A Comparison of OCR Methods on Natural Images in Different Image Domains
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2020 A comparison of OCR methods on natural images in different image domains AGNES FORSBERG MELVIN LUNDQVIST KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE En jämförelse av OCR-metoder i olika domäner AGNES FORSBERG MELVIN LUNDQVIST Degree Project in Computer Science, DD142X Date: June 8, 2020 Supervisor: Kevin Smith Examiner: Pawel Herman School of Electrical Engineering and Computer Science Abstract Optical character recognition (OCR) is a blanket term for methods that convert printed or handwritten text into machine-encoded text. As the digital world keeps growing the amount of digital images with text increases, and the need for OCR methods that can handle more than plain text documents as well. There are OCR engines that can convert images of clean documents with an over 99% recognition rate. OCR for natural images is getting more and more attention, but because natural images can be far more diverse than plain text documents it also leads to complications. To combat these issues it needs to be clear in what areas the OCR methods of today struggle. This thesis aims to answer this by testing three popular, readily available, OCR methods on a dataset comprised only of natural images containing text. The results show that one of the methods, GOCR, can not handle natural images as its test results were very far from correct. For the other two methods, ABBYY FineReader and Tesseract, the results were better but also show that there still is a long way to go, especially when it comes to images with special font. However when the images are less complicated some of our methods performed above our expectations. i Sammanfattning Optical character recognition (OCR) är en samlingsterm för metoder som konverterar tryckt eller handskriven text till maskinkod. När den digitala världen växer så växer även antalet digitala bilder med text, och även behovet för OCR- metoder som kan hantera mer än vanliga textdokument. Det finns idag OCR- motorer som kan konvertera bilder av rena dokument till maskinkod med över 99% korrekthet. OCR för fotografier får mer och mer uppmärksamhet, men eftersom fotografier har mycket större mångfaldhet än rena textdokument leder detta också till problem. För att hantera detta krävs klarhet inom vilka områden som dagens OCR-metoder har problem. Denna uppsats ämnar svara på denna fråga genom att undersöka och testa tre populära, enkelt tillgängliga OCR- metoder på ett dataset som endast innehåller fotografier av naturliga miljöer med text. Resultaten visade att en av metoderna, GOCR, inte kan hantera fotografier. GOCRs testresulet var långt från det korrekta. För de andra metoderna, ABBYY FineReader och Tesseract, var resultaten bättre men visade att det fortfarande finns mycket arbete att göra inom området, särskilt när det kommer till bilder med speciella typsnitt. När det däremot kommer till bilder som är mindre komplicerade blev vi förvånade över hur bra resultatet var för några av metoderna. ii Contents 1 Introduction 1 1.1 Problem statement ............................ 1 1.2 Scope ................................... 1 1.3 Hypothesis ................................ 2 1.4 Outline .................................. 3 2 Background 4 2.1 Optical character recognition ...................... 4 2.2 Tesseract ................................. 4 2.3 ABBYY FineReader Engine ....................... 5 2.4 GOCR ................................... 5 3 Method 7 3.1 The NEOCR dataset ........................... 7 3.2 Image domains .............................. 7 3.3 Image filtering .............................. 10 3.4 Experiment ................................ 11 4 Results 15 4.1 Font .................................... 15 4.2 Texture .................................. 15 4.3 Arrangement ............................... 16 4.4 Contrast .................................. 16 4.5 Blurriness ................................. 16 4.6 Comparative study ............................ 17 5 Discussion 19 5.1 Method discussion ............................ 19 5.2 Discussion of results ........................... 20 5.3 Future research ............................. 22 5.4 Ethical and sustainability considerations . 22 5.5 Societal considerations ......................... 23 5.6 Conclusion ................................ 23 References 24 iii 1 Introduction Optical character recognition (OCR) is the process of converting images of typewritten, handwritten or printed text to editable machine-encoded text [7]. The first true OCR machine was installed in 1954 on typewritten sales reports [4]. As the digital world is growing and the amount of digital images with text increases, the potential use of OCR technology is expanding. Nowadays there are a lot of great OCR engines for converting pictures of clean documents to editable text that can be used by computer software, some with over 99% recognition rate [6]. Therefore, OCR is very helpful converting physical office documents to digital ones, but is far from faultless in scenarios such as recognising text in natural scene images [7]. Recognising text in real world images is getting more and more attention, but come with numerous complications, since text in natural images is far more diverse than in plain text documents [6]. 1.1 Problem statement OCR engines are very complex and consist of several steps. To develop OCR technology in natural images further, it must be clear in what cases OCR methods perform poorly in order to know what should be in focus. Therefore, the goal of this study is to evaluate the current state of three different OCR methods and identify possible scenarios in which each respective method demonstrates lower recognition rate by answering the following question: How accurate are the OCR methods Tesseract, ABBYY FineReader and GOCR on natural images overall and in different domains? 1.2 Scope The study is solely comparative and will not take into account the accessibility or computational cost of the OCR methods. The study is restricted to investigate the performance of the three popular off-the- shelf OCR methods Tesseract (version 4.0.0-beta.1)1, ABBYY FineReader Engine 1https://tesseract-ocr.github.io/tessdoc/Home.html 1 (version 12 for Linux)2 and GOCR(version 0.52-20181015)3. The experiment will use the accessible software of the OCR methods and will not make use of additional support, such as language input, training or neural networks. String distance function Levenshtein distance will be used to evaluate the performance. Furthermore, the OCR methods will only be evaluated on their overall performance and when applied to data in the presence of the domains (confounding factors) font, texture, arrangement, contrast and blurredness (from here on referred to as blurriness). All images for the study are included in the dataset NEOCR version 1.0 and will be filtered for specific domains. For more specifics of the characteristics of the images in NEOCR, see section 3.1. 1.3 Hypothesis GOCR seems to be a rather simple OCR method. In terms of results, we believe that this will be a disadvantage in complex images of natural environment. ABBYY FineReader commercially appears to be a tool for digital documents only; however the SDK engine used for these experiments supposedly has support for text recognition in images as well. It is also used in many large scale applications4. Because of this, we expect ABBYY FineReader to have a better accuracy than the other methods. Tesseract is open source but maintained by Google and is one of the most used systems in the world. However, we do not know how well it performs on natural images. We expect it to have good accuracy, but not as good as the ABBYY FineReader Engine. In addition, we expect all methods to perform better in images with standard font, horizontal arrangement and high contrast while performing worse in images with high texture and blurriness, since text in these types of images is more difficult to distinguish for the eye and then hypothetically also for OCR methods. 2https://abbyy.technology/en:products:fre:linux 3https://www-e.ovgu.de/jschulen/ocr/download.html 4https://www.abbyy.com/en-gb/case-studies/?product=3250 2 1.4 Outline The next section presents a more explicated background required for the comprehension of the rest of the report, using previous studies. Section three describes how the experiment was carried out. It contains an explanation of how filtering the images in the dataset was made and a technical specification from accessing the OCR tools needed to retrieve and treat the outputs. The fourth section presents the results of the experiment and compares them. In the fifth section, there is a discussion on the course of action and the reliability and usefulness of the results, both to answer the research question and for future research. Finally, a conclusion of the study is presented. 3 2 Background In this section a theoretical background for the area of study will be given. Specifically OCR technology will be introduced further, and the three OCR methods Tesseract, ABBYY FineReader and GOCR will be described. 2.1 Optical character recognition Optical character recognition (OCR) uses technology to distinguish printed or handwritten text characters in digital images. It was originally invented as a tool that read text out loud for the blind or visually impaired. Nowadays it is most frequently used to transform historic documents and books into PDFs. OCR methods use algorithms