Angelica Gabasio May 2013

Comparison of optical character recognition (OCR) software

Master’s thesis work carried out at Sweco Position.

Supervisors Bj¨ornHarrtell, Sweco Tobias Lennartsson, Sweco Examiner Jacek Malec, Lund University

The purpose of the thesis was to test some different optical char- acter recognition (OCR) software to find out which one gives the best output. Different kinds of images have been OCR scanned to test the software. The output from the OCR tools has been automatically compared to the correct text and the error percentage has been cal- culated.

Introduction

Optical character recognition (OCR) is used to extract plain text from images containing text. This is useful when information in paper form is going to be digitized. OCR is applied to the images to make the text editable or searchable. Another way of using OCR scanning is to process forms automatically.[1] If the output from the OCR tool contains many errors, it would require a lot of manual work to fix them. This is why it is important to get a correct or almost correct output when using OCR. The output from different OCR tools may differ a lot, and the purpose of this comparison is to find out which OCR software to use for OCR scanning.

OCR steps As seen in Figure 1, the OCR software takes an image, applies OCR on it and outputs plain text. Figure 1 also shows the basic steps of the OCR algorithm, which are explained below[1, 2]: Preprocessing Preprocessing can be done either manually before the image is given to the OCR tool, or it can be done internally in the software. The most common preprocessing that is done is: 1) convert the image to only black and white 2) remove noise, and 3) rotate the image to make the text as horizontal as possible. The purpose of preprocessing is to modify the image in a way that makes the recognition as successful as possible. Layout analysis This step is to determine how to read the text in the image. It can include identification of columns or images in the text.

1 Recognition The recognition step is when the OCR tool decides which char- acter to output. The software extracts each character from the image and uses a database to match them to the (hopefully) correct character.

Text image OCR Plain text

Preprocessing

Layout analysis

Recognition

Figure 1: OCR - the basic steps

Method

Software Nine different OCR tools has been used, four of them commercial and five open source. The tools included in the comparison are , , CuneiForm, GOCR, OCRopus, TOCR, Abbyy CLI OCR, Leadtools OCR SDK and OCR API Service. TOCR, Abbyy, Leadtools and OCR API Service are commercial tools, the rest are open source.

Input Since it is important to get a good output from the OCR scan, different kinds of images have been tested to see how successful the tools are on images of different quality. The types of images used are skewed images, images containing handwriting, noisy and stained images, light images, images with pictures in the text, and images with underlined text. Most of the text in the images is in Swedish, but some images contain text in English as well.

Comparison To decide how correct the OCR output is, the Wagner-Fischer string compar- ison algorithm is used. The algorithm calculates the edit distance between

2 two strings. The edit distance is the minimum number of character inser- tions/deletions/substitutions needed to change one string into the other.[3] The edit distance is used to get a percentage value of the error: edit distance(s1, s2) error(s1, s2) = max{|s1|, |s2|} ·100, where |s| is the length of the string s.

Result

The mean error, in increasing order, from the tested OCR tools on all of the images is:

TOCR 8.79% Leadtools 20.06% Abbyy 24.53% OCR API Service 27.81% OCRopus 31.42% Tesseract 33.9% CuneiForm 37.68% Ocrad 50.25% GOCR 64.46%

Conclusion

The four commercial tools in the comparison are more accurate than any of the open source tools. TOCR gives the most accurate output on most of the scanned images. In some of the images, TOCR generates the correct text but some whitespace between the rows is missing. The comparison algorithm compares the strings character by character, and will result in an error if a new line is missing. The comparison shows that it probably is a good thing to invest in a com- mercial tool, preferably TOCR, since it is much more accurate than any of the other tools.

Software used

• Tesseract-ocr, http://code.google.com/p/tesseract-ocr/

• Ocrad, http://www.gnu.org/software/ocrad/

• CuneiForm, http://cognitiveforms.com/ru.html#1189-CuneiForm

• GOCR, http://jocr.sourceforge.net/

• OCRopus, https://code.google.com/p/ocropus/

3 • TOCR, http://www.transym.com/tocr-the-integrators-choice.htm/

• Abbyy CLI OCR, http://www.ocr4linux.com/

• Leadtools OCR SDK, http://www.leadtools.com/sdk/ocr/

• OCR API Service, http://ocrapiservice.com/

References

[1] Inad Aljarrah, Osama Al-Khaleel, Khaldoon Mhaidat, Mu’ath Alrefai, Ab- dullah Alzu’bi, Mohammad Rabab’ah, Automated System for Arabic Op- tical Character Recognition with Lookup Dictionary, Journal of Emerging Technologies in Web Intelligence, Nov 2012, Vol. 4 Issue 4, pp. 362-370 [2] Tobias Blanke, Michael Bryant, Mark Hedges, Open source optical charac- ter recognition for historical research, Journal of Documentation, Vol. 68 Iss: 5, 2012, pp. 659-683

[3] Robert A. Wagner, Michael J. Fischer, The String-to-String Correction Problem, Journal of the Association for Computing Machinery, Vol. 21, No. 1, January 1974, pp. 168-173

4