Download Download
Total Page:16
File Type:pdf, Size:1020Kb
-1- Which OCR toolset is good and why? A comparative study Pooja Jain1,*, Dr. Kavita Taneja1, Dr. Harmunish Taneja2 1Dept. Of Computer Science & Applications, Panjab University, Chandigarh, India. 2Dept. Of Computer Science & Information Tech., DAV College, Sec - 10, Chandigarh, India. *Corresponding author: [email protected] Abstract Optical Character Recognition (OCR) is a very active research area in many challenging fields like pattern recognition, natural language processing (NLP), computer vision, biomedical informatics, machine learning (ML), and artificial intelligence (AI). This computational technology extracts the text in an editable format (MS Word/Excel, text files, etc.) from PDF files, scanned or hand-written documents, images (photographs, advertisements, and alike), etc. for further processing and has been utilized in many real-world applications including banking, education, insurance, finance, healthcare and keyword-based search in documents, etc. Many OCR toolsets are available under various categories, including open-source, proprietary, and online services. This research paper provides a comparative study of various OCR toolsets considering a variety of parameters. Keywords: ABBYY FineReader; Calamari; Google Docs; OCR; Tesseract. 1. Introduction OCR (Bokser, 1992; Mori et al., 1992) is a 1.1. OCR process commonly used technology for recognizing text within digital images such as OCR process (Goswami et al., 2013; Cao, scanned documents, advertisements, 2014; Tomaschek, 2018) generally goes photographs, etc. It is widely used as an through multiple stages, as shown in Figure information entry tool that can extract useful 1, including Image-acquisition (downloading information from scanned documents, image from an online source or capturing it including printed forms (filled by users), using a camera or scanner), Pre-processing computerized receipts, bank statements, (modifying image in a way that may increase invoices, business cards, passport documents, the accuracy of OCR), Binarization mails, or any other suitable documentation. (separating the content from the background), Other applications include searching within Layout Analysis (a division of the document institutional repositories and scanned legal into various homogeneous regions), Character documents, automatic number plate level segmentation (segmentation of the recognition (ANPR), processing cheques in image into lines, words, and characters), banks, recognizing barcodes, testing text- Recognition (feature extraction of every based captcha codes, etc. character image) and finally, Classification (determining the output characters) followed by Postprocessing where classification results can be enhanced using various language models and dictionaries. Kuwait Journal of Science -2- Fig. 1. OCR Process 1.2. OCR Challenges OCR faces many problems in recognizing Fig. 2. Popular OCR Softwares. printed or handwritten characters such as image deformation (disconnected line domain. The rest of the paper is arranged as segments, isolated dots, breaks or holes in follows. First proprietary OCR toolsets are lines, rotation of text, etc.), shape introduced, followed by open-source and discrimination (some characters have very online OCR toolsets. Then a literature similar shapes like 0 (zero) and O, 5 and S, review for various comparative studies 2 and Z, 6 and G, etc.), size and pitch conducted for OCR toolsets is presented, variations, cluttered background, a camera followed by the discussion and conclusion captured documents (motion and out-of- of the study. focus blur), multilingual documents, text 2. Proprietary OCR Toolsets formatting, and complicated structures and natural scene text (variation in illumination Proprietary OCR toolsets are usually paid conditions and fonts, etc.). Touching for and supported by developers. They characters and different typefaces used in generally have an excellent graphical user early printed books (15th-19th century) interface (GUI). Popular proprietary OCR pose additional text recognition difficulties. toolsets are discussed in this section, and Recognition of calligraphy typefaces and their important features are summarized in fonts is another challenge (Al-Hmouz, Table 1. 2020). 2.1. ABBYY FineReader 1.3. OCR Toolsets ABBYY FineReader achieves up to 100 % OCR toolsets are software that focus on word-level accuracy with high quality accurate character recognition in a reliable images in English language manner. OCR toolsets can be broadly (https://abbyy.technology/_media/en:produ classified into three categories: cts:fre:win:v11:frengine11_performance_g (i) Proprietary uide.pdf), and has been an undisputed (ii) Open Source choice for layout analysis and OCR. It can (iii) Online be accessed in two different ways: ABBYY FineReader SDK and ABBYY Online/ Figure 2 presents various popular OCR cloud service. toolsets in different categories. Since many proprietaries, open-source, and online OCR 2.2. Transym OCR (TOCR) toolsets with varied features are available to (http://www.transym.com/tocr-the- choose from, this research paper reviews integrators-choice.htm) and analyse the performance of various OCR toolsets and provide better insight to TOCR (Tafti et al., 2016) is specifically the OCR study with an aim to help designed keeping in mind the ease of researchers in making a right choice of an integration with other softwares. With a OCR toolset specific to their application very light and efficient GUI, it is claimed Kuwait Journal of Science -3- that Transym has the capability to read French, and Spanish). For other languages, blurred, obscure, and even broken their language pack must be installed first. characters (Vithlani et al., 2015). However, multiple languages cannot be set 2.3. Readiris for a single document in MODI. (https://www.irislink.com/EN- 3. Open-source OCR Toolsets US/c1729/Readiris-17--the-PDF-and- Open-source OCR engines are best OCR-solution-for-Windows-.aspx) controlled via their command-line Preserving the original page layout, interfaces, and most of them do not have a Readiris automatically converts text from GUI. Some of the popular open-source PDF files, images, or paper documents into OCR toolsets are discussed in this section, fully editable files. Being compatible with and their important features are summarized most of the scanners in the market, it in Table 2. supports many different input formats and 3.1. Tesseract provides an attractive and intuitive GUI. 2.4. Adobe Acrobat Tesseract (Smith, 2007; Patel et al., 2012) (https://acrobat.adobe.com/in/en/acrobat/h comes with an easy-to-use command-line ow-to/ocr-software-convert-pdf-to- tool called ‘tesseract.’ It can be integrated in text.html) C++ or Python code by using Tesseract’s API. Many GUI desktop applications use Adobe Acrobat automatically converts Tesseract as a text recognition engine, image files, scanned documents, and PDF including FreeOCR, OCRFeeder, PDF files into searchable/editable documents OCR X, QTesseract, YAGF, while preserving the format and its accuracy gImageReader, Lector, VietOCR, is reportedly high. It provides fewer SunnyPage, and Lime OCR. Also, WeOCR, language options comparing with ABBYY CustomOCR, i2OCR, and NewOCR are FineReader, but it is more pervasive web applications using the Tesseract tool software as it is more business-oriented and (Vithlani et al., 2015). Its latest version, less academic. Tesseract4.0 (https://github.com/tesseract- 2.5. OmniPage ocr/tesseract/wiki/4.0-with-LSTM), uses a kind of Recurrent Neural Network (RNN) (https://www.kofax.com/Products/omnipage) called Long Short Term Memory (LSTM) OmniPage((https://www.nuancesoftwarest based recognition engine. ore.com/omnipage-ultimate) claims 99% or more character-accuracy and converts paper 3.2. Ocrad documents and PDFs into fully editable (https://www.gnu.org/software/ocrad/manu al/ocrad_manual.html) digital files preserving all type of formatting. Using OmniPage, digital Ocrad supports narrowing down the camera photos and other images can also be character search by defining which converted into text files. character sets to recognize. Ocrad recognizes characters very fast, but at the 2.6. Microsoft Office Document Imaging same time, it is very sensitive to character (MODI) defects, and it is difficult to modify Ocrad (https://support.microsoft.com/da- to recognize new characters. Best character dk/help/982760/install-modi-for-use-with- recognition results are achieved when microsoft-office-2010) characters are at least 20 pixels high, or the MODI is a free OCR tool that can scan hard image is scanned at 300 dpi. copies of documents and import them to MS Word for editing. By default, MODI can perform OCR in three languages (English, Kuwait Journal of Science -4- 3.3. GOCR 3.6. Calamari (http://jocr.sourceforge.net/) (https://github.com/Calamari-OCR) GOCR can be used either as a stand-alone Calamari (Reul et al., 2019) is one of the console application or as a back-end (OCR latest open-source OCR line recognition tool. engine) to other programs. Written in C Based on state-of-the-art Deep Neural language, its recognition process takes two Networks (DNNs) implemented in passes. In the first pass, the entire document Tensorflow (including Convolutional Neural is called. In the second pass, the unknown Networks (CNN) and LSTM layers), it uses characters are called (Dhiman et al., 2013). techniques such as pretraining and voting, It is claimed that GOCR can handle which help in minimizing its character error single-column sans-serif