<<

-1-

Which OCR toolset is good and why? A comparative study

Pooja Jain1,*, Dr. Kavita Taneja1, Dr. Harmunish Taneja2 1Dept. Of Computer Science & Applications, Panjab University, Chandigarh, India. 2Dept. Of Computer Science & Information Tech., DAV College, Sec - 10, Chandigarh, India. *Corresponding author: [email protected]

Abstract

Optical Character Recognition (OCR) is a very active research area in many challenging fields like pattern recognition, natural language processing (NLP), computer vision, biomedical informatics, machine learning (ML), and artificial intelligence (AI). This computational technology extracts the text in an editable format (MS Word/Excel, text files, etc.) from PDF files, scanned or hand-written documents, images (photographs, advertisements, and alike), etc. for further processing and has been utilized in many real-world applications including banking, education, insurance, finance, healthcare and keyword-based search in documents, etc. Many OCR toolsets are available under various categories, including open-source, proprietary, and online services. This research paper provides a comparative study of various OCR toolsets considering a variety of parameters.

Keywords: ABBYY FineReader; Calamari; Google Docs; OCR; .

1. Introduction OCR (Bokser, 1992; Mori et al., 1992) is a 1.1. OCR process commonly used technology for recognizing text within digital images such as OCR process (Goswami et al., 2013; Cao, scanned documents, advertisements, 2014; Tomaschek, 2018) generally goes photographs, etc. It is widely used as an through multiple stages, as shown in Figure information entry tool that can extract useful 1, including Image-acquisition (downloading information from scanned documents, image from an online source or capturing it including printed forms (filled by users), using a camera or scanner), Pre-processing computerized receipts, bank statements, (modifying image in a way that may increase invoices, business cards, passport documents, the accuracy of OCR), Binarization mails, or any other suitable documentation. (separating the content from the background), Other applications include searching within Layout Analysis (a division of the document institutional repositories and scanned legal into various homogeneous regions), Character documents, automatic number plate level segmentation (segmentation of the recognition (ANPR), processing cheques in image into lines, words, and characters), banks, recognizing barcodes, testing text- Recognition (feature extraction of every based captcha codes, etc. character image) and finally, Classification (determining the output characters) followed by Postprocessing where classification results can be enhanced using various language models and dictionaries.

Kuwait Journal of Science -2-

Fig. 1. OCR Process 1.2. OCR Challenges

OCR faces many problems in recognizing Fig. 2. Popular OCR . printed or handwritten characters such as image deformation (disconnected line domain. The rest of the paper is arranged as segments, isolated dots, breaks or holes in follows. First proprietary OCR toolsets are lines, rotation of text, etc.), shape introduced, followed by open-source and discrimination (some characters have very online OCR toolsets. Then a literature similar shapes like 0 (zero) and O, 5 and S, review for various comparative studies 2 and Z, 6 and G, etc.), size and pitch conducted for OCR toolsets is presented, variations, cluttered background, a camera followed by the discussion and conclusion captured documents (motion and out-of- of the study. focus blur), multilingual documents, text 2. Proprietary OCR Toolsets formatting, and complicated structures and natural scene text (variation in illumination Proprietary OCR toolsets are usually paid conditions and fonts, etc.). Touching for and supported by developers. They characters and different typefaces used in generally have an excellent graphical user early printed books (15th-19th century) interface (GUI). Popular proprietary OCR pose additional text recognition difficulties. toolsets are discussed in this section, and Recognition of calligraphy typefaces and their important features are summarized in fonts is another challenge (Al-Hmouz, Table 1. 2020). 2.1. ABBYY FineReader

1.3. OCR Toolsets ABBYY FineReader achieves up to 100 %

OCR toolsets are that focus on word-level accuracy with high quality accurate character recognition in a reliable images in manner. OCR toolsets can be broadly (https://abbyy.technology/_media/en:produ classified into three categories: cts:fre:win:v11:frengine11_performance_g (i) Proprietary uide.), and has been an undisputed (ii) Open Source choice for layout analysis and OCR. It can (iii) Online be accessed in two different ways: ABBYY FineReader SDK and ABBYY Online/ Figure 2 presents various popular OCR cloud service. toolsets in different categories. Since many proprietaries, open-source, and online OCR 2.2. Transym OCR (TOCR) toolsets with varied features are available to (http://www.transym.com/tocr-the- choose from, this research paper reviews integrators-choice.htm) and analyse the performance of various OCR toolsets and provide better insight to TOCR (Tafti et al., 2016) is specifically the OCR study with an aim to help designed keeping in mind the ease of researchers in making a right choice of an integration with other softwares. With a OCR toolset specific to their application very light and efficient GUI, it is claimed

Kuwait Journal of Science -3- that Transym has the capability to read French, and Spanish). For other languages, blurred, obscure, and even broken their language pack must be installed first. characters (Vithlani et al., 2015). However, multiple languages cannot be set 2.3. Readiris for a single document in MODI. (https://www.irislink.com/EN- 3. Open-source OCR Toolsets US/c1729/Readiris-17--the-PDF-and- Open-source OCR engines are best OCR-solution-for-Windows-.aspx) controlled via their command-line Preserving the original page layout, interfaces, and most of them do not have a Readiris automatically converts text from GUI. Some of the popular open-source PDF files, images, or paper documents into OCR toolsets are discussed in this section, fully editable files. Being compatible with and their important features are summarized most of the scanners in the market, it in Table 2. supports many different input formats and 3.1. Tesseract provides an attractive and intuitive GUI.

2.4. Adobe Acrobat Tesseract (Smith, 2007; Patel et al., 2012) (https://acrobat.adobe.com/in/en/acrobat/h comes with an easy-to-use command-line ow-to/ocr-software-convert-pdf-to- tool called ‘tesseract.’ It can be integrated in text.html) ++ or Python code by using Tesseract’s API. Many GUI desktop applications use Adobe Acrobat automatically converts Tesseract as a text recognition engine, image files, scanned documents, and PDF including FreeOCR, OCRFeeder, PDF files into searchable/editable documents OCR X, QTesseract, YAGF, while preserving the format and its accuracy gImageReader, , VietOCR, is reportedly high. It provides fewer SunnyPage, and Lime OCR. Also, WeOCR, language options comparing with ABBYY CustomOCR, i2OCR, and NewOCR are FineReader, but it is more pervasive web applications using the Tesseract tool software as it is more business-oriented and (Vithlani et al., 2015). Its latest version, less academic. Tesseract4.0 (https://github.com/tesseract- 2.5. OmniPage ocr/tesseract/wiki/4.0-with-LSTM), uses a (https://www.kofax.com/Products/omnipage) kind of (RNN) called Long Short Term Memory (LSTM) OmniPage((https://www.nuancesoftwarest based recognition engine. ore.com/omnipage-ultimate) claims 99% or more character-accuracy and converts paper 3.2. documents and into fully editable (https://www.gnu.org/software/ocrad/manu al/ocrad_manual.html) digital files preserving all type of formatting. Using OmniPage, digital Ocrad supports narrowing down the camera photos and other images can also be character search by defining which converted into text files. character sets to recognize. Ocrad recognizes characters very fast, but at the 2.6. Microsoft Office Document Imaging same time, it is very sensitive to character (MODI) defects, and it is difficult to modify Ocrad (https://support.microsoft.com/da- to recognize new characters. Best character dk/help/982760/install-modi-for-use-with- recognition results are achieved when microsoft-office-2010) characters are at least 20 pixels high, or the MODI is a free OCR tool that can scan hard image is scanned at 300 dpi. copies of documents and import them to MS Word for editing. By default, MODI can perform OCR in three languages (English,

Kuwait Journal of Science -4-

3.3. GOCR 3.6. Calamari (http://jocr.sourceforge.net/) (https://github.com/Calamari-OCR)

GOCR can be used either as a stand-alone Calamari (Reul et al., 2019) is one of the console application or as a back-end (OCR latest open-source OCR line recognition tool. engine) to other programs. Written in C Based on state-of-the-art Deep Neural language, its recognition process takes two Networks (DNNs) implemented in passes. In the first pass, the entire document Tensorflow (including Convolutional Neural is called. In the second pass, the unknown Networks (CNN) and LSTM layers), it uses characters are called (Dhiman et al., 2013). techniques such as pretraining and voting, It is claimed that GOCR can handle which help in minimizing its character error single-column sans-serif fonts, which are rates (CERs). Calamari doesn’t offer a full 20–60 pixels in height. Trouble is reported OCR pipeline and just focuses on recognizing with serif fonts, italic fonts, slanted fonts, text from text line images. It can be integrated small fonts, heterogeneous fonts, coloured in existing OCR pipelines and can replace images, handwritten text, overlapping their OCR-engine efficiently. characters, large angles of skew, noisy images, multiple columns, tables, 4. OCR Online services complicated layouts, and text other than Using OCR online services, there is no need Latin alphabets. to download or install any OCR software. The 3.4. OCRopus user is only required to upload the input file, (https://code.google.com/p/ocropus/) select the language(s), output format (optionally), and the output is generated. Few Supporting a command-line based user important OCR online services are discussed interface, OCRopus (Breuel, 2008) has a in this section, and their important features very modular design giving the flexibility to are summarized in perform each step of OCR (e.g., Table 3. binarization, page layout analysis, text line 4.1. ABBYY Cloud OCR recognition, etc.) individually using (https://www.abbyy.com/en-eu/cloud-ocr-sdk/) separate commands and the user can use different modules of his/her own choice to Running on Microsoft Azure infrastructure, perform these tasks. Its latest version, ABBYY's Cloud OCR SDK is a Web OCR OCRopus3 (Breuel et al., 2013), uses service that provides excellent text bi-directional LSTM models using PyTorch recognition quality. This web service can be networks. easily integrated into your own applications 3.5. CuneiForm using a Web API. Converted files can be (https://www.softpedia.com/get/Office- exported to Google Docs, Drop Box, and tools/Other-OfficeTools/CuneiForm.shtml) Ever note (Vithlani et al., 2015). 4.2. Google Docs CuneiForm(https://en.wikipedia.org/wiki/C (http://docs.google.com) uneiForm_(software)) can be used from the command line (as a stand-alone application) Google Docs is primarily a cloud document or with other programs (as a background storage and editing platform offered by application). It uses language dictionaries Google within the Google Drive service to improve recognition results. English is (http://drive.google.com). Once an image or a the default recognition language, but the PDF file is uploaded to Google Drive, OCR user can choose other languages as desired. conversion can be performed by right- clicking on the file and selecting the option “Open with Google Docs.” The extracted text is in editable form, which can be downloaded.

Kuwait Journal of Science -5-

Table 1. Comparison of Important Proprietary OCR Toolsets.

Proprietary Available Multilingual Operating Input Formats Output Formats S.No. Important Features OCR Toolset Online support System supported supported

DOC(X), XLS(X), 1. PDF, BMP, TIFF,  Pre-processing techniques include noise removal, skew correction, straightening ABBYY Windows, PPT(X), RTF, XML, Yes GIF, PNG, JPEG, text lines, trapezoidal distortion correction. FineReader Yes MAC OS X, PDF, CSV, TXT, PCX, DCX, DjVu, (190+) ALTO, FB2, DBF,  Uses AI and ML for precise document reconstruction and higher accuracy. JBIG2, WDP EPUB, HTML, ODT

 Automatically detect the page or image orientation. Yes  Can identify text with background defects (extremely light or dark backgrounds, 2. Transym No Windows PDF, BMP, TIFF TXT, RTF deformation, and speckle). (11)  Uses lexicon for maximising word accuracies and reliability.

 Pre-processing techniques include adaptive binarization, de-speckle filters, de- PDF, PDF/A, HTML, skew feature, document rotation, dark border removal, line removal, and colour XML, RTF, TXT, Windows, PDF, JPEG, PNG, dropout. Yes ODT, WordML, 3. Readiris No MAC OS X, BMP, TIFF, JBIG2,  Font-independent text recognition is complemented by self-learning techniques SpreadsheetML, CSV, (138) iOS, Android JPEG2000 derived from proprietary neural networks. DOC(X), XLS(X),  Uses proprietary dictionaries. XPS, ePub

 Allows editing of text in PDFs. PDF, BMP, JPG/ DOC(X), XLSX,  Automatically generates a custom font (for adding or editing within the PDF Windows, JPEG, GIF, TIF/ PPTX, HTML, RTF, document) that looks like the same font as in the original document. 4. Adobe Acrobat No Yes MAC OS X, TIFF, PNG, PCX, PS, EPS, XML, Edit iOS, Android  Smart PDFs can be created (only searching and copying capabilities without RLE, DIB text in PDF, TXT, CSV editing).

DOC(X), XLS(X),  Pre-processing tools and de-speckling methods are available to reduce PPT(X), RTF, MP3, background noise and enhancement of text and diagrams. BMP, DCX, GIF, Windows, ePUB, XML, PDF,  Removal of punch holes and auto-cropping of margins. Yes JBG, JP2, MAX, 5. OmniPage Yes MAC OS X, PDF/A, Searchable PCX, PDF, XIF,  Text extraction from shaded and coloured documents with very little human (120+) Linux PDF, Corel intervention. XPS WordPerfect, HTML  Able to perform document reading through mobile devices that support MP3 Text audio files.

Microsoft Office Yes DOC/DOCX, MDI, 6. No Windows TIF/TIFF, MDI  De-skews and re-orients the page where required. Document (by-default 3) TIFF  Produces TIFF files that violate the TIFF standard specifications and are only Imaging usable by MODI.

Note: All proprietary OCR softwares discussed above have GUI and are not free except MODI, which is free.

Kuwait Journal of Science -6-

Table 2. Comparison of Important Open-source OCR Toolset

S.No. Input Output Open-source Available Multilingual Operating Formats Formats Important Features OCR Toolset Online support System supported supported

 Pre-processing techniques include orientation detection and TIFF, JPEG, Windows, minor skew correction. JFIF, PNG, Yes MAC OS X, TXT, PDF,  Can use multiple languages in a single scan. 1. Tesseract4.0 No PNM (PGM, Linux, hOCR  Machine learning support for recognizing new languages, (100+) PBM, PPM), Android symbols, and fonts. BMP  No GPU support till date.

Yes  Pre-processing transformations including cut, rotate, scale, Yes MAC OS X, PNM (PGM, and layout detection. 2. Ocrad TXT Ocrad.js (Latin Linux, BSD PBM, PPM)  Both in-built and user-defined filters can be used for the alphabets) post-processing step. PNM (PGM,  GUI (.tcl). Windows, PBM, PPM), Yes Yes  No training data is required (no neural network) or large 3. GOCR MAC OS X, some PCX and Text file GOCR.js font bases to store. (20+) Linux, BSD TGA formats  Barcodes can also be recognized and translated.

Yes  Can be trained to recognize new languages or different MAC OS X, TXT, hOCR, fonts. 4. OCRopus No (Languages PNG with Latin Linux, BSD PDF, HTML  Used for Google Books. script)  GPU support in OCRopus3.  GUI for Windows, command based for Linux. Windows, Yes PNG, BMP, HTML, hOCR,  Saves text formatting and recognizes complicated tables of 5. Cuneiform No MAC OS X, JPG RTF, TeX, TXT any structure. (25+) Linux, BSD  A mixture of Russian and English can also be recognized.

GT.TXT, XML,  Uses Cross Fold Voting, Data Augmentation, Pretraining. 6. Calamari No Yes Not Known PNG, JPG, H5 ABBYY.XML,  GPU support. HDF5

Kuwait Journal of Science -7-

Table 3. Comparison of Important Online OCR Toolsets.

Online OCR Multilingual operating Input Formats Output Formats S.No Free Important Features Toolsets support System supported supported

DOC(X),  Provides highest data security by complying with the PDF, BMP, XLS(X), PPT(X), relevant data protection laws. No TIFF, GIF, RTF, XML, PDF,  Preserves formatting. ABBYY Yes Platform PNG, JPEG, 1. (Free trial CSV, TXT,  Multipage documents can also be converted. Web service independent PCX, DCX, (190+) version is ALTO, FB2, DjVu, JBIG2,  Maximum input file size: 30 MB. available) DBF, EPUB, WDP  For a multilingual document, up to 3 recognition HTML, ODT languages can be chosen. Windows,  Automatically determine the language of the document, Android, DOC(X), DOCM, and no need to specify the language. Yes MAC OS X, PDF, PNG, JPG, DOT(X), DOTM,  Currently, OCR works best on cleanly scanned, high- 2. Google Docs Yes GIF, TIFF HTML, TXT, resolution documents in the most commonly used (83+) iOS, RTF, ODT typefaces. BlackBerry, ChromeOS  Maximum input file size: 50 MB.  Automatically rotates pages. Free-Online No Browser- GIF, JPG, BMP, DOC, RTF, PDF,  Supports low-resolution images. 3. Yes OCR (English only) Based PNG, TIF, PDF TXT  Preserves the original layout and formatting.  Maximum input file size: 200 MB.

 For best text recognition, input images should have a resolution of 200-400 DPI. JPEG/JPG, DOCX, XLSX,  Maximum input file size: 200 MB Yes Yes Browser- PNG, PDF, 4. Online OCR TXT, HTML,  Automatically rotates images (full-page de-skew) for Based BMP, TIF/TIFF, (46) PDF, RTF better recognition. PCX, GIF, ZIP  Non-text, coloured regions are reinserted into the output document.

Note: All Online OCR Toolsets discussed above have GUI.

Kuwait Journal of Science -8-

4.3. Free-Online-OCR conversion, and hence, Tesseract becomes a (http://www.free-online-ocr.com/) better choice. Patel et al., 2012, compared the Free-Online-OCR can achieve high performance of open-source Tesseract with recognition accuracy even with low-quality proprietary Transym for ANPR. It is documents, including screenshots and observed that Tesseract provides better faxes. The accuracy is further increased accuracy than Transym for both grayscale with the help of an integrated dictionary. and coloured images. Also, Tesseract was found to be faster than Transym. The 4.4. Online OCR standard deviation (for accuracy) of (http://www.onlineocr.net/service/about) Tesseract was also less than Transym. Online OCR can convert digital camera- Tomaschek, 2018, conducted a captured images, photographs, faxes, and comparative study of popular open-source scanned documents into various searchable OCR tools, including Tesseract3, and editable formats. Documents written in Tesseract4.0, GOCR, Ocrad, and more than one language can also be established proprietary OCR tools, processed. It allows free conversion of 15 including Nuance OmniPage and images per hour in a guest mode without Readiris16 on a prepared dataset which registration, whereas free Registration contained a balanced mix of various provides extra features. document classes. The results clearly indicate that proprietary Nuance OmniPage 5. Review of comparative studies of could achieve 100% accuracy, whereas popular OCR softwares none of the open-source tools could achieve 100% accuracy. Also, in the open-source Many experimental studies have been category, Tesseract (both version 3 & 4.0) conducted to compare and analyse the performs better than GOCR and Ocrad. performance of various OCR toolsets on In another test conducted by standard as well as non-standard datasets. Tomaschek, 2018, on a synthetic page with Dhiman et al., 2013, used different different open source (Tesseract3, parameters such as image type, font type, Tesseract4.0, GOCR, Ocrad, and OCRopus) brightness, and resolution and compared and proprietary OCR tools (Cuneiform, Tesseract and GoCR based on precision as Nuance OmniPage, and Readiris16), it was well as accuracy. The authors concluded noted that proprietary Cuneiform and that Tesseract outperforms GoCR in most Nuance OmniPage could achieve 100% of cases. word- accuracy whereas in open-source Gabasio, 2013, calculated the mean error category only Tesseract4.0 could achieve values of various proprietary (TOCR, 100% word- accuracy ratio. While Leadtools, ABBYY, OCR API Service) and comparing the time taken by the tested OCR open-source (OCRopus, Tesseract, tools, OCRopus was found to be slowest CuneiForm, Ocrad) OCR tools and and Tesseract4.0 being the fastest among concluded that the mean error rates of tested these. proprietary OCR toolsets are much lower Tafti et al., 2016, performed the than popular open-source OCR toolsets and experimental evaluation of four popular suggested to invest in proprietary OCR OCR toolsets (Google Docs, ABBYY toolsets if the scanned images are of FineReader, Tesseract, and Transym) using different quality. Among the open-source a dataset of images from different category, the mean error rates of OCRopus categories. It is found that Google Docs and and Tesseract were comparable, but ABBYY FineReader performed more OCRopus takes a lot more time in consistently across different image categories with a population standard

Kuwait Journal of Science -9- deviation of 18.19 and 18.02, respectively, to extract text from a huge volume of as compared to 25.56 and 25.79 of Tesseract images uploaded to Facebook and and Transym, respectively. Instagram every day and facilitates many Vijayarani et al., 2015, evaluated the applications like search and performance of eight OCR tools, including recommendation of images. Rosetta's OCR Google Docs, Free OCR to Word Convert, process is based on Faster-RCNN for i2OCR, FreeOCR, Convertimagetotext.net, detecting text containing regions of the OCR Convert, Free Online OCR, and image, followed by a fully-convolutional Online OCR on a sample input image. It is character-based recognition model for concluded that all the tested OCR tools recognizing the text in those locations. (except Free OCR to Word Convert) Scene text or photographs may include performed reasonably good while email-ids, special symbols, URLs and converting characters from the text images, words from different languages and hence but none of the tested OCR tools performed the use of pre-defined dictionaries may not satisfactorily while converting suit here. Therefore, instead of using a mathematical symbols and equations. dictionary-based recognition model, Asad et al., 2016, compared the Rosetta uses a character-based recognition performance of three popular OCR toolsets model. for camera-captured blurred documents. Reul et al., 2019, compared Calamari The experimental evaluation on the with other popular open-source OCR tools SmartDoc-QA dataset shows that out of on the UW3 (University of Washington III) ABBYY FineReader, Tesseract, and dataset, and it is found that Calamari yields OCRopus, the lowest CER of 38.9% is superior recognition capabilities with achieved by ABBYY FineReader. It is 0.114% CER as compared to OCRopus3 further stated that the performance of these (0.436%) and Tesseract4.0 (0.397%). It is OCR toolsets is limited by the binarization also observed that Calamari and OCRopus3 techniques employed or by the factor of support batched GPU training and segmentation. A new framework called prediction and hence, are faster than BLSTM is proposed, which overcomes the Tesseract4.0. problem of segmentation-based OCR Namysl et al., 2019, presented a robust systems and eliminates the need for and fast, deep learning-based multi-font binarization of blurred documents. OCR engine that uses a segmentation-free Reul et al., 2017, concluded that both text recognition method and a novel data ABBYY and Tesseract don’t yield augmentation technique resulting in satisfactory results for early printed books. improved text recognition capabilities. A On the other hand, OCRopus3, when aided comparison of the presented OCR engine with text analysis tools like Aletheia with leading OCR tools including (manual segmentation) or fully automated Tesseract3, Tesseract4.0, ABBYY open-source tool: LAREX (Layout FineReader, and OmniPage revealed that Analysis and Region Extraction) to perform better recognition results were obtained by segmentation, provided high-quality OCR the presented OCR engine on distorted result with over 97% character-accuracy documents with background textures within and around 92% word-accuracy on an early comparable execution time. The low printed book (15th century) within a recognition performance by the compared reasonable amount of time. The results tools for distorted inputs is due to their clearly show that after being thoroughly inability to segment the characters in noisy trained, OCRopus could recognize even the backgrounds adequately. earliest printed typefaces. Borisyuk et al., 2018, presented a scalable OCR system Rosetta which is used

Kuwait Journal of Science -10-

6. Discussion

With up to 100% word-level accuracy in high- images from a large dataset. Rosetta does quality images and covering more than 190 not use pre-defined language dictionaries languages, proprietary ABBYY FineReader and instead uses a character-based clearly defines the state-of-the-art for modern recognition model to facilitate the prints. However, for early printed books, recognition of email-ids, special symbols, OCRopus3 provides high-quality OCR URLs, and words from different languages results within a reasonable amount of time. in a single image. Both OCRopus3 and Tesseract4.0 have When it comes to text recognition in implemented LSTM based recognition distorted documents with background engine for better recognition capabilities. textures, the leading OCR toolsets, However, the major disadvantage of including Tesseract4.0, ABBYY, and OCRopus is that it is not available for OmniPage, do not give satisfactory results. Windows operating systems, whereas A new OCR tool is presented by Namysl et Tesseract is available for all three major al., 2019, which performs much better than operating systems (Windows/ Linux/ MAC the established OCR tools for distorted OS). Another disadvantage of OCRopus is inputs within comparable execution time that it needs a set of commands in sequence and is able to recognize superscripts, for complete OCR rather than a single subscripts as well as different and command. Though this modularity has its alternating font styles located on the same own advantages but it makes OCRopus a page better than the leading OCR toolsets. slow tool. OCR online services are generally free, For modern English prints, among open- very useful, and convenient to use. source categories, Calamari yields better However, uploading the file on the internet recognition results as compared to to their servers may have some privacy and OCRopus3 and Tesseract4.0 (Reul et al., security concerns. Also, if the document is 2019), and its GPU support helps in high- big, the user may face time/bandwidth speed training and recognition. However, issues. Most online services limit the file Calamari is not designed as a complete OCR size which can be uploaded or the no. of solution and focuses solely on text pages that can be processed daily/weekly recognition, whereas other open-source for free. For bigger jobs, the user needs to OCR tools like Tesseract4.0 are designed to pay. support the full pipeline of OCR from image to text. Also, like ABBYY, Tesseract4.0 7. Conclusion uses dictionaries and language modelling, whereas Calamari and OCRopus do not use these. OCR is a challenging research area that For extracting text from a large volume deals with various complexities, including of images uploaded to Facebook and image quality, background noise, skewed or Instagram every day, a scalable OCR speckled images, recognition languages, toolset named Rosetta (Borisyuk et al., different typefaces and fonts, various input- 2018) is used, which uses a fully CNN output formats, text formatting and complex based text recognition model instead of structures, natural scene text, etc. Lots of recurrent LSTM used by other popular OCR toolsets are readily available for use in OCR toolsets including Tesseract4.0 and various domains. This paper provided a OCRopus3. It costs a small loss in the brief introduction and important features of accuracy, but the inference time is reduced, various popular OCR toolsets in different the desired feature in many applications categories, including open-source, like quick search and recommendation of proprietary, and online categories. A review of various comparative studies conducted to

Kuwait Journal of Science -11- evaluate the performance of these popular Bokser, M. (1992) Omnidocument OCR tools has been presented here. It is technologies. Proceedings of the concluded that different OCR tools have IEEE, 80(7), pp.1066-1078. different capabilities, and a single OCR Borisyuk, F.; Gordo, A. and Sivakumar, toolset may not fit in all the domains. Image V. (2018) Rosetta: Large scale system for quality plays an important role in text text detection and recognition in images. recognition. For high-quality images, In Proceedings of the 24th ACM SIGKDD proprietary ABBYY is best-in-class OCR International Conference on Knowledge software. For books printed later than 19th Discovery & Data Mining (pp. 71-79). century, ABBYY gives best results, but for ACM. early printed books, OCRopus3 provides much better OCR results. For modern Breuel, T.M. (2008) The OCRopus open- English prints, in the open-source category, source OCR system. In Document Calamari yields better recognition results as Recognition and Retrieval XV (Vol. 6815, compared to OCRopus3 and Tesseract4.0, p. 68150F). International Society for Optics But when a single software is required and Photonics. which can perform the complete pipeline of Breuel, T.M.; Ul-Hasan, A.; Al-Azawi, OCR in one go, Tesseract4.0 is the most M.A. & Shafait, F. (2013) High- popular choice among the open-source performance OCR for Printed English and category. When dealing with a large Fraktur using LSTM networks. In 12th volume of images, Rossetta provides faster International Conference on Document recognition results and facilitates natural Analysis and Recognition (pp. 683-687). scene text recognition. For distorted inputs IEEE. with noisy backgrounds, a new lexicon-free OCR toolset is presented by Namysl et al., Cao, H. (2014) Machine-printed character 2019, which provides much better text recognition. Handbook of Document Image recognition results as compared to Processing and Recognition: 331-358. established OCR toolsets. Many free online Dhiman, S. and Singh, A. (2013) OCR services are available, including Tesseract vs. gocr a comparative Google Docs, Online-OCR, Free-Online- study. International Journal of Recent OCR, etc., which allow the users to convert Technology and Engineering, 2(4), p.80. images into text files without downloading the OCR toolsets on their machines, but file Gabasio, A. (2013) Comparison of optical security may be a concern using these character recognition (OCR) software. online services. Master’s thesis Lund University, Sweden. pp. 8-19.

http://fileadmin.cs.lth.se/intern/Utskrifter/2 References 013-10%20Rapport.pdf Goswami, R. & Sharma, O.P. (2013) A Al-Hmouz, R. (2020) Deep learning Review on Character Recognition autoencoder approach: Automatic Techniques. International Journal of recognition of artistic Arabic calligraphy Computer Applications 83(7). types. Kuwait Journal of Science, 47(3). Mori, S.; Suen, C.Y. & Yamamoto, K. Asad, F.; Ul-Hasan, A.; Shafait, F. & (1992) Historical review of OCR research Dengel, A. (2016) High-Performance OCR and development. Proceedings of the for Camera-Captured Blurred Documents IEEE, 80(7), pp.1029-1058. with LSTM Networks. In 12th IAPR Namysl, M. & Konya, I. (2019) Efficient, Workshop on Document Analysis Systems Lexicon-Free OCR using Deep (DAS) (pp. 7-12). IEEE. Learning. In International Conference on

Kuwait Journal of Science -12-

Document Analysis and Recognition (ICDAR) (pp. 295-301). IEEE. Submitted : 20/04/2020 Patel, C.; Patel, A. & Patel, D. (2012) Revised : 20/08/2020 Optical character recognition by open- source OCR tool tesseract: A case Accepted : 24/08/2020 study. International Journal of Computer DOI : 10.48129/kjs.v48i2.9589 Applications, 55(10), pp.50-56. Reul, C.; Dittrich, M. and Gruner, M. (2017) Case Study of a highly automated Layout Analysis and OCR of an incunabulum:'Der Heiligen Leben'(1488). In Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage (pp. 155-160). ACM. Reul, C.; Christ, D.; Hartelt, A.; Balbach, N.; Wehner, M.; Springmann, U.; Wick, C.; Grundig, C.; Büttner, A. & Puppe, F. (2019) OCR4all-an open-source tool providing a (semi-) automatic OCR workflow for historical printings. Applied Sciences, 9(22), pp.1-30. Smith, R. (2007) An overview of the Tesseract OCR engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) (Vol. 2, pp. 629-633). IEEE. Tafti, A.P.; Baghaie, A.; Assefi, M.; Arabnia, H.R.; Yu, Z. & Peissig, P. (2016) OCR as a service: an experimental evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym. International Symposium on Visual Computing. pp. 735-746. Springer, Cham. Tomaschek, M. (2018) Evaluation of off- the-shelf OCR technologies. Bachelor thesis Masaryk University, Brno, Czech Republic. pp. 3-24. https://is.muni.cz/th/v09x6/ Vijayarani, S. & Sakila, A. (2015) Performance comparison of OCR tools. International Journal of UbiComp (IJU), 6(3), pp.19-30. Vithlani, P. & Kumbharana, C.K. (2015) Comparative Study of Character Recognition Tools. International Journal of Computer Applications 118(9).

Kuwait Journal of Science