Evaluation of Off-The-Shelf OCR Technologies

Total Page:16

File Type:pdf, Size:1020Kb

Evaluation of Off-The-Shelf OCR Technologies Masaryk University Faculty of Informatics Evaluation of off-the-shelf OCR technologies Bachelor’s Thesis Martin Tomaschek Brno, Fall 2017 Masaryk University Faculty of Informatics Evaluation of off-the-shelf OCR technologies Bachelor’s Thesis Martin Tomaschek Brno, Fall 2017 This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document. Acknowledgements I would like to thank my advisor for patience, my brother for help and my parents for their love. iii Abstract A OCR software comparison iv Keywords ocr, benchmark v Contents 1 Preface 1 2 Outlines of the OCR process 3 3 Challenges to OCR 5 4 OCR benchmarking and evaluation 9 4.1 Dataset creation ........................9 4.1.1 Synthetic and real data . 10 4.1.2 Formats . 11 4.2 Evaluation metrics ...................... 12 4.2.1 Text recognition . 12 4.2.2 Text segmentation . 13 4.2.3 Existing datasets . 15 4.2.4 Ground-truthing tools . 16 4.3 Evaluation tools ........................ 16 4.3.1 The ISRI Analytic Tools[13] . 16 4.3.2 hOCR tools[15] . 18 4.3.3 An open-source OCR evaluation tool . 18 5 The tested OCR systems 19 5.1 Proprietary .......................... 19 5.1.1 Abby FineReader . 19 5.1.2 Readiris 16 . 20 5.1.3 Adobe Acrobat 11 . 20 5.1.4 Omnipage . 21 5.2 Open source .......................... 21 5.2.1 Tesseract . 21 5.2.2 GNU Ocrad . 21 5.2.3 Gocr . 22 5.2.4 Ocropus . 22 5.2.5 Cuneiform . 22 5.3 Online services ........................ 22 5.3.1 Google docs . 22 5.4 Tests .............................. 22 vii 6 Conclusion 25 Bibliography 27 viii 1 Preface Optical character recognition (OCR) is the extraction of machine- encoded text from an image. It is a subfield of computer vision and has many applications: digitizing of scanned documents to enable editing, searching and indexing or storing them more effectively, processing bank cheques, sorting mail [1], recognition of license plate numbers in highway toll systems, etc. Since when the first commercial OCR systems were created in the 1950s [2], they have improved significantly, alongside the com- puter – once room-sized, expensive custom built systems used only by large organizations, nowadays OCR application can even run on a smartphone. and leverage its in-built camera to take the picture. Early OCR systems were limited to monospace1text often of single typeface, today’s OCR software supports many common proportional2fonts. OCR is a complex and computationally demanding task. There are uncountable combinations of document type, layout, paper type, font, language, script and countless other variables, such as material degradation, defects of imaging and print, etc. Because of this there is also large variety of OCR software each designed for particular appli- cation, for example recognizing Hebrew3or Japanese4, handprinted script recognition or the OCR packed might be fine-tuned for reading medieval scripts and so on. This thesis focuses on evaluation of the most common type of OCR software, designed to recognize western languages using Latin script and its derivatives. English has the most samples in the datasets for the tests in this work, Slovak and Czech documents examined are fewer, the ISRI[3] dataset, which will also be used, contains also Spanish documents, other languages were not tested. 2. Every character occupies the same, i.e. fixed, width. 2. The opposite of monospace. 4. Hebrew is "impure abjad", using an alphabet of 22(+5) consonants, vovels are indicated by diacritical marks beneath the consonants and is written right to left. https://en.wikipedia.org/wiki/Hebrew_alphabet 4. Japanese uses 4 scripts – logographic characters adopted from China, i.e. kanji, two syllabic scripts, hiragana and katakana, latin for some of the foreign words (mostly acronyms) along with arabic numerals. The core set of kanji used daily has 3000 symbols, a few thusand more used from time to time. Japanese can be written 1 1. Preface Figure 1.1: Examples of scripts used around the globe The aim of this thesis is to compare available OCR software, se- lected among the industry leaders and various open-source projects. While a few papers already exist on the subject, these tend to be rather outdated (e.g. [3] from 1996) or focused on a specific document type (e.g. [4]). Some websites contain more up-to-date reviews and com- parisons, however, they are often not very credible, as they seldom describe their methodology or test the OCR solutions on very small datasets, contain subjective performance measures or just list available features (e.g. [5]). The first chapter provides an overview of the OCR process itself. The second lists factors and problems connected to OCR its perfor- mance. Chapter three explains various metrics that can be used to evaluate OCR systems and presents the tools used to measure them. In the fourth chapter the tested OCR programs are introduced. The fifth chapter investigates the actual impact various variables, suchas image resolution, lossy compression, skew, font of the text, etc. on the accuracy of OCR systems. The last chapter presents and discusses result obtained in the tests. both left to right and top to bottom. https://en.wikipedia.org/wiki/Japanese_ writing_system 2 2 Outlines of the OCR process OCR process generally involves these stages: ∙ Image acquisition – an image is taken, using a scanner, a camera or a similar device. To achieve high accuracy results, a good quality image is needed. ∙ Preprocessing – text orientation detection, deskewing, noise filtering, perspective correction (if the source is a photograph) etc. ∙ Binarization – the content is separated from the background. ∙ Page segmentation – the document is divided into homogeneous regions, such as columns of text, tables, images, etc. ∙ Line, word and character segmentation [6] – the image is further divided up to the character level.1 ∙ Recognition [7] – Feature extraction – various characteristics (called features) are calculated for every character image. 2 – Classification – features are compared with trained data 3 to determine what the output character should be, via a classifier (a program). 4 ∙ Postprocessing – dictionaries and various language models can be used to enhance results. 3 2. Outlines of the OCR process OCR packages use very different algorithms and techniques to perform their task. Some examples can be found in articles referenced above. 1. Basically all OCR programs require reference data, which is used to identify the patterns in an image. This data is usually bundled in the OCR package. Some OCR software are user trainable, which allows to add new symbols or even languages and scripts, or improve accuracy. Training is principally done by presenting images of characters or even whole sentences to the OCR program together with the correct solution. See [8] for examples. 2. Line and word segmentation is relatively easy to do, especially on printed docu- ments, where lines are straight and evenly spread. Character segmentation is much tougher problem and is often closely coupled with recognition, because already recognized characters can be used to improve segmentation accuracy. Some recogni- tion approaches (notably hidden Markov model (HMM) ones) do not need character level pre-segmentation. 3. For instance gradient features can be obtained by splitting the character image in 4 by 4 tiles grid and applying Sobel operator to calculate gradient orientation at each pixel, which is then quantized into 12 orientations (as per 5 minutes or 1 hour on the clock). Finally for each tile the features are defined as the count of pixels with given gradient orientation normalized by tile size. 4. Many types of character classifiers exist, each one works using different setof features, and therefore is good at distinguishing among different character classes. Several of classifiers are often used together to leverage their individual strengths to achieve better accuracy. 4 3 Challenges to OCR This chapter presents challenges OCR software has to overcome in order to be ale to correctly convert an image to text, expanding on Nartker, Nagy and Rice [9] who have described some key factors contributing to OCR errors. ∙ Imaging defects – there are many ways imaging defect may be in- troduced during printing and scanning the document. Common imaging defects include: – Heavy or light print – heavy print may be produced for example when a tape is replaced for a new one in a dot matrix printer, light print when a printer is running low on ink or toner – Uneven contrast – cheap or old laser printers often do not produce quality output, scanning a book results in darker areas near the binding, etc. – Stray marks – Curved baselines – Lens geometry and perspective transformation – these af- fect images acquired by a camera. – Paper quality – paper slowly degrades over time and so does the information it carries. ∙ Similar symbols – there exist many characters that look simi- lar to vertical line and therefore one another – i, j, I, l, 1, !, |. Some capital letters differ from their regular counterparts just by size – e.g. v,V, o,O, s,S, z,Z. Other pairs of glyphs that bear close resemblance are 0,O, (,{,[, u,v, U,V, p,P, k,K and so on. Co- mas (,) and dots (.) look almost identical in some fonts and so do many other punctuation symbols. While English does not use diacritical marks much, some languages do so extensively. Punctuation and diacritical marks are often very small and thus hard to correctly recognize and easy to be mistaken for noise. 5 3. Challenges to OCR ∙ Special or new symbols – the Unicode contains a lot of characters and OCR software is simply not trained to recognize all of them. Many languages contain little peculiarities and use different alphabets (in Spanish question mark is written upside down, Scandinavian languages slash some letters, etc.) in addition to aforementioned diacritical marks and different punctuation, and OCR systems use different trained data and/or minor modifi- cations to support them.
Recommended publications
  • OCR Pwds and Assistive Qatari Using OCR Issue No
    Arabic Optical State of the Smart Character Art in Arabic Apps for Recognition OCR PWDs and Assistive Qatari using OCR Issue no. 15 Technology Research Nafath Efforts Page 04 Page 07 Page 27 Machine Learning, Deep Learning and OCR Revitalizing Technology Arabic Optical Character Recognition (OCR) Technology at Qatar National Library Overview of Arabic OCR and Related Applications www.mada.org.qa Nafath About AboutIssue 15 Content Mada Nafath3 Page Nafath aims to be a key information 04 Arabic Optical Character resource for disseminating the facts about Recognition and Assistive Mada Center is a private institution for public benefit, which latest trends and innovation in the field of Technology was founded in 2010 as an initiative that aims at promoting ICT Accessibility. It is published in English digital inclusion and building a technology-based community and Arabic languages on a quarterly basis 07 State of the Art in Arabic OCR that meets the needs of persons with functional limitations and intends to be a window of information Qatari Research Efforts (PFLs) – persons with disabilities (PWDs) and the elderly in to the world, highlighting the pioneering Qatar. Mada today is the world’s Center of Excellence in digital work done in our field to meet the growing access in Arabic. Overview of Arabic demands of ICT Accessibility and Assistive 11 OCR and Related Through strategic partnerships, the center works to Technology products and services in Qatar Applications enable the education, culture and community sectors and the Arab region. through ICT to achieve an inclusive community and educational system. The Center achieves its goals 14 Examples of Optical by building partners’ capabilities and supporting the Character Recognition Tools development and accreditation of digital platforms in accordance with international standards of digital access.
    [Show full text]
  • Reconocimiento De Escritura Lecture 4/5 --- Layout Analysis
    Reconocimiento de Escritura Lecture 4/5 | Layout Analysis Daniel Keysers Jan/Feb-2008 Keysers: RES-08 1 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 2 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 3 Jan/Feb-2008 Detection of Geometric Primitives some geometric entities important for DIA: I text lines I whitespace rectangles (background in documents) Keysers: RES-08 4 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 5 Jan/Feb-2008 Hough-Transform for Line Detection Assume we are given a set of points (xn; yn) in the image plane. For all points on a line we must have yn = a0 + a1xn If we want to determine the line, each point implies a constraint yn 1 a1 = − a0 xn xn Keysers: RES-08 6 Jan/Feb-2008 Hough-Transform for Line Detection The space spanned by the model parameters a0 and a1 is called model space, parameter space, or Hough space.
    [Show full text]
  • An Accuracy Examination of OCR Tools
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-9S4, July 2019 An Accuracy Examination of OCR Tools Jayesh Majumdar, Richa Gupta texts, pen computing, developing technologies for assisting Abstract—In this research paper, the authors have aimed to do a the visually impaired, making electronic images searchable comparative study of optical character recognition using of hard copies, defeating or evaluating the robustness of different open source OCR tools. Optical character recognition CAPTCHA. (OCR) method has been used in extracting the text from images. OCR has various applications which include extracting text from any document or image or involves just for reading and processing the text available in digital form. The accuracy of OCR can be dependent on text segmentation and pre-processing algorithms. Sometimes it is difficult to retrieve text from the image because of different size, style, orientation, a complex background of image etc. From vehicle number plate the authors tried to extract vehicle number by using various OCR tools like Tesseract, GOCR, Ocrad and Tensor flow. The authors in this research paper have tried to diagnose the best possible method for optical character recognition and have provided with a comparative analysis of their accuracy. Keywords— OCR tools; Orcad; GOCR; Tensorflow; Tesseract; I. INTRODUCTION Optical character recognition is a method with which text in images of handwritten documents, scripts, passport documents, invoices, vehicle number plate, bank statements, Fig.1: Functioning of OCR [2] computerized receipts, business cards, mail, printouts of static-data, any appropriate documentation or any II. OCR PROCDURE AND PROCESSING computerized receipts, business cards, mail, printouts of To improve the probability of successful processing of an static-data, any appropriate documentation or any picture image, the input image is often ‘pre-processed’; it may be with text in it gets processed and the text in the picture is de-skewed or despeckled.
    [Show full text]
  • Durchleuchtet PDF Ist Der Standard Für Den Austausch Von Dokumenten, Denn PDF-Dateien Sehen Auf
    WORKSHOP PDF-Dateien © alphaspirit, 123RF © alphaspirit, PDF-Dateien verarbeiten und durchsuchbar machen Durchleuchtet PDF ist der Standard für den Austausch von Dokumenten, denn PDF-Dateien sehen auf Daniel Tibi, allen Rechnern gleich aus. Für Linux gibt es zahlreiche Tools, mit denen Sie alle Möglich- Christoph Langner, Hans-Georg Eßer keiten dieses Dateiformats ausreizen. okumente unterschiedlichster Art, in einem gedruckten Text, Textstellen mar- denen Sie über eine Texterkennung noch von Rechnungen über Bedie- kieren oder Anmerkungen hinzufügen. eine Textebene hinzufügen müssen. D nungsanleitungen bis hin zu Bü- Als Texterkennungsprogramm für Linux chern und wissenschaftlichen Arbeiten, Texterkennung empfiehlt sich die OCR-Engine Tesseract werden heute digital verschickt, verbrei- Um die Möglichkeiten des PDF-Formats [1]. Die meisten Distributionen führen das tet und genutzt – vorzugsweise im platt- voll auszureizen, sollten PDF-Dateien Programm in ihren Paketquellen: formunabhängigen PDF-Format. Durch- durchsuchbar sein. So durchstöbern Sie l Unter OpenSuse installieren Sie tesse­ suchbare Dokumente erleichtern das etwa gleich mehrere Dokumente nach be- ract­­ocr und eines der Sprachpakete, schnelle Auffinden einer bestimmten stimmten Wörtern und finden innerhalb z. B. tesseract­ocr­traineddata­german. Stelle in der Datei, Metadaten liefern zu- einer Datei über die Suchfunktion des (Das Paket für die englische Sprache sätzliche Informationen. PDF-Betrachters schnell die richtige Stelle. richtet OpenSuse automatisch mit ein.) Zudem gibt es zahlreiche Möglichkei- PDF-Dateien, die Sie mit LaTeX oder Libre- l Für Ubuntu und Linux Mint wählen ten, PDF-Dokumente zu bearbeiten: Ganz Office erstellen, lassen sich üblicherweise Sie tesseract­ocr und ein Sprachpaket, nach Bedarf lassen sich Seiten entfernen, bereits durchsuchen. Anders sieht es je- wie etwa tesseract­ocr­deu.
    [Show full text]
  • Igalia Desktop Summit, Berlin, Aug 2011
    Igalia Desktop Summit, Berlin, Aug 2011 Juan José Sánchez Penas | [email protected] | www.igalia.com About Igalia 2 ● Open source consultancy founded in 2001 ● Privately owned (independent), flat internal structure ● Headquarters: north west of Spain (A Coruña, Galicia) ● ~45 open source developers from many countries, working from different locations ● What we do: ● Development, consultancy, training,... ● We offer our upstream expertise to help others building platforms, products and solutions Juan José SánchezPenas | [email protected] | www.igalia.com What we do 3 ● Areas/Teams: Kernel/OS, multimedia, graphics, browsers, compilers, accessibility, ... ● Platforms/Technologies: GNOME, WebKit, MeeGo, Freedesktop.org, Qt, ... ● Experience: Many projects for relevant international companies Related to platform, middleware and app development Creation of and contribution to upstream components Juan José Sánchez | [email protected] | www.igalia.com Main affiliations 4 ● Members of GNOME Foundation's Advisory Board (2007) ● Patrons of FSF (2011) ● Members of Linux Foundation (2011) Juan José Sánchez | [email protected] | www.igalia.com Events 5 ● WebKitGTK+ Hackfest, A Coruña, 2009, 2010 and 2011 ● GUADEC Hispana, A Coruña, 2005 and 2010 ● GUADEMY, A Coruña, 2007 ● GTK+ Hackfest, A Coruña, October 2010 ● ATK Hackfest, A Coruña, May 2011 Juan José Sánchez | [email protected] | www.igalia.com WebKit 6 Juan José Sánchez | [email protected] | www.igalia.com WebKit 7 ● Key for integration of web technologies in the desktop ● Stable
    [Show full text]
  • Tecnologías Libres Para La Traducción Y Su Evaluación
    FACULTAD DE CIENCIAS HUMANAS Y SOCIALES DEPARTAMENTO DE TRADUCCIÓN Y COMUNICACIÓN Tecnologías libres para la traducción y su evaluación Presentado por: Silvia Andrea Flórez Giraldo Dirigido por: Dra. Amparo Alcina Caudet Universitat Jaume I Castellón de la Plana, diciembre de 2012 AGRADECIMIENTOS Quiero agradecer muy especialmente a la Dra. Amparo Alcina, directora de esta tesis, en primer lugar por haberme acogido en el máster Tecnoloc y el grupo de investigación TecnoLeTTra y por haberme animado luego a continuar con mi investigación como proyecto de doctorado. Sus sugerencias y comentarios fueron fundamentales para el desarrollo de esta tesis. Agradezco también al Dr. Grabriel Quiroz, quien como profesor durante mi último año en la Licenciatura en Traducción en la Universidad de Antioquia (Medellín, Colombia) despertó mi interés por la informática aplicada a la traducción. De igual manera, agradezco a mis estudiantes de Traducción Asistida por Computador en la misma universidad por interesarse en el software libre y por motivarme a buscar herramientas alternativas que pudiéramos utilizar en clase sin tener que depender de versiones de demostración ni recurrir a la piratería. A mi colega Pedro, que comparte conmigo el interés por la informática aplicada a la traducción y por el software libre, le agradezco la oportunidad de llevar la teoría a la práctica profesional durante todos estos años. Quisiera agradecer a Esperanza, Anna, Verónica y Ewelina, compañeras de aventuras en la UJI, por haber sido mi grupo de apoyo y estar siempre ahí para escucharme en los momentos más difíciles. Mis más sinceros agradecimientos también a María por ser esa voz de aliento y cordura que necesitaba escuchar para seguir adelante y llegar a feliz término con este proyecto.
    [Show full text]
  • Gradu04243.Pdf
    Paperilomakkeesta tietomalliin Kalle Malin Tampereen yliopisto Tietojenkäsittelytieteiden laitos Tietojenkäsittelyoppi Pro gradu -tutkielma Ohjaaja: Erkki Mäkinen Toukokuu 2010 i Tampereen yliopisto Tietojenkäsittelytieteiden laitos Tietojenkäsittelyoppi Kalle Malin: Paperilomakkeesta tietomalliin Pro gradu -tutkielma, 61 sivua, 3 liitesivua Toukokuu 2010 Tässä tutkimuksessa käsitellään paperilomakkeiden digitalisointiin liittyvää kokonaisprosessia yleisellä tasolla. Prosessiin tutustutaan tarkastelemalla eri osa-alueiden toimintoja ja laitteita kokonaisjärjestelmän vaatimusten näkökul- masta. Tarkastelu aloitetaan paperilomakkeiden skannaamisesta ja lopetetaan kerättyjen tietojen tallentamiseen tietomalliin. Lisäksi luodaan silmäys markki- noilla oleviin valmisratkaisuihin, jotka sisältävät prosessin kannalta oleelliset toiminnot. Avainsanat ja -sanonnat: lomake, skannaus, lomakerakenne, lomakemalli, OCR, OFR, tietomalli. ii Lyhenteet ADRT = Adaptive Document Recoginition Technology API = Application Programming Interface BAG = Block Adjacency Graph DIR = Document Image Recognition dpi= Dots Per Inch ICR = Intelligent Character Recognition IFPS = Intelligent Forms Processing System IR = Information Retrieval IRM = Image and Records Management IWR = Intelligent Word Recognition NAS = Network Attached Storage OCR = Optical Character Recognition OFR = Optical Form Recognition OHR = Optical Handwriting Recognition OMR = Optical Mark Recognition PDF = Portable Document Format SAN = Storage Area Networks SDK = Software Development Kit SLM
    [Show full text]
  • Informe Tradución Ao Galego Do Contorno GNOME 3.0
    INFORME DE TRADUCIÓN AO GALEGO DO CONTORNO GNOME 3.0 ABRIL 2011 Oficina de Software Libre da USC www.usc.es/osl [email protected] LICENZA DO DOCUMENTO Este documento pode empregarse, modificarse e redistribuírse baixo dos termos de unha das seguintes licenzas, a escoller: GNU Free Documentation License 1.3 Copyright (C) 2009 Oficina de Software Libre da USC. Garántese o permiso para copiar, distribuír e/ou modificar este documento baixo dos termos da GNU Free Documentation License versión 1.3 ou, baixo o seu criterio, calquera versión posterior publicada pola Free Software Foundation; sen seccións invariantes, sen textos de portada e sen textos de contraportada. Pode achar o texto íntegro da licenza en: http://www.gnu.org/copyleft/fdl.html Creative Commons Atribución – CompartirIgual 3.0 Copyright (C) 2009 Oficina de Software Libre da USC. Vostede é libre de: • Copiar, distribuír e comunicar publicamente a obra • Facer obras derivadas Baixo das condicións seguintes: • Recoñecemento. Debe recoñecer os créditos da obra do xeito especificado polo autor ou polo licenciador (pero non de xeito que suxira que ten o seu apoio ou apoian o uso que fan da súa obra. • Compartir baixo a mesma licenza.. Se transforma ou modifica esta obra para crear unha obra derivada, só pode distribuír a obra resultante baixo a mesma licenza, unha similar ou unha compatíbel. Pode achar o texto íntegro da licenza en: http://creativecommons.org/licenses/by-sa/3.0/es/deed.gl TÁBOA DE CONTIDOS Licenza do documento............................................................................................................3
    [Show full text]
  • Extracción De Eventos En Prensa Escrita Uruguaya Del Siglo XIX Por Pablo Anzorena Manuel Laguarda Bruno Olivera
    UNIVERSIDAD DE LA REPÚBLICA Extracción de eventos en prensa escrita Uruguaya del siglo XIX por Pablo Anzorena Manuel Laguarda Bruno Olivera Tutora: Regina Motz Informe de Proyecto de Grado presentado al Tribunal Evaluador como requisito de graduación de la carrera Ingeniería en Computación en la Facultad de Ingeniería ​ ​ 1 1. Resumen ​ ​ En este proyecto, se plantea el diseño y la implementación de un sistema de extracción de eventos en prensa uruguaya del siglo XIX digitalizados en formato de imagen, generando clusters de eventos agrupados según su similitud semántica. La solución propuesta se divide en 4 módulos: módulo de preprocesamiento compuesto por el OCR y un corrector de texto, módulo de extracción de eventos implementado en Python y utilizando Freeling1, módulo de clustering de eventos implementado en Python utilizando Word Embeddings y por último el módulo de etiquetado de los clusters también utilizando Python. Debido a la cantidad de ruido en los datos que hay en los diarios antiguos, la evaluación de la solución se hizo sobre datos de prensa digital de la actualidad. Se evaluaron diferentes medidas a lo largo del proceso. Para la extracción de eventos se logró conseguir una Precisión y Recall de un 56% y 70% respectivamente. En el caso del módulo de clustering se evaluaron las medidas de Silhouette Coefficient, la Pureza y la Entropía, dando 0.01, 0.57 y 1.41 respectivamente. Finalmente se etiquetaron los clusters utilizando como etiqueta las secciones de los diarios de la actualidad, realizándose una evaluación del etiquetado. 1 http://nlp.lsi.upc.edu/freeling/demo/demo.php 2 Índice general 1.
    [Show full text]
  • Tesseract Als Komponente Im OCR-D-Workflow
    - Projekt Optimierter Einsatz von OCR-Verfahren – Tesseract als Komponente im OCR-D-Workflow OCR Noah Metzger, Stefan Weil Universitätsbibliothek Mannheim 30.07.2019 Prozesskette Forschungdaten aus Digitalisaten Digitalisierung/ Text- Struktur- Vorverarbeitung erkennung parsing (OCR) Strukturierung Bücher Generierung Generierung der digitalen Inhalte digitaler Ausgangsformate der digitalen Inhalte (Datenextraktion) 28.03.2019 2 OCR Software (Übersicht) kommerzielle fett = eingesetzt in Bibliotheken Software ABBYY Finereader Tesseract freie Software BIT-Alpha OCRopus / Kraken / Readiris Calamari OmniPage CuneiForm … … Adobe Acrobat CorelDraw ABBYY Cloud OCR Microsoft OneNote Google Cloud Vision … Microsoft Azure Computer Vision OCR.space Online OCR … Cloud OCR 28.03.2019 3 Tesseract OCR • Open Source • Komplettlösung „All-in-1“ • Mehr als 100 Sprachen / mehr als 30 Schriften • Liest Bilder in allen gängigen Formaten (nicht PDF!) • Erzeugt Text, PDF, hOCR, ALTO, TSV • Große, weltweite Anwender-Community • Technologisch aktuell (Texterkennung mit neuronalem Netz) • Aktive Weiterentwicklung u. a. im DFG-Projekt OCR-D 28.03.2019 4 Tesseract an der UB Mannheim • Verwendung im DFG-Projekt „Aktienführer“ https://digi.bib.uni-mannheim.de/aktienfuehrer/ • Volltexte für Deutscher Reichsanzeiger und Vorgänger https://digi.bib.uni-mannheim.de/periodika/reichsanzeiger • DFG-Projekt „OCR-D“ http://www.ocr-d.de/, Modulprojekt „Optimierter Einsatz von OCR-Verfahren – Tesseract als Komponente im OCR-D-Workflow“: Schnittstellen, Stabilität, Performance
    [Show full text]
  • Integral Estimation in Quantum Physics
    INTEGRAL ESTIMATION IN QUANTUM PHYSICS by Jane Doe A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Mathematical Physics Department of Mathematics The University of Utah May 2016 Copyright c Jane Doe 2016 All Rights Reserved The University of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of Jane Doe has been approved by the following supervisory committee members: Cornelius L´anczos , Chair(s) 17 Feb 2016 Date Approved Hans Bethe , Member 17 Feb 2016 Date Approved Niels Bohr , Member 17 Feb 2016 Date Approved Max Born , Member 17 Feb 2016 Date Approved Paul A. M. Dirac , Member 17 Feb 2016 Date Approved by Petrus Marcus Aurelius Featherstone-Hough , Chair/Dean of the Department/College/School of Mathematics and by Alice B. Toklas , Dean of The Graduate School. ABSTRACT Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah. Blah blah blah blah blah blah blah blah blah blah blah blah blah blah blah.
    [Show full text]
  • Paperport 14 Getting Started Guide Version: 14.7
    PaperPort 14 Getting Started Guide Version: 14.7 Date: 2019-09-01 Table of Contents Legal notices................................................................................................................................................4 Welcome to PaperPort................................................................................................................................5 Accompanying programs....................................................................................................................5 Install PaperPort................................................................................................................................. 5 Activate PaperPort...................................................................................................................6 Registration.............................................................................................................................. 6 Learning PaperPort.............................................................................................................................6 Minimum system requirements.......................................................................................................... 7 Key features........................................................................................................................................8 About PaperPort.......................................................................................................................................... 9 The PaperPort desktop.....................................................................................................................
    [Show full text]