Text Classification and Layout Analysis for Document Reassembling

Total Page:16

File Type:pdf, Size:1020Kb

Text Classification and Layout Analysis for Document Reassembling Text Classification and Layout Analysis for Document Reassembling DISSERTATION submitted in partial fulfillment of the requirements for the degree of Doktor der technischen Wissenschaften by Markus Diem Registration Number 0226595 to the Faculty of Informatics at the Vienna University of Technology Advisor: Ao.Univ.Prof. Dipl.-Ing. Dr.techn. Robert Sablatnig The dissertation has been reviewed by: (Ao.Univ.Prof. Dipl.-Ing. (Basilis G. Gatos, PhD) Dr.techn. Robert Sablatnig) Wien, March 10, 2014 (Markus Diem) Technische Universit¨atWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at Erkl¨arung zur Verfassung der Arbeit Markus Diem Mollardgasse 22/19, 1060 Wien Hiermit erkl¨are ich, dass ich diese Arbeit selbst¨andig verfasst habe, dass ich die ver- wendeten Quellen und Hilfsmittel vollst¨andig angegeben habe und dass ich die Stellen der Arbeit { einschließlich Tabellen, Karten und Abbildungen {, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe. (Ort, Datum) (Unterschrift Verfasser) i Danksagung Ich m¨ochte mich an dieser Stelle bei all jenen Menschen bedanken, die mich begleitet und unterst¨utzthaben. Sie haben wesentlich dazu beigetragen, dass ich diese Arbeit schreiben konnte. F¨urdie fachliche Unterst¨utzungm¨ochte ich mich bei allen Arbeitskollegen am Cvl bedanken, die in vielen Diskussionen meine Idee der Computer Vision geformt haben. Speziell m¨ochte ich mich bei Melanie Gau, Michael H¨odlmoser,Martin Kampel, Rainer Planinc, Michael Reiter, Sebastian Zambanini und Andreas Zweng bedanken. Ein Dank gilt auch Stefan, der mich nie zu seiner Masterparty eingeladen hat, Florian, dessen zum Teil absurde Ideen den B¨uroalltagauffrischen und Fabian, der mich schon beim Studium f¨urseine nicht immer absurden Ideen begeisterte. Florian hat mich zus¨atzlich durch alle Stasi Projekte begleitet und mir dabei die Mathematik sowie eine gewisse Stressresistenz n¨ahergebracht. Die letzteren drei hatten einen wesentlichen Einfluss auf die Entstehung des Montagsbier, welches in jeder gesunden B¨uroumgebungempfehlenswert ist. Ein spezieller Dank gilt auch meinem Betreuer Robert Sablatnig. Er hat eine B¨uroat- mosph¨aregeschaffen, die sowohl produktiv als auch belebend ist. Des Weiteren hat er mein Interesse an der Forschung geweckt und mich auf die Verfassung wissenschaftlicher Artikel vorbereitet. Begriffe wie many, more und jeglicher Konjunktiv verwendete ich weit h¨aufigerohne seine Korrekturen. Bei Basilis Gatos m¨ochte ich mich f¨urdie Zeit bedanken, die er in das Review dieser Arbeit investiert. Seine wissenschaftliche Arbeit ist die Grundlage einiger Entwicklungen die hier vorgestellt werden. Ein abschließender Dank gilt meiner Familie, die mich zu jenem Menschen formte, der ich nun bin. Meine Großmutter Frieda f¨ordertebesonders in meiner Kindheit meinen Wissensdurst. Lukas hat immer Zeit f¨urDiskussionen ¨uber Gott, die Welt und im speziellen die Informatik. Im Gegensatz dazu holt mich Clemens aus dieser Welt gerne heraus und zeigt mir jene, in der die Asthetik¨ der Form dominiert. Meine beiden Schwest- ern Sarah und Sabine holen mich wiederum gerne auf den Boden der Realit¨at.Ein letzter Dank gilt Alex, Sandro, Lara, Nina, Werner und Annemarie, die immer da sind wenn es notwendig ist. Diese Dissertation widme ich meinen Eltern und Babsi. Meine Eltern haben mein Studium erm¨oglicht und mir f¨urmeinen Lebensweg Werte mitgegeben, die sich bis heute als sehr gut erwiesen haben. Babsi, du bist nach Platons Symposion meine fehlende H¨alfte. Danke iii \as goht als" Abstract In the context of automated reassembling of manually torn document snippets contour based approaches are insufficient because snippets have the same rupture edges if more than one page is torn at the same time. Moreover jigsaw puzzling is np hard which requests for a grouping of document snippets beforehand such that the complexity and computational speed of reassembling is improved. Analyzing the visual content of doc- ument snippets renders the distinction of snippets with the same contours possible. In addition, a visual content extraction enables fine alignment of snippets with the same content and for grouping snippets. The document analysis approaches presented in this thesis are part of a combined reassembling which utilizes content and contour for the reconstruction of about 600 Million Stasi snippets. The ruling analysis classifies the supporting material into void, lined, and checked paper. If a ruling is detected, the lines are localized accurately which allows for snippet alignments. Snippets might have sparse visual content depending on the conscientious- ness when tearing. Therefore a new word localization { the so-called Profile Box { is introduced which keeps a compact word representation while accounting for anticipated deformations such as a word's local skew. These word boxes are further classified into printed, manuscript, and non-text elements by means of Gradient Shape Features (Gsf) which are designed newly for this task. The latter class allows for rejecting falsely bi- narized elements which improves the robustness in the presence of degraded or noisy documents. Finally, a layout analysis is performed that is based on a bottom-up ap- proach to keep the element clustering flexible even if a global text structure is not present. Results on various publicly available databases show that the methodology is capable of being adopted to different document analysis scenarios. A synthetic database for ruling line removal is created and made publicly available which allows comparisons between the approach proposed and other state-of-the-art methodologies. The text classification is compared to other approaches by means of the PRImA benchmarking database and the Iam database, which is a handwriting database written by multiple authors. The methodology presented achieves the best results in both empirical evaluations. On real world Stasi snippets, the recognition rate is lower because of the heterogeneity and sparseness of content in the data. The layout analysis is additionally evaluated on the most recent Handwriting Segmentation Contests where it competes state-of-the-art methods and on a medieval database. vii Kurzfassung Eine automatische Dokumentrekonstruktion von handzerrissenen Akten erm¨oglicht die Wiederherstellung von verlorengeglaubtem Inhalt. Das zugrundeliegende Datenmaterial beinhaltet 600 Millionen Stasi Schnipsel die beim Fall der Berliner Mauer vernichtet wurden. Konturbasierte Ans¨atze k¨onnen Schnipsel nicht richtig zusammensetzten wenn mehrere Seiten gleichzeitig zerrissen wurden. Zus¨atzlich ist die Komplexit¨at der automa- tischen Rekonstruktion zu hoch wenn jedes Schnipsel mit jedem verglichen werden muss. Deshalb wird vor der Rekonstruktion der visuelle Inhalt von Schnipseln analysiert. Da- durch kann einerseits eine Vorsortierung vorgenommen werden, andererseits erm¨oglicht es Schnipsel mit gleichen Risskanten voneinander zu unterscheiden. Algorithmen zur automatischen Textlokalisierung und Papieranalyse werden in dieser Doktorarbeit vor- gestellt, die bei der Rekonstruktion mit konturbasierten Ans¨atzen kombiniert werden. Die Papieranalyse klassifiziert Papier in liniiert, kariert und leeres Papier. Liegt ein liniiertes oder kariertes Schnipsel vor, so werden die Linien genau lokalisiert um benach- barte Schnipsel basierend auf deren Liniierung auszurichten. Des Weiteren wurde eine neue Textlokalisierung entwickelt, die W¨orter kompakt repr¨asentiert und deren lokale Ausrichtung genau wiedergibt. Alle Elemente eines Dokuments werden in Maschinen- schrift, Handschrift oder kein Text mit Hilfe von sogenannten Gradient Shape Features (Gsf) klassifiziert. Die Erkennung von kein Text erm¨oglicht es falsch binarisierte Ele- mente zu verwerfen um auf diese Weise gute Ergebnisse in verrauschten Dokumenten zu erzielen. Nach der Klassifikation und Lokalisierung von Text werden diese Elemente zu hierarchisch h¨oheren Strukturen zusammengefasst. Dabei wird ein bottom-up Verfahren verwendet welches auch auf zerrissenen Dokumenten angewendet werden kann. Die Methodik wurde empirisch auf ¨offentlich verfugbaren¨ Datens¨atzen evaluiert und mit bestehenden Dokumentanalysesystemen verglichen. Die Textklassifikation und Loka- lisierung konnte dabei bisherige Ergebnisse auf einem Datensatz mit modernen gedruck- ten Layouts und einem Handschriftendatensatz verbessern. Die Layout Analyse wurde auf den letzten drei Page Segmentation Contest Datens¨atzen evaluiert. Dort wurden im Vergleich zu State-of-the-Art Methoden ¨ahnliche Ergebnisse erzielt, wobei sich heraus- stellte, dass besonders Bangla eine Schwierigkeit fur¨ die entwickelte Methode darstellt. ix Contents 1 Introduction 1 1.1 Motivation . 2 1.2 Main Contribution . 7 1.3 Definition of Terms . 9 1.4 Results . 11 2 Related Work 15 2.1 Ruling Analysis . 17 2.2 Features for Text Classification . 18 2.3 Machine Learning . 27 2.4 Text Classification . 30 2.5 Document Layout Analysis & Page Segmentation . 41 2.6 Text Line Segmentation . 49 3 Ruling Analysis 57 3.1 Methodology . 57 3.2 Evaluation . 65 4 Text Classification 75 4.1 Pre Processing . 76 4.2 Text Localization . 80 4.3 Gradient Shape Features . 83 4.4 Classification . 88 4.5 Evaluation . 91 5 Layout Analysis 113 5.1 Text Line Clustering . 114 5.2 Text Line Localization . 118 5.3 Evaluation . 121 6 Conclusion 131 Bibliography 141 xi CHAPTER 1 Introduction Document analysis in the context of computer vision aims at automatically extracting
Recommended publications
  • Reconocimiento De Escritura Lecture 4/5 --- Layout Analysis
    Reconocimiento de Escritura Lecture 4/5 | Layout Analysis Daniel Keysers Jan/Feb-2008 Keysers: RES-08 1 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 2 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 3 Jan/Feb-2008 Detection of Geometric Primitives some geometric entities important for DIA: I text lines I whitespace rectangles (background in documents) Keysers: RES-08 4 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 5 Jan/Feb-2008 Hough-Transform for Line Detection Assume we are given a set of points (xn; yn) in the image plane. For all points on a line we must have yn = a0 + a1xn If we want to determine the line, each point implies a constraint yn 1 a1 = − a0 xn xn Keysers: RES-08 6 Jan/Feb-2008 Hough-Transform for Line Detection The space spanned by the model parameters a0 and a1 is called model space, parameter space, or Hough space.
    [Show full text]
  • Comparison of Visual and Logical Character Segmentation in Tesseract OCR Language Data for Indic Writing Scripts
    Comparison of Visual and Logical Character Segmentation in Tesseract OCR Language Data for Indic Writing Scripts Jennifer Biggs National Security & Intelligence, Surveillance & Reconnaissance Division Defence Science and Technology Group Edinburgh, South Australia {[email protected]} interfaces, online services, training and training Abstract data preparation, and additional language data. Originally developed for recognition of Eng- Language data for the Tesseract OCR lish text, Smith (2007), Smith et al (2009) and system currently supports recognition of Smith (2014) provide overviews of the Tesseract a number of languages written in Indic system during the process of development and writing scripts. An initial study is de- internationalization. Currently, Tesseract v3.02 scribed to create comparable data for release, v3.03 candidate release and v3.04 devel- Tesseract training and evaluation based opment versions are available, and the tesseract- on two approaches to character segmen- ocr project supports recognition of over 60 lan- tation of Indic scripts; logical vs. visual. guages. Results indicate further investigation of Languages that use Indic scripts are found visual based character segmentation lan- throughout South Asia, Southeast Asia, and parts guage data for Tesseract may be warrant- of Central and East Asia. Indic scripts descend ed. from the Brāhmī script of ancient India, and are broadly divided into North and South. With some 1 Introduction exceptions, South Indic scripts are very rounded, while North Indic scripts are less rounded. North The Tesseract Optical Character Recognition Indic scripts typically incorporate a horizontal (OCR) engine originally developed by Hewlett- bar grouping letters. Packard between 1984 and 1994 was one of the This paper describes an initial study investi- top 3 engines in the 1995 UNLV Accuracy test gating alternate approaches to segmenting char- as “HP Labs OCR” (Rice et al 1995).
    [Show full text]
  • Design a Fast 3D Scanner Using a Laser Line
    REPUBLIC OF TURKEY FIRAT UNIVERSITY GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCE AN ANDROID BASED RECEIPT TRACKER SYSTEM USING OPTICAL CHARACTER RECOGNITION KAREZ ABDULWAHHAB HAMAD Master Thesis Department: Software Engineering Supervisor: Asst. Prof. Dr. Mehmet KAYA JULY – 2017 ACKNOWLEDGEMENTS First, thanks to ALLAH, the Almighty, for granting me the well and strength, with which this master thesis was accomplished; it will be the first step to propose much more great scientific researches. I would like to acknowledge my thankfulness and appreciation to my supervisor Asst. Prof. Dr. Mehmet KAYA for his guidance, assistance encouragement, wisdom suggestions, and valuable advice that made the completion of the present master thesis possible. Last but not the least; I want to express my special thankfulness to my lovely parents, and special gratitude to all members of my family and friends. Special thanks to my lovely uncle Assoc. Prof. Dr. Yadgar Rasool, who helped me and encouraged me a lot during my study. II TABLE OF CONTENTS Page No ACKNOWLEDGEMENTS ............................................................................................... II TABLE OF CONTENTS ................................................................................................. III ABSTRACT ....................................................................................................................... VI ÖZET ................................................................................................................................ VII
    [Show full text]
  • Mathematical Expression Detection and Segmentation in Document Images
    Mathematical Expression Detection and Segmentation in Document Images Jacob R. Bruce Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Master of Science In Computer Engineering A. Lynn Abbott, Committee Chair Jianhua J. Xuan, Co-Chair Michael S. Hsiao February 12, 2014 Blacksburg, VA Keywords: document layout analysis, optical character recognition, mathematical expression detection and segmentation, document image, type- specific layout analysis Mathematical Expression Detection and Segmentation in Document Images Jacob R. Bruce Abstract Various document layout analysis techniques are employed in order to enhance the accuracy of optical character recognition (OCR) in document images. Type-specific document layout analysis involves localizing and segmenting specific zones in an image so that they may be recognized by specialized OCR modules. Zones of interest include titles, headers/footers, paragraphs, images, mathematical expressions, chemical equations, musical notations, tables, circuit diagrams, among others. False positive/negative detections, oversegmentations, and undersegmentations made during the detection and segmentation stage will confuse a specialized OCR system and thus may result in garbled, incoherent output. In this work a mathematical expression detection and segmentation (MEDS) module is implemented and then thoroughly evaluated. The module is fully integrated with the open source OCR software, Tesseract, and is designed to function as a component of it. Evaluation is carried out on freely available public domain images so that future and existing techniques may be objectively compared. Acknowledgments I would like to thank Dr. Abbott for his helpful insights. I'd also like to thank my lab-mates Sherin Aly, Sherin Ghannam, Ahmed Ibrahim, Mahmoud Sobhy, and Amira Youssef for keeping me company, teaching me a thing or two about their rich culture, and letting me try some great foods.
    [Show full text]
  • Geometric Layout Analysis of Scanned Documents
    Geometric Layout Analysis of Scanned Documents Dissertation submitted to the Department of Computer Science Technical University of Kaiserslautern for the fulfillment of the requirements for the doctoral degree Doctor of Engineering (Dr.-Ing.) by Faisal Shafait Thesis supervisors: Prof. Dr. Thomas Breuel, TU Kaiserslautern Prof. Dr. Horst Bunke, University of Bern Chair of supervisory committee: Prof. Dr. Karsten Berns, TU Kaiserslautern Kaiserslautern, April 22, 2008 D 386 Abstract Layout analysis–the division of page images into text blocks, lines, and determination of their reading order–is a major performance limiting step in large scale document dig- itization projects. This thesis addresses this problem in several ways: it presents new performance measures to identify important classes of layout errors, evaluates the per- formance of state-of-the-art layout analysis algorithms, presents a number of methods to reduce the error rate and catastrophic failures occurring during layout analysis, and de- velops a statistically motivated, trainable layout analysis system that addresses the needs of large-scale document analysis applications. An overview of the key contributions of this thesis is as follows. First, this thesis presents an efficient local adaptive thresholding algorithm that yields the same quality of binarization as that of state-of-the-art local binarization methods, but runs in time close to that of global thresholding methods, independent of the local window size. Tests on the UW-1 dataset demonstrate a 20-fold speedup compared to traditional local thresholding techniques. Then, this thesis presents a new perspective for document image cleanup. Instead of trying to explicitly detect and remove marginal noise, the approach focuses on locating the page frame, i.e.
    [Show full text]
  • Extending the Page Segmentation Algorithms of the Ocropus Document Layout Analysis System
    EXTENDING THE PAGE SEGMENTATION ALGORITHMS OF THE OCROPUS DOCUMENTATION LAYOUT ANALYSIS SYSTEM by Amy Alison Winder A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Boise State University August 2010 c 2010 Amy Alison Winder ALL RIGHTS RESERVED BOISE STATE UNIVERSITY GRADUATE COLLEGE DEFENSE COMMITTEE AND FINAL READING APPROVALS of the thesis submitted by Amy Alison Winder Thesis Title: Extending the Page Segmentation Algorithms of the OCRopus Document Layout Analysis System Date of Final Oral Examination: 28 June 2010 The following individuals read and discussed the thesis submitted by student Amy Alison Winder, and they evaluated her presentation and response to questions during the final oral examination. They found that the student passed the final oral examination. Elisa Barney Smith, Ph.D. Co-Chair, Supervisory Committee Timothy Andersen, Ph.D. Co-Chair, Supervisory Committee Amit Jain, Ph.D. Member, Supervisory Committee The final reading approval of the thesis was granted by Elisa Barney Smith, Ph.D., Co- Chair of the Supervisory Committee. The thesis was approved for the Graduate College by John R. Pelton, Ph.D., Dean of the Graduate College. Dedicated to my parents, Mary and Robert, and to my children, Samantha and Thomas. iv ACKNOWLEDGMENTS The author wishes to express gratitude to Dr. Barney Smith for providing a structured and supportive environment in which to formulate, develop, and complete this thesis. Not only did she provide the initial concept for the work, but she was instrumental in selecting the specific topic and followed through by meeting with me weekly to guide and support this effort.
    [Show full text]
  • Ocrdroid: a Framework to Digitize Text Using Mobile Phones
    OCRdroid: A Framework to Digitize Text Using Mobile Phones Anand Joshiy, Mi Zhangx, Ritesh Kadmawalay, Karthik Dantuy, Sameera Poduriy and Gaurav S. Sukhatmeyx yComputer Science Department, xElectrical Engineering Department University of Southern California, Los Angeles, CA 90089, USA {ananddjo,mizhang,kadmawal,dantu,sameera,gaurav}@usc.edu Abstract. As demand grows for mobile phone applications, research in optical character recognition, a technology well developed for scanned documents, is shifting focus to the recognition of text embedded in digital photographs. In this paper, we present OCRdroid, a generic framework for developing OCR-based ap- plications on mobile phones. OCRdroid combines a light-weight image prepro- cessing suite installed inside the mobile phone and an OCR engine connected to a backend server. We demonstrate the power and functionality of this framework by implementing two applications called PocketPal and PocketReader based on OCRdroid on HTC Android G1 mobile phone. Initial evaluations of these pilot experiments demonstrate the potential of using OCRdroid framework for real- world OCR-based mobile applications. 1 Introduction Optical character recognition (OCR) is a powerful tool for bringing information from our analog lives into the increasingly digital world. This technology has long seen use in building digital libraries, recognizing text from natural scenes, understanding hand-written office forms, and etc. By applying OCR technologies, scanned or camera- captured documents are converted into machine editable soft copies that can be edited, searched, reproduced and transported with ease [15]. Our interest is in enabling OCR on mobile phones. Mobile phones are one of the most commonly used electronic devices today. Commodity mobile phones with powerful microprocessors (above 500MHz), high- resolution cameras (above 2megapixels), and a variety of embedded sensors (ac- celerometers, compass, GPS) are widely deployed and becoming ubiquitous.
    [Show full text]
  • Australasian Language Technology Association Workshop 2015
    Australasian Language Technology Association Workshop 2015 Proceedings of the Workshop Editors: Ben Hachey Kellie Webster 8{9 December 2015 Western Sydney University Parramatta, Australia Australasian Language Technology Association Workshop 2015 (ALTA 2015) http://www.alta.asn.au/events/alta2015 Online Proceedings: http://www.alta.asn.au/events/alta2015/proceedings/ Silver Sponsors: Bronze Sponsors: Volume 13, 2015 ISSN: 1834-7037 II ALTA 2015 Workshop Committees Workshop Co-Chairs • Ben Hachey (The University of Sydney | Abbrevi8 Pty Ltd) • Kellie Webster (The University of Sydney) Workshop Local Organiser • Dominique Estival (Western Sydney University) • Caroline Jones (Western Sydney University) Programme Committee • Timothy Baldwin (University of Melbourne) • Julian Brook (University of Toronto) • Trevor Cohn (University of Melbourne) • Dominique Estival (University of Western Sydney) • Gholamreza Haffari (Monash University) • Nitin Indurkhya (University of New South Wales) • Sarvnaz Karimi (CSIRO) • Shervin Malmasi (Macquarie University) • Meladel Mistica (Intel) • Diego Moll´a(Macquarie University) • Anthony Nguyen (Australian e-Health Research Centre) • Joel Nothman (University of Melbourne) • C´ecileParis (CSIRO { ICT Centre) • Glen Pink (The University of Sydney) • Will Radford (Xerox Research Centre Europe) • Rolf Schwitter (Macquarie University) • Karin Verspoor (University of Melbourne) • Ingrid Zukerman (Monash University) III Preface This volume contains the papers accepted for presentation at the Australasian Language Tech-
    [Show full text]
  • Ocrdroid: a Framework to Digitize Text Using Mobile Phones
    OCRdroid: A Framework to Digitize Text on Smart Phones Mi Zhang§, Anand Joshi†, Ritesh Kadmawala†, Karthik Dantu†, Sameera Poduri†, Gaurav S. Sukhatme†§ †Computer Science Department, §Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, CA 90089, USA {mizhang, ananddjo, kadmawal, dantu, sameera, gaurav}@usc.edu Abstract— As demand grows for mobile phone applications, bad orientation, and text misalignment) [17]. Moreover, since research in optical character recognition, a technology well the system is running on a mobile phone, real time response developed for scanned documents, is shifting focus to the recog- is also a critical challenge. We have developed a framework nition of text embedded in digital photographs. In this paper, we present OCRdroid, a generic framework for developing called OCRdroid. It utilizes embedded sensors (orientation OCR-based applications on mobile phones. OCRdroid combines sensor, camera) combined with image preprocessing suite a lightweight image preprocessing suite installed inside the built on top of the hardware to address those issues men- mobile phone and an OCR engine connected to a backend tioned above. We have also evaluated our OCRdroid frame- server. We demonstrate the power and functionality of this work by implementing two applications called PocketPal and framework by implementing two applications called PocketPal and PocketReader based on OCRdroid on HTC Android G1 PocketReader based on this framework. Our experimental mobile phone. Initial evaluations of these pilot experiments results demonstrate the OCRdroid framework is feasible demonstrate the potential of using OCRdroid framework for for building real-world OCR-based mobile applications. The real-world OCR-based mobile applications. main contributions of this work are: • An algorithm to detect text misalignment in real time, I.
    [Show full text]
  • Report on File Formats for Hand-Written Text Recognition (HTR) Material
    Report on File Formats for Hand-written Text Recognition (HTR) Material CO:OP Community as Opportunity The Creative Archives´ and Users´ Network Author: Sami Nousiainen, National Archives of Finland December 16, 2016 This work is licensed under Creative Commons CC BY 4.0. Abbreviations and terms Abbreviation or term Explanation ADDML Archival Data Description Markup Language AHAA A metadata project of the National Archives of Finland AIP Archival Information Package ALLÄRS General Swedish Thesaurus ALTO Analyzed Layout and Text Object binarize To convert the document image into bi -level form as an attempt to separate the text from the background. CHANGE Change Vocabulary Specification CSS Cascading Style Sheets DC Dublin Core deskew To correct the angle of the text. dewarp To correct the distortion due to page curl and perspective that is present in document page images captured using digital cameras. DIP Dissemination Information Package Dublin Core Schema Set of terms to describe web resources or physical resources. EAC -CPF Encoded Archival Context – Corporate bodies, Persons and Families EAD Encoded Archival Description EDM Europeana Data Model ESE Europeana Semantic Elements faceted search A way of accessing information enabling the usage of several filters (e.g. filter for organization, format, year). FINNA Search interface for Finnish archives, libraries and museums FINTO Finnish Thesaurus and Ontology service GTS Ground Truth Storage HTML HyperText Markup Language HTR Handwritten Text Recognition IE Information Extraction indexing Can be used in two meanings in this document: 1) Providing a more detailed structural description of an archival unit by specifying page numbers and corresponding topic. 2) Automatic processing of documents resulting in the creation of an index within indexing/search software enabling (e.g.
    [Show full text]
  • 2.1.4 Document Layout Analysis
    http://waikato.researchgateway.ac.nz/ Research Commons at the University of Waikato Copyright Statement: The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand). The thesis may be consulted by you, provided you comply with the provisions of the Act and the following conditions of use: Any use you make of these documents or images must be for research or private study purposes only, and you may not make them available to any other person. Authors control the copyright of their thesis. You will recognise the author’s right to be identified as the author of the thesis, and due acknowledgement will be made to the author where appropriate. You will obtain the author’s permission before publishing any material from the thesis. Improving Digital Library Support for Historic Newspaper Collections A thesis submitted in partial fulfilment of the requirements of the degree of Master of Science in Computer Science at The University of Waikato By Leo Lin The University of Waikato 2009 Abstract National and international initiatives are underway around the globe to digitise the vast treasure troves of historical artefacts they contain and make them available as digital libraries (DLs). The developed DLs are often constructed from facsimile pages with pre-existing metadata, such as historic newspapers stored on microfiche or generated from the non-destructive scanning of precious manuscripts. Access to the source documents is therefore limited to methods constructed from the metadata. Other projects look to introduce full-text indexing through the application of off-the-shelf commercial Optical Character Recognition (OCR) software.
    [Show full text]