Sheet Music Search

Total Page:16

File Type:pdf, Size:1020Kb

Sheet Music Search Masaryk University Faculty of Informatics Sheet Music Search Bachelor’s Thesis Jan-Sebastian Fabík Brno, Spring 2018 MASARYKOVA UNIVERZITA Fakulta informatiky ZADÁNÍ BAKALÁŘSKÉ PRÁCE Student: Jan-Sebastian Fabík Program: Informatika Obor: Počítačové systémy a zpracování dat Garant oboru: prof. RNDr. Václav Matyáš, M.Sc., Ph.D. (PSZD) Vedoucí práce: doc. RNDr. Aleš Horák, Ph.D. Katedra: Katedra strojového učení a zpracování dat Název práce: Vyhledávání notových zápisů Název práce anglicky: Sheet Music Search Zadání: Na internetu jsou volně dostupné notové zápisy nejrůznějších typů, není však dostupný vhodný způsob jejich vyhledávání. Cílem práce je návrh a implementace metavyhledávače, který bude schopný z výsledků hledání standardních vyhledávačů filtrovat a rozpoznávat právě notové zápisy ve formě obrázků a PDF dokumentů. Navržený systém musí na základě strojového učení rozpoznat, zda se jedná o notový zápis a pomocí OCR nástroje zjistit textové údaje skladby (název, autor, ...). Výsledná práce se bude skládat jednak z textové části popisující stávající techniky a projekty zaměřené na rozpoznávání (metadat) notových zápisů a jejich vyhledávání, dále analýzu, návrh řešení a vlastní popis a vyhodnocení implementace, a jednak funkční implementovaný systém. Literatura: Chamberlain, A., & Crabtree, A. (2016). Searching for music: understanding the discovery, acquisition, processing and organization of music in a domestic setting for design. Personal and Ubiquitous Computing, 20(4), 559-571. The International Society for Music Information Retrieval, www.ismir.net Zadání bylo schváleno prostřednictvím IS MU. Prohlášení autora školního díla Jméno, příjmení a UČO studenta: Beru na vědomí, že Masarykova univerzita může na základě zákona (§ 35 odst. 3 a 4 autorského zákona č. 121/2000 Sb.) užít mou kvalifikační práci nebo jiné mé školní dílo, které jsem jako autor vytvořil ke splnění svých studijních povinností vůči této vysoké škole, a to k výuce nebo k její vlastní vnitřní potřebě nikoli za účelem přímého nebo nepřímého obchodního nebo jiného hospodářského prospěchu. Vlastní vnitřní potřebou Masarykovy univerzity se rozumí užití díla nejen v původní podobě, ale též ve zpracované nebo jinak změněné podobě zahrnující též takové užití mého díla touto vysokou školou, které spočívá v zadání mého školního díla k dalšímu zpracování jinému studentovi této vysoké školy (členovi téže akademické obce) za účelem vytvoření další kvalifikační práce nebo jiného školního díla, které bude odvozené od mého díla při uvedení mého autorství, názvu mého díla a pramene; a to vše v souladu s rozvojem vzdělanosti na Masarykově univerzitě a zájmem této veřejné vysoké školy navazovat na výsledky mé práce a mé školní dílo dále rozpracovávat v téže akademické obci. Okolnosti hodné zvláštního zřetele z mé strany, například zájem o vlastní další rozpracování své kvalifikační práce na Masarykově univerzitě nebo jinde, jsem povinen sdělit této vysoké škole prostřednictvím studijního oddělení nejpozději při odevzdání kvalifikační práce. V Brně dne_________________ _________________ podpis studenta 1 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Jan-Sebastian Fabík Advisor: doc. RNDr. Aleš Horák, Ph.D. i Acknowledgements I would like to thank my advisor, doc. RNDr. Aleš Horák, Ph.D., for his helpful guidance. iii Abstract Various sheet music is freely available online, but there is no suitable search tool for it. The aim of this thesis is to design and implement a metasearch engine which filters and recognizes sheet music from the search results of standard image search engines. The designed system uses convolutional neural networks and deep residual networks to recognize whether the given result is sheet music and determines the text information of the tracks using OCR (the track title and artist name). iv Keywords classification, convolutional neural networks, deep residual networks, OCR, sheet music identification, web application, search engine v Contents 1 Introduction 1 1.1 Motivation ..........................1 1.2 A brief description of the system ...............1 1.2.1 Image classifiers . .1 1.2.2 Metadata extractor . .1 1.2.3 Term search . .2 1.2.4 Web application . .2 1.3 Structure of the thesis .....................2 2 State of the art 5 2.1 Sheet music metasearch engines ...............5 2.2 Catalogs and music databases .................5 2.3 Optical music recognition (OMR) ..............6 2.4 Sheet music identification ...................7 2.5 Optical character recognition (OCR) .............7 2.6 Feedforward neural networks .................7 2.6.1 Architecture . .8 2.6.2 Activation functions . .8 2.6.3 Cost function . 10 2.6.4 Stochastic gradient descent algorithm (SGD) . 10 2.6.5 Back-propagation algorithm . 11 2.6.6 Regularization . 11 2.7 Convolutional neural networks ................ 13 2.7.1 Convolution operation . 13 2.7.2 Architecture . 14 2.7.3 Convolution stage . 15 2.7.4 Detector stage . 15 2.7.5 Pooling stage . 15 2.8 Deep residual networks .................... 15 3 Image Classifiers 19 3.1 Dataset ............................ 19 3.1.1 Downloading dataset . 19 3.1.2 Manual classification of images . 19 3.1.3 Dataset storage . 21 vii 3.1.4 Preprocessing . 22 3.2 Neural network architectures ................. 24 3.2.1 Convolutional neural networks . 24 3.2.2 Deep residual networks . 27 3.3 Implementation ........................ 27 3.3.1 Format of the HTTP API . 28 3.3.2 Command line interface . 29 3.4 Test results .......................... 30 3.4.1 Sheet music classification . 31 3.4.2 Watermark detection . 33 3.4.3 Title detection . 34 4 Metadata extraction 35 4.1 Database of track titles and artists .............. 35 4.1.1 Normalization . 36 4.2 Implementation ........................ 36 4.2.1 Text entries recognition . 36 4.2.2 Text entries processing . 37 4.2.3 Weight calculation . 37 4.3 Search with an error tolerance ................ 37 4.3.1 Levenshtein distance . 38 4.3.2 Prefix trees . 39 4.3.3 Searching for strings in a prefix tree by the Lev- enshtein distance from a given string . 40 4.3.4 Implementation . 41 4.4 Test results .......................... 42 4.4.1 Text recognition . 42 4.4.2 Metadata recognition . 44 5 Conclusion 47 5.1 Image classification ...................... 47 5.2 Metadata extraction ...................... 47 5.3 Web application ........................ 48 A System Implementation 49 A.1 System components ...................... 49 A.2 Web application ........................ 49 A.2.1 Controllers . 50 viii A.2.2 Command line interface . 50 A.3 Deployment .......................... 51 A.4 Installation instructions ................... 52 Bibliography 55 ix List of Tables 3.1 Category sizes 20 3.2 Categories included in the classes 23 3.3 Parameters of the ConvNets with topology A 26 3.4 Parameters of the ConvNets with topology B 26 3.5 Modified dropout rates of the ConvNets 27 3.6 Accuracy of the models evaluated on the test set 30 3.7 Accuracy of the ConvNets with modified dropout rates 31 4.1 Metadata recognition statistics 42 xi List of Figures 2.1 The rectified linear function 9 2.2 The logistic sigmoid function 9 2.3 An example of 2-D convolution, Based on: Figure 9.1 from Deep Learning by Ian Goodfellow et al. [28] 14 2.4 Structure of a convolutional neural network, Source: Mathworks.com [29] 14 2.5 Training error (left) and test error (right) on CIFAR-10 dataset with 20-layer and 56-layer networks, Source: Deep Residual Learning for Image Recognition, Figure 1 [30] 16 2.6 Residual learning: building block, Source: Deep Residual Learning for Image Recognition, Figure 2 [30] 16 3.1 Manual classification application UI 20 3.2 Examples of images by categories (cropped) 21 3.3 Resize modes 23 3.4 Sample handwritten sheet music 24 3.5 ConvNet topologies 25 3.6 Error and loss of convnetA2 and convnetA3 on the sheet music dataset 32 3.7 Sample images incorrectly marked as sheet music 32 3.8 Sample images incorrectly marked as watermarked 33 3.9 Sample images with incorrectly detected title 34 4.1 The structure of the database tables for the Discogs.com dataset 35 4.2 Sample prefix tree 39 4.3 Examples of images not recognized by the OCR tool 43 4.4 Examples of images with incorrectly recognized metadata 45 A.1 Search form and results page 50 A.2 Deployment diagram 51 xiii 1 Introduction 1.1 Motivation The primary aim of this thesis is to design and implement a metasearch engine for sheet music freely available on the Internet. As of April 2018, there is no specialized search engine for freely available sheet music. People who want to find sheet music online typically use standard search engines and append the phrase “sheet music” to the query. The search results sometimes include images that are not sheet music or contain a different track than the user expected. A metasearch engine designed directly for sheet music search would help them find the most relevant results faster. 1.2 A brief description of the system The system designed and implemented in this thesis provides image classifiers, a metadata extractor, an application for term search param- eterized by Levenshtein distance, and a web application with a search form. 1.2.1 Image classifiers The classifiers classify images by three criteria: whether they contain sheet music, whether they are watermarked, and whether they contain a heading with a track title. They are implemented using neural net- works. I experimented with convolutional neural networks and deep residual networks. For the training, I downloaded a dataset of 13,000 images from an image search engine and manually assigned them to the corresponding classes. After training, I evaluated the models by the accuracy rate on the test set. 1.2.2 Metadata extractor The metadata extractor uses an OCR tool to extract text from the input image.
Recommended publications
  • Video-Based Tracking of Physical Documents on a Desk
    FACULTY OF ENGINEERING Department of Electronics and Informatics Video-based Tracking of Physical Documents on a Desk Graduation thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Applied Sciences and Engineering: Applied Computer Science Sone Nsime Ngole Promoter: Prof. Dr. Beat Signer Advisor: Dr. Bruno Dumas JANUARY 2014 FACULTEIT INGENIEURSWETENSCHAPPEN Vakgroep Elektronica en Informatica Video-based Tracking of Physical Documents on a Desk Afstudeer eindwerk ingediend in gedeeltelijke vervulling van de eisen voor het behalen van de graad Master of Science in de Ingenieurswetenschappen: Toegepaste Computerwetenschappen Sone Nsime Ngole Promoter: Prof. Dr. Beat Signer Advisors: Dr. Bruno Dumas JANUARI 2014 Abstract Currently, physical and digital documents tend to stay in their world, without any direct relationship linking them. How- ever, a lot of physical documents are printed from digital doc- uments and conversely, digital documents can be scanned ver- sions of printed papers. Furthermore, the organization of piles of physical documents on a desk hints at shared semantic fea- tures between a set of documents. This thesis explores an ap- proach to link or re-link physical documents with their digital counterpart. This integration will be done by designing a sys- tem that uses an overhead digital camera to recognize, identify, localize and track paper documents on the physical desk space in real time (or offline by means of a pre-recorded video stream) and automatically matching them against an image database of electronic documents. The system locates each paper docu- ment that is present on the desk and reconstructs a complete configuration of documents on the desk at each instant in time.
    [Show full text]
  • Diplomová Práce Inteligentní Vyhledávání Dokumentů
    Západočeská univerzita v Plzni Fakulta aplikovaných věd Katedra informatiky a výpočetní techniky Diplomová práce Inteligentní vyhledávání dokumentů Plzeň 2017 Jiří Martínek Místo této strany bude zadání práce. Prohlášení Prohlašuji, že jsem diplomovou práci vypracoval samostatně a výhradně s použitím citovaných pramenů. V Plzni dne 16. května 2017 Jiří Martínek Poděkování Na tomto místě bych chtěl poděkovat svému vedoucímu diplomové práce doc. Ing. Pavlu Královi Ph.D. za odborné vedení, za pomoc a rady při zpracování této práce. Jiří Martínek Abstract This diploma thesis deals with information retrieval in a set of scanned documents in form of raster images. First, the images are converted into the text form using optical character recognition (OCR) methods. Unfortunately, there are errors in conversion, therefore another part of the work deals with error correction. This thesis propose several error correction methods that are combined to achieve the best possible results. Then, the corrected documents are indexed into the full-text Apache Solr database. The resulting application allows to efficiently find the requested document according to a full-text query. Error correction of the OCR output helps to increase the accuracy of full-text search. The accuracy of the system was experimentally verified on the real data. Abstrakt Tato diplomová práce se zabývá problematikou vyhledávání informací v množině naskenovaných dokumentů v podobě rastrových obrázků. Nejdříve je proto proveden převod rastrového obrázku do textové podoby pomocí metod optického rozpoznávání znaků (OCR). V rámci převodu bohužel dochází k chybám, proto se další část práce zabývá samotnou opravou chyb. V práci je navrženo několik metod oprav chyb, které jsou zkombinovány pro dosažení co nejlepšího výsledku.
    [Show full text]
  • OCR Pwds and Assistive Qatari Using OCR Issue No
    Arabic Optical State of the Smart Character Art in Arabic Apps for Recognition OCR PWDs and Assistive Qatari using OCR Issue no. 15 Technology Research Nafath Efforts Page 04 Page 07 Page 27 Machine Learning, Deep Learning and OCR Revitalizing Technology Arabic Optical Character Recognition (OCR) Technology at Qatar National Library Overview of Arabic OCR and Related Applications www.mada.org.qa Nafath About AboutIssue 15 Content Mada Nafath3 Page Nafath aims to be a key information 04 Arabic Optical Character resource for disseminating the facts about Recognition and Assistive Mada Center is a private institution for public benefit, which latest trends and innovation in the field of Technology was founded in 2010 as an initiative that aims at promoting ICT Accessibility. It is published in English digital inclusion and building a technology-based community and Arabic languages on a quarterly basis 07 State of the Art in Arabic OCR that meets the needs of persons with functional limitations and intends to be a window of information Qatari Research Efforts (PFLs) – persons with disabilities (PWDs) and the elderly in to the world, highlighting the pioneering Qatar. Mada today is the world’s Center of Excellence in digital work done in our field to meet the growing access in Arabic. Overview of Arabic demands of ICT Accessibility and Assistive 11 OCR and Related Through strategic partnerships, the center works to Technology products and services in Qatar Applications enable the education, culture and community sectors and the Arab region. through ICT to achieve an inclusive community and educational system. The Center achieves its goals 14 Examples of Optical by building partners’ capabilities and supporting the Character Recognition Tools development and accreditation of digital platforms in accordance with international standards of digital access.
    [Show full text]
  • Desenvolvimento De Um Sistema De Apoio Ao Arquivo E À Gestão De
    Diogo Alexandre Nascimento Alves Licenciatura em Ciências da Engenharia Electrotécnica e de Computadores [Nome completo do autor] Desenvolvimento de um Sistema de Apoio ao Arquivo e à [Habilitações Académicas]Gestão de Blocos Histológicos [Nome completo do autor] [Habilitações Académicas] [Nome[Título completo da Tese] do autor] Dissertação para obtenção do Grau de Mestre em [Habilitações Académicas] Engenharia Electrotécnica e de Computadores [Nome completo do autor] Orientador: José Manuel Matos Ribeiro da Fonseca, Professor Auxiliar com Dissertação para obtenção do Grau de Mestre em [Habilitações Agregação,Académicas] Faculdade de Ciências e Tecnologia da Universidade [Engenharia Informática] Nova de Lisboa Co-orientador[Nome completo: André do Teixeira autor] Bento Damas Mora, Professor Auxiliar, Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa [Habilitações Académicas] [Nome completo do autor] [Habilitações Académicas] [Nome completo do autor] [Habilitações Académicas] i Setembro, 2018 Desenvolvimento de um Sistema de Apoio ao Arquivo e á Gestão de Blocos Histológicos Copyright © Diogo Alexandre Nascimento Alves, Faculdade de Ciências e Tecnologia, Univer- sidade Nova de Lisboa. A Faculdade de Ciências e Tecnologia e a Universidade Nova de Lisboa têm o direito, perpétuo e sem limites geográficos, de arquivar e publicar esta dissertação através de exemplares impres- sos reproduzidos em papel ou de forma digital, ou por qualquer outro meio conhecido ou que venha a ser inventado, e de a divulgar através de repositórios
    [Show full text]
  • Download Resume
    [email protected] +919635337441 Bibhash Chandra Mitra A dynamic professional, targeting challenging & rewarding opportunities in Machine Learning and Artificial Intelligence with an organization of high repute and big challenges preferably in Pune, Bangalore or Kolkata. https://www.linkedin.com/in/bibhashmitra220896 https://github.com/Bibyutatsu https://bibyutatsu.github.io https://bibyutatsu.github.io/Blogs Profile Summary A focused and goal-oriented professional with 1+ year of industrial exposure in Data Science, Machine Learning-supervised/unsupervised, Artificial Intelligence and Algorithms Alumni of IIT, Kharagpur ; graduated with a Major in Aerospace Engineering and a Minor in Computer Science Engineering Currently associated with Innoplexus Consulting Services Pvt. Ltd.; working on critical projects likeNovel Drug Discovery with AI and Cluster- ing Graph Networks for Entity Normalisation Received rating of 5/5 employee for “Outstanding” performance for two quarters (Jun’19 to Dec’19) Expertise in OCR Engines, Deep Learning, Data Exploration and Visualization, Predictive Modelling and Optimization Proficiency in using various AI techniques such as RCNN, VAE, GAN and RL Worked on projects such as ‘Table Detection and Extraction using FRCNN and Image processing’ & ‘Hierarchy using Graphs Hands-on experience in Docker, Virtual Environments, Anaconda, DGX-1, Tesla V100 Modeling: Designing and implementing statistical/predictive models and cutting edge algorithms by utilizing diverse sources of data to predict demand,
    [Show full text]
  • Gradu04243.Pdf
    Paperilomakkeesta tietomalliin Kalle Malin Tampereen yliopisto Tietojenkäsittelytieteiden laitos Tietojenkäsittelyoppi Pro gradu -tutkielma Ohjaaja: Erkki Mäkinen Toukokuu 2010 i Tampereen yliopisto Tietojenkäsittelytieteiden laitos Tietojenkäsittelyoppi Kalle Malin: Paperilomakkeesta tietomalliin Pro gradu -tutkielma, 61 sivua, 3 liitesivua Toukokuu 2010 Tässä tutkimuksessa käsitellään paperilomakkeiden digitalisointiin liittyvää kokonaisprosessia yleisellä tasolla. Prosessiin tutustutaan tarkastelemalla eri osa-alueiden toimintoja ja laitteita kokonaisjärjestelmän vaatimusten näkökul- masta. Tarkastelu aloitetaan paperilomakkeiden skannaamisesta ja lopetetaan kerättyjen tietojen tallentamiseen tietomalliin. Lisäksi luodaan silmäys markki- noilla oleviin valmisratkaisuihin, jotka sisältävät prosessin kannalta oleelliset toiminnot. Avainsanat ja -sanonnat: lomake, skannaus, lomakerakenne, lomakemalli, OCR, OFR, tietomalli. ii Lyhenteet ADRT = Adaptive Document Recoginition Technology API = Application Programming Interface BAG = Block Adjacency Graph DIR = Document Image Recognition dpi= Dots Per Inch ICR = Intelligent Character Recognition IFPS = Intelligent Forms Processing System IR = Information Retrieval IRM = Image and Records Management IWR = Intelligent Word Recognition NAS = Network Attached Storage OCR = Optical Character Recognition OFR = Optical Form Recognition OHR = Optical Handwriting Recognition OMR = Optical Mark Recognition PDF = Portable Document Format SAN = Storage Area Networks SDK = Software Development Kit SLM
    [Show full text]
  • Insight MFR By
    Manufacturers, Publishers and Suppliers by Product Category 11/6/2017 10/100 Hubs & Switches ASCEND COMMUNICATIONS CIS SECURE COMPUTING INC DIGIUM GEAR HEAD 1 TRIPPLITE ASUS Cisco Press D‐LINK SYSTEMS GEFEN 1VISION SOFTWARE ATEN TECHNOLOGY CISCO SYSTEMS DUALCOMM TECHNOLOGY, INC. GEIST 3COM ATLAS SOUND CLEAR CUBE DYCONN GEOVISION INC. 4XEM CORP. ATLONA CLEARSOUNDS DYNEX PRODUCTS GIGAFAST 8E6 TECHNOLOGIES ATTO TECHNOLOGY CNET TECHNOLOGY EATON GIGAMON SYSTEMS LLC AAXEON TECHNOLOGIES LLC. AUDIOCODES, INC. CODE GREEN NETWORKS E‐CORPORATEGIFTS.COM, INC. GLOBAL MARKETING ACCELL AUDIOVOX CODI INC EDGECORE GOLDENRAM ACCELLION AVAYA COMMAND COMMUNICATIONS EDITSHARE LLC GREAT BAY SOFTWARE INC. ACER AMERICA AVENVIEW CORP COMMUNICATION DEVICES INC. EMC GRIFFIN TECHNOLOGY ACTI CORPORATION AVOCENT COMNET ENDACE USA H3C Technology ADAPTEC AVOCENT‐EMERSON COMPELLENT ENGENIUS HALL RESEARCH ADC KENTROX AVTECH CORPORATION COMPREHENSIVE CABLE ENTERASYS NETWORKS HAVIS SHIELD ADC TELECOMMUNICATIONS AXIOM MEMORY COMPU‐CALL, INC EPIPHAN SYSTEMS HAWKING TECHNOLOGY ADDERTECHNOLOGY AXIS COMMUNICATIONS COMPUTER LAB EQUINOX SYSTEMS HERITAGE TRAVELWARE ADD‐ON COMPUTER PERIPHERALS AZIO CORPORATION COMPUTERLINKS ETHERNET DIRECT HEWLETT PACKARD ENTERPRISE ADDON STORE B & B ELECTRONICS COMTROL ETHERWAN HIKVISION DIGITAL TECHNOLOGY CO. LT ADESSO BELDEN CONNECTGEAR EVANS CONSOLES HITACHI ADTRAN BELKIN COMPONENTS CONNECTPRO EVGA.COM HITACHI DATA SYSTEMS ADVANTECH AUTOMATION CORP. BIDUL & CO CONSTANT TECHNOLOGIES INC Exablaze HOO TOO INC AEROHIVE NETWORKS BLACK BOX COOL GEAR EXACQ TECHNOLOGIES INC HP AJA VIDEO SYSTEMS BLACKMAGIC DESIGN USA CP TECHNOLOGIES EXFO INC HP INC ALCATEL BLADE NETWORK TECHNOLOGIES CPS EXTREME NETWORKS HUAWEI ALCATEL LUCENT BLONDER TONGUE LABORATORIES CREATIVE LABS EXTRON HUAWEI SYMANTEC TECHNOLOGIES ALLIED TELESIS BLUE COAT SYSTEMS CRESTRON ELECTRONICS F5 NETWORKS IBM ALLOY COMPUTER PRODUCTS LLC BOSCH SECURITY CTC UNION TECHNOLOGIES CO FELLOWES ICOMTECH INC ALTINEX, INC.
    [Show full text]
  • Extracción De Eventos En Prensa Escrita Uruguaya Del Siglo XIX Por Pablo Anzorena Manuel Laguarda Bruno Olivera
    UNIVERSIDAD DE LA REPÚBLICA Extracción de eventos en prensa escrita Uruguaya del siglo XIX por Pablo Anzorena Manuel Laguarda Bruno Olivera Tutora: Regina Motz Informe de Proyecto de Grado presentado al Tribunal Evaluador como requisito de graduación de la carrera Ingeniería en Computación en la Facultad de Ingeniería ​ ​ 1 1. Resumen ​ ​ En este proyecto, se plantea el diseño y la implementación de un sistema de extracción de eventos en prensa uruguaya del siglo XIX digitalizados en formato de imagen, generando clusters de eventos agrupados según su similitud semántica. La solución propuesta se divide en 4 módulos: módulo de preprocesamiento compuesto por el OCR y un corrector de texto, módulo de extracción de eventos implementado en Python y utilizando Freeling1, módulo de clustering de eventos implementado en Python utilizando Word Embeddings y por último el módulo de etiquetado de los clusters también utilizando Python. Debido a la cantidad de ruido en los datos que hay en los diarios antiguos, la evaluación de la solución se hizo sobre datos de prensa digital de la actualidad. Se evaluaron diferentes medidas a lo largo del proceso. Para la extracción de eventos se logró conseguir una Precisión y Recall de un 56% y 70% respectivamente. En el caso del módulo de clustering se evaluaron las medidas de Silhouette Coefficient, la Pureza y la Entropía, dando 0.01, 0.57 y 1.41 respectivamente. Finalmente se etiquetaron los clusters utilizando como etiqueta las secciones de los diarios de la actualidad, realizándose una evaluación del etiquetado. 1 http://nlp.lsi.upc.edu/freeling/demo/demo.php 2 Índice general 1.
    [Show full text]
  • Tesseract Als Komponente Im OCR-D-Workflow
    - Projekt Optimierter Einsatz von OCR-Verfahren – Tesseract als Komponente im OCR-D-Workflow OCR Noah Metzger, Stefan Weil Universitätsbibliothek Mannheim 30.07.2019 Prozesskette Forschungdaten aus Digitalisaten Digitalisierung/ Text- Struktur- Vorverarbeitung erkennung parsing (OCR) Strukturierung Bücher Generierung Generierung der digitalen Inhalte digitaler Ausgangsformate der digitalen Inhalte (Datenextraktion) 28.03.2019 2 OCR Software (Übersicht) kommerzielle fett = eingesetzt in Bibliotheken Software ABBYY Finereader Tesseract freie Software BIT-Alpha OCRopus / Kraken / Readiris Calamari OmniPage CuneiForm … … Adobe Acrobat CorelDraw ABBYY Cloud OCR Microsoft OneNote Google Cloud Vision … Microsoft Azure Computer Vision OCR.space Online OCR … Cloud OCR 28.03.2019 3 Tesseract OCR • Open Source • Komplettlösung „All-in-1“ • Mehr als 100 Sprachen / mehr als 30 Schriften • Liest Bilder in allen gängigen Formaten (nicht PDF!) • Erzeugt Text, PDF, hOCR, ALTO, TSV • Große, weltweite Anwender-Community • Technologisch aktuell (Texterkennung mit neuronalem Netz) • Aktive Weiterentwicklung u. a. im DFG-Projekt OCR-D 28.03.2019 4 Tesseract an der UB Mannheim • Verwendung im DFG-Projekt „Aktienführer“ https://digi.bib.uni-mannheim.de/aktienfuehrer/ • Volltexte für Deutscher Reichsanzeiger und Vorgänger https://digi.bib.uni-mannheim.de/periodika/reichsanzeiger • DFG-Projekt „OCR-D“ http://www.ocr-d.de/, Modulprojekt „Optimierter Einsatz von OCR-Verfahren – Tesseract als Komponente im OCR-D-Workflow“: Schnittstellen, Stabilität, Performance
    [Show full text]
  • Paperport 14 Getting Started Guide Version: 14.7
    PaperPort 14 Getting Started Guide Version: 14.7 Date: 2019-09-01 Table of Contents Legal notices................................................................................................................................................4 Welcome to PaperPort................................................................................................................................5 Accompanying programs....................................................................................................................5 Install PaperPort................................................................................................................................. 5 Activate PaperPort...................................................................................................................6 Registration.............................................................................................................................. 6 Learning PaperPort.............................................................................................................................6 Minimum system requirements.......................................................................................................... 7 Key features........................................................................................................................................8 About PaperPort.......................................................................................................................................... 9 The PaperPort desktop.....................................................................................................................
    [Show full text]
  • Introduction to Digital Humanities Course 3 Anca Dinu
    Introduction to Digital Humanities course 3 Anca Dinu DH master program, University of Bucharest, 2019 Primary source for the slides: THE DIGITAL HUMANITIES A Primer for Students and Scholars by Eileen Gardiner and Ronald G. Musto, Cambridge University Press, 2015 DH Tools • Tools classification by: • the object they process (text, image, sound, etc.) • the task they are supposed to perform, output or result. • Projects usually require multiple tools either in parallel or in succession. • To learn how to use them - free tutorials on product websites, on YouTube and other. DH Tools Tools classification by the object they process • Text-based tools: • Text Analysis: • The simplest and most familiar example of text analysis is the document comparison feature in Microsoft Word (taking two different versions of the same document and highlight the differences); • Dedicated text analysis tools create concordances, keyword density/prominence, visualizing patterns, etc. (for instance AntConc) DH Tools • Text Annotation: • On a basic level, digital text annotation is simply adding notes or glosses to a document, for instance, putting comments on a PDF file for personal use. • For complex projects, there are interfaces specifically designed for annotations. • Example: FoLiA-Linguistic-Annotation-Tool https://pypi.org/project/FoLiA-Linguistic-Annotation-Tool/ DH Tools • Text Conversion and Encoding tools: • Every text in digital format is encoded with tags, whether this is apparent to the user or not. • Everything from font size, bold, italics and underline, line and paragraph spacing, justification and superscripts, to meta-data as title and author are the result of such coding tags. • Common encoding standards: XML, HTML, TEI, RTF, etc.
    [Show full text]
  • JETIR Research Journal
    © 2019 JETIR June 2019, Volume 6, Issue 6 www.jetir.org (ISSN-2349-5162) IMAGE TEXT CONVERSION FROM REGIONAL LANGUAGE TO SPEECH/TEXT IN LOCAL LANGUAGE 1Sayee Tale,2Manali Umbarkar,3Vishwajit Singh Javriya,4Mamta Wanjre 1Student,2Student,3Student,4Assistant Professor 1Electronics and Telecommunication, 1AISSMS, Institute of Information Technology, Pune, India. Abstract: The drivers who drive in other states are unable to recognize the languages on the sign board. So this project helps them to understand the signs in different language and also they will be able to listen it through speaker. This paper describes the working of two module image processing module and voice processing module. This is done majorly using Raspberry Pi using the technology OCR (optical character recognition) technique. This system is constituted by raspberry Pi, camera, speaker, audio playback module. So the system will help in decreasing the accidents causes due to wrong sign recognition. Keywords: Raspberry pi model B+, Tesseract OCR, camera module, recording and playback module,switch. 1. Introduction: In today’s world life is too important and one cannot loose it simply in accidents. The accident rates in today’s world are increasing day by day. The last data says that 78% accidents were because of driver’s fault. There are many faults of drivers and one of them is that they are unable to read the signs and instructions written on boards when they drove into other states. Though the instruction are for them only but they are not able to make it. So keeping this thing in mind we have proposed a system which will help them to understand the sign boards written in regional language.
    [Show full text]