Font Family/Style Recognition

bachelor’s thesis Font Family/Style Recognition Tereza Soukupová May 2014 Ing. Michal Bušta Czech Technical University in Prague Faculty of Electrical Engineering, Department of Cybernetics České vysoké učení technické v Praze Fakulta elektrotechnická Katedra kybernetiky ZADÁNÍ BAKALÁŘSKÉ PRÁCE Student: Tereza S o u k u p o v á Studijní program: Otevřená informatika (bakalářský) Obor: Informatika a počítačové vědy Název tématu: Odhadování třídy fontů v úloze rozpoznávání textu v obrázcích Pokyny pro vypracování: 1. Seznamte se se systémem TextSpotter pro detekci a rozpoznávání textu z obrázků vyvíjeným v Centru strojového vnímání KK FEL ČVUT. Zaměřte se zejména na OCR modul. 2. Seznamte se se state-of-the-art v odhadování fontu písma. 3. Definujte pojem “třída fontu” (použitelnou zejména pro OCR úlohu). 4. Navrhněte algoritmus pro odhadování třídy fontu. 5. Implementujte a otestujte jeho kvalitu. 6. Zvažte využití informace o třídě fontu pro zlepšení kvality OCR. Seznam odborné literatury: [1] Neumann L., Matas J.: Scene Text Localization and Recognition with Oriented Stroke Detection. ICCV 2013 (Sydney, Australia). [2] Neumann L.: Vyhledání a rozpoznání textu v obrazech reálných scén. Master thesis, ČVUT, 2010. [3] Al-Khaffaf H. S. M., Shafait F., Cutter M. P., Breuel T. M.: On the Performance of Decapod’s Digital Font Reconstruction. International Conference on Pattern Recognition November 2012, pp.649 – 652 (2012). Vedoucí bakalářské práce: Ing. Michal Bušta Platnost zadání: do konce letního semestru 2014/2015 L.S. doc. Dr. Ing. Jan Kybic prof. Ing. Pavel Ripka, CSc. vedoucí katedry děkan V Praze dne 10. 1. 2014 Czech Technical University in Prague Faculty of Electrical Engineering Department of Cybernetics BACHELOR PROJECT ASSIGNMENT Student: Tereza S o u k u p o v á Study programme: Open Informatics Specialisation: Computer and Information Science Title of Bachelor Project: Font Family/Style Recognition Guidelines: 1. Familiarize yourself with the TextSpotter [2] system for “text in the wild” detection and recognition developed at the Centre for Machine Perception at the Department of Cybernetics FEE CTU. Focus on the OCR module. 2. Familiarize yourself with the state-of-the-art in estimation of the Font Family/Style Recognition. 3. Define the “Class of the Font” (particularly useful for the OCR task). 4. Suggest an algorithm for the estimation of the class of the Font. 5. Implement it and test its quality. 6. Try to use the information about the class of the Font to improve the OCR quality. Bibliography/Sources: [1] Neumann L., Matas J.: Scene Text Localization and Recognition with Oriented Stroke Detection. ICCV 2013 (Sydney, Australia). [2] Neumann L.: Vyhledání a rozpoznání textu v obrazech reálných scén. Master thesis, ČVUT, 2010. [3] Al-Khaffaf H. S. M., Shafait F., Cutter M. P., Breuel T. M.: On the Performance of Decapod’s Digital Font Reconstruction. International Conference on Pattern Recognition November 2012, pp.649 – 652 (2012). Bachelor Project Supervisor: Ing. Michal Bušta Valid until: the end of the summer semester of academic year 2014/2015 L.S. doc. Dr. Ing. Jan Kybic prof. Ing. Pavel Ripka, CSc. Head of Department Dean Prague, January 10, 2014 Poděkování Chtěla bych poděkovat Ing. Michalu Buštovi za odborné vedení, trpělivost, ochotu a cenné rady, které mi v průběhu zpracování bakalářské práce věnoval. Děkuji také prof. Ing. Jiřímu Matasovi, Ph.D. za věcné připomínky. Prohlášení Prohlašuji, že jsem předloženou práci vypracovala samostatně, a že jsem uvedla veškeré použité informační zdroje v souladu s Metodickým pokynem o dodržování etických principů při přípravě vysokoškolských závěrečných prací. vii Acknowledgement I would like to thank Ing. Michal Bušta for his guidance, patience, willingness and assistance during the writing of my thesis. I also thank prof. Ing. Jiří Matas, Ph.D. for his advice. Declaration I declare that I have completed this thesis independently and that I have listed all used information sources in accordance with Methodical instruction about ethical principles in the preparation of university theses. ix Abstrakt Práce se zabývá optickým rozpoznáváním fontů v obrazech reálných scén. Vychází z OCR systému TextSpotter, který je vyvíjen v Centru strojového vnímání na ČVUT v Praze. Systém vyhledává text v obraze, detekuje oblasti znaků, binarizuje je a ná- sledně se je snaží klasifikovat. OCR klasifikátor je naučen na množině obrázků písmen latinské abecedy napsané v množině trénovacích fontů. Některé znaky jsou ale rozpo- znány špatně nebo nejsou rozpoznány vůbec. Je to dáno tím, že v některých fontech vypadají podobně rozdílné znaky, např „g” může vypadat v jiném fontu jako „8”. Tato práce přináší vylepšení v tom, že rozpozná přibližný font daného textu, a následně na- učí klasifikátor pouze tímto fontem. Sníží se tak velikost trénovací množiny ajevíce pravděpodobné, že znak bude rozpoznán správně. Klíčová slova Rozpoznávání fontu; OFR; rozpoznávání textu; OCR x Abstract This work presents an algorithm for optical font recognition of a text in real-scene images. It is based on an OCR system TextSpotter developed at Czech Technical Uni- versity in Prague. This system locates areas with a text, detects connected components of characters, binarizes them, extracts features and tries to recognize characters. The OCR system trains its classifier on all initial fonts. Some characters are not well recognized or they stay unrecognized because in some different fonts mismatching characters look similar. The goal of this project is to recognize a font of the text and to train the classifier only on this font. The unrecognized characters are then classified again by the classifier with a reduced training set. It is thus more likely that they willbe classified correctly and the OCR quality increases. Keywords Optical Font Recognition; OFR; text detection; Optical Character Recognition; OCR xi Contents 1 Introduction1 1.1 Problem formulation.............................1 1.2 Definitions...................................3 2 State-of-the-art4 2.1 Global feature approaches..........................4 2.2 Local feature approaches...........................7 Our approach.............................8 3 OCR system TextSpotter9 3.1 Input......................................9 3.2 Processing...................................9 3.3 Exctrating features.............................. 10 3.4 OCR classifier................................. 10 3.4.1 K-nearest neighbor.......................... 10 3.4.2 Training................................ 11 3.4.3 Classification............................. 11 4 Clustering 12 Hierarchical clustering........................ 13 Implementation............................ 13 4.1 Clustering through all characters...................... 14 4.1.1 Distances between fonts....................... 14 The class of the font......................... 14 4.1.2 Clusters and their representatives.................. 15 Example with the character a .................... 15 4.2 Clustering per character........................... 18 4.2.1 Distances............................... 18 4.2.2 Clusters and their representatives.................. 18 5 Recognition algorithms 19 5.1 Font recognition............................... 19 5.1.1 Nearest neighbour font voting.................... 19 xii 5.1.2 Finding the shortest path in a multistage graph.......... 20 Dynamic programming approach.................. 23 5.2 Clustering through all characters...................... 26 5.2.1 Nearest neighbour cluster voting.................. 26 5.2.2 Finding the shortest path in a multistage graph.......... 26 5.3 Clustering per character........................... 28 The algorithm............................ 28 5.4 An utilization of font knowledge to improve the OCR.......... 29 Recognition pipeline......................... 30 6 Experiments 31 6.1 Font or cluster recognition.......................... 31 6.1.1 DATASET 1 – computer-generated images............ 31 6.1.2 DATASET 2 – real-scene images.................. 32 Example 1............................... 32 Example 2............................... 33 Example 3............................... 33 Example 4............................... 34 6.2 An utilization of font knowledge to improve the OCR quality...... 35 6.2.1 DATASET 1 – computer-generated images............ 35 6.2.2 DATASET 2 – all 132 real-scene images.............. 36 6.2.3 DATASET 2a – method FontNN.................. 37 6.2.4 DATASET 2b – method FontDynamic............... 37 6.2.5 DATASET 2c – method ClusterNN................. 37 6.2.6 DATASET 2d – method ClusterDynamic............. 38 6.2.7 DATASET 2e – method ClusterPerCharacter........... 38 6.2.8 Examples............................... 39 Examples with an improvement................... 39 Examples with no improvement................... 40 Examples where the methods have errors............. 42 7 Implementation 44 7.1 Programming language............................ 44 7.2 Used libraries................................. 44 xiii 7.3 The code................................... 44 8 Conclusion 45 8.1 Font or cluster recognition.......................... 45 8.2 An utilization of font knowledge to improve the OCR quality...... 45 Appendices A Appendix 47 A.1 The maximum distance between the letters within the cluster is 6.5... 48 A.2 The maximum distance between the letters within the cluster is 7.5... 49 A.3 The maximum distance between the letters within the cluster is 8.... 50 A.4 The maximum distance between the letters within the cluster is 8.5..

Load more