Exploiting Similarity Hierarchies for Multi-Script Scene Text Understanding

Exploiting similarity hierarchies for multi-script scene text understanding A dissertation submitted by Lluís Gómez-Bigordà at Universitat Autònoma de Barcelona, Dept. Ciències de la Computació, to fulfil the degree of Doctor of Philosophy in Computer Science. Bellaterra, February 19, 2016 Director: Dr. Dimosthenis Karatzas Universitat Autònoma de Barcelona Dept. Ciències de la Computació & Centre de Visió per Computador This document was typeset by the author using LATEX 2#. The research described in this book was carried out at the Computer Vision Center, Universitat Autònoma de Barcelona. Copyright © MMXVI by Lluís Gómez-Bigordà. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photo- copy, recording, or any information storage and retrieval system, without permission in writing from the author. Abstract This thesis addresses the problem of automatic scene text understanding in unconstrained conditions. In particular, we tackle the tasks of multi-language and arbitrary-oriented text detection, tracking, and recognition in natural scene images and videos. For this we have developed a set of generic methods that build on top of the basic assumption that text has always some visual key characteristics that are independent of the language or script in which it is written. Scene text extraction methodologies are usually based in classification of individual regions or patches, using a priori knowledge for a given script or language. Human perception of text, on the other hand, is based on perceptual organisation through which text emerges as a perceptually significant group of atomic objects. In this thesis, we argue that the text extraction problem could be posed as the detection of meaningful groups of regions. We address the problem of text segmentation in natural scenes from a hierarchical perspective, making explicit use of text structure, aiming directly to the detection of region groupings corresponding to text within a hierarchy produced by an agglomerative similarity clustering process over individual regions. We propose an optimal way to construct such an hierarchy introducing a feature space designed to produce text group hypotheses with high recall and a novel stopping rule combining a discriminative classifier and a probabilistic measure of group meaningfulness based in perceptual organization. We propose a new Object Proposals algorithm that is specifically designed for text and compare it with other generic methods in the state of the art. At the same time we study to what extent the existing generic Object Proposals methods may be useful for scene text understanding. Then, we present a hybrid algorithm for detection and tracking of scene text where the notion of region grouppings plays also central role. A scene text extraction module based on Maximally Stable Extremal Regions (MSER) is used to detect text asynchronously, while in parallel detected text objects are tracked by MSER propagation. The cooperation of these two modules goes beyond the full-detection approaches in terms of time performance optimization, and yields real-time video processing at high frame rates even on low-resource devices. Finally, we focus on the problem of script identification in scene text images in order to build a multi- language end-to-end reading system. Facing this problem with state of the art CNN classifiers is not straightforward, as they fail to address a key characteristic of scene text instances: their extremely vari- able aspect ratio. Instead of resizing input images to a fixed size as in the typical use of holistic CNN classifiers, we propose a patch-based classification framework in order to preserve discriminative parts of the image that are characteristic of its class. We describe a novel method based on the use of ensembles of conjoined networks to jointly learn discriminative stroke-parts representations and their relative importance in a patch-based classification scheme. Our experiments with this learning procedure demonstrate the viability of script identification in natural scene images, paving the road towards true multi-lingual end-to-end scene text understanding. Acknowledgments Contents 1 Introduction 1 1.1 Scene text understanding tasks .................................... 2 1.2 Challenges ................................................ 3 1.3 Applications and socio-economic impact .............................. 4 1.4 Objectives and research hypotheses ................................. 5 1.5 Contributions .............................................. 6 1.6 Publications ............................................... 8 2 Related Work 11 2.1 Scene text localization and extraction ................................ 11 2.2 Object proposals ............................................. 13 2.3 Scene text detection and tracking in video sequences ....................... 15 2.4 End-to-end methods .......................................... 16 2.5 Script Identification ........................................... 19 3 Datasets 23 3.1 The ICDAR Robust Reading Competition .............................. 23 3.2 Multi-language scene text detection ................................. 26 3.3 Scene Text Script Identification .................................... 28 3.4 MLe2e multi-lingual end-to-end dataset ............................... 29 3.5 An on-line platform for ground truthing and performance evaluation of text extraction systems .................................................. 30 4 Scene Text Extraction based on Perceptual Organization 33 4.1 Text Localization Method ....................................... 34 4.1.1 Region Decomposition ..................................... 35 4.1.2 Perceptual Organization Clustering ............................. 35 4.2 Experiments and Results ........................................ 38 4.3 Conclusion ................................................ 41 5 Optimal design and efficient analysis of similarity hierarchies 43 5.1 Hierarchy guided text extraction ................................... 44 5.1.1 Optimal clustering feature space ............................... 45 5.1.2 Discriminative and Probabilistic Stopping Rules ...................... 48 5.1.3 From Pixel Level Segmentation to Bounding Box Localization .............. 52 5.2 Experiments ............................................... 53 5.2.1 Baseline analysis ........................................ 53 5.2.2 Scene text segmentation results ................................ 54 5.2.3 Scene text localization results ................................. 56 5.3 Conclusions ............................................... 57 6 Text Regions Proposals 59 6.1 Text Specific Selective Search ..................................... 60 6.1.1 Creation of hypotheses .................................... 60 6.1.2 Ranking ............................................. 61 6.2 Experiments and Results ........................................ 62 6.2.1 Evaluation of diversification strategies ........................... 62 6.2.2 Evaluation of proposals’ rankings .............................. 62 6.2.3 Comparison with state of the art ............................... 63 6.3 Conclusion ................................................ 65 7 Efficient tracking of text groupings 67 7.1 Text Detection and Tracking Method ................................. 68 7.1.1 Tracking Module ........................................ 68 7.1.2 Merging detected regions and propagated regions .................... 71 7.2 Experiments ............................................... 71 7.2.1 Time performance ....................................... 73 7.3 Conclusion ................................................ 73 8 Scene text script identification 75 8.1 Patch-based script identification ................................... 77 8.1.1 Patch representation with Convolutional Features ..................... 77 8.1.2 Naive-Bayes Nearest Neighbor ................................ 78 8.1.3 Weighting per class image patch templates by their importance ............. 79 8.2 Ensembles of conjoined deep networks ............................... 79 8.2.1 Convolutional Neural Network for stroke-parts classification .............. 80 8.2.2 Training with an Ensemble of Conjoined Networks .................... 81 8.2.3 Implementation details .................................... 82 8.3 Experiments ............................................... 85 8.3.1 Script identification in pre-segmented text lines ...................... 85 8.3.2 Joint text detection and script identification in scene images ............... 87 8.3.3 Cross-domain performance and confusion in single-language datasets ......... 89 8.4 Conclusion ................................................ 90 9 Applications 91 9.1 Unconstrained Text Recognition with off-the-shelf OCR engines ................. 91 9.1.1 End-to-end Pipeline ...................................... 92 9.1.2 Experiments ........................................... 92 9.1.3 Discussion ............................................ 95 9.2 End-to-end word spotting ....................................... 96 9.2.1 Discusion ............................................ 96 9.3 Public releases .............................................. 97 10 Conclusion and Future Work 99 Bibliography 100 Chapter 1 Introduction Reading is the complex cognitive process of transforming forms (letters and words) into meanings [94] in order to understand and interpret written content. Since ancient times1, written language

Load more