Large Scale Character Classification

Large Scale Character Classification Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science (by Research) in Computer Science by Neeba N.V 200650016 [email protected] http://research.iiit.ac.in/~neeba International Institute of Information Technology Hyderabad, India August 2010 ii INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled Large Scale Character Classi- fication by Neeba N.V, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Advisor: Dr. C. V. Jawahar iv Copyright c Neeba N.V, 2008 All Rights Reserved vi To my Loving parents. viii O Lord, May I accept gracefully, what I can not change. O Lord, May I have the will and effort to change what I can change. O Lord, May I have the wisdom to understand, what I can change, and What I can not Change. x Acknowledgements I am deeply indebted to my advisor Dr. C. V. Jawahar for his kindness, dedication, encouragement, motivation and also for his inspiring guidance and supervision throughout my thesis work. I am also greatly indebted to Dr. P. J. Narayanan (PJN) for his concern, encouragement and advise. My sincere thanks also forwarded to Dr. Anoop M. Namboodiri for his critical comments on my conference papers. I would also like to thank document understanding research groups at Centre for Vi- sual Information Technology (CVIT), who had made great contribution by sharing ideas, comments and materials. My dearest thanks goes to Anand Kumar, Million Meshesha, Jyotirmoy, Rasagna, and Jinesh for their valuable suggestions and kindness to help me in any way possible. A special thanks goes to my friend Ilayraja, who was my project partner for the work ”Efficient Implementation of SVM for Large Class Problems”. I extend my thanks to my friends Lini, Satya, Pooja and Uma for their support during my MS. Last, but not the least, the almighty, my parents, my relatives and all those from CVIT who had at some or the other point in time helped me with their invaluable suggestions and feedback, and my research center, Center for Visual Information Technology (CVIT), for funding my MS by research in IIIT Hyderabad. Abstract Large scale pattern classification systems are necessary in many real life problems like object recognition, bio-informatics, character recognition, biometrics and data-mining. This thesis focuses on pattern classification issues associated with character recognition, with special emphasis on Malayalam. We propose an architecture for the character classification, and proves the utility of the the proposed method by validating on a large dataset. The challenges we address in this work includes: (i) Classification in presence of large number of classes (ii) Efficient implementation of effective large scale classification (iii) Simultaneous performance analysis and learning in large data sets (of Millions of examples). Throughout this work, we use examples of characters (or symbols) extracted from real- life Malayalam document images. Developing annotated data set at the symbol level from a coarse (say word-level) annotated data is addressed first with the help of a dynamic programming based algorithm. Algorithm is then generalized to handle the popular degradations in the form of cuts, merges and other artifacts. As a byproduct, this algorithms allows to quantitatively estimate the quality of the books, documents and words. The dynamic programming based algorithm aligns the text (in UNICODE) with images (in Pixels). This helps in developing a large data set which could help in conducting large scale character classification experiments. We then conduct an empirical study of classifiers and feature combination to explore their suitability to the problem of character classification. The scope of this study include (a) applicability of a spectrum of popular classifiers and features (b)scalability of classifiers with the increase in number of classes (c) sensitivity of features to degradation (d) generalization across fonts and (e) applicability across scripts. It may be noted that all these aspects are important to solve practical character classification problems. Our empirical studies provide convincing evidences to support the utility of SVM (multiple pair-wise) classifiers for solving the problem. However, a direct use of multiple SVM classifiers has certain disadvantages: (i) since there are nC2 pairwise classifiers, storage and computational complexity of the final classifier becomes high for many practical applications. (ii) they directly provide a class label and fail to provide an estimate of the posterior probability. We address these issues by efficiently designing a Decision Directed Acyclic Graph (DDAG) classifier and using the appropriate feature space. We also propose efficient methods to minimize the storage complexity of support vectors for the classification purpose. We also extend our algebraic simplification method for simplifying hierarchical classifier solutions.We use SVM pair-wise classifiers with DDAG architecture for classification. We use linear kernel for SVM, considering the fact that most of the classes in a large class problem are linearly separable. We carried out our classification experiments on a huge data set, with more than 200 classes and 50 million examples, collected from 12 scanned Malayalam books. Based on the number of cuts, merges detected, the quality definitions are imposed on the document image pages. The experiments are conducted on pages with varying quality. We could achieve a reasonably high accuracy on all the data considered. We do an extensive evaluation of the performance on this data set which is more than 2000 pages. In presence of large and diverse collection of examples, it becomes important to continuously learn and adapt. Such an approach could be more significant while recognizing books. We extend our classifier system to continuously improve the performance by providing feedback and retraining the classifier. We also discuss the limitations of the current work and scope for future work. ii Contents 1 Introduction 1 1.1 Pattern Classifiers ................................ 1 1.2 Overview of an OCR System .......................... 2 1.3 Indian Language OCR : Literature Survey ................... 4 1.4 Challenges ..................................... 8 1.4.1 Challenges Specific to Malayalam Script ................ 10 1.5 Overview of this work .............................. 12 1.5.1 Contribution of the work ........................ 12 1.5.2 Organization of the thesis ........................ 13 2 Building Datasets from Real Life Documents 15 2.1 Introduction .................................... 15 2.2 Challenges in Real-life Documents ....................... 17 2.2.1 Document level Issues .......................... 17 2.2.2 Content level Issues ........................... 18 2.2.3 Representational level Issues ...................... 18 2.3 Background on Dynamic Programming ..................... 19 2.3.1 A worked out Example - String Matching ............... 20 2.4 A Naive Algorithm to Align Text and Image for English ........... 23 2.5 Algorithm to Align Text and Image for Indian Scripts ............ 26 2.6 Challenges for Degraded Documents ...................... 28 2.7 Implementation and Discussions ........................ 31 2.7.1 Features for matching .......................... 31 2.7.2 Malayalam script related issues ..................... 35 2.8 Results ....................................... 35 2.8.1 Symbol level Unigram and Bigram ................... 36 i 2.8.2 Estimate of Degradations ........................ 38 2.8.3 Estimate of various Quality Measures ................. 38 2.9 Quality definitions of document images ..................... 39 2.9.1 Word level Degradation ......................... 40 2.10 Summary ..................................... 41 3 Empirical Evaluation of Character Classification Schemes 42 3.1 Introduction .................................... 42 3.2 Problem Parameters ............................... 43 3.2.1 Classifiers ................................. 43 3.2.2 Features .................................. 48 3.3 Empirical Evaluation and Discussions ..................... 53 3.3.1 Experiment 1: Comparison of Classifiers and Features ........ 53 3.3.2 Experiment 2: Richness in the Feature space ............. 54 3.3.3 Experiment 3: Scalability of classifiers ................. 55 3.3.4 Experiment 4: Degradation of Characters ............... 56 3.3.5 Experiment 5: Generalization Across Fonts .............. 58 3.3.6 Experiment 6: Applicability across scripts ............... 59 3.4 Discussion ..................................... 60 3.5 Summary ..................................... 62 4 Design and Efficient Implementation of Classifiers for Large Class Prob- lems 64 4.1 Introduction .................................... 64 4.2 Multiclass Data Structure(MDS) ........................ 66 4.2.1 Discussions ................................ 70 4.2.2 SVM simplification with linear kernel ................. 72 4.3 Hierarchical Simplification of SVs ........................ 73 4.4 OCR and Classification ............................. 76 4.5 Summary ..................................... 78 5 Performance Evaluation 80 5.1 Introduction .................................... 80 5.1.1 Performance Metrics ........................... 81 5.2 Experiments and Results ............................. 82 5.2.1 Symbol and Unicode level Results ................... 82 ii 5.2.2 Word level Results

Large Scale Character Classification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support