Optical Character Recognition
Total Page:16
File Type:pdf, Size:1020Kb
OPTICAL CHARACTER RECOGNITION Màster de Visió per Computador Curs 2006 - 2007 Outline • Introduction • Pre-processing (document level) – Binarization – Skew correction • Segmentation – Layout analysis – Character segmentation • Pre-processing (character level) • Feature extraction – Image-based features – Statistical features – Transform-based features – Structural features • Classification • Post-proccessing – Classifier combination – Exploitation of context information • Examples of OCR systems • Bibliography 2 Outline 1 Optical Character Recognition Pattern Recognition Statistical Structural Image Processing Pattern Recognition Pattern Recognition Methods Applications Optical Document Character Analysis Recognition 3 Introduction Some examples Drawings, maps Books, journals, reports License plates Identity PDAs Cards Old documents Cheques, bills Postal addresses Quality control 4 Introduction 2 Document Image Analysis G. Nagy: Twenty years of document image analysis in PAMI. IEEE Trans. on PAMI, vol. 22, nº 1, pp. 38-62 January 2000. Document image analysis is the subfield of digital image processing that aims at converting document images to symbolic form for modification, storage, retrieval, reuse and transmission. Document image analysis is the theory and practice of recovering the symbol structure of digital images scanned from paper or produced by computer. What is a document? Objects created expressly to convey information encoded as iconic symbols – Scanned images from paper documents – Electronic documents – Multimedia documents (video with text) –… 5 Introduction Applications of DIA Document Imaging: •Digitization •Storage •Compression •Re-printing Applications of DIA Document understanding •Recognition •Interpretation •Indexing •Retrieval 6 Introduction 3 DIA tasks Mostly text Mostly graphics •Acquisition •Acquisition •Binarization •Binarization Document Imaging •Filtering •Filtering •Skew correction •Vectorization •Segmentation •Text-graphics •Layout analysis separation Document Understanding •OCR •Symbol recognition •Interpretation 7 Introduction Outline of the course Focus: Document understanding of mostly text documents 1. Acquisition 2. Pre-processing − Binarization − Skew correction 3. Layout analysis 4. Character segmentation 5. OCR − Feature extraction − Classification − Post-processing 8 Introduction 4 Categorization of Character Recognition Optical Character Recognition According to the type of writing Machine-printed Hand-written character recognition character recognition According to the type of acquisition Off-line On-line character recognition character recognition 9 Introduction Machine-printed character recognition • Characters are totally defined by the font type: – Dimensions (segmentation) • Character width • Inter-character separation • Character height – Shape (recognition) • Typographic effects (boldface, italics, underline). •Challenges: – Similar shapes among characters – Multiple fonts – Joined characters – Digitization noise: broken lines, random noise, heavy characters, etc. – Document degradation: old documents, photocopies, etc. 10 Introduction 5 Machine-printed character recognition • Classification of machine-printed OCR systems – Monofont: • One single type of font – Multifont: • Recognition of a fixed and known set of fonts • It is necessary to identify and learn the differences between characters of all the types of fonts – Omnifont: • Recognition of any arbitrary type of font, even if it has not been previously learned 11 Introduction Off-line hand-written character recognition • Hand-written • Off-line: acquisition by a scanner or a camera • Challenges: – Shape variability among images of the same character – Character segmentation • Subproblems: – Hand-written numeral recognition: digit recognition – Hand-printed character recognition: well-separated characters – Cursive character recognition: non-separated characters 12 Introduction 6 On-line hand-written character recognition • On-line acquisition – Digitizer tablets – Digital Pen – Tablet PC • Advantages with respect to off-line acquisition: – Image is acquired while the text is written – We can take advantage of dynamic information: • Temporal information: writing order, stroke segmentation, etc • Writing speed • Pen pressure • Subproblems: – Cursive script recognition. – Signature verification/recognition. 13 Introduction Levels of difficulty in character recognition •S.Mori, H.Nishida, H.Yamada. Optical Character Recognition. John Wiley and sons. 1999. •S.V. Rice, G. Nagy, T.A. Nartker. Optical Character Recognition: An illustrated guide to the frontier. Kluwer Academic Publishers. 1999. • Little shape variability 0.0. Printed characters. Specific font. Constant size. Roman • Small number of characters alphabets. • Little noise 0.1. Constrained hand-printed characters. Arabic numerals. Level 0 • Medium variation in shape 1.0. Printed characters. Multiple fonts. Nº characters < 100 • Medium noise 1.1. Loosely constrained hand-printed char. Nº char < 100 1.2. Chinese characters of few fonts Level 1 1.3. Loosely constrained hand-printed char. Nº char≈1000 • Much variation in shape 2.0. Printed characters of multiple fonts • Heavy noise 2.1. Unconstrained hand-printed characters 2.2. Affine transformed characters Level 2 • Nonsegmented strings of 3.0. Touching/broken characters characters 3.1. Cursive handwriting characters Level 3 3.2. Characters on a textured background 14 Introduction 7 Levels of difficulty in character recognition Level 0 0.0. Printed character of a specific font with a constant size • Constant size • Connectivity of characters • Variation in the stroke thickness • Little noise 0.1. Constrained hand-printed characters • Characters are written according to some instructions or box guidelines Solved problem 15 Introduction Levels of difficulty in character recognition Level 1 1.0. Printed characters of multiple fonts 1.1. Loosely constrained hand-printed characters 1.2. Chinese characters of few fonts. 1.3. Loosely constrained hand-printed characters. Nº characters ≈ 1000 Solved problem 16 Introduction 8 Levels of difficulty in character recognition Level 2 2.0. Printed characters of multiple fonts 2.1. Unconstrained hand-printed characters 2.2. Affine transformed characters 17 Introduction Levels of difficulty in character recognition Level 3 3.0. Caràcters no separats o trencats 3.1. Cursive handwriting characters 3.2. Characters on a textured background 18 Introduction 9 Databases for OCR Off-line Hand-written characters On-line Machine- characters printed CEDAR CENPARMI NIST UNIPEN Univ. Washington • 50.000 • 17.000 • More than • Definition of a • More than segmented manually 1.000.000 format to 1500 pages of numerals from segmented characters from represent on- articles in zip codes numerals from forms line data english • 5.000 zip codes zip codes • Several learning • 4.500.000 • More than 500 • 5.000 city and test sets characters pages of articles names (more • Segmented in japanese • 9.000 state variability) characters, • Originals, names • 91.500 words and photocopies and sentences with a sentences pages with dictionary arificially generated noise • Page segmentation into labeled zones 19 Introduction Performance evaluation of OCR systems • Hand-printed Character Recognition – Institute for Posts and Telecommunications Policy (IPIP) – Japan - 1996 – 5000 hand-written numerals from japanese zip codes – Performance of the best system: 97.94% (human performance: 99.84%) • Machine-printed Character Recognition – “The fifth Annual Test of OCR Accuracy”. Information Science Research Institute. TR-96-01. April 1996. http://www.isri.unlv.edu – 5.000.000 characters from 2000 pages in journals, newspapers, letters and technical reports – Performance in good quality documents: 99.77% - 99.13% – Performance in medium quality documents: 99.27% - 98.21% – Performance in low quality documents: 97.01% - 89.34% • Performance 99% => 30 errors /page (3000 characters/page) 20 Introduction 10 Performance evaluation of OCR systems C.Y. Suen, J. Tan: Analysis of errors of handwritten digits made by a multitude of classifiers. PRL 26, pp. 369-379. 2005 Classifier N of test samples Recognition (%) Error (%) MNIST database GPR 10,000 99.06 0.94 VSVMb 10,000 99.62 0.38 VSV2 10,000 99.44 0.56 LeNet-5 10,000 99.18 0.82 POE 10,000 98.32 1.68 CENPARMI database VSVM 2000 98.7 1.3 USPS database VSVM 2007 97.66 2.34 NIST SD19 database MLP 30,000 99.16 0.84 21 Introduction Performance evaluation of OCR systems Classifier N. of errors Category 1 Category 2 Category 3 MNIST database GPR 94 24 11 59 VSVMb 38 15 6 17 VSV2 56 15 9 32 LeNet5 82 17 14 51 POE 168 41 9 118 CENPARMI database VSVM 26 6 4 16 USPS database VSVM 47 13 13 21 NIST SD 19 database MLP 119 30 8 81 Sum 630 161 74 395 Percentage (%) 100 25.56 11.75 62.70 22 Introduction 11 Components of an OCR system ACQUISITION • Filtering. DOCUMENT • Binarization. PRE-PROCESSING • Skew correction. • Layout analysis SEGMENTATION • Text/graphics separation • Character segmentation CHARACTER • Filtering PRE-PROCESSING • Normalization • Image-based features FEATURE • Statistical features EXTRACTION • Transform-based features • Structural features LEARNING CLASSIFICATION Models Context POSTPROCESSING infromation 23 Introduction Acquisition. Scanners A scanner is a linear camera with a lighting and a displacement system Scan line Opening Video circuit Digital image text Lens CCD Document Lights 24 Acquisition 12 Acquisition. Scanners Types of scanners: • Flatbed scanners. The CCD