Reconocimiento De Escritura Lecture 4/5 --- Layout Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Reconocimiento de Escritura Lecture 4/5 | Layout Analysis Daniel Keysers Jan/Feb-2008 Keysers: RES-08 1 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 2 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 3 Jan/Feb-2008 Detection of Geometric Primitives some geometric entities important for DIA: I text lines I whitespace rectangles (background in documents) Keysers: RES-08 4 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 5 Jan/Feb-2008 Hough-Transform for Line Detection Assume we are given a set of points (xn; yn) in the image plane. For all points on a line we must have yn = a0 + a1xn If we want to determine the line, each point implies a constraint yn 1 a1 = − a0 xn xn Keysers: RES-08 6 Jan/Feb-2008 Hough-Transform for Line Detection The space spanned by the model parameters a0 and a1 is called model space, parameter space, or Hough space. Its implementation is called accumulator array. http://www.rob.cs.tu-bs.de/content/04-teaching/ 06-interactive/Hough.html Keysers: RES-08 7 Jan/Feb-2008 Hough-Transform for Line Detection I points in data space on straight line ! lines in parameter space meet in one point I many points ! reliable estimate of the two parameters of the line I line in data space ! point in parameter space I This transformation from the data space to the parameter space via a model equation is called the Hough transform. possible improvement: include orientation information at each point Keysers: RES-08 8 Jan/Feb-2008 Hough-Transform for Line Detection problem with slope-intercept-parameterization: slope of line may become infinite better parameterization: angle of line with respect to the x-axis and distance from the coordinate center x sin φ + y cos φ = d φ 2 [0; 2π); d > 0 d = x sin φ + y cos φ http://www.rob.cs.tu-bs.de/content/04-teaching/ 06-interactive/HNF.html Keysers: RES-08 9 Jan/Feb-2008 Extensions of the Hough Transform Generally, the Hough Transform is applicable to any parameterized curve or shape. However, the size of the accumulator grows exponentially with the number of parameters, such that it is usually only feasible for few parameters. Example: circle detection | iris detection Keysers: RES-08 10 Jan/Feb-2008 Iris Detection Keysers: RES-08 11 Jan/Feb-2008 Generalized Hough Transform Keysers: RES-08 12 Jan/Feb-2008 Hough-Transform drawbacks of Hough transform: I computational effort, lots of storage for accumulator I discretization of model space: I bin size too small ! solutions lost I bin size too large ! solutions too inaccurate history: I slope-intercept variant patented in 1962 by Hough I angle-distance variant published 1972 by Duda and Hart I generalization to arbitrary shapes 1981 by Ballard Keysers: RES-08 13 Jan/Feb-2008 Generalized Hough Transform? B. Leibe and B. Schiele: Interleaved Object Categorization and Segmentation. Proc. British Machine Vision Conference, Norwich, UK, 2003. Keysers: RES-08 14 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 15 Jan/Feb-2008 RAST I Recognition by Adaptive Subdivision of Transformation Space I branch-and-bound search in parameter space w.r.t. a given quality function I related to adaptive Hough transform I easily adaptable to new problems Keysers: RES-08 16 Jan/Feb-2008 RAST: Algorithm I algorithm I keep priority queue of parameter regions I use upper bound of quality estimate for region I on best region: I stop? I output? I split and insert new regions I use interval arithmetic to determine bounds Keysers: RES-08 17 Jan/Feb-2008 RAST: Example Subdivision Keysers: RES-08 18 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 19 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 20 Jan/Feb-2008 Introduction Page segmentation I divides the document image into homogeneous zones, each consisting of only one physical layout structure (e.g. text, graphics, or pictures) I an important basic step of document analysis systems I many algorithms proposed over last two decades I comparative performance evaluation? See also: R. Cattoni, T. Coianiz, S. Messelodi, C.M. Modena, \Geometric layout analysis techniques for document image understanding: a review," IRST, Trento, Italy, TR 9703-09, 1998. Acknowledgment: This lecture is largely based on work by Faisal Shafait. Keysers: RES-08 21 Jan/Feb-2008 Introduction \Mostly-text pages are separated into columns, paragraph-blocks, text-lines, word, and characters. OCR converts the individual word or character images into a character code like ASCII or Unicode. But, there is much more to a text document than a string of symbols. Additional, and sometimes essential, information is conveyed by long-established conventions of layout and format, choice of type size and typeface, italics and boldface, and the two-dimensional arrangement of tables and formulas. To capture the whole meaning of a document, DIA must extract and interpret all of this subtle encoding." (Nagy 2000) Keysers: RES-08 22 Jan/Feb-2008 Introduction Keysers: RES-08 23 Jan/Feb-2008 Page Segmentation Approaches area based text-line based ! ! I sensitive to spacing (e.g. I insensitive to spacing, fails on large-font titles works on pages even with and headlines) minimum spacing I still requires text-line I text-lines directly used for extraction OCR Keysers: RES-08 24 Jan/Feb-2008 Area-Based Approach Document Whitespace Whitespace Page Image Cover Evaluation Segments I principled approach I hard to implement I evaluation function not very document-centric I oversegments (too many vertical cuts) Keysers: RES-08 25 Jan/Feb-2008 Text-Line-Based Approach Find Document Extract Column Image Text-Lines Separators I globally optimal whitespace cover computation I very accurate column model I works very well on clean text I uses straight (non-curled) text-line model I excludes noise from detected text-lines Keysers: RES-08 26 Jan/Feb-2008 Outline Detection of Geometric Primitives The Hough-Transform RAST Document Layout Analysis Introduction Algorithms for Layout Analysis A `New' Algorithm: Whitespace Cuts Evaluation of Layout Analyis Statistical Layout Analysis OCR OCR - Introduction OCR fonts Tesseract Sources of OCR Errors Keysers: RES-08 27 Jan/Feb-2008 Overview of Algorithms Page segmentation algorithms: I X-Y cut by Nagy et al. I Smearing by Wong et al. I Whitespace analysis by Baird I Docstrum by O'Gorman I Voronoi-diagram by Kise et al. I Constrained text-line finding by Breuel I Whitespace Cuts by Shafait et al. (\Dummy"-algorithm for comparison: return whole page as one segment) Keysers: RES-08 28 Jan/Feb-2008 Segmentation Results of an Example Image Whitespace Whitespace Cuts Text-Line Smearing Voronoi Docstrum X-Y cut Keysers: RES-08 29 Jan/Feb-2008 Recursive X-Y Cuts Algorithm G. Nagy, S. Seth, and M. Viswanathan, \A prototype document image analysis system for technical journals," Computer, vol. 7, no. 25, pp. 10{22, 1992. I tree-based top-down algorithm. I root = entire document page I leaves = final segmentation I recursively split each node into two or more smaller rectangular zones I at each step compute horizontal and vertical projection profiles I if a valley in the profile is low and wide enough, split Keysers: RES-08 30 Jan/Feb-2008 Run-length Smearing Algorithm K. Y. Wong, R. G. Casey, and F. M. Wahl, \Document analysis system," IBM Journal of Research and Development, vol. 26, no. 6, pp. 647{656, 1982. I operate on binary image separately in both dimensions I white pixel runs are changed to black if shorter than a threshold I first along x, then along y, then AND-ing the two images, then x-direction again with smaller threshold I effect: link connected components that are separated by only few pixels of background (words, lines, but not blocks) I connected component analysis determines zones I classify as text-block, if height is no too large and mean run-lengths are not too large ! essentially morphological operations Keysers: RES-08 31 Jan/Feb-2008 Run-length Smearing Algorithm Keysers: RES-08 32 Jan/Feb-2008 Whitespace H. S. Baird, \Background structure in document images," in Document Image Analysis, H. Bunke, P. Wang, and H. S. Baird, Eds., World Scientific, Singapore, 1994, pp. 17{34. I find a set of maximal white rectangles (called covers) whose union completely covers the background I covers sorted with respect to the sort key K(c) p K(c) = area(c) ∗ W (jlog2 (height(c)=width(c))j) weighting function W (:) assigns higher weight to tall and long rectangles because they are supposed to be meaningful separators of text blocks I add covers to form a background cover until sort key is less than a threshold I uncovered area are blocks (not necessarily rectangular) Keysers: RES-08 33 Jan/Feb-2008 Docstrum L.