MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com
Text Super-Resolution: A Bayesian Approach
Gerald Dalley, Bill Freeman, Joe Marks TR2003-147 October 2004
Abstract We address the problem of text super-resolution: given an image of text scanned in at low resolution from a piece of paper, return the image that is mortly likely to be generated from a noiseless high-resolution scan of the same piece of paper. In doing so, we wish to: (1) avoid introducing artifacts in the high-resolution image such as blurry edges and rounded corners, (2) recover from quantization noise and grid-alignment effects that introduce errors in the low-resolution image, and (3) handle documents with very large glyph sets such as Japanese’s Kanji. Applications for this technology include improving the display of: fax documents, low- resolution scans of archival docuemtns, and low-resolution bitmapped fonts on high-resolution output devices.
IEEE International Conference on Image Processing (ICIP
This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved.
Copyright c Mitsubishi Electric Research Laboratories, Inc., 2004 201 Broadway, Cambridge, Massachusetts 02139
TEXT SUPER-RESOLUTION: A BAYESIAN APPROACH
Gerald Dalley1,2, Bill Freeman1,2, Joe Marks1
1Mitsubishi Electric Research Labs, Cambridge, MA [email protected] 2CSAIL, Massachusetts Institute of Technology, Cambridge, MA {dalleyg,billf}@ai.mit.edu
ABSTRACT True shape (hidden) We address the problem of text super-resolution: given an
Low-res patch High-res patch image of text scanned in at low resolution from a piece S of paper, return the image that is mostly likely to be gen- block (grayscale) (grayscale) erated from a noiseless high-resolution scan of the same Testing patch Final high-res L H piece of paper. In doing so, we wish to: (1) avoid intro- block (binary) patch (binary) ducing artifacts in the high-resolution image such as blurry edges and rounded corners, (2) recover from quantization T F noise and grid-alignment effects that introduce errors in the low-resolution image, and (3) handle documents with very Fig. 1. Bayesian Network Model large glyph sets such as Japanese’s Kanji. Applications for this technology include improving the display of: fax doc- uments, low-resolution scans of archival documents, and the average of high-resolution pixels should be close to the low-resolution bitmapped fonts on high-resolution output value of the corresponding pixel in the low-resolution im- devices. age. This method is 1-2 orders of magnitude slower than cubic-spline interpolation, and works well in areas away 1. INTRODUCTION from sharp corners and thin curves. Thouin, Du, and Chang also experimented with a Markov Random Field formula- Some early work on fully automatic super-resolution of text tion, with results of similar quality [7]. was done by Ulichney and Troxel [1]. They use a fixed Kim presents a time-efficient method that is extensible set of heuristic templates to model local continuous shape. to large glyph sets at low resolution [8]. He begins by ob- While doing so, they verify that neighboring shape assign- taining a training set consisting of a pair of registered black- ments are mutually compatible and make local straight-line and-white images at both low and high resolution. This pair approximations. of images is then scanned to build a database that indicates Several pieces of more-recent work on super-resolution which high-res patch should be output given an input low- [2, 3, 4] and compression [5] have centered on inferring res patch. When using this database, any time an input low- shape by clustering glyphs seen in a document. For a docu- res patch occurs that was not present in the training image, ment containing Latin characters, Hobby and Ho [3] gener- a weighted k-nearest-neighbor search is performed to esti- ate nearly 300 glyph clusters. Ideally, each cluster contains mate the best high-res output patch. Kim uses this approach all instances of a single glyph. During the clustering pro- for zooming noiseless 300dpi, 12pt. Times text to 600dpi, cess, a high-resolution model of each glyph is generated by for zooming handwritten text, and for several nonzooming averaging polygonal shape approximations. Some care is image-processing tasks. He also presents error bounds, con- taken to try to preserve sharp corners [2]. Bern and Gold- fidence intervals, and inductive bias measures. berg expand on Hobby and Ho’s approach by modeling the We employ a training-based method, similar to Kim, scanning process probabilistically and using slightly differ- but adopt a full-Bayesian approach with an explicit noise ent algorithms at each step. model. To handle noise and large glyph sets while limiting Thouin and Chang [6] use nonlinear optimization on a the training set size, we utilize grayscale training images. A grayscale input image to minimize a Bimodal Smoothness side effect of our approach is that we trade off training time Average (BSA) score. To be bimodal, the high-resolution so that super-resolution can be done via simple table lookup, image should be black and white with few gray pixels. To be without any nearest-neighbor searches at runtime. We per- considered smooth, its locally estimated second derivatives form most of our experiments with fax-resolution (200dpi) should be small. For the “average” measure to be small, inputs. 2. METHOD HF We pose the problem in the Bayesian framework depicted in Figs. 1 and 2. This network is for a pair of image patches m from a high-res and a low-res image. For example, in the S low-res image we might be analyzing 5 × 5 patches to be able to produce a single pixel in the high-res image. S is the L T true shape of the image and is a hidden node, meaning that we never know and will not even try to estimate what that Fig. 2. Pictorial representation of our Bayesian framework. shape is. We will discuss the other nodes of the network in The patch of interest for the low-res images is highlighted the following sections. with a dotted line on the large letter ‘m’. The true shape of the patch is the hidden state S. It has been quantized in space onto the 2 × 2 high-res and 5 × 5 low-res image 2.1. Training windows, resulting in H and L, respectively. Note that be- During training, we are given pairs of grayscale low- and cause the high- and low-resolution processes are indepen- high-resolution image patches (L and H) that have been dent when conditioned on S, the amount of area each win- deterministically generated from the true glyph shape (S). dow covers may be different, as shown in this figure. H and As a simplification, we assume that when the underlying L are then quantized in intensity to produce the bilevel im- shape (S) is scanned to a binary image (named T for “test” ages F and T . The task is to infer the most likely F from or F for “final”) during testing, the probability that a par- T . ticular pixel will be quantized to be white is equal to the grayscale value of that pixel in the training image (L or H). obtain: We assume that the quantization of each pixel is indepen- Z Z Z P (T |L)P (F |H)P (L, H|S)P (S) dent of all the others. A more complete model might con- P (F |T ) = sider shape rotation, shape placement relative to the imaging S L H P (T ) grid, nonuniform placement of individual imaging sensors, (5) and sensor noise. Z Z Z P (T |L)P (F |H)P (L, H, S) = (6) To be more precise, S L H P (T )
|L| Since rendering to grayscale is a deterministic process, the Y joint PDF of L, H, and S is a delta function. To estimate P (T |L = l) = (liδ(1,Ti) + (1 − li) δ(0,Ti)) (1) i=1 P (F |T ), we then need:
|F | N Y 1 X (s) (s) P (F |H = h) = (hjδ(1,Fj) + (1 − hj) δ(0,Fj)) (2) Pˆ(F |T ) = P T |L = l P F |H = h ˆ j=1 N ·P (T ) s=1 (7) where δ(·, ·) is the Kronecker delta function, |L| is the num- ber of pixels in a low-res patch, li is the grayscale value of where s indexes the training samples, N is the number of (s) (s) patch l at pixel location i, Ti is the binary pixel value at lo- training samples, l , h is a training sample pair, and cation i, |F | is the number of pixels in a high-res patch, hj we estimate P (T ) as follows: is the grayscale value of patch h at pixel location j, and Fj N 1 X is the binary pixel value at location j. As indicated in Fig. Pˆ(T ) = P T |L = l(s) (8) N 1, we assume that the quantization processes producing F s=1 and T are independent. Ultimately, we wish to estimate P (F |T ): 2.2. Complexity Reduction