Browsing Through High Quality Document

Browsing through High Quality Do cument Images with DjVu Patrick Haner Leon Bottou Paul G Howard Patrice Simard Yoshua Bengio and Yann Le Cun ATT LabsResearch Schultz Drive Red Bank NJ fhanerleonbpghpatriceyoshuayanngresearchattcom Abstract Even if pictures and drawings are scanned and in tegrated into the web page much of the visual asp ect We present a new image compression technique of the original do cument is likely to be lost Visual cal led DjVu that is specical ly geared towards the details including font irregularities pap er color and compression of highresolution highquality images of pap er texture are particularly imp ortant for histori scanned documents in color With DjVu any scr een cal do cuments and may also b e crucial in do cuments connected to the Internet can access and display im with tables mathematical or chemical formulae and ages of scannedpages while faithful ly reproducing the handwritten text font color drawing pictures and paper texture A typical magazine page in color at dpi can becom A simple alternativewould b e to scan the original pressed down to between to KB approximately page and simply compress the image as a JPEG or to times better than JPEG for a similar level of GIF le Unfortunately those les tend to be quite subjective quality BW documents are typical ly large if one wants to preserv e the readability of the to KBytes at dpi or to times better than text Compressed with JPEG a color image of a ecient version of CCITTG Arealtime memory typical magazine page scanned at dpi dots per the decoder was implemented and is available as a inch would be around KBytes to KBytes plugin for popular web browsers and would be barely readable The same page at Keywords digital libraries image compression im dpi would b e of acceptable quality but would o c age segmentation arithmetic coding wavelet coding cupy around KBytes Even worse not only would JBIG the decompressed image ll up the entire memory of an average PC but only a small p ortion of it would Intro duction b e visible on the screen at once A justreadable black As electronic storage retrieval and distribution of and white page in GIF would be around to do cuments b ecomes faster and cheap er libraries are KBytes b ecoming increasingly digital Recent studies have To summarize the current situation it is clear that shown that it is already less costly to store do cuments the complete digitization of the worlds ma jor library digitally than to provide for buildings and shelves to collections is only a matter of time In this context it house them Unfortunately the diculty of con seems paradoxical that there exist no universal stan verting existing do cuments to electronic form is a ma dard for ecient storage retrieval and transmission jor obstacle to the development of digital libraries of highquality do cument images in color Existing do cuments are usually retyp ed and con verted to HTML or Adob es PDF format a tedious Tomake remote access to digital libraries a pleasant and exp ensive task sometimes facilitated by the use exp erience pages must app ear on the screen after only of Optical Character Recognition OCR While the a few seconds delay Assuming a kilobits p er second accuracy of OCR systems has b een steadily improv kbps connection this means that the most relevant ing over the last decade they are still far from b eing parts of the do cument the text must b e compressed able to translate faithfully a scanned do cumentinto a down to ab out to KBytes With a progressive computerreadable format without extensive manual compression technique the text would b e transmitted correction and displayed rst Then the pictures drawings and backgrounds would b e transmitted and displayed im background The p erformance of b oth of these meth proving the quality of the image as more bits arrive o ds heavily relies on a new adaptive binary arithmetic The overall size of the le should b e on the order of co ding technique called the Zco der also briey de to KBytes to keep the overall transmission time scrib ed Section intro duces the plugin that allows and storage requirements within reasonable b ounds to browse DjVu do cuments through standard appli cations such as Netscap e Navigator or Microsoft Ex Another p eculiarity of do cument images their large plorer with a DjVu plugin Comparative results on a size makes current image compression techniques in wide variety of do cumenttyp es are given in Section appropriate A magazinesize page at dots per inch is pixel high and pixel wide Uncom ImageBased Digital Libraries pressed it o ccupies MBytes of memory more than The imagebased approach to digital libraries is what the average PC can handle A practical do cu to store and to transmit do cuments as images To ment image viewer would need to keep the image in a achieve that we need to devise a metho d for com compressed form in the memory of the machine and pressing do cument images that makes it p ossible to only decompress ondemand the part of the image that transfer a highquality page over lowsp eed links mo is displayed on the screen dem or ISDN in a few seconds OCR would only b e The DjVu do cument image compression technique used for indexing with less stringent constraints on describ ed in this pap er is an answer to all the ab ove accuracy problems With DjVu scanned pages at dpi in full Several authors have prop osed imagebased ap color can b e compressed down to to KBytes les proaches to digital libraries The most notable ex from MBytes originals with excellent quality Black ample is the RightPages system The RightPages and white pages typically o ccupy to KBytes system was designed for do cument image transmission once compressed This puts the size of highquality over a lo cal area network It was used for some years scanned pages in the same order of magnitude as an bytheATT Bell Labs technical community and dis average HTML page KBytes according to the lat tributed commercially for customized applications in est statistics DjVu pages are displayed within the several countries The absence of a universal and op en browser window through a plugin The DjVu plugin platform for networking and bro wsing suchastodays allows easy panning and zo oming of very large images Internet limited the dissemination of the RightPages This is made p ossible by an onthey decompression system Similar prop osals have b een made more re metho d whichallows images that would normally re cently quire MBytes of RAM once decompressed to re All of the ab ove imagebased approaches and most quire only MBytes of RAM commercially available do cument image management The basic idea b ehind DjVu is to separate the text systems are restricted to black and white bilevel from the backgrounds and pictures and to use dier images This is adequate for technical and business ent techniques to compress each of those comp onents do cuments but insucient for other typ es of do cu Traditional metho ds are either designed to compress ments such as magazines catalogs or historical do c natural images with few edges JPEG or to compress uments Many formats exist for co ding bilevel do c black and white do cument images almost entirely com ument images notably the CCITT G and G fax p osed of sharp edges CCITT G G and JBIG standards the recent JBIG standard and the up The DjVutechnique improves on b oth and combines coming JBIG standard Using ATTs prop osals to the b est of b oth approaches the JBIG standard images of a t ypical black and white page at dpi dots p er inch can b e transmit Section reviews the currentavailable compression ted over a mo dem link in a few seconds Work on and displaying technologies and states the require bilevel image compression standards is motivated by ments for do cument image compression Section de the fact that until recently do cument images were scrib es the DjVu metho d of separately co ding the text primarily destined to b e printed on pap er Most low and drawings on one hand and the pictures and back cost printer technologies excel at printing bilevel im grounds on the other hand Section turns this idea ages but they must rely on dithering and halftoning into an actual image compression format It starts to print greylevel or color images thus reducing their with a description of the metho d used by DjVutoen eective resolution co de the text and the drawings This metho d is a variation of ATTs prop osal to the new JBIG fax The low cost and availability of highresolution standard It also includes a description of the IW color displays is causing more and more users to rely waveletbased compression metho d for pictures and on their screen rather than on their printer to dis play do cument images Even mo dern lowend PCs can display x pixel images with bits per pixelbitsperRGB comp onent while highend PCs and workstations can display x at bits p er pixel Most do cuments displayed in bilevel mo de are readable at dpi but are not pleasanttoread At dpi the quality is quite acceptable in bilevel mo de Displaying an entire x lettersize page at such high resolution requires a screen resolution of pix els vertically and pixels horizontallywhichisbe yond traditional display technology Fortunately us ing color or graylevels when displaying do cument im ages at lower resolutions drastically improves readabil ity and sub jective quality Most do cuments are read able when displayed at dpi on a color or greyscale display Only do cuments with particularly small fonts require dpi for eortless readability At dpi a typical

Load more