Memory-Efficient Algorithms for Raster Document Image
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Optimization of Image Compression for Scanned Document's Storage
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) IJCST VOL . 4, Iss UE 1, JAN - MAR C H 2013 Optimization of Image Compression for Scanned Document’s Storage and Transfer Madhu Ronda S Soft Engineer, K L University, Green fields, Vaddeswaram, Guntur, AP, India Abstract coding, and lossy compression for bitonal (monochrome) images. Today highly efficient file storage rests on quality controlled This allows for high-quality, readable images to be stored in a scanner image data technology. Equivalently necessary and faster minimum of space, so that they can be made available on the web. file transfers depend on greater lossy but higher compression DjVu pages are highly compressed bitmaps (raster images)[20]. technology. A turn around in time is necessitated by reconsidering DjVu has been promoted as an alternative to PDF, promising parameters of data storage and transfer regarding both quality smaller files than PDF for most scanned documents. The DjVu and quantity of corresponding image data used. This situation is developers report that color magazine pages compress to 40–70 actuated by improvised and patented advanced image compression kB, black and white technical papers compress to 15–40 kB, and algorithms that have potential to regulate commercial market. ancient manuscripts compress to around 100 kB; a satisfactory At the same time free and rapid distribution of widely available JPEG image typically requires 500 kB. Like PDF, DjVu can image data on internet made much of operating software reach contain an OCR text layer, making it easy to perform copy and open source platform. In summary, application compression of paste and text search operations. -
Caminova HC-PDF Creation Using Djvu Segmentation Technology
CaminovaHC-PDF Creation using DjVu Segmentation technology abiding by PDF ISO standard PDF Specification Portable Document Format (PDF) is a file format created by Adobe Systems in 1993 for document exchange. PDF is used for representing two-dimensional documents in a manner independent of the application software, hardware, and operating system. Each PDF file encapsulates a complete description of a fixed-layout 2D document that includes the text, fonts, images, and 2D vector graphics which compose the documents. Formerly a proprietary format, PDF was officially released as an open standard on July 1, 2008, and published by the International Organization for Standardization as ISO/IEC 32000-1:2008. Raster images in PDF Raster images in PDF (called Image XObjects) are represented by dictionaries with an associated stream. The dictionary describes properties of the image, and the stream contains the image data. (Less commonly, a raster image may be embedded directly in a page description as an inline image) Images are typically filtered for compression purposes. Image filters supported in PDF include the general purpose filters: ? ASCII85Decode a deprecated filter used to put the stream into 7-bit ASCII ? ASCIIHexDecode similar to ASCII85Decode but less compact ? FlateDecode a commonly used filter based on the DEFLATE or Zip algorithm ? LZWDecode a deprecated filter based on LZW Compression ? RunLengthDecode a simple compression method for streams with repetitive data using the Run-length encoding algorithm and the image-specific filters: ? DCTDecode a lossy filter based on the JPEG standard ? CCITTFaxDecode a lossless filter based on the CCITT fax compression standard ? JBIG2Decode a lossy or lossless filter based on the JBIG2 standard, introduced in PDF 1.4 Copyright c2011 Caminova,Inc. -
Djvu: a Compression Method for Distributing Scanned Documents in Color Over the Internet
DjVu: a Compression Method for Distributing Scanned Documents in Color over the Internet Yann LeCun, Leon´ Bottou, Patrick Haffner and Paul G. Howard AT&T Labs-Research 100 Schultz Drive Red Bank, NJ 07701-7033 yann,leonb,haffner,pgh ¡ @research.att.com ¿ to preserve the readability of the text. Compressed with JPEG, a color image of a typical magazine page scanned at Abstract 100dpi (dots per inch) would be around 100 KB to 200 KB, We present a new image compression technique called and would be barely readable. The same page at 300dpi “DjVu” that is specifically geared towards the compression would be of acceptable quality, but would occupy around of scanned documents in color at high revolution. DjVu en- 500 KB. Even worse, not only would the decompressed able any screen connected to the Internet to access and dis- image fill up the entire memory of an average PC, but play images of scanned pages while faithfully reproducing only a small portion of it would be visible on the screen at the font, color, drawings, pictures, and paper texture. With once. A just-readable black and white page in GIF would DjVu, a typical magazine page in color at 300dpi can be be around 50 to 100 KB. compressed down to between 40 to 60 KB, approximately To make remote access to digital libraries a pleasant 5 to 10 times better than JPEG for a similar level of subjec- experience, pages must appear on the screen after only tive quality. A real-time, memory efficient version of the a few seconds delay. -
JPEG2000-Matched MRC Compression of Compound Documents
JPEG2000-Matched MRC Compression of Compound Documents Debargha Mukherjee, Christos Chrysafis1, Amir Said Imaging Systems Laboratory HP Laboratories Palo Alto HPL-2002-165 June 6th , 2002* JPEG2000, The Mixed Raster Content (MRC) ITU document compression standard MRC, JBIG, (T.44) specifies a multilayer decomposition model for compound wavelets, documents into two contone image layers and a binary mask layer for mask, independent compression. While T.44 does not recommend any procedure scanned for decomposition, it does specify a set of allowable layer codecs to be used after decomposition. While T.44 only allows older standardized document codecs such as JPEG/JBIG/G3/G4, higher compression could be achieved if newer contone and bi-level compression standards such as JPEG2000/JBIG2 were used instead. In this paper, we present a MRC compound document codec using JPEG2000 as the image layer codec and a layer decomposition scheme matched to JPEG2000 for efficient compression. JBIG still codes the mask. Noise removal routines enable efficient coding of scanned documents along with electronic ones. Resolution scalable decoding features are also implemented. The segmentation mask obtained from layer decomposition, serves to separate text and other features. * Internal Accession Date Only Approved for External Publication ãCopyright IEEE 1 Divio Inc. To be published in and presented at IEEE International Conference on Image Processing, 22-25 September 2002, Rochester, New York JPEG2000-MATCHED MRC COMPRESSION OF COMPOUND DOCUMENTS Debargha Mukherjee, Christos Chrysafis, Amir Said Compression and Multimedia Technologies Group Hewlett Packard Laboratories Palo Alto, CA 94304 ABSTRACT redundant initially, if the decomposition is intelligent enough, The Mixed Raster Content (MRC) ITU document the three layers when compressed individually can yield a very compression standard (T.44) specifies a multilayer compact and high quality representation of the compound decomposition model for compound documents into two document. -
Compression of the Image General Schemes for Application Of
Compression of the image Adolf Knoll National Library of the Czech Republic General schemes for application of compression The schemes adapt to the character of the represented objects: ¡ Bitonal image (1-bit, black-and-white) ¡ Colour photorealistic image ¡ Mixed document (two above-mentioned components) 1 Trends n Bitonal ¡ from CCITT Gr. Fax 3 and 4 to JBIG variants n Fotorealistický ¡ Lossless compression: PNG, TIFF/LZW ¡ Lossy: from JPEG DCT to wavelet n Mixed document ¡ Both applied (Mixed Raster Content – usually vertically) 2 How is it built into formats? n Trying to have it in ISO TIFF (even JPEG, LZW, or PNG) – but it is not enough due to lack of tools for conversion and display. n That is why the other more suitable formats are used: JPEG, PNG n That is why there is a lot of development in the area of mixed formats – they do not aim to become ISO Relevant directions n Bitonal image ¡ JBIG2 (ISO) – no support (exc. Xerox), but many similar activities n Photorealistic image ¡ wavelet JPEG2000 and many other non- ISO initiatives (WI, LWF, IW44, SID, Imagepower IW, … ) n Mixed content ¡ DjVu, LDF, Imagepower MRC Aims n Image Archiving n Image Delivery ¡ standardized ¡ More efficient archival format modern format (TIFF, JPEG, PNG, (JB2, MrSID, DjVu, … ) LDF, … ) Which relationship will be between both of them? It will be defined by the goal of the project. 3 Around compression n Pre-processing of the image n Compression n Encoding in a format n De-coding from the format n De-compression n Display – print-out Pre-processing of the bitonal image - I n Efficient schemes are built on possibilities to apply vocabularies of pixel chunks/groups: ¡ E.g. -
Text Segmentation for MRC Document Compression Eri Haneda, Student Member, IEEE, and Charles A
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 6, JUNE 2011 1611 Text Segmentation for MRC Document Compression Eri Haneda, Student Member, IEEE, and Charles A. Bouman, Fellow, IEEE Abstract—The mixed raster content (MRC) standard (ITU-T T.44) specifies a framework for document compression which can dramatically improve the compression/quality tradeoff as compared to traditional lossy image compression algorithms. The key to MRC compression is the separation of the document into foreground and background layers, represented as a binary mask. Therefore, the resulting quality and compression ratio of a MRC document encoder is highly dependent upon the segmentation algorithm used to compute the binary mask. In this paper, we pro- pose a novel multiscale segmentation scheme for MRC document encoding based upon the sequential application of two algorithms. The first algorithm, cost optimized segmentation (COS), is a blockwise segmentation algorithm formulated in a global cost opti- Fig. 1. Illustration of MRC document compression standard mode 1 structure. mization framework. The second algorithm, connected component An image is divided into three layers: a binary mask layer, foreground layer, classification (CCC), refines the initial segmentation by classifying and background layer. The binary mask indicates the assignment of each pixel feature vectors of connected components using an Markov random to the foreground layer or the background layer by a “1” (black) or “0” (white), field (MRF) model. The combined COS/CCC segmentation algo- respectively. Typically, text regions are classified as foreground while picture rithms are then incorporated into a multiscale framework in order regions are classified as background. Each layer is then encoded independently to improve the segmentation accuracy of text with varying size. -
Universal Image Format (UIF)
1 2 A Project of the PWG IPPFAX Working Group 3 Universal Image Format (UIF) 4 5 IEEE-ISTO Printer Working Group 6 Draft Standard 5102.2-D0.76 7 8 July 25October 16, 2001 9 10 ftp://ftp.pwg.org/pub/pwg/QUALDOCS/uif-spec-067.pdf, .doc, .rtf 11 12 Abstract 13 This standard specifies the an extension to TIFF-FX known as Universal Image Format (UIF) 14 by formally defining a series of TIFF-FX “profiles” distinguished primarily by the method of 15 compression employed and color space used. The UIF requirements [7] are derived from the 16 requirements for IPPFAX [8] and Internet Fax [9]. 17 In summary UIF is a raster image data format intended for use by, but not limited to, the 18 IPPFAX protocol, which is used to provide a synchronous, reliable exchange of image 19 Documents between Senders and Receivers. UIF is based on the TIFF-FX specification [4], 20 which describes the TIFF (Tag Image File Format) representation of image data specified by 21 the ITU-T Recommendations for black-and-white and color facsimile. 22 This document (1) formally defines a series of “UIF profiles” distinguished primarily by the 23 method of compression employed and color space used; (2) describes the use of CONNEG in 24 capabilities communication between two UIF-enabled Implementations; and (3) defines a set 25 of baseline capabilities that permits a CONNEG implementation to be OPTIONAL. 26 This document is a draft of an IEEE-ISTO PWG Proposed Standard and is in full conformance with all 27 provisions of the PWG Process (see: ftp//ftp.pwg.org/pub/pwg/general/pwg-process.pdf).