PDF/A for Scanned Documents

Total Page:16

File Type:pdf, Size:1020Kb

PDF/A for Scanned Documents Webinar www.pdfa.org PDF/A for Scanned Documents Paper Becomes Digital Mark McKinney, LuraTech, Inc., President Armin Ortmann, LuraTech, CTO Mark McKinney President, LuraTech, Inc. © 2009 PDF/A Competence Center, www.pdfa.org Existing Solutions for Scanned Documents www.pdfa.org Black & White: TIFF G4 Color: Mostly JPEG, but sometimes PNG, BMP and other raster graphics formats Often special version formats like “JPEG in TIFF” Disadvantages: Several formats already for scanned documents Even more formats for born digital documents Loss of information, e.g. with TIFF G4 Bad image quality and huge file size, e.g. with JPEG No standardized metadata spread over all formats Not full text searchable (OCR) inside of files Black/White: Color: - TIFF FAX G4 - TIFF - TIFF LZW Mark McKinney - JPEG President, LuraTech, Inc. - PDF 2 Existing Solutions for Scanned Documents www.pdfa.org Bad image quality vs. file size TIFF/BMP JPEG TIFF G4 23.8 MB 180 kB 60 kB Mark McKinney President, LuraTech, Inc. 3 Alternative Solution: PDF www.pdfa.org PDF is already widely used to: Unify file formats Image à PDF “Office” Documents à PDF Other sources à PDF Create full-text searchable files Apply modern compression technology (e.g. the JPEG2000 file formats family) Harmonize metadata Conclusion: PDF avoids the disadvantages of the legacy formats “So if you are already using PDF as archival Mark McKinney format, why not use PDF/A with its many President, LuraTech, Inc. advantages?” 4 PDF/A www.pdfa.org What is PDF/A? • ISO 19005-1, Document Management • Electronic document file format for long-term preservation Goals of PDF/A: • Maintain static visual representation of documents • Consistent handing of Metadata • Option to maintain structure and semantic meaning of content • Transparency to guarantee access • Limit the number of restrictions Mark McKinney President, LuraTech, Inc. PDF/A – Full-Text Searchability (OCR) www.pdfa.org Benefit: Searchable at the File Level Digital Library - “after book download” Large Manuals / Multi-Page Construction Documents Downloaded Documents from Archive Databases Documents sent to customers, suppliers, lawyers, etc. as email attachments Mark McKinney President, LuraTech, Inc. 6 PDF/A – Enhanced Compression www.pdfa.org For Black & White Documents JBIG2 - ISO/IEC 14492 Used as alternative to TIFF G4 Full and visual lossless mode Embedded in PDF/A, available in Acrobat Reader FAX G4 JBIG2/lossless JBIG2/lossy 60 kB 46 kB 29 kB Mark McKinney President, LuraTech, Inc. 7 PDF/A – Enhanced Compression www.pdfa.org For Color Documents MRC Compression, also known as JPEG2000 (JPM) Splits documents in three layers to be compressed independently and stored in PDF/A Mark McKinney President, LuraTech, Inc. 8 PDF/A – Enhanced Compression www.pdfa.org For Color Documents Extreme compression, fully legible Saves the color and the visual quality TIFF TIFF G4 JPEG PDF/A 23,8 MB 60 kB 180 kB 65 kB Mark McKinney President, LuraTech, Inc. 9 PDF Compressor Basics: How it works www.pdfa.org TIFF Network / Workflow JPEG LuraDocument PDF Compressor Scanner PDF Conversion and Optimization Process Paper Storage / ECM Convert Scanned documents Batch conversion “unattended” Fully automated Mark McKinney President, LuraTech, Inc. Demo www.pdfa.org Armin, let’s have a look! Mark McKinney President, LuraTech, Inc. 11 Question: www.pdfa.org PDF/A: hype or the future archiving format? Mark McKinney President, LuraTech, Inc. 12 PDF/A – Example e-Government www.pdfa.org Medical and Student Records State of New York Long-term Archive Department of Health Department of Education Project Outline Previously using 1 terabyte of storage every 2 weeks Capture all documents with Scan Service Provider with Fujistu and Kodak scanners Convert images to optimized PDF/A with LuraDocument PDF Compressor Deliver and store PDF/A documents with ECM Results High compressed PDF/A files reduce storage costs and bandwidth needs by 90% Long term readability of all files with retention time of over 40 years Files are now available quickly for daily research Mark McKinney AIIM 2008: Best Practices Award President, LuraTech, Inc. GTC West 2008: Best Solutions Award 13 PDF/A – Example Credit Files www.pdfa.org Mailroom for credit files and international checks Example: HeLaBa (German State Bank) Mailroom Revenue: 168B Euros Employees: 5,700 Project Outline Convert 20 Mio. Pages paper based archive to PDF/A Convert all daily incoming mail to PDF/A Create complete electronic credit files Used tools: LuraTech PDF Compressor, Kofax Ascent, EMC Centera, Wincor Nixdorf archive:net (Taxnet) Results Full color scans in electronic archive High compressed PDF/A files Full text searchable credit files Mark McKinney President, LuraTech, Inc. Long term readability of credit files First step on the way to single archiving format 14 Billions of Pages Preserved www.pdfa.org Airbus (D) Library of Congress (USA) AOK (D) OCE (NL/D) APO-Bank (D) RWE Energy (D) Bank Julius Baer (CH) Siemens (D) Blohm & Voss (D) Southern Nuclear (USA) Bosch Rexroth (D) Southern CA Edison (USA) British Library (UK) West LB (D) City of Arlington (USA) Sparkassen Informatik (D) City of Toronto (CA) State of New York (USA) DAK Insurance (D) Swiss RE (CH) Department of Defense (USA) Universa Insurance (D) Harvard Library (USA) Vattenfall (D) Het Utrechts Archief (NL) International Labor A few of the projects that LuraTech knows about. Mark McKinney Organization (CH) President, LuraTech, Inc. 15 PDF/A for Scanned Documents www.pdfa.org Thanks your interest! Please fill out our questionnaire. Demo software or more information? [email protected] Mark McKinney President, LuraTech, Inc. 16.
Recommended publications
  • Chapter 9 Image Compression Standards
    Fundamentals of Multimedia, Chapter 9 Chapter 9 Image Compression Standards 9.1 The JPEG Standard 9.2 The JPEG2000 Standard 9.3 The JPEG-LS Standard 9.4 Bi-level Image Compression Standards 9.5 Further Exploration 1 Li & Drew c Prentice Hall 2003 ! Fundamentals of Multimedia, Chapter 9 9.1 The JPEG Standard JPEG is an image compression standard that was developed • by the “Joint Photographic Experts Group”. JPEG was for- mally accepted as an international standard in 1992. JPEG is a lossy image compression method. It employs a • transform coding method using the DCT (Discrete Cosine Transform). An image is a function of i and j (or conventionally x and y) • in the spatial domain. The 2D DCT is used as one step in JPEG in order to yield a frequency response which is a function F (u, v) in the spatial frequency domain, indexed by two integers u and v. 2 Li & Drew c Prentice Hall 2003 ! Fundamentals of Multimedia, Chapter 9 Observations for JPEG Image Compression The effectiveness of the DCT transform coding method in • JPEG relies on 3 major observations: Observation 1: Useful image contents change relatively slowly across the image, i.e., it is unusual for intensity values to vary widely several times in a small area, for example, within an 8 8 × image block. much of the information in an image is repeated, hence “spa- • tial redundancy”. 3 Li & Drew c Prentice Hall 2003 ! Fundamentals of Multimedia, Chapter 9 Observations for JPEG Image Compression (cont’d) Observation 2: Psychophysical experiments suggest that hu- mans are much less likely to notice the loss of very high spatial frequency components than the loss of lower frequency compo- nents.
    [Show full text]
  • Preparation Method for TIFF File (*.Tif) Over 300Dpi
    Preparation method for TIFF file (*.tif) over 300dpi Using software with saving function of TIFF file. (e.g. DeltaGraph) 1. Select the figure. 2. On the “File” menu, point to “Export”, and then select “Image”. 3. Click “Option”, and select “Color/Gray-scale”. 4. Select “TIFF” in the “File type” dialog box, and save the file at over “300”dpi. Using Microsoft Excel. A) Using draw type graphics software. (e.g. Illustrator, Canvas, etc.) 1. Select the figure in Excel. 2. Copy the figure and paste into graphics software. 3. On the “File” menu, point to “Save as”, and save the file after select “TIFF (over 300dpi)“ in the “File type” dialog box. Compression “LZW”, “ZIP”, or “JPEG” should be used in compression mode for TIFF file to reduce the file size. B) Simple method Color printing by Excel or PowerPoint graphics 1. Select the figure in Excel or PowerPoint. 2. On the “File” menu, point to “Print”, and select “Microsoft Office Document Image Writer” under “printer”. Click “Properties”, click the “Advanced” tab, and then check “MDI” under “Output format”. 3. Click “OK”、and then close the “Properties”. 4. Click “OK” under “printer” and save the MDI file. 5. Start Windows Explorer. 6. Open the saved MDI file, or right-click of the saved MDI file —in the “Open with” dialog box; click “Microsoft Office Document Imaging”. 7. On the “Tool” menu, point to “Option”. In the “Compression” tab, check “LZW”, and then click “OK”. 8. On the “File” menu, point to “Save as”, and then select “TIFF ” in the “File type” dialog box.
    [Show full text]
  • Electronics Engineering
    INTERNATIONAL JOURNAL OF ELECTRONICS ENGINEERING ISSN : 0973-7383 Volume 11 • Number 1 • 2019 Study of Different Image File formats for Raster images Prof. S. S. Thakare1, Prof. Dr. S. N. Kale2 1Assistant professor, GCOEA, Amravati, India, [email protected] 2Assistant professor, SGBAU,Amaravti,India, [email protected] Abstract: In the current digital world, the usage of images are very high. The development of multimedia and digital imaging requires very large disk space for storage and very long bandwidth of network for transmission. As these two are relatively expensive, Image compression is required to represent a digital image yielding compact representation of image without affecting its essential information with reducing transmission time. This paper attempts compression in some of the image representation formats and the experimental results for some image file format are also shown. Keywords: ImageFileFormats, JPEG, PNG, TIFF, BITMAP, GIF,CompressionTechniques,Compressed image processing. 1. INTRODUCTION Digital images generally occupy a large amount of storage space and therefore take longer time to transmit and download (Sayood 2012;Salomonetal 2010;Miano 1999). To reduce this time image compression is necessary. Image compression is a technique used to identify internal data redundancy and then develop a compact representation that takes up less storage space than the original image size and the reverse process is called decompression (Javed 2016; Kia 1997). There are two types of image compression (Gonzalez and Woods 2009). 1. Lossy image compression 2. Lossless image compression In case of lossy compression techniques, it removes some part of data, so it is used when a perfect consistency with the original data is not necessary after decompression.
    [Show full text]
  • Understanding Image Formats and When to Use Them
    Understanding Image Formats And When to Use Them Are you familiar with the extensions after your images? There are so many image formats that it’s so easy to get confused! File extensions like .jpeg, .bmp, .gif, and more can be seen after an image’s file name. Most of us disregard it, thinking there is no significance regarding these image formats. These are all different and not cross‐ compatible. These image formats have their own pros and cons. They were created for specific, yet different purposes. What’s the difference, and when is each format appropriate to use? Every graphic you see online is an image file. Most everything you see printed on paper, plastic or a t‐shirt came from an image file. These files come in a variety of formats, and each is optimized for a specific use. Using the right type for the right job means your design will come out picture perfect and just how you intended. The wrong format could mean a bad print or a poor web image, a giant download or a missing graphic in an email Most image files fit into one of two general categories—raster files and vector files—and each category has its own specific uses. This breakdown isn’t perfect. For example, certain formats can actually contain elements of both types. But this is a good place to start when thinking about which format to use for your projects. Raster Images Raster images are made up of a set grid of dots called pixels where each pixel is assigned a color.
    [Show full text]
  • JPEG and JPEG 2000
    JPEG and JPEG 2000 Past, present, and future Richard Clark Elysium Ltd, Crowborough, UK [email protected] Planned presentation Brief introduction JPEG – 25 years of standards… Shortfalls and issues Why JPEG 2000? JPEG 2000 – imaging architecture JPEG 2000 – what it is (should be!) Current activities New and continuing work… +44 1892 667411 - [email protected] Introductions Richard Clark – Working in technical standardisation since early 70’s – Fax, email, character coding (8859-1 is basis of HTML), image coding, multimedia – Elysium, set up in ’91 as SME innovator on the Web – Currently looks after JPEG web site, historical archive, some PR, some standards as editor (extensions to JPEG, JPEG-LS, MIME type RFC and software reference for JPEG 2000), HD Photo in JPEG, and the UK MPEG and JPEG committees – Plus some work that is actually funded……. +44 1892 667411 - [email protected] Elysium in Europe ACTS project – SPEAR – advanced JPEG tools ESPRIT project – Eurostill – consensus building on JPEG 2000 IST – Migrator 2000 – tool migration and feature exploitation of JPEG 2000 – 2KAN – JPEG 2000 advanced networking Plus some other involvement through CEN in cultural heritage and medical imaging, Interreg and others +44 1892 667411 - [email protected] 25 years of standards JPEG – Joint Photographic Experts Group, joint venture between ISO and CCITT (now ITU-T) Evolved from photo-videotex, character coding First meeting March 83 – JPEG proper started in July 86. 42nd meeting in Lausanne, next week… Attendance through national
    [Show full text]
  • One Software Solution. One World of Difference for Your Content
    Datasheet One software Have you heard? There has been a quiet revolution in solution. One world the way color documents are scanned and published on the Web. It is Document Express with DjVu®--a of diff erence for format that has long been preferred by universities your content. and libraries, because it produces dramatically smaller fi les while preserving original quality. Leading companies around the world are now turning to Document Express including Northwest Airlines, Panasonic, Samsung, Sears, Komatsu, and others-- and that’s because Document Express with DjVu is truly in a class by itself. Only Document Express empowers you to send scanned or electronic color documents on any platform, over any connection speed, with full confi dence in the results. Images download quickly, pages retain true design fi delity, and viewers can access and use your content in ways that are impossible with PDF, TIFF, or JPEG. Document Express with DjVu consistently delivers an excellent user experience, every time. About Document Express with DjVu Features Document Express with DjVu (pronounced: déjà vu) uses a highly effi cient document image compression methodology and fi le format. Scientists at AT&T Labs who fi rst de- veloped the DjVu format for color scanning, also found it vastly superior to Postscript or Sample 400dpi color scan PDF formats for transmitting electronic fi les. Document Express with DjVu uses the most advanced document image segmentation ever developed. The document image seg- 46 MB mentation technology enables the Document Express with DjVu format to have the high- est image quality while keeping text separate to maintain the highest legibility possible.
    [Show full text]
  • Making TIFF Files from Drawing, Word Processing, Powerpoint And
    Making TIFF and EPS files from Drawing, Word Processing, PowerPoint and Graphing Programs In the worlds of electronic publishing and video production programs, the need for TIFF or EPS formatted files is a necessity. Unfortunately, most of the imaging work done in research for presen- tation is done in PowerPoint, and this format simply cannot be used in most situations for these three ends. Files can be generally be saved or exported (by using either Save As or Export under File) into TIFF, PICT or JPEG files from PowerPoint, drawing, word processing and graphing programs—all called vector programs—but the results are often poor in resolution (in Photoshop these are shown as having a resolution of 72dpi when opening the Image Size dialogue box: under Image on the menu select Image Size). Here are four ways to save as TIFF (generally the way in which image files are saved) or EPS (gen- erally the way in which files are saved which contain lines or text): Option 1. Use the Program’s Save As or Export option. If it exists, use the Export or Save As option in your vector program. This only works well when a dialogue box appears so that specific values for height, width and resolution can be typed in (as in the programs Canvas and CorelDraw). Anti-aliasing should be checked. Resolution values of 300 dots per inch or pixels per inch is for images, 600 dpi is for images with text and 1200 dpi is for text, graphs and drawings. If no dialogue box exists to type in these values, go to option 2 - 4.
    [Show full text]
  • Image Formats
    Image Formats Ioannis Rekleitis Many different file formats • JPEG/JFIF • Exif • JPEG 2000 • BMP • GIF • WebP • PNG • HDR raster formats • TIFF • HEIF • PPM, PGM, PBM, • BAT and PNM • BPG CSCE 590: Introduction to Image Processing https://en.wikipedia.org/wiki/Image_file_formats 2 Many different file formats • JPEG/JFIF (Joint Photographic Experts Group) is a lossy compression method; JPEG- compressed images are usually stored in the JFIF (JPEG File Interchange Format) >ile format. The JPEG/JFIF >ilename extension is JPG or JPEG. Nearly every digital camera can save images in the JPEG/JFIF format, which supports eight-bit grayscale images and 24-bit color images (eight bits each for red, green, and blue). JPEG applies lossy compression to images, which can result in a signi>icant reduction of the >ile size. Applications can determine the degree of compression to apply, and the amount of compression affects the visual quality of the result. When not too great, the compression does not noticeably affect or detract from the image's quality, but JPEG iles suffer generational degradation when repeatedly edited and saved. (JPEG also provides lossless image storage, but the lossless version is not widely supported.) • JPEG 2000 is a compression standard enabling both lossless and lossy storage. The compression methods used are different from the ones in standard JFIF/JPEG; they improve quality and compression ratios, but also require more computational power to process. JPEG 2000 also adds features that are missing in JPEG. It is not nearly as common as JPEG, but it is used currently in professional movie editing and distribution (some digital cinemas, for example, use JPEG 2000 for individual movie frames).
    [Show full text]
  • Analysis and Comparison of Compression Algorithm for Light Field Mask
    International Journal of Applied Engineering Research ISSN 0973-4562 Volume 12, Number 12 (2017) pp. 3553-3556 © Research India Publications. http://www.ripublication.com Analysis and Comparison of Compression Algorithm for Light Field Mask Hyunji Cho1 and Hoon Yoo2* 1Department of Computer Science, SangMyung University, Korea. 2Associate Professor, Department of Media Software SangMyung University, Korea. *Corresponding author Abstract This paper describes comparison and analysis of state-of-the- art lossless image compression algorithms for light field mask data that are very useful in transmitting and refocusing the light field images. Recently, light field cameras have received wide attention in that they provide 3D information. Also, there has been a wide interest in studying the light field data compression due to a huge light field data. However, most of existing light field compression methods ignore the mask information which is one of important features of light field images. In this paper, we reports compression algorithms and further use this to achieve binary image compression by realizing analysis and comparison of the standard compression methods such as JBIG, JBIG 2 and PNG algorithm. The results seem to confirm that the PNG method for text data compression provides better results than the state-of-the-art methods of JBIG and JBIG2 for binary image compression. Keywords: Lossless compression, Image compression, Light Figure. 2. Basic architecture from raw images to RGB and filed compression, Plenoptic coding mask images INTRODUCTION The LF camera provides a raw image captured from photosensor with microlens, as depicted in Fig. 1. The raw Light field (LF) cameras, also referred to as plenoptic cameras, data consists of 10 bits per pixel precision in little-endian differ from regular cameras by providing 3D information of format.
    [Show full text]
  • PDF Image JBIG2 Compression and Decompression with JBIG2 Encoding and Decoding SDK Library | 1
    PDF image JBIG2 compression and decompression with JBIG2 encoding and decoding SDK library | 1 JBIG2 is an image compression standard for bi-level images developed by the Joint bi-level Image Expert Group. It is suitable for lossless compression and lossy compression. According to the group’s press release, in its lossless mode, JBIG2 usually generates files that are one- third to one-fifth the size of the fax group 4 and twice the size of JBIG, which was previously released by the group. The double-layer compression standard. JBIG2 was released as an international standard ITU in 2000. JBIG2 compression JBIG2 is an international standard for bi-level image compression. By segmenting the image into overlapping and/or non-overlapping areas of text, halftones and general content, compression techniques optimized for each content type are used: *Text area: The text area is composed of characters that are well suited for symbol-based encoding methods. Usually, each symbol will correspond to a character bitmap, and a sub-image represents a character or text. For each uppercase and lowercase character used on the front face, there is usually only one character bitmap (or sub-image) in the symbol dictionary. For example, the dictionary will have an “a” bitmap, an “A” bitmap, a “b” bitmap, and so on. VeryUtils.com PDF image JBIG2 compression and decompression with JBIG2 encoding and decoding SDK library | 1 PDF image JBIG2 compression and decompression with JBIG2 encoding and decoding SDK library | 2 *Halftone area: Halftone areas are similar to text areas because they consist of patterns arranged in a regular grid.
    [Show full text]
  • Lossless Image Compression
    Lossless Image Compression C.M. Liu Perceptual Signal Processing Lab College of Computer Science National Chiao-Tung University http://www.csie.nctu.edu.tw/~cmliu/Courses/Compression/ Office: EC538 (03)5731877 [email protected] Lossless JPEG (1992) 2 ITU Recommendation T.81 (09/92) Compression based on 8 predictive modes ( “selection values): 0 P(x) = x (no prediction) 1 P(x) = W 2 P(x) = N 3 P(x) = NW NW N 4 P(x) = W + N - NW 5 P(x) = W + ⎣(N-NW)/2⎦ W x 6 P(x) = N + ⎣(W-NW)/2⎦ 7 P(x) = ⎣(W + N)/2⎦ Sequence is then entropy-coded (Huffman/AC) Lossless JPEG (2) 3 Value 0 used for differential coding only in hierarchical mode Values 1, 2, 3 One-dimensional predictors Values 4, 5, 6, 7 Two-dimensional Value 1 (W) Used in the first line of samples At the beginning of each restart Selected predictor used for the other lines Value 2 (N) Used at the start of each line, except first P-1 Default predictor value: 2 At the start of first line Beginning of each restart Lossless JPEG Performance 4 JPEG prediction mode comparisons JPEG vs. GIF vs. PNG Context-Adaptive Lossless Image Compression [Wu 95/96] 5 Two modes: gray-scale & bi-level We are skipping the lossy scheme for now Basic ideas find the best context from the info available to encoder/decoder estimate the presence/lack of horizontal/vertical features CALIC: Initial Prediction 6 if dh−dv > 80 // sharp horizontal edge X* = N else if dv−dh > 80 // sharp vertical edge X* = W else { // assume smoothness first X* = (N+W)/2 +(NE−NW)/4 if dh−dv > 32 // horizontal edge X* = (X*+N)/2
    [Show full text]
  • What's the Diff Between a Jpeg and a Tiff?
    WHAT’S THE DIFF BETWEEN A JPEG AND A TIFF? You’ve just scanned a photo into your computer. Now you need to save the image. You click on File > Save As... > and up pops a window with a menu of obscure acronyms: GIF, JPEG, BMP, TIF, EPS, PSD, PDF and more. What do they mean? What diff erence does it make? GIF. The letters “GIF” stand for “Graphics Interchange Format”. It is a low-resolution graphics fi le format for use on the Internet, often used in simple animated graphics. Images copied from the Internet are either GIFs or JPEGs and are almost always 72 dpi (dots per square inch). These low resolution graphics should NOT be used for high-resolution printing. JPEG or JPG. The letters “JPEG” stand for “Joint Photographic Experts Group”. It is a standardized image compression format that makes fi les smaller for quick transfer over a network and for display on the Inter- net. Most JPEGs are saved at 72 dpi, too low for high quality printing. The JPEG fi le format is acceptable as long as images are saved at 300 dpi or higher. (NOTE: Many digital cameras record photos as large 72 dpi JPEG images - anywhere from 8x12” up to 24x36”. In PhotoShop, they can be “resampled” to a smaller size and a higher resolution in order to create 300 dpi images.) BMP. The letters “BMP” stand for “bitmap”, a fi le format best suited for “line art” (i.e. images that do not have a dot-pattern or screen). Cartoons or drawings should be scanned at 600 dpi or higher and saved as a BMP in order to preserve sharp lines and shapes.
    [Show full text]