Two Diet Plans for Fat PDF
Total Page:16
File Type:pdf, Size:1020Kb
Two Diet Plans for Fat PDF Thomas A. Phelps and Robert Wilensky University of California, Berkeley [email protected], [email protected] ABSTRACT pdfTeX, and various OCR engines. Predictably not every one As Adobe's Portable Document Format has exploded in popularity generates the absolutely most space efficient PDF file. Initially so too has the number PDF generators, and predictably the quality Adobe software was the primary way to generate PDF. First the of generated PDF varies considerably. This paper surveys a range user "printed" to a PostScript file, which was the universal way of of PDF optimizations for space, and reports the results of a tool communicating with printers and therefore nearly every that can postprocess existing PDFs to reduce file sizes by 20 to application could produce PostScript, and then "distilled" the 70% for large classes of PDFs. (Further reduction can often be PostScript to PDF with Adobe Distiller. Distiller is engineered by obtained by recoding images to lower resolutions or with newer the company that invented PostScript and has a long history of compression methods such as JBIG2 or JPEG2000, but those expertise with graphics- and font-related applications, and thus the operations are independent of PDF per se and not a component of user could depend on a certain level of quality. Rather than the results reported here.) A new PDF storage format called distilling, it is better for an application to directly write PDF in "Compact PDF" is introduced that achieves for many classes of order to better capture the source document's semantics and in PDF an additional reduction of 30 to 60% beyond what is possible order to take advantage of technical features in PDF that are not in in the latest PDF specification (version 1.5, corresponding to PostScript, such as gradients. However, if a PDF generator is just Acrobat 6); for example, the PDF 1.5 Reference manual shrinks one of many features of a large application, then as a shipping from 12.2MB down to 4.2MB. The changes required by Compact deadline approaches refinements of a basically working PDF to the PDF specification and to PDF readers are easily subsystem are not high priority. understood and straightforward to implement. Second, even for those PDF generators and libraries primarily Categories and Subject Descriptors concerned with PDF, the amount of work to track the PDF E.3 [Coding and Information Theory]: Data compaction and specification is enormous and ongoing. Adobe regularly compression improves PDF by adopting new technology, such as JBIG2 over CCITT Fax, JPEG2000 over JPEG, and Flate over LZW, and General Terms compressible object streams over individual uncompressed top- Algorithms, Measurement, Documentation, Languages level objects. The increasing sophistication of PDF is reflected in Keywords PDF Reference — which as of version 1.5 stands at 1,100 pages, PDF, Compression, Multivalent, Compact PDF and incorporates by reference several other large, complex specifications such as JPEG2000. Moreover, some PDF features interact with one another and multiply complexity. For example, MOTIVATION on top of page building command streams, there is compression, It is uncontroversial to state that Adobe's Portable Document optional encryption, and optional painstaking "linearization", Format (PDF) is the de facto way final form digital documents are which orders content so that the first page can be viewed quickly distributed today. There are many reasons for this, including high over a slow network. technical quality and the free Acrobat viewer available on all major platforms. However, as our results will show, PDFs are Third, regardless of however well PDF generators track the PDF often 50% larger than they need to be and in some cases 1000% specification, there remain billions of legacy PDFs. While all times larger. There are several reasons for this. PDFs are forward compatible with later specifications (another primary reason for the popularity of PDF), they use older, less In the first place, there are now innumerable PDF generators, efficient technology (which of course was all that was available at including Adobe Distiller, Adobe PDFWriter, Adobe PDF the time of document generation). In almost all cases, these new Library, Aladdin Ghostscript, Corel PDF Engine, CL-PDF, PDFs cannot be regenerated from source, since one usually DaVince C++ Class Library, Apache FOP, HPA image bureau, receives many more PDFs than one generates (just like email) and Oracle PDF driver, Panda, PDFlib, ClibPDF Library, dvipdfm, the sources are not available. dvips + GNU Ghostscript, htmldoc, iSEDQuickPDF iText, Permission to make digital or hard copies of all or part of this work for We have developed a tool that optimizes PDF space requirements. personal or classroom use is granted without fee provided that copies are It postprocesses existing PDFs, working with all PDF generators, not made or distributed for profit or commercial advantage and that inefficient and efficient, old and modern. It centralizes expertise copies bear this notice and the full citation on the first page. To copy in the back end so that general applications can concentrate on otherwise, or republish, to post on servers or to redistribute to lists, translating their visuals to clean PDF. Or, since an integrated requires prior specific permission and/or a fee. system is often preferable, applications can compare their file DocEng’03, November 20-22, 2003, Grenoble, France. sizes and see if significant improvements are possible, and if so Copyright 2003 ACM 1-58113-724-9/03/0011…$5.00. applications can examine the tool's output to identify optimization 175 opportunities. Furthermore, since the tool postprocesses PDF, it compression loses information and a program must at best rely on operates on legacy PDF, bringing the benefits of modern various heuristics or manual intervention to decide whether the loss is compression algorithms as well as other new techniques. significant or not. While PDF structure information, which is not related to the visual appearance, is relatively new and as yet This paper surveys a range of PDF optimizations for space, and seldom used, a program cannot automatically determine whether utilizes the tool to measure their effectiveness. PDF was designed the structure is meaningful or unintended bloat. PDFs can have more than 10 years ago, or almost seven Moore's Law doublings many named destinations, which are similar to HTML anchors; if ago, and we consider optimizations that are newly technically not all of them are referenced within the document, the unused practical. may be referenced from other PDFs or instead be due to overzealous labeling (as by FrameMaker). Such optimizations can be enabled with explicit switches to our tool and other PDF THE STRUCTURE OF PDF optimization tools have target "profiles" that specify the In order to understand the ways PDF can be optimized, a high- combinations that are appropriate, but none is used in the results level familiarity with the PDF file format is needed. PDF is reported in this paper. relatively simple. A brief header of the form %PDF-m.n marks the file as PDF of version m.n, a number at the very end of the From among the many possible PDF space optimizations, the file points to a cross-reference table, the cross-reference table following are those that are most effective on most document holds the exact byte offsets of PDF objects, and everything else is instances. one of those objects. Objects can be of the usual types found in programming languages, including strings, integers and real numbers, and arrays. A core data type is the dictionary, which is Use a modern compression algorithm in effect a hash table. Dictionaries and arrays can nest objects, PDF is fundamentally a text-based format, writing objects as including other dictionaries and arrays. Objects are identified by human-readable text, as opposed to a binary format with carefully number, and objects can refer to other objects by number by defined bit fields. However, compression is essential for indirect references. Arbitrary byte sequences can be embedded in reasonable file sizes. Originally the general-purpose compression streams, which are dictionaries with metadata (length, algorithm was LZW, but this has been superceded by the superior compression type, data type) followed by the data bytes. Streams performance of Flate [7]. Only PDF streams can be compressed; are used for image data, embedded fonts, and arbitrary embedded the new PDF 1.5 of May 2003 introduces object streams, which files, among others uses. Page contents are a sequence of collect one or more non-streams into streams, which can then be PostScript-like textual commands that are stored in streams and compressed. This is especially useful for hyperlinks and that are executed to build the page as a series of graphical annotations, of which there can be many and which share much of operations. Only streams can be compressed. PDF 1.5 [2] also the same content such as dictionary entries (/Subtype /Link, introduced cross-reference streams, which are more flexible and /Border [0 0 0]). compressible than previous cross-reference tables. Modern compression of images can also result in large space So as not to overwhelm the reader, we introduce refinements to savings. Images can be compressed with a variety of formats, and this basic description as they become relevant. PDF 1.4 and PDF 1.5, respectively, introduced JBIG2 for bitonal images (such as black and white scanned paper) and JPEG2000 for continuous-tone images (such as color photographs). OPTIMIZATIONS However, image compression is independent of PDF per se: Techniques armed with an image compressor, applying it to PDF is a simple PDF is a rich format and few PDFs take advantage of every matter of rewriting the PDF image's data stream; the other objects aspect: many PDFs have JPEG images, some have JPEG2000 in the PDF are unaffected, except their file offsets in the cross images, some are scanned paper CCITT FAX images lightly reference.