Threshing EPS files

Bogus law Jackowski, Piotr Pianowski, and Piotr Strzelczyk BOP s.c. ul. Piastowska 70, Gda´nsk, Poland [email protected]\\ [email protected]\\ [email protected]

Abstract In this article we describe the CEP package for compressing EPS files. It belongs to the public domain and was released at the GUST meeting in Bachotek, 1997.

The amount of disk space occupied by bitmap means “the flail”. We hope that others find thresh- graphics is a well-recognized problem. For example, ing EPS files useful, in order to get rid of chaff, i.e., a 300 dpi picture (A4) contains ca 8700000 pixels; redundant data. assuming that each CMYK pixel occupies four bytes, one obtains ca 35MB of disk space needed to store About the program the picture. Now, imagine a TEX-er, who is not allowed to Our package consists of two pairs of AWK programs use binary graphic data (because of the otherwise (cep.awk – uncep.awk and cop.awk– uncop.awk), magnificent DVIPS); thus our poor TEX-er usually four MS-DOS batch files and text information. cep.awk converts the binary data to EPS files, and cop.awk generate (on-the-fly) PostScript pro- thus doubling the required space, and next, after grams which, processed by Ghostscript, yield the compiling a document with TEX+DVIPS, the whole appropriate data compression. UNCEP and UNCOP graphic data is put into the resulting PostScript file, accomplish (using a similar technique) the reverse so the required space is doubled again — altogether process, i.e., uncompression. 140MB per one A4 page. The nightmare begins. . . CEP is devised for the compression of the This problem is not a new one; it was recognised usual bitmapped EPS files, containing a single, by Adobe a relatively long time ago. In the Post- hexadecimally-coded image; COP can be used to Script Level 2 specification, they included objects compress any PostScript data. called filters which enable data compression. In The question arises: Why use two packing particular, instead of hexadecimal data, one can techniques? The answer is simple: the efficiency use encoding (there are explanations of of compression is higher if a compression program abbreviations at the end of the article), run length knows in advance which kinds of data are to be compression, LZW compression, DCT (used in JPEG expected. In general, bitmaps are more regular files), and many others. Why not make use of (redundant) than arbitrary PostScript data, hence these tools? The question is not as silly as it may even simple algorithms turn out to be more efficient. look at the first glance, as there exist relatively few Tests show that in the best case (screen dumps) applications capable of generating well-compressed squeezing up to 10% of the original size is noth- PostScript graphics. ing unusual. Sometimes, however, no compression We decided to patch somehow this gap. We method gives a satisfactory result. In such a case, developed a little package enabling the compression one can always use encoding data using the ASCII85 of “normal” (non-compressed) graphic data. The filter, obtaining a reduction of a hexadecimal bitmap nature of the problem is more complex, however, size by approximately 35%. than one might expect. In particular, a universal, Below we give a brief description of CEP and always efficient compression technique does not ex- COP. So far, only the MS-DOS version of the ist. Choice of an optimal algorithm depends upon PostScript-compressors is available, but it should be the kind of data, form of the file and the expected easy to adapt our package to any platform, where application. Hence, the package has several “but- GAWK and Ghostscript are available. In this version tons” which enable controlling various aspects of the GNU implementation of AWK (GAWK-EMX.EXE) compression. and Aladdin Ghostscript interpreter (GS386.EXE) Actually, the name CEP is derived from “com- are used. pressed EPS”. Coincidentally, the name in Polish

TUGboat, Volume 19 (1998), No. 3 — Proceedings of the 1998 Annual Meeting 267 Bogus law Jackowski, Piotr Pianowski, and Piotr Strzelczyk

We tested the package using several Ghostscript One should remember that the names of input and and GAWK implementations; now we use Ghost- output files must differ. The program recognizes the script 5.10 and GAWK 3.0.3. following options: 8 —useASCII85 coding (default) CEP h or H —useHEX (hexadecimal) coding The CEP subpackage consists of the MS-DOS batch r or R —useRLE (RunLength) compression files cep.bat and uncep.bat and the AWK pro- (default) grams cep.awk and uncep.awk.First,AWK in- l or L —useLZW compression spects the source EPS file doing its best to recognize f or F — use Flate compression a position of a hexadecimal bitmap; next it creates (PDF and Level 3) an appropriate PostScript program; and then the n or N — don’t compress control is passed on to Ghostscript which just per- Invoking UNCEP is even simpler: forms the submitted program: encodes the bitmap uncep.bat in file out file and copies verbatim the remaining lines. The orig- h ih i inal preamble is slightly modified; nevertheless, all As mentioned, decompression and decoding meth- DSC comments are left intact. ods are taken from an input file. If the bitmap cannot be found or the AWK COP suspects that troubles may arise, the CEP engine gives up. The subpackage consists of the MS-DOS batch files The resulting file should be verified prior to cop.bat and uncop.bat,andtheAWK programs removing the original one, as the CEP heuristic cop.awk and uncop.awk. COP reads and encodes tricks may fail to fix the bitmap properly; moreover, appropriately the supplied data. No analysis of the due to Ghostscript bugs, premature removal of the PostScript data is performed, as the entire file is source may also be painful. encoded without changing even a bit. The only as- CEP never generates binary output — only hexa- pect that is taken into account is the DSC comment decimal or ASCII85 encoding are supported. This %%BoundingBox:; if it is found, COP inserts this is due to the fact that CEP-compressed EPS files comment in the preamble, otherwise the resulting are primarily meant to be used by TEX+DVIPS. file does not contain the bounding box information. Nevertheless, the resulting files can be used in other COP-generated files are readable by any Post- typesetting systems as so-called placeable EPS files. Script Level 2 interpreter. The applicability to non-TEX applications, however, UNCOP scans the header and deduces from it is somewhat limited, as binary TIFF previews (re- the method of decompression, hence no options are quired by WYSIWYG applications) may be misin- needed. UNCOP, unlike UNCEP, retrieves precisely terpreted by (G)AWK. the original file. It is still recommended, however, UNCEP requires that a CEP-compressed file that a user verifies whether the resulting file is was not changed. In particular, it relies on the properly interpreted by Ghostscript. Due to Ghost- information in a quasi-DSC comment %UNCEPInfo:. script bugs, premature removal of the source file This information can be destroyed by a seemingly after compression or decompression may turn out innocent modification (e.g., by adding or removing to be painful. a comment line). Note that the technique employed Since COP canbeusedtocompressanydatafor by CEP destroys, by its nature, the information arbitrary applications, binary encoding is allowed about the line-breaking structure of the hexadecimal also. The resulting files can be used with typesetting bitmap. Therefore, UNCEP cannot retrieve the orig- systems that accept so-called placeable EPSs. Un- inal file. Line-breaking structure does not make any fortunately, binary TIFF previewers make files after problem for a PostScript interpreter. There exist compression illegible for PostScript. programs, however, that read their own bitmapped The usage of COP is similar to that of CEP: EPS files, which for unknown reasons make use of cop.bat in file out file options such (sub)lexical information; Aldus PhotoStyler is h ih ih i a notable example. The program recognizes the following options: 8 —useASCII85 coding (default) b or B — use binary coding The command line invoking CEP is pretty sim- h or H —useHEX (hexadecimal) coding ple: r or R —useRLE (RunLength) compression cep.bat in file out file options (default) h ih ih i

268 TUGboat, Volume 19 (1998), No. 3 — Proceedings of the 1998 Annual Meeting Threshing EPS files

l or L —useLZW compression colour photo images. It is just a weakness of all f or F — use Flate compression non-lossy techniques — algorithms employed by (PDF and Level 3) ARJ, ZIP, LHARC, and others would yield poor n or N — don’t compress results also. A reasonable alternative for data Observe that binary encoding is, in fact, no encoding of this kind would be DCT (JPEG) compression. at all. 3. As was mentioned above, ASCII85 encoding can The reverse process, i.e., UNCOP decompres- usually be recommended; it added, however, sion, is also straightforward: some troubles. First, due to Ghostscript bugs, we decided to add the (dummy) NullEncode uncop.bat in file out file h ih i filter which seems to cure the problem. But As with UNCEP, decompression and decoding meth- there is one more problem: ASCII85-encoded ods are taken from an input file. bitmaps may contain lines looking like DSC comments, i.e, they may begin with double A heap of remarks concerning our package percent signs, %%, or with a percent-exclamation The applied solution addresses several problems: sign pair, %! — why didn’t Adobe exclude the 1. It is not at all obvious how to determine syn- percent from ASCII85? Some programs may tactically where a hexadecimal bitmap begins try to interpret pseudo-DSC lines. For example, in an EPS file; semantic analysis (by redefining DVIPS just removes such lines, unless the option PostScript primitives image, imagemask and -K0 is not used; on the other hand, leaving colorimage) is possible, but it also has its DSC comments intact may stupefy document limitations; anyway, we decided to recognize managers. a bitmap syntactically, which implied a problem 4. It would be convenient to have some more filters of recognizing such artifacts as add or def which implemented, in particular DCT and CCITT- look like fragments of a bitmap but, in fact, are Fax; both of them, however, make use of some not. additional input data which makes using them 2. Also, it is not obvious which compression method more complex; moreover, it is not clear whether should be applied for a given data type; usu- one can find the optimal compression parame- ally, ASCII85 encoding is advisable; for pure ters for DCT without a WYSIWYG program; we bitmaps (CEP) RLE compression is satisfactory, consider a possibility of one-to-one conversion although LZW and Flate filters usually produce between JPEG files and EPS files making use of much better results (the latter seems to be DCT filters; also, a similar conversion between the best); nevertheless, both LZW and Flate GIF files and EPS files making use of LZW filters encodings have limited usability: can perhaps be implemented. (a) LZW encoding is not implemented in 5. The package conserves the working disk space — Ghostscript ver. > 4 due to USA patent no large temporary files are created; roughly, law; as a by-pass, Aladdin implemented the needed disk space is equal to the size of the an LZW-compatible filter which produces source plus the size of the target. non-compressed data (in fact, enlarged by 6. The following file may be helpful for verifying some 10%) readable for any LZWDecode whether a given PostScript device is able to filter. You can use an old Ghostscript interpret compressed EPS files: version, or compile a Ghostscript version %!PS-Adobe-2.0 EPSF-1.2 containing the real LZW filter at your own %%Pages: 1 risk, but... %%BoundingBox: 0 0 540 150 (b) Flate encoding (the same that is used %%EndComments in GZIP) is available on photo-typesetters /Helvetica 8 selectfont having implemented PostScript Level 3; 90 rotate it is also used in PDF files. It is safe 1 2 moveto to assume that Ghostscript ver. > 4has (*) this filter built-in. With other PostScript {0 -10 rmoveto gsave show grestore} devices, in particular commercial ones, the 255 string test described in point 6 may prove useful. /Filter As a rule of thumb we would suggest not to resourceforall use any compression but ASCII85 for detailed showpage

TUGboat, Volume 19 (1998), No. 3 — Proceedings of the 1998 Annual Meeting 269 Bogus law Jackowski, Piotr Pianowski, and Piotr Strzelczyk

%%EOF sibly with LZWEncode compiled in) and Running this program yields the list of filters GAWK 3.x: GS 4.x is nearly complete for a given device. The error reported during implementation of the Level 2 PostScript; the processing of this file proves that the device GAWK 3.x provides regular expressions for is not Level 2 compatible. In such a case, using record separators, which makes it possible the CEP package should be abandoned. to force handling end-of-lines in exactly 7. Bugs and traps: the same manner as PostScript does and, moreover, is more reliable than earlier ver- (a) Apparently, by preceding the closefile sions. command by flushfile, one neutralizes an error in GS 3.x (tail of output swal- CEP for everybody lowed). The packages described herein, we have developed (b) Adding (a dummy) NullEncode filter neu- for our own purposes. We make them available to tralizes (probably) another Ghostscript the public in the hope that others will find them bug: an ASCII85Encode filter with a tar- useful too. We intend to support the packages, but get procedure may produce superfluous we cannot guarantee that we will be able to follow EOD marks, i.e., “~>” (if things go really the frequency of Ghostscript upgrades. badly you can obtain thousands of them). Note that all copyrights, copylefts, copyups, Using the target procedure instead of a file copydowns, or whatever you wish to call them, object excludes GS ver. < 3.x, because concerning all the files in the CEP/COP packages early Ghostscripts didn’t support all fea- are essentially of the public domain character. tures of PostScript Level 2. Nevertheless, Ghostscript ver. 2.6 can be used for Vocabulary compression with≥ hexadecimal encoding (it has a “legal” LZW compression). You may find useful the following short explanation of terms appearing throughout the article. (c) The target procedure mentioned in 7b is, in turn, due to special treatment of the Ghostscript, GS: A reliable and efficient inter- ASCII85-encoded lines that look like DSC preter of PostScript language by Aladdin Enter- comments; this special treatment is break- prises, available as a free public license product; ing lines after the first percent character. its current version (in July — 5.10) turns out to It is dedicated to the DVIPS driver which be much more reliable than not a few commer- has a dangerous option remove comments cial interpreters. (-K1). AWK: A popular utility and a programming lan- (d) An artificial form of quitting, i.e., {2 2 guage for convenient and efficient batch data- .quit} instead of {2 .quit}, is due to an reformatting; written in 1977 by Alfred V. Aho, infinite loop of Ghostscript 3.5x caused by Peter J. Weinberger, and Brian W. Kernighan. the latter form. The Ghostscript internal GAWK: GNU AWK, GNU Free Software Founda- operation .quit was chosen to provide error handling at the level of the operating tion implementation of AWK, written in 1986 system. by Paul Rubin and Jay Fenlason, with advice from Richard Stallman. (e) Still, there exist bugs in older Ghostscript that we were not able to neutralize; e.g., GNU: The initiative of the Free Software Founda- some EPS files are properly compressed by tion (FSF), a non-profit organization dedicated GS 2.6, but Ghostscript 2.6 breaks while to the production and distribution of freely dis- displaying them; GS 3.51 behaves simi- tributable software, founded by Richard M. Stall- larly with other bitmaps. So far, Ghost- man. script > 4 seems to be the most resistant TEX: Public domain typesetting system by Don- to the “filter trial”, but it also reveals some ald E. Knuth of Stanford University. deficiencies. Let’s hope that GS 5.5, which is expected to appear soon and is claimed DVIPS: A popular TEX-to-PostScript driver by to have most of PostScript Level 3 features Tomas Rokicki of Stanford University. implemented, will be still better. DSC: Document Structuring Convention — a stan- (f) Summing up, we would strongly recom- dard for structuring PostScript documents de- mend using Ghostscript 4.x or 5.x (pos- signed by Adobe.

270 TUGboat, Volume 19 (1998), No. 3 — Proceedings of the 1998 Annual Meeting Threshing EPS files

ASCII85: Algorithm for coding binary data as 7- DCT: Discrete cosine transform compression, an bit ASCII text consisting of only printable char- elaborate, very efficient but lossy compression acters; encodes every four bytes as five charac- scheme, used in JPEG file format. ters from % to u; additionally z is used to code JPEG: Joint Photographic Experts Group, an or- four zeros (see PostScript Language Reference ganization responsible for developing an in- Manual, second edition, pp. 128–130). ternational standard for compression of image RLE: Run Length Encoding — a standard method data; the PostScript (Level 2) DCTEncode filter of data compression (see PostScript Language conforms to the JPEG-proposed standard. Reference Manual, second edition, pp. 133– GZIP: Compressing utility by GNU Free Software 134). Foundation, based on a superior and unpatented LZW: An algorithm of data compression by J. Ziv, compression algorithm (modified Lempel and A. Lempel (1978), improved by T. Welch (1984); Ziv algorithm), developed in order to get rid Unisys, at the time Welch’s employer, was of the patented LZW algorithm. It has become granted a US patent in 1985 on Welch’s al- a standard compression tool in UNIX systems. gorithm; a grandfather clause was established Flate: Filter in PostScript Level 3 based on the by Unisys to make pre-1995 implementations of compression algorithm used by GZIP; the name LZW code free of royalty requirements, thereby is the truncation of the words “inflate” and eliminating such claims on UNIX compress (in- “deflate”. formation from Nelson H. F. Beebe, e-mail: [email protected]).

TUGboat, Volume 19 (1998), No. 3 — Proceedings of the 1998 Annual Meeting 271 Bogus law Jackowski, Piotr Pianowski, and Piotr Strzelczyk

1

272 TUGboat, Volume 19 (1998), No. 3 — Proceedings of the 1998 Annual Meeting