Threshing EPS Files
Total Page:16
File Type:pdf, Size:1020Kb
Threshing EPS files Bogus law Jackowski, Piotr Pianowski, and Piotr Strzelczyk BOP s.c. ul. Piastowska 70, Gda´nsk, Poland [email protected]\\ [email protected]\\ [email protected] Abstract In this article we describe the CEP package for compressing EPS files. It belongs to the public domain and was released at the GUST meeting in Bachotek, 1997. The amount of disk space occupied by bitmap means “the flail”. We hope that others find thresh- graphics is a well-recognized problem. For example, ing EPS files useful, in order to get rid of chaff, i.e., a 300 dpi picture (A4) contains ca 8700000 pixels; redundant data. assuming that each CMYK pixel occupies four bytes, one obtains ca 35MB of disk space needed to store About the program the picture. Now, imagine a TEX-er, who is not allowed to Our package consists of two pairs of AWK programs use binary graphic data (because of the otherwise (cep.awk – uncep.awk and cop.awk– uncop.awk), magnificent DVIPS); thus our poor TEX-er usually four MS-DOS batch files and text information. cep.awk converts the binary data to hexadecimal EPS files, and cop.awk generate (on-the-fly) PostScript pro- thus doubling the required space, and next, after grams which, processed by Ghostscript, yield the compiling a document with TEX+DVIPS, the whole appropriate data compression. UNCEP and UNCOP graphic data is put into the resulting PostScript file, accomplish (using a similar technique) the reverse so the required space is doubled again — altogether process, i.e., uncompression. 140MB per one A4 page. The nightmare begins. CEP is devised for the compression of the This problem is not a new one; it was recognised usual bitmapped EPS files, containing a single, by Adobe a relatively long time ago. In the Post- hexadecimally-coded image; COP can be used to Script Level 2 specification, they included objects compress any PostScript data. called filters which enable data compression. In The question arises: Why use two packing particular, instead of hexadecimal data, one can techniques? The answer is simple: the efficiency use ASCII85 encoding (there are explanations of of compression is higher if a compression program abbreviations at the end of the article), run length knows in advance which kinds of data are to be compression, LZW compression, DCT (used in JPEG expected. In general, bitmaps are more regular files), and many others. Why not make use of (redundant) than arbitrary PostScript data, hence these tools? The question is not as silly as it may even simple algorithms turn out to be more efficient. look at the first glance, as there exist relatively few Tests show that in the best case (screen dumps) applications capable of generating well-compressed squeezing up to 10% of the original size is noth- PostScript graphics. ing unusual. Sometimes, however, no compression We decided to patch somehow this gap. We method gives a satisfactory result. In such a case, developed a little package enabling the compression one can always use encoding data using the ASCII85 of “normal” (non-compressed) graphic data. The filter, obtaining a reduction of a hexadecimal bitmap nature of the problem is more complex, however, size by approximately 35%. than one might expect. In particular, a universal, Below we give a brief description of CEP and always efficient compression technique does not ex- COP. So far, only the MS-DOS version of the ist. Choice of an optimal algorithm depends upon PostScript-compressors is available, but it should be the kind of data, form of the file and the expected easy to adapt our package to any platform, where application. Hence, the package has several “but- GAWK and Ghostscript are available. In this version tons” which enable controlling various aspects of the GNU implementation of AWK (GAWK-EMX.EXE) compression. and Aladdin Ghostscript interpreter (GS386.EXE) Actually, the name CEP is derived from “com- are used. pressed EPS”. Coincidentally, the name in Polish TUGboat, Volume 19 (1998), No. 3 — Proceedings of the 1998 Annual Meeting 267 Bogus law Jackowski, Piotr Pianowski, and Piotr Strzelczyk We tested the package using several Ghostscript One should remember that the names of input and and GAWK implementations; now we use Ghost- output files must differ. The program recognizes the script 5.10 and GAWK 3.0.3. following options: 8 —useASCII85 coding (default) CEP h or H —useHEX (hexadecimal) coding The CEP subpackage consists of the MS-DOS batch r or R —useRLE (RunLength) compression files cep.bat and uncep.bat and the AWK pro- (default) grams cep.awk and uncep.awk.First,AWK in- l or L —useLZW compression spects the source EPS file doing its best to recognize f or F — use Flate compression a position of a hexadecimal bitmap; next it creates (PDF and Level 3) an appropriate PostScript program; and then the n or N — don’t compress control is passed on to Ghostscript which just per- Invoking UNCEP is even simpler: forms the submitted program: encodes the bitmap uncep.bat in file out file and copies verbatim the remaining lines. The orig- h ih i inal preamble is slightly modified; nevertheless, all As mentioned, decompression and decoding meth- DSC comments are left intact. ods are taken from an input file. If the bitmap cannot be found or the AWK COP suspects that troubles may arise, the CEP engine gives up. The subpackage consists of the MS-DOS batch files The resulting file should be verified prior to cop.bat and uncop.bat,andtheAWK programs removing the original one, as the CEP heuristic cop.awk and uncop.awk. COP reads and encodes tricks may fail to fix the bitmap properly; moreover, appropriately the supplied data. No analysis of the due to Ghostscript bugs, premature removal of the PostScript data is performed, as the entire file is source may also be painful. encoded without changing even a bit. The only as- CEP never generates binary output — only hexa- pect that is taken into account is the DSC comment decimal or ASCII85 encoding are supported. This %%BoundingBox:; if it is found, COP inserts this is due to the fact that CEP-compressed EPS files comment in the preamble, otherwise the resulting are primarily meant to be used by TEX+DVIPS. file does not contain the bounding box information. Nevertheless, the resulting files can be used in other COP-generated files are readable by any Post- typesetting systems as so-called placeable EPS files. Script Level 2 interpreter. The applicability to non-TEX applications, however, UNCOP scans the header and deduces from it is somewhat limited, as binary TIFF previews (re- the method of decompression, hence no options are quired by WYSIWYG applications) may be misin- needed. UNCOP, unlike UNCEP, retrieves precisely terpreted by (G)AWK. the original file. It is still recommended, however, UNCEP requires that a CEP-compressed file that a user verifies whether the resulting file is was not changed. In particular, it relies on the properly interpreted by Ghostscript. Due to Ghost- information in a quasi-DSC comment %UNCEPInfo:. script bugs, premature removal of the source file This information can be destroyed by a seemingly after compression or decompression may turn out innocent modification (e.g., by adding or removing to be painful. a comment line). Note that the technique employed Since COP canbeusedtocompressanydatafor by CEP destroys, by its nature, the information arbitrary applications, binary encoding is allowed about the line-breaking structure of the hexadecimal also. The resulting files can be used with typesetting bitmap. Therefore, UNCEP cannot retrieve the orig- systems that accept so-called placeable EPSs. Un- inal file. Line-breaking structure does not make any fortunately, binary TIFF previewers make files after problem for a PostScript interpreter. There exist compression illegible for PostScript. programs, however, that read their own bitmapped The usage of COP is similar to that of CEP: EPS files, which for unknown reasons make use of cop.bat in file out file options such (sub)lexical information; Aldus PhotoStyler is h ih ih i a notable example. The program recognizes the following options: 8 —useASCII85 coding (default) b or B — use binary coding The command line invoking CEP is pretty sim- h or H —useHEX (hexadecimal) coding ple: r or R —useRLE (RunLength) compression cep.bat in file out file options (default) h ih ih i 268 TUGboat, Volume 19 (1998), No. 3 — Proceedings of the 1998 Annual Meeting Threshing EPS files l or L —useLZW compression colour photo images. It is just a weakness of all f or F — use Flate compression non-lossy techniques — algorithms employed by (PDF and Level 3) ARJ, ZIP, LHARC, and others would yield poor n or N — don’t compress results also. A reasonable alternative for data Observe that binary encoding is, in fact, no encoding of this kind would be DCT (JPEG) compression. at all. 3. As was mentioned above, ASCII85 encoding can The reverse process, i.e., UNCOP decompres- usually be recommended; it added, however, sion, is also straightforward: some troubles. First, due to Ghostscript bugs, we decided to add the (dummy) NullEncode uncop.bat in file out file h ih i filter which seems to cure the problem. But As with UNCEP, decompression and decoding meth- there is one more problem: ASCII85-encoded ods are taken from an input file. bitmaps may contain lines looking like DSC comments, i.e, they may begin with double A heap of remarks concerning our package percent signs, %%, or with a percent-exclamation The applied solution addresses several problems: sign pair, %! — why didn’t Adobe exclude the 1.