Data Compression
Total Page:16
File Type:pdf, Size:1020Kb
Algorithms ROBERT SEDGEWICK | KEVIN WAYNE 5.5 DATA COMPRESSION ‣ introduction ‣ run-length coding Algorithms ‣ Huffman compression FOURTH EDITION ‣ LZW compression ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu Last updated on 12/2/19 5:12 PM 5.5 DATA COMPRESSION ‣ introduction ‣ run-length coding Algorithms ‣ Huffman compression ‣ LZW compression ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu Data compression Compression reduces the size of a file: ・To save space when storing it. ・To save time when transmitting it. ・Most files have lots of redundancy. Who needs compression? ・Moore’s law: # transistors on a chip doubles every 18–24 months. ・Parkinson’s law: data expands to fill space available. ・Text, images, sound, video, sensors, … Every day, we create: 500 million Tweets. ・ ・300 billion emails. ・350 million Facebook photos. ・500,000 hours YouTube video. Basic concepts ancient (1950s), best technology recently developed. 3 Applications Generic file compression. ・Files: Gzip, bzip2, 7z, PKZIP. ・File systems: NTFS, ZFS, HFS+, ReFS, GFS. Multimedia. ・Images: GIF, JPEG. ・Sound: MP3. ・Video: MPEG, DivX™, HDTV. Communication. Fax, Skype, Google hangout. Databases. Google, Facebook, NSA, …. Smart sensors. Phone, watch, car, health, …. 4 Lossless compression and expansion Message. Bitstream B we want to compress. Compress. Generates a “compressed” representation C (B). Expand. Reconstructs original bitstream B. uses fewer bits (you hope) Compress Expand bitstream B compressed version C(B) original bitstream B 0110110101... 1101011111... 0110110101... Basic model for data compression Compression ratio. Bits in C (B) / bits in B. Ex. 50–75% or better compression ratio for natural language. 5 Data representation: genomic code Genome. String over the alphabet { A, T, C, G }. Goal. Encode an n-character genome: ATAGATGCATAG ... Standard ASCII encoding. Two-bit encoding. ・8 bits per char. ・2 bits per char. ・8 n bits. ・2 n bits (25% compression ratio). char hex binary char binary 'A' 41 01000001 'A' 00 'T' 54 01010100 'T' 01 'C' 43 01000011 'C' 10 'G' 47 01000111 'G' 11 Fixed-length code. k-bit code supports alphabet of size 2k. Amazing but true. Some genomic databases in 1990s used ASCII. 6 664 CHAPTER 6 n Strings 664 CHAPTER 6 n Strings Binary input and output. Most systems nowadays, including Java, base their I/O on 8-bit bytestreams, so we might decide to read and write bytestreams to match I/O for- mats with the internal representations of primitive types, encoding an 8-bit char with Binary input and output. Most systems nowadays, including Java, base their I/O on 1 byte, a 16-bit short with 2 bytes, a 32-bit int with 4 bytes, and so forth. Since bit- 8-bitstreams bytestreams, are the primary so we mightabstraction decide for to dataread compression,and write bytestreams we go a bit to furthermatch I/Oto allow for- matsclients with to readthe internal and write representations individual bits of, intermixedprimitive types, with dataencoding of various an 8-bit types char (primi with- 1 byte, a 16-bit short with 2 bytes, a 32-bit int with 4 bytes, and so forth. Since bit- Reading andtive writing types and Stringbinary). The datagoal is to minimize the necessity for type conversion in streamsclient programs are the primary and also abstraction to take care for of dataoperating-system compression, conventionswe go a bit furtherfor representing to allow clientsdata.We to useread the and following write individual API for reading bits, intermixed a bitstream with from data standard of various input: types (primi- Binary standardtive types input. and String Read). The bits goal isfrom to minimize standard the necessity input. for type conversion in clientpublic programs class and BinaryStdIn also to take care of operating-system conventions for representing data.We use the following API for reading a bitstream from standard input: boolean readBoolean() read 1 bit of data and return as a boolean value publicchar classreadChar() BinaryStdIn read 8 bits of data and return as a char value boolean readBoolean() char readChar(int r) readread 1 r bit bits of of data data and and return return as as a aboolean char value value char readChar() byte shortread 8 bits of dataint and returnlong as a chardouble value [similar methods for (8 bits); (16 bits); (32 bits); and (64 bits)] booleanchar readChar(intisEmpty() r) readis the r bitstreambits of data empty? and return as a char value [similarvoid methodsclose() for byte (8 bits); shortclose (16 the bits); bitstream int (32 bits); long and double (64 bits)] boolean isEmpty()API for static methods thatis the read bitstream from a bitstreamempty? on standard input void close() close the bitstream A key feature of the abstraction is that, in marked constrast to StdIn, the data on stan- API for static methods that read from a bitstream on standard input dard input is not necessarily aligned on byte boundaries. If the input stream is a single byte, a client could read it 1 bit at a time with 8 calls to readBoolean(). The close() A key feature of the abstraction is that, in marked constrast to StdIn, the data on stan- method is not essential, but, for clean termination, clients should call close() to in- Binary standarddard inputoutput. is not necessarily Write aligned bits on to byte standard boundaries. Ifoutput the input stream is a single dicate that no more bits are to be read. As with StdIn/StdOut, we use the following byte,complementary a client could API read for itwriting 1 bit at bitstreams a time with to 8standard calls to readBoolean()output: . The close() method is not essential, but, for clean termination, clients should call close() to in- dicatepublic that class no more BinaryStdOut bits are to be read. As with StdIn/StdOut, we use the following complementary API for writing bitstreams to standard output: void write(boolean b) write the specifed bit publicvoid classwrite(char BinaryStdOut c) write the specifed 8-bit char void write(boolean b) void write(char c, int r) writewrite the the speci r leastfed signi bit fcant bits of the specifed char [similarvoid methodswrite(char for byte c)(8 bits); shortwrite (16 bits); the speci intf (32ed 8-bitbits); char long and double (64 bits)] voidvoid write(charclose() c, int r) writeclose thethe rbitstream least signifcant bits of the specifed char [similar methodsAPI for for byte static (8 methods bits); short that write (16 bits); to a bitstream int (32 bits); on standard long and output double (64 bits)] 7 void close() close the bitstream API for static methods that write to a bitstream on standard output Writing binary data Date representation. Three different ways to represent 12/31/1999. A character stream (StdOut) A character stream (StdOut) A character stream (StdOut) StdOut.print(month + "/" + day + "/" + year); StdOut.print(month + "/" + day + "/" +StdOut.print(month year); + "/" + day + "/" + year); 00110001001100100010111100110111001100010010111100110001001110010011100100111001 00110001001100100010111100110111001100010010111100110001001110010011100100111001 001100010011001000101111001101110011000100101111001100010011100100111001001110011 2 / 3 1 / 1 9 9 9 1 2 / 3 1 / 1 9 9 9 80 bits 1 2 / 3 1 / 1 9 9 9 Three ints (BinaryStdOut) 80 bits 80 bits Three ints (BinaryStdOut) Three ints (BinaryStdOut) BinaryStdOut.write(month); BinaryStdOut.write(month); BinaryStdOut.write(month);BinaryStdOut.write(day); BinaryStdOut.write(day); BinaryStdOut.write(day);BinaryStdOut.write(year); BinaryStdOut.write(year); BinaryStdOut.write(year); 000000000000000000000000000011000000000000000000000000000001111100000000000000000000011111001111 000000000000000000000000000011000000000000000000000000000001111100000000000000000000011111001111 00000000000000000000000000001100000000000000000000000000000111110000000000000000000001111100111112 31 1999 96 bits 12 31 1999 12 31 1999 96 bits Two chars and a short (BinaryStdOut) 96A bits 4-bit feld, a 5-bit feld, and a 12-bit feld (BinaryStdOut) Two chars and a short (BinaryStdOut) A 4-bit feld, a 5-bit feld, and a 12-bit feld (BinaryStdOut) Two chars and a short (BinaryStdOut) A BinaryStdOut.write((char)4-bit feld, a 5-bit feld, and a 12-bit feld (BinaryStdOut)month); BinaryStdOut.write(month, 4); BinaryStdOut.write((char) month); BinaryStdOut.write((char)BinaryStdOut.write(month, month);4);day); BinaryStdOut.write(month,BinaryStdOut.write(day, 5); 4); BinaryStdOut.write((char) day); BinaryStdOut.write((char)BinaryStdOut.write(day,BinaryStdOut.write((short) 5); day); year); BinaryStdOut.write(day,BinaryStdOut.write(year, 5); 12); BinaryStdOut.write((short) year); BinaryStdOut.write((short)BinaryStdOut.write(year, 12); year); BinaryStdOut.write(year, 12); 00001100000111110000011111001111 110011111011111001111000 00001100000111110000011111001111 110011111011111001111000 00001100000111110000011111001111 110011111011111001111000 12 31 1999 32 bits 12 31 1999 21 bits ( + 3 bits for byte alignment at close) 12 31 1999 12 31 1999 12 31 1999 32 bits 12 31 1999 21 bits ( + 3 bits32 for bits byte alignment at close) 21 bits ( + 3 bits for byte alignment at close) Four ways to put a date onto standard output Four ways to put a date onto standard output Four ways to put a date onto standard output 8 628 CHAPTER 5 Strings 628 CHAPTER 5 Strings to open a file with an edi- public class BinaryDump tor or view it in the manner { to open a file with an edi- public class BinaryDump public static void bits(String[] args) you view text files (or just { { tor or view it in the manner run a program that uses public