Algorithms ROBERT SEDGEWICK | KEVIN WAYNE

5.5

‣ introduction ‣ run-length coding Algorithms ‣ Huffman compression FOURTH EDITION ‣ LZW compression

ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu

Last updated on 12/2/19 5:12 PM 5.5 DATA COMPRESSION

‣ introduction ‣ run-length coding Algorithms ‣ Huffman compression ‣ LZW compression

ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu Data compression

Compression reduces the size of a file: ・To save space when storing it. ・To save time when transmitting it. ・Most files have lots of redundancy.

Who needs compression? ・Moore’s law: # transistors on a chip doubles every 18–24 months. ・Parkinson’s law: data expands to fill space available. ・Text, images, sound, video, sensors, …

Every day, we create: 500 million Tweets. ・ ・300 billion emails. ・350 million Facebook photos. ・500,000 hours YouTube video.

Basic concepts ancient (1950s), best technology recently developed.

3 Applications

Generic file compression. ・Files: Gzip, bzip2, 7z, PKZIP. ・File systems: NTFS, ZFS, HFS+, ReFS, GFS.

Multimedia. ・Images: GIF, JPEG. ・Sound: MP3. ・Video: MPEG, DivX™, HDTV.

Communication. Fax, Skype, Google hangout.

Databases. Google, Facebook, NSA, ….

Smart sensors. Phone, watch, car, health, ….

4 Lossless compression and expansion

Message. Bitstream B we want to compress. Compress. Generates a “compressed” representation C (B). Expand. Reconstructs original bitstream B. uses fewer bits (you hope)

Compress Expand bitstream B compressed version C(B) original bitstream B 0110110101... 1101011111... 0110110101...

Basic model for data compression

Compression ratio. Bits in C (B) / bits in B. Ex. 50–75% or better compression ratio for natural language.

5 Data representation: genomic code

Genome. String over the alphabet { A, T, C, G }.

Goal. Encode an n-character genome: ATAGATGCATAG ...

Standard ASCII encoding. Two-bit encoding. ・8 bits per char. ・2 bits per char. ・8 n bits. ・2 n bits (25% compression ratio).

char hex binary char binary

'A' 41 01000001 'A' 00

'T' 54 01010100 'T' 01

'C' 43 01000011 'C' 10

'G' 47 01000111 'G' 11

Fixed-length code. k-bit code supports alphabet of size 2k. Amazing but true. Some genomic databases in 1990s used ASCII.

6 664 CHAPTER 6 n Strings

664 CHAPTER 6 n Strings Binary input and output. Most systems nowadays, including Java, base their I/O on 8-bit bytestreams, so we might decide to read and write bytestreams to match I/O for- mats with the internal representations of primitive types, encoding an 8-bit char with Binary input and output. Most systems nowadays, including Java, base their I/O on 1 , a 16-bit short with 2 , a 32-bit int with 4 bytes, and so forth. Since bit- 8-bitstreams bytestreams, are the primary so we mightabstraction decide for to dataread compression,and write bytestreams we go a bit to furthermatch I/Oto allow for- matsclients with to readthe internal and write representations individual bits of, intermixedprimitive types, with dataencoding of various an 8-bit types char (primi with- 1 byte, a 16-bit short with 2 bytes, a 32-bit int with 4 bytes, and so forth. Since bit- Reading andtive writing types and Stringbinary). The datagoal is to minimize the necessity for type conversion in streamsclient programs are the primary and also abstraction to take care for of dataoperating-system compression, conventionswe go a bit furtherfor representing to allow clientsdata.We to useread the and following write individual API for reading bits, intermixed a bitstream with from data standard of various input: types (primi- Binary standardtive types input. and String Read). The bits goal isfrom to minimize standard the necessity input. for type conversion in clientpublic programs class and BinaryStdIn also to take care of operating-system conventions for representing data.We use the following API for reading a bitstream from standard input: boolean readBoolean() read 1 bit of data and return as a boolean value publicchar classreadChar() BinaryStdIn read 8 bits of data and return as a char value boolean readBoolean() char readChar(int r) readread 1 r bit bits of of data data and and return return as as a aboolean char value value char readChar() byte shortread 8 bits of dataint and returnlong as a chardouble value [similar methods for (8 bits); (16 bits); (32 bits); and (64 bits)] booleanchar readChar(intisEmpty() r) readis the r bitstreambits of data empty? and return as a char value

[similarvoid methodsclose() for byte (8 bits); shortclose (16 the bits); bitstream int (32 bits); long and double (64 bits)] boolean isEmpty()API for static methods thatis the read bitstream from a bitstreamempty? on standard input void close() close the bitstream A key feature of the abstraction is that, in marked constrast to StdIn, the data on stan- API for static methods that read from a bitstream on standard input dard input is not necessarily aligned on byte boundaries. If the input stream is a single byte, a client could read it 1 bit at a time with 8 calls to readBoolean(). The close() A key feature of the abstraction is that, in marked constrast to StdIn, the data on stan- method is not essential, but, for clean termination, clients should call close() to in- Binary standarddard inputoutput. is not necessarily Write aligned bits on to byte standard boundaries. Ifoutput the input stream is a single dicate that no more bits are to be read. As with StdIn/StdOut, we use the following byte,complementary a client could API read for itwriting 1 bit at bitstreams a time with to 8standard calls to readBoolean()output: . The close() method is not essential, but, for clean termination, clients should call close() to in- dicatepublic that class no more BinaryStdOut bits are to be read. As with StdIn/StdOut, we use the following complementary API for writing bitstreams to standard output: void write(boolean b) write the specifed bit publicvoid classwrite(char BinaryStdOut c) write the specifed 8-bit char void write(boolean b) void write(char c, int r) writewrite the the speci r leastfed signi bit fcant bits of the specifed char [similarvoid methodswrite(char for byte c)(8 bits); shortwrite (16 bits); the speci intf (32ed 8-bitbits); char long and double (64 bits)] voidvoid write(charclose() c, int r) writeclose thethe rbitstream least signifcant bits of the specifed char [similar methodsAPI for for byte static (8 methods bits); short that write (16 bits); to a bitstream int (32 bits); on standard long and output double (64 bits)] 7 void close() close the bitstream API for static methods that write to a bitstream on standard output Writing binary data

Date representation. Three different ways to represent 12/31/1999.

A character stream (StdOut) A character stream (StdOut) A character stream (StdOut) StdOut.print(month + "/" + day + "/" + year); StdOut.print(month + "/" + day + "/" +StdOut.print(month year); + "/" + day + "/" + year); 00110001001100100010111100110111001100010010111100110001001110010011100100111001 00110001001100100010111100110111001100010010111100110001001110010011100100111001 001100010011001000101111001101110011000100101111001100010011100100111001001110011 2 / 3 1 / 1 9 9 9 1 2 / 3 1 / 1 9 9 9 80 bits 1 2 / 3 1 / 1 9 9 9 Three ints (BinaryStdOut) 80 bits 80 bits Three ints (BinaryStdOut) Three ints (BinaryStdOut) BinaryStdOut.write(month); BinaryStdOut.write(month); BinaryStdOut.write(month);BinaryStdOut.write(day); BinaryStdOut.write(day); BinaryStdOut.write(day);BinaryStdOut.write(year); BinaryStdOut.write(year); BinaryStdOut.write(year); 000000000000000000000000000011000000000000000000000000000001111100000000000000000000011111001111 000000000000000000000000000011000000000000000000000000000001111100000000000000000000011111001111 00000000000000000000000000001100000000000000000000000000000111110000000000000000000001111100111112 31 1999 96 bits 12 31 1999 12 31 1999 96 bits Two chars and a short (BinaryStdOut) 96A bits 4-bit feld, a 5-bit feld, and a 12-bit feld (BinaryStdOut) Two chars and a short (BinaryStdOut) A 4-bit feld, a 5-bit feld, and a 12-bit feld (BinaryStdOut) Two chars and a short (BinaryStdOut) A BinaryStdOut.write((char)4-bit feld, a 5-bit feld, and a 12-bit feld (BinaryStdOut)month); BinaryStdOut.write(month, 4); BinaryStdOut.write((char) month); BinaryStdOut.write((char)BinaryStdOut.write(month, month);4);day); BinaryStdOut.write(month,BinaryStdOut.write(day, 5); 4); BinaryStdOut.write((char) day); BinaryStdOut.write((char)BinaryStdOut.write(day,BinaryStdOut.write((short) 5); day); year); BinaryStdOut.write(day,BinaryStdOut.write(year, 5); 12); BinaryStdOut.write((short) year); BinaryStdOut.write((short)BinaryStdOut.write(year, 12); year); BinaryStdOut.write(year, 12); 00001100000111110000011111001111 110011111011111001111000 00001100000111110000011111001111 110011111011111001111000 00001100000111110000011111001111 110011111011111001111000 12 31 1999 32 bits 12 31 1999 21 bits ( + 3 bits for byte alignment at close) 12 31 1999 12 31 1999 12 31 1999 32 bits 12 31 1999 21 bits ( + 3 bits32 for bits byte alignment at close) 21 bits ( + 3 bits for byte alignment at close) Four ways to put a date onto standard output Four ways to put a date onto standard output Four ways to put a date onto standard output 8 628 CHAPTER 5 Strings 628 CHAPTER 5 Strings

to open a file with an edi- public class BinaryDump tor or view it in the manner { to open a file with an edi- public class BinaryDump public static void bits(String[] args) you view text files (or just { { tor or view it in the manner run a program that uses public static void bits(String[] args) int width = Integer.parseInt(args[0]);you view text files (or just BinaryStdOut), you are { int cnt; run a program that uses likely to see gibberish, de- int width = Integer.parseInt(args[0]); for (cnt = 0; !BinaryStdIn.isEmpty();BinaryStdOut), you arecnt++) pending on the system you { int cnt; likely to see gibberish, de- use. BinaryStdIn allows 628 CHAPTER 5 Strings if (cnt % width == 0) StdOut.println(); for (cnt = 0; !BinaryStdIn.isEmpty(); cnt++) us to avoid such system de- if (BinaryStdIn.readBoolean())pending on the system you { StdOut.print("1");use. BinaryStdIn allows pendencies by writing our if (cnt % width == 0) StdOut.println(); else StdOut.print("0");us to avoid such system de- own programs to convert if (BinaryStdIn.readBoolean()) bitstreams such that we can } to openpendencies a file with by writing an edi- our public class BinaryDump StdOut.print("1"); StdOut.println(cnt + " bits"); see them with our standard own programs to convert { else StdOut.print("0"); } tor or view it in the manner tools. For example, the pro- } bitstreams such that we can public static void bits(String[] args)} you view text files (or just gram BinaryDump at left is StdOut.println(cnt + " bits"); see them with our standard { run a program that uses a BinaryStdIn client that } Printing a bitstream on standard (character) output int width = Integer.parseInt(args[0]); BinaryStdOuttools. For example,), you the are pro - prints out the bits from } int cnt; standard input,likely encodedgram to with seeBinaryDump gibberish,the characters at de left- 0 and is 1. This program is useful for debug- for (cnt = 0; !BinaryStdIn.isEmpty();ging cnt++) when workingpendinga with BinaryStdIn onsmall the inputs. system client We you use that a slightly more complicated version that Printing a {bitstream on standard (character) output just prints the countuse.prints whenBinaryStdIn outthe width the bits allowsargument from is 0 (see Exercise 5.5.X). The similar if (cnt % width == 0) StdOut.println(); standard input, encoded with the charactersclient 0 and HexDump 1. This usgroups programto avoid the datasuchis useful intosystem for8-bit debugde bytes- - and prints each as two hexadecimal if (BinaryStdIn.readBoolean()) digits that each represent 4 bits. The client PictureDump displays the bits in a Picture. ging when StdOut.print("1"); working with small inputs. We use a slightly pendenciesmore complicated by writing version our that justelse prints StdOut.print("0"); the count when the width argumentYou can isdownload 0 (seeown Exercise HexDump programs and5.5 . PictureDumpX to). The convert similar from the booksite. Typically, we use pip- } client HexDump groups the data into 8-biting bytes and redirectionand printsbitstreams at each the command-line suchas two that hexadecimal we can level when working with binary files: we can Binary dumps pipe the output of an encoder to BinaryDump, HexDump, or PictureDump, or redirect StdOut.println(cntdigits that each represent + " bits");4 bits. The client PictureDumpsee displays them withthe bits our in standard a Picture . it to a file. } You can download HexDump and PictureDump from the tools.booksite. For Typically,example, the we prouse -pip- } Q. Howing to and examine redirection the atcontents the command-line of a bitstream? Standardlevel when character working streamgram BinaryDumpwith binary files:at left we is Bitstreamcan represented with hex digits % more abra.txt BinaryStdIn % java HexDump 4 < abra.txt Printing a bitstreampipe on standardthe output (character) of an output encoder to BinaryDump, HexDumpa , or PictureDump client, or redirect that ABRACADABRA! prints out the bits from 41 42 52 41 it to a file. 43 41 44 41 standard input, encoded with the characters 0 and 1. This program is useful for debug- 42 52 41 21 Standard character stream BitstreamBitstream represented represented as 0 and with 1 characters hex digits ging when working with small inputs. We use a slightly more complicated version that 12 bytes % more abra.txt % java% BinaryDumpjava HexDump 16 <4 abra.txt< abra.txt just prints the count when the width argument0100000101000010 is 0 (see ). The similar ABRACADABRA! 41 42 52 Exercise41 5.5.X Bitstream represented as pixels in a Picture 0101001001000001 client HexDump groups the data into 8-bit bytes 43and 41 prints44 41 each as two hexadecimal % java PictureDump 16 6 < abra.txt 0100001101000001 digitsBitstream that represented each represent as 0 and 14 characters bits. The client PictureDump42 52 41 displays 21 the bits in a Picture. 0100010001000001 16-by-6 pixel 010000100101001012 bytes You% canjava download BinaryDump HexDump 16 < abra.txt and PictureDump from the booksite. Typically, we use pip- window, magnified 0100000100100001 ing0100000101000010 and redirection at the command-line level when working with binary files: we can 96 Bitstreambits represented as pixels in a Picture 96 bits pipe 00101001001000001 1the2 output3 4 5 6 of7 an8 9encoderA B C DtoE BinaryDumpF , HexDump, or PictureDump, or redirect % java PictureDump 16 6 Four< abra.txt ways to look at a bitstream it0 toNUL0100001101000001 SOHa file.STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI 1 DLE0100010001000001DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 16-by-6 pixel Standard2 SP0100001001010010! character" # $ % stream& ' ( ) * + , - . / Bitstream represented with hex digits window, magnified 0100000100100001 3% 0more1 2abra.txt3 4 5 6 7 8 9 : ; < = > ? % java HexDump 4 < abra.txt 96 bits 96 bits 4ABRACADABRA!@ A B C D E F G H I J K L M N O 41 42 52 41 5 P Q R S T U V W X Y Z [ \ ] ^Four_ ways to look43 at 41 a bitstream44 41 6 ` a b c d e f g h i j k l m n o Bitstream represented as 0 and 1 characters 42 52 41 21 7 p q r s t u v w x y z { | } ~ DEL 12 bytes % javaHe xBinaryDumpadecimal-to-AS C16II co n

Pied Piper. Claims 3.8:1 lossless compression of arbitrary data.

10 Universal data compression

Proposition. No algorithm can compress every bitstring. U Pf 1. [by contradiction] ・Suppose you have a universal data compression algorithm U U that can compress every bitstream. ・Given bitstring B0, compress it to get smaller bitstring B1. U

Compress B1 to get a smaller bitstring B2. . ・ . ・Continue until reaching bitstring of size 0. . ・Implication: all bitstrings can be compressed to 0 bits!

U

Pf 2. [by counting] U ・Suppose your algorithm that can compress all 1,000-bit strings. 21000 possible bitstrings with 1,000 bits. ・ U ・Only 1 + 2 + 4 + … + 2998 + 2999 can be encoded with ≤ 999 bits. ・Similarly, only 1 in 2499 bitstrings can be encoded with ≤ 500 bits! ⑀ Universal 11 data compression? Rdenudcany in Enlgsih lnagugae

Q. How much redundancy in the English language? A. Quite a bit.

“ ... randomising letters in the middle of words [has] little or no effect on the ability of skilled readers to understand the text. This is easy to denmtrasote. In a pubiltacion of New Scnieitst you could ramdinose all the letetrs, keipeng the first two and last two the same, and reibadailty

would hadrly be aftcfeed. My ansaylis did not come to much beucase the

thoery at the time was for shape and senqeuce retigcionon. Saberi’s work

sugsegts we may have some pofrweul palrlael prsooscers at work. The

resaon for this is suerly that idnetiyfing coentnt by paarllel prseocsing

speeds up regnicoiton. We only need the first and last two letetrs to spot chganes in meniang. ” — Graham Rawlinson

The gaol of data cmperisoson is to inetdify rdenudcany and epxloit it.

12 5.5 DATA COMPRESSION

‣ introduction ‣ run-length coding Algorithms ‣ Huffman compression ‣ LZW compression

ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu Run-length encoding (RLE)

Simple type of redundancy in a bitstream. Long runs of repeated bits.

0000000000000001111111000000011111111111 40 bits run of length 7 Representation. 4-bit counts to represent alternating runs of 0s and 1s: 15 0s, then 7 1s, then 7 0s, then 11 1s.

1111011101111011 16 bits (instead of 40) 15 7 7 11

Q. How many bits to store the counts? A. Typically 8 bits (but 4 on this slide for brevity).

Q. What to do when run length exceeds max count? A. Intersperse runs of length 0.

Applications. JPEG, ITU-T T4 Group 3 Fax, ...

14 Run-length encoding: Java implementation

public class RunLength { private static final int R = 256; maximum run-length count private static final int lgR = 8; number of bits per count

public static void compress() { /* see textbook */ }

public static void expand() { boolean bit = false; initial runs are 0 while (!BinaryStdIn.isEmpty()) { int run = BinaryStdIn.readInt(lgR); read 8-bit count from standard input for (int i = 0; i < run; i++) write run of 0s or 1s to standard output BinaryStdOut.write(bit); bit = !bit; flip bit (for next run) } BinaryStdOut.close(); pad 0s for byte alignment }

}

15 Data compression: quiz 1

What is the best compression ratio achievable from run-length encoding when using 8-bit counts?

A. 1 / 256

B. 1 / 16

C. 8 / 255

D. 1 / 8

E. 16 / 255

255n bits (input) 0000...01111...10000...01111...1... run of length 255 run of length 255 run of length 255 run of length 255

8n bits (compressed) 11111111111111111111111111111111... 25 255 255 255 255 16 5.5 DATA COMPRESSION

‣ introduction ‣ run-length coding Algorithms ‣ Huffman compression ‣ LZW compression

ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu

David Hu"man Variable-length codes

Key idea. Use different number of bits to encode different characters.

Ex. Morse code: ● ● ● ● ● ●

A N Issue. Ambiguity. B O C P SOS ? D Q VZE ? E R F S EEJIE ? codeword for S G T is a prefix of EEWNI ? H U codeword for V I V J W K X L Y M Z

In practice. Use a short gap to separate characters.

18 Variable-length codes

Q. How do we avoid ambiguity? A. Ensure that no codeword is a prefix of another.

Codeword table Trie representation key value Ex 1. Fixed-length code. ! 101 0 1 A 0 A Ex 2. Append special “stop” character to eachB codeword.1111 0 1 C 110 Ex 3. General prefix-free code. D 100 0 1 0 1 R 1110 D ! C 0 1 R B Compressed bitstring 011111110011001000111111100101 30 bits A B RA CA DA B RA !

Codeword table Trie representation Codeword table Trie representation key value key value 0 1 ! 101 ! 101 0 1 A 0 A A 11 0 1 B 1111 B 00 0 1 0 1 C 110 C 010 0 1 0 1 B A D 100 D 100 0 1 0 1 R 1110 D ! C 0 1 R 011 C R D ! R B Compressed bitstring Compressed bitstring 11000111101011100110001111101 011111110011001000111111100101 30 bits 29 bits A B R A C A D A B R A ! A B RA CA DA B RA ! Two prefx-free codes Codeword table Trie representation 19 key value ! 101 0 1 A 11 B 00 0 1 0 1 C 010 B A D 100 0 1 0 1 R 011 C R D !

Compressed bitstring 11000111101011100110001111101 29 bits A B R A C A D A B R A !

Two prefx-free codes Prefix-free codes: trie representation

Q. How to represent the prefix-free code? A. A binary trie! Characters in leaves. Codeword table Trie representation ・ key value Codeword is path from root to leaf. ! 101 0 1 ・ A 0 A B 1111 0 1 C 110 D 100 0 1 0 1 R 1110 D ! C 0 1 R B Compressed bitstring 011111110011001000111111100101 30 bits A B RA CA DA B RA !

Codeword table Trie representation Codeword table Trie representation key value key value 0 1 ! 101 ! 101 0 1 A 0 A A 11 0 1 B 1111 B 00 0 1 0 1 C 110 C 010 0 1 0 1 B A D 100 D 100 0 1 0 1 R 1110 D ! C 0 1 R 011 C R D ! R B Compressed bitstring Compressed bitstring 11000111101011100110001111101 011111110011001000111111100101 30 bits 29 bits A B R A C A D A B R A ! A B RA CA DA B RA ! Two prefx-free codes Codeword table Trie representation 20 key value ! 101 0 1 A 11 B 00 0 1 0 1 C 010 B A D 100 0 1 0 1 R 011 C R D !

Compressed bitstring 11000111101011100110001111101 29 bits A B R A C A D A B R A !

Two prefx-free codes Prefix-free codes: expansion

Expansion. ・Start at root. ・Go left if bit is 0; go right if 1. ・If leaf node, write character; return to root node; repeat.

11000111101011100110001111101

A B R A C A D A B R A !

0 1

0 1 0 1

B A

0 1 0 1

C R D !

21 Prefix-free codes: compression

Compression. ・Method 1: start at leaf; follow path up to the root; print bits in reverse. ・Method 2: create ST of key–value pairs.

0 1

0 1 0 1

B A

0 1 0 1

C R D !

22 Data compression: quiz 2

Consider the following trie representation of a prefix-free code. Expand the compressed bitstring 100101000111011 ?

S P E E D Y

A. PEED 0 1 B. PESDEY E C. SPED 0 1 D. SPEEDY D 0 1

S

0 1 P Y

23 Huffman coding overview

Static model. Use the same prefix-free code for all messages. Dynamic model. Use a custom prefix-free code for each message.

Compression. ・Read message. ・Build best prefix-free code for message. How? [ahead] ・Write prefix-free code. ・Compress message using prefix-free code.

Expansion. ・Read prefix-free code. ・Read compressed message and expand using prefix-free code.

24 Huffman trie node data type

private static class Node implements Comparable { private final char ch; // used only for leaf nodes private final int freq; // used only by compress() private final Node left, right;

public Node(char ch, int freq, Node left, Node right) { this.ch = ch; this.freq = freq; initializing constructor this.left = left; this.right = right; }

public boolean isLeaf() { return left == null && right == null; } is Node a leaf?

public int compareTo(Node that) compare nodes by frequency { return this.freq - that.freq; } (stay tuned)

}

25 Prefix-free codes: expansion

public void expand() { Node root = readTrie(); read encoding trie int n = BinaryStdIn.readInt(); read number of chars

for (int i = 0; i < n; i++) for each encoded character i {

Node x = root; follow path from root to leaf to determine character while (!x.isLeaf()) { if (!BinaryStdIn.readBoolean()) x = x.left; else x = x.right; } BinaryStdOut.write(x.ch, 8); write character (8 bits)

} BinaryStdOut.close(); }

Running time. Linear in input size (number of bits).

26 ENCODING THE TRIE

Q. How to transmit the trie? A. Write preorder traversal; mark leaf nodes and internal nodes with a bit.

0 for internal nodes 1 for leaf nodes preorder traversal 1

A 2

3 5

D 4 R B

! C

leaves A D ! C R B 01010000010010100010001000010101010000110101010010101000010

1 2 3 4 5 internal nodes Using preorder traversal to encode a trie as a bitstream

27

! Huffman codes

Q. How to find best prefix-free code?

Huffman algorithm: ・Count frequency freq[i] for each char i in input. ・Start with one node corresponding to each char i (with weight freq[i]). ・Repeat until single trie formed: – select two tries with min weight freq[i] and freq[j] – merge into single trie with weight freq[i] + freq[j]

Applications:

28 Constructing a Huffman encoding trie: Java implementation

private static Node buildTrie(int[] freq) { MinPQ pq = new MinPQ(); for (char i = 0; i < R; i++) initialize PQ with if (freq[i] > 0) singleton tries pq.insert(new Node(i, freq[i], null, null));

merge two while (pq.size() > 1) smallest tries { Node x = pq.delMin(); Node y = pq.delMin(); Node parent = new Node('\0', x.freq + y.freq, x, y);

pq.insert(parent); }

return pq.delMin(); not used for total frequency two subtries internal nodes }

29 Huffman compression summary

Proposition. Huffman’s algorithm produces an optimal prefix-free code.

Pf. See textbook. no prefix-free code uses fewer bits

Two-pass implementation (for compression). ・Pass 1: tabulate character frequencies; build trie. ・Pass 2: encode file by traversing trie (or symbol table).

Running time (for compression). Using a binary heap ⇒ n + R log R .

Running time (for expansion). Using a binary trie ⇒ n . input alphabet size size

Q. Can we do better (in terms of compression ratio)? [stay tuned]

30 5.5 DATA COMPRESSION

‣ introduction ‣ run-length coding Algorithms ‣ Huffman compression ‣ LZW compression

ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu

Abraham Lempel Jacob Ziv Terry Welch Statistical methods

Static model. Same model for all texts. ・Fast. ・Not optimal: different texts have different statistical properties. ・Ex: ASCII, Morse code.

Dynamic model. Generate model based on text. ・Preliminary pass needed to generate model. ・Must transmit the model. ・Ex: Huffman code.

Adaptive model. Progressively learn and update model as you read text. ・More accurate modeling produces better compression. ・Decoding must start from beginning. ・Ex: LZW.

32 LZW compression demo (for 7-bit chars and 8-bit codewords)

input A B R A C A D AA B RR A BB RR AA BB RR AA matches A B R A C A D A B R A B R A B R A

value 41 42 52 41 43 41 44 81 83 82 88 41 80

LZW compression for A B R A C A D A B R A B R A B R A

Input. 7-bit ASCII chars. key value key value key value ・ ・ ASCII 'A' is 4116. ⋮ ⋮ AB 81 DA 87 A 41 BR 82 ABR 88 Codeword table. ・ 8-bit codewords. B 42 RA 83 RAB 89 ・ Codewords for single C 43 AC 84 BRA 8A chars are ASCII values. ・ Use codewords 8116 to D 44 CA 85 ABRA 8B FF16 for multiple chars. ⋮ ⋮ AD 86 ・ Stop symbol = 8016.

codeword table 33 LZW expansion demo (for 7-bit chars and 8-bit codewords)

value 41 42 52 41 43 41 44 81 83 82 88 41 80 output A B R A C A D A B R A B R A B R A

LZW expansion for 41 42 52 41 43 41 44 81 83 82 88 41 80

Input. 7-bit ASCII chars. key value key value key value ・ ・ ASCII 'A' is 4116. ⋮ ⋮ 81 AB 87 DA 41 A 82 BR 88 ABR Codeword table. ・ 8-bit codewords. 42 B 83 RA 89 RAB ・ Codewords for single 43 C 84 AC 8A BRA chars are ASCII values. ・ Use codewords 8116 to 44 D 85 CA 8B ABRA FF16 for multiple chars. ⋮ ⋮ 86 AD ・ Stop symbol = 8016. codeword table 34 Data compression: quiz 3

Which is the LZW compression for ABABABA ?

A. 41 42 41 42 41 42 80

B. 41 42 41 81 81 80

C. 41 42 81 81 41 80

D. 41 42 81 83 80

35 Data compression: quiz 4

Which is key data structure to implement LZW compression e$ciently?

A. array

B. red–black BST

C. hash table

key operation: find longest string in ST D. trie that is a prefix of unscanned part of input

37 Implementing LZW compression: longest prefix match

Find longest key in symbol table that is a prefix of query string.

input … RABRABRABROCCOLI… value 89 82 88 88

41 A B 42 C 43 D 44 R 52

81 B 84 C D 86 R 82 A 85 A 87 A 83

88 R A 8A B 89

8B A 38 LZW tricky case: expansion

value 41 42 81 83 80 need to know code for 83 output A B A B A B xA before it is in codeword table!

LZW expansion for 41 42 81 83 80 we can deduce that the code for 83 is ABx for some character x

key value key value now, we have deduced x!

⋮ ⋮ 81 AB

41 A 82 BA

42 B 83 AB?xA

43 C

44 D

⋮ ⋮

codeword table 39 LZW in the real world

Lempel–Ziv and friends. ・LZ77. ・LZ78. ・LZW. ・Deflate / zlib = LZ77 variant + Huffman.

Unix compress, GIF, TIFF, V.42bis modem: LZW. previously under patent zip, 7zip, gzip, jar, png, : deflate / zlib. not patented iPhone, , Apache HTTP server: deflate / zlib. (widely used in open source)

40 Lossless data compression benchmarks

year scheme bits / char

1967 ASCII 7

1950 Hu"man 4.7

1977 LZ77 3.94

1984 LZMW 3.32

1987 LZH 3.3

1987 move-to-front 3.24

1987 LZB 3.18

1987 gzip 2.71

1988 PPMC 2.48

1994 SAKDC 2.47

1994 PPM 2.34

1995 Burrows-Wheeler 2.29 next programming assignment

1997 BOA 1.99

1999 RK 1.89

data compression using Calgary corpus 41 Data compression summary

Lossless compression. ・Represent fixed-length symbols with variable-length codes. [Huffman] ・Represent variable-length symbols with fixed-length codes. [LZW]

Lossy compression. [not covered in this course]

JPEG, MPEG, MP3, … n 1 ・ 1 X = x cos i + k FFT/DCT, wavelets, fractals, … k i n 2 ・ i=0

n Theoretical limits on compression. Shannon entropy: H(X)= p(xi)lgp(xi) i X

Practical compression. Exploit extra knowledge whenever possible.

42