Algorithms ROBERT SEDGEWICK | KEVIN WAYNE

The smallest number requiring nine words to express. ・What is it? – Nine hundred ninety nine million nine thousand ninety nine? – One trillion to the power of one trillion factorial? 5.5

‣ introduction ‣ run-length coding Algorithms ‣ Huffman compression FOURTH EDITION ‣ LZW compression ‣ ROBERT SEDGEWICK | KEVIN WAYNE

http://algs4.cs.princeton.edu

1

Data compression

Compression reduces the size of a file: ・To save space when storing it. ・To save time when transmitting it. ・Most files have lots of redundancy. 5.5 DATA COMPRESSION Who needs compression? ‣ introduction ・Moore's law: # transistors on a chip doubles every 18–24 months. ‣ run-length coding ・Parkinson's law: data expands to fill space available. ・Text, images, sound, , … Algorithms ‣ Huffman compression ‣ LZW compression “ Everyday, we create 2.5 quintillion of data—so much that ‣ Kolmogorov Complexity 90% of the data in the world today has been created in the last ROBERT SEDGEWICK | KEVIN WAYNE two years alone. ” — IBM report on big data (2011) http://algs4.cs.princeton.edu

Basic concepts ancient (1950s), best technology recently developed.

4 Applications Way back in the day

Generic file compression. ・Polybius, 200 BC ・Files: , BZIP, . ・Archivers: PKZIP. ・File systems: NTFS, HFS+, ZFS.

Multimedia. ・Images: GIF, JPEG. ・Sound: MP3. ・Video: MPEG, DivX™, HDTV.

Communication. ・ITU-T T4 Group 3 Fax. ・V.42bis modem. ・Skype.

http://people.seas.harvard.edu/~jones/cscie129/images/history/encoding.html Databases. Google, Facebook, ....

5 6

Semaphore telegraph These days

Every sensible person will agree that a single man in a single day could, without interference, cut all the electrical wires terminating in Paris; it is obvious that a single man could sever, in ten places, in the course of a day, the electrical wires of a particular line of communication, without being stopped or even Now recognized. Video [Clearly, no serious competition could be expected from] Uncompressed 1080p, 100 megabytes per second “a few wretched wires.’’ ・ 1080p on Netflix, ~0.5 megabytes per second ” — Dr. Jules Guoyot (July 5th, 1841) ・

vs

The Claude Chappe system (1792) Then: 1 symbol per minute 7 8 and expansion Data representation: genomic code

Message. Binary data B we want to . Genome. String over the alphabet { A, C, T, G }. Compress. Generates a "compressed" representation C (B). Expand. Reconstructs original bitstream B. Goal. Encode an N-character genome: ATAGATGCATAG... uses fewer bits (you hope)

Standard ASCII encoding. Two-bit encoding. Compress Expand 8 bits per char. 2 bits per char. bitstream B compressed version C(B) original bitstream B ・ ・ 0110110101... 1101011111... 0110110101... ・8 N bits. ・2 N bits.

char hex binary char binary Basic model for data compression A 41 01000001 A 00 C 43 01000011 C 01 T 54 01010100 T 10 G 47 01000111 G 11 Compression ratio. Bits in C (B) / bits in B.

Ex. 50–75% or better compression ratio for natural language. 664 AGCHAPTER 6 Strings AG AG 0100000101000111 0011 0100000101000111

9 Binary input and output. Most systems nowadays, including Java, base their I/O on 10 664 CHAPTER8-bit bytestreams, 6 Strings so we might decide to read and write bytestreams to match I/O for- mats with the internal representations of primitive types, encoding an 8-bit char with 1 , a 16-bit short with 2 bytes, a 32-bit int with 4 bytes, and so forth. Since bit- streams are the primary abstraction for data compression, we go a bit further to allow Binary input and output. Most systems nowadays, including Java, base their I/O on clients to read and write individual bits, intermixed with data of various types (primi- 8-bit bytestreams, so we might decide to read and write bytestreams to match I/O for- Data representation: genomic code Reading andtive writing types and Stringbinary). The datagoal is to minimize the necessity for type conversion in mats with the internal representations of primitive types, encoding an 8-bit char with client programs and also to take care of operating-system conventions for representing 1 byte, a 16-bit short with 2 bytes, a 32-bit int with 4 bytes, and so forth. Since bit- data.We use the following API for reading a bitstream from standard input: streams are the primary abstraction for data compression, we go a bit further to allow Genome. String over the alphabet { A, C, T, G }. Binary standard input and standard output. Libraries to read and write bits clientspublic to readclass and BinaryStdIn write individual bits, intermixed with data of various types (primi- tive types and String). The goal is to minimize the necessity for type conversion in from standardboolean input readBoolean()and to standard output. boolean client programs and also to take careread of 1operating-system bit of data and return conventions as a for valuerepresenting Goal. Encode an N-character genome: ATAGATGCATAG... data.Wechar use thereadChar() following API for readingread 8 bits a bitstream of data and from return standard as a char input: value

publicchar classreadChar(int BinaryStdIn r) read r bits of data and return as a char value [similar methods for byte (8 bits); short (16 bits); int (32 bits); long and double (64 bits)] Standard ASCII encoding. Two-bit encoding. boolean readBoolean() read 1 bit of data and return as a boolean value booleanchar readChar()isEmpty() readis the 8 bitstreambits of data empty? and return as a char value 8 bits per char. 2 bits per char. void close() ・ ・ char readChar(int r) readclose r the bits bitstream of data and return as a char value 8 N bits. 2 N bits. API for static methods that read from a bitstream on standard input ・ ・ [similar methods for byte (8 bits); short (16 bits); int (32 bits); long and double (64 bits)] Aboolean key featureisEmpty() of the abstraction is isthat, the bitstreamin marked empty? constrast to StdIn, the data on stan- char hex binary char binary dard voidinput isclose() not necessarily alignedclose on thebyte bitstream boundaries. If the input stream is a single byte, a client could read it 1 bit at a time with 8 calls to readBoolean(). The close() A 41 01000001 A 00 API for static methods that read from a bitstream on standard input method is not essential, but, for clean termination, clients should call close() to in- C 43 01000011 C 01 dicate that no more bits are to be read. As with StdIn/StdOut, we use the following A key feature of the abstraction is that, in marked constrast to StdIn, the data on stan- complementary API for writing bitstreams to standard output: T 54 01010100 T 10 dard input is not necessarily aligned on byte boundaries. If the input stream is a single byte,public a client class could BinaryStdOut read it 1 bit at a time with 8 calls to readBoolean(). The close() G 47 01000111 G 11 method is not essential, but, for clean termination, clients should call close() to in- void write(boolean b) write the specifed bit dicate that no more bits are to be read. As with StdIn/StdOut, we use the following complementaryvoid write(char API for writing c) bitstreamswrite to the standard specifed output:8-bit char

publicvoid classwrite(char BinaryStdOut c, int r) write the r least signifcant bits of the specifed char k Fixed-length code. k-bit code supports alphabet of size 2 . [similar methods for byte (8 bits); short (16 bits); int (32 bits); long and double (64 bits)] void write(boolean b) write the specifed bit void close() close the bitstream Amazing but true. Initial genomic databases in 1990s used ASCII. void write(char c) write the specifed 8-bit char API for static methods that write to a bitstream on standard output 11 void write(char c, int r) write the r least signifcant bits of the specifed char 12 [similar methods for byte (8 bits); short (16 bits); int (32 bits); long and double (64 bits)] void close() close the bitstream API for static methods that write to a bitstream on standard output 628 CHAPTER 5 Strings

to open a file with an edi- public class BinaryDump { tor or view it in the manner public static void bits(String[] args) you view text files (or just { run a program that uses int width = Integer.parseInt(args[0]); BinaryStdOut), you are int cnt; likely to see gibberish, de- for (cnt = 0; !BinaryStdIn.isEmpty(); cnt++) pending on the system you { use. BinaryStdIn allows if (cnt % width == 0) StdOut.println(); us to avoid such system de- if (BinaryStdIn.readBoolean()) StdOut.print("1"); pendencies by writing our else StdOut.print("0"); own programs to convert } bitstreams such that we can StdOut.println(cnt + " bits"); see them with our standard } tools. For example, the pro- } gram BinaryDump at left is BinaryStdIn Printing a bitstream on standard (character) output a client that prints out the bits from standard input, encoded with the characters 0 and 1. This program is useful for debug- ging when working with small inputs. We use a slightly more complicated version that Writing binary data Binaryjust dumps prints the count when the width argument is 0 (see Exercise 5.5.X). The similar client HexDump groups the data into 8-bit bytes and prints each as two hexadecimal digits that each represent 4 bits. The client PictureDump displays the bits in a Picture. Date representation. Three different ways to represent 12/31/1999. Q. HowYou to can examine download HexDumpthe contents and PictureDump of a bitstream? from the booksite. Typically, we use pip- ing and redirection at the command-line level when working with binary files: we can pipe the output of an encoder to BinaryDump, HexDump, or PictureDump, or redirect it to a file. A character stream (StdOut) A character stream (StdOut) Standard character stream Bitstream represented with hex digits A character stream (StdOut) StdOut.print(month + "/" + day + "/" + year); StdOut.print(month + "/" + day + "/" + year); % more abra.txt % java HexDump 4 < abra.txt StdOut.print(month + "/" + day + "/" + year); ABRACADABRA! 41 42 52 41 00110001001100100010111100110111001100010010111100110001001110010011100100111001 00110001001100100010111100110111001100010010111100110001001110010011100100111001 43 41 44 41 001100010011001000101111001101110011000100101111001100010011100100111001001110011 2 / 3 1 / 1 9 9 9 42 52 41 21 1 2 / 3 1 / 1 9 9 9 80 bits Bitstream represented as 0 and 1 characters 1 2 / 3 1 / 1 9 9 9 80 bits Three ints (BinaryStdOut) 80 bits 12 bytes Three ints (BinaryStdOut) % java BinaryDump 16 < abra.txt Three ints (BinaryStdOut) BinaryStdOut.write(month); 0100000101000010 Bitstream represented as pixels in a Picture BinaryStdOut.write(month); 0101001001000001 BinaryStdOut.write(month); BinaryStdOut.write(day); % java PictureDump 16 6 < abra.txt BinaryStdOut.write(day); BinaryStdOut.write(year);BinaryStdOut.write(day); 0100001101000001 0100010001000001 BinaryStdOut.write(year); BinaryStdOut.write(year); 16-by-6 pixel 0100001001010010 window, magnified 000000000000000000000000000011000000000000000000000000000001111100000000000000000000011111001111 0100000100100001 6.5 Data Compression 667 000000000000000000000000000011000000000000000000000000000001111100000000000000000000011111001111 00000000000000000000000000001100000000000000000000000000000111110000000000000000000001111100111112 31 1999 96 bits 96 bits 96 bits 12 31 1999 12 31 1999 96 bits Two chars and a short (BinaryStdOut) 96A bits 4-bit feld, a 5-bit feld, and a 12-bit feld (BinaryStdOut) Four ways to look at a bitstream Two chars and a short (BinaryStdOut) A 4-bit feld, a 5-bit feld, and a 12-bit feld (BinaryStdOut) Two chars and a short (BinaryStdOut) ABinaryStdOut.write((char) 4-bit feld, a 5-bit feld, and a 12-bit feld (BinaryStdOut)month); BinaryStdOut.write(month, 4); ASCII encoding. When you HexDump a bit- 0 1 2 3 4 5 6 7 8 9 A B C D E F stream that contains ASCII-encoded charac- NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI BinaryStdOut.write((char) month); BinaryStdOut.write((char)BinaryStdOut.write(month, day);month);4); BinaryStdOut.write(day,BinaryStdOut.write(month, 5); 4); 0 ters, the table at right is useful for reference. 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US BinaryStdOut.write((char) day); BinaryStdOut.write((short)BinaryStdOut.write((char)BinaryStdOut.write(day, 5); day);year); BinaryStdOut.write(year,BinaryStdOut.write(day, 5);12); BinaryStdOut.write((short) year); BinaryStdOut.write(year, 12); Given a 2-digit hex number, use the first hex 2 SP ! “ # $ % & ‘ ( ) * + , - . / BinaryStdOut.write((short) year); BinaryStdOut.write(year, 12); digit as a row index and the second hex digit 3 0 1 2 3 4 5 6 7 8 9 : ; < = > ? 00001100000111110000011111001111 110011111011111001111000 as a column reference to find the character 4 @ A B C D E F G H I J K L M N O 00001100000111110000011111001111 110011111011111001111000 00001100000111110000011111001111 110011111011111001111000 that it encodes. For example, 31 encodes the 12 31 1999 32 bits 12 31 1999 21 bits ( + 3 bits for byte alignment at close) 5 P Q R S T U V W X Y Z [ \ ] ^ _ 12 31 1999 12 31 1999 digit 1, 4A encodes the letter J, and so forth. 12 31 1999 12 31 1999 32 bits 21 bits ( + 3 bits for byte alignment at close) 6 ` a b c d e f g h i j k l m n o 32 bits 21 bits ( + 3 bits for byte alignment at close) This table is for 7-bit ASCII, so the first hex 7 p q r s t u v w x y z { | } ~ DEL Four ways to put a date onto standard output digit must be 7 or less. Hex numbers starting Four ways to put a date onto standard output Hexadecimal to ASCII conversion table Four ways to put a date onto standard output with 0 and 1 (and the numbers 20 and 7F) 13 correspond to non-printing control charac- 14 ters. Many of the control characters are left over from the days when physical devices like typewriters were controlled by ASCII input; the table highlights a few that you might see in dumps. For example SP is the space character, NUL is the null character, LF is line-feed, and CR is carriage-return.

In summaryRun-length, working with encoding data compression requires us to reorient our thinking about standard input and standard output to include binary encoding of data. BinaryStdIn and BinaryStdOut provide the methods that we need. They provide a way for you to make a clear distinction in your client programs between writing out information in- tendedSimple for file storage type and of data redundancy transmission (that will in be reada bitstream. by programs) and print Long- runs of repeated bits. ing information (that is likely to be read by humans). 0000000000000001111111000000011111111111

40 bits Representation. 4-bit counts to represent alternating runs of 0s and 1s: 5.5 DATA COMPRESSION 15 0s, then 7 1s, then 7 0s, then 11 1s.

1111011101111011 16 bits (instead of 40)

‣ introduction 15 7 7 11 ‣ run-length coding Q. How many bits to store the counts? ‣ Huffman compression Algorithms A. 4 bits (but we’ll use 8 bits in the code on the next slide) ‣ LZW compression ‣ Kolmogorov complexity Q. What to do when run length exceeds max count? ROBERT SEDGEWICK | KEVIN WAYNE

http://algs4.cs.princeton.edu A. If longer than 15, intersperse runs of length 0.

111100000001011101111011 If we had 16 1s at the front instead

15 0 1 7 7 11 Applications. JPEG, ITU-T T4 Group 3 Fax, ...

16 Run-length encoding: Java implementation An application: compress a bitmap

7 1s % java BinaryDump 32 < q32x48.bin Typical black-and-white-scanned image. 00000000000000000000000000000000 32 public class RunLength 00000000000000000000000000000000 32 00000000000000011111110000000000 15 7 10 00000000000011111111111111100000 12 15 5 { 300 pixels/inch. 00000000001111000011111111100000 10 4 4 9 5 ・ 00000000111100000000011111100000 8 4 9 6 5 private final static int R = 256; maximum run-length count 00000001110000000000001111100000 7 3 12 5 5 8.5-by-11 inches. 00000011110000000000001111100000 6 4 12 5 5 private final static int lgR = 8; ・ 00000111100000000000001111100000 5 4 13 5 5 number of bits per count 00001111000000000000001111100000 4 4 14 5 5 00001111000000000000001111100000 4 4 14 5 5 300 × 8.5 × 300 × 11 = 8.415 million bits. 00011110000000000000001111100000 3 4 15 5 5 ・ 00011110000000000000001111100000 2 5 15 5 5 public static void compress() 00111110000000000000001111100000 2 5 15 5 5 00111110000000000000001111100000 2 5 15 5 5 00111110000000000000001111100000 2 5 15 5 5 { /* see textbook */ } 00111110000000000000001111100000 2 5 15 5 5 00111110000000000000001111100000 2 5 15 5 5 Observation. Bits are mostly white. 00111110000000000000001111100000 2 5 15 5 5 00111110000000000000001111100000 2 5 15 5 5 00111110000000000000001111100000 2 5 15 5 5 public static void expand() 00111111000000000000001111100000 2 6 14 5 5 00111111000000000000001111100000 2 6 14 5 5 { 00011111100000000000001111100000 3 6 13 5 5 00011111100000000000001111100000 3 6 13 5 5 boolean bit = false; 00001111110000000000001111100000 4 6 12 5 5 00001111111000000000001111100000 4 7 11 5 5 00000111111100000000001111100000 5 7 10 5 5 while (!BinaryStdIn.isEmpty()) 00000011111111000000011111100000 6 8 7 6 5 00000001111111111111111111100000 7 20 5 { 00000000011111111111001111100000 9 11 2 5 5 00000000000011111000001111100000 22 5 5 Typical amount of text on a page. 00000000000000000000001111100000 22 5 5 int run = BinaryStdIn.readInt(lgR); read 8-bit count from standard input 00000000000000000000001111100000 22 5 5 00000000000000000000001111100000 22 5 5 for (int i = 0; i < run; i++) 40 lines × 75 chars per line = 3,000 chars. 00000000000000000000001111100000 22 5 5 00000000000000000000001111100000 22 5 5 BinaryStdOut.write(bit); 00000000000000000000001111100000 22 5 5 write 1 bit to standard output 00000000000000000000001111100000 22 5 5 00000000000000000000001111100000 22 5 5 bit = !bit; 00000000000000000000001111100000 22 5 5 00000000000000000000001111100000 22 5 5 } 00000000000000000000001111100000 22 5 5 00000000000000000000011111110000 21 7 4 00000000000000000011111111111100 18 12 2 BinaryStdOut.close(); pad 0s for byte alignment 00000000000000000111111111111110 17 14 1 00000000000000000000000000000000 32 } 00000000000000000000000000000000 32 1536 bits 17 0s } A typical bitmap, with run lengths for each row

17 18

Variable-length codes

Use different number of bits to encode different chars.

Ex. Morse code: • • • − − − • • •

5.5 DATA COMPRESSION Issue. Ambiguity. SOS ? ‣ introduction V7 ? ‣ run-length coding IAMIE ? EEWNI ? Algorithms ‣ Huffman compression ‣ LZW compression ‣ Kolmogorov complexity In practice. Use a medium gap to ROBERT SEDGEWICK | KEVIN WAYNE

http://algs4.cs.princeton.edu separate codewords. codeword for S is a prefix of codeword for V

David Hufman

20 Variable-length codes Prefix-free codes: trie representation

Q. How do we avoid ambiguity? Q. How to represent the prefix-free code? A. Ensure that no codeword is a prefix of another. A. A binary trie!

Codeword table Trie representation Chars in leaves. Codeword table Trie representation key value ・ key value Ex 1. Fixed-length code. ! 101 0 1 Codeword is path from root to leaf. ! 101 0 1 A 0 A ・ A 0 A Ex 2. Append special stop char to each codeword.B 1111 0 1 B 1111 0 1 C 110 C 110 Ex 3. General prefix-free code. D 100 0 1 0 1 D 100 0 1 0 1 R 1110 D ! C R 1110 D ! C 0 1 0 1 R B R B Compressed bitstring Compressed bitstring 011111110011001000111111100101 30 bits 011111110011001000111111100101 30 bits A B RA CA DA B RA ! A B RA CA DA B RA !

Codeword table Trie representation Codeword table Trie representation Codeword table Trie representation Codeword table Trie representation key value key value key value key value 0 1 0 1 ! 101 ! 101 0 1 ! 101 ! 101 0 1 A 0 A A 11 A 0 A A 11 B 0 1 B 0 1 1111 B 00 0 1 0 1 1111 B 00 0 1 0 1 C 110 C 010 C 110 C 010 0 1 0 1 B A 0 1 0 1 B A D 100 D 100 0 1 0 1 D 100 D 100 0 1 0 1 R 1110 D ! C R 1110 D ! C 0 1 R 011 C R D ! 0 1 R 011 C R D ! R B R B Compressed bitstring Compressed bitstring Compressed bitstring Compressed bitstring 011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits 011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits A B RA CA DA B RA ! A B R A C A D A B R A ! A B RA CA DA B RA ! A B R A C A D A B R A ! Two prefx-free codes Two prefx-free codes Codeword table Trie representation 21 Codeword table Trie representation 22 key value key value ! 101 0 1 ! 101 0 1 A 11 A 11 B 00 0 1 0 1 B 00 0 1 0 1 C 010 B A C 010 B A D 100 0 1 0 1 D 100 0 1 0 1 ExpansionR 011 C R D ! Prefix-freeR 011 codes: C compressionR D ! and expansion

Compressed bitstring Compressed bitstring 11000111101011100110001111101 29 bits Compression.11000111101011100110001111101 29 bits A B R A C A D A B R A ! A B R A C A D A B R A ! Method 1: start at leaf; follow path up to the root; print bits in reverse. Two prefx-free codes ・ Two prefx-free codes Method 2: create ST of key-value pairs.Codeword table Trie representation ・ key value ! 101 0 1 A 0 A Expansion. B 1111 0 1 C 110 Start at root. D 100 0 1 0 1 ・ R 1110 D ! C Go left if bit is 0; go right if 1. 0 1 ・ R B If leaf node, print char and return to root.Compressed bitstring ・ 011111110011001000111111100101 30 bits A B RA CA DA B RA !

Codeword table Trie representation Codeword table Trie representation key value key value 0 1 ! 101 ! 101 0 1 A 0 A A 11 pollEv.com/jhug text to 37607 B 0 1 1111 B 00 0 1 0 1 C 110 C 010 0 1 0 1 B A Q: Which string is encoded by the bitstream 011001010? D 100 D 100 0 1 0 1 R 1110 D ! C 0 1 R 011 C R D ! A. EDSEEP [313698] D. EDSEEPE [317827] R B B. YSPP [313889] E. None [787438] Compressed bitstring Compressed bitstring 011111110011001000111111100101 30 bits 11000111101011100110001111101 29 bits C. EDEEP [317826] A B RA CA DA B RA ! A B R A C A D A B R A ! Two prefx-free codes 23 Codeword table Trie representation 24 key value ! 101 0 1 A 11 B 00 0 1 0 1 C 010 B A D 100 0 1 0 1 R 011 C R D !

Compressed bitstring 11000111101011100110001111101 29 bits A B R A C A D A B R A !

Two prefx-free codes Huffman trie node data type

Dynamic modeling ・Use a different codebook for EVERY text. private static class Node implements Comparable { private final char ch; // used only for leaf nodes Interesting Tasks Required private final int freq; // used only for compress private final Node left, right; Compress ・ – Build trie codebook. public Node(char ch, int freq, Node left, Node right) { – Write trie codebook. this.ch = ch; – Compress input (using array codebook). this.freq = freq; initializing constructor this.left = left; ・Expand this.right = right; – Read trie codebook. }

public boolean isLeaf() – Expand input using trie codebook. is Node a leaf? { return left == null && right == null; }

public int compareTo(Node that) compare Nodes by frequency { return this.freq - that.freq; } (stay tuned) }

25 26

Prefix-free codes: expansion Prefix-free codes: how to transmit

public void expand() Q. How to write the trie? { A. Write preorder traversal of trie; mark leaf and internal nodes with a bit. Node root = readTrie(); read in encoding trie int N = BinaryStdIn.readInt(); read in number of chars

for (int i = 0; i < N; i++) { preorder private static void writeTrie(Node x) traversal 1 Node x = root; { while (!x.isLeaf()) expand codeword for ith char A 2 if (x.isLeaf()) { { if (!BinaryStdIn.readBoolean()) 3 5 BinaryStdOut.write(true); x = x.left; BinaryStdOut.write(x.ch, 8); D 4 R B else return; x = x.right; ! C } BinaryStdOut.write(false); } leaves writeTrie(x.left); BinaryStdOut.write(x.ch, 8); A D ! C R B writeTrie(x.right); } 01010000010010100010001000010101010000110101010010101000010 } BinaryStdOut.close(); 1 2 3 4 5 internal nodes } Using preorder traversal to encode a trie as a bitstream

Running time. Linear in input size N. Note. If message is long, overhead of transmitting trie is small.

27 28

! Prefix-free codes: how to transmit Huffman coding

Q. How to read in the trie? Interesting Tasks Required A. Reconstruct from preorder traversal of trie. ・Compress – Build trie codebook. – Write trie. – Compress input using codebook (codebook as symbol table). preorder private static Node readTrie() traversal 1 { Expand A 2 if (BinaryStdIn.readBoolean()) ・ { – Read trie. 3 5 char c = BinaryStdIn.readChar(8); return new Node(c, 0, null, null); – Expand input using trie. D 4 R B } ! C Node x = readTrie(); Node y = readTrie(); leaves return new Node('\0', 0, x, y); A D ! C R B } 01010000010010100010001000010101010000110101010010101000010 used only for

1 2 3 4 5 internal nodes leaf nodes Using preorder traversal to encode a trie as a bitstream

29 30

!

Shannon-Fano codes Huffman algorithm demo

char freq encoding Q. How to find best prefix-free code? ・Count frequency for each character in input. A B Shannon-Fano algorithm: C D Partition symbols S into two subsets S and S of (roughly) equal freq. ・ 0 1 R ・Codewords for symbols in S0 start with 0; for symbols in S1 start with 1. ! ・Recur in S0 and S1. input char freq encoding char freq encoding A B R A C A D A B R A ! A 5 0... B 2 1...

C 1 0... D 1 1...

R 2 1... S0 = codewords starting with 0

! 1 1...

S1 = codewords starting with 1

Problem 1. How to divide up symbols? Problem 2. Not optimal!

31 Huffman algorithm demo Huffman algorithm demo

char freq encoding char freq encoding ・Count frequency for each character in input. A 5 ・Start with one node corresponding to each character A 5 B 2 with weight equal to frequency. B 2 C 1 C 1 D 1 D 1 R 2 R 2 ! 1 ! 1

input A B R A C A D A B R A !

1 ! 1 C 1 D 2 R 2 B 5 A

Huffman algorithm demo Huffman algorithm demo

char freq encoding char freq encoding ・Select two tries with min weight. A 5 ・Select two tries with min weight. A 5 ・Merge into single trie with cumulative weight. B 2 ・Merge into single trie with cumulative weight. B 2 C 1 C 1 D 1 D 1 R 2 R 2 ! 1 ! 1

1 ! 1 C 1 D 2 R 2 B 5 A 1 ! 1 C 1 D 2 R 2 B 5 A Huffman algorithm demo Huffman algorithm demo

char freq encoding char freq encoding ・Select two tries with min weight. A 5 ・Select two tries with min weight. A 5 ・Merge into single trie with cumulative weight. B 2 ・Merge into single trie with cumulative weight. B 2 C 1 1 C 1 1 D 1 D 1 R 2 R 2 ! 1 0 ! 1 0

2

0 1

1 ! 1 C 2

0 1

1 D 2 R 2 B 5 A 1 D 1 ! 1 C 2 R 2 B 5 A

Huffman algorithm demo Huffman algorithm demo

char freq encoding char freq encoding ・Select two tries with min weight. A 5 ・Select two tries with min weight. A 5 ・Merge into single trie with cumulative weight. B 2 ・Merge into single trie with cumulative weight. B 2 C 1 1 C 1 1 D 1 D 1 R 2 R 2 ! 1 0 ! 1 0

2 2

0 1 0 1

1 D ! C 2 R 2 B 5 A 1 D ! C 2 R 2 B 5 A Huffman algorithm demo Huffman algorithm demo

char freq encoding char freq encoding ・Select two tries with min weight. A 5 ・Select two tries with min weight. A 5 ・Merge into single trie with cumulative weight. B 2 ・Merge into single trie with cumulative weight. B 2 C 1 1 1 C 1 1 1 D 1 0 D 1 0 R 2 R 2 ! 1 1 0 ! 1 1 0

3

0 1 3 1 D 2 0 1 0 1 1 D 2 ! C 0 1

2 R 2 B 5 A 2 R 2 B ! C 5 A

Huffman algorithm demo Huffman algorithm demo

char freq encoding char freq encoding ・Select two tries with min weight. A 5 ・Select two tries with min weight. A 5 ・Merge into single trie with cumulative weight. B 2 ・Merge into single trie with cumulative weight. B 2 C 1 1 1 C 1 1 1 pollEv.com/jhug text to 37607 D 1 0 pollEv.com/jhug text to 37607 D 1 0 Q: What will be the weight of the new trie? R 2 Q: What will be the weight of the new trie? R 2 A. 2 [787387] D. 5 [787452] ! 1 1 0 C. 4 ! 1 1 0 B. 3 [787388] E. 6 [787453] C. 4 [787389]

3 3

0 1 0 1

D D

0 1 0 1

2 R 2 B ! C 5 A 2 R 2 B ! C 5 A Huffman algorithm demo Huffman algorithm demo

char freq encoding char freq encoding ・Select two tries with min weight. A 5 ・Select two tries with min weight. A 5 ・Merge into single trie with cumulative weight. B 2 ・Merge into single trie with cumulative weight. B 2 1 C 1 1 1 C 1 1 1 D 1 0 D 1 0 R 2 R 2 0 ! 1 1 0 ! 1 1 0

4 3 3 0 1 0 1 0 1 2 R 2 B D D

0 1 0 1

2 R 2 B ! C 5 A ! C 5 A

Huffman algorithm demo Huffman algorithm demo

char freq encoding char freq encoding ・Select two tries with min weight. A 5 ・Select two tries with min weight. A 5 ・Merge into single trie with cumulative weight. B 2 1 ・Merge into single trie with cumulative weight. B 2 1 C 1 1 1 C 1 1 1 D 1 0 D 1 0 R 2 0 R 2 0 ! 1 1 0 ! 1 1 0

3 3

0 1 0 1

D 4 D 4

0 1 0 1 0 1 0 1

! C 2 R 2 B 5 A ! C R B 5 A Huffman algorithm demo Huffman algorithm demo

char freq encoding char freq encoding ・Select two tries with min weight. A 5 ・Select two tries with min weight. A 5 ・Merge into single trie with cumulative weight. B 2 1 1 ・Merge into single trie with cumulative weight. B 2 1 1 C 1 0 1 1 C 1 0 1 1 D 1 0 0 D 1 0 0 R 2 1 0 R 2 1 0 ! 1 0 1 0 ! 1 0 1 0

7 7 0 1 0 1

3 4 3 4 0 1 0 1 0 1 0 1 D R B D R B 0 1 0 1 ! C 5 A 5 A ! C

Huffman algorithm demo Huffman algorithm demo

char freq encoding char freq encoding ・Select two tries with min weight. A 5 ・Select two tries with min weight. A 5 0 ・Merge into single trie with cumulative weight. B 2 1 1 ・Merge into single trie with cumulative weight. B 2 1 1 1 C 1 0 1 1 C 1 1 0 1 1 D 1 0 0 D 1 1 0 0 R 2 1 0 R 2 1 1 0 12 ! 1 0 1 0 ! 1 1 0 1 0 0 1

7 5 A 7

0 1 0 1

0 1 0 1 0 1 0 1

D R B D R B

0 1 0 1

5 A ! C ! C Huffman algorithm demo Huffman algorithm demo

char freq encoding char freq encoding A 5 0 A 5 0 B 2 1 1 1 B 2 1 1 1 C 1 1 0 1 1 C 1 1 0 1 1 D 1 1 0 0 D 1 1 0 0 R 2 1 1 0 R 2 1 1 0 ! 1 1 0 1 0 ! 1 1 0 1 0 0 1 0 1

A A

0 1 0 1

0 1 0 1 0 1 0 1

D R B D R B

0 1 0 1 ! C ! C

Huffman codes - the most singular moment of one man’s life Constructing a Huffman encoding trie: Java implementation

Q. How to find best prefix-free code?

private static Node buildTrie(int[] freq) Huffman algorithm: { MinPQ pq = new MinPQ(); Count frequency freq[i] for each char i in input. ・ for (char i = 0; i < R; i++) initialize PQ with ・Start with one node corresponding to each char i (with weight freq[i]). if (freq[i] > 0) singleton tries ・Repeat until single trie formed: pq.insert(new Node(i, freq[i], null, null)); – select two tries with min weight freq[i] and freq[j] while (pq.size() > 1) – merge into single trie with weight freq[i] + freq[j] { merge two Node x = pq.delMin(); smallest tries Node y = pq.delMin(); Node parent = new Node('\0', x.freq + y.freq, x, y); Applications: pq.insert(parent); }

return pq.delMin(); not used for total frequency two subtries } internal nodes

55 56 Huffman encoding summary

Proposition. [Huffman 1950s] Huffman algorithm produces an optimal prefix-free code.

Pf. See textbook. no prefix-free code uses fewer bits

5.5 DATA COMPRESSION Implementation. ・Pass 1: tabulate char frequencies and build trie. ‣ introduction ・Pass 2: encode file by traversing trie or lookup table. ‣ run-length coding ‣ Huffman compression Running time. Using a binary heap ⇒ N + R log R . Algorithms ‣ LZW compression input alphabet ‣ Kolmogorov complexity size size ROBERT SEDGEWICK | KEVIN WAYNE

http://algs4.cs.princeton.edu

Q. Can we do better? [stay tuned] Abraham Lempel Jacob Ziv

57

Statistical methods LZW compression example

Static model. Same model for all texts. input A B R A C A D A B R A B R A B R A Fast. ・ matchesA B R A C A D A B R A B R A B R A Not optimal: different texts have different statistical properties. ・ value 41 42 52 41 43 41 44 81 83 82 88 41 80 ・Ex: ASCII, Morse code. LZW compression for A B R A C A D A B R A B R A B R A Dynamic model. Generate model based on text. ・Preliminary pass needed to generate model. ・Must transmit the model. ・Ex: Huffman code. key value key value key value ⋮ ⋮ AB 81 DA 87

Adaptive model. Progressively learn and update model as you read text. A 41 BR 82 ABR 88 More accurate modeling produces better compression. ・ B 42 RA 83 RAB 89 Decoding must start from beginning. ・ C 43 AC 84 BRA 8A Ex: LZW. ・ D 44 CA 85 ABRA 8B

⋮ ⋮ AD 86

59 codeword table 60 Lempel-Ziv-Welch compression LZW expansion example

LZW compression. Create ST associating W-bit codewords with string keys. ・ value 41 42 52 41 43 41 44 81 83 82 88 41 80 Initialize ST with codewords for single-char keys. ・ output A B R A C A D A B ・Find longest string s in ST that is a prefix of unscanned part of input. Write the W-bit codeword associated with s. ・ LZW expansion for 41 42 52 41 43 41 44 81 83 82 88 41 80 ・Add s + c to ST, where c is next char in the input.

Q. How to represent LZW compression code table? A. A trie to support longest prefix match. key value key value key value A 41 B 42 C 43 D 44 R 52 ⋮ ⋮ 81 AB 87 DA

key value 41 A 82 BR B 81 C 84 D 86 R 82 A 85 A 87 A 83 Q: Next two entries? 42 B 83 RA AB 81 88: AB, 89: BR [787394] BR 82 43 C 84 AC 88: AB, 89: BRA [787396] R 88 A 8A B 89 88: ABR, 89: BRA [787397] RA 83 44 D 85 CA 88: ABR, 89: RAB [787398] longest prefix match AC 84 A 8B ⋮ ⋮ 86 AD

codeword table R A B R A B R A 61 pollEv.com/jhug text to 3760762

LZW expansion example LZW expansion

LZW expansion. key value Create ST associating string values with W-bit keys. ⋮ ⋮ value 41 42 52 41 43 41 44 81 83 82 88 41 80 ・ Initialize ST to contain single-char values. 65 A output A B R A C A D A B R A B R A B R A ・ 66 B Read a W-bit key. ・ 67 C Find associated string value in ST and write it out. LZW expansion for 41 42 52 41 43 41 44 81 83 82 88 41 80 ・ 68 D ・Update ST. ⋮ ⋮ 129 AB Q. How to represent LZW expansion code table? 130 BR 131 RA key value key value key value A. An array of size 2W. 136 ABR 132 AC ⋮ ⋮ 81 AB 87 DA 133 CA

41 A 82 BR 88 ABR 134 AD

42 B 83 RA 89 RAB 135 DA 136 ABR 43 C 84 AC 8A BRA 137 RAB

44 D 85 CA 8B ABRA 138 BRA

139 ABRA ⋮ ⋮ 86 AD ⋮ ⋮ D. 88: ABR, 89: RAB codeword table 63 64 LZW example: tricky case LZW example: tricky case

A B A B A B A input value 41 42 81 83 80 need to know which A B A B A B A matches output A B A B A B A key has value 83 before it is in ST! value 41 42 81 83 80

LZW expansion for 41 42 81 83 80 LZW compression for ABABABA

key value key value key value key value

⋮ ⋮ AB 81 ⋮ ⋮ 81 AB

A 41 BA 82 41 A 82 BA

B 42 ABA 83 42 B 83 ABA

C 43 43 C

D 44 44 D

⋮ ⋮ ⋮ ⋮

codeword table codeword table 65 66

LZW implementation details LZW in the real world

How big to make ST? Lempel-Ziv and friends. ・How long is message? ・LZ77. LZ77 not patented ⇒ widely used in open source ・Whole message similar model? ・LZ78. LZW patent #4,558,302 expired in U.S. on June 20, 2003 ・[many other variations] ・LZW. ・ / zlib = LZ77 variant + Huffman. What to do when ST fills up? ・Throw away and start over. [GIF] ・Throw away when not effective. [Unix compress] ・[many other variations]

Why not put longer substrings in ST? ・[many variations have been developed]

67 68 LZW in the real world Lossless data compression benchmarks

Lempel-Ziv and friends. year scheme bits / char ・LZ77. 1967 ASCII 7.00 ・LZ78. 1950 Huffman 4.70 ・LZW. 1977 LZ77 3.94 Deflate / zlib = LZ77 variant + Huffman. ・ 1984 LZMW 3.32 1987 LZH 3.30

1987 move-to-front 3.24

1987 LZB 3.18 Unix compress, GIF, TIFF, V.42bis modem: LZW. 1987 gzip 2.71 , 7zip, gzip, , png, pdf: deflate / zlib. 1988 PPMC 2.48 iPhone, Sony Playstation 3, Apache HTTP server: deflate / zlib. 1994 SAKDC 2.47

1994 PPM 2.34

1995 Burrows-Wheeler 2.29 Coursera programming assignment

1997 BOA 1.99

2010 PAQ 1.30

data compression using 69 70

Universal data compression

Proposition. No algorithm can compress every bitstring.

U

Pf. [by contradiction] ・Suppose you have a universal data compression algorithm U U 5.5 DATA COMPRESSION that can compress every bitstream. U ・Given bitstring B0, compress it to get smaller bitstring . . ‣ introduction Compress B1 to get a smaller bitstring B2. . ・ . ‣ run-length coding ・Continue until reaching bitstring of size 0. Implication: all bitstrings can be compressed to 0 bits! ・ U Algorithms ‣ Huffman compression ‣ LZW compression U ‣ Kolmogorov complexity ROBERT SEDGEWICK | KEVIN WAYNE http://algs4.cs.princeton.edu U

Universal data compression?

Andrej Kolmogorov 72 Compression This plant

42 4d 7a 00 10 00 00 00 00 00 7a Compression likelihood 00 00 00 6c 00 00 00 00 02 00 00 00 02 00 00 01 00 20 00 03 00 00 00 00 00 10 00 12 0b 00 00 12 0b How many different bitstreams are there with 5 bits? 00 00 00 00 00 00 00 00 00 00 00 ・ 00 ff 00 00 ff 00 00 ff 00 00 00 5 00 00 00 ff 01 00 00 00 00 00 00 – 2 =32 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 01 00 ... ・How many of these sequences can be uniquely represented with EXACTLY 4 bits? Total: 8389584 bits – 24=16 ・What fraction of our sequences can be represented with EXACTLY 2 bits? – 22/25 = 1/8 ・What fraction can be represented with 3 bits or fewer? – Number of 3 bit or fewer sequences: 1+2+22+23 = 24 - 1 Run length encoding: – Fraction: 15/32 4224800 bits ・What fraction of the sequences of N bits can be represented by N/2 bits? – Approximately: 2N/2+1 / 2N – 1 out of 2N/2+1 – For N=1000, this is 1 out of 2501

73 74

hugPlant.bmp hugPlant.huf Decoding

42 4d 7a 00 10 00 00 00 00 00 7a 00 00 00 6c 00 00 00 00 02 00 00 00 02 00 00 01 00 20 00 03 00 00 00 00 00 10 00 12 0b 00 00 12 0b 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 00 ff 00 00 ff 00 00 00 00 00 00 ff 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 01 00 ... Total: 8389584 bits

Huffman Total: 1994024 bits

00 46 6d 4a bd f8 b1 d2 9b 5c 50 e7 dc 10 cc 21 00 20 00 f4 c3 b7 6d c2 31 24 92 dc 24 a7 c9 25 f9 ba d9 c2 40 8e 6f 99 cc 81 85 de 20 39 b1 fa ae 24 b5 c4 85 88 40 be c4 92 46 25 79 2f c4 af d7 2e 57 62 b7 8e 5c 51 f9 70 55 1b 7c ba 68 bc 25 f8 92 49 24 92 64 c9 92 49 30 b1 24 92 49 24 e9 07 af 4e d0 6f cb be 09 65 75 7a c0 9a fa a2 2c 49 24 92 49 0b 12 49 24 92 42 c4 92 49 24 92 09 eb cd d4 2a 55 9f d8 98 d1 4e e7 97 56 58 68 49 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 0c 7a 43 dd 80 00 7b 11 58 f4 75 73 77 bc 26 01 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff e0 92 28 ef 47 24 66 9b de 8b 25 04 1f 0e 87 bd 25% ratio! 87 9e 03 c9 f1 cf ad fa 82 dc 9f a1 31 b5 79 13 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 9b 95 d5 63 26 8b 90 5e d5 b0 17 fb e9 c0 e6 53 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff c7 cb dd 5f 77 d3 bd 80 f9 b6 5e 94 aa 74 34 3a ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff a9 c1 ca e6 b8 9c 60 ab 36 3b a5 8a b4 3a 5c 5a ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 62 e9 2f 16 4c 34 60 6e 51 28 36 2c e7 4e 50 be ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff c0 15 1b 01 d9 c0 bd b4 20 87 42 be d4 e2 23 a2 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff b6 84 22 4c cf 74 cd 4f 23 06 54 e6 c2 0f 2d bd ff ff ff ff ff ff ff ff ff ff ff ff ... e5 81 f4 c6 de 15 59 f1 68 a4 a5 88 16 b0 7f bf 8a 1d 98 bd 33 b4 d5 71 22 93 81 af e0 cc ce 12 57 23 62 3a e4 3d 8c f1 12 8d a5 40 3b 70 d6 9b Image data: 1991432 bits 12 49 62 8d 6f d4 52 f6 7f d5 11 7c ca 07 dd e3 ??? dc 1c 7f c4 a4 69 77 6e 5e 60 db 5a 69 01 95 c8 + 32 bits for length cc 16 b4 0e 64 34 68 b4 6a 8c 50 78 6c cf 9f fe Trie: 2560 bits

75 76 hugPlant.bmp hugPlant.huf Data is bits. Code is bits.

42 4d 7a 00 10 00 00 00 00 00 7a 00 00 00 6c 00 00 00 00 02 00 00 2f 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 00 02 00 00 01 00 20 00 03 00 00 00 46 6d 4a bd2a f82a b1 2a d2 2a 9b 2a5c 502a e72a dc 2a 10 2a cc 2a21 2a 2a00 2a 20 2a 00 2a f4 2a c3 2a b7 2a 6d 00 c2 46 31 6d 24 4a 92 bd dc f8 24 b1 a7 d2 c9 9b 25 5c 50 e72f 2adc 2a 10 2a 2acc 2a 21 2a 2af9 2a ba 2a 2a 2a 2a 2a 2a 2a 00 00 00 10 00 12 0b 00 00 12 0b f9 ba d9 c2 40d9 8ec2 6f 40 99 8e cc 6f81 8599 decc 20 81 39 85 b1 defa 20 39ae b1 24 fa b5 d7 c4 2e 85 57 88 62 40 b7 be 8e c4 5c 92 51 46 f9 25 70 79 55 2f 1b c4 7c af ba 68 bc2a 2ae9 2a 07 2a 2aaf 2a4e 2a d02a 2a6f 2a 2a 2a 2a 2a 2a 2a 00 00 00 00 00 00 00 00 00 00 00 d7 2e 57 62 b7cb 8ebe 5c 09 51 65 f9 7570 557a 1bc0 7c 9a ba fa 68 a2bc 09 eb25 cd f8 d4 92 2a 49 55 24 9f 92 d8 64 98 c9 d1 92 4e 49 e7 30 97 b1 56 24 58 92 68 49 0c 24 7a 43 dd2a 2a80 2a 00 2a 2a7b 2a11 2a 582a 2af4 2a 2a 2a 2a 2a 2a 2a 00 ff 00 00 ff 00 00 ff 00 00 00 e9 07 af 4e d0 6f cb be 09 65 75 7a c0 9a fa a2 75 73 77 bc 26 01 e0 92 28 ef 47 242c 66 49 9b 24 de 92 8b 49 25 0b 04 12 1f 49 0e 24 87 92 bd 42 87 c4 9e 92 03 49 c9 24 f1 92 cf ad fa2a 2a82 2a dc 2a 2a9f 2aa1 2a 312a 2ab5 2a 2a 2a 2a 2a 2a 2a 00 00 00 ff 01 00 00 00 00 00 00 09 eb cd d4 2a 55 9f d8 98 d1 4e e7 97 56 58 68 79 13 9b 95 d5 63 26 8b 90 5e d5 b049 17 ff fb ff e9 ff c0 ff e6 ff 53 ff c7 ff cb ff dd ff 5f ff 77 ff d3 ff bd ff 80 ff f9 ff b6 5e 942a 2aaa 2a 74 2a 2a34 2a3a 2a a92a 2ac1 2a 0a 20 2a 20 20 43 00 00 00 00 00 01 00 00 00 00 00 0c 7a 43 dd 80 00 7b 11 58 f4 75 73 77 bc 26 01 ca e6 b8 9c 60 ab 36 3b a5 8a b4 3aff 5c ff 5a ff 62 ff e9 ff 2f ff 16 ff 4c ff 34 ff 60 ff 6e ff 51 ff 28 ff 36 ff 2c ff e7 ff 4e 50 be6f 6dc0 70 15 69 1b6c 6101 74 d969 6fc0 6e 3a 20 20 6a 61 76 00 00 00 00 00 00 01 00 ... e0 92 28 ef 47 24 66 9b de 8b 25 04 1f 0e 87 bd bd b4 20 87 42 be d4 e2 23 a2 b6 84 22 4c cf 74 cd 4f 23 06 54 e6 c2 0f 2d bd e5 81 f4 c6 de 15 59 f1 68 a4 87 9e 03 c9 f1 cf ad fa 82 dc 9f a1 31 b5 79 13 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 61 63 20 48 75 66 66 6d 61 6e 2e 6a ... Total: 8389584 bits 9b 95 d5 63 26a5 8b88 90 16 5e b0 d5 7fb0 17bf fb8a e9 1d c0 98 e6 bd53 33 b4ff d5 ff 71 ff 22 ff 93 ff 81 ff af ff e0 ff cc ff ce ff 12 ff 57 ff 23 ff 62 ff 3a ff e4 ff 3d 8c f1 12 8d a5 40 3b 70 c7 cb dd 5f 77d6 d39b bd 12 80 49 f9 62b6 5e8d 946f aa d4 74 52 34 f63a 7f d5ff 11 ff 7c ff ca ff 07 ff dd ff e3 ff dc ff 1c ff 7f ff c4 ff a4 ff 69 ff 77 ff 6e ff 5e ff 60 db 5a 69 01Huffman.java: 95 c8 cc 16 a9 c1 ca e6 b8b4 9c0e 60 64 ab 34 36 683b a5b4 8a6a b4 8c 3a 50 5c 785a 6c cfff 9f ff fe ff 2a ff 2a ff 2a ff 2a ff 2a ff 2a ff 2a ff 2a ff 2a ff 2a ff 2a ff 2a ff 2a ff 2a 2a 2a 2a 2a 2a 2a 0a 00 62 e9 2f 16 4c20 3400 60 f4 6e c3 51 b728 366d 2cc2 e7 31 4e 24 50 92be dc 24ff a7 ff c9 ff 25 ff ae ff 24 ff b5 ff c4 ff 85 ff 88 ff 40 ff be ff c4 ff 92 ff 46 ff 25 ff 79 2f c4 af 2543400 f8 92 49 24bits Huffman Total: 2037424 bits c0 15 1b 01 d992 c064 bd c9 b4 92 20 4987 4230 beb1 d4 24 e2 92 23 49a2 24 2cff 49 ff 24 ff 92 ff 49 ff 0b ff 12 ff 49 ff 24 ff 92 ff 42 ff c4 ff 92 ff 49 ff 24 ff 92 ff 49 ff ff ff ff ff ff ff ff b6 84 22 4c cfff 74ff cd ff 4f ff 23 ff06 54ff e6ff c2 ff 0f ff 2d ffbd ff ffff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ... ff ff ff ff ff ff ff ff ff ff ff ff e5 81 f4 c6 deff 15ff 59 ff f1 ff 68 ffa4 a5ff 88ff 16 ff b0 ff 7f ffbf ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 8a 1d 98 bd 33 b4 d5 71 22 93 81 af e0 cc ce 12 00 46 6d 4a bd f8 b1 d2 9b 5c 50 e7 dc 10 cc 21 00 20 00 f4 c3 b7 6d c2 31 24 92 dc 24 a7 c9 25 2f 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 57 23 62 3a e4 3d 8c f1 12 8d a5 40 3b 70 d6 9b f9 ba d9 c2 40 8e 6f 99 cc 81 85 de 20 39 b1 fa ae 24 b5 c4 85 88 40 be c4 92 46 25 79 2f c4 af 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a ff ff ff ff ff ff ff ff ff ff ff ff ffImage ff ff ff ffdata: ff ff ff1991432 ff ff ff .. 20 bits 2a 20 20 43 6f 6d 70 69 6c 61 74 12 49 62 8d 6f d4 52 f6 7f d5 11 7c ca 07 dd e3 d7 2e 57 62 b7 8e 5c 51 f9 70 55 1b 7c ba 68 bc 69 6f 6e 3a 20 20 6a 61 76 61 63 20 48 75 66 66 6d 61 6e 2e 6a ... 25 f8 92 49 24 92 64 c9 92 49 30 b1 24 92 49 24 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a dc 1c 7f c4 a4 69 77 6e 5e 60 db 5a 69 01 95 c8 e9 07 af 4e d0 6f cb be 09 65 75 7a c0 9a fa a2 + 32 bits for length 2c 49 24 92 49 0b 12 49 24 92 42 c4 92 49 24 92 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a cc 16 b4 0e 64 34 68 b4 6a 8c 50 78 6c cf 9f fe 09 eb cd d4 2a 55 9f d8 98 d1 4e e7 97 56 58 68 49 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 0a 20 2a 20 20 43 0c 7a 43 dd 80 00 7b 11 58 f4 75 73 77 bc 26 01 HuffmanWithHardWiredData.java: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Trie: 2560 bits e0 92 28 ef 47 24 66 9b de 8b 25 04 1f 0e 87 bd 6f 6d 70 69 6c 61 74 69 6f 6e 3a 20 20 6a 61 76 87 9e 03 c9 f1 cf ad fa 82 dc 9f a1 31 b5 79 13 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 61 63 20 48 75 66 66 6d 61 6e 2e 6a ... ~2000000 bits 9b 95 d5 63 26 8b 90 5e d5 b0 17 fb e9 c0 e6 53 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff c7 cb dd 5f 77 d3 bd 80 f9 b6 5e 94 aa 74 34 3a ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Huffman.java: a9 c1 ca e6 b8 9c 60 ab 36 3b a5 8a b4 3a 5c 5a ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 62 e9 2f 16 4c 34 60 6e 51 28 36 2c e7 4e 50 be ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 43400 bits c0 15 1b 01 d9 c0 bd b4 20 87 42 be d4 e2 23 a2 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff b6 84 22 4c cf 74 cd 4f 23 06 54 e6 c2 0f 2d bd ff ff ff ff ff ff ff ff ff ff ff ff ... Code e5 81 f4 c6 de 15 59 f1 68 a4 a5 88 16 b0 7f bf 8a 1d 98 bd 33 b4 d5 71 22 93 81 af e0 cc ce 12 57 23 62 3a e4 3d 8c f1 12 8d a5 40 3b 70 d6 9b Image data: 1991432 bits Trie 12 49 62 8d 6f d4 52 f6 7f d5 11 7c ca 07 dd e3 dc 1c 7f c4 a4 69 77 6e 5e 60 db 5a 69 01 95 c8 + 32 bits for length cc 16 b4 0e 64 34 68 b4 6a 8c 50 78 6c cf 9f fe ImageData Trie: 2560 bits

77 78

A program unambiguously describes a bitstream HugPlant.java hugPlant.bmp

2f 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 69 6d 70 6f 72 74 20 6a 61 76 61 2e 61 77 74 2e 42 4d 7a 00 10 00 00 00 00 00 7a 00 00 00 6c 00 00 00 00 02 00 00 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 00 46 6d 4a bd f8 b1 d2 9b 5c 50 e7 dc 10 cc 21 f9 ba 43 6f 6c 6f 72 3b 0a 0a 0a 70 75 62 6c 69 63 20 00 02 00 00 01 00 20 00 03 00 00 d9 c2 40 8e 6f 99 cc 81 85 de 20 39 b1 fa d7 2e 57 62 b7 8e 5c 51 f9 70 55 1b 7c ba 68 bc e9 07 af 4e d0 6f 63 6c 61 73 73 20 48 75 67 50 6c 61 6e 74 20 7b 00 00 00 10 00 12 0b 00 00 12 0b cb be 09 65 75 7a c0 9a fa a2 09 eb cd d4 2a 55 9f d8 98 d1 4e e7 97 56 58 68 0c 7a 43 dd 80 00 7b 11 58 f4 00 00 00 00 00 00 00 00 00 00 00 75 73 77 bc 26 01 e0 92 28 ef 47 24 66 9b de 8b 25 04 1f 0e 87 bd 87 9e 03 c9 f1 cf ad fa 82 dc 9f a1 31 b5 0a 09 2f 2f 73 65 6e 64 20 65 6d 61 69 6c 20 74 00 ff 00 00 ff 00 00 ff 00 00 00 79 13 9b 95 d5 63 26 8b 90 5e d5 b0 17 fb e9 c0 e6 53 c7 cb dd 5f 77 d3 bd 80 f9 b6 5e 94 aa 74 34 3a a9 c1 6f 20 70 72 69 7a 65 40 6a 6f 73 68 68 2e 75 67 00 00 00 ff 01 00 00 00 00 00 00 ca e6 b8 9c 60 ab 36 3b a5 8a b4 3a 5c 5a 62 e9 2f 16 4c 34 60 6e 51 28 36 2c e7 4e 50 be c0 15 1b 01 d9 c0 00 00 00 00 00 01 00 00 00 00 00 20 74 6f 20 72 65 63 65 69 76 65 20 79 6f 75 72 00 00 00 00 00 00 01 00 ... bd b4 20 87 42 be d4 e2 23 a2 b6 84 22 4c cf 74 cd 4f 23 06 54 e6 c2 0f 2d bd e5 81 f4 c6 de 15 59 f1 68 a4 Java a5 88 16 b0 7f bf 8a 1d 98 bd 33 b4 d5 71 22 93 81 af e0 cc ce 12 57 23 62 3a e4 3d 8c f1 12 8d a5 40 3b 70 20 70 72 69 7a 65 0a 0a 09 70 72 69 76 61 74 65 d6 9b 12 49 62 8d 6f d4 52 f6 7f d5 11 7c ca 07 dd e3 dc 1c 7f c4 a4 69 77 6e 5e 60 db 5a 69 01 95 c8 cc 16 20 73 74 61 74 69 63 20 64 6f 75 62 6c 65 20 73 b4 0e 64 34 68 b4 6a 8c 50 78 6c cf 9f fe 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 0a 00 20 00 f4 c3 b7 6d c2 31 24 92 dc 24 a7 c9 25 ae 24 b5 c4 85 88 40 be c4 92 46 25 79 2f c4 af 25 f8 92 49 24 63 61 6c 65 46 61 63 74 6f 72 3d 32 30 2e 30 3b 92 64 c9 92 49 30 b1 24 92 49 24 2c 49 24 92 49 0b 12 49 24 92 42 c4 92 49 24 92 49 ff ff ff ff ff ff ff ff 0a 0a 09 70 72 69 76 61 74 65 20 73 74 61 74 69 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 63 20 69 6e 74 20 67 65 6e 43 6f 6c 6f 72 56 61 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 6c 75 65 28 69 6e 74 20 6f 6c 64 56 61 6c 2c 20 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff .. 20 2a 20 20 43 6f 6d 70 69 6c 61 74 69 6e 74 20 6d 61 78 44 65 76 29 0a 09 7b 0a 09 69 6f 6e 3a 20 20 6a 61 76 61 63 20 48 75 66 66 6d 61 6e 2e 6a ... 09 69 6e 74 20 6f 66 66 53 65 74 20 3d 20 28 69 Total: 8389584 bits HuffmanWithHardWiredData.java: 6e 74 29 20 28 53 74 64 52 61 6e 64 6f 6d 2e 72 61 6e 64 6f 6d 28 29 2a 6d 61 78 44 65 76 2d 6d ~2000000 bits 61 78 44 65 76 2f 33 2e 30 29 3b 0a 09 09 69 6e Java 74 20 6e 65 77 56 61 6c 20 3d 20 6f 6c 64 56 61 6c 20 2b 20 6f 66 66 53 65 74 3b ... Total: 29432 bits 42 4d 7a 00 10 00 00 00 00 00 7a 00 00 00 6c 00 00 00 00 02 00 00 00 02 00 00 01 00 20 00 03 00 00 Of Note: 00 00 00 10 00 12 0b 00 00 12 0b 00 00 00 00 00 00 00 00 00 00 00 00 ff 00 00 ff 00 00 ff 00 00 00 0.35% compression ratio! 00 00 00 ff 01 00 00 00 00 00 00 ・ 00 00 00 00 00 01 00 00 00 00 00 Of the 2^(8389584) possible bit streams, only roughly one in 00 00 00 00 00 00 01 00 ... ・ 2^(8360151) can be compressed this well. 8389584 bits 79 80 So what? Kolmogorov Complexity

Definition ??? 42 4d 7a 00 10 00 00 00 00 00 7a 69 6d 70 6f 72 74 20 6a 61 76 61 2e 61 77 74 2e 00 00 00 6c 00 00 00 00 02 00 00 43 6f 6c 6f 72 3b 0a 0a 0a 70 75 62 6c 69 63 20 Given a bitstream x, the Java Kolmogorov complexity K(x) is the length 00 02 00 00 01 00 20 00 03 00 00 63 6c 61 73 73 20 48 75 67 50 6c 61 6e 74 20 7b ・ 00 00 00 10 00 12 0b 00 00 12 0b 0a 09 2f 2f 73 65 6e 64 20 65 6d 61 69 6c 20 74 in bits of the shortest Java program (also a bitstream) that outputs x. 00 00 00 00 00 00 00 00 00 00 00 6f 20 70 72 69 7a 65 40 6a 6f 73 68 68 2e 75 67 00 ff 00 00 ff 00 00 ff 00 00 00 Java 20 74 6f 20 72 65 63 65 69 76 65 20 79 6f 75 72 00 00 00 ff 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 20 70 72 69 7a 65 0a 0a 09 70 72 69 76 61 74 65 00 00 00 00 00 00 01 00 ... 20 73 74 61 74 69 63 20 64 6f 75 62 6c 65 20 73 63 61 6c 65 46 61 63 74 6f 72 3d 32 30 2e 30 3b Claim 0a 0a 09 70 72 69 76 61 74 65 20 73 74 61 74 69 Total: 8389584 bits 63 20 69 6e 74 20 67 65 6e 43 6f 6c 6f 72 56 61 Given any arbitrary Turing complete programming language P. 6c 75 ... ・ – K(x) < KP(x) + CP Total: 29432 bits – Shortest program in Java is no more than a constant ADDITIVE factor larger than the shortest program in P. Interesting questions: Do different programming languages have different sets of ‘easy’ ・ Proof: bitstreams? ・ – Since Java and P are both Turing complete languages, we can write a Is there some way to find the smallest Java program that generates a ・ compiler for P in Java. given bitstream? – The size of this compiler is independent of x, and is given by CP.

K( ) ≤ 29432

81 82

So what? K(X)

Upper bound is easy 42 4d 7a 00 10 00 00 00 00 00 7a ??? 69 6d 70 6f 72 74 20 6a 61 76 61 2e 61 77 74 2e 00 00 00 6c 00 00 00 00 02 00 00 43 6f 6c 6f 72 3b 0a 0a 0a 70 75 62 6c 69 63 20 x = “What is the shortest Java program that can output exactly this 00 02 00 00 01 00 20 00 03 00 00 63 6c 61 73 73 20 48 75 67 50 6c 61 6e 74 20 7b ・ 00 00 00 10 00 12 0b 00 00 12 0b 0a 09 2f 2f 73 65 6e 64 20 65 6d 61 69 6c 20 74 Stringggggggggggggggggggggggggggggggggggggggggggggggggg??? 00 00 00 00 00 00 00 00 00 00 00 6f 20 70 72 69 7a 65 40 6a 6f 73 68 68 2e 75 67 00 ff 00 00 ff 00 00 ff 00 00 00 20 74 6f 20 72 65 63 65 69 76 65 20 79 6f 75 72 00 00 00 ff 01 00 00 00 00 00 00 Java ??????????????????????????????????????????????????????????????????????????????????? 00 00 00 00 00 01 00 00 00 00 00 20 70 72 69 7a 65 0a 0a 09 70 72 69 76 61 74 65 00 00 00 00 00 00 01 00 ... 20 73 74 61 74 69 63 20 64 6f 75 62 6c 65 20 73 63 61 6c 65 46 61 63 74 6f 72 3d 32 30 2e 30 3b ?????????????????” 0a 0a 09 70 72 69 76 61 74 65 20 73 74 61 74 69 Total: 8389584 bits 63 20 69 6e 74 20 67 65 6e 43 6f 6c 6f 72 56 61 6c 75 ... public class PrintX { Total: 29432 bits public static void main(String[] args) { String x = “What is the shortest Java program that can output exactly this Stringggggggggggggggggggggg Interesting questions: gggggggggggggggggggggggggggg??????????????????????? ・Do different programming languages have different sets of ‘easy’ ??????????????????????????????????????????????????? bitstreams? No! ?????????????????????????????”; ・Is there some way to find the smallest Java program that generates a System.out.println(x); given bitstream? No, but we won’t discuss the proof in class. See following } slides or come to flipped lecture. No proving this won’t } be on the exam. Yes, it is a super cool proof. ・Might be able to do better, but no reason to do worse than K(x) ≤ 2680. 83 84 StringGenerator - a tool for exploring random strings Brute force approach for finding K(x)

StringGenerator sg = new StringGenerator(); public class FindBestClass { String s0 = sg.next(); public static void main(String[] args) { String s1 = sg.next(); String x = args[0]; String s2 = sg.next(); StringGenerator sg = new StringGenerator(); ... while(true) { String s65538 = sg.next(); String sourceCode = sg.next(); ... String output = CompileRunAndReturn(sourceCode); if (output.equals(x)) { System.out.println(sourceCode); return; s0 0000 s1 0001 } s2 0002 } ... } s65536 FFFF Prints source code of shortest program that outputs x. s65536 0000 0000 ・ s65537 0000 0001 s65538 0000 0002 ... s131071 0000 FFFF s131072 0001 0000 85 86

Brute force approach for finding K(x) Brute force approach for finding K(x)

public class FindBestClass { public class FindBestClass { public static void main(String[] args) { public static void main(String[] args) { String x = args[0]; String x = args[0]; StringGenerator sg = new StringGenerator(); StringGenerator sg = new StringGenerator(); while(true) { while(true) { String sourceCode = sg.next(); String sourceCode = sg.next(); String output = CompileRunAndReturn(sourceCode); String output = CompileRunAndReturn(sourceCode); if (output.equals(x)) { if (output.equals(x)) { System.out.println(sourceCode); System.out.println(sourceCode); return; return; } } } } } } ・Prints source code of shortest program that outputs x. ・Prints source code of shortest program that outputs x. ・K(x): Length of source code. ・K(x): Length of source code. ・Will this approach work (if we’re willing to wait a long long time)?

87 88 Brute force approach for finding K(x) Is computing K(x) theoretically possible?

//Returns Kolmogorov Complexity of a string x public class FindBestClass { public int KolmogorovComplexity(String x) public static void main(String[] args) { String x = args[0]; Gives K(x), but does not provide the actual Java source code! StringGenerator sg = new StringGenerator(); ・ while(true) { String sourceCode = sg.next(); String output = CompileRunAndReturn(sourceCode); if (output.equals(x)) { System.out.println(sourceCode); return; } } } ・Prints source code of shortest program that outputs x. ・K(x): Length of source code. ・Will this approach work (if we’re willing to wait a long long time)? – No! Some of the random programs will go into an infinite loop.

89 90

public class ComplexStringFinder public class ComplexStringFinder

//Returns Kolmogorov Complexity of a string x //Returns Kolmogorov Complexity of a string x public int KolmogorovComplexity(String x) public int KolmogorovComplexity(String x)

public String GenerateComplexString(int n) { //Returns a string with Kolmogorov Complexity at least n G bits int sLength = 0; public String GenerateComplexString(int n)

while (true) { Observations i++; GenerateComplexString(n) StringGenerator sg = new StringGenerator(i); ・ – G total bits for the code (including both methods). while (sg.hasNext()) { String complexString = sg.next(); – lg n bits for parameter n. if (KolmogorovComplexity(complexString) >= n) ・Java program of length lg n + G bits provides an x such that K(x) ≥ n. return complexString; } } } ・If there is a String of length i with complexity n, the inner loop returns this String. GenerateComplexString(n) gives shortest string of complexity n. ・ 91 92 public class ComplexStringFinder public class ComplexStringFinder

//Returns Kolmogorov Complexity of a string x //Returns Kolmogorov Complexity of a string x public int KolmogorovComplexity(String x) public int KolmogorovComplexity(String x)

//Returns a string with Kolmogorov Complexity at least n //Returns a string with Kolmogorov Complexity at least n G bits G bits public String GenerateComplexString(int n) public String GenerateComplexString(int n)

Observations Observations ・GenerateComplexString(n) ・GenerateComplexString(n) – G total bits for the code (including both methods). – G total bits for the code (including both methods). – lg n bits for parameter n. – lg n bits for parameter n. ・Java program of length lg n + G bits provides an x such that K(x) ≥ n. ・Java program of length lg n + G bits provides an x such that K(x) ≥ n. – DANGER!! – DANGER!! ・Our Java program of length lg n0 + G bits generates strings that can only be generated by Java programs of length n0.

93 94

public class ComplexStringFinder Berry Paradox

//Returns Kolmogorov Complexity of a string x Let n be “the smallest number requiring nine words to express” public int KolmogorovComplexity(String x) ・But this word can be expressed in eight words, namely: “the smallest number requiring nine words to express”. //Returns a string with Kolmogorov Complexity at least n 1 2 3 4 5 6 7 8 G bits public String GenerateComplexString(int n) ・There is no such number! Observations ・GenerateComplexString(n) ・Not quite the same thing as our Kolmogorov paradox (Java is a better – G total bits for the code (including both methods). defined language than English). – lg n bits for parameter n. – Kolmogorov complexity IS a specific number. ・Java program of length lg n + G bits provides an x such that K(x) ≥ n. – You just can’t find it! – DANGER!! ・Our Java program of length lg n0 + G bits generates strings that can only be generated by Java programs of length n0.

– Example: Suppose G is 100,000 bits. If we pick n0=1,000,000, then Java program is of length 100,020, but K(x) ≥ 1,000,000.

95 96 So what? An uncomputable function

Neat Facts. ??? 42 4d 7a 00 10 00 00 00 00 00 7a 69 6d 70 6f 72 74 20 6a 61 76 61 2e 61 77 74 2e 00 00 00 6c 00 00 00 00 02 00 00 43 6f 6c 6f 72 3b 0a 0a 0a 70 75 62 6c 69 63 20 There exists SOME function K(X) that gives the Java Kolmogorov 00 02 00 00 01 00 20 00 03 00 00 63 6c 61 73 73 20 48 75 67 50 6c 61 6e 74 20 7b ・ 00 00 00 10 00 12 0b 00 00 12 0b 0a 09 2f 2f 73 65 6e 64 20 65 6d 61 69 6c 20 74 complexity of a given bitstream X. 00 00 00 00 00 00 00 00 00 00 00 6f 20 70 72 69 7a 65 40 6a 6f 73 68 68 2e 75 67 00 ff 00 00 ff 00 00 ff 00 00 00 Java 20 74 6f 20 72 65 63 65 69 76 65 20 79 6f 75 72 00 00 00 ff 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 20 70 72 69 7a 65 0a 0a 09 70 72 69 76 61 74 65 00 00 00 00 00 00 01 00 ... 20 73 74 61 74 69 63 20 64 6f 75 62 6c 65 20 73 We can compute this function for some bitstreams. For example, in 63 61 6c 65 46 61 63 74 6f 72 3d 32 30 2e 30 3b ・ 0a 0a 09 70 72 69 76 61 74 65 20 73 74 61 74 69 Total: 8389584 bits 63 20 69 6e 74 20 67 65 6e 43 6f 6c 6f 72 56 61 Python, we can exactly compute KP(X) for some small X: 6c 75 ... – KP(6) = 56: print 6 Total: 29432 bits – KP(387420489) = 80: print 9**9

Interesting questions: ・It is impossible to compute K(X) for arbitrary X. ・Do different programming languages have different sets of ‘easy’ bitstreams? No! ・Is there some way to find the smallest Java program that outputs a given bitstream? No!

97 98

Kolmogorov complexity summary Data compression summary

Kolmogorov Complexity Lossless compression. ・Uncomputable function. ・Represent fixed-length symbols with variable-length codes. [Huffman] ・Represents the compressibility of a bitstream using ANY technique. ・Represent variable-length symbols with fixed-length codes. [LZW] – a.k.a. measures randomness of the bitstream. ・Special purpose code for generating a specific bitstream. [hugPlant] ・Most bitstreams cannot be compressed using any technique in any language. . [not covered in this course] ・No Turing complete language is better at compressing any sufficiently ・JPEG, MPEG, MP3, … long bitstream than any other. ・FFT, wavelets, fractals, …

n Theoretical limits on compression. Shannon entropy: H(X)= p(xi)lgp(xi) i ・Better upper bounds on K(x) X – Human experiments on English text: 1.1 bits/character

Practical compression. Use extra knowledge whenever possible.

SETI data: Complex hugPlant: Simple 99 100