Compressing Data

Konstantin Tretyakov ([email protected])

MTAT.03.238 Advanced April 26, 2012

Algorithmics 26.04.2012

Claude Elwood Shannon (1916-2001) Algorithmics 26.04.2012

C. E. Shannon. A mathematical theory of communication. 1948 Algorithmics 26.04.2012

C. E. Shannon. The mathematical theory of communication. 1949 Algorithmics 26.04.2012  Shannon-Fano coding  Nyquist-Shannon sampling theorem  Shannon-Hartley theorem  Shannon’s noisy channel coding theorem  Shannon’s source coding theorem  Rate-distortion theory

Ethernet, Wifi, GSM, CDMA, EDGE, CD, DVD, BD, ZIP, JPEG, MPEG, …

Algorithmics 26.04.2012  MTMS.02.040 Informatsiooniteooria (3-5 EAP)  Jüri Lember

 http://ocw.mit.edu/ 6.441 Theory

 https://www.coursera.org/courses/

Algorithmics 26.04.2012 Basic terms: Information, Code

 “Information”  “Coding”, “Code”  Can you code the same information differently?  Why would you?  What properties can you require from a coding scheme? Are they contradictory?  Show 5 ways of coding the concept “number 42”  What is the shortest way of coding this concept? How many are needed?  Aha! Now define the term “code” once again.

Algorithmics 26.04.2012 Basic terms: Coding

 Suppose we have a set of three concepts. Denote them as A, B and C.  Propose a code for this set.  Consider the following code: A → 0, B → 1, C → 01 What do you think about it?  Define “variable length code”.  Define “uniquely decodable code”.

Algorithmics 26.04.2012

Basic terms: Prefix-free

 If we want to code series of messages, what would be a great property for a code to have?  Define “prefix-free code”.  For historical reasons those are more often referred to as “prefix codes”.  Find a prefix-free code for {A, B, C}.  Is it uniquely decodable?  Is prefix-free ⇒ uniquely decodable?  Is uniquely decodable ⇒ prefix-free?

Algorithmics 26.04.2012 Prefix-free code

 .. can always be represented as a tree with symbols at the leaves.

Algorithmics 26.04.2012 Compression

 Consider some previously derived code for {A, B, C}. Is it good for compression purposes?

 Define “expected code length”.

 Let event be as follows: A → 0.50, B → 0.25, C → 0.25 Find the shortest possible prefix-free code.

Algorithmics 26.04.2012 Compression & Prefix coding

 Does the “prefix-free” property sacrifice code length?

 No!

For each uniquely-decodable code there exists a prefix-code with the same codeword lengths.

Algorithmics 26.04.2012 Huffman code

 Consider the following event probabilities A → 0.50, B → 0.25, C → 0.125, D → 0.125 and some event sequence ADABAABACDABACBA…  Replace all events C and D with a new event “Z”.  Construct the optimal code for {A, B, Z}  Extend this code to a new code for {A, B, C, D}

Algorithmics 26.04.2012

 Generalize the previous construction to construct an optimal prefix-free code.

 Use Huffman coding to encode “YAYBANANABANANA” Compare its efficiency to straightforward 2- encoding.

D. Huffman. “A Method for the Construction of Minimum-Redundancy Codes”, 1952

Algorithmics 26.04.2012 Huffman coding in practice

 Is just saving the result of Huffman coding to file enough?

 What else should be done? How?  Straightforward approach – dump the tree using preorder traversal.  Smarter approach – save only code lengths  Wikipedia:  RFC1951: Compressed Data Format Specification version 1.3, Section 3.2.2

Algorithmics 26.04.2012 Huffman code optimality

 Consider an alphabet, sorted by event (letter) , e.g. 푥1 → 0.42, 푥2 → 0.25, … , 푥9 → 0.01, 푥10 → 0.01

 Is there just a single optimal code for it, or several of them? Algorithmics 26.04.2012 Huffman code optimality

 Show that each optimal code has: 푙 푥1 ≤ 푙 푥2 ≤ ⋯ ≤ 푙(푥10)  Show that there is at least one optimal code where 푥9 and 푥10 are siblings in the prefix tree.  Let 퐿 be the expected length of the optimal code. Merge 푥9 and 푥10, and let 퐿푠 be the expected length of the resulting smaller code. Express 퐿 in terms of 퐿푠. Complete the proof.

Algorithmics 26.04.2012 Huffman code in real life

 Which of those use Huffman coding?

 DEFLATE (ZIP, GZIP)  JPEG  PNG  GIF  MP3  MPEG-2

 All of them do, as a post-processing step.

Algorithmics 26.04.2012 Shannon-Fano code

 I randomly chose a letter from this probability:

A → 0.45, B → 0.35, C → 0.125, D → 0.125

You need to guess it in the smallest expected number of yes/no questions. Devise an optimal strategy.

Algorithmics 26.04.2012 Shannon-Fano code

 Constructs a prefix-code in a top-down manner:  Split the alphabet into two parts with as equal probability as possible.  Construct a code for each part.  Prepend 0 to codes of the first part  Prepend 1 to codes of the second part.

 Is Shannon-Fano the same as Huffman?

Algorithmics 26.04.2012 Shannon-Fano & Huffman

 Shannon-Fano is not always optimal.

 Show that it is optimal, though, for letter probabilities of the form 1/2푘.

Algorithmics 26.04.2012 log(p) as amount of information

 Let letter probabilities all be of the form 1 p = 2푘 Show that for the optimal prefix code, the length of codeword for a letter with probability 1 p = i 2푘 is exactly 1 푘 = log2 = −log2pi. pi

Algorithmics 26.04.2012 Why logarithms?

 Intuitively, we want a of information to be “additive”. Receiving N equivalent events must correspond to “N times” the information in the single event.

 However, probabilities are …

 Therefore, the most logical way to measure information of an event is …

Algorithmics 26.04.2012 The thing to remember 1 log 2 푝 is the information content of a single random event with probability 푝.

For 푝 of the form 2−푘 it is exactly the number of bits needed to code this event using an optimal binary prefix-free code.

Algorithmics 26.04.2012 The thing to remember 1 log 2 푝 is the information content of a single random event with probability 푝.

−푘 For 푝For of other the values form of p the 2 information it is content exactly is not an the integer. number Obviously you can’t use something like “2.5 bits” to encode a symbol. However, for of bitslonger textsneeded you can codeto codemultiple symbolsthis eventat once and using in this case an you can optimalachieve binary the average prefix coding -ratefree of this code. number (e.g. 2.5) bits per each presence of the corresponding event.

Algorithmics 26.04.2012 Expected codeword length

 Let letter probabilities all be of the form 1 p = 2푘

 What is the expected code length for the optimal binary prefix-free code?

Algorithmics 26.04.2012 The thing to remember

For a given discrete probability distribution, the function 1 1 퐻 푝1, 푝2, … , 푝푛 = 푝1 log2 + ⋯ + 푝푛 log2 푝1 푝푛 is called the entropy of this distribution.

Algorithmics 26.04.2012 Meaning of entropy

The average codeword length 퐿 for both Huffman and Shannon-Fano codes satisfies: 퐻 푃 ≤ 퐿 < 퐻(푃) + 1

Algorithmics 26.04.2012 Meaning of entropy

Shannon Source Coding Theorem A sequence of 푁 events from probability 푃 can be losslessly represented as a sequence of 푁 ⋅ 퐻(푃) bits for sufficiently large 푁.

Conversely, it is impossible to losslessly represent a the sequence using less than 푁 ⋅ 퐻(푃) bits.

Algorithmics 26.04.2012 The things to remember 1 log 2 푝 is the information content of a single random event with probability 푝, measured in bits.

퐻(푃) Is the expected information content for the distribution 푃, measured in bits.

Algorithmics 26.04.2012 The things to remember 1 log 2 푝 is theI.e. it information is the expected number ofcontent bits necessary toof optimally a single encode random an event event with probabilitywith such푝, probability.measured in bits.

퐻(푃)

Is theI.e. itexpected is the expected number information of bits necessary to contentoptimally encode for a single the distribution 푃, randommeasured event from this in distribution. bits.

Algorithmics 26.04.2012  Demonstrate an N-element distribution with zero entropy.

 Demonstrate an N-element distribution with maximal entropy.

 Define entropy for a continuous distribution 푝(푥).

Algorithmics 26.04.2012  Is Huffman code good for coding:

 Images? None of them, because  Music? Huffman coding assumes an I.I.D. sequence, yet all of those  Text? have a lot of structure.

 What is it good for?

It is good for coding random- like sequences.

Algorithmics 26.04.2012  Say we need to encode the text

 THREE SWITCHED WITCHES WATCH THREE SWISS SWATCH WATCH SWITCHES. WHICH SWITCHED WITCH WATCHES WHICH SWISS SWATCH WATCH SWITCH?

 Can we code this better than Huffman?

Of course, if we use a dictionary. Can we build the dictionary adaptively from the data itself? Algorithmics 26.04.2012 Lempel-Ziv-Welch algorithm

 Say we want to code string “AABABBCAB”  Start with a dictionary {0 → “”}  Scan string from the beginning.  Find the longest prefix present in the dictionary (0, “”).  Read one more letter “A”.  Output prefix id and this letter (0, “A”).  Append to the dictionary. New dictionary: {0 → “”, 1 → “A”}.

 Finish the coding.

Terry Welch, “A Technique for High-PerformanceAlgorithmics 26.04.2012Data Compression,” 1984. LZW Algorithm

 Unpack the obtained code.

 Can we do smarter initialization?

 If we pack a long text, the dictionary may bloat. How do we handle it?

 In practice LZW coding is followed by Huffman (or a similar) coding. Algorithmics 26.04.2012 Theorem LZW coding is asymptotically optimal.

I.e. as the length of the string goes to infinity, the compression ratio approaches the best possible (given some conditions). Algorithmics 26.04.2012 LZW and variations in real life

 Which of those use variations of LZW?

 DEFLATE (ZIP, GZIP)  JPEG  PNG  GIF  MP3  MPEG-2

Algorithmics 26.04.2012 LZW and variations in real life

 Which of those use variations of LZW?

 DEFLATE (ZIP, GZIP)  JPEG  PNG  GIF  MP3  MPEG-2

Remember, LZW is aimed at text-like data with many repeating substrings. It is used in GIF after the run-length-encoding step (which produces such kind of data). Not sure why PNG uses it, but probably for a similar reason. Algorithmics 26.04.2012 Ideal compression?

 Given a string of bytes, what would be the theoretically best way to encode it?

Algorithmics 26.04.2012 Kolmogorov

 The of a byte string is the length of the shortest program which outputs this string.

Algorithmics 26.04.2012 Kolmogorov complexity

 Can we achieve Kolmogorov complexity at packing?

Algorithmics 26.04.2012 Kolmogorov complexity

Theorem Kolmogorov complexity is not computable.

Algorithmics 26.04.2012 Summary

 Thou shalt study !  Huffman-code is a length-wise optimal uniquely-decodable code.  퐥퐨퐠 (ퟏ/풑) is the information content of an event. 푯 푷 is the information content of a distribution.  LZW is asymptotically optimal.  Kolmogorov complexity is a fun (but practically useless) idea.

Algorithmics 26.04.2012