Algorithmics :: Compressing Data
Total Page:16
File Type:pdf, Size:1020Kb
Compressing Data Konstantin Tretyakov ([email protected]) MTAT.03.238 Advanced Algorithmics April 26, 2012 Algorithmics 26.04.2012 Claude Elwood Shannon (1916-2001) Algorithmics 26.04.2012 C. E. Shannon. A mathematical theory of communication. 1948 Algorithmics 26.04.2012 C. E. Shannon. The mathematical theory of communication. 1949 Algorithmics 26.04.2012 Shannon-Fano coding Nyquist-Shannon sampling theorem Shannon-Hartley theorem Shannon’s noisy channel coding theorem Shannon’s source coding theorem Rate-distortion theory Ethernet, Wifi, GSM, CDMA, EDGE, CD, DVD, BD, ZIP, JPEG, MPEG, … Algorithmics 26.04.2012 MTMS.02.040 Informatsiooniteooria (3-5 EAP) Jüri Lember http://ocw.mit.edu/ 6.441 Information Theory https://www.coursera.org/courses/ Algorithmics 26.04.2012 Basic terms: Information, Code “Information” “Coding”, “Code” Can you code the same information differently? Why would you? What properties can you require from a coding scheme? Are they contradictory? Show 5 ways of coding the concept “number 42” What is the shortest way of coding this concept? How many bits are needed? Aha! Now define the term “code” once again. Algorithmics 26.04.2012 Basic terms: Coding Suppose we have a set of three concepts. Denote them as A, B and C. Propose a code for this set. Consider the following code: A → 0, B → 1, C → 01 What do you think about it? Define “variable length code”. Define “uniquely decodable code”. Algorithmics 26.04.2012 Basic terms: Prefix-free If we want to code series of messages, what would be a great property for a code to have? Define “prefix-free code”. For historical reasons those are more often referred to as “prefix codes”. Find a prefix-free code for {A, B, C}. Is it uniquely decodable? Is prefix-free ⇒ uniquely decodable? Is uniquely decodable ⇒ prefix-free? Algorithmics 26.04.2012 Prefix-free code .. can always be represented as a tree with symbols at the leaves. Algorithmics 26.04.2012 Compression Consider some previously derived code for {A, B, C}. Is it good for compression purposes? Define “expected code length”. Let event probabilities be as follows: A → 0.50, B → 0.25, C → 0.25 Find the shortest possible prefix-free code. Algorithmics 26.04.2012 Compression & Prefix coding Does the “prefix-free” property sacrifice code length? No! For each uniquely-decodable code there exists a prefix-code with the same codeword lengths. Algorithmics 26.04.2012 Huffman code Consider the following event probabilities A → 0.50, B → 0.25, C → 0.125, D → 0.125 and some event sequence ADABAABACDABACBA… Replace all events C and D with a new event “Z”. Construct the optimal code for {A, B, Z} Extend this code to a new code for {A, B, C, D} Algorithmics 26.04.2012 Huffman coding algorithm Generalize the previous construction to construct an optimal prefix-free code. Use Huffman coding to encode “YAYBANANABANANA” Compare its efficiency to straightforward 2-bit encoding. D. Huffman. “A Method for the Construction of Minimum-Redundancy Codes”, 1952 Algorithmics 26.04.2012 Huffman coding in practice Is just saving the result of Huffman coding to file enough? What else should be done? How? Straightforward approach – dump the tree using preorder traversal. Smarter approach – save only code lengths Wikipedia: Canonical Huffman Code RFC1951: DEFLATE Compressed Data Format Specification version 1.3, Section 3.2.2 Algorithmics 26.04.2012 Huffman code optimality Consider an alphabet, sorted by event (letter) probability, e.g. 푥1 → 0.42, 푥2 → 0.25, … , 푥9 → 0.01, 푥10 → 0.01 Is there just a single optimal code for it, or several of them? Algorithmics 26.04.2012 Huffman code optimality Show that each optimal code has: 푙 푥1 ≤ 푙 푥2 ≤ ⋯ ≤ 푙(푥10) Show that there is at least one optimal code where 푥9 and 푥10 are siblings in the prefix tree. Let 퐿 be the expected length of the optimal code. Merge 푥9 and 푥10, and let 퐿푠 be the expected length of the resulting smaller code. Express 퐿 in terms of 퐿푠. Complete the proof. Algorithmics 26.04.2012 Huffman code in real life Which of those use Huffman coding? DEFLATE (ZIP, GZIP) JPEG PNG GIF MP3 MPEG-2 All of them do, as a post-processing step. Algorithmics 26.04.2012 Shannon-Fano code I randomly chose a letter from this probability: A → 0.45, B → 0.35, C → 0.125, D → 0.125 You need to guess it in the smallest expected number of yes/no questions. Devise an optimal strategy. Algorithmics 26.04.2012 Shannon-Fano code Constructs a prefix-code in a top-down manner: Split the alphabet into two parts with as equal probability as possible. Construct a code for each part. Prepend 0 to codes of the first part Prepend 1 to codes of the second part. Is Shannon-Fano the same as Huffman? Algorithmics 26.04.2012 Shannon-Fano & Huffman Shannon-Fano is not always optimal. Show that it is optimal, though, for letter probabilities of the form 1/2푘. Algorithmics 26.04.2012 log(p) as amount of information Let letter probabilities all be of the form 1 p = 2푘 Show that for the optimal prefix code, the length of codeword for a letter with probability 1 p = i 2푘 is exactly 1 푘 = log2 = −log2pi. pi Algorithmics 26.04.2012 Why logarithms? Intuitively, we want a measure of information to be “additive”. Receiving N equivalent events must correspond to “N times” the information in the single event. However, probabilities are … Therefore, the most logical way to measure information of an event is … Algorithmics 26.04.2012 The thing to remember 1 log 2 푝 is the information content of a single random event with probability 푝. For 푝 of the form 2−푘 it is exactly the number of bits needed to code this event using an optimal binary prefix-free code. Algorithmics 26.04.2012 The thing to remember 1 log 2 푝 is the information content of a single random event with probability 푝. −푘 For 푝For of other the values form of p the 2 information it is content exactly is not an the integer. number Obviously you can’t use something like “2.5 bits” to encode a symbol. However, for of bitslonger textsneeded you can codeto codemultiple symbolsthis eventat once and using in this case an you can optimalachieve binary the average prefix coding -ratefree of this code. number (e.g. 2.5) bits per each presence of the corresponding event. Algorithmics 26.04.2012 Expected codeword length Let letter probabilities all be of the form 1 p = 2푘 What is the expected code length for the optimal binary prefix-free code? Algorithmics 26.04.2012 The thing to remember For a given discrete probability distribution, the function 1 1 퐻 푝1, 푝2, … , 푝푛 = 푝1 log2 + ⋯ + 푝푛 log2 푝1 푝푛 is called the entropy of this distribution. Algorithmics 26.04.2012 Meaning of entropy The average codeword length 퐿 for both Huffman and Shannon-Fano codes satisfies: 퐻 푃 ≤ 퐿 < 퐻(푃) + 1 Algorithmics 26.04.2012 Meaning of entropy Shannon Source Coding Theorem A sequence of 푁 events from probability 푃 can be losslessly represented as a sequence of 푁 ⋅ 퐻(푃) bits for sufficiently large 푁. Conversely, it is impossible to losslessly represent a the sequence using less than 푁 ⋅ 퐻(푃) bits. Algorithmics 26.04.2012 The things to remember 1 log 2 푝 is the information content of a single random event with probability 푝, measured in bits. 퐻(푃) Is the expected information content for the distribution 푃, measured in bits. Algorithmics 26.04.2012 The things to remember 1 log 2 푝 is theI.e. it information is the expected number ofcontent bits necessary toof optimally a single encode random an event event with probabilitywith such푝, probability.measured in bits. 퐻(푃) Is theI.e. itexpected is the expected number information of bits necessary to contentoptimally encode for a single the distribution 푃, randommeasured event from this in distribution. bits. Algorithmics 26.04.2012 Demonstrate an N-element distribution with zero entropy. Demonstrate an N-element distribution with maximal entropy. Define entropy for a continuous distribution 푝(푥). Algorithmics 26.04.2012 Is Huffman code good for coding: Images? None of them, because Music? Huffman coding assumes an I.I.D. sequence, yet all of those Text? have a lot of structure. What is it good for? It is good for coding random- like sequences. Algorithmics 26.04.2012 Say we need to encode the text THREE SWITCHED WITCHES WATCH THREE SWISS SWATCH WATCH SWITCHES. WHICH SWITCHED WITCH WATCHES WHICH SWISS SWATCH WATCH SWITCH? Can we code this better than Huffman? Of course, if we use a dictionary. Can we build the dictionary adaptively from the data itself? Algorithmics 26.04.2012 Lempel-Ziv-Welch algorithm Say we want to code string “AABABBCAB” Start with a dictionary {0 → “”} Scan string from the beginning. Find the longest prefix present in the dictionary (0, “”). Read one more letter “A”. Output prefix id and this letter (0, “A”). Append <current prefix><current letter> to the dictionary. New dictionary: {0 → “”, 1 → “A”}. Finish the coding. Terry Welch, “A Technique for High-PerformanceAlgorithmics 26.04.2012Data Compression,” 1984. LZW Algorithm Unpack the obtained code. Can we do smarter initialization? If we pack a long text, the dictionary may bloat. How do we handle it? In practice LZW coding is followed by Huffman (or a similar) coding. Algorithmics 26.04.2012 Theorem LZW coding is asymptotically optimal. I.e. as the length of the string goes to infinity, the compression ratio approaches the best possible (given some conditions).