Text Algorithms (6EAP) Compression

Text Algorithms (6EAP) Compression Jaak Vilo 2012 fall Jaak Vilo MTAT.03.190 Text Algorithms 1 Problem • Compress – Text – Images, video, sound, … • Reduce space, efficient communicaon, etc… – Data deduplicaon • Exact compression/decompression • Lossy compression • Managing Gigabytes: Compressing and Indexing Documents and Images • Ian H. WiTen, Alistair Moffat, Timothy C. Bell • Hardcover: 519 pages Publisher: Morgan Kaufmann; 2nd Revised ediMon ediMon (11 May 1999) Language English ISBN-10: 1558605703 Links • hTp://datacompression.info/ • hTp://en.wikipedia.org/wiki/Data_compression • Data Compression Debra A. Lelewer and Daniel S. Hirschberg – hTp://www.ics.uci.edu/~dan/pubs/DataCompression.html • Compression FAQ hTp://www.faqs.org/faqs/compression-faq/ • Informaon Theory Primer With an Appendix on Logarithms by Tom Schneider hTp://www.lecb.ncifcrf.gov/~toms/paper/primer/ • hTp://www.cbloom.com/algs/index.html Problem • Informaon transmission • Informaon storage • The data sizes are huge and growing – fax - 1.5 x 106 bit/page – photo: 2M pixels x 24bit = 6MB – X-ray image: ~ 100 MB? – Microarray scanned image: 30-100 MB – Tissue-microarray - hundreds of images, each tens of MB – Large Hardon Collider (CERN) - The device will produce few peta (1015) bytes of stored data in a year. – TV (PAL) 2.7 · 108 bit/s – CD-sound, super-audio, DVD, ... – Human genome – 3.2Gbase. 30x sequencing => 100Gbase + quality info (+ raw data) – 1000 genomes, all individual genomes … What it’s about? • Eliminaon of redundancy • Being able to predict… • Compression and decompression – Represent data in a more compact way – Decompression - restore original form • Lossy and lossless compression – Lossless - restore in exact copy – Lossy - restore almost the same informaon • Useful when no 100% accuracy needed • voice, image, movies, ... • Decompression is determinisMc (lossy in compression phase) • Can achieve much more effecMve results Methods covered: • Code words (Huffman coding) • Run-length encoding • ArithmeMc coding • Lempel-Ziv family (compress, gzip, zip, pkzip, ...) • Burrows-Wheeler family (bzip2) • Other methods, including images • Kolmogorov complexity • Search from compressed texts Model Model Model Compressed Data data Data Encoder Decoder • Let pS be a probability of message S • The informaon content can be represented in terms of bits • I(S) = -log( pS ) bits • If the p=1 then the informaon content is 0 (no new informaon) – If Pr[s]=1 then I(s) = 0. – In other words, I(death)=I(taxes)=0 • I( heads or tails ) = 1 -- if the coin is fair • Entropy H(S) is the average informaon content of S – H(S) = pS · I(S) = -pS log( pS ) bits hTp://en.wikipedia.org/wiki/Informaon_entropy • Shannon's experiments with human predictors show an informaon rate of between .6 and 1.3 bits per character, depending on the experimental setup; the PPM compression algorithm can achieve a compression rao of 1.5 bits per character. • No compression can on average achieve beTer compression than the entropy • Entropy depends on the model (or choice of symbols) • Let M={ m1, .. mn } be a set of symbols of the model A and let p(mi) be the probability of the symbol mi • The entropy of the model A, H(M) is -∑i=1..n p(mi) · log( p(mi) ) bits • Let the message S = s1, .. sk, and every symbol si be in the model M. The informaon content of model A is -∑i=1..k log p (si) • Every symbol has to have a probability, otherwise it cannot be coded if it is present in the data http://prize.hutter1.net/ • The data compression world is all abuzz about Marcus HuTer’s recently announced 50,000 euro prize for record-breaking data compressors. Marcus, of the Swiss Dalle Molle InsMtute for ArMficial Intelligence, apparently in cahoots with Florida compression maven Ma Mahoney, is offering cash prizes for what amounts to the most impressive ability to compress 100 MBytes of Wikipedia data. (Note that nobody is going to win exactly 50,000 euros - the prize amount is prorated based on how well you beat the current record.) • This prize differs considerably from my Million Digit Challenge, which is really nothing more than an aempt to silence people foolishly claiming to be able to compress random data. Marcus is instead looking for the most effecMve way to reproduce the Wiki data, and he’s pung up real money as an incenMve. The benchmark that contestants need to beat is that set by Ma Mahoney’s paq8f , the current record holder at 18.3 MB. (Alexander Ratushnyak’s submission of a paq variant looks to clock in at a Mdy 17.6 MB, and should soon be confirmed as the new standard.) • So why is an AI guy inserMng himself into the world of compression? Well, Marcus realizes that good data compression is all about modeling the data. The beTer you understand the data stream, the beTer you can predict the incoming tokens in a stream. Claude Shannon empirically found that humans could model English text with an entropy of 1.1 to 1.6 0.6 to 1.3 bits per character, which at at best should mean that 100 MB of Wikipedia data could be reduced to 13.75 7.5 MB, with an upper bound of perhaps 20 16.25 MB. The theory is that reaching that 7.5 MB range is going to take such a good understanding of the data stream that it will amount to a demonstraon of ArMficial Intelligence. hTp://marknelson.us/2006/08/24/the-huTer-prize/#comment-293 Model Model Model Compressed Data data Data Encoder Decoder Stac or adapMve • Stac model does not change during the compression • AdapMve model can be updated during the process • Symbols not in message cannot have 0-probability • Semi-adapMve model works in 2 stages, off-line. • First create the code table, then encode the message with the code table How to compare compression techniques? • Rao (t/p) t: original message length • p: compressed message length • In texts - bits per symbol • The Mme and memory used for compression • The Mme and memory used for decompression • error tolerance (e.g. self-correcMng code) Shorter code words… • S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg' • Alphabet of 8 • Length = 40 symbols • Equal length codewords • 3-bit a 000 b 001 c 010 d 011 e 100 f 101 g 110 space 110 • S compressed - 3*40 = 120 bits Run-length encoding • hTp://michael.dipperstein.com/rle/index.html • The string: • "aaaabbcdeeeeefghhhij" • may be replaced with • "a4b2c1d1e5f1g1h3i1j1". • This is not shorter because 1-leTer repeat takes more characters... • "a3b1cde4fgh2ij" • Now we need to know which characters are followed by run-length. • E.g. use escape symbols. • Or, use the symbol itself - if repeated, then must be followed by run- length • "aa2bb0cdee3fghh1ij" AlphabeNcally ordered word-lists resume 0resume retail 2tail retain 5n retard 4rd retire 3ire Coding techniques • Coding refers to techniques used to encode tokens or symbols. • Two of the best known coding algorithms are Huffman Coding and ArithmeNc Coding. • Coding algorithms are effecMve at compressing data when they use fewer bits for high probability symbols and more bits for low probability symbols. Variable length encoders • How to use codes of variable length? • Decoder needs to know how long is the symbol • Prefix-free code: no code can be a prefix of another code • Calculate the frequencies and probabiliMes of symbols: • S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg' freq ratio p(s) a 2 2/40 0.05 b 3 3/40 0.075 c 4 4/40 0.1 d 5 5/40 0.125 space 5 5/40 0.125 e 6 6/40 0.15 f 7 7/40 0.175 g 8 8/40 0.2 Algoritm Shannon-Fano • Input: probabiliMes of symbols • Output: Codewords in prefix free coding 1. Sort symbols by frequency 2. Divide to two almost probable groups 3. First group gets prefix 0, other 1 4. Repeat recursively in each group unMl 1 symbol remains Example 1 a 1/2 0 b 1/4 10 c 1/8 110 d 1/16 1110 e 1/32 11110 f 1/32 11111 Example 1 Code: a 1/2 0 b 1/4 10 c 1/8 110 d 1/16 1110 e 1/32 11110 f 1/32 11111 Shannon-Fano S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg' p(s) code g 0.2 00 0.2 0.525 f 0.175 010 0.175 0.325 0.15 e 0.15 011 1 d 0.125 100 space 0.125 101 0.475 c 0.1 110 b 0.075 1110 a 0.05 1111 Shannon-Fano • S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg' • S in compressed is 117 bits • 2*4 + 3*4 + 4*3 + 5*3 + 5*3 + 6*3 + 7*3 + 8*2 = 117 • Shannon-Fano not always opMmal • SomeMmes 2 equal probable groups cannot be achieved • Usually beTer than H+1 bits per symbol, when H is entropy. Huffman code • Works the opposite way. • Start from least probable symbols and separate them with 0 and 1 (sufix) • Add probabiliMes to form a "new symbol" with the new probability • Prepend new bits in front of old ones. Char Freq Code space 7 111 Huffman example a 4 010 e 4 000 f 3 1101 h 2 1010 i 2 1000 m 2 0111 n 2 0010 s 2 1011 t 2 0110 l 1 11001 o 1 00110 p 1 10011 r 1 11000 u 1 00111 "this is an example of a huffman tree" x 1 10010 • Huffman coding is opMmal when the frequencies of input characters are powers of two. ArithmeMc coding produces slight gains over Huffman coding, but in pracMce these gains have not been large enough to offset arithmeMc coding's higher computaonal complexity and patent royalMes • (as of November 2001/Jul2006, IBM owns patents on the core concepts of arithmeMc coding in several jurisdicons).

Load more