
27.11.2012 Problem • Compress Text Algorithms (6EAP) – Text – Images, video, sound, … Compression • Reduce space, efficient communicaon, etc… Jaak Vilo – Data deduplicaon 2012 fall • Exact compression/decompression Jaak Vilo MTAT.03.190 Text Algorithms 1 • Lossy compression Links • Managing Gigabytes: • hUp://datacompression.info/ Compressing and Indexing • hUp://en.wikipedia.org/wiki/Data_compression Documents and Images • Data Compression Debra A. Lelewer and Daniel S. Hirschberg • Ian H. WiUen, Alistair Moffat, – hUp://www.ics.uci.edu/~dan/pubs/DataCompression.html Timothy C. Bell • Compression FAQ hUp://www.faqs.org/faqs/compression-faq/ • Hardcover: 519 pages Publisher: Morgan Kaufmann; 2nd Revised • Informaon Theory Primer With an ediNon ediNon (11 May 1999) Appendix on Logarithms by Tom Language English ISBN-10: Schneider hUp://www.lecb.ncifcrf.gov/~toms/paper/primer/ 1558605703 • hUp://www.cbloom.com/algs/index.html Problem What it’s about? • Informaon transmission • Eliminaon of redundancy • Informaon storage • Being able to predict… • The data sizes are huge and growing – fax - 1.5 x 106 bit/page • Compression and decompression – photo: 2M pixels x 24bit = 6MB – Represent data in a more compact way – X-ray image: ~ 100 MB? – Decompression - restore original form – Microarray scanned image: 30-100 MB – Tissue-microarray - hundreds of images, each tens of MB • Lossy and lossless compression – Large Hardon Collider (CERN) - The device will produce few peta (1015) bytes of stored – Lossless - restore in exact copy data in a year. – Lossy - restore almost the same informaon – TV (PAL) 2.7 · 108 bit/s • Useful when no 100% accuracy needed – CD-sound, super-audio, DVD, ... • voice, image, movies, ... • Decompression is determinisNc (lossy in compression phase) – Human genome – 3.2Gbase. 30x sequencing => 100Gbase + quality info (+ raw data) • Can achieve much more effecNve results – 1000 genomes, all individual genomes … 1 27.11.2012 Methods covered: Model • Code words (Huffman coding) Model Model • Run-length encoding • ArithmeNc coding • Lempel-Ziv family (compress, gzip, zip, pkzip, ...) • Burrows-Wheeler family (bzip2) Compressed data • Other methods, including images Data Data Encoder Decoder • Kolmogorov complexity • Search from compressed texts • Let pS be a probability of message S • The informaon content can be represented in terms of bits • I(S) = -log( pS ) bits • If the p=1 then the informaon content is 0 (no new informaon) – If Pr[s]=1 then I(s) = 0. – In other words, I(death)=I(taxes)=0 • I( heads or tails ) = 1 -- if the coin is fair • Entropy H(S) is the average informaon content of S – H(S) = pS · I(S) = -pS log( pS ) bits hUp://en.wikipedia.org/wiki/Informaon_entropy • Shannon's experiments with human predictors show an informaon rate of between .6 and 1.3 bits per character, depending on the experimental setup; the PPM compression algorithm can achieve a compression rao of 1.5 bits per character. 2 27.11.2012 http://prize.hutter1.net/ • The data compression world is all abuzz about Marcus HuUer’s recently announced 50,000 euro prize for record-breaking data compressors. Marcus, of the Swiss Dalle Molle InsNtute for ArNficial Intelligence, apparently in cahoots with Florida compression maven Ma Mahoney, is offering cash prizes for what amounts to the most • No compression can on average achieve beUer compression impressive ability to compress 100 MBytes of Wikipedia data. (Note that nobody is going to than the entropy win exactly 50,000 euros - the prize amount is prorated based on how well you beat the current record.) • Entropy depends on the model (or choice of symbols) • This prize differs considerably from my Million Digit Challenge, which is really nothing more • Let M={ m , .. m } be a set of symbols of the model A and let than an aempt to silence people foolishly claiming to be able to compress random data. 1 n Marcus is instead looking for the most effecNve way to reproduce the Wiki data, and he’s p(mi) be the probability of the symbol mi pung up real money as an incenNve. The benchmark that contestants need to beat is that set by Ma Mahoney’s paq8f , the current record holder at 18.3 MB. (Alexander • The entropy of the model A, H(M) is -∑i=1..n p(mi) · log( p(mi) ) Ratushnyak’s submission of a paq variant looks to clock in at a Ndy 17.6 MB, and should soon bits be confirmed as the new standard.) • So why is an AI guy inserNng himself into the world of compression? Well, Marcus realizes • Let the message S = s1, .. sk, and every symbol si be in the that good data compression is all about modeling the data. The beUer you understand the model M. The informaon content of model A is -∑ log p data stream, the beUer you can predict the incoming tokens in a stream. Claude Shannon i=1..k empirically found that humans could model English text with an entropy of 1.1 to 1.6 0.6 to (si) 1.3 bits per character, which at at best should mean that 100 MB of Wikipedia data could be reduced to 13.75 7.5 MB, with an upper bound of perhaps 20 16.25 MB. The theory is that • Every symbol has to have a probability, otherwise it cannot be reaching that 7.5 MB range is going to take such a good understanding of the data stream coded if it is present in the data that it will amount to a demonstraon of ArNficial Intelligence. hUp://marknelson.us/2006/08/24/the-huUer-prize/#comment-293 Model Stac or adapNve • Stac model does not change during the Model Model compression • AdapNve model can be updated during the process • Symbols not in message cannot have 0-probability Compressed • Semi-adapNve model works in 2 stages, off-line. Data data Data • First create the code table, then encode the message Encoder Decoder with the code table How to compare compression Shorter code words… techniques? • Rao (t/p) t: original message length • S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg' • p: compressed message length • Alphabet of 8 • In texts - bits per symbol • Length = 40 symbols • The Nme and memory used for compression • Equal length codewords • The Nme and memory used for decompression • 3-bit a 000 b 001 c 010 d 011 e 100 f 101 g 110 • error tolerance (e.g. self-correcNng code) space 110 • S compressed - 3*40 = 120 bits 3 27.11.2012 Run-length encoding AlphabeNcally ordered word-lists • hUp://michael.dipperstein.com/rle/index.html • The string: • "aaaabbcdeeeeefghhhij" • may be replaced with resume 0resume • "a4b2c1d1e5f1g1h3i1j1". retail 2tail • This is not shorter because 1-leUer repeat takes more characters... retain 5n • "a3b1cde4fgh2ij" • Now we need to know which characters are followed by run-length. retard 4rd • E.g. use escape symbols. retire 3ire • Or, use the symbol itself - if repeated, then must be followed by run- length • "aa2bb0cdee3fghh1ij" Coding techniques Variable length encoders • Coding refers to techniques used to encode • How to use codes of variable length? tokens or symbols. • Decoder needs to know how long is the • Two of the best known coding algorithms are symbol Huffman Coding and ArithmeNc Coding. • Coding algorithms are effecNve at • Prefix-free code: no code can be a prefix of compressing data when they use fewer bits another code for high probability symbols and more bits for low probability symbols. Algoritm Shannon-Fano • Calculate the frequencies and probabiliNes of • Input: probabiliNes of symbols symbols: • Output: Codewords in prefix free coding • S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg' freq ratio p(s) a 2 2/40 0.05 1. Sort symbols by frequency b 3 3/40 0.075 c 4 4/40 0.1 2. Divide to two almost probable groups d 5 5/40 0.125 space 5 5/40 0.125 3. First group gets prefix 0, other 1 e 6 6/40 0.15 f 7 7/40 0.175 4. Repeat recursively in each group unNl 1 g 8 8/40 0.2 symbol remains 4 27.11.2012 Example 1 Example 1 Code: a 1/2 0 a 1/2 0 b 1/4 10 b 1/4 10 c 1/8 110 c 1/8 110 d 1/16 1110 d 1/16 1110 e 1/32 11110 e 1/32 11110 f 1/32 11111 f 1/32 11111 Shannon-Fano Shannon-Fano S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg' • S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg' p(s) code • S in compressed is 117 bits g 0.2 00 0.2 0.525 • 2*4 + 3*4 + 4*3 + 5*3 + 5*3 + 6*3 + 7*3 + 8*2 f 0.175 010 0.175 0.325 0.15 = 117 e 0.15 011 1 d 0.125 100 • Shannon-Fano not always opNmal space 0.125 101 0.475 • SomeNmes 2 equal probable groups cannot be c 0.1 110 achieved b 0.075 1110 a 0.05 1111 • Usually beUer than H+1 bits per symbol, when H is entropy. Char Freq Code Huffman code space 7 111 Huffman example a 4 010 • Works the opposite way. e 4 000 f 3 1101 • Start from least probable symbols and separate them with 0 h 2 1010 and 1 (sufix) i 2 1000 • Add probabiliNes to form a "new symbol" with the new m 2 0111 probability n 2 0010 • Prepend new bits in front of old ones. s 2 1011 t 2 0110 l 1 11001 o 1 00110 p 1 10011 r 1 11000 u 1 00111 "this is an example of a huffman tree" x 1 10010 5 27.11.2012 Properes of Huffman coding • Huffman coding is opNmal when the • Error tolerance quite good frequencies of input characters are powers of • In case of the loss, adding or change of a two.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages14 Page
-
File Size-