27.11.2012
Problem
• Compress Text Algorithms (6EAP) – Text – Images, video, sound, … Compression • Reduce space, efficient communica on, etc… Jaak Vilo – Data deduplica on 2012 fall • Exact compression/decompression
Jaak Vilo MTAT.03.190 Text Algorithms 1 • Lossy compression
Links
• Managing Gigabytes: • h p://datacompression.info/ Compressing and Indexing • h p://en.wikipedia.org/wiki/Data_compression Documents and Images • Data Compression Debra A. Lelewer and Daniel S. Hirschberg • Ian H. Wi en, Alistair Moffat, – h p://www.ics.uci.edu/~dan/pubs/DataCompression.html Timothy C. Bell • Compression FAQ h p://www.faqs.org/faqs/compression-faq/ • Hardcover: 519 pages Publisher: Morgan Kaufmann; 2nd Revised • Informa on Theory Primer With an edi on edi on (11 May 1999) Appendix on Logarithms by Tom Language English ISBN-10: Schneider h p://www.lecb.ncifcrf.gov/~toms/paper/primer/ 1558605703 • h p://www.cbloom.com/algs/index.html
Problem What it’s about?
• Informa on transmission • Elimina on of redundancy • Informa on storage • Being able to predict… • The data sizes are huge and growing – fax - 1.5 x 106 bit/page • Compression and decompression – photo: 2M pixels x 24bit = 6MB – Represent data in a more compact way – X-ray image: ~ 100 MB? – Decompression - restore original form – Microarray scanned image: 30-100 MB – Tissue-microarray - hundreds of images, each tens of MB • Lossy and lossless compression – Large Hardon Collider (CERN) - The device will produce few peta (1015) bytes of stored – Lossless - restore in exact copy data in a year. – Lossy - restore almost the same informa on – TV (PAL) 2.7 · 108 bit/s • Useful when no 100% accuracy needed – CD-sound, super-audio, DVD, ... • voice, image, movies, ... • Decompression is determinis c (lossy in compression phase) – Human genome – 3.2Gbase. 30x sequencing => 100Gbase + quality info (+ raw data) • Can achieve much more effec ve results – 1000 genomes, all individual genomes …
1 27.11.2012
Methods covered: Model
• Code words (Huffman coding) Model Model • Run-length encoding • Arithme c coding • Lempel-Ziv family (compress, gzip, zip, pkzip, ...)
• Burrows-Wheeler family (bzip2) Compressed data • Other methods, including images Data Data Encoder Decoder • Kolmogorov complexity • Search from compressed texts
• Let pS be a probability of message S • The informa on content can be represented in terms of bits
• I(S) = -log( pS ) bits • If the p=1 then the informa on content is 0 (no new informa on) – If Pr[s]=1 then I(s) = 0. – In other words, I(death)=I(taxes)=0 • I( heads or tails ) = 1 -- if the coin is fair • Entropy H(S) is the average informa on content of S
– H(S) = pS · I(S) = -pS log( pS ) bits h p://en.wikipedia.org/wiki/Informa on_entropy
• Shannon's experiments with human predictors show an informa on rate of between .6 and 1.3 bits per character, depending on the experimental setup; the PPM compression algorithm can achieve a compression ra o of 1.5 bits per character.
2 27.11.2012
http://prize.hutter1.net/ • The data compression world is all abuzz about Marcus Hu er’s recently announced 50,000 euro prize for record-breaking data compressors. Marcus, of the Swiss Dalle Molle Ins tute for Ar ficial Intelligence, apparently in cahoots with Florida compression maven Ma Mahoney, is offering cash prizes for what amounts to the most • No compression can on average achieve be er compression impressive ability to compress 100 MBytes of Wikipedia data. (Note that nobody is going to than the entropy win exactly 50,000 euros - the prize amount is prorated based on how well you beat the current record.) • Entropy depends on the model (or choice of symbols) • This prize differs considerably from my Million Digit Challenge, which is really nothing more • Let M={ m , .. m } be a set of symbols of the model A and let than an a empt to silence people foolishly claiming to be able to compress random data. 1 n Marcus is instead looking for the most effec ve way to reproduce the Wiki data, and he’s p(mi) be the probability of the symbol mi pu ng up real money as an incen ve. The benchmark that contestants need to beat is that set by Ma Mahoney’s paq8f , the current record holder at 18.3 MB. (Alexander • The entropy of the model A, H(M) is -∑i=1..n p(mi) · log( p(mi) ) Ratushnyak’s submission of a paq variant looks to clock in at a dy 17.6 MB, and should soon bits be confirmed as the new standard.) • So why is an AI guy inser ng himself into the world of compression? Well, Marcus realizes • Let the message S = s1, .. sk, and every symbol si be in the that good data compression is all about modeling the data. The be er you understand the model M. The informa on content of model A is -∑ log p data stream, the be er you can predict the incoming tokens in a stream. Claude Shannon i=1..k empirically found that humans could model English text with an entropy of 1.1 to 1.6 0.6 to (si) 1.3 bits per character, which at at best should mean that 100 MB of Wikipedia data could be reduced to 13.75 7.5 MB, with an upper bound of perhaps 20 16.25 MB. The theory is that • Every symbol has to have a probability, otherwise it cannot be reaching that 7.5 MB range is going to take such a good understanding of the data stream coded if it is present in the data that it will amount to a demonstra on of Ar ficial Intelligence. h p://marknelson.us/2006/08/24/the-hu er-prize/#comment-293
Model Sta c or adap ve
• Sta c model does not change during the Model Model compression • Adap ve model can be updated during the process • Symbols not in message cannot have 0-probability
Compressed • Semi-adap ve model works in 2 stages, off-line. Data data Data • First create the code table, then encode the message Encoder Decoder with the code table
How to compare compression Shorter code words… techniques? • Ra o (t/p) t: original message length • S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg' • p: compressed message length • Alphabet of 8 • In texts - bits per symbol • Length = 40 symbols • The me and memory used for compression • Equal length codewords • The me and memory used for decompression • 3-bit a 000 b 001 c 010 d 011 e 100 f 101 g 110 • error tolerance (e.g. self-correc ng code) space 110 • S compressed - 3*40 = 120 bits
3 27.11.2012
Run-length encoding Alphabe cally ordered word-lists
• h p://michael.dipperstein.com/rle/index.html • The string:
• "aaaabbcdeeeeefghhhij"
• may be replaced with resume 0resume
• "a4b2c1d1e5f1g1h3i1j1". retail 2tail • This is not shorter because 1-le er repeat takes more characters... retain 5n • "a3b1cde4fgh2ij" • Now we need to know which characters are followed by run-length. retard 4rd • E.g. use escape symbols. retire 3ire • Or, use the symbol itself - if repeated, then must be followed by run- length • "aa2bb0cdee3fghh1ij"
Coding techniques Variable length encoders
• Coding refers to techniques used to encode • How to use codes of variable length? tokens or symbols. • Decoder needs to know how long is the • Two of the best known coding algorithms are symbol Huffman Coding and Arithme c Coding. • Coding algorithms are effec ve at • Prefix-free code: no code can be a prefix of compressing data when they use fewer bits another code for high probability symbols and more bits for low probability symbols.
Algoritm Shannon-Fano
• Calculate the frequencies and probabili es of • Input: probabili es of symbols symbols: • Output: Codewords in prefix free coding • S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg'
freq ratio p(s) a 2 2/40 0.05 1. Sort symbols by frequency b 3 3/40 0.075 c 4 4/40 0.1 2. Divide to two almost probable groups d 5 5/40 0.125 space 5 5/40 0.125 3. First group gets prefix 0, other 1 e 6 6/40 0.15 f 7 7/40 0.175 4. Repeat recursively in each group un l 1 g 8 8/40 0.2 symbol remains
4 27.11.2012
Example 1 Example 1
Code: a 1/2 0 a 1/2 0 b 1/4 10 b 1/4 10 c 1/8 110 c 1/8 110 d 1/16 1110 d 1/16 1110 e 1/32 11110 e 1/32 11110 f 1/32 11111 f 1/32 11111
Shannon-Fano Shannon-Fano S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg' • S = 'aa bbb cccc ddddd eeeeee fffffffgggggggg' p(s) code • S in compressed is 117 bits g 0.2 00 0.2 0.525 • 2*4 + 3*4 + 4*3 + 5*3 + 5*3 + 6*3 + 7*3 + 8*2 f 0.175 010 0.175 0.325 0.15 = 117 e 0.15 011 1 d 0.125 100 • Shannon-Fano not always op mal space 0.125 101 0.475 • Some mes 2 equal probable groups cannot be c 0.1 110 achieved b 0.075 1110 a 0.05 1111 • Usually be er than H+1 bits per symbol, when H is entropy.
Char Freq Code Huffman code space 7 111 Huffman example a 4 010 • Works the opposite way. e 4 000 f 3 1101 • Start from least probable symbols and separate them with 0 h 2 1010 and 1 (sufix) i 2 1000 • Add probabili es to form a "new symbol" with the new m 2 0111 probability n 2 0010 • Prepend new bits in front of old ones. s 2 1011 t 2 0110 l 1 11001 o 1 00110 p 1 10011 r 1 11000 u 1 00111 "this is an example of a huffman tree" x 1 10010
5 27.11.2012
Proper es of Huffman coding
• Huffman coding is op mal when the • Error tolerance quite good frequencies of input characters are powers of • In case of the loss, adding or change of a two. Arithme c coding produces slight gains single bit, the differences remain local to the over Huffman coding, but in prac ce these place of the error gains have not been large enough to offset • Error usually remains quite local (proof?) arithme c coding's higher computa onal • complexity and patent royal es Has been shown, the code is op mal • (as of November 2001/Jul2006, IBM owns patents on • Can be shown the average result is H+p the core concepts of arithme c coding in several +0.086, where H is the entropy and p is the jurisdic ons). probability of the most probable symbol h p://en.wikipedia.org/wiki/Arithme c_coding#US_patents_on_arithme c_coding
Move to Front Arithme c (en)coding
• Move to Front (MTF), Least recently used • Arithme c coding is a method for lossless data compression. • It is a form of entropy encoding, but where other entropy encoding (LRU) techniques separate the input message into its component symbols and replace each symbol with a code word, arithme c coding encodes the • Keep a list of last k symbols of S en re message into a single number, a frac on n where (0.0 n < 1.0). • Huffman coding is op mal for character encoding (one character-one code • Code word) and simple to program. Arithme c coding is be er s ll, since it can – use the code for symbol. allocate frac onal bits, but more complicated. • Wikipedia h p://en.wikipedia.org/wiki/Arithme c_encoding – if in codebook, move to front • En re message is a single floa ng point number, a frac on n where (0.0 n – if not in codebook, move to first, remove the last < 1.0). • Every symbol gets a probability based on the model • c.f. the handling of memory paging • Probabili es represent non-intersec ng intervals • Other heuris cs ... • Every text is such an interval
Let P(A)=0.1, P(B)=0.4, P(C)=0.5
A [0,0.1) AA [0 , 0.01) AB [0.01 , 0.05) AC [0.05 , 0.1 )
B [0.1,0.5) BA [0.1 , 0.14) BB [0.14 , 0.3 ) BC [0.3 , 0.5 )
C [0.5,1) CA [0.5 , 0.55 ) CB [0.55 , 0.75 ) CC [0.75 , 1 )
6 27.11.2012
• Add a EOF symbol. • Invented by Jorma Rissanen (then at IBM) • Arithme c coding revisited by Alistair Moffat, Radford M. • Problem with infinite precision arithme cs Neal, Ian H. Wi en - h p://portal.acm.org/cita on.cfm?id=874788 • Alterna ve - blockwise, use integer- arithme cs • Models for arithme c coding - • • Works, if smallest p not too small HMM Hidden Markov Models i • ... • Best ra o • Context methods: Abrahamson dependency model • Problem - the speed, and error tolerance, • Use the context to maximum, to predict the next symbol small change has catastrophic effect • PPM - Predic on by Par al Matching • Several contexts, choose best • Varia ons
Dic onary based compression Lempel-Ziv family, LZ, LZ-77
• Dic onary (symbol table) , list codes • Use the dic onary to memorise the previously compressed parts • LZ-77 • If not in disc onary, use escape • Sliding window of previous text; and text to be compressed
• Usual heuris cs searches for longest repeat /bbaaabbbaaffacbbaa…./...abbbaabab... • With fixed table one can search for op mal • Lookahead - longest prefix that begins within the moving window, is code encoded with [posi on,length] • With adap ve dic onary the op mal coding is • In example, [5,6] • Fast! (Commercial so ware, e.g. PKZIP, Stacker, DoubleSpace, ) NP-complete • Several alterna ve codes for same string (alterna ve substrings will match) • Quite good for English language texts, for • Lempel-Ziv compression from McGill Univ. h p://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic23/ example
Original LZ77 LZ-78 • Triples [posi on,length,next char] • If output [a,b,c], advance by b+1 posi ons • Dic onary • Store strings from processed part of the message • For each part of the triple the nr of bits is • Next symbol is the longest match from dic onary, that reserved depending on window length matches the text to be processed ⎡ log(n-f) ⎤ + ⎡ log(f) ⎤ + 8 • LZ78 (Ziv and Lempel 1978) where n is window size, and f is lookahead • First, dic onary is empty, with index 0 size • Code [i,c] - refers to dic onary (word u at pos. i) and c is the next symbol • Example: abbbbabbc • Add uc to dic onary [0,0,a] [0,0,b] [1,3,a] [3,2,c] • Example: ababcabc → [0,a][0,b][1,b][0,c][3,c] • In example the match actually overlaps with lookahead window
7 27.11.2012
LZW LZJ
• Code consists of indices only! • Coding - search for longest prefix. • First, dic onary has every symbol /alphabet/ • Update dic onary like LZ78 • Code - address of the trie node • In decoding there is a danger: See abcabca • From the root of the trie, there is a transi on – If abc is in dic onary – add abca to dic onary on every symbol (like in LZW). – next is abca, output that code – But when decoding, a er abc it is not known that abca is in the • If out of memory, remove these nodes/ dic onary branches that have been used only once • Solu on - if the dic onary entry is used immediately a er its crea on, the 1st and last characters have to match • In prac ce, h=6, dic onary has 8192 nodes
• Many representa ons for the dic onary. • List, hash, sorted list, combina on, binary tree, trie, suffix tree, ...
LZFG Burrows-Wheeler
• See FAQ h p://www.faqs.org/faqs/compression-faq/part2/sec on-9.html • Effec ve LZ method • The method described in the original paper is really a composite of three different algorithms: • From LZJ – the block sor ng main engine (a lossless, very slightly expansive preprocessor), – the move-to-front coder (a byte-for-byte simple, fast, locally adap ve noncompressive coder) and • – a simple sta s cal compressor (first order Huffman is men oned as a candidate) eventually doing Create a suffix tree for the window the compression. • Of these three methods only the first two are discussed here as they are what cons tutes the • Code - node address plus nr of characters heart of the algorithm. These two algorithms combined form a completely reversible (lossless) transforma on that - with typical input - skews the first order symbol distribu ons from teh edge. to make the data more compressible with simple methods. Intui vely speaking, the method transforms slack in the higher order probabili es of the input block (thus making them more • The internal and leaf nodes with different even, whitening them) to slack in the lower order sta s cs. This effect is what is seen in the histogram of the resul ng symbol data. codes • Please, read the ar cle by Mark Nelson: • Data Compression with the Burrows-Wheeler Transform Mark Nelson, Dr. Dobb's Journal • small match directly... (?) September, 1996. h p://marknelson.us/1996/09/01/bwt/
Burrows-Wheeler Transform (BWT)
8 27.11.2012
Example
CODE: t: hat acts like this:<13><10><1 t: hat buffer to the constructor • Decode: errktreteoe.e t: hat corrupted the heap, or wo W: hat goes up must come down<13 t: hat happens, it isn't likely w: hat if you want to dynamicall • t: hat indicates an error.<13><1 Hint: . Is the last character, t: hat it removes arguments from alphabe cally first… t: hat looks like this:<13><10>< t: hat looks something like this t: hat looks something like this t: hat once I detect the mangled
9 27.11.2012
Syntac c compression Image compression
• Context Free Grammar for presen ng the • Many images, photos, sound, video, ... syntax tree • Usually for source code • Assump on - program is syntac cally correct • Comments • Features, constants - group by group
Fax group 3
• Fax/group 3 • Joint Photographic Experts Group JPEG 2000 • Black/white, 0/1 code h p://www.jpeg.org/jpeg2000/ • Image Compression -- JPEG from W.B. Pennebaker, • Run-length: 000111001000 → 3,3,2,1,3 J.L. Mitchell, "The JPEG S ll Image Data Compression • Variable-length codes for represen ng run- Standard", Van Nostrand Reinhold, 1993. lengths. • Color image, 8 or 12 bits per pixel per color. • Four modes Sequen al Mode • Lossless Mode • Progressive Mode • Hierarchical Mode • DCT (Discrete Cosine Transform)
Lena
• from h p://www.utdallas.edu/~aria/mcl/post/ • Lossy signal compression works on the basis of transmi ng the "important" signal content, while omi ng other parts (Quan za on). To perform this quan za on effec vely, a linear de-correla ng transform is applied to the signal prior to quan za on. All exis ng image and video coding standards use this approach. The most commonly used transform is the Discrete Cosine Transform (DCT) used in JPEG, MPEG-1, MPEG-2, H.261 and H.263 and its descendants. For a detailed discussion of the theory behind quan za on and jus fica on of the usage of linear transforms, see reference [1] below. • A brief overview of JPEG compression is as follows. The JPEG encoder par ons the image into 8x8 blocks of pixels. To each of these blocks it applies a 2- dimensional DCT. The transform matrix is normalized (element-wise) by a 8x8 quan za on matrix and then rounded to the nearest integer. This opera on is equivalent to applying different uniform quan zers to different frequency bands of the image. The high-frequency image content can be quan zed more coarsely than the low-frequency content, due to two factors. • L9_Compression/lena/
10 27.11.2012
Vector quan za on Discrete cosine transform
• • A discrete cosine transform (DCT) expresses a sequence of finitely many Vector quan za on data points in terms of a sum of cosine func ons oscilla ng at different • Dic onary-meetod frequencies. DCTs are important to numerous applica ons in science and engineering, from lossy compression of audio and images (where small • 2-dimensional blocks high-frequency components can be discarded), to spectral methods for the numerical solu on of par al differen al equa ons. The use of cosine rather than sine func ons is cri cal in these applica ons: for compression, it turns out that cosine func ons are much more efficient (as explained below, fewer are needed to approximate a typical signal), whereas for differen al equa ons the cosines express a par cular choice of boundary condi ons.
• h p://en.wikipedia.org/wiki/Discrete_cosine_transform
• In par cular, a DCT is a Fourier-related transform 2d DCT (type II) compared to the DFT. similar to the discrete Fourier transform (DFT), but For both transforms, using only real numbers. DCTs are equivalent to DFTs there is the magnitude of the spectrum on left of roughly twice the length, opera ng on real data and the histogram on with even symmetry (since the Fourier transform of right; both spectra are cropped to 1/4, to a real and even func on is real and even), where in zoom the behaviour in some variants the input and/or output data are the lower frequencies. The DCT concentrates shi ed by half a sample. There are eight standard most of the power on the lower frequencies. DCT variants, of which four are common.
• Digital Image Processing: h p://www-ee.uta.edu/dip/
Block Diagram of JPEG Baseline s EE330 (Princeton) (Princeton) s EE330 ’ From Wallace’s JPEG tutorial (1993) JPEG tutorial Wallace’s From From Liu 475 x 330 x 3 = 157 KB luminance ENEE631 Digital Image ENEE631 Digital Image Lec13 – Transf. Coding & JPEG [65] Lec13 – Transf. Coding & JPEG [66] Processing (Spring'04) Processing (Spring'04)
11 27.11.2012
RGB Components Y U V (Y Cb Cr) Components s EE330 (Princeton) (Princeton) s EE330 (Princeton) s EE330 ’ ’ From Liu From Liu Assign more bits to Y, less bits to Cb and Cr ENEE631 Digital Image ENEE631 Digital Image Lec13 – Transf. Coding & JPEG [67] Lec13 – Transf. Coding & JPEG [68] Processing (Spring'04) Processing (Spring'04)
JPEG Compression (Q=75%) JPEG Compression (Q=75%) s EE330 (Princeton) (Princeton) s EE330 (Princeton) s EE330 ’ ’
From Liu 45 KB, compression ration ~ 4:1 From Liu 45 KB, compression ration ~ 4:1
ENEE631 Digital Image ENEE631 Digital Image Lec13 – Transf. Coding & JPEG [69] Lec13 – Transf. Coding & JPEG [70] Processing (Spring'04) Processing (Spring'04)
JPEG Compression (Q=75% & 30%)
Uncompressed (100KB)
JPEG 75% (18KB) s EE330 (Princeton) (Princeton) s EE330 ’ UMCP ENEE408G Slides (created by M.Wu & R.Liu2002)& © ENEE408G Slides (created byM.Wu UMCP JPEG 50% (12KB)
JPEG 30% (9KB) From Liu 45 KB 22 KB JPEG 10% (5KB) ENEE631 Digital Image ENEE631 Digital Image Lec13 – Transf. Coding & JPEG [71] Lec13 – Transf. Coding & JPEG [72] Processing (Spring'04) Processing (Spring'04)
12 27.11.2012
1.4-billion-pixel digital camera Fractal compression
• Monday, November 24, 2008 • Fractal Compression group at Waterloo h p://www.technologyreview.com/compu ng/21705/page1/ h p://links.uwaterloo.ca/fractals.home.html • Giant Camera Tracks Asteroids • A "Hitchhiker's Guide to Fractal Compression" • The camera will offer sharper, broader views For Beginners of the sky. p://links.uwaterloo.ca/pub/Fractals/Papers/ • The focal plane of each camera contains an almost complete 64 x 64 array of CCD devices, each containing approximately 600 x 600 pixels, for a total of about 1.4 gigapixels. The CCDs Waterloo/vr95.pdf themselves employ the innova ve technology called "orthogonal transfer", which is described below. • Encode using fractals.
Diagram showing how an OTA chip • Search for regions that with a simple is made up of 64 OTCCD cells, each of which has 600 x 600 pixels transforma on can be similar to each other. • Compressin ra o 20-80
• When one compresses files, can we s ll use fast search techniques without decompressing • Moving Pictures Experts Group ( frst? • Some mes, yes h p://www.chiariglione.org/mpeg/ ) • e.g. Udi Manber has developed a method • Approximate Matching of Run-Length Compressed Strings • MPEG Compression : h p://www.cs.cf.ac.uk/Dave/Mul media/node255.html Veli Mäkinen, Gonzalo Navarro, Esko Ukkonen
• Screen divided into 256 blocks, where the • We focus on the problem of approximate matching of strings that have been compressed using run-length encoding. Previous studies have concentrated on the problem of compu ng changes and movements are tracked the longest common subsequence (LCS) between two strings of length m and n , compressed to m' and n' runs. We extend an exis ng algorithm for the LCS to the Levenshtein distance • Only differences from previous frame are achieving O(m'n+n'm) complexity. Furthermore, we extend this algorithm to a weighted edit distance model, where the weights of the three basic edit opera ons can be chosen shown arbitrarily. This approach also gives an algorithm for approximate searching of a pa ern of m le ers (m' runs) in a text of n le ers (n' runs) in O(mm'n') me. Then we propose improvements for a greedy algorithm for the LCS, and conjecture that the improved • Compression ra o 50-100 algorithm has O(m'n') expected case complexity. Experimental results are provided to support the conjecture.
Kolmogorov (or Algorithmic) complexity • Kolmogorov, Chai n, ... • Algorithmic informa on theory is a field of study which • What is the compressed version of sequence a empts to capture the concept of complexity using tools '1234567891011121314151617181920212223242526...' ? from theore cal computer science. The chief idea is to define • Every symbol appears almost equally frequently, almost the complexity (or Kolmogorov complexity) of a string as the "random" by entropy length of the shortest program which, when run without any input, outputs that string. Strings that can be produced by • for i=1 to n do print i ; short programs are considered to be not very complex. This • Algorithmic complexity (or Kolmogorov complexity) for string no on is surprisingly deep and can be used to state and prove S is the length of the shortest program that reproduces S, impossibility results akin to Gödel's incompleteness theorem o en noted K(S) and Turing's hal ng problem. • Condi onal complexity : K(S|T). Reproduce S given T. • The field was developed by Andrey Kolmogorov and • h p://en.wikipedia.org/wiki/Algorithmic_informa on_theory Gregory Chai n star ng in the late 1960s.
13 27.11.2012
Kolmogorov complexity: size of circle in bits... G J Chai n
Model Model • h p://www.cs.umaine.edu/~chai n/
Compressed Data data Data Encoder Decoder
• Distance using K. • Use of Kolmogorov Distance Iden fica on of • d(S,T) = ( K(S|T) + K(T|S) ) / ( K(S) + K(T) ) Web Page Authorship, Topic and Domain • We cannot calculate K, but we can David Parry (PPT) in OSWIR 2005, 2005 approximate it workshop on Open Source Web Informa on Retrieval • E.g. by compression LZ, BWT, etc • Informatsioonikaugus by Mart Sõmermaa, Fall • d(S,T) = ( C(ST) + C(TS) ) / ( C(S) + C(T) ) 2003 (in Data Mining Research seminar) h p://www.egeen.ee/u/vilo/edu/2003-04/DM_seminar_2003_II/Raport/P06/main.pdf
14