Prologo Emails Source Code DNA Sequences

Pre-history of string processing ! Collection of strings Documents Books Prologo Emails Source code DNA sequences ... Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa An XML excerpt The Query-Log graph <dblp > <book > <author > Donald E. Knuth </ author > <title > The TeXbook </ title > Dept CS pisa <publisher > Addison-Wesley </ publisher > #clicks, time, country,... <year > 1986 </ year > www.di.unipi.it/index.html </ book > <article > <author > Donald E. Knuth </ author > <author > Ronald W. Moore </ author > <title > An Analysis of Alpha-Beta Pruning </ title > <pages > 293-326 </ pages > <year > 1975 </ year > QueryLog ( dataset, 2005) <journal > Artificial Intelligence </ journal > Yahoo! </ article > #links: ≈70 Mil ... </ dblp > #nodes: ≈50 Mil Dictionary of URLs: 24 Mil, 56.3 avg/chars, 1.6Gb size ≈ 100Mb #leaves ≥ 7Mil for 75Mb Dictionary of terms: 44 Mil, 7.3 avg/chars, 307Mb #internal nodes 4 Mil for 25Mb ≥ Dictionary of Infos: 2.6Gb depth ≤ 7 Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa 1 In all cases... Some structure: relation among items Trees, (hyper-)graphs, ... Large space (I/O, cache, Some data: (meta-)information about the items compression,...) Labels on nodes and/or edges Various operations to be supported Given node u Retrieve its label, Fw(u), Bw(u),… Id String Given an edge (i,j) Check its existence, Retrieve its label, … Given a string p: search for all nodes/edges whose label includes p Index search for adjacent nodes whose label equals p Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa 2 Dirtually enlarge M [Zobel et al , ‘07] Do you use (z)grep? [deMoura et al , ‘00] gzip HuffWord ≈1Gb data Grep takes 29 secs (scan the uncompressed data) Zgrep takes 33 secs (gunzip the data | grep) Cgrep takes 8 secs (scan directly the compressed) Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa Seven years ago... [now, J. ACM 05] Opportunistic Data Structures with Applications In our lectures we are interested Data Compression P. Ferragina, G. Manzini not only in the storage issue: + + Random Access + Search Data Structures Nowadays several papers: theory & experiments Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa (see Navarro -Makinen ’s survey) 3 Iur starting point was... In these lectures.... Muthu’s Ken Church (AT&T, 1995) said —If I compress the Suffix Array with A path consisting of five steps challenge!! Gzip I do not save anything. But the underlying text is 1) The problem compressible.... What‘s going on?“ 2) What practitioners do and why they did not use —theory“ 3) What theoreticians then did 4) Experiments Practitioners use many —squeezing heuristics“ that 5) The moral ;-) compress data and still support fast access to them At the end, hopefully, you‘ll bring at home: Algorithmic tools to compress & index data Data aware measures to evaluate them Algorithmic reductions: Theorists and practitioners love them! No ultimate receipts !! Can we “automate” and “guarantee” the process ? Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, of variable length, compress them in a way that we can efficiently support Id string Hash Table String Storage Need D to avoid false -positive and for id string (Minimal) ordered perfect hashing Need D for id string , or check (Compacted) Trie Need D for edge match Paolo Ferragina Dipartimento di Informatica, Università di Pisa Yet the dictionary D needs to be stored its space is not negligible I/O- or cache-misses in retrieval Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa 4 Eront-coding Locality-preserving FC Bender et al , 2006 Practitioners use the following approach: Drop bucketing + optimal string decompression Sort the dictionary strings Compress D up to (1+ ε) FC( D) bits Strip-off the shared prefixes [e.g. host reversal?] Decompress any string S in 1+| S|/ ε time Introduce some bucketing , to ensure fast random access A simple incremental encoding algorithm [ where ε = 2/(c-2) ] 0 http://checkmate.com/All_Natural/ I. Assume to have FC( S ,..., S ) uk-2002 crawl ≈250Mb 30 ÷÷÷35% 33 Applied.html 1 i-1 34 roma.html http://checkmate.com/All_Natural/ 38 1.html http://checkmate.com/All_Natural/Applied.html 38 tic_Art.html II. Given Si, we proceed backward for X=c | Si| chars in FC http://checkmate.com/All_Natural/Aroma.html 34 yate.html http://checkmate.com/All_Natural/Aroma1.html 35 er_Soap.html http://checkmate.com/All_Natural/Aromatic_Art.html 35 urvedic_Soap.html Two cases http://checkmate.com/All_Natural/Ayate.html 33 Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Ayer_Soap.html 42 s.htmlDo we need bucketing ? http://checkmate.com/All_Natural/Ayurvedic_Soap.html 25 Essence_Oils.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html 25 Mineral_Bath_Crystals.html Experimental tuning http://checkmate.com/All_Natural/Bath_Salts.html 38 Salt.html http://checkmate.com/All/Essence_Oils.html X=c |S i| = copied 33 Cream.html Si http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html 0 http://checkmate.com/All/Natural/Washcloth.html ... FC-coded http://checkmate.com/All/Mineral_Cream.html = FCs http://checkmate.com/All/Natural/Washcloth.html copied ... gzip ≈ 12% Be back on this later on! Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa Locality-preserving FC Bender et al , 2006 Random access to LPFC A simple incremental encoding algorithm [ where ε = 2/(c-2) ] We call C the LPFC-string, n = #strings in C, m = total length of C Assume to have FC( S1,..., Si-1) Given Si, we proceed backward for X=c | Si| chars in FC How do we Random Access the compressed C ? If S is decoded, then we add FC( S ) else we add S i i i Get(i): return the position of the i-th string in C ( id string ) Decoding is unaffected!! X=c |S | Z i Si Previous(j), Next(j): return the position of the string preceding or ---- Space occupancy (sketch) following character C[j] crowded FC-encoded strings are OK! X/2 Partition the copied strings in (un)crowded Classical answers ;-) Pointers to positions of copied-strings in C Let S i be crowded, and Z its preceding copied string: Space is O(n log m) bits |Z| ≥ X/2 ≥ (c/2) | S | |S | ≤ (2/c) | Z| i i Access time is O(1) + O(|S|/ ε) Hence, length of crowded strings decreases geometrically !! Some form of bucketing... Trade-off Consider chains of copied: |uncrowd crowd* | ≤ (c/c-2) |uncrowd | Space is O((n/b) log m) bits No trade-off ! Charge chain-cost to | | chars before (ie FC-chars) X/2 = (c/2) uncrowd uncrowd Access time is O(b) + O(|S|/ ε) Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa 5 Re-phrasing our problem A basic problem ! Jacobson, ”89 C is the LPFC-string, n = #strings in C, m = total length of C Select 1(3) = 8 Support the following operations on C: B 0010100101010101111111000001101010101010111000.... Get(i): return the position of the i-th string in C m = |B| Previous(j), Next(j): return the position of the string prec/following C[j] n = #1s Rank 1(7) = 4 Proper integer encodings • Rank b(i) = number of b in B[1,i] 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html .... C= • Select b(i) = position of the i-th b in B B= 1 000000000000000000000000000 10 0000000000 10 00000000 10 00000 10 000000000 .... Considering b=1 is enough: Select 0(B,i) = #1Select in B1(B0 and1, Rank B 1 ≤1(B min{m-n,n}0, i-1)) + i see Moffat ”07 • Rank 1(x) = number of 1 in B[1, x] Rank 1(36) = 2 Select 1(4) = 51 Select is similar !! Rank 0(i)= i œ Rank 1(i) 1 |B 0|+|B 1| = m • Select (y) = position of the y-th 1 in B Uniquely-decodable Int-codes: 1 Any Select Rank and Select over two binary arrays: • γ-code(6) = 00 110 1 1 • 2 log x +1 bits B = 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 • Get(i) = Select (i) 1 • -code(33) = (3) 110 δ γ • = Select (Rank (j) -1) B0 = 1 0 0 0 1 0 1 0 1 1, |B 0|= m-n Previous(j) 1 1 Look• log at x them+ 2 loglog as x + O(1) bits • No recursivity, in practice: B = 1 0 0 1 1 0 0 0 0 0 0 1 , |B |= n • Next(j) = Select 1(Rank 1(j) +1) pointerless data structures 1 1 • |γ(x)|>| δ(x)|, for x > 31 Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa • Huffman on the lengths m = |B| A basic problem ! Jacobson, ”89 The Bit-Vector Index n = #1s Select 1(3) = 8 Goal. B is read-only , and the additional index takes o(m) bits. Rank B 0010100101010101111111000001101010101010111000.... B 00101001010101011 1111100010110101 0101010111000.... 8 17 Rank 1(7) = 4 Z 4 5 7 9 m = |B| block pos #1 z 0000 1 0 • Rank 1(i) = number of 1s in B[1,i] n = #1s (absolute) Rank 1 • Select 1(i) = position of the i-th 1 in B (bucket-relative) Rank 1 .... ... ... 1011 2 1 Setting Z = poly(log m) and z=(1/2) log m: Given an integer set , we set B as its characteristic vector Space is |B| + (m/Z) log m + (m/z) log Z + o(m) .... pred(x) = Select (Rank (x-1)) 1 1 m + O(m loglog m / log m) bits Rank time is O(1) LBs can be inherited The term o(m) is crucial in practice [Patrascu-Thorup, ”06] Ω ?? Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa 6 m = |B| The Bit-Vector Index n = #1s B 0010100101010101111111000001101010101010111000...

Load more