Pre-history of string processing !

 Collection of strings  Documents  Books Prologo  Emails  Source code  DNA sequences  ...

Paolo Ferragina Dipartimento di Informatica, Università di Pisa

Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa

An XML excerpt The Query-Log graph

Donald E. Knuth The TeXbook </ title > Dept CS pisa <publisher > Addison-Wesley </ publisher > #clicks, time, country,... <year > 1986 </ year > www.di.unipi.it/index.html </ book > <article > <author > Donald E. Knuth </ author > <author > Ronald W. Moore </ author > <title > An Analysis of Alpha-Beta Pruning </ title > <pages > 293-326 </ pages > <year > 1975 </ year >  QueryLog ( dataset, 2005) <journal > Artificial Intelligence </ journal > Yahoo! </ article >  #links: ≈70 Mil ... </ dblp >  #nodes: ≈50 Mil  Dictionary of URLs: 24 Mil, 56.3 avg/chars, 1.6Gb size ≈ 100Mb #leaves ≥ 7Mil for 75Mb  Dictionary of terms: 44 Mil, 7.3 avg/chars, 307Mb #internal nodes 4 Mil for 25Mb ≥  Dictionary of Infos: 2.6Gb depth ≤ 7</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>1 In all cases...  Some structure: relation among items  Trees, (hyper-)graphs, ... Large space (I/O, cache,  Some data: (meta-)information about the items compression,...)  Labels on nodes and/or edges</p><p> Various operations to be supported  Given node u  Retrieve its label, Fw(u), Bw(u),… Id  String  Given an edge (i,j)  Check its existence, Retrieve its label, …  Given a string p: </p><p> search for all nodes/edges whose label includes p Index  search for adjacent nodes whose label equals p </p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>2 Virtually enlarge M [Zobel et al, ‘07] Do you use (z)grep? [deMoura et al, ‘00]</p><p> gzip</p><p>HuffWord</p><p> ≈1Gb data  Grep takes 29 secs (scan the uncompressed data)  Zgrep takes 33 secs (gunzip the data | grep)  Cgrep takes 8 secs (scan directly the compressed)</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>Seven years ago... [now, J. ACM 05]</p><p>Opportunistic Data Structures with Applications</p><p>In our lectures we are interested Data Compression P. Ferragina, G. Manzini not only in the storage issue: + + Random Access + Search Data Structures Nowadays several papers: theory & experiments Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa (see Navarro -Makinen ’s survey)</p><p>3 Our starting point was... In these lectures.... Muthu’s Ken Church (AT&T, 1995) said —If I compress the Suffix Array with A path consisting of five steps challenge!! Gzip I do not save anything. But the underlying text is 1) The problem compressible.... What‘s going on?“ 2) What practitioners do and why they did not use —theory“ 3) What theoreticians then did 4) Experiments Practitioners use many —squeezing heuristics“ that 5) The moral ;-) compress data and still support fast access to them At the end, hopefully, you‘ll bring at home:  Algorithmic tools to compress & index data  Data aware measures to evaluate them  Algorithmic reductions: Theorists and practitioners love them!  No ultimate receipts !! Can we “automate” and “guarantee” the process ?</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>A basic problem</p><p>Given a dictionary D of strings, of variable length, compress them in a way that we can efficiently support Id  string  Hash Table String Storage  Need D to avoid false-positive and for id string</p><p> (Minimal) ordered perfect hashing  Need D for id string , or check</p><p> (Compacted) Trie  Need D for edge match Paolo Ferragina Dipartimento di Informatica, Università di Pisa Yet the dictionary D needs to be stored  its space is not negligible  I/O- or cache-misses in retrieval Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>4 Front-coding Locality-preserving FC Bender et al, 2006</p><p>Practitioners use the following approach: Drop bucketing + optimal string decompression  Sort the dictionary strings  Compress D up to (1+ ε) FC( D) bits  Strip-off the shared prefixes [e.g. host reversal?]  Decompress any string S in 1+| S|/ ε time  Introduce some bucketing , to ensure fast random access A simple incremental encoding algorithm [ where ε = 2/(c-2) ]</p><p>0 http://checkmate.com/All_Natural/ I. Assume to have FC( S ,..., S ) uk-2002 crawl ≈250Mb 30 ÷÷÷35% 33 Applied.html 1 i-1 34 roma.html http://checkmate.com/All_Natural/ 38 1.html http://checkmate.com/All_Natural/Applied.html 38 tic_Art.html II. Given Si, we proceed backward for X=c | Si| chars in FC http://checkmate.com/All_Natural/Aroma.html 34 yate.html http://checkmate.com/All_Natural/Aroma1.html 35 er_Soap.html  http://checkmate.com/All_Natural/Aromatic_Art.html 35 urvedic_Soap.html Two cases http://checkmate.com/All_Natural/Ayate.html 33 Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Ayer_Soap.html 42 s.htmlDo we need bucketing ? http://checkmate.com/All_Natural/Ayurvedic_Soap.html 25 Essence_Oils.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html 25 Mineral_Bath_Crystals.html Experimental tuning  http://checkmate.com/All_Natural/Bath_Salts.html 38 Salt.html http://checkmate.com/All/Essence_Oils.html X=c |S i| = copied 33 Cream.html Si http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html 0 http://checkmate.com/All/Natural/Washcloth.html ... FC-coded http://checkmate.com/All/Mineral_Cream.html = FCs</p><p> http://checkmate.com/All/Natural/Washcloth.html copied ... gzip ≈ 12% Be back on this later on! Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>Locality-preserving FC Bender et al, 2006 Random access to LPFC</p><p>A simple incremental encoding algorithm [ where ε = 2/(c-2) ] We call C the LPFC-string, n = #strings in C, m = total length of C  Assume to have FC( S1,..., Si-1)  Given Si, we proceed backward for X=c | Si| chars in FC How do we Random Access the compressed C ?  If S is decoded, then we add FC( S ) else we add S i i i  Get(i): return the position of the i-th string in C ( id string )  Decoding is unaffected!! X=c |S |  Z i Si Previous(j), Next(j): return the position of the string preceding or ---- Space occupancy (sketch) following character C[j] crowded  FC-encoded strings are OK! X/2  Partition the copied strings in (un)crowded Classical answers ;-)  Pointers to positions of copied-strings in  C Let S i be crowded, and Z its preceding copied string:  Space is O(n log m) bits  |Z| ≥ X/2 ≥ (c/2) | S |  |S | ≤ (2/c) | Z| i i  Access time is O(1) + O(|S|/ ε)  Hence, length of crowded strings decreases geometrically !!  Some form of bucketing... Trade-off  Consider chains of copied: |uncrowd crowd* | ≤ (c/c-2) |uncrowd |  Space is O((n/b) log m) bits No trade-off !  Charge chain-cost to | | chars before (ie FC-chars) X/2 = (c/2) uncrowd uncrowd  Access time is O(b) + O(|S|/ ε) Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>5 Re-phrasing our problem A basic problem ! Jacobson, ”89</p><p>C is the LPFC-string, n = #strings in C, m = total length of C Select 1(3) = 8</p><p>Support the following operations on C: B 0010100101010101111111000001101010101010111000.... Get(i): return the position of the i-th string in C m = |B| Previous(j), Next(j): return the position of the string prec/following C[j] n = #1s Rank 1(7) = 4 Proper integer encodings • Rank b(i) = number of b in B[1,i] 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html .... C= • Select b(i) = position of the i-th b in B B= 1 000000000000000000000000000 10 0000000000 10 00000000 10 00000 10 000000000 ....  Considering b=1 is enough: Select 0(B,i) = #1Select in B1(B0 and1, Rank B 1 ≤1(B min{m-n,n}0, i-1)) + i see Moffat ”07 • Rank 1(x) = number of 1 in B[1, x] Rank 1(36) = 2 Select 1(4) = 51  Select is similar !! Rank 0(i)= i œ Rank 1(i) 1 |B 0|+|B 1| = m • Select (y) = position of the y-th 1 in B Uniquely-decodable Int-codes: 1  Any Select  Rank and Select over two binary arrays: • γ-code(6) = 00 110 1 1 • 2 log x +1 bits  B = 0 1 0 0 0 0 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 • Get(i) = Select (i) 1 • -code(33) = (3) 110 δ γ  • = Select (Rank (j) -1) B0 = 1 0 0 0 1 0 1 0 1 1, |B 0|= m-n Previous(j) 1 1 Look• log at x them+ 2 loglog as x + O(1) bits • No recursivity, in practice:  B = 1 0 0 1 1 0 0 0 0 0 0 1 , |B |= n • Next(j) = Select 1(Rank 1(j) +1) pointerless data structures 1 1 • |γ(x)|>| δ(x)|, for x > 31 Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa • Huffman on the lengths</p><p> m = |B| A basic problem ! Jacobson, ”89 The Bit-Vector Index n = #1s</p><p>Select 1(3) = 8 Goal. B is read-only, and the additional index takes o(m) bits. Rank B 0010100101010101111111000001101010101010111000.... B 00101001010101011 1111100010110101 0101010111000....</p><p>8 17 Rank 1(7) = 4 Z 4 5 7 9 m = |B| block pos #1 z 0000 1 0 • Rank 1(i) = number of 1s in B[1,i] n = #1s (absolute) Rank 1</p><p>• Select 1(i) = position of the i-th 1 in B (bucket-relative) Rank 1 ...... </p><p>1011 2 1  Setting Z = poly(log m) and z=(1/2) log m:  Given an integer set, we set B as its characteristic vector  Space is |B| + (m/Z) log m + (m/z) log Z + o(m) ....  pred(x) = Select (Rank (x-1)) 1 1  m + O(m loglog m / log m) bits  Rank time is O(1) LBs can be inherited  The term o(m) is crucial in practice [Patrascu-Thorup, ”06] Ω ??</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>6 m = |B| The Bit-Vector Index n = #1s</p><p>B 0010100101010101111111000001101010101010111000.... size r  k consecutive 1s Compressed </p><p> Sparse case : If r > k 2 store explicitly the position of the k 1s  Dense case : k ≤ r ≤ k2, recurse... One level is enough!! String Storage ... still need a table of size o(m).</p><p> Setting k ≈ polylog m  Space is m + o(m) , and B is not touched! Paolo Ferragina  Select time is O(1) Dipartimento di Informatica, Università di Pisa</p><p>There exists a Bit-Vector Index LPFC + RankSelect taking |B| + o(|B|) bits and constant time for Rank/Select. takes [1+o(1)] extra bits per FC-char Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa B is read-only !</p><p>FC versus Gzip The emprirical entropy H0 a a c a a c b a b a b c b a c</p><p>Dictionary H0(S) = ` ∑i (m i/m) log 2 (m i/m) Frequency in S of the i-th symbol (all substrings starting here) <6,3,c></p><p>Two features:  Repetitiveness is deployed at any position  m H (S) is the best you can hope for a memoryless compressor  Window is used for (practical) computational reasons 0  We know that Huffman or Arithmetic come close to this bound On the previous dataset of URLs (ie. uk-2002)</p><p> FC achieves >30% x y H0 cannot distinguish between A B and a random with x A and y B  Gzip achieves 12% No random access to substrings  PPM achieves 7% We get a better compression using a codeword that depends on May be combine the best of the two worlds? the k symbols preceding the one to be compressed (context ) Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>7 The empirical entropy Hk Entropy-bounded string storage [Ferragina-Venturini, ”07]</p><p>Σ σ Use Huffman Goal. Given a string S[1,m] drawn from an alphabet of size or Arithmetic  Compress S up to H (S)  σ k encode S within m Hk(S) + o(m log ) bits, with k ≤ … ≅ compress all S[ ω] up to their H0  extract any substring of L symbols in optimal Θ(L / log m) time</p><p>| S[ ] | S[ ] Hk(S) = (1/|S|) ∑|ω|=k ω H0( ω ) This encoding fully-replaces S in the RAM model !</p><p> S[ ω] = string of symbols that follow the substring ω in S Two corollaries  Compressed Rank/Select data structures S[—is“] = ss Example : Given S = —mississippi“, we have  B was read-only in the simplest R/S scheme</p><p> We get |B| Hk(B) + o(|B|) bits and R/S in O(1) time Follow ≈ Precede  Compressed Front-Coding + random access  Promising: FC+Gzip saves 16% over gzip on uk-2002 How much is —operational“ ? Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>The storage scheme Bounding |V| in terms of H k(S) cw T  Introduce the statistical encoder Ek(S) : </p><p>• b = ² log m ε α f</p><p> r σ 0 e  Compute F(i)= freq of S[i] within its k-th order context S[i-k,i-1] • # blocks = m/b = O(m / log m) β q σ 1 u  b ² δ e Encode every block B[1,b] of S as follows • #distinct blocks = O( σ ) = O(m ) n 00 γ c 1) Write B[1,k] explicitly</p><p> y 01 ... 2) Encode B[k+1, b] by Arithmetic using the k-th order frequencies Decoding is easy: 10 ...  σ >> Some algebra (m/b) * (k log ) + m H k(S) + 2 ( m/b) bits • R/S on B to determine cw position in V 11 ... • Retrieve cw from V 000 ... len(cw) • Decoded block is T[2 + cw]  |S| H k(S) + o(|S| log σ) Ek(S) is worse than our encoding V  E assigns unique cw to blocks S α β α δ γ β α k  These cw are a subset of {0,1} * b -- 0 -- 1 00 0 ...  Our cw are the shortest of {0,1} * V T+V+B take B 1 01 1 01 001 01 ... |V|+o(|S| log σσσ) Golden rule of data compression bits Paolo Ferragina, Università di Pisa σ |B| ≤ |S| log σ, #1 in B = #blocks = o(|S|) |V| ≤ |Ek(S) | ≤ |S| H k(S) + o(|S| log ) bits</p><p>8 Part #2: Take-home Msg</p><p> Given a binary string B, we can Pointer-less data structure  Store B in |B| H k(B) + o(|B|) bits  Support Rank & Select in constant time (Compressed)  Access any substring of B in optimal time String Indexing</p><p> Given a string S on Σ, we can Always better than S on RAM  Store S in |S| H (S) + o(|S| log | Σ|) bits, where k ≤ α log |S| k |Σ|  Access any substring of S in optimal time Paolo Ferragina Experimentally Dipartimento di Informatica, Università di Pisa • 10 7 select / sec • 10 6 rank / sec</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>What do we mean by —Indexing“ ? The Problem</p><p>Given a text T, we wish to devise a (compressed) representation for T that  Word-based indexes , here a notion of “word” must be devised ! efficiently supports the following operations: Inverted files, Signature files, Bitmaps. »  Count(P): How many times string P occurs in T as a substring ?  Locate(P): List the positions of the occurrences of P in T ?  Visualize(i,j): Print T[i,j]</p><p> Time-efficient solutions , but not compressed  Suffix Arrays, Suffix Trees, ...  ...many others...  Full-text indexes , no constraint on text and queries ! » Suffix Array, Suffix tree, String B-tree,...  Space-efficient solutions , but not time efficient  ZGrep: uncompress and then grep it  CGrep, NGrep: pattern-matching over compressed text</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>9 The Suffix Array Searching a pattern</p><p>Prop 1. All suffixes of T having prefix P are contiguous. Indirected binary search on SA: O(p) time per suffix cmp Prop 2. Starting position is the lexicographic one of P.</p><p>Θ(N 2) space 5 T = mississippi#mississippi # T = mississippi # SA SUF(T) SA 12 # suffix pointer 12 11 i# 11 8 ippi# 8 P is larger 5 issippi# 5 2 ississippi# 2 2 accesses per step 1 mississippi# 1 P = si 10 pi# P=si 10 9 ppi# 9 Suffix Array 7 sippi# 7 4 sissippi# • SA: Θ(N log 2 N) bits 4 6 ssippi# • Text T: N chars 6 3 ssissippi#  In practice, a total of 5N bytes 3</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>Searching a pattern Listing of the occurrences</p><p>Indirected binary search on SA: O(p) time per suffix cmp</p><p>T = mississippi # T = mississippi # SA 4 7 SA 12 Suffix Array search 12 11 • listing takes O (occ) time 11 8 8 5 5 P is smaller 2 where # < Σ < $ 2 1 1 10 P# = si# 10 9 Suffix Array search 9 7 sippi (log N) binary-search steps occ=2 7 •O 2 4 sissippi 4 P = si • Each step takes O(p) char cmp 6 6  overall, O(p log 2 N) time 3 P$ = si$ 3 + [Manber-Myers, ‘90]</p><p>Paolo Ferragina, Università di Pisa |ΣΣΣ| [Cole et al, ‘06] Paolo Ferragina, Università di Pisa</p><p>10 Text mining What about space occupancy? Lcp[1,N-1] stores the LCP length between suffixes adjacent in SA T = mississippi# T = m i s s i s s i p p i # SA 1 2 3 4 5 6 7 8 9 10 11 12 SA Lcp 12 12 0 Θ 11 11 SA + T take (N log 2 N) bits 0 8 8 1 5 issippi 5 4 2 ississippi 2 0 1 1 0 10 10 Do we need such an amount ? 1 9 • Does it exist a repeated substring of length ≥ L ? 9 0 1) # permutations on {1, 2, ..., N} = N! 7 • Search for Lcp[i] ≥ L 7 2 4 4 1 2) SA cannot be any permutation of {1,...,N} 6 6 3 3 • Does it exist a substring of length ≥ L occurring ≥ C times ? 3 3) #SA  # texts = | Σ|N Very • Search for Lcp[i,i+C-1] whose entries are ≥ L  LB from #texts = Ω(N log | Σ|) bits far  Ω LB from compression = (N H k(T)) bits</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>The Burrows-Wheeler Transform (1994)</p><p>Take the text T = mississippi # L F An elegant mississippi# #mississipp i ississippi#m i#mississip p mathematical tool ssissippi#mi ippi#missis s sissippi#mis issippi#mis s issippi#miss Sort the rows ississippi# m ssippi#missi mississippi # sippi#missis pi#mississi p T ippi#mississ ppi#mississ i ppi#mississi sippi#missi s Paolo Ferragina pi#mississip sissippi#mi s Dipartimento di Informatica, Università di Pisa i#mississipp ssippi#miss i #mississippi ssissippi#m i</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>11 A famous example A useful tool: L  F mapping</p><p>FLunknown # mississipp i i #mississip p i ppi#missis s i ssippi#mis s i ssissippi# m m ississippi # How do we map L’s onto F’s chars ? p i#mississi p ... Need to distinguish equal chars in F... p pi#mississ i Take two equal L’s chars s ippi#missi s Rotate rightward their rows s issippi#mi s s sippi#miss i Same relative order !! s sissippi#m i</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>The BWT is invertible How to compute the BWT ?</p><p>BWT matrix SA L We said that: L[i] precedes F[i] in T FLunknown 12 ##mississippmississipp i L[3] = T[ 7 ] = T[ SA[3] œ 1 ] # mississipp i Two key properties: 11 ii#mississip#mississip p i #mississip p 1. LF-array maps L’s to F’s chars 8 ippiippi#missis#missis s i ppi#missis s 5 issippiissippi#mis#mis s 2. L[ i ] precedes F[ i ] in T Given SA, we have L[i] = T[SA[i]-1] i ssippi#mis s 2 ississippiississippi## m i ssissippi# m Reconstruct T backward: 1 mississippi # Elegant but inefficient m ississippi # T = .... i p p i # 10 pi#mississipi #mississi p p i#mississi p 9 ppippi#mississ#mississ i p pi#mississ i 7 sippisippi#missi#missi s s ippi#missi s InvertBWT (L) 4 sissippisissippi#mi#mi s s issippi#mi s Compute LF[0,n-1]; 6 ssippi#missssippi #miss i s sippi#miss i r = 0; i = n; 3 ssissippi#mssissippi #m i s sissippi#m i while (i>0) { T[i] = L[r]; Obvious inefficiencies: Role of # r = LF[r]; i--; • Ο(n 3) time in the worst-case Ο 2 Paolo Ferragina, Università di Pisa } Paolo Ferragina, Università di Pisa • (n ) cache misses or I/O faults</p><p>12 Compressing L seems promising... An encoding example</p><p>Key observation: T = mississippimississippimississippi  L is locally homogeneous # at 16 L = ipppssssssmmmii #pppiiissssssiiiiii L is highly compressible Mtf = 020030000030030200300300000100000</p><p>Algorithm Bzip : Mtf = 030040000040040300400400000200000  Move-to-Front coding of L Bin(6)=110, Wheeler‘s code  Alphabet Run-Length coding RLE0 = 021310 31302131310 110 |Σ|+1  Statistical coder Arithmetic/Huffman su |ΣΣΣ|+1 simboli..... </p><p> Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>Why it works... Be back on indexing: BWT  SA</p><p>Key observation: SA BWT matrix L L includes SA and T. Can  L is locally homogeneous we search within L ? 12 ##mississippmississipp i L is highly compressible 11 ii#mississip#mississip p 8 ippiippi#missis#missis s 5 issippiissippi#mis#mis s 2 ississippiississippi## m Each piece  a context 1 mississippi # 10 pipi#mississi#mississi p 9 ppippi#mississ#mississ i Compress pieces up to 7 sippisippi#missi#missi s their H0 , we achieve Hk(T) 4 sissippisissippi#mi#mi s 6 ssippissippi#miss#miss i 3 MTF + RLE avoids the need ssissippississippi#m#m i to partition BWT Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>13 Implement the LF-mapping [Ferragina-Manzini] Rank and Select on strings F start FLunknown # 1 # mississipp i  If Σ is small (i.e. constant) i #mississip p i 2 i ppi#missis s m 6  Build binary Rank data structure per symbol of Σ i ssippi#mis s p 7  Rank takes O(1) time and entropy-bounded space i ssissippi# m + s 9 m ississippi # p i#mississi p The oracle p pi#mississ i  If Σ is large ( words ?) [Grossi-Gupta-Vitter, ’03] s ippi#missi s Rank( s , 9 ) = 3  Need a smarter solution: Wavelet Tree data structure s issippi#mi s s sippi#miss i Another step of reduction: s sissippi#m i How do we map L[9]  F[11] >> Reduce Rank&Select over arbitrary strings We need ... to Rank&Select over binary strings Generalized R&S Binary R/S are key tools</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa >> tons of papers <<</p><p>[Ferragina-Manzini, Focs ‘00] Substring search in T (Count the pattern occurrences ) The FM-index [Ferragina-Manzini, JACM ”05] P[ j ] F The result (on small alphabets): # 1 fo in  Count(P): O(p) time P = si e bl i 2 la ai 1+ ε First step Av  Locate(P): O(occ log N) time unknown L m 6 ε #mississipp i p 7  Visualize(i,i+L): O( L + log 1+ N) time i#mississip p s 9  H  rows prefixed fr Space occupancy: O( N k(T) ) + o(N) bits o(N) if T compressible by char —i“ ippi#missis s issippi#mis s Index does not depend on k bound holds for all k, simultaneously lr ississippi# m Inductive step: Given fr ,lr for P[j+1,p] mississippi #  Take c=P[j] pi#mississi p  Find the first c in L[fr, lr] ppi#mississ i  Find the last c in L[fr, lr] New concept : The FM-index is an opportunistic data structure fr sippi#missi s occ=2  L-to-F mapping of these chars [lr-fr+1] lr sissippi#mi s ssippi#miss i ssissippi#m i Rank is enough Survey of Navarro-Makinen Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa contains many compressed index variants</p><p>14 Is this a technological breakthrough ? The question then was...</p><p>How to turn these challenging and mature theoretical achievements into a technological breakthrought ?</p><p> Engineered implementations  Flexible API to allow reuse and development  Framework for extensive testing</p><p>Paolo Ferragina, Università di Pisa [December 2003] [January 2005] Paolo Ferragina, Università di Pisa</p><p>Joint effort of Navarro‘s group Some figures over hundreds of MBs of data: • Count(P) takes 5 µsecs/char, ≈ 42% space 10 times slower! • Extract takes 20 µsecs/char All implemented indexes follow a carefully designed which We engineered the best known indexes: API • Locate(P) takes 50 secs/occ, +10% space offers: build, count, locate, extract,... µ FMI, CSA, SSA, AF-FMI, RL-FM, LZ, .... 50 times slower!</p><p>A group of variagateSome textstools ishave available, been designed their to sizes range fromautomatically 50Mb to 2Gb plan, execute and check the Trade-off is possible !!! index performance over the text collections >400 downloads Paolo Ferragina, Università di Pisa >50 registered Paolo Ferragina, Università di Pisa</p><p>15 We need your applications... Part #5: Take-home msg...</p><p>Data type</p><p>This is a powerful paradigm to design compressed indexes: Indexing 1. Transform the input in few arrays 2. Index (+ Compress) the arrays to support rank/select ops</p><p>Compressed Indexing</p><p>Other data types: Compression Compression Labeled Trees and I/Os and query distribution/flow 2D</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>Where we are...</p><p>A data structure is —opportunistic“ if it indexes a text T within compressed space and supports three kinds of queries: (Compressed)  Count(P): Count the occurrences of P occurs in T Tree Indexing  Locate(P): List the occurrences of P in T  Display(i,j): Print T[i,j]</p><p> Key tools : Burrows-Wheeler Transform + Suffix Array</p><p>Paolo Ferragina  Key idea : reduce P‘s queries to few rank /select queries on BWT(T) Dipartimento di Informatica, Università di Pisa  Space complexity : function the k-th order empirical entropy of T</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>16 Another data format: XML [W3C ”98] A tree interpretation...</p><p><dblp > <book > <author > Donald E. Knuth </ author > <title > The TeXbook </ title > <publisher > Addison-Wesley </ publisher > <year > 1986 </ year > </ book > <article > <author > Donald E. Knuth </ author > <author > Ronald W. Moore </ author > <title > An Analysis of Alpha-Beta Pruning </ title > <pages > 293-326 </ pages > <year > 1975 </ year > <volume > 6 </ volume > <journal > Artificial Intelligence </ journal > </ article > ...  XML document exploration ≡ Tree navigation </ dblp >  XML document search ≡ Labeled subpath searches Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa Subset of XPath [W3C]</p><p>A key concern: Verbosity... The problem, in practice...</p><p>We wish to devise a (compressed) representation for T that efficiently supports the following operations:  Navigational operations : parent(u), child(u, i), child(u, i, c)  Subpath searches over a sequence of k labels  Content searches : subpath search + substring</p><p> XML-aware compressors (like XMill , XmlPpm , ScmPpm ,...) need the whole decompression for navigation and search</p><p> XML-queriable compressors (like XPress , XGrind , XQzip ,...) achieve poor compression and need the scan of the whole (compressed) file</p><p>XML-native search engines need this tool as a core block for Theory? query optimization and (compressed) storage of information Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa IEEE Computer, April 2005</p><p>17 A transform for labeled trees [Ferragina et al, 2005] The XBW-Transform</p><p>C Sα Sπ on trees  on strings XBW-transform BW-transform C ε B C B A B D B C The XBW-transform linearizes T in 2 arrays such that: c D B C a D B C c B C D c b a D D a  the compression of T reduces to the compression of these two A C b A C arrays (e.g. gzip, bzip2, ppm,...) a A C c a c b D A C  the indexing of T reduces toimplement generalized c D A C rank/select over these two arrays Step 1. B C Visit the tree in pre-order. D B C b D B C For each node, write down its label Permutation a B C and the labels on its upward path of tree nodes Rank&Select are again crucial Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa upward labeled paths</p><p>The XBW-Transform The XBW-Transform</p><p>C S Sα Sπ C last Sα Sπ 1 C ε C ε 0 b A C b A C B A B 0 a A C B A B a A C 1 D A C D A C 0 D B C D B C 1 c B C D c b a D D a c B C 0 D B C D c b a D D a D B C 1 a B C a B C 0 B C XBW can be B C c a c b 0 A C built and inverted c a c b A C 1 B C in optimal O(t) time B C 1 c D A C c D A C 0 c D B C StepKey fact 3. Step 2. c D B C 1 a D B C a D B C NodesAdd a binarycorrespond array toS last itemsmarking in <S the ,S > 1 b D B C Stably sort according to S last α π b D B C rows corresponding to last children XBW Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa XBW takes optimal t log | Σ| + t bits upward labeled paths</p><p>18 XBW is highly compressible XBzip œ a simple XML compressor</p><p>Sπ Slast Sα <a href="/tags/Donald_Knuth/" rel="tag">Donald Knuth</a> /author/article/dblp <a href="/tags/Kurt_Mehlhorn/" rel="tag">Kurt Mehlhorn</a> /author/article/dblp Kurt Mehlhorn /author/article/dblp ...... Kurt Mehlhorn /author/book/dblp John Kleinberg /author/book/dblp Kurt Mehlhorn /author/book/dblp ...... Theoretically,Journal we of couldthe ACM extend/journal/article/dblp the definition of H k to labeled trees by takingAlgorithmica as k-context of a/journal/article/dblp node its leading path of k-length Journal of the ACM /journal/article/dblp ...... (related to Markov random fields over trees) 120-128 /pages/article/dblp 137-157 /pages/article/dblp ...... Tags, Attributes and = ACM Press /publisher/journal/dblp ACM Press /publisher/journal/dblp IEEE Press /publisher/journal/dblp ...... XBW is compressible : 1977 /year/book/dblp 2000 /year/journal/dblp XBW is compressible :  Compress S with PPM ... α ...  Sα is locally homogeneous Pcdata  Slast is small...  Paolo Ferragina, Università di Pisa XBW Slast has some structure and is small Paolo Ferragina, Università di Pisa</p><p>XBzip = XBW + PPM [Ferragina et al, 2006] Some structural properties</p><p>C 25% Slast Sα Sπ 20% 1 C ε B A B 0 b A C 15% 0 a A C 1 D A C 10% 0 D B C D c b a D D a 1 c B C 5% 0 D B C 1 a B C 0% c a c b 0 B C DBLP Pathways News 0 A C 1 B C Two useful properties: 1 c D A C gzip bzip2 ppmdi xmill + ppmdi scmppm XBzip • Children are contiguous and delimited by 1s 0 c D B C 1 a D B C • Children reflect the order of their parents String compressors are not so bad: within 5% 1 b D B C</p><p>XBW Paolo Ferragina, Università di Pisa Deploy huge literature on string compression Paolo Ferragina, Università di Pisa</p><p>19 XBW is navigational Subpath search in XBW C A 2 A 2 C ΠΠΠ C S Sπ B 5 [i+1] last Sα B 5 S Sα Sπ C 9 last 1 C ε C 9 D 12 Π 1 C ε = B D 0 b A C D 12 B A B 0 b A C B A B 0 a A C 0 a A C 1 D A C Bfr 0 B C 1 D A C Select in S D last Rows 0 D B C the 2° item 1 1 c B C D c b a D D a from here... D c b a D D a whose 1 c B C 0 D B C Sπ starts 0 D B C lr 1 a B C with ”B‘ 1 a B C C 0 B C Jump c a c b 0 B C c a c b 0 A C to their Get_children children 0 A C 1 B C 1 D A C 1 B C Rank(B,S α)=2 Inductive step: c 1 c D A C 0 c D B C XBW is navigational:  Pick the next char in [i+1] i.e. ”D‘ 0 c D B C Π , 1 a D B C • Rank-Select data structures on S and S  last α 1 a D B C Search for the first and last ”D‘ in S α[fr,lr] 1 b D B C • The array C of | Σ| integers 1 b D B C  Jump to their children XBW-index</p><p>Paolo Ferragina, Università di Pisa XBW Paolo Ferragina, Università di Pisa</p><p>A 2 : XBW + FM-index [Ferragina et al, 2006] Subpath search in XBW B 5 XBzipIndex C 9 C S ΠΠΠ[i+1] Slast Sα π D 12 Under patenting by 60% ε Π 1 C Pisa + Rutgers = B D 0 b A C 2° D 50% B A B 0 a A C 1 D A C 3° D 40% fr 0 D B C 30% 1 c B C D c b a D D a B C 0 D 20% lr B C 1 a Look at S last 0 B C Jumpto find to the 10% their2° children and 3° c a c b 0 A C 1s after 12 1 B C 0% D A C Inductive step: 1 c DBLP Pathways News fr D B C Rows 0 c whose XBW Pick indexing the next char[ reductionin Π[i+1] ,to i.e. string ”D‘ indexing ] Huffword XPress XQzip XBzipIndex XBzip 1 a D B C Sπ starts  with ”D B‘ Search for the first and last ”D‘ in S α[fr,lr] lr 1 b D B C Rank and Select data structures DBLP : 1.75 bytes/node , Pathways : 0.31 bytes/node , News : 3.91 bytes/node  Jump to their children are enough to navigate and search T XBW-index Two occurrences Upto 36% improvement in compression ratio because of two 1s Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa Query (counting) time ≅ 8 ms, Navigation time ≅ 3 ms</p><p>20 Part #6: Take-home msg...</p><p>Data type I/O issues This is a powerful paradigm to design compressed indexes: Indexing 1. Transform the input in few arrays [Kosaraju, Focs ”89] 2. Index (+ Compress) the arrays to support rank/select ops</p><p>Compressed Strong connection Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa</p><p>More experiments Other data types: More ops and Applications 2D, Labeled graphs</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>What about I/O-issues ? The B-tree</p><p>B-tree is ubiquitous in large-scale applications: P[1,p] pattern to search Search(P) – Atomic keys : integers, reals, ... •O((p/B) log n) I/Os 2 O(p/B log B) I/Os – Prefix B-tree : bounded length keys ( ≤ 255 chars) •O(occ/B) I/Os 2 29 13 20 18 3 23 O(log B n) levels</p><p>String B-tree = B-tree + Patricia Trie [Ferragina-Grossi, 95] – Unbounded length keys – I/O-optimal prefix searches Variants for various models</p><p>– Efficient string updates 29 2 26 13 20 25 6 18 3 14 21 23 – Guaranteed optimal page fill ratio</p><p>They are not opportunistic  [Bender et al  FC]</p><p>29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23</p><p>Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa</p><p>21 On small sets... [Ferguson, 92] On larger sets... 0 Patricia Trie A G Scan FC( D) : Space = Θ(#D) words C Two-phase search:  If P[L[x]]=1, then { x++ } else { jump; } G P = GCACGCAC Search(P):  Compare P and S[x]  Max_lcp • Phase 1: tree navigation 3 4 0 4 6 • Phase 2: Compute LCP A C A G  If P[Max_lcp+1] = 0 go left, else go right, until L[] ≤ Max_lcp • Phase 3: tree navigation x = 2 x = 3 x = 4 5 L 0 2 4 5 0 2 A 6 6 G A A G A G max LCP jump P 0 0 0 0 0 1 1 with P 1 1 1 1 1 1 1 1 string checked 1 2 3 4 5 6 7 1 0 1 1 1 0 1 + 0 0 1 1 1 0 1 Space PT ≈ #D A A A G G G G G 1 0 1 1 1 1 G G C C C C A 1 A A G G G G 0 0 1 1 0 A A C C C C C 0 1 G G A A G G Init x = 1 A G G G G G Correct 4 is the candidate position ,Mlcp=3 P’s position A G A G Time is #D + |P| ≤ |FC(D)| A Paolo Ferragina, Università di Pisa Paolo Ferragina, Università di Pisa Just S[x] needs to be decoded !!</p><p>Succinct PT  smaller height in practice The String B-tree ...not opportunistic: Ω(#D log |D|) bits + P[1,p] pattern to search Search(P) •O((p/B) log n) I/Os B O(p/B) I/Os •O(occ/B) I/Os PT 29 13 20 18 3 23 O(log n) levels It is dynamic... B</p><p>PT PT PT 29 2 26 13 20 25 6 18 3 14 21 23</p><p>PT PT PT PT PT PT 29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23</p><p>Lexicographic position of P Paolo Ferragina, Università di Pisa</p><p>22</p> </div> </article> </div> </div> </div> <script type="text/javascript" async crossorigin="anonymous" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-8519364510543070"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script> var docId = 'ec576a92db911ad9b8129aff94ae13e7'; var endPage = 1; var totalPage = 22; var pfLoading = false; window.addEventListener('scroll', function () { if (pfLoading) return; var $now = $('.article-imgview .pf').eq(endPage - 1); if (document.documentElement.scrollTop + $(window).height() > $now.offset().top) { pfLoading = true; endPage++; if (endPage > totalPage) return; var imgEle = new Image(); var imgsrc = "//data.docslib.org/img/ec576a92db911ad9b8129aff94ae13e7-" + endPage + (endPage > 3 ? ".jpg" : ".webp"); imgEle.src = imgsrc; var $imgLoad = $('<div class="pf" id="pf' + endPage + '"><img src="/loading.gif"></div>'); $('.article-imgview').append($imgLoad); imgEle.addEventListener('load', function () { $imgLoad.find('img').attr('src', imgsrc); pfLoading = false }); if (endPage < 7) { adcall('pf' + endPage); } } }, { passive: true }); </script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> </html>