New Algorithms for Position Heaps Travis Gagie, Wing-Kai Hon, Tsung-Han Ku
CPM2013
1 Outline
Position Heap
Limiting Length and Height
Turing a Suffix Tree into a Position Heap
Using a Position Heap as a Suffix Array
Using a Compressed Suffix Array as a Position Heap
2 Position Heap
The root is labeled 0 and the other nodes are labeled 1 to n such that parent’s labels are smaller than their children’s labels
For 1≤ i ≤ n, the path label of the node labeled i is a prefix of S[i..n].
For 1≤ i ≤ n, the node labeled i stores a pointer (called maximal-reach pointer) to the deepest node whose path label is a prefix of S[i..n]
3 Position Heap
0 a 1
A position heap for S = abaababbabbab$
4 Position Heap
0 b a 2 1
A position heap for S = abaababbabbab$
5 Position Heap
0 b a 2 1 a
3
A position heap for S = abaababbabbab$
6 Position Heap
0 b a 2 1 a b 3 4
A position heap for S = abaababbabbab$
7 Position Heap
0 b a 2 1 a b a 3 4 5
A position heap for S = abaababbabbab$
8 Position Heap
0 b a 2 1 a b a 3 4 5 b
6
A position heap for S = abaababbabbab$
9 Position Heap
0 b a 2 1 b a b a 3 4 5 7 b
6
A position heap for S = abaababbabbab$
10 Position Heap
0 b a 2 1 b a b a 3 4 5 7 b b 6 8
A position heap for S = abaababbabbab$
11 Position Heap
0 b a 2 1 b a b a 3 4 5 7 b b 6 8
a
9
A position heap for S = abaababbabbab$
12 Position Heap
0 b a 2 1 b a b a 3 4 5 7 b b a 6 8 10 a
9
A position heap for S = abaababbabbab$
13 Position Heap
0 b a 2 1 b a b a 3 4 5 7 b b a 6 8 10 a $ 11 9
A position heap for S = abaababbabbab$
14 Position Heap
0 b a 2 1 b a b a 3 4 5 7 $ b b a 12 6 8 10 a $ 11 9
A position heap for S = abaababbabbab$
15 Position Heap
0 b a 2 1 b a b $ a 3 4 13 5 7 $ b b a 12 6 8 10 a $ 11 9
A position heap for S = abaababbabbab$
16 Position Heap
0 $ b a 2 1 14 b a b $ a 3 4 13 5 7 $ b b a 12 6 8 10 a $ 11 9
A position heap for S = abaababbabbab$
17 Position Heap
0 $ b a 2 1 14 b a b $ a 3 4 13 5 7 $ b b a 12 6 8 10 a $ 11 9
A position heap for S = abaababbabbab$
18 Position Heap
With the following auxiliary data structures, the pattern matching problem can be solved in optimal time (i.e. O(|P| + occ)).
(1) An array A of pointers such that, given i, in O(1) time we can find the node labeled i.
(2) A data structure B such that, given i and j, in O(1) time we can determine whether the node labeled i is an ancestor of the node labeled j.
19 Query on Position Heap
0 $ b a 2 1 14 b a b $ a 3 4 13 5 7 $ b b a 12 6 8 10 a $ 11 9 A position heap for S = abaababbabbab$ Query pattern P = aabab, candidate are 1 and 3. 20 Query on Position Heap
0 $ b a 2 1 14 b a b $ a 3 4 13 5 7 $ b b a 12 6 8 10 a $ 11 9 Query pattern P = aabab, candidate is 3. Check if 3+2= 5 is on the path or in the subtree of 8. 21 Limiting Length and Height
Suppose the length of the query pattern P is always less than or equal to a variable M.
Then we can build a position heap whose height would be bounded in O(M).
22 Limiting Length and Height
Suppose we want to build the position heap of a string
S’ …..
2M 2M 2M S’’ …..
2M 2M 2M Instead of building the position heap on S, we build the position heap of S’!S’’. 23 Limiting Length and Height
According to the nature of the position heap, the height of the position heap of S’!S’’ would be O(M). By adopting Ehrenfeucht et al.’s approaches with an AVL tree, we have the following theorem.
Theorem 1: If we will never search for a pattern of length greater than M in a dynamic string S, then we can maintain a position heap that works as an index for S such that • Searching for a pattern of length m≦M takes O(mlog|S|+occ) time, • Inserting a substring of length l takes O((M+l)Mlog(|S|+l)) time, • Deleting a substring of length l takes O((M+l)Mlog|S|) time.
24 Suffix Tree Into Position Heap
A position heap of a string S is a subtree of the suffix trie of S.
Suffix tree is a compact suffix trie.
Here, we show how to turn a suffix tree into a position heap.
25 b $ a 14 a b 13 b $ a $ 10 12 $ 11 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 26 0 b $ a 14 1 a b 13 b $ a $ 10 12 $ 11 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 27 0 b $ a 2 14 1 a b 13 b $ a $ 10 12 $ 11 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 28 0 b $ 2 14 1 a b 13 b $ a $ 10 12 $ 11 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 29 0 b $ 2 14 1 a b 13 4 b $ a $ 10 12 $ 11 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 30 0 b $ 2 14 1 a b 13 5 4 b $ a $ 10 12 $ 11 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 31 0 b $ 2 14 1 a b 13 5 4 b $ a 6 $ 10 12 $ 11 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 32 0 b $ 2 14 1 a 7 b 13 5 4 b $ a 6 $ 12 $ 10 11 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 33 0 b $ 2 14 1 a 7 b 13 5 4 b $ a 6 8 $ 12 $ 10 11 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 34 0 b $ 2 14 1 a 7 b 13 5 4 b $ a 6 8 $ 12 9 $ 10 11 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 35 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 36 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 37 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 38 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 39 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 40 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7
3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 41 Suffix Tree Into Position Heap
Given a suffix tree, we can build the position heap in linear time by using Westbrook’s approaches, and Berkman et al.’s approaches.
By using Farach’s linear time suffix tree construction algorithm independent of the alphabet size, we can build a position heap of a string S in linear time independent of the alphabet size.
42 Using Position Heap as Suffix Array
Define D[i] = the depth of the node labeled SA[i]
Define E[i] = the depth of the node with rank i in preorder
43 0 $ b a 2 1 14 b a b $ a 3 4 13 5 7 $ b b a 12 6 8 10 a $ 11 9 + D = [ 1, 2, 3, 1, 2, 4, 3, 2, 1, 4, 3, 2, 3, 2] E = [ 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 4, 2, 3] = SA = [14, 3, 12, 1, 4, 9, 6, 13, 2, 11, 8, 5, 10, 7] 44 Using Position Heap as Suffix Array
By using succinct partial rank and select query structures which support constant time query on arrays D and E, we can compute the SA[i] value in constant time.
Theorem: We can add O(n log h) bits to a position heap, where h is the height of the heap, such that it supports access to the corresponding suffix array and inverse suffix array in O(1) time.
45 Using CSA as Position Heap
To represent a position heap, we need
Its structure as a tree; balanced parentheses representation
The nodes’ labels; by using E, D, and SA -1 The edges’ labels; by using SA and a bit-vector
The maximal-reach pointers; next slide, we will show
The array A of pointers; similar to previous one described in the previous slides
The data structure B; similar to previous one described in the previous slides
46 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7
3 $ 2 9 6 8 4 5 1 ( (*) ( (*) ( (*)**( (**) ) ) ) ( (*) (*( (*) ** ) ) ( (**) ) ) ) 47 Using CSA as Position Heap
( (*) ( (*) ( (*)**( (**) ) ) ) ( (*) (*( (*) ** ) ) ( (**) ) ) )
This pair represents the node labeled 9. Suppose we want to compute the maximal-reach pointer of the node labeled 6. First we compute SA-1[6] = 7. ; it is the 7-th star we visited Then we compute which parentheses pair encloses the 7-th star.
48 Thank you
49