New Algorithms for Position Heaps Travis Gagie, Wing-Kai Hon, Tsung-Han Ku

CPM2013

1 Outline

 Position

 Limiting Length and Height

 Turing a Suffix into a Position Heap

 Using a Position Heap as a

 Using a Compressed Suffix Array as a Position Heap

2 Position Heap

 The root is labeled 0 and the other nodes are labeled 1 to n such that parent’s labels are smaller than their children’s labels

 For 1≤ i ≤ n, the path label of the node labeled i is a prefix of S[i..n].

 For 1≤ i ≤ n, the node labeled i stores a pointer (called maximal-reach pointer) to the deepest node whose path label is a prefix of S[i..n]

3 Position Heap

0 a 1

A position heap for S = abaababbabbab$

4 Position Heap

0 b a 2 1

A position heap for S = abaababbabbab$

5 Position Heap

0 b a 2 1 a

3

A position heap for S = abaababbabbab$

6 Position Heap

0 b a 2 1 a b 3 4

A position heap for S = abaababbabbab$

7 Position Heap

0 b a 2 1 a b a 3 4 5

A position heap for S = abaababbabbab$

8 Position Heap

0 b a 2 1 a b a 3 4 5 b

6

A position heap for S = abaababbabbab$

9 Position Heap

0 b a 2 1 b a b a 3 4 5 7 b

6

A position heap for S = abaababbabbab$

10 Position Heap

0 b a 2 1 b a b a 3 4 5 7 b b 6 8

A position heap for S = abaababbabbab$

11 Position Heap

0 b a 2 1 b a b a 3 4 5 7 b b 6 8

a

9

A position heap for S = abaababbabbab$

12 Position Heap

0 b a 2 1 b a b a 3 4 5 7 b b a 6 8 10 a

9

A position heap for S = abaababbabbab$

13 Position Heap

0 b a 2 1 b a b a 3 4 5 7 b b a 6 8 10 a $ 11 9

A position heap for S = abaababbabbab$

14 Position Heap

0 b a 2 1 b a b a 3 4 5 7 $ b b a 12 6 8 10 a $ 11 9

A position heap for S = abaababbabbab$

15 Position Heap

0 b a 2 1 b a b $ a 3 4 13 5 7 $ b b a 12 6 8 10 a $ 11 9

A position heap for S = abaababbabbab$

16 Position Heap

0 $ b a 2 1 14 b a b $ a 3 4 13 5 7 $ b b a 12 6 8 10 a $ 11 9

A position heap for S = abaababbabbab$

17 Position Heap

0 $ b a 2 1 14 b a b $ a 3 4 13 5 7 $ b b a 12 6 8 10 a $ 11 9

A position heap for S = abaababbabbab$

18 Position Heap

 With the following auxiliary data structures, the problem can be solved in optimal time (i.e. O(|P| + occ)).

(1) An array A of pointers such that, given i, in O(1) time we can find the node labeled i.

(2) A B such that, given i and j, in O(1) time we can determine whether the node labeled i is an ancestor of the node labeled j.

19 Query on Position Heap

0 $ b a 2 1 14 b a b $ a 3 4 13 5 7 $ b b a 12 6 8 10 a $ 11 9 A position heap for S = abaababbabbab$ Query pattern P = aabab, candidate are 1 and 3. 20 Query on Position Heap

0 $ b a 2 1 14 b a b $ a 3 4 13 5 7 $ b b a 12 6 8 10 a $ 11 9 Query pattern P = aabab, candidate is 3. Check if 3+2= 5 is on the path or in the subtree of 8. 21 Limiting Length and Height

 Suppose the length of the query pattern P is always less than or equal to a variable M.

 Then we can build a position heap whose height would be bounded in O(M).

22 Limiting Length and Height

 Suppose we want to build the position heap of a string

S’ …..

2M 2M 2M S’’ …..

2M 2M 2M Instead of building the position heap on S, we build the position heap of S’!S’’. 23 Limiting Length and Height

According to the nature of the position heap, the height of the position heap of S’!S’’ would be O(M). By adopting Ehrenfeucht et al.’s approaches with an AVL tree, we have the following theorem.

Theorem 1: If we will never search for a pattern of length greater than M in a dynamic string S, then we can maintain a position heap that works as an index for S such that • Searching for a pattern of length m≦M takes O(mlog|S|+occ) time, • Inserting a of length l takes O((M+l)Mlog(|S|+l)) time, • Deleting a substring of length l takes O((M+l)Mlog|S|) time.

24 Suffix Tree Into Position Heap

 A position heap of a string S is a subtree of the suffix of S.

 Suffix tree is a compact suffix trie.

 Here, we show how to turn a suffix tree into a position heap.

25 b $ a 14 a b 13 b $ a $ 10 12 $ 11 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 26 0 b $ a 14 1 a b 13 b $ a $ 10 12 $ 11 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 27 0 b $ a 2 14 1 a b 13 b $ a $ 10 12 $ 11 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 28 0 b $ 2 14 1 a b 13 b $ a $ 10 12 $ 11 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 29 0 b $ 2 14 1 a b 13 4 b $ a $ 10 12 $ 11 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 30 0 b $ 2 14 1 a b 13 5 4 b $ a $ 10 12 $ 11 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 31 0 b $ 2 14 1 a b 13 5 4 b $ a 6 $ 10 12 $ 11 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 32 0 b $ 2 14 1 a 7 b 13 5 4 b $ a 6 $ 12 $ 10 11 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 33 0 b $ 2 14 1 a 7 b 13 5 4 b $ a 6 8 $ 12 $ 10 11 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 34 0 b $ 2 14 1 a 7 b 13 5 4 b $ a 6 8 $ 12 9 $ 10 11 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 35 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 36 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 37 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 38 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 39 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 40 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7

3 $ 2 9 6 8 4 5 1 The suffix tree of S = abaababbabbab$. 41 Suffix Tree Into Position Heap

 Given a suffix tree, we can build the position heap in linear time by using Westbrook’s approaches, and Berkman et al.’s approaches.

 By using Farach’s linear time suffix tree construction algorithm independent of the alphabet size, we can build a position heap of a string S in linear time independent of the alphabet size.

42 Using Position Heap as Suffix Array

 Define D[i] = the depth of the node labeled SA[i]

 Define E[i] = the depth of the node with rank i in preorder

43 0 $ b a 2 1 14 b a b $ a 3 4 13 5 7 $ b b a 12 6 8 10 a $ 11 9 + D = [ 1, 2, 3, 1, 2, 4, 3, 2, 1, 4, 3, 2, 3, 2] E = [ 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 4, 2, 3] = SA = [14, 3, 12, 1, 4, 9, 6, 13, 2, 11, 8, 5, 10, 7] 44 Using Position Heap as Suffix Array

 By using succinct partial rank and select query structures which support constant time query on arrays D and E, we can compute the SA[i] value in constant time.

 Theorem: We can add O(n log h) bits to a position heap, where h is the height of the heap, such that it supports access to the corresponding suffix array and inverse suffix array in O(1) time.

45 Using CSA as Position Heap

To represent a position heap, we need

 Its structure as a tree; balanced parentheses representation

 The nodes’ labels;  by using E, D, and SA -1  The edges’ labels;  by using SA and a bit-vector

 The maximal-reach pointers;  next slide, we will show

 The array A of pointers;  similar to previous one described in the previous slides

 The data structure B; similar to previous one described in the previous slides

46 0 b $ 2 14 1 a 7 b 13 5 10 4 b $ a 6 8 12 9 $ $ 11 10 7

3 $ 2 9 6 8 4 5 1 ( (*) ( (*) ( (*)**( (**) ) ) ) ( (*) (*( (*) ** ) ) ( (**) ) ) ) 47 Using CSA as Position Heap

( (*) ( (*) ( (*)**( (**) ) ) ) ( (*) (*( (*) ** ) ) ( (**) ) ) )

This pair represents the node labeled 9. Suppose we want to compute the maximal-reach pointer of the node labeled 6. First we compute SA-1[6] = 7. ; it is the 7-th star we visited Then we compute which parentheses pair encloses the 7-th star.

48 Thank you

49