String Algorithm
Total Page:16
File Type:pdf, Size:1020Kb
Trie: A data-structure for a set of words All words over the alphabet Σ={a,b,..z}. In the slides, let say that the alphabet is only {a,b,c,d} S – set of words = {a,aba, a, aca, addd} Need to support the operations Tries and suffixes trees • insert(w) – add a new word w into S. • delete(w) – delete the word w from S. • find(w) is w in S ? Alon Efrat • Future operation: Computer Science Department • Given text (many words) where is w in the text. University of Arizona • The time for each operation should be O(k), where k is the number of letters in w • Usually each word is associated with addition info – not discussed here. 2 Trie (Tree+Retrive) for S A trie - example p->ar[‘b’-’a’] Note: The label of an edge is the label of n A tree where each node is a struct consist the cell from which this edge exits n Struct node { p a b c d n char[4] *ar; 0 a n char flag ; /* 1 if a word ends at this node. b d Otherwise 0 */ a b c d a b c d Corresponding to w=“d” } a b c d a b c d flag 1 1 0 ar 1 b Corr. To w=“db” (not in S, flag=0) S={a,b,dbb} a b c d 0 a b c d flag ar 1 b Rule: a b c d Corr. to w= dbb Each node corresponds to a word w 1 “ ” (w which is in S iff the flag is 1) 3 In S, so flag=1 4 1 Finding if word w is in the tree Inserting a word w p=root; i =0 Recall – we need to modify the tree so find(w) would return While(1){ TRUE. n If w[i] == ‘\0’ // we scanned all letters of w n then return the flag of p ; // True/False n If the entry of p correspond to w[i] is NULL return false; nTry to perform find(w). n Set p to be the node pointed by this entry, and set i++; n If runs into a NULL pointers, create new node(s) along } the path. n The flag fields of all new node(s) is 0. nSet the flag of the last node to 1 5 6 Inserting “cbb” Deleting a word w Note: The label of an edge is the label of p->ar[‘b’-’a’] the cell from which this edge exits n Find the node p corresponding to w (using `find’ p a b c d 0 operation). a b d a b c d a b c d n Set the flag field of p to 0. a b c d a b c d Corr. to w=“d” 1 1 0 0 S={a,b,dbb, cbb} b b n If p is dead (I.e. flag==0 and all pointers are NULL ) Try to perform find(cbb). then free(p), set p=parent(p) and repeat this check. a b c d a b c d Corr. to w=“db” If runs into a NULL pointers, create w=“cb” 0 0 new node(s) along the path. The flag fields of all new node(s)=0. b b Set the flag of the last node to 1 a b c d Corr. to w=“dbb” 1 a b c d In S, so flag=1 7 w=“cbb” 1 7 8 2 Space requirements Heuristics for space saving n Let m be is the sum of characters of all words in S n To save some space, if Σ is larger, there are a few n The space required might be Θ( |Σ | m ) heuristics we can use. Assume Σ={a,b..z} . n (for each letter of each words of S, we need an array of size n We use two types of nodes |Σ | n Type “A”, which is used when the number of children of a (Might be an issue by itself, and might slow down node is more than 3 performances) a b z flag type a b z flag p p Note – the letters are not stores explicit ally Note – the letters are not stores explicit ally 9 10 Heuristics for space saving Another Heuristics – path compression n Replace a long sequence of nodes that a b c d n Type “B” is used if there are 3 or less children: happens to have only a single child, with n The “letter” of the child is also stored: a single node (of type “pointer to string”) a b c d that keeps a point to the next node, and a point to a string. a b c d type letter pointer letter pointer letter pointer flag p a b c d B F R a b c d type a b c d “bbac\0” • The rule of the flag is the same as in type “A” nodes. • We only store the 3 pointers, but we need to know to which letters they corresponds to 11 3 Suffix tree. Suffix tree. n Assume B (for book) is a long text. Example B=“aabab” S={“aabab”, “abab”, “bab”, “ab”, “b”} n Want to preprocess B, so when a word w is given, we could a b c d quickly find if it is in B. (incremental search) n (as well as locations, how many etc) a b c d a b c d 1 n We can find it in O(|w|). a b c d a b c d a b c d n Idea: 1 n Consider B as a long string. To know where a word a b c d a b c d a b c d appear in B, we store with n Create a trie T of all suffixes of B. 1 each node the index of the beginning of the suffix in B. n In addition to the flag (specifying if a word ends at node), a b c d a b c d we also stored the index in B where this word begins. 1 (we can store only the first appearance of the word in n Example B=“aabab” a b c d the text) S={“aabab”, “abab”, “bab”, “ab”, “b”} 13 1 14 Size of suffix tree Size of suffix tree 12345 Example B=“aabab” S={“aabab”, “abab”, “bab”, “ab”, “b”} Example B=“aabab” S={“aabab”, “abab”, “bab”, “ab”, “b”} a b c d 1 a b c d Assume n=|B|. Assume n=|B|. 1 Total length of all string Θ(n2) a b c d a b c d Total length of all string Θ(n2) a b c d a b c d 3 1 1 Size of a node is |Σ| Size of a node is |Σ| 1 2 a b c d a b c d a b c d a b c d a b c d a b c d 3 So size of the tree is Θ(n2 |Σ| ). 1 So size of the tree is Θ(n2 |Σ| ). 1 2 a b c d a b c d a b c d 1 a b c d a b c d a b c d 3 2 1 2 1 Time to construct the tree Θ(n ) Time to construct the tree Θ(n ) 2 a b c d a b c d 1 a b c d a b c d 1 1 a b c d 1 a b c d Rather than a flag, we store the 1 In addition to the flag, we store 1 first index where the suffix Example B=“aabab” the first index (in the book) Example B=“aabab” appear S={“aabab”, “abab”, “bab”, “ab”, “b”} where the suffix starts (in red) S={“aabab”, “abab”, “bab”, “ab”, “b”} 15 16 4 Suffix tries on a diet Suffix tries on a diet - cont a b c d a b c d Def: a shred is a path from node u to node v in the Algorithm for constructing a “thin” trie: trie, consisting of nodes of outdegree 1 (except a b c d a b c d maybe the last one) and flag=0. Given B – create an empty trie T, and insert all n Obs: There is a contiguous part of B, identical to the a b c d suffixes of B into T --- generating a trie of size a b c d string the shred represents. We call this part the Θ(n2). shred-string a b c d a b c d We stores B itself as an array. Traverse the tries, and each time that a shred is We use a new type of nodes, called shred-nodes, a b c d seen, replace all nodes of the shred with a a b c d that maintain only the indexes of the first (id1) single shred-node. and last (id2) letters of the shred-string in B. type a b c d id1 id2 flag 7 10 1 7 10 Example for shred of “adbd” B=“cadbdaadbd 17 18 Suffix tries on a diet - cont a b c d a b c d Clearly the use of shred nodes saves some-but n can we prove something ? a b c d Thanks for patience. Observations: The # number of leaves of T is a b c d at most ≤ n n See you at the review (every leaf is the end of one prefix).