String Algorithm

String Algorithm

Trie: A data-structure for a set of words All words over the alphabet Σ={a,b,..z}. In the slides, let say that the alphabet is only {a,b,c,d} S – set of words = {a,aba, a, aca, addd} Need to support the operations Tries and suffixes trees • insert(w) – add a new word w into S. • delete(w) – delete the word w from S. • find(w) is w in S ? Alon Efrat • Future operation: Computer Science Department • Given text (many words) where is w in the text. University of Arizona • The time for each operation should be O(k), where k is the number of letters in w • Usually each word is associated with addition info – not discussed here. 2 Trie (Tree+Retrive) for S A trie - example p->ar[‘b’-’a’] Note: The label of an edge is the label of n A tree where each node is a struct consist the cell from which this edge exits n Struct node { p a b c d n char[4] *ar; 0 a n char flag ; /* 1 if a word ends at this node. b d Otherwise 0 */ a b c d a b c d Corresponding to w=“d” } a b c d a b c d flag 1 1 0 ar 1 b Corr. To w=“db” (not in S, flag=0) S={a,b,dbb} a b c d 0 a b c d flag ar 1 b Rule: a b c d Corr. to w= dbb Each node corresponds to a word w 1 “ ” (w which is in S iff the flag is 1) 3 In S, so flag=1 4 1 Finding if word w is in the tree Inserting a word w p=root; i =0 Recall – we need to modify the tree so find(w) would return While(1){ TRUE. n If w[i] == ‘\0’ // we scanned all letters of w n then return the flag of p ; // True/False n If the entry of p correspond to w[i] is NULL return false; nTry to perform find(w). n Set p to be the node pointed by this entry, and set i++; n If runs into a NULL pointers, create new node(s) along } the path. n The flag fields of all new node(s) is 0. nSet the flag of the last node to 1 5 6 Inserting “cbb” Deleting a word w Note: The label of an edge is the label of p->ar[‘b’-’a’] the cell from which this edge exits n Find the node p corresponding to w (using `find’ p a b c d 0 operation). a b d a b c d a b c d n Set the flag field of p to 0. a b c d a b c d Corr. to w=“d” 1 1 0 0 S={a,b,dbb, cbb} b b n If p is dead (I.e. flag==0 and all pointers are NULL ) Try to perform find(cbb). then free(p), set p=parent(p) and repeat this check. a b c d a b c d Corr. to w=“db” If runs into a NULL pointers, create w=“cb” 0 0 new node(s) along the path. The flag fields of all new node(s)=0. b b Set the flag of the last node to 1 a b c d Corr. to w=“dbb” 1 a b c d In S, so flag=1 7 w=“cbb” 1 7 8 2 Space requirements Heuristics for space saving n Let m be is the sum of characters of all words in S n To save some space, if Σ is larger, there are a few n The space required might be Θ( |Σ | m ) heuristics we can use. Assume Σ={a,b..z} . n (for each letter of each words of S, we need an array of size n We use two types of nodes |Σ | n Type “A”, which is used when the number of children of a (Might be an issue by itself, and might slow down node is more than 3 performances) a b z flag type a b z flag p p Note – the letters are not stores explicit ally Note – the letters are not stores explicit ally 9 10 Heuristics for space saving Another Heuristics – path compression n Replace a long sequence of nodes that a b c d n Type “B” is used if there are 3 or less children: happens to have only a single child, with n The “letter” of the child is also stored: a single node (of type “pointer to string”) a b c d that keeps a point to the next node, and a point to a string. a b c d type letter pointer letter pointer letter pointer flag p a b c d B F R a b c d type a b c d “bbac\0” • The rule of the flag is the same as in type “A” nodes. • We only store the 3 pointers, but we need to know to which letters they corresponds to 11 3 Suffix tree. Suffix tree. n Assume B (for book) is a long text. Example B=“aabab” S={“aabab”, “abab”, “bab”, “ab”, “b”} n Want to preprocess B, so when a word w is given, we could a b c d quickly find if it is in B. (incremental search) n (as well as locations, how many etc) a b c d a b c d 1 n We can find it in O(|w|). a b c d a b c d a b c d n Idea: 1 n Consider B as a long string. To know where a word a b c d a b c d a b c d appear in B, we store with n Create a trie T of all suffixes of B. 1 each node the index of the beginning of the suffix in B. n In addition to the flag (specifying if a word ends at node), a b c d a b c d we also stored the index in B where this word begins. 1 (we can store only the first appearance of the word in n Example B=“aabab” a b c d the text) S={“aabab”, “abab”, “bab”, “ab”, “b”} 13 1 14 Size of suffix tree Size of suffix tree 12345 Example B=“aabab” S={“aabab”, “abab”, “bab”, “ab”, “b”} Example B=“aabab” S={“aabab”, “abab”, “bab”, “ab”, “b”} a b c d 1 a b c d Assume n=|B|. Assume n=|B|. 1 Total length of all string Θ(n2) a b c d a b c d Total length of all string Θ(n2) a b c d a b c d 3 1 1 Size of a node is |Σ| Size of a node is |Σ| 1 2 a b c d a b c d a b c d a b c d a b c d a b c d 3 So size of the tree is Θ(n2 |Σ| ). 1 So size of the tree is Θ(n2 |Σ| ). 1 2 a b c d a b c d a b c d 1 a b c d a b c d a b c d 3 2 1 2 1 Time to construct the tree Θ(n ) Time to construct the tree Θ(n ) 2 a b c d a b c d 1 a b c d a b c d 1 1 a b c d 1 a b c d Rather than a flag, we store the 1 In addition to the flag, we store 1 first index where the suffix Example B=“aabab” the first index (in the book) Example B=“aabab” appear S={“aabab”, “abab”, “bab”, “ab”, “b”} where the suffix starts (in red) S={“aabab”, “abab”, “bab”, “ab”, “b”} 15 16 4 Suffix tries on a diet Suffix tries on a diet - cont a b c d a b c d Def: a shred is a path from node u to node v in the Algorithm for constructing a “thin” trie: trie, consisting of nodes of outdegree 1 (except a b c d a b c d maybe the last one) and flag=0. Given B – create an empty trie T, and insert all n Obs: There is a contiguous part of B, identical to the a b c d suffixes of B into T --- generating a trie of size a b c d string the shred represents. We call this part the Θ(n2). shred-string a b c d a b c d We stores B itself as an array. Traverse the tries, and each time that a shred is We use a new type of nodes, called shred-nodes, a b c d seen, replace all nodes of the shred with a a b c d that maintain only the indexes of the first (id1) single shred-node. and last (id2) letters of the shred-string in B. type a b c d id1 id2 flag 7 10 1 7 10 Example for shred of “adbd” B=“cadbdaadbd 17 18 Suffix tries on a diet - cont a b c d a b c d Clearly the use of shred nodes saves some-but n can we prove something ? a b c d Thanks for patience. Observations: The # number of leaves of T is a b c d at most ≤ n n See you at the review (every leaf is the end of one prefix).

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    6 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us