Tries ! Search for Value Given Key

Tries ! Search for Value Given Key

Review Symbol tables. ! Associate a value with a key. Tries ! Search for value given key. Balanced trees ! use between lg N and 2 lg N key comparisons ! support ordered iteration and other operations • review Hash tables • tries ! typically use 1-2 probes • TSTs ! require good hash function for each key type • applications • refinements Radix sorting ! some keys are inherently digital ! digital keys give linear and sublinear sorts References: Algorithms in Java, Chapter 15 Intro to Algs and Data Structs, Section 6.2 This lecture. Symbol tables for digital keys. Copyright © 2007 by Robert Sedgewick and Kevin Wayne. 3 Review: summary of the performance of searching (symbol-table) algorithms Digital keys (review) Frequency of execution of instructions in the inner loop: Many commonly-use key types are inherently digital (sequences of fixed-length characters) interface guarantee average case ordered operations Examples interface Digital { iteration? on keys ! iteration? search insert delete search insert delete Strings public int charAt(int k); ! 64-bit integers public int length(int); BST N N N 1.38 lg N 1.38 lg N ? yes Comparable static int R(); } randomized BST 7 lg N 7 lg N 7 lg N 1.38 lg N 1.38 lg N 1.38 lg N yes Comparable red-black tree 2 lg N 2 lg N 2 lg N lg N lg N lg N yes Comparable This lecture: ! refer to fixed-length vs. variable-length strings equals() hashing 1* 1* 1* 1* 1* 1* no hashcode() ! R different characters ! key type implements charAt() and length() methods ! code works for String and key types that implement Digital. Q: Can we do better? 2 4 Digital keys in applications StringSET implementation cost summary Key = sequence of "digits." ! DNA: sequence of a, c, g, t. ! IPv6 address: sequence of 128 bits. typical case dedup ! English words: sequence of lowercase letters. implementation Search hit Insert Space moby actors ! Protein: sequence of amino acids A, C, ..., Y. input * L L L 0.26 15.1 ! Credit card number: sequence of 16 decimal digits. red-black L + log N log N C 1.40 97.4 ! International words: sequence of Unicode characters. hashing L L C 0.76 40.6 ! Library call numbers: sequence of letters, numbers, periods. actors 82MB, 11.4M words, 900K distinct. N = number of strings moby 1.2MB, 210K words, 32K distinct. L = length of string This lecture. Key = string over ASCII alphabet. C = number of characters in input R = radix * only reads in data Challenge. Efficient performance for long keys (large L). 5 7 String Set ADT String set. Unordered collection of distinct strings. API for StringSET. ! add(key) insert the key into the set ! contains(key) is the given key in the set? Typical client: Dedup (remove duplicate strings from input) tries TSTs StringSET set = new StringSET(); applications while (!StdIn.isEmpty()) { refinements String key = StdIn.readString(); if (!set.contains(key)) { set.add(key); System.out.println(key); } } This lecture: focus on StringSET implementation Same ideas improve STs with wider API 6 8 Tries Tries Tries. [from retrieval, but pronounced "try"] Q. How to handle case when one key is a prefix of another? ! Store characters in internal nodes, not keys. A1. Append sentinel character '\0' to every key so it never happens. ! Store records in external nodes. A2. Store extra bit to denote which nodes correspond to keys. ! Use the characters of the key to guide the search. Ex. sells sea shells by the sea Ex. she sells sea shells by the sea shore by by she sea the sea the sells sells shore shells shells 9 11 Tries Branching in tries Tries. [from retrieval, but pronounced "try"] Q. How to branch to next level? ! Store characters in internal nodes, not keys. A. One link for each possible character ! Store records in external nodes. ! Use the characters of the key to guide the search. Ex. sells sea shells by the sea shore R-way trie Ex. sells sea shells by the sea shore R empty links on leaves by sea the sells shore shells 10 12 R-Way Trie: Java Implementation R-way trie implementation of StringSET (continued) R-way existence trie. A node. Node. Reference to R nodes. private class Node { public void add(String s) Node[] next = new Node[R]; { boolean end; root = add(root, s, 0); } } private Node add(Node x, String s, int i) root { if (x == null) x = new Node(); if (i == s.length()) x.end = true; else { a b c d e f g h char c = s.charAt(i); x.next[c] = add(x.next[c], s, i+1); } return x; } 8-way trie that represents {a, f, h} 13 15 R-way trie implementation of StringSET R-way trie performance characteristics public class StringSET ASCII Time { ! private static final int R = 128; examine one character to move down one level in the trie ! empty trie private Node root = new Node(); trie has ~logR N levels (not many!) ! need to check whole string for search hit (equality) private class Node ! { search miss only involves examining a few characters Node[] next = new Node[R]; boolean end; } Space ! R empty links at each leaf public boolean contains(String s) current digit ! 63356-way branching for Unicode impractical { return contains(root, s, 0); } private boolean contains(Node x, String s, int i) { if (x == null) return false; Bottom line. if (i == s.length()) return x.end; ! method of choice for small R char c = s.charAt(i); ! you use tries every day return contains(x.next[c], s, i+1); } ! stay tuned for ways to address space waste public void add(String s) // see next slide } 14 16 Sublinear search with tries Digression: Out of memory? Tries enable user to present keys one char at a time "640 K ought to be enough for anybody." Search hit - attributed to Bill Gates, 1981 ! can present possible matches after a few digits (commenting on the amount of RAM in personal computers) ! need to examine all L digits for equality Search miss "64 MB of RAM may limit performance of some Windows XP ! could have mismatch on first character features; therefore, 128 MB or higher is recommended for ! typical case: mismatch on first few characters best performance." - Windows XP manual, 2002 Bottom line: sublinear search cost (only a few characters) "64 bit is coming to desktops, there is no doubt about that. But apart from Photoshop, I can't think of desktop applications where you would need more Further help than 4GB of physical memory, which is what you have to have in order to ! object equality test benefit from this technology. Right now, it is costly." - Bill Gates, 2003 ! cached hash values 17 19 StringSET implementation cost summary Digression: Out of memory? A short (approximate) history typical case dedup address addressable typical actual cost implementation Search hit Insert Space moby actors bits memory memory input * L L L 0.26 15.1 PDP-8 1960s 12 6K 6K $16K red-black L + log N log N C 1.40 97.4 PDP-10 1970s 18 256K 256K $1M hashing L L C 0.76 40.6 IBM S/360 1970s 24 4M 512K $1M R-way trie L << L RN + C 1.12 out of memory VAX 1980s 32 4G 1M $1M N = number of strings Pentium 1990s 32 4G 1 GB $1K R-way trie L = size of string C = number of characters in input ! Xeon 2000s 64 enough 4 GB $100 faster than hashing for small R R = radix ! too much memory if R not small ?? future 128+ enough enough $1 65536-way trie for Unicode?? Challenge. Use less memory! 18 20 A modest proposal Ternary Search Tries (TSTs) Number of atoms in the universe: < 2266 (estimated) Ternary search tries. [Bentley-Sedgewick, 1997] Age of universe (estimated): 20 billion years ~ 250 secs < 280 nanoseconds ! Store characters in internal nodes, records in external nodes. ! Use the characters of the key to guide the search How many bits address every atom that ever existed ? ! Each node has three children ! Left (smaller), middle (equal), right (larger). A modest proposal: use a unique 512-bit address for every object 512 bits is enough: 266 bits 80 bits 174 bits place time cushion for whatever current plan: 128 bits 64 bits place (ipv6) place (machine) Use trie to map to current location. 64 8-bit chars ! wastes 255/256 actual memory (only good for tiny memories or Gates) ! need better use of memory 21 23 Ternary Search Tries (TSTs) Ternary search tries. [Bentley-Sedgewick, 1997] ! Store characters in internal nodes, records in external nodes. ! Use the characters of the key to guide the search ! Each node has three children: left (smaller), middle (equal), right (larger). tries Ex. sells sea shells by the sea shore TSTs extensions refinements Observation. Only three null links in leaves! 22 24 26-Way Trie vs. TST TST implementation of contains() for StringSET TST. Collapses empty links in 26-way trie. Recursive code practically writes itself! now for tip ilk dim tag public boolean contains(String s) jot sob { nob sky if (s.length() == 0) return false; hut return contains(root, s, 0); ace bet } men 26-way trie (1035 null links, not shown) egg few private boolean contains(Node x, String s, int i) jay owl { joy if (x == null) return false; rap gig char c = s.charAt(i); wee was if (c < x.c) return contains(x.l, s, i); cab else if (c > x.c) return contains(x.r, s, i); wad caw else if (i < s.length()-1) return contains(x.m, s, i+1); cue fee else return x.end; tap } ago tar jam dug and TST (155 null links) 25 27 TST representation TST implementation of add() for StringSET A TST string set is a TST node.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    12 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us