Algorithms ROBERT SEDGEWICK | KEVIN WAYNE

Algorithms ROBERT SEDGEWICK | KEVIN WAYNE 5.2 TRIES ‣ string symbol tables ‣ R-way tries Algorithms ‣ ternary search tries FOURTH EDITION ‣ character-based operations ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu Last updated on 4/15/21 9:40 AM 5.2 TRIES ‣ string symbol tables ‣ R-way tries Algorithms ‣ ternary search tries ‣ character-based operations ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu Symbol tables: performance summary Review. Two classic symbol tables: red–black BSTs and hash tables. frequency of core operations ordered core operations implementation operations on keys search insert delete red–black BST log n log n log n ✔ compareTo() equals() hash table 1 † 1 † 1 † hashCode() † under uniform hashing assumption Q. Can we do better? A. Yes, if we can avoid examining the entire key, as with string sorting. 3 String symbol tables: performance summary Goal (for string keys). Faster than hashing, more flexible than BSTs. Benchmark. Count distinct words in a text file. exchange rate: L character accesses per hash around Θ(log n) character accesses per string compare character accesses (typical case) count distinct search search space implementation insert moby.txt actors.txt hit miss (references) 2 2 2 red–black BST L + log n log n log n 4 n 1.4 97.4 hashing L L L 4 n to 16 n 0.76 40.6 (linear probing) n = number of key–value pairs file size words distinct L = length of key moby.txt 1.2 MB 210 K 32 K R = radix actors.txt 82 MB 11.4 M 900 K 4 5.2 TRIES ‣ string symbol tables ‣ R-way tries Algorithms ‣ ternary search tries ‣ character-based operations ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu Tries Etymology. [ from retrieval, but pronounced “try” ] 6 Tries Abstract trie. Store characters in nodes (not keys). Each node has up to R children, one for each possible character in alphabet. link to trie containing all keys root that start with s link to trie containing all keys that start with she b s t y 4 e h h key value by 4 a 6 l e 0 o e 5 sea 6 sells 1 l l r value associated with she in node corresponding to she 0 last character in key shells 3 s 1 l e 7 shore 7 the 5 s 3 7 Tries: search hit Follow links corresponding to each character in the key. Search hit: node where search ends has a non-null value. Search miss: reach null link or node where search ends has null value. get("shells") b s t y 4 e h h a 6 l e 0 o e 5 l l r s 1 l e 7 return value in node corresponding to s 3 last character in key (return 3) 8 Tries: search hit Follow links corresponding to each character in the key. Search hit: node where search ends has a non-null value. Search miss: reach null link or node where search ends has null value. get("she") b s t y 4 e h h a 6 l e 0 o e 5 l l r search may terminate at an internal node (return 0) s 1 l e 7 s 3 9 Tries: search miss Follow links corresponding to each character in the key. Search hit: node where search ends has a non-null value. Search miss: reach null link or node where search ends has null value. get("shelter") b s t y 4 e h h a 6 l e 0 o e 5 l l r s 1 l e 7 no link to t s 3 (return null) 10 Tries: search miss Follow links corresponding to each character in the key. Search hit: node where search ends has a non-null value. Search miss: reach null link or node where search ends has null value. get("shell") b s t y 4 e h h a 6 l e 0 o e 5 l l r s 1 l e 7 no value associated node corresponding to s 3 last character in key (return null) 11 Tries: insertion Follow links corresponding to each character in the key. Encounter a null link: create new node. Encounter the last character of the key: set value in that node. put("shore", 7) b s t y 4 e h h a 6 l e 0 o e 5 l l r s 1 l e 7 s 3 12 Trie construction demo trie b s t y 4 e h h a 6 l e 0 o e 5 l l r s 1 l e 7 s 3 13 R-way tries: Java representation Node. A value, plus references to R nodes. private static class Node { private Object val; no generic array creation private Node[] next = new Node[R]; } characters are implicitlycharacters are implicitly defined by link definedindex by link indexs s s s e e h h e h e h e e a l a l a 2 l e 0a 2 l e2 0 2 0 0 l l l l each node has each node has an array of linksan array of links s sand a value and a value s 1 s 1 1 1 Trie representationTrie representation Remark. An R-way trie stores neither keys nor characters explicitly. 14 R-way tries: Java implementation public class TrieST<Value> { private static final int R = 256; extended ASCII private Node root = new Node(); private static class Node { /* see previous slide */ } public void put(String key, Value val) { root = put(root, key, val, 0); } private Node put(Node x, String key, Value val, int d) { if (x == null) x = new Node(); if (d == key.length()) { x.val = val; return x; } char c = key.charAt(d); x.next[c] = put(x.next[c], key, val, d+1); return x; } private Value get(String key) { /* similar, see book or booksite */ } } 15 R-way trie: performance Parameters. n = number of key–value pairs; L = length of key; R = alphabet size. Search hit. Θ(L). Search miss (worst case). Θ(L). sublinear in L Search miss (typical case). Θ(log R n). Space. At least Θ(nR) space. characters are implicitly defined by link index s s e h e h a l e a 2 l e 0 2 0 l l each node has an array of links s and a value at least R links per key s 1 1 Trie representation Bottom line. Fast search hit; even faster search miss; but wastes space. 16 Trie quiz 1 What is worst-case running time to insert a key of length L into an R-way trie that contains n key–value pairs? n = number of key–value pairs L = length of key A. Θ(L) R = alphabet size B. Θ(R + L) C. Θ(n + L) D. Θ(R L) might need to create L nodes, each containing an array of R links 17 String symbol table implementations cost summary character accesses (typical case) count distinct search search space implementation insert moby.txt actors.txt hit miss (references) 2 2 2 red–black BST L + log n log n log n 4 n 1.4 97.4 hashing L L L 4 n to 16 n 0.76 40.6 (linear probing) out of R-way trie L log R n R + L (R+1) n 1.12 memory R-way trie. Method of choice for small R. Effective for medium R. Too much memory for large R. Challenge. Use less memory, e.g., a 65,536-way trie for Unicode! 18 5.2 TRIES ‣ string symbol tables ‣ R-way tries Algorithms ‣ ternary search tries ‣ character-based operations ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu Ternary search tries Store characters and values in nodes (not keys). Each node has three children: smaller (left), equal (middle), larger (right). Fast Algorithms for Sorting and Searching Strings Jon L. Bentley* Robert Sedgewick# Abstract that is competitive with the most efficient string sorting We present theoretical algorithms for sorting and programs known. The second program is a symbol table searching multikey data, and derive from them practical C implementation that is faster than hashing, which is com- implementations for applications in which keys are charac- monly regarded as the fastest symbol table implementa- ter strings. The sorting algorithm blends Quicksort and tion. The symbol table implementation is much more radix sort; it is competitive with the best known C sort space-efficient than multiway trees, and supports more codes. The searching algorithm blends tries and binary advanced searches. search trees; it is faster than hashing and other commonly In many application programs, sorts use a Quicksort used search methods. The basic ideas behind the algo- implementation based on an abstract compare operation, rithms date back at least to the 1960s but their practical and searches use hashing or binary search trees. These do utility has been overlooked. We also present extensions to not take advantage of the properties of string keys, which more complex string problems, such as partial-match are widely used in practice. Our algorithms provide a nat- searching. ural and elegant way to adapt classical algorithms to this important class of applications. 1. Introduction Section 6 turns to more difficult string-searching prob- Section 2 briefly reviews Hoare’s [9] Quicksort and lems. Partial-match queries allow “don’t care” characters binary search trees. We emphasize a well-known isomor- (the pattern “so.a”, for instance, matches soda and sofa). phism relating the two, and summarize other basic facts.

Load more