Algorithms ROBERT SEDGEWICK | KEVIN WAYNE

5.2

‣ string symbol tables ‣ R-way tries Algorithms ‣ ternary search tries FOURTH EDITION ‣ -based operations

ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu

Last updated on 4/15/21 9:40 AM 5.2 TRIES

‣ string symbol tables ‣ R-way tries Algorithms ‣ ternary search tries ‣ character-based operations

ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu Symbol tables: performance summary

Review. Two classic symbol tables: red–black BSTs and hash tables.

frequency of core operations ordered core operations implementation operations on keys search insert delete

red–black BST log n log n log n ✔ compareTo()

equals() 1 † 1 † 1 † hashCode()

† under uniform hashing assumption

Q. Can we do better? A. Yes, if we can avoid examining the entire key, as with string sorting.

3 String symbol tables: performance summary

Goal (for string keys). Faster than hashing, more flexible than BSTs. Benchmark. Count distinct words in a text file. exchange rate: L character accesses per hash around Θ(log n) character accesses per string compare

character accesses (typical case) count distinct

search search space implementation insert moby.txt actors.txt hit miss (references)

2 2 2 red–black BST L + log n log n log n 4 n 1.4 97.4

hashing L L L 4 n to 16 n 0.76 40.6 (linear probing)

n = number of key–value pairs file size words distinct

L = length of key moby.txt 1.2 MB 210 K 32 K R = radix actors.txt 82 MB 11.4 M 900 K

4 5.2 TRIES

‣ string symbol tables ‣ R-way tries Algorithms ‣ ternary search tries ‣ character-based operations

ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu Tries

Etymology. [ from retrieval, but pronounced “try” ]

6 Tries

Abstract . Store characters in nodes (not keys). Each node has up to R children, one for each possible character in alphabet.

link to trie containing all keys root that start with s link to trie containing all keys that start with she

b s t

y 4 e h h key value

by 4 a 6 l e 0 o e 5 sea 6

sells 1 l l r value associated with she in node corresponding to she 0 last character in key shells 3 s 1 l e 7 shore 7

the 5 s 3 7 Tries: search hit

Follow links corresponding to each character in the key. Search hit: node where search ends has a non-null value. Search miss: reach null link or node where search ends has null value.

get("shells")

b s t

y 4 e h h

a 6 l e 0 o e 5

l l r

s 1 l e 7 return value in node corresponding to s 3 last character in key (return 3) 8 Tries: search hit

Follow links corresponding to each character in the key. Search hit: node where search ends has a non-null value. Search miss: reach null link or node where search ends has null value.

get("she")

b s t

y 4 e h h

a 6 l e 0 o e 5

l l r search may terminate at an internal node (return 0)

s 1 l e 7

s 3 9 Tries: search miss

Follow links corresponding to each character in the key. Search hit: node where search ends has a non-null value. Search miss: reach null link or node where search ends has null value.

get("shelter")

b s t

y 4 e h h

a 6 l e 0 o e 5

l l r

s 1 l e 7

no link to t s 3 (return null) 10 Tries: search miss

Follow links corresponding to each character in the key. Search hit: node where search ends has a non-null value. Search miss: reach null link or node where search ends has null value.

get("shell")

b s t

y 4 e h h

a 6 l e 0 o e 5

l l r

s 1 l e 7 no value associated node corresponding to s 3 last character in key (return null) 11 Tries: insertion

Follow links corresponding to each character in the key. Encounter a null link: create new node. Encounter the last character of the key: value in that node.

put("shore", 7)

b s t

y 4 e h h

a 6 l e 0 o e 5

l l r

s 1 l e 7

s 3 12 Trie construction demo

trie

b s t

y 4 e h h

a 6 l e 0 o e 5

l l r

s 1 l e 7

s 3

13 R-way tries: Java representation

Node. A value, plus references to R nodes.

private static class Node { private Object val; no generic array creation private Node[] next = new Node[R]; }

characters are implicitlycharacters are implicitly defined by link definedindex by link indexs s s s e e h h e h e h a l a e l e a 2 l e 0a 2 l e2 0 2 0 0

l l l l each node has each node has an array of linksan array of links s sand a value and a value s 1 s 1 1 1

Trie representationTrie representation Remark. An R-way trie stores neither keys nor characters explicitly.

14 R-way tries: Java implementation

public class TrieST { private static final int R = 256; extended ASCII private Node root = new Node();

private static class Node { /* see previous slide */ }

public void put(String key, Value val) { root = put(root, key, val, 0); }

private Node put(Node x, String key, Value val, int d)

{

if (x == null) x = new Node();

if (d == key.length()) { x.val = val; return x; }

char c = key.charAt(d);

x.next[c] = put(x.next[c], key, val, d+1);

return x;

}

private Value get(String key) { /* similar, see book or booksite */ } }

15 R-way trie: performance

Parameters. n = number of key–value pairs; L = length of key; R = alphabet size.

Search hit. Θ(L). Search miss (worst case). Θ(L). sublinear in L Search miss (typical case). Θ(log R n).

Space. At least Θ(nR) space. characters are implicitly defined by link index s s e h e h a l e a 2 l e 0 2 0 l l each node has an array of links s and a value at least R links per key s 1 1

Trie representation Bottom line. Fast search hit; even faster search miss; but wastes space.

16 Trie quiz 1

What is worst-case running time to insert a key of length L into an R-way trie that contains n key–value pairs? n = number of key–value pairs L = length of key A. Θ(L) R = alphabet size

B. Θ(R + L)

C. Θ(n + L)

D. Θ(R L) might need to create L nodes, each containing an array of R links

17 String symbol table implementations cost summary

character accesses (typical case) count distinct

search search space implementation insert moby.txt actors.txt hit miss (references)

2 2 2 red–black BST L + log n log n log n 4 n 1.4 97.4

hashing L L L 4 n to 16 n 0.76 40.6 (linear probing)

out of R-way trie L log R n R + L (R+1) n 1.12 memory

R-way trie. Method of choice for small R. Effective for medium R. Too much memory for large R.

Challenge. Use less memory, e.g., a 65,536-way trie for Unicode!

18 5.2 TRIES

‣ string symbol tables ‣ R-way tries Algorithms ‣ ternary search tries ‣ character-based operations

ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu Ternary search tries

Store characters and values in nodes (not keys). Each node has three children: smaller (left), equal (middle), larger (right).

Fast Algorithms for Sorting and Searching Strings

Jon L. Bentley* Robert Sedgewick#

Abstract that is competitive with the most efficient string sorting We present theoretical algorithms for sorting and programs known. The second program is a symbol table searching multikey data, and derive from them practical C implementation that is faster than hashing, which is com- implementations for applications in which keys are charac- monly regarded as the fastest symbol table implementa- ter strings. The blends Quicksort and tion. The symbol table implementation is much more ; it is competitive with the best known C sort space-efficient than multiway trees, and supports more codes. The searching algorithm blends tries and binary advanced searches. search trees; it is faster than hashing and other commonly In many application programs, sorts use a Quicksort used search methods. The basic ideas behind the algo- implementation based on an abstract compare operation, rithms date back at least to the 1960s but their practical and searches use hashing or binary search trees. These do utility has been overlooked. We also present extensions to not take advantage of the properties of string keys, which more complex string problems, such as partial-match are widely used in practice. Our algorithms provide a nat- searching. ural and elegant way to adapt classical algorithms to this important class of applications. 1. Introduction Section 6 turns to more difficult string-searching prob- Section 2 briefly reviews Hoare’s [9] Quicksort and lems. Partial-match queries allow “don’t care” characters binary search trees. We emphasize a well-known isomor- (the pattern “so.a”, for instance, matches soda and sofa). phism relating the two, and summarize other basic facts. The primary result in this section is a ternary search The multikey algorithms and data structures are pre- implementation of Rivest’s partial-match searching algo- sented in Section 3. Multikey Quicksort orders a set of II rithm, and experiments on its performance. “Near neigh- bor” queries locate all words within a given Hamming dis- vectors with k components each. Like regular Quicksort, it 20 partitions its input into sets less than and greater than a tance of a query word (for instance, code is distance 2 given value; like radix sort, it moves on to the next field from soda). We give a new algorithm for near neighbor once the current input is known to be equal in the given searching in strings, present a simple C implementation, field. A node in a ternary represents a subset of and describe experiments on its efficiency. vectors with a partitioning value and three pointers: one to Conclusions are offered in Section 7. lesser elements and one to greater elements (as in a ) and one to equal elements, which are then pro- 2. Background cessed on later fields (as in tries). Many of the structures Quicksort is a textbook divide-and-conquer algorithm. and analyses have appeared in previous work, but typically To sort an array, choose a partitioning element, permute as complex theoretical constructions, far removed from the elements such that lesser elements are on one side and practical applications. Our simple framework opens the greater elements are on the other, and then recursively sort door for later implementations. the two subarrays. But what happens to elements equal to The algorithms are analyzed in Section 4. Many of the the partitioning value? Hoare’s partitioning method is analyses are simple derivations of old results. binary: it places lesser elements on the left and greater ele- Section 5 describes efficient C programs derived from ments on the right, but equal elements may appear on either side. the algorithms. The first program is a sorting algorithm Algorithm designers have long recognized the desir- * Bell Labs, Lucent Technologies, 700 Mountam Avenue, Murray Hill. irbility and difficulty of a ternary partitioning method. NJ 07974; [email protected]. Sedgewick [22] observes on page 244: “Ideally, we would # Princeton University. Princeron. NJ. 08514: [email protected]. llke to get all [equal keys1 into position in the file, with all

360 Ternary search tries

Store characters and values in nodes (not keys). Each node has three children: smaller (left), equal (middle), larger (right).

link to TST for all keys that start link to TST for all keys abstract trie TST with a character less than s that start with s

mini BST of s first characters (drawn in red) a b s t b a t abstract trie links (drawn in black) r y 4 h h r y 4 e h u h e u

e 10 12 l e r e 8 e 12 a 14 l e 10 o r e 8 14 a o

l l r e l l r e

s l e 7 l 11 s l e l 11

each node s 15 y 13 15 y 13 s has 3 children 21 Trie quiz 2

Which keys are stored in the designated subtrie of the TST?

A. Strings that start with s. s B. Strings that start with . se b C. Strings that start with sh. a t

D. Strings that start with she. r y 4 h h

e u

e 10 12 l e r e 8

14 a o

l l r e

s 11 l e 7 l

s 15 y 13

22 Search hit in a TST

get("sea")

s

b t

y 4 h h

e

l e 0 e 5

6 a o

l l r return value in node corresponding to last character in key s 1 l e 7

s 3

23 Search miss in a TST

get("shelter")

s

b t

y 4 h h

e

l e 0 e 5

6 a o

l l r

s 1 l e 7

s 3 no link to t (return null) 24 Search in a TST

Compare key character to key in node and follow links accordingly: If less, go left. If greater, go right. If equal, go middle and advance to the next key character.

Search hit. Node where search ends has a non-null value. Search miss. Either (1) reach a null link or (2) node where search ends has null value.

25 Trie quiz 3

Which value is associated with the key CAC ?

A. 3

B. 4 A C C. 5

D. null C 0 A G

5 C

G G C 6

1 C 3 C

G 2 C 4

26 Ternary search trie construction demo

ternary search trie

s

b t

y 4 h h

e

l e 0 e 5

6 a o

l l r

s 1 l e 7

s 3

27 Trie quiz 4

In which subtrie would the key CCC be inserted?

A.

B. A C C.

D. C A E. G

C

D A G C

C C E

B C C C

A 28 26-way trie vs. TST

26-way trie. 26 null links in each leaf.

now for tip ilk dim tag jot sob nob sky hut 26-way trie (1035 null links, not shown) ace bet men egg few jay TST. 3 null links in each leaf. owl joy rap gig wee was cab wad caw cue fee tap ago tar jam dug and

TST (155 null links) 29 TST representation in Java

A TST node is five fields: A value. private class Node { A character. private Value val; private char c; A reference to a left TST. private Node left, mid, right; A reference to a middle TST. } A reference to a right TST.

standard array of links (R = 26) (TST) link for keys s s that start with s h e u e h u link for keys that start with su Trie node representations

30 TST: Java implementation

public class TST { private Node root; private class Node { /* see previous slide */ }

public Value get(String key) { return get(root, key, 0); }

private Value get(Node x, String key, int d) { if (x == null) return null; char c = key.charAt(d); if (c < x.c) return get(x.left, key, d); else if (c > x.c) return get(x.right, key, d); else if (d < key.length() - 1) return get(x.mid, key, d+1); else return x.val; }

public void put(String Key, Value val) { /* similar, see book or booksite */ } }

31 String symbol table implementation cost summary

character accesses (typical case) count distinct

search search space implementation insert moby.txt actors.txt hit miss (references)

2 2 2 red–black BST L + log n log n log n 4 n 1.4 97.4

hashing L L L 4 n to 16 n 0.76 40.6 (linear probing)

R-way trie log n out of L R R + L (R+1) n 1.12 memory

TST L + log n log n L + log n 4 n 0.72 38.7

Bottom line. TST is as fast as hashing (for string keys) and space efficient.

32 5.2 TRIES

‣ string symbol tables ‣ R-way tries Algorithms ‣ ternary search tries ‣ character-based operations

ROBERT SEDGEWICK | KEVIN WAYNE https://algs4.cs.princeton.edu Autocompletion

Autocompletion. User types characters one at a time. in a cell phone, search bar, text editor, shell, ... System reports all matching strings.

34 Prefix matches

Prefix matches. Find all keys in symbol table that start with a given prefix.

Ex 1. Prefix = "sh" ⟹ matches = "she", "shells", and "shore". Ex 2. Prefix = "se" ⟹ matches = "sea" and "sells".

key value

by 4

sea 6

sells 1

she 0

shells 3

shore 7

the 5

35 Warmup: ordered iteration

To iterate over all keys in sorted order: Do inorder traversal of trie; add keys encountered to a queue. Maintain sequence of characters on path from root to node.

prefix queue b by by s se sea by sea b s t sel sell y 4 e h h sells by sea sells sh she by sea sells she a 6 l e 0 o e 5 shel shell r shells by sea sells she shells l l sho shor s 1 l e 7 shore by sea sells she shells shore t th s 3 the by sea sells she shells shore the

36 Ordered iteration: Java implementation

To iterate over all keys in sorted order: Do inorder traversal of trie; add keys encountered to a queue. Maintain sequence of characters on path from root to node.

public Iterable keys() { Queue queue = new Queue(); collect(root, "", queue);

return queue; sequence of characters } on path from root to x

private void collect(Node x, String prefix, Queue queue) { if (x == null) return; if (x.val != null) queue.enqueue(prefix); for (char c = 0; c < R; c++) collect(x.next[c], prefix + c, queue); } or use StringBuilder

37 Prefix matches in an R-way trie

Prefix matches. Find all keys in symbol table that start with a given prefix.

keysWithPrefix("sh");keysWithPrefix("sh"); key keyq q sh sh b b s s t t b b s s t t she sheshe she shelshel y 4y 4 e e h h h h y 4y 4 e e h h h h shellshell shellsshellsshe sheshells shells a 6 e 0 e 5 a 6 e e 5 a l6 l e o0 o e 5 a l6 l 0eo0 o e 5 sho sho findfind subtrie subtrie for forall all shorshor l l l r r l l r keyskeys beginning beginning with with "sh" "sh" l l l r shoreshoreshe sheshells shells shore shore s 1 e 7 7 s 1l l e 7 s 1s l1 le e 7 collectcollect keys keys in thatin thatsubtrie subtrie s 3s 3 s 3s 3

Pre!Prex match!x match in a in trie a trie

38 T9 texting (predictive texting)

Goal. Type text messages on a phone keypad.

Multi-tap input. Enter a letter by repeatedly pressing a key. Ex. good: 4 6 6 6 6 6 6 3

“a much faster and more fun way to enter text”

T9 text input (on 4 billion handsets). Find all words that correspond to given sequence of numbers. 4663: good, home, gone, hoof. textonyms Press * to select next option. Press 0 to see all completion options. System adapts to user’s tendencies.

http://www.t9.com

39 T9 TEXTING

Q. How to implement T9 texting on a mobile phone?

40 Network router IP address lookup

IP address lookup. To send packet toward destination IP address x, network router finds longest IP address in its routing table that is a prefix of x.

backbone router might have 1M entries

routing table and processes millions of packets per second destination (key) gateway (value) 128

128.112 represented as 32-bit 128.112.055 binary number for IPv4 (instead of string) 128.112.055.15

128.112.136 128.112.155.11 longestPrefixOf(128.112.136.11) = 128.112.136 128.112.155.13 longestPrefixOf(128.112.100.16) = 128.112 longestPrefixOf(128.166.123.45) = 128 128.222 128.222.136

Note. Not the same as floor: floor(128.112.100.16) = 128.112.055.15

43 Longest prefix match

Longest prefix match. Find longest key in symbol table that is a prefix of query string.

Ex 1. Query = "shellsort" ⟹ match = "shells". Ex 2. Query = "sheep" ⟹ match = "she".

key value

by 4

sea 6

sells 1

she 0

shells 3

shore 7

the 5

44 Longest prefix match in an R-way trie

Longest prefix match. Find longest key in symbol table that is a prefix of query string. Search for query string. Keep track of longest key encountered.

"she""she""she" "shell""shell""shell" "shellsort""shellsort""shellsort" "shelters""shelters""shelters"

s s s s s s s s s s s s e e e h h h e e e h h h e e e h h h e e e h h h a a2 2la l2 le e0 0e 0 a a2 2la l2 le e0 0e 0searchsearchsearch ends ends at ends at at endend ofend stringof string of string a a2 2a 2 e e 0e a a2 2a 2 e e 0e valuevalue isvalue null is null is null l l l 0 0 l l l 0 0searchsearchsearch ends ends at ends at at l l ll l l l l ll l l returnreturnreturn she she she search ends at null null link null link link searchsearch ends at ends at (last(last key(last key on key onpath) path)on path) l l ll l l l l ll l l returnreturnreturn she she she s s1 1sl 1l l endend ofend stringof string of string s s1 1sl 1l l (last(last key(last key on key onpath) path)on path) valuevalue isvalue not is not nullis notnull null s s1 1sl 1l l s s1 1sl 1l l s s3 3s 3 return return return she she she s s3 3s 3 searchsearchsearch ends ends at ends at at s s3 3s 3 null null link null link link s s3 3s 3 returnreturnreturn shells shells shells (last(last key(last key on key onpath) path)on path) PossibilitiesPossibilitiesPossibilities forPossibilities longestPrefixOf() for for longestPrefixOf() forlongestPrefixOf() longestPrefixOf()

45 Patricia tries

Patricia trie. [ Practical Algorithm to Retrieve Information Coded in Alphanumeric ] Remove one-way branching. Each node represents a sequence of characters. put("shells", 1); Implementation: one step beyond this course. put("shellfish", 2); standard no one-way trie branching shell

Applications. s s 1 fish 2 Database search. h P2P network search. e internal one-way branching IP routing tables: find longest prefix match. l Compressed quad-tree for n-body simulation. l Efficiently storing and querying XML documents. s 1 f

i external one-way branching s

h 2 Also known as: crit-bit tree, . Removing one-way branching in a trie 46 Suffix trees

Suffix tree. Patricia trie of suffixes of a string. Linear-time construction: well beyond scope of this course.

su#x tree for BANANAS

BANANAS A NA S

NA S S NAS

NAS S

Applications. Linear-time: longest repeated , longest common substring, longest palindromic substring, substring search, tandem repeats, …. Computational biology databases (BLAST, FASTA).

47 String symbol tables summary

A success story in algorithm design and analysis.

Balanced BSTs. [ red–black BSTs ] Θ(log n) key compares per search/insert. worst case Supports ordered operations (e.g., rank, select, floor).

Hash tables. [ separate chaining, linear probing ] Θ(1) probes per search/insert. uniform hashing assumption

Tries. [ R-way tries, ternary search tries ] Θ(L + log n) character accesses per search hit/insert. typical applications Θ(log n) character accesses per search miss. Supports character-based operations (e.g., prefix match). Works only for string (or digital) keys.

48 © Copyright 2021 Robert Sedgewick and Kevin Wayne

49