COMPUTER SCIENCE SEDGEWICK/WAYNE
15. Symbol Tables
Section 4.4 http://introcs.cs.princeton.edu COMPUTER SCIENCE SEDGEWICK/WAYNE
15.Symbol Tables
•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu FAQs about sorting and searching
Hey, Alice. That whitelist filter with mergesort and binary search is working great. Right, but it's a pain sometimes.
Why? We have to sort the whole list whenever we add new customers.
Also, we want to process transactions and associate all sorts of information with our customers. Bottom line. Need a more flexible API.
3 Why are telephone books obsolete?
Unsupported operations • Change the number associated with a given name. • Add a new name, associated with a given number. • Remove a givne name and associated number
Observation. Mergesort + binary search has the same problem with add and remove.
see Sorting and Searching lecture 4 Associative array abstraction
Imagine using arrays whose indices are string values.
phoneNumber["Alice"] = "(212) 123-4567" legal code in some programming phoneNumber["Bob"] = "(609) 987-6543" languages (not Java) phoneNumber["Carl"] = "(800) 888-8888" phoneNumber["Dave"] = "(888) 800-0800" phoneNumber["Eve"] = "(999) 999-9999" transactions["Alice"] = "Dec 12 12:01AM $111.11 Amazon, Dec 12 1:11 AM $989.99 Ebay" ... A fundamental abstraction • Use keys to access associated values. URL["128.112.136.11"] = "www.cs.princeton.edu" • Keys and values could be any type of data. URL["128.112.128.15"] = "www.princeton.edu" URL["130.132.143.21"] = "www.yale.edu" • Client code could not be simpler. URL["128.103.060.55"] = "www.harvard.edu"
IPaddr["www.cs.princeton.edu"] = "128.112.136.11" IPaddr["www.princeton.edu"] = "128.112.128.15" Q. How to implement? IPaddr["www.yale.edu"] = "130.132.143.21" IPaddr["www.harvard.edu"] = "128.103.060.55" 5 Symbol table ADT
A symbol table is an ADT whose values are sets of key-value pairs, with keys all different.
Basic symbol-table operations key: word value: definition • Associate a given key with a given value. [If the key is not in the table, add it to the table.] [If the key is in the table, change its value.] • Return the value associated with a given key. • Test if a given key is in the table. • Iterate though the keys.
key: number key: time+channel value: function value value: TV show Useful additional assumptions key: name value: phone number • Keys are comparable and iteration is in order. • No limit on number of key-value pairs. • All keys not in the table associate with null.
key: term value: article 6 Benchmark example of symbol-table operations
Application. Count frequency of occurrence of strings in StdIn.
Keys. Strings from a sequence. Values. Integers.
key it was the best of times it was the worst value 1 1 1 1 1 1 2 2 2 1
it 1 it 1 it 1 best 1 best 1 best 1 best 1 best 1 best 1 best 1 was 1 the 1 it 1 of 1 of 1 of 1 of 1 of 1 of 1 symbol-table was 1 the 1 it 1 it 1 it 2 it 2 it 2 it 2 contents was 1 the 1 the 1 the 1 the 1 the 2 the 2 after operation was 1 times 1 times 1 times 1 times 1 times 1 was 1 was 1 was 2 was 2 was 2 worst 1
change the value
7 Parameterized API for symbol tables
Goal. Simple, safe, and clear client code for symbol tables holding any type of data.
Java approach: Parameterized data types (generics) • Use placeholder type names for both keys and values. • Substitute concrete types for placeholder in clients.
“implements compareTo()”
public class ST
ST
void put(Key key, Value val) associate key with val Symbol Table API Value get(Key key) return value associated with key, null if none boolean contains(Key key) is there a value associated with key?
Iterable
8 Aside: Iteration (client code)
Q. How to print the contents of a stack/queue?
A. Use Java's foreach construct. Java foreach construct Stack
• Substantially simplifies client code. public class Stack
• Works when API "implements Iterable". Stack
void push(Item item) add item to stack
Item pop() remove and return item most recently pushed
boolean isEmpty() is the stack empty ?
int size() # of objects on the stack
Performance specification. Constant-time per entry. 9 Aside: Iteration (implementation)
Q. How to "implement Iterable"? public class Stack
void push(Item item) add item to stack A. We did it for Stack and Queue, Item pop() remove and return item most recently pushed so you don't have to. boolean isEmpty() is the stack empty ?
int size() # of objects on the stack
A. Implement an Iterator (see text pp. 588-89)
Meets performance specification. Constant-time per entry.
Bottom line. Use iteration in client code that uses collections. 10 Why ordered keys?
Natural for many applications • Numeric types. • Strings. • Date and time. • Client-supplied types (Account numbers, ...).
Enables useful API extensions • Provide the keys in sorted order. • Find the kth largest key.
Enables efficient implementations • Mergesort. • Binary search. • BSTs (this lecture). thingsorganizedneatly.tumblr.com
11 Symbol table client example 1: Sort (with dedup)
Goal. Sort lines on standard input (and remove duplicates). % more tale.txt it was the best of times • Key type. String (line on standard input). it was the worst of times it was the age of wisdom • Value type. (ignored). it was the age of foolishness it was the epoch of belief it was the epoch of incredulity it was the season of light it was the season of darkness it was the spring of hope it was the winter of despair
public class Sort { % java Sort < tale.txt public static void main(String[] args) it was the age of foolishness { // Sort lines on StdIn it was the age of wisdom BST
12 Symbol table client example 2: Frequency counter
Goal. Compute frequencies of words on standard input. % more tale.txt it was the best of times • Key type. String (word on standard input). it was the worst of times it was the age of wisdom • Value type. Integer (frequency count). it was the age of foolishness it was the epoch of belief % java Freq < tale.txt | java Sort it was the epoch of incredulity 1 belief it was the season of light 1 best public class Freq it was the season of darkness 1 darkness it was the spring of hope { 1 despair it was the winter of despair public static void main(String[] args) 1 foolishness { // Frequency counter 1 hope 1 incredulity BST
Goal. Print index to words on standard input. % more tale.txt • Key type. String (word on standard input). it was the best of times it was the worst of times • Value type. Queue
application key value
contacts name phone number, address
Symbol tables credit card account number transaction details are ubiquitous file share name of song computer ID in today's computational dictionary word definition infrastructure. web search keyword list of web pages book index word list of page numbers
cloud storage file name file contents
We're going to need domain name service domain name IP address a good symbol-table implementation! reverse DNS IP address domain name compiler variable name value and type
internet routing destination best route
......
15 COMPUTER SCIENCE SEDGEWICK/WAYNE
15.Symbol Tables
•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu COMPUTER SCIENCE SEDGEWICK/WAYNE
15.Symbol Tables
•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu Benchmark
Application. Linguistic analysis
Zipf's law (for a natural language corpus) • Suppose most frequent word occurs about t times. • 2nd most frequent word occurs about t/2 times. • 3rd most frequent word occurs about t/3 times. • 4th most frequent word occurs about t/4 times.
% java Freq < mobydick.txt | java Sort ... 1940 i hypothesis Goal. Validate Zipf's law for real natural language data. 2370 it observation 2481 his 2911 that 4037 in Method. % java Freq < data.txt | java Sort 4508 to 4583 a 6247 and Required. Efficient symbol-table implementation. 6415 of 13967 the
18 Benchmark statistics
Goal. Validate Zipf's law for real natural language data.
Method. % java Freq < data.txt | java Sort
file description words distinct
mobydick.txt Melville's Moby Dick 210,028 16,834
liepzig100k.txt 100K random sentences 2,121,054 144,256
liepzig200k.txt 200K random sentences 4,238,435 215,515
liepzig1m.txt 1M random sentences 21,191,455 534,580
Reference: Wortschatz corpus, Universität Leipzig http://corpora.informatik.uni-leipzig.de
Required. Efficient symbol-table implementation.
19 Strawman I: Ordered array
Idea keys values keys values • Keep keys in order in an array. alice 121 alice 121 • Keep values in a parallel array. bob 873 bob 873 carlos 884 carlos 884
Reasons (see "Sorting and Searching" lecture) carol 712 carol 712 • Takes advantage of fast sort (mergesort). dave 585 craig 999 • Enables fast search (binary search). erin 247 dave 585 eve 577 erin 247 Known challenge. How big to make the arrays? oscar 675 eve 577 peggy 895 oscar 675
trent 557 peggy 895
trudy 926 trent 557 Fatal flaw. How to insert a new key? walter 51 trudy 926 • To keep key array in order, need to move walter larger entries ala insertion sort. wendy 152 51 wendy • Hypothesis: Quadratic time for benchmark. 152 easy to validate with experiments 20 Strawman II: Linked list
Idea • Keep keys in order in a linked list. • Add a value to each node.
Reason. Meets memory-use performance specification.
alice 2 bob 7 carlos 1 carol 8 dave 2 erin 8 eve 1 oscar 8 peggy 2
Fatal flaw. How to search? • Binary search requires indexed access. • Example: How to access the middle of a linked list? • Only choice: search sequentially through the list. • Hypothesis: Quadratic time for benchmark. easy to validate with experiments 21 Design challenge
Implement scalable symbol tables.
Goal. Simple, safe, clear, and efficient client code.
Only slightly more costly than stacks or queues!
• Order of growth of running time for put(), get() and contains() is logarithmic. Performance • Memory use is proportional to the size of the collection, when it is nonempty. specifications • No limits within the code on the collection size.
No way!
Are such guarantees achievable?? Can we implement associative arrays with just log-factor extra cost??
phoneNumber["Alice"] = "(212) 123-4567" This lecture. Yes way! 22 COMPUTER SCIENCE SEDGEWICK/WAYNE
15.Symbol Tables
•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu COMPUTER SCIENCE SEDGEWICK/WAYNE
15.Symbol Tables
•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu Doubly-linked data structures
With two links ( ) a wide variety of data structures are possible.
Binary tree Doubly-linked list (this lecture) Tree
Doubly-linked circular list General case
From the point of view of a particular object, Maintenance can be complicated! all of these structures look the same.
25 A doubly-linked data structure: binary search tree
Binary search tree (BST) • A recursive data structure containing distinct comparable keys that is ordered. • Def. A BST is a null or a reference to a BST node (the root). • Def. A BST node is a data type that contains references to a key, a value, and two BSTs, a left subtree and a right subtree. • Ordered. All keys in the left subtree of each node are smaller than its key and all keys in the right subtree of each node are larger than its key.
A BST private class Node { private Key key; private Value val; private Node left; private Node right; } left right
26 BST processing code
Standard operations for processing data structured as a binary search tree • Search for the value associated with a given key. • Add a new key-value pair. • Traverse the BST (visit every node, in order of the keys). • Remove a given key and associated value (not addressed in this lecture).
root
it 2
best 1 was 2
the 2
of 1 times 1
27 BST processing code: Search
Goal. Find the value associated with a given key in a BST. • If less than the key at the current node, go left. • If greater than the key at the current node, go right.
Example. get("the") root GREATER go right the? it 2 LESS go left best 1 was 2 SEARCH HIT return value public Value get(Key key) the 1 { return get(root, key); } private Value get(Node x, Key key) { of 1 times 1 if (x == null) return null; int cmp = key.compareTo(x.key); if (cmp < 0) return get(x.left, key); else if (cmp > 0) return get(x.right, key); else if (cmp == 0) return x.val; } 28 BST processing code: Associate a new value with a key
Goal. Associate a new value with a given key in a BST. • If less than the key at the current node, go left. • If greater than the key at the current node, go right.
Example. put("the", 2) root GREATER go right the? it 2 LESS go left best 1 was 2 SEARCH HIT update value public void put(Key key, Value val) the 21 { root = put(root, key, val); } private Node put(Node x, Key key, Value val) { of 1 times 1 if (x == null) return new Node(key, val); int cmp = key.compareTo(x.key); if (cmp < 0) x.left = put(x.left, key, val); else if (cmp > 0) x.right = put(x.right, key, val); else x.val = val; return x; } 29 BST processing code: Add a new key
Goal. Add a new key-value pair to a BST. • Search for key. • Return link to new node when null reached.
Example. put("worst", 1) root GREATER go right worst? it 2 GREATER go right best 1 was 2
public void put(Key key, Value val) the 2 worst 1 { root = put(root, key, val); } private Node put(Node x, Key key, Value val) { NULL of 1 times 1 if (x == null) return new Node(key, val); add new node int cmp = key.compareTo(x.key); if (cmp < 0) x.left = put(x.left, key, val); else if (cmp > 0) x.right = put(x.right, key, val); else x.val = val; return x; } 30 BST processing code: Traverse the BST
Goal. Put keys in a BST on a queue, in sorted order. • Do it for the left subtree. • Put the key at the root on the queue. • Do it for the right subtree. root
it 2
best 1 was 2 public Iterable
15.Symbol Tables
•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu COMPUTER SCIENCE SEDGEWICK/WAYNE
15.Symbol Tables
•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu ADT for symbol tables: review
A symbol table is an idealized model of an associative storage mechanism.
An ADT allows us to write Java programs that use and manipulate symbol tables.
public class ST
ST
void put(Key key, Value val) associate key with val API Value get(Key key) return value associated with key, null if none
boolean contains(Key key) is there a value associated with key?
Iterable
• Order of growth of running time for put(), get() and contains() is logarithmic. Performance • Memory use is proportional to the size of the collection, when it is nonempty. specifications • No limits within the code on the collection size.
34 Symbol table implementation: Instance variables and constructor
Data structure choice. Use a BST to hold the collection. instance variables constructor
public class BST
private class Node test client { private Key key; private Value val; private Node left; root private Node right; } it ... } best was
the
of times
35 BST implementation: Test client (frequency counter)
instance variables constructors public static void main(String[] args) { methods BST
instance variables Methods define data-type operations (implement the API). constructors
public class BST
public boolean isEmpty() test client { return root == null; }
public void put(Key key, Value value) { /* See BST add slides and next slide. */ }
public Value get(Key key) { /* See BST search slide and next slide. */ }
public boolean contains(Key key) { return get(key) != null; }
public Iterable
... }
37 BST implementation
public class BST
it it it was it was best was the the
it
it best was
best was the
the of
of times
39 COMPUTER SCIENCE SEDGEWICK/WAYNE
15.Symbol Tables
•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu COMPUTER SCIENCE SEDGEWICK/WAYNE
15.Symbol Tables
•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu BST analysis
Costs depend on order of key insertion.
Best case Worst case the best
it was it
best of times worst of
the
Typical case times it was best was worst the worst
of times
42 BST insertion: random order visualization
Insert keys in random order. • Tree is roughly balanced. • Tends to stay that way!
43 BST analysis
Running time depends on order of key insertion.
Model. Insert keys in random order. • Tree is roughly balanced. • Tends to stay that way!
Proposition. Building a BST by inserting N randomly ordered keys into an initially empty tree uses ~2 N ln N (about 1.39 N lg N ) compares.
Interested in Proof. A very interesting exercise in discrete math. details? Take a course in algorithms.
44 Benchmarking the BST implementation
BST implements the associative-array abstraction for randomly ordered keys.
public class ST
ST
void put(Key key, Value value) associate key with value Symbol table API Value get(Key key) return value associated with key, null if none for random keys boolean contains(Key key) is there a value associated with key? (but stay tuned) Iterable
• Order of growth of running time for put(), get() and contains() is logarithmic. ✓ Performance • Memory use is proportional to the size of the collection, when it is nonempty. specifications ✓ • No limits within the code on the collection size. ✓ it
best was
Made possible by binary tree data structure. the worst
of times 45 Empirical tests of BSTs
Count number of words TN % java Generator 1000000 ... N TN/TN/2 (seconds) 263934 (5 seconds) that appear more than % java Generator 2000000 ... once in StdIn. 1 million 5 593973 (9 seconds) % java Generator 4000000 ... 2 million 9 1.8 908795 (17 seconds) % java Generator 8000000 ... Frequency count 4 million 17 1.9 996961 (34 seconds) without the output % java Generator 16000000 ... 8 million 34 2 999997 (72 seconds)
16 million 72 2.1 ... = 6 0123456789 | java DupsBST
... 6-digit integers
1 BILLION 4608 2
Confirms hypothesis that order of growth is N log N Easy to process 21M word corpus NOT possible without BSTs WILL scale 46 Performance guarantees
Practical problem. Keys may not be randomly ordered. • BST may be unbalanced. • Running time may be quadratic. • Happens in practice (insert keys in order).
Remarkable resolution. • Balanced tree algorithms perform simple transformations that guarantee balance. • AVL trees (Adelson-Velskii and Landis, 1962) proved concept. • Red-black trees (Guibas and Sedgewick, 1979) are implemented in many modern systems.
47 Red-black tree insertion: random order visualization
Insert keys in random order. • Same # of black links on every path from root to leaf. • No two red links in a row. • Tree is roughly balanced. • Guaranteed to stay that way!
48 ST implementation with guaranteed logarithmic performance
import java.util.TreeMap;
public class ST
public void put(Key key, Value val) Java's TreeMap library { uses red-black trees. if (val == null) st.remove(key); else st.put(key, val); } public Value get(Key key) { return st.get(key); } public Value remove(Key key) { return st.remove(key); } public boolean contains(Key key) { return st.containsKey(key); } public Iterable
Proposition. In a red-black tree of size N, put(), get() and Interested in contains() are guaranteed to use fewer than 2lg N compares. details? Take a course in Several other algorithms. Proof. A fascinating exercise in algorithmics. useful operations also available. 49 Summary
BSTs. Simple symbol-table implementation, usually efficient. Red-black trees. More complicated variation, guaranteed to be efficient. Applications. Many, many, many things are enabled by efficient symbol tables.
Example. Search among 1 trillion customers with less than 80 compares!
Example. Search among all the atoms in the universe with less than 200 compares!
YES!
Can we implement associative arrays with just log-factor extra cost??
50 COMPUTER SCIENCE SEDGEWICK/WAYNE
15.Symbol Tables
•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu COMPUTER SCIENCE SEDGEWICK/WAYNE
15. Symbol Tables
Section 4.4 http://introcs.cs.princeton.edu