COMPUTER SCIENCE SEDGEWICK/WAYNE

15. Symbol Tables

Section 4.4 http://introcs.cs.princeton.edu COMPUTER SCIENCE SEDGEWICK/WAYNE

15.Symbol Tables

•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu FAQs about sorting and searching

Hey, Alice. That whitelist filter with mergesort and binary search is working great. Right, but it's a pain sometimes.

Why? We have to sort the whole list whenever we add new customers.

Also, we want to process transactions and associate all sorts of information with our customers. Bottom line. Need a more flexible API.

3 Why are telephone books obsolete?

Unsupported operations • Change the number associated with a given name. • Add a new name, associated with a given number. • Remove a givne name and associated number

Observation. Mergesort + binary search has the same problem with add and remove.

see Sorting and Searching lecture 4 abstraction

Imagine using arrays whose indices are string values.

phoneNumber["Alice"] = "(212) 123-4567" legal code in some programming phoneNumber["Bob"] = "(609) 987-6543" languages (not Java) phoneNumber["Carl"] = "(800) 888-8888" phoneNumber["Dave"] = "(888) 800-0800" phoneNumber["Eve"] = "(999) 999-9999" transactions["Alice"] = "Dec 12 12:01AM $111.11 Amazon, Dec 12 1:11 AM $989.99 Ebay" ... A fundamental abstraction • Use keys to access associated values. URL["128.112.136.11"] = "www.cs.princeton.edu" • Keys and values could be any type of data. URL["128.112.128.15"] = "www.princeton.edu" URL["130.132.143.21"] = "www.yale.edu" • Client code could not be simpler. URL["128.103.060.55"] = "www.harvard.edu"

IPaddr["www.cs.princeton.edu"] = "128.112.136.11" IPaddr["www.princeton.edu"] = "128.112.128.15" Q. How to implement? IPaddr["www.yale.edu"] = "130.132.143.21" IPaddr["www.harvard.edu"] = "128.103.060.55" 5 Symbol table ADT

A symbol table is an ADT whose values are sets of key-value pairs, with keys all different.

Basic symbol-table operations key: word value: definition • Associate a given key with a given value. [If the key is not in the table, add it to the table.] [If the key is in the table, change its value.] • Return the value associated with a given key. • Test if a given key is in the table. • Iterate though the keys.

key: number key: time+channel value: function value value: TV show Useful additional assumptions key: name value: phone number • Keys are comparable and iteration is in order. • No limit on number of key-value pairs. • All keys not in the table associate with null.

key: term value: article 6 Benchmark example of symbol-table operations

Application. Count frequency of occurrence of strings in StdIn.

Keys. Strings from a sequence. Values. Integers.

key it was the best of times it was the worst value 1 1 1 1 1 1 2 2 2 1

it 1 it 1 it 1 best 1 best 1 best 1 best 1 best 1 best 1 best 1 was 1 the 1 it 1 of 1 of 1 of 1 of 1 of 1 of 1 symbol-table was 1 the 1 it 1 it 1 it 2 it 2 it 2 it 2 contents was 1 the 1 the 1 the 1 the 1 the 2 the 2 after operation was 1 times 1 times 1 times 1 times 1 times 1 was 1 was 1 was 2 was 2 was 2 worst 1

change the value

7 Parameterized API for symbol tables

Goal. Simple, safe, and clear client code for symbol tables holding any type of data.

Java approach: Parameterized data types (generics) • Use placeholder type names for both keys and values. • Substitute concrete types for placeholder in clients.

“implements compareTo()”

public class ST, Value>

ST() create a symbol table

void put(Key key, Value val) associate key with val Symbol Table API Value get(Key key) return value associated with key, null if none boolean contains(Key key) is there a value associated with key?

Iterable keys() all the keys in the table

8 Aside: Iteration (client code)

Q. How to print the contents of a stack/queue?

A. Use Java's foreach construct. Java foreach construct Stack stack = new Stack(); ... Enhanced for loop. for (String s : stack) • Useful for any collection. StdOut.println(s); ... • Iterate through each entry in the collection. • Order determined by implementation.

• Substantially simplifies client code. public class Stack implements Iterable

• Works when API "implements Iterable". Stack() create a stack of objects, all of type Item

void push(Item item) add item to stack

Item pop() remove and return item most recently pushed

boolean isEmpty() is the stack empty ?

int size() # of objects on the stack

Performance specification. Constant-time per entry. 9 Aside: Iteration (implementation)

Q. How to "implement Iterable"? public class Stack implements Iterable Stack() create a stack of objects, all of type Item

void push(Item item) add item to stack A. We did it for Stack and Queue, Item pop() remove and return item most recently pushed so you don't have to. boolean isEmpty() is the stack empty ?

int size() # of objects on the stack

A. Implement an Iterator (see text pp. 588-89)

Meets performance specification. Constant-time per entry.

Bottom line. Use iteration in client code that uses collections. 10 Why ordered keys?

Natural for many applications • Numeric types. • Strings. • Date and time. • Client-supplied types (Account numbers, ...).

Enables useful API extensions • Provide the keys in sorted order. • Find the kth largest key.

Enables efficient implementations • Mergesort. • Binary search. • BSTs (this lecture). thingsorganizedneatly.tumblr.com

11 Symbol table client example 1: Sort (with dedup)

Goal. Sort lines on standard input (and remove duplicates). % more tale.txt it was the best of times • Key type. String (line on standard input). it was the worst of times it was the age of wisdom • Value type. (ignored). it was the age of foolishness it was the epoch of belief it was the epoch of incredulity it was the season of light it was the season of darkness it was the spring of hope it was the winter of despair

public class Sort { % java Sort < tale.txt public static void main(String[] args) it was the age of foolishness { // Sort lines on StdIn it was the age of wisdom BST st = new BST(); it was the best of times it was the epoch of belief while (StdIn.hasNextLine()) it was the epoch of incredulity st.put(StdIn.readLine(), 0); it was the season of darkness for (String s : st.keys()) it was the season of light it was the spring of hope StdOut.println(s); it was the winter of despair } foreach it was the worst of times } construct

12 Symbol table client example 2: Frequency counter

Goal. Compute frequencies of words on standard input. % more tale.txt it was the best of times • Key type. String (word on standard input). it was the worst of times it was the age of wisdom • Value type. Integer (frequency count). it was the age of foolishness it was the epoch of belief % java Freq < tale.txt | java Sort it was the epoch of incredulity 1 belief it was the season of light 1 best public class Freq it was the season of darkness 1 darkness it was the spring of hope { 1 despair it was the winter of despair public static void main(String[] args) 1 foolishness { // Frequency counter 1 hope 1 incredulity BST st = new BST(); 1 light while (!StdIn.isEmpty()) 1 spring { 1 winter 1 wisdom String key = StdIn.readString(); 1 worst if (st.contains(key)) st.put(key, st.get(key) + 1); 2 age else st.put(key, 1); 2 epoch 2 season } 2 times for (String s : st.keys()) 10 it 10 of StdOut.printf("%8d %s\n", st.get(s), s); 10 the } 10 was } 13 Symbol table client example 3: Index

Goal. Print index to words on standard input. % more tale.txt • Key type. String (word on standard input). it was the best of times it was the worst of times • Value type. Queue (indices where word occurs). it was the age of wisdom it was the age of foolishness it was the epoch of belief % java Index < tale.txt it was the epoch of incredulity public class Index age 15 21 it was the season of light { belief 29 it was the season of darkness best 3 public static void main(String[] args) it was the spring of hope darkness 47 { it was the winter of despair despair 59 BST> st; epoch 27 33 st = new BST>(); foolishness 23 int i = 0; hope 53 while (!StdIn.isEmpty()) incredulity 35 it 0 6 12 18 24 30 36 42 48 54 { light 41 String key = StdIn.readString(); of 4 10 16 22 28 34 40 46 52 58 if (!st.contains(key)) season 39 45 st.put(key, new Queue()); spring 51 the 2 8 14 20 26 32 38 44 50 56 st.get(key).enqueue(i++); times 5 11 } was 1 7 13 19 25 31 37 43 49 55 for (String s : st.keys()) winter 57 StdOut.println(s + " " + st.get(s)); wisdom 17 worst 9 } } 14 Symbol-table applications

application key value

contacts name phone number, address

Symbol tables credit card account number transaction details are ubiquitous file share name of song computer ID in today's computational dictionary word definition infrastructure. web search keyword list of web pages book index word list of page numbers

cloud storage file name file contents

We're going to need domain name service domain name IP address a good symbol-table implementation! reverse DNS IP address domain name compiler variable name value and type

internet routing destination best route

......

15 COMPUTER SCIENCE SEDGEWICK/WAYNE

15.Symbol Tables

•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu COMPUTER SCIENCE SEDGEWICK/WAYNE

15.Symbol Tables

•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu Benchmark

Application. Linguistic analysis

Zipf's law (for a natural language corpus) • Suppose most frequent word occurs about t times. • 2nd most frequent word occurs about t/2 times. • 3rd most frequent word occurs about t/3 times. • 4th most frequent word occurs about t/4 times.

% java Freq < mobydick.txt | java Sort ... 1940 i hypothesis Goal. Validate Zipf's law for real natural language data. 2370 it observation 2481 his 2911 that 4037 in Method. % java Freq < data.txt | java Sort 4508 to 4583 a 6247 and Required. Efficient symbol-table implementation. 6415 of 13967 the

18 Benchmark statistics

Goal. Validate Zipf's law for real natural language data.

Method. % java Freq < data.txt | java Sort

file description words distinct

mobydick.txt Melville's Moby Dick 210,028 16,834

liepzig100k.txt 100K random sentences 2,121,054 144,256

liepzig200k.txt 200K random sentences 4,238,435 215,515

liepzig1m.txt 1M random sentences 21,191,455 534,580

Reference: Wortschatz corpus, Universität Leipzig http://corpora.informatik.uni-leipzig.de

Required. Efficient symbol-table implementation.

19 Strawman I: Ordered array

Idea keys values keys values • Keep keys in order in an array. alice 121 alice 121 • Keep values in a parallel array. bob 873 bob 873 carlos 884 carlos 884

Reasons (see "Sorting and Searching" lecture) carol 712 carol 712 • Takes advantage of fast sort (mergesort). dave 585 craig 999 • Enables fast search (binary search). erin 247 dave 585 eve 577 erin 247 Known challenge. How big to make the arrays? oscar 675 eve 577 peggy 895 oscar 675

trent 557 peggy 895

trudy 926 trent 557 Fatal flaw. How to insert a new key? walter 51 trudy 926 • To keep key array in order, need to move walter larger entries ala insertion sort. wendy 152 51 wendy • Hypothesis: Quadratic time for benchmark. 152 easy to validate with experiments 20 Strawman II:

Idea • Keep keys in order in a linked list. • Add a value to each node.

Reason. Meets memory-use performance specification.

alice 2 bob 7 carlos 1 carol 8 dave 2 erin 8 eve 1 oscar 8 peggy 2

Fatal flaw. How to search? • Binary search requires indexed access. • Example: How to access the middle of a linked list? • Only choice: search sequentially through the list. • Hypothesis: Quadratic time for benchmark. easy to validate with experiments 21 Design challenge

Implement scalable symbol tables.

Goal. Simple, safe, clear, and efficient client code.

Only slightly more costly than stacks or queues!

• Order of growth of running time for put(), get() and contains() is logarithmic. Performance • Memory use is proportional to the size of the collection, when it is nonempty. specifications • No limits within the code on the collection size.

No way!

Are such guarantees achievable?? Can we implement associative arrays with just log-factor extra cost??

phoneNumber["Alice"] = "(212) 123-4567" This lecture. Yes way! 22 COMPUTER SCIENCE SEDGEWICK/WAYNE

15.Symbol Tables

•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu COMPUTER SCIENCE SEDGEWICK/WAYNE

15.Symbol Tables

•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu Doubly-linked data

With two links ( ) a wide variety of data structures are possible.

Binary Doubly-linked list (this lecture) Tree

Doubly-linked circular list General case

From the point of view of a particular object, Maintenance can be complicated! all of these structures look the same.

25 A doubly-linked data : binary

Binary search tree (BST) • A recursive containing distinct comparable keys that is ordered. • Def. A BST is a null or a reference to a BST node (the root). • Def. A BST node is a that contains references to a key, a value, and two BSTs, a left subtree and a right subtree. • Ordered. All keys in the left subtree of each node are smaller than its key and all keys in the right subtree of each node are larger than its key.

A BST private class Node { private Key key; private Value val; private Node left; private Node right; } left right

26 BST processing code

Standard operations for processing data structured as a • Search for the value associated with a given key. • Add a new key-value pair. • Traverse the BST (visit every node, in order of the keys). • Remove a given key and associated value (not addressed in this lecture).

root

it 2

best 1 was 2

the 2

of 1 times 1

27 BST processing code: Search

Goal. Find the value associated with a given key in a BST. • If less than the key at the current node, go left. • If greater than the key at the current node, go right.

Example. get("the") root GREATER go right the? it 2 LESS go left best 1 was 2 SEARCH HIT return value public Value get(Key key) the 1 { return get(root, key); } private Value get(Node x, Key key) { of 1 times 1 if (x == null) return null; int cmp = key.compareTo(x.key); if (cmp < 0) return get(x.left, key); else if (cmp > 0) return get(x.right, key); else if (cmp == 0) return x.val; } 28 BST processing code: Associate a new value with a key

Goal. Associate a new value with a given key in a BST. • If less than the key at the current node, go left. • If greater than the key at the current node, go right.

Example. put("the", 2) root GREATER go right the? it 2 LESS go left best 1 was 2 SEARCH HIT update value public void put(Key key, Value val) the 21 { root = put(root, key, val); } private Node put(Node x, Key key, Value val) { of 1 times 1 if (x == null) return new Node(key, val); int cmp = key.compareTo(x.key); if (cmp < 0) x.left = put(x.left, key, val); else if (cmp > 0) x.right = put(x.right, key, val); else x.val = val; return x; } 29 BST processing code: Add a new key

Goal. Add a new key-value pair to a BST. • Search for key. • Return link to new node when null reached.

Example. put("worst", 1) root GREATER go right worst? it 2 GREATER go right best 1 was 2

public void put(Key key, Value val) the 2 worst 1 { root = put(root, key, val); } private Node put(Node x, Key key, Value val) { NULL of 1 times 1 if (x == null) return new Node(key, val); add new node int cmp = key.compareTo(x.key); if (cmp < 0) x.left = put(x.left, key, val); else if (cmp > 0) x.right = put(x.right, key, val); else x.val = val; return x; } 30 BST processing code: Traverse the BST

Goal. Put keys in a BST on a queue, in sorted order. • Do it for the left subtree. • Put the key at the root on the queue. • Do it for the right subtree. root

it 2

best 1 was 2 public Iterable keys() { Queue queue = new Queue(); the 2 inorder(root, queue); return queue; } of 1 times 1 private void inorder(Node x, Queue queue) { if (x == null) return; inorder(x.left, queue); q.enqueue(x.key); inorder(x.right, queue); Queue best it of the times was } 31 COMPUTER SCIENCE SEDGEWICK/WAYNE

15.Symbol Tables

•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu COMPUTER SCIENCE SEDGEWICK/WAYNE

15.Symbol Tables

•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu ADT for symbol tables: review

A symbol table is an idealized model of an associative storage mechanism.

An ADT allows us to write Java programs that use and manipulate symbol tables.

public class ST, Value>

ST() create a symbol table

void put(Key key, Value val) associate key with val API Value get(Key key) return value associated with key, null if none

boolean contains(Key key) is there a value associated with key?

Iterable keys() all the keys in the table

• Order of growth of running time for put(), get() and contains() is logarithmic. Performance • Memory use is proportional to the size of the collection, when it is nonempty. specifications • No limits within the code on the collection size.

34 Symbol table implementation: Instance variables and constructor

Data structure choice. Use a BST to hold the collection. instance variables constructor

public class BST, Value> methods { private Node root = null;

private class Node test client { private Key key; private Value val; private Node left; root private Node right; } it ... } best was

the

of times

35 BST implementation: Test client (frequency counter)

instance variables constructors public static void main(String[] args) { methods BST st = new BST(); while (!StdIn.isEmpty()) % java BST < tale.txt { 2 age 1 belief String key = StdIn.readString(); 1 best test client if (st.contains(key)) st.put(key, st.get(key) + 1); 1 darkness else st.put(key, 1); 1 despair } 2 epoch for (String s : st.keys()) 1 foolishness StdOut.printf("%8d %s\n", st.get(s), s); 1 hope } 1 incredulity 10 it 1 light 10 of 2 season 1 spring 10 the 2 times What we expect, once the implementation is done. 10 was 1 winter 1 wisdom 1 worst 36 BST implementation: Methods

instance variables Methods define data-type operations (implement the API). constructors

public class BST, Value> methods { ...

public boolean isEmpty() test client { return root == null; }

public void put(Key key, Value value) { /* See BST add slides and next slide. */ }

public Value get(Key key) { /* See BST search slide and next slide. */ }

public boolean contains(Key key) { return get(key) != null; }

public Iterable keys() { /* See BST traverse slide and next slide. */ }

... }

37 BST implementation

public class BST, Value> private Value get(Node x, Key key) { { private Node root = null; instance variable if (x == null) return null; int cmp = key.compareTo(x.key); if (cmp < 0) return get(x.left, key); private class Node else if (cmp > 0) return get(x.right, key); { else if (cmp == 0) return x.val; private Key key; } private Value val; nested class private Node left; private Node put(Node x, Key key, Value val) { private Node right; if (x == null) return new Node(key, val); } int cmp = key.compareTo(x.key); if (cmp < 0) x.left = put(x.left, key, val); public boolean isEmpty() else if (cmp > 0) x.right = put(x.right, key, val); { return root == null; } private else x.val = val; methods return x; public void put(Key key, Value val) } { root = put(root, key, val); } private void inorder(Node x, Queue q) public Value get(Key key) { { return get(root, key); } if (x == null) return; public inorder(x.left, q); public boolean contains(Key key) methods q.enqueue(x.key); { return get(key) != null; } inorder(x.right, q); } public Iterable keys() { public static void main(String[] args) Queue q = new Queue(); { // Frequency counter } test client inorder(root, q); return q; } } 38 Trace of BST construction

it it it was it was best was the the

it

it best was

best was the

the of

of times

39 COMPUTER SCIENCE SEDGEWICK/WAYNE

15.Symbol Tables

•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu COMPUTER SCIENCE SEDGEWICK/WAYNE

15.Symbol Tables

•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu BST analysis

Costs depend on order of key insertion.

Best case Worst case the best

it was it

best of times worst of

the

Typical case times it was best was worst the worst

of times

42 BST insertion: random order visualization

Insert keys in random order. • Tree is roughly balanced. • Tends to stay that way!

43 BST analysis

Running time depends on order of key insertion.

Model. Insert keys in random order. • Tree is roughly balanced. • Tends to stay that way!

Proposition. Building a BST by inserting N randomly ordered keys into an initially empty tree uses ~2 N ln N (about 1.39 N lg N ) compares.

Interested in Proof. A very interesting exercise in discrete math. details? Take a course in algorithms.

44 Benchmarking the BST implementation

BST implements the associative-array abstraction for randomly ordered keys.

public class ST, Value>

ST() create a symbol table

void put(Key key, Value value) associate key with value Symbol table API Value get(Key key) return value associated with key, null if none for random keys boolean contains(Key key) is there a value associated with key? (but stay tuned) Iterable keys() all the keys in the table (sorted) ✓

• Order of growth of running time for put(), get() and contains() is logarithmic. ✓ Performance • Memory use is proportional to the size of the collection, when it is nonempty. specifications ✓ • No limits within the code on the collection size. ✓ it

best was

Made possible by binary tree data structure. the worst

of times 45 Empirical tests of BSTs

Count number of words TN % java Generator 1000000 ... N TN/TN/2 (seconds) 263934 (5 seconds) that appear more than % java Generator 2000000 ... once in StdIn. 1 million 5 593973 (9 seconds) % java Generator 4000000 ... 2 million 9 1.8 908795 (17 seconds) % java Generator 8000000 ... Frequency count 4 million 17 1.9 996961 (34 seconds) without the output % java Generator 16000000 ... 8 million 34 2 999997 (72 seconds)

16 million 72 2.1 ... = 6 0123456789 | java DupsBST

... 6-digit integers

1 BILLION 4608 2

Confirms hypothesis that order of growth is N log N Easy to process 21M word corpus NOT possible without BSTs WILL scale 46 Performance guarantees

Practical problem. Keys may not be randomly ordered. • BST may be unbalanced. • Running time may be quadratic. • Happens in practice (insert keys in order).

Remarkable resolution. • Balanced tree algorithms perform simple transformations that guarantee balance. • AVL trees (Adelson-Velskii and Landis, 1962) proved concept. • Red-black trees (Guibas and Sedgewick, 1979) are implemented in many modern systems.

47 Red-black tree insertion: random order visualization

Insert keys in random order. • Same # of black links on every path from root to leaf. • No two red links in a row. • Tree is roughly balanced. • Guaranteed to stay that way!

48 ST implementation with guaranteed logarithmic performance

import java.util.TreeMap;

public class ST, Value> { private TreeMap st = new TreeMap();

public void put(Key key, Value val) Java's TreeMap library { uses red-black trees. if (val == null) st.remove(key); else st.put(key, val); } public Value get(Key key) { return st.get(key); } public Value remove(Key key) { return st.remove(key); } public boolean contains(Key key) { return st.containsKey(key); } public Iterable keys() { return st.keySet(); } }

Proposition. In a red-black tree of size N, put(), get() and Interested in contains() are guaranteed to use fewer than 2lg N compares. details? Take a course in Several other algorithms. Proof. A fascinating exercise in algorithmics. useful operations also available. 49 Summary

BSTs. Simple symbol-table implementation, usually efficient. Red-black trees. More complicated variation, guaranteed to be efficient. Applications. Many, many, many things are enabled by efficient symbol tables.

Example. Search among 1 trillion customers with less than 80 compares!

Example. Search among all the atoms in the universe with less than 200 compares!

YES!

Can we implement associative arrays with just log-factor extra cost??

50 COMPUTER SCIENCE SEDGEWICK/WAYNE

15.Symbol Tables

•APIs and clients •A design challenge •Binary search trees •Implementation •Analysis http://introcs.cs.princeton.edu COMPUTER SCIENCE SEDGEWICK/WAYNE

15. Symbol Tables

Section 4.4 http://introcs.cs.princeton.edu