Compsci 201 Priority Queues & Autocomplete

Compsci 201 Priority Queues & Autocomplete Owen Astrachan Jeff Forbes November 15, 2017 11/15/17 Compsci 201, Fall 2017, PQ + Compare 1 U is for … • URL and URI • Uniform Resource (Locator and Identifier) • Usenet • p2p original source of FAQ, Flame, Spam • Unix • Before there was Linux, … • User Interface, UI, UX • User is the heart and soul 11/15/17 Compsci 201, Fall 2017, PQ + Compare 2 Plan for the Day • Where are we? Where are we going? What’s left? • Review of PQs+Heaps: implementation and API • PQ API is a key to Autocomplete • Software Design and Software Testing • Autcomplete, algorithms, trade-offs • Testing, debugging, understanding 11/15/17 Compsci 201, Fall 2017, PQ + Compare 3 Work [todate | todo] • APTQuiz2: Median 30, Mean 24.6 • APTQuiz1: Median 30, Mean 25.5 • Midterm 1: Median 78%, Mean 75% • Midterm 2: Median 87%, Mean 84% • Assignments: 5 of 6 out • APT: one more set, one more quiz 11/15/17 Compsci 201, Fall 2017, PQ + Compare 4 One path, two paths 11/15/17 Compsci 201, Fall 2017, PQ + Compare 5 Heap Review • Used to implement priority queues efficiently • Binary tree implemented in array: indexes! • Heap shape and heap property, 2*k and 2*k+1 • Minimal element: O(1) peek and O(log N) poll • Change definition of min: max-heap! • How to compare elements? • Comparable or Comparator! 11/15/17 Compsci 201, Fall 2017, PQ + Compare 6 A Sore by any other name … • String implements Comparable<String> • We know there’s a .compareTo method • What if we want to change how strings compared, e.g., length or lowercase or … • Sometimes we don’t have access to class, or sub-classing not a good idea • Some classes are final, no sub-classing! • Some classes not designed for sub-classing 11/15/17 Compsci 201, Fall 2017, PQ + Compare 7 Comparator or Raptor Coma? • Must implement .compare(T a, T b) • Different than .compareTo, similar as well • Return < 0 when a < b • Return == 0 when a == b • Return > 0 when a > b • You write code to determine what this means! 11/15/17 Compsci 201, Fall 2017, PQ + Compare 8 Something Old, Something New • Create class that implements Comparator<T> • Write the .compare(T a, T b) method • Create Comparator object, use it • Java 8: New tricks: Create classes “anonymously” by calling Comparator.comparing • Pass method names as parameters https://coursework.cs.duke.edu/201fall17/sortall/blob/master/src/PersonSorter.java 11/15/17 Compsci 201, Fall 2017, PQ + Compare 9 PersonSorter.java static class Person implements Comparable<Person> { String first; String last; public String getLast(){ return last; } // more here https://coursework.cs.duke.edu/201fall17/sortall/blob/master/src/PersonSorter.java • Changing how people are compared (run it) Comparator<Person> comp = Comparator.comparing(Person::getFirst) .thenComparing(Person::getLast); Collections.sort(list,comp); 11/15/17 Compsci 201, Fall 2017, PQ + Compare 10 WOTO http://bit.ly/201fall17-nov15-compare • What is Comparator.comparing? • Creating Comparator at runtime 11/15/17 Compsci 201, Fall 2017, PQ + Compare 11 It's time for Autocomplete 11/15/17 Compsci 201, Fall 2017, PQ + Compare 12 What is Autocomplete? • 40,000 queries/second, thousands of computers, 0.2 seconds to answer query 11/15/17 Compsci 201, Fall 2017, PQ + Compare 13 Geolocating Heaven… 11/15/17 Compsci 201, Fall 2017, PQ + Compare 14 Data Structure for Autocomplete • We'd like the "best" or "top" matching Terms • Each Term is a (word, weight) pair • If we sort by weight, we get the best easily! • Priority Queues help • Find "best" element • Comparator! 11/15/17 Compsci 201, Fall 2017, PQ + Compare 15 Tradeoffs in Autocomplete • Bruteforce: look at every (word, weight) pair • Find heaviest/best ones after looking at all N • Binary Search: search efficiently for prefixes • Find these candidates, choose best from M • Trie Search: search really efficiently for prefixes • Search tree-like structure, choose best from M 11/15/17 Compsci 201, Fall 2017, PQ + Compare 16 Overview of Approaches • Have N total terms, want k best matches from M • N is millions; k is 10’s; M is hundreds… • We want k "heav…" prefix matches, from M of N • We can organize the N elements to find M • ITRW we'd do lots/different organizing • Bruteforce: search for M matching terms, sort M • O(N) to search, O(M log M) to sort, get top k 11/15/17 Compsci 201, Fall 2017, PQ + Compare 17 How to get organized https://www.youtube.com/watch?v=1ve57l3c19g 11/15/17 Compsci 201, Fall 2017, PQ + Compare 18 Bruteforce made smarter • Suppose we insert all N elements into a priority queue ordered by weight -- limited to M elements • PQ contains M elements, minimum first • Default PQ in Java, can specify comparator • Example: M = 4, add 10,70,30,40,20,60,50 • After adding 20, drop using pq.remove() • After adding 60 drop using pq.remove() • Contains largest M elements seen so far 11/15/17 Compsci 201, Fall 2017, PQ + Compare 19 Quantifying Improvements • PQ changes O(N + M log M)to O(N log M) • Find M, then sort versus PQ insertion. Doesn't take constant factors into account • Still typically faster when N > M https://coursework.cs.duke.edu/201fall17/sortall/blob/master/src/TopMSorting.java • Where are heaviest/best when sorting? • Last M: Collections.sort(...) • First M: .sort(...,comp.reversed()); 11/15/17 Compsci 201, Fall 2017, PQ + Compare 20 Sort N take top k • Why use reversed comparator? Alternative? • If we used comp, greatest/top k at end • Could still use subList to return these! https://coursework.cs.duke.edu/201fall17/sortall/blob/master/src/TopMSorting.java public static List<String> sortTopM(List<String> list, int mSize, Comparator<String> comp){ List<String> copy = new ArrayList<>(list); Collections.sort(copy,comp.reversed()); return copy.subList(0, mSize); } 11/15/17 Compsci 201, Fall 2017, PQ + Compare 21 Use limited-PQ for top k https://coursework.cs.duke.edu/201fall17/sortall/blob/master/src/TopMSorting.java • Remove/poll min when pq size exceeds • Why do we remove smallest seen so far? • Why do we use LinkedList and addFirst? public static List<String> pqTopM(List<String> list, int mSize, Comparator<String> comp) { PriorityQueue<String> pq = new PriorityQueue<>(comp); for(String s : list) { pq.add(s); if (pq.size() > mSize) pq.remove(); } LinkedList<String> ret = new LinkedList<>(); while (pq.size() > 0) ret.addFirst(pq.remove()); return ret; } 11/15/17 Compsci 201, Fall 2017, PQ + Compare 22 No organization in BruteForce • Always search through all N (word, weight) terms • Even to find the best 10, or the best 100 of M • New Query? Search again, no improvement • We can organize data to facilitate prefix search! • Binary search through a sorted list • Trie data structure to help with prefixes • Both better in theory, in practice? It depends 11/15/17 Compsci 201, Fall 2017, PQ + Compare 23 Binary Search in Autocomplete • Given "beenie" and prefix of 3, find M matches • Find first "bee.." and last "bee.." • Sort these M elements by weight! Done • O(log N) to find first and last O(M log M) to sort 11/15/17 Compsci 201, Fall 2017, PQ + Compare 24 BinarySearch (Autocomplete) • Sort all N elements: cost O(N log N) • Find M prefixes (weight,word) pairs O(log N) using binary search for firstIndex and lastIndex (adjacent) • Take the top k of these • More queries? Sort once! Sorting cost amortized • Carefully code binary search to find first/last • Top k after sorting: O(M log M) • Use limited PQ?: O(M log k) 11/15/17 Compsci 201, Fall 2017, PQ + Compare 25 Summary of Two Approaches • Bruteforce: O(N + M log M) or O(N log M) • Binary search: O(log N + M log M)* • Requires initial sort of O(N log N), only once! • We are willing to sort once, recoup $$ over time • Which is better? • What if we do LOTS of queries? Q? • Comparing QN to Qlog N 11/15/17 Compsci 201, Fall 2017, PQ + Compare 26 One compare, cut list in half! binary search 11/15/17 Compsci 201, Fall 2017, PQ + Compare 27 Finding the firstIndex • Use Collections.binarySearch • Code below doesn't check index < 0 • Why is this O(N) in worst case? public static int firstIndex(String[] values, String target, Comparator<String> comp) { List<String> list = Arrays.asList(values); int index = Collections.binarySearch(list,target,comp); while (0 <= index && comp.compare(list.get(index),target) == 0) { index -= 1; } return index+1; } 11/15/17 Compsci 201, Fall 2017, PQ + Compare 28 Start with code and change, … • Do not use this reference to achieve O(log N) http://stackoverflow.com/questions/6676360/first -occurrence-in-a-binary-search • One idea: find standard code and mess with it until it works 11/15/17 Compsci 201, Fall 2017, PQ + Compare 29 How to develop loops • David Gries: The Science of Programming • Edsger Dijkstra: The Discipline of Programming 11/15/17 Compsci 201, Fall 2017, PQ + Compare 30 Reasoning about code https://coursework.cs.duke.edu/201fall17/sortall/blob/master/src/Looper.java A. Runs Forever B. Exhausts Memory and stops C. Prints~ (2 billion), D. Prints~ (-2 billion) public class Looper { public static void main(String[] args){ int x = 0; while (x < x + 1) { x = x + 1; } System.out.println("value of x = "+x); } } 11/15/17 Compsci 201, Fall 2017, PQ + Compare 31 Reasoning with Logic • While loop test is a boolean expression • Negation must be true when loop exits, why? • Other boolean expressions aid loop development • Loop invariant: true when loop test checked • Use invariant and loop guard to develop loop • Reason semi-formally about loops 11/15/17 Compsci 201, Fall 2017, PQ + Compare 32 Better than late-night coding? • Proving code correct?

Compsci 201 Priority Queues & Autocomplete

An Alternative to Fibonacci Heaps with Worst Case Rather Than Amortized Time Bounds∗

Priority Queues and Binary Heaps Chapter 6.5

Assignment 3: Kdtree ______Due June 4, 11:59 PM

Rethinking Host Network Stack Architecture Using a Dataflow Modeling Approach

Priorityqueue

Programmatic Testing of the Standard Template Library Containers

Readings Findmin Problem Priority Queue

Chapter C4: Heaps and Binomial / Fibonacci Heaps

Nearest Neighbor Searching and Priority Queues

Rethinking Host Network Stack Architecture Using a Dataflow Modeling Approach

Jt-Polys-Cours-11.Pdf

IBM Spectrum Scale 5.1.0: Concepts, Planning, and Installation Guide Summary of Changes