Nearest neighbor searching and priority queues Nearest Neighbor Search • Given: a P of n points in Rd • Goal: a , which given a query point q, finds the nearest neighbor p of q in P or the k nearest neighbors

p q Variants of nearest neighbor • Near neighbor (range search): find one/all points in P within distance r from q • Spatial join: given two sets P,Q, find all pairs p in P, q in Q, such that p is within distance r from q • Approximate near neighbor: find one/all points p’ in P, whose distance to q is at most (1+ε) times the distance from q to its nearest neighbor Solutions

Depends on the value of d: • low d: graphics, GIS, etc. • high d: – similarity search in databases (text, images etc) – finding pairs of similar objects (e.g., copyright violation detection) Nearest neighbor search in documents

• How could we represent documents so that we can define a reasonable distance between two documents? • Vector of word frequency occurrences – Probably want to get rid of useless words that occur in all documents – Probably need to worry about synonyms and other details from language – But basically, we get a VERY long vector • And maybe we ignore the frequencies and just idenfy with a “1” the words that occur in some document. Nearest neighbor search in documents

• One reasonable measure of distance between two documents is just a count of words they share – this is just the point wise product of the two vectors when we ignore counts. • Easy enough to compute for a pair of documents, but suppose our document database contains millions of documents. How can we solve the nearest neighbor problem FAST? Algorithms

• Main memory – linear scan – -based: • • kd-tree – hashing-based: Locality-Sensitive Hashing • Secondary storage (Databases) – R-tree (and numerous variants) – Vector Approximation File (VA-file) Nearest neighbors in k-d trees

Make a guess about the nearest neighbor of the star Nearest neighbors in k-d trees ub, the radius of the circle, is the upper bound on the distance to the nearest neighbor

ub Nearest neighbors in k-d trees • Establishing an upper bound lets us prune parts of the tree which cannot hold the true nearest neighbor. • In parcular, this circle is enrely to the right of the spling line running through the root of the tree. So, any point to the le of the root cannot be in the candidate circle, and so can't be any beer than our current guess. – Once we have a guess about where the nearest neighbor is, we can start eliminang parts of the tree where the actual answer cannot be. • This general technique of searching a large space and pruning opons based on paral results is called branch- and-bound. Nearest neighbors in k-d trees

• It is easy to tell where this circle is with respect to the line passing through the k-d tree point. y = y0 r2

r1 (x2, y2) y2 + r2 > y0 (x1, y1) y1 + r1 < y0 Nearest neighbors in k-d trees • Let the query point be (a1,a2). • Maintain a global best esmate of the nearest neighbor, called 'guess.' • Maintain a global value of the distance to that neighbor, called 'bestDist' • Set 'guess' to NULL. • Set 'bestDist' to infinity. Starng at the root, execute the following procedure: if curr == NULL return /* If the current locaon is beer than the best known locaon, update the best known locaon. */ if distance(curr, guess) < bestDist bestDist = distance(curr, guess) guess = curr /* Recursively search the half of the tree that contains the test point. */ if ai < curri recursively search the le subtree on the next axis else recursively search the right subtree on the next axis /* If the candidate circle crosses this spling plane, look on the other side of the plane by examining the other subtree. */ if |curri – ai | < bestDist recursively search the other subtree on the next axis • Procedure works by walking down to the leaf of the kd-tree as if searching for the test point. • As we start unwinding the recursion and walking back up the tree, check whether each node is beer than the best esmate we have so far. – If so, update best esmate to be the current node. • Finally, check whether the candidate circle based on current guess could cross the spling line of the current node. If not, eliminate all points on the other side of the spling line and walk back up to the next node in the tree. Otherwise, look in that side of the tree to see if there are any closer points. Suppose we want more than 1 nearest neighbor?

• Find the k nearest neighbors (kNN) of a query point in the k-d tree (sorry about using k in two different ways!) • Algorithm uses a data structure called a bounded (or BPQ for short). • A bounded priority queue stores a fixed number of entries, each of which has a key and a priority (lower is beer). • When you add a new element to the BPQ and the BPQ is full, you eject the node with maximum priority (which might be the new node). – If we have not reached the bound, then we just insert the new element in its appropriate locaon. kNN searching • There are two changes to this algorithm that differenate it from the inial 1-NN search algorithm. 1. First, when determining whether to look on the opposite side of the spling plane, we use as the radius of the candidate circle the distance from the test point to the maximum-priority point in the BPQ. The raonale behind this is that when finding the k nearest neighbors, our candidate circle for the k nearest points needs to encompass all k of those neighbors, not just the closest. 2. The other main change is that when we consider whether to look on the opposite side of the spling plane, our decision takes into account whether the BPQ contains at least k points. – This is extremely important! If we prune out parts of the tree before we have made at least k guesses, we might accidentally throw out one of the closest points. K-NN search

• Perform a 2-NN lookup for the star. • Recursively check the le subtree of the spling plane, and find the blue point as a candidate nearest neighbor. Since we haven't found two nearest neighbors yet, we sll need to look on the other side of the spling plane for more neighbors, even though the candidate circle does not cross the spling line. Priority Queue

• A priority queue stores a collecon of items • An item is a pair: (key, element) • Main methods: – insert(key, element) inserts an item with the specified key and element – removeMin() removes the item with the smallest key and returns the associated element

Monday, March 30, 15 19 Priority Queue Implementaons

Implementaon add removeMin

Unsorted Array O(1) O(n)

Sorted Array O(n) O(1)

Unsorted O(1) O(n)

Sorted Linked List O(n) O(1)

Hash Table O(1) O(n)

Heap O(log n) O(log n)

Monday, March 30, 15 20 Binary implementaon of priority queues • (or heap) is a complete having the following heap order 13 property: – for every node X, the key in the 21 16 parent of X is smaller than the key at X. 24 31 19 68 • Heaps stored using sequenal representaon of complete 65 26 32 binary trees • Smallest element is at the root of the heap Inseron of x into a binary heap • Create a hole in the next available locaon • If x can be placed in the hole, finished • Otherwise, percolate x up into its parent’s locaon and recurse • Terminate if x is switched with the key at the root. Example - Insert 14

13 13

21 16 21 16

24 31 19 68 24 14 19 68

65 26 32 14 13 65 26 32 31

14 16

24 21 19 68

65 26 32 31 Code for inseron

• Place a small element in posion 0 of the heap to avoid tesng for root – value known as a sennel • Roune does not use swaps as it percolates up – percolang up using swaps would require 3d assignments for d percolates – Code shown uses d+1 assignments Code for inseron

Procedure insert (x:element to be inserted; H: priority queue); vari i: integer; begin if H.size = Maximum then error else begin H.size : = H.size +1 i := H.size while H.element[i div2] > x do begin H.element[i] := H.element[i div 2]; move that value down i := i div 2; this is now an empty heap locaon end H.element[i] := x end!

Delete-min • Key,at root, is always deleted • Move last key, x, in heap into root • Percolate down unl it is smaller than both of its children – if x is smaller than both of its children, halt – otherwise swap x with its smaller child and repeat Example 32 16

21 16 21 32

24 31 19 68 24 31 19 68

16 65 26 65 26

21 19

24 31 32 68

65 26 Building a heap • A heap can be built from n keys in O(n) me • Insert the keys in any order, maintaining the structure property (complete BT) • Then percolate keys down from “boom” to “top”. – percolang a node down can only take me proporonal to the height of the node – But the “total” height of a complete BT is O(n) 150 Example

80 40

30 10 70 110 percolate down (7)

100 20 90 60 50 120 140 130

150

80 40

30 10 70 110

100 20 90 60 50 120 140 130 150

percolate-down (6) 80 40

30 10 70 110

100 20 90 60 50 120 140 130

150

80 40

30 10 50 110

100 20 90 60 70 120 140 130 150

80 40 percolate down (5)

30 10 50 110

100 20 90 60 70 120 140 130

150

80 40

30 10 50 110

100 20 90 60 70 120 140 130 150

percolate down (4) 80 40

30 10 50 110

100 20 90 60 70 120 140 130

150

80 40

20 10 50 110

100 30 90 60 70 120 140 130 150 percolate down (3)

80 40

20 10 50 110

100 30 90 60 70 120 140 130

150 percolate down (2)

10 40

20 60 50 110

100 30 90 80 70 120 140 130 150

10 40

20 60 50 110

100 30 90 80 70 120 140 130

10

20 40

30 60 50 110

100 150 90 80 70 120 140 130 Binomial queues

• Consider problem of merging two priority queues – binary heap soluon would require inserng the

keys one at a me from H1 into H2. This would lead to a linear algorithm • Binomial queues allow merging in log(n) me while sll supporng fast inseron and delete- min Binomial queues • Binomial queue is a collecon of trees – each tree is heap ordered – collecon of trees represented as a forest - root node whose sons are the roots of the heap ordered trees • Each of these trees is a binomial tree. – There is only one binomial tree of any given height.

– B0: binomial tree of height 0, is a single node

– BK: binomial tree of height k - aach a BK-1 to the root of another BK-1 Examples

B3 B0 B1 B2

B4 Binomial queues

• Binomial tree, BK, consists of a root and children B0, B1, ..., BK-1. k • BK has exactly 2 nodes • Number nodes at depth d is the binomial k ⎜⎛ ⎞ coefficient ⎝ d⎠ • How can we represent an arbitrary priority queue as a binomial queue? – Expand the size of the priority queue in binary – Include a binomial tree for each “1” in the binary representaon of the size. Represenng priority queues as binomial queues • Consider a priority queue containing 13 elements – 13 = 1101 in binary

– so, include B3, B2 and B0 in the forest of binomial trees represenng the priority queue.

• Example: binomial queue H1 with 6

elements 16 12 24 18 21 65 Binomial queue operaons

• Merging is the basic operaon – accomplished by “adding” the two queues together 10010 +10111

101001 – merging two binomial trees takes constant me – only log(n) pairs of trees to merge in merging two binomial queues Example H1 H1 1 2 H20 H21 H22

16 12 12 14 23 24 + 24 18 21 26 51 65 68 110 = 6 111 = 7 = H3 H30 H32 3 12 12 14 23 24 16 24 21 26 51 65 18 68

1101 = 13 Operaons on binomial queues

• Inseron – special case of merging – create a one node tree and then perform the merge – if the priority queue into which the element is merged has the property that its smallest nonexistent binomial tree is Bi, then the me to insert is proporonal to i+1 – since each tree is present with probability 1/2, the expected me to perform an inseron is constant Example - inserng the keys 1-7

1 1 3 1 1 3

2 2 2 4

5 1 3 5 1 3 7 5 1 3 2 6 2 4 6 2 4 4 Operaons on binomial queues

• Delete-min

– find binomial tree, Bk, with smallest root in binomial queue H

– Remove Bk from the forest H forming a new binomial queue H’.

– Remove the root of Bk creang the binomial trees B’0, B’1, ..., B’k-1 which collecvely form the binomial queue H’’. – Merge H’ with H’’ to get the answer delete-min(H)

7 5 1 3 7 5

6 2 6 4 H H’ = 11

2 3 2 3 5

7 4 4 6 H’’ = 11 delete-min(H) = 110 Linked Binary Binomial Fibonacci Relaxed Operaon List Heap Heap Heap † Heap

make-heap 1 1 1 1 1

is-empty 1 1 1 1 1

insert 1 log n log n 1 1

delete-min n log n log n log n log n

decrease-key n log n log n 1 1

delete n log n log n log n log n

union 1 n log n 1 1

find-min n 1 log n 1 1 † n = number of elements in priority queue amorzed 46 Disjoint sets with union • Given: – a fixed set T

– paron of T into subsets S1, S2, ..., Sk k T = Si i=1 • Operaons to be supported are – Find (X) - returns the set that contains X – Union (R,S) - compute R ∪ S which replaces R and S Why do we care?

• Lots of situaons in which you have a set of elements (cies in the U.S.) and some property that induces a paron over the set (state in which they lie). • The goal is to find the paron into subsets determined by the property (set of all cies in the same state). • There are many very bad ways to solve this problem! Disjoint sets with union - Up trees

• Since the subsets are disjoint, each element belongs to only one set • Up-tree: Node contains a pointer to its parent

in the tree represenng a set Si. – A node can have arbitrary number of children since there is no limit on the number of pointers that can point to it. – Sets are idenfied by their root nodes. Up-trees: find

• Find(X): follow pointers from X back up to the root of the tree to which X belongs • But how do we get to the node containing X? – Use some tree diconary for the names of the set elements; in this case is takes log(n) me to find the node containing X. – If X is drawn from a small diconary, then maintain a table that gives constant lookup mes Up-trees: Union • Union(R,S) - make one set C R point to another D F – make the root of one point “up” to the root of the other – if we make the root of R point B A R T V to S then we say we merge R into S M S

O Merging

M C

D F O

B A R T V

M

C C OR O D F M D F

B A R T V O B A R T V Merging

• When we merge two sets we want the height of the merged tree to be as small as possible – Always merge the smaller tree into the larger – Associate a field, Count, with the root of each tree which contains the number of nodes in the tree • Let R be an up-tree represenng a set of size n constructed from singleton sets by repeatedly forming unions using the merge smaller rule. Then the height of R is at most log n. Path compression • A simple modificaon to Find can make subsequent Finds faster – In the case where our auxiliary diconary is a table, this will actually lower the complexity of Finds to sub- logarithmic • Finds takes less me in a shallow tree than in a deep tree – trick: during a Find operaon, nodes that are visited have their parent pointers updated to point to the root – called path compression Path compression funcon PathcompressFind (pointer P): pointer {return the root of the tree to which P belongs} R <--P while Parent(R) <> Nil do R <-- Parent(R) Q <-- P {Now we retrace the path} While Q <> R do {simultaneously} Q <-- Parent(Q) Parent(Q) <-- R return R Example

A

B C D

Find D A

B C D • Nodes C and D which were encountered on the path to D have their pointers changed A

B to the root. C D • Subsequent Finds to them or nodes in their subtrees will be faster