Nearest neighbor searching and priority queues Nearest Neighbor Search • Given: a set P of n points in Rd • Goal: a data structure, which given a query point q, finds the nearest neighbor p of q in P or the k nearest neighbors
p q Variants of nearest neighbor • Near neighbor (range search): find one/all points in P within distance r from q • Spatial join: given two sets P,Q, find all pairs p in P, q in Q, such that p is within distance r from q • Approximate near neighbor: find one/all points p’ in P, whose distance to q is at most (1+ε) times the distance from q to its nearest neighbor Solutions
Depends on the value of d: • low d: graphics, GIS, etc. • high d: – similarity search in databases (text, images etc) – finding pairs of similar objects (e.g., copyright violation detection) Nearest neighbor search in documents
• How could we represent documents so that we can define a reasonable distance between two documents? • Vector of word frequency occurrences – Probably want to get rid of useless words that occur in all documents – Probably need to worry about synonyms and other details from language – But basically, we get a VERY long vector • And maybe we ignore the frequencies and just iden fy with a “1” the words that occur in some document. Nearest neighbor search in documents
• One reasonable measure of distance between two documents is just a count of words they share – this is just the point wise product of the two vectors when we ignore counts. • Easy enough to compute for a pair of documents, but suppose our document database contains millions of documents. How can we solve the nearest neighbor problem FAST? Algorithms
• Main memory – linear scan – tree-based: • quadtree • kd-tree – hashing-based: Locality-Sensitive Hashing • Secondary storage (Databases) – R-tree (and numerous variants) – Vector Approximation File (VA-file) Nearest neighbors in k-d trees
Make a guess about the nearest neighbor of the star Nearest neighbors in k-d trees ub, the radius of the circle, is the upper bound on the distance to the nearest neighbor
ub Nearest neighbors in k-d trees • Establishing an upper bound lets us prune parts of the tree which cannot hold the true nearest neighbor. • In par cular, this circle is en rely to the right of the spli ng line running through the root of the tree. So, any point to the le of the root cannot be in the candidate circle, and so can't be any be er than our current guess. – Once we have a guess about where the nearest neighbor is, we can start elimina ng parts of the tree where the actual answer cannot be. • This general technique of searching a large space and pruning op ons based on par al results is called branch- and-bound. Nearest neighbors in k-d trees
• It is easy to tell where this circle is with respect to the line passing through the k-d tree point. y = y0 r2
r1 (x2, y2) y2 + r2 > y0 (x1, y1) y1 + r1 < y0 Nearest neighbors in k-d trees • Let the query point be (a1,a2). • Maintain a global best es mate of the nearest neighbor, called 'guess.' • Maintain a global value of the distance to that neighbor, called 'bestDist' • Set 'guess' to NULL. • Set 'bestDist' to infinity. Star ng at the root, execute the following procedure: if curr == NULL return /* If the current loca on is be er than the best known loca on, update the best known loca on. */ if distance(curr, guess) < bestDist bestDist = distance(curr, guess) guess = curr /* Recursively search the half of the tree that contains the test point. */ if ai < curri recursively search the le subtree on the next axis else recursively search the right subtree on the next axis /* If the candidate circle crosses this spli ng plane, look on the other side of the plane by examining the other subtree. */ if |curri – ai | < bestDist recursively search the other subtree on the next axis • Procedure works by walking down to the leaf of the kd-tree as if searching for the test point. • As we start unwinding the recursion and walking back up the tree, check whether each node is be er than the best es mate we have so far. – If so, update best es mate to be the current node. • Finally, check whether the candidate circle based on current guess could cross the spli ng line of the current node. If not, eliminate all points on the other side of the spli ng line and walk back up to the next node in the tree. Otherwise, look in that side of the tree to see if there are any closer points. Suppose we want more than 1 nearest neighbor?
• Find the k nearest neighbors (kNN) of a query point in the k-d tree (sorry about using k in two different ways!) • Algorithm uses a data structure called a bounded priority queue (or BPQ for short). • A bounded priority queue stores a fixed number of entries, each of which has a key and a priority (lower is be er). • When you add a new element to the BPQ and the BPQ is full, you eject the node with maximum priority (which might be the new node). – If we have not reached the bound, then we just insert the new element in its appropriate loca on. kNN searching • There are two changes to this algorithm that differen ate it from the ini al 1-NN search algorithm. 1. First, when determining whether to look on the opposite side of the spli ng plane, we use as the radius of the candidate circle the distance from the test point to the maximum-priority point in the BPQ. The ra onale behind this is that when finding the k nearest neighbors, our candidate circle for the k nearest points needs to encompass all k of those neighbors, not just the closest. 2. The other main change is that when we consider whether to look on the opposite side of the spli ng plane, our decision takes into account whether the BPQ contains at least k points. – This is extremely important! If we prune out parts of the tree before we have made at least k guesses, we might accidentally throw out one of the closest points. K-NN search
• Perform a 2-NN lookup for the star. • Recursively check the le subtree of the spli ng plane, and find the blue point as a candidate nearest neighbor. Since we haven't found two nearest neighbors yet, we s ll need to look on the other side of the spli ng plane for more neighbors, even though the candidate circle does not cross the spli ng line. Priority Queue
• A priority queue stores a collec on of items • An item is a pair: (key, element) • Main methods: – insert(key, element) inserts an item with the specified key and element – removeMin() removes the item with the smallest key and returns the associated element
Monday, March 30, 15 19 Priority Queue Implementa ons
Implementa on add removeMin
Unsorted Array O(1) O(n)
Sorted Array O(n) O(1)
Unsorted Linked List O(1) O(n)
Sorted Linked List O(n) O(1)
Hash Table O(1) O(n)
Heap O(log n) O(log n)
Monday, March 30, 15 20 Binary heap implementa on of priority queues • Binary heap (or heap) is a complete binary tree having the following heap order 13 property: – for every node X, the key in the 21 16 parent of X is smaller than the key at X. 24 31 19 68 • Heaps stored using sequen al representa on of complete 65 26 32 binary trees • Smallest element is at the root of the heap Inser on of x into a binary heap • Create a hole in the next available loca on • If x can be placed in the hole, finished • Otherwise, percolate x up into its parent’s loca on and recurse • Terminate if x is switched with the key at the root. Example - Insert 14
13 13
21 16 21 16
24 31 19 68 24 14 19 68
65 26 32 14 13 65 26 32 31
14 16
24 21 19 68
65 26 32 31 Code for inser on
• Place a small element in posi on 0 of the heap to avoid tes ng for root – value known as a sen nel • Rou ne does not use swaps as it percolates up – percola ng up using swaps would require 3d assignments for d percolates – Code shown uses d+1 assignments Code for inser on
Procedure insert (x:element to be inserted; H: priority queue); vari i: integer; begin if H.size = Maximum then error else begin H.size : = H.size +1 i := H.size while H.element[i div2] > x do begin H.element[i] := H.element[i div 2]; move that value down i := i div 2; this is now an empty heap loca on end H.element[i] := x end!
Delete-min • Key,at root, is always deleted • Move last key, x, in heap into root • Percolate down un l it is smaller than both of its children – if x is smaller than both of its children, halt – otherwise swap x with its smaller child and repeat Example 32 16
21 16 21 32
24 31 19 68 24 31 19 68
16 65 26 65 26
21 19
24 31 32 68
65 26 Building a heap • A heap can be built from n keys in O(n) me • Insert the keys in any order, maintaining the structure property (complete BT) • Then percolate keys down from “bo om” to “top”. – percola ng a node down can only take me propor onal to the height of the node – But the “total” height of a complete BT is O(n) 150 Example
80 40
30 10 70 110 percolate down (7)
100 20 90 60 50 120 140 130
150
80 40
30 10 70 110
100 20 90 60 50 120 140 130 150
percolate-down (6) 80 40
30 10 70 110
100 20 90 60 50 120 140 130
150
80 40
30 10 50 110
100 20 90 60 70 120 140 130 150
80 40 percolate down (5)
30 10 50 110
100 20 90 60 70 120 140 130
150
80 40
30 10 50 110
100 20 90 60 70 120 140 130 150
percolate down (4) 80 40
30 10 50 110
100 20 90 60 70 120 140 130
150
80 40
20 10 50 110
100 30 90 60 70 120 140 130 150 percolate down (3)
80 40
20 10 50 110
100 30 90 60 70 120 140 130
150 percolate down (2)
10 40
20 60 50 110
100 30 90 80 70 120 140 130 150
10 40
20 60 50 110
100 30 90 80 70 120 140 130
10
20 40
30 60 50 110
100 150 90 80 70 120 140 130 Binomial queues
• Consider problem of merging two priority queues – binary heap solu on would require inser ng the
keys one at a me from H1 into H2. This would lead to a linear algorithm • Binomial queues allow merging in log(n) me while s ll suppor ng fast inser on and delete- min Binomial queues • Binomial queue is a collec on of trees – each tree is heap ordered – collec on of trees represented as a forest - root node whose sons are the roots of the heap ordered trees • Each of these trees is a binomial tree. – There is only one binomial tree of any given height.
– B0: binomial tree of height 0, is a single node
– BK: binomial tree of height k - a ach a BK-1 to the root of another BK-1 Examples
B3 B0 B1 B2
B4 Binomial queues
• Binomial tree, BK, consists of a root and children B0, B1, ..., BK-1. k • BK has exactly 2 nodes • Number nodes at depth d is the binomial k ⎜⎛ ⎞ coefficient ⎝ d⎠ • How can we represent an arbitrary priority queue as a binomial queue? – Expand the size of the priority queue in binary – Include a binomial tree for each “1” in the binary representa on of the size. Represen ng priority queues as binomial queues • Consider a priority queue containing 13 elements – 13 = 1101 in binary
– so, include B3, B2 and B0 in the forest of binomial trees represen ng the priority queue.
• Example: binomial queue H1 with 6
elements 16 12 24 18 21 65 Binomial queue opera ons
• Merging is the basic opera on – accomplished by “adding” the two queues together 10010 +10111
101001 – merging two binomial trees takes constant me – only log(n) pairs of trees to merge in merging two binomial queues Example H1 H1 1 2 H20 H21 H22
16 12 12 14 23 24 + 24 18 21 26 51 65 68 110 = 6 111 = 7 = H3 H30 H32 3 12 12 14 23 24 16 24 21 26 51 65 18 68
1101 = 13 Opera ons on binomial queues
• Inser on – special case of merging – create a one node tree and then perform the merge – if the priority queue into which the element is merged has the property that its smallest nonexistent binomial tree is Bi, then the me to insert is propor onal to i+1 – since each tree is present with probability 1/2, the expected me to perform an inser on is constant Example - inser ng the keys 1-7
1 1 3 1 1 3
2 2 2 4
5 1 3 5 1 3 7 5 1 3 2 6 2 4 6 2 4 4 Opera ons on binomial queues
• Delete-min
– find binomial tree, Bk, with smallest root in binomial queue H
– Remove Bk from the forest H forming a new binomial queue H’.
– Remove the root of Bk crea ng the binomial trees B’0, B’1, ..., B’k-1 which collec vely form the binomial queue H’’. – Merge H’ with H’’ to get the answer delete-min(H)
7 5 1 3 7 5
6 2 6 4 H H’ = 11
2 3 2 3 5
7 4 4 6 H’’ = 11 delete-min(H) = 110 Linked Binary Binomial Fibonacci Relaxed Opera on List Heap Heap Heap † Heap
make-heap 1 1 1 1 1
is-empty 1 1 1 1 1
insert 1 log n log n 1 1
delete-min n log n log n log n log n
decrease-key n log n log n 1 1
delete n log n log n log n log n
union 1 n log n 1 1
find-min n 1 log n 1 1 † n = number of elements in priority queue amor zed 46 Disjoint sets with union • Given: – a fixed set T
– par on of T into subsets S1, S2, ..., Sk k T = Si i=1 • Opera ons to be supported are – Find (X) - returns the set that contains X – Union (R,S) - compute R ∪ S which replaces R and S Why do we care?
• Lots of situa ons in which you have a set of elements (ci es in the U.S.) and some property that induces a par on over the set (state in which they lie). • The goal is to find the par on into subsets determined by the property (set of all ci es in the same state). • There are many very bad ways to solve this problem! Disjoint sets with union - Up trees
• Since the subsets are disjoint, each element belongs to only one set • Up-tree: Node contains a pointer to its parent
in the tree represen ng a set Si. – A node can have arbitrary number of children since there is no limit on the number of pointers that can point to it. – Sets are iden fied by their root nodes. Up-trees: find
• Find(X): follow pointers from X back up to the root of the tree to which X belongs • But how do we get to the node containing X? – Use some tree dic onary for the names of the set elements; in this case is takes log(n) me to find the node containing X. – If X is drawn from a small dic onary, then maintain a table that gives constant lookup mes Up-trees: Union • Union(R,S) - make one set C R point to another D F – make the root of one point “up” to the root of the other – if we make the root of R point B A R T V to S then we say we merge R into S M S
O Merging
M C
D F O
B A R T V
M
C C OR O D F M D F
B A R T V O B A R T V Merging
• When we merge two sets we want the height of the merged tree to be as small as possible – Always merge the smaller tree into the larger – Associate a field, Count, with the root of each tree which contains the number of nodes in the tree • Let R be an up-tree represen ng a set of size n constructed from singleton sets by repeatedly forming unions using the merge smaller rule. Then the height of R is at most log n. Path compression • A simple modifica on to Find can make subsequent Finds faster – In the case where our auxiliary dic onary is a table, this will actually lower the complexity of Finds to sub- logarithmic • Finds takes less me in a shallow tree than in a deep tree – trick: during a Find opera on, nodes that are visited have their parent pointers updated to point to the root – called path compression Path compression func on PathcompressFind (pointer P): pointer {return the root of the tree to which P belongs} R <--P while Parent(R) <> Nil do R <-- Parent(R) Q <-- P {Now we retrace the path} While Q <> R do {simultaneously} Q <-- Parent(Q) Parent(Q) <-- R return R Example
A
B C D
Find D A
B C D • Nodes C and D which were encountered on the path to D have their pointers changed A
B to the root. C D • Subsequent Finds to them or nodes in their subtrees will be faster