Nearest Neighbor Searching and Priority Queues
Total Page:16
File Type:pdf, Size:1020Kb
Nearest neighbor searching and priority queues Nearest Neighbor Search • Given: a set P of n points in Rd • Goal: a data structure, which given a query point q, finds the nearest neighbor p of q in P or the k nearest neighbors p q Variants of nearest neighbor • Near neighbor (range search): find one/all points in P within distance r from q • Spatial join: given two sets P,Q, find all pairs p in P, q in Q, such that p is within distance r from q • Approximate near neighbor: find one/all points p’ in P, whose distance to q is at most (1+ε) times the distance from q to its nearest neighbor Solutions Depends on the value of d: • low d: graphics, GIS, etc. • high d: – similarity search in databases (text, images etc) – finding pairs of similar objects (e.g., copyright violation detection) Nearest neighbor search in documents • How could we represent documents so that we can define a reasonable distance between two documents? • Vector of word frequency occurrences – Probably want to get rid of useless words that occur in all documents – Probably need to worry about synonyms and other details from language – But basically, we get a VERY long vector • And maybe we ignore the frequencies and just idenEfy with a “1” the words that occur in some document. Nearest neighbor search in documents • One reasonable measure of distance between two documents is just a count of words they share – this is just the point wise product of the two vectors when we ignore counts. • Easy enough to compute for a pair of documents, but suppose our document database contains millions of documents. How can we solve the nearest neighbor problem FAST? Algorithms • Main memory – linear scan – tree-based: • quadtree • kd-tree – hashing-based: Locality-Sensitive Hashing • Secondary storage (Databases) – R-tree (and numerous variants) – Vector Approximation File (VA-file) Nearest neighbors in k-d trees Make a guess about the nearest neighbor of the star Nearest neighbors in k-d trees ub, the radius of the circle, is the upper bound on the distance to the nearest neighbor ub Nearest neighbors in k-d trees • Establishing an upper bound lets us prune parts of the tree which cannot hold the true nearest neighbor. • In parEcular, this circle is enErely to the right of the spling line running through the root of the tree. So, any point to the leT of the root cannot be in the candidate circle, and so can't be any beVer than our current guess. – Once we have a guess about where the nearest neighbor is, we can start eliminang parts of the tree where the actual answer cannot be. • This general technique of searching a large space and pruning opEons based on parEal results is called branch- and-bound. Nearest neighbors in k-d trees • It is easy to tell where this circle is with respect to the line passing through the k-d tree point. y = y0 r2 r1 (x2, y2) y2 + r2 > y0 (x1, y1) y1 + r1 < y0 Nearest neighbors in k-d trees • Let the query point be (a1,a2). • Maintain a global best esEmate of the nearest neighbor, called 'guess.' • Maintain a global value of the distance to that neighbor, called 'bestDist' • Set 'guess' to NULL. • Set 'bestDist' to infinity. StarEng at the root, execute the following procedure: if curr == NULL return /* If the current locaon is beVer than the best known locaon, update the best known locaon. */ if distance(curr, guess) < bestDist bestDist = distance(curr, guess) guess = curr /* Recursively search the half of the tree that contains the test point. */ if ai < curri recursively search the leT subtree on the next axis else recursively search the right subtree on the next axis /* If the candidate circle crosses this spling plane, look on the other side of the plane by examining the other subtree. */ if |curri – ai | < bestDist recursively search the other subtree on the next axis • Procedure works by walking down to the leaf of the kd-tree as if searching for the test point. • As we start unwinding the recursion and walking back up the tree, check whether each node is beVer than the best esEmate we have so far. – If so, update best esEmate to be the current node. • Finally, check whether the candidate circle based on current guess could cross the spling line of the current node. If not, eliminate all points on the other side of the spling line and walk back up to the next node in the tree. Otherwise, look in that side of the tree to see if there are any closer points. Suppose we want more than 1 nearest neighbor? • Find the k nearest neighbors (kNN) of a query point in the k-d tree (sorry about using k in two different ways!) • Algorithm uses a data structure called a bounded priority queue (or BPQ for short). • A bounded priority queue stores a fixed number of entries, each of which has a key and a priority (lower is beer). • When you add a new element to the BPQ and the BPQ is full, you eject the node with maximum priority (which might be the new node). – If we have not reached the bound, then we just insert the new element in its appropriate locaon. kNN searching • There are two changes to this algorithm that differenEate it from the iniEal 1-NN search algorithm. 1. First, when determining whether to look on the opposite side of the spling plane, we use as the radius of the candidate circle the distance from the test point to the maximum-priority point in the BPQ. The raonale behind this is that when finding the k nearest neighbors, our candidate circle for the k nearest points needs to encompass all k of those neighbors, not just the closest. 2. The other main change is that when we consider whether to look on the opposite side of the spling plane, our decision takes into account whether the BPQ contains at least k points. – This is extremely important! If we prune out parts of the tree before we have made at least k guesses, we might accidentally throw out one of the closest points. k-NN search • Perform a 2-NN lookup for the star. • Recursively check the leT subtree of the spling plane, and find the blue point as a candidate nearest neighbor. Since we haven't found two nearest neighbors yet, we sEll need to look on the other side of the spling plane for more neighbors, even though the candidate circle does not cross the spling line. Priority Queue • A priority queue stores a collecEon of items • An item is a pair: (key, element) • Main methods: – insert(key, element) inserts an item with the specified key and element – removeMin() removes the item with the smallest key and returns the associated element Monday, March 30, 15 19 Priority Queue Implementaons Implementaon add removeMin Unsorted Array O(1) O(n) Sorted Array O(n) O(1) Unsorted Linked List O(1) O(n) Sorted Linked List O(n) O(1) Hash Table O(1) O(n) Heap O(log n) O(log n) Monday, March 30, 15 20 Binary heap implementaon of priority queues • Binary heap (or heap) is a complete binary tree having the following heap order 13 property: – for every node X, the key in the 21 16 parent of X is smaller than the key at X. 24 31 19 68 • Heaps stored using sequenEal representaon of complete 65 26 32 binary trees • Smallest element is at the root of the heap InserEon of x into a binary heap • Create a hole in the next available locaon • If x can be placed in the hole, finished • Otherwise, percolate x up into its parent’s locaon and recurse • Terminate if x is switched with the key at the root. Example - Insert 14 13 13 21 16 21 16 24 31 19 68 24 14 19 68 65 26 32 14 13 65 26 32 31 14 16 24 21 19 68 65 26 32 31 Code for inseron • Place a small element in posiEon 0 of the heap to avoid tesEng for root – value known as a sennel • RouEne does not use swaps as it percolates up – percolang up using swaps would require 3d assignments for d percolates – Code shown uses d+1 assignments Code for inseron Procedure insert (x:element to be inserted; H: priority queue); vari i: integer; begin if H.size = Maximum then error else begin H.size : = H.size +1 i := H.size while H.element[i div2] > x do begin H.element[i] := H.element[i div 2]; move that value down i := i div 2; this is now an empty heap locaon end H.element[i] := x end! Delete-min • key,at root, is always deleted • Move last key, x, in heap into root • Percolate down unEl it is smaller than both of its children – if x is smaller than both of its children, halt – otherwise swap x with its smaller child and repeat Example 32 16 21 16 21 32 24 31 19 68 24 31 19 68 16 65 26 65 26 21 19 24 31 32 68 65 26 Building a heap • A heap can be built from n keys in O(n) Eme • Insert the keys in any order, maintaining the structure property (complete BT) • Then percolate keys down from “boom” to “top”. – percolang a node down can only take Eme proporEonal to the height of the node – But the “total” height of a complete BT is O(n) 150 Example 80 40 30 10 70 110 percolate down (7) 100 20 90 60 50 120 140 130 150 80 40 30 10 70 110 100 20 90 60 50 120 140 130 150 percolate-down (6) 80 40 30 10 70 110 100 20 90 60 50 120 140 130 150 80 40 30 10 50 110 100 20 90 60 70 120 140 130 150 80 40 percolate down (5) 30 10 50 110 100 20 90 60 70 120 140 130 150 80 40 30 10 50 110 100 20 90 60 70 120 140 130 150 percolate down (4) 80 40 30 10 50 110 100 20 90 60 70 120 140 130 150 80 40 20 10 50 110 100 30 90 60 70 120 140 130 150 percolate down (3) 80 40 20 10 50 110 100 30 90 60 70 120 140 130 150 percolate down (2) 10 40 20 60 50 110 100 30 90 80 70 120 140 130 150 10 40 20 60 50 110 100 30 90 80 70 120 140 130 10 20 40 30 60 50 110 100 150 90 80 70 120 140 130 Binomial queues • Consider problem of merging two priority queues – binary heap soluEon would require inserEng the keys one at a Eme from H1 into H2.