Similarity Searching in Peer-To-Peer Environment
Total Page:16
File Type:pdf, Size:1020Kb
Masaryk University in Brno Faculty of Informatics ^TIS ftp Similarity Searching in Peer-to-Peer Environment Dissertation Proposal Mgr. David Novák Supervisor: Prof. Ing. Pavel Zezula, CSc. In Brno, January 2006 Supervisor Contents 1 Introduction 1 2 Metrie Space Indexing 2 2.1 Metrie Space Model 2 2.1.1 Metrie Space 2 2.1.2 Similarity Queries 2 2.1.3 Metrie Distance Measures 3 2.2 Metrie Partitioning Principles 5 2.2.1 Ball Partitioning 5 2.2.2 Excluded Middle Partitioning 5 2.2.3 Generalized Hyperplane Partitioning 6 2.2.4 Voronoi-like Partitioning 6 2.3 Search Space Pruning Policies 6 2.3.1 Object-Pivot Distance Constraint 6 2.3.2 Range-Pivot Distance Constraint 7 2.3.3 Double-Pivot Distance Constraint 7 2.3.4 Pivot Filtering 7 2.4 Metric Data Structures 7 2.4.1 Vantage Point Tree 8 2.4.2 Generalized Hyperplane Tree 8 2.4.3 M-Tree 8 2.4.4 D-Index 8 3 Peer-to-Peer Structures 9 3.1 Distributed Hash Tables 9 3.1.1 Chord 9 3.1.2 CAN 10 3.2 One-Dimensional Range Queries 10 3.2.1 P-Grid 10 3.2.2 Skip Graphs & SkipNet 11 3.2.3 P-Tree 12 3.3 Range Queries over Multiple Attributes 12 3.3.1 SCRAP & HSFC-based 12 3.3.2 ZNet 12 3.3.3 MAAN 12 3.3.4 Mercury 13 3.3.5 MURK 13 3.4 Nearest Neighbors Queries 13 3.4.1 pSearch 13 3.4.2 Distributed Quadtree 14 3.4.3 SWAM 14 3.5 Metric Space Structures 14 3.5.1 GHT* 15 3.5.2 MCAN 16 4 Thesis Plan 16 4.1 Overview 16 4.2 Objectives 17 4.3 System Structure Proposal 17 4.4 Future Plans 19 References 20 1 Introduction The diffusion of computer technology into a wide spectrum of human activities leads to formation of a variety of new complex data types. This fact calls for new ways of effective searching of this data which would fit both the nature of the specific data and user requirements. Some of these types, e.g., text documents or multimedia data like digital images or videos, exhibit complex, often user- defined relationships that prevents this data from simple sorting according to a single feature. Furthermore, a simple YES or NO classification is not convenient when answering queries that are meaningful for the mentioned data types. The data processing techniques that address the requirements outlined above are usually referred to as content-based or similarity searching. The development of this research area proceeds on many levels, from techniques tailored to a very specific data and for a specific application to solutions based on general models and solid theoretical grounds that are applicable to a variety of data types. The most general data model, and the only usable one in some cases, is the metric space model. This concept treats the dataset as unstructured objects together with a distance or dissimilarity function measurable for every pair of objects. The universality of this approach inclined us to adopt it. A considerable effort has concentrated on the field of metric based index ing [26, 16, 48]. A variety of index structures have been proposed; many of them nicely demonstrate the principles of metric indexing but these are usually static or main memory structures and, therefore, not very suitable for large data volumes. Certain dynamic and disk-oriented structures have been published as well, e.g., the M-Tree [17] and the D-Index [20]. However, the huge amounts of digital data that are produced nowadays make heavy demands on scalability of data-oriented applications. The similarity search is inherently very expensive - the critical aspect is evaluation of the distance function which is typically time consuming. Even though the sophisticated index structures can reduce both computational and I/O costs, the overall searching costs become unacceptable when the data volume grows, because the increase trend is linear with respect to the size of the dataset. A way to achieve swift processing of large datasets is to shift from central ized data structures towards distributed environment. This step provides not only easily enlargeable and practically unlimited storage capacity, but especially significant potentiation of the system computational power and the possibility of exploiting parallelism during query processing. Recently, two metric-based distributed structures were published - the GHT* index [9, 8] and MC AN [21]. In the survey part of this proposal, the author follows two separated research lines. First, the principles and basic data structures for the indexing based on the metric space model are outlined in Section 2. Then, Section 3 studies the development trends in the area of self-organized distributed systems based on the peer-to-peer paradigm and focuses on various query paradigms heading towards similarity searching. In Section 4, we recognize the need for a new distributed structure for similarity search in metric spaces and the architecture of such a system is proposed. 1 2 Metrie Space Indexing In this section, we provide the theoretical background for the similarity search based on the metric space model. We give a brief survey of principles of metric- based indexing techniques and of similarity queries processing. Finally, we de scribe several centralized data structures that adopt the metric space as a data model. For further details refer to several recent comprehensive surveys on this area [16, 26, 48]. 2.1 Metric Space Model This section provides basic concepts for the metric-based similarity searching and gives several examples of the application domains for this model. 2.1.1 Metric Space Under certain circumstances, the dataset together with a function measuring proximity of pairs of objects can be treated as metric space. Definition 1 Metric space A4 is a pair A4 = (T>,d), where T> is the domain of objects and d is the total distance function d : T> x T> —> R satisfying the following conditions for all objects x,y,z G T>: d{x, y) > 0 (non-negativity)', d{x, y) = 0 iff x = y (identity), d{x, y) = d(y, x) (symmetry), d{x, z) < d{x, y) + d(y, z) (triangle inequality). The metric space model is considered to be the most abstract data model which can still be used for designing an index structure. Many well-known data con cepts fulfil the conditions of this simple but powerful model. Especially, note that any normed vector space (V, || • ||) forms a metric space (V,d) by defining Vx,y G V : d{x,y) = \\y - x\\. 2.1.2 Similarity Queries There are several types of metric similarity queries defined in the literature. Let us use this notation in the following text. Notation 1 Let I (I'D be a finite set of objects indexed by a data structure. The range query is the basic type of similarity query and it is often exploit when processing more complex queries. Definition 2 Given an object q G T> and a maximal search radius r, range query Range(</, r) selects a subset of indexed objects S A C / such that SA = {x | x G / A d(q, x) < r}. Figure la shows an example of a Range query in a planar domain. Let us give a real-life example of such a query - user of a road-map application could formulate the following requirement: Give me all gas stations within a distance of ten kilometers from my actual position. 2 o o o/ o o 3 2 o r i O q^ o O (a) (b) Figure 1: Examples of a Range(</, r) query (a) and of a kNN(</, 3) query (b) Querying a system via a range query requires a background knowledge about the data, distance function and a certain level of a "rule of thumb" estimation of the suitable query radius r. This is not convenient for users nor even possible in some applications. The nearest neighbors query eliminates the necessity of specifying a data-dependent query radius. Definition 3 Given an object q G T> and an integer k > 1, A;-nearest neighbors query kNN(</, k) retrieves a subset of indexed objects SA C / : IS^I = k, Vx G SA,Vy € I\SA : d(q,x) < d(q,y). In other words, the kNN(</, k) query selects the k indexed objects that are nearest to the query object q. If there is more than one object in the same borderline distance, the ties are solved arbitrary. Remaining in the road-map example, user can pose a query: Give me three closest gas stations. This survey of similarity queries is not comprehensive and we refer the reader to specialized publications [48] for further details and for other types of queries. 2.1.3 Metric Distance Measures Most of the natural dissimilarity measures fulfil the metric function conditions and the metric space approach can be applied to datasets that use these dis tances. We bring a few important examples of such functions in this section. Minkowski Distances A very important family of distance functions for vector spaces are the Minkowski distances. For two n-dimensional real vectors and for a numeric parameter p, the Minkowski Lp metric is defined as: Lp[(xi, Mž/i ,ž/n)] = Ei •ž/i I (1) \ The L\ distance is also referred to as Manhattan distance, the L2 metric is the well-known Euclidian distance, and the L^ = max™=1 |XJ — y%\ is known as the chessboard distance. Figure 2 shows examples of several important members of this family in the two-dimensional vector space.