M-Chord: A Scalable Distributed Similarity Search Structure

David Novak and Pavel Zezula

Masaryk University, Brno, Czech Republic xnovak8,zezula @fi.muni.cz

Abstract ject and a constraint on the required proximity to the query object. Many principles and index structures have been pro- Theneed for a retrieval based not on the attribute val- posed in this field, summarized in several comprehensive ues but on the very data content has recently led to rise of surveys [14, 21]. the metric-based similarity search. The computational com- In real-life applications, the distance function is typically plexity of such a retrieval and large volumes of processed expensive to compute. Unfortunately, even with sophisti- data call for distributed processing which allows to achieve cated index structures [9, 10] the similarity search is expen- scalability. In this paper, we propose M-Chord, a dis- sive and the increase of the costs is linear with respect to the tributed data structure for metric-based similarity search. size of the dataset indexed. Since the volumes of managed The structure takes advantage of the idea of a vector index data become still larger, there is an evident need for dis- method iDistance in order to transform the issue of simi- tributed processing that brings two benefits distribution larity searching into the problem of interval search in one of the storage and parallelization of the time-consuming dimension. The proposed peer-to-peer organization, based query execution. Most of the recent effort in the field of on theChord protocol, distributes the storage space and distributed data structures has focused on the vector-based parallelizes the execution of similarity queries. Promising approach [20, 8, 12, 2, 6, 1]. As far as we know, the only features of the structure are validated by experiments on the metric-based distributed structure published are the GHT prototype implementation and two real-life datasets. index [4, 3] and, very recently, MCAN [11].

In this paper, we introduce a new distributed data struc- ture called M-Chord. It maps the metric space into one- dimensional domain using a generalized and adapted vari- 1. Introduction ant of iDistance [15] a vector index method for nearest- neighbors search. In this way, M-Chord reduces the issue Thefield of similarity data retrieval has recently made a of similarity search to the interval search problem. The data rapid progress. Generally, the objective of this search mech- is divided among the nodes of the structure and the Peer-to- anism is to retrieve all indexed data that are similar with Peer protocol Chord [17] is utilized for intra-system naviga- a given query object a digital image, a text document, etc. tion. Finally, the similarity search algorithms are designed One way towards the similarity search is adapting the for the proposed architecture. traditional attribute-based retrieval which usually leads to The paper is organized as follows. Section 2 reminds the complex queries in multi-dimensional vector spaces. The theoretical background of the metric-based searching and standard index structures that are able to process such describes the iDistance and Chord techniques. In Section 3, queries, e.g. kd- or , seem to become ineffi- we provide both the ideas behind and the architecture of the cient for high number of dimensions that are common for proposed structure. Section 4 presents results of the perfor- current data. Furthermore, data types that do not use the mance trials and the paper concludes with related work and Euclidian distance to measure similarity (e.g., digital im- a future work outline in Sections 5 and 6. ages compared by the Earth Moover’s Distance) and some specialdata types (e.g., texts or DNA sequences) cannot be indexed efficiently by vector-based data structures at all. 2. Preliminaries These research challenges have led to the development of the area of metric-based similarity search. This approach In this section, we shortly remind basic concepts of the considers the data space as a metric space dataset together similarity searching and indexing in metric spaces and then with a distance function applicable to every pair of objects. mention two techniques important for this work the iDis- The queries in this model are defined by a sample query ob- tance and the Chord.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

INFOSCALE '06. Proceedings of the First International Conference on Scalable Information Systems, May 29-June 1 2006, Hong Kong © 2006 ACM 1-59593-428-6/06/05...$5.00 N4

K4 Mathematically, metric space is a pair , 0 where is the domain of objects and is the total distance N3 K N 3 K1 1 function satisfying the following condi- tions for all objects :

K2 (non-negativity) N iff (identity) 2 (symmetry) Figure 2. The Chord structure (triangle inequality) that possibly contain objects from the query range these Let us define two types of similarity queries [21] we that are all clusters that satisfy focus on. Let be a finite of indexed objects. Definition 1: Given an object and a maximal search max-dist radius , range query selects a set of in- dexed objects: . where max-dist is the maximum distance between and Definition 2: Given an object and an integer , objects in cluster . Figure 1(b) shows the clusters influ- -nearest neighbors query retrieves a set enced by the query ( , ) and specifies more precisely . the space areas to be searched. Such an area within cluster corresponds to the iDistance interval

(2)

The iDistance [15] is an index method for similarity So, several iDistance intervals are determined, distance search in vector spaces. It partitions the data space into is evaluated for all objects from these intervals clusters and selects a reference point for each cluster , and the query answer set is created as . Every data object is assigned a one-dimensional . iDistance key according to the distance to its cluster’s ref- The iDistance algorithm is based on repeti- erence object. Having a constant to separate individual tive range queries with growing radius. Such a policy is not clusters, the iDistance key for an object is very convenient for distributed environment and we propose a different one (see below). (1)

Expecting that is large enough, all objects in cluster are mapped to the interval see Figure 1(a) The Chord [17] is a P2P protocol providing the function- for the mapping visualization. The data is then stored in a ality of a Distributed Hash Table an efficient localization B -tree according to the iDistance keys. of the node that stores the data item corresponding to a given search key. It is a message driven dynamic structure that is

C2 C2 able to adapt as nodes (cooperating computers) join or leave C C the system. 0 p2 0 p2 Using consistent hashing, the protocol uniformly maps p p 0 0 the domain of search keys into the Chord domain of keys q C1 C1 .EveryChord node is assigned a key from r p p1 1 thesamedomain . The identifiers are ordered (iDistance) in an identifier circle modulo , . Node 0 c 2*c 3*c 0 c 2*c3 *c is responsible for all keys from interval (a) (b) (mod ) see Figure 2 for visualization. Figure 1. The principles of iDistance Notation: For every key , let us denote the node responsible for key . Although the iDistance is primarily proposed as a Every node stores addresses of its predecessor and suc- search method, the algorithm for queries is cessor on the identifier circle and, furthermore, it maintains quite straightforward. It searches separately those clusters a routing table called the finger table with addresses of up to other nodes [17]. Due to the uniformity of the Chord inequality property of function implies the standard fil- domain distribution, the protocol preserves, with high prob- tering criterion for query [21]: Every object ability, that: may be excluded without evaluating if

in an -node system, the node responsible for a given (3) key is located via number of messages to other nodes (number of hops); This criterion is applied to reduce number of distance com- putations at query-time. the storage load of the nodes is balanced. Pivot selection Let us discuss the way in which the piv- 3. M-Chord ots are selected. The proposed selection algorithm is gen- eral and applicable to any metric dataset. Exploiting some In this section, we describe basic ideas and architecture additional knowledge about the particular dataset, various of M-Chord the proposed distributed data structure for data-tailored methods can be developed. similarity searching in general metric spaces. The M-Chord Because the pivots are used for filtering, their selection approach can be summarized as follows: influences the performance of the search algorithm. So, the main objective of the pivoting technique is to increase the generalize the idea of iDistance to metric spaces and efficiency of the filtering criterion (3). We follow the ap- map the dataset to a domain of keys; proach described by Cháavez et al. [7] which corresponds divide the domain into intervals and let every with our requirements. node store data with keys from one interval; First, let us deduce a formula for comparing the quality of two sets of pivots. For query and a set of use the Chord routing mechanism for navigation; pivots , let us denote design the and search algorithms; provide an additional filtering mechanism to reduce the computational cost of the query processing. The filtering condition (3) may be reformulated as The rest of this section analyzes this approach in detail. (4)

and the efficiency of the filtering grows with the growing probability of (4). One way to increase this probability is to find a set of pivots that maximizes the mean of distribution The iDistance algorithm partitions the data space and se- of , . Let us denote the mean value . lects a reference point in every partition. This approach is So, we say that is a better set of pivots than applicable in vector spaces where the coordinate system can when: be used for partitioning and the reference points may be se- lected from the whole domain. In a general metric space (5) (without any knowledge about the specific dataset), the only way to specify reference objects is by choosing them from The value of can be estimated on a sample set a given set of objects and only then the space can be par- . Having the comparison criterion (5), several actual tech- titioned with respect to these objects. niques for pivot selection can be defined. The proposed sys- In order to generalize iDistance to metric spaces, we first tem adopts the incremental selection technique [7]. choose a set of pivots from an a priori If the dataset to be indexed is known a priori, the given sample dataset . Then, the Voronoi-like par- pivot selection can be tailored directly for it. Otherwise, we titioning [21] is used to divide the set of indexed objects try to maximize the filtering efficiency on the whole domain into clusters : . This is possible when designing a specific real-life ap- plication with a specific domain . If the distribution of the sample strongly differed from the distribution of ,the This modification has no impact on iDistance functionality. pivot filtering efficiency would be damaged. While partitioning the data space, the distances are computed for every object M-Chord domain Having the pivots selected and the data . These values can be stored together with the object and space partitioned, we can use iDistance to map the dataset can be used at query time for further filtering. The triangle into a one-dimensional domain and join this domain with the Chord protocol. Since the Chord presumes a key space in all nodes of the system. The initialization algorithm has of size , the range of the iDistance domain is normal- the following parameters: ized by an order-preserving function to a interval. The iDistance formula (1) for an object , sample set ; becomes an M-Chord key-assignment formula: number of pivots .

(6) First, the pivots are selected from us- ing the algorithm described in Section 3.1. Then, the iDis- The transformation has another important purpose. In our tance formula (1) is applied to determine the distribution approach, every node takes over responsibility for a sub- of function on and this information is used to find interval of the M-Chord domain. Such a partitioning should the transformation . Both, the pivots and the function , follow the domain distribution to preserve balanced load of are necessary for evaluation of the key-assignment the nodes. Further, the Chord protocol guarantees its effi- function (6). ciency while nodes are distributed uniformly on the domain When other than first node joins the system, it receives circle. Therefore, the ideal transformation would map the this configuration from an already running node we pre- original domain on the interval uniformly. sume that the joining node learns about an existing node Generally, the task is to find an order-preserving uniform through some external mechanism. transformation (knowing the distribu- tion of on a set ). This is a very well studied topic and can be solved, for instance, by a piecewise-linear trans- Nodes activation The first node of the system is automat- formation [13]. This method fixes the -values in several ically assigned key and covers the whole M-Chord selected points and the overall transformation is the linear domain . Other nodes are not assigned M-Chord interpolation of these values. keys on the startup they become non-active nodes at first. The described construction works flawlessly only when There are several ways to maintain the pool of non-active distribution of the set copies distribution of the indexed nodes, e.g., separated distributed peer-to-peer layer, central data . It is impossible to reach this criterion fully in a (replicated) register, broadcast messages in LAN, etc. The choice of a suitable solution is left to a specific implemen- real-life application. But note that the imperfection of the M-Chord domain uniformity can only slightly degrade the tation environment. distribution of the keys assigned to nodes and, thus, the rout- If there is an available non-active node then any ac- ing efficiency (hop count). The storage load balance of the tive node can invoke a split request. Generally, every node nodes is addressed in other way (see Section 3.2) and is not may define its own criteria forcing split these could be influenced by the domain distribution. reaching a storage capacity limit, heavy computational load (either because of frequent M-Chord claims, demands of other processes, or weak technical parameters), etc. The splitting of node , responsible for interval , proceeds as follows: The M-Chord system constitutes a logical overlay over any network of directly addressable nodes. The structure determine a key : and as- consists of autonomous nodes that store data, can insert ob- sign it to a non-active node ; jects into the system, and retrieve them by executing simi- larity queries. The nodes communicate via messages. move data from the interval to and The logical topology of the network corresponds to the follow the standard Chord join mechanism for . Chord structure of . As explained above, the data objects are Note that, due to the separation constant in the assigned keys from the M-Chord domain according to (6). formula (6), values , form the bound- The nodes are assigned keys from the same domain aries of particular clusters within the domain. If the in order to divide the data among the nodes. So, every node interval covers more than one cluster ( clus- contains: ters, )then is selected as a cluster boundary so Chord routing information key , links to predeces- that interval covers clusters. If interval sor and successor nodes and the finger table; covers only one cluster (or its part) then is set in order to split the storage of equally. B -tree storage for the (mod )interval. Insert Any node can initiate insert operation for an Initialization phase The initialization proceeds on the object . First, node applies Formula 6 to cal- first node started and the determined settings are then used culate key. Values are obtained as a by-product and are carried along with from Figure 3(b) shows an example of a query flow now on. through the system. Both the algorithm description and the If is non-active then the request is forwarded to picture simplify the query forwarding to nodes the the node known from the startup. If is active then Chord protocol must be employed to reach these nodes. So, the Chord protocol is followed to forward the request to queries are spread simultaneously from node follow- (node responsible for ). This ing the Chord protocol. According to this protocol, a lot of node stores into the B -tree storage according to the these messages travel partially along the same path. In key. See an example in Figure 3(a). order to decrease flooding of the network, the requests are sent as one message along the common path parts. inex request As described in Section 2.2, the iDistance search algo- kmcox Nins response N foak,x I3 I3 rithm may exclude cluster from the search if 0 eceivek,x max-dist , where max-dist is the radius of .This oek,x is not possible in the distributed algorithm because values I1 N Nq N max-dist are not known to all nodes. Thus, some (most I1 of them) interval search requests return immediately after mchord(x) reaching node because there is nothing to be searched N I I2 2 within the interval. This issue and its possible solutions (a) (b) are candidates for future studies.

Figure 3. The insert (a) and range search (b)

The iDistance approach to query processing a sequence of queries with growing radius does not seem to be suitable for distributed environment The query algorithm is usually the base because multiple iterations would result in an un- algorithm for more sophisticated similarity queries. The pleasant number of successive message transmissions in- M-Chord structure has been designed in such a way that creasing the overall response time. Our approach, similar the algorithm may follow the iDistance pruning to the approach adopted, e.g., in GHT [4], has two phases: idea. The node that initiates the query, ,executesthe RANGESEARCH( ) algorithm with the following schema: 1. Employ a low-cost heuristic to find objects that are near . Measure the distance to the nearest for each cluster , determine interval of object found. Value is an upper bound of the dis- keys to be scanned: tance to the actual nearest neighbor of . 2. Run the query and return the nearest objects from the query result (skip the space searched send an INTERVALSEARCH( ) during the first phase). request to node responsible for the midpoint of interval : ; In the first phase, node searches the cluster into which belongs: wait for all responses and create the final answer set. localize the B -tree leaf covering key ; The INTERVALSEARCH( ) algorithm, executed on the nodes, , has the following schema: walk through the B -tree leaves alternately left and right, add the first objects to the answer set ,and if is not responsible for the whole interval , initialize the value; forward the INTERVALSEARCH( ) request to the predecessor and/or successor; continue walking left and right examining objects while keys examine the locally stored objects and create the local answer set ; use the precomputed values and the filtering formula (3) while evaluating ; if then remove the object from , add into , and update interval shrinks; return to node and eventually notify to wait for responses from the predecessor and/or successor finish searching when either the whole interval or nodes. the cluster in has been searched. Note that the described algorithm presumes that at least and its farthest pivot . Because the M-Chord domain is objects are stored in cluster on node .If normalized and redistributed by the function, the value of so, the second phase is applied otherwise we adopt the op- the constant has no serious impact. Size of the M-Chord timistic strategy by iterating over the following steps: domain is . For simplicity and transparency, we define only one criterion forcing nodes’ splits reaching storage run a range query with radius distance to the far- capacity limit of 5,000 objects. thest object found so far; if objects are found, increment by and search again skipping the space already searched. Since the technical resources used for testing were not Finish searching when objects are found. dedicated, our experimental environment was very fluctu- ating the computational load of the workstations caused 4. Performance evaluation by other running processes and load of the network was un- predictable and difficult to measure. Therefore, we do not In this section, we present and analyze results of exper- present the exact response times values or duration of com- iments conducted on the prototype implementation of M- putations. Generally, the query response times on our im- Chord. The experiments focus mainly on the system per- plementation were below one second for smaller radii and formance for and queries processing and on about two seconds for bigger radii regardless of the dataset various aspects of scalability of the system. size. The CPU costs are measured as the number of evalu- ations of the distance function . This is a standard way for evaluation of metric-based structures other operatio ns The experiments have been performed on up to 300 (and usually I/O cost as well) are practically negligible com- workstation nodes connected by a high-speed local-area pared to the distance evaluation demands. We measure the network. The nodes communicate via TCP protocol. The total cost on all participating nodes and the parallel cost following real-life datasets have been selected to conduct maximal number of distance computations performed in a the experiments on: sequential manner. The total costs correspond to costs on a centralized version of the structure. VEC 45-dimensional vectors of color image features com- The communication costs are measured by the total num- pared by a quadratic-form distance function [21]. Dis- ber of messages (both requests and responses) and by the tribution of the distances between pairs of objects is maximal number of messages sent in a serial way the max- practically normal and such a high-dimensional space imal hop count. We also present the number of nodes influ- is extremely sparse. enced by the query processing. All presented measurement results are taken as an average over queries on fifty query TTL titles and subtitles of books and periodicals from sev- objects randomly selected from the dataset. eral academic libraries. The edit distance function [16] on the level of individual characters is used to compare these strings of lengths from 3 to 200 characters. The distance distribution of this dataset is skewed. In the first set of experiments, we measured the influence Observe that neither of these datasets can be efficiently in- of the number of pivots (clusters) on the performance of dexed and searched by a standard vector data structures the search algorithms. For all tested numbers of pivots (1 VEC uses a specific distance function and TTL does not 70) and for 200,000 stored objects, there were about sixty form a vector space at all. Both distance functions are active nodes in the system so the average storage load ratio highly computationally intensive. was approx. 66%. When testing the system scalability with respect to Figure 4 shows the total number of distance computa- dataset size we use the whole sets (1,000,000 objects for tions for queries with respect to the number VEC and 800,000 objects for TTL) otherwise the size of of pivots . Generally, the filtering formula (3) reduces the the stored data is fixed to 200,000. The sample set used number of distance computations less effectively for TTL in the initialization phase (Section 3.2) consists of 5,000 ob- because two strings of similar length often have similar edit jects randomly chosen from the dataset. The number of piv- distance from any pivot. ots is discussed in Section 4.3. The separation constant , Figure 5 shows the parallel number of distance computa- used in Equation 6, is determined by a simple heuristic tions for the same experiment. We can observe that the par- the double of the maximal distance between an objects in allel costs do not decrease so significantly for larger radii VEC TTL 120000 120000 itor the algorithm while the query radius grows, then we radius = 1500 radius = 15 100000 radius = 1000 100000 radius = 10 concern with the interquery parallelism of the query execu- radius = 500 radius = 5 80000 80000 tion, and, finally, we study the scalability with respect to 60000 60000 growing dataset. 40000 40000 20000 20000 distance computations distance computations 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 Size of the query The dotted line in Figure 7 shows the n − number of pivots (clusters) n − number of pivots (clusters) number of retrieved objects while increasing the query ra- dius . Observe that, e.g., radius 2,000 for VEC retrieves Figure 4. Number of pivots total costs about 8,000 objects on average and radius 20 for TTL

VEC TTL 12,000 objects (6% of the whole database). Such queries 7000 7000 radius = 1500 radius = 15 are usually not reasonable for application (strings with edit 6000 radius = 1000 6000 radius = 10 5000 radius = 500 5000 radius = 5 distance 20 differ significantly) but we study the system per- 4000 4000 formance even in these cases. 3000 3000 The solid line in Figure 7 depicts the parallel number of 2000 2000 1000 1000 distance computations. The algorithm is designed parallel distance comp. parallel distance comp. 0 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 so that the query replicas are always forwarded before the n − number of pivots (clusters) n − number of pivots (clusters) local storage is searched and any metric distance evaluated. Therefore, the parallel costs are upper bounded by the num- Figure 5. Number of pivots parallel costs ber of objects stored in a single peer (5,000 in our setting).

VEC s) TTL because close objects are usually stored on a single node 12000 12000 parallel dist. comp. parallel dist. comp. which is searched sequentially. 10000 retrieved objects 10000 retrieved objects Figure 6 presents the maximal hop count (a) for VEC 8000 8000 and the total number of messages (b) for TTL. The hop 6000 6000 count is bigger for small number of clusters because the in- 4000 4000 2000 2000 tervals of M-Chord domain to be visited by the query are 0 0 0 500 1000 1500 2000 0 5 10 15 2 distance computations (objects) longer. The hop count remains stable for higher values of range query radius distance computations (object range query radius . Though the INTERVALSEARCH() request is sent for ev- ery cluster (see Section 3.3), the mentioned message pass- Figure 7. Size of the query parallel costs ing optimization reduces the total number of messages and it does not grow as fast as the number of clusters. The parallel distance computations together with the maximal hop count (presented in Figure 8) can be consid- VEC TTL 30 ered as the characterization of the actual response time of radius = 1500 200 radius = 15 25 radius = 1000 radius = 10 radius = 500 radius = 5 the query. For example, running a query for 20 150 TTL, we retrieve approx. 4,000 objects on average and the 15 100 longest branch of the query execution maximally consists 10 50 of nine forwardings of the query replicas and 4,000 distance

maximal hop count 5

0 total number of messages 0 computations. 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 n − number of pivots (clusters) n − number of pivots (clusters) (a) (b) VEC TTL 160 total messages 160 total messages 140 max hop count 140 max hop count Figure 6. Number of pivots maximal hop 120 120 100 100 count (a) and total (b) number of messages 80 80 60 60 40 40 number of messages In summary, there is an obvious trade-off between the number of messages 20 20 0 0 0 500 1000 1500 2000 0 5 10 15 2 CPU costs and the messaging while increasing the number range query radius range query radius of pivots. In the following experiments, we fix the number of pivots to forty for both datasets. Figure 8. Size of the query messages sent

Interquery parallelism In the previous experiment, we In this section, we analyze the performance of the focused on the intraquery parallelism, i.e. parallel process- search algorithm more deeply. First, we mon- ing of a single query. The interquery parallelism refers to VEC ) TTL the ability of the system to accept multiple queries at the 60 70 radius = 1500 radius = 15 same time. In the following experiment, we executed 50 radius = 1000 60 radius = 10 radius = 500 radius = 5 40 50 queries simultaneously each from a different node. We 40 30 measured the overall parallel costs of the set of the queries 30 20 as the maximal number of distance computations performed 20 10 10 actively visited nodes (% on a single node of the system (inter-cost ). actively visited nodes (%) 0 0 0 200000 600000 1e+06 0 200000 400000 600000 800000 data set size data set size VEC TTL 70000 35000 radius = 1500 radius = 15 30000 radius = 1000 60000 radius = 10 Figure 10. Size of the dataset visited nodes radius = 500 radius = 5 25000 50000 20000 40000 VEC TTL 15000 30000 radius = 1500 radius = 15 5000 radius = 1000 6000 radius = 10 10000 20000 4000 radius = 500 5000 radius = 5 10000 5000 4000

parallel distance comp. 3000 parallel distance comp. 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 3000 2000 l − number of simultaneous queries l − number of simultaneous queries 2000 1000 1000 parallel distance comp. 0 parallel distance comp. 0 Figure 9. Overall parallel costs 0 200000 600000 1e+06 0 200000 400000 600000 800000 data set size data set size Figure 9 presents the inter-cost with respect to num- Figure 11. Size of the dataset parallel costs ber of simultaneous queries . The baseline of this experi- ment (a system with zero level of interquery parallelism) is inter-cost parallel-cost of one query (the parallel bound of the size of the data stored by individual nodes cost of queries executed one by one). (5,000 in our setting). The visible fluctuations are caused For example, considering the VEC results for radius by the nodes’ splitting. 1,000, inter-cost 15,000 while the parallel cost The storage capacity limit for individual nodes is one of of one-by-one execution is approx. 2,200 66,000 the parameters that can be utilized for tuning of the system distance computations (Figure 7). The inter-cost for TTL performance. By decreasing this limit, the parallel number is higher because the TTL total costs are higher (Figure 4). of distance computations can be lowered if there is sufficient The following example brings another point of view on number of available nodes to spread the data over. the way in which the computations are spread over the The price to be paid is the increase of the number of nodes. The sum of the total costs of thirty messages. Figure 12 shows the maximal hop count while queries is approx. ,000 = 1,140,000 (Figure 4 increasing the dataset size. The value slowly grows be- for TTL and forty pivots). The average number of dis- cause, while size of the particular clusters grows, the se- tance computations on the sixty active nodes is, therefore, quential walk within the cluster interval gets longer. This 1,140,000 = 19,000. The maximal number of compu- is a weak point of the M-Chord approach but, at the mo- tations per node is 40,000 (inter-cost , )which ment, we are implementing an improved message passing indicates quite balanced computational load of the nodes. algorithm which should make the hop count stable without increasing the total number of messages sent.

Size of the dataset The following set of experiments con- VEC TTL 25 cerns the search scalability with respect to growing radius = 1500 18 radius = 15 20 radius = 1000 16 radius = 10 dataset size up to 1,000,000 objects for VEC and 800,000 radius = 500 14 radius = 5 15 12 objects for TTL. The number of active nodes grows from 10 10 8 approx. thirty for 100,000 objects to approx. 300 for 6 5 4 maximal hop count 1,000,000 objects so the average storage load of 66% re- maximal hop count 2 0 0 mains stable. Figure 10 shows the percentage number of 0 200000 600000 1e+06 0 200000 400000 600000 800000 nodes actively visited by the search algorithm. This value data set size data set size decreases with growing data size because the indexed space Figure 12. Size of the dataset hop count becomes more dense and, therefore, nodes cover smaller regions of the data space . During the query execution, some nodes of the structure are used only for routing and these are not considered actively visited. Figure 11 shows the parallel number of distance com- The next set of experiments concerns with the perfor- putations the main factor influencing the query response mance of the algorithm. The heart of the time. Observe that it remains quite stable with the upper algorithm is running a query for determined in the initial phase. In the ideal case, the algorithm would VEC TTL 18 kNN query 18 kNN query estimate accurately the distance to the nearest neigh- 16 corresponding range 16 corresponding range 14 14 bor of and run . Let us denote such a query 12 12 10 10 as the query corresponding to the query. 8 8 6 6 Basically, there are three factors increasing the costs of a 4 4 maximal hop count compared to its corresponding query: maximal hop count 2 2 0 0 0 20 40 60 80 100 0 20 40 60 80 1 costs of the initial phase, k k

the determined radius may be larger than , Figure 14. Maximal hop count for kNN multiple iterations could be necessary if less then objects are retrieved in the first phase. approaches are usually applicable and tested on data of low dimensions, they cannot manage inherently-metric data or Figure 13 shows results of the experiment that compares data with non-standard distance measures, and they do not the parallel costs of a query and of its corre- support similarity queries like nearest neighbors query. sponding (for growing ). The costs are ob- The MAAN structure [8] extends the Chord proto- tained as the sum of the parallel number of distance compu- col to support multi-attribute and range queries by means tations of the consecutive phases of the algorithm. of uniform locality-preserving hashing expecting the ex- act knowledge of distributions of the attributes’ domains. VEC TTL 8000 kNN query 10000 kNN query Ganesan et al. [12] use space-filling curves and the 7000 corresponding range corresponding range 8000 Chord protocol to support multi-dimensional range queries 6000 5000 6000 in the SCRAP structure. The authors also propose the 4000 3000 4000 MURK structure that adapts the kd-trees to support multi- 2000 2000 dimensional interval queries in peer-to-peer environment. 1000 parallel distance comp.

parallel distance comp. 0 0 The Mercury [6] provides another protocol for routing 0 20 40 60 80 100 0 20 40 60 80 1 k k multi-attribute range queries. This structure builds a Chord- like index for each attribute. The Skip Graphs [1] provide Figure 13. kNN parallel costs increasing k efficient routing mechanism while preserving the locality of data which enables an efficient support for range queries on In most of the test cases, the initial phase has explored one dimension. a significant part of a single node storage. The determined Tanin et al. [19] introduce a P2P generalization of a radius has usually been very close to and the answer set quadtree index. The authors propose range and nearest retrieved by the first phase was only slightly corrected in neighbor algorithms over this structure [20]. The applica- the second phase (the first phase result can be treated as the tion of this structure is limited to spatial domain. approximation of the precise answer). But the high costs of The pSearch approach [18] uses two information re- the initial phase influence negatively the parallel costs for trieval techniques vector space model and latent semantic which can be observed in Figure 13. indexing to build a P2P information retrieval system for top- There is a natural trade-off between the first phase cost similarity queries. This concept, defining similarity be- and the demands of the second phase. Finding a suitable tween a document and a query by means of their common compromise is a matter of further testing and tuning of the terms, is suitable for a specific application area only. system for a specific application. Banaei-Kashani and Shahabi [2] formalize the problem Figure 14 gives comparison of the maximal hop count of attribute-based similarity search in P2P Data Networks in the same experiment. As discussed above, this value is and propose SWAM a family of small-world based access usually negatively influenced only by the hop-count of one methods. This concept provides a general solution for the Chord lookup operation during the initial phase. range and nearest neighbors search in the vector data. All trends observed in the experiments impact So far, the GHT [4, 3] and the MCAN [11] are the of the number of pivots, interquery parallelism, scalability only metric-based distributed data structures published. The with respect to the dataset size hold true for experi- GHT is a native metric P2P structure utilizing the idea of ments as well and we do not present these results for . Generalized Hyperplane Tree for data partitioning and for navigation. The MCAN, similarly to the M-Chord,first 5. Related work transforms the metric-space searching problem and then takes advantage of an existing solution namely the CAN A number of distributed structures for interval queries protocol. A very detailed comparison of our approach to in attribute-based (vector) data have been proposed. These these structures is provided in a separate publication [5]. 6. Conclusions and future work Library Architectures, volume 3664/2005 of LNCS, pages 25 44. Springer, 2005. [5] M. Batko, D. Novak, F. Falchi, and P. Zezula. On scalability So far, the area of distributed data structures for met- of the similarity search in the world of peers. In Proceedings ric data has not been investigated sufficiently. We consider of INFOSCALE 2006, Hong Kong. IEEE Computer Society, this topic very relevant for many present-day applications 2006. that manage large volumes of complex digital data. We [6] A. R. Bharambe, M. Agrawal, and S. Seshan. Mercury: Sup- propose M-Chord, a peer-to-peer data structure for simi- porting scalable multi-attribute range queries. SIGCOMM larity search in metric spaces. It maps the data space into Comput. Commun. Rev., 34(4):353 366, 2004. a one-dimensional domain by means of a generalized and [7] B. Bustos, G. Navarro, and E. Cháavez. Pivot selection tech- improved variant of a vector index method iDistance.The niques for proximity searching in metric spaces. Pattern Recognition Letters, 24(14):2357 2366, 2003. indexed data is divided among the participating nodes and [8] M. Cai, M. Frank, J. Chen, and P. Szekely. MAAN: A multi- protocol Chord defines the topology of the structure and is attribute addressable network for grid information services. used for navigation. Algorithms for the range and nearest- In Proceedings of GRID 2003, pages 184 191, Washington, neighbors similarity queries are proposed. DC, USA. IEEE Computer Society, 2003. We have conducted experiments on a prototype M-Chord [9] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient implementation and on two real-life datasets. The presented access method for similarity search in metric spaces. In Pro- results focus mainly on performance of the query process- ceedings of VLDB 1997, Athens, Greece, pages 426 435. ing and on scalability of the system from various points Morgan Kaufmann, 1997. [10] V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. D-index: of view. The query processing trials have proved a good Distance searching index for metric data sets. Multimedia level of intraquery and interquery parallelism without any Tools Appl., 21(1):9 33, 2003. hot spots. The system scales well with the size of the query [11] F. Falchi, C. Gennaro, and P. Zezula. A content-addressable and with size of the dataset managed, assuming that enough network for similarity search in metric spaces. In Proceed- computational resources are available. The system perfor- ings of DBISP2P 2005, Trondheim, Norway, pages 126 137, mance can be easily tuned through the nodes split policy. 2005. The maximal hop count during the query processing [12] P. Ganesan, B. Yang, and H. Garcia-Molina. One torus to rule them all: Multi-dimensional queries in p2p systems. In grows with the number of nodes in the system. We are im- Proceedings of WebDB 2004, New York, USA, pages 19 24. plementing a new Chord-based messages passing algorithm ACM Press, 2004. that is expected to stabilize the hop count while the system [13] A. K. Garg and C. C. Gotlieb. Order-preserving key transfor- grows. Furthermore, our plans for future work cover imple- mations. ACM Trans. Database Syst., 11(2):213 234, 1986. mentation of the delete and update operations and support [14] G. R. Hjaltason and H. Samet. Index-driven similarity for proper departure of nodes from the system. search in metric spaces. ACM Trans. Database Syst., Although we provide a very detailed comparison of our 28(4):517 580, 2003. [15] H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. approach to the most relevant structures in a separate publi- iDistance: An adaptive B -tree based indexing method for cation [5], we plan to study the system behavior on simple nearest neighbor search. ACM Transactions on Database vector data and compare it with vector-based index struc- Systems (TODS 2005), 30(2):364 397, 2005. tures, e.g., Mercury [6]. Finally, we would like to inves- [16] V. I. Levenshtein. Binary codes capable of correcting spuri- tigate deeply the influence and behavior of various load- ous insertions and deletions of ones. Problems of Informa- balancing and replication strategies. tion Transmission, 1:8 17, 1965. [17] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Bal- akrishnan. Chord: A scalable peer-to-peer lookup service References for internet applications. In Proceedings of SIGCOMM 2001, San Diego, USA, pages 149 160. ACM Press, 2001. [1] J. Aspnes and G. Shah. Skip graphs. In Fourteenth Annual [18] C. Tang, Z. Xu, and S. Dwarkadas. Peer-to-peer information ACM-SIAM Symposium on Discrete Algorithms, pages 384 retrieval using self-organizing semantic overlay networks. 393, 2003. Technical report, HP Labs, 2002. [19] E. Tanin, A. Harwood, and H. Samet. A distributed quadtree [2] F. Banaei-Kashani and C. Shahabi. SWAM: A family of ac- index for peer-to-peer settings. In ICDE, pages 254 255. cess methods for similarity-search in peer-to-peer data net- IEEE Computer Society, 2005. works. In Proceedings of CIKM 2004, pages 304 313. ACM [20] E. Tanin, D. Nayar, and H. Samet. An efficient nearest Press, 2004. neighbor algorithm for p2p settings. In Proceedings of the [3] M. Batko, C. Gennaro, and P. Zezula. A scalable nearest 2005 national conference on Digital government research, neighbor search in P2P systems. In Proceedings of DBISP2P pages 21 28. Digital Government Research Center, 2005. 2004, Toronto, Canada, Revised Selected Papers, volume [21] P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similar- 3367 of LNCS, pages 79 92. Springer, 2004. ity Search: The Metric Space Approach, volume 32 of Ad- [4] M. Batko, C. Gennaro, and P. Zezula. Similarity grid for vances in Database Systems. Springer-Verlag, 2006. searching in metric spaces. In DELOS Workshop: Digital