Masaryk University in Brno Faculty of Informatics

^TIS ftp

Similarity Searching in Peer-to-Peer Environment

Dissertation Proposal

Mgr. David Novák

Supervisor: Prof. Ing. Pavel Zezula, CSc.

In Brno, January 2006 Supervisor Contents 1 Introduction 1

2 Metrie Space Indexing 2 2.1 Metrie Space Model 2 2.1.1 Metrie Space 2 2.1.2 Similarity Queries 2 2.1.3 Metrie Distance Measures 3 2.2 Metrie Partitioning Principles 5 2.2.1 Ball Partitioning 5 2.2.2 Excluded Middle Partitioning 5 2.2.3 Generalized Hyperplane Partitioning 6 2.2.4 Voronoi-like Partitioning 6 2.3 Search Space Pruning Policies 6 2.3.1 Object-Pivot Distance Constraint 6 2.3.2 Range-Pivot Distance Constraint 7 2.3.3 Double-Pivot Distance Constraint 7 2.3.4 Pivot Filtering 7 2.4 Metric Data Structures 7 2.4.1 Vantage Point 8 2.4.2 Generalized Hyperplane Tree 8 2.4.3 M-Tree 8 2.4.4 D-Index 8

3 Peer-to-Peer Structures 9 3.1 Distributed Hash Tables 9 3.1.1 Chord 9 3.1.2 CAN 10 3.2 One-Dimensional Range Queries 10 3.2.1 P-Grid 10 3.2.2 Skip Graphs & SkipNet 11 3.2.3 P-Tree 12 3.3 Range Queries over Multiple Attributes 12 3.3.1 SCRAP & HSFC-based 12 3.3.2 ZNet 12 3.3.3 MAAN 12 3.3.4 Mercury 13 3.3.5 MURK 13 3.4 Nearest Neighbors Queries 13 3.4.1 pSearch 13 3.4.2 Distributed 14 3.4.3 SWAM 14 3.5 Metric Space Structures 14 3.5.1 GHT* 15 3.5.2 MCAN 16 4 Thesis Plan 16 4.1 Overview 16 4.2 Objectives 17 4.3 System Structure Proposal 17 4.4 Future Plans 19

References 20 1 Introduction The diffusion of computer technology into a wide spectrum of human activities leads to formation of a variety of new complex data types. This fact calls for new ways of effective searching of this data which would fit both the nature of the specific data and user requirements. Some of these types, e.g., text documents or multimedia data like digital images or videos, exhibit complex, often user- defined relationships that prevents this data from simple sorting according to a single feature. Furthermore, a simple YES or NO classification is not convenient when answering queries that are meaningful for the mentioned data types. The data processing techniques that address the requirements outlined above are usually referred to as content-based or similarity searching. The development of this research area proceeds on many levels, from techniques tailored to a very specific data and for a specific application to solutions based on general models and solid theoretical grounds that are applicable to a variety of data types. The most general data model, and the only usable one in some cases, is the metric space model. This concept treats the dataset as unstructured objects together with a distance or dissimilarity function measurable for every pair of objects. The universality of this approach inclined us to adopt it. A considerable effort has concentrated on the field of metric based index­ ing [26, 16, 48]. A variety of index structures have been proposed; many of them nicely demonstrate the principles of metric indexing but these are usually static or main memory structures and, therefore, not very suitable for large data volumes. Certain dynamic and disk-oriented structures have been published as well, e.g., the M-Tree [17] and the D-Index [20]. However, the huge amounts of digital data that are produced nowadays make heavy demands on scalability of data-oriented applications. The similarity search is inherently very expensive - the critical aspect is evaluation of the distance function which is typically time consuming. Even though the sophisticated index structures can reduce both computational and I/O costs, the overall searching costs become unacceptable when the data volume grows, because the increase trend is linear with respect to the size of the dataset. A way to achieve swift processing of large datasets is to shift from central­ ized data structures towards distributed environment. This step provides not only easily enlargeable and practically unlimited storage capacity, but especially significant potentiation of the system computational power and the possibility of exploiting parallelism during query processing. Recently, two metric-based distributed structures were published - the GHT* index [9, 8] and MC AN [21]. In the survey part of this proposal, the author follows two separated research lines. First, the principles and basic data structures for the indexing based on the metric space model are outlined in Section 2. Then, Section 3 studies the development trends in the area of self-organized distributed systems based on the peer-to-peer paradigm and focuses on various query paradigms heading towards similarity searching. In Section 4, we recognize the need for a new distributed structure for similarity search in metric spaces and the architecture of such a system is proposed.

1 2 Metrie Space Indexing

In this section, we provide the theoretical background for the similarity search based on the metric space model. We give a brief survey of principles of metric- based indexing techniques and of similarity queries processing. Finally, we de­ scribe several centralized data structures that adopt the metric space as a data model. For further details refer to several recent comprehensive surveys on this area [16, 26, 48].

2.1 Metric Space Model This section provides basic concepts for the metric-based similarity searching and gives several examples of the application domains for this model.

2.1.1 Metric Space

Under certain circumstances, the dataset together with a function measuring proximity of pairs of objects can be treated as metric space.

Definition 1 Metric space A4 is a pair A4 = (T>,d), where T> is the domain of objects and d is the total distance function d : T> x T> —> R satisfying the following conditions for all objects x,y,z G T>:

d{x, y) > 0 (non-negativity)', d{x, y) = 0 iff x = y (identity), d{x, y) = d(y, x) (symmetry), d{x, z) < d{x, y) + d(y, z) (triangle inequality).

The metric space model is considered to be the most abstract data model which can still be used for designing an index structure. Many well-known data con­ cepts fulfil the conditions of this simple but powerful model. Especially, note that any normed vector space (V, || • ||) forms a metric space (V,d) by defining Vx,y G V : d{x,y) = \\y - x\\.

2.1.2 Similarity Queries

There are several types of metric similarity queries defined in the literature. Let us use this notation in the following text.

Notation 1 Let I (I'D be a finite of objects indexed by a data structure.

The range query is the basic type of similarity query and it is often exploit when processing more complex queries.

Definition 2 Given an object q G T> and a maximal search radius r, range query Range(

Figure la shows an example of a Range query in a planar domain. Let us give a real-life example of such a query - user of a road-map application could formulate the following requirement: Give me all gas stations within a distance of ten kilometers from my actual position.

2 o o o/ o o

3 2 o r i O q^

o O

(a) (b)

Figure 1: Examples of a Range(

Querying a system via a range query requires a background knowledge about the data, distance function and a certain level of a "rule of thumb" estimation of the suitable query radius r. This is not convenient for users nor even possible in some applications. The nearest neighbors query eliminates the necessity of specifying a data-dependent query radius.

Definition 3 Given an object q G T> and an integer k > 1, A;-nearest neighbors query kNN(

In other words, the kNN(

2.1.3 Metric Distance Measures Most of the natural dissimilarity measures fulfil the metric function conditions and the metric space approach can be applied to datasets that use these dis­ tances. We bring a few important examples of such functions in this section.

Minkowski Distances A very important family of distance functions for vector spaces are the Minkowski distances. For two n-dimensional real vectors and for a numeric parameter p, the Minkowski Lp metric is defined as:

Lp[(xi, Mž/i ,ž/n)] = Ei •ž/i I (1) \ The L\ distance is also referred to as Manhattan distance, the L2 metric is the well-known Euclidian distance, and the L^ = max™=1 |XJ — y%\ is known as the chessboard distance. Figure 2 shows examples of several important members of this family in the two-dimensional vector space. The shapes depict sets of points that are in a constant distance from the central point for various Lp metrics.

3 Loo

Figure 2: Points at constant distance from a center for various Lp metrics.

Quadratic Form Distance Some vector datasets, e.g., color histograms of digital images, have the individual components correlated rather than indepen­ dent. The Lp metrics do not reflect any inter-coordinate correlations. A concept that takes these into account and that has been successfully applied, e.g., to the image databases is the quadratic form distance [38]. In an n-dimensional vector space, this approach bases the measure on an n x n positive semi-definite ma­ trix M = [rriij], where the weights rriij denote the strength of the connection between components i and j of vectors x and y, respectively. These weights are usually normalized so that 0 < rriij < 1 with the diagonal elements m^j = 1. The quadratic distance du is evaluated by means of the following expression:

Observe that for M equal to the identity matrix this definition becomes the Euclidean distance.

Edit Distance The similarity of two sequences of symbols (strings) can be effectively measured by the edit distance; its most famous variant is also called the Levenshtein distance [30]. The edit distance of two strings x = x\ • • • xn and V = ž/i • • • ž/m is defined as the minimal number of atomic edit operations insert, delete, and replace necessary to transform string x into string y. The atomic operations are defined as follows: insert ins(x, i, c) inserts symbol c into the string x at the position i: delete del(x, i) deletes the symbol at the position i from the string x:

del{x, i) = x\X2 • • • Xj_iXj+i • • • xn; replace replace(x, i, c) replaces the symbol at position i of x with symbol c:

replace{x, i, c) = x\x<2 • • • Xj_icxi+i • • • xn.

The weighted edit distance assigns weights (positive real numbers) to individual atomic operations. Then, the distance between strings is the minimum value of the weighted sum of atomic operations needed to transform one string to the other. In order to preserve the metric distance symmetry property, the weights for insert and delete operations must be equal.

4 Figure 3: Examples of basic metric partitioning techniques: (a) the ball parti­ tioning, (b) the excluded middle partitioning, (c) hyperplane partitioning.

2.2 Metric Partitioning Principles In general, the fundamental principle of any storage structure is partitioning of the data space into clusters, so that not all these clusters have to be searched at query time. In the metric space, the partitioning can only be defined with the assistance of some designated data objects, unlike, e.g., standard vector space that is divisible by means of the coordinate system. Let us introduce the basic metric partitioning Schemas for a subset SCD.

2.2.1 Ball Partitioning

The ball partitioning [44] divides S into two subsets S\ and S2 by means of spherical cut with respect to a designated pivot p G T>. Let dm be the median of {d(o,p)\o G S}. Then S\ and S2 are created by the following rules:

S\ <— {o G S\d(o,p) < dm},

S2 <— {o G S\d{o,p) > dm}.

This notation is not fully accurate - the redundant conditions < and > are to be interpreted as enabling balanced partitioning if the median value is not unique. Then, the division of the median elements is arbitrary, but balanced. See Figure 3a for a ball partitioning example.

2.2.2 Excluded Middle Partitioning The following modification [46] of the ball partitioning schema is motivated by the effectiveness of execution of a Range(

Si <— {o G S\d{o,p) < dm - g},

52 <— {o G S\d{o,p) > dm + g},

53 <— {o G S\dm - g < d(o,p) < dm + g}.

Figure 3b gives an example of this partitioning. Please, note that this schema is not balanced any more.

5 2.2.3 Generalized Hyperplane Partitioning Unlike the previous Schemas, the generalized hyperplane partitioning [44] divides the data space with respect to two pivots pi, p2'-

S\ < {oeS\d{pi,o)

S2 <— {o G S\d(pi,o) > d(p2,o)}.

Objects in the same distance from both pivots can be, again, separated arbitrary. Figure 3c gives an example of this non-balanced partitioning schema.

2.2.4 Voronoi-like Partitioning The Voronoi-like partitioning principle [6] can be seen as an extension of the generalized partitioning for n > 2 pivots. Having a set of pivots pi,... ,pn, set S is divided into n sets S\,... ,Sn in the following way:

Vi(l < i < n) : Si <— {o G S\yj(í < j < n) : d(p,pi) < d(o,pj)}.

The assigning of objects on the borderlines is resolved arbitrary.

2.3 Search Space Pruning Policies The general goal of most of the index structures is to reduce the size of the data that have to be searched using "brute force". Thus, having the data space partitioned, the structures may take advantage of the metric space properties to prune the search space at query time. Let us focus on execution of the Range(

2.3.1 Object-Pivot Distance Constraint Having the distances between objects o G I and a pivot p precomputed, the following constraint sets both the lower and the upper limits on the d(q, o) values. When distance d(q,p) is evaluated, the following lemma can be applied.

Lemma 1 Given a metric space A4 = (X>, d) and three arbitrary objects q,p,o£ T>, it is always guaranteed:

\d(p, q) - d{p, o)\ < d{q, o) < d{p, q) + d{p, o).

When processing a Range(

6 2.3.2 Range-Pivot Distance Constraint

The next constraint can be applied when the exact distances d(p, o) are not known but can be bounded by upper and/or lower limit: r; < d(p, o) < r^.

Lemma 2 Given a metric space A4 = CD, ď) and objects o,p G D such that r i < dCp, o) < rh, and given some q G D and an associated distance d(q,p), the distance d(q, o) can be restricted by the range:

max{d(p, q) - rh, rt - d(p, q), 0} < d(q, o) < d(p, q) + rh.

This constraint is typically applied together with the ball or excluded middle partitioning (Sections 2.2.1 and 2.2.2) and, thus, a part of the data space is skipped without further searching.

2.3.3 Double-Pivot Distance Constraint

Unlike the previous two, this constraint requires existence of two pivots p\, p2 and is related to the generalized hyperplane partitioning (Section 2.2.3). It offers opportunity to skip searching of one of the two separated subsets.

Lemma 3 Assume a metric space A4 = (D,d) and objects o,p\,p2 G D such that dCpi,o) < dCp2,o). Given a query object q G D and the distances dCp\,q), dCp2,q), the distance d(q,o) is lower-bounded as follows:

max{(d(pi, q) - d(p2,q))/2, 0} < d(q, o).

2.3.4 Pivot Filtering

The following constraint is the generalization of Lemma 1 to more than one pivot. Having precomputed distances between database objects o and n pivots, the following lemma enables effective search space pruning.

Lemma 4 Given a metric space A\ = CD, d), a set of pivots {pi, .. . ,pn} C D, objects o, q G D and respective distances Vi(l < i < n) : dCpi, o),dCpi,q) the distance d(q, o) can be bounded from below:

ra&!í\d{pi,q) - d{pi,o)\ < d{q,o). i This filtering schema can be applied, e.g., together with the Voronoi-like par­ titioning (Section 2.2.4) for which the distances to all n pivots are evaluated when assigning objects to corresponding subsets.

2.4 Metric Data Structures Several comprehensive surveys of the area of metric-based index structures have been published [26, 16, 48]. A lot of the introduced structures are static and main memory structures that are not very suitable for huge volumes of data. Their, quite simple, architectures reveal the metric indexing principles, thus, we describe two of them - the Vantage Point Tree [45] and the Generalized Hyperplane Tree [44]. We also briefly introduce principles of two most significant dynamic disk-oriented structures - the M-Tree [17] and the D-Index [20].

7 2.4.1 Vantage Point Tree The idea of the Vantage Point Tree (VPT) approach is based on the ball parti­ tioning principle (see Section 2.2.1) which divides given set S C.T> into two sets Si, S2 according to a pivot p and the median dm of distances between p and objects from S. Initially having the whole set of indexed objects I C T>, this technique recursively selects a pivot, applies the ball partitioning and builds a balanced until the size of the sets gets below a predefined limit. The search algorithm for Range( r or br > r then the left or right subtree is not accessed, respectively. Please, note that it may happen that both subtrees have to be accessed. Two other structures with similar approach can be found in the literature, namely, Multi-Way Vantage Point Tree [14], using more than one radius for partitioning, and Excluded Middle Vantage Point Forest [46] building a separate tree from an exclusion (see Section 2.2.2).

2.4.2 Generalized Hyperplane Tree The basic structure of the Generalized Hyperplane Tree (GHT) is similar to the VPT. The GHT selects two pivots in every step of the tree building and parti­ tions the dataset using the generalized hyperplane schema (see Section 2.2.3). The Range(

2.4.3 M-Tree The above mentioned approaches, and other alike, form static main memory and often imbalanced index structures. Let us briefly introduce the M-Tree [17] - a dynamic, disk-oriented, balanced tree-like structure. Similarly to R-Trees [24] and B-Trees [11], it is built in a bottom-up fashion by splitting of overfilled nodes. Every tree node contains a pivot and a radius that specifies a sphere-like region of the space covered by the node and its sub-tree. The leaf nodes store data objects together with their distances to the pivot in the parent node. The internal nodes keep distances to the parent node pivots as well. All these values are utilized in order to achieve pruning effect for the search algorithms.

2.4.4 D-Index Unlike the mentioned tree-based structures, the D-Index [20] defines a hash function that maps objects to respective storage items. The excluded middle partitioning with exclusion area of width 2g (see Section 2.2.2) is used multi­ ple times in order to define several g-split functions. These functions form a multi-level hierarchy and their consecutive usage constitutes the overall hashing function. On the first level, a g-split function separates objects of the whole

8 dataset. For any other level, objects mapped to the exclusion area of the previ­ ous level are the candidates for storage in buckets on this level.

3 Peer-to-Peer Structures

The fundamental idea of the peer-to-peer (P2P) paradigm is that all nodes in the logical network are equal, all treating as clients and servers at the same time. While this basis remains stable, the scope of application together with the meaning of the term P2P went through a significant development during the past few years. Although the most widely spread P2P applications are still the world-scale file sharing systems, the Distributed Hash Tables (DHTs) have formed another important family of P2P systems. These structures are rather storage providing than data sharing because they redistribute the data between the peers and enable an effective localization of a stored data item having its identifier. Later on, the functionality of DHTs has been improved in more complex structures that support evaluation of queries that have an interval scope in one of the search attributes or even in more attributes at the same time. Some systems have taken a different path and focus on localization of such data items that are near the specified query point. Let us track the path from DHTs towards structures providing a full-featured similarity search.

3.1 Distributed Hash Tables As mentioned above, the DHTs are storage providing structures supporting an effective identifier-item lookup.

3.1.1 Chord The Chord [40] is one of the most famous DHT protocols. It provides support for the following operation: it maps a given search key onto a node responsible for this key. It is a message driven dynamic structure that is able to adapt as nodes join or leave the system. Using consistent hashing [29], the protocol uniformly maps the domain of search keys into the Chord domain of keys [0, 2m). Every Chord node N is assigned a key K i from the same domain K i G [0, 2m). The identifiers are ordered in an identifier circle modulo 2m, KÍ < K j <ŕ=> i < j. Node N is "responsible" for all keys from interval (íQ_i, K i] (mod 2m) - see Figure 4a for visualization. A node stores objects with keys of its responsibility. Every node maintains physical addresses of its successor on the identifier circle and, furthermore, a routing table called the finger table with long distance links to up to m other nodes. Due to the uniformity of the Chord domain distribution, the protocol preserves (with high probability) balanced load of the nodes and a logarithmic hop count for the key searching operation (Figure 4a gives example of message passing to reach node Ni responsible for key k).

9 Ji fay, level 0 A B C D l • A''' D level 1 D B C % oo'y ,^iio>; B E F Zevrf2 B D C A 1 { L | 0.5 ;0010,0001 0101: 1010, ÍOOI 1101 (b) (c)

Figure 4: Structure schema of the Chord (a), the CAN (b), and the P-Grid (c)

3.1.2 CAN The Content Addressable Network (CAN) [36] is a DHT organization that maps the data objects to a virtual d-dimensional space using a uniform hash function. Every node of the system takes over responsibility for a zone of the space and stores data with hash values from this zone (see an example of CAN partitioning of a two-dimensional space in Figure 4b). Every peer maintains a routing table with the network identifiers and the coordinates of its neighbors in the virtual space. In every step, the routing algorithm passes the query to the neighboring peer that is geometrically the closest to the target point in the space. The average number of neighbors per peer is proportional to d while the average number of hops to reach given zone is inversely proportional to this value. See an example of routing towards the node responsible for key (x, y) in Figure 4b.

3.2 One-Dimensional Range Queries Although the DHT routing algorithms could be, in most cases, modified in order to support interval queries in the routing space, DHTs usually utilize hashing functions to reach balanced load of the peers and the hashing brakes connection between proximity in the search space and proximity in the routing space. The structures introduced in this section deal with the issue of one-dimensional range queries while preserving the balanced load. An example of a range query on attribute "age" is: All people born in the 70's of the 20th century.

3.2.1 P-Grid The P-Grid [1, 3] divides the key space between the peers by making each peer responsible for a prefix of the binary notation of keys. Then, the routing mechanism is based on a virtual binary structure (see Figure 4c). For each bit of the binary string prefix, the peer maintains a reference to a peer that is responsible for the other side of the binary tree at that level. For instance, peer A in Figure 4c is assigned key 11 and it has references to peer D (responsible for key 0*) and to peer C (key 10). If A receives a 0001 query, it forwards it to D which forwards it further to B which is responsible for keys 00*. Because the storage load of the leaf node peers is kept balanced, the trie is not necessarily balanced for non-uniform distributions of the search domain.

10 List00: (13)

Listi*:

ListO*: (13) membership z' \ (C^ level O 00 vector \ (13) (21) ^33) (48) ^75) 10 01 00 11 11

(a) (b)

Figure 5: A Skip graph with 3 levels (a); z-orderíng space filling curve (b).

However, the authors show [2] that the average expected cost for a search in terms of number of messages is logarithmic even for an unbalanced logical tree. Since P-Grid employs no randomized hashing, it can resolve range queries which are in detail studied separately [19].

3.2.2 Skip Graphs & SkipNet Both the Skip Graphs [5] and the SkipNet [25] structures were published simul­ taneously and their core is identical. They are based on the idea of Skip lists [35] and have the following multi-level architecture (see an example in Figure 5a). Level 0 contains a list of all nodes ordered by their keys taken from the search domain. Every node is assigned a random membership vector which determines its membership to the higher level lists. On level i G N, all nodes sharing the same membership vector prefix of length i are in the same ordered list. There is O(logn) levels in an n-node system - until every node remains alone in some list. In the SkipNet, these lists are linked to form circles. If peer p searches for key k, the searching starts in the highest level list by searching for a neighboring peer with key closer to A;. If found, the request is forwarded to this peer otherwise continuing to the lower level list. For instance, node 48 searching for key 18 first forwards request to node 13 which then finds, on level 0, its neighbor 21 which is the closest to key 18. The range queries in Skip Graphs can be resolved by finding a single element of the interval and then broadcasting the query to all nodes in the interval by flooding. This operation is proved to take O(logn) time [5]. However, the Skip Graphs either support load balancing of the nodes by hashing the key space or preserve the content locality property (and support range queries) but without the guarantee of balanced load. The SkipNet enable usage of a hybrid storage with a constrained load balancing within a given do­ main subinterval. This interval is then redistributed by uniform hashing which damages the full-fledged spatial locality. But the technique described in [4] can be used for load balancing of these structures while keeping them range- queriable.

11 3.2.3 P-Tree The P-Tree index [18] is based on the idea of B+-tree and inherits the range lookup algorithm from it. Unfortunately it is primarily designed for one resource stored in each peer and, therefore, has no load-balancing mechanism itself. As well as for Skip Graphs, the approach from [4] can be used to balance the load. More precisely, there is a semi-independent B+-tree maintained by every peer. Let us first describe a fully independent trees for a set of peers pi,... ,pn with keys ki,..., kn (in an increasing order). Each peer pn views the set of keys organized in a ring with ki as the smallest value and builds a full B+-tree over these keys. The B+-tree leaf nodes have pointers to peers with respective keys. Such an index is space consuming and has high management requirements. The semi-independent tree, i.e. P-Tree lets every peer maintain only the left-most root-to-leaf path of the corresponding fully independent B+-tree. The pointers to the tree nodes that are not stored locally lead to corresponding peers.

3.3 Range Queries over Multiple Attributes The next step towards a full similarity search is supporting range queries over multiple attributes. Let us introduce several systems that walk in this direction.

3.3.1 SCRAP & HSFC-based The SCRAP [23] and the HSFC-based [37] structures adopt a very similar ap­ proach. They both use filling curves (z-ordering [34] or Hilbert curve [27]) in order to map the n-dimensional space into one-dimensional domain (see Fig­ ure 5b for an example of z-ordering). The data is mapped onto the peers by building the Chord (Section 3.1.1) or Skip graphs (Section 3.2.2) structures on the one-dimensional domain. To resolve a multi-dimensional range query, first the well-known algorithms (e.g., [34, 13]) are used to map these queries on several intervals of the space-filling domain. Then, the DHT protocols are employed to route the query to all peers overlapping with these intervals. The load balancing for both structures is achieved by performing the bound­ ary adjustment for the overloaded nodes [22] or by inserting a new node in order to split the most loaded node (HSFC-based).

3.3.2 ZNet The ZNet structure [39] takes advantage of the z-ordering as well but in a slightly different manner. It first partitions the original multi-dimensional space to zones that are assigned to the peers and only then employs the z-curve to order the zones (and peers) into a sequence. Again, the skip graphs (Section 3.2.2) are used for intra-system navigation. The z-curves properties are utilized to resolve all nodes covering the query region. The system employs the same load- balancing techniques mentioned in the previous section.

3.3.3 MAAN The Multi-Attribute Addressable Network (MAAN) [15] first maps each at­ tribute domain into the Chord routing space. It expects knowledge of the at­ tributes domains distributions and, thus, the mapping can be done using a

12 locality preserving hash function with uniform distribution. So, for each of n attributes, the peer is responsible for a portion of the attribute domain. When inserting a new resource x with identifier rx and attributes values v\,... ,vn, the vector (v\,..., vn;rx) is forwarded to all n peers responsible for vi values. The authors do not discuss the space requirements of the index. When resolv­ ing a multi-attribute range query, the query is routed to all peers responsible for the query interval in one selected attribute aj. These peers locally search for resources that fulfil the whole query. The authors present experiments on a static dataset knowing the data distribution in advance.

3.3.4 Mercury The approach adopted in Mercury [12] can be compared to the MAAN approach. It does not hash the attributes domains but uses them directly to build Chord- style routing circles (using local and long-distance links). Such a circle called hub, is created for every attribute but every hub is built only on subsets of the nodes (potentially disjoint). The multi-attribute range query algorithm is similar to the MAAN algorithm. The routing and load balancing is achieved by a technique based on random sampling which allows every peer to estimate metrics about the whole system. The efficiency is measured in the terms of number of hops and total number of messages for routing of three-dimensional range queries.

3.3.5 MURK Unlike all the previous structures in this section, the MURK system [23] does not map the original space into one dimension but routes directly in the multi­ dimensional space. This system partitions the data space using the k-d trees approach: Every node is responsible for "rectangles" of the space (hypercuboids in higher dimensions) and every new peer splits region of an existing peer in order to divide its load equally. The routing mechanism is based on the idea of greedy CAN navigation (Section 3.1.2). This simple routing is extended by skip pointers in order to speed up the navigation. The skip pointers are either random or emulating the exponential distribution of pointers known from one-dimensional routing. This approach gives the desired logarithmic hop count cost.

3.4 Nearest Neighbors Queries The systems introduced in the previous section generalize the search query to retrieving data objects that are mutually similar by means of belonging to some range specified by the query. The last step towards the similarity search as we perceive it is searching for data objects that are "the most similar" to a given query point. The systems described in this section allow such a querying for specific data domains.

3.4.1 pSearch The pSearch [41] is an information retrieval system looking up for documents whose content best fits the query terms (words). The vector space model rep­ resents the documents and queries as term vectors where each element of the

13 vector corresponds to the importance of a term in the document or query. The similarity between a query and an document is measured as the cosine of the angle between their term vectors. In order to decrease the dimensionality of the term space and to filter out the noise, pSearch does not utilize the vector space model itself but the latent semantic indexing which uses statistically derived conceptual indices instead of terms. The CAN structure is employed to partition the semantic space and for routing of the insert and query operations. To resolve a query first its position in the space is computed, it is forwarded to the peer responsible for the position and then it is flooded to peers within a determined similarity radius. The receiving peers perform a local search and return the best documents matching.

3.4.2 Distributed Quadtree The distributed quadtree [42, 43] is an P2P index for the spatial domain. The index builds a virtual quadtree over the set of managed spatial objects and the quadtree blocks are hashed using the Chord hashing in order to make peers responsible for particular blocks. The Chord is utilized for routing as well. The system supports searching for k objects that are the nearest to a given query point (in the sense of Definition 3 on Page 3). The peer posing the query maintains a priority queue of blocks that possibly contain answer objects. The queue is ordered by the lower-bound on the distance from the query point to objects in the blocks. While processing the query, the blocks that could not contain any objects nearer than the current answer set are removed from the queue.

3.4.3 SWAM The authors of the SWAM approach [7] generally formalize the issue of similarity search in vector domains in P2P Data Networks (PDNs). They define the similarity in the terms of Range and kNN queries on vector spaces with Lp metrics (Eq. 1). They propose SWAM - a family of small-world based access methods. This can be considered to be a general model for all structures that constitute a graph of peers (nodes) and links (edges) where nodes have edges to spatially neigboring nodes and then random edges to distant nodes. Further, they describe the SWAM-V, member of this family that partitions the data space in a Voronoi-like manner to neighboring cells. For instance, the MURK system (Section 3.3.5) can be considered to be an example of a SWAM-V structure. The authors also define three general metrics to measure performance of the similarity searching on PDNs. The experimental part com­ pares the SWAM-V with the CAN (Section 3.1.2) and with the BASIC access method - random graph that simply floods queries to all nodes.

3.5 Metric Space Structures In this section, we introduce two distributed structures that treat the dataset T> together with the distance function d as the metric space A4 = (T>, d) and define the similarity in the terms of Section 2. Both, the GHT* [9, 8] and the MC AN [21], provide support for the Range and kNN similarity queries.

14 Legend: Q Bucket l NNIDorBID Innernode C A D

B E F

(a) (b)

Figure 6: Address in GHT* (a); MC AN query region for Range(

These systems adopt the requirements of designing Scalable and Distributed Data Structures (SDDS) [31]: • Data expands to new nodes gracefully and only when the nodes already used are efficiently loaded. • There is no master site to be accessed when searching for objects, e.g., there is no centralized directory. • The data access and maintenance primitives, e.g., search, insertion, split, etc., never require atomic updates to multiple nodes. Therefore, unlike the previous structures, these systems do not automatically employ all available peers, but expand according to the size of stored data. The load balance is achieved in this way.

3.5.1 GHT* In Section 2.4.2, we have introduced Generalized Hyperplane Tree (GHT) as one of the fundamental structures for metric indexing. The core idea of the GHT* is to propose a dynamic and distributed extension of this structure. In more detail, the architecture of the system and navigation is determined by an Address Search Tree (AST) based on the GHT principles. Every peer has a unique identified (NNID) and provides a set of storage buckets holding a BID identifier unique within the peer. Further, the peers maintain the AST (or its part) whose inner nodes consist of two pivots (in GHT manner) and leaf nodes point either to local buckets (BID) or to another peer (NNID) (see Figure 6a). The insert operation traverses the AST using the generalized hyperplane splitting (Section 2.2.3) down to the bucket into which the inserted object be­ longs. When the bucket is overfilled, it is splitted by selecting two pivots, creating an inner node and allocating a new bucket either on the local peer or on another one. The AST tree can be unbalanced. The Range(

15 1. Search the bucket that would store q for the k nearest objects within this bucket. Measure the distance r from q to the kth nearest object found. 2. Execute the Range(

3.5.2 MCAN Unlike the GHT*, the MCAN structure [21] is not purely based on native metric indexing principles. It transforms the similarity search problem into the issue of resolving multi-dimensional range queries in vector spaces and employs the CAN structure then. Thus, the data space T> is mapped into n-dimensional n vector space R using relative distances to n selected pivots pi,... ,pn:

n F:V^R ; F(x) = (d(x,Pl),d(x,p2),..., d(x,Pn)).

The CAN (Section 3.1.2) is used to partition the n-dimensional routing space. Together with the L°° metric as a distance function in Rn, the mapping F is contractive, i.e. L°°(F(x), F(y)) < d(x,y). Thus, the Range(

4 Thesis Plan

In the previous section, the author made an effort to provide a comprehensive overview of the families of peer-to-peer systems and of their most significant members. This section analyzes the current state of the art and proposes a new system which would fill a blank space in the present spectrum.

4.1 Overview The development trends in the peer-to-peer world obviously head towards struc­ tured storage-providing systems that are able to process more complex types of queries. Similarity searching, e.g., finding the k most relevant data objects for a given query, is very convenient for users and it is the only possible way for some types of data. On the other hand, the indexing and query processing in such a searching paradigm is inherently expensive - centralized data structures for similarity searching do not scale well with growing size of the dataset. Besides P2P systems that handle exact match queries (Distributed Hash Ta­ bles) , there are structures for interval queries in one or multiple dimensions (see Sections 3.2 and 3.3) and there are several systems supporting full-fledged sim­ ilarity searching tailored to a specific dataset (see Section 3.4). Considering the similarity in its most general definition, i.e. using the metric space model, two distributed systems have been proposed recently - the GHT* and the MCAN.

16 One of the hot topics throughout the area of self-organizing distributed struc­ tures is the way to achieve balanced load of the cooperating peers without redis­ tributing the data by a uniform random hash function which violates the locality of data. The two introduced metric-based distributed structures cope with this issue by splitting of the overloaded peers. This concept requires having a global control mechanism to administrate the resources available. Such a global con­ trol is possible if the system is implemented within a cluster of computers or over a dedicated set of workstations connected into a high-speed LAN but it is not convenient for a world-wide peer-to-peer overlay network.

4.2 Objectives The main objective of the following work is to propose a distributed self- organizing structure for general similarity searching that would balance the load of its nodes without the necessity of managing the pool of available resources. Several, more or less general techniques for such a load balancing have recently arisen in literature (e.g., [4, 22]). These techniques focus on the data organized in a one-dimensional key space. We would like to study and utilize these balancing techniques and, thus, the objectives could be itemized as follows: • propose a distributed peer-to-peer system that would transform the issue of similarity searching in metric spaces into searching in one-dimensional domain;

• create a prototype implementation of this structure and compare it with the CHT* and the MC AN on the levels of principles and performance; • find suitable techniques to achieve a balanced load of the storage units without any global control; • study the influence of various ways of load balancing and replication on the intra-query and inter-query parallelism of the query processing. The author has already made an effort to achieve the first two items of the list above. The proposed system is referred to as M-Chord [33, 32] and its performance has been compared with the CHT*, with its mutation VPT*, and with the MC AN [10].

4.3 System Structure Proposal Let us describe the architecture of the proposed system M-Chord in detail. The fundamental idea is to map the data space into a one-dimensional domain and to link up this domain with the P2P routing protocol Chord (Section 3.1.1). First, let us analyze the mapping of the space to one dimension. Learning from the peer-to-peer approaches introduced in Section 3, a possible solution is to first map the metric space into an n-dimensional space in the same way as the MCAN does and then use space-filling curves in order to reduce the dimensionality to one. This approach has been used in structure SCRAP [23] and, unfortunately, analysis of its performance shows that it becomes inefficient for more than two dimensions.

17 (a) (b)

Figure 7: The principles of the iDistance method.

Therefore, another solution has been found and the mapping is defined by the metric space generalization of the vector indexing method iDistance [47, 28]. This technique partitions the vector space into n clusters (Co, C\,..., Cn_i), then identifies reference points within the clusters (po,Pi, • • • ,Pn-i), an(i maps the data objects according to their distances from the cluster's reference point. Having a separation constant c that avoids clusters' overlaps, the iDistance value for an object o G Cj is idist(o) = d{jpi, o) + i • c. This mapping schema is visualized in Figure 7a. When a range query Range(

mchord(o) = h(d(pi, o) + i • c). (2)

Once the data space is mapped into the one-dimensional M-Chord domain, the responsibility for intervals of this domain is divided between the active peers of the system. The navigation within the system is supplied by the P2P routing

18 (a) (b)

Figure 8: The schema of (a) the insert and (b) Range operation for the M- Chord structure. protocol Chord (see Section 3.1.1). Figure 8 shows the logical architecture of the system. The part (a) of the figure provides a schema of the insert operation of an object o G T> into the structure. First, the initiating peer Nins computes the mchord(o) key using Eq. 2 and then employs the Chord to forward a store request to the peer N0 responsible for the computed key. Peers store data objects in a B+-tree according to their M-Chord keys. When a peer reaches its storage size limit, it executes a split. A new peer is placed on the M-Chord circle, so that the requester's storage is split evenly. Since the data mapping is based on the íDístance idea, several M-Chord intervals of interest may be determined for a Range( r are filtered out. Such a filtering can be performed with respect to all n pivots in the sense of Lemma 4 on Page 7. When inserting an object o into the M-Chord and evaluating Eq. 2, the distances d(o,pi) are computed Vi : 0 < i < n. These values get stored along with the object o and the improved filtering using all pivots is applied at query time in order to avoid unnecessary distance evaluations.

4.4 Future Plans According to the objectives specified in Section 4.2, I plan to concentrate on the load balancing and seek for suitable techniques to achieve a balanced load of individual peers without any global control. Further, I would like to study the phenomenon of replication and the influence of both load balancing and repli­ cation on the intra-query and inter-query parallelism of the query processing.

19 References

[1] Karl Aberer. P-Grid: A self-organizing access structure for P2P informa­ tion systems. Lecture Notes in Computer Science, 2172:179-??, 2001.

[2] Karl Aberer. Efficient Search in Unbalanced, Randomized Peer-To-Peer Search Trees. Technical report, Ecole Polytechnique Fédérale de Lausanne, 2002. Karl Aberer, EPFL Technical Report IC/2002/79.

[3] Karl Aberer, Philippe Cudré-Mauroux, Anwitaman Datta, Zoran Despo- tovic, Manfred Hauswirth, Magdalena Punceva, and Roman Schmidt. P- grid: a self-organizing structured p2p system. SIGMOD Rec, 32(3):29-33, 2003.

[4] James Aspnes, Jonathan Kirsch, and Arvind Krishnamurthy. Load balanc­ ing and locality in range-queriable data structures. In PODC '04: Pro­ ceedings of the twenty-third annual ACM symposium on Principles of dis­ tributed computing, pages 115-124, New York, NY, USA, 2004. ACM Press.

[5] James Aspnes and Gauri Shah. Skip graphs. In Fourteenth Annual ACM- SIAM Symposium on Discrete Algorithms, pages 384-393, January 2003.

[6] Franz Aurenhammer. Voronoi diagrams - a survey of a fundamental geomet­ ric data structure. A CM Computing Surveys (CSUR 1991), 23(3):345-405, 1991.

[7] Farnoush Banaei-Kashani and Cyrus Shahabi. SWAM: A family of access methods for similarity-search in peer-to-peer data networks. In CIKM '04: Proceedings of the Thirteenth ACM conference on Information and knowl­ edge management, pages 304-313. ACM Press, 2004.

[8] Michal Batko, Claudio Gennaro, and Pavel Zezula. A scalable nearest neighbor search in p2p systems. In DBISP2P, volume 3367 of Lecture Notes in Computer Science, pages 79-92, 2004.

[9] Michal Batko, Claudio Gennaro, and Pavel Zezula. Similarity grid for searching in metric spaces. In DELOS Workshop: Digital Library Architec­ tures, Lecture Notes in Computer Science, volume 3664/2005, pages 25-44, 2005.

[10] Michal Batko, David Novák, Fabrizio Falchi, and Pavel Zezula. On scalabil­ ity of the similarity search in the world of peers. Submitted for publication.

[11] Rudolf Bayer and Edward M. McCreight. Organization and maintenance of large ordered indexes. In Record of the 1970 ACM SIGFIDET Workshop on Data Description and Access, November 15-16, 1970, Rice University, Houston, Texas, USA (Second Edition with an Appendix), pages 107-141. ACM, 1970.

[12] Ashwin R. Bharambe, Mukesh Agrawal, and Srinivasan Seshan. Mercury: supporting scalable multi-attribute range queries. SIGCOMM Cornput. Commun. Rev., 34(4):353-366, 2004.

20 [13] Christian Böhm, Gerald Klump, and Hans-Peter Kriegel. Xz-ordering: A space-filling curve for objects with spatial extension. In SSD '99: Proceed­ ings of the 6th International Symposium on Advances in Spatial Databases, pages 75-90, London, UK, 1999. Springer-Verlag.

[14] Tolga Bozkaya and Z. Meral Ozsoyoglu. Distance-based indexing for high- dimensional metric spaces. In Joan Peckham, editor, Proceedings of the ACM International Conference on Management of Data (SIGMOD 1997), Tucson, Arizona, USA, May 13-15, 1997, pages 357-368. ACM Press, 1997.

[15] Min Cai, Martin Frank, Jinbo Chen, and Pedro Szekely. Maan: A multi- attribute addressable network for grid information services. In GRID '03: Proceedings of the Fourth International Workshop on Grid Computing, page 184, Washington, DC, USA, 2003. IEEE Computer Society.

[16] Edgar Chavez, Gonzalo Navarro, Ricardo Baeza-Yates, and Jose Luis Mar- roquin. Searching in metric spaces. ACM Comput. Surv., 33(3):273-321, 2001.

[17] Paolo Ciaccia, Marco Patella, and Pavel Zezula. M-tree: An efficient access method for similarity search in metric spaces. In VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pages 426-435. Morgan Kaufmann, 1997.

[18] Adina Crainiceanu, Prakash Linga, Johannes Gehrke, and Jayavel Shanmu- gasundaram. Querying peer-to-peer networks using p-trees. In WebDB '0^: Proceedings of the 7th International Workshop on the Web and Databases, pages 25-30, New York, NY, USA, 2004. ACM Press.

[19] Anwitaman Datta, Manfred Hauswirth, Renault John, Roman Schmidt, and Karl Aberer. Range queries in trie-structured overlays. In The Fifth IEEE International Conference on P eer-to-P eer Computing, 2005.

[20] Vlastislav Dohnal, Claudio Gennaro, Pasquale Savino, and Pavel Zezula. D-index: Distance searching index for metric data sets. Multimedia Tools Appl, 21(l):9-33, 2003.

[21] F. Falchi, C. Gennaro, and P. Zezula. A content-addressable network for similarity search in metric spaces. In Proceedings of the the 2nd Inter­ national Workshop on Databases, Information Systems and P eer-to-P eer Computing, Trondheim, Norway, pages 126-137, August 2005.

[22] Prasanna Ganesan, Mayank Bawa, and Hector Garcia-Molina. Online bal­ ancing of range-partitioned data with applications to peer-to-peer systems. Technical report, Stanford U., 2004.

[23] Prasanna Ganesan, Beverly Yang, and Hector Garcia-Molina. One torus to rule them all: multi-dimensional queries in p2p systems. In WebDB '0^: Proceedings of the 7th International Workshop on the Web and Databases, pages 19-24, New York, NY, USA, 2004. ACM Press.

[24] Antonin Guttman. R-trees: A dynamic index structure for spatial search­ ing. In Proceedings of the 1984 ACM SIGMOD international conference on Management of data, pages 47-57. ACM Press, 1984.

21 [25] Nicholas Harvey Michael B. Jones, Stefan Saroiu, Marvin Theimer, and Alec Wolman. Skipnet: A scalable overlay network with practical locality properties. In In proceedings of the Jfih USENIX Symposium on Internet Technologies and Systems (USITS '03), Seattle, WA, March 2003.

[26] Gisli R. Hjaltason and Hanan Samet. Index-driven similarity search in metric spaces. A CM Trans. Datahase Syst., 28(4):517-580, 2003.

[27] H. V. Jagadish. Linear clustering of objects with multiple attributes. In SIGMOD '90: Proceedings of the 1990 ACM SIGMOD international con­ ference on Management of data, pages 332-342, New York, NY, USA, 1990. ACM Press.

[28] H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, and Rui Zhang. iDistance: An adaptive b+-tree based indexing method for nearest neighbor search. A CM Trans. Database Syst., 30(2):364-397, 2005.

[29] David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. Consistent hashing and random trees: dis­ tributed caching protocols for relieving hot spots on the world wide web. In STOC '97: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 654-663. ACM Press, 1997.

[30] V. I. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones. Problems of Information Transmission, 1:8-17, 1965.

[31] Witold Litwin, Marie-Anna Neimat, and Donovan A. Schneider. LH* — a scalable, distributed data structure. ACM Transactions on Database Systems, 21(4):480-525, 1996.

[32] David Novak and Pavel Zezula. M-chord: A scalable distributed similarity search structure. Submitted for publication.

[33] David Novák and Pavel Zezula. Indexing the distance using chord: A dis­ tributed similarity search structure. In Proceeding of the 8th International Workshop of the DELOS Network of Excellence on Digital Libraries, pages 94-108, 2005.

[34] J. A. Orenstein and T. H. Merrett. A class of data structures for associative searching. In PODS '84: Proceedings of the 3rd ACM SIGACT-SIGMOD symposium on Principles of database systems, pages 181-190, New York, NY, USA, 1984. ACM Press.

[35] William Pugh. Skip lists: A probabilistic alternative to balanced trees. Commun. ACM, 33(6):668-676, 1990.

[36] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Schenker. A scalable content-addressable network. In Proceedings of the 2001 conference on applications, technologies, architectures, and protocols for computer communications, pages 161-172. ACM Press, 2001.

[37] Cristina Schmidt and Manish Parashar. Flexible information discovery in decentralized distributed systems. In HP D C '03: Proceedings of the 12th

22 IEEE International Symposium on High Performance Distributed Comput­ ing (HPDC'03), page 226, Washington, DC, USA, 2003. IEEE Computer Society.

[38] Thomas Seidl and Hans-Peter Kriegel. Efficient user-adapt able similarity search in large multimedia databases. In The VLDB Journal, pages 506- 515, 1997.

[39] Yanfeng Shu, Kian-Lee Tan, and Aoying Zhou. Adapting the content native space for load balanced indexing. In Wee Siong Ng, Beng Chin Ooi, Aris M. Ouksel, and Claudio Sartori, editors, DBISP2P, volume 3367 of Lecture Notes in Computer Science, pages 122-135. Springer, 2004.

[40] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of ACM SIGCOMM, pages 149-160. ACM Press, 2001.

[41] C. Tang, Z. Xu, and S. Dwarkadas. Peer-to-peer information retrieval using self-organizing semantic overlay networks, 2002.

[42] Egemen Tanin, Aaron Harwood, and Hanan Samet. A distributed quadtree index for peer-to-peer settings. In ICDE, pages 254-255. IEEE Computer Society, 2005.

[43] Egemen Tanin, Deepa Nayar, and Hanan Samet. An efficient nearest neigh­ bor algorithm for p2p settings. In Proceedings of the 2005 national con­ ference on Digital government research, pages 21-28. Digital Government Research Center, 2005.

[44] Jeffrey K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40(4):175-179, 1991.

[45] Peter N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the J^th Annual ACM Symposium on Discrete Algorithms (SODA 1993), Austin, Texas, USA, January 25-27, 1993, pages 311-321. ACM Press, 1993.

[46] Peter N. Yianilos. Excluded middle vantage point forests for nearest neigh­ bor search. In Proceedings of the 6th DIM ACS Implementation Challenge: Near Neighbor Searches (ALENEX 1999), Baltimore, Maryland, USA, Jan­ uary 15-16, 1999, 1999.

[47] Cui Yu, Beng Chin Ooi, Kian-Lee Tan, and H. V. Jagadish. Indexing the distance: An efficient method to KNN processing. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, pages 421-430. Morgan Kaufmann, 2001.

[48] Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko. Sim­ ilarity Search: The Metric Space Approach, volume 32 of Advances in Database Systems. Springer-Verlag, 2006.

[49] Pavel Zezula, Vlastislav Dohnal, and David Novak. Global Data Manage­ ment, chapter Towards Scalability of Similarity Searching. IOS Press, 2006.

23 Current Results of Study

In the past three semesters of my doctoral study at Faculty of Informatics, Masaryk University, I mainly focused on the distributed self-organizing struc­ tures based on the peer-to-peer paradigm and, especially, on the techniques for similarity searching in parallel and distributed environment. In the first months, I became familiar with the principles of the indexing based on the metric space model and with the current state of the field of peer-to-peer systems. Soon, I started to work on the design of the M-Chord, a peer-to-peer system for similarity searching in metric spaces that is based on transformation of the issue of similarity search into searching in one-dimensional domain. I created a prototype implementation and the first results were presented at the Seminar of Informatics and at The 8th International Workshop of the DELOS Network of Excellence on Digital Libraries, Dagstuhl, Germany, 2005 [33]. Later on, I have continued working on the structure design and implemen­ tation, which resulted in presentation at workshop MEMICS 2005, Znojmo, October 2005, and in paper currently submitted for publication [32]. I have also worked on the experiments comparing performance of the M-Chord structure with the other existing distributed structures for similarity searching in metric spaces and this work resulted in a paper submitted for publication [10]. In Autumn 2005, I made a three-weeks working visit at the Max-Planck Institut für Informatik, Saarbrücken, Germany, being invited by prof. Gerhard Weikum, director of the Databases and Information Systems Group. During the visit, I succeeded to link the Minerva system, a distributed data-retrieval system developed by this research group, together with M-Chord and create a prototype of a system called Minerva-Similar which provides data-retrieval with similarity based on the edit distance. At the of the Autumn semester 2005/06, I cooperated on chapter Towards Scalability of Similarity Searching in book Global Data Management [49]. Dur­ ing my study, I cooperate mainly with my supervisor prof. Pavel Zezula, with my colleagues Michal Batko and Vlastislav Dohnal, and with Fabrizio Falchi from ISTI-CNR, Pisa, Italy. I participate in the doctoral grant Integrated approach to education of PhD students in the area of parallel and distributed systems.

24