M-Tree: an Efficient Accessmethod for Similarity Search in Metric Spaces
Total Page:16
File Type:pdf, Size:1020Kb
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces Pa010 Ciaccia Marco Patella Pave1 Zezula DEIS - CSITE-CNR DEIS - CSITE-CNR CNUCE-CNR Bologna, Italy Bologna, Italy Piss, Italy [email protected] mpatella@deis . unibo . it [email protected] increased and resulted in the development of multi- media database systems aiming at a uniform manage- Abstract ment of voice, video, image, text, and numerical data. Among the many research challenges which the multi- A new access method, called M-tree, is pro- media technology entails - including data placement, posed to organize and search large data sets presentation, synchronization, etc. - content-based re- from a generic “metric space”, i.e. where ob- trieval plays a dominant role. In order to satisfy the ject proximity is only defined by a distance information needs of users, it is of vital importance to function satisfying the positivity, symmetry, effectively and efficiently support the retrieval process and triangle inequality postulates. We detail devised to determine which portions of the database algorithms for insertion of objects and split are relevant to users’ requests. management, which keep the M-tree always In particular, there is an urgent need of index- balanced - several heuristic split alternatives ing techniques able to support execution of similarity are considered and experimentally evaluated. queries. Since multimedia applications typically re- Algorithms for similarity (range and k-nearest quire complex distance functions to quantify similari- neighbors) queries are also described. Re- ties of multi-dimensional features, such as shape, tex- sults from extensive experimentation with a ture, color [FEF+94], image patterns [VM95], sound prototype system are reported, considering as [WBKW96], text, fuzzy values, set values [HNP95], se- the performance criteria the number of page quence data [AFS93, FRM94], etc., multi-dimensional I/O’s and the number of distance computa- (spatial) access methods (SAMs), such as R-tree tions. The results demonstrate that the M- [Gut841 and its variants [SRF87, BKSSSO], have been tree indeed extends the domain of applica- considered to index such data. However, the applica- bility beyond the traditional vector spaces, bility of SAMs is limited by the following assumptions performs reasonably well in high-dimensional which such structures rely on: data spaces, and scales well in case of growing files. 1. objects are, for indexing purposes, to be repre- sented by means of feature values in a multi- 1 Introduction dimensional vector space; Recently, the need to manage various types of data 2. the (dis)similarity of any two objects has to be stored in large computer repositories has drastically based on a distance function which does not in- troduce any correlation (or “cross-talk”) between Permission to copy without fee all or part of this material is feature values [FEF+94]. More precisely, an L, granted provided that the copies are not made or distributed for metric, such as the Euclidean distance, has to be direct commercial advantage, the VLDB copyright notice and used. the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee Furthermore, from a performance point of view, SAMs and/or special permission from the Endowment. assume that comparison of keys (feature values) is a Proceedings of the 23rd VLDB Conference trivial operation with respect to the cost of accessing a Athens, Greece, 1997 disk page, which is not always the case in multimedia 426 applications. Consequently, no attempt in the design cific metric distance function d. of these structures has been done to reduce the number Formally, a metric space is a pair, M = (D,d), of distance computations. where V is a domain of feature values - the indexing A more general approach to the “similarity index- keys - and d is a total (distance) function with the ing” problem has gained some popularity in recent following properties:l years, leading to the development of so-called metric 1. d(O,, 0,) = d(C),, 0,) (symmetry) trees (see [UhlSl]). M et ric trees only consider relative distances of objects (rather than their absolute posi- 2. d(O,, 0,) > 0 (0, # 0,) and d(O,, 0,) = 0 tions in a multi-dimensional space) to organize and (non negativity) partition the search space, and just require that the function used to measure the distance (dissimilarity) 3. d(O,, 0,) 5 4% 0,) + d(Oz,O,) between objects is a metric (see Section 2), so that the (triangle inequality) triangle inequality property applies and can be used to In principle, there are two basic types of similarity prune the search space. queries: the range query and the k nearest neighbors Although the effectiveness of metric trees has been query. clearly demonstrated [Chi94, Bri95, B097], current de- signs suffer from being intrinsically static, which limits Definition 2.1 (Range) their applicability in dynamic database environments. Given a query object Q E 2) and a maximum search Contrary to SAMs, known metric trees have only tried distance r(Q), the range query range(Q, r(Q)) selects to reduce the number of distance computations re- all indexed objects Oj such that d(Oj, Q) 5 r(Q). q quired to answer a query, paying no attention to I/O Definition 2.2 (k nearest neighbors (LNN)) costs. Given a query object Q E 2) and an integer k 2 1, In this article, we introduce a paged metric tree, the k-NN query NN(&, k) selects the k indexed objects called M-tree, which has been explicitly designed to which have the shortest distance from Q. 0 be integrated with other access methods in database systems. We demonstrate such possibility by imple- There have already been some attempts to tackle the menting M-tree in the GiST (Generalized Search Tree) difficult metric space indexing problem. The FastMap [HNP95] framework, which allows specific access meth- algorithm [FL951 transforms a matrix of pairwise dis- ods to be added to an extensible database system. tances into a set of low-dimensional points, which can The M-tree is a balanced tree, able to deal with then be indexed by a SAM. However, FastMap as- dynamic data files, and as such it does not require sumes a static data set and introduces approximation periodical reorganizations. M-tree can index objects errors in the mapping process. The Vantage Point using features compared by distance functions which (VP) tree [Chi94] partitions a data set according to either do not fit into a vector space or do not use an L, distances the objects have with respect to a reference metric, thus considerably extends the cases for which (vantage) point. The median value of such distances efficient query processing is possible. Since the design is used as a separator to partition objects into two of M-tree is inspired by both principles of metric trees balanced subsets, to which the same procedure can re- and database access methods, performance optimiza- cursively be applied. The MVP-tree [B097] extends tion concerns both CPU (distance computations) and this idea by using multiple vantage points, and exploits I/O costs. pre-computed distances to reduce the number of dis- After providing some preliminary background in tance computations at query time. The GNAT de- Section 2, Section 3 introduces the basic M-tree prin- sign [Bri95] applies a different - so-called generalized ciples and algorithms. In Section 4, we discuss the hyperplane [Uhlgl] - partitioning style. In the basic available alternatives for implementing the split strat- case, two reference objects are chosen and each of the egy used to manage node overflows. Section 5 presents remaining objects is assigned to the closest reference experimental results. Section 6 concludes and suggests object. So obtained subsets can recursively be split topics for future research activity. again, when necessary. Since all of the above organizations build trees by 2 Preliminaries means of a top-down recursive process, these trees are not guaranteed to remain balanced in case of inser- Indexing a metric space means to provide an efficient tions and deletions, and require costly reorganizations support for answering similarity queries, i.e. queries to prevent performance degradation. whose purpose is to retrieve DB objects which are ‘In order to simplify the presentation, we sometimes refer to “similar” to a reference (query) object, and where the 0, as an object, rather than as the feature value of the object (dis)similarity between objects is measured by a spe- itself. 427 3 The M-tree file.3 In summary, entries in leaf nodes are structured as follows. The research challenge which has led to the design of M-tree was to combine advantages of balanced and Oj (feature value of the) DB object dynamic SAMs, with the capabilities of static metric oid(Oj) object identifier trees to index objects using features and distance func- d(Oj, P(Oj)) distance of Oj from its parent tions which do not fit into a vector space and are only constrained by the metric postulates. 3.2 Processing Similarity Queries The M-tree partitions objects on the basis of their Before presenting specific algorithms for building the relative distances, as measured by a specific distance M-tree, we show how the information stored in nodes function d, and stores these objects into fixed-size is used for processing similarity queries. Although per- nodes,2 which correspond to constrained regions of the formance of search algorithms is largely influenced by metric space. The M-tree is fully parametric on the the actual construction of the M-tree , the correctness distance function d, so the function implementation is and the logic of search are independent of such aspects.