Similarity Search and Data Mining: Database Techniques Supporting Next Decade's Applications

Similarity Search and Data Mining: Database Techniques Supporting Next Decade's Applications Christian Böhm Unit for Database Systems, University for Health Informatics and Technology Innrain 98, 6020 Innsbruck, Austria [email protected] − www.ChrBoehm.de Phone&Fax: +49-700-24726346 (+49-700-ChrBoehm) Abstract group in this area will not only be of vital interest for academ- ic scholars but even for the complete worldwide economy, in Similarity Search and Data Mining have become widespread all branches of traditional and “new” economy. problems of modern database applications involving com- If the structure of the information to be searched is suffi- plex objects such as Multimedia, CAD, Molecular Biology, ciently simple, such as in one-dimensional numerical at- Sequence Analysis, etc. Search problems in such databases tributes or character strings, search problems can be consid- are rarely based on exact matches but rather on some applica- ered as solved. Database management systems (DBMS) tion specific notion of similarity. A common approach to provide index structures for the management of such data grasp the intuitive idea of similarity by a formal means is to [BM 77, Com 79] which are well-understood and widely ap- translate complex objects into multidimensional vectors by a plied. Requirements of traditional applications such as ac- feature transformation which allows retrieval of the most counting and billing are perfectly met by a commercial similar objects to a given query object (similarity search) but DBMS. Therefore, the information infrastructure of most en- also to analyze the complete set of complex objects with re- terprises is based on products such as Oracle or Informix. spect to clusters, outliers, correlations etc. (data mining). In this contribution we identify several areas of applications Recently, an increasing number of applications has where the classical feature approach is not sufficient. Exam- emerged processing large amounts of complex, application- ple applications include Biometric Identification, Medical specific data objects [Jag 91, GM 93, FBF+ 94, FRM 94, Imaging, Electronic Commerce and Share Price Analysis. We ALSS 95, KSF+ 96]. In application domains such as multi- show that existing feature based similarity models fail due to media, medical imaging, molecular biology, computer aided different reasons, e.g. because they do not cope with the un- design, marketing and purchasing assistance, etc., a high ef- certainty which is inherent to their feature vectors (biometric ficiency of query processing is crucial due to the immense and identification) or because they do not integrate application even increasing size of current databases. The search in such specific methods into the similarity model (share price anal- databases, called non-standard databases, is rarely based on ysis, medical imaging). We survey the challenges and possi- an exact match of objects. Instead, the search is often based ble solutions to these problems to direct future research. on some notion of similarity which is specific to the application. 1. Introduction For applications which do not only support transaction ori- ented search operations but also high-level decision making, Information is the master key to economic success and influ- it is necessary not only to search for objects which are similar ence in the contemporary information society. “Only who can to a given query object but rather to analyze the data set as a apply the newest information for his product development, is whole. Information which is interesting in the process of de- able to survive in the global competition” [Sch 95]. Crucial cision making are common patterns, classifications of data, for the applicability of information is its quality and its fast knowledge about clusters (collections similar of objects), availability. What is lacking most, however, is not the access and, as the opposite of a cluster, exceptional data (outliers) to information resources but rather the facility to effectively which can be, for instance, indicators for the misuse of a sys- and efficiently search for the required information. To cope tem. This kind of information is commonly referred to as with the information overkill will be the central competence knowledge and the process of deriving such higher-level in- of the next decade. Therefore, research on information sys- formation from low-level transactional data is called data tems will be one of the most important domains of computer mining or (in the presence of a vast amount of data) knowl- science. To have connection with the worldwide leading edge discovery in databases (KDD). Because such applica- - 1 - tions on top of modern databases are also often based on some 2.1 Similarity Search notion of similarity (or equivalently on the notion of data den- Index Structures sity) they also depend on similarity search. The difference to traditional similarity search applications is, however, that First we considered single similarity queries and their support these applications do not only raise few, single similarity que- by multidimensional index structures. Starting from existing ries but rather a high number of such queries. approaches for high-dimensional spaces such as the X-tree [BKK 96], the TV-tree [LJF 95], or the SS-tree [WJ 96] our Our main focus is on efficiency. Many problems of simi- observation was that for sufficiently high dimensions, the larity search and data mining have been basically solved, and complexity of similarity queries is still far away from being algorithms have been proposed that produce useful results. A logarithmic. Moreover, simple query processing techniques general problem, however, are the vast amounts of data of to- based on the sequential scan [WSB 98] of the data set are of- day’s applications, and a processing time that makes the algo- ten able to outperform approaches based on sophisticated in- rithms inoperative. Our general motivation is to change such dex structures. The deterioration of performance for increas- algorithms, base them on powerful database primitive opera- ing dimensions is often called the curse of dimensionality. For tions such that thy scale well to large data sets, potentially not previous approaches on indexing high-dimensional data fitting into the main memory. Thus, the database system be- spaces, cf. also our computing survey [BBK 02]. comes a powerful toolset to support the next decade’s appli- Our primary intention was to develop index structures that cations. solve the problems of the existing approaches and outperform A typical approach to handle complex objects of modern them clearly. This could be achieved with the pyramid tech- applications such as CAD, multimedia, etc. is the feature nique [BBK 98b] for a limited kind of queries. Another im- transformation. This transformation extracts from the objects portant approach was the parallelization of index structures a number of characterizing properties and translates them into for high-dimensional data spaces. A declustering technique vectors of a multidimensional space. The specific property of for distributed and parallel environments [BBB+ 97] has a feature transformation is that the similarity of the objects gained the best paper award 1997 of the ACM SIGMOD Int. corresponds to a small distance (in most cases measured ac- Conf. on management of data. cording to the Euclidean norm) of the associated feature vec- A few techniques primarily concentrating on the efficient tors. Therefore, similarity search systems and data mining al- construction of multidimensional index structures are pro- gorithms rely on distance based queries (similarity queries) posed in [BBK 98a] and [BK 99]. Problems of the integration on the feature vectors (cf. figure 1). of multidimensional index structures into object-relational The remainder of this paper is organized as follows: environments were considered in [BBKM 99] and Section 2 describes previous work in the areas of similarity [BBKM 00] and with the focus on objects with a spatial ex- search and data mining. Section 3 is dedicated to indicate tension which are more prevalent in geographical databases promising directions of future research, and section 4 sum- in [BKK 99]. marizes and concludes the paper. Cost Modelling and Optimization As it was not possible to develop index structures which are 2. Previous Work not subject to the curse of dimensionality and which yield a Our previous work primarily concentrates on indexing of fea- uniformly good performance over all dimensions, we ana- ture transformed data from modern database applications lyzed index based query processing from a theoretical point of view and proposed a cost model [BBKK 97] which was lat- such as CAD, multimedia, molecular biology, time sequences er significantly extended [Böh 00]. The central notion of the etc. and on advanced query processing techniques for appli- cost model is the access probability of an index page for an cations of data analysis and data mining. arbitrary query. The mathematical concept to capture this probability was the Minkowski sum (cf. figure 2), a concept Complex Objects Feature Vectors ε-/NN-Search primarily used in robot motion planning which was intro- Feature- Insert, duced in this work for the first time to cost modelling. Togeth- Dow Jones Transform. Query er with techniques for estimating the extension of index pages high- dimens. and queries in the data space, this concept could be used to Index accurately estimate the disk access and CPU cost of similarity queries under Euclidean and maximum metrics in low-di- Figure 1. Basic Idea of Feature-based Similarity Search mensional data spaces. Two extensions accounting for - 2 - b b13 3 b 6 Directory b1 b7 b4 b2 b5 b11 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b b scanjump scan b12 8 b9 10 Figure 2. The Minkowski Sum Figure 3. The Fast Index Scan of the IQ-Tree boundary effects and non-linear correlations (described by case of storing each attribute in a single, one-dimensional in- fractal concepts) enabled our cost model also to accurately dex is called inverted list approach.

Load more