<<

of Computer Science for Geographic Information Systems

A revised version of this report appeared in GeoInformatica, Int. Journal on Advances 2(2): 113-147, 1998.

query Approximation-Based Similarity Search for 3-D Surface Segments1

DB = … HANS-PETER KRIEGEL AND THOMAS SEIDL { } University of Munich, Institute for Computer Science, Oettingenstr. 67, D-80538 München, Germany Contact: [email protected], phone +49-89-2178-2191, fax +49-89-2178-2192 Figure 1. Similarity search in a database of 3-D surface segments.

Abstract protein surface and subsequently be stored in a database [SK 95]. Thus, the problem of The issue of finding similar 3-D surface segments arises in many recent applications of spatial database finding docking partners is reduced to finding similar (complementary) surface segments systems, such as molecular biology, medical imaging, CAD, and geographic information systems. Sur- from a large database of segments matching a given query segment. face segments being similar in to a given query segment are to be retrieved from the database. The Medical Imaging. Modern medical imaging technology such as CTI or MRI produces de- two main questions are how to define shape similarity and how to efficiently execute similarity search scriptions of 3-D objects like organs or tumors by a set of 2-D images. These images repre- queries. We propose a new similarity model based on shape approximation by multi-parametric surface sent slices through the object, from which the 3-D shape can be reconstructed. A method for functions that are adaptable to specific application domains. We then define shape similarity of two 3-D retrieving similar surface segments can help to discover correlations between shape defor- surface segments in terms of their mutual approximation errors. Applying the multi-step query process- mations of organs and certain deceases. ing paradigm, we propose algorithms to efficiently support complex similarity search queries in large spatial databases. A new query type, called the ellipsoid query, is utilized in the filter step. Ellipsoid Geographic Information Systems. Regions with similar topography or hills that have a queries, being specified by quadratic forms, represent a general concept for similarity search. Our major similar shape as a given example are of great interest for geographers and the mining indus- contribution is the introduction of efficient algorithms to perform ellipsoid queries on multi-- try, for example. A modern GIS will benefit users by supporting shape similarity search for al index structures. Experimental results on a large 3-D protein database containing 94,000 surface 3-D surface segments. segments demonstrate the successful application and the high performance of our method. Further application fields include CAD and mechanical engineering. In order to meet the Keywords: Approximation-based similarity search, multi-step similarity query processing, ellipsoid specific requirements of these application domains, our method supports invariance against queries on multidimensional index structures, 3-D spatial database systems and , because the position and orientation of the objects in 3-D does not affect shape similarity. Since the number of objects in a spatial database typically is very large, efficient query processing is important and is supported by our algorithm. Our method 1. Introduction requires the surface segments to be given as sets of points which can be obtained from common surface representations. Figure 1 illustrates the problem of similarity search for 3-D surface Currently, more and more applications managing spatial objects become involved with the segments: Given a query segment query, retrieve all segments from the database DB of 3-D problem of efficient similarity search in large databases. The application of retrieving surface segments that are similar to query. similar surface segments range from molecular biology and medical imaging to geographic in- formation systems (GIS) and CAD databases containing mechanical parts. The following ex- The paper is organized as follows: In the remainder of this introduction, we sketch some amples illustrate typical requirements and potential benefits of shape-oriented similarity related work as well as the basic idea of our approach. In Section 2, we provide the background search: for shape approximation of 3-D segments by multi-parametric surface functions. Our novel Molecular Biology. A challenging problem in molecular biology is the prediction of pro- shape similarity model for 3-D surface segments is defined in Section 3, along with an experi- tein-protein interactions (the molecular docking problem): Which proteins from the data- mental evaluation of similarity results. Starting with Section 4, we turn to efficient query pro- base form a stable complex with a given query protein? It is well known that docking part- cessing and provide a framework for multi-step similarity query processing. We derive a lower- ners are recognized by complementary surface regions [MWS 96]. In many cases, the active bounding filter that guarantees no false dismissals. This function corresponds sites of the proteins, i.e. the docking regions, are known and can be extracted from the to an ellipsoid query which represents a new and general query type for spatial database sys- tems. In Section 5 we introduce a new algorithm for efficiently performing ellipsoid queries on 1. This research was funded by the German Ministry for Education, Science, Research and Technology multidimensional index structures. Section 6 shows the performance results for our similarity (BMBF) under grant no. 01 IB 307 B. The authors are responsible for the content of this paper. search system, and Section 7 concludes the paper.

- 1 - - 2 - 1.1. Related Work a) b) c) In recent years, considerable work has been done on similarity search in database systems. s1 s2 Most of the previous approaches deal with one- or two-dimensional data, such as time series, digital images or polygonal data. However, they do not manage three-dimensional objects. s3 Agrawal et al. present a method for similarity search in a sequence database of one-dimen- sional data [AFS 93]. The sequences are mapped onto points of a low-dimensional feature space using a Discrete Fourier Transform. A Access Method (PAM) is then used for effi- Figure 2. a) 3-D spatial objects such as molecules or mechanical parts; b) surface representation cient retrieval of similar sequences. This technique was later generalized for subsequence by points; c) surface segments as subsets of the surface points. matching in [FRM 94], and searching in the presence of noise, , and translation in [ALSS 95]. Nevertheless, it remains restricted to one-dimensional sequence data. tions or vector representations, this assumption does not restrict generality. Figure 2 shows an Jagadish proposes a technique for the retrieval of similar in two [Jag 91]. example from molecular biology. A point set is computed for the surface of every molecule. He derives an appropriate object description from a rectilinear cover of an object, i.e. a cover The surface is then decomposed into (not necessarily disjoint) segments. The resulting set of consisting of rectilinear . The rectangles belonging to a single object are sorted by segments should include all docking sites at which the interaction with partner molecules takes size, and the largest ones serve as retrieval key for the shape of the object. Normalization is used place. to achieve invariance with respect to scaling and translation. Though this method can be gener- Several techniques are available for the segmentation of molecular surfaces. They are adapt- alized to three dimensions by using covers of hyperrectangles, it has not been evaluated for real ed from image and signal processing, or from clustering techniques in spatial databases. Two world 3D data and, furthermore, does not achieve rotational invariance. different classes may be distinguished: Mehrotra and Gary suggest the use of boundary features for the retrieval of shapes [MG 93] Guided segmentation. If typical shapes or locations of docking sites on molecular surfaces [GM 93]. A 2D-shape is represented by an ordered set of surface points, and fixed-sized subsets are known, the segmentation technique may be provided with appropriate heuristics to de- of this representation are extracted as shape features. All of these features are mapped to points termine potential docking segments. Such a guided algorithm, while returning a small num- in a multidimensional space which are stored using a PAM. This method provides translational, ber of segments, has a low probability of dismissing or splitting actual docking sites. The rotational and scaling invariance. It can handle partially occluded objects, but is limited to two reliability of the algorithm, however, critically depends on the quality of the underlying dimensions. heuristics. It is non-trivial to provide well-suited heuristics without deep insight into the For retrieving similar 2D shapes from a CAD database system, previous work is characteristics of molecular docking processes [Mei+ 95] [MPSS 97]. presented in [BKK 97] and [BK 97]. This technique applies the Fourier Transformation in order Naive segmentation. If no information on how to extract typical docking sites from molec- to obtain a shape encoding for retrieving similar sections of polygon contours. The polygon ular surfaces is available, a brute force segmentation has to be applied. A lot of segments are sections are stored as extended multidimensional feature objects and a Spatial Access Method produced for each object. The more segments we produce, the higher the probability that no (SAM) is used for efficient retrieval [BKK 96] [Ber+ 97]. This approach is also limited to two- actual docking sites are missed. The drawback as compared to the guided segmentation dimensional objects. approach is, that the system has to manage considerably more 3-D segments. Korn et al. propose a method for searching similar tumor shapes in a medical image database [Kor+ 96]. Even though the solution seems to be easily extensible to 3-D, the authors consider Domain experts have to decide which technique to use and, in the case of guided segmenta- tion, which heuristics to employ. Our approach for similarity search does not depend on the 3-D only 2-D images. The proposed similarity is volume-based while surface segments do not enclose a volume but piece by piece model the boundary of a solid object. Therefore the surface segmentation method. In the following, we assume that appropriate segments are al- concept cannot be adapted to 3-D surface segments. ready available and that they fulfill the requirements of the application, e.g. molecular docking prediction, or similarity search in medical image or CAD databases. The QBIC (Querying By Image Content) system [Nib+ 93] [Fal+ 94] contains a component for 2-D shape retrieval where shapes are given as sets of points. The method is based on alge- braic moment invariants and is also applicable to 3-D objects [TC 91]. Nevertheless, the adapt- 1.3. Basic Idea of Approximation-based Shape Similarity ability of the method to specific application domains is restricted. Appropriate moment invari- For a similarity search of 3-D surface segments as sketched above, the system has to be pro- ants have to be selected from a set of feasible ones. In this approach, the moment invariants that vided with an appropriate similarity model, in our case a distance function for 3-D segments. have to be chosen are abstract quantities, while in our approach, the approximation models that The similarity of 3-D segments depends on their shape and on their extension in the 3-D space. have to be chosen may be graphically visualized as 3-D surfaces, thus providing an early im- In order to compare the shape of two segments s and q, we use multi-parametric surface pression of the suitability for the given application domain. functions as approximations apps and appq. The shape distance is defined in terms of the mutual approximation error. Figure 3 depicts two segments s and q, their approximations app and 1.2. 3-D Surface Segments s appq , and illustrates the mutual comparison of the segments with the approximation of the part- We assume that the 3-D segments are provided as sets of surface points. Since sets of surface ner segment. As we can see, the approximation error is a measure for how well or how badly points can be obtained from common representations of 3-D objects such as raster representa- the segment q is approximated by the approximation apps of segment s and vice versa.

- 3 - - 4 - apps 0.0 2.0

-0.2 appq 1.5 s -0 .4

s 1.0 s -0 .6

0.5 -0 .8

-1 .0 0.0 3.1 3.1 2.7 2.7 2.4 2.4 -1 .2 2.0 -0.5 2.0 1.6 1.6

1.2 1.2 -1 .4 0.8 0.8

0.4 -1 .0 0.4

0.0 0.0 -1 .6 -0 .4 -0 .4

-0 .8 -0 .8 -1 .5 -1 .8 -1.2 -1.2

-1.6 -1.6 app -2.0 -2.0 -2.0 -2.0 -3 .1 -2 .4 -3 .1 -2 .4 -2 .7 -2 .7 s -2 .4 -2 .0 -2 .4 -2 .0 -1 .6 -1 .2 -1 .6 -1 .2 -0 .8 -2 .7 -0 .8 -2 .7 -0.4 -0.4 0.0 0.4 0.0 0.4 0.8 0.8 1.2 1.2 1.6 -3 .1 1.6 -3 .1 app 2.0 2.0 2.4 2.4 q q q q 2.7 3.1 2.7 3.1 PARAB-2 TRIGO-4

Figure 3. Mutual approximation of two 3-D surface segments s and q.

4.0 6.0

5.0

3.0

4.0

2.0

The extension distance is the of the 3-D extension vectors obtained from 3.0

1.0 2.0

3.1 3.1 the principal moments of inertia. In the subsequent sections, we present a formal introduction 2.7 1.0 2.7 0.0 2.4 2.4

2.0 2.0 1.6 1.6 0.0 1.2 1.2

-1 .0 0.8 0.8

0.4 0.4 and an experimental evaluation of this new model. -1 .0 0.0 0.0

-0 .4 -0 .4 -2 .0 -0 .8 -0 .8 -2 .0 -1 .2 -1.2

-1 .6 -1.6

-2.0 -3.0 -2 .0 -3.0

-3.1 -2 .4 -3 .1 -2 .4 -2 .7 -2 .7 -2 .4 -2 .4 -2 .0 -2 .0 -1 .6 -1 .6 -1.2 -1 .2 -2 .7 -0 .8 -2.7 -0 .8 -0.4 -0.4 0.0 0.0 0.4 0.4 0.8 0.8 1.2 1.2 -3 .1 1.6 -3 .1 1.6 2.0 2.0 2.4 2.4 2.7 2.7 3.1 3.1 2. Approximation of 3-D Segments TRIGO-8 TRIGO-12

A main concept of our new approach is the approximation of 3-D surface segments in order Figure 4. Four approximation models: PARAB-2, TRIGO-4, TRIGO-8, and TRIGO-12. to provide comparable representations of shapes. We present a generic method based on mod- elling 3-D shapes by a multi-parametric surface function. We call this function the approxima- tion model. The method is adapted to specific applications by chosing the model (i.e. the func- particular application. Examples for multi-parametric surface functions which we used in our tion). As already mentioned, the similarity of 3-D segments is measured by their mutual experiments are paraboloids and trigonometric polynomials of various degrees. For example, approximation error (and their extensions). The better the chosen multi-parametric surface figure 4 depicts the graphs of the surface functions PARAB-2, TRIGO-4, TRIGO-8, and TRI- function fits the characteristics of the application, the more powerful is the distance function in GO-12, and table 1 provides the respective formulas for the 2-, 4-, 8-, and 12-dimensional ap- distinguishing between shapes that differ only slightly. proximation models.

2.1. Approximation Models model name formula of approximation model The basic component of any approximation technique is the approximation model. We use (), ⋅ ()2, 2 2 2 surface functions since they fit the two-dimensional character of the 3-D surface segments. PARAB-2 a1 a2 x y = a1x + a2y Whereas any multi-parametric two-dimensional surface function f: ℜ2 → ℜ can be employed TRIGO-4 ()a ,,,a a a ⋅ ()cosx, sinx, cosy, siny as an approximation model, we focus on a particular class of functions for which efficient algo- 1 2 3 4 rithms to compute the approximation of a 3-D segment are available. The class is characterized (),,… ⋅ (), , ,, , , , TRIGO-8 a1 a8 cosx sinx cosy siny cos 2x sin 2x cos 2y sin 2y by the following definition. (),,… ⋅ (), , ,,,,,,… TRIGO-12 a1 a12 cosx sinx cosy siny cos3x sin 3x cos 3y sin 3y Definition 1 (Surface approximation model): The class of multi-parametric two-dimen- ℜ2 → ℜ sional surface functions fapp: is called a d-dimensional surface approximation mod- Table 1. A sample of approximation models of various dimensionalities ()ℜ,,… ∈ d el, if it is the scalar product of a vector app= a1 ad of d approximation parame- (),,… ℜ2 → ℜ ters and a vector f1 fd of d two-dimensional base functions fi: : 2.2. Approximation of a 3-D Segment f ()xy, ==a ⋅…f ()xy, ++a ⋅ f ()xy, ()a ,,… a ⋅ ()f ,,… f ()xy, app 1 1 d d 1 d 1 d The notion by which we relate 3-D surface segments and multi-parametric approximation models is the approximation error. For any arbitrary 3-D surface segment s and any instance As we can see, surface approximation models are linear combinations of the base functions. app of approximation parameters, the approximation error indicates the deviation of the surface The base functions themselves, however, may be as simple or complex as it is useful for the function fapp from the points of the segment s:

- 5 - - 6 - app’ s s apps app” apps

Figure 5. A 3-D surface segment s and its approximation apps. Figure 6. Various approximation candidates of a 3-D surface segment.

Definition 2 (Approximation error): Let the 3-D surface segment s be represented by a set For later use, we focus on two immediate implications of this definition: First, the relative of n surface points. Given an approximation model f and a vector app of approximation param- approximation never evaluates to a negative value, and it reaches zero for the (unique) approx- eters, the approximation error of app and s is defined as imation of a segment: 2 1 2 d ()app = --- ()f ()p , p – p s n∑ app x y z ps∈ Lemma 1. (i) For any 3-D surface segment s and any approximation parameter set app’, the ∆ 2 ≥ relative approximation error is non-negative: ds (app’) 0. (ii) The relative approximation Given this definition, from all possible choices, we select the parameter vector app which ∆ 2 s error reaches zero. In particular, ds (apps) = 0 for all segments s. yields the minimum approximation error for a given 3-D segment s: 2 2 ≥ 2 Definition 3 (Approximation of a segment): Given an approximation model f and a 3-D sur- Proof. (i) By definition of ds (apps), the equation ds (app’) ds (apps) holds for all param- ∆ 2 ≥ face segment s, the (unique) approximation of s is given by the parameter set apps for which the eter sets app’. Therefore ds (app’) 0 for all app’. (ii) For app’ = apps we have: ∆ 2 2 2 ◊ approximation error is minimum: ds (apps)=ds (apps)–ds (apps) = 0. ⇔ ∀ 2 ≥ 2 apps is approximation of s app: ds()app ds()apps As a final observation, consider figure 7. Two different segments s and q may share the same approximation apps =appsq. Consequently, they cannot be distinguished by a simple compar- Figure 5 illustrates the approximation of a surface segment. The approximation apps of s is ison of their approximation parameters. The approximation error, however, provides additional required to be unique. Theoretically it is possible that the approximation parameters vary with- information, and the segments may be discriminated if they differ in their approximation errors. out affecting the approximation error (in which case apps would not be well defined). This in- dicates that the approximation model has been chosen inappropriately for the application do- If too many 3-D segments share the same approximation or even the same approximation main, and has to be changed. Our algorithm will detect this situation and notify the user. Note error for a particular application, it is recommended to modify the approximation model since that in all our experiments this situation never occurred. it does not reflect the differences between the shapes very well. Another parametric surface 2 function may better suited to describe the variety of shapes that occur in the application. Table 2 In general even the approximation error ds()apps will be greater than zero. In order to obtain summarizes the notions, symbols, and definitions that we have introduced so far. a similarity function that characterizes similarity of an object to itself by the value zero, we introduce the relative approximation error.

Definition 4 (Relative approximation error): Given an approximation model f, a 3-D sur- face segment s, and an arbitrary vector app’ of approximation parameters. The relative approx- app ∆ 2 apps q imation error ds()app' of app’ and s is defined to be s s ≠ q q ∆ 2 2 2 ds()app' = ds()app' – ds()apps apps = appq Figure 6 shows a given 3-D surface segment s being compared to various approximation parameter vectors. The (unique) approximation apps is closest to the original surface points and may be used as a more or less coarse representation of the shape of s whereas the other surface Figure 7. Two different 3-D segments s and q that share the same approximation. Possibly, s and functions do not fit the shape of the segment s very well. q may be distinguished by their approximation error.

- 7 - - 8 - description symbol definition

⋅ , appq exts ∑ ai fi()xy = , … s approximation model fapp()xy i = 1 d (),,… ⋅ (), ,,… , a1 ad f1()xy fd()xy s

1 2 2 --- ()f ()p , p – p approximation error ds ()app ∑ app x y z apps n ∈ ext ps q q q ∀ 2 ≥ 2 (unique) approximation of s apps app: ds()app ds()apps 2 2 ∆ 2 2 2 dapp (s, q)dext (s, q) relative approximation error ds()app' ds()app' – ds()apps

Figure 8. Similarity quantification by mutual approximation and 3-D extension distance. Table 2. Symbols and definitions for the approximation of 3-D segments

2.3. Computation by Singular Value Decomposition 3. Shape Similarity of 3-D Segments For our approximation models, we restrict ourselves to the class of linear combinations of non-parameterized base functions as introduced in definition 1. According to definitions 2 In this section, we introduce our new similarity model for 3-D surface segments. It is based and 3, finding an approximation is a least minimization problem for which an efficient on the shape approximation technique from the previous section. After an introduction of the numerical computation method is required. For linearly parameterized functions in particular, similarity distance function, we illustrate the successful application of the model to the docking it is recommended to perform least-squares approximation by Singular Value Decomposition problem from molecular biology. (SVD) [PTVF 92]. 3.1. Approximation-based Similarity Distance Function (),,… Besides the d approximation parameters apps = a1 ad , the SVD also returns a d-vec- × tor ws of confidence or condition factors, and an orthogonal dd-matrix Vs. Using Vs we can For a 3-D surface segment, our shape similarity criterion considers two components: The compute the relative approximation error for any approximation parameter vector app’ with extension of the segment in the 3-D space, and the shape of the segments in a narrow sense. We 2 T respect to the segment s. Let As = Vs·diag(ws) ·Vs , and let us denote the rows of Vs by Vsi. Now define the extension similarity as the Euclidean distance of the 3-D extension vectors exts and the error formula can be written as: extq which we obtain from the principal moments of inertia. The shape component of the dis- tance function is based on shape approximation, and we exploit more information than just the ∆ 2 2 ()()⋅ 2 ()⋅⋅()T ds()app' = ∑ wsi app' – appq Vsi = app' – apps As app' – apps approximation parameters. This is recommended, as the confidence of the approximation pa- i = 1…d rameters may vary substantially from one segment approximation to another. For this reason, we introduce the concept of mutual approximation errors. The basic question for shape simi- larity quantification is the following: How much will the approximation error increase, if seg- 2.4. Normalization in the 3-D Space ment q would have been approximated by the approximation of segment s, apps, instead of its own approximation, appq, and vice versa? (cf. figure 8). This approach leads us to the definition In general, the points of a segment s are located anywhere in the 3-D space and are oriented of approximation-based shape similarity. arbitrarily. Since we are only interested in the shape of the segment s, but not in its location and orientation in the 3-D space, we transform s by a rigid 3-D transformation into a normalized Definition 5 (Shape similarity): Let s and q be two 3-D surface segments, with exts and extq representation. There are two ways to integrate normalization into our method: (1) Separate: being their 3-D extension vectors in space. Define the mutual approximation distance as First normalize the segment s, and then compute the approximation apps by least-squares min- 2 1 2 1 2 d (s, q) := --- ∆d (app ) + --- ∆d (app ), and the 3-D extension distance as the squared Eu- imization. (2) Combined: Minimize the approximation error simultaneously over all the nor- app 2 s q 2 q s 2 2 malization and approximation parameters. clidean distance dext (s, q) := (exts – extq) . With uapp and uext being non-negative weighting In our experiments, we used the combined normalization approach. For similarity search factors, the shape similarity function dshape is now defined as: purposes, only the resulting approximation parameters are used. However, the normalization 2 , 2 , parameters may be required later for superimposing segments. dshape (s, q) = uappdapp()sq+ uext dext()sq

- 9 - - 10 - Additionally, for a segment s, we define the d+3-dimensional key vector keys to be the con- catenation of the d-dimensional vector apps and the 3-dimensional vector exts: keys =(apps, exts). 2 2 Since we have chosen to combine the squared distance functions dapp and dext by the root over their sum, dshape is related to the Euclidean distance in the following way:

Lemma 2. Let s and q be two segments. If for both approximations, apps and appq, all the Azurin BA Azurin BB Azurin CA Azurin CB values wsi and wqi are equal to 1, as well as the weighting factors uapp = uext = 1, the shape Figure 9. Four similar surface segments from azurin molecules. similarity dshape(s, q) is equal to the Euclidean distance of the key vectors, keys and keyq. protein data are available from the Brookhaven Protein Data Bank (PDB) [Ber+ 77] which cur- rently provides the atomic coordinates of approx. 3,000 proteins. From the FSSP (Families of ∆ 2 2 ()()⋅ 2 Proof. Recall the error formula ds (app’) = ∑ wsi app'– appq Vsi , and assume all Structurally Similar Proteins) database [HS 94], we selected families of molecules that are sim- i ilar in their sequence and, hence, are similar in their 3-D shape. Examples for our experimental the w and w being equal to 1. Since the orthogonal matrices V and V represent pure base si qi s q evaluation are the azurin family (PDB code 1AZC-A) covering four proteins with a high struc- ∆ 2 ()()⋅ 2 transformations without any scaling, we obtain ds (app’) = ∑ app'– appq Vsi = tural similarity, and the fructose bisphosphatase family (PDB code 1FRP-A) with 18 members. i Figure 9 depicts four members of the azurin family, in particular BA, BB, CA, and CB. ()()app'– app ⋅ V 2 = ()app'– app 2 . This implies d ()sq, 2 = 1d⋅ 2 ()sq, + 1d⋅ 2 ()sq, q s s shape app ext We performed ranking queries on the database of 6,200 segments with the docking segment 1 2 1 2 2 1 2 1 2 2 =---∆d ()app ++---∆d ()dapp ()sq, = ---()app – app ++---()app – app ()ext – ext of each member of the azurin family, and compare the approximation models PARAB-2, TRI- 2 s q 2 q s ext 2 q s 2 s q s q GO-4, TRIGO-8, and TRIGO-12 that have 2, 4, 8, and 12 parameters, leading to 5-, 7-, 11-, and ()2 ◊ =.keys – keyq 15-dimensional key vectors, respectively (cf. table 4). In general, however, the approximation confidence values wi will be different from 1, and similarity search methods based on the Euclidean distance in feature spaces (cf. [AFS 93] dimension d of the dimension d+3 of [BKK 97]) do not support shape similarity query processing immediately. model name approximation model the key vector As for other similarity functions, a small value of dshape indicates a high degree of similarity, whereas a large value of dshape signals strong dissimilarity. In table 3, we summarize the defi- PARAB-2 2 5 nitions that we used for the introduction of our new 3-D shape similarity model. TRIGO-4 4 7 TRIGO-8 8 11 description symbol definition TRIGO-12 12 15 key vectorkey ()app , ext = ()a ,,,,,… a e e e s s s 1 r 1 2 3 Table 4. Dimensions of the key vectors for several approximation models 2 1 2 1 2 (simple) shape similarity d ()sq, ---∆d ()app + ---∆d ()app app 2 s q 2 q s 2 , ()2 Figure 10 demonstrates how the docking segments of the four azurin molecules rank within 3-D extension similarity dext()sq exts – extq the database of 6,200 segments according to the shape distance dshape for the approximation 2 , 2 , models PARAB-2 and TRIGO-4. As expected, the similar azurin segments rank at top posi- , uappdapp()sq+ uextdext()sq with (final) shape similarity dshape()sq tions. weighting factors uapp and uext Figure 11 includes the ranking for the models TRIGO-8 and TRIGO-12. For each of the four Table 3. Symbols and definitions for the shape similarity of 3-D surface segments azurin molecules, all the 6,200 database objects were ranked according to their dshape-distance to the query object. Particularly, the positions of the azurin molecules were recorded. The dia- grams summarize the result: Whereas the members of the family were desired to rank on the top 3.2. Sample Application positions 1 to 4 (indicated on the abscissa axis), the actual positions are only a little worse. The We successfully applied the similarity distance dshape to the of molecular biology, in ordinate axis depicts the minimum, maximum, and average position that was achieved by rank- particular, to the molecular recognition or docking problem. As a sample application, we ing all 6,200 database objects. The experiments support the adequacy of the TRIGO-4 model present the search for similar docking sites in a database of some 6,200 docking segments. The for the docking application.

- 11 - - 12 - The performance of similarity query processing, however, is limited by the complexity of the PARAB2 TRIGO4 similarity model. Since it has been successfully applied to complex spatial query processing, a rank BA BB CA CB rank BA BB CA CB multi-step query processing architecture is recommended in our case [OM 88] [BHKS 93]: 1 1AZBA 1AZBB 1AZCA 1AZCB 1 1AZBA 1AZBB 1AZCA 1AZCB Several filter and refinement steps produce and reduce candidate sets from the database, yield- 2 1AZCA 1AZCB 1AZBA 1AZBB 2 1AZCA 1AZCB 1AZBA 1AZBB ing an overall result that contains the correct answer, producing neither false positive nor false 3 2INSB 2INSD 2INSB 2INSD 3 2INSB 2INSD 2INSB 2INSD negative answers (no false hits and no false drops). For improved efficiency, filter steps are 4 4INSD 2INSB 4INSD 1AZCA 4 2INSD 2INSB 2INSD 2INSB 5 4INSB 4INSD 4INSB 4INSD 5 4INSB 4INSD 4INSB 1AZCA usually supported by multidimensional index structures [GG 97]. 6 2INSD 1AZCA 2INSD 1AZBA 6 4INSD 4INSB 4INSD 1AZBA 7 2MSBB 1AZBA 2MSBB 2INSB 7 1AZCB 1AZBA 1AZCB 4INSD 4.1. Multi-Step Similarity Query Processing 8 1AZCB 4INSB 1AZCB 4INSB 8 1AZBB 1AZCA 1AZBB 4INSB 9 2MSBA 6INSF 2MSBA 6INSF 9 6INSF 6INSF 6INSF 6INSF Two types of queries are highly relevant for similarity search: range queries and k-nearest 10 1AZBB 1BGSF 1AZBB 1BGSF 10 1BGSF 1CSKB 1BGSF 1BGSF neighbor queries. A range query is specified by a query object q and a query range ε, asking for 11 1EPBA 1IZBD 1EPBA 1IZBD 11 2MSBB 1IZBB 2MSBB 1IZBD the answer set that contains all the objects s from the database having a similarity distance less 12 1IZBB 1CSKB 1BGSF 1BGSG 12 6INSE 1IZBD 6INSE 1IZBB than ε to the query object q. A k-nearest neighbor query for a query object q and a cardinal 13 1BGSF 2MSBA 3INSD 2MSBA 13 1IZBB 1BGSF 1IZBB 1BGSG number k specifies the retrieval of those k objects from the database that are most similar to q. 14 3INSD 1BGSG 1IZBB 2MSBB 14 1BGSG 1CSKD 1BGSG 1CSKB 15 1IZBD 1CSKD 1C2RA 1CSKB 15 1IZBD 2MSBA 1EPBA 2MSBA Following the multi-step query processing paradigm, refinement steps discard false positive 16 1C2RA 2MSBB 1IZBD 1EPBA 16 1EPBA 1BGSG 1IZBD 1CSKD candidates (false hits), but are not able to reconstruct false negatives (false drops) that have been …… … … … …… … … … dismissed by the filter step. Therefore, a basic requirement for any filter step is to prevent false 6,200 … … … … 6,200 … … … … drops. The lower-bounding property has been identified as a fundamental criterion that guaran- atees no false drops [FRM 94] [Kor+ 96]. The filter step has to be provided with a filter distance Figure 10. Ranking of four Azurin segments for the approximations models PARAB-2 and TRI- function df that lower bounds the exact object distance function do for all pairs of objects s and q: GO-4 among 6,200 segments. As desired, the members of the Azurin family rank at top positions. , ≤ , df()sq do()sq

The efficiency of the algorithms depends on selectivity and the query processing efficiency PARAB-2 TRIGO-4 TRIGO-8 TRIGO-12 provided by the underlying multi-dimensional index structure. To adapt the mentioned algo- 25 25 25 25 rithms to our new 3-D shape similarity model, it suffices to provide a filter function f(x, q) that 20 20 20 20 fulfills both, the correctness criterion (crucial) and the efficiency requirements (desired), and 15 15 15 15 the following tasks remain to be done: ≤ 10 10 10 10 (1) Show the lower-bounding property, i.e. f(s, q) dshape(s, q). (2) Provide efficient algorithms to perform queries on the index that use the filter distance 5 5 5 5 function f()sq, . 0 0 0 0 1234 1234 1234 1234

ranks within the overall 4.2. A Lower Bound for Shape Similarity database (6,200 objects) rank (1 to 4) within the members of the azurin family of similar molecules In order to apply the multi-step query processing paradigm to our new approximation-based shape similarity model, we derive an appropriate filter function for the shape similarity function Figure 11. Ranking of azurin segments for various approximation models. The members of the dshape. Besides the correctness, which is ensured by the lower-bounding criterion, we look for azurin family rank at top positions, in particular below rank 25 out of 6,200. a solution that provides efficient support by a multi-dimensional access method. Although the X-tree has been developed as an index structure that efficiently manages feature spaces of di- Additional experiments e.g. for hemoglobin molecules or for trypsin inhibitors show that mension 10 to 20 or more [BKK 96], the lower the dimension the better the performance. most of the molecules that were ranked on top positions within the overall database according Let us investigate the similarity function dshape and its components with respect to the num- to dshape also belong to the same family of molecules as the query object. ber of data values that are required for the evaluation, and let us assume a d-dimensional ap- proximation model. Table 5 illustrates the situation: Observe that for the evaluation of the first ∆ 2 2 4. Efficient Similarity Query Processing term, ds ()appq , d + d data values are required for the segment s, since the formula contains × the d-vector apps as well as the dd-matrix As. Concerning the query segment q, only the d- Due to the immense and even increasing size of current databases for molecular biology, ∆ 2 medical imaging, and engineering applications, strong efficiency requirements have to be met. vector appq is required. Conversely, the evaluation of the second term, dq ()apps , requires d

- 13 - - 14 - 2 2 1 2 2 values for the vector app and d + d values concerning the query segment q. For the third com- Proof. Observe that f ()key = ---u ∆d ()app + u d ()sq, , and consider the differ- s q s 2 app q s ext ext 2 2 , () 2 2 2 ponent, i.e. the extension distance dext()sq = exts – extq , only the 3-D extension vectors , 1 ∆ ence dshape()sq – fq()keys = ---uapp ds()appq , which is greater or equal to zero, according ext and ext have to be provided. 2 q s 2 ≤ , 2 ≤ , ◊ to Lemma 1. Since fq()keys dshape()sq , also fq()dkeys shape()sq due to Lemma 1. Now, the obvious strategy is as follows: Create an index over the d+3-dimensional key vec- term formula data of q data of s tors keys of all the segments s in the database. Then provide efficient processing methods for ≤ ε ∆ 2 T 2 filter queries on the index structure, i.e. range queries fq()x , and k-nearest neighbor queries 1 ds ()appq (appq – apps) · As · (appq – apps) d d + d using the filter distance function fq(x). ∆ 2 T 2 2 dq ()apps (apps – appq) · Aq · (apps – appq) d + d d 4.3. Experimental Evaluation 333d2 ()sq, ()2 ext exts – extq We implemented both the exact similarity distance function following the approximation- based similarity model as well as the filter distance function. Before showing the overall query Table 5. Number of data values required for the evaluation of the approximation-based shape processing efficiency later on, we evaluate the selectivity which is obtained by our filter dis- , similarity distance function dshape()sq tance function. For this purpose, we extended our database of 3-D surface segments to approx- imately 94,000 objects. These segments are extracted from 5,000 molecules by a segmentation of the surfaces such that each segment covers an area of approximately 300 Å2. The results are In order to avoid an index dimension that is quadratic in the dimension of the approximation obtained for the approximation models COS-2, COS-4, and COS-6 which are defined in table 6. model, only the components 2 and 3 of dshape (see table 5) shall be evaluated in the index-based filter step. Thus, the index has to manage only d + 3 values for each database segment s since ∆ 2 the quadratic component of dq ()apps belongs to the query segment q, and the evaluation of model name formula of approximation model dimension of key the quadratic term ∆d 2()app is deferred to the refinement step. s q COS-2()a , a ⋅ ()cosx, cosy 2 + 3 = 5 Now, for any given query object q, let us compose the positive definite 1 2 ()× () × × COS-4()a ,,,a a a ⋅ ()cosx, cosy,,cos2x cos 2y 4 + 3 = 7 d + 3 d + 3 -matrix A’q from the positive definite dd -matrix Aq and the 33 unit ma- 1 2 3 4 trix I . The positive weights u and u are used as before to attach more or less importance (),,… ⋅ (),…,, , 3 app ext COS-6a1 a6 cosx cosy cos3x cos 3y 6 + 3 = 9 to the similarity of shapes or to the similarity of extensions:

1 Table 6. Additional approximation models for 3-D surface segments --- 2uappAq 0 A'q = 0 u I ext 3 We computed the approximations for each of the 94,000 surface segments and selected a sample of 200 representative query objects. The diagrams in figure 12 depict the first and sec- Recall the definition of the d + 3 key vectors keys = (apps, exts), and let us use these key ond approximation parameter of the query objects, thus providing an impression how the ob- vectors for the definition of our filter distance function as follows: jects in the database are distributed.

Definition 6 (Filter distance function): Given two 3-D surface segments s and q and their (), (), key vectors keys = apps exts and keyq = appq extq for a particular shape approxima- 4 4 4 ℜd + 3 → ℜ 2 2 tion model, the filter distance function fq: is defined as follows: 2 0 0 0 f ()key = ()key – key ⋅⋅A' ()key – key T -5-2 0 5 10 -20-1001-2 020 q s s q q s q -5 0 5 10 -2 -4 -4 -6 -6 The following lemma states the lower-bounding property of f with respect to d : -4 q shape -8 -8 Lemma 3. Given an d-dimensional approximation model with two positive weighting fac- -6 COS-2-10 COS-4-10 COS-6 tors uapp and uext. For all segments s and q, the filter distance function fq(keys) is a lower bound Figure 12. Sample of query objects out of 94,000 surface segments. The diagrams depict the first ≤ , of the shape distance dshape(s, q), i.e. fq()dkeys shape()sq . and second approximation parameters of 200 representative query objects.

- 15 - - 16 - query which is a new, general query type in spatial database systems, we present the framework 0.7% into which the components are integrated in order to efficiently support complex similarity 0.6% search. 0.5% Multi-Step Range Query Processing. In [FRM 94], a multi-step algorithm for similarity COS-6 0.4% range query processing is given. This solution guarantees no false dismissals provided the COS-4 0.3% lower-bounding property holds. Figure 15 shows this algorithm adapted to our notation. COS-2 Note that the computation of the key vector corresponds to the general concept of a ‘feature 0.2%

filter selectivity filter results transform’ in the original version. 0.1% 0.0% 0.00% 0.01% 0.02% 0.03% 0.04% 0.05% 0.06% SimilarityRangeQuery (Object q, range ε) requested fraction out of 94,000 surface segments Preprocessing. Compute the key vector keyq of the query object q (‘feature transform’). Figure 13. Selectivity of the filter distance function. For various approximation models, the dia- Filter Step. gram depicts the average fraction of 94,000 surface segments that passed the filter step for a sample {}()ε≤ Perform a range query on the index to obtain ofq keyo of k-nearest neighbor queries, L (i.e. up to 0.06% of the database). Refinement Step. , ≤ ε For each of the approximation models, we performed k-nearest neighbor queries for k from From the candidate set, report the objects o that fulfill dshape()oq 1 to 60 which corresponds to a fraction of up to approximately 0.06 percent of the overall data- base. Figure 13 shows the results: For the retrieval of 0.06 percent of the objects, the filter re- Figure 15. Algorithm for range queries based on a multidimensional index (cf. [FRM 94]). turns between 0.13 and 0.6 percent of the database as candidates, depending on the approxima- tion model. For these candidates, the exact similarity distance to the query object has to be evaluated in the refinement step. Multi-Step k-Nearest Neighbor Search. For multi-step k-nearest neighbor query process- ing, an early algorithm has been presented in [Kor+ 96]. For our experiments, we use an An interesting observation is that for the model COS-2, only twice as many candidates as improved algorithm which has been show to offer an optimal filter selectivity, i.e. to produce final results pass the filter whereas for COS-6, the factor is approximately 10. This difference a minimum number of candidates [Sei 97] [SK 98]. Again, we adapt the code to the context in the filter quality is caused by the distribution of the filter and the exact distances of approximation-based similarity search where the lower-bounding filter distance function which, in our case, are closer to each other for the lower-dimensional approximation model (cf. is given by f (see figure 16). figure 14). q

4.4. Efficient Multi-Step Similarity Search k-NearestNeighborSearch (Object q, number k) // optimal version 1 initialize index.incremental_ranking (key , f ) We are now provided with some building blocks for multi-step similarity query processing. q q 〈 〉 Before introducing our new efficient index-based query processing method for the ellipsoid 2 initialize result = new sorted list dist, object 3 initialize d = ∞ 0.47 max 0.5 ()≤ 4 while index.getnext(o) and fq keyo dmax do 0.4 0.35 0.34 , ≤ , 5 if dshape()oq dmax then result.insert (dshape()oq , o) // condition is optional 0.28 0.3 exact distance ≥ 0.22 6 if result.length k then dmax = result[k].dist 0.19 filter distance 0.2 7 remove all entries from result where dist > dmax // optional optimization

imilarity and filter 0.1 8 endwhile ≤ 0 9 report all entries from result where dist dmax distance over 200 queries over distance

average s average COS-2 COS-4 COS-6 Figure 16. Optimal multi-step k-nearest neighbor algorithm [SK 98].

Figure 14. Distribution of the filter distances and exact similarity distances. The diagram depicts the average exact distance and the average filter distance of 200 sample queries.

- 17 - - 18 - Note that the optimal multi-step k-nearest neighbor algorithm presumes an incremental 5. Ellipsoids as Query Objects ranking on the underlying multidimensional index structure. Algorithms to perform such a ranking have been proposed in [Hen 94] [HS 95], and we employ an adapted version of the In this section, we investigate ellipsoids as query objects and present a new algorithm for latter that works on hierarchical index structures such as the R-tree [Gut 84], the R+-tree efficient ellipsoid query processing on multidimensional index structures. The main advantage [SRF 87], R*-tree [BKSS 90], or the X-tree [BKK 96]. Figure 17 shows the code of the proce- of our technique as compared to other approaches is the high flexibility that supports modifica- dure. Since for our index-based filter step, the number k of objects that are retrieved as candi- tions of the similarity matrix at query specification time. In the context of approximation-based dates is not known in advance, algorithms for k-nearest neighbor processing on indexes for similarity queries, every query segment defines its own ellipsoid shape, and our method is able given k (eg. [RKV 95]) are of no use. to employ a precomputed index even if the similarity matrix is not known prior to query time. Up to now, algorithms for similarity query processing using multidimensional access meth- method RTree :: incremental_ranking (Object query, DistanceFunction distfct) ods only support the Euclidean distance whose query ranges have a spherical shape, and 1 PriorityQueue queue; // manages nodes and objects by their distance weighted Euclidean distances whose query ranges are iso-oriented ellipsoids. General quadrat- 2 , ()⋅⋅()T 2 queue.insert (0, root); // start search at the root node ic form distance functions dA()pq = pq– Apq– produce query ranges that, for pos- 3wait (getnext_is_called); // wait for request of first object itive definite query matrices A, correspond to arbitrarily oriented ellipsoids. 4 while not queue.isempty() do 5Element first = queue.pop(); 5.1. Query Ellipsoids in the Filter Step 6 case first isa 7 DirNode: In section 4.2, we derived a filter distance function whose square is equal to a quadratic form (cf. definition 6): 8 foreach child in first do 2 ()⋅⋅()T 9 queue.insert (mindist (distfct, query, child.box), child); fq()keys = keys – keyq A'q keys – keyq 10 DataNode: 11 foreach object in first do 12 queue.insert (distance (distfct, query, object), object); Based on the key vector difference of the two segments s and q, the quadratic form is deter- mined by the matrix A' which results from combining the matrices A and I as follows: 13 Object: q q 3

14 report (first); 1 --- 2uappAq 0 15 wait (getnext_is_called); // wait for request of next object A'q = 16 end 0 uextI3 17 enddo 18 report (nil); // there are no more objects available Clearly, the identity matrix I3 is positive definite since it has the three-fold positive eigen- value 1. From the theory of least-squares minimization for our approximation model, we know Figure 17. Incremental ranking query processing on R-trees (adapted from [HS 95]). that Aq is positive definite [PTVF 92]. Because uapp and uext are positive, Aq is positive definite and corresponds to an ellipsoid in the space of approximation parameters. This ellipsoid repre- The ranking algorithm provides a GIVE-ME-MORE facility and, for this purpose, commu- sents the confidence behavior of the solution since it indicates how a variation of the approxi- nicates with the calling environment by the routines wait and report. Although the priority mation parameters affects the approximation error. Apart from the approximation parameters, queue for internal nodes and indexed objects is immediately initialized at the beginning, the the Singular Value Decomposition (SVD) delivers the principal axes of the ellipsoid. Figure 18 search does not start (or even proceed) before the next object is requested by a getnext call. This provides an example in a two-dimensional parameter space. In particular, the singular values behavior is reasonable since it is not known in advance after which answer the calling routine w and w correspond to the (reciprocal) length of the principal axes, and the directions of the will be satisfied. As a response to a getnext call, the ranking method reports the next object from 1 2 principal axes are given by the vectors V and V which are obtained directly from the SVD. the index according to the similarity distance to the query object. By reporting nil, the algorithm 1 2 signals that the database is exhausted and no more objects are available. In figure 18, the ellipsoid is depicted for a given similarity range parameter ε. The approxi- In order to complete the algorithm in figure 17 we still need methods to compute the func- mation parameters of three objects q, s’, and s” are marked in the diagram. The point appq is tions mindist and distance according to the given (filter) distance function. We already suggest- the center of the ellipsoid. For this point, the filter distance function evaluates to zero, i.e. ε > ε ed a well-suited filter distance function in the previous section. Due to its geometry, we call the fq()0appq = . The point apps’ is located outside the ellipsoid of range since fq()apps′ , < ε query type that corresponds to this filter distance function the ellipsoid query and provide effi- and the point apps” is located in the interior of the ellipsoid because fq()apps″ . cient algorithms for query processing on multidimensional index structures in the following For every query segment q, the filter distance function fq is a quadratic form which geomet- section. rically corresponds to an ellipsoid. Quadratic forms have already been introduced for distance

- 19 - - 20 - a 2 2 ε d2 (x, q) ≤ ε d A (x, q) = 3 apps’ A 2 ε d A (x, q) = 2 app d2 (x, q) = ε s” ε A 1 /w1 f (app ) = 0 q q q q V app 1 ε q fq(apps’) > V2 ε f (app ) < ε /w2 q s”

ε fq(x) = = const a1

Figure 19. Problems ellip.contains(point) for similarity range queries and ellip.distance(point) for Figure 18. Ellipsoid for distance range ε in a two-dimensional parameter space. k-nearest neighbor search and similarity ranking. functions of color histograms [Fal+ 94]. Efficient query processing algorithms have been de- to be tested for intersection with the query ellipsoid. Figure 20 provides a 3-D example where veloped that are based on techniques reducing the dimensionality. Unfortunatly, for these algo- two distinct query ellipsoids (differing in their defining matrices A1 and A2) are shown with the rithms the matrices need to be known in advance. In our case, the matrix is not known before same set of boxes. Only those paths of the index are examined whose boxes overlap the respec- query time, and each query object has its own matrix. Therefore, the previous solution is not tive query ellipsoid. Formally, the intersection criterion may be transformed to the following applicable, and we present a new algorithm that efficiently supports the required flexibility for representation: ellipsoid query processing. ,,ε ⇔ ∃ ∈ ∧ 2 , ≤ ε ELLIP(Aq ).intersects()box p: pboxdA()pq 5.2. Ellipsoid Query Processing on Multidimensional Index Structures To complete the ranking function (cf. figure 17) we need to define the basic operations For k-nearest neighbor search and similarity ranking, the minimum ellipsoid distance of the distance (dA, q, point) and mindist (dA, q, box). For this purpose, we introduce the ellipsoid query point to any point of the box has to be determined. The following equation provides a query distance function ELLIP()Aq, by combining the distance function d and the query formal specification of the mindist function of ellipsoids and boxes. We denote mindist as a A method of the ELLIP class: ,, , ,, point q. Then, distance() dA q point = ELLIP(Aq ).distance ()point and mindist ()dA q box ELLIP(Aq, ).mindist(box )= min{}d2()|pq, pbox∈ = ELLIP(Aq, ).mindist()box . For range queries with a given query range ε, we additionally A need to provide the basic operations ELLIP(Aq,,ε ).contains()point and ∈ ELLIP(Aq,,ε ).intersects()box . Our query processor makes use of multidimensional index Let pmin box denote the point of the box with the minimum ellipsoid distance dmin to the query point q, d ==d2(p , q ) min{}d2()|pq, pbox∈ . Observe that an ellipsoid of lev- structures which are hierarchically organized by rectilinear bounding boxes. These boxes are min A min A the parameters of the basic operations. 2 , The query ellipsoid distance function dA()pq is specified by the positive definite matrix A and the query point q. In case of a range query, an additional parameter ε denotes the level of ,,ε {}2 , ≤ ε the particular query ellipsoid ELLIP()Aq = p| dA()pq (see figure 19). Thus, for the two basic operations concerned with points, the implementation is straightforward: A1 A2 , 2 , ELLIP(Aq ).distance()point = dA()point q

,,ε 2 , ≤ ε ELLIP(Aq ).contains()point = dA()point q

The remaining two basic operations deal with the boxes in the index. While traversing the index from the root down to the leaves of the tree, the query is tested against the minimum bounding boxes in the respective directory nodes. For range query processing, the boxes have Figure 20. Problem ellipsoid intersects box for two similarity matrices A1 and A2.

- 21 - - 22 - ε el and the box intersect if the point pmin is contained in the ellipsoid. Hence, both box-related 5.3. Basic Distance Algorithm for Ellipsoids and Boxes operations can be implemented in terms of the minimal distance. When only intersection has to For the implementation of distance ()A,, q box ,ε , we combine two paradigms: The steepest be tested, the actual minimum distance is not required as long as it is less or equal than ε. There- descent method [PTVF 92], and the technique of iteration over feasible points [BR 85]. fore, we introduce a bounded minimum distance function distance ()A,, q box ,ε which meets Figure 22 shows the code of distance. the requirements of both the intersection test and the actual minimum distance computation. This generalized distance function returns the minimum ellipsoid distance dmin from the query ≥ ε ε < ε function distance()A,, q box ,ε → float; point q to box if dmin , and an arbitrary value below if dmin . The following lemma shows the relationship of distance to the basic operations ellip.intersects (box) and 1 box := box.move (– q); // for all p ∈ box , let pp= – q ellip.mindist (box). 2p0 := box.closest (origin); // ‘closest’ in the Euclidean sense 3 loop Lemma 4. Given a similarity matrix A, a query point q, a rectilinear hyper- box and 2 4 if ()d , ()p ≤ ε break; // ellipsoid ellip() A,, q ε is reached a query range parameter ε. Then, the function distance fulfills the following correspondences: A origin i ∇ () 5 g := – ellip pi ; // descending gradient at p 6g := box.project (p , g); // gradient projection onto the box ()i ELLIP(Aq,,ε ).intersects ()box ≡ []distance ()Aqbox,, ,ε ≤ ε i 7 if (|g| = 0) break; // no feasible progress along g () , ≡ ,, , ii ELLIP(Aq ).mindist ()box distance ()Aqbox0 ∇ () ⁄ ∇ () 8 s := – ellip pi *g ellip g *g ; // linear minimization along g 9pi+1 := box.closest (pi + s*g); // projection of new p onto box 2 ≈ 2 10 if (dA, origin()pi dA, origin()pi + 1 ) break; // no more progress was achieved {}2 , ∈ Proof. Let dmin = min dA()|pq pbox be the actual minimum ellipsoid distance from 11 endloop the query point q to box. (i) By definition, the estimation distance ()A,, q box ,ε ≤ ε holds if and 2 12 return dA, origin()pi ; ε ≤ ε only if dmin is lower or equal to . On the other hand, dmin holds if and only if the hyper- end distance; rectangle box intersects the ellipsoid ellip of level ε. (ii) Since d ≥ 0 , distance ()A,, q box ,0 min Figure 22. Basic ellipsoid-and-box algorithm. The procedure iterates over feasible points p within ε ◊ i always returns the actual value of dmin which is never less than = 0 . the box until the constrained minimum of the ellipsoid or the value ε is reached.

The translation in 1 adjusts the coordinate system such that the query point q, which is Figure 21 demonstrates the integration of the new function distance into the basic operations also the center of the ellipsoid, is the origin. We achieve this by moving the box, and we then of the class ELLIP for exact ellipsoid query processing. Note that lemma 4 helps to improve the compute the ellipsoid distance function and the gradient by the more efficient formulas runtime performance of intersection tests with a positive result, in particular, for the following 2 T d ()p = p ⋅⋅Ap and ∇ ()p = 2 ⋅⋅p A . Thus, we save the evaluation of the dif- reason: Given an ellipsoid ELLIP()Aq,,ε and an intersecting hyperrectangle box, the fact of A, origin i i i ellip i i ference vector p – q for each intermediate point p . intersection can be reported without knowledge of the actual value of d as long as it is small- i i min Our algorithm exploits the basic idea of the feasible points method adapted from the linear er then the ellipsoid level ε. programming algorithm of [BR 85]. This concept differs from the standard techniques for linear programming, the simplex method [Dan 66] [PTVF 92], since it ensures that every point that is class ELLIP visited on the way down to the minimum belongs to the feasible region, which is, in our case, {float[n][n] A; float[] q; float range;}; the box. In particular, the algorithm starts at the query point which is the origin after the trans- lation of line 1. In order to ensure the feasibility of the visited points, the starting point p0 is projected onto the box (line 2). The same projection is later performed for all the points pi that ELLIP :: init (A, query) {A = A; q = query;} are reached by the iteration (line 9). For any point p, the rectilinear projection yields the closest ε ε ELLIP :: set_range ( ) {range = ;} point of the box according to the Euclidean distance. In our case, the boxes are rectilinear and, 2 , ≤ ELLIP :: contains (point) —> bool {return dA()point q range;} therefore, the projection is simply performed for the dimensions d = 1,,… n independent of ELLIP :: intersects (box) —> bool {return distance ()Aq,,box ,range ≤ range;} each other as follows: ELLIP :: distance (point) —> float {return d2 ()point, q ;} A box.lower[]d if pd[]< box.lower[]d ELLIP :: mindist (box) —> float {return distance ()Aq,,box ,0 ;}  box.closest ()p []d = box.upper[]d if pd[]> box.upper[]d  Figure 21. Ellipsoid operations for similarity query processing.  pd[] otherwise

- 23 - - 24 - box (feasible region) g p leaving ε ε A dmin > box gfeasible g p e q

pmin Figure 24. Gradient truncation with respect to the box boundary.

and the desired minimum is reached at the current position. This situation is recognized in line 7, and the algorithm stops. Figure 23. The distinct closest points pe for the Euclidean distance and pmin for an ellipsoid distance ,,ε ε function. The objects box and ELLIP()Aq do not intersect which is indicated by dmin > . Now, the algorithm descends along the new, feasible direction down to the local minimum. In line 8, the scaling factor s ∈ ℜ is determined that leads to the point psg+ on that line for which d2()psg+ , q is minimum. This holds if ∇ ()psg+ ⋅ g = 0 which immediately im- Figure 23 provides an illustration that demonstrates two basic facts which both apply to the A ellip plies s = –∇ellip ()p *g ⁄ ∇ellip ()g *g. general case of not iso-oriented query ellipsoids. First, the closest point p0 with respect to the In line 9, the local minimum point p + sg is projected onto the box, yielding the new point Euclidean distance is distinct from the point pmin that has the minimum ellipsoid distance to the i query point. Second, the fundamental theorem of linear optimization [PTVF 92] that the opti- pi+1 (cf. line 2). Again, the projection ensures that the algorithm does not leave the box on its mum solution coincides with a vertex of the feasible region does not apply to our non-linear way down to the global minimum within the box. Except if it finishes already in line 4 or 7, the objective function. This is the reason why our algorithm incorporates the steepest descent tech- steepest descent method stops in line 10 after a finite number of iterations [PTVF 92] when no nique. more progress is observed. Finally, the function returns the desired minimum ellipsoid distance value in line 12. Our algorithm is designed for both, the determination of the minimum ellipsoid distance of a query point to a rectilinear box as well as the intersection detection between a query ellipsoid of level ε and a box. In line 4, an optimization for the intersection detection is provided. On its 6. Performance Evaluation way down to the minimum, the algorithm may reach an intermediate point pi that is already contained in the ellipsoid. In the case of intersection detection as it occurs for range queries, the 2 We implemented all our algorithms for ellipsoid query processing in C/C++ and performed algorithm may stop at this point. This situation is detected by the condition d , ()p ≤ ε . A origin i experiments on an HP9000/780 under HP-UX 10.20. In the following, we demonstrate the re- ∇ () For the steepest descent, the gradient ellip pi of the ellipsoid function at pi is determined sults from our test database of 3-D surface segments. This database contains 94,000 surface in line 5. In order to proceed from the current point pi to the desired minimum while remaining segments from 5,100 molecules. Originally, the 3-D molecule data are obtained from the PDB within the box, we decompose the gradient g into two components g = gfeasible + gleaving, and [Ber+ 77]. We computed the molecular surface and generated surface points with a density of project g to the direction gfeasible that does not leave the box when it is affixed to p (line 6). For 1.0 Å–1 [SK 95]. A segmentation of the surfaces yielded 94,000 segments for which we com- rectilinear boxes, the operation box.project (p, g) is easily performed by nullifying the leaving puted several shape approximations, in particular using the approximation models COS-2, ,,… components of the gradient g. Formally, we obtain for each dimension d = 1 n : COS-4, and COS-6. These models lead to key vectors of dimension 5, 7, and 9, respectively. We managed these key vectors by an X-tree [BKK 96] with a page size of 4 kbytes. For the  0 if ()gd[]< 0 ∧ pd[]= box.lower[]d  approximation-based similarity distance function, we set the weighting factors uapp and uext box.project()pg, []d =  or ()gd[]> 0 ∧ pd[]= box.upper[]d both to 1.  gd[] otherwise Whereas this experimental setting is typical for the docking search with present data, the number of available 3-D protein structures will significantly increase in the near future for two reasons: First, the more powerful Nuclear Magnetic Resonance (NMR) technique is going to This gradient projection means a restriction of the search space to those dimensions that replace the crystallographic structure detection used up to now, leading to much larger database correspond to the feasible directions when proceeding from the current location of p. In other sizes. Second, methods for predicting the 3-D structure from protein sequences are currently words, the truncated gradient gfeasible represents the gradient of the ellipsoid function when under development and become more and more successful. Since sequence analysis is much being restricted to the subspace that corresponds to the active constraints of the box (see easier and cheaper than 3-D structure detection, the majority of 3-D protein structures will be figure 24). Note that by the projection, the gradient may vanish. No more progress is feasible, available from structure prediction in the future [ASS 95].

- 25 - - 26 - box COS-2 (dim=5) COS-4 (dim=7) COS-6 (dim=9) A 250 500 p = p 1000 e min q 200 400 800 k-nn queries 150 300 600

100 200 400 range queries Figure 25. For iso-oriented ellipsoids, the algorithm distance stops after the first iteration. 50 100 200

number of candidates 0 0 0 10 30 50 70 90 10 30 50 70 90 10 30 50 70 90 retrieved results retrieved results retrieved results 6.1. Runtime Complexity of the Algorithm

In each iteration of the loop in distance, the evaluation of both the ellipsoid value Figure 26. Number of candidates generated in the filter step. For 2,000 k-nearest neighbor and 2 ⋅⋅T ∇ () ⋅⋅ ()2 dA, origin()p = pAp and the gradient vector ellip p = 2 pA requires Od time for equivalent range queries on 94,000 surface segments, the diagrams depict the average number of 2 d dimensions. The overall runtime of distance ()A,, q box ,ε results in O() #iter⋅ d where candidates depending on the number of requested results. For k = 100, there were 215, 402, and 830 ∈ #iter denotes the number of iterations. Note that our starting point p0 box is closest to the candidates generated which corresponds to 0.2, 0.4, and 0.9 percent of the objects in the database. query point with respect to the Euclidean distance. This also holds for any weighted Euclidean distance. Thus, the desired point pmin coincides with the starting point p0 if the similarity ma- trix A is diagonal, and the algorithm immediately stops in the first iteration at line 4 if the box 6.3. Performance of Ellipsoid Queries on Indexes and ellipsoid intersect, or at line 7 if they do not. Overall, the runtime complexity of the opera- 2 Our next experiments demonstrate the efficiency of ellipsoid query processing on the index. tion distance is Od() for diagonal similarity matrices A (representing the Euclidean distance The ellipsoids represent the filter distance function of the approximation-based similarity mea- or weighted Euclidean distance functions, cf. figure 25). sure. Figure 27 illustrates the results, averaged over the sample queries on our 3-D database of For general query ellipsoids that are not iso-oriented, the number of iterations is hard to es- 94,000 surface segments. In the diagrams, the abscissa axes depict the number of candidates timate. In the following, we show typical values that occurred in our experimental evaluations. rather than the number of final results since the candidates are the objects that are retrieved from the index in the filter step. We observed that our iterative algorithm works well in practice. One might imagine examples, We observed that the number of iterations (top row) does not significantly vary with the however, where the method does not perform well. In particular, if the ellipsoid is far from being number of retrieved results. Range queries require fewer iterations than k-nn queries for an ob- iso-oriented and some of its principal axes are very long and others very short, the number of vious reason: For k-nn queries, the minimum distance of the ellipsoid has to be evaluated ex- gradient iterations may increase significantly. actly for every box which results in a high number of iterations. For range queries, the iteration may stop as soon as it is detected that the box intersects the query ellipsoid. In our examples, up 6.2. Performance of Similarity Query Processing to 30% of the iterations are saved. The overall CPU time (middle row) depends on the number of iterations and, clearly, on the For our experiments, we randomly selected some 200 query objects from the database of number of results that are obtained from the index. As expected from the number of iterations, 94,000 surface segments. We performed k-nearest neighbor queries for k = 10,,, 20 … 100 the range queries are faster than the equivalent k-nn queries. For the higher dimensions (7 and 9) which corresponds to approximately 0.1 percent of the database. For each query object and each this effect is mainly a result of the number of iterations. For lower dimensions (e.g. 5), the over- considered value of k, we determined the equivalent query range ε. With these query ranges, we head for managing the priority queue may be recognized. The fraction of accessed index pages depends on the number of retrieved results as well as performed range queries which are exactly equivalent to the corresponding k-nearest neighbor on the dimension of the index. Clearly, it does not depend on the query type since we use a queries. Thus, we are able to present a direct comparison of both query types. purely mindist-based k-nearest neighbor algorithm which causes the same number of index As a first result, we demonstrate the number of candidates that were generated by the filter page accesses as the equivalent range query. step (see figure 26) for each of the three considered approximation models. Due to the optimal- ity of our multi-step k-nearest neighbor algorithm, the number of candidates for k-nearest neighbor queries and equivalent range queries coincide. The diagrams illustrate the good selec- tivity of our filter distance function: For k = 100, there were 215, 402, and 830 candidates gen- erated which corresponds to 0.2, 0.4, and 0.9 percent of the 94,000 objects in the database.

- 27 - - 28 - 7. Conclusions

COS-2 (dim=5) COS-4 (dim=7) COS-6 (dim=9) In this paper, we presented a new approach to quantify shape similarity of 3-D surface seg- ments. The method is adaptable to specific applications by providing appropriate approxima- 2 25 70 tion models that fit the requirements of the particular problem. The similarity of two 3-D sur- 60 20 1.5 face segments is measured by using a shape approximation technique, and the distance function 50 k-nn 15 40 is defined in terms of the mutual approximation error combined with the 3-D extension dis- 1 queries 10 30 tance. In order to support efficient query processing, we derive a lower-bounding filter distance 20 range function that is designed for an index-based filter step. The successful application of the ap- 0.5 5 10 queries proximation-based similarity model is demonstrated by experiments on a protein database sys- number of iterations number 0 0 0 tem. 0 100 200 0 200 400 0 400 800 The filter distance functions are positive definite quadratic forms and, therefore, represent #candidates #candidates #candidates ellipsoids as query objects. We present an algorithm for efficient ellipsoid query processing that supports both, range queries and k-nearest neighbor queries. The technique is very general since it is not committed to a particular single index structure but works for the wide class of rectilin- 0.16 0.8 2.5 early organized multidimensional index structures. Thus, the performance of our method may 0.14 benefit from advances of high-dimensional indexing by adapting the algorithm to the respective 2 0.12 0.6 k-nn index methods. Theoretical investigations as well as experimental evaluations demonstrate the 0.1 1.5 queries 0.08 0.4 efficiency of the technique in a multi-step query processing environment where the ellipsoid 0.06 1 query is used in the filter step. 0.2 range 0.04 0.5 In our future work, we plan to investigate complex approximation models which may be elapsed time [sec] time elapsed 0.02 queries 0 0 0 given as combinations of the presented simple approximation models. By this approach, a better 0 100 200 0 200 400 0 400 800 support for even more complex surface segments could be provided. Two aspects arise and have #candidates #candidates #candidates to be considered: One question is how to combine the individual models in order to obtain a useful similarity measure. The other question is how to extend efficient query algorithms to high-dimensional key vectors. Such high-dimensional key vectors result from complex approx- imation models, or from a combination of several low-dimensional approximation models. A 0.12 0.35 0.5 first step in this direction can be found in [SK 97]. 0.3 k-nn 0.1 0.4 0.25 queries 0.08 7.1. Acknowledgements 0.2 0.3 0.06 range 0.15 0.2 queries We thank the anonymous reviewers for thoroughly and carefully reading the paper, and ap- 0.04 0.1 index pages 0.1 preciate their helpful suggestions to improve the presentation of our concepts. We are also very 0.02 0.05

fraction of accessed accessed of fraction 0 0 0 note that the grateful to our colleague, Markus Breunig, who was very engaged in reading the paper and thus 0 100 200 0 200 400 0 400 800 curves helped to polish the presentation. Finally, we thank our colleague, Thomas Schmidt, for fruitful coincide! #candidates #candidates #candidates discussions and his assistance in the preparation of an earlier, shorter version of this paper which appeared in the proceedings of the Fifth International Symposium on Large Spatial Da- tabases (SSD'97), Berlin, Germany. Figure 27. Performance of approximation-based similarity query processing. On our test database of 94,000 surface segments, we performed 2,000 k-nearest neighbor and range queries with a selec- References tivity of up to 0.1% of the database. Depending on the number of candidates, the diagrams depict the average of the following values: Number of iterations in the function distance (top), elapsed [AFS 93] Agrawal R., Faloutsos C., Swami A.: ‘Efficient Similarity Search in Sequence Databases’, CPU time (middle), and number of accessed index pages (bottom). The accessed pages for k-nn Proc. 4th. Int. Conf. on Foundations of Data Organization and Algorithms (FODO‘93), queries and range queries coincide due to the optimality of the k-nn algorithm. Evanston, ILL, in: Lecture Notes in Computer Science, Vol. 730, Springer, 1993, pp. 69-84. [ALSS 95] Agrawal R., Lin K.-I., Sawhney H. S., Shim K.: ‘Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases’, Proc. 21th Int. Conf. on Very Large Databases (VLDB’95), Morgan Kaufmann, 1995, pp. 490-501.

- 29 - - 30 - [ASS 95] Aehle W., Sobek H., Schomburg D.: ‘Evaluation of Protein 3D-Structure Prediction: Com- [Kor+ 96] Korn F., Sidiropoulos N., Faloutsos C., Siegel E., Protopapas Z.: ‘Fast Nearest Neighbor parison of Modelled and X-Ray Structure of an Alkaline Serine Protease’, Journal of Bio- Search in Medical Image Databases’, Proc. 22nd VLDB Conference, Mumbai, India, 1996, technology, Vol. 41, 1995, pp. 211-220. pp. 215-226. [Ber+ 77] Bernstein F. C., Koetzle T. F., Williams G. J., Meyer E. F., Brice M. D., Rodgers J. R., Ken- [MG 93] Mehrotra R., Gary J. E.: ‘Feature-Based Retrieval of Similar Shapes’, Proc. 9th Int. Conf. nard O., Shimanovichi T., Tasumi M.: ‘The Protein Data Bank: a Computer-based Archival on Data Engineering, Vienna, Austria, 1993, pp. 108-115. File for Macromolecular Structures’, Journal of Molecular Biology, Vol. 112, 1977, pp. 535-542. [Mei+ 95] Meier R., Herrmann G., Ackermann F., Posch S., Sagerer G.: ‘Segmentation of Molecular [Ber+ 97] Berchtold S., Böhm C., Braunmüller B., Keim D., Kriegel H.-P.: ‘Fast Parallel Similarity Surfaces based on their Convex Hull Surfaces’, Proc. Int. Conf. on Image Processing, Wash- Search in Multimedia Databases’, Proc. ACM SIGMOD Int. Conf. on Management of Data, ington, D.C., IEEE Computer Society Press, 1995, pp. 552-555. 1997, pp. 1-12. [MPSS 97] Massmann A., Posch S., Sagerer G., Schlüter D.: ‘Using Markov Random Fields for Con- [BHKS 93]Brinkhoff T., Horn H., Kriegel H.-P., Schneider R.: ‘A Storage and Access Architecture for tour-Based Grouping’, Proc. Int. Conf. on Image Processing, Santa Barbara, CA, IEEE Efficient Query Processing in Spatial Database Systems’, Proc. 3rd Int. Symp. on Large Spa- Computer Society Press, pp. 207-210. tial Databases (SSD‘93), Singapore, 1993, Lecture Notes in Computer Science, Vol. 692, Springer, pp. 357-376. [MWS 96] Meyer M., Wilson P., Schomburg D.: ‘Hydrogen Bonding and Molecular Surface Shape [BKK 96] Berchtold S., Keim D., Kriegel H.-P.: ‘The X-tree: An Index Structure for High-Dimensional Complimentarity as a Basis for Protein Docking’, Journal of Molecular Biology, Vol. 264, Data’, Proc. 22nd Int. Conf. on Very Large Data Bases (VLDB’96), Mumbai, India, 1996, 1996, pp. 199-210. pp. 28-39. [Nib+ 93] Niblack W., Barber R., Equitz W., Flickner M., Glasmann E., Petkovic D., Yanker P., Falout- [BKK 97] Berchtold S., Keim D., Kriegel H.-P.: ‘Using Extended Feature Objects for Partial Similarity sos C., Taubin G.: ‘The QBIC Project: Querying Images by Content Using Color, Texture, Retrieval’, VLDB Journal, Vol. 6, No. 4, 1997, pp. 333-348. and Shape’, SPIE 1993 Int. Symposium on Electronic Imaging: Science and Technology [BK 97] Berchtold S., Kriegel H.-P.: ‘S3: Similarity Search in CAD Database Systems’, Proc. ACM Conference 1908, Storage and Retrieval for Image and Video Databases, San Jose, CA, SIGMOD Int. Conf. on Management of Data, 1997. 1993. [BKSS 90] Beckmann N., Kriegel H.-P., Schneider R., Seeger B.: ‘The R*-tree: An Efficient and Robust Access Method for Points and Rectangles’, Proc. ACM SIGMOD Int. Conf. on Management [OM 88] Orenstein J. A., Manola F. A..: ‘PROBE Spatial Data Modeling and Query Processing in an of Data, Atlantic City, NJ, 1990, pp. 322-331. Image Database Application’, IEEE Trans. on Software Engineering, Vol. 14, No. 5, 1988, [BR 85] Best, M. J., Ritter K.: ‘Linear Programming. Active Set Analysis and Computer Programs’, pp. 611-629. Englewood Cliffs, N.J., Prentice Hall, 1985. [PTVF 92] Press W. H., Teukolsky S. A., Vetterling W. T., Flannery B. P.: ‘Numerical Recipes in C’, 2nd [Dan 66] Dantzig G. B.: ‘Linear Programming and Extensions’ (in German), Springer, Berlin, 1966. ed., Cambridge University Press, 1992. [Fal+ 94] Faloutsos C., Barber R., Flickner M., Hafner J., Niblack W., Petkovic D., Equitz W.: ‘Effi- [RKV 95] Roussopoulos N., Kelley S., Vincent F.: ‘Nearest Neighbor Queries’, Proc. ACM SIGMOD cient and Effective Querying by Image Content’, Journal of Intelligent Information Systems, Vol. 3, 1994, pp. 231-262. Int. Conf. on Management of Data, 1995, pp. 71-79. [FRM 94] Faloutsos C., Ranganathan M., Manolopoulos Y.: ‘Fast Subsequence Matching in Time- [Sei 97] Seidl T.: ‘Adaptable Similarity Search in 3-D Spatial Database Systems’, PhD thesis, Faculty Series Databases’, Proc. ACM SIGMOD Int. Conf. on Management of Data, 1994, for Mathematics and Computer Science, University of Munich, 1997. pp. 419-429. [SK 95] Seidl T., Kriegel H.-P.: ‘A 3D Molecular Surface Representation Supporting Neighborhood [GG 97] Gaede V., Günther O.: ‘Multidimensional Access Methods’, ACM Computing Surveys. Queries’, Proc. 4th Int. Symposium on Large Spatial Databases (SSD ‘95), Portland, Maine, [GM 93] Gary J. E., Mehrotra R.: ‘Similar Shape Retrieval using a Structural Feature Index’, Infor- USA, Lecture Notes in Computer Science, Vol. 951, Springer, 1995, pp. 240-258. mation Systems, Vol. 18, No. 7, 1993, pp. 525-537. [Gut 84] Guttman A.: ‘R-trees: A Dynamic Index Structure for Spatial Searching’, Proc. ACM SIG- [SK 97] Seidl T., Kriegel H.-P.: ‘Efficient User-Adaptable Similarity Search in Large Multimedia MOD Int. Conf. on Management of Data, Boston, MA, 1984, pp. 47-57. Databases’, Proc. 23rd Int. Conf. on Very Large Databases (VLDB'97), Athens, Greece, [Hen 94] Henrich, A.: ‘A Distance-Scan Algorithm for Spatial Access Structures’, Proc. 2nd ACM 1997, pp. 506-515. Workshop on Advances in Geographic Information Systems, Gaithersburg, Maryland, 1994, [SK 98] Seidl T., Kriegel H.-P.: ‘Optimal Multi-Step k-Nearest Neighbor Search’, Proc. ACM SIG- pp. 136-143. MOD Int. Conf. on Management of Data, Seattle, Washington, 1998. [HS 94] Holm L., Sander C.: ‘The FSSP database of structurally aligned protein fold families’, Nucl. [SRF 87] Sellis T., Roussopoulos N., Faloutsos C.: ‘The R+-Tree: A Dynamic Index for Multi-Dimen- Acids Res. 22, 1994, pp. 3600-3609. sional Objects’, Proc. 13th Int. Conf. on Very Large Databases, Brighton, England, 1987, [HS 95] Hjaltason G. R., Samet H.: ‘Ranking in Spatial Databases’, Proc. 4th Int. Symposium on pp 507-518. Large Spatial Databases (SSD‘95), Lecture Notes in Computer Science, Vol. 951, Springer, 1995, pp. 83-95. [TC 91] Taubin G., Cooper D. B.: ‘Recognition and Positioning of Rigid Objects Using Algebraic [Jag 91] Jagadish H. V.: ‘A Retrieval Technique for Similar Shapes’, Proc. ACM SIGMOD Int. Conf. Moment Invariants’, in Geometric Methods in Computer Vision, Vol. 1570, SPIE, 1991, on Management of Data, 1991, pp. 208-217. pp. 175-186.

- 31 - - 32 -