Dimension and Space Filling Curve Approximate Space-filling Curve

Ana Karina Tavares da Moura Gomes

Dissertation submitted to obtain the Master Degree in Information Systems and Computer Engineering

Supervisor: Prof. Dr. Andreas Wichert

Examination Committee Chairperson: Prof. Dr. Mário Jorge Costa Gaspar da Silva Supervisor: Prof. Dr. Andreas Miroslaus Wichert Member of the Committee: Prof. Dr. Pável Pereira Calado

June 2015 ii Dedicated to my beloved son Enzo. You were my greatest motivation to finish the course and not give up, regardless of the difficulties. I love you so much Zo!ˆ

iii iv Acknowledgments

First of all, I would like to thank my mother Ju´ and my sister Carol for all the support they gave me. I have special thanks to Carol who so many times listened me without understand anything just to help me think out loud. Thank you for all the love and care. Love you sis! I would like to thank my advisor for the patience and for the advices, but mainly for help me growing as a student and future professional allowing me to think and decide for my own head. Thank you! I would like to thank my mother-in-law Cristina for taking care of my child allowing me to work on the afternoons. I would like to thank my closest friends Ana Silva and Jose´ Leira for all the support. Thank you for soften the worst days of work. You are one of the greatest wealth that I gained from this course. Ana, thank you for not giving up on me and for being an angel! Ze,´ thank you for being my thesis partner and for listening all my endless doubts. Finally, I have a special thanks to my husband Tiago. Thanks for not judge me, for believe in me (even when I didn’t) and for your endless patience. There are no words to describe how grateful I am for having you in my life. Thank you for all the love, all the support and comprehension during all this years. Love you!

v vi Resumo

Curvas preenchedoras de espac¸o sao˜ fractais gerados por computador que podem ser usadas para indexar espac¸os de baixas dimensoes.˜ Existem estudos anteriores que as usam como metodo´ de acesso nestes cenarios,´ contudo sao˜ muito conservadores usando-as apenas em espac¸os ate´ quatro dimensoes.˜ Adicionalmente, os metodos´ alternativos tendem a apresentar um desempenho pior do que uma pesquisa linear quando os espac¸os ultrapassam as dez di- mensoes.˜ Deste modo, no contexto da minha tese, estudo as curvas preenchedoras de espac¸o e as suas propriedades assim como os desafios apresentados pelos dados multidimensionais. Eu proponho o uso destas, especificamente da curva de Hilbert, como um metodo´ de acesso para indexar pontos multidimensionais ate´ trinta dimensoes.˜ Eu comec¸o por mapear os pontos para a curva de Hilbert gerando os seus h-values. Em seguida, desenvolvo tresˆ heuristicas para procurar vizinhos aproximadamente mais proximos´ de um dado ponto de pesquisa com o objec- tivo de testar o desempenho da curva. Duas heur´ısticas usam a curva com metodo´ de acesso direto e a restante usa a curva como chave secundaria´ combinada com uma variante da B-tree. Estas resultam de um processo iterativo que basicamente consiste no planeamento, concepc¸ao˜ e teste da heur´ıstica. De acordo com os resultados do teste, a heur´ıstica e´ alterada ou e´ criada uma nova. Os resultados experimentais com as tresˆ heur´ısticas provam que a curva de Hilbert pode ser usada como metodo´ de acesso e que esta consegue funcionar pelo menos em espac¸os ate´ trinta e seis dimensoes.˜

Palavras-chave: Fractais, Curva de Hilbert Preenchedora de Espac¸o, Indexac¸ao˜ de Baixas Dimensoes,˜ Vizinho Aproximadamente Mais Proximo´

vii viii Abstract

Space-filling curves are computer generated that can be used to index low-dimensional spaces. There are previous studies using them as an access method in these scenarios, although they are very conservative only applying them up to four-dimensional spaces. Additionally, the alternative access methods tend to present worse performance than a linear search when the spaces surpass the ten dimensions. Therefore, in the context of my thesis, I study the space-filling curves and their properties as well as challenges presented by multidimensional data. I propose their use, specifically the Hilbert curve, as an access method for indexing multidimensional points up to thirty dimensions. I start by mapping the points to the Hilbert curve generating their h-values. Then, I develop three heuristics to search for approximate nearest neighbors of a given query point with the aim of testing the performance of the curve. Two of the heuristics use the curve as a direct access method and the other uses the curve as a secondary key retrieval combined with a B-tree variant. These result from an iterative process that basically consists of planning, conceiving and testing the heuristic. According to the test results, the heuristic is adjusted or a new one is created. Experimental results with the three heuristics prove that the Hilbert curve can be used as an access method, and that it can operate at least in spaces up to thirty-six dimensions.

Keywords: Fractals, Hilbert Space-Filling Curve, Low-Dimensional Indexing, Approximate Nearest Neighbor

ix x Contents

Acknowledgments...... v Resumo...... vii Abstract...... ix List of Tables...... xiii List of Figures...... xvi

1 Introduction 1 1.1 Hypothesis and Methodology...... 2 1.2 Main contributions...... 2 1.3 Document Outline...... 3

2 Multidimensional Data 5 2.1 Introduction...... 5 2.2 Multidimensional Relations...... 6 2.3 Vector Space...... 7 2.4 Multidimensional Query...... 9 2.4.1 Range Query...... 9 2.4.2 Nearest Neighbor Query...... 9 2.5 Multidimensional Index...... 10 2.5.1 R-tree...... 11 2.5.2 Kd-tree...... 12 2.6 Summary...... 13

3 Fractals 15 3.1 Introduction...... 15 3.2 The ...... 16 3.3 Space-Filling Curves...... 18 3.3.1 Z-Curve...... 18 3.3.2 Hilbert Curve...... 20 3.4 Higher Dimensions...... 24 3.5 Clustering Property...... 28 3.6 Fractals: An Access Method...... 30

xi 3.6.1 Secondary Key Retrieval...... 30 3.6.2 Direct Access Index...... 32 3.7 Summary...... 34

4 Approximate Space-Filling Curve 35 4.1 Motivation...... 35 4.2 Methodology...... 36 4.3 Data Description...... 37 4.4 Mapping System...... 39 4.5 Dataset Analysis based on a Linear Search...... 40 4.6 Experiment 1: Hypercube Zoom Out...... 43 4.7 Experiment 2: Enzo Full Space...... 46 4.8 Experiment 3: Enzo Reduced Space...... 48 4.9 Summary...... 52

5 Conclusions 55 5.1 Contributions...... 56 5.2 Future Work...... 57

Bibliography 62

A Appendix 63 A.1 Chapter: Fractals...... 63 A.2 Chapter: Approximate Space-Filling Curve...... 65

xii List of Tables

4.1 Datasets Description...... 40 4.2 Neighbor Distance to Query Point per Dimension...... 41 4.3 HZO Runtime Results...... 44 4.4 HZO Approximate Nearest Neighbors Results...... 45 4.5 Enzo FS Runtime Results...... 47 4.6 Enzo FS Approximate Nearest Neighbors Results...... 47 4.7 Enzo RS Runtime Results...... 49 4.8 Enzo RS Relative Error...... 51

A.1 ...... 63 A.2 Enzo RS Results...... 65

xiii xiv List of Figures

2.1 Examples of Multidimensional Data...... 7

2.2 Example of Unit Circles for L1, L2 and L∞ Norms...... 8 2.3 R-tree Planar and Directory Representation...... 11 2.4 Kd-tree Planar and Directory Representation...... 12

3.1 Ratio Similarity Example 1...... 16 3.2 Ratio Similarity Example 2...... 16 3.3 Hausdorff Dimension Analysis...... 17 3.4 Z-curve...... 19 3.5 Z-curve Bit Shuffling...... 19 3.6 Z-curve Query Example...... 20 3.7 Hilbert Curve...... 20 3.8 Hilbert Curve 3D...... 21 3.9 Hilbert Second Order Curve...... 22 3.10 Hilbert Curve VS Z-curve...... 23 3.11 Hilbert Curve Analysis to Higher Dimensions...... 24 3.12 Analysis of the Entry and Exit Points...... 26 3.13 Clustering Analysis...... 29 3.14 Lawder and King Hilbert Curve Indexing Structure...... 31 3.15 Chen and Chang Hilbert Curve Indexing Structure...... 33

4.1 Approximate Space-filling Curve Framework...... 36 4.2 The Nearest Neighbor Distribution per Dimension...... 41 4.3 Linear Search Runtime Performance...... 42 4.4 Linear Search Runtime Performance...... 42 4.5 Hypercube Zoom Out Algorithm Scheme...... 43 4.6 Hypercube Zoom Out Example...... 44 4.7 HZO Relative Error Distribution...... 46 4.8 HZO Algorithm Scheme...... 46 4.9 Enzo FS Relative Error Distribution...... 48 4.10 Enzo RS Algorithm Scheme...... 49

xv 4.11 Enzo RS Runtime Performance of the First Group of Analysis...... 50 4.12 Enzo RS Runtime Performance of the Second Group of Analysis...... 50 4.13 Enzo RS Relative Error Distribution...... 51 4.14 Runtime Comparison Between Heuristics...... 52 4.15 Average Relative Error Comparison Between Heuristics...... 52 4.16 Enzo RS Global Runtime Performance...... 53 4.17 Enzo RS Global Average Relative Error...... 53

xvi Chapter 1

Introduction

Throughout the years, several indexes have been proposed to handle multidimensional (or spatial) data, but the most popular is the R-tree. Despite all the effort dedicated to improve it, R-tree performance does not appear to keep up the increasing of the dimensions. It is common to multidimensional indexes to suffer from the curse of dimensionality which means they present a similarity search performance worse than a linear search in spaces that exceed ten dimensions [Weber et al., 1998, Yu, 2002, Mamoulis, 2012]. Even with other indexes with better performance, Mamoulis considers that similarity search in these spaces is a difficult and unsolved problem in its generic form [Mamoulis, 2012].

The approximate nearest neighbor is a similarity search technique that neglect the accuracy of query results for reduced response time and memory usage. It is an invaluable search technique in high dimensional spaces. Commonly, we use feature vectors to simplify and represent data in order to lower the complexity. The features are represented by points in the high dimensional space and the closer the points in space, the more similar objects the objects are. In the case of images, these are usually represented by feature vectors, which are then used to find approximately similar objects. In this case, an approximate result is an acceptable solution. Multimedia data is only an example among several other applications. The approximate nearest neighbor, as its name suggests, does not guarantee the return of the exact neighbor, and its performance is highly related with the data distribution [Wichert, 2015].

The space-filling curves are also highly related to the data distribution. They are fractals that enable mapping multidimensional points to a single dimension allowing the traditional one-dimensional indexing structures, such as B+-tree, to be used for indexing multidimensional data. Basically, they consist on a path in multidimensional space that visits each point exactly once without crossing itself. The curve defines a total order between points visited making possible its linear ordering. The purpose is to preserve the spatial proximity between points allowing a good clustering property. Indexes are then built on the reduced space enabling the resultant data to be on a linear range of page memory or disk block addresses [Yu, 2002, Jagadish, 1990, Faloutsos and Lin, 1995, ip Lin et al., 1994, B.-U. Pagel and Faloutsos, 2000]. Therefore, space-filling curves become very useful in areas such as multidimensional indexing [Mamoulis, 2012, Faloutsos and Roseman, 1989].

In the context of my thesis, I studied the space-filling curves and their properties as well as challenges presented by multidimensional data. I explored the use of the space-filling curves, specifically the Hilbert curve, as an access

1 method for indexing multidimensional points. I also tested three heuristics to search for an approximate nearest neighbor on the Hilbert curve. The main goal was to explore the potential of the Hilbert curve in low dimensional spaces (up to thirty dimensions). My thesis can be justified by previous studies that indicate that unless the data are well-clustered or correla- tions exist between the dimensions, nearest neighbor search may be meaningless in spaces where the number of dimensions are bigger than ten [Mamoulis, 2012]. Studies indicate as well that space-filling curves are very useful in the domain of multidimensional indexing due to their properties of good clustering, preservation of the total order and spatial proximity [Mamoulis, 2012, Faloutsos and Roseman, 1989].

1.1 Hypothesis and Methodology

In the context of my thesis, I studied space-filling curves as a direct and secondary access method in multidimen- sional spaces. I chose the Hilbert space-filling curve due to its superior clustering property. Additionally, I applied a similarity search technique on the Hilbert curve where I opted for the approximate nearest neighbor to speed up the search. Formally, the hypothesis that I tried to validate was that a space-filling curve can operate in a low-dimensional space (up to thirty dimensions), in terms of index and search. Considering this hypothesis, there were some questions that I tried to answer:

• What is the performance of an approximate nearest neighbor search, in a Hilbert space-filling curve, in terms of relative error and runtime, compared with a basic linear search?

• What are the Hilbert space-filling curves limitations in terms of usability?

In order to validate my thesis hypothesis and answer the questions above, I build a prototype called approximate space-filling curve. The system configures basically two steps: (1) mapping the original space to the Hilbert space- filling curve generating the Hilbert Index to each multidimensional point, and (2) performing the query applying the heuristic chosen. In the first step, I started by cleaning the data and remove the duplicated points in order to obtain clear results. Then, for each multidimensional point in the original data space a Hilbert index key is generated with help of library Uzaygezen 0.21 from Google Code, which is a space-filling curve library capable of mapping multidimensional data points into one dimension through a compact Hilbert index [Hamilton and Rau-Chaplin, 2008]. On the second step, I applied two heuristics using the Hilbert curve as a direct access method and one using the Hilbert index with a B-tree variant. The three heuristics search for an approximate nearest neighbor to a defined query point on the Hilbert curve. I evaluated the results for the three heuristics and compared with a simple linear search.

1.2 Main contributions

The main contributions of this work are summarized below:

1https://code.google.com/p/uzaygezen/

2 • I tested the performance of the Hilbert space-filling curve as a direct access method with two heuristics up to 36 dimensions. The results indicated that the curve generates too much data making it difficult to use the curve beyond the 4 dimensions.

• I tested the performance of the same curve as a secondary key combined with a B-tree variant up to 36 dimensions. The results showed that this combination works in the six spaces tested inclusive in the space with 36 dimensions. Compared with the linear method, it is generally faster as the number of dimensions and order of the curve increase.

1.3 Document Outline

The remaining chapters of this document are structured as follows: the chapter2 introduces the fundamental concepts required to understand the context of the problem as multidimensional data, indexing and query as well as the alternatives to the proposed fractal access method. Chapter3 presents the main concepts relating fractals and reveals some previous work done with space-filling curve in multidimensional spaces in terms of indexing and query execution. Chapter4 presents the solution proposal and the methodology followed, the experiments and the results. Finally, the chapter5 summarizes this document, the results achieved and the future work in this domain.

3 4 Chapter 2

Multidimensional Data

In this chapter, we will see the basic concepts of the surrounding context of this thesis proposal as well as the possible alternatives. Therefore, the first section presents a short introduction to the multidimensional data and its challenges. The second section introduces the basic notion of relation in multidimensional data that influence on the choice of the access method. Section 2.3 explain the concept of vector space as a result of applying a feature transformation function in order to perform easier operations on multidimensional data, very used on multimedia data. This section also shows the common metric distances used in vector spaces. Section 2.4 introduces the main queries performed in a multidimensional space with the auxiliary of the metric distances. Followed by section 2.5 which reveals the alternatives to the fractals’ proposal to index low-dimensional data (two to thirty dimensions). Finally, the last section sums up this chapter.

2.1 Introduction

Data is becoming increasingly complex. It contains an extensive set of attributes and sometimes requires to be seen as part of a multidimensional space. Database Management Systems (DBMSs) have undergone several changes in its access methods in order to support this type of data, also known as spatial data. It is defined by its location, which is limited by a boundary, known as the spatial extent. In a DBMSs file, a spatial data can be represented as a point data or region data. In point data, its spatial extent comes down to its location, since a point has no area or volume. On the other hand, when it has a spatial extent, the location refers to the centroid region. In two or three-dimensional space, the boundary corresponds to closed line and to a surface, respectively. To store some data object, we can perform a geometric approximation through points, lines or other geometric figures and store the data as a feature vector. For example, as a point the vector is stored as d-tuple where d represents the number of dimensions. The purpose is to perform some easy operations on the data but this can result in data spaces up to hundreds of dimensions. This spatial data is often called multidimensional when the data are feature vectors, and the vector space has usually more than three dimensions. Despite these differences, both concepts relate to the same subject. For convenience, we will refer to all as multidimensional data since it is the broader concept. The differences, when they exist, will be referred in due course. The challenges start when to apply access methods and query processing techniques. There are several indexes

5 structures developed to deal with multidimensional data. Some are more suitable to low-dimensional space, some to high-dimensional one. The purpose of this research is to study the fractals and understand how their properties can help to optimize the multidimensional indexing structures to index data of low dimensions (two to thirty). These low-dimensional spaces may result from the application of indexing structures oriented to high-dimensional space or result from the application of dimensionality reduction techniques for subsequent indexing. In either sit- uation, the resulting space, despite being of lower dimensions still requires a large computational effort to perform operations on data. The main problem deals with the fact that the majority of the indexes tend to suffer of the curse of dimensionality when the number of dimensions are bigger than ten. This translates into a degrading of search performance related to the increase of the dimensions being worse than a linear search [Berchtold et al., 1996, Weber et al., 1998,B ohm¨ et al., 2001, Clarke et al., 2009, Mamoulis, 2012]. This happens because the volume of the space also increases, although so fast that the data in the space become sparse [Clarke et al., 2009]. In terms of query challenge, it relates with a special property of multidimensional data. It refers that ”there is no total ordering in the multidimensional space that preserves spatial proximity” [Mamoulis, 2012]. This means that there is no guarantee that two objects close in space will also be close in the linear order. Therefore, according to Mamoulis, the objects in space cannot be physically clustered to disk pages in a way that provides theoretically optimal performance bounds to multidimensional queries.

2.2 Multidimensional Relations

It is normal to speak of multidimensional data without think of them as such. When we want to give our house location to someone, it is common to use one of these two options: either we give the address, which gives a good location of the house, or can give a reference to something that is near to it. According to [Mamoulis, 2012], our house can be defined as a spatial object once it has at least one attribute which characterizes its location. Whether the house is a detached house or an apartment, it occupies an area in two or three-dimensional space, depending on the chosen representation. The Figure 2.1(a) shows this area and is defined as a geometric extent of the spatial object. However, there is not always a geometric extent. Another example is a large-scale map where our home city is a spatial object which has a location, yet does not have a geometric extent. This is typically represented as a point on the map (see Figure 2.1(b)). The multidimensional data are clearly in the last case, since it is represented by points or vectors without geometric extent in spaces usually with more than three dimensions. Two or more objects with the same semantic define a spatial relation. It can be represented in a table where each row and columns correspond to an object and its attributes respectively [Mamoulis, 2012]. In Figure 2.1(a), if we choose to give a reference to something near to our house, for example, the Paladares restaurant, we are establishing a spatial relation between two objects in space based on the distance between them. This can be defined using a distance metric, or implicitly through a distance range. This distance can be further classified objectively or subjectively [Mamoulis, 2012]. In Figure 2.1(a), our house is within 5Km from the nearest university. This distance can be considered near if we drive or far if we walk. These relations can also be of other two types: directional or topological. The first, relates two objects based on a global reference system [Mamoulis, 2012], such as a Compass Rose. Following our example in Figure 2.1(a), we may say that the University is located southwest of our house. We can also add that our house is right in front

6 MyMy house house

LisbonLisbon PaladaresPaladares MyMy city city

UniversityUniversity OeirasOeiras

(a) Example of map that includes our (b) Example of large scale map that represent house, the nearest restaurant (Paladares) cities as points. and the nearest University.

Figure 2.1: Examples of multidimensional data. of the Paladares restaurant. In this example, we are using a reference system defined by the viewer, in this case, us. An object can be defined by a given set ofMy pointsMy house house that fill it - defining its interior - and another set that defines PaladaresPaladares its frontier. The topological relation uses the notion of interior and frontier of an object to relate them. When we say, for instance, that our house has a garageUniversityUniversity that is adjacent to the kitchen. The words ”has” and ”adjacent” express topological relations since the first implies that the house contains a garage, and the second reveals a relation between the limits of two objects. There are other topological relations of intersection in addition to these like inside, equals or overlap. There are still disjoint topological relations, which are another extension of the topological relations in multidimensional spaces. For more information on this topic, please see [Mamoulis, 2012].

2.3 Vector Space

Multidimensional data, especially high-dimensional data, arise mostly from the need to map features of multimedia objects to high dimensional space. Also called feature vector space, it can represent multimedia repositories in a simple way and be used to find similar objects with similar feature vectors. The idea behind is very simple. Take, for example, your favorite song. You can choose to represent your song as a histogram containing the percentage of each note used in the song. Then, each note will correspond to a dimension and each percentage of the note used in the song will be the value in each dimension. At the end, you will have a d-tuple that can be represented by a point in a multidimensional space. Finally, if you look for the nearest neighbor point to yours in a song repository, this will probably be the similar song to your favorite song. Afterwards, two songs resulting in a map of two nearby points should be more similar than two songs that result in two spaced points [Ramakrishnan and Gehrke, 2002]. According to [Bohm¨ et al., 2001], the feature transformation F can be defined as the mapping of a multimedia object Obj into a d-dimensional feature vector

F : Obj −→ Rd (2.1)

The similarity between the two objects can be determined using a distance metric distp where p describe the metric

7 (a) L1 norm (b) L2 norm (c) L∞ norm

Figure 2.2: Examples of unit circles for the L1, L2 and L∞ norms. Adapted from Wikipedia.

chosen,

distp(obj1, obj2) = ||(F (obj1) − F (obj2)||p (2.2)

The Lm is usually known as Lp but will be used here with a m to distinguish from the point p. This norm represents the variety of metrics that we can use to define the distance between two points p and q in arbitrary space S. Let m ∈ [1, ∞[ be a real number. The Lm metric can be defined as follows:

X p 1/m ||p − q||m = [ |p − q| ] (2.3)

According toB ohm¨ et al., the Euclidean - metric L2 - is the most common function used and is defined substituting the m in the Equation 2.3 for the number 2:

X 2 1/2 ||p − q||2 = [ |p − q| ] (2.4)

Basically, it consists on the straight distance between p and q. A unit circle using this metric is represented in

Figure 2.2(b). There are other popular functions like the Manhattan - metric L1 - and it is defined substituting the m in the Equation 2.3 for the number 1: X ||p − q||1 = |p − q| (2.5)

This metric has the name of a city due to the distance are conceived as streets in a city. The distance is not straight because we assume that we cannot cross through a building. A unit circle using this metric is represented in Figure

2.2(a). Another metric is the Maximum norm or Chebyshev metric L∞ where the m is substituted by the infinite symbol:

||p − q||∞ = max{|p − q|} (2.6)

This metric is a particular case of Manhattan norm where it is only considered the maximum distance between two points. The Figure 2.2 illustrates three examples of unit circles for the three norms presented here. For more information on this topic, please see [Bohm¨ et al., 2001].

8 2.4 Multidimensional Query

Similarly to relational queries, multidimensional queries are the substantial motivation for the optimization of multidimensional data management. Take again the example of your favorite song presented in the previous section. After applying a feature transformation, the idea is to find the nearest point to ours in order to discover our preferred song in a song repository. This search is called similarity queries and there are mainly two types: range queries and nearest neighbor queries. The first refers to a spatial extent of a point and returns all regions that overlap the query region. On the other hand, the nearest neighbor queries aim to find the spatial data objects close to a particular point. There are other types of queries [Mamoulis, 2012] but will focus on these since they are the most common in d-dimensional vector spaces with d > 3. According toB ohm¨ et al., all these queries are based on the notion of distance metric between two points p and q in the vector space Sd with d dimensions. Let m be the metric chosen to compute the distance [Bohm¨ et al., 2001].

2.4.1 Range Query

As the name indicates, the query is related to a range, for example, ”Find all points within a radius of 5”. This will depend on the distance metric previously defined. Imagine a large multimedia repository and you want to find your favorite song. The search should retrieve all the songs with the same specifications you defined in the query. More formally, this type of query is the simplest and searches for all the points p in the vector space Sd that are identical to our query point q. P ointQuery(Sd, q) = {p ∈ Sd|p = q} (2.7)

Looking now at the Figure 2.1(a), what if we want to search for all the restaurants that are less than 5 Km away from our house? This question is a range query where the predicate is well defined - less than 5 Km away from our house. More formally, this query searches for all the points p in the vector space Sd that are inside a radius r defined by a distance metric m.

d d RangeQuery(S , q, r, m) = {p ∈ S |distm(p, q) ≤ r} (2.8)

Point queries are particular cases of a range queries with radius r = 0. Typically, the range queries describe geometric figures in the vector space according to the metric m chosen. For example, if the m relates to the

Euclidean (L2) the figure described will be a hypersphere and all the points inside this figure are retrieved by query. On the other hand, if the m relates to the Chebyshev (L∞) the figure will be a hypercube. The range query and its variants have some disadvantages due to the size of the query result set are unknown, which may produce or an empty set or almost the entire space. This happens because the user must specify the radius without knowing the amount of results the query may produce.

2.4.2 Nearest Neighbor Query

For example, ”Find the nearest point to mine”. In the Figure 2.1(a), suppose we want the nearest restaurant to our house. This query represents the classic nearest neighbor query that retrieves the point p with the shortest distance

9 to our query object q, according to the distance metric m. In case of a tie, the query retrieves all the tied points.

d d 0 d 0 NNQuery(S , q, m) = {p ∈ S |∀p ∈ S : distm(p, q) ≤ distm(p , q)} (2.9)

There are a few variants of the Nearest Neighbor Query. We may want to specify the k nearest neighbors the query must retrieve instead of getting the nearest neighbor or obtain an approximate nearest neighbor instead of an exact one or the combination of both getting an approximate k-nearest neighbor.

The k-nearest neighbor query searches for the k nearest neighbors points pi, i ∈ [0, k − 1] such that no other 0 point p are closer than any pi to our query point q.

d d 0 d 0 kNNQuery(S , q, k, m) = {pi ∈ S | 6 ∃p ∈ S \{pi}∧ 6 ∃i : distm(pi, q) > distm(p , q)} (2.10)

The approximate nearest neighbor query and the approximate k-nearest neighbor query follow the same idea of the previous given a query point q and defining the number of k neighbors. In these cases, the query retrieves an approximation instead of an exact result which in high dimensional spaces could be a significant advantage in terms of computational effort. The heuristic to find the approximate neighbor varies according to each application. For more information on multidimensional query, please see [Bohm¨ et al., 2001].

2.5 Multidimensional Index

Database Management Systems (DBMSs) have undergone several changes in its access methods in order to support the multidimensional data. In order to locate and fetch some data fulfilling minimum performance requirements, structures like indexes are of the utmost importance. Multidimensional indexing turns to be essential to perform a faster data mining and similarity search compared with traditional indexing structures. In the multidimensional space, data can be seen as points or regions according to the application usability. To organize these points, an important factor must be taken into account over the common space: the spatial relation between the them. This factor is considered when organizing the entries in the index structure [Mamoulis, 2012]. There are several structures oriented for indexing multidimensional data. These vary according to their struc- ture, if they have taken into account or not the data distribution, if they are targeted for points or regions, if they are more suited to high or low dimensional spaces, etc. For example, structures such as Kd-trees [Bentley, 1975, Bentley and Friedman, 1979], hB-trees [Lomet and Salzberg, 1990], Grid files [Nievergelt et al., 1984], Point Quad trees among others, are oriented to indexing points. Structures such as the Region Quad trees [Finkel and Bentley, 1974], SKD [Ooi et al., 1987] or R trees can handle both points or regions [Ramakrishnan and Gehrke, 2002]. Subspace-tree [Wichert, 2008] and LSD-tree are more suited to high dimensions and R-tree, Kd-tree and Space- Filling Curves are more suited to low dimensional spaces. These lists are quite incomplete due to the wide variety of existing proposals. Although there is no consensus on the best structure for multidimensional indexing, the R tree is the most commonly used for benchmark the performance of new proposals. Furthermore, in commercial DBMSs, the R-trees are preferred due to their suitability for points and regions and orientation to low-dimensional space [Yu, 2002, Ramakrishnan and Gehrke, 2002]. The list of alternative access methods to compete with fractals in low-dimensional space (two to thirty dimensions) are quite extensive. Therefore, I choose the R-tree and the

10 R2 R5 p8 p2 p5 p9 p4 p7 p1 p10 R6 R3 p11 p3 p6

R1 R4

R1 R2

R3 R4 R5 R6

p5 p1 p4 p3 p6 p11 p8 p2 p9 p7 p10

Figure 2.3: R-tree planar and directory representation adapted from [Yu, 2002].

Kd-tree as two possible alternatives to take a closer look.

2.5.1 R-tree

Since the multidimensional indexes are primarily conceived for secondary storage, the notion of data page is a concern. These access methods perform a hierarchical clustering of the data space, and the clusters are usually covered by page regions. To each data page is assigned a page region, which is a subset of the data space. The page regions vary according to the index and in the R-tree case, it has the name of minimum bounding rectangle (MBR) [Bohm¨ et al., 2001]. At a higher resolution level, each MBR represents a single object and is stored on the lower leaf nodes (see Figure 2.3). These leafs contain also an identifier of a multidimensional point or region that represents some data. The R-tree descends from the B+-tree and therefore, preserves the height-balanced property. Thus, at the higher levels will have clusters of objects, and since they are represented by rectangles, we will also have clusters of rectangles [Yu, 2002, Ramakrishnan and Gehrke, 2002]. To insert a new object in the tree, the search of the perfect leaf describes a single path traversed from the root to the leaf node. At each level, the base criterion is to find the covering box that needs least enlargement to include the new object. In the event of a tie, we chose the node where the covering box has the smaller area. This rule is applied until we reach a leaf node. Here, if the node is not full, the insertion is straightforward. The object is inserted, and the box area is enlarged in order to cover it. The enlargement must be propagated to the parent nodes in order to the tree remain coherent. But if the leaf node is full, the node is split into two and it can generate a recursive split across the branch or even increasing height of the tree [Yu, 2002, Ramakrishnan and Gehrke, 2002]. On the other hand, the deletion of an object may alter the upper nodes. If a leaf node become underflow, the node is deleted and all the remaining neighbors are reinserted on the tree. Therefore, the deletion may cause re-adjustments upwards and downwards. In order to find elements that are encompassed in a given query rectangle, all the leaves

11 s 1 s1

s2 s2

s3 s 4 s4

s3

Figure 2.4: Kd-tree planar and directory representation adapted from [Bohm¨ et al., 2001]. that overlap the query are traversed until reaching a leaf node. In these, the MBRs are tested to the query rectangle, and the data is fetched if the intersection is not empty [Yu, 2002]. The R-tree search performance relates to the size of the MBRs. Note that if there are too many overlapping regions, when performing a search it will result in several intersections with several subtrees of a node. This will require to traverse them all, and we must consider that the search occurs in a multidimensional space. Thus, the minimization of both coverage and overlap, has a large impact on the performance [Yu, 2002, Ramakrishnan and Gehrke, 2002]. Other proposals based on R tree have been presented over the years. The R*-tree tries to improve the R-tree by choosing coverage of a square shape rather than the traditional rectangle. This shape allows a reduction in coverage area by reducing the overlap and hence improving performance of the tree. Like the R*- tree, many other structures have been proposed based on the R-tree as the R+-tree, the X-tree, the Hilbert R-tree among others. However, there is also another problem that haunts these hierarchical indexing structures known as the curse of dimensionality. It refers to the degradation of performance of these structures with the increasing number of dimensions. The main underlying cause relates to the volume of the chosen form that covers the points. This has a constant radius or edge size which increases exponentially with the increasing of dimensions [Berchtold et al., 1996, Wichert, 2009]. Nevertheless, the R-tree is the most popular multidimensional access method and has been used as a benchmark to evaluate the new structures’ proposals [Yu, 2002, Ramakrishnan and Gehrke, 2002].

2.5.2 Kd-tree

Throughout the years, several hierarchal indexes have been proposed to handle multidimensional data space. These structures evolve from the basic B+ tree index and can be grouped in two classes: indexes based on the R-tree and indexes based on the K-dimensional tree (Kd-trees). Unlike the R-tree, the Kd-tree is unbalanced, point and memory-oriented (see Figure 2.4). It divides the data space through hyper-rectangles resulting in mutually disjoint subsets. Like the R-tree, the hyper-rectangles correspond to the page regions but in this case, most of these do not represent any object [Bohm¨ et al., 2001, Wichert, 2009]. If a point is on the left of the hyper-rectangle it will be represented on the left side of the tree, otherwise it will be represented on the right side. The advantage is that the choice of which subtree to search is always unambiguous [Bentley, 1975, Yu, 2002]. Adding an element on this tree is the same process as any other tree structure. We start on the root, and we decide whether the left or right branch according to the elements’ position are on the left or right side of the data

12 space. Once we find the parent node where we should be located below, we insert the node on the right or left side according to the element position to the parent node. There are some options to remove an element from this tree but in general is easier than the R-tree because all leaf nodes with the same parent create a disjoint hyper-rectangle of the data space.B ohm¨ et al. refer that for this reason, the leaves can be merged without violating the conditions of complete. There are some disadvantages of having complete disjoint partitions of the data space.B ohm¨ et al. refer that the page regions are normally larger than necessary, which can lead to an unbalanced use of data space resulting in high accesses than with the MBRs. The R-tree only covers the space with data unlike the Kd-tree. The latter is unbalanced by definition, in a way that there are no direct connections between the subtrees structure [Yu, 2002]. This leads to the impossibility of pack contiguous subtrees into directory pages. The Kd-B-tree [Robinson, 1981] results from a combination of the Kd-tree with the B-tree [Comer, 1979] and solves this problem by forcing the splits [Bohm¨ et al., 2001, Mamoulis, 2012]. This culminates in a balancer resulting tree, such as the B-tree, overcoming the problem above. Creating completely partitions has its drawbacks. Namely, the rectangles can be too big containing only a cluster of points. Or the reverse, a rectangle may have too many points becoming overloaded. Thus, more adjustable rectangles may bring better performance. The hB-tree tries to overcome this problem decreasing the partition area using holey bricks. However, the hB-tree is not ideal for disk-based operation [Yu, 2002]. Berchtold et al. refer that all these index structures are restricted with respect to the data space partitioning. They also suffer from the well-known drawbacks of multidimensional index structures such as high costs for insert and delete operations and a poor support of concurrency control and recovery [Berchtold et al., 1998].

2.6 Summary

This chapter introduces the basic concepts relating multidimensional data and its challenges within the context of this thesis. After the introduction, we saw the multidimensional relations (section 2.2) that later would influence the multidimensional indexing. We also saw the vector space (section 2.3) which basically consist on a simple representation of the data, typically multimedia data. The vector space allows to search for similar objects since they are represented in the same way. This similarity search is performed using distance metrics also described in this section. In section 2.4, we saw how to perform similarity search queries. Finally in Section 2.5, we saw the most common access methods used in lower dimensional spaces (two to thirty) as an alternative to the space-filling curves which will be introduced in the following chapter. However, all these index structures are restricted with respect to the data space partitioning. Additionally, they suffer from the well-known drawbacks of multidimensional index structures such as high costs for insert and delete operations and a poor support of concurrency control and recovery.

13 14 Chapter 3

Fractals

In the previous chapter, we saw basic concepts of the surrounding context of this thesis proposal as well as the possible alternatives. In this chapter, we will see the fundamental concepts for understanding and reasoning the thesis proposal. Thus, this chapter starts with a short introduction on Fractals followed by section 3.2 that explains the concept of the Hausdorff dimension. Section 3.3 describe the space-filling curves which are computer generated fractals. Section 3.4 describe how a space-filling curve can be generated to any dimension. The following section 3.5 describe a special property of the space-filling curves that can be very useful when they are used as an access method. Section 3.6 describe how the Hilbert curve, which is the most promising space-filling curve had been used as an access method. Finally, the last section 3.7 summarizes this chapter.

3.1 Introduction

Benoit B. Mandelbrot, in 1975, introduces the concept of a fractal as an object that can be broken into several parts, each one equal to the original object [Mandelbrot, 1975]. Examples of fractals are forests, mountains, leaves, galaxies, etc. They all have a self-similarity that repeats at distinct levels of magnitude with different ratio similarity in different directions. In the traditional geometric point of view, objects are abstracted and based on the classic geometric figures like squares, triangles or others [Frame and Mandelbrot, 2002]. Fractal geometry will makes us all see the world differently [Barnsley, 2013].

Clouds are not spheres, mountains are not cones, coastlines are not circles, and bark is not smooth, nor does lightning travel in a straight line.

Mandelbrot in [Mandelbrot, 1982]

A closer look at a cloud or mountain and we can realize a degree of repetition between the parts and the whole. Fractal geometry allows to construct precise models of all kind of physical structures. According to Frame and Mandelbrot, it allows us to describe the shape of cloud like an architect describe a house. It also relies on a special notion of dimension that I will introduce below.

15 3.2 The Hausdorff Dimension

Usually when you go the cinema you probably say, ”I am going to see the Big Hero 6 in 3D!”. You don’t say, ”I am going to see the Big Hero 6 in 2,7D!”. In fractal geometry, it is possible to use a fraction as a value for an object dimension. This is a classical mathematical chapter called Hausdorff Dimension that is forgotten due to a lack of interest or ignorance about its potential [Falconer, 2007, Mandelbrot, 1982]. The fractal geometry relies on a special notion of dimension - the Hausdorff Dimension [Mandelbrot, 1982]. The idea of a line having a dimension one and a surface having a dimension two may not occur in this domain. For example, in classical geometry two coordinates are required to define a point. In fractal geometry, a point is can be seen as being in a line thus we need single a coordinated to represent it [Mandelbrot, 1982]. An object can have a dimension bigger than one, expressing an infinite length and, simultaneously, minor than two corresponding to a zero area [Falconer, 2007].

A, L = 9 A

B,A, sL == 39 B AB B

B, s = 3 B B B

Figure 3.1: Ratio similarity example with line segments.

A line can be decomposed into N parts, which together are the size of the original (see Figure 3.1). Although

the previous statement is obvious, it transmits information that will be understood in more detail below. Imagine a

line A of length L and N subsegments B of length s. The union of the N segments B are equal to A. Therefore, the length of B can be derived by dividing the length of A by the N segments B, known as the scale factor:

L s = (3.1) N

Or it can be defined in terms of similarity ratio: N r = (3.2) L

r = 1/2

r = 1/2

Figure 3.2: Ratio similarity example with squares

16

D = 2

D = 2

s = 2, N = 4 s = 3, N = 9

s = 2, N = 4 s = 3, N = 9 In Figure 3.1, the line A has length L = 9 and can be broken into 3 (N) segments B. Note that the length B A B A,s L = 9 9 /3 = 3 of is obtained by dividing the length of byA the number of segments , then and the ratio of similarity corresponds to r = 1/3 since the line segments are 1/3 smaller than theB, s original. = 3 Following the same B B B line of thought it is possible to increase the complexity thinking now in squares. In Figure 3.2 the ratio of similarity corresponds to r = 1/2 since the side length of the inner square corresponds to half the side of the outer square. In the Euclidean geometry, the notion of the similarity ratio is not very useful to realize the self-similarity relationship between large and small objects. Therefore, the Hausdorff Dimension, also known as the , gives more complete notion of dimension and self-similarity. Thus, suppose a figure that can be decomposed in N sub-figures of size s, the Hausdorff Dimension D is obtained through this formula:

log(N) D = (3.3) log(s)

Analyzing the dimension 2 to a square, the variations of the two variable are shown in Figure 3.3. The N varies according to the number of cuts the outer square suffers. When the side length of the outer square is divided into two (s), this results in two segments along the axes with half of the original size. As the dimension of the square is two, the result is four inner squares, and this last number is obtained using equation 3.4 which is a variation r = 1/2 of equation 3.3 and corresponds to the number of elements needed to cover the outside square, in this case, it corresponds to 4. The same analogy is done to the side length divided into three.

N = sD (3.4)

D = 2

s = 2, N = 4 s = 3, N = 9

Figure 3.3: The variation of the variables of the Equation 3.3 s and N along the dimension 2.

Consequently, if a square, which is a figure that has dimension two, has a curve that covers the entire area, we must conclude the curve also has dimension two. The Hilbert curve is an example of curves that have dimension two but still a line in space. On the other hand, for curves such as the snowflake Mandelbrot[1982] the Hausdorff Dimension is a fraction, since D = log(4)/log(3) [Mandelbrot, 1982].

17 3.3 Space-Filling Curves

A fractal can be found in Nature, like we saw, or can be computer generated thanks to its self-similarity property. A space-filling curve is an example of computer generated fractal. Jordan, in 1887, defines formally a curve as a continuous function with endpoints whose domain is the unit interval [0,1] [Moon et al., 2001, Hales, 2007] but the space-filling curve is more than a simple curve. It has special properties. It consists on a path in multidimensional space that visits each point exactly once without crossing itself, hence introducing a total ordering on the points. However, in the previous chapter, we saw that there is no total ordering that fully preserves the spatial proximity between the multidimensional points in order to provide a theoretically optimal performance for spatial queries. According to Moon et al., once the problem of the total order is overcome, it is possible to use the traditional one- dimensional indexing structures for multidimensional data with good performance on similarity queries [Moon et al., 2001]. So, the reader must be thinking on how can we represent a space-filling curve. Well, we assume that every space has a finite granularity and is possible to see the space as grid cells according to the Hausdorff dimension. A space-filling curve starts with a basic curve called first-order curve. To obtain the curves of following order, we replace each vertex of the curve by the previous order curve (see Figure 3.4). The i − 1 order curve can be rotated or mirrored to form the curve i [Jagadish, 1990]. The curves can vary along the axes, order and dimension. The first order of a curve in 2-dimensional space has 4 cells (22×1). The scale factor for the space-filling curves is 2. Therefore, we can rewrite the Equation 3.4 to calculate the number of cells for each order o and will have that

N = 2(D·o) (3.5)

According to this equation, the second order curve has 16 cells, the third has 64 cells and so on. There are several ways to sort the grid, but some of them preserve the locality better than others. The purpose is to keep the spatial locality, or in other words, points that are close in space should be also close in the linear order [Yu, 2002, Jagadish, 1990, Mamoulis, 2012, Moon et al., 2001]. So, the reader must be thinking now on how can we represent the points on the curve. Well, each vertex of the curve corresponds to a possible point. Each axis is represented using the binary code and their combination results on the point location. The curves can be used as a secondary access method combined with a simple B-Tree [Comer, 1979]. Some of the curves use the traditional binary code and some use the Gray code as we’ll see below.

3.3.1 Z-Curve

The Z-curve is a space-filling curve and is introduced for the first time by Morton in 1966. This curve, as its name suggests, has the ”Z” (or ”N”) shape as you can see in Figure 3.4. In this, the curve starts on the left lower corner and ends on the upper right corner, but this can be changed as long as the curve keeps the ”Z” shape. The following orders are derived replacing each vertex by the previous order and then connecting the figures.

Bit Shuffling A space-filling curve maps an n-dimensional point to a single dimension. Each space-filling curve has it own algorithm to do the map. The Z-curve performs a bit interleaving of the coordinate values in order to generate the so-called z-value, which is the value on the curve. Since there are a few algorithms to calculate

18 Figure 3.4: Z-curve orders 1, 2, and 3 respectively. the z-value [Faloutsos and Roseman, 1989], we select one to present here. The Z-curve bit shuffling function in two-dimensional space will be represented as follows [Wichert, 2015]:

z = f(x, y) (3.6)

In the Cartesian axis for two dimensions, the point coordinates can be represented by x and y using decimal notation. In this case, the coordinates are represented using a binary code of maximum length of 2n. Therefore, the two coordinates can now be expressed as x = x1x2(2) and y = y1y2(2) where the numbers correspond to the bit position in the binary code representation. As stated before, the Z-curve uses the bit interleaving in order to generate the z-value, thus the expression 3.6 can be rewritten as

z = x1y1x2y2(2) (3.7)

In the Figure 3.5 it is easy to see how the z-value is generated. To obtain the upper right value on the grid, we interpolate the x and y bit. Starting with x = 0(2) and y = 1(2) the result will be z = 01(2). The same for the others z-values.

01 11 1

0 00 10

0 1

Figure 3.5: Z-curve first order bit shuffling.

Queries Space-filling curves allow to perform range queries and usually, other types of similarity search are transformed into range queries. A square or rectangle is regularly used to represent the query area in the grid ?.

Take Figure 3.6 as an example. Suppose the query point has the coordinates x = 01(2) and y = 10(2) therefore, z = 0110(2) or z = 6(10) using the decimal notation. In this figure, the darker gray square is the query point, and

19 the lighter gray one covers the query area. The latter matches to four ranges in the linear mapping.

R1 = 1,R2 = 3, 4, 5, 6, 7,R3 = 9,R4 = 12, 13

5 7 13 15

4 6 12 14

1 3 9 11

0 2 8 10

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 3.6: Z-curve query example transformed into linear mapping.

This ranges are also called clusters. They represent an agglomerate of points that can be read at once. Another way to analyze the clusters has to do with the number of entry or exit points of the research area. Looking at this example, we can conclude that Z-curve generated four clusters since it has four entry (or exit) points. This clustering property of space-filling curves arouses considerable attention due to the benefits it can bring. For example, a good clustering of multidimensional data on a disk translates into a reduction in disk (or memory) accesses required for a range query [Moon et al., 2001]. The path that the curve describes is very important in this domain. The Z-curve has a few long jumps that can be rearranged in order to reduce the number of clusters.

3.3.2 Hilbert Curve

Using a similar technique, the Hilbert Curve is another space-filling curve based on and comes as an improvement[Faloutsos, 1988]. The Peano basic curve has the shape of an upside down ”U”. The curves of higher order are obtained by replacing the lower vertices for the previous order curve. The upper vertices are also replaced by the previous order curve but suffering a rotation of 180 degrees. The resulting curves can be seen in [Faloutsos, 1988]. The Hilbert starts with the same Peano’s basic curve but evolves differently. On the following

Figure 3.7: Hilbert Curve orders 1, 2, and 3 respectively.

20 orders, the top vertices are replaced by the previous order, and the bottom vertices suffer a rotation (see Figure 3.7). The bottom left vertex is rotated 90 degrees clockwise, and the bottom right rotates 90 degrees counterclockwise. The Figure 3.7 shows the Hilbert curve orders one, two and three. In this, the curve starts on the lower left corner and ends on the lower right corner, but this can be changed as long as the curve keeps the ”U” shape. The Hilbert curve in three dimensions can have different Hamiltonian paths for the first-order curve depending on the defined entry and exit point from the cube. In Figure 3.8 has an example of this curve is three dimensions.

Figure 3.8: Hilbert Curve in three dimensions from order one to three respectively [Dickau, 2015]

Bit Shuffling Analogous to the z-value, the Hilbert Curve has it own h-value. This curve as the Peano, use the Gray Code to order their values. The Gray Code is a binary code system that every adjacent number differ only in one digit (see Table A.1). There are several algorithms [Faloutsos and Roseman, 1989] to compute the h-value. However, will see a simple explanation in order to have a basic comprehension. The Hilbert curve bit shuffling function for two-dimensional space can be represented as follows [Wichert, 2015]:

h = f(x, y) (3.8)

The two coordinates can be expressed as x = x1x2(2) and y = y1y2(2) where each one is a bit string of length two. To compute the h-value, we can follow four basic steps:

1. Perform the bit interleaving;

2. Split the resulting bit string and represent them using the decimal notation;

3. Apply the ”0-3 rule”;

4. And finally, concatenate the strings to obtain the h-value.

As a result of the first step we obtain:

h = x1y1x2y2(2) (3.9)

In the second step, we split the previous string into n substrings, where n correspond to the coordinates bit length, and with a string length equal to the original coordinates length. After splitting, the substrings are converted to the decimal notation having the Gray code as reference. Thus,

21 5 6 9 10 11

4 7 8 11 10

3 2 13 12 01

0 1 14 15 00

00 01 10 11

Figure 3.9: Hilbert seconde order curve - location of the h-value 12

h = x1y1x2y2(2)

s1 = x1y1(2), s2 = x2y2(2) (3.10) d1 = s1(10), d2 = s2(10)

d = [d1, d2]

The third step configures a rule that I informally called ”0-3 rule”:

• If 0 is present in d, change every following occurrence of 1 to 3 and vice-versa;

• If 3 is present in d, change every following occurrence of 0 to 2 and vice-versa.

After applying the supposed changes in array d, its values are converted to binary code and concatenated in order to obtain the h-value.

d = [d1, d2]

d1 = x1y1(2), d2 = x2y2(2) (3.11)

h = x1y1x2y2(2)

In order to clarify the explanation, we will see an example of a h-value computation in the Figure 3.9. In this, we intend to obtain h=12 using its coordinates x = 11 and y = 01 First, we apply the bits interleaving technique using the two coordinates. Then, (2) we split the string into two substrings and convert them to the decimal notation having the Gray code as reference (see Table A.1).

h = 1011(2)

s1 = 10(2), s2 = 11(2) (3.12) d1 = 3(10), d2 = 2(10) d = [3, 2]

Afterwards, (3) we apply the ”0-3 rule”. Since the first number is 3 it is necessary to swap every following 2 for 0

22 and vice-versa. d = [3, 2] (3.13) d = [3, 0]

Finally, (4) we convert d to binary code and concatenate the substrings.

s1 = 11(2), s2 = 00(2)

h = 1100(2) (3.14)

h = 12(10)  This algorithm only works up to three dimensions. A broader one will be introduced in the following section 3.4.

Queries The queries on the Hilbert Curve are done in the same way as told in Z-curve but the result clusters are slightly different. Let’s analyze the same query done for the Z-curve now with the Hilbert (see Figure 3.10).

Suppose the query point has the coordinates x = 01(2) and y = 10(2) therefore, z = 6(10) and h = 7(10). For the same grid with 4x4 cells and for the same query point, the same range query has two different maps. The Hilbert maps only two clusters compared to four generated by the Z-curve. This means that the Hilbert curve has a better clustering property which translates to a reduction in disk (or memory) accesses required to perform a range query [Moon et al., 2001].

Hilbert Curve Z-curve

5 6 9 10 5 7 13 15

4 7 8 11 4 6 12 14

3 2 13 12 1 3 9 11

0 1 14 15 0 2 8 10

Hilbert Curve map 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Z-curve map 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 3.10: Hilbert Curve clusters generation compared with Z-curve for the same query.

There are several ways to perform similarity search on space-filling curves [Chen and Chang, 2011, Faloutsos, 1988]. Suppose we want the find the nearest gas station to a certain point X. Faloutsos and Roseman suggest that we follow this algorithm:

1. Calculate the h-value of X;

2. Find the X’s preceding and succeeding points on the Hilbert path until one of the points corresponds to the gas station;

23 3. Calculate the distance d from the X location to the gas-station location;

4. And check all the points within d blocks of X.

Note that the fourth step is done because we intend to find the nearest neighbor of X. If an approximate value is sufficient, the algorithm could end at the third step as soon as it finds the first gas station.

3.4 Higher Dimensions

We have seen how the space-filling curves work in 2 dimensions. In this section, we will see the behavior of the Hilbert curve in when the dimensions are bigger than 2. The concepts introduced in this section are based on [Hamilton, 2006]. The author does a geometric approximation to the curve in order to extend the concept to higher dimensions. We will focus on the Hilbert curve since it generally presents the best cluster property, which is a subject that we will focus on the next section. To simplify the explanation, the author chose to use the Hilbert first-order curve rotated 90 degrees clockwise and then 180 degrees along the horizontal axis (see Figure 3.11).

10 11 3 2

0 1 00 01

Figure 3.11: Hilbert curve analysis to higher dimensions.

In order to increase the curve dimensions, we need to generalize its construction and its representation. If we look closely to the Hilbert first order curve in two dimensions (see Figure 3.11), we can observe an unfinished square with 2 × 2 vertices. Each one of the them represents a single point. On the second-order curve, we have 22 ×22 vertices due to the recursive construction of the curve. As the order curve increases, we will have 2×· · ·×2 n points01 in n dimensions corresponding to the hypercube vertices. Each of the 2 vertices are represented by an n-bit string such as

00 10 b = [βn−1 ··· β0] (3.15) 011 where βi ∈ B takes 0 (low) or 1 (high) according to its position on the grid. Looking at the Figure 3.11, we can conclude that b = [00] corresponds to the lower left vertex of the square and b = [11] corresponds to the top right vertex of the square. Along the curve, all the vertices are immediate neighbors. This means that their binary representation only changes in one bit. The directions of the curve inside the grid are therefore represented using 010 the Gray code (see Table A.1). The function gc(i) returns the Gray code value for the ith vertex. The symbols ⊕ and >> represent the exclusive-or and the logical shift right respectively.

gc(i) = i ⊕ (i >> 1) (3.16)

24 Applying this Equation for the four vertices in Figure 3.11 we will have:

gc(0) = 00 ⊕ (00 >> 1) = 00 ⊕ 00 = 00(2)

gc(1) = 01 ⊕ (01 >> 1) = 01 ⊕ 00 = 01(2) (3.17) gc(2) = 10 ⊕ (10 >> 1) = 10 ⊕ 01 = 11(2)

gc(3) = 11 ⊕ (11 >> 1) = 11 ⊕ 01 = 10(2)

The curve is constructed by replacing each vertex for the previous order curve. Each replacement is a transforma- tion/rotation that generates a new sub-hypercube. These operations must make sense along the 2n vertices. With sense, I mean that ”every exit point of the curve through one sub-hypercube must be immediately adjacent to the entry point of the next sub-hypercube” [Hamilton, 2006]. The author defines an entry point e(i) where i refer to the ith sub-hypercube.   0, i = 0 e(i) = (3.18)   i−1  n gc(2 2 ), 0 < i ≤ 2 − 1

The entry point e(i) and the exit point f(i) are symmetric and the latter is defined as

f(i) = e(2n − 1 − i) ⊕ 2n−1 (3.19)

Even at the first-order curve, we can look at each vertex as sub-hypercube. Therefore, each one as an entry and exit point. The entry points for the Hilbert first-order curve are presented in Figure 3.12 are calculated now:

e(0) = 00(2) f(0) = e(3) ⊕ 10 = 01(2)

e(1) = gc(0) = 00(2) f(1) = e(2) ⊕ 10 = 10(2) (3.20) e(2) = gc(0) = 00(2) f(2) = e(1) ⊕ 10 = 10(2)

e(3) = gc(2) = 11(2) f(3) = e(0) ⊕ 10 = 10(2)

We can look to the Figure 3.12 and observe that to generate the first vertex/sub-hypercube the curve entry on the left lower corner (00) and exit at the right lower corner (01). The next vertex is generated by entering on the left lower corner (00) and exiting on the right upper corner (10). And so on. Additionally, we need to know along which coordinate is our next neighbor in the curve. The directions between the sub-hypercubes are given by the function g(i) where i refer to the ith sub-hypercube.

g(i) = k such that gc(i) ⊕ gc(i + 1) = 2k (3.21)

Calculating g(i) to the Hilbert first-order curve, we have that

g(0) = k, 00 ⊕ 01 = 01 = 1, k = 0

g(1) = k, 01 ⊕ 11 = 10 = 2, k = 1 (3.22)

g(2) = k, 11 ⊕ 10 = 01 = 1, k = 0

25 10 11 f(3) = 10 e(3) = 11 f(2) = 10 풊 풆(풊) 풇(풊) 풅(풊) 품(풊)

0 [00]2 [01]2 0 0 e(2) = 00 1 [00]2 [10]2 1 1 f(1) = 10 2 [00]2 [10]2 1 0

3 [11]2 [10]2 0

e(0) = 00 f(0) = 01 e(1) = 00 00 01

Figure 3.12: Analysis of the entry and exit points of Hilbert first-order curve in two dimensions. The author refer that the x corresponds to the least significant bit and the y to the most significant. Adapted from [Hamilton, 2006]

This means that, in Figure 3.12, the next neighbor of the vertex i = 0 is along the x coordinate. In this case, x is 011 the least significant bit. For i = 1, the next neighbor is along the y coordinate. The author refers that g(i) can also be calculated as g(i) = tsb(i) where tsb is the number of trailing set bits in the bynary representation of i. The g(i) indicate the inter sub-hypercube direction along the curve. We can also calculate intra direction d(i) on the sub-hypercubes. 010   0, i = 0;   d(i) = g(i − 1) mod n, i = 0 mod 2; (3.23)   g(i) mod n, i = 1 mod 2, for 0 ≤ i ≤ 2n − 1.

Again, for the Hilbert first-order curve, we have

d(0) = 0

d(1) = g(1) mod 2 = 1 mod 2 = 1 (3.24) d(2) = g(1) mod 2 = 1 mod 2 = 1

d(3) = g(3) mod 2 = 0 mod 2 = 0

All these functions can be calculated for any dimension, allowing to construct a Hilbert curve of higher dimen- sions. Well, now that the basic functions are introduced, we can pass to the main function. The author main idea is to define a geometric transformation T such that the ordering of the sub-hypercubes in the Hilbert curve defined by e and d will map to the binary reflected Gray code” [Hamilton, 2006]. The operator x  i is defined as the right bit rotation and rotate n bits of x to the right i places. A more formal definition is given in [Hamilton, 2006].

Te,d(b) = b ⊕ e  (d + 1) (3.25)

Therefore, we will now do a complete example to simplify and reduce the explanations. Given p = [5, 6] =

[101(2), 110(2)], we want to determine the h-value for p. In this case, n = 2 and m = 3, where n refers to the dimension and m to the bit precision - 3 bits to represent the value. The result is achieved through a series of m

26 projections and Gray code calculations. Given p, we can extract an n-bit number lm−1 that will tell us whether the point p is in the lower or upper half set of points with respect to a given axis.

lm−1 = [bit(pn−1, m − 1) ... bit(p0, m − 1)] (3.26)

The bit function returns the m − 1 bit of p. The simplified algorithm has two steps:

1. Rotate and reflect the space such that the Gray code ordering corresponds to the binary reflected Gray code,

¯ lm−1 = Te,d(lm−1) (3.27)

2. Determine the index of the associated sub-hypercube,

−1 ¯ wm−1 = gc (lm−1) (3.28)

Applying this to our example, we will have that:

p = [101(2), 110(2)] (3.29) | {z } | {z } p0 p1

We will have m − 1 iterations to compute the h-value h. So i represent the iterations and it varies from i = m − 1 to i = 0. To i = 2, we start with the variables e = 00(2), d = 1 and h = 0.

l2 = [bit(p1, 2)bit(p0, 2)] = [11],

T0,1(l2) = 11 ⊕ 00  2 = 11(2), −1 w2 = gc (11) = 10(2), (3.30) e(w2) = e(2) = 00(2),

d(w2) = d(2) = 1(10),

h = [w2] = 10(2) = 2(10)

So, for the first iteration we can locate the point on the sub-hypercube h = 2 of the first-order curve. Since the curve to represent this point is a three order curve, we must perform another two iterations to find the exact location of the point. To i = 2, we have the variables e = 00(2), d = 1(10) and h = 2(10).

l2 = [bit(p1, 1)bit(p0, 1)] = [10],

T0,1(l2) = 10 ⊕ 00  2 = 10(2), −1 w2 = gc (10) = 11(2), (3.31) e(w2) = e(2) = 11(2),

d(w2) = d(2) = 0(10),

h = [w2w1] = 1011(2) = 11(10)

27 So, for the second iteration we can locate the point on the sub-hypercube h = 11 of the second-order curve. For the last iteration i = 0, we have the variables e = 11(2), d = 0(10) and h = 11(10).

l2 = [bit(p1, 0)bit(p0, 0)] = [01],

T3,0(l2) = 01 ⊕ 11  1 = 01(2), −1 w2 = gc (01) = 01(2), (3.32) e(w2) = e(1) = 00(2),

d(w2) = d(1) = 1(10),

h = [w2w1w0] = 101101(2) = 45(10)

Thus, the h-value for p = [5, 6] is h = 45(10).

3.5 Clustering Property

The clustering property of space-filling curves arouses considerable attention due to the benefits it can bring. Clustering relates to the preservation of locality between multidimensional objects when mapped to the linear space. For example, a good clustering of multidimensional data translates into a reduction in disk accesses required for a range query. Moon et al. conducted a study [Moon et al., 2001] to derive closed-form formulas for a number of clusters in a particular region. The formulas provide a measure that can be used to predict a disk total access time. The analysis is based on Hilbert space-filling curve since it presents better results in preserving the locality. For this purpose, some assumptions are made to simplify and clarify the lines of this argument. The multidimensional space is considered to have finite granularity where each point corresponds to a grid cell. The number of factors that influence the disk accesses are diverse [Moon et al., 2001, Faloutsos and Roseman, 1989]. Therefore, the average number of clusters in a subspace of a point grid is used as a performance measure of the Hilbert curve. This subspace is the region of a query and each grid point maps a disk block. This performance measure corresponds to the number of non-consecutive hits. The analysis takes two different courses. The first one is the asymptotic analysis of the clustering property of the Hilbert curve. It focuses on the relation between growth of space grid tending to infinity and the average number of clusters in a query subspace region. Through a series of demonstration exercises, the authors reach the Theorem 1. The formal definition is presented below.

d Theorem 1. In a sufficiently large d-dimensional grid space mapped by Hk , let Sq be the total surface area of a given rectilinear polyhedral query q. Then, Sq lim Nd = . (3.33) k→∞ 2d

In a simple way, this sets an asymptotic solution revealing that the number of clusters is approximately pro- portional to the hyper-surface area of a d-dimensional polyhedron. It also provides the constant factor of the linear −1 d function as being (2d) . Where d, represents the number of dimensions. Hk corresponds to Hilbert’s curve representation in dimension d, order k. This theorem has immediate consequences, therefore, comes along an

28 2 2

2 2

3 3

(a) Rectangle query area example. (b) Square query area example.

Figure 3.13: Clustering analysis example with different query area shape. important corollary.

d Corollary 1. In a sufficiently large d-dimensional grid space mapped by Hk , the following properties are satisfied:

1. Given an s1 × s2 × ... × sn hyper rectangle,

d  d  1 X 1 Y lim Nd =  sj (3.34) k→∞ d s i=1 i j=1

2. Given a hypercube of side length s, d−1 lim Nd = s (3.35) k→∞

Understand the true meaning of these formulas is difficult without an example thus let’s observe the Figure 3.13. The idea is to calculate the cluster average number inside a query area. In this Figure, we have two examples to test the Corollary 1 formulas. The first example, Figure 3.13(a), is a rectangle area with the sides s1 = 2 and s2 = 3. We apply now the Equation 3.34 to dimension d = 2.

2  2  1 X 1 Y lim N2 =  sj k→∞ 2 s i=1 i j=1 1  1 1  = (s1 · s2) + (s1 · s2) 2 s1 s2 (3.36) 1 1 1  = (6) + (6) 2 2 3 5 = = 2, 5 2 

So we can conclude that the cluster average number is 2,5. In this case, the exact cluster number is 2 since we have two uninterrupted curves inside the query area. Let’s now analyze the square case in the Figure 3.13(b). To square

29 query areas, we apply the Equation 3.35. Since it is a dimension 2 and the square as side 2 we can substitute now.

2−1 lim N2 = 2 = 2 (3.37) k→∞

In this case the cluster average number match the example since we have again two uninterrupted curves inside the query area. The second course of this analysis is related to the exact analysis of the same property in a 2- dimensional space. It is the same idea but to a finite grid space. This way, giving the notion of how fast the number of cluster converges to the asymptotic solution. In order to demonstrate the accuracy and correctness of the asymptotic analysis (see Appendix A.1), it is created a simulation experiment for range queries of different shapes and sizes. Despite the distinct forms of queries chosen, all of them can be decomposed into rectangles. The simulation is performed only for grids of two and three dimensions due to the extend range of options. Although the paper focusing on the Hilbert curve, the simulation is extended to the Z and gray-code curves in order to compare performance. The simulation results show deep similarities between the empirical results, and the results obtained from the derived formulas. The Theorem 1 and Corollary 1 provide an excellent approach to d-dimensional queries of different shape. It is further concluded that the Hilbert curve overall outperforms Z and Gray-code curves. Moon et al. also refer that assuming the blocks on the disk are ordered according to a Hilbert curve, accessing to the minimum bounding rectangles of d-dimensional query d (d ≥ 3) may increase the number of nonconsecutive accesses as well as non-rectangular queries. The same does not happen for the two-dimensional query case. Despite the satisfactory results, it has only been tested for 2 and 3 dimensions. It would be interesting to have more empirical results in higher dimensions. For further details, please see [Moon et al., 2001].

3.6 Fractals: An Access Method

The use of space-filling curves for indexing multidimensional spaces is not something new. Basically, there are two ways to use them and perform queries on space-filling curves. One hypothesis is to use a variant of B-tree combined with the curve as a secondary key retrieval. The other alternative is to use the curve directly through the h-values. In section 3.3, we saw how the queries look like on the grid of the curve. Now, we will see how these indexes are used in the related work. Since the scope of this work is also focused on similarity search, I will only present works underlying this matter or showing different ways to use fractals as an access method.

3.6.1 Secondary Key Retrieval

Faloutsos and Roseman propose the use of fractals to achieve good clustering results for secondary key retrieval. The performance of an access method for secondary key depends on how good is the map to preserve locality. The authors proceed with the study on the basis that the Hilbert curve shows a better locality preservation map by avoiding long jumps between points. To a more interesting study, it is added two other curves to compare with the Hilbert curve, the Z-curve and the Gray-code curve. The analysis focus on range and nearest neighbor queries to test the hypothesis. The main purpose of their study is to find the best locality preserving map. The best is defined based on two measures. The first focuses on the performance of the locality preserving map when using

30 range queries. Simplifying, the average number of disk accesses required for a range query that is exclusively dependent on the map, revealing how good the clustering property is. The second measure is related with the nearest neighbor query and its called the maximum neighbor distance. Faloutsos and Roseman refer that for a given locality preserving mapping and a given radius R, the maximum neighbor distance to a point X is the largest Manhattan - L1 metric - distance (in the k-dimensional space) of the points within distance R on the linear curve. The shorter the distance between the points the better the locality preserving map. In order to make the experiment feasible, the maps are tested for dimensions two, three and four to fourth order. Generally, the Hilbert curve presents better results than the other two curves for the two measures applied. This study thus contributes to encourage the use of fractals as secondary keys, especially the Hilbert curve. However, despite the positive contribution this study makes, it would be interesting to know the performance of fractal beyond the four dimension.

00 01 10 11

00 01 11 10

00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11

00 10 11 01 00 01 11 10 00 01 11 10 11 01 00 10

10 11 10 11 10 11 10 11 11 10 01 10 01 10 01 00

00 01 00 11 00 11 10 11 00 01 00 01 00 01 00 01

Figure 3.14: Lawder and King indexing structure based on the Hilbert curve - Example of a indexing a second order curve. Adapted from [Lawder and King, 2000].

Lawder and King introduce a more developed work based on this proposal. They present a multidimensional index based on the Hilbert curve. The points are partitioned into sections according to the curve, and each section corresponds to a page storage. The h-value of the first point in a section is used as corresponding page, given to the pages a coherent ordering. The page-keys are then indexed using a B-tree variant. The height of the tree reflects the order of the curve indexed, and the root corresponds to the first-order curve. Each node makes the correspondence between the h-value and the position of the point on the grid (see Figure 3.14). For the second-order curve level, the first ordered pairs are the parent of the lower nodes, and so on. In the Figure 3.14 we present an example adapted from [Lawder and King, 2000] that represent a the index of a Hilbert second order curve. The nodes have a top line and a bottom line. The top one corresponds to the ith vertex of the curve. And the bottom one corresponds to its position on the grid, like we explained in section 3.4. Reading the bottom line of the node from left to right, we describe the curve’s path on the grid. So, the first (left) node, in the second level of the hierarchy starts on

31 the left lower corner (00), to the right lower corner (01), to the righ upper corner (11) and finally to the left upper corner (10). The same are valid for the rest of the nodes. Since the tree can easily become too big the authors use a state diagram, suggest by [Faloutsos and Roseman, 1989], to represent the four possible nodes and express the tree in a compact mode. However, this only works up to 10 dimensions. Above that the ”memory requirements become prohibitive”[Lawder and King, 2000]. Some modifications are made but the method only works up to 16 dimensions. The implementation of the tree is compared to the R tree, revealing that indexing based on the Hilbert curve saved 75% of time to populate the data store and reduced by 90% the time to perform range queries. The authors focused on points as datum-points (records) and a further investigation considering points as spatial data would also be interesting. To more information on Lawder and King indexing structure, please see [Lawder and King, 2000].

3.6.2 Direct Access Index

Another interesting indexing structure proposal based on the Hilbert curve is from Hamilton and Rau-Chaplin. They present an algorithm based on [Butz, 1971] and perform an geometrical approximation of the curve already presented in section 3.4. The authors developed a compact hilbert index that enables a grid to have different sizes along the dimensions. The index simply converts a certain point p to a compact h-value. The authors extract redundant bits that are not used to calculate the index. This allows a reduction in terms of space and time usage. The index is tested and compared with a regular Hilbert index and Butz’s algorithm. The authors conclude that although the compact Hilbert index is more ”computationally expensive to derive”, it saves significantly more space and reduces the sorting time. The compact Hilbert index is tested up to 120 dimensions to map the indexes [Hamilton and Rau-Chaplin, 2008]. Chen and Chang also choose to use the Hilbert curve using the h-values for a direct access to the points. They perform a set of operations similar to what we present in section 3.4 in order to access a certain point. Unlike Hamilton and Rau-Chaplin, they also developed a query technique based on the map. There are two basic steps to find the nearest neighbor to a query point. First, locate the query point on the curve, then locate the neighbor. To locate a certain point on the grid, they start by defining the direction sequence of the curve (DSQ) based on the cardinal points. The direction is fixed starting from left lower vertex (SW) and finishing at the right lower vertex (SE). The DSQ is represented as (SW,NW,NE,SE). Each of the four cardinal points is then substitute for the h-value at that position (see 3.15). The combination of the h-values with the cardinal points give us the direction of the curve in each sub-hypercube. The h-value of the query point is converted to a number of base four [d1 ··· dm]4 and each digit indicates the curve’s direction along the zoom in on the curve. In other words, the number of digits m refers to the order of the curve. Additionally, as we know, the curve suffers some transformations/rotations along the sub-hypercubes. These transformations are here defined and related with the derived quaternary number composed along zoom in:

• C13 - If the number ends with zero, then switch all ones to three and vice-versa;

• C02 - If the number ends with three, then switch all zeros to two and vice-versa;

• C11/22 - If it ends with one or two, remains the same.

32 NW NE

1 2

3 0 1 x 1 0 2 3 SW SE

Figure 3.15: Chen and Chang indexing structure based on the Hilbert curve - Example of locate the h-value 50 marked with X on the grid. Adapted from Chen and Chang[2011].

The authors present an example to a better understand of the concepts. Suppose we want to find the nearest neighbor to the h-value 50 at the Hilbert third-order curve. First, we need to find the location of the query point on the grid (see Figure 3.15). They start by convert the number 5010 to 3024. Each of digits in (d1d2d3)4 give us the direction along each sub-hypercube. Therefore, on the first-order curve we have:

d1 = 3, DSQ3 = (0, 1, 2, 3) (3.38)

The digit d1 = 3 points to the number 3 at SE position. This indicates that the point is in the sub-hypercube located at SE of the grid. Since the first sub-hypercube is found, we must continue locating the point inside this sub-hypercube. As DSQ3 ends with three, we must apply to the DSQ3 the case C02. The second order,

d2 = 0, DSQ30 = (2, 1, 0, 3) (3.39)

The digit d2 = 0 points to the number 0 at NE position. This indicates that the point is in the sub-hypercube located at NE of the first sub-hypercube. Since the second sub-hypercube is found, we must continue locating the point inside this sub-hypercube. As DSQ30 ends with zero, we must apply to the DSQ30 the case C13. The third order,

d3 = 2, DSQ302 = (2, 3, 0, 1) (3.40)

Finally, we find the point located at SW of the second sub-hypercube marked with a X in the Figure 3.15. So, once the query point is located, we can now search for neighbors. The authors store the location, and the DSQs used to find the location. With this, we know that our query point 5010 is located at SW of a hypercube of level three. Therefore, they generate the following neighbors’ h-values in order to find out if they exist. Since we know that the query point 5010 is located SW of the grid, we know that the other three positions can be generated through the quaternary number used to guide the query point location. So, the values 3004, 3014 and 3034 will gives the location of the possible local neighbors. And, if this search turns out empty, we can go up one level, for example, to 314 and restart the process again.

33 The authors tested this algorithm with six distinct spatial dataset and compared with a previous version that computes the exact location through the binary bits of the coordinates. The last version showed better results despite only being tested in two dimensions [Chen and Chang, 2011].

3.7 Summary

This chapter presented the basic concepts relating fractals, which are the technique I intend to use as an access method in the following chapter. Therefore, in section 3.2, we understand how fractals can have a fraction as a value for dimension instead of an integer. In section 3.3 we saw how the most popular space-filling curves are generated by computers. The section 3.4 explained how the space-filling curves can be generically calculated for any dimension. Although, there are several different curves, in section 3.5 we focused on the Hilbert a superior clustering property when compared with other. In section 3.6 we saw that the Hilbert curve is the most promising curve to be used as a multidimensional index. In general, the related work encourages the use of the Hilbert curve as an index method, whether as a secondary key retrieval or as a direct access. Nevertheless, they are very conservative testing the index performance in extremely low dimensional spaces, usually up to two or three dimensions.

34 Chapter 4

Approximate Space-Filling Curve

In the two previous chapters, we saw the basic concepts relating this thesis context and proposal. This chapter presents a solution proposal to explore the behavior of fractals, especially the Hilbert space-filling curve used as an access method for multidimensional data in low-dimensional spaces. The first section explains the motivation be- hind this work. Section 4.2 describes the methodology chosen to develop these experiments. Section 4.3 describes the origin of the data used during the experiments. Section 4.4 describes the first part of the proposal. The section 4.5 presents a characterization and analysis of the data before running the experiments based on a linear search. The rest of the sections 4.6, 4.7, 4.8 describe the experiments done and the results achieved. The final section sums up the content of this chapter.

4.1 Motivation

Multidimensional data is not an easy subject. In chapter2, I introduced the basic concepts relating multidimen- sional data and its indexing challenges. First, we saw that multidimensional or spatial data has a special property. Ideally, in order to reduce the search time, all the multidimensional data that are related should be linearly stored on disk or memory. However, this cannot be done because it is impossible to guarantee that all the pair of objects close in space will be all close to the linear order. It is impossible to all but not for some and the space-filling curves, as we saw in chapter3, provide a total order among the points in space with superior clustering property. Another problem relating the multidimensional data refers to its indexing structures. In chapter2, we also saw two alternatives to index low dimensional data up to thirty dimensions - the R-tree and the Kd-tree. Although, like many other indexes, they present a problem that concerns about the number of dimensions increase. When this happens, the volume of the space increases so fast that the data on the space become sparse. The curse of di- mensionality appears when the spaces usually surpass the ten dimensions, comparing their performance to a linear search or worse [Clarke et al., 2009]. Berchtold et al. refer that they also suffer from the well-known drawbacks of multidimensional index structures such as high costs for insert and delete operations and a poor support of con- currency control and recovery. A possible solution to this problem is to use dimensionality reduction techniques like the space-filling curves [Berchtold et al., 1998, Clarke et al., 2009]. They allow to map a d-dimensional space to a one-dimensional space in order to reduce the complexity. After this, it is possible to use a B-tree variant to

35 store the data and take advantage of all the properties of these structures such as fast insert, update and delete operations, good concurrency control and recovery, easy implementation and re-usage of the B-tree implementa- tion. Additionally, the space-filling curve can be easily implemented on top of an existing DBMS [Berchtold et al., 1998]. Previous studies indicate that unless the multidimensional data, typically over ten, are well-clustered or correla- tions exist between the dimensions, similarity search like the nearest neighbor in these spaces may be meaningless [Mamoulis, 2012]. Other studies indicate as well that fractals are very useful in the domain of multidimensional indexing due to their properties of good clustering, preservation of the total order and spatial proximity [Mamoulis, 2012, Faloutsos and Roseman, 1989]. In the previous chapter3, we saw that exist related work using fractals as an access method. However, they are very conservative using those access methods to index and search in extremely low dimensional spaces, usually up to three or four dimensions. It would be interesting to explore the fractals as an access method till thirty dimensions. For all the reasons mentioned above, the purpose of this research is to study the space-filling curves. The goal is to understand how their properties can help to optimize the multidimensional indexing structures to index low-dimensional data (two to thirty). These spaces may result from the application of indexing structures oriented to high-dimensional space or result from the application of dimensionality reduction techniques for subsequent indexing. In either situation, the resulting space, despite being of lower dimensions, still requires a large computational effort to perform operations on data.

4.2 Methodology

Uzaygezen 0.2 3. ANN Query Search

ANN 12 17 59 1. Read Mapping 2. Generate 14 13 1 Heuristic 18 45 78 Data System Space System Multidimensional Data Hilbert Curve 4. Performance Measures

Figure 4.1: Approximate Space-filling Curve Framework

My proposal is based on the framework presented in Figure 4.1 and it has two main phases. The Mapping System and the ANN Heuristic System. In this thesis, we are studying the use of the space-filling curve as an access method in low-dimensional spaces. It can be used as dimensionality reduction technique, and so it reduces the number of dimensions of a space by mapping the data from a d-dimensional space to a single one. In order to do this, I create the Mapping System. It starts by extracting the multidimensional points and generating the respective h-values with the help of the Uzaygezen 0.2 library. As a result, a space is created according to the Hilbert curve. Once the space is created, it is necessary to test the curve behavior as an access method. Therefore, I develop at least two heuristics to search for approximate nearest neighbors in the space. One of them uses the Hilbert curve as a direct access method and the other, as secondary key retrieval combined with a B-tree variant. This second phase

36 is called Approximate Nearest Neighbor Heuristic System (ANN Heuristic System) and focuses on creating heuristics to find the ANN of an arbitrary query point enhancing the Hilbert curve properties. Basically, I test a heuristic on the Hilbert space and readjust it or create a new one based on its performance measures. In other words, the ANN Heuristic System is an iterative testing system. In order to test something, it is necessary to define evaluation metrics. These are defined in terms of runtime performance and distance relative error.

Runtime performance The first is the most basic metric that we can use. As I mentioned before, several indexes’ structures present worse performances than a simple linear search when the number of dimensions surpasses the ten. For this reason, I decided to have as reference the running time of the linear search. It has a simple algorithm. It assumes that the first multidimensional point present in the data file is the nearest neighbor to a given query point. It stores the coordinates of the neighbor and computes the distance to the query point. Then, it visits the next point in the data file and computes the distance to the query point. If the new point is closer than the stored nearest neighbor, the point becomes the new nearest neighbor. The verification is repeated for each point present in the data file. The heuristics runtimes are considered acceptable if they are lower than the linear search runtime. As the reader could notice, I referred to a distance that must be computed, but I did not say how this distance is calculated. Therefore, we must also define the distance metric that are used either for the linear search or for the heuristics. A point in a two-dimensional space can have several nearest neighbors at the same time. If they are all at a distance of one cell to a central query point, they describe a unit circle in the grid. Well, this is clearly the

L∞ norm since the unit circle is a square (remember section 2.3). Therefore, a nearest neighbor or a set of them is defined by having the smallest distance in the space to a given query point.

Distance relative error On the other hand, I intend to find an approximate nearest neighbor to a given query. The word ”approximate” immediately leads us to think that the neighbor, not being the nearest neighbor, is not far from this. Thus, it is necessary to calculate the approximation error. Consider a space S containing points in Rd and a query point q ∈ Rd. Given c > 0, we say that the point p ∈ S is a c-approximate nearest neighbor for any q:

1. If there is a p ∈ S, ||p − q||∞ = r, where r is the distance value for the nearest neighbor

∗ ∗ ∗ 2. It returns p ∈ S, ||p − q||∞ ≤ c · r, therefore p is an approximate nearest neighbor of q and c is the error factor.

If we define c as being 1 + ε, we may see p∗ within the nearest neighbor’s relative error ε

∗ ||p − q||∞ ≤ (1 + ε) · ||p − q||∞,

The relative error ε can vary from zero, where it corresponds to the nearest neighbor, to infinity. The values of the nearest neighbors are previously defined by the linear search and used to calculate the error.

4.3 Data Description

The recently introduced subspace tree [Wichert, 2009] promises low retrieval complexity of extremely high- dimensional features. The search in this structure begins in the subspace with the lowest dimension. In this subspace, the set of all possible similar objects is determined. In the next subspace, additional metric information

37 that corresponds to a higher dimension is used to reduce this set. This process is repeated. However, for the lowest dimensional space with dimensions from 2 till 36 a linear search has to be performed. In this thesis, I research how to speed up this search using space filling curves. I perform empirical experiments on the following low dimensional data, the SIFT, GIST and RGB images.

SIFT A scale-invariant feature transform (SIFT) generates local features of an image that are invariant to image scaling and rotation and illumination. An image is described by several vectors of fixed dimension 128. Each vector represents an invariant key feature descriptor of the image.

• The vector x of dimensionality 128 is split into 2 distinct windows of dimensionality 64. The mean value is computed in each window resulting in a 2 dimensional vector.

• The vector x of dimensionality 128 is split into 4 distinct windows of dimensionality 32. The mean value is computed in each window resulting in a 4 dimensional vector.

GIST Observers recognize a real-world scene at a single glance. The GIST is a corresponding concept of a descriptor and is not a unique algorithm. It is a low-dimensional representation of a scene that does not require any segmentation. It is related to the SIFT descriptor. The scaled image of a scene is partitioned into coarse cells, and each cell is described by a common value. An image pyramid is formed with l levels. A GIST descriptor is represented by a vector of the dimension

grid × grid × l × orientations (4.1)

For example, the image is scaled to the size 32 × 32 and segmented into a 4 × 4 grid. From the grid orientation, histograms are extracted on three scales, l = 3. It is defined a histogram with each bin covering 18 degrees, which results in 20 bins (orientations=20). The dimension of the vector that represents the GIST descriptor is 960.

• The vector x of dimensionality 960 is split into 5 distinct windows of dimensionality 192. The mean value is computed in each window resulting in a 5 dimensional vector.

• The vector x of dimensionality 960 is split into 10 distinct windows of dimensionality 96. The mean value is computed in each window resulting in a 10 dimensional vector.

• The vector x of dimensionality 960 is split into 30 distinct windows of dimensionality 32. The mean value is computed in each window resulting in a 30 dimensional vector.

RGB images The database consists of 9.877 web-crawled color images of size 128×96. The images were scaled to the size of 128 × 96 through a bilinear method resulting in a 12288-dimensional vector space. Each color is represented by 8 bits. Each of the tree bands of size 128 × 96 is tiled with rectangular windows W of size 32 × 32. The mean value is computed in each window resulting in a

36 = 12 × 3 = 4 × 3 × 3 dimensional vector.

38 4.4 Mapping System

The subspace tree generated the low-dimensional data that is used by this framework. However, the resulting six spaces contain duplicated data that could lead to biased conclusions. The reader must remember that, in the next phase, we will be searching for approximate nearest neighbors. If there are many duplicated neighbors, we could find them at a zero distance, that is, ourselves. Thus, these spaces require to be clean before being used by the Mapping System. The resulting data of the subspace tree application are, in fact, multidimensional points. For each of these spaces, I map each multidimensional point to its one-dimensional h-value with the auxiliary of the Uzaygezen 0.2 library. It is based on the theory of the Compact Hilbert Indexes [Hamilton, 2006] already presented in section 3.4 and follows this basic algorithm [Hamilton and Rau-Chaplin, 2008]:

1. Find the cell containing the point of interest;

2. Transform as necessary;

3. Update the cell index value appropriately;

4. Continue until sufficient precision has been attained.

Find the cell containing the point of interest Locate the cell that contains our point of interest by determining whether it lies in the upper or lower half-plane regarding each dimension. Assuming we are working with a m- order curve where m also refers to the number of bits necessary to represent a coordinate, we can use the Equation 3.26 defined in section 3.4.

lm−1 = [bit(pn−1, m − 1) ... bit(p0, m − 1)] (3.26 revisited)

Transform as necessary We zoom in the cell containing the point and rotate and reflect the space such that the Gray code ordering corresponds to the binary reflected Gray code. The goal is to transform the curve into the canonical orientation.

¯ lm−1 = Te,d(lm−1) (3.27 revisited)

Update the index value appropriately The orientation of the curve is given only by an entry e and the exit f through the grid. Once we know the orientation at the current resolution, we can determine in which order the cells are visited. So, h gives the index of a cell at the current resolution.

h = [wm−1 ··· w0] (4.2)

Where w is index of the associated cell and defined in section 3.4

−1 ¯ wm−1 = gc (lm−1) (3.28 revisited)

39 We continue until sufficient precision has been attained. A full example can be found in section 3.4. Each multi- dimensional point is mapped to the Hilbert curve allowing that the ANN Heuristic System can perform queries on this space.

4.5 Dataset Analysis based on a Linear Search

The code is developed in JAVA and tested in a computer with the following properties:

• Windows 8 Professional 64 Bits

• Processor Pentium(R) Dual-Core CPU T4300 @ 2.10GHz

• 4 GB RAM

In order to run the experiments, we must first remove the duplicated data as I explained in section 4.2. The datasets are organized in files per dimension. The dimensions tested are 2, 4, 5, 10, 30 and 36. The files of dimensions 2, 4 and 36 have 10.000 points and the remain have 100.000. The files are analyzed and duplicated data are removed. After this cleaning process, the files have the number of points presented in Table 4.1, in the column ”Data Volume”. It is possible to know already the order of the Hilbert curve through the number of bits to represent the higher number present in a file. The order of the curve is therefore, in this same table, in the column labeled as ”Resolution”.

Spaces Original Volume Data Volume Resolution S2 10 000 264 6 S4 10 000 9 229 7 S5 100 000 98 917 16 S10 100 000 98 917 16 S30 100 000 98 917 16 S36 9 877 9 831 8

Table 4.1: Datasets description

Before running the heuristics experiments, we must define the set of query points for which we intend to find their neighbors. We need to define as well as the distance values of their nearest neighbors. The query group vary from de point P 0 to P 9 and are the first 10 points extracted from each file after the cleaning process. The linear search, described in section 4.2, defines the distance values of their nearest neighbors for the query group and these are presented in Table 4.2. In this Table, we can observe that spaces of dimension 2 and 4 are densely populated since all the neighbors are immediate cell neighbors, and the mean distance is 1. The spaces regarding dimensions 5, 10 and 30 are quite sparse since the distance mean are 332, 3, 629, 2 and 943, 4 respectively. Finally, for the space 36 we can say that it is in between the first two and the rest of the spaces regarding the density of the data. It presents a mean distance of 34, 7. In terms of analysis, I grouped the spaces in two sets like {2, 4, 36} and {5, 10, 30} because they are more similar between them as shown above. The Figure 4.2 presents the nearest neighbor to a query point distributed over all spaces. It is interesting to observe how the neighbor distances are distributed over the spaces and through the query points. If we put our

40 Query Point S2 S4 S5 S10 S30 S36 P0 1 1 341 323 669 35 P1 1 1 325 759 1 197 37 P2 1 1 421 239 691 14 P3 1 1 466 1 516 961 41 P4 1 1 249 1 411 853 7 P5 1 1 169 117 1 599 36 P6 1 1 99 563 734 90 P7 1 1 760 538 831 39 P8 1 1 278 465 545 14 P9 1 1 215 361 1 354 34

Table 4.2: Neighbor distance to query point per dimension query point at the center of the web graphic, we can see how far is the nearest neighbor to it. We can see it in every dimension and for each one of the query points. Once again, it is clear that the spaces 5, 10 and 30 are sparser than the rest. The spaces 2 and 4 don’t even appear in the graphic since the distance for the nearest neighbor is 1 for each query point. On the query points P3 and P4, the space 10 even surpass the space 30 that has the higher mean distance over all spaces.

P0 10000 P9 P1 1000

100

P8 P2 Space 2 10 Space 4 Space 5 1 Space 10 Space 30 P7 P3 Space 36

P6 P4

P5

Figure 4.2: The nearest neighbor distribution per dimension.

In terms of runtime performance, the linear search results for the first group is shown in Figure 4.3. There are some aspects to consider that influence this performance (see Table 4.1). The first relates the resolution of each space that are 6, 7 and 8 for the spaces 2, 4 and 36 respectively. Then the amount of points to compute are significant low for dimension 2 (264) and almost equivalent to the remain spaces (9.229 and 9.831 for dimension 4 and 36 respectively). Therefore, it is faster to compute the space 2 since it has the lower resolution, the lower

41 amount of data and lower number of dimensions to compute. The analogy is valid for the spaces 4 and 36.

0,8 0,686

0,7

0,6

0,5

0,4 0,312 Linear Search 0,3

0,2 Execution Time Time Execution (seconds) 0,1 0,047

0 2 4 36 Dimensions

Figure 4.3: Linear search runtime performance.

The linear search results for the second group is shown in Figure 4.4. In this scenario, the three spaces have the same amount of data and the same space resolution (see Table 4.1). Therefore, the runtime performance are only related with the number of dimensions of each space. The time required to compute the space 10 is almost twice the time needed to compute the space 5. The same analogy can be done to the space 30 which takes almost three times longer than the space 10 and nearly six times longer than the space 5.

12 11,091

10

8

6 4,587 Linear Search 4

2,761 Execution Time Time Execution (seconds) 2

0 5 10 30 Dimensions

Figure 4.4: Linear search runtime performance.

42 4.6 Experiment 1: Hypercube Zoom Out

Since the reference values are established, we can now focus on develop the heuristics. As mentioned before, the ANN Heuristic System works on the space generated by the Mapping System. It is an empirical system that evolves based on the experimental performances. The idea is to find the ANNs to the query group defined in the previous section, considering the best properties of the Hilbert curve. After creating the heuristic, I test it on the Hilbert space and readjust it or create a new heuristic based on the performance measures. The first heuristic idea has the name Hypercube Zoom Out (HZO) and it tries to explore the following properties of the Hilbert curve:

• The space-filling curves divide the space into four quadrants. Each quadrant is a square that has an entry point and an exit point (remember section 3.3);

• The clustering property of the Hilbert curve is better when the area of analysis tends to a square (remember section 3.5).

Suggestion In order to perform an approximate nearest neighbor search using the Hilbert curve, the area of the search should be a square to minimize the number of clusters. To make use of the squares that the curve forms, it is necessary to point out the entry and exit point of each square as they define the search area. The squares can vary the number of inside points from one to infinity depending on the space resolution. Thus, the search must begin from a higher resolution zooming out until reach the lower resolution. Although it stops once it reaches the first neighbor point inside a square. Since the curve preserves the spatial proximity between the points, it is expected to find an approximate neighbor point inside a higher-resolution square that will be, in fact, closer to the query point, therefore, resulting on an efficient computational effort. The reader must notice that this solution generates data like the proposal presented by Chen and Chang introduced in section 3.6. Thus, it uses the Hilbert curve, like their proposal, as a direct access method.

5 6 9 10 5 6 9 10

4 7 8 11 4 7 8 11

3 2 13 12 3 2 13 12

0 1 14 15 0 1 14 15

Figure 4.5: Hypercube Zoom Out algorithm scheme in a two dimensional space with resolution 2.

Algorithm The Figure 4.5 illustrates how the HZO search for a neighbor in a two-dimensional space. The left space corresponds to the initial search at the highest resolution hypercube. The right space corresponds to the search at the lowest resolution hypercube. The set of cells {0, 4, 8, 12} are entry-points of their respective hypercubes. Similarly, the set of cells {3, 7, 11, 15} are exit points of the same hypercubes. The cells 0 and 15 are as well entry and exit points respectively of the lower-resolution hypercube. A search always flows from an entry

43 to an exit point at the highest resolution hypercube. Since I introduced the basic notions, suppose we are searching for an approximate nearest neighbor to the query point in the cell 10. We stop searching once we found the first point. So, the HZO starts looking for neighbors at the entry point of the father hypercube of this cell, which is the cell number 8. The search continues until it reaches a neighbor, or it reaches the exit point of this hypercube, the cell 11. If it does not find any neighbors, it will start again at the entry point of the hypercube father, which is the cell 0. The search continues again until it finds a neighbor or reaches the son hypercube visited {8, 9, 10, 11}. At this point, the search jumps from the cell 7 to the cell 12 and continues until reach the exit point, since the hypercube between these cells has already been visited.

5 6 9 10 5 6 9 10 5 6 9 10

4 7 8 11 4 7 8 11 4 7 8 11

3 2 13 12 3 2 13 12 3 2 13 12

0 1 14 15 0 1 14 15 0 1 14 15

Figure 4.6: Hypercube Zoom Out - Three scenarios in two dimensions with the cell 10 as the query point.

Now, let’s observe three spaces (Figure 4.6 from left to right) in a two-dimensional space to apply the HZO heuristic. At all three, we are looking for an approximate nearest neighbor to the cell 10. The spaces vary in terms of population density. In the left space, the HZO returns the cell 9 as the approximate nearest neighbor since it starts at the entry point 8 and continues until find the cell 9. In the middle space, the neighbor is the cell 4. It starts at the cell 8 and continues until it reaches the exit point. Since it does not find any neighbor, it changes the search to the hypercube father, which is the entire space, thus restarting at the entry point 0. The search finishes when it reaches the cell 4. On the right space, it does not find any neighbor at higher resolution so it restarts from the hypercube father entry point 0. The search continues until it reaches the cell 7 and then jumps to the cell 12. The resulting neighbor is the cell 13.

Spaces Linear Search HZO S2 0,047 0,047 S4 0,312 0,289 S5 2,761 n/d S10 4,587 n/d S30 11,091 n/d S36 0,686 n/d

Table 4.3: HZO runtime results in seconds.

Evaluation This heuristic is tested and evaluated on the space generated by the Mapping System with the data presented in section 4.3 and analyzed in the section 4.5. The HZO did not present any result in a time considered acceptable (less than a day) for spaces with dimensions higher than 4. It revealed to produce huge amounts of data since it generates the h-value of the potential neighbor and verifies if it exists on the dataset. In the worst-case

44 scenario, the space 30 at the order curve 16, it can generate the number of cells existent in the maximum order curve, which is an amazing 3, 121749 × 10144 cells. The heuristic requires high computational resources, namely space, and due to the space constraint, those results could not be computed (n/d). Therefore, it is presented the results for the remaining spaces. The Table 4.3 shows the HZO runtime performance compared with the linear search. For the space with the lower number of dimensions, the runtime is the same as the linear search. For the space 4, the HZO is 23 milliseconds better. In terms of distance metric, the results for the approximate nearest neighbors produced by the HZO are pre- sented in the Table 4.4. For the space 2, the HZO presents a mean relative error of 0 which translates into all exact neighbors. For space 4, the HZO presents 70% of exact nearest neighbors and an average relative error of 0, 7.

HZO Results Relative Error Query Point S2 S4 S2 S4 P0 1 5 0 4 P1 1 1 0 0 P2 1 1 0 0 P3 1 1 0 0 P4 1 1 0 0 P5 1 2 0 1 P6 1 3 0 2 P7 1 1 0 0 P8 1 1 0 0 P9 1 1 0 0 Mean Value 1 1,7 0 0,7

Table 4.4: HZO approximate nearest neighbors results.

In total, are performed twenty queries in the HZO heuristic evaluation for the two spaces observed. In this broader overview, and looking at the Figure 4.7, we can see that 85% of the queries return the nearest neighbors. The rest of the returned results are approximate nearest neighbors where 5% present a relative error ε = 1. This means that the neighbors are 2 cells away from the nearest neighbor since the error factor is c = 1 + ε. The remaining 10% of the queries return approximate neighbors with ε ∈ [2, 4]. Thus, the analogy of the error factor is the same and, on average, for these c = 4. Globally, the average relative error for the HZO is 0, 35 cells away from the nearest neighbor.

Conclusion In general, we can conclude that the HZO does not work for dimensions bigger than 4. On computing this heuristic, it stands out that the Hilbert Curve generates much more data than the dataset used to test the heuristic. As the spaces 2 and 4 are densely populated, the data generated by the space-filling curve does not have much impact, since the heuristic in general only moves from one cell to another in the curve to find a neighbor. The problems begin when the number of dimensions increases and the points get sparser. In the space 5, for example, the points are on average 332, 3 cells away from query point, and the smallest hypercube (higher resolution) has 32 cells. For the resolution 2, the hypercube jumps to an amazing 1.024 cells. If, by chance, the approximate nearest neighbor is not in this hypercube, the jump will be to 32.768, and so on. The number of cells to visit grows

45 90% 80% 70% 60% 50% 40% Queries 30% 20% 10% 0% 0 1 [2-4] Ɛ

Figure 4.7: HZO relative error distribution over all queries performed. exponentially, which means that the data generated by the curve follow this trend. Overall, we can say the results for the lower dimensions are acceptable since in general it presents a good percentage of nearest neighbors and the runtime performance are slightly better than the linear search.

4.7 Experiment 2: Enzo Full Space

This heuristic proposal tries to overcome the fact that the HZO does not run in acceptable time in spaces with more than 4 dimensions and, at the same time, reducing the overall ε.

Suggestion Enzo Full Space (FS) leaves the strategy of the hypercube area of search and focuses on the curve path itself. The idea is to launch two research branches that start on the query point location and walk on the curve towards to the lower-resolution hypercube entry and exit point until one of them, or both, find a neighbor. In theory, Enzo FS will be faster and will present a lower ε as we will see below. Like the HZO, the Enzo FS uses the Hilbert curve as a direct access method.

5 6 9 10

4 7 8 11

3 2 13 12

0 1 14 15 Entry Exit branch branch

Figure 4.8: HZO algorithm scheme

46 Spaces Linear Search HZO Enzo FS S2 0,047 0,047 0,031 S4 0,312 0,289 0,115 S5 2,761 n/d n/d S10 4,587 n/d n/d S30 11,091 n/d n/d S36 0,686 n/d n/d

Table 4.5: Enzo FS runtime results

Algorithm The Figure 4.8 illustrates how the Enzo FS search for neighbors. The cell 10 represent the query point. In order to simplify the explanation, I named the branch that search towards the entry point as entry branch and the other, by analogy, as the exit branch. The entry branch starts searching from neighbors from the immediate neighbor value of the query point on the curve towards to the entry point. It verifies if the neighbor value exists on the data point. If it does exist, the neighbor is found, and it tries to store the point as the approximate nearest neighbor. It tries since the other branch may be doing the same. If there is a tie, the decision is made by storing the closest point to the query point. Once the approximate nearest neighbor is set, the branch which wrote ends the concurrent branch. There are factors that influence the performance of this heuristic, and they are related to the density of the space and the query point location. In the worst-case scenario, the query point is an entry or an exit point on the lower-resolution hypercube and the search performance could be worse than the linear search.

Enzo FS Results Relative Error Query Point S2 S4 S2 S4 P0 1 4 0 3 P1 1 1 0 0 P2 1 1 0 0 P3 1 1 0 0 P4 1 1 0 0 P5 1 2 0 1 P6 1 2 0 1 P7 1 1 0 0 P8 1 1 0 0 P9 1 1 0 0 Mean Value 1 1,5 0 0,5

Table 4.6: Enzo FS approximate nearest neighbors results.

Evaluation This heuristic is tested and evaluated in the same context as the previous and, like the HZO, it does not present results in a time considered acceptable (less than a day) for spaces with a number of dimensions a higher than 4 for the same reasons. Therefore, the results for the remaining spaces are now presented.In terms of runtime performance (see Table 4.5), the Enzo FS in the space 2 is 16 ms faster than the linear search and the HZO. In the space 4, the difference is bigger, reaching the 197 ms comparing with the linear search. In terms of distance metric, the results for the approximate nearest neighbors produced by the Enzo FS are presented in Table 4.6. For the space 2, the Enzo FS returns all the nearest neighbors. For the space 4, the Enzo FS

47 presents 70% of the results being the nearest neighbors and an average relative error ε of 0, 5. Analyzing in terms of the error factor c = 1 + ε we can say that, on average, an approximate nearest neighbor produced by the Enzo FS is 1, 5 cells away from the nearest neighbor. For the two spaces observed, twenty queries are performed in total. In this broader overview, observing the Figure 4.9 we can see that 85% of the queries return the nearest neighbors. From the rest of the results, 10% present a relative error ε = 1 which means that the neighbor is 2 cells away from the nearest neighbor. The remaining 5% of the queries return neighbors with ε = 3. The analogy of the error factor is the same and c = 4. Globally, the average relative error for the Enzo FS is 0, 25 cells away from the nearest neighbor.

90% 80% 70% 60% 50% 40% Queries 30% 20% 10% 0% 0 1 3 Ɛ

Figure 4.9: Enzo FS relative error distribution over all queries performed.

Conclusion The problems encountered for the HZO are basically the same for the Enzo FS:

• The Hilbert Curve generates much more data than the dataset used to test the heuristic;

• Both heuristics work in the lower spaces apparently because they are densely populated and, in general, the heuristics only generate a few cells (up to 4) until they find an approximate nearest neighbor;

On the other hand, the Enzo FS presents a less 28, 6% of average ε (from 0,35 to 0,25) when compared with the HZO and reduced the maximum value for ε in 25% (from 4 to 3). In terms of runtime evaluation, overall the Enzo FS is better than the HZO and therefore, better than the linear search.

4.8 Experiment 3: Enzo Reduced Space

The Enzo Reduced Space (RS) tries to overcome the fact that neither the HZO nor Enzo FS worked on spaces with more than 4 dimensions. The goal of this heuristic is to continue to reduce the ε as well as the runtime performance.

Suggestion Since the main problem for both heuristics are related with the data generated by the Hilbert Curve, the idea to the Enzo Reduced Space (RS) is to simply order the dataset according to the curve without generate any more data. On top of the curve, it uses a B-tree variant in JAVA therefore using the curve as a secondary key retrieval.

48 5 6 9 10

4 7 8 11

3 2 13 12

0 1 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Entry Exit branch branch

Figure 4.10: Enzo RS algorithm scheme

Algorithm The Figure 4.10 illustrates by an example how this algorithm works. The cell 10 is the query point and the rest colored cells are the population within the space. The idea is basic the same as the Enzo FS but, in this case, the branches do not walk on the curve but on the existent data ordered by the curve. We insert the query point on the dataset ordered. We locate point inserted with the auxiliary of the B-tree, and then we locate the neighbors. Since the data are now in 1 dimension, we pick the immediate left and right neighbors to the query point. The points are then compared in terms of the shortest distance to the query point, and the winner is the approximate nearest neighbor.

Spaces Linear Search HZO Enzo FS Enzo RS S2 0,047 0,047 0,031 0,078 S4 0,312 0,289 0,115 0,67 S5 2,761 n/d n/d 2,669 S10 4,587 n/d n/d 3,853 S30 11,091 n/d n/d 7,256 S36 0,686 n/d n/d 0,608

Table 4.7: Enzo RS runtime results

Evaluation This heuristic is tested and evaluated in the same context as the previous heuristics and the results are now presented. On what concern the runtime performance (see Table 4.7), generally the Enzo RS has worse performance in the spaces with lower dimensions {2,4} and better performance in the spaces of higher dimensions {5, 10, 30, 36} achieving the best performance at dimension 30 with less 3, 835 seconds than the linear search. The Figures 4.11 and 4.12 present runtime graphics for the space groups 2, 4, 36 and 5, 10, 30 respectively. The spaces in each graphic are sorted in ascending order of resolution, the amount of data and finally by dimensions. The Figure 4.11 presents the analysis for the spaces with lower data and maximum resolution of 8 to space 36. However, the Enzo RS is faster to compute the space 36 than the space 4. The main differences between the spaces refer to the average distance neighbor which is 1 for the spaces 2 and 4 and 34, 7 for the space 36. The Figure 4.12 presents the analysis for the spaces with a higher amount of data (almost 100.000) but equal

49 0,8

0,7

0,6

0,5

0,4 Linear Search 0,3 Enzo (RS)

0,2 Execution Time Time (seconds) Execution

0,1

0 2 4 36 Dimensions

Figure 4.11: Enzo RS runtime performance compared with the linear search - first group of analysis. for the three spaces. They all have equal resolution of 16, and they only differ in dimensions and average distance of neighbors. They have 332, 3, 629, 2 and 943, 4 to the space 5, 10 and 30 respectively. Since this group is more even than the previous, it seems that the higher the average distance of the neighbors and the number of dimensions the faster is the Enzo RS to compute the space.

12

10

8

6 Linear Search Enzo (RS)

4 Execution Time Time (seconds) Execution 2

0 5 10 30 Dimensions

Figure 4.12: Enzo RS runtime performance compared with the linear search - second group of analysis.

In terms of distance metric, the results for the approximate nearest neighbors produced by this heuristic are presented in Table A.2. The table shows the results for each of the 10 query point in each of the 6 spaces tested. The Table 4.8 shows the relative error ε rounded to two decimal places for each of the 10 query points in each space. For the space 2, Enzo RS returns all the nearest neighbors since ε = 0 for all the queries. For the spaces 4, 5, 10, 30 and 36, the percentage of the nearest neighbors down to 80, 10, 20, 20 and 0%, respectively. Generally speaking, the mean relative error ε seems to grow as the number of dimensions increases. For the 6 spaces observed, sixty queries are performed in total. In this broader overview, and looking to Figure

50 Query Point S2 S4 S5 S10 S30 S36 P0 0 3 1,46 0,08 1,74 1,49 P1 0 0 0,89 0,61 0,29 0,89 P2 0 0 0,19 1,54 1,59 0,71 P3 0 0 0,24 0 1,78 0,37 P4 0 0 0,23 0 0 1 P5 0 1 0,88 0,83 0 0,92 P6 0 0 2,15 0,73 0,78 0,88 P7 0 0 0 1,02 1,26 1,51 P8 0 0 0,22 0,59 0,81 0,43 P9 0 0 1,39 1,49 0,49 0,91 Mean Value 0 0,40 0,77 0,69 0,87 0,91

Table 4.8: Enzo RS Relative error ε from space 2 to 36.

4.13, we can see that 38% of the queries return the nearest neighbors. In the remaining approximate results, 40% of the results have an ε ∈]0, 1], 18% in ]1, 2] and just 3% in ]2, 3]. Globally, the average relative error for the Enzo FS is 0, 2 cells away from the nearest neighbor.

45% 40% 35% 30% 25% 20% Queries 15% 10% 5% 0% 0 ]0-1] ]1-2] ]2-3] Ɛ

Figure 4.13: Enzo RS relative error distribution over all queries performed.

Conclusion The problem of the huge amounts of data generated by the Hilbert Curve is overcome, and this heuristic presents results in all the spaces tested. It even shows a better runtime performance when compared with the linear search in dimensions over 4. Its good performance seems to be related with the low density of the space and the high number of dimensions. On the other hand, Enzo RS presents worse runtime results in lower dimensional spaces when compared with the HZO or Enzo FS. In terms of distance metrics, and only looking to spaces 2 and 4, the Enzo RS presents the lowest relative error ε when compared with the previous heuristics.

51 0,8

0,7

0,6

0,5 Linear Search 0,4 Enzo FS 0,3 HZO

0,2 Enzo RS Execution Time Time Execution (seconds) 0,1

0 S2 S4 Spaces

Figure 4.14: Runtime comparison between the heuristics.

4.9 Summary

This chapter described the solution proposal to explore the behavior of fractals, especially the Hilbert space-filling curve used as an access method for multidimensional data in low-dimensional spaces. The section 4.1 described the motivation for this thesis. The section 4.2 presented my Approximate Space-Filling Curve Framework as the plan to develop and test and evaluate my proposal. The section 4.3 explain that the data used to test my proposal is a result of applying the Subspace tree to a high-dimensional data. The section 4.4 described how the mapping of the data is done using the Uzagezen 0.2 library. The section 4.5 presented the results of performing similarity search using linear search. The following sections 4.6, 4.7 and 4.8 presented the heuristic proposals, their test and the evaluation according to the metrics presented in 4.2.

0,80

0,70

0,60

0,50

0,40 Enzo FS HZO 0,30 Enzo RS

Average Relative Average Error Relative 0,20

0,10

0,00 S2 S4 Spaces

Figure 4.15: Average relative error comparison between the heuristics.

As said before, the HZO (section 4.6) and the Enzo FS (section 4.7) did not compute the results in spaces over 4 dimensions due to require high computational resources, namely space. For this reason, in Figure 4.14 and 4.15

52 it is presented the overall comparison in the spaces that the referred heuristics presented results. On what concerns the runtime performance, the Figure 4.14 shows that the Enzo FS presented the lowest runtime to search for an approximate nearest neighbor. The Figure 4.15, shows that the Enzo FS (section 4.8) despite having the lowest runtime, it did not present the lowest mean relative error when searching for the approximate nearest neighbor, being surpassed by the Enzo RS.

12

10

8

6 Enzo RS 4 Linear Search

Execution Time (seconds) Time Execution 2

0 S2 S4 S36 S5 S10 S30 Spaces

Figure 4.16: Enzo RS runtime performance along the spaces.

1 0,9

0,8

0,7 0,6 0,5 0,4 Enzo RS

0,3 Average Relative Average Error Relative 0,2 0,1 0 S2 S4 S5 S10 S30 S36 Spaces

Figure 4.17: Enzo RS average relative error along the spaces.

Overall, in these spaces, the Enzo FS presents the most acceptable balance between the mean relative error and the runtime performance. On the other hand, the Enzo RS is the only heuristic which worked beyond the 4 dimensions. Its global runtime performance can be seen in Figure 4.16. The spaces in the graphic are sorted in ascending order of resolution, the amount of data to compute, dimensions and average distance neighbor. It is possible to observe that the Enzo RS, is faster than the linear search from the space 36 although it starts to stand out from the space 10. The lower the density of the space, the greater the distance between the Enzo RS curve and the linear search curve. The Enzo RS average relative error seems to grow as the number of dimensions increase

53 (Figure 4.17).

54 Chapter 5

Conclusions

In the context of my MSc thesis, I proposed to study the problem of index low-dimensional data up to thirty dimensions, using the Hilbert space-filling curve as a dimensionality reduction technique that is known to have superior clustering properties. To address the problem, I created a framework named Approximate Space-Filling Curve that structured all the steps of the proposal. A few experiments were performed with the resulting prototype. This framework had two main phases: the Mapping System and the Approximate Nearest Neighbor Heuristic System (ANN Heuristic System). The first system converted each one of the multidimensional point in the dataset to the Hilbert curve, generating its h-value. On the second system, I developed a heuristic, tested on the curve and re-adapted it based on its performance. As a result of this process, three heuristics were created: the Hypercube Zoom Out (HZO), the Enzo Full Space (FS) and finally the Enzo Reduced Space (RS). The HZO searches for the approximate neighbor starting at the highest resolution hypercube excluding the query point itself. It starts at the entry point of the hypercube and ends the first iteration at the end point of the same hypercube. If it does not find any neighbor, it performs a zoom out to the hypercube of the next resolution. Always starting at an entry point and ending on an exit point. Once it finds the first neighbor, the heuristic ends. The Enzo FS searches for the approximate neighbor by launching two search branches from the query point to lower resolution entry an exit point. If both branches find neighbors, they compare them and the closest neighbor to the query point is the approximate nearest neighbor.

Once tested, both HZO and Enzo FS did not present any results in a time considered acceptable (less than a day) in spaces with more than 4 dimensions. They revealed to produce huge amounts of data since they generate the h-value of the potential neighbor and verify if it exists on the dataset. In the worst-case scenario, the space 30 at the order curve 16, they can generate the number of cells existent in the maximum order curve, which is an amazing 3, 121749 × 10144 cells. The heuristics required high computational resources, namely space, and due to the space constraint, those results could not be computed.

Nevertheless, in the spaces 2 and 4, regarding the runtime performance, the HZO revealed to be a slightly better than a linear search and the Enzo FS slightly better than HZO. Although, we must take into account that the results are approximate. In terms of the overall approximate error distance, 85% of the both heuristics results are the nearest neighbor to a given query point. Globally, the average relative error is 0, 35 cells away from the nearest neighbor for the HZO and 0, 25 for the Enzo FS. Unlike the previous two which use the Hilbert curve as a direct

55 access, the Enzo RS uses the Hilbert curve as a secondary key retrieval combined with a B-tree variant. Instead of generating neighbors and verifying if they exist on the data, it uses the data ordered according to the curve. Using the B-tree variant and the h-value of the query point, it is determined the query point immediate neighbors on the curve. Only later, when writing the report and after tested the heuristics, I could notice that the Faloutsos and Roseman had already suggested this heuristic in [Faloutsos and Roseman, 1989]. Their algorithm was therefore presented in Section 3.6 as related work. Nevertheless, they only tested the heuristic till 4 dimensions. The Enzo RS, unlike the previous two, presented results for all the dimensions tested (2, 4, 5, 10, 30 and 36). On what concerns the runtime performance, it revealed to be worse than the linear search in spaces till 4 dimensions. On what concerns the error of the approximate neighbors distance in these spaces, the Enzo RS presented an average relative error of 0.2. Therefore, this heuristic presented the lowest average approximate error but the higher runtime performance in these spaces. Beyond these spaces, the Enzo RS worked in less time than the linear search in all dimensions tested. Never forgetting that we are talking about approximate neighbors, and the linear search always provides us the nearest. The average relative error of the Enzo RS seems to grow along the spaces. Globally, on all the query points tested, the Enzo RS presented an average relative error of 0, 61 cells away from the nearest neighbor. Despite the relative error is higher when compared with the other heuristics, the reader should not forget that the Enzo RS was tested in more forty queries than the previous. So we cannot compare the overall relative error between them. Considering the obtained results from the experiments, we can conclude that a space-filling curve can operate in a low-dimensional space (up to thirty dimensions), in terms of index and search. However, some observations can be made:

• The Hilbert curve requires high space resources when generating the cell keys to apply similarity search techniques. Therefore, the curve seems to be more suited to highly dense spaces when used as a direct access;

• The Hilbert curve seems to be more suited to sparse spaces when applied to the original dataset combined with a B-tree variant.

5.1 Contributions

The main contributions of this work are summarized below:

Hilbert curve tested as a direct access till 36 dimensions. I tested the performance of the Hilbert space-filling curve as a direct access method with two heuristics up to 36 dimensions. The results indicated that the curve generates too much data making it difficult to use the curve beyond the 4 dimensions.

Hilbert curve tested as a secondary key retrieval till 36 dimensions. I tested the performance of the same curve as a secondary key combined with a B-tree variant up to 36 dimensions. The results showed that this combination worked in the six spaces tested inclusive on the space with 36 dimensions. Compared with a linear search, it is generally faster as the number of dimensions and order of the curve increase, although it provided approximate results.

56 5.2 Future Work

Despite of the interesting results, that are many aspects on which this work could be improved. I suggest the following:

Increase the space Both HZO and Enzo FS did not compute theirs results in spaces bigger than four dimensions due to space restrictions. The experiments could be repeated in another computer with higher space resources in order to try to compute the experiments that did not provide results due to this problem.

Extend the group of query points The query group points could be increased in order to test, for example, 100 points instead of the 10 tested in this work. A larger group could provide a more solid set of results.

Extend the evaluation metrics It would also be interesting to evaluate the heuristic considering other evaluation metrics as the number of accesses to the disk or memory. It would allow to analyze the clustering property empirically beyond the three dimensions tested in [Moon et al., 2001].

Explore new heuristics The heuristics proposed are simple and had the purpose of testing the Hilbert curve performance as a low-dimensional index. However, a more complete heuristic could be explored in order to optimize the use of the space-filling curves.

Explore new dimensions Once the curve worked up to 36 dimensions, it could be tested in higher dimensions applied directly to the original data, as the Enzo RS. It would be interesting to know the Hilbert curve limitations in terms of dimensions.

57 58 Bibliography

F. K. B.-U. Pagel and C. Faloutsos. Deflating the dimensionality curse using multiple fractal dimensions. In Proceedings of the 16th International Conference on Data Engineering, ICDE ’00, pages 589–, Washington, DC, USA, 2000. IEEE Computer Society. ISBN 0-7695-0506-6. URL http://dl.acm.org/citation.cfm? id=846219.847327.

M. Barnsley. Fractals Everywhere: New Edition. Dover books on mathematics. Dover Publications, 2013. ISBN 9780486320342. URL https://books.google.pt/books?id=PbMAAQAAQBAJ.

J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509–517, Sept. 1975. ISSN 0001-0782. doi: 10.1145/361002.361007. URL http://doi.acm.org/10.1145/361002. 361007.

J. L. Bentley and J. H. Friedman. Data structures for range searching. ACM Comput. Surv., 11(4):397–409, Dec. 1979. ISSN 0360-0300. doi: 10.1145/356789.356797. URL http://doi.acm.org/10.1145/356789. 356797.

S. Berchtold, D. A. Keim, and H.-P. Kriegel. The x-tree: An index structure for high-dimensional data. pages 28–39, 1996.

S. Berchtold, C. Bohm,¨ and H.-P. Kriegal. The pyramid-technique: Towards breaking the curse of dimensionality. SIGMOD Rec., 27(2):142–153, June 1998. ISSN 0163-5808. doi: 10.1145/276305.276318. URL http: //doi.acm.org/10.1145/276305.276318.

C. Bohm,¨ S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Comput. Surv., 33(3):322–373, Sept. 2001. ISSN 0360-0300. doi: 10.1145/502807.502809. URL http://doi.acm.org/10.1145/502807.502809.

A. R. Butz. Alternative algorithm for hilbert’s space-filling curve. IEEE Transactions on Computers, 20(4):424– 426, 1971.

H.-L. Chen and Y.-I. Chang. All-nearest-neighbors finding based on the hilbert curve. Expert Syst. Appl., 38(6): 7462–7475, June 2011. ISSN 0957-4174. doi: 10.1016/j.eswa.2010.12.077. URL http://dx.doi.org/10. 1016/j.eswa.2010.12.077.

B. Clarke, E. Fokoue, and H. Zhang. Principles and Theory for Data Mining and Machine Learning. Springer

59 Series in Statistics. Springer New York, 2009. ISBN 9780387981352. URL https://books.google.pt/ books?id=RQHB4_p3bJoC.

D. Comer. Ubiquitous b-tree. ACM Comput. Surv., 11(2):121–137, June 1979. ISSN 0360-0300. doi: 10.1145/ 356770.356776. URL http://doi.acm.org/10.1145/356770.356776.

R. Dickau. Hilbert and moore 3d fractal curves, January 2015. URL http://demonstrations.wolfram.com/ HilbertAndMoore3DFractalCurves/.

K. Falconer. Fractal Geometry: Mathematical Foundations and Applications. Wiley, 2007. ISBN 9780470848616. URL http://books.google.pt/books?id=xTKvG_j4LvsC.

C. Faloutsos. Gray codes for partial match and range queries. IEEE Trans. Softw. Eng., 14(10):1381–1393, Oct. 1988. ISSN 0098-5589. doi: 10.1109/32.6184. URL http://dx.doi.org/10.1109/32.6184.

C. Faloutsos and K.-I. D. Lin. Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. pages 163–174, 1995.

C. Faloutsos and S. Roseman. Fractals for secondary key retrieval. In Proceedings of the eighth ACM SIGACT- SIGMOD-SIGART symposium on Principles of database systems, PODS ’89, pages 247–252, New York, NY, USA, 1989. ACM. ISBN 0-89791-308-6. doi: 10.1145/73721.73746. URL http://doi.acm.org/10.1145/ 73721.73746.

R. Finkel and J. Bentley. Quad trees: a data structure for retrieval on composite keys. Acta Informatica, 4:1–9, 1974. ISSN 0001-5903. doi: 10.1007/BF00288933. URL http://dx.doi.org/10.1007/BF00288933.

M. Frame and B. Mandelbrot. Fractals, Graphics, and Mathematics Education. MAA notes. Mathemati- cal Association of America, 2002. ISBN 9780883851692. URL https://books.google.pt/books?id= Wz7iCaiB2C0C.

T. C. Hales. Jordans proof of the jordan curve cheorem. Studies in Logic, Grammar and Rhetoric, 10(23):45–60, 2007.

C. Hamilton. Compact hilbert indices. Technical report, Faculty of Computer Science, 6050 University Ave., Halifax, Nova Scotia, B3H 1W5, Canada, 2006.

C. H. Hamilton and A. Rau-Chaplin. Compact hilbert indices: Space-filling curves for domains with unequal side lengths. Inf. Process. Lett., 105(5):155–163, Feb. 2008. ISSN 0020-0190. doi: 10.1016/j.ipl.2007.08.034. URL http://dx.doi.org/10.1016/j.ipl.2007.08.034.

K. ip Lin, H. V. Jagadish, and C. Faloutsos. The tv-tree – an index structure for high-dimensional data. VLDB Journal, 3:517–542, 1994.

H. V. Jagadish. Linear clustering of objects with multiple attributes. SIGMOD Rec., 19(2):332–342, May 1990. ISSN 0163-5808. doi: 10.1145/93605.98742. URL http://doi.acm.org/10.1145/93605.98742.

60 J. Lawder and P. King. Using space-filling curves for multi-dimensional indexing. In B. Lings and K. Jeffery, editors, Advances in Databases, volume 1832 of Lecture Notes in Computer Science, pages 20–35. Springer Berlin Heidelberg, 2000. ISBN 978-3-540-67743-7. doi: 10.1007/3-540-45033-5 3. URL http://dx.doi. org/10.1007/3-540-45033-5_3.

D. B. Lomet and B. Salzberg. The hb-tree: a multiattribute indexing method with good guaranteed performance. ACM Trans. Database Syst., 15(4):625–658, Dec. 1990. ISSN 0362-5915. doi: 10.1145/99935.99949. URL http://doi.acm.org/10.1145/99935.99949.

N. Mamoulis. Spatial Data Management. Synthesis Lectures on Data Management. Morgan & Claypool, 2012. ISBN 9781608458325. URL http://books.google.pt/books?id=6z5grzUcPhoC.

B. Mandelbrot. The fractal geometry of nature. Times Books, 1982.

B. B. Mandelbrot. Les objects fractals: Forme. Hasard et Dimension, Flammarion, Paris, 1975.

B. Moon, H. V. Jagadish, C. Faloutsos, and J. H. Saltz. Analysis of the clustering properties of the hilbert space- filling curve. IEEE Transactions on Knowledge and Data Engineering, 13:2001, 2001.

G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, 1966.

J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An adaptable, symmetric multikey file structure. ACM Trans. Database Syst., 9(1):38–71, Mar. 1984. ISSN 0362-5915. doi: 10.1145/348.318586. URL http: //doi.acm.org/10.1145/348.318586.

Ooi, K. J. Mcdonell, and S. R. Davis. Spatial kd-tree: An Indexing Mechanism for Spatial Database. COMPSAC conf., pages 433–438, 1987.

R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill international editions: Computer science series. McGraw-Hill Companies,Incorporated, 2002. ISBN 9780072465631. URL http://books. google.pt/books?id=JSVhe-WLGZ0C.

J. T. Robinson. The kdb-tree: a search structure for large multidimensional dynamic indexes. In Proceedings of the 1981 ACM SIGMOD international conference on Management of data, pages 10–18. ACM, 1981.

R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24rd International Conference on Very Large Data Bases, VLDB ’98, pages 194–205, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. ISBN 1-55860- 566-5. URL http://dl.acm.org/citation.cfm?id=645924.671192.

A. Wichert. Content-based image retrieval by hierarchical linear subspace method. J. Intell. Inf. Syst., 31(1):85– 107, Aug. 2008. ISSN 0925-9902. doi: 10.1007/s10844-007-0041-4. URL http://dx.doi.org/10.1007/ s10844-007-0041-4.

A. Wichert. Subspace tree. In Content-Based Multimedia Indexing, 2009. CBMI ’09. Seventh International Work- shop on, pages 38 –43, june 2009. doi: 10.1109/CBMI.2009.14.

61 A. Wichert. Intelligent Big Multimedia Databases. World Scientific, 2015.

C. Yu. High-Dimensional Indexing: Transformational Approaches to High-Dimensional Range and Similarity Searches. Lecture notes in computer science. Springer, 2002. ISBN 9783540441991. URL http://books. google.pt/books?id=CgsFwuSC4q8C.

62 Appendix A

Appendix

A.1 Chapter: Fractals

Decimal Code Binary Code Gray Code 0 000 000 1 001 001 2 010 011 3 011 010 4 100 110 5 101 111 6 110 101 7 111 100

Table A.1: The conversion between Decimal, Binary and Gray code with three bits.

Theorem 2. Given a 2k+n × 2k+n grid region, the average number of clusters within a 2k × 2k query window is

(2n − 1)223k + (2n − 1)2k + 2n N (k, k + n) = (A.1) 2 (2k+n − 2k + 1)2

63 64 A.2 Chapter: Approximate Space-Filling Curve

Query Point S2 S4 S5 S10 S30 S36 P0 1 4 839 349 1835 87 P1 1 1 613 1223 1541 70 P2 1 1 503 606 1788 24 P3 1 1 576 1516 2669 56 P4 1 1 307 1411 853 14 P5 1 2 318 214 1599 69 P6 1 1 312 974 1308 169 P7 1 1 760 1086 1881 98 P8 1 1 339 740 989 20 P9 1 1 514 898 2022 65 Mean Value 1 1,4 508,1 901,7 1648,5 67,2

Table A.2: Enzo RS Results from space 2 to 36.

65