Masaryk University Faculty of Informatics

Similarity Search using Hazelcast In-memory Data Grid

Master’s Thesis

Bc. Ľudovít Labaj

Brno, Spring 2018

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Bc. Ľudovít Labaj

Advisor: RNDr. David Novák, Ph.D

i

Acknowledgements

I would like to thank thesis advisor RNDr. David Novák, Ph.D for his time and guidance in this very challenging and interesting assignment.

iii Abstract

Metric space is universal approach to search for similar objects in by providing a sample object. This approach is very powerful but computationally expensive. This thesis introduces a prototype of distributed system, which uses Hazelcast In-Memory Data Grid for data storage and in-memory . The system is then evaluated in terms of quality of results and performance

iv Keywords similarity search, metric space, distributed systems, NoSQL

v

Contents

1 Introduction 1

2 Similarity Search in Metric Space 3 2.1 Metric Space ...... 3 2.2 Distance Functions ...... 4 2.2.1 Minkowski Distances ...... 4 2.2.2 Edit Distance ...... 5 2.3 Similarity Queries ...... 6 2.3.1 Range Query ...... 6 2.3.2 Nearest Neighbor Query ...... 6 2.3.3 Similarity Join ...... 7 2.4 Use of Indexes ...... 8 2.4.1 M-Chord ...... 11 2.4.2 M-Index ...... 17

3 NoSQL 21 3.1 Classification ...... 21 3.2 MongoDB ...... 23 3.3 Cassandra ...... 23 3.4 Hazelcast In-Memory Data Grid ...... 24

4 Index Structure and Implementation 29 4.1 Index Structure ...... 29 4.1.1 kNN Search Process ...... 30 4.2 Implementation ...... 33

5 Testing and Evaluation 39 5.1 Precision of Search Results ...... 39 5.2 Performance Tests ...... 40 5.2.1 Latency ...... 43 5.2.2 Throughput ...... 44

6 Conclusion 47 6.1 Future work ...... 48

References 49

vii A Contend of Attached Archive 51

viii 1 Introduction

In the beginning of the Internet, it mostly contained structured, textual data in form of web pages and articles. Searching in those kinds of data is not so difficult task, because structured textual data can be sorted, categorized and indexed, therefore easy to search in. As technologies advanced, new kinds of data became available in form of images, au- dio, video and others, where conventional search approaches stopped being effective. For some forms of data it might be very difficult or even impossible to formulate a search query using conventional ap- proaches, it might be much easier to give search engine a sample data, and it will retrieve the most similar entries from the database. The objective of the Similarity Search is to create techniques and algorithms how to efficiently search in large collections of data, pro- viding sample object as input parameter. Fundamental problem with unstructured data is their inability to be sorted in respect to some of their attribute, data organization therefore becomes key part of any similarity search system, in terms of both quality of results (system returns relevant results) and search performance (low latency).

Thesis Structure

Chapter 2 describes theoretical introduction to Similarity Search, basic concepts, definitions and terminology. Chapter 3 descries distributed data stores, technology often used to store unstructured or semi-structured datasets, such as objects for similarity search. Search structure and implementation is described in Chapter 4 and system evaluation, from both performance and quality of results perspective is described in Chapter 5. The last Chapter 6 contains conclusion and possibilities for future improvements.

1

2 Similarity Search in Metric Space

This chapter describes theoretical foundations of similarity search in metric spaces. The term metric space is defined in Section 2.1, followed by distance functions in Section 2.2 and operations (similarity queries) in Section 2.3. Section 2.4 describes possible use of indexes to speed up the search process. The theory of metric space is a well studied topic with many text- books and articles published [1, 2]. For purposes of this thesis, book Similarity Search, The Metric Space Approach [3] written by Zezula; Am- ato; Dohnal; Batko was used to write Sections 2.1 to 2.4.

2.1 Metric Space

In general, the similarity search can be seen as a process of finding data objects in database according to their distance to a query object – an input object specified by user. The distance is determined by distance function and constrains on which objects should be returned called similarity queries. A metric space ℳ is defined by a tuple ℳ = (풟, d) where 풟 is domain of objects and d is distance function (also called metric function or just metric). For a function d : 풟 × 풟 ↦→ R to be valid distance function, these properties must hold:

∀x, y ∈ 풟, d(x, y) ≥ 0 non-negativity ∀x, y ∈ 풟, d(x, y) = d(y, x) symmetry ∀x, y ∈ 풟, x = y ⇔ d(x, y) = 0 identity ∀x, y, z ∈ 풟, d(x, z) ≤ d(x, y) + d(y, z) triangle inequality

There are situations, where the symmetry property does not hold – for example distance between two building in a city can be different because of layout of roads and traffic rules (one-way roads etc.). These non-symmetrical metric functions are called quasi-metrics.

3 2. Similarity Search in Metric Space

L L 1 2 L6 L 8

Figure 2.1: Minkowski distances examples for L1, L2, L6 and L∞ [3].

2.2 Distance Functions

Distance functions represent a way to determine how close are indi- vidual objects in a metric space. They return a number representing distance between two objects x and y from the same domain 풟. Return type can be both discrete (for example Edit Distance, Section 2.2.2, re- turns natural number) or continuous (for example Minkowski Distances, Section 2.2.1, return real number).

2.2.1 Minkowski Distances

The Minkowski Distances or Lp metrics are family of distance functions, where p is input parameter. They are defined on two n-dimensional vectors X = (x1, x2,..., xn) and Y = (y1, y2,..., yn) of real numbers:

s n p p Lp(X, Y) = ∑ |xi − yi| i=1

L1 is also known as Manhattan distance (or the City-Block distance), L2 is Euclidean distance. One special case is when p = ∞ or L∞ which is called the maximum distance, defined as:

n L∞(X, Y) = maxi=1|xi − yi|

Note that due to absolute value of differences between xi and yi the order of parameters doesn’t influence the result, in another words – all Lp distances are symmetric. Figure 2.1 shows examples of L1, L2, L6 and L∞ where all points are at the same distance from the middle according to different distance function.

4 2. Similarity Search in Metric Space

2.2.2 Edit Distance

Opposing to Lp distances, which are defined on numeric vectors, the Edit Distance is used to calculate distance between two sequences of symbols (strings). The distance between string x = x1x2 ... xn and string y = y1y2 ... ym is defined as the minimum number of atomic edit operations needed to transform string x to string y. Atomic edit operations are: ∙ insert the character c into the string x at position i

ins(x, i, c) = x1x2 ... xicxi+1 ... xn

∙ delete the character from string x at position i

del(x, i) = x1x2 ... xi−1xi+1 ... xn

∙ replace the character at position i in string x with the new character c

replace(x, i, c) = x1x2 ... xi−1cxi+1 ... xn

Due to representation of string in computers, these edit operations may have different computational cost, what can be addressed by assigning weights to the edit operations. However, assigning different weights may violate symmetry, for example: let wins = 2, wdel = 1 and wreplace = 1

′′ ′′ dedit(“combine ,“combination ) = 9 replace e → a, insert t, i, o, n

′′ ′′ dedit(“combination ,“combine ) = 5 replace a → e, delete t, i, o, n

As far as weights for insert and delete operations are equal, the symmetry property holds regardless of weight of the replace operation, which can also have different weight for different values – for example replace a → b can have different weight than a → c, but a → b must have the same weight as b → a.

5 2. Similarity Search in Metric Space

q r

Figure 2.2: Range query for query object q with radius r [3].

2.3 Similarity Queries

A similarity query specifies constrains for selection by a query object q, typically expressed as a distance. The result contains all objects in database which satisfies the selection, usually ordered by their distance to the query object q. The following sections discuss some basic types of similarity queries.

2.3.1 Range Query The similarity range query is probably the most intuitive one, it basically says “find me all objects that are at most‘r’ distance units away”. The query is specified by a query object q ∈ 풟 and a distance (often called radius) r ∈ R≥0. Formal definition: R(q, r) = {o ∈ X, d(o, q) ≤ r}

In general, query object q does not need to exist in database X, but has to belong in the metric domain 풟. It is also possible for the radius to be zero, what means we are looking for existence of one specific object in the database, also called point query or exact match. This type of query is mostly used in delete operation, where we want to locate and delete a specific object.

2.3.2 Nearest Neighbor Query The range query might be the most intuitive one, but it might not be the most practical one. The problem is that the maximum distance

6 2. Similarity Search in Metric Space

1 3 q 2

Figure 2.3: 3NN query for query object q [3].

is specified as input parameter – the notion of distances is oftentoo abstract for practical usage by humans and is heavily dependent on used domain and distance function. Thats why an alternative way for finding similar objects was cre- ated – nearest neighbor query, or k-th nearest neighbor query, kNN for short. This query specifies how many nearest objects to the query object q should be returned in the response set. The k in kNN, k ∈ N+, specifies the number of nearest objects – for example 2NN(q) searches for 2 nearest objects, 3NN(q) searches for 3 nearest objects etc. The formal definition is: kNN(q) = {R ⊆ X, |R| = k ∧ ∀x ∈ R, y ∈ X ∖ R : d(q, x) ≤ d(q, y)} In other words – “find me ‘k’ objects that are closest to my specified object”. If more than one object lie at the same distance from q, the order is usually arbitrary.

2.3.3 Similarity Join In todays world of information technologies, the biggest source of data is the Internet. However, those data may not be reliable – unstructured, inaccurate and contain duplicities. Often a cleanup operation is needed – using so called similarity join. The similarity join between two datasets X ⊆ 풟 and Y ⊆ 풟 returns all pairs of objects (x ∈ X, y ∈ Y) whose distance does not exceed a given threshold µ ∈ R≥0, formally defined as: J(X, Y, µ) = {(x, y) ∈ X × Y : d(x, y) ≤ µ}

7 2. Similarity Search in Metric Space

Figure 2.4: Similarity self join with threshold µ [3].

If µ = 0, similarity join has the same effect like natural join, known from relational algebra. When we apply similarity join on the same dataset, X = Y, we talk about similarity self join. This similarity self join can be used to detect duplicities or near duplicities – objects, that are so close to each other that we can merge them into one.

2.4 Use of Indexes

Applications using similarity search typically search in large datasets of millions, possibly billions of entries. Processing one distance is quite expensive – for example time complexity of Minkowski Distances is linear to the number of dimensions in vectors. For that many data using naive linear search is not sufficient, therefore an indexing structure with sophisticated search process is needed. This section first describes three elementary techniques, namely ball partitioning,generalized hyperplane partitioning and Voronoi partition- ing that are used to create complex index structures like M-Chord [4] (Section 2.4.1), M-Index [5] (Section 2.4.2) and PPP-Codes[6]. Most of these indexes works in two phases – first they find candidate set of ob- jects, which are then refined by explicitly evaluating distances between query object and all objects in candidate set. Two of these structures are introduced and briefly described later in this section.

8 2. Similarity Search in Metric Space

Figure 2.5: Ball partitioning example [3].

There are two main properties of indexes – they can be either centralized or distributed and either precise or approximate. Advantage of approximate indexes over precise ones is mostly performance by greatly reducing the number of distance calculations. For most ap- plications approximate search is sufficient, because user’s point of view on what is similar and system’s point of view on what is similar differs anyway and therefore small inaccuracies in result set arenota concern. There are three elementary techniques used by indexes – ball par- titioning, generalized hyperplane partitioning and Voronoi partitioning. The main goal of partitioning is to split the metric space into sections (partitions), so that the search process has to access only the selected partitions and not the whole dataset.

Ball Partitioning

The principle of ball partitioning is quite simple – take a reference point (also called pivot or anchor) p ∈ 풟 and a radius r ∈ R>0 and split the metric space S into two sections S1 containing every data object oi ∈ S whose distance to the pivot p is less or equal to r and S2 containing all oi whose distance to the pivot p is more or equal to r. Formally written as:

S1 ← {oi | d(oi, p) ≤ r}

S2 ← {oi | d(oi, p) ≥ r}

9 2. Similarity Search in Metric Space

Figure 2.6: Generalized hyperplane partitioning example [3].

The reason for both cases having ≤ and ≥ is to let a concrete algorithm to decide how to handle ties, for example to let every oi be inside S1 or to ensure balance between S1 and S2 etc. Figure 2.5 shows an example of 2-dimensional metric space with pivot p and a circle of radius r dividing the space into S1 and S2.

Generalized Hyperplane Partitioning A hyper plane in a n-dimensional space is a subspace whose dimen- sion is one less than it’s ambient space. For example, a hyperplane of a 3d space is a 2d plane, hyperplane of a 2d plane is a 1d line etc. Generalized hyperplane partitioning is using a hyperplane of a metric space S to split it into two subsets S1 and S2. The idea is to draw a hyperplane between two pivots p1, p2 ∈ 풟 such that all points on the hyperplane are at equal distance to both p1 and p2, therefore splitting the space into S1 - subset of S where all objects oi are closer to p1 than to p2 and S2, where all oi are closer to p2 than to p1. Formal definition of S1 and S2 is:

S1 ← {oi | d(p1, oi) ≤ d(p2, oi)}

S2 ← {oi | d(p1, oi) ≥ d(p2, oi)}

Figure 2.6 shows an example of two-dimensional metric space with two pivots p1 and p2 and a thick line splitting the space into S1 and S2.

10 2. Similarity Search in Metric Space

Figure 2.7: Voronoi partitioning example [5].

Voronoi Partitioning

Given a set of pivots P = {p1, p2,..., pn}, P ⊆ 풟, Voronoi partitioning splits the space into polygonal regions (cells) Ci, i = 1, . . . , n, such that for each point within a cell Ci, the closes pivot is pi. Figure 2.7 shows example of Voronoi partition for 2-dimensional space and four pivots. It is also possible to create recursive Voronoi partition by creating cells that are not defined by one closest pivot, but by l ∈ N+ closest pivots,

splitting the space into smaller cells addressed with l pivots Cp1,p2,...,pl .

2.4.1 M-Chord The M-chord is utilizing two existing ideas: the iDistance index method and the Chord peer-to-peer protocol. The main idea if iDistance is to re- duce generally multi-dimensional metric space into one-dimensional iDistance key space. The process is:

∙ split data space into n partitions (clusters) Ci, i = 0, . . . , n − 1

∙ select pivot pi for each Ci ∙ calculate the iDistance key according to formula

iDist(x) = d(pi, x) + i · c

11 2. Similarity Search in Metric Space

C2 C2 C C 0 p2 0 p2

p p0 0 q C1 C1 r p p1 1 (iDistance)

0 c 2*c 3*c 0 c 2*c3 *c (a) (b) Figure 2.8: (a) visualization of assignment of iDistance keys to objects, (b) range query for query object q and radius r, showing which subsets of iDistance keys contain candidate set [4].

where x is an input object x ∈ Ci and c is a constant

If c is sufficiently large, all objects in Ci are mapped into interval [i · c, (i + 1) · c). All objects are then stored into B+-tree according to their iDistance keys. Figure 2.8(a) shows assignment of iDistance keys to data objects, figure 2.8(b) shows a range query for query object q and radius r. The highlighted parts on the iDistance key space represents a subset of keys which are then transformed back into original metric space where highlighted areas shows space containing candidate set for refinement. The Chord is a peer-to-peer structure which provides functionality of a distributed hash table. It uses a consistent hashing to uniformly map domain of search keys to domain of the Chord keys [0, 2m). Every m node Ni is assigned a key Ki from [0, 2 ) and every node is responsi- m ble for a subset of keys (Ki−1, Ki] mod 2 . Because chord has a circle topology, every node has address of it’s predecessor and successor stored in memory and also a finger table with up to m other peers. Figure 2.9 shows an example of Chord structure. Arrows around the circle pointing from Ni represents subset of the chord domain keys (Ki−1, Ki] managed by Ni.

The M-Chord index has the following five main features:

12 2. Similarity Search in Metric Space

N4

K4 0 = 32 N 3 K N1 3 K1

K2 N2 Figure 2.9: Chord structure of four nodes [4].

∙ generalizes the formula of iDistance to any metric space and map the dataset to domain of Chord keys [0, 2m),

∙ distributes the [0, 2m) domain among nodes, such that every node manages data from one interval (intervals don’t overlap and the whole domain is covered),

∙ uses the Chord structure for routing,

∙ designs algorithms for range and kNN query,

∙ provides additional computation-reducing filtering for the query processing.

Generalization of iDistance The iDistance is defined on spaces with coordinate system and not general metric space. Because it uses numeric coordinates for par- titioning and pivot selection, it needs to be generalized to any met- ric space, which can be done by choosing n pivots p0, p1,..., pn−1 from a given sample S ⊆ 풟 known in advance. Splitting into clusters C0, C1,..., Cn−1 is then done by applying Voronoi-like partitioning and assigning indexed objects I to their corresponding cluster, formally:

Ci = {x ∈ I | d(pi, x) ≤ d(pj, x), 0 ≤ j < n}.

13 2. Similarity Search in Metric Space

This modification has no impact on the iDistance functionality. Dur- ing partitioning of the data space, all distances d(p0, x),..., d(pn−1), x ∈ I are calculated and can be stored for later use during query processing for additional filtering.

M-Chord domain In order to join iDistance’s one-dimensional domain and the Chord’s domain, we need to normalize iDistance range to [0, 2m) interval using order-preserving function h, therefore creating M-Chord key-assignment formula for object x ∈ Ci, i = 0, . . . , n − 1:

mchord(x) = h(d(pi, x) + i · c).

The h function should also distribute keys uniformly, because Chord guarantees it’s efficiency when nodes have uniform distribution on the node circle. Choosing right function h is a well studied topic and can be solved using for example piecewise-linear transformation.

M-Chord data structure

At first, there is one node N1 containing all initial data S ⊆ 풟. This node performs initialization steps described above:

1. selects n pivots p0, p1,..., pn−1 and creates clusters C0,..., Cn−1, 2. iDistance formula is used to determine distribution of iDistance keys, which is needed to find suitable h function for creating M-Chord key-assignment formula,

m 3. N1 is assigned key 2 − 1, therefore it manages the whole M- Chord domain [0, 2m).

When another node Nnew joins the system, first a node which splits it’s interval needs to be determined. Choosing the node may be specific to each implementation, for example, splitting a node due to storage capacity limitations or for load-balancing purposes. Suppose a node Ni managing (Ki−1, Ki] is about to be split. What it needs to do is to find a key Knew, Ki−1 < Knew < Ki, assign it to Nnew and transport all data corresponding to interval (Ki−1, Knew] from Ni

14 2. Similarity Search in Metric Space

to Nnew. Because the constant c is different in every node, values h(i · c), i = 0, . . . , n form boundaries of the cluster within [0, 2m) domain. If the interval (Ki−1, Ki] contain only one cluster (whole or partial), Knew is set to split the interval such that both parts are balanced. If, on the other hand, interval covers z, z > 1 clusters, it is split such that z (Ki−1, Knew] covers ⌊ 2 ⌋ clusters.

Range query

Suppose node Nq initiate range query RangeSearch(q, r). Search algo- rithm follows basic structure:

∙ for each cluster Ci find interval Ii, such that

Ii = [h(d(pi, q) + i · c − r), h(d(pi) + i · c + r)]

interval Ii contains keys to be processed in next step,

∙ ∀i : 0, . . . , n − 1 send IntervalSearch(Ii, q, r) to node NIi which manages the midpoint of interval Ii,

∙ wait for all partial answers and create the final answer.

The IntervalSearch(Ii, q, r) executed on NIi nodes is as follows:

∙ if NIi does not manage whole interval Ii, resends the request to predecessor and/or successor,

∙ process local entries and create local answer SA

SA = {x | mchord(x) ∈ Ii, d(q, x) ≤ r},

∙ return SA to node Nq together with list of other nodes (usually predecessor and/or successor) involved in computation.

Figure 2.10 shows an example query with messages sent through the system. Initial node Nq sends IntervalSearch(Ii, q, r) requests to

nodes NI1 , NI2 and NI3 where NI1 also resends the request to predeces- sor and successor. The illustration simplifies routing to direct access

of NIi , in real implementation this routing follows the Chord protocol.

15 2. Similarity Search in Metric Space N I3 I3 request response

I1 Nq N I1

N I2 I2 Figure 2.10: M-Chord range query search process with simplified routing [4]. kNN query In general, iDistance’s approach to kNN queries is iterative RangeSearch(q, r) with increasing range r. This approach is not suitable for distributed environment due to large amount of message transmissions. The M- Chord’s approach consists of two phases: 1. use a heuristic to find k objects that are near q and measure the distance $k to k-th nearest object found so far. This $k forms the upper bound to the actual k-th nearest neighbor of q.

2. Execute RangeSearch(q, $k) and return k objects closest to q, space searched in the first phase can be omitted.

In the first phase, node Nmchord(q) searches cluster Ci containing q: ∙ find B+-tree leaf covering key mchord(q), ∙ scan the B+-tree leaves both left and right, adding the first k objects to the answer set SA and initialize the $k value,

∙ keep scanning the objects x while keys mchord(x) ∈ Ii

Ii = [h(d(pi, q) + i · c − $k), h(d(pi) + i · c + $k)],

∙ if d(q, x) < $k, remove the last (most distant from q) element from SA, put x into SA, update $k – interval Ii shrinks.

16 2. Similarity Search in Metric Space

∙ Search terminates when the whole interval Ii or cluster Ci in Nmchord(q) has bean searched.

The process described above assumes that at least k objects belongs to cluster Ci in the node Nmchord(q). If the assumption does not hold, the following steps are executed instead of phase two:

∙ execute RangeSearch(x, $) where x is the most distant object found so far,

′ k−k′ ∙ if k < k objects are found, set $ to $ + $ k and repeat, omitting the space already searched,

∙ terminate when k objects are found.

2.4.2 M-Index

Metric Index (or M-Index) uses fixed set of pivots P = {p1, p2,..., pn}, n ∈ N+ in order to create Voronoi partitioning to split the metric space into partitions (cells) Ci, i = 0, . . . , n − 1. Same as M-Chord, M-Index uses iDistance to map the space into one-dimensional domain. Figure 2.11 show mapping from general metric space domain to iDistance domain. Having large enough constant c ∈ R+ to separate individual clusters, cluster Ci maps to interval [i · c, (i + 1) · c).

Multi-level M-Index In order to make M-Index more scalable for large datasets, further partitioning can be applied in form of recursive Voronoi partitioning with level l ∈ N+. Cluster is then determined by l closest pivots

(called pivot permutation)– Ci1,...,il . This idea is also utilized by M-Index with Dynamic Level, where instead of fixed constant l, level of each cell is created on demand – once a cell size on level m exceeds certain threshold, it is split into two cells with level m + 1 up to level lmax.

Approximate kNN Search The algorithm searches clusters ordered by heuristic based on dis- tances between individual pivots and query object q. Heuristic penalty

17 2. Similarity Search in Metric Space

Figure 2.11: M-Index mapping into iDistance domain [5]. is calculated using:

1 l−1 penalty(C ) = · max{d(p q) − d(p q) } i0,...,il−1 ∑ ij , (j)q , , 0 l j=0 where pij are individual pivots forming cluster Ci and p(j)q is j-th closest pivot to q. The whole sum is divided by l to make sure the value is comparable between clusters on different levels, relevant only in M-Index with Dynamic Level. The algorithm stops once certain threshold is reach, like number of objects refined, number of clusters visited or number of disk blocks read.

Distributed M-Index The M-Index algorithm specifies only indexing structure and search algorithm, but does not specify any concrete data structure. Only requirement for data structure is efficient evaluation of object keys, according to their M-Index keys. An ideal structure for centralized solution would be B+-tree. As for distributed solution, data can be stored in distributed struc- tures with efficient interval key search, for example structured peer-

18 2. Similarity Search in Metric Space to-peer network Skip Graphs [7]. The index structured can be stored in:

∙ centralized structure, handling incoming search queries and forwarding them to nodes managing the data,

∙ replicated and synchronized in every node,

∙ fully distributed over the nodes.

A single level M-Index using Chord data structure corresponds to M-Chord.

19

3 NoSQL Databases

NoSQL databases were developed to overcome new challenges cre- ated by new trends and changes in data management and computing. Several of the factors are:

∙ support for large amount of concurrent users,

∙ need to store and query unstructured or semi-structured data,

∙ high availability – minimize (or eliminate) downtime,

∙ high volume and velocity of data.

Conventional relational databases is well studied and mature tech- nology, but they are more suited for centralized, highly consistent ap- plications. NoSQL databases, on the other hand, are usually designed as distributed systems capable of one node taking responsibility for another node in case of failure. However, a significant disadvantage is consistency – distributed systems are usually eventually consistent, a weak consistency model that means the system might not be con- sistent at given time but will eventually converge to consistent state. Another disadvantage is performance for small amount of data, where communication overhead usually degrades performance to the level it is no longer worth using distributed systems. Nowadays, many small to enterprise graded companies use NoSQL databases as part of their systems, showing that the technology has established it’s place not as a replacement for other kinds of databases, but as a tool suited for a specific task.

3.1 Classification

Unlike relational databases with established standards and definitions, NoSQL databases do not follow any standards and each database is different. In despite of lack of standardization, several categories of NoSQL databases came into existence, the most common are:

∙ Key-Value Store,

21 3. NoSQL Databases

∙ Document-oriented Databases, ∙ Column-Family Store, ∙ Graph Databases.

Key-Value Store Usually a hash table (also called map) provides access to every element via it’s ID/primary key. From relational database’s point of view it can be modeled as table with two columns – ID as primary key and DATA as the value, unstructured binary object. Notable representatives are Redis1 or Hazelcast2.

Document-oriented Databases Basic type of data is Document – an independent, self-describing piece of information stored as hierarchical structure consisting of arrays, scalars or other documents. Physical structure of documents can be for example JSON or XML. Concrete structure of individual documents can differ, but documents in the same collection should have similar structure. Notable representatives are MongoDB3 or CouchDB4.

Column-Family Store Data model of Column-Family stores (also known as wide-column stores) is that rows may have multiple columns associated with a row key. Groups of related data, which are often accessed together are called column families. Notable representatives are Cassandra5 or HBase6.

Graph Databases In graph databases, data entities are represented as nodes and relations are represented as edges in directed graph. Querying data is done by

1. https://redis.io 2. https://hazelcast.org 3. https://mongodb.com 4. https://couchdb.apache.org 5. https://cassandra.apache.org 6. https://hbase.apache.org

22 3. NoSQL Databases

traversing stored graph(s) (for example Breath First Search) returning subset of data satisfying given condition(s). Notable representative is Neo4j7.

3.2 MongoDB

MongoDB, developed by MongoDB Inc.8 is one of the most used NoSQL databases. It classifies as Document-oriented databasing using JSON-like documents and is written in C++, making it available for Windows, Linux and OS X platforms. The most of the database features are available in free, open source edition licensed under GNU Affero General Public License9, but authors also offer paid enterprise edition, database as a service as well as commercial support and trainings. Key features of MongoDB are: ∙ custom query language – queries use JavaScript Domain Spe- cific Language, ∙ strong consistency – queries always return the most recent piece of data, ∙ flexible schema – changes in application model requires no changes in the database model, ∙ scalability – MongoDB was designed as distributed database, therefore horizontal scalability is automatic and efficient.

3.3 Cassandra

Cassandra was originally developed by Facebook, now it is open- sourced and developed by Apache Software Foundation10. It is written in Java and published under Apache License 2.011. As a representative of Column-Family store, Cassandra is relatively easy to learn for peo- ple already familiar with relational databases, including it’s own query

7. https://neo4j.com 8. https://www.mongodb.com/company 9. https://www.gnu.org/licenses/agpl-3.0.en.html 10. https://apache.org/foundation 11. https://apache.org/licenses/LICENSE-2.0

23 3. NoSQL Databases language Cassandra Query Language (CQL). What makes Cassandra stand out is it’s large write throughput. Key features are: ∙ custom query language CQL – syntax of SQL modified for Column-Family data store,

∙ fault tolerance – data is automatically replicated to multiple nodes or multiple data centers, failed nodes can be replaced with no downtime,

∙ durable – each operation is written to log and committed to disk first, then executed on the data, ensuring every operation is either completed entirely or not at all,

∙ decentralized – every cluster node is identical, therefore there is no single point of failure,

∙ elastic – read/write throughput automatically scale as ma- chines are added or removed, with no downtime or interrup- tions.

3.4 Hazelcast In-Memory Data Grid

Hazelcast In-Memory Data Grid (or Hazelcast for short) is developed by Hazelcast Inc.12. There is free and open-source community edition published under Apache License 2.013 and paid enterprise edition, offering additional features, priority support, trainings and others. Server-side software (called member) is written in Java, client API is currently available for Java, C#, C++, Node.js, Python and Go. Main use cases for Hazelcast are: ∙ distributed in-memory computing – main focus of Hazelcast, hence the name In-Memory Data Grid,

∙ caching – first or second level cache over relational database, implements JCache (Java Caching Standard),

12. https://hazelcast.com/company/about 13. https://apache.org/licenses/LICENSE-2.0

24 3. NoSQL Databases

∙ web session clustering – storing web sessions in distributed memory, allowing servers to take over any web session, if other server becomes unavailable, ∙ messaging – sending data to other nodes using publish/sub- scribe messaging model. Unlike most of other NoSQL databases, Hazelcast does not offer per- sistent storage, only volatile in-memory storage with options to store data in another data store – relational database, file system or NoSQL database. Member applications can be ran as a standalone processes or embedded in application code. Embedded option is used to minimize communication overhead between two processes (application and Hazelcast itself), by eliminating need for serialization/de-serialization of objects send to/from local data structures and synchronization between processes. At it’s core, Hazelcast is collection of distributed data structures and operations, most notable are: ∙ Map – implementation of distributed hash table, ∙ Replicated Map – every node stores whole copy of the map, modifications are propagated to all nodes, ∙ Multi-Map – a map where one key is associated with multiple values, supports List-Multi-Map (keys may contain duplicate values, order is preserved) and Set-Multi-Map (keys may not contain duplicate values, order is not preserved), ∙ Entry Processor – allows to execute user-defined function on entries in map, may atomically modify the entry, ∙ Executor Service – allows distributed, asynchronous execution of tasks, ∙ Topic – used for sending messages to other nodes using pub- lish/subscribe messaging model. Hazelcast supports querying with SQL-like predicates, for example SELECT*FROM users WHERE users.is_admin= true

25 3. NoSQL Databases

Some distributed collections also support local indexes, utilized by the predicates. Data in distributed data structures are kept in partitions, by default there are 271 partitions P1,..., P271 in the whole cluster. When new cluster starts with only one node N1, it manages all 271 partitions. When second node N2 joins the cluster, partitions are migrated such that each node manages equal part of the partitions, in this case N1 manages P1 ... P135 and N2 manages P136 ... P271. If data replication is configured with replication factor r ∈ N0, for each partition Pi there are r backup partitions Bi stored in the cluster, each one guaranteed to be stored in different node. Suppose r = 1, if number of nodes n = 1 there are no backup partitions because it would only lead to a node having whole dataset twice. If n = 2, every node own it’s half of partitions and stores backup copies of the other one, if n = 4 every node owns it’s quarter of partitions and stores backup of one other node. Figure 3.1 shows distribution of partitions for r = 1 and n = 1, 2 and 4. When a node becomes unavailable, there are r other copies which are redistributed among other nodes. Partitions in real cluster may not be distributed in sequential order, they are ordered in this explanation only for illustrative purpose [8]

26 3. NoSQL Databases

Figure 3.1: Distribution of Hazelcast partitions in cluster of size a) 1, b) 2, c) 4 with replication factor r = 1.

27

4 Index Structure and Implementation

The goal of the thesis is to design, implement and evaluate prototype of fully distributed system for similarity search using Hazelcast for data storage and processing. This chapter describes index structure used to find subspace containing candidate set, process of refinement the candidate set and combine partial answers to the final answer. Section 4.1 describes the index structure itself, Section 4.1.1 de- scribes algorithm used for kNN search query and 4.1.1 describes com- bined kNN search using both query object and additional information for filtering. The last Section 4.2 describes more details about imple- mentation of the system and some of the technical issues and their solutions overcame during creation of the system.

4.1 Index Structure

Like other indexes described in Section 2.4, this index uses pivots to split the data space into partitions. These pivots are known a priori and their amount is fixed. The data space is then partitioned into recursive Voronoi cells, illustrated in 4.1, according to tier Pivot Prefix – ordered set of l closest pivots, l ∈ N+. There are three main data structures used in cooperation to per- form operations on the dataset:

∙ Data Map – distributed map containing data objects together with their Pivot Prefix (Voronoi cell ID),

∙ Pivot Map – map containing all pivots, every node stores it’s own copy,

∙ Pivot Prefix Counts – map containing number of objects in Voronoi cells, every node stores it’s own copy

If a combined search is required, one more data structure per attribute is needed – map containing number of objects in Voronoi cells that also contain certain value of the indexed attribute. More details are discussed later in Section 4.1.1. Every node stores it’s own copy.

29 4. Index Structure and Implementation

Figure 4.1: Recursive Voronoi partitioning with l = 2 (2-level recur- sion) [9].

Insert operation In order to insert new object x into database, following steps are exe- cuted:

1. find Pivot Prefix PPx for object x,

2. update value in Pivot Prefix Counts map for key PPx, 3. for each tag t in object x, update Tags Pivot Prefix Counts map

with key tPPx . Finding Pivot Prefix is done by calculating distance between x and every pivot. Because the number of pivots is generally small, linear search of l closest pivots is sufficient.

4.1.1 kNN Search Process There are three parameters needed for kNN search:

30 4. Index Structure and Implementation

∙ k – the number of closest neighbors to be found,

∙ q – the query object,

∙ s – the minimum number of objects in Candidate Set.

The process itself consists of two main parts:

1. find set of candidate Pivot Prefixes PP containing at least s objects – in another words, find Voronoi cells closest to q con- taining at least s objects,

2. refine objects in the candidate set by explicitly calculating their distance to the query object, returning only the closest k objects.

Find Candidate Pivot Prefixes Steps for finding candidate Pivot Prefixes:

∙ calculate distance d(q, p) between the query object q and every pivot p,

∙ sort prefixes in Pivot Prefix Counts map according to distance between Pivot Prefix PP and query object q using function d∆ (see below), which uses distances calculated in previous step,

∙ take first prefix from sorted prefixes until the answer contain at least s objects.

Distance from query object q to Pivot Prefix PP is estimated using weighted sum of distances between q and individual pivots PP[i] from the prefix [6]: l i−1 d∆(q, PP) = ∑ c d(q, PP[i]) i=1 where c ∈ [0, 1] is constant, set to 0.75. From another point of view, the algorithm sorts the Voronoi cells according to their distance to the query object q then takes first m ∈ N+ cells such that those m cells contain at least s objects. Figure 4.2 shows example of partitioned space with dotted lines showing closest cells to q.

31 4. Index Structure and Implementation

Figure 4.2: Voronoi partitioning, dotted arrows show the closest cells.

Candidate Set Refinement The process of candidate set refinement is using Map Reduce [10] pro- gramming model, which is wildly used by many NoSQL databases and distributed systems and is supported by many distributed-processing technologies, including Hazelcast. General Map Reduce consists of three phases:

1. Mapping – iterates over individual entries in collection, may modify the entry or filter it out,

2. Grouping (optional) – groups entries with the same key to- gether, ensuring those entries will be on the same node,

3. Reducing – collects intermediate results and produces the final answer.

The design of Map Reduce allows the iterations in the individual phases to be executed simultaneously, making it ideal solution for

32 4. Index Structure and Implementation

highly computationally demanding and embarrassingly parallel tasks, such as refinement of the candidate set. The refinement process uses Map reduces as follows:

∙ Mapping – calculate distance d(x, q) between query object q and every object x in candidate set and store up to k closest objects,

∙ Reducing – combine intermediate sets into final answer, con- taining k closest objects.

The Grouping phase is not used.

Combined Search The system supports combined kNN search with additional search restrictions, finding k closest data objects which also satisfies supplied predicate. The predicate is roughly equal to SQL where clause, for example: SELECT*FROM images WHERE tagsCONTAINS(’paris’,’eiffel’,’france’) AND author=’Jane’ Search is restricted to simple attribute types like text, numeral or logical value and array of simple type. For each attribute used in search, local index is created on each node.

4.2 Implementation

This section describes how the data structure described in Section 4.1 is implemented as well as operations needed for similarity search. The main part of the system is in form of module, meant to be used as a library embedded in application. In order to utilize Hazelcast’s feature to store data as Java in-memory objects, the module also has to be implemented in Java. As a demo, a web application using the module was also developed, showcasing usage and results of the system. The demo application searches in database of images and supports combined search using tags.

33 4. Index Structure and Implementation

Data Map Hazelcast has native support for distributed map using it’s IMap class. This map supports data replication for fault tolerance, local indexes for faster data lookup and automatically synchronizes when a member joins or leaves the cluster, making it suitable structure for data storage. The data have the following structure: { "_id" : "123...", "title":"The title...", "description":"Description of the object...", "tags":[ "tag1", "tag2", "tag3", ... ], "_file":"/path/to/file.jpg", "descriptor" : [1.0, 2.0, 3.0, ...] } These data are loaded as JSON document then stored in memory as plain Java objects. Objects may also contain other attributes which will not be used during search, only stored for use by concrete application. Description of the individual attributes: ∙ _id – unique identifier of the object, used as a key, ∙ title, description – information about the object displayed in user interface, ∙ tags – used for combined search, the object may also contain different attributes for search purposes ∙ _file – location of the file represented by data object, specific to each application, ∙ descriptor – coordinates in metric space used for calculation of the distance between objects, in this case a vector of numbers representing a point in n-dimensional space.

34 4. Index Structure and Implementation

During the insert operation, attribute pivot_prefix is calculated and added to object before storing it. For increased performance, following local indexes are crated in each node: ∙ _id – index for keys is automatically created,

∙ pivot_prefix – needed for identification of objects inside Voronoi cell,

∙ tags – for additional filtering of objects. Searching in the IMap uses SQL-like predicates utilizing local in- dexes described above. After calculating Candidate Pivot Prefixes CPP, query SELECT*FROM data WHERE data.pivot_prefix=CPP is executed if only query object has been provided and query SELECT*FROM data WHERE data.pivot_prefix=CPP AND data.tagsCONTAINS(tag1, tag2,...) is executed if combined search is in place.

kNN Search Process The process starts by calculating Candidate Pivot Prefixes CPP, de- scribed in Section 4.1.1. The potential number of prefixes P(p, l) is rather large, because it is a permutation of prefix length l and number ( ) = p! of pivots p, P p, l (p−r)! . However, when using fixed, pre-selected set of pivots many Voronoi cells are empty and therefore not stored in memory, greatly reducing both memory cost and computational time of calculating CPP. Having CPP calculated, next step is to send kNNSearch(q, k, CPP) message to all nodes. The nodes start local Map Reduce process on entries satisfying the predicate mentioned above. The Mapping phase iterates over individual entries x ∈ X, X ⊆ 풟 and: ∙ calculates distance d(q, x) between query object q and x,

35 4. Index Structure and Implementation

∙ puts tuple (x, d(q, x)) into priority queue Q of fixed size k,

– if number of elements in Q exceeds k, the last (most distant) object is evicted.

The combination phase takes priority queues and merges them into one priority queue of size k, returning it to the initial node. All operations on single node are executed in parallel and every node is executing it’s sub-task independently, making the kNN Search process both distributed an parallel. Output of single node is it’s local kNN search result, which is returned to the initial node and combined into the final answer.

Combined Search

The combined search is modification of the kNN Search with two changes:

∙ predicate for selecting objects takes tags into account,

∙ different map for determining number of objects in Voronoi cells during calculation of CPP is used – instead of Pivot Prefix counts map, containing number of objects in every cells, a map containing number of objects that also contain specific tag is used.

If only one tag is used, value from the map is retrieved and used. When using multiple tags, the intersection of all tags needs to be calculated. This is not possible, because the map contains only counts of individual objects, not objects themselves. The system estimates intersection between tags t1 and t2 using formula

EstimateIntersection(t1, t2) = ⌈c · min(|Xt1 |, |Xt2 |)⌉

where Xti is subset of objects in a specific Voronoi cells containing tag ti and c ∈ (0; 1] is a constant. Because refinement of candidate set operates on entries determined by the predicate, the process itself is identical.

36 4. Index Structure and Implementation

Other Data Structures Maps that are required to be stored locally on each node are imple- mented using Hazelcast’s ReplicatedMap class. This class en- sures that whenever a node inserts/deletes/updates an entry, all other nodes are notified about this modification and adjust their local copies accordingly. When an application reads data from Hazelcast’s data structures, it always creates copy of stored entries. The reason is to provide con- sistency, for example if code map.get(key).setAttribute("new value") was executed, modification of the attribute would create inconsistency among nodes, because other nodes would not be notified about this modification. However, this behavior is undesirable if there aremany read-only operation, because it creates unnecessary overhead and degrades performance of search process. The solution used in this system is to store entries in distributed map and create Cached Map, which serves as cached, unmodifiable view on the underlying distributed map. Whenever there is read op- eration the Cached Map returns cached object itself and not it’s copy. Updating and deleting entries is not supported on the Cached Map itself, only directly on the underlying distributed map. Thanks to listeners attached to the distributed map, the Cached Map is auto- matically updated when any node inserts/deletes/updates an entry. Disadvantage of this solution is the inconsistency issue mentioned above as well as additional memory consumption, because every node 1 stores it’s local copy and manages N of the data in the distributed map. Map containing Pivot Prefix Counts is implemented using Cached Map.

37

5 Testing and Evaluation

This chapter describes process of testing and evaluation of the system described in Chapter 4. There are two main aspects of evaluation taken into account – quality of result set and performance of the system. Section 5.1 describes process and results of evaluating quality of the index and Section 5.2 describes performance aspect, discussing process and results of testing latency and throughput. Data used for experiments were DeCAF7 [11] visual descriptors of image collection [12], extracted using deep neural network. Visual descriptors are represented as 4096-dimensional vector of floating- point numbers, reduced to 256-dimensional vector using PCA [13] dimension-reducing algorithm.

5.1 Precision of Search Results

Definitions There are two standard measures to evaluate quality of approximate search – precision and recall [3]. Suppose a dataset X ⊆ 풟 and two result sets S, set of qualifying objects (in another words, result of precise search) and SA, set of objects returned by an approximate search. Precision P is defined as ratio of number of qualifying objects returned by approximate search and all objects returned by approximate search, formally: |S ∩ SA| P = |SA| and recall R is defined as ratio of number of qualifying objects returned by approximate search and all qualifying objects, formally written as:

|S ∩ SA| R = . |S|

Because kNN search always returns k objects, size of S and SA are always equal, therefore precision is always equal to recall. For sake of brevity, only precision will be taken into account in this chapter.

39 5. Testing and Evaluation

Table 5.1: Summaries of precision evaluation

Candidate set Min 1st Qu. Median Mean 3rd Qu. Max 10,000 0.100 0.500 0.700 0.692 0.900 1.000 15,000 0.200 0.638 0.800 0.767 0.950 1.000 20,000 0.200 0.700 0.850 0.816 0.950 1.000 25,000 0.250 0.750 0.900 0.849 1.000 1.000 30,000 0.250 0.800 0.950 0.875 1.000 1.000 35,000 0.250 0.850 0.950 0.893 1.000 1.000 40,000 0.300 0.850 0.950 0.911 1.000 1.000

Results Experiments were executed on sample dataset of size 1 million for kNN search k = 20. Tests consist of 200 sample queries chosen ran- domly, every test uses the same 200 query objects. Qualifying objects were calculated using sequential scan. The input variable of experi- ments is size of candidate set for refinement. The bigger candidate set, the more precise are the results, but at the cost of performance. Table 5.1 shows summaries of individual experiments. First col- umn is the size of the candidate set, second is the minimum precision measured, then 1st quartile (1st Qu), median, mean, 3rd quartile (3rd Qu) and maximum precision measured. Figure 5.1 a) shows mean pre- cision for each experiment and b) shows average latency. Having per- formance costs in mind (discussed in more detail in Section 5.2), candi- date set of size 30,000 seems to have a good precision-to-performance ratio. Therefore, we use this value in the following experiments.

5.2 Performance Tests

Tool Apache JMeter [14] was used for performance testing, which is a de facto standard tool for performance testing in Java ecosystem. JMeter contains all features needed, like simulating multiple users at once, sending requests to multiple hosts, writing detailed logs about individual requests, highly configurable parameters and others. Tests were executed as REST requests for kNN search with specified query

40 5. Testing and Evaluation

Figure 5.1: a) Mean precision for candidate set size b) Mean latency for candidate set size.

object and result was returned as JSON document containing indi- vidual objects and their distance to the query object. Requests were sent from machines that were not part of the cluster (did not run the application) and they were on the same local network as the cluster to minimize the network costs. One specific property of every Java-based application when testing performance is to warm up the Java Virtual Machine (JVM) before exe- cuting tests themselves. When JVM starts executing an application (for example JAR archive) it first runs the compiled bytecode in interpreted mode. As the application executes more and more bytecode, the JVM collects statistics about individual parts of the application (usually methods or loops) and often executed parts become “hot”1 and the JVM applies various optimizations to them, like method inlining or

1. for this reason, JVM made by Oracle Corporation is called HotSpot [15]

41 5. Testing and Evaluation compiling them to native code using so called just in time compilation (JIT compilation), making them run faster. For this reason, full restart of application is needed before every experiment and warm-up phase finished before taking any measures. Hardware specifications of nodes:

∙ Node 1:

– Intel Xeon Processor E5-2620 v22, 6 cores and 12 threads, 2.10 to 2.60 GHz clock speed, – 50 GB memory.

∙ Node 2:

– Intel Xeon Processor E5-26203, 6 cores and 12 threads, 2.00 to 2.50 GHz clock speed, – 58 GB memory.

∙ Node 3:

– Intel Xeon Processor E5-2620, 6 cores and 12 threads, 2.00 to 2.50 GHz clock speed, – 90 GB memory.

Following parameters were used to execute individual scenarios:

∙ number of nodes (cluster configuration): 1, 2 and 3

∙ size of the dataset: 1 million and 3 million

∙ number of simultaneous requests:

– latency: 1 user (sequence of requests), 10 users and 20 users, – throughput: 50, 100 and 150 users.

2. https://ark.intel.com/products/75789/Intel-Xeon-Processor-E5-2620-v2- 15M-Cache-2_10-GHz 3. https://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M- Cache-2_00-GHz-7_20-GTs-Intel-QPI

42 5. Testing and Evaluation

Size of candidate set was set to 30,000 for all scenarios and nodes were physical machines. Search process runs in parallel, therefore running multiple virtual nodes on single machine would not increase performance, only introduced additional overhead.

5.2.1 Latency Figure 5.2 shows graph of mean latency for each combination of param- eters. Latency for sequential set of requests is almost equal for every cluster configuration, 2-node configuration is actually slightly slower than 1-node configuration because of communication and synchro- nization overhead. Experiments for 10 and 20 simultaneous requests show significant, near linear improvement as new nodes are added to the cluster. When one node receives multiple simultaneous requests, it’s performance power needs to be shared among all of the requests, leading to higher latency. Multi-node cluster divides it’s work evenly among nodes, what results in lower latencies.

159 1 node 2 nodes 3 nodes

105

92

71 65 latency (ms) 70 59 59 55 0 10 30 50 70 90 110 130 150

1 10 20

number of threads

Figure 5.2: Mean latency, 1 million objects.

43 5. Testing and Evaluation

50 threads 364 100 threads 150 threads

257

185

127 133 throughput (req / s)

125

62 90

42 0 40 80 120 160 200 240 280 320 360

1 2 3

number of nodes

Figure 5.3: Mean throughput, 1 million objects.

5.2.2 Throughput

Our first throughput experiments uncovered that it is necessary tofirst determine the maximum number of simultaneous requests (threads) a client machine is capable of processing. After extensive testing, the most suitable number was found to be 50. Up to 50, the throughput increased as more threads were added which means that the threads did not utilize the system to its maximum capacity. Number of threads greater than 50 showed no increase or decrease in throughput and eventually led to errors on the client side, meaning that the client was not capable of processing the amount of requests. To test more than 50 simultaneous requests, more nodes were added as clients. Tests were executed for 50 (1 client node), 100 (2 client nodes, 50 threads each) and 150 threads (3 nodes). Figure 5.3 shows results of experiments. System scales almost lin- early as more nodes are added to the cluster. Same as latency, through- put is greatly increased for multi-node clusters because of balancing the workload among nodes.

44 5. Testing and Evaluation

3 million objects Figure 5.4 shows results for experiments on 3 million objects. Results are consistent with other experiments in term of scalability. With 3 times as much data, performance is not 3 times worse, because:

∙ as more objects are present in metric space, it becomes more dense and potentially less Voronoi cells needs to be accessed,

∙ data objects are indexed, therefore lookup time is slower but not linearly slower.

45 5. Testing and Evaluation

237 1 node 2 nodes 3 nodes

142 134

latency (ms) 92

74 95 73 76 77 0 20 40 60 80 100 130 160 190 220

1 10 20

number of threads

50 threads 100 threads 150 threads

278

190

147 throughput (req / s)

97 96 98 50 63

32 0 40 80 120 160 200 240 280 320 360

1 2 3

number of nodes

Figure 5.4: Mean latency and throughput for 3 million objects.

46 6 Conclusion

Objective of this thesis was to design and implement similarity search system utilizing Hazelcast as in-memory database and data processing engine. The result is fully distributed system in form of a library, independent from used data and distance metrics. As proof of concept, web application with user interface using the library was implement. This web application searches in images by their similarity, but the library itself is capable of searching in any kind of data, as long as they are represented as objects in metric space and suitable distance metric is provided, for example search in audio records. In Chapter 2, basic terminology and definitions, necessary to un- derstand contents of this thesis are introduced, like metric space, dis- tance functions and similarity queries. Large part is also dedicated to indexes, discussing their need in similarity search applications, elementary space partitioning techniques and two complex indexes, M-Chord and M-Index are briefly described. Chapter 3 describes NoSQL databases in general, their advantages, disadvantages and classification. Three representatives of each cate- gory are then briefly described with emphasis on Hazelcast, described in more detail. Index structure and it’s implementation is described in Chapter 4. The main idea is to use fixed set of pivots and recursive Voronoi partitioning to split the data space and keep statistics about individual cells, like number of objects it contains. Search algorithm then searches in cells ordered in respect to their distance to the query object, until a certain number of objects is accessed. The refinement phase is im- plemented using Map-Reduce programming model to fully utilize system resources – executing it both in parallel and distributed. Finally Chapter 5 discusses system evaluation from multiple points of view – precision, latency and throughput. Measured values show little to no improvement in latency for multi-node cluster and sequence of requests, but as number of simultaneous requests grows, differences becomes more significant. Throughput scales almost linearly as more nodes are added, because of distribution of the workload.

47 6. Conclusion 6.1 Future work

Biggest limitation of similarity search systems is performance. For a web application searching in images, response time of tens maybe hundreds of milliseconds might be sufficient, but not for real-time application, for example recognition of people in live video broadcast. Building an efficient index structures and algorithms is still an open issue. Combined search with tags can be improved by adding approx- imate string matching instead of exact matching, for example input string ‘building’ can find similar tags like ‘buildings’. Another improve- ment would be ability to recognize and search in synonyms and words with similar meaning, for example ‘female’ would also find ‘woman’ and ‘girl’.

48 References

1. SAMET, H. Foundations of Multidimensional and Metric Data Structures. Elsevier/Morgan Kaufmann, 2006. Morgan Kaufmann series in computer graphics and geometric modeling. ISBN 9780123694461. Available also from: https://books.google.sk/books?id=vO- NRRKHG84C. 2. CHAVEZ,Edgar; NAVARRO, Gonzalo; BAEZA-YATES, Ricardo; MAR- ROQUIN, Jose Luis. Searching in Metric Spaces. ACM Comput. Surv. 2001, vol. 33, no. 3, pp. 273–321. ISSN 0360-0300. Available from DOI: 10.1145/502807.502808. 3. ZEZULA, Pavel; AMATO, Giuseppe; DOHNAL, Vlastislav; BATKO, Michal. Similarity Search: The Metric Space Approach. Springer, 2006. Advances in Database Systems. ISBN 0-387-29146-6. 4. NOVAK, David; ZEZULA, Pavel. M-Chord: A Scalable Distributed Similarity Search Structure. In: Proceedings of the 1st International Conference on Scalable Information Systems. Hong Kong: ACM, 2006. InfoScale ’06. ISBN 1-59593-428-6. Available from DOI: 10.1145/ 1146847.1146866. 5. NOVAK, David; BATKO, Michal; ZEZULA, Pavel. Metric Index: An efficient and scalable solution for precise and approximate similarity search. Information Systems. 2011, vol. 36, no. 4, pp. 721–733. ISSN 0306-4379. Available from DOI: https://doi.org/10.1016/j.is. 2010.10.002. Selected Papers from the 2nd International Workshop on Similarity Search and Applications SISAP 2009. 6. NOVAK, David; ZEZULA, Pavel. PPP-Codes for Large-Scale Similar- ity Searching. In: Transactions on Large-Scale Data- and Knowledge- Centered Systems XXIV: Special Issue on Database- and Expert-Systems Applications. Ed. by HAMEURLAIN, Abdelkader; KÜNG, Josef; WAGNER, Roland; DECKER, Hendrik; LHOTSKA, Lenka; LINK, Se- bastian. Berlin, Heidelberg: Springer Berlin Heidelberg, 2016, pp. 61– 87. ISBN 978-3-662-49214-7. Available from DOI: 10.1007/978-3- 662-49214-7_2.

49 REFERENCES

7. ASPNES, James; SHAH, Gauri. Skip Graphs. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Bal- timore, Maryland: Society for Industrial and Applied Mathematics, 2003, pp. 384–393. SODA ’03. ISBN 0-89871-538-5. Available also from: http://dl.acm.org/citation.cfm?id=644108.644170. 8. Hazelcast IMDG Reference Manual. Hazelcast Inc., 2018. Available also from: http://docs.hazelcast.org/docs/3.10/manual/html- single/index.html. 9. NOVAK, David. Multi-modal Similarity Retrieval with Distributed Key-value Store. Mobile Networks and Applications. 2015, vol. 20, no. 4, pp. 521–532. ISSN 1572-8153. Available from DOI: 10.1007/s11036- 014-0561-4. 10. DEAN, Jeffrey; GHEMAWAT, Sanjay. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM. 2008, vol. 51, no. 1, pp. 107–113. ISSN 0001-0782. Available from DOI: 10.1145/1327452. 1327492. 11. NOVAK, David; BATKO, Michal; ZEZULA, Pavel. Large-scale Image Retrieval Using Neural Net Descriptors. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. Santiago, Chile: ACM, 2015, pp. 1039–1040. SIGIR ’15. ISBN 978-1-4503-3621-5. Available from DOI: 10.1145/ 2766462.2767868. 12. BUDIKOVA, Petra; BATKO, Michal; ZEZULA, Pavel. Evaluation Plat- form for Content-Based Image Retrieval Systems. In: GRADMANN, Stefan; BORRI, Francesca; MEGHINI, Carlo; SCHULDT, Heiko (eds.). Research and Advanced Technology for Digital Libraries. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 130–142. ISBN 978-3-642-24469-8. 13. JOLLIFFE, I. T. Principal Component Analysis. Springer, 2002. Springer Series in Statistics. ISBN 978-0-387-95442-4. Available from DOI: 10.1007/b98835. 14. Apache JMeter. Apache Software Foundation, 1999–2018. Available also from: https://jmeter.apache.org. 15. HotSpot. Available also from: https://en.wikipedia.org/wiki/ HotSpot.

50 A Contend of Attached Archive

Archive submitted together with thesis contains:

∙ source codes of the implemented system,

– fixedpivots – library module, containing data structures and algorithms for searching and indexing, – fixedpivotswebapp – web application module, – messif – dependecy for data import/storage, – scripts – helper scripts used during development/test- ing, – README.md – information about deployment and opera- tion,

∙ screenshots – directory with screen shots of the web applica- tion.

51