Similarity Search Using Hazelcast In-Memory Data Grid

Masaryk University Faculty of Informatics Similarity Search using Hazelcast In-memory Data Grid Master’s Thesis Bc. Ľudovít Labaj Brno, Spring 2018 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Bc. Ľudovít Labaj Advisor: RNDr. David Novák, Ph.D i Acknowledgements I would like to thank thesis advisor RNDr. David Novák, Ph.D for his time and guidance in this very challenging and interesting assignment. iii Abstract Metric space is universal approach to search for similar objects in database by providing a sample object. This approach is very powerful but computationally expensive. This thesis introduces a prototype of distributed system, which uses Hazelcast In-Memory Data Grid for data storage and in-memory computing. The system is then evaluated in terms of quality of results and performance iv Keywords similarity search, metric space, distributed systems, NoSQL v Contents 1 Introduction 1 2 Similarity Search in Metric Space 3 2.1 Metric Space .........................3 2.2 Distance Functions ......................4 2.2.1 Minkowski Distances . .4 2.2.2 Edit Distance . .5 2.3 Similarity Queries ......................6 2.3.1 Range Query . .6 2.3.2 Nearest Neighbor Query . .6 2.3.3 Similarity Join . .7 2.4 Use of Indexes .........................8 2.4.1 M-Chord . 11 2.4.2 M-Index . 17 3 NoSQL Databases 21 3.1 Classification ......................... 21 3.2 MongoDB ........................... 23 3.3 Cassandra ........................... 23 3.4 Hazelcast In-Memory Data Grid ............... 24 4 Index Structure and Implementation 29 4.1 Index Structure ........................ 29 4.1.1 kNN Search Process . 30 4.2 Implementation ........................ 33 5 Testing and Evaluation 39 5.1 Precision of Search Results .................. 39 5.2 Performance Tests ....................... 40 5.2.1 Latency . 43 5.2.2 Throughput . 44 6 Conclusion 47 6.1 Future work .......................... 48 References 49 vii A Contend of Attached Archive 51 viii 1 Introduction In the beginning of the Internet, it mostly contained structured, textual data in form of web pages and articles. Searching in those kinds of data is not so difficult task, because structured textual data can be sorted, categorized and indexed, therefore easy to search in. As technologies advanced, new kinds of data became available in form of images, au- dio, video and others, where conventional search approaches stopped being effective. For some forms of data it might be very difficult or even impossible to formulate a search query using conventional approaches, it might be much easier to give search engine a sample data, and it will retrieve the most similar entries from the database. The objective of the Similarity Search is to create techniques and algorithms how to efficiently search in large collections of data, providing sample object as input parameter. Fundamental problem with unstructured data is their inability to be sorted in respect to some of their attribute, data organization therefore becomes key part of any similarity search system, in terms of both quality of results (system returns relevant results) and search performance (low latency). Thesis Structure Chapter 2 describes theoretical introduction to Similarity Search, basic concepts, definitions and terminology. Chapter 3 descries distributed data stores, technology often used to store unstructured or semi-structured datasets, such as objects for similarity search. Search structure and implementation is described in Chapter 4 and system evaluation, from both performance and quality of results perspective is described in Chapter 5. The last Chapter 6 contains conclusion and possibilities for future improvements. 1 2 Similarity Search in Metric Space This chapter describes theoretical foundations of similarity search in metric spaces. The term metric space is defined in Section 2.1, followed by distance functions in Section 2.2 and operations (similarity queries) in Section 2.3. Section 2.4 describes possible use of indexes to speed up the search process. The theory of metric space is a well studied topic with many text- books and articles published [1, 2]. For purposes of this thesis, book Similarity Search, The Metric Space Approach [3] written by Zezula; Am- ato; Dohnal; Batko was used to write Sections 2.1 to 2.4. 2.1 Metric Space In general, the similarity search can be seen as a process of finding data objects in database according to their distance to a query object – an input object specified by user. The distance is determined by distance function and constrains on which objects should be returned called similarity queries. A metric space M is defined by a tuple M = (D, d) where D is domain of objects and d is distance function (also called metric function or just metric). For a function d : D × D 7! R to be valid distance function, these properties must hold: 8x, y 2 D, d(x, y) ≥ 0 non-negativity 8x, y 2 D, d(x, y) = d(y, x) symmetry 8x, y 2 D, x = y , d(x, y) = 0 identity 8x, y, z 2 D, d(x, z) ≤ d(x, y) + d(y, z) triangle inequality There are situations, where the symmetry property does not hold – for example distance between two building in a city can be different because of layout of roads and traffic rules (one-way roads etc.). These non-symmetrical metric functions are called quasi-metrics. 3 2. Similarity Search in Metric Space L L 1 2 L6 L 8 Figure 2.1: Minkowski distances examples for L1, L2, L6 and L¥ [3]. 2.2 Distance Functions Distance functions represent a way to determine how close are indi- vidual objects in a metric space. They return a number representing distance between two objects x and y from the same domain D. Return type can be both discrete (for example Edit Distance, Section 2.2.2, returns natural number) or continuous (for example Minkowski Distances, Section 2.2.1, return real number). 2.2.1 Minkowski Distances The Minkowski Distances or Lp metrics are family of distance functions, where p is input parameter. They are defined on two n-dimensional vectors X = (x1, x2,..., xn) and Y = (y1, y2,..., yn) of real numbers: s n p p Lp(X, Y) = ∑ jxi − yij i=1 L1 is also known as Manhattan distance (or the City-Block distance), L2 is Euclidean distance. One special case is when p = ¥ or L¥ which is called the maximum distance, defined as: n L¥(X, Y) = maxi=1jxi − yij Note that due to absolute value of differences between xi and yi the order of parameters doesn’t influence the result, in another words – all Lp distances are symmetric. Figure 2.1 shows examples of L1, L2, L6 and L¥ where all points are at the same distance from the middle according to different distance function. 4 2. Similarity Search in Metric Space 2.2.2 Edit Distance Opposing to Lp distances, which are defined on numeric vectors, the Edit Distance is used to calculate distance between two sequences of symbols (strings). The distance between string x = x1x2 ... xn and string y = y1y2 ... ym is defined as the minimum number of atomic edit operations needed to transform string x to string y. Atomic edit operations are: ∙ insert the character c into the string x at position i ins(x, i, c) = x1x2 ... xicxi+1 ... xn ∙ delete the character from string x at position i del(x, i) = x1x2 ... xi−1xi+1 ... xn ∙ replace the character at position i in string x with the new character c replace(x, i, c) = x1x2 ... xi−1cxi+1 ... xn Due to representation of string in computers, these edit operations may have different computational cost, what can be addressed by assigning weights to the edit operations. However, assigning different weights may violate symmetry, for example: let wins = 2, wdel = 1 and wreplace = 1 00 00 dedit(“combine ,“combination ) = 9 replace e ! a, insert t, i, o, n 00 00 dedit(“combination ,“combine ) = 5 replace a ! e, delete t, i, o, n As far as weights for insert and delete operations are equal, the symmetry property holds regardless of weight of the replace operation, which can also have different weight for different values – for example replace a ! b can have different weight than a ! c, but a ! b must have the same weight as b ! a. 5 2. Similarity Search in Metric Space q r Figure 2.2: Range query for query object q with radius r [3]. 2.3 Similarity Queries A similarity query specifies constrains for selection by a query object q, typically expressed as a distance. The result contains all objects in database which satisfies the selection, usually ordered by their distance to the query object q. The following sections discuss some basic types of similarity queries. 2.3.1 Range Query The similarity range query is probably the most intuitive one, it basically says “find me all objects that are at most‘r’ distance units away”. The query is specified by a query object q 2 D and a distance (often called radius) r 2 R≥0. Formal definition: R(q, r) = fo 2 X, d(o, q) ≤ rg In general, query object q does not need to exist in database X, but has to belong in the metric domain D. It is also possible for the radius to be zero, what means we are looking for existence of one specific object in the database, also called point query or exact match. This type of query is mostly used in delete operation, where we want to locate and delete a specific object.

Similarity Search Using Hazelcast In-Memory Data Grid

High Performance with Distributed Caching

White Paper Using Hazelcast with Microservices

Alfresco Enterprise on AWS: Reference Architecture October 2013

Getting Started

Hazelcast In-Memory Platform Terry Walters Sr Solutions Architect

Spring Boot Starter Cache Example

Migrating to In-Memory Computing for Modern, Cloud-Native Workloads

Lumada Data Catalog Product Manager Lumada Data Catalog V 6

In-Memory Databases and Apache Ignite

Data Platforms

Hazelcast-Documentation-3.8.Pdf

Pragmatic App Migration to the Cloud: Quarkus, Kotlin, Hazelcast and Graalvm