<<

Efficient Similarity Search with Hamming Constraints

by

Xiaoyang Zhang

B.Sc., Wuhan University, China, 2009

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ENGINEERING IN THE SCHOOL

OF

Computer Science and Engineering

November 10, 2013

All rights reserved. This work may not be

reproduced in whole or in part, by photocopy

or other means, without the permission of the author.

c Xiaoyang Zhang 2013

Abstract

In this thesis, we study the Hamming distance query problem. Hamming distance measures the number of dimensions where two vectors have different values. In applications such as pattern recognition, information retrieval, Chemoinformatics and databases, it is often needed to perform efficient Hamming distance query, which retrieves vectors in a database that have no more than k Hamming distance from a given query vector.

Existing work on efficient Hamming distance query processing has some of the following limitations, such as unable to deal with k that is not tiny, unable to deal with vectors where the value domain is large, or unable to attain robust performance in the presence of data skew. To address these limitations, we propose

HmSearch, an efficient query processing method for Hamming distance query. Our method is based on enumeration-based signatures and a novel partitioning scheme.

We develop enhanced filtering as well as a filtering-and-verification procedure.

To deal with data skew, we design an effective dimension rearrangement method.

We also illustrate a hybrid technique for the LSH data. Extensive experimental evaluation demonstrates that our methods outperform state-of-the-art methods by up to two orders of magnitude. We also list out several existing problems and show a few possible directions for future work.

i Publications Involved in the Thesis

Published conference paper:

• Xiaoyang Zhang, Jianbin Qin, Wei Wang, Yifang Sun, Jiaheng Lu, HmSearch:

An Efficient Hamming Distance Query Processing Algorithm. SSDBM 2013.

ii Acknowledgements

I dedicate this thesis to my parents. Their love and support are indispensable.

Without these, I would never have a opportunity to concentrate in my study and

finish my thesis.

I would like to show my sincere appreciation to my supervisor, Prof. Wei

Wang. He not only guides me and supports me in research, but also provides concerns and help in my everyday life. Moreover, He presents me what attitudes a true researcher should have and how diligent a man should be to chase his dream.

The knowledge learned from him and the experience of working with him will benefit me in my whole life.

I would like to thank Prof. Xuemin Lin for his conduction and support for the whole database group.

I would like to thank Dr. Jianbin Qin and Dr. Jiaheng Lu for their collabora- tion and assistance for the work in Chapter 3 and specially thanks to Dr. Jianbin

Qin’s for his long time help to my research.

I also would like to thank all our group members: Jianbin Qin (again), Yifei

Lu, Yifang Sun, Xiaoling Zhou, Chen Chen. We are brothers and sisters forever.

iii Contents

Abstract i

Acknowledgements iii

List of Figures vii

List of Tables viii

List of Algorithms ix

1 Introduction 1

1.1 The Applications for Hamming Distance Query ...... 4

1.1.1 Hamming distance for Near Duplicate Detection ...... 4

1.1.2 Chemical Informatics ...... 5

1.1.3 LSH ...... 5

1.1.4 Image Retrieval ...... 5

1.1.5 Iris Recognition ...... 6

1.2 Challenge and Our Contribution ...... 6

1.3 Thesis organisation ...... 8

1.4 Notations Involved in This Thesis ...... 9

2 Related Work 11

iv 2.1 Overview of the Similarity Search ...... 11

2.1.1 Exact Match Query ...... 11

2.1.2 Similarity Query in Space ...... 12

2.2 Hamming Distance Query ...... 18

2.2.1 Theoretical Studies ...... 18

2.2.2 Practical Solutions ...... 20

2.2.3 Solutions in Other Areas ...... 22

2.2.4 Relationship with Other Similarity Measures ...... 23

3 HmSearch: An Efficient Hamming Distance Query Processing

Algorithm 25

3.1 Overview ...... 25

3.2 Background Information ...... 25

3.2.1 Problem Definition ...... 26

3.2.2 Most Closely Related Techniques ...... 26

3.3 Reduction of the General Hamming Distance Problem ...... 30

3.3.1 Reduction Strategy ...... 30

3.3.2 Heuristics of Choosing κ and k’ ...... 33

3.4 Answer the Reduced Query ...... 35

3.4.1 Variants and Deletion Variants ...... 35

3.4.2 1-Query Processing using Variants and Deletion Variants . . 36 3.5 The HmSearch Algorithm ...... 38 3.5.1 Partitioning ...... 38

3.5.2 Hierarchical Binary Filtering and Verification ...... 44

3.6 Partition Strategies ...... 46

3.6.1 Equal Length Partition and its Drawback ...... 47

3.6.2 Dimension Rearrangement ...... 48

v 3.7 Hybrid Techniques for LSH Data ...... 50

3.7.1 Hamming Distance Query in C2LSH ...... 51

3.7.2 Hybrid Algorithm ...... 52

3.8 Experiments ...... 54

3.8.1 Experiment Setup ...... 54

3.8.2 Hamming Similarity Query Performance ...... 56

3.8.3 Candidate Size Analysis ...... 58

3.8.4 Query Time Fluctuation ...... 59

3.8.5 Effect of Enhanced Filter and Hierarchical Binary Verification 62

3.8.6 Effect of Rearranging Dimensions ...... 63

3.8.7 Scalability ...... 64

3.8.8 Index Size ...... 64

3.9 Discussion ...... 65

3.9.1 Complexity Analysis ...... 65

3.9.2 2-Query Processing using 1-Variants ...... 68

3.9.3 Triangular Inequality Pruning ...... 69

3.10 Summary ...... 71

4 Final Remark 73

4.1 Conclusions ...... 73

4.2 Existing Problems and Future Work ...... 74

Bibliography 76

vi List of Figures

3.1 Google’s Method ...... 27

3.2 Google’s Method Recursively ...... 28

3.3 HEngine ...... 29

3.4 Index for 1-Variants ...... 40

3.5 Example of Hierarchical Binary Filtering and Verification ...... 45

3.6 Impact of Data Skew and Benefit of Dimension Rearrangement . . . 48

3.7 Dimension Rearrangement Example ...... 50

3.8 Experiment Results - I ...... 60

3.8 Experiment Results - I ...... 61

3.9 Experiment Results - II ...... 65

3.10 posting list ...... 70

vii List of Tables

1.1 Notations ...... 10

3.1 Statistics of Datasets ...... 57

3.2 Complexities of Empirical Hamming Distance Query Methods . . . 66

viii

1 HammingQuery(Q, k, κ) ...... 32

2 filter(v, m)...... 33

3 oneHammingQuery1Var(q) ...... 41

4 HmSearch − V(Q, k, κ)...... 41

5 enhancedFilter(v, k)...... 42

6 oneHammingQuery1DelVar(q) ...... 43

7 HBVerify(Q, S)...... 46

8 Reorder(Q, N, k) ...... 49

ix x Chapter 1

Introduction

In this thesis, we study the problem of efficiently processing similarity query exactly under an Hamming distance constraint (for simplicity, we named this as

Hamming distance query). Hamming distance is a widely used distance function, which measures the number of dimensions where two vectors have different values. The Hamming distance query is to retrieve vectors in a database that have Hamming distance no more than k from a given query vector. A novel and practical approach to solve this problem will be proposed in Chapter 3. It will be demonstrated that the performance of our algorithm outperforms previous state-of-the-art algorithms substantially.

With the advancement of the Information technology, the digital data has become an integral part of everyday life. As the growth of those digital data accelerates in both scale and variety, it is urgent to find a way to manipulate the data efficiently and effectively. A natural way to manipulate the data is to execute a search against the data. A simple example is to use Google to search for certain objects across the Internet. As searching is such a prominent data processing operation, researchers focused much on this searching problem, especially the

1 2 Chapter 1. Introduction exact match at the very beginning.

The exact match searching is a well studied area. However, naturally, in plenty of situations, requirement of searching is not restricted in finding the identical ob- jects. For example, if a criminal’s face is captured by a surveillance camera, the police probably wish to find all the similar faces to identify the suspect. Moreover, with the rapid development of the Information industry, the data are increasingly being generated and gathered in different areas. Hence current data usually emerge in huge size and a variety of categories, such as images, audios, videos, time series,

fingerprints, documents, protein sequences and so on. Since these data collections are huge and complicated, the data objects might probably lack precision. There- fore, it is inevitable to use the similarity search to find the similar objects instead of only seeking the exact matching ones. Currently, the similarity search has been ap- plied in a variety of applications, including databases, data mining, machine learn- ing, bioinformatics and so on. Hence, the study of the similarity search has become a very fundamental problem in this area and attract more and more attentions.

A primary task of doing similarity search to quantify the degree of similarity of a query object against the objects in a data collections. There are fundamentally two ways to capture the requirements of similarity quantitatively. One is to specify a distance constraint, and the other is to specify a similarity constraint. There are plenty of different distance constraint, such as Minkowski , Quadratic

Form distance, , Hamming distance and so on. There are also a few different similarity constraint, such as Jaccard similarity, , Dice similarity and so on. Among these measures, Hamming distance is one of the highly popular and widely used methods [Liu et al., 2011].

Hamming distance measures the number of dimensions where two vectors have different values. It is used in many applications such as, multimedia databases, Chapter 1. Introduction 3 bioinformatics, pattern recognition, text mining, computer vision, Chemical

Informatics and so on. The reasons for its popularity is listed as follows:

• Currently, Many of the data objects, such as Chemical data, Image data, Audio

data, iris biometrics data and so on are extraordinarily complicated. Hence, it is

extremely expensive to execute search on those data objects directly. Therefore,

to facilitate the searching, the modern digital objects are usually represented by

some characteristic features extracted from themselves. The binary string is one

of the highly popular representations. For example, in Chemical Informatics,

a compound is usually represented by a fingerprint, which is virtually a binary

string. Because plenty of data types are abstracted and stored as equal length

binary strings. It is nature to apply Hamming distance to these equal length

binary strings to measure the dissimilarity among them.

• The calculation of Hamming distance between two vectors is relatively fast

(O(n) by linear scan), comparing to other distance functions, especially the

popular Edit distance (O(n2) by dynamic programming). Moreover, this

one-to-one estimation can be easily coped with the bit operations, which can

further accelerate the process. During the searching, sometimes it is inevitable

to employ an exact similarity or distance calculation. Therefore, generally, when

dealing with huge amounts of data, using Hamming distance probably has an

advantage on running speed.

• The value of Hamming distance constraint is inter-related to the constraint

based on some other measures, such as: Jaccard Similarity, Cosine Similarity

and Overlap Similarity. Therefore, these similarity measures can be mutually

transformed to each other under certain conditions.

In terms of the similarity searching strategy, performance has become an important criterion for a successful design and a lack of it will probably leads to 4 Chapter 1. Introduction failure. Although the calculation of Hamming distance can be solved in linear time by a simple linear scan. In applications such as fingerprints based compound retrieval regarding to Chemical Informatics, with up to millions of binary vectors where each one is with respect to a compound, it’s easy to see the importance of the searching speed and the difficulty to achieve it. Therefore, How to answer the

Hamming distance query efficiently is virtually an important research issue. Before we introduce the problem formally and discuss the different solution paradigms, we start by presenting some of its numerous applications in different fields.

1.1 The Applications for Hamming Distance

Query

1.1.1 Hamming distance for Near Duplicate Detection

In order to identify near-duplicate Web pages, Google uses

SimHash [Manku et al., 2007] to obtain a 64-dimension vector for each web page. Two web pages are considered as near-duplicate if their vectors are within

Hamming distance 3 [Manku et al., 2007]. This method is very practical and can be applied on a distributed system. In [Manku et al., 2007] they evaluate this method in a huge scale. Using 64-bit strings, they execute 1 million batch queries on a 8 billion data collection considering the Hamming distance constraint k = 3. The experiment runs on a distributed system with 200 mappers and can be finished in fewer than 100 seconds. Chapter 1. Introduction 5

1.1.2 Chemical Informatics

Similarity search is widely used in Chemical informatics to search and classify known chemicals, virtually screen of chemicals for drug discov- ery, and predict and optimize the properties of existing active com- pounds [Flower, 1998, Nasr et al., 2012]. A fundamental query is to find all the molecules whose 881-bit fingerprints have Tanimoto similarity no less than t to the fingerprint of a query molecule. As will be shown in Chapter 3.2, this can be transformed into a Hamming distance query.

1.1.3 LSH

Locality sensitive hashing [Indyk and Motwani, 1998] is a widely used technique to perform approximate similarity search with probabilistic guarantees. Recently,

C2LSH [Gan et al., 2012] is proposed to address the issue of excessive index space required by traditional LSH method without affecting the theoretical guarantees.

At the core of the method is a Hamming distance query with a medium-valued threshold k against vectors of the database objects generated by N LSH functions.

1.1.4 Image Retrieval

Hamming distance query is employed in image re- trieval [Landr´eand Truchetet, 2007, Wu et al., 2009]. Generally, the Images data are very big, so that the original image data can not be directly indexed for retrieval. A solution is to extract features from a image. Then these features are hashed into binary strings [Landr´eand Truchetet, 2007, Chum et al., 2007], which can be evaluated by Hamming distance query [Landr´eand Truchetet, 2007].

Another approach is based on the Bag-of-Word model [Wu et al., 2009]. In this 6 Chapter 1. Introduction approach, image collections are represented in a visual dictionary based on a k-d tree [Philbin et al., 2007], where each object is along with an extra 24-bit . During the evaluation, the objects which have a large hamming distance from the query feature will be filtered [Wu et al., 2009].

1.1.5 Iris Recognition

The Hamming distance query is widely applied in the Iris Recognition system with a minor adjustment [Daugman, 1993, Daugman, 2001, Daugman, 2007]. In the Iris biometrics, the Iris texture is usually represented by the iris code, which is virtually the binary string. The Hamming distance used in Iris Recognition is usually the normalised Hamming distance, which is the standard Hamming distance divides the number of dimensions of the vector (which is a constant when the data is fixed). The Normalised Hamming distance is used to measures the fraction of bits for which two iris codes disagree and the minimum computed normalised Hamming distance between two iris codes is assumed to correspond to the correct alignment of the two images [Bowyer et al., 2008]. All the iris codes whose m-bits have normalised Hamming distance no less than k/m to the query iris code can be found by a Hamming distance query with constraint k.

1.2 Challenge and Our Contribution

Because the Hamming distance is widely used in numerous applications, there are quite a few prior studies focusing on efficient query processing methods for

Hamming distance search with a fixed threshold of k. However, all of they suffer from some of the following problems:

• Unable to handle medium-valued k. The value of k defines the tolerance to Chapter 1. Introduction 7

the dissimilarity betwwen two vectors. In some conditions, the k needs to

be set larger to find the distant neighbors of the query. For example, in

Chemical Informatics, the k needs to be set slightly larger if it is required

to find all substances having at most 1/4 compositions different from certain

compound. Early solutions based on reduction to exact matching problems

only work for a very small k [Manku et al., 2007, Tabei et al., 2010]. Recent

proposals [Liu et al., 2011, Norouzi et al., 2012] are able to process slightly

larger k, but are still fairly limited as the performance deteriorates rapidly with

the increase of k due to lack of effective pruning. Hence it is sensible to develop

an algorithm to better manipulate searching with medium-valued k.

• Unable to handle a large value domain. Sometimes we need to do approxi-

mate Hamming search on datasets with large domain sizes, such as the datasets

generated by MinHash [Broder et al., 1997, Theobald et al., 2008] or kNN

search [Norouzi et al., 2012, Gan et al., 2012]. However, most existing solutions

were designed for binary vectors (i.e., value domain size is 2), and they incur huge

space usage when the value domain size is large. Thus it is preferable to develop

an algorithm that can efficiently dealing with datasets with large domain sizes.

• Unable to handle skewed data. There are some real world datasets which are

highly skewed, such as the PubChem dataset. As we show in Chapter 3.6.2,

existing methods will all degenerate to essentially the brute-force linear-scan

methods when they are dealing with these kinds of data collections. Therefore

it is desirable to develop an algorithm that has better handling in such dataset.

In this thesis, we study the Hamming distance query problem and try to tackle the above problems. We propose HmSearch, an efficient query processing method for Hamming distance query that addresses the above-mentioned limitations. Our method is based on partitioning the dimensions into partitions such that the 8 Chapter 1. Introduction query results must have at least one partition whose Hamming distance is no more than 1 with the corresponding partitions of the query. We can then use either 1-deletion variants or 1-variants to efficiently process the special 1-Hamming distance query. We fully exploit the partitioning method such that we develop a tighter pruning condition by requiring candidates to match more partitions under certain circumstances. We also develop a novel hierarchical binary representation for the data which enables us to perform filtering and verification simultaneously with almost no additional cost. To deal with data skew, we design an effective dimension rearrangement method. Moreover, for LSH data, we demonstrate a hybrid techniques which can switch between the filtering strategy and the direct verification, which can help improve the performance under certain conditions.

Extensive experimental evaluation using various LSH and chemical datasets demonstrates that our methods outperform the state-of-the-art methods by up to two orders of magnitude, especially for medium-valued k and skewed datasets.

Our contributions can be summarised as follows.

• We propose a versatile method to process Hamming distance queries under a

wide spectrum of settings including error threshold k, and value domain size. It

is also robust against data skew thank to the dimension rearrangement technique.

• We compare the proposed method with state-of-the-art methods in an extensive

experimental study. The results demonstrates that our method can outperform

existing ones by up to two orders of magnitude.

1.3 Thesis organisation

The rest of the thesis is organised as follows.

Chapter 2 introduce some important studies in the similarity search area, which Chapter 1. Introduction 9 are highly related to the Hamming distance query. Chapter 2.1 presents a brief overview of the Similarity search in the . Some important studies on several distance measures which are highly related to Hamming distance are also demonstrated in this section. Chapter 2.2 presents the fruitful researches related to the Hamming distance query problem. It includes the works in theoretical area, works in practical area and works apart from the database or IR area, especially the Chemical Informatics.

Chapter 3 illustrates the details of our HmSearch algorithms. Chapter 3.2 defines the problem and introduces the preliminaries. Chapter 3.3 presents a general Hamming distance reduction strategy. Chapter 3.4 introduces the variant-based signature for Hamming distance query with threshold 1. Chapter 3.5 presents our HmSearch method with tighter pruning and a filtering-and-verification procedure based on hierarchical binary representation of the data. Chapter 3.6.2 presents our technique of rearranging the dimensions to further reduce sensitivity to data skew. Chapter 3.9 presents some discussions, including the theoretical comparison between our methods and several state-of-the-art algorithms. A few of our incomplete work and hunches are also shown in this part. Experimental results are presented in Chapter 3.8. Chapter 4.1 concludes this chapter.

Chapter 4 concludes this thesis, then introduce some open problems in this area and demonstrate our plans for the future work. Chapter 4.1 presents a brief conclu- sion and Chapter 4.2 show several existing problems and plans for our future work.

1.4 Notations Involved in This Thesis

We list notations used in the thesis in Table 1.1. 10 Chapter 1. Introduction

Table 1.1: Notations

Symbol Definition N Dimensionality of all the vectors k Hamming distance threshold M N − k τ M/N Σ the domain for all values of the vector # Deletion marker [11, 22] We use partition ID in the subscript to distinguish values vi the i-th partition of vector v Isig postings list of signature sig x(i) the i-th bit (from left to right) of the binary representation of an integer x (e.g., 5(3) is 1) Chapter 2

Related Work

In this chapter, we survey the literature regarding to similarity queries, especially the Hamming distance query. First we present an overview of the Similarity search.

Then we give a detailed survey to the related work of Hamming distance query.

2.1 Overview of the Similarity Search

In this section, we will give a brief survey to a fraction of the similarity search in the metric space. We will start by the exact match query problem. Then we will briefly introduce three similarity measures which are highly related, including

Jaccard Similarity, Edit distance and Hamming distance.

2.1.1 Exact Match Query

Exact match query is always applied in traditional databases, especially the structured databases containing numeric or alphabetic data. Given a database and a query object, the exact match query is to find all objects in the database which are identical to the query object. A naive method to support the exact match query is to keep database objects sorted in a global order, then employing

11 12 Chapter 2. Related Work a binary search on the sorted data to find the query object. This process takes a average time complexity of O(log n)(n is the number of objects in the database).

A more efficient method to answer the exact match query is to build a hash table as a index based on the database objects, then hashing the query object and probe the index. This method supports O(1) time access- ing [Fredman et al., 1984]. [Fredman et al., 1984] utilizes a simple probabilistic construction algorithm to do the hashing, and there are some other hashing strategies developed recently. Eg. Cuckoo hashing [Pagh and Rodler, 2001] and perfect hashing [Botelho and Ziviani, 2007, Botelho et al., 2011]. In some cases, the database might be huge, which means there is a high probability that two distinct objects will have the same hash value. In these situations, the cryptographic hash function, such as MD5 or PGP will probably be employed.

If the objects are strings, the database objects can be constructed as a prefix tree [Knuth, 1973] (also known as ), then the exact match query Q can be answered at a time complexity of O(|Q|).

2.1.2 Similarity Query in Metric Space

The natural extension of the exact match query is the similarity query, which is to find similar objects of the query in a collection of database objects. One way to measure the similarity among different objects is to use the distance function in the metric space. The metric space is a set where notions of distance between elements are defined. The definition of metric space is as follows: Let M be a set of objects, d is a function (or distance) defined on M

d : M × M → R Chapter 2. Related Work 13

(M, d) is a metric space if the following holds: for any x, y, z ∈ M

1. d(x, y) ≥ 0 (non − negativity)

2. d(x, y) = 0 ⇐⇒ x = y (identity)

3. d(x, y) = d(y, x)(symmetry)

4. d(x, y) = d(y, x)(triangularinequality)

There are different kinds of distance functions d, e.g., Minkowski distances,

Quadratic Form distance, Edit distance, Hamming distance, Edit Distance.

Usually, the choice of distance function is highly dependent on the ap- plication domain, e.g., Jaccard similarity [Gionis et al., 1999] usually for sets, Edit distance [Levenshtein, 1966] usually for strings, Hamming dis- tance [Manku et al., 2007] usually for binary strings. The solutions for different

Similarity measures are usually different.

A very practical solution to the Similarity problem is to use the approximate strategy. Generally, the approximate strategy can answer the similarity query with most of the required results (not necessarily all) and there is a theoretical guarantee for the number of the missed results. One of the wildly used approximate solutions is LSH [Indyk and Motwani, 1998]. The basic idea of the LSH is that two similar objects usually have a larger probability to have the same hash value, and dissimilar objects usually have a larger probability to have different hash values.

Different hash functions [Broder, 1997, Charikar, 2002, Datar et al., 2004] are created to approximate different similarity measures. Eg. Min-

Hash [Broder, 1997] for Jaccard Similarity. Several recent works in this area are [Gan et al., 2012, Satuluri and Parthasarathy, 2012].

Although the approximate strategies usually have a short querying time and small index size, under some conditions, the query has to be answered 14 Chapter 2. Related Work exactly. Exact similarity query in metric space is a very important study area. Several wildly used and well studied distance functions in this are Jac- card Coefficient, Edit distance and Hamming distance. The three of them are inter-related [Xiao et al., 2008b].

2.1.2.1 Jaccard Distance and Some Other Set Similarity Measures

The Jaccard Distance is a popular metric defined as follows: Given sets S and T |S ∩ T | J(S,T ) = 1 − . |S ∪ T | Usually, a more widely used version of Jaccard Distance is the Jaccard similarity (or coefficient), which is defined as the complementary to the Jaccard

Distance. To be clear, Jaccard similarity is defined as follows: |S ∩ T | J(S,T ) = |S ∪ T | Under this settings, J(S,T ) = 1 if S = T . The Jaccard Similarity is a very popular measure to estimate the similarity between sets. Some important works about similarity search using the Jaccard coefficient are [Gionis et al., 1999, Charikar, 2002, Chaudhuri et al., 2006, Xiao et al., 2008b].

There are several other set similarity measures including Overlap Similarity,

Cosine Similarity and Dice similarity [Xiao et al., 2008b]. Although these sim- ilarity measures are not in the metrics, they are closely related to the Jaccard similarity. So we still list them here:

• Overlap Given sets S and T, the overlap coefficient [Charikar, 2002] is defined as |S ∩ T | O(S,T ) = min(S,T ) The Overlap similarity constraint can be transformed to constraints on several

other similarity measures equivalently, including Jaccard similarity, Cosine simi-

larity, Dice similarity, Hamming distance and Edit distance [Xiao et al., 2008b]. Chapter 2. Related Work 15

Therefore, several studies focused on efficient query processing with overlap con-

straint, then extend the methods for overlap constraint to other similarity con-

straints [Sarawagi and Kirpal, 2004, Bayardo et al., 2007, Xiao et al., 2008b].

• Cosine Given sets S and T, the Cosine similarity is defined as

P S ,T C(S,T ) = i i i . p|S| · p|T |

Some important related works are [Bayardo et al., 2007, Xiao et al., 2008b].

• Dice Given sets S and T, the Dice similarity is defined as

2|S ∩ T | dice(S,T ) = |S| + |T |

[Xiao et al., 2008b] introduces a solution to the Dice similarity using Overlap

similarity.

2.1.2.2 Edit Distance

The Edit distance is a distance function to measure how dissimilar two strings are. The basic edit distance is defined as the minimum number of operations (deletion, insertion, substitution) required to transfer one string to the other [Levenshtein, 1966]. A more general version of edit distance is introduced in [Kurtz, 1996], where each operation is along with a cost function. The Edit distance can be computed in O(n2) time using dynamic programming [Wagner and Fischer, 1974]. This calculation was later improved in [Masek and Paterson, 1980, Ukkonen, 1985, Myers, 1999, Qin et al., 2013].

[Masek and Paterson, 1980] improves the bound to O(n2/ log n). [Myers, 1999] proposes an algorithm with time complexity O(n2/w)(w is the machine word length). [Ukkonen, 1985] achieves a time complexity O(τ · n). [Qin et al., 2013] further improves [Ukkonen, 1985] by precomputing the transition states of some diagonal values. 16 Chapter 2. Related Work

There are plenty of theoretical works about the Edit similarity. The solutions for large τ (Edit distance threshold) and small τ are different. In terms of the work for τ = 1 [Yao and Yao, 1997, Brodal and Gasieniec, 1996, Belazzougui, 2009,

Belazzougui and Venturini, 2012]. [Belazzougui and Venturini, 2012] finds an efficient method to solve the problem in query time O(|Q| + occ). When tau > 1, the problem becomes more difficult, among [Cole et al., 2004, Tsur, 2010,

Chan et al., 2011], [Chan et al., 2011] presents a structure which helps solve the problem in query time O(|Q| + lgd(d+1) n · lg lg n + occ) using O(n) space.

The study for practical solutions of Edit similarity is fruitful. There are generally three categories of solutions.

• Sketch-based Most of the sketch-based algorithms follows a framework of

filtering and verification. Generally, because the calculation of exact Edit

distance between two strings is very expensive, there needs some measures

to quickly eliminate the pair that are not within the Edit distance threshold.

Hence, some sketches (usually grams) from each string are selected as signatures,

where two dissimilar strings can be pruned instantly by considering certain

requirement based on their signatures. The real Edit distance will only be

calculated if a pair is not pruned. There exists an amount of studies on how

to choose sketches [Gravano et al., 2001, Chaudhuri et al., 2006, Li et al., 2007,

Xiao et al., 2008a, Yang et al., 2008, Li et al., 2011a, Qin et al., 2011] and how

to do post-pruning before verification [Gravano et al., 2001, Xiao et al., 2008a,

Qin et al., 2011] to improve the performance.

• Enumeration-based The enumeration-based method is to generate all the τ

Edit distance neighbourhood of strings in the database. This exhaustive enumer-

ation can help answer any Edit distance query within τ in O(1) time. The naive

enumeration method only works when τ is extremely small (0 or 1) because of Chapter 2. Related Work 17

the large number of enumeration. [T. Bocek, 2007] proposes to enumerate all the

deletion neighbourhoods instead of the real neighbourhood to significantly re-

duce the enumeration times. Recent studies[Wang et al., 2009, Li et al., 2011a]

introduce a partition strategy which reduces the general Edit distance problem

to several small Edit distance sub-problems, then using enumeration to solve

each sub-problem. Apart from that, [Arasu et al., 2006] proposes to do the

partition and enumeration on the domain and [Li et al., 2011b] introduces a bit

operation based strategy.

• Automata-based Most of the automata-based algorithms are based on the trie

structure, which is a prefix tree (can be viewed as a DFA). Generally, the strings

are constructed into a trie, such that strings with the same prefix will be grouped

together. When a query comes in, the trie will be traversed top down to find

the answers. Unlike the sketch-based algorithms, by using a trie, the verification

can be done during the traversing of the trie. This means there are no needs

to do verification after filtering, so it is highly efficient for very short strings.

An open problem for this category is to utilize not only prefix sharing but also

suffix sharing as well. [Mihov and Schulz, 2004, Chaudhuri and Kaushik, 2009,

Wang et al., 2010, Deng et al., 2012, Qin et al., 2013, Deng et al., 2013,

Xiao et al., 2013] all focus on this category.

2.1.2.3 Hamming Distance

The Hamming distance measures the number of dimensions where two vectors have different values. If considering the Hamming distance between two equal length strings, the Hamming distance can be viewed as a special case of Edit distance, where the Edit operation is only substitution. Therefore, almost all the algorithms designed for Edit Similarity can directly apply to the Hamming 18 Chapter 2. Related Work distance constraint. However, there are also studies which focus on the Hamming distance query and usually have a more efficient solution because of utilizing the unique property of the Hamming distance measure [Yao and Yao, 1997,

Brodal and Gasieniec, 1996, Brodal and Venkatesh, 2000, Manku et al., 2007,

Tabei et al., 2010, Liu et al., 2011, Norouzi et al., 2012]. We will give a detailed survey of these studies in the next section.

2.2 Hamming Distance Query

In this section, we will introduce some works, which are highly related to the

Hamming distance query problem. We first present some studies in the theoretical area, then show some practical solutions. Next we introduce some works about this problem in other areas, especially the Chemical Informatics. Notice that, some works also deal with similarity search among vectors, but in a nonmetric space and focusing on a different aspect, such as [Dayal et al., 2006]. So we do not discuss them in this thesis.

2.2.1 Theoretical Studies

The Hamming distance query with threshold k was originally known as the k-query problem. [Minsky and Papert, 1987] first proposed the d-query problem (d is the same to k), which asks if there exists a string in a dictionary (consisting of n strings) within Hamming distance d of a given (binary) query string Q of length m.

Different solutions are required for small d and large d. For the special case when d = 1, there exist many efficient solutions [Yao and Yao, 1997,

Brodal and Gasieniec, 1996, Brodal and Venkatesh, 2000].

Yao et. al. [Yao and Yao, 1997] present a for the 1-query Chapter 2. Related Work 19 that achieves query time O(m log log n) bit probes and space O(mn log m) bits in the bit probe model. The basic idea of their work is that given two strings whose Hamming distance is at most one, if these two strings are divided into two partitions arbitrarily, there should be at least one partition identical. Based on this idea, they first construct a dictionary Dw recursively for the database set strings using FKS dictionary [Fredman et al., 1984]. When a query comes in, it is

first checked in Dw to check if a direct answer can be returned. If not, the query is divided into two parts and each part is recursively checked in Dw, then the final answer will be returned when the recursion ends.

Brodal et. al. [Brodal and Gasieniec, 1996] introduce two data structures, both of which answer 1-query in O(m) memory accesses and using O(mn) words for space, based on the standard unit-cost RAM model with logarithmic word size.

The basic structure they used is trie. A trie is a prefix tree, which is usually used to represent a set of strings. The first data structure is simple. It is a single trie, which store strings within Hamming distance 1 with any string in data collections.

The basic idea of the second data structure is that if two strings exactly have a

Hamming distance 1, these two strings should share a prefix and a suffix with one mismatch left in between. Therefore two are organised in the second structure, where one is for the original strings and the other is for the reversed suffixes of the strings. They have a guarantee that the second data structure can be constructed in O(mn) time.

[Brodal and Venkatesh, 2000] constructs a data structure that answers 1-query in time O(1) using space O(n log m) in a cell probe model with word size m. The basic idea is that if a query is identical to one string in the database, it is easy to

find it by using index. Otherwise, if the query is at one Hamming distance with one string in the database, they just need to find the mismatched position and 20 Chapter 2. Related Work

flip it, then check this string against the database string again. To use this idea, they create a hash table as index using FKS dictionary [Fredman et al., 1984].

They also creates a perfect hash function [Schmidt and Siegel, 1990] for the set of all strings at Hamming distance one of a string in database. Then they use the hash values as entries to record the positions, where each of the 1-neighbour string differs from the original string in the database. When a query comes in, they first check it with the index for database strings. If it is not found, they will use the index for 1-neighbour to find the mismatched positions and flip one position each time, then check each changed string in the index for the database strings.

d-query for large d is much harder, with few results beating the naive solution with O(md) query time. The state-of-the-art result is obtained by [Cole et al., 2004], which answers a d-query where d = O(1) in time

O(m + logd(nm) + occ) and using space O(n logd(nm)), where occ is the number of query result. The basic idea is to build a trie based on the database strings, and for each node (except for the leaves) in the trie they build a auxiliary data structure named substitution tree to help accelerate the query speed.

2.2.2 Practical Solutions

Due to the wide range of applications of Hamming distance queries (e.g., those mentioned in Chapter 1.1), many practical solutions have been proposed. Almost all the methods are based on the framework of reduction. To be clear, to reduce the k-query problem into several k0-query sub-problems, where k0 < k.

[Manber and Wu, 1994] essentially suggests to index all the 1-variants of strings in the dictionary to answer 1-query efficiently. [Manber and Wu, 1994]

To handle small k, [Manku et al., 2007, Tabei et al., 2010] divide the strings arbitrarily into k + 1 partitions, such that each pair within the Hamming distance Chapter 2. Related Work 21 threshold k must have at least one exact match with the query in one of the partitions. [Manku et al., 2007] also proposes methods to recursively apply the same idea. Because this study [Manku et al., 2007] is closely related to our work, we will give a more detailed illustration of their techniques in Chapter 3.2.2.

PartEnum method [Arasu et al., 2006] proposes a two-level partitioning strategy. The partitioning scheme is like this. At first, N dimensions are divided into κ1 (κ1 ≤ k + 1)equal sized partitions. According to the Pigeon-Hole principle, any two vectors that are within a Hamming distance k must have at least one partition which is with Hamming distance k0 = k+1 − 1. Next, the dimensions in κ1 each partition is divided into κ2 (κ1 · κ2 > k + 1) partitions and all possible subsets

0 of the second-level partitions of size (κ2 − k ). Each subset (value along with the dimensions) is considered as a signature. By applying this partitioning strategies to two vectors, if these two vectors are with the Hamming distance k, they must have at least one signature in common. The main disadvantage of this method is that the exhaustive enumeration on two level-partition will result in a huge amounts of signatures which will significantly hurt the performance and the index space.

The above methods work well for very small value of k. When k becomes larger, those methods cannot deal with the problem efficiently. As the number of dimensions in each partition will be small and this results in poor selectivity.

Recent work addresses this limitation by reducing the general problem into several

1-query sub-problems [Liu et al., 2011, Norouzi et al., 2012].

In [Liu et al., 2011], the number of partitions κ is chosen to be bk/2c + 1, such that two vectors within the Hamming distance threshold k must have at least one partition within the Hamming distance 1. To efficiently support the

1-query, [Liu et al., 2011] proposes a solution by enumerating all the possible

1-neighbourhood of the query. We will give a more detailed illustration of the 22 Chapter 2. Related Work techniques in [Liu et al., 2011] in Chapter 3.2.2.

The [Norouzi et al., 2012] uses a very similar strategy to that in [Liu et al., 2011]. The difference between the two proposals is that

[Liu et al., 2011] replicates the data while [Norouzi et al., 2012] resorts to indexes, which was implemented and compared with in our experiments. Although both papers mentioned the possibility of k0 > 1, this will result in very large index as it is super-linear in the number of dimensions. We also notice that the approaches that reduce to 1-queries (including ours) are better than those reducing to 0-query using the two-level partitioning methods [Manku et al., 2007, Arasu et al., 2006]

2 2k+1 as they have the similar signature length ( k+2 N vs. (k+1)2 N), but the latter generates much more signatures per vector (k/2 vs. (k + 1)2) (See Table 3.2); therefore, we do not compare with the latter in our experiment.

2.2.3 Solutions in Other Areas

Because the Hamming distance is widely used in different areas, there are also plenty of work about Hamming distance apart from the database or IR area.

But as we notice, most of these work just follow or are very similar to the above solutions [Landr´eand Truchetet, 2007, Daugman, 1993, Wu et al., 2009,

Bowyer et al., 2008, Miller et al., 2005]. However, we find that some solutions in the Chemical Informatics are very interesting and list them as follows.

Solutions in Chemical Informatics Similarity queries with a Tanimoto threshold on binary fingerprints of chemicals are widely used in Chemoin- formatics applications [Flower, 1998, Chen et al., 2005, Chen et al., 2009,

Nasr et al., 2009, Norouzi et al., 2012]. There exist many specialised so- lutions [Swamidass and Baldi, 2007, Baldi et al., 2008, Nasr et al., 2010, Chapter 2. Related Work 23

R. Nasr, 2011, Nasr et al., 2012]. Most of the solutions are based on bounding the number of 1-bits in the fingerprints or their partitions.

[Swamidass and Baldi, 2007] develops the bound on the number of 1-bits given a query fingerprint; and [Nasr et al., 2010] further applies this idea to parti- tioned fingerprints. Another 1-bit bound is developed in [Baldi et al., 2008] where

fingerprints are “folded” down to shorter fingerprints via the XOR operation, and a bound of the similarity can be established on the short fingerprints. [R. Nasr, 2011] builds a method named MultiBit Tree, which is a binary tree recursively built by choosing a certain dimension to split the remaining fingerprints. At query time, a depth first traversal on the tree is performed together with pruning based on the number of 1-bits.

One of the latest methods is [Nasr et al., 2012], where each fingerprint is trans- formed into a set (as in Chapter 3.2) and inverted index is built on the set elements.

This essentially reduces the original problem into a set overlap search problem, where the DivideSkip method proposed in their earlier work [Li et al., 2008] is employed for query processing.

2.2.4 Relationship with Other Similarity Measures

Different applications usually come with different similarity measures. Actually, several prevalent similarity measures can be transformed to Hamming distance measure equivalently.

Relationship with Several Popular Similarity Constraints First we introduce several widely used similarity functions. Let S and T be binary vectors.

def set(S) is the set representation of S; i.e., set(S) = { Di | V [Di] 6= 0 }.

set(S)∩set(T ) • Jaccard similarity is defined as J(set(S), set(T )) = set(S)∪set(T ) 24 Chapter 2. Related Work

P set(S) set(T ) • Cosine similarity is defined as C(set(S), set(T )) = √ i √i i |set(S)|· |set(T )| • Overlap similarity is defined as O(set(S), set(S)) = |set(S) ∩ set(T )|

The equivalent forms of the above similarity measures are as follows:

1−t • J(Q, S) ≥ t ⇐⇒ H(Q, S) ≤ 1+t · (|set(Q)| + |set(S)|) l m • C(Q, S) ≥ t ⇐⇒ H(Q, S) ≤ |set(Q)| + |set(S)| − 2 t · p|set(Q)| · |set(S)|

• O(Q, S) ≥ α ⇐⇒ H(Q, S) ≤ |set(Q)| + |set(S)| − 2α

Relationship with Tanimoto Similarity In Chemical Informatics, molecules can be represented by binary vectors, which are called fingerprints [Flower, 1998].

One of the most popular measures to measure the similarity between fingerprints is the Tanimoto similarity [Nasr et al., 2009]. The Tanimoto similarity between two binary fingerprints is defined as follows: The Tanimoto similarity is essentially the Jaccard similarity, defined as:

|set(S) ∩ set(T )| T (S,T ) = |set(S) ∪ set(T )|

We can derive the following equivalence between a constraint based on Tanimoto similarity and that based on Hamming distance:

1 − t T (Q, S) ≥ t ⇐⇒ H(Q, S) ≤ · (|set(S)| + |set(Q)|) 1 + t

If we perform a search using a Tanimoto similarity threshold of t, we can derive the

1−t threshold of a Hamming distance query with threshold kQ = t · |set(Q)|. This is because for any result S satisfying T (Q, S) ≥ t, we know that

|set(S)| ∈ [t · |set(Q)|, |set(Q)|/t]. Chapter 3

Efficient Hamming Distance

Query Processing Algorithm

3.1 Overview

In this chapter, we report our studies on this Hamming distance query problem.

We present some preliminaries first, then demonstrate our HmSearch algorithms.

Next we illustrate how to optimise HmSearch for LSH data. After that, we show some discussions and exhibit our experimental results. Finally we give a conclusion for this chapter.

3.2 Background Information

In this section, we first formally define the Hamming distance query problem in 3.2.1.

25 Chapter 3. HmSearch: An Efficient Hamming Distance Query 26 Processing Algorithm

3.2.1 Problem Definition

Since Hamming distance works for a fixed number of dimensions, we consider all data and query vectors having N dimensions in this thesis. The i-th dimension is denoted as Di. V [Di] represents the i-th dimension value of a vector V . Without loss of generality, we assume the domain of possible values for Di is the same and is denoted as Σ.

Let ∆(x, y) = 0 if x = y and 1 otherwise. The Hamming distance between two vectors S and T is defined as:

N X H(S,T ) = ∆(S[i],T [i]) i=1 If we consider T using S as a yardstick, we can also say T made H(S,T ) errors with respect to S.

Given a dataset V of vectors, a Hamming distance query of a query vector Q and threshold k retrieves all vectors in the dataset with Hamming distance to Q no more than k, or

{ vi ∈ V | H(vi,Q) ≤ k }

Such a query is also known as the k-query due to [Minsky and Papert, 1987].

3.2.2 Most Closely Related Techniques

To better illustrate our work, we briefly describe two most relevant techniques, which deal with the approximate Hamming distance query.

Google. Google’s algorithm was introduced in [Manku et al., 2007]. The idea is based on the observation that if two vectors are within Hamming distance k, by dividing the two vectors arbitrarily into k + 1 segments, there is at least one of the k + 1 segments exactly the same for those two vectors. Based on this idea, they Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 27

Exact search on T1

S1 S2 S3 T1 T2 T3

Exact search on T2

S1 S2 S3 S2 S1 S3 T2 T1 T3 T1 T2 T3

Query vector T Exact search on T3 Data vector S

S3 S1 S2 T3 T1 T2

Figure 3.1: Google’s Method develop the following technique (See Figure 3.1). Their method first partitions the vectors in the database into k + 1 even partitions, then the database vectors are replicated by k + 1 times (each replication denoted as a table), where each vector in the same table has the ith partition (i ranges from 1 to k + 1) moved to the leftmost of the vector (See Figure 3.1). When a query comes in, the same partition and replication strategy is applied for the query vector, then each replicated query is put into the table with the same partition ID at the leftmost. Next the vectors are sorted based on the leftmost partition in each table and vectors which have the identical leftmost partition with the query replication are fetched. Finally, results from each table are merged to get the final results.

They also further develop a recursive version for this method. The idea is that, following the previous settings, given one of the k + 1 partitions, partitioning the remaining dimensions into k + 1 partitions, there should be at least one of Chapter 3. HmSearch: An Efficient Hamming Distance Query 28 Processing Algorithm the k + 1 segments of the remaining part exactly the same for those two vectors.

This process can be done recursively. The following is the process of a two-level partition (See Figure 3.2).

Exact search on T1,T21

S1 S2 S3 S1 S21 S22 S3 T1 T21 T22 T3 T1 T2 T3

S1 S22 S21 S3 T1 T22 T21 T3 T2 T1 T3 T1 T2 T3 S1 S2 S3 S2 S1 S3

Query vector T

Exact search on T1, T22 Data vector S

S3 S2 S1 T3 T1 T2

Figure 3.2: Google’s Method Recursively

This recursive method can help decrease the number of results retrieved from each table, because intuitively during the sorting there is more information to look at. However, with the increment of recursion time by 1, the total number of tables will increase to k + 1 times. Therefore, probably there can only be two-level partition at most. Notice that Google’ methods are created based on distributed computing, therefore they replicate the database vectors as tables and put them on a farm of machines, which can contribute a decent load balance.

HEngine. HEngine was introduced in [Liu et al., 2011]. The basic idea of

HEngine can be deemed as an extension of the Google’s one-level partition method.

The idea is that if two vectors are within Hamming distance k, by dividing the

 k  two vectors arbitrarily into κ segments, where κ ≥ 2 + 1, there are at least  k  m = κ − 2 + 1 partitions within the Hamming distance of 1 for those two Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 29 vectors. Based on this idea, a general Hamming distance problem can be reduced to several 1 Hamming distance problems. Then the 1 Hamming distance problem is tackled by enumerating all the possible 1-substitution-neighbourhood of the query’s partition plus itself ((|Σ| − 1) · lpi + 1) (lpi is the length of the ith partition) and find vectors which have identical partitions with the query. Base on this idea, the process of HEngine is presented in Figure 3.3.

T1 T2 T3

T1(1) T2 T3 S1 S2 S3 ... T1(f(T1)) T2 T3 T1 T2 T3

T2 T1 T3

T2(1) T1 T3 T2 T1 T3 T1 T2 T3 S1 S2 S3 S2 S1 S3 ...

T2(f(T2)) T1 T3 Query vector T

Data vector S T3 T1 T2 T3 T1 T2 T3(1) T1 T2

S3 S1 S2 ...

T3(f(T3)) T1 T2

Figure 3.3: HEngine

 k  At the beginning, the vectors in the database are partitioned into 2 +1 parts,  k  therefore m = 1. Then the partitioned vectors are replicated into the 2 + 1 tables and with its ith partition moved to the leftmost, which is similar to that in the one level partition version of Google’s method. When a query comes in, query

 k   k  vector is also partitioned into 2 + 1 parts and replicated 2 + 1 times with each partition at the leftmost. Then for the leftmost partition of each replicated query, all its 1-substitution-neighbourhood (e.g. T 1 to T 1(f(T 1)) in Figure 3.3, where f() calculates the number of enumeration) is generated by substituting the element at each position of that partition with every possible element in the Chapter 3. HmSearch: An Efficient Hamming Distance Query 30 Processing Algorithm alphabet at that position. Next each replicated query and its enumerations are put into the table with the same partition ID at the leftmost. Next the vectors in each table are sorted based on the leftmost partition, and vectors which have the identical leftmost partition with the enumerated query replication are found.

Finally, results from each replication are merged to generate the final results.

m q κ Then the maximum number of probes in HEngine is ( κ + 1) · q . Therefore, probably only q = 1 is practical. In addition, the Hengine also has a recursive version, which is similar to that of the Google’ method. However, since the Hengine is not created for distributed computing, the recursion will greatly increase the probing times, hence the recursive version is probably not practical.

3.3 Reduction of the General Hamming Distance

Problem

In this section, we first illustrate how to reduce the general Hamming distance problem to Hamming distance problem with smaller k (denoted as k0). Next, we discuss about our heuristics of choosing k0 and partition number κ.

3.3.1 Reduction Strategy

The prevalent approach to answer Hamming distance query is based on reducing the problem into several instances of smaller Hamming distance query with a lower threshold via partitioning.

First, we introduce a few concepts. We consider a partitioning scheme that di- vides the N dimensions into κ partitions; each partition, denoted as pi, is a subset of

dimensions { Di,1,Di,2,...,Di,|pi| }. Given a vector v, its projection on a partition,

i i.e., v[Di,1,...,Di,|pi|] forms a new projected vector, denoted as projection v . Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 31

Definition 1 (Match, Exact-Match, and k’-Match) Given two partitions pi and pj, if H(pi, pj) ≤ 1, they are said to match each other. In addition, if

0 0 H(pi, pj) = 0, they are said to be exact-match each other; if H(pi, pj) = k (k 6= 0), they are said to be k’-match each other;

Lemma 1 Given two vectors S and T such that H(S,T ) ≤ k, if we divide the N di-

 k  mensions (arbitrarily) into κ partitions, there are at least m = κ− k0+1 partitions, i i 0 0 0 { p1, p2, . . . , pm }, such that H(S ,T ) ≤ k , ∀1 ≤ i ≤ m, k ≥ bk/κc and k ∈ N .

Proof. Assume the contrary that at most m − 1 partitions have at most k0 errors. Then all the rest κ − m + 1 partitions should have at least k0 + 1 errors.

Let β = k0 + 1. Then the total amount of error is

 k   β · + 1 > k β which contradicts the condition that H(S,T ) ≤ k.

This Lemma is a generalisation of previous results (such as Theorem 3.1 in [Liu et al., 2011] and [Norouzi et al., 2012]). As such, it has several instantia- tions and each results in different algorithms. For example, [Manku et al., 2007] chooses k0 = 0 and κ = k + 1, such that there must be m = 1 exact-matching partition, as bk/(k + 1)c = 0. [Liu et al., 2011] essentially chooses k0 = 1 and κ = bk/2c + 1, hence entailing at least m = 1 1-matching partition.

[Norouzi et al., 2012] considers the general case of choosing any κ, but fails to capitalize on cases where m could be greater than 1.

The overall query processing method based on reduction can also be captured in the following general framework:

• In the indexing phrase, each vector in the database is partitioned into κ parti-

tions. Each partition is indexed in such a way that it is possible to efficiently Chapter 3. HmSearch: An Efficient Hamming Distance Query 32 Processing Algorithm

answer a Hamming distance query with threshold bk/κc for the projection of

all vectors on this partition.

• In the query processing phase (See Algorithm 1), the query vector is partitioned

in the same way into κ partitions. A special Hamming distance query with

threshold k0 is issued on each query partition to obtain a list of candidate

vectors whose corresponding partitions has at most k0 Hamming distance from

the query partition Qi (Line 4). The returned results of these κ queries are

added to the CAND hash table, which count the number of times a vector has

been encountered. We perform the filtering (See Algorithm 2), which essentially

check the occurrence number against m. If the vector passes the filtering, it will

then be verified (Line 7) against the entire query Q.

Algorithm 1: HammingQuery(Q, k, κ)

/* generate candidates */

1 CAND ← empty hash table that maps vector ID to an integer;

2 partition(Q, κ);

3 for each the i-th partition Qi of the query Q do

4 for each vector ID v ∈ reducedHammingQuery(Qi, bk/κc) do

5 CAND[v] ← CAND[v] + 1;

// CAND[v] is initialised to 0 upon first visit

/* filtering and then verification */

6 for each candidate v ∈ CAND do

j k k 7 if filter(v, κ − bk/κc+1 ) = false then

8 if verify(Q, v) then

9 output v;

Remark 1 In addition to the indexing approach described above, the other way is Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 33

Algorithm 2: filter(v, m) Output : Returns true if v is filtered (i.e., disqualified)

1 if CAND[v] < m then

2 return true;

3 return false; to replicate the vectors and keep the vectors in each copy sorted. Binary search is used on each copy to locate candidates. This method usually incurs much overheads in both space and query time, and is mostly adopted in distributed systems to achieve a high degree of parallelism [Manku et al., 2007].

3.3.2 Heuristics of Choosing κ and k’

Given the 1, the general Hamming distance problem can be reduced to κ Hamming distance problems with k0 and the final result is the merging of results of those subproblems. κ and k0 are parameters which can be customised, where k0 ≥ bk/κc.

Intuitively, the efficient solution to the case when k0 = 0 is straightforward and a larger k0 will probably result in a much slower processing speed for k0-query.

According to exploring of the current work related to this problem and our own research, the largest value of k0 which can be efficiently tackled with is

1 [Liu et al., 2011] ( in Chapter 3.4.1, we will present our methods which also tackles the case when k0 = 1 but in a more efficient way). Hence generally the value of k0 has only two options: 0 and 1.

Next, we present a model to quantify the pruning power (and hence the

CAND1) regarding to κ. Let the query generate κ partitions. We want to model

P≥m the probability of a set s that matches with at least m partitions of the query.

Lemma 2 Assuming the conditions for each partition are independent and each Chapter 3. HmSearch: An Efficient Hamming Distance Query 34 Processing Algorithm set has the probability psig to match one of query’s partition. Then in any of the partition

κ X κ P = pm−i(1 − p )i = 1 − I (κ − m + 1, m) ≥m i sig sig (1−psig) i=m where I is regularised incomplete beta function.

Intuitively, given a fixed k, if the value of κ is smaller, the length of the partitions can be longer. Therefore, the psig will be smaller, because the partition is more se- lective. Given the Lemma 2, psig is a key parameter in the formulation and the prun- ing power increases with the decrease of the psig. Viewed from this angle, a smaller κ will probably improve the pruning power. However, we also notice that in Lemma 1, a larger κ will result in a larger m. Considering that in Lemma 2, if psig is fixed, a larger κ will probably lead to a larger P≥m. Therefore obviously there is a trade off here. From our experience, a larger m always requires for a more complicated merge which is difficult to be executed efficiently. Meanwhile, psig is usually a more im- portant issue, especially when k is not large. Hence generally we prefer a smaller κ.

Then we consider the relationship between the κ and k0. Based on Lemma 1, we can easily induce the following lemma:

0 k Lemma 3 Based on Lemma 1, given a k , κ > k0+1 .

Given the Lemma 3, if k is fixed, the κ is bounded by the value of k0, and the lower-bound of κ decreases with the increment of k0. Because we wish to make the

0 k value of κ smaller, we set k = 1, then κ > 2 . So the lower-bound of the value of  k+1  κ is 2 . Notice that when k and k0 are both fixed, a slightly larger κ will result in a larger m, which might increase the pruning power when k is large. Therefore in

0  k+3  In our method, we set k = 1 and κ = 2 , which is slightly larger than its Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 35 lower-bound. By doing this, we can enhance the value of m to 2 with almost no hurt of the selectivity. In Chapter 3.6 we will show some improvements based on these settings, which can further improve the pruning power.

3.4 Answer the Reduced Query

In this section, we introduce the definitions of 1-variants and 1-deletion variants, then illustrate how to use these two variants to efficiently answer the Hamming distance query for k0 = 1, respectively.

3.4.1 Variants and Deletion Variants

The 1-variants of a vector v with respect to the value domain Σ (1 ≤ i ≤ N) are all the vectors v0 in ΣN such that H(v, v0) ≤ 1, denoted collectively as 1-Var-Set(v). v is by definition its own 1-variant. If we exclude v itself, the rest of the 1-variants are called v’s strict 1-variants. They can be computed easily by substituting v[i] with another character from Σ. For any Vi ∈ V, the total number of 1-variants is therefore 1 + (|Σ| − 1)N.

Let Σ∗ = Σ ∪ { # }. The 1-deletion variants [Wang et al., 2009] of a vector v are all the vectors obtained by substituting v[i] with the deletion marker # plus v itself. If we remove v itself, the rest are called v’s strict 1-deletion variants, and are collectively denoted as Strict-1-Del-Var-Set(v). The total number of 1-deletion variant is 1 + N.

All the above different kinds of variants are referred to as variants generically.

Example 1 Consider v = [11, 22, 13] and Σ = { 1, 2, 3 }. Its 1-variants are:

[11, 22, 13], [21, 22, 13], [31, 22, 13], [11, 12, 13], [11, 32, 13], [11, 22, 23], [11, 22, 33]. Its strict 1-deletion variants are [#, 22, 13], [11, #, 13], [11, 22, #]. Chapter 3. HmSearch: An Efficient Hamming Distance Query 36 Processing Algorithm

3.4.2 1-Query Processing using Variants and Deletion

Variants

3.4.2.1 1-Query Processing using 1-Variants

Lemma 4 Consider two vectors S and T such that

H(S,T ) ≤ 1. Then

1-Var-Set(S) ∩ { T } 6= ∅.

According to Lemma 4, we can use the following procedure to answer Hamming distance query with threshold k0 = 1.

• Indexing. We generate all the 1-variants for every vector in the database and

index the variants using an inverted index I.

• Query Processing. We directly look up the query in the index. The returned

results are exactly the query results.

The index space complexity of this method is O(|Σ| · N · n). The query time complexity is O(1 + occ), where occ denotes the number of query results.

3.4.2.2 1-Query Processing using Strict 1-Deletion Variants

Many of the existing Hamming distance query processing methods assume a binary value domain, hence 1-variant-based methods are almost always preferred to those based on 1-deletion variants (to be introduced below), as the former achieves O(1) query time at the cost of just doubling the index space. However, when |Σ| is large (e.g., Σ can be as large as 172 for vectors generated by Min-

Hash [Broder et al., 1997]), 1-variant-based methods will incur excessive amount of space usage (and building time) for the index, which is not practical or competitive.

We study the 1-deletion variants and its query processing methods. Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 37

The following Lemma gives us a necessary condition for two vectors to be within

Hamming distance of 1 based on the intersection of their strict 1-deletion variants.

Lemma 5 Consider two vectors S and T such that

H(S,T ) ≤ 1. Then

Strict-1-Del-Var-Set(S) ∩ Strict-1-Del-Var-Set(T ) 6= ∅.

Note that strict 1-deletion variants, rather than 1-deletion variants, are used in the above Lemma. This is simply because if S = T , they will have N common strict 1-deletion variants anyway.

According to Lemma 5, we can use the following procedure to answer Hamming distance query with threshold k0 = 1.

• Indexing. We generate all the strict 1-deletion variants for every vector in the

database and index the variants using an inverted index I.

• Query Processing. We generate the strict 1-deletion variants of the query and

look them up in the index. The returned results are merged and become the

query results.

The index space complexity of this method is O(N · n). The query time complexity is O(N + N · occ).

Example 2 Continuing the Example 1, we index all the strict 1-deletion variants of v (and other vectors in the database). To process the query Q = [11, 22, 33], we first generate all q’s strict 1-deletion variants: [#, 22, 33], [11, #, 33], and

[11, 22, #]; and we look them up in the inverted index and merge the returned

results. v will be found in the postings list of I[11,22,#].

Comparing to using strict 1-deletion variants, using 1-variants can have a much less query time (O(1)). However, because the generation of 1-variants Chapter 3. HmSearch: An Efficient Hamming Distance Query 38 Processing Algorithm is very related to |Σ|, when |Σ| is large (e.g., those sets generated by Min-

Hash [Broder et al., 1997, Theobald et al., 2008]), the number of 1-variants gen- erated from the database vectors will probably be too large to be indexed (O(|Σ| ·

N · n)). Because the generation ofstrict 1-deletion variants is not related to the

|Σ|, it can still work decently for the case when |Σ| is large. Therefore, we propose to use 1-variants when |Σ| is small and strict 1-deletion variants when |Σ| is large.

3.5 The HmSearch Algorithm

In this section, based on the heuristics mentioned above, we present HmSearch, our proposed query processing method with advanced threshold-based pruning and a technique to perform pruning and verification simultaneously.

3.5.1 Partitioning

In our HmSearch method, as illustrated in Chapter 3.3.2, we choose to partition the

 k+3  0 dimensions into κ = 2 partitions and choose k = 1. According to Lemma 1, any query result vector must have at least one matching partitions, i.e., having

Hamming distance at most 1. However, we show later that this condition can be strengthen, which will help to keep the candidate size low when k increases.

Our enhanced filtering is based on observing the following artefact of the partitioning scheme. Let k = 2c, where c is an integer. It can be shown that

κ = c + 1. Based on Lemma 1, m = 1. We observe that if first c + 1 errors evenly distributed into c + 1 partitions, there are only c − 1 errors left to put into c + 1 partitions. In this case, at least two 1-matches exist. By carefully analysing this condition, we find if the first match is not an exact match, there must exist at least two 1-matches. Similar observation can be find when k = 2c + 1. Therefore, Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 39 we establish the following Lemma, which gives us a tighter condition for filtering.

Lemma 6 (Enhanced Filtering Condition) Consider processing the Ham- ming distance query for Q with threshold k, and that the dimensions have been divided into κ = b(k + 3)/2c partitions. A query result S must satisfy the following conditions:

• If k is an even number, S must have at least one exact-matching partition, or

two 1-matching partitions.

• If k is an odd number, S must have at least two matching partitions, where at

least one of the matches should be an exact-match, or S must have at least three

1-matching partitions.

Proof. When k is even, let k = 2c. Then κ = c + 1. Assume the contrary, i.e., there is at most one 1-matching partition. Then one partition has at least 1 error and the rest c partitions have at least 2 errors each. The total number of errors is at least 2c + 1 = k + 1, which contradicts the fact that it is a query result.

When k is odd, let k = 2c + 1. Then κ = c + 2. Assume the contrary, i.e., there is at most two 1-matching partitions or one exact-matching partition.

Considering the former condition, since both of the matching partitions have at least 1 error each and the rest c partitions have at least 2 errors each. The total number of errors is at least 2k + 2 = k + 1; Considering the latter condition, since c + 1 partitions have at least 2 errors each. The total number of errors is at least

2k + 2 = k + 1. Both of the cases contradict the fact that it is a query result.

This Lemma helps to control the growth of candidate size when k increases.

As show in experiment Figures 3.8(l) and 3.8(m), the reduction of candidate size could reach up to 2 orders of magnitude when compare HSV-nEB verse HSV and

HSD-nEB verse HSD. Chapter 3. HmSearch: An Efficient Hamming Distance Query 40 Processing Algorithm

3.5.1.1 Implementation based on 1-Variants

Consider HmSearch implemented in the general framework of Algorithm 1, where reducedHammingQuery is based on strict 1-variants as signatures. Hence, we will use the indexing and query processing methods described in Chapter 3.4.2.1 to implement reducedHammingQuery. The only subtlety is that we index each signature enhanced with its partition ID, so that we can index signatures from different partitions in one index without them interfering with each other.

Lemma 6 requires the ability to distinguish between the exact-match with

1-match. We achieve this by the following modification to the posting lists of the inverted index. The inverted index maps a signature sig to Isig which is a list of vectors such that sig is one of their 1-variants. Now we propose to divide vectors in the posting lists into two parts: ones that match sig exactly, and the other that have one error. We denote the former set as Isig[0] and the latter Isig[1]. This can be implemented by keeping an additional pointer at the beginning of the postings list which points to the starting entry of Isig[1], as show if Figure 3.4. Therefore, if a candidate is returned from Isig[0], it is an exact-match; otherwise it is a 1-match.

Finally, we check the number of matching partitions according to Lemma 6 in the function filter.

The complete listings of the algorithms are given in Algorithms 3 to 5.

11, 12 v1 v2 13, 14 v1 v2

1 , 2 1 2 v2 v1 13, 24 v1 v2

2 , 1 21, 12 v1 3 4 v1 v2

21, 22 v2

Figure 3.4: Index for 1-Variants Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 41

Algorithm 3: oneHammingQuery1Var(q)

1 C ← ∅;

2 for each vector ID v in Iq[0] do

3 C ← (v, 0);

4 for each vector ID v in Iq[1] do

5 C ← (v, 1);

6 return C;

Algorithm 4: HmSearch − V(Q, k, κ)

/* generate candidates */

1 CAND ← empty hash table that maps vector ID to a list of integers;

2 partition(Q, κ);

3 for each the i-th partition Qi of the query Q do

4 for each vector ID (v, err) ∈ oneHammingQuery1Var(Qi, 1) do

5 CAND[v].append(err);

/* filtering and then verification */

6 for each candidate v ∈ CAND do

7 if enhancedFilter(v, k) = false then

/* See Algorithm 7 for HBVerify */

8 if HBVerify(Q, v) then

9 output v;

Example 3 Consider N = 4, X = 2, Q = [11, 12, 23, 24] and the following data vectors:

v1 :[11, 12, 13, 14]

v2 :[11, 22, 13, 14] Chapter 3. HmSearch: An Efficient Hamming Distance Query 42 Processing Algorithm

Algorithm 5: enhancedFilter(v, k) Output : Returns true if v is filtered (i.e., disqualified)

1 errors ← CAND[v]; /* the list of errors */;

2 if k is even then

3 if errors has less than two number then

4 if errors[0] = 1 then

5 return true;

6 else

7 if errors has less than three number then

8 if errors has only one number or errors[0] = 1 and errors[1] = 1 then

9 return true;

10 return false;

2+3 Assume the domain is { 1, 2 }. κ = d 2 e = 2, and the first partition is consisted of the first two dimension and the rest two dimensions form the second partition.

The variants and the index built for them is shown in Figure 3.4.

At the beginning, CAND and C are initialised to empty. The query is partitioned into [11, 12] and [23, 24]. Since [11, 12] is in the index, all its postings are fetched back. v1 is in I[0]. This means Q and v1 has an exact match [11, 12].

We denote this matching as (v1, 0) and send it to C . Next, v2 is in I[1], which means Q and v2 match with one error on this partition, so this matching is marked as (v2, 1) and sent it to C as well. Then, in CAND, conditions of matchings are added to each vectors belonging to C (e.g., error[0] denotes the number of errors the first matching incurs). In this case, we have v1.error[0] = 0 and v2.error[0] = 1. Then for v1, v1.|errors[]| = 1 < 2. This means it has one matching with the query. In addition, v1.error[0] = 0, which means this is an Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 43 exact matching, so it cannot be filtered. Hence v1 is sent to verification. Next v2 is processed. Because v2.|errors[]| = 1 < 2, it also has one matching with the query. However, as v2.error[0] = 1, which means this is not an exact matching, it is pruned. Next, as [23, 24] has no match in the index, the process is done here.

3.5.1.2 Implementation based on Strict 1-deletion Variants

The only major difference to the previous section is the method to distinguish be- tween exact-match and 1-match. For the former case, we know the number of com- mon strict 1-deletion variants in a partition p is exactly |p|, i.e., the number of di- mensions in p. For the latter case, we know the number is exactly 1. So we only need to test if the number is greater than 1 to tell these two cases apart (See Algorithm 6).

Hence we can replace oneHammingQuery1Var by oneHammingQuery1DelVar at line

4 of Algorithm 4 to implement HmSearch based on strict 1-deletion variants.

Algorithm 6: oneHammingQuery1DelVar(q)

1 C ← empty hash table that maps vector ID to an integer;

2 for each the strict 1-deletion variant δ(q) of q do

3 for each vector ID v in Iδ(q) do

4 C[v] ← C[v] + 1;

5 C0 ← empty list;

6 for each key v in C do

7 if C[v] > 2 then

8 C0 ← C0 ∪ (v, 0);

9 else

10 C0 ← C0 ∪ (v, 1);

11 return C0; Chapter 3. HmSearch: An Efficient Hamming Distance Query 44 Processing Algorithm

3.5.2 Hierarchical Binary Filtering and Verification

Another improvement in our HmSearch method is a new algorithm, HBVerify, that can perform additional filtering and verification simultaneously; it is also highly optimised by exploiting vertical layout and bit-parallelism.

Let d = dlog2 |Σ|e. We can represent each dimension value of a vector using d bits. For a vector v, we store its dimension values in binary in a vertical format, i.e., using N bits to store the most significant bits of all the N dimension values, and another N bits for the second most significant bits, and so on and so forth. We use the notation v(i) to denote the array of bits consist of the i-th most significant bits of the dimension values of vector v.

We can derive a filtering condition as follows:

Lemma 7 If H(Q, v) ≤ k, then H(Q(i), v(i)) ≤ k, ∀i.

Proof. Let H(v, Q) ≤ k. Assume the contrary, then H(Q(i), v(i)) ≥ k, which means v(i) has at least k + 1 different bits with Q(i). Since each bit in v(i) belongs to a unique dimension value in v, v has at least k + 1 different dimension values with Q, which contradicts the fact that H(Q, v) ≤ k.

This filter can be implemented efficiently using bit-level operations exploiting the bit-parallelism offered by CPUs.

• XOR: Perform bitwise-XOR between v(i) and Q(i) and obtain a bitmap A. This only requires dN/we instructions.

• BitCount: Count the number of set bits (i.e., 1s) in the A. This can be done

using d12N/we machine instructions based on the trick at http://graphics.

stanford.edu/~seander/bithacks.html#CountBitsSetParallel.

Therefore, the filtering can be performed efficiently, exploiting the bit-parallelism. Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 45

We can further strengthen the above filter by invoking it in an accumulative fashion over our binary representation (See Algorithm 7). In each iteration, we reuse the XOR’ed bitmap A obtained in the last step (stored in B). We can perform the XOR operations for the current level of bits, and then mask off the bits that are already different in early iterations using bitwise OR. The resulting bitmap will be bit-counted. Therefore, the number of different bits is actually the total number of dimensions where the two vectors are different so far. Obviously, the filtering power is much better than applying the filter alone for the current iteration. We choose to iterate from the least significant bit to the most significant bits, to maximize the probability of filtering (Line 3).

Another benefit of this filtering is that after we iterate over all the d levels, the

final bit count number is exactly the Hamming distance between the two vectors.

So we do not need a separate verification stage.

significant v = [5, 0, 3, 6] Q = [5, 2, 3, 5] diff cumdiff bit 3rd 1 0 1 0 ⊕ 1 0 1 1 = 0001 0001

2nd 0 0 1 1 ⊕ 0 1 1 0 = 0101 0101 1st 1 0 0 1 ⊕ 1 0 0 1 = 1000

Figure 3.5: Example of Hierarchical Binary Filtering and Verification

Example 4 Consider the vector v = { 5, 0, 3, 6 } and the query Q = { 5, 2, 3, 5 } in vertical binary representation in Figure 3.5. Let N = 4, k = 1, |Σ| = 8, and w = 4. We first filter-and-verify the 3rd most significant bits of Q and v. There are 1 mismatch between 1010 and 1011. The cumulative difference bitmap cumdiff in Algorithm 7 is 0001. After bit counting, the total number of errors is 1, which is no larger than k = 1. So we move on to the 2nd most significant bits of Q and Chapter 3. HmSearch: An Efficient Hamming Distance Query 46 Processing Algorithm

Algorithm 7: HBVerify(Q, S)

1 maxlevel ← log2(|Σ|);

2 cumdiff ← dN/we machine

words filled with 0x0; /* w is the size of a machine word in bits */;

3 for i = maxlevel downto 1 do

4 errs ← 0;

5 for j = 0 to dN/we do

6 diff ← Q(i)[j] ⊕ v(i)[j]; /* XOR for diffs */;

7 cumdiff [j] ← cumdiff [j] ∨ diff ; /* OR */;

8 errs ← errs + popcount(cumdiff [j]); /* count set bits */;

9 if errs > k then

10 return false;

11 Output (v, errs);

12 return true;

v. diff = 0011 ⊕ 0110 = 0101. The cumdiff is then OR’ed with the diff and produce curdiff = 0101, which has 2 bits set. This means H(Q, v) ≥ 2 and hence we can prune v immediately.

3.6 Partition Strategies

In this section, we first introduce the prevalent equal length partition strategy for the parition-based method and present its weakness under certain conditions.

Then we illustrate our proposed partition strategy which not only conquer the drawbacks, but also improve the performance in most cases. Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 47

3.6.1 Equal Length Partition and its Drawback

Traditionally, almost all the partition-based methods use the arbitrary equal length partitioning strategy [Manku et al., 2007, Liu et al., 2011, Norouzi et al., 2012].

The heuristic is simple that keep the length of each partition as even as possible.

• Partition it into κ partitions evenly such that each partition has length either

bN/κc or dN/κe. We can always make the last N − bN/κc · κ partitions having

the longer length.

• For each partition pi, generate its signature (either 1-variants or 1-deletion

variants), which is a pair of the variant and partition ID, i.e., (i, vij).

This method is simple and works well in some cases. However, when the data is skewed, the traditional arbitrary partitioning strategy has severe drawbacks and will hurt the performance significantly. Consider a partition of l dimensions with n vectors. A data skew exists if one of the |Σ|l values occurs very frequently

(e.g., close to n times). If the corresponding partition value of the query is exactly this frequently-occurring value, then the majority of the vectors will become candidates. Such a large amount of candidate will make the algorithm degenerate to a brute-force linear scan algorithm.

Example 5 Consider the example dataset in Figure 3.6(a). N = 6 and k = 1, so κ = 2. Since all the vectors’ second partitions are within Hamming distance of

1 with the query’s second partition, all of them will become candidates. However, if we permute the dimensions before the partitioning as in Figure 3.6(b), the only candidate is v1. Chapter 3. HmSearch: An Efficient Hamming Distance Query 48 Processing Algorithm

Partition1 Partition2 Partition1 Partition2

Dim 1 2 3 4 5 6 1 2 5 4 3 6

Q 1 1 1 1 0 0 1 1 0 1 1 0

v1 1 1 1 0 0 0 1 1 0 0 1 0

v2 0 0 0 2 0 0 0 0 0 2 0 0 v3 2 0 2 0 0 0 2 0 0 0 2 0 v4 3 0 0 0 0 0 3 0 0 0 0 0

(a) (b)

Figure 3.6: Impact of Data Skew and Benefit of Dimension Rearrangement

3.6.2 Dimension Rearrangement

As the problem to obtain the optimal dimension rearrangement is likely to be a hard problem, instead, we resort to a bottom-up, greedy algorithm to find a reasonably good rearrangement for a specific κ. Assuming that we have a way to measure the quality of a partition, the general idea of the algorithm is:

• Initially, we form κ partitions, each consisting of the “worst” single dimensions

(in terms of quality) among the remaining dimensions.

• In each of the N − κ rounds, we choose a worst partition, and add one of the

remaining dimensions to this partition such that the resulting new partition has

the best possible quality.

Now we consider how to define the “quality” of a partition. Consider a partition

D consisting of l dimensions. Since we do not know the query’s value on this partition a priori, we choose to minimize the maximum frequency of any values oc-

def curring in these dimensions, i.e., MaxFreq(D) = maxx∈Σl |{ vi ∈ DB | vi[D] = x }|. Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 49

Minimising MaxFreq also contributes to minimising the candidate size for a query in the worst case. Let D ◦ { Dj } denotes the partition formed by adding dimension

Dj to D. We can choose the dimension Dj such that it results in the smallest

MaxFreq(D ◦ { Dj }).

Algorithm 8: Reorder(Q, N, k)

1 P ← k empty buckets;

2 Dim ← dimensions from 0 to N − 1;

3 X,Y ∈ P ∪ Dim;

4 X.frq denotes the max frequency in X;

5 F rq(X,Y ) denotes (X,Y ).frq where (X,Y ) means put X,Y together;

6 forall buckets Pi ∈ P do

7 dimt ← GetithLargestFrqDim(D, i);

8 Pi ← dimt;

9 Dim = Dim − dimt;

10 N = N − k;

11 while N ≥ 0 do

12 Pmax ← GetMaxFrqBucket(P, k);

13 tempdim = arg min F rq(Pmax, dimi)(dimi ∈ Dim);

14 Pmax = Pmax ∪ tempdim;

15 N − −;

The complexity of this greedy algorithm is O(N 3 · n). While it could have a long running time when N is large, it only needs to be run once for a fixed dataset. Also to reduce its running time for large N, we run it on a sample of the dataset. Finally, the efforts in dimension rearrangement are worthwhile as it ... in our experiment (See Chapter 3.8.6).

Example 6 We illustrate the process of running the dimension rearrangement Chapter 3. HmSearch: An Efficient Hamming Distance Query 50 Processing Algorithm

MaxFreq 1 3 3 3 4 4 4 4 1 4 Partitions partition1 partition2 partition1 partition2 Dim D1 D2 D3 D4 D5 D6 D5 D6 D5 D1 D6

1 1 1 0 0 0 v1 0 0 0 1 0 v2 0 0 0 2 0 0 0 0 0 0 0 v3 2 0 2 0 0 0 0 0 0 2 0 v4 3 0 0 0 0 0 0 0 0 3 0

MaxFreq 1 2 1 1 1 1 Partitions partition1 partition2 partition1 partition2 partition1 partition2 Dim D5 D1 D6 D3 D5 D1 D6 D3 D4 D5 D1 D2 D6 D3 D4

0 1 0 1 0 1 0 1 0 0 1 1 0 1 0

0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 2 0 2 0 2 0 2 0 0 2 0 0 2 0 0 3 0 0 0 3 0 0 0 0 3 0 0 0 0

Figure 3.7: Dimension Rearrangement Example algorithm on the dataset in Example 5 for κ = 2 in Figure 3.7. Initially, the

MaxFreqs of the single dimensions are first computed. We pick the worst two to start the partitions. Then we consider the best dimension to add to partition 1

(currently only D5) such that the resulting MaxFreq is minimised; we found D1, and this results in the MaxFreq of 1 for the new partition { D5,D1 }. The process runs until all remaining dimensions have been distributed to one of the partitions.

3.7 Hybrid Techniques for LSH Data

LSH is a prevailing technique for approximate similarity queries. In this section, we first present that our HmSearch technique can be used for LSH based methods.

Then we demonstrate several key issues for generating the appropriate LSH data.

Next, we will illustrate an optimisation of our HmSearch method based on the Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 51

LSH data.

3.7.1 Hamming Distance Query in C2LSH

LSH [Indyk and Motwani, 1998] is a widely used technique that perform approxi- mate similarity or distance queries. It has instances for many important problems.

For a concrete example, MinHash [Broder, 1997] is an instance of LSH to find sets that satisfies a given Jaccard similarity threshold t approximately. The LSH functions, hi (1 ≤ i ≤ k · l), is drawn from a family of functions with the property that Pr[hi(X) = hi(Y )] = Jaccard(X,Y ), where Jaccard is the Jaccard similarity function. Traditionally, LSH combines k signatures into a super-signature, and index these super-signatures. In order to maintain a high recall, the above process needs to be repeated l times.

Recently, C2LSH [Gan et al., 2012] showed that we can take as candidates objects that share at least M signatures out of the total N signatures, and this scheme maintain the same rigorous accuracy guarantees as the original LSH. A similar idea is also used in approximate verification for candidates returned by

LSH methods [Satuluri and Parthasarathy, 2012].

In C2LSH, given that N signatures are extracted for each object and form a set, the core task of query processing is to find those sets such that the Hamming distance is no larger than k. Using our HmSearch, the speed of this process can be significantly accelerated. In this thesis, we consider three prevailing LSH functions listed as follows:

• SimHash [Charikar, 2002] is a common technique to convert documents into

a vector by using the TF-IDF representation. It measures the similarity of documents by approximating the cosine similarity of corresponding vectors.

• MinHash [Broder, 1997] is a prevailing technique to quickly estimate the Chapter 3. HmSearch: An Efficient Hamming Distance Query 52 Processing Algorithm

similarity between two sets. The basic idea is to approximate the widely used

Jaccard similarty with the minimum value of the hash values.

• P-Stable [Datar et al., 2004] is a prevalent technique to solve the Nearest

Neighbour Problem. It approximates the lp form by using the p-stable distributions (practically p can only be set as 1 or 2).

3.7.2 Hybrid Algorithm linear scan is inevitable and will be better than index-based solutions when k is suffi- ciently large. Hence, below we show that for LSH based data, we can build an accu- rate cost model to predict the relative cost of index-based query processing time and the linear scan-based query processing time, thanks to the independence of each di- mension values as they are generated by independent LSH functions. Then we pro- pose a hybrid algorithm that utilize the model to switch between the two query pro- cessing strategies to achieve the best performance across a wide range of k settings.

3.7.2.1 Cost Model

Let n be the total number of sets in the database. We denote n0 as the sum of postings lists’ length for all the signatures the query generates. This quantity can be easily obtained from the inverted index without actually accessing the posting lists. Let n1 be the size of CAND after duplicate elimination of the n0 set IDs.

Let cprocess be the average time cost to process one set in CAND0, and cverify be the average time cost to process one set in the verification. Then the cost of index-based filtering followed by hierarchical binary verification is n0 · cprocess + n1 · cverify, while the cost of direct hierarchical binary verification for Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 53 all the sets in the database is n · cverify. Therefore it is easy to see that if

n c 0 ≤ verify , (3.1) n − n1 cprocess then index-bases solution will be faster.

Consider the index-based approach using 1-variants. Given N and k, we divide the query set into κ parts. We can obtain the length of the postings list for each Pκ partition as li. We have i=1 li = n0. Now we need to estimate n1, which are the union of set IDs contained in the κ postings lists. We assume the posting lists associated with each signature is random. Therefore, for any set s in the

li database, the probability that it appears in the i-th postings list is estimated as n . Since each dimension values any set is created by an independent LSH function, we assume independence too. So the probability of the set s appearing in at least

Qκ li one such list is p = 1 − i=1(1 − n ), and the expected number of such sets is p · n, and this is exactly our estimate of n1.

Qκ li Therefore n − n1 = n · i=1(1 − n ). In the decision rule in Equation (3.1), we can assume the right-hand side is a constant, and the value varies based on machine configuration, implementation,

Dmax, and N. The first two factors are fixed no matter which dataset we have.

It is easy to see that the value will grow linearly in log2(Dmax) and N, due to the physical layout of the dataset in the verification process.

Therefore, we model

cverify = α · log2(Dmax) · N (3.2) cprocess

The α value can be estimated by running a sample workload of queries and keep track of total amount of CAND sizes and total amount of verifications, from which we use the average values to obtain cprocess and cverify. Chapter 3. HmSearch: An Efficient Hamming Distance Query 54 Processing Algorithm

Considering the case using strict-1-deletion-variants, we only need to calculate the number of vectors which have at most one error for each partition, which is the same value as Ii, then follow the process of doing estimation for 1-variants. To calculate the number of vectors which have at most one error for each partition, we need to generate 1-deletion-variants instead of strict-1-deletion-variants. Then we not only record the length of the postings list for each strict-1-variant in partition as lij, but also record the length of the posting list for the original partition as

I(exact)i and the length of the partition i as li. After that, according to the theory of Venn diagram, the number of vectors which have at most one error with the Pl query is Ii = j=1 lij − (li − 1) · I(exact)i. Based on the above derivation, we employ the hybrid algorithm in HmSearch when dealing with LSH data, which uses Equation (3.1) to switch between index-based approach and linear scan-based approach.

3.8 Experiments

In this section, we report findings in our extensive experimental study. We first compare the performance of our proposed algorithms with three state-of-the-art methods for Hamming distance queries. Then we evaluate our dimensions rearrangement method to show its resulting performance improvement. Finally, we analyze the scalability and index size of our methods.

3.8.1 Experiment Setup

The following algorithms are used in the experiment.

• HSD, HSV are our proposed algorithms. HSD generates 1-deletion-variants as

signatures. HSV generates 1-variants as signatures. Both algorithms employ Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 55

all three techniques we proposed, including Enhanced Filter (EF), Hierarchical

Binary Filter (HB) and Rearranging Dimensions (RD). HSD-nEB and HSV-nEB

are two variations that remove EF and HB techniques from HSD and HSV,

HSD-nB and HSV-nB are another two variations that only remove HB from HSD

and HSV. HSD-nR and HSV-nR only remove RD from HSD and HSV.

• ScanCount [Li et al., 2008] is an index merge method that scans through

the posting list of each element of query and count the occurrences of data

IDs. We use it as a baseline method. Note that when dealing with Hamming

distance constraint using ScanCount, 0s in each vector must also be indexed and

processed to guarantee the correctness.

• Google [Manku et al., 2007] is one of the state-of-the-art Hamming distance

query algorithms and is specifically designed for detecting near duplicate docu-

ments at Web scale. This method is based on partitioning and exact-matching.

We also implemented a variation of Google, Google-R, which integrates the

Rearranging Dimensions (RD) technique we proposed.

• HEngine [Liu et al., 2011] is a recently proposed Hamming distance query pro-

cessing method. It is based on partitioning and reducing the k-query to 1-queries.

In our experiments, we select four publicly available real datasets. They cover a wide range of data distributions and application domains.

• Audio is extracted from the DARPA TIMIT collection1. It contains 54,387

192-dimensional feature vectors. We use p-stable LSH [Datar et al., 2004] to

convert each feature vector into a 64 dimension integer vector.

• TREC is extracted from the TREC-9 Filtering Track Collections2. Each string

is a reference from the MEDLINE database with author, title, and abstract

information. We apply the SimHash [Manku et al., 2007] to convert each string 1http://www.cs.princeton.edu/cass/audio.tar.gz 2http://trec.nist.gov/data/t9_filtering.html Chapter 3. HmSearch: An Efficient Hamming Distance Query 56 Processing Algorithm

into a 64-dimension binary vector.

• ENRON is extracted from the Enron email collection3. We extract and

concatenate the email title and body. We employ MinHash [Broder et al., 1997]

to convert each string a 64-dimension integer vector. As MinHash selects a token

in the string as its signature, the |σ| of ENRON is large.

• PubChem is a database of chemical molecules4. We sample 1 million entries.

Each entry contains a fingerprint, which is a 881-dimension binary vector.

Statistics about the datasets are listed in Table 3.1.

The experiments for Audio, TREC, ENRON data were carried out on a PC with Intel(R) Xeon(R) X3330 2.66GHZ CPU and 4GB RAM. The operation system is Debian 5.0.6. The experiments for PubChem data were carried out on a

PC with Quad-Core AMD Opteron(tm) Processor 8378 2.4GHZ CPU and 96GB

RAM and the operating system is Ubuntu/Linaro 4.6.3-1ubuntu5. All algorithms were implemented in C/C++ and compiled using GCC 4.4.5 with -O3 flag. All algorithms run in in-memory mode.

We measured the query time and candidate size in the experiments. By query time, we mean the average elapsed time (measured in millisecond) for a query.

Due to the wide range of values, the y-axes of most figures on running time are plotted in logarithmic scale. The candidate size we measure is the average number of data vectors that are sent to the final verification.

3.8.2 Hamming Similarity Query Performance

To test the query processing time of all algorithms on four datasets, we randomly sample 1,000 vectors from each dataset as queries. We measure the query time

3http://www.cs.cmu.edu/~enron/ 4http://pubchem.ncbi.nlm.nih.gov/ Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 57

Table 3.1: Statistics of Datasets

Data n N Generation Function |Σ|

Audio 54, 387 64 2-stable LSH 16 TREC 239, 580 64 SimHash 2 ENRON 95, 997 64 MinHash 172 PubChem 1, 000, 000 881 chemical fingerprinting 2 and show the results of five algorithms in Figures 3.8(a)–3.8(d). For Audio, TREC and ENRON datasets, the Hamming distance threshold varies from 1 to 31 (nearly

50% error rate). For the PubChem dataset, the Hamming distance threshold varies from 1 to 81 (nearly 10% error rate).

We observe that

• The query performance on Audio, TREC and PubChem exhibits following

patterns.

– The fastest algorithm is HSV for all Hamming distance thresholds.

– For small threshold (less than 7), Google is better than HSD. On the other

hand, when the threshold gets larger (than 7), HSD outperforms Google by

up to 2 orders of magnitude.

– When Hamming distance threshold is 1, HSV and Google have the similar

performance, as both methods use highly selective signatures. When Ham-

ming distance threshold increases, the performance of Google deteriorates

faster than HSV. The reason is that HSV’s partition length is nearly twice as

long as that of Google’s partition. Hence HSV generates much more selective

signatures and this results in HSV’s better performance.

– The slowest algorithm is always ScanCount and it is insensitive to the

Hamming distance threshold. This is because ScanCount always na¨ıvely

goes through all the posting lists for each dimension values of the query and

collects the number of occurrence of each vector encountered. Chapter 3. HmSearch: An Efficient Hamming Distance Query 58 Processing Algorithm

• ENRON has a large alphabet size, hence HSV becomes inapplicable. We

compare HSD with other algorithms in Figure 3.8(c). The trend is,

– HSD has a competative performance from middle (10) to large (31) Hamming

distance thresholds.

– When the threshold is low, for instance, up to 7, Google outperforms HSD

in most experiment cases. This is because when the threshold is low,

both Google and HSD generate highly selective signatures. Therefore, the

advantage of longer signatures of HSD is not obvious. In the meanwhile, the

overhead of HSD enumerating 1-deletion variants on the query contributes

to the slowing down the query time.

– HEngine has substantially worse performance on ENRON, because it needs

to generate a large amount of the query’s 1-variants and probing them

against the index. This cost is proportional to the alphabet size |σ|.

• The overall trend for HSV, HSD and Google is that the query time increases

with the increase of the Hamming distance threshold. This is reasonable, as

larger threshold leads to more candidates and eventually more results which

increase the computation time.

3.8.3 Candidate Size Analysis

We measure the candidate sizes of four algorithms on the four datasets and show the results in Figures 3.8(e)–3.8(h).

We observe that

• Except for ScanCount, the candidate sizes of other algorithms increase with the

increase of the Hamming distance threshold. Google has a larger candidate size

than HSV and HEngine. The reason is that Google’s partition length is about

half of that of HSV, HSD and HEngine. Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 59

• When the Hamming distance threshold increases, the candidate sizes of HSV and

HSD grow much slower than other algorithms thanks to the enhanced filtering.

• The candidate sizes of all partitioning-based methods will reach n (i.e., all data

vectors become the candidates) when the threshold is sufficiently large. When

an algorithm’s candidate size reaches n, it is better off to use a brute-force

verification-only method, and hence this is the maximum error threshold the

algorithm could be useful for. For Google, this happens when the threshold is

around 25 on Audio and ENRON, and when thresholds are 10 and 17 on TREC

and PubChem, respectively. This phenomenon occurs much later for both HSV

and HSD than Google and HEngine.

3.8.4 Query Time Fluctuation

The overall trend of the query time is that it increases when the Hamming distance threshold increases. However, as shown in Figure 3.8(i), on the micro-scale, the query time of our HSV may fluctuate. For example, query time at k = 26 is slightly more than that at k = 27. This phenomenon is caused by the enhanced filtering due to Lemma 6. When the Hamming distance constraint is even, in certain conditions, two variant matches are required to pass the filtering condition. However, when the threshold increase by one, the filtering condition may be strengthened to requiring three matches. Although the increase of the threshold shortens the partition length and accordingly the selectivities, the stronger pruning condition may eventually re- duces the candidate size substantially and hence improve the overall performance. Chapter 3. HmSearch: An Efficient Hamming Distance Query 60 Processing Algorithm

100 100

10 10

1 1

0.1 0.1 HSV HSV Query Time (ms) HSD Query Time (ms) HSD 0.01 Google 0.01 Google HEngine HEngine ScanCount ScanCount 0.001 0.001 1 4 7 10 13 16 19 22 25 28 31 1 4 7 10 13 16 19 22 25 28 31 Hamming Distance Threshold Hamming Distance Threshold

(a) Audio, Query Time (N = 64) (b) TREC, Query Time (N = 64)

100 1000

10 100

1 10

0.1 1 HSV Query Time (ms) Query Time (ms) HSD 0.01 0.1 Google HSD HEngine HEngine Google ScanCount ScanCount 0.001 0.01 1 4 7 10 13 16 19 22 25 28 31 1 9 17 25 33 41 49 57 65 73 81 Hamming Distance Threshold Hamming Distance Threshold

(c) ENRON, Query Time (N = 64) (d) PubChem, Query Time (N = 881)

100000 1e+06

100000 10000

10000 1000 1000 100 100

Candidate Size HSV Candidate Size HSV HSD HSD 10 Google 10 Google HEngine HEngine ScanCount ScanCount 1 1 1 4 7 10 13 16 19 22 25 28 31 1 4 7 10 13 16 19 22 25 28 31 Hamming Distance Threshold Hamming Distance Threshold

(e) Audio, Candidate Size (N = 64) (f) TREC, Candidate Size (N = 64)

100000 1e+06

10000 100000

10000 1000 1000 100

Candidate Size Candidate Size 100 HSV HSD HSD 10 Google 10 Google HEngine HEngine ScanCount ScanCount 1 1 1 4 7 10 13 16 19 22 25 28 31 1 9 17 25 33 41 49 57 65 73 81 Hamming Distance Threshold Hamming Distance Threshold

(g) ENRON, Candidate Size (N = 64) (h) PubChem, Candidate Size (N = 881)

10

1

Query Time (ms) HSV Google HEngine ScanCount 0.1 22 23 24 25 26 27 28 29 30 31 32 Hamming Distance Threshold

(i) Audio, Query Time Fluctuation

Figure 3.8: Experiment Results - I Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 61

100 100

10 10

1 1

0.1 0.1 Query Time (ms) Query Time (ms) 0.01 HSV 0.01 HSD HSV-nB HSD-nB HSV-nEB HSD-nEB 0.001 0.001 1 4 7 10 13 16 19 22 25 28 31 1 4 7 10 13 16 19 22 25 28 31 Hamming Distance Threshold Hamming Distance Threshold

(j) Effect of EF and HB, Audio, Time (k) Effect of EF and HB, ENRON, Time

100000 100000

10000 10000

1000 1000

100 100 Candidate Size Candidate Size

10 HSV 10 HSD HSV-nB HSD-nB HSV-nEB HSD-nEB 1 1 1 4 7 10 13 16 19 22 25 28 31 1 4 7 10 13 16 19 22 25 28 31 Hamming Distance Threshold Hamming Distance Threshold

(l) Effect of EF and HB, Audio, Candidate (m) Effect of EF and HB, ENRON, Candidate

100 100

10 10

1 1

0.1 0.1

Query Time (ms) HSV Query Time (ms) HSV 0.01 HSV-nR 0.01 HSV-nR Google Google Google-R Google-R 0.001 0.001 1 4 7 10 13 16 19 22 25 28 31 1 4 7 10 13 16 19 22 25 28 31 Hamming Distance Threshold Hamming Distance Threshold

(n) Effect of Reordering, Audio, Time (o) Effect of Reordering, TREC, Time

100 1000

10 100

1 10

0.1 1

Query Time (ms) HSD Query Time (ms) HSV 0.01 HSD-nR 0.1 HSV-nR Google Google Google-R Google-R 0.001 0.01 1 4 7 10 13 16 19 22 25 28 31 1 9 17 25 33 41 49 57 65 73 81 Hamming Distance Threshold Hamming Distance Threshold

(p) Effect of Reordering, ENRON, Time (q) Effect of Reordering, PubChem, Time

12 HSV Google 10 HEngine ScanCount 8

6

Query Time Ratio 4

2

20% 40% 60% 80% 100% Dataset Size

(r) Scalability, TREC, Query Time

Figure 3.8: Experiment Results - I Chapter 3. HmSearch: An Efficient Hamming Distance Query 62 Processing Algorithm

3.8.5 Effect of Enhanced Filter and Hierarchical Binary

Verification

We present the query time and candidate size of our algorithms to exhibit the effects of Enhanced Filter and Hierarchical Binary Verification. HSV denotes the algorithm that contains both Enhanced Filter and Hierarchical Binary Verifica- tion. HSV-nB denotes the algorithm with only Enhanced Filter and HSV-nEB denotes the algorithm without either technique. The query times are shown in

Figures 3.8(j) and 3.8(k), and the corresponding candidate sizes are shown in

Figures 3.8(l) and 3.8(m).

For HSV-nEB and HSV-nB, Enhanced Filter contributes significantly to perfor- mance improvement when the threshold is in the middle range. For example, when threshold is between 13 and 28, HSV-nB has almost one order of magnitude faster than HSV-nEB. The reason is that at these thresholds, the selectivity of variants is low, thus requiring two or three matching partitions helps improve the performance dramatically. The same trend also appears in Figures 3.8(l) and 3.8(m). We can notice that the average candidate size reduction is more significant than the reduction of running time. For example, when the Hamming distance threshold equals to 16 on Audio data, there is a nearly 90% reduction of candidate size with a nearly 70% reduction of running time. This is mainly due to the overhead of performing the filtering. Another observation is that when the threshold is very small, the improvement due to the Enhanced Filter is insignificant. For example, for Enron, when the Hamming distance threshold is 1, both HSV-nEB and HSV-nB have almost the same performance. The reason is that since the threshold is very small, the variants of HSV-nEB are long enough to have a very high selectivity.

By comparing HSV-nB (HSD-nB) and HSV (HSD), we can evaluate the effectiveness of Hierarchical Binary Verification. The improvement of applying Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 63

Hierarchical Binary Verification is noticeable in both datasets, especially when the

Hamming distance threshold is not small. Generally speaking, the performance gap between the traditional verification and the Hierarchical Binary Verification enlarges with the increase of the Hamming distance threshold. For example, when the Hamming distance threshold is 22 for Audio data, the performance improvement over traditional verification is over 4 times.

3.8.6 Effect of Rearranging Dimensions

We study the effectiveness of rearranging dimensions in Figures 3.8(n)–3.8(q). Note that Google-R is the Google with the dimension rearrangement technique, and HSV

(HSV-nR) and HSD (HSD-nR) are our algorithms with and without the dimension rearrangement technique, respectively. The following observations can be made:

• The effect of dimension rearrangement can boost the performance in most cases,

especially for the PubChem data (up to two orders of magnitude). The reason is

that since each dimension corresponds to a manually defined feature of chemical

molecules, there are plenty of skews in the PubChem dataset. In addition, it

is not uncommon that several of these skewed dimensions are consecutive and

may reside in the same partition by existing methods. This may lead to the fact

that the majority of the dataset will be retrieved as candidates.

• For the Audio and TREC datasets, the effect of dimension rearrangement is no-

ticeable but not significant (See Figures 3.8(n) and 3.8(o)). The reason is that the

dimensions of these datasets are generated by various independent LSH functions,

therefore, there is much less data skew in the Audio and TREC datasets. Hence,

the improvement of dimension rearrangement is not as remarkable as PubChem.

• Although our dimension rearrangement method works in most cases, it does not

always deliver better performance than without. For example, for the ENRON Chapter 3. HmSearch: An Efficient Hamming Distance Query 64 Processing Algorithm

dataset, when the threshold is between 5 and 8, Google has a better performance

than Google-R (see Figure 3.8(p)). The reason is that under such cases, the

variants already have very good selectivites and our greedy algorithm cannot

guarantee the global optimality of the dimension rearrangement.

3.8.7 Scalability

We study the scalability of the algorithms by varying the dataset size. We randomly sample 20% to 100% of the data vectors from TREC as the datasets for this experiment. The Hamming distance threshold is fixed to 7. Figure 3.8(r) shows the query time ratio, which is defined as the query time of the current dataset over the query time of the 20% sampled dataset.

The general trend is that the query time of the four algorithms all grows with the increase of the dataset size. ScanCount exhibits the slowest growth rate, followed by HSV. When the dataset size increases from 20% to 100%, the query time increases by 4.5 times for ScanCount, 5.9 times for HSV, 8.2 times for Google and 11 times for HEngine. This showcases the better scalability of HSV and

ScanCount to dataset size than that of Google and HEngine.

3.8.8 Index Size

Figures 3.9(b)–3.9(d) show the index sizes of the algorithms on the three datasets5 with different Hamming distance thresholds. In general, our HSV has a large index size. For TREC data, when the threshold is small (k = 1), the index size is even larger than the case where the threshold is large (k > 25). The reason is that, for small k, a large amount of unique signatures are generated by HSV, each of which requires two pointers. Hence the total size of pointers is huge and contributes to

5The results on the Audio dataset is similar to that of ENRON. Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 65

1000 1000 HSV Google ScanCount HSV Google ScanCount HSD HEngine HSD HEngine

100

100

10 Index Size (MB) Index Size (MB)

10 1 1 4 7 10 13 16 19 22 25 28 31 1 4 7 10 13 16 19 22 25 28 31 Hamming Distance Threshold Hamming Distance Threshold

(a) Audio, Index Size (b) TREC, Index Size

1000 HSD HEngine HSV Google ScanCount Google ScanCount 10000 HSD HEngine

1000 100

Index Size (MB) Index Size (MB) 100

10 10 1 4 7 10 13 16 19 22 25 28 31 1 9 17 25 33 41 49 57 65 73 81 Hamming Distance Threshold Hamming Distance Threshold

(c) ENRON, Index Size (d) PubChem, Index Size

Figure 3.9: Experiment Results - II the large index size. The HSD and ScanCount have a competitive space usage on

ENRON dataset, yet for the TREC and PubChem datasets, they consume a rela- tively larger index space. Note that, in some cases (e.g., k ≥ 25 on ENRON), HSD has the smallest index size among all the methods. Generally speaking, Google has a relative small index size for most cases. The HEngine’s index size increases linearly with the increasing threshold, and is usually larger than that of HSD.

3.9 Discussion

In this section, we first compare our HmSearch with some state-of-the-art algorithms. Then we illustrate several innovative yet incomplete ideas.

3.9.1 Complexity Analysis

We list the time and space complexities of previous and our methods in Table 3.2. Chapter 3. HmSearch: An Efficient Hamming Distance Query 66 Processing Algorithm

Algorithm Query Time Index Size

N [Manku et al., 2007] (1 level part.) (k + 1) · f( k+1 ) + vc1 (k + 1) · n 2 (2k+1)N 2 [Manku et al., 2007] (2 level part.) (k + 1) · f( (k+1)2 ) + vc2 (k + 1) · n 2N [Liu et al., 2011] N · |Σ| · g1( k ) + vc3 N · n k 2N HmSearch 1-var 2 · g1( k ) + vc4 N · |Σ| · n 2N HmSearch 1-del-var N · g2( k − 1) + vc5 N · n  n   n·|Σ|·N   n·N  where f(x) = max 1, |Σ|N−x , g1(x) = max 1, |Σ|N−x , and g2(x) = max 1, |Σ|N−x , under the uniform assumption. vci stands for the total time used for pruning and verifying the candidates in each algorithm. Table 3.2: Complexities of Empirical Hamming Distance Query Methods

Comparison with Google. Google’s algorithm was introduced in [Manku et al., 2007]. Generally, there are two practical versions of its partition strategy: the one level version and the two level version. In terms of the one-level partition strategy. It partitions the set into k + 1 evenly partitions and take each partition as a signature. A qualified query result should have at least one common signature with the query. Under this condition, the number of times Google’s method probes the index is k +1, which is small but still larger than our 1-Variants

k based algorithm ( 2 ). Moreover, as presented in Table 1, when k is not large, the query time of Google’s algorithm should probably be much longer than both of our

N methods. The reason is that its partition length is k+1 , which is almost half of our 2N algorithms’ ( k ). Therefore, its performance will probably drained by this. Notice that the index size of Google is small and not depending on the alphabet size. For- tunately, our 1-deletion variants’ index size also dose not depend on the alphabet size, so its index size should still be acceptable, even when alphabet is very large

(e.g., those sets generated by MinHash [Broder et al., 1997, Theobald et al., 2008]).

In terms of the two level version, it first partitions the set into k + 1 evenly partitions. Then for each such partition, they divide the remaining dimensions again into k + 1 evenly partitions. Next combine each previous partition with Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 67 the second time partition and take the combined dimensions as signatures. A qualified query result should have at least one common signature with the query.

(2k+1)N Under this condition, the length of each signature is (k+1)2 , which compensates the problem of short partition length in the one level version. However, it needs to probe the index (k + 1)2 times, which is much more than the one level version and will probably drag on the performance. The index size of the two level version is much larger than that of the one level version as well.

Comparison with HEngine. HEngine was introduced in [Liu et al., 2011].

X+1 The basic idea is to partition the set into d 2 e parts. Then it takes each partition of the sets in database as signatures for the data and take the 1-variants set of the partitions of query for the query. A qualified query result should have at least one common signature with the query. Although the partitioning method in [Liu et al., 2011] is similar to the idea in this thesis, the basic method we proposed differs from [Liu et al., 2011] in three important aspects:

• Although the candidate size of HEngine is identical to our 1-variants based

method and probably smaller than our strict 1-deletion variants based method

in many cases, it needs to spend larger amounts of time on enumerating the

possibilities of query and probing. This exhaustive enumeration and probe will

hurt its performance significantly. We apply the variant generation on the data

side, rather than the query side. This reduces the number of probes at the

query time and hence improve the performance substantially.

• We propose to consider both 1-variants and strict 1-deletion variant as the

signatures. The strict 1-deletion variants do not depend on the alphabet size,

and hence are applicable to cases where the dimension domains are large (e.g.,

those sets generated by MinHash [Broder et al., 1997, Theobald et al., 2008]).

Therefore, in terms of this kind of data, when HEngine and our variants suffers Chapter 3. HmSearch: An Efficient Hamming Distance Query 68 Processing Algorithm

from huge enumerating times on query and large index space respectively, our

strict1-deletion variants can still perform decently.

• Rather than replicating the sets in the database multiple times, we generate

signatures for each set and index them using inverted index.

All these differences contributes to a substantial improvement in the query performance.

3.9.2 2-Query Processing using 1-Variants

As discussed in Chapter 3.3.2, if there is a decent solution to Hamming distance problem with constraint 2, the κ can further even smaller, and each signature can have a even better selectivity. From our observation, there is a way to answer the 2-query using 1-variants. The basic idea is to generate 1-variants both on the index side and query side.

Lemma 8 Consider two vectors S and T such that

H(S,T ) ≤ 2. Then

|1-Var-Set(S) ∩ 1-Var-Set(T )| ≥ 2

Proof. We consider all the possible cases, whereH(S,T ) = 0, 1or2.

• When H(S,T ) = 0, all the 1-variants will match, hence |1-Var-Set(S) ∩

1-Var-Set(T )| ≥ N · (|Σ| − 1) + 1.

• When H(S,T ) = 1, all the enumeration of 1-variants on the mismatched

dimesion will match, thus |1-Var-Set(S) ∩ 1-Var-Set(T )| ≥ |Σ|

• When H(S,T ) = 2, each of the mismatched dimesion will have one match, thus

|1-Var-Set(S) ∩ 1-Var-Set(T )| ≥ 2

Apparently |Σ| ≥ 2, or there will be no mismatched dimension. Therefore

|1-Var-Set(S) ∩ 1-Var-Set(T )| ≥ 2. Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 69

Notice that a trick here is that there is no condition where H(S,T ) > 2 and

|1-Var-Set(S) ∩ 1-Var-Set(T )| = 1. Therefore we can always consider the condition as |1-Var-Set(S) ∩ 1-Var-Set(T )|= 6 ∅ instead and this condition is apparently more easily to test. According to Lemma 8, we can use the following procedure to answer Hamming distance query with threshold k0 = 2.

• Indexing. We generate all the 1-variants for every vector in the database and

index the variants using an inverted index I.

• Query Processing. We generate all the 1-variants for the query and look them

up in the index. The returned results are merged and become the query results.

The index space complexity of this method is O(|Σ| · N · n). The query time complexity is O(|Σ| · N · n + occ), where occ denotes the number of query results.

Based on 3, using the 2-query strategy, the κ can be even smaller than our

k proposed method (about 3 ), therefore the signatures will probably have even better selectivity. However, this method needs to generate and probe all 1-variants on the query side. This exhaustive enumeration and probing will severely hurt the performance, thus we do not propose this strategy and only leave it here as an issue for discussion.

3.9.3 Triangular Inequality Pruning

A bottleneck of the HmSearch is that it needs to naively go through all the retrieved vectors and the verification is almost applied to all the candidates1. One possible way to solve this problem is to record some auxiliary information, and use this information to skip some steps during the processing.

The triangular inequality is a prevalent pruning technique used for the problem involved in the metric space. Its definition is very simple: given a metric space M Chapter 3. HmSearch: An Efficient Hamming Distance Query 70 Processing Algorithm with metric d, the triangle inequality is a requirement upon distance:

|d(x, y) − d(y, z)| ≤ d(x, z) ≤ |d(x, y) + d(y, z)|

Since Hamming distance is in the metric space, this property holds. Therefore, given a query Q, a vector S and a vector T , H(S,T ) = hst. Suppose the Hamming distance threshold is k, once H(Q, S) is calculated (result denoted as hqs, based on the triangular inequality,

|hqs − hst| ≤ H(Q, T ) ≤ hqs + hst

Hence, if |hqs − hst| > k, T can be quickly pruned; and if hqs + hst ≤ k, T can be accepted as final result directly. Based on this, we wish to design an algorithm which can quickly prune some vectors during the index probing and accept some vectors without verification.

Suppose there is a dataset of vectors and signatures are generated from each vector, an inverted index I is already built on these signatures. One way to use the above heuristics is that for each posting list of I (denoted as Ii), we pre-compute the Hamming distance Hj between all the consecutive postings Ii(j) and Ii(j + 1)

(j ∈ [0, length(Ii) − 2], then record it in Ii(j). When a query Q comes in, we go through all the retrieved postings. Once Ii(j) is verified, based on the triangular inequality, there is a chance that Ii(j + 1) can be quickly accepted or pruned.

1 5 10

A v2 v3 v4 v5

Figure 3.10: posting list Chapter 3. HmSearch: An Efficient Hamming Distance Query Processing Algorithm 71

Example 7 For example, the posting list is shown as 3.10, A is the index entry, v2, v3, v4, v5 are vectors, H(v2, v3) = 1, H(v3, v4) = 5, H(v4, v5) = 10 are pre- computed and stored. Assume that k = 5 and when a query comes in this posting list is retrieved. For the ease of illustration, suppose v2 is a candidate, so that the verification is applied to v2, then H(Q, v2) = 2. Hence v2 is a result. Next, before v3 is processed, we use the triangular inequality and can estimate that H(Q, v3) ≤

H(v2, v3) + H(Q, v2) = 3 < 5, so that v3 is quickly accepted as a result. Then supposing that v4 is also a candidate and H(Q, v4) = 3, v4 is also a result. After that, before processing v5, using the triangular inequality and an estimation can be made: H(Q, v5) ≥ H(Q, v4) + H(v4, v5) = 13 > 5. Therefore, v5 is quickly pruned.

However, there are some drawbacks of this method.

• A main draw back of this method is that there needs at least a char (or even

short) to keep the hamming between all the consecutive vectors. Therefore the

space overhead increases with the increment of the number of postings.

• Another problem is that the order of the vectors in the postings will probably

affect the performance of this pruning method. There are at least two choices

to do that: to make the similar vectors adjacent, or to make the totally different

vectors adjacent. The former helps do the acceptations more easily and the

latter one helps do the rejection more easily.

3.10 Summary

In this chapter, we propose HmSearch, an efficient Hamming distance query algorithm that works well for a large spectrum of error threshold, do not have any limitation on the domain size, and is robust against data skew. Our method is based on a different partitioning scheme, with tightened count filtering and a Chapter 3. HmSearch: An Efficient Hamming Distance Query 72 Processing Algorithm

filter-and-verification technique based on hierarchical binary representation. A greedy algorithm to rearrange the dimensions before partitioning is also developed.

We also present the theoretical comparison and some incomplete studies. We demonstrate the superior performance of our proposed method against the previous state-of-the-art methods, using LSH and Chemical datasets under a wide range of parameter settings. Chapter 4

Final Remark

4.1 Conclusions

With the accelerating developing speed of the IT industry, increasing data are generated in large size in a variety of fields. To efficiently manipulate these data, similarity search is required. Hamming distance is a prevailing measure to estimate the similarity between two vectors and similarity search with Hamming distance constraint is widely applied in numerous different applications. To support the versatile demands from those applications, Hamming distance query becomes one of the important issues in the research area and attracts increasing attention.

In this thesis, we present efficient techniques to answer the Hamming distance query mainly in Chapter 3. We report our studies on this Hamming distance query problem. Using partition based Hamming distance reduction framework, our approach use variants based signatures to answer the reduced Hamming distance query. In addition, our method makes a deeper inspection of the requirement for being within the reduced Hamming threshold. This enhanced condition improves the pruning power dramatically. Moreover, we introduce a novel verification

73 74 Chapter 4. Final Remark strategy named hierarchical filtering and verification. This method combines

filtering process and verification process together and significantly improves the overall performance. Next, we introduce a hybrid technique for the LSH data.

After that, we show some discussions including the complexity comparison and several strategies we have explored.

4.2 Existing Problems and Future Work

We list several of the open problems for the study of Hamming distance query

• One of the existing problems in both theoretical and practical area is that

the current solutions do not work well for the case when k is large, e.g., when

k > N/2. In terms of the practical solutions, most of the current strategies

are following the framework of filtering-and-verification. When k is large, most

current filtering techniques are almost useless, so that almost all the vectors in

data collections will be sent to verification. Moreover, because of the overhead

of the filtering process, these algorithms usually have a worse performance than

linear scan. Although our algorithm outperforms other practical methods in

most cases, when k is large (k > N/2), it also has this problem.

• Another problem lies in the partition-based method. Most of the prevail-

ing practical methods are partition-based (including [Manber and Wu, 1994,

Tabei et al., 2010, Arasu et al., 2006, Liu et al., 2011, Norouzi et al., 2012] and

ours). A main drawback of the partition-based strategy is that if the vectors

are very short (e.g. shorter than 10), the length of each partition will be very

short as well. So that the selectivity of signatures might be terrible even when

k is small. This might also result in a worse performance than linear scan. Chapter 4. Final Remark 75

Based on these two problems and our current work, we list the directions of our future work as follows.

• One direction we wish to explore is the signature generation strategy. We hope

to find some ways to compliment the current signature generation strategies.

One way to do that is to find a data structure to efficiently support the Hamming

distance query with k = 2. So that the κ can be set even smaller, which can help

boost the selectivity of signatures. Furthermore, since the partition-based meth-

ods use partitions as signatures, each signature only contains the information

for a fraction of a vector. We believe that there should be a method to abstract

information from the whole vector, which can help do some efficient filtering.

• Considering the short vectors, trie might be a good structure to deal with them.

There are some theoretical studies focusing on solving the Hamming distance

query using tries [Brodal and Venkatesh, 2000, Cole et al., 2004, Arslan, 2006],

although all of these algorithms do not have a decent performance when k > 1,

they still give us some indications. We hope to find a practical solution to use

the trie structure to solve the Hamming distance query problem.

• In Chapter 3.6 we discuss the skewness problem and present a greedy solution.

However, the time complexity of this solution is not good. Although sampling

can be employed to make the processing time acceptable and this process only

needs to be done once, we still hope to optimize it or find a more efficient way

to replace it. Bibliography

[Arasu et al., 2006] Arasu, A., Ganti, V., and Kaushik, R. (2006). Efficient exact

set-similarity joins. In VLDB.

[Arslan, 2006] Arslan, A. N. (2006). Efficient approximate dictionary look-up for

long words over small alphabets. In Correa, J. R., Hevia, A., and Kiwi, M. A.,

editors, LATIN, volume 3887 of Lecture Notes in Computer Science, pages

118–129. Springer.

[Baldi et al., 2008] Baldi, P., Hirschberg, D. S., Nasr, R. J., Baldi, P., Hirschberg,

D. S., and Nasr, R. J. (2008). Speeding up chemical database searches using

a proximity filter based on the logical exclusive-or. J. Chem. Inf. Model, pages

1367–1378.

[Bayardo et al., 2007] Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up

all pairs similarity search. In WWW.

[Belazzougui, 2009] Belazzougui, D. (2009). Faster and space-optimal edit distance

”1” dictionary. In CPM, pages 154–167.

[Belazzougui and Venturini, 2012] Belazzougui, D. and Venturini, R. (2012). Com-

pressed string dictionary look-up with edit distance one. In CPM, pages 280–292.

76 BIBLIOGRAPHY 77

[Botelho et al., 2011] Botelho, F. C., Lacerda, A., Menezes, G. V., and Ziviani, N.

(2011). Minimal perfect hashing: A competitive method for indexing internal

memory. Inf. Sci., 181(13):2608–2625.

[Botelho and Ziviani, 2007] Botelho, F. C. and Ziviani, N. (2007). External

perfect hashing for very large key sets. In CIKM, pages 653–662.

[Bowyer et al., 2008] Bowyer, K. W., Hollingsworth, K., and Flynn, P. J. (2008).

Image understanding for iris biometrics: A survey. Comput. Vis. Image

Underst., 110(2):281–307.

[Brodal and Gasieniec, 1996] Brodal, G. S. and Gasieniec, L. (1996). Approximate

dictionary queries. In CPM, pages 65–74.

[Brodal and Venkatesh, 2000] Brodal, G. S. and Venkatesh, S. (2000). Improved

bounds for dictionary look-up with one error. Inf. Process. Lett., 75(1-2):57–59.

[Broder, 1997] Broder, A. Z. (1997). On the resemblance and containment of

documents. In SEQS.

[Broder et al., 1997] Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G.

(1997). Syntactic clustering of the web. Computer Networks, 29(8-13):1157–1166.

[Chan et al., 2011] Chan, H.-L., Lam, T. W., Sung, W.-K., Tam, S.-L., and Wong,

S.-S. (2011). A linear size index for approximate . J. Discrete

Algorithms, 9(4):358–364.

[Charikar, 2002] Charikar, M. (2002). Similarity estimation techniques from

rounding algorithms. In STOC, pages 380–388.

[Chaudhuri et al., 2006] Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A

primitive operator for similarity joins in data cleaning. In ICDE. 78 BIBLIOGRAPHY

[Chaudhuri and Kaushik, 2009] Chaudhuri, S. and Kaushik, R. (2009). Extending

autocompletion to tolerate errors. In SIGMOD Conference, pages 707–718.

[Chen et al., 2009] Chen, B., Wild, D., and Guha, R. (2009). Pubchem as a

source of polypharmacology. Journal of Chemical Information and Modeling,

49(9):2044–2055.

[Chen et al., 2005] Chen, J., Swamidass, S. J., Dou, Y., and Baldi, P. (2005).

Chemdb: a public database of small molecules and related chemoinformatics

resources. Bioinformatics, 21:4133–4139.

[Chum et al., 2007] Chum, O., Philbin, J., Isard, M., and Zisserman, A. (2007).

Scalable near identical image and shot detection. In Proceedings of the 6th

ACM international conference on Image and video retrieval, CIVR ’07, pages

549–556, New York, NY, USA. ACM.

[Cole et al., 2004] Cole, R., Gottlieb, L.-A., and Lewenstein, M. (2004). Dictionary

matching and indexing with errors and don’t cares. In STOC, pages 91–100.

[Datar et al., 2004] Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S.

(2004). Locality-sensitive hashing scheme based on p-stable distributions. In

Symposium on Computational Geometry, pages 253–262.

[Daugman, 1993] Daugman, J. (1993). High confidence visual recognition of

persons by a test of statistical independence. IEEE Trans. Pattern Anal. Mach.

Intell., 15(11):1148–1161.

[Daugman, 2001] Daugman, J. (2001). Statistical richness of visual phase infor-

mation: Update on recognizing persons by iris patterns. International Journal

of Computer Vision, 45(1):25–38. BIBLIOGRAPHY 79

[Daugman, 2007] Daugman, J. (2007). New methods in iris recognition. IEEE

Transactions on Systems, Man, and Cybernetics, Part B, 37(5):1167–1175.

[Dayal et al., 2006] Dayal, U., Whang, K.-Y., Lomet, D. B., Alonso, G., Lohman,

G. M., Kersten, M. L., Cha, S. K., and Kim, Y.-K., editors (2006). Proceedings

of the 32nd International Conference on Very Large Data Bases, Seoul, Korea,

September 12-15, 2006. ACM.

[Deng et al., 2012] Deng, D., Li, G., and Feng, J. (2012). An efficient trie-based

method for approximate entity extraction with edit-distance constraints. In

ICDE, pages 762–773.

[Deng et al., 2013] Deng, D., Li, G., and Feng, J. (2013). Top-k string similarity

search with edit-distance constraints. In ICDE.

[Flower, 1998] Flower, D. R. (1998). On the properties of bit string-based

measures of chemical similarity. Journal of Chemical Information and Computer

Sciences, 38(3):379–386.

[Fredman et al., 1984] Fredman, M. L., Koml´os,J., and Szemer´edi, E. (1984).

Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538–544.

[Gan et al., 2012] Gan, J., Feng, J., Fang, Q., and Ng, W. (2012). Locality-

sensitive hashing scheme based on dynamic collision counting. In SIGMOD

Conference, pages 541–552.

[Gionis et al., 1999] Gionis, A., Indyk, P., and Motwani, R. (1999). Similarity

search in high dimensions via hashing. In VLDB, pages 518–529.

[Gravano et al., 2001] Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N.,

Muthukrishnan, S., and Srivastava, D. (2001). Approximate string joins in a

database (almost) for free. In VLDB. 80 BIBLIOGRAPHY

[Indyk and Motwani, 1998] Indyk, P. and Motwani, R. (1998). Approximate

nearest neighbors: Towards removing the curse of dimensionality. In STOC.

[Knuth, 1973] Knuth, D. E. (1973). The Art of Computer Programming, Volume

III: Sorting and Searching. Addison-Wesley.

[Kurtz, 1996] Kurtz, S. (1996). Approximate string searching under weighted edit

distance. Proc. of Third South American Workshop on String Processing, pages

156–170.

[Landr´eand Truchetet, 2007] Landr´e,J. and Truchetet, F. (2007). Image retrieval

with binary hamming distance. In VISAPP (2)’07, pages 237–240.

[Levenshtein, 1966] Levenshtein, V. I. (1966). Binary codes capable of correcting

deletions, insertions and reversals. Soviet Physics Doklady, 10:707–720.

[Li et al., 2008] Li, C., Lu, J., and Lu, Y. (2008). Efficient merging and filtering

algorithms for approximate string searches. In ICDE.

[Li et al., 2007] Li, C., Wang, B., and Yang, X. (2007). VGRAM: Improving

performance of approximate queries on string collections using variable-length

grams. In VLDB.

[Li et al., 2011a] Li, G., Deng, D., Wang, J., and Feng, J. (2011a). Pass-join: A

partition-based method for similarity joins. PVLDB, 5(3):253–264.

[Li et al., 2011b] Li, Y., Terrell, A., and Patel, J. M. (2011b). Wham: a

high-throughput method. In SIGMOD Conference, pages

445–456.

[Liu et al., 2011] Liu, A. X., Shen, K., and Torng, E. (2011). Large scale hamming

distance query processing. In ICDE, pages 553–564. BIBLIOGRAPHY 81

[Manber and Wu, 1994] Manber, U. and Wu, S. (1994). An algorithm for approx-

imate membership checking with application to password security. Inf. Process.

Lett., 50(4):191–197.

[Manku et al., 2007] Manku, G. S., Jain, A., and Sarma, A. D. (2007). Detecting

near-duplicates for web crawling. In WWW, pages 141–150.

[Masek and Paterson, 1980] Masek, W. J. and Paterson, M. (1980). A faster

algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18–31.

[Mihov and Schulz, 2004] Mihov, S. and Schulz, K. U. (2004). Fast approximate

search in large dictionaries. Computational Linguistics, 30(4):451–477.

[Miller et al., 2005] Miller, M. L., Rodriguez, M. A., and Cox, I. J. (2005). Audio

fingerprinting: Nearest neighbor search in high dimensional binary spaces. J.

VLSI Signal Process. Syst., 41(3):285–291.

[Minsky and Papert, 1987] Minsky, M. and Papert, S. (1987). Perceptrons - an

introduction to computational geometry. MIT Press.

[Myers, 1999] Myers, G. (1999). A fast bit-vector algorithm for approximate

string matching based on dynamic programming. J. ACM, 46(3):395–415.

[Nasr et al., 2010] Nasr, R., Hirschberg, D., and Baldi, P. (2010). Hashing

algorithms and data structures for rapid searches of fingerprint vectors. J.

Chem. Inf. Model, 50(8):1358–68.

[Nasr et al., 2009] Nasr, R., Swamidass, S. J., and Baldi, P. (2009). Large scale

study of multiple-molecule queries. J. Cheminformatics, 1:7. 82 BIBLIOGRAPHY

[Nasr et al., 2012] Nasr, R., Vernica, R., Li, C., and Baldi, P. (2012). Speeding up

chemical searches using the inverted index: The convergence of chemoinformatics

and text search methods. J. Chem. Inf. Model.

[Norouzi et al., 2012] Norouzi, M., Punjani, A., and Fleet, D. J. (2012). Fast

search in with multi-index hashing. In CVPR, pages 3108–3115.

[Pagh and Rodler, 2001] Pagh, R. and Rodler, F. F. (2001). Cuckoo hashing. In

ESA, pages 121–133.

[Philbin et al., 2007] Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman,

A. (2007). Object retrieval with large vocabularies and fast spatial matching.

In CVPR.

[Qin et al., 2011] Qin, J., Wang, W., Lu, Y., Xiao, C., and Lin, X. (2011).

Efficient exact edit similarity query processing with the asymmetric signature

scheme. In SIGMOD Conference, pages 1033–1044.

[Qin et al., 2013] Qin, J., Zhou, X., Wang, W., and Xiao, C. (2013). Trie-based

similarity search and join. In Guerrini, G., editor, EDBT/ICDT Workshops,

pages 392–396. ACM.

[R. Nasr, 2011] R. Nasr, T. Kristensen, P. B. (2011). Tree and hashing data

structures to speedup chemical searches: Analysis and experiments. Molecular

Informatics, 30(9):791–800. Special Issue on Machine Learning Methods in

Chemoinformatics/NIPS.

[Sarawagi and Kirpal, 2004] Sarawagi, S. and Kirpal, A. (2004). Efficient set joins

on similarity predicates. In SIGMOD. BIBLIOGRAPHY 83

[Satuluri and Parthasarathy, 2012] Satuluri, V. and Parthasarathy, S. (2012).

Bayesian locality sensitive hashing for fast similarity search. PVLDB, 5(5):430–

441.

[Schmidt and Siegel, 1990] Schmidt, J. P. and Siegel, A. (1990). The spatial com-

plexity of oblivious k-probe hash functions. SIAM J. Comput., 19(5):775–786.

[Swamidass and Baldi, 2007] Swamidass, S. and Baldi, P. (2007). Bounds and

algorithms for fast exact searches of chemical fingerprints in linear and sublinear

time. J Chem Inf Model, 47(2):302–17.

[T. Bocek, 2007] T. Bocek, E. Hunt, B. S. (2007). Fast Similarity Search in

Large Dictionaries. Technical Report ifi-2007.02, Department of Informatics,

University of Zurich.

[Tabei et al., 2010] Tabei, Y., Uno, T., Sugiyama, M., and Tsuda, K. (2010).

Single versus multiple sorting in all pairs similarity search. Journal of Machine

Learning Research - Proceedings Track, 13:145–160.

[Theobald et al., 2008] Theobald, M., Siddharth, J., and Paepcke, A. (2008).

Spotsigs: robust and efficient near duplicate detection in large web collections.

In SIGIR, pages 563–570.

[Tsur, 2010] Tsur, D. (2010). Fast index for approximate string matching. J.

Discrete Algorithms, 8(4):339–345.

[Ukkonen, 1985] Ukkonen, E. (1985). Algorithms for approximate string matching.

Information and Control, 64(1-3):100–118.

[Wagner and Fischer, 1974] Wagner, R. A. and Fischer, M. J. (1974). The

string-to-string correction problem. J. ACM, 21(1):168–173. [Wang et al., 2010] Wang, J., Li, G., and Feng, J. (2010). Trie-join: Efficient

trie-based string similarity joins with edit-distance constraints. PVLDB,

3(1):1219–1230.

[Wang et al., 2009] Wang, W., Xiao, C., Lin, X., and Zhang, C. (2009). Efficient

approximate entity extraction with edit constraints. In SIMGOD.

[Wu et al., 2009] Wu, Z., Ke, Q., Isard, M., and Sun, J. (2009). Bundling features

for large scale partial-duplicate web image search. In CVPR, pages 25–32.

[Xiao et al., 2013] Xiao, C., Qin, J., Wang, W., Ishikawa, Y., Tsuda, K., and

Sadakane, K. (2013). Efficient error-tolerant query autocompletion. In PVLDB.

[Xiao et al., 2008a] Xiao, C., Wang, W., and Lin, X. (2008a). Ed-Join: an

efficient algorithm for similarity joins with edit distance constraints. PVLDB,

1(1):933–944.

[Xiao et al., 2008b] Xiao, C., Wang, W., Lin, X., and Yu, J. X. (2008b). Efficient

similarity joins for near duplicate detection. In WWW.

[Yang et al., 2008] Yang, X., Wang, B., and Li, C. (2008). Cost-based variable-

length-gram selection for string collections to support approximate queries

efficiently. In SIGMOD Conference, pages 353–364.

[Yao and Yao, 1997] Yao, A. C.-C. and Yao, F. F. (1997). Dictionary look-up

with one error. J. Algorithms, 25(1):194–202.

84