Efficient Similarity Search with Hamming Distance Constraints

Efficient Similarity Search with Hamming Distance Constraints by Xiaoyang Zhang B.Sc., Wuhan University, China, 2009 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ENGINEERING IN THE SCHOOL OF Computer Science and Engineering November 10, 2013 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author. c Xiaoyang Zhang 2013 Abstract In this thesis, we study the Hamming distance query problem. Hamming distance measures the number of dimensions where two vectors have different values. In applications such as pattern recognition, information retrieval, Chemoinformatics and databases, it is often needed to perform efficient Hamming distance query, which retrieves vectors in a database that have no more than k Hamming distance from a given query vector. Existing work on efficient Hamming distance query processing has some of the following limitations, such as unable to deal with k that is not tiny, unable to deal with vectors where the value domain is large, or unable to attain robust performance in the presence of data skew. To address these limitations, we propose HmSearch, an efficient query processing method for Hamming distance query. Our method is based on enumeration-based signatures and a novel partitioning scheme. We develop enhanced filtering as well as a filtering-and-verification procedure. To deal with data skew, we design an effective dimension rearrangement method. We also illustrate a hybrid technique for the LSH data. Extensive experimental evaluation demonstrates that our methods outperform state-of-the-art methods by up to two orders of magnitude. We also list out several existing problems and show a few possible directions for future work. i Publications Involved in the Thesis Published conference paper: • Xiaoyang Zhang, Jianbin Qin, Wei Wang, Yifang Sun, Jiaheng Lu, HmSearch: An Efficient Hamming Distance Query Processing Algorithm. SSDBM 2013. ii Acknowledgements I dedicate this thesis to my parents. Their love and support are indispensable. Without these, I would never have a opportunity to concentrate in my study and finish my thesis. I would like to show my sincere appreciation to my supervisor, Prof. Wei Wang. He not only guides me and supports me in research, but also provides concerns and help in my everyday life. Moreover, He presents me what attitudes a true researcher should have and how diligent a man should be to chase his dream. The knowledge learned from him and the experience of working with him will benefit me in my whole life. I would like to thank Prof. Xuemin Lin for his conduction and support for the whole database group. I would like to thank Dr. Jianbin Qin and Dr. Jiaheng Lu for their collabora- tion and assistance for the work in Chapter 3 and specially thanks to Dr. Jianbin Qin's for his long time help to my research. I also would like to thank all our group members: Jianbin Qin (again), Yifei Lu, Yifang Sun, Xiaoling Zhou, Chen Chen. We are brothers and sisters forever. iii Contents Abstract i Acknowledgements iii List of Figures vii List of Tables viii List of Algorithms ix 1 Introduction 1 1.1 The Applications for Hamming Distance Query . 4 1.1.1 Hamming distance for Near Duplicate Detection . 4 1.1.2 Chemical Informatics . 5 1.1.3 LSH . 5 1.1.4 Image Retrieval . 5 1.1.5 Iris Recognition . 6 1.2 Challenge and Our Contribution . 6 1.3 Thesis organisation . 8 1.4 Notations Involved in This Thesis . 9 2 Related Work 11 iv 2.1 Overview of the Similarity Search . 11 2.1.1 Exact Match Query . 11 2.1.2 Similarity Query in Metric Space . 12 2.2 Hamming Distance Query . 18 2.2.1 Theoretical Studies . 18 2.2.2 Practical Solutions . 20 2.2.3 Solutions in Other Areas . 22 2.2.4 Relationship with Other Similarity Measures . 23 3 HmSearch: An Efficient Hamming Distance Query Processing Algorithm 25 3.1 Overview . 25 3.2 Background Information . 25 3.2.1 Problem Definition . 26 3.2.2 Most Closely Related Techniques . 26 3.3 Reduction of the General Hamming Distance Problem . 30 3.3.1 Reduction Strategy . 30 3.3.2 Heuristics of Choosing κ and k' . 33 3.4 Answer the Reduced Query . 35 3.4.1 Variants and Deletion Variants . 35 3.4.2 1-Query Processing using Variants and Deletion Variants . 36 3.5 The HmSearch Algorithm . 38 3.5.1 Partitioning . 38 3.5.2 Hierarchical Binary Filtering and Verification . 44 3.6 Partition Strategies . 46 3.6.1 Equal Length Partition and its Drawback . 47 3.6.2 Dimension Rearrangement . 48 v 3.7 Hybrid Techniques for LSH Data . 50 3.7.1 Hamming Distance Query in C2LSH . 51 3.7.2 Hybrid Algorithm . 52 3.8 Experiments . 54 3.8.1 Experiment Setup . 54 3.8.2 Hamming Similarity Query Performance . 56 3.8.3 Candidate Size Analysis . 58 3.8.4 Query Time Fluctuation . 59 3.8.5 Effect of Enhanced Filter and Hierarchical Binary Verification 62 3.8.6 Effect of Rearranging Dimensions . 63 3.8.7 Scalability . 64 3.8.8 Index Size . 64 3.9 Discussion . 65 3.9.1 Complexity Analysis . 65 3.9.2 2-Query Processing using 1-Variants . 68 3.9.3 Triangular Inequality Pruning . 69 3.10 Summary . 71 4 Final Remark 73 4.1 Conclusions . 73 4.2 Existing Problems and Future Work . 74 Bibliography 76 vi List of Figures 3.1 Google's Method . 27 3.2 Google's Method Recursively . 28 3.3 HEngine . 29 3.4 Index for 1-Variants . 40 3.5 Example of Hierarchical Binary Filtering and Verification . 45 3.6 Impact of Data Skew and Benefit of Dimension Rearrangement . 48 3.7 Dimension Rearrangement Example . 50 3.8 Experiment Results - I . 60 3.8 Experiment Results - I . 61 3.9 Experiment Results - II . 65 3.10 posting list . 70 vii List of Tables 1.1 Notations . 10 3.1 Statistics of Datasets . 57 3.2 Complexities of Empirical Hamming Distance Query Methods . 66 viii List of Algorithms 1 HammingQuery(Q; k; κ) ......................... 32 2 filter(v; m)................................. 33 3 oneHammingQuery1Var(q) ........................ 41 4 HmSearch − V(Q; k; κ).......................... 41 5 enhancedFilter(v; k)............................ 42 6 oneHammingQuery1DelVar(q) ...................... 43 7 HBVerify(Q; S).............................. 46 8 Reorder(Q; N; k) ............................. 49 ix x Chapter 1 Introduction In this thesis, we study the problem of efficiently processing similarity query exactly under an Hamming distance constraint (for simplicity, we named this as Hamming distance query). Hamming distance is a widely used distance function, which measures the number of dimensions where two vectors have different values. The Hamming distance query is to retrieve vectors in a database that have Hamming distance no more than k from a given query vector. A novel and practical approach to solve this problem will be proposed in Chapter 3. It will be demonstrated that the performance of our algorithm outperforms previous state-of-the-art algorithms substantially. With the advancement of the Information technology, the digital data has become an integral part of everyday life. As the growth of those digital data accelerates in both scale and variety, it is urgent to find a way to manipulate the data efficiently and effectively. A natural way to manipulate the data is to execute a search against the data. A simple example is to use Google to search for certain objects across the Internet. As searching is such a prominent data processing operation, researchers focused much on this searching problem, especially the 1 2 Chapter 1. Introduction exact match at the very beginning. The exact match searching is a well studied area. However, naturally, in plenty of situations, requirement of searching is not restricted in finding the identical objects. For example, if a criminal's face is captured by a surveillance camera, the police probably wish to find all the similar faces to identify the suspect. Moreover, with the rapid development of the Information industry, the data are increasingly being generated and gathered in different areas. Hence current data usually emerge in huge size and a variety of categories, such as images, audios, videos, time series, fingerprints, documents, protein sequences and so on. Since these data collections are huge and complicated, the data objects might probably lack precision. There- fore, it is inevitable to use the similarity search to find the similar objects instead of only seeking the exact matching ones. Currently, the similarity search has been ap- plied in a variety of applications, including databases, data mining, machine learn- ing, bioinformatics and so on. Hence, the study of the similarity search has become a very fundamental problem in this area and attract more and more attentions. A primary task of doing similarity search to quantify the degree of similarity of a query object against the objects in a data collections. There are fundamentally two ways to capture the requirements of similarity quantitatively. One is to specify a distance constraint, and the other is to specify a similarity constraint. There are plenty of different distance constraint, such as Minkowski distances, Quadratic Form distance, Edit distance, Hamming distance and so on. There are also a few different similarity constraint, such as Jaccard similarity, Cosine similarity, Dice similarity and so on. Among these measures, Hamming distance is one of the highly popular and widely used methods [Liu et al., 2011]. Hamming distance measures the number of dimensions where two vectors have different values. It is used in many applications such as, multimedia databases, Chapter 1. Introduction 3 bioinformatics, pattern recognition, text mining, computer vision, Chemical Informatics and so on.

Load more