Faster Approximate Pattern Matching: a Unified Approach

Total Page:16

File Type:pdf, Size:1020Kb

Faster Approximate Pattern Matching: a Unified Approach 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS) Faster Approximate Pattern Matching: A Unified Approach Panagiotis Charalampopoulos Tomasz Kociumaka Philip Wellnitz Department of Informatics Department of Computer Science Max Planck Institute for Informatics King’s College London, UK Bar-Ilan University Saarland Informatics Campus and Ramat Gan, Israel Saarbrücken, Germany Institute of Informatics [email protected] [email protected] University of Warsaw, Poland [email protected] Abstract—In the approximate pattern matching problem, Hamming Distance: Recall that the Hamming distance given a text T , a pattern P , and a threshold k, the task of two (equal-length) strings is the number of positions is to find (the starting positions of) all substrings of T that T n k P where the strings differ. Now, given a text of length , are at distance at most from . We consider the two most P m k> fundamental string metrics: Under the Hamming distance,we a pattern of length , and an integer threshold 0, search for substrings of T that have at most k mismatches we want to compute the k-mismatch occurrences of P in T , with P , while under the edit distance, we search for substrings that is, all length-m substrings of T that are at Hamming of T that can be transformed to P with at most k edits. distance at most k from P . This pattern matching with Exact occurrences of P in T have a very simple structure: 3 mismatches problem has been extensively studied. In the If we assume for simplicity that |P | < |T |≤ /2 |P | and that P occurs both as a prefix and as a suffix of T , then both P and late 1980s, Abrahamson [2] and√ Kosaraju [26] independently T are periodic with a common period. However, an analogous proposed an FFT-based O(n m log m)-time algorithm for characterization for occurrences with up to k mismatches was computing the Hamming distance of P and all the length- proved only recently by Bringmann et al. [SODA’19]: Either m fragments of T . While their algorithms can be used to there are O(k2) k-mismatch occurrences of P in T , or both P T O(k) solve the pattern matching with mismatches problem, the and are at Hamming distance from strings with k a common string period of length O(m/k). We tighten this first algorithm to benefit from the threshold was given characterization by showing that there are O(k) k-mismatch by Landau and Vishkin [27] and slightly improved by Galil occurrences in the non-periodic case, and we lift it to the edit and Giancarlo [15]: Based on so-called “kangaroo jumping”, distance setting, where we tightly bound the number of k-edit they obtained an O(nk)-time algorithm, which is faster than O(k2) √ occurrences by in the non-periodic case. Our proofs O(n m log m) even for moderately large k. Amir et al. [4] are constructive and let us obtain a unified framework for √ developed two algorithms with running time O(n k log k) approximate pattern matching for both considered distances. 3 In particular, we provide meta-algorithms that only rely on a and O˜(n + k n/m), respectively; the latter algorithm was small set of primitive operations. We showcase the generality then improved upon by Clifford et al. [11], who presented an of our meta-algorithms with results for the fully compressed O˜(n + k2n/m)-time solution. Subsequently, Gawrychowski setting, the dynamic setting, and the standard setting. and Uznanski´ [17] provided√ a smooth trade-off between the 2 Keywords-approximate pattern matching, grammar com- running times O√˜(n k) and O˜(n + k n/m) by designing pression, dynamic strings, Hamming distance, edit distance an O˜(n + kn/ m)-time algorithm. Very recently, Chan I. INTRODUCTION et al. [10] removed most of the polylog n factors in the latter solution at the cost of (Monte-Carlo) randomization. The pattern matching problem, asking to search for oc- Furthermore, Gawrychowski and Uznanski´ [17] showed that currences of a given pattern P in a given text T , is perhaps a significantly faster “combinatorial” algorithm would have the most fundamental problem on strings. However, in most (unexpected) consequences for the complexity of Boolean applications, finding all exact occurrences of a pattern is matrix multiplication. Pattern matching with mismatches not enough: Think of human spelling mistakes or DNA on strings is thus well understood in the standard setting. sequencing errors, for example. In this work, we focus on Nevertheless, in the settings where the strings are not given approximate pattern matching, where we are interested in explicitly, a similar understanding is yet to be obtained. One finding substrings of the text that are “similar” to the pattern. of the main contributions of this work is to improve the While various similarity measures are imaginable, we focus upper bounds for two such settings, obtaining algorithms on the two most commonly encountered metrics in this with running times analogous to the algorithm of [11]. context: the Hamming distance and the edit distance. Edit Distance: Recall that for two strings S and T , the A full version of this paper is available at arxiv.org/abs/2004.08350. edit distance (also known as Levenshtein distance) is the Proofs of the claims marked with ♠ are presented only in the full version. minimum number of edits required to transform S into T . 2575-8454/20/$31.00 ©2020 IEEE 978 DOI 10.1109/FOCS46700.2020.00095 Here, an edit is an insertion, a substitution, or a deletion of a straight-line programs. We refer an interested reader to [36], single character. In the pattern matching with edits problem, [34], [30], [38] to learn more about grammar compression. we are given a text T , a pattern P , and an integer threshold Working directly with a compressed representation of a k>0, and the task is to find all starting positions of the text in general, intuitively at least, seems to be hard—in k-edit (or k-error) occurrences of P in T . Formally, we fact, Abboud et al. [1] showed that, for some problems, are to find all positions i in T such that the edit distance decompress-and-solve is the best we can hope for, under between T [i..j] and P is at most k for some position some reasonable assumptions from fine-grained complexity j. Again, a classic algorithm by Landau and Vishkin [28] theory. Nevertheless, Jez˙ [22] managed to prove that exact runs in O(nk) time. Subsequent research [37], [13] resulted pattern matching can be solved on grammar-compressed 4 in an√O(n + k n/m)-time algorithm (which is faster for strings in near-linear time: Given an SLP of size n rep- k ≤ 3 m). From a lower-bound perspective, we can benefit resenting a string T and an SLP of size m representing a from the discovery that the classic quadratic-time algorithm string P , we can find all exact occurrences of P in T in for computing the edit distance of two strings is essentially O((n + m)log|P |) time. For fully compressed approximate optimal: Backurs and Indyk [5] recently proved that a signifi- pattern matching, no such near-linear time algorithm is cantly faster algorithm would yield a major breakthrough for known, though. While the O˜((n+|P |)k4)-time algorithm by the satisfiability problem. For pattern matching with edits, Bringmann et al. [8] for pattern matching with mismatches this means that there is no hope for an algorithm that is comes close, it works in an easier setting where only the significantly faster than O(n+k2n/m); however, apart from text is compressed. We fill this void by providing the first that “trivial” lower bound and the 20-year-old conjecture algorithm for fully compressed pattern matching with mis- 3 H of Cole and Hariharan [13] that an O(n + k n/m)-time matches that runs in near-linear time. Denote by Occk (P, T) algorithm should be possible, nothing is known that would the set of (starting positions of) k-mismatch occurrences close this gap. While we do not manage to tighten this of P in T ; then, our result reads as follows. gap, we do believe that the structural insights we obtain Theorem I.1 (♠). Let GT denote an SLP of size n generat- may be useful for doing so. What we do manage, however, ing a text T , let GP denote an SLP of size m generating a is to significantly improve the running time of the known pattern P , let k denote a threshold, and set N := |T | + |P |. algorithms in two settings where T and P are not given H Then, we can compute |Occk (P, T)| in time O(m log N+ explicitly, thereby obtaining running times that can be seen 2 3 H nk log N). The elements of Occk (P, T) can be reported as analogous to the running time of Cole and Hariharan’s H within O(|Occ (P, T)|) extra time. algorithm [13]. k Grammar Compression: One of the settings that we For pattern matching with edits, near-linear time algorithms consider in this paper is the fully compressed setting, when are not known even in the case that the pattern is given both the text T and the pattern P are given as straight-line explicitly. Currently, the best pattern matching algorithms on programs. Compressing the text and the pattern is, in gen- an SLP-compressed text run in time O(n|P | log |P |) [41] 4 eral, a natural thing to do—think of huge natural-language and O(n(min{|P |k, k + |P |} +log|T |)) [6].
Recommended publications
  • Pattern Matching Using Similarity Measures
    Pattern matching using similarity measures Patroonvergelijking met behulp van gelijkenismaten (met een samenvatting in het Nederlands) PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit Utrecht op gezag van Rector Magnificus, Prof. Dr. H. O. Voorma, ingevolge het besluit van het College voor Promoties in het openbaar te verdedigen op maandag 18 september 2000 des morgens te 10:30 uur door Michiel Hagedoorn geboren op 13 juli 1972, te Renkum promotor: Prof. Dr. M. H. Overmars Faculteit Wiskunde & Informatica co-promotor: Dr. R. C. Veltkamp Faculteit Wiskunde & Informatica ISBN 90-393-2460-3 PHILIPS '$ The&&% research% described in this thesis has been made possible by financial support from Philips Research Laboratories. The work in this thesis has been carried out in the graduate school ASCI. Contents 1 Introduction 1 1.1Patternmatching.......................... 1 1.2Applications............................. 4 1.3Obtaininggeometricpatterns................... 7 1.4 Paradigms in geometric pattern matching . 8 1.5Similaritymeasurebasedpatternmatching........... 11 1.6Overviewofthisthesis....................... 16 2 A theory of similarity measures 21 2.1Pseudometricspaces........................ 22 2.2Pseudometricpatternspaces................... 30 2.3Embeddingpatternsinafunctionspace............. 40 2.4TheHausdorffmetric........................ 46 2.5Thevolumeofsymmetricdifference............... 54 2.6 Reflection visibility based distances . 60 2.7Summary.............................. 71 2.8Experimentalresults.......................
    [Show full text]
  • Pattern Matching Using Fuzzy Methods David Bell and Lynn Palmer, State of California: Genetic Disease Branch
    Pattern Matching Using Fuzzy Methods David Bell and Lynn Palmer, State of California: Genetic Disease Branch ABSTRACT The formula is : Two major methods of detecting similarities Euclidean Distance and 2 a "fuzzy Hamming distance" presented in a paper entitled "F %%y Dij : ; ( <3/i - yj 4 Hamming Distance: A New Dissimilarity Measure" by Bookstein, Klein, and Raita, will be compared using entropy calculations to Euclidean distance is especially useful for comparing determine their abilities to detect patterns in data and match data matching whole word object elements of 2 vectors. The records. .e find that both means of measuring distance are useful code in Java is: depending on the context. .hile fuzzy Hamming distance outperforms in supervised learning situations such as searches, the Listing 1: Euclidean Distance Euclidean distance measure is more useful in unsupervised pattern matching and clustering of data. /** * EuclideanDistance.java INTRODUCTION * * Pattern matching is becoming more of a necessity given the needs for * Created: Fri Oct 07 08:46:40 2002 such methods for detecting similarities in records, epidemiological * * @author David Bell: DHS-GENETICS patterns, genetics and even in the emerging fields of criminal * @version 1.0 behavioral pattern analysis and disease outbreak analysis due to */ possible terrorist activity. 0nfortunately, traditional methods for /** Abstact Distance class linking or matching records, data fields, etc. rely on exact data * @param None matches rather than looking for close matches or patterns. 1f course * @return Distance Template proximity pattern matches are often necessary when dealing with **/ messy data, data that has inexact values and/or data with missing key abstract class distance{ values.
    [Show full text]
  • The One-Way Communication Complexity of Hamming Distance
    THEORY OF COMPUTING, Volume 4 (2008), pp. 129–135 http://theoryofcomputing.org The One-Way Communication Complexity of Hamming Distance T. S. Jayram Ravi Kumar∗ D. Sivakumar∗ Received: January 23, 2008; revised: September 12, 2008; published: October 11, 2008. Abstract: Consider the following version of the Hamming distance problem for ±1 vec- n √ n √ tors of length n: the promise is that the distance is either at least 2 + n or at most 2 − n, and the goal is to find out which of these two cases occurs. Woodruff (Proc. ACM-SIAM Symposium on Discrete Algorithms, 2004) gave a linear lower bound for the randomized one-way communication complexity of this problem. In this note we give a simple proof of this result. Our proof uses a simple reduction from the indexing problem and avoids the VC-dimension arguments used in the previous paper. As shown by Woodruff (loc. cit.), this implies an Ω(1/ε2)-space lower bound for approximating frequency moments within a factor 1 + ε in the data stream model. ACM Classification: F.2.2 AMS Classification: 68Q25 Key words and phrases: One-way communication complexity, indexing, Hamming distance 1 Introduction The Hamming distance H(x,y) between two vectors x and y is defined to be the number of positions i such that xi 6= yi. Let GapHD denote the following promise problem: the input consists of ±1 vectors n √ n √ x and y of length n, together with the promise that either H(x,y) ≤ 2 − n or H(x,y) ≥ 2 + n, and the goal is to find out which of these two cases occurs.
    [Show full text]
  • Fault Tolerant Computing CS 530 Information Redundancy: Coding Theory
    Fault Tolerant Computing CS 530 Information redundancy: Coding theory Yashwant K. Malaiya Colorado State University April 3, 2018 1 Information redundancy: Outline • Using a parity bit • Codes & code words • Hamming distance . Error detection capability . Error correction capability • Parity check codes and ECC systems • Cyclic codes . Polynomial division and LFSRs 4/3/2018 Fault Tolerant Computing ©YKM 2 Redundancy at the Bit level • Errors can bits to be flipped during transmission or storage. • An extra parity bit can detect if a bit in the word has flipped. • Some errors an be corrected if there is enough redundancy such that the correct word can be guessed. • Tukey: “bit” 1948 • Hamming codes: 1950s • Teletype, ASCII: 1960: 7+1 Parity bit • Codes are widely used. 304,805 letters in the Torah Redundancy at the Bit level Even/odd parity (1) • Errors can bits to be flipped during transmission/storage. • Even/odd parity: . is basic method for detecting if one bit (or an odd number of bits) has been switched by accident. • Odd parity: . The number of 1-bit must add up to an odd number • Even parity: . The number of 1-bit must add up to an even number Even/odd parity (2) • The it is known which parity it is being used. • If it uses an even parity: . If the number of of 1-bit add up to an odd number then it knows there was an error: • If it uses an odd: . If the number of of 1-bit add up to an even number then it knows there was an error: • However, If an even number of 1-bit is flipped the parity will still be the same.
    [Show full text]
  • Sequence Distance Embeddings
    Sequence Distance Embeddings by Graham Cormode Thesis Submitted to the University of Warwick for the degree of Doctor of Philosophy Computer Science January 2003 Contents List of Figures vi Acknowledgments viii Declarations ix Abstract xi Abbreviations xii Chapter 1 Starting 1 1.1 Sequence Distances . ....................................... 2 1.1.1 Metrics ............................................ 3 1.1.2 Editing Distances ...................................... 3 1.1.3 Embeddings . ....................................... 3 1.2 Sets and Vectors ........................................... 4 1.2.1 Set Difference and Set Union . .............................. 4 1.2.2 Symmetric Difference . .................................. 5 1.2.3 Intersection Size ...................................... 5 1.2.4 Vector Norms . ....................................... 5 1.3 Permutations ............................................ 6 1.3.1 Reversal Distance ...................................... 7 1.3.2 Transposition Distance . .................................. 7 1.3.3 Swap Distance ....................................... 8 1.3.4 Permutation Edit Distance . .............................. 8 1.3.5 Reversals, Indels, Transpositions, Edits (RITE) . .................... 9 1.4 Strings ................................................ 9 1.4.1 Hamming Distance . .................................. 10 1.4.2 Edit Distance . ....................................... 10 1.4.3 Block Edit Distances . .................................. 11 1.5 Sequence Distance Problems . .................................
    [Show full text]
  • Dictionary Look-Up Within Small Edit Distance
    Dictionary Lo okUp Within Small Edit Distance Ab dullah N Arslan and Omer Egeciog lu Department of Computer Science University of California Santa Barbara Santa Barbara CA USA farslanomergcsucsbedu Abstract Let W b e a dictionary consisting of n binary strings of length m each represented as a trie The usual dquery asks if there exists a string in W within Hamming distance d of a given binary query string q We present an algorithm to determine if there is a memb er in W within edit distance d of a given query string q of length m The metho d d+1 takes time O dm in the RAM mo del indep endent of n and requires O dm additional space Intro duction Let W b e a dictionary consisting of n binary strings of length m each A dquery asks if there exists a string in W within Hamming distance d of a given binary query string q Algorithms for answering dqueries eciently has b een a topic of interest for some time and have also b een studied as the approximate query and the approximate query retrieval problems in the literature The problem was originally p osed by Minsky and Pap ert in in which they asked if there is a data structure that supp orts fast dqueries The cases of small d and large d for this problem seem to require dierent techniques for their solutions The case when d is small was studied by Yao and Yao Dolev et al and Greene et al have made some progress when d is relatively large There are ecient algorithms only when d prop osed by Bro dal and Venkadesh Yao and Yao and Bro dal and Gasieniec The small d case has applications
    [Show full text]
  • CSCI 2041: Pattern Matching Basics
    CSCI 2041: Pattern Matching Basics Chris Kauffman Last Updated: Fri Sep 28 08:52:58 CDT 2018 1 Logistics Reading Assignment 2 I OCaml System Manual: Ch I Demo in lecture 1.4 - 1.5 I Post today/tomorrow I Practical OCaml: Ch 4 Next Week Goals I Mon: Review I Code patterns I Wed: Exam 1 I Pattern Matching I Fri: Lecture 2 Consider: Summing Adjacent Elements 1 (* match_basics.ml: basic demo of pattern matching *) 2 3 (* Create a list comprised of the sum of adjacent pairs of 4 elements in list. The last element in an odd-length list is 5 part of the return as is. *) 6 let rec sum_adj_ie list = 7 if list = [] then (* CASE of empty list *) 8 [] (* base case *) 9 else 10 let a = List.hd list in (* DESTRUCTURE list *) 11 let atail = List.tl list in (* bind names *) 12 if atail = [] then (* CASE of 1 elem left *) 13 [a] (* base case *) 14 else (* CASE of 2 or more elems left *) 15 let b = List.hd atail in (* destructure list *) 16 let tail = List.tl atail in (* bind names *) 17 (a+b) :: (sum_adj_ie tail) (* recursive case *) The above function follows a common paradigm: I Select between Cases during a computation I Cases are based on structure of data I Data is Destructured to bind names to parts of it 3 Pattern Matching in Programming Languages I Pattern Matching as a programming language feature checks that data matches a certain structure the executes if so I Can take many forms such as processing lines of input files that match a regular expression I Pattern Matching in OCaml/ML combines I Case analysis: does the data match a certain structure I Destructure Binding: bind names to parts of the data I Pattern Matching gives OCaml/ML a certain "cool" factor I Associated with the match/with syntax as follows match something with | pattern1 -> result1 (* pattern1 gives result1 *) | pattern2 -> (* pattern 2..
    [Show full text]
  • Searching All Seeds of Strings with Hamming Distance Using Finite Automata
    Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong Searching All Seeds of Strings with Hamming Distance using Finite Automata Ondřej Guth, Bořivoj Melichar ∗ Abstract—Seed is a type of a regularity of strings. Finite automata provide common formalism for many al- A restricted approximate seed w of string T is a fac- gorithms in the area of text processing (stringology), in- tor of T such that w covers a superstring of T under volving forward exact and approximate pattern match- some distance rule. In this paper, the problem of all ing and searching for borders, periods, and repetitions restricted seeds with the smallest Hamming distance [7], backward pattern matching [8], pattern matching in is studied and a polynomial time and space algorithm a compressed text [9], the longest common subsequence for solving the problem is presented. It searches for [10], exact and approximate 2D pattern matching [11], all restricted approximate seeds of a string with given limited approximation using Hamming distance and and already mentioned computing approximate covers it computes the smallest distance for each found seed. [5, 6] and exact covers [12] and seeds [2] in generalized The solution is based on a finite (suffix) automata ap- strings. Therefore, we would like to further extend the set proach that provides a straightforward way to design of problems solved using finite automata. Such a problem algorithms to many problems in stringology. There- is studied in this paper. fore, it is shown that the set of problems solvable using finite automata includes the one studied in this Finite automaton as a data structure may be easily imple- paper.
    [Show full text]
  • Compiling Pattern Matching to Good Decision Trees
    Submitted to ML’08 Compiling Pattern Matching to good Decision Trees Luc Maranget INRIA Luc.marangetinria.fr Abstract In this paper we study compilation to decision tree, whose We address the issue of compiling ML pattern matching to efficient primary advantage is never testing a given subterm of the subject decisions trees. Traditionally, compilation to decision trees is op- value more than once (and whose primary drawback is potential timized by (1) implementing decision trees as dags with maximal code size explosion). Our aim is to refine naive compilation to sharing; (2) guiding a simple compiler with heuristics. We first de- decision trees, and to compare the output of such an optimizing sign new heuristics that are inspired by necessity, a notion from compiler with optimized backtracking automata. lazy pattern matching that we rephrase in terms of decision tree se- Compilation to decision can be very sensitive to the testing mantics. Thereby, we simplify previous semantical frameworks and order of subject value subterms. The situation can be explained demonstrate a direct connection between necessity and decision by the example of an human programmer attempting to translate a ML program into a lower-level language without pattern matching. tree runtime efficiency. We complete our study by experiments, 1 showing that optimized compilation to decision trees is competi- Let f be the following function defined on triples of booleans : tive. We also suggest some heuristics precisely. l e t f x y z = match x,y,z with | _,F,T -> 1 Categories and Subject Descriptors D 3. 3 [Programming Lan- | F,T,_ -> 2 guages]: Language Constructs and Features—Patterns | _,_,F -> 3 | _,_,T -> 4 General Terms Design, Performance, Sequentiality.
    [Show full text]
  • Combinatorial Pattern Matching
    Combinatorial Pattern Matching 1 A Recurring Problem Finding patterns within sequences Variants on this idea Finding repeated motifs amoungst a set of strings What are the most frequent k-mers How many time does a specific k-mer appear Fundamental problem: Pattern Matching Find all positions of a particular substring in given sequence? 2 Pattern Matching Goal: Find all occurrences of a pattern in a text Input: Pattern p = p1, p2, … pn and text t = t1, t2, … tm Output: All positions 1 < i < (m – n + 1) such that the n-letter substring of t starting at i matches p def bruteForcePatternMatching(p, t): locations = [] for i in xrange(0, len(t)-len(p)+1): if t[i:i+len(p)] == p: locations.append(i) return locations print bruteForcePatternMatching("ssi", "imissmissmississippi") [11, 14] 3 Pattern Matching Performance Performance: m - length of the text t n - the length of the pattern p Search Loop - executed O(m) times Comparison - O(n) symbols compared Total cost - O(mn) per pattern In practice, most comparisons terminate early Worst-case: p = "AAAT" t = "AAAAAAAAAAAAAAAAAAAAAAAT" 4 We can do better! If we preprocess our pattern we can search more effciently (O(n)) Example: imissmissmississippi 1. s 2. s 3. s 4. SSi 5. s 6. SSi 7. s 8. SSI - match at 11 9. SSI - match at 14 10. s 11. s 12. s At steps 4 and 6 after finding the mismatch i ≠ m we can skip over all positions tested because we know that the suffix "sm" is not a prefix of our pattern "ssi" Even works for our worst-case example "AAAAT" in "AAAAAAAAAAAAAAT" by recognizing the shared prefixes ("AAA" in "AAAA").
    [Show full text]
  • B-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches
    b-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches Shunsuke Kanda Yasuo Tabei RIKEN Center for Advanced Intelligence Project RIKEN Center for Advanced Intelligence Project Tokyo, Japan Tokyo, Japan [email protected] [email protected] Abstract—Recently, randomly mapping vectorial data to algorithms intending to build sketches of non-negative inte- strings of discrete symbols (i.e., sketches) for fast and space- gers (i.e., b-bit sketches) have been proposed for efficiently efficient similarity searches has become popular. Such random approximating various similarity measures. Examples are b-bit mapping is called similarity-preserving hashing and approximates a similarity metric by using the Hamming distance. Although minwise hashing (minhash) [12]–[14] for Jaccard similarity, many efficient similarity searches have been proposed, most of 0-bit consistent weighted sampling (CWS) for min-max ker- them are designed for binary sketches. Similarity searches on nel [15], and 0-bit CWS for generalized min-max kernel [16]. integer sketches are in their infancy. In this paper, we present Thus, developing scalable similarity search methods for b-bit a novel space-efficient trie named b-bit sketch trie on integer sketches is a key issue in large-scale applications of similarity sketches for scalable similarity searches by leveraging the idea behind succinct data structures (i.e., space-efficient data structures search. while supporting various data operations in the compressed Similarity searches on binary sketches are classified
    [Show full text]
  • Large Scale Hamming Distance Query Processing
    Large Scale Hamming Distance Query Processing Alex X. Liu, Ke Shen, Eric Torng Department of Computer Science and Engineering Michigan State University East Lansing, MI 48824, U.S.A. {alexliu, shenke, torng}@cse.msu.edu Abstract—Hamming distance has been widely used in many a new solution to the Hamming distance range query problem. application domains, such as near-duplicate detection and pattern The basic idea is to hash each document into a 64-bit binary recognition. We study Hamming distance range query problems, string, called a fingerprint or a sketch, using the simhash where the goal is to find all strings in a database that are within a Hamming distance bound k from a query string. If k is fixed, algorithm in [11]. Simhash has the following property: if two we have a static Hamming distance range query problem. If k is documents are near-duplicates in terms of content, then the part of the input, we have a dynamic Hamming distance range Hamming distance between their fingerprints will be small. query problem. For the static problem, the prior art uses lots of Similarly, Miller et al. proposed a scheme for finding a song memory due to its aggressive replication of the database. For the that includes a given audio clip [20], [21]. They proposed dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this converting the audio clip to a binary string (and doing the paper, we first propose a static Hamming distance range query same for all the songs in the given database) and then finding algorithm called HEngines, which addresses the space issue in the nearest neighbors to the given binary string in the database prior art by dynamically expanding the query on the fly.
    [Show full text]