Efficient and Private Distance Approximation in The
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
The One-Way Communication Complexity of Hamming Distance
THEORY OF COMPUTING, Volume 4 (2008), pp. 129–135 http://theoryofcomputing.org The One-Way Communication Complexity of Hamming Distance T. S. Jayram Ravi Kumar∗ D. Sivakumar∗ Received: January 23, 2008; revised: September 12, 2008; published: October 11, 2008. Abstract: Consider the following version of the Hamming distance problem for ±1 vec- n √ n √ tors of length n: the promise is that the distance is either at least 2 + n or at most 2 − n, and the goal is to find out which of these two cases occurs. Woodruff (Proc. ACM-SIAM Symposium on Discrete Algorithms, 2004) gave a linear lower bound for the randomized one-way communication complexity of this problem. In this note we give a simple proof of this result. Our proof uses a simple reduction from the indexing problem and avoids the VC-dimension arguments used in the previous paper. As shown by Woodruff (loc. cit.), this implies an Ω(1/ε2)-space lower bound for approximating frequency moments within a factor 1 + ε in the data stream model. ACM Classification: F.2.2 AMS Classification: 68Q25 Key words and phrases: One-way communication complexity, indexing, Hamming distance 1 Introduction The Hamming distance H(x,y) between two vectors x and y is defined to be the number of positions i such that xi 6= yi. Let GapHD denote the following promise problem: the input consists of ±1 vectors n √ n √ x and y of length n, together with the promise that either H(x,y) ≤ 2 − n or H(x,y) ≥ 2 + n, and the goal is to find out which of these two cases occurs. -
Fault Tolerant Computing CS 530 Information Redundancy: Coding Theory
Fault Tolerant Computing CS 530 Information redundancy: Coding theory Yashwant K. Malaiya Colorado State University April 3, 2018 1 Information redundancy: Outline • Using a parity bit • Codes & code words • Hamming distance . Error detection capability . Error correction capability • Parity check codes and ECC systems • Cyclic codes . Polynomial division and LFSRs 4/3/2018 Fault Tolerant Computing ©YKM 2 Redundancy at the Bit level • Errors can bits to be flipped during transmission or storage. • An extra parity bit can detect if a bit in the word has flipped. • Some errors an be corrected if there is enough redundancy such that the correct word can be guessed. • Tukey: “bit” 1948 • Hamming codes: 1950s • Teletype, ASCII: 1960: 7+1 Parity bit • Codes are widely used. 304,805 letters in the Torah Redundancy at the Bit level Even/odd parity (1) • Errors can bits to be flipped during transmission/storage. • Even/odd parity: . is basic method for detecting if one bit (or an odd number of bits) has been switched by accident. • Odd parity: . The number of 1-bit must add up to an odd number • Even parity: . The number of 1-bit must add up to an even number Even/odd parity (2) • The it is known which parity it is being used. • If it uses an even parity: . If the number of of 1-bit add up to an odd number then it knows there was an error: • If it uses an odd: . If the number of of 1-bit add up to an even number then it knows there was an error: • However, If an even number of 1-bit is flipped the parity will still be the same. -
Sequence Distance Embeddings
Sequence Distance Embeddings by Graham Cormode Thesis Submitted to the University of Warwick for the degree of Doctor of Philosophy Computer Science January 2003 Contents List of Figures vi Acknowledgments viii Declarations ix Abstract xi Abbreviations xii Chapter 1 Starting 1 1.1 Sequence Distances . ....................................... 2 1.1.1 Metrics ............................................ 3 1.1.2 Editing Distances ...................................... 3 1.1.3 Embeddings . ....................................... 3 1.2 Sets and Vectors ........................................... 4 1.2.1 Set Difference and Set Union . .............................. 4 1.2.2 Symmetric Difference . .................................. 5 1.2.3 Intersection Size ...................................... 5 1.2.4 Vector Norms . ....................................... 5 1.3 Permutations ............................................ 6 1.3.1 Reversal Distance ...................................... 7 1.3.2 Transposition Distance . .................................. 7 1.3.3 Swap Distance ....................................... 8 1.3.4 Permutation Edit Distance . .............................. 8 1.3.5 Reversals, Indels, Transpositions, Edits (RITE) . .................... 9 1.4 Strings ................................................ 9 1.4.1 Hamming Distance . .................................. 10 1.4.2 Edit Distance . ....................................... 10 1.4.3 Block Edit Distances . .................................. 11 1.5 Sequence Distance Problems . ................................. -
Dictionary Look-Up Within Small Edit Distance
Dictionary Lo okUp Within Small Edit Distance Ab dullah N Arslan and Omer Egeciog lu Department of Computer Science University of California Santa Barbara Santa Barbara CA USA farslanomergcsucsbedu Abstract Let W b e a dictionary consisting of n binary strings of length m each represented as a trie The usual dquery asks if there exists a string in W within Hamming distance d of a given binary query string q We present an algorithm to determine if there is a memb er in W within edit distance d of a given query string q of length m The metho d d+1 takes time O dm in the RAM mo del indep endent of n and requires O dm additional space Intro duction Let W b e a dictionary consisting of n binary strings of length m each A dquery asks if there exists a string in W within Hamming distance d of a given binary query string q Algorithms for answering dqueries eciently has b een a topic of interest for some time and have also b een studied as the approximate query and the approximate query retrieval problems in the literature The problem was originally p osed by Minsky and Pap ert in in which they asked if there is a data structure that supp orts fast dqueries The cases of small d and large d for this problem seem to require dierent techniques for their solutions The case when d is small was studied by Yao and Yao Dolev et al and Greene et al have made some progress when d is relatively large There are ecient algorithms only when d prop osed by Bro dal and Venkadesh Yao and Yao and Bro dal and Gasieniec The small d case has applications -
Searching All Seeds of Strings with Hamming Distance Using Finite Automata
Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong Searching All Seeds of Strings with Hamming Distance using Finite Automata Ondřej Guth, Bořivoj Melichar ∗ Abstract—Seed is a type of a regularity of strings. Finite automata provide common formalism for many al- A restricted approximate seed w of string T is a fac- gorithms in the area of text processing (stringology), in- tor of T such that w covers a superstring of T under volving forward exact and approximate pattern match- some distance rule. In this paper, the problem of all ing and searching for borders, periods, and repetitions restricted seeds with the smallest Hamming distance [7], backward pattern matching [8], pattern matching in is studied and a polynomial time and space algorithm a compressed text [9], the longest common subsequence for solving the problem is presented. It searches for [10], exact and approximate 2D pattern matching [11], all restricted approximate seeds of a string with given limited approximation using Hamming distance and and already mentioned computing approximate covers it computes the smallest distance for each found seed. [5, 6] and exact covers [12] and seeds [2] in generalized The solution is based on a finite (suffix) automata ap- strings. Therefore, we would like to further extend the set proach that provides a straightforward way to design of problems solved using finite automata. Such a problem algorithms to many problems in stringology. There- is studied in this paper. fore, it is shown that the set of problems solvable using finite automata includes the one studied in this Finite automaton as a data structure may be easily imple- paper. -
Distance-Sensitive Hashing∗
Distance-Sensitive Hashing∗ Martin Aumüller Tobias Christiani IT University of Copenhagen IT University of Copenhagen [email protected] [email protected] Rasmus Pagh Francesco Silvestri IT University of Copenhagen University of Padova [email protected] [email protected] ABSTRACT tight up to lower order terms. In particular, we extend existing Locality-sensitive hashing (LSH) is an important tool for managing LSH lower bounds, showing that they also hold in the asymmetric high-dimensional noisy or uncertain data, for example in connec- setting. tion with data cleaning (similarity join) and noise-robust search (similarity search). However, for a number of problems the LSH CCS CONCEPTS framework is not known to yield good solutions, and instead ad hoc • Theory of computation → Randomness, geometry and dis- solutions have been designed for particular similarity and distance crete structures; Data structures design and analysis; • Informa- measures. For example, this is true for output-sensitive similarity tion systems → Information retrieval; search/join, and for indexes supporting annulus queries that aim to report a point close to a certain given distance from the query KEYWORDS point. locality-sensitive hashing; similarity search; annulus query In this paper we initiate the study of distance-sensitive hashing (DSH), a generalization of LSH that seeks a family of hash functions ACM Reference Format: Martin Aumüller, Tobias Christiani, Rasmus Pagh, and Francesco Silvestri. such that the probability of two points having the same hash value is 2018. Distance-Sensitive Hashing. In Proceedings of 2018 International Confer- a given function of the distance between them. More precisely, given ence on Management of Data (PODS’18). -
Arxiv:2102.08942V1 [Cs.DB]
A Survey on Locality Sensitive Hashing Algorithms and their Applications OMID JAFARI, New Mexico State University, USA PREETI MAURYA, New Mexico State University, USA PARTH NAGARKAR, New Mexico State University, USA KHANDKER MUSHFIQUL ISLAM, New Mexico State University, USA CHIDAMBARAM CRUSHEV, New Mexico State University, USA Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many diverse application domains. Locality Sensitive Hashing (LSH) is one of the most popular techniques for finding approximate nearest neighbor searches in high-dimensional spaces. The main benefits of LSH are its sub-linear query performance and theoretical guarantees on the query accuracy. In this survey paper, we provide a review of state-of-the-art LSH and Distributed LSH techniques. Most importantly, unlike any other prior survey, we present how Locality Sensitive Hashing is utilized in different application domains. CCS Concepts: • General and reference → Surveys and overviews. Additional Key Words and Phrases: Locality Sensitive Hashing, Approximate Nearest Neighbor Search, High-Dimensional Similarity Search, Indexing 1 INTRODUCTION Finding nearest neighbors in high-dimensional spaces is an important problem in several diverse applications, such as multimedia retrieval, machine learning, biological and geological sciences, etc. For low-dimensions (< 10), popular tree-based index structures, such as KD-tree [12], SR-tree [56], etc. are effective, but for higher number of dimensions, these index structures suffer from the well-known problem, curse of dimensionality (where the performance of these index structures is often out-performed even by linear scans) [21]. Instead of searching for exact results, one solution to address the curse of dimensionality problem is to look for approximate results. -
Are All Distributions Easy?
Are all distributions easy? Emanuele Viola∗ November 5, 2009 Abstract Complexity theory typically studies the complexity of computing a function h(x): f0; 1gn ! f0; 1gm of a given input x. We advocate the study of the complexity of generating the distribution h(x) for uniform x, given random bits. Our main results are: • There are explicit AC0 circuits of size poly(n) and depth O(1) whose output n P distribution has statistical distance 1=2 from the distribution (X; i Xi) 2 f0; 1gn × f0; 1; : : : ; ng for uniform X 2 f0; 1gn, despite the inability of these P circuits to compute i xi given x. Previous examples of this phenomenon apply to different distributions such as P n+1 (X; i Xi mod 2) 2 f0; 1g . We also prove a lower bound independent from n on the statistical distance be- tween the output distribution of NC0 circuits and the distribution (X; majority(X)). We show that 1 − o(1) lower bounds for related distributions yield lower bounds for succinct data structures. • Uniform randomized AC0 circuits of poly(n) size and depth d = O(1) with error can be simulated by uniform randomized circuits of poly(n) size and depth d + 1 with error + o(1) using ≤ (log n)O(log log n) random bits. Previous derandomizations [Ajtai and Wigderson '85; Nisan '91] increase the depth by a constant factor, or else have poor seed length. Given the right tools, the above results have technically simple proofs. ∗Supported by NSF grant CCF-0845003. Email: [email protected] 1 Introduction Complexity theory, with some notable exceptions, typically studies the complexity of com- puting a function h(x): f0; 1gn ! f0; 1gm of a given input x. -
B-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches
b-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches Shunsuke Kanda Yasuo Tabei RIKEN Center for Advanced Intelligence Project RIKEN Center for Advanced Intelligence Project Tokyo, Japan Tokyo, Japan [email protected] [email protected] Abstract—Recently, randomly mapping vectorial data to algorithms intending to build sketches of non-negative inte- strings of discrete symbols (i.e., sketches) for fast and space- gers (i.e., b-bit sketches) have been proposed for efficiently efficient similarity searches has become popular. Such random approximating various similarity measures. Examples are b-bit mapping is called similarity-preserving hashing and approximates a similarity metric by using the Hamming distance. Although minwise hashing (minhash) [12]–[14] for Jaccard similarity, many efficient similarity searches have been proposed, most of 0-bit consistent weighted sampling (CWS) for min-max ker- them are designed for binary sketches. Similarity searches on nel [15], and 0-bit CWS for generalized min-max kernel [16]. integer sketches are in their infancy. In this paper, we present Thus, developing scalable similarity search methods for b-bit a novel space-efficient trie named b-bit sketch trie on integer sketches is a key issue in large-scale applications of similarity sketches for scalable similarity searches by leveraging the idea behind succinct data structures (i.e., space-efficient data structures search. while supporting various data operations in the compressed Similarity searches on binary sketches are classified -
SIGMOD Flyer
DATES: Research paper SIGMOD 2006 abstracts Nov. 15, 2005 Research papers, 25th ACM SIGMOD International Conference on demonstrations, Management of Data industrial talks, tutorials, panels Nov. 29, 2005 June 26- June 29, 2006 Author Notification Feb. 24, 2006 Chicago, USA The annual ACM SIGMOD conference is a leading international forum for database researchers, developers, and users to explore cutting-edge ideas and results, and to exchange techniques, tools, and experiences. We invite the submission of original research contributions as well as proposals for demonstrations, tutorials, industrial presentations, and panels. We encourage submissions relating to all aspects of data management defined broadly and particularly ORGANIZERS: encourage work that represent deep technical insights or present new abstractions and novel approaches to problems of significance. We especially welcome submissions that help identify and solve data management systems issues by General Chair leveraging knowledge of applications and related areas, such as information retrieval and search, operating systems & Clement Yu, U. of Illinois storage technologies, and web services. Areas of interest include but are not limited to: at Chicago • Benchmarking and performance evaluation Vice Gen. Chair • Data cleaning and integration Peter Scheuermann, Northwestern Univ. • Database monitoring and tuning PC Chair • Data privacy and security Surajit Chaudhuri, • Data warehousing and decision-support systems Microsoft Research • Embedded, sensor, mobile databases and applications Demo. Chair Anastassia Ailamaki, CMU • Managing uncertain and imprecise information Industrial PC Chair • Peer-to-peer data management Alon Halevy, U. of • Personalized information systems Washington, Seattle • Query processing and optimization Panels Chair Christian S. Jensen, • Replication, caching, and publish-subscribe systems Aalborg University • Text search and database querying Tutorials Chair • Semi-structured data David DeWitt, U. -
Lower Bounds on Lattice Sieving and Information Set Decoding
Lower bounds on lattice sieving and information set decoding Elena Kirshanova1 and Thijs Laarhoven2 1Immanuel Kant Baltic Federal University, Kaliningrad, Russia [email protected] 2Eindhoven University of Technology, Eindhoven, The Netherlands [email protected] April 22, 2021 Abstract In two of the main areas of post-quantum cryptography, based on lattices and codes, nearest neighbor techniques have been used to speed up state-of-the-art cryptanalytic algorithms, and to obtain the lowest asymptotic cost estimates to date [May{Ozerov, Eurocrypt'15; Becker{Ducas{Gama{Laarhoven, SODA'16]. These upper bounds are useful for assessing the security of cryptosystems against known attacks, but to guarantee long-term security one would like to have closely matching lower bounds, showing that improvements on the algorithmic side will not drastically reduce the security in the future. As existing lower bounds from the nearest neighbor literature do not apply to the nearest neighbor problems appearing in this context, one might wonder whether further speedups to these cryptanalytic algorithms can still be found by only improving the nearest neighbor subroutines. We derive new lower bounds on the costs of solving the nearest neighbor search problems appearing in these cryptanalytic settings. For the Euclidean metric we show that for random data sets on the sphere, the locality-sensitive filtering approach of [Becker{Ducas{Gama{Laarhoven, SODA 2016] using spherical caps is optimal, and hence within a broad class of lattice sieving algorithms covering almost all approaches to date, their asymptotic time complexity of 20:292d+o(d) is optimal. Similar conditional optimality results apply to lattice sieving variants, such as the 20:265d+o(d) complexity for quantum sieving [Laarhoven, PhD thesis 2016] and previously derived complexity estimates for tuple sieving [Herold{Kirshanova{Laarhoven, PKC 2018]. -
Large Scale Hamming Distance Query Processing
Large Scale Hamming Distance Query Processing Alex X. Liu, Ke Shen, Eric Torng Department of Computer Science and Engineering Michigan State University East Lansing, MI 48824, U.S.A. {alexliu, shenke, torng}@cse.msu.edu Abstract—Hamming distance has been widely used in many a new solution to the Hamming distance range query problem. application domains, such as near-duplicate detection and pattern The basic idea is to hash each document into a 64-bit binary recognition. We study Hamming distance range query problems, string, called a fingerprint or a sketch, using the simhash where the goal is to find all strings in a database that are within a Hamming distance bound k from a query string. If k is fixed, algorithm in [11]. Simhash has the following property: if two we have a static Hamming distance range query problem. If k is documents are near-duplicates in terms of content, then the part of the input, we have a dynamic Hamming distance range Hamming distance between their fingerprints will be small. query problem. For the static problem, the prior art uses lots of Similarly, Miller et al. proposed a scheme for finding a song memory due to its aggressive replication of the database. For the that includes a given audio clip [20], [21]. They proposed dynamic range query problem, as far as we know, there is no space and time efficient solution for arbitrary databases. In this converting the audio clip to a binary string (and doing the paper, we first propose a static Hamming distance range query same for all the songs in the given database) and then finding algorithm called HEngines, which addresses the space issue in the nearest neighbors to the given binary string in the database prior art by dynamically expanding the query on the fly.