Efficient Privacy-Preserving General Edit Distance and Beyond

Efficient Privacy-Preserving General Edit Distance and Beyond Ruiyu Zhu Yan Huang Indiana University Indiana University Email: [email protected] Email: [email protected] Abstract—Edit distance is an important non-linear metric that secure computation protocols such as weighted edit distance, has many applications ranging from matching patient genomes Needleman-Wunsch, longest common subsequence (LCS), and to text-based intrusion detection. Depends on the application, heaviest common subsequence (HCS), using all existing ap- related string-comparison metrics, such as weighted edit distance, Needleman-Wunsch distance, longest common subsequences, and plicable optimizations including fixed-key hardware AES [12], heaviest common subsequences, can usually fit better than the [15], Half-Gate garbling [13], free-XOR technique [11]. We basic edit distance. When these metrics need to be calculated on report the performance of these protocols in the “Best Prior” sensitive input strings supplied by mutually distrustful parties, row of TableI, as well as in the performance charts of Figure4, it is more desirable but also more challenging to compute 5 in Section V-A and use them as baselines to evaluate our them in privacy-preserving ways. In this paper, we propose efficient secure computation protocols for private edit distance as new approach. Note that our baseline performance numbers well as several generalized applications including weighted edit are already much better than any generic protocols we can distance (with potentially content-dependent weights), longest find in the literature, simply because we have, for the first common subsequence, and heaviest common subsequence. Our time, applied the most recent optimizations (such as Half- protocols run 20+ times faster and use an order-of-magnitude Gates, efficient AESNI-based garbling, and highly customized less bandwidth than their best previous counterparts. Along- side, we propose a garbling scheme that allows free arithmetic circuits) to solve this particular set of problems. addition, free multiplication with constants, and low-cost com- To circumvent the deficiency of the generic approach, parison/minimum for inputs of restricted relative-differences. researchers have proposed some interesting heuristic methods Moreover, the encodings (i.e. wire-labels) in our garbling scheme that exploit a public reference string to compute the basic edit can be converted from and to encodings used by traditional binary circuit garbling schemes with light to moderate costs. distance over low-entropy genome strings [16], [17]. Wang et Therefore, while being extremely efficient on certain kinds of al. proposed to approximate edit distances by converting it to computations, the new garbling scheme remains composable and set-difference-size problem then used sketch algorithms to ap- capable of handling generic computational tasks. proximate the set-difference-size [16]. Asharov et al. proposed to divide strings into short segments and approximate the over- I. INTRODUCTION all edit distance by accumulating the scores on the individual Edit Distance quantifies the dissimilarity of two strings by segments [17]. As a result, these methods achieved very high the minimal number of editing operations (insert, delete, and efficiency and scalability. These heuristic methods, however, substitute) to transform one string to the other. It finds many have several notable limitations in common. First, they only interesting applications ranging from diagnosis and treatment work with low-entropy strings. When the entropy of the input of genetic diseases [1–3] to computer immunology [4] and strings increases, it is unclear how to find good reference intrusion detection [5], [6]. The basic edit distance can be strings as those papers did not discuss how to identify good generalized to settings where different costs are associated to reference strings, especially in a privacy-preserving manner. different kinds of editing operations (known as weighted edit Second, they do not support more generalized string metrics distance), and the costs of the edits can even depend on the such as weighted edit distance, Needleman-Wunsch, longest values of the operands (e.g., Needleman-Wunsch [7] distance). common subsequence (LCS), heaviest common subsequence To maximize utility, real world applications often favor vari- (HCS) [18] which are more widely used in the fields than the ants of edit distance with weights empirically adjusted based basic edit distance [2], [6], [8]. Finally, they assume a weaker on the likelihoods of various mutations [2], [6], [8]. threat model that leaks more than what is allowed by the To securely compute the basic edit distance, researchers standard security definition of private edit distance [19]. See have tried generic approaches using binary garbled circuit [9], SectionVI for more detailed discussion on the comparisons. [10]. However, due to the significant constant-factor blowups Thus, our work is motivated by the following two questions: in translating the computation into binary circuits, the cost of • Can private edit distance be done more efficiently over these protocols are prohibitive for practical uses. Moreover, arbitrary inputs according to the standard definition of even leveraging all the recent technical breakthroughs [10–14] security, without sacrificing accuracy? in generic secure computation, performance of the resulting protocols remains less than satisfactory to enable many real • Can the result be extended to compute other more general world applications. As a background-study part of this work, edit-distance-like string metrics used in broader scenarios? we have implemented a class of private-edit-distance-like Our approach to answering these questions is based on two insights. First, most part of these computations can be more Bellare et al. have proposed three security notions for garbling: efficiently realized with arithmetic circuits than binary circuits. privacy, obliviousness, and authenticity, which we summarize Second, we observe that, inherent to these applications, there is as below. special public correlation of intermediate values used by many • Privacy: there exists an efficient simulator S such that for component computations, e.g., the inputs to the minimum every x, circuit differ by some publicly inferable values. To leverage (F; e; d) Gb(1k; f); these insights, we propose a special garbling scheme and (F; X; d): ≈ S(1k; f; f(x) : X En(e; x): demonstrate with experiments that a range of edit-distance- like string-metrics can be securely computed an order-of- where “≈” symbolizes computational indistinguishability. magnitude more efficiently than the best prior work. • Obliviousness: there exists an efficient simulator S such Threat Model. In this paper, we focus on the semi-honest that for every x, model but leave it as an interesting future work to prevent (F; e; d) Gb(1k; f); (F; X): ≈ S(1k; f) : active attacks. X En(e; x): Contribution. We propose a new garbling scheme that ex- ploits interesting characteristics of edit-distance-like computa- • -Authenticity: for every efficient adversary A = (A1; A2), tions. When applied to this particular class of computations, 0 (f; x) A (1k); 1 Y 6= Ev(F; X) 1 the proposed garbling scheme features (almost) free addition, (F; e; d) Gb(1k; f); Pr B and : C ≤ . low cost comparison, minimum, and public table lookup (with B X En(e; x); C @ De(d; Y ) 6= ? A secret indices) operations. Its construction is conceptually Y A (1k; F; X): simple and we formally prove its security. Finally, the scheme 2 can also be extended to handle arbitrary functions through Many optimizations have been proposed to improve circuit tethering with traditional garbling of binary circuits. We garbling in various aspects such as bandwidth [13], [21], have implemented our scheme through exploiting the high- [22], evaluator’s computation [21], memory consumption [10], performance fixed-key AES cipher accelerated by Intel AESNI and using dedicated hardware [12], [13], [23]. State-of-the- instructions. art implementations of garbling schemes using AESNI can We experimentally evaluate our approach by applying the typically produce a garbled row of the garbled truth table in proposed garbling scheme to securely compute edit distance, roughly every 25ns [12], [14], [23]. weighted edit distance, Needleman-Wunsch distance, LCS, HCS. Unlike the approximation approach, our protocols will B. Edit Distance and its Variants and Generalizations always calculate precise results. Our experiments show that these new protocols run 20+ times faster and are an order- The edit distance (also known as Levenshtein distance) of-magnitude more bandwidth-efficient than the state-of-the- between any two strings s and t is the minimum number of art protocols (see TableI for some highlights of the perfor- edits needed to transform s into t, where an edit is typically mance comparisons). Moreover, we stress that, unlike secure one of three basic operations: insert, delete, and substitute. approximation protocols [16], [17], our approach keeps all the Algorithm1 is a standard dynamic programming approach to secret intermediate states of the computation, which can be compute the edit distance between two strings. The invariant obliviously used in a subsequent computation if needed. is that Di;j always represents the edit distance between s[1::i] and t[1::j]. Lines 1–2 initialize the first row of the matrix D II. BACKGROUND while lines 3–4 initialize the first column. Within the main Notations. We denote the computational parameter by κ. We nested loops (lines 5–7), Di;j is set at line 7 to the smallest use “a := b” to denote assigning the value of b to a; use of Di−1;j + cins , Di;j−1 + cdel , and Di−1;j−1 + csub , where “x S” to denote uniformly sampling an element of the set cins ; cdel , and csub correspond to the cost of insert, delete, and S and assigning it to x. substitute a single character (at any position). For basic edit distance, c := 1, c := 1, and c := (s[i] = t[j]) ? 0 : 1, A. Secure Garbling ins del sub i.e., each single-character insert, delete, and substitute incurs First proposed by Yao [20], garbled circuits were later one unit cost while matching characters costs zero.

Efficient Privacy-Preserving General Edit Distance and Beyond

Dictionary Look-Up Within Small Edit Distance

Approximate String Matching with Reduced Alphabet

Approximate Boyer-Moore String Matching

Problem Set 7 Solutions

3. Approximate String Matching

Boyer Moore Pattern Matching Algorithm Example

More Fuzzy String Matching!

Approximate String Matching Using Compressed Suffix Arrays

Levenshtein Distance Based Information Retrieval Veena G, Jalaja G BNM Institute of Technology, Visvesvaraya Technological University

A New Edit Distance for Fuzzy Hashing Applications

The String Edit Distance Matching Problem with Moves

Analysis of Algorithms and Data Structures for Text Indexing