<<

On Estimating Edit : Alignment, Dimension Reduction, and Embeddings

Moses Charikar1 Department of , , Stanford, CA, USA [email protected] Ofir Geri2 Department of Computer Science, Stanford University, Stanford, CA, USA [email protected] Michael P. Kim3 Department of Computer Science, Stanford University, Stanford, CA, USA [email protected] William Kuszmaul Department of Computer Science, Stanford University, Stanford, CA, USA [email protected]

Abstract Edit distance is a fundamental measure of distance between strings and has been widely studied in computer science. While the problem of estimating edit distance has been studied extensively, the equally important question of actually producing an alignment (i.e., the sequence of edits) has received far less attention. Somewhat surprisingly, we show that any to estimate edit distance can be used in a black-box fashion to produce an approximate alignment of strings, with modest loss in approximation factor and small loss in run time. Plugging in the result of 2 Andoni, Krauthgamer, and Onak, we obtain an alignment that is a (log 푛)푂(1/휀 ) approximation in time 푂˜(푛1+휀). Closely related to the study of approximation is the study of embeddings for edit distance. We show that min-hash techniques can be useful in designing edit distance embeddings through three results: (1) An embedding from Ulam distance (edit distance over permutations) to Hamming space that matches the best known distortion of 푂(log 푛) and also implicitly encodes a sequence of edits between the strings; (2) In the case where the edit distance between the input strings is known to have an upper bound 퐾, we show that embeddings of edit distance into Hamming space with distortion 푓(푛) can be modified in a black-box fashion to give distortion 푂(푓(poly(퐾))) for a class of periodic-free strings; (3) A randomized dimension- reduction map with contraction 푐 and asymptotically optimal expected distortion 푂(푐), improving on the previous 푂˜(푐1+2/ log log log 푛) distortion result of Batu, Ergun, and Sahinalp.

2012 ACM Subject Classification Theory of computation → Approximation algorithms analysis

Keywords and phrases edit distance, alignment, approximation algorithms, embedding, dimen- sion reduction

Digital Object Identifier 10.4230/LIPIcs.ICALP.2018.34

Related Version A full version is available at https://arxiv.org/abs/1804.09907.

1 Supported by NSF grant CCF-1617577 and a Simons Investigator Award. 2 Supported by NSF grant CCF-1617577, a Simons Investigator Award, and the Graduate Fellowship in Computer Science in the School of Engineering at Stanford University. 3 Supported by NSF grant CCF-1763299.

© Moses Charikar, Ofir Geri, Michael P. Kim, and William Kuszmaul; E A licensed under Creative Commons License CC-BY T C 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018). S Editors: Ioannis Chatzigiannakis, Christos Kaklamanis, Dániel Marx, and Donald Sannella; Article No. 34; pp. 34:1–34:14 Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany 34:2 On Estimating Edit Distance: Alignment, Dimension Reduction, and Embeddings

1 Introduction

The edit distance Δed(푥, 푦) between two strings 푥 and 푦 is the minimum number of character insertions, deletions, and substitutions needed to transform 푥 into 푦. This is a fundamental distance measure on strings, extensively studied in computer science [2–5, 9, 11, 17, 21, 24]. Edit distance has applications in areas including , signal processing, handwriting recognition, and image compression [19]. One of its oldest and most important uses is as a tool for comparing differences between genetic sequences [1, 19, 20]. The textbook dynamic-programming algorithm for edit distance runs in time 푂(푛2) [20, 23, 24], and can be leveraged to recover a sequence of edits, also known as an alignment. The quadratic run time is prohibitively large for massive datasets (e.g., genomic data), and conditional lower bounds suggest that no strongly subquadratic time algorithm exists [5]. The difficulty of computing edit distance has motivated the development of fast heuristics [1,10,14,19]. On the theoretical side, the tradeoff between run time and approximation factor (or distortion) is an important question (see [15, Section 6], and [16, Section 8.3.2]). Andoni and Onak [4] (building on beautiful work of Ostrovsky√ and Rabani [21]) gave an algorithm that estimates edit distance within a factor of 2푂˜( log 푛) in time 푛1+표(1). The current best known tradeoff was obtained by Andoni, Krauthgamer and Onak [3], who gave an algorithm that estimates edit distance to within factor (log 푛)푂(1/휀) with run time 푂(푛1+휀).

Alignment Recovery

While these algorithms produce estimates of edit distance, they do not produce an alignment between strings (i.e., a sequence of edits). By decoupling the problem of numerical estimation from the problem of alignment recovery, the authors of [4] and [3] are able to exploit techniques such as embeddings4 and random sampling in order to obtain better approximations. The algorithm of [3] runs in phases, with the 푖-th phase distinguishing 푛 between whether Δed(푥, 푦) is greater than or significantly smaller than 2푖 . At the beginning of each phase, a nuanced random process is used to select a small fraction of the positions in 푥, and then the entire phase is performed while examining only those positions. In total, (︁ 휀 )︁ the full algorithm samples an 푂˜ 푛 -fraction of the letters in 푥. Note that this is a Δed(푥,푦) 1/2 polynomially small fraction as we are interested in the case where Δed(푥, 푦) > 푛 (if the edit distance is small, we can run the algorithm of Landau et al. [18] in linear time). Given that the algorithm only views a small portion of the positions in 푥, it is not clear how to recover a global alignment between the two strings. We show, somewhat surprisingly, that any edit distance estimator can be turned into an approximate aligner in a black box fashion with modest loss in approximation factor and small loss in run time. For example, plugging the result of [3] into our framework, we get an 2 algorithm with distortion (log 푛)푂(1/휀 ) and run time 푂˜(푛1+휀). To the best of our knowledge, the best previous result that gave an approximate alignment was the work of Batu, Ergun, and Sahinalp [7], which has distortion that is polynomial in 푛.

4 The algorithm of [4] has a recursive structure in which at each level of recursion, every substring 훼 of some length 푙 is assigned a vector 푣훼 such that the ℓ1 distance between vectors closely approximates edit distance between substrings. The vectors at each level of recursion are constructed from the vectors in lower levels through a series of procedures culminating in an application of Bourgain’s embedding to a sparse graph metric. As a result, although the between vectors in the top level of recursion allow for a numerical estimation of edit distance, it is not immediately clear how one might attempt to extract additional information from the vectors in order to recover an alignment. M. Charikar, O. Geri, M. P. Kim, and W. Kuszmaul 34:3

Embeddings of Edit Distance Using Min-Hash Techniques The study of approximation algorithms for edit distance closely relates to the study of embeddings [4, 7, 21]. An embedding from a metric space 푀1 to a metric space 푀2 is a map of points in 푀1 to 푀2 such that distances are preserved up to some factor 퐷, known as the distortion. Loosely speaking, low-distortion embeddings from a complex metric space 푀1 to a simpler metric space 푀2 allow algorithm designers to focus on the simpler metric space, rather than directly handling the more complex one. Embeddings from edit distance to Hamming space have been widely studied [8, 9, 11, 21] and have played pivotal roles in the development of approximation algorithms [4] and streaming algorithms [8, 9]. The second contribution of this paper is to introduce new algorithms for three problems related to embeddings for edit distance. The algorithms are unified by the use of min-hash techniques to select pivots in strings. We find this technique to be particularly useful for edit distance because hashing the content of strings allows us to split strings in places where their content is aligned, thereby getting around the problem of insertions and deletions misaligning the strings. In several of our results, this allows us to obtain algorithms which are either more intuitive or simpler than their predecessors. The three results are summarized below.

Efficiently Embedding the Ulam Metric into Hamming Space: For the special case of the Ulam metric (edit distance on permutations), we present a randomized embedding 휑 of permutations of size 푛 to poly(푛)-dimensional Hamming space with distortion 푂(log 푛). Given strings 푥 and 푦, the Hamming differences between 휑(푥) and 휑(푦) not only approximate the edit distance between 푥 and 푦, but also implicitly encode a sequence of edits from 푥 to 푦. If the output string of our embedding is stored using a sparse vector representation, then the embedding can be computed in linear time, and its output can be stored in linear space. The logarithmic distortion matches that of Charikar and Krauthgamer’s embedding into ℓ1-space [11], which did not encode the actual edits and needed quadratic time and number of dimensions. Our embedding also supports efficient updates, and can be modified to reflect an edit in expected time 푂(log 푛) (as opposed to the deterministic linear time of [11]).

Embedding Edit Distance in the Low-Distance Regime: Recently, there has been consid- erable attention devoted to edit distance in the low-distance regime [8, 9]. In this regime, we are interested in finding algorithms that run faster or perform better given the promise that the edit distance between the input strings is small. This regime is of considerable interest from the practical point of view. Landau, Myers and Schmidt [18] gave an exact algorithm for strings with edit distance 퐾 that runs in time 푂(푛+퐾2). Recently, Chakraborty, Goldenberg and Kouck`y[9] gave a randomized embedding of edit distance into Hamming space that has distortion linear in the edit distance with probability at least 2/3. Given an embedding with distortion 훾(푛) (a function of the input size), could one obtain an embedding whose distortion is a function of 퐾, the edit distance, instead of 푛? We answer this question in the affirmative for the class of (퐷, 푅)-periodic free strings. We say that a string is (퐷, 푅)-periodic free if none of its substrings of length 퐷 are periodic with period of at most 푅. For 퐷 ∈ poly(퐾) and 푅 = 푂(퐾3), we show that the embedding of Ostrovksy and Rabani [21] can be used in a black-box fashion to obtain an embedding with distortion (︀√ )︀ 2푂 log 퐾 log log 퐾 for (퐷, 푅)-periodic free strings with edit distance of at most 퐾. Our result can be seen as building on the min-hash techniques of [11, Section 3.5] (which in turn extends ideas from [6]). The authors of [11] give an embedding for (푡, 180푡퐾)-non-repetitive strings with distortion 푂(푡 log(푡퐾)) [11]. The key difference is that our notion of (퐷, 푅)-periodic free is much less restrictive than the notion of non-repetitive strings studied in [11].

I C A L P 2 0 1 8 34:4 On Estimating Edit Distance: Alignment, Dimension Reduction, and Embeddings

Optimal Dimension Reduction for Edit Distance: The aforementioned work of Batu et al. [7] introduced and studied an interesting notion of dimension reduction for edit distance: An embedding of edit distance on length-푛 strings to edit distance on length-푛/푐 strings (with larger alphabet size) is called a dimension-reduction map with contraction 푐. By first performing dimension reduction, one can then apply inefficient algorithms to the contracted strings at a relatively small overall cost. This idea was used in [7] to design an approximation algorithm with approximation factor 푂(푛1/3+표(1)). We provide a dimension-reduction map with contraction 푐 and asymptotically optimal expected distortion 푂(푐), improving on the distortion of 푂˜(푐1+2/ log log log 푛) obtained by the deterministic map of [7].5

2 Preliminaries

Throughout the paper, we will use Σ to denote an alphabet,6 and Σ푛 to denote the set of words of length 푛 over that alphabet. Additionally, we use 풫푛 to denote the set of permutations of length 푛 over Σ, or equivalently, the subset of Σ푛 containing words whose letters are distinct. Given a string 푤 of length 푛, we denote its letters by 푤1, 푤2, . . . , 푤푛, and we use 푤[푖 : 푗] to denote the substring 푤푖푤푖+1 ··· 푤푗 (which is empty if 푗 < 푖). An edit operation is either an insertion, a deletion, or a substitution of a letter with another letter in a word. Given words 푥 and 푦, an alignment from 푥 to 푦 is a sequence of edits transforming 푥 to 푦. The edit distance Δed(푥, 푦) is the minimum number of edits needed to transform 푥 to 푦. Alternatively, it is the length of an optimal alignment. For convenience, we make the Simple Uniform Hashing Assumption [12], which assumes access to a fully independent family ℋ of hash functions mapping Θ(log 푛) bits to Θ(log 푛) bits with constant time evaluation. For our applications, this can be simulated using the family of Pagh and Pagh [22], which is independent on any given set of size 푛 with high probability. The family can be constructed in linear time and uses 푂(푛 log 푛) random bits.

3 Alignment Recovery Using a Black-Box Approximation Algorithm

In this section, we show how to transform a black-box edit distance approximation algorithm 풜 into an approximate alignment algorithm ℬ. The algorithm ℬ appears here as Algorithm 1. In the description of the algorithm, we rely on the following definition of a partition.

I Definition 3.1. A partition of a string 푢 into 푚 parts is a tuple 푃 = (푝0, 푝1, 푝2, . . . , 푝푚) such that 푝0 = 0, 푝푚 = |푢|, and 푝0 ≤ 푝1 ≤ · · · ≤ 푝푚. For 푖 ∈ {1, . . . , 푚}, the 푖-th part of 푃 is the subword 푃푖 := 푢[푝푖−1 + 1 : 푝푖], which is empty if 푝푖 = 푝푖−1. A partition of a string 푢 into 푚 parts is an equipartition if each of the parts is of size either ⌊|푢|/푚⌋ or ⌈|푢|/푚⌉.

Formally, we assume that the approximation algorithm 풜 has the following properties: 1. There is some non-decreasing function 훾 such that for all 푛 > 0, and for any two strings 푢, 푣 with |푢| + |푣| ≤ 푛, Δed(푢, 푣) ≤ 풜(푢, 푣) ≤ 훾(푛) · Δed(푢, 푣). 2. 풜(푢, 푣) runs in time at most 푇 (푛) for some non-decreasing function 푇 which is super- additive in the sense that 푇 (푗) + 푇 (푘) ≤ 푇 (푗 + 푘) for 푗, 푘 ≥ 0.

We are now ready to state the main theorem of this section.

5 When comparing these distortions, one should note that 2/ log log log 푛 goes to zero very slowly; in particular, 푐1+2/ log log log 푛 ≥ 푐1.66 for all 푛 ≤ 1082, the number of atoms in the universe. 6 We assume that characters in Σ can be represented in Θ(log 푛) bits, where 푛 is the size of input strings. M. Charikar, O. Geri, M. P. Kim, and W. Kuszmaul 34:5

Algorithm 1 Black-Box Approximate Alignment Algorithm. Input: Strings 푢, 푣 with |푢| + |푣| ≤ 푛. Parameters: 푚 ∈ N satisfying 푚 ≥ 2 and an approximation algorithm 풜 for edit distance.

1. If |푢| ≤ 1, then find an optimal alignment in time 푂(|푣|) naively. 2. Let 푃 = (푝0, 푝1, . . . , 푝푚) be an equipartition of 푢. 3. Let 푆 consist of the positions in 푣 which can be reached by adding or subtracting a power 1 of (1 + 푚 ) to some 푝푖. Formally, define (︃ {︃⌈︃ ⌉︃ }︃)︃ (︂ 1 )︂푗 푆 = {푝 , . . . , 푝 } ∪ {|푣|} ∪ 푝 ± 1 + | 푖, 푗 ≥ 0 ∩ {0,..., |푣|}. 0 푚 푖 푚

4. Using , find a partition 푄 = (푞0, . . . , 푞푚) of 푣 such that each 푞푖 is ∑︀푚 in 푆, and such that the cost 푖=1 풜(푃푖, 푄푖) is minimized: a. For 푙 ∈ 푆, let 푓(푙, 푗) be the subproblem of returning a choice of 푞0, 푞1, . . . , 푞푗 with ∑︀푗 푞푗 = 푙 which minimizes 푖=1 풜(푃푖, 푄푖). b. Solve 푓(푙, 푗) by examining precomputed answers for subproblems of the form 푓(푙′, 푗 −1) ′ ′ ∑︀푗−1 with 푙 ≤ 푙 ∈ 푆: if 푓(푙 , 푗 − 1) gives a choice of 푞0, 푞1, . . . , 푞푗−1 with 푖=1 풜(푃푖, 푄푖) = 푡, ∑︀푗 ′ then we can set 푞푗 = 푙 to get 푖=1 풜(푃푖, 푄푖) = 푡 + 풜(푃푗, 푣[푙 + 1 : 푙]). ′ (Here, 풜(푃푗, 푣[푙 + 1 : 푙]) is computed using 풜.) 5. Recurse on each pair (푃푖, 푄푖). Combine the resulting alignments between each 푃푖 and 푄푖 to obtain an alignment between 푢 and 푣.

I Theorem 3.2. For all 푢, 푣 with |푢| + |푣| ≤ 푛 and 푚 ≥ 2, Algorithm 1 outputs an alignment 푂(log 푛) 5 from 푢 to 푣 of size at most (3훾(푛)) 푚 ·Δed(푢, 푣). Moreover, the run time is 푂˜(푚 ·푇 (푛)). Before continuing, we provide a brief discussion of Algorithm 1. The algorithm first breaks 푢 into a partition 푃 of 푚 equal parts. It then uses the black-box algorithm 풜 to ∑︀ search for a partition 푄 of 푣 such that 푖 Δed(푃푖, 푄푖) is near minimal; after finding such a 푄, the algorithm recurses to find approximate alignments between 푃푖 and 푄푖 for each 푖. Rather than considering every option for the partition 푄 = (푞0, . . . , 푞푚), the algorithm limits itself to those for which each 푞푖 comes from a relatively small set 푆. The set 푆 is carefully designed so that although it is small, any optimal partition 푄opt of 푣 can be in some sense well approximated by some partition 푄 using only 푞푖 values from 푆. This limits the multiplicative error introduced at each level of recursion to be bounded by 3훾(푛);

푂(log푚 푛) across the 푂(log푚 푛) level of recursion, the total multiplicative error becomes 3훾(푛) . The fact that the recursion depth appears in the exponent of the multiplicative error is why we partition 푢 and 푣 into many parts at each level. Next we discuss several implications of Theorem 3.2. The parameter 푚 allows us to trade off the approximation factor and the run time of the algorithm. When taken to theextreme, this gives two particularly interesting results.

I Corollary 3.3. Let 0 < 휀 < 1 (not necessarily constant). Then 푚 can be chosen so that 푂( 1 ) 휀 Algorithm 1 has approximation ratio (3훾(푛)) 휀 and run time 푂˜ (푇 (푛) · 푛 ).

I Corollary 3.4. Let 0 < 휀 < 1 (not necessarily constant). Then 푚 can be chosen so that Algorithm 1 has approximation ratio 푛푂(휀) and run time 푂˜(푇 (푛)) · (3훾(푛))푂(1/휀). We can apply Corollary 3.3 to the algorithm of Andoni et al. [3] with approximation ratio (log 푛)푂(1/휀) and run time 푂(푛1+휀) as follows. (Note that 휀 may be 표(1).)

I C A L P 2 0 1 8 34:6 On Estimating Edit Distance: Alignment, Dimension Reduction, and Embeddings

I Corollary 3.5. There exists an approximate-alignment algorithm which runs in time ˜ 1+휀 푂(1/휀2) 1 푂(푛 ), and has approximation factor (log 푛) with probability 1 − poly(푛) .

3.1 Proof of Theorem 3.2 The proof of the theorem will follow from Proposition 3.6, which bounds the run time of Algorithm 1, and Proposition 3.11, which bounds the approximation ratio. Throughout this section, let 푢, 푣 and 푚 be the values given to Algorithm 1. Let 푃 = (푝0, . . . , 푝푚) be the equipartition of 푢, and let 푆 be the set defined by Algorithm 1.

˜ 5 I Proposition 3.6. Algorithm 1 runs in time 푂(푇 (|푢| + |푣|) · 푚 ). Proof. If |푢| ≤ 1, then we can find an optimal alignment in time 푂(|푣|) naively. Suppose 2 1 (푚+1) ln 푛 |푢| > 1. Notice that |푆| ≤ 푂(푚 log 푛). In particular, because (1 + 푚 ) ≥ 푛, {︃⌈︃ ⌉︃ }︃ (︂ 1 )︂푗 푆 ⊆ {푝 , . . . , 푝 } ∪ {|푣|} ∪ 푝 ± 1 + | 푖 ∈ [0 : 푚], 푗 ∈ [0 : (푚 + 1) ln 푛] , 0 푚 푖 푚

which has size at most 푂(푚2 log 푛). Finding an equipartition of 푢 can be done in linear time, and constructing 푆 takes time 푂(|푆|) = 푂˜(푚2). In order to perform the fourth step which selects 푄, we must compute 푓(푙, 푗) for each 푙 ∈ 푆 and 푗 ∈ [0 : 푚]. To evaluate 푓(푙, 푗), we must consider each 푙′ ∈ 푆 ′ ′ ′ satisfying 푙 ≤ 푙, and then compute the cost of 푓(푙 , 푗 − 1) plus 풜(푃푗, 푣[푙 + 1 : 푙]) (which takes time at most 푇 (|푢| + |푣|) to compute). Therefore, each 푓(푙, 푗) is computed in time 푂(|푆|·푇 (|푢|+|푣|)) ≤ 푂˜(푇 (|푢|+|푣|)·푚2). Because there are 푂(푚·|푆|) = 푂˜(푚3) subproblems of the form 푓(푙, 푗), the total run time of the dynamic program is 푂˜(푇 (|푢| + |푣|) · 푚5). So far we have shown that the first level of recursion takes time 푂˜(푇 (|푢| + |푣|)푚5). The sum of the lengths of the inputs to Algorithm 1 at a particular level of the recursion is at most |푢| + |푣|. It follows by the super-additivity of 푇 (푛) that the time spent in any given level of recursion is at most 푂˜(푇 (|푢| + |푣|)푚5). Because each level of recursion reduces the sizes

of the parts of 푢 by a factor of Ω(푚), the number of levels is at most 푂(log푚 푛) ≤ 푂(log 푛). ˜ 5 Therefore, the run time is 푂(푇 (|푢| + |푣|) · 푚 ). J When discussing the approximation ratio of Algorithm 1, it will be useful to have a notion of edit distance between partitions of strings.

I Definition 3.7. Given two partitions 퐶 = (푐0, . . . , 푐푚) and 퐷 = (푑0, . . . , 푑푚) of strings 푎 ∑︀ and 푏 respectively, we define Δed(퐶, 퐷) := 푖 Δed(퐶푖, 퐷푖). In order to bound the approximation ratio of Algorithm 1, we will introduce, for the opt opt opt opt sake of analysis, a partition 푄 = (푞0 , . . . , 푞푚 ) of 푣 satisfying Δed(푃, 푄 ) = Δed(푢, 푣). Recall that 푃 is fixed, which allows us to use it in the definition of 푄opt. opt opt We claim that some partition 푄 satisfying Δed(푃, 푄 ) = Δed(푢, 푣) must exist. If 푢 and 푣 differed by only a single edit, one could start from 푃 and explicitly define 푄opt so that opt Δed(푃, 푄 ) = 1 (by a case analysis of which type of edit was performed). It can then be shown by induction on the number of edits that, in general, we can obtain a partition 푄opt opt satisfying Δed(푃, 푄 ) = Δed(푢, 푣). Our strategy for bounding the approximation ratio will be to compare Δed(푃, 푄) for the opt partition 푄 selected by our algorithm to Δed(푃, 푄 ). We do this through three observations. The first observation upper bounds Δed(푃, 푄). Informally, it shows that the cost in edit opt ∑︀푚 opt distance which Algorithm 1 pays for selecting 푄 instead of 푄 is at most 2 푖=0 |푞푖 − 푞푖 |. M. Charikar, O. Geri, M. P. Kim, and W. Kuszmaul 34:7

I Lemma 3.8. Let 푄 = (푞0, . . . , 푞푚) be a partition of 푣. Then

푚 ∑︁ opt Δed(푃, 푄) ≤ Δed(푢, 푣) + 2 |푞푖 − 푞푖 |. 푖=1

Proof. Observe that

푚 opt opt ∑︁ opt Δed(푃, 푄) ≤ Δed(푃, 푄 ) + Δed(푄 , 푄) = Δed(푢, 푣) + Δed(푄푖 , 푄푖). 푖=1

opt opt opt opt Because 푄 and 푄 are both partitions of 푣, Δed(푄푖, 푄푖 ) ≤ |푞푖−1 − 푞푖−1| + |푞푖 − 푞푖 |. opt opt In particular, |푞푖−1 − 푞푖−1| insertions to the left side of one of 푄푖 or 푄푖 (whichever has its start point further to the right) will result in the two substrings having the same start-point; opt opt and then |푞푖 − 푞푖 | insertions to the right side of one of 푄푖 or 푄푖 (whichever has its end point further to the left) will result in the two substrings having the same end-point. Thus

푚 푚 ∑︁ opt ∑︁ opt opt Δed(푢, 푣) + Δed(푄푖 , 푄푖) ≤ Δed(푢, 푣) + |푞푖−1 − 푞푖−1| + |푞푖 − 푞푖 | 푖=1 푖=1 푚 ∑︁ opt ≤ Δed(푢, 푣) + 2 |푞푖 − 푞푖 |, 푖=1

opt where we are able to disregard the case of 푖 = 0 because 푞0 = 푞0 = 0. J

The next observation establishes a lower bound for Δed(푢, 푣).

1 ∑︀푚 opt I Lemma 3.9. Δed(푢, 푣) ≥ 푚 푖=1 |푝푖 − 푞푖 |.

opt Proof. Because Δed(푃, 푄 ) = Δed(푢, 푣), we must have that for each 푖 ∈ [푚],

opt opt Δed(푢, 푣) = Δed(푢[1 : 푝푖], 푣[1 : 푞푖 ]) + Δed(푢[푝푖 + 1 : |푢|], 푣[푞푖 + 1 : |푣|]).

opt opt Notice, however, that the strings 푢[1 : 푝푖] and 푣[1 : 푞푖 ] differ in length by at least |푞푖 − 푝푖|. opt opt Therefore, their edit distance must be at least |푞푖 −푝푖|, implying that Δed(푢, 푣) ≥ |푞푖 −푝푖|. 1 1 opt Thus 푚 Δed(푢, 푣) ≥ 푚 |푞푖 − 푝푖|. Summing over 푖 ∈ [푚] gives the desired equation. J

So far we have shown that the cost in edit distance which Algorithm 1 pays for selecting opt ∑︀푚 opt 푄 instead of 푄 is at most 2 푖=0 |푞푖 − 푞푖 | (Lemma 3.8), and that the edit distance from 1 ∑︀푚 opt 푢 to 푣 is at least 푚 푖=1 |푝푖 − 푞푖 | (Lemma 3.9). Next we compare these two quantities. In particular, we show that if 푄 is chosen to mimic 푄opt as closely as possible, then each of the opt opt |푞푖 − 푞푖 | will become small relative to each of the |푝푖 − 푞푖 |.

I Lemma 3.10. There exists a partition 푄 = (푞0, . . . , 푞푚) of 푣 such that each 푞푖 is in 푆, opt 1 opt and such that for each 푖 ∈ [0 : 푚], |푞푖 − 푞푖 | ≤ 푚 |푝푖 − 푞푖 |.

Proof Sketch. Consider the partition 푄 in which 푞푖 is chosen to be the largest 푠 ∈ 푆 opt satisfying 푠 ≤ 푞푖 . Observe that: (i) because 0 ∈ 푆, each 푞푖 always exists; (ii) because opt opt opt |푣| ∈ 푆, we will have 푞푚 = |푣|; and (iii) because 푞0 ≤ 푞1 ≤ · · · ≤ 푞푚 , we will have that 푞0 ≤ 푞1 ≤ · · · ≤ 푞푚. Therefore, 푄 is a well-defined partition of 푣. opt 1 opt opt It remains to prove |푞푖 −푞푖 | ≤ 푚 |푝푖 −푞푖 |. For brevity, we focus on the case of 푝푖 < 푞푖 . The other cases are conceptually similar and appear in the full version of this paper.

I C A L P 2 0 1 8 34:8 On Estimating Edit Distance: Alignment, Dimension Reduction, and Embeddings

opt 1 푗 Assume 푝푖 < 푞푖 . Consider the largest non-negative integer 푗 such that 푝푖 + (1 + 푚 ) ≤ opt 푞푖 . One can verify that by definition of 푗, we must have

(︂ 1 )︂푗 (︂ 1 )︂푗+1 1 + ≤ 푞opt − 푝 ≤ 1 + . (3.1) 푚 푖 푖 푚

It follows that (︃ )︃ (︂ 1 )︂푗 (︂ 1 )︂푗+1 (︂ 1 )︂푗 1 (︂ 1 )︂푗 푞opt − 푝 + 1 + ≤ 1 + − 1 + = 1 + . (3.2) 푖 푖 푚 푚 푚 푚 푚

1 푗 1 푗 Since ⌈푝푖 + (1 + 푚 ) ⌉ ∈ 푆, the definition of 푞푖 ensures that 푞푖 is between 푝푖 + (1 + 푚 ) and opt opt 1 (︀ 1 )︀푗 푞푖 inclusive. Therefore, (3.2) implies 푞푖 − 푞푖 ≤ 푚 1 + 푚 . Combining this with (3.1), opt 1 opt it follows that 푞푖 − 푞푖 ≤ 푚 (푞푖 − 푝푖), as desired. J We are now equipped to bound the approximation ratio of Algorithm 1, thereby completing the proof of Theorem 3.2. We will use the preceeding lemmas to bound the approximation ratio at each level of recursion to 푂(훾(푛)). The approximation ratio will then multiply across

푂(log푚 푛) the 푂(log푚 푛) levels of recursion, giving total approximation ratio 푂(훾(푛)) .

I Proposition 3.11. Let 퐸(푢, 푣) be the number of edits returned by Algorithm 1. Then

푂(log 푛) Δed(푢, 푣) ≤ 퐸(푢, 푣) ≤ Δed(푢, 푣) · (3훾(푛)) 푚 .

Proof. Because Algorithm 1 finds a sequence of edits from 푢 to 푣, clearly Δed(푢, 푣) ≤ 퐸(푢, 푣). By Lemma 3.10 there is some partition 푄 = (푞0, . . . , 푞푚) of 푣 such that each 푞푖 is in 푆, and opt 1 opt such that for each 푖 ∈ [0 : 푚], |푞푖 − 푞푖 | ≤ 푚 |푝푖 − 푞푖 |. By Lemma 3.9, it follows that

푚 푚 ∑︁ 1 ∑︁ |푞 − 푞opt| ≤ |푝 − 푞opt| ≤ Δ (푢, 푣). 푖 푖 푚 푖 푖 ed 푖=1 푖=1 Applying Lemma 3.8, we then get that

푚 ∑︁ opt Δed(푃, 푄) ≤ Δed(푢, 푣) + 2 |푞푖 − 푞푖 | ≤ 3Δed(푢, 푣). 푖=1

Thus there is some 푄 which Algorithm 1 is allowed to select such that Δed(푃, 푄) ≤ 3Δed(푢, 푣). Since the approximation ratio of 풜 is 훾(푛), the true partition 푄 selected at the first level of recursion must satisfy Δed(푃, 푄) ≤ 3훾(푛)Δed(푢, 푣). After the 푖-th level of recursion, 푢 has implicitly been split into a large partition 푃 푖, 푣 has implicitly been split into a large partition 푄푖, and the recursive subproblems are searching for edits between pairs of parts of 푃 푖 and 푄푖. Since there is 3훾(푛) distortion 푖 푖 푖 at each level, we can get by induction that Δed(푃 , 푄 ) ≤ (3훾(푛)) Δed(푢, 푣). Since there are 푂(log푚 푛) levels of recursion, the number of edits returned by the algorithm is at most 푂(log 푛) Δed(푢, 푣) · (3훾(푛)) 푚 . J

4 Embeddings and Dimension Reduction Using Min-Hash Techniques 4.1 Alignment Embeddings for Permutations

Here we present a randomized embedding from 풫푛, the set of permutations of length 푛, into Hamming space with expected distortion 푂(log 푛). The embedding has the surprising M. Charikar, O. Geri, M. P. Kim, and W. Kuszmaul 34:9

Algorithm 2 Alignment Embedding for Permutations.

Input: A string 푤 = 푤1 ··· 푤푛 ∈ 풫푛. 1 Parameters: 휀 and 푚 ≥ log1/2+휀 푛 + 1.

1. At the first level of recursion only: a. Initialize an array 퐴 of size 2푚 − 1 (indexed starting at one) with zeros. The array 퐴 will contain the output embedding. b. Select a hash function ℎ mapping Σ to 푟 log 푛 bits for a sufficiently large constant 푟. 8 2. Let 푖 minimize ℎ(푤푖) out of the 푖 ∈ [푛/2 − 휀푛 : 푛/2 + 휀푛]. We call 푤푖 the pivot in 푤. 푚−1 3. Set 퐴[2 ] = 푤푖. 푚−1 4. Recursively embed 푤1 ··· 푤푖−1 into 퐴[1 : 2 − 1]. 푚−1 푚 5. Recursively embed 푤푖+1 ··· 푤푛 into 퐴[2 + 1 : 2 − 1].

property that it implicitly encodes alignments between strings. If the output is stored using run-length encoding,7 then the size of the output and the run time are both 푂(푛). The description of the embedding appears as Algorithm 2. For simplicity, we assume 0 ∈/ Σ, which allows us to use 0 as a null character. The algorithm takes two parameters: 휀 and 푚. The parameter 휀 controls a trade-off between the distortion and the output dimension. The parameter 푚 dictates the maximum depth of recursion that can be performed within the array 퐴. In particular, 푚 needs to be chosen such that the algorithm does not run out of space for the embedding in the recursive calls. Since each recursive step takes as input words of size in the range [(1/2 − 휀)푛 : (1/2 + 휀)푛], the input size at the 푖-th level of the recursion is at most (1/2 + 휀)푖−1푛. We need to choose 푚 such that at the 푚-th level of recursion, the input size will be at most 1. Therefore, it 1 suffices to pick 푚 satisfying 푚 ≥ log1/2+휀 푛 + 1. We denote the resulting embedding of the input string 푤 into the output array 퐴 by 1 휑휀,푚(푤). Moreover, for 푚 = ⌈log1/2+휀 푛 + 1⌉, we define 휑휀(푤) to be 휑휀,푚(푤). Note that (︁ log 1/푛)︁ (︀ −1/ log(1/2+휀))︀ 휑휀 embeds 푤 into an array 퐴 of size 푂 2 1/2+휀 = 푂 푛 , which one can 1 1+6휀 verify for 휀 ≤ 4 is 푂(푛 ). We call 휑휀 an alignment embedding because 휑휀 maps a string 푥 to a copy of 푥 spread out across an array of zeros. When we compare 휑휀(푥) with 휑휀(푦) by Hamming differences, 휑휀 encodes an alignment between 푥 and 푦; it pays for every letter which it fails to match up with another copy of the same letter. In particular, every pairing of a letter with a null corresponds to an insertion or deletion, and every pairing of a letter with a different letter corresponds to a substitution. Thus Ham(휑휀(푥), 휑휀(푦)) will always be at least Δed(푥, 푦). In the rest of this subsection we will prove the following theorem.

1 1+6휀 I Theorem 4.1. For 휀 ≤ 4 , there exists a randomized embedding 휑휀 from 풫푛 to 푂(푛 )- dimensional Hamming space with the following properties.

For 푥, 푦 ∈ 풫푛, 휑휀(푥) and 휑휀(푦) encode a sequence of Ham(휑휀(푥), 휑휀(푦)) edits from 푥 to 푦. In particular, Ham(휑휀(푥), 휑휀(푦)) ≥ Δed(푥, 푦). (︀ 1 )︀ For 푥, 푦 ∈ 풫푛, E[Ham(휑휀(푥), 휑휀(푦))] ≤ 푂 휀 log 푛 · Δed(푥, 푦). For 푥 ∈ 풫푛, 휑휀(푥) is sparse in the sense that it only contains 푛 non-zero entries. Moreover, if 휑휀(푥) is stored with run-length encoding, it can be computed in time 푂(푛).

7 In run-length encoding, runs of identical characters are stored as a pair whose first entry is the character and the second entry is the length of the run. 8 With high probability, there are no hash collisions.

I C A L P 2 0 1 8 34:10 On Estimating Edit Distance: Alignment, Dimension Reduction, and Embeddings

The first property in the theorem follows from the discussion above. In order toprove (︀ 1 )︀ E[Ham(휑휀(푥), 휑휀(푦))] ≤ Δed(푥, 푦)푂 휀 log 푛 , we will consider a series of at most 2Δed(푥, 푦) insertions or deletions that are used to transform 푥 into 푦. Each substitution operation can be emulated by an insertion and a deletion. Moreover, note that by ordering deletions before insertions, each of the intermediate strings will still be a permutation. In the following lemma, we bound the expected for just a single insertion (or equivalently, a deletion). By the triangle inequality, we get the bound on E[Ham(휑휀(푥), 휑휀(푦))].

I Lemma 4.2. Let 푥 ∈ 풫푛 be a permutation, and let 푦 be a permutation derived from 푥 by 1 a single insertion. Let 0 < 휀 ≤ 4 and let 푚 be large enough so that 휑휀,푚 is well-defined on 푥 (︀ 1 )︀ and 푦. Then E[Ham(휑휀,푚(푥), 휑휀,푚(푦))] ≤ 푂 휀 log 푛 . Proof. Observe that the set of letters in position-range [(1/2 − 휀)|푥| : (1/2 + 휀)|푥|] in 푥 differs by at most 푂(1) elements from the set of letters in position-range [(1/2 − 휀)|푦| : (1/2 + 휀)|푦|] in 푦. Thus with probability 1 − 푂(1/(휀푛)), there will be a letter 푙 in the overlap between the two ranges whose hash is smaller than that of any other letter in either of the two ranges. In other words, the pivot in 푥 (i.e., the letter in the position range with minimum hash) will differ from the pivot in 푦 with probability 푂(1/(휀푛)). Therefore,

E[Ham(휑휀,푚(푥), 휑휀,푚(푦))] = Pr[pivots differ] · E[Ham(휑휀,푚(푥), 휑휀,푚(푦)) | pivots differ] + Pr[pivots same] · E[Ham(휑휀,푚(푥), 휑휀,푚(푦)) | pivots same] (︂ 1 )︂ ≤ 푂 · [Ham(휑 (푥), 휑 (푦)) | pivots differ] 휀푛 E 휀,푚 휀,푚

+ E[Ham(휑휀,푚(푥), 휑휀,푚(푦)) | pivots same].

In general, Ham(휑휀,푚(푥), 휑휀,푚(푦)) cannot exceed 푂(푛). Thus

(︂ 1 )︂ [Ham(휑 (푥), 휑 (푦))] ≤ 푂 · 푂(푛) + [Ham(휑 (푥), 휑 (푦)) | pivots same] E 휀,푚 휀,푚 휀푛 E 휀,푚 휀,푚 (︂1)︂ ≤ 푂 + [Ham(휑 (푥), 휑 (푦)) | pivots same]. 휀 E 휀,푚 휀,푚

If the pivot in 푥 is the same as in 푦, then the insertion must take place to either the left or the right of the pivot. Clearly 휑휀,푚(푥) and 휑휀,푚(푦) will agree on the side of the pivot in which the edit does not occur. Inductively applying our argument to the side on which the edit occurs, we incur a cost of 푂(1/휀) once for each level in the (︁ 1 )︁ recursion. The maximum depth of the recursion is 푂 log1/2+휀 푛 = 푂(log 푛). This gives us (︀ 1 )︀ E[Ham(휑휀,푚(푥), 휑휀,푚(푦))] ≤ 푂 휀 · log 푛. J

It remains only to analyze the run time. Notice that 휑휀(푥) can be stored in space Θ(푛) using run-length encoding. We can compute 휑휀(푥) in time 푂(푛), as follows. Using Range Minimum Query [13] we can build a which supports constant-time queries returning minimum hashes in contiguous substrings of 푥. This allows each recursive step in the embedding to be performed in constant time. Since each recursive step writes one of the 푛 letters to the output, the total run time is bounded by 푂(푛).

4.2 Embedding into Hamming Space in the Low-Distance Regime Assume we are given an embedding from edit distance in Σ푛 into Hamming space with subpolynomial distortion 훾(푛). We wish to use such an embedding as a black box in order to obtain a new embedding for the low edit distance regime: the new embedding, which would M. Charikar, O. Geri, M. P. Kim, and W. Kuszmaul 34:11

Algorithm 3 Choose Next Block (Informal). Input: A string 푥, the index 푖 where the current block begins. Output: The index where the next block begins. Parameters: 푊 ′′ ≪ 푊 ′ ≪ 푊 set as needed.

1. Consider the window 푥푖 ··· 푥푖+푊 −1 of size 푊 . Divide the second half of the window into non-overlapping sub-windows of size 푊 ′, and pick one such sub-window at random. 2. Each substring of length 푊 ′′ inside the sub-window is called a sub-sub-window (sub-sub- windows may overlap). Compute a hash of each of the sub-sub-windows.9 3. Return the start position of the sub-sub-window with the smallest hash.

푛 be parameterized by a value 퐾, would take any two strings 푥, 푦 ∈ Σ with Δed(푥, 푦) ≤ 퐾 and map 푥 and 푦 into Hamming space with distortion 훾′(퐾), a function of 퐾 rather than 푛. We make progress toward such an embedding with the added constraint that our strings 푥 and 푦 are (퐷, 푅)-periodic free for 퐷 ∈ poly(퐾) of our choice and 푅 ∈ 푂(퐾3). A string is (퐷, 푅)-periodic free if it contains no contiguous substrings of length 퐷 that are periodic with period at most 푅. Our embedding takes two such strings with Δed(푥, 푦) ≤ 퐾 and maps them into Hamming space with distortion 훾(poly(퐾)). If we select the black-box embedding (︀√ )︀ to be embedding of Ostrovsky and Rabani [21], then this gives distortion 2푂 log 퐾 log log 퐾 . Our main result in this section is stated as the following theorem.

푛 I Theorem 4.3. Suppose we have an embedding from edit distance in Σ to Hamming space with subpolynomial distortion 훾(푛) ≥ 2. Let 퐾 ∈ N and pick some 퐷 ∈ poly(퐾). Then there exists 푅 ∈ 푂(퐾3) and an embedding 훼 from edit distance into scaled Hamming space with the following property. For strings 푥 and 푦 that are (퐷, 푅)-periodic free and of edit distance at most 퐾 apart, 훼 distorts the distance between 푥 and 푦 by at most 푂(훾(퐾3퐷)) ≤ 푂(훾(poly(퐾))) in expectation.

The complete proof of Theorem 4.3 appears in the full version of the paper. Below we discuss the key ideas, which once again make use of min-hash techniques. The main step in our embedding is to partition the strings 푥 and 푦 into parts of length poly(퐾) in a way so that the sum of the edit distances between the parts equals Δed(푥, 푦) (with some probability), and then to apply the black-box embedding to each individual part. We select the partition of 푥 by going from left to right, and at each step choosing the next index at which the current block ends and the next one begins. The algorithm for selecting each successive block appears as Algorithm 3. The goal of the algorithm is to find partitions of 푥 and 푦 that are aligned despite the edits. By picking a sub-window uniformly at random, we guarantee that with high probability (as a function of 푊, 푊 ′) no edits occur directly within that sub-window. However, if 푥 and 푦 differ by an insertion or deletion prior to that sub-window, the sub-window within 푥 may be misaligned with the sub-window within 푦. Nonetheless, the set of 푊 ′′-letter sub-sub-windows of the sub-window of 푥 will be almost the same as the set of 푊 ′′-letter sub-sub-windows of the sub-window of 푦. By selecting the sub-sub-window with minimum hash as the start position for the next block, we are then able to guarantee that with high probability (as a function of 푊 ′) we pick positions in 푥 and 푦 which are aligned with each other.

9 By selecting 푊 ′′, 푊 ′, 푊 appropriately, we can use the (퐷, 푅)-periodic free property to guarantee that these sub-sub-windows are distinct.

I C A L P 2 0 1 8 34:12 On Estimating Edit Distance: Alignment, Dimension Reduction, and Embeddings

4.3 Dimension Reduction A mapping of edit distance on length-푛 strings to edit distance on length-at-most-푛/푐 strings (with larger alphabet size) is called a dimension-reduction map with contraction 푐 [7]. The distortion of the map measures the multiplicative factor to which edit distance is preserved.

I Theorem 4.4. There is a randomized dimension-reduction map 휑 with contraction 푐 and expected distortion 푂(푐). In particular, for 푥, 푦 ∈ Σ푛, 1 · Δ (푥, 푦) ≤ Δ (휑(푥), 휑(푦)), 2푐 ed ed and

E[Δed(휑(푥), 휑(푦))] ≤ 푂(1) · Δed(푥, 푦). Moreover, for this definition of distortion, 휑 is within a constant factor of optimal. Addition- ally, 휑 can be evaluated in time 푂(푛 log 푐).

Note that the output of a dimension-reduction map may be over a much larger alphabet than the input. One should not be too alarmed by this, however, because alphabet reduction can be performed after the dimension reduction (by simply hashing to Θ(log 푛) bits). Below we summarize our approach to proving Theorem 4.4. The complete proof appears in the full version of the paper. When computing our dimension-reduction map 휑(푤) on a word 푤, one approach would be to split 푤 into blocks of size 푐 and to then define 휑(푤)’s letters to correspond to blocks. This would achieve the desired reduction in dimension, and would have the effect that a single edit to 휑(푤) would correspond to at most 푐 edits in 푤. However, a single insertion to 푤 could change the content of linearly many blocks, corresponding to a large number of edits to 휑(푤). Thus the challenge is to instead break 푤 into blocks in a way so that an edit to 푤 affects only a small number of blocks. In order to accomplish this, our actual 휑 breaks 푤 into long periodic substrings and non-periodic substrings. The two types of substrings are then handled as follows:

Handling Long Periodic Substrings: Within the periodic substrings, we break the substring into blocks based on the periodic behavior. The key insight is that the embedding of the periodic substring will consist of the same block repeating many times. If an edit occurs in the middle of a long periodic substring, the embedding will still include the same block repeated many times, but 푂(1) blocks around the edit will be modified. Since the blocks correspond to letters in 휑(푤), the edit to 푤 results in 푂(1) edits to 휑(푤).

Handling Non-Periodic Substrings: Within the non-periodic substrings, we select markers in a randomized fashion to determine block boundaries. The definition of non-periodic sub- strings guarantees that for every 푐 adjacent letters, the 8푐-letter substrings 푤푖푤푖+1 ··· 푤푖+8푐−1 beginning at each of those 푐 letters 푤푖 are distinct. We utilize this property to select markers based on the minimum hash of 8푐-letter substrings. A letter is selected as a marker if the hash of the 8푐-letter string beginning at that letter is smaller than the hashes of any of the 8푐-letter strings beginning in the 푐/2 letters to its left or right. This localizes the selection so that edits to the string will only affect nearby markers. When two markers are more than 푐 apart, we additionally break the space between them into sub-blocks of size at most 푐. By preventing blocks in 휑(푤) from exceeding 푂(푐) in size, we can take a sequence of edits in 휑(푤) and generate a corresponding sequence of edits in 푤 with distortion 푂(푐). When bounding the distortion in the other direction, we risk edits in M. Charikar, O. Geri, M. P. Kim, and W. Kuszmaul 34:13

푤 taking place between two far-apart markers, which in turn could affect all of the blocks between those markers in 휑(푤). The main technical challenge is bounding the effect this has on the distortion. We do so by showing probabilistically that wherever there is an edit, there will be markers nearby which mitigate the impact of that edit on the block structure of 휑(푤).

References 1 Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997. 2 Alexandr Andoni and Robert Krauthgamer. The computational hardness of estimating edit distance. SIAM J. Comput., 39(6):2398–2429, 2010. 3 Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. Polylogarithmic approxim- ation for edit distance and the asymmetric query complexity. In Proceedings of the 51st Annual Symposium on Foundations of Computer Science (FOCS), pages 377–386. IEEE, 2010. 4 Alexandr Andoni and Krzysztof Onak. Approximating edit distance in near-linear time. SIAM J. Comput., 41(6):1635–1648, 2012. 5 Arturs Backurs and . Edit distance cannot be computed in strongly subquad- ratic time (unless seth is false). In Proceedings of the 47th Annual Symposium on Theory of Computing (STOC), pages 51–58. ACM, 2015. 6 Ziv Bar-Yossef, T.S. Jayram, Robert Krauthgamer, and Ravi Kumar. Approximating edit distance efficiently. In Proceedings of 45th Annual Symposium on Foundations of Computer Science (FOCS), pages 550–559. IEEE, 2004. 7 Tugkan Batu, Funda Ergün, and Süleyman Cenk Sahinalp. Oblivious string embeddings and edit distance approximations. In Proceedings of the 17th Annual Symposium on Discrete Algorithms (SODA), pages 792–801, 2006. 8 Djamal Belazzougui and Qin Zhang. Edit distance: Sketching, streaming, and document exchange. In Proceedings of the 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 51–60. IEEE, 2016. 9 Diptarka Chakraborty, Elazar Goldenberg, and Michal Kouck`y.Streaming algorithms for embedding and computing edit distance in the low distance regime. In Proceedings of the 48th Annual Symposium on Theory of Computing (STOC), pages 712–725. ACM, 2016. 10 Kun-Mao Chao, William R. Pearson, and Webb Miller. Aligning two sequences within a specified diagonal band. , 8(5):481–487, 1992. 11 Moses Charikar and Robert Krauthgamer. Embedding the ulam metric into l1. Theory of Computing, 2(11):207–224, 2006. 12 Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to algorithms. The MIT Press, third edition, 2009. 13 Johannes Fischer and Volker Heun. Theoretical and practical improvements on the rmq- problem, with applications to LCA and LCE. In Proceedings of the 17th Annual Symposium on Combinatorial (CPM), pages 36–48, 2006. 14 Dan Gusfield. Algorithms on strings, trees and sequences: computer science and computa- tional biology. Cambridge University Press, 1997. 15 Piotr Indyk. Algorithmic aspects of geometric embeddings (tutorial). In Proceedings of the 42nd Annual Symposium on Foundations of Computer Science (FOCS), pages 10–33. IEEE, 2001. 16 Piotr Indyk and Jiri Matoušek. Low-distortion embeddings of finite metric spaces. Handbook of Discrete and Computational Geometry, page 177, 2004.

I C A L P 2 0 1 8 34:14 On Estimating Edit Distance: Alignment, Dimension Reduction, and Embeddings

17 Robert Krauthgamer and Yuval Rabani. Improved lower bounds for embeddings into L1. In Proceedings of the 17th Annual Symposium on Discrete Algorithms (SODA), pages 1010– 1017, 2006. 18 Gad M. Landau, Eugene W. Myers, and Jeanette P. Schmidt. Incremental string compar- ison. SIAM Journal on Computing, 27(2):557–582, 1998. 19 Gonzalo Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31–88, 2001. 20 Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970. 21 Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. J. ACM, 54(5):23, 2007. 22 Anna Pagh and Rasmus Pagh. Uniform hashing in constant time and optimal space. SIAM Journal on Computing, 38(1):85–96, 2008. 23 Taras K. Vintsyuk. Speech discrimination by dynamic programming. Cybernetics, 4(1):52– 57, 1968. 24 Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974.