Sequence Distance Embeddings

Sequence Distance Embeddings by Graham Cormode Thesis Submitted to the University of Warwick for the degree of Doctor of Philosophy Computer Science January 2003 Contents List of Figures vi Acknowledgments viii Declarations ix Abstract xi Abbreviations xii Chapter 1 Starting 1 1.1 Sequence Distances . ....................................... 2 1.1.1 Metrics ............................................ 3 1.1.2 Editing Distances ...................................... 3 1.1.3 Embeddings . ....................................... 3 1.2 Sets and Vectors ........................................... 4 1.2.1 Set Difference and Set Union . .............................. 4 1.2.2 Symmetric Difference . .................................. 5 1.2.3 Intersection Size ...................................... 5 1.2.4 Vector Norms . ....................................... 5 1.3 Permutations ............................................ 6 1.3.1 Reversal Distance ...................................... 7 1.3.2 Transposition Distance . .................................. 7 1.3.3 Swap Distance ....................................... 8 1.3.4 Permutation Edit Distance . .............................. 8 1.3.5 Reversals, Indels, Transpositions, Edits (RITE) . .................... 9 1.4 Strings ................................................ 9 1.4.1 Hamming Distance . .................................. 10 1.4.2 Edit Distance . ....................................... 10 1.4.3 Block Edit Distances . .................................. 11 1.5 Sequence Distance Problems . .................................. 14 1.5.1 Efficient Computation and Communication . .................... 14 1.5.2 Approximate Pattern Matching .............................. 15 1.5.3 Geometric Problems . .................................. 16 1.5.4 Approximate Neighbors .................................. 16 1.5.5 Clustering for k-centers .................................. 17 1.6 The Shape of Things to Come . .................................. 17 ii Chapter 2 Sketching and Streaming 21 2.1 Approximations and Estimates .................................. 22 2.1.1 Sketch Model . ....................................... 23 2.1.2 Streaming . ....................................... 23 2.1.3 Equality Testing ....................................... 24 2.2 Vector Distances ........................................... 26 2.2.1 Johnson-Lindenstrauss lemma .............................. 26 2.2.2 Frequency Moments . .................................. 27 2.2.3 L1 Streaming Algorithm .................................. 28 2.2.4 Sketches using Stable Distributions . ......................... 29 2.2.5 Summary of Vector Lp Distance Algorithms . .................... 31 2.3 Set Spaces and Vector Distances .................................. 32 2.3.1 Symmetric Difference and Hamming Space . .................... 32 2.3.2 Set Union and Distinct Elements ............................. 35 2.3.3 Set Intersection Size . .................................. 37 2.3.4 Approximating Set Measures . .............................. 40 2.4 Geometric Problems . ....................................... 41 2.4.1 Locality Sensitive Hash Functions . ......................... 41 2.4.2 Approximate Furthest Neighbors for Euclidean Distance ............... 43 2.4.3 Clustering for k-centers .................................. 44 2.5 Discussion . ............................................ 45 Chapter 3 Searching Sequences 46 3.1 Introduction . ............................................ 47 3.1.1 Computational Biology Background . ......................... 47 3.1.2 Results ............................................ 48 3.2 Embeddings of Permutation Distances .............................. 48 3.2.1 Swap Distance ....................................... 49 3.2.2 Reversal Distance ...................................... 51 3.2.3 Transposition Distance . .................................. 53 3.2.4 Permutation Edit Distance . .............................. 55 3.2.5 Hardness of Estimating Permutation Distances . .................... 58 3.2.6 Extensions . ....................................... 62 3.3 Applications of the Embeddings . .............................. 62 3.3.1 Sketching for Permutation Distances . ......................... 62 3.3.2 Approximating Pairwise Distances . ......................... 64 3.3.3 Approximate Nearest Neighbors and Clustering .................... 65 3.3.4 Approximate Pattern Matching with Permutations . ............... 67 3.4 Discussion . ............................................ 68 Chapter 4 Strings and Substrings 70 4.1 Introduction . ............................................ 71 4.2 Embedding String Edit Distance with Moves into L1 Space . ............... 72 4.2.1 Edit Sensitive Parsing . .................................. 72 4.2.2 Parsing of Different Metablocks ............................. 73 4.2.3 Constructing ET(a) .................................... 75 4.3 Properties of ESP . ....................................... 77 4.3.1 Upper Bound Proof . .................................. 77 4.3.2 Lower Bound Proof . .................................. 79 4.4 Embedding for other block edit distances . ......................... 80 iii 4.4.1 Compression Distance . .................................. 85 4.4.2 Unconstrained Deletes . .................................. 86 4.4.3 LZ Distance . ....................................... 87 4.4.4 Q-gram distance ...................................... 88 4.5 Solving the Approximate Pattern Matching Problem for String Edit Distance with Moves 89 4.5.1 Using the Pruning Lemma . .............................. 89 4.5.2 ESP subtrees . ....................................... 89 4.5.3 Approximate Pattern Matching Algorithm . .................... 90 4.6 Applications to Geometric Problems . .............................. 92 4.6.1 Approximate Nearest and Furthest Neighbors . .................... 92 4.6.2 String Outliers ....................................... 94 4.6.3 Sketches in the Streaming model ............................. 95 4.6.4 Approximate p-centers problem ............................. 95 4.6.5 Dynamic Indexing . .................................. 96 4.7 Discussion . ............................................ 97 Chapter 5 Stables, Subtables and Streams 98 5.1 Introduction . ............................................ 99 5.1.1 Data Stream Comparison . .............................. 99 5.1.2 Tabular Data Comparison . ..............................100 5.2 Sketch Computation . .......................................102 5.2.1 Implementing Sketching Using Stable Distributions . ...............102 5.2.2 Median of Stable Distributions ..............................103 5.2.3 Faster Sketch Computation . ..............................105 5.2.4 Implementation Issues . ..................................108 5.3 Stream Based Experiments . ..................................109 5.4 Experimental Results for Clustering . ..............................113 5.4.1 Accuracy Measures . ..................................114 5.4.2 Assessing Quality and Efficiency of Sketching . ....................115 5.4.3 Clustering Using Sketches . ..............................118 5.4.4 Clustering Using Various Lp Norms . .........................121 5.5 Discussion . ............................................125 Chapter 6 Sending and Swapping 126 6.1 Introduction . ............................................127 6.1.1 Prior Work . .......................................127 6.1.2 Results ............................................129 6.2 Bounds on communication . ..................................130 6.3 Near Optimal Document Exchange . ..............................132 6.3.1 Single Round Protocol . ..................................132 6.3.2 Application to distances of interest . .........................133 6.3.3 Computational Cost . ..................................134 6.4 Computationally Efficient Protocols for String Distances ...................136 6.4.1 Hamming distance . ..................................137 6.4.2 Edit Distance. .......................................140 6.4.3 Tichy’s Distance .......................................141 6.4.4 LZ Distance . .......................................143 6.4.5 Compression Distances and Edit Distance with Moves . ...............144 6.4.6 Compression Distance with Unconstrained Deletes . ...............147 6.5 Computationally Efficient Protocols for Permutation Distances . ...............147 iv 6.6 Discussion . ............................................149 Chapter 7 Stopping 152 7.1 Discussion . ............................................153 7.1.1 Nature of Embeddings . ..................................153 7.1.2 Permutations and Strings . ..............................154 7.2 Extensions . ............................................155 7.2.1 Trees . ............................................155 7.2.2 Graphs ............................................156 7.3 Further Work ............................................157 Appendix A Supplemental Sectionon Sequence Similarity 159 A.1 Combined Permutation Distances . ..............................160 A.1.1 Combining All Operations . ..............................160 A.1.2 Transpositions, Insertions, Reversals, Edits and Deletions ...............160 Bibliography 164 v List of Figures 1.1 A map is an embedding of geographic data into the 2D plane with some loss of information 4 1.2 Summary of the distances of interest . .............................. 18 2.1 Key features of the different methods described in Section 2.2.

Sequence Distance Embeddings

The One-Way Communication Complexity of Hamming Distance

Fault Tolerant Computing CS 530 Information Redundancy: Coding Theory

Dictionary Look-Up Within Small Edit Distance

Applied Psychological Measurement

Searching All Seeds of Strings with Hamming Distance Using Finite Automata

B-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches

Large Scale Hamming Distance Query Processing

Bilinear Forms Over a Finite Field, with Applications to Coding Theory

A Distance Model for Rhythms

Goodchild & Mcadams 1 Perceptual Processes in Orchestration To

Parity Check Code Repetition Code 10010011 11100001

Model Analytics and Management