Sequence Distance Embeddings

Sequence Distance Embeddings

Sequence Distance Embeddings by Graham Cormode Thesis Submitted to the University of Warwick for the degree of Doctor of Philosophy Computer Science January 2003 Contents List of Figures vi Acknowledgments viii Declarations ix Abstract xi Abbreviations xii Chapter 1 Starting 1 1.1 Sequence Distances . ....................................... 2 1.1.1 Metrics ............................................ 3 1.1.2 Editing Distances ...................................... 3 1.1.3 Embeddings . ....................................... 3 1.2 Sets and Vectors ........................................... 4 1.2.1 Set Difference and Set Union . .............................. 4 1.2.2 Symmetric Difference . .................................. 5 1.2.3 Intersection Size ...................................... 5 1.2.4 Vector Norms . ....................................... 5 1.3 Permutations ............................................ 6 1.3.1 Reversal Distance ...................................... 7 1.3.2 Transposition Distance . .................................. 7 1.3.3 Swap Distance ....................................... 8 1.3.4 Permutation Edit Distance . .............................. 8 1.3.5 Reversals, Indels, Transpositions, Edits (RITE) . .................... 9 1.4 Strings ................................................ 9 1.4.1 Hamming Distance . .................................. 10 1.4.2 Edit Distance . ....................................... 10 1.4.3 Block Edit Distances . .................................. 11 1.5 Sequence Distance Problems . .................................. 14 1.5.1 Efficient Computation and Communication . .................... 14 1.5.2 Approximate Pattern Matching .............................. 15 1.5.3 Geometric Problems . .................................. 16 1.5.4 Approximate Neighbors .................................. 16 1.5.5 Clustering for k-centers .................................. 17 1.6 The Shape of Things to Come . .................................. 17 ii Chapter 2 Sketching and Streaming 21 2.1 Approximations and Estimates .................................. 22 2.1.1 Sketch Model . ....................................... 23 2.1.2 Streaming . ....................................... 23 2.1.3 Equality Testing ....................................... 24 2.2 Vector Distances ........................................... 26 2.2.1 Johnson-Lindenstrauss lemma .............................. 26 2.2.2 Frequency Moments . .................................. 27 2.2.3 L1 Streaming Algorithm .................................. 28 2.2.4 Sketches using Stable Distributions . ......................... 29 2.2.5 Summary of Vector Lp Distance Algorithms . .................... 31 2.3 Set Spaces and Vector Distances .................................. 32 2.3.1 Symmetric Difference and Hamming Space . .................... 32 2.3.2 Set Union and Distinct Elements ............................. 35 2.3.3 Set Intersection Size . .................................. 37 2.3.4 Approximating Set Measures . .............................. 40 2.4 Geometric Problems . ....................................... 41 2.4.1 Locality Sensitive Hash Functions . ......................... 41 2.4.2 Approximate Furthest Neighbors for Euclidean Distance ............... 43 2.4.3 Clustering for k-centers .................................. 44 2.5 Discussion . ............................................ 45 Chapter 3 Searching Sequences 46 3.1 Introduction . ............................................ 47 3.1.1 Computational Biology Background . ......................... 47 3.1.2 Results ............................................ 48 3.2 Embeddings of Permutation Distances .............................. 48 3.2.1 Swap Distance ....................................... 49 3.2.2 Reversal Distance ...................................... 51 3.2.3 Transposition Distance . .................................. 53 3.2.4 Permutation Edit Distance . .............................. 55 3.2.5 Hardness of Estimating Permutation Distances . .................... 58 3.2.6 Extensions . ....................................... 62 3.3 Applications of the Embeddings . .............................. 62 3.3.1 Sketching for Permutation Distances . ......................... 62 3.3.2 Approximating Pairwise Distances . ......................... 64 3.3.3 Approximate Nearest Neighbors and Clustering .................... 65 3.3.4 Approximate Pattern Matching with Permutations . ............... 67 3.4 Discussion . ............................................ 68 Chapter 4 Strings and Substrings 70 4.1 Introduction . ............................................ 71 4.2 Embedding String Edit Distance with Moves into L1 Space . ............... 72 4.2.1 Edit Sensitive Parsing . .................................. 72 4.2.2 Parsing of Different Metablocks ............................. 73 4.2.3 Constructing ET(a) .................................... 75 4.3 Properties of ESP . ....................................... 77 4.3.1 Upper Bound Proof . .................................. 77 4.3.2 Lower Bound Proof . .................................. 79 4.4 Embedding for other block edit distances . ......................... 80 iii 4.4.1 Compression Distance . .................................. 85 4.4.2 Unconstrained Deletes . .................................. 86 4.4.3 LZ Distance . ....................................... 87 4.4.4 Q-gram distance ...................................... 88 4.5 Solving the Approximate Pattern Matching Problem for String Edit Distance with Moves 89 4.5.1 Using the Pruning Lemma . .............................. 89 4.5.2 ESP subtrees . ....................................... 89 4.5.3 Approximate Pattern Matching Algorithm . .................... 90 4.6 Applications to Geometric Problems . .............................. 92 4.6.1 Approximate Nearest and Furthest Neighbors . .................... 92 4.6.2 String Outliers ....................................... 94 4.6.3 Sketches in the Streaming model ............................. 95 4.6.4 Approximate p-centers problem ............................. 95 4.6.5 Dynamic Indexing . .................................. 96 4.7 Discussion . ............................................ 97 Chapter 5 Stables, Subtables and Streams 98 5.1 Introduction . ............................................ 99 5.1.1 Data Stream Comparison . .............................. 99 5.1.2 Tabular Data Comparison . ..............................100 5.2 Sketch Computation . .......................................102 5.2.1 Implementing Sketching Using Stable Distributions . ...............102 5.2.2 Median of Stable Distributions ..............................103 5.2.3 Faster Sketch Computation . ..............................105 5.2.4 Implementation Issues . ..................................108 5.3 Stream Based Experiments . ..................................109 5.4 Experimental Results for Clustering . ..............................113 5.4.1 Accuracy Measures . ..................................114 5.4.2 Assessing Quality and Efficiency of Sketching . ....................115 5.4.3 Clustering Using Sketches . ..............................118 5.4.4 Clustering Using Various Lp Norms . .........................121 5.5 Discussion . ............................................125 Chapter 6 Sending and Swapping 126 6.1 Introduction . ............................................127 6.1.1 Prior Work . .......................................127 6.1.2 Results ............................................129 6.2 Bounds on communication . ..................................130 6.3 Near Optimal Document Exchange . ..............................132 6.3.1 Single Round Protocol . ..................................132 6.3.2 Application to distances of interest . .........................133 6.3.3 Computational Cost . ..................................134 6.4 Computationally Efficient Protocols for String Distances ...................136 6.4.1 Hamming distance . ..................................137 6.4.2 Edit Distance. .......................................140 6.4.3 Tichy’s Distance .......................................141 6.4.4 LZ Distance . .......................................143 6.4.5 Compression Distances and Edit Distance with Moves . ...............144 6.4.6 Compression Distance with Unconstrained Deletes . ...............147 6.5 Computationally Efficient Protocols for Permutation Distances . ...............147 iv 6.6 Discussion . ............................................149 Chapter 7 Stopping 152 7.1 Discussion . ............................................153 7.1.1 Nature of Embeddings . ..................................153 7.1.2 Permutations and Strings . ..............................154 7.2 Extensions . ............................................155 7.2.1 Trees . ............................................155 7.2.2 Graphs ............................................156 7.3 Further Work ............................................157 Appendix A Supplemental Sectionon Sequence Similarity 159 A.1 Combined Permutation Distances . ..............................160 A.1.1 Combining All Operations . ..............................160 A.1.2 Transpositions, Insertions, Reversals, Edits and Deletions ...............160 Bibliography 164 v List of Figures 1.1 A map is an embedding of geographic data into the 2D plane with some loss of information 4 1.2 Summary of the distances of interest . .............................. 18 2.1 Key features of the different methods described in Section 2.2.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    187 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us