L Repository

L Repository

University of Warwick institutional repository: http://go.warwick.ac.uk/wrap A Thesis Submitted for the Degree of PhD at the University of Warwick http://go.warwick.ac.uk/wrap/61310 This thesis is made available online and is protected by original copyright. Please scroll down to view the document itself. Please refer to the repository record for this item for information to help you to cite it. Our policy information is available from the repository home page. Sequence Distance Embeddings by Graham Cormode Thesis Submitted to the University of Warwick for the degree of Doctor of Philosophy Computer Science January 2003 Contents List of Figures vi Acknowledgments viii Declarations ix Abstract xi Abbreviations xii Chapter 1 Starting 1 1.1 Sequence Distances . ....................................... 2 1.1.1 Metrics ............................................ 3 1.1.2 Editing Distances ...................................... 3 1.1.3 Embeddings . ....................................... 3 1.2 Sets and Vectors ........................................... 4 1.2.1 Set Difference and Set Union . .............................. 4 1.2.2 Symmetric Difference . .................................. 5 1.2.3 Intersection Size ...................................... 5 1.2.4 Vector Norms . ....................................... 5 1.3 Permutations ............................................ 6 1.3.1 Reversal Distance ...................................... 7 1.3.2 Transposition Distance . .................................. 7 1.3.3 Swap Distance ....................................... 8 1.3.4 Permutation Edit Distance . .............................. 8 1.3.5 Reversals, Indels, Transpositions, Edits (RITE) . .................... 9 1.4 Strings ................................................ 9 1.4.1 Hamming Distance . .................................. 10 1.4.2 Edit Distance . ....................................... 10 1.4.3 Block Edit Distances . .................................. 11 1.5 Sequence Distance Problems . .................................. 14 1.5.1 Efficient Computation and Communication . .................... 14 1.5.2 Approximate Pattern Matching .............................. 15 1.5.3 Geometric Problems . .................................. 16 1.5.4 Approximate Neighbors .................................. 16 1.5.5 Clustering for k-centers .................................. 17 1.6 The Shape of Things to Come . .................................. 17 ii Chapter 2 Sketching and Streaming 21 2.1 Approximations and Estimates .................................. 22 2.1.1 Sketch Model . ....................................... 23 2.1.2 Streaming . ....................................... 23 2.1.3 Equality Testing ....................................... 24 2.2 Vector Distances ........................................... 26 2.2.1 Johnson-Lindenstrauss lemma .............................. 26 2.2.2 Frequency Moments . .................................. 27 2.2.3 L1 Streaming Algorithm .................................. 28 2.2.4 Sketches using Stable Distributions . ......................... 29 2.2.5 Summary of Vector Lp Distance Algorithms . .................... 31 2.3 Set Spaces and Vector Distances .................................. 32 2.3.1 Symmetric Difference and Hamming Space . .................... 32 2.3.2 Set Union and Distinct Elements ............................. 35 2.3.3 Set Intersection Size . .................................. 37 2.3.4 Approximating Set Measures . .............................. 40 2.4 Geometric Problems . ....................................... 41 2.4.1 Locality Sensitive Hash Functions . ......................... 41 2.4.2 Approximate Furthest Neighbors for Euclidean Distance ............... 43 2.4.3 Clustering for k-centers .................................. 44 2.5 Discussion . ............................................ 45 Chapter 3 Searching Sequences 46 3.1 Introduction . ............................................ 47 3.1.1 Computational Biology Background . ......................... 47 3.1.2 Results ............................................ 48 3.2 Embeddings of Permutation Distances .............................. 48 3.2.1 Swap Distance ....................................... 49 3.2.2 Reversal Distance ...................................... 51 3.2.3 Transposition Distance . .................................. 53 3.2.4 Permutation Edit Distance . .............................. 55 3.2.5 Hardness of Estimating Permutation Distances . .................... 58 3.2.6 Extensions . ....................................... 62 3.3 Applications of the Embeddings . .............................. 62 3.3.1 Sketching for Permutation Distances . ......................... 62 3.3.2 Approximating Pairwise Distances . ......................... 64 3.3.3 Approximate Nearest Neighbors and Clustering .................... 65 3.3.4 Approximate Pattern Matching with Permutations . ............... 67 3.4 Discussion . ............................................ 68 Chapter 4 Strings and Substrings 70 4.1 Introduction . ............................................ 71 4.2 Embedding String Edit Distance with Moves into L1 Space . ............... 72 4.2.1 Edit Sensitive Parsing . .................................. 72 4.2.2 Parsing of Different Metablocks ............................. 73 4.2.3 Constructing ET(a) .................................... 75 4.3 Properties of ESP . ....................................... 77 4.3.1 Upper Bound Proof . .................................. 77 4.3.2 Lower Bound Proof . .................................. 79 4.4 Embedding for other block edit distances . ......................... 80 iii 4.4.1 Compression Distance . .................................. 85 4.4.2 Unconstrained Deletes . .................................. 86 4.4.3 LZ Distance . ....................................... 87 4.4.4 Q-gram distance ...................................... 88 4.5 Solving the Approximate Pattern Matching Problem for String Edit Distance with Moves 89 4.5.1 Using the Pruning Lemma . .............................. 89 4.5.2 ESP subtrees . ....................................... 89 4.5.3 Approximate Pattern Matching Algorithm . .................... 90 4.6 Applications to Geometric Problems . .............................. 92 4.6.1 Approximate Nearest and Furthest Neighbors . .................... 92 4.6.2 String Outliers ....................................... 94 4.6.3 Sketches in the Streaming model ............................. 95 4.6.4 Approximate p-centers problem ............................. 95 4.6.5 Dynamic Indexing . .................................. 96 4.7 Discussion . ............................................ 97 Chapter 5 Stables, Subtables and Streams 98 5.1 Introduction . ............................................ 99 5.1.1 Data Stream Comparison . .............................. 99 5.1.2 Tabular Data Comparison . ..............................100 5.2 Sketch Computation . .......................................102 5.2.1 Implementing Sketching Using Stable Distributions . ...............102 5.2.2 Median of Stable Distributions ..............................103 5.2.3 Faster Sketch Computation . ..............................105 5.2.4 Implementation Issues . ..................................108 5.3 Stream Based Experiments . ..................................109 5.4 Experimental Results for Clustering . ..............................113 5.4.1 Accuracy Measures . ..................................114 5.4.2 Assessing Quality and Efficiency of Sketching . ....................115 5.4.3 Clustering Using Sketches . ..............................118 5.4.4 Clustering Using Various Lp Norms . .........................121 5.5 Discussion . ............................................125 Chapter 6 Sending and Swapping 126 6.1 Introduction . ............................................127 6.1.1 Prior Work . .......................................127 6.1.2 Results ............................................129 6.2 Bounds on communication . ..................................130 6.3 Near Optimal Document Exchange . ..............................132 6.3.1 Single Round Protocol . ..................................132 6.3.2 Application to distances of interest . .........................133 6.3.3 Computational Cost . ..................................134 6.4 Computationally Efficient Protocols for String Distances ...................136 6.4.1 Hamming distance . ..................................137 6.4.2 Edit Distance. .......................................140 6.4.3 Tichy’s Distance .......................................141 6.4.4 LZ Distance . .......................................143 6.4.5 Compression Distances and Edit Distance with Moves . ...............144 6.4.6 Compression Distance with Unconstrained Deletes . ...............147 6.5 Computationally Efficient Protocols for Permutation Distances . ...............147 iv 6.6 Discussion . ............................................149 Chapter 7 Stopping 152 7.1 Discussion . ............................................153 7.1.1 Nature of Embeddings . ..................................153 7.1.2 Permutations and Strings . ..............................154 7.2 Extensions . ............................................155 7.2.1 Trees . ............................................155 7.2.2 Graphs ............................................156 7.3 Further Work ............................................157 Appendix A Supplemental Sectionon Sequence Similarity 159 A.1 Combined Permutation Distances

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    188 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us