High-Dimensional Data Indexing with Applications

High-Dimensional Data Indexing with Applications

HIGH-DIMENSIONAL DATA INDEXING WITH APPLICATIONS by Michael Arthur Schuh A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science MONTANA STATE UNIVERSITY Bozeman, Montana July 2015 c COPYRIGHT by Michael Arthur Schuh 2015 All Rights Reserved ii DEDICATION To my parents... Who taught me that above all else, sometimes you just need to have a little faith. iii ACKNOWLEDGEMENTS It was six years ago that I first met Dr. Rafal Angryk in his office in Bozeman, Montana, and I have been thankful ever since for the opportunity and privilege to work as one of his students. As an advisor, he motivated me to pursue my goals and maintained the honesty to critique me along the way. As a mentor, his wisdom to focus my energy towards (albeit still many) worthwhile pursuits, while driving me to finish each task to completion, has impacted my life in immeasurable ways beyond academics. I also owe a deep thank you to Dr. Petrus Martens for his research guidance and motivational enthusiasm towards science. I am grateful to the many people who have helped me during my time at Mon- tana State University. A special thanks to: Dr. Rocky Ross, Ms. Jeanette Radcliffe, Mr. Scott Dowdle, Ms. Shelly Shroyer, as well as former graduate student labmates: Dr. Tim Wylie, Dr. Karthik Ganesan Pillai, and Dr. Juan Banda, all of whom made my life a little bit easier and much more enjoyable. I would be remiss if I did not thank my undergraduate advisors Dr. Tom Naps and Dr. David Furcy for recognizing interests in computer science and pushing me towards graduate school. I will be forever grateful to my parents, Arthur L. Schuh Jr. and Cheryl A. Schuh, for the countless opportunities they gave me to experience the world and better myself. Last but not least, I wish to sincerely thank all of my friends and extended family for their endless support and encouragement over the years. Nobody should go it alone. Funding Acknowledgement This work was supported in part by two NASA Grant Awards: 1) No. NNX09AB03G, and 2) No. NNX11AM13A. iv TABLE OF CONTENTS 1. INTRODUCTION ........................................................................................1 1.1 Motivation...............................................................................................2 1.2 Overview.................................................................................................3 2. BACKGROUND...........................................................................................5 2.1 Notation..................................................................................................5 2.2 Preliminaries ...........................................................................................6 2.2.1 Metric Spaces...................................................................................6 2.2.2 Types of Search Queries....................................................................8 2.2.2.1 Point ........................................................................................8 2.2.2.2 Range.......................................................................................9 2.2.2.3 Nearest Neighbor..................................................................... 10 2.3 High-Dimensional Data .......................................................................... 10 2.3.1 Curse of Dimensionality .................................................................. 11 2.3.2 Similarity Search ............................................................................ 12 2.3.3 Query Selectivity ............................................................................ 12 2.4 Related Index Techniques ....................................................................... 14 2.4.1 Sequential Search............................................................................ 15 2.4.2 Low Dimensions ............................................................................. 16 2.4.2.1 B-tree Family.......................................................................... 16 2.4.2.2 R-tree Family.......................................................................... 16 2.4.3 Higher Dimensional Approaches....................................................... 17 2.4.4 Distance-based Indexing.................................................................. 19 2.4.4.1 M-tree .................................................................................... 19 2.4.4.2 iDistance ................................................................................ 20 2.4.4.3 SIMP...................................................................................... 21 2.4.5 Approximation Methods.................................................................. 21 3. THE ID* INDEX........................................................................................23 3.1 Foundations........................................................................................... 24 3.1.1 Building the Index.......................................................................... 24 3.1.1.1 Pre-processing......................................................................... 25 3.1.2 Querying the Index......................................................................... 27 3.1.2.1 Alternative Types of Search ..................................................... 29 3.2 Establishing a Benchmark ...................................................................... 29 3.2.1 Motivation ..................................................................................... 30 3.2.2 Partitioning Strategies .................................................................... 30 v TABLE OF CONTENTS { CONTINUED 3.2.3 Experiments................................................................................... 32 3.2.3.1 Preliminaries........................................................................... 32 3.2.3.2 Space-based Strategies in Uniform Data.................................... 34 3.2.3.3 The Transition to Clustered Data............................................. 36 3.2.3.4 Reference Points: Moving from Clusters.................................... 38 3.2.3.5 Reference Points: Quantity vs. Quality..................................... 39 3.2.3.6 Results on Real Data............................................................... 42 3.2.4 Remarks ........................................................................................ 43 3.3 Segmentation Extensions ........................................................................ 44 3.3.1 Motivation ..................................................................................... 45 3.3.2 Overview........................................................................................ 45 3.3.2.1 Global .................................................................................... 46 3.3.2.2 Local...................................................................................... 47 3.3.2.3 Hybrid Indexing ...................................................................... 50 3.3.3 Experiments................................................................................... 50 3.3.3.1 Preliminaries........................................................................... 50 3.3.3.2 First Look: Extensions & Heuristics ......................................... 51 3.3.3.3 Investigating Cluster Density Effects ........................................ 53 3.3.3.4 Results on Real Data............................................................... 54 3.3.4 Extended Experiments.................................................................... 55 3.3.4.1 Curse of Dimensionality........................................................... 56 3.3.4.2 High-dimensional Space ........................................................... 57 3.3.4.3 Tightly-clustered Space............................................................ 57 3.3.4.4 Partition Tuning ..................................................................... 58 3.3.5 Remarks ........................................................................................ 60 3.4 Algorithmic Improvements...................................................................... 61 3.4.1 Motivation ..................................................................................... 62 3.4.2 Theoretical Analysis ....................................................................... 62 3.4.2.1 Index Creation ........................................................................ 63 3.4.2.2 Index Retrieval........................................................................ 64 3.4.2.3 Nearest Neighbor Collection..................................................... 67 3.4.3 Experiments................................................................................... 69 3.4.3.1 Preliminaries........................................................................... 69 3.4.3.2 Creation Costs ........................................................................ 70 3.4.3.3 Reference Point Quality........................................................... 72 3.4.3.4 Optimizations ......................................................................... 74 3.4.3.5 Real World Data ..................................................................... 76 3.4.4 Remarks ........................................................................................ 79 3.5 Bucketing Extension............................................................................... 80 vi TABLE OF CONTENTS

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    144 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us