High-Dimensional Data Indexing with Applications
Total Page:16
File Type:pdf, Size:1020Kb
HIGH-DIMENSIONAL DATA INDEXING WITH APPLICATIONS by Michael Arthur Schuh A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science MONTANA STATE UNIVERSITY Bozeman, Montana July 2015 c COPYRIGHT by Michael Arthur Schuh 2015 All Rights Reserved ii DEDICATION To my parents... Who taught me that above all else, sometimes you just need to have a little faith. iii ACKNOWLEDGEMENTS It was six years ago that I first met Dr. Rafal Angryk in his office in Bozeman, Montana, and I have been thankful ever since for the opportunity and privilege to work as one of his students. As an advisor, he motivated me to pursue my goals and maintained the honesty to critique me along the way. As a mentor, his wisdom to focus my energy towards (albeit still many) worthwhile pursuits, while driving me to finish each task to completion, has impacted my life in immeasurable ways beyond academics. I also owe a deep thank you to Dr. Petrus Martens for his research guidance and motivational enthusiasm towards science. I am grateful to the many people who have helped me during my time at Mon- tana State University. A special thanks to: Dr. Rocky Ross, Ms. Jeanette Radcliffe, Mr. Scott Dowdle, Ms. Shelly Shroyer, as well as former graduate student labmates: Dr. Tim Wylie, Dr. Karthik Ganesan Pillai, and Dr. Juan Banda, all of whom made my life a little bit easier and much more enjoyable. I would be remiss if I did not thank my undergraduate advisors Dr. Tom Naps and Dr. David Furcy for recognizing interests in computer science and pushing me towards graduate school. I will be forever grateful to my parents, Arthur L. Schuh Jr. and Cheryl A. Schuh, for the countless opportunities they gave me to experience the world and better myself. Last but not least, I wish to sincerely thank all of my friends and extended family for their endless support and encouragement over the years. Nobody should go it alone. Funding Acknowledgement This work was supported in part by two NASA Grant Awards: 1) No. NNX09AB03G, and 2) No. NNX11AM13A. iv TABLE OF CONTENTS 1. INTRODUCTION ........................................................................................1 1.1 Motivation...............................................................................................2 1.2 Overview.................................................................................................3 2. BACKGROUND...........................................................................................5 2.1 Notation..................................................................................................5 2.2 Preliminaries ...........................................................................................6 2.2.1 Metric Spaces...................................................................................6 2.2.2 Types of Search Queries....................................................................8 2.2.2.1 Point ........................................................................................8 2.2.2.2 Range.......................................................................................9 2.2.2.3 Nearest Neighbor..................................................................... 10 2.3 High-Dimensional Data .......................................................................... 10 2.3.1 Curse of Dimensionality .................................................................. 11 2.3.2 Similarity Search ............................................................................ 12 2.3.3 Query Selectivity ............................................................................ 12 2.4 Related Index Techniques ....................................................................... 14 2.4.1 Sequential Search............................................................................ 15 2.4.2 Low Dimensions ............................................................................. 16 2.4.2.1 B-tree Family.......................................................................... 16 2.4.2.2 R-tree Family.......................................................................... 16 2.4.3 Higher Dimensional Approaches....................................................... 17 2.4.4 Distance-based Indexing.................................................................. 19 2.4.4.1 M-tree .................................................................................... 19 2.4.4.2 iDistance ................................................................................ 20 2.4.4.3 SIMP...................................................................................... 21 2.4.5 Approximation Methods.................................................................. 21 3. THE ID* INDEX........................................................................................23 3.1 Foundations........................................................................................... 24 3.1.1 Building the Index.......................................................................... 24 3.1.1.1 Pre-processing......................................................................... 25 3.1.2 Querying the Index......................................................................... 27 3.1.2.1 Alternative Types of Search ..................................................... 29 3.2 Establishing a Benchmark ...................................................................... 29 3.2.1 Motivation ..................................................................................... 30 3.2.2 Partitioning Strategies .................................................................... 30 v TABLE OF CONTENTS { CONTINUED 3.2.3 Experiments................................................................................... 32 3.2.3.1 Preliminaries........................................................................... 32 3.2.3.2 Space-based Strategies in Uniform Data.................................... 34 3.2.3.3 The Transition to Clustered Data............................................. 36 3.2.3.4 Reference Points: Moving from Clusters.................................... 38 3.2.3.5 Reference Points: Quantity vs. Quality..................................... 39 3.2.3.6 Results on Real Data............................................................... 42 3.2.4 Remarks ........................................................................................ 43 3.3 Segmentation Extensions ........................................................................ 44 3.3.1 Motivation ..................................................................................... 45 3.3.2 Overview........................................................................................ 45 3.3.2.1 Global .................................................................................... 46 3.3.2.2 Local...................................................................................... 47 3.3.2.3 Hybrid Indexing ...................................................................... 50 3.3.3 Experiments................................................................................... 50 3.3.3.1 Preliminaries........................................................................... 50 3.3.3.2 First Look: Extensions & Heuristics ......................................... 51 3.3.3.3 Investigating Cluster Density Effects ........................................ 53 3.3.3.4 Results on Real Data............................................................... 54 3.3.4 Extended Experiments.................................................................... 55 3.3.4.1 Curse of Dimensionality........................................................... 56 3.3.4.2 High-dimensional Space ........................................................... 57 3.3.4.3 Tightly-clustered Space............................................................ 57 3.3.4.4 Partition Tuning ..................................................................... 58 3.3.5 Remarks ........................................................................................ 60 3.4 Algorithmic Improvements...................................................................... 61 3.4.1 Motivation ..................................................................................... 62 3.4.2 Theoretical Analysis ....................................................................... 62 3.4.2.1 Index Creation ........................................................................ 63 3.4.2.2 Index Retrieval........................................................................ 64 3.4.2.3 Nearest Neighbor Collection..................................................... 67 3.4.3 Experiments................................................................................... 69 3.4.3.1 Preliminaries........................................................................... 69 3.4.3.2 Creation Costs ........................................................................ 70 3.4.3.3 Reference Point Quality........................................................... 72 3.4.3.4 Optimizations ......................................................................... 74 3.4.3.5 Real World Data ..................................................................... 76 3.4.4 Remarks ........................................................................................ 79 3.5 Bucketing Extension............................................................................... 80 vi TABLE OF CONTENTS