Algorithms for Analyzing Spatio-Temporal Data

Algorithms for Analyzing Spatio-Temporal Data by Abhinandan Nath Department of Computer Science Duke University Ph.D. Dissertation 2018 Copyright c 2018 by Abhinandan Nath All rights reserved except the rights granted by the Creative Commons Attribution-Noncommercial Licence Abstract In today's age, huge data sets are becoming ubiquitous. In addition to their size, most of these data sets are often noisy, have outliers, and are incomplete. Hence, analyzing such data is challenging. We look at applying geometric techniques to tackle some of these challenges, with an emphasis on designing provably efficient algorithms. Our work takes two broad approaches { distributed algorithms, and concise descriptors for big data sets. With the massive amounts of data available today, it is common to store and process data using multiple machines. Parallel programming frameworks such as MapReduce and its variants are becoming popular for handling such large data. We present the first provably efficient algorithms to compute, store, and query data structures for range queries and approximate nearest neighbor queries in a popular parallel computing abstraction that captures the salient features of MapReduce and other massively parallel communication (MPC) models. Our algorithms are competitive in terms of running time and workload to their classical counterparts. We propose parallel algorithms in the MPC model for processing large terrain elevation data (represented as a 3D point cloud) that are too big to fit on one machine. In particular, we present a simple randomized algorithm to compute the Delaunay triangulation of the xy-projections of the input points. Next we describe an efficient algorithm to compute the contour tree (a topological descriptor that succinctly encodes the contours of a terrain) of the resulting triangulated terrain. We then look at comparing real-valued functions, by computing a distance function between their merge trees (a small-sized descriptor that succinctly captures the sublevel sets of a function). Merge trees are robust to noise in the data, and can be used effectively as proxies for the data sets themselves in some cases. We also use it give an algorithm to compute the Gromov-Hausdorff distance, a natural way to measure distance between two metric spaces. We give the first proof of hardness and the first non-trivial approximation algorithm for computing the Gromov-Hausdorff distance between metric spaces defined on trees, where the distance between two points is given by the length of the unique path between them in the tree. Finally we look at the problem of capturing shared portions between large number of input trajectories. We formulate it as a subtrajectory clustering problem - the clustering of subsequences of trajectories. We propose a new model for clustering subtrajectories. Each cluster of subtrajectories is represented as a pathlet, a sequence iii of points that is not necessarily a subsequence of an input trajectory. We present a single objective function for finding the optimal collection of pathlets that best represents the trajectories taking into account noise and other artifacts of the data. We show that the subtrajectory clustering problem is NP-hard and present fast approximation algorithms. We further improve the running time of our algorithm if the input trajectories are \well-behaved". We also present experimental results on both real and synthetic data sets. iv Dedicated to everyone who inspired me. v Contents Abstract iii List of Tables ix List of Figuresx Acknowledgements xii 1 Introduction1 1.1 Broad Themes...............................2 1.2 Our Contribution.............................3 1.3 Prior Work................................5 2 Querying Massive Data9 2.1 Introduction................................9 2.1.1 Our results............................ 11 2.1.2 Our techniques........................... 13 2.1.3 Related work........................... 14 2.2 MPC Primitives.............................. 14 2.2.1 Range spaces and "-approximations............... 14 2.2.2 Geometric primitives and techniques.............. 15 2.3 Constructing kd-tree........................... 19 2.3.1 An MPC algorithm........................ 19 2.3.2 Query procedure......................... 22 2.3.3 Extension to simplex range searching.............. 23 2.4 NN Searching and BBD-tree....................... 23 2.5 Range Tree................................ 26 2.5.1 An MPC algorithm........................ 26 2.5.2 Query procedure......................... 28 2.6 Conclusion................................. 29 3 Analyzing Massive Terrains 30 3.1 Introduction................................ 30 3.1.1 Related work........................... 31 vi 3.1.2 Our results............................ 31 3.2 Preliminaries............................... 32 3.2.1 Delaunay triangulations..................... 32 3.2.2 Terrains and contour trees.................... 33 3.2.3 "-nets............................... 35 3.3 Computing Delaunay Triangulation................... 35 3.3.1 Conflict lists and kernels..................... 36 3.3.2 Random sampling......................... 37 3.3.3 Algorithm............................. 39 3.3.4 Analysis.............................. 41 3.4 Contour Tree............................... 42 3.4.1 Distributed contour trees..................... 42 3.4.2 Recursive construction...................... 43 3.4.3 Combining multiple contour trees................ 45 3.5 Conclusion................................. 47 4 Comparing Topological Descriptors and Metric Spaces 48 4.1 Introduction................................ 48 4.1.1 Related work........................... 49 4.1.2 Our results............................ 49 4.2 Preliminaries............................... 50 4.3 Hardness of Approximation....................... 53 4.4 Gromov-Hausdorff and Interleaving Distances............. 56 4.5 Computing the Interleaving Distance.................. 60 4.6 Conclusion................................. 70 5 Subtrajectory Clustering 72 5.1 Introduction................................ 72 5.1.1 Our contribution......................... 73 5.1.2 Related work........................... 76 5.2 The Clustering Model.......................... 77 5.3 Hardness.................................. 78 5.4 Pathlet-cover............................... 79 5.4.1 From pathlet-cover to set-cover................. 79 5.4.2 Greedy algorithm......................... 80 5.4.3 Computing TP ........................... 81 5.4.4 Data structure.......................... 83 5.4.5 Analysis.............................. 83 5.5 Subtrajectory Clustering......................... 84 5.5.1 Candidate pathlets........................ 84 5.5.2 Approximate distances...................... 86 5.5.3 Analysis.............................. 88 5.6 Preprocessing............................... 90 vii 5.7 Experiments................................ 92 5.8 Conclusion................................. 98 6 Conclusion 99 Bibliography 101 Biography 113 viii List of Tables 2.1 Our results for building and querying distributed data structures... 13 5.1 Speed-ups obtained using pathlet sparsification............. 96 ix List of Figures 2.1 The MPC model.............................. 10 2.2 Hierarchical partition induced by a kd-tree................ 13 2.3 The Partition primitive......................... 16 2.4 Storing a kd-tree across multiple machines................ 20 2.5 Dividing a cell in a BBD-tree....................... 24 2.6 A distributed range tree in 2 dimensions................. 28 3.1 The Delaunay triangulation and Voronoi diagram of a set of points.. 33 3.2 Terrain, contours, and the contour tree.................. 33 3.3 The kernel and the conflict list of an edge of a Delaunay triangulation. 36 3.4 Storing a contour tree across multiple machines............. 43 3.5 Triangles of a terrain intersecting a rectangle.............. 45 4.1 Correspondence vs. bijection between metric spaces........... 49 4.2 Merge tree of a real-valued function................... 52 4.3 A map between two merge trees..................... 52 4.4 Hardness reduction for the Gromov-Hausdorff distance......... 54 4.5 Nearest common ancestor of two points in a merge tree......... 59 4.6 Two merge trees.............................. 61 4.7 A merge tree illustration.......................... 62 4.8 Illustration for Lemma 44......................... 64 4.9 A trivial approximation to the interleaving distance between merge trees..................................... 65 4.10 Removing short subtrees from a merge tree............... 66 4.11 Matching points of a merge tree...................... 68 4.12 Computing the interleaving distance between merge trees using their matching points.............................. 68 5.1 Subtrajectory clusters capture shared movement patterns....... 73 5.2 Computing the maximum coverage-cost ratio for a subtrajectory cluster. 82 5.3 Computing the distance between a pathlet and a subtrajectory.... 87 5.4 Left: interpolating sparse trajectories. Right: sparsifying candidate pathlets................................... 91 5.5 The Beijing data set........................... 93 x 5.6 The RTP data set............................. 93 5.7 Gaps capture unique portions of a trajectory............... 94 5.8 Left: pathlets of the GeoLife data set. Right: a subtrajectory cluster and its pathlet............................... 94 5.9 Left: reconstructing trajectories from pathlets. Right: measuring cluster quality.................................. 95 5.10 Pathlets capture shared portions among

Algorithms for Analyzing Spatio-Temporal Data

Lecture 04 Linear Structures Sort

The Fractal Dimension Making Similarity Queries More Efficient

An Investigation of Practical Approximate Nearest Neighbor Algorithms

Search Trees

Similarity Search Using Metric Trees

Slim-Trees: High Performance Metric Trees Minimizing Overlap Between Nodes

Języki Programowania Wstęp

Searching in Metric Spaces

Algorithms for Efficiently Collapsing Reads with Unique Molecular Identifiers

Approximate Search for Textual Information in Images of Historical Manuscripts Using Simple Regular Expression Queries DEGREE FINAL WORK

Introduction to Cover Tree

A Multicontext-Adaptive Query Creation and Search System for Large-Scale Image and Video Data