The Geodesic Distance of Data

The Geodesic Distance of Data Paul Beinat 1 Data is normally assumed to exist as a multidimensional cloud. Many of the methods that have been used to represent data, such as principal components analysis for example, make this implicit assumption. When we fit models to data we also make this implicit assumption, fitting the outcome to the overt predictor variables in the data. If data does have an inherent structure, rather than being just a cloud, then these assumptions are unsafe and questionable at least. Any inherent structure in the data will be in the form of a manifold. A manifold is a multidimensional surface on which the data lies. Its dimensionality is less than that of the data. This makes it hard to detect, and that’s why we make assumptions about data being a cloud. A new algorithm has been developed that follows and measures the geodesic of the manifold of data. Tests are detailed to determine whether this algorithm finds false positives, a geodesic where none exists, and whether it can find a geodesic when artificially prepared data is analysed. Finally, we use the algorithm on a large real world data set. Introduction The task of analytics and data science in general, is to develop models from data. One of the key tasks involved is to reduce the dimensionality of the data to the some form of key variables for the modelling. One common way of achieving this is by reducing the number of variables based on their correlation to the desired outcome. Of course these correlations take many values, some Fig. 3. The “Swiss roll” data by geodesic distance along The two-dimensional embedding set, illustrating how Isomap the low-dimensional manifold recovered by isomap in step predictor variables are better than others, and variables exploits geodesic paths for (length of solid curve). (B) three, which best preserves the nonlinear dimensionality The neighborhood graph G shortest path distances in the can also be correlated with each other. Another often reduction (A) for two arbitary constructed in step one of neighborhood graph (overlaid). used method is to reduce the data dimensionality by points (circled) on a nonlinear Isomap (with K = 7 and N = Straight lines in the embedding manifold, their Euclidean distance 1000 data points) allows an (blue) now represents simpler using principal components analysis (PCA). PCA works in the high dimensional input approximation (red segments) and cleaner approximations to space (length of dashed line) to the true geodesic path to be the true geodesic paths than by finding the direction of the largest variance in the data may not accurately reflect their computed efficiently in step two, do the corresponding graph and fitting a line in the direction. It then finds the largest intrinsic similarity, as measured as the shortests paths in G. (C) paths (red). variance perpendicular to the first line. This process can In the case above the linear distance in the manifold be repeated many times, until the variance is exhausted. equates to the geodesic distance in the data. Manifolds Significantly PCA creates new data dimensions, which are in higher dimensions, the one above is 2-dimensional, linear combination of the overt data dimensions. cannot be reduced to two dimensions without distortion. In the age of big data, dimensionality reduction is a Therefore linear distances in higher order manifolds do not key objective. equate to geodesics. Another key approach has been nonlinear dimensionality Significantly, while the data above has 3 dimensions, it reduction. Most of the algorithms rely on inter data point actually lies on a two dimensional surface embedded in distances as the basis for structure discovery. These inter the 3 dimensional space. In more dimensions this is not point distances are represented in a matrix, which is then overtly visualisable. used as the basis of these algorithms. Inter point distances If there are N points in the data set, then the distances are calculated as Euclidean distance. One of the most matrix is N by N, and it has N2 entries. (Actually distances notable algorithms is the ISOMAP algorithm [1]. Another are not directed so we only need the triangular matrix.) If N is Local Linear Embedding [2]. Many more algorithms are is 10,000 then there are 100,000,000 entries. This is one detailed in [3]. Below are example results from ISOMAP. of the key limitations of these distance matrix approaches, and why they have not been applied to large data sets. There are new algorithms produced from time to time. A modern algorithm is [4]. 1 Finity and University of Technology Sydney finity.com.au Geodesic Distance of Data Algorithm The Geodesic Distance of Data Algorithm we In a simplistic implementation of the steps in “2.” it can implemented and will examine calculates the geodesic be readily seen that in order to find the point nearest distance between any two points, embedded in d the current point and closer to the goal point involves dimensions. It uses local Euclidean distances as the basis searching through virtually all points. If the number of of following the geodesic of the manifold. points is large then this search can be prohibitive. This is the search problem discussed below. If the data had d dimensions then the manifold has m dimensions where m < d. It can be seen that “3a” to “3d” represents a recursive algorithm, although it can be implemented as an iterative The algorithm has the following structure: one. The series of points are locally smoothed. The smoothing algorithm, a local linear regression, is a type 1. Start with 2 specified points from the d dimensional of kernel smoother. These types of kernel algorithms are data. Select one as the start point A and the other as well known in the literature. These smoothed points are the goal B. Add A to a list of points to remember, in L. then sent to the same smoothing algorithm. It stops when 2. Find the closest point to A that reduces the distance the new geodesic distance is not significantly smaller to B. “Reducing the distance” means that the than the previous. One complication not discussed is distance between the new point and the end point B what the smoothing does at the points closest to each is smaller than the distance of the original point A and end of the sequence, but this is also well described in the the end point B. Mark this new point as C. Add C to L. literature. a. Find the closest point to C that reduces the distance to B. Mark that point as a new C and add Implementation it to L. b. Repeat until there are no more points that are The search problem has severe implications. As the data close to C and reduce the distance to B. becomes more numerous the number of points on the path between two given points increases. The search c. Add B to L. for the nearest point also increases with the number of data points. This means that the search time increases 3. L now contains all the points from A to B, including approximately as the square of number of data points. them. Now smooth the path and subsequently the distance measure from A to B. In order to alleviate the search problem a Finity proprietary artificial neural network is used that models a. Select a subset of contiguous points from L. the joint probability distribution of data. Points close to Fit a linear regression to these points. Use the each other in the joint probability distribution are also regression to derive a new location for the central closest to each other in the data space itself. This allows point. the algorithm to exploit the neural network to limit the search problem to only searching a local region of the b. Move the points selection along by 1, selecting whole data. It can then connect from one local region the same number of points. Repeat the linear to another local region which is on the path to the goal regression and central point estimation. point. This makes the search problem approximately c. Repeat b. until it reaches the end of L. Calculate linear with the number of data points. It allows the the distance as the sum of the distances between algorithm to use large data sets. subsequent points. This is the current geodesic In all the experiments described below the recursive distance. local linear kernel smoother is set to terminate when d. Repeat a. to c. using the new points as L. it can no longer reduce the summed distance, using Calculate the new sum of the distances. If the the current smoothed points, by 1%, over the previous new distance is not smaller than the previous smoothed distance. It takes very few iterations to reach distance by some small margin then repeat d. the termination condition. 4. The points in L are now the smooth locally derived distance between A and B. It is the estimate of the geodesic distance. 2 of 4 Examples The first test of the algorithm is whether it can correctly The results were unsurprising. On many occasions it determine geodesic distances for a known manifold. was impossible to smooth the path because not enough points exist on path from the start to the end point. A 2 dimensional circular manifold is generated using This is expected given the random nature of the data and random points from x2 + y2 = 1. 10,000 points are the dimensionality – it’s a sparse data set. On the few generated using a uniform random distribution for x, occasions when enough points could be found on the using this function to derive y.

Load more