Quick viewing(Text Mode)

The Geodesic Distance of Data

The Geodesic Distance of Data

The Distance of Data

Paul Beinat 1

Data is normally assumed to exist as a multidimensional cloud. Many of the methods that have been used to represent data, such as principal components analysis for example, make this implicit assumption. When we fit models to data we also make this implicit assumption, fitting the outcome to the overt predictor variables in the data. If data does have an inherent structure, rather than being just a cloud, then these assumptions are unsafe and questionable at least. Any inherent structure in the data will be in the form of a . A manifold is a multidimensional on which the data lies. Its dimensionality is less than that of the data. This makes it hard to detect, and that’s why we make assumptions about data being a cloud. A new algorithm has been developed that follows and measures the geodesic of the manifold of data. Tests are detailed to determine whether this algorithm finds false positives, a geodesic where none exists, and whether it can find a geodesic when artificially prepared data is analysed. Finally, we use the algorithm on a large real world data .

Introduction

The task of analytics and data science in general, is to develop models from data. One of the key tasks involved is to reduce the dimensionality of the data to the some form of key variables for the modelling. One common way of achieving this is by reducing the number of variables based on their correlation to the desired outcome. Of course these correlations take many values, some Fig. 3. The “Swiss roll” data by geodesic distance along The two-dimensional embedding set, illustrating how Isomap the low-dimensional manifold recovered by isomap in step predictor variables are better than others, and variables exploits geodesic paths for ( of solid curve). (B) three, which best preserves the nonlinear dimensionality The neighborhood graph G shortest distances in the can also be correlated with each other. Another often reduction (A) for two arbitary constructed in step one of neighborhood graph (overlaid). used method is to reduce the data dimensionality by points (circled) on a nonlinear Isomap (with K = 7 and N = Straight lines in the embedding manifold, their 1000 data points) allows an (blue) now represents simpler using principal components analysis (PCA). PCA works in the high dimensional input approximation (red segments) and cleaner approximations to (length of dashed ) to the true geodesic path to be the true geodesic paths than by finding the direction of the largest variance in the data may not accurately reflect their computed efficiently in step two, do the corresponding graph and fitting a line in the direction. It then finds the largest intrinsic , as measured as the shortests paths in G. (C) paths (red). variance perpendicular to the first line. This process can In the case above the linear distance in the manifold be repeated many times, until the variance is exhausted. equates to the geodesic distance in the data. Significantly PCA creates new data dimensions, which are in higher dimensions, the one above is 2-dimensional, linear combination of the overt data dimensions. cannot be reduced to two dimensions without distortion. In the age of big data, dimensionality reduction is a Therefore linear distances in higher order manifolds do not key objective. equate to .

Another key approach has been nonlinear dimensionality Significantly, while the data above has 3 dimensions, it reduction. Most of the algorithms rely on inter data point actually lies on a two dimensional surface embedded in distances as the basis for structure discovery. These inter the 3 dimensional space. In more dimensions this is not point distances are represented in a matrix, which is then overtly visualisable. used as the basis of these algorithms. Inter point distances If there are N points in the data set, then the distances are calculated as Euclidean distance. One of the most matrix is N by N, and it has N2 entries. (Actually distances notable algorithms is the ISOMAP algorithm [1]. Another are not directed so we only need the triangular matrix.) If N is Local Linear Embedding [2]. Many more algorithms are is 10,000 then there are 100,000,000 entries. This is one detailed in [3]. Below are example results from ISOMAP. of the key limitations of these approaches, and why they have not been applied to large data sets.

There are new algorithms produced from time to time. A modern algorithm is [4].

1 Finity and University of Technology Sydney

finity.com.au Geodesic Distance of Data Algorithm

The Geodesic Distance of Data Algorithm we In a simplistic implementation of the steps in “2.” it can implemented and will examine calculates the geodesic be readily seen that in order to find the point nearest distance between any two points, embedded in d the current point and closer to the goal point involves dimensions. It uses local Euclidean distances as the basis searching through virtually all points. If the number of of following the geodesic of the manifold. points is large then this search can be prohibitive. This is the search problem discussed below. If the data had d dimensions then the manifold has m dimensions where m < d. It can be seen that “3a” to “3d” represents a recursive algorithm, although it can be implemented as an iterative The algorithm has the following structure: one. The series of points are locally smoothed. The smoothing algorithm, a local linear regression, is a type 1. Start with 2 specified points from the d dimensional of kernel smoother. These types of kernel algorithms are data. Select one as the start point A and the other as well known in the literature. These smoothed points are the goal B. Add A to a list of points to remember, in L. then sent to the same smoothing algorithm. It stops when 2. Find the closest point to A that reduces the distance the new geodesic distance is not significantly smaller to B. “Reducing the distance” means that the than the previous. One complication not discussed is distance between the new point and the end point B what the smoothing does at the points closest to each is smaller than the distance of the original point A and end of the sequence, but this is also well described in the the end point B. Mark this new point as C. Add C to L. literature.

a. Find the closest point to C that reduces the distance to B. Mark that point as a new C and add Implementation it to L.

b. Repeat until there are no more points that are The search problem has severe implications. As the data close to C and reduce the distance to B. becomes more numerous the number of points on the path between two given points increases. The search c. Add B to L. for the nearest point also increases with the number of data points. This means that the search time increases 3. L now contains all the points from A to B, including approximately as the of number of data points. them. Now smooth the path and subsequently the distance measure from A to B. In order to alleviate the search problem a Finity proprietary artificial neural network is used that models a. Select a subset of contiguous points from L. the joint of data. Points close to Fit a linear regression to these points. Use the each other in the joint probability distribution are also regression to derive a new location for the central closest to each other in the data space itself. This allows point. the algorithm to exploit the neural network to limit the search problem to only searching a local region of the b. Move the points selection along by 1, selecting whole data. It can then connect from one local region the same number of points. Repeat the linear to another local region which is on the path to the goal regression and central point estimation. point. This makes the search problem approximately c. Repeat b. until it reaches the end of L. Calculate linear with the number of data points. It allows the the distance as the sum of the distances between algorithm to use large data sets. subsequent points. This is the current geodesic In all the experiments described below the recursive distance. local linear kernel smoother is set to terminate when d. Repeat a. to c. using the new points as L. it can no longer reduce the summed distance, using Calculate the new sum of the distances. If the the current smoothed points, by 1%, over the previous new distance is not smaller than the previous smoothed distance. It takes very few iterations to reach distance by some small margin then repeat d. the termination condition.

4. The points in L are now the smooth locally derived distance between A and B. It is the estimate of the geodesic distance.

2 of 4 Examples

The first test of the algorithm is whether it can correctly The results were unsurprising. On many occasions it determine geodesic distances for a known manifold. was impossible to smooth the path because not enough points exist on path from the start to the end point. A 2 dimensional circular manifold is generated using This is expected given the random nature of the data and random points from x2 + y2 = 1. 10,000 points are the dimensionality – it’s a sparse data set. On the few generated using a uniform random distribution for x, occasions when enough points could be found on the using this to derive y. In this case the data path, the smoothed path was the same as the Euclidean dimension d is 2 and the manifold dimension m is 1. distance, to the 1% tolerance. This is a pleasing result. The algorithm fails to find any geodesic from data that does The algorithm is run repeatedly on two points selected not have one. at random from the data above. It calculates the length of the geodesic that lies between the two points. The final experiment uses a real data set. It encompasses various attributes about cars, and contains a little over atio y ucian itanc 128,000 rows of data. The attributes include the make, model and type of the car, the of the engine, the 1.6 , the number of cylinders, the wheel base and track, and other such attributes. 1.5 For this experiment 31 attributes are chosen. This will 1.4 potentially provide as much sparsity as the previous experiment. When examining the joint probability of this data set the result looks very different to the random data 1.3 used before. The joint probability distribution is heavily skewed to with very thick data and many areas 1.2 with little data. Lots of cars have similar attributes, but not many cars are like an Aston Martin. This is the expected 1.1 joint distribution of real data, and has previously been observed with all real data sets. 1 The variables used, 31 of them, contain data on 0 0.5 1 1.5 different scales. For instance, engine capacity is in Simple distance v ratio cubic centimetres and typically in the thousands; so is wheelbase in millimetres; engine power is in kW in the Figure 1 order of hundreds and number of wheels is very few in Figure 1 shows the ratio of the geodesic distance to comparison. This is addressed by making all the variables the Euclidean distance, the X-axis shows the Euclidean on the same scale, standardizing them by a whitening distance and the Y-axis shows the ratio of the geodesic process. This involves transforming each variable by distance to the Euclidean distance. When the distances subtracting its mean from every observation, which are small then the geodesic is very close to the Euclidean makes the mean now zero, and dividing by the standard distance, the length of the arc of the circle between deviation, which now makes the standard deviation 1. This these points is almost the same as a line. At the other puts all the variables on the same scale and they all make extreme, when two points are on opposite sides of the equal contributions to distance measures. circle then the ratio of the geodesic, calculated as the length of a semicircle, is π/2 times the Euclidean distance. A random selection of start and end points is made In between, the distance ratio varies along a circular arc. 1,000 times. The Euclidean distance, the initial distance This is exactly what the algorithm has found. calculated from the points selected as the path and the final smoothed distance are recorded. The second test of the algorithm is whether it finds geodesic distances for data that contains no structure. Unsurprisingly the initial path involves points that do not form a smooth path. After smoothing the path length For this case a data set of random data is generated. is generally reduced by between 5 and 10 times. In a It has 25 variables, each one with a uniform distribution of typical example, the Euclidean distance is 2.85, while values between zero and one. There are 100,000 rows of the path along the unsmoothed points is 21.98 and data. The joint probability of this data shows no features, after smoothing path is 4.58. which is to be expected.

The same process as the previous experiment is performed, selecting pairs of random points and finding the geodesic distance between them.

3 of 4 is smoothed. However, should the manifold be highly oic y ucian convoluted, such as with the Swiss roll having a number 35 of revolutions, it will trace out a path that will cross through the manifold. This occurs because the next point 30 that is closer to the end point will be on the next layer y = 1.9457x + 0.02 of the roll. In the case of highly convoluted manifolds 25 R² = 0.8708 the real geodesic involves getting further away from the goal before approaching again. In these cases of highly 20 convoluted manifolds the distance will be longer than the 15 algorithm reports. 10 Discussion 5 Some algorithms that attempt to measure manifold 0 distances represent the data as an undirected acyclic 0 5 10 15 graph, most simply constructed as mostly connected "Geodesic by Euclidean" Linear ("Geodesic by Euclidean") triangles. These algorithms then calculate the geodesic distance between any two points by following the Figure 2 connections in the graph. Although the graph is a physical representation of the data relationships, Figure 2 shows the length of the smoothed geodesic in terms of nearness, this is no different in principle (Y axis) for given simple Euclidean distances (X-axis). It than the algorithm described here. Our algorithm just also shows the linear regression of this data. Notable is follows the graph implicitly rather than explicitly. The that the intercept is very close to zero, as it should be, greatest difference is that our algorithm encompasses and that the ratio between the geodesic and Euclidean a smoothing process. We can see that this smoothing distance is very close to 2. In fact the average ratio reduces the geodesic distances calculated by adding the of geodesic to Euclidean distances is 1.938, while the graph distances between adjacent points. minimum is 1.07 and the maximum is 3.69. The lower ratios tend to be for smaller Euclidean distances. These It is common to regard the overt d dimensions of some represent points that are close to each other on the data set as a representation of the data itself. When fitting manifold, in the same way that points close to each other models the implied assumption is that each of those d in the circle have distances close to their Euclidean dimensions can represent variables that are predictive of distance. some outcome. Implicit in this approach is that nearness is related to each of these d dimensions. For instance This demonstrates that the manifold is a complex a person’s age seems to be a natural ordinal variable, shape. Recall that the longest geodesic for the circle where each age is related to its neighbor values. data was approximately 1.57 times the diameter – the ratio of the geodesic to the Euclidean distance. In the The alternate formulation is that the data actually exists case of the data for this experiment the average ratio on a complex surface of m dimensions, where m < d. is almost 2 – far greater than the maximum ratio for the In this case nearness is a more holistic concept. Two circle. The implication is that the manifold is curved in 31 observations are close because they exist near each dimensional data space to such a that ratios of other on the manifold. This nearness is multi-dimensional. distances attain these values. It is also the case that for any given Euclidean distance there are different geodesic The implication is that while we are used to fitting models distances. These observations involve different directions on the d dimensions of the data, the manifold will provide in the feature space. The implication is that the us with another m dimensions that embody very different is not uniform, as one would expect given the complex relationships in the data. There will be potential benefits relationships inherent in real data. to having greater freedom to exploit modelling data in this way. Limitations [1] Tenenbaum, J, de Silva and V, Langford, J – A Global Geometric Framework for Nonlinear Dimensionality Reduction The algorithm discussed traces a path between a starting [2] Saul, L and Roweis, S – An Introduction To Local Linear and ending points. It does this by finding the closes Embedding point to the start, or using the next point as the start, which also reduces the distance to the end point. If the [3] Lee, J, Verleysen, M – Nonlinear Dimensionality Reduction manifold is not highly convoluted then this algorithm [4] McInnes, L, Healy, J and Melville, J – UMAP: Uniform Manifold should faithfully trace out the manifold, once the path Approximation and Projection for Dimensionality Reduction

finity.com.au