Nearest Neighbors in High-Dimensional Data: the Emergence and Inﬂuence of Hubs

Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs Milosˇ Radovanovic´ [email protected] Department of Mathematics and Informatics, University of Novi Sad, Trg D. Obradovica´ 4, 21000 Novi Sad, Serbia Alexandros Nanopoulos [email protected] Institute of Computer Science, University of Hildesheim, Marienburger Platz 22, D-31141 Hildesheim, Germany Mirjana Ivanovic´ [email protected] Department of Mathematics and Informatics, University of Novi Sad, Trg D. Obradovica´ 4, 21000 Novi Sad, Serbia Abstract points and Nk(x) the number of k-occurrences of each High dimensionality can pose severe difficul- point x ∈ D, i.e., the number of times x occurs among ties, widely recognized as different aspects of the k NNs of all other points in D. Under certain condi- the curse of dimensionality. In this paper we tions, as dimensionality increases, the distribution of Nk study a new aspect of the curse pertaining to becomes considerably skewed to the right, resulting in the the distribution of k-occurrences, i.e., the num- emergence of hubs, i.e., points which appear in many more ber of times a point appears among the k nearest k-NN lists than other points. Unlike distance concentra- neighbors of other points in a data set. We show tion, the skewness of Nk has not been studied in depth. As that, as dimensionality increases, this distribution will be described in Section 2.2, the two phenomena are re- becomes considerably skewed and hub points lated but distinct. In this paper we study the causes and the emerge (points with very high k-occurrences). implications of this aspect of the dimensionality curse. We examine the origin of this phenomenon, showing that it is an inherent property of high- 1.1. Related Work dimensional vector space, and explore its influ- The skewness of Nk recently started to be observed in fields ence on applications based on measuring dis- like audio retrieval (Aucouturier & Pachet, 2007; Dod- tances in vector spaces, notably classification, dington et al., 1998) and fingerprint identification (Hicklin clustering, and information retrieval. et al., 2005), where it is described as a problematic situa- tion. Singh et al. (2003) notice possible skewness of N1 1. Introduction on real data and account for it in their reverse NN search algorithm. Nevertheless, these works neither analyze the It is widely recognized that high-dimensional spaces pose causes of skewness nor generalize it to other applications. severe difficulties, regarded as different aspects of the the curse of dimensionality (Bishop, 2006). One aspect of this The distribution of k-occurrences has been explicitly stud- curse is distance concentration, which directly affects ma- ied in the applied probability community (Newman et al., chine learning applications. It refers to the tendency of dis- 1983; Yao & Simons, 1996). No skewness was observed tances between all pairs of points in high-dimensional data because of the different properties of the settings studied, to become almost equal. Concentration of distances and which will be explained in Section 2.2. the meaningfulness of finding nearest neighbors in high- dimensional spaces has been studied thoroughly (Beyer 1.2. Motivation and Contributions et al., 1999; Aggarwal et al., 2001; François et al., 2007). Since the skewness of k-occurrences has been observed in There is another aspect of the curse of dimensionality that the contexts of specific applications, the question remains is related to nearest neighbors (NNs). Let D be a set of whether it is limited to them by being an artifact of the data or the modeling algorithms. In this paper we show that it th Appearing in Proceedings of the 26 International Conference is actually an inherent property of high-dimensional vec- on Machine Learning , Montreal, Canada, 2009. Copyright 2009 tor spaces under widely used assumptions. To the best of by the author(s)/owner(s). Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs our knowledge, there has been no study relating this phe- ally expressed as the ratio between some measure of spread nomenon with the properties of vector space and the dimen- and some measure of magnitude of distances of all points sionality curse. It is worth to examine its origin and conse- in a data set to some arbitrary reference point (Aggarwal quences because of its influence on applications based on et al., 2001; François et al., 2007). If this ratio converges distances in vector spaces, notably classification, cluster- to 0 as dimensionality goes to infinity, it is said that the ing, and information retrieval. distances concentrate. We make the following contributions. First, we demon- To ease comprehension, consider again the iid uniform ran- strate and explain the emergence of skewness of k- dom data examined in the previous section and select as the occurrences (Section 2). We then study its implications on reference point the mean of the distribution. Figure 2 plots, widely used techniques (Sections 3–5). As this is a pre- for each point x, its N5(x) against its Euclidean distance liminary examination of the problem, we provide a list of from the mean, for d = 3, 20, 100. As dimensionality in- directions for future work (Section 6). creases, stronger correlation emerges, meaning that points closer to the mean tend to become hubs. We need to un- 2. The Skewness of k-occurrences derstand why some points tend to be closer to the mean and, thus, become hubs. Based on existing theoretical re- In this section we first demonstrate the emergence of skew- sults (Beyer et al., 1999; Aggarwal et al., 2001), high di- ness in the distribution of Nk and then explain its causes. mensional points are approximately lying on a hypersphere centered at the data set mean. Moreover, the results by De- 2.1. A Motivating Example martines (1994) and François et al. (2007) specify that the distribution of distances to the data set mean has a non- We start with an illustrative experiment which demon- negligible variance for any finite d.1 Hence, the existence strates the changes in the distribution of Nk with vary- of a non-negligible number of points closer to the data set ing dimensionality. Consider a random data set consist- mean is expected in high dimensions. These points, by being of 10000 d-dimensional points drawn uniformly from ing closer to the mean, tend to be closer to all other points d the unit hypercube [0, 1] , and the following distance func- – a tendency which is amplified (in relative terms) by high tions: Euclidean (l2), fractional l0.5 (proposed for high- dimensionality, making points closer to the data set mean dimensional data by Aggarwal et al. (2001)), and cosine. have increased inclusion probability into k-NN lists, even Figure 1 shows the empirically observed distributions of for small values of k. Nk, with k = 5, for (a) d = 3, (b) d = 20, and (c) d = 100. Note that the non-negligible variance has an additional For d = 3 the distributions of N5 for the three distance “side”: we also expect points farther from the mean and, functions (Fig. 1(a)) are consistent with the binomial dis- thus, with much lower Nk than the rest. Such points corre- tribution. This is expected when considering k-occurrences spond to the bottom-right parts of Fig. 2(b, c), and can be as node in-degrees in the k-nearest neighbor digraph. For regarded as outliers since they are also far away from all uniformly distributed points in low dimensions, this di- other points (Tan et al., 2005). Outliers will be analyzed graph follows the Erdos-R˝ enyi´ (ER) random graph model, further in Section 4. in which the degree distribution is binomial (Erdos˝ & Renyi,´ 1959). Research in applied probability describes that, within the Poisson process setting, as d → ∞ the distribution of As dimensionality increases, the observed distributions of k-occurrences converges to the Poisson distribution with N5 depart from the random graph model and become more mean k (Newman et al., 1983; Yao & Simons, 1996), which skewed to the right (Fig. 1(b, c)). We verified this by being implies no skewness. However, a Poisson process produces able to fit the tails of the distributions with the log-normal an unbounded infinite set of points for which no meaning- distribution, which is highly skewed (fits were supported ful data set mean exists, and distances do not concentrate 2 by the χ -test at 0.05 confidence level). We made similar (their spread and magnitude are infinite). Through simu- observations with various k values, distance measures (lp lation of this setting we verified that, once boundaries are norm for both p ≥ 1 and 0 < p < 1, Bray-Curtis, normal- introduced (as in the majority of practical cases), skewness ized Euclidean, and Canberra), and distributions, like the 2 of Nk emerges. normal. In all these cases, skewness exists and produces 1 hubs, i.e., points with high Nk. These results apply to lp distances, but our numerical sim- ulations suggest that other mentioned distance functions behave similarly. 2.2. The Causes of Skewness 2With the exception of combinations of (bounded) data dis- The skewness of k-occurrences appears to be related with tributions and distances without meaningful means, e.g. centered normal distribution and cosine distance. the phenomenon of distance concentration, which is usu- Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs iid uniform, d = 3 iid uniform, d = 20 iid uniform, d = 100 0.25 0.16 −0.5 l 0.14 l −1 l 0.2 2 2 2 0.12 )) −1.5 l l 5 l ) ) 0.5 0.1 0.5 N 0.5 5 ( 5 0.15 −2 N N 0.08 p ( ( cos cos ( cos −2.5 p p 0.1 0.06 10 −3 0.04 log 0.05 0.02 −3.5 0 0 −4 0 5 10 15 0 10 20 30 40 50 0 0.5 1 1.5 2 2.5 N N log (N ) 5 5 10 5 (a) (b) (c) Figure 1.

Load more