Arxiv:1605.06950V4 [Stat.ML] 12 Apr 2017 Racy of RAND Depends on the Diameter of the Network, 1 X Which Motivated Cohen Et Al
Total Page:16
File Type:pdf, Size:1020Kb
A Sub-Quadratic Exact Medoid Algorithm James Newling Fran¸coisFleuret Idiap Research Institute & EPFL Idiap Research Institute & EPFL Abstract network analysis. In clustering, the Voronoi iteration K-medoids algorithm (Hastie et al., 2001; Park and Jun, 2009) requires determining the medoid of each of We present a new algorithm trimed for ob- K clusters at each iteration. In operations research, taining the medoid of a set, that is the el- the facility location problem requires placing one or ement of the set which minimises the mean several facilities so as to minimise the cost of connect- distance to all other elements. The algorithm ing to clients. In network analysis, the medoid may is shown to have, under certain assumptions, 3 d represent an influential person in a social network, or expected run time O(N 2 ) in where N is R the most central station in a rail network. the set size, making it the first sub-quadratic exact medoid algorithm for d > 1. Experi- ments show that it performs very well on spa- 1.1 Medoid algorithms and our contribution tial network data, frequently requiring two A simple algorithm for obtaining the medoid of a set orders of magnitude fewer distance calcula- of N elements computes the energy of all elements and tions than state-of-the-art approximate al- selects the one with minimum energy, requiring Θ(N 2) gorithms. As an application, we show how time. In certain settings Θ(N) algorithms exist, such trimed can be used as a component in an as in 1-D where the problem is solved by Quickse- accelerated K-medoids algorithm, and then lect (Hoare, 1961), and more generally on trees. How- how it can be relaxed to obtain further com- ever, no general purpose o(N 2) algorithm exists. An putational gains with only a minor loss in example illustrating the impossibility of such an algo- cluster quality. rithm is presented in Supplementary Material B (SM- A). Related to finding the medoid of a set is finding the geometric median, which in vector spaces is de- 1 Introduction fined as the point in the vector space with minimum energy. The relationship between the two problems is A popular measure of the centrality of an element of discussed in x2.1. a set is its mean distance to all other elements. In network analysis, this measure is referred to as close- Much work has been done to develop approximate al- ness centrality, we will refer to it as energy. Given gorithms in the context of network analysis. The RAND a set S = fx(1); : : : ; x(N)g the energy of element algorithm of Eppstein and Wang (2004) can be used to i 2 f1;:::;Ng is thus given by, estimate the energy of all nodes in a graph. The accu- arXiv:1605.06950v4 [stat.ML] 12 Apr 2017 racy of RAND depends on the diameter of the network, 1 X which motivated Cohen et al. (2014) to use pivoting E(i) = dist(x(i); x(j)): N to make RAND more effective for large diameter net- j2f1;:::;Ng works. The work most closely related to ours is that of Okamoto et al. (2008), where RAND is adapted to An element in S with minimum energy is referred to the task of finding the k lowest energy nodes, k = 1 as a 1-median or a medoid. Without loss of general- corresponding to the medoid problem. The resulting ity, we will assume that S contains a unique medoid. TOPRANK algorithm of Okamoto et al. (2008) has run The problem of determining the medoid of a set arises time O~(N 5=3) under certain assumptions, and returns in the contexts of clustering, operations research, and the medoid with probability 1 − O(1=N), that is with th high probability (w.h.p.). Note that only their run time Proceedings of the 20 International Conference on Artifi- result requires any assumption, obtaining the medoid cial Intelligence and Statistics (AISTATS) 2017, Fort Laud- erdale, Florida, USA. JMLR: W&CP volume 54. Copy- w.h.p. is guaranteed. TOPRANK is discussed in x2.2. right 2017 by the author(s). In this paper we present an algorithm which has ex- A Sub-Quadratic Exact Medoid Algorithm pected run time O(N 3=2) under certain assumptions It should be noted that algorithms other than KMEDS and always returns the medoid. In other words, we have been proposed for finding approximate solutions present an exact medoid algorithm with improved to the K-medoids problem, and have been shown to be complexity over the state-of-the-art approximate al- very effective in Newling and Fleuret (2016b). These gorithm, TOPRANK. We show through experiments that include PAM and CLARA of Kaufman and Rousseeuw the new algorithm works well for low-dimensional data (1990), and CLARANS of Ng et al. (2005). In this pa- in Rd and for spatial network data. Our new medoid per we do not compare cluster qualities of previous algorithm, which we call trimed, uses the triangle in- algorithms, but focus on accelerating the lloyd equiv- equality to quickly eliminate elements which cannot be alent for K-medoids as a test setting for our medoid the medoid. The O(N 3=2) run time follows from the algorithm trimed. surprising result that all but O(N 1=2) elements can be eliminated in this way. 2 Previous works The complexity bound on expected run time which we derive contains a term which grows exponentially in 2.1 A related problem: the geometric median dimension d, and experiments show that in very high 2 A problem closely related to the medoid problem is the dimensions trimed often ends up computing O(N ) d distances. geometric median problem. In the vector space R the geometric median, assuming it is unique, is defined as, 1.2 K-medoids algorithms and our 0 1 X contribution g(S) = arg min @ kv − ykA : (1) v2V y2S The K-medoids problem is to partition a set into K While the medoid of a set is defined in any space with a clusters, so as to minimise the sum over elements of distance measure, the geometric median is specific to dissimilarites with their nearest medoids. That is, to vector spaces, where addition and scalar multiplica- choose M = fm(1); : : : ; m(K)g ⊂ f1;:::;Ng to min- tion are defined. The convexity of the objective func- imise, tion being minimised in (1) has enabled the develop- N X ment of fast algorithms. In particular, Cohen et al. L(M) = min diss(x(i); x(m(k))): k2f1;:::;Kg (2016) present an algorithm which obtains an estimate i=1 for the geometric median with relative error 1 + O() We focus on the special case where the dissimilarity 3 n d with complexity O(nd log ( )). In R , one may hope is a distance (diss = dist), which is still more general that such an algorithm can be converted into an exact than K-means which only applies to vector spaces. K- medoid algorithm, but it is not clear how to do this. medoids is used in bioinformatics where elements are Thus, while it may be possible that fast geometric me- genetic sequences or gene expression levels (Chipman dian algorithms can provide inspiration in the devel- et al., 2003) and has been applied to clustering on opment of medoid algorithms, they do not work out graphs (Rattigan et al., 2007). In machine vision, K- of the box. Moreover, geometric median algorithms medoids is often preferred, as a medoid is more easily cannot be used for network data as they only work interpretable than a mean (Frahm et al., 2010). in vector spaces, thus they are useless for the spatial The K-medoids problem is NP-hard, but there exist network datasets which we consider in x5. approximation algorithms. The Voronoi iteration al- gorithm, appearing in Hastie et al. (2001) and later 2.2 Medoid Algorithms : TOPRANK and in Park and Jun (2009), consists of alternating be- TOPRANK2 tween updating medoids and assignments, much in the same way as Lloyd's algorithm works for the K-means In Eppstein and Wang (2004), the RAND algorithm problem. We will refer to it as KMEDS, and to Lloyd's for estimating the energy of all elements of a set K-means algorithm as lloyd. S = fx(1); : : : ; x(N)g is presented. While RAND is pre- sented in the context of graphs, where the N elements One significant difference between and is KMEDS lloyd are nodes of an undirected graph and the metric is that the computation of a medoid is quadratic in the shortest path length, it can equally well be applied to number of elements per cluster whereas the computa- any set endowed with a distance. The simple idea of tion of a mean is linear. By incorporating our new RAND is to estimate the energy of each element from a medoid algorithm into , we break the quadratic KMEDS sample of anchor nodes I, so that for j 2 f1;:::;Ng, dependency of KMEDS, bringing it closer in performance 1 X to lloyd. We also show how ideas for accelerating E^(j) = dist(x(j); x(i)): jIj lloyd presented in Elkan (2003) can be used in KMEDS. i2I James Newling, Fran¸coisFleuret 5 An elegant feature of RAND in the context of sparse If assumption 3 holds, then the run time is O~(N 3 ). A graphs is that Dijkstra's algorithm needs only be run second algorithm presented in Okamoto et al. (2008) from anchor nodes i 2 I, and not from every node.