Fast Euclidean Minimum Spanning Tree: Algorithm, Analysis, and Applications

Fast Euclidean Minimum Spanning Tree: Algorithm, Analysis, and Applications William B. March Parikshit Ram Alexander G. Gray School of Computational Science & Engineering, Georgia Institute of Technology 266 Ferst Dr., Atlanta, GA 30332 {march@, p.ram@, agray@cc.} gatech.edu ABSTRACT this long-standing theoretical and algorithmic interest, the The Euclidean Minimum Spanning Tree problem has appli- MST is useful for many practical data analysis problems. cations in a wide range of fields, and many efficient algo- Many optimization problems can be posed as the search for rithms have been developed to solve it. We present a new, the MST in a network [36]. The MST is also used as an fast, general EMST algorithm, motivated by the clustering approximation for the traveling salesman problem [24], in and analysis of astronomical data. Large-scale astronomical document clustering [48], analysis of gene expression data surveys, including the Sloan Digital Sky Survey, and large [15], wireless network connectivity [46], percolation analyses simulations of the early universe, such as the Millennium [8], and modeling of turbulent flows [44], among other areas. Simulation, can contain millions of points and fill terabytes These problems are commonly solved in the Euclidean set- of storage. Traditional EMST methods scale quadratically, ting. In this case, the computational bottleneck in both tra- and more advanced methods lack rigorous runtime guaran- ditional MST algorithms like Kruskal’s [28] and Prim’s [37] tees. We present a new dual-tree algorithm for efficiently and more advanced methods is finding the nearest neigh- computing the EMST, use adaptive algorithm analysis to bor of components in a spanning forest. We propose a new prove the tightest (and possibly optimal) runtime bound for method to overcome this obstacle and demonstrate its the- the EMST problem to-date, and demonstrate the scalability oretical and experimental superiority. of our method on astronomical data sets. In particular, we are interested in using the EMST to compute hierarchical clusterings [20, 53]. One such clustering is obtained by deleting all edges longer than a specified Categories and Subject Descriptors cutoff in the MST, generating a clustering through the re- F.2.0 [Analysis of Algorithms and Problem Complex- maining connected components. By varying the scale of the ity]: General; I.5.3 [Clustering]: Algorithms cutoff, this generates a hierarchical clustering. In the clustering literature, this is often referred to as a single-linkage General Terms clustering and is frequently represented by a dendrogram. While the single-linkage clustering is very simple and can Algorithms, Theory be sub-optimal for many applications, it can form the basis of more insightful clusterings. The single linkage clustering Keywords can be pruned to obtain more useful astronomical results Adaptive Algorithm Analysis, Euclidean Minimum Span- [4]. MST’s also form the inner loop for methods to identify ning Trees non-parametric clusters in noisy data [49]. Furthermore, theoretically optimal clusterings can be obtained efficiently 1. INTRODUCTION from the single-linkage clustering [3]. In astronomy, EMST-based clustering is used to analyze We present a new algorithm for the fundamental and widely deep-space surveys and simulations of the early universe. applied Euclidean Minimum Spanning Tree (EMST) prob- d Each level of single-linkage clustering is known as a friend- lem. Given a set of points S in R , our goal is to find of-friends clustering [40, 2]. The EMST is used to identify the lowest weight spanning tree in the complete graph on S dark matter haloes in simulations, which are believed to be with edge weights given by the Euclidean distances between crucial to galaxy formation [29]. Clustering is also applied points. With references in the literature as early as 1926, the to sky surveys to identify the super-large scale structure of MST problem is one of the oldest and most thoroughly stud- the universe, which sheds light on the conditions of the early ied problems in computational geometry [36]. In addition to universe and the mechanisms of galaxy formation [4]. The volume of data produced in the astronomy commu- nity has grown explosively in recent years. Recent large Permission to make digital or hard copies of all or part of this work for surveys include the Las Campanas Redshift Survey (26,418 personal or classroom use is granted without fee provided that copies are objects) [42], the 6dF Galaxy Survey (125,071 galaxies) [25], not made or distributed for profit or commercial advantage and that copies the 2dF Galaxy Redshift Survey (382,323 objects) [13], and bear this notice and the full citation on the first page. To copy otherwise, to the Sloan Digital Sky Survey (over 230 million objects) [52]. republish, to post on servers or to redistribute to lists, requires prior specific In addition, our understanding of cosmology has benefit- permission and/or a fee. KDD’10, July 25–28, 2010, Washington, DC, USA. ted from large-scale simulations of the formation of galaxies Copyright 2010 ACM 978-1-4503-0055-1/10/07 ...$10.00. 603 and conditions in the early universe. For example, the Mil- 2. RELATED WORK lennium Simulation [43] contains over 1 billion points and Many MST algorithms rely on Tarjan’s blue rule [45], produces terabytes of output. The analysis of the current which says the minimum weight edge across any edge cut structure of the universe as revealed in the large sky surveys is in the minimum spanning tree. This allows us to greedily and comparison to the predictions of theories in simulations form cuts in the graph and add the minimum weight edge are the keys to understanding the origins of the cosmos and across each. Algorithms using this rule include Kruskal’s [28] validating new models, including verification of dark mat- and Prim’s [37], which require O(m log n)andO(m+n log n) ter and dark energy. This in turn requires the ability to time, respectively, on a graph with n points and m edges. compute minimum spanning trees quickly and accurately Both algorithms maintain one or more components in span- for very large data from a variety of distributions. ning forest and use the cut between one component of the Adaptive Analysis. Traditional approaches to algo- forest and the rest of the graph, adding the edges found in rithm analysis use the running time for the worst possible this way one at a time. input as an upper bound for the running time of all in- Boruvka’s Algorithm. In this work, we focus on the stances. This often leads to overly pessimistic bounds due earliest known minimum spanning tree algorithm, Bor˚uvka’s to a few pathological inputs. Adaptive analysis seeks to algorithm, which dates from 1926. See [32] for a transla- improve these results by considering properties of the in- tion and commentary on Boruvka’s original papers. As in puts in the analysis. By bounding the runtime in terms of Kruskal’s algorithm, a minimum spanning forest is main- these properties, one can obtain tighter and more informa- tained throughout the algorithm. Kruskal’s algorithm adds tive bounds. Adaptive analysis has been successfully applied the minimum weight edge between any two components of to many fundamental problems including searching in lists the forest at each step, thus requiring N − 1 steps to com- [6], merging arrays [14], sorting [16], and the convex hull plete. Bor˚uvka’s algorithm finds the minimum weight edge problem [27]. Despite these successes, the difficulty of char- incident with each component, and adds all such edges, thus acterizing the inputs in relation to the problem has limited requiring at most log N steps and a total running time of the number of applications. O(m log n). We define the nearest neighbor pair of a com- Our Contribution. We present a new Euclidean mini- ponent C as the pair of points q ∈ C, r ∈ C that minimizes mum spanning tree algorithm, DualTreeBoruvka.Using d(q, r). Finding the nearest neighbor pair for each compo- the dual-tree algorithmic framework [22], we can efficiently nent and adding the edges (p, q) to the forest is called a compute the shortest edge between components in a span- Boruvka step. Boruvka’s algorithm then consists of forming ning forest, thus overcoming the bottleneck of most EMST an initial spanning forest with each point as a component methods. We show: and iteratively applying Boruvka steps until all components are joined. • The first application of adaptive algorithm anal- General MST Algorithms. More recently, sophisti- ysis to the EMST problem in order to achieve cated algorithms have been developed for the MST problem tighter and more precise runtime bounds to-date. on general graphs. Fredman & Tarjan [17] showed a bound of O(m log∗ n), which was soon improved to O(m log log∗ m) • The asymptotically fastest EMST runtime: [19]. Yao further improved the bound to O(m log log n) [50]. Chazelle showed O(mα log α), where α(m, n) is a functional O(N log Nα(N)) ≈ O(N log N) inverse of Ackermann’s function [11]. Chazelle [12] and Pet- tie [34] improved this to O(mα). The current tightest bound, where α(N) is related to the functional inverse of Ack- ∗ 80 T ermann’s function and α(10 ) ≤ 4. Our analysis, un- due to Pettie & Ramachandran [35] in 2002, is O( (m, n)), T ∗ like some previous work, accounts for complexity of where is bounded from below by Ω(m)andaboveby · tracking connected components in a partial MST and O(m α). All these general algorithms are insufficient for reduces the difference between upper and lower bounds large, metric problems because they depend linearly on the toafactorofα(N).

Fast Euclidean Minimum Spanning Tree: Algorithm, Analysis, and Applications

Mlpack: Or, How I Learned to Stop Worrying and Love C++ Ryan R

Data Mining – Intro

Rcppmlpack: R Integration with MLPACK Using Rcpp

Deep Learning and Big Data in Healthcare: a Double Review for Critical Beginners

Survey on Recent Machine Learning Tools, Platforms and Interface

Arxiv:1210.6293V1 [Cs.MS] 23 Oct 2012 So the Finally, Accessible

ML Cheatsheet Documentation

ML Cheatsheet Documentation

MACHINE LEARNING METHODS in LOGICBLOX 2 1| an Introduction to Machine Learning

Vision Document

The Ensmallen Library for Flexible Numerical Optimization

Neural Network and Support Vector Machine for the Prediction of Chronic Kidney Disease: a Comparative Study T