Programming the K-Means Clustering Algorithm in SQL

Programming the K-means Clustering Algorithm in SQL Carlos Ordonez Teradata, NCR San Diego, CA, USA ABSTRACT by the database application programmer. Moreover, nowa- Using SQL has not been considered an efficient and feasible days SQL is the standard way to interact with a relational way to implement data mining algorithms. Although this is DBMS. So can SQL, as it exists today, be used to get an effi- true for many data mining, machine learning and statistical cient implementation of a clustering algorithm? This article algorithms, this work shows it is feasible to get an efficient shows the answer is yes for the popular K-means algorithm. SQL implementation of the well-known K-means clustering It is worth mentioning programming data mining algorithms algorithm that can work on top of a relational DBMS. The in SQL has not received too much attention in the database article emphasizes both correctness and performance. From literature. This is because SQL, being a database language, a correctness point of view the article explains how to com- is constrained to work with tables and columns, and then pute Euclidean distance, nearest-cluster queries and updat- it does not provide the flexibility and speed of a high level ing clustering results in SQL. From a performance point of programming language like C++ or Java. Summarizing, this view it is explained how to cluster large data sets defining article presents an efficient SQL implementation of the K- and indexing tables to store and retrieve intermediate and means algorithm that can work on top of a relational DBMS final results, optimizing and avoiding joins, optimizing and to cluster large data sets. simplifying clustering aggregations, and taking advantage of The article is organized as follows. Section 2 introduces sufficient statistics. Experiments evaluate scalability with definitions and an overview of the K-means algorithm. Sec- synthetic data sets varying size and dimensionality. The tion 3 introduces several alternatives and optimizations to proposed K-means implementation can cluster large data implement the K-means algorithm in SQL. Section 4 con- sets and exhibits linear scalability. tains experiments to evaluate performance with synthetic data sets. Section 5 discusses related work. Section 6 pro- vides general conclusions and directions for future work. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications- Data Mining 2. DEFINITIONS The basic input for K-means [2, 6] is a data set Y containing n points in d dimensions, Y = fy1; y2; : : : ; yng, and k, General Terms the desired number of clusters. The output are three matri- Algorithms, Languages ces W; C; R, containing k weights, k means and k variances respectively corresponding to each cluster and a partition of Keywords Y into k subsets. Matrices C and R are d×k and W is k×1. Throughout this work three subscripts are used to index ma- Clustering, SQL, relational DBMS, integration trices: i = 1; : : : ; n; j = 1; : : : ; k; l = 1; : : : ; d. To refer to one column of C or R we use the j subscript (e.g. Cj ; Rj ); Cj 1. INTRODUCTION can be understood as a d-dimensional vector containing the There exist many efficient clustering algorithms in the centroid of the jth cluster having the respective squared ra- data mining literature. Most of them follow the approach diuses per dimension given by Rj . For transposition we will proposed in [14], minimizing disk access and doing most of use the T superscript. For instance Cj refers to the jth cen- T the work in main memory. Unfortunately, many of those al- troid in column form and Cj is the jth centroid in row form. gorithms are hard to implement inside a real DBMS where Let X1; X2; : : : ; Xk be the k subsets of Y induced by clusters 0 the programmer needs to worry about storage management, s.t. Xj \ Xj0 = ;; j 6= j . K-means uses Euclidean distance concurrent access, memory leaks, fault tolerance, security to find the nearest centroid to each input point, defined as T d 2 and so on. On the other hand, SQL has been growing over d(yi; Cj ) = (yi − Cj ) (yi − Cj ) = Pl=1(yli − Clj ) . the years to become a fairly comprehensive and complex K-means can be described at a high level as follows. Cen- query language where the aspects mentioned above are au- troids Cj are generally initialized with k points randomly tomatically handled for the most part or they can be tuned selected from Y for an approximation when there is an idea about potential clusters. The algorithm iterates executing c ACM, 2004 . This is the author's version of the work. It is posted here the E and the M steps starting from some initial solution un- by permission of ACM for your personal use. Not for redistribution. The definitive version was published in KDD Conference 2004, til cluster centroids become stable. The E step determines http://doi.acm.org/10.1145/1014052.1016921 the nearest cluster for each point and adds the point to it. That is, the E step determines cluster membership and par- to implement K-means. In general we omit Data Defini- titions Y into k subsets. The M step updates all centroids Cj tion Language (DDL) statements and deletion statements to by averaging points belonging to the same cluster. Then the make exposition more concise. Thus most of the SQL code k cluster weights Wj and the k diagonal variance matrices presented involves Data Manipulation Language (DML) state- Rj are updated based on the new Cj centroids. The quality ments. The columns making up the primary key of a table of a clustering solution is measured by the average quantiza- are underlined. Tables are indexed on their primary key for tion error q(C) (also known as squared assignment distance efficient join access. Subscripts i; j; l (see Section 2) are de- [6]). The goal of K-means is minimizing q(C), defined as fined as integer columns and the d numerical dimensions of n q(C) = Pi=1 d(yi; Cj )=n. where yi 2 Xj . This quantity points of Y , distances, and matrix entries of W; C; R are de- measures the average squared distance from each point to fined as FLOAT columns in SQL. Before each INSERT state- the cluster where it was assigned according to the parti- ment it is assumed there is a "DELETE FROM ... ALL;" tion into k subsets. K-means stops when q(C) changes by a statement that leaves the table empty before insertion. marginal fraction () in consecutive iterations. K-means is As introduced in Section 2 the input data set has d di- theoretically guaranteed to converge decreasing q(C) at each mensions. In database terms this means there exists a table iteration [6], but it is common to set a maximum number of Y with several numerical columns out of which d columns iterations to avoid long runs. are picked for clustering analysis. In practice the input table may have many more than d columns but to simplify exposition we will assume its definition is Y (Y1; Y2; ::; Yd). 3. IMPLEMENTING K-MEANS IN SQL So the SQL implementation needs to build its own reduced This section presents our main contributions. We explain version projecting the desired d columns. This motivates how to implement K-means in a relational DBMS by auto- defining the following "horizontal" table with d + 1 columns: matically generating SQL code given an input table Y with Y H(i; Y1; Y2; :::; Yd) having i as primary key. The first col- d selected numerical columns and k, the desired number of umn is the i subscript for each point and then Y H has the clusters as input as defined in Section 2. The SQL code list of d dimensions. This table saves Input/Output access generator dynamically creates SQL statements monitoring (I/O) since it may have fewer columns than Y and it is the difference of quality of the solution in consecutive itera- scanned several times during the algorithm progress. In tions to stop. There are two main schemes presented in here. general it is not guaranteed i (point id) exists because the The first one presents a simple implementation of K-means primary key of Y may consist of more than one column, or it explaining how to program each computation in SQL. We re- may not exist at all because Y is the result of some aggrega- fer to this scheme as the Standard K-means implementation. tions. In an implementation in an imperative programming The second scheme presents a more complex K-means im- language like C++ or Java the point identifier is immaterial plementation incorporating several optimizations that dra- since Y is accessed sequentially, but in a relational database matically improve performance. We call this scheme the it is essential. Therefore it is necessary to automatically Optimized K-means implementation. create i guaranteeing a unique identifier for each point yi. There are important assumptions behind our proposal The following statement computes a cumulative sum on one from a performance point of view. Two tables with n rows scan over Y to get i 2 f1 : : : ng and projects the desired d having the same primary key can be joined in time O(n) columns. using a hash-based join. So if a different DBMS does not provide hash-based indexing, joining tables may take longer INSERT INTO YH than O(n). However, the proposed scheme should still pro- SELECT sum(1) over(rows unbounded preceding) AS i vide the most efficient implementation even in such cases.

Load more