Evaluation Metrics for Unsupervised Learning Algorithms

Evaluation Metrics for Unsupervised Learning Algorithms Julio-Omar Palacio-Nino˜ Fernando Berzal Dept. Computer Science and Artificial Intelligence Dept. Computer Science and Artificial Intelligence Universidad de Granada Universidad de Granada Granada, Spain Granada, Spain [email protected] [email protected] Abstract—Determining the quality of the results obtained by can be considered “good”, independently of the algorithm clustering techniques is a key issue in unsupervised machine used to find the solution. These axioms are scale invariance, learning. Many authors have discussed the desirable features of consistency, and wealth [3], which are explained in more good clustering algorithms. However, Jon Kleinberg established an impossibility theorem for clustering. As a consequence, a detail below. wealth of studies have proposed techniques to evaluate the quality of clustering results depending on the characteristics of the A grouping function is defined as a set of S of x ≥ 2 clustering problem and the algorithmic technique employed to points and the distances between pairs of points. The set of cluster data. points is S = f1; 2; :::; ng and the distance between points is Index Terms—clustering, unsupervised learning, evaluation given by the distance function d(i; j), where i; j 2 S. The metrics distance function d measures the dissimilarity between pairs of points. For instance, the Euclidean, Manhattan, Chebyshev, I. INTRODUCTION and Mahalanobis distances can be used, among many others. Machine learning techniques are usually classified into Alternatively, a similarity function might also be used. supervised and unsupervised techniques. Supervised machine learning starts from prior knowledge of the desired result 1) Scale Invariance: The first of Kleinberg’s axioms states in the form of labeled data sets, which allows to guide the that f(d) = f(α · d) for any distance function d and any training process, whereas unsupervised machine learning scaling factor α > 0. [3] works directly on unlabeled data. In the absence of labels to orient the learning process, these labels must be “discovered” This simple axiom indicates that a clustering algorithm by the learning algorithm. [1] should not modify its results when all distances between points are scaled by the factor determined by a constant α. In this technical report, we discuss the desirable features of good clustering results, recall Kleinberg’s impossibility 2) Richness: A clustering process is considered to be theorem for clustering, and describe a taxonomy of evaluation rich when every partition of S is a possible result of the criteria for unsupervised machine learning. We also survey clustering process. If the use Range(f) to denote the set of many of the evaluation metrics that have been proposed in the all Γ partitions so that f(d) = Γ for some distance function literature. We end our report by describing the techniques that d, then Range(f) is equal to the set of all S partitions. [3] can be used to adjust the parameters of clustering algorithms, arXiv:1905.05667v2 [cs.LG] 23 May 2019 i.e. their hyperparameters. This means that the the clustering function must be flexible enough to produce any arbitrary partition/clustering of the II. FORMAL LIMITATIONS OF CLUSTERING input data set. From an intuitive point of view, the clustering problem 0 has a very clear goal; namely, properly clustering a set of 3) Consistency: Let d and d be two distance functions. unlabeled data. Despite its intuitive appeal, the notion of If, for every pair (i; j) belonging to the same cluster, 0 “cluster” cannot be precisely defined, hence the wide range d(i; j) ≥ d (i; j), and for every pair (i; j) belonging to 0 0 of clustering algorithms that have been proposed. [2] different clusters, d(i; j) ≤ d (i; j), then f(d) = f(d ). [3] A clustering process is “consistent” when the clustering A. Desirable Features of Clustering results do not change if the distances within clusters decrease Jon Kleinberg proposes three axioms that highlight the and/or the distances between clusters increase. characteristics that a grouping problem should exhibit and B. An Impossibility Theorem for Clustering includes a notorious particularity: the way the measurement is performed depends on the algorithm used to obtain the Given the above three axioms, Kleinberg proves the clustering results. following theorem: For every n >= 2, there is no clustering function f that satisfies scale invariance, richness, and When analyzing clustering results, several aspects must be consistency. [3] taken into account for the validation of the algorithm results [4]: Determining a “good” clustering is not a trivial problem. It is impossible for any clustering procedure to be able to • Determining the clustering tendency in the data (i.e. satisfy all three axioms. Practical clustering algorithms must whether non-random structure really exists). trade-off the desirable features of clustering results. • Determining the correct number of clusters. Since the three axioms cannot hold simultaneously, clustering algorithms can be designed to violate one of the • Assessing the quality of the clustering results without axioms while sarisfying the other two. Kleinberg illustrates external information. this point by describing three variants of single-link clustering (an agglomerative hierarchical clustering algorithm): [3] • Comparing the results obtained with external information. • Comparing two sets of clusters to determine which one • k-cluster stopping condition: Stop merging clusters when is better. we have k clusters (violates the richness axiom, since the algorithm would never return a number of clusters different to k). The first three issues are addressed by internal or unsupervised validation, because there is no use of external information. The fourth issue is resolved by external or • Distance-r stopping condition: Stop merging clusters when the nearest pair of clusters are farther than r supervised validation. Finally, the last issue can be addressed (violates scale invariance given that every cluster will by both supervised and unsupervised validation techniques. contain a single instance when α is large, whereas a [4]. single cluster will contain all data when α ! 0). Gan et al. [5] propose a taxonomy of evaluation techniques that comprises both internal and external validation approaches • Scale- stopping condition: Stop merging clusters when the nearest pair of clusters are farther than a fraction (see Figure 1). of the maximum pairwise distance ∆ (scale invariance is satisfied, yet consistency is violated). Clustering algorithms can often satisfy the properties of scale invariance and consistency by relaxing their richness (e.g. whenever the number of clusters is established External beforehand). As we have seen, some algorithms can even be criteria Statistical customized to satisfy two out of three axioms by relaxing the testing third one (e.g. simple linkage with different stopping criteria). Internal Cluster criteria validity index III. METHODS FOR CLUSTER EVALUATION Nonstatistical Relative Evaluating the results of a clustering algorithm is a very testing criteria important part of the process of clustering data. In supervised learning,“the evaluation of the resulting classification model is an integral part of the process of developing a classification model and there are well-accepted evaluation measures and procedures” [4]. In unsupervised learning, because of its very Figure 1. Taxonomy of clustering evaluation methods (adaptation) [5]. nature, cluster evaluation, also known as cluster validation, is not as well-developed. [4] A. Null Hypothesis Testing In clustering problems, it is not easy to determine the One of the desirable characteristics of a clustering process quality of a clustering algorithm. This gives rise to multiple is to show whether data exhibits some tendency to form evaluation techniques. Quite often, the evaluation process actual clusters. From a statistical point of view, a feasible approach consists of testing whether data exhibits random used by the evaluation methods are the same metrics that behavior or not [6]. In this context, the null hypothesis testing the clustering algorithm tries to optimize, which can be can be used: A null hypothesis H0 is assumed to be true until counterproductive in determining the quality of a clustering evidence suggests otherwise. In this case, the null hypothesis algorithm and deliver unfair validation results. On the other is the randomness of data and, when the null hypothesis is hand, in the absence of other sources of information, these rejected, we assume that the data is significantly unlikely to metrics allow different algorithms to be compared under the be random. [5]. same evaluation criterion [8], yet care must be taken not to report biased results. One of the difficulties of null hypothesis testing in this context is determining the statistical distribution under which Internal evaluation methods are commonly classified the randomness hypothesis can be rejected. Jain and Dubes according to the type of clustering algorithm they are propose three alternatives [7]: used with. For partitional algorithms, metrics based on the proximity matrix, as well as metrics of cohesion and • Random plot hypothesis H0: All proximity matrices of separation, such as the silhouette coefficient, are often used. order n × n are equally likely. For hierarchical algorithms, the cophenetic coefficient is the most common (see Figure 3). • Random label hypothesis H0: All permutations

Evaluation Metrics for Unsupervised Learning Algorithms

Extraction of User's Stays and Transitions from Gps Logs: a Comparison of Three Spatio-Temporal Clustering Approaches

Performance Measures Outline 1 Introduction 2 Binary Labels

The Difference Between Precision-Recall and ROC Curves for Evaluating the Performance of Credit Card Fraud Detection Models

Performance Measures Accuracy Confusion Matrix

A Practioner's Guide to Evaluating Entity Resolution Results

Benchmarking Differential Expression Analysis Tools for RNA-Seq: Normalization-Based Vs. Log-Ratio Transformation-Based Methods Thomas P

Anomaly Detection in Application Log Data

Software Benchmark—Classification Tree Algorithms for Cell Atlases

Evaluation Criteria (For Binary Classification)

Controlling and Visualizing the Precision-Recall Tradeoff for External

Precision Recall Curves

Chapter 6 Evaluation Metrics and Evaluation