Hierarchical Clustering with Prior Knowledge

Hierarchical Clustering with Prior Knowledge Xiaofei Ma Satya Dhavala Amazon.com Inc. Amazon.com Inc. Seattle, Washington Seattle, Washington [email protected] [email protected] ABSTRACT 1 INTRODUCTION Hierarchical clustering is a class of algorithms that seeks to build Hierarchical clustering is a a prominent class of clustering algo- a hierarchy of clusters. It has been the dominant approach to con- rithms. It has been the dominant approach to constructing embed- structing embedded classification schemes since it outputs dendro- ded classification schemes [27]. Compared with partition-based grams, which capture the hierarchical relationship among members methods (flat clustering) such as K-means, a hierarchical clustering at all levels of granularity, simultaneously. Being greedy in the al- offers several advantages. First, there is no need to pre-specify the gorithmic sense, a hierarchical clustering partitions data at every number of clusters. Hierarchical clustering outputs dendrogram step solely based on a similarity / dissimilarity measure. The clus- (tree), which the user can then traverse to obtain the desired clustering results oftentimes depend on not only the distribution of tering. Second, the dendrogram structure provides a convenient the underlying data, but also the choice of dissimilarity measure way of exploring entity relationships at all levels of granularity. and the clustering algorithm. In this paper, we propose a method to Because of that, for some applications such as taxonomy building, incorporate prior domain knowledge about entity relationship into the dendrogram itself, not any clustering found in it, is the desired the hierarchical clustering. Specifically, we use a distance function outcome. For example, hierarchical clustering has been widely em- in ultrametric space to encode the external ontological information. ployed and explored within the context of phylogenetics, which We show that popular linkage-based algorithms can faithfully re- aims to discover the relationships among individual species, and cover the encoded structure. Similar to some regularized machine reconstruct the tree of biological evolution. Furthermore, when learning techniques, we add this distance as a penalty term to the dataset exhibits multi-scale structure, hierarchical clustering is able original pairwise distance to regulate the final structure of the to generate a hierarchical partition of the data at different levels of dendrogram. As a case study, we applied this method on real data granularity, while any standard partition-based algorithm will fail in the building of a customer behavior based product taxonomy to capture the nested data structure. for an Amazon service, leveraging the information from a larger In a typical hierarchical clustering problem, the input is a set of Amazon-wide browse structure. The method is useful when one data points and a notion of dissimilarity between the points, which want to leverage the relational information from external sources, can also be represented as a weighted graph whose vertices are data or the data used to generate the distance matrix is noisy and sparse. points, and edge weights represent pairwise dissimilarities between Our work falls in the category of semi-supervised or constrained the points. The output of the clustering is a dendrogram, a rooted clustering. tree where each leaf node represents a data point, and each internal node represents a cluster containing its descendant leaves. As the CCS CONCEPTS internal nodes get deeper in the tree, the points within the clusters • Computing methodologies → Cluster analysis; Regulariza- become more similar to each other, and the clusters become more tion; Machine learning algorithms; Semi-supervised learning settings; refined. Algorithms for hierarchical clustering generally fall into two types: Agglomerative (“bottom up”) approach: each observation KEYWORDS starts in its own cluster, at every step a pair of most similar clusters are merged. Divisive (“top down”) approach: all observations start in hierarchical clustering, semi-supervised clustering, ultrametric dis- one cluster, and splits are performed recursively, dividing a cluster tance, regularization into two clusters that will be further divided. As a popular data analysis method, hierarchical clustering has ACM Reference Format: arXiv:1806.03432v3 [stat.ML] 25 Aug 2018 Xiaofei Ma and Satya Dhavala. 2018. Hierarchical Clustering with Prior been studied and used for decades. Despite its widespread use, it has Knowledge. In Proceedings of ACM Conference (Conference’17). ACM, New rather been studied at a more procedural level in terms of practical York, NY, USA, 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn algorithms. There are many hierarchical algorithms. Oftentimes, different algorithms produce dramatically different results onthe same dataset. Compared with partition-based methods such as Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed K-means and K-medians, hierarchical clustering has a relatively un- for profit or commercial advantage and that copies bear this notice and the full citation derdeveloped theoretical foundation. Very recently, Dasgupta [12] on the first page. Copyrights for components of this work owned by others than ACM introduced an objective function for hierarchical clustering, and must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a justified it for several simple and canonical situations. A theoretical fee. Request permissions from [email protected]. guarantee for this objective was further established [26] on some Conference’17, July 2017, Washington, DC, USA of the widely used hierarchical clustering algorithms. Their works © 2018 Association for Computing Machinery. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 give insight into what those popular algorithms are optimizing https://doi.org/10.1145/nnnnnnn.nnnnnnn for. Another route of theoretical research is to study the clustering Conference’17, July 2017, Washington, DC, USA Xiaofei Ma and Satya Dhavala schemes under an axiomatic view [2, 7, 11, 15, 25, 31], character- points. The weight between the two distance components, which ing different algorithms by the significant properties they satisfy. reflects the confidence of prior knowledge, is a hyper-parameter One of the influential works is Kleinberg’s impossibility theorem that can be tuned in a cross-validation manner, by optimizing an ex- [23], where he proposed three axioms for partitional clustering ternal task-specific metric. We then use a property-based approach algorithms, namely scale-invariance, richness and consistency. He to select algorithms to solve the semi-supervised clustering prob- proved that no clustering function can simultaneously satisfy all lem. We note that there are several pioneer works on constrained three. It is showed [10], however, if a nested family of partitions hierarchical clustering [4, 17, 18, 22]. Davidson [13] explored the instead of fixed single partition is allowed, which is the case for feasibility problem of incorporating 4 different instance and clus- hierarchical clustering, single linkage hierarchical clustering is the ter level constraints into hierarchical clustering. Zhao [32] studied unique algorithm satisfying the properties. The stability and con- hierarchical clustering with order constraints in order to capture vergence theorems for single link algorithm are further established. the ontological information. Zheng [34] represented triple-wise Ackerman [1] proposed two more desirable properties, namely, lo- relative constraints in a matrix form, and obtained the ultramet- cality and outer consistency, and showed that all linkage-based hi- ric representation by solving a constrained optimization problem. erarchical algorithms satisfy the properties. Those property-based Compared with previous studies, our goal is to recover the hier- analyses provide a better understanding of the techniques, and archical structure of the data which resembles existing ontology guide users in choosing algorithms for their crucial tasks. and yet provides new insight into entity relationships based on a Based on similarity information alone, clustering is inherently task-specific distance measure. The external ontological knowledge an ill-posed problem where the goal is to partition the data into serves as soft constraints in our approach, which is different from some unknown number of clusters so that within cluster similar- the hard constraints used in the previous works. Our constructed ity is maximized while between cluster similarity is minimized distance measure also fits naturally with the global object function [19]. It’s very hard for a clustering algorithm to recover the data [12] recently proposed for hierarchical clustering. partitions that satisfies various criteria of a concrete task. There- The paper is organized as follows: In Section 2, we state the fore, any external or side information from other sources can be problem and introduce the concepts used in this paper. In Section 3, extremely useful in guiding clustering solutions. Clustering algo- we discuss our approach to solving the semi-supervised hierarchical rithms that leverage external information fall into the category clustering problem. In Section 4, we present

Hierarchical Clustering with Prior Knowledge

Text Mining Course for KNIME Analytics Platform

A Survey of Clustering with Deep Learning: from the Perspective of Network Architecture

A Tree-Based Dictionary Learning Framework

Comparison of Dimensionality Reduction Techniques on Audio Signals

Robust Hierarchical Clustering∗

Space-Time Hierarchical Clustering for Identifying Clusters in Spatiotemporal Point Data

Chapter G03 – Multivariate Methods

A Study of Hierarchical Clustering Algorithm

CNN Features Are Also Great at Unsupervised Classification

Single Link Clustering on Data Sets Ajaya Kushwaha, Manojeet Roy

ML Cheatsheet Documentation

LEARNING EMBEDDINGS for SPEAKER CLUSTERING BASED on VOICE EQUALITY Yanick X. Lukic, Carlo Vogt, Oliver Dürr, and Thilo Stadelma