Coupled Attribute Similarity Learning on Categorical Data Can Wang, Xiangjun Dong, Fei Zhou, Longbing Cao, Senior Member, IEEE, and Chi-Hung Chi

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 4, APRIL 2015 781 Coupled Attribute Similarity Learning on Categorical Data Can Wang, Xiangjun Dong, Fei Zhou, Longbing Cao, Senior Member, IEEE, and Chi-Hung Chi Abstract— Attribute independence has been taken as a major Index Terms— Clustering, coupled attribute similarity, coupled assumption in the limited research that has been conducted on object analysis, similarity analysis, unsupervised learning. similarity analysis for categorical data, especially unsupervised learning. However, in real-world data sources, attributes are more or less associated with each other in terms of certain I. INTRODUCTION coupling relationships. Accordingly, recent works on attribute IMILARITY analysis has been a problem of great dependency aggregation have introduced the co-occurrence of attribute values to explore attribute coupling, but they only Spractical importance in several domains for decades, not present a local picture in analyzing categorical data similarity. least in recent work, including behavior analysis [1], document This is inadequate for deep analysis, and the computational analysis [2], and image analysis [3]. A typical aspect of these complexity grows exponentially when the data scale increases. applications is clustering, in which the similarity is usually This paper proposes an efficient data-driven similarity learning defined in terms of one of the following levels: 1) between approach that generates a coupled attribute similarity measure for nominal objects with attribute couplings to capture a global clusters; 2) between attributes; 3) between data objects; or picture of attribute similarity. It involves the frequency-based 4) between attribute values. The similarity between clusters is intra-coupled similarity within an attribute and the inter-coupled often built on top of the similarity between data objects, e.g., similarity upon value co-occurrences between attributes, as well centroid similarity. Further, the similarity between data objects as their integration on the object level. In particular, four is generally derived from the similarity between attribute measures are designed for the inter-coupled similarity to calculate the similarity between two categorical values by considering values, e.g., Euclidean distance and simple matching similarity their relationships with other attributes in terms of power set, (SMS) [4]. The similarity between attribute values assesses the universal set, joint set, and intersection set. The theoretical relationship between two data objects and even between two analysis reveals the equivalent accuracy and superior efficiency clusters. The more two objects or clusters resemble each other, of the measure based on the intersection set, particularly for the larger is the similarity [5]. The other similarity between large-scale data sets. Intensive experiments of data structure and clustering algorithms incorporating the coupled dissimilarity attributes [6] can also be converted into the difference of metric achieve a significant performance improvement on state- similarities between pairwise attribute values [7]. Therefore, of-the-art measures and algorithms on 13 UCI data sets, which the similarity between attribute values plays a fundamental is confirmed by the statistical analysis. The experiment results role in similarity analysis. show that the proposed coupled attribute similarity is generic, The similarity measures for attribute values are sensitive and can effectively and efficiently capture the intrinsic and global interactions within and between attributes for especially to the attribute types, which are classified as discrete and large-scale categorical data sets. In addition, two new coupled continuous. The discrete attribute is further typed as nominal categorical clustering algorithms, i.e., CROCK and CLIMBO are (categorical) or binary [5]. The nominal data, a special case of proposed, and they both outperform the original ones in terms the discrete type, has only a finite number of values, while the of clustering quality on UCI data sets and bibliographic data. binary variable has exactly two values. In this paper, we regard Manuscript received April 6, 2013; revised April 23, 2014 and May 5, the binary data as a special case of the nominal data. 2014; accepted May 11, 2014. Date of publication June 13, 2014; date of Compared with the intensive study on the similarity between current version March 16, 2015. This work was supported in part by the two numerical variables, such as Euclidean and Minkowski National Privacy Principles, Tasmania, Australia, in part by the Australian Research Council Discovery under Grant DP1096218, in part by the Aus- distance, and between two categorical values in supervised tralian Research Council Linkage under Grant LP100200774, in part by the learning, e.g., heterogeneous distance functions [8] and mod- National Natural Science Foundation of China under Grant 71271125, in part ified value distance matrix (MVDM) [9], the similarity for by the Natural Science Foundation of China under Grant 61301183, and in part by the China Post-Doctoral Science Foundation under Grant 2013M540947. nominal variables has received much less attention in unsuper- C. Wang and C.-H. Chi are with the Commonwealth Scientific and vised learning on unlabeled data. Only limited efforts [5] have Industrial Research Organisation, Sandy Bay, TAS 7005, Australia (e-mail: been made, including SMS, which uses 0s and 1s to distinguish [email protected]; [email protected]). X. Dong is with the School of Information, Qilu University of Technology, the similarity between distinct and identical categorical values, Ji’nan 250353, China (e-mail: [email protected]). occurrence frequency (OF) [10] and information-theoretical F. Zhou is with the Department of Electronic Engineering, Graduate similarity (Lin) [10], [11], to discuss the similarity between School at Shenzhen, Tsinghua University, Shenzhen 518055, China (e-mail: fl[email protected]). nominal values. The challenge is that these methods are L. Cao is with the Advanced Analytics Institute, University of Technology too rough to precisely characterize the similarity between at Sydney, Ultimo, NSW 2008, Australia (e-mail: [email protected]). categorical attribute values, and they only deliver a local Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. picture of the similarity and are not data-driven. In addition, Digital Object Identifier 10.1109/TNNLS.2014.2325872 none of them provides a comprehensive picture of similarity 2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 782 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 4, APRIL 2015 TABLE I as their global aggregation in unsupervised learning on INSTANCE OF THE MOVIE DATABASE nominal data. The key contributions are as follows. 1) We propose a coupled attribute similarity for objects (CASO) measure based on the coupled attribute similarity for values (CASV), by considering both the intra- coupled and inter-coupled attribute value similarities (IaASV and IeASV), which globally capture the attribute value frequency distribution and attribute dependency aggregation with high accuracy and relatively low complexity. between categorical attributes by combining relevant aspects. 2) We compare the accuracy and efficiency of the four Below, we illustrate the problem with SMS and the challenge proposed measures for IeASV in terms of four relation- of analyzing the categorical data similarity. ships: power set; universal set; joint set; and intersection As shown in Table I, six movie objects are divided into two set; and obtain the most efficient candidate based on classes with three nominal attributes: 1) director; 2) actor; and the intersection set (i.e., IRSI) from theoretical and 3) genre. The SMS measure between directors Scorsese and experimental aspects. Coppola is 0, but Scorsese and Coppola are very similar.1 3) A method is proposed to flexibly define the dissimilarity Another observation by following SMS is that the similarity metrics with the proposed similarity building blocks between Koster and Hitchcock is equal to that between Koster according to specific requirements. and Coppola; however, the similarity of the former pair should 4) The proposed measures are compared with the state-of- be greater because both directors belong to the same class l2. the-art metrics on various benchmark data sets in terms The above examples show that it is much more complex to of the internal and external clustering criteria. All the analyze the similarity between nominal variables than between results are statistically significant. continuous data. The SMS and its variants fail to capture a 5) We propose two new coupled categorical clustering global picture of the genuine relationship for nominal data. algorithms: CROCK and CLIMBO. With the exponential increase of categorical data, such as This paper is organized as follows. In Section II, we that derived from social networks, it is important to develop briefly review the related work. Preliminary definitions are effective and efficient measures for capturing the similarity specified in Section III. Section IV proposes the framework between nominal variables. of the coupled attribute similarity analysis. Section V defines

Coupled Attribute Similarity Learning on Categorical Data Can Wang, Xiangjun Dong, Fei Zhou, Longbing Cao, Senior Member, IEEE, and Chi-Hung Chi

Deep Similarity Learning for Sports Team Ranking

RELIEF Algorithm and Similarity Learning for K-NN

Deep Metric Learning: a Survey

Cross-Domain Visual Matching Via Generalized Similarity Measure and Feature Learning

Low-Rank Similarity Metric Learning in High Dimensions

Learning Mahalanobis Distance Metric: Considering Instance Disturbance Helps∗

Unsupervised Domain Adaptation with Similarity Learning

Algorithms for Similarity Relation Learning from High Dimensional Data Phd Dissertation

Supervised Learning with Similarity Functions

Unsupervised Graph-Based Similarity Learning Using Heterogeneous Features

Learning Similarities for Linear Classification: Theoretical

Arsknn-A K-NN Classifier Using Mass Based Similarity Measure