IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 4, APRIL 2015 781 Coupled Attribute Similarity Learning on Categorical Data Can Wang, Xiangjun Dong, Fei Zhou, Longbing Cao, Senior Member, IEEE, and Chi-Hung Chi

Abstract— Attribute independence has been taken as a major Index Terms— Clustering, coupled attribute similarity, coupled assumption in the limited research that has been conducted on object analysis, similarity analysis, . similarity analysis for categorical data, especially unsupervised learning. However, in real-world data sources, attributes are more or less associated with each other in terms of certain I. INTRODUCTION coupling relationships. Accordingly, recent works on attribute IMILARITY analysis has been a problem of great dependency aggregation have introduced the co-occurrence of attribute values to explore attribute coupling, but they only Spractical importance in several domains for decades, not present a local picture in analyzing categorical data similarity. least in recent work, including behavior analysis [1], document This is inadequate for deep analysis, and the computational analysis [2], and image analysis [3]. A typical aspect of these complexity grows exponentially when the data scale increases. applications is clustering, in which the similarity is usually This paper proposes an efficient data-driven similarity learning defined in terms of one of the following levels: 1) between approach that generates a coupled attribute similarity measure for nominal objects with attribute couplings to capture a global clusters; 2) between attributes; 3) between data objects; or picture of attribute similarity. It involves the frequency-based 4) between attribute values. The similarity between clusters is intra-coupled similarity within an attribute and the inter-coupled often built on top of the similarity between data objects, e.g., similarity upon value co-occurrences between attributes, as well centroid similarity. Further, the similarity between data objects as their integration on the object level. In particular, four is generally derived from the similarity between attribute measures are designed for the inter-coupled similarity to calculate the similarity between two categorical values by considering values, e.g., Euclidean distance and simple matching similarity their relationships with other attributes in terms of power set, (SMS) [4]. The similarity between attribute values assesses the universal set, joint set, and intersection set. The theoretical relationship between two data objects and even between two analysis reveals the equivalent accuracy and superior efficiency clusters. The more two objects or clusters resemble each other, of the measure based on the intersection set, particularly for the larger is the similarity [5]. The other similarity between large-scale data sets. Intensive experiments of data structure and clustering algorithms incorporating the coupled dissimilarity attributes [6] can also be converted into the difference of achieve a significant performance improvement on state- similarities between pairwise attribute values [7]. Therefore, of-the-art measures and algorithms on 13 UCI data sets, which the similarity between attribute values plays a fundamental is confirmed by the statistical analysis. The experiment results role in similarity analysis. show that the proposed coupled attribute similarity is generic, The similarity measures for attribute values are sensitive and can effectively and efficiently capture the intrinsic and global interactions within and between attributes for especially to the attribute types, which are classified as discrete and large-scale categorical data sets. In addition, two new coupled continuous. The discrete attribute is further typed as nominal categorical clustering algorithms, i.e., CROCK and CLIMBO are (categorical) or binary [5]. The nominal data, a special case of proposed, and they both outperform the original ones in terms the discrete type, has only a finite number of values, while the of clustering quality on UCI data sets and bibliographic data. binary variable has exactly two values. In this paper, we regard Manuscript received April 6, 2013; revised April 23, 2014 and May 5, the binary data as a special case of the nominal data. 2014; accepted May 11, 2014. Date of publication June 13, 2014; date of Compared with the intensive study on the similarity between current version March 16, 2015. This work was supported in part by the two numerical variables, such as Euclidean and Minkowski National Privacy Principles, Tasmania, Australia, in part by the Australian Research Council Discovery under Grant DP1096218, in part by the Aus- distance, and between two categorical values in supervised tralian Research Council Linkage under Grant LP100200774, in part by the learning, e.g., heterogeneous distance functions [8] and mod- National Natural Science Foundation of China under Grant 71271125, in part ified value distance matrix (MVDM) [9], the similarity for by the Natural Science Foundation of China under Grant 61301183, and in part by the China Post-Doctoral Science Foundation under Grant 2013M540947. nominal variables has received much less attention in unsuper- C. Wang and C.-H. Chi are with the Commonwealth Scientific and vised learning on unlabeled data. Only limited efforts [5] have Industrial Research Organisation, Sandy Bay, TAS 7005, Australia (e-mail: been made, including SMS, which uses 0s and 1s to distinguish [email protected]; [email protected]). X. Dong is with the School of Information, Qilu University of Technology, the similarity between distinct and identical categorical values, Ji’nan 250353, China (e-mail: [email protected]). occurrence frequency (OF) [10] and information-theoretical F. Zhou is with the Department of Electronic Engineering, Graduate similarity (Lin) [10], [11], to discuss the similarity between School at Shenzhen, Tsinghua University, Shenzhen 518055, China (e-mail: fl[email protected]). nominal values. The challenge is that these methods are L. Cao is with the Advanced Analytics Institute, University of Technology too rough to precisely characterize the similarity between at Sydney, Ultimo, NSW 2008, Australia (e-mail: [email protected]). categorical attribute values, and they only deliver a local Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. picture of the similarity and are not data-driven. In addition, Digital Object Identifier 10.1109/TNNLS.2014.2325872 none of them provides a comprehensive picture of similarity

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 782 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 4, APRIL 2015

TABLE I as their global aggregation in unsupervised learning on INSTANCE OF THE MOVIE DATABASE nominal data. The key contributions are as follows. 1) We propose a coupled attribute similarity for objects (CASO) measure based on the coupled attribute simi- larity for values (CASV), by considering both the intra- coupled and inter-coupled attribute value similarities (IaASV and IeASV), which globally capture the attribute value frequency distribution and attribute dependency aggregation with high accuracy and relatively low com- plexity. between categorical attributes by combining relevant aspects. 2) We compare the accuracy and efficiency of the four Below, we illustrate the problem with SMS and the challenge proposed measures for IeASV in terms of four relation- of analyzing the categorical data similarity. ships: power set; universal set; joint set; and intersection As shown in Table I, six movie objects are divided into two set; and obtain the most efficient candidate based on classes with three nominal attributes: 1) director; 2) actor; and the intersection set (i.e., IRSI) from theoretical and 3) genre. The SMS measure between directors Scorsese and experimental aspects. Coppola is 0, but Scorsese and Coppola are very similar.1 3) A method is proposed to flexibly define the dissimilarity Another observation by following SMS is that the similarity metrics with the proposed similarity building blocks between Koster and Hitchcock is equal to that between Koster according to specific requirements. and Coppola; however, the similarity of the former pair should 4) The proposed measures are compared with the state-of- be greater because both directors belong to the same class l2. the-art metrics on various benchmark data sets in terms The above examples show that it is much more complex to of the internal and external clustering criteria. All the analyze the similarity between nominal variables than between results are statistically significant. continuous data. The SMS and its variants fail to capture a 5) We propose two new coupled categorical clustering global picture of the genuine relationship for nominal data. algorithms: CROCK and CLIMBO. With the exponential increase of categorical data, such as This paper is organized as follows. In Section II, we that derived from social networks, it is important to develop briefly review the related work. Preliminary definitions are effective and efficient measures for capturing the similarity specified in Section III. Section IV proposes the framework between nominal variables. of the coupled attribute similarity analysis. Section V defines The similarity between categorical values is sensitive to the intra-coupled similarity, inter-coupled similarity, and their the data characteristics. In general, two attribute values are aggregation. The theoretical analysis is given in Section VI. expected to be similar if they present analogous frequency We describe the CASO algorithm in Section VII. The effec- distributions within one attribute (e.g., OF and Lin) [10], [11]; tiveness of CASO is empirically studied in Section VIII, two this reflects the intra-coupled similarity within attributes. new categorical clustering methods (CROCK and CLIMBO) For example, two directors are very similar if they appear are introduced, and a flexible method to define dissimilarity with almost the same frequency, such as Scorsese with metrics is also developed. Section IX discusses the coupled Coppola and Koster with Hitchcock. However, the reality is nominal similarity with open issues. Finally, we conclude this that the former director pair is more similar than the latter. paper in Section X. Ahmad and Dey [12] introduced the co-occurrence probability of categorical values from different attributes and compared II. RELATED WORK this probability for two categorical values from the same Some surveys, in particular [5] and [10], discuss the attribute. This means that the similarity between directors similarity between categorical attributes. The usual practice is relates to the dependency of director on other attributes, such to binarize the data and use binary similarity measures rather as actor and genre over all the movie objects, namely, the inter- than directly considering nominal data. Cost and Salzberg [9] coupled similarity between attributes. They both capture local proposed MVDM based on labels, Wilson and Martinez [8] pictures of the similarity from different perspectives. No work performed a detailed study of heterogeneous distance func- has been reported on systematically considering both intra- tions for instance based learning, and Figueiredo et al. [2] coupled similarity and inter-coupled similarity. The incomplete introduced word co-occurrence features for text classification. description of the categorical value similarity leads to tentative Unlike our focus, their similarities are only designed for and less effective learning performance. In addition, it is supervised approaches. usually very costly to consider the similarity between values in relation to the dependency between attributes and the A. Nominal Similarity in Unsupervised Learning aggregation of such dependency [12], which is verified in Section VI. There are a number of existing data mining techniques In this paper, we explicitly discuss the data-driven for the unsupervised learning of nominal data [10], [12]. intra-coupled similarity and inter-coupled similarity, as well Well-known metrics include SMS [4] and its diverse variants, such as Jaccard coefficients [13], which are all intuitively 1A conclusion drawn from a well-informed cinematic source. based on the principle that the similarity measure is 1 with WANG et al.: COUPLED ATTRIBUTE SIMILARITY LEARNING ON CATEGORICAL DATA 783 identical values and 0 otherwise, which are not data driven. TABLE II More recently, the frequency distribution of attribute values EXAMPLE OF INFORMATION TABLE has been considered for similarity measures [10], such as OF and Lin. Similarity computation has been incorporated into the learning algorithm without explicitly defining general mea- sures [14]. Neighborhood-based similarity [15], [16] was also explored to measure the proximity of objects using functions that operate on the intersection of two neighborhoods. They present the similarity between a pair of objects by considering only the relationships among data objects, which are built on the similarity between attribute values simply quantified a handful of relevant publications. Guha et al. [16] proposed by the variants of SMS. However, the couplings between a robust hierarchical clustering algorithm ROCK, which uses attributes involve the similarity both between attribute values the link-based similarity measure to measure the similarity and between data objects. Such couplings are catered in our between two categorical data points and between two clusters. proposed similarity measure between attribute values, which is Gibson et al. [14] first constructed a hypergraph according to incorporated with the neighborhood-based similarity between the database, and then cluster the hypergraph using a discrete data objects to more precisely describe the neighborhood of dynamic system STIRR. Andritsos et al. [17] introduced a an object. It represents the neighborhood-based metric as a scalable hierarchical categorical clustering algorithm LIMBO meta-similarity measure [10] in terms of both the couplings that builds on the information bottleneck (IB) framework for between attributes and between objects. quantifying the relevant information preserved when cluster- All the above methods are attribute-independent since ing. An incremental algorithm called COOLCAT [18] was pro- similarity is calculated separately for two categorical values posed to cluster categorical attributes using entropy; however, of individual attributes. However, an increasing number of it is based on the assumption of independence between the researchers argue that the attribute value similarity is also attributes. Clustering with sLOPE (CLOPE), presented in [19], dependent on the couplings of other attributes [1], [10]. The uses a global criterion function instead of a local one defined Pearson correlation coefficient [15] measures only the strength by the pairwise similarity to cluster categorical data, espe- of linear dependence between two numerical variables. cially transactional data. Rather than a measure of similarity, Das and Mannila [6] put forward the iterated contextual CLICKS [20] uses a graph-theoretic approach to find k disjoint distances algorithm, believing that the attribute, object, and sets of vertices in a graph constructed for a particular data subrelation similarities are inter-dependent. They convert each set. The last three algorithms, i.e., COOLCAT, CLOPE, and object with binary attribute values to a continuous vector by a CLICKS, have a different focus from our proposed coupled kernel smoothing function, and define the similarity between similarity. objects as the Manhattan distance between continuous Therefore, in the experiments, we compare the clustering vectors [6]. By contrast, we directly consider similarity for quality of ROCK, STIRR, and LIMBO with the coupled categorical values to maintain the least information loss. versions of them, i.e., when our proposed coupled similarity Andritsos et al. [17] introduced a context sensitive dissim- measure replaces the original similarity measure between ilarity measure between attribute values based on the JensenÐ attribute values in ROCK and LIMBO. Shannon divergence. Similarly, Ahmad and Dey [12] proposed an algorithm ADD to compute the dissimilarity between III. PRELIMINARY DEFINITIONS attribute values by considering the co-occurrence probability between each attribute value and the values of another A large number of data objects with the same attribute set attribute. Though the dissimilarity metric leads to high accu- can be organized by an information table S =< U, A, V, f >, racy, the computation is usually very costly [12], which limits where universe U ={u1,...,um } is composed of a nonempty its application in large-scale problems. In addition, Ahmad finite set of data objects; A ={a1,...,an} is a finite set of =∪n and Dey’s [12] approaches only focus on the interactions attributes; V j=1V j is a collection of attribute value sets, among different attributes, whereas our proposed measure in which V j is the set of attribute values from attribute a j (1 ≤ ≤ ) =∪n : → ( ≤ ≤ ) also considers the couplings within each attribute globally. j n ;and f j=1 f j , f j U V j 1 j n is an information function that assigns a particular value of attribute a j to every object. For instance, Table II is an information B. Categorical Clustering table consisting of six objects {u1,...,u6} and three attributes { , , } Clustering algorithms [16], including partition-based meth- a1 a2 a3 , the attribute value of object u1 for attribute a2 ods, such as k-means and hierarchy-based methods like divi- is f2(u1) = B1, and the set of all attribute values for a2 is ={B , B , B } sive approaches [5], are more suitable for clustering data with V2 1 2 3 . numerical attributes than categorical data. Generally speaking, the similarity between two objects Clustering of categorical data (categorical clustering for ux , u y(∈ U) can be built on top of the similarities between x , y(∈ ) ∈ short) is a difficult, yet important task. Many fields, from statis- their attribute values v j v j V j for all attributes a j A. vx y tics to psychology, deal with categorical data. Despite this fact, Here, j and v j indicate the respective attribute values of v1 = B categorical clustering has received limited attention with only objects ux and u y for the attribute a j , for example, 2 1 784 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 4, APRIL 2015

v2 = A TABLE III and 1 2. By proposing a coupled attribute value similarity measure, we define a new object similarity for categorical LIST OF MAIN NOTATIONS data. The basic concepts below facilitate the formulation for a coupled attribute value similarity measure. They are exemplified by Table II. Below, an information table S is given, and |set| is the number of elements in a certain set. Definition 3.1 (SIF): Two set information functions (SIFs) are defined as

U Vj   Fj : 2 → 2 , Fj (U ) ={f j (ux )|ux ∈ U } (1)   : Vj → U , ( ) ={ | ( ) ∈ } G j 2 2 G j V j ui f j ui V j (2) TABLE IV ≤ ≤ ≤ ≤  ⊆  ⊆ LIST OF ABBREVIATIONS where 1 j n,1 i m, U U,andV j V j . These SIFs describe the relationships between objects and  attribute values from different levels. Function Fj (U ) assigns  the associated value set of attribute a j to the object set U . ( )  Function G j V j maps the value set V j of attribute a j to the dependent object set. For example, based on the attribute a2, F2({u1, u2, u3}) ={B1, B2} collects the attribute values of u1, u2 and u3;andG2({B1, B2}) ={u1, u2, u3, u6} returns the objects whose attribute values are B1 and B2.InTableI on the movie data, G1({Hitchcock}) ={Vertigo, NbyNW} while F2({Vertigo, NbyNW}) ={Stewart, Grant}. Note that in the two definitions below, the superscripts x and y of v j are omitted, since any attribute value v j ∈ V j used x Back to Table I, P2|1({De Niro, Stewart}|Hitchcock) = 0.5 here is independent of the objects ux and u y.However,v j and y since actor subset {De Niro, Stewart} and director Hitchcock v j are reused when defining the similarity in the following co-occur in only one movie Vertigo given Hitchcock directs sections. two movies: Vertigo and N by NW. Note that the use of subset Definition 3.2 (IIF): The inter-information function (IIF)  Vk and element v j is to facilitate the definitions in Section V, obtains a value subset of attribute ak for the corresponding and make them mathematically solid and consistent. objects, which are derived from the value v j of attribute a j . All these concepts and functions form the foundation for It is defined as formalizing the coupled interactions within and between cat- Vk ϕ j→k : V j → 2 ,ϕj→k(v j ) = Fk(G j ({v j })). (3) egorical attributes, as presented below. The main notations in this paper are listed in Table III. In addition, several important This IIF ϕ → is the composition of F and G .The j k k j abbreviations are defined in Table IV to facilitate the reading involved subscript j → k means that this mapping ϕ is per- of this paper. formed from attribute a j to attribute ak. Intuitively, ϕ j→k(v j ) computes the set of attribute values from attribute ak that IV. FRAMEWORK OF THE COUPLED ATTRIBUTE co-occurs with a particular attribute value v j from attribute a j . SIMILARITY ANALYSIS In other words, ϕ j→k(v j ) returns an attribute value subset of Vk that shares the same objects as v j . For example, In this section, a framework for coupled attribute similarity ϕ2→1(B1) ={A1, A2} specifies the values B1 of attribute analysis is proposed from a global perspective of the intra- a2 and {A1, A2} of attribute a1 are related by the corre- coupled interaction within an attribute, the inter-coupled inter- sponding objects: 1) u1 and 2) u2. Likewise, in Table I, action among multiple attributes, and the integration of both. ϕ1→2(Hitchcock) = {Stewart, Grant} due to the connected With respect to the intra-coupled interaction, the similarity objects: Vertigo and N by NW. between attribute values is considered by examining their (⊆ ) occurrence frequencies within one attribute. For the inter- Definition 3.3 (ICP): The value subset Vk Vk of attri- bute ak, and the value v j (∈ V j ) of attribute a j , then the coupled interaction, the similarity between attribute values is  captured by exposing their co-occurrence dependency on the information conditional probability (ICP) of Vk with respect ( | ) values of other attributes. For example, the coupled value to v j is Pk| j Vk v j ,definedas similarity between B and B (i.e., values of attribute a ) | ( ) ({ })| 1 2 2  Gk Vk G j v j P | (V |v ) = . concerns both the intra-coupled relationship specified by the k j k j | ({ })| (4) G j v j repeated times of values B1 and B2: 2 and 2, and the inter- Intuitively, when given all the objects with the value v j coupled interaction triggered by the other two attributes (a1 of attribute a j , ICP is the percentage of common objects and a3). The coupled interaction is then derived by the  whose values of attribute ak fall in subset Vk and whose integration of intra-coupling and inter-coupling. In this way, values of attribute a j are exactly v j as well. For example, the couplings of attributes lead to more accurate similarity P1|2({A1}|B1) = 0.5. Hence, ICP quantifies the relative (∈[0, 1]) between attribute values, rather than a rude assign- overlapping ratio of attribute values in terms of objects. ment of either 0 or 1. WANG et al.: COUPLED ATTRIBUTE SIMILARITY LEARNING ON CATEGORICAL DATA 785

the maximum similarity is attained when the attribute values exhibit approximately equal frequencies [10]. Thus, when calculating attribute value similarity (e.g., the similarity between Scorsese and Coppola in Table I), we con- sider the relationship between the attribute value frequencies of an attribute, proposed as intra-coupled similarity to satisfy the above principles. Definition 5.1 (IaASV): The intra-coupled attribute similar- x y ity for values (IaASV) between values v j and v j of attribute a j is δIa( x , y) j v j v j | ({ x })|·| ({ y})| G j v j G j v j = . (5)  | ({ x })|+| ( y)|+| ({ x })|·| ({ y})| Fig. 1. Framework of coupled attribute similarity analysis, where G j v j G j v j G j v j G j v j indicates intra-coupling and ←→ refers to inter-coupling. ≤| ( x )|, | ( y)|≤ ≤| ( x )|+ Since 1 G j v j G j v j m and 2 G j v j | ( y)|≤ δIa ∈[/ , /( + )] G j v j m,then j 1 3 m m 4 is obtained In the framework described in Fig. 1, the couplings of according to Proof (a) in the Appendix. For example, in attributes are revealed via the similarity between attribute B B δIa(B , B ) = x y Table II, both 1 and 2 are observed twice, 2 1 2 values v j (e.g., Scorsese in Table I) and v j (e.g., Coppola . δIa( , ) = / 0 5. In Table I, we have 1 Scorsese Coppola 1 3 in Table I) of each attribute a j (e.g., Director in Table I) by since both directors Scorsese and Coppola appear once, i.e., means of the intra-coupling and inter-coupling. Further, the |G1({Scorsese})|=|G1({Coppola})|=1. coupled similarity for objects is built on top of the pairwise Note that there is still an issue in the above definition: if two similarity between attribute values according to the integration attribute values vx and vy have the same frequency, then we of couplings. Finally, two learning tasks are explored for the j j have δIa(vx , vx ) = δIa(vx , vy). This is somewhat intuitively data structure analysis and data clustering evaluation by incor- j j j j j j problematic, but the inter-coupled similarity proposed in the porating the coupled interactions, revealing that the couplings next section remedies this issue because the inter-coupled sim- of attributes are essential to applications in empirical studies. ilarities between vx , vx and between vx , vy are overwhelmingly Given an information table S with a set of m objects U j j j j distinct. and a set of n attributes A, we specify those interactions and By taking the frequency of attribute values into consider- couplings in the following sections. ation, IaASV characterizes the value similarity in terms of attribute value occurrence times. V. C OUPLED ATTRIBUTE SIMILARITY The attribute couplings are proposed in terms of both B. Inter-Coupled Interaction intra-coupled and inter-coupled similarities. Below, the intra- The IaASV considers the interaction between attribute coupled and inter-coupled relationships, as well as the inte- values within an attribute a j . It does not involve the couplings grated coupling, are formalized and exemplified. Note that all between attributes [e.g., ak(k = j) and a j ] when calculating the main notations are listed in Table III. the attribute value similarity. For this, we discuss the depen- dency aggregation, i.e., inter-coupled interaction. In 1993, Cost and Salzberg [9] presented a powerful new A. Intra-Coupled Interaction method MVDM for measuring the dissimilarity between cate- According to [5], the discrepancy in attribute value occur- gorical values of a given attribute. The MVDM considers the rence times reflects the value similarity in terms of frequency overall similarity of classification of all objects on each pos- distribution. It reveals that greater similarity is assigned to sible value of each attribute. The dissimilarity D j|L between the attribute value pair which owns approximately equal fre- x y two attribute values v j and v j (e.g., Scorsese and Coppola in quencies. The higher these frequencies are, the closer the two Table I) for a specific attribute a j (e.g., Director in Table I) values are. Different occurrence frequencies therefore indicate regarding labels L (e.g., Class in Table I) is distinct levels of attribute value significance. ( x , y) = | ({ }| x ) − ({ }| y)| These principles are also consistent with the similarity theo- D j|L v j v j Pl| j l v j Pl| j l v j (6) rem presented in [11], in which the commonality corresponds l∈L to the product of frequencies and the full description relates where l (∈ L,e.g.,l1 and l2 in Table I) is a label in the to the total sum of individual frequencies and their product. information table. Pl| j is the ICP defined in (4) by replacing In addition, a comparative evaluation on similarity measures the attribute ak with the label l, the attribute value subset for categorical data has been done in [10], delivering OF and V  with the label subset L ⊆ L (here L ={l}), in which k   Lin as the two best similarity measures among 14 existing Gl (L ) refers to the set of objects whose labels fall in L . measures on 18 data sets. Both these measures assign higher D j|L indicates that values are identified as being similar if weights to mismatches or matches on frequent values, and they occur with the same relative frequency for all classes. 786 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 4, APRIL 2015

TABLE V EXAMPLE OF COMPUTING SIMILARITY USING IRSP

{A }4 The idea behind is that we wish to establish that values are i i=1. As shown in Table V, the set of all attribute values similar if they occur with the same relative frequency for all of a1 is V1 ={A1, A2, A3, A4}. The number of all power 4 classifications. sets within V1 is 2 , i.e., the number of the combinations  ⊆  ⊆ 4 According to the principle [21] that, for the categorical data consisting of V1 V1 and V1 V1 is 2 . In detail, for distribution, the sum of L1 dissimilarities and twice the total  ={A }  = the second row when we consider V1 1 and V1 variation dissimilarity are equivalent, we have {A , A , A } ( |B ) =|{ }|/|{ , }| = 2 3 4 ,wehaveP1|2 V1 1 u1 u1 u2  y   y . ( |B ) =|{ , }|/|{ , }| = ( x , ) = · | ( | x ) − ( | )|. 0 5andP1|2 V1 2 u3 u6 u3 u6 1. Therefore, D j|L v j v j 2 max Pl| j L v j Pl| j L v j (7) L⊆L − ( |B ) − ( |B ) = − . − = . 2 P1|2 V1 1 P1|2 V1 2 2 0 5 1 0 5. The The detailed proof on the equivalence of (6) and (7) is specified other elements in the power set of V1 follow the same rule. by Proof (b) in the Appendix. Accordingly, the minimal value among them is 0.5, which δ P . In the absence of labels, the above (7) is adapted to satisfy indicates that the corresponding similarity 2|1 is 0 5. our target problem by replacing the class label information This process shows that the combinational explosion with other attribute knowledge to enable unsupervised learn- brought about by the power set needs to be considered when ing. In Table I, for instance, we consider the attribute Actor calculating attribute value similarity by IRSP. For a given set rather than Class when calculating the similarity between of attribute values, the power set considers all the subsets. Scorsese and Coppola for Director, since Class is invisible However, the universal set concerns all the elements involved, in unsupervised learning process. We regard this interaction which effectively reduces the number of items involved. The between attributes as inter-coupled similarity in terms of the joint and intersection sets focus on parts of the elements, which co-occurrence comparisons of ICP. The most intuitive variant further optimize the calculation time. We start with the power of (7) is IRSP. set-based IRSP, and will proceed to the universal set-based Definition 5.2 (IRSP): The inter-coupled relative similar- IRSU, the joint set-based IRSJ, and the intersection set-based x y IRSI to see whether the problem can be reduced in this way. ity based on power set (IRSP) between values v j and v j of attribute a j based on another attribute ak is defined as We therefore try to define three more similarity metrics IRSU, δ P ( x , y, ) δ P IRSJ, and IRSI based on IRSP. j|k v j v j Vk (below j|k for short) Definition 5.3 (IRSU, IRSJ, IRSI): The inter-coupled rela- P  x  y δ | = min {2 − Pk| j (V |v ) − Pk| j (V |v )} (8) tive similarity based on universal set (IRSU), joint set (IRSJ), j k ⊆ k j k j Vk Vk x y and intersection set (IRSI) between values v j and v j of    where V = Vk\V is the complementary set of a set V under attribute a j based on another attribute ak are defined as k k k δU ( x , y, ) δ J ( x , y, ) δI ( x , y, ) the complete value set Vk of attribute ak. j|k v j v j Vk , j|k v j v j Vk ,and j|k v j v j Vk (below δ δ J δI The main difference between (8) and (7) includes: 1) the j|k, j|k,and j|k for short), respectively multiplier 2 in (7) is omitted; 2) labels are replaced with  U x y other values of a particular attribute ak,i.e.,V and Vk are δ | = 2 − max{Pk| j ({vk}|v ), Pk| j ({vk}|v )} (10)  k j k j j substituted for L and L, respectively; 3) a complementary set vk ∈Vk   y V rather than the original set V is concerned for v in ICP, J x y k k j δ = 2 − max{P | ({v }|v ), P | ({v }|v )} (11)  y  y j|k k j k j k j k j note that P | (V |v ) = 1 − P | (V |v ); and 4) dissimilarity ∈∪ k j k j k j k j vk is considered rather than similarity; the new dissimilarity I x y δ | = min{Pk| j ({vk}|v ), Pk| j ({vk}|v )} (12) measure j k j j vk ∈  x y  x  y D | (v , v ) = max |Pk| j (V |v ) + Pk| j (V |v ) − 1| (9) j k j j  k j k j y V ⊆Vk ∈∪ ∈∩ ∈ ϕ ( x )∪ ϕ ( ) k where vk and vk denote vk j→k v j j→k v j ∈ ϕ ( x ) ∩ ϕ ( y) is obtained by following the previous three steps, then we and vk j→k v j j→k v j , respectively. δ P = −  ( x , y) have j|k 1 D j|k v j v j . The detailed conversion process In the above definition, universal set, joint set, and inter- is provided in Proof (c) in the Appendix. Two attribute values section set are proposed to quantify the inter-coupled sim- are closer to each other if they have more similar probabilities ilarity based on the power set. Let us explain the intuitive with other attribute value subsets in terms of co-occurrence meaning and understanding behind those complex formu- object frequencies. las. For example, in Table I, the IRSP similarity between In Table II, by employing (8), we want to obtain directors Scorsese and Coppola is measured by examin- δ P (B , B , {A }4 ) 2|1 1 2 i i=1 , i.e., the similarity between two attribute ing their individual connections with all the actor subsets B , B (  ⊆ ) values 1 2 of a2 regarding attribute a1 with its values V2 V2 , such as {De Niro}, {De Niro, Stewart}, {Stewart, WANG et al.: COUPLED ATTRIBUTE SIMILARITY LEARNING ON CATEGORICAL DATA 787

TABLE VI TABLE VIII COMPUTING SIMILARITY USING IRSU COMPUTING SIMILARITY USING IRSI

Based on IRSU, an alternative IRSI is concerned. With δI TABLE VII (12), the calculation of 2|1 is once again simplified as in A ∈ ϕ → (B ) ∩ϕ → (B ) COMPUTING SIMILARITY USING IRSJ Table VIII since only 2 2 1 1 2 1 2 ,where ϕ2→1(B1) ={A1, A2} and ϕ2→1(B2) ={A2, A4}. Then, we δI (B , B , {A }4 ) = . easily obtain 2|1 1 2 i i=1 0 5 as the final result, though ICP has been calculated only once. In this case, it is sufficient to compute ICP with A2 ∈ V1, which only belongs to ϕ2→1(B1) ∩ ϕ2→1(B2) (i.e., {A1, A2}∩{A2, A4}). In detail, we have P1|2({A2}|B1) =|{u2}|/|{u1, u2}| = 0.5 Grant}.2 Alternatively, the IRSU similarity between directors and P1|2({A2}|B2) =|{u3}|/|{u3, u6}| = 0.5, so the minimum . ∩ Scorsese and Coppola is defined upon all the actors (v2 ∈ V2), is 0 5. It is trivial that the cardinality of intersection is including De Niro, Stewart, and Grant. The IRSJ similarity always no larger than that of joint set ∪. Thus, IRSI is more between directors Scorsese and Coppola is further built on the efficient than IRSU due to the reduction of intra-coupled subset of {De Niro, Stewart, Grant} according to the joint relative similarity complexity. rule (v2 ∈∪), which produces all the possible values to Intuitively, IRSI is the most efficient of all the proposed share objects. The IRSI similarity between directors Scorsese inter-coupled relative similarity measures: IRSP, IRSU, IRSJ, and Coppola is composed by another subset of {De Niro, and IRSI. All four measures lead to the same similarity result, . Stewart, Grant} based on the intersection rule (v2 ∈∩),which such as 0 5 in our example. These measures are mathemati- returns the common values to share objects. In addition, the cally equivalent to one another. This assumption is proved in joint and intersection rules correspond to different selection Section VI. ( x , y) schemes of actors that co-occur with directors Scorsese and/or Accordingly, the similarity between the value pair v j v j Coppola in movies. The former collects the co-occurrence of attribute a j can be calculated on top of these four optional actors with either Scorsese or Coppola, while the latter gains measures by aggregating all the relative similarity on attributes the co-occurrence actors with both Scorsese and Coppola. other than a j . In Table I, the inter-coupled similarity between As discussed later in Section VI, these four options are Scorsese and Coppola can be naturally constructed by sum- actually equivalent to one another, though present varying marizing both the co-occurrences from attributes other than computational efficiency. Director, which are Actor and Genre. In detail, each value vk(∈ Vk) of attribute ak, rather than its Definition 5.4 (IeASV): The inter-coupled attribute similar-  ⊆ ity for values (IeASV) between attribute values vx and vy of value subset Vk Vk, is considered to reduce computational j j complexity. As shown in Table VI, we have P1|2({A1}|B1) = attribute a j is |{ }|/|{ , }| = . ({A }|B ) =|∅|/|{ , }| = u1 u1 u2 0 5andP1|2 1 2 u3 u6 n A . Ie x y x y 0whenvk takes 1, thus the maximum is 0 5. Accordingly, δ (v , v , {Vk}k = j ) = αk δ j|k(v , v , Vk) (13) δU δU (B , B , {A }4 ) = j j j j j the similarity 2|1 based on IRSU is 2|1 1 2 i i=1 k=1,k = j 2 − 0.5 − 0.5 − 0 − 0.5 = 0.5. Since IRSU only concerns α all the single attribute values rather than exploring the whole where k is the weight parameter for attribute ak, n α = α ∈[, ] δ ( x , y, ) power set, it solves the combinational explosion issue to a k=1,k = j k 1, k 0 1 ,and j|k v j v j Vk is one great extent. In IRSU, ICP is merely calculated eight times of the inter-coupled relative similarity candidates. δIe ∈[ , ] δIe compared with 32 times by IRSP, which leads to a substantial Therefore, j 0 1 . Intuitively, j calculates the inter- improvement in efficiency. coupled similarity of values by aggregating all the connections The IIF (3) is used to further reduce the time cost of ICP between other attributes and the attribute a j . Back to Table I, with two more similarity measures: IRSJ (11) and IRSI (12). the inter-coupled similarity between Scorsese and Coppola δ J (δIe)3 is determined by the respective co-occurrences of Actor With (11), the calculation of 2|1 is further simplified since 1 (δ | ) and Genre (δ | ) with them. The parameter α specifies A3 ∈ ϕ2→1(B1) ∪ ϕ2→1(B2),whereϕ2→1(B1) ={A1, A2} 1 2 1 3 k how strongly two attributes are dependent on each other. and ϕ2→1(B2) ={A2, A4}. As shown in Table VII, we J 4 In this paper, we simply assign α = 1/(n−1), which indicates obtain δ | (B1, B2, {Ai } = ) = 2 − 0.5 − 0.5 − 0.5 = 0.5by k 2 1 i 1 that every two attributes uniformly connect with each other. calculating ICP with {A1, A2, A4} rather than the whole value Thus, different similarity values rest with distinct extents of set {A1, A2, A3, A4} of attribute a1. This reveals the fact that it is enough to compute ICP with w ∈ V that belongs to co-occurrence between attributes. 1 δIe(B , B , { , }) = ϕ → (B ) ∪ϕ → (B ) instead of all the elements in V . Thus, For example, in Table II, we have 2 1 2 V1 V3 2 1 1 2 1 2 1 . · δ (B , B , {A }4 ) + . · δ (B , B , {C }3 ) = . IRSJ further reduces the complexity compared with IRSU. 0 5 2|1 1 2 i i=1 0 5 2|3 1 2 i i=1 0 25

2This is just an illustration, the whole power set includes eight subsets. 3The symbols within brackets are omitted if no ambiguity arises. 788 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 4, APRIL 2015

TABLE IX if α1 and α3 equal to 0.5. The calculation of the first compo- nent has been displayed above, while the second component TIME COST OF ICP can also be obtained by following the same approach. Alter- natively, the parameter αk which reflects the coupling weight of categorical attributes can also be defined to capture the average connection degree of attribute values inspired by [22] that proposes the support of attribute values. Later, we will explore and analyze this strategy in our future work. They correspond to the fact that Scorsese and Coppola are very similar directors just as Coppola is to himself, and the C. Coupled Interaction similarity between Koster and Hitchcock is larger than that So far, we have built formal definitions for both IaASV and between Koster and Coppola, as clarified in Section I. IeASV measures. The IaASV emphasizes the attribute value In the following theoretical analysis in Section VI, the OF, while IeASV focuses on the co-occurrence comparison of computational accuracy and complexity of the four inter- ICP with four inter-coupled relative similarity options. Then, coupled relative similarity options are analyzed. the CASV is naturally derived by simultaneously considering both measures. VI. THEORETICAL ANALYSIS Definition 5.5 (CASV): The CASV between attribute values x y This section compares the proposed four inter-coupled rela- v j and v j of attribute a j is tive similarity measures (IRSP, IRSU, IRSJ, and IRSI) in terms δ A( x , y, { }n ) = δIa( x , y) · δIe( x , y, { } ) j v j v j Vk k=1 j v j v j j v j v j Vk k = j (14) of their computational accuracy and complexity. where Vk(k = j) is a value set of attribute ak different from δIa δIe a j to enable the inter-coupled interaction. j and j are A. Accuracy Equivalence IaASV and IeASV, respectively, which will be detailed in the According to the set theory, these four measures are equiva- following sections. lent to one another in calculating value similarity; we therefore As indicated in (14), CASV gets larger by increasing either have the following theorem. This theorem is deduced by IaASV or IeASV. Here, we choose the multiplication of Proof (d) in the Appendix. these two components. The rationale is twofold: 1) IaASV Theorem 6.1: The IRSP, IRSU, IRSJ, and IRSI are all is associated with how often the value occurs, while IeASV equivalent to one another. reflects the extent of the value difference brought by other The above theorem indicates that IRSP, IRSU, IRSJ, and attributes, hence intuitively, the multiplication of them indi- IRSI are equivalent to one another in terms of the information cates the total amount of attribute value difference and 2) the and knowledge they present. It also explains the similarity multiplication method is consistent with the adapted simple result in Section V-B. Thus, these measures can induce exactly matching distance introduced in [5]. Alternatively, in our the same computational accuracy in different learning tasks, future work, we could consider other combination forms of including classification and clustering. IaASV and IeASV according to the data structure, such as δ A( x , y, { }n ) = β ·δ˜Ia( x , y)+γ ·δIe( x , y, { } ) j v j v j Vk k=1 j v j v j j v j v j Vk k = j , δ˜Ia ∈[ , ] 4 B. Computational Complexity Comparison where j 0 1 is the normalized intra-coupled similarity, 0 ≤ β,γ ≤ 1 (β + γ = 1) are the corresponding weights. When calculating the similarity between every pair of Thus, IaASV and IeASV can be controlled flexibly to display attribute values for all attributes, the computational complexity in which cases the former is more significant than the latter, linearly depends on the time cost of ICP, which is quantified and vice versa. by the calculation counts of ICP. This reflects the efficiency δ A = δIa · δIe ∈[, /( + )] In addition, j j j 0 m m 4 since difference between distinct similarity measures. Table IX δIa ∈[/ , /( + )]( ≥ ) summarizes the time costs of the four inter-coupled relative we have j 1 3 m m 4 m 2 as well as δIe ∈[, ] similarity measures. j 0 1 . For example, in Table II, the CASV of (M) y A | | δM (vx ,v ) B B δ (B , B , {V , V , V }) = Let ICP j|k represent the time cost of ICP for j|k j j attribute values 1 and 2 is 2 1 2 1 2 3 ={ , , , } δIa(B , B )·δIe(B , B , { , }) = . × . = . with the associated measure M P U J I , whose ele- 2 1 2 2 1 2 V1 V3 0 5 0 25 0 125. For the Movie data, the coupled similarity between Scorsese and ments are IRSP, IRSU, IRSJ, and IRSI, respectively. From | (P)|≥| (U)|≥| (J )|≥| (I)| δ A Table IX, ICP j|k ICP j|k ICP j|k ICP j|k holds Coppola ( j ) is obtained by multiplying the intra-coupled sim- δIa constantly. It demonstrates the competitive efficiency of IRSI ilarity ( 1 ) characterized by frequency with the inter-coupled δIe compared with the other three measures. In Table II, 32 cal- similarity ( 1 ) quantified by co-occurrence. For other director δ A ( ) = culation counts of ICP are required in IRSP, compared with pairs, we accordingly obtain that Director Scorsese,Coppola δ A ( , ) = . δ A ( , only two calculation counts when using IRSI. Director Coppola Coppola 0 33, and Director Koster ) = δ A ( , ) = . Suppose the maximal number of values for each attribute is Coppola 0 while Director Koster Hitchcock 0 25. (= n | |) R max j=1 V j . In total, the number of value pairs for all 4 δ˜Ia = (δIa − )/( − ) the attributes is at most n·R(R−1)/2, which is also the number The normalization can be made by j j min max min , where min and max denote the minimum and maximum values of δ Ia for of calculation steps. For each inter-coupled relative similarity, j | (M)| each attribute, respectively. we calculate ICP for ICP j|k times. As we have n attributes, WANG et al.: COUPLED ATTRIBUTE SIMILARITY LEARNING ON CATEGORICAL DATA 789

TABLE X Algorithm 1 Coupled Attribute Similarity for Objects COMPUTATIONAL COMPLEXITY FOR CASV Data: Data set Sm×n with m objects and n attributes, object ux , u y(x, y ∈[1, m]), and weight α = (αk)1×n. Result: Coupled Similarity for objects CASO(ux , u y). begin //Compute pairwise similarity for any two values of the same attribute. ·| (M)|·( − ) the total ICP time cost for CASV is 2 ICP j|k n 1 flops for attribute a j ,j= 1 : n do per step. The computational complexity for calculating all four (vx ,vy ∈[ , | |]) for every value pair j j 1 V j do options of CASV is shown in Table X. ←− { |vi == vx } ←− { |vi == v y} U1 i j j , U2 i j j ; As indicated in Table X, all the measures have the same x //Compute intra-coupled similarity for two values v j calculation steps, while their flops per step are sorted in y and v . descending order since 2R > R ≥ R∪ ≥ R∩,inwhichR∪ j δIa(vx ,vy ) = and R∩ are the cardinality of the join and intersection sets j j j (| || |)/(| |+| |+| || |) of the corresponding IIFs, respectively. This evidences that U1 U2 U1 U2 U1 U2 ; //Compute coupled similarity for two attribute values the computational complexity essentially depends linearly on x y v j and v j . the time cost of ICP with given data. Specifically, IRSP has y 2 2 R δ A( x , , { }n ) ←− the largest complexity O(n R 2 ), compared with the smaller j v j v j Vk k=1 2 3 δIa(vx ,vy ) · (vx ,vy, { } ) equal ones O(n R ) presented by the other three measures j j j IeASV j j Vk k = j ; (IRSU, IRSJ, and IRSI). Of the latter three candidates, though //Compute coupled similarity between two objects u and they have the same computational complexity, IRSI is the x uy. most efficient due to R∩ ≤ R∪ ≤ R. The dissimilarity ADD CASO(u , u ) ←− sum(δ A(vx , vy, {V }n )); that Ahmad and Dey [12] used for mixed data clustering x y j j j k k=1 end corresponds to the worst measure IRSP. end Considering both the accuracy analysis and complexity (vx ,vy , { } ) Function IeASV j j Vk k = j comparison, we conclude that IRSI is the best performing begin measure because it indicates the least complexity but maintains //Compute inter-coupled similarity for two attribute values equal accuracy to present couplings. Thus, we only consider x y v j and v j . IRSI in the following algorithmic design and experimental for attribute (k = 1 : n) ∧ (k = j) do comparisons. {vz } ←− { vx } ∩{v y} k z∈U3 k x∈U1 k y∈U2 ; for intersection z = U3(1) : U3(|U3|) do VII. COUPLED SIMILARITY ALGORITHM ←− { |vi == vz } U0 i k k ; In previous sections, we have discussed the construction of ICPx ←− | U0 ∩ U1|/|U1|; CASV and its theoretical comparison among the inter-coupled ICPy ←− | U0 ∩ U2|/|U2|; relative similarity candidates. In this section, a coupled sim- Min(x,y) ←− min(ICPx , ICPy ); ilarity between objects is built based on CASV. Below, we x y //Compute IRSI for v j and v j . consider the sum of all these CASV measures, following the x y δI (v , v , V ) = sum(Min( , )); Manhattan dissimilarity [5]. j|k j j k x y δle(vx ,vy , { } ) = [α( ) × δI ( x , y, )] Definition 7.1 (CASO): Given an information table S,the j j j Vk k = j sum k j|k v j v j Vk ; ( , ) δle(vx ,vy , { } ) CASO between objects ux and u y is CASO ux u y return j j j Vk k = j ; n end ( , ) = δ A( x , y, { }n ) CASO ux u y j v j v j Vk k=1 (15) j=1 δ A x y We then design an algorithm CASO(), given in Algorithm 1, where j is the CASV measure defined in (14), v j and v j are the attribute values of attribute a j for objects ux and u y, to compute the coupled object similarity with IRSI (i.e., the respectively, and 1 ≤ x, y ≤ m,1≤ j ≤ n. best inter-coupled relative similarity candidate). The whole For CASO, all the CASVs with each attribute are summed process of this algorithm is summarized as follows: 1) Com- vx v y up for two objects. For example the similarity between u2 and pute the IaASV for values j and j of attribute a j (Line 1); ( , ) = 3 δ A( 2, 3, { }3 ) = vx v y u3 in Table II is CASO u2 u3 j=1 j v j v j Vk k=1 2) Compute the IeASV for attribute values j and j based 0.5 + 0.125 + 0.125 = 0.75. on IRSI (Line 1ÐLine 1); 3) Compute the CASV for attribute vx v y The CASO has the properties of non-negativity values j and j (Line 1); and 4) Compute the CASO for since CASO(ux , u y) ∈[0, mn/(m + 4)], in particular objects ux and u y (Line 1). CASO(ux , ux ) ∈[n/3, mn/(m + 4)], and symmetry, i.e., Before the similarity calculation is performed, some data CASO(ux , u y) = CASO(u y, ux ), although it does not preprocessing is conducted to enable this algorithm. In detail, guarantee the property of triangle inequality. Therefore, all the categories of each attribute need to be encoded as CASO is a nonmetric similarity measure. numberings, starting at one and increasing to the maximum, 790 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 4, APRIL 2015 which is the respective number of attribute values. To reduce unnecessary iterations in Line 1, pairwise CASV similarity for any two values of the same attribute, rather than the only two values involved of each attribute, is precalculated for reuse when computing the object similarity. Explicitly, this pseudocode also embodies the fact that the computational complexity for IRSI is indeed O(n2 R3). However, it might not be very attractive for extremely large data sets with attributes that take too many values. Thus, we are working on strategies of attribute reduction to effectively reduce the number of coupled attributes.

VIII. EXPERIMENTS AND EVA L UAT I O N In this section, extensive experiments are performed on several UCI and bibliographic data sets to show the effec- tiveness of our proposed coupled similarity measures. For simplicity, we assign the weight vector α = (αk)1×n with values α(k) = 1/(n − 1) in Definition 5.4. In this part of our experiments, we focus on comparing Fig. 2. Data structure index comparison. our novel coupled attribute dissimilarity for objects (CADO) induced from CASO with existing categorical dissimilarity normally evaluated by assigning the best score to the algo- measures. Four independent groups of experiments are con- rithm that produces clusters with highest similarity within ducted with extensive data sets based on a cluster and lowest similarity between clusters based on applications. In the following, we evaluate the CADO, which a certain similarity measure. We work in a different way, : is derived from (15) in which similarity measures are evaluated with clustering criteria and given labels. In this way, a better cluster structure CADO(ux , u y) n can be clarified with a better similarity measure in terms = (δIa( x , y)) · (δIe( x , y, { } )) of the clustering internal descriptors, such as sum-square, h1 j v j v j h2 j v j v j Vk k = j (16) j=1 DaviesÐBouldin index (DBI) [24], and Dunn index (DI) [25]. To reflect the data cluster structure more clearly, the ( ) ( ) where h1 t and h2 t are decreasing functions. Based on induced dissimilarity metrics are evaluated by four descriptors. ( ) ( ) intra-coupled and inter-coupled similarities, h1 t and h2 t 1) Relative dissimilarity (RD). 2) DBI. 3) DI. 4) Sum- can be flexibly chosen to build dissimilarity measures accord- dissimilarity (SD). In detail, RD is the ratio of average inter- ing to specific requirements. In terms of the capability of cluster dissimilarity upon average intra-cluster dissimilarity revealing the data relationship, the better the induced dissim- for all cluster labels. SD is the sum of object dissimilarities ilarity, the better is its similarity. within all the clusters. Since internal criteria seek clusters with ( ) = / − ( ) = Here, we consider h1 t 1 t 1andh2 t high intra-cluster similarity and low inter-cluster similarity, − 1 t to reflect the complementarity between similarity dissimilarity metrics that produce clusters with high RD or and dissimilarity measures, since they are both decreasing DI and low DBI or SD are more desirable. functions of t. The rationale behind these two functions is Four object dissimilarity metrics are considered: 1) SMD as follows. The first conversion corresponds to the improved [5] (i.e., Hamming distance [23]); 2) OF dissimilarity (OFD) simple matching dissimilarity (SMD) with frequency [5], if [10]; 3) ADD introduced in [12]; and 4) our proposed CADO. δIe only 0 and 1 are assigned to j (i.e., SMD [23]: dissim- The SMD is a simple, well-known measure for categorical ilarity 0 for identical values, and otherwise 1). The second data, while OFD considers matching in terms of attribute value transformation guarantees the consistency of CADO with the frequency distribution, both formalized as the sum of value dissimilarity measure ADD [12], when a constant is fixed for dissimilarities for all the attributes. Further, value dissimilar- δIa ( ) = / − y j . In addition, h1 t 1 t 1 is also consistent with the SMD = OFD = vx = ities D j D j 0if j x j , otherwise they equal converted measures proposed in [11]; h (t) = 1 − t follows − 2 − + ( /| ({vx })|) · ( /| ({v y})|) 1 the way of converting OF to OFD [10] as well, presented in 1and1 1 log m G j j log m G j j the next section. Both these functions are designed to include for SMD and OFD, respectively. The dissimilarity measure existing classical measures as special cases of our proposed ADD, derived from (15) with the worst inter-coupled rela- coupled similarity. The detailed specialization to the improved tive similarity candidate IRSP, considers the sum of inter- SMD and the ADD are explained in Section IX. coupled interactions between all the corresponding attribute values. These three measures only concern local pictures, while CADO is globally formalized based on (16). A. Data Structure Analysis The cluster structures produced by the above four dissim- This section performs experiments to explicitly specify ilarity metrics are then analyzed on 10 data sets in different the internal structures for the labeled data. Clusterings are scales. The results after dissimilarity normalization are shown WANG et al.: COUPLED ATTRIBUTE SIMILARITY LEARNING ON CATEGORICAL DATA 791

TABLE XI CLUSTERING EVA LUATI O N O N SIX DATA SETS

in Fig. 2, where the x-axis refers to the data sets Movie, KM with CADO, and SC with SMD, SC with OFDSC Balloon, Soybean-small, Zoo, Hayesroth, Voting, Breast- with ADD, SC with CADO. The clustering performance is cancer, Tic, Letter, and Mushroom, respectively. They are evaluated by comparing the obtained cluster of each object ordered according to the number of objects involved (i.e., m) with that provided by the data label in terms of accuracy to describe distinct data scales, ranging from 6 to 8124. (AC) and normalized mutual information (NMI) [27], which As discussed previously, larger RD, larger DI, smaller DBI, are essentially the external criteria compared with the internal and smaller SD indicate better characterization of the cluster criterion analysis in Section VIII-A. The AC ∈[0, 1] is a differentiation capability, which corresponds to a better dissim- degree of closeness between the obtained clusters and its actual ilarity metric being induced. From Fig. 2, we observe that, with data labels, while NMI ∈[0, 1] is a quantity that measures the the exception of a few items, the corresponding RD and DI mutual dependence of two variables: clusters and labels. The indexes on CADO are mostly the largest ones when compared larger AC or NMI is, the better the clustering is, and the better with those on SMD, OFD, and ADD; while the associated DBI the corresponding dissimilarity metric is. and SD index curves on CADO are mostly below the other Table XI reports the results on six data sets with differ- three curves. The results show that our proposed CADO is ent |U|, ranging from 15 to 699 in the increasing order. The better than SMD and OFD in terms of differentiating objects performance of the aforementioned eight schemes is evaluated in distinct clusters. ADD also seems to be slightly better than on AC and NMI individually. Followed by Laplacian Eigen- SMD and OFD in most cases. The degrees of improvement of maps, the subspace dimensions are determined by the number CADO upon SMD, OFD, and ADD mainly depend on data of labels in SC. For each data, the average performance structure rather than on data scale |U|(= m) alone. is computed over 100 tests for KM and SC with distinct start points. Note that the highest measure score of each B. Data Clustering Evaluation experimental setting is highlighted in boldface. To demonstrate the effectiveness of our proposed CADO As Table XI indicates, the clustering methods with CADO, and CASO in applications, we conduct two groups of exper- whether KM or SC, outperform those with SMD, OFD, iments: 1) k-modes (KM) and spectral clustering (SC) and and ADD on both AC and NMI. In addition, CADO is 2) ROCK and CROCK. The former compares two classical better than ADD for measuring clustering quality, ADD is clustering methods based on four dissimilarity metrics on six in general superior to OFD, OFD performs better than SMD data sets. The latter considers the clustering quality of the for most cases. These findings are consistent with the results adapted method CROCK by integrating our proposed CASO uncovered in Section VIII-A. In addition, Ahmad and Dey [12] with the categorical clustering algorithm ROCK [16]. also evidenced that their proposed metric ADD outperforms 1) KM and SC: One of the clustering approaches is the SMD in terms of KM clustering. Thus, we only analyze the KM algorithm [5], designed to cluster categorical data sets. performance improvement of CADO upon ADD in details. The main idea of KM is to specify the number of clusters k The reason is that the effect of inter-coupled interaction of and then to select k initial modes, followed by allocating every categorical attributes is generally stronger than that of object to the nearest mode. The other is a branch of graph- intra-coupled interaction, and the consideration of a complete based clustering, i.e., SC [26], which makes use of Laplacian coupling relationship leads to the largest improvement on clus- Eigenmaps on a dissimilarity matrix to perform dimensionality tering accuracy since it discloses the implicit whole structure reduction for clustering before the k-means algorithm. Below, hidden in data. we aim to compare the performance of CADO (16) against For KM, the AC improving rate ranges from 5.56% SMD [5], OFD [10], and ADD [12] as used in data cluster (Balloon) to 16.50% (Zoo), while the NMI improving rate falls analysis for further clustering evaluation. within 4.76% (Soybean-s, i.e., Soybean-small) and 37.38% We consider eight strategies for clustering on six UCI (Breastcancer). With regard to SC, the former rate takes the data sets: KM with SMD, KM with OFD, KM with ADD, minimal and maximal ratios as 4.21% (Balloon) and 20.84% 792 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 4, APRIL 2015

(Soybean-l, i.e., Soybean-large), respectively, however, the TABLE XII latter rate belongs to [5.45% (Soybean-l), 38.12% (Shuttle)]. CROCK VERSUS ROCK ON UCI DATA SETS The AC and NMI evaluate clustering quality from different aspects; generally, they take minimal and maximal ratios on distinct data sets. Statistical analysis, namely the t-test, has been done on AC and NMI, at a 95% significance level. The null hypothesis that CADO is better than ADD in terms of AC and NMI is accepted. Another significant observation is that SC mostly outperforms KM whenever it has the same dissimilarity metric; this is consistent with the finding TABLE XIII in [26], indicating that SC very often outperforms k-means for CLUSTERING QUALITIES FOR FIRST AUTHOR numerical data. 2) ROCK and CROCK: The ROCK, proposed by Guha et al. [16], is a robust clustering algorithm for categorical attributes. A link-based similarity measure between two data points is defined based on the neighborhood relation of the two data points, rather than distance or similarity with other data points in the data set. the bibliographic data taken from the publicly-accessible bib- During the process of choosing neighbors for each data liographic databases with 720 research papers [14]. Some object, Guha et al. [16] simply considered the Jaccard coef- 190 papers focus on database research, and the remaining ficient to capture the closeness between each pair of data 530 papers are written on theoretical computer science and objects, followed by the determination of neighbors with related fields. For each paper, we record the name of the first a user-defined threshold parameter. Their algorithm mainly author, the name of the second author, the name of the confer- focuses on the coupled relationship among objects, without ence/journal, and the year of publication. We are interested in any concern for the coupled relationships among attributes clustering the first authors, as well as the conferences/journals. and their values. Therefore, we propose to replace the As mentioned in Section II, STIRR applies an itera- Jaccard coefficient with our proposed coupled nominal sim- tive method based on a linear dynamic system to assign ilarity CASO and to construct a coupled ROCK (CROCK) and propagate weights on the categorical values [14] to algorithm by considering both coupled objects and coupled conduct the intra-attribute value clustering, and LIMBO attribute values. Specifically, we regard two data objects ux defines a distance between attribute values on the basis and u y to be neighbors if CASO(ux , u y)/n ≥ θ, instead of the IB framework to quantify the degree of inter- of |ux ∩ u y|/|ux ∪ u y|≥θ presented in [16]. The other changeability of attribute values within a single attribute to procedures and functions remain the same as [16]. group them [17]. So, we substitute our proposed CADO in (16) Below, we experiment with five real-life data sets, i.e., for the distance δI (ci , c j ) described by the information loss Movie, Hayesroth, SPECT, Voting, and Mushroom, to compare (i.e., JensenÐShannon divergence) in [17], and then propose the cluster quality between ROCK and CROCK in terms a coupled version of LIMBO, i.e., CLIMBO. The LIMBO of three measures: 1) precision (Pr); 2) recall (Re); and reveals that two attribute values are similar if the contexts 3) specificity (Sp) [2], [17]. As described in [17], the larger in which they appear are similar. It is an alternative way these indexes, the better the clustering. The number of runs for to explicate the inter-coupled interactions among different each experiment here is set to be 20 to obtain corresponding attributes; however, it lacks the consideration of the intra- average results for the evaluation measures, due to the high coupled interactions within each attribute. Thus CADO can computational complexity. be extended to measure the coupled distance between clusters Table XII shows the results of both algorithms on quality by replacing an object ui with a cluster ci ,thenCLIMBOis measures. We choose parameters to obtain the best results, naturally induced. such as θ = 0.75 for Voting. As this table indicates, the Below, two experiments are conducted to compare these adapted CROCK with our proposed CASO outperforms the algorithms for the intra-attribute value clustering. The para- original ROCK on almost all the evaluation measures. Statis- meters are specified in the second column of Table XIII. tical testing also supports the results on Pr, Re, and Sp, that For STIRR, M and N are the numbers of initial CROCK performs better than ROCK, at a 95% significance configurations and iterations, respectively. For LIMBO and level. Thus, CROCK’s quality is verified to be superior to CLIMBO, φ indicates the size bound, S refers to the accuracy that of ROCK due to the fact that the former considers both bound, and the addition operator is used. Note that the exper- the couplings between attributes with their values (through iments in this part only run 20 times to display the average co-occurrence) and between objects (by links). results, since the algorithms itself is computational costly. The first experiment is designed to cluster the first authors of the 720 academic papers, and the labels for evaluation are the C. Intra-Attribute Value Clustering preknown research fields: 1) database research (190 papers) In this part, we present the results of CASO applications and 2) theoretical computer science (530 papers). Note that all to the problem of intra-attribute value clustering. We use authors are identified by their last names so that, for instance, WANG et al.: COUPLED ATTRIBUTE SIMILARITY LEARNING ON CATEGORICAL DATA 793

(first) aspect discusses the degeneration of CADO and CASV with special cases, while the extended (second) aspect focuses on the direct extension of CASO and CADO. Degenerative Aspect: Many existing similarity measures for attribute values are special cases of our proposed CADO or CASV. On one hand, CADO could degenerate as an intra-attribute-independence measure if frequency functions ({ x }), ({ y}) ξ G j v j G j v j take a nonzero constant value .Inthis way, the dissimilarity measure ADD between vx and vy Fig. 3. Clusterings for conferences/journals. j j proposed in [12] is exactly ξ/2 · CADO, which consid- ers the interactions between attributes, but lacks the cou- an attribute value Wang actually represents several Wangs plings within each attribute. On the other hand, an inter- taken together. In addition, the second author is regarded as attribute-independence measure could be produced by consid- δIe( x , y, { } ) δI ( x , y, ) being the same as the first author if the research paper has ering j v j v j Vk k= j for IeASV, in which j| j v j v j V j δI ( x , y, )( = ) only one author. These two aspects lead to the overall modest replaces j|k v j v j Vk k j for IRSI. Such an exam- clustering quality. The STIRR, LIMBO, and CLIMBO are ple is the improved SMD with frequency [5]. In addi- compared on the intra-attribute value clustering results of the tion, an intra-inter-attribute-independence measure could be ({ x }) = ({ y}) = ξ attribute first author with regard to Pr, Sp [2], and AC [17]. obtained by specializing G j v j G j v j and δIe( x , y, { } ) Table XIII shows that CLIMBO is the best in terms of Pr and j v j v j Vk k= j both, which corresponds to the classical AC, and is comparable with LIMBO on Sp. All the results on similarity measure SMS and its variants, such as Jaccard Pr, Sp, and AC are supported by a statistical significant test coefficients [5]. So, our proposed measures have the capa- at a 95% significance level. bility of generalization to the existing similarity measures We now turn to the problem of clustering the confer- which assume independence and partial dependence among ences/journals. Fig. 3 shows the clusters produced by STIRR, attributes. LIMBO, and CLIMBO. The x-axis represents the academic Extended Aspect: The couplings or relationships between papers, while the y-axis denotes publishing venues. The thick attribute values, attributes, objects, and even clusters should horizontal line separates the clusters of conferences/journals, be considered to cater for the interactions among the data. and the thick vertical line distinguishes between database We have already proposed a coupled discretization algorithm research related papers (on the left) and theoretical computer CD [28], which concerns both the information dependency science related papers (on the right). If an author has published and deterministic relationship to disclose the couplings of a paper in a particular venue, this is represented by a point. uncertainty and certainty. A coupled framework for clustering From this figure, it is clear that CLIMBO yields the best ensembles have been reported in [29] by considering both the partition, followed by LIMBO, and STIRR performs worst. relationships within each base clustering and the interactions However, even the clustering of CLIMBO is slightly mistaken between distinct base clusterings, in which CASO or CADO by the conferences/journals between index 50 and 60, which is applied. On the other hand, coupled attribute analysis [30] is due to the influence of their co-authors. has also been carried out to quantify the relationships among The above two experiments therefore reveal that CLIMBO continuous data. In addition, how to appropriately choose the is better than LIMBO and STIRR on the clustering quality of weights αk for IeASV defined in (13), rather than simply intra-attribute values. In addition, LIMBO can also be clearly treating them as equal, is in great need of further exploration. observed to outperform STIRR, which is consistent with the Further, we are also working on a flexible way to control conclusion drawn in [17]. the respective importance of IaASV and IeASV by using In summary: 1) intra-coupled relative similarity measures corresponding weights β and γ , according to the specific data IRSP, IRSU, IRSJ, and IRSI all present the same learning structure. Other data mining and machine learning tasks, e.g. accuracy, but IRSI is the most efficient, especially for large- fraud detection [1] and relational learning [31], can also be scale data; 2) our proposed object dissimilarity metric CADO considered to involve coupled interactions. is better than others, i.e., the traditional SMD, frequency distribution only OFD, and dependency aggregation only X. CONCLUSION ADD, for categorical data in terms of data structure analysis We have proposed CASO, a novel data-driven coupled and clustering quality; and 3) the incorporation of CASO or attribute similarity measure for objects incorporating both CADO into existing categorical clustering algorithms, such as IaASV and inter-coupled attribute similarity for values in overlap-based methods (e.g., KM and ROCK), context-based unsupervised learning on nominal data. The measure involves methods (e.g., STIRR), and information-theoretic methods both attribute value frequency distribution (intra-coupling) (e.g., LIMBO) can greatly lift their performance. and attribute dependency aggregation (inter-coupling) and the interaction of the two, which captures a global picture IX. DISCUSSION of the similarity and has been shown to improve learning Below, we discuss the potential opportunities triggered by accuracy in diverse similarity measures. Theoretical analysis our proposed CASV, CASO, and CADO. The degenerative have shown that the inter-coupled relative similarity measure 794 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 4, APRIL 2015

IRSI significantly outperforms the other options (IRSP, IRSU, Proof (b): and IRSJ) in terms of efficiency, while maintaining equal Theorem 2(b) (Definition 5.2): Equations (6) and (7) are ( x , y) = | ({ }| x ) − accuracy. In addition, our derived dissimilarity metric is more equal to each other: D j|L v j v j l∈L Pl| j l v j y   y general and accurate in capturing the internal structures of the ({ }| )|= ·  | ( | x ) − ( | )| Pl| j l v j 2 maxL ⊆L Pl| j L v j Pl| j L v j holds. predefined clusters and clustering quality in accordance with [Note] This theorem is deduced from a property in probability intensive empirical results. Very substantial experiments on theory, which is the total variation distance between two accuracy have been conducted on the data structure and clus- probability measures P and Q on a sigma-algebra F of the tering performance by incorporating the proposed similarity. subsets of the sample space is defined via δ(P, Q) = |P( ) − Q( )| This has clearly shown that the proposed coupled nomi- supA∈F A A . For a finite alphabet, we can write δ(P, Q) = ( / ) |P( ) − Q( )| P = nal similarity leads to more accurate learning performance 1 2 x∈ x x .Ifweregard on large scale categorical data sets, supported by statistical (·| x ) Q = (·| y) =  = Pl| j v j and Pl| j v j , A L and x l, then the analysis. The reason is that our proposed measure is global above theorem holds accordingly.  as a result of effectively integrating different aspects of the Proof 2: Assume that L ={l1, l2,...,ln} and L = similarity. {l1, l2,...,lk} (k ≤ n), we have We are currently applying the CASO measure with IRSI ( ) = · | x − | y to attribute discretization, clustering ensemble, and other data F L 2 Pl| j L v j Pl| j L v j mining and machine learning tasks. We are working on the k k = · { }| x − · { }| y . assignment of attribute weights, and the flexible engagement of 2 Pl| j li v j 2 Pl| j li v j IaASV and IeASV. We are designing the strategies of attribute i=1 i=1 reduction to fit the extremely large data. In addition, the n | x = n | y = Since i=1 Pl| j li v j i=1 Pl| j li v j 1 holds, then proposed concepts inter-information function and information conditional probability have the potential to be used in other k n ( ) = { }| x + − { }| x applications. Flexible dissimilarity measures can also be built F L Pl| j li v j 1 Pl| j li v j i=1 i=k+1 on our fundamental similarity building blocks according to k n different requirements. − { }| y + − { }| y Pl| j li v j 1 Pl| j li v j i=1 i=k+1 APPENDIX k k x y = P | {l }|v − P | {l }|v Proof (a): l j i j l j i j = = Theorem 1(a) (Definition 5.1): Intra-coupled attribute simi- i 1 i 1 n n vx vy + { }| y − { }| x larity for values (IaASV) between values j and j of attribute Pl| j li v Pl| j li v j y j δIa( x , ) δIa ∈[ / , /( + )] = + = + a j is j v j v j ,wehave j 1 3 m m 4 . i k 1 i k 1 Proof 1: According to Definition 5.1, we have that k x y ≤ x , y ≤ = Pl| j {li }|v − Pl| j {li }|v 1 G j v j G j v j m holds, then j j = i 1 δIa x , y n j v j v j y x + Pl| j {li }|v − Pl| j {li }|v x y j j G v · G v = + = j j j j i k 1 x + y + x · y k G j v j G j v j G j v j G j v j ≤ { }| x − { }| y Pl| j li v j Pl| j li v j = = 1 i 1 − − n y 1 x 1 y G j v + G j v + 1 + { }| − { }| x j j Pl| j li v j Pl| j li v j 1 i=k+1 ≤ . n y −1 −1 x y · x + ≤ Pl| j {li }|v − Pl| j {li }|v 2 G j v j G j v j 1 j j i∈1 x y = P | l |v − P | l |v . On one hand, δIa vx , vy is a monotonously increasing l j j l j j j j j l∈L function of variables G vx and G vy , respectively. j j j j If there exists k > 0, such that Therefore, δIa vx , vy takes its minimum value 1/3when j j j x y x = y = Pl| j {li }|v ≥ Pl| j {li }|v G j v j G j v j 1. j j ≤ x + On the other hand, because of both 2 G j v j holds for 1 ≤ i ≤ k < n and y ≤ G j v m and the above function property, then x y j P | {l }|v < P | {l }|v δIa x , y /( + ) l j i j l j i j j v j vj takes its maximum value m m 4 when x = y = / k + ≤ i ≤ n F L G j v j G j v j m 2. holds for 1 ,then takes its maximal value: | x − | y Thus, considering both aspects above, we have l∈L Pl| j l v j Pl| j l v j . ≤ ≤ < If for all 1 i k n δIa x , y ∈ 1, m . j v j v j { }| x < { }| y 3 (m + 4) Pl| j li v j Pl| j li v j WANG et al.: COUPLED ATTRIBUTE SIMILARITY LEARNING ON CATEGORICAL DATA 795

 x  y holds, then we have If Pk| j V |v + Pk| j V |v − 1 ≥ 0, then we have k j k j y { }| x ≥ { }| (4.2) x y  x  y Pl| j li v j Pl| j li v j , = − | | − | | D j|k v j v j min 2 Pk j Vk v j Pk j Vk v j V ⊆Vk for k + 1 ≤ i ≤ n. Thus, we alternatively consider k according to the fact that ( ) = ·  y −  x F L 2 Pl| j L v j Pl| j L v j 1 − max(| f (x)|) = min(1 − f (x)) where L = L − L. In fact   for all f (x) ≥ 0 (x ∈ R),where f (x) is a function and R is max F(L ) = max F(L ) L⊆L L⊆L the real number field. | x + | y − < If Pk| j Vk v j Pk| j Vk v j 1 0, we alternatively use holds. Similar to the above deduction    V = Vk − V = V .Thenwehave ( ) = ( ) k k k max F L max F L  L⊆L L⊆L (4.1 ) x , y = − | x + | y − . D j|k v j v j 1 max Pk| j Vk v j Pk| j Vk v j 1 x y V ⊆V = Pl| j l |v − Pl| j l |v . k k j j l∈L    y Since P | V |vx = 1 − P | V |vx and P | V |v = ≤ ≤ k j k j k j k j k j k j The rest special case is that for 1 i n | y = − | y Pk| j Vk v j 1 Pk| j Vk v j ,wehave x y P | {l }|v ≥ P | {l }|v l j i j l j i j | x + | y − > . Pk| j Vk v j Pk| j Vk v j 1 0 holds. This is in fact Hence, we have { }| x = { }| y Pl| j li v j Pl| j li v  j (4.2 ) x y  x  y D | v , v = min 2 − Pk| j (V |v ) − Pk| j V v ( ) = j k j j ⊆ k j k j for every possible i,thenF L 0 takes the maximal value Vk Vk y as well i.e., P | l |vx − P | l |v . l∈L l j j l j j according to the fact that 1 − max(| f (x)|) = min(1 + f (x)) Therefore, we have ( ) ≥ ( ∈ R) ( ) R for all f x 0 x ,where f x is a function and is x , y = | x − | y the real number field. D j|L v j v j Pl| j l v j Pl| j l v j l∈L We can see that  x  y  = · P | L v − P | L v . (4.1) y (4.1 ) y 2 max l j j l j j x , = x , . L ⊆L D j|k v j v j D j|k v j v j Proof (c): Therefore, we have obtained that (Definition 5.2): The conversion is conducted from (7) to (8) ( . ) ( . ) y   y 4 1 x y 4 1 x y x , = ·  | ( | x ) − ( | )| D | v , v = D | v , v via (9): D j|L v j v j 2 maxL ⊆L Pl| j L v j Pl| j L v j j k j j j k j j  P  x  y (4.2) x y (4.2 ) x y to δ = min ⊆ {2 − Pk| j (V |v ) − Pk| j (V |v )}. = D v , v = D v , v . j|k Vk Vk k j k j j|k j j j|k j j Proof: The whole conversion procedural is divided into four steps. By following the above four steps, we have successfully ( x , y) (4.2)( x , y) x y converted from (7) to (8) via (9): D j|L v j v j to D j|k v j v j 1) The multiplier 2 in D j|L v , v is omitted  j j (4.2 )( x , y) (3)( x , y)  ( x , y) or D j|k v j v j via D j|k v j v j or D j|k v j v j . (1) x , y = | x − | y . D j|L v j v j max Pl| j L v j Pl| j L v j Proof (d): L⊆L Theorem 3(d) (Theorem 6.1): IRSP, IRSU, IRSJ, and IRSI 2) Labels are replaced with other values of a particular are all equivalent to one another. attribute ak Proof: Part (I) IRSP⇐⇒ IRSU. ∗ ( ) y   y 2 x , = | x − | . Let Vk be the value set of attribute ak that makes D j|k v j v j max Pk| j Vk v j Pk| j Vk v j V ⊆V k k |vx + v y Pk| j Vk j Pk| j Vk j 3) A complementary set V  rather than the original one k ∈ ∗  y | y = maximal. Below, we show that for every vk Vk Vk is concerned for v j in ICP, based on Pk| j Vk v j − | y |vx ≥ |v y 1 Pk| j Vk v j Pk| j vk j Pk| j vk j (3) x y  x  y z ∗ , = | | + | | − holds. If there exists v (∈ V ) satisfying D j|k v j v j max Pk j Vk v j Pk j Vk v j 1 k k V ⊆Vk k z x z y Pk| j v |v < Pk| j v |v  x , y k j k j which is D j|k v j v j formalized in (9). ∗∗ = ∗\{vz} ∗∗ = ∗ ∪{vz} 4) Dissimilarity is considered rather than similarity, we use then set Vk Vk k , Vk Vk k , it directly follows δ P = −  x , y : j|k 1 D j|k v j v j for simplicity that ( . ) ( ) ∗∗ x ∗∗ y ∗ x ∗ y 4 1 x y 3 x y | |v + | |v > | |v + | |v . D | v , v = 1 − D | v , v Pk j Vk j Pk j Vk j Pk j Vk j Pk j Vk j j k j j j k j j  x  y ∗∗ ∗ = 1 − max Pk| j V |v + Pk| j V |v − 1 . This results in the contradiction between V and V because ⊆ k j k j ∗ k k Vk Vk of the maximal assumption of Vk . 796 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 4, APRIL 2015

∈ ∗ ∈ ϕ v y \ϕ vx Similarly, for any vk Vk Similarly, if vk j→k j j→k j , it indicates y x y y { }|vx ≤ { }|v | { }|v , | { }|v = | { }|v . Pk| j vk j Pk| j vk j max Pk j vk j Pk j vk j Pk j vk j holds. Hence Therefore, we have J x y P x y  x  y δ | v ,v δ v ,v = − P | V |v − P | V |v j k j j j|k j j min 2 k j k j k j k j V ⊆Vk x y k = 2 − max{Pk| j {vk}|v , Pk| j {vk}|v   y j j = − |vx + v ∈ 2 max Pk| j Vk j Pk| j Vk j vk V ⊆V ⎡ k k ∗ x ∗ y = 2 − P | V |v + P | V |v ⎢ k j k j k j k j = − ⎣ { }|vx , { }|v y ⎡ ⎤ 2 max Pk| j vk j Pk| j vk j ∈v x \v y ⎢ ⎥ vk j j = − ⎣ { }|vx + { }|v y ⎦ 2 Pk| j vk j Pk| j vk j x y + | { }|v , | { }|v ∈ ∗ ∗ max Pk j vk j Pk j vk j vk Vk vk ∈V ⎡ k ∈v y\v x vk j j ⎤ x y = − ⎣ P | {v }|v , P | {v }|v 2 max k j k j k j k j x y ⎦ ∗ + max Pk| j {vk}|v , Pk| j {vk}|v vk ∈V j j k ⎤ ∈ ⎡ vk x y ⎥ + max Pk| j {vk}|v , Pk| j {vk}|v ⎦ ⎣ x y j j = 2 − 1 − Pk| j {vk}|v + 1 − Pk| j {vk}|v ∗ j j vk ∈V ∈ ∈ k vk vk ⎤ = − { { }| x , }| y 2 max Pk| j vk v j Pk| j vk v j x y ⎦ vk ∈Vk + max Pk| j {vk}|v , Pk| j {vk}|v j j = δU vx ,vy . v ∈ j|k j j k x y = Pk| j {vk}|v + Pk| j {vk}|v Part (II) IRSU⇐⇒ IRSJ. j j vk ∈ Note that in the following Parts (II) and (III), vk ∈ x y y x x y v \v and v ∈ v \v are the abbreviated forms for − max Pk| j {vk}|v , Pk| j {vk}|v j j k j j j j x y y x v ∈ ϕ → (v )\ϕ → (v ) and v ∈ ϕ → (v )\ϕ → (v ), vk ∈ k j k j j k j k j k j j k j respectively. x y I x y = min Pk| j {vk}|v , Pk| j vk}|v = δ | v ,v . x y j j j k j j Given vk ∈ ϕ j→k v ∪ ϕ j→k v ,thatis j j vk ∈ ∈ ϕ vx ∈ ϕ v y . vk j→k j and vk j→k j ∈ ϕ vx If vk j→k j ,wethenhave ACKNOWLEDGMENT The authors would like to thank Mr. Xin Cheng and ∗({ }) (vx ) = ∅ gk vk g j j Mr. Zhong She for their assistance in coding on the algorithms: ROCK, LIMBO, and STIRR. We would also like to thank { }|vx = { }|v y = so, Pk| j vk j 0. Similarly, Pk| j vk j 0. Therefore the associate editor and all the anonymous reviewers for their invaluable comments on our paper. δU vx ,vy = − { }| x , { }| y j|k j j 2 max Pk| j vk v j Pk| j vk v j v ∈V ⎡k k REFERENCES ⎣ x y [1] L. Cao, Y. Ou, and P. S. Yu, “Coupled behavior analysis with = 2 − max Pk| j {vk}|v , Pk| j {vk}|v applications,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 8, j j vk ∈ pp. 1378Ð1392, Aug. 2012. ⎤ [2] F. Figueiredo, L. Rocha, T. Couto, T. Salles, M. A. Gonçalves, and W. Meira, Jr, “Word co-occurrence features for text classification,” x y ⎦ + max Pk| j {vk}|v , Pk| j {vk}|v Inform. Syst., vol. 36, no. 5, pp. 843Ð858, 2011. j j [3] G. Wang, D. Hoiem, and D. Forsyth, “Learning image similarity from vk ∈ Flickr groups using fast kernel machines,” IEEE Trans. Pattern Anal. = − { }| x , { }| y Mach. Intell., vol. 34, no. 11, pp. 2177Ð2188, Nov. 2012. 2 max Pk| j vk v j Pk| j vk v j Finding Groups in Data: An Introduction v ∈ [4] L. Kaufman and P. Rousseeuw, k to . New York, NY, USA: Wiley, 1990. = δ J vx ,vy . j|k j j [5]G.Gan,C.Ma,andJ.Wu,Data Clustering: Theory, Algorithms, and Applications. Philadelphia, PA, USA: SIAM, 2007. Part (III) IRSJ⇐⇒ IRSI. [6] G. Das and H. Mannila, “Context-based similarity measures for y y categorical databases,” in Proc. 4th Eur. Conf. Principles Pract. Knowl. If v ∈ ϕ → vx \ϕ → v ,thenP | {v }|v = 0. k j k j j k j k j k j Discovery Databases, Lyon, France, Sep. 2000, pp. 201Ð210. Accordingly, we have [7] T. Li, M. Ogihara, and S. Ma, “On combining multiple clusterings: An overview and a new perspective,” Appl. Intell., vol. 33, no. 2, { }|vx , { }|v y = { }|vx . max Pk| j vk j Pk| j vk j Pk| j vk j pp. 207Ð219, 2009. WANG et al.: COUPLED ATTRIBUTE SIMILARITY LEARNING ON CATEGORICAL DATA 797

[8] D. R. Wilson and T. R. Martinez, “Improved heterogeneous distance Can Wang received the B.Sc. and M.Sc. degrees functions,” J. Artif. Intell. Res., vol. 6, no. 1, pp. 1Ð34, 1997. from the Faculty of Mathematics and Statistics, [9] S. Cost and S. Salzberg, “A weighted nearest neighbor algorithm Wuhan University, Wuhan, China, in 2007 and 2009, for learning with symbolic features,” Mach. Learn., vol. 10, no. 1, respectively, and the Ph.D. degree in computing pp. 57Ð78, 1993. sciences from the Advanced Analytics Institute, Uni- [10] S. Boriah, V. Chandola, and V. Kumar, “Similarity measures for versity of Technology, Sydney, NSW, Australia, in categorical data: A comparative evaluation,” in Proc. SIAM Int. Conf. 2013. Data Mining, Atlanta, GA, USA, Apr. 2008, pp. 243Ð254. She is currently a Post-Doctoral Fellow with the Commonwealth Scientific and Industrial Research [11] D. Lin, “An information-theoretic definition of similarity,” in Proc. 15th Organisation, Australia. Her current research inter- Int. Conf. Mach. Learn., Madison, WI, USA, Jul. 1998, pp. 296Ð304. ests include behavior analytics, data mining, [12] A. Ahmad and L. Dey, “A k-mean clustering algorithm for mixed machine learning, and knowledge representation. numeric and categorical data,” Data Knowl. Eng., vol. 63, no. 2, pp. 503Ð527, 2007. [13] L. A. Ribeiro and T. Harder, “Generalizing prefix filtering to improve set similarity joins,” Inform. Syst., vol. 36, no. 1, pp. 62Ð78, 2011. [14] D. Gibson, J. Kleinberg, and P. Raghavan, “Clustering categorical data: Xiangjun Dong received the Ph.D. degree in com- An approach based on dynamical systems,” VLDB J., vol. 8, no. 3, puting sciences from the Beijing Institute of Tech- pp. 222Ð236, 2000. nology, Beijing, China. [15] M. E. Houle, V. Oria, and U. Qasim, “Active caching for similarity He is currently a Professor with the School of queries based on shared-neighbor information,” in Proc. 19th Int. Conf. Information, Qilu University of Technology, Jinan, Inform. Knowl. Manage., Oct. 2010, pp. 669Ð678. China. His current research interests include data mining, machine learning, data integration, and data- [16] S. Guha, R. Rastogi, and K. Shim, “ROCK: A robust clustering base techniques. algorithm for categorical attributes,” Inform. Syst., vol. 25, no. 5, pp. 345Ð366, 2000. [17] P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik, “LIMBO: Scalable clustering of categorical data,” in Proc. 9th Int. Conf. Extending Database Technol., Heraklion, Greece, Mar. 2004, pp. 123Ð146. [18] D. Barbará, J. Couto, and Y. Li, “COOLCAT: An entropy-based algorithm for categorical clustering,” in Proc. 11th Int. Conf. Inform. Fei Zhou received the B.Eng. degree from the Knowl. Manage., McLean, VA, USA, Nov. 2002, pp. 582Ð589. Department of Electronics and Information Engi- [19] Y. Yang, X. Guan, and J. You, “CLOPE: A fast and effective neering, Huazhong University of Science and Tech- clustering algorithm for transactional data,” in Proc. 8th ACM SIGKDD nology, Wuhan, China, in 2007, and the Ph.D. degree Conf. Knowl. Discovery Data Mining, Edmonton, AB, Canada, from the Department of Electronics Engineering, Jul. 2002, pp. 682Ð687. Tsinghua University, Beijing, China, in 2013. [20] M. J. Zaki, M. Peters, I. Assent, and T. Seidl, “Clicks: An effective He has been a Post-Doctoral Fellow with the Shen- algorithm for mining subspace clusters in categorical datasets,” in Proc. zhen Graduate School, Tsinghua University, since 11th ACM SIGKDD Conf. Knowl. Discovery Data Mining, Chicago, IL, 2013. His current research interests include image USA, Aug. 2005, pp. 736Ð742. processing and pattern recognition in video surveil- [21] A. L. Gibbs and F. E. Su, “On choosing and bounding probability lance, image superresolution, image interpolation, metrics,” Int. Statist. Rev., vol. 70, no. 3, pp. 419Ð435, 2002. image quality assessment, and object tracking. [22] V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS—Clustering categorical data using summaries,” in Proc. 5th ACM SIGKDD Conf. Knowl. Discovery Data Mining, San Diego, CA, USA, Aug. 1999, pp. 73Ð83. Longbing Cao (SM’06) received the Ph.D. degrees [23] K. P. Hollingsworth, K. W. Bowyer, and P. J. Flynn, “Improved in pattern recognition and intelligent systems, and iris recognition through fusion of Hamming distance and fragile bit computing sciences. distance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 12, He is currently a Professor with the University pp. 2465Ð2476, Dec. 2011. of Technology, Sydney, NSW, Australia, where he [24] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE is the Founding Director of the Advanced Ana- Trans. Pattern Anal. Mach. Intell., vol. PAMI-1, no. 2, pp. 224Ð227, lytics Institute, and is the Data Mining Research Apr. 1979. Leader of the Australian Capital Markets Cooper- [25] J. C. Dunn, “Well-separated clusters and optimal fuzzy partitions,” ative Research Center, Sydney. His current research J. Cybern., vol. 4, no. 1, pp. 95Ð104, 1974. interests include big data analytics, data mining, [26] U. Von Luxburg, “A tutorial on spectral clustering,” Statist. Comput., machine learning, behavior informatics, complex vol. 17, no. 4, pp. 395Ð416, 2007. intelligent systems, agent mining, and their applications. [27] D. Cai, X. He, and J. Han, “Document clustering using locality preserving indexing,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 2, pp. 1624Ð1637, Dec. 2005. [28] C. Wang, M. Wang, Z. She, and L. Cao, “CD: A coupled discretization Chi-Hung Chi received the Ph.D. degree from algorithm,” in Proc. 16th Pacific-Asia Conf. Knowl. Discovery Data Purdue University, West Lafayette, IN, USA. Mining, Kuala Lumpur, Malaysia, May 2012, pp. 407Ð418. He is currently a Science Leader with the Com- [29] C. Wang, Z. She, and L. Cao, “Coupled clustering ensemble: monwealth Scientific and Industrial Research Organ- Incorporating coupling relationships both between base clusterings and isation (CSIRO), Australia. Before joining CSIRO objects,” in Proc. 29th Int. Conf. Data Eng., Brisbane, Australia, in 2012, he was involved in the industry and uni- Apr. 2013, pp. 374Ð385. versities for more than 20 years. He has authored [30] C. Wang, Z. She, and L. Cao, “Coupled attribute analysis on numerical more than 200 papers in international conferences data,” in Proc. 23rd Int. Joint Conf. Artif. Intell., Beijing, China, and journals, and holds six U.S. patents. His current Aug. 2013, pp. 1736Ð1742. research interests include service engineering, cloud [31] L. Getoor and B. E. Taskar, Introduction to Statistical Relational computing, big data analytics, social networking, Learning. Cambridge, MA, USA: MIT Press, 2007. and behavior informatics.