Clustering-Based Binary-Class Classification for Imbalanced Data

Clustering-based Binary-class Classification for Imbalanced Data Sets Chao Chen and Mei-Ling Shyu Department of Electrical and Computer Engineering University of Miami Coral Gables, FL 33124, USA Email: [email protected], [email protected] Abstract ample would be the semantic indexing within large video collections. TRECVID [13] semantic indexing task attracts In this paper, we propose a new clustering-based binary- attention from research institutions all over the world. The class classification framework that integrates the cluster- objective of semantic indexing task is to detect and rank ing technique into a binary-class classification approach to video shots that contain a target concept from a given video handle the imbalanced data sets. A binary-class classifier is collection. Since different concepts are mixed together in designed to classify a set of data instances into two classes; the video shots, a popular preprocessing step is to generate while the clustering technique partitions the data instances a number of binary-class data sets where the target concept into groups according to their similarity to each other. After is regarded as the positive class and the rest of the concepts applying a clustering algorithm, the data instances within are regarded as the negative class. the same group usually have a higher similarity, and the dif- However, one of the problems in the aforementioned de- ferences among the data instances between different groups tection task is that the size of positive (concept) class within should be larger. In our proposed framework, all nega- a binary-class data set is typically much smaller than that of tive data instances are first clustered into a set of negative the negative (non-concept) class, so that the positive class groups. Next, the negative data instances in each negative becomes the minority class and the negative class is the ma- group are combined with all positive data instances to con- jority class. This is so-called data imbalance issue [9]. As struct a balanced binary-class data set. Finally, subspace most of the popular learning algorithms develop their learn- models trained on these balanced binary-class data sets are ing models based on the assumption that the classes are bal- integrated with the subspace model trained on the origi- anced, the performance of their learning models is usually nal imbalanced data set to form the proposed classification not satisfactory when the data set is imbalanced [8]. model. Experimental results demonstrate that our proposed In this paper, a novel binary-class classification frame- classification framework performs better than the compara- work is proposed to address such data imbalance issue. tive classification approaches as well as the subspace mod- First, the original (training) data set is divided into two eling method trained on the original data set alone. subsets, one for positive class and one for negative class, based on the class labels. Then, K-means clustering algo- Keywords: Binary classification, Subspace Modeling, Im- rithm is applied to cluster the data instances in the nega- balanced data sets, Clustering. tive class subset into K negative groups. Each of the K negative groups is combined with the positive class sub- 1 Introduction set so that new K data groups are formed. For each data group, one subspace model is trained and optimized. Fur- Recently, with the prosperity of social networks and the thermore, these K subspace models are then integrated with advances of the Internet techniques, a huge amount of mul- the subspace model trained on the original data set to build timedia sources, such as videos and images, has posed a an integrated model that is expected to render better perfor- great challenge on searching, indexing, and retrieving the mance than the subspace model trained on the original data data that the users are interested in from the multimedia set alone. sources. For example, soccer fans are very interested in fan- The paper is organized as follows. Section 2 introduces tastic goals made by the soccer players. Therefore, effective the related work. The details of the proposed framework is goal detection in a large collection of soccer videos is vital illustrated in Section 3. The setup and results of the compar- to meet the request of soccer fans [4][5][12]. Another ex- ative experiment are shown in Section 4. Finally, Section 5 concludes this paper and explores some future directions. The testing data set is denoted as Ts, which has the same attributes as the training data set. Ts[i] means the i-th data 2 Related Work instance in the testing data set. The training and testing phases of the subspace modeling are shown in Code 1 and Code 2, respectively. Section 3.1 illustrates the definitions Broadly, there are two major categories of techniques de- of the operators and functions used in the Codes. veloped to address the data imbalanced issue. One is data sampling and the other one is boosting [11]. Data sampling 3.1 Definitions can be further divided into oversampling and undersampling [2]. The idea of oversampling is to add more new data instances to the minority class to balance a data set. These Dot operators are defined in Definition 1 to facilitate later new data instances can either be generated by replicating the expressions. These operators are commonly seen in Matlab. data instances of the minority class or by applying synthetic Definition 1 (Dot operator). Dot division (./) between two methods like Synthetic Minority Over-sampling Technique vectors a = fa1;:::; adg and b = fb1;:::;bdg is defined (SMOTE) [3]. Undersampling differs from oversampling in as a:=b = fa1=b1;:::; ad=bdg. Dot multiplication (.*) be- that it removes data instances of the majority class to bal- tween two vectors a = fa1;:::;adg and b = fb1;:::;bdg is ance a data set. There are also approaches that combine both defined as a: ∗ b = fa1 × b1;:::; ad × bdg. undersampling and oversampling together [10]. The prob- lem with data sampling is that: on one hand, the removal of Definition 2. The mean vector of a m×n matrix A = a(i; j) the data instances of the majority class causes information is defined as m(A), m(A)=[m1(A);:::; mn(A)]. Each element loss, although the time of model learning is reduced. On the m j(A) is calculated by: other hand, duplicating or creating data instances of the mi- m nority class requires more time for model learning, but the 1 m j(A) = ∑ a(i; j); j = 1;2;:::;n (1) improvement on performance may be negligible [6]. m i=1 The other category of techniques to handle data imbal- × ance issue is Boosting. Unlike data sampling which di- Definition 3. The standard deviation vector of a m n ma- rectly copes with data imbalance issue on the data level, trix A = a(i; j) is defined as s(A), s(A)=[s1(A);:::;sn(A)]. Boosting methods are designed to reduce the influence of Each element s j(A) is calculated by: data imbalance on the model level by improving the perfor- s 1 m mance of weak and poor models. The most famous boosting − m 2 s j(A) = − ∑ (a(i; j) j(A)) ; j = 1;2;:::;n (2) method is AdaBoost [7]. In the training phase, Adaboost m 1 i=1 reweighs the training data instances and training models it- eratively by minimizing the error of prediction produced by Definition 4 (function Z). For a m × n matrix A = a(i; j) an ensemble of models. In the classification phase, a vote and a r × n matrix B, of these weighted ensemble models determines the label of 2 3 (B(1;:) − m(A)):=s(A) each testing data instance. The boosting methods are proved 6 7 6 : 7 to be effective, but their major drawback is that they usually 6 7 Z(B;A) = 6 : 7 require a time-consuming iteration process to find the opti- 4 5 mal weights. : (B(r;:) − m(A)):=s(A) 3 Proposed Framework Here, Z(A;A) is the z-score normalization result of A. Our proposed framework is based on a few subspace Definition 5 (SVD). The standard SVD (Singular Value models built on several data sets. Here, the framework of Decomposition) is shown in Equation (3). subspace modeling is first introduced. A = USV T : (3) Suppose that the training data set Tr contains R data instances and C attributes. Thus, it can be regarded as a 2-D The SVD function of a m × n matrix A defined here matrix with R rows and C columns. Each row stands for a will further produce two important items, l(A) and 1 2 data instance which is denoted as Tr(x;:) ;where x [1;R]. PC(A). l(A) is the positive diagonal elements of Each column stands for one attribute, which is denoted as ST S sorted in a descending manner. In other words, Tr(:;y);where y 2 [1;C]. An element of the 2-D matrix l(A)=fl1(A),. ,lq (A)jl1(A) ≥ l2(A) ≥ ··· ≥ lq (A) > Tr(x;y) means the value of attribute y for data instance x. 0g. The other item PC(A) is the eigenvectors from V that l 1we use A(i,:) to denote the i-th row of matrix A, as will be seen later. correspond to the sorted (A). CODE 1: SUBSPACE MODELING:LEARNING PHASE defined as follows, where a is a scalar. 1 Input: FinalScore(T;A;B;a; pl) = Score(T;A; pl) − a ∗ Score(T;B; pl): (1) A set of training data instances Tr (6) (2) Training labels (opt) b (opt) m m 2 Output: pl , , (TrP), (TrN), s(TrP); CODE 2: SUBSPACE MODELING:CLASSIFICATION PHASE s(TrN), l(TrP), l(TrN), PC(TrP), PC(TrN) 1 Input: (1) Testing data instance Ts[i], i=1 to w (the total 3 Divide Training data set Tr into positive class TrP number of testing data instances) and negative class TrN according to training labels.

Load more