CSMOUTE: Combined Synthetic Oversampling and Undersampling Technique for Imbalanced Data Classification
Total Page:16
File Type:pdf, Size:1020Kb
CSMOUTE: Combined Synthetic Oversampling and Undersampling Technique for Imbalanced Data Classification Michał Koziarski Department of Electronics AGH University of Science and Technology Al. Mickiewicza 30, 30-059 Kraków, Poland Email: [email protected] Abstract—In this paper we propose a novel data-level algo- ity observations (oversampling). Secondly, the algorithm-level rithm for handling data imbalance in the classification task, Syn- methods, which adjust the training procedure of the learning thetic Majority Undersampling Technique (SMUTE). SMUTE algorithms to better accommodate for the data imbalance. In leverages the concept of interpolation of nearby instances, previ- ously introduced in the oversampling setting in SMOTE. Further- this paper we focus on the former. Specifically, we propose more, we combine both in the Combined Synthetic Oversampling a novel undersampling algorithm, Synthetic Majority Under- and Undersampling Technique (CSMOUTE), which integrates sampling Technique (SMUTE), which leverages the concept of SMOTE oversampling with SMUTE undersampling. The results interpolation of nearby instances, previously introduced in the of the conducted experimental study demonstrate the usefulness oversampling setting in SMOTE [5]. Secondly, we propose a of both the SMUTE and the CSMOUTE algorithms, especially when combined with more complex classifiers, namely MLP and Combined Synthetic Oversampling and Undersampling Tech- SVM, and when applied on datasets consisting of a large number nique (CSMOUTE), which integrates SMOTE oversampling of outliers. This leads us to a conclusion that the proposed with SMUTE undersampling. The aim of this paper is to approach shows promise for further extensions accommodating serve as a preliminary study of the potential usefulness of local data characteristics, a direction discussed in more detail in the proposed approach, with the final goal of extending it to the paper. utilize local data characteristics, a direction further discussed in the remainder of the paper. To this end, in the conducted I. INTRODUCTION experiments we not only compare the proposed method with Data imbalance remains one of the open challenges of the the state-of-the-art approaches, but also analyse the factors contemporary machine learning, affecting, to some extent, a influencing its performance, with a particular focus on the majority of the real-world classification problems. It occurs impact of dataset characteristics. whenever one of the considered classes, a so-called majority class, consists of a higher number of observations than one II. RELATED WORK of the other minority classes. Data imbalance is especially Various different approaches to the imbalanced data un- prevalent in problem domains in which the data acquisition for dersampling can be distinguished in the literature. Perhaps the minority class poses a greater difficulty, and the majority the oldest techniques are heuristic cleaning strategies such as class observations are more abundant. Examples of such Tomek links [6], Edited Nearest-Neighbor rule [7], Condensed domains include, but are not limited to, cancer malignancy Nearest Neighbour editing (CNN) [8], and more recently Near grading [1], fraud detection [2], behavioral analysis [3] and Miss method (NM) [9]. They tend not to allow specifying cheminformatics [4]. a desired undersampling ratio, instead removing all of the Data imbalance poses a challenge for traditional learning instances meeting a predefined criterion. This can lead to an algorithms, which are ill-equipped for handling uneven class undesired behavior in the cases in which the imbalance level arXiv:2004.03409v2 [cs.LG] 17 Apr 2021 distributions and tend to display a bias toward the majority after undersampling still does not meet users’ expectation. As class accompanied by a reduced discriminatory capabilities a result, contemporary methods tend to allow an arbitrary level on the minority classes. A large variety of methods reducing of balancing. This can be achieved by introducing a scoring the negative impact of data imbalance on the classification function and removing the majority observations based on the performance can be found in the literature. They can be order it introduces. For instance, Anand et al. [10] propose divided into two categories, based on the part of the clas- sorting the undersampled observations based on the weighted sification pipeline that they modify. First of all, the data-level Euclidean distance from the positive samples. Smith et al. [11] algorithms, in which the training data is manipulated prior to advocate for using the instance hardness criterion, with the the classification by either reducing the number of majority ob- hardness estimated based on the certainty of the classifiers servations (undersampling) or increasing the number of minor- predictions. Another approach that allows specifying the de- sired level of undersampling are clustering-based approaches, of the proposed Radial-Based Undersampling algorithm and which reduce the number of original observations by replacing identify the types of datasets on which it achieves the best them with a specified number of representative prototypes results. Similar categorization was also used in the design of [12], [13]. Finally, as has been originally demonstrated by a number of over- and undersampling approaches, in which Liu et al. [14], undersampling algorithms are well-suited for the resampling was focused on observations of a particular forming classifier ensembles, an idea that was further extended type. These include several extensions of SMOTE, such as in form of evolutionary undersampling [15] and boosting [16]. Borderline-SMOTE [26], focusing on the borderline instances, Over- and undersampling strategies for handling data im- placed close to the decision border, and Safe-Level-SMOTE balance pose unique challenges during the algorithm design [27] and LN-SMOTE [28], limiting the risk of placing syn- process, and can lead to a vastly different performance on any thetic instances inside the regions belonging to the majority given dataset. Some research has been done on the factors class, as well as MUTE [29], extending the concept of Safe- affecting the relative performance of both of these approaches. Level-SMOTE to the undersampling setting. However, these First of all, some of the classification algorithms show clear methods tend to be ad-hoc and in most cases their behavior preference towards either of the resampling strategies, with on datasets of a specific type is not analysed. One important a notable example of decision trees, the overfitting of which exception is a study conducted by Sáez et al. [30], in which was a motivation behind the SMOTE [5]. Nevertheless, later authors used the extracted knowledge about the imbalance study found the SMOTE itself to still be ineffective when distribution types to guide the oversampling process. combined with the C4.5 algorithm [17], for which applying un- dersampling led to a better performance. In another study [18] III. CSMOUTE ALGORITHM authors focused on the impact of noise, with a conclusion that To mitigate the negative impact of data imbalance on especially for high levels of noise simple random undersam- the performance of classification algorithms we propose a pling produced the best results. Finally, in another study [19] novel data-level approach, Synthetic Majority Undersampling authors investigated the impact of the level of imbalance on the Technique (SMUTE), which leverages the concept of inter- choice of the resampling strategy. Their results indicate that polation of nearby instances in the undersampling process. oversampling tends to perform better on severely imbalanced The idea of using interpolation between nearby instances was datasets, while for more modest levels of imbalance both over- previously introduced in the oversampling setting as SMOTE and undersampling tend achieve similar results. In general, [5] and since then became a cornerstone of numerous data- none of the approaches is clearly outperforming the other, and level strategies for handling data imbalance [31]. SMOTE both can be useful in specific cases. As a result, a family of was introduced as a direct response to the shortcomings of methods combining over- and undersampling emerged. One random oversampling, which was shown by Chawla et al. to of the first of such approaches involved combining SMOTE cause overfitting in selected classification algorithms, such as with later cleaning of the complete dataset, using either Tomek decision trees. Instead of simply duplicating the minority in- links [6] or Edited Nearest-Neighbor rule [7]. More recently, stances, which was the case in random oversampling, SMOTE methods such as SMOUTE [20], which combines SMOTE instead advocates for generating synthetic observations via oversampling with k-means based undersampling, as well as data interpolation. It is worth noting that this approach does CORE [21], a technique the goal of which is to strengthen the not remain without its on disadvantages. For instance, SMOTE core of a minority class while simultaneously reducing the does not take into the account the positions of the majority risk of incorrect classification of borderline instances, were class objects, and as a result can produce synthetic obser- proposed. In another study [22] authors propose combining vations overlapping the majority class distribution. This was SMOTE with Neighborhood Cleaning Rule (NCL) [23]. In further discussed by Koziarski et al.