Comparing Oversampling Techniques to Handle the Class Imbalance Problem: a Customer Churn Prediction Case Study
Total Page:16
File Type:pdf, Size:1020Kb
Received September 14, 2016, accepted October 1, 2016, date of publication October 26, 2016, date of current version November 28, 2016. Digital Object Identifier 10.1109/ACCESS.2016.2619719 Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study ADNAN AMIN1, SAJID ANWAR1, AWAIS ADNAN1, MUHAMMAD NAWAZ1, NEWTON HOWARD2, JUNAID QADIR3, (Senior Member, IEEE), AHMAD HAWALAH4, AND AMIR HUSSAIN5, (Senior Member, IEEE) 1Center for Excellence in Information Technology, Institute of Management Sciences, Peshawar 25000, Pakistan 2Nuffield Department of Surgical Sciences, University of Oxford, Oxford, OX3 9DU, U.K. 3Information Technology University, Arfa Software Technology Park, Lahore 54000, Pakistan 4College of Computer Science and Engineering, Taibah University, Medina 344, Saudi Arabia 5Division of Computing Science and Maths, University of Stirling, Stirling, FK9 4LA, U.K. Corresponding author: A. Amin ([email protected]) The work of A. Hussain was supported by the U.K. Engineering and Physical Sciences Research Council under Grant EP/M026981/1. ABSTRACT Customer retention is a major issue for various service-based organizations particularly telecom industry, wherein predictive models for observing the behavior of customers are one of the great instruments in customer retention process and inferring the future behavior of the customers. However, the performances of predictive models are greatly affected when the real-world data set is highly imbalanced. A data set is called imbalanced if the samples size from one class is very much smaller or larger than the other classes. The most commonly used technique is over/under sampling for handling the class-imbalance problem (CIP) in various domains. In this paper, we survey six well-known sampling techniques and compare the performances of these key techniques, i.e., mega-trend diffusion function (MTDF), synthetic minority oversampling technique, adaptive synthetic sampling approach, couples top-N reverse k-nearest neigh- bor, majority weighted minority oversampling technique, and immune centroids oversampling technique. Moreover, this paper also reveals the evaluation of four rules-generation algorithms (the learning from example module, version 2 (LEM2), covering, exhaustive, and genetic algorithms) using publicly available data sets. The empirical results demonstrate that the overall predictive performance of MTDF and rules- generation based on genetic algorithms performed the best as compared with the rest of the evaluated oversampling methods and rule-generation algorithms. INDEX TERMS SMOTE, ADASYN, mega trend diffusion function, class imbalance, rough set, customer churn, mRMR. ICOTE, MWMOTE, TRkNN. I. INTRODUCTION certain patterns that may be exploited for acquiring customer In many subscription-based service industries such as information on time for intelligent decision-making practice telecommunications companies, are constantly striving to of an industry [2]. recognize customers that are looking to switch providers (i.e., However, knowledge discovery in such rich CRM Customer churn). Reducing churn is extremely important in databases, which typically contains thousands or millions of competitive markets since acquiring new customers in such customers' information, is a challenging and difficult task. markets is very difficult (attracting non-subscribers can cost Therefore, many industries have to inescapably depend on up to six times more than what it costs to retain the current prediction models for customer churn if they want to remain customers by taking active steps to discourage churn behavior in the competitive market [1]. As a consequence, several [1]). Churn-prone industries such as the telecommunication competitive industries have implemented a wide range of sta- industry typically maintain customer relationship manage- tistical and intelligent machine learning (ML) techniques to ment (CRM) databases that are rich in unseen knowledge and develop predictive models that deal with customer churn [2]. 2169-3536 2016 IEEE. Translations and content mining are permitted for academic research only. 7940 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 4, 2016 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. A. Amin et al.: Comparing Oversampling Techniques to Handle the CIP: A Customer Churn Prediction Case Study Unfortunately, the performance of ML techniques is SMOTE [6] is commonly used as a benchmark for over- considerably affected by the CIP. The problem of imbal- sampling algorithm [7], [8]. ADASYN is also an important anced dataset appears when the proportion of majority oversampling technique which improves the learning about class has a higher ratio than minority class [3], [4]. the samples distribution in an efficient way [9]. MTDF was The skewed distribution (imbalanced) of data in the dataset first proposed by Li et al. [10] and reported improved per- poses challenges for machine learning and data mining algo- formance of classifying imbalanced medical datasets [11]. rithms [3], [5]. This is an area of research focusing on skewed CUBE is also another advanced oversampling technique [12] class distribution where minority class is targeted for classi- but we have not considered this approach since as noted fication [6]. Consider a dataset where the imbalance ratio is by Japkowicz [13], CUBE oversampling technique does not 1:99 (i.e., where 99% of the instances belong to the majority increase the predictive performance of the classifier. The class, and 1% belongs to the minority class). A classifier above-mentioned oversampling approaches are used to han- may achieve the accuracy up to 99% just by ignoring that dle the imbalanced dataset and improve the performance 1% of minority class instances—however, adopting such of predictive models for customer churn, particularly in an approach will result in the failure to correctly classify the telecommunication sector. Rough Set Theory (RST) is any instances of the class of interest (often the minority applied to the four different rule-generation algorithms— class). (i) Learning from Example Module, version 2 (LEM2), It is well known that churn is a rare object in service- (ii) Genetic (Gen), (iii) Covering (Cov) and (iv) Exhaustive based industries, and that misclassification is more costly for (Exh) algorithms—in this study to observe the behavior of rare objects or events in the case of imbalance datasets [7]. customer churn and all experiments are applied on the pub- Traditional approaches, therefore, can provide mislead- licly available dataset. It is specified here that this paper is ing results on the class-imbalanced dataset—a situation an extended version of our previous work [14], and makes that is highly significant and one that occurs in many the following contributions: (i) more datasets are employed domains [3], [6]. For instance, there are a significant num- to obtained more generalized results for the selected over- ber of real-world applications that are suffering from the sampling techniques and the rules-generation algorithms, class imbalance problem (e.g., medical and fault diagnosis, (ii) another well-known oversampling technique—namely, anomaly detection, face recognition, telecommunication, the ADASYN—is also used (iii) detailed analysis and discus- web & email classification, ecology, biology and financial sion on the performance of targeted oversampling tech- services [3], [4], [6], [85], [86]). niques (namely, ADASYN, MTDF, SMOTE, MWMOTE, Sampling is a commonly used technique for handling ICOTE and TRkNN) followed by the rules-generation algo- the CIP in various domains. Broadly speaking, sampling rithms—namely, Gen, Cov, LEM2 and Exh, and (iv) detailed can be categorized into oversampling and undersampling. performance evaluation—in terms of the balance accuracy, A large number of studies focused on handling the class the imbalance ratio, the area under the curve (AUC) and imbalance problem have been reported in literature [2], [4]. the McNemar's statistical test—is performed to validate the These studies can be grouped into the following approaches results and avoid any biases. Many comparative studies [2], based on their dealing with class imbalance issue: (i) the [7], [11], [15], [16] have already been carried out on the internal-level: construct or update the existing methods to comparison of oversampling and undersampling methods for emphasize the significance of the minority class and (ii) handling the CIP; however, the proposed study differs from the external level: adding data in preprocessing stage where the previous studies in that in addition to evaluating six over- the distribution of class is resampled in order to reduce sampling techniques (SMOTE, ADASYN, MTDF, ICOTE, the influence of imbalanced distribution of class in the MWMOTE and TRkNN), we also compare the performance classification process. The internal level approach is fur- of four rules-generation algorithms (Exh, Gen, Cov and ther divided into two groups [5]. Firstly, the cost-sensitive LEM2). The proposed study is also focused on considering approach, which falls between internal and external level the following research questions (RQ): approaches, is based on reducing incorrect classification • RQ1: What is the list of attributes that is highly symp- costs for minority class leading to reduction of the over- tomatic in the targeted data set for prediction of customer all