C5.0 Classification Algorithm and Application on Individual Credit

Systems Engineering — Theory & Practice Volume 29, Issue 12, February 2009 Online English edition of the Chinese language journal Cite this article as: SETP, 2009, 29(12): 94–104 C5.0 Classification Algorithm and Application on Individual Credit Evaluation of Banks PANG Su-lin , GONG Ji-zhang Institute of Finance Engineering, School of Management, Jinan University, Guangzhou, 510632 Abstract: This article focuses on individual credit evaluation of commercial bank. The records of individual credit include both numerical and nonnumeric data. Decision tree is a good solution for this kind of issue. This year, the algorithm C4.5 of decision tree become popular, but C5.0 algorithm is still undergoing. In this article, we do some deep research on C5.0 algorithm by embedding “boosting” technology in cost matrix and cost-sensitive tree to establish a new model for individual credit evaluation of Commercial Bank. We apply our new model on evaluating the individual credit records of a German bank, and compared results of the adjusted decision tree model and the original one. The comparison shows that the adjusted decision tree model is more precise. Key words: credit evaluationdecision tree; C5.0 algorithm; boosting technology; cost-sensitive tree 1 Introduction more values. At the same time, C4.5 introduces some new functions such as pruning technology. For the purpose of A lot of researches show that multivariate discrimina- handling large-scale data set, some scholars proposed more tion analysislogistic regression model, Neural Network tech- [5] [6] new decision tree algorithms, such as SLIQ , SPRINT , nology and Support Vector Machine are efficient in research [7] PUBLIC , and so on. These algorithms have their own ad- of company and individual credit evaluation with high accu- vantages and characteristics. racy rate of discrimination based on financial numeric data. But those methods can not be applied on individual credit C5.0 is another new decision tree algorithm developed evaluation, for its record data are nonnumeric except the based on C4.5 by Quinlan[8]. It includes all functionalities “income” index, which is in numeric type. Other individ- of C4.5 and apply a bunch of new technologies, among them ual credit record data such as “gender”, “Occupation” and the most important application is “boosting”[9−10] technol- “duty” which are nonnumeric are very important. These ogy for improving the accuracy rate of identification on sam- valuable data can not be used as the input variant in multi- ples. But C5.0 with boosting algorithm is still undergoing variate discriminate analysislogistic regression model, Neu- and unavailable in practice. We develop a practical C5.0 al- ral Network technology, and Support Vector Machine. gorithm with boosting algorithm and apply on pattern recog- Thus, these methods can not be applied in individual nition. The iterative process of boosting technology in C5.0 Credit Evaluation directly. As one of the most widely used is programmed, and numerical experiments were done. The algorithms within data mining, decision tree algorithm has algorithm process is modified and optimized during experi- the advantage of high accuracy rate, simpliness, high effi- ments for the model of individual credit evaluation of com- ciency. It can handle not only the numeric data such as “in- mercial banks. We apply the model for evaluating the indi- come”, “loans”, “value of pledge”, but also the nonnumeric vidual credit data from some bank in German. data such as “gender”, “degree”, “career”, “duty”. So it is very suitable for individual credit evaluation of commercial In the second section, we introduce the construction of a banks. decision tree and the purring of a tree in C4.5, then the main The first concept of decision tree come from Concept ideas of C5.0. After that we provide the mathematical de- Learning System (CLS) was proposed by Hunt.E.B et al in scription of the boosting that we use in this article and intro- 1966[1−2]. Based on it, a lot of improved algorithms have duce the cost matrix and cost-sensitive tree, which we use in emerged. Among these, the most famous algorithm is ID3 section 3. In the third section, by embedding boosting tech- with a choosing policy according to information gain, which nology, using matrix and cost-sensitive tree we construct the was proposed by Quinlan in 1986[3]. Based on ID3 algo- model of individual credit evaluation of commercial banks rithm, many scholars proposed improved methods. Quinlan based on C5.0 and evaluate the individual created data from also invented C4.5 algorithm in 1993[4]. C4.5 takes informa- some bank in German using this model. We also compare the tion gain ratio as the selecting criterion in order to overcome discrimination result of the modified decision tree with the the defect of ID3, which is apt to use the attributes that have original one. We provide the conclusion in the last section. Received date: July 30, 2008 ∗ Corresponding author: Tel: +86-136-1006-1308; E-mail: [email protected] Foundation item: Supported by the National Nature Science Foundation(No.70871055) and 2008 Ministry of Education of the People’s Republic of China New Century Scholar Foundation(No.NCET-08-0615) Copyright c 2009, Systems Engineering Society of China. Published by Elsevier BV. All rights reserved. PANG Su-lin et al./Systems Engineering — Theory & Practice, 2009, 29(12): 94–104 2 C5.0 algorithm sample is adjusted, such that the learner focus on the samples which are misclassified by the decision tree constructed As mentioned earlier, C5.0 is a new decision tree al- in the last trial, which means these samples will have higher gorithm developed from C4.5. The idea of construction of weight. In this article we take the following boosting itera- a decision tree in C5.0 is similar to C4.5. Keeping all the tion algorithm[9−10]. functions of C4.5, C5.0 introduces more new technologies. Assume a given samples set S consists of n samples One of important technologies is boosting, and another one and a learning system that constructs decision trees from a is the construction of a cost-sensitive tree. We introduce the training set of samples. Boosting constructs multiple de- construction of a decision tree based on C4.5, the purring cision trees from the samples. The number T is the num- technology, boosting, cost matrix, and cost-sensitive tree as ber of decision trees that will be constructed, which is the following. number of trials will be treated. Ct is the decision tree that ∗ 2.1 Decision tree and pruning technology in C4.5 the learning system generates in trail t, and C is the final algorithm decision tree that is formed by aggregating the T decision t trees from these trails. ωi is the weight of sample i in trail t At first, we take the given samples set as the root of t (i =1, 2, ..., N; t =1, 2, ..., T ) . Pi is the normalized factor decision tree. Second, we compute information gain ratio t of ω and βt is the factor that adjusts weight. We also define of every attribute of the training samples, and select the at- i indicator function: tribute with the highest information gain ratio as the split t 1, sample i is misclassified attribute. Then, we create a node for the split attribute and θ (i)= use the attribute to make an indicator for the node. Third, we 0, sample i is classified rightly create a branch for each value of the split attribute and ac- The main steps of boosting are as follows: cording to this, divide the data set into several subsets. Ob- Step 1 Initialize the variables: set a value to the num- viously, the number of the subsets equals the number of the =1 1 =1 ber of T (usually is 10). Set t ,ωi /n . testing attribute’s values. Then we repeat the steps above for t = t n ( t) Step 2 Calculate Pi ωi / i=0 ωi , where every subset we have created until all the subsets can satisfy n t i=0(Pi )=1. one of the following three conditions: t Step 3 Set Pi to the weight of each sample and con- (1) All the samples in the subset belong to one class. At struct Ct under this distribution. the same time, we create leaves for subsets. t t Step 4 Calculate the error rate of C as ε = (2) All the attributes of the subset have been transacted n t t i=0(Pi θi ) . and there are no attributes left which can be used to divide Step 5 if εt < 0.5 , the trails are terminated, set data set. In this condition, we classify this subset as the class T = t +1; else if εt =0, the trails are terminated, set which most samples in it belong to and create a leave for the T = t ; else if 0 <εt < 0.5 , go on to step 6. subset. Step 6 Calculate βt = εt/(1 − εt). (3) All the testing attributes left of the samples in the Step 7 Adjust weight according to the error rate, that subset have the same value, but the samples still belong to is different classes. In this condition, we create a leave whose t t t+1 ω β , sample is misclassified class is most samples in the subset belong to. ω = i i ωt, sample is classified rightly C4.5 mainly adopts EBP algorithm as its pruning i [11] method. EBP algorithm is shown as follows: Step 8 If t = T , the trails are terminated. Else, set Assume that Tt is a subtree whose root is an inner node t = T +1and go to step 2 to begin the next trail.

C5.0 Classification Algorithm and Application on Individual Credit

United States Olympic Committee and U.S. Department of Veterans Affairs

(VA) Veteran Monthly Assistance Allowance for Disabled Veterans

VISTA2013 Scientific Conference Booklet Gustav-Stresemann-Institut Bonn, 1-4 May 2013

Facilities Environmental Severity Classification Study Report

The Paralympic Athlete Dedicated to the Memory of Trevor Williams Who Inspired the Editors in 1997 to Write This Book

Women in the Olympic and Paralympic Games

The Risk of Dengue for Non-Immune Foreign Visitors to the 2016 Summer

Annual Report 20 19 Front Cover: Darren Hicks, 2019 Para-Cycling World Champion

From Stoke Mandeville to Stratford: a History of the Summer Paralympic Games Brittain, I.S

Official Results Book

Petroleum Hydrocarbon Gases Category Analysis and Hazard Characterization

SPECTATOR GUIDE U.S. Paralympics Cycling Open Cummings Research Park, Huntsville, Alabama April 17-18, 2021 Come Cheer for Our P