Comparison of the Catboost Classifier with Other Machine Learning Methods

(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 Comparison of the CatBoost Classifier with other Machine Learning Methods Abdullahi A. Ibrahim1, Raheem L. Ridwan2, Muhammed M. Muhammed3, Rabiat O. Abdulaziz4, Ganiyu A. Saheed5 Department of Mathematical Sciences, Baze University, Abuja, Nigeria1;3 African Institute for Mathematical Sciences, Accra, Ghana2 Department of Energy Engineering, PAUWES, University of Tlemcen, Algeria4 Institute of Mathematics, University of Silesia, Katowice, Poland5 Abstract—Machine learning and data-driven techniques have applicants based on available data to assess the probability of become very famous and significant in several areas in recent default and also recommend the technique that yields the best times. In this paper, we discuss the performances of some machine performance. learning methods with the case of the catBoost classifier algorithm on both loan approval and staff promotion. We compared the Since the advent of machine learning, several pieces of algorithm’s performance with other classifiers. After some feature research has been conducted to discriminate against a loan engineering on both data, the CatBoost algorithm outperforms applicants. In Goyal and Kaur [2], the authors developed an en- other classifiers implemented in this paper. In analysis one, semble model by aggregating together Support Vector Machine features such as loan amount, loan type, applicant income, and loan purpose are major factors to predict mortgage loan (SVM), Random Forest (RF), and Tree Model for Genetic approvals. In the second analysis, features such as division, Algorithm (TMGA). The ensembled model was compared with foreign schooled, geopolitical zones, qualification, and working each of these models individually and eight other machine years had a high impact on staff promotion. Hence, based on the learning techniques namely Linear Model (LM), Neural Net- performance of the CatBoost in both analyses, we recommend work (NN), Decision Trees (DT), Bagged CART, Model Trees, this algorithm for better prediction of loan approvals and staff Extreme Learning Machine (ELM), Multivariate Adaptive Re- promotion. gression Spline (MARS) and Bayesian Generalized Linear Keywords—Machine learning algorithms; data science; Cat- Model (BGLM) and was concluded from the analysis that the Boost; loan approvals; staff promotion ensembled algorithm provided an optimum result. Alomari and Fingerman [3] tried to discriminate against loan applicants by comparing six machine learning techniques. The study com- I. INTRODUCTION pared DT, RF, K-Nearest Neighbour (KNN), OneR (1R), Na¨ıve Machine learning and data-driven techniques have become Bayes (NB), and Artificial Neural Networks (ANN) in which very significant and famous in several areas. Some of the Random Forest gave the best performance with an accuracy machine learning algorithms used in practice include; support of 71:75%. In Ibrahim and Rabiat [1], four classifiers were vector machine, logistic regression, CatBoost, random forest, used to prediction in titanic analysis and XGBoost achieved the decision tree, AdaBoost, extreme gradient boosting, gradient highest accuracy. Also, Ulaga et al. [4] conducted exploratory boosting, naive Bayes, K-nearest neighbor, and many more. In research where the suitability of RF was tested in classifying supervised machine learning, classifiers have been widely used loan applicants and accuracy of 81:1% was achieved. In related in areas such as fraud detection, spam email, loan prediction, research by Li [5], RF, BLR, and SVM were used to predict and so on. In this work, we shall look into the applications loan approvals and RF outperformed the other techniques with of some machine learning methods in areas of loan prediction an accuracy of 88:63%. Xia et al. [6] predicted approvals for a and staff promotion. peer-to-peer lending system by comparing Logistic Regression (LR), Random Tree (RT), Bayesian Neural Network (BNN), The issuance of loans is one of the many profit sources RF, Gradient Boosted Decision Trees (GBDT), XGBoost, and of financial institutions. However, the problems of default by CatBoost and the results indicated that CatBoost gave the applicants have been of major concern to credit providing best performance over the other classifiers. The review of past institutions [1]. Studies conducted in the past were mostly literature showed tremendous developments in the applications empirical and as such the problems of default have not been of machine learning classifiers and how ensembled classifiers definitively dealt with. The furtherance of time to the 21st outperform single classifiers. However, only a few pieces century was accompanied by bulks of archived data collected of research considered CatBoost classifier in loan prediction from years of loan applications. Statistical techniques have approvals; hence, this research seeks to compare eight machine been developed to study past data to develop models that learning methods namely Binary Logistic Regression, Random can predict the possibility of defaults by loan applicants; Forest, Ada Boost, Decision Trees, Neural Network, Gradient thus, providing a score of creditworthiness. The availability of Boost, Extreme Gradient Boosting, and CatBoost algorithms voluminous data called Big data necessitated the introduction in the prediction of loan approvals. of machine learning tools that can be used to discriminate loan applicants based on creditworthiness. This study considered The application of machine learning in employee promo- some of these machine learning techniques to classify loan tion is another area we shall look into. Employees/staff play a www.ijacsa.thesai.org 738 j P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 11, 2020 significant role in the development of an enterprise. Employee or Y = f0promoted0;0 notpromoted0g. Logistic regression promotion in an enterprise is a major concern to both the models the probability of Y belongs to a specific category. employer and employee. In human resource management, staff With approach (1) below to predict this probability: promotion is very vital for organizations to attract, employ, retain, and effectively utilize their employee’s talents [7]. p(X) = β0 + β1X1 + β2X2 + ··· + βnXn (1) Promotion of staff in an organization is based on some factors among which are age [8], gender [9], education [10], previous experience [11] and communication strategy or pattern [12]. The conditions p(X) < 0 and p(X) > 0 can be predicted In Long et. al [7], the authors applied some machine learning for values of X, except for range of X is limited. To keep away algorithms on Chinese data to predict employee promotion. from this, p(X) must be modelled with the help of a logistic It was discovered that, among all the available features in function that generates between 0 and 1 values as output. The their dataset, the number of the different positions occupied, function is defined as in (2) the highest departmental level attained and the number of working years affect staff promotion. In Sarkar et. al [13], joint data clustering, and decision trees were used to evaluate eβ0+β1X1+β2X2+···+βnXn p(X) = (2) staff promotion. Saranya et. al [14] researched why the best 1 + eβ0+β1X1+β2X2+···+βnXn and performing employees quit prematurely and predicted performing and valuable employees likely to quit prematurely. The proposed algorithm was recommended to the human The ‘maximum likelihood’ method is used to fit (2). resource department to determine valuable employees likely The unknown coefficients β0; β1; β2; : : : ; βn in (2) should be to quit prematurely. Previous works showed tremendous de- approximated based on the data available for training the velopments in the applications of machine learning but only model. The intuition of likelihood function can be expressed few researchers have considered the CatBoost classifier in mathematically as in (3): staff promotion. This research seeks to compare four machine learning methods namely Random Forest, Gradient Boost, Extreme Gradient Boosting, and CatBoost algorithms in the Y Y 0 `(β0; : : : ; βn) = p(xi) (1 − p(xi)) (3) prediction of staff promotion. 0 0 i:yi=1 i :yi=0 Some of these literature only discussed the applications without emphases on the mathematics behind this algorithm. The estimates β0; : : : ; βn are selected to maximize this This paper will differ from others by highlighting the mathe- function [1]. More explanation can be obtained in [15]. matics of the algorithm, the process of data cleaning, apply- ing the supervised learning algorithms and evaluating these algorithms. This paper aim to develop a predictive machine Basic Assumptions of Binary Logistic Regression learning model from supervised machine learning in areas of loan prediction and staff promotion. To achieve this aim, we (i) The response variable must be binary. shall set some objectives which will also be our contribution: (ii) The relationship between the response feature and • Perform data science process such as exploratory the independent features does not assume a linear analysis, perform data cleaning, balancing, and trans- relationship. formation (iii) Large sample size is usually required. • Develop a predictive model from machine learning methods (iv) There must be little or no multicollinearity. • Apply some model evaluation metrics to determine the (v) The categories must be mutually exclusive and ex- performance of the

Comparison of the Catboost Classifier with Other Machine Learning Methods

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support