Logistic Model Trees∗

Machine Learning, 59, 161–205, 2005 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. Logistic Model Trees∗ NIELS LANDWEHR [email protected] Institute for Computer Science, University of Freiburg, Freiburg, Germany MARK HALL [email protected] EIBE FRANK [email protected] Department of Computer Science, University of Waikato, Hamilton, New Zealand Editor: Johannes Furnkranz¨ Abstract. Tree induction methods and linear models are popular techniques for supervised learning tasks, both for the prediction of nominal classes and numeric values. For predicting numeric quantities, there has been work on combining these two schemes into ‘model trees’, i.e. trees that contain linear regression functions at the leaves. In this paper, we present an algorithm that adapts this idea for classification problems, using logistic regression instead of linear regression. We use a stagewise fitting process to construct the logistic regression models that can select relevant attributes in the data in a natural way, and show how this approach can be used to build the logistic regression models at the leaves by incrementally refining those constructed at higher levels in the tree. We compare the performance of our algorithm to several other state-of-the-art learning schemes on 36 benchmark UCI datasets, and show that it produces accurate and compact classifiers. Keywords: model trees, logistic regression, classification 1. Introduction Two popular methods for classification are linear logistic regression and tree induction, which have somewhat complementary advantages and disadvantages. The former fits a simple (linear) model to the data, and the process of model fitting is quite stable, resulting in low variance but potentially high bias. The latter, on the other hand, exhibits low bias but often high variance: it searches a less restricted space of models, allowing it to capture nonlinear patterns in the data, but making it less stable and prone to overfitting. So it is not surprising that neither of the two methods is superior in general—earlier studies (Perlich, Provost, & Simonoff, 2003) have shown that their relative performance depends on the size and the characteristics of the dataset (e.g., the signal-to-noise ratio). It is a natural idea to try and combine these two methods into learners that rely on simple regression models if only little and/or noisy data is available and add a more complex tree structure if there is enough data to warrant such structure. For the case of predicting a numeric variable, this has lead to ‘model trees’, which are decision trees with linear regression models at the leaves. These have been shown to produce good results (Quinlan, ∗This is an extended version of a paper that appeared in the Proceedings of the 14th European Conference on Machine Learning (Landwehr et al., 2003). 162 N. LANDWEHR, M. HALL AND E. FRANK 1992). Although it is possible to use model trees for classification tasks by transforming the classification problem into a regression task by binarizing the class (Frank et al., 1998), this approach produces several trees (one per class) and thus makes the final model harder to interpret. A more natural way to deal with classification tasks is to use a combination of a tree structure and logistic regression models resulting in a single tree. Another advantage of using logistic regression is that explicit class probability estimates are produced rather than just a classification. In this paper, we present a method, called LMT (Logistic Model Trees), that follows this idea. We discuss a new scheme for selecting the attributes to be included in the logistic regression models, and introduce a way of building the logistic models at the leaves by refining logistic models that have been trained at higher levels in the tree, i.e. on larger subsets of the training data. We evaluate the performance of LMT on 36 datasets taken from the UCI repository (Blake & Merz, 1998). Included in the experiments are the standard decision tree learners C4.5 (Quinlan, 1993) and CART (Breiman et al., 1984), linear logistic regression, and other tree-based classifiers, such as boosted C4.5, model trees fit to the class indicator variables (Frank et al., 1998), functional trees (Gama, 2004), naive Bayes trees (Kohavi, 1996), and a different algorithm for building logistic model trees: Lotus (Chan & Loh, 2004). The experiments show that LMT produces more accurate classifiers than C4.5, CART, logistic regression, model trees, functional trees, naive Bayes trees and Lotus. It is competitive with boosted decision trees, which are considered to be one of the best ‘off the shelf’ classification systems, while producing models that are easier to interpret. We also present empirical evidence that LMT smoothly adapts the tree size to the complexity of the data set. The rest of the paper is organized as follows. In Section 2 we briefly discuss the two learning methods that LMT is based on: tree induction and logistic regression. Section 3 discusses related work on tree-based learning. In Section 4 we present the LMT algorithm for learning logistic model trees. Section 5 describes our experimental study, followed by a discussion of results. Finally, we draw some conclusions in Section 6. 2. Tree induction and logistic regression This section discusses the two basic approaches to learning that our method is based upon: tree induction and logistic regression. We briefly introduce the process of tree induction, discuss the application of regression to classification tasks, and then describe our imple- mentation of logistic regression. 2.1. Tree induction The goal of supervised learning is to find a subdivision of the instance space into regions labeled with one of the target classes. Top-down tree induction finds this subdivision by recursively splitting the instance space, stopping when the regions of the subdivision are reasonably ‘pure’ in the sense that they contain examples with mostly identical class labels. The regions are labeled with the majority class of the examples in that region. LOGISTIC MODEL TREES 163 Figure 1. The artificial ‘polynomial-noise’ dataset and the uncorrupted class boundary. Figure 2. Subdivisions of increasing complexity for the ‘polynomial-noise’ dataset, generated by (from left to right) a decision stump learner, C4.5 with the ‘minimum instances’ parameter set to 20, and C4.5 with standard options. Colors (from light to dark) indicate class probability estimates in the different regions. Important advantages of tree models (with axis-parallel splits) are that they can be constructed efficiently and are easy to interpret. A path in a decision tree basically corresponds to a conjunction of boolean expression of the form ‘attribute = value’ (for nominal attributes) or ‘attribute ≤ value’ (for numeric attributes), so a tree can be seen as a set of rules that say how to classify instances. The goal of tree induction is to find a subdivision that is fine enough to capture the structure in the underlying domain but does not fit random patterns in the training data. As an example, Figure 1 shows a sample of 500 instances from an artificial domain, namely the sign-boundary of the function , = 2 + + + , f (x1 x2) x1 x1 x2 e a polynomial of the two attributes x1, x2 that is corrupted by Gaussian noise e. The function was uniformly sampled in [−1, 1]2. The original decision boundary of the polynomial (without noise) is also given (black/white region). We refer to this dataset as the ‘polynomial- noise’ dataset, it will be used again later. Figure 2 shows three subdivision of the R2 instance space for the ‘polynomial-noise’ dataset, generated by a decision stump learner (i.e. a one-level decision tree), C4.5 (Quinlan, 1993) with the ‘minimum instances’ parameter set to 20, and C4.5 with standard options. They are increasingly more complex; in this case, the center one would probably be adequate, while the rightmost one clearly overfits the examples. 164 N. LANDWEHR, M. HALL AND E. FRANK The usual approach to the problem of finding the best number of splits is to first perform many splits (build a large tree) and afterwards use a ‘pruning’ scheme to undo some of these splits. Different pruning schemes have been proposed. For example, C4.5 uses a statistically motivated estimate for the true error given the error on the training data, while the CART (Breiman et al., 1984) method cross-validates a ‘cost-complexity’ parameter that assigns a penalty to large trees. 2.2. Classification via regression The term ‘regression’ sometimes refers to a particular kind of parametric model for estimating a numeric target variable, and sometimes to the process of estimating a numeric target variable in general (as opposed to a discrete one). For the moment, we take the latter meaning—we explain how to solve a classification problem with a learner that can only produce estimates for a numeric target variable. Assume we have a class variable G that takes on values 1,...,J. The idea is to transform this class variable into J numeric ‘indicator’ variables G1,...,G J to which the regression learner can be fit. The indicator variable G j for class j takes on value 1 whenever class j is present and value 0 everywhere else. A separate model is then fit to every indicator variable G j using the regression learner. When classifying an unseen instance, predictions u1,...,u J are obtained from the numeric estimators fit to the class indicator variables, and the predicted class is ∗ j = argmax u j . j We will use this transformation process several times, for example when using model trees for classification. Transforming a classification task into a regression problem in this fashion, we can use standard linear regression model for classification.

Logistic Model Trees∗

Machine Learning Or Econometrics for Credit Scoring: Let’S Get the Best of Both Worlds Elena Dumitrescu, Sullivan Hué, Christophe Hurlin, Sessi Tokpavi

Reinforcement Learning in Supervised Problem Domains

Analysis of Earthquake Forecasting in India Using Supervised Machine Learning Classiﬁers

Automatic Classification of Oranges Using Image Processing and Data Mining Techniques

A Gradient-Based Split Criterion for Highly Accurate and Transparent Model Trees

Accurate and Intuitive Contextual Explanations Using Linear Model

A Classification Model for Water Quality Analysis

A Novel Hybrid Method for Landslide Susceptibility Mapping-Based Geodetector and Machine Learning Cluster: a Case of Xiaojin County, China

Extreme Gradient Boosting Method in the Prediction of Company Bankruptcy

Merlyn.AI Prudent Investing Just Got Simpler and Safer

Landslide Susceptibility Mapping Using Machine Learning for Wenchuan County, Sichuan Province, China

By Scott Juds – Sumgrowth Strategies