Piecewise Regression Through the Akaike Information Criterion Using Mathematical Programming ?
Total Page:16
File Type:pdf, Size:1020Kb
Piecewise Regression through the Akaike Information Criterion using Mathematical Programming ? Ioannis Gkioulekas, Lazaros G. Papageorgiou Centre for Process Systems Engineering, Department of Chemical Engineering, UCL (University College London), Torrington Place, London WC1E 7JE, UK Corresponding author: [email protected] Abstract: In machine learning, regression analysis is a tool for predicting the output variables from a set of known independent variables. Through regression analysis, a function that captures the relationship between the variables is fitted to the data. Many methods from literature tackle this problem with various degrees of difficulty. Some simple methods include linear regression and least squares, while some are more complicated such as support vector regression. Piecewise or segmented regression is a method of analysis that partitions the independent variables into intervals and a function is fitted to each interval. In this work, the Optimal Piecewise Linear Regression Analysis (OPLRA) model is used from literature to tackle the problem of segmented analysis. This model is a mathematical programming approach that is formulated as a mixed integer linear programming problem that optimally partitions the data into multiple regions and calculates the regression coefficients, while minimising the Mean Absolute Error of the fitting. However, the number of regions is a known priori. For this work, an extension of the model is proposed that can optimally decide on the number of regions using information criteria. Specifically, the Akaike Information Criterion is used and the objective is to minimise its value. By using the criterion, the model no longer needs a heuristic approach to decide on the number of regions and it also deals with the problem of overfitting and model complexity. Keywords: Mathematical programming, Regression analysis, Optimisation, Information criterion, Machine learning 1. INTRODUCTION 1.1 Over-fitting and Information Criteria In statistics, regression analysis is a process for estimating One of the challenges in statistics and machine learning the relationship between variables. Specifically, given a is to develop a model that fits a set of training data dataset with dependent and independent variables, regres- but is also able to make predictions in other datasets. sion aims to understand how the value of the dependent During this process, the problem of over-fitting arises. variable changes when the independent variables vary. The In over fitting, the proposed statistical model describes goal of the analysis is to retrieve a mathematical function the set of training data but also its noise. That leads that correlates the predictors (i.e. independent variables) to a model that is tailored to fit the training dataset with the response (i.e. dependent variables) rather than reflecting the overall population, resulting in poor predictive performance (Hawkins, 2004). In the same There are many examples in the literature for regres- sense, there is also the problem of under-fitting, where the sion analysis such as linear and least squares regression, constructed model is so simple that it is not able to capture Support Vector Regression (SVR) (Smola and Sch¨olkopf, the underlying relationship that might exist in the data. 2004), K-nearest neighbour (Korhonen and Kangas, 1997), Multivariate Adaptive Regression Splines (MARS) (Fried- Choosing the suitable model complexity can be achieved man, 1991) and Random Forest (Breiman, 2001). More through model selection and assessment. In model selec- recent work includes the automated learning of algebraic tion, given a set of candidate models, the objective is to models for optimisation (ALAMO) (Cozad et al., 2014) select the best model according to some criterion. Once and an optimization based regression approach called the final model has been selected, the estimation of its Optimal Piecewise Linear Regression Analysis (OPLRA) prediction error in new data will assess the quality of the (Yang et al., 2016). model. Such an assessment method is cross validation. Tackling the model selection problem can be achieved through the use of information criteria. Information cri- ? This work was supported by The Leverhulme Trust under Grant teria are measures of the relative goodness of fit of a number RPG-2015-240 statistical model. Akaike proposed that a model should be evaluated in terms of the goodness of the results by relative quality of statistical models and it is not useful assessing the closeness between the predictive distribution for assessing the model. As a result, if all of the models defined by the model and the true distribution. The core are poor, AIC will select the one that is estimated to be idea is that introducing parameters to the model yields the best, even though this model might not perform well. good results, but having a high degree of freedom will lead When it comes to regression analysis, if all the candidate to an increase in the instability of the model, producing models assume normally distributed errors with a constant unreliable results (Konishi and Kitagawa, 2008). variance, then AIC can be easily computed as (Burnham and Anderson, 2003): 1.2 Contribution of this work RSS The focus of this work is the topic of piecewise linear AIC = n · ln + 2K (1) regression. Such regression methods partition the inde- n pendent variable into multiple segments, also called re- where: gions, and a function is fitted to each one. The boundaries RSS residual sum of squares between the different regions are called break points and n number of observations identifying the position of the break points is a key task K number of regression parameters for piecewise regression. In order to perform piecewise linear regression, the Op- With this formulation, K is the total number of estimated timal Piecewise Linear Regression Analysis (OPLRA) regression parameters including the intercept. Since the model is used from literature (Yang et al., 2016). The criterion can only be used as a measure of relative quality, main objective of that model is to receive a multivariate it stands to reason that the absolute value of the AIC is dataset as input, identify a single feature and partition not important. Instead it is the relative value of the model the samples into multiple regions based on this feature. that matters. The criterion chooses as the best model the The partitioning of the samples into regions is solved as one that has the lowest AIC value. An easy interpretation an optimisation problem, minimising the absolute error of the criterion and equation (1) is to think that the of the fitting. The number of regions has to be specified, AIC `rewards` descriptive accuracy via the residuals and enabling the model to decide on the optimal position of `penalizes` for model complexity. the break points and fit a linear function to each region. The AIC has been established as one of the most fre- Knowing the number of regions in advance however is quently used information criterion for model selection typically never the case. So an iterative approach was problems, with a wide variety of applications. In human proposed by the authors to choose the optimal number of body composition predictions based on bioelectricity mea- regions. An additional region is added with each iteration surements, the AIC was included to reduce redundant and the mathematical model is solved again. Convergence influence factors (Chen et al., 2016), cancer research where is achieved by satisfying a heuristic rule. AIC was used to develop a prognostic model in patients This work extends the OPLRA mathematical model and with germ cell tumors who experienced treatment failure addresses the issue of selecting the appropriate number of with chemotherapy (Group, 2010) and outlier detection regions. In an attempt to avoid over-fitting the data and (Lehmann and L¨osler,2016). Also, Carvajal et al. (2016) generate a large number of regions, the Akaike Information considered an optimisation approach for model selection Criterion is used. New binary variables are introduced to using the AIC, by incorporating the `0-(pseudo)norm as a the mathematical model that are able to `activate` regions penalty function to the log-likelihood function. so that samples can be allocated to them and the AIC is now the objective function of the optimisation problem. 2.2 Mathematical Formulation of OPLRA Minimising its value will lead to a model that balances complexity with predictive accuracy. In this section the OPLRA mathematical programming model is described as found in literature (Yang et al., 2. MATHEMATICAL FORMULATION 2016). Indices 2.1 Akaike Information Criterion s observation, s = 1; 2; :::; S The Akaike Information criterion is used in this work for m independent input variable m = 1; 2; :::M model selection. The objective of the AIC is to estimate r region, r = 1; 2; :::; R ∗ the amount of information that is lost when a model is m the variable where sample partitioning takes used to approximate the true process that occurs and place generates the data, by using the Kullback-Leibler distance. The criterion calculates the maximised log-likelihood, but Parameters Akaike showed that this is a biased term. This bias is m approximately equal to the number of K parameters of As numeric value of observation s on variable m the tested model, hence this term is introduced to the Ys output value of observation s formulation (Burnham and Anderson, 2003). U1;U2 arbitrary large positive numbers very small number Since the criterion is based on the ‘difference‘ between the model and the true underlying mechanism, it is important to understand that the AIC is simply a measure of the Positive variables This literature work introduced a heuristic procedure in r ∗ order to identify the partitioning feature and then find X ∗ break-point r on partitioning variable m m the optimal number of regions. The partitioning feature Ds training error between predicted output and is identified by solving the optimisation problem defined real output for sample s above for all the variables, while fixing the number of regions to 2, and choosing the one that yields the minimum Variables fitting error.