A Fuzzy-Statistical Tolerance Interval from Residuals of Crisp Linear Regression Models
Total Page:16
File Type:pdf, Size:1020Kb
mathematics Article A Fuzzy-Statistical Tolerance Interval from Residuals of Crisp Linear Regression Models Maryam Al-Kandari 1,*, Kingsley Adjenughwure 2 and Kyriakos Papadopoulos 1 1 Department of Mathematics, Kuwait University, P.O. Box 5969, Safat 13060, Khaldiyah City, Kuwait; [email protected] 2 Department of Civil Engineering, Democritus University of Thrace, 67100 Xanthi, Greece; [email protected] * Correspondence: [email protected] Received: 10 August 2020; Accepted: 21 August 2020; Published: 25 August 2020 Abstract: Linear regression is a simple but powerful tool for prediction. However, it still suffers from some deficiencies, which are related to the assumptions made when using a model like normality of residuals, uncorrelated errors, where the mean of residuals should be zero. Sometimes these assumptions are violated or partially violated, thereby leading to uncertainties or unreliability in the predictions. This paper introduces a new method to account for uncertainty in the residuals of a linear regression model. First, the error in the estimation of the dependent variable is calculated and transformed to a fuzzy number, and this fuzzy error is then added to the original crisp prediction, thereby resulting in a fuzzy prediction. The results are compared to a fuzzy linear regression with crisp input and fuzzy output, in terms of their ability to represent uncertainty in prediction. Keywords: tolerance interval; fuzzy linear regression; crisp linear regression; fuzzy-statistics 1. Introduction In classical linear regression models, assumptions like linearity, fixed independent variables, normality of residuals, uncorrelated errors are made to simplify model estimation procedures. Despite these assumptions, the results are often taken at face value with very little effort, to adequately represent the uncertainty in predictions made by the model. Uncertainties in linear regression models are often represented via confidence and prediction intervals, which may not be adequate, since only one interval is calculated; e.g., 95% confidence or the prediction interval. Fuzzy linear regression with crisp input can be used to better represent uncertainty in prediction. This fuzzy model, first proposed in [1], has been widely used as alternative to classical crisp linear regression models. Since then, there have been various modifications to the model to overcome certain limitations of the original model [2–5]. Although such fuzzy linear regression models can represent uncertainty, there has always been doubt in terms of their suitability for prediction of future values (see a discussion on this in [6]). The problem is that minimization of fuzziness of the model is done, such that model fits the available sample with a certain h-value. There is no connection with prediction of future values like those of classic regression. Recently, there has been an attempt by [7] to create fuzzy numbers from predictions made by classic linear regression models, without the need of optimization or assuming any fuzzy coefficient, or fuzzy input or output. In their work, confidence and prediction intervals from crisp linear regression are converted to fuzzy numbers, by superimposing intervals and deriving the equivalent membership functions using fuzzy estimators. In this paper, we use the technique that was introduced in [7], and we propose a new approach to fuzzify the outputs of crisp linear regression models, for use in the case of tolerance intervals that are required, instead of prediction intervals. In our proposed approach, we assume that errors are normally distributed and, as such, we can construct a tolerance interval Mathematics 2020, 8, 1422; doi:10.3390/math8091422 www.mdpi.com/journal/mathematics Mathematics 2020, 8, 1422 2 of 10 of normal distribution for the errors. This tolerance interval will contain at least a proportion of the errors, both those in the sample and those outside of the sample (future predictions). Using the method proposed in [8], we construct a fuzzy number by superimposing the tolerance intervals up to the mean error. We then use this fuzzy tolerance interval as a fuzzy estimate of the error in our model. This is similar to the error proposed in [9], which uses crisp coefficients, and estimates a fuzzy error using optimization. This is to avoid the issue of the increasing magnitude of spread with an increasing independent variable. To complete the process, we add the fuzzy tolerance error to our crisp estimate, thereby resulting in a fuzzy estimate. The advantage of the proposed approach is that all possible statistical errors in the model are represented in one interval. Statistical tolerance intervals are, by definition, different from prediction intervals, and they serve a different purpose. Tolerance intervals give the percentage of population coverage interval with some confidence level, while a prediction interval will give the coverage interval for a single prediction. The interpretation and calculation of both intervals are also different. This paper thus extends the work in [7,8], to produce fuzzy statistical tolerance intervals. The method proposed here is important for applications where a tolerance interval is needed, instead of a prediction interval. The tolerance interval covers both the confidence interval and the prediction interval. Thus, both in-sample, out-of-sample, and future errors are covered. It also uses the information that model errors are normally distributed, so the better the original crisp model, the better the fuzzy model. Finally, the proposed method produces a fuzzy output with only crisp input, and crisp parameters without the need for optimization. 2. Crisp Linear Regression A linear regression model with n independent variables and one dependent variable can be written as: Yk = α0 + α1Xk1+α2Xk2+ ::: + αnXkn+"k where Yk is the dependent variable, Xki, i = 1, 2, ::: , n are the independent variables, α0, α1, ::: ,αn are the coefficients which need to be estimated, and "k is the random error of the model. The linear regression model assumes that errors are normally distributed with zero mean and constant variance, i.e., " N(0, σ). k ∼ The method of least squares is the most common way of estimating the model parameters. The parameters are calculated as: 1 A = (XTX)− XTY where A is a vector of the parameters, X is a matrix of explanatory variables, and Y is a vector of the response variable. A thorough examination of the distribution of errors is usually done after model estimation, to check the validity of the model. To account for uncertainty in prediction due to random errors, prediction and confidence intervals are easily constructed both for the estimated parameters and for the predicted response. The (1 α)% − prediction intervals are given as follows: q Yk t1 α ,K p MSE 1 + h f ± − 2 − where X is a matrix of explanatory variables, Y is the estimated response variable, K is the sample size, MSE is the estimate of the mean-squared error of the model, t α is a t-distribution with K p 1 2 ,K p 1 − − − T − degrees of freedom, and h f = x f X X x f , where x f is the row-vector of that observation. 3. Fuzzy Linear Regression The classical fuzzy linear regression model proposed by [1] is similar to the crisp linear regression. The main difference is that the parameters are fuzzy numbers. This results in a fuzzy output for the response variable. The model is given below: Mathematics 2020, 8, 1422 3 of 10 Yfk = Ae0 + Ae1Xk1 + Ae2Xk2 + ::: + AenXkn where Aei(i = 1, 2, ::: , n) are symmetric triangular fuzzy numbers of the form (ci, ri), ci is the center of the triangular fuzzy number and ri is the spread. The objective of the fuzzy linear regression model is to minimize the uncertainty by minimizing the spreads of the fuzzy numbers. This results in the linear optimization problem below [1]: XK Xn J = mc0 + ci xij j=1 i=1 with the following constraints: 0 1 Xn Xn B C yj r0 + rixij (1 h)Bc0 + ci xij C ≥ − − @B AC i=1 i=1 0 1 Xn Xn B C yj r0 + rixij + (1 h)Bc0 + ci xij C ≤ − @B AC i=1 i=1 where c 0, i = 1, 2, ::: , n. The value 0 h 1 represents the confidence level of the model and the i ≥ ≤ ≤ membership value of all all responses in the sample should be at least h i.e., µ y h for j = 1, 2, ::: , K. j ≥ 4. Proposed Method Suppose that we have estimated a linear regression model and checked its validity with the necessary residual plots. That is, we assume that the model is well calibrated, and the distribution of errors are approximately normal. From every observation, our model produces an error: " = y yˆ k k − k where yk is the real value of the response variable and yˆk is the estimate from the model. We assume that all errors in the model come from a normal distribution, with an unknown mean and unknown standard deviation. To accommodate the uncertainty in our errors, we can construct confidence intervals for the mean, or standard deviation of errors. However, this does not give us a bound on all errors that the model can produce, but only a bound for the mean. Additionally, a prediction interval can only hold for a particular prediction and, thus, it is not valid for other predictions. To accommodate both sample errors and future prediction errors, we opt to use a tolerance interval to bound the errors; an interval which contains p% of all errors with confidence γ%. To simplify calculations, we do not focus on tolerance intervals for Y given a particular X (see for example [10]). Rather, we focus on calculating a general tolerance interval of a random sample originating from a normal distribution. We treat all errors in our model estimation as a random sample and try to find a tolerance interval for such errors.