Model Selection, Transformations and Variance Estimation in Nonlinear Regression
Total Page:16
File Type:pdf, Size:1020Kb
Model Selection, Transformations and Variance Estimation in Nonlinear Regression Olaf Bunke1, Bernd Droge1 and J¨org Polzehl2 1 Institut f¨ur Mathematik, Humboldt-Universit¨at zu Berlin PSF 1297, D-10099 Berlin, Germany 2 Konrad-Zuse-Zentrum f¨ur Informationstechnik Heilbronner Str. 10, D-10711 Berlin, Germany Abstract The results of analyzing experimental data using a parametric model may heavily depend on the chosen model. In this paper we propose procedures for the ade- quate selection of nonlinear regression models if the intended use of the model is among the following: 1. prediction of future values of the response variable, 2. estimation of the un- known regression function, 3. calibration or 4. estimation of some parameter with a certain meaning in the corresponding field of application. Moreover, we propose procedures for variance modelling and for selecting an appropriate nonlinear trans- formation of the observations which may lead to an improved accuracy. We show how to assess the accuracy of the parameter estimators by a ”moment oriented bootstrap procedure”. This procedure may also be used for the construction of confidence, prediction and calibration intervals. Programs written in Splus which realize our strategy for nonlinear regression modelling and parameter estimation are described as well. The performance of the selected model is discussed, and the behaviour of the procedures is illustrated by examples. Key words: Nonlinear regression, model selection, bootstrap, cross-validation, variable transformation, variance modelling, calibration, mean squared error for prediction, computing in nonlinear regression. AMS 1991 subject classifications: 62J99, 62J02, 62P10. 1 1 Selection of regression models 1.1 Preliminary discussion In many papers and books it is discussed how to analyse experimental data estimating the parameters in a linear or nonlinear regression model, see e.g. Bunke and Bunke [2], [3], Seber and Wild [15] or Huet, Bouvier, Gruet and Jolivet [11]. The usual situation is, that it is not known, if a certain regression model describes the unknown true regression function sufficiently well. The results of the statistical analysis may depend heavily on the chosen model. Therefore there should be a careful selection of the model, based on the scientific and practical experience in the corresponding field of application and on statistical procedures. Moreover, a nonlinear transformation of the observations or a appropriate model of the variance structure can lead to an improved accuracy. In this paper we will propose procedures for the adequate selection of regression mod- els. This includes choosing suitable variance models and transformations. Additionally, we present methods to assess the accuracy of estimates, calibrations and predictions based on the selected model. The general framework of this paper will be described now. We consider possibly replicated observations of a response variable Y at fixed values of exploratory variables (or nonrandom design points) x1,...,xk, which follow the model k Yij = f(xi)+εij,i=1,...,k, j =1,...,ni, ni = n. (1.1) i=1 In (1.1) the regression function f is unknown and real-valued and the εij are assumed 2 2 to be uncorrelated random errors with zero mean and positive variances σi = σ (xi). 2 2 The usual assumption of a homogeneous error variance σi = σ is unrealistic in many applications and one is often confronted with heteroscedasticity problems. The analysis of the data requires in general to estimate the regression function which describes the dependence of the response variable on the explanatory variables. This is usually done by assuming that this dependence may be described by a parametric model {f(., ϑ) | ϑ ∈ Θ} , (1.2) where the function f(., ϑ) is known up to a p-dimensional parameter ϑ ∈ Θ ⊆ Rp,so that the problem reduces to estimate this parameter using the data. As an estimate of the parameter we will use the ordinary least squares estimator (OLSE) ϑˆ,whichisthe minimizer of the sum of squares k ni 2 S(ϑ)= (Yij − f(xi,ϑ)) (1.3) i=1 j=1 2 with respect to ϑ . In practice this is the most popular approach to estimate ϑ.Weighted LSE will be discussed later in Section 2. To simplify the presentation in what follows, we will preliminary restrict ourselves to the case of real-valued design points, that is, we will assume to have only one explanatory variable. Note that the model (1.2) is called linear if it depends linearly on ϑ;otherwiseit is called nonlinear. Our focus will be on the latter case. A starting point in an approach to model selection is the idea, that even if a certain regression model is believed to be convenient for given experimental data, either because of theoretical reasonings or based on past experience with similar data, there is seldom sure evidence on the validity of a model of the form (1.2). Therefore a possible modification of the model could lead to a better fit or to more accurate estimates of the interesting parameters. A parameter with a certain physical or biological meaning may often be represented in different regression models as a function of the corresponding parameters. This will be the case, if the parameter γ of interest is a function γ = γ[f(x1),...,f(xk)] (1.4) of the values of the regression function at the values x1,...,xk of its argument. Examples are one of these values, e.g. γ = f(x1), the linear slope (or growth rate) k k f(xi)(xi − x) 1 i=1 γ = k 2 with x = xi (1.5) i=1(xi − x) k i=1 or the approximated area under the curve 1 k−1 γ = (f(xi+1)+f(xi))(xi+1 − xi), (1.6) 2 i=1 which is used to characterize rate and extent of drug absorption in pharmacokinetic studies. Alternatively, the theoretical reasonings as well as experiences may lead to several models, which preliminary seem to be equally adequate, so that the ultimate choice of a specific model among them is left open for the analysis of the data. 1.2 Examples The following examples illustrate the above discussion by describing situations with different objectives in analyzing the data. Example 1 (Pasture regrowth). Ratkowsky [12] considered data (see Table 1) de- scribing the dependence of the pasture regrowth yield on the time since the last grazing. The objective of the analysis is to model the dependence of the pasture regrowth yield on the time since the last grazing. A careful investigation of a graph of the data (see Fig- ure 1) suggested, that the process produces a sigmoidally shaped curve as it is provided 3 Table 1: Data of yield of pasture regrowth versus time Time after pasture 91421284257637079 Yield 8.93 10.8 18.59 22.33 39.35 56.11 61.73 64.62 67.08 Table 2: Sum of squared errors for alternative sigmoidal models in Example 1 model (1.7) (1.8) (1.9) (1.10) 1 ˆ n S(ϑ) 0.93 4.65 2.42 0.80 by the Weibull model ϑ4 f(x, ϑ)=ϑ1 − ϑ2 exp((−ϑ3x )). (1.7) But there are other models of sigmoidal form, which could as well fit data from growth experiments, e.g. f(x, ϑ)=ϑ1 exp[ϑ2/(x + ϑ3)] (1.8) or f(x, ϑ)=ϑ1 exp[− exp(ϑ2 − ϑ3x)] (1.9) Figure 1: Pasture regrowth: Observed yield and fitted regression curves . (1.7) • (1.8) • (1.9) • (1.10) • • yield • • • 10 20• 30 40 50 60 70 20 40 60 80 time after pasture 4 (three-parameter Gompertz model) or ϑ2 f(x, ϑ)=ϑ1 + (1.10) 1+exp(ϑ3 − ϑ4x) (four-parameter logistic model). A listing and a description of even more alternative sigmoidal nonlinear models is given in Ratkowsky [13] (see also Seber and Wild [15] or Ross [14]). The regression curves corresponding to the alternative models (1.8), (1.9) and (1.10) fit differently well to the observations, as may be seen by the corresponding values of the sum of squared errors (1.3) in Table 2 or by the plot of the fitted models in Figure 1. Notice that the fitted curves of model (1.7) and (1.10) are hardly distinguishable in Figure 1. Example 2 (Radioimmunological assay of cortisol: Calibration). In calibration Table 3: Data for the calibration problem of a RIA of cortisol Dose in ng/.1 ml Response in counts per minutes 0 2868 2785 2849 2805 2779 2588 2701 2752 0.02 2615 2651 2506 2498 0.04 2474 2573 2378 2494 0.06 2152 2307 2101 2216 0.08 2114 2052 2016 2030 0.1 1862 1935 1800 1871 0.2 1364 1412 1377 1304 0.4 910 919 855 875 0.6 702 701 689 696 0.8 586 596 561 562 1 501 495 478 493 1.5 392 358 399 394 2 330 351 343 333 4 250 261 244 242 ∞ 131 135 134 133 experiments one first takes a training (calibration) sample and fits a model to the data providing a calibration curve. However, in contrast to Example 1 the real interest lies here in estimating an unknown value of the explanatory variable corresponding to a value of the response variable, which is independent of the training sample and may easily be measured. This is usually done by inverting the calibration curve. We will illustrate the procedure by considering the radioimmunological assay (RIA) of cortisol. The corresponding data from the laboratoire de physiologie de la lactation 5 (INRA) are reported in Huet, Bouvier, Gruet and Jolivet [11] and are reproduced in Table 3. The response variable is the quantity of a link complex of antibody and radioactive hormone, while the explanatory variable is the dose of some hormone.