Introduction to Nonlinear Regression
Andreas Ruckstuhl IDP Institut für Datenanalyse und Prozessdesign ZHAW Zürcher Hochschule für Angewandte Wissenschaften
October 2010∗†
Contents
1. The Nonlinear Regression Model 1
2. Methodology for Parameter Estimation 5
3. Approximate Tests and Confidence Intervals 8
4. More Precise Tests and Confidence Intervals 13
5. Profile t-Plot and Profile Traces 15
6. Parameter Transformations 17
7. Forecasts and Calibration 23
8. Closing Comments 27
A. Gauss-Newton Method 28
∗The author thanks Werner Stahel for his valuable comments. †E-Mail Address: [email protected]; Internet: http://www.idp.zhaw.ch
I 1. The Nonlinear Regression Model 1
Goals The nonlinear regression model block in the Weiterbildungslehrgang (WBL) in ange- wandter Statistik at the ETH Zurich should 1. introduce problems that are relevant to the fitting of nonlinear regression func- tions, 2. present graphical representations for assessing the quality of approximate confi- dence intervals, and 3. introduce some parts of the statistics software R that can help with solving concrete problems.
1. The Nonlinear Regression Model
a The Regression Model. Regression studies the relationship between a variable of interest Y and one or more explanatory or predictor variables x(j) . The general model is (1) (2) (m) Yi = h xi ,xi ,...,xi ; θ1,θ2,...,θp + Ei . Here, h is an appropriate function that depends on the explanatory variables and (1) (2) (m) T parameters, that we want to summarize with vectors x = [xi ,xi ,...,xi ] and T θ = [θ1,θ2,...,θp] . The unstructured deviations from the function h are described via the random errors Ei . The normal distribution is assumed for the distribution of this random error, so 2 Ei ∼N 0,σ , independent.
b The Linear Regression Model. In (multiple) linear regression, functions h are con- sidered that are linear in the parameters θj ,
(1) (2) (m) (1) (2) (p) h xi ,xi ,...,xi ; θ1,θ2,...,θp = θ1xi + θ2xi + . . . + θpxi ,
(j) (j) where the x can be arbitrary functions of the original explanatory vari ables x . (Here the parameters are usually denoted as βj instead of θj .) c The Nonlinear Regression Model In nonlinear regression, functions h are considered that can not be written as linear in the parameters. Often such a function is derived from theory. In principle, there are unlimited possibilities for describing the determin- istic part of the model. As we will see, this flexibility often means a greater effort to make statistical statements.
Example d Puromycin. The speed with which an enzymatic reaction occurs depends on the concentration of a substrate. According to the information from Bates and Watts (1988), it was examined how a treatment of the enzyme with an additional substance called Puromycin influences this reaction speed. The initial speed of the reaction is chosen as the variable of interest, which is measured via radioactivity. (The unit of the variable of interest is count/min2 ; the number of registrations on a Geiger counter per time period measures the quantity of the substance present, and the reaction speed is proportional to the change per time unit.)
A. Ruckstuhl, ZHAW 2 1. The Nonlinear Regression Model
200
150 Velocity Velocity 100
50
0.0 0.2 0.4 0.6 0.8 1.0 Concentration Concentration Figure 1.d: Puromycin Example. (a) Data (• treated enzyme; △ untreated enzyme) and (b) typical course of the regression function.
The relationship of the variable of interest with the substrate concentration x (in ppm) is described via the Michaelis-Menten function θ x h x; θ = 1 . θ2 + x An infinitely large substrate concentration (x →∞) results in the ”asymptotic” speed θ1 . It has been suggested that this variable is influenced by the addition of Puromycin. The experiment is therefore carried out once with the enzyme treated with Puromycin and once with the untreated enzyme. Figure 1.d shows the result. In this section the data of the treated enzyme is used.
Example e Oxygen Consumption. To determine the biochemical oxygen consumption, river water samples were enriched with dissolved organic nutrients, with inorganic materials, and with dissolved oxygen, and were bottled in different bottles. (Marske, 1967, see Bates and Watts (1988)). Each bottle was then inoculated with a mixed culture of microor- ganisms and then sealed in a climate chamber with constant temperature. The bottles were periodically opened and their dissolved oxygen content was analyzed. From this the biochemical oxygen consumption [mg/l] was calculated. The model used to con- nect the cumulative biochemical oxygen consumption Y with the incubation timex, is based on exponential growth decay, which leads to
−θ2x h x,θ = θ1 1 − e . Figure 1.e shows the data and the regression function to be applied.
Example f From Membrane Separation Technology (Rapold-Nydegger (1994)). The ratio of protonated to deprotonated carboxyl groups in the pores of cellulose membranes is dependent on the pH value x of the outer solution. The protonation of the carboxyl carbon atoms can be captured with 13 C-NMR. We assume that the relationship can be written with the extended “Henderson-Hasselbach Equation” for polyelectrolytes
θ1 − y log10 = θ3 + θ4 x , y − θ2
WBL Applied Statistics — Nonlinear Regression 1. The Nonlinear Regression Model 3
20
18
16
14
12 Oxygen Demand Oxygen Demand 10
8
1 2 3 4 5 6 7 Days Days Figure 1.e: Oxygen consumption example. (a) Data and (b) typical shape of the regression function.
163
162 y
161 y (= chem. shift)
160 (a) (b)
2 4 6 8 10 12 x x (=pH) Figure 1.f: Membrane Separation Technology.(a) Data and (b) a typical shape of the regression function.
where the unknown parameters are θ1,θ2 and θ3 > 0 and θ4 < 0. Solving for y leads to the model θ3+θ4xi θ1 + θ2 10 Yi = h xi; θ + Ei = + Ei . 1 + 10θ3+θ4xi
The regression funtion h xi,θ for a reasonably chosen θ is shown in Figure 1.f next to the data. g A Few Further Examples of Nonlinear Regression Functions:
θ3 θ3 • Hill Model (Enzyme Kinetics): h xi,θ = θ1xi /(θ2 + xi ) For θ3 = 1 this is also known as the Michaelis-Menten Model (1.d).
• Mitscherlich Function (Growth Analysis): h xi,θ = θ1 + θ2 exp θ3xi . • From kinetics (chemistry) we get the function
(1) (2) (1) (2) h xi ,xi ; θ = exp −θ1xi exp −θ2/xi .
A. Ruckstuhl, ZHAW 4 1. The Nonlinear Regression Model
• Cobbs-Douglas Production Function
(1) (2) (1) θ2 (2) θ3 h xi ,xi ; θ = θ1 xi xi . Since useful regression functions are often derived from the theory of the application area in question, a general overview of nonlinear regression functions is of limited benefit. A compilation of functions from publications can be found in Appendix 7 of Bates and Watts (1988). h Linearizable Regression Functions. Some nonlinear regression functions can be lin- earized through transformation of the variable of interest and the explanatory vari- ables. For example, a power function θ2 h x; θ = θ1x can be transformed for a linear (in the parameters) function
ln h x; θ = ln θ1 + θ2 ln x = β0 + β1x ,
whereβ0 = ln θ1 , β1 = θ2 and x = ln x . We call the regression function h lin- earizable, if we can transform it into a function linear in the (unknown) parameters via transformations of the arguments and a monotone transformation of the result.
Here are some more linearizable functions (also see Daniel and Wood, 1980):
h x, θ = 1/(θ1 + θ2 exp x ) 1/h x, θ = θ1 + θ2 exp x − ←→ − 1 h x, θ = θ1x/(θ2 + x) 1/h x, θ = 1/θ1 + θ2/θ1 ←→ x θ2 h x, θ = θ1x ln h x, θ = ln θ1 + θ2 ln x ←→ h x, θ = θ1 exp θ2g x ln h x, θ = ln θ1 + θ2g x ←→ (1) (2) (1) (2) h x, θ = exp θ1x exp θ2/x ln ln h x, θ = ln θ1 + ln x θ2/x − − ←→ − − (1) θ2 (2) θ3 (1) (2) h x, θ = θ1 x x ln h x, θ = ln θ1 + θ2 ln x + θ3 ln x . ←→ The last one is the Cobbs-Douglas Model from 1.g.
i The Statistically Complete Model. A linear regression with the linearized regression function in the referred-to example is based on the model
ln Yi = β0 + β1xi + Ei ,
where the random errors Ei all have the same normal distribution. We back transform this model and thus get θ2 Yi = θ1 x Ei
with Ei = exp Ei . The errors Ei , i = 1,...,n now contribute multiplicatively and are lognormal distributed! The assumptions about the random deviations are thus now drastically different than for a model that is based directily on h,
θ2 ∗ Yi = θ1 x + Ei
∗ with random deviations Ei that, as usual, contribute additively and have a specific normal distribution.
WBL Applied Statistics — Nonlinear Regression 2. Methodology for Parameter Estimation 5
A linearization of the regression function is therefore advisable only if the assumptions about the random deviations can be better satisfied - in our example, if the errors actually act multiplicatively rather than additively and are lognormal rather than normally distributed. These assumptions must be checked with residual analysis.
j * Note: In linear regression it has been shown that the variance can be stabilized with certain transformations (e.g. log , √ ). If this is not possible, in certain circumstances one can also perform a weighted linear regression . The process is analogous in nonlinear regression. k The introductory examples so far: We have spoken almost exclusively of regression functions that only depend on one original variable. This was primarily because it was possible to fully illustrate the model graphically. The ensuing theory also functions well for regression functions h x; θ , that depend on several explanatory variables x = [x(1),x(2),...,x(m)].
2. Methodology for Parameter Estimation a The Principle of Least Squares. To get estimates for the parameters θ = [θ1 , θ2 , . . ., T θp] , one applies, like in linear regression calculations, the principle of least squares. The sum of the squared deviations
n 2 S(θ) := (yi − ηi θ ) mit ηi θ := h xi; θ i=1
should thus be minimized. The notation where h xi; θ is replaced by ηi θ is reason- able because [xi,yi] is given by the measurement or observation of the data and only the parameters θ remain to be determined. Unfortunately, the minimum of the squared sum and thus the estimation can not be given explicitly as in linear regression. Iterative numeric procedures help further. The basic ideas behind the common algorithm will be sketched out here. They also form the basis for the easiest way to derive tests and confidence intervals.
T b Geometric Illustration. The observed values Y = [Y1,Y2,...,Yn] determine a point in n-dimensional space. The same holds for the ”model values” η θ = T [η1 θ , η2 θ ,...,ηn θ ] for given θ. Take note! The usual geometric representation of data that is standard in, for example, multivariate statistics, considers the observations that are given by m variables x(j) , j = 1, 2,...,m,as points in m-dimensional space. Here, though, we consider the Y - and η -values of all n observations as points in n-dimensional space. Unfortunately our idea stops with three dimensions, and thus with three observations. So, we try it for a situation limited in this way, first for simple linear regression. T As stated, the observed values Y = [Y1,Y2,Y3] determine a point in 3-dimensional space. For given parameters β0 = 5 and β1 = 1 we can calculate the model values ηi β = β0+β1xi and represent the corresponding vector η β = β01+β1x as a point. We now ask where all points lie that can be achieved by variati on of the parameters. These are the possible linear combinations of the two vectors 1 and x and thus form the
A. Ruckstuhl, ZHAW 6 2. Methodology for Parameter Estimation
plane ”spanned by 1 and x” . In estimating the parameters according to the principle of least squares, geometrically represented, the squared distance between Y and η β is minimized. So, we want the point on the plane that has the least distance to Y . This is also called the projection of Y onto the plane. The parameter values that T correspond to this point η are therefore the estimated parameter values β = [β0, β1] .
Now a nonlinear function, e.g. h x; θ = θ1 exp 1 − θ2x , should be fitted on the same three observations. We can again ask ourselves where all points η θ lie that can be achieved through variations of the parameters θ1 and θ2 . They lie on a two-dimensional curved surface (called the model surface in the following) in three- dimensional space. The estimation problem again consists of finding the point η on the model surface that lies nearest to Y . The parameter values that correspond to T this point η , are then the estimated parameter values θ = [θ1, θ2] . c Solution Approach for the Minimization Problem. The main idea of the usual al- gorithm for minimizing the sum of squared deviations (see 2.a) goes as follows: If a preliminary best value θ(ℓ) exists, we approximate the model surface with the plane that touches the surface at the point η θ(ℓ) = h x; θ(ℓ) . Now we seek the point in this plane that lies closest to Y . This amounts to the estimation in a linear regression problem. This new point lies on the plane, but not on the surface, that corresponds to the nonlinear problem. However, it determines a parameter vector θ(ℓ+1) and with this we go into the next round of iteration. d Linear Approximation. To determine the approximated plane, we need the partial derivative (j) ∂ηi θ Ai θ := , ∂θj which we can summarize with a n × p matrix A . The approximation of the model surface η θ by the ”tangential plane” in a parameter value θ∗ is
∗ (1) ∗ ∗ (p) ∗ ∗ ηi θ ≈ ηi θ + Ai θ (θ1 − θ1)+ ... + Ai θ (θp − θp)
or, in matrix notation, η θ ≈ η θ∗ + A θ∗ (θ − θ∗) . If we now add back in the random error, we get a linear regression model
Y = A θ∗ β + E
∗ with the ”preliminary residuals” Y i = Yi − ηi θ as variable of interest, the columns A ∗ of as regressors and the coefficients βj = θj − θj (a model without intercept β0 ).
WBL Applied Statistics — Nonlinear Regression 2. Methodology for Parameter Estimation 7
e Gauss-Newton Algorithm. The Gauss-Newton algorithm consists of, beginning with a start value θ(0) for θ, solving the just introduced linear regression problem for θ∗ = θ(0) to find a correction β and from this get an improved value θ(1) = θ(0) + β . For this, again, the approximated model is calculated, and thus the ”preliminary residuals” Y − η θ(1) and the partial derivatives A θ(1) are determined, and this gives us
θ2 . This iteration step is continued as long as the correction β is negligible. (Further details can be found in Appendix A.) It can not be guaranteed that this procedure actually finds the minimum of the squared sum. The chances are better, the better the p-dimensionale model surface at the T minimum θ = (θ1,..., θp) can be locally approximated by a p-dimensional ”plane” and the closer the start value θ(0) is to the solution being sought.
* Algorithms comfortably determine the derivative matrix A numerically. In more complex prob- lems the numerical approximation can be insufficient and cause convergence problems. It is then advantageous if expressions for the partial derivatives can be arrived at analytically. With these the derivative matrix can be reliably numerically determined and the procedure is more likely to converge (see also Chapter 6). f Initial Values. A iterative procedure requires a starting value in order for it to be applied at all. Good starting values help the iterative procedure to find a solution more quickly and surely. Some possibilities to get these more or less easily are here briefly presented. g Initial Value from Prior Knowledge. As already noted in the introduction, nonlin- ear models are often based on theoretical considerations from the application area in question. Already existing prior knowledge from similar experiments can be used to get an initial value. To be sure that the chosen start value fits, it is advisable to graphically represent the regression function h x; θ for various possible starting values θ = θ0 together with the data (e.g., as in Figure 2.h, right). h Start Values via Linearizable Regression Functions. Often, because of the distri- bution of the error, one is forced to remain with the nonlinear form in models with linearizable regression functions. However, the linearized model can deliver starting values. In the Puromycin Example the regression function is linearizable: The reciprocal values of the two variables fulfill
1 1 1 θ2 1 y = ≈ = + = β0 + β1x . y h x; θ θ1 θ1 x