Experimental Sequential Designs for Logistic Regression Models

Revista Colombiana de Estadística Diciembre 2008, volumen 31, no. 2, pp. 261 a 291 Experimental Sequential Designs for Logistic Regression Models Diseños experimentales secuenciales para modelos logísticos de regresión Arturo T. De Zana Departamento de Ingeniería, Facultad de Ingeniería, Universidad de la Sabana, Chía, Colombia Abstract When the usual hypotheses of normality and constant variance do not hold (e.g. in binomial or Bernoulli processes), the problem of choosing ap- propriate designs creates problems to researches when pursuing a sequential exploration of process. This paper is based on De Zan (2006), where the au- thor proposes two criteria to evaluate design strategies, that take the amount of information as the main evaluation tool. One into account the information of the fitted model, and the other explores the information that is contained on the approximation of a set of the best conditions of factors found on a fitted model. An example of how these strategies work is also given through a simulation using R software. Key words: Factorial Design, Response Surface Design, Sequential Design of Experiments, Generalized Linear Model, Logistic Regression, Fisher In- formation Matrix. Resumen Cuando los supuestos habituales de normalidad y varianza constante no se cumplen (e.g. en procesos de Bernoulli o binomiales), el problema de la elección de diseños adecuados ocasiona cierta dificultad a los experimenta- dores, especialmente cuando lo que se persigue es una exploración secuencial del proceso. Este artículo está basado en De Zan (2006), en donde se pro- ponen dos criterios para evaluar estrategias de diseño. Una de ellas toma en cuenta la cantidad de información contenida en el modelo ajustado, mientras que la otra explora la información contenida en las mejores condiciones de experimentación encontradas en el modelo ajustado. Se desarrolla un ejem- plo simulado con el paquete R acerca de cómo funcionan estas estrategias. Palabras clave: diseño factorial, metodología de superficie de respuesta, diseño de experimentos secuenciales, modelo lineal generalizado, regresión logística, matriz de información de Fisher. aAssistant Professor. E-mail: [email protected] 261 262 Arturo T. De Zan 1. Introduction 1.1. Classical Response Surface Methodology Since the very seminal work of Box & Wilson (1951), and for more than five decades, many researchers and practitioners of Response Surface Methodology (RSM) have widely contributed to the natural development of this matter. RSM has achieved special acceptance within industrial applications, quality improve- ment of processes and industrial processes in general, among others. Four complete references that reinforce the extensive use of these applications are: Box & Draper (1987), Box & Draper (2007), Khuri & Cornell (1996) and Myers & Montgomery (2002). Some of the most recent updated aspects on this topic are also described in Myers (1999), Myers et al. (2004) and Atkinson (2006). Future lines of research are well detailed there, as well. Although RSM problems have been developed in a very cohesive manner complete way, the scope of the majority of contributions are focused on the normal linear model1. The study of processes with abnormal distributions in the context of RSM is a relatively new problem, especially when some considerations related to the choosing of “good” designs arise. There are many contributions whose focus is on the analysis of experiments –strictly speaking– while the design does not seem to reach the same level of development (for instance Myers & Montgomery (2002), Myers et al. (2001a), Myers et al. (2004) and Khuri (2006) for more details). 1.2. Non-normal Models and GLMs When the classical hypotheses of statistical modeling do not apply, especially when the usual assumptions of normality and constant variance do not hold, many problems for which there are several solutions from the point of view of experimental designs arise. One of the first attempts to cope with this departure from the usual hypotheses is to transform the data (e.g. using Box-Cox family of transformations). These transformations work adequately when a specific hypothesis is tested, in the sense that a transformed variable could fit better to the classical approach of design and analysis than if it were not transformed. Nevertheless, some authors point out that there is no one transformation that can cope completely with all of the deviations from the usual scope at the same time. Also, it is said that sometimes many transformations do not make sense from a physical point of view, or may carry on some additional problems when analyzed statistically. For further details see Montgomery (2005) and Myers et al. (2001b). The seminal work of Nelder & Wedderburn (1972) was the beginning of a new well-integrated approach for the problem of non-normal distributions, namely the Generalized Linear Models (GLMs). Several authors have later contributed to extend the applications of these models. Among the most referenced works there 1See for example Khuri & Mukhopadhyay (2006). Revista Colombiana de Estadística 31 (2008) 261–291 Experimental Sequential Designs for Logistic Regression Models 263 are: Lindsey (1997), McCullagh & Nelder (1989), Dobson (2002) and Myers et al. (2001a). The result of this endeavor was successful. A new class of statistical models now can be considered and analyzed from a common and solid point of view, which is well accepted in the context of modern statistical modeling. In Myers et al. (2001a) the authors use the GLM approach to analyze given dataset for binomial and Poisson factorial designs. Other valuable contributions –see Lewis et al. (2001b) and Lewis et al. (2001a)– focus on some implications of analysis as well as, for example, how to obtain confidence intervals for the parameters on factorial designs when the response follows binomial and Poisson distributions. When the main goal of a study is not the analysis, but in the design of experiments, it is well known that the choice of design depends on the unknown parameters. See Cox & Reid (2000), Khuri & Mukhopadhyay (2006) and Atkinson (2006) for more details. Compared with classical regression models (e.g. normal linear models), the ran- dom component of GLMs is characterized by a certain relaxation of the assumption of additive error of the former. For a n-dimensional vector of independent obser- ′ vations of the response, say y = (y1,y2,...,yn) , the usual model has the form Y = x′β + ε, where x is a vector of known explanatory variables, β a vector of unknown parameters and ε the error term. In this case, the density function of Y ′ is fY (y)= fε(y − x β) (Firth 1991). In the case of GLM, errors do not necessarily follow a normal distribution but belong to the exponential family distributions. ′ So, the density function of the response can be written as fY (y)= f(y; x β), where x continues to appear only through the linear predictor, η = x′β (see Equation 8). As a result, the mean of Y is determined by η and it is conventional to write g(µ) = η, or µ = g−1(η). This function g(·) is known as link function, as it links the mean, µ, and the linear predictor, η. Thus, it is widely said that GLMs are a very suitable approach to deal with non-normality and non-constant models at the same time, and so the problem of the choice of designs is projected on this class of models too. According to what we mentioned before, GLMs can be completely characterized by their density function or by the likelihood function (Firth 1991). For example, let’s consider a response variable that belongs to the exponential family of distributions, whose generic likelihood function is: yθ − c(θ) L(θ, φ; y)= exp + h(y, φ) (1) φ for a single observation y, where c(·) and h(·) are taken as known functions. The quantities θ and φ are also known, and they are so called “natural” and “dispersion” parameters, respectively. Among some important properties GLMs have been related to the variance–covariance matrix of the response (McCullagh & Nelder 1989). The form of the variance of the observations for a given level of x, varies as a function of the mean of the distribution that data come from: V (y | xi)= φV (µ) (2) Revista Colombiana de Estadística 31 (2008) 261–291 264 Arturo T. De Zan where µ = E(y | x). The function V (µ) is the so called “variance function”, which is conditional to x as well as µ. This function is associated with the specific distribution considered (e.g., for the normal distribution, we have V (µ)=1 and φ = σ2). From Equation (2), we can see that V (µ) depends on the parameters of the model considered. Taking into account this aspect, iterative class of procedures have to be carried out on the way of estimating those parameters (e.g. Fisher scoring, Newton-Raphson, etc.). Thus, the estimation of β is usually expressed as: − ′ 1 ′ β = X WX X Wy (3) b c c where X is the usual design matrix and W is the estimated diagonal matrix of “weights” (Myers et al. 2001a), given by W = diag{w ,..., w }, whose i-th c 1 n element is given by: c−1 b b wi = V (yi | xi) (4) −1 ′ for i = 1,...,n. Recalling Equationb b (2) and provided that µ = g (x β), the dependence of the variance function on the parameters on the “weights” of the variance–covariance matrix of W can be seen: −1 −1 c −1 ′ wi = φV (µi) = φV g xiβ (5) b b b From Equation (4), the dependence of the “weights”, wi, on the unknown parameters is an obstacle when choosing the “proper” matrix design X that leads to “good designs”. This aspect is also clearly seen on the corresponding asymptotic variance–covariance matrix of β, which is approximated (Khuri & Mukhopadhyay 2006) by: − 1 ′ 1 V β ≈ X WX (6) φ b c From Equation (6), an implicit “structural” consequence also follows: although the design matrix X could be orthogonal, the presence of the matrix W does not imply that V(β) should be diagonal.

Experimental Sequential Designs for Logistic Regression Models

Logistic Regression, Dependencies, Non-Linear Data and Model Reduction

Logistic Regression Maths and Statistics Help Centre

Generalized Linear Models

An Introduction to Logistic Regression: from Basic Concepts to Interpretation with Particular Attention to Nursing Domain

Variance Partitioning in Multilevel Logistic Models That Exhibit Overdispersion

Randomization Does Not Justify Logistic Regression

Chapter 19: Logistic Regression

An Introduction to Biostatistics: Part 2

Lecture 9: Logistic Regression (V2)

Logistic Regression, Part I: Problems with the Linear Probability Model

Generalized Linear Models Link Function the Logistic Equation Is

Binary Logistic Regression