A Comparison of Generalized Linear Models for Insect Count Data
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Statistics and Analysis. ISSN 2248-9959 Volume 9, Number 1 (2019), pp. 1-9 © Research India Publications http://www.ripublication.com A Comparison of Generalized Linear Models for Insect Count Data S.R Naffees Gowsar1*, M Radha 1, M Nirmala Devi2 1*PG Scholar in Agricultural Statistics, Tamil Nadu Agricultural University, Coimbatore, Tamil Nadu, India. 1Faculty of Agricultural Statistics, Tamil Nadu Agricultural University, Coimbatore, Tamil Nadu, India. 2Faculty of Mathematics, Tamil Nadu Agricultural University, Coimbatore, Tamil Nadu, India. Abstract Statistical models are powerful tools that can capture the essence of many biological systems and investigate ecological patterns associated to ecological stability dependent on endogenous and exogenous factors. Generalized linear model is the flexible generalization of ordinary linear regression, allows for response variables that have error distribution models other than a normal distribution. In order to fit a model for Count, Binary and Proportionate data, transformation of variables can be done and can fit the model using general linear model (GLM). But without transforming the nature of the data, the models can be fitted by using generalized linear model (GzLM). In this study, an attempt has been made to compare the generalized linear regression models for insect count data. The best model has been identified for the data through Vuong test. Keywords: Count data, Regression Models, Criterions, Vuong test. INTRODUCTION: Understanding the type of data before deciding the modelling approach is the foremost thing in data analysis. The predictors and response variables which follow non normal distributions are linearly modelled, it suffers from methodological limitations and statistical properties. Hence, generalized linear model is used. Generalized linear model is the extension of general linear model. 2 Naffees Gowsar S.R, Radha M, Nirmala Devi M Generalized linear model can be used to predict response variable both for dependent variables with discrete distributions and for the dependent variables which are nonlinearly related to the predictors. Generalized linear model is a larger class of model which include Poisson regression model, well suited for count data. But it suffers from one disadvantage that mean should be equal to variance. When the conditional variance is greater than conditional mean Poisson regression becomes unsuitable hence the generalization of Poisson regression model is used for overdispersed data. The model which is used for overdispersed data is Negative Binomial regression model. The data which has excessive zeros can be modelled using Zero Inflated model and Hurdle model. In this paper, death of whitefly and Encarsia sp is taken into account. Data is collected from the coconut farm in TNAU, cbe-03.Whitefly is the invasive pest of Coconut. Lifecycle of whitefly is 3-4 weeks based on perfect monsoon. Whereas Yellow sticky cards are an essential monitoring tool for monitoring whitefly populations. Whiteflies are sap feeders that reduce the overall vigour of plants with their feeding. Encarsia sp provides a best biological control against whitefly. Hence, death of Encarsia sp due to yellow sticky cards is modelled with the help of whitefly death for 4 weeks. METHODOLOGY To study the relation of count data with other variable, the count data is treated as dependent variable. For this situation linear regression assumes a normal distribution of the dependent variable which is not true for count data. Conversely, Poisson regression model recognizes the nonnegative nature of count data. The Poisson regression model is a special case of the class of generalized linear model (GLM) (McCullagh and Nelder, 1986). 1. Poisson regression When the outcome of the variable is count (i.e non negative, discrete) and if mean is equal to variance Poisson regression model can be used. Y is a random variable which follows Poisson distribution has a pdf as e y Pr{Y y} for 0 y! Since E(Y)= Var (Y), any factor that affects one will also affect the other. When variance is greater than mean, Overdispersion occurs. 2. Negative Binomial regression The generalization of poisson regression is negative binomial regression, it deals well with the overdispersed data. If Y is a random variable which follows negative binomial A Comparison of Generalized Linear Models for Insect Count Data 3 distribution then the pdf will be ( yi ) yi P(Yi j / X i ) ri (1 r) , (yi 1)() ' i i exp(X i ) & r i E[y / x ] Exp(X '), Since, i i i i is the conditional mean and variance. Here ∝ is Var[yi / xi ] i (1i ) i the dispersion parameter. When this parameter is zero. Negative binomial distribution will tend to Poisson distribution. 3. Zero Inflated Poisson regression models Zero inflated model assumes that data are generated from two different processes: one process generates only the zero count from the always zero group and another process generates both the zero and non -zero counts from the not always- zero group. Hence it combines logistic regression(at first stage) and Poisson regression(at second stage). ij P(Yij yij ) pij (1 pij )e , yij 0 ij yij e ij P(Yij yij ) (1 pij ) , yij 0 yij ! 4. Hurdle models Similar to zero inflated models, hurdle model also undergoes two processes. The first process is that the event occurrence falls below or above the hurdle. Once the hurdle has been crossed, the second stage determines the number of subsequent event occurrences above the hurdle. Hence it combines logistic regression as a first stage and negative binomial regression as a second stage. P(Y) ,Y 0 e Y P(Y) (1 ) ,Y 1 (1 e )(Y 1) MODEL EVALUATION AND SELECTION i. Akaike Information Criteria Kullback-Leibler divergence is one by which information criterias are build. AIC is the measure of how one probability distribution is different from another reference probability distribution. It is also used to find the Directed Divergence between two 4 Naffees Gowsar S.R, Radha M, Nirmala Devi M distributions. AIC( j) 2nK j l j ( j ) 2d j Best model is finded using AIC(j) , here 2n is the penalty factor for the model. ii. Bayesian Information Criteria BIC is also known as Schwartz Criterion after Gideon Schwartz. It is similar to the Minimum Description Length(MDL) criterion. d BIC l ( ) j log n j j j 2 This is same as AIC, here penalty for the model is harsher. Hence BIC chooses the true model. iii. Vuong test The Vuong test is a likelihood ratio-based test for model selection and makes probabilistic statements about two models. It tests the null hypothesis that two models equally approximate the actual model against the alternative hypothesis that one model more accurately represents the actual model (i.e., is preferred). It cannot make the decision that the “more accurate” model is the true model. Yi P1 ( ) X i mi log Yi P2 ( ) X i 1 n n mi n t1 Vuong test statistic test the hypothesis of given E(mi = 0), V , this n 1 2 (mi m ) n t1 test statistic provides the superiority of model 1 over model 2. If, V>1.96 the first model is preferred. But if V<1.96 the second model is preferred. ANALYSIS At first the given data is tested for normality using Q-Q plot graph both the variables are graphed separately, which results that there is no normality in the data and data is overdispersed. A Comparison of Generalized Linear Models for Insect Count Data 5 Figure 1: Normal Q-Q plot for Encarsia sp. Figure 2 : Normal Q-Q plot for Whitefly 6 Naffees Gowsar S.R, Radha M, Nirmala Devi M Fitting Poisson model results : Table 1: ANOVA for Poisson Model. Estimate Standard error t-value Pr(>|t|) Intercept 0.725887 0.721365 1.006 0.315668 Week B -0.478387 0.402025 -1.190 0.235671 Week C -0.773953 0.408301 -1.896 0.059659 Week D -1.233354 0.459118 -2.686 0.007915 ** Whitefly 0.017175 0.004836 3.552 0.000491 *** Significant ratios: 0.1 (.) , 0.05(*) , 0.01 (**) , 0.001 (***) Here Week D and Whitefly has highly influenced the death of Encarsia sp. The exponential coefficients of Poisson model is Table 2: Exponential coefficients of Poisson Model Exponential coefficient Intercept 2.06 Week B 0.619 Week C 0.461 Week D 0.219 Whitefly 1.017 Hence the model will be Y = 2.06+0.619 Week B+ 0.461 Week C+0.291 Week D+ 1.017 Whitefly+ ei Negative binomial model results : Table 3: ANOVA for Negative Binomial Model Estimate Standard error t-value Pr(>|t|) Intercept 0.334979 0.305461 1.097 0.2728 Week B -0.269579 0.188447 -1.431 0.1526 Week C -0.462654 0.199375 -2.321 0.0203 * Week D -1.080769 0.252210 -4.285 1.83e-05 *** Whitefly 0.005579 0.001944 2.871 0.0041 ** Significant ratios : 0.1 (.) , 0.05(*) , 0.01 (**) , 0.001 (***) Here Week C, Week D and Whitefly is highly influencing the death of Encarsia sp. The A Comparison of Generalized Linear Models for Insect Count Data 7 exponential coefficients of Negative binomial model is Table 4: Exponential Coefficient of Negative Binomial Model Exponential coefficient Intercept 1.3979 Week B 0.7637 Week C 0.6296 Week D 0.3393 Whitefly 1.0055 The model is Y = 1.39+0.763 Week B+ 0.629 Week C+0.339 Week D+ 1.005 Whitefly+ ei Zero inflation model : For this model, coefficients will not be same for both processes. Anyone can be taken for constructing the model. Based on the objective, count model coefficients is considered. Table 5: Exponential Coefficients of Zero inflated model. Count model Zero inflation model Intercept 1.3044241 3.283438e-09 Week B 0.7914268 1.302208e-01 Week C 0.7589757 1.830644e+05 Week D 0.4708149 6.918792e+05 Whitefly 1.0060517 1.064084e+00 The model is Y = 1.30+0.79 Week B+ 0.7WeekC+0.47 Week D+ 1.006 Whitefly+ ei Hurdle model : This is similar to zero inflation model, where coefficients will be not same for both the Processes.