Piecewise Linear Regression for Leaf Appearance Rate Data
Total Page:16
File Type:pdf, Size:1020Kb
Iowa State University Capstones, Theses and Creative Components Dissertations Spring 2021 Piecewise linear regression for leaf appearance rate data Lin Quan Follow this and additional works at: https://lib.dr.iastate.edu/creativecomponents Part of the Biostatistics Commons Recommended Citation Quan, Lin, "Piecewise linear regression for leaf appearance rate data" (2021). Creative Components. 786. https://lib.dr.iastate.edu/creativecomponents/786 This Creative Component is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Creative Components by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Piecewise linear regression for leaf appearance rate data by Lin Quan A Creative Component submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Major: Statistics Program of Study Committee: Dan Nettleton, Major Professor Lily Wang Peng Liu The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this creative component. The Graduate College will ensure this creative component is globally accessible and will not permit alterations after a degree is conferred. Iowa State University Ames, Iowa 2021 Copyright © Lin Quan, 2021. All rights reserved. ii TABLE OF CONTENTS Page LIST OF TABLES . iii LIST OF FIGURES . iv ABSTRACT . vi CHAPTER 1. INTRODUCTION . .1 CHAPTER 2. DATASET AND METHODS . .4 2.1 Dataset . .4 2.2 Methods . .4 2.2.1 Package segmented ................................4 2.2.2 Least-squares . .6 2.2.3 Bootstrap Resampling . .7 CHAPTER 3. RESULTS AND DISCUSSION . 10 3.1 Analysis of the first 396 genotypes . 10 3.2 Analysis of the last 99 genotypes . 15 CHAPTER 4. SUMMARY . 24 CHAPTER 5. ACKNOWLEDGEMENT . 25 BIBLIOGRAPHY . 26 iii LIST OF TABLES Page 3.1 ANOVA table for difference between slopes in two segments for each geno- type ........................................ 21 3.2 ANOVA table for estimated breakpoints for each genotype . 21 iv LIST OF FIGURES Page 3.1 (a) Histogram of p-values of genotypes with one breakpoint. (b) Histogram of p-values of genotypes with two breakpoints. 11 3.2 Plot of genotype C103. The black points are the raw data. The green line is a linear regression model. The blue line is a one breakpoint piecewise linear model, the blue triangle is a breakpoint. The red line is a two breakpoints piecewise linear model, the red circles are the breakpoints. 12 3.3 (a) Plot of comparing the estimators calculated by segmented and least- squares. (b) Plot of comparing the p-value calculated by segmented and residual bootstrap. The red line represents y=x. 13 3.4 (a) Studentized residuals. (b) Histogram of residuals. 14 3.5 (a) Plot of comparing the p-value calculated by segmented and wild boot- strap (Mammen's distribution). (b) Plot of comparing the p-value calculated by residual bootstrap and wild bootstrap (Mammen's distribution). (c) Plot of comparing the p-value calculated by wild bootstrap with the Rademacher distribution and Mammen's distribution. (d) Plot of comparing the p-value calculated by wild bootstrap with the standard normal distribution and Mammen's distribution. The red line represents y=x. 16 3.6 The estimated breakpoints and confidence intervals. (a) The results calcu- lated by segmented. (b) The results calculated by least-squares and boot- strap. Black points represent estimated breakpoints while red lines represent confidence intervals. 17 3.7 Plot of the width of CI against the p-value. (a) segmented method. (b) bootstrap method. 18 3.8 The estimated breakpoints and confidence intervals calculated by segmented. (a) genotypes in block1. (b) genotypes in block2. Black points represent estimated breakpoints while red lines represent confidence intervals. 19 3.9 The estimated breakpoints and confidence intervals calculated by bootstrap. (a) genotypes in block1. (b) genotypes in block2. Black points represent estimated breakpoints while red lines represent confidence intervals. 20 v 3.10 Histogram of residuals and normal Q-Q plot. (a), (b) correspond to differ- ence between slopes in two segments for each genotype. (c), (d) correspond to estimated breakpoints for each genotype. 22 3.11 (a) Plot of residuals versus estimated breakpoints. (b) Plot of residuals versus estimated difference between slopes in two segments. The red line represents y=0. 23 vi ABSTRACT Segmented regression models are generalization of linear and generalized linear models that replace a linear predictor with a piecewise linear predictor. Breakpoints where the piecewise linear predictor changes slope are unknown and estimated from data. We use segmented regression to model the relationship between the number of plant leaves and thermal time for hundreds of maize genotypes. Slope estimates from fitted segmented regression models provide estimates of leaf appearance rate (LAR) for each genotype. Estimates of breakpoints provide insight into developmental time points when changes in LAR occur for each genotype. We compare inferences about slopes and breakpoints obtained from the segmented R package with inferences obtained by bootstrap techniques. Furthermore, we use the estimated difference between slopes in two segments for each genotype and the estimated single breakpoint for each genotype as LAR characteristics of interest. We then use each of these characteristics as a response variable in a linear model to test for genotype effects. 1 CHAPTER 1. INTRODUCTION As an essential crop, maize plays an important role in small farm sustainability (Birch et al., 1998). Models for estimating maize developmental stages can be used to assist growers in management practices, such as choosing the best sowing date, and as an aid for selecting cultivars better adapted to different regions (Alberto et al., 2009). One important plant development parameter is the leaf appearance rate (LAR), which is the number of visible leaf collars per day (Xue et al., 2004). The LAR is used to obtain models of growth and yield of agricultural crops (Streck, 2003). It is an important parameter in the production efficiency of agricultural produce and has been used in ecophysiological studies in plants. Temperature is a major factor driving leaf appearance in maize (Hesketh and Warrington, 1989; White, 2001). One approach to predict individual leaf appearance is the phyllochron, defined as the time interval between the appearances of successive leaves (Klepper et al., 1982; Kirby, 1995). Time is often expressed as thermal time (TT), with units of oC day. The calculation of LAR is an essential part of many crop simulation models (Hodges, 1991), including the maize model (L´opez-Cedr´onet al., 2005). A segmented model is a regression model where the relationship between the response and one or more explanatory variables is piecewise linear, namely represented by two or more straight line segments connected to unknown points: these points are usually referred to as breakpoints, change points or even join points. Segmented regression is a useful tool for assessing the effectiveness of a quantitative covariate X on the response Y when there exists a value within the range Rx of the covariate where the effect of X changes. Piecewise linear regression models are widely applied in many fields, including climatology, environmental epidemiology, occupational medicine, toxicology, and ecology (Shao and Campbell, 2002; Toms and Lesperance, 2003; Betts et al., 2007; Muggeo, 2008a; Rhodes et al., 2008). In applications, it is common to see the effect of some key factors on the response change before and after some threshold value. For instance, in 2 mortality studies, the relationship between death and temperature is V-shaped, so it may be of interest to estimate the optimal temperature where the mortality reaches its minimum (Kunst et al., 1993). There is evidence that the risk of preterm delivery depends on the mother's stress only when it is above a specific threshold (Whitehead, 2002). Likewise, fertility patterns for cohabitant women in Italy have been modeled by exploiting segmented relationships of the hazard with respect to time since the start of cohabitation (Muggeo et al., 2009). In the literature, several methods have been employed in regression models with unknown breakpoints. Some methods separately estimate the breakpoints. If the breakpoint is fixed, the models are usually linear, without any problems of estimation and inference. The breakpoint can be estimated by a simple inspection of the scatter smooth plot (Kunst et al., 1993; Vermont et al., 1991) or by specific algorithms, e.g., exact- or grid-search-type algorithms (Stasinopoulos and Rigby, 1992; Ulm, 1991; Ertel and Fowlkes, 1976; Hawkins, 1976). Some other methods approximate the segmented relationship through a continuous differentiable function on the overall range of the explanatory variable (Tishler and Zang, 1981a; Muggeo, 2003; Bacon and Watts, 1971; Griffiths and Miller, 1973) or in a neighborhood of the unknown breakpoint (Tishler and Zang, 1981b). In addition, regression splines are used to estimate breakpoints (knots in spline terminology) by Molinari et al.(2001). Bayesian MCMC methods are an alternative for the same purpose. In a Bayesian approach, no differentiability assumption is needed, but on the other hand, the computational effort might