Iowa State University Capstones, Theses and Creative Components Dissertations
Spring 2021
Piecewise linear regression for leaf appearance rate data
Lin Quan
Follow this and additional works at: https://lib.dr.iastate.edu/creativecomponents
Part of the Biostatistics Commons
Recommended Citation Quan, Lin, "Piecewise linear regression for leaf appearance rate data" (2021). Creative Components. 786. https://lib.dr.iastate.edu/creativecomponents/786
This Creative Component is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Creative Components by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Piecewise linear regression for leaf appearance rate data
by
Lin Quan
A Creative Component submitted to the graduate faculty
in partial fulfillment of the requirements for the degree of
MASTER OF SCIENCE
Major: Statistics
Program of Study Committee: Dan Nettleton, Major Professor Lily Wang Peng Liu
The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this creative component. The Graduate College will ensure this creative component is globally accessible and will not permit alterations after a degree is conferred.
Iowa State University
Ames, Iowa
2021
Copyright © Lin Quan, 2021. All rights reserved. ii
TABLE OF CONTENTS
Page
LIST OF TABLES ...... iii
LIST OF FIGURES ...... iv
ABSTRACT ...... vi
CHAPTER 1. INTRODUCTION ...... 1
CHAPTER 2. DATASET AND METHODS ...... 4 2.1 Dataset ...... 4 2.2 Methods ...... 4 2.2.1 Package segmented ...... 4 2.2.2 Least-squares ...... 6 2.2.3 Bootstrap Resampling ...... 7
CHAPTER 3. RESULTS AND DISCUSSION ...... 10 3.1 Analysis of the first 396 genotypes ...... 10 3.2 Analysis of the last 99 genotypes ...... 15
CHAPTER 4. SUMMARY ...... 24
CHAPTER 5. ACKNOWLEDGEMENT ...... 25
BIBLIOGRAPHY ...... 26 iii
LIST OF TABLES
Page
3.1 ANOVA table for difference between slopes in two segments for each geno- type ...... 21
3.2 ANOVA table for estimated breakpoints for each genotype ...... 21 iv
LIST OF FIGURES
Page
3.1 (a) Histogram of p-values of genotypes with one breakpoint. (b) Histogram of p-values of genotypes with two breakpoints...... 11
3.2 Plot of genotype C103. The black points are the raw data. The green line is a linear regression model. The blue line is a one breakpoint piecewise linear model, the blue triangle is a breakpoint. The red line is a two breakpoints piecewise linear model, the red circles are the breakpoints...... 12
3.3 (a) Plot of comparing the estimators calculated by segmented and least- squares. (b) Plot of comparing the p-value calculated by segmented and residual bootstrap. The red line represents y=x...... 13
3.4 (a) Studentized residuals. (b) Histogram of residuals...... 14
3.5 (a) Plot of comparing the p-value calculated by segmented and wild boot- strap (Mammen’s distribution). (b) Plot of comparing the p-value calculated by residual bootstrap and wild bootstrap (Mammen’s distribution). (c) Plot of comparing the p-value calculated by wild bootstrap with the Rademacher distribution and Mammen’s distribution. (d) Plot of comparing the p-value calculated by wild bootstrap with the standard normal distribution and Mammen’s distribution. The red line represents y=x...... 16
3.6 The estimated breakpoints and confidence intervals. (a) The results calcu- lated by segmented. (b) The results calculated by least-squares and boot- strap. Black points represent estimated breakpoints while red lines represent confidence intervals...... 17
3.7 Plot of the width of CI against the p-value. (a) segmented method. (b) bootstrap method...... 18
3.8 The estimated breakpoints and confidence intervals calculated by segmented. (a) genotypes in block1. (b) genotypes in block2. Black points represent estimated breakpoints while red lines represent confidence intervals. . . . . 19
3.9 The estimated breakpoints and confidence intervals calculated by bootstrap. (a) genotypes in block1. (b) genotypes in block2. Black points represent estimated breakpoints while red lines represent confidence intervals. . . . . 20 v
3.10 Histogram of residuals and normal Q-Q plot. (a), (b) correspond to differ- ence between slopes in two segments for each genotype. (c), (d) correspond to estimated breakpoints for each genotype...... 22
3.11 (a) Plot of residuals versus estimated breakpoints. (b) Plot of residuals versus estimated difference between slopes in two segments. The red line represents y=0...... 23 vi
ABSTRACT
Segmented regression models are generalization of linear and generalized linear models that replace a linear predictor with a piecewise linear predictor. Breakpoints where the piecewise linear predictor changes slope are unknown and estimated from data. We use segmented regression to model the relationship between the number of plant leaves and thermal time for hundreds of maize genotypes. Slope estimates from fitted segmented regression models provide estimates of leaf appearance rate (LAR) for each genotype. Estimates of breakpoints provide insight into developmental time points when changes in LAR occur for each genotype. We compare inferences about slopes and breakpoints obtained from the segmented R package with inferences obtained by bootstrap techniques. Furthermore, we use the estimated difference between slopes in two segments for each genotype and the estimated single breakpoint for each genotype as LAR characteristics of interest. We then use each of these characteristics as a response variable in a linear model to test for genotype effects. 1
CHAPTER 1. INTRODUCTION
As an essential crop, maize plays an important role in small farm sustainability (Birch et al.,
1998). Models for estimating maize developmental stages can be used to assist growers in management practices, such as choosing the best sowing date, and as an aid for selecting cultivars better adapted to different regions (Alberto et al., 2009). One important plant development parameter is the leaf appearance rate (LAR), which is the number of visible leaf collars per day
(Xue et al., 2004). The LAR is used to obtain models of growth and yield of agricultural crops
(Streck, 2003). It is an important parameter in the production efficiency of agricultural produce and has been used in ecophysiological studies in plants. Temperature is a major factor driving leaf appearance in maize (Hesketh and Warrington, 1989; White, 2001). One approach to predict individual leaf appearance is the phyllochron, defined as the time interval between the appearances of successive leaves (Klepper et al., 1982; Kirby, 1995). Time is often expressed as thermal time (TT), with units of oC day. The calculation of LAR is an essential part of many crop simulation models (Hodges, 1991), including the maize model (L´opez-Cedr´onet al., 2005).
A segmented model is a regression model where the relationship between the response and one or more explanatory variables is piecewise linear, namely represented by two or more straight line segments connected to unknown points: these points are usually referred to as breakpoints, change points or even join points. Segmented regression is a useful tool for assessing the effectiveness of a quantitative covariate X on the response Y when there exists a value within the range Rx of the covariate where the effect of X changes. Piecewise linear regression models are widely applied in many fields, including climatology, environmental epidemiology, occupational medicine, toxicology, and ecology (Shao and Campbell, 2002; Toms and Lesperance, 2003; Betts et al., 2007; Muggeo, 2008a; Rhodes et al., 2008). In applications, it is common to see the effect of some key factors on the response change before and after some threshold value. For instance, in 2 mortality studies, the relationship between death and temperature is V-shaped, so it may be of interest to estimate the optimal temperature where the mortality reaches its minimum (Kunst et al., 1993). There is evidence that the risk of preterm delivery depends on the mother’s stress only when it is above a specific threshold (Whitehead, 2002). Likewise, fertility patterns for cohabitant women in Italy have been modeled by exploiting segmented relationships of the hazard with respect to time since the start of cohabitation (Muggeo et al., 2009).
In the literature, several methods have been employed in regression models with unknown breakpoints. Some methods separately estimate the breakpoints. If the breakpoint is fixed, the models are usually linear, without any problems of estimation and inference. The breakpoint can be estimated by a simple inspection of the scatter smooth plot (Kunst et al., 1993; Vermont et al.,
1991) or by specific algorithms, e.g., exact- or grid-search-type algorithms (Stasinopoulos and
Rigby, 1992; Ulm, 1991; Ertel and Fowlkes, 1976; Hawkins, 1976). Some other methods approximate the segmented relationship through a continuous differentiable function on the overall range of the explanatory variable (Tishler and Zang, 1981a; Muggeo, 2003; Bacon and
Watts, 1971; Griffiths and Miller, 1973) or in a neighborhood of the unknown breakpoint (Tishler and Zang, 1981b). In addition, regression splines are used to estimate breakpoints (knots in spline terminology) by Molinari et al.(2001). Bayesian MCMC methods are an alternative for the same purpose. In a Bayesian approach, no differentiability assumption is needed, but on the other hand, the computational effort might become rather heavy even with a simple model (G¨ossland
K¨uchenhoff, 2001).
While widely used, most of the methods seem to be inappropriate for the current study of
LAR because of the following reasons. (i) Only single breakpoint detections have been investigated, and just a few have been employed in the case of multiple parameters; (ii) The model is constrained to a particular probability distribution of the response variable (often
Gaussian or even logistic). Furthermore, a common aspect of the methods mentioned above is that the computational cost increases with the complexity of the model. Therefore, these issues might be a practical limitation of modelling breakpoints in regression models (Muggeo, 2003). 3
In this study, we use a segmented regression model to fit LAR data and check whether the linear slope is constant or changes at some points. We use the segmented R package to estimate the breakpoints and to obtain inferences about the necessity and location of breakpoints. The main advantage of the package is flexibility to model more than one segmented relationship and multiple, variable-specific breakpoints. We compare the segmented R package to least-squares method for estimating breakpoints that does not require a normality assumption. Based on the least-squares results, we use bootstrap and wild bootstrap methods to perform breakpoint inferences and compare the results with the segmented R package. In the end, we test some LAR characteristics among the genotypes. We choose the estimated single breakpoint and the estimated difference between the slopes of adjacent segments for each geneotype as the representative LAR characteristics. 4
CHAPTER 2. DATASET AND METHODS
2.1 Dataset
The LAR dataset we consider contains counts of maize leaves as a function of the date and accumulated growing degree days (GDD). Current data are3w collected from 596 rows of maize plants. In the first 396 rows, each row represents a unique genotype. For the last 200 rows, every two rows represent one genotype, and they are equally distributed into two blocks. Two genotypes, B73 x PHM49 and LH198 x PHN37, have two rows in the first block and one row in the second block. At each time, two maize plants are selected from one row to count the leaves.
The plants selected for counting within any row vary from time to time. Thus, it is not possible to track leaf counts for any one plant with the available data.
2.2 Methods
2.2.1 Package segmented
The segmented R package (Muggeo, 2003, 2008b, 2016, 2017) is employed to fit regression models with breakpoints. It provides facilities to estimate and summarize generalized linear models with segmented relationships. The primary advantage of the package is that it does not limit the number of segmented variables or the number of breakpoints for each variable (Muggeo,
2008b). The method is suitable for generalizing any regression model with a linear predictor.
The main idea of segmented is to fit segmented regression and estimate a breakpoint
(Muggeo, 2003). Here we use a single breakpoint, piecewise linear regression model as an example:
y = β0 + β1Z + β2 (Z − ψ)+ + , (2.1) 5
2 where ψ is the breakpoint, (Z − ψ)+ = (Z − ψ) I (Z > ψ) and ∼ N(0, σ ). According to the piecewise linear regression model, |β2| > 0 if a breakpoint exists. β1 is the slope of the left line segment, which corresponds to the condition of Z ≤ ψ. β2 describes the difference between slopes of adjacent segment. Therefore, (β1 + β2) is the slope of the right line segment. Using a first-order
0 Taylor expansion around an initial value ψ , one can approximate the term (Z − ψ)+ as,