Iowa State University Capstones, Theses and Creative Components Dissertations

Spring 2021

Piecewise for leaf appearance rate data

Lin Quan

Follow this and additional works at: https://lib.dr.iastate.edu/creativecomponents

Part of the Biostatistics Commons

Recommended Citation Quan, Lin, "Piecewise linear regression for leaf appearance rate data" (2021). Creative Components. 786. https://lib.dr.iastate.edu/creativecomponents/786

This Creative Component is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Creative Components by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Piecewise linear regression for leaf appearance rate data

by

Lin Quan

A Creative Component submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

Major: Statistics

Program of Study Committee: Dan Nettleton, Major Professor Lily Wang Peng Liu

The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this creative component. The Graduate College will ensure this creative component is globally accessible and will not permit alterations after a degree is conferred.

Iowa State University

Ames, Iowa

2021

Copyright © Lin Quan, 2021. All rights reserved. ii

TABLE OF CONTENTS

Page

LIST OF TABLES ...... iii

LIST OF FIGURES ...... iv

ABSTRACT ...... vi

CHAPTER 1. INTRODUCTION ...... 1

CHAPTER 2. DATASET AND METHODS ...... 4 2.1 Dataset ...... 4 2.2 Methods ...... 4 2.2.1 Package segmented ...... 4 2.2.2 Least-squares ...... 6 2.2.3 Bootstrap Resampling ...... 7

CHAPTER 3. RESULTS AND DISCUSSION ...... 10 3.1 Analysis of the first 396 genotypes ...... 10 3.2 Analysis of the last 99 genotypes ...... 15

CHAPTER 4. SUMMARY ...... 24

CHAPTER 5. ACKNOWLEDGEMENT ...... 25

BIBLIOGRAPHY ...... 26 iii

LIST OF TABLES

Page

3.1 ANOVA table for difference between slopes in two segments for each geno- type ...... 21

3.2 ANOVA table for estimated breakpoints for each genotype ...... 21 iv

LIST OF FIGURES

Page

3.1 (a) Histogram of p-values of genotypes with one breakpoint. (b) Histogram of p-values of genotypes with two breakpoints...... 11

3.2 Plot of genotype C103. The black points are the raw data. The green line is a linear regression model. The blue line is a one breakpoint piecewise linear model, the blue triangle is a breakpoint. The red line is a two breakpoints piecewise linear model, the red circles are the breakpoints...... 12

3.3 (a) Plot of comparing the estimators calculated by segmented and least- squares. (b) Plot of comparing the p-value calculated by segmented and residual bootstrap. The red line represents y=x...... 13

3.4 (a) Studentized residuals. (b) Histogram of residuals...... 14

3.5 (a) Plot of comparing the p-value calculated by segmented and wild boot- strap (Mammen’s distribution). (b) Plot of comparing the p-value calculated by residual bootstrap and wild bootstrap (Mammen’s distribution). (c) Plot of comparing the p-value calculated by wild bootstrap with the Rademacher distribution and Mammen’s distribution. (d) Plot of comparing the p-value calculated by wild bootstrap with the standard normal distribution and Mammen’s distribution. The red line represents y=x...... 16

3.6 The estimated breakpoints and confidence intervals. (a) The results calcu- lated by segmented. (b) The results calculated by least-squares and boot- strap. Black points represent estimated breakpoints while red lines represent confidence intervals...... 17

3.7 Plot of the width of CI against the p-value. (a) segmented method. (b) bootstrap method...... 18

3.8 The estimated breakpoints and confidence intervals calculated by segmented. (a) genotypes in block1. (b) genotypes in block2. Black points represent estimated breakpoints while red lines represent confidence intervals. . . . . 19

3.9 The estimated breakpoints and confidence intervals calculated by bootstrap. (a) genotypes in block1. (b) genotypes in block2. Black points represent estimated breakpoints while red lines represent confidence intervals. . . . . 20 v

3.10 Histogram of residuals and normal Q-Q plot. (a), (b) correspond to differ- ence between slopes in two segments for each genotype. (c), (d) correspond to estimated breakpoints for each genotype...... 22

3.11 (a) Plot of residuals versus estimated breakpoints. (b) Plot of residuals versus estimated difference between slopes in two segments. The red line represents y=0...... 23 vi

ABSTRACT

Segmented regression models are generalization of linear and generalized linear models that replace a linear predictor with a piecewise linear predictor. Breakpoints where the piecewise linear predictor changes slope are unknown and estimated from data. We use segmented regression to model the relationship between the number of plant leaves and thermal time for hundreds of maize genotypes. Slope estimates from fitted segmented regression models provide estimates of leaf appearance rate (LAR) for each genotype. Estimates of breakpoints provide insight into developmental time points when changes in LAR occur for each genotype. We compare inferences about slopes and breakpoints obtained from the segmented R package with inferences obtained by bootstrap techniques. Furthermore, we use the estimated difference between slopes in two segments for each genotype and the estimated single breakpoint for each genotype as LAR characteristics of interest. We then use each of these characteristics as a response variable in a linear model to test for genotype effects. 1

CHAPTER 1. INTRODUCTION

As an essential crop, maize plays an important role in small farm sustainability (Birch et al.,

1998). Models for estimating maize developmental stages can be used to assist growers in management practices, such as choosing the best sowing date, and as an aid for selecting cultivars better adapted to different regions (Alberto et al., 2009). One important plant development parameter is the leaf appearance rate (LAR), which is the number of visible leaf collars per day

(Xue et al., 2004). The LAR is used to obtain models of growth and yield of agricultural crops

(Streck, 2003). It is an important parameter in the production efficiency of agricultural produce and has been used in ecophysiological studies in plants. Temperature is a major factor driving leaf appearance in maize (Hesketh and Warrington, 1989; White, 2001). One approach to predict individual leaf appearance is the phyllochron, defined as the time interval between the appearances of successive leaves (Klepper et al., 1982; Kirby, 1995). Time is often expressed as thermal time (TT), with units of oC day. The calculation of LAR is an essential part of many crop simulation models (Hodges, 1991), including the maize model (L´opez-Cedr´onet al., 2005).

A segmented model is a regression model where the relationship between the response and one or more explanatory variables is piecewise linear, namely represented by two or more straight line segments connected to unknown points: these points are usually referred to as breakpoints, change points or even join points. Segmented regression is a useful tool for assessing the effectiveness of a quantitative covariate X on the response Y when there exists a value within the range Rx of the covariate where the effect of X changes. Piecewise linear regression models are widely applied in many fields, including climatology, environmental epidemiology, occupational medicine, toxicology, and ecology (Shao and Campbell, 2002; Toms and Lesperance, 2003; Betts et al., 2007; Muggeo, 2008a; Rhodes et al., 2008). In applications, it is common to see the effect of some key factors on the response change before and after some threshold value. For instance, in 2 mortality studies, the relationship between death and temperature is V-shaped, so it may be of interest to estimate the optimal temperature where the mortality reaches its minimum (Kunst et al., 1993). There is evidence that the risk of preterm delivery depends on the mother’s stress only when it is above a specific threshold (Whitehead, 2002). Likewise, fertility patterns for cohabitant women in Italy have been modeled by exploiting segmented relationships of the hazard with respect to time since the start of cohabitation (Muggeo et al., 2009).

In the literature, several methods have been employed in regression models with unknown breakpoints. Some methods separately estimate the breakpoints. If the breakpoint is fixed, the models are usually linear, without any problems of estimation and inference. The breakpoint can be estimated by a simple inspection of the scatter smooth plot (Kunst et al., 1993; Vermont et al.,

1991) or by specific algorithms, e.g., exact- or grid-search-type algorithms (Stasinopoulos and

Rigby, 1992; Ulm, 1991; Ertel and Fowlkes, 1976; Hawkins, 1976). Some other methods approximate the segmented relationship through a continuous differentiable function on the overall range of the explanatory variable (Tishler and Zang, 1981a; Muggeo, 2003; Bacon and

Watts, 1971; Griffiths and Miller, 1973) or in a neighborhood of the unknown breakpoint (Tishler and Zang, 1981b). In addition, regression splines are used to estimate breakpoints (knots in spline terminology) by Molinari et al.(2001). Bayesian MCMC methods are an alternative for the same purpose. In a Bayesian approach, no differentiability assumption is needed, but on the other hand, the computational effort might become rather heavy even with a simple model (G¨ossland

K¨uchenhoff, 2001).

While widely used, most of the methods seem to be inappropriate for the current study of

LAR because of the following reasons. (i) Only single breakpoint detections have been investigated, and just a few have been employed in the case of multiple parameters; (ii) The model is constrained to a particular probability distribution of the response variable (often

Gaussian or even logistic). Furthermore, a common aspect of the methods mentioned above is that the computational cost increases with the complexity of the model. Therefore, these issues might be a practical limitation of modelling breakpoints in regression models (Muggeo, 2003). 3

In this study, we use a segmented regression model to fit LAR data and check whether the linear slope is constant or changes at some points. We use the segmented R package to estimate the breakpoints and to obtain inferences about the necessity and location of breakpoints. The main advantage of the package is flexibility to model more than one segmented relationship and multiple, variable-specific breakpoints. We compare the segmented R package to least-squares method for estimating breakpoints that does not require a normality assumption. Based on the least-squares results, we use bootstrap and wild bootstrap methods to perform breakpoint inferences and compare the results with the segmented R package. In the end, we test some LAR characteristics among the genotypes. We choose the estimated single breakpoint and the estimated difference between the slopes of adjacent segments for each geneotype as the representative LAR characteristics. 4

CHAPTER 2. DATASET AND METHODS

2.1 Dataset

The LAR dataset we consider contains counts of maize leaves as a function of the date and accumulated growing degree days (GDD). Current data are3w collected from 596 rows of maize plants. In the first 396 rows, each row represents a unique genotype. For the last 200 rows, every two rows represent one genotype, and they are equally distributed into two blocks. Two genotypes, B73 x PHM49 and LH198 x PHN37, have two rows in the first block and one row in the second block. At each time, two maize plants are selected from one row to count the leaves.

The plants selected for counting within any row vary from time to time. Thus, it is not possible to track leaf counts for any one plant with the available data.

2.2 Methods

2.2.1 Package segmented

The segmented R package (Muggeo, 2003, 2008b, 2016, 2017) is employed to fit regression models with breakpoints. It provides facilities to estimate and summarize generalized linear models with segmented relationships. The primary advantage of the package is that it does not limit the number of segmented variables or the number of breakpoints for each variable (Muggeo,

2008b). The method is suitable for generalizing any regression model with a linear predictor.

The main idea of segmented is to fit segmented regression and estimate a breakpoint

(Muggeo, 2003). Here we use a single breakpoint, piecewise linear regression model as an example:

y = β0 + β1Z + β2 (Z − ψ)+ + , (2.1) 5

2 where ψ is the breakpoint, (Z − ψ)+ = (Z − ψ) I (Z > ψ) and  ∼ N(0, σ ). According to the piecewise linear regression model, |β2| > 0 if a breakpoint exists. β1 is the slope of the left line segment, which corresponds to the condition of Z ≤ ψ. β2 describes the difference between slopes of adjacent segment. Therefore, (β1 + β2) is the slope of the right line segment. Using a first-order

0 Taylor expansion around an initial value ψ , one can approximate the term (Z − ψ)+ as,

0 0 0 (Z − ψ)+ ≈ Z − ψ + + ψ − ψ (−1) I Z > ψ , (2.2)

0 0 where (−1) I Z > ψ is the first derivative of (Z − ψ)+ evaluated at ψ . Then the term (2.1) can be written as,

0 0 0 y ≈ β0 + β1Z + β2 Z − ψ + + β2 ψ − ψ (−1) I Z > ψ + , (2.3)

Let γ = β ψ − ψ0, ψ = γ + ψ0. Estimated parameters γ and β can be obtained from 2 β2 2 maximum likelihood (ML) estimation with the fitting of the model (2.3). They lead to

γˆ ψˆ = + ψ0, (2.4) βb2 Here, ψˆ is the updated estimate of the breakpoint ψ. With the updated ψˆ as the new initial ˆ value, one can use the model (2.3) to obtain the updated ψ,γ ˆ and βb2. This process is iterated until possible convergence (Box and Tidwell, 1962). Once the iteration converges, ML estimates of all the parameters in the model, including ψ, can be obtained. In each iteration, parameter γ measures the difference between the two fitting straight lines at Z = ψ. Since the gap of the straight lines should be null at Z = ψ,γ ˆ is expected to be approximately zero when the iteration converges. When the iteration stops andγ ˆ ≈ 0, there is no more improvement in the breakpoint estimate with the equation (2.4). Thus, the final updated approximation is assumed to be the ML estimate.

In order to use the simplest Wald statistics, the of ψˆ is obtained by the delta method: 6

!2 !2 ! ! ∂ψˆ ∂ψˆ ∂ψˆ ∂ψˆ var(ψˆ) ≈ var (ˆγ) + var(βˆ2) + cov(ˆγ, βˆ2), (2.5) ∂γˆ ∂βˆ2 ∂γˆ ∂βˆ2 Combing it with the equation (2.4), we can get

v u"  2   # ˆ u ∂γˆ ˆ ∂γˆ ˆ ˆ 2 SE(ψ) = t vard (ˆγ) + vard(β2) + 2 covc (ˆγ, β2) /β2 , (2.6) ∂βˆ2 ∂βˆ2 The confidence interval for the breakpoint can be calculated with the Wald statistic by assuming a normal distribution (McCullagh and Nelder, 1989). A confidence interval with   coverage level approximately equal to 95% is ψˆ ± 1.96 × SE ψˆ .

Segmented can also test for the change of the slope. There are two methods to test whether the breakpoint exists or not, that is

a. Davies’ test: given a (generalized) linear model, it tests for a non-constant regression parameter in the linear predictor.

b. Pscore (pseudo score) test: given a (generalized) linear model, it tests for the existence of one breakpoint.

The two test methods both work for a non-zero difference-in-slope parameter of a segmented relationship. The null hypothesis is H0 : β2 = 0, where β2 is the difference in slopes of the coefficient of the segmented function β2 (Z − ψ)+. The hypothesis of interest β2 = 0 means no breakpoint. Simulation studies show that if the alternative hypothesis is one breakpoint, the

Pscore test is more effective than Davies test. While if there are two or more breakpoints, Davies’ test performs better than the Pscore test (Muggeo, 2016).

2.2.2 Least-squares

The normality assumption of model (2.1) does not hold for our leaf-count data. Therefore, it may not be appropriate to use the segmented R package for inference. Here we also use a simple and common approach to estimate the model via grid-search type algorithms. Given a grid of possible candidate value of the breakpoint ψk, k = 1,...,K, one fits K linear models and seeks for the value corresponding to the model with the best fit (Muggeo, 2008b). The selection criteria for 7 the best fit is based on least-squares, as it minimizes the sum of squared residuals. A simple dataset consists of n pairs of points (xi, yi), where i = 1, . . . , n. The fitted model function is the form f(xi, β), where vector β holds m adjustable parameters. The specific form in our model is f(xi, β) = β0 + β1Z + β2 (Z − ψ)+. The fit of a model to data is measured by its residuals, which are defined as the differences between the observed data values of the response variable and the corresponding values provided by the fitted model,

ri = yi − f(xi, βˆ), i = 1, . . . , n. (2.7)

The goal of the least-squares method is to find the optimal parameter values by minimizing the sum of squared residuals S(βˆ) over all possible values of βˆ, where

n ˆ X 2 S(β) = ri . (2.8) i=1 For the piecewise linear regression model with K candidate breakpoints, we could get K different sum of squared residuals. The model corresponding to the minimum sum of squared residuals is the best fit breakpoint model.

Unlike the segmented R package, the condition of the least-squares method is weaker.

However, this method can become dramatically time-consuming when the dataset is too large or there exist multiple breakpoints. For the LAR data, the x variable is the accumulated GDD in the range from 349.10 to 1385.40. If the step of each breakpoint is 1.0, this leads to more than

1000 sum of squared residuals to be calculated for each genotype.

2.2.3 Bootstrap Resampling

Bootstrapping is a statistical procedure that uses a random sampling with replacement to ascertain the uncertainty of estimators or statistics. Mathematically, a random variable or

0 statistic Tn(Y,G), which is calculated from the data Y = (Y1,...,Yn) by the joint cumulative distribution function (CDF) G(y1, . . . , yn). In order to estimate the distribution of Tn(Y,G), we

∗ ∗ use the bootstrap approach to estimate Tn(Y , Gˆ). The statistic Tn(Y , Gˆ) can be calculated 8

∗ ∗ ∗ 0 ˆ based on the bootstrap data Y = (Y1 ,...,Yn ) , which is generated from the fitted value G of the ∗ CDF G. In many cases, the conditional distribution of Tn(Y , Gˆ) under Gˆ may differ from the distribution of Tn(Y,F ) under F . Shao and Tu (Shao and Tu, 1995) summarized it as “the spirit

∗ ∗ of bootstrap is to use the sampling behavior of (G,ˆ Y ,Tn(Y , Gˆ)) to mimic the behavior of

(F,Y,Tn(Y,G))”. Bootstrap is an ideal procedure to estimate the distribution of a statistic (e.g., mean, variance and p-value etc.) without using normal theory (e.g. z-statistics, t-statistics). For regression problems, various types of bootstrap methods are available, e.g., in small samples, a parametric bootstrap approach may be preferred. For the LAR data analysis, we use resampling residuals and wild bootstrap.

2.2.3.1 p-values calculated by resampling residuals

For the LAR data, we treat the piecewise linear regression model as a full model, the model as a reduced model. We use the test statistic, F = (SSEreduced−SSEfull) , to SSEfull compare the reduced model and the full model.

Suppose the data contains n pairs of points (x1, y1), ..., (xi, yi), ..., (xn, yn), where xi is a explanatory variable and yi is a response variable. First, we fit the full model to get the fitted value y and calculate the residual r = y − y , i = 1, .., n. Then we get [ifull [ifull i [ifull Pn 2 Pn 2 SSEfull = i=1 r[ifull . Similarly, we could obtain SSEreduced = i=1 ri\reduced . We put the values of SSEfull and SSEreduced into the test statistic F formula to get the original F0. Second,

∗ we generate a new variable y = rj , where j is selected uniformly from the list (1, . . . , n) for i [full ∗ n ∗ every i and refit the full and reduced model using the synthetic data {(xi, yi )}i=1 and obtain F , a bootstrap replication of the test statistic. After repeating this step 1000 times, we generate

∗ ∗ 1000 F values. The proportion of F values that are greater than or equal to F0 serves as a p-value. We use the p-value to check for evidence of a breakpoint. 9

2.2.3.2 p-values calculated by wild bootstrap

Wild bootstrap is a proposed alternative bootstrap method that is suitable when model errors are heteroskedastic (Wu, 1986). The main idea is similar to the residual bootstrap, but the response variable is resampled based on the residual values multiplied by a random variable. That is, each bootstrapped response value is of the form

∗ y = rj υi, (2.9) i [full where υi is a random variable with mean value 0 and variance 1. After we get the synthetic data

∗ (xi, yi ), we follow the steps in residual bootstrap to compute the p-values.

Different distributions are used for the random variable υi. There are three common choices: 1. The standard normal distribution:

υi ∼ N (0, 1) (2.10)

2. Mammen’s distribution (Mammen, 1993):

 √ √  − 5−1 , with probability p = 5+1√  2 2 5 vi = √ √ (2.11)  5+1 , with probability p = 5√−1  2 2 5 Approximately, it can be written as:

  −0.618, with probability p = 0.7236 vi = (2.12)  1.618, with probability p = 0.2764 3. The Rademacher distribution (Hitczenko and Kwapie´n, 1994):

  +1, with probability p = 0.5 vi = (2.13)  −1, with probability p = 0.5 10

CHAPTER 3. RESULTS AND DISCUSSION

3.1 Analysis of the first 396 genotypes

Using the pseudo score test in the segmented R package, we could calculate p-values of the estimated breakpoints. Here, we use α = 0.05 as a criteria to determine whether or not the piecewise linear regression model is adequate. In Fig. 3.1(a), the result shows that p-values of 203 genotypes are much smaller than the criteria. Therefore, it is inadequate to use the linear regression model to fit the points for these genotypes. When we do multiple tests, we need to control the rate of type I error in null hypothesis. The false discovery rate (FDR) is considered.

An FDR adjusted p-value (or q-value) of 0.05 implies a false discovery rate of 5%. We list p-values and q values of each genotype from small values to large ones. If we control FDR to be

5%, we declare 225 genotypes as significant. For the genotypes with p-values smaller than α, we further check whether there exists another breakpoint. We also use the pseudo score test to calculate the p-value of existing two breakpoints. As shown in Fig. 3.1(b), most p-values are higher than α, which indicates it is adequate to use one breakpoint piecewise linear regression model to fit the data. The result shows that, out of 396 genotypes, 15 are significant to use two breakpoints piecewise linear regression model to fit the data. Next, we use one of these 15 genotypes as a case study to test linear regression models.

Here we use genotype C103 as an example. Fig. 3.2 represents results of three models: linear regression, piecewise linear regression with one breakpoint and two breakpoints. Based on the graph result, the linear regression model could fit the points well. And the slopes and breakpoints are different in the two piecewise linear regression models. We also check the p-value by the wild bootstrap, which is 0.061. This result shows it’s adequate to fit the points by linear regression model, which is consistent with the result in Fig. 3.2. 11

Figure 3.1 (a) Histogram of p-values of genotypes with one breakpoint. (b) Histogram of p-values of genotypes with two breakpoints.

We employ the least-squares method to calculate breakpoints and compare to the results of segmented method. Here we assume all the genotypes have only one breakpoint. From the scatter plot in Fig. 3.3(a), most estimated breakpoints calculated by the least-squares and segmented methods are the same. However, their p-values are quite different, which are shown in Fig. 3.3(b).

Here we use the bootstrap method to calculate p-value for breakpoints estimated by the least-squares method. For most genotypes, the p-value calculated by the residual bootstrap is larger than the one from segmented. The analysis in Fig. 3.4 shows that residuals of the LAR data exhibit heteroskedasticity. Therefore, we further use the wild bootstrap method to calculate 12

Figure 3.2 Plot of genotype C103. The black points are the raw data. The green line is a linear regression model. The blue line is a one breakpoint piecewise linear model, the blue triangle is a breakpoint. The red line is a two breakpoints piecewise linear model, the red circles are the breakpoints. 13

Figure 3.3 (a) Plot of comparing the estimators calculated by segmented and least-squares. (b) Plot of comparing the p-value calculated by segmented and residual boot- strap. The red line represents y=x.

p-values and choose the random variable υi following Mammen’s distribution, the Rademacher distribution and the standard normal distribution.

Fig. 3.5(a) shows the comparison of p-values calculated by the wild bootstrap with

Mammen’s distribution and segmented methods. The p-values calculated by the wild bootstrap are still larger than segmented results for most genotypes. Comparing results between the residual bootstrap and the wild bootstrap with Mammen’s distribution, one can see they are quite similar in Fig. 3.5(b). For some genotypes, p-values calculated by the wild bootstrap are slightly higher than the residual bootstrap. In addition to Mammen’s distribution in the wild bootstrap, we also use the Rademacher distribution and the standard normal distribution. As shown in Fig. 3.5(c) and (d), results are quite similar between Mammen’s distribution and the Rademacher 14

Figure 3.4 (a) Studentized residuals. (b) Histogram of residuals. 15 distribution. The p-values calculated by the standard normal distribution are slightly lower than those caculated using Mammen’s distribution.

The p-values calculated by the residual bootstrap and the wild bootstrap with different distributions are similar, but they are quite different from results calculated by the segmented method. We also calculate the FDR adjusted p-values of the wild bootstrap method with the standard normal distribution, Mammen’s distribution and the Rademacher distribution. If we control FDR to be 5%, no genotypes show significance for the wild bootstrap with Mammen’s and the standard normal distribution. For the Rademacher distribution, we declare 35 genotypes to be significant. Compared to the result of the segmented R package, the wild bootstrap leads to fewer false positives. As mentioned in the Methods section, the segmented method is based on a normality assumption. However, the bootstrap method estimates statistics without using the normal theory, which is more suitable for the LAR data.

We use the segmented and the bootstrap methods to calculate confidence intervals (CI) and sort them with the increasing values of estimated breakpoints. As shown in Fig. 3.6(a) and (b), the widths of CIs calculated by the segmented R package are much narrower than those from the bootstrap. Based on this, we check the relationship between widths of CIs and p-values in Fig.

3.7. For the segmented method, it is hard to see the trend. For bootstrap method, except for p-values around zero, the p-value increases as the width of CI becomes larger.

3.2 Analysis of the last 99 genotypes

Compared with previous data, there is an extra variable, block, for the last 99 genotypes.

Estimated breakpoints and CIs of these data are also calculated by the segmented R package. As can be seen from Fig. 3.8(a) and (b), the same genotypes in different blocks may have different estimated breakpoints. The widths of CIs are also different. Similarly, the estimated breakpoints and CIs are also checked by the least-squares and the bootstrap, which are shown in Fig. 3.9.

The widths of CIs calculated by segmented R package are also narrower than ones calculated by the bootstrap method. 16

Figure 3.5 (a) Plot of comparing the p-value calculated by segmented and wild boot- strap (Mammen’s distribution). (b) Plot of comparing the p-value calculated by residual bootstrap and wild bootstrap (Mammen’s distribution). (c) Plot of comparing the p-value calculated by wild bootstrap with the Rademacher distribution and Mammen’s distribution. (d) Plot of comparing the p-value calculated by wild bootstrap with the standard normal distribution and Mam- men’s distribution. The red line represents y=x. 17

Figure 3.6 The estimated breakpoints and confidence intervals. (a) The results calculated by segmented. (b) The results calculated by least-squares and bootstrap. Black points represent estimated breakpoints while red lines represent confidence in- tervals. 18

Figure 3.7 Plot of the width of CI against the p-value. (a) segmented method. (b) boot- strap method. 19

Figure 3.8 The estimated breakpoints and confidence intervals calculated by segmented. (a) genotypes in block1. (b) genotypes in block2. Black points represent esti- mated breakpoints while red lines represent confidence intervals. 20

Figure 3.9 The estimated breakpoints and confidence intervals calculated by bootstrap. (a) genotypes in block1. (b) genotypes in block2. Black points represent esti- mated breakpoints while red lines represent confidence intervals. 21

Table 3.1 ANOVA table for difference between slopes in two segments for each genotype

ˆ β2ij = µ + bi + Gj + ij d.f. S.S. M.S. F Stat p-value block 1 0.0000248 2.4771e-05 0.6532 0.4209 genotype 98 0.0041399 4.2243e-05 1.1140 0.2971 residuals 98 0.0037162 3.7921e-05

Table 3.2 ANOVA table for estimated breakpoints for each genotype

estij = µ + bi + Gj + ij d.f. S.S. M.S. F Stat p-value block 1 186595 186595 5.7711 0.01818 genotype 98 3065019 55.453 31276 0.56518 residuals 98 3168619 32333

Since these 99 genotypes maizes are randomly distributed in two blocks, we have more degree freedoms of residuals to check whether some LAR characteristics are significantly different among these genotypes. The estimated difference between slopes in two segments for each genotype and the estimated single breakpoint are chosen as response variables. ANOVA tables are shown in

Table 3.1 and 3.2. Both the estimated difference between slopes in two segments and the estimated breakpoints are not significantly different among the genotypes. We use histogram of residuals and Q-Q plots to check the normality assumption. In Fig. 3.10, Q-Q plots show that the error distributions may have heavier tails than normal. Based on the results in Fig. 3.10, our linear model analyses based on the assumption of independent and identically distributed normal errors may not be appropriate. We also plot residuals versus estimated breakpoints and estimated difference between slopes in two segments for each genotype in Fig. 3.11. Although we found no significant differences among genotypes, our residual analysis suggests that a more complex modeling strategy may be needed to test for differences among genotypes. 22

Figure 3.10 Histogram of residuals and normal Q-Q plot. (a), (b) correspond to difference between slopes in two segments for each genotype. (c), (d) correspond to estimated breakpoints for each genotype. 23

Figure 3.11 (a) Plot of residuals versus estimated breakpoints. (b) Plot of residuals versus estimated difference between slopes in two segments. The red line represents y=0. 24

CHAPTER 4. SUMMARY

In summary, for the first 396 genotypes of maize, the estimated breakpoints vary among the genotypes. By using segmented R package, around 200 different genotypes of maize show that there exist one significant breakpoint. Only 15 genotypes of maize show that there exist two significant breakpoints. Using the least-squares and the bootstrap method, we get the estimated breakpoints and p values and CIs for the estimated breakpoints. The estimated breakpoints are almost the same as those obtained by the segmented R package, while the widths of CIs are much wider. Based on p values calculated by the bootstrap method, it is adequate to use linear regrssion model to fit the data for most genotypes. For the last 99 genotypes, we choose the estimated difference between slopes in two segments for each genotype and the estimated single breakpoint as the representative LAR characteristics and test for significant differences among the genotypes. Our work provides a suitable scheme to analyze whether LAR changes significantly at a certain developmental stage for each of many genotypes.

The errors structure that we use is an independent structure with non-constant variance.

The AIC of this structure is 16690.33. We also calculate the AIC of the general correlation structure and AR(1) structure, which are 15216.69 and 16022.23. Based on the results of AIC, we should explore the use of models with dependent errors in future work. 25

CHAPTER 5. ACKNOWLEDGEMENT

The authors gratefully acknowledge Qiang Liu for providing LAR dataset. 26

REFERENCES

Alberto, C. M., Streck, N. A., Walter, L. C., Rosa, H. T., de Menezes, N. L., and Heldwein, A. B. (2009). Modelagem do desenvolvimento de trigo considerando diferentes temperaturas cardinais e m´etodos de c´alculoda fun¸c˜aode resposta `atemperatura. Pesquisa Agropecu´ariaBrasileira, 44(6):545–553.

Bacon, D. W. and Watts, D. G. (1971). Estimating the Transition between Two Intersecting Straight Lines. Biometrika, 58(3):525.

Betts, M. G., Forbes, G. J., and Diamond, A. W. (2007). Thresholds in Songbird Occurrence in Relation to Landscape Structure. Conservation Biology, 21(4):1046–1058.

Birch, C., Rickert, K., and Hammer, G. (1998). Modelling leaf production and crop development in maize (Zea mays L.) after tassel initiation under diverse conditions of temperature and photoperiod. Field Crops Research, 58(2):81–95.

Box, G. E. and Tidwell, P. W. (1962). Transformation of the Independent Variables. Technometrics, 4(4):531–550.

Ertel, J. E. and Fowlkes, E. B. (1976). Some Algorithms for Linear Spline and Piecewise Multiple Linear Regression. Journal of the American Statistical Association, 71(355):640–648.

G¨ossl,C. and K¨uchenhoff, H. (2001). Bayesian analysis of with an unknown change point and covariate measurement error. Statistics in Medicine, 20(20):3109–3121.

Griffiths, D. and Miller, A. (1973). Hyperbolic regression - a model based on two-phase piecewise linear regression with a smooth transition between regimes. Communications in Statistics, 2(6):561–569.

Hawkins, D. M. (1976). Point Estimation of the Parameters of Piecewise Regression Models. Applied Statistics, 25(1):51.

Hesketh, J. D. and Warrington, I. J. (1989). Corn Growth Response to Temperature: Rate and Duration of Lead Emergence. Agronomy Journal, 81(4):696–701.

Hitczenko, P. and Kwapie´n,S. (1994). On the Rademacher Series. In Hoffmann-Jørgensen, J., Kuelbs, J., and Marcus, M. B., editors, Probability in Banach Spaces, 9, pages 31–36. Birkh¨auserBoston, Boston, MA.

Hodges, T. (1991). Predicting crop phenology. In Boston: CRC, page 233P. 27

Kirby, E. J. M. (1995). Factors Affecting Rate of Leaf Emergence in Barley and Wheat. Crop Science, 35(1):11–19.

Klepper, B., Rickman, R. W., and Peterson, C. M. (1982). Quantitative Characterization of Vegetative Development in Small Cereal Grains 1. Agronomy Journal, 74(5):789–792.

Kunst, A. E., Looman, C. W. N., and Mackenbach, J. P. (1993). Outdoor Air Temperature and Mortality in the Netherlands: A Time-Series Analysis. American Journal of Epidemiology, 137(3):331–341.

L´opez-Cedr´on,F. X., Boote, K. J., Ru´ız-Nogueira,B., and Sau, F. (2005). Testing CERES-Maize versions to estimate maize production in a cool environment. European Journal of Agronomy, 23(1):89–102.

Mammen, E. (1993). Bootstrap and Wild Bootstrap for High Dimensional Linear Models. The Annals of Statistics, 21(1):255–285.

McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. 2nd Edition. Chapman and Hall, London.

Molinari, N., Daurs, J.-P., and Durand, J.-F. (2001). Regression splines for threshold selection in survival data analysis. Statistics in Medicine, 20(2):237–247.

Muggeo, V. M. (2017). Interval estimation for the breakpoint in segmented regression: a smoothed score-based approach. Australian and New Zealand Journal of Statistics, 59(3):311–322.

Muggeo, V. M., Attanasio, M., and Porcu, M. (2009). A segmented regression model for event history data: An application to the fertility patterns in Italy. Journal of Applied Statistics, 36(9):973–988.

Muggeo, V. M. R. (2003). Estimating regression models with unknown break-points. Statistics in Medicine, 22(19):3055–3071.

Muggeo, V. M. R. (2008a). Modeling temperature effects on mortality: multiple segmented relationships with common break points. Biostatistics, 9(4):613–620.

Muggeo, V. M. R. (2008b). Segmented: An R Package to Fit Regression Models With Broken-Line Relationships. R News, 8:20–25.

Muggeo, V. M. R. (2016). Testing with a nuisance parameter present only under the alternative: a score-based approach with application to segmented modelling. Journal of Statistical Computation and Simulation, 86(15):3059–3067. 28

Rhodes, J., Callaghan, J., McAlpine, C., de Jong, C., Bowen, M., Mitchell, D., Lunney, D., and Possingham, H. (2008). Regional variation in habitat-occupancy thresholds: a warning for conservation planning. Journal of Applied Statistics, 45(2):549–557.

Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. Springer Series in Statistics. Springer New York.

Shao, Q. and Campbell, N. A. (2002). Applications: Modelling trends in groundwater levels by segmented regression with constraints. Australian & New Zealand Journal of Statistics, 44(2):129–141.

Stasinopoulos, D. and Rigby, R. (1992). Detecting break points in generalised linear models. Computational Statistics & Data Analysis, 13(4):461–471.

Streck, N. A. (2003). Incorporating a Chronology Response into the Prediction of Leaf Appearance Rate in Winter Wheat. Annals of Botany, 92(2):181–190.

Tishler, A. and Zang, I. (1981a). A Maximum Likelihood Method for Piecewise Regression Models with a Continuous Dependent Variable. Applied Statistics, 30(2):116.

Tishler, A. and Zang, I. (1981b). A New Maximum Likelihood Algorithm for Piecewise Regression. Journal of the American Statistical Association, 76(376):980.

Toms, J. D. and Lesperance, M. L. (2003). Piecewise Regression: a tool for identifying ecological threshold. Ecology, 84(8):2034–2041.

Ulm, K. (1991). A statistical method for assessing a threshold in epidemiological studies. Statistics in Medicine, 10(3):341–349.

Vermont, J., Bosson, J., Fran¸cois,P., Robert, C., Rueff, A., and Demongeot, J. (1991). Strategies for graphical threshold determination. Computer Methods and Programs in Biomedicine, 35(2):141–150.

White, J. W. (2001). Modeling temperature response in wheat and maize. El Bat´an,Mexico: CIMMYT - International Maize and Wheat Improvement Center.

Whitehead, N. (2002). Exploration of Threshold Analysis in the Relation between Stressful Life Events and Preterm Delivery. American Journal of Epidemiology, 155(2):117–124.

Wu, C. F. J. (1986). Jackknife, Bootstrap and Other Resampling Methods in . The Annals of Statistics, 14(4):1261–1295.

Xue, Q., Weiss, A., and Baenziger, P. (2004). Predicting leaf appearance in field-grown winter wheat: evaluating linear and non-linear models. Ecological Modelling, 175(3):261–270.