EMPIRICAL LIKELIHOOD METHOD FOR SEGMENTED

by Zhihua Liu

A Dissertation Submitted to the Faculty of The Charles E. Schmidt College of Science in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

Florida Atlantic University Boca Raton, FL December 2011 Copyright by Zhihua Liu 2011

ii

ACKNOWLEDGEMENTS

First of all, I have to confess that the thought of writing this dissertation was intimidating to me. As much as I wish to express my appreciation to all the people who have been there for me to make this dissertation possible, I know I can never include everyone, and not adequately express my immense gratitude to them in simple words. I would like to express my deepest gratitude to all my committee members, Dr. Lianfen Qian, Dr. Hongwei Long, Dr. Heinrich Niederhausen and Dr. Dragan Radulovic. I appreciate their time, interest, and valuable comments concerning my thesis. For the last few years, I had the opportunity and privilege to work with Dr. Lianfen Qian. She has given me ideas and suggestions that enlightened my under- standing of this research and gave me a better perspective on my own work. I would also want to use this opportunity to thank many faculty members and my colleagues for all their inspiration and encouragement. I am also fortunate to be surrounded by sweet people from PenServ and ERISA Pension Systems. Without their support, I would never have completed this disser- tation. In the course of writing this dissertation, I am running between a full-time job and actuarial exams. Most of the time, I wish I could have 40 hours a day. My family has always given me their unconditional love and constant support that has helped me sail through the difficulties. I dedicate this dissertation to my family. Thanks for all your love and patience.

iv ABSTRACT

Author: ZhihuaLiu Title: EmpiricalLikelihoodMethodforSegmentedLinear Regression Institution: Florida Atlantic University Dissertation Advisor: Dr. Lianfen Qian Degree: DoctorofPhilosophy Year: 2011

For a segmented regression system with an unknown change-point over two do- mains of a predictor, a new empirical likelihood ratio test statistic is proposed to test the null hypothesis of no change. The proposed method is a non-parametric method which releases the assumption of the error distribution. Under the null hypothesis of no change, the proposed test statistic is shown empirically Gumbel distributed with robust location and scale parameters under various parameter settings and error distributions. Under the alternative hypothesis with a change-point, the comparisons with two other methods (Chen’s SIC method and Muggeo’s SEG method) show that the proposed method performs better when the slope change is small. A power anal- ysis is conducted to illustrate the performance of the test. The proposed method is also applied to analyze two real datasets: the plasma osmolality dataset and the gasoline price dataset.

v EMPIRICAL LIKELIHOOD METHOD FOR SEGMENTED LINEAR REGRESSION

TABLES...... viii

FIGURES...... ix

1 Introduction...... 1 1.1 Motivationandsomeexamples...... 1 1.2 ParametricMethod...... 4 1.3 NonparametricMethod...... 7 1.4 EmpiricalLikelihood ...... 10

2 Change-point Estimation Via Empirical Likelihood ...... 14 2.1 AssumingaKnownchange-point ...... 14 2.2 AssuminganUnknownchange-point ...... 15 2.3 MainResults ...... 18 2.4 Algorithm ...... 20

3 Asymptotic Property of Zn ...... 24 3.1 SimulationI...... 25 3.2 SimulationII ...... 31 3.3 SimulationIII...... 31 3.4 SimulationIV...... 33

vi 4 Application ...... 36

5 Conclusion...... 39

Bibliography ...... 40

vii TABLES

3.1 Robustness analysis for the estimated location and scale parameters, µ and σ respectively, of the limiting distribution of Zn under three types oferrordistributions...... 26

3.2 The percentiles of Zn with α = 0 .10 , 0.05 , and 0 .01...... 27 3.3 Robustness analysis for the estimated location and scale parameters, µ and σ respectively, of the limiting distribution of Zn with respect to the settings of γ under null hypothesis when n =100...... 31 3.4 The frequency distribution of d = kˆ∗ k∗ and the relative frequency # of d D | − | RF = { ≤ }% for ELR, SIC and SEG methods, when sample 10 size n = 50 and the true time of change k∗ =25...... 34

3.5 The size and the power of Zn under two different error distributions: (i) normal distribution N(0 , 0.12) and (ii) centered log-normal distribution logN (0 , 0.12)...... 35

viii FIGURES

3.1 The histograms and Q-Q plots of Zn under H0 with normal errors (i) N(0 , 0.12) for four different sample size settings. The solid line rep- resents the estimated Gumbel density and the dashed line represents theestimatedkerneldensity...... 28

3.2 The histograms and Q-Q plots of Zn under H0 with centered log-normal errors (ii) Centered log N(0 , 0.12) for four different sample size settings. The solid line represents the estimated Gumbel density and the dashed line represents the estimated kernel density...... 29

3.3 The histograms and Q-Q plots of Zn under H0 with non-homogeneous [n/ 2] 2 n 2 errors (iii) ei i=1 N(0 , 0.1 ) and ei i=[ n/ 2]+1 N(0 , 1.0 ) for four different sample{ } size∼ settings. The solid{ } line represents∼ the estimated Gumbel density and the dashed line represents the estimated kernel density...... 30

4.1 (a) the scatter plot of AVP versus plasma osmolality with fitted seg- mented linear regression. (b) The plot of -2logarithm of empirical likelihood ratio versus all the possible time of the change k...... 37 4.2 (a) the scatter plot of gasoline price versus year with fitted segmented linear regression. (b) The plot of -2logarithm of empirical likelihood ratio versus all the possible time of the change k...... 38

ix CHAPTER 1 INTRODUCTION

1.1 MOTIVATION AND SOME EXAMPLES

In the classical regression setting, a regression model is usually assumed to be of a single parametric form on the whole domain of predictors. However, a piecewise regression model is used to show that the parameters of the model can be different on different domains of the predictors. In the last thirty years, a considerable body of techniques have been developed for hypothesis testing, parameter estimation and related computing programs on detecting the change-point for the piecewise regression model. One special and commonly used piecewise regression model is the two-phase linear regression model. The regression function of this model is a piecewise linear function. One can define this more precisely as follows. Let Y be the response variable, and

n X be a univariate predictor such as E Y X < . Suppose that (Xi, Y i) i=1 is a h| | i ∞ { } sequence of independent observations of ( X, Y ) satisfying the following model:

Y = ( α + α X )I(X τ) + ( β + β X )I(X > τ ) + e (1.1.1) i 0 1 i i ≤ 0 1 i i i where α , α , β , β , τ are unknown parameters, and e n are independent random 0 1 0 1 { i}i=1 errors with mean zero. Without loss of generality, we assume X X . . . X 1 ≤ 2 ≤ ≤ n throughout the rest of the dissertation. If there is an unknown time k∗ such that

∗ X ∗ τ < X ∗ , then we shall call k the time of change and τ the change-point. k ≤ k +1 Widespread applications of two-phase linear regression models have appeared in

1 diverse research areas. For example, in environmental sciences, in Section 2.2 of [24], Piegorsch and Bailer illustrate the usefulness of two-phase linear regression models with a series of examples. In [27], Qian and Ryu fit a two-phase model with termite survival as Y and tropical tree resin dosage as X and found out that tropical tree resin at a concentration of 10 mg was significantly effective in killing termites. In biological sciences, Vieth [35] applies model (1.1.1) to estimate the osmotic threshold by fitting arginine vasopressin concentration against plasma osmolality in the plasma of conscious dogs. In medical sciences, Smith and Cook [32] use a piecewise linear regression model to fit some renal transplant data. Lund and Reeves utilize the model to detect undocumented change-points for time series in [18]. Other applications can be found in epidemiology (Ulm [34], Pastor and Guallar [23]), software engineering (Qian and Yao [26]), econometrics (Chow [6], Koul and Qian [17], Fiteni [12], Zeileis [36]) and other fields. Hawkins [14] classifies two-phase linear regression models into two types: con- tinuous and discontinuous. By continuous, he means that the regression function is continuous at the change-point τ; that is, the change-point τ satisfies the following equation:

α0 + α1τ = β0 + β1τ. (1.1.2)

If equation (1.1.2) is not satisfied, the model is discontinuous. Continuous models are also known as segmented linear regression models (Feder [10, 11]). Before applying two-phase linear regression model to the dataset, it is a common practice to test for the existence of a change-point τ, which is done by the use of hypothesis testing. The null hypothesis of only a single phase or no change can be

obtained by assuming α0 = β0 and α1 = β1. In the usual scenario, when the change- point is known, the asymptotic chi-squared theory can be used for the likelihood ratio test for one phase against two phases. When the change-point is unknown,

2 the difficulty arises because we are testing a hypothesis in the presence of a nuisance parameter, τ, which is meaningless and cannot be estimated under the null hypothesis. However, τ enters the model under the alternative hypothesis. If τ is known and S(τ) is the appropriate test statistic, with large values corresponding to the alternative been true, then the test statistic we suggest for the case when τ is unknown is

Λ = max S(τ) : L < τ < U , (1.1.3) n o where and are the boundary truncations. L U There are two existing types of likelihood based approaches: the classical para- metric likelihood approach (Quandt [28, 29]) and the Schwartz Information Criteria (SIC) method (Chen [4]). In [18], Lund and Reeves conjecture that the limiting dis- tribution of Λ would involve the Gumbel extreme value distribution in the case of classical parametric likelihood approach. In this dissertation, we address the asymptotic distribution problem using the em- pirical likelihood method. Empirical likelihood (EL) as a nonparametric method was first proposed by Owen [21]. EL employs the likelihood function without specifically assuming the distribution of the data. It incorporates the side information, through constraints or prior distribution, which maximizes the efficiency of the method (Owen [22]). In this dissertation, we propose the empirical likelihood method for the con- tinuous two-phase linear regression (segmented linear regression) model. Through simulation studies, we can demonstrate that the limiting null distribution is related to the Gumbel extreme value distribution when the continuity constraint is imposed. First, we propose an EL-based test statistic for testing the null hypothesis of no change. Through simulation studies, we have confirmed that Lund and Reeves’s conjecture [18] is correct for the empirical likelihood based method, although the original conjecture is for the classical likelihood method. Then, if the null hypothesis

3 is rejected, we construct an EL-based estimator of the change-point for model (1.1.1) with the continuity constraint (1.1.2), i.e. the segmented linear regression model.

1.2 PARAMETRIC METHOD

The classical parametric likelihood approach was first proposed by Quandt [28, 29] to detect the presence of a change-point in a model. Quandt assumes that the error terms e n are normally and independently distributed with { i}i=1 mean zero and standard deviations σ if i k∗ and σ if i > k ∗, where k∗ is the true 1 ≤ 2 time of change. The null hypothesis of no change is expressed as k∗ n 1 or k∗ 2. ≥ − ≤ The likelihood ratio test statistic is

k n−k σˆ1 (k)ˆσ2 (k) Λ= max S(k) , with S(k) = 2 log n , 3≤k≤n−3{ } −  σˆ  whereσ ˆ is the estimator of the standard deviation of the errors for a simple linear

regression based on all observations, whileσ ˆ 1(k) andσ ˆ 2(k) are the estimators of σ1 and σ2 for a fixed k. Large values of Λ suggest the existence of a change-point.

2 In [28], Quandt conjectures that the asymptotic distribution of Λ is χ4 under the null hypothesis of no change ( H0) for the model (1.1.1). In [10, 11], Feder points out that if the change-point is identified under the null hypothesis, then the asymptotic theory of Wilks and Chernoff is applicable. But in the case of unidentified change- point, the parameter estimates are not asymptotically normal and the distribution of Λ is not asymptotically χ2. The null distribution of Λ is shown to be of the maximum of a sequence of correlated chi-squared random variables and to vary with the configuration of the observation points of the independent variable. Furthermore, Feder indicates that there is evidence for the existence of a limiting distribution of the likelihood ratio. In the classical parametric likelihood ratio approach, when there is no continuity

4 constraint on the regression function at the change-point, without an extreme value scaling, the uncropped supremum converges to infinity. Hence, one has an option of either scaling maximum type change-point statistic into a limiting Gumbel statistic ( Cs¨org˝oand Horv´ath [8]) or cropping the endpoints to use Brownian bridges. The former approach gives the asymptotic distribution of Λ related to the Gumbel extreme value distribution, while using the latter approach, assuming a known change-point, one can obtain B2(t) + B2(t) max S(k) sup 1 2 (1.2.1) l≤k/n ≤u{ } → ≤ ≤ t(1 t) l t u  −  as n . Here, 0 < l < u < 1 represent the boundary truncations, while → ∞ B (t), 0 t 1 and B (t), 0 t 1 are independent standard Brownian Bridges. 1 ≤ ≤ 2 ≤ ≤ As to the asymptotic behaviors of the test with the continuity constraint at the change-point, Hinkley [16] was one of the first authors to investigate a maximum likelihood estimator of the change point (the point of intersection τ) under normally distributed errors. Through empirical studies, in [15, 16], Hinkley claims that the

2 asymptotic distribution of Λ is χ3 under H0 assuming σ1 = σ2 for the segmented linear regression model (1.1.1) with the constraint (1.1.2). A generalization of his model was considered by Feder [10], who investigated estimates and showed, that these estimates are consistent under suitable identifiability assumptions and the asymptotic distribution of Λ might involve Gumbel extreme value distribution. In [7], Davies applies the parametric likelihood method to the segmented linear re- gression model. For illustration and simplicity, Davies presents the simple segmented linear regression model as followed:

Y = α + α X + ξ(X τ)+. (1.2.2) i 0 1 i i − where

5 0 if X τ; + i (Xi τ) =  ≤ −  X τ if X > τ . i − i It can be easily seen that the model (1.2.2) is equivalent to the model (1.1.1) with the continuity constraint (1.1.2), where β = α ξτ, and β = α + ξ. 0 0 − 1 1 With respect to the model (1.2.2), it is desired to test the null hypothesis of no

change as H0 : ξ = 0. In this case, the test statistic is defined as

Λ = max S(τ) ; (1.2.3) L<τ< U{ } and n i=1 (Yi αˆ0 αˆ1Xi)( Xi τ)I(Xi > τ ) S(τ) = − − 1 − ; (1.2.4) P V 2 where 1 n X¯ = X , n i Xi=1 1 n Y¯ = Y , n i Xi=1 αˆ = Y¯ αˆ X,¯ 0 − 1 n ¯ ¯ i=1 (Xi X)( Yi Y ) αˆ1 = − − , P s0 S S S S V = 1 2 + 3 4 , SXX n n SXX = (X X¯)2, i − Xi=1 n S = (X X¯)( X τ)I(X > τ ), 1 i − i − i Xi=1 n S = (X X¯)( X τ)I(X τ), 2 i − i − i ≤ Xi=1 n S = (X τ)I(X > τ ), 3 i − i Xi=1

6 n S = (X τ)I(X τ). 4 i − i ≤ Xi=1 In the paper, Davies lists the critical values regarding the different sample sizes. The simulation result shows that this proposed test statistics Λ is moderately robust against non-normality. In this dissertation, for convenience, we present the discontinuous model as model (1.1.1), and the continuous model as model (1.2.2).

1.3 NONPARAMETRIC METHOD

In 1998, Chen([4]) proposed Schwartz Information Criterion (SIC) to locate the change-point in the simple linear regression model. According to the minimum infor- mation criterion principle, a model which minimizes the SIC is considered to be the most appropriate model.

T T Suppose α = ( α0, α 1) , and β = ( β0, β 1) , then the hypothesis test in Chen’s SIC method is expressed as: H : α = β vs H : α = β. 0 1 6 T Under H0, denote γ = ( γ0, γ 1) = α = β. Then the maximum likelihood estimates of γ and σ2 are obtained as: SXY γˆ = , 1 SXX γˆ = Y¯ γˆ X,¯ 0 − 1 1 n σˆ2 = (Y γˆ γˆ X )2, n i − 0 − 1 i Xi=1 where n SXX = (X X¯)2, i − Xi=1 n SXY = (X X¯)( Y Y¯ ). i − i − Xi=1

7 Therefore, the SIC under H0 is

2 SIC (n) = 2 log L0(ˆγ, σˆ ) + 3 log n − n = n log (Y γˆ γˆ X )2 + n(1 + log 2 π) + (3 n) log n, i − 0 − 1 i −  Xi=1 

2 where L0(ˆγ, σˆ ) is the estimated maximum likelihood function under H0.

2 Under H1, the maximum likelihood estimates of α0, α1, β0, β1, and σ are found to be: XY Sk αˆ1 = XX , Sk αˆ = Y¯ αˆ X¯ , 0 k − 1 k XY ˆ Sn−k β1 = XX , Sn−k

βˆ = Y¯ − βˆ X¯ − , 0 n k − 1 n k 1 k n σˆ2 = (Y αˆ αˆ X )2 + (Y βˆ βˆ X )2 , n i − 0 − 1 i i − 0 − 1 i  Xi=1 i=Xk+1  where 1 k X¯ = X , k k i Xi=1 1 k Y¯ = Y , k k i Xi=1 1 n X¯ − = X , n k n k i − i=Xk+1 1 n Y¯ − = Y , n k n k i − i=Xk+1 k SXX = (X X¯ )2, k i − k Xi=1 k SXY = (X X¯ )( Y Y¯ ), k i − k i − k Xi=1

8 n XX 2 S = (X X¯ − ) , n−k i − n k i=Xk+1 n XY S = (X X¯ − )( Y Y¯ − ). n−k i − n k i − n k i=Xk+1 Hence, the SIC under H for k = 2 ,...,n 2, is 1 −

SIC (k) = 2 log L (ˆα, β,ˆ σˆ2) + 5 log n − 1 k n = n log (Y αˆ αˆ X )2 + (Y βˆ βˆ X )2 i − 0 − 1 i i − 0 − 1 i  Xi=1 i=Xk+1  + n(1 + log 2 π) + (5 n) log n, −

2 where L1(ˆα, β,ˆ σˆ ) is the estimated maximum likelihood function under H1. There- fore, the decision rule for selecting one of the n 3 regression models is: select the − model with no change if SIC (n) SIC (k) for all k; select a model with a change- ≤ point at kˆ∗ if

SIC (kˆ∗) = min SIC (k) : 2 k n 2 < SIC (n). (1.3.1) { ≤ ≤ − }

According to the principle of information criterion in , H0 will be accepted if SIC (n) SIC (kˆ∗), ≤ and H1 will be accepted if SIC (kˆ∗) > SIC (n).

∗ When H1 is accepted, the estimated time of change is kˆ in the equation (1.3.1). Let Λ = SIC (kˆ∗) SIC (n) be the test statistic, then the asymptotic distribution n − of Λ n under null hypothesis follows Gumbel extreme value. When the inference of a change-point is viewed as a model selection problem, the SIC provides a very effective way of locating the change-point. Note that the SIC still requires the error terms e n to be normally distributed with the identical standard { i}i=1

9 error σ. Therefore, we looked for a non-parametric method which can release the normal distribution requirement for error terms. Hence, Empirical Likelihood came into our picture.

1.4 EMPIRICAL LIKELIHOOD

The empirical likelihood (EL) method proposed by Owen ([21]) is a nonparametric version of the classical maximum likelihood procedure. Compared to the parametric method, EL method is a data-driven method that lets the data determine the shape of the confidence region. In comparison with the bootstrap method, it profiles a multinomial with one parameter per data point instead of resampling. We can use Bartlett correction and location adjustment to improve the accuracy of inferences. In order to illustrate the EL method, let’s start with the simple case. Suppose

X1, . . . , X n be the random sample drawn from an unknown distribution F having mean µ Rs and finite variance covariance matrix V . The atom of probability on 0 ∈

X = Xi is

pi = dF (Xi) = P (X = Xi), where p = ( p ,...,p ) is a vector having each p 0 and n p = 1. 1 n i ≥ i=1 i P The empirical likelihood function (ELF) is

n n

L(F ) = dF (Xi) = pi; Yi=1 Yi=1 the empirical distribution function (EDF) is

n F (x) = n−1 I(X x); n i ≤ Xi=1 the empirical likelihood ratio (ELR) is

n

R(F ) = L(F )/L (Fn) = np i; Yi=1

10 the corresponding empirical log-likelihood ratio is

n l(F ) = 2 log R(F ) = 2 log( np ). − − i Xi=1 Let θ Rr be the parameter vector of interest such that ∈

EF g(X; θ) =0 (1.4.1)  where g(X; θ) Rt is a proper vector-valued function of X and θ, which is called the ∈ estimating function. If t < r , the dimension of the estimating function is less than that of the parameter, it is the underdetermined case. In this case, there might exist more than one solution θ that satisfies the equation (1.4.1). If t > r , the dimension of the estimating function is greater than that of the parameter, it is the overdetermined case. In this case, there may be no θ that satisfies the equation (1.4.1). We define R(θ) the profile empirical likelihood ratio of θ as

n n n

R(θ) = max np i pig(Xi; θ) = 0 , p i 0, pi = 1 . (1.4.2) p | ≥  Yi=1 Xi=1 Xi=1 The main technique of the empirical likelihood method is to find R(θ). We define the profile empirical log-likelihood ratio l(θ) as

l(θ) = 2 log R(θ). (1.4.3) −

The maximum can be found via Lagrange multipliers. Let

n n n H(θ) = log p + ζ(1 p ) nλ T p g(X ; θ) (1.4.4) i − i − i i Xi=1 Xi=1 Xi=1 where λ and ζ are the multipliers. Then we have

∂H (θ) 1 T = ζ nλ g(Xi; θ) = 0 . (1.4.5) ∂p i pi − −

Multiplying pi both sides of the equation (1.4.5),

1 ζp nλ T g(X ; θ)p = 0 . − i − i i

11 n T T n n It is known that i=1 λ pig(Xi; θ) = λ i=1 pig(Xi; θ) = 0 and i=1 pi = 1, then

P n n P P n ζ p nλ T g(X ; θ)p = 0 = ζ = n. − i − i i ⇒ Xi=1 Xi=1 Therefore, 1 1 pi(θ) = T . (1.4.6) n1 + λ g(Xi; θ) We can solve λ with the restriction, s.t.

n 1 n g(X ; θ) p g(X ; θ) = i = 0 , (1.4.7) i i n 1 + λT g(X ; θ) Xi=1 Xi=1 i Note that it is necessary that 0 p 1, which implies that λ and θ must satisfy ≤ i ≤ 1 + λT g(X ; θ) 1/n for each i. i ≥ The profile empirical likelihood function for θ is

n 1 R(θ) = . (1.4.8) 1 + λ(θ)T g(X ; θ) Yi=1 n i o Empirical likelihood hypothesis tests reject

H0 : EF g(X; θ0) = 0 when R(θ0) < r 0,  where r0 is a certain threshold value.

Suppose V ar g(Xi; θ0) be finite with rank q, Owen [21] stated the empirical  version of the well known parametric Wilk’s theorem, that is

2 log R(θ ) d χ2 − 0 −→ q and also

n n 1 −1 1 λ(θ) = g(X ; θ)gT (X ; θ) g(X ; θ) + O (n−1/2). n i i n i p h Xi=1 i h Xi=1 i The rest of the dissertation is organized as follows: In Chapter 2, we propose an empirical likelihood ratio test statistic for the segmented linear regression model

12 and define the empirical likelihood based estimator of the change-point, if it exists. In Chapter 3, four simulations are presented. In these simulations, we observe that the location and scale parameters of this asymptotic distribution are robust against various parameter settings and error distributions. Through simulations, we report percentiles values for various significance levels. The size and the power of the test is then analyzed by using the percentile table. We also discuss the comparisons among the proposed EL-based method, Chen’s Schwartz information criteria (SIC) method and Muggeo’s [20] parametric method. In Chapter 4, we present two empirical examples on analyzing the plasma osmolality dataset and the gasoline price dataset by using the proposed ELR method.

13 CHAPTER 2 CHANGE-POINT ESTIMATION VIA EMPIRICAL LIKELIHOOD

2.1 ASSUMING A KNOWN CHANGE-POINT

Assuming a known change-point, Dong [9] derives an empirical likelihood type Wald (ELw) statistic to test the equality of two coefficient vectors from two linear regression models. To be more precise, letα ˆ and βˆ be the least squared estimators of the regression coefficient vectors, respectively. Under the normality assumption, Dong’s ELw test statistic has the form

ELw = (ˆα βˆ)′[˜σ2(X′ X )−1 +σ ˜ 2(X′ X )−1]−1(ˆα βˆ) − 1 1 1 2 2 2 −

X 2 2 where i is the design matrix andσ ˜ i is the EL estimator of σi , the variance of errors, for ith regression model ( i = 1 , 2). Dong concludes that the ELw test is asymptotically χ2 distributed under null hypothesis: α = β Rp. p ∈ Instead of assuming a known change-point, in the next section, we first derive an empirical likelihood based test statistic for testing the null hypothesis of no change

(H0). If a change-point does exist, we construct the EL based estimator for the change time k∗, and hence for the change-point τ. Unlike the method used by Dong, we neither require the assumption of normality of errors nor need the known change time between the two phases. However, we do require that the two phases be continuous at τ. We address the following two important research issues for model (1.1.1) with continuity constraint (1.1.2):

To test simple linear regression versus two-phase linear regression with one •

14 single unknown change-point.

To estimate the time of change and the change-point if it exists. •

2.2 ASSUMING AN UNKNOWN CHANGE-POINT

For simplicity purpose, we only demonstrate the simple linear regression model in this section. Assume an unknown change-point, the null hypothesis of no change is

H0 : α0 = β0 and α1 = β1. For a fixed k, let’s assume

α0 + α1Xi + ei (β0 + β1Xi), i = 1 ,...,k ; zi =  −  β + β X + e (α + α X ), i = k + 1 ,...,n. 0 1 i i − 0 1 i  then EH0 (zi) = E(ei) = 0 for i = 1 ,...,n .

k n k n max pi qi pizi = m1, qizi = m2, m 1 = m2 = 0 R(k) = { i=1 i=k+1 | i=1 i=k+1 } Qmax Qk p nP q k p z P= m , n q z = m { i=1 i i=k+1 i| i=1 i i 1 i=k+1 i i 2} Q Q P P (2.2.1) where m and m are unknown, k p = 1, n q = 1, p 0 and q 0. Note 1 2 i=1 i i=k+1 i i ≥ i ≥ P P that

k n k n max p q p z = m , q z = m = k−k(n k)−(n−k). { i i| i i 1 i i 2} − Yi=1 i=Yk+1 Xi=1 i=Xk+1 Thus,

k n k n R(k) = max kp (n k)q p z = m = q z = m = 0 . (2.2.2) { i − i| i i 1 i i 2 } Yi=1 i=Yk+1 Xi=1 i=Xk+1 The corresponding empirical log-likelihood ratio is defined as

l(k) = 2 log R(k) − k n = 2 max log( kp ) + log[( n k)q ] , − { i − i } Xi=1 i=Xk+1

k n k n which is subject to i=1 pizi = 0, i=k+1 qizi = 0, i=1 pi = 1, i=k+1 qi = 1, P P P P 0 p 1, and 0 q 1. ≤ i ≤ ≤ i ≤

15 Hence, Lagrange multiplier is as followed:

k n k n H(k) = log( kp ) + log(( n k)q ) nλ ( p z + q z ) i − i − i i i i Xi=1 i=Xk+1 Xi=1 i=Xk+1 k n + ζ (1 p ) + ζ (1 q ) 1 − i 2 − i Xi=1 i=Xk+1

Taking the first derivative of H(k) with respect to pi, then

∂H (k) 1 = nλz i ζ1 = 0 . ∂p i pi − −

Multiplying pi both side and take the sum from i = 1 through k, such as:

k ∂H (k) k k p = k + nλ z p ζ p = 0 . ∂p i i i − 1 i Xi=1 i Xi=1 Xi=1

k k As we know that i=1 pizi = 0 and i=1 pi = 1, thus ζ1 = k. Similarly, we can P P get ζ = n k. Therefore, 2 − ∂H (k) 1 = nλz i k = 0 , i = 1 ,...,k ; ∂p i pi − − and ∂H (k) 1 = nλz i (n k) = 0 , i = k + 1 ,...,n. ∂q i qi − − − Therefore, 1 1 pi = n , i = 1 ,...,k ; k 1 + λz k i and 1 1 qi = n , i = k + 1 ,...,n. n k 1 + λz − n k i − 1 1 Let’s assumez ˜ = z for i = 1 ,...,k andz ˜ = z for i = k + 1 ,...,n . Thus, i k i i n k i − 1 1 pi = , i = 1 ,...,k ; (2.2.3) k 1 + nλ z˜i

16 and 1 1 q = , i = k + 1 ,...,n. (2.2.4) i n k 1 + nλ z˜ − i In order to get pi and qi, we need to solve λ, which can be solved by

k n

pizi + qizi = 0 , Xi=1 i=Xk+1 that is, k z˜ n z˜ i + i = 0 . (2.2.5) 1 + nλ z˜i 1 + nλ z˜i Xi=1 i=Xk+1 As we can see, λ is a function of k, which can be presented as λ = λ(k). Hence,

k n

l(k) = 2 log(1 + nλ (k)˜zi) + log(1 + nλ (k)˜zi) . (2.2.6) n Xi=1 i=Xk+1 o Therefore, we propose the following empirical log-likelihood ratio test statistic:

Mn = max l(k) . (2.2.7) 3≤k≤n−3{ }

Intuitively, we should take the smallest k that maximizes l(k) as an estimator of the true time of change. But simulation studies show that Mn is sensitive to outliers when k is too small or too close to the sample size n. This phenomena is similar to the results from the classic maximum likelihood ratio approach. In order to overcome this situation, under classical parametric likelihood ratio approach, Cs¨org˝oand Horv´ath

[8] suggest using a “trimmed” test statistic, instead of using Mn. We adopt their trimmed idea to define our “trimmed ” test statistic as

′ Mn = max l(k) , (2.2.8) L≤k≤U{ } where the choice of L and U are arbitrary. Values ranging from [ln n] to [ n1/2] have been used in the literature for the classical parametric likelihood approach. In the empirical likelihood method, for the sample sizes used in the simulations, excessively

17 small tail portions do not work well. After testing various trimmed options, we choose L = [ln n]2 and U = n L in this dissertation, where [ x] refers to the smallest integer − larger than x.

2.3 MAIN RESULTS

Lemma 2.1. Assume both z k and z n have finite second moments and { i}i=1 { i}i=k+1 under the null hypothesis of no change, we have

1 − 2 λ(k) = ǫkOp (m) ,   k n k where ǫ = min , − and m = nǫ . k {n n } k k Proof. Without loss of generality, let’s assume ǫ = , then m = k. k n Denote b = λˆ(k) . Since E z 2 < , we have | | F | | ∞

n 1 2 µ1 = max z˜i = op(k ). 1≤i≤k | | k

k Besides, from the equation i=1 pizi = 0, we have P k z˜ 0 = i 1 + nb z˜i Xi=1 k 2 nb z˜i = z˜i − 1 + nb z˜i Xi=1 k k 2 nz˜i = z˜i b . − 1 + nb z˜i Xi=1 Xi=1

Then, k 2 k nz˜i b = z˜i . (2.3.1) 1 + nb z˜i Xi=1 Xi=1 1 1 1 Since 0 pi = 1, then 1 + nλ (k)˜zi . ≤ k 1 + nλ (k)˜zi ≤ ≥ k

18 Therefore,

bn k bn k z˜2 n k z˜2 (1 + nbµ ) i = (1 + nbµ ) z˜ , k i ≤ k 1 1 + nb z˜ k 1 i Xi=1 Xi=1 i Xi=1 1 1 Suppose S2 = k z˜2 andz ¯ = k z˜ , by the central limit theorem,z ¯ = k k i=1 i k k i=1 i k k 1 P P O (k− 2 ). So n p k 1 b S2 nµ z¯ z¯ = O (k− 2 ). k − 1 k ≤ k n p  k − 1 − 1 Thus, we can obtain b = Op k 2 . Hence, λˆ(k) = ǫkOp (m) 2 . n    

Theorem 2.1. Assume both z k and z n have finite second moments. Then { i}i=1 { i}i=k+1 2 l(k) has an asymptotic χ1 distribution under the null hypothesis of no change, that is

2 −1 P l(k) < χ 1(α) = α + O(n ),   2 2 where χ1(α) is the α percentile of the χ1 distribution.

Proof. 1 n 2 1 n 2 Letz ¯ = n i=1 z˜i and Sn = n i=1 z˜i . By the equation (2.2.5), we have P n P 1 z˜ 0 = i n 1 + nλ (k)˜z Xi=1 i 1 n (nλ (k)˜z )2 = z˜ 1 nλ (k)˜z + i n i − i 1 + nλ (k)˜z Xi=1  i  1 n z˜ (nλ (k)˜z )2 =z ¯ nS 2λ(k) + i i . − n n 1 + nλ (k)˜z Xi=1 i 2 1 n z˜i(nλ (k)˜zi) 1 It is obvious that is bounded by o (n− 2 ). Hence, n i=1 1 + nλ (k)˜z p P i z¯ 1 1 − 2 λ(k) = 2 + op(n ). nS n n 1 By using Taylor expansion, we can write log(1 + x) = x x2 + η, where for some − 2 finite B > 0, P ( η B x 3) 1. | | ≤ | | −→

19 Now we may write

n

l(k) = 2 log(1 + nλ (k)˜zi) Xi=1 n n n = 2 nλ (k)˜z n2 (λ(k)˜z )2 + 2 η i − i i Xi=1 Xi=1 Xi=1 n = 2 n2λ(k)¯z n3λ(k)S2λ(k) + 2 η − n i Xi=1 n −2 2 = nS n z¯ + 2 ηi Xi=1

As we all know that nS −2z¯2 χ2. Therefore, l(k) χ2 in distribution. n −→ 1 −→ 1 We just proved l(k) is asymptotically chi-squared distributed for i = 3 ,...,n 3, − then l(k) n−3 is a sequence of dependent chi-squared distributed random variables. { }i=3

The proposed empirical log-likelihood ratio test statistic Mn is the maximum of this sequence. We conjecture that Zn, the square root of Mn, follows the Gumbel extreme value distribution. In Chapter 3, the simulation studies will illustrate this conjecture.

2.4 ALGORITHM

T T Let α = ( α0, α 1) and β = ( β0, β 1) be the regression coefficient vectors of the two phases in model (1.1.1). Then the test of no change is equivalent to the test H : α = β. For a fixed k, we can separate the data into two groups: (X , Y ) k 0 { i i }i=1 and (X , Y ) n . For each group, we apply simple linear regression to fit the data { i i }i=k+1 T points by using the (OLS) method. Letα ˆ(k) = (ˆα0(k), αˆ1(k)) be the OLS estimator of α computed from (X , Y ) k and βˆ(k) = ( βˆ (k), βˆ (k)) T { i i }i=1 0 1 be the OLS estimator of β computed from (X , Y ) n . Then, the estimated errors { i i }i=k+1

20 are Y αˆ (k) +α ˆ (k)X , i = 1 ,...,k ; i − 0 1 i eˆi(k) =     Y βˆ (k) + βˆ (k)X , i = k + 1 ,...,n. i − 0 1 i    Under the null hypothesis H0: α = β,α ˆ 0(k) and βˆ0(k) should be close to each other, and similarly forα ˆ 1(k) and βˆ1(k). Therefore, we propose to switch the rules of the estimated regression coefficient vectors in estimating the errors for these two phases. That is, the estimated errors under H0 can be represented as follows:

Y βˆ (k) + βˆ (k)X , i = 1 ,...,k ; i − 0 1 i zˆi(k) =    (2.4.1)  Y αˆ (k) +α ˆ (k)X , i = k + 1 ,...,n. i − 0 1 i    Hence, 1 zˆi(k), i = 1 ,...,k ; ˆ  k z˜i(k) = 1 (2.4.2)  zˆ (k), i = k + 1 ,...,n. n k i  − Therefore, λˆ can be determined by the following equation:

k z˜ˆ n z˜ˆ i + i = 0 . ˆˆ ˆˆ Xi=1 1 + nλz˜i i=Xk+1 1 + nλz˜i

Thus by the equation(2.2.6),

k n

ˆl(k) = 2 log(1 + nλˆz˜ˆi) + log(1 + λˆz˜ˆi) . (2.4.3) n Xi=1 i=Xk+1 o Notice that the true time of change k∗ is unidentifiable under the null hypothesis

H0. Large values of ˆl(k) correspond to a two-tailed alternative hypothesis being true.

When ˆl(k) is small for each possible k, Mn is small, as is the case under the null

∗ hypothesis. If a change occurs at k , then ˆl(k) and Mn should be statistically large, ˆ 2 thus we reject H0. Notice that l(k) is an asymptotic χ1 statistic for each fixed k and the components of ˆl(k) n−3 are not independent. { }k=3

21 In [8], Cs¨org˝oand Horv´ath show that the limiting null distribution of the square root of the parametric likelihood ratio test statistic is Gumbel extreme value dis- tributed. Motivated by their theory, we investigate the analogy property of the em- pirical likelihood based test statistic Zn = Mn. We now propose a computing p algorithm consisting of the following five steps.

1. For a fixed k, k = 3 , 4,...,n 3, split the data into two groups referred to as − the left-phase group (X , Y ) k , and the right-phase group (X , Y ) n . { i i }i=1 { i i }i=k+1

2. For each group, fit the points into a linear model to obtainα ˆ 0(k), αˆ1(k) from the αˆ1(k) βˆ1(k) left-phase group and βˆ0(k), βˆ1(k) from the right-phase group. If − / βˆ (k) αˆ (k) ∈ 0 − 0 [Xk, X k+1 ), return to Step 1.

3. Calculatez ˆ (k) = Y [βˆ (k)+ βˆ (k)X ] for i = 1 ,...,k andz ˆ (k) = Y [ˆα (k)+ i i − 0 1 i i i − 0

αˆ1(k)Xi] for i = k + 1 ,...,n .

4. Use z˜ˆ (k) n as the input in el.test function in R package ( emplik ) to compute { i }i=1 ˆl(k) = 2 log ˆ(k). − R 5. For each possible k, repeat steps 1 to 4 to obtain a sequence of ˆl(k) n−3. The { }k=3

maximum of this sequence is Mn. Taking square root of Mn gives Zn. Note that some of k values might be skipped.

Let G be the critical value of Z = √M at significance level α. That is, P (Z α n n n ≥

Gα) = α. If Zn is larger than the critical value Gα, then H0 is rejected at significance level α. If H0 is rejected, we conclude the existence of a change-point and hence estimate the change-point.

∗ If H0 is rejected, then we define the smallest time, denoted by k , which reaches

′ ∗ the value Mn as the EL estimator of k . That is, b

k∗ = min k : M ′ = 2 log ˆ(k) (2.4.4) n − R  b 22 and hence, the empirical likelihood estimator of the change-point τ is defined as

∗ ∗ ∗ ∗ αˆ1(k ) βˆ1(k ) αˆ1(k ) βˆ1(k − , if − [Xk∗ , X k∗+1 ),  ˆ ∗ ∗ ˆ ∗ ∗ ∈ τˆ =  β0(kb) αˆ0(kb ) β0(kb) αˆ0(kb) (2.4.5)  − − c∗ Xk ,b b otherwiseb . b   In this way, the equation (2.4.5) guarantees the continuity constraint. For more details, see Hinkley [15].

23 CHAPTER 3 ASYMPTOTIC PROPERTY OF THE PROPOSED TEST STATISTIC

The algorithm presented in Chapter 2 enables us to conduct simulation studies to investigate whether Zn under the null hypothesis has asymptotic Gumbel extreme value distribution, which is a subfamily of the Generalized Extreme Value (GEV) dis- tribution. The GEV distribution has the following cumulative distribution function:

x µ −1/ζ F (x; µ, σ, ζ ) = exp 1 + ζ − , for x R and 1 + ζ(x µ)/σ > 0, n − h  σ i o ∈ − where µ R is the location parameter, σ > 0 is the scale parameter and ζ R ∈ ∈ is the shape parameter. The shape parameter ζ dominates the tail behavior of the distribution. When ζ 0, the limiting distribution of the GEV distribution is the → Gumbel (G) extreme value distribution, given by

(x µ) FG(x; µ, σ ) = exp exp − − . (3.0.1) n − h σ io For a Gumbel extreme value distribution, the mean and the variance are µ + σa and σ2π2/6, respectively, where a is the Euler-Mascheroni constant 0.57721.... We use fgev function in R package ( evd ) to estimate the location µ, the scale σ and the critical values of FG. The function fgev uses the maximum-likelihood fitting of the GEV distribution to estimate µ, σ and ζ. We can obtain estimates of µ and σ for the Gumbel extreme value distribution, by setting ζ = 0. Four simulation studies are reported in this section. We simulate 1000 samples with the sample values of the random predictor X generated from N(0 , 1). The sample size n ranges from 30 to 500.

24 3.1 SIMULATION I

Simulation I tests the robustness of the proposed EL based test statistic and computes the critical values of Zn for the most popular nominal levels 0 .10 , 0.05 and 0.01, under the null hypothesis of no change. Data is generated from the simple linear regression

Yi = γ0 + γ1Xi + ei (3.1.1)

T T with γ = ( γ0, γ 1) = (1 , 1) and three types of error distributions: (i) Normal errors: N(0 , 0.12); (ii) Centered log-normal errors (heavy-tailed): log N(0 , 0.12) − E(log N(0 , 0.12)) and (iii) Non-homogeneous errors: N(0 , 0.12) and N(0 , 1.02). Table 3.1 shows that the estimated location and the scale parameters of the asymp- totic distribution increases as the sample size increases. More importantly, one ob- serves that both the location and the scale parameters are robust against changes to the distributions of the errors. This is consistent with the well-known property of the empirical likelihood method being of a nonparametric nature.

The histograms and Q-Q plots of Zn for the 1000 replicates are shown in Figure 3.1 through Figure 3.3, corresponding to three types of error distributions. The left panel shows the histograms of Zn, in which the solid line represents the estimated Gumbel density and the dashed line represents the estimated kernel density with Gaussian kernel using density function in R. The right panel shows the Q-Q plots of the Gumbel distributions.

Simultaneously, the critical values of Zn with the nominal levels α = 0 .10 , 0.05 and 0 .01 can also be derived by computing the quartiles of the simulated Gumbel

extreme value distribution. The critical values of Zn are reported in Table 3.2.

25 Table 3.1 : Robustness analysis for the estimated location and scale pa- rameters, µ and σ respectively, of the limiting distribution of Zn under three types of error distributions.

The type of error distribution (i) Normal (ii) Centered log-normal (iii) Non-homogeneous

n µ σ µ σ µ σ

30 8.511 2.453 8.529 2.816 8.299 3.181 50 12.169 3.338 12.282 3.371 11.554 3.981 100 18.625 4.535 18.474 4.442 17.674 5.744 500 46.000 9.105 46.268 9.442 44.985 10.064

26 Table 3.2 : The percentiles of Zn with α = 0 .10 , 0.05 , and 0 .01. The type of error distribution (i) Normal (ii) Centered log-normal (iii) Non-homogeneous @ @ α @ 0.10 0.05 0.01 0.10 0.05 0.01 0.10 0.05 0.01 27 n @ @ 30 14.031 15.797 19.795 14.866 16.893 21.483 15.457 17.747 22.932 50 19.681 22.084 27.524 19.867 22.294 27.789 20.513 23.378 29.867 100 28.830 32.095 39.487 28.470 31.668 38.908 30.600 34.735 44.097 500 66.490 73.044 87.884 67.516 74.313 89.703 67.632 74.877 91.280 (i) n=30 Gumbel Q−Q plot Density Sample Quantile Sample 6 8 10 12 14 16 0.00 0.05 0.10 0.15

5 10 15 20 25 30 6 8 10 12 14

Zn Theoretical Quantile

(i) n=50 Gumbel Q−Q plot Density Sample Quantile Sample 0.00 0.05 0.10 0.15 8 10 12 14 16 18 20 22

10 20 30 40 10 12 14 16 18 20 22

Zn Theoretical Quantile

(i) n=100 Gumbel Q−Q plot Density Sample Quantile Sample 15 20 25 30 0.00 0.02 0.04 0.06 0.08 0.10

10 20 30 40 50 60 15 20 25 30

Zn Theoretical Quantile

(i) n=500 Gumbel Q−Q plot Density Sample Quantile Sample 40 50 60 70 0.00 0.01 0.02 0.03 0.04 0.05

20 40 60 80 100 120 140 35 40 45 50 55 60 65 70

Zn Theoretical Quantile

Figure 3.1 : The histograms and Q-Q plots of Zn under H0 with normal errors (i) N(0 , 0.12) for four different sample size settings. The solid line represents the estimated Gumbel density and the dashed line represents the estimated kernel density.

28 (ii) n=30 Gumbel Q−Q plot Density Sample Quantile Sample 6 8 10 12 14 16 0.00 0.05 0.10 0.15 0.20

5 10 15 20 25 30 6 8 10 12 14

Zn Theoretical Quantile

(ii) n=50 Gumbel Q−Q plot Density Sample Quantile Sample 10 12 14 16 18 20 0.00 0.05 0.10 0.15 10 12 14 16 18 5 10 15 20 25 30 Theoretical Quantile Zn

(ii) n=100 Gumbel Q−Q plot Density Sample Quantile Sample 15 20 25 30 0.00 0.02 0.04 0.06 0.08 0.10 14 16 18 20 22 24 26 28 10 20 30 40 50 60 Theoretical Quantile Zn

(ii) n=500 Gumbel Q−Q plot Density Sample Quantile Sample 40 50 60 70 0.00 0.01 0.02 0.03 0.04 0.05

20 40 60 80 100 120 140 40 45 50 55 60 65 70

Zn Theoretical Quantile

Figure 3.2 : The histograms and Q-Q plots of Zn under H0 with centered log-normal errors (ii) Centered log N(0 , 0.12) for four different sample size settings. The solid line represents the estimated Gumbel density and the dashed line represents the estimated kernel density.

29 (iii) n=30 Gumbel Q−Q plot Density Sample Quantile Sample 6 8 10 12 14 16 18 0.00 0.05 0.10 0.15

5 10 15 20 25 30 6 8 10 12 14

Zn Theoretical Quantile

(iii) n=50 Gumbel Q−Q plot Density Sample Quantile Sample 8 10 12 14 16 18 20 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14

5 10 15 20 25 8 10 12 14 16 18

Zn Theoretical Quantile

(iii) n=100 Gumbel Q−Q plot Density Sample Quantile Sample 15 20 25 30 0.00 0.02 0.04 0.06 0.08

10 20 30 40 50 60 15 20 25

Theoretical Quantile Zn

(iii) n=500 Gumbel Q−Q plot Density Sample Quantile Sample 40 50 60 70 0.00 0.01 0.02 0.03 0.04

20 40 60 80 100 120 140 30 40 50 60 70

Zn Theoretical Quantile

Figure 3.3 : The histograms and Q-Q plots of Zn under H0 with non-homogeneous errors (iii) e [n/ 2] N(0 , 0.12) and e n { i}i=1 ∼ { i}i=[ n/ 2]+1 ∼ N(0 , 1.02) for four different sample size settings. The solid line repre- sents the estimated Gumbel density and the dashed line represents the estimated kernel density.

30 3.2 SIMULATION II

Simulation II is performed to show that the asymptotic distribution is robust

T against changes in the settings of parameter vector γ = ( γ0, γ 1) in model (3.1.1). For a sample size of n = 100 and e n N(0 , 0.12), five different settings of γ are { i}i=1 ∼ investigated. Table 3.3 indicates that the estimated location and scale parameters are robust to the settings of γ.

Table 3.3 : Robustness analysis for the estimated location and scale pa- rameters, µ and σ respectively, of the limiting distribution of Zn with respect to the settings of γ under null hypothesis when n = 100.

T γ = ( γ0, γ 1) (-1,1) (2,-1) (-3,-3) (8,3) (-3,5)

µ 18.625 18.824 18.629 18.500 18.931 σ 4.535 4.157 4.263 4.140 4.789

3.3 SIMULATION III

In 2003, Muggeo[20] proposed a parametric method which approximates the seg- mented relationship through a first-order Taylor expansion around an initial guess for the change-point τ (0) . Muggeo points out that the log-likelihood function for the change-point may not be concave, especially for the small slope change. From his simulation, we can easily see that this method is very sensitive to the scale of the slope change. In R, Muggeo provides the related R package ( segmented ). For con- venience, we refer to Muggeo’s method as the SEG method. Muggeo used the model as the model (1.2.2). Assume τ (0) is the initial change-point, then

(X τ)+ = ( X τ (0) )+ + ( τ τ (0) )( 1) I(X > τ (0) ); (3.3.1) i − i − − − i

31 where ( 1) I(X > τ (0) ) is the first derivative of ( X τ)+ assessed in τ (0) . Therefore, − i i − the algorithm at step k is:

Fix τ (k) and calculate U (k) = ( X τ (k)) and V (k) = I(X > τ (k)) • i − − i

Fit the model with additional variates U (k) and V (k), such as: •

(k) (k) Yi = α0 + α1Xi + ξU + γV . (3.3.2)

Improve the change-point estimate by • γˆ τ (k+1) = + τ (k) ξˆ

Repeat the process until convergence. • Simulation III is carried out to compare the performance of the proposed EL based estimator, Chen’s SIC estimator and Muggeo’s SEG parametric estimator of the true change time k∗. We examine the effect of two different values of the slope change on two error settings: N(0 , 0.52) and N(0 , 0.12). Let ξ be the slope change between the two phases for the segmented linear regression model (1) with the continuity constraint (2). Two different values of the slope change are ξ = 2 and ξ = 4. When the X values are too close together, it is hard to detect the true time of change. The acceptable deviation of k∗ from k∗ depends on the sample size and the range of the X values. In order to compareb these three methods, we propose the following fine tuned acceptable deviation D: U L D = − ,  A  where A is the range of the X values, L = [ln n]2 and U = n L are the fine tuning − portion in the definition of the trimmed test statistic M ′ . For X N(0 , 1.02), we n ∼ take A = 6. Then, when n = 50, D = 3 with L = 16 and U = 34.

32 For the purpose of illustration, we report the simulation results for the sample size n = 50 and k∗ = 25 for the errors generated from N(0 , 0.52) and N(0 , 0.12). Let d = k∗ k∗ be the absolute value of the bias between the true time of change k∗ | − | and theb estimate k∗, and RF be the relative frequency of the deviation d less than or equal to the acceptableb deviation D. Table 3.4 reports the frequency distribution of d and the relative frequency of d D. The simulation result indicates that the ≤ proposed method works slightly better than the SIC method and the SEG method in capturing the true change time ( d = 0). The relative frequency of the three methods increases as the standard deviation ( σ) of the errors decreases. When σ = 0 .5, the proposed method works much better than the SIC method and the SEG method for ξ = 2. When σ = 0 .1, these three methods are comparable. One also notices that as the change of slopes increases, the absolute value of the bias gets smaller and the relative frequency increases. The simulation results for various sample sizes ranging from 30 to 500 show a similar pattern. From Table 3.4, one notices the variation of error and the slope change influence the change-point estimation. The detection is easier for the larger slope change, as well as the higher signal to noise ratio.

3.4 SIMULATION IV

Simulation IV is performed to study the size and the power performance of the proposed test for two types of error distributions. Table 3.5 shows the size and the power of the proposed test for a variety of sample sizes and true times of the changes. The size is computed by using the estimated critical value of Gumbel distribution. We simulated 1000 samples from the model (3.1.1) with parameter vector γ = (1 , 1) T , and the size is estimated by the proportion of samples resulting rejection of the null

33 Table 3.4 : The frequency distribution of d = kˆ∗ k∗ and the relative # of d D | − | frequency RF = { ≤ }% for ELR, SIC and SEG methods, when 10 sample size n = 50 and the true time of change k∗ = 25.

N(0 , 0.52) N(0 , 0.12) ξ = 2 ξ = 4 ξ = 2 ξ = 4

d ELR SIC SEG ELR SIC SEG ELR SIC SEG ELR SIC SEG

0 57 51 21 91 81 89 193 189 176 301 299 301

34 1 155 90 43 170 165 168 317 301 283 316 317 313 2 118 99 56 203 198 201 198 217 186 187 196 189 3 102 128 74 165 172 167 119 128 134 71 68 78 4 84 113 97 103 128 101 78 89 88 64 55 53 5 82 94 106 98 103 105 54 59 58 34 32 45 6 79 88 109 89 64 76 30 11 45 12 24 15 7 323 337 494 81 89 93 11 6 30 15 9 6 ≥ RF (%) 43.2 36.8 19.4 62.9 61.6 62.5 82.7 83.5 77.9 87.5 88 88.1 hypothesis falsely, which means that the test statistic is larger than the critical value. This study indicates that the proposed test statistic is able to control the size and attains a high power when the sample size is large.

Table 3.5 : The size and the power of Zn under two different error distri- butions: (i) normal distribution N(0 , 0.12) and (ii) centered log-normal distribution logN (0 , 0.12).

The type of error distribution (i) Normal (ii) Centered log-normal @ @ n 30 50 100 500 30 50 100 500 ∗ @ k @ n @ 0% 0.045 0.049 0.050 0.050 0.037 0.046 0.051 0.050 20% 0.884 0.896 0.969 0.987 0.876 0.895 0.947 0.968 30% 0.897 0.897 0.968 0.987 0.894 0.897 0.948 0.968 40% 0.892 0.897 0.969 0.987 0.896 0.917 0.954 0.969 50% 0.908 0.898 0.960 0.987 0.907 0.923 0.954 0.970

35 CHAPTER 4 APPLICATION

This section applies the proposed ELR method for the segmented linear regression to the plasma osmolality data set and the gasoline price dataset. In many biological systems, the response variable changes at a critical point. Usu- ally, this is done by inspection. However, now a lot of biologists prefer to estimate this change-point in statistical way. The following dataset is a good example. The plasma osmolality data set was collected to show arginine vasopressin (AVP) concentration in plasma as a function of plasma osmolality in conscious dogs. Using parametric max- imum likelihood method under normality assumption for errors, Vieth[35] utilizes the segmented linear model to fit arginine vasopressin (AVP) concentration against plasma osmolality in plasma of conscious dogs to determine the osmotic threshold. Our proposed ELR method does not require the normality assumption of the error terms in nature. Figure 4.1(a) is the scatter plot, overlaid with the fitted segmented regression function of the data with the estimated change-point atτ ˆ = 302, corresponding to osmotic threshold and indicated by the vertical dash line. Using ELR method, Figure 4.1(b) plots the -2 logarithm of empirical likelihood ratio for each possible time of change between 3 and n 3. The estimated time of change is at kˆ∗ = 42 highlighted − ′ by the solid dot (42, Mn). With the estimated change-point atτ ˆ = 302, the least squared fitted segmented

36 (a) (b) AVP(pg/ml) −2logELR(k) 50 100 150 0 5 10 15

290 295 300 305 310 315 20 30 40 50 60

Plasma osmolality(mOsm/kg) k

Figure 4.1 : (a) the scatter plot of AVP versus plasma osmolality with fitted segmented linear regression. (b) The plot of -2logarithm of empir- ical likelihood ratio versus all the possible time of the change k. linear regression is

AVP = 0.002 + 0 .01 Plasma Osmolality + 0 .52 (Plasma Osmolality 302) +, − ∗ ∗ − and the corresponding R2 is 73% with the estimated standard deviation of 1 .60. In Vieth’s paper, R2 for this dataset is 73%. So our result is comparable to Vieth’s result. The next application is the gasoline price dataset which contains 26 years’ gasoline prices. This dataset was released on September 26, 2011 and can be obtained from Energy Information Administration online. The link is as followed: http://www.eia.gov/oil gas/petroleumdata publications/wrgp/ mogas history.html . This dataset represents the annual average gasoline prices from 1976 through 2011. Through Figure 4.2, we can easily see that the gasoline prices dramatically changed after year 2000. The ELR method also suggests the change-point is year 2000 and

37 the least squared fitted segmented linear regression is

Gasoline Price = 0.000236 + 0 .000571 Year + 0 .223079 (Year 2000) + − ∗ ∗ − and the corresponding R2 is 89 .43% with the estimated standard deviation of 0 .25.

(a) (b) Gas Price −2logELR 0 200 400 600 800 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1975 1980 1985 1990 1995 2000 2005 2010 10 15 20

Year k

Figure 4.2 : (a) the scatter plot of gasoline price versus year with fitted segmented linear regression. (b) The plot of -2logarithm of empirical likelihood ratio versus all the possible time of the change k.

38 CHAPTER 5 CONCLUSION

This dissertation proposes a nonparametric empirical likelihood based test statistic for the detection of potential change-points in segmented linear regression models. If the change-point is identified, then an EL-based change-point estimator is defined, along with the estimator of the regression coefficients. Under the null hypothesis of no change, the simulation studies show that the pro- posed test statistic is asymptotically Gumbel extreme value distributed. The location and scale parameters are increasing functions of the sample size and robust to vari- ous parameter settings and error distributions. The simulation under the alternative hypothesis shows that the proposed test is able to control the size and attain a high power when the sample size is large. Finally, it is shown that the proposed empir- ical likelihood method performs better than the Chen’s information criterion based method and Muggeo’s parametric method in accurately detecting the true time of change. Also, the proposed empirical likelihood method performs better than the SEG method and SIC method when the slope change is small. However, for small variation of error, all three methods are comparable.

39 BIBLIOGRAPHY

[1] Bhattacharya, P. K. (1991). Weak convergence of the log likelihood process in the two-phase linear regression problem. Probab. Statist. Design Experiments , 145156.

[2] Bhattacharya, P. K. (1994). Some aspects of change-point analysis. Change-Point Problems (E. Carlstein, H.-G. Miller and D. Siegmund, eds.) 2856.

[3] Berman, N.G., et.al. (1996). Applications of segmented regression models for biomedical studies. American Journal of Physiology , 270 , 723-732.

[4] Chen, J. (1998). Testing for a change point in linear regression models. Commu- nications in Statistics-Theory and Methods , 27:10 , 2481-2493.

[5] Chen,J., and Gupta, A.K. (2000). Parametric Statistical Change Point Analysis , Birkh¨auser, Boston.

[6] Chow, G. (1960). Tests of equality between two sets of coefficients in two linear regressions. Econometrica , 28 , 591-605.

[7] Davies, Robert B. (1987). Hypothesis Testing When a Nuisance Parameter is Present Only Under the Alternative. Biometrika , 74 , 33-43.

[8] Cs¨org˝o, M. and Horv´ath, L. (1997). Limit theorems in change-point analysis , Wiley Series in Probability and Statistics.

[9] Dong, L.B. (2004). Testing for structural change in regression: an empirical likelihood approach, Econometrics Working Paper , 0405.

[10] Feder, P.I. (1975a). Asymptotic distribution theory in segmented regression problems-identified case. The Annals of Statistics , 3, 49-83.

[11] Feder, P.I. (1975b). The log likelihood ratio in segmented regimes. The Annals of Statistics , 3, 84-97.

[12] Fiteni, I. (2004). τ-estimators of regression models with structural change of unknown location. Journal of Econometrics , 119 , 19-44.

40 [13] Gbur,E.E., Thomas,G.L. and Miller,F.R. (1979). The use of segmented regres- sion models in the determination of the base temperature in heat accumulation models. Agronomy Journal , 71 , 949-953.

[14] Hawkins, D.M. (1980). A Note on Continuous and Discontinuous Segmented Regressions, Technometrics , 22 , 443-444.

[15] Hinkley, D.V. (1969). Inference about the Intersection in Two-Phase Regression. Biometrika , 56 , 495-504.

[16] Hinkley, D.V. (1971). Inference in two-phase regression.Journal of the American Statistical Association , 66 , 736-743.

[17] Koul, L.H. and Qian, L.F. (2002). Asymptotics of maximum likelihood estima- tor in a two-phase linear regression model. Journal of Statistical Planning and Inference , 108 , 99-119.

[18] Lund, R. and Reeves, J.(2002). Detection of undocumented changepoints: A revision of the two-phase regression model. Journal of Climate , 15 , 2547-2554.

[19] Luwel K., Beem A.L., Onghena P. and Verschaffel L. (2001). Using segmented linear regression models with unknown change points to analyze strategy shifts in cognitive tasks. Behavior Research Methods, Instruments, & Computers . 33 , 470-478(9)

[20] Muggeo, V.M.R. (2003). Estimating regression models with unknown break- points. Statistics in Medicine . 22 , 3055-3071.

[21] Owen, A.B. (1991). Empirical likelihood for linear models. The Annals of Statis- tics, 19 , 1725-1747.

[22] Owen, A.B. (2001). Empirical Likelihood , Chapman & Hall/CRC.

[23] Pastor, R. and Guallar, E. (1998). Use of two-segmented to estimate change-points in epidemiologic studies. American Journal of Epidemi- ology , 148 , 631-42.

[24] Piegorsch, W. W. and Bailer, A. J. (1997). Statistics for environmental biology and toxicology . Chapman and Hall.

[25] Qian, L.F. (1998). On maximum likelihood estimation for a threshold autore- gression. Journal of Statistical Planning and Inference , 75 , 21-46.

[26] Qian, L.F. and Yao, Q.C. (2002). Software project effort estimation using two- phase linear regression models. Proceeding of The 15th Annual Motorola Software Engineering Symposium (SES) .

41 [27] Qian, L.F. and Ryu, S.Y.(2006). Estimating tree resin dose effect on termites. Environmentrics , 17 , 183-197.

[28] Quandt, R.E. (1958). The estimation of the parameters of a linear regression system obeying two separate regimes. Journal of the American Statistical Asso- ciation, 53 , 873-880.

[29] Quandt, R.E. (1960). Tests of the hypothesis that a linear regression system obeys two separate regimes. Journal of the American Statistical Association , 55 , 324-330.

[30] Robbins, M, Gallagher, C. and Lund, R. (2009). Mean shift testing with corre- lated data. Manuscript.

[31] Schwartz, G. (1978). Estimating the dimension og a model. Annuals of Statistics , 6, 461-464.

[32] Smith,A.M.F. and Cook, D.G. (1980). Straight lines with a change point: A Bayesian analysis of some renal transplant data. Applied Statistics , 29 , 180-189.

[33] Toms, J.D. and Lesperance, M.L. (2003). Piecewise regression: A tool for iden- tifying ecological thresholds. Ecology , 84 , 2034-2041.

[34] Ulm, K.W. (1991). A statistical method for assessing a threshold in epidemio- logical studies. Statistics in medicine , 10 , 341-349.

[35] Vieth, E. (1989). Fitting piecewise linear regression functions to biological re- sponses. Journal of Applied Physiology , 67 , 390-396.

[36] Zeileis, A. (2006). Implementing a class of structural change tests: an economet- ric computing approach. Computational Statistics & Data Analysis , 50 , 2987– 3008.

42