INFORMATION TO USERS

This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, same thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer.

The quality of this reproduction is dependent upon the quality of the copy subrnitted. Broken or indistinct print, coîored or poor quality illustrations and photographs, pnnt bleedthrough, substandard margins. and improper alignment can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthanzed copyright material had to be removed, a note will indicate the deletion.

Oversize materials (e-g., maps, drawings, &arts) are reproduced by sectioning the original, beginning at the upper left-hand merand continuing from left to right in equal sections with small overlaps. Each original is aldo photographed in one exposure and is induded in reduced forrn at the back of the book.

Photographs induded in the original manuscript have been repraduced xerographically in this copy. Higher quality 6' x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional diarge. Contact UMI diredly to order.

Bell 8 Howell Information and Leaming 300 North Zeeb Road, Ann Arbor, MI 48106-1346 USA 800-521-0600

NOTE TO USERS

Page(s) not included in the original manuscript are unavailable from the author or university. The manuscript was microfilmed as received.

This reproduction is the best copy available.

UMI

Modeling Covariance in Multi-Path Changepoint Problems

Masoud Asgharian Dastenaei Depart ment of Mat hematics and Statistics McGill University, Montreal

A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements of the degree of Doctor of Philosophy

OMasoud Asgharian Dastenaei 1998 i National library Bibliothèque nationale 1+1 of Canada du Canada Acquisitions and Acquisitions et Bibliographie Services services bibliographiques 395 Wellington Street 395, nre Wellington Ottawa ON KIA ON4 OrtawaON KlAûN4 Canada Canada

The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sel1 reproduire, prêter, dismbuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/nlm, de reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or othewise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation. To the rnemory of my mother, Hagar, who constantly supported me up to the last day of her life and loved to see this moment, but it didn't corne true.

And to my wife and dear friend Mojgan Bien qu'il ait été intensivement étudié dans le cas de la trajectoire unique, le problème du point de changement a été largement ignoré dans le cas de trajectoires multiples.

Dans la situation "multi-trajectoires" , il est souvent utile de déterminer l'impact des covariables sur le point de changement lui-même aussi bien que les paramètres avant et après celui-ci. Cette thèse aborde l'inclusion des covariables dans la distribution du point de changement, cet aspect n'ayant jamais été étudié aupara- vant. Le modèle que nous introduisons est basé sur la fonction hasard du change- ment. Il a des caractéristiques qui permettent d'établir les résultats asymptotiques nécessaires à l'estimation et aux tests. En effet, nous établissons la consistance des estimateurs du maximum de vraissemblance des paramètres de notre modèle.

Le modèle proposé étant un mélange, deux difficultés reliées à de tels modèles sont à surmonter, à savoir l'identifiabilité et la définie positivité de la matrice d'information. 11 est établi, sous des conditions appropriées, que l'ensemble des zéros du déterminant de la matrice d'information est dense nulle part, paliant ainsi à l'impossibilité d'une preuve directe de la définie positivité.

En utilisant la méthode d'annulation par simuation, nous avons effectué quelques simulations afin de déterminer la maniabilité de notre procédure d'estimation. Dans l'exemple traité, notre estimateur semble approximativement suivre une nor- male, même pour des échantillons de taille modérée. Les estimateurs du maximum de vraissemblance semblent également bien approximer leur paramètres. ABSTRACT

Although the single-path changepoint problem has been extensively treated in the statistical literature, the multi-path changepoint problem has been largely ignored.

In the multi-path changepoint setting it is often of interest to assess the impact of covariates on the changepoint itself as weil as on the parameters before and after the changepoint. This thesis is concerned with including covariates in the changepoint distribution, a topic never before addressed in the literature. The model we introduce, based on the hazard of change, enjoys features which allow one to establish asymptotic results needed for estimation and testing. Indeed, we establish consistency of the maximum likelihood estimators of the parameters of Our model.

-4s the proposed model is a mixture model, two of the difficulties associated with such models are addressed. They are identifiability, and positive definiteness of the information matrix. It is shown that under suitable conditions the of zeros of the determinant of the information matrix is a nowhere : thus partially compensating for the impossibility of directly establishing positive definiteness.

A limited simulation, using simulated annealing, is carried out to assess how the estimation procedure works in practice. In the esample presented, the estimators appear to follow an approximately normal distribution even for moderate sample sizes. The maximum likelihood estirnators appear to approximate their parameter counterparts well.

Chapter 4: 93 : Lemma 1. Lemma 2. Lemma 3 and Theorem 5, Lemma 5. Lemma 6. Proposition 4. Theorem 7. Lemma 7, and Theorem 8 $4 : Lemma 8. Theorem 9 and Theorem 10 + $3 : Establishing asym ptot ic normality of the maximum likelihood estimators of the unknown parameters in the mode1 introduced in Chapter 1 ACKNOWLEGEMENT

David Wolfson has been much more than a supervisor for me. He has been a mise friend. His fastidiousness enorrnously improved the exposition of this thesis and his unusual patience gave me the chance to work on a varîety of problems and enjoy learning new things. He and his wife, Tina Wolfson of the Division of Clinical Epidemiology at the Jewish General Hospital(JGH), provided me the chance to learn about aspects of statistics not covered in the classroom. Indeed, working at the Jewish General Hospital forced me to understand many things which 1 had never questioned before. It was at the JGH that I was given the chance to work on survivai analysis, my favourite topic in statistics, and where 1 was introduced to the notion of length-biased sampling. For al1 this 1 would like to express my sincere gratitude.

1 thank Sanjo Zlobec for interesting lectures on parametric programrning and inspiration for working on an ongoing problem. Jal Choksi always gave me the most relevant references to my questions.

-4mong my friends 1 should start with Enrique Reyes whorn 1 tortured pitilessly with my questions on differential geometry. He introduced me to the book by -4braham, Marsden and Ratiu(1988) which turned out to be my main reference in Chapter 3 of this thesis. Luc Lalond was a source of computer skills from which 1 personally benefitted very much. When 1was stuck for a long time with an error in my program, he devoted a considerable amount of time to find the error, although he was very busy himself. Statistical discussion with my old friend Khalil Shafie has been always beneficial for me. 1 also befitted very much from his computer skills. 1 also like to thank Lassina Dembele who helped me with translation of the abstract of the thesis. 1 am very grateful to the Ministry of Higher Education of Iran for supporting me through my education. 1 would also like to express my acknowledgrnent to the McGill Major Fellowship Foundation who awarded me the "175th Anniversary of McGill University" fellowship .

When, for some baseless reason 1 was labekd as somebody who neither has the right to continue his education abroad, nor even in Iran, it was my Masters supervisor, Siamak Xoorbaloochi, who helped me overcome this obstacle. Without him there would not be any thesis nor even an education towards a PhD. 1 am most indebted to him for al1 he did for me.

My siblings have always been a great source of encouragement and inspiration. 1 don't think there are any words that can express my real gratitude and acknowl- edgrnent to them. 1 am also very grateful to my father and mother-in-law who helped me and my wife very much. Contents

Chapter 1. INTRODUCTION 3

Chapter II. HAZARD APPROACH IN THE MULTI-PATH CHANGE POINT 7

1. Introduction

2. Markovian Structure Of The Changepoint Problem

3. Principle of Maximum Entropy and Modeling 22 3.1. Synopsis Of History And Etymology Of The Word " Entropy" 22 3.2. Entropy And Slodeling 24

4. Introducing Covariates into the Model

5. Mixture Distributions

Chapter III. CONSISTENCY OF THE MLE

1. Introduction 51

2. Identifiability Of The Mode1 52

3. Consistency In The Single Parameter Case 61

4. Consistency In The Multiparameter Case For Identifiable Mod- els 72

1 2 CONTENTS

5. Consistency In The Multiparameter Case For Quasi-Identifiable Models 84

Chapter IV. ASYMPTOTIC NORMALITY 91

1. Introduction 91

2. Preliminaries Rom Differential Geometry And Functional Anal- ysis 93

3. On Positive-Definitness Of The Information Matrix 104

4. On The Measure Of A and The Exponential Family 118

5. On The Smoothness And Boundedness Conditions For Asymp- totic Norrnality 124

Chapter V. NUMERICAL ASPECTS AND SIMULATION 131

1. Introduction 131

2. Simulated Annealing Algorithm 132

3. Simulation Results

Chapter VI. FUTURE DIRECTIONS CHAPTER 1 INTRODUCTION

The study of changepoint problems dates back to the 1950's. Page(1954) ad- dressed the identification of subsamples conesponding to different parameter val- ues in a series of papers (1951, 55(a), 55(b), and 57). His main object was the detection of changes in a sequential setting and he established the statisti- cal foundation for modern quality control. Quandt (lSfi8, 60, and 72) addressed the question in the different framework of so called switching or two phase regres- sion. Shiryayev(l96l(a), 61(b). 63(a), 63(b), 63(c),65, 66(a)' and 66(b)) studied changepoints (or "disorder" as he called it) from a Bayesian perspective for discrete and continuous tirne processes. Since these modest beginnings changepoints have permeated both the applied and theoretical literature. Shaban(l980) and Telk- snys(1986) contain extensive references on the subject up until these relatively early times.

The recent IMS monograph edited by Carlstein, Muller and Siegmund (1994) shows considerable attention to this subject. For a more recent review on sequen- tial changepoint problems see Lai(1995), while Picard(1985), Giraitis et a1(1996), Kim (1996), and Tang and MacNeil(1993) discuss dependent observations. Tar- takovskii(l994) has discussed detection of a changepoint from a decision theoretic

3 4 1. INTRODUCTIOS perspective. Hu and Rukhin(1995) have studied lower bounds for error probabil- ities in changepoint estimation. Asymptotic results in the nonparametric setting for changepoint estimation have been given by Carlstein(l988).

Confidence regions and tests for a changepoint in a sequence of independent ran- dom variables distributed according to members of an exponential family have been studied by Worsley(1986), Nagaraj and Reddy(1993), and Baron and Rukhin(1997). Smith(1975) has studied the problem in a Bayesian setting and discussed the result for some members of an exponential family. More recently Lee(199T) has consid- ered estimating the number of changepoints in an exponential family. Chen and Gupta(1997) have examined changes in the variance for a Gaussian mode1 using an information theoretic approach. They also applied their results to analyse stock prices. Their approach is based on a binary segmentation argument introduced by Vostrikova(l981) and the Schwarz information criterion (SIC). Brostrom(l997) took a martingale approach for a sequence of Bernoulli random variables. Change- point problems have also received considerable attention in life testing problems ( Luo. Turnbull and Clark(1997) ).

The above references focus almost excIusively on so-called "single-path" change- point analysis. However, when the data consist of several sample paths very !ittIe has been done. Joseph(1989) and Joseph and Wolfson (1992, 93, 96a, 96b, and 97) studied the so-called "multi-path" change point problem from both a frequen- tist and Bayesian perspective and applied their results to several different sets of data. Indeed the multi-path changepoint problem seems a natural setting for the assessrnent of the effects of a treatment when repeated observations are taken in time, on different patients, Say.

A question that has, so far not been addressed in the changepoint literature is how to introduce and assess the effects of covariates on the changepoint distribu- tion itself. How to incorporate covariates into the before and after changepoiat distributions. on the other hand. poses no dificulty and is not the subject of this thesis. i\-ith the introduction of covanates. we are drawn inevitably into the niiilti-path set ting.

This thesis is organized as follours. In Chapter 2 we discuss modeling covariates iri the multi-path changepoint problem. The hlarkovian structure of a changepoint problem is discussed in this chapter. This Markovian structure is key in our ~lisciissionswhich lead to a hazard a~proachto changepoint probierns. We also discuss rnodelirig covariates from aa information theoretic perspective. Chapter :3 tfeals with consistency of the MLEs of the unknown parameters in the mode1 i~itroducedin Chapter 2. The results are fairly general. being applicable, for instance, to m~xturedistributions with covariates. under certain conditions. In Chapter 4 we discuss aq-mptotic normality of the MLEs. noting that because of the strong form of identifiability that holds here. the usual difficulties associated with the asymptotics of mixtures do not occur. See Chen(199.1) for discussion on the difficulties with the asymptotics of rni~tures- In Chapter 5 we car- out a liitii ted simulation whose purpose is mainly to demonstrate the implementation of the rriodel and to examine the behaviour of the JILE'S in a controlied setting. In r lie last chapter we discuss future directions for research. NOTE TO USERS

Page(s) not included in the original manuscript are unavailable from the author or university. The manuscript was microfilmed as received.

This reproduction is the best copy available.

UMI CH-APTER II HAZARD APPROACH IN THE MULTI-PATH CHANGEPOINT PROBLEM

1. Introduction

Cliniçal trials are the main method of assessing the effect of drugs. Typicall- patients are rnonitored for a period of time, resulting in a set of measurements indexed by time. on each patient. Such data. broadly known as repeated rneasure- rnents. have been widely studied. (See Crowder and Hand(1990) and the references c.iced therein.)

Ir1 many cases the main concern of the study is to make inference about the time that the drug takes effect. In changepoint terminology, ive wish to make inference about the changepoint. If the effect of the drug is immediate. then one can easily apply the standard methods of repeated rneasurements for statistical inference. In niany cases. however, the change may be delayed for an unknown length of time or be gradual.

To clarify this point consider the effect of calcium supplementation on blood pressure. Grouchow et al(1985), Harlan et aL(1984) and McCarron et al(1986) al1

7 8 Il. HAZARD APPROACH IN THE MULTI-PATH CHANGEPOINT found an inverse relationship between dietary clacium and blood pressure. Sempos et al(1986) supported this result only among black males but not among other racial groups or fernales.

Using a randomized clinical trial Lyle et al(1987) re-examined the effects of calcium supplementation on the blood pressure of 75 white and black males, aged 19 to 52 years. The subjects were followed for a period of 16 weeks. The first four weeks we taken as a baseline period. During this period, weekly blood pressure was recorded. After this period, within each racial group, the patients were randomly assigned to a calcium intake group (10 black and 27 white men) and placebo group (21 black and 28 white men). The subjects were then given three calcium tablets per day, and blood pressure measurements taken every other week for the next 12 weeks, resulting in 6 measurements after taking the tablets. Lyle et al applied repeated measures methods to assess the effect of calcium intake on blood pressure in this study.

Joseph and Wolfson et al(1996, p628) argued that the effect of dietary calcium supplementation on lowering blood pressure may not be immediate, but may be delayed until the metabolism adjusts to the increase in calcium. They therefore proposed using a multi-path changepoint mode! to analyse the results of the study. When comparing the various groups of the study they simply stratified and made pairwise cornparisons between strata. A more efficient procedure would be to mode1 group effects especially when the covariates are polychotomous or contin- uous and where the simultaneous effect of covariates is of interest. In particular, it may be important to examine the time to reponse, taking into account, for instance, race, gender and type of blood pressure reading (diastolic or systolic).

We address such issues by modeling, explicitly the effects of covariates, on the changepoint distribution itself. 2. MARKOVIAN STRUCTURE 9

This chapter is organized as follows. In section 2 we establish the Markovian structure of the single-path changepoint problem. This result is the backbone of the model which we propose in this thesis. In section 3 we give a synopsis of entropy and " Jaynes' Px-inciple of MF' (maximum entropy) leading to an infor- mation theoretic justification of our model. In section 4 we introduce Our model which allows for covariates in multi-path changepoint problems and give some preliminaries which are needed in the sequel.

2. Markovian Structure OC The Changepoint Problem

As discussed in the Introduction, the changepoint problem has ben treated in the literature extensively. Perhaps the best and most recent review is Carlstein. Muller and Siegmund (1994).

In this section ive are mainly concerned with the single-path changepoint prob- lem. -4lthough, for the discussion of the Markovian structure which is presented in this section, it is not essential to refer to the multi-path changepoint setting. we introduce the latter topic because of the central role it plays in the sequel-

We start with the single path changepoint problem setting. Suppose XI,X2. .-., X, X,+I, ..., X, is a sequence of random variables such that Xi,X2, ..., X, are realiza- tion from a process, Say Po whose joint distributin is Fo and that of X,+l, Xr+Z,.-.,& is from Pl,with the joint distribution FI.If T is unknown and r < rn we say that a change bas occured at T, which is called the changepoint. If r = rn, we say no change has occurred.

In a multi path changepoint problem we have several paths and associated with each path we have a changepoint. The observations in a multi-path changepoint 111 II. H-AZGRD -4PPROACH IX THE MCLTI-PATH CH.WGEPOI3T setting form the matrix

Iri the above marr~ueach rom? corresponds to one path. There are n paths with m riieasurements on each. We say a change has taken place at rz.for the i-th subject.

N-hen the first T' meaurements in the i-th path. Sil,.Yl2, - - . . -ylrli corne from a processes. sqPo. and the rest. X-i,I,i, Xlrt+27. . . . -Yzm.€rom a different process. P, . We always assume that Po and PLare independent. It should be empha- sizeci that associateci w-ith each patk we have a possibly different changepoint.

KPgive a different formulation of the changepoint problem than is conventional. Let -Y2.-.. . -YLrrnbe the sequence of observable random variables. ,Associated with each ,Yk the unobserved random variable Ok. is defined as

O. if -Ykis an observation from Po O&={ 1. if i.k is an observation from Pl

It should be noted that although the sequence of 0's and 1's are unobservable. t here is a 1-1 correspondence between where the change occurs and every sequence of 0's and 1's. In the multi-path setting it is easily conceivable that different paths rili~~generate different sequences of 0's and 1's. -4s discussed in the example of (-:ilcium supplementation. the time that a change occurs for different patients may riut be the same. But in a single-patu framework we only observe one path. so that in a frequentist setting. the single changepoint r, is regarded as fiued. Consequently the #k'~are a deterministic sequence of 0's and 1's. From a Bayesian point of view. 2. MARKOVXAN STRUCTURE 11 the changepoint(s) are assigned a prior distribution in both the single and multi- path settings.

The approach beIow, which assumes the Bk's to be random, allows us to work with the hazard of change and it is through the hazard that Our modeling will be

Using the above discussion a single-path chgepoint problem can be repre- sented by a sequence of random vectors (XI,el), (X2,02), --., (&, 6,). Thus, for the marginal distribution of XI,..., X,,, we may write

We can now proceed to mode1 (O1, O*, . . . ,Om) .

Firithout loss of generality assume that the sequence (O1, B2: . . . , O,) starts with a zero. We then have a sequence of binary random variables which consists of a sequence of 0's followed by a sequence of 1's. The important feature of the sequence is that we have only one switch, from O to 1. In other words, as soon as the change takes place we can never return to O. In the language of stochastic processes this means that 1 is an absorbing state. The following theorem shows that this sequence of random variables must be a two state Markov chah

THEOREM1. Let {Bk}& be a sequence of binary mndom variables defined on a probability space 3' = (O,3, P) such that Vk E &(= PI U 0) Ok assumes values 12 II. HAZARD APPROACH IN THE MULTI-PATH CH-QNGEPOINT in {O? l} and that the foilowing conditions are fulfzlled:

1) P(B0 =O) = 1

2) P(ek+I= 11 ok = 1) = l,vk ,k EN.

Then there ezists a modification of (&)&, say {&)gm=,and a sepence O < ~k $ 1 such that ii a Markov chain with the follownig transition matriz

PROOF.Let -Ak := {w E R 1 &(w) = l,&+l(w) = O) Vk E N. By definition P(Ak 1 = 0) = O. Using assumption 2 we also have P(Ak 1 bJk = 1) = O, which implies P(Ak)= O. NOW define

and for k 2 1

O&), ifw E R\u~=;'AI

1, if w E $,,A[ (which happens w.p. O) Using the fact that for any Borel set B C b(R),

and also, since ëk is rneasurable with respect to (R, 3).Since P(Ak)= O Vk E IV, it is easily seen that {ak)& is a modification of {ek)& Consequently, conditions 1 and 2 are fulfilled for {&)r=,.

Noa define ni = P(& = 1 1 &-1 = O) Vk € N and dk = {w E 9 1 ëk(w)= 1). ft is dear that A&T. Let

Ek = Ek(tl,.-., tkj = (1: t, = 1 for 051 5 k) and

( k7 Since .4& it is easily seen that

- {el = t,' l= O, 1,2, where

This implies that 14 II- H-UARD APPROACH IN THE MULTI-PATH CHANGEPOINT which is qua1 to P(&+I = 1 1 & = l), by assumption 2. Otherwise & = k and q = O which means

The above theorem allows us to regard the change-point as the instant of entry into the absorbing state of the possibly non-stationary Markov chah {&)é,i,with transition matrices n(k)defined in Theorem 1. That is,

It is easy to see that ~k is the hazard of having a change at time k. For by the definition of hazard

- rt nf=:-' (1 - ru) 1 - x:=;' Tl ng1(1- Tu)

REMARK1. It should be noted that nE1(l- I;I) may not be zero. For ezarnple, 2 if ?ri = (~+1)(1+2)y then nEl(l - al)= $. This is easily seen by foming the partial products, 2. MARKOVIAN STRUCTURE

Now suppose that the chain is stopped at time n. The equation (2.3) suggests the following model for the probability of having a change-point among the first n - 1 observations

One could model the distribution of the changepoint directly i.e. by choosing some form for P(T = k). Using the above result we have an alternative method for modeling the distribution of the instant of change through the hazard. Simple choices for the forrn of nk as a function of k: includes

(1)7rk = T Vk, and

(II)'TT& is piecewise constant.

Either (1) or (II) may be justified by consideration of the situation under which the data are obsewed. For instance, in a clinical trial, it may be believed that the hazard for a change in response is approximately, piecewise constant. Alter- natively, one may have only limited prior knowledge about ~k that is quantified through one or more constraints. It may then be reasonable to construct a model of "greatest ignorancen subject to the given constraints, by giving an information theoretic justification. In this thesis we shall assume a constant form for T*, and show what information theoretic argument leads to this choice. 1G II. H-4ZXRD APPROACH THE hIC'LTI-PATH CHANGEPOIXT

.Uthough the constant hazard model for the changepoint bas been used by Shiry+-ev (1973) and Zacks (1983), Our approach is completely different. The nia.jor distinction is in the use of Theorem 1. This point tvill be elucidated in the riest sec-tion.In the next section we show that there exists a non-constant hazard irioriel which --ojuggests" the following distribution for the instant of change:

WJshow. however. that while there is a non-constant hazard rnodel for which the distribution of the changepoint is proportional to (2.5) on 1?2,..., m-1. (2.5) is the distribution of the changepoint if and only if x, = ;r for i=l.2. .. .. m. This implies that there is a one-to-one correspondence between the rnodel (2.5) and the cunstant hazard model.

The most important benefit of a constructive approach is that the reader can evaiuate the assumptions and the structure which is irnposed on the problem. A1s0 a constructive approach ma? suggest ways for generalization. In many cases. huwever. it m- not be possible to take a constructive approach at ail.

\\ilxpproack our model (2.3) through Theorem 1 and the hazard. An applied stacistician m-. after assessing past studies or through familiarity with the under- I!-ing mechanisms of the data. decide that the hazard is approximately constant. Ori the other hand. in an unfamihar setting one may wish the hazard to reflect "rie's -ignorance'' of the situation. apart from one or more constraints. In the nest section. we use information theory to build our constant hazard model, as one of lest information.

LI-ithout additional assumptions Theorem 1 cannot be extended to the more yeiieral setting of changepoint probiems with more than one changepoint. The 2. hl-ARKOVLW STRUCTURE 17 foiruwing generalization of Theorem 1 is the first step in understanding the struc- t lirr uf niulti-changepoint problems. The generalization below does not pertain to th>chesis directlu- which is concerned with paths containing at most one change- nttI'e provide these additionai results for completeness and as a possible starting point for future research.

Roughly speaking, if we assume that the system is subjected to d changes. there exists a modification of the governing stochastic process of the system. - sa\- {tjk}kEHu which can be decomposed into a sum of d bina- hlarkov chains.

{ ij, }kci,. s = 1.2. . . . .d each with one absorbing state. Csing the decomposition ive van obtain a mode1 for the joint distribution of the instants of change and also nriswer another question rvhich is of interest.

THEOREM2. Let {ek}g0be a sequence of random variables de$ned on a prob- ub~lttyspace IP = (Cl. S.P) such that Vk E ek takes on a value in C = {O. 1.2. . . . . d} ar~dsuch that the following conditions are fulfilled:

whew C, = {S. s + 1.. . . .d), s = 1.2,. . . :d and Co = C. Then there a rrru

PROOF.Define f, : Cs-i+ (O, 11, Let r, = min{k E No 1 Ik= S} for s = 1.2.. . . .d. define ôk(u)= fl(Bk(u)).and

) = (0() V, E r) and s = 2.3..... d. k E & and di = O for k E W.

It is easily seen that gk(-) = x)=, (9).Sloreover. using assumptions 1 atid 2 the conditions of Theorem 1 are fulfilled by (z?",~~for s = 1.2,.. . .d. S'ow there exists a modification of say for s = 1.2.. . . . d such - that {d;jkErc, is a rwo-stare hlarkov chah Utilizing assumption 2 it is easily seen - - that state 1 is an absorbing state of {d;}tENo for s = 1.2.. . - : d. Let irt = P(d; = 1 1 ;;-, = O) Then &(w) = x$l a;(,) is the desired modification. O Csing Theorem 2 the follow-ing corollary gives the joint distribution of the in- stants of change.

CORCLLARY1. Under the conditions of Theorern 2 we have

P ROOF. Csing conditioning we have

and by definition of 6;. 2. MARKOVL4N STRUCTURE

On the other hand,

which implies

This completes the proof O

The above results still do not prove that {Bk)kEK is a Markov chain. Nonethe- les, ive can obtain P(& = i) which is of interest since it gives the probability of s changes by time k. If, for example, the changes represent important points in the natural history of a disease, or the progress of the impact of a drug, it may be of prime importance to estimate P(& = s). For illustrative purposes we treat this question for d = 2. Since the method is recursive it can be used for other cases.

It is easily seen that P(& = 0) = nf=,(1 - r;). Now defining = T, - T,-~: for s = 1,2,. . . , d, r, = xi=,W,. Since P(Ok < s) = P(rS> k), to find P(Ok < 2) it suffices to find P(T2> k) . 20 II. HAZARD APPROACH IN THE 'VIULTI-P.4TH CHANGEPOINT Interchanging the order of the summations, the above expression becomes,

Letting, v = i - t: this expression is equal to

k t-1 k-t m = Cc.: nu - .p>, [ -r:) - -T:) ]

k k t- 1 k-t = nu -ri> + Zr 4 ncl -.il ncl -ri) 1

Sow

p(& = 1) = P(& < 2) - P(& = 0) and P(& = 2) = 1 - P(Bk c 2).

If 7~;= ?yS, for s = 1,2, the above expression can be simplifieci as follows: 2. hLULKOt'UN STRUCTURE

Therefore

.411 the above results suggests a Markovian structure for multi-changepoint prob- lems wit h the following probability transition matrix

We discuss the case of r = 2 . In order to simplify Our notation, we replace d and r2 by n- and 4, respectively. By definition 71 = min{k E PI 1 Ok = 1) and r2 = min{k E N 1 Bk = 2). It Kas shown that

~vherekl < kP, and kl , k2 E N. In this case as in the case of one changepoint, either both changes may occur or the first change may occur, but the second one may not, or neither may occur. The complete model for the distribution is :

Replacing oo by m we obtain a model for the distribution of changes that occur in the first m observations. For instance, assuming T,, = .rr and 4, = q5 Vn E N, 22 II. HAZARD APPROACH IN THE MWLTI-PATH CHANGEPOINT the mode1 reduces to

Since we are concemed with discrete processes we do not discuss the continuous case. It is casily seen that same idea can be employed for the continuous case which yeilds a continuous version of the results given in this section. For example an analogoue version of (2.5) for the case of constant hazard would be

3. Principle of Maximum Entropy and Modeling

3.1. Synopsis Of History And Etymology Of The Word "Entropy".

The word " Entropy" was apparentiy introduced in the context of t hermodynamics by Clausius around 1857 who was one of the two (the other one was Carnot) pioneers of the second law of thrmodynamics. There are different viewpoints concerning the etymology of " Entropy" . According to Clausius it cornes from the

Greek word ~po?rqwhich means "a turning point". According to Prigogine the Greek origin of the word is Evrponq- meaning "evolution" . Clausius states that he added €9 just to make the word sounds like "energy". The Clausius' Entropy is a function of macroscopic quantities (by macroscopic quantities we mean quantities such as volume and pressure) of a system which can be measured in a laboratory. In fact, for a closed system, Clausius' Entropy is the summation of the infinitesimai 3. MAXIML'M ENTROPY AXD MODELING 23 c.hariges in heat energy times the reciprocal of the (absolute) temperature over a par11 of equilibrium States.

Tlierrnodynamics is the study of the relationships betu-een rnacroscopic ieat ures of a ptiysical system. The many unmswered questions in therrnodynamics ne- (wsitated the development of a new theon.: which became known as statistical riirc-tianiçs. Lt \vas initiated by the work of Kronig and Clausius around 1857 in a rat ber vague form. followed by MaxwelI and Boltzmann whose contributions were .-;uhtantial. They introduced a new formulation for statistical mechanics through the famous H - Theorem and Ergodic .-lssumptim of Boltzmann.

Tlic Boltzmanxi H - f unctiun is essentially the entropy function known as Shan- rion's entropy function except for the sign which is non negative. Mé discuss infor- niation entropy in the next subsection. The Boltzmann H - Theorem simply says ttiat the H - f unctim can only decrease with time. The Ergodic -4ssumptiun is rrioti\-ated by an experimental faLt that an isolated gas quantum which is not in a state of equilibrium will go to thermal equilibrium and will remain there permanently.

Despi te the fact t hat Boltzmann and ,Clauwell explained several anibiguities riiroiigh their new formulations. there remained many questionable and some- tinics even erratic concepts such as the notion of probability. Gibbs tried to rrsolve these ambiguities and some internal contradictory features that had been aIready pointed out by E. Zermelo(1876) using an axiornatic approach to statistical meclianiçs. But the issues were still not completely settled.

Ii'Lien Claude Shannon proposed his celebrated function, he wanted to term it. .-nwu.su.re of information". He hesitated, however, because of the broad meaning that information can cover. He therefore contacted his friend John Von Neumann 24 II. HAZARD APPROACH IN THE MULTI-PATH CH-LiL'GEPOIST and it \vas Von Neumann who suggested naming it, "Entropy" for two reasons:

"First, the expression is the same as the ezpression for entropy in thermodynamics and as such you should not use two different names for the same mathematical expression, and second, and more importantly, entropy, an spite of one hundreù years of history, as not very well understood yet and so, crs such, you will win euery time you use entropy in an argument!"

E. T. Jaynes understood the value of Shannon's measure very well and its poten- tial, in the context of statistical mechanics, to provide an axiomatic approach to treat statistical mechanics and to resolve the contradictory features of the subject . In his celebrated papers in 1957 in Physical Review he introduced his approach and derived some of the most important results of statistical mechanics using his "Princzple of Maximum Entropy". It was a breakthrough in statistical mechan- ics and opened the way to the application of statistical mechanics and to other disciplines.

REMARK2. The following bibliography is pertinent to the subject. Landsberg(l990) for thermodynarnics, especially the second law of thermodynamics which is dis- cussed in chapter 5 of the book. Jaynes(1983) (Edited b y Rosenkrantz), especially papers nurnber 2,3, and 14. Ehrenfest und Ehrenfest(i990), chapters 1 and 2. Kapur and Kesavan(l992) chapter 1, especially Von Neumann's statement. Hob- son(1971), which gives a thorough account of the concept. of statistieol mechanics.

3.2. Entropy And Modeling. There are many different sets of postdates from which Shannon's entropy can be derived. We discuss and give some references for these different sets of postulates at the end of this section. Here we use the postulates introduced by A.I.Khinchin(l957). This choice is based on Our view 3. MA-XI%IUIUI EXTROPY AND MODELiXG 25 thar theu have an intuitive appeal. We try to justify each postulate as far as we cari. \.i+also stote the uniqueness theorem for Shannon's entropy function because of rhca itnpormrit role that it play in our exposition. Our attention is confined to finitc schemes. by which we mean a set of finitely many mutually exclusi\-e evencs -4k. k = 1.2.. . . . m and a corresponding set of pk. k = 1.2.. . . .rn where

TR 1)~.2 0 . Ck,,pk = 1 and pk is the probability of occurrence of -Ac. A finite

and its entropy by H(p1:p2.---.pm) or briefi by H(-4). ive start with the postu- laces. Let the real valued function, Ht with domain in Rm. satisfy:

1- H (pl.p2. . . . .prn) is continz~ouswith respect to al1 of its arguments.

II- H(pl.pr... . .p,) is rnaximzzed for p, = $. i = 1.2..... m.

IL'- H(pl,~...-.~m.O)= H(pi3m.---.pm). where ff.4B)= p&(B), and Hk(B)is the entropy of scheme B given that everlt .& has occurred. That is. we refer to the entropy of the schenie

THEOREM (Khinchin(l957)): Suppose H (pl, p2, . . . ,p,) is a function which fuifiils postulates I. II. III. Il,-and for any m it is definedfor allpk.k = 1.2,.. . . rn where pk 2 O and pk = 1. Then there zs a positive constant X such that 26 II. HAZ,-IRD -4PPROACH IIi THE MC'LTI-PATH CH.4SGEPOI'IT

Coridition I is a snioothness condition which says that small variations in ar- giiiiients resiilt in smdl variations in the entropy- Condition II is in the spirit of Lirpl(u:e Pr-inciplr of Insuficient Reason. condition III is a reflection of the cIrhii tion of coriditional probablity. and finally condition Il- says the entropy of ii c;c.lierrie does ~iutchange by adding the impossible events.

-4rriu1igthe above conditions there is only one condition, namely III which needs sortie more discussion. ïndeed. condition III is a generalization of the follou-ing c1esirt.d coridition

II1.. H(.4B) = H (-4) i H(B) fc.7 uny two independent schemes -4 and B.

-411 irrimediatt. natural question which no\v arises '1s it possible to replace 111 by 111" 171 the above theorem and still huue the same result?" The answer is not affirmative. 1x1 fact. Renyi( 1960) has show that for al1 a > 0. a # 1

fiil fiils r-ouditioris 1. II. III'. and ILr. Csing L 'Hospital S mle it is easily seen that

I,, = -Ka is called Renyirs Information which plays an important role in Statistical Xiechanics and Xonlinear Dynamics. It can be easily seen that Io = - log(n)(here we are assuming pk # O. for k = 1,Z. . . : rn) and 1 Io ] grows logarithmically with m. For more discussion on the propcrties of I, see Beck and Schlogl(1993. chapter 2. sec-tioon 5) and Renyi(1960).

Sliarinon's original postdates are slightly different from the ones introduced by Khirichin(lS57). For Shannon's postdates we refer the reader to Shannon and \f;~aver (1949. p.49). IL should be also emphasized, though not veq crucial. that 3. 1L1.4XMUM ENTROPY AND MODELI'JG 27

Sliaririon did not state a11 of his postdates. For his complete set of postulates see Kapur and Kesavan( 1992. p.24).

The postulates used by Khinchin(l957) indicate some intuitive features which ariy ceriteria for uncttrtainty are expected to have. In the light of this intuition and Laplace's Principle of Insuficient Reason (IR) a sensible generalization of

IR c-aii be introduced using Shannon's entropy function. This was proposed by E.T..i-nes.

Iri 1937 E.T. .Jaynes proposed his Pn'nczple of Maximum Entropy (ME). Accord- i~igco t his principle the distribution which maximizes Shannon's entropy function under dl the available information. represented as constraints. is the most "ob- jective " distribution given the available information. In fact. ME gives the closest pssible distribution to the Cniform distribution among al1 the distributions which satisfy the constraints. For a thorough discussion of this principle see Jqnes(1983. paper #6).

111 this subsection we discuss hour one can one3 modeling approach on the AIE principle approach with emphasis on the derivation of the secalled constant hazard mode1 (2.5)-The discussion is divided into two subsubsections. In the first subsubsection we give a justification for Our constant hazard mode1 introduced in the 1st section. Lf;e start with proportions and then show that how the condi- tions which lead us to (2.5) in the ME approach are meaningful when we work with hazards instead of proportions. In the second subsubsection we incorporate llarkovian structure of the changepoint problem to introduce entropy criteria for noniiomogenious Markov chahs. Consistency of this criterion is discussed and t hen it is employed in modeling the distribution of changepoint. It is shown that this approach provides several different models. This shows that the information 2s II. H.12-4RD -4PPR0.4CH T-3 THE MU'LTI-P-ATH CHASGEPOI3T

~)rovidedthrough Theorem 1 on the structure of the changepoint problem can be crirc-id. firially discuss the underlying assumptions which lead us to a constant tiazard in this nrtw formulation.

ME -4pproach and Hazards \Ir Jusri-. our constant hazard mode1 (2.5) by invoking the principle of nia-cimum r2ritropy (IIEj. Let us begin by considering a rather general optimization problem. m

Subjecr to:

k= i whrrr C' is a constant. The pi's will be given an interyretation in the context of oiir changepoint problem.

The solution of the above problem is

It is c*lear that h(m)= 1.

Ir1 fact. h(k).is what is usually referred to as "the hazurd ut iristunt k". Suppose wAi~iterpret the pi's as pk = P(r = k). Then the maximization problem (-4). ni- t~ rrgarded as one of ascertainhg the "least infomutiveo' family of models for P(Ï = k) (and thus the 7iL~)subject to a constraint on E(T).Here. however, the constraining constant, C. (and hence A) generate the members of the family. Once 3. M.4XI'clUh.I ENTROPY .4ND MODELING 29 data are collected, the specific member of the family may then be estimated by using the model that we have selected by maximum entropy. From (2.5), writing

K = (1 - eA),X < O, we see that in that case of constant transition probabilities, P(T = k) = (1 - ex)e*(k-l) = pk, Say, for k = 1,2,..., rn - 1 and fi, = eAtrn-l). Therefore

1 pk = c'fik, where c = 1 - ,+m 1 for k=1,2,..., nt-1. For k = m, we have

(1 - e*) pm = cifi where c; = 1 - =Am ' This argument shows that if we leave out k = rn the distribution of the change- point, T,obtained using a maximum entropy argument, and which does not restrict the hazard to be constant, leads to values of P(T = k) which are proportional to those may be written down assuming a constant hazard model. That is, there are two approaches: (2) Start with a constant hazard model for the sequence of unobserved 0-1 indicators, which can be shown to define a Markov chain. This choice of hazard may be based on past data and(or) theoretical considerations. Such a choice of hazard implies a specification of P1,= P(T = k). (ii) Start with the set of probabilities, pi = P(r = k), placing no restriction on them other than the specification of E(T) and xl;n=lpk = 1. Here, there is no insistence on a con- stant hazard. Search for values of pk, k = 1,2,-.. , m under these broad constraints, that maximize the entropy. Such pkYsmay be thought of as the "Zeast ànforma- tiue", and , in this sense, may be regarded as objective. With this approach, we are led to pk7s that are proportional to the fi& we obtain through approach (a). It is thus meaningful to seek further constraints on the pi's of approach (ai) that will force the resulting hazard to be constant and pk = jik for k = 1,2, .. ., m . We 30 11- H-4ZhRD APPROACH IN THE MULTI-P-4TH CHAYGEPOIXT shall then have achieved a ME justification for the constant hazard mode1 to be used iri the sequel.

To this etid. ive begin with the Markov chain structure of the state indicators. given bu (2.4). viz

r&";(i -r,)' ifk = 1.2.3...., m- i P(r = k) =

The uptirnization problem. -4.is revisited as an optimization problem in the pa- ranieters n~.9. .... a,-, . This approach dlows us to more easily assess the sought aftrr constraints. Thus. consider. (ro= 0)

Subject CO:

rn- 1 k-L rn-1

\Ve solve the minimization form of the above problem, by considering the fol- lowing function: 3. bfAXIMUM ENTROPY AXD MODELIXG where X is a Lagrange multiplier. Then for i = 1,2,.. . . rn - 2

i- 1 m-1 k-l m-1

and for i = m - 1

Solving the above equations we obtain

(3-1) and

(3-2) or equivalently

and Xote, X is of course, a function of C. Then for k = 1,2,. . . , m - 1

and using (3.1) and the fact that ~r,,.,-1 = -6.we obtain

and consequently,

Therefore, P(r = m) = nle"(m-l)and since Er='=,P(r = k) = 1, we get

which results in

where c is a constant. It should be noted that using P(T = k) = ~~e*(~-l),and the fact that O 5 ,s5 1 it is easily seeo that X 5 O .

Now ut.ilizing (3.1) and the fact that xl = -&,1 e-' it is easy to see by induction t hat 1 - eA 'Ki = 2 = 1,2, ..-,772- 1 1 ,eA(m-i+l) '

1 eA It should be noted that = l+c-'-*, , which implies the above result is also true for i = n - 1-

We see that while a solution to (3.1) alone is, = 1 - eA, for k = 1,2,..., m - 2: this is not a solution to the optimization problem which must also take into account the condition (3.2). In order to obtain the solution 7rt = 1 - e* for k = 1,2, ..., m, we must impose a further constra.int. 3. MAXIhIU-VI EXTROPY AND hfODELING 33

It is perhaps reasonable to restrict the probability of having no change. Thus, the additional constraint where we add is:

where a is a specified constant. Solving (B)under this extra new condition gives the f~llowingresult.

which is (3.1) with a new boundary condition. The constant X and a are functions of C and a. It is easy to see that for this system of equations we have

X(k- 1) P(T = k) = qe for k= 1,2,... JZ- 1

and P(T = n) = eh1eA("-'1 nhich implies that

Now choosing p = - log(1 - e*) results in nl = 1 - eA which irnplies that

P(r = k) = (1 - e A )e A(&-1) , for k = 1,2,..., n- 1 and

nhich is the corresponding distribution of the constant hazard model.

The Lagrange multiplier A can be considered as a criterion for measuring the role of the mean constraint. Indeed, the larger the absolute value of A is, the more important the role played by the mean constraint. On the other hand, the larger 34 II. H-4ZAR.D -4PPROACH IN THE MULTI-P.4TH CHANGEPOINT

the absolute value of X is, the closer the value of p to zero is. This shows that more emphasis on the mean constraint results in less emphases on the probability of having no change. In other words, we have the possibility of choosing one of these two conditionso making inference about the corresponding parameter and then adjusting the other one by using the given relationship between A and p.

ME approach and Markovian Structure L'p to now the distribution of the changepoint was of our main concern. Nev- ertheless the above discussion shows that starting with the chain itself can be more fruitful, especially when the main concern is to make inference about the hazard function. Since our airn is to introduce covariates into changepoint prob- lems, which will be achieved through the hazard function, we now examine this problem in more detail. By taking into account the Markovian structure of the chain we show that we may end up with different models though we have the same information.

Here, again our main tool is the ME principle. To begin we first need a definition for the entropy of a non-homogeneous Lhîrliov chain. The definition of entropy for a homogeneous Slarkov chain {&)gowith (d + 1) states and an initial invariant distribution P is well known. It is as follows:

where P, = P(& = s) and pst = P(Bt+l = t 1 Ok = 2). Justification of the above measure is simpie since it is essentiaily equal to 1; PiHi where Hi is the entropy of the following scheme

The definition of the entropy for a non-homogeneouc Markov chain is basically a 3. MAXIMUhl ENTROPY AND MODELING 33 simple extension of the above defintion for the homogeneous case. We, however show how it can be derived from the original definition of entropy, beginning with axiomatic justification in the homogeneous case.

When the chain is homogeneous with an initial invariant distribution, the en- tropy of the chain is the entropy of any two successive steps of the chain, the entire behaviour of the chain can be characterized by the characteristics of any two successive steps. Hence it suffices to confine our attention only to the first two steps. Then the entropy of the joint distribution of P(s,t) = P(Bo = s, = t) is

H = - 2 P(s,t) log(P(s, t))

which is equal to

= - C C ps pst log(Ps) - C C ps Pst 10dp.t

As can be seen HP is the entropy of the initial distribution. It is a quantity which is independent of the transition mechanism. The second term, however, is the part which is a property of the transition mechanism. For it depends on the probability transition matrix which governs the chah It measures the mean value of the uncertainty in the chah which is called the entropy of the homogeneous Markov Chain.

This definition is easily extended to the non-homogeneous case.

DEFINITION1. Suppose that {8&)z0zs a non-homogeneous Markov Chain with finite state space S = {O, 1,2,. . . ,d). The entropy of the subchain {9k)F=L, is 36 II. HAZARD APPROACH I-i THE MULTI-PATH CHANGEPOINT denoted by HZ~~,and defined as follows:

Udd k-1 k ~&~=-CCcpsPst log(~:t)

where Pl-' = P(&-1 = s), and pft = P(ûk = t 1 Bk-l = s). For L = O the above entropy function is denoted by H,h,,, .

Xow using Hz,, and applying the ME principle, u-e obtain a variety of models for the changepoint problem under suitable circumstances. For the single change- point problem which is Our main concern, we have S = {O, l), P:-' = nfz(1 - r,), = ïrk and pf, = 1 for k = 1,2,.. . , m - l?and since the chain is stopped at time rn we have pz = pt, = 1. These properties leads to a considerable simplification in the form of HnhMc Indeed, we obtain

For brevity we omit some of the details in the following discussion. To show that Our definition is not contradictory we examine the definition for some knona cases. \Ve begin with the simplest case in which we have no information at all; that is, there are no constraints. Solving

we obtain

and this results in P(r = k) = for k = 1,2, .. . ,m. Both results are in agreement with Our expectation.

The first condition which may be imposed is specification of a constraint on the probability that a change does not take place by time m. The problem can then 3. 1\I.4XIMUM ENTROPY AND MODELING be formulated as follows

Subject to:

It turns out that the solution is

k -=I and for =m- 1, we have I-T,,+X - A which results in

and P(r = m) = C.

1% already mentioned that if the main concern is to constmct a model for the 7;s we should focus on the chah itself. We now show that there are indeed differences between rnaxïmization of H,mhM,. (U = m) and maximization of the entropy that was based on the distribution of the changepoint directly. In fact, me show that mavimization of Hz,, subject to (?ro = 0)

m-1 k-1 rn- 1 does not give the model that was derived from problems (A) and (B). We solve the minimization form of the problem:

m-1 k-1 Tk (P) min -H,mhaf, = (1 - n)[n log(-) + log(1 - n)] Tl ,~2,---v~m-l n 7rk k=l L=0 1 - Subject to: 38 II. HAZ.1RD APPROACH IN THE MULTI-PATH CHANGEPOINT

Then using Lagrange multipliers we must minimize the following the fuction

m-l k-l Tk L = n(l- TA) [rk log(- ) + log(1 - rk)] k=I l=O 1 - n,

Taking partial derivatives for k = 1,2, . .. - 2, we obtain - k-l m-L t-l aL '6 k nt - = log(-) (1 - TL) - (1 - Tl)[rt log( -) + 10g(l - A,)] ark 1 - ~t [=O t=k+~ L=O 1 - nt

k-1 m- 1 t-1 m-l

and for k = rn - 1 we have

Solving the above equations we obtain

and for k = m - 1 we have log(-) I-Z,,'-L = -A- This results in the folloaing hazard model

There are some points which are noteworthy. As is easily seen (3.4) is different from (3.1), though the constraints are the same. The second point is that from (3.4) a first order autoregressive model seems appropriate for cm = log(,'k).

A natural question is now, "Are Were any circumstances under which (3.1) holds in this new setting in which the Markov chain structure is utilized and entropy 3. .\IA-XI?dL%I ENTROPY AXD MODELDG 39 dejzned ~OTthe chain itself?" Et is easy to show that (3.1) is the solution of the following optimizacion problem.

rn-L k-1 min -Hza5f,= C n(l-;i)[aklog(- ) i log(1 - kk)] r,,: 2..... Z,.,.,-L iTc k=l (=O 1 - Subject ta:

m- 1 k-l m-1

Irideed. solving the above problem we obtain

Closing Remark Information theory is a topic with a vast literature. For a good review we refer the reader to Soofi(1994) and references cited therein manu of which emphasize the fundamerital role information theory çan pl- in many aspects of statistical t heory. particulaq- in rnodeling.

-\s is eviderit. the straightforward extension of entropy for random variables tt-hich can take on uncountably many values is to replace C by J'. Though this is esactly the way entropy is defined for these cases. we should note that there are some incorisistencies. For instance, the value of the entropy in this general case 40 II. HAZARD APPROACH IN THE SIULTI-PATH CH-4NGEPOINT may be negative (consider Unifom distribution on [a,b] ahere b - a < 1)' or the integral may not converge. In this connection we refer the reader to Renyi(l959), (especially Theorem 1j which gives a beautiful justification and detailed account of the theory (see also Csiszar(l971)). In connection with the convergence of the entropy we refer the reader to Rajski(l959) which is presumbly the sharpest result of the type. For different sets of postdates for entropy see Renyi(1960) and the references cited therein. See also Shore and Johnson(l980) where they show that under intuitive assumptions for an inductive in ference and when new information is given in the form of expected values, Jaynes's principle of maximum entropy is the uniquely "correct" method of inductive inference. For a critical poiint of view see Seidenfeld (1987).

There are some interesting relationship between Fisher's information and Shan- non's entropy. We refer the reader to Vajda(lS'il), Theorem 1 which establishes a relationship betrveen Fisher's information and a class of statistical divergence and which includes I - divergence where

Iw otherwise P and Q are two probability measures and p and q are their Radon-Nikodjm derivatives with respect to a a-finite measure. This result basically generalizes the same result which had been already proved by Kullback(1967) for 1-divergence.

As we already mentioned Renyi's information bas many applications in sta- tistical mechanics and nonlinear dynamics. Some interesting results concerning this information criteria has been given by Zvarova(l971). In fact: she has shown (see Theorem 1) that under mild conditions the maximum liklihood estimators of Renyi's divergence is asymptotically normally distributed with a variance of the ordrr of the reciprocal of the sample size. The variance is of order of the reciprocal of tht. squared sample size under a mild condition.

One ~f the other iriteresting applications of entropy is in measuring dependence (Reriyi(1959)). Renyi introduces sorne intuitive postulates which are natural for

il rational nieasure of drpendence and mentions that Linfoot's measure of depen- ciclrice ( Linfoot ( 1957) )

n-hich is an increasing fùnction of 1-divergence. fulfills a11 the intuitive conditions. .\s .\s in Renyi(l959). the particular form chosen by Linfoot ensures tliat L(<.q) =l Corr(c.q) / if the joint distribution of { and q is normal. An interested reader

iri irlformation theory is also recornmanded to see Csiszar(l96f) where the f- clil-ergence has been originally introduced and some nice results in connection \rit h thern given.

The way that information has been defined doesn't incorporate local features of the tlensity fuuction. This is viewed as a restriction for its application by wrric. aspects in the statistical literature. See Silver and lIartz(199-1) where the? iritrodrice the riotion uf Quantum entropy and its application to density estimation.

Firially. for an insightful discussion on the notion of information the reader is rcfered to Kolniogorov j 1968) which gives some interesting perspective in 5 -3 of his paper for the definition of the amount of information. See also his concluding reniarks. For a rigrous treatment of the theory of entropy the reader can consult 1Iarti1i and England(l981). II. HAZARD APPROACH IN THE MULTI-PATH CHXNGEPOIXT

4. Introducing Covariates into the Mode1

Our main concern in this subsection is to show how the principle of ME can be used to introduce covariates in the model. Up to now we have only discussed how to model the hazard as a function of tirne. Our main interests in this subsection are the constant hazard model and the logistic hazard hinction which is given by (3.4). Our principle reference in this brief subsection is Good(l963).

Of course, we could have introduced the two hazard functions in an ad hoc fashion. We chose instead, to justify their introduction through an ME argument. Continuing in this spirit, we use ME to justiQ the manner in which we allow the hazard function to be a function of a vector of covariates, Z.

In order to introduce covariates into the model we adjust Our notation to reflect the dependence of the mode1 on the covariates. For instance, in the constant hazard model we have xk(r) = x(z) for k = 1,2,.. . , rn - 1 and in the logistic =-X(r)(m-k) hazard mode1 we may write T~(z)= ,+e-,,,,,,-,, for k = 1,2,.. . ,m - 1 . This means that for any given set of covariates we know how the hazard changes with time. This may be called a conditional model for time given the covariates. NOK suppose that we consider the hazard at a specific moment of time and we wish to specify it as a function of the covariates. This gives us a conditional model for covariates given time. These two conditional models will be seen to suggest a hazard function (of tirne) that allows for the effect of covariates.

Using Jaynes' principle of Maximum Entropy, Good (1963) derived the cIass of log-linear models. In the light of Good's results and since logistic regression can be treated as a special case of log-linear models, [ogistic regression models may be derived using Jaynes' principle of ME.

Suppose there are r discrete covariates and the response variable is a binary 4. INTRODUCING COVARIATES INTO THE MODEL 43

variable. If n*e consider the response variable as a new classification, we obtain a (r + 1) way contingency table. If al1 2-dimensionai marginal totals of the (r + 1) way table are given, Good showed that cosideration of the following maximization problem, subject to constraints on the marginals, leads to the form,

where pi, ,i2,--.,i,, is the probability that an observation is in the ce11 which is rep resented by the zj - th level of the j - th covariate for j = 1.2, . . . , r and K stands for the level of tbe response of the variable which can take on values in the set

{O, 1). Here, uj (il) stands for the effect of the ij - th level of the j - th cornriate for j = 1,2,.. . ,r , uj j8 (il ij,l stands for the interaction effects between j and j' when j is at the level i, and j is at the level 2,. , and u,+, (t) and uj .+i (ij L) represent the corresponding effect and interaction for the response variable respec- tivel. -4s is evident, the above result can be easily extended to the case when the 2-dimensional marginals are available. This extension results in the introduction of higher order interaction factors into the model.

It should be noted that in the language of optirnization the parameters in- troduced in the above model are the Lagrange rnultipliers corresponding to the constraints. For instance, uj which is conventionally interpreted as the main eflect of the j - th conriate, can alternatively be regarded as the corresponding Lagrange multiplier of the constraint on the one-dimensional marginal of the j - th covari- ate. Similar arguments apply to the other parameters. With this perspective it is easy to show that the side conditions (sometimes called "estimability conditions" ) 44 II. HAZARD APPROACH LN THE hfULTI-PATH CHANGEPOINT which ive usually encounter in log-linear models i.e.

1, and also for j = 1,2,. . . ,r

are reflections of the linear dependence of the constraints. Indeed, summation of 2-dimensional marginals over one of the indecies gives corresponding 1-dimensionai constraints of the other index. This implies that the constraints are not linearly independent. Linear dependence of the constraints results in Iinear dependence of the Lagrange multipliers. To see this, it suffices to consider the Lagrangian and use linear dependence among constraints which simply results in elimination of as many Lagrange multipliers as the number of linearly dependent constraints.

XOW we have, - and so:

Defining vj (i,>= U~ u,+~(11, we obtain

and therefore

which is the logistic model. 4. INTRODUCISG COVARIATES INTO THE MODEL 43

The above result suggests that for the constant hazard mode1 we may choose

For the logistic hazard (as a fiinction of time), the above discussion suggests that the form of the hazard that inciudes covariates be of the form,

for k = 1,2, . . . ,m - 1 . The side conditions for both (4.2) and (4.3) are

EVj(s)=0 for j=l,2 ,..., r .

It is easy to see that if we assume that the 1-dimensional marginals are available, then we would have 1-way interaction factors in (4.1) which would result in a logistic mode1 with Z - 1-way interaction factors with the usual side conditions. It is not hard to extend (1.1) to the continuous case. Indeed. as is espectd \ve obtain exponential families for the joint distribution of the covariates. It is customary in the continuous case that the constraints are given by the expectations rather than mariginal distributions. In the following we show how to construct models including both discrete and continuous covanates. Suppose for illustrative purposes, that we have one continuous and two dis crete covariates. The response variable, as before, is binary. The problem can be formulated as follows: 46 II. HAZARD APPROACH IX THE MULTI-PATH CHANGEPOINT

Subject to:

p(i?j,k,z)dz=p(i,j) for i=0,1 and j=172,--.,J k= 1

p(i,j,k,z)dz=p(i,k) for i=0,1 and k=1,2 ,..., K j=lC Jz $1 p(i, j, t,i)dz= p(j, k) for j = 1.2,. . . , J and k = l72,. ., K i=O 25 p(i, j, b, *)di = p(i) for i = O, i j=l k=l CC1 p(i7j, k, r)dz = p(j) for j = L2,.-. , J i=o k=l

p(i, j, k, z)dz = p(k) for k = 1,2,- - . , K

In order to solve the above problem we consider the foilowing Lagrangian 4. INTRODUCING COVARL4TES LUT0 THE MODEL

Classical variational methods then may bt! needed to carry out the optimization. It is easiiy seen t hat a necessary and sufficient condition (because of convexity ) for the estremum is

O where 1JK

It follows that

Using the same argument as for the discrete case we find,

again we have the usual side conditions which reflect the linear dependence among 48 II. H-AZARD APPROACH l[N THE ML'LTI-P.4TH CH,SXGEPOL,VT thr coiistrairits of the optimization problem. -An analogou version for (4.3) cm be dso given as tollows.

for t. = 1.2.. . . . rri - 1 -

For mi elerrientry treatnient of variational niethods which suffices for the above purpose see Riistagi 1994). For a thorough treatment 1-e refer the reader to Zei- cilcr( 198.5). puricular>- 337.5 for multidimensiorial classical variational methods w1iic.h is the case of our concern when we have more than one continuous covari- iite.

Hereafter ive confine our attention to (4.2) and (4.4). The other model. (1.3. which seems a reasonable alternative for (4.2) will not be discussed in this thesis.

5. Mixture Distributions

In tliis section ive complete Our modeling for the multi-path change point prob- Irm. The full likelihood of the observed data is written as a mixture of components rvhuse rriising probabilities are expressed using the models (2.5) and (4.2) of sec- tions 2 and 3.

Rrcall that in a niulti-path setting we have n paths aud rn observations for each thThus. as cliscussed before we have a rnatr~xof observations. X= [x,~],,. n-here r,k is the k-th observation on the i-th path. It should be noted that .L-,~ CHIL be a \-ector. though this is not the concern- of this thesis. Now let gft . 1.2. . . . . Xim) be the density function of -Y, = (ril, 42, - - - , xam),given thut the change takes place at k (with respect to a O-finite measure on the sample 5. MIXTURE DISTRIBCTIONS 49 space). For instance, if we assume that the zik's are independent and identically distributed before and after the change takes place, with density functions h? and h: respectively. then conditional on a change at k,

and

The unconditional density function for Xiis then

where T' is the instant of change for the z-th path. The expression (5.1) reduces

under the assumption of independence (the product n;"=,+,g-i (zÜ) is defined to be equal to I for k = m).

Now incorporating (2.5), (4.2), and (5.1) we may write

~XP(v + C;=, vj (ij) 1. (Xi) - 1 (I-I~(~)) k=l [I + e~p( + C>=,vj (i,)) Ik-' where If tlir o~servationson the i-path are independent we obtain the follouing mode1 iv1iic.h is of our prime concern

for which

Although ma- of the results which are given in the sequel are applicable for

(.X) . ive are mainly concerned with mode1 (5.3). Since (5.2) is a rn~vturedistri- hution the mixture distributions wiri play a crucial role in the sequel.

The riiisture distributions of this thesis have some special features which will be riisc-ussed in Iciter chapters. For esample. under teasonable condit ions. we will show that the riodel is strongly identifiable in the language of Ghosh and Puri(l985. Imge 791. srcorid paragaph). This allows us to use classical asymptotic results. Sest. since our niairi concern is to rnake inference about the effect of the covariates rarher than about the number of components of the mixture this saves us from the coniplexities of such problems. For it is well knos-n in tests for the number of coniporients of a mixture a crucial assumption for the asymptotic theory. that is interiorit? of the true value of the parameter. is violated. In fact. the true value will be on the boundary of the parameter space. This problem has been treated 1)- Cliernoff(l95-l) and Feder(1968). The above discussion. the resuits given by CIivrnoff and Feder and the recent advances in the applicatioon of differential gc.oIrietr:- in statistics suggest extension of the available asymptotic theory for the case that the parameter space is a differentiable manifolds with boundry or ?ver1 corners. Based on our knowledge the theory so Far has been extended only for cfifferentiable nianifolds. For a discussion on manifolds with boundaries and corners see Alichor(l9SO.fj 2). CHAPTER III CONSISTENCY OF THE MAXIMUM LIKELIHOOD ESTIMATORS

1. Introduction

Perhaps the best known method of estimation in the frequentist setting is the method of maximum likelihood whose popularity stems from its large sample prop erties.

Our main concern in this chapter is the consistency of the maximum likelihood estimators (MLE's) of the unknown parameters introduced in the mode1 given in chapter I(51.4). It is shown that under suitable conditions the MLE's are strongly consistent. As is well known for the consistency of maximum likeiihood estimators two types of condition are needed. The heuristic argument for the consistency of MLE's, given by Cox and HinkIey(l974), reveals this fact. The first condition is continuity of the likelihood function. We refer to this condition as the smoothness condition. The second condition is identifiability of the parameters. Indeed, without identifiability of the parameters it is meaningless to talk about hitting the target in the long mn. If identifiability fails to hold we can at most expect to have consistency in the quotient space, a fact that has been discussed

51 5 2 III. CONSISTENCY OF THE MLE

It is shown here that the unknown parameters in Our mode1 are identifiable. This very important feature is a key ingredient in the proof of consistency that uTe present. In the next chapter where we discuss asymptotic normality, identifiability also plays the key role.

We take different approaches for proving consistency. It is seen that in the first that the single parameter argument fails in the multi parameter case. Nevertheless, the first approach allows us to prove consistency under more general circumstances when there is only one unknown parameter. For a thorough discussion on maxi- mum likelihood inference, see Norden(l972, 1973) and LeCam(l990b).

This chapter is organized as follows. In section 1 we discuss identifiability of Our model. In section 2 we discuss the one parameter case where we have only one unknown parameter. In section 3 we present the multi-parameter case. Finally in section 4 we argue the general case of independent but non identically distributed random variables and apply Our result to the changepoint setting which is Our main concern. We introduce a notion for the type of identifiability we encounter which is neither ordinary identifiability (strict identifiability), nor the usual identifiability for mixtures.

2. Identifiability Of The Mode1

Before stating our result we should make the distinction between what we cal1 usual identifiability for mixtures (identifiability up to a permutation) and ordinary identifiability (strict identifiability). Our main concern here is strict identifiability for a collection of family of distributions indexed by covariates. The definition 2. IDENTIFIXBILITY OF THE MODEL 53

of this type of identifiability which we cal1 quasi-identifiability is given below. \té shall establish this type of identifiability. Mie refer to Ghosh and Sen(1985) for more discussion on the difference between strict and usuaI identifiability for mktures. Owing to the lack of strict identifiability in many finite mixture problems (see Pfanzagl(l994), Theorem 1) and the fact that strict identifiability is the core of most of the results stated in this and the next chapter, we ernphasize this point in this section.

DEFINIT~ON1. A wilection of familàes of pmbabàlity measuns {{Pi: 8 E 8}: - -1 Z E 2) Ls called quasi-identifiable with respect to g if for any 8, B E 8 6 # B ' implies 32 E Z such that Pi # Pi,

Before stating Our result we first recall that the observations form the matrix

In the above matriv each row corresponds to one subject. There are n subjects with m measurements on each. We Say a change has taken place at r', for the i-th subject, when the first simeasurements on the 6th subject, Xii,Xi2,. . . ,Xi7., corne from a processes, Say Pl, and the rest, . . . : Ximifrom a dif- ferent process, say Pz.We always assume that Pl and P2 are independent.

In this section we establish strict identifiability first when the within subject measurements are independent and then when they follow a Markovian structure.

We begin by assuming that observations before and after the change are indepen- dent and follow the method of Joseph and Wolfson(1993), with slight modification, to take care of the possibility of no change. 54 III. CONSISTENCY OF THE MLE

As discussed in Chapter 1, Our mode1 is defined by the set of probability density funct ions

with respect to a reference measure v, where

and

The densities hf and 4-, represent respectively, the before and after the change distributions which we suppose are identifiabie with *G and y'; respectively. Let

-. t z = (xi,. . . Zm)1 Z = (zl, . . . , z,), 8 = (@: 9,i = (5, 5).The parameter ,û is the unknown regression parameter, while y', and y? are respectively the parameters of the before and after the change distributions. For brevity ure write at(@) ( (or ak when there is no confusion ) for ak(B,3.

In the following we always assume the usual side conditions (introduced in 51.4) for the factor levels of qualitative covariates.

THEOREM1. Suppose hf for s = 1,2 are, respectively, quasi-identifiable with respect to r, fors = 1,2 and crt(B, Z) = x(B, q[l-n(P, Z')Ik-l for k = l,2, .-.,rn-1 and a,@, Z) = (1 - ~)~-l,where

B = (Do, BI, ... . , a)and = (01, . . . ,A). Then f&?; 8.1 given Q y (3.1), is quasi- identifiable with respect to 8. 2. IDENTIFIABILITY OF THE MODEL 53

m m (2-4) C ac(& ?;z39:(cr)= ak(B ' ;qg:(5; f ) for alrnost al1 2 and VP k=l k=l Integrating both sides of equation (2.4) over zz,. . . , z, we obtain

for almost all xland VZ. Using the quasi-identifiability of hl, 5 = y,.J

By integrating over XI,. . . ,xm-1 and XI,. . . , x,-2, x, respectively, we get

for v = m and m - 1 respectively. Assuming the same range for zk, k = 1, . . . , m (or generally focusing on the line xm = xm-1 in the x, - x,-1 plane) and subtracting equation (2.5) for m - 1 from that for rn results in

This implies

Similarly, III. CONSISTENCY OF THE MLE and therefore

Now using the fact that ak = ~(1- T)~-I, it follows that

Using the first tu-Oequations we have (1 - n) = (1 - n') which implies T = T'. -4s the logistic function is monotone we obtain

To simplify our discussion we assume that we have only one continuous covariate,

Z, and one qualitative covariate, 7 with S Ievels. Then (2.7) implies that

Summing over s we obtain

nhich implies Po = 0; and (~f==,Cs) = (x:='=,CL). Equation (2.8) then reduces to

(2.10) 7. +Csz = 7i +& Vz and s = 1,2,..., S -

But two straight lines can at rnost itersect at one point, so that (2.10) impiies that

qs = 7); and = for s = 1,2, . . . , S .

4 Now using (2.6), quasi-identifiability of 4,the facts that = @ ' and 5 = =y we obtain 5 = S. This completes the proof. [II 2. IDESTIFIABILITY OF THE MODEL 57

EXAMPLE1. Suppose, with a conventent abuse of notation, that hf (x;pl: p2, O) = lV(p1 + p2~, 02) and h;(z;Xt,h,r) = N(X1 + A22 , r2) where (pl,p2,0) #

(A, A,) - Then 71 = (pl,p2, O) and .iz = (XI, X2,T), and as the Nonnal distnbu- tzon is identifiable, Theorern 1 applies.

REMARK1. An inspection of the proof of Theorem 1 and using the adentifinbilaty result of Joseph and Wolfson(1993) it can be easily seen that under the assumption of a change before the last observation, we do not need to speczfy the fom of the mùing distribution. Indeed to conclude identifiability of the mixture with respect to ë we ody need the identifiability of the rnizing distribution with respect to ,8 .

Now let us suppose, more generally, that the observations in each path are dependent, but with a Markovian structure. Such a framework in the single-path setting has been discussed by Yakir(1994) for finite state space blarkov chains (see also Telksnys(l986)). The basic assumption is that the processes before and after the change are independent with different transition probabilities. In Our notation this means

for k= 1,2, ...,m- 1 and

It should be noted that for the Markov mode1 7, = (&,&) for s = 1,2 where Xs and ps are respectively the unknown parameters of the marginal and transition probability functions for s = 1,2. Theorem 2 extends Theorem 1 for station- ary Markov processes provided that isand & are respectively identifiable from hf (z; X,) and hf(z;7,l y) .

THEOREM2. Suppose the measurements before and after the change respec- tively jollow independent stationary Markov processes Pl and P2 whose marginal and transition probability functions are identifiable. Then fF(Z:ë) ggiven b y (3.1), (2.1 1) and (2.12), is quasi-identifiable with respect to 87

PROOF.Suppose f~(e8) = fjs(Z; ') for almost al1 Z. As in the proof of The- orem 1 we only work with marginal distributions. Since Ps for s = 1,2, are sta- tion~processes with identifiable marginal and transition probability functions, the çarne proof applies here as for Theorem 1, to obtain xs = As.-1 for s = 1,2 and p=a'.-

To complete the proof we rnust show that & = @,' for s = 1,2. To show this it suffices to integrate 23, . - . , xm out from both sides of equation (2.4). We obtain

-1 But we already showed that b = A, for s = 1,2 and $ = e'. Therefore, (2.13) reduces to

Thus the identifiability of the transition probability function implies that pi = 6. 2. IDENTIFIABILITY OF THE MODEL

Similary by integrating over x1 , . . . ,x,-2 we obtain

Using (2.14) and the identifiability of the transition probability function of Pz implies & = &'. This completes the proof.

EXAMPLE2. Suppose P:rJ is O stationary ibfarkov process

=Ji' t xs,i ( ) = PS + Vs,j(i)+ Czi + K(t)

ontinvous and 77 is a qualitative covariate. The proce. .4R(1) process defined by

(2.15) &(t) = pJw- 1) + &(t) where O

M the notation of Theorem 2, A, = (p&, C,,a.), for s = 1,2 . Then it is not hard to see that all the conditions of Theom 2 are fuf'lled and therefore we have identijiability of the rnizture rnodel given by (3.1) with respect to 8 = (pl, XI, fi,X2) (the vector of parameters). Now using the fact that for general processes P,, for s = 1,2 ive have

for k= 1,2, ...,m- I and

The above results extend to fairly general circumstances. As the proof is similar to the proof of Theorem 2 it is omitted.

COROLLARY2. Suppose the measurements before and ufier the changepoint re- spectively follow independent stationary processes Pl and P2 whose marginal und conditional densities

and

are, respectively identifiable with respect to 7,~and 'I.llIl,-.-l~-i,the sub-uector of 7, which parametrizes (2.19). Then f& ë) given by (3.11, (2.16) and (2.17), is quasi-identifiable with respect to 8.

An extension of Example 2 to AR@) processes serves as an application of the above corollary. 3. SINGLE PAR4METER CASE

3. Consistency In The Single Parameter Case

In this section we discuss consistency in the single parameter case. By this we mean the consistency of 9 for B E R, in the following model

where g;(Z, 8) for k = 1,2, ... , m are known distributions up to an unknown param- eter 6. We always consider the following model for a&, 8) as a concrete example fur which the assumptions below are fulfilled:

x(zY8)(1- 1~(z,B))~-l,ifk = l,2,..., m - i (3-2) 1-(z)), ifk=m and

.AS mentioned by Kraft and LeCam(1956) there are two ways to attack consistency of !VILE'S.The first approach is to prove. directly, the consistency of !VILE'Sas is done by Wald(1949) and Ibragimov and Hasminskii (1981). Alternat ively one may prove existence and consistency of a selected root of the likelihood equation which is the method described in this section. We use the idea employed by Lehmann(l983) for proving consistency in the single parameter case. It should be mentioned that we treat consistency and asymptotic normality separately and therefore we do not assume any condition on the second derivative. When we discuss the multi-parameter case for which we take the first approach we do not even assume differentiability which is an unnecessary condition if consistency is the sole aim.

There is a vast literature on the consistency of hlLE's. Perhaps the most famous paper on the consistency of MLEys is Wald(1949). Among the assumptions listed 62 rn. CONSISTE~YOF THE MLE in CVald(1949, page 596) assumption 5 stated as follows, is not fulfilled in Our case.

Assumption 5: If hi+,1 Bi I= W, then lim,,, f (x,8,) = O for any z ezcept perhaps on a fùed set (independent of the sequence 8, ) whose probability is zero according to the true parameter point go.

Indeed, for the simplest case that gf(Z) does not depend on 9 it can be easily seen that for z a positive (negative) scalar,

This shows that we cannot apply Wald's result which is highly dependent on this assumption tlicough his lemma 3. We therefore proceed differently, starting with a simple lemma. Although the proof of the lemma may be found in any book on inequalities (see for example Mitrinovic (1964, p14)), we give the short proof for the sake of completeness.

LEMMA1. Suppose a& for k = 1,2,... ?rn are m real numbers and bk for k = 1,2,...: m are m. positive real numbers. Then

PROOF.It is clear that for any k we have

ak a& ak min - 5 - 5 rnax - lsksrn bk bk l

ak ak bk min - < ak 5 bk max - - lSk5rn bk - 1Ikl" bk Summing (3.5) over k results in 3. SIhTGLEPARAMETER CASE which implies

This completes the proof.

In the sequel we also need a simple auxiliary result which we state in the fol- lowing lemrna.

LEMMA2. Suppose fnygn : (Xi3, p) + (IR, 8(W),dx) for n E W are two se- quences of positive integrable functions, such that fn + / and gn + g o.e. where f and g are integmble. If II g,, 11,s B, Vn € E and 1 fn 14 h for some integmble function h, then

PROOF.First we notice that using the Dominated Convergence Theorem (D.C.T.)

J fndp + J f dp- AS 1 fn - f 1 < 1 fn 1 + 1 f 1 it follow~from the generalized D.C.T. that f, 3 f. On the other hand

The first te* in the last inequality tends to zero as fn 3 f. To show the second term tends to zero we notice that gn 4 g a.e. and f ( g, - g 15 2Bf. Another use of the D.C.T. completes the proof. O

To prove consistency we need the following assumptions: Assumption Al: The randorn vector 2, : (fl,0,'P8)+ (IF',23(R"),p(d;)) where p(dz) is a a-hite measure, induces a probability measure P;,@on (Rm,!B(I(RR)) 64 III. CONSISTENCY OF THE MLE whose density with respect to p is of the following form

where m is a futed positive and Er=,ak(z,B) = 1 for al1 z E Z and

Assumption A2: The collection of families of probability measures {{Piqo: 0 E 8): z E 2) is quasi-identifiable with respect to O.

Assumption X3: The mixing distribution a&, 8) is a positive continuous func- tion for each 9 E 0 and k = 1,2, ..., rn

Assumption -44: The set, 2,of possible values of the covariate, z, is a compact space.

Assumption A5 The densities gi(2', 8) for k = 1$2, ..., rn, have the same support for al1 8 E 8, z E Z , and k = 1,2, ..., ml and are bounded continuous function as functions of (5,z) for dl û E 8 and k = 1,2, ..., m .

Assumption A6: For al1 1 5 k, L 5 ml O E 8 and z E Z

and

where 60 is the true parameter value.

Assumption A7: The true parameter value Bo is an point of the param- eter space 9

In the sequel we assume that the observations Xi,, for i = 1,2, ..., n are inde- pendent. 3. SINGLE PARAMETER CASE 65

TO simplify notation in the sequel instead of fz, (Zi,:, (w), 9) aL(zirO) and gi(Ii, 8) we respectively write fi(Zi (w), 8), oik(8) and gik(Zi,8).

THEOREM3. Under assurnptions A &AT,

for any fized 8 # O0

To prove Theorem 3 we need the following lemmas. First recall the following result

EX-" LEMMA3. If b, t m and &$ < oo, then 2 C:=,(X, - ai) + O =.S. xgl bi where = O or E(X,) according as O < ri < 1 or 1 < ri 5 2

PROOF.For the proof of this result see Loeve(1977, page 253).

LEMMA4. Under assurnptions Al-A 7

PROOF.Using Lemma 3 for ri = 2, and bi = i, it suffices to show that for any 8 given

for sorne positive Cs.Defining gi = log(M),and using Lemma 1 together with the fact that "log" is an increasing function we obtain

where f%k(@)9ik(z,,fl) nhin(Zi,8) = min [log 1 1 ~ik(Oo)gik(Zi, @O) 66 1n. CONSISTENCY OF THE MLE and

and

gik (&Y 0) ) ~max(5, 6) I max [log( aik(e) 11 + max [log( KK~ a,&Jo) lck

"ik (O) (6) 11 cmin(i,e) = min [log( 9 1<*~m -(Oo) 11 cmu(i7') = l~"<"mm[log(aik(eo) and si* (%,O) gik (698) smin(Z,, 8) = min [log( )] Lx(6,s) = max [log( 1

and therefore

On the other hand it is easy to see that

and hence 3. SINGLE PARAMETER CASE which imply that

Using Minkowski's inequality we obtain (3.10)

On the other hand

Using .46(2) for each fixed O, there exists a constant Ko such that

Then (3.9),(3.10) and (3.11) imply that for each given 0 there exists De such that

To complete the proof we need to find an upper bound for cmin(il 0) and h,(i. O). These are, however, easy to obtain as

Using A3 and A4 for any given 8, we have,

where

Now, using (3.8), (3.12), and (3.13) we obtain (3.7), which completes the proof. 61 68 In. CONSISTENCY OF THE kfLE

PROOF.By Lemma 1 we have

(3.14)

1 aik(e)gik (Ti 7 0)

1 aik (00 )gik (fi1 00) II

1 1 gik (zi9 0) 5 max I log[ + max 1 log[ 1Sksm aik (BO) 1SkSm gik (6,@0) 1 1

Yext, by (3.14), .46(1), Lemma 2 and the D.C.T.,

is a continuous function. C'sing A2 we have

fz(? O) ]) < O Vr E Z and 0 # do - (log[ fz (El 0,) Thus it follows from the continuity of a(- ,O) and A4 that

which implies

This completes the proof O

We can now proceed to prove Theorem 3. 3- SINGLE PARAMETER CASE

PROOFOF THEOREM3. To prove the assertion we have

fi(zi(~),e)1 < Poo{W E R : lirn sup[n n+m i=i fi(Z(w), 00)

1" < - Iim inf - Edo log( n+ao *=1 It should be noted that the continuity of 9(-,0) and compactneçs of 2,imply

1" 1" ri(& 0) ) lirn inf - log( and lim sup - Egolog( n+cm i=i fi(%, 00) n i=l fi (fi7 00) exist. Now using Lemma 4 1" (3.17) lim sup[- x(log( ft(Xi(40) ) - Edolog( fi(% @) 111 = a.s. , n-+m i= 1 fi (r?,(w) ,go) fi (-fi 00) On the other hand using Lemma 5

1" fi(dfi7 0) fn (-vn 8) (3.18) lim inf - Edolog( ) 2 - SUP Eeo log( )>O n+oo n iZl fi(L00) nEH fn (dTn700) The assertion follows from (3.16), (3.171, and (3.18) which completes the proof.

Theorem 3 plays the key role in consistency of the MLE as shown in Theorem 4.

THEOREM4. Suppose Al - A7 are fulfilled and f,(L,8) is diflerentiable almost surely with respect to 0 in a neighborhood NB, of û0 m-th derivative fi(?, O). Then the likelihood equation 70 III. COKSISTENCY OF THE MLE has almost surely a root & = ên(z'l,&, ..., In)for large enough n which is strongly consistent for 0.

PROOF.Let E > O be small enough such that Ne,(€)E NB, and

By Theorem 3 PdO(Acn A-') = 1 Therefore for almost every w E R there exists N(w) E N such that for n 2 N(w) we have

On the other hand n;.'=,[fi(zi(w), 8)] is a differentiable function of 0 on the compact set, [Bo - E,BO + E], almost surely. There then exists a value E Nb(€) ai which Iikelihood function has a Iocal maximum and, is a root of the Iikelihood equation.

-4s E > O is arbitrary en + e0 T'eo-almost surel-

REMARK2. It should be noted that Theorem 4 only proues enenstenceof a se- pence 9, which converges to û alrnost surely. When the maximiter is unique tue can easily find the maxirnizing sequence. But if the likelihood equations have more than one solution, it may be dzficult to find the consistent sequence. This has been pointed out by Kraft and LeCam(1956) who observe the aîstence and con- szstency of suitably selected roots of the likelihood equation is not adequate ezcept when the rnazirnizer of the lzkelihood equations is unique, although such consistent roots will be eventually unique. For more details see Krajl and LeCam(1956) and Lehmann(l983, page 414).

-4mong the assurnptions Al - A7 the only one which needs further discussion is A6. Indeed, this is the only technical assumption we make. In the spirit of the original paper of Wald(1949), A6(1) is a modified version of his A6. We need 3. SINGLE PAR4METER CASE 71

A6(2) for the law of large numbers. It should be noted that when we have iid random samples u-e only need the existence of the moment which is an automatic consequence of A6(1). That is why our A6(1) and -16(2) together are equiva- lent to A6 of Wald(1949) in Our setting. Of course when Z is a -46 is straightfomard to check as we do not need to introduce "sup".

The following example shows that A6 is not a restrictive assumption when Z is compact.

EXAMPLE3. Suppose that Z = [zi, y] where ri and 22 are two known values. Let a(-,8) be gzven by (3.2) and (3.3) and

1 1 * [xd - (al +biz)I2 g; (5,O) = expF5 C 1 (2li)b: 1=1 0:

where a,, a, and b, for s = 1,2 are aZZ known values. To check A6(1) it sufices to show

for al[ z E Z and s,t = 1,2. Suppose ( a, 1 + 1 b, 1 (supZEZ1 z 1) 5 C for s = 1,2. Ezistence of such a C follows from compactness of 2. It is eûsy to see that

for al[ t E Z and s = 1,2. Now A6(1) follows from the continuity of B(z) where

z 1/2 + c2 B(r)= 0; + (a, + bsr)2+c[u: + (a. + b,z) ] 72 III. CONSISTENCY OF THE MLE

and the compactnes of 2. As we have

tu check -46(2) it suffficesto show supZEZE:[X - (at + btr)lr < oo for r = 1,2,3.1 and s, t = 1,2. This may be tediour to do directly. But we can use the fact that the moment generating function is has continuou derivatives of al1 orders with respect

tu r (i.e. it is a COC function of z). Thus al1 the moments are wntinuous functions of r. Hence the contznuity of moments and compactness of Z imply A6(2).

4. Consistency In The Multiparameter Case For Identifiable Modeis

The main point of the method employed in the last section is that it does not impose any condition on 8. But inspection of the proof of Theorem 4 shows that the fact that 0 is one-dimensional is crucial for the proof. Indeed, we cm eficiently utilize Theorem 3 since the boundary of any consists of only two points. When 8 is multi-dimensional the boundary is the circumference of a sphere which consists of uncountably many points. The estension of Theorem 4 to the multi-dimensional case is, therefore, not straightfomard.

In this section we establish consistency of the MLE in the multiparameter case. Here we take the first approach refened to at the beginning of this chapter, which deals directly with the maximum Iikelihood estimator rather than with the roots of likelihood equation. We mainly focus on models which are strictly identifiable for each z E 2. The mode1 given belon is an example of such models. Let 4. MULTiPAR4NETER CASE 1 73 - where û = (9,T), gk(Z,81, not depending on r, for k = 1: 2, ..., rn, are known distributions up to a vector of unknown parameters 7. The mixing distribution ak(z, ,O) is as follows:

and

For simplicity, in the sequel we omit the vector sign (') when there is no con- fusion-The following lemma from Ibragimov and Hasminskii(l981, page 35) plays the key role in the proof of the main theorem of this section. As the proof of the lemrna is short and easy we present the proof for the sake of completeness. First we need a definition which we partially adapt from Ibragimov and Hasminskii(l981, page 31).

DEFINITION2. Let o sepence of ezpenrnents E, = (~(€1,a(€), piE), 0 E 0) generated by obseruations XE: (R,3' 'Po)+ (x(~) 0,

When we have countably many experiments, i.e. E, = (~("1, a("), P,("),6 E 8) is generated by Xn = (XI,Xz, ..., X,) where Xi, X2,... are observations witb values 74 III. CONSISTENCY OF THE MLE in (X,a), we can utilize the following simple inequality to relate Po and Pt

To prove uniform strong consistency one may proceed to show the right hand side of the last inequality tends to zero.

LEMMA6. Let E, = a('),PB('), 0 € 0) be a farnily of expenrnents and let the likelihood functions pc(zE;9) correspond to these expen'rnenb. Suppose that p,(xE: e + U) %,B(U) = RE(u)= vrhere u E U = 0 - 8. Then in order that the PE(~@) maximum likelihood estimator 9, be consistent it is suficient that for al1 8 E 8 and y > O

If the last relation is unifonn in 0 E K, then the estirnator dE is uniformly consis- tent in K.

PROOF.Set û, = e, - û so that R&) = sup, R,(u). Since RE(0)= 1, we have

This completes the proof 4. MULTIPARA-METER CASE 1 75

The sufficient condition of the above lemma simply says for any neighbourhood around the true value of 8, no matter how small, we can take n large enough such that the maximum likelihood is inside the neighbourhood. This, however, means that the maximum is a consistent estimator as we can take arbitrarily small neighbourhoods.

We are now ready to state and prove the main result of this section. The proof is similar to the proof of Theorem 4.3 of Ibragimov and Hasminskii(l981) with slight modification for handling covariates.

THEOREM5. Suppose 8 is a bounded open set of RP and f&; 8) is the density function of Pi with respect to the O- f inite rneasure Y. Let f,(x; 8) be a contznuous function of 9 on 8 for alrnost al1 x E X and all z E Z and let the followzng conditions be fulfilled:

1. For al1 O E 9 and al1 -y >O,

(4.4) inf inf ri,t (O; 0') = inf inf [f,'j2(x; O) - f;i12(z;@')]*du = ne(?) > O "€21le-8'11>7 2. For a11 O E

(4.5)

Then for al1 0 E 8 the maximum likelihood estirnator & + 6 os n -r m in fPo-probability, i.e. 8, is a weakly consistent estimator of 0

PROOF.In view of the above lemma and Markov's inequality it suffices to find an upper bound for the expectation Ee sup, c(u)which tends to zero. This is the core of the proof given below. Suppose 8 is fixed and consider

as a function of u. Let î be a sphere of a small radius 6 situated in its entirety in the region 11 u II> i7. We shall bound the expectation Ee supr 7Z,,(u).If uo is the

Therefore we obtain;

Now it is eôsy to see that

On the other hand, using the Cauchy-Schwarz inequality

/ sup 1 6 + Ug + t) - fi'* (x;8 + UO) I f:I2 (2; 5 w*+.', (6)9 by (4-5) - litIlla 4. MULTIPARAMETER CASE 1

Taking into account the elementary inequality 1 + a 5 eu, a € IR, we obtain

It follows from (4.7) that to eacb point C of the set Ù \ {II u II < y} there corresponds a sphere î(<)with center < such that suprK1x(u) + O as n + cm in Po-probability. Using compactness of select a finite cover ï((J for q = 1,2, . . . , N of the set Ü\ {II u 11 < y} Erom the collection {r(<)}.Then

N

This completes the proof. O

REMARK3. In tnew of the heuristic argument given in the introduction wn- cerning the nature of the conditions needed for consistency, we can see thut (4.4) is an identifiability condition and (4.5) is a srnoothness condition.

Under (4.4) and (4.5) it is possbile to prove strong consistency. But before a-e proceed furt her we need to briefly discuss Hellinger distance.

DEFINITXOS3. Suppose P and Q are two postiue rneasures defined on the o - algebra a. The Hellinger distance between P and Q às defined as follwos

It is easy to see that r is a metric. If P and Q are probability rneasures we have

where p(P, Q) = ~(JdPrn)is called the agniity between P and Q. Suppose P and Q are dominated by p (for example one many consider p = P + Q). Then we 78 III. CONSISTENCY OF THE MLE have

where f and g are respectively Radon-Nikodym derivatives of P and Q witb respect to p.

The Hellinger distance and the LI-nom,

are related by the following inequality (see LeCam(1990, page 25))

This shows that the Hellinger distance and the Cl norrn both induce the same topology on the space of probability measures defined on the o - algebra a. As pointed out by LeCam(l990a, page 25) working with the Hellinger distance is much simpler ahen we are dealing with the direct product of probability measures which is the case when we have independent observations. The core of this simplicity is that Kakutani(l948) has shown

n=l For a thorough discussion on the Hellinger distance see LeCam(1986 and 1990a). The following lemma is crucial for the proof of strong consistency.

PROOF.Suppose sups we(b) # O. Then there exists a sequence of num- bers 6, + O and 8, E B such that we, (&) > 7 > O. Using the compactness of 8 we may assume that 8, + 8' for some O* E 8: Continuity of f&; O) as a function 4. MULTIPARAMETER CASE 1 79 of 0 for every z E X and compactness of {t : II t 11s 6) imply that there exists a function 7' : X + BF(o) = {B E 8 : II 0 II $ 6) such that for every z,

On the other hand using Minkowski's inequality we have

which implies that

where

Taking the sup over z E Z we obtain

This contradicton completes the proof. a 80 III. CONSISTENCY OF THE MCE

We can now proceed to prove strong consistency.

PROOFOF STRONG CONSISTENCY. Suppose y is hed. Using the compactness of the exterior of the sphere II u 115 y can be covered by N spheres r,, q = 1,2,. . . ,N of raduis 6 with centers cq. Using the above lemma 6 can be chosen to be sufficiently smail so that dl the spheres are located in the region 1) u 11 > :and de+&) 5 )4)for al1 p = 1,2,.. . , N. Put fin = - B. Then using (4.7) we have

T herefore

- Nexp{-?ne(;)} 1 - exp{-f "O($)) This completes the proof.

REMARK4. Using Lemma 7 and the above discussion on strong consistency it is not hard to see that if infeçe ~~(7)> O for al1 7 > O, Wen 8, is a unifomiy strongly conîistency estamator of B. Indeed, an the proof of strong wnsistency it sufices to choose 6 such that we+c, < iinfe ~~(5)for all g. we have 4. MULTIPARAMETER CASE 1 81

Conditions (4.4) and (4.5) may seem hard to check. Corollary 1 to Theorem 5 gives sufficient conditions for fulfillment of (4.4) and (4.5). The conditions provided in the corollary are also sufficient for uniform strong consistency.

COROLLARY1. Let 8 c RP be a bounded open set, Z be a compact subset of a nonned lznear space E wàth nom II Ilr Let f.(z;B) be a wntinuow function on 8 x 2 for almost al1 x. If f,(z; 8) 2s stn'ctly identifiable with respect to 9 E 8 for all z E Z und

/ sup fz+C(~;O+t)dvO , " IltlllJ IIClls9 then the rnazimurn ltkelihood estamator 8, is a strongly wnsistent estimotor of 8.

PROOF.To verifjr condition 1 of Theorem 5 we first notice that

Continuity off on 8 x 2,the general cont-ergence theorem (Royden (1988), page 27 O) and the fact that f is a density function imply that the Hellinger distance r;,.(û, -) = &(fS1/*(x; O) - f1I2(x;.))2du is a continuous function for each 0 E 8. Then condition 1 follows from compactness of Z x (8\ B,(B)) for any 0 E 8 and the identifiability of f for each z.

To verify the second condition we first notice that continuity of f implies that

SU^ [f:'2(~;e)- f~$(~;e+t)]~/~+Oas 6+O lltlI9 IfCIISc for almost al1 x E X. Using (4.9), (4.10) and the D.C.T.we obtain which implies wi(6) -+ O as 6 + O. NOWsuppose limd+o supZEZwgL(6) # O. Then there exists a sequence of numbers 6, + O and a sequence of points z, E Z such that w,"(6,) > y > O . Using the compactness of Z we may assume that z, =+ r E 2. Note that in rnetric spaces compactness and sequential compactness are equivalent. Thus using Minkowski's inequality

Using the D.C.T. and continuity of f the first term in the last inequality tends to zero as n + m. An application of the D.C.T. and utilizing the continuity of f and (4.11) shows that the second term tends to zero too. This implies that wp(b) + O as R + m. This contradiction completes the proof. CI1

REMARK5. AS seen in the pmof of Theorem 5 the openness assurnption of 8 is not crucial. The boundedness is, howeuer, a key assumption in the proof. The reason that in many proofs of eonsistency if is assumed that 8 is an open set is that for asymptotic normality thâs condition is needed for the validity of Taylor's expansion. These proofs wmbine consistency and crryrnptotic normality under one set of hypotheses. Openness is afso needed to show that MLE is a root of likelihood equation provided that the likelihood is a di'erentiable function of 8 .

Condition (4.9) may look cumbersome to veriS.. The following example shows 4. MULTIPARAMETER CASE 1 83 that the condition is not that stringent. Indeed this condition may be compared with Assumptiun 2 of Wald(1949). In a more general setting one can also compare this condition with Condition 3.2 of Pfaff(1982).

EXAMPLE4. Consider the model (4.1) whose rnixing distribution and rnizing components, ak (2;p) and gk(2,r) , are respecituely given by (4.21, (4.3) and (4.1).

Let (with an abuse of notation) hl(x;p, a) = N(p,02) and h&; A, T) = N(A. ?), where (p,a) # (A, r). It is clear that 8 = (b,7) where 7 = (p,O, A, r). As tue have only one continuow covariote and the regression model is Pz, strict identzfiabàiity of the mode1 for each given z is clear. To check (4.9) we first notice that

This implies that

it sufices to check wndition (4.1) for the Normal distribution when both pamme- ters are unknown. Define 84

It is not hard to see that

On the other hand one can easily see that

which implies (4.1).

5. Consistency In The Multiparameter Case For Quasi-Identifiable Models

Strict identifiability for each z E Z is a crucial assurnption which restricts ap- plications of our model. In this section we show how one can relax this condition. Starting with a general result we then show how this Ieads to the cases of Our interest.

THEOREM6. Let 8 be a bounded open subset of lP, and fi(x: 9) be a continuous function of 8, for ulmost al1 x E X and al1 i E N. Suppose for al1 0 E 8 there ezish a positive decreasing function \Io(n)such that \I.(n)is a continuow function on 5. MULTIPAR4METER CASE II 85

8 for all n E N and @e(n)1 0 as n t oo. If the foflowzng conditions are ful'led

where n Ke,, (7)= inf rivi(&O' ) and T& (O, 8') = / [f/I2 (2;O) - f:I2 (2;O' )]2d~ , Ile-d'11>7 i=l x and

where n (a) = C(6) and C (6) = {/ sup [f:12 (2;0) - f:I2 (2;0 + h)I2du)II2 * i= 1 IlhIlSs Then the maximum lzkelihood estirnator, &.,, is stronglg consistent for 0 .

It should be noted that (5.1) is an identifiability condition and (3.2) is a smooth- ness condition. These two conditions extend (4.4) and (4.5) of Theorem 5.

PROOFOF WEAK CONSISTENCY. using the method employed for Theorem 5 we obtain

Using (5.1) and (5.2),

which irnplies weak consistency of 9,.

To prove strong consistency we need the following modified form of Lemma 7. PROOF.Suppose that lims,o supe,ë Ae(d) # O . Then there exists a sequence of numbers 6, -t O and 9, + B' E 8 such that X0,(6*) > e > O. Now as in the proof of Lemma 7, we obtain

Summing both sides over i = 1,2, ..., n, we have

bhltiplying by \kem,T(n),taking lim sup as n + m, and using the fact that

lirn sup(A + B) 5 lim sup A + lim sup B where A and B are any two sets bounded from above, we obtain

which is a contradiction.

Ive can nom proceed to prove strong consistency.

PROOFOF STRONG CONSISTENCY. In view of Lemma 8 we can choose 6 srnail enough such that X6+&(6) 5 fm(:). Then there exists no = no(6) such that for al1 n 2 no we have

and 5. MULTIPARAMETER CASE n

Therefore, in view of (4.7)

Thus for n >_ no we obtain

This completes the proof of strong consistency fl

The following observation provide some intuition about (5.1) and (5.2). Suppose that Qs(n)= for al1 n E N and that we can interchange ike(n)and infile-e'll,, - If, in addition, 2 is a random variable with distribution function Q(z) such that EQ(')[î&(O, o')] exists, then

Similarly we obatin

provided that ~4(')[~(6)]exists.

The following corollary shows how one can apply the above theorem to a quasi- identifiable collection of families. COROLLARY1. Let 8 C IR?' be a bounded open set and Z be a jinite set. Suppose that f,(z; -) is a contznuous Juntion on 8 for almost all z E X and al1 z E 2,the collection of families {{f,(x; 0) : B E a} : z E 2) is quasi-identafiable with respect to 0 and for each z E 2 and 8 E 0 the following wndztions are fulfilled:

cz (5.7) 2. liminf -(4 = .v, > O n+ao n where Cz(n)= #(observations f rom f, in a sample of sire n} . Then the maximum Lzkelihood estimator is a strongly consistent estirnator of 0 .

Condition (5.6) is a farniliar condition which have been already discussed. But condition (5.7) is new here. This second condition simply says that we should have enough observations in each category. This is of course a familiar condition in the aysmptotic theory of categorical data.

PROOFOF COROLLARY1. TO prove this corollary it suffices to check (5.1) and (5.2) for qs(n)= o. Using the finiteness of 2,the continuity of f,(x; -) on 8 for al1 z f Z and almost al1 x E X , and (3.6) which allows for the application of D.C.T., we obtain the continuity of x,,, &(O, 0') - Then quasi-identifiability and compactness of 8 imply

As Z is a finite set there exists a k E N and O < E < rnin,,zv, such that 1 ;CZ(n)> u, - E for al1 n 2 k and z E 2. Thus

2 (min",-E) inf ~r&(8,û1)>O Vnzk- zEZ I~@-~II>Tr~z 5. MULTIPARAMETER CASE II

This implies that

lim inf Be(n)ne,=(7)> O Ve, and y > O - n+ao

To check condition (5.2) we utilize the continuity of f,(z;-) which implies

sup 1 f:i2 (x;8) - f:/'(z; 0 + t) I+ O as b + O Vz E Z Iltll

This completes the proof of this corollary O

It is easy to see that (5.7) can be replaced by

2. lim inf > O b'r E Z n+w Nn) for some positive increasing function -4, such that A(n) t. x as n t. oo .

REMARK6. Theorem 6 gzves a general result for independent but non-identically distributed (INID) random variables. In the very extensive literature on wn- sistency of MLE7s there are two important papers whose main concern is INID random vuriubles. The first is by Ibragimov and Husrninskii(1975) in which the single parameter case is addressed. Their condations are dgerent from ours and more complicated to check. Nevertheless, they established both consistency and crsymptotic normality. The other paper is by Hoadley(l971) in which there are two sepurate sets of conditions, for consistency and for asymptotic nomality. Under the first set of conditions he established weak consistency of the MLE7s. He also pointed out thut under &m conditions Chao(1970) (this paper was not available) 90 III. CONSISTESCY OF THE MLE had est ablished strong consistency. The diflerence between our approach and his zs in bath the conditions and the method O! the proof. His approach is actually an extension oj Wald's approach (Wald(1949)) to INID randorn variables.

Condition C2 of Hoadley(l9'11) which inuokes unzfom upper semi-wntinuity of fi(Xi;8) as a function of8 dues not seem easy to check. There are also other boundedness conditions such as C4(i) which may be cumbersome. In addition, the condition, imposed by Chao (1970) (basai on Hoadiey(I971)),

+O unifody in i a.s.[~~]as II B II+ ca , fi (Xi;@O) where e0 is the true value of the parumeter 8, look. restrictive too. Zndeed, it is not satisfed for a simple ezample given by Hoadley(l971). These unifonnity conditions, howeuer, are automatically satisfied when there are only finitely many dzfferent distributions and the parameter space is compact. As a result, among the conditions of Hoadley(l971), except for the boundedness condition, the other conditions are automatically satisfied for Corollary 1 of Theorem 6. CH-IPTER IV ASYMPTOTIC NORIGALITY OF THE MAXIMUM LIKELIHOOD ESTIMATORS

1. Introduction

III c-Liapter II we discussed consistency of the MLE's. In order to carl out tests of 11'-potheses we need to establish the large sample distribution of these estimators. Th<.main çoncern of this chapter is therefore. to establish asymptotic riorniali t>-.

The iisymptotic norrnality of a family of distributions follows from a quadratic npprusiinatiori uf the iogarithm of the likelihood. In view of Morse's lemma (Mil- riuri 19Ci:3}. Iliigi- 6) it is sufficient tuat the Hessian matriv be negative definite. pruvided that the second derivatives exist. t'nder mild conditions. however. the Fisher Information matrix is equal to the negative of the Hessian matriu. Thus the c-riici;ii ruridit iori for a quaciratic approximation to the logarithm of the likelihood i.s positive drfiriitrhness of the Information matrix. Of course, this is not al1 as there

;ti.tL otliclr tedinical coiiditions involveci too. Indeed. the conditions for asymptotic iioritialiry cari he classified into three groups. The first are smoothness conditions whic-il we usually guarantee by assuming the existence of the third derivative, al- thr)\igh LeCain( 1970) hspointed out that it is just the first derivative which is ritwle(i (sec, ii.1~0 Pollard(199-I)). The second group of conditions involve bound- ~c!ric~ssof the cierivatives. The third condition which is of a different nature is

~wsitivtb cletiiiiteriess of the information matix

,Alrlioiigh t1it.r~are a fwrases in which the second or third dervative does not psisr. sniootli~iessis [lot really a troublesorne assumption in most statistical appli- c-ations. The boiindednes conditions are not trou biesome either. particulariy when tlir ~>ilr;~invt,rrspacp is compact. The condition which is really hard to check is thr [wsitive (fdiniteriess of the information matriu. As mentioned by -4mari(l982) pusiri\-e ciefinitentass of the information matrix is fulfilied for the exponential fam- il'- undcar suitable conditions. which has been proved by Barndofi-Xielsen(1978). Checking positive definiteness of the information matriv for mixture distributions. huwever. seenis virtually impossible.

?-livre apyears ro be no result in the iiterature on positive definiteness of the irifimriiition rnatrix for mixture distributions. An exception is the mixture distri- 1)ucir)ns tvhkLi are riienlbers of the exponential family (see Teicher(l960). Theorem -3. 6.5). Bwause of the pivotai role positive definiteness of the information plal-s in establisliing as?-mptotic norrnality. the focus of this chapter is this topic. It is stiown ttiat iri presence of identifiability and some other smoothness condi- tions the set of zeros of the determinant of the information rnatrix is a nowhere cierrsc. (rare) set. Hence. even though it is virtually impossible to establish positive clclfini tcmess escept for the most trivial of mixture problemes . there is some conso- lation iri tht~kriowledge that this condition is "rarely" violated. For discussion on t tie dificulties with the asymptotics of m~xtureswe refer the reader to Chen(1994). See also Cheu( 1995) and Chernoff and Lander(l995).

This chapter is organized as follows. In section 2 some preliminaries on differen- the set of zeros of the determinant of the information matrix. We first prove Our main result stated in Theorem 5 for IID random variables. Then Theorem 8 shows how it can be extended for INID random variables. As we are mainly interested in mixtures of an exponential family we discuss the category of the set of zero's of the determinant of the information mat& for an exponential family in section 4. It is shown that this set is an isolated subset of the parameter space. This means that for any point of this set there is a neighborhood which does not contain any other point of the set. We also breifly discuss the meaçure of the set of zeros of the determinant of the information matrix in this section. Finally in section 5 we argue how, checking the smoothness and boundedness conditions for mixture distributions can be reduced to checking the same conditions for the components of the mixture and the mixing distributions, separately.

2. Preliminaries Rom Differential Geometry And F'unctional Analysis

In this section we recall some definitions and notions of differential geometry which are needed in the sequel. Our reference for the definitions and results given below is Abraham, Marsden and Ratiu(1988). In the following E and F are Banach spaces and C (E,F) is the space of al1 bounded linear functionals from E to F.

DEFINITION1. Suppose fig : U C E -t F where Li is open in E. We say / and g are tangent ut the point uo E LI if

where II . II represents the nom (presumed to be defined) on the appropriate space.

PROPOSITION1. For f : U c E + F and uo E U there is ut most one L E L such that the map g~ : U c E + F given by gL(u) = f (uo)+ L(u - uo) is tangent to f at uo. 94 IV. ASYMPTOTIC NORMALITY

PROOF.See Abraham, Marsden and Ratiu(1988), page 75. O

DEF~NITION2. If in the above propostion there is such an L E t(E,F), we say f 2s dmerentiable ut uo and define the derivative off ut uo to be Df (uO)= L. The evaluation of D f (uo)on e E E will be denoted by D f (uo) e . If f is dzflerentiuble at euch uo E LI, the rnap D f : U + C(E,F) ts called the derivative off . Moreouer, if D f is a continuous map (where t(E,F) hm the nom topology), we soy f ts of class Cl (or ïS continuously difjerentiable). Induetzvely we can define

If Dr f ezists and is nom continuous we say f is of class Cr.

The directional derivative can be defined as follows.

DEFINITION3. Let f : U C E + F and let u E '. We say that f ha3 a derivative in the dimction e E E at u if

exzsts. We cal1 this element of F the directional derivative of f in the direction e at u.

A function al1 of whose derivative exist is called Gateawc differentiable, whereas a function differentiable in the sense we have difined earlier is called Fkechet differentiable. The latter is , of course, stronger as the following propo- sition shows.

PROPOSITION2. If f LE diflerentiable ut u, then the directional derivatives off ezist at u and are given by

PROOF. See Abraham, Marsden and Ratiu(1988), page 86 2. PRELIMINARIES 95

As Gateaux derivatives are easier to calculate one may start with the Gateaux derivative and then ask for circumstances under which the Gateaux derivative co- incides with the Frechet derivative . The following result gives such circurnstances - First we need a definition.

DEFINITION4. If f : U C E + F is Gateauz differentiable and the Gateauz derivative zs in t(E,F); i-e- for each u E U there ezists GuE L(E,F) such that

and if u c, Gu is continuous, we say f is Cl -Gateauz.

THEOREM1. If f : W C E + F Lp C1 - Gateauz, then it is CLand the two derivatiues coincide.

PROOF.See Abraham, Marsden and Ratiu(1988), page 88 O

In the sequel we also need partial derivatives

DEFINITION5. Let f : U + F be a mapping defined on the open set I/ C

El @ Ez and let uo = (uO1,uO2) - If the den'vatiues of mappings vl f (VI, uoî) and 2~>ct f (uO1:u2) dt,where ui E E, for i = 1,2, are uilled partial derivatiues off at uo E U and are denoted by Dif(u0) E &(Ei,F), for i = 1,2 .

PROPOSITION3. Let U C El @ fi be open and f : u -b F. (i) 4 f is dzerentiable, then the partial derivatives exïst and are given by

(ii) If j 2s difirentiable, then

Df (4- (el, ez) = Dif (4. ei + Dzf (4 - eî

(iii) f is of class Cr iff Di f : U + C (E,,F), for i = 1,2 both &t and are of class Cr-l . 96 IV. ASYMPTOTIC NORMALITY

PROOF.See Abraham: Marsden and Ratiu(1988), page 89 . CI

DEFINITION6. Let X and Y be topological spaces; let f : X + 1- be a bijection. If both the function f and the inverse function f-' : Y + X are eontànuous, then f is called a horneomorphism.

Now we can state the Brouwer's Theorem.

Brouwer's Theorem of Invariance of Domain:There is no homeornorphism between an open subset U c Rn and an open subset of R" when m # n.

PROOF.See Massey(l991), page 216-217.

In connection with Brouwer's Theorem of Invariance oj Dornain one can also consult chapter XVII, $3 of Dugundji(l966). Next we define difimorphisrn.

DEFINITION7. A map f : U C E + V c F (U, V open) is a Cr d#eomor- phism if f is of class Cr, is a bijection (that is, one-to-one and ont0 V), and f-' is ais0 of class Cr.

DEFINITION8. Let S be a set. A churt on S is a bijection Q, from a subset U to S tu an open subset of a Banach space. We sornetzmes denote q5 by (C:6), to ândicate the domain Ci of 4. A Ckatlas on 8 is a family oj charts A = {(Cl,, 4,) 1 i E 1) svch thut 1. S = u{Ui 1 i € 11- 2. Any two charts in A are compatibZe in the sense that the overlap maps between members of A are Ck diffeomophisms: for two charts (U.,4,) and (Uj, 4,) Wth & n U, # 0, we jorm the overlap map: q5,* = 4, means the restriction of to the set di(Ui n Uj). We require that r$,(& n Uj) be open and that 4,. be a Cr difleornorphism. DEFINITION9. TWOCk atlases Ai and A2 are equivalent if Al U A2 is a Ck atlas. A Ck differentiable structure D on S is an equiualence of atlases on S. The union of the atlases in 9,AD = u{A 1 A E 2)) is the maximal atlas of D. and a chart (Li,6) E AD is an admissible local chart . If A is a CLatlas on S, the union of al1 atlases equiualent to A is called the Ck difierentiable structure genemted by A. A differentiable manifold M is a pair (S,D), where S is a set and 2) 2s a Ck dijferentiable structure on S. We shall often identzfy M with the underlying set S for notational wnvenience. If a wvering by charts takes their values in a Banach space E, then E is called the mode1 space and we say M is a CkBanach manifold mdeled on E.

Having the above definitions we can define open subsets of a manifold.

DEFINITIONf O. Let M be a differentiable manifold. A subset A c M is called open if for each a E -4 there is an admissible local chart (LI,4) such that a E U and U CA.

DEFINITION11. A submanifold of a manqold M is a subset B c M with the property that for each b E B there is an admissible chart (LI,4) in M with b E U which hm the submanifold property, namely,

An open subset V of M is a submanâfold in this sense. It sufices to take F = {O), and for x E V use any chart (U,4) of M for which z E U.

DEFINITION12. Suppose f : M -+ Ni where M and N are CCmanifolds (that is, f maps the underlying set of M into that of N). We say f is of class Cr, O 5 r 5 k, if for each m E M and an admissible chart (V,$) of N with /(m) E V, there is a chazt (U,4) of M satisfying m E U, and f (U)c V, and such that the local representative of f , f,= I, O f O d-', is of class C. 98 IV. ASYMPTOTIC NORMALITY

DEFINITION13. Let M be a manifold and m E M. A cume ut m is a CLmap c : I + M from an interval I c W into M with O E I and c(0) = m. Let cl and c2 be curues at rn and (U',4) an admissible chart with m E U. Then we Say ci and cz are tangent at m with respect to 4 if and only if (4o c~)'(o)= (4 O c2)'(0).

Tangency of curves is a notion that is independent of the chart used (see Abra- ham, Marsden and Ratiu(1988), page 158). That is why we can Say cl and cz are tangent at m E M without refering to the locai chart 4. Tangency at m E M is an equivalence ralation which partitions curves at m into equi~lenceclasses [clm where c is a representative of the class.

DEFINITION14. For a manifold h.I and m E M the tangent sprrce to M ut rn is the set of equivaience classes of curves ut :

For a subset A C M, let TM = UmEATmM(disjoint union). We cal1 Th.I = 1A the tangent bundle of M.

If M = U where C' is an open subset of a Banach manifold E, TU is defined by U x E. Lemma 3.3.4 of .4braham, Marsden and Ratiu(1988, page 158) shows how TU as defined by the above definition can be identified by U x E. If M is a Cr+' manifold, then TM is a Cr manifold. If hZ: is n-dimensional, i-e. M is modeled on an 72-dimensional Banach space, then TM is a 2n-dimensional Banach manifold (see Abraham, Marsden and Ratiu(1988)). The tangent space T,M at a point m f M is an n-dimensional Banach space. Note that we are dealing with simple manifolds. For manifolds with boundaries and corners the dimension of the tagent space T,M is different and depends on where m is.

It should be noted that the approach taken by Abraham, Marsden and Ratiu(1988) for defining the tangent space to an abstract manifold is the curve approach. This means that identifying tangent vector to a surface by the velocity vector of a curve in the surface. However, the approach which is more related to Our purposes is the derivative approach. First we need to define an algebra

DEFINITION15. SL is an algebra over the scalar field K: if its elements admit the three opemtions of addition, multiplication, and scalar multiplication, subject to the following conditions. 2i is a linear space with addition and scalar mult?plzcation. The multiplication satisfies: 1. Every ordered pair of elernents z,y has a unique product xy; 2. Multiplication is associative: (xy)z = z(yz). Addition and multiplication are distributives:

x(y + 2) = xy + 22, (y+ z)x = yx+ zx. Multiplication and scalar multiplication commute: axPy = aPxy Va,,B f K Further conditions which may sometimes be imposed are:

There ezïts a unit element e such that ex = Xe = x for each x; Multiplication is commutative: xy = yx. In this case % is respectively called an algebra with an unit element if the first condition holds and an abelian (or commutative) algebra if the second condition holds.

DEFINITION16. Let 3(p) be the algebra of diflerentiable functions of class Cl, defirted in a neighborhood of p to R. Let c(t) be a curve of class C', a 5 t 5 b, such that c(to) = p. The vector tangent to the curve c(t) ut p is a mapping S : 3(p) + IR, such that 100 IV. ASYMPTOTIC NORMALITY

In other words, X f is the derivative of f in the direction of the curve c(t) at t = to. The vector X satisfis the following conditions: 1. X is a linear mapping of 3(p) into HQ;

The set of mappings X of 3(p) into W satisfying the preceding two conditions forms a real vector space. This vector space is denoted by Tp(M)or Tp and called the tangent space of M at p. For more discussion on this point of view see Kobayashi and Nomizu(1963).

DEFINITION17. If f : M + N 2s of clus CI-Tf : TM + TN,the tangent of

-4s we are mainly interested in open subsets of Banach manifolds we give the following definition of tangent.

DEFINITION18. Suppose U C E is open and f : Li + F 2s of class Cl. DeJine the tangent off to be the map

Tf :UxE+FxF givenby Tf(u,e)=(f(u),Df(u)-e) where D f (u)- e denotes D f (u) applied to e E E as a Zznear map. If f 2s of class Cr, we can define T f = T(Tr-'f) inductively.

As seen from the above definition for a fixed point uo, Tu,f is identified by D f (uo). Shen Tu,f is a linear map from E to F. Indeed for a general manifold Tmf can be defined as follows. Suppose c(t) is a curve at m. Then

We also need the following definitions before stating the su bimmersion t heorem.

DEFINITION19. The closed subspace F of a Banach space E 2s called split, if there is a closed subspace G c E such that E = F @ G. The notions of submersion and immersion defined below introduce local surjec- tivity and injectivity respectively. Recall that for an linear operator A : E -+ F where E and F are Linear spaces, &=ker(A) = {x E E : A(z) = 0) E, BA=range(d) = {y E F : 3x E E 3 A(z) = y) C_ F, and rank(A) is the dimension of RA.

DEFINITION20. Suppose M and N are manifolds and f : M + N of class Cr,r 2 1. If for each m E S c MyTm f i.s surjective unth splàt kemel, we Say f is a submersion on S. II Tmf is injective with closed split range in TI(,lN, then f is calïed an immersion at m. We say f is an immersion if f is immersion at each m E M.

Suppose we have a map f : M + N. The notion of subimmersion introduces local constant rank for the tangent map of f.

DEFINITION21. A Cr rnap f : ?kf + N, r 2 1 is called a subimmersion if for each point m E M there is an open neighborhood U of m, a manifold P, a submersion s : U + P, and an immersion j : P + N such that fi, = j o s .

tVe are now ready to state the Subimmersion Theorem.

Subimmersion Theorem: Suppose f : M + N is Cr for r 2 1, E N and / is a subimmersion in an open neighborhood of f -'(no) ( If M or N are finite dimensional this âs equivalent to Tmf having constant rank an a neighborhood of each m E f -'(no). ) Then f -'(no) is a submanifoold of M with Tmf -'(no) =

R(T, f)

PROOF.See Abraham, Marsden and Ratiu(1988), page 205 O

The above theorem is stated for Cm,in Abraham, Marsden and Ratiu(1988). This is presumably a misprint as the same proof applies for any Cr for r 2 1. 102 IV- ASYMPTOTIC NORMALITY

To complete our prelirninaries we need to recall some results from functional analysis. First are need to recall two standard results from measure theorey which we use repeatedly in the sequel.

PROOF.To prove this result it suffices to rework the proof of the D.C.T.

This theorem is sometimes called the Genemlited Dominated Convergence The- orem (G.D. C.T)

THEOREM3. Suppose fn, f E 1;' and fn -+ f a-e. Then / 1 fn - f I+ O (i.e. fnf;f) ifil I fn I+II / I-

PROOF.Usiog the triangle inequality fn f implies that J I fn I+ J I f I. Now suppose J 1 fn I+ / 1 f 1 . Then using the G.D.C.T., the fact that Ifn-f I+Oa-e- and Ifn-f 151 fn I + I f 1 wehave fnsf- O

Our reference in the following is Hille and PhiIlips(1957). First we need some definitions.

DEFINITION22. An algebra 2i is a topological algebra (or topological algebraic space) if 8 is a topological ltnear space and to every z,y E 2t and every neiborhood N(xy)of xy there are neiborhoods N(x) of z and N(y) of y such that zN(y) C N(xy)and N(z)y c N(xy).

The condition imposed by the definition guarantees continuity of xy in x and y separately. In other words, 7; (2) = ax and z(z)= za from 2l to Q is continuous for each a E a. DEFINITION23. 23 is mUed a Banach-algebra if B is an algebra as well as a Banach space and if, in addition, II xy 11 111 x 1111 y II - It is a real or a complez Banach algebra according as K is the real or complez nurnber field.

DEFINITION24. Suppose S( is an algebra with the unit element e. An element z LE culled regular if there is an element z-l, called the inverse of z, such that xx-' = e. A non-regular element is called singular.

DEFINITION25. The set of al1 values X for which Xe - A is singular in the

Banach algebra 23 is called spectrum O/ A and denoted by o(A).

It should be noted that if X is a finite dimensional Banach space and A : X + X is a linear operator, then o(A) is the set of charactristic values of A.

DEFINITION26. The spectrum of x, a(x), is called upper semi-continuous at x = a if for any open set O containing o(z), there exists an c > O such that

II z - a Il< E irnplies that o(x) c O.

We are now ready to state the last result which is needed in the sequel.

THEOREM4. The spectrum of x, ~(x), is an upper semi-continuou function of

PROOF.See Hille and Phillips(1957), page 167. O

Subimmersion Theorem and Theorem 1 stated above play the key role in the proof of the main result of this chapter. IV. ASYMPTOTIC KORMALITY

3. On Positive-Definitness Of The Information Matrix

-4s al ready discussed, checking for positive definiteness of the in format ion mat nv seems virtually impossible for mixture distributions even when the mixture has only a few components. It is then natural to ask how crucial positive definiteness of the informatin matrix is, in the presence of the other conditions for establishing asyrnptotic normality.

-4s is well-known the information matrix is the variance-covariance matrix of partial derivatives, and therefore is positive semi-definite. On the other hand a positive semi-definite matriz is positive definite if and only if the detenninant of the matriz is not zero. Thcrefore, for positive-definiteness of the information matrix we need only focus on the zeros of the determinant of the information matrut- In the following the set of zeros of the information rnatrix is denoted by A and int(-1) means the interior of A .

DEFINITION27. A set A ïs called nowhere dense (mm)if intfi)= 0.

A necessary and sufficient condition for A to be a nowhere dense (rare) set is that any open sphere S includes an open sphere SI such that S n A = 0 (Kolmogorov and Fomin (1970),page 61). The following conditions are needed to establish the main resuit of this section.

Assurnptton 1: The parameter space 8 c RP is a bounded open set.

Assumption 2: The family P = {Po : 8 E 8} is a family of identifiable measures on (X, !2l) with respect to 8.

Assumption 3: The family P is dominated by a u - f inite mesure v on (X,a) and f (-,O) is the Radon-Nykodym derivative of Po with respect to v. 3. POSITIVEDEFINITNESS OF THE INFORMATION MATRIX 105

LEMMA1. Suppose f (x,-) is continuous on 8 a.e.[u] and Assumptions 3 is fuffilled. Then P(A) is a continuous function on 8 for al1 A E 2î.

PROOF.First we notice that f (z,O)IA(z) 5 f (x,O) where Ia(x) is the indicator function of A. Using a-e. continuity of f (z; O) we have f (x,9) + f (z,&) a-e-. On the other hand /(x, O) is a density function so Ir f (z,9) -+ f (z,00). Using the G.D.C.T. (Theorem 2) implies that Pe(A) + Peo(A) which completes the proof. Ci

LEMMA2. Suppose f (z,-) is continuous on 8 a.r.[v]and Assumptions 1-23 are fuljilled. Then there exists a finite meusure p which dominates the family P.

PROOF.Define

where rn is the on P.It is clear that p is a finite measure. Now suppose that p(A) = O for some A E 2i. Then continuity of P.(A) implies that PB(A)=O for alle € 8-

It is, indeed, easy to see that p is equivalent to P in the sense that if Pe(A)= O W E 0 for some A E SL, then p(A) = 0.

Assumption 4: The function #(z,9) = log f (x,6) is of class CLon 9 a.e. [pl

Assumption 5: There exists a function k(z) such that

(ii) Lk(z)~e(dz)

Assumption 6: The following equations hold 106 N. ASYMPTOTIC NORMALITY

Assurnption 7: The map 1 : 0 + w,, defined by

where

and *%,, is the linear space of p x p real matrices, is a continuous map.

Assumptions 1-4 are standard. Assumption 5 is a weaker version of Assumption 3 of LeCam(1953, page 307) stated as follows,

(ii) &)fi(dz)

LYe discuss Assumptions 6 and 7 later and show how they are related to the usual assumptions for asymptotic normaiity.

It is easy to see that Assumptions 1-4 and 5(ii) imply

The following lemma shows how Assumptions 1-5 imply Frechet differentiability. 3. POSITIVEDEFINITNESS OF THE INFOR!fATION MATU 107

LEMMA3. Suppose Assumptions 1-5 are fulfilled. Then (9 r : 0 + Li(p) defined by n(O) = O(z,0) is of class Cl and Dn(0) = v#b0)- (ii) If, Mi addition, d(z. Bo) E L1(p)for sorne O. E 8,then 4(z, 8) E L1(p),VO E e.

PROOF.TO prove (i) we first notice that

On the other hmd, the Mean Value Theorem implies that

Using the D.C.T.we obtain

Now using Theorem 3

Therefore, the Gateaux derivative of q5 exists and it is equal to ~9(2,O) To show that v$(x, 8) is C1 continuous we notice that vqb(z, 6) is continuous on 8, a.e.[p]

. If 0 -t Bo, then pointwise continuity of v~(z,8) implies that

On the other hand

so that 108 IV. ASYMPTOTIC NORMALITY

Xow Theorem 3 implies that

Therefore q5 is CL-Gateauxand Theorem 1 implies that 4 is Cl.

To prove (ii) we apply Taylor's expansion around Ba:

By the boundedness of 8, there exists C such that

This irnplies that 4(z,8) E C1(p),~OE 8 . 0

Lemma 3 plays a key role in the proof of the main result stated in the following theorem, as it allows us to obtain rank(D?r) using the rank of the information matrix. The theorem is proved in two steps. The proof is based on a simple observation. Loosely speaking if rank(Da) = k < p on an open set then p - k variables are redundant. But this violates the identifiability of P,the farnily of the probability measures on (Z,a).

THEOREM5. Under A8sumptiow 1-7, A, fhe aet of rems of det(I(O)), is a nowhere dense (mm) aet.

PROOF.Step 1. Assume A is an open set.

If rank(I(8))= O on A, then identifiability is violated, for al1 the partial deriva- tives must be zero on the open set A. Therefore, suppose there exists E A such that O < ~ank(I(8~))< p. Let

k = max{l : rank(l(0))= 1, fur some 9 E A) . 3- POSITIVE-DEFIXTNESS OF THE INFOR\ll,4TI0N MATRIX 109

Suppose rank(l(0')) = k. Using the fact that I(0) is a variance-covariance matrix of partial derivatives, t here exist k iinearly independent partial derivatives, at 8'. Let f(9') be the \ariame-covariance matrix of these k linearly independent partial derivatives, and &(O8) for i = 1,2, ..., , k be the eigenvalues of f(0'). Let D c BPP be an open set containing 4(09)for i = 1,2,..., k. Then using the upper semi-continuity of the spectrum(Theorem 4), there exists E > O such that

I/ Ï(0) - Ï(9') II< E implies that the eignvalues of f(0) belong to D. Using the continuity of Ï there exists an open set 0, C A such that II Ï(0) - Ï(eB)II< E. Thus rank(Ï(0))= k for al1 0 E O,. On the other hand k is the maximum rank of I. Hence rank(I(8))= k, Vi9 E O,. Using the fact that there are only k linearly independent partial derivatives on O, we have rank(Tbn) = k,VO E O,. Now let no = log f (x;8) for some Bo E O,. By the "Subimmersion Theorem" (no) = ker(Te,n). But identifiability implies that x is injective and so Te,?r-L(no)is of O-dimension. On the other hand dim(ker(Te,?r))= p - k and k < p. This is a contradiction. Hence det(I(6)) cannot be zero on an open subset of P.

Step 2. -4.ssume that -1 is a general set

This step is almost straightforward. In fact, it follows from the continuity of I(8) and the fact that "det" is a smooth map. In other words det(I(8,))= 0,Vn E W and On -t 9 imply that det(I(B)) = O. Therefore det(I(0)) = O, on A implies that det(I(0))= O, on K. Using Step 1, int(x)= 0 which rneans A is a nowhere dense set. This completes the proof.

Assumption 6 is easy to venh for mixture distributions, as the following argu- ment shows that it is satisfied for mixtures whose mixiing components fulfill the assumption and whose mwng distribution is differentiable. 110 W.ASYMPTOTIC NORMALITY

Suppose

Integrating thoroughout, we obtain

But Er!'=,ak (O) = 1 and the differentiability assumption of the mixing distribution therefore imply that Er!'=,=ak(8)a = O -

.Assumption 7 may seem hard to check. But Ibragimov and Hasminskii(l981, Lemma 7.1, page 65) have shown that under mild conditions such as a.e. differen- tiability of the density function and existence of the information matrix, -4ssump tion 7 is fulfilled. For a thorough dixussion see Ibragimov and Hasminskii(l981, chapter 1, 57). See also the result and discussion given at the end of this section

Constancy of the ''rank" is key in our proof. There are several equivalent def- initions for "rank". One definition is that the nznk of an m x n matriz I is the maximum order of any nonuanishing rninor determinant. Using this defintion it is easy to prove constancy of the rank. The following lemma shows this fact.

LEMMA4. Suppose Assumption 7 is fulfilled and k is the maximum mnk of I on A. Then there ezists an open subset O C A such that rank(I(6))= k, t(6 E O. 3. POSITIVEDEFINITNESS OF THE INFORMATION MATRIX 111

PROOF.Suppose that r~nk(l(8~))= k for sorne E A. Using Assumption 7 and the continuity of "det", any nonvanishing minor determinant will be positive on an open neighbourhood Os,around Bo. On the other band, since k is the

maximum order, tank(I(8))= k on 00,. O

In the following we discuss the geometry of P and show why we used infinite- dimensional geometry rather than finite-dimensional geornetry.

Although P is not a subset of a finite-dimensional space, using Brouwer's The- orem of Invariance of Domain it is easy to show that under Assurnptions 1-4 the family P can be described by a p - dimensional manifold. First we need a definition

DEFINITION 28. Suppose X and Y are two rnetric spaces. A continuow map f : X + Y Lr called proper if for any sequence {x,} such that f (x,) + y there ezists a subsequence {x,,} such that z,, + z where f (z)= y.

First we need the following lemmas

LEMMA5. Suppose the density function f (x,6) is contznuous a.e.[v] and As- surnptions 1-3 are fulfilled. Then the map

is a proper map.

PROOF.Suppose (8,) C 8 such that %(en) Wv)+ c. Using the compactness of - 8 there exists a subsequence {O,,,} which converges in 8 to a point, Say q. Now by continuity of ii we have ~(6,~)-P ii(7). But the limit of any subsequence must be equal to <. This implies that if(r)) = <. To complete the proof we should show that q E 8. To do this we notice that f E ii(8).Thus there exists Bo E 8 such 112 TV. -4SYMPTOTIC NORMALITY that < = 77(00). Therefore 5(Bo)= *(v) which implies f (z,Bo) = f (x, r)), a-e.. Kow the result follows from identifiability (Assumption 2). O

We need the following definition for the next lemma

DEFINITION29. A rnap f : X -t Y is called closed (rreapectiuely open) if for any closed (respectively open) subset C (respedively O)C X, f (C)(respective( y f (O)) is closed (respectively open) in Y

LEMMA6. Under the conditions of Lemma 5, if is a closed map.

PROOF.Suppose that C is a closed subset of 8 and 8, E C,such that +(O,) = 4 converges to 5 E if(€)). To prove the assertion we must show that F E n(C). As if is a proper map, there exists a subsequence {Oni} of {O,) such that Bni -t 0 E 8. Closedness of C implies that 8 E C. Now using the continuity of 5 we have < = ?(O) E ?(C)which implies closedness of n(C).

MTealso need to recall the folIowing theorem.

THEOREM6. Let f : X + Y be bijective. The following properties of f are equivalent:

(2) f Fs continuous and open. (3) j is continuous and closed. (4) f (A)= fofor each A c X.

PROOF.See Dugundji(l966), Theorem 12.2, page 89. O

PROPOSITION4. Under Assumptions 1-4, the family P can be descnbed by a p - dimensianal manifold. 3. POSITTVE-DEFDi'ITNESSOF THE nTORMAT1ON MATRIX 113

PROOF.It follows from Lemma 5 and 6 and Theorem 6 that 5 is a homeomor- phism. Thus Brouwer's Theorem of Invariance of Domain implies that n(8) is an p - dimensimal manifold. O

A question which arises here is "why do we me infinite dimensional geometry rather than finite dimensional geometry"? Indeed there is a vast literature on the geometry of statistical inference (see for esarnple Kass(l989))which is mainly con- centrated on finite dimensional geometry. In fact ahen we are dealing with para- metric models finite dimensional Riemannian geometry is the appropriate model. But the key assumption in the theory expanded so far is positive-definiteness of the information matrix. This is so, because the information matrix is the Rieman- nian metric and t herefore must be posit ive-definite. Therefore it seerns impossible to use the available theory to prove the above Theorem 5 as is shown below.

To equip n with more structure, the positivedefiniteness of the information matrix is needed. Indeed if ir is required to be a C1 manifold, then if needs to be a diffeomorphism. Thus ii and ii" are both of class CLand hence 5 O if-' = L, where L is the identity map of 8 to 8. Therefore. by the chain rule

but DL(@)is just the identity matrix for every B E 8. Hence DiT is nonsingular at each 9 E 8. On the other hand, using Lemma 3, Dif(6) = vq5(x,B). This implies that {&4(z, 8), for i = 1,2, ...,p} is a set of linearly independent vectors. Therefore the information matrix must positivedefinite. # As we saw in the proof of Theorem 5 there are two main steps. The first step was to show constancy of the rank on a neighborhood and the second step was to use the "Subimmersion Theorem". -4s the "Subimrnersion Theorem" is a corollary of the "Rank Theorem" (see Abraham, Marsden and Ratiu(1988), page 114 IV. ASYMPTOTIC NORMALITY

127) one may use the "Rank Theorem" instead of the "Subimmersion Theorem" after establishing the first step. However, one would need to incorporate the following fact. -4ny finite-dimensional subspace, F, of a normed vector space E is split (see Folland(1984), page 152). Assumption 7 is one of the assurnptions which may be hard to check. We already referred to Ibragimov and Hasminskii(l981, Lemma 7.1, page 65). Here we show that under easily checkable conditions Assumption 7 is fulfilled. Since we are mainly interested in mixtures of members of an exponential family, we confine Our attention to these types of mixtures. First we need to impose a stronger version of Assumption 5 (ii) .

Assumption 5(ii)': Supp~sef (z,8) = C;"', ak(B)gk(z,O) where gk is a rnember of an exponential family for k = 1,2, ..., m,and the following conditions hold

(3.1) / [k(z)1~~k(z,8)v(dz)5 M < oo VB E 8 and k = 1,2, ..., rn -- 1 -4 stronger form of (3.1) has been already considered by Wald (see LeCam(1953), page 309, first paragraph, "In the papers ...., for 19 E c).Indeed Wald has imposed the same condition on the second derivative.

THEOREM7. If Assumptions 1-4, 5(2), 5(ii)' and 6 ore fu@lled, then Assump- tion 7 is satisfied.

P ROOF. Using the fact that the information matrix I(0) is a p x p matrix and p is finite, continuity of I(0) reduces to the continuity of its elements. Suppose

O,, + 0 as n -t oc-.Then it follows from Assumption 4 that B,(z, 8,) f (2,O,) + Bij(x,9) f (x, 0) a.e.[~].On the other hand 3. POSITIVE-DEFINITNESS OF THE lNFORM.4TION MATFUX

Next, using Theorem 9(i) of Lehmann(l986, page 59) we have

Xow the G.D.C.T. implies that

which completes the proof. [71

Among the conditions of Tbeorem 7, there is only one condition, namely 5, which needs to be discussed for mixture distributions. This will be discussed later when we consider smoothness and boundedness conditions for asymptotic normality. It will be shown that using Lemma 1 of Chapter 2, the boundedness and smoothness conditions for mixture distributions can be reduced to the same conditions for mixing distributions and their mixing components.

We now extend Theorem 5 for independent but non identically distributed

(IXID) random variables. In our case the heterogeneity is imposed by the CO- variates. Suppose that the covarïate Z is a random vector distributed according to fi (z). In the following f&; 9) is written in full as f (z;0 1 Z = z) to emphasize that Z is a random vector. We also wnte fx,Z(x7Z; 8) for the joint distribution of (X, 2).The set of possible values of Z is indicated by 2

In order to extend Theorem 5 we need the following modifications of Assump tions 2-7.

Assumption 2': The farnily P = {PI : 9 E 8)of probability measures on (X x 2,Il @ 3) is dominated by a a-finite rneasure v on (X x 2,I( 3) and f (z, z; O) is the Radon-Nykodym derivative of Po with respect to v.

Assumption 3': The collection of conditional probabilities 9' = {{pé

The following lemma shows that Assumption 3' implies identifiability of P.

LEMMA7. Suppose that f&) does not depend on O and Assumption 3' holds. Then the farniiy P is identifiable with respect to O.

PROOF.Suppose that fxVz(x;z; 8) = fx,z(x, z; 8') for al1 (X, 2)E (2,X). Since fi(=) does not depend on 6 we obtain j(z;6 1 2 = z) = f (x;6' 1 2 = z) which, using quasi-identifiability implies 8 = 0'.

and fi(z) does not depend on 8, it is clear that

In view of the above equation we can confine our attention to the function +,(z, O) = log f (z; O 1 Z = 2) .

Assumption 1' : The function 4, (x,9) = log f (x;0 1 Z = z), Vz E 2 is of class Cl on 8 a.e. [pl for al1 z E 2 .

Assumption 5': For any z E 2 there exists a function k,(z) such that

(4 (ii)

Assumption 6': The following equations hold 3. POSITIVE-DEFMITNESS OF THE INFORii'IATION MATRIX

Assumption 7': The map 1 : 9 -t w,, defined by

where

and q,,is the linear space of p x p real matrices, is a continuous map.

THEOREM8. Suppae that /&) does not depend on 19. Then under Assumptions 1 and 2' - 7', A , the aet of zeros of det(I(8)) where I(0) is defined as in Assumption ?', M a nowhere dense (mm) set.

PROOF.Using Lemma 7 it is evident that under Assumptions 1 and 2' - f, fx,~(x, z; 8) fulfills Assumptions 1-7. The result then follows from Theorem 5.

When 2 is finite, Assumptions 2' - 7' are essentiaily 2 - 7. In this case one can give a direct proof by easily reworking the proof of Theorem 5. The key difference is in the definition of the map n. Suppose that Z can take on only finitely- many- values 21,22, ..., z, with respective probabilities pi,n,..., pq. Let é = (@,O,, .. ., e',) where B = (A,a, ... ,O,) is the vector of regression parameters -. common to al1 the distributions f,, (x; O) for 1 = 1,Z, ..-Yq and Of = (Oi1,..., O,,) is the unknown vector of parameters which only appear in- f,,(x; - cl). Let us also denote ail the unknown parameters of f, (z;Cf) by 6 = (8,Of) for 1 = 1,2,..., q. Then .rr : 8 + C1 x ICL - x LI is defined as follows:

Quasi-identifiability implies that n is injective. The rest of the proof is the same as that of Theorern 5.

It should be noted that although det(li(f))cm be equal to zero on 8 for al1 1 = 1,2, ..., g, this does oot imply that det(l(8))= O as Minkowski's inequality for 118 IV. ASYMPTOTIC NOFtMALITY determinants (Marcus and Minc(1964), page Il?, 4.4.1) implies

Clearly I(#) = CL, pJl(() where 11(f) is the information rnatriv of the Z - th distribution, as the matrix form of I(Ë) is

and

4. On The Measure Of A and The Exponential Family

The next step towards the investigation of the properties of A is the mea- sure of the set. Our main concern in this section is to show how one can find a reparametrization such that the set of zero's of the determinant of the informa- tion matrix is nowhere dense and of measure zero. We show that one can find such 4. MEASURE OF A 119 a transformation when an orthogonal reparametrization exists. However, for mix- ture distributions this approach does not seem fruitful. Thus the measure of the set of zero's of the determinant of the information matrix for mixture distributions remains an open question.

As is well-known there are nowhere dense sets of positive measure (see for exarn- ple Gelbaum and Olrnsted(1964), page 88). Nevzrtheless, Oxtoby and Ulam(1938) imply that any set of the first category (rneager) (that is, a countable union of nowhere dense sets) in WP is equivalent to a set of p - dimensional measure zero under an automorphism (that is, a homeomorphism of the space onto itself) of RP (Theorem 2 of Oxtoby and Ulam(1938)). However, these automorphisms are not easy to find for practical purposes.

Under the assumption of compactness, the set of automorphisms which carry a set, A, of the first category, into a set of p - dimensional measure zero form a residual set (that is, not of the first category) in [Hl, where [HI denotes the space of al1 automorphism of the unite cube Y' in p - dimensimal Euclidean space, with uniform norm. This, roughly speaking, rneans that those automorphisms which do not carry -4 into a set of measure zero are negligible.

With the above mentioned result of Oxtoby and Ulam one hopes to find a reparametrization such that the set of zero's of the determinant of the information matrix for this new reparametrization is a nowhere dense set and of measure zero. The transformation should be differentiable. We may then go further and ask for a transformation (in Our terminology "reparamet7ization") such that we obtain a nowhere dense set of measure zero as the set of zeros of the determinant of the information matrix. Now for nowhere dense sets in IR it is possible to prove the en'st ence of such a transformation under certain circumstances which are discussed below. ,4s will be seen, however, in W the existence of such a transformation 120 W. ASYMPTOTIC NORMALITY is tied to the existence of an orthogonal reparametrization. Establishing such an orthogonal reparametrization for mixture problems, and hence for Our change point problem, is no doubt very difficult. Weare able, nevertheless, to show that for members of an exponential family the set of zeros of det(l(0))forms an isolateci set which is a stronger property. We start Our discussion with the subsets of R

LEMMA8. Suppose 8 C R and Assumptions 1-7 are fulfilled. Then the= ezists a reparumetriration r) = q(0)such that the set of zero's ofdet(I(r))),is a nowhere dense set of measure zero.

PROOF.Since we arc mostly concerned with compact parameter spaces, without loss of generality, we assume that 8 = 3 = [O, 11. Therefore A is a closed, nowhere dense subset of 8.Let R = 8 - A, so +(x) = -, men, where rn is the Lebesgue measure, is a strictly increasing automorphism map on 3, and $(A) is a set of Lebesgue measure zero (see Oxtoby(l971, page 49)). Since $J is increasing, it is differentiable a.e. (see Folland(1984), page 93, Theorem (3.2.3)). Indeed

and so,

On the other hand, it is easy to see

1, for alrnost al1 x E A lim At-+O 0, for almost al1 s # A which implies that

Now consider the following reparametrization q = @(x). Since $J is strictly increasing, so is the inverse map Il>-'(q) = 8. Hence $J-' is a.e. differentiable and 4. MEASURE OF A 12 1 using the chain rule (21-')' = m(R) for a-e. q E +(O). But +(A) is a set of measure zero, so ($-')' = m(R) for almost al1 r) E [O, 11. Therefore I(r)) [m(R)]21(8). Since [m(R)I2 > O (A is nowhere dense and R = 3 - A), we obtain I(7) = O if and only if I(0) = O except, perhaps, on a set of measure zero. On the other hand I(0) = O on A and @(O) is a set of measure zero, and hence I(r)) > O for almost al1 7) E [O, 11. It is clear that identifiability of the mode1 is preserved under the transformationo since yt is an automorphism. Consequently the set of zeros of the information map is sti11 a nowhere dense set. [nl

The above argument can be extended to a multi-dimensional parameter space if an orthogonal reparametrization is possible. We give some remarks concerning orthogonal parametrization after the foliowing theorem.

DEFINITION30. We say an orthogonal reparametrization is possible if there a- ists an autornorphisrn Q on 8 such that I(O) = (D@-')~I(B)DO-' is a diagonal matriz.

T HEOREM 9. Suppose an orthogonal repammetrizution is possible and assump- tions 1-7 are fulfilled. Then there ezists an autornorphisrn rnap iI on 8 c WP such that for the repurametriration, q = ik O $, the set of zero's of dei(l(r])),is a nowhere dense set of measure zero.

PROOF.As is an automorphism it is clear that identifiability under the or- thogonal reparametrization is preserved, and also det(l(O))= O on A if and only if t here is at lest one zero diagonal entry. Hence E[(& log f (2, &))?] = O for some i E {1,2, ..., p} which means & log f (z,4) = O a-e. for some i E {1,2, ..., p) . If A is an open subset of WP, then by using continuity of "det" there exists an open subset O of A such that & log f (z,4) = O a.e. for some i E (1.2, ..., p) and V+ E O. This is a contradiction, because identifiability is violatecl. 122 K. ASYMPTOTIC NORMAUTY

Sm-consider the projection map n, : R? + R nl(xi. q, .... zp)= x, A similar iirgiiiiirnt shows tliar 7iJ.l).for j = 1: 2. ...,p cannot be open in R either. Finall. iising the ïotitinuity of the determinant we conclude that 7r3(A) is a nowhere dense set for j = 1.2 ..... p

&sr. ive cari defirie @ = (u.q~: .... w) as the desired autornorphism. where LW is ciefined as in Lemma 8. Shen the set of zeros of the determinant of I(7) = in:=, rrz(R,)~'l(o).where q = (vit--..qP), qj = ~(4~)~P = (dl.---.o~): R, = 3-:Il. arid -1,= T;) ( .\ ) for j = 1.2. .. . , p is f nowhere dense set and of measure zero. O

As merit ioned by Huzurbazar(l950), finding ort hogonai parameters is equivaient to tinciing p hinctions satis-ing fp(p- 1) differential equations which rnay not be possible. in gerieral. for p > 3. For p = 2 (as espected) the problem is usuaily solvable (Huzurbazar(1950)). However for p > 3. a condition which is expected to be in\-011-ed is the existence of joint sufficient statisitcs of dimension p .

Dist ributions which admit such sufficient statistic. are basically rnembers of an esponeritial fàrnily ( Koopman(l936)). For an exponential family, however. the information matriv has already been shown to be positive-definite under certain regiilarity conditions (see Barndorff-Nielsen(1978)). For a thorough discussion on orthogonal parametrizaition the reader can consult Cox and Reid(1987). Leaving the measure of ,\ for mixture problems as an open question. we conclude this section with a result for members of an exponential famil? which suggest that for niistiire problems. .\ might be an isolated set. This property. which is stronger thari being nowhere dense, would certainly be very convincing evidence of the riirity of the zeros of det(I(O)).

1j-e sxy -\ is an isolated subset of the parameter space if for any 8 E A, there exists a neighborhood O@C 8 such a n A = 0. 4. MEASURE OF A

THEOREM10. Suppose f(Z,$)= ~(@exp{& &Z(q),

(4-1) 11 q2(5)exp{x ~&(z)}Y(~x)< oc for f = 2,. k . 1=1 Let the natural parameter space 8 be open. Then det(I(0))> O on O, ezcept for a countable svbset A C 8.

PROOF.For brevity we htex and 8 instead of 1 and 8. First we notice that a a W-ae, logf (XIQ))(~ log f(x,O))]

Now using Lehmann(l986, Theorem 9(i), page 59) C(8)is an analytic function.

On the other hand using the Cauchy-Schwarz inequality and (4.1) we can apply Lehmann(l986, Theorem 9(i)) again which implies that al1 the integral terms in the above equation are analytic functions of 8.

Since the set of anaiytic functions on a specific domain form an algebra, we conclude that ,Te[(& log f (X, O))(& log f (X, O))] is an analytic function on 8. Since "det" is an analytic function and composition of two analytic functions is an anaiytic function it follows that I(8) is an analytic map and therefore so is det(I(8)). 124 IV. ASYMPTOTIC NORMALITY

Applying Lemma 7 of Lehmann(l986, page 57) we have convexity of 8. On the other hand any convex set is path connected and therefore using Theorem 5-3 of Dugundji(l966, page 115) connected. It then follows from the "princzple of analytic continuationn(Cartan(1995),page 10-41 for one complex variable and page 122 for several complex variables. For the same result on manifolds see page 190-191. See also page 203.) that, A, the set of zeros of det(I(8))is a discrete set (Le. the zeros are isolated). Using o - mpactness of IR?' implies that il is countable. [7

It is worth noting that if 8 c IRP is a bounded open set, then A is a finite set, for ë is compact when 8 is bounded.

5. On The Smoothness And Boundedness Conditions For Asymptotic Normality

Our main concern in this section is to show how checking the smoothness and boundedness conditions for asymptotic normality for a mixture distribution can be reduced to checking the same conditions for the mixing distributions and the mixing components. As we are interested in non-identicaily distributed random variables, we start wit h the conditions for asymptotic normality for non-identically distributed random variables. Our main reference is Hoadley(l971) in the following discussion. After introducing some notation we list the coditions given by him and show how they can be checked for mixture distributions.

Suppose that 2 is a finite set. Then we have finitely many different distributions. The case of infinite 2 will be discussed later. In what follows ë= (6,r), oi(z; e), ;Pi(z;9) are, respectively, a p x 1 and a p x p rnatrix whose components &,(z; 8) 5. SMOOTHNESS AND BOUNDEDNESS CONDITIONS

and &&; 8) are defined as follows:

We can now list Hoadley's conditions (Hoadley(l971, page 1982)).

N.1: 8 is an open subset of Rp This condition is automatically fulfilled as we assume 8 is a bounded open subset of R*.

N.2: &, + O*. P In Chapter 2 we established a stronger version of this condition. Indeed, we showed that the maximum likelihood estirnators are strongly consistent.

N.3: di,s(~;O) und mi,,(z; O) ezist, a.s. Suppose that aiVk(8)and giL(x;O) are of of ciass C2.Then it is clear that N3 is fulfilled.

N.4: &,,(x; 9) is a continuous function of 8, unifomly in i, as. and is a measurable function of xi.

If oi,k(B) and giTt(x;O) are of of class C2, then NA is valid as ive have only finitely many clifferent distributions.

N.5: E&~,,(X,,~)]= O for a11 s = 1,2, ..., p and i = 1,2,3,... We already showed that if the mwng distribution, crilk(8) is differentiable and the mixing components, g&; O), fulfill this condition, then the condition is sat- isfied for O). It has been mentioned by Hoadley(l971) that N.5 is fulfilled if N-5' given below holds. 126 W.ASYMPTOTIC NORMALITY

This is indeed the core of ~.5'and shows why it is not a troublesome condition for mixt.ure distributions.

N.6: ri(@)= E~[&*(X'; 8)ai- (X; a)] = -E~[&(x; O)] for al1 i = 1; 2, ... .\gain, as pointed out by Hoadley (1971), condition N.6 is satisfied if ~.6'given below holds.

Suppose that û;-,,(8) and git (x;9) are of clas C2 and gi,*(2;7) for al1 k = l32, ..., m fulfill ~.6',then

N.7: x:=l ri(@)+ r(13), and T(8) is positive definite. This condition has been extensively treated in the 1st sections. Theorem 8 from § 3 addresses this condition.

As the last two conditions are both boundedness conditions we treat them to- 5. SMOOTHNESS AND BOUNDEDMESS CONDITIONS gether.

N.8: For some 6 > O, nc2:d,,2 x:=i E~~[X~Q>,(X~;eo)j2+* + O for al1 A E RP.

N.9: There ezist E > O and random variables Bi,,,(Xi) such that

-4s pointed out by Hoadley(l971), a stronger but easier to check replacement for N.8 is Eh 1 &i,s(~ï; BO) 135 C where C is a constant.

The key is to show how checking these two conditions can be reduced to checking the same conditions for the mixing distribution, air, and the mixing components g,,kby applying Lemma 1 of Chapter 2. Taking the derivative of log fi(x;8) we obtain

Csing Lemma 1 of Chapter 2 we have 128 IV. ASYMPTOTIC NOKVALITY

For the second derivative we obtain

Again' using Lemma 1 of Chapter 2 one can find a bound for & log fi(z;O), as the bound consists of a summation and product of

a il(0) ~gi.~(XI 0) min and max { i~ksrn 1Iklm ai,k (O) gi1k(2; 8) II

a2 a2 -gir (z;0) =ai.r(e) min and max { min and max { isklm l

REMARK1. ils seen above we only used the finiteness of 2 for condation N.4. In the case of our interest where couariates induce heterogeneity of distributions, is not hard to check this condition. Indeed it sufices that fxVz(x,z; O) be jointly continuous and 2 be compact. NOTE TO USERS

Page(s) not included in the original manuscript are unavailable from the author or university. The manuscript was microfilmed as received.

This reproduction is the best copy available.

UMI CHAPTER V NUMERICAL ASPECTS AND SIMULATION

1. Introduction

In Chapter 2, although we showed consistency of the b1LE7s, the rate of con- vergence was not considered. We also discussed asymptotic normality in Chapter 3. We proved that the set of zero's of the determinant of the information matriu is a nowhere dense set. Nevertheles, the information matrix may not be positive definite at some points and hence the main condition for asymptotic normality be violated. In this chapter we carry out limited simulations in order to esamine the finite sample properties of the estimators, and to show how the maximization of the likelihood may actually be carried out.

The likelihood surface for mixture problems can be very burnpy with many local optima (see for example, Brooks and Morgan(1995)) so that finding the MLE's is not straightforward. Consequently, in order to increase our chances of finding the global maximum, we used Simulated Annealing(SA). In fact aTeused a variant of Simulated Annealing, so called Simulated Quenching, to obtain the MLEs for the unknown parameters of our model. We introduce a slightly different algorithm which is faster than the usual one. The program for the optimization has been

131 132 V. NUMERICAL ASPECTS ..WD SIMULATION writtrn in C++ (the C version is also available) and is flexible enough to handle ot ticr opri~nizationproblems as well.

2. SimuJated Annealing Algorit hm

~-Iiiiit~;ili:~gmeans melting a physical substance and then cooling it slowly to rwt-ti ;L state of minimum energ-. which corresponds to the most stable state of the siii)stancr- Sirnulated annealing. as the name impiies. is simulation of the ariri~alirigprocess in order to rninimize an objective function.

Sinlulated aimealing as an optimization technique was introduced by Kirk- patrick. Gelatt and Vecchi(1983). The method depends heavily on a Markov Chain XIunte C'do rnethod whose origins date back to the seminal paper by Metropo- lis et itI(19.5:3)- The paper by Kirkpatrick et ai was followed by the important wurk of Geman and Gernan(l98-l) who introduced the Gibbs sarnpler algorithm. Siniulated annealing has become a widely used tool for optimization since 1983 with a vast literature. For a thorough account see Aans(1988) and for more re- cent reviews see Gidas(l995) and Bersimas and Tsitsiklis(l993). Slarkov Chain .\[onte Car10 rnethods have developed both in conjunction with and separate from sirriulatecl arlnealing playing a central role in modern Bayesian inference (see. for irisr aric-P.Sriiitli and Roberts( 1993)).

111 thfollon-ing we follow Brooks and Morgan(1995). The algorithm given below is frorri Brooks and hlorgan(1995. 5 3.1). Algorithm Step 1: Beginning at an initial temperature To,we pick an intiai set of param- eter values with objective function value 8. Step 2: Riindomly select another point in the parameter space, within a neigh- 2. SBZULATED ANNEALING ALGORITHM 133 bourhood of the original, and calculate the corresponding objective function value. Step 3: Compare the tn-O points in terms of their function value, using the Metropoiis criterion as foilows. Let 4 = En,, - &&, and move the system to the new point if and only if a random variable U,distributed uniformly over (0, l), satisfies

where T is the current temperature, or equivalently if and only if

Note that we always move to the new point if its corresponding function value is lomer than that of the old point, and that at any temperature there is a chance for the system to move "upwards". Accepting a point, we equate with success. Step 4: Whether the system has moved or not, repeat steps 2-3. At each stage compare the function value of new points with the function value of the present point until the sequence of accepted points is judged, by some criterion, to have reached a state of equilibrium. Step 5: Once an equilibrium state has been achieved for a given temperature, the temperature is lowered to a new temperature as defined by the annnealing schedule. The process then begins again from step 2, taking as initial state the point following the last iteration of the algorithm, until some stopping criterion is met, and the system is considered to have fiozen.

There are two differences between the algorithm we used for our simulation and the one given above. The first difference concerns the cooling schadule. Indeed, simulated annealing can be painfully slow, for simulated annealing is a conserva- tive approach. In other words, it guards against the worst possible case. Having a priori information about the objective function helps one to choose a faster cool- 134 V. NUMERICAL ASPECTS .4ND SIMULATION ing schedule. \Ve have chosen an exponentially fast cooling schedule. Such an algori thm is usually called Szmulated Quenching. Quenching is the converse of annealing as in the quenching process the temperature of the heat bath is instan- taneously lowered. Despite successful applications of simulated quenching in many cases (Ingber(1993)), there is no concrete proof of the almost sure convergence of this algorithm to global minimum. We do not discuss this matter further but refer the reader to Ingber(1993).

Although simulated anneaiing usually considers two points, namely the new and the old, we use three points in our algorithm, namely the cunent, the new, and the old. Indeed, by retaining the point from which we have just departed called the old point ive retain for consideration three points at next step. At each step we compare the new with the old and the current. We accept the new if it is better than both the old and the cunent. The rest of Our algorithm is the sarne as the one given above.

3. Simulation Results

Recall that the mode1 we are concerned with, mode1 (5.3) of §5 of Chapter 1, is as follows:

We consider one covariate Z which can take on two values {O, 11, Say male and fernale. We also suppose that al1 the measurements before and after the change- point are independent. The observations are assumed to be normaily distributed with o: = 0; = 1 and pl = 0, 112 = 4 before and after the change respectively. The regression parameters @ = (Po,Pi) are set to be (0,l). We generated 100 changepoints for each value of the covariate. If the changepoint T = c, we gen- 3. SIMULATION RESULTS 535

erate I- ot~ser~ationsfrom -V(O. 1) and 20-c frorn N(4.1). With these simulated d;it ;i WP foiiiid the SILE's for 2 using sirnulated quenching. The initial values are riindonily (9iost~ibetween -10 and 10.

Thp initial temperature. the number of iterations per temperature and the num- ber of temperatures were chosen based on a prelirninary analusis. The behaviour of the objective function was examined in severid examples. using different corn- binations of algonthm parameters. men the objective function appeared to be essentiallv frozen at a particular value for different "temperatures". the algorithm wxs clwrned to have converged. The final choice was T', = 2.000.000.00 as the initial temperature. and T(t)= (+)'Tofor t = 1.2, .... 20 as the cooling schedule. \f-~chose the riearest neighbors as the neighbourhood system.

-4s the teniperature is lowered more iterations are required for the 1Iarkov chairi. at a part,icwlar temperature to reach equilibrium. Our algorithm adjusted for this pherioinenon b~+allowing for an increasing number of iterations with decreasing terripcariitiire. 111 fact the number of iterations at the t-th step was set to ,Yt = S(J - 10 t. hi orher words. we start by checking 50 points at the initial temperature and wd up with checking 250. In most of the cases at the lowest temperature there wpre only a téw acceptable moves among the new points.

Each iteration of our simulation took about 5 minutes and 40 seconds to simulate the data and find the MLEs. In 100 iterations of the algorithm there were 7 cases for which the result was outside of a circle with radius 4 whose centre is the true value. For a c-irde with radius 2 and the true value as the centre there were 10 cases iv1iic.h were outside of the circle.

REMARK1. Suppose that we have a &ed nunber of subjects. The closer the meon IIUIU~Suf the before and a.er the changepoint distributions are, the .more 136 V- NUMERICAL ASPECTS AND SIMULATION

observations needed on each subject to reach a specific level of accuracy for MLE's. It is also the case when the variance is larger. It seerns that when the van'ance before and after the change are equal, the number of observations needed on each subject for reaching a specific leuel of accuracy for MLE's is an increasing function of q defined as follows:

The parameter < is the difference between the coefficient of vanations for distebu- tions before and afier the changepoint.

The attached figures show the result of Our simulation. Figure 1 shows the joint distribution of the MLEs. Aithough the front tail looks heavier, the joint distribution of the MLE's look like a bivariate normal distribution, even for this moderate sample size. Figures 2 and 3 show the histograms for the marginals. Figures 4 and 5 show that the normal distribution is reasonable.

We aIso computed the likelihood surface of one iteration of Our simulation when the mean values before and after the change were respectively, pi = O and p2 = 1. As mentioned in Remark 1, the closer the values of pl and p2 are, the more ob- servations are needed. Nevertheless, the Iikelihood surface looks almost concave and hence shows that the quadratic approximation is appropnate. The last pic- ture which shows rescaled values of the likelihood indicates that the maximum likelihood is close to the true value.

The very limited simulation carried out in this chapter was just for illustrative purposes and to show how to implement the simulation and find the maximum likelihood estimators in Our setting. Of course, a thorough examination of the finite sample behaviour of the estimators is needed before any real conclusion can be reached. Joint Distribution of MLE's of Bo and i3,

Figure 1 Marginal Distribution of B,

Figure 2 Marginal Distribution of MLE of B,

3 4

Figure 3 Normal Probability Plot of 8,

Figure 4 Normal Probability Plot of B,

Figure 5

CHAPTER VI FUTURE DIRECTIONS

This tliesis has introduced a topic in changepoint problems. the modeling of co- \.ariates in the changepoint distribut.ion, never before addressed in the literature. \Y(. twlieve that there are. therefore. several directions that we may take starting from this initial point.

1) 011ritpproach has been entirely frequentist and it is natural to reconsider the prot~lerritrorri a Bayesian perspective. This would entai1 modeling the regression piuarrirters arid the before and after changepoint parameters. With Markov Chain llorite Carlo rriethods it should be possible to calculate the posterior distributions.

2) The tiazard approach to changepoint problems has been neglected which is siirprising. given their relationship with survival analysis. Frorn a Bayesian point of i-iew. orle might consider imposing a Dirichlet process prior on the hazard of c.Iiarige.

:3) The before and after changepoint distributions could be allowed to depend oii covariates and while this should be routine. serious applications of such models either doue or in combination with the mode1 of this thesis should be carried out. Tlit. xriedical field is a particular rich source of changepoint problems with

137 138 VI. FC'TCRE DIRECTIONS

4) Iri tliis thesis we have arjsumed that the hazard of change is constant. An obvioiis estension is to that of piecewise constant hazard and to more generai hazards.

> ) Esterisivcl simulations which are beyond the scope of this thesis should be car- riwi out to ~smiinethe preformance of our mode1 under t-ariety of circumstances. The foilowing questions should be addressed a) Hun (1~srhe discrepancy between the before and after changepoint distribu- r ion iiffk-t r lit* firiite sample inference? b) Huw does the number of covariates affect our inference? c-) \l'liat arc. the advantages of increasing sequence size instead of the number of sequences'.' ci) How efficient is simulated anneaiing when several covariates are considered?

6) The rnudel may be reformulated when more than one changepoint is possibie in each sequence. Bibliography

111 .ku-ts. E-i 1988). Sarnulated Anneding and Boltzmann Machines: A stochastic approach to cotnOrnatrvrzrrl optimization and neural computing. John Wiley and Sons, Yew York [2! .lt~rahani.R.. Marsden. J.E. and Ratiu, T.(1988). Manifolds. Tensors Anuiysis, und Ap- ~~lii-cttzorts.S jxinger-Verlag, New Yor:, f3i -4rnari. S.1. t 1982). Di fferential Cwmetry Of Curved Ezponentiui Famdies- Curvatures .4nd Irifor-rrintron Loss. The Annais of Statist. Vol- 10. 50.2. pp 357-385 j-l! B;iriidorff-Sielsen. Ole. E. (1978).Information and Ezponential Families In Statisticd The- on/. \-\'ilvy. SPWYork f.51 B;tron. II. and Rukhin. X.(1997). Asymptotic BehoMor of Confidence Regions in the Chrrye-pvzrrt Problem. J. of Stat. Plan. and Inference. Vol.58, pp 263-282 [G] Bw:k and Sr4ilgo1(1993). Thermodynamics of Chuotic Systems. Cambridge U. Press I<] Bvrtsirriu. D. and Tsitsikiis. J.(1993).Simulated Annealing. Statistical Science, 8, Yo. 1. pp 10-1s [SI Brooks. S. iind Morgan. B.(1995). Optimaration Using Simulated Anneding . The Statisti-

c-iitri. lb1.44.50.2. pp 241-257 [!II Brostruom. G.(lWï).A Martingale Approach to the Changepoint Problem. J-ASA. Vol. 92. So. 439. pp 1177-1183 [IO] CarIsein. E. ( 2988). Nonparametnc Change-Point Estimation. AM. Statist. Vol.16, pp 188- 19'7 El 11 C;trlstçtiri. E. H. G. SIuller and D. Siegmund (1994). Change Point Problems. LMS Lecture Suces-hloriographs Series, Vol. 23 [12] Cartan. H.11995).Elementary Theory Of Anafytic Functions Of One Or Several Variables. Dciver Piiblic:ations [13] Clia~..\l.T.( 1970). Strong Consistency Of Mazimum Lrkelihood Estimators When The 0hr:n~crtzons.Are Independent But Not Identicdly Distnbuted. Dr. Y. W. Chen's 60-year 'cleirioriai 1i)lurne. .kadernia Sinica Taipei.

1141 Clwri. .J .i 199.1)- Ceneralized Likelihood-Ratio Test of the Number of Components of Fanite .Clcttirrc .LfurieL~.Can. J. of Statistics. iol. 22. no-3.pp 384-399 f 1.51 C'lieri. J.( 1 99s). Optimal Rote of Convergence for Finite Mizture Modefs. ..Inn& of Statis- rii-s. i01. 23. no.1. pp 221-233 il61 C'lien. .J. .ilid Gupta. .4.(1997). Testing und Locating Variance Changepoints With Applica-

twli tu S'tork Pnces. .J.\SX. Vo1.92. No. 438. pp 739-747 IL:: IL:: Cheruoff. H. and Lander. E.(1995). -4symptotic Distribution of the Likelihood Ratio Test That rr .Iluttir~iof Two Binomial Is a Single Binomial. J. Stat. Plan. Inference, Vol. -13. Si).1-2. [~p19-40 [lS] CliocMingam.,4.. -4bbott.D.. Bass. .M.. Battista. R. et ai(1990). Recommendations of the Cirrrudsan consensus conference on non-phormawlogical approaches to the management of hylr bloori pressure. Canadian Medical .hciation Jomal 142: pp 1397-1409- jLS] Cos. D.R. iuid i-iinkley. D.V.(1974)- Theoretical Statistics . Chapman and Haii PU] C'CJX-D . iirid Reid. S.( 1987). Parumeter Orthogonality And Appmzimate Conditinal Infer- ence. .J.R.S.S.. \,-01.49. Xo. 1. pp 1-39 [2lj Cruwdc*r. hl..]. and Hand. D.J.(1990). Anolysis of Repeated Mwures. Chapman and Hall. [22] Co;iszar. 1.( 1971). Generalired Entropy -4nd Quant:zation Problems. Trans. of The Gth Pr;~giieConference On information Theory. Statistical Decision Functions and Randorn Processes. Proague 1971, pp 139-174. [23] Csiszar. 1. (1967). Information-Type Memures Of Difference Of Probability Distributions .-Inrl Ircdzr-ect 0h.servutioru. Studia Scientiarum Marhematicarurn Hungarica 2 (1967).pp 299-3 18. I2-11 Ehrerifest. Paul and Tatiana Ehrenfest(l990)). The Conceptual Fundations Of The Statis- tird -4ppronch rn Mechanics. Dover Publications. hc.

1-51 Folliind. C .( 1SS-l). Real Analysis . Wdey-hterscience. New York [26] Gel bau1n.B .R.and J .hi.H.OImsted(l964). Counterezamples In Analysis. Holden-Day, Inc. [27] Geman, S. and Geman, D.(1984). Stochmtic Relazation, Gibbs Distributions, and the Bayesian Restomtion of Images- IEEE Transactions on Pattern Analysis and Machine In- telligence, Vol. PA.341-6, ';o. 6, Nov. [28] Ghosh, J.K. and P.K. Sen(1985). On The Asymptotic Performance Of The Log Likelinood Ratio Statistics For The Mizture Model And Related Raults. In the Proceedang of the Berke- ley Conference in honor of J- Neymon and J- Kiejer Edited by Lucian M. Le Cam and Richard 1. Olshen, pp 789-806. 1291 Giraitis, L. et al(1996). The Change-point problem for dependent observations. J. of Stat. Plan. and inference, Vol. 53, pp 297-310 [30] Gida., B.(1995). Metropolis-Type Monte Carlo Simulated Algorithm and Simulated an- neaiin. In "Topics in Contemporary Probability and Its Applications" Edited by: J. L. Snell [31] Good, I.J.(1963). Marimum Entropy For Hypothesis Formulation, Especially For MuHidi- mensional Contingenq Tables. Ann. of Math. Statist. Vol. 33, pp 911-934. [32] Grouchow, H.W., Sobocinski, KA-,and Barboriak, J.I.(1985). Alcohol, nutrient intake, and hypertension in US adults. Journal of the -4merica.n Medical -4ssociation 253:1567-1570. [33] Harlan, W .R.,Hull,A.L.,Schmouder, R.L., Landis,J.R. et ai(l984). Blood pmsure and nu- trition in adults. Amencan Journal of Epidemiology 120: 17-28 [34] Hill, E. and Phiiiips, R.(1957). Functional Andysis And Semi-Croups. American Ma~he- maticai Society, Providence, Rhode Island [35] Hobson, A.(1971). Concepts in Statisticd Mechania. Gordon and Breach Science Publica- tions. [36] Hoadley, B. (1971). Asymptotic Pmperties Of Mazimum Likelihood Estimators For The Independent Not Identidly Distributed Cwe. The Annals of Mathematical Statistics. Vol 42, No. 6, pp 1977-1991 [37] Hu, 1 and Rukhin, A.(1995). A Lower Bound For Emr Pmhbility In Change-Point Esti- mation. Statistics Sinica, Vo1.5, pp 319-331 [38] Huzurbazar,V.(l950). Priobability Distributions And Orthogonal Parameters- Proc. of the Camb. Phil. Soc-, Vo1.46, pp 281-284 [39] Ibragimov, I.A. and Has'minskü, RZ(1981). Statisticul Estimation: Asymptotic Thcury. [-IO] 1l)r;iginiov. I..A. and Has'minskii, R.Z.(1975). Properties of Mazimum Likelihood And Bayes ' E.strtnrrtor.5 Fur .Von-ldenticolly Distnbuted Observations. Theory of Probability and Its -1pplications. Vol. -LX.Number 4, pp 689-694 (411 Iriglic~.L.( 1993). Szrnulated Anneuiang: Pmctice versus Theory. Mathl. Cornput. Model- Iliug. 1-01. 1s. No. 11. pp 2!3-07 j42) .Iavties. E. T.(1983). Papers On Probabiiity. Statistics and Statistical Physics. Edited by R. D. Roserikrantz. D. Reidel Publishing Company. [43] .Jolirisuri. R. LV. and J. E. Shore(1983). Comrnents On and Correction to "Aziomatic De~~ilutzorrOf The PrinciPle Of Mazimurn Entropy ArrJ The Principle Of Minimum Cross-

E~ttrupy*. 1EEE Transactions On information Theory. Vol. IT-29. Yo.6 Nov. 1983.

1441 .Jos~pti-L. ( 1959). The Mufti-Path Change-Point. Ph-D. Thesis. Department of Mathemat- i(.s ;uid Sr;rrisrics. McGill University, blontreal.

i45; .lowpt 1. L. .uid D. B. \.CCoifson(1992). Estimation In Multi-Path Change-Point Probiemri.

C'tmirn. Sr;lr.ist. Theory and Methods. Co1.21. pp 897-913 [AG! .Joseph. L. id D. B. Wolkon(1993). Mazimum Likelihood Estimation In The Multi-Path Cl,orrye-Por~dProblem. Ann. kt.Statist Math. Vol.45. No.3. pp 5'11-530 f-lÏ].Joseph. L. alid D. B. F\;olfson et ai(1996(a)). Change-Point Arrufysis of a Randomized Trid

orr tlw Effwts of Calcium Supplementation on Blood Pressure. in Ba~esianBiostatistics. Edited by D. A. Berry and D. K. Stangi [AS].Jo..cc-ph. L. imd D. B. Woifson(l996(b)). Estimation In The Muiti-Path Change-Point Pmb- le711fur Col~elcrteffData. The Canadian Journal of Statistics. Col. 24. 30.1. pp 37-53

1-19] .Jusepii. L. ;tnd D. B. Clolfson(1997)- Andysts of Panel Data With Change-Points . Statistica Siriica. \-d.7. 30.3, pp 687-703 !50] K.;;qmr..J.S. ;md H. K. Kesavan(l992). Entropy Optirnuution Principles with -4pplications. Xl.:tdetnic Press. [.51] K~LS R.( 19S9). The Geometry of Asymptotic Inference. Statistical Science, Vo1.4. X0.3. 188-234 [Z'Z! KIiirdiin. -1.1. (1957).Muthematical Fundations of Information Theory. Dover Publication.

1331 Ki~il.H .I 1'396 1. Change-point Detection for Cowelated Observutions- Statistica Sinica. Vol. G. pp 27.5-237 [54] Kirkpatridc, S. Gelatt, C. and Vecchi, M. (1983). Optimization by Simulated Annealing. Science, Vol. 220, No. 4598, pp 671-680 [55J Kobayashi,S. and Xomizu,K.(1963). Foundations Of Differential Geometry. Vol. 1: WiIey, New York. [56] Kolmogorov, A. N. (1968). Three Approcrch To The Definition Of The Concept Of The "Amount Of Information". Selected 'ïtanslation in Math. Stat. and Prob. Vol. 7, 1968. [57] Kolmogorov, A. and Fomin, S.(1970)- Intmductory RdAnalysis. Dover Publication. 1581 Kooprnan,B. (1936). On Distributions Admiting A SuBcient Stotistics. nan. AIMS, Vo1.39, pp 399-409. [59] Kraft,C. and L. LeCam(1956). A Remark On The Roots Of The Mazimum Likelihood Equa- tion. Ann. of Math. Stat. Vol. 27, pp 1174-1177 [60] Lai, T.(1995). Sequential Changepoint Detection in Quality Contd and Dynamid Sys- tems. JRSS, Vol. 57, NJ-4, pp 613-658 1611 Landsberg, P. T.(1990). Themodynamics and Statistical Mechanics. Dover Publication. 1621 LeCam, L.M.(1970). On The Assumptions Used To Proue Asymptotic Normality Of Mu- imum Likelihood Estimates- The Annais Of Mathematical Statistics, Vol. 41, Xo. 3, pp 802-828 [63] LeCam, L.~M.(1986).Asymptotic Methods In Statistical Decision Thwq. Springer-Verlag [64] LeCam, L.hL(1990a). Asymptotic In Statistics. Springer-Verlag [65] LeCam, L.M(l990b). Mozimum Likelihood: An Introduction. Inter. Statist. Review, Vol. 58, 2, pp153-171 [66] Lee, C.(1997). Estimating the Number of Change Points in Ezponential Families Distribu- tions. Vol. 24, pp 201-210 [671 Lehmann,E.(1986). Testing Statisticsl Hgpotheses. John Wiley and Sons. [68] Linfoot, E.H.(1957). An Informational Measure Of Correlation. information and Control. Vol.1, pp. 85-89. 1691 Luo, X., nirnbull, B. and Clark, L.(1997). Likelihd Ratio Tesb for a Changepoint With Sunrivd Data. Biometh, Vol. 84, No. 3, pp 555-565 [70] Marcus,M. and Minc,H.(1964). A Survey Of Matriz Theorey And Matriz Inequalities. Dover Publications jTli .\[art in. S .F.G. and .J .~t'.Engiand(l980).Muthematical Theoq of Entrapy. Encyclopedia of ~1;rtiieniaticsand Its Applications. [TZ] llassey. \V.(1991).A Basic Course In Algebmic Topology. Springer-Verlag i73] 1Ii-Carrori. D..l.. 31orris. C.D.. and Cole. C.(l984). Dietary calcium in human hypertension. Sc.icilcc! 2 17: 267-269. jf-l] lit-rrupolic;. S.. Rosenbluth. -4. and M., Telier. A. and E-(1953).Equation of State Calcu- firtzons by Fast Computing Machines. J. Chem. Phys. Col. 21. pp 1087-1091 !75] llihur. .1.( 1963). Morse Theory. Annals of Mathematics Studies. Number 31. Princeton Criivrrsitv Press. IX! lliciior. P.\V.(1980).Manifolds Of Differentiabie Mqpings. Shim Publishing Limited. [?Y] Mitrinoviv. D. S. (1964). Elementmy Inequaiities. P. Soordhoff Ltd - Groningen. [X] llitrinovic.. D. S. (1970).Artulytic Inequuiïties. Springer-Verlag. il9j 3;i;;irit.j. X. i~~idReddy. C.(1993). -4symptotic ~VullDistributions of Tests For Change in

Lcutd in Corviatecf Duta. Sankhya. Series A. Vol. 53. Pt. 1. pp 37-48 [SU] Sorden. R .H.( 1973). A Sumey of Mazimum Likelihood Estimation. Int. Stat. Rev.. Col. 41. 50. 1.1973. pp 39-58 [S 11 Surden. R .H.(1972). A Suruey o,f Mazimurn Likelhood Estimation. Int. Stat. Rev.. Vol. 40. Su. 3. 1912. pp 329-351 [82] Ostoby( lgil). Measure .4nd Category. Sp~ger-Verlag [83] Osroby iind L-lam(1938). On The Equivalence Of .4ny Set Of First Categom To il Set Of :Clc~rr.sirr-eZero. Fund. Math. Vo1.31, 201-206

[84] Page,E. ( 1954). Continuou inspection Schemes. Biometrika. Vol. 41. pp 100-1 14 18.51 Page. E.i1955(a)). Control Chatts Wàth Wornàng Lines. Biometrika, Vol. 42, pp 213-254 [8G] Page. E.11955(b)). A Test For A Change In A Parorneter Occumng At An Unknown Point- Bii~rnrtrikl.\.cd. -12. pp 523-527 [Sr! Page. E.(19.57). On Pmblems In Which A Change In A Parameter Occurs .4t =In Unknoum Point. Biornecrika. Vol. 44. pp 248-252 [SS! Pfitff. Tiioi11;sil982).Quich Consistency Of Quasi Maximum Likelihood Estimators. The =\riii;ilsof Statist. Vol. 10. No. 3. pp 990-1005 [dg] Picid. D.i lSS5). Testing And Estimating Change-Points In Tzme Series- -4dv. Appl. Prob. ib1. f 7. 8-4 1-SGS j9Oj Pul1;u-ci. D.i 1994 1- rlnother Look at Diflerentiability In Quudnrtic Meun. Festschrift for L.

11. LtCiuii Edited bu: D. Pollard. E. Torgersen. G. Yang. Springer-VerIag !!Il] CJu;iritlt.B. i 1958). The Estmation of the Parumeters of a Linear Regression System Oley-

rriii Tu:o Svpumting regmes. J.4S.4. kb1. 53. pp 873-880 [92! C)ir;iiicit. R. (1960).Test of Hypothesis Thut a Ltnear Reyression System obeys Two Separute

f?c~~lr~u.\.-1.iS.A. iL1. 5.3. pp 324330.

[931 Qimritir. R. i 1972).-4 New Approach Ta Estimating Switchtng Regressions. JASA. Vol. 67. pl> ,3116-3 lu

I9-i; Ri~,iski.C. ( 19.59). On The Emtence Of Entropy. Trans. of The Second Prague Conference.

Pri~gue( 19.59).pp 541--542. [95] Rwliier. R-.\.(1981)-Note On The Consisteny Of The Mmmum Likelihood Estimate For i~(~~rrfir*ntzfiab/eDlstnbuttons. Ann. Of Statist. Vol. 9. pp 225228 [96] R~rivi..Ufred (1960).On Measure Of Entropy And Information. Fourth Berkeley Sympo-

sitirri of ~lathernaticaiStatistics and Probability. Vol. 1. pp 547-561. [9T] Rc~ivi.Alfred (1959a). On The Dimension And Entropy Of Probability Distributions. Acta. .\f;~t.li..Ac;td. Sci. Hunger. Vol. 10. pp 193-215. [9S] Rwvi. -4lfrc.d ('1959b). On Mecrsures Of Depencence. -4cta. !dath. .-icad. Sci. Hunger. hl. 10. pp 441-451. j99] Ruvcien. H.L.( 1988). Real Anafysis. 3rd ed. Macmillan. Yew York [100j Rirsrrigi. .Ja,gdish S.(1994). Optimization Techenipw In Stuttstics. -4caùemic Press. il011 Sei

~m.\.~uwLI' .t;itlonal Health and Nutrition Emmtnaion Sumeys I and If. Hypertension 8: 1067- 1074

[103] Sliitbari. S.( 1980 ). Change Point Problem and Two-Pf~aseReyresston: an Annotated Bibli- oyrrrphy. Inter. Stat. Review. C'ol. 48. pp 83-93 [IO-l! Stiirv;iyt>r. A. S.(1961(a) ). The Detection Of Spontuneous Effecki. Soviet Slathematics. \-01. 2. SC).1. pp 740-743 (10Sj Sliiryayev. A. S.(1961(b)). Tfie Pmblem Of TIie Most Rupid Detectton Of A Distribution

17~.A Stuttunuq Pmcess. Soviet Mathematics, Vol. 2. So. 1. pp 795-799 146 BIBLIOGRAPHY

[106] Shiryayev, 4. N.(l963(a)). On Optimum Methods In Quichest Detection Pmblems. Theoq of Probability And Its Applications, VoI. VIII, No. 1, pp 22-46 Il071 Shiryayev, A. N.(1963(b)). On The Detection Of Disorder In A Manufacturing Pmcess I. Theorey Of Probability And Its Applications, Vol. VIII, No. 3, pp 247-265 [IO81 Shiryayev, A. N.(1963(c)). On The Detection Of Disorder In A Manufacturing Prucess II. Theorey Of Probability And Its Applications, Vol. Mn, No. 4, pp 402-413 [log] Shiryayev, A. N.(1965). Some Ezact Formulas In A "Diso+dernPmblem. Theorey Of Prob ability And Its Applications, Vol&X, pp 348-354 [110] Shiryayev, A. N.(1966(a)). Detection Of A Randomly Appearing Target In A Multi-Channel Syst em. Selected Thdations in Mathematical Statistics and Probability. Vol- 6, pp 157-161 [Ill] Shiryayev, A. N.(1966(b)). Detection Of A Randomly Appearing Tatget In A Multi-Channel System. Selected Translations in Mathematical Statistics and Probability. Vol. 6, pp 162-188 [112] Shiryayev, A. N.(1973). StatUtid Sequential Analysis ( optimal stopping des). Trans- lations of Mathematical Monographs, Volume 38. iherican Mathematical Society, Provi- dence, Rhode Island. [Il31 Shore, J. E. and R-W- Johnson(l980). Aziomatic Derivation Of The Principle Of Mm- imam Entropy And The Pnnciple Of Minimum Cross-Entmpy. IEEE Transactions On Information Theory, Vol. IT-26, No.1, Jan. 1980. [114] Silver, R. N. and H- F. Martz(l99.l). Applications Of Quantum Entropy To Statistics. Invited Paper For ASA Meeting, Toronto, -4ugust 1994. [115] Smith, A.(1975). A Bayesian Approach to Inference About a Change-point in a Sequence of Random Variables. Biometrika, Vol. 62, pp 407-416 [116] Smith, A.F.M. and G.O. Roberts(l993). Bayesian Computation vio the Gibbs Sampler and Reiated Markov Chain Monte Car10 Methods. J. R. Statist. Soc. B, Vol. 55, No.1 pp 3-23 [il71 Soofi, E. S.(1994). Capturing The Intangible Concept Of Information. JASA, Vo1.89, No.428, pp1243-1254. (1 181 Tang, S. and MacNeil, 1.(1993). The E&ct of Senal Comlation on Twts for Pammeter Change at UrrAnown Time. An.. Statist., Vol. 21, pp 552-575 [Il91 Tartakovskii(l994). Asymptotic Minimaz Multialternative Sequential Rule For Disorder Detection. Proceeding of the SteWov Institute of Mathematics, Issue 4, pp 229-236 [120] Teicher, H.(1960). On The hfizture Of Distributions. Am. Math. Stat. Vol. 31, pp 55-73. [121! T(!ll;srivs. L(1986). Detection Of Changes In Rundom Processes. Optimization Software.

hic-. PiibLic;it iuri Divisin. New York

[122! i'iiJtia. 1. t 1971 1. 1 a -Divergence .4nd Gertemiized Fisher's Information. Trans. of The 6th Pi-;igue* Ccmft.re!ice On information Theory. etc. Prague 1971. pp 873-886. !123] i;)srriko~t.L.(1981). Detectrng "Disonier7 In Multidimensional Random Processes. Svvier .\L;rtii. Di~kl..\-01. 24. 30.1. pp 55-39 i12-41 \V;ii(I. Ai)r;rii;u1iil94!I}. iC'ote On The Conszste~rcyOf The Afuzimum Likelihood Estimate. -4riii. of 1i;rrli. Srat.. Vol 20. pp 595-601

[LSj \\;~rsic*v. 1i.r i3SG). Confidence Reyions und Tests for u Chnge-potnt in a Sequence of Eq)vr~t:nt~uIFurr~ihj Rundom Vunubles. Biometrika. Vol. 73. So. 1. pp 91-10.1 [12G! Xikir. B.i1994). Optimal Detectron of a Change in Distrrbution when the Observations

For711 (L &f(~rkou~uin With a Finite State Space. In Chunge-Point Prolfems. PvIS lecturr notes vol 23. Edited by E. Carlstein et d(1994). j Z27! Ziu:ks. S11983). Survey of classical and Bayesiun upprocaches to the change-point problem jurd x~mpleund sequential procedures of testing and estimation. In Recent Advances in

Str~t~sttcs.Edited by M.A.Rizvi. J.S.Rustagi and D. Siegrnund. hcademic Press. Sew York. [i2S] Zeidler. Ehrrliard (1985).Nodinear Functional rlnalyszs And Iki Applications. Vol. I and III. Spririger-ikriag [129] Z~iiro~i~..liuia (1971). On Asymptotic Behaviour Of -4 Sample Estimator Of RenyiS In- [or-rrrattori Of Order a. Tram. of the 6th Prague Conference. Prague 1971. pp 9 19-924.