Chapter 8

Likelihood Processes

This chapter describes behavior of likelihood ratios for large sample sizes. We apply results from chapter 4 to show that derivatives of log-likelihoods are additive martingales and results from chapter 7 to show that likelihood ratio processes are multiplicative martingales. This chapter studies settings in which the state vector can be inferred from observed data, while chapter 9 studies settings in which states are hidden but can be imperfectly inferred from observed variables.

8.1 Warmup

This section introduces elementary versions of objects that will play key roles in this chapter. We simplify things by assuming that successive obser- vations of a random vector y are independently and identically distributed draws from a statistical model ψ(y|θo). In this section, we study aspects of the “inverse problem” of inferring θo from a sequence observed y’s. Remain- ing sections consider situations in which observations are not identically and independently distributed. As an instance of a chapter 1, section 1.1 setup, assume that an unknown parameter vector θ resides in a set Θ. Given θ, a vector Y with realizations in Y is described by a probability density ψ(y|θ) relative to a measure τ over Y. We also call the density ψ(y|θ) a likelihood function or statistical model. We assume that a sample y1, . . . , yT is a set of independent draws from ψ(y|θ). The likelihood function of this sample is

T L(y1, . . . , yT ) = Πt=1ψ(yt|θ).

145 146 Chapter 8. Likelihood Processes

We want to study the maximum likelihood estimator of θ as sample size T → +∞. For this purpose, we define the log density

λ(y|θ) = log ψ(y|θ) and the log-likelihood function multiplied by T −1

T −1 X −1 lT (θ) = T λ(yt|θ) = T log L(y1, . . . , yT |θ). t=1

We are interested in the behavior of the random variable lT (θ) as T → +∞. For this purpose, let ψ(y|θo) be the density from which the data are actually drawn. Under ψ(y|θo), a implies that

lT (θ) → E[λ(y|θ)], where the mathematical expectation is evaluated with respect to the density ψ(y|θo), so that Z E[λ(y|θ)] = λ(y|θ)ψ(y|θo)τ(dy)

≡ l∞(θ)

o (Here it is understood that lT (θ) depends on θ .) We can regard l∞(θ) as the population value of the (scaled by T −1) log likelihood function of the parameter vector θ when the data are governed by successive independent draws from the statistical model ψ(y|θo). The following proposition de- scribes a key desirable property of the maximum likelihood estimator of θ for infinite values of T .

Proposition 8.1.1.

argmax l∞(θ) = θo θ∈Θ

Proof. Let r be the likelihood ratio

ψ(y|θ) r = . ψ(y|θo) 8.1. Warmup 147

The inequality log r ≤ r − 1 (see figure 81) implies Z  ψ(y|θ)  Z  ψ(y|θ)  log ψ(y|θo)τ(dy) ≤ − 1 ψ(y|θo)τ(dy) ψ(y|θo) ψ(y|θo) Z Z = ψ(y|θ)τ(dy) − ψ(y|θo)τ(dy) = 1 − 1 = 0. Therefore Z Z log ψ(y|θ)ψ(y|θo)τ(dy) ≤ log ψ(y|θo)ψ(y|θo)τ(dy) or l∞(θ) ≤ l∞(θo).

1 r − 1

0.5 log(r)

r 0.5 1 1.5 2

−0.5

−1

Figure 81: The inequality (r − 1) ≥ log(r).

Definition 8.1.2. The parameter vector θ is said to be identified if argmax l∞(θ) θ∈Θ is unique. Definition 8.1.3. The time t element of the score process of a likelihood function is defined as ∂λ(y |θ) s (θ) = t . t ∂θ 148 Chapter 8. Likelihood Processes

Remark 8.1.4. The first-order necessary condition for the (population) maximum likelihood estimator is

Est(θ) = 0, where again the mathematical expectation is taken with respect to the density ψ(y|θo). Definition 8.1.5. The Fisher information matrix is defined as Z 0 0 I(θo) = Estst = st(θo)st(θo) f(y|θ0)τ(dy)

A necessary condition for the parameter vector θo to be identified is that the Fisher information matrix I(θo) be a positive definite matrix.

Kullback-Leibler discrepancy Let ψ(y|θ) m = ψ(y|θo) be the likelihood ratio of the θ model relative to the θo model. Kullback- Leibler relative entropy is defined as1 Z  ψ(y|θ)   ψ(y|θ)  ent = Em log m = log ψ(y|θo)τ(dy) ψ(y|θo) ψ(y|θo) Z  ψ(y|θ)  = log ψ(y|θ)τ(dy) ψ(y|θo) where the mathematical expectation is under the θo model. Kullback and Leibler’s relative entropy concept is often used to measure the discrepancy between two densities ψ(y|θ) and ψ(y|θo). This use exploits the important property that Em log m is nonnegative and that it is zero when θ = θo. To show this, first note the inequality

m log m ≥ m − 1.

Please see figure 82. After noting that Em = 1, it then follows that

Em log m ≥ Em − 1 = 1 − 1 = 0 1See Kullback and Leibler (1951). 8.2. Dependent Processes 149

So Kullback-Leibler relative entropy is nonnegative. Further, notice that Em log m can be represented as

Z  ψ(y|θ)  log ψ(y|θ)τ(dy) ≥ 0 ψ(y|θo)

Therefore Z Z log ψ(y|θ)ψ(y|θ)τ(dy) − log ψ(y|θo)ψ(y|θ)τ(dy) ≥ 0

and equals zero when θ = θ0. Thus, setting θ = θo minimizes population Kullback-Leibler relative entropy.

2 m log(m)

m − 1 1

m 0.5 1 1.5 2 2.5 3

−1

Figure 82: The inequality (m − 1) ≤ m log(m).

In the remainder of this chapter and in chapter 9 we extend our study to situations in which observations of yt are not identically and independently distributed.

8.2 Dependent Processes

As in chapter 4, we let {Wt+1} be a process of shocks satisfying

E (Wt+1|Ft) = 0. 150 Chapter 8. Likelihood Processes

We let {Xt : t = 0, 1, ...} = {Xt} be a discrete time stationary Markov process generated by

Xt+1 = φ(Xt,Wt+1), where φ is a Borel measurable function. We observe a vector of “signals” whose ith component evolves as an additive functional of {Xt}

[i] [i] Yt+1 − Yt = κi(Xt,Wt+1) for i = 1, 2, ..., k. Stack these k signals into a vector Yt+1 − Yt and form

Yt+1 − Yt = κ(Xt,Wt+1).

In this chapter we impose:

Assumption 8.2.1. X0 is observed and there exists a function χ such that

Wt+1 = χ(Xt,Yt+1 − Yt).

∞ Assumption 8.2.1 is an invertibility condition that asserts that states {Xt}t=1 can be recovered from signals. To verify this claim, recall that

Xt+1 = φ(Xt,Wt+1) = φ[Xt, χ(Xt,Yt+1 − Yt)] ≡ ζ(Xt,Yt+1 − Yt), (8.1) which allows us to recover a sequence of states from the initial state X0 and a sequence of signals. In chapter 9, we will relax assumption 8.2.1 and ∞ treat states {Xt}t=0 as hidden. Let τ denote a measure over the space Y of admissible signals.

Assumption 8.2.2. Yt+1 − Yt has a conditional density ψ(·|x) with respect to τ conditioned on Xt = x.

We want to construct the density ψ from κ and the distribution of Xt+1 conditioned on Xt. One possibility is to construct ψ and τ as follows:

Example 8.2.3. Suppose that κ is Yt+1 − Yt = Xt+1 and that the Markov ∗ process {Xt} has a transition density p(x |x) relative to a measure λ. Set ψ = p and τ = λ.

Another possibility is: 8.3. Likelihood processes 151

Example 8.2.4.

Xt+1 = AXt + BWt+1

Yt+1 − Yt = DXt + FWt+1,

∞ where A is a stable matrix, {Wt+1}t=0 is an i.i.d. sequence of N (0,I) ran- dom vectors conditioned on X0, and F is nonsingular. So

−1  −1 Xt+1 = A − BF D Xt + BF (Yt+1 − Yt) , . which gives the function ζ(Xt,Yt+1 − Yt) = Xt+1 in equation 8.1 from as- sumption 8.2.1 as a linear function of Yt+1 − Yt and Xt. The conditional distribution of Yt+1 − Yt is normal with mean DXt and nonsingular covari- ance matrix FF 0, which gives us ψ. This conditional distribution has a density against Lebesgue measure on Rm, a measure that we can use as τ.

8.3 Likelihood processes

Assumptions 8.2.1 and 8.2.1 imply that associated with a statistical model is a probability specification for (Xt+1,Yt+1 − Yt) conditioned Xt. The only information about the distribution of future Y ’s contained in current and past (Xt,Yt − Yt−1)’s and X’s is Xt. The joint density r likelihood function of a history of observations Y1, ..., Yt o conditioned on X0 is

t Y Lt = ψ(Yj − Yj−1|Xj−1), j=1 so t X log Lt = log [ψ(Yj − Yj−1|Xj−1)] . j=1

Because they are functions of signals {Yt − Yt−1} and states {Xt}, the likelihood function and log-likelihood function are both stochastic processes. Introducing a density for the initial state X0 only affects initial conditions for L0 and log L0. 152 Chapter 8. Likelihood Processes

Fact 8.3.1. A log-likelihood process {log(Lt): t = 0, 1, . . . , t} is an additive functional with increment

∗ ∗ . ∗ log ψ(y |x) = log φ[κ(x, w )|x] = κ`(x, w ),

∗ where y denotes a realized value of Yt+1 − Yt.

Fact 8.3.2. Because a log-likelihood process is an additive functional, a likelihood process is a multiplicative functional.

Example 8.3.3. Consider example 8.2.4 again. It follows from the formula for the normal density that 1 κ (X ,X ) = − (Y − Y − DX )0(FF 0)−1(Y − Y − DX ) ` t+1 t 2 t+1 t t t+1 t t 1 k − log det(FF 0) − log(2π) (8.2) 2 2 and

t 1 X log L = − (Y − Y − DX )0(FF 0)−1(Y − Y − DX ) t 2 j j−1 j−1 j j−1 j−1 j=1 t kt − log det(FF 0) − log(2π). 2 2 If we know the transition density ψ only up to an unknown parameter vector θ, we have a set of transition density functions ψ(y∗|x, θ) indexed by parameter θ in a space Θ. For each parameter vector θ we construct

∗ ∗ . ∗ log ψ(y |x, θ) = log ψ[κ(x, w )|x, θ] = κ`(x, w |θ).

Let log L0(θ) be the logarithm of an initial density function for X0 that we also allow to depend on θ. When it is unique, we sometimes use the stationary distribution as the distribution for X0 . A log-likelihood process is t X log Lt(θ) = log ψ (Yj − Yj−1|Xj−1, θ) + log L0(θ). j=1 The implied probability distributions for all t ≥ 1 along with a density for X0 is the statistical model indexed by parameters θ. 8.4. Likelihood ratios 153

Because a log-likelihood process is an additive functional (see fact 8.3.1), Proposition 4.2.3 tells us that it has a representation

t X log Lt(θ) = ν(θ)t+ κa(Xj−1,Wj|θ)−g(Xt|θ)+g(X0|θ)+log L0(θ). (8.3) j=1

Let θo be the parameter value that generates the data. Applying a law of large numbers and the proposition 4.2.3 properties of its components to (8.3) establishes that under the statistical model with θ = θo 1 lim log Lt(θo) = ν(θo). (8.4) t→+∞ t We will use result (8.4) when we discuss maximum likelihood estimation of the parameter vector θ below.

8.4 Likelihood ratios

Ratios of likelihoods associated with two different parameter vectors, say θ and θo, can be constructed by multiplicatively cumulating a sequence of multiplicative increments of the form

∗ ∗ ∗ ψ(y |x, θ) exp [κ`(x, w |θ) − κ`(x, w |θo)] = ∗ . ψ(y |x, θo)

Under the θo probability model, the conditional expectation of a multiplica- tive increment to the likelihood ratio is Z  ∗  Z ψ(y |x, θ) ∗ ∗ ∗ ∗ ∗ ψ(y |x, θo)υ(dy ) = ψ(y |x, θ)υ(dy ) = 1 (8.5) Y ψ(y |x, θo) Y for all x ∈ X and all θ ∈ Θ. This follows because ψ(y∗|x, θ) is a density for every x and θ. We have established n o Theorem 8.4.1. For each θ ∈ Θ, the likelihood ratio process Lt(θ) : t = 0, 1, ... Lt(θo) is a multiplicative martingale under the θo probability model.

Thus, under the statistical model with parameter vector θo   Lt(θ) Lt−1(θ) E Ft−1 = . Lt(θo) Lt−1(θo) 154 Chapter 8. Likelihood Processes

` = log(L) 2 ` L 1 2 3 4 5 6 −2

−4

−6

Figure 83: Jensen’s Inequality. The logarithmic function is the concave func- tion that equals zero when evaluated at unity. By forming averages using the two endpoints of the straight line below the logarithmic function, we get a point on the line segment that depends on the weights used in the averaging. Jensen’s Inequality asserts that the line segment lies below the logarithmic function.

8.5 Log-Likelihoods

Because the logarithmic function is concave, Jensen’s Inequality implies that the expectation of the logarithm of a random variable cannot exceed the logarithm of the expectation. (See figure 83.) This reasoning implies the following inequality: Z ∗ ∗ ∗ ∗ [log ψ(y |x, θ) − log ψ(y |x, θo)] ψ(y |x, θo)τ(dy ) Y Z  ∗  ψ(y |x, θ) ∗ ∗ ≤ log ∗ ψ(y |x, θo)τ(dy ) = 0. (8.6) Y ψ(y |x, θo) Inequality (8.6) holds with equality only if

∗ ∗ log ψ(y |x, θ) = log ψ(y |x, θo)

∗ ∗ with probability one under the measure ψ(y |x, θo)τ(dy ). A consequence of inequality 8.6 is that

E [log Lt+1(θ) − log Lt+1(θo)|Ft] ≤ log Lt(θ) − log Lt(θo). (8.7) 8.5. Log-Likelihoods 155

∞ Definition 8.5.1. An additive functional {Yt}t=0 is said to be a super- ∞ martingale with respect to a filtration {Ft}t=0 if

E (Yt+1|Ft) ≤ Yt.

Inequality (8.7) implies

Theorem 8.5.2. For each θ, the log-likelihood ratio process

{log Lt(θ) − log Lt(θo): t = 0, 1, ... } is an additive super martingale under the θo probability model.

We now combine two facts: (a) that because each log-likelihood process is an additive process, each has a proposition 4.2.3 decomposition of the form (8.3), and (b) that under the θo model, a log-likelihood ratio process {log Lt(θ) − log Lt(θo): t = 0, 1, ... } is a supermartingale. It follows that a log-likelihood ratio process has an additive decomposition with a coefficient ν(θ) − ν(θo) on the linear time trend t, that this coefficient is less than or equal to zero, and that it equals zero only when θ = θo. Therefore, the parameter θo satisfies

θo = argmaxθ∈Θν(θ). (8.8)

The method of maximum likelihood appeals to equation (8.4) and the Law of Large Numbers to estimate the time trend of a log-likelihood by the sample average2 1 νˆ (θ) = log L (θ). (8.9) t t t The maximum likelihood estimator of the vector θ based on data up to date t maximizesν ˆt(θ). Heuristic justifications for this assertion come from combining insights from equations (8.4), (8.9), and (8.8).

The trend coefficient ν(θ) − ν(θo) in a proposition 4.2.3 decomposition of the log-likelihood ratio governs the large sample behavior of likelihood ratio for discriminating statistical model θ from statistical model θo. We study this in section 8.7.

2Use of the Law of Large Numbers pointwise in θ is typically not sufficient to justify statistical consistency. Instead, one has to justify estimation of ν as a function of θ over an admissible parameter space. 156 Chapter 8. Likelihood Processes

8.6 Score processes

In this section, we study first-order necessary necessary conditions for the maximum likelihood estimator. For simplicity suppose that Θ is an open interval of R containing θo. Inequality 8.6 implies that the parameter vector θo necessarily maximizes the objective Z ∗ ∗ ∗ log ψ(y |x, θ)ψ(y |x, θo)τ(dy ). (8.10) X Suppose that we can differentiate under the integral sign in (8.10) to get the first-order condition Z   d ∗ ∗ ∗ log ψ(y |x, θo) ψ(y |x, θo)τ(dy ) = 0. (8.11) X dθ

Definition 8.6.1. Where `t(θ) = log Lt(θ), the score process {St : t = 0, 1, ...} is defined as

t d X d d S = ` (θ)| = log ψ(Y − Y |X , θ ) + log L (θ ). t dθ t θo dθ j j−1 j−1 o dθ 0 o j=1

Theorem 8.6.2. E (St+1 − St|Xt) = 0, so the score process is an additive d functional; more specifically, it is a martingale with increment dθ log ψ(Yj − Yj−1|Xj−1, θo) under the θo probability model. So 1 √ St → N (0,V ) t

 d 2 where V = E dθ log ψ(Yt+1 − Yt|Xt, θo) .

Proof. This follows directly from equation (8.11) and corollary 4.7.1.3 Theorem 8.6.2 justifies using the martingale out- come stated in corollary 4.7.1 to characterize the large sample behavior of the score process. The associated central limit approximation yields a large sample characterization of the maximum likelihood estimator of θ in a Markov setting. Where θt maximizes the log-likelihood function `t(θ), the 3In the formula we use the notation a2 to refer to the matrix a0a where a is an n × 1 vector. 8.6. Score processes 157 following result typically prevails under some additional regularity condi- tions: √ −1 t(θt − θo) → N (0,V ). This kind of result motivates interpreting the variance of the martingale increment of the score process !  d 2 V = E log ψ(Y − Y |X , θ ) dθ t+1 t t o as a measure of the information in the data about the parameter θo; V is called the Fisher information matrix after the statistician R.A. Fisher.

Nuisance parameters Consider extending the notion of Fisher information to situations in which we want information about one parameter but to get it we have to estimate other parameters too. Assume that there is an unknown scalar parameter θo ˜ of interest and a vector θo of “nuisance” parameters that also unknown and that we must also estimate. Suppose that the likelihood is parameterized on an open set Θ in a finite dimensional Euclidean space and that the true ˜ parameter vector is (θo, θo) ∈ Θ. Write the multivariate score process as    St+1 ˜ : t = 0, 1, ... St+1 where {St+1 : t = 0, 1, ...} is the partial derivative of the log-likelihood with ˜ respect to the parameter of interest θ and {St+1 : t = 0, 1, ...} is the partial derivative with respect to the nuisance parameters θ˜. ˜ Estimating θo simultaneously with θo is more difficult than estimating ˜ θo conditional on knowing θo. Fisher’s measure of information takes into ˜ account additional uncertainty that comes from having to estimate θo si- multaneously with θo in order to make inferences about θo. In particular, Fisher’s measure of information about θo is the inverse of the (1, 1) compo- nent of the covariance matrix partitioned conformably with (θ0, θ˜0)0:

 2 ˜ ˜  E(St+1 − St) E(St+1 − St)(St+1 − St) V = ˜ ˜ ˜ ˜ 2 . E(St+1 − St)(St+1 − St) E(St+1 − St) 158 Chapter 8. Likelihood Processes

To represent the inverse of V , compute the population regression

0 ˜ ˜ St+1 − St = β (St+1 − St) + Ut+1, (8.12) where β is the population regression coefficient and Ut+1 is the popula- tion regression residual that by construction is orthogonal to the regressor ˜ ˜ (St+1 − St). The population regression induces the representation

   0   St+1 − St I β Ut ˜ ˜ = ˜ ˜ , St+1 − St 0 I St+1 − St

˜ ˜ which, because Ut+1 is orthogonal to St+1 − St, implies

" 2 # 1 β0 E [(Ut+1) ] 0 1 0 h i V = ˜ ˜ ˜ ˜ 0 . 0 I 0 E (St+1 − St)(St+1 − St) β I

Since 1 0−1  1 0 = β I −β I and

" 2 #−1 E [(Ut+1) ] 0 h i ˜ ˜ ˜ ˜ 0 0 E (St+1 − St)(St+1 − St) " 1 # 2 0 E[(Ut+1) ] =  h i−1 , ˜ ˜ ˜ ˜ 0 0 E (St+1 − St)(St+1 − St) the (1, 1) component of the partition of matrix V −1 is

−1 1 V1,1 = 2 . E [(Ut+1) ]

2 Thus, the reciprocal of E [(Ut+1) ] gives the Fisher information about θo in the presence of nuisance parameters. The population regression equation (8.12) implies that the inverse of the 2 2 Fisher information measure E(Ut+1) is no larger than E(St+1 − St) :

2 2 E[(Ut+1) ] ≤ E[(St+1 − St) ]. (8.13) 8.7. Limiting Behavior of Likelihood Ratios 159

This inequality asserts that the likelihood function contains more informa- ˜ ˜ ˜ tion about θo when θ is known to be θo than when θo and θo are both unknown. Inequality (8.13) offers reasons to be cautious about interpreting some ˜ calibration practices in economics. Pretending that you know θo when you really don’t leads you to overstate the information that a data set contains about the parameter θo.   St+1 Since the multivariate score ˜ is a vector of additive martingales, St+1 the score regression residual process {Ut+1 : t = 0, 1, ...} is itself an additive martingale.

8.7 Limiting Behavior of Likelihood Ratios

In Chapter 1 we described the problem of selecting between two statistical models, say model θ1 and model θ2, based on observed data. As more data become available, the ex ante probability of making a mistake becomes smaller. To characterize this formally, we study the tail behavior of the likelihood ratio. We follow Chernoff (1952) and others by using a large de- viation theory to measure the difficulty of choosing between models θ1 and θ2. Specifically, we study the probability that the likelihood ratio exceeds a given threshold and the probability of making a mistaken model selection via a likelihood ratio test. In line with the decision-theoretic perspective described in chapter 1, the threshold is determined by prior probabilities assigned to the two models and the losses associated with misclassifica- tion. For simplicity we assign prior probabilities equal to one half, but our calculations allow for other choices of these probabilities. We use a large deviation calculation to characterize the limiting behav- ior of making type I (selecting model θ2 when model θ1 is true) and type II (selecting model θ1 when model θ2 is true) errors. Construct an addi- tive functional defined by the log-likelihood ratio between two models, one indexed by θ1 and another by θ2:

Yt = log Lt(θ2) − log Lt(θ1).

We know that the likelihood ratio process is a multiplicative martingale under the θ1 probability measure. To study a model selection rule that 160 Chapter 8. Likelihood Processes chooses the model with a higher likelihood, we construct a large deviations inequality that is implied by two elementary ideas:

• Probabilities are expectations of indicator functions.

• Indicator functions are bounded above by exponential functions with positive exponents.

For a scalar α > 0, we use these two ideas to deduce the following inequal- ities: 1 1  1 log Prob Y ≥ 0|X = x = log Prob {Y ≥ 0|X = x} t t t 0 t t 0 1 ≤ log E [exp (αY ) |X = x] t t 0 1 = log E [(M )α|X = x] , (8.14) t t 0

α where Mt = exp(Yt). Let η(M ) be the exponential trendη ˜ in the Propo- α sition 7.3.2 decomposition of the multiplicative functional Mt . Taking limits of both sides of (8.14) as t gets large,   1 1 α lim sup log Prob Yt ≥ 0|X0 = x ≤ η (M ) (8.15) t→∞ t t

α ∞ provided that η (M ) is well defined and finite. Since {Mt}t=0 is a multi- ∞ plicative martingale, it follows from Jensen’s Inequality that {Yt}t=0 and α ∞ therefore {Mt }t=0 is an additive supermartingale and hence that

η (M α) ≤ 0.

It can be shown that η (M α) is concave in α. Different values of α > 0 give rise to different bounds. To get an informative bound we compute:

−%(M) = min η (M α) . (8.16) 0≤α≤1

It follows from (8.15) and (8.16) that

1 1  lim sup log Prob Yt ≥ 0|X0 = x ≤ −%(M). (8.17) t→∞ t t 8.7. Limiting Behavior of Likelihood Ratios 161

Thus, a positive %(M) bounds the decay rate of the probabilities on the left side of (8.17). Under some more stringent restrictions, the bound (8.17) can be shown to be sharp. In the theory of large deviations, %(M) is called the rate function. Suppose now that θ2 is the true parameter value. This induces us to use −1 {Mt } as the likelihood-ratio process. This process is evidently a martin- ∗ gale under the probability measure ψ(y |x, θ2) implied by θ2. Following the preceding recipe, after raising M −1 to the power α and multiplying by M to change measure, we are led to study M 1−α for 0 < α < 1. Minimizing η (M 1−α) over α verifies that %(M) is again the asymptotic decay rate in the probability of making a mistake. The identical outcomes of these two calculations implies the invariance of %(M) across the two mental experi- ments, one that assumes that the θ1 is true, the other that assumes that the θ2 is true. That allowed Chernoff (1952) to interpret the rate bound %(M) as a measure of statistical discrepancy between two models. While we have used zero as a threshold for the log-likelihood ratio, the analysis extends immediately to any constant threshold determined by the ratio of prior probabilities put on the two models. Both types of mistake probabilities converge to zero at an exponential rate that is independent of the threshold. For example 8.7.1 to be described next, figure 84 depicts −η (M α) and −η (M 1−α) as functions of α. The two functions overlap completely. Both functions are concave in α. After describing the example, we shall tell in detail how we computed −η (M α) and −η (M 1−α).

2 Example 8.7.1. Statistical model 1 with parameter vector θ1 = (µ1, σ ) asserts that a scalar random variable has a univariate normal distribution 2 with mean µ1 and variance σ . Statistical model 2 with parameter vector 2 θ2 = (µ2, σ ) asserts that a scalar random variable has a univariate normal 2 distribution with mean µ2 and variance σ . Under model 1, a draw yt from the statistical model can be represented yt = µ1 + σWt where Wt ∼ N (0, 1). For a sequence of independent draws yt, t = 1, 2,..., from model 1, the log likelihood ratio Yt = Lt(θ2) − Lt(θ1) has increment (y − µ )2 (y − µ )2 Y − Y = − t 2 + t 1 t t−1 2σ2 2σ2 (µ − µ + σW )2 (σW )2 = − 1 2 t + t 2σ2 2σ2 (µ − µ )2 (µ − µ )W = − 1 2 − 1 2 t . 2σ2 σ 162 Chapter 8. Likelihood Processes

The exponential trend term of the likelihood ratio itself is then exp(0) = 1, verifying a necessary condition for the likelihood ratio Mt = exp(Yt) to be a martingale.

The likelihood ratio Mt = exp(Yt) in example 8.7.1 is an instance of a multiplicative functional with a corresponding additive functional sharing the form of those from example 7.3.5. Notice that for such a multiplicative functional M to be a martingale of the example 8.7.1 form, we must have

log Mt+1 − log Mt = ν + F · Wt+1

1 0 with ν = − 2 F F in order that the exponential trend term in the decompo- sition representation for Mt to be identically unity. We want to compute EM α for α ∈ [0, 1]. To do so, first multiply

1 log M − log M = − F 0F + F · W t+1 t 2 t+1 by α ∈ [0, 1] to get

1 α log M − α log M = −α F 0F + αF · W , t+1 t 2 t+1

α the right side of which is the increment in the logarithm of Mt+1 that we α seek. It follows that the (negative) growth rate of Mt+1 is the decay rate (α2−α) 0 2 F F . The rate (α2 − α) − F 0F 2 1 0 is symmetric about its maximum value equal to 8 F F that is attained at 1 α = 2 , as depicted in figure 84. Thus, the graphs of exponential decay rates α 1−α of Mt and Mt overlap completely.

Example for Dongchen

Recall example 8.7.1. Note that for a sequence of independent draws yt, t = 1, 2,..., from model 2, the log likelihood ratio Y˜t = Lt(θ1)−Lt(θ2) has increment

(µ − µ )2 (µ − µ )W Y˜ − Y˜ = − 2 1 − 2 1 t . t t−1 2σ2 σ 8.7. Limiting Behavior of Likelihood Ratios 163

0.3

−η(M α) −η(M 1−α) 0.2

0.1

α 0.2 0.4 0.6 0.8 1

Figure 84: Chernoff bound. The functions −η(M α) and −η(M 1−α) on α ∈ [−0, 1] for the likelihood ratio Mt described in example 8.7.1.

For convenience, assume that σ = 1 and consider the two log-likelihood ratios

T  2  X (µ1 − µ2) Y = T −1 − − (µ − µ )W T 2 1 2 t t=1 T  2  X (µ2 − µ1) Y˜ = T −1 − − (µ − µ )W T 2 2 1 t t=1 where in each case {Wt} is a sequence of independent draws from a univariate normal distribution with mean 0 and variance 1.

A likelihood ratio test selects model θ1 when YT < 0 and model θ2 when YT > 0. To calculate the probability that a likelihood ratio makes a mistake, notice that

 (µ − µ )2  Y ∼ N − 1 2 ,T −1(µ − µ )2 T 2 1 2  (µ − µ )2  Y˜ ∼ N − 2 1 ,T −1(µ − µ )2 T 2 2 1

Therefore, when the data are generated by statistical model θ1, the probability 164 Chapter 8. Likelihood Processes

that YT is greater than zero and that the likelihood ratio test makes a mistake is

2 " (µ1−µ2) !# 1 − 2 e1 = 1 − 1 + erf , p −1 2 2 2T (µ1 − µ2) where erf is the “error function” that appears in the formula for the cumulative distribution function for a univariate normal random variable. When the data are generated by statistical model θ2, the probability that Y˜T is greater than zero and that therefore the likelihood ratio makes a mistake is

2 " (µ2−µ1) !# 1 − 2 e2 = 1 − 1 + erf . p −1 2 2 2T (µ2 − µ1)

Instructions to Dongchen:

• Please check the above calculations leading to e1 and e2.

• Please write a program to compute e1 for various values of T , with inputs µ1, µ2 as parameters.

• For given µ1 6= µ2, please plot log e1 for various values of T , pushing T big. • Give the program the ability to compare the rate at which the rate at 1 which log e1 declines with T with the value 8 ; we can fine tune how we characterize the slope of log e1 against T once we see what it looks like. Maybe we’ll run a regression line.

8.8 More Properties of Likelihood Ratio Processes

Let {Mft} be a multiplicative martingale like that defined in section 7.2.2. If we set EMf0 = 1, then {Mft} is a likelihood ratio that we can use to construct an alternative probability measure.

Proposition 8.8.1. Let Bt be a bounded random variable in the Ft information set. Let Bbt+j be a bounded random variable in the Ft+j information set. We can compute mathematical expectations Ee under an alternative probability model associated with a likelihood ratio process {Mft} as follows:   Ee (Bt) = E MftBt 8.8. More Properties of Likelihood Ratio Processes 165

" ! #   Mft+τ Ee Bbt+τ |Ft = E Bbt+τ |Ft . Mft Proof. We must verify that probabilities assigned to date t events are consistent with those assigned to date t+τ events for τ ≥ 0. We do this as follows. Because Bt is also in Ft+τ , we must verify that     E Mft+τ Bt = Ee (Bt) = E MftBt . (8.18)

To accomplish this, apply the martingale property of {Mft+τ } to deduce     E Mft+τ Bt|Ft = E Mft+τ |Ft Bt = MftBt. (8.19) Note that by the Law of Iterated Expectations:      E Mft+τ Bt = E E Mft+τ Bt|Ft .

Take unconditional expectations in (8.19) to confirm the second equality in equa- tion (8.18). To represent conditional expectation operators under the alternative proba- bility measure associated with martingale {Mft}, note that " ! # h i Mft+τ E Mft+τ Bbt+τ Bt|Ft = E Bbt+τ |Ft MftBt, Mft which holds for any bounded random variable Bt in Ft. Therefore, " ! #   Mft+τ Ee Bbt+τ |Ft = E Bbt+τ |Ft . Mft

To deduce a one-period transition probability implied by a likelihood ratio ∞ process {Mft}t=0, write:

Mft+1 = exp [˜κ(Xt,Wt+1)] , Mft an object that in multiplicative functional 7.2.2 we assumed to have conditional expectation equal to unity. The function

exp [κe(x, z)] gives the density for Wt+1 under the (e·) probability measure conditioned on Xt = x relative to the original conditional distribution for Wt+1. 166 Chapter 8. Likelihood Processes

Remark 8.8.2. Under the change of measure associated with the increment to the multiplicative martingale {Mft}, the shock Wt+1 typically does not have con- ditional mean zero, but the conditional distribution for Wt+1 conditioned on Ft continues to depend only on Xt and the process {Xt} continues to be Markovian.

Peculiar Large Sample Property

Likelihood ratio processes {Mft} have a peculiar sample path property.

Result 8.8.3. {Mfj : j = 1, 2, ...} converges almost surely to zero although EMfj = 1 for each j.

Proof. From Jensen’s Inequality we know that

E [κe(Xt,Wt+1)|Xt = x] ≤ 0.

Given Xt, if κe(Xt,Wt+1) cannot be predicted perfectly with probability one, then

E [κe(Xt,Wt+1)] < 0. The Law of Large Numbers implies that 1 log Mfj j converges almost surely to E [κe(Xt,Wt+1)] < 0. Therefore, {Mfj} converges al- most surely to zero.

Figure 85 shows 100 sample paths along with 98 percent probability coverage regions for a multiplicative martingale {Mft} associated with a log-normal multi- plicative functional {Mt} that is an instance of the process described in example 7.3.5 below.

8.9 Robustness interpretation

The risk-sensitivity operator is the indirect utility function that emerges from the following minimization problem: 1 min E (log Vt+1|Ft) + E (m log m|Ft) m≥0,E(m|Ft)=1 γ − 1 1 = − log E (exp [(1 − γ) log V ] |F ) , 1 − γ t+1 t 8.9. Robustness interpretation 167

Martingale components for many paths of Y1

18

16

14

12

10

8

6

4

2

0 0 50 100 150 200 250 300 350 400 450

Figure 85: Peculiar sample path property for a multiplicative martingale with EMt = 1 for all t. The shaded region depicts values between the .01 and .99 quantiles of the probability densities for Mft for different values of t.

where γ > 1. Here m is a relative probability density that distorts the one-period transition density between t and t+1. The term E [m log m|Ft] is the conditional 4 1 relative entropy of this distortion; γ−1 parameterizes a relative entropy penalty. 1 Large value of γ−1 impose a big penalty on exploring alternative models. The minimizing m is exp [(1 − γ) log Vt+1] mˇ t+1 = ; E (exp [(1 − γ) log Vt+1] |Ft)

Multiplying the conditional density of Wt+1 bym ˇ exponentially tilts it in direc- tions that adversely affect the continuation value. By cumulatively multiplying successivem ˇ t+1’s, we can construct a likelihood ratio process for an implied worst-case model. Consider the value function established in Proposition 4.5.2:

log Vt − log Ct = U · Xt + u

4Relative entropy is a concept that is used in a variety of literatures in applied mathematics. 168 Chapter 8. Likelihood Processes where Xt+1 = AXt + BWt+1, and log C+1 − log Ct = ν + D · Xt + F · Wt+1. where Wt+1 is normally distributed with mean zero and covariance matrix I. Notice that

0 0 0 0 log Vt+1 = U AXt + U BWt+1 + log Ct + ν + D Xt + F Wt+1.

It can be shown thatm ˇ t+1 is a likelihood ratio that transforms the distribution of Wt+1 to be normal with mean

µ = (1 − γ) B0U + F  and covariance matrix I. To verify this conclusion, recall that a standard normally distributed random vector has a density that is proportional to

 1  exp − w0w 2

Multiply this by exp(µ · w) to obtain a twisted density that is proportional to

 1  exp − w0w + µ · w 2 or equivalently to:  1  exp − (w − µ)0(w − µ) . 2

This implies that, after the change of probability measure, Wt+1 is normally distributed with mean µ and covariance I.