Large Deviations and Applications for Markovian Hawkes Processes with a Large Initial Intensity

Submitted to Bernoulli

Large deviations and applications for Markovian Hawkes processes with a large initial intensity

XUEFENG GAO1 and LINGJIONG ZHU2 1Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T. Hong Kong. E-mail: [email protected] 2Department of Mathematics, Florida State University, 1017 Academic Way, Tallahassee, FL-32306, United States of America. E-mail: [email protected]

Hawkes process is a class of simple point processes that is self-exciting and has clustering effect. The intensity of this point process depends on its entire past history. It has wide applications in finance, insurance, neuroscience, social networks, criminology, seismology, and many other fields. In this paper, we study linear Hawkes process with an exponential kernel in the asymptotic regime where the initial intensity of the Hawkes process is large. We establish large deviations for Hawkes processes in this regime as well as the regime when both the initial intensity and the time are large. We illustrate the strength of our results by discussing the applications to insurance and queueing systems.

1. Introduction

−∞ Let N be a simple point process on R and let Ft := σ(N(C),C ∈ B(R),C ⊂ (−∞, t]) −∞ be an increasing family of σ-algebras. Any nonnegative Ft -progressively measurable process λt with

"Z b # −∞ −∞ E N(a, b]|Fa = E λsds Fa , almost surely, a

−∞ for all intervals (a, b] is called an Ft -intensity of N. We use the notation Nt := N(0, t] to denote the number of points in the interval (0, t]. −∞ A Hawkes process is a simple point process N admitting an Ft -intensity Z t− λt := λ φ(t − s)dNs , (1.1) −∞ where λ(·): R+ → R+ is locally integrable, left continuous, φ(·): R+ → R+ and we R ∞ R t− always assume that kφkL1 = φ(t)dt < ∞. In (1.1), φ(t − s)dNs stands for P 0 −∞ τ

φ(·) and λ(·) are usually referred to as exciting function (or sometimes kernel function) and rate function respectively. A Hawkes process is linear if λ(·) is linear and it is nonlinear otherwise. The linear Hawkes process was first introduced by A.G. Hawkes in 1971 [17, 18]. It naturally generalizes the Poisson process and it captures both the self–exciting1 property and the clustering effect. In addition, Hawkes process is a very versatile model which is amenable to statistical analysis. These explain why it has wide applications in insurance, finance, social networks, neuroscience, criminology and many other fields. For a list of references, we refer to [32]. Throughout this paper, we assume an exponential exciting function φ(t) := αe−βt where α, β > 0, and a linear rate function λ(z) := µ + z where the base intensity µ ≥ 0. That is, we restrict ourselves to the linear Markovian Hawkes process. To see the Markov property, we define

Z t Z t −β(t−s) −βt −β(t−s) Zt := αe dNs = Z0 · e + αe dNs. −∞ 0 Then, the process Z is Markovian and satisﬁes the dynamics:

dZt = −βZtdt + αdNt, where N is a Hawkes process with intensity λt = µ + Zt− at time t. In addition, the pair (Z,N) is also Markovian. For simplicity, we also assume Z0 = Z0−, i.e., there is no jump at time zero. + In this paper we consider an asymptotic regime where Z0 = n, and n ∈ R is sent to infinity. This implies the initial intensity λ0 = µ + Z0 is large for fixed µ. Our main contribution is to provide the large deviations analysis of Markovian Hawkes processes in this asymptotic regime as well as the regime when both Z0 and the time are large. The rate functions are found explicitly. Our large deviations analysis here complement our previous results in [14], where we establish functional law of large numbers and functional central limit theorems for Markovian Hawkes processes in the same asymptotic regimes. For simplicity, the discussions in our paper are restricted to the case when the exciting function φ is exponential, that is the Markovian case. Indeed, all the results can be extended to the case when the exciting function φ is a sum of exponential functions. And for the non–Markovian case, we know that any continuous and integrable function φ can be approximated by a sum of exponential functions, see e.g. [37]. In this respect, the Markovian setting in this paper is not too restrictive. From the application point of view, the exponential exciting function and thus the Markovian case, together with the linear rate function, is the most widely used due to the tractability of the theoretical analysis as well as the simulations and calibrations. See, e.g., [1, 2, 7, 17] and the references therein. To illustrate the strength of our results, we apply them to two examples. In the first example, we develop approximations for finite–horizon ruin probabilities in the insurance

1Self–exciting refers the phenomenon that the occurrence of one event increases the probability of the occurrence of further events.

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 3 setting where claim arrivals are modeled by Hawkes processes. Here, the initial arrival rate of claims could be high, say, right after a catastrophe event. In the second example, we rely on our large deviations results to approximate the loss probability in a multi– server queueing system where the traffic input is given by a Hawkes process with a large initial intensity. Such a queueing system could be relevant for modeling large scale service systems (e.g., server farms with thousands of servers) with high–volume traffic which exhibits clustering. We now explain the difference between our work and the existing literature on limit theorems of Hawkes processes, especially the large deviations. The large–time large deviations of Hawkes processes have been extensively studied in the literature, that is the large deviation principle for P(Nt/t ∈ ·) as t → ∞. Bordenave and Torrisi [6] derived the large deviations when λ(·) is linear and obtained a closed-form formula for the rate function. When λ(·) is nonlinear, the lack of immigration-birth representation ([17]) makes the study of large deviations much more challenging mathematically. In the case when φ(·) is exponential, the large deviations were obtained in Zhu [37] by using the Markovian property, and λ(·) is assumed to be sublinear so that a delicate application of minmax theorem can match the lower and upper bounds. For the general non-Markovian case, i.e., general φ(·), the large deviations was obtained at the process-level in Zhu [38]. The large deviations for extensions of Hawkes processes have also been studied in the literature, see e.g. Karabash and Zhu [24] for the linear marked Hawkes process, and Zhu [35] for the Cox–Ingersoll-Ross process with Hawkes jumps and also Zhang et al. [31] for affine point processes. Other than the large deviations, the central limit theorems for linear, nonlinear and extensions of Hawkes processes have been considered in, e.g., [4, 36, 35]. Recently, Torrisi [27, 28] studied the rate of convergence in the Gaussian and Poisson approximations of the simple point processes with stochastic intensity, which includes as a special case, the nonlinear Hawkes process. The moderate deviations for linear Hawkes processes were obtained in Zhu [34], that fills in the gap between the central limit theorem and large deviations. Also, the large–time limit theorems for nearly unstable, or nearly critical Hawkes processes have been considered in Jaisson and Rosenbaum [21, 22]. The large–time asymptotics for other regimes are referred to Zhu [32]. The limit theorems considered in Bacry et al. [4] hold for the multidimensional Hawkes process. Indeed, one can also consider the large dimensional asymptotics for the Hawkes process, that is, mean-field limit, see e.g. Delattre et al. [9]. We organize our paper as follows. In Section 2, we will state the main theoretical results in our paper, i.e., the large deviations for the linear Markovian Hawkes processes with a large initial intensity. We will then discuss the applications of our results to two examples in Section 3. We prove Theorems 1 and 2 in Section 4. Technical proofs for additional results will be presented in the online appendix due to space considerations.

2. Main results

In this section we state our main results. First, let us introduce the notation that will be used throughout the paper and introduce the deﬁnition and the contraction principle in

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 4 Gao and Zhu the large deviations theory that will be used repeatedly in the paper.

2.1. Notation and background of large deviations theory

+ We define R = {x ∈ R : x > 0} and R≥0 = {x ∈ R : x ≥ 0}. We fix T > 0 throughout this paper. Let us first define the following spaces:

• D[0,T ] is defined as the space of càdlàgfunctions from [0,T ] to R≥0. •AC x[0,T ] is defined as the space of absolutely continuous functions from [0,T ] to R≥0 that starts at x at time 0. + •AC x [0,T ] is defined as the space that consists of all the non-decreasing functions f ∈ ACx[0,T ].

We also deﬁne B(x) as the Euclidean ball centered at x with radius > 0. Before we proceed, let us give a formal deﬁnition of the large deviation principle and state the contraction principle. We refer readers to Dembo and Zeitouni [10] or Varadhan [29] for general background of large deviations and the applications.

A sequence (Pn)n∈N of probability measures on a topological space X satisfies the large deviation principle with the speed an and the rate function I : X → [0, ∞] if I lower semicontinuous and for any measurable set A, we have 1 1 − inf I(x) ≤ lim inf log Pn(A) ≤ lim sup log Pn(A) ≤ − inf I(x). x∈Ao n→∞ an n→∞ an x∈A Here, Ao is the interior of A and A is its closure. The rate function I is said to be good if for any m, the level set {x : I(x) ≤ m} is compact. The contraction principle concerns the behavior of large deviation principle under continuous mapping from one space to another. It states that if (Pn)n∈N satisfies a large deviation principle on X with a good rate function I(·), and F is a continuous mapping −1 from the Polish space X to another Polish space Y , then the family Qn = PnF satisfies a large deviation principle on Y with a good rate function J(·) given by J(y) = inf I(x). x:F (x)=y

2.2. Large deviation analysis for large initial intensity

In this section we state a set of results on large deviations behavior of Markovian Hawkes processes when Z0 = n is sent to inﬁnity. Note that processes Z and N both depend on n n the initial condition Z0 = n and we use Z ,N to emphasize the dependence on Z0 = n. We consider the process Zn ﬁrst.

1 n Theorem 1. P n Zt , 0 ≤ t ≤ T ∈ · satisﬁes a sample-path large deviation principle on D[0,T ] equipped with uniform topology with the speed n and the good rate function Z T βg(t) + g0(t) βg(t) + g0(t) βg(t) + g0(t) IZ (g) = log − − g(t) dt, (2.1) 0 α αg(t) α

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 5

0 1 n if g ∈ AC1[0,T ] and g ≥ −βg, and IZ (g) = ∞ otherwise. Moreover, P( n ZT ∈ ·) satisﬁes a scalar large deviation principle on R+ with the good rate function

J(x; T ) = inf IZ (g) (2.2) g(T )=x = sup {θx − A(T ; θ)} . (2.3) θ∈R where A(t; θ) satisﬁes the ODE (Ordinary Diﬀerential Equation):

A0(t; θ) = −βA(t; θ) + eαA(t;θ) − 1, (2.4) A(0; θ) = θ. (2.5)

Four remarks are in order. (α−β)t (a) When g(t) = e for t ∈ [0,T ], one immediately verifies from (2.1) that IZ (g) = 1 n 0. This is consistent with the functional law of large numbers for n Zt , 0 ≤ t ≤ T in [14]. 0 n n −βt −βt (b) Note that g (t) = −βg(t) for any 0 ≤ t ≤ T corresponds to Zt = Z0 e = ne n n for any 0 ≤ t ≤ T , which is equivalent to NT = 0. We can compute that P(NT = R T −βt n − 0 (µ+ne )dt 1 n −βt 0|Z0 = n) = e , which gives − limn→∞ n log P(Zt = ne , 0 ≤ t ≤ R T −βt R T −βt 0 T ) = 0 e dt which is consistent with IZ (g) = 0 e dt for g (t) = −βg(t) for any 0 ≤ t ≤ T . (c) We have used A(t; θ) instead of A(t) to emphasize that A takes value θ at time zero, and the derivative in (2.4) is taken with respect to t. (d) We have two equivalent expressions for the rate function J: the first expression (2.2) is directly implied by the sample–path large deviation principle together with the contraction principle, and the second expression (2.3) is obtained via Gärtner– Ellis Theorem. See Section 4 for more details. In general, there are no analytical formulas for A and the rate function J. But one can easily numerically solve the ODE for A (e.g., Runge–Kutta methods) and then solve the optimization problem in (2.3) to obtain the rate function J. An illustrative example is given in Figure 1. 1 n Next we proceed to state a large deviation principle for P n Nt , 0 ≤ t ≤ T ∈ · . To gain some intuition about the result, we note that

dZt = −βZtdt + αdNt, which implies that Z t Zt − Z0 β Nt = + Zsds. α α 0

Given Z0 = n, equivalently we have

n Z t n 1 n 1 Zt β Zs Nt = · − 1 + ds. (2.6) n α n α 0 n

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 6 Gao and Zhu

3.5 1.8

1.6 3

1.4

2.5 1.2

1 2

0.8

J(x;T) J(x;T) 1.5 0.6

0.4 1

0.2 0.5

0 -0.2 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 x T (a) J(x; T ) as a function of x. T = 5 is ﬁxed. (b) J(x; T ) as a function of T . x = 3 is ﬁxed.

Figure 1: This ﬁgure plots the rate function J(x; T ) in (2.3). The parameters are given by: α = β = 1.

Now if we deﬁne for t ∈ [0,T ],

g(t) − 1 β Z t h(t) = + g(s)ds, α α 0 then one readily veriﬁes that the map g 7→ h is a continuous map from D[0,T ] to D[0,T ] under the uniform topology. Therefore, by Theorem 1 and the contraction principle, we can obtain the following result. The details of the proof is left to Section 4.

1 n Theorem 2. P n Nt , 0 ≤ t ≤ T ∈ · satisﬁes a sample-path large deviation principle on D[0,T ] equipped with uniform topology with the speed n and the good rate function

Z T 0 ! 0 h (t) IN (h) = h (t) log (2.7) −βt −βt R t βs 0 0 e + e 0 αe h (s)ds Z t − h0(t) − e−βt − e−βt αeβsh0(s)ds dt, 0

+ n if h ∈ AC0 [0,T ], and IN (h) = ∞ otherwise. Moreover, P(NT /n ∈ ·) satisﬁes a scalar large deviation principle on R≥0 with the good rate function

H(x; T ) = inf IN (h) (2.8) h:h(T )=x θ θ = sup θx − C T ; + , (2.9) θ∈R α α

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 7

θ where C(t; α ) solves the ODE 0 θ θ α·C t; θ βθ C t; = −βC t; + e ( α ) − 1 + , (2.10) α α α θ θ C 0; = . (2.11) α α

Four remarks are in order. R T 0 −βt −βt R t βs 0 (a) Notice that IN (h) = 0 R(h (t), e + e 0 αe h (s)ds)dt, where R(x, y) := x x log y − x + y, for any x, y > 0. It is easy to see that R(x, y) ≥ 0 and 0 R(x, y) = 0 if and only if x = y. Therefore IN (h) = 0 if and only if h (t) = −βt −βt R t βs 0 0 e + e 0 αe h (s)ds for any 0 ≤ t ≤ T . Together with h (0) = 1, we get 0 (α−β)t R t (α−β)s h (t) = e . With the initial condition h(0) = 0, we get h(t) = 0 e ds = t e(α−β)t−1 if α = β and α−β if α 6= β. This is consistent with the functional law of large 1 n numbers for n Nt , 0 ≤ t ≤ T in [14]. n n n (b) Note that h ≡ 0 corresponds to NT = 0. We can compute that P(NT = 0|Z0 = n) = R T −βt T − 0 (µ+ne )dt 1 n n R −βt e , which gives − limn→∞ n log P(NT = 0|Z0 = n) = 0 e dt, R T −βt which is consistent with IN (h) = 0 e dt for h ≡ 0. θ (c) Similar as in Theorem 1, we use C(t; α ) instead of C(t) to emphasize that C takes θ value α at time zero. The derivative in (2.10) is taken with respect to t. (d) Similar as in Theorem 1, we have two equivalent expressions for the rate function H. In general, there is no analytical formula for H. But one can easily numerically solve the ODE for C (e.g., Runge–Kutta methods) and then solve the optimization problem in (2.9) to obtain the rate function H. An illustrative example is given in Figure 2.

2.2.1. Most likely paths In this section, we compute the most likely paths to rare events for Hawkes processes with large initial intensities. More precisely, we are interested to ﬁnd the minimizer to the variational problems in (2.2) and (2.8). + 2 Fix x ∈ R . Let θ∗ be the unique maximizer to the optimization problem (2.3).

Proposition 3. The minimizer to the variational problem (2.2) is given by Z t αA(s;θ∗) g∗(t) = exp αe ds − βt , (2.12) 0 for 0 ≤ t ≤ T , where A(s; θ∗) solves the ODE (2.4) with an initial condition A(0; θ∗) = θ∗.

2 1 θZT It will be clear from the Proof of Theorem 1 that A(T ; θ) = limn→∞ n log E[e |Z0 = n] if the limit exists. So one readily veriﬁes that A(T ; θ) is convex in θ, and in fact strictly convex in θ from (2.4). Hence, there is a unique optimal θ∗ for the optimization problem (2.3).

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 8 Gao and Zhu

0.12 10

0.1 8

7 0.08

0.06

H(x;T) H(x;T)

4 0.04

2 0.02

0 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x T (a) H(x; T ) as a function of x. T = 5 is ﬁxed. (b) H(x; T ) as a function of T . x = 5 is ﬁxed.

Figure 2: This ﬁgure plots the rate function H(x; T ) in (2.9). The parameters are given by: α = β = 1.

ˆ Next we consider the variational problem (2.8). Let θ∗ be the unique maximizer to the optimization problem (2.9).3

Proposition 4. The minimizer to the variational problem (2.8) is given by ! ! Z t ˆ Z s ˆ θ∗ αC(T −u; θ∗ ) h∗(t) = exp α · C T − s; + α e α du − βs ds, (2.13) 0 α 0

θˆ∗ for any 0 ≤ t ≤ T , where C(s; α ) solves the ODE (2.10) with the initial condition θˆ∗ θˆ∗ C(0; α ) = α .

The proofs of these two propositions are deferred to the online appendix.

2.3. Large deviation analysis for large initial intensity and large time

This section is devoted to a set of results on large deviations behavior of Markovian Hawkes processes in the asymptotic regime where both Z0 = n and the time go to inﬁnity. The proofs of these results are deferred to the online appendix due to space considerations.

3 θ θ 1 θNT It will be clear from the Proof of Theorem 2 that C(T ; α ) − α = limn→∞ n log E[e |Z0 = n] is always convex in θ if the limit exists. Indeed, from the ODE (2.10), the limit must be strictly convex. Hence, there is a unique optimal θˆ∗ for the optimization problem (2.9).

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 9

When the time is sent to infinity, Hawkes processes behave differently depending on the value of kφkL1 (see, e.g., Zhu [32]). In our case, the exciting function is exponential: φ(t) = αe−βt. So we have the following three different cases: (1) critical: α = β; (2) super–critical: α > β; and (3) sub–critical: α < β. We study each case separately.

2.3.1. Critical case We ﬁrst consider the critical case, i.e., α = β > 0.

Theorem 5. Assume that α = β > 0. Let tn be a positive sequence that goes to infinity tn as n → ∞ and limn→∞ n = 0. Zn tnT (i) For any T > 0, P( n ∈ ·) satisfies a large deviation principle on R with the speed n and the rate function tn √ 2( x − 1)2 Iˆ (x) = , if x ≥ 0, Z α2T and +∞ otherwise. N n (ii) For any T > 0, ( tnT ∈ ·) satisfies a large deviation principle on with the P ntn R speed n and rate function tn IˆN (x) = sup {θx − Λ(θ)} , θ∈R where  √ √ −2θ tanh −√α −θT if θ ≤ 0,  α 2 Λ(θ) = √ √ 2θ tan √α θT if θ > 0.  α 2

The proof of this result relies on G¨artner-Ellistheorem and Gronwall’s inequality for nonlinear ODEs (see, e.g., [11, Theorem 42]) which arise from the characterization of the moment generating functions of Zt and Nt.

2.3.2. Super–critical case We next state the result for the super–critical case where α > β > 0. Below, we use the convention that ∞ · 0 = 0.

log n Theorem 6. Assume that α > β > 0 and 0 < T < 1. Let tn = α−β . Then, Zn tnT + T (i) P( n1+T ∈ ·) satisfies a large deviation principle on R with the speed n and the rate function I˜Z (x) = 0·1x=1 + ∞ · 1x6=1. N n tnT T (ii) P( n1+T ∈ ·) satisfies a large deviation principle on R≥0 with the speed n and the rate function IÑ (x) = 0·1 1 + ∞ · 1 1 . x= α−β x6= α−β

We remark that the sequence {tn} in Theorem 6 can be taken to be more general. We choose this particular {tn} for the simplicity of notation. Notice that when Z0 = n → ∞,

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 10 Gao and Zhu the initial intensity is µ + n which is of the same order as n, and assuming µ = 0, we can easily compute that [Zn] = ne(α−β)t. Thus choosing t = log n gives [Zn ] = n1+T , E t n α−β E tnT which is notation-wise concise.

2.3.3. Sub–critical case Finally, we state the large deviations results for the sub–critical case, i.e., β > α > 0. Given Z0 = z where z is a ﬁxed constant and under the assumption β > α > 0, it is well Nt µ Nt known that as t → ∞, → α almost surely and P( ∈ ·) satisﬁes a large deviation t 1− β t n NnT principle, see e.g. [6]. So for Z0 = n, it is natural to study the large deviations for n .

n NnT Theorem 7. Assume that β > α > 0. For any T > 0, P( n ∈ ·) satisﬁes a scalar large deviation principle on R with the speed n and the rate function βx αx + 1 + µβT I(x) = x log − x + , (2.14) αx + 1 + µβT β for x ≥ 0 and I(x) = +∞ otherwise.

The proof of this result rely on G¨artner-Ellistheorem and asymptotic behavior of the solutions of certain nonlinear ODEs which arise from the characterization of the moment generating function of Nt.

Remark 8. We discuss the connections with existing results on large–time large deviations of Hawkes processes here. Since the dependence on the initial condition should be self-evident here, we omit the superscript n for the processes Z and N. As we have (0) (1) (0) discussed in [14], when Z0 = n, we can decompose Nt = Nt + Nt , where N is a simple point process with intensity Z(0), where

(0) (0) (0) dZt = −βZt dt + αdNt ,

(0) (1) with Z0 = n and N is a simple point process with intensity Z t (1) −β(t−s) (1) λt := µ + αe dNs . 0

That is, we can decompose the Hawkes process N into the sum of N (0) and N (1), where N (0) is a linear Markovian Hawkes process with zero base intensity and initial intensity (0) (1) Z0 = n and N is a linear Markovian Hawkes process with nonzero base intensity µ > 0 and empty history, i.e., N (1)(−∞, 0] = 0. This decomposition is valid due to the immigration-birth representation of linear Hawkes processes [20]. One of the key results from the immigration-birth representation is that the two processes N (0) and N (1) are independent of each other.

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 11

(0) NnT By letting µ = 0 in Theorem 7, P( n ∈ ·) satisﬁes a large deviation principle with the rate function βx αx + 1 I(0)(x) = x log − x + . αx + 1 β

(1) NnT On the other hand, from Bordenave and Torrisi [6], P( n ∈ ·) satisﬁes a large deviation principle with the rate function

" x ! # (1) x T x x α I (x) = T log x α − + + µ . T µ + T β T T β

(0) (1) NnT Since N and N are independent, we conclude that P( n ∈ ·) satisﬁes a large deviation principle with the rate function I(x) = inf {I(0)(y) + I(1)(z)}. y+z=x

(1) (0) x 1 (0) Notice that I (x) = µT I µT + µT 1 − β and I (x) is convex in x. Hence, by Jensen’s inequality, we conclude that y 1 I(x) = inf I(0)(x − y) + µT I(0) + µT 1 − 0≤y≤x µT β 1 µT y 1 = (1 + µT ) inf I(0)(x − y) + I(0) + µT 1 − 0≤y≤x 1 + µT 1 + µT µT β 1 µT y 1 = (1 + µT )I(0) (x − y) + + µT 1 − 1 + µT 1 + µT µT β x 1 = (1 + µT )I(0) + µT 1 − , 1 + µT β which can be easily veriﬁed to be consistent with (2.14).

The next result is complementary to Theorem 7.

Theorem 9. Assume that β > α > 0 and µ > 0. Let tn be a positive sequence that goes to inﬁnity as n → ∞. N n tn tnT (i) If limn→∞ n = 0, then, for any T > 0, P( n ∈ ·) satisﬁes a large deviation principle on R≥0 with the speed n and the rate function βx αx + 1 I(0)(x) = x log − x + . αx + 1 β

N n tn tnT (ii) If limn→∞ = ∞, then, for any T > 0, ( ∈ ·) satisﬁes a large deviation n P tn principle on R≥0 with the speed tn and the rate function " x ! # (1) x T x x α I (x) = T log x α − + + µ . T µ + T β T T β

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 12 Gao and Zhu

Let us give some intuition behind the results of Theorem 9. Recall the decomposition N = N (0) + N (1) from Remark 8. Notice that N (1) is of order t and that is because t t t tnT n of the large–time law of large numbers of the linear Hawkes process with a ﬁxed initial intensity µ and empty history. Also notice that N (0) is of order n. Let us explain. Notice tnT that from Z(0) = n we obtain [N (0) ] = R tnT [Z(0)]ds = n R tnT e(α−β)sds. As n → ∞, 0 E tnT 0 E s 0 we have t T → ∞. But R ∞ e(α−β)sds = 1 < ∞ for β > α. Thus, N (0) is of order n 0 β−α tnT tn (0) n. Hence, when limn→∞ n = 0, N ‘dominates’ and we have result (i), and when tn (1) limn→∞ n = ∞, N ‘dominates’ and we obtain (ii). So far we have discussed the large deviations for the process N n in the sub-critical n case. We next consider the large deviations for the process Z in the regime where Z0 = n and the time are both sent to inﬁnity. Below, we use the convention that 0 · ∞ = 0.

log n Theorem 10. Assume that β > α > 0, 0 < γ < 1, and tn := β−α . For any 0 < Zn tnT + T < 1 − γ, P( n1−T ∈ ·) satisﬁes a scalar large deviation principle on R with the speed n1−γ−T and the rate function

I¯Z (x) = 0 · 1x=1 + ∞ · 1x6=1.

We remark that similar as in Theorem 6, here the sequence {tn} in Theorem 10 can be taken to be more general. We choose this particular {tn} for the simplicity of notation.

3. Examples and Applications

This section is devoted to two examples that apply the large deviations principle that we have developed in the previous sections. The first example is on ruin probabilities in the insurance setting, and the second example is on the finite–horizon maximum of queue lengths in an infinite–server queue. We assume Markovian Hawkes processes can adequately model the clustering behavior of events occurring in each application. While this assumption may not be completely realistic, it enables us to illustrate the potential strength of our large deviations analysis. Throughout this section, we write an = o(n) as n → ∞ if the sequence of numbers an satisfies limn→∞ an/n = 0.

3.1. Example 1: Ruin probability in insurance risk theory

In this example, we apply our large deviations results to approximate the ﬁnite horizon ruin probability in a risk model in insurance mathematics. Hawkes processes have been applied to insurance settings to accommodate the clustering arrival of claims observed in practice, see, e.g. [8, 23, 26, 33]. When a natural disaster such as an earthquake occurs, the claims typically will not be reported following a constant intensity as in a homogeneous Poisson process. Instead, we expect clustering eﬀect in the claim arrivals after a catastrophe. In addition, the arrival rate of claims is

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 13 typically high right after a catastrophe event. So one might use Hawkes processes with large initial intensities to model such claim arrival processes, and it is of interest to study the finite horizon ruin probability in a risk model where the claim arrivals are modeled by such Hawkes processes. To study the ruin probability, let us consider the surplus process of the insurance company: n Nt n n X Xt = X0 + ρt − Yi. i=1 Here, N n is the claim arrival process modeled as a Hawkes process with an initial intensity µ + n, and an exciting function φ(t) = αe−βt; the constant ρ > 0 is the premium rate, and we assume it is independent of n for simplicity; {Yi} are the non-negative claim sizes n which are independent and identically distributed, and {Yi} is independent of N and n n. Note that we use N to emphasize the dependence on Z0 = n. We are interested in approximating the finite horizon ruin probability P(τ n ≤ T ) for fixed T > 0 and large n, where τ n is the ruin time of an insurance company and it is defined as follows: n n τ := inf{t > 0 : Xt ≤ 0}. We assume that the initial surplus at time 0 is given by Xn(0) = nx, which is large, as n → ∞. In the usual setting of the finite horizon ruin probability problem for the classical risk model, the ruin probability is exponentially small when the initial surplus n is large, see e.g. [3]. In our example, because Nt is of the order n, the ruin will occur at a finite time with probability one. Notice that N n satisfies a functional law of large numbers, see [14],

n Nt sup − ψ(t) → 0, almost surely as n → ∞, 0≤t≤T n

e(α−β)t−1 where ψ(t) := α−β for α 6= β, and ψ(t) := t for α = β. Therefore, as n → ∞,

n ∞ τ → τ := inf{t > 0 : x − E[Y1]ψ(t) = 0}, almost surely. It is easy to compute that (assuming that (α − β) x + 1 > 0; otherwise τ ∞ will be ∞.) E[Y1]  log (α−β) x +1 [Y1] ∞  E for α 6= β, τ = α−β  x for α = β. E[Y1]

For any T > τ ∞, P(τ n ≤ T ) → 1 as n → ∞. For any T < τ ∞, this probability will go to zero exponentially fast as n → ∞, and falls into the large deviations regime. In the following we develop approximations for this probability P(τ n ≤ T ). Let us assume that E[eθY1 ] < ∞ for any θ < θ+ and E[eθY1 ] = ∞ otherwise, where θ+ > 0 and we allow it to be +∞. We deﬁne V++ as the subspace of D[0, ∞), consisting of

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 14 Gao and Zhu unbounded nonnegative increasing functions starting at zero at time zero with finite vari- ation over finite intervals equipped with the vague topology, see [25] . A Mogulskii-type n 1 Pbntc o theorem says that, see e.g. Lemma 3.2. [25], P n i=1 Yi, 0 ≤ t < ∞ ∈ · satisfies a large deviation principle on V++ with the good rate function Z ∞ 0 + ++ Λ(g1(t))dt + θ g2(∞), if g = g1 + g2 ∈ V , g1 ∈ AC0[0, ∞), 0 where θY1 Λ(x) := sup θx − log E[e ] , (3.1) θ∈R and g = g1 + g2 denotes the Lebesgue decomposition of g with respect to Lebesgue measure, where g2 is the singular component and g2(∞) = limt→∞ g2(t). Note that if + n θ = ∞, then g2 ≡ 0. Since {Yi} and N are independent, then Theorem 2 implies that

 bnsc     1 1 X  N n, Y , 0 ≤ t ≤ T, 0 ≤ s < ∞ ∈ · P  n t n i   i=1 

++4 satisfies a large deviation principle on D[0,T ]×V with the good rate function IN (h)+ R ∞ 0 + 0 Λ(g1(t))dt+θ g2(∞), where the rate function IN (h) is given in Theorem 2. It is easy n bn· 1 N nc 1 PNt 1 P n t to see that n i=1 Yi = n i=1 Yi. Hence, by the continuity of the first-passage-time map, and the contraction principle, for any fixed 0 < T < τ ∞, we have

R ∞ 0 + n −n·infh,g:x−g(h(T ))≤0{IN (h)+ Λ(g (t))dt+θ g2(∞)}+o(n) P(τ ≤ T ) = e 0 1 n o −n·inf I (h)+R h(T ) Λ(g0 (t))dt+θ+g (h(T )) +o(n) = e h,g:x−g(h(T ))≤0 N 0 1 2 , (3.2) as n → ∞. We can replace ∞ by h(T ) in (3.2) since Λ(x) ≥ 0 for any x ≥ 0 and it is zero for x = E[Y1] and g2 is also non-decreasing so that g2(∞) ≥ g2(h(T )), and thus 0 0 the optimal g satisfies g1(t) = E[Y1] for t > h(T ) so that Λ(g1(t)) = 0 for t > h(T ) and g2(∞) = g2(h(T )). The expression (3.2) is not very informative, so we next simplify it to obtain a more manageable expression which allows efficient numerical computations. We can first fix g2(h(T )) and then optimize over g2(h(T )). By the convexity of Λ(·) and using Jensen’s inequality, we obtain

Z h(T ) Z h(T ) ! 0 1 0 x − g2(h(T )) Λ(g1(t))dt ≥ h(T )Λ g1(t)dt ≥ h(T )Λ , 0 h(T ) 0 h(T ) where the second inequality is due to x − g1(h(T )) − g2(h(T )) ≤ 0 and Λ(x) is non- ∗ x−g2(h(T )) decreasing in x for x > E[Y1]. On the other hand, by considering g1 (t) = h(T ) t, we

4Here D[0,T ] is equipped with Skorokhod topology. In Theorem 1 and Theorem 2 we proved ﬁrst the large deviation principles hold in the Skorokhod topology.

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 15 have Z h(T ) ∗ 0 x − g2(h(T )) Λ((g1 ) (t))dt = h(T )Λ . 0 h(T ) This implies that (3.2) can be reduced to the following:

x−z + n −n·infh,z≤x{IN (h)+h(T )Λ( )+θ z}+o(n) P(τ ≤ T ) = e h(T ) , as n → ∞. Therefore, we have

x−z + n −n·infy>0,z≤x infh:h(T )=y {IN (h)+yΛ( )+θ z}+o(n) P(τ ≤ T ) = e y .

n To further simplify the above expression, we note from Theorem 2 that P(NT /n ∈ ·) satisﬁes a large deviation principle with the rate function

θ θ H(x; T ) = inf IN (h) = sup θx − C T ; + , h:h(T )=x θ∈R α α where C solves the nonlinear ODE given in (2.10) and (2.11). Hence, we conclude that

n P(τ ≤ T ) = exp(−n · Iτ (x; T ) + o(n)) as n → ∞, where x − z + Iτ (x; T ) := inf H(y; T ) + yΛ + θ z . y>0,z≤x y

x−z + We remark that the function H(y; T ) + yΛ y + θ z is convex in y. This is because H(y; T ) is convex in y and one can also verify directly from the convexity of Λ that x x−z + yΛ y in convex in y. It is also clear that H(y; T ) + yΛ y + θ z is convex in z. So we can numerically obtain Iτ (x; T ) eﬃciently. We now present a numerical example when Yi has a Poisson distribution with rate 1. Then it is easy to see from (3.1) that Λ(¯ v) = v log v − v + 1 for v > 0 and Λ(¯ v) = +∞ otherwise. Also in this case θ+ = ∞. Hence, we obtain

x x Iτ (x; T ) = inf H(y; T ) + y · − 1 − log . (3.3) y>0 y y

See Figure 3 for a numerical illustration.

3.2. Example 2: Finite–horizon maximum of the queue length process in an inﬁnite–server queue

In this example, we use our large deviations results to study certain tail probabilities in an inﬁnite–server queue in heavy traﬃc where the job arrival process is modeled by a

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 16 Gao and Zhu

0.7 0.6

0.6 0.5

0.5 0.4

0.4

0.3 (x;T)

(x;T) =

τ I I 0.3

0.2 0.2

0.1 0.1

0 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.2 0.4 0.6 0.8 1 1.2 1.4 T x

(a) Iτ (x; T ) as a function of T . x = 0.5 is ﬁxed. (b) Iτ (x; T ) as a function of x. T = 0.2 is ﬁxed.

Figure 3: This ﬁgure plots Iτ (x; T ) in (3.3). The parameters are given by: α = β = 1.

Hawkes process with a large initial intensity. Such a queueing system could be relevant for analyzing the performance of large scale service systems with high–volume traffic which exhibits clustering. For background on infinite–server queues, their engineering applications and related large deviation analysis, see, e.g., [15, 30, 5]. Consider a sequence of queueing systems indexed by n with infinite number of servers. Jobs arrive to the n-th system according to a Markovian Hawkes process N n with an initial intensity µ + n, and an exciting function φ(t) = αe−βt. We use N n to emphasize the dependence on Z0 = n. For simplicity, we assume that (a) n is large so the offered load in the system is high; (b) the system is initially empty; (c) the processing time of each job is deterministic given by a constant c > 0. We are interested in the finite–horizon maximum of queue length process in such an infinite–serve queue, similarly as in [5]. Mathematically, we want to develop large deviations approximations for the probability of the event

n max Qs ≥ nx (3.4) 0≤s≤T

n for fixed T > 0 and sufficiently large x, as n → ∞. Here Qs is number of jobs (or busy servers) in the n-th system at time s. For sufficiently large x, we note that (3.4) is a rare event. This event corresponds precisely to the event of observing a loss in a queue with nx servers, no waiting room, and starting empty. It is well known that (see, e.g., [16]) for the n-th system with deterministic processing time c, the queue length process Qn can be represented by

n n n Qt = Nt − Nt−c, n where Nt = 0 if t ≤ 0 by convention. It is easy to see that the function Φ mapping y to y˜ where y˜(t) := max{y(s) − y(s − c)}, s≤t

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 17

1 n is continuous under the uniform topology. Since Theorem 2 states that P( n N ∈ ·) satisﬁes a sample path large deviation principle with the good rate function IN , we can apply the contraction principle and obtain: 1 n lim log P max Qs ≥ nx n→∞ n 0≤s≤T 1 1 n n = lim log P max [Ns − Ns−c] ≥ x n→∞ n 0≤s≤T n = − inf IN (h; T ) : max[h(s) − h(s − c)] ≥ x , (3.5) h s≤T where we use the notation IN (h; T ) to emphasize the dependence of IN on T , as can be clearly seen in (2.7). n Therefore, to develop large deviations approximations for P (max0≤s≤T Q (s) ≥ nx), it remains to solve the optimization problem in (3.5). For T ≤ c, since h is a nondecreasing function, then the inﬁmum in (3.5) is simply

inf IN (h; T ) = H(x; T ). h(T )≥x

For T > c, the inﬁmum in (3.5) is equivalent to: min inf inf IN (h; s), inf inf IN (h; s) . 0≤s≤c h:h(s)≥x c≤s≤T h:h(s)−h(s−c)≥x Now, let us solve the optimization problem:

inf IN (h; t). h:h(t)−h(t−c)≥x Since

1 ny lim lim log P(Nt /n ∈ B(x)|Z0 = ny) = −yH(x/y; t), →0 n→∞ n 1 n lim lim log P(Zt /n ∈ B(y)|Z0 = n) = −J(y; t), →0 n→∞ n and by the Markov property, we get

1 n n n lim lim log P [Nt − Nt−c]/n ∈ B(x),Zt−c/n ∈ B(y)|Z0 = n (3.6) →0 n→∞ n = −yH(x/y; c) − J(y; t − c), and ﬁnally for suﬃciently large x, by (3.6) and the contraction principle, we obtain

1 n n inf IN (h; t) = − lim lim log P [Nt − Nt−c]/n ∈ B(x)|Z0 = n h:h(t)−h(t−c)≥x →0 n→∞ n = inf {yH(x/y; c) + J(y; t − c)} . y>0

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 18 Gao and Zhu

Hence we conclude that the inﬁmum in (3.5) is equivalent to the following expression: G(x; T ) := min inf H(x; s), inf inf {yH(x/y; c) + J(y; s − c)} , (3.7) 0≤s≤c c≤s≤T y>0 where H and J are given in Theorem 1 and 2, respectively. This implies the following approximation for T > c and suﬃciently large x: n P max Qs ≥ nx = exp(−n · G(x; T ) + o(n)), as n → ∞. 0≤s≤T

Since one can solve H and J numerically, we can then also obtain G by solving the optimization problem in (3.7) numerically. We present an example in Figure 4.

1.4 1.4

1.2 1.2

0.8

G(x;T) G(x;T) 0.6

0.6 0.4

0.4 0.2

0 0.2 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 x T (a) G(x; T ) as a function of x. T = 5 is ﬁxed. (b) G(x; T ) as a function of T . x = 5 is ﬁxed.

Figure 4: This ﬁgure plots G(x; T ) in (3.7). We use parameters α = β = 1, c = 1.

4. Proofs of Theorems 1 and 2

This section collects the proofs of Theorems 1 and 2.

4.1. Moment generating functions of Zt and Nt

In this section we discuss the moment generating functions of Zt and Nt for ﬁxed t, conditioned on knowing the value of Z0. These functions play a critical role in proving our large deviation results. First, recall from [14, Section 3.2.1] the moment generating function of Zt:

θZt A(t;θ)z+B(t;θ) u(t, z) := E[e |Z0 = z] = e , (4.1)

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 19 where A(t; θ),B(t; θ) satisfy the ODEs:

A0(t; θ) = −βA(t; θ) + eαA(t;θ) − 1, (4.2) B0(t; θ) = µ eαA(t;θ) − 1 , (4.3) with initial conditions A(0; θ) = θ and B(0; θ) = 0. As remarked earlier, we have used A(t; θ) instead of A(t) to emphasize that A takes value θ at time zero, and the derivative in (4.2) is taken with respect to t. We also write B(t; θ) instead of B(t) to stress that B depends on the initial condition of A. Zt−Z0 Next, we compute the moment generating function of Nt. Recall that Nt = α + β t θ R θNt − α z α 0 Zsds. Thus, E[e |Z0 = z] = e v(t, z), where

θ θβ R t Zt+ Zsds v(t, z) := E[e α α 0 |Z0 = z]. Recall that Z is a Markov process with the inﬁnitesimal generator ∂f Af(z) = −βz + (z + µ)[f(z + α) − f(z)]. ∂z By Feynman-Kac formula, v satisﬁes the equation: ∂v ∂v θβ = −βz + (µ + z)[v(t, z + α) − v(t, z)] + zv(t, z), ∂t ∂z α

θ z with an initial condition v(0, z) = e α . Therefore, by the aﬃne structure, see e.g. [12], θ θ C(t; α )z+D(t; α ) θ θ one deduces that v(t, z) = e , where C(t; α ),D(t; α ) satisfy the ODEs: 0 θ θ αC(t; θ ) θ C t; = −βC t; + e α − 1 + β · C 0; , (4.4) α α α 0 θ αC(t; θ ) D t; = µ e α − 1 , (4.5) α

θ θ θ with initial conditions C(0; α ) = α and D 0; α = 0. Thus we have θ θ θ [eθNt |Z = z] = exp C t; − C 0; · z + D t; . (4.6) E 0 α α α Finally, we remark that there exists some Θ > 0 such that the moment generating functions in (4.1) and (4.6) are both ﬁnite for all θ ≤ Θ. See [36].

4.2. Proofs of Theorems 1 and 2

We prove Theorems 1 and 2 in this section. For notational convenience, unless speciﬁed n n explicitly, we use Z and N for Z and N when Z0 = n. We also use E[·] to denote the conditional expectation E[·|Z0 = n], and P(·) for the conditional probability P(·|Z0 = n).

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 20 Gao and Zhu

Proof of Theorem 1. The proof is long, so we split it into four steps. 1 Step 1. We ﬁrst establish a scalar large deviation principle for P n ZT ∈ · , using G¨artner-Ellistheorem. From (4.1) we have

θZt A(t;θ)z+B(t;θ) u(t, z) := E[e |Z0 = z] = e .

It is easy to see that since Zt process is positive, u(t, z) is monotonically increasing in θ. Let us recall from Section 4.1 that A(t; θ),B(t; θ) satisfy the ODEs: A0(t; θ) = −βA(t; θ) + eαA(t;θ) − 1, B0(t; θ) = µ eαA(t;θ) − 1 , with initial conditions A(0; θ) = θ and B(0; θ) = 0. Let us ﬁrst consider the critical and super-critical case, that is, α ≥ β. When we have α ≥ β, for any A > 0, −βA + eαA − 1 > 0 and thus A(t; θ) is increasing in t. It is R ∞ dA clear that for any θ > 0, θ −βA+eαA−1 < ∞. On the other hand, it is easy to see that R ∞ dA 0 −βA+eαA−1 = ∞. Therefore, for any ﬁxed T > 0, there exists a unique positive value θc(T ) such that Z ∞ dA αA = T. (4.7) θc(T ) −βA + e − 1

Hence, we conclude that for any fixed T > 0, for any 0 < θ < θc(T ), A(T ; θ) is the unique positive value greater than θ, that satisfies the equation: Z A(T ;θ) dA αA = T. (4.8) θ −βA + e − 1 Now let us consider the case θ ≤ 0 . When α > β, −βA + eαA − 1 = 0 when A = 0 or when A = Ac, for some unique negative value Ac. For θ = 0 or θ = Ac, A(t; θ) = 0 for any t. For Ac < θ < 0, A(t; θ) is decreasing in t and A(T ; θ) satisfies the equation (4.8). For θ < Ac, A(t; θ) is increasing in t and A(T ; θ) < 0 and satisfies the equation (4.8). When α = β, −βA + eαA − 1 > 0 when A 6= 0. Thus, for any θ < 0, A(t; θ) is increasing in t and A(T ; θ) < 0 and satisfies the equation (4.8) and also A(t; 0) ≡ 0. Also, it is easy to see that for θ < θc(T ), A(t; θ) is continuous and finite in t, and Z T B(T ; θ) = µ (eαA(t;θ) − 1)dt 0 is finite. Therefore, for θ < θc(T )

1 θZ lim log E[e T ] = A(T ; θ). n→∞ n

When θ ≥ θc(T ), this limit is ∞. By diﬀerentiating the equation (4.8) with respect to θ, we get 1 1 d − + A(T ; θ) = 0. (4.9) −βθ + eαθ − 1 −βA(T ; θ) + eαA(T ;θ) − 1 dθ

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 21

It is clear from the equation (4.7) and (4.8) that as θ → θc(T ), we have A(T ; θ) → ∞. Therefore, from (4.9), we get

∂ −βA(T ; θ) + eαA(T ;θ) − 1 A(T ; θ) = → ∞, as θ → θ (T ). ∂θ −βθ + eαθ − 1 c

1 Hence, we verified the essential smoothness condition. By Gärtner-Ellistheorem, P n ZT ∈ · satisfies a large deviation principle on R+ with the rate function J(x; T ) = sup {θx − A(T ; θ)} . (4.10) θ∈R Next, let us consider the sub-critical case, that is, α < β. In this case, −βA+eαA−1 = 0 if and only if A = 0 or A = Ac, where Ac is a positive constant and it is unique. For θ = 0 or Ac, A(t; θ) = 0 for any t. For θ < 0, A(t; θ) is increasing in t, and A(T, θ) < 0 satisfies the equation (4.8). For 0 < θ < Ac, A(t, θ) is decreasing in t and satisfies the equation (4.8). For θ > Ac, A(t, θ) is increasing in t. For any fixed T > 0, there exists a unique θc(T ) > Ac satisfying the equation (4.7) so that for any Ac < θ < θc(T ), A(T, θ) is the unique positive value greater than θ that satisfies the equation (4.8) and for θ ≥ θc(T ), 1 A(T, θ) = ∞. We can proceed similarly as before and prove that, P n ZT ∈ · satisfies a large deviation principle on R+ with the rate function given in (4.10). Step 2. Next, we need to prove the exponential tightness before we proceed to establish the sample path large deviation principle. To be more precise, we will show that

1 lim sup lim sup log P sup Zt ≥ nK = −∞, (4.11) K→∞ n→∞ n 0≤t≤T and for any δ > 0, ! 1 lim sup lim sup log P sup |Zt − Zs| ≥ δn = −∞. (4.12) →0 n→∞ n |t−s|≤,0≤t,s≤T

We will also show that for any η > 0,

1 lim sup log P sup |Zt − Zt−| ≥ ηn = −∞. (4.13) n→∞ n 0

The superexponential estimates (4.11) and (4.12) will guarantee the exponential tightness on D[0,T ] equipped with the Skorokhod topology, see e.g. Theorem 4.1. in Feng and Kurtz [13]. Together with Step 3, it will prove the large deviation principle for 1 P n Zt, 0 ≤ t ≤ T ∈ · on D[0,T ] equipped with Skorokhod topology. Next, the equation (4.13), i.e. the so-called C-exponentially tightness, see e.g. Deﬁnition 4.12. in [13] 1 strengthens the large deviation principle for P n Zt, 0 ≤ t ≤ T ∈ · so that it holds on D[0,T ] equipped with uniform topology, see e.g. Theorem 4.14. in [13].

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 22 Gao and Zhu

Let us ﬁrst prove (4.11). Notice ﬁrst that Zt − Z0 ≤ αNt and Z0 = n. Therefore, for K > 1, P sup Zt ≥ nK = P sup (Zt − Z0) ≥ n(K − 1) 0≤t≤T 0≤t≤T K − 1 ≤ P sup Nt ≥ n 0≤t≤T α

= P (αNT ≥ n(K − 1)) θN −θ(K−1)n/α ≤ E[e T ]e , where the last inequality follows from Chebychev’s inequality. In conjunction with the moment generating function of NT in (4.6), we hence obtain

1 θ θ θ(K − 1) lim sup log P sup Zt ≥ nK ≤ C T ; − − , n→∞ n 0≤t≤T α α α which goes to −∞ as K → ∞. Hence, we proved (4.11). R t Next, let us prove (4.12). Note that for s < t, αN(s, t] = Zt − Zs + β s Zudu. Thus, for s < t, we have |Zt − Zs| ≤ αN(s, t] + β(t − s) sup Zu. s≤u≤t Therefore, ! P sup |Zt − Zs| ≥ δn |t−s|≤,0≤t,s≤T ! ≤ P sup αN(s, t] + β(t − s) sup Zu ≥ δn |t−s|≤,0≤s≤t≤T s≤u≤t ! ! δ δ ≤ P sup αN(s, t] ≥ n + P sup β(t − s) sup Zu ≥ n . |t−s|≤,0≤s≤t≤T 2 |t−s|≤,0≤s≤t≤T s≤u≤t 2

Note that ! δ δ P sup β(t − s) sup Zu ≥ n ≤ P β sup Zu ≥ n . |t−s|≤,0≤s≤t≤T s≤u≤t 2 0≤u≤T 2

By (4.11), we have

1 δ lim sup lim sup log P β sup Zu ≥ n = −∞. →0 n→∞ n 0≤u≤T 2

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 23

1 Next, notice that without loss of generality we can assume that ∈ N and ! δ δ P sup αN(s, t] ≥ n ≤ P ∃1 ≤ j ≤ T/ : αN(tj−1, tj] ≥ n |t−s|≤,0≤s≤t≤T 2 4 T/ X δ ≤ αN(t , t ] ≥ n , (4.14) P j−1 j 4 j=1 where 0 = t0 < t1 < ··· < tT/ = T , where tj − tj−1 = for any j. In addition, note that for θ > 0, h i h h ii θαN(tj−1,tj ] θαN(tj−1,tj ] E e = E E e |Ztj−1 (4.15)

h −θZ C(t −t ;θ)Z +D(t −t ;θ)i = E e tj−1 e j j−1 tj−1 j j−1 = exp D(; θ) + A(tj−1; C(; θ) − θ)n + B(tj−1; C(; θ) − θ) , where we have used the moment generating functions of Zt and Nt in Section 4.1. Hence, using Chebychev’s inequality and combining (4.14) and (4.15), we ﬁnd for ﬁxed > 0, ! 1 δ lim sup log P sup αN(s, t] ≥ n n→∞ n |t−s|≤,0≤s≤t≤T 2 δ ≤ sup A(tj−1; C(; θ) − θ) − θ 1≤j≤T/ 4 δ ≤ sup {A(t; C(; θ) − θ)} − θ . 0≤t≤T 4

So in order to prove (4.12), what remains is to choose θ that depends on so that (i) θ → ∞ as → 0; (ii) A(t; C(; θ) − θ) is uniformly bounded for t ∈ [0,T ] and → 0. To this end, let us deﬁne y(t) := C(t; θ) − C(0; θ) = C(t; θ) − θ. Then y satisﬁes the ODE:

y0(t) = −βy(t) + eαθeαy(t) − 1, y(0) = 0.

For θ > 0, we have y0(0) = eαθ − 1 > 0, which implies y is increasing on [0, γ] for some γ > 0. This suggests that

0 < y0(t) ≤ eαθeαy(t), for t ∈ [0, γ].

By Gronwall’s inequality for nonlinear ODEs, we obtain 1 0 ≤ y(t) ≤ − · log(1 − αeαθt), for t ∈ [0, γ]. (4.16) α

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 24 Gao and Zhu

αθ √1 Let us set αe = . Then it is clear that θ → ∞ as → 0. In addition, we deduce from (4.16) that for < γ, 1 √ 0 ≤ C(; θ) − θ = y() ≤ − · log(1 − ). (4.17) α Next we show {A(t; C(; θ) − θ)} is uniformly bounded for t ∈ [0,T ] and → 0. When α < β, it is clear that zero is a stable solution for the ODE of A in (4.2). Since A(0; C(; θ) − θ)) = y() → 0 as → 0, so the stability of zero solution implies that when → 0, {A(t; C(; θ) − θ)} is uniformly small and thus uniformly bounded for all t ≥ 0. When α ≥ β, since A(0; C(; θ) − θ)) = y() ≥ 0, one readily checks that A is non–decreasing with respect to time t. Hence we obtain

sup {A(t; C(; θ) − θ)} = A(T ; y()). 0≤t≤T ¯ ¯ ¯ We have shown in Step 1 that A(T ; θ) is ﬁnite when θ < θc(T ), and A(T ; θ) is continuous as a function of θ¯. Therefore we deduce from (4.17) that A(T ; y()) is uniformly bounded for → 0. Thus, we have proved (4.12). Finally, the claim in (4.13) trivially holds since for any 0 < t ≤ T , |Zt− − Zt| = 0 or α with probability 1. Step 3. Next, we establish the sample path large deviation principle. For any > 0, let B(x) denote the open ball centered at x with radius . For any + 0 =: t0 < t1 < t2 < ··· < tk−1 < tk := T and x1, . . . , xk ∈ R , by the Markov property of the process Z, we have 1 1 1 Z ∈ B (x ), Z ∈ B (x ),..., Z ∈ B (x ) P n t1 1 n t2 2 n tk k 1 1 1 = P Zt1 ∈ B(x1) P Zt2 ∈ B(x2) Zt1 ∈ B(x1) n n n 1 1 ··· P Ztk ∈ B(xk) Ztk−1 ∈ B(xk−1) . n n Hence, we have 1 1 1 1 lim lim log P Zt1 ∈ B(x1), Zt2 ∈ B(x2),..., Ztk ∈ B(xk) →0 n→∞ n n n n x2 xk = −J(x1; t1) − x1J ; t2 − t1 − · · · − xk−1J ; tk − tk−1 , x1 xk−1 where J is given in (4.10). Hence, for any g ∈ AC1[0,T ], 1 1 1 1 lim lim log P Zt1 ∈ B(g(t1)), Zt2 ∈ B(g(t2)),..., Ztk ∈ B(g(tk)) →0 n→∞ n n n n g(t2) g(tk) = −J(g(t1); t1) − g(t1)J ; t2 − t1 − · · · − g(tk−1)J ; tk − tk−1 . g(t1) g(tk−1)

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 25

For any given positive g ∈ AC1[0,T ], we have g(tj) J ; tj − tj−1 g(tj−1) g(tj) = sup θ − A(tj − tj−1; θ) θ∈R g(tj−1) 0 ∗ Z tj −tj−1 g (tj−1) αA(s;θ) = sup θ 1 + (tj − tj−1) − θ − (−βA(s; θ) + e − 1)ds θ∈R g(tj−1) 0 0 ∗ g (tj−1) ∗∗ ∗∗ αA(tj−1;θ) = (tj − tj−1) sup θ − −βA(tj−1; θ) + e − 1 , θ∈R g(tj−1)

∗ ∗∗ where tj−1 ∈ [tj−1, tj] is independent of θ and tj−1 ∈ [0, tj − tj−1] may depend on θ. 0 ∗ g (tj ) It is easy to see that for any given positive g ∈ AC1[0,T ], , is uniformly bounded g(tj−1) in j. To see this, notice that g is positive and continuous so inf0≤t≤T g(t) > 0, and since g is absolutely continuous, g0 exists almost surely and we can assume that g0 exist for ∗ ∗∗ any tj . And we can also see that A(tj−1; θ) is uniformly bounded in j. Therefore, there exists some constant K that may depend on the given g, such that, uniformly in j,

0 ∗ g (tj−1) ∗∗ ∗∗ αA(tj−1;θ) sup θ − −βA(tj−1; θ) + e − 1 θ∈R g(tj−1) 0 ∗ g (tj−1) ∗∗ ∗∗ αA(tj−1;θ) = sup θ − −βA(tj−1; θ) + e − 1 . |θ|≤K g(tj−1)

Therefore,

0 ∗ g (tj−1) ∗∗ ∗∗ αA(tj−1;θ) sup θ − −βA(tj−1; θ) + e − 1 θ∈R g(tj−1) 0 ∗ g (tj−1) αθ − sup θ − −βθ + e − 1 θ∈R g(tj−1) αA(t;θ) αθ ≤ sup sup −βA(t; θ) + e − 1 − −βθ + e − 1 → 0, |θ|≤K 0≤t≤tj −tj−1 as tj − tj−1 → 0. Hence, we conclude that

1 1 lim lim log P Zt ∈ B(g), 0 ≤ t ≤ T →0 n→∞ n n Z T g0(t) = − g(t) sup θ − (−βθ + eαθ − 1) dt 0 θ∈R g(t) Z T n o = − sup θ(t)g0(t) − (−βθ(t) + eαθ(t) − 1)g(t) dt. θ(t):0≤t≤T 0

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 26 Gao and Zhu

Together with the superexponential estimates (4.11) and (4.12), we have proved that, 1 P n Zt, 0 ≤ t ≤ T ∈ · satisﬁes a large deviation principle with the rate function Z T n 0 αθ(t) o IZ (g) = sup θ(t)g (t) − (−βθ(t) + e − 1)g(t) dt, θ(t):0≤t≤T 0 if g ∈ AC1[0,T ]. Note that the maximization problem

sup {xg0 − (−βx + eαx − 1)g} x

g0 1 β+ g 0 has its maximum achieved at x = α log α , provided that g ≥ −βg. Otherwise, the maximum is +∞. Therefore, we conclude that

Z T βg(t) + g0(t) βg(t) + g0(t) βg(t) + g0(t) IZ (g) = log − − g(t) dt, 0 α αg(t) α

0 for any g ∈ AC1[0,T ] and g ≥ −βg and IZ (g) = +∞ otherwise. Step 4. Finally let us show that the rate function IZ (g) is good. That is, we need to show that for any ﬁxed m > 0, the level set

Km := {g ∈ AC1[0,T ]: IZ (g) ≤ m} (4.18) is compact. −βt −βt −βt Since Zt ≥ Z0e , we have g(t) ≥ g(0)e = e for any t. Therefore, for any g ∈ Km, Z T βg(t) + g0(t) e−βT Λ∗ dt ≤ m, (4.19) 0 αg(t) ∗ where Λ (x) := x log x−x+1 is strictly convex and non-negative. Thus, for any g ∈ Km,

Z T β 1 g0(t) Λ∗ + dt ≤ meβT . (4.20) 0 α α g(t)

0 β 1 0 β 1 g (t) Let us deﬁne f(t) = α t + α log g(t). Then f(0) = 0 and f (t) = α + α g(t) . From the proof that the rate function for Mogulskii’s theorem is good, see e.g. Page 183 in Dembo and Zeitouni [10], it follows that the set

( Z T ) ∗ 0 αT f ∈ AC0[0,T ]: Λ (f (t)) dt ≤ me (4.21) 0 is a bounded set of equicontinuous functions. Since g(t) = eαf(t)−βt, it follows that the set Km is a bounded set of equicontinuous functions. By Arzel`a-Ascolitheorem, the set Km is compact. Hence, IZ (g) is a good rate function. The proof is complete.

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 27

Proof of Theorem 2. We apply Theorem 1 and the contraction principle. One then 1 readily obtains from (2.6) that P n Nt, 0 ≤ t ≤ T ∈ · satisﬁes a large deviation principle with the good rate function

IN (h) = inf IZ (g). (4.22) g(t)−1 β R t h(t)= α + α 0 g(s)ds,0≤t≤T

g(t)−1 β R t Observe that diﬀerentiating the integral equation h(t) = α + α 0 g(s)ds, we get 1 β h0(t) = g0(t) + g(t), α α which is a ﬁrst-order linear ODE for g(t) with initial condition g(0) = 1. Thus, we can solve this ODE and get Z t g(t) = e−βt + e−βt αeβsh0(s)ds. 0

Hence, we infer from (4.22) and the expression of IZ (g) in (2.1) that

Z T 0 0 h (t) 0 IN (h) = h (t) log − (h (t) − g(t))dt 0 g(t) ! Z T h0(t) = h0(t) log −βt −βt R t βs 0 0 e + e 0 αe h (s)ds Z t − h0(t) − e−βt − e−βt αeβsh0(s)ds dt. 0 Using this sample path large deviations result and applying the contraction principle, we + can also obtain that, P(NT /n ∈ ·) satisﬁes a scalar large deviation principle on R with the good rate function

H(x; T ) = inf IN (h). (4.23) h:h(T )=x

Next, we prove that the rate function H in (4.23) can be equivalently given by (2.9). Recall the moment generating function of Nt in (4.6), θ θ θ [eθNt |Z = n] = exp C t; − n + D t; , E 0 α α α where 0 θ θ αC t; θ βθ θ θ C t; = −βC t; + e ( α ) − 1 + ,C 0; = . α α α α α Let us ﬁrst consider the critical and super-critical case, that is, α ≥ β. When we have αC βθ θ α ≥ β, for any C > 0 and θ > 0, −βC + e − 1 + α > 0 and thus C(t; α ) is increasing

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 28 Gao and Zhu

R ∞ dC in t. It is clear that for any θ > 0, θ αC βθ < ∞. On the other hand, it is easy α −βC+e −1+ α R ∞ dC to see that 0 −βC+eαC −1 = ∞. Therefore, for any ﬁxed T > 0, there exists a unique positive value θd(T ) such that Z ∞ dC = T. (4.24) θd(T ) αC βθd(T ) α −βC + e − 1 + α

θ Hence, we conclude that for any ﬁxed T > 0, for any 0 < θ < θd(T ), C(T ; α ) is the θ unique positive value greater than α , that satisﬁes the equation:

C(T ; θ ) Z α dC βθ = T. (4.25) θ −βC + eαC − 1 + α α

θ The case for θ ≤ 0 is similar. Also, it is easy to see that for θ < θd(T ), C(t; α ) is continuous and ﬁnite in t, and

Z T θ αC(t; θ ) D T ; = µ (e α − 1)dt α 0 is ﬁnite. Therefore, for θ < θd(T ), 1 θN θ θ lim log E[e T ] = C T ; − . n→∞ n α α

When θ ≥ θd(T ), this limit is ∞. By diﬀerentiating the equation (4.25) with respect to θ, we get

C(T ; θ ) 1 β Z α dC − θ − βθ (4.26) e − 1 α θ (−βC + eαC − 1 + )2 α α 1 d θ + C T ; = 0. θ θ βθ αC(T ; α ) dθ α −βC(T ; α ) + e − 1 + α

θ It is clear from the equation (4.24) and (4.25) that as θ → θd(T ), we have C(T ; α ) → ∞. Therefore, from (4.26), we get ∂ θ θ αC(T ; θ ) βθ C T ; = −βC T ; + e α − 1 + ∂θ α α α C(T ; θ ) ! 1 β Z α dC · θ + βθ → ∞, e − 1 α θ (−βC + eαC − 1 + )2 α α as θ → θd(T ). Hence, we veriﬁed the essential smoothness condition. By G¨artner-Ellis theorem, we get the desired result. The proof for the sub–critical case is similar and is omitted here.

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 29 Acknowledgements

The authors are grateful to the editor and an anonymous referee for their helpful sugges- tions that greatly improved the quality of the paper. Xuefeng Gao acknowledges support from Hong Kong RGC ECS Grant 2191081 and CUHK Direct Grants for Research with project codes 4055035 and 4055054. Lingjiong Zhu is grateful to the support from NSF Grant DMS-1613164. We thank Xiang Zhou for help with the ﬁgures.

References

[1] Abergel, F. and A. Jedidi. (2015). Long time behaviour of a Hawkes process-based limit order book. SIAM J. Financial Math. 6, 1026-1043. [2] A¨ıt-Sahalia, Y., Cacho-Diaz, J. and R. J. A. Laeven. (2015). Modeling financial contagion using mutually exciting jump processes. Journal of Financial Economics. 117, 585-606. [3] Asmussen, S. and H. Albrecher. Ruin Probabilities. Second Edition. World Scientific, Singapore, 2010. [4] Bacry, E., Delattre, S., Hoffmann, M. and J. F. Muzy. (2013). Scaling limits for Hawkes processes and application to financial statistics. Stochastic Processes and their Applications 123, 2475-2499. [5] Blanchet, J., Chen, X. and H. Lam. (2014) Two-parameter sample path large deviations for infinite-server queues. Stochastic Systems. 1, 206-249. [6] Bordenave, C. and Torrisi, G. L. (2007). Large deviations of Poisson cluster processes. Stochastic Models, 23, 593-625. [7] Dassios, A. and Zhao, H. (2013). Exact simulation of Hawkes process with exponentially decaying intensity. Electronic Communications in Probability 18(62), pp.1-13. [8] Dassios, A. and H. Zhao. (2012). Ruin by dynamic contagion claims. Insurance: Mathematics and Economics. 51, 93-106. [9] Delattre, S., Fournier, N. and M. Hoffmann. (2016). Hawkes processes on large networks. Annals of Applied Probability. 26, 216-261. [10] Dembo, A. and O. Zeitouni. Large Deviations Techniques and Applications. 2nd Edition, Springer, New York, 1998. [11] Dragomir, S.S. and City, M.. 2000. Some Gronwall type inequalities and applications. RGMIA Monographs, Victoria University. [12] Errais, E., Giesecke, K. and Goldberg, L. (2010). Affine point processes and portfolio credit risk. SIAM J. Financial Math. 1, 642-665. [13] Feng, J. and T. G. Kurtz. Large Deviations for Stochastic Processes. American Math- ematical Society, 2006. [14] Gao, X. and L. Zhu. (2015). Limit theorems for Markovian Hawkes processes with a large initial intensity. arXiv:1512.02155. [15] Glynn, P.W. (1995). Large deviations for the infinite server queue in heavy traffic. Institute for Mathematics and its Applications. 71, p.387.

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 30 Gao and Zhu

[16] Glynn, P. W. and W. Whitt. (1991). A new view of the heavy-traffic limit theorem for infinite-server queues. Advances in Applied Probability. 188-209. [17] Hawkes, A. G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika 58, 83-90. [18] Hawkes, A.G. (1971). Point spectra of some mutually exciting point processes. Jour- nal of the Royal Statistical Society. Series B (Methodological). pp.438-443. [19] Hartman, P. Ordinary Differerential Equations, Second Edition, SIAM, 2002. [20] Hawkes, A. G. and Oakes, D. (1974). A cluster process representation of a self- exciting process. J. Appl. Prob. 11, 493-503. [21] Jaisson, T. and M. Rosenbaum. (2015). Limit theorems for nearly unstable Hawkes processes. Ann. Appl. Prob. 25, 600-631. [22] Jaisson, T. and M. Rosenbaum. (2016). Rough fractional diffusions as scaling limits of nearly unstable heavy tailed Hawkes processes. Ann. Appl. Prob. 26, 2860-2882. [23] Jang, J. and A. Dassios. (2013). A bivariate shot noise self-exciting process for insurance. Insurance: Mathematics and Economics. 53, 524-532. [24] Karabash, D. and L. Zhu. (2015). Limit theorems for marked Hawkes processes with application to a risk model. Stochastic Models. 31, 433-451. [25] Puhalskii, A. (1995). Large deviation analysis of the single server queue. Queueing Systems. 21, 5-66. [26] Stabile, G. and G. L. Torrisi. (2010). Risk processes with non-stationary Hawkes arrivals. Methodol. Comput. Appl. Prob. 12, 415-429. [27] Torrisi, G. L. (2016). Gaussian approximation of nonlinear Hawkes processes. Annals of Applied Probability. 26, 2106-2140. [28] Torrisi, G. L. Poisson approximation of point processes with stochastic intensity, and application to nonlinear Hawkes processes. to appear in AIHP. [29] Varadhan, S. R. S. Large Deviations and Applications, SIAM, Philadelphia, 1984. [30] Whitt, W. Stochastic-Process Limits: An Introduction to Stochastic-Process Limits and Their Application to Queues, Springer Science and Business Media, 2002. [31] Zhang, X., Blanchet, J., Giesecke, K., and P. W. Glynn. (2015). Affine point processes: Approximation and efficient simulation. Mathematics of Operations Research. 40, 797-819. [32] Zhu, L. (2013). Nonlinear Hawkes Processes. PhD thesis, New York University. [33] Zhu, L. (2013). Ruin probabilities for risk processes with non-stationary arrivals and subexponential claims. Insurance: Mathematics and Economics. 53 544-550. [34] Zhu, L. (2013). Moderate deviations for Hawkes processes. Statistics and Probability Letters. 83, 885-890. [35] Zhu, L. (2014). Limit theorems for a Cox-Ingersoll-Ross process with Hawkes jumps. Journal of Applied Probability. 51, 699-712. [36] Zhu, L. (2013). Central limit theorem for nonlinear Hawkes processes. Journal of Applied Probability. 50 760-771. [37] Zhu, L. (2015). Large deviations for Markovian nonlinear Hawkes Processes. Annals of Applied Probability. 25, 548-581. [38] Zhu, L. (2014). Process-level large deviations for nonlinear Hawkes point processes. Annales de l’Institut Henri Poincaré-Probabilitéset Statistiques. 50, 845-871.

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 31 Appendix A: Additional proofs

We prove results in Sections 2.2.1 and 2.3 in this appendix. For notational convenience, n n unless speciﬁed explicitly, we use Z and N for Z and N when Z0 = n. We also use E[·] to denote the conditional expectation E[·|Z0 = n], and P(·) for the conditional probability P(·|Z0 = n).

A.1. Proofs of Propositions 3 and 4

Proof of Proposition 3. Let

βg + g0 βg + g0 βg + g0 L(g, g0) := log − − g . α αg α

Then the variational problem (2.2) becomes

Z T inf L(g(t), g0(t))dt. g(0)=1,g(T )=x 0

∂L d ∂L Applying the Euler-Lagrange equation ∂g − dt ∂g0 = 0, we deduce that the optimal sample path g∗ satisﬁes the following equation: β βg + g0 βg + g0 d 1 βg + g0 log − + 1 − log = 0. (A.1) α αg αg dt α αg

Deﬁne 1 βg (t) + g0 (t) q(t) = log ∗ ∗ . (A.2) α αg∗(t)

Then the Equation (A.1) reduces to

d q(t) = βq(t) − (eαq(t) − 1). (A.3) dt Set q(T ) = θ, then we obtain from (A.3), (2.4) and the uniqueness of ODE solutions that

q(t) = A(T − t; θ), for t ∈ [0,T ].

Note from (A.2) and g∗(0) = 1, we have

Z t αq(s) g∗(t) = exp αe ds − βt . 0

So what is remaining is to ﬁnd the parameter θ such that g∗(T ) = x. We claim that the correct parameter θ is simply θ∗, the maximizer to the optimization problem (2.3). To

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 32 Gao and Zhu

∂ see this, we deﬁne γ(t) := ∂θ A(t; θ). That is, γ is the derivative of A with respect to the initial condition. Then it follows from (2.4) and [19, Chapter V] that d γ(t) = (−β + αeαA(t;θ)) · γ(t), dt γ(0) = 1, which immediately yields

Z T ! αA(t;θ) γ(T ) = exp (−β + αe )dt = g∗(T ) = x. 0

Now from (2.3), it is clear that the optimal θ∗ satisﬁes ∂ x = A(T ; θ)| . ∂θ θ=θ∗

Therefore, when θ = θ∗, we have g∗(T ) = x, and the path g∗ solves the Euler-Lagrange equation (A.1), which is a necessary condition for optimality. It remains to check g∗ given by (2.12) is indeed the optimal sample path for the variational problem (2.2). It suﬃces to note that

Z T 0 IZ (g∗) = L(g∗(t), g∗(t))dt 0 Z T n 0 αq(t) o = q(t)g∗(t) − (−βq(t) + e − 1)g∗(t) dt 0 Z T 0 0 = {q(t)g∗(t) + q (t)g∗(t)} dt 0 = q(T )x − q(0)

= θ∗ · x − A(T ; θ∗). = sup {θx − A(T ; θ)} , θ∈R where we have used the fact that q(t) = A(T −t; θ∗), for t ∈ [0,T ]. Therefore, g∗ is indeed the optimal sample path. The proof is complete.

Proof of Proposition 4. Let us recall that

H(x; T ) = inf IN (h), h(0)=0,h(T )=x

R t βs 0 where IN is given in (2.7). By considering f(t) = 0 αe h (s)ds, we get H(x; T ) 0 R T 1 −βt h 0 f (t) 0 i = inf f0(t) e f (t) log − f (t) + α + αf(t) dt. f(0)=0,R T dt=x 0 α α+αf(t) 0 αeβt

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 33

This is a constrained optimization problem, so we introduce the Lagrange multiplier λ and deﬁne 1 f 0(t) f 0(t) L(t, f, f 0) = e−βt f 0(t) log − f 0(t) + α + αf(t) − λ . α α + αf(t) αeβt We consider the modiﬁed problem

Z T inf L(t, f(t), f 0(t))dt. (A.4) 0 f(0)=0,R T f (t) dt=x 0 0 αeβt

∂L d ∂L The Euler-Lagrange equation ∂f − dt ∂f 0 = 0 yields that the optimal f∗ for (A.4) satisﬁes 1 −f 0 (t) d 1 f 0 (t) e−βt ∗ + α − e−βt log ∗ − λ = 0. (A.5) α 1 + f∗(t) dt α α + αf∗(t)

In addition, f∗ satisﬁes the following transversality condition: 0 ∂L 1 −βT f∗(T ) 0 = 0 = e log − λ . (A.6) ∂f t=T α α + αf∗(T ) Let us deﬁne 1 f 0 (t) p(t) = log ∗ . (A.7) α α + αf∗(t) Then, the Equations (A.5) and (A.6) become

dp(t) βλ = βp(t) − eαp(t) + 1 − , dt α λ p(T ) = . α Hence, p solves a ﬁrst order ODE with terminal constraint. Comparing with (2.10) and (2.11), we infer from the uniqueness of solutions of such ODEs that

λ p(t) = C T − t; . α

Note that we can deduce from (A.7) that

α R t eαp(s)ds f∗(t) = e 0 − 1.

0 βt 0 Recall that f∗(t) = αe h∗(t) and h∗(0) = 0. Thus, we get

t 0 t Z f (s) Z R s αp(u) ∗ αp(s) α 0 e du−βs h∗(t) = βs ds = e · e ds, (A.8) 0 αe 0 which depends on λ.

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 34 Gao and Zhu

Next, we claim that, to satisfy the constraint h∗(T ) = x, the correct Lagrange mul- ˆ tiplier λ is simply θ∗, the maximizer to the optimization problem (2.9). To see this, we set θ θ ∂ w (t) = C t; − , and r(t) = w (t). θ α α ∂θ θ

One readily checks from (2.10) and (2.11) that r solves the ODE d αC(t; θ ) θ r(t) = −β + α · e α r(t) + exp αC t; , dt α r(0) = 0, which implies

Z T Z T ! αC(t; θ ) αC(s; θ ) r(T ) = e α · exp αe α − β ds dt. 0 t

ˆ Since θ∗ is the maximizer to the optimization problem (2.9), we deduce that

Z T ˆ Z u ˆ αC(T −u; θ∗ ) αC(T −v; θ∗ ) x = r(T )| = e α · exp αe α − β dv du, (A.9) θ=θˆ∗ 0 0 where we have applied the change of variable formula. Then it is clear from (A.8) and

θˆ∗ (A.9) that when p(t) = C(T − t; α ), we have h∗(T ) = x. Finally, we verify that h∗ given by (2.13) is indeed optimal for the variational problem (2.8). It suﬃces to note that

Z T −βt 0 ˆ e 0 f∗(t) 0 ˆ 0 IN (h∗) = θ∗ · x + f∗(t) log − f∗(t) + α + αf∗(t) − θ∗f∗(t) dt 0 α α + αf∗(t) Z T −βt 0 ˆ e 0 f∗(t) ˆ = θ∗ · x + f∗(t) log − θ∗ dt 0 α α + αf∗(t) Z T −βt 0 e −f∗(t) + + α (1 + f∗(t))dt 0 α 1 + f∗(t) Z T −βt 0 ˆ d e f∗(t) ˆ = θ∗ · x + (1 + f∗(t)) log − θ∗ dt 0 dt α α + αf∗(t) Z T −βt 0 d e f∗(t) ˆ + log − θ∗ (1 + f∗(t))dt 0 dt α α + αf∗(t) −βT 0 ˆ e f∗(T ) ˆ = θ∗ · x + (1 + f∗(T )) log − θ∗ α α + αf∗(T ) 0 1 f∗(0) ˆ − (1 + f∗(0)) log − θ∗ , α α + αf∗(0)

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 35

0 βt 0 where the ﬁrst equality is due to f∗(t) = αe h∗(t), h∗(0) = 0 and h∗(T ) = x, the third equality is due to (A.5), and other equalities follow from direct computation. Now by

θˆ∗ (A.6), (A.7) and f∗(0) = 0, and the fact that p(t) = C(T − t; α ) for any 0 ≤ t ≤ T , we obtain

θˆ I (h ) = θˆ · x − p(0) + ∗ N ∗ ∗ α ! θˆ θˆ = θˆ · x − C T ; ∗ + ∗ ∗ α α θ θ = sup θx − C T ; + , θ∈R α α

Therefore, h∗ is indeed the optimal sample path. The proof is complete.

A.2. Proofs of results in Section 2.3.1

Proof of Theorem 5. We want to apply G¨artner-Ellistheorem to obtain the large deviations principle. Since the moment generating functions of Zt and Nt involves nonlinear ODEs, the key idea in the proof is to use Gronwall’s inequality for nonlinear ODEs to obtain estimates for the ODE solutions. Below we prove part (i) and (ii) separately. (i) We ﬁrst prove part (i). Suppose that we can show

( θ 2 h θ i 1 for any θ < 2 , tn Zt T 1− α2θT α T lim log E e tn n = 2 (A.10) n→∞ n ∞ otherwise.

θ 2 ∂ h θ i It is easy to check that 1 2 is differentiable in θ for any θ < α2T , and ∂θ 1 2 = 1− 2 α θT 1− 2 α θT 1 2 1 2 2 → ∞ as θ → α2T . Thus, we verified the essential smoothness condition. By (1− 2 α θT ) ZtnT Gärtner-EllisTheorem, P( n ∈ ·) satisfies a large deviation principle with the speed n and the good rate function tn √ θ 2( x − 1)2 Iˆ (x) = sup θx − = , Z 1 2 2 θ< 2 1 − α θT α T α2T 2 if x ≥ 0 and +∞ otherwise. Therefore, it suffices to prove (A.10). We focus on the case θ 6= 0 since the proof is trivial for the case θ = 0. From the moment generating function of Zt in (4.1), one readily obtains

h θ i tn Zt T θ tn θ log E e tn n = tn · A tnT ; + · B tnT ; , n tn n tn

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 36 Gao and Zhu where A are B are solutions to the ODEs in (4.2) and (4.3). Here the initial conditions are given by A 0; θ = θ and B 0; θ = 0. So in order to show (A.10), it suﬃces to tn tn tn show ( θ 2 θ 1− 1 α2θT , for θ < α2T , lim t · A t T ; = 2 (A.11) n n 2 n→∞ tn +∞, for θ ≥ α2T , and  tn θ 2 limn→∞ n · B tnT ; t = 0, for θ < α2T , n (A.12) tn θ 2 lim infn→∞ · B tnT ; ≥ 0, for θ ≥ 2 .  n tn α T

We ﬁrst prove (A.11). The idea is to use Gronwall’s inequality for nonlinear ODEs to obtain estimates for A. Write g(x) = eαx − αx − 1. Then the ODE in (4.2) becomes A0(t) = g(A(t)) in the critical case α = β. Given small , η > 0, there exists some δ > 0 α2 2 α2 2 such that ( 2 − )x ≤ g(x) ≤ ( 2 + )x and |g(x)| ≤ η|x| when |x| ≤ δ. If we write θ cn = sup t ≥ 0 : A s; ≤ δ, for all s ≤ t , tn then we obtain for n large,

2 2 α 2 θ 0 θ α 2 θ − A t; ≤ A t; ≤ + A t; , for all t ∈ [0, cn].(A.13) 2 tn tn 2 tn Note that the solution to the ODE α2 y0(t) = + y2(t), 2

y(0) = y0, is given by 1 α2 −1 y(t) = − + t , (A.14) y0 2 which is clearly an non-decreasing function when y is properly deﬁned. We next discuss three cases to prove (A.11). Case 1: θ < 0. When α = β, it is clear from (4.2) that A is non-decreasing in t. So for n large such that 0 > A 0; θ = θ > −δ, one readily checks that 0 > A(t; θ ) > −δ for all t ≥ 0. tn tn tn That is, cn = +∞. Hence, we deduce from (A.13)–(A.14) and Gronwall’s inequality for nonlinear ODEs that

t α2 −1 θ t α2 −1 n − − t ≤ A t; ≤ n − + t , for all t ≥ 0. θ 2 tn θ 2

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 37

This implies

1 α2 −1 θ − − T ≤ lim inf tn · A tnT ; θ 2 n→∞ tn θ 1 α2 −1 ≤ lim sup tn · A tnT ; ≤ − + T . (A.15) n→∞ tn θ 2 Letting → 0, we obtain that for any θ < 0,

θ 1 α2 −1 θ lim tnA tnT ; = − T = . (A.16) n→∞ 1 2 tn θ 2 1 − 2 α θT 2 Case 2: 0 < θ < α2T . α2 In this case, we have θT ( 2 + ) < 1 for small enough. This implies that given θ θ y(0) = A(0; ) = , the ODE solution y in (A.14) is properly deﬁned for t = tnT , and tn tn its value is given by

t α2 −1 1 θ y(t T ) = n − + t T = · . n θ 2 n t α2 n 1 − θT ( 2 + )

Hence from the non-decreasing property of y, we have 0 < y(t) < δ for all t ∈ [0, tnT ] when tn is large. This implies tnT ≤ cn. Following the proof of Case 1, we can deduce from (A.13)–(A.14) and Gronwall’s inequality for nonlinear ODEs that (A.16) holds for 2 0 < θ < α2T as well. 2 Case 3: θ ≥ α2T . 2 When θ ≥ α2T , we prove (A.11) by contradiction. Suppose θ lim inf tn · A tnT ; = M < +∞. n→∞ tn

Then there is a subsequence {nk} such that θ tnk · A tnk T ; ≤ 2M, for nk large. (A.17) tnk

θ This implies when nk is large, we have 0 < A tnk T ; ≤ δ, which further implies tnk cnk ≥ tnk T. Similar as before, we can use (A.13) and apply Gronwall’s inequality for nonlinear ODEs and obtain that θ 1 θ 1 1 A t T ; ≥ · ≥ · , nk t t α2 t T nk nk 1 − θT ( 2 − ) nk 2 where the last inequality is due to the fact that θ ≥ α2T . However, this is a contradiction 2 with (A.17) since we can choose arbitrarily small. Hence, for θ ≥ α2T , we have θ lim tnA tnT ; = +∞. n→∞ tn

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 38 Gao and Zhu

Therefore, we have proved (A.11). 2 We next prove (A.12). When θ < α2T , we have shown that tnT ≤ cn for n large and θ αx thus |A(t; )| ≤ δ for all t ∈ [0, tnT ]. together with the fact that |e − 1 − αx| ≤ η|x| tn 2 when |x| ≤ δ, we deduce from (4.3) that for θ < α2T ,

Z tnT θ αA s; θ ( tn ) B tnT ; = µ e − 1 ds tn 0 Z tnT θ ≤ µ(α + η) A s; ds. 0 tn

2 In conjunction with the inequality (A.15), it is readily veriﬁed that for θ < α2T , θ lim (tn/n) · B tnT ; = 0. n→∞ tn

2 For θ ≥ α2T , it is clear that A is always positive for all t. Thus we infer from (4.3) that B is nonnegative and

θ lim inf(tn/n) · B tnT ; ≥ 0. n→∞ tn

Therefore, we have proved (A.12) and thus (A.10) holds. The proof of part (i) is complete. (ii) We next prove part (ii). Recall the moment generating function of Nt in (4.6). Since Z0 = n, we infer that

θ tn 2 NtnT tn θ θ θ tn log E e = − 2 n + C tnT ; 2 n + D tnT ; 2 , (A.18) n n αtn αtn αtn

θ θ where C,D solve the ODEs in (4.4) and (4.5) with initial condition C(0; 2 ) = 2 and αtn αtn θ D(0; 2 ) = 0. αtn To study its limiting behavior as n → ∞, we ﬁrst prove √ √  −θ −α α tanh √ −θT if θ ≤ 0 θ  √ 2 √ 2 lim tn · C tnT ; 2 = √ . (A.19) n→∞ αt θ √α n  √α tan θT if θ > 0 2 2 We focus on θ 6= 0 since the case θ = 0 is trivial to prove. Similar as in the proof of part (i), we obtain for n large, and for all t ∈ [0, dn]

2 α 2 θ θ 0 θ − C t; 2 + αC 0; 2 ≤ C t; 2 (A.20) 2 αtn αtn αtn 2 α 2 θ θ ≤ + C t; 2 + αC 0; 2 , 2 αtn αtn

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 39 where θ dn = sup t ≥ 0 : C s; 2 ≤ δ, for all s ≤ t . αtn We again want to use Gronwall’s inequality for nonlinear ODEs to obtain estimates for C. To this end, we ﬁrst study the solution zk to the Riccati equation

0 2 z (t) = kz (t) + αz0, (A.21) θ z(0) = z0 = 2 , (A.22) αtn

α2 α2 where we are interested in k = 2 + and k = 2 − . We discuss the cases θ > 0 and θ < 0 separately. When θ > 0, the solution to the Riccati equation (A.21) and (A.22) is given by √ √ √ kθ kθ kθ kθ sin t + 2 cos t 1 tn tn αtn tn zk(t) = · √ √ √ . k cos kθ t − kθ sin kθ t tn tn tn It follows from (A.20) and Gronwall’s inequality for nonlinear ODEs that θ 2 2 z α −(t) ≤ C t; 2 ≤ z α +(t), for t ∈ [0, dn]. (A.23) 2 αtn 2

α2 α2 −1 For k = 2 + or k = 2 − , it is clear that |zk(tnT )| = O(tn ) as tn → ∞, from which one can verify that tnT ≤ dn. Together with (A.23), we obtain √ s ! θ θ α2 lim sup t · C t T ; ≤ lim t · z 2 (t T ) = tan + θT . n n 2 n α + n q n→∞ αtn n→∞ 2 α2 2 2 + Setting → 0, we ﬁnd √ θ θ α √ lim sup tn · C tnT ; ≤ tan √ θT . αt2 √α n→∞ n 2 2 A similar argument leads to √ θ θ α √ lim inf tn · C tnT ; ≥ tan √ θT . n→∞ αt2 √α n 2 2 So we have proved (A.19) when θ > 0. When θ < 0, we can similarly solve the Riccati equation (A.21) and (A.22) and obtain √ √ k|θ| k|θ| p exp − r · exp − 1 k|θ| tn tn zk(t) = − · · √ √ , k t k|θ| k|θ| n exp + r · exp − tn tn

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 40 Gao and Zhu

q k|θ| 1− 2 where r := αtn . Thus we infer from (A.23) that for k = α + , q k|θ| 2 1+ αtn

θ p|θ| √ p lim sup tn · C tnT ; 2 ≤ lim tn · zk(tnT ) = − tanh k|θ|T . n→∞ αtn n→∞ k Setting → 0, we ﬁnd for θ < 0, p θ |θ| α p lim sup tn · C tnT ; ≤ − tanh √ |θ|T . αt2 √α n→∞ n 2 2 A similar argument leads to p θ |θ| α p lim inf tn · C tnT ; ≥ − tanh √ |θ|T . n→∞ αt2 √α n 2 2 So we have proved (A.19) when θ < 0. We next show θ lim (tn/n) · D tnT ; 2 = 0. n→∞ αtn θ Since we have shown |C(t; 2 )| ≤ δ for all t ∈ [0, tnT ], together with the fact that αtn |eαx − 1 − αx| ≤ η|x| when |x| ≤ δ, we deduce that ! Z tnT αC s; θ θ αt2 D t T ; = µ e n − 1 ds n 2 αtn 0 Z tnT θ ≤ µ(α + η) C s; 2 ds 0 αtn Z T θ = µ(α + η) tn · C stn; 2 ds. 0 αtn Given (A.19) and (A.24), we obtain θ lim sup D tnT ; 2 ≤ K, for some constant K < ∞. n→∞ αtn

Since tn/n → 0, we then deduce that θ lim (tn/n) · D tnT ; 2 = 0. n→∞ αtn Hence, we infer from (A.18) that  √ √ −θ −√α θ √α tanh −θT if θ ≤ 0 tn 2 NtnT  2 Λ(θ) := lim log E e tn = √ 2 √ . n→∞ n θ √α  √α tan θT if θ > 0 2 2

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 41

It is easy to verify that Λ(θ) is diﬀerentiable in θ < 0 and

d 1 1 α √ T α √ Λ(θ) = √ tan √ θT + sec2 √ θT for θ < 0, dθ √α 2 2 θ 2 2 2 and Λ(θ) is diﬀerentiable in θ > 0 and

d −1 1 −α√ T −α√ Λ(θ) = √ tanh √ −θT + sech2 √ −θT for θ > 0. dθ √α 2 2 −θ 2 2 2 Thus, it is easy to see that Λ0(0+) = Λ0(0−) = T, which implies that Λ(θ) is also diﬀerentiable at θ = 0. An application of G¨artner-Ellis Theorem yields the result.

A.3. Proofs of results in Section 2.3.2

Proof of Theorem 6. The approach is similar as in the proof of Theorem 5: use Gron- wall’s inequality for nonlinear ODEs to obtain estimates, and then apply G¨artner-Ellis theorem to obtain the large deviations principle. (i) We ﬁrst prove part (i). We claim that for Z0 = n and any θ ∈ R, 1 h θ i 1 θ θ Zt T lim log E e n n = lim · nA tnT ; + B tnT ; = θ. n→∞ nT n→∞ nT n n

When θ = 0, the above holds trivially. So in the following we focus on θ 6= 0. We ﬁrst show that 1−T θ lim n · A tnT ; = θ. n→∞ n Write g(x) = eαx − αx − 1. Then given any small η > 0, there exists some δ > 0 and K > 0 such that |g(x)| ≤ min{η|x|, Kx2} when |x| < δ. Recall from (4.2) that A solves the ODE: θ θ θ A0 t; = (α − β)A t; + g A t; , (A.24) n n n θ θ A 0; = . n n

θ Suppose that for n large, we have A t; n ≤ δ for all t ∈ [0, tnT ]. Then we obtain θ θ θ θ θ (α − β)A t; − KA2 t; ≤ A0 t; ≤ (α − β)A t; + KA2 t; . n n n n n

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 42 Gao and Zhu

The solution to the Bernoulli equation

y0(t) = (α − β)y(t) + Ky2(t), θ y(0) = A(0) = , n is given by 1 K K −1 y(t) = + · e(β−α)t − . y(0) α − β α − β Hence we have −1 n1−T K K y(t T ) = + · n−T − . n θ α − β α − β It is clear that y is a monotone function, and we thus deduce that |y(t)| ≤ δ for all 2 t ∈ [0, tnT ] when n is large. Note that L(y) := (α − β)y + Ky is Lipshitz continuous when |y| ≤ δ. Then by Gronwall’s inequality for nonlinear ODEs, we obtain

A(t) ≤ y(t), for t ∈ [0, tnT ], which further implies that

−1 n1−T K K A(t T ) ≤ y(t T ) = + · n−T − . n n θ α − β α − β

Similarly, we can ﬁnd

−1 n1−T −K −K A(t T ) ≥ + · n−T − . n θ α − β α − β

Thus we get 1−T θ lim n · A tnT ; = θ. n→∞ n θ So the only remaining step is to show for n large, we have |A(t; n )| ≤ δ for all t ∈ [0, tnT ]. To this end, we ﬁrst deﬁne, with a slight abuse of notation (see also Section A.2) θ cn = sup t ≥ 0 : A s; ≤ δ, for all s ≤ t , n

θ θ and note that A 0; n = n < δ for all large n. So it suﬃces to show tnT ≤ cn. We note from the ODE in (A.24) that

θ θ Z t θ A t; = A 0; · e(α−β)t + e(α−β)(t−s)g A s; ds. n n 0 n

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 43

θ θ θ Note that A t; n < δ for all t ∈ [0, cn]. Then we have g A t; n ≤ η A(t; n ) for all t ∈ [0, cn]. Hence we obtain for t ∈ [0, cn] Z t −(α−β)t θ θ −(α−β)s θ e A t; ≤ A 0; + η e A s; ds. n n 0 n Gronwall’s inequality then implies θ θ (α−β)t ηt A t; ≤ A 0; · e · e , for t ∈ [0, cn]. (A.25) n n

θ Now notice that for A 0; n = θ/n and t = tnT where T < 1, the right–hand–side of the above inequality becomes T (1+ η )−1 |θ| · n α−β which is smaller than δ when n is large and η is set suﬃciently small. Hence when n is θ large, we have tnT ≤ cn and thus A t; n ≤ δ for all t ∈ [0, tnT ]. Next we prove 1 θ lim · B tnT ; = 0. (A.26) n→∞ nT n Recall from the ODE for B that

Z tnT θ αA(s; θ ) B tnT ; = µ e n − 1 ds. n 0

θ Note that for n large, we have |A(t; n )| ≤ δ for all t ∈ [0, tnT ]. Together with the fact that |eαx − 1 − αx| ≤ η|x| when |x| ≤ δ, we deduce that

Z tnT θ θ B tnT ; ≤ µ(α + η) A s; ds n 0 n |θ| Z tnT ≤ µ(α + η) · · e(α−β+η)tdt n 0 |θ| T (1+ η ) = µ(α + η) · · n α−β − 1 . n

θ θ where in the second inequality we have used (A.25) and A(0; n ) = n . Thus (A.26) readily follows. So the proof of part (i) is complete after applying the G¨artner-EllisTheorem. (ii) We next prove part (ii). The proof is similar to that of part (i), so we only outline the key steps. Recall that

h θ N i − θ n C(t T ; θ )n+D(t T ; θ ) E e n tnT = e αn e n αn n αn ,

θ θ where C,D solve the ODEs in (4.4) and (4.5) with initial condition C(0; αn ) = αn and θ D(0; αn ) = 0.

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 44 Gao and Zhu

It suﬃces to show 1−T θ 1 lim n C tnT ; = θ, (A.27) n→∞ αn α − β −T θ lim n D tnT ; = 0, (A.28) n→∞ αn

θ θ When C(0; nα ) = nα , we look at the Riccati equation below to obtain estimates for C: βθ z0(t) = (α − β)z(t) + Kz2(t) + , (A.29) nα θ z(0) = . nα

2 βθ When n is large, we have (α − β) > 4K · nα . This implies the constant function r ! 1 βθ z¯ := (α − β)2 − 4K · − (α − β) 2K nα is a particular solution to the Riccati equation (A.29). In addition, an application of Taylor expansion yields −β θ z¯ = · + O(n−2), as n → ∞. α − β nα Write v(t) = z(t) − z¯. Then it is clear that 1 θ v(0) = z(0) − z¯ = · + O(n−2). (A.30) α − β n Moreover, one readily veriﬁes that v satisﬁes the Bernoulli equation

v0(t) = (α − β + 2Kz¯)v(t) + Kv2(t), which implies that

1 K K −1 v(t) = + · e(β−α−2Kz¯)t − . (A.31) v(0) α − β + 2Kz¯ α − β + 2Kz¯

θ Suppose that for n large, we have |C(t; αn )| ≤ δ for all t ∈ [0, tnT ]. Then we obtain from Gronwall’s inequality for nonlinear ODEs that

C(tnT ) ≤ z(tnT ) = v(tnT ) +z. ¯ Together with (A.30) and (A.31), we deduce that 1−T θ 1−T 1 lim sup n C tnT ; ≤ lim sup n (v(tnT ) +z ¯) = θ. n→∞ αn n→∞ α − β

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 45

A similar argument leads to 1−T θ 1 lim inf n C tnT ; ≥ θ. n→∞ αn α − β

θ Hence the proof of (A.27) is complete if we can verify that |C(t; αn )| ≤ δ for all t ∈ [0, tnT ]. Similarly as before, this can be done using the facts that |g(x)| ≤ ηx for x small and applying Gronwall’s inequality to obtain the following bound on the function C: Z t θ θ (β−α)s (α−β)t ηt C t; ≤ C 0; · 1 + β e ds · e · e , for t ∈ [0, dn]. nα nα 0

θ where dn = sup{t ≥ 0 : |C(t; αn )| ≤ δ, for all s ≤ t}. In addition, the proof of (A.28) follows similarly as for the proof of (A.26). The proof is complete after applying the G¨artner-EllisTheorem.

A.4. Proofs of results in Section 2.3.3

Proof of Theorem 7. We apply G¨artner-Ellistheorem. The key idea is to study asymptotic behavior of the solutions of the ODEs (4.4) and (4.5) that characterize the moment generating function of Nt. Recall from (4.6) that for Z0 = n, we have 1 θ θ 1 θ log [eθNnT ] = C nT ; − + D nT ; , (A.32) n E α α n α

θ θ θ with initial condition C(0; α ) = α and D(0; α ) = 0. Hence to study the limit of (A.32) as n → ∞ and then apply G¨artner-Ellistheorem, we need to look at the asymptotic behavior of the ODE solutions C and D as t → ∞. To this end, let θβ F (x) := −βx + eαx − 1 + . α 0 θ θ Then Equation (4.4) becomes C (t; α ) = F (C(t; α )). It is clear that F is a convex 1 β function and F (±∞) = ∞. In addition, F (x) achieves its minimum at x = α log α at which F 0(x) = 0 and

1 β β β β θβ F log = − log + − 1 + . α α α α α α

α α Thus, minx F (x) ≤ 0 if and only if θ ≤ θc := β − log β − 1. When θ ≤ θc, since α < β, 0 θ θ one readily veriﬁes that F (C(0; α )) < 0. Hence the ODE solution C(t; α ) converges to the smaller solution x∗(θ) of the equation θβ F (x) = −βx + eαx − 1 + = 0, α

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 46 Gao and Zhu as t → ∞. Note from (4.4) and (4.5) we have

θ θ θ Z t θ µθβ D t; = µ · C t; − C 0; + µβ C s; ds − t. (A.33) α α α 0 α α

θ D(t; α ) ∗ µθβ Therefore, t → µβx (θ) − α as t → ∞ and for any θ ≤ θc, 1 θN θ θ 1 θ lim log E[e nT ] = lim C nT ; − + D nT ; n→∞ n n→∞ α α n α θ µθβ = − + x∗(θ) + µβx∗(θ) − T, α α

∗ 1 θNnT ∗ αx (θc) and limn→∞ n log E[e ] = +∞ otherwise. For θ = θc, we get −βx (θc) + e − β α β ∗ log(β/α) α log β − α = 0, which implies that x (θc) = α . By diﬀerentiating the equation ∗ αx∗(θ) θβ −βx (θ) + e − 1 + α = 0 with respect to θ, we get d (β/α) x∗(θ) = → ∞, as θ → θ . dθ β − αeαx∗(θ) c

NnT Thus, we verified the essential smoothness condition. By Gärtner-Ellistheorem, P( n ∈ ·) satisfies a large deviation principle with the rate function θ µθβ I(x) = sup θx + − x∗(θ) − µβx∗(θ) − T . (A.34) θ∈R α α We next solve the optimization problem in (A.34) and simplify the rate function above. At the optimal θ in (A.34), we have 1 d d µβ x + − x∗(θ) − µβ x∗(θ)T + T = 0, α dθ dθ α which implies that d x + 1 + µβ T x∗(θ) = α α . (A.35) dθ 1 + µβT Recall that x∗(θ) satisfies

∗ θβ −βx∗(θ) + eα·x (θ) − 1 + = 0. (A.36) α Diﬀerentiating with respect to θ, we get

d d ∗ β −β x∗(θ) + α x∗(θ)eαx + = 0. (A.37) dθ dθ α Plugging (A.35) into (A.37), we get 1 βx x∗(θ) = log . (A.38) α αx + 1 + µβT

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 Gao and Zhu 47

Plugging this into (A.36), we get the optimal θ in (A.34) is given by

βx αx α θ = log − + . (A.39) αx + 1 + µβT αx + 1 + µβT β

Finally, substituting (A.38) and (A.39) into (A.34), we get

βx αx + 1 + µβT I(x) = x log − x + . αx + 1 + µβT β The proof is therefore complete.

Proof of Theorem 9. The proof is similar to that of Theorem 7, so we only provide a sketch. From the proof of Theorem 7, we have

θ θ θ θNt T − n C(tnT ; )n+D(tnT ; ) E e n = e α e α α ,

α α θ ∗ where for θ ≤ θc := β − log β − 1, limn→∞ C(tnT ; α ) = x (θ), the smaller solution to αx θβ the equation −βx + e − 1 + α = 0. In addition, we infer from (A.33) that

θ θ θ Z tnT θ µθβ D tnT ; = µ C tnT ; − + µβ C s; ds − tnT. α α α 0 α α

tn Therefore, if limn→∞ n = 0, we have for θ ≤ θc, 1 θ θ 1 θ θNt T lim log E e n = lim C tnT ; − + D tnT ; n→∞ n n→∞ α α n α θ = − + x∗(θ), α

1 θNtnT tn and limn→∞ n log E e = ∞ otherwise. Similarly, if limn→∞ n = ∞, we have for θ ≤ θc, 1 n θ θ 1 θ θNt T lim log E e n = lim · C tnT ; − + D tnT ; n→∞ tn n→∞ tn α α tn α µθβ = µβx∗(θ)T − T, α

1 θNt T d ∗ and limn→∞ log e n = ∞ otherwise. We can also check that x (θ) → ∞ as tn E dθ θ → θc. Therefore, by G¨artner-Ellistheorem and following the proof of Theorem 7, we have proved the desired results.

Proof of Theorem 10. The proof is similar as the proof of Theorem 6, so we only provide a sketch. From (4.1) we have

h θ i θ θ γ Zt T A(tnT ; γ )n+B(tnT ; γ ) E e n n |Z0 = n = e n n .

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017 48 Gao and Zhu

Similar as in the proof of Theorem 6, we can show that

1 θ γ Zt T lim log E[e n n |Z0 = n] = θ. n→∞ n1−γ−T The result then follows from G¨artner-Ellistheorem.

imsart-bj ver. 2014/10/16 file: Bernoulli_LDP_Gao_Zhu_accept.tex date: April 12, 2017