Large Deviation Bounds for Functionals of Viterbi Paths Arka P

IEEE TRANSACTIONS ON INFORMATION THEORY 1 Large deviation bounds for functionals of Viterbi paths Arka P. Ghosh, Elizabeth Kleiman, and Alexander Roitershtein

Abstract—In a number of applications, the underlying stochas- y ∈ Y, and A ∈ S by tic process is modeled as a finite-state discrete-time Markov chain that cannot be observed directly and is represented by an K(i, y; j, A) auxiliary process. The maximum a posteriori (MAP) estimator = P (X = j, Y ∈ A|X = i, Y = y) is widely used to estimate states of this hidden Markov model (n+1 n+1 ) n n −1 through available observations. The MAP path estimator based = pijP (j, ω0) ∈ h (A) (2) on a finite number of observations is calculated by the Viterbi algorithm, and is often referred to as the Viterbi path. It was Notice that K(i, y; j, A) is in fact independent of the value of recently shown in [2], [3] and [16], [17] (see also [12], [15]) that y. under mild conditions, the sequence of estimators of a given state The sequence (Xn,Yn)n≥0 constructed above is called a converges almost surely to a limiting regenerative process as the Hidden Markov Model (HMM). We refer to monographs [4], number of observations approaches infinity. This in particular implies a law of large numbers for some functionals of hidden [6], [7], [9], [10], [13], [19] for a general account on HMM. states and finite Viterbi paths. The aim of this paper is to provide The Hidden Markov Models have numerous applications the corresponding large deviation estimates. in various areas such as communications engineering, bio- Index Terms—hidden Markov models, maximum a posteriori informatics, finance, applied statistics, and more. An extensive path estimator, Viterbi algorithm, large deviations, regenerative list of applications, for instance to coding theory, speech processes. recognition, pattern recognition, satellite communications, and bio-informatics, can be also found in [2], [3], [16], [17]. One of the basic problems associated with HMM is to NTRODUCTION AND STATEMENT OF RESULTS m I.I determine the most likely sequence of hidden states (Xn)n=0 m that could have generated a given output (Yn) . In other Let (X ) ≥ be a discrete-time irreducible and aperiodic n=0 n n 0 words, given a vector (y )m ∈ Ym+1, m ∈ N , the problem Markov chain in a finite state space D = {1, . . . , d}, d ∈ N. n n=0 0 U (m),U (m),...,U (m) We interpret X = (X ,X ,...) as an underlying state process is to find a feasible sequence of states 0 1 m 0 1 such that that cannot be observed directly, and consider a sequence ( ) (m) of accessible observations Y = (Y0,Y1,...) that serves to P X = U , 0 ≤ k ≤ m Y = y , 0 ≤ k ≤ m k k ( 0 0 produce an estimator for X. For instance, Yn can represent = max P Xk = xk, ∀ k Yk = yk, ∀ k). (3) m the measurement of a signal Xn disturbed by noise. Given (xk)k=0 a realization of X, the sequence Y is formed by independent ( )m A vector U (m) that satisfies (3) is called the maximum random variables Yn valued in a measurable space (Y, S), with n n=0 a-posteriori (MAP) path estimator. It can be efficiently cal- Yn dependent only on the single element Xn of the sequence X. culated by the Viterbi algorithm [8], [19]. In general, the Formally, let N := N ∪ {0} and assume the existence of estimator is not uniquely defined even for a fixed outcome of 0 m a probability space (Ω, F,P ) that carries both the Markov observations (yn)n=0. Therefore, we will assume throughout F that an additional selection rule (for example, according( ) to chain X as well as an i.i.d. sequence of (Ω, )-valued random (m) m the lexicographic order) is applied to produce Un . variables ωn, n ∈ N0 independent of X, such that n=0 The MAP path estimator is used for instance in cryptanalysis, Yn = h(Xn, ωn) (1) speech recognition, machine translation, and statistics, see [7], [8], [9], [19] and also references in [2], [3], [16], [17]. D× → Y for some measurable function h : Ω . We assume that ( In principle,) for a fixed k ∈ N, the first k entries of (m) m X is stationary under P. Let pij, i, j ∈ D, be the transition Un n=0 might vary significantly when m increases and matrix of the Markov chain X. Then the sequence of pairs approaches infinity. Unfortunately, it seems that very little

(Xn,Yn)n∈N0 forms, under the law P, a stationary Markov is known about the asymptotic properties of the MAP path chain with transition kernel deﬁned for n ∈ N0, i, j ∈ D, estimator for general HMM. However, recently it was shown in [2], [3] that under certain mild conditions on the transition A. P. Ghosh is with the Departments of Statistics and Mathematics, Iowa kernel of an HMM, which were further amended in [16], State University, Ames, IA 50011, USA; e-mail: [email protected]. E. Kleiman is with the Department of Computer Science, Mount Mercy [17], there exists a strictly increasing sequence of integer- University, Cedar Rapids, IA 52402, USA; e-mail: [email protected] valued non-negative random variables (Tn)n≥1 such that the A. Roitershtein is with the Department of Mathematics, Iowa State Univer- following properties hold (In fact, we use here a slightly sity, Ames, IA 50011, USA; e-mail: [email protected]. MSC2000: primary 60J20, 60F10; secondary 62B15, 94A15. strengthened version of the corresponding results in [16], [17]. Manuscript received February 16, 2009; revised October 14, 2010. The proof that the statement holds in this form under the IEEE TRANSACTIONS ON INFORMATION THEORY 2

assumptions introduced in [16], [17] is given in Lemma 5 and below). n∑−1 n∑−1 b (n) Here and henceforth we make the convention that T0 = 0. Sn = ξk and Sn = ξk, (10) Condition RT. k=0 k=0 b (i) (Tn)n≥1 is a (delayed) sequence of renewal times, that is: where we convene that S0 = S0 = 0. An important practical − ≥ example, which we borrow from [2], is (a) The increments τn := Tn+1 Tn, n 0, are positive { integers with probability one. 1 if x ≠ y, ξ(x, y) = 1{ ̸ } = (b) (τn)n≥1 is an i.i.d. sequence which is furthermore inde- x=y 0 if x = y pendent of τ . 0 b (c) E(τn) < ∞ for n ≥ 0. In this case Sn and Sn count the number of places where the realization of the state Xk differs from the estimators Uk and (ii) Furthermore, there exist positive constants a > 0 and b > 0 (n) such that for all t > 0, Uk respectively. If Condition RT holds, then (see [2], [16]) there is a −bt −bt P (T1 > t) ≤ ae and P (τ1 > t) ≤ ae . (4) unique probability measure Q on the infinite product space 2 N0 (D ) where the sequence (Zn)n≥0 is defined, which makes In particular, both T1 and τ1 have finite moments of every (Wn)n≥0 introduced in (8) into an i.i.d. sequence having the order (notice that assumption includes the first-order condition same distribution as (Wn)n≥1. Furthermore, under Condi- (i)-(d) stated above). tion RT, the following limits exist and the equalities hold: (iii) There exist a positive integer r ≥ 0 such that for any fixed ≥ S Sb integer k 1, µ := lim n = lim n n→∞ n n→∞ n (m) (Tk+r) ( ) U = U for m ≥ Tk + r and n ≤ Tk. (5) b n n EQ ST = 1 ,P − a.s. and Q − a.s., (11) In particular, for any n ≥ 0, the following limit exists and the EQ(T1) equality holds with probability one: where EQ denotes the expectation with respect to measure Q.

(m) (Tkn +r) A set of specific conditions on HMM, which ensures the Un := lim Un = Un , m→∞ existence of a sequence (Tn)n≥1 satisfying Condition RT, is where given below in Assumption 3. The goal of this paper is to provide the following complementary large deviation estimates kn = min{m ∈ N : Tm ≥ n}, (6) to the above law of large numbers. that is k is the unique sequence of positive integers such that Theorem( 1. Let) Assumption 3 hold. Then either we have n b ∈ T − < n ≤ T for all n ≥ 1. Q ST1 = µT1 = 1 or there exist a constant γ (0, µ) kn 1 kn − → ∞ The interpretation of the above property is that at a random and a function I(x):(µ γ, µ + γ) [0, + ) such that (m) (m) (i) I(x) is lower semi-continuous and convex, I(µ) = 0, and time Tn + r, blocks of estimators (U ,...,U ) are Tn−1+1 Tn I(x) > 0 for x ≠ µ. fixed for all m ≥ Tn + r regardless of the future observations (ii) For x ∈ (µ, µ + γ), (Yj)j≥T +r. n ( ) (iv) For n ≥ 0, let 1 b lim log P Sn ≥ nx = −I(x). n→∞ n Zn := (Xn,Un). (7) (iii) For x ∈ (µ − γ, µ), Then the sequence Z := (Zn)n≥0 forms a regenerative pro- ( ) 1 b ≤ − cess with respect to the embedded renewal structure (T ) ≥ . lim log P Sn nx = I(x). n n 1 n→∞ n That is (recall the convention T0 = 0), the random blocks The rate function I(x) is specified in (17) below. The proof ≥ Wn = (ZTn ,ZTn+1,...,ZTn+1−1), n 0, (8) of Theorem 1 is included in Section II. First we show that Assumption 3 implies that Condition RT is satisfied for an are independent and, moreover, W ,W ,... (but possibly not 1 2 appropriate sequence of random times (Tn)n≥1. This is done W0) are identically distributed. in Lemma 5 below. Then( we show) that as long as the non- Q Sb = µT ≠ 1 Following [2], [3], [16], [17] we refer to the sequence U = degeneracy condition T1 1 is satisfied, the existence of the regeneration structure described by Condition RT (Un)n≥0 as the infinite Viterbi path. Condition RT yields a nice implies the large deviation result stated in Theorem 1. Related asymptotic behavior of the sequence Un, see Proposition 1 in [2] for a summary and for instance [11] for a general account results for non-delayed regenerative processes (in which case of regenerative processes. In particular, it implies the law of W0 defined in (8) has the same distribution as the rest of ≥ large numbers that we describe below. the random blocks Wn, n 1) can be found for instance Let ξ : D2 → R be a real-valued function, and for all n ≥ 0, in [1], [14], [18]. We cannot apply general large deviation m ≥ n define results for regenerative processes directly to our setting for the ( ) ( ) following two reasons. First, in our case the first block W0 is (m) (m) ξn = ξ Xn,Un and ξn = ξ Xn,Un , (9) in general distributed differently from the rest of the blocks IEEE TRANSACTIONS ON INFORMATION THEORY 3

≥ (n) Wn, n 1, defined in (8). Secondly, the estimators Ui II.PROOFOF THEOREM 1 (n) for i > Tkn−1 + r differ in general from the limiting random Recall the random variables ξk defined in (9). For the variables Ui. In principle, the contribution of these two factors rest of the paper we will assume, actually without loss of to the tails asymptotic on the large∑ deviation scale could be (n) ≥ ∈ n (n) generality, that ξk > 0 for all n 0 and integers k [0, n]. significant because both W0 and ξ (where i=Tkn−1+r+1 i Indeed, since D is a finite set, the range of the function ξ is the last sum is assumed to be empty if T − + r + 1 > n) kn 1 bounded, and hence if needed we can replace ξ by ξ¯ = B + ξ have exponential tails. However, it turns out that this is not (n) with some B > 0 large enough such that ξk + B > 0 for all the case, and the large deviation result for our model holds n ≥ 0, and k ∈ [0, n]. with the same “classical” rate function I(x) as it does for the We start with the definition of the sequence of random times non-delayed purely regenerative processes in [1], [14], [18]. (T ) ≥ that satisfies Condition RT. A similar sequence was In fact, only a small modification is required in order to adapt n n 1 first introduced in [2] in a slightly less general setting. In this the proof of a large deviation result for regenerative processes paper, we use a version of the sequence which is defined in in [1] to our model. Section 4 of [16]. The authors of [16] do not explicitly show We next state our assumptions about the underlying HMM. that Condition RT is fulfilled for this sequence. However, only The set of assumptions that we use is taken from [16], [17]. a small modification of technical lemmas from [16] or [17] is Y , We assume throughout that the distribution of n conditioned required to demonstrate this result. on Xn = i, has a density with respect to a reference measure ∈ N ∈ ∈ D Roughly speaking, for some integers M and r ν for all i . That is, there exist measurable functions − − Y → R ∈ D ∈ S ∈ D [0,M 1], the random times θn := Tn + r M + 1 are fi : +, i , such that for any A and all i , − ∫ defined, with the additional requirement θn+1 θn > M, as × A ∈ DM ∈ | successive hitting time of a set in the form q , q , P (Y0 A X0 = i) = fi(y)ν(dy). (12) A ⊂ YM Y , for the Markov chain formed by the vectors of pairs R = (X ,Y )n+M−1. ∈ D { ∈ Y } n i i i=n For each i , let Gi = y : fi(y) > 0 . That is, More precisely, the following is shown in [16], [17]. See the closure of Gi is the support of the conditional distribution Lemmas 3.1 and 3.2 in either [16] or [17] for (i) and (ii), and ∈ ·| P (Y0 X0 = i). Section 4 [p. 202] of [16] for (iii). We notice that a similar Definition 2. We call a subset C ⊂ D a cluster if regeneration structure was first introduced in [2] under a more ( ∩ ) stringent condition than Assumption 3. min P Y0 ∈ Gi|X0 = j > 0 j∈C Lemma 4. [16], [17] Let Assumption 3 hold. Then there exist i∈C integer constants M ∈ N, r ∈ [0,M − 1], a state l ∈ D, a and ∈ DM A ⊂ YM ( ∩ ) vector of states q , and a measurable set such ∈ | that( the following hold: ) max Pj Y0 Gi X0 = j = 0. − − j̸∈C P (X )M 1 = q, (Y )M 1 ∈ A > 0, i∈C (i) n n=0 n n=0 (ii) If for some k ≥ 0 we have (Y )k+M−1 ∈ A, then U (m) = In other words, a cluster is a maximal subset of states C i i=k j ∩ (k+M−1) ≤ − − ≥ − such ν(G ) > 0, where G := G . Uj for any j k + M 1 r and m k + M 1. C C i∈C i − Furthermore, U (m) = U (k+M 1) = l for any m ≥ ∈ D ∗ k+M−1−r k+M−1−r Assumption 3. [16], [17] For k , let Hk = maxj∈D pjk. k + M − 1. Assume that for all k ∈ D, ( ) (iii) Let ∗ ∗| P fk(Y0)Hk > max fi(Y0)Hi X0 = k > 0. (13) { ≥ k+M−1 k+M−1 ∈ A} i≠ k θ1 = inf k 0 : (Xi)i=k = q and (Yi)i=k Moreover, there exist a cluster C ⊂ D and a number m ∈ N and for n ≥ 1, such that the m-th power of the sub-stochastic matrix H = C k+M−1 θn+1 = inf{k ≥ θn + M :(Xi) = q (pij)i,j∈C is strictly positive. i=k and (Y )k+M−1 ∈ A}. Assumption 3 is taken from the conditions of Lemma 3.1 i i=k in [16] and [17], and it is slightly weaker than the one used Define in [2], [3]. This assumption is satisfied if C = D and for all − − k ∈ D and a > 0, Tn = θn + M 1 r. (15) ( ) For n ≥ 0, let L := (Y ,U ). Then the sequence (L ) ≥ P fk(Y0) > a max fi(Y0)|X0 = k > 0. (14) n n n n n 0 i≠ k forms a regenerative process with respect to the embedded renewal structure (T ) ≥ . That is, the random blocks The latter condition holds for instance if fi is the density of a n n 1 Gaussian random variable centered at i with variance σ2 > 0. Jn = (LT ,LT +1,...,LT −1), n ≥ 0, For a further discussion of Assumption 3 and examples of n n n+1 HMM that satisfy this assumption see Section III in [17] and are independent and, moreover, J1,J2,... (but possibly not also the beginning of Section II below, where some results of J0) are identically distributed.∪ Furthermore,∪ there is a de- Yn → Dn [17] are summarized and their relevance to Condition RT is terministic function g : n∈N ( n∈N such that) explained. (UTn ,UTn+1,...,UTn+1−1) = g YTn ,YTn+1,...,YTn+1−1 . IEEE TRANSACTIONS ON INFORMATION THEORY 4

In fact, using Assumption 3, the set A is designed in We summarize some properties of the above defined quan- [16], [17] in such a way that once a M-tuple of observable tities in the following lemma. Recall µ from (11). variables (Y )n+M−1 that belongs to A occurs, the value of i i=n Lemma 6. The following hold: the estimator U (m) is set to l when m = n + M − 1, n+M−1−r (i) Λ(α) and I(x) are both lower semi-continuous convex and, moreover, it remains unchanged for all m ≥ n + M − 1 functions on R. Moreover, I(µ) = 0( and I(x) >) 0 for x ≠ µ. regardless of the values of the future observations (Y ) ≥ . i i n+M (ii) If there is no c > 0 such that Q Sb = cT = 1, then The elements of the set A are called barriers in [16], [17]. T1 1 Using basic properties of the Viterbi algorithm, it is not hard (a) Λ(α) is strictly convex in an neighborhood of 0. to check that the existence of the barriers implies claims (ii) (b) I(x) is finite in some neighborhood of µ, and for each − and (iii) of the above lemma. In fact, the Viterbi algorithm x in that neighborhood we have I(x) = αxx Λ(αx) − ̸ is a dynamic programming procedure which is based on the for some αx such that (x µ)αx > 0 for x = µ and fact that for any integer k ≥ 2 and states i, j ∈ D there limx→µ αx = 0. Yk → Dk exists a deterministic function gk,i,j(: such that) Proof: (m) (m) (m) (Un ,Un+1,...,Un+k−1) = gk,i,j Yn,Yn+1,...,Yn+k−1 (i) The fact that Λ(α) and I(x) are both lower semi-continuous for all n ≥ 0 and m ≥ n + k − 1 once it is decided that convex functions is well-known, see for instance p. 2884 in (m) (m) ( ) Un = i and Un+k = j. See [16], [17] for details. [1]. Furthermore, (4) implies that Γ α, Λ(α) = 0 for α in We have: a neighborhood of 0. Moreover, the dominated convergence theorem implies that Γ has continuous partial derivatives in Lemma 5. Let Assumption 3 hold. Then Condition RT is a neighborhood of (0, 0). Therefore, by the implicit function satisfied for the random times T defined in (15). n theorem, Λ(α) is differentiable in this neighborhood. In par- Proof: We have to show that properties (i)–(iv) listed in ticular, Λ(0) = 0, and the statement of Condition RT hold for (T ) ∈N. n n ∂Γ (i)-(ii) Notice that (θn)n≥1 defined in (iii) of Lemma 4 (0, 0) − Λ′(0) = − ∂α = µ. is (besides the additional requirement that θn+1 θn > M) ∂Γ essentially the sequence of successive hitting times of the set (0, 0) ∂Λ q×A for the Markov chain formed by the pairs of M-vectors n+M−1 ≥ Since Λ is a convex function, this yields the second part of (Xi,Yi)i=n , n 0. Therefore, (i) and (ii) follow from n+M−1 I(µ) = 0 I(x) > 0 x ≠ µ. Lemma 4-(i) and the fact that the M-vectors (Xi) , (i), namely establishes that and for i=n ∈ R ∈ R n ≥ 0, form{ an irreducible( Markov chain) on the} finite space (ii) (If there are no) constants b and c such − b D ∈ DM M 1 that Q ST − cT1 = b = 1, we can use the results of [1] M = x : P (Xi)i=0 = x > 0 while the 1 distribution of the observation Yi depends on the value of Xi for both (ii-a) and (ii-b), see Lemma 3.1 and the preceding only. paragraph( on p. 2884 of) [1]. It remains to consider the case b − ̸ (iii) Follows from (ii) of Lemma 4 directly. when Q ST1 cT1( = b = 1 with) b = 0 but there are no b − (iv) The definition of the HMM together with (15) im- c˜ > 0 such that Q ST1 cT˜ 1 = 0 = 1. In this case, using ply that (Xn,Yn)n≥0 is a regenerative process with re- the definition (16) and the estimate (4), we have for α in a spect to the renewal sequence (Tn)n≥1. The desired re- neighborhood of 0, [ ( )] sult follows from this property combined with the fact that b − 1 = EQ exp αST1 Λ(α)T1 (UTn ,UTn+1,...,UTn+1−1) is a deterministic function of [ ( )] = EQ exp α(b + cT1) − Λ(α)T1 (YTn ,YTn+1,...,YTn+1−1) according to part (iii) of Lemma 4. [ ( ( ))] αb The proof of the lemma is completed. = e EQ exp T1 αc − Λ(α) . We next define the rate function I(x) that appears in the statement of Theorem 1. The rate function is stan- That is, for α in a neighborhood of 0, [ ( ( ))] dard in the large deviation theory of regenerative processes −αb EQ exp T1 αc − Λ(α) = e . (18) (see for instance [1], [14], [18]). Recall τn defined in Condition-RT and let (σ, τ) be a random pair distributed Suppose Λ(α) is not strictly convex in any neighborhood of under the (“block-stationary”) measure Q identically to any of 0, that is Λ(α) = kα for some k ∈ R and all α in a one-sided (∑ − ) Tn 1 ξ , τ , n ≥ 1. For any constants α ∈ R, Λ ∈ R, α > 0 α > α k=Tn−1 k n (say positive) neighborhood of zero. Let 1 and 2 1 and x ≥ 0, set be two values of α in the aforementioned neighborhood of zero ( α2 ) ( ) α for which (18) is true. Using Jensen’s inequality EQ X 1 ≥ Γ(α, Λ) = log EQ exp(ασ − Λτ) , ( ) α2 α EQ(X) 1 , with and define ( ( )) ( ) X = exp T1 α1c − Λ(α1) = exp T1α1(c − k) , Λ(α) = inf{Λ ∈ R : Γ(α, Λ) ≤ 0}, (16) ( ) { } one gets that the identity (18) can hold only if Q T1 = c0 = 1 I(x) = sup αx − Λ(α) , (17) ∈R for( some c0 > 0. But) under this condition and the assumption α b − Q ST1 cT1 = b = 1, we would have using the usual convention that the infimum over an empty set ( ) ∞ b − is + . Q ST1 T1(c c0 + b)/c0 = 0 = 1 IEEE TRANSACTIONS ON INFORMATION THEORY 5

k R = Sb − Sb . in contradiction to what we have assumed in the beginning of Proof: Recall n from (6) and set n n Tkn−1 the paragraph. This completes the proof of part (a), namely We use the following series of estimates, where α > 0 and shows that Λ(α) is a strictly convex function in a neighbor- Λ > 0 are arbitrary positive parameters small enough to ensure hood( of 0 provided) that there is no constant c > 0 such that that all the expectations below exist (due to the tail estimates b Q ST − cT1 = 0 = 1. assumed in (4)): 1 ( ) ( ) To complete the proof of the lemma, it remains to show b ≥ b ≥ P Sn nx = P ST − + Rn nx that part (b) of (ii) holds. Although the argument is standard, kn 1 − we give it here for the sake of completeness. We remark that n∑1 ( ) = P Sb + R ≥ nx; T < n ≤ T in contrast to the “usual assumptions”, we consider properties Tj n j j+1 of Λ(α) and I(x) only in some neighborhoods of 0 and µ re- j=0 − spectively, and not in the whole domains where these functions n∑1 ( ) ≤ P Sb + R ≥ nx; T < n are finite. Recall that Λ(α) is an analytic and strictly convex Tj n j function of α in a neighborhood of zero (see for instance [1] j=0 ′ n∑−1 or [18]). In particular, Λ (α) exists and is strictly increasing ( ) ≤ P α(Sb + R − nx) + Λ(n − T ) ≥ 0 in this neighborhood. Since Λ′(0) = µ, it follows that for Tj n j j=0 all x in a neighborhood of µ, there is a unique αx such that − ′ ′ n∑1 [ ( )] x = Λ (αx). Since Λ (α) is a continuous increasing function, ≤ b − − − E exp α(STj xn) Λ(Tj n) + αRn we have limx→µ αx = 0 and αx(x − µ) > 0. Furthermore, − ≥ − j=0[ ( )] ( ) since Λ(α) is convex, we obtain Λ(α) Λ(αx) x(α αx), b ≤ E exp α(ST + Rn) · exp −(αx − Λ)n × and hence xαx−Λ(αx) ≥ xα−Λ(α) for all α ∈ R. This yields 1 { n∑−1 } I(x) = xαx − Λ(αx), completing the proof of the lemma. ( ) × 1 + exp (j − 1)Γ(α, Λ) Remark 7. Part (ii)-(b) of the above lemma shows that the j=1 rate function I(x) is finite in a neighborhood of µ as long as [ ( )] ( ) ( ) ≤ E exp α(Sb + R ) · exp −(αx − Λ)n × b − T1 n there is no c > (0 such that )Q ST1 cT1 = 0 = 1. On the { } b 1 − eΓ(α,Λ)n other hand, if Q ST1 = µT1 = 1, then the definition of Λ(α) × 1 + , implies Λ(α) = µα for all α ∈ R, and hence I(x) = +∞ for 1 − eΓ(α,Λ) x ≠ µ. b − b − where we used the fact that the pairs (STn STn−1,Tn Tn−1) Part of the claim (i) of Theorem 1 is in the part (i) of the have the same distribution under P and under Q as long as above lemma. Therefore we now turn to the proof of parts (ii) n ≥ 2. and (iii) of Theorem 1. We start from the observation( that) the By the Cauchy-Schwartz inequality, b [ ( )] large deviation asymptotic for the lower tail P Sn ≤ nx with b E exp α(ST1 + Rn) x < µ, can be( deduced) from the corresponding results for the √ [ ( )] [ ( )] b ¯ b upper tail P Sn ≥ nx , x > µ. Indeed, let ξ(i, j) = µ−ξ(i, j) ≤ · ∑ ( ) E exp 2αST1 E exp 2αRn . n ¯ (n) and Sn = k ξ Xk,Uk . Then ( ) ( ) Therefore, using any C > 0 such that ξ(i, j) < C for all b ∈ D P Sn ≤ nx = P Sn ≥ n(µ − x) . (19) i, j , [ ( )] Furthermore, for the function ξ¯ we have 1 b lim sup log E exp α(ST1 + Rn) ( ) n→∞ n − − [ ( )] Γ(α, Λ) = log EQ(exp(α(µτ σ) Λτ) ) 1 − − − = lim sup log E exp 2αRn = log EQ exp( ασ (Λ αµ)τ) , n→∞ 2n 1 [ ( )] and hence Λ(α) := inf{Λ ∈ R : Γ(α, Λ) ≤ 0} = Λ(−α)+µα, ≤ lim log E exp 2αC(n − T − ) = 0, →∞ kn 1 which in turn implies n n where the first equality is due to (4), while in the last step we { ∈ R − } I(x) := sup α : αx Λ(α) used the renewal theorem which shows that the law of n − { ∈ R − − − − } = sup α : α(µ x) Λ( α) Tkn−1 converges, as n goes to infinity, to a (proper) limiting = I(µ − x). distribution. Therefore, if α > 0 and Λ > 0 are small enough and Γ(α, Λ) ≤ 0, then Therefore, part (iii) of Theorem 1 can be derived from part ( ) ¯ 1 b (ii) applied to the auxiliary function ξ. lim sup log P Sn ≥ nx ≤ −(αx − Λ). It remains to prove part (ii) of the theorem. For the upper n→∞ n bound we adapt the proof of Lemma 3.2 in [1]. It follows then from the definition of Λ(α) given in (16) that ( ) ( ) Proposition 8. Let Assumption 3 hold and( suppose in) addition 1 b ≥ ≤ − − ∈ D b ̸ lim sup log P Sn nx αx Λ(α) . (20) that ξ(i, j) > 0 for all i, j and Q ST1 = µT1 = 1. Let n→∞ n I(x) be as defined in (17). Then( there exists) a constant γ > 0 The rest of the proof is standard. Let 1 b ≥ ≤ − such that lim supn→∞ n log P Sn nx I(x) for all { ( ) ( ) } αSb αSb x ∈ (µ, µ + γ). A := sup α > 0 : E e T1 < ∞,E e T2 < ∞ . IEEE TRANSACTIONS ON INFORMATION THEORY 6

It follows from (20) that ACKNOWLEDGEMENT ( ) ( ) We would like to thank the Associate Editor and anonymous 1 b ≥ ≤ − − lim sup log P Sn nx sup αx Λ(α) , referees for their comments and suggestions which allowed us n→∞ n 0<α