Verbitskiy Pacific Journal of for Industry (2016) 8:2 DOI 10.1186/s40736-015-0021-5

ORIGINAL ARTICLE Open Access Thermodynamics of the binary symmetric channel Evgeny Verbitskiy1,2

Abstract We study a hidden Markov process which is the result of a transmission of the binary symmetric Markov source over the memoryless binary symmetric channel. This process has been studied extensively in Information Theory and is often used as a benchmark case for the so-called denoising algorithms. Exploiting the link between this process and the 1D (RFIM), we are able to identify the Gibbs potential of the resulting Hidden Markov process. Moreover, we obtain a stronger bound on the memory decay rate. We conclude with a discussion on implications of our results for the development of denoising algorithms. Keywords: Hidden Markov models, Gibbs states, Thermodynamic formalism, Denoising 2000 Mathematics Subject Classification: 37D35, 82B20, 82B20

1 Introduction Markov Processes are random functions of discrete-time We study the binary symmetric Markov source over the Markov chains, where the value Yn is chosen according memoryless binary symmetric channel. More specifically, to the distribution which depends on the value Xn = xn let {Xn} be a stationary two-state with of the underlying Markov chain, independently for any values {±1},and n. The applications of Hidden Markov processes include automatic character and speech recognition, information P(X + = X ) = p, n 1 n theory, and bioinformatics, see [4, 13]. The par- < < = (  ) = where 0 p 1; denote by P px,x ticular example (1.1) we consider in the present paper is 1 − pp probably one of simplest examples, and often used as a − the corresponding transition probability p 1 p   benchmark for testing algorithms. π = 1 1 matrix. Note that 2 , 2 is the stationary initial In particular, this example has been studied rather distribution for this chain. extensively in connection to computation of of the The binary symmetric channel will be modelled as a output process {Yn}, see e.g., [8–10, 12]. sequence of Bernoulli random variables {Zn} with The law Q of the process {Yn} is the push-forward of P × P ψ {− }Z ×{− }Z →{− }Z P ( =− ) = ε P ( = ) = −ε ε ∈ ( ) Z under : 1, 1 1, 1 1, 1 ,with Z Zn 1 , Z Zn 1 1 , 0, 1 . −1 ψ((xn, zn)) = xn · zn.WewriteQ = (P × PZ) ◦ ψ .For Finally, put ≤ n = ( ... ) ∈{− }n−m+1 every m n,andym : ym, , yn 1, 1 ,the = · of the corresponding cylindric set is given by Yn Xn Zn (1.1)   n Q y := Q(Ym = ym, ..., Yn = yn) for all n.Theprocess{Yn} is a Hidden Markov process, m because Y ∈{−1, 1} is chosen independently for any      n  n = P n P n I = · π {− } π = xm Z zm yk xk zk n from an emission distribution Xn on 1, 1 : 1 xn ,zn ∈{−1,1}n−m+1 k=m (ε,1−ε) and π−1 = (1−ε, ε). More generally, the Hidden m m  n−1 1 { ∈ =− } Correspondence: [email protected] = · ε# i [m,n]:xiyi 1 pxi,xi+1 1Mathematical Institute, Leiden University, Postbus 9512, 2300 RA Leiden, The 2 xn ∈{−1,1}n−m+1 i=m Netherlands m 2 { ∈ = } Johann Bernoulli Institute for Mathematics and Computer Science, University × (1 − ε)# i [m,n]:xiyi 1 . of Groningen, PO Box 407, 9700 AK Groningen, The Netherlands (1.2) © 2016 Verbitskiy. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Verbitskiy Pacific Journal of Mathematics for Industry (2016) 8:2 Page 2 of 9

= 1 Two particular cases are easy to analyze. If p 2 ,then The non-trivial part of the cylinder probability is the {Xn} is a sequence of independent identically distributed sum over all hidden configurations (xm, ..., xn): random variables with P(Xn =−1) = P(Xn =+1) = 1 { } ε = 1 2 ,and Yn has the same distribution. If 2 , then the    n−1 n formula above implies that n Zm,n y := exp J xixi+1 + K xiyi   m n−1 − + n ∈{− }n−m+1 = =     n m 1 xm 1,1 i m i m Q n = 1 1 y px x + m 2 i, i 1 2 n ∈{− }n−m+1 = xm 1,1 i m   − + is in fact the partition function of the Ising model with the 1 n m 1 = , signs of the external random fields given by yi’s. Apply- 2 ing the recursive method of [14], the partition function can be evaluated in the following fashion [1]. Consider the and hence again, {Yn} is a sequence of independent ran- Q ( =− ) = Q ( =+ ) = 1 following functions dom variables with Yn 1 Yn 1 2 . The paper is organizes as follows. In Section 2 we exploit methods of to derive expressions for 1 cosh(w + J) probabilities of cylindric events (1.2). In Section 3, analyz- A(w) = log , 2 cosh(w − J) ing derived expressions, we show that the measure Q has 1 nice thermodynamic properties; in particular, it falls into B(w) = log [4 · cosh(w + J)cosh(w − J)] 2 the class of g-measures with exponential decay of varia-  1 − − tion (memory decay). We also obtain an novel estimate of = log e2w + e 2w + e2J + e 2J . the decay rate, which is stronger than estimates derived 2 in previous works. In Section 4 we study two-sided con- ditional probabilities and show that Q is in fact a Gibbs One readily checks that if s =±1, then for all w ∈ R state in the sense of Statistical Mechanics. We also dis- cuss well-known denoising algorithm DUDE, and suggest that the Gibbs property of Q explains why DUDE per- exp (sA(w) + B(w)) = 2cosh(w + sJ). (2.1) forms so well in this particular example. Furthermore, we argue that the development of denoising algorithsms, rely- ing on thermodynamic Gibbs ideas can result in a superior Now the partition function can be evaluated by sum- < n ∈ performance. ming the right-most spin. Namely, suppose m n, ym {−1, 1}n−m+1,then 2 Random field Ising model It was observed in [18] that the probability Q(ym, ..., yn)    n−2 n−1 of a cylindric event Ym = ym, ..., Yn = yn , m ≤ n,can n Zm,n y = exp J xixi+1 + K xiyi be expressed via a partition function of a random field m n−1∈{− }m−n i=m i=m Ising model. We exploit this observation further. Assume xm 1,1 ( + ) p, ε ∈ (0, 1),andput × exn Jxn−1 Kyn

xn∈{−1,1} 1 1 − p 1 1 − ε J = log , K = log .  n−2 n−1 2 p 2 ε = exp J xixi+1 + K xiyi − = = n−m+1 xn 1∈{−1,1}m−n i m i m Then for any (ym, ..., yn) ∈{−1, 1} , expression m for the cylinder probability (1.2) can be rewritten as × 2cosh( Jx − + Ky ) n 1 n − − c   n 2 n 1 Q( ... ) = J = + ym, , yn n−m+1 exp J xixi+1 K xiyi λ − + J,K xn ∈{−1,1}n m 1 n−1∈{− }m−n i=m i=m m xm 1,1 −      n 1 n (n) + (n) exp xn−1A wn B wn exp J xixi+1 + K xiyi , i=m i=m where where

cJ = cosh(J), λJ,K = 2 (cosh(J + K) + cosh(J − K)) = ( ) ( ) (n) = 4cosh J cosh K . wn Kyn. Verbitskiy Pacific Journal of Mathematics for Industry (2016) 8:2 Page 3 of 9

Hence, Definition 3.1. Suppose P is a fully supported transla- ⎛ tion invariant measure on  = AZ+ ,whereA is a finite ⎜ alphabet.    ⎜ n−2 n = ⎜ (i) The measure P is called a g-measure,ifforsome Zm,n ym exp ⎜J xixi+1 − ⎝ = positive continuous function g :  → (0, 1) satisfying the xn 1∈{−1,1}m−n i m m normalization condition ⎞  g (ω¯ 0, ω1, ω2, ...) = 1 n−2  ⎟ ω¯ ∈   ⎟ 0 A + + + (n) ⎟ K xiyi xn−1 Kyn−1 A wn    ⎠ for all ω = (ω0, ω1, ...) ∈ ,onehas i=m (n) w − P(ω |ω ω ...)= (ω ω ...)    n 1 0 1, 2, g 0, 1, × (n) exp B wn , for P-a.a. ω ∈ . (ii) The measure P is Bowen-Gibbs for a continuous and thus the new sum has exactly the same form, but   potential φ :  → R, if there exist constants P ∈ R and (n) = + (n) instead of Kyn−1,wenowhavewn−1 Kyn−1 A wn . C ≥ 1 such that for all ω ∈  and every n ∈ N Continuing the summation over the remaining right-most 1 P({ω˜ ∈  : ω˜ 0 = ω0, ...ω˜ n−1 = ωn−1}) x-spins, one gets ≤ ≤ C, C exp ((Snφ)(ω) − nP)   n       − n (n) (n) ( φ)(ω) = n 1 φ( kω) Zm,n y = 2cosh w exp B w , where Sn k=0 S . m m i P i=m+1 (iii) The measure is called an equilibrium state for φ  → R P where continuous potential : ,if attains maximum of   the following functional (n) = + (n) <     wi Kyi A wi+1 for every i n, (P) + φ P = (P˜ ) + φ P˜ ( ) = h d sup h d , (3.1) equivalently, since A 0 0, we can define P˜ ∈M∗()   1 (n) = ∀ > (n) = + (n) ∀ ≤ (P) P wi 0 i n,andwi Kyi A wi+1 i n. where h is the Kolmogorov-Sinai entropy of and the supremum is taken over the set M∗() of all translation Therefore, we obtain the following expressions for the 1 invariant Borel probability measures on . cylinder and conditional probabilities

    n   It is known that every g-measure P is also an equilibrium cJ ( ) ( ) Q yn = cosh w n exp B w n , state for log g; and that every Bowen-Gibbs measure P for 0 λn+1 0 i = φ φ J,K    i 1  potential is an equilibrium state for as well. ( ) ( )   cosh w n exp B w n n 1 0   1 ε ∈ ( ) Q y0|y = . Theorem 3.1. Suppose p, 0, 1 . Then the measure 1 λ (n) Z+ J,K Q = Q ε {− } cosh w1 p, on 1, 1 (c.f., (2.2))isag-measure.More- over, the corresponding function g has exponential decay of (2.2) variation: define the n-th variation of g by     3 Thermodynamic formalism varn(g) := sup g(y) − g(y˜) , n ≥ 1, Z+ ˜ n−1=˜n−1 Let  = A ,whereA is a finite alphabet, be the space of y,y:y0 y0 ω = (ω ω ...) one-sided infinite sequences 0, 1, in alphabet then A ( ωi ∈ A for all i). We equip  with the metric   1 ρ( ε) = ( ) n < −k(ω,ω˜ ) p, lim sup varn g 1. (3.2) d(ω, ω˜ ) = 2 , n→∞     ω ω˜ = ω =˜ω ω ω˜ = { ∈ ρ( ε) = = 1 ε = 1 = 1 where k , 1if 0 0,andk , max k Furthermore, p, 0 if p 2 or 2 ;forallp 2 N ω =˜ω ∀ = ... − } : i i i 0, , k 1 , otherwise. Denote by ρ(p, ε) < |1 − 2p|. (3.3) S :  →  the left shift: Finally, the measure Q is also a Bowen-Gibbs measure ( ω) = ω ∈ Z S i i+1 for all i +. for a Hölder continuous potential φ : {−1, 1}Z+ → R. Borel P is translation invariant if − The result of Theorem 3.1 is actually true in much P(S 1C) = P(C) greater generality: namely, for distributions of functions of for any Borel event C ⊆ . Markov chains {Yn}, where the underlying Markov chain Let us recall the following well-known definitions: {Xn} has strictly positive transition probability matrix P, Verbitskiy Pacific Journal of Mathematics for Industry (2016) 8:2 Page 4 of 9

see [15] for review of several results of this nature. How- Proof. Suppose i ≤ n ≤ m.Then        ever, example considered in the present paper is rather      (n) − (m) =  (n) − (m)  exceptional since one is able to identify the g-function and wi wi A wi+1 A wi+1 the Gibbs potential φ explicitly. Another interesting ques-      ( ) ( ) dA ρ ≤  n − m  ·   tionistheestimateofthedecayrate . In [15] a number wi+1 wi+1 sup   . of previously known estimates of the rate of exponen- w dw tial decay in (3.3) have been compared; the best known One has estimate for ρ dA sinh(2J) ρ ≤| − | = , 1 2p dw cosh(2J) + cosh(2w) is due to [8] and [7]. Quite surprisingly this estimate does and hence ε     not depend on ,andinfact,itwasconjecturedin[15]     dA  sinh(2J)  that the estimate could be improved, e.g., by incorporating := sup   =   =|1 − 2p| < 1. (3.6) dependence on ε. The proof of Theorem 3.1 shows that w dw cosh(2J) + 1 this is indeed the case and one obtains sharper estimate Combined with the fact that for all i ∈ Z+ (3.3).      Let us start by introducing some notation and proving  ( )  ( )   m  =  + m  ≤| |+| ( − )| ε ∈ ( ) = 1 wi Kyi A wi+1 K arctanh 1 2p a technical result. Suppose p, 0, 1 ,andp 2 and Z ε = 1 = ( ...)∈{− } + ∈ Z+ ≤| |+| |= 2 .Fixy y0, y1, 1, 1 . For every n , K J : C1, (n) (n) define the sequence w = w (y), i ∈ Z+, as follows: i i one therefore concludes that for i ≤ n ≤ m ( )       w n = 0, for every i ≥ n + 1,       i    (n) − (m) ≤ n−i+1  (n) − (m)  = n−i+1  (m)  wi wi wn+1 wn+1 wn+1 (n) = + (n) ≤ wi Kyi A wi+1 ,fori n. n−i+1 ≤ C1 . If we introduce maps F−1, F1 : R → R,givenby (n) = Hence, limn→∞ wi : wi exists and F−1(w) =−K + A(w), F1(w) = K + A(w),   ∞   ∞       ≤  (n) −  ≤  (m) − (m+1) ≤ m−i+1 then for i n, wi wi wi wi C1 m=n m=n (n) = (n)( ) = ( (···( ( )) ···)) w w y Fy Fy + Fy 0 C i i i i 1 n (3.4) = 1 n−i+1 = n−i+1 = ◦ ◦··· ( ) − : C . Fyi Fyi+1 Fyn 0 . 1

As we will show the maps F−1, F1 are strict contrac- From representation (3.4) it is clear that for n ≥ i, (n) tions, and as a result, for every i, the sequence wi is ( ) ( − ) w n (y) = w n i (Siy), converging as n →∞; in fact, with exponential speed. i 0 ( ) = ( i ) Z+ and hence wi y w0 S y . Lemma 3.2. For every i ∈ Z+ and all y ∈{−1, 1} one Z+ Suppose y = (y0, y1, ···), y˜ = (y˜0, y˜1, ···) ∈{−1, 1} has −k are such that d(y, y˜) = 2 for some k ∈ N, i.e., y0 = ( ) ˜ ... =˜ n (y) = (y) y0, , yk−1 yk−1.Then lim→∞ wi : wi . n   | ( ) − (˜)|= ◦ ...◦ ( ( )) − ◦ ...◦ ( (˜)) w0 y w0 y Fy Fy − w y Fy Fy − w y ∈ ( ) > 0  k 1 k  0 k 1 k Moreover, there exist constants 0, 1 and C 0,  ≤ sup (F ◦ ...◦ F ) (w) ·|w (y) − w (y˜)| both independent of y,suchthat y0 yk−1 k k w∈R      ( )  k  n ( ) − ( ) ≤ n−i ≤ | ( )| · ( ) = k wi y wi y C (3.5) sup A w 2C1 2C1 , w∈R ≥ ( ) = ( i ) ∈ Z for all n i. Furthermore, wi y w0 S y for all i + and hence w0 : {−1, 1}→R is Hölder continuous. Z+ and y,andw0 : {−1, 1} → R (and hence every wi)is Hölder continuous TheestimateofacontractionrateintheLemmaabove   1  θ can be improved. If p = ,thenA(w) ≡ 0, and hence |w (y) − w (y˜)|≤C d(y, y˜) 2 0 0 (n) ≡ ≥ = 1 wi Kyi for all n i. We may assume that p 2 .Let  θ> ˜ ∈{− }Z+ ε = 1 = for some C , 0 and all y, y 1, 1 . us also assume that 2 , i.e., K 0. Verbitskiy Pacific Journal of Mathematics for Industry (2016) 8:2 Page 5 of 9

Let us now consider second iterations: To show that Q is a g-measure it is sufficient to                   Q | n  (n) − (m) =  (n) − (m)  =  + (n) show that conditional probabilities y0 y1 converge wi wi A wi+1 A wi+1 A Kyi+1 A wi+2    uniformly as n →∞.Giventhat (m)  − A Kyi+1 + A w +       i 2 ( ) ( )     cosh w n exp B w n    ( ) ( )   1 0 1 ≤ sup |A (K + A(w))A (w)| w n − w m  Q | n =   i+2 i+2 y0 y1 ( ) , (3.10) w   λJ,K n   cosh w1 = ρ(2)  (n) − (m) : wi+2 wi+2 . (n)( ) ⇒ ( ) We are going to show that for p = 1 and all ε = 1 ,one and using the result of Lemma 3.2: wi y wi y as 2 2 →∞ has n , we obtain uniform convergence of conditional probabilities, and hence, Q is a g-measure with g given by (2)   2 ρ = sup |A (K + A(w))A (w)| <(1 − 2p) . (3.7)   w cosh(w (y)) exp B(w (y)) ( ) = 1 0   1 Firstly, note that g y λ . (3.11) J,K cosh w1(y)   |A (K + A(w))A (w)| Taking into account that w0, w1 = w0 ◦ S are Hölder 2( ) = sinh 2J continuous functions satisfying (3.9), and that cosh, exp, (cosh(2J) + cosh(2w))(cosh(2J) + cosh(2K + 2A(w))) and B are smooth functions, we can conclude that g is also (1 − 2p)2 Hölder continuous with the same decay of variation: = , (α + (1 − α)cosh(2K + 2A(w))) · (α + (1 − α)cosh(2w)) ( ) = | ( ) − (˜)| (3.8) varn g sup g y g y ˜ n−1=˜n−1 y,y:y0 y0 where α = (1 − p)2 + p2,1− α = 2p(1 − p).Let − ≤ C |w (y) − w (y˜)|≤C ¯ n 1, >0 be sufficiently small so that for all w ∈[ − , ] 3 1 1 4 one has cosh(2K + 2A(w)) > cosh(K), and hence for all for some C3 > 0(C4 = C2C3, c.f., (3.9)), and hence w ∈ [ − , ] (1 − 2p)2 | ( + ( )) ( )|≤ <( − )2   1 A K A w A w 1 2p . n (α + (1 − α)cosh(K)) · 1 ρ(p, ε) = lim sup varn(g) ≤¯ <|1 − 2p|. n→∞ For w ∈[ − , ], one has Let us introduce the following functions: for y ∈ (1 − 2p)2 Z | ( + ( )) ( )|≤ {−1, 1} + ,put A K A w A w · (α + ( − α) ( )) 1 1 cosh   2 <(1 − 2p) . φ(y) = B(w0(y)), h(y) = cosh(w0(y)) exp −B(w0(y)) . Hence, ( ) = ( ) ! " Taking into account that w1 y w0 Sy ,onehas ( − )2 ( − )2 (2) 1 2p 1 2p ρ = min , φ( ) (α + (1 − α)cosh(K)) (α + (1 − α)cosh( )) e y h(y) g(y) = . <(1 − 2p)2, λJ,K h(Sy) # and thus (3.5) holds for ¯ = ρ(2) < |1 − 2p| and some Since every g-measure is also an equilibrium state for constant C˜ > 0. In particular, we are now able conclude log g, we conclude that Q is an equilibrium state for that   φ(˜ ) = φ( ) + ( ) − ( ) − λ ( ˜) θ y y log h y log h Sy log J,K . |w (y) − w (y˜)|≤C ¯ k y,y = C d(y, y˜) , 0 0 2 2 (3.9) θ =− >¯ φ(˜ ) − φ( ) log2 0. The difference y y has a very special form: it is a sum of a so-called coboundary (log h(y) − log h(Sy))and Even sharper bounds can be achieved by studying a constant (− log λ ). Two potentials, whose difference minimum of the denominator in (3.8) or higher interates J,K is of a such form, have identical sets of equilibrium states. of F’s. The reason is that for any translation invariant measure Q one has = 1 ε = 1 Proof of Theorem 3.1. The cases p 2 or  2 are    obvious: in these cases, Q is the Bernoulli 1 , 1 -measure  2 2 log h(y) − log h(Sy) − log λJ,K dQ =−log λJ,K on {−1, 1}Z+ , and hence ρ(p, ε) = 0. Thus we will assume ε = 1 = that p, 2 . const. Verbitskiy Pacific Journal of Mathematics for Industry (2016) 8:2 Page 6 of 9

Therefore, if Q achieves maximum in the righthand side where for i ≥ 1 of (3.1) for φ˜,thenQ achieves maximum for φ as well. Thus Q is also an equilibrium state for q = (1 − 2p)y − y , a = 1 + q , b = 4ε(1 − ε)q .   i i 1 i i i i i φ( ) = ( ( )) = 1 2( ( )) + 1 y B w0 y log 4sinh w0 y ( − ) . Proof. Using elementary transformations, one can show 2 p 1 p Z+ that for every y = (y0, y1, ...)∈{−1, 1} one has Any equilibrium measure for a Hölder continuous potential φ is also a Bowen-Gibbs measure [3]. In our par-   1 cosh(w0(y)) ticular case, direct proof of the Bowen-Gibbs property for g(y) = exp B(w1(y)) λ cosh(w (y)) Q is straightforward. Indeed, using the result of (2.2) and J,K 1 (3.13) = ( ...) 1 1 the notation introduced above, for every y y0, y1, = + (1 − 2p)(1 − 2ε)y0 tanh(w1(y)). one has 2 2

n       c ( ) ( ) ∈ R Q yn = J exp B w n (y) cosh w n (y) Since for all w 0 λn+1 i 0 J,K i=1   (n) ( ( )) = ( ) ( ) = ( − ) ( ) cJ · cosh w (y) n $   % tanh A w tanh J tanh w 1 2p tanh w , = 0 (n)( ) − ( ( )) ( ( ( ))) exp B wi y B wi y exp B w0 y = i 1 ∈ Z n for every i +,onehas × exp B(wi(y)) − (n + 1) log λJ,K . i=0 ( ) + ( ( )) ( ) = tanh Kyi tanh A wi+1 tanh wi + ( ) · ( ( )) Therefore, for P = log λJ,K , 1 tanh Kyi tanh A wi+1   ( − ε) + ( − ) ( )   1 2 yi 1 2p tanh wi+1 · (n)( ) = Q yn cJ cosh w y + ( − ε)( − ) ( )  0  = 0 × 1 1 2 1 2p yi tanh wi+1 exp (S + φ)(y) − (n + 1)P exp(B(w (y))) (1 − 2ε) + (1 − 2p)y tanh(w + ) n 1 0 = i i 1 n $   % yi .  1 + (1 − 2ε)(1 − 2p)yi tanh(wi+1) (n)( ) − ( ( )) exp B wi y B wi y i=1 Therefore, if we let zi = (1 − 2p)(1 − 2ε)yi−1 tanh(wi), It only remains to demonstrate that the right hand side i ∈ N,then is uniformly bounded (both in n and y = (y0, y1, ...))from below and above by some positive constants C, C,respec- ε( − ε)( − ) = ( − ) − 4 1 1 2p yi−1yi tively. Indeed, since p, ε ∈ (0, 1), I = [−|K|−|J|, |K|+|J|] zi 1 2p yi−1yi 1 + zi+1 is a finite interval, by the result of the previous Lemma, (n) bi w (y) ∈ I for all i and n. Using (3.5), one readily checks = qi − . i 1 + z + that the following choice of constants suffices: i 1    sup cosh(w) C  dB  1 1 w∈I   Since g(y) = + z1, we obtain the continued fraction C = cJ exp sup   < ∞, 2 2 inf ∈ exp(B(w)) 1 − ∈ dw w I  w I   expansion (3.12). ( )   = infw∈I cosh w − C  dB  > C cJ ( ( )) exp − sup   0. 4 Two-sided conditional probabilities and supw∈I exp B w 1 w∈I dw denoising In the previous section we established that Q is a Bowen- Gibbs measure. The notion of a Gibbs measure originates We complete this section with a curious continued in Statistical Mechanics, and is not equivalent to the fraction representation of the g-function (3.11). Bowen-Gibbs definition. In Statistical Mechanics, one is interested in two-sided conditional probabilities Proposition 3.3. For every y = (y , y , ...)∈{−1, 1}Z+ , 0 1     one has Q | −1 n Q( | ) = Q | −1 ∞ y0 y−m, y1 or y0 y<0, y>0 : y0 y−∞, y1 . b1 2g(y) = a1 − (3.12) b2 a − The method of Section 2 can be used to evaluate 2 −1 n b3 conditional probabilities Q y |y− , y , m, n > 0for a − 0 m 1 3 − ... Z a4 y = (..., y−1, y0, y1, ...)∈{−1, 1} . Indeed, Verbitskiy Pacific Journal of Mathematics for Industry (2016) 8:2 Page 7 of 9

    Q −1 n y−m, y0, y Again, given this expression, one easily establishes uni- Q | −1 n =   1  y0 y−m, y1 , form convergence and existence of the limits, Q −1 n + Q −1 ¯ n     y−m, y0, y1 y−m, y0, y1 −1 ∞ −1 n Q y0|y−∞, y = lim Q y0|y− , y . 1 m,n→∞ m 1 where y¯0 =−y0. We can evaluate Thus the two sided conditional probabilities are also reg- ular, c.f. Theorem 3.1.

Q( ...... ) y−m, , y−1, y0, y1, , yn 4.1 Denoising

 n−1 n Reconstruction of signals corrupted by noise during the cJ = exp J x x + + K x y λn+m+1 i i 1 i i transmission is one of the classical problems in Informa- J,K xn ∈{−1,1}n+m+1 i=−m i=−m −m tion Theory. Suppose we observe a sequence {yn}, n =   cJ 1, ..., N, given by (1.1), i.e., = Z− yn , λn+m+1 m,n −m J,K yn = xn · zn.

where {xn} is some unknown realisation of the Markov by first summing over spins on the right: xn, ..., x1,and chain, and {zn} is unknown realisation of the Bernoulli then summing over spins on the left: x−m, ..., x−1.One sequence {Zn}. The natural question is, given the observed has N data y = (y1, ..., yN ), what is the optimal choice of ˆ = ˆ N Xn Xn y – the estimate of Xn, such that the empirical    −1 0   n = + + (n) zero-one loss (bit error rate) Z−m,n y−m exp J xixi+1 K xiyi x0A w1 ... x−m, ,x0 i=−m i=−m N $ % 1 ˆ n   LN = I Xn = xn (n) N exp B wi n=1 i=1 ⎛ ⎞ is minimal. The corresponding standard maximum a pos- −1     = ⎝ (−m) ⎠ (−m,n) teriori probability (MAP) estimator (denoiser) is given exp B wj 2cosh w0 j=−m by   n   ˆ n = ˆ n N ( ) X X y exp B w n &   i N N N N = −1, if P X =−1 | Y = y ≥ P X = 1 | Y = y , i 1 =  n 1  n 1 + P =− | N = N < P = | N = N 1, if Xn 1 Y y1 Xn 1 Y y1 , (−m) = n = 1, ..., N. where now w−m Ky−m, In case, parameters of the Markov chain (i.e., P)andof    (−m) (−m) the channel (i.e., ) are known, conditional probabilities w + = Kyj+1 + A w , j =−m, ..., −2, N N j 1 j P Xn = x | Y = y can be found using the backward- forward algorithm. Namely, one has and  α (x)β (x) P X = x | Y N = yN =  n n (4.1) n α (x˜)β (x˜)     x˜∈A n n (− ) (− ) ( ) w m,n = Ky + A w m + A w n . where 0 0 −1 1  α (x) = P Y n = yn, X = x , n  1 1 n β ( ) = P N = N | = Therefore, n x Yn+1 yn+1 Xn x

    −1 n Z−m,n y−m, y0, y Q | −1 n =   1  y0 y−m, y1 −1 n + −1 ¯ n Z−m,n y−m, y0, y1 Z−m,n y−m, y0, y1      (−m) (n) cosh Ky0 + A w− + A w =      1  1     + (−m) + (n) + − + (−m) + (n) cosh Ky0 A w−1 A w1 cosh Ky0 A w−1 A w1 . Verbitskiy Pacific Journal of Mathematics for Industry (2016) 8:2 Page 8 of 9

are the so-called forward and backward variables, satisfy- channel remains known. Indeed, two-sided conditional N\n N\n ing simple recurrence relations: probabilities Q Yn =·|Y = y could be estimated from the data. The Discrete Universal Denoiser algorithm  (DUDE) [16] estimates conditional probabilities α + (x) = α (x˜) P˜  , n = 1, ..., N − 1, n 1 n x,x x,yn+1   x˜∈A n−1 −1 n+kN kN Q Y = c | Y = a , Y + = b n n−kN −kN n 1 1 with α1(x) = P(X1 = x)x,y ,   1 −  m a 1 , c, bkN (4.3) −kN 1 β ( ) = β + (˜)  = ... − =   n x n 1 x Px,x˜ x˜,yn+1 , n 1, , N 1,  −1 kN ˜∈ ¯ m a , c, b x A c −kN 1 with β (x) = 1. N   − where m a 1 , c, bkN isthenumberofoccurrencesof −kN 1 The key observation of [16] is that the probability distri- −1 kN N =  the word a−k cb1 in the observed sequence y P =·| N = N N bution Xn Y y , viewed as a column vector, (y1, ..., yN ); the length of right and left contexts is set to can be expressed in terms of two-sided conditional proba- kN = c log N, c > 0. N\n N\n bilities Q Yn =·|Y = y ,withN \n ={1, ..., N}\ The DUDE has shown excellent performance in a num- {n}, as follows ber of test cases. In particular, in case of the binary memoryless channel and the symmetric Markov chain,  considered in this paper, performance in comparable to P =·| N = N Xn Y y the one of the backward-forward algorithm (4.1), which  −1 N\n N\n (4.2) requires full knowledge of the source distribution, while πy   Q Yn =·|Y = y = n  , DUDE is completely oblivious in that respect. In our opin- π  −1Q =·| N\n = N\n  yn Yn Y y , 1 ion, the excellent performance of DUDE in this case is partially due to the fact that Q is a Gibbs measure, admit- ting smooth two-sided conditional probabilities, which  π π where is the emission matrix, and −1, 1 are the are well approximated by (4.3) and thus can be esti-  columns of : mated from the data. It will be interesting to evaluate performance in cases when the output measure is not     Gibbs. 1 −  1 −   = , π− = , Invention of DUDE sparked a great interest in two-  1 −  1      sided approaches to information-theoretic problems. It  − 1 1 −  − turns out that despite the fact the efficient algorithms π = ,  1 = , 1 1 −  1 − 2 − 1 −  for estimation of one-sided models exist, the analogous two-sided problem is substantially more difficult. As alter- natives to (4.3), other methods to estimate two-sided con- and  is componentwise product of vectors of equal ditional probabilities have been suggested , e.g., [6, 11, 17]. lengths, For example, Yu and Verdú [17] proposed a Backward- Forward Product (BFP) model:

u  v = (u1 · v1, ..., ud · vd). ' ' ' Q(y0|y<0, y>0) ∝ Q(y0|y<0)Q(y0|y>0),

' Expression (4.2) opens a possibility of constructing and the one-sided conditional probabilities Q(y0|y<0), ' denoisers when parameters of the underlying Markov Q(y0|y>0) can be estimated using standard one-sided chains are unknown; we continue to assume that the algorithms. Note, that in our model,

' ' Q (y0|y<0) Q (y0|y>0) ' ' ' ' Q(y0|y<0)Q(y0|y>0) + Q(y¯0|y<0)Q(y¯0|y>0) ( + ( )) ( + ( )) = cosh Ky0 A w−1 cosh Ky0 A w1 cosh(Ky0 + A(w−1)) cosh(Ky0 + A(w1)) + cosh(−Ky0 + A(w−1)) cosh(−Ky0 + A(w1)) Verbitskiy Pacific Journal of Mathematics for Industry (2016) 8:2 Page 9 of 9

in general does not coincide with 6. Fernandez, F, Viola, A, Weinberger, MJ: Efficient Algorithms for Constructing Optimal Bi-directional Context Sets. Data Compression   Conference (DCC), 179–188 (2010). http://ieeexplore.ieee.org/xpl/ cosh Ky0 + A(w−1) + A(w1) articleDetails.jsp?arnumber=5453460. doi:10.1109/DCC.2010.23     7. Fernández, R, Ferrari, PA, Galves, A: Coupling renewal and perfect + ( − ) + ( ) + − + ( − ) + ( ) cosh Ky0 A w 1 A w1 cosh Ky0 A w 1 A w1 simulation of chains of infinite order, Ubatuba (2001) 8. Hochwald, BM, Jelenkovic:´ State learning and in entropy of hidden = Q(y0|y<0, y>0). Markov processes and the Gilbert-Elliott channel. IEEE Trans. Inform. Theory. 45(1), 128–138 (1999). ISSN:0018-9448, MR1677853 (99k:94028) Nevertheless, the BFP model seems to perform 9. Jacquet, P, Seroussi, G, Szpankowski, W: On the entropy of a hidden Markov process. Theoret Comput. Sci. 395(2–3), 203–219 (2008) extremely well [17]. 10. Ordentlich, E, Weissman, T: Approximations for the entropy rate of a Among other alternatives, let us mention the possibil- hidden Markov process. In: Proceedings International Symposium on ity to extend standard one-sided algorithms to produce Information Theory 2005, pp. 2198-2202, (2005). http://ieeexplore.ieee. org/xpl/articleDetails.jsp?arnumber=1523737. doi:10.1109/ISIT.2005. algorithms for estimating two-sided conditional probabil- 1523737 ities from data. This approach is investigated in a joint 11. Ordentlich, E, Weinberger, MJ, Weissman, T: Multi-directional context sets work with S. Berghout, where the denoising performance with applications to universal denoising and compression. In: ISIT 2005. Proceedings International Symposium on Information Theory. of the resulting Gibbsian models is evaluated. Gibbsian pp. 1270–1274, (2005). http://ieeexplore.ieee.org/xpl/articleDetails.jsp? algorithm performs better than DUDE: bit error rates are arnumber=1523546. doi:10.1109/ISIT.2005.1523546 given in the table below for noise level  = 0.2 and various 12. Pollicott M: Computing entropy rates for hidden Markov processes. In: Entropy of hidden Markov processes and connections to dynamical values of p (smaller rates are better). systems, London Math. Soc. Lecture Note Ser, pp. 223–245, (2011). p Gibbs DUDE Cambridge Univ. Press, Cambridge MR2866670 (2012i:37010) 13. Rabiner, LR: A tutorial on hidden Markov models and selected 0.05 5.30 % 5.58 % applications in speech recognition. Proc IEEE. 77, 257–286 (1989) 0.10 9.91 % 10.48 % 14. Ruján, P: Calculation of the free energy of Ising systems by a recursion 0.15 13.20 % 13.77 % method. Physica A: Stat Theoretical Phys. 91(3–4), 549–562 (1978) 15. Verbitskiy, EA: Thermodynamics of Hidden Markov Processes. 0.20 18.34 % 18.77 % In: Markus, B, Petersen, K, Weissman, T (eds.) Papers from the Banff International Research Station Workshop on Entropy of Hidden Markov One could also try to estimate the Gibbsian potential Processes and Connections to Dynamical Systems, pp. 258–272. London directly, e.g., using the estimation procedure proposed Mathematical Society, Lecture Note Series, (2011). http://dx.doi.org/10. in [5]. This method showed promising performance in 1017/CBO9780511819407.010 16. Weissman, T, Ordentlich, E, Seroussi, G, Verdú, S, Weinberger, MJ: experiments on language classification and authorship Universal discrete denoising: known channel. IEEE Trans. Inform. Theory. attribution. In conclusion, let us also mention that the 51(1), 5–28 (2005). ISSN: 0018-9448 MR2234569 (2008h:94036) direct two-sided Gibbs modeling of stochastic processes 17. Yu, J, Verdú, S: Schemes for bidirectional modeling of discrete stationary sources. IEEE Trans. Inform. Theory. 52(11), 4789–4807 (2006). opens possibilities for applying semi-parametric statistical ISSN:0018-9448, MR2300356 (2007m:94144) procedures, as opposed to the universal (parameter free) 18. Zuk, O, Domany, E, Kanter, I, Aizenman, M: From Finite-System Entropy to approach of DUDE. Entropy Rate for a Hidden Markov Process. IEEE Sig Proc. Letters. 13(9), 517–520 (2006)

Acknowledgments Part of the work described in this paper has been completed during author’s visit to the Institute of Mathematics for Industry, Kyushu University. The author is grateful for the hospitality during his stay and the support of the World Premier International Researcher Invitation Program.

Received: 27 August 2015 Revised: 29 November 2015 Accepted: 28 December 2015

References 1. Behn, U, Zagrebnov, VA: One-dimensional Markovian-field Ising model: Submit your manuscript to a physical properties and characteristics of the discrete stochastic mapping. journal and benefi t from: J. Phys. A. 21(9), 2151–2165 (1988). ISSN:0305-4470, MR952930 (89j:82024) 2. Berghout, S, Verbitskiy, E: On bi-directional modeling of information 7 Convenient online submission sources (2015) 7 Rigorous peer review 3. Bowen, R: Some systems with unique equilibrium states. Math. Systems 7 Theory. 8(3), 193–202. (1974/75). ISSN:0025-5661, MR0399413 (53 #3257) Immediate publication on acceptance 4. Ephraim, Y, Merhav, N: Hidden Markov processes, Special issue on 7 Open access: articles freely available online Shannon theory: perspective, trends, and applications. IEEE Trans. Inform. 7 High visibility within the fi eld Theory. 48(6), 1518–1569 (2002). ISSN:0018-9448, MR1909472 7 Retaining the copyright to your article (2003f:94024), doi:10.1109/TIT.2002.1003838 5. Ermolaev, V, Verbitskiy, E: Thermodynamic Gibbs Formalism and Information Theory. In: The Impact of Applications on Mathematics (M. Submit your next manuscript at 7 springeropen.com Wakayama et al, ed.), Mathematics for Industry, pp. 349-362. Springer Japan, (2014)