1 Turbo Decoding as Iterative Constrained Maximum Likelihood Sequence Detection John MacLaren Walsh, Member, IEEE, Phillip A. Regalia, Fellow, IEEE, and C. Richard Johnson, Jr., Fellow, IEEE

Abstract— The turbo decoder was not originally introduced codes have good distance properties, which would be relevant as a solution to an optimization problem, which has impeded for maximum likelihood decoding, researchers have not yet attempts to explain its excellent performance. Here it is shown, succeeded in developing a proper connection between the sub- nonetheless, that the turbo decoder is an iterative method seeking a solution to an intuitively pleasing constrained optimization optimal turbo decoder and maximum likelihood decoding. This problem. In particular, the turbo decoder seeks the maximum is exacerbated by the fact that the turbo decoder, unlike most of likelihood sequence under the false assumption that the input the designs in modern communications systems engineering, to the encoders are chosen independently of each other in the was not originally introduced as a solution to an optimization parallel case, or that the output of the outer encoder is chosen problem. This has made explaining just why the turbo decoder independently of the input to the inner encoder in the serial case. To control the error introduced by the false assumption, the opti- performs as well as it does very difficult. Together with the mizations are performed subject to a constraint on the probability lack of formulation as a solution to an optimization problem, that the independent messages happen to coincide. When the the turbo decoder is an iterative algorithm, which makes the constraining probability equals one, the global maximum of determining its convergence and stability behavior important. the constrained optimization problem is the maximum likelihood Much of the analytical work concerning the turbo decoder, sequence detection, allowing for a theoretical connection between turbo decoding and maximum likelihood sequence detection. It then, has focussed on determining its convergence and stability is then shown that the turbo decoder is a nonlinear block Gauss- properties. Significant progress along these lines has been Seidel iteration that aims to solve the optimization problem made with EXIT style analysis [4] and density evolution by zeroing the gradient of the Lagrangian with a Lagrange [5], but these techniques ultimately appeal to results which multiplier of -1. Some conditions for the convergence for the become valid only when the block length grows rather large. turbo decoder are then given by adapting the existing literature for Gauss-Seidel iterations. Other attempts, such as connections to factor graphs [6] and belief propagation [7], have been hindered from showing Index Terms— constrained optimization, maximum likelihood convergence due to loops in the turbo coding graph. The decoding, Turbo decoder convergence analysis information geometric attempts [8], [9], [10], [11], [12], in turn provided an intriguing partial description of the problem in I.INTRODUCTION terms of information projections, which allowed for linearized Along with being one of the most prominent communica- local stability results in [9], [10], but complete results have tions inventions of the past decade, the introduction of turbo been hampered by an inability to efficiently describe extrinsic codes in [3] began a new era in communications systems information extraction as an information projection. achieving unprecedented performance. The creation of the None of these convergence frameworks, so far, have identi- turbo decoder introduced a new method of decoding these fied the optimization problem that the decoder is attempting to codes which brought the decoding of complex codes within solve. Within the context of belief propagation [13], a general the reach of computationally practical algorithms. The iter- family of algorithms which includes the turbo decoder [7], ative decoding algorithm, while being suboptimal, performs [6], several authors [14], [15], [16], [17], [18], [19] have well enough to bring turbo codes very close to theoretically shown that the turbo decoder is related to an approximation in attainable limits. statistical physics which is a constrained optimization, but the An accurate justification for the decoding strategy’s per- development there is based on ideas from statistical physics forms is still incomplete. While it has been proven that turbo which are not essential for the analysis of the turbo decoder. Manuscript received July 29, 2005, revised January 25, 2006, revised In particular, [14], [15], [16], [17], [18] and [19] show that the again July 14, 2006. J. M. Walsh and C. R. Johnson, Jr. were supported in stationary points of belief propagation and thus turbo decoding part by Applied Signal Technology and NSF grants CCF-0310023 and INT- minimize the Bethe approximation to the variational free 0233127. P. A. Regalia was supported in part by the CNRS of France, contract no. 14871, and the Network of Excellence in Wireless Communications energy. While this approximation has a long and established (NEWCOM), E. C. Contract no. 507325 while at the Groupe des Ecoles history within the context of statistical physics, its intuitive des T´el´ecommunications, INT, 91011 Evry, France. This paper collects and meaning within the context of turbo decoding is less than builds on work presented in [1] and [2]. J. M. Walsh is with the Department of Electrical and Computer Engineering, transparent. Indeed, given that the Bethe approximation can Drexel University, Philadelphia, PA. P. A. Regalia is with The Catholic not be expected to be exact in factor graphs with loops (a University of America, Washington DC. C. R. Johnson, Jr. is with the School class within which all turbo codes are bound to lie), it is not of Electrical and Computer Engineering, Cornell University, Ithaca, NY. Contact J. M. Walsh at [email protected] for information regarding clear why minimizing it can yield such good performance in this paper. these cases. This paper takes a different approach, choosing 2

a different objective and constraints which both have probability mass functions (PMFs) on the outcomes ξ = Bi . clear intuitive meaning within the context of the turbo decoder Since there are a finite number of such outcomes,{ we can} and still yield the stationary points of the turbo decoder as completely characterize such a measure q, by simply listing 1 critical points of the constrained optimization’s Lagrangian. the probabilities q(Bi)= Pr[ξ = Bi] . Furthermore, because { } After introducing our notation and reviewing the turbo q is a PMF, we must have q(Bi) 0 and decoder in Section II, we will show in Section III that the ≥ turbo decoder admits an exact interpretation as a well known q(Bi)=1 i iterative method [1] attempting to find a solution to a par- X We are then content, that, to parameterize the of all ticular intuitively pleasing constrained optimization problem. F PMFs on the outcomes ξ = Bi , it is sufficient to consider In our formulation of the constrained optimization problem, { } the set η of vectors of the form it will become clear that the turbo decoder is calculating a F constrained maximum likelihood sequence detection. This is T η = (q(B0), q(B1),...,q(B N )) to be contrasted to the suggestion in [14], [15], [17], [18] 2 −1 that belief propagation decoding results in estimates of the whose entries are nonnegative and sum to one. We shall also maximum likelihood bitwise detection, partially owing to the find it convenient later to work with the log coordinates for fact that the existing proofs of convergence within that arena PMFs in . Given a PMF q , its log coordinates are the F ∈F are primarily based on the loopless case for which bitwise vector θ whose ith element is given by optimality is well established[6]. Of course, the constraints in θi = log(q(Bi)) log(q(B )) the constrained maximum likelihood sequence detection bias − 0 the turbo decoder away from the exact maximum likelihood Given, then, a vector θ, we see that we can uniquely determine sequence detection, so that in general it can be expected to be its corresponding wordwise PMF q, by using the list of neither exactly the maximum likelihood sequence detection probabilities in the vector η, which can be written in terms of nor the maximum likelihood bitwise detection. Properly iden- θ as tifying the iterative method that is being used then allows us to η = exp(θ ψ(θ)), ψ(θ) := log( exp(θ) ) (1) give some conditions for convergence of the turbo decoder by − k k1 borrowing convergence theory for the nonlinear block Gauss where 1 is the 1-norm (sum of the absolute values of the Seidel iteration in Section IV. componentsk·k of a vector argument). In fact, one may show that ψ(θ) is actually the convex conjugate [23] and dual potential II. PRELIMINARIES AND NOTATION under the Legendre transformation [24] to the negative of the Before we get to answering some key questions about the Shannon entropy, so that turbo decoder, we will need to discuss some preliminary ψ(θ)+ H(η) θ, η topics. In the following development, we will find it useful ≥ h i to consider families of probability measures on the possible with equality iff θ and η are coordinates for the same PMF, binary words of length N. This will lead us in Section II-A to where H is the negative of the Shannon entropy consider the geometric structure of this family of probability H(η)= η, log(η) measures by finding parameterizations of it that will be useful h i in the sequel. Next, in Section II-B we will consider a It is often to convenient to work with the log coordinates formulation of maximum likelihood sequence decoding which of PMFs, since if the random binary words ξ, χ, ζ satisfy is rather atypical, but bears important resemblance to the turbo Pr[ζ = Bi]= Pr[χ = Bi]Pr[ξ = Bi] for all i, we have that decoder. We then briefly review the operation of the turbo θζ = θξ + θχ (2) encoder and decoder in Section II-C, which may be helpful We will find it useful to parameterize the subset for some readers, and should also reinforce the information P ⊂F which contains those probability measures Pr on Bi that geometric notation that we will use in the remainder of the { } development. factor into the product of their bitwise marginals, so that

Pr(x)= Pr(xi) A. Information Geometry i Y N N Let Bi 0, 1 for i 0,..., 2 1 denote the binary One can show [9], [17], [10] that this set may be parameterized representation∈{ of} the integer∈{i. Then, by− forming} the matrix by the vectors λ RN of bitwise log probability ratios which ∈ N T 2 ×N have elements of the form B = (B0, B1, , B2N −1) 0, 1 · · · ∈ { } Pr(xi = 1) λi = log we can create a matrix whose rows collectively are all the 1 Pr(xi = 1) − possible binary words of length N. Given a random binary The log coordinates of a factorable measure Pr then take word of length N, call it ξ, we will be interested in different the form ∈P 1Although it is well beyond the scope of this paper, the reader familiar with θ = Bλ (3) the Bethe approximation to the variational free energy may wish to consult [20] [21] and [22] for the interplay between these two different optimization By combining the facts (2) and (3), we may represent frameworks that yield the same critical points. the log coordinates of a wordwise PMF which results by 3 weighting a whose log coordinates are θ where p(r ξ) is the likelihood function which results from with a factorable PMF with bitwise log probability ratios λ concatenating| the encoder with the channel. as Bλ + θ. A fact that we will use later is that the gradient One can set up maximum likelihood parameter estimation of ψ(Bλ + θ) with respect to λ is in fact the vector whose problems which yield these detectors as their solutions. In ith element is the marginal probability that the ith bit is one particular, consider a parameter estimation problem where we according to the PMF with log coordinates Bλ + θ. are trying to determine the factorable prior PMF on ξ which yields the maximum a posteriori probability of having received λψ(Bλ + θ) = λ log( exp(Bλ + θ) ) ∇ ∇ k k1 r. This problem then takes the form exp(Bλ + θ) = BT exp(Bλ + θ) 1 qML = argmax p(r ξ = Bi)q(Bi) (5) T k k q∈P | = B ηB i λ+θ X = pBλ+θ (4) We know from section II-A that to parameterize the set , it P Here we have used the notation η to represent the PMF vector is sufficient to use a vector of log probability ratios λ, thus θ we can set this problem as selecting of the measure with log coordinates θ, and pθ to represent the vector whose ith element is the probability that the ith bit is ˆ

λML = argË max p(r λ) one according to the measure with log coordinates θ. λ∈(R {±∞})N | We will also be interested in handling cases where zero

= argË max p(r ξ = Bi)q(Bi λ) and one bitwise probabilities are possible, which correspond λ∈(R {±∞})N | | i to vectors of log probability ratios λs with some elements X that are infinite. In particular, for the cases when λ contains In particular, since we know that for any λ we must have some infinite values, decompose it into λ = λ∞ + λR N RN q(Bi λ)=1 where λ∞ , 0 and λR . Given any function | f : RN ∈ {±∞R, we will} define f∈(λ) by extending f(λ) = i → X limt→∞ f(tλs + λR) where λs is any vector of finite log we see that the q that maximizes (5) takes the form probability ratios whose signs match those of λ∞, so that ˆ sign(λs)= sign(λ∞). In every case we shall encounter in this 1 Bi = ξMLSD q(Bi λ)= article, these limits will be the same regardless of the choice | 0 otherwise  of λs, so long as sign(λs) = sign(λ∞). This definition has the nice property that if we re-wrote f(λ) in terms of the since putting any probability mass on any other word would ˆ vector of bitwise probabilities p [0, 1] corresponding to λ, not yield as high a likelihood. This then implies that λML and then evaluated it as the probabilities∈ corresponding to λ are the infinite log probability ratios associated with the word ˆ (with possibly infinite values and thus some probabilities in ξMLSD. 0, 1 ), we would recover the same value for f(λ). One can also set up a set of maximum likelihood parameter { We} will exploit (4) later when we are discussing the opera- estimation problems whose answers are the maximum like- tion of the turbo decoder. First, we will reexamine maximum lihood bitwise detections. In particular, consider the set of likelihood decoding for a generic encoder. estimation problems ˆ

λi,ML = arg maxË p(r ξ)Pr[ξi λ] B. Maximum Likelihood Decoding λ∈R {±∞} | | ξ In a typical decoding situation, a binary message ξ is X encoded and transmitted over a channel to create the received

= arg maxË Pr[ξi =1 λ] p(r ξ) data r. Given r we would like to reconstruct the original λ∈R {±∞} |  |  ξ|ξi=1 message ξ. There are two senses of “optimal” when it comes X   to decoding ξ in the situation where we have no prior +Pr[ξi =0 λ] p(r ξ) probabilities for the outcomes ξ = Bi . In one situation, |  |  { } ξ|ξi=0 we wish to minimize the likelihood of selecting the wrong X   sequence ξ so as to minimize the block error rate. In another From the latter form it is evident that situation we wish to minimize the probability of selecting the wrong bit ξi so as to minimize the bit error rate. The 1 p(r ξ) > p(r ξ) ˆ | | former yields maximum likelihood sequence detection, the Pr[ξ =1 λi,ML]= ξ|ξi=1 ξ|ξi=0 i |  X X latter maximum likelihood bitwise detection. The maximum  0 otherwise likelihood sequence detection is then which shows that λˆ  is the (infinite) log probability ratio ˆ i,ML ξMLSD =arg max p(r ξ) ξ∈{0,1}N | corresponding to the maximum likelihood bitwise detection. With these preliminaries, we now review the turbo encoder and the maximum likelihood bitwise detection is and decoder, which are suboptimal, yet have been shown via ξˆi,MLBD =arg max p(r x) simulation to have near optimal performance at a reasonable ξi∈{0,1} x | complexity. |Xxi=ξi 4

ξ Select ζ ξ χˆ Memoryless r encoder 0 Π encoder 1 Systematic Memoryless Channel Encoder 0 (rs, r0) and Parity Channel Check Π Fig. 3. The serially concatenated turbo encoder.

Select Memoryless Encoder 1 Parity r1 Channel Check between computationally efficient decoders for each of the Fig. 1. Parallel Concatenation of two convolutional codes with interleavers. component codes as depicted in Figure 2. Denoting by

[θ0]i = log(p(rs, r0 ξ = Bi)) log(p(rs, r0 ξ = B0)) −1 z Π−1 + | − | − (k) − λT [θ1]i = log(p(r1 χ = Bi)) log(p(r1 χ = B0)) λr o λr o | − | u λu + (k) Π u λu λU and denoting λU and λT as the vectors of information ex- Code 0 Code 1 SISO SISO changed between the two decoders, one may write the parallel Decoder Decoder rs 0 turbo decoder succinctly as iterating r o r o λc λc λc λc r0 r1 (k) (k) (k) λT = π(BλU + θ0) λU (k+1) (k) − (k) Fig. 2. The Parallel Concatenated Turbo Decoder. λ = π(Bλ + θ1) λ U T − T where π takes a log PMF to its bitwise marginal log probability ratios, and thus may be written as C. The Turbo Encoder Consider the parallel turbo encoder depicted in Figure 1. π(θ) = log BT exp(θ ψ(θ)) − A block of N bits ξ is interleaved to get χˆ = Π(ξ). (We log (1 B)T exp(θ  ψ(θ)) will denote the deinterleaved Π−1(χˆ) by χ for convenience − − − in notation.) Then, ξ and χˆ are encoded with two, possibly where we denoted by 1 the 2N N matrix whose entries are 2 × different systematic convolutional encoders. We then pass all one. If we denote by pθ the vector of bitwise marginal the systematic bits, parity check bits from the first encoder, probabilities of the bits being one according to the wordwise and parity check bits from the second encoder, over a noisy log PMF θ, we may also write the turbo decoder as memoryless channel to get the channel outputs rs, r0, and pB (k) (k) = pB (k) r1, respectively, which we collect into a large vector r = (λU +λT ) λU +θ0 (rs, r0, r1). p (k+1) (k) = p (k) B(λU +λT ) BλT +θ1 The optimal decoder, in the sense of minimizing block error ˆ (k) probability, at the receiver would choose ξ to be the message since the extrinsic information vectors λT are first chosen which maximizes the likelihood function so that they match the a posteriori probabilities from the first (k) decoder when they are added to its prior information λU , and ξˆ = arg max p(rs, r0, r1 ξ) (k+1) ξ∈{0,1}N | similarly for λU with respect to the second decoder and prior information λ(k). = arg max p(rs, r0 ξ)p(r1 χˆ = Π(ξ)) T ξ∈{0,1}N | | The serial turbo encoder is depicted in Figure 3. A block where the second factorization is admitted by the memoryless ζ of M information bits are encoded with a systematic nature of the channel. As discussed in Section II-B, this is convolutional encoder to get ξ, which is then interleaved to equivalent to choosing the prior factorable PMF for ξ that get χˆ = Π(ξ), and then encoded with another encoder, which maximizes the a posteriori likelihood function subject to the may not be systematic, and passed over a memoryless channel constraint that χ = ξ. To see this select log probability ratios λ to give the received vector r. to parameterize the prior PMF for ξ, and write the a posteriori At the receiver, the optimal decoder, in the sense of min- likelihood function imizing the block error probability would select the decoded message ζˆ to be the message which maximizes the likelihood

ptrue(rs, r0, r1 λ)= p(rs, r0, r1 ξ = Bi)Pr(ξ = Bi λ) function | | | ˆ i ζML =arg max p(r ζ) X ζ∈{0,1}N | This is maximized when In fact, one can show, following a technique similar to that 1 i = argmaxj p(rs, r0, r1 ξ = Bj ) from Section II-B, that ζˆ can be related to an estimation Pr(ξ = Bi λ)= | ML | 0 otherwise problem. In particular, consider the problem of estimating  so that λ corresponds to the (infinite) log probability ratios 2For consistency with the references, we have chosen the original decoder corresponding to the maximum likelihood sequence detector. formulation from [3], [25] which explicitly includes rs in only one of the Unfortunately, such a procedure is not computationally component decoders. The reader unfamiliar with this form of the parallel turbo decoder may verify that the systematic information from rs is still used feasible at the receiver. To cope with this problem, the parallel (via the extrinsic information vectors λT ) in both decoders, although it is not turbo decoder makes use of an exchange of information explicitly included in θ1. 5

z−1 Π−1 ξ Select Systematic Memoryless − Encoder 0 (rs, r0) and Parity Channel r o r o Check λu λu + λu λu

Code 1 Code 0 λ SISO SISO T χ χˆ Select Memoryless r Decoder Decoder Π Encoder 1 Parity 1 Check Channel r r o r o λc λc Π λc λc + − Fig. 5. The system which the parallel turbo decoder assumes.

λU ζ ξ Fig. 4. The serially concatenated turbo decoder. encoder 0 the factorable PMF for ξ, parameterized by a vector of log probability ratios λ, which maximized the likelihood function χ χˆ Memoryless r Π encoder 1 Channel λML = argmax p(r ξ = C0(ζ))Pr[ξ = C0(ζ) λ] λ∈A | | ζ X Fig. 6. The system which the serial turbo decoder assumes. = argmax p(r ξ = Bi)φ(Bi)Pr[ξ = Bi λ](6) λ∈A | | i X where R N and in the second line we have := ( ) or, equivalently used theA indicator{±∞} function for encoder 0: S 1 Bi is in encoder 0’s codebook pB (k) (k) = pB (k) (9) φ(B )= (λU +λT ) λU +θ0 i 0 otherwise p (k+1) (k) = p (k) (10)  B(λU +λT ) BλT +θ1 and we have used C0 to represent the function which takes inputs to encoder 0 to their encoded codewords. Once again, (k) since the extrinsic information vectors λT are first chosen realizing that we must have so that they match the a posteriori probabilities from the first decoder when they are added to its prior information λ(k), Pr[ξ = Bi λ]=1 U | (k+1) i and similarly for λU with respect to the second decoder X (k) and prior information λ . Here, we have repeated the same we see that the λML which maximizes (6) is the one which T satisfies notation for the serial decoder and the parallel decoder to emphasize the fact that they have the same form when written ˆ 1 Bi = C0(ζML) Pr[Bi λ ]= on this abstract level. The only important difference, aside | ML 0 otherwise  from the fact that θ0 and θ1 are determined in different ways for the serial and parallel concatenated cases, is that in the so that λML are the infinite log probability ratios corre- sponding to the encoded version of the maximum likelihood serial decoder it is probabilities for the output of the first detection. encoder C0(ζ) which are being exchanged, as opposed to the Unfortunately, as in the parallel case, this optimal decoder the original message bits ξ in the parallel turbo decoder. For is not feasible. To cope with this problem, the serial turbo this reason, we have deliberately chosen ξ to denote both the decoder makes use of an exchange in information between input to the parallel encoder in the parallel concatenated case, computationally efficient decoders for each of the component and the output of the outer encoder in the serial concatenated codes as depicted in Figure 2. Denoting by case, so that it is always probabilities for bits in ξ that we are exchanging in either the serial or parallel concatenated case. [θ ]i = log(p(r ξ = Bi)) log(p(r ξ = B )) 0 | − | 0

[θ ]i = log(φ(Bi)) log(φ(B )) III. THE TURBO DECODER AS AN ITERATIVE SOLUTION 1 − 0 0 B is in encoder 0’s codebook TO A CONSTRAINED OPTIMIZATION PROBLEM = i otherwise  −∞ Here we will somewhat demystify the turbo decoder by where we will assume that B0 (the all zero codeword) is in showing both the sense in which its stationary points are encoder 0’s codebook, which is satisfied, for instance, if the optimal, as well as by identifying the iterative method which is inner code is linear. Denoting λU and λT as the vectors of being used to find these optimal points. We must first consider information exchanged between the two decoders, one may a system which, although it is not exactly equal to that in write the serial turbo decoder succinctly as iterating which the turbo decoder operates, is the system which the turbo decoder assumes in its sense of optimality. We will then λ(k) = π(Bλ(k) + θ ) λ(k) (7) T U 0 − U provide the optimization interpretation of the turbo decoder, λ(k+1) = π(Bλ(k) + θ ) λ(k) (8) followed by some commentary and conclusions. U T 1 − T 6

A. A Convenient Independence Assumption constraint that they be exactly the same in doing the decoding, In describing the operation of the turbo decoder, it will we want to enforce that they be similar. Thus, at the decoder, be advantageous to describe a system like that in which the we will be interested in the constraint set parallel turbo decoder operates, but in which the messages ξ = (λU , λT ) Pr(ξ = χ λU , λT )= c and χ are chosen completely independently of one another as C { | | } depicted in Figure 5 for the parallel case and in Figure 6 for which is the set such that the probability of choosing the same the serial case. Under the false assumption that ξ and χ were message for ξ and χ, given that we chose them independently chosen independently of one another according to factorable according to the factorable densities whose bitwise marginals PMFs with (deterministic) bitwise log probability ratios λU are λU and λT respectively, is fixed to some constant c. We and λT , the likelihood function for the received data would will now use the notation be p(rs, r0, r1 λU , λT )= exp(BiλU ) | Pr(ξ = Bi λU )= | exp(BλU ) 1 p(rs, r ξ = Bi)Pr(ξ = Bi λU ) k k 0| | × i ! exp(BiλT ) X Pr(χ = Bi λT )= | exp(BλT ) 1 p(r χ = Bj )Pr(χ = Bj λT ) k k  1| |  j in order to rewrite X  exp(BλU + θ0) 1 exp(BλT +θ1) 1 = k k k k Pr(ξ = χ λU , λT ) = Pr(ξ = Bi λU )Pr(χ = Bi λT ) exp(BλU ) 1 exp(BλT ) 1 | | | i k k k k X in the case of parallel concatenation and exp(B(λU + λT )) = k k1 exp(Bλ ) exp(Bλ ) p(r λU , λT )= U 1 T 1 | k k k k so that p(r ξ = Bi)Pr(ξ = Bi λU ) | | × i ! log(Pr(ξ = χ λU , λT )) = log( exp(B(λU + λT )) ) X | k k1 log( exp(BλU ) 1) φ(Bj )Pr(χ = Bj λT )  |  − k k j log( exp(BλT ) ) X − k k1  exp(BλU + θ0) 1 exp(BλT + θ1) 1 = k k k k = ψ(B(λU + λT )) exp(BλU ) 1 exp(BλT ) 1 k k k k ψ(BλU ) ψ(BλT ) in the case of serial concatenation. Thus, recalling the defini- − − tion of ψ from (1), the log likelihood function in either (serial Thus, we can write the constraint set as or parallel) case is = (λU , λT ) ψ(B(λU + λT )) ψ(BλU ) C { | − log(p(r λU , λT )) = ψ(BλU ) ψ(BλT ) ψ(BλT ) = log(c) | − − − } +ψ(BλU + θ0)+ ψ(BλT + θ1) Furthermore, the decision statistics which the parallel turbo B. An Exact Characterization of the Turbo Decoder decoder uses to make its decisions, which are the bitwise In the next theorem we show that the turbo decoder is posterior probabilities at the output of one of the component an iterative attempt to find the maximum likelihood estimate decoders and have the log probability ratio vector λU + λT , for bitwise log probability ratios λU and λT under the false may be interpreted under this notation as assumption that ξ and χ where chosen independently of one another according to the bitwise factorable PMFs with log [Pr(ξi = χi =1 ξ = χ, λU , λT )] = pB(λU +λT ) | probability ratios λU and λT , respectively. Furthermore, this or the collection over all the bits of the probability that optimization is performed subject to the constraint that the the ith bit of ξ and χ are equal to 1, given that ξ = χ probability of selecting the same message when selecting ξ and ξ and χ are chosen with factorable PMFs with log and χ in this manner is held constant. probability ratios λU and λT , respectively. Since we have Theorem 1 (What is turbo decoding?): The turbo decoder assumed that the first component encoder in the serial turbo is exactly a nonlinear block Gauss Seidel iteration bent on encoder is systematic, the serial turbo decoder uses the same finding the solution to the constrained optimization problem statistics for its decisions, but only the elements in pB(λU +λT ) ∗ ∗ that correspond to systematic bits from that first encoder are (λU , λT )=arg max log(p(r λU , λT )) (λU ,λT )∈C | needed. To control this false assumption of independently choosing In particular, the turbo decoder stationary points are in a one to ξ and χ, we will be interested in choosing λU and λT so one correspondence with the critical points of the Lagrangian that the two densities so chosen have a large probability of of this optimization problem with a Lagrange multiplier of selecting the same word. This is because we know a priori that 1. The turbo decoder is then a nonlinear block Gauss Seidel ξ and χ are the same, and thus, although we are relaxing the iteration− on the gradient of this Lagrangian. 7

Proof: For the proof, form the Lagrangian While it may seem odd to select a Lagrange multiplier µ = 1 rather than a value of the constraint log(c), we L(λU , λT ) = log(p(r λU , λT )) − | note that this too has a possible intuitive interpretation within +µ (log (Pr [ξ = χ λU , λT ]) log(c)) the context of the turbo decoder, although it is less rigorous | − = ψ(BλU ) ψ(BλT )+ ψ(BλU + θ0) than other developments in this paper. In particular, selecting − − µ = 1 requires that the gradient of the constraint log(c) +ψ(BλT + θ1) be equal− to the gradient of the approximate likelihood∇ func- µ (ψ(BλU )+ ψ(BλT )) − tion log(p(r λU , λT )) at a turbo decoder stationary point +µψ(B(λ + λ )) ∗ ∇∗ | U T (λU , λT ). Thus, the change in the likelihood function from its value at a turbo decoder stationary point log(p(r λU , λT )) Taking its gradient gives ∗ ∗ | − log(p(r λU , λT )) is, to first order, equal to the change in L p p | λU = BλU + BλU +θ0 the constraint from its value at that turbo decoder stationary ∇ − ∗ ∗ point log(c(λU , λT )) log(c(λU , λT )) within a neighborhood +µ pBλU + pB λU λT − − ( + ) of that stationary point. When the turbo decoder is used is L = pB + pB  ∗ ∗ λT λT λT +θ1 practice, one takes decisions on its convergent values λU +λT ∇ − ˆ N p p to get the bits (ξ = χˆ 0, 1 ) which are passed on to the +µ BλT + B(λU +λT ) ∈{ ∗} ∗ − rest of the receiver. We use λU + λT , because the ith element Selecting a Lagrange multiplier of µ = 1, we have of this vector is the log likelihood ratio associated with the − bitwise probability Pr[ξi = χi = 1 ξ = χ, λU , λT ] and we λU L = pB pB | ∇ λU +θ0 − (λU +λT ) condition here on ξ = χ because we know it to be true and λT L = pBλT θ pB λU λT wish to use that fact before passing our decisions to the rest ∇ + 1 − ( + ) If we break this system of equations into two parts of the receiver. Taking these decisions (and thus setting the new λU and λT ’s equal to the likelihood ratios associated ˆ ˆ ∗ ∗ N F0(λU , λT ) = λT L = pB pB (11) with these decisions λU , λT = sign(λ + λ ) ) −∇ (λU +λT ) − λT +θ1 ∞ U T ∈ {±∞} F λ λ L p p increases the value of the constraint probability to one, so 1( U , T ) = λU = B(λU +λT ) BλU +θ0(12) −∇ − that log(c(λˆU , λˆT )) = 0. Choosing the Lagrange multiplier then we can see that the turbo decoder solves F0 for λU given µ = 1, then, ensures that this largest possible increase in the (k) (k) − a fixed λT = λT to get λU , and then solves F1 for λT given constraint is accompanied with an equally large increase in the (k) (k+1) a fixed λU = λU to get λT , that is approximated likelihood function to first order. The sign of the (k) (k) Lagrange multiplier ensures that increasing the constraint by λU = λU such that F0(λU , λT )= 0 taking decisions increases the approximate likelihood to first (k+1) (k) λT = λT such that F1(λU , λT )= 0 order, and the magnitude reflects that it is equally important to have a large change in the constraint as it is to have a This is exactly the form of a nonlinear block Gauss Seidel large change in the likelihood. Although other interpretations iteration. Furthermore, the system of equations it is trying to of selecting µ = 1 could be offered, the theorem and solve are the necessary conditions for finding a solution of the the intuitive nature of− the objective functions and constraints provided constrained optimization problem, which are remain true. It is also interesting to note that, with µ = 1 in the constrained Lagrangian, we recover the function used− in λU L = 0 ∇ Theorem 4 of [26], which was obtained without recourse to λT L = 0 ∇ likelihood concepts. An investigation of other values for the subject to a Lagrange multiplier of µ = 1. Lagrange multiplier as well as other justifications for choosing − An important fact which sets this formulation apart from 1 would be an interesting topic for further research. Along most constrained optimization problems is that one does not these− lines, for readers familiar with the Bethe approximation get to specify the value of c = Pr[ξ = χ λU , λT ] before to the Gibbs free energy from statistical physics [18], [17], decoding; rather, a Lagrange multiplier µ =| 1 is selected it may be noted that when evaluated at the turbo decoder − which then results in a value of the constraint. If the turbo stationary points, L µ=−1 attains the same values as the Bethe decoder converges, one may evaluate the constraint function approximation to the| Gibbs free energy (see [17] page 1795), ψ(BλU ) ψ(BλT )+ψ(B(λU +λT )) at the convergent λU although for message values that are not stationary points the − − and λT to get the log(c) for which the turbo decoder stationary equality does not hold. In fact, a pseudo-duality relationship points are critical points of the optimization problem. Note between the two optimality frameworks, that is the constrained also that if we have c = Pr[ξ = χ λU , λT ] = 1 so optimization problem from Theorem 1 and the Bethe free that log(c)= 0, then the globally optimal| solution to the energy, can be shown [21]. optimization problem is actually the maximum likelihood Given the importance of selecting the Lagrange multiplier sequence detection. This then suggests that for convergent µ = 1 in order to get the turbo decoder stationary points, we values of log(c) close to 0, the turbo decoder will provide henceforth− select µ = 1 when we refer to the Lagrangian. a detection close to a critical point of the wordwise likelihood Because the condition− that the gradient of the Lagrangian is function, provided continuity in log(c) of the solutions to the equal to zero is necessary, but not always sufficient, for a optimization problem. point to be the global optimum, we must characterize the type 8 of critical points which are possible. In particular, we wish or minimum. Another important distinction is that while to know whether or not the critical points of the Lagrangian the expected value of the Lagrangian is concave within the are at least local maxima of the constrained optimization constraint space, it is not necessarily concave outside of the problem. Generally speaking one can converge to either a constraint space. In fact, one can show that outside of the maximum, minimum, or saddle point, although if one replaces constraint space the Lagrangian with the expectation of the Lagrangian over 2 T L = PBλU +θ pB pB the received data one can guarantee that there is only a global ∇λU ,λU 0 − λU +θ0 λU +θ0 maximum. T PB(λU +λT ) pB(λU +λT )pB(λU +λT ) Theorem 2 (Critical Point Characterization): The − − 2  T  expected Lagrangian has only one critical point which U T L = PB λU λT pB U T pB ∇λ ,λ − ( + ) − (λ +λ ) (λU +λT ) is a maximum of the expected constrained optimization 2  T  L = PBλU θ pB pB problem. Here, the expectation is taken over the received ∇λT ,λT + 0 − λU +θ0 λU +θ0 T information r using the approximate likelihood function for PB pB pB − (λU +λT ) − (λU +λT ) (λU +λT ) r given λU , λT under the false independence assumption.   Proof: To see this, consider the value of the Lagrangian where Pθ is the matrix whose i, jth entry is the probability that within the constraint space both the ith and jth bits are one, according to the wordwise measure whose log coordinates are θ. From this we can see L = log(c) ψ(B(λU + λT )) + ψ(BλU + θ ) that at a Turbo decoder stationary point, the diagonal elements − 0 of 2L are all zero. Thus, the trace of the Hessian matrix is +ψ(BλT + θ1) zero,∇ and, provided the Hessian is not identically zero, the = ψ(BλU ) ψ(BλT )+ ψ(BλU + θ ) − − 0 stationary points are saddle points of the Lagrangian when +ψ(BλT + θ1) one does not restrict oneself to the constraint space. where in the latter equation we substitute in the constraint To conclude our remarks concerning the optimality of the turbo decoder stationary points, we wish to emphasize the log(c) ψ(B(λU +λT )) = ψ(BλU ) ψ(BλT ). Now, note from this− that L thus is the sum− of two log− likelihood functions new perspective that the constrained optimization problem can within the constraint space; provide. Many researchers believe that the turbo decoder ap- proximates the maximum likelihood bitwise solutions (see, e.g. L = ( ψ(BλU )+ ψ(BλU + θ )) [17], [18] or other literature about loopy belief propagation), − 0 partially because the elegant theory for belief propagation + ( ψ(BλT )+ ψ(BλT + θ1)) − considers the special case when the graph has no loops[6], = log(p(rs, r λU )) + log(p(r λT )) 0| 1| for which the belief propagation does calculate the maximum This then implies that the second derivative of the Lagrangian likelihood bitwise solutions. The use of the forward backward within the constraint set has a mean which is the negative algorithm [27], which would calculate the MLBD given only of the Fisher information matrix, call it I, since the Fisher rs, r0 or rs, r1 further confuses the issue. These results do information matrix is defined as not apply to the turbo decoder, or even to the soft decoding of finite block length LDPC codes, due to loops in the factor E 0 I = graphs of both of these decoders. Here, we have shown that the − 0 V   turbo decoder may be interpreted as a constrained maximum 2 E := log(p(r λU , λT )) p(r λU , λT )dr likelihood sequence detector. The constraints generally bias ∇λU ,λU { | } | Z the stationary points away from the exact maximum likelihood 2 sequence detector. V := log(p(r λU , λT )) p(r λU , λT )dr ∇λT ,λT { | } | Z where we have used 2 here to denote the operator IV. CONVERGENCE ANALYSIS ∇ {·} which takes a function to its Hessian matrix of second order In Theorem 1, we observed that the turbo decoder may be partial derivatives. The well known fact, then, that the Fisher understood as an iterative procedure to seek a critical point of information matrix is positive semi-definite, then shows that the Lagrangian L with a Lagrange multiplier µ = 1. the expectation of the Hessian matrix of the Lagrangian L is − (k) (k) negative semi-definite. This implies then, by Theorem 4.5 of Choose λ such that λU L(λ , λT ) = 0 (13) T ∇ U [23], the function E[L] is concave, where E denotes expectation (k+1) (k) Choose λU such that λT L(λU , λT ) = 0 (14) with respect to p(r λU , λT ), and thus has a unique maximum. ∇ | We then noted that such an iterative procedure may be con- Two important distinctions are made in the previous theo- nected with the Gauss-Seidel iteration for solving the system rem. First of all, while the expected value of the Lagrangian L(λU , λT )= 0. From here on, to compact notation, and to ∇ is concave within the constraint space, and thus has a unique continue relation with the F0 and F1 notation used in the proof maximum, we are not guaranteed that for a particular sample of Theorem 1, we will denote L by F. Here we will elaborate ∇ rs, r0, r1 there is only one solution to L = 0. In fact, Figure on the connection to the Gauss Seidel iteration, and use it 7 shows that, even for N = 2, it is∇ possible, depending on to gain some convergence conditions for the turbo decoder. rs, r0, r1 for this convergent point to be either a maximum Before we do so, it is important to note that there have been 9

Fig. 7. L for two different sample values of θ0 and θ1 for N = 2 bits and simple parity check codes. Here, the line indicates C, and x marks the spot to 1 1 which the turbo decoder converges given an initialization at ( 2 , 2 ). The p0 and p1 axes are the bitwise marginal probabilities associated with λU , and we are always selecting λT = π(BλU + θ0) − λU . In one instance we have converged to a local minimum along C, and in another to a local maximum along C. numerous previous works which discuss local stability of the resulted in conditions which required the user of the theorem turbo decoder via first order Taylor approximation techniques, to check a containment relation of two parametrically defined see for instance the information geometry based work [9], sets in N dimensions, where N was the block size of the turbo [26], [10]. Global stability in turn, has been discussed in decoder. It seemed doubtful that the theory would be useful the context of belief propagation via either convexity of the without extra theory to determine when the conditions would Bethe free energy or loop-less factor graph cases, see e.g. [6], be satisfied. Still, the reader interested in characterizing the [28]. These and other expositions discussing convergence have regions of convergence for the case when the turbo decoder already been mentioned in the introduction. The conditions for admits multiple fixed points may consult [1]. global stability that we shall discuss here, however, are derived We take a different approach here, focussing instead on uniquely via a connection to the nonlinear block Gauss Seidel conditions on the likelihood functions of the two component iteration, and in fact are mathematically different from those decoders that yield global convergence of the turbo decoder mentioned previously and in the introduction. from any initialization to a single fixed point. We transform the domain of the update equation L = 0 from that The Gauss-Seidel procedure has been discussed most widely ∇ for linear systems of equations [29, pp. 480–483] , [30, of the difference between two bitwise probabilities to the pp. 251–257], [31, pp. 510–511], but can be generalized to difference between two log bitwise probability ratios. In other nonlinear systems of equations as well [32, pp. 131–133; words, rather than considering the system (11) and (12) whose pp. 185–197], [33, p. 225]. In particular, the convergence nonlinear block Gauss Seidel iteration is embodied by the of nonlinear Gauss Seidel methods received some significant recursion (9) and (10), we use the system attention in the numerical analysis literature during the early λU + λT π(BλU + θ0)= 0 (15) 1970s. Relevant references include [34], [35], and [36], from − λU + λT π(BλT + θ )= 0 (16) which our major result is adapted. − 1 whose nonlinear block Gauss Seidel iteration is embodied by One of the major difficulties encountered when applying (7) and (8) and is thus equivalent to the recursion defined by the typical convergence theorems for nonlinear block Gauss (9) and (10). Since the range of the system in consideration Seidel iterations to the iteration (13) and (14), as in [1], is is now R2N , it is easier to apply the results from [36], since that the parameters are drawn from a different domain than the surjectivity can now hold. In particular, we have the following range of L. In particular, L is a difference between bitwise theorem, in which we use the componentwise ordering, so that probabilities∇ and thus is necessarily∇ bounded within [ 1, 1]N . x y . The variables that are being solved for, λ and λ , however,− xi yi i U T ≤Theorem⇐⇒ 3 (Global≤ ∀ Convergence of the Turbo Decoder): are drawn from RN . This foiled unmodified application of Define the measures q whose log coordinates are Bλ + θ theorems from [36], since those theorems needed surjectivity U 0 and r whose log coordinates are Bλ + θ . Define the of L onto RN . The approach taken in [1], was then to T 1 matrices C and G whose elements are modify∇ the theorems from [36], adding extra conditions so that the surjectivity was no longer needed. Unfortunately, this [C]i,j = q[ξj =1 ξi = 1] q[ξj =1 ξi = 0] | − | 10 and which shows that [G]i,j = r[ξj =1 ξi = 1] r[ξj =1 ξi = 0] | − | lim F(λk) = and define the matrix k→∞ k k ∞ I I C whenever λk and either λk λk+1 or λk+1 λk. Λ := k k → ∞ ≥ ≥ I G− I This then implies that F is surjective by Thm. 4.4 of [36]. Thus,  −  F is a surjective M-function, and Thm. 6.4 of [36] proves that N N 2 2 Define the set R R to be the set of (θ0, θ1) the Gauss Seidel block iteration on F is globally convergent. for which Λ is anD M-matrix,⊆ × or equivalently has non positive off diagonal elements and all real eigenvalues positive [37], To illustrate the practical applicability of this theorem, 3 [38], [39], [40] . Then, if (θ0(r), θ1(r)) the turbo we consider now a very simple case in which the globally decoder converges to a unique solution from any∈ D initialization convergent region can be computed exactly. In particular 2N D (λU , λT ) R . consider the case that N =2, then Λ takes the form Proof:∈ The outline of our proof is as follows. 1 0 0 α • The fact that Λ is an M-matrix shows by Thm. 3.6 of 0 1 β −0 R2N Λ = [36] that F is an M-function everywhere in .  0 γ −1 0  − • We can linearly lower bound F componentwise, and thus  σ 0 0 1   −  use Thm. 4.4 of [36] to prove surjectivity of F.   • We finally have that F is a surjective M-function. This where allows us to apply Thm. 6.4 of [36] to prove convergence. α = q[ξ1 =1 ξ0 = 1] q[ξ1 =1 ξ0 = 0], First, calculate the Jacobian F | − | ∇ β = q[ξ0 =1 ξ1 = 1] q[ξ0 =1 ξ1 = 0], I I C | − | F = − γ = r[ξ1 =1 ξ0 = 1] r[ξ1 =1 ξ0 = 0], ∇ I G I | − |  −  σ = r[ξ0 =1 ξ1 = 1] r[ξ0 =1 ξ1 = 0] where | − | We must calculate the eigenvalues of Λ. The characteristic [C]i,j = q[ξj =1 ξi = 1] q[ξj =1 ξi = 0] function of Λ is | − | and (1 λ)0 0 α − − [G] = r[ξ =1 ξ = 1] r[ξ =1 ξ = 0] 0 (1 λ) β 0 i,j j i j i − − = | − | 0 γ (1 λ) 0 We now see that the conditions in the theorem force F to − − σ 0 0 (1 λ) be a M-matrix for all λ , λ RN . Then Thm. 3.6 of∇ [36] − − U T 2 2 shows that F is an M-function.∈ Next, consider that (1 λ) ασ (1 λ) βγ − − − − T B exp(BλU +θ0) Setting this equal to zero allows us to solve for the eigenvalues log 1 B T B − ( − ) exp( λU +θ0) ≥ BT B exp( λU ) α0 λ =1 √ασ, 1 βγ log 1 B T B + 1 log − ( − ) exp( λU ) β0 ± ± α0 λU + log   Now, we see that as long as α,σ,β,γp 0 we have that all β0 ≥− of the eigenvalues are both real and positive,≥ since   α,σ,β,γ where are the difference between two probabilities and thus have modulus less than one. It remains only to require that the off α0 = min[exp(θ0)]i, β0 = max[exp(θ0)]i i i diagonal elements of this matrix are non-positive. If we define a and we used the componentwise division notation so that b for exp(θ ) N T 0 a, b R is the vector whose ith element is ai/bi. Similarly, η = [η0, η1, η2, η3] := ∈ exp(θ0) 1 we have k k T Then we see that the condition that α, β 0 is equivalent to B exp(BλT + θ1) α1 ≥ log T λT + 1 log η3 η2 η3 η1 − (1 B) exp(BλT + θ1)! ≥− β1 , −   η1 + η3 ≥ η0 + η2 η2 + η3 ≥ η0 + η1 where which holds if and only if

α1 = min[exp(θ1)]i, β1 = max[exp(θ1)]i i i η η η η (17) 1 2 ≤ 0 3 Putting these two facts together, we have Denoting θ0 = [0,θ1,θ2,θ3], we see that (17) holds if and

α0 only if λT + 1 log β0 F(λU , λT ) θ1 + θ2 θ3 = [0, 1, 1, 1]θ0 0 (18) ≥   α1   − − ≤ λU + 1 log β 1 The same requirement (18) is necessary and sufficient for     3See [37], [38], [39], [40] for conditions equivalent to the definition of an γ, σ to be negative by replacing θ1 = [0,θ0,θ1,θ2]. Now, M-matrix. suppose θ0 has elements which satisfy (18), and consider all 11

2 log domain pmfs of the form BλU + θ0 for λU R . These block Gauss Seidel iterations gives us sufficient conditions will satisfy ∈ for this robust behavior, which we have demonstrated to be intuitively meaningful for a particularly simple, yet nontrivial, [0, 1, 1, 1] (BλU + θ )=[0, 0]λU + [0, 1, 1, 1]θ 0 − 0 − 0 ≤ 2 systematic bit case. It remains to be seen whether or not We can thus see that is exactly characterized as the conditions of the theorem can be interpreted in a similar D manner for realistic codeword lengths. 4 4 [0, 1, 1, 1]θ0 0, := (θ0, θ1) R R − ≤ D ∈ × [0, 1, 1, 1]θ1 0  − ≤  V. CONCLUSIONS

Thus, as long as r and the structure of the code yield a The turbo decoder is a suboptimal heuristic method which

(θ0, θ1) , the turbo decoder converges to a unique fixed was developed through simulation. Although researchers have ∈ D RN RN point from any initialization (λU , λT ) . been able to characterize its convergence behavior and per- ∈ × Continuing the example, suppose that we are using a parallel formance in the asymptotically large block length and cycle- concatenated code, with each component code a simple parity free factor graph cases, up until now the sense of optimality check code with two systematic bits and one parity check of its stationary points relative to the desired maximum bit and we are transmitting using binary phase shift keying likelihood sequence detector design, the mechanism behind (BPSK) over a additive white gaussian noise (AWGN) channel, its convergence, and conditions under which it converged and assume the all zero codeword was transmitted. This yields all remained less understood. We have shown that the turbo 0 0 0 0 decoder stationary points are critical points of the constrained 0 1 1 1 optimization problem of maximizing the log likelihood func- θ = λ , θ = λ 0  1 0 0  0 1  1  1 tion for the received data under a false independence as-  1 1 1   0  sumption for the messages for which the turbo decoder is     exchanging soft information, subject to the constraint that     where λ0 are the raw channel log likelihood ratios for sys- the probability that the messages so chosen are the same tematic bits and the parity check bit of the first code (rs, r0) is fixed. If this probability is 1, then the global maximum and λ1 is the raw channel log likelihood ratio for the parity to this constrained optimization problem is the maximum check bit of the second code. The condition that θ0 and θ1 likelihood sequence detector. Furthermore, we have shown lie in is then equivalent to the requirement that D that the turbo decoder is actually a nonlinear block Gauss 0 0 0 Seidel iteration on the system of necessary equations for 0 1 1 this constrained optimization problem specified by Lagrange [0, 1, 1, 1]   λ0 = [0, 0, 2]λ0 0 with a Lagrange multiplier of 1. Finally, by identifying the − 1 0 1 ≤ −  1 1 0  iterative method that was being used to find the solution to   the necessary conditions, we identified the mechanism behind and 2λ 0. This establishes that the turbo decoder for this 1 ≤ the turbo decoder’s convergence. Using the existing theory two bit parallel concatenated parity check code will converge for this convergence mechanism, we were able to determine to a fixed point from any initialization as long as the channel sufficient conditions on the component likelihood functions for does not cause any errors in the parity check bits (i.e. as long the convergence of the turbo decoder from any initialization, as the log likelihood ratios associated with the parity check which we demonstrated to be intuitively reasonable for a bits are less than or equal to zero). particular simple parallel concatenated code. It may seem odd that this result does not depend on the systematic bits, but this has a perfectly reasonable expla- REFERENCES nation. This result does not depend on the systematic bits [1] J. M. Walsh, P. A. Regalia, and C. R. Johnson, Jr., “A convergence because we have required the turbo decoder to converge proof for the turbo decoder as an instance of the Gauss-Seidel iteration,” from any initialization of the a priori log likelihood ratios in IEEE International Symposium on Information Theory, Adelaide, for the systematic bits. This is equivalent to requiring the Australia, Sept. 2005. [2] ——, “Turbo decoding as constrained optimization,” in 43rd Allerton turbo decoder to converge regardless of the received raw Conference on Communication, Control, and Computing., Sept. 2005. channel log likelihood ratios for the systematic bits. This [3] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near shannon limit shows that, generally speaking, any set of sufficient conditions error-correcting coding and decoding: Turbo-codes.” in ICC 93, vol. 2, Geneva, May 1993, pp. 1064–1070. for global convergence of the parallel turbo decoder must thus [4] S. ten Brink, “Convergence behavior of iteratively decoded parallel be insensitive to the received channel likelihood values for concatenated codes.” IEEE Trans. Commun., vol. 49, pp. 1727–1737, the systematic bits. Thus it is natural that the raw channel log Oct. 2001. [5] H. E. Gamal and A. R. Hammons, Jr., “Analyzing the turbo decoder likelihood ratios for the systematic bits did not appear when using the gaussian approximation,” IEEE Trans. Inform. Theory, vol. 47, checking (θ0, θ1) . pp. 671–686, Feb. 2001. Given the wide∈ range D of nonlinear dynamical phenomena [6] F. R. Kshischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,” IEEE Trans. Inform. Theory, vol. 47, pp. that the turbo decoder has been demonstrated to posses (e.g. 498–519, Feb. 2001. limit cycles [41] and even chaotic behavior [42]), it is im- [7] R. J. McEliece, D. J. C. MacKay, and J.-F. Cheng, “Turbo decoding portant to characterize the cases in which the turbo decoder as an instance of Pearl’s belief propagation algorithm.” IEEE J. Select. Areas Commun., vol. 16, pp. 140–152, Feb. 1998. avoids this bad behavior and instead converges to a single [8] M. Moher and T. A. Gulliver, “Cross-entropy and interative decoding.” fixed point. The theorem we have adapted from nonlinear IEEE Trans. Inform. Theory, vol. 44, pp. 3097–3104, Nov. 1998. 12

[9] T. J. Richardson, “The geometry of turbo-decoding dynamics.” IEEE [39] O. Axelsson, Iterative Solution Methods. Cambridge University Press, Trans. Inform. Theory, vol. 46, pp. 9–23, Jan. 2000. 1994. [10] S. Ikeda, T. Tanaka, and S. Amari, “Information geometry of turbo and [40] Anonymous, “M-matrix,” entry at PlanetMath.org. [Online]. Available: low-density parity-check codes,” IEEE Trans. Inform. Theory, vol. 50, http://planetmath.org/encyclopedia/MMatrix.html pp. 1097 – 1114, June 2004. [41] D. Agrawal and A. Vardy, “The turbo decoding algorithm and its phase [11] B. Muquet, P. Duhamel, and M. de Courville, “Geometrical interpreta- trajectories,” IEEE Trans. Inform. Theory, vol. 47, pp. 699–722, Feb. tions of iterative ’turbo’ decoding,” in Proceedings ISIT, June 2002. 2001. [12] J. M. Walsh, P. A. Regalia, and C. R. Johnson, Jr., “A refined information [42] Z. Tasev, L. Kocarev, and G. Maggio, “Bifurcations and chaos in the geometric interpretation of turbo decoding,” in International Conference turbo decoding algorithm,” in Circuits and Systems, 2003. ISCAS ’03. on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, Proceedings of the 2003 International Symposium on, May 2003, pp. PA, Mar. 2005. III–120 – III–123. [13] J. Pearl, Probabilistic reasoning in intelligent systems : networks of plausible inference. Morgan Kaufmann Publishers, 1988. [14] P. Pakzad and V. Anantharam, “Belief propagation and statistical physics.” in Proceedings of the Conference on Information Sciences and John MacLaren Walsh was born in Carbondale, Il in 1981. He received his Systems, Princeton University, Mar. 2002. B.S. (magna cum laude), M.S., and Ph.D. degrees from Cornell University [15] ——, “Estimation and marginalization using kikuchi approximation under the supervision of C. Richard Johnson, Jr. in 2002, 2004, and 2006, methods.” Neural Computation, pp. 1836–1876, Aug. 2005. respectively. After completing a short postdoc at the University of British [16] ——, “Kikuchi Approximation Method for Joint Decoding of LDPC Columbia with Vikram Krishnamurthy in the summer of 2006, he joined Codes and Partial-Response Channels,” IEEE Trans. Commun., vol. 54, the Department of Electrical and Computer Engineering at Drexel University, no. 7, July 2006. where he is currently an assistant professor. He is a member of the IEEE, [17] S. Ikeda, T. Tanaka, and S. Amari, “Stochastic reasoning, free energy HKN, and TBP. His current research interests involve: (a) the development and information geometry,” Neural Computation, no. 16, pp. 1779–1810, of a design theory for distributed composite adaptive systems, (b) the perfor- 2004. mance, convergence, and application of expectation propagation, a family of [18] J. Yedidia, W. Freeman, and Y. Weiss, “Constructing free-energy approx- algorithms for distributed iterative decoding and estimation, and, most recently imations and generalized belief propagation algorithms,” IEEE Trans. (c) models for permeation in biological ion channels. Inform. Theory, no. 7, pp. 2282–2312, July 2005. [19] A. Montanari and N. Sourlas, “The statistical mechanics of turbo codes,” Eur. Phys. J. B, no. 18, pp. 107–109, 2000. [20] J. M. Walsh and P. A. Regalia, “Connecting belief propagation with maximum likelihood detection,” in Fourth International Symposium on Phillip A. Regalia was born in Walnut Creek, CA. He received the B.Sc. Turbo Codes, Munich, Germany, Apr. 2006. (highest honors), M. Sc. and Ph.D degrees in electrical and computer [21] J. M. Walsh, “Dual Optimality Frameworks for Expectation Propaga- engineering from the University of California at Santa Barbara in 1985, tion,” in IEEE Conference on Signal Processing Advances in Wireless 1987, and 1988, respectively, and the Habilitation `aDiriger des Recherches Communications (SPAWC), Cannes, France, June 2006. from the University of Paris-Orsay, France, in 1994. Currently he is with the [22] ——, “Distributed Iterative Decoding and Estimation via Expectation Department of Electrical Engineering and Computer Science of the Catholic Propagation: Performance and Convergence,” Ph.D. dissertation, Cornell University of America, Washington, DC, and is an Adjunct Professor with University, 2006. the Institut National des T´el´ecommunications in Evry, France. His research [23] R. T. Rockafellar, . Princeton University Press, 1970. interests include adaptive filtering and wireless communications. [24] S. Amari, Methods of Information Geometry. AMS Translations of Mathematical Monographs, 2004, vol. 191. [25] C. Berrou and A. Glavieux, “Near optimum error correction coding and decoding: Turbo codes.” IEEE Trans. Commun., vol. 44, pp. 1262–1271, C. Richard Johnson, Jr. was born in Macon, GA in 1950. He is currently Oct. 1996. a Professor of Electrical and Computer Engineering and a member of the [26] S. Ikeda, T. Tanaka, and S. Amari, “Information geometrical framework Graduate Field of Applied Mathematics at Cornell University, Ithaca, NY. for analyzing belief propagation decoder,” in Advances in Neural Infor- Since joining the Cornell faculty in 1981, he has received school, college, mation Processing Systems 14. MIT Press, 2002. university and national teaching awards. In 1983, Eta Kappa Nu selected him [27] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of as the C. Holmes MacDonald Outstanding Teacher among young professors linear codes for minimizing symbol error rate.” IEEE Trans. Inform. of electrical engineering in the USA. In 2004, he was named a Stephen H. Theory, vol. 20, Mar. 1974. Weiss Presidential Fellow, only three of which are chosen per year from the [28] S. C. Tatikonda, “Convergence of the sum-product algorithm,” in Pro- tenured faculty at Cornell for their “effective, inspiring, and distinguished ceedings of the 2003 Information Theory Workshop, Paris, France, pp. teaching of undergraduate students”. Professor Johnson received a PhD in 222–225. Electrical Engineering from Stanford University, along with the first PhD [29] F. S. Acton, Numerical Methods that Work. Harper & Row Publishers, minor in Art History granted by Stanford, in 1977. In 1983, he was selected 1970. as the Outstanding Young American Electrical Engineer by Eta Kappa Nu [30] J. L. Buchanan and P. R. Turner, Numerical Methods and Analysis. for “outstanding contributions to the field of control technology, his cultural McGraw Hill, Inc., 1992. achievements, and his involvement in professional activities”. In 1989, he [31] G. H. Golub and C. F. V. Loan, Matrix Computations 3rd Edition. The was elected a Fellow of the IEEE “for contributions to adaptive parameter Johns Hopkins University Press, 1996. estimation theory with applications in digital control and signal processing”. [32] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computa- During the past decade and a half, his primary research interest has been tion: Numerical Methods. Athena Scientific, 1997. blind adaptive equalization for communication systems. His recent books are [33] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Johnson and Sethares, Telecommunication Breakdown: Concepts of Commu- Equations in Several Variables. Academic Press, 1970. nication Transmitted via Software-Defined Radio, Pearson Prentice Hall, 2004 [34] W. C. Rheinboldt, “On m-functions and their application to nonlinear and Treichler, Johnson, and Larimore, Theory and Design of Adaptive Filters, gauss-seidel iterations and network flows,” Journal Mathematical Anal- Prentice Hall, 2001. In the past decade, Professor Johnson has taught courses ysis and Applications, no. 32, pp. 274–307, 1971. on adaptive signal processing in communication systems as a visiting professor [35] J. J. Mor´e, “Nonlinear generalizations of matrix diagonal dominance at Stanford University, University of California Berkeley, Princeton University, with application to gauss-seidel iterations,” SIAM Journal on Numerical Ecole´ Nationale Sup´erieure des T´el´ecommunications (France), University Analysis, vol. 9, no. 2, pp. 357–378, June 1972. College Dublin (Ireland), and Technische Universiteit Delft (the Netherlands). [36] W. C. Rheinboldt, “On classes of n-dimensional nonlinear mappings For the latter half of 2005, he visited the Conservatoire National des Arts generalizing several types of matrices,” in Proc. Symposium on the et M´etiers (Paris, France), the Ecole´ National Sup´erieure de l’Electronique Numerical Solution of Partial Differential Equations. II. Academic et de ses Applications (Cergy-Pontoise, France), and the Ecole´ Sup´erieure Press, 1970, pp. 501–546. d’Electricit´e(Gif-sur-Yvette,´ France) as a Fulbright Research Scholar. He is [37] R. A. Horn and C. R. Johnson, Topics in Matrix Analysis. Cambridge currently organizing the first international workshop on processing for University Press, Cambridge, 1991. artist identification to be held at the Van Gogh Museum (Amsterdam) in May [38] M. Fiedler, Special Matrices and Their Applications in Numerical 2007. Mathematics. Martinus Nijhoff, Dordrecht, 1986.