EURASIP Journal on Applied Signal Processing 2003:6, 530–542 c 2003 Hindawi Publishing Corporation

An FPGA Implementation of (3, 6)-Regular Low-Density Parity-Check Code Decoder

Tong Zhang Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180, USA Email: [email protected]

Keshab K. Parhi Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USA Email: [email protected]

Received 28 February 2002 and in revised form 6 December 2002

Because of their excellent error-correcting performance, low-density parity-check (LDPC) codes have recently attracted a lot of attention. In this paper, we are interested in the practical LDPC code decoder hardware implementations. The direct fully parallel decoder implementation usually incurs too high hardware complexity for many real applications, thus partly parallel decoder design approaches that can achieve appropriate trade-offs between hardware complexity and decoding throughput are highly desirable. Applying a joint code and decoder design methodology, we develop a high-speed (3,k)-regular LDPC code partly parallel decoder architecture based on which we implement a 9216-bit, rate-1/2(3, 6)-regular LDPC code decoder on Xilinx FPGA device. This partly parallel decoder supports a maximum symbol throughput of 54 Mbps and achieves BER 10−6 at 2 dB over AWGN channel while performing maximum 18 decoding iterations. Keywords and phrases: low-density parity-check codes, error-correcting coding, decoder, FPGA.

1. INTRODUCTION decoding messages are iteratively computed on each variable In the past few years, the recently rediscovered low-density node and check node and exchanged through the edges be- parity-check (LDPC) codes [1, 2, 3] have received a lot of at- tween the neighboring nodes. tention and have been widely considered as next-generation Recently, tremendous efforts have been devoted to ana- error-correcting codes for telecommunication and magnetic lyze and improve the LDPC codes error-correcting capabil- storage. Defined as the null space of a very sparse M × N ity, see [4, 5, 6, 7, 8, 9, 10, 11] and so forth. Besides their parity-check matrix H, an LDPC code is typically represented powerful error-correcting capability, another important rea- by a bipartite graph, usually called Tanner graph, in which son why LDPC codes attract so many attention is that the one set of N variable nodes corresponds to the set of code- iterative BP decoding algorithm is inherently fully parallel, word, another set of M check nodes corresponds to the set thus a great potential decoding speed can be expected. of parity-check constraints and each edge corresponds to The high-speed decoder hardware implementation is ob- a nonzero entry in the parity-check matrix H. (A bipartite viously one of the most crucial issues determining the extent graph is one in which the nodes can be partitioned into two of LDPC applications in the real world. The most natural so- sets, X and Y, so that the only edges of the graph are be- lution for the decoder architecture design is to directly in- tween the nodes in X and the nodes in Y.) An LDPC code stantiate the BP decoding algorithm to hardware: each vari- is known as (j, k)-regular LDPC code if each variable node able node and check node are physically assigned their own has the degree of j and each check node has the degree of processors and all the processors are connected through an k, or in its parity-check matrix each column and each row interconnection network reflecting the Tanner graph con- have j and k nonzero entries, respectively. The code rate of a nectivity. By completely exploiting the parallelism of the BP (j, k)-regular LDPC code is 1 − j/k provided that the parity- decoding algorithm, such fully parallel decoder can achieve check matrix has full rank. The construction of LDPC codes very high decoding speed, for example, a 1024-bit, rate-1/2 is typically random. LDPC codes can be effectively decoded LDPC code fully parallel decoder with the maximum symbol by the iterative belief-propagation (BP) algorithm [3] that, throughput of 1 Gbps has been physically implemented us- as illustrated in Figure 1, directly matches the Tanner graph: ing ASIC technology [12]. The main disadvantage of such An FPGA Implementation of (3, 6)-Regular LDPC Code Decoder 531

Check nodes Check-to-variable message

Variable-to-check message

Variable nodes

Figure 1: Tanner graph representation of an LDPC code and the decoding messages flow. fully parallel design is that with the increase of code length, complexity is significantly reduced. Exploiting the good min- typically the LDPC code length is very large (at least several imum distance property of LDPC codes, this decoder em- thousands), the incurred hardware complexity will become ploys parity check as the earlier decoding stopping criterion more and more prohibitive for many practical purposes, to achieve adaptive decoding for energy reduction. With the for example, for 1-K code length, the ASIC decoder imple- maximum 18 decoding iterations, this FPGA partly parallel mentation [12]consumes1.7M gates. Moreover, as pointed decoder supports a maximum of 54 Mbps symbol through- out in [12], the routing overhead for implementing the en- put and achieves BER (bit error rate) 10−6 at 2 dB over tire interconnection network will become quite formidable AWGN channel. due to the large code length and randomness of the Tan- This paper begins with a brief description of the LDPC ner graph. Thus high-speed partly parallel decoder de- code decoding algorithm in Section 2.InSection 3,webriefly sign approaches that achieve appropriate trade-offsbetween describe the joint code and decoder design methodology for hardware complexity and decoding throughput are highly (3,k)-regular LDPC code partly parallel decoder design. In desirable. Section 4, we present the detailed high-speed partly parallel For any given LDPC code, due to the randomness of its decoder architecture design. Finally, an FPGA implementa- Tanner graph, it is nearly impossible to directly develop a tion of a (3, 6)-regular LDPC code partly parallel decoder is high-speed partly parallel decoder architecture. To circum- discussed in Section 5. vent this difficulty, Boutillon et al. [13]proposedadecoder- first code design methodology: instead of trying to conceive the high-speed partly parallel decoder for any given ran- 2. DECODING ALGORITHM dom LDPC code, use an available high-speed partly par- Since the direct implementation of BP algorithm will incur allel decoder to define a constrained random LDPC code. too high hardware complexity due to the large number of We may consider it as an application of the well-known multiplications, we introduce some logarithmic quantities “Think in the reverse direction” methodology. Inspired by to convert these complicated multiplications into additions, the decoder-first code design methodology, we proposed which lead to the Log-BP algorithm [2, 15]. a joint code and decoder design methodology in [14]for Before the description of Log-BP decoding algorithm, ,k (3 )-regular LDPC code partly parallel decoder design. By we introduce some definitions as follows. Let H denote the jointly conceiving the code construction and partly paral- M × N sparse parity-check matrix of the LDPC code and ,k lel decoder architecture design, we presented a (3 )-regular Hi,j denote the entry of H at the position (i, j). We de- LDPC code partly parallel decoder structure in [14], which fine the set of bits n that participate in parity-check m as ,k not only defines very good (3 )-regular LDPC codes but ᏺ(m) ={n : Hm,n = 1}, and the set of parity-checks m in also could potentially achieve high-speed partly parallel which bit n participates as ᏹ(n) ={m : Hm,n = 1}.Wede- decoding. note the set ᏺ(m) with bit n excluded by ᏺ(m) \ n, and the In this paper, applying the joint code and decoder design set ᏹ(n) with parity-check m excluded by ᏹ(n) \ m. methodology, we develop an elaborate (3,k)-regular LDPC code high-speed partly parallel decoder architecture based Algorithm 1 (Iterative Log-BP Decoding Algorithm). on which we implement a 9216-bit, rate-1/2(3, 6)-regular LDPC code decoder using Xilinx Virtex FPGA (Field Pro- Input grammable Gate Array) device. In this work, we significantly p0 = P x = p1 = P x = = modify the original decoder structure [14] to improve the de- The prior probabilities n ( n 0) and n ( n 1) − p0 n = ,...,N coding throughput and simplify the control logic design. To 1 n, 1 ; achieve good error-correcting capability, the LDPC code de- coder architecture has to possess randomness to some extent, Output which makes the FPGA implementations more challenging Hard decision x ={x1,...,xN }; since FPGA has fixed and regular hardware resources. We propose a novel scheme to realize the random connectivity Procedure by concatenating two routing networks, where all the ran- (1) Initialization: For each n, compute the intrinsic (or 0 1 dom hardwire routings are localized and the overall routing channel) message γn = log pn/pn and for each (m, n) ∈ 532 EURASIP Journal on Applied Signal Processing

High-speed partly Constrained random parallel decoder parameters

Random input H3

(3,k)-regular LDPC code Deterministic Construction of ensemble defined by   input   H  Selected code  = 1 H H H H = H2 H3

Figure 2: Joint design flow diagram.

{(i, j) | Hi,j = 1},compute Moreover, by delivering the hard decision xi from each VNU   to its neighboring CNUs, the parity-check H · x can be eas-   1+e−|γn| ily performed by all the CNUs. Thanks to the good min- αm,n = sign γn log , (1) 1 − e−|γn| imum distance property of LDPC code, such adaptive de- coding scheme can effectively reduce the average energy con- where sumption of the decoder without performance degradation.  In the partly parallel decoding, the operations of a cer-    +1,γn ≥ 0, γ = tain number of check nodes or variable nodes are time- sign n  (2) −1,γn < 0. multiplexed, or folded [16], to a single CNU or VNU. For an LDPC code with M check nodes and N variable nodes, if (2) Iterative decoding its partly parallel decoder contains Mp CNUs and Np VNUs, (i) Horizontal (or check node computation) step: for we denote M/Mp as CNU folding factor and N/Np as VNU each (m, n) ∈{(i, j) | Hi,j = 1},compute folding factor.   1+e−α   β = α  , m,n log − e−α sign m,n (3) 3. JOINT CODE AND DECODER DESIGN 1 n∈ᏺ(m)\n In this section, we briefly describe the joint (3,k)-regular α = |α  | where n∈ᏺ(m)\n m,n . LDPC code and decoder design methodology [14]. It is well (ii) Vertical (or variable node computation) step: for known that the BP (or Log-BP) decoding algorithm works each (m, n) ∈{(i, j) | Hi,j = 1},compute well if the underlying Tanner graph is 4-cycle free and does   not contain too many short cycles. Thus the motivation of −|γ |   1+e m,n this joint design approach is to construct an LDPC code that αm,n = sign γm,n log , (4) 1 − e−|γm,n| not only fits to a high-speed partly parallel decoder but also has the average cycle length as large as possible in its 4-cycle- γ = γ β  where m,n n + m∈ᏹ(n)\m m ,n.Foreach free Tanner graph. This joint design process is outlined as fol- n, update the pseudoposterior log-likelihood ratio lows and the corresponding schematic flow diagram is shown (LLR) λn as in Figure 2.

λn = γn + βm,n. (5) (1) Explicitly construct two matrices H1 and H2 in such a  = T , T T ,k m∈ᏹ(n) way that H [H1 H2 ] defines a (2 )-regular LDPC 1 code C2 whose Tanner graph has the girth of 12. (iii) Decision step: (2) Develop a partly parallel decoder that is configured by (a) perform hard decision on {λ ,...,λN } to ob- 1 a set of constrained random parameters and defines tain x ={x ,...,xN } such that xn = 0 if 1 a(3,k)-regular LDPC code ensemble, in which each λn > 0 and xn = 1 if λ ≤ 0; code is a subcode of C and has the parity-check matrix (b) if H·x = 0, then algorithm terminates, else go 2 =  T , T T to horizontal step until the preset maximum H [H H3 ] . ,k number of iterations have occurred. (3) Select a good (3 )-regular LDPC code from the code ensemble based on the criteria of large Tanner graph We call αm,n and βm,n in the above algorithm extrinsic average cycle length and computer simulations. Typi- messages, where αm,n is delivered from variable node to check cally the parity-check matrix of the selected code has node and βm,n is delivered from check node to variable node. only few redundant checks, so we may assume that the Each decoding iteration can be performed in fully paral- code rate is always 1 − 3/k. lel fashion by physically mapping each check node to one in- dividual check node processing unit (CNU) and each variable node to one individual variable node processing unit (VNU). 1Girth is the length of a shortest cycle in a graph. An FPGA Implementation of (3, 6)-Regular LDPC Code Decoder 533

L with relatively high-average cycle lengths, then select the one I I I 1,1 001,2 1,k 0 leading to the best result in the computer simulations. The principal partly parallel decoder structure presented I , I , ··· I ,k 2 1 2 2 2 L · k . . . in [14] has the following properties......   0 0 I 0 I I (i) It contains k2 memory banks, each one consists of sev- H1 k,1 k,2 k,k H = = eral RAMs to store all the decoding messages associ- H P ,k P ,k ··· Pk,k 2 1 2 ated with L variable nodes. . 0 0 . . L · k (ii) Each memory bank associates with one address gener- P P ··· P 1,2 2,2 k,2 0 ator that is configured by one element in a constrained P P ···P 1,1 2,1 k,1 0 random integer set ᏾. (iii) It contains a configurable random-like one-dimen- 2 N = L · k sional shufflenetwork᏿ with the routing complexity scaled by k2. Figure 3: Structure of H = [HT , HT ]T . 1 2 (iv) It contains k2 VNUs and k CNUs so that the VNU and CNU folding factors are L·k2/k2 = L and 3L·k/k = 3L, Construction of  = T , T T respectively. H [H1 H2 ] L  (v) Each iteration completes in 3 clock cycles in which The structure of H is shown in Figure 3, where both H1 and only CNUs work in the first 2L clock cycles and both L · k L · k2 H2 are by submatrices. Each block matrix Ix,y in CNUs and VNUs work in the last L clock cycles. H1 is an L × L identity matrix and each block matrix Px,y in H2 is obtained by a cyclic shift of an L × L identity ma- Over all the possible ᏾ and ᏿, this decoder defines a (3,k)- u trix. Let T denote the right cyclic shift operator where T (Q) regular LDPC code ensemble in which each code has the represents right cyclic shifting matrix Q by u columns, then parity-check matrix H = [H T , HT ]T , where the submatrix u 3 x,y = T u = x − · y L P (I)where (( 1) )mod and I represents H3 is jointly specified by ᏾ and S. the L × L identity matrix, for example, if L = 5, x = 3, and y = 4, we have u = (x − 1) · y mod L = 8mod5= 3, then 4. PARTLY PARALLEL DECODER ARCHITECTURE   00010   In this paper, applying the joint code and decoder design   00001 methodology, we develop a high-speed (3,k)-regular LDPC = T3 =   . P3,4 (I) 10000 (6) code partly parallel decoder architecture based on which a   01000 9216-bit, rate-1/2(3, 6)-regular LDPC code partly parallel 00100 decoder has been implemented using Xilinx Virtex FPGA device. Compared with the structure presented in [14], this k Notice that in both H1 and H2, each row contains 1’s partly parallel decoder architecture has the following distinct  = and each column contains a single 1. Thus, the matrix H characteristics. T , T T ,k C L · [H1 H2 ] defines a (2 )-regular LDPC code 2 with k2 variable nodes and 2L · k check nodes. Let G denote the (i) It employs a novel concatenated configurable ran- ffl Tanner graph of C2, we have the following theorem regarding dom two-dimensional shu e network implementa- to the girth of G. tion scheme to realize the random-like connectivity with low routing overhead, which is especially desir- Theorem 1. If L cannot be factored as L = a · b,wherea, b ∈ able for FPGA implementations. {0,...,k− 1}, then the girth of G is 12 and there is at least one (ii) To improve the decoding throughput, both the VNU 12-cycle passing each check node. folding factor and CNU folding factor are L instead of L and 3L in the structure presented in [14]. Partly parallel decoder (iii) To simplify the control logic design and reduce the memory bandwidth requirement, this decoder com-  ,k Based on the specific structure of H, a principal (3 )-regular pletes each decoding iteration in 2L clock cycles in LDPC code partly parallel decoder structure was presented in which CNUs and VNUs work in the 1st and 2nd L [14]. This decoder is configured by a set of constrained ran- clock cycles, alternatively. dom parameters and defines a (3,k)-regular LDPC code en- semble. Each code in this ensemble is essentially constructed Following the joint design methodology, we have that this by inserting extra L · k check nodes to the high-girth (2,k)- decoder should define a (3,k)-regular LDPC code ensemble 2 regular LDPC code C2 under the constraint specified by the in which each code has L · k variable nodes and 3L · k check decoder. Therefore, it is reasonable to expect that the codes nodes and, as illustrated in Figure 4, the parity-check ma- = T , T , T T in this ensemble more likely do not contain too many short trix of each code has the form H [H1 H2 H3 ] where H1 cycles and we may easily select a good code from it. For real and H2 have the explicit structures as shown in Figure 3 and applications, we can select a good code from this code ensem- the random-like H3 is specified by certain configuration pa- ble as follows: first in the code ensemble, find several codes rameters of the decoder. To facilitate the description of the 534 EURASIP Journal on Applied Signal Processing

Leftmost column Rightmost column five RAM blocks: EXT RAM i for i = 1, 2, 3, INT RAM, and h(k,2) h(k,2) i L L 1 L DEC RAM. Each EXT RAM has memory locations and the location with the address d − 1(1≤ d ≤ L) contains I I I 1,1 1,2 1,k the extrinsic messages exchanged between the variable node x,y .. .. ··· .. ( ) . . . L · k vd in VGx,y and its neighboring check node in CGi.The   Ik,1 Ik,2 Ik,k INT RAM and DEC RAM store the intrinsic message and H (x,y)  1 P P v   1,k ··· k,k hard decision associated with node d at the memory lo- H= H = .  2 . . cation with the address d − 1(1≤ d ≤ L). As we will see P1,2 ··· Pk,2 L · k H3 later, such decoding messages storage strategy could greatly P , ··· Pk, 1 1 1 simplify the control logic for generating the memory access address. L · k For the purpose of simplicity, in Figure 6 we do not show the from INT RAM to EXT RAM i’s for extrinsic L H(k,2) L · k2 message initialization, which can be easily realized in clock cycles before the decoder enters the iterative decoding pro- Figure 4: The parity-check matrix. cess.

4.1. Check node processing decoder architecture, we introduce some definitions as fol- During the check node processing, the decoder performs the lows: we denote the submatrix consisting of the L consecutive computations of all the check nodes and realizes the extrinsic x,y columns in H that go through the block matrix Ix,y as H( ) message exchange between all the neighboring nodes. At the x,y ( ) x,y in which, from left to right, each column is labeled as hi beginning of check node processing, in each PE the mem- d − i with i increasing from 1 to L, as shown in Figure 4.Welabel ory location with address 1inEXTRAM contains 6- (x,y) (x,y) bit hybrid data that consists of 1-bit hard decision and 5-bit the variable node corresponding to column hi as vi and (x,y) variable-to-check extrinsic message associated with the vari- the L variable nodes vi for i = 1,...,Lconstitute a variable (x,y) able node vd in VGx,y. In each clock cycle, this decoder node group VGx,y. Finally, we arrange the L · k check nodes performs the read-shuffle-modify-unshuffle-write operations corresponding to all the L·k rows of submatrix Hi into check to convert one variable-to-check extrinsic message in each node group CGi. EXT RAM i to its check-to-variable counterpart. As illus- Figure 5 shows the principal structure of this partly par- trated in Figure 6, we may outline the datapath loop in check allel decoder. It mainly contains k2 PE blocks PEx,y,for1≤ x node processing as follows: and y ≤ k, three bidirectional shufflenetworksπ1, π2,and (i) π3,and3· k CNUs. Each PEx,y contains one memory bank (1) read:one6-bithybriddatahx,y is read from each RAMsx,y that stores all the decoding messages, including the EXT RAM i in each PEx,y; intrinsic and extrinsic messages and hard decisions, associ- (i) (2) shuffle:eachhybriddatahx,y goes through the shuffle ated with all the L variable nodes in the variable node group network πi and arrives at CNUi,j ; VGx,y, and contains one VNU to perform the variable node (3) modify:eachCNUi,j performs the parity check on the 6 computations for these L variable nodes. Each bidirectional input hard decision bits and generates the 6 output 5- shufflenetworkπi realizes the extrinsic message exchange be- β(i) tween all the L·k2 variable nodes and the L·k check nodes in bit check-to-variable extrinsic messages x,y based on CGi.Thek CNUi,j ,forj = 1,...,k, perform the check node the 6 input 5-bit variable-to-check extrinsic messages; ffl computations for all the L · k check nodes in CGi. (4) unshu e: send each check-to-variable extrinsic mes- (i) This decoder completes each decoding iteration in 2L sage βx,y back to the PE block via the same path as its clock cycles, and during the first and second L clock cycles, variable-to-check counterpart; (i) it works in check node processing mode and variable node (5) write:writeeachβx,y to the same memory location in processing mode, respectively. In the check node processing EXT RAM i as its variable-to-check counterpart. mode, the decoder not only performs the computations of all the check nodes but also completes the extrinsic message All the CNUs deliver the parity-check results to a central exchange between neighboring nodes. In variable node pro- control block that will, at the end of check node processing, cessing mode, the decoder only performs the computations determine whether all the parity-check equations specified of all the variable nodes. by the parity-check matrix have been satisfied, if yes, the de- The intrinsic and extrinsic messages are all quantized to coding for current code frame will terminate. five bits and the iterative decoding of this partly To achieve higher decoding throughput, we implement parallel decoder are illustrated in Figure 6, in which the dat- the read-shuffle-modify-unshuffle-write loop operation by apaths in check node processing and variable node process- five-stage pipelining as shown in Figure 7, where CNU is ing are represented by solid lines and dash dot lines, respec- one-stage pipelined. To make this pipelining scheme feasi- tively. As shown in Figure 6,eachPEblockPEx,y contains ble, we realize each bidirectional I/O connection in the three An FPGA Implementation of (3, 6)-Regular LDPC Code Decoder 535

PE1,1 PE2,1 PEk,k

VNU VNU VNU Active during variable node processing RAMs1,1 RAMs2,1 ··· RAMsk,k

Active during ··· ··· ··· check node processing (random-like & π (regular & fixed) π (regular & fixed) π 1 2 3 configurable)

··· ··· ···

CNU1,1 CNU1,k CNU2,1 CNU2,k CNU3,1 CNU3,k

Figure 5: The principal (3,k)-regular LDPC code partly parallel decoder structure.

··· PEx,y 6bits (i) ,j π (regular & fixed) (1) {hx,y} INT RAM CNU1 1 hx,y 18 bits 5bits (2) ··· hx,y EXT RAM 1 (i) 5bits 6bits {βx,y} (i) π {βx,y} CNU ,j 2 (regular & fixed) 2 15 bits EXT RAM 2 5bits 15 bits VNU (3) ··· hx,y EXT RAM 3 1bit 6bits π (random-like & CNU3,j 3 configurable) (i) DEC RAM 5bits 18 bits {hx,y}

Figure 6: Iterative decoding datapaths.

CNU 6bits 6bits 5bits 5bits CNU CNU Unshuffle Read Shuffle (1st half) (2nd half) Write

Figure 7: Five-stage pipelining of the check node processing datapath. shuffle networks by two distinct sets of wires with opposite tained via delaying the read address by five clock cycles. It directions, which means that the hybrid data from PE blocks is clear that the connectivity among all the variable nodes to CNUs and the check-to-variable extrinsic messages from and check nodes, or the entire parity-check matrix, realized CNUs to PE blocks are carried on distinct sets of wires. Com- by this decoder is jointly specified by all the address genera- pared with sharing one set of wires in time-multiplexed fash- tors and the three shufflenetworks.Moreover,fori = 1, 2, 3, ion, this approach has higher wire routing overhead but ob- the connectivity among all the variable nodes and the check (i) viates the logic gate overhead due to the realization of time- nodes in CGi is completely determined by AGx,y and πi.Fol- multiplex and, more importantly, make it feasible to directly lowing the joint design methodology, we implement all the pipeline the datapath loop for higher decoding throughput. address generators and the three shufflenetworksasfollows. (i) In this decoder, one address generator AGx,y associates (1) π with one EXT RAM i in each PEx,y. In the check node pro- 4.1.1 Implementations of AGx,y and 1 (i) (1) cessing, AGx,y generates the address for reading hybrid data The bidirectional shufflenetworkπ1 and AGx,y realize the and, due to the five-stage pipelining of datapath loop, the ad- connectivity among all the variable nodes and all the check dress for writing back the check-to-variable message is ob- nodes in CG1 as specified by the fixed submatrix H1.Recall 536 EURASIP Journal on Applied Signal Processing

π3

a , ··· a ,k 1 1 1 r = 0 ···L − 1 Input data (r) s r ROM Output data 1 Ψ( )(R or Id) from PE blocks r = 0 ···L − 1 1 1 1bit C 1bit to CNU3,j’s (c) (c) a1,1 ··· a1,k 1bit s s c c b ··· b 1 k 1,1 ··· 1,k . . . 1,1 1,k . . . . b , c , b , c ,

ROM Ψ Ψ . . . . 1 1 1 1 1 1 1 1 . . .

1 ( k ( . . . R a a c c . . . a a k,1 ··· k,k ) ) k,1 ··· k,k ( ( 1bit . C . . C . r . 1 . ··· . k . c

rId) or c s( ) . . . Id) or . k,1 ··· k,k k Ψ(r) R k ( k or Id) b c k,1 ck,1 bk,1 k,1 bk,1 ··· bk,k Stage I: intrarow shuffle Stage II: intracolumn shuffle

Figure 8: Forward path of π3.

(x,y) (x,y) (3) that node vd corresponds to the column hi as illustrated tions of AGx,y and π3 are not easy because of the following requirements on H : inx,yFigure 4 and the extrinsic messages associated with node 3 v( ) d − d are always stored at address 1. Exploiting the ex- (1) the Tanner graph corresponding to the parity-check plicit structure of H1, we easily obtain the implementation = T , T , T T (1) matrix H [H1 H2 H3 ] should be 4-cycle free; schemes for AGx,y and π1 as follows: (2) to make H random to some extent, H3 should be (1)  L random-like. (i) each AGx,y is realized as a log2 -bit binary counter that is cleared to zero at the beginning of check node As proposed in [14], to simplify the design process, we processing; (3) separately conceive AGx,y and π in such a way that the im- ffl π k 3 (ii) the bidirectional shu enetwork 1 connects the (3) plementations of AGx,y and π accomplish the above first and PEx,y with the same x-index to the same CNU. 3 second requirements, respectively. (2) 4.1.2 Implementations of AGx,y and π2 (3) Implementations of AGx,y (2) The bidirectional shufflenetworkπ2 and AGx,y realize the (3)  L We implement each AGx,y as a log2 -bit binary counter connectivity among all the variable nodes and all the check that counts up to the value L − 1 and is initialized with a nodes in CG2 as specified by the fixed matrix H2. Similarly, constant value tx,y at the beginning of check node process- exploiting the extrinsic messages storage strategy and the ex- ing. Each tx,y is selected in random under the following two (2) π plicit structure of H2, we implement AGx,y and 2 as follows: constraints:

(2)  L x tx,y = tx,y y ,y ∈{ ,...,k} (i) each AGx,y is realized as a log2 -bit binary counter (1) given , 1 2 ,forall 1 2 1 ; L − y t − t ≡ x − x · y L that only counts up to the value 1 and is loaded (2) given , x1,y x2,y (( 1 2) )mod ,forall with the value of ((x − 1) · y)modL at the beginning x1,x2 ∈{1,...,k}. of check node processing; t (ii) the bidirectional shufflenetworkπ connects the k It can be proved that the above two constraints on x,y are 2 sufficient to make the entire parity-check matrix H always PEx,y with the same y-index to the same CNU. correspond to a 4-cycle free Tanner graph no matter how we (2) implement π . Notice that the counter load value for each AGx,y directly 3 comes from the construction of each block matrix Px,y in H 2 π as described in Section 3. Implementation of 3 (3) Since each AGx,y is realized as a counter, the pattern of shuf- (3) π ffl 4.1.3 Implementations of AGx,y and π3 fle network 3 cannot be fixed, otherwise the shu epattern of π3 will be regularly repeated in the H3, which means that ffl π (3) The bidirectional shu enetwork 3 and AGx,y jointly de- H3 will always contain very regular connectivity patterns no fine the connectivity among all the variable nodes and all the matter how random-like the pattern of π3 itself is. Thus we check nodes in CG3, which is represented by H3 as illustrated should make π3 configurable to some extent. In this paper, in Figure 4. In the above, we show that by exploiting the spe- we propose the following concatenated configurable random cific structures of H1 and H2 and the extrinsic messages stor- shuffle network implementation scheme for π3. age strategy, we can directly obtain the implementations of Figure 8 shows the forward path (from PEx,y to CNU3,j) (i) π i = , each AGx,y and i,for 1 2. However, the implementa- of the bidirectional shufflenetworkπ3. In each clock cycle, it An FPGA Implementation of (3, 6)-Regular LDPC Code Decoder 537

ffl a c VNU realizes the data shu efrom x,y to x,y by two concatenated 6bits 5bits stages: intrarow shuffleandintracolumn shuffle. Firstly, the VNU VNU Write ax,y data block, where each ax,y comes from PEx,y, passes an Read (1st half) (2nd half) intrarow shuffle network array in which each shufflenetwork 1bit (r) Ψx shuffles the k input data ax,y to bx,y for 1 ≤ y ≤ k.Each (r) (r) Figure 9: Three-stage pipelining of the variable node processing Ψx is configured by 1-bit control signal sx leading to the (r) datapath. fixed random permutation Rx if sx = 1, or to the identity permutation (Id) otherwise. The reason why we use the Id pattern instead of another random shuffle pattern is to min- hard decisions. As shown in Figure 6, we may outline the dat- imize the routing overhead, and our simulations suggest that apath loop in variable node processing as follows: there is no gain on the error-correcting performance by using (1) read:ineachPEx,y, three 5-bit check-to-variable ex- another random shuffle pattern instead of Id pattern. The k- (i) r trinsic messages βx,y and one 5-bit intrinsic messages bit configuration word s( ) changes every clock cycle and all γx,y associated with the same variable node are read the Lk-bit control words are stored in ROM R. Next, the bx,y from the three EXT RAM i and INT RAM at the same data block goes through an intracolumn shufflenetworkar- (c) address; ray in which each Ψy shuffles the kbx,y to cx,y for 1 ≤ x ≤ k. (2) modify: based on the input check-to-variable extrinsic Ψ(c) s(c) Similarly, each y is configured by 1-bit control signal y messages and intrinsic message, each VNU generates (c) leading to the fixed random permutation Cy if sy = 1, or to the 1-bit hard decision xx,y and three 6-bit hybrid data (c) (i) Id otherwise. The k-bit configuration word sy changes ev- hx,y; Lk (i) ery clock cycle and all the -bit control words are stored (3) write:eachhx,y is written back to the same memory kcx,y in ROM C. As the output of forward path, the with the location as its check-to-variable counterpart and xx,y x same -index are delivered to the same CNU3,j. To realize the is written to DEC RAM. bidirectional shuffle, we only need to implement each config- (r) (c) urable shufflenetworkΨx and Ψy as bidirectional so that The forward path from memory to VNU and backward 2 π3 can unshuffle the k data backward from CNU3,j to PEx,y path from VNU to memory are implemented by distinct sets along the same route as the forward path on distinct sets of of wires and the entire read-modify-write datapath loop is wires. Notice that, due to the pipelining on the datapath loop, pipelined by three-stage pipelining as illustrated in Figure 9. the backward path control signals are obtained via delaying Since all the extrinsic and intrinsic messages associated the forward path control signals by three clock cycles. with the same variable node are stored at the same address ff To make the connectivity realized by π3 random-like and in di erent RAM blocks, we can use only one binary counter change each clock cycle, we only need to randomly generate to generate all the read address. Due to the pipelining of the (r) (c) the control words sx and sy for each clock cycle and the datapath, the write address is obtained via delaying the read fixed shufflepatternsofeachRx and Cy. Since most modern address by three clock cycles. FPGA devices have multiple metal layers, the implementa- 4.3. CNU and VNU architectures tions of the two shuffle arrays can be overlapped from the bird’s-eye view. Therefore, the above concatenated imple- Each CNU carries out the operations of one check node, mentation scheme will confine all the routing wires to small including the parity check and computation of check-to- area (in one row or one column), which will significantly variable extrinsic messages. Figure 10 shows the CNU archi- i reduce the possibility of routing congestion and reduce the tecture for check node with the degree of 6. Each input x( ) routing overhead. is a 6-bit hybrid data consisting of 1-bit hard decision and 5-bit variable-to-check extrinsic message. The parity check is 4.2. Variable node processing performed by XORing all the six 1-bit hard decisions. Each 5-bit variable-to-check extrinsic messages is represented by Compared with the above check node processing, the opera- sign-magnitude format with a sign bit and four magnitude tions performed in the variable node processing is quite sim- bits. The architecture for computing the check-to-variable ple since the decoder only needs to carry out all the variable extrinsic messages is directly obtained from (3). The func- node computations. Notice that at the beginning of variable −|x| −|x| tion f (x) = log((1 + e )/(1 − e )) is realized by the LUT node processing, the three 5-bit check-to-variable extrinsic (lookup table) that is implemented as a combinational logic v(x,y) messages associated with each variable node d are stored block in FPGA. Each output 5-bit check-to-variable extrinsic d − i i at the address 1 of the three EXT RAM in PEx,y.The message y( ) is also represented by sign-magnitude format. (x,y) 5-bit intrinsic message associated with variable node vd is Each VNU generates the hard decision and all the also stored at the address d − 1ofINT RAM in PEx,y.Ineach variable-to-check extrinsic messages associated with one clock cycle, this decoder performs the read-modify-write op- variable node. Figure 11 shows the VNU architecture for erations to convert the three check-to-variable extrinsic mes- variable node with the degree of 3. With the input 5-bit in- sages associated with the same variable node to three hybrid trinsic message z and three 5-bit check-to-variable extrinsic data consisting of variable-to-check extrinsic messages and messages y(i) associated with the same variable node, VNU 538 EURASIP Journal on Applied Signal Processing

X(1) X(2) X(3) X(4) X(5) X(6) 6 6 6 6 6 6 51 51 51 51 51 51

41 41 41 41 41 41

Pipeline

6 6 6 6 6 6

LUT LUT LUT LUT LUT LUT

414141414141 1

5 5 5 5 5 5

y(1) y(2) y(3) y(4) y(5) y(6) Parity-check result

Figure 10: Architecture for CNU with k = 6.

Pipeline S-to-T: Sign magnitude to two’s complement 5 T-to-S: Two’s complement to sign-magnitude Z 1 Hard decision

5 76 4 6 y(1) S-to-T T-to-S LUT X(1) 1

5 76 4 6 y(2) S-to-T T-to-S LUT X(2) 1

y(3) 5 7 6 4 6 S-to-T T-to-S LUT X(3) 1

Figure 11: Architecture for VNU with j = 3. generates three 5-bit variable-to-check extrinsic messages 4.4. Data Input/Output and 1-bit hard decision according to (4)and(5), respectively. This partly parallel decoder works simultaneously on three To enable each CNU to receive the hard decisions to per- consecutive code frames in two-stage pipelining mode: while form parity check as described above, the hard decision is one frame is being iteratively decoded, the next frame is combined with each 5-bit variable-to-check extrinsic mes- loaded into the decoder, and the hard decisions of the x(i) sage to form 6-bit hybrid data as shown in Figure 11. previous frame are read out from the decoder. Thus each y(i) Since each input check-to-variable extrinsic message is INT RAM contains two RAM blocks to store the intrinsic represented by sign-magnitude format, we need to convert messages of both current and next frames. Similarly, each it to two’s complement format before performing the addi- DEC RAM contains two RAM blocks to store the hard de- f x = tions. Before going through the LUT that realizes ( ) cisionsofbothcurrentandpreviousframes. x| −|x| log((1 + e−| )/(1 − e )), each data is converted back to the The design scheme for intrinsic message input and hard sign-magnitude format. decision output is heavily dependent on the floor planning of An FPGA Implementation of (3, 6)-Regular LDPC Code Decoder 539

Load Intrinsic address data

 L  k2 log2 + log2 5 log k2 2  L log2 Binary decoder

k2 1 2 k − 1 k

PE , PE , ··· PE ,k PE block 1 1 1 2 1 select 1 2 k − 1 k

PE2,1 PE2,2 ··· PE2,k

k2 Decoding log L Read 2 output address . . . . . ··· .

1 2 k − 1 k

PEk,1 PEk,2 ··· PEk,k

Figure 12: Data input/output structure. the k2 PE blocks. To minimize the routing overhead, we de- 5. FPGA IMPLEMENTATION velop a square-shaped floor planning for PE blocks as illus- Applying the above decoder architecture, we implemented trated in Figure 12 and the corresponding data input/output , L = scheme is described in the following. a(36)-regular LDPC code partly parallel decoder for 256 using Xilinx Virtex-E XCV2600E device with the pack- (1) Intrinsic data input. The intrinsic messages of next age FG1156. The corresponding LDPC code length is N = frame is loaded, 1 symbol per clock cycle. As shown L · k2 = 256 · 62 = 9216 and code rate is 1/2. We obtain in Figure 12, the memory location of each input in- the constrained random parameter set for implementing π3 (3) trinsic data is determined by the input load ad- and each AGx,y as follows: first generate a large number of  L  k2 dress that has the width of ( log2 + log2 )bits parameter sets from which we find few sets leading to rela-  k2 in which log2 bits specify which PE block (or tively high Tanner graph average cycle length, then we select which INT RAM) is being accessed and the other one set leading to the best performance based on computer  L log2 bits locate the memory location in the selected simulations. INT RAM. As shown in Figure 12, the primary intrin- The target XCV2600E FPGA device contains 184 large sic data and load address input directly connect to the on-chip block RAMs, each one is a fully synchronous dual- k PE blocks PE1,y for 1 ≤ y ≤ k,andfromeachPEx,y port 4K-bit RAM. In this decoder implementation, we con- the intrinsic data and load address are delivered to the figure each dual-port 4K-bit RAM as two independent adjacent PE block PEx+1,y in pipelined fashion. single-port 256 × 8-bit RAM blocks so that each EXT RAM i (2) Decoded data output. The decoded data (or hard deci- can be realized by one single-port 256 × 8-bit RAM block. sions) of the previous frame is read out in pipelined Since each INT RAM contains two RAM blocks for storing  L fashion. As shown in Figure 12, the primary log2 - the intrinsic messages of both current and next code frames, bit read address input directly connects to the k PE we use two single-port 256 × 8-bit RAM blocks to imple- blocks PEx,1 for 1 ≤ x ≤ k,andfromeachPEx,y the ment one INT RAM. Due to the relatively small memory size read address are delivered to the adjacent block PEx,y+1 requirement, the DEC RAM is realized by distributed RAM in pipelined fashion. Based on its input read address, that provides shallow RAM structures implemented in CLBs. each PE block outputs 1-bit hard decision per clock Since this decoder contains k2 = 36 PE blocks, each one in- cycle. Therefore, as illustrated in Figure 12, the width corporates one INT RAM and three EXT RAM i’s, we to- of pipelined decoded data bus increases by 1 after go- tally utilize 180 single-port 256 × 8-bit RAM blocks (or 90 ing through one PE block, and at the rightmost side, dual-port 4K-bit RAM blocks). We manually configured the we obtain kk-bit decoded output that are combined placement of each PE block according to the floor-planning together as the k2-bit primary decoded data output. scheme as shown in Figure 12. Notice that such placement 540 EURASIP Journal on Applied Signal Processing

Table 1: FPGA resources utilization statistics. Resource Number Utilization rate Resource Number Utilization rate Slices 11,792 46% Slices Registers 10,105 19% 4 input LUTs 15,933 31% Bonded IOBs 68 8% Block RAMs 90 48% DLLs 1 12%

PE1,1 PE1,2 PE1,3 PE1,4 PE1,5 PE1,6

PE2,1 PE2,2 PE2,3 PE2,4 PE2,5 PE2,6

PE3,1 PE3,2 PE3,3 PE3,4 PE3,5 PE3,6

PE4,1 PE4,2 PE4,3 PE4,4 PE4,5 PE4,6

PE5,1 PE5,2 PE5,3 PE5,4 PE5,5 PE5,6

PE6,1 PE6,2 PE6,3 PE6,4 PE6,5 PE6,6

Figure 13: The placed and routed decoder implementation. scheme exactly matches the structure of the configurable analysis tool, the maximum decoder clock frequency can be shufflenetworkπ3 as described in Section 4.1.3, thus the 56 MHz. If this decoder performs s decoding iterations for routing overhead for implementing the π3 is also minimized each code frame, the total clock cycle number for decoding in this FPGA implementation. one frame will be 2s · L + L, where the extra L clock cycles From the architecture description in Section 4, we know is due to the initialization process, and the maximum sym- that, during each clock cycle in the iterative decoding, this bol decoding throughput will be 56 · k2 · L/(2s · L + L) = decoder need to perform both read and write operations on 56·36/(2s+1)Mbps. Here, we set s = 18 and obtain the max- each single-port RAM block EXT RAM i. Therefore, sup- imum symbol decoding throughput as 54 Mbps. Figure 14 pose the primary clock frequency is W, we must generate shows the corresponding performance over AWGN channel a2× W clock signal as the RAM control signal to achieve with s = 18, including the BER, FER (frame error rate), and read-and-write operation in one clock cycle. This 2 × W the average iteration numbers. clock signal is generated using the delay-locked loop (DLL) in XCV2600E. 6. CONCLUSION To facilitate the entire implementation process, we exten- sively utilized the highly optimized Xilinx IP cores to instan- Due to the unique characteristics of LDPC codes, we be- tiate many function blocks, that is, all the RAM blocks, all lieve that jointly conceiving the code construction and the counters for generating addresses, and the ROMs used to partly parallel decoder design should be a key for practi- store the control signals for shufflenetworkπ3.Moreover,all cal high-speed LDPC coding system implementations. In the adders in CNUs and VNUs are implemented by ripple- this paper, applying a joint design methodology, we devel- carry that is exactly suitable for Xilinx FPGA imple- oped a (3,k)-regular LDPC code high-speed partly paral- mentations thanks to the on-chip dedicated fast arithmetic lel decoder architecture design and implemented a 9216- carry chain. bit, rate-1/2(3, 6)-regular LDPC code decoder on the Xil- This decoder was described in the VHDL (hardware de- inx XCV2600E FPGA device. The detailed decoder architec- scription language) and SYNOPSYS FPGA Express was used ture and floor planning scheme have been presented and a to synthesize the VHDL implementation. We used the Xil- concatenated configurable random shuffle network imple- inx Development System tool suite to place and route the mentation is proposed to minimize the routing overhead synthesized implementation for the target XCV2600E device for the random-like shuffle network realization. With the with the speed option −7. Table 1 shows the hardware re- maximum 18 decoding iterations, this decoder can achieve source utilization statistics. Notice that 74% of the total uti- up to 54 Mbps symbol decoding throughput and the BER lized slices, or 8691 slices, were used for implementing all 10−6 at 2 dB over AWGN channel. Moreover, exploiting the CNUs and VNUs. Figure 13 shows the placed and routed the good minimum distance property of LDPC code, this design in which the placement of all the PE blocks are con- decoder uses parity check after each iteration as earlier strained based on the on-chip RAM block locations. stopping criterion to effectively reduce the average energy Based on the results reported by the Xilinx static timing consumption. An FPGA Implementation of (3, 6)-Regular LDPC Code Decoder 541

100 sparse matrices,” IEEE Transactions on Information Theory, vol. 45, no. 2, pp. 399–431, 1999. [4]M.C.DaveyandD.J.C.MacKay,“Low-densityparitycheck q −1 codes over GF( ),” IEEE Communications Letters, vol. 2, no. 10 6, pp. 165–167, 1998. [5] M. Luby, M. Mitzenmacher, M. Shokrollahi, and D. Spiel- man, “Improved low-density parity-check codes using irregu- 10−2 lar graphs and belief propagation,” in Proc. IEEE International Symposium on Information Theory, p. 117, Cambridge, Mass, USA, August 1998.

− [6] T. Richardson and R. Urbanke, “The capacity of low-density 10 3 parity-check codes under message-passing decoding,” IEEE

BER/FER Transactions on Information Theory, vol. 47, no. 2, pp. 599– 618, 2001. 10−4 [7] T. Richardson, M. Shokrollahi, and R. Urbanke, “Design of capacity-approaching irregular low-density parity-check codes,” IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 619–637, 2001. 10−5 [8] S.-Y. Chung, T. Richardson, and R. Urbanke, “Analysis of sum-product decoding of low-density parity-check codes us- ing a Gaussian approximation,” IEEE Transactions on Infor- 10−6 mation Theory, vol. 47, no. 2, pp. 657–670, 2001. 11.522.5 [9] M. Luby, M. Mitzenmacher, M. Shokrollahi, and D. A. Spiel- E N man, “Improved low-density parity-check codes using irreg- b/ 0(dB) ular graphs,” IEEE Transactions on Information Theory, vol. BER 47, no. 2, pp. 585–598, 2001. FER [10] S.-Y. Chung, G. D. Forney, T. Richardson, and R. Urbanke, “On the design of low-density parity-check codes within 0.0045 dB of the Shannon limit,” IEEE Communications Let- 18 ters, vol. 5, no. 2, pp. 58–60, 2001. [11] G. Miller and D. Burshtein, “Bounds on the maximum- likelihood decoding error probability of low-density parity- 16 check codes,” IEEE Transactions on Information Theory, vol. 47, no. 7, pp. 2696–2710, 2001. [12] A. J. Blanksby and C. J. Howland, “A 690-mW 1-Gb/s 1024-b, 14 rate-1/2 low-density parity-check code decoder,” IEEE Journal of Solid-State Circuits, vol. 37, no. 3, pp. 404–412, 2002. [13] E. Boutillon, J. Castura, and F. R. Kschischang, “Decoder- 12 first code design,” in Proc. 2nd International Symposium on Turbo Codes and Related Topics, pp. 459–462, Brest, France, September 2000. 10 [14] T. Zhang and K. K. Parhi, “VLSI implementation-oriented (3,k)-regular low-density parity-check codes,” in IEEE Work- shop on Signal Processing Systems (SiPS), pp. 25–36, Antwerp,

Average number8 of iterations Belgium, September 2001. [15] M. Chiani, A. Conti, and A. Ventura, “Evaluation of low- density parity-check codes over block fading channels,” in 6 Proc. IEEE International Conference on Communications,pp. 1183–1187, New Orleans, La, USA, June 2000. [16] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and 4 Implementation, John Wiley & Sons, New York, USA, 1999. 11.522.533.5

Eb/N (dB) 0 Tong Zhang received his B.S. and M.S. degrees in electrical engineering from the Figure 14: Simulation results on BER, FER and the average itera- Xian Jiaotong University, Xian, China, in tion numbers. 1995 and 1998, respectively. He received the Ph.D. degree in electrical engineering from the University of Minnesota in 2002. Cur- REFERENCES rently, he is an Assistant Professor in Elec- [1] R. G. Gallager, “Low-density parity-check codes,” IRE Trans- trical, Computer, and Systems Engineering actions on Information Theory, vol. IT-8, no. 1, pp. 21–28, Department at Rensselaer Polytechnic Insti- 1962. tute. His current research interests include [2]R.G.Gallager, Low-Density Parity-Check Codes, MIT Press, design of VLSI architectures and circuits for digital signal pro- Cambridge, Mass, USA, 1963. cessing and communication systems, with the emphasis on error- [3] D. J. C. MacKay, “Good error-correcting codes based on very correcting coding and multimedia processing. 542 EURASIP Journal on Applied Signal Processing

Keshab K. Parhi is a Distinguished McK- night University Professor in the Depart- ment of Electrical and Computer Engineer- ing at the University of Minnesota, Min- neapolis. He was a Visiting Professor at Delft University and Lund University, a Visiting Researcher at NEC Corporation, Japan, (as a National Science Foundation Japan Fellow), and a Technical Director DSP SystemsatBroadcomCorp.Dr.Parhi’sre- search interests have spanned the areas of VLSI architectures for digital signal and image processing, adaptive digital filters and equalizers, error control coders, cryptography architectures, high- level architecture transformations and synthesis, low-power digital systems, and computer arithmetic. He has published over 350 pa- pers in these areas, authored the widely used textbook VLSI Digital Signal Processing Systems (Wiley, 1999) and coedited the reference book Digital Signal Processing for Multimedia Digital Signal Process- ing Systems (Wiley, 1999). He has received numerous best paper awards including the most recent 2001 IEEE WRG Baker Prize Pa- per Award. He is a Fellow of IEEE, and the recipient of a Golden Jubilee medal from the IEEE Circuits and Systems Society in 1999. He is the recipient of the 2003 IEEE Kiyo Tomiyasu Technical Field Award.