electronics

Article Flexible New Radio LDPC Encoder Optimized for High Hardware Usage Efficiency

Vladimir L. Petrovi´c 1,* , Dragomir M. El Mezeni 1 and Andreja Radoševi´c 2

1 School of Electrical Engineering, University of Belgrade, Bulevar kralja Aleksandra 73, 11120 Belgrade, Serbia; [email protected] 2 Tannera LLC, 401 Wilshire Blvd, 12th Floor, Santa Monica, CA 90401, USA; [email protected] * Correspondence: [email protected]

Abstract: Quasi-cyclic low-density parity-check (QC–LDPC) codes are introduced as a physical channel coding solution for data channels in 5G new radio (5G NR). Depending on the use case scenario, this standard proposes the usage of a wide variety of codes, which imposes the need for high encoder flexibility. LDPC codes from 5G NR have a convenient structure and can be efficiently encoded using forward substitution and without computationally intensive multiplications with dense matrices. However, the state-of-the-art solutions for encoder hardware implementation can be inefficient since many hardware processing units stay idle during the encoding process. This paper proposes a novel partially parallel architecture that can provide high hardware usage efficiency (HUE) while achieving encoder flexibility and support for all 5G NR codes. The proposed architecture includes a flexible circular shifting network, which is capable of shifting a single large bit vector or multiple smaller bit vectors depending on the code. The encoder architecture was built around the

 shifter in a way that multiple parity check matrix elements can be processed in parallel for short  codes, thus providing almost the same level of parallelism as for long codes. The processing schedule

Citation: Petrovi´c,V.L.; El Mezeni, was optimized for minimal encoding time using the genetic algorithm. The optimized encoder D.M.; Radoševi´c,A. Flexible 5G New provided high throughputs, low latency, and up-to-date the best HUE. Radio LDPC Encoder Optimized for High Hardware Usage Efficiency. Keywords: channel coding; 5G new radio; low-density parity-check (LDPC) codes; encoder architec- Electronics 2021, 10, 1106. https:// ture; hardware usage efficiency; circular shifter; genetic algorithm optimization doi.org/10.3390/electronics10091106

Academic Editors: Byeong Yong Kong and Hoyoung Yoo 1. Introduction Fifth-generation wireless technology standard for broadband cellular networks (5G Received: 12 April 2021 NR) [1] introduces LDPC codes [2] for data channel coding. Due to their excellent error- Accepted: 6 May 2021 Published: 8 May 2021 correcting performance and the possibility of achieving high parallelization in decoding procedures, LDPC codes outperform turbo codes [3], as their predecessor in and

Publisher’s Note: MDPI stays neutral LTE [4,5]. The main channel coding challenges posed by novel 5G-centric applications with regard to jurisdictional claims in are multigigabit throughputs, low latency (in certain use cases), and high flexibility while published maps and institutional affil- keeping high error-correcting performance. In addition to that, the channel coding solution iations. should support incremental redundancy hybrid automatic repeat request (IR–HARQ) procedures. LDPC codes can face all those challenges better than turbo codes. Most importantly, they have higher coding gain and have better error-correcting performance in the error floor region [5]. LDPC codes are linear block codes, meaning that they are completely defined by their Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. parity check matrices (PCM). One of the reasons for their increasing adoption in many This article is an open access article communication standards is the computational efficiency of the decoding procedure, which distributed under the terms and is achieved by designing the sparse PCM. In addition, the PCM of practical LDPC codes is conditions of the Creative Commons always structured in a way that can achieve high parallelization of both the encoder and Attribution (CC BY) license (https:// the decoder. On the other hand, turbo code decoding is inherently serial. Although it can creativecommons.org/licenses/by/ be parallelized to some extent, high degrees of parallelism are hardly achievable. Moreover, 4.0/). the throughput of turbo decoders is usually not dependent on the code rate, whereas

Electronics 2021, 10, 1106. https://doi.org/10.3390/electronics10091106 https://www.mdpi.com/journal/electronics Electronics 2021, 10, 1106 2 of 24

LDPC decoder throughput increases along with the code rate since PCM for higher code rate is usually smaller than the PCM for low code rate. This property enables achieving higher peak throughput at high spectral efficiencies and high values of signal-to-noise ratio (SNR) [5]. The most common structured LDPC family is known as quasi-cyclic (QC) LDPC codes [6–8]. Their PCM consists of circularly shifted identity submatrices or square zero submatrices. Such submatrices are frequently called circulant permutation matri- ces (CPM) [8,9]. The size of CPMs is referred to as lifting size or lifting factor (Z)[5]. QC–LDPC codes are usually designed based on the so-called base graph (BG) matrix, which defines a macroscopic structure of a code. The final PCM is constructed by placing CPMs at appropriate positions defined by the BG matrix. Since CPMs are either circularly shifted identity matrices or zero matrices, the PCM can be stored as a matrix of integer numbers that represent corresponding shift values or zero matrix identification. Such description is frequently called a permutation matrix [7] or exponent matrix [9] and is very much convenient for 5G NR since the number of supported codes is very high. The great diversity of code rates and codeword lengths additionally requires a unification of code descriptions; therefore, the representation using exponent matrices is essential [5,9]. Another important property of an LDPC code is its regularity. An LDPC code is regular if all columns in the PCM have the same number of nonzero entries, i.e., if they all have the same weights, and if all rows in the PCM have the same weights. However, the error-correcting performance of such LDPC codes is usually not as good as the performance of irregular LDPC codes [10]. Hardware that supports irregular LDPC codes is usually more challenging to design since irregular code structure can significantly affect HUE [11]. Nevertheless, 5G NR introduces a large number of irregular LDPC codes with various codeword lengths and code rates. Therefore, hardware that supports 5G NR codes must provide a high level of flexibility [11]. Multiple encoding methods for LDPC codes have been developed [12–35]. A straight- forward method is to use a generator matrix for encoding [12–16]. Such a method is computationally very expensive since, in general, the generator matrix is not sparse. Re- duction of complexity can be obtained for systematic codes if the PCM is divided into two parts [17–19]. This way, the encoding can be performed by multiplication with two smaller matrices of which the first one is sparse, but the second one is dense. An approximately linear complexity can be obtained by using the Richardson and Urbanke (RU) method [20]. This method is based on the offline preprocessing algorithm that permutes columns and rows of the PCM so that the permuted matrix has a specific structure that is convenient for encoding. If the proper restructuring is achieved, the encoding can be performed mainly by multiple sequential forward substitutions and one multiplication with a small matrix, which is why this is a widely used approach [21–27]. A hybrid method has been used in [28,29], which proposes the calculation of one part of the parity bits using the generator matrix and another part using the RU method. However, many practical codes can be encoded using only forward substitutions, due to the convenient code structure [30–35]. Of note, 5G NR codes belong to this class of codes. This paper focuses on the hardware implementation of a flexible 5G NR encoder that supports all codes defined by the standard and achieves high throughput for a wide spectrum of codes while keeping hardware resources as low as possible. The encoding process is optimized for high hardware usage efficiency expressed in terms of achieved throughput divided by used hardware resources. Optimizing HUE can minimize required hardware resources for an arbitrary target throughput. As long as the latency of the system is acceptable, it is always more efficient to employ multiple optimized encoders in parallel and achieve the required throughput than to use a highly parallel implementation that takes significantly more resources. In general, the encoding latency is usually not an issue since the soft-decision LDPC decoding is usually a bottleneck of the entire physical layer [28]. In addition to that, resource-efficient hardware design is strongly linked to energy efficiency. Idle processing blocks consume leakage energy, which is an increasingly Electronics 2021, 10, 1106 3 of 24

significant energy component in modern technology processes [36]. Therefore, optimizing HUE can lead to a more energy-efficient design. In general, 5G NR LDPC codes are highly irregular and their base graph matrices are mostly sparse. This property drastically reduces the efficiency of highly parallel architectures since many hardware processing units frequently remain idle. Therefore, in this work, a partially parallel architecture is proposed in order to obtain high HUE. The main contributions of this paper are (1) high flexibility of the LDPC encoder that supports all codes from 5G NR. This is primarily obtained by the proper design of a circular shifting network, which is the key hardware processing unit in the system; (2) by exploiting partially parallel processing, the circular shifter and other hardware processing units are rarely idle, which highly increases the hardware usage efficiency; and (3) the encoder is capable of changing the encoding schedule in runtime so as to exploit high parallelism both for codes with high values of lifting size and for codes with smaller lifting sizes. Encoding schedules are optimized for minimum encoding time, which provides a minimum latency and the maximum throughput for the observed architecture. Such optimization significantly increases the HUE. The rest of the paper is organized as follows. Section2 presents an overview of com- monly used methods for LDPC encoding. In Section3, 5G NR LDPC encoding is discussed, first by introducing the structure of 5G NR codes and then by describing the efficient encoding algorithm. Section4 introduces a critical comparison of various approaches for the implementation of the aforementioned algorithm. Afterward, an optimization method for the selected implementation strategy is presented. The proposed encoder architecture is described in Section5, whereas results are shown in Section6. Section7 concludes the paper.

2. Encoding of LDPC Codes Depending on the code structure and properties, various methods have been de- veloped for encoding LDPC codes [12–35]. This section briefly introduces mostly used methods that have been applied in LDPC encoders.

2.1. Straightforward Encoding With Generator Matrix Multiplication As previously mentioned, LDPC codes are linear block codes. They are completely de- fined by their parity check matrices, which are designed to be sparse so as to obtain smaller decoding complexity. The parity check matrix H and the codeword vector x = [x1x2 ... xn] must satisfy the following equation:

xHT = 0. (1)

Most practical LDPC codes are systematic, meaning that their codeword can be represented as follows: x =  i p , (2)

where the vector i = [i1i2 ... ik] represents a sequence of information bits and the vector p = [p1 p2 ... pm] represents a sequence of parity bits. The encoding process implies mapping the vector i to the vector x. However, in general, the code does not have to be systematic, in which case information and parity bits are not divided into separate groups. In any case, the straightforward encoding method is to multiply the information bits vector with the generator matrix G as x = iG. (3) The generator matrix can be obtained by solving Equation (4) using the Gaussian elimination.

GHT = 0 (4)

The obtained generator matrix is generally dense, which causes quadratic encoding complexity (On2)[20]. However, when the generator matrix is precalculated, the required Electronics 2021, 10, 1106 4 of 24

matrix multiplication for specific codes can be efficiently achieved in hardware. Therefore, several previous encoders have been based on this method [12–16].

2.2. Partitioned PCM Two-Step Encoding In cases when the LDPC code is systematic, the PCM can be partitioned into two   submatrices (H = H1 H2 ), where the first submatrix is of size m × k, and the second submatrix is of size m × m. By transposing both the left-hand side and the right-hand side of the equation, Equation (1) becomes

HxT = 0, (5)

which can further be written as

  T T T H1 H2 i p = H1i + H2p = 0. (6)

Parity bits can now be calculated as follows:

T −1 T p = H2 H1i = 0. (7)

This way, the calculation is simplified since the submatrix H1 is sparse, and the vector T −1 H1i can be calculated with linear complexity (O(n)). However, the matrix H2 is dense and therefore leads to significantly high hardware cost in terms of computational complex- ity and memory requirements. Nevertheless, when properly exploiting the matrix structure, it is possible to achieve significant hardware cost reduction for certain codes [17–19].

2.3. Richardson–Urbanke LDPC Encoding Method Richardson and Urbanke [20] proposed an LDPC encoding method that provides an approximately linear computational complexity. The proposed method consists of two steps: (1) offline preprocessing and (2) encoding. During the preprocessing step, the PCM is transformed to the approximate lower triangular (ALT) form in such a way that codewords can be represented in a systematic form as follows:   x = i p1 p2 , (8)

where p1 represents a vector of g parity bits, whereas p2 represents a vector of remaining (m − g) parity bits. After the preprocessing step, the PCM has the following form:

 ABT  H = , (9) CDE

where submatrices A (of size (m − g) × (n − m)), B (of size (m − g) × g)), C (of size g × (n − m)), D (of size g × g), and E (of size g × (m − g)) are sparse matrices, whereas the submatrix T (of size (m − g) × (m − g)) is a sparse lower triangular matrix. The preprocessing transformation is performed by permuting rows and columns inside the PCM. Appropriate permutations can be obtained by greedy algorithms described in [20], which provide g to be as small as possible. In addition, it is required that the matrix −ET−1B + D stays nonsingular in the Galois field GF(2). If the PCM transformation is successfully completed, parity bits can be calculated using the following equations:

−1 T  −1   −1  T p1 = −ET B + D −ET A + C i , (10)

T −1 T T p2 = −T Ai + Bp1 . (11) Electronics 2021, 10, 1106 5 of 24

Multiplications with A, B, C, and E are multiplications with sparse matrices, whose computational complexity is linear (O(n)). Additions also have linear complexity. Since T is a lower triangular matrix, all multiplications with T−1 can be replaced with a forward substitution, given that T−1a = b is equivalent to the system a = Tb. Since T is also sparse, the performed forward substitution is a process of linear complexity. The only quadratic −1 complexity calculation is multiplication by dense matrix −ET−1B + D (Og2), which is why the transformation√ algorithm searches for permutations that will provide small g, preferably g ∼ n. Such approximately linear encoding complexity is the main reason why many LDPC encoders are essentially based on the RU method [20–27]. The reported weaknesses of this method are the necessity for multiple consequent operations, which generally leads to long critical paths in hardware implementations [28], and low potential for encoding flexibility [14].

2.4. Straightforward and RU Hybrid Encoding −1 In order to avoid multiplication with a dense matrix −ET−1B + D from (10), Cohen and Parhi [28] proposed a hybrid encoding of IEEE 802.3an codes, which instead uses generator matrix for calculation of first g parity bits (p1) and then uses RU’s relation from (11) to calculate remaining parity bits (p2). Such an approach allows easier trans- formation of the PCM since the constraint for nonsingularity of the matrix −ET−1B + D is removed. Additionally, the critical path can be shortened, which can lead to faster hardware implementation. The hybrid encoding has also been used for LDPC codes in space applications [29].

2.5. Forward Substitution-Based Encoding Many standardized LDPC codes have a structure that can facilitate even more efficient encoding than any of the previously described methods. For example, codes from Wi- Fi 802.11n/ac/ax have a double diagonal structure of the right-hand side of the PCM, whereas the left-hand side has a nonspecific conventional quasi-cyclic structure [37], as shown in (12).

 ··· (1) ··· ···  h1,1 h1,2 h1,kb I I 0 0 0 0  h h ··· h 0 II ··· 0 0 ··· 0   2,1 2,2 2,kb   ......   ......   ......   h h ··· h 0 0 0 ··· I 0 ··· 0  H =  r−1,1 r−1,2 r−1,kb  (12)  h h ··· h I ··· II ···   r,1 r,2 r,kb 0 0 0   ··· ··· ···   hr+1,1 hr+1,2 hr+1,kb 0 0 0 0 I 0     ......   ......  ··· (1) ··· ··· hmb,1 hmb,2 hmb,kb I 0 0 0 0 I

In (12), hj,k represents a single CPM in QC–LDPC code’s PCM, which can be either circularly shifted identity matrix or zero matrix, kb is the number of information bit groups each of which has Z bits, whereas the mb is the number of parity check equation groups each of which contains Z parity check equations. I is an identity matrix. In general, W(s) represents any matrix circularly shifted by s. The matrix is circularly shifted by circularly shifting each of its sub vectors. A described deterministic PCM form can be exploited to implement low-area encoders based on a forward substitution (FS) [30–34]. Since the code is quasi cyclic, the codeword can be grouped into kb information bit groups and mb parity bit groups as follows: h i x = i i ··· i p p ··· p 1 2 kb 1 2 mb . (13) Electronics 2021, 10, 1106 6 of 24

From (5), (12), and (13), it is simple to derive the following system of equations in GF(2):

kb T T(1) T st ∑ h1,lil + p1 + p2 = 0, (1 row) l=1 kb T T T ∑ hj,lil + pj + pj+1 = 0, for j ∈ {2, 3, . . . , r − 1} ∪ {r + 1, r + 2, . . . , mb − 1} l=1 (14) kb T T T T th ∑ hr,lil + p1 + pr + pr+1 = 0, (r row) l=1 kb ( ) h iT + pT 1 + pT = 0, (mth row) ∑ mb,l l 1 mb b l=1

Parity group p1 can be calculated by adding up all the equations from (14), which gives the following expression: mb kb T T p1 = ∑ ∑ hj,lil . (15) j=1 l=1

kb T In hardware, each of the mb sums ∑l=1 hj,lil is usually calculated by circularly shifting T the information group vectors il and by calculating the sum of shifted vectors using XOR gates. If these sums are denoted with λj, then the rest of the parity bit vectors can be calculated as follows:

T T(1) T T p2 = λ1 + p1 ; pj+1 = λj + pj , for j ∈ {2, 3, . . . , r − 2}; ( ) pT = λ + pT 1 ; pT = λ + pT , for j ∈ {m − 1, m − 2, . . . , r + 1}; (16) mb mb 1 j j j+1 b b T T(1) T pr = λr + p1 + pr+1, if r > 2. An almost identical approach has been recently used for encoding 5G NR LDPC codes since their structure is essentially similar to the described structure of Wi-Fi codes [35]. Many codes from other standards have similar structural features, which can be similarly used for achieving low complexity encoding [38,39]. The disadvantages of FS-based methods are that code information should be stored in some storage memory and that there is still some sequential processing, which can limit the achievable parallelism. However, since FS is dominantly used for structured codes, the memory requirements in previously described architectures [30–35] come down to only CPM positions inside the base graph matrix and the corresponding circulant shift values, i.e., there is no need for storage of the entire PCM. Additionally, these methods do provide high flexibility, which is increasingly required in modern communication standards [1].

3. Encoding of 5G NR LDPC Codes The structure of 5G NR LDPC codes is illustrated in Section 3.1. After that, Section 3.2 gives a description of the efficient encoding algorithm that is used in this paper.

3.1. Description of LDPC Codes in 5G NR All 5G NR LDPC codes are quasi cyclic. As for every QC–LDPC code, the PCM consists of submatrices (CPMs) of size Z × Z, which can be a circularly shifted identity matrix or a zero matrix. The general structure of a 5G NR LDPC code’s PCM is

 AB 0  H = , (17) CDI

where A (of size 4Z × kbZ), C (of size (mb − 4)Z × kbZ) and D (of size (mb − 4)Z × 4Z) have conventional nonspecific quasi-cyclic structure, and B (of size 4Z × 4Z) has a form similar Electronics 2021, 10, x FOR PEER REVIEW 7 of 26

[35]. Many codes from other standards have similar structural features, which can be sim- ilarly used for achieving low complexity encoding [38,39]. The disadvantages of FS-based methods are that code information should be stored in some storage memory and that there is still some sequential processing, which can limit the achievable parallelism. However, since FS is dominantly used for structured codes, the memory requirements in previously described architectures [30–35] come down to only CPM positions inside the base graph matrix and the corresponding circulant shift values, i.e., there is no need for storage of the entire PCM. Additionally, these methods do provide high flexibility, which is increasingly required in modern communication stand- ards [1].

3. Encoding of 5G NR LDPC Codes The structure of 5G NR LDPC codes is illustrated in Section 3.1. After that, Section 3.2 gives a description of the efficient encoding algorithm that is used in this paper.

3.1. Description of LDPC Codes in 5G NR All 5G NR LDPC codes are quasi cyclic. As for every QC–LDPC code, the PCM con- sists of submatrices (CPMs) of size Z × Z, which can be a circularly shifted identity matrix or a zero matrix. The general structure of a 5G NR LDPC code’s PCM is  = AB0 Electronics 2021, 10, 1106 H , (17)7 of 24 CDI

where A (of size 4Z × kbZ), C (of size (mb − 4)Z × kbZ) and D (of size (mb − 4)Z × 4Z) have conventional nonspecific quasi-cyclic structure, and B (of size 4Z × 4Z) has a form similar toto thethe right-hand side side of of IEEE IEEE 802.11n/ac/ax 802.11n/ac/ax PCMs PCMs described described in Section in Section 2.5. Depending 2.5. Depending on onthe the base base graph, graph, the thematrix matrix B canB havecan have one of one the of following the following structures: structures:

 (1) (1)    I II00I 0 0 II00II 0 0 III0 0 II 0 =  III0 =0II0  BBG1 BB ==; BBG2  (1) . .(18) (18) BG1 000II 0 II  BG2 I0II(1) I 0 II  (1) I (1) 0 0 I I 0 0 I I00I I00I

AA scatter diagram diagram of of the the base base graph graph for both for both BG1 BG1and BG2 and matrices BG2 matrices is shown is in shown Fig- in Figureure 1. 1.

Base graph 1 Base graph 2

kb,BG2 = 10 mb,BG2 = 42 kb,BG1 = 22 mb,BG1 = 46 FigureFigure 1.1. Base graph matrix matrix structures structures for for codes codes in in5G 5G NR. NR.

TheThe codewordcodeword (of (of length length nnbZbZ, ,where where nnb b= =kb k+b +mbm) canb) can be berepresented represented in systematic in systematic formform asas   x = i pc pa , (19)

where pc represents a vector of so-called core parity bits and pa is a vector of so-called additional parity bits [40]. The information bits vector is of length kbZ, whereas the parity bit vectors are of length 4Z and (mb − 4)Z, respectively. Since the upper right submatrix of the PCM is a zero matrix, all core parity bits can be calculated from information bits using submatrices A and B. Additional parity bits can be calculated from information and core parity bits using submatrices C and D. The lifting size Z can take values from 2 to 384 and must have a form of

Z = a · 2j, (20)

where a can take values from the set {2, 3, 5, 7, 9, 11, 13, 15}, which gives a total of 51 different lifting size values. There are two base graph matrices (BG1 and BG2) available from which all codes are derived. Each base graph matrix has a corresponding set of base exponent matrices. The set includes one matrix for each possible value of parameter a from (20), defined for the corresponding maximum lifting size (256, 384, 320, 224, 288, 352, 208, or 240). Every exponent matrix contains nonnegative and negative entries. Nonnegative entries represent the values by which the corresponding identity submatrices are circularly shifted. All other entries are equal to −1 and represent zero submatrices. The structure of a base exponent matrix is  ···  V1,1 V1,2 V1,nb  V V2,2 ··· V n   2,1 1, b  V =  . . . . . (21)  . . .. .  ··· Vmb,1 Vmb,2 Vmb,nb Electronics 2021, 10, 1106 8 of 24

Base exponent matrices define exponent matrices for every other lifting size whose structure is the same as  ···  P1,1 P1,2 P1,nb  P P2,2 ··· P n   2,1 1, b  P =  . . . . . (22)  . . .. .  ··· Pmb,1 Pmb,2 Pmb,nb

Shift values Pj,l are calculated as follows: ( −1, if Vj,l = −1 Pj,l = , (23) Vj,lmodZ, if Vj,l > −1

where mod represents modulo operation [1,5,9]. Up until today, the 5G NR standard introduces the highest number of possible LDPC codes. A practical system must have all the codes supported since the choice of which code is used is made in the runtime. Therefore, the encoder and the decoder need to be flexible, which defines a challenge for designers of practical systems.

3.2. Efficient Algorithm for Flexible Encoding As mentioned earlier, the 5G NR communication standard imposes the necessity for high LDPC encoder flexibility. There are a total of 102 codes derived just from two base graph matrices by lifting each with every one of 51 lifting sizes. The standard also introduces the shortening of base graph matrices, which additionally increases the number of possible codes by tens of times [1]. This flexibility cannot be achieved by any method that requires either multiplying with generator matrix or any matrix inversion without extremely high usage of memory or logic resources. Therefore, the flexible encoding must be achieved based on the previously described forward substitution method. The forward substitution-based encoding for BG1 codes has been presented in [35]. It has been shown that this method is multiple times more efficient than straightforward and RU methods and that fast encoding can be achieved. The first step is the calculation of core parity bits. After that, the additional parity bits can be calculated using only the lower part of the PCM. Similar to Wi-Fi codes, the encoding process solves the following system of equations:

 T    i T AB 0 T Hx = 0 ⇒  pc  = 0. (24) CDI T pa

This system can be split into the smaller system of equations from which core parity bits can be calculated (25) and the set of equations from which additional parity bits are afterward calculated (26).   T  T    i1   p a1,1 a1,2 ... a1,kb b1,1 b1,2 b1,3 b1,4 c,1  iT  T  a2,1 a2,2 ... a k   2   b2,1 b2,2 b2,3 b2,4   p   2, b  ×  .  +   ×  c,2  = 0, (25)  a a ... a   .   b b b b   pT  3,1 3,2 3,kb  .  3,1 3,2 3,3 3,4  c,3  a a a T b b b b T 4,1 4,2 ... 4,kb i 4,1 4,2 4,3 4,4 pc,4 kb   c1,1 c1,2 ... c1,kb d1,1 d1,2 d1,3 d1,4 c c c d d d d  T   2,1 2,2 ... 2,kb 2,1 2,2 2,3 2,4  i pT =   × . (26) a  ......  T  ......  pc

cmb−4,1 cmb−4,2 ... cmb−4,kb dmb−4,1 dmb−4,2 dmb−4,3 dmb−4,4 Electronics 2021, 10, 1106 9 of 24

For BG1 codes, the system (25) is just a special case of the system (12), where r = 2. Therefore, if the intermediate results are denoted as

kb T T λj = ∑ aj,lil , j ∈ {1, 2, 3, 4}, (27) l=1

core parity bits can be calculated using the following expressions:

4 (1) (1) pc,1 = ∑ λj; pc,2 = λ1 + pc,1 ; pc,4 = λ4 + pc,1 ; pc,3 = λ3 + pc,4 . (28) j=1

In [35], the system (25) is solved only for BG1 codes. In this paper, flexibility was a key requirement. Therefore, the analysis is expanded to codes defined by BG2 too. For BG2, the system (25) can be written as follows:

kb T T T ∑ a1,lil + pc,1 + pc,2 = 0, l=1 kb T T T ∑ a2,lil + pc,2 + pc,3 = 0, l=1 (29) kb T T(1) T T ∑ a3,lil + pc,1 + pc,3 + pc,4 = 0, l=1 kb T T T ∑ a4,lil + pc,1 + pc,4 = 0. l=1

By adding up all the above equations, it is simple to obtain the first core parity bit group circularly shifted by 1 position as

4 (1) = λ pc,1 ∑ j. (30) j=1

This vector should be shifted back by 1 to obtain the first core parity bit group in the appropriate arrangement. With the first group available, all other groups are directly calculated from equations in (29) as follows:

pc,2 = λ1 + pc,1; pc,3 = λ2 + pc,2; pc,4 = λ4 + pc,1 . (31)

The described algorithm is used in this paper for efficient hardware implementation.

4. Optimal Encoding Schedules in Partially Parallel Encoding The encoding algorithm described in Section 3.2 can be implemented with various levels of parallelism. Fully parallel implementation would require extremely high logic usage and a relatively long critical path since the codeword can be up to 26,112 bits long and since flexibility is one of the main requirements in 5G NR. Partially parallel implementations can provide a compromise between achievable throughput and utilization of hardware resources. This section describes various approaches for partially parallel encoding and analyses corresponding hardware usage utilization for each of the approaches.

4.1. Partially Parallel Processing in 5G NR LDPC Encoding The encoding process mainly consists of circular shifting Z-bit vectors and calculating the XOR operation of all the shifted values. In partially parallel processing, the hardware that is used for shifting, later in the text called circular shifter (CS), must be generic, i.e., it needs to support all shift values and all shift (lifting) sizes that are defined by the standard. A requirement for the flexibility of CSs drastically increases their hardware resources requirements to the point where shifters take the most hardware resources in the entire Electronics 2021, 10, 1106 10 of 24

encoder. Therefore, if the hardware usage efficiency is the key requirement, it is important that these components are utilized as much as possible, i.e., that they are rarely idle. Partially parallel processing can be conducted in multiple ways. If only one CS is used, the encoding is performed serially by processing one CPM at a time. Note that this encoding is still partially parallel since the CS takes the entire vector of Z bits. Figure2 shows such architecture’s encoding schedule. The schedule is shown for BG1 codes since BG2 codes have a similar structure but a smaller number of CPMs in the PCM. In Figure2, numbers at CPM positions represent ordinals of clock cycles in which the corresponding CPMs are processed. In the beginning, the encoder takes information bit groups that correspond to the first four base graph rows necessary for the calculation of vectors λ1, λ2, λ3, and λ4. Based on calculated λ vectors, core parity bits are computed as in (28) or as in (30) and (31). This can be performed in parallel with further information bit groups shifting for later rows and their XOR sum calculation. Due to this parallel computation, all CPMs Electronics 2021, 10, x FOR PEER REVIEWthat are used for core parity bits calculation, based on vectors λ , λ , λ , and λ , are11 of placed 26 1 2 3 4 in the same clock cycle in Figure2.

1 2 3 45 6 78 9 10 11 12 13 1415 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 67 68 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 2 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 3 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 67 4 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 5 67 68 6 69 70 71 72 73 74 75 7 76 77 78 79 80 81 82 83 8 84 85 86 87 88 89 9 90 91 92 93 94 94 96 97 98

45 259 260 261 262 46 263 264 265

FigureFigure 2. EncodingEncoding schedule schedule for for BG1 BG1 codes codes when when serially serially processing processing single single CPM CPM at a attime. a time. Numbers Numbers represent the ordinal of the clock cycle in which the corresponding CPM is processed. represent the ordinal of the clock cycle in which the corresponding CPM is processed.

The serialencoding processing throughput schedule can be requires increased 265 ifclock a larger periods number for of all parallel parity bitsprocessing generation units is used, i.e., by exploiting higher parallelism. Figure 3 shows the encoding schedule for BG1 codes and 150 clock periods for BG2 codes. Although the throughput of such an proposed in [34] for Wi-Fi codes but here applied to 5G NR codes. All CPMs that belong approach is the smallest of all partially parallel schedules, the HUE is the highest since to a single column are processed in parallel. After the first kb columns are processed, all λ only one CS is used and is never idle. vectors are ready and core parity bits can be calculated. It is convenient to wait at least A similar schedule has been presented in [32] for Wi-Fi codes, but its order of process- one clock period for core parity bits to be ready before the remaining four columns are ing was column oriented, meaning that the processing was performed column by column processed. This is carried out because of the pipelining, which makes the critical path with the storage of intermediate results in local memory. much shorter. The schedule requires at least 27 (kb,BG1 + 1 + 4) clock periods for all parity The encoding throughput can be increased if a larger number of parallel processing bits generation for BG1 codes and at least 15 (kb,BG2 + 1 + 4) clock periods for BG2 codes. unitsThe described is used, i.e., method by exploiting requires a higher large number parallelism. of CSs: Figure 46 for3 shows BG1 codes the encoding and 42 for schedule BG2 proposedcodes. Since in [the34] encoder for Wi-Fi should codes be but flexible, here applied it is necessary to 5G NR to codes.use a larger All CPMs of these that two belong tonumbers. a single column are processed in parallel. After the first kb columns are processed, all λ vectors are ready and core parity bits can be calculated. It is convenient to wait at least one1 clock 2 3 period 45 6 for 78 core9 10 parity 11 12 13 bits 1415 to 16 be 17 ready 18 19 20 before 21 22 23 the 24 25 remaining 26 27 28 29 four30 31 columns 67 68 are processed.1 1 2 3 4 This6 is carried7 10 out11 because12 13 14 of16 the17 pipelining,19 20 21 22 which makes the critical path much 2 3 4 5 6 8 9 10 12 13 15 16 17 18 20 22 shorter. The schedule requires at least 27 (kb,BG1 + 1 + 4) clock periods for all parity bits 3 1 2 3 5 6 7 8 9 10 11 14 15 16 18 19 20 21 generation for BG1 codes and at least 15 (kb,BG2 + 1 + 4) clock23 periods for BG2 codes. The 4 1 2 4 5 7 8 9 11 12 13 14 15 17 18 19 21 22 described5 1 2 method requires a large number of CSs: 46 for BG1 codes and 42 for BG2 codes. Since6 1 the2 encoder4 should be flexible,13 it is17 necessary to22 use24 a larger of these two numbers. 7 1 7 11 12 14 18 19 21 8 1 2 5 8 9 15 9 1 2 4 13 17 20 22 24 26

45 1 8 10 24 46 2 7 11 Figure 3. Encoding schedule for BG1 codes when processing the entire base graph matrix column at the same time. Numbers represent the ordinal of the clock cycle in which the corresponding CPM is processed.

In order to reduce used hardware resources, a row-based parallel encoding can be executed [31,35]. Such a schedule is shown in Figure 4. All CPMs from a single row are processed in a single clock period. Instead of shifting one bit group at a time, this schedule requires that all information bits, and later core parity bits, are available in registers. Con- sequently, it is often required to wait for information bits to be written to registers from the serial input stream [35]. However, assuming that the double buffering is available at the data input, this stall time can be avoided. The GF(2) sum of shifted vectors should be calculated using XOR gates placed in a tree-like structure.

Electronics 2021, 10, x FOR PEER REVIEW 11 of 26

1 2 3 45 6 78 9 10 11 12 13 1415 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 67 68 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 2 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 3 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 67 4 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 5 67 68 6 69 70 71 72 73 74 75 7 76 77 78 79 80 81 82 83 8 84 85 86 87 88 89 9 90 91 92 93 94 94 96 97 98

45 259 260 261 262 46 263 264 265 Figure 2. Encoding schedule for BG1 codes when serially processing single CPM at a time. Numbers represent the ordinal of the clock cycle in which the corresponding CPM is processed.

The encoding throughput can be increased if a larger number of parallel processing units is used, i.e., by exploiting higher parallelism. Figure 3 shows the encoding schedule proposed in [34] for Wi-Fi codes but here applied to 5G NR codes. All CPMs that belong to a single column are processed in parallel. After the first kb columns are processed, all λ vectors are ready and core parity bits can be calculated. It is convenient to wait at least one clock period for core parity bits to be ready before the remaining four columns are processed. This is carried out because of the pipelining, which makes the critical path much shorter. The schedule requires at least 27 (kb,BG1 + 1 + 4) clock periods for all parity bits generation for BG1 codes and at least 15 (kb,BG2 + 1 + 4) clock periods for BG2 codes. Electronics 2021, 10, 1106 The described method requires a large number of CSs: 46 for BG1 codes and 42 for BG211 of 24 codes. Since the encoder should be flexible, it is necessary to use a larger of these two numbers.

1 2 3 45 6 78 9 10 11 12 13 1415 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 67 68 1 1 2 3 4 6 7 10 11 12 13 14 16 17 19 20 21 22 2 3 4 5 6 8 9 10 12 13 15 16 17 18 20 22 3 1 2 3 5 6 7 8 9 10 11 14 15 16 18 19 20 21 23 4 1 2 4 5 7 8 9 11 12 13 14 15 17 18 19 21 22 5 1 2 6 1 2 4 13 17 22 24 7 1 7 11 12 14 18 19 21 8 1 2 5 8 9 15 9 1 2 4 13 17 20 22 24 26

45 1 8 10 24 46 2 7 11

Figure 3. Encoding schedule schedule for for BG1 BG1 codes codes when when processi processingng the the entire entire base base graph graph matrix matrix column column at at the same time. Numbers represent the ordinal of the clock cycle in which the corresponding CPM the same time. Numbers represent the ordinal of the clock cycle in which the corresponding CPM is processed. is processed. In order to reduce used hardware resources, a row-based parallel encoding can be In order to reduce used hardware resources, a row-based parallel encoding can be executed [31,35]. Such a schedule is shown in Figure 4. All CPMs from a single row are executed [31,35]. Such a schedule is shown in Figure4. All CPMs from a single row are processed in a single clock period. Instead of shifting one bit group at a time, this schedule processedrequires that in all a single information clock period. bits, and Instead later core of shifting parity bits, one are bit available group at in a time,registers. this Con- schedule requiressequently, that it is alloften information required to bits, wait and for in laterformation core parity bits to bits,be written are available to registers in registers.from Consequently,the serial input it stream is often [35]. required However, to wait assuming for information that the double bits to buffering be written is to available registers at from thethe data serial input, input this stream stall [time35]. can However, be avoided. assuming The GF(2) that thesum double of shifted buffering vectors isshould available be at Electronics 2021, 10, x FOR PEER REVIEWthe data input, this stall time can be avoided. The GF(2) sum of shifted vectors should12 of 26 be calculated using XOR gates placed in a tree-like structure. calculated using XOR gates placed in a tree-like structure.

1 2 3 45 6 78 9 10 11 12 13 1415 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 67 68 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9

45 45 45 45 45 46 46 46 46 Figure 4. Encoding schedule for BG1 codes when processing the entire base graph matrix row at the Figure 4. Encoding schedule for BG1 codes when processing the entire base graph matrix row same time. Numbers represent the ordinal of the clock cycle in which the corresponding CPM is atprocessed. the same time. Numbers represent the ordinal of the clock cycle in which the corresponding CPM is processed.

The required number of clock cycles for one codeword encoding is at least 46 (mb,BG1) The required number of clock cycles for one codeword encoding is at least 46 (m ) for BG1 codes and at least 42 (mb,BG2) for BG2 codes, which is significantly higher whenb,BG1 forcompared BG1 codes with and the atcolumn-based least 42 (mb,BG2 parallel) for BG2encoding. codes, However, which is the significantly required number higher whenof comparedCSs is reduced: with the26 for column-based BG1 codes and parallel 14 for encoding. BG2 codes However, [35]. Again, the requiredit is necessary number to use of CSs isthe reduced: larger of 26 these for two BG1 numbers codes and because 14 for of BG2 the codesflexibility. [35]. Again, it is necessary to use the largerThe of theseencoder two information numbers because throughput of the can flexibility. be calculated as follows: The encoder information throughput can be calculated as follows: kZf Thr = bCLK. (32) kNZ f Thr = b CPCCLK . (32) NCPC where fCLK is the encoder’s operating frequency, and NCPC is the number of clock cycles

wherenecessaryfCLK foris a the single encoder’s codeword operating encoding frequency, –clocks per and codeword.NCPC is Since the number CSs take of the clock largest cycles necessaryamount of for logic a single resources, codeword hardware encoding usage ca –clocksn be approximately per codeword. expressed Since CSs in takethe number the largest amountof CSs. For of logica better resources, comparison hardware of approximat usage cane hardware be approximately usage efficiencies expressed for in described the number encoding schedules, a circular shifter’s efficiency is defined as follows:

= Thr EffCS , (33) NCS

where Ncs is the number of used CSs. This parameter values calculated for the longest codewords (kmax,BG1 = kb,BG1Zmax = 8448, and kmax,BG2 = kb,BG2Zmax =3840) are summarized in Ta- ble 1. As can be inferred from the results, the most efficient schedule is serial processing, although it gives the lowest throughput. Highly parallel encoding schedules are less effi- cient because a large number of shifters are unused, which is a consequence of the fact that the base graph matrix’s rows and columns are not fully filled.

Table 1. Comparison of approximated hardware usage efficiencies of a flexible encoder for vari- ous encoding schedules and for the longest codeword (Z = 384) in b/cycle per circular shifter used.

NCPC EffCS (b/Cycle per CS) Encoding Schedule NCS BG1 BG2 BG1 BG2 Serial 265 150 1 31.9 25.6 Column based parallel 27 15 46 6.8 5.6 Row based parallel 46 42 26 7.1 3.5

For the reasons of efficiency, in this paper, serial CPM processing is implemented for codes with high lifting size (Z > 192). Only one CS is used. However, the CS is designed in such a way that it can be reconfigured to reuse logic resources and to work as two or more independent CSs if codes with lower lifting size values are used. The CS design and the entire encoder architecture are described in detail in Section 5.

Electronics 2021, 10, 1106 12 of 24

of CSs. For a better comparison of approximate hardware usage efficiencies for described encoding schedules, a circular shifter’s efficiency is defined as follows:

Thr E f fCS = , (33) NCS

where Ncs is the number of used CSs. This parameter values calculated for the longest codewords (kmax,BG1 = kb,BG1Zmax = 8448, and kmax,BG2 = kb,BG2Zmax = 3840) are summarized in Table1. As can be inferred from the results, the most efficient schedule is serial processing, Electronics 2021, 10, x FOR PEER REVIEWalthough it gives the lowest throughput. Highly parallel encoding schedules13 of are 26 less

efficient because a large number of shifters are unused, which is a consequence of the fact that the base graph matrix’s rows and columns are not fully filled. Since more than one CS is available for shorter codes, the serial encoding schedule Tablecan be 1. thenComparison parallelized of approximated to increase the hardware HUE. The usage proposed efficiencies encoding of a flexible schedule encoder processes for various encodingmore than schedules one row andat a fortime. the Its longest example codeword of a situation (Z = 384) when in b/cycle four shifters per circular are available shifter used. is shown in Figure 5. The encoder takes information bit groups, shifts each of them, and N Eff (b/Cycle per CS) calculates intermediate results for all λCPC vectors. During the next four rowsCS processing, core Encoding Schedule NCS parity bits are calculated and preparedBG1 for BG2 calculation of additional parity BG1 bits. BG2 UsingSerial the described multirow 265 serial schedule 150 with 1 four CSs, the 31.9 encoding time 25.6 can beColumn reduced based to 180 parallel clock cycles fo 27r BG1 codes 15 and 94 clock 46 cycles for BG2 6.8 codes. However, 5.6 duringRow encoding, based parallel even three out 46 of four available 42 CSs are 26 frequently idle 7.1 due to the PCM’s 3.5 structure. This is easily noticeable in Figure 5 for all rows used in additional parity bits calculation. The reason for such PCM design is an attempt to avoid pipeline conflicts that For the reasons of efficiency, in this paper, serial CPM processing is implemented for happen in layered decoding [41] architectures because of read-after-write (RAW) data codes with high lifting size (Z > 192). Only one CS is used. However, the CS is designed in hazards [5,11]. such a way that it can be reconfigured to reuse logic resources and to work as two or more Although PCM has a structure designed to support decoder efficiency, it can be pro- independent CSs if codes with lower lifting size values are used. The CS design and the cessed in a way that is optimal for the encoder. For example, the sixth-row group shown entire encoder architecture are described in detail in Section5. in Figure 5 uses almost the same information and core parity bits as the ninth-row group. A significantlySince more higher than efficiency one CS is would available be obtained for shorter if the codes, sixth the and serial ninth encoding rows are schedulepro- cancessed be together. then parallelized Consequently to increase, instead the of conventionally HUE. The proposed processing encoding adjacent schedule rows groups processes morein parallel, than the one encoding row at a can time. be accomplished Its example of by a processing situation whenthe rows four that shifters use the are same available or isat shownleast almost in Figure the same5. The information encoder takes and core information parity bits. bit Theref groups,ore, shiftsthe proposed each of them,encod- and calculatesing schedule intermediate can be optimized results for for better all λ vectors.hardware During efficiency. the nextA method four rows for obtaining processing, the core parityoptimal bits schedule are calculated is described and in prepared Section 4.2 for. calculation of additional parity bits.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 67 68 1 1 2 3 4 6 7 10 11 12 13 14 16 17 19 20 21 22 2 3 4 5 6 8 9 10 12 13 15 16 17 18 20 22 3 1 2 3 5 6 7 8 9 10 11 14 15 16 18 19 20 21 23 4 1 2 4 5 7 8 9 11 12 13 14 15 17 18 19 21 22 5 23 24 6 23 24 25 32 35 39 40 7 23 27 30 31 33 36 37 38 8 23 24 26 28 29 34 9 41 42 44 50 53 56 58 59 61

45 174 177 178 180 46 175 176 179 Figure 5. ProposedProposed e encodingncoding schedule schedule for for BG1 BG1 codes codes when when four four CS CSss are are available available.. Numbers Numbers repre- represent thesent ordinal the ordinal of the of clockthe clock cycle cycle in whichin which the the corresponding corresponding CPM CPM is is processed. processed.

4.2. MultiUsingrow the Serial described Encoding multirow Schedule Optimization serial schedule with four CSs, the encoding time can be reducedFinding to the 180 optimal clock processing cycles for BG1order codes belongs and to 94 the clock traveling cycles salesman for BG2 problem codes. However,class duringand can encoding, be efficiently even solved three outusing of foura genetic available algorithm CSs are (GA) frequently [42,43]. idleThe dueGA has to the been PCM’s structure.previously Thisused isfor easily finding noticeable the optimal in Figurelayered5 fordecoding all rows computation used in additional schedule [ parity11,43]. bits calculation.The goal was Theto find reason the processing for such PCMschedule design that iswould an attempt provide tothe avoid smallest pipeline numbe conflictsr of pipeline RAW data hazards. A schedule that provides the smallest data hazards number should have a minimal data dependency between the adjacent rows. Therefore, the opti- mization procedure results in the schedule that has adjacent rows whose CPMs are in the different columns. However, in this work, the schedule is optimized in a completely opposite manner; the cost function is written to favor those processing orders whose adjacent rows have as many as possible CPMs in the same columns. The result of the cost function returns the number of clock cycles that are needed for encoding completion, and it is exclusively de- pendent on the processing order. This criterion should result in a schedule that maximizes the available CSs utilization and therefore gives the highest HUE.

Electronics 2021, 10, 1106 13 of 24

that happen in layered decoding [41] architectures because of read-after-write (RAW) data hazards [5,11]. Although PCM has a structure designed to support decoder efficiency, it can be processed in a way that is optimal for the encoder. For example, the sixth-row group shown in Figure5 uses almost the same information and core parity bits as the ninth-row group. A significantly higher efficiency would be obtained if the sixth and ninth rows are processed together. Consequently, instead of conventionally processing adjacent rows groups in parallel, the encoding can be accomplished by processing the rows that use the same or at least almost the same information and core parity bits. Therefore, the proposed encoding schedule can be optimized for better hardware efficiency. A method for obtaining the optimal schedule is described in Section 4.2.

4.2. Multirow Serial Encoding Schedule Optimization Finding the optimal processing order belongs to the traveling salesman problem class and can be efficiently solved using a genetic algorithm (GA) [42,43]. The GA has been previously used for finding the optimal layered decoding computation schedule [11,43]. The goal was to find the processing schedule that would provide the smallest number of pipeline RAW data hazards. A schedule that provides the smallest data hazards number should have a minimal data dependency between the adjacent rows. Therefore, the optimization procedure results in the schedule that has adjacent rows whose CPMs are in the different columns. However, in this work, the schedule is optimized in a completely opposite manner; the cost function is written to favor those processing orders whose adjacent rows have as many as possible CPMs in the same columns. The result of the cost function returns the number Electronics 2021, 10, x FOR PEER REVIEWof clock cycles that are needed for encoding completion, and it is exclusively dependent14 of 26 on the processing order. This criterion should result in a schedule that maximizes the available CSs utilization and therefore gives the highest HUE. In Section 4.1, all core parity bits are treated in the same way as information bits in the In Section 4.1, all core parity bits are treated in the same way as information bits in processthe process of additional of additional bits bits calculation. calculation. Whenever Whenever any any core core parity parity bits bits vector vector participates partici- inpates the in GF(2) the GF(2) sum sum for calculationfor calculation of of additional additional parity parity bits, bits, thethe encoder should should allocate allocate a separatea separate clock clock cycle cycle forfor eacheach of the core core parity parity groups, groups, as asshown shown in Figure in Figure 5. However,5. However, ifif therethere are free CSs CSs available, available, it it is is possible possible to to use use them them to tocircularly circularly shift shift core core parity parity bits bits inin parallelparallel with with the the information information bits bits shifting. shifting. In Inthis this way, way, there there is no is noneed need for foradditional additional clockclock cyclescycles used for processing of of core core pa parity-relatedrity-related CPMs, CPMs, and and hence, hence, the the throughput throughput is increased.is increased. The The illustration illustration of of repositioning repositioning the the CPMs CPMs processing processing is is shown shown inin FigureFigure6 .6. The describedThe described repositioning repositioning is included is included in thein the optimization optimization process. process.

1 2 3 45 6 78 9 10 11 12 13 1415 16 17 18 19 20 21 22 23 24 25 26

5 23 24 6 23 24 25 32 35 39 7 23 27 30 31 33 36 37 38 8 23 24 26 28 29 34 9 40 41 43 49 52 55 57 10 40 41 47 48 50 53 54 56 11 41 42 44 45 46 51 12 40 41 49 52 57

FigureFigure 6. IllustrationIllustration of of processing processing repositionin repositioningg for for core core parity parity bits bits related related CPMs. CPMs.

Finally, it is important important to to mention mention that that the the results results of ofthe the first first four four rows rows in the in thebase base graphgraph matrix are are always always calculated calculated in in the the begi beginningnning since since the the core core parity parity bits bitsare necessary are necessary forfor laterlater calculation. Reordering Reordering is isperformed performed only only for for the the processing processing of rows of rows used used for for additionaladditional parity bits bits calculation. calculation. The GA optimization consists of the following phases: (1) generation of the initial population, after which the later phases are iteratively repeated; (2) selection of the best individuals; (3) crossover of the selected individuals; and (4) mutation. The initial popu- lation contains a large set of vectors, which represent the rows processing order. Each individual is a random permutation of the original processing order, which is a vector of natural numbers valued from 5 to mb. After the population is generated, the set of pro- cessing orders that give the smallest numbers of encoding clock-cycles is selected for re- production. The illustration of the crossover of two vectors (parents) is shown in Figure 7. For the sake of simplicity, the example from Figure 7 uses individuals generated from a vector of natural numbers from 1 to 20, but it demonstrates a procedure applicable to any vectors. A subvector is cut from the first parent at random positions, after which it is placed in the child vector at the same position where it stood in the parent vector. The remainder of the child vector is filled with elements from the second parent vector while keeping the orig- inal order of elements. Those elements which are the same as already placed elements originating from the first vector are skipped. This is carried out to keep the uniqueness of each element in a vector. The crossover is repeated until a new population is formed. A mutation is performed by randomly switching two elements of some randomly selected vectors. The newly generated population is used for another selection, and the process is repeated until the maximal number of generations is reached.

Electronics 2021, 10, 1106 14 of 24

The GA optimization consists of the following phases: (1) generation of the initial population, after which the later phases are iteratively repeated; (2) selection of the best in- dividuals; (3) crossover of the selected individuals; and (4) mutation. The initial population contains a large set of vectors, which represent the rows processing order. Each individual is a random permutation of the original processing order, which is a vector of natural numbers valued from 5 to mb. After the population is generated, the set of processing orders that give the smallest numbers of encoding clock-cycles is selected for reproduction. The illustration of the crossover of two vectors (parents) is shown in Figure7. For the sake of simplicity, the example from Figure7 uses individuals generated from a vector of natural numbers from 1 to 20, but it demonstrates a procedure applicable to any vectors. A subvector is cut from the first parent at random positions, after which it is placed in the child vector at the same position where it stood in the parent vector. The remainder of the child vector is filled with elements from the second parent vector while keeping the original order of elements. Those elements which are the same as already placed elements originating from the first vector are skipped. This is carried out to keep the uniqueness of each element in a vector. The crossover is repeated until a new population is formed. A mutation is performed by randomly switching two elements of some randomly selected Electronics 2021, 10, x FOR PEER REVIEWvectors. The newly generated population is used for another selection, and the process15 of 26 is

repeated until the maximal number of generations is reached.

Original 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 vector

Parent 1 6 3 16 11 7 17 14 8 5 19 15 1 2 4 18 13 9 20 10 12 Parent 2 12 20 2 14 15 10 13 18 8 9 1 5 17 11 6 16 3 4 19 7

Random selection Crossover 17 14 8 5 19 15 from parent 1

12 20 2 14 15 10 13 18 8 9 1 5 17 11 6 16 3 4 19 7 Fill in from parent 2 12 20 2 10 13 17 14 8 5 19 15 18 9 1 11 6 16 3 4 7

Random Mutation switch of 12 20 2 10 13 17 14 8 5 19 15 18 9 1 11 6 16 3 4 7 places

Mutated 12 9 2 10 13 17 14 8 5 19 15 18 20 1 11 6 16 3 4 7 child FigureFigure 7. 7.Genetic Genetic algorithm algorithm crossover crossover andand mutationmutation examples for for 20-element 20-element vectors. vectors.

5.5. Proposed Proposed Hardware Hardware Realization Realization ofof FlexibleFlexible 5G NR LDPC LDPC Encoder Encoder ThisThis section section presents presents the the hardwarehardware implementationimplementation of of the the proposed proposed encoding encoding algo- algo- rithmrithm and and scheduling. scheduling. The The flexible flexible schedulingscheduling described in in Section Section 44 requiresrequires a a thorough thorough designdesign of of the the circular circular shifting shifting network. network. Therefore,Therefore, in Section Section 5.1, 5.1 ,the the proposed proposed design design for for suchsuch a shiftinga shifting network network is presented.is presented. The The architecture architecture of theof the proposed proposed encoder encoder is presented is pre- insented Section in 5.2 Section. 5.2.

5.1.5.1. Circular Circular Shifting Shifting Network Network for for HighHigh FlexibilityFlexibility TheThe circular circular shifting shifting network network is designedis designed to workto work in one in ofone the of following the following three three modes: (1)modes: as a single (1) as CS a single when CS lifting when size lifting is in size the is range in the 192 range < Z 192≤ 384, < Z (2)≤ 384, as two(2) as independent two inde- CSspendent when liftingCSs when size lifting is in thesize range is in the 96 range < Z ≤ 96192, < Z and≤ 192, (3) and as four(3) as independent four independent CSs whenCSs liftingwhen size lifting is in size the is range in the 2 range≤ Z ≤ 2 96.≤ Z For≤ 96. supporting For supporting such such functionality, functionality, the chosenthe chosen basic componentbasic component for the for entire the networkentire network is a CS is thata CS supports that supports all 5G all NR 5G liftingNR lifting sizes sizes of up of toup 96 to 96 (CS96). Four CS96 components are used as either independent CSs or as partial CSs for higher lifting sizes. A CS96 shifter is designed as a two-stage shifter as in [11,44], where the first stage is a pre-rotator network and the second stage is a QSN (abbreviated from QC–LDPC Shift Network) circular shifter [45]. As stated in (20), the 5G NR lifting size can take values of the form Z = a∙2j, where j goes from 0 for smallest to 7 for largest lifting sizes. For lifting sizes supported by the CS96 shifter, j is always less than or equal to 5. Given that j can take values as small as 0, the pre-rotator network must have outputs for multiple rotation sizes [11], which is obtained as in [46]. With CS96 available, CS for higher lifting sizes is realized using appropriate bit per- mutations and reordering before and after CS96 blocks. A rationale for permutation and reordering network design is based on the possibility of rearranging CPMs to a group of smaller submatrices of which some are also circularly shifted identity matrices of size Z/D × Z/D, where D is a natural number. Such rearranging has been previously used for the design of LDPC decoders that support lower parallelism than Z and for preventing pipe- line RAW data hazards [43], but here, it is applied to the design of CS. The CPM splitting relies on row and column permutations. Both row and column indexes in the new matrix are calculated using the following formula:

Electronics 2021, 10, 1106 15 of 24

(CS96). Four CS96 components are used as either independent CSs or as partial CSs for higher lifting sizes. Electronics 2021, 10, x FOR PEER REVIEW 16 of 26 A CS96 shifter is designed as a two-stage shifter as in [11,44], where the first stage is Electronics 2021, 10, x FOR PEER REVIEW 16 of 26 a pre-rotator network and the second stage is a QSN (abbreviated from QC–LDPC Shift Network) circular shifter [45]. As stated in (20), the 5G NR lifting size can take values of the form Z = a·2j, where j goes from 0 for smallest to 7 for largest lifting sizes. For lifting Z idx sizes supported by the CS96idx shifter,=⋅+() idxj is alwaysmod D less thanold or equal. to 5. Given that(34)j can new old Z idx take values as small as 0, the pre-rotator=⋅+() network mustDD haveold outputs for multiple rotation idxnew idx old mod D  . (34) sizes [11], which is obtained as in [46]. DD Examples of presented permutations for D = 2 and D = 4 are shown in Figure 8. For With CS96 available, CS for higher lifting sizes is realized using appropriate bit the sakeExamples of simplicity of presented and clearer permutations presentation, for D the= 2 originaland D = lifting 4 are shownsize is 16. in AFigure generalized 8. For permutations and reordering before and after CS96 blocks. A rationale for permutation and theillustration sake of simplicity of CPM splitting and clearer for anypresentation, lifting size the is shownoriginal in lifting Figure size 9. One is 16. exponent A generalized matrix reordering network design is based on the possibility of rearranging CPMs to a group of illustrationentry P gives of CPM 2 × 2 splitting= 4 new entriesfor any if lifting D = 2, size and is 4 shown × 4 = 16 in new Figure entries 9. One if D exponent = 4. matrix smaller submatrices of which some are also circularly shifted identity matrices of size Z/D entry PThe gives described 2 × 2 = 4property new entries proves if D that = 2, aand single 4 × 4circular = 16 new shift entries for a if large D = 4.lifting size can × Z/D, where D is a natural number. Such rearranging has been previously used for the be splitThe intodescribed multiple property circular proves shifts for that a smallera single lifting circular size. shift Therefore, for a large the liftingcircular size shifting can design of LDPC decoders that support lower parallelism than Z and for preventing pipeline benetwork split into for multiple large lifting circular sizes shifts can forbe adesigned smaller liftingby using size. independent Therefore, the smaller circular lifting shifting sizes RAW data hazards [43], but here, it is applied to the design of CS. The CPM splitting relies networkCSs. This for can large be performed lifting sizes usin cang bethe designed following by steps. using Input independent bits are firstly smaller permuted lifting sizesusing onCSs.the row Thisrelation and can columnfrombe performed (34) permutations. to obtain using independent the following Both row groups steps. and columnInput of data. bits indexesEach are firstly group in thepermuted of data new is matrix using circu- are calculatedthelarly relation shifted using from in shifters the(34) followingto ofobtain smaller independent formula: lifting sizes. groups The shifted of data. data Each are group in the of permuted data is circu- order larlyand shiftedhence need in shifters an inverse of smaller perm utationlifting sizes. to obtain The theshifted proper data output are in bits the arrangement. permuted order The Z  idx  andCS hencearchitecture need an that inverse implementsidx permutation= this(idx method to obtainmod suitableD the) · proper for+ 5G output NRold codes. bits arrangement.is shown in Figure The (34) new old D D CS10. architecture In addition tothat the implements already mentioned this method perm suitableutations, for the 5G shifting NR codes network is shown needs in aFigure group 10.reordering InExamples addition stage, to of the presentedwhich already depends mentioned permutations on the perm remain forutations,Dder= 2after the and shiftingtheD =division 4 are network shown of the needs inoriginal Figure a group shift8. For value with D. For example, for D = 4, if the remainder is 0, there is no need for reordering, thereordering sake of stage, simplicity which and depends clearer on presentation, the remainder the after original the division lifting size of the is 16. original A generalized shift which can be concluded from positions of newly generated CPMs in Figure 9. However, illustrationvalue with D of. For CPM example, splitting for for D any= 4, if lifting the remainder size is shown is 0, there in Figure is no9 need. One for exponent reordering, matrix if the remainder is 1, the first bit group must be placed last, the second must be placed entrywhichP cangives be concluded 2 × 2 = 4 new from entries positions if D of= newly 2, and generated 4 × 4 = 16 CPMs new entriesin Figure if D9. =However, 4. iffirst, the remainderthe third must is 1, be the placed first bit second, group th muste fourth be placedmust be last, placed the third,second etc. must be placed first, the third must be placed second, the fourth must be placed third, etc. 0123456789 10 1112131415 0 2 4 681012141 3 5 7 9 11 13 15 0246810121413 5 79111315 0 4 8 121591326 10 14 3 7 11 15 0 0 0 0 10123456789 10 1112131415 10 2 4 681012141 3 5 7 9 11 13 15 20246810121413 5 79111315 40 4 8 121591326 10 14 3 7 11 15 2 2 4 8 0 3 0 3 0 6 012 1 4 1 4 2 8 4 1 2 5 2 5 410 8 5 3 6 3 6 612 12 9 4 7 4 7 814 113 5 8 5 8 10 1 5 2 6 9 6 9 12 3 9 6 710 710 14 5 1310 811 811 1 7 214 912 912 3 9 6 3 1013 1013 511 10 7 1114 1114 713 1411 12 12 915 315 1315 1315 11 7 14 14 13 11 15 (a) 15 (b) 15 (c)15 (d) (a) (b) (c) (d) FigureFigure 8. 8.CPM CPM splittingsplitting examples: ( (aa)) the the original original CPM, CPM, (b ()b original) original CPM CPM with with columns columns permuted permuted for D = 2, (c) final permuted CPM for D = 2, and (d) final permuted CPM for D = 4. forFigureD = 8. 2, CPM (c) final splitting permuted examples: CPM ( fora) theD =original 2, and CPM, (d) final (b) permutedoriginal CPM CPM with for columnsD = 4. permuted for D = 2, (c) final permuted CPM for D = 2, and (d) final permuted CPM for D = 4. P mod 4= 0 P mod4= 1 QP mod−1 4−=1 0 −1 −1P mod4Q −=1 1 −1

Q−1 −1Q −1−1 −1−1 −1−1 Q−1 −1Q −1−1 P −1 Q −1 −1 −1 −1 Q −1 P −1 −1 Q −1 −1 −1 −1 Q = P = P Q −1 −1 Q −1 −1 −1 −1 Q Q  4 −1 −1 −1 Q Q + 1 −1 −1 −1 2 = P = P Q  −1 −1 −1 Q Q + 1 −1 −1 −1 Q  4 P mod 4= 2 P mod 4= 3 P mod 2= 0 P mod22 = 1 P = = = = P −1P mod−1 4Q 2 −1 −1P mod−1 4−1 3 Q P modQ 2−1 0 P mod2−1 Q 1 −1−1 −1−1 Q−1 −1Q −Q1 + 1−1−1 −1−1 Q−1 Q−1 −1Q −Q1 + 1 Q−1 −Q1 + 1−1−1 −1−1 Q−1 Q +− 1 −Q1 + 1−1−1 −1−1 −1 Q Q + 1 −1 Q +− 1 −Q1 + 1−1−1 −1−1 −1−1Q +− 1 −Q1 + 1−1−1

−1 Q + 1 −1 −1 −1 −1 Q + 1 −1 Figure 9. General circulant permutation matrix splitting results for D = 2 and D = 4. P is the shift value from the exponent matrix. Exponent matrix entry value −1 represents a zero submatrix. Figure 9. General circulant circulant permutation permutation matrix matrix splitting splitting results results for forD =D 2 =and 2and D = D4. =P 4.is Ptheis shift the shift value from the exponent matrix. Exponent matrix entry value −1 represents a zero submatrix. value from the exponent matrix. Exponent matrix entry value −1 represents a zero submatrix.

Electronics 2021, 10, 1106 16 of 24

The described property proves that a single circular shift for a large lifting size can be split into multiple circular shifts for a smaller lifting size. Therefore, the circular shifting network for large lifting sizes can be designed by using independent smaller lifting sizes CSs. This can be performed using the following steps. Input bits are firstly permuted using the relation from (34) to obtain independent groups of data. Each group of data is circularly shifted in shifters of smaller lifting sizes. The shifted data are in the permuted order and hence need an inverse permutation to obtain the proper output bits arrangement. The CS architecture that implements this method suitable for 5G NR codes is shown in Figure 10. In addition to the already mentioned permutations, the shifting network needs a group reordering stage, which depends on the remainder after the division of the original shift value with D. For example, for D = 4, if the remainder is 0, there is no need for reordering, which can be concluded from positions of newly generated CPMs in Figure9. However, if Electronics 2021, 10, x FOR PEER REVIEWthe remainder is 1, the first bit group must be placed last, the second must be placed17 of 26 first, the third must be placed second, the fourth must be placed third, etc.

Input Flexible Cyclic Shifter Output data Group data reordering Shifter 1 (Z ≤ 96 )

Direct bit Shifter 2 (Z ≤ 96 ) Inverse bit permutation permutation Shifter 3 (Z ≤ 96 )

Shifter 4 (Z ≤ 96 )

DDDZ/DP mod DQ/Q + 1 Parameters calculation 124 124 D Z P1 P2 P3 P4 D

FigureFigure 10. 10.Shifting Shifting network architecture architecture for for the the propos proposeded 5G 5G NR NR encoder. encoder. The The network network is capable is capable of circular shifting four independent groups of data if Z ≤ 96, two independent groups of data if 96 of circular shifting four independent groups of data if Z ≤ 96, two independent groups of data if < Z ≤ 192, and one group of data if 192 < Z ≤ 384. 96 < Z ≤ 192, and one group of data if 192 < Z ≤ 384. The described architecture provides another important feature. The shifted bits for The described architecture provides another important feature. The shifted bits for lifting sizes other than 96, 192, or 384 are aligned to one side of the data block. Several lifting sizes other than 96, 192, or 384 are aligned to one side of the data block. Several examples are shown in Figure 10. Light green and dark green areas represent groups of Z examplesvalid bits arethat shownshould inbe Figureshifted, 10 whereas. Light the green hatched and darkarea re greenpresents areas nonvalid represent data. groups This of Zpropertyvalid bits is of that crucial should importance be shifted, since whereas such data the alignment hatched areadrastically represents simplifies nonvalid the re- data. Thismaining property encoder is of blocks: crucial core importance parity bits since calcul suchator, data input alignment data interface, drastically and output simplifies data the remaininginterface. encoder blocks: core parity bits calculator, input data interface, and output data interface.Finally, the proposed procedure can be easily applied to a more generic CS network that Finally,can shift the more proposed than four procedure independent can bebloc easilyks of data. applied However, to a more in such generic cases, CS due network to thatthe canhigh shift irregularity more than of check four independentnodes in 5G NR blocks LDPC of codes, data. However,in this particular in such application, cases, due to thethe high ratio irregularity of the throughput of check gain nodes with in respec 5G NRt to LDPC the additional codes, in complexity this particular would application, not be theas high ratio as of it the is for throughput the former gain cases. with Addition respectally, to thethe additionalprocessing of complexity the first four would rows not in be asthe high base as graph it is for matrix the formercannot cases.be further Additionally, parallelized the since processing core parity of thebits first are fournecessary rows in thefor baselater graphprocessing. matrix Nevertheless, cannot be furtherother LDPC parallelized encoder sinceor decoder core parityarchitecture bits are may necessary have formore later benefits processing. from additional Nevertheless, shifting other network LDPC splitting. encoder orFor decoder example, architecture in order to mayobtain have morebetter benefits HUE, row-based from additional parallel shifting architectures network could splitting. use a shifter For example, that processes in order all inputs to obtain betterin parallel HUE, for row-based small lifting parallel size architectures values but combine could use parallel a shifter and that serial processes processing all inputs for in parallelhigher lifting for small sizes. lifting Again, size for values 5G NR, but PCM combine has a broad parallel range and of serial row processingweight values, for and higher liftingsuch a sizes. shifter Again, would for not 5G always NR, PCM be used has in a its broad full rangecapacity, of rowbut for weight codes values, that have and rela- such a shiftertively wouldhigh row not weights always for be usedall rows, in its not full just capacity, for the butfirst for4Z codesas in 5G that NR, have benefits relatively to the high rowHUE weights would be for significant. all rows, not Similar just forconclusions the first 4applyZ as into codes 5G NR, whose benefits column to the regularity HUE would is behigher. significant. Similar conclusions apply to codes whose column regularity is higher.

5.2. LDPC Encoder Architecture The hardware architecture of the proposed flexible encoder is shown in Figure 11. Information bits are stored in the input buffer dual-port memory. In order to implement double buffering, the input buffer can receive information bits for two adjacent code- words. One half of memory is used for reads during the current codeword encoding, while the other half is used for information bits storage for the next codeword. Conse- quently, there is no need for any information bits repackaging when the encoding starts, which removes the need for additional clock cycles and increases the encoding speed. Depending on the lifting size, the encoder can work in one of the three modes deter- mined by the circular shifter. If lifting size is in the range 192 < Z ≤ 384, the encoder serially processes one CPM per clock cycle, from one row of the base graph matrix. The infor- mation bits group of maximum bit length 384 is read from the input buffer. Info bits are shifted and used for the calculation of GF(2) sums for λ vectors. All GF(2) additions are realized using the XOR operation. If lifting size is in the range 96 < Z ≤ 192, the encoder

Electronics 2021, 10, x FOR PEER REVIEW 18 of 26

serially processes up to two CPMs per clock cycle, from two adjacent rows defined by the optimized schedule. In this case, a word that is read from the input buffer memory con- sists of an information bits group (of maximum bit length 192), and its copy is placed in Electronics 2021, 10, 1106 the word’s high bits. This is provided by the input interface, which is not shown in Figure17 of 24 11. Circular shifting network works as two independent CSs with shift sizes equal to the given code lifting size. If lifting size is in the range 2 ≤ Z ≤ 96, the encoder serially processes up to four CPMs per clock cycle, from four adjacent rows defined by the corresponding 5.2.optimized LDPC Encoder schedule. Architecture The input buffer now contains words that consist of four identical copiesThe of hardware information architecture bits groups. of They theproposed are independently flexible encodershifted in isthe shown CS and in then Figure used 11 . Informationfor the calculation bits are of stored four independent in the input GF(2) buffer sums. dual-port During memory. the process In order of λ vectors to implement calcu- doublelation, buffering, information the bits input are buffer also passed can receive to the information output codeword bits for generator two adjacent that writes codewords. the Onedata half to ofthe memory output buffer. is used for reads during the current codeword encoding, while the other halfSince, is usedas described, for information the parallelism bits storage of the for λ vectors the next calculation codeword. is different Consequently, for differ- there is noent needlifting for sizes, any there information is a need bits for repackaginga separate block, when marked the encoding as “Merge/Select starts, which λ” in removesFigure the11, need which for arranges additional the clock λ vectors cycles data and in increases the appropriate the encoding order speed.for later core parity bits calculation.

Encoder core

Info bits GF(2) Sum Merge/ 3 Calculate core parity bits Select λ 8 λ1 4 9 6 BG 3 pc,1 8 λ2 1 9 2 6 4 Input CS pc,2 CS buffer (1) 9 3 2 λ3 6 8 1 4 pc,3 BG 2 9 1 6 3 BG λ4 pc,4 en[3:0] 8 4

Output code word generator

Output buffer Code word bits

Figure 11. Proposed 5G NR LDPC encoder architecture. Figure 11. Proposed 5G NR LDPC encoder architecture.

DependingAfter the calculation on the lifting of all size, λ vectors, the encoder the process can work continues in one inof the the same three manner modes for de- the next rows in the base graph matrix. During that time, core parity bits are calculated in termined by the circular shifter. If lifting size is in the range 192 < Z ≤ 384, the encoder parallel. BG1 and BG2 have different equations for the calculation of core parity bits. How- serially processes one CPM per clock cycle, from one row of the base graph matrix. The ever, most of the logic can be shared by inserting a few additional multiplexers that rear- information bits group of maximum bit length 384 is read from the input buffer. Info bits range data paths depending on the base graph selected. The circular shifter marked as are shifted and used for the calculation of GF(2) sums for λ vectors. All GF(2) additions are CS(1) shifts data only by 1, to the left or right depending on the base graph. Since the realized using the XOR operation. If lifting size is in the range 96 < Z ≤ 192, the encoder shifter should support only two shift values, it can be implemented using just a fragment serially processes up to two CPMs per clock cycle, from two adjacent rows defined by the of resources that are used for the main CS, although it should still support all lifting sizes optimizeddefined in schedule. the 5G NR In thisstandard. case, a When word thatcalculated, is read the from core the parity input bits buffer are memory also passed consists to ofthe an informationoutput codeword bits groupgenerator. (of maximum bit length 192), and its copy is placed in the word’s high bits. This is provided by the input interface, which is not shown in Figure 11. Circular shifting network works as two independent CSs with shift sizes equal to the given code lifting size. If lifting size is in the range 2 ≤ Z ≤ 96, the encoder serially processes up to four CPMs per clock cycle, from four adjacent rows defined by the corresponding optimized schedule. The input buffer now contains words that consist of four identical copies of information bits groups. They are independently shifted in the CS and then used for the calculation of four independent GF(2) sums. During the process of λ vectors calculation, information bits are also passed to the output codeword generator that writes the data to the output buffer. Electronics 2021, 10, 1106 18 of 24

Since, as described, the parallelism of the λ vectors calculation is different for different lifting sizes, there is a need for a separate block, marked as “Merge/Select λ” in Figure 11, whicharranges the λ vectors data in the appropriate order for later core parity bits calculation. After the calculation of all λ vectors, the process continues in the same manner for the next rows in the base graph matrix. During that time, core parity bits are calculated in parallel. BG1 and BG2 have different equations for the calculation of core parity bits. However, most of the logic can be shared by inserting a few additional multiplexers that rearrange data paths depending on the base graph selected. The circular shifter marked as CS(1) shifts data only by 1, to the left or right depending on the base graph. Since the shifter should support only two shift values, it can be implemented using just a fragment of resources that are used for the main CS, although it should still support all lifting sizes defined in the 5G NR standard. When calculated, the core parity bits are also passed to the output codeword generator. Calculated core parity bits are returned to the main CS input since they are used for additional parity bits calculation. One of the four core parity bit groups is selected based on the control CPM data. Depending on the lifting size, the selected core parity bits group is copied in the same way as it has been for information bits in the input buffer. Finally, four sets of multiplexers determine whether the information bits group or the core parity bits group should be passed to the CS. Outputs of the “GF(2) Sum” block are now additional parity bits, which are passed to the output codeword generator. The entire architecture is pipelined to obtain high operating frequency. Therefore, it is necessary that the encoding schedule does not use core parity bits for the calculation of additional parity bits until the former are ready. Ideally, the first rows that are used for additional parity bits calculation should not contain CPMs in the columns related to the core parity bits. This is provided by giving the additional constraint to the processing order optimization. In the end, it should be mentioned that almost identical architecture could be applied for encoding of IEEE 802.16e (WiMAX) [38] LPDC codes. There are two main differences. First is that WiMAX lifting sizes are between 24 and 96. Therefore, the main CS would be designed using four CS24 components, instead of CS96s. The second difference is based on the fact that WiMAX has only core parity bits which can be calculated recursively in a way similar to (15) and (16). The proposed 5G NR encoder calculates all core parity bits in parallel, but for WiMAX, it may be more efficient to conduct that serially because the number of parity bits is not the same for all codes. Additional CS needed for parity bits calculation should support slightly more than only two shift values as it is in the 5G NR encoder.

6. Results and Discussion 6.1. Encoding Schedule Optimization Results Optimization of the processing order provided a strong decrease in the encoding time. Examples of optimized encoding schedules are shown in Figure 12. Examples are shown for a processing order when four CSs are available. It is obvious that rows are grouped in a way that the same information bits are used for multiple CPMs at the same time. This results in a significant increase in the encoding speed. Moreover, the core parity bits processing is merged with the processing of some information bits when free CSs are available. This repositioning is of great importance for the additional encoding speed increase. Electronics 2021, 10, x FOR PEER REVIEW 2 of 27

hough it can be parallelized to some extent, high degrees of parallelism are hardly achiev- able. Moreover, the throughput of turbo decoders is usually not dependent on the code rate, whereas LDPC decoder throughput increases along with the code rate since PCM for higher code rate is usually smaller than the PCM for low code rate. This property enables Electronics 2021, 10, 1106 achieving higher peak throughput at high spectral efficiencies and high values of signal-19 of 24 to-noise ratio (SNR) [5].

Figure 12. A scatter diagramThe most of thecommon optimized structured processing LDPC family order is known for BG1 as andquasi-cyclic BG2 codes (QC) whenLDPC four CSs are available. Circlescodes [6–8]. represent Their CPMsPCM consists that are of usedcircularly for informationshifted identity bits submatrices processing, or square whereas zero other submatrices. Such submatrices are frequently called circulant permutation matrices shapes represent CPMs that are used for core parity bits processing. All core parity bits processing is (CPM) [8,9]. The size of CPMs is referred to as lifting size or lifting factor (Z) [5]. QC– merged with some informationLDPC codes are bits usually processing designed when based freeon th CSse so-called are available. base graph (BG) matrix, which defines a macroscopic structure of a code. The final PCM is constructed by placing CPMs Table2 summarizes the results of GA optimization. The improvement metric de- scribes how much faster the optimized schedule is, compared to the unoptimized. A

large reduction in the number of clock cycles necessary for one codeword encoding can be observed. The BG2 optimization provides better improvements because a larger percentage of total CPMs are located in the core parity columns and since all these CPMs are processed together with other CPMs from information bit columns.

Table 2. Results of the GA optimization.

Improvement wrt Improvement wrt N N Encoding CPC CPC, Optimized Unoptimized (%) Serial (%) Schedule BG1 BG2 BG1 BG2 BG1 BG2 BG1 BG2 Serial 265 150 - - - - 0 0 Serial (2 CSs) 223 136 165 86 35.15 58.14 60.61 74.42 Serial (4 CSs) 180 94 107 53 68.22 77.35 147.66 183.02

Moreover, the improvement is calculated with respect to the original serial processing. The results show that the proposed architecture granted almost three times faster encoding for codes with lifting sizes of 96 and smaller, and almost twice as faster for codes with lifting sizes between 96 and 192. The hardware overhead is expected to be small and comes Electronics 2021, 10, x FOR PEER REVIEW 21 of 26

are better used for parallelism increase. The benefits of processing order optimization are also easily noticeable. Encoder’s latency was calculated as the time between the last transfer of the input information bits data and the last output codeword data transfer. The obtained latency is smaller than 1 µs for all codes. Results for all available lifting size values are shown in Figure 13b. In order to compare obtained results with the state of the art, three additional 5G NR LDPC encoder cores were implemented as follows: (1) encoder that incorporates serial CPM processing whose architecture is presented in this paper but would be a conven- tional way to encode the 5G NR codes, (2) encoder that incorporates column-based paral- lel processing (a method described in [34]), and (3) encoder that incorporates row-based Electronics 2021, 10, 1106 parallel processing (described in [31,35]). A detailed explanation of all the processing20 of 24 schedules used in the abovementioned architectures is given in Section 4.1. The serial encoder has the same architecture as the proposed encoder. However, all redundant hardware components were removed. The CS was implemented as in [11], mainly because of the shifting network’s flexibility, which will be shown in Section 6.2. whereas blocks for merging and selection of λ values and core parity bits were removed. Therefore,Additional it simpl is expectedification that wasthe made HUE in the will storage be significantly of PCM matrix increased. data since there was no need for the storage of optimized schedules. 6.2. Throughput-, Latency-, and Hardware-Usage Efficiency Results As described in Section 4.1, the column-based parallel architecture has been previ- ouslyThe used proposed for encoding encoder Wi- wasFi LDPC implemented codes [34 on]. For the the Xilinx sake ZCU111 of comparison, development the same board witharchitecture the Zynq was UltraScale+ applied to 5G RF-SoC NR LDPC device codes. (XCZU28DR). All columns The in the obtained exponent maximum matrix are clock frequencyprocessed in was parallel,fmax = which 580 MHz. implies Throughput the usage of results46 CSs and for bothGF(2) BG1accumulation and BG2 circuits. codes (the entireCSs were PCM) the are same summarized as for the serial in Figure architecture. 13a. Results Due areto the shown pipeline, for all it availablewas necessary lifting tosize values.wait for Thetwo informationclock cycles for throughput core parity isbits calculated to be ready, as which in (32). increased The peak the BG1 total throughputnumber isof 18.51clock Gb/s,cycles whereasestimated the in Section peak throughput 4.1 to 28 for forBG1 BG2 and is 16 14.87 for BG2. Gb/s. This Figure LDPC 13 encodera shows a significantcore also calculates improvement core parity of the bits throughput using the resultssame hardware for lifting as sizes the smallerproposed than partially or equal toparallel 192 when architecture. the proposed PCM matrix flexible was partially stored in parallelthe block encoder RAM memories is compared (BRAMs) to thein the serial processing.form of eight High exponent throughput matrices values for the are highest achievable lifting forsize, much whereas smaller the shift lifting values size valuesfor sincesmaller CS lifting resources sizes are were better calculated used for in parallelismlogic. It is necessary increase. to The load benefits up to 46 of shift processing values in order optimizationparallel; hence are multiple also easily BRAMs noticeable. were necessary to store these values.

BG 1 BG 2

Electronics 2021, 10, x FOR PEER REVIEW 22 of 26

(a)

BG 1 BG 2

(b)

FigureFigure 13.13. PerformancePerformance ofof thethe proposedproposed 5G NR LDPC encoder for three different encoding sched- schedules ules and for various lifting sizes: (a) information throughput and (b) latency. and for various lifting sizes: (a) information throughput and (b) latency.

Encoder’sTo the best latencyof the author was calculateds’ knowledge, as thethe only time architecture between the applied last transfer to 5G NR of theLDPC input informationencoding so bitsfar was data row and parallel the last architecture output codeword [35]. As described data transfer. in Section The 4.1, obtained all rows latency of isthe smaller exponent than matrix 1 µs forareall processed codes.Results in parallel. for allSuch available an approach lifting requires size values the usage are shown of 26 in FigureCSs. In 13 contrastb. to [35], all shifters in the present implementation were flexible (the same as inIn serial order and to column compare parallel obtained architecture) results with, i.e. the, they state supported of the art, all threelifting additional sizes from 5Gthe NR standard. Shifters’ outputs are summed in parallel using a tree-structured XOR summa- LDPC encoder cores were implemented as follows: (1) encoder that incorporates serial tion. The PCM matrix was also stored in the block RAM memories in the same form as for CPM processing whose architecture is presented in this paper but would be a conventional column parallel architecture. However, since a smaller number of shift values is necessary way to encode the 5G NR codes, (2) encoder that incorporates column-based parallel in parallel, the number of used BRAMs is also smaller. Core parity bits were also calcu- processing (a method described in [34]), and (3) encoder that incorporates row-based lated from λ vectors as in former designs. parallel processing (described in [31,35]). A detailed explanation of all the processing Implementation results and comparison are given in Table 3. Given resources are scheduleswithout input used and in theoutput abovementioned buffers. Therefore, architectures utilized BRAMs is given are in only Section used 4.1 for. PCM stor- age. The obtained maximum operating frequency differed depending on the chosen ar- chitecture as shown in Table 3. Although they are heavily pipelined, larger designs have smaller maximal operating frequencies due to routing congestion. However, in order to make a better architectural comparison, the information throughput was normalized with the operating frequency as follows: Thr Thrnorm  . (35) fCLK Normalized throughput values are calculated for both BG1 and BG2 and for all lifting sizes. The average normalized throughput values and the average throughputs are shown in Table 3. Finally, the HUE is calculated as throughput, both normalized and achieved, divided by the number of utilized hardware resources. Results show that, on average, the proposed architecture provides by far the highest HUE.

Table 3. Implementation results for various architectures of 5G NR LDPC encoder and average throughput and hardware usage efficiency comparison.

Resources Utilization Avg. HUE (Thr(norm),avg/Resources) Thrnorm,avg fmax Thravg Architecture 36k b/Cy- b/Cy- b/Cy- Gbps/kSl Slices LUTs FFs (b/Cycle) (MHz) (Gbps) BRAMs cle/kSlice cle/kLUT cle/kFF ice Column based 42257 248641 246960 6 61.9 330 20.43 1.46 0.25 0.25 0.48 parallel [34] Row based 26180 126986 131185 3.5 30.1 470 14.15 1.15 0.24 0.23 0.54 parallel [31,35]

Serial 1729 8712 9975 3 6.6 580 3.83 3.82 0.76 0.66 2.22

Electronics 2021, 10, 1106 21 of 24

The serial encoder has the same architecture as the proposed encoder. However, all redundant hardware components were removed. The CS was implemented as in [11], whereas blocks for merging and selection of λ values and core parity bits were removed. Additional simplification was made in the storage of PCM matrix data since there was no need for the storage of optimized schedules. As described in Section 4.1, the column-based parallel architecture has been previously used for encoding Wi-Fi LDPC codes [34]. For the sake of comparison, the same architecture was applied to 5G NR LDPC codes. All columns in the exponent matrix are processed in parallel, which implies the usage of 46 CSs and GF(2) accumulation circuits. CSs were the same as for the serial architecture. Due to the pipeline, it was necessary to wait for two clock cycles for core parity bits to be ready, which increased the total number of clock cycles estimated in Section 4.1 to 28 for BG1 and 16 for BG2. This LDPC encoder core also calculates core parity bits using the same hardware as the proposed partially parallel architecture. PCM matrix was stored in the block RAM memories (BRAMs) in the form of eight exponent matrices for the highest lifting size, whereas the shift values for smaller lifting sizes were calculated in logic. It is necessary to load up to 46 shift values in parallel; hence multiple BRAMs were necessary to store these values. To the best of the authors’ knowledge, the only architecture applied to 5G NR LDPC encoding so far was row parallel architecture [35]. As described in Section 4.1, all rows of the exponent matrix are processed in parallel. Such an approach requires the usage of 26 CSs. In contrast to [35], all shifters in the present implementation were flexible (the same as in serial and column parallel architecture), i.e., they supported all lifting sizes from the standard. Shifters’ outputs are summed in parallel using a tree-structured XOR summation. The PCM matrix was also stored in the block RAM memories in the same form as for column parallel architecture. However, since a smaller number of shift values is necessary in parallel, the number of used BRAMs is also smaller. Core parity bits were also calculated from λ vectors as in former designs. Implementation results and comparison are given in Table3. Given resources are without input and output buffers. Therefore, utilized BRAMs are only used for PCM storage. The obtained maximum operating frequency differed depending on the chosen architecture as shown in Table3. Although they are heavily pipelined, larger designs have smaller maximal operating frequencies due to routing congestion. However, in order to make a better architectural comparison, the information throughput was normalized with the operating frequency as follows:

Thr Thrnorm = . (35) fCLK

Normalized throughput values are calculated for both BG1 and BG2 and for all lifting sizes. The average normalized throughput values and the average throughputs are shown in Table3. Finally, the HUE is calculated as throughput, both normalized and achieved, divided by the number of utilized hardware resources. Results show that, on average, the proposed architecture provides by far the highest HUE. Finally, Figure 14 shows the normalized HUE depending on the lifting size for all the above architectures. It is obvious that highly parallel architectures are very inefficient, when compared to the serial and proposed partially parallel architecture. Considering the fact that the latency of the serial processing architecture and the proposed partially parallel architecture is lower than 1 µs for all codeword lengths, and the fact that the air interface requirement for 5G ultra-reliable low-latency communication (URLLC) use case is 1 ms [47], which is 1000 times higher, there is no remaining reason to use highly parallel architectures for 5G LDPC encoding. By placing multiple encoders to work in parallel, the proposed architecture can be used for obtaining the throughput that is even higher than the throughput of the highly parallel architectures but with much fewer hardware resources. Electronics 2021, 10, x FOR PEER REVIEW 23 of 26

Electronics 2021, 10, 1106 22 of 24 Proposed 2215 11156 9635 5 10.3 580 5.97 4.65 0.92 1.07 2.70 optimized Table 3. Implementation results for various architectures of 5G NR LDPC encoder and average throughput and hardware usage efficiency comparison. Finally, Figure 14 shows the normalized HUE depending on the lifting size for all the above architectures. It is obvious that highly parallel architectures are very inefficient,

Resourceswhen Utilization compared to the serial and proposed partially parallelAvg. architecture. HUE (Thr(norm Considering),avg/Resources the) Thrnorm,avg fmax Thravg Architecture fact that the latency36k of the serial processing architectureb/Cycle/ and the b/Cycle/proposed b/Cycle/partially paral-Gbps/ Slices LUTs FFs (b/Cycle) (MHz) (Gbps) lel architectureBRAMs is lower than 1 µs for all codeword lengths,kSlice and thekLUT fact thatkFF the air inter-kSlice Column based face requirement for 5G ultra-reliable low-latency communication (URLLC) use case is 1 42,257 248,641 246,960 6 61.9 330 20.43 1.46 0.25 0.25 0.48 parallel [34] ms [47], which is 1000 times higher, there is no remaining reason to use highly parallel Row based 26,180 126,986architectures 131,185 for3.5 5G LDPC 30.1 encoding. 470 By placing 14.15 multiple 1.15 encoders 0.24 to work in 0.23 parallel, the 0.54 parallel [31,35] Serial 1729 8712proposed 9975 architecture 3 can 6.6 be used for 580 obtaining 3.83 the throughput 3.82 that 0.76 is even 0.66 higher than 2.22 Proposed the throughput of the highly parallel architectures but with much fewer hardware re- 2215 11156 9635 5 10.3 580 5.97 4.65 0.92 1.07 2.70 optimized sources.

BG 1 BG 2

FigureFigure 14.14. Normalized hardware hardware usage usage efficiency efficiency for for various various 5G 5G NR NR LDPC LDPC encoder encoder architectures architectures dependingdepending on the lifting lifting size. size.

7.7. Conclusions InIn this this paper, paper, a anovel novel architecture architecture for for5G 5GNR NRLDPC LDPC encoding encoding is presented. is presented. The de- The designedsigned encoder encoder can can be configured be configured to encode to encode all LDPC all LDPC codes codes from fromthe 5G the NR 5G standard NR standard in inruntime. runtime. The The architecture architecture is based is based on the on serial the encoding serial encoding schedule schedule that processes that processes one CPM one CPMof the ofPCM the per PCM clock per cycle clock. However, cycle. However, it incorporates it incorporates a novel approach a novel of approachreusing available of reusing availablehardware hardwareresources resourcesfor processing for processing multiple CPMs multiple per clock CPMs cycle per clockfor codes cycle with for codesshorter with shortercodeword codewords.s. This is obtained This is byobtained the innovative by the innovative design of the design circular of theshift circularing network shifting networkwhich can which be configured can be configured to work as to a worksingle asshifter a single for long shifter input for data long vector inputs data or as vectors mul- or astiple multiple independent independent CSs for CSs shorter for shorter inputs. inputs. Proposed Proposed CS architecture CS architecture is also isapplicable also applicable in inother other hardware hardware designs designs for forboth both encoding encoding and and decoding decoding of LDPC of LDPC codes codes whe whenn flexibility flexibility isis aa requirement.requirement. InIn additionaddition to thethe architecturalarchitectural contributions, the the paper paper also also presented presented the the method method for offlinefor offline encoding encoding schedule schedule optimization optimization when when multiple multiple CPMs CPMs are are processed processed per per clock clock cycle. Thecycle. GA-based The GA optimization-based optimization provided provided a great improvement a great improvement in the encoding in the throughput encoding (up throughput (up to 183%) and latency (up to 92%). Implemented encoder provided multi- to 183%) and latency (up to 92%). Implemented encoder provided multi-gigabit throughput gigabit throughput and showed significant improvement in hardware usage efficiency, and showed significant improvement in hardware usage efficiency, when compared with when compared with other architectures used for encoding of QC–LDPC codes. On aver- other architectures used for encoding of QC–LDPC codes. On average, the proposed age, the proposed encoder implementation has five times higher HUE, when compared encoder implementation has five times higher HUE, when compared to highly parallel to highly parallel implementations and about 22% higher HUE, when compared to the implementations and about 22% higher HUE, when compared to the serial implementation. The obtained throughput and latency match the required specifications for all currently proposed 5G use cases. The flexibility of the proposed encoder makes it applicable in any of those scenarios. Some contributions described in the paper can be applied in encoders or decoders for other QC–LDPC codes. The GA-based optimization can be performed for any architecture Electronics 2021, 10, 1106 23 of 24

that requires multiple rows processing for possible throughput increase. In addition to that, the proposed cyclic shifter design can be used in any encoder or decoder design that needs multiple shifting modes. A good example would be the encoder for WiMAX LDPC codes whose lifting sizes vary from 24 to 96.

Author Contributions: Conceptualization, V.L.P., D.M.E.M., and A.R.; methodology, V.L.P. and D.M.E.M.; validation, V.L.P.; formal analysis, V.L.P. and D.M.E.M.; investigation, V.L.P.; resources, A.R.; writing—original draft preparation, V.L.P., D.M.E.M., and A.R.; visualization, V.L.P.; software, V.L.P. and D.M.E.M.; supervision, A.R.; project administration, A.R.; funding acquisition, A.R. and V.L.P. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Acknowledgments: Authors are thankful to Miloš Markovi´cfrom the Tannera LLC and Lazar Saranovac, Srdan¯ Brki´c,Ðorde¯ Saraˇc,and Predrag Ivaniš from the University of Belgrade, School of Electrical Engineering for valuable comments and discussions. Conflicts of Interest: The authors declare no conflict of interest.

References 1. 3rd Generation Partnership Project. Technical Specification Group Radio Access Network; NR; Multiplexing and Channel Coding (Release 16), 3GPP TS 38.212 V16.5.0 (2021-03); 3GPP: Valbonne, France, 2021. 2. Gallager, R. Low-density parity-check codes. IRE Trans. Inform. Theory 1962, 8, 21–28. [CrossRef] 3. Berrou, C.; Glavieux, A.; Thitimajshima, P. Near Shannon limit error-correcting coding and decoding: Turbo-codes. 1. In Proceedings of the IEEE International Conference on Communications (ICC ’93), Geneva, Switzerland, 23–26 May 1993. [CrossRef] 4. Barzegar, H.R.; Reggiani, L. Channel Coding for Multi-Carrier Wireless Partial Duplex. Wirel. Commun. Mob. Comput. 2019, 2019, 1–13. [CrossRef] 5. Richardson, T.J.; Kudekar, S. Design of low-density parity check codes for 5G new radio. IEEE Commun. Mag. 2018, 56, 28–34. [CrossRef] 6. Gallager, R.G. Analysis of Number of Independent Decoding Iterations. In Low-Density Parity-Check Codes; MIT Press: Cambridge, MA, USA, 1963; pp. 81–88. 7. Tanner, R.M.; Sridhara, D.; Sridharan, A.; Fuja, T.E.; Costello, D.J. LDPC block and convolutional codes based on circulant matrices. IEEE Trans. Inform. Theory 2004, 50, 2966–2984. [CrossRef] 8. Fossorier, M.P.C. Quasi-Cyclic Low-Density Parity-Check Codes from Circulant Permutation Matrices. IEEE Trans. Inform. Theory 2004, 50, 1788–1793. [CrossRef] 9. Li, H.; Bai, B.; Mu, X.; Zhang, J.; Xu, H. Algebra-Assisted Construction of Quasi-Cyclic LDPC Codes for 5G New Radio. IEEE Access 2018, 6, 50229–50244. [CrossRef] 10. Richardson, T.; Shokrollahi, M.; Urbanke, R. Design of capacity approaching irregular low-density parity-check codes. IEEE Trans. Inform. Theory 2001, 47, 619–637. [CrossRef] 11. Petrovi´c,V.L.; Markovi´c,M.M.; El Mezeni, D.M.; Saranovac, L.V.; Radoševi´c,A. Flexible High Throughput QC-LDPC Decoder with Perfect Pipeline Conflicts Resolution and Efficient Hardware Utilization. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 5454–5467. [CrossRef] 12. Andrews, K.; Dolinar, S.; Thorpe, J. Encoders for block-circulant LDPC codes. In Proceedings of the International Symposium on (ISIT), Adelaide, Australia, 4–9 September 2005; pp. 2300–2304. [CrossRef] 13. Li, Z.; Chen, L.; Zeng, L.; Lin, S.; Fong, W. Efficient Encoding of Quasi-cyclic Low-density Parity-check Codes. IEEE Trans. Commun. 2006, 54, 71–81. [CrossRef] 14. Yasotharan, H.; Carusone, A.C. A Flexible Hardware Encoder for Systematic Low-density Parity-check Codes. In Proceedings of the 52nd IEEE International Midwest Symposium on Circuits and Systems, Cancun, Mexico, 2–5 August 2009; pp. 54–57. [CrossRef] 15. Theodoropoulos, D.; Kranitis, N.; Paschalis, A. An efficient LDPC encoder architecture for space applications. In Proceedings of the 22nd International Symposium on On-Line Testing and Robust System Design (IOLTS), Sant Feliu de Guixols, Spain, 4–6 July 2016; pp. 149–154. [CrossRef] 16. Chen, D.; Chen, P.; Fang, Y. Low-complexity high-performance low density parity-check encoder design for China digital radio standard. IEEE Access 2017, 5, 20880–20886. [CrossRef] 17. Mahdi, A.; Paliouras, V. Simplified Multi-Level Quasi-Cyclic LDPC Codes for Low-Complexity Encoders. In Proceedings of the IEEE Workshop on Signal Processing Systems, Quebec City, QC, Canada, 17–19 October 2012. [CrossRef] 18. Mahdi, A.; Paliouras, V. A Low Complexity-High Throughput QC-LDPC Encoder. IEEE Trans. Signal Process. 2014, 62, 2696–2708. [CrossRef] 19. Mahdi, A.; Kanistras, N.; Paliouras, V. A Multirate Fully Parallel LDPC Encoder for the IEEE 802.11n/ac/ax QC-LDPC Codes Based on Reduced Complexity XOR Trees. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2021, 29, 51–64. [CrossRef] Electronics 2021, 10, 1106 24 of 24

20. Richardson, T.J.; Urbanke, R.L. Efficient Encoding of Low-density Parity-check Codes. IEEE Trans. Inf. Theory 2001, 47, 638–656. [CrossRef] 21. Eleftheriou, E.; Olcer, S. Low-density parity-check codes for digital subscriber lines. In Proceedings of the IEEE International Conference on Communications (ICC), New York, NY, USA, 28 April–2 May 2002; pp. 1752–1757. [CrossRef] 22. Zhang, T.; Parhi, K. Joint (3, k)-regular LDPC code and decoder/encoder design. IEEE Trans. Signal Process. 2004, 52, 1065–1079. [CrossRef] 23. Zhong, H.; Zhang, T. Block-LDPC: A practical LDPC coding system design approach. IEEE Trans. Circuits Syst. I Reg. Pap. 2005, 52, 766–775. [CrossRef] 24. Zhang, H.; Zhu, J.; Shi, H.; Wang, D. Layered approx-regular LDPC: Code construction and encoder/decoder design. IEEE Trans. Circuits Syst. I Reg. Pap. 2008, 55, 572–585. [CrossRef] 25. Kim, J.K.; Yoo, H.; Lee, M.H. Efficient encoding architecture for IEEE 802.16e LDPC codes. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2008, E91-A, 3607–3611. [CrossRef] 26. Zhang, P.; Jiang, Q.; Du, S.; Liu, C. Fast encoding of quasi-cyclic low-density parity-check codes in IEEE 802.15.3c. Electron. Lett. 2015, 51, 1713–1715. [CrossRef] 27. Wang, X.; Ge, T.; Li, J.; Su, C.; Hong, F. Efficient multi-rate encoder of QC-LDPC codes based on FPGA for WIMAX standard. Chin. J. Electron. 2017, 26, 250–255. [CrossRef] 28. Cohen, A.E.; Parhi, K.K. A Low-Complexity Hybrid LDPC Code Encoder for IEEE 802.3an (10GBase-T) Ethernet. IEEE Trans. Signal Process. 2009, 57, 4085–4094. [CrossRef] 29. Theodoropoulos, D.; Kranitis, N.; Tsigkanos, A.; Paschalis, A. Efficient architectures for multigigabit CCSDS LDPC encoders. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2020, 28, 1118–1127. [CrossRef] 30. Sun, Y.; Karkooti, M.; Cavallaro, J.R. High Throughput, Parallel, Scalable LDPC Encoder/Decoder Architecture for OFDM Systems. In Proceedings of the IEEE Dallas/CAS Workshop on Design, Applications, Integration and Software, Richardson, TX, USA, 29–30 October 2006; pp. 39–42. [CrossRef] 31. Cai, Z.; Hao, J.; Tan, P.; Sun, S.; Chin, P. Efficient encoding of IEEE 802.11n LDPC codes. Electron. Lett. 2006, 42, 1471–1472. [CrossRef] 32. Perez, J.; Fernandez, V. Low-cost encoding of IEEE 802.11n. Electron. Lett. 2008, 44, 307–308. [CrossRef] 33. Chen, Y.H.; Hsiao, J.H.; Siao, Z.Y. Wi-Fi LDPC encoder with approximate lower triangular diverse implementation and verification. In Proceedings of the IEEE 11th International Multi-Conference on Systems, Signals & Devices (SSD14), Barcelona, Spain, 1–14 February 2014. [CrossRef] 34. Jung, Y.; Chung, C.; Jung, Y.; Kim, J. 7.7 Gbps Encoder Design for IEEE 802.11ac QC-LDPC Codes. J. Semicond. Technol. Sci. 2014, 14, 419–425. [CrossRef] 35. Nguyen, T.T.B.; Nguyen Tan, T.; Lee, H. Efficient QC-LDPC Encoder for 5G New Radio. Electronics 2019, 8, 668. [CrossRef] 36. Markovi´c,D.; Brodersen, R. Energy and Delay Models. In DSP Architecture Design Essentials; Springer: New York, NY, USA, 2012; pp. 11–27. 37. IEEE. Standard for Information Technology—Local and Metropolitan Area Networks—Specific Requirements—Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications; IEEE Standard 802.11-2016; IEEE: New York, NY, USA, 2016. 38. IEEE. Standard for Local and Metropolitan Area Networks—Part 16: Air Interface for Fixed Broadband Wireless Access Systems; IEEE Standard 802.16-2004; IEEE: New York, NY, USA, 2004. 39. Digital Video Broadcasting (DVB). Second Generation Framing Structure, Channel Coding and Modulation Systems for Broadcasting, Interactive Services, News Gathering and other Broadband Satellite Applications; Part 2: DVB-S2 Extensions (DVB-S2X), ETSI EN 302 307-2 V.1.1.1 (2014-10); ETSI: Sophia Antipolis, France, 2014. 40. Petrovi´c,V.L.; El Mezeni, D.M. Reduced-Complexity Offset Min-Sum Based Layered Decoding for 5G LDPC Codes. In Proceedings of the 28th Forum (TELFOR), Belgrade, Serbia, 24–25 November 2020; pp. 109–112. [CrossRef] 41. Hocevar, D.E. A reduced complexity decoder architecture via layered decoding of LDPC codes. In Proceedings of the IEEE Workshop on Signal Processing Systems (SIPS), Austin, TX, USA, 13–15 October 2004; pp. 107–112. [CrossRef] 42. Larranaga, P.; Kuijpers, C.M.H.; Murga, R.H.; Inza, I.; Dizdarevic, S. Genetic algorithms for the travelling salesman problem: A review of representations and operators. Artif. Intell. Rev. 1999, 13, 129–170. [CrossRef] 43. Marchand, C.; Dore, J.; Conde-Canencia, L.; Boutillon, E. Conflict resolution for pipelined layered LDPC decoders. In Proceedings of the IEEE Workshop on Signal Processing Systems, Tampere, Finland, 7–9 October 2009; pp. 220–225. [CrossRef] 44. Kang, H.-J.; Yang, B.-D. Low-complexity, high-speed multi-size cyclic-shifter for quasi-cyclic LDPC decoder. Electron. Lett. 2018, 54, 452–454. [CrossRef] 45. Chen, X.; Lin, S.; Akella, V. QSN-A simple circular shift network for reconfigurable quasi-cyclic LDPC decoders. IEEE Trans. Circuits Syst. II Express Briefs 2010, 57, 1549–7747. [CrossRef] 46. Jung, Y.; Jung, Y.; Lee, S.; Kim, J. Low-complexity multi-way and reconfigurable cyclic shift network of QC-LDPC decoder for WiFi/WIMAX applications. IEEE Trans. Consum. Electron. 2013, 59, 467–475. [CrossRef] 47. Siddiqi, M.A.; Yu, H.; Joung, J. 5G Ultra-Reliable Low-Latency Communication Implementation Challenges and Operational Issues with IoT Devices. Electronics 2019, 8, 981. [CrossRef]