Tamkang Journal of Science and Engineering, Vol. 10, No. 3, pp. 253-264 (2007) 253

Finite Polynomial Multiplier with Linear Feedback Shift Register

Che-Wun Chiou1*, Chiou-Yng Lee2 and Jim-Min Lin3

1Department of Computer Science and Information Engineering, Ching Yun University, Chung-Li, Taiwan 320, R.O.C. 2Department of Computer Information and Network Engineering, Lung Hwa University of Science & Technology, Taoyuan, Taiwan 333, R.O.C. 3Department of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan 407, R.O.C.

Abstract We will present an one-dimensional polynomial array multiplier for performing multiplications in GF(2m). A linear feedback shift register is employed in our proposed multiplier for reducing space complexity. As compared to other existing two-dimensional polynomial basis multipliers, our proposed linear array multiplier drastically reduces the space complexity from O(m2) to O(m). A new two-dimensional systolic array version of the proposed array multiplier is also included in this paper. The proposed two-dimensional systolic array multiplier saves about 30% of space complexity and 27% of time complexity while comparing with other two-dimensional systolic array multipliers.

Key Words: Finite Field, Multiplication, Polynomial Basis, Systolic Array, Cryptography

1. Introduction of implementing multiplication operations depends on the representation of the field elements. There are three Arithmetic operations in a finite field play an in- main representation types of bases over GF(2m) fields, creasingly important role in error-correcting codes [1], namely, (NB), dual basis (DB), and polyno- cryptography [2], digital signal processing [3,4], and mial basis (PB). The major advantage of the NB multipli- pseudorandom number generation [5]. Two premier ari- ers [6-8] is that the squaring of an element could be com- thmetic operations over finite fields are addition and puted simply by a cyclic shift of the binary representa- multiplication. Addition operation is simple. Multiplica- tion. Thus, the normal basis multipliers could be very ef- tion operation requires more computational time and fectively applied on performing inverse, squaring, and higher circuit complexity. Many other complex arithme- exponentiation operations. The DB multipliers [9-13] tic operations, like exponentiation, division, and multi- require less chip area than other two types. However, the plicative inversion, can be therefore performed by apply- former two multipliers need basis conversion, while the ing multiplication operations repeatedly. Hence, it is im- latter type does not [36]. The polynomial basis represen- portant in a practical sense to develop fast multiplication tation has been widely used and leads to lots of efficient algorithms for these complex arithmetic operations. In implementations of multipliers. As compared to other recent years, the realization of multiplication operation two bases multipliers, the polynomial multipliers have in finite fields has received wide attentions, and several the feature of lower design complexity and their sizes approaches have been presented [6-36]. The complexity could be easily extended to desirable scales to meet vari- ous applications due to their simplicity, regularity, and *Corresponding author. E-mail: [email protected] modularity in architecture. 254 Che-Wun Chiou et al.

Numerous architectures for PB multipliers have version of the proposed algorithm is then described in been presented [14-35]. The first parallel PB multiplier Section 4. The space and time complexities are discussed was suggested by Bartee and Schneider [14]. The PB in Section 5. Finally, a brief conclusion is given in Sec- multiplication operation for GF(2m) is often accompli- tion 6. shed in two steps: polynomial multiplication and modu- lar reduction. In practical, both steps are usually com- 2. Preliminaries bined together for performance reason. Mastrovito [15, 16] firstly proposed the architecture for performing such It is assumed that the reader is familiar with the basic combinational operations. Recently, several bit-parallel concepts of finite fields. The properties of finite fields PB multipliers have been proposed for VLSI implemen- are covered in detail in [1,2]. The properties of finite tation by using some specific classes of polynomials, fields are reviewed briefly as required in the following such as trinomials [17-23], all one polynomials (AOP) paragraphs. and equally spaced polynomials (ESP) [24-26], and The finite field GF(2m) can be viewed as a vector composite fields [27,28]. Yet these architectures still space of dimension m over GF(2). Suppose that the finite have certain shortcomings as regards cryptographic ap- field GF(2m) is generated by the m-1 m plication due to their high circuit complexity and long la- P(x) = p0 +p1x+…+pm-1x +x of degree m over tency. When the size of the finite field is getting large, the GF(2), where p0 =1. Then any element A in the Galois m 2 issue of modular multipliers design requires much more field GF(2 ) can be represented as A(x) = a0 +a1x+a2x m-1 attentions. To alleviate the long latency problem, most +…+am-1x , where x is an intermediate over GF(2). existing PB multipliers employ XOR trees to minimize The basis {1, x, x2,…, xm-1} is known as standard basis time complexity. Unfortunately, these circuits are not and often refered to as polynomial basis, conventional m suitable for VLSI systems, due to the irregular and non- basis or canonical basis. Since P(x) = 0, x =p0 +p1x+ m-1 p modular structure of XOR trees. To overcome this prob- …+pm-1x can be used to reduce the high order term x , lem, Lee [22] has proposed a regular and modular PB p ³ m, to a polynomial of degree less than m. Thus, xB(x) multiplier using irreducible trinomials with the space mod P(x) can be reduced by complexity of O(m2) and the time complexity of O(m). This multiplier could be easily extended and implemen- xB(x) mod P(x) 2 m ted using VLSI technologies. = b0x + b1x +…+bm-1x mod P(x) m-1 In this article, we will present a linear parallel-in par- = bm-1p0 + (bm-1p1 + b0)x +…+(bm-1pm-1 + bm-2)x allel-out PB array multiplier using general irreducible Let polynomials with a linear feedback shift register. The B(x)(1) = xB(x) mod P(x) (1) proposed PB multiplier requires the space complexity of O(m). In order to demonstrate that our proposed multi- Therefore, xiB(x) mod P(x) can be obtained as the fol- plier is superior to other existing two-dimensional sys- lowing formula tolic array multipliers, a new two-dimensional systolic array multiplier version of such multiplier is also pre- B(x)(i) = xB(x)(i-1) mod P(x) (2) sented. We will show that the proposed two-dimensional systolic array multiplier also saves both space and time Note that B(x)(0) = B(x). complexities while comparing with other existing two- Let the PB representation of B(x)(i) be dimensional systolic array multipliers. ()im=+ + 2 + 3 ++ - 1 The organization of this paper is as follows. In Sec- Bx() bii,0 b ,1 x b i ,2 x b i ,3 x ... b im ,- 1 x , Σ£ tion 2, we will provide some basic definitions and pre- where bi, j {}0,1 for 0 j m - 1. liminaries. In Section 3, we derive the one-dimensional m 2 m-1 parallel-in parallel-out PB multiplication algorithm us- According to x = p0 + p1x + p2x +…+pm-1x , the ing general irreducible polynomials and a linear feed- relation between B(X)(i+1) and B(X)(i) is depicted as back shift register. The two-dimensional systolic array follows: Finite Field Polynomial Multiplier with Linear Feedback Shift Register 255

+ BX()(1)i m-1 caSi=££å () for0 j m-1 =+ +21 ++ m- jij bbxbxbxii++1,0 1,1 i + 1,2... im +- 1, 1 i=0 = xB() X ()i The cell Uj is responsible for accumulating the coeffi- 23m- 1 =+++++xb( b x b x b x ... b- x ) ii,0 ,1 i ,2 i ,3 im , 1 cient cj (0 £ j £ m-1). Another shift register E with m =+++++234mm- 1 + bxbxi,0 i ,1 bx i ,2 bx i ,3... b im ,-- 2 x b im , 1 x bits, Em-1Em-2…E1E0, is used for storing and rotating =+++++234 m-1 A(X) and is defined as follows: bxbxii,0 ,1 bx i ,2 bx i ,3... b im , -2 x +++++21m- bppxpxpxim,1--( 0 1 2 ... m 1 ) Em-1(i+1) = E0(i), and =++++ 2 £ £ bpbbpxbbpxim,10--()() i ,0,11 im i ,1,12 im - Ej(i+1) = Ei+1(i) for 0 j m-2. ++ + m-1 ... (bbpxim,2--- im ,1 m 1 ) The notation E (i) denotes the value of the bit E of the (3) j j register E at clock cycle i. Both registers S and E are initially loaded in parallel 3. The Proposed Multiplier with Linear with B(X) and A(X) in the following manners: Feedback Shift Register = Sbjj(0) , and m =££ Let A(x) and B(x) be any two elements in G(2 ), and E(jj0,for0jm-1 ) a the element C(x) be the multiplication of A(x) and B(x) in GF(2m), i.e., C(x) = A(x) ´ B(x) mod P(x). Referring to The following example is used to describe the hard- the Horner’s rule, the product C(x) = A(x) ´ B(x) mod P(x) ware implementation of the proposed linear array multi- can be obtained: plier structure.

Cx()=´ Ax () Bx ()mod() Px Example 1: =+ +23 + An example with an irreducible polynomial P(X) = 1 aBxaxBxaxBxaxBx01() () 2 () 3 () 3 4 8 m-1 +x+x +x +x is given here to describe the hardware ++...axBx- ( ) m 1 implementation of the proposed array multiplier. The =+aBx() a (( xBx ())) + a (( xxBx ())) 01 2 (4) hardware implementation is shown in Figure 1. The lin- +++22m- axxBx31(( ()))... am- (( xx Bx ())) ear feedback shift register S with 8 bits is right shifted =++(0) (1) (2)+ (3) aBx012() aBx () aB()xaBx3 () one bit for each clock. The shift register E with 8 bits is ++ (1)m- rotated clockwise one bit for each clock. Their functions ...aBxm-1 ( ) are defined as follows: Let us define a linear feedback shift register S with m bits (in binary representation: Sm-1Sm-2…S1S0) for realiz- (1)S ing the Eq. (4) as follows: S0(i+1) = S7(i),

S1(i+1) = S0(i)ÅS7(i),

S0(i+1) =Sm-1(i), and for 1 £ j £ m-1, S2(i+1) = S1(i),

Sj(i+1) = Sj-1(i)ifpj =0, S3(i+1) = S2(i)ÅS7(i),

Sj(i+1) = Sj-1(i)ÅSm-1(i)ifpj =1, S4(i+1) = S3(i)ÅS7(i),

S5(i+1) = S4(i), where Sj(i) denotes the content of the bit-j of S (i.e., Sj) S6(i+1) = S5(i), at clock cycle i. Based on Eq.(2), each coefficient of S7(i+1) = S6(i). C(X) is computed as follows: (2)E m-1 E (i+1) = E (i), and for 0 † j † 6 =££ 7 0 cabjiijå , for 0 j m - 1, or Ej(i+1) = Ej+1(i). i=0 256 Che-Wun Chiou et al.

Figure 1. Hardware implementation of C(X) = A(X)B(X) mod (1+X+X3+X4+X8).

The shift registers S and E are initially loaded with latched in the 1-bit latch for each clock. All 1-bit B(X) and A(X) as follows: latches in U cells are initially reset to 0s. The symbol L in Figure 2 represents a 1-bit latch. The detailed cir-

Sj(0) = bj and Ej(0) = aj for 0 £ j £ 7. cuits for cells Sj and Ej could be found in Figure 3 and Figure 4, respectively. The shift registers S and E can

The detailed circuit of each cell Uj is shown in Figure 2. be loaded in parallel.

The cell Uj realizes the following function: The procedure for computing C(X) = A(X) B(X) in Fig- ure 1 is described in the Appendix.

vout = hin · v1inÅv2in, and

hout = hin.

The output vout is computed and then the result is

Figure 2. The detailed circuit of the cell Uj. Figure 3. The detailed circuit of the cell Sj. Finite Field Polynomial Multiplier with Linear Feedback Shift Register 257

jm=-1 = j Each coefficient of the product CX() å cxj is com- j=0 puted as follows:

m-1 =££ cabjiijå , for 0 j m - 1, or i=0 =+++ cababababjjjjj00, 11, 22, 33, +++ ab44,jmmj... a-- 1 b 1, æöab++ a()() b b p ++ a b b p ç÷00,jjmjjmj 1 0,1-- 0,1 2 1,1 -- 1,1 =+ + + ç÷ab32,1(jmj-- b 2,1 p ) ... ç÷ ç÷++ èøabmmj-----12,12,1() b mmj p æöab++()( ab ab p + a babp+ ) ç÷01110,12jj-- mj 1,jmj-- 1 2 1, 1 = ++ + ç÷(ab32,1jmj-- ab 32,1 p ) ... ç÷ ç÷++ èø()abmmj---12,1 ab mmm --- 12,1 p j (6) The following algorithm can be used for computing

the coefficient cj based on Eq. (6). Figure 4. The detailed circuit of the cell Ej. Algorithm A: (Using traditional method)

cj:=0;

4. Implementation with Semi-Systolic b-1,j-1:=bj;

Two-Dimensional Array b-1,m-1:=0; Fori=0tom-1 A semi-systolic two-dimensional systolic array im- Begin plementation of the proposed array multiplier structure is cj: = cj + aibi-1,j-1; discussed in this section. As aforementioned, the results cj: = cj + ai bi-1,m-1pj; in Eqs. (2) and (4) are rewritten as follows: End =++(0) (1) (2) CX() aBX012 () aBX () aBX () If Algorithm A is realized with the hardware circuit, the +++(3) (m- 1) aBX31() ... am- BX () , propagation delay of one AND gate delay and two XOR gate delays is needed. To shorten this propagation de- (i) and B(X) for 0 £ i £ m - 1is represented by lay, a parallel version of Algorithm A, Algorithm B, is ()im=+ + 2 + 3 ++ - 1 depicted as follows. BX( ) bii,0 b ,1 x b i ,2 x b i ,3 x ... b im ,- 1 x , Σ£ where bi, j {}0,1 for 0 j m - 1. Algorithm B: (Using parallel method) z-1: = bj; The initial value of B(X)(i) is assigned as follows: b-1,m-1: = bm-1; BX()(0) = BX () Fori=0tom-1 Begin (i+1) and the relation between coefficients of B(X) and Cobegin B(X)(i) is illustrated as follows: zi: = zi-1 + bi-1,m-1pj; c : = c + a z ; = j j i i-1 bbpiim+-1,0 , 1 0 , (5) Coend =+ ££ bbbpij+--1, ij , 1 imj , 1 for 1 j m - 1 End 258 Che-Wun Chiou et al.

Based on Eqs. (5~6) and Algorithm B, the semi- Suppose that the generating polynomial P(X) has k systolic two-dimensional systolic array for realizing terms. Most existing PB multipliers using XOR binary the product C(X) = A(X) ´ B(X) is shown in Figure 5. trees require the space complexity of O(m2) and take

The circuit for the processing element Vi,j is shown in Figure 6.

5. Complexity

In the CMOS VLSI technology, 2-input AND, 2- input XOR, and 1-bit latch are composed of 6, 6, and 8 transistors, respectively [37]. Suppose that an XOR gate with 3-input and an XOR gate with 4-input are const- ructed by two 2-input XOR gates and three 2-input XOR gates, respectively. Thus, the propagation delays of go- ing through a 3-input XOR gate and a 4-input XOR gate would be the same. A comparison of space and area-time complexities of various PB bit-parallel multipliers is given in Table 1. Figure 6. The detailed circuit for the cell Vi,j.

Figure 5. The proposed semi-systolic two-dimensional systolic array over GF(2m). Finite Field Polynomial Multiplier with Linear Feedback Shift Register 259

Table 1. Comparison of various PB bit-parallel multipliers Items Generating Space complexity Area-time Function Latency Multipliers polynomial Gate count Transistor count complexity (ns) 2 2 3 Yeh et al. [34] General form AB + C #AND2:2m 80m 3m 7680m 2 #XOR2:2m #L: 7m2 2 2 3 Wang-Lin [31] General form AB + C #AND2:2m 76m 3m 10032m 0 2 #XOR3:m #L: 7m2 2 2 3 Wei [33] General form AB + C #AND2:3m 68m m 2992m 2 #XOR2:m 2 #XOR3:m #L: 4m2 2 2 3 Lee [22] Trinomials AB + C #AND2:m 36m + 24m - 24 2m - 1 2304m 2 #XOR2:m +m-1 #L: 3m2 -2m-2 2 Our proposal in Fig. 1 General form AB + C #AND2:7m 72m + 6k m 2304m #XOR2:m+k #L: 3m 2 2 3 Our proposal in Fig. 5 General form AB + C #AND2:2m 48m m 1536m 2 #XOR2:2m #L: 3m2

time complexity of O(log2 m) [24,25]. However, such tPD = 12ns (TYP.)) [38], M74HC08 (STMicroelectro- multipliers are not regular and then are not suitable for nics, AND gate, tPD = 7ns (TYP.)) [39], and M74HC279

VLSI implementation due to their tree structures. To (STMicroelectronics, Latch, tPD = 13ns (TYP.)) [40] overcome this problem, many systolic array structures, are employed. The proposed multiplication architec- which have features of regularity and modularity and are tures in Figure 1 and Figure 5 save about 27% of time well suited to VLSI implementation, have been present- complexity as compared to the multiplier in [33]. Al- ed. However, most existing systolic array multipliers though, the developed multiplier in Figure 1 increases need the space complexity of O(m2). Our proposed linear the space complexity as compared to Lee’s multiplier systolic array multiplier in Figure 1 using an irreducible [22] for all trinomials, but saves about 33% of area- polynomial only requires the space complexity of O(m). time complexity. However, two-dimensional systolic array multipliers are useful when there are many successive multiplication 6. Conclusion operations to be performed as in the case of exponen- tiation operation. Thus, a two-dimensional systolic array In this study, we have presented an one-dimensional version of the proposed multiplier is shown in Figure 5. array multiplier for performing multiplications in the Comparing with the multiplier proposed by Wei [33], the finite field GF(2m) with the PB representation. A linear proposed two-dimensional semi-systolic systolic array feedback shift register is employed in our proposed multiplier in Figure 5 saves about 30% of space com- multiplier. Our proposed linear array multiplier re- plexity. quires only O(m) space complexity while other exist- Comparisons of time complexities of various PB ing two-dimensional systolic array multipliers need 2 bit-parallel multipliers are given in Table 2. Let TA, O(m ) space complexity. Such low-complexity multi-

TX,TL, and T3X represent the gate delays of 2-input plier is very attractive for mobile platforms such as AND gate, 2-input XOR gate, 1-bit latch, and 3-input PDA and smart phone. A new two-dimensional sys- XOR gate, respectively. We assume that real circuits tolic array version of the proposed multiplier has also such as M74HC86 (STMicroelectronics, XOR gate, been included. The proposed two-dimensional systolic 260 Che-Wun Chiou et al.

Table 2. Comparisons of time complexities of various PB bit-parallel multipliers Items Time complexity Generating Latency Function Throu- Propagation through Total propagation polynomial (unit = clock ghput one cell delay (unit = ns) Multipliers cycles)

Yeh et al. [34] General form AB + C 3m 1 TA+TX+TL 3m(TA+TX+TL) (96m ns) Wang-Lin [31] General form AB + C 3m 1 TA+T3X+TL 3m(TA+T3X+TL) (132m ns) Wei [33] General form AB + C m 1 TA+T3X m(TA+T3X+TL) (44m ns) Lee [22] Trinomials AB + C 2m - 1 1 TA+TX+TL (2m-1)(TA+TX+TL) (64m ns) Our proposal in Fig. 1 General form AB + C m 1/m TA+TX+TL m(TA+TX+TL) (32m ns) Our proposal in Fig. 5 General form AB + C m 1 TA+TX+TL m(TA+TX+TL) (32m ns)

array multiplier saves about 30% of space complexity U3 = 0 Å E0 (0)S3 (0)=a0 b3, and 27% of time complexity while comparing with U4 = 0 Å E0 (0)S4 (0)=a0 b4, other existing two-dimensional systolic array multi- U5 = 0 Å E0 (0)S5 (0)=a0 b5, plier in [33]. U6 = 0 Å E0 (0)S6 (0)=a0 b6,

U7 = 0 Å E0 (0)S7 (0)=a0 b7. Appendix: Procedure-A

Step 2: At clock cycle 1; Cells U0~U7 do following op- Procedure-A: erations:

/* Let C(X) = A(X) ´ B(X) mod P(X), and */ U0 = U0 Å E0 (1)S0 (1)=a0 b0 Å a1b7, 1 2 3 4 5 6 /* C(X)=c0 + c1X + c2X + c3X + c4X + c5X + c6X U1 = U1 Å E0 (1)S1(1)=a0 b1 Å a1(b0 Å b7), 7 + C7X ,*/ U2 =U2 Å E0 (1)S2(1)=a0 b2 Å a1b1, 1 2 3 4 5 6 /* A(X)=a0 + a1X + a2X + a3X + a4X + a5X + a6X U3 = U3 Å E0 (1)S3(1)=a0 b3 Å a1(b2 Å b7), 7 + A7X ,*/ U4 = U4 Å E0 (1)S4(1)=a0 b4 Å a1(b3 Å b7), 1 2 3 4 5 6 /* B(X)=b0 + b1X + b2X + b3X + b4X + b5X + b6X U5 = U5 Å E0 (1)S5(1)=a0 b5 Å a1b4, 7 + B7X ,*/ U6 = U6 Å E0 (1)S6(1)=a0 b6 Å a1b5, 1 3 4 8 /* P(X)=1+X +X +X +X.*/ U7 = U7 Å E0 (1)S7(1)=a0 b7 Å a1b6.

Step 0: Initial condition; Step 3: At clock cycle 2; Cells U0~U7 do following op- (a) B(X) is loaded into the linear feedback shift regis- erations:

ter S as follows: Si(0)=bi for 0 † i † 7. U0 = U0 Å E0 (2)S0 (2)=a0 b0 Å a1b7 Å a2b6,

(b) A(X) is loaded into the shift register E as follows: U1 = U1 Å E0 (2)S1(2) = a0 b1 Å a1(b0 Åb7)

Ei(0)=ai for 0 † i † 7. Å a2 (b7 Å b6),

(c) All 1-bit latches in cells Ui for 0 † i † 7 are ini- U2 = U2 Å E0 (2)S2(2)=a0 b2 Å a1b1 Å a2(b0 Å b7),

tially reset to zeros. U3 = U3 Å E0 (2)S3(2)=a0 b3 Å a1(b2 Å b7)

Å a2(b1 Å b6),

Step 1: At clock cycle 0; Cells U0~U7 do following op- U4 = U4 Å E0 (2)S4(2)=a0 b4 Å a1(b3 Å b7)

erations: Å a2(b2 Å b7 Å b6),

U0 = 0 Å E0 (0)S0 (0)=a0 b0, U5 = U5 Å E0 (2)S5(2)=a0 b5 Å a1b4 Å a2(b3 Å b7),

U1 = 0 Å E0 (0)S1 (0)=a0 b1, U6 = U6 Å E0 (2)S6(2)=a0 b6 Å a1b5 Å a2b4,

U2 = 0 Å E0 (0)S2 (0)=a0 b2, U7 = U7 Å E0 (2)S7(2)=a0 b7 Å a1b6 Å a2b5. Finite Field Polynomial Multiplier with Linear Feedback Shift Register 261

Step 4: At clock cycle 3; Cells U0~U7 do following op- U3 = U3 Å E0 (5)S3(5)=a0 b3 Å a1(b2 Å b7)

erations: Å a2(b1 Å b6) Å a3(b0 Å b7 Å b5)

U0 = U0 Å E0 (3)S0 (3) = a0 b0 Å a1b7 Å a2b6 Å a3b5, Å a4(b7 Å b6 Å b4) Å a5(b6 Å b5 Å b3 Å b7),

U1 = U1 Å E0 (3)S1(3)=a0 b1 Å a1(b0 Å b7) U4 = U4 Å E0 (5)S4(5)=a0 b4 Å a1(b3 Å b7) Å a2(b2

Å a2(b7 Å b6) Å a3(b6 Å b5), Å b7 Å b6) Å a3(b1 Å b6 Å b5) Å a4(b0 Å b7 Å b5

U2 = U2 Å E0 (3)S2(3)=a0 b2 Å a1b1 Å a2(b0 Å b7) Å b4) Å a5(b7 Å b6 Å b4 Å b3 Å b7),

Å a3(b7 Å b6), U5 = U5 Å E0 (5)S5(5)=a0 b5 Å a1b4 Å a2(b3 Å b7)

U3 = U3 Å E0 (3)S3(3)=a0 b3 Å a1(b2 Å b7) Å a3(b2 Å b7 Å b6) Å a4(b1 Å b6 Å b5)

Å a2(b1 Å b6) Å a3(b0 Å b7 Å b5), Å a5(b0 Å b7 Å b5 Å b4),

U4 = U4 Å E0 (3)S4(3)=a0 b4 Å a1(b3 Å b7) U6 = U6 Å E0 (5)S6(5)=a0 b6 Å a1b5 Å a2b4 Å a3(b3

Å a2(b2 Å b7 Å b6) Å a3(b1 Å b6 Å b5), Å b7) Å a4(b2 Å b7 Å b6) Å a5(b1 Å b6 Å b5),

U5 = U5 Å E0 (3)S5(3)=a0 b5 Å a1b4 Å a2(b3 Å b7) U7 = U7 Å E0 (5)S7(5)=a0 b7 Å a1b6 Å a2b5 Å a3b4

Å a3(b2 Å b7 Å b6), Å a4(b3 Å b7) Å a5(b2 Å b7 Å b6).

U6 = U6 Å E0 (3)S6(3)=a0 b6 Å a1b5 Å a2b4

Å a3(b3 Å b7), Step 7: At clock cycle 6; Cells U0~U7 do following op-

U7 = U7 Å E0 (3)S7(3)=a0 b7 Å a1b6 Å a2b5 Å a3b4. erations:

U0 = U0 Å E0 (6)S0 (6)=a0 b0 Å a1b7 Å a2b6 Å a3b5

Step 5: At clock cycle 4; Cells U0~U7 do following op- Å a4b4 Å a5(b3 Å b7) Å a6(b2 Å b7 Å b6),

erations: U1 = U1 Å E0 (6)S1(6)=a0 b1 Å a1(b0 Å b7)

U0 = U0 Å E0 (4)S0 (4)=a0 b0 Å a1b7 Å a2b6 Å a3b5 Å a2(b7 Å b6) Å a3(b6 Å b5) Å a4(b5 Å b4)

Å a4b4, Å a5(b4 Å b3 Å b7)+a6(b3 Å b7 Å b2 Å b7 Å b6),

U1 = U1 Å E0 (4)S1(4)=a0 b1 Å a1(b0 Å b7) U2 = U2 Å E0 (6)S2(6)=a0 b2 Å a1b1 Å a2(b0 Å b7)

Å a2(b7 Å b6) Å a3(b6 Å b5) Å a4(b5 Å b4), Å a3(b7 Å b6) Å a4(b6 Å b5) Å a5(b5 Å b4)

U2 = U2 Å E0 (4)S2(4)=a0 b2 Å a1b1 Å a2(b0 Å b7) Å a6(b4 Å b3 Å b7),

Å a3(b7 Å b6) Å a4(b6 + b5), U3 = U3 Å E0 (6)S3(6)=a0 b3 Å a1(b2 Å b7) Å a2(b1

U3 = U3 Å E0 (4)S3(4)=a0 b3 Å a1(b2 Å b7) Å a2(b1 Å b6) Å a3(b0 Å b7 Å b5) Å a4(b7 Å b6 Å b4)

Å b6) Å a3(b0 Å b7 Å b5) Å a4(b7 Å b6 Å b4), Å a5(b6 Å b5 Å b3 Å b7) Å a6(b5 Å b4 Å b2

U4 = U4 Å E0 (4)S4(4)=a0 b4 Å a1(b3 Å b7) Å b7 Å b6),

Å a2(b2 Å b7 Å b6) Å a3(b1 Å b6 Å b5) U4 = U4 Å E0 (6)S4(6)=a0 b4 Å a1(b3 Å b7) Å a2(b2

Å a4(b0 Å b7 Å b5 Å b4), Å b7 Å b6) Å a3(b1 Å b6 Å b5) Å a4(b0 Å b7 Å b5

U5 = U5 Å E0 (4)S5(4)=a0 b5 Å a1b4 Å a2(b3 Å b7) Å b4) Å a5(b7 Å b6 Å b4 Å b3 Å b7) Å a6(b6 Å b5

Å a3(b2 Å b7 Å b6) Å a4(b1 Å b6 Å b5), Å b3 Å b7 Å b2 Å b7 Å b6),

U6 = U6 Å E0 (4)S6(4)=a0 b6 Å a1b5 Å a2b4 U5 = U5 Å E0 (6)S5(6)=a0 b5 Å a1b4 Å a2(b3 Å b7)

Å a3(b3 Å b7) Å a4(b2 Å b7 Å b6), Å a3(b2 Å b7 Å b6) Å a4(b1 Å b6 Å b5) Å a5(b0

U7 = U7 Å E0 (4)S7(4)=a0 b7 Å a1b6 Å a2b5 Å a3b4 Å b7 Å b5 Å b4) Å a6(b7 Å b6 Å b4 Å b3 Å b7),

Å a4(b3 Å b7). U6 = U6 Å E0(6)S6(6)=a0 b6 Å a1b5 Å a2b4 Å a3(b3

Å b7) Å a4(b2 Å b7 Å b6) Å a5(b1 Å b6 Å b5)

Step 6: At clock cycle 5; Cells U0~U7 do following op- Å a6(b0 Å b7 Å b5 Å b4),

erations: U7 =U7 Å E0 (6)S7(6)=a0 b7 Å a1b6 Å a2b5 Å a3b4

U0 = U0 Å E0 (5)S0 (5)=a0 b0 Å a1b7 Å a2b6 Å a3b5 Å a4(b3 Å b7) Å a5(b2 Å b7 Å b6) Å a6(b1 Å b6 + b5).

Å a4b4 Å a5(b3 Å b7),

U1 = U1 Å E0 (5)S1(5)=a0 b1 Å a1(b0 Å b7) Step 8: At clock cycle 7; Cells U0~U7 do following op-

Å a2(b7 Å b6) Å a3(b6 Å b5) Å a4(b5 Å b4) erations:

Å a5(b4 Å b3 Å b7), U0 = U0 Å E0 (7)S0 (7)=a0 b0 Å a1b7 Å a2b6 Å a3b5

U2 = U2 Å E0 (5)S2(5)=a0 b2 Å a1b1 Å a2(b0 Å b7) Å a4b4 Å a5(b3 Å b7) Å a6(b2 Å b7 Å b6)

Å a3(b7 Å b6) Å a4(b6 Å b5) Å a5(b5 Å b4), Å a7(b1 Å b6 Å b5), 262 Che-Wun Chiou et al.

U1 = U1 Å E0 (7)S1(7)=a0 b1 Å a1(b0 Å b7) Å a2(b7 [3] Blahut, R. E., Fast Algorithms for Digital Signal Pro- Å Å Å Å Å Å Å b6) a3(b6 b5) a4(b5 b4) a5(b4 b3 cessing, Reading, Mass.: Addison-Wesley (1985). Å Å Å Å Å Å Å Å b7) a6(b3 b7 b2 b7 b6) a7(b2 b7 [4] Reed, I. S. and Truong, T. K., “The Use of Finite Fields Å Å Å Å b6 b1 b6 b5), to Compute Convolutions,” IEEE Trans. Information U = U Å E (7)S (7)=a b Å a b Å a (b Å b ) 2 2 0 2 0 2 1 1 2 0 7 Theory, Vol. IT-21, pp. 208-213 (1975). Å a (b Å b ) Å a (b Å b ) Å a (b Å b ) 3 7 6 4 6 5 5 5 4 [5] Wang, C. C. and Pei, D., “A VLSI Design for Com- Å a6(b4 Å b3 Å b7) Å a7(b3 Å b7 Å b2 Å b7 Å b6), puting Exponentiation in GF(2m) and its Application to U3 = U3 Å E0 (7)S3(7)=a0 b3 Å a1(b2 Å b7) Å a2(b1 Generate Pseudorandom Number Sequences,” IEEE Å b6) Å a3(b0 Å b7 Å b5) Å a4(b7 Å b6 Å b4) Trans. Computers, Vol. 39, pp. 258-262 (1990). Å a5(b6 Å b5 Å b3 Å b7) Å a6(b5 Å b4 Å b2 Å b7 [6] Omura, J. and Massey, J., “Computational Method and Å b6) Å a7(b4 Å b3 Å b7 Å b1 Å b6 Å b5),

U4 = U4 Å E0 (7)S4(7)=a0 b4 Å a1(b3 Å b7) Å a2(b2 Apparatus for Finite Field Arithmetic,” U.S. Patent

Å b7 Å b6) Å a3(b1 Å b6 Å b5) Å a4(b0 Å b7 Å b5 Number 4,587,627 (1986).

Å b4) Å a5(b7 Å b6 Å b4 Å b3 Å b7) Å a6(b6 Å b5 [7] Wang, C. C., Truong, T. K., Shao, H. M., Deutsch, L. Å Å Å Å Å Å Å Å Å b3 b7 b2 b7 b6) a7(b5 b4 b2 b7 J., Omura, J. K. and Reed, I. S., “VLSI Architectures Å Å Å Å m b6 b1 b6 b5), for Computing Multiplications and Inverses in GF(2 ),” Å Å Å Å U5 = U5 E0 (7)S5(7)=a0 b5 a1b4 a2(b3 b7) IEEE Trans. Computers, Vol. C-34, pp. 709-717 Å Å Å Å Å Å Å a3(b2 b7 b6) a4(b1 b6 b5) a5(b0 (1985). Å b Å b Å b ) Å a (b Å b Å b Å b Å b ) 7 5 4 6 7 6 4 3 7 [8] Reyhani-Masoleh, A. and Hasan, M. A., “A New Con- Å a (b Å b Å b Å b Å b Å b Å b ), 7 6 5 3 7 2 7 6 struction of Massey-Omura Parallel Multiplier Over U6 = U6 Å E0 (7)S6(7)=a0 b6 Å a1b5 Å a2b4 Å a3(b3 GF(2m),” IEEE Trans. Computers, Vol. 51, pp. 511- Å b7) Å a4(b2 Å b7 Å b6) Å a5(b1 Å b6 Å b5) 520 (2002). Å a6(b0 Å b7 Å b5 Å b4) Å a7(b7 Å b6 Å b4 [9] Berlekamp, E. R., “Bit-Serial Reed-Solomon Enco- Å b3 Å b7),

U7 = U7 Å E0 (7)S7(7)=a0 b7 Å a1b6 Å a2b5 Å a3b4 der,” IEEE Trans. Inform. Theory, Vol. IT-28, pp. - Å a4(b3 Å b7) Å a5(b2 Å b7 Å b6) Å a6(b1 Å b6 869 874 (1982).

Å b5) Å a7(b0 Å b7 Å b5 Å b4). [10] Wu, H., Hasan, M. A. and Blake, I. F., “New Low- Complexity Bit-Parallel Finite Field Multipliers Using The final result C(X) is obtained from the outputs of Ui Weakly Dual Bases,” IEEE Trans. Computers, Vol.47, † † for 0 i 7. pp. 1223-1234 (1998). [11] Wu, H. and Hasan, M. A., “Low Complexity Bit- Acknowledgments Parallel Multipliers for a Class of Finite Fields,” IEEE Trans. Computers, Vol. 47, pp. 883-887 (1998). The authors would like to thank anonymous referees [12] Lee, C. Y., Chiou, C. W. and Lin, J. M., “Low- and the editor for carefully reading the paper and for their great help in improving the paper. Complexity Bit-Parallel Dual Basis Multipliers Using the Modified Booth’s Algorithm,” Computers & Elec- References trical Engineering, Vol. 31, pp. 444-459 (2005). [13] Lee, C. Y. and Chiou, C. W., “Efficient Design of [1] MacWilliams, F. J. and Sloane, N. J. A., The Theory of Low-Complexity Bit-Parallel Systolic Hankel Multi- Error-Correcting Codes, Amsterdam: North-Holland pliers to Implement Multiplication in Normal and Dual (1977). Bases of GF(2m),” IEICE Trans. on Fundamentals of [2] Lidl, R. and Niederreiter, H., Introduction to Finite Electronics, Communications and Computer Science, Fields and Their Applications, New York: Cambridge Vol. E88-A, pp. 3169-3179 (2005). Univ. Press, U.S.A. (1994). [14] Bartee, T. C. and Schneider, D. J., “Computation with Finite Field Polynomial Multiplier with Linear Feedback Shift Register 263

Finite Fields,” Information and Computing, Vol. 6, pp. [26] Lee, C. Y., Lu, E. H. and Lee, J. Y., “Bit-Parallel Sys- 79-98 (1963). tolic Multipliers for GF(2m) Fields Defined by All-One [15] Mastrovito, E. D., “VLSI Architectures for Multiplica- and Equally-Spaced Polynomials,” IEEE Trans. Com- tion Over Finite Field GF(2m),” Applied Algebra, Al- puters, Vol. 50, pp. 385-393 (2001). gebraic Algorithms, and Error-Correcting Codes, Proc. [27] Paar, C., “A New Architecture for a Parallel Finite Sixth Int’l Conf., AAECC-6, T. Mora, ed., Rome, pp. Field Multiplier with Low Complexity Based on Com- 297-309 (1988). posite Fields,” IEEE Trans. Computers, Vol. 45, pp. [16] Mastrovito, E. D., “VLSI Architectures for Computa- 856-861 (1996). tions in Galois Fields,” Ph.D. thesis, Linköping Univ., [28] Paar, C., Fleischmann, P. and Roelse, P., “Efficient Dept. of Electrical Eng., Linköping, Sweden (1991). Multiplier Architectures for Galois Fields GF(24n),” [17] Koç, Ç. K. and Sunar, B., “Low-Complexity Bit- IEEE Trans. Computers, Vol. 47, pp. 162-170 (1998). Parallel Canonical and Normal Basis Multipliers for a [29] Kim, N.-Y., Kim, H.-S. and Yoo, K.-Y., “Computation Class of Finite Fields,” IEEE Trans. Computers, Vol. of AB2 Multiplication in GF(2m) Using Low-Com- 47, pp. 353-356 (1998). plexity Systolic Architecture,” IEE Proc.-Circuits, De- [18] Sunar, B. and Koç, Ç. K., “Mastrovito Multiplier for vices and Systems, Vol. 150, pp. 119-123 (2003). All Trinomials,” IEEE Trans. Computers, Vol. 48, pp. [30] Drolet, G., “A New Representation of Elements of Fi- 522-527 (1999). nite Fields GF(2m) Yielding Small Complexity Arith- [19] Wu, H., “Bit-Parallel Finite Field Multiplier and Squ- metic Circuits,” IEEE Trans. Computers, Vol. 47, pp. are Using Polynomial Basis,” IEEE Trans. Computers, 938-946 (1998). Vol. 51, pp. 750-758 (2002). [31] Wang, C.-L. and Lin, J.-L., “Systolic Array Implemen- [20] Elia, M., Leone, M. and Visentin, C., “Low Complex- tation of Multipliers for Finite Fields GF(2m),” IEEE ity Bit-Parallel Multipliers for GF(2m) with Generator Trans. Circuits and Systems, Vol. 38, pp. 796-800 Polynomial xm+xk+1,” Electronics Letters, Vol.35, pp. (1991). 551-552 (1999). [32] Wei, S.-W., “A Systolic Power-Sum Circuit for GF(2m),” [21] Wu, H., “Montgomery Multiplier and Squarer for a IEEE Trans. Computers, Vol. 43, pp. 226-229 (1994). Class of Finite Fields,” IEEE Trans. Computers, Vol. [33] Wei, S.-W., “VLSI Architectures for Computing Ex- 51, pp. 521-529 (2002). ponentiations, Multiplicative Inverses, and Divisions [22] Lee, C. Y., “Low Complexity Bit-Parallel Systolic in GF(2m),” IEEE Trans. Circuits and Systems-II: Ana- Multiplier Over GF(2m) Using Irreducible Trino- log and Digital Signal Processing, Vol. 44, pp. mials,” IEE Proc.-Comput. Digit. Tech., Vol. 150, pp. 847-855 (1997). 39-42 (2003). [34] Yeh, C.-S., Reed, I. S. and Truong, T. K., “Systolic [23] Chiou, C. W., Lin, L. C., Chou, F. H. and Shu, S. F., Multipliers for Finite Fields GF(2m),” IEEE Trans. “Low Complexity Finite Field Multiplier Using Irre- Computers, Vol. C-33, pp. 357-360 (1984). ducible Trinomials,” IEE Electronics Letters, Vol. 39, [35] Wang, C. L. and Guo, J. H., “New Systolic Arrays for pp. 1709-1711 (2003). C+AB2, Inversion, and Division in GF(2m),” IEEE [24] Itoh, T. and Tsujii, S., “Structure of Parallel Multipliers Trans. Computers, Vol. 49, pp. 1120-1125 (2000). for a Class of Fields GF(2m),” Information and Com- [36] Hsu, I. S., Truong, T. K., Deutsch, L. J. and Reed, I. S., putation, Vol. 83, pp. 21-40 (1989). “A Comparison of VLSI Architecture of Finite Field [25] Hasan, M. A., Wang, M. and Bhargava, V. K., “Modu- Multipliers Using Dual, Normal, or Standard Bases,” lar Construction of Low Complexity Parallel Multipli- IEEE Trans. Computers, Vol. 37, pp. 735-739 (1988). ers for a Class of Finite Fields GF(2m),” IEEE Trans. [37] Weste, N. and Eshraghian, K., Principles of CMOS Computers, Vol. 41, pp. 962-971 (1992). VLSI Design: A System Perspective, Reading, Mass.: 264 Che-Wun Chiou et al.

Addison-Wesley (1985). ds/1937/m74hc279.pdf [38] http://www.st.com/stonline/books/pdf/docs/2006.pdf [39] http://www.st.com/stonline/books/pdf/docs/1885.pdf Manuscript Received: Dec. 1, 2005 [40] Http://www.stm.com/stonline/products/literature/ Accepted: Oct. 2, 2006