<<

IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 11, NOVEMBER 2003 1379 Fast Normal Basis Multiplication Using General Purpose Processors

Arash Reyhani-Masoleh, Member, IEEE, and M. Anwar Hasan, Senior Member, IEEE

Abstract—For cryptographic applications, normal bases have received considerable attention, especially for hardware implementation. In this article, we consider fast software algorithms for normal basis multiplication over the extended binary GFð2mÞ. We present a vector-level algorithm which essentially eliminates the bit-wise inner products needed in the conventional approach to the normal basis multiplication. We then present another algorithm which significantly reduces the dynamic instruction counts. Both algorithms utilize the full width of the data-path of the general purpose processor on which the software is to be executed. We also consider composite fields and present an algorithm which can provide further speed-ups and an added flexibility toward hardware-software codesign of processors for very large finite fields.

Index Terms— multiplication, normal basis, software algorithms, ECDSA, composite fields. æ

1INTRODUCTION m HE extended binary finite field GFð2 Þ of degree m is to a number of practical considerations. Most importantly, Tused in important cryptographic operations, such as normal basis multiplication algorithms require inner pro- key exchange, signing, and verification. For today’s security ducts or matrix multiplications over the ground field GF(2). applications, the minimum values of m are considered to be Such computations are not directly supported by most of l60 in the elliptic curve cryptography and 1,024 in the today’s general purpose processors. These computations standard discrete log-based cryptography. Elliptic curve require bit-by-bit logical AND and XOR operations, which crypto-systems, which were proposed by Koblitz [15] and are not efficiently implemented using the instruction set Miller [23] independently, use relatively smaller field sizes, supported by the processors. Also, when a high-level but require a considerable amount of field arithmetic for programming language such as C is used, the cyclic shifts each group operation (i.e., addition of two points). In such needed for field squaring operations are not as efficient as crypto-systems, often the most complicated and expensive they are in hardware. module is the finite field arithmetic unit. As a result, it is In this paper, we consider algorithms for fast software important to develop suitable finite field arithmetic algo- normal basis (NB) multiplication on general purpose rithms and architectures that can meet the constraints of processors. We discuss how the conventional bit-level various implementation technologies, such as hardware algorithm for normal basis multiplication fails to utilize and software. the full data-path of the processor and makes its software For cryptographic applications, the most frequently used implementation inefficient. In view of this, a vector-level m GFð2 Þ arithmetic operations are addition and multiplica- normal basis multiplication algorithm is presented which tion. Compared to the former, the latter is a much more eliminates the matrix multiplication over GF(2) and sig- complicated and time-consuming operation. The complex- nificantly reduces the number of dynamic instructions. We ity of GFð2mÞ multiplication very much depends on how then derive another scheme for normal basis multiplication the field elements are represented. For hardware imple- to further improve the speed. We present implementation mentation of a multiplier, the use of normal bases has results of these schemes for the five fields recommended by received considerable attention and a number of hardware NIST for NB multiplication in ECDSA (elliptic curve digital architectures and implementations have been reported (see, signature algorithm) [25], i.e., m ¼ 163, 233, 283, 409, 571. for example, [1], [2], [19], [10], [35]). Unlike hardware, so far For example, it takes 99s and requires only 322 bytes of 163 software implementation of a GFð2mÞ multiplier using memory for a GFð2 Þ multiplication using NB on normal bases has not been that popular. This is mainly due Pentium III 533 MHz PC. We also consider normal basis multiplication over certain special classes of composite fields. We show that normal basis multipliers over such . A. Reyhani-Masoleh is with the Centre for Applied Cryptographic composite fields can provide an additional speed-up. For Research, Department of Combinatorics and Optimization, University of example, it requires 114s and 25 bytes of memory for Waterloo, Waterloo, Ontario, Canada N2L 3G1. 299 E-mail: [email protected]. multiplication over GFð2 Þ. Composite fields also offer a . M.A. Hasan is with the Department of Electrical and Computer great deal of flexibility toward hardware-software codesign Engineering, University of Waterloo, Waterloo, Ontario, Canada N2L of very large finite field processors. 3G1. E-mail: [email protected]. The organization of this article is as follows: In Section 2, Manuscript received 26 July 2001; revised 16 Sept. 2002; accepted 24 Jan. the NB representation and its conventional multiplication 2003. For information on obtaining reprints of this article, please send e-mail to: algorithm are presented. This algorithm has been proposed [email protected], and reference IEEECS Log Number 114607. by NIST for NB multiplication in ECDSA [25] where all

0018-9340/03/$17.00 ß 2003 IEEE Published by the IEEE Computer Society 1380 IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 11, NOVEMBER 2003 recommended values of m are prime. Then, in Section 3, we referred to [11] for an algorithm with t odd). The case of t even extend NIST’s algorithm to a vector-level multiplication is of particular interest for implementing high speed crypto- algorithm which uses the full width of the data-path of the systems based on Koblitz curves. Such curves with points processor. In Section 4, a more efficient algorithm for over GFð2mÞ exist for m ¼ 163, 233, 283, 409, 571, where NB multiplication is presented. In this section, this algo- normal bases have t even. Let A ¼ða0;a1; ;am1Þ and B ¼ m rithm is also compared with other algorithms. Then, we ðb0;b1; ;bm1Þ be the elements of GFð2 Þ, then the ith present an algorithm for multiplication over finite fields coordinate of the product C ¼ AB is computed as [13]: GFð2mÞ with composite values of m in Section 5. Finally, a Xp2 few concluding remarks are given in Section 6. ci ¼ aFðnþ1ÞþibFðpnÞþi; 0 i m 1: ð4Þ n¼1 2PRELIMINARIES In (4), p ¼ tm þ 1 and 2.1 Normal Basis Representation Fð2iuj mod pÞ¼i; 0 i m 1; 0 j

Below, we give the conventional algorithm for normal basis 1. These are dynamic instructions which the underlying processor needs multiplication. This algorithm is for t even only (the reader is to execute. REYHANI-MASOLEH AND HASAN: FAST NORMAL BASIS MULTIPLICATION USING GENERAL PURPOSE PROCESSORS 1381

TABLE 1 TABLE 2 Values of F of Type 2 GNB for GFð25Þ Contents of Variables in Algorithm 2 for Multiplication of A ¼ð01110Þ and B ¼ð10101Þ

are done on vectors instead of bits. This enables us to use the full data-path of the processor on which the software is executed. The assumption of Algorithm 1 that t is even is also used in the remaining discussion of this section. The case of t odd is considered in Appendix A. Lemma 1. For GNB of type t, where t is even, the sequence FðnÞ of p 1 integers as defined above is mirror symmetric around the center, i.e., FðnÞ¼Fðp nÞ, 1 n p 1. Proof. In (5), t is the smallest nonzero integer such that t t u mod p ¼ 1. Then, u2 mod p must be equal to 1. For We now state the vector-level algorithm for t even as 0 i m 1 and 0 j t 1, let n ¼ 2iuj mod p. Then, follows: i j i tþj F n F 2 u mod p i F 2 u2 mod p i ð Þ¼ ð Þ¼ .Also, ð Þ¼ . Algorithm 2. (Vector-Level NB Multiplication for t even) Thus, Input: A; B 2 GFð2mÞ, FðnÞ2½0;m 1, 1 n p 1 i tþj i j Output: C ¼ AB FðnÞ¼Fð2 u2 mod pÞ¼Fð2 u mod pÞ¼Fðp nÞ: 1. Initialize SA :¼ A; SB :¼ B; C :¼ 0 ut 2. For n ¼ 1 to p 2 { From (5) and Lemma 1, one has Fð1Þ¼Fðp 1Þ¼0. For 3. SA FðnÞ 1 n p 2, let us define 4. R :¼ SA SB 5. C :¼ C þ R F n F n 1 F n mod m: 6 ð Þ¼ ð þ Þ ð Þ ð Þ 6. SB FðnÞ Now, we have the following corollary. 7. } m Corollary 1. For FðnÞ as defined above and for t even, the In line 4 of Algorithm 2, for X; Y 2 GFð2 Þ, X Y following holds: denotes the bit-wise AND operation between coordinates of X and Y , i.e., X Y ¼ðx0y0;x1y1; ;xm1ym1Þ. The Fðp nÞ¼m Fðn 1Þ mod m; 1 n p 2: following example can be used to illustrate the operation of P this algorithm. Proof. Using (6), one obtains Fðn þ 1Þ¼ n FðiÞ. i¼1 5 Applying Lemma 1 into (6), one can also write Example 1. For the type 2 GNB in GFð2 Þ, one has p ¼ 11 and u ¼ 10. Thus, the values of FðnÞ, 1 n 10, using Fðp nÞ¼Fðn 1Þ mod m; 2 n p 1; (5) are given in Table 1. By using (4), the coordinates of C which results in are obtained as:

p 1 ci ¼ aiþ1bi þ aiþ3biþ1 þ aiþ2biþ3 þ aiþ4biþ2 þ aiþ4biþ4 Fðp nÞ¼m Fðn 1Þ; 2 n< ; 2 þ aiþ2biþ4 þ aiþ3biþ2 þ aiþ1biþ3 þ aibiþ1; 0 i m 1: p 1 ð9Þ and F ¼ 0: 2 Also, using (6), one can obtain the values of FðnÞ. The ut results are also shown in Table 1. Using Lemma 1 and (6), one can write (4) as follows: Here, the multiplication of A ¼ð01110Þ and B ¼ ð10101Þ is shown using Algorithm 2. Table 2 shows Xp2 contents of various variables of the algorithm as they are ci ¼ aFðnþ1ÞþibFðnÞþi; 0 i m 1; ð7Þ updated. The row with n being “-” is for the initialization n¼1 step (i.e., line 1) of the algorithm. In order to obtain an overall computation time for a Xp2 GFð2mÞ multiplication using Algorithm 2, the coordinates ¼ aFðnÞþFðnÞþibFðnÞþi; 0 i m 1: ð8Þ m n¼1 of the field elements can be divided into d! e units where ! For a particular GNB, the values of FðnÞ, 1 n p 2, corresponds to the data-path width of the processor. are fixed and are to be determined only once, i.e., at the time Here (and in the rest of the paper), we assume that the of choosing the basis. Additionally, Corollary 1 implies that processor can perform bit-wise XOR and AND of two p1 it is sufficient to store only half (i.e., 2 ) of these FðnÞs. !-bit operands using one single XOR instruction and one 1382 IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 11, NOVEMBER 2003

Xhj single AND instruction, respectively. Since the loop in 2wj;k j ¼ ; 1 j v; ð11Þ Algorithm 2 has p 2 iterations, the total number of bit- k¼1 wise AND and bit-wise XOR instructions are the same m m where 0 wj;1 mP1 Pv Phj mP1 > 2iþ1 2iþwj;k the cost of this efficient algorithm in terms of dynamic <> aibi þ xi;j ; for m odd i¼0 j¼1 k¼1 i¼0 instruction counts and memory requirements is analyzed > mP1 vP1 Phj mP1 and then they are compared with those of similar other > 2iþ1 2iþwj;k : aibi þ xi;j þ F; for m even; algorithms. i¼0 j¼1 k¼1 i¼0 14 4.1 Algorithm ð Þ 1 m1 For the normal basis f; 2 ; ;2 g, let where

1þ2j hv j ¼ ;j¼ 1; ;v; X2 Xv1 2iþwv;k 2iþwv;kþv m F ¼ xi;vð þ Þ; and v ¼ : m1 2 where v ¼d 2 e. Then, one has the following result from [32]. k¼1 i¼0 Lemma 2. Let A and B be two elements of GFð2mÞ and C be Note that, for a normal basis, the representation of is their product. Then, j fixed and so is wj;k, 1 j v, 1 k hj. Now, define 8 "# ! > mP1 Pv 4 > 2iþ1 2i wj;k ¼ wj;k wj;k1; 1 j v; 1 k hj;wj;0 ¼ 0; <> aibi þ xi;jj ; for m odd i¼0 j¼1 ð15Þ C ¼ "# ! > mP1 vP1 > 2iþ1 2i 2i where w s are as given in (11). For a particular normal : aibi þ xi;jj þ aibvþiv ; for m even; j;k i¼0 j¼1 basis, all wj;ks are fixed. Hence, all wj;ks need to be determined only at the time of choosing the basis. Using where ais and bis are the NB coordinates of A and B, wj;ks, below we present an efficient NB (ENB) multi- respectively. Also, indices and exponents are reduced plication algorithm over GFð2mÞ for odd values of m. The mod m and corresponding algorithm for even values of m is shown in Appendix B. An efficient scheme to compute wj;ks can be x a b a b ; 1 j v; 0 i m 1: 10 i;j ¼ i iþj þ iþj i ð Þ found in [29].

Let hj, 1 j v, be the number of 1s in the normal basis Algorithm 3. (ENB Multiplication for m Odd) m representation of j.Letwj;1;wj;2; ;wj;hj denote the Input: A; B 2 GFð2 Þ, wj;k 2½0;m 1, 1 j v, m1 positions of 1s in the normal basis representation of j, i.e., 1 k hj, v ¼ 2 REYHANI-MASOLEH AND HASAN: FAST NORMAL BASIS MULTIPLICATION USING GENERAL PURPOSE PROCESSORS 1383

TABLE 3 calculating js, which is essentially a multiplication process Contents of Variables in Algorithm 3 for Multiplication all by itself. For this multiplication, one can use either of A ¼ð01110Þ and B ¼ð10101Þ Algorithm 1 or Algorithm 2. 4.2 Cost and Comparison In an effort to determine the cost of Algorithm 3, we give the dynamic instruction counts for its software implemen- tation. We also consider the number of memory accesses to read the precomputed values of wj;k.Forsoftware implementation of the above algorithm, one would heavily rely on instructions, such as, XOR, AND, and others which can be used to emulate cyclic shifts (in the C like programming language). XOR instructionsP are needed in C AB v Output: ¼ lines 6 and 9, which are repeatedP v and j¼1 hj times, C : A B S : A S : B m1 v CN 1 1. Initialize ¼ , A ¼ , B ¼ respectively. Since v ¼ 2 and j¼1 hj ¼ 2 [18], the 2. C 1 1 m total number of XOR instructions is 2 ðCN þ m 1Þd! e. 3. For j ¼ 1 to v { Because of the operations in lines 1 and 5, one can also see 4. SA 1, SB 1 m that the above algorithm requires md! e AND instructions. 5. TA :¼ A SB, TB :¼ B SA We assume that each i-fold cyclic shift, 1 i m 1,in R :¼ T þ T 6. A B lines 2, 4, and 8 needs dme instructions, where is as 7. For k ¼ 1 to h { ! j defined earlier. In Algorithm 3, the number of cyclic shifts 8. R w P j;k in lines 2, 4, and 8 are 1, 2v, and v h , respectively. Thus, 9. C :¼ C þ R j¼1 j the total number of cyclic shifts in this algorithm is 1 þ 10. } P 2v þ v h ¼ 1 ðC þ 2m 1Þ 11. } j¼1 j 2 N and, so, the total number of instructions to emulate cyclic shifts used in Algorithm 3 is In the above algorithm, shifted values of A and B are m m 2 ðCN þ 2m 1Þd ! e. Based on the above discussion, we have stored in SA and SB, respectively. InP line 6, R 2 GFð2 Þ m1 2i the following. contains ðx0;j;x1;j; ;xm1;jÞ, i.e., i¼0 xi;j . Also, the Proposition 2. rightP cyclic shift of R in line 8 corresponds to The dynamic instruction count for Algorithm 3 is m1 2iþwj;k given by i¼0 xi;j . After the final iteration, C is the normal AB lm basis representation of the required product .To 1 þ 3 þ 2 2 þ m #Instructions C þ m : ð16Þ illustrate the operation of the above algorithm and compare 2 N 2 2 ! it with the previous algorithm, we present Example 1 as follows. It is noted that the type I ONB only exists when m is Example 2. Consider the finite field GFð25Þ generated by even. For this type of bases, an algorithm and its instruction the irreducible polynomial FðzÞ¼z5 þ z2 þ 1 and let count are presented in Appendix B. For type II ONBs, be its root, i.e., FðÞ¼0. We choose ¼ 5,then where the value of m is odd, one can obtain the total 2 4 8 16 f; ;;; g is a type 2 GNB. Here, m ¼ 5 and instruction for Algorithm 3 by substituting CN ¼ 2m 1 51 into (16). However, for the even values of m, where type II v ¼ 2 ¼ 2. Using Table 2 in [24], one has ONBs exist, one can use an algorithm presented in 3 8 h1 1 ¼ ¼ þ ;h1 ¼ 2; ½w1;kk¼1 ¼½0; 3; Appendix B to further reduce the instruction count. Now, 5 8 16 h2 2 ¼ ¼ þ ;h2 ¼ 2; ½w2;kk¼1 ¼½3; 4: we can state the following. Let Remark 2. The dynamic instruction count for ENB multiplication algorithm when the finite field is defined for type II optimal 2 4 8 m A ¼ þ þ ¼ð01110Þ normal bases is at most ðð2:5 þ 2Þm ð1:5 þ ÞÞd ! e. and For software implementation of Algorithm 3, if the loops B ¼ þ 4 þ 16 ¼ð10101Þ; are not unrolled and the values of wj;ks are not hard- coded, one needs to store wj;k, P1 j v, 1 k hj. Since which are the same two field elements we used in v the total number of wj;ksis j¼1 hj and each wj;k 2 Example 1. Table 3 shows contents of various variables ½0;m 1 needs dlog2 me bits of memory, a total of about of the algorithm as they are updated (see Table 2 for CN 1 2 dlog2 me bits of memory is needed to store the comparison). The row with j being “-” is for the precomputed wj;ks. initialization step (i.e., line 1) of the algorithm. Table 4 compares the number of dynamic instructions of the three algorithms we have described so far. As can be As can be seen in Algorithm 3, all wj;ks have to be seen in Table 4, both our proposed schemes (i.e., precomputed and it is done only once when the basis is Algorithms 2 and 3) are superior to the conventional bit- chosen. In the above example, they are determined by level multiplication scheme (i.e., Algorithm 1). Also, this 1384 IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 11, NOVEMBER 2003

TABLE 4 Generic Comparison of Multiplication Algorithms in Terms of Number of Instructions and Memory Requirements

table gives memory sizes and numbers of memory accesses Very recently, another NB multiplication algorithm of these algorithms. The final row of this table gives suitable for software implementation has been proposed approximate improvement factors of Algorithm 3 to [24]. This algorithm is quite different from the ones Algorithm 2. A more detailed comparison of these two discussed here. It uses much more memory, but, with algorithms is given in Table 5 and for the five binary fields respect to computational time, it is expected to be better recommended by NIST for ECDSA (elliptic curve digital than the algorithm presented here and it is possible to signature algorithm) [26]. combine the work of [24] with that of this article to further These algorithms have been coded in software using the improve the performance of NB multiplication in software. C programming language. Table 5 also shows timing (in s) Some of the recently proposed multi- for these codes executed on Pentium III 533 MHz PC. The plication algorithms, for example, [9], [17], create a look-up PC has 64 M bytes of RAM, 32 K bytes of L1 cache, and table on the fly based on one of the inputs (say A) and yield 512 K bytes of L2 cache. Our codes are parameterized in the significant speed-ups by processing a group of bits of the sense that they can be used for various m and t without other input (i.e., B) at a time. At this point, it is not clear major modifications. For high speed implementation, the whether such a group-level processing of B can be codes can be optimized for special values of m and t. incorporated into our Algorithm 3. However, if m is a Agnew et al. in [1] have proposed a bit-serial architecture composite number, then one can essentially achieve a for the NB multiplication. Although their work has been similar kind of group-level processing by performing targeted to hardware implementation, the main idea can be computations in the subfields. This idea is explored in the used for software implementation similar to the vector-level following section. method proposed here. For such a software implementation m of [1], one would require ðCN 1Þd ! e XOR instructions, m m 5EFFICIENT COMPOSITE FIELD NB md ! e AND instructions, and ðCN þ m 1Þd! e other instructions. Thus, the dynamic instruction count would MULTIPLICATION ALGORITHM m be ð þ 1ÞðCN þ m 1Þd ! e, which is about twice that in In this section, we consider multiplications in the finite field Algorithm 3 (see Proposition 2). In [32], one can find GFð2mÞ, where m is a composite number. These fields are software implementation of the NB multiplication for two referred to as composite fields and have been used in the special cases, namely, two optimal normal bases. The recent past to develop efficient multiplication schemes [26], method used in [32] is similar to that of the NB multi- [27]. When these fields are to be used for elliptic curve plication of [1]. crypto-systems, one must choose m such that its cofactors

TABLE 5 Comparison of the Proposed Algorithms for Binary Fields Recommended by NIST for ECDSA Applications (! ¼ 32, ¼ 4) REYHANI-MASOLEH AND HASAN: FAST NORMAL BASIS MULTIPLICATION USING GENERAL PURPOSE PROCESSORS 1385

A a ;a ; ;a ; are large enough to resist the attack described by Galbraith i ¼ð i;0 i;1 i;m21Þ m2 and Smart [6], [34]. The composite fields which are not Bi ¼ðbi;0;bi;1; ;bi;m21Þ2GFð2 Þ vulnerable against the known attacks for elliptic curve are subfield coordinates of A and B. crypto-systems are presented in [20], [5]. 5.1 Algorithm Lemma 4 leads to an algorithm for multiplication in composite fields using normal bases. The algorithm is Let us introduce the following lemma from [21] which is stated below. used for constructing a composite field normal basis. m m 2j Algorithm 4. (ECFNB Multiplication of GFð2 1 2 Þ over Lemma 3 [21]. Let gcdðm1;m2Þ¼1. Let N1 ¼f1 j 0 j m2 m1 GFð2 Þ m1 1g be a normal basis of GFð2 Þ over GFð2Þ. Then, N1 1 Input: A; B 2 GFð2mÞ, wð Þ 2½0;m 1, 1 j v , is also a normal basis of GFð2m1m2 Þ over GFð2m2 Þ. j;k 1 1 m11 ð1Þ v1 ¼ 2 , 1 k hj Output: C ¼ AB Here, we consider composite fields with only two prime 1. Initialize C :¼ A B, SA :¼ A, SB :¼ B factors (i.e., both m1 and m2 are prime). This is particularly 2. For j ¼ 1 to v { important for elliptic curve crypto-systems. For such 1 3. SA m2, SB m2 systems in today’s security applications, the values of m 4. TA :¼ A þ SA, TB :¼ B þ SB appear to be in the range of 160 to several hundred only 5. R :¼ TA TB (571 as given in [25]). To avoid the attack of [6], one, ð1Þ 6. For k ¼ 1 to hj { however, may like to choose m such that it has no small factors such as 2, 3, 5, 7, 11 (see [20] for secure composite ð1Þ 7. R m2wj;k values of m 2½160; 600). This basically makes one choose m 8. C :¼ C þ R as the product of two primes. Thus, in the following, we 9. } give all equations and algorithms for odd degrees. The 10. } reader can easily extend it for even degrees using the results In lines 1 and 5 of Algorithm 4, of the previous section. Also, the parameters, namely, j, hj, v, , and wj;k of the previous section are used here in the A B ¼ðA0B0;A1B1; ;Am11Bm11Þ GFð2m1 Þ GFð2m2 Þ context of the subfields and by putting an denotes parallel subfield multiplications of A and B. This ð1Þ m1 ð2Þ extra sub/superscript, for example, j for GFð2 Þ and j subfield multiplication can be implemented with an GFð2m2 Þ for . extension of Algorithm 3 such that it produces m1 subfield m1 Let A and B be two elements of GFð2 Þ over GFð2Þ and multiplications over GFð2m2 Þ. This is shown in Algorithm 5, C be their product. Then,we have the following from [28]: where A.i (resp. A/i) 0 i m2 1, denotes an i-fold

ð1Þ ! right (resp. left) subfield cyclic shift of all subfield elements hj ð1Þ mX11 Xv1 X mX11 iþw 2i 2 j;k of A, i.e., A0;A1; ;Am11, respectively. C ¼ aibi1 þ yi;j1 ; for m1 odd; m i¼0 j¼1 k¼1 i¼0 Algorithm 5. (Parallel Subfield Multiplication over GFð2 2 Þ m ð2Þ ð17Þ Input: A; B 2 GFð2 Þ, wj;k 2½0;m2 1, 1 j v2, ð2Þ m21 1 k hj , v2 ¼ 2 where Output: C ¼ A B 1. Initialize C :¼ A B, SA :¼ A, SB :¼ B yi;j ¼ðai þ aiþjÞðbi þ biþjÞ; 1 j v1; 0 i m1 1; C.1 ð1Þ 2. h ð1Þ Xj w j ¼ 1 v m1 1 2jþ1 2 j;k 3. For to 2 { v1 ¼ ; ¼ : 2 1 1 4. SA / 1, SB / 1 k¼1 5. TA :¼ A SB, TB :¼ B SA By combining Lemma 3 with (17), the following is obtained: 6. R :¼ TA þ TB k 1 hð2Þ Lemma 4. Let A ¼ðA0;A1; ;Am 1Þ and B ¼ 7. For ¼ to j { 1 ð2Þ m1m2 8. R.w ðB0;B1; ;Bm11Þ be two elements of GFð2 Þ over j;k GFð2m2 Þ and C be their product. Then, 9. C :¼ C þ R 10. } ð1Þ ! hj ð1Þ 11. } mX11 Xv1 X mX11 iþw 2i 2 j;k C ¼ AiBi1 þ Yi;j1 ; for m1 odd; i¼0 j¼1 k¼1 i¼0 5.2 Cost ð18Þ In order to obtain the cost of Algorithm 4, we need to evaluate the cost of Algorithm 5, which is called 1 þ v1 ¼ where m1þ1 2 times by the former. Like Algorithm 3, one can Yi;j ¼ðAi þ AiþjÞðBi þ BiþjÞ; 1 j v1; 0 i m1 1; determine the dynamic instruction counts of Algorithm 5 to be 1 ðC þ m 2Þ XOR, m AND, and 1 ðC þ 2m 1Þ ð19Þ 2 2 2 2 2 2 2 others to emulate cyclic shifts. The total cost of Algorithm 4 and also depends on how subfield elements, each of m2 bits, are 1386 IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 11, NOVEMBER 2003

TABLE 6 Cost of Algorithm 4

TABLE 7 Cost of Algorithm 4 for Certain Composite Fields ( ¼ 4)

TABLE 8 e Comparison of Algorithm 4 of This Paper with Algorithm 6 of [3] for m1 ¼ 2 and m2 ¼ n; gcdð2;nÞ¼1

stored in registers. For the sake of simplicity, we assume Table 11 of [29]. The results are presented in Table 9 of m that an element of GFð2 2 Þ is stored in one !-bit register Appendix C. (for software implementation of elliptic curve crypto- systems with both m1 and m2 being prime, most general 5.3 Comparison and Comments purpose processors would have ! bit registers where 5.3.1 Normal Basis Multipliers ! m2). For ! ¼ 24 and 32, the best values of m2 are those In order to compare the cost of Algorithm 4 with that of which have ONBs, i.e., 23 and 29, respectively. Thus, each e e m Algorithm 6 of [3] which uses m ¼ n 2 , one can let m1 ¼ 2 element of GFð2 Þ needs m1 registers and the cyclic shifts in lines 3 and 7 of Algorithm 4 are almost free of cost (or at and m2 ¼ n; gcdð2;nÞ¼1. The condition that gcdð2;nÞ¼1 best register renaming). Based on this assumption, we give enables us to use Algorithm 4; however, it is not needed for the dynamic instruction counts of Algorithm 4 in Table 6. In Algorithm 6 of [3]. In [3], the complexity of GFð2mÞ multi- this table, is the number of instructions needed for one subfield cyclic shift in each register and it is four in the plication is given in terms of the number of subfield, i.e., n C programming language. GFð2 Þ, operations. Table 8 compares the number of subfield Table 7 shows the number of instructions and memory operations required in Algorithm 4 and that in [3]. In this requirements of Algorithm 4 for six different composite table, our composite field multiplication algorithm requires fields. These six fields are obtained by combining three m1s e e m1ðm1þ1Þ 4 þ2 m1 e e1 ¼ and ðC1 þ 2m1 3Þ¼4 þ 2 ðC1 3Þ and two m2s. Based on Table 1 of [20], these fields are not 2 2 2 vulnerable against the known attacks. Algorithm 4 is also subfield multiplications and subfield additions, respec- coded for these composite fields using the C programming tively. Since the subfield multiplication is more costly than language. The actual timing (in s) of Algorithm 4 executed the subfield addition, our proposed algorithm has about on Pentium III 533 MHz PC is also shown in Table 7. In order to obtain parameters of the finite fields GFð2m1 Þ and half of the computational complexity of the algorithm GFð2m2 Þ used in Algorithm 4, we have used Table 10 and presented in [3]. REYHANI-MASOLEH AND HASAN: FAST NORMAL BASIS MULTIPLICATION USING GENERAL PURPOSE PROCESSORS 1387

5.3.2 Multipliers Based on Polynomial and Dual Bases executed and they provide significant speed-ups compared In addition to normal bases, in the literature at least two to the conventional bit-level multiplication scheme (i.e., other types of bases, namely, polynomial and dual bases, Algorithm 1). Algorithms 2 and 3 are particularly suitable if have received considerable attention for implementation of m is a prime. Such values of m are of importance, especially multipliers over GFð2mÞ [33], [16], [8], [15]. Multiplication for designing high-speed crypto-systems based on Koblitz algorithms that use these two types of bases usually have curves and for protecting elliptic curve crypto-systems regular structures and can easily take advantage of fixed against the attack of Galbraith and Smart [6]. Both irreducible polynomials, which define the fields. In addi- Algorithms2and3havebeencodedforsoftware tion, they can also be made to operate in digit serial fashion implementation using C and our timing results show that to provide improved performance (perhaps at the cost of Algorithm 3 is about 200 percent faster than Algorithm 2. some memory [9]). As a result, dual and/or polynomial These results are for those five Gaussian normal bases over basis multipliers can be faster than a normal basis multi- the binary fields which NIST has described in their ECDSA plier (e.g., Algorithm 4). For the purpose of illustration, document [25]. For the purpose of using NIST parameters, below we mention three pertinent multipliers where the although we have presented our results for Gaussian values of m are composite. normal bases, our algorithms are quite generic and can be m Schroeppel et al. in [33] obtained 112:6s and 7:1s for used for any normal bases of GFð2 Þ over GFð2Þ. implementing a multiplication over GFð2155Þ on Sun SPARC We have also considered composite fields with IPC at 25 MHz and DEC Alpha 3000 at 175 MHz, respectively. m ¼ m1 m2. To avoid the attack of [6] on elliptic curve They used polynomial basis representation for their opera- crypto-systems defined over these composite fields, we tion. Guajardo and Paar in [8] have used the Karatsuba- choose both m1 and m2 to be prime. We have presented an Ofman algorithm for field multiplication over GFðð216Þ11Þ. algorithm (i.e., Algorithm 4) for normal basis multiplication m m Their 176 176 bit multiplication uses polynomial basis for GFð2 Þ over GFð2 2 Þ. Our results show that, for similar representation and requires 38:56s on the DEC Alpha 3000 values of m, Algorithm 4 can be much more efficient than platform at 175 MHz. Also, Lee and Lim have presented a Algorithm 3. For example, the actual timing of Algorithm 3 283 GFð2mÞ multiplication algorithm in [16] using a special type is 318 microseconds for GFð2 Þ, whereas the timing of 299 of dual basis called circular dual basis. The circular dual basis Algorithm 4 is 114 microseconds for GFð2 Þ. Composite of GFð2mÞ exists only for the values of m where a type I fields also provide an added flexibility to hardware-soft- optimal normal basis exists. Their multiplication method has ware codesign of finite field processors. For example, m1þ1 been implemented on Pentium 133 MHz and requires 50s Algorithm 5, which is called by Algorithm 4 a total of 2 for the multiplication in GFð2178Þ. times, can be implemented in hardware for small values of In each of the three cases above, the field dimension has a m2 and Algorithm 4 can be embedded in a microcontroller, small prime factor (i.e., 5 for [33], 2 for both [8] and [16]). The which would give us a high speed, yet quite flexible, algorithm of [33] does not make use of the small factor 5, but normal basis multiplier over very large fields. yields a high speed multiplier by carefully choosing the 155 62 irreducible polynomial (namely, x þ x þ 1) to define APPENDIX A the field. In our implementation, we have restricted the ECTOR EVEL ULTIPLICATION FOR t DD smallest factor to be 13 or more. A smaller factor is expected V -L NBV M O to improve the timing results reported in Table 7 (perhaps In this appendix, we discuss the vector-level NB multi- at the expense of reduced level of security for certain fields). plication algorithm when t is odd. This algorithm, which is Also, for m ¼ 178, one can use type 1 optimal normal basis similar to Algorithm 2, requires the input sequence to speed up the multiplication for which an algorithm is Fð1Þ;Fð2Þ; ;Fðp 1Þ to be precomputed using (5), i.e., included in Appendix B. i j As can be seen from the above discussions, various Fð2 u mod pÞ¼i; 0 i m 1; 0 j

TABLE 9 (a) Parameters of Type t GNBs Used in Algorithm 4 ð2Þ (b) Parameters of Type 2 GNBs Used in Algorithm 5 (t ¼ 2, hj ¼ 2)

F 0ðnÞ¼Fðp nÞFðp n þ 1Þ mod m; #Instructions lm lmlm m FðpÞ¼0 m m ! where . It is easy to see that ð1 þ ÞðÞ2tm 1 þ log þ 3de log ! þ : ! 2 ! 2 2 F 0ðnÞ¼Fðp n þ 1Þ mod m: We now state the vector-level algorithm for t odd as Remark 3. For type I optimal normal bases, the average dynamic follows. instruction counts for Algorithm 6 is t lm lmlm m Algorithm 6. (Vector-Level NB Multiplication for odd) m m ! m 0 ð1 þ Þð2m 1Þ þ log þ 3deþ log ! : Input: A; B 2 GFð2 Þ, FðnÞ; F ðnÞ2½0;m 1, ! 2 ! 2 2 1 n p 1 Output: C ¼ AB 1. Initialize SA :¼ A, SB :¼ B, C :¼ 0, f :¼ 0 m APPENDIX B 2. SB 2 3. R :¼ SA SB, R ¼ðr0;r1; ;rm1Þ EFFICIENT NB MULTIPLICATION ALGORITHM 4. For i ¼ 0 to m 1 { FOR m EVEN 5. f :¼ f þ ri 6. } Algorithm 7. (ENB Multiplication for m even) m 7. SB :¼ B Input: A; B 2 GFð2 Þ, wj;k, 1 j v, 1 k hj 8. For n ¼ 1 to p 2 { Output: C ¼ AB 9. SA FðnÞ 1. Initialize C :¼ A B, SA :¼ A, SB :¼ B 0 10. SB F ðnÞ 2. C 1 11. R :¼ SA SB 3. For j ¼ 1 to v 1 { 12. C :¼ C þ R 4. SA 1, SB 1 13. } 5. TA :¼ A SB, TB :¼ B SA 14. If f is 1, C :¼ C þð1; 1; ; 1; 1Þ 6. R :¼ TA þ TB 7. For k ¼ 1 to hj { 8. R wj;k Proposition 3. The average dynamic instruction count for 9. C :¼ C þ R Algorithm 6 is given by 10. } 11. } REYHANI-MASOLEH AND HASAN: FAST NORMAL BASIS MULTIPLICATION USING GENERAL PURPOSE PROCESSORS 1389

[2] G.B. Agnew, R.C. Mullin, and S.A. Vanstone, “An Implementation 12. SA 1, SB 1 of Elliptic Curve Cryptosystems over F2155 ,” IEEE J. Selected Areas 13. TA :¼ A SB, TB :¼ B SA in Comm., vol. 11, no. 5, pp. 804-813, June 1993. 14. R :¼ TA þ TB [3] K. Aoki and K. Ohta, “Fast Arithmetic Operations over F2n for 15. For k ¼ 1 to hv { Software Implementation,” Proc. Fourth Ann. Workshop Selected 2 Areas in Cryptography (SAC’ 97), R w 1997. 16. v;k [4] D.W. Ash, I.F. Blake, and S.A. Vanstone, “Low Complexity 17. C :¼ C þ R Normal Bases,” Discrete Applied Math., vol. 25, pp. 191-210, 1989. 18. } [5] M. Ciet and J.-J. Quisquater, F. Sica, “A Secure Family of Composite Finite Fields Suitable for Fast Implementation of One of the special cases of Algorithm 7 is multiplication Elliptic Curve Cryptography,” Proc. Indocrypt 2001, pp. 108-116, using a type I optimal normal basis where m is always even. Dec. 2001. w [6] S.D. Galbraith and N. Smart, “A Cryptographic Application of 2 j m For such an NB, there exists j ¼ , 1 j v 1, v ¼ 2 , Weil Descent,” Proc. Seventh IMA Conf. Cryptography and Coding, pp. 191-200, 1999. and v ¼ 1 [32], i.e., wj;1 ¼ wj, 1 j v 1. This simpli- [7] S. Gao and H.W. Lenstra Jr., “Optimal Normal Bases,” Designs, fies Algorithm 7 as follows: Codes and Cryptography, vol. 2, pp. 315-323, 1992. [8] J. Guajardo and C. Paar, Efficient Algorithms for Elliptic Curve Algorithm 7a. (ENB multiplication for type I ONB) Cryptosystems, pp. 342-356. Springer-Verlag, 1997. m Input: A; B 2 GFð2 Þ;wj, 1 j v 1, [9] M.A. Hasan, “Look-Up Table-Based Large Finite Field Multi- Output: C ¼ AB plication in Memory Constrained Cryptosystems,” IEEE Trans. Computers, vol. 49, no. 7, pp. 749-758, July 2000. 1. Initialize C :¼ A B, SA :¼ A, SB :¼ B, f :¼ 0 [10] M.A. Hasan, M.Z. Wang, and V.K. Bhargava, “A Modified 2. C 1 Massey-Omura Parallel Multiplier for a Class of Finite Fields,” 3. For j ¼ 1 to v 1 { IEEE Trans. Computers, vol. 42, no. 10, pp. 1278-1280, Oct. 1993. [11] IEEE Std 1363-2000, “IEEE Standard Specifications for Public-Key 4. SA 1, SB 1 Cryptography,” Jan. 2000. 5. TA :¼ A SB, TB :¼ B SA [12] D. Johnson, A. Menezes, and S. Vanstone, “The Elliptic Curve 6. R :¼ TA þ TB Digital Signature Algorithm (ECDSA),” Int’l J. Information Security, 7. R w vol. 1, pp. 36-63, 2001. j [13] E. Knudsen, “Elliptic Scalar Multiplication Using Point Halving,” 8. C :¼ C þ R Proc. ASIACRYPT 1999, pp. 135-149, 1999. 9. } [14] N. Koblitz, “Elliptic Curve Cryptosystems,” Math. Computation, 10. } vol. 48, pp. 203-209, 1987. [15] C.K. Koc and T. Acar, “Montgomery Multiplication in GFð2kÞ,” 11. SA 1 Designs, Codes, and Cryptography, vol. 14, pp. 57-69, 1998. 12. R :¼ SA B, R ¼ðr0;r1; ;rm1Þ [16] C. Lee and J. Lim, “A New Aspect of Dual Basis for Efficient Field 13. For i ¼ 0 to m 1 { Arithmetic,” Proc. Int’l Workshop Practice and Theory in Public Key Cryptography (PKC ’99), f :¼ f þ r pp. 12-28, 1999. 14. i [17] J. Lopez and R. Dahab, “High Speed Software Multiplication in 15. } F2m ,” Proc. Indocrypt 2000, pp. 203-212, 2000. 16. If f is 1, C :¼ C þð1; 1; ; 1; 1Þ [18] C.-C. Lu, “A Search of Minimal Key Functions for Normal Basis Multipliers,” IEEE Trans. Computers, vol. 46, no. 5, pp. 588-592, Proposition 4. The average dynamic instruction count for May 1997. [19] J.L. Massey and J.K. Omura, “Computational Method and Algorithm 7a is given by Apparatus for Finite Field Arithmetic,” US Patent No. 4,587,627, 1986. #Instructions [20] lm lmlm M. Maurer, A. Menezes, and E. Teske, “Analysis of the GHS Weil m m m Descent Attack on the ECDLP over Characteristic Two Finite ð3 þ 4Þð þ 1:5Þ þ log2 þ 3de log2 ! : Fields of Composite Degree,” Proc. Indocrypt 2001, pp. 195-213, 2 ! ! Dec. 2001. [21] A.J. Menezes, I.F. Blake, X. Gao, R.C. Mullin, S.A. Vanstone, and T. Yaghoobian, Applications of Finite Fields. Kluwer Academic, 1993. [22] V.S. Miller, “Use of Elliptic Curves in Cryptography,” Proc. Crypto APPENDIX C ’85, pp. 417-426, 1986. [23] R.C. Mullin, I.M. Onyszchuk, S.A. Vanstone, and R.M. Wilson, PARAMETERS USED IN ALGORITHMS 4 AND 5 “Optimal Normal Bases in GFðpnÞ,” Discrete Applied Math., vol. 22, pp. 149-161, 1988/1989. Parameters used in Algorithms 4 and 5 are shown in Table 9. [24] P. Ning and Y.L. Yin, “Efficient Software Implementation for Finite Field Multiplication in Normal Basis,” Proc. Information and Commu. Security (ICICS 2001), pp. 177-181, Nov. 2001. ACKNOWLEDGMENTS [25] Nat’l Inst. of Standards and Technology, Digital Signature The authors would like to thank Z. Zhang for his help with Standard, FIPS Publication 186-2, 2000. [26] S. Oh, C.H. Kim, J. Lim, and D.H. Cheon, “Efficient Normal Basis implementing the algorithms and getting their timing Multipliers in Composite Fields,” IEEE Trans. Computers, vol. 49, results. The authors would also like to thank the reviewers no. 10, pp. 1133-1138, Oct. 2000. [27] C. Paar, P. Fleishmann, and P. Soria-Rodriguez, “Fast Arithmetic for their comments. A preliminary version of this article for Public-Key Algorithms in Galois Fields with Composite appeared in [31]. Exponents,” IEEE Trans. Computers, vol. 48, no. 10, pp. 1025- 1034, Oct. 1999. [28] A. Reyhani-Masoleh and M.A. Hasan, “On Efficient Normal Basis Multiplication,” Proc. Indocrypt 2000, pp. 213-224, Dec. 2000. EFERENCES R [29] A. Reyhani-Masoleh and M.A. Hasan, “Fast Normal Basis Multi- [1] G.B. Agnew, R.C. Mullin, I.M. Onyszchuk, and S.A. Vanstone, plication Using General Purpose Processors,” Technical Report “An Implementation for a Fast Public-Key Cryptosystem,” CORR 2001-25, Dept. of C & O, Univ. of Waterloo, Canada, Apr. J. Cryptology, vol. 3, pp. 63-79, 1991. 2001. 1390 IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 11, NOVEMBER 2003

[30] A. Reyhani-Masoleh and M.A. Hasan, “Fast Normal Basis Multi- M. Anwar Hasan received the BSc degree in plication Using General Purpose Processors,” Proc. Selected Areas electrical and electronic engineering, the MSc in Cryptography (SAC 2001), pp. 230-244, Aug. 2001. degree in computer engineering, both from the [31] A. Reyhani-Masoleh and M.A. Hasan, “A New Construction of Bangladesh University of Engineering and Tech- Massey-Omura Parallel Multiplier over GFð2mÞ,” IEEE Trans. nology, in 1986 and 1988, respectively, and the Computers, vol. 51, no. 5, pp. 511-520, May 2002. PhD degree in electrical engineering from the [32] M. Rosing, Implementing Elliptic Curve Cryptography. Manning University of Victoria in 1992. Since 1993, he Publications, 1999. has been with the Department of Electrical and [33] R. Schroeppel, H. Orman, S.W. O’Malley, and O. Spatscheck, “Fast Computer Engineering, University of Waterloo, Key Exchange with Elliptic Curve Systems,” Proc. CRYPTO ’95, Canada, where he is now a professor. At the pp. 43-56, 1995. University of Waterloo, he is also a member of the Centre for Applied [34] N.P. Smart, “How Secure Are Elliptic Curves over Composite Cryptographic Research and the Center for Wireless Communications. Extension Fields?” Proc. Eurocrypt 2001, pp. 30-39, 2001. His current research interests include algorithms and architectures for [35] B. Sunar and C.K. Koc, “An Efficient Optimal Normal Basis Type computations in Galois fields, data security and reliability, and digital II Multiplier,” IEEE Trans. Computers, vol. 50, no. 1, pp. 83-88, Jan. communication networks. From January to December of 1999, he was 2001. on sabbatical with Motorola Labs., Schaumburg, Illinois. He is a recipient of the Raihan Memorial Gold Medal. At the University of Victoria, he was Arash Reyhani-Masoleh received the BSc awarded the President’s Research Scholarship four times. He has degree from Iran University of Science and served on the program and executive committees of several confer- Technology in 1989, the MSc degree from the ences and, currently, he is an associate editor of the IEEE Transactions University of Tehran in 1991, both with the first of Computers. He is a senior member of the IEEE, a member of the rank in electrical and electronic engineering, and IEEE Computer Society, and a licensed professional engineer of the PhD degree in electrical and computer Ontario. engineering from the University of Waterloo, Canada, in 2001. From 1991 to 1997, he was with the Department of Electrical Engineering, Iran University of Science and Technology. Since June 2001, he has been a postdoctoral fellow with the Centre for Applied Cryptographic Research, University of Waterloo. His current . For more information on this or any computing topic, please visit research interests include algorithms and VLSI architectures for our Digital Library at http://computer.org/publications/dlib. computations in finite fields, fault-tolerant computing, and error-control coding. He was awarded an NSERC (Natural Sciences and Engineering Research Council of Canada) postdoctoral fellowship in 2002. He is a member of the IEEE and the IEEE Computer Society.