<<

Hardware Architectures of Elliptic Curve Based over Binary Fields

Chang Shu Doctoral Dissertation Defense Feb. 8, 2007

Advisor: Dr. Kris Gaj Dept. of Electrical & Computer Engineering George Mason University

1 Acknowledgements

Dr. Kris Gaj (Dissertation Director) Dr. Soonhak Kwon (Dept. of Mathematics, Sungkyunkwan University, Korea) Dr. Shih-Chun Chang (Committee Member) Dr. Brian L. Mark (Committee Member) Dr. Ravi Sandu (Committer Member) Dr. Andre Manitius (Chair of ECE) Dr. Yariv Ephraim (Ph.D. Coordinator) Dr. Tarek El-Ghazawi (Dept. of ECE at The George Washington University)

2 Overview • Introduction – Elliptic Curve – Tate Pairing Based Cryptography

• Architectures for Finite Field Arithmetic – Polynomial basis multiplier – Normal basis multiplier – Composite field arithmetic

• Architectures for Elliptic Curve Cryptosystems – Optimizations for a single FPGA device – Reconfigurable computing approach

• Architectures for Tate Pairing Based Cryptosystems – Optimizations for a single FPGA device – Reconfigurable computing approach

• Summary 3 Elliptic Curve Cryptosystems

• Family of public cryptosystems • Invented in 1985 by Miller and Koblitz independently • Used primarily for digital signatures & • Included in multiple industry, government, and banking standards, such as IEEE p1363, ANSI 9.62, and FIPS 186-2 • Part of standard security protocols, such as IPSec and SSL (proposed extension)

4 Why Elliptic Curve Cryptography ? – ECC vs. RSA comparison: Security Level (bits)

80 112 128 192 256 SKIPJACK Triple-DES AES AES AES ECC n 160 224 256 384 512 RSA n 1024 2048 3072 8192 15360

Hardware implementation consideration: Less area, less memory, narrower bandwidth, and more efficient underlying arithmetic

Flexibility: There exists a family of cryptosystems for ECC

5 Why Hardware Implementations of Cryptography

SOFTWARE HARDWARE security of data during transmission speed

random key low cost generation access control to keys flexibility (new cryptoalgorithms, tamper resistance protection against new attacks ) (viruses, internal attacks )

6 Why Hardware Accelerators for Elliptic Curve Cryptosystems ? • Hardware accelerators for web servers – SSL (Secure Socket Layer), high speed requirements for a large number of key exchanges • Hardware accelerators for Virtual Private Networks (VPNs) – IPSec (Secure Internet Protocol), establishment of a large number of security association • Hardware accelerators for wireless gateways – IEEE 802.11, secure key exchange, achieving low power • Secure smart cards – Need to shorten latency, due to limitations, such as low power, low frequency, and low cost embedded microprocessors • Selected cryptographic chip manufacturers

7 What is Elliptic Curve Cryptography ?

• Elliptic Curve Cryptosytems (ECC) are a class of public key cryptosystems

• The security of ECC is based on the hardness of the elliptic curve problem (ECDLP).

• Let E be an elliptic curve over a finite field F q . Let P be a point in E Ʊ F q Ʋ , and suppose that P has a prime order n . Then the cyclic subgroup of E Ʊ F q Ʋ generated by P is < P >= ∞  P  2 P  L Ʊ  n − 1 Ʋ P . ¢ £ • Private key: an integer d chosen randomly from the interval 1 ¡ 2 ¡ ¡ n −1

• Public key: Q = dP

: C =ƱVU Ʋ=ƱkP M + kQ Ʋ

• Decryption: M = U − dV = U − d ⋅ kP = U − kQ 8 Elliptic Curve Arithmetic – Group Law

Point addition: P + Q Point doubling: 2P = P + P

Scalar Multiplication: kP = P + P + L + P

k times 9 Pairing Based Cryptography

• New family of public key cryptosystems • First proposed by Menezes, Okamoto, and Vanstone in 1993 for Weil decent attack against ECC • Applied to identity based cryptography, key exchange, and by Boneh, Joux, Sakai, et al. • Not a part of any standard yet • Very limited number of software and hardware implementations • Believed to be slower than elliptic curve cryptography

10 Mathematical Basics of Pairing Based Cryptography • Pairing is a map between groups, → where e: G 1 x G 1 G 2 , G 1 = E( F q )G and 2 = Fqk

• The most important property of this map is bilinearity e(aP, bQ) = e(P, Q) ab

a, b: integers P,Q: points on elliptic curves

• In practice, Tate or Weil pairing are used.

11 Identity-Based Encryption

Trusted Authority

s: secret value ID(Bob) P: public value H 1 PTA = s P ID(Bob) SID(Bob) PTA public key of TA PID(Bob) r P PID(Bob)= H 1(ID(Bob)) Bob’s public key AliceC Bob SID(Bob)= s PID(Bob) M Encryption Decryption M Bob’s private key

r r: random number C = (U, V) = (rP, M + H 2(e(P ID(Bob) , P TA ) )

M = (V + H 2(e(S ID(Bob) , U )) r r By bilinearity, e(S ID(Bob , U) = e(sP ID(Bob), rP) = e(P ID(Bob ), sP) = e(P ID(Bob) , P TA ) 12 Major Contributions of this Thesis

• Finite field arithmetic – A novel large extension field multiplier architecture for Tate pairing based cryptosystems – A novel hybrid multiplier architecture for composite fields – A new mathematical scheme for basis conversion for selected field degrees • Elliptic curve cryptosystems – Latency optimization scheme for a single FPGA device – Analysis of several partitioning schemes for a reconfigurable computer, SRC 6 – Extensive library of over 25 hardware macros for SRC 6 and SGI Altix-4700 • Tate pairing based cryptosystems – Comparative analysis of two novel algorithms from the point of view of hardware efficiency – First published implementations via a single FPGA device – Porting the IP core of pairing over 8 binary fields to SGI Altix-4700 – Comparative analysis of Tate pairing based cryptosystems vs. elliptic curve cryptosystems in hardware

13 Architectures for Finite Field Arithmetic

14 Basis Choices in Finite Fields

¢ 2 m−1 £ • Polynomial basis: the subsequent powers 1 ¡ α ¡ α ¡ ¡ α of the root of

an irreducible polynomial f mƱ x Ʋ . – Low Hamming weight irreducible polynomial, e.g., trinomial or pentanomial – Maximum Hamming weight irreducible polynomial, e.g., All-One- Polynomial

m −1 ¤ ¦ 2

2 2 § 2 • Normal basis: the conjugates β ¥ β ¥ β ¥ β , where β is the root

of an irreducible polynomial f m Ʊ x Ʋ . – Type I or Type II optimal normal basis ¨ ©

γ ¨ ¨ © ©

• Hybrid basis for composite fields α

β γ ¨ ©

15 Polynomial Basis Multiplier (1) Bit-serial multiplier is area efficient while the operational speed is sacrificed = 9 + + f9 (x) x x 1

Linear feedback shift registers (LFSRs) are adopted in both architectures.

Least significant bit-serial multiplier based on right-to-left algorithm The registers of b(x) can be saved in MSB-serial multiplier because only the partial products need to be updated in each clock cycle §

¦ Less power is consumed in the ¥ ¤ £

¢ second architecture because the ¡ value of b(x) is fixed during computations.

Most significant bit-serial multiplier based on left-to-right algorithm 16 Polynomial Basis Multiplier (2) Bit-parallel multiplier can complete one multiplication in one clock cycle. It is impossible to be implemented in case of large field sizes. But it can be applied to the ground field arithmetic of the composite multiplier. = 5 + 2 + f5 (x) x x 1

Two steps to derive the bit-parallel multiplier:

1. Use Mastrovito’s method to compute the partial product with 2m-1 bits

2. Perform the reduction exploiting the standard technique for low Hamming weight irreducible polynomials.

17 Polynomial Basis Multiplier (3)

The digit-serial multiplier is a parallel version of the bit-serial one. Instead of computing one bit of the product, the digit-serial multiplier can compute multiple bits each clock cycle. Allows the tradeoff between area and latency.

MSD serial multiplier in , where the digit size D=4, f (x) = x239 + x36 +1 F2239 239

D−1 Two parts: 1. LFSRs, c(x) ← c(x)x D + a xib(x) f (x) for ∑ n−D+i mod m 2. AND-XOR arrays i=0 18 Normal Basis Multiplier (1) Massey-Omura’s architecture for normal basis multiplier is to use the same combinational circuits together with rotate registers computing the product serially.

− γ = θ +θ 1 11 θ ∈ θ = F 5 F 10 1 is the normal basis generator of 2 2 ¨ ¨ ¨ ¨ ¨ ¢ ¤ © ¡ ¡ ¡ ¤ ¤ ¢ ¥ ¥ ¥ ¦ £ £ £ §

¢ ¤ ©

19 Normal Basis Multiplier (2) Agnew et al. improved the original Massey-Omura’s architecture by shortening the critical path ¨ ¨ ¨ ¨ ¨ £ ¤ ¥ ¦ §

¢ ¢ ¢ ¢ ¢ £ ¤ ¥ ¦ § ¢

© © © © © £ ¤ ¥ ¦ §

Kwon et al. improved the Agnew et al’s architecture by decreasing the circuit complexity ¨ ¨ ¨ ¨ ¨ ¢ ¤ ©

¡ ¡ ¡ ¡ ¡ ¡ ¢ ¤ © ¡ ¡

¢ ¤ ©

20 A Novel Normal Basis Hybrid Multiplier for Composite Binary Fields (1)

1. Kwon’s bit-serial structure is applied to the tower field multiplication in GF(2 3x5 ).

2. Special irreducible trinomial is used to construct the ground field, so that the bit-parallel structure can be efficient. 21 A Novel Normal Basis Hybrid Multiplier for Composite Binary Fields (2) Squarer: Inverter:

d02 d01 d00

d'02 d'01 d'00

nm − 1 − 2 −1 a 1 = ⋅ar 1,r = ar 2n −1 r Obviously, A = a is an element in F2n Since r-1 can be represented as a sum of powers

r −1 = 2n + 22n +... + 2(m− )1 n Computation at the top level is free and r−1 equivalent to cyclic shift. a can be computed using the addition chain, the method ¡ The standard technique for polynomial requires log (m − )1 + HW (m − )1 +1 basis can be applied to the ground field. 2 general multiplications. 22 A Novel Normal Basis Hybrid Multiplier for Composite Binary Fields (3)

To apply hybrid multipliers in cryptography properly, another issue must be taken into account. The matrix for basis conversion can be obtained within reasonable amount of time.

g g − t − Special irreducible trinomials of the form f ( x ) = x 2 + x +1 or f ( x ) = x 2 1 + x 2 1 + 1 can be used to construct the ground field so that computing such a conversion matrix is equivalent to solving a set of linear equations.

Field Size n trinomials Field Size n trinomials 2 x2+x+1 15 x15 +x+1 3 x3+x+1 31 x31 +x 3+1 4 x4+x+1 63 x63 +x+1 7 x7+x+1 127 x127 +x+1

Summary: 1. Circuit complexity can be decreased considerably due to the efficient bit-serial architecture at the top level. 2. Compared with the straightforward method of parallelizing normal basis multiplier, this hybrid multiplier is more regular since the bit-parallel component has the same structure. Therefore it is easy for EDA tools to place and route. 3. The bit-parallel multiplier of ground field is very efficient in terms of timing and area due to the chosen trinomial. 23 Architectures for Elliptic Curve Cryptosystems

24 Lopez-Dahab Algorithm - - 1 5 , . , . 4 0 / 2 0 6  3 7 Input: An integerQ k 0 and a point P = (x,y) E. Input: P1= (X1, Z1), P2= (X2, Z2) Output: Q = k P . P2= P1 + P, P = (x, 1). Output: Q = P1 + P2 = (X3, Z3). 1. If k=0 or x=0 then Q= (0,0) and stop . Z3 = ( X1 * Z2 + X2 * Z1 ) 2 ¤ ¢ £

M ¡ 2. Set k k … k k ) X3 = x * Z3 + (X1 * Z2)* (X2 * Z1 ) ¢ ¥ ¢ ¢ £ M £ M M M 1 : 1 5 8 . , . 4 2 0 / 2 0 6 3. Set X x, Z 1, X x +b, Z x . 9 3 7 D I ; A EF H ? = < > @ C

4. for i from L-2 downto 0 do B F G G D I J K L A EM H ¦ ? ? = > > @ C C B M if k =1 then G N O O R Q H EM EF P ¢ ¢ ¢ ¢ £ £ C Madd (X ,Z ,X ,Z ), Mdouble (X ,Z ) . F S S R H H EF C else M F ¢ ¢ Madd (X ,Z ,X £ ,Z £ ), Mdouble (X £ ,Z £ ) . W W Z Z A U V U U Y X U U U ] ? ? ? ? T < < T = \ [ < T @ > C ¢ ¢ £ £ Z Q Z Z ` A U ^ U Y X U U U 5. Return (Q=Mxy (X ,Z ,X ,Z ? )) . ? ? T < T _ < = \ [ < T > @ > C W W f ; U U U U < b d < \ T d T b < b d _ < a c a e >     §          ©    

¨     "         

  !   W g U U ó ? ? = T T [ T b [ \ d < b b c c a ) $ ' # %   *      &  "   +     "  !  ( 

25 Scalar Multiplication Montgomery Ladder Concept

ki=1 P, 2P

ki=0 ki=1

2P , 3P 3P , 4P

ki=0 ki=1 ki=0 ki=1

4P , 5P 5P , 6P 6P , 7P 7P , 8P

ki=0 ki=1ki=0 ki=1ki=0 ki=1ki=0 ki=1

8P ,9P 9P ,10P 10P ,11P 11P ,12P 12P ,13P 13P ,14P 14P ,15P 15P ,16P

11=1011 2 26 Latency Optimization Scheme for ECC Processor in a Single FPGA Device

Block Diagram Choices of optimal digit sizes of Multipliers for Xilinx XCV2000E

Digit size D mul_1 mul_2 mul_3 GF(2 163 ) 32 32 32 GF(2 233 ) 32 32 32 GF(2 283 ) 16 16 16 mul_4 mul_5 mul_6 GF(2 163 ) 8 8 8 GF(2 233 ) 8 8 8 GF(2 283 ) 4 4 4

Features: Lopez-Dahab algorithm / parallel computation

27 Timing Diagram

= ¡ L log 2 k

T1: Latency of one multiplication done by mul_1, mul_2 and mul_3

T2: Latency of one multiplication done by mul_4 and mul_5

28 Comparisons with Software and Previous Work

Timing, resource utilization and performance comparison vs. software implementation based on LiDIA, run on Intel Xeon 2.8 GHz

Hardware, Xilinx XC2V6000 Software, LiDIA Clock FPGA LiDIA Speedup Fields FFs LUTs Period Latency Latency vs. (ns) (us) (us) LiDIA GF(2 163 ) 10,918 26,999 9.971 33 5290 159 GF(2 233 ) 14,215 36,582 9.949 57 9680 171 GF(2 283 ) 24,664 35,196 11.096 146 14060 96

Performance comparison vs. design by Gura et al. (Target device: Xilinx XCV2000E) Gura et al. Our design D=16 Our design D=32 GF(2 163 ) GF(2 233 ) GF(2 163 ) GF(2 233 ) GF(2 163 ) GF(2 233 ) FFs 6,442 NA 7,425 10,474 7,467 10,637 LUTs 19,508 NA 18,749 25,838 25,768 35,800 Frequency (MHz) 66.5 66.5 69.6 65.2 62.7 60.3 Latency (us) 143 225 75 144 53 100 Speed-up vs. 1 1 1.91 1.56 2.70 2.25 Gura et al. 29 Reconfigurable Computing Systems

• Definition: Microprocessors + reconfigurable components (FPGAs)

• Advantages: faster and more flexible than conventional computing technology

• Two selected systems: SRC-6 & SGI Altix-4700 ( ( C @ ( ) A B A @ ? > ( = S W R Q P O V U T , q _ q _  ` r s p  ` r s p ž ž ( ( C @ ( ) A B A @ ? > ( = a ` _ ^ X [ ] Z Y Y \       D ; : 9 7 £ 2 < § 8 f f f e d c X b     ¢ ¡ ‡ † „ „ „ t ‡ † „ „ „ t  % x     x     $  # " ! £ 4 3 ( ( C @ ( ) A B A @ ? > ( = ¥ ¡ ¤ £ * % ) ( + ' & # ¡ ¦ ¤ ¤ E , c Ÿ c Ÿ      ¦ ¨  [ ~ } q e  i l f f € q § ©   - ¦ ¨ / " ! . # # . ( ( C @ ( ) A B A @ ? > ( = § ©       i l f f €          [ ~ } r e q p k e [ ¡ ¦ w v u t  ‚ j j F x z y x | { g l i k ( ( C @ ( ) A B A @ ? > ( = o \ Y n Z m \ j Y X g l h „ „ t G r i l q c p ‡ † x  i l € c µ  i l f f €   0 ¨ 1 § 2 © * * N M K > J I FH ‡ † ‡ † „ „ t „ƒ „ t ƒ / # ! . " / $ ! L 6 ¢ 5 x x £ 4 3  i l € c µ w v u ‹ ‰ ƒ ˆ w ‰ v u ‹ ‰ ƒ ˆ ‰  x Š x Š w ¡ ¥ § ¤ ¦ £ ¢ X _ b h ( ( C @ ( ) A B A @ ? > ( =

¨ § ¢ £ ¦ , ‹ † « ª š © ˆ ( ( C @ ( ) A B A @ ? > ( = ‘   Ž Ž  Œ ’ ­ ¬ ¬ ¡ ¦ s a ` _ ^ c a ` _ ^ D — – Œ ™ • ” “ ˜ ’ ¢ ¡ ; : 9 7 £     ® w y 2 < § 8  ¯ ¦ § ° ¦ { f f f e d c X b f f f e d c X b  %  $ # " !              £ 4 3 ¥ ¡ ¤ £ ( ( C @ ( ) A B A @ ? > ( = * % ) ( + ' & # ¡ ¦ ¤ ¤ E ´ ³ ² ± ,       - ¦ ¨ ¦ ¨ / " ! . # # . § © § © ( ( C @ ( ) A B A @ ? > ( =              F w v u ‹ t „ š ˆ w v u ‹ t „ š ˆ x x SW R V U T ¶ ( ( C @ ( ) A B A @ ? > ( = G [ › X ~  i l f f q c \ Y  œ j Z 

SRC-6 architecture SGI RASC Blade architecture • SRC-6 chosen for our experiments with ECC and SGI Altix-4700 chosen for our experiments with Tate pairing 30 Hierarchy of Elliptic Curve Operations

31 SRC Program Partitioning

C function µP system for µP HLL C function for MAP FPGA system VHDL HDL macro

32 Partitioning Schemes of ECC in SRC 6

µP µP

0HL1 0HL2

µP µP

0HM 00H 33 Results and Comparisons for Scalar Multiplication over GF(2 233 )

Results of the timing measurements for several investigated partitioning schemes and implementation approaches

System End-to- DMA FPGA DMA Total Speedup Slowdown level end data-in computation data-out overhead vs. vs. VHDL architecture time (us) time (us) time (us) time (us) (us) software macro H00 (Software) 9710 NA NA NA NA 1 30.2 0HL1 524.2 14.7 258.9 7.6 265.3 18.5 1.43 0HL2 519.7 14.5 254.4 7.5 265.3 18.7 1.43 0HM 366.3 14.7 101.1 7.4 265.2 26.5 1.00 00H (VHDL) 365.7 14.6 100.5 7.6 265.2 26.6 1.00

Resource utilization for several investigated partitioning schemes and implementation approaches

System % of CLB CLB slice % of LUTs LUT % of FFs FF count level slices (out of Increase vs. (out of Increase vs. (out of increase vs. architecture 33,792) pure VHDL 67,584) pure VHDL 67,584) pure VHDL 0HL1 99 1.86 55 1.25 66 3.30 0HL2 99 1.86 57 1.30 62 3.10 0HM 68 1.28 51 1.16 32 1.60 00H 53 1.00 44 1.00 20 1.00 34 Library of Hardware Macros for Reconfigurable Computers

• 25 hardware macros for operations in the Galois Fields, and elliptic curve cryptography

• Developed as a part of the DoD-sponsored project, Library Development and Experiments using Prototype Reconfigurable Parallel Computers (LUCITE), in 2002-2006

• Ported to two high-performance reconfigurable computers, SRC 6 & SGI Altix 4700

• Made available to other groups for research regarding reconfigurable computers and cryptography

• Thoroughly tested using reference software implementations based on the public-domain mathematical package, LiDIA

35 Macro Library for Elliptic Curve Cryptosystems

Trinomial squarer NIST NB squarer Trinomial multiplier GF2m_NB NIST NB multiplier Trinomial inverter NIST NB inverter Pentanomial squarer PB point adder Pentanomial multiplier PB point doubler Pentanomial inverter PB coordinater converter Special PB squarer PB scalar multiplier GF2m_PB Special PB multiplier ECC_GF2m NB point adder Special PB inverter NB point doubler NIST PB squarer NB coordinater converter NIST PB multiplier NB scalar multiplier NIST PB inverter

Three kinds of macro library for SRC 6, totally 25 macros in VHDL, thoroughly tested

36 Architectures for Tate Pairing Cryptosystems

37 Hierarchy of Tate Pairing Operations

38 Analysis of Two New Algorithms Optimized by Kwon

Algorithm 1: Algorithm 2:

Input: P = (x, y), Q = (α, β ) Input: P = (x, y), Q = (α, β ) α β ∈ α β ∈ x, y, , F m x, y, , F m 2 2 =τ ∈ =τ ∈ C (P,Q) C F 4m C (P,Q) C F 4m Output: , where 2 Output: , where 2 C ← ,1 C ← ,1 α ← α 2 + ,1 β ← β 2 + ,1 4 α ← α , β ← β 4 , v ← x 2 + ,1 θ ← α ⋅ v u ← y + b + ,1 θ ← α ⋅ v m −1 u ← x2 + y2 + b + 2 for i=0 to (m-1)/2 do for i=0 to m-1 do C ← C 2 ,C ← C ⋅ A if i<(m-1)2 then A ← β +θ + u + (α + v)s + t α ←α 4 , β ← β 4 , u ← u + v + ,1 C ← C 2 Accumulative v ← v + ,1 θ ← α ⋅v ← ⋅ multiplication C C A end if α ← α 4 β ← β 4 ← + , , u u v, end for v ← v + ,1 θ ← α ⋅v A ← A + (α 2 + v + )1 + s ← ⋅ end for Final C C A m+1 m+1 MT 2m m 2 2 22 m −1 C ← C , MT = 2( − )(1 2 2 + )(1 2 ± )1 C ← C powering 39 Top Architecture of Pairing Processor

Features:

• Hardwired logic instead of stored-programmed machine

• Iterative structure ¡ ¢£

• Register files for ¤¥ £ intermediate results ¦ §

¥ ¨¨ • Main controller ©

§ designed as a finite state machine

• The extension field multiplier CA and Multiplier 1 are working for both stages

40 Two Architectures Investigated for CA

6 multipliers: 3 multipliers: 1. lower latency 1. higher latency 2. larger area 2. smaller area 3. lower product of latency by area 3. higher product of latency by area 41 Timing Diagram for Algorithm 1

Initialization of θ Notations: T1: Latency of CA and Multiplier 1 Storing results to Registers T2: Latency of Multiplier 2 T3: Latency of Inverter MUL 1 MUL 1 : Multiplier 1 C A

REG MUL 2 : Multiplier 2 A⋅ B : Multiplication over F 2m 2 4m

REG F MUL 1 C A C A : Special multiplier over 2 C A INV : Inverter

MUL REG 1

REG MUL 1 α ⋅ v MUL 1

m times MUL 2 REG 1 (c + c (s + ))1 2 + + 2 0 1

REG c c c c MUL 2 0 0 1 1

Accumulative multiplications MUL 1 REG INV MUL 1 ⋅ c0 c1 1 MUL 2 REG 2 + + 2 C A c0 c0c1 c1 Final exponentiation

+ T1 2 (m + )(1 T + )2 + + + + + + + + + + + + + 0: 1 (m )(1 T1 )2 T2 2 (m )(2 T1 )2 (2 T2 )2 (m )(3 T1 )2 3T2 T3 8 (m + )(1 T + )2 + (2 T + )2 + + + + + start 1 2 (m )(2 T1 )2 2T2 T3 6 done Time (clock cycles)

42 Timing Diagram for Algorithm 2

θ

A ⋅ B

α ⋅ v 1 (c + c (s + ))1 2 + + 2 0 1 c 0 c 0 c1 c1

⋅ c0 c1 1 2 + + 2 c0 c 0 c1 c1

m + 7 ¥ ¤ ¥ ¤ © ¨

¡ + + + + + + £ ¢ m 9 T + 2 m 3 m 5 m + 5 T1 2 2 T2 2 + + + + 1 + + + + + T1 2 3T2 T3 8

T 2 T 2 T 2 T 2 2 + 1 2 1 2 1 2 m 7 + + + + 2 ¦ ¦ 2 m + 5 § § T1 2 2T2 T3 6 T + 2 + 2 T + 2 2 2 1 2

m + 9   T + 2 + 3T + T + 8 2 1 2 3  m + 1  43 T = T + 4 T + 2 4 2 ck 1 Implementation Results for GF(2 239 )

4 Target device: Xilinx XC2CP100-6FF-1704 x 10 4

3.5 Algorithm 1 D=32 Lower product 3 of latency by area Algorithm 2

2.5 D=16 # CLB slices

2 Algorithm 1 D=16

1.5 30 35 40 45 50 55 60 65 70 75 80 85 44 Latency (us) Implementation Results for GF(2 283 )

4 Target Device: Xilinx XC2VP100-6FF-1704 x 10 4

3.8 Algorithm 1 3.6 Algorithm 2 D=32 D=16 3.4 Lower product 3.2 of latency by area

3

2.8 # CLB slices

2.6

2.4 Algorithm 1 2.2 D=16

2 50 60 70 80 90 100 45 Latency (us) Speed-up over Software Software Platform: Intel Xeon 2.8 GHz; C++ library, LiDIA, for subfield arithmetic

Hardware Platform: Xilinx XC2VP100-6FF-1704

350 Algorithm 1, D = 32 Algorithm 2, D = 16 300 297 251 250

200 183 156 150

Speed-up 100

50

0 239 46 GF(2 ) GF(2 283 ) GF(2 239 ) GF(2 283 ) Comparison with Hardware Implementation of Comparable Schemes (1)

MOV Security Elliptic Curve Discrete Discrete Logarithm Logarithm Problem Problem over GF(q k) Over E(GF(q)) Menezes-Okamoto-Vanstone algorithm

Field Fq MOV Security

Binary elliptic 4 m q = 2 m k·m Binary hyper- 12 m elliptic m Cubic elliptic q = 3 k·(log 23)·m 9.5 m

47 Comparison with Hardware Implementation of Comparable Schemes (2) 4 x 10 8 Our Kerin Grabher Alg. 2 7 Curves Elliptic Elliptic Elliptic 6 Kerin

5 Fields GF(2 239 ) GF(3 97 ) GF(3 97)

4 MOV 956 922 922 Lower product of Security

# CLB slices 3 Our latency by area Alg. 2 FPGA XC2VP XC2VP XC2VP4 2 Device 100 125 FF672

1 Grabher Controller Hard Hard Micropr wired wired ocessor 0 logic logic 0 200 400 600 800 1000 1200 Latency (us) 48 Comparison with Hardware Implementation of Comparable Schemes (3) 4 x 10 8 Alg. 1 Ronan 7 Curves Elliptic Hyper- 6 elliptic

5 Fields GF(2 283 ) GF(2 103 ) Ronan 4 Our Alg. 1 MOV 1132 1236 Security # CLB slices 3 Lower product of FPGA XC2VP XC2VP 2 latency by area Device 100 125

1 Controller Hardwired Hardwired logic logic 0 0 200 400 600 800 1000 1200 Latency (us) 49 Porting Tate Pairing to SGI Altix-4700

Features: 1. Serial-in-parallel-out registers 2. Two SRAMs for input and output 3. Computations and communications between SRAMs and FPGAs scheduled by the controller 50 Performance and Cost of Tate Pairing on SGI Altix 4700

Underlying Digit size Frequency Algorithm Total Latency of Latency of Speed-up fields of (MHz) block resource software SGI Altix Altix vs. multiplier resource utilization LiDIA (ms) 4700 (us) Software in CA utilization # #slices / (%) slices / (%) GF(2 239 ) 32 100 29,920 (34%) 41,641 (46%) 11.00 36.47 302

GF(2 241 ) 32 100 30,286 (34%) 41,989 (47%) 11.10 37.23 298

GF(2 283 ) 32 100 36,481 (41%) 48,202 (56%) 18.20 46.14 394

GF(2 353 ) 32 100 45,543 (51%) 57,264 (64%) 26.90 67.26 400

GF(2 367 ) 32 100 47,271 (53%) 58,992 (66%) 28.40 70.47 403

GF(2 379 ) 32 100 49,819 (56%) 61,540 (69%) 30.07 72.70 414

GF(2 457 ) 32 100 58,956 (66%) 70,677 (79%) 42.70 100.76 424

GF(2 557 ) 8 66 37,931 (43%) 49,652 (55%) 83.8 675.50 124

8 binary fields ranging from GF(2 239 ) to GF(2 557 ) are selected for our experiments 51 Performance and Cost Comparisons between ECC and Tate Pairing

Software performance comparisons for one operation between pairing and ECC, implemented via LiDIA and run on Intel Xeon 2.8 GHz

Tate Pairing ECC Speed-up, Pairing vs. ECC Field sizes Latency (ms) Field sizes Latency (ms) 239 11.00 233 11.36 1.03 283 18.20 283 21.85 1.20 457 42.70 409 45.19 1.06 557 83.80 571 116.80 1.39

Hardware performance/cost comparisons for one operation between pairing and ECC, target on SGI Altix-4700,

Tate Pairing ECC Speed-up Pairing vs. Field Digit f CLB Latency Field Digit f CLB Latency ECC sizes sizes MHz #slices (%) (us) sizes sizes MHz #slices(%) (us) 239 32 100 41461 (46%) 36.47 233 64 100 41284 (46%) 43.84 1.20 283 32 100 48202 (54%) 46.14 283 64 100 49518 (55%) 58.87 1.28 457 32 100 70677 (79%) 100.76 409 64 100 64249 (72%) 101.39 1.10 557 8 66 59652 (55%) 675.50 571 32 66 62522 (70%) 388.90 0.58

Note: 1. Virtex-4 LX200 FPGAs are used in Altix-4700. 52 2. 11,721 slices for core services, and 89,088 slices in LX200 Conclusions for Tate Pairing

• First published FPGA implementation of the Tate pairing schemes for binary elliptic curves • Two algorithms improved, implemented and compared • Algorithm 2 is faster, but its implementation takes more area • Speed-ups in the range 150-300 demonstrated for Xilinx XC2VP100 vs. Xeon 2.8 GHz • Our designs outperform existing implementations of comparable schemes in terms of the execution time by a factor 10-20, and in terms of the product of latency by area by a factor 12-46. • The first complete investigation of pairing cryptosystems over binary elliptic curves on a reconfigurable system • Tate pairing cryptosystems are comparable with ECC in terms of software/hardware performance, against the common belief that Tate pairings are slower than traditional elliptic curve cryptosystems 53 Summary of Contributions

54 Future Work

• Comparison of three schemes of Tate pairing by the same research group

– Identical assumptions – Design techniques – Optimization schemes – Tools and coding style

• Development of a general processor for elliptic curve based cryptosystems, including ECC, HECC, and pairing

– FPGAs for underlying field arithmetic – ASICs for a stored programmable machine

55 Publications

1. S. Bajracharya, C. Shu , K. Gaj, and T. El-Ghazawi, Implementation of Elliptic Curve Cryptosystems over GF(2 n) in Optimal Normal Basis on a Reconfigurable Computer, 14 International Conference on Field Programmable Logic and Applications FPL’04, Antwerp, Belgium, Aug. 2004. 2. C. Shu , K. Gaj, and T. El-Ghazawi, Low Latency Elliptic Curve Cryptography Accelerator for NIST Curves over Binary Fields, IEEE International Conference on Field Programmable Technology FPT’05, Singapore, Dec. 2005. 3. C. Shu , S. Kwon, and K. Gaj, FPGA Accelerated Tate Pairing Based Cryptosystems over Binary Fields, IEEE International Conference on Field Programmable Techonology FPT’06, Thailand, Dec. 2006. 4. C. Shu , S. Kwon, and K. Gaj, FPGA Accelerated Multipliers over Binary Composite Fields Constructed via Low Hamming Weight Irreducible Polynomials, submitted to IEE proceeding of Computer & Digital Techniques. 5. C. Shu , S. Kwon, and K. Gaj, A Hybrid Multiplier of Binary Composite Field with Efficient Basis Conversion, submitted to IEEE Transactions on Computers. 6. C. Shu , S. Kwon, and K. Gaj, Reconfigurable Computing Approach for Tate Pairing Cryptosystems over Binary Fields submitted to IEEE Transactions on Computers.

56 Questions ?

Thank you !

57