Implementing ECC – 1/110

April 2005 Darrel Hankerson Campinas, Brazil Auburn University

(a narrow survey)

Institute of Computing – UNICAMP

Implementing Elliptic Curve Cryptography Overview Implementing ECC – 2/110 sample selected topics of practical interest. Software solutions on general-purpose processors rather than dedicated hardware. Techniques with broad applicability. Methods targeted to standardized curves. Present proposed methods in context. Limit coverage of technical detailsnecessarily (but involves “implementing” platform considerations). I I I I I Objective: Talk will favor: Goals: Implementing ECC – 3/110

Focus: higher-performance processors 256 MB – 8 GB 2 MB “disk”, 304 KB RAM 0.5 GHz – 3 GHz 10 MHz, single AA battery heats entire building fits in shirt pocket Sun and IBM workstations RIM pager circa 1999 SPARC or (Pentium) Intel x86 (custom 386) “Higher-performance” includes processors commonly associated with workstations, but alsosmall found portable in devices. surprisingly Implementing ECC – 4/110

Optimizing ECC q F field arithmetic Curve arithmetic Big number and modular arithmetic Elliptic Curve Digital Signature Algorithm (ECDSA) generation Random number 1. Field-level optimizations. 2. Curve-level optimizations. 3. Protocol-level optimizations. General categories of optimization efforts: Constraints: security requirements, hardware limitations, bandwidth, interoperability, and patents. Implementing ECC – 5/110

Optimizing ECC... Choose fields with fast multiplication andUse inversion. special-purpose hardware (cryptographic coprocessors, DSP, floating-point, SIMD). Reduce the number of point additionsReduce (windowing). the number of fieldcoords). inversions (projective Replace point doubles (endomorphism methods). Develop efficient protocols. Choose methods and protocols thatcomputations favor or your hardware. I I I I I I I 1. Field-level optimizations. 2. Curve-level optimizations. 3. Protocol-level optimizations. Implementing ECC – 6/110

Context: Protocols such as ECDSA IEEE 1363-2000 ANSI X9.62. dard – FIPS 186-2. but unclassified data: ECDSAcation. selected for authenti- Jan 1999 ECDSA formally approved asJan an 2000 ANSI standard ECDSA – formally approved as a US Federal Gov stan- Feb 2005 NSA recommends algs for securing US Gov sensitive Aug 2000 ECDSA formally approved as an IEEE standard – The Elliptic Curve Digital Signaturethe Algorithm elliptic (ECDSA) curve is analogue of the DSA. Provides data origin authentication, datanon-repudiation. integrity, and . . n . ] ) 1 s Implementing ECC – 7/110 , r − . ( mod n } , Q 1 . G is 1 x n [ + m = , and public key and r ··· d (consisting of the curve, mod D from {z times + } D k . k and ) G dr ) m + 1 does the following: ( + y which is known a priori. , G | A e 1 , {

ECDSA Signature Generation x G 1 = m − SHA-1 k =( kG = = kG e s , etc.), private key G has authentic copies of ’s signature for the message B has domain parameters A . A 3. Compute 4. Compute 2. Compute 5. 1. Select a random integer To sign a message The computationally expensive operation ismultiplication the scalar in step 2, for a point dG = I I field, base point Signer Q is G . n . ] 1 . mod r − Implementing ECC – 8/110 n = rw , 1 v [ = does: 2 . B u ) ,

ECDSA Verification 1 y m , 1 and . in step 5, where only x . on n n ) . ) Q n m s 2 =( ( , u r ( Q mod mod are integers in 2 1 mod s u and − ew 1 s + x SHA-1 G = and 1 = G = = 1 1 u r e w u u v ’s signature A 6. Compute 1. Verify that 2. Compute 5. Compute 4. Compute 3. Compute 7. Accept the signature if and only if known a priori. To ver ify The computationally expensive operation ismultiplications the scalar I I . p Implementing ECC – 9/110

Deployment Notes... for 256-bit or larger p F NSA presents strategy and recommendationssecuring for US Gov sensitive andcommunications. unclassified “Suite B” protocols for keyauthentication agreement are and ECC only: ECMQVkey agreement, and and ECDH ECDSA for for auth. NSA obtains licensing rights to(MQV). Menezes-Qu-Vanstone Covers curves over February 2005, at the RSA conference US National Security Agency (NSA) and ECC Fall 2003 2048 3072 8192 1024 length 15360 RSA/DL key Implementing ECC – 10/110 384 160 224 256 512 length ECC key Deployment Notes... Example algorithm SKIPJACK Triple-DES AES-Small AES-Large AES-Medium 80 112 128 192 256 key length Symmetric cipher Marketing demands 256-bit AES. Key size comparison. ECC especially attractive here. I I I Example: BlackBerry 2-way pager (ECDH-512) a Implementing ECC – 11/110

Deployment Notes... ECC (256) RSA (3072) DH (3072) Simple Password Exponential Key Exhange. Key generationEncrypt or verify 166 msDecript or sign 150 msSource: Herb Little, Too long RIM. 168 ms 52 ms 38 s 8 s 74 s 74 s FacilityOver the Air (OTA) ProvisioningPolicy Authentication ECSPEKE OTA Re-keya ECDSA Protocol ECMQV Timings on BlackBerry 7230 for 128-bit security. BlackBerry EC algorithms. RSA-1024 used for code signing (for verify speed). I I I Topic I

Prerequisites

Implementing ECC – 12/110 . } ∞ }∪{ 3 , . 2 2 6 = = Implementing ECC – 13/110 into an abelian ) K K } K P ( + E char char as the identity. ··· Elliptic curve groups , ∞ , {z times b + solves the equation b k P ) + + y 2 , + x ax ax ( P | | + + = K 3 3 x x . × kP K K = =

point at infinity 2 ∈ y xy ) y , + x 2 ( is an integer. y { k )= K ( over finite fields E where group with the forms 3. The chord-and-tangent rule make 2. 4. Scalar (or point) multiplication is the operation 1. Interested in elliptic curves in simplified Weierstrass ) 3 y , 3 . x ( x R = R = P + P Implementing ECC – 14/110 y ) 1 y , 1 x ( = (b) Doubling:

Chord-and-tangent rule P ) 3 y , 3 . x ( x R = R = Q + P y ) ) 1 2 y y , , 1 2 x x ( ( = = P Q (a) Addition: Geometric addition and doubling of elliptic curve points ) 3 y , 3 x ( x 1 = R y 1 1 y y + x x − − 1 + ) ) y 1 x y x x y ) )+ y 1 λ + y , x 1 − − − x 1 ( + = 1 1 x + P x x 2 1 1 ( ( x x λ λ ( ) 3 Implementing ECC – 15/110 y ,

3 λ x ( x = R

Point Arithmetic 2 1 a b x + + 2 2 2 1 x x y ) ) x 1 1 2 y y , , 1 x 2 + x x − ( ( = 2 = = 1 b 1 1 1 P Q x x x a b x x − + 2 + − + 2 +

λ 2 λ λ ax .

ax λ ) + + + 2 + 2 2 3 y 3 λ

x λ , x 2 = x = ) ) 1 2 2 1 xy x x y y =( 1 − + + 2 x 2 1 / 2 / x x Q ) y 1 ( ( , a y λ / / ) ) ) 1 + 1 2 ,curve: + y 2 1 y y } 1 , x 3 x − + 1 3 , 2 1 ( ,curve: x 2 y y 2 ( ( / ∈{ = =( Q Q K K P P P P P + + op 2 2 − − P P Addition and Doubling Let char char kP Implementing ECC – 16/110

Calculating additions. 2 / t . } 1 . , P 0 + ∈{ Q i k do ← 0 Q for i to 2 i 1 k then 1 0 . − 1 ∑ − = t Q i t . 2 = = i Q k ← . k from Q ∞ i ← Q point doubles and expected 3.2 If 3.1 t 4. Return 2. 1. Write 3. For Reduce the number of pointReduce additions the (windowing). number of fieldcoordinates). inversions (projective Replace point doubles (efficient endomorphisms). Use curves over fields where fast arithmetic is available. I I I I Improving the performance: Simple double-and-add approach: Analysis: ... . kP 0 = i k i additions. + i (NAF): k 3 Implementing ECC – 17/110 / t and Calculating } . point doublings and 1 ∞ t  . ← , P . 0 Q P − + Q ∈{ Q i .Set ← i k 2 Q ←

non-adjacent form i k Q 0 in t ∑ do = i then k 0 1 then where . to )= 1 − , k i t Q ( . 2 2 = = i i i Q k k k additions. ← 0 from t Q 2 = i ∑ i / t point doublings and an expected 2.2 If 2.1 2.3 If )= t k 1. Find NAF 3. Return 2. For ( NAF an expected Reduce adds by writing Simple double-and-add approach has Analysis: . ) } 3 − . NAF , } 0 1 , w {z 0 , − 0 width-3 NAF 1 , 1 − | w 2 =(

Width- Implementing ECC – 18/110 0  2 · ) 3 ,..., − doublings and an ex- 3 t  +( , 4 2 1 ·  1 , 0 . = 10 NAF has ∈{ . A NAF is a width-2 NAF. 13

NAF i ) k w 1 = w ) } + 1 . , 1 0 w , ( additions. {z + 1 / width- binary , ) , where c 1 1 i consecutive digits, at most one is nonzero. 1 k | by width- 2 i 2 w + =( k 0 ) } kP w t log 1 = ∑ ( i , / 0 t , = ≤b 1 k {z t − NAF , 0 , 2. Among 3. 4. Density is 1. 1 | pected Example: 13 (base 10) has expansions NAF reduces the number ofgeneralized required to point a additions. Can be ( Analysis: 0 k NAF... w Implementing ECC – 19/110

Width- produced right-to-left ( i 2 i k ∑ -NAF are stored. = w k -NAF is inexpensive. -NAF w w Solinas [DCC 2000] gives algoperations. using only simple bit Digits of first). Some point mult algs wanttypically to digits process of left-to-right, so I I I -NAF variant. See also Muir and Stinson [CACR Tech Avanzi [SAC 2004] gives (optimal) left-to-right calculation of Calculting the w Rep]. . 2 Q  . . 2 ) 4 . 6 = Q y 2 1 , − 4  Implementing ECC – 20/110 Q x 1 2 x Q 2 =( + x with b 2 can be computed 1 ) x and 2 2 + Q y 2 2  Q , − 2 Q 1 ax

Montgomery’s method + + x , 1 1 2 Q + x Q Q 3 =( 2 x + x 2 and 1 Q = x ) 3 y xy + , 4 and 3 + x x ) 2 1 y = y , =( 3 1 x -coordinate of 2 x -coordinates of x Q x =( + 1 1 Q Q from the Let Let Then Thus, the I I I I Montgomery’s method is a usefulmultiplication benchmark algorithms for for point curves over binary fields. (Due toan López idea and of Dahab Montgomery.) and based on 0 1 = = ) ) 1 1 + + j j ( ( − − t t k k Implementing ECC – 21/110 if if P , ] 2 ) P 0 ) k 1 1 , k ] + l P ( ··· ) ) 2 1 2 Montgomery’s method... , P + + j ) l ( 1 − t + +( k l ) } -coordinate of the result, sufficient 1 lP x , + +( j ( {z lP − 2 lP t [ [ k | j } − → t ] k P ) ··· 1 2 ↓↓ {z + − t l k ( 1 , − for ECDSA. Each iteration requires one doubling andPossesses one natural addition. resistance to someattacks. side-channel No extra storage. Produces only the t lP [ | k ( I I I I ), . of y b 2 x + ]+ 2 1 y ; otherwise -coordinate 1 ax + ˜ y y

-coordinate .) -coordinate of 2 ) -coordinates y + y Implementing ECC – 22/110 x -coord y x = 2 1 . If this agrees . , x x ) ) 1 x 2 y y )+ , y = x , x =( y 2 1 x + P x 2 x )+( + =( 1 ˜ )( 2 then set y and x P y , ) 1 ) P x 1 1 + ) ( Obtaining the y 1 1 , + x 1 + k x ( k )[( ( x =( -coordinate from a Montgomery + and y . 1 -coord of 1 ) kP x x x 1 ( y 1 . , + 1 − 1 ˜ 1 y x -coord of x from ˜ y x = P = ) (e.g., solve quadratic =( 1 1 1 y y can be recovered as: kP kP + k of obtaining kP (Derived using the addition formula( for of with the set Compute the 2. (Point compression method) Guess at the 1. (Direct) After the last iteration, have the multiplication: Two methods to find ) 2 ∼ Z

λ , 2 Y

d projective Name λ Jacobian Jacobian projective projective , 2 X Implementing ECC – 23/110 c López-Dahab (LD)

λ ) ) ) ) ) 3 3 2 Z Z Z Z Z )=( / / / / / 1 Y Y Y Y Y Z , , , , , , 2 2 Z Z Z 1 Z Z Projective coordinates / / / Y Relation / / , X X X 1 ( ( X X ( ( ( X ( 4 3 6 if bZ bZ bZ 3 b 6 + ) + + 2 2 bZ 2 + Z bZ Z Z 2 2 Z 2 , + 2 + 2 2 b ax 4 aX Y by aX , aX + + . An equivalence class is called a + 2 } + 3 ∗ + 3 aXZ ) X x ax aXZ Z 3 ( 0 K X 3 + , + X + = 3 X ∼ = 0 3 ∈ 3 , x X = ) xy X = 0 1 λ Projective form = = ( + Z = be positive integers. Define equivalence relation , 2 2 XYZ Z 2 1 y y XYZ \{ 2 d XYZ + Y Y 3 , . Y , + c Z + 1 K 2 2 2 X Y Y ( Y

Let on for some point Curve: Curve: , ) 2 : 1 ) 1 y aZ , :1 Z + + 2 2 x C X Y ( : 2 . + B 2 )+ G X x Implementing ECC – 24/110 X ← + ← + D F F 1 ) ) , x 2 )+( , Z ( B x 1 1 E λ + Z + Z 1 + : E ← x ( ← 1 D y ( C : Y / ← + m ) : , Y 2 , 2 2 Projective coordinates... 1 1 a y A F , X X 2 + + ← + Z 1 ) 2 1 y X 2 x ( Z Y , 2 )=( + X ← + Z 1 AC λ 2 x ← : X ← B ( + Y E , : ← λ 1 , G Y 2 X + ( + C 2 2 1

λ ← Z Z 2 ← Y x ← A Projective coordinates reduce the numberinversions of in field point arithmetic. Point addition in affine for Projective I I I Field inversion is typically expensivemultiplication. relative to field } M S 8 1 by S 3 1 , M 3 2 , + M M , kP 8 2 I M M , 1 8 I 4 1 Implementing ECC – 25/110 = {z proj projective A S 3 1 2 S , M 4 + 2 , squaring M M } , 4 2 I 1 M proj = , , in binary case: 1 4 I Doubling Addition (mixed) 0 | D S

Projective coordinates... 1 M ≤ ∈{ / } A I 3 a 3 , 1 − b ) + 2 = + Z D a 2 ) /

, 3 Y b ax Z , multiplication, )= / Z + + Y = / M 3 , 2 x ax X 2 M ( {z Z + affine + = / I 3 ( . X x xy ( 1 3 M = + + 3 2 2 y y M inversion, ≤ 2 I = + Affine López-Dahab Affine Jacobian I | I

Curve Curve non-adjacent form Affine vs projective field operation counts: Rough estimate of threshold gives 8 = . 8 M 954 990 a 2430 1306 2160 1090 1980 2928 / I = 5 M = / I 939 987 M -NAFs. 1701 1512 1386 1303 1087 1953 / I w and I 1 1 5 1 5 243 216 198 325 Implementing ECC – 26/110 Field operations = M M 486 432 396 914 328 982 1298 1082 . / I ) d d 1 D + 162 162 163 162 162 163 162 162 3 z c c + -coordinate only. A +32 81 81 54 54 35 6 x b Point multiplication costs 162 162 z d EC operations 3 + 7 0 3 0 0 0 0 3 0 z Points stored + 0 4 – 0 0 0 4 – w 163 z ( / ]

z , on-line precomputation) [ for the NIST random binary curve B-163 over 2 kP F Projective projective projective projective kP = Addition for Montgomery. c 163 2 F Right columns give costs inAffine. terms of field mults for a b

MethodUnknown point ( Coordinates BinaryBinary NAF affine affine Window NAF affine Montgomery affine field Example Required point doubles limit the improvement via , cofactor 2 Implementing ECC – 27/110 163 2 , cofactor 4 , cofactor 4 , cofactor 4 , cofactor 4 1 1 F 1 + + 233 283 409 571 + 2 2 2 2 5 2 3 x x F F F F over x 1 + + + 7 5 + 6 1 x 1 x over over over over 2 x x 1 1 1 1 + + + + + + + + + + 7 74 12 87 10 3 3 3 3 3 x x x x x x x x x x . + + + + + b = = = = = + 163 233 283 409 571 xy xy xy xy xy x x x x x 2 x + + + + + = = = = = 2 2 2 2 2 + f f f f f y y y y y 3 x = Appendix: NIST curves over binary fields xy K-163 K-233 K-283 K-409 K-571 + 2 y 571} over each of these fields, each with cofactor 2: 1. Koblitz curves: 2. Randomly-generated curves B-{163, 233, 283, 409, 1 1 − − Implementing ECC – 28/110 96 32 2 2 + + 192 96 2 2 1 1 + − − + 64 96 224 128 p 2 2 2 2 1 − − − − − randomly generated and have 192 224 256 384 521 2 2 2 2 2 b + x 3 − 3 x Appendix: NIST curves over prime fields = CurveP-192 Prime P-224 P-256 P-384 P-521 2 y Curves Special form of the prime speeds reduction. prime order. . p F . p ; that is, p √ divides both 2 √ 2 . 2 Implementing ECC – 29/110 + n ∞ 1 |≤ + t | p where ≤ 2 n ) p Z where F t ⊕ ( 1 is replaced by any finite field. E n − p # Z 1 F ≤ + different elliptic curves over . p p . p ) The number of points on the elliptic p √ 2 2 ≈ )= ) p − p . F 1 F ( 1 ( E + − E # can be computed in polynomial time using p # p is an abelian group with identity is isomorphic to ) p ) ) p p F ( and F F ( ( 1 E Hasse’s Theorem E Hence, # E n curve is Schoof’s algorithm 3. ( 2. 4. 1. There are about 5. Appendix: Basic facts for curves over prime fields Corresponding facts hold if Topic II

Endomorphism methods

Implementing ECC – 30/110 Implementing ECC – 31/110 is large. M / I

Reducing the cost of doubling Eliminates most field inversions. (e.g., in signature generation) andmemory the requirement additional is acceptable. computable maps. 2. Use precomputation if the point is known in advance 3. Replace doubling by halving in the4. binary case. Replace (some) doublings by other efficiently 1. Use projective coordinates when defined b defined over + . Implementing ECC – 32/110 E replace point 3 2 . Special case: x

φ = for = mP ) ) q 2 y q y , y 7→ x : ,

q β P , then E x ( : ( 3 ] 7→ m 7→ ) [ y ) , y x , ( x : ( map)

: . Consider φ is of order )

m φ p F ∈ mod 3

β ( . 1 .If P p ≡ F , the map in 2 is inexpensive compared with field p 2 . Particular case: Koblitz curves, 7→ − Replace doublings: efficient endomorphisms th power map) q q = over P F is an endomorphism. q 3. Let 1. (multiplication by 2. ( mult, and the map in 3 isKoblitz a curves: single inexpensive field applications mult. of doubles. (Point halving has similar goal.) Example 3 allows the number of doublings to be reduced. For . P 0 P a 0 % u K K K . K 0 0 K )+ K a K P + ( 0 1 z 1 a a

 τ a 9 9 1 by elements 9 9 + 9 u ) 1 Implementing ECC – 33/110 0 a m ... +

Koblitz curves 2 + F 1 ( ··· . − and curve points ··· ··· a ) m z 2 a 1 E 1 y − )+ − − , m 0 1 P m a 2 Squaring a binary polynomial Õ ) a ( Õ x Õ l Õ 1 1 Ò Õ ( . τ − l 7 − m u a − 7→ 2 √ = ) =( + y µ P , µ ) 1 x 0 ( = u for : + τ 2 + τ ) : 1 x τ P 2 1 ( ⇒ F + +

τ u = 3 3 µ x x +

τ = µ = = ··· P : + = 2 xy xy (field squaring) is inexpensive in comparison to ] l

2 τ τ τ + + + [ l 2 2 u + Z P ( y y 2 2 of (Frobenius map) τ τ Makes sense to multiply points in : : 0 1 I I I I E E Define curves over Applying field multiplication. . i 3

τ P i 3 k = α 0 ∑ w

τ − ) P as 1 . 1 .If k w α α +

τ 5

α τ τ ( Implementing ECC – 34/110 − 7→ − mod 1 = α τ u : , and P 0 on a Koblitz curve 1 5 = N + u + 2 kP α τ τ , the element of the

τ 0 = 1 ) is subtracted, and the . Then + r 3 w 5 3 α τ + τ , 0 . 0 1 r w

τ + -NAF is analogous to ordinary = 4 τ

1 τ -NAF of α 0 Strategy for w τ + 5

τ , then − 0 NAF: for odd = = w 5 is Euclidean with respect to a ] small and sparse (to reduce the number of required

τ | [ i k Example: representatives are width- Finding a width- and gives a width-3 result is divisible by equivalence class (mod Z | I I point additions). Basic idea: since field squaringwith is cheap, expand , 0 . k : P ) u m of

. α 2 ] Strategy... i τ F

[ τ i ( point Z c Implementing ECC – 35/110 ) E 0 in 1 = i t ) ∑ 1 + . − w } ( τ 0 ( . / 1 / ) m − 1 }∪{ w u 2 − α < m

τ u ( ∈{ -adic expansion i τ c mod do w k for odd 0 then add or subtract appropriate and P = . to in the main subgroup of u 0 0 m ). t Q α k

P τ 6 = Q ≈ i t c ← . for from Q ∞ i ← kP Q where 5.1 5.2 If 1. Compute 2. Compute 4. 3. Find the width- 5. For 6. Return ( J. Solinas, Efficient arithmetic on Koblitz curves. Designs, Codes additions. and Cryptography, 2000. To compute No point doublings. Expect roughly for a kP 78 78 116 adds 3+47 3+47 Implementing ECC – 36/110

Operation counts 0 0 232 232 1+232 doubles . -NAF

τ 4 233 -adic NAF) is not free.

τ = -NAF m NAF width-4 NAF width- method τ double and add requires two or three field squarings, each (for the with τ 0 m k 2 F curve random Koblitz Applying Finding costing roughly 15% of a field multiplication. I I curve over Point doubles and additions expected in finding .) 229233 Implementing ECC – 37/110 − 160 2 be an element of p = F p ∈

β is an endomorphism. . (In P-160, , and let

Using efficient endomorphisms ) ) b y , + x requires only 1 field multiplication. 3 β x (

mod 3 φ ( = 7→ 2 1 ) . y y 1 : ≡ , (WAP) x p E = ( | :

φ elliptic curves with efficient endomorphisms, CRYPTO 2001. for faster point multiplication on certain elliptic curves, PKC 2002. order 3. Let Let φ Computing | 1. Gallant, Lambert, and Vanstone. Faster point multiplication on 2. Park, Jeong, and Kim. An alternate decomposition of an integer I I I I I GLV observed that an endomorphismreduce may the be number used of to doublesexpansion (even is if not a efficient). Koblitz-like Example . is a

φ λ . (This n 1 2 √ k k |≈ i , where Implementing ECC – 38/110 k P |

λ NAF of NAF of , which can be ) . w w = n P ( P

where φ φ 2 width- width- ) k n 0 0 + , , 1 2 k k P 1 mod k in the example.) 1 1 ( , , ) 1 2 = k k n λ 2 P k λ ··· ··· 2 +

Using efficient endomorphisms... mod k 1 ( : t t ) of the characteristic polynomial of by multiplication: , , k + 1 2 n i 1 k k kP P ≡ G 1 be a point of prime order h k k ) ≡− p =

λ F ( can be done efficiently.) computed via interleaving. Write kP + E 2 negligible if domain parameters are set in advance. acts on • • i ∈ λ root (modulo ( φ To compute Approx half the doubles arek eliminated. Cost of finding G I I Let 1 ) k P y

, λ x 0 2

β 2 k k 3 |{z} + )=( P y ) } , prime 0 1 x k ( r Implementing ECC – 39/110 1

λ + k {z 0 2 , ) k r p 2 ⇒ | F 63 =

A very special case =( = p P mod 3 7 ) ( over 0 1 ) . Then k y + 1 , 2 + x 195 ) 0 ≡ 2 195 1mod 2 1 − k 2 3 ) 3 + + 2 < x − is less than a field multiplication. + 0 1 + 194 194 x k =

λ 390 2 2 390

2 β 3 2 2 y + + for : 0 = 1 389 =(( E )= k 389 p 2 p 2 P + (( ) F 2 0 1 ( = k k E 195 β # + 2 , )+ 0 2 0 2 y k 2 = k , − x n 3 195 ( = 1 2 is free. 195 k 2 k 2 k =( = =

λ kP Solinas (CORR 2001-41) gives anand example where finding Write The cost of calculating . 2 ) x y , a + x 2 + x

λ =( λ + P . Basic idea: Implementing ECC – 40/110 + ) 2 2 2

λ x y

, λ x 2 from = = x ) 2 2 for for 2 x (2 mul, 1 inv) y y =( , 2 2 x P x a 2 + or 2 + =( x

λ P λ from 2 + ) + y 2 2 ,

λ x x 2 ,1inv) x = = b =( + 2 2 2 x y 2 x P x / . Calculate:

λ b x / + + y

Point halving for curves over binary fields 2 2 + x x x = = = 2 2 Asiacrypt ’99. and method, patent application, 2000. x y (2 mul, 1 mul by

λ Doubling in affine: seek Halving: seek 1. E. Knudsen, Elliptic scalar multiplication using point halving, 2. R. Schroeppel, Elliptic curve point ambiguity resolution apparatus Let solve I I . 1 ) 2 y )= + a 2 ( x ) Tr 1 Implementing ECC – 41/110

Point halving... +

2 λ b . y . a } (( 1 + G + , Tr 2 0 2 x . x ) 1 1 ,so ∈{ = 1 + + . 1 2 − λ b , consider λ m λ b y 2 2 + x )= c = 2 a =( )+ + ( + λ b λ for generator b 1 2 2 b x x ) Tr ··· or + a λ ( λ

+ λ + ( )= 2 Tr 2 x 2 c = x ( x .

+ λ b λ Tr p = c )) = 2 = y kG )= x 2 )= ( c x x ( ( ( Tr Tr obtaining Tr identifies 2. The NIST random binary curves all have 2. Since 3. Find 1. Solve 1. Halving for the trace 1 case Facts ; ) 2 y , 2 x . ) . field mult 2 ) )=( y 2 2 y , y / , 2 Cost , 1 field mult x x Implementing ECC – 42/110 2 ( ( free x 3 ≈ Point halving... 2 / 1 field mult) ≈ or 1 field mult 2 1 field mult. ) ≈ 2 ≈ ≈ Tr root )=( (

λ ≈ y , , where 2 is x 2 T ) x ( x ) ( x 2 √ y . / : , + ) 2 . y x = 1 x x λ b ( x

+ λ + + , x 2 for NPUT →

λ T b where x = 4 field mults.) ) a 2 ) = √ y + = y x λ + 2 ≈

λ , / λ = 2

, λ x y ⇒ x x )+ x ( , ( + = 1 + 1 =

λ x b then Steps + . y ( →

λ ) + b 2 λ 1 b = ) + λ x

( λ b + , 2 2 2 2

λ x y x x = = )= ( , , λ b (point halving) I 2 x T T

= λ = ( x ( ( : x T Tr

λ else or 4. Return 1. Solve 3. If 2. Find UTPUT may be recovered via O Algorithm (Doubling in projective Halving: y Conversion to affine w ) n . x mod √ ( 0 0 Implementing ECC – 43/110 k + 2

by halve-and-add / . 0 1 0 . t k k kP + kP + : ··· ··· . + + t t P UTPUT 2 2 / 0 0 + 0 t k small. k Q .O = = k ← M n 0 / Q k

Calculating I do + t mod then . ··· )Solve k t 2 k 1 . 2 + / ) and scalar t = P 2 Q 0 i t P ( k k ← . (halve-and-add, right-to-left) from 0 to ; i.e., P 0 ∞ i = k k ← : point ements for B -163). Build table of 16 multiples of for Q 4.2 4.1 If 1. (Precomputation) Solve quadratic equations (21–30 field el- 4. For 5. Return 3. 2. (Transform Looks best when Multiple accumulators allow right-to-left withNAF width- variant. NPUT I Algorithm I I is M / I on binary Implementing ECC – 44/110 kP on Koblitz curves is

τ

Summary: Efficient endomorphisms curves in on-line precomp case, especially if small. significantly better than halving. Koblitz-like expansions to other curves.the Less endomorphism useful costs if more than 1/2 point double. speedup for on-line precomp case. 3. Frobenius endomorphism 4. Siet, Lange, Sica, Quisquater [SAC 2002] extend 1. If curve can be selected, GLV offers a dramatic 2. Halving is a significant improvement for Implementing ECC – 45/110

Summary: Efficient endomorphisms... Practical interest may be limited, sinceunlikely it that seems there’s room for halvingfor code, an but additional not point of storage and 3-TNAF. recoding that combines a halvingmethod step on with Koblitz Frobenius curves. Tradesfor some a point halving. additions 5. Avanzi, Siet, and Sica [PKC 2004] give a scalar 8 = . a . 7 M 954 990 705 687 386 365 1980 2928 8 / / I . = M 5 M = = M + / 939 987 598 681 284 341 M S I 1386 1953 / A I with I and 5 1 2 8 35 34 198 325

Cost τ 5 e Field operations Implementing ECC – 46/110 = M 396 914 328 982 423 671 114 301 M f f . / d d I ) g g 1 0 0 D 163 163 162 162 + 1+163 3+163 3 e e c c z A +32 +27 35 34 + b b 162 162 -coordinate only. 7+27 EC operations 3 6+30 7 6 x z d

Appendix: Operation count + 3 0 7 7 3 0 3 7 7 Points z stored 4 – 5 4 – 5 4 5 w + Field ops include applications of g 163 z . ( / M ] 2

z , on-line precomputation) [ for the NIST random binary curve B-163 over 2 kP projective projective projective projective F kP = Addition for Montgomery. c -NAF affine 163 w 2 F Right columns give costs inAffine. terms of field mults for Halvings; est. cost a b f

MethodUnknown point ( Window NAF Coordinates affine Montgomery affine Halving Window TNAF affine field Example s µ Implementing ECC – 47/110

) lQ ) 565 1749 1000 5 + , 6 ) 718 718 325 kP = 4 )) 2306 1154 2306 3572 1150 1800 5 4 w , , = 6 6 )) 365 1130 263 625 814 475 ) 687 2126 1050 w in the prime and binary fields. Normalization gives 5 6 )) 2016 954 2016 2953 975 1475 4 = = 8 5 4

, on-line precomputation) = = w w = = = = w w w kP w w , off-line precomputation) M / I kP and 80 = Appendix: Timings (800 MHz Intel Pentium III) NIST Field Pentium III (800 MHz) M curve Method mult M normalized M

Unknown point ( P-192 5-NAF ( B-163 4-NAF ( B-163 Halving ( K-163 5-TNAF ( Fixed base ( P-192 Comb 2-table ( B-163 CombK-163 6-TNAF ( Multiple point multiplication ( P-192 Interleave ( B-163 Interleave ( K-163 Interleave TNAF ( 386 1195 575 / I Timings using general-purpose registers. Mwith is the estimated field multiplications equivalent P-192 field mults for this implementation. Implementing ECC – 48/110

Appendix: Timings... Only general-purpose registers used. Pentium III has floating-point registersprime which field can arithmetic, accelerate and single-instruction(SIMD) multiple-data registers that are easily harnessedInteger for multiplication binary with fields. general-purpose registersis on faster P-III than on earlier orP-192 newer may Pentium be family less processors. competitive ifoperates hardware on mult fewer is bits. weaker or I I I random binary or prime in on-line precomp case. for off-line precomp case. for precomp in known-point methods is not addressed. 1. Timings for Koblitz curves significantly faster than for 2. Faster prime field multiplication gives P-192 the edge 3. Results depend on processor and implementation. 4. The case where a large amount of storage is available Topic III

Normal Basis Arithmetic

Implementing ECC – 49/110 . f can be . c } 1 Implementing ECC – 50/110 − = m 2 x

β + 2 , reduction poly x } ,..., 1 2 2 −

m β x , 1 2

β , ,..., 0 x 2 ,

β 1 { {

Normal bases in characteristic 2 field elements: m 2 F Normal basis: Polynomial basis: solved bitwise. Squaring, square root are shifts. Performance of square root andfundamental quadratic to solver methods is based on pointLow halving. complexity (Gaussian) NB basesnice have arithmetic. especially I I I Representing Motivation for use of normal basis reps: Methods in 1990s seem toslow confirm in that software NB compared mult to will PB. beNing very & Yin [ICICS 2001]improve precompute NB shifts mult and (at significantly the cost of data-dependent storage). . 0 . ) 1 1 − − m + Implementing ECC – 51/110 m . s : 2 b M } i . C 2 = ) ... 0 NB Multiplication β , ( ij 1 s M { 2 λ + C s s β b if + ) j s s ( ij b b s λ ( 1 + 0 i M :

− complexity = ] a ) ∑ s ) m 1 1 0 0 optimal ( − − ij = j ∑ = m m λ j 1 + 0 2 is the s − = a ∑ =[ β i m i M 2 M ... β = 1 s . Basis is c + 1 s a s − a matrix is given by m 2 m =( ab × s ≥ c = m M c Number of 1s in Optimal bases introduced by Mullin,Vanstone, Onyszchuk, and Wilson to reduce hardware complexity. C I I I Obtain the multplication table for basis Then Define . . p ∗ has 1 Z p ∗ Z for in )= j ∈ , K m α } u , i in the NIST e K T ( ∈ j Implementing ECC – 52/110 < 233 ∑ j gcd th root of unity where = p = ≤ i i u m

0 β h , = m are satisfies K ) < p are the cosets of

Gaussian normal bases ∗ i T Z , m ≤ m in < ( 0 i i | 2 j h ≤ u i there is a primitive of 0 2 beaprime. e of type { 1 for ⇒ = i + = 2 p ∗ . 1 K Z . mT m . − = mT i 2 T = < F mT K i p 2 ∈ | ≤ p Then order α 0 Let Suppose index and Gauss periods I I I I recommended fields has an ONB. Generalization: Gaussian normal bases. Optimal bases are relatively rare. Only . } 2 , 1 Implementing ECC – 53/110 is a normal basis ∈{ } i T 2

β { and type of GNB. m 2 . F , and

Gaussian normal bases... i 2

β GNB = T i

β is a measure of the complexity of the 426410

type 163 233 283 409 571 T . 1 . Then m − 0 Type

β called a mT = m 2 β ≤ F NIST recommended fields M for mult. Let ONBs are precisely the GNBs with C I I I ik           w 1 1 − m + iT . . . + iT w w b iT b w b      . s , and let i + Implementing ECC – 54/110

δ ik ⊕···⊕ w      b 1 1 1 − T 1 = m + i ∑ 1 . . . k i + w 1 w b i s b w + Vector multiplication b i      a      1 1

− = ∑ i      m 1 1 − i + . . . m + i a + s a i is given by . a + ik      1 w 1 1 ab b 2 − = s ∑ i m β a = ⊕ 1      c i = = 1 2 0 . . . n k b b b . . s i      2 c ∑ T

     = ≤

ββ 1 i i 0 1 − . . . a δ a m n be the number of 1s in NB rep of a = i i      δ n =      1 0 1 − . . . Let Let satisfy Facts: c c m c      I I I Basic idea in Ning and Yin is precomputation of shifts. Rosing, Ning and Yin, andobserved Reyhani-Masoleh that and the Hasan computation can be written: . m only) a . b Implementing ECC – 55/110 and words. a operations on field m 4 AND

Vector multiplication... m and is even.) XOR m required shifts for m mT : ab = c Copy the precomputation to simplifyevaluation indexing phase. in Total storage: versions. elements. (But type 1 means Precompute all Evaluation has Efficient one-table (e.g., precomputation for For type 1, use matrixand decompostion Bhargava of to Hasan, reduce Wang, complexity to essentially I I I I Variations (Dahab, H, Hu, Long, López, Menezes): To find 2.7 Implementing ECC – 56/110 ∗ DHHLLM 5.0 8.4 10.4 11.7 7.1 1 table Ring map

Vector multiplication... 6.7 9.6 11.4 2table Ning&Yin s), 800 MHz Intel Pentium III. Other µ 1.3 1.3 2.3 L-D comb words of dynamic storage. Type m 4 or Uses matrix decomposition. m 162163 1 233 4 ∗ 2 m Significantly faster than preceding methodsfor in NB. software Easy to code and relatively easy to optimize. 2 Still much slower than methods for polynomial basis. field multiplication (in m I I I I 2 than L-D, input and output are in NB. The good news F Some bad news Implementing ECC – 57/110

Ring mapping method expansion, a significant T hurdle. Technique is well-known for themapping type is 1 a case, permutation. where the Sunar and Koç [ToC 2001]approach is for a type basis 2 conversion that exploits properties of the map. Arithmetic in the ring is(and modulo so a has cyclotomic especially polynomial simple reduction). “Palindromic representation” of Blake, Roth,Seroussi and is the special case forGeneral type case 2. has a factor I I I I I

Basic idea for GNB: mapmethods to for an polynomial associated basis ring reps. and useBasis fast conversion approach Ring mapping approach . } . i ) K 1 ∈ − k x , ( j j / ) α 1 j 0 for a Implementing ECC – 58/110 k − . 1 c and is a normal j p p , and = x ) mT j ∑ x j R 0 Φ = T a j , = 1 to c j m = m mT j ∑ )=( (

α 2 x ∈| i F ( K p 7→ ∈ ∑ mT i j Ring mapping method... Φ 2 x i

a β i 1 mT 0 a c − = 1 ∑ i 0 m − + = where ∑ i m = ) : i ··· p . 2 i φ : Φ

β + K benefits from form of m i ( x 2 a / 1 ∈ R ] F 1 0 c j x − [ = { ∈ ∑ i 2 m if a F i = a )= is a Gauss period of type = m a 2 = β R j F 0 and its inverse are relatively inexpensive, and ( a

φ Let is a ring homomorphism from φ arithmetic in I I element. For Assume where 2 = T areamirror . Implementing ECC – 59/110 ) } } m m 2 m F ( ≤ <

φ i i is odd), then the last ≤ ≤ m 1 0 | | i Ring mapping method... i [WHBG]. 2 − ) 1 2 1 + / − m 2 α

mT α expansion, which can be significant. + +

T α i (

α { { is at least 4. T , coefficients for elements in 163 2 2 is even (always the case if F / T reflection of the first For There is a factor If mT Symmetry property is well-known inwhere the Gauss case periods produce abasis type of 2 the optimal form normal and there is an associated basis I I I Naïve approach: map into thepolynomial-based ring arithmetic. and then exploit fast . . 13 ) φ 0 a = , 1 1 a ) , + 1 1 a , − 2 mT x is a ( , Implementing ECC – 60/110 = / 0 -element coeffs ∗ 13 ) a p R Z , 1 2 in a − , 4 2 13 a x , = 0 a and hence the first T , ) coeffs will allow recovery 2 2 6 )=( a / gives prime x , p ( 1 = 4 p a 2 , Φ 1 = / ) a , T 1 for 0 . a ) − 3 ( p p . The mapping into ( = Φ 7→ ( } ) / m 5 2 ] coeffs of the ring element suffice to invert a x − [ , , for 2 2 1 5 / a m F , ) , 2 is symmetric about 0 1 1 F = a ) − − a , R ( p 1 =( that permits recovery of the associated field element. of the field element. φ ( Fewer coefficients may suffice: in4 the (rather example, than the first Wu, Hasan, Blake and Gaominima [ToC for 2002] the give number sample of consecutive { a : =

Gauss periods and mapping for small parameters I I I φ Consider and the unique subgroup ofK order is given by . m 2 F in i 2 . 0

β b i b . 1 R 0 Implementing ECC – 61/110 − = i m in j 0 ∑ x b j 0 0 = a a 1 1 b − = = j ∑ p 0 c and . Do similarly for i i )= 2 . a K β i ( 2 i

φ ∈ a β i 1 j 0 c = − . 1 if = 0 i m 0 ) i − 0 a = ∑ c a i m ( ∑ 1 = = − j 0 a = φ a = ab c =

Algorithm: Multiplication via ring mapping c where : R : elements reps to find half the coeffs of in UTPUT 2. Apply a fast multiplication method for polynomial-based 1. Calculate 3. Return NPUT O I 1 + x + times the cost Implementing ECC – 62/110 ··· 2 / + T 2 − p x + 1 − p x = 1 1 is approximately − − φ p x x )= x ( p . Φ 1 −

Observations on the ring mapping algorithm φ can be optimized with per-field precomp. Each output of φ is fast. coefficient is obtained with aOnly word the shift first and half mask. areremainder calculated obtained in by this symmetry. fashion; multiplication for polynomials and canfind be only adapted some to of the output coefficients. 4. Only half the output5. coefficients are Cost required. of applying 2. The López-Dahab “comb” method provides fast 3. Reduction via 1. 163 . 0 = c ). m

φ Implementing ECC – 63/110 coeffs of the of the complete of product 30 326 0 309 c = is approx 10% of the ,..., 2

/ φ 19 ,..., , 0 1 c mT 10 costs approx half of 1 −

φ ,..., 0 normal basis, and hence the mapping 4 = . With 32-bit words, field elements require 6 0 T b 0 a

Example: ring mapping method for product 42-word polynomial product. Small optimization: word 10 iselt not can required be since recovered from field total field mult time ( Modified comb mult finds Comb finds words The cost of an application of words and ring elements require 21. has a type 163 I I I 2 Experimentally, times on an Intelslower Pentium than III field are mult a for factor a 7 polynomial basis rep. Competitive with the best methods with a NB rep. F gives a factor 4 expansion. 233 163 = = m m 2.7 Implementing ECC – 64/110 (where 233 ∗ DHHLLM 233 5.0 8.4 10.4 11.7 7.1 = 1 table Ring map m 6.7 9.6 11.4 2table Ning&Yin s), 800 MHz Intel Pentium III. Other µ normal basis. 2 1.3 1.3 2.3 L-D comb = T Type

Example: ring mapping method for Uses matrix decomposition. m 162163 1 233 4 ∗ 2 has a type field multiplication (in 233 m 2 2 than L-D, input and output are in NB. coefficients in the ring product are found) than for F (where 309 coefficients are found). Method gives the fastest multiplicationcase, times and for is the approx type apolynomial 2 factor basis. 3 slower than mult in a F As expected, algorithm is faster for 2 = T , 233 Implementing ECC – 65/110 = m 938 3250 3432 146 510 260 792 2740 3172 Poly rep Direct Ring map

Memory consumption 4 = T , 163 = consume a significant portion of the m . 1 for NB will likely have significantly smaller m 652 2452 4504 544108 2092 360 4144 360 − 2 c

Poly rep Direct Ring map φ F = x and +

φ 2 x Total Storage type mem requirements than their counterparts for a PB. Code for Algorithms for NB reps have significantlyrequirements larger than memory the comb method for PB. If total mem consumed by fieldmeasurement arithmetic of is interest, the then squaring, squaresolving root, and total mem requirement. For typeconsumption 2, is however, total comparable mem to direct method. automatic (stack) data object code & static data I I I Approximate code and storage requirements (infield 32-bit multiplication words) in for Implementing ECC – 66/110 . c = x + 2 x

Summary: Normal basis arithmetic than previously reported. for low-complexity Gaussian normal basessuperior (and for ONB). cited as especially desirable: Koblitzhalving. curves and point 1. Multiplication for normal basis reps significantly faster 2. Ring mapping method is competitive with best methods 3. Not sufficiently fast to help in two cases where NB are For software implementations, Dahab, H,and Hu, Menezes Long, conclude: López, Penalty in NB mult appearsthe to advantages be of sufficient fast in and tosquaring, elegant overwhelm operations square of root, trace, and solving Journal of Implementing ECC – 67/110

Information and , LNCS 2229:177–189. , 29:879–889, 2000.

References: Normal basis arithmetic through palindromic representation. Technical ) n 2 ( GF finite field multiplication in normal basis. in Report HPL-98-134, Hewlett-Packard, Aug. 1998.via Available http://www.hpl.hp.com/techreports/. Menezes. Software multiplication using normal bases. Technical Report CACR 2004-12, University of2004. Waterloo, Algorithms for exponentiation in finite fields. Symbolic Computation Communications Security 2001 4. P. Ning and Y. Yin. Efficient software implementation for 1. I. F. Blake, R. M. Roth, and G. Seroussi. Efficient arithmetic 2. R. Dahab, D. Hankerson, F. Hu, M. Long, J. López, and A. 3. S. Gao, J. von zur Gathen, D. Panario, and V. Shoup. , Implementing ECC – 68/110

IEEE , 51(11):1306–1316, 2002.

IEEE Transactions on Computers

References: Normal basis arithmetic... 50(1):83–87, Jan. 2001. multiplier using redundant representation. for field multiplication using Gaussian normalTechnical Report bases. CACR 2004-04, University ofCanada, Waterloo, http://www.cacr.math.uwaterlo.ca, 2004. type II multiplier. Transactions on Computers 7. H. Wu, A. Hasan, I. F. Blake, and S. Gao. Finite field 5. A. Reyhani-Masoleh. Efficient algorithms and architectures 6. B. Sunar and Ç. K. Koç. An efficient optimal normal basis Topic IV

Inversion and affine arithmetic

Implementing ECC – 69/110 . m 2 F Implementing ECC – 70/110

Inversion revisited -coord in intermediate y 7–9 multiplications in ≈ by omitting the Q + P 2 . Proposal is specific to affinesignificant coords, number and of will inversions. have Related papers by Ciet, Joye,and Lauter Dimitrov, and Imbert, Montgomery, and Mishrarequirement. have similar For the fastest multiplication times,have [FHLM inversion ToC cost 2004] Inversion in prime fields significantly more expensive. Q + I I I I Motivation: optimizations for affine-coordinate arithmetic. Example. Eisenträger, Lauter, and Montgomery2003] [CT-RSA speed P Problem: inversion seems to bemultiplication. expensive compared with bit scan Implementing ECC – 71/110

via EEA-like methods m 2 F

Inversion in Can efficently track size of (some)Requires variables. explicit degree calculations. Fastest in our tests oninstruction Pentium aids III degree (where calc). Can efficently track size of (some)No variables. explicit degree calculations required. Same cost for inversion as for division. Slower in our tests on Pentium and SPARC family. I I I I I I I 1. Euclidean algorithm. 2. Binary Euclidean algorithm. Implementing ECC – 72/110

via EEA-like methods m 2 F ). k z

Inversion in and then divides by Similar to BEA, but can trackTracking lengths lengths of efficiently all seems variables. toexpansion. require code Variants allow fast 2nd stagereduction even poly. for non-favorable Fastest in our tests onPentium). SPARC (and competitive on k z 1 I I I I − 3. Almost inverse algorithm (2-stagea method that first finds squarings. 1 1 Implementing ECC – 73/110 − − ) . m s in binary rep. . 1 2 2 1 / ) via multiplication ) 1 1 − − m − has been evaluated. m 1 ( 2 m 2 − ( b in m F b 2 w 1 · , then a − 1 b + a − 2 c / = =( ) ) 1 1 1 2 − m − − ( 1 − m 2 − 2 a m m a 2 ( a = 2 = Inversion in gives the number of b squarings once 1 can be computed with one multiplication log − w 2 1 b a / − 1 ) − 1 m 2 − a m ( is odd and m and If Hence Recursive procedure finds mults, where I I If squarings are free, speedmethods. is (But in squarings ballpark are of not EEA-like free in our context.) For the NIST fields, this is 9–13 mults and Inversion by multiplication uses . . ) . 4 Q y a 2 2 y x , 1 M + + y 4 ) 2 1 − − 1 2 xy 1 1 x x P 1 y x 3 · x − =( = = k and 2 1 has cost = ( Q λ λ 3 1 Implementing ECC – 74/110 P Q + − 2 + 1 1 y P + I y y , ) P − − 3 , 3 y 1 2 , 1 λ λ xy 3 ) ) · x 3 4 y x x

Simultaneous Inversion =( − − = 1 1 P 1 x x 2 − , x ) =( =( 2 y 3 4 , , y y 2 inverses have cost 1 x xy 2 k x 1 =( 7→ x − 2 y Q 1 · , x − ) x by simultaneous inversion for 1 2 1 − y λ 2 , 2 7→ M 1 λ are obtained with a single inversion. x = ) 9 y 2 3 = , + x λ 4 =( x S x ( Useful whenever an inverse costs more thanGeneralization: 3 mults. P 4 and + I I I I 1 I Example. For curves over prime fields, “Montgomery’s Trick” to simultaneously findbased inverses on: is 2 λ } P 64 8 2 t kP 2 DD + A 4 → log 2 P Implementing ECC – 75/110 + 16 3 {z 2 DD A + P 1 %- 8 = DD k A Saved point doubles for Simultaneous Inversion %- %- | P additions, have t      to exploit simultaneous inversion. kP on each row Simultaneous inversion inversions. corresponding to required point additions. each level. For a total of 1. Calculate all point doubles, retaining those 2. Use binary tree to sum, with simultaneous inversion at The technique is applied widely.increases Efficiency with (and the storage) number simultaneous inverses. Schroeppel and Beaver [2003] proposeadditions delaying in point to I + .) M 2 M 8 Implementing ECC – 76/110 (trinomial reduction points (depends on , and NAF is used. 233 3 2 M / 8 F m ≈ .

Simultaneous Inversion... I , M 3 . M 1 2 /add gives 25% improvement. ≈ ≈ M 6 H H ). k . (Addition in mixed coords is approx M 5 , cost of affine point additions reduced from First round additions more expensiverequired if (e.g., conversions in methods based on pointAdditional halving). storage is approx Adapts to windowing via multiple accumulators. rep used for m 2 F I I I approx poly). Uses estimate Sanity test: suppose Attractive when point additions arethe a point significant mult. portion Examples: of Koblitz curvesIn and point halving. Their cost estimate of Schroeppel and Beaver estimate 30%methods improvement based for on point halving in Implementing ECC – 77/110

Simultaneous Inversion... . Improvement is decreased to 20%. M ) 1 + 8 ( Some side-channel information is eliminatedoccur if after additions all the point doubles. Similar strategy has been appliedMishra in and parallel Sarkar processing; [PKC e.g. 2004]. Comparison of interest: in mixed≈ coords, the additions cost References Implementing ECC – 78/110 (LNCS 2612), Inversions for Multiplications in Elliptic CurveDesigns, Cryptography. Codes and Cryptography. Cryptology ePrint Archive http://eprint.iacr.org/2003/257/. multiplication using double-base chains. Cryptology ePrint Archive http://eprint.iacr.org/2005/069. curve arithmetic and improved Weil pairing evaluation. In Topics in Cryptology—CT-RSA 2003 343–354, 2003. calculations with the reciprocal sharing trick.Public-Key Cryptography Mathematics (MPKC), of University of IllinoisChicago, at November 2003. 1. M. Ciet, M. Joye, K. Lauter and P. Montgomery. Trading 2. V. Dimitrov, L. Imbert, and P. Mishra. Fast elliptic curve point 3. K. Eisenträger, K. Lauter, and P. Montgomery. Fast elliptic 4. R. Schroeppel and C. Beaver. Accelerating elliptic curve Topic V

Friendlier fields

Implementing ECC – 79/110 . . 2 7 − . − 5

ω 6 . x x m − = p = m f f Implementing ECC – 80/110 x ≈ arithmetic can , , 5 q m = 1 p . f q − F − for F 32 q 31 2 F 2 and = c = p p  .

Optimal extension fields , n , p case. 5 2 6 F q = F = = m p m for ) ; e.g., f ; e.g., 2 ( 1 / ] = x = [ p c ω F = m can be chosen to fit in a register; p p Type 2: be performed via ops in Type 1: F I I I 2. Fast field inversion compared with 1. Inversion still expensive for small2. extensions. Fastest mult looks like 1. Arithmetic more elegant than in Bailey and Paar, CRYPTO ’98 and J. Cryptology 2001. The bad news Attractions ,find 1 + Implementing ECC – 81/110 p + . ··· 1 + − r 1 A − 1 . m − p p ) + r ··· , which is relatively fast. = A p + 1 1 1 . F − − p =( − m m p p F p 1 Inversion in OEFs (Itoh & Tsujii) A − . in = p A = 1 r F − 1 . ) ∈ − r 1 r − A A and A r 1 m − p cA r =( F A c = ∈ 1 = r − A A A Steps 1 and 3 appear toStep be 2 the is expensive calculations. not a fullStep field 3 multiplication. is inversion in Step 1 can be done with a few field multiplications. I I I I 2. 4. 3. Find 1. Compute Cost Given Steps 2 = m . 2 1 a

ω . Implementing ECC – 82/110 −  2 0 1 0 a  . Want p = = F ∆ ), all in the subfield. ∈  0 0 0 1 ω , a ω a , ∆

Direct inversion for / ω )  1 1 − 0 a a 2 a −

x ω gives , 0 ) 0 1 a f )= a a x  ( (and a mult by =( f 1 S , mod − 2 ( x . a 1 x + a 1 0 1 a M + ≡ 2 0 + 1 a 0 0 − + a I = aa Then = a 1 I I − Let a Cost: 1 − i

ω − t x Implementing ECC – 83/110 subfield )= x ( i . ⇒ f 0 = )= i

ω ( i f

Extending OEF: tower fields and ) 1 − i t q is expected to be less expensive. ( q F GF inversion in Main attraction: faster inversion than inOEF OEF. subfield inversion may requireinversion 40% time. of Recursion the in total OTF favor method. analysis. Baktir and Sunar [IEEE ToC 2004]. I I 1. Multiplication slightly slower. 2. Unclear if inversion is fast3. enough to Analysis favor is affine coords. limited. Parameters chosen (on ARM) may 4. Security of curves over OEFs and OTFs needs more Idea: do a sequence of extensionsirreducible with over The bad news . p F . m 2 F Implementing ECC – 84/110 . p F may be any prime subject to size p

Processor Adequate Finite Fields s, and Koç, for ˘ ailescu [SAC 2003] discuss PAFFs. Gura, Eberle, and Chang Shantz, for Yanik, Sava¸ I I restriction (size chosen to cooperate with hardware). Similar to OEF, but Use “partial reduction” to reduce cost. Redundant rep. I I 2. Fields with parameters not-of-special-form. 1. Bernstein’s fast floating-point implementations for Techniques have some application forparameters. special-form Avanzi and Mih Redundant reps and partial reductiontechnique. is a widely-used . , but with 1 T 1 − 1 − + Implementing ECC – 85/110 x m x . = 1 f − 1 + prime. Type 1 ONBs m 1 Move to a friendlier ring x where + ) f m ( / ] x [ 2 F expansion. T relatively rare. Map field elts to Do most arithmetic modulo Limitations: must have I I I factor Technique generalizes to Gaussian bases of type Idea: move to a ringpleasant. where where arithmetic is more Specialized case with fields havingwell-known. a type 1 ONB is in c = x . 8 + 2 = Implementing ECC – 86/110 x m where there is no m 2 F . 1

Move to a friendlier ring + x that has an irreducible factor has irreducible factor + 1 2 1 x + + k + 5 x 4 x x + 10 in 95% of cases. For 32-bit words, + n + x 11 5 ≤ x x . + m 6 x + 8 Trinomial Example: No irreducible trinomial for x expansion does not require morecases. words in 86% of of degree 2. Do most arithmetic in3. the larger Expansion ring. is 1. Find trinomial point halving, and square root. Doche gives “redundant trinomials” for irreducible trinomial. Trinomials are desirable in reduction, solving Implementing ECC – 87/110 has irreducible factor 1 +

Move to a friendlier ring... 103 x + 27 + 197 x redundant trinomials have degree divisible by of degree 197. Example: 32. Optimal Faster, even though extension isquadrinomials usually more larger. common. Optimal than 5% for multiplications. 5. 20% improvement for reductions and squarings. Less 6. Inversion 15% slower. 4. Implementation was with NTL. Moredesirable. experimental data , but curve } Implementing ECC – 88/110 3 , 2 ) of interest. 3 ∈{ ) . : ≤ g u Hyperelliptic curves g q ( g F f ≤ h (or = v 4 over ) deg g u ≤ , ( 1 g h + + g ⇒ 2 v 2 = = f elliptic curve. ⇒ deg = , ] u 1 [ q = F g Known attacks Arithmetic is in smaller fields for Significant improvements in explicit formulaePezl, by Wollinger, Lange, Guarjardo, Paar. operations are more complicated. ∈ f I I I I , Hyperelliptic curve of genus h Implementing ECC – 89/110

Hyperelliptic vs elliptic curves are competitive or faster than EC for curves 1 = h deg over binary fields. Limitations: implementation was with NTL.coords Affine only. curves over prime fields compared with EC. Limitations: “not interested in...primeform” moduli but of this special is aelliptic comparison curves. of interest and may favor 1. Avanzi [CHES 2004]: roughly 15% penalty with genus 2 2. Lange and Stevens [SAC 2004]: genus 2 curves with References Implementing ECC – 90/110 ˘ ailescu. Generic efficient arithmetic fields in software implementations. CHES3156:148–162. 2004, LNCS algorithms for PAFFs (Processor Adequate Finiteand Fields) related algebraic structures. SAC 2003,3006:320-334, 2004. LNCS Transactions on Computers 53:1231–1243, 2004. characteristic 2. Cryptology ePrint Archive http://eprint.iacr.org/2004/055/. 1. R. Avanzi. Aspects of hyperelliptic curves over large prime 2. Avanzi and P. Mih 3. S. Baktir and B. Sunar. Optimal tower fields. IEEE 4. C. Doche. Redundant trinomials for finite fields of References... Implementing ECC – 91/110 s, and Ç. Koç. Incomplete reduction in elliptic curve algorithm combining frobenius mapreference and to table adapt to higher characteristic.’99, LNCS EUROCRYPT 1592:176-189. [Koblitz-like speedups. Downside: curve is over large subfield.] special optimal extension fields. Public-Key Cryptography and Computational Number Theory, pages 197–207,Gruyter, de 2001. [Combines the ideas of GLV and OEF.] modular arithmetic. IEE Proceedings: ComputersDigital and Techniques, 149(2):46-52, March 2002. 5. T. Kobayashi, H. Morita, K. Kobayashi, F. Hoshino. Fast 6. V Müller. Efficient point multiplication for elliptic curves over 7. T. Yanik, E. Sava¸ Topic VI

Using special-purpose hardware

Implementing ECC – 92/110 Implementing ECC – 93/110

Case study: the Intel x86 (Pentium) family general-purpose registers. hardware. 1. Declining performance of integer arithmetic with 2. The floating-point approach. 3. “Multimedia” single-instruction multiple-data (SIMD) 4. Wish list: special instructions. Outline Implementing ECC – 94/110 -ops executed per cycle. Im- µ

Intel IA-32 processors proved branch prediction,larger but than misprediction on penalty Pentium.extensions much on Integer P-III have multiplication 128-bit faster.vectors registers of supporting SSE single-precision ops on floating-point values. allel stages. instruction per clock cycle. speeds, but manythan instructions P6 family have processors.precision worse SSE2 extensions floating-point cycle have counts double- andtypes. 128-bit packed integer data Dual-pipeline: optimalclock pairing cycle. MMX gives added twotimedia” eight registers, special-purpose instructions supporting 64-bit operations “mul- per 4, on or vectors 8-byte of integers. 1, 2, P6 architecture has moreof-order sophisticated execution. pipelining Up and out- to 3 1993 1997 1995 1997 1998 1999 Processor386 Year Selected features 486 1985Pentium First IA-32 family processor withPentium 32-bit MMX operations 1989 and par- 5 pipelined stages in the 486; processorPentium is Pro capable of one Pentium II Celeron Pentium III Pentium 4 2000 NetBurst architecture runs at significantly higher clock bits. Implementing ECC – 95/110 64 → 32 × 32

Multiplication on the Pentium registers for the mult of interest. d and a Can do integer multiplication Pentium II/III have faster multiplicationPentium than and original MMX. Must use Multiplication on Pentium 4 withregisters general-purpose is slower than on earlierPentium processors. II/III have better branchoriginal prediction Pentium, than but the mispredictions areexpensive. more I I I I I The good news The bad news Implementing ECC – 96/110

Instruction latency/throughput InstructionInteger add, xor,...Integer add, sub with carryInteger multiplicationFloating-point 1 multiply / 1 Pentium II/IIIMMX 1 ALU / 1MMX Pentium multiply 4 4 / 1 5 6–8 / / 2 2–3 .5 14–18 / / .5 3–5 7 / 2 1 3 / / 1 1 2 8 / / 2 2 Latency: number of clock cyclesresult required of before an the operation may beThroughput: used. number of cycles whichthe must instruction pass may before be executed again. I I Latency/throughput for Pentium II/III vs Pentium 4 Small latency and small throughputThroughput are can desirable. be significantly less than latency. . f . 1 × Implementing ECC – 97/110 1023 increases effective − e f . 2 1 (52-bit fraction) × f s ) 1 − =( z

Faster multiplication: floating-point (11-bit exponent) e s 63 62 52 51 0 precision to 53 bits. Potential for broad applicability: DECPentium, Alpha, Sun Intel SPARC, AMD Athlon. IEEE double-precision floating-point format represents numbers Normalization of the significand 80-bit double-extended format. Pentium has 8 floating-point registers.significand Length is of selected the in a control register. I I I I I Idea: use floating-point hardware toarithmetic. implement fast integer Implementing ECC – 98/110

Floating-point tests for equality are more ⇒ = expensive. Floating-point addition operates on moreaddition bits with than integer instructions on 32-bitMore hardware. registers and not restrictedCan to do specific useful registers. things during the latency period. Expensive to move between integerformats. and floating-point Bit operations which are convenient(e.g., in extraction integer of format specific bits,clumsy division on by values 2) in are floating pointRedundant registers. rep I I I I I Some good news Some bad news . 1 + 96 2 − 224 2 Implementing ECC – 99/110 = p for prime p F

Scalar multiplication for P-224 P-224: NIST curve over Performance improvements are in fieldin arithmetic the (and organization of fieldaddition). ops in point doubling and Field multiplication will require more(floating-point) than multiplications, 64 compared with 49classical in method. the On the positive side, morecan registers occur are on available, any mult register,accumulated and in products a may register be without directly handling carry. I I I I Bernstein: minimize conversions to/from floating point. a 1.7 GHz P4 224 p Implementing ECC – 100/110 F (to 8 FP values ab = c

P-224 field multiplication Excludes canonical form conversions. Multiplication in Classical integerKaratsuba-OfmanFloating-pointa 0.62 0.82 0.20 of roughly 28 bits) very fast. and register allocation. extended-double to 64-bit doubles. Alignment. commit to FP across curve operations. 1. Wide applicability. No coding2. in assembly. Excluding conversions, can do 1. No assembly required, but have to manage scheduling 2. Can’t allow compiler to unexpectedly spill 80-bit 3. Conversion to canonical form is expensive, so must 4. Algorithm verification. The good news The bad news ≤ for 8 ) 3 − Z : 2 Z : Z where : kP i Y : 4 2 X i ( Implementing ECC – 101/110 k 0 = 55 i ∑ Cycles for = k point additions. )

P-224 Curve arithmetic 4 / 224 )( in Chudnovsky coords 16 / iP 15 +( 3 Methodgeneral-purpose registersfloating-point 1,200,000 registers 2,700,000 730,000 830,000 Pentium III Pentium 4 . ) 8 , 8 Expected Scalar multiplication timings: 8. Precomp stores Reference implementation processes − [ . < I I ∈ i i k Bernstein uses a width-4 windowkP method (without sliding) for 2

γ 8 − ) ) 2

δ x − + 1 β x 4 ( )( is δ α ) − ← 1 2 1 z Implementing ECC – 102/110 y x , ( 1 3 y , , δ ← 1

x α − (

γ 2 ,

γ − 1 2 ) x )= 1 2

P-224 Curve arithmetic... z ← z

, β + 2 1 y y 2 1 , ( y 2 x ← ← 2 (

γ z , , 2 1 β z 8 ← −

δ 2

α ← 2 x Point doubling 3 field mults, 5 squarings, 7 reductions. • • Most of the improvement mayscheduling be only obtained field by multiplication and squaring. Bernstein organized point arithmetic socould that be operations efficiently folded into field multiplication. Expensive conversion to canonical formthe done end only of at scalar multiplication. I I I Implementing ECC – 103/110 share space with floating-pointisters reg- floating-point ing point and 64-bit integers

Single-instruction multiple-data (SIMD) Extension RegistersMMX (PII) Features added 8 64-bit vector ops onSSE 1,2,4,8 (PIII) byte integers; 8 128-bitSSE2 vector (P4) ops 8 128-bit on vector ops on double-precision single-precision float- Initially “MMX Technology” for multimedia. MMX suitable for binary fieldinversion. multiplication and SSE2 provides integer alternative tomultiplication. floating-point I I I Perform operations in parallel onexcept vectors. original All and Intel Pentium Pentiums Pro. Implementing ECC – 104/110 ). d and

Single-instruction multiple-data (SIMD) a multiply with general-purpose registers has 32 × 32 output in Advanced Micro Devices (AMD) K6MMX. processor has Sun has Visual Instruction Set (VIS). Implement fast 64-bit operations onmachines. primarily 32-bit Gives more registers on register-poor machine. Integer multiplication (SSE2) can use( 8 registers SSE2 integer multiply has latencyCan 8 do and useful throughput things 2. in the latency period. I I I I I I Basic idea applied to Intel and AMD processors: 28 = w 2 1 0 + + + i i i B B B (e.g., 2 1 0 b b b w Implementing ECC – 105/110 i i i 2 a a a ∑ ∑ ∑ = + + + B c c c → → → 2 1 0 2 b b b 2 2 2 a a a a for some i B ··· ··· ··· i a . 2 1 0 1 b b b ∑ 1 1 1 a ab a a a = = a ··· ··· ··· c SIMD integer multiplication methods 2 1 0 0 b b b 0 0 0 a a a a 2 1 0 ). Want b b b 32 Control code is simple. Can use full 32-bit multiplications. More memory accesses if few registersMust available. shift after each multiplication. = w I I I I 1. Operand scanning approach. Notation: integers or Advantages Downsides 4 3 2 B B B j j j b b b i i i Implementing ECC – 106/110 a a a 4 3 2 = = = j j j ∑ ∑ ∑ + + + i i i & & 1 B j 2 1 0 2 b b b b i 2 2 2 a a 1 a a a = j . “Wastes” part of the ∑ + i ...... 32 0 B < j 2 1 0 1 b b b b i w 1 1 1 a a 0 a a a = j ∑ + i ...... &&& 2 1 0 0 SIMD integer multiplication methods b b b 0 0 0 a a a a 2 1 0 b b b Possibly fewer memory accesses. Less shifting. Control code more complicated (unless fullyTo unrolled). avoid carry, take multiplier. I I I I

2. Product scanning method. Advantages Downsides : uv low low −→ Implementing ECC – 107/110 uv uv 64 bits 0 in product scanning, high ··· 32 uv pshufd 0 < ↓ w 0 −→ ←− high ··· ) simultaneously. Roughly uv 0 29 = 64 bits 0 0 w ··· ··· 0 0 ←−

SIMD integer multiplication methods... operand scanning, but carry is handled in 2nd stage. use shuffle instruction to split 64-bit product and then accumulate. Downside:every have mult. to shuffle on products (with 3. To avoid the requirement 1. GNU mp uses operand2. scanning. Moore uses vector ops in 128-bit SSE2 to compute two Experiments suggest product scanning with(even though scalar input ops must wins be split and output reassembled). a Implementing ECC – 108/110 Pentium 4 (1.7 GHz) 224 p F

Multiplication with SSE2 integer ops Excludes conversion to/from canonical form. Multiplication in Classical integer (product scanning)Karatsuba-Ofman (depth 2)SIMD (SSE2; product scanning) 0.62 Floating-pointa 0.27 0.82 0.20 Floating-point includes partial reduction tofloating-point 8 values (each roughly 28include bits); expensive does conversion not to canonicalform. reduced Other times include reduction. Classical and Karatsuba would benefittuning from specific additional to Pentium 4;inferior regardless, to both SIMD will and be floating-point. SIMD does not require thefloating-point commitment approach. of the I I I Implementing ECC – 109/110 s [CHES 2004] propose

Summary: special-purpose hardware integer mult with general-purpose registers. arithmetic. Relatively easy to code. speeds prime field arithmetic.across Must curve commit operations. to coding locally, but not as fastavailable as with floating-point Pentium approach. II/III. Not 1. Designs such as Pentium 4 and UltraSPARC have slow 2. Common MMX subset suitable for binary field 3. Floating-point hardware common on workstations 4. Integer SSE2 extensions on Pentium 4 easy to insert Wish list: Großschädl andinstruction Sava¸ set extensions to speed field ops. Amusements: use graphics card as[CT-RSA 2005]. a crypto co-processor . CHES )

Guide to References m 2 ( Implementing ECC – 110/110 GF and ) p ( GF . Springer-Verlag, 2004. s. Instruction set extensions for Perform Big Multiplications. Application Note AP-941,Corporation, Intel Version 2.0, Order Number 248606-001, 2000. Presentation at the 5th Workshop onCryptography Elliptic (ECC Curve 2001), University of Waterloo,29-31, October 2001. Slides available from http://cr.yp.to/talks.html fast arithmetic in finite fields 2004, LNCS 3156:133–147. Elliptic Curve Cryptography 4. S. Moore. Using Streaming SIMD Extensions (SSE2) to 1. D. Bernstein. A software implementation of NIST P-224. 2. J. Großschädl and E. Sava¸ 3. D. Hankerson, A. Menezes, and S. Vanstone.