Efficient Algorithms in Software

Julio López [email protected]

Institute of Computing, University of Campinas

September 2017, Habana, Cuba.

ASCrypto 2017 Agenda

1 Efficient Software Implementations Software Efficiency Parallel Computation -SIMD

2 Symmetric-Key Cryptography Data Encryption Hash Functions SHA2 Implementation SHA3 Implementation

3 Elliptic Curve Cryptography Elliptic Curves Elliptic Curve Diffie-Hellman Digital Signatures EdDSA Scheme

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 2 / 83 Section 1

Efficient Software Implementations 1.1

Software Efficiency Efficient Software Implementations Software Efficiency Software Efficiency

The optimization of a software implementation of a cryptographic algorithm is a task with several goals:

• Ensure security. • Running time. • Code size. • Memory consumption. • Computer platform characteristics • Energy consumption.

Sometimes these goals are in conflict with each other. For example: accelerating an operation using look-up tables, it will increase code size, and it could result vulnerable against memory cache-attacks (if not implemented adequately). Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 3 / 83 Efficient Software Implementations Software Efficiency How Performance is Measured?

• Measuring the elapsed time does not allow to compare timing between different computers; instead, clock cycles are measured. • Use the RDTSC instruction to read the Time-Stamp Counter on processor.

1 #include 2 uint64_t get_cycles() { 3 uint32_t lo,hi; 4 asm volatile("rdtsc":"=a"(lo),"=d"(hi)); 5 return ((uint64_t)hi<<32) | lo; 6 }

• To reduce certain sources of randomness during measurements it is recommended to turn off technologies such as Turbo Boost or Hyper-Threading.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 4 / 83 1.2

Parallel Computation -SIMD Efficient Software Implementations Parallel Computation -SIMD Single Instruction Multiple Data

• Single Instruction Multiple Data is a class of computers where a single instruction is applied simultaneously over a set of data. • Latest processors support SIMD class by using a bank of wider registers, also known as vector registers.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 5 / 83 Efficient Software Implementations Parallel Computation -SIMD Vector instructions

Instructions associated to vector registers are known as vector instructions. These instructions operate over words packed in vector registers.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 6 / 83 Integer Arithmetic

MMX (64)

Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions

Integer Arithmetic

MMX

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX (64)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions

Integer Arithmetic Floating-point Arithmetic

SSE

MMX

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM (64)(128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions

Integer Arithmetic Floating-point Arithmetic SSE2

SSE

MMX

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM (64)(128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions

Integer Arithmetic Floating-point Arithmetic SSE2

SSE

MMX

SSE3

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM (64)(128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions

Integer Arithmetic Floating-point Arithmetic SSE2 String Manipulation

SSE SSE4 MMX

SSE3

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM (64)(128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions

Integer Arithmetic Floating-point Arithmetic SSE2 String Manipulation Cryptography

SSE SSE4 MMX

SSE3 AES-NI + CLMUL

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM (64)(128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions

Integer Arithmetic Floating-point Arithmetic SSE2 String Manipulation Cryptography AVX

SSE SSE4 MMX

SSE3 AES-NI + CLMUL

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM YMM (64)(128) (256)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions

Integer Arithmetic AVX2 Floating-point Arithmetic SSE2 String Manipulation Cryptography Bit Manipulation AVX

SSE SSE4 MMX

SSE3 AES-NI + CLMUL BMI

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM YMM (64)(128) (256)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions

Integer Arithmetic AVX2 Floating-point Arithmetic SSE2 String Manipulation Cryptography Bit Manipulation AVX

SSE SSE4 MMX

SSE3 SHA1-SHA2 AES-NI + CLMUL BMI

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM YMM (64)(128) (256)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions

Integer Arithmetic AVX2 AVX-512 Floating-point Arithmetic SSE2 String Manipulation Cryptography Bit Manipulation AVX

SSE SSE4 MMX

SSE3 SHA1-SHA2 AES-NI + CLMUL BMI

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM YMM ZMM (64)(128) (256) (512)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Variable logic shifts. • 1 cycle for fixed shifts. • 2 cycles for variable shifts. Permutation of words. • 3 cycles for permutations. Combination/selection of registers. • Up-to 3 instructions per cycle without dependencies.

Efficient Software Implementations Parallel Computation -SIMD Relevant AVX2 Instructions

Integer arithmetic for 64-bit words: • 1 cycle for add/sub. C = ADD(A, B) • 5 cycles for multiplications. a3 a2 a1 a0 + + + +

b3 b2 b1 b0

c3 c2 c1 c0

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83 Permutation of words. • 3 cycles for permutations. Combination/selection of registers. • Up-to 3 instructions per cycle without dependencies.

Efficient Software Implementations Parallel Computation -SIMD Relevant AVX2 Instructions

Integer arithmetic for 64-bit words: • 1 cycle for add/sub. C = VSHL(A, B) • 5 cycles for multiplications. a3 a2 a1 a0 Variable logic shifts. • 1 cycle for fixed shifts.     • 2 cycles for variable shifts. b3 b2 b1 b0

c3 c2 c1 c0

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83 Combination/selection of registers. • Up-to 3 instructions per cycle without dependencies.

Efficient Software Implementations Parallel Computation -SIMD Relevant AVX2 Instructions

Integer arithmetic for 64-bit words: • 1 cycle for add/sub. C = PERM(A, M) • 5 cycles for multiplications. a3 a2 a1 a0 Variable logic shifts. • 1 cycle for fixed shifts. • 2 cycles for variable shifts. m3 m2 m1 m0 0, 1, 2, 3 Permutation of words. { } • 3 cycles for permutations.

am3 am2 am1 am0

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83 Efficient Software Implementations Parallel Computation -SIMD Relevant AVX2 Instructions

Integer arithmetic for 64-bit words: • 1 cycle for add/sub. C = BLEND(A, B, M) • 5 cycles for multiplications. a3 a2 a1 a0 b3 b2 b1 b0 Variable logic shifts. • 1 cycle for fixed shifts. • 2 cycles for variable shifts. Permutation of words. 0/1 0/1 0/1 0/1 • 3 cycles for permutations.

Combination/selection of registers. c3 c2 c1 c0 • Up-to 3 instructions per cycle without dependencies.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83 Efficient Software Implementations Parallel Computation -SIMD Vector Instruction Guide

Full documentation available at: http://software.intel.com/sites/landingpage/IntrinsicsGuide

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 9 / 83 Efficient Software Implementations Parallel Computation -SIMD Skylake Execution Engine

The Skylake processor has eight execution ports for instructions. This improves the Instruction-Level Parallelism (ILP).

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 10 / 83 Section 2

Symmetric-Key Cryptography 2.1

Data Encryption Symmetric-Key Cryptography Data Encryption Secure Communication

• Alice and Bob would like to communicate through an insecure channel. • Charles is a malicious third party that has also access to the channel. • It is desired that Charles does not be able to read messages interchanged by Alice and Bob.

0111100001100010101011111010

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 11 / 83 Symmetric-Key Cryptography Data Encryption Symmetric Data Encryption

Using a secret key k, Alice and Bob can interchange encrypted messages. Charles can not read the messages without the knowledge of the key k.

k k Key Generation

(M, k) M

encryption C C decryption 0111100001100010101011111010 C = Ek(M) M = Dk(C)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 12 / 83 Symmetric-Key Cryptography Data Encryption Advanced Encryption Standard (AES)

• AES, 1998 (Daemen and Rijmen) • AES (2000) is the current NIST standard for encrypting data using a symmetric key. • AES is a cipher that encrypts a 128-bit plaintext (M) producing a 128-bit ciphertext (C) using a key k. k

M AES C • AES supports three key sizes, k = 128, 192, 256 , leading to three algorithms: | | { } • AES-128. • AES-192. • AES-256.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 13 / 83 Symmetric-Key Cryptography Data Encryption AES State Representation

AES keeps track of a 128-bit state, which can be seen as a 4 4 matrix of × bytes.

M ... C

k0 kNr

In each round, AES applies a series of transformations over the matrix.  10 if k = 128  | | Nr = 12 if k = 192  | | 14 if k = 256 | |

After Nr rounds, the last state is returned as the ciphertext. Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 14 / 83 Symmetric-Key Cryptography Data Encryption AES State Transformations

• SubBytes

• ShiftRows

• MixColumns

• AddRoundKey

For decryption, transformations are inverted and applied in reverse order.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 15 / 83 Symmetric-Key Cryptography Data Encryption AES Mix Column-Encryption

pe = 03 x3 + 01 x2 + 01 x + 02 { } { } { } { } c = pe c = Me c ⊗ ⊗       c0 02 03 01 01 c0        c1   01 02 03 01   c1    =      c2   01 01 02 03   c2  c3 03 01 01 02 c3

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 16 / 83 Symmetric-Key Cryptography Data Encryption AES Mix Column-Decryption

pd = 0b x3 + 0d x2 + 09 x + 0e { } { } { } { } c = pd c = Md c ⊗ ⊗       c0 0e 0b 0d 09 c0        c1   09 0e 0b 0d   c1    =      c2   0d 09 02 0b   c2  c3 0b 0d 09 0e c3

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 17 / 83 Symmetric-Key Cryptography Data Encryption The AES-NI Instruction Set

In 2010, Intel released a set of instructions to perform the AES algorithm.

Plaintext Plaintext

AddRoundKey AddRoundKey

SubBytes AESDECLAST InvSubBytes

ShiftRows 1 InvShiftRows

AESENC − r MixColumns AddRoundKey N

AddRoundKey InvMixColumns 1

AESDEC − InvSubBytes r

SubBytes N

AESENCLAST ShiftRows InvShiftRows

AddRoundKey AddRoundKey

Ciphertext Ciphertext

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 18 / 83 Symmetric-Key Cryptography Data Encryption AES-128 Encryption

Encrypting a 128-bit block (stored in xmm15) using the key schedule (stored in xmm0-xmm10). Nr = 10.

1 MOVQDA xmm15, (%rsi); Load message block 2 PXOR xmm15, xmm0; AddRoundKey 3 AESENC xmm15, xmm1; Round1 4 AESENC xmm15, xmm2; Round2 5 AESENC xmm15, xmm3; Round3 6 AESENC xmm15, xmm4; Round4 7 AESENC xmm15, xmm5; Round5 8 AESENC xmm15, xmm6; Round6 9 AESENC xmm15, xmm7; Round7 10 AESENC xmm15, xmm8; Round8 11 AESENC xmm15, xmm9; Round9 12 AESENCLAST xmm15, xmm10; Round 10 13 MOVQDA (%rdi), xmm15; Store cipher block

Analogously, for decryption use AESDEC, AESDECLAST and invert the key schedule using AESIMC. Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 19 / 83 Symmetric-Key Cryptography Data Encryption Modes of Operation

Splitting a long message into 128-bit blocks and encrypting each one is not secure! (ECB Mode)

Modes of operation are used for encrypting arbitrary-length messages using a block cipher as a building block. • CBC. Cipher block chaining. • CTR. Counter mode. • GCM. Galois-counter mode. (Authenticated encryption)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 20 / 83 Symmetric-Key Cryptography Data Encryption Cipher Block Chaining (CBC)

P1 P2 P3 P4 C1 C2 C3 C4

IV Dk Dk Dk Dk

E E E E k k k k IV

C1 C2 C3 C4 P1 P2 P3 P4 Encryption Decryption (sequential execution) (parallel execution)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 21 / 83 Symmetric-Key Cryptography Data Encryption Counter mode (CTR)

IV+1 IV+2 IV+3 IV+4 IV+1 IV+2 IV+3 IV+4

Ek Ek Ek Ek Ek Ek Ek Ek

P1 P2 P3 P4 C1 C2 C3 C4

C1 C2 C3 C4 P1 P2 P3 P4 Encryption Decryption

Either encryption and decryption can be executed in parallel. The block cipher encryption is used only.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 22 / 83 Symmetric-Key Cryptography Data Encryption Performance of AES-128-CBC Encryption

The performance is determined by the latency of the AESENC instruction.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Clock Latency

AESENC AESENC AESENC ·········

µ-arch Latency CBC-ENC Intel Haswell 7 4.49 Intel Skylake 4 2.71 AMD Zen 4 2.44

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 23 / 83 Symmetric-Key Cryptography Data Encryption Pipelined AES Implementation

The execution of AESENC instruction can be overlapped with other instructions of the same type.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Clock Latency

AESENC AESENC AESENC ········· AESENC AESENC AESENC ········· w = 4 AESENC AESENC AESENC ········· AESENC AESENC AESENC ·········

Throughput

Processor’s pipeline improves performance of CBC-DEC and CTR modes.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 24 / 83 Symmetric-Key Cryptography Data Encryption Performance of AES-128-CBC Decryption

1.4 w = 1 w = 2 w = 4 1.2 1.0 0.8 0.6 0.4 (cycles-per-byte) Running Time 0.2 0.0 Haswell Skylake Zen

Scheduling w = 4 AES-NI instructions, the performance of decryption is improved.

Can we do better?

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 25 / 83 Symmetric-Key Cryptography Data Encryption Performance of AES-128-CBC Decryption

1.4 w = 1 w = 2 w = 4 w = 8 1.2 1.0 0.8 0.6 0.4 (cycles-per-byte) Running Time 0.2 0.0 Haswell Skylake Zen

Yes! Zen has two execution units for AES-NI instructions.

µ-arch Latency CBC-ENC CBC-DEC Intel Haswell 7 4.49 0.63 Intel Skylake 4 2.71 0.62 AMD Zen 4 2.44 0.37

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 25 / 83 Symmetric-Key Cryptography Data Encryption Performance of AES-128-CTR Mode

Sequential w = 2 w = 4 w = 8 1.4 1.2 1.0 0.8 0.6 0.4 (cycles-per-byte)

Running Time 0.2 0.0 Haswell Skylake Zen

µ-arch Latency CBC-ENC CBC-DEC CTR Intel Haswell 7 4.49 0.63 0.74 Intel Skylake 4 2.71 0.62 0.62 AMD Zen 4 2.44 0.37 0.39

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 26 / 83 2.2

Hash Functions Symmetric-Key Cryptography Hash Functions Hash Function

A hash function maps an arbitrary-length bit-string into a n-bit string.

h: 0, 1 ∗ 0, 1 n { } → { } The output of a hash function is called as digest or hash value.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 27 / 83 Symmetric-Key Cryptography Hash Functions Cryptographic Properties

1st pre-image. Given a hash value r it should be difficult to find any message M such that r = h(M).

2nd pre-image. Given an input M1 it should be difficult to find a different input M2 such that h(M1) = h(M2). Collision resistant. It should be difficult to find two different messages M1 and M2 such that h(M1) = h(M2).

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 28 / 83 Symmetric-Key Cryptography Hash Functions Applications of Hash Functions

There is a large number of applications of cryptographic hash functions: • Verifying the integrity of files or messages. • Password verification. • Pseudo-random number generation. • Key derivation functions. • Digital signatures.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 29 / 83 Symmetric-Key Cryptography Hash Functions NIST Hash Functions

1993 SHA-0: Secure Hash Algorithm (160 bits). · · ·• 1995 SHA-1: output 160 bits. · · ·• 2001 SHA-2: output: 224, 256, 384, 512. · · ·• 2015 SHA-3 Keccak, output: 224, 256, 384, 512. · · ·• SHA-3 (SHAKE128, SHAKE256), 2015 · · ·• output: m (arbitrary) (FIPS) 180-4.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 30 / 83 2.3

SHA2 Implementation Symmetric-Key Cryptography SHA2 Implementation SHA2 Algorithm

SHA2-256 operates as follows.

• Initialize state S0 with constant values. • After padding, the message is split into n 512-bit blocks: M1,...,Mn. • For each block Mj:

Sj = Update(Sj− ,Mj) for 1 j n 1 ≤ ≤

• The digest of M is H(M) = Sn.

Update consists of two phases: 1 Message Schedule. 2 State Update.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 31 / 83 Symmetric-Key Cryptography SHA2 Implementation Update Phase 1: Message Schedule

Let w0, . . . , w15 be the message block Mi split into 16 words of 32 bits, then, the message schedule calculates 48 new words:

wi σ (wi− ) + σ (wi− ) + wi− + wi− , for 16 i < 64. ← 0 15 1 2 7 16 ≤ where σ (x) = Rot(x, 7) Rot(x, 18) Shr(x, 3) 0 ⊕ ⊕ σ (x) = Rot(x, 17) Rot(x, 19) Shr(x, 10) 1 ⊕ ⊕

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 32 / 83 Symmetric-Key Cryptography SHA2 Implementation Update Phase 2: State Update

(a , b , c , d , e , f , g , h ) S 0 0 0 0 0 0 0 0 ← for i 0 to 63 do T2i ← T hi Σ (ei) Ch(ei, fi, gi) 1 ←  1   ai ai+1 ki wi  bi bi+1 T2 Σ0(ai)  Maj(ai, bi, ci) ← ci ci+1 hi+1 gi, gi+1 fi ← ← di di+1 fi+1 ei, ei+1 di  T1 ← ← ei ei+1 di+1 ci, ci+1 bi ← ← fi fi+1 bi ai, ai T T +1 ← +1 ← 1  2 end for gi gi+1 S0 (a a , . . . , h h ) hi T1i hi+1 ← 0  63 0  63 ki wi 32  is addition modulo 2 .

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 33 / 83 Symmetric-Key Cryptography SHA2 Implementation SHA New Instructions (SHA-NI)

In 2013, Intel released the specification of the SHA New Instructions (SHA-NI). • Since 2016 it was supported by Goldmont Intel micro-architecture. • Zen AMD’s micro-architecture also added support in 2017.

SHA1: SHA2-256 (and SHA2-224): • SHA1MSG1 • SHA256MSG1 • SHA1MSG2 • SHA256MSG2 • SHA1NEXTE • SHA256RNDS2 • SHA1RNDS4

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 34 / 83 Symmetric-Key Cryptography SHA2 Implementation Implementation of Phase 1a: Message Schedule

The SHA256MSG1 instruction performs the following operation:

xi = σ (wi ) + wi , for 0 i < 4. 0 +1 ≤

xmm0 xmm1

w7 w6 w5 w4 w3 w2 w1 w0

σ0 σ0 σ0 σ0

+ + + +

x3 x2 x1 x0

xmm2

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 35 / 83 Symmetric-Key Cryptography SHA2 Implementation Implementation of Phase 1b: Message Schedule

The SHA256MSG2 instruction performs the following operation:

wi = σ (wi ) + yi , for 0 i < 4. +16 1 +14 ≤

xmm0 xmm1

y3 y2 y1 y0 w15 w14 w13 w12

σ1 σ1

+ + + +

w19 w18 w17 w16 xmm2

σ1 σ1

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 36 / 83 Symmetric-Key Cryptography SHA2 Implementation Implementation of Phase 2: Two Iterations

Let Ai = [ai, bi, ei, fi] and C = [ci, di, gi, hi] be the state at the i-th iteration. Then, it holds that: Ci+2 = Ai

The remaining values Ai+2 = [ai+2, bi+2, ei+2, fi+2] are calculated by the SHA256RNDS2 instruction:

Ai+2 = SHA256RNDS2(Ai,Ci,X)

where X = [wi + ki, wi+1 + ki+1].

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 37 / 83 Symmetric-Key Cryptography SHA2 Implementation Implementation of Phase 2: Two Iterations

T2i T2i+1

ai ai+1 ai+2

bi bi+1 bi+2

ci ci+1 ci+2 = ai

di di+1 di+2

ei ei+1 ei+2

fi fi+1 fi+2

gi gi+1 gi+2

hi T1i hi+1 T1i+1 hi+2

ki wi ki+1 wi+1

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 38 / 83 Symmetric-Key Cryptography SHA2 Implementation Implementation of Phase 2: Two Iterations

T2i T2i+1

ai ai+1 ai+2

bi bi+1 bi+2

ci ci+1 ci+2 = ai

di di+1 di+2 = bi

ei ei+1 ei+2

fi fi+1 fi+2

gi gi+1 gi+2 = ei

hi T1i hi+1 T1i+1 hi+2 = fi

ki wi ki+1 wi+1

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 38 / 83 Symmetric-Key Cryptography SHA2 Implementation Implementation of Phase 2: Four Iterations

Using two SHA256RNDS2 instructions, one can compute four iterations of the Update function:

Ci+2 = Ai Ai+2 = SHA256RNDS2 (Ci,Ai,X) Ci+4 = Ai+2 Ai+4 = SHA256RNDS2 (Ci+2,Ai+2,Y )

where X = [wi + ki, wi+1 + ki+1] and Y = [wi+2 + ki+2, wi+3 + ki+3].

This is equivalent to:

Ci+4 = SHA256RNDS2 (Ci,Ai,X) Ai+4 = SHA256RNDS2 (Ai,Ci+4,Y )

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 39 / 83 Symmetric-Key Cryptography SHA2 Implementation Performance of SHA2-256 using SHA-NI

SHA-NI is 4-5 faster than 64-bit implementations of SHA2-256. ×

210 29 5 × 28 7 4 2 × 26 3 25 × 24

Speedup 2 23 ×

(cycles-per-byte) 2 Running Time 2 1 × 21

1 16 256 4K 64K 1M 1 16 256 4K 64K 1M Message size (bytes) Message size (bytes)

sphlib (supercop) OpenSSL SHA-NI

Can we do better?

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 40 / 83 Symmetric-Key Cryptography SHA2 Implementation Pipelined Implementation of SHA-NI

Like AES-NI, SHA-NI instructions can be executed in pipeline.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Clock Latency

SHA256RNDS2 SHA256RNDS2 SHA256RNDS2

SHA256RNDS2 SHA256RNDS2 ········· SHA256RNDS2 ········· w = 4 SHA256RNDS2 SHA256RNDS2 SHA256RNDS2 SHA256RNDS2 SHA256RNDS2 ········· SHA256RNDS2 ·········

Throughput

Target scenario: multiple hashing hash-based signatures (PQ-Crypto). ⇒

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 41 / 83 Symmetric-Key Cryptography SHA2 Implementation Performance of Pipelined Implementation of SHA-NI

Example: Calculating four hashes (pipelined) is 20% faster than a sequential implementation.

Zen (Ryzen 7 1800X processor)

2.5 1 message 2 messages 4 messages 8 messages

2.0 (cycles-per-byte) Running Time 1.5

1.0 256 4K 64K 1M Message size (bytes)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 42 / 83 2.4

SHA3 Implementation Symmetric-Key Cryptography SHA3 Implementation The SHA-3 Family of Functions

SHA-3 is composed of four hash functions and two XOF called as SHAKE.

Function Output size (n) Bit-rate (r) Security Level1

SHA-3224 224 1,152 112 SHA-3256 256 1,088 128 SHA-3384 384 832 192 SHA-3512 512 576 256

SHAKE128 n 1,344 min(n/2, 128) SHAKE256 n 1,088 min(n/2, 256)

The input of a SHA-3 is split into blocks of r bits. The larger bit-rate the faster execution.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 43 / 83 Symmetric-Key Cryptography SHA3 Implementation Extendable-Output Function

An extendable-output function (XOF) maps an arbitrary length bit string producing a variable-length digest value. ∗ ∗ XOF: 0, 1 N 0, 1 { } × 7→ { } (a, n) 0, 1 n → { }

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 44 / 83 Symmetric-Key Cryptography SHA3 Implementation The SHA-3 Design

The SHA-3 was designed using a sponge construction proposed in 2009 by Bertoni et al.

Initializing Absorbing Squeezing

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 45 / 83 Symmetric-Key Cryptography SHA3 Implementation Sponge Construction

Initializing: The state has 1,600 bits that are initialized to 0; then, the input is split into blocks of r bits.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 46 / 83 Symmetric-Key Cryptography SHA3 Implementation Sponge Construction

Absorbing: Each block is added to the first r bits of the state; then, the state is processed by a permutation function P .

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 46 / 83 Symmetric-Key Cryptography SHA3 Implementation Sponge Construction

Squeezing: After the input was consumed, the function P is used to produce n/r output blocks of r bits concatenated with n (mod r) bits b c taken from the last state.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 46 / 83 Symmetric-Key Cryptography SHA3 Implementation Permutation Function P

The state has 1, 600 bits and is represented by 5 5 matrix S, each entry × of the matrix is 64-bit word.

 s0 s1 s2 s3 s4  s s s s s  5 6 7 8 9  S = s10 s11 s12 s13 s14 ; S[x, y] = s5x+y for 0 ≤ x, y < 5. s15 s16 s17 s18 s19 s20 s21 s22 s23 s24 The permutation P consists of 24 rounds applying the transformations: θ ι ρ 24

χ π

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 47 / 83 Symmetric-Key Cryptography SHA3 Implementation Using 256-bit instructions

The SHA-3 state is stored in seven 256-bit registers.

Y0 s0 s1 s2 s3

Y1 s5 s6 s7 s8 Pros: Y2 s10 s11 s12 s13 • It uses just few 256-bit s s s s Y3 15 16 17 18 vector registers.

Y4 s20 s21 s22 s23 Cons: • The permutation Y5 s24 s24 s24 s24 instructions of AVX-2 are s s s s Y6 4 9 14 19 expensive.

• Yi: 256-bit vector registers.

• si: 64-bit words.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 48 / 83 Symmetric-Key Cryptography SHA3 Implementation Using 128-bit instructions

• State representation.

X0 s0 s1 X7 s15 s16 • The state uses 12 s s s s X1 2 3 X8 17 18 variables of 256 bits.

X2 s5 s6 X9 s14 s19 • Pros: • The permutation s s s s X3 7 8 X10 20 21 instructions of SSE4 are cheaper than X4 s4 s9 X11 s22 s23 AVX-2. s s s s X5 10 11 X12 24 24 • Cons: • X6 s12 s13 It uses more variables.

• Xi: 128-bit vector registers.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 49 / 83 Symmetric-Key Cryptography SHA3 Implementation 4-way implementation

• State representation.

1 2 3 4 Y0 s s s s 0 0 0 0 • The state uses 25 Y s1 s2 s3 s4 1 1 1 1 1 variables of 256 bits.

1 2 3 4 • Pros: Y2 s2 s2 s2 s2 . . • There is no 64-bit . . . . permutations.

1 2 3 4 • Y22 s22 s22 s22 s22 Cons: • It uses many variables Y s1 s2 s3 s4 23 23 23 23 23 and the processor has

1 2 3 4 only 16 registers. Y24 s24 s24 s24 s24

• Yi: 256-bit vector registers.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 50 / 83 Symmetric-Key Cryptography SHA3 Implementation Performance of SHA3-128 Function

Cycles-per-bytes taken for hashing a message of 4096 bytes.

18 15 12 9 6 (cycles-per-byte) Running Time 3 0 Haswell Skylake Zen

x64 x64shld AVX2 generic64 2M-SSE 4M-AVX2

Measurements were taken using the official Keccak Code Package.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 51 / 83 Symmetric-Key Cryptography SHA3 Implementation SHA3 Parallel Hashing: Two and Four Messages

(1M) 64-bit native instructions. 4

Haswell Skylake (2M) 128-bit vector instructions 3 Zen [SSE2/AVX]. Speedup 2 (4M) 256-bit vector instructions [AVX2].

1 1 2 3 4 Number of messages

Performance of Zen does not scale well for hashing 4 messages.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 52 / 83 Section 3

Elliptic Curve Cryptography 3.1

Elliptic Curves Elliptic Curve Cryptography Elliptic Curves ECC: Software Implementation

• Introduction • Point Multiplication kP • Elliptic Curve Diffie-Hellman (X25519, X448) • Digital Signature (EdDSA) • Performance (vector instructions on Intel Haswell/Skylake)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 53 / 83 Elliptic Curve Cryptography Elliptic Curves Elliptic Curve Cryptography (ECC)

• In 1985, Koblitz [8] and Miller [9] independently suggested the use of elliptic curves for cryptographic purposes. • ECC achieves the same security as RSA-based protocols using shorter keys sizes. For example: at the 128-bit security level: • RSA uses keys of 3,072 bits • ECC uses keys of 256 bits. • Applications of ECC: • Key-agreement protocols. • Digital signatures. • Bitcoin. • End-to-end encryption. • Smart cards security.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 54 / 83 Elliptic Curve Cryptography Elliptic Curves Mathematical Aspects of Elliptic Curves

• An elliptic curve is defined by the following equation:

2 3 2 E/Fp : y + a1xy + a3y = x + a2x + a4x + a6

where a1, a2, a3, a4, a6 Fp and p is a prime number. ∈ • The points of an elliptic curve form a commutative group, with as O identity. (E, +) = (x, y) E { ∈ } ∪ {O} • The addition of two different points (x3, y3) = (x1, y1) + (x2, y2) is calculated as:  2 y2 y1 x3 = − x1 x2 x2 x1 − −  y − y  y = 2 − 1 (x x ) y 3 x x 1 − 3 − 1 2 − 1 Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 55 / 83 • Trace a line passing through P and Q. • This line will intersect the curve in a point R. • Trace a vertical line passing through R. • The point where this line intersects the curve will be defined as the addition P + Q.

Elliptic Curve Cryptography Elliptic Curves Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83 • This line will intersect the curve in a point R. • Trace a vertical line passing through R. • The point where this line intersects the curve will be defined as the addition P + Q.

Elliptic Curve Cryptography Elliptic Curves Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:

• Trace a line passing through P and Q.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83 • Trace a vertical line passing through R. • The point where this line intersects the curve will be defined as the addition P + Q.

Elliptic Curve Cryptography Elliptic Curves Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:

• Trace a line passing through P and Q. • This line will intersect the curve in a point R.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83 • The point where this line intersects the curve will be defined as the addition P + Q.

Elliptic Curve Cryptography Elliptic Curves Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:

• Trace a line passing through P and Q. • This line will intersect the curve in a point R. • Trace a vertical line passing through R.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83 Elliptic Curve Cryptography Elliptic Curves Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:

• Trace a line passing through P and Q. • This line will intersect the curve in a point R. • Trace a vertical line passing through R. • The point where this line intersects the curve will be defined as the addition P + Q.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83 • Trace a line tangent to the curve at point P . • The line will intersect to the curve in a point R. • Trace a vertical line passing through R. • The point were this line intersects to the curve is defined as 2P .

Elliptic Curve Cryptography Elliptic Curves Point Doubling

The addition of a point P with itself can be computed as follows:

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83 • The line will intersect to the curve in a point R. • Trace a vertical line passing through R. • The point were this line intersects to the curve is defined as 2P .

Elliptic Curve Cryptography Elliptic Curves Point Doubling

The addition of a point P with itself can be computed as follows:

• Trace a line tangent to the curve at point P .

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83 • Trace a vertical line passing through R. • The point were this line intersects to the curve is defined as 2P .

Elliptic Curve Cryptography Elliptic Curves Point Doubling

The addition of a point P with itself can be computed as follows:

• Trace a line tangent to the curve at point P . • The line will intersect to the curve in a point R.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83 • The point were this line intersects to the curve is defined as 2P .

Elliptic Curve Cryptography Elliptic Curves Point Doubling

The addition of a point P with itself can be computed as follows:

• Trace a line tangent to the curve at point P . • The line will intersect to the curve in a point R. • Trace a vertical line passing through R.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83 Elliptic Curve Cryptography Elliptic Curves Point Doubling

The addition of a point P with itself can be computed as follows:

• Trace a line tangent to the curve at point P . • The line will intersect to the curve in a point R. • Trace a vertical line passing through R. • The point were this line intersects to the curve is defined as 2P .

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83 3 2 1 3 2 1 15P = (1111)2P = (2 + 2 + 2 + 1)P = 2 P + 2 P + 2 P + P

n−1 kP = kn− 2 + + k 2P + k P 1 ··· 1 0

Elliptic Curve Cryptography Elliptic Curves Point Multiplication kP

Given an integer number k and a point P E, point multiplication is ∈ defined as: kP = P + P + + P | {z··· } k times

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 58 / 83 n−1 kP = kn− 2 + + k 2P + k P 1 ··· 1 0

Elliptic Curve Cryptography Elliptic Curves Point Multiplication kP

Given an integer number k and a point P E, point multiplication is ∈ defined as: kP = P + P + + P | {z··· } k times

3 2 1 3 2 1 15P = (1111)2P = (2 + 2 + 2 + 1)P = 2 P + 2 P + 2 P + P

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 58 / 83 Elliptic Curve Cryptography Elliptic Curves Point Multiplication kP

Given an integer number k and a point P E, point multiplication is ∈ defined as: kP = P + P + + P | {z··· } k times

3 2 1 3 2 1 15P = (1111)2P = (2 + 2 + 2 + 1)P = 2 P + 2 P + 2 P + P

n−1 kP = kn− 2 + + k 2P + k P 1 ··· 1 0

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 58 / 83 Elliptic Curve Cryptography Elliptic Curves Point Multiplication: Double-and-Add algorithm

Input: P E and k Z+. ∈ ∈ Output: kP (kn− , . . . , k , k ) k 1 1 0 2 ← Q ← O for i n 1 to 0 do ← − Q 2Q ← Q Q + kiP end for← return Q

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 59 / 83 Elliptic Curve Cryptography Elliptic Curves Techniques for kP

The operation kP can be performed using different techniques: • Double-and-Add Algorithm (right-to-left) • Montgomery Algorithm. • w-NAF representations. • Fixed recoding representations. • Elliptic curves with endomorphism, GLV/GLS curves.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 60 / 83 Elliptic Curve Cryptography Elliptic Curves Elliptic Curve Discrete Logarithm Problem (ECDLP)

Given two points, P and Q, the problem of finding an integer k such that Q = kP is known as the elliptic curve discrete logarithm problem. • The Pollard’s algorithm is the best known algorithm that solves ECDLP. The complexity of this algorithm is: q  O #E(Fp) ,

where #E(Fp) p is the number of points in the curve. ≈ • For example: an elliptic curve defined over a prime field such that p 2256 then 2128 operations are required to solve ECDLP. ≈

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 61 / 83 Elliptic Curve Cryptography Elliptic Curves The Standardized Elliptic Curves by NIST

• In 1999, NIST standardized a set of elliptic curves to compute digital signatures (ECDSA) and the key-agreement protocol (ECDH) [10]. • NIST’s curves have the following equation:

2 3 E/Fp : y = x 3x + b − • Prime curves: P-256 and P-384

P-256 P-384 Security 128-bit 192-bit p 2256 − 2224 + 2192 + 296 − 1 2384 − 2128 − 296 + 232 − 1 b 0x5ac635d...27d2604b 0xb3312fa...d3ec2aef #E 2256 − 2224 + 2192 − 2128 + t 2384 − t t 0xbce6faa...fc632551 0x389cb27...333ad68d

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 62 / 83 Elliptic Curve Cryptography Elliptic Curves RFC7748: Edwards/Montgomery Elliptic Curves

On January 2016, the RFC7748 recommends the use of Curve25519 and Curve448 in two elliptic curve models: • Edwards curves: : ax2 + y2 = 1 + dx2y2. E • Montgomery curves: E : v2 = u3 + Au2 + u.

Curve25519 Bernstein [1, 2] Curve448 Hamburg [5] Security 128-bit 224-bit p 2255 − 19 2448 − 2224 − 1 121665 (a, d, A)(−1, − 121666 , 486662) (1, −39081, 156326) #E 8` 4` 252−0x14def9dea2f79cd65812631a 446−0x8335dc163bb124b65129c96fd ` 2 2 5cf5d3ed e933d8d723a70aadc873d6d54a7bb0d

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 63 / 83 3.2

Elliptic Curve Diffie-Hellman Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman Diffie-Hellman Protocol using Montgomery Curves

The RFC 7748 recommends the use of two functions to compute a shared secret. X25519 Keys of 32 bytes. X448 Keys of 56 bytes.

$ $ a 0, 1 256 b 0, 1 256 ←−{ } ←−{ } K X25519(9, a) K X25519(9, b) A ← B ← K = X25519(KB, a) K = X25519(KA, b)

K is the shared secret.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 64 / 83 Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman The X Function

Internally X is the calculation of an elliptic curve point multiplication kP .

Example: 22P . Montgomery ladder algorithm. ki Q0 Q1 P Input: P E and k Z+. ← O ← Output: kP∈ ∈ 1 P P 1: (kn 1 = 1, . . . , k0)2 k 2 − ← 2: Q P 0 ← 3: Q1 2P 0 P P ← 2 3 4: for i n 2 to 0 do ← − 5: b k k ← i ⊕ i+1 1 5P 6P 6: Q ,Q cswap(b, Q ,Q ) 0 1 ← 0 1 7: Q0,Q1 2Q0,Q0 + Q1 8: end for ← 1 11P 12P 9: Q ,Q cswap(k ,Q ,Q ) 0 1 ← 0 0 1 10: return Q0 0 22P 23P

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 65 / 83 Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman Representation of Prime Field Elements

Elements of Fp are split into words of size w:

t−1   X wi w 2w p a Fp = ai2 = a0 + a12 + a22 + ... where t = | | . ∈ w i=0

Let W be the machine’s word size, then there are two cases: w = W Full-radix or saturated arithmetic. w < W Reduced-radix, redundant representation, unsaturated arith...

E.g. for p = 2255 19 and a W = 64 instruction set, − use an array of t = 5 words storing coefficients of w = 51 bits.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 66 / 83 Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman X25519 Shared Secret Computation

Full-radix: Using MULX+ADCX/ADOX a 11-14% of time reduction of the fastest implementation reported in SUPERCOP. Reduced-radix: an additional 8-10% is obtained by using AVX2.

175 Haswell Skylake

150 100 Kcc

125

100 cycles)

3 75

(10 50 Running Time

25

0 Moon Tung Oliveira et al. Our code (floodyberry) SAC 2015 SAC 2017 AVX2 x64 x64+SSE2 x64(MULX/ADCX) MULX/ADCX

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 67 / 83 Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman X448 Shared Secret Computation

We reduce a 13% in Haswell and a 17% in Skylake the timings reported by Hamburg.

Haswell Skylake

500

400

300 clock cycles 200 3 10 100

0 eBacs (supercop) Hamburg Our code x64 x64 AVX2

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 68 / 83 3.3

Digital Signatures Elliptic Curve Cryptography Digital Signatures Digital Signatures

• They are used to verify both integrity and authenticity of a message. • Basic operations: Sign Given a message there is an algorithm that computes a bit string, called signature, associated to the private key of the signer. Verify This step determines whether a signature is valid, i.e. the signature for the message was created using the private key corresponding to the referenced public key.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 69 / 83 Elliptic Curve Cryptography Digital Signatures Signature Generation

Private Key

Hash Signing

• The message is processed through a cryptographic hash function H to obtain a digest value. • The digest along with the private key are used to generate a signature. • Both message and signature must be sent together for further verification.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 70 / 83 Elliptic Curve Cryptography Digital Signatures Signature Verification

Public Key

Valid

Verification Reject

• Using the signer’s public key, the verification algorithm determines whether a signature is valid. • Ensuring authenticity of the signer and integrity of the message.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 71 / 83 Elliptic Curve Cryptography Digital Signatures Digital Signatures

1991 PKCS#1: Rivest-Shamir-Adleman scheme (RSA). · · ·• 1993 FIPS 186: Digital Signature Algorithm (DSA). · · ·• ANSI X9.62: Elliptic Curve Digital Signature 1999 · · ·• Algorithm (ECDSA). Bernstein et. al. proposed the Edwards Digital 2011 · · ·• Signature Algorithm (EdDSA).

2015 EdDSA is in a draft of the IETF for discussion [6]. · · ·• 2017 EdDSA is described in RFC-8032 [7]. · · ·• The use of EdDSA is increasing; for instance, OpenSSH now supports Ed25519 signatures.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 72 / 83 3.4

EdDSA Scheme Elliptic Curve Cryptography EdDSA Scheme Edwards Digital Signature Algorithm

• This is a novel signature scheme based on the Edwards curves. • The RFC-8032[7] describes the usage of two instances of EdDSA. • EdDSA delivers digital signatures faster than the ECDSA. • It consists of three primitive operations: • Key Generation. • Signing. • Verification.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 73 / 83 Elliptic Curve Cryptography EdDSA Scheme EdDSA: Domain parameters

• Public key of b bits and signature size of 2b bits. • d(Fp), an Edwards curve over a prime field. E • ` h = # d(Fp), the number of points in the curve. · E • B = (0, 1), a generator point. 6 • c 2, 3 and n = log (`), two constants. c n < b ∈ { } 2 ≤ • s = Encode(P ), converts a point P = (x, y) into a string s. s = (x mod 2) y k • (x, y)=Decode(s), converts a string s into a pair (x, y). s y2 1 y = s mod 2b−1 , x = − dy2 a − such that x sb− mod 2. ≡ 1 • H, a hash function producing 2b bits. • Ex: use of the SHAKE128 function which is part of the SHA3 standard.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 74 / 83 Elliptic Curve Cryptography EdDSA Scheme EdDSA: Key Generation

Computing the secret and public keys, (sk, pk):

1: sk R [0, `) ∈ 2: h = (h b− , . . . , h ) H(sk) 2 1 0 2 ← n P i 3: a 2 + 2 hi, for c i < n; a : n + 1 bits, bottom c bits cleared. ← ≤ 4: pk aB ← 5: return (sk, pk)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 75 / 83 Elliptic Curve Cryptography EdDSA Scheme EdDSA: Signing

Given a message M and the pair of keys (sk, pk) compute the signature (R,S) as:

1: h = (h2b−1, . . . , hb, hb−1, . . . , h0)2 H(sk) | {z } | {z } ← hH hL n P i 2: a 2 + 2 hi, for c i < n ← ≤ 3: r H(hH M)(mod `) ← k 4: R0 rB ← 5: R Encode(R0) ← 6: S r + H(R pk M) a (mod `) ← k k · 7: return (R,S)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 76 / 83 Elliptic Curve Cryptography EdDSA Scheme EdDSA: Verification

Given a message M, a signature (R,S) and a public key pk:

P Decode(pk) ← h H(R pk M)(mod `) ← k k Accept signature if the following is true:

P d(Fp) and S [0, `) and SB = R + hP ∈ E ∈

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 77 / 83 Elliptic Curve Cryptography EdDSA Scheme Optimization Techniques for EdDSA

Focus on the optimization of two main operations: • kP , when P is known. • kP + lQ, when P is known and Q is an arbitrary point.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 78 / 83 Elliptic Curve Cryptography EdDSA Scheme Fixed-point mult: computing kP when P is known Input: k, a n-bit integer, w, an integer window size, P , a fixed point of order `. Output: Q a point such that Q = kP . Off-line computation: wi w−1 1: Compute the look-up tables {Ti ← d2 P } for odd d ∈ [1, 2 ] and all i ∈ [0, t). On-line computation: 1: t ← dn/we 2: Q ← O 3: Let (K0,K1,...,Kt−1)w be the signed radix-w representation of k. 4: for i ← 0 to t − 1 do 5: P ← Query(Ti,Ki) 6: Q ← Q + P 7: end for 8: return Q

Query must be protected against side-channel attacks.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 79 / 83 Elliptic Curve Cryptography EdDSA Scheme Double-point mult: computing kP + lQ when P is known and Q is an arbitrary point

One efficient algorithm is the interleaving method using ω-NAF.

• Obtain the ω-NAF of k and l, ki k and li l. { } ← { } ← • There exists a pair (ωk, ωl) that minimizes the number of operations. ω −1 • Precompute Td = dP for odd d [1, 2 k ]. ∈ ω −1 • Compute Ud = dQ for odd d [1, 2 l ]. ∈ R ← O for i n 1 to 0 do ← − R 2R ← if ki = 0 then R R + Tk 6 ← i if li = 0 then R R + Uli end for6 ← • R is the required point kP + lQ.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 80 / 83 Elliptic Curve Cryptography EdDSA Scheme Improvements on Ed25519 Signature Generation

The synergy between AVX2, MULX, and ADCX/ADOX instructions increases the performance of the signing operation.

100 Haswell Skylake 80 60 cycles)

3 40 (10 Running time 20

Moon Moon Schwabe Our code Our code (floodyberry) (floodyberry) (supercop) AVX2 AVX2 SSE2 x64 x64+SSE2 MULX/ADCX MULX/ADCX 24 KB 24 KB 30 KB 12 KB 24 KB

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 81 / 83 Elliptic Curve Cryptography EdDSA Scheme Improvements on Ed448 Signature Generation

Running time was reduced in around 16-18% on Haswell and Skylake platforms.

200 Haswell Skylake 160 120 cycles) 3 80 (10 Running time 40

supercop Hamburg Our code x64 x64 AVX2

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 82 / 83 Elliptic Curve Cryptography EdDSA Scheme

Thanks for your attention! [email protected]

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 83 / 83 References

[1] Daniel J. Bernstein. Ed448-Goldilocks, a new elliptic curve. Curve25519: New Diffie-Hellman Speed Records. Cryptology ePrint Archive, Report 2015/625, 2015. In Moti Yung, Yevgeniy Dodis, Aggelos Kiayias, and Tal http://eprint.iacr.org/. Malkin, editors, Public Key Cryptography, volume 3958 of Lecture Notes in Computer Science, pages 207–228. [6] Simmon Josefsson and NIels Moeller. Springer, 2006. EdDSA and Ed25519 draft-josefsson-eddsa-ed25519-03. Available on https://tools.ietf.org/html/ [2] DanielJ. Bernstein, Niels Duif, Tanja Lange, Peter draft-josefsson-eddsa-ed25519-03, May 2015. Schwabe, and Bo-Yin Yang. High-speed high-security signatures. [7] Simon Josefsson and Ilari Liusvaara. Journal of Cryptographic Engineering, 2(2):77–89, 2012. Edwards-Curve Digital Signature Algorithm (EdDSA). RFC 8032, January 2017. [3] Joppe W. Bos, J. Alex Halderman, Nadia Heninger, Jonathan Moore, Michael Naehrig, and Eric Wustrow. [8] Neal Koblitz. Elliptic Curve Cryptography in Practice. Elliptic Curve Cryptosystems. In Nicolas Christin and Reihaneh Safavi-Naini, editors, Mathematics of Computation, 48(177):203–209, January Financial Cryptography and Data Security: 18th 1987. International Conference, FC 2014, Christ Church, [9] VictorS. Miller. Barbados, March 3-7, 2014, Revised Selected Papers, Use of Elliptic Curves in Cryptography. pages 157–175, Berlin, Heidelberg, 2014. Springer Berlin In HughC. Williams, editor, Advances in Cryptology — Heidelberg. CRYPTO ’85 Proceedings, volume 218 of Lecture Notes in Computer Science, pages 417–426. Springer Berlin [4] Intel Corporation. Heidelberg, 1986. Intel Instruction Set Architecture Extensions. Available at https: [10] National Institute for Standards and Technology. //software.intel.com/en-us/intel-isa-extensions, Digital Signature Standard (DSS). July 2013. http://csrc.nist.gov/publications/fips/archive/ , January 2000. [5] Mike Hamburg. fips186-2/fips186-2.pdf

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 83 / 83