SoK: A Performance Evaluation of Cryptographic Instruction Sets on Modern Architectures

A. Faz-Hernández, Julio López, Ana Karina D. S. de Oliveira [email protected]

Institute of Computing, University of Campinas, Brazil

June 4, 2018. Incheon, Republic of Korea.

5th ACM ASIA Public-Key Cryptography Workshop AsiaCCS/Asia PKC 2018 Instruction Sets for Cryptography

Processors have extensions to the Instruction Set Architecture (ISA) that aid on the execution of cryptographic algorithms.

SHA-NI Zen ADX & RDSEED MULX & AVX2 Haswell AES-NI & CLMUL Westmere CRC32 Nehalem SSE2 x64 4

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 1 / 33 Motivation

Goals of this work: • Performance evaluation of algorithms based on SHA-256 and AES. • Look for optimizations on the use of the SHA New Instructions.

Optimized Implementations: • Multiple-Message Hashing. • XMSS Digital Signature. • AES Modes of Operation. • AEGIS Authenticated Encryption.

Deliverables: • Source code available.

/armfazh/flo-shani-aesni

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 2 / 33 Outline

1 SHA New Instructions Performance Comparison

2 Multiple-Message Hashing SIMD Instructions Pipelining SHA-NI

3 Hash-Based Digital Signatures

4 AES Modes of Operation

5 Final Remarks SHA New Instructions SHA New Instructions (SHA-NI)

In 2013, [1] released the specification of the SHA New Instructions (SHA-NI), which is composed by:

SHA1: SHA-256: • SHA1MSG1 • SHA256MSG1 • SHA1MSG2 • SHA256MSG2 • SHA1NEXTE • SHA256RNDS2 • SHA1RNDS4

Processors that support SHA-NI. 2016 Intel , a low power consumption micro-architecture. 2017 AMD Zen, a middle- and high-end micro-architecture.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 3 / 33 2 Pad the message and split it into n 512-bit blocks: M Pad

3 For each block, process the state using the Update function. m1 mn ···

S0 U U U U U U ··· U Sn

4 The digest of M is SHA2(M) = Sn.

The SHA-256 Hashing Algorithm

1 Initialize the state. " # 0x6a09e667 0xbb67ae85 0x3c6ef372 0xa54ff53a S0 = 0x510e527f 0x9b05688c 0x1f83d9ab 0x5be0cd19

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 4 / 33 3 For each block, process the state using the Update function. m1 mn ···

S0 U U U U U U ··· U Sn

4 The digest of M is SHA2(M) = Sn.

The SHA-256 Hashing Algorithm

1 Initialize the state. " # 0x6a09e667 0xbb67ae85 0x3c6ef372 0xa54ff53a S0 = 0x510e527f 0x9b05688c 0x1f83d9ab 0x5be0cd19

2 Pad the message and split it into n 512-bit blocks: M Pad

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 4 / 33 4 The digest of M is SHA2(M) = Sn.

The SHA-256 Hashing Algorithm

1 Initialize the state. " # 0x6a09e667 0xbb67ae85 0x3c6ef372 0xa54ff53a S0 = 0x510e527f 0x9b05688c 0x1f83d9ab 0x5be0cd19

2 Pad the message and split it into n 512-bit blocks: M Pad

3 For each block, process the state using the Update function. m1 mn ···

S0 U U U U U U ··· U Sn

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 4 / 33 The SHA-256 Hashing Algorithm

1 Initialize the state. " # 0x6a09e667 0xbb67ae85 0x3c6ef372 0xa54ff53a S0 = 0x510e527f 0x9b05688c 0x1f83d9ab 0x5be0cd19

2 Pad the message and split it into n 512-bit blocks: M Pad

3 For each block, process the state using the Update function. m1 mn ···

S0 U U U U U U ··· U Sn

4 The digest of M is SHA2(M) = Sn.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 4 / 33 SHA256 Update Function

The Update function consists of two phases: 1 Message Schedule. 2 State Update.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 5 / 33 w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16

w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16 w17 w63 w1 + σ0(w2) + w10 + σ1(w15)→ ...

• Rename words to calculate w17.

• Repeat this proceeding to calculate the words w16, . . . , w63.

w16 w0 + σ0(w1) + w9 + σ1(w14) →

Update Phase 1: Message Schedule

• Split message block into sixteen blocks of 32 bits and calculate w16.

w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 6 / 33 w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16

w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16 w17 w63 w1 + σ0(w2) + w10 + σ1(w15)→ ...

• Rename words to calculate w17.

• Repeat this proceeding to calculate the words w16, . . . , w63.

Update Phase 1: Message Schedule

• Split message block into sixteen blocks of 32 bits and calculate w16. w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

w16 w0 + σ0(w1) + w9 + σ1(w14) →

σ0(x) = rot(x, 7) ⊕ rot(x, 18) ⊕ shr(x, 3) where σ1(x) = rot(x, 17) ⊕ rot(x, 19) ⊕ shr(x, 10)

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 6 / 33 • Repeat this proceeding to calculate the words w16, . . . , w63.

w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16 w17 w63 w1 + σ0(w2) + w10 + σ1(w15)→ ...

Update Phase 1: Message Schedule

• Split message block into sixteen blocks of 32 bits and calculate w16.

w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

w16 w0 + σ0(w1) + w9 + σ1(w14) →

• Rename words to calculate w17. w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 6 / 33 • Repeat this proceeding to calculate the words w16, . . . , w63.

w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16 w17 w63 ...

Update Phase 1: Message Schedule

• Split message block into sixteen blocks of 32 bits and calculate w16.

w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

w16 w0 + σ0(w1) + w9 + σ1(w14) →

• Rename words to calculate w17.

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16

w17 w1 + σ0(w2) + w10 + σ1(w15)→

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 6 / 33 • Repeat this proceeding to calculate the words w16, . . . , w63.

w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

w17 w63 w1 + σ0(w2) + w10 + σ1(w15)→ ...

Update Phase 1: Message Schedule

• Split message block into sixteen blocks of 32 bits and calculate w16.

w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

w16 w0 + σ0(w1) + w9 + σ1(w14) →

• Rename words to calculate w17.

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16

w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16 w17

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 6 / 33 w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16

w17 w1 + σ0(w2) + w10 + σ1(w15)→

Update Phase 1: Message Schedule

• Split message block into sixteen blocks of 32 bits and calculate w16.

w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

w16 w0 + σ0(w1) + w9 + σ1(w14) →

• Rename words to calculate w17.

w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 w16 w17 w63 ...

• Repeat this proceeding to calculate the words w16, . . . , w63.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 6 / 33 SHA-NI implementation.           w16 w0 w1 w9 w12           w17 w1 w2 w10 w13   =   + σ0   +   + σ1   w18 w2 w3 w11 w14 w19 w3 w4 w12 w15 | {z } | {z } SHA256MSG1 PALIGNR | {z } PADD | {z } SHA256MSG2

Update Phase 1: SHA-NI Implementation

Sequential implementation.

w16 = w0 + σ0(w1) + w9 + w12

w17 = w1 + σ0(w2) + w10 + w13

w18 = w2 + σ0(w3) + w11 + w14

w19 = w3 + σ0(w4) + w12 + w15

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 7 / 33 Update Phase 1: SHA-NI Implementation

Sequential implementation.

w16 = w0 + σ0(w1) + w9 + w12

w17 = w1 + σ0(w2) + w10 + w13

w18 = w2 + σ0(w3) + w11 + w14

w19 = w3 + σ0(w4) + w12 + w15

SHA-NI implementation.           w16 w0 w1 w9 w12           w17 w1 w2 w10 w13   =   + σ0   +   + σ1   w18 w2 w3 w11 w14 w19 w3 w4 w12 w15 | {z } | {z } SHA256MSG1 PALIGNR | {z } PADD | {z } SHA256MSG2

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 7 / 33 Update Phase 2: State Update

The state is split in eight 32-bit words and is processed for i from 0 to 63:

T2i

ai ai+1

bi bi+1

ci ci+1

di di+1

ei ei+1

fi fi+1

gi gi+1

hi T1i hi+1

ki wi

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 8 / 33 Update Phase 2: Two Iterations

Some values are not modified in every two consecutive iterations.

T2i T2i+1

ai ai+1 ai+2

bi bi+1 bi+2

ci ci+1 ci+2 = ai

di di+1 di+2

ei ei+1 ei+2

fi fi+1 fi+2

gi gi+1 gi+2

hi T1i hi+1 T1i+1 hi+2

ki wi ki+1 wi+1

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 9 / 33 Update Phase 2: Two Iterations

Some values are not modified in every two consecutive iterations.

T2i T2i+1

ai ai+1 ai+2

bi bi+1 bi+2

ci ci+1 ci+2 = ai

di di+1 di+2 = bi

ei ei+1 ei+2

fi fi+1 fi+2

gi gi+1 gi+2 = ei

hi T1i hi+1 T1i+1 hi+2 = fi

ki wi ki+1 wi+1

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 9 / 33 Update Phase 2: Four Iterations

By storing the state as follows     ai ci     bi  di Ai =   Ci =   ei gi  fi hi

the SHA256RNDS2 instruction calculates four iterations of the Update function: Ci+4 = SHA256RNDS2 (Ci,Ai,X)

Ai+4 = SHA256RNDS2 (Ai,Ci+4,Y ) where " # " # w + k w + k X = i i Y = i+2 i+2 wi+1 + ki+1 wi+3 + ki+3

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 10 / 33 Performance of SHA-256 using SHA-NI

The SHA-NI implementation is 4-5× faster than 64-bit implementations of SHA-256.

210 29 5 × 28 7 4 2 × 26 3 25 × 24

Speedup 2 23 ×

(cycles-per-byte) 2 Running Time 2 1 × 21

1 16 256 4K 64K 1M 1 16 256 4K 64K 1M Message size (bytes) Message size (bytes)

sphlib (supercop) OpenSSL SHA-NI

Can we do better?

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 11 / 33 "Industrial history of the United States, from the earliest settlements to the present time." Multiple-Message Hashing Multiple-Message Hashing

• Task: Hashing several messages of the same length.

• Parallel Strategies: • Multi-core processing. • SIMD instructions, a.k.a. vectorization. • Pipelined Instruction Scheduling.

• Applications: • Hash-based Signatures. • XMSS • XMSSMT

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 12 / 33 Marine on St. Croix, Minnesota.

SIMD Instructions SIMD Vectorization of SHA-256

SHA-256 algorithm operates over words of 32-bits.

32

Single Message 128

4-way (SSE)

256

8-way (AVX2)

Kaby Lake and Zen support both SSE and AVX2 vector instruction sets. So, what is the performance rendered on these platforms?

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 13 / 33 and 4.53× using AVX2.

• Vector implementation is 2.35× faster using SSE,

• On Zen, the latency of AVX2 instructions is twice slower than on Intel’s processors.

Performance of SIMD Multiple-Message Hashing

Kaby Lake Zen 24 24

23 23

22 22

Performance 1 1 (cycles-per-byte) 2 2

20 20 256 4K 64K 1M 256 4K 64K 1M Message size (bytes) Message size (bytes)

Single

• A similar performance for single message hashing ∼ 9.67 cpb.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 14 / 33 and 4.53× using AVX2. • On Zen, the latency of AVX2 instructions is twice slower than on Intel’s processors.

Performance of SIMD Multiple-Message Hashing

Kaby Lake Zen 24 24

23 23

22 22

Performance 1 1 (cycles-per-byte) 2 2

20 20 256 4K 64K 1M 256 4K 64K 1M Message size (bytes) Message size (bytes)

Single 4-way (SSE)

• A similar performance for single message hashing ∼ 9.67 cpb. • Vector implementation is 2.35× faster using SSE,

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 14 / 33 Performance of SIMD Multiple-Message Hashing

Kaby Lake Zen 24 24

23 23

22 22

Performance 1 1 (cycles-per-byte) 2 2

20 20 256 4K 64K 1M 256 4K 64K 1M Message size (bytes) Message size (bytes)

Single 4-way (SSE) 8-way (AVX2)

• A similar performance for single message hashing ∼ 9.67 cpb. • Vector implementation is 2.35× faster using SSE, and 4.53× using AVX2. • On Zen, the latency of AVX2 instructions is twice slower than on Intel’s processors.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 14 / 33 Big Fire Engine Book by Virginia Brody/Mazoujian (fl.1960), New York, McGraw-Hill. Pipelining SHA-NI SHA256RNDS2 SHA256RNDS2

SHA256RNDS2 SHA256RNDS2

SHA256RNDS2 SHA256RNDS2 Reciprocal Throughput SHA256RNDS2 SHA256RNDS2

SHA256MSG1 SHA256MSG1 SHA256RNDS2 Latency 2 3 4 RT 0.5 2 2

Instruction Pipelining

Ensuring that no data dependencies occur, ⇒ The execution of an instruction can be overlapped with other instructions of the same type

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Latency SHA256RNDS2

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 15 / 33 SHA256RNDS2 SHA256RNDS2

SHA256RNDS2 SHA256RNDS2

SHA256RNDS2 SHA256RNDS2

SHA256RNDS2 SHA256RNDS2

SHA256MSG1 SHA256MSG1 SHA256RNDS2 Latency 2 3 4 RT 0.5 2 2

Instruction Pipelining

Ensuring that no data dependencies occur, ⇒ The execution of an instruction can be overlapped with other instructions of the same type

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Latency SHA256RNDS2

SHA256RNDS2

Reciprocal Throughput

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 15 / 33 Instruction Pipelining

Ensuring that no data dependencies occur, ⇒ The execution of an instruction can be overlapped with other instructions of the same type

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Latency SHA256RNDS2 SHA256RNDS2

SHA256RNDS2 SHA256RNDS2

SHA256RNDS2 SHA256RNDS2 Reciprocal Throughput SHA256RNDS2 SHA256RNDS2

SHA256MSG1 SHA256MSG1 SHA256RNDS2 Latency 2 3 4 RT 0.5 2 2

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 15 / 33 Performance of SHA-NI Pipelined Implementation

Multiple-message hashing using a pipelined instruction scheduling of SHA-NI instructions. Zen 1 message 2.5 2 messages 4 messages 8 messages 2.0 Performance

(cycles-per-byte) 1.5

1.0 256 4K 64K 1M Message size (bytes)

The performance of hashing is improved by 18% for two messages and by 21% for four messages. Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 16 / 33 Hash-Based Digital Signatures Hash-based Digital Signatures

Post-quantum hash-based digital signatures can be categorized in • Stateful algorithms • XMSS. • XMSSMT. The RFC8391 [2] describes their usage. (released five days ago).

• Stateless algorithms • SPHINCS. • SPHINCS+ (submitted to the NIST’s call for Post-Quantum algorithms).

All of these schemes rely on the computation of a hashing tree.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 17 / 33 Hashing Tree Computation

In a hashing tree: • Leave nodes store public keys of a one-time signature (OTS). • Every internal node contains the hash value of its children. • Hashes at the same level are pairwise independent. ⇒ a higher degree of parallelism.

pk = H(r r ) 0 k 1

r = H(q q ) r = H(q q ) 0 0 k 1 1 2 k 3

q = H(p p ) q = H(p p ) q = H(p p ) q = H(p p ) 0 0 k 1 1 2 k 3 2 4 k 5 3 6 k 7

p0 p1 p2 p3 p4 p7

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 18 / 33 • Timings were improved using SSE, but not using AVX2. • However, hashing with SHA-NI is slightly faster than using AVX2.

Zen

XMSS Signature Generation (h = 20)

20 Kaby Lake Zen 15 cycles) Time 6 10 10 ( 5

0 sphlib SSE AVX2 SHA-NI 4-SHA-NI

Kaby Lake • Signatures are 2.35× and 4.53× faster using SSE and AVX2, resp.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 19 / 33 • However, hashing with SHA-NI is slightly faster than using AVX2.

XMSS Signature Generation (h = 20)

20 Kaby Lake Zen 15 cycles) Time 6 10 10 ( 5

0 sphlib SSE AVX2 SHA-NI 4-SHA-NI

Kaby Lake • Signatures are 2.35× and 4.53× faster using SSE and AVX2, resp. Zen • Timings were improved using SSE, but not using AVX2.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 19 / 33 XMSS Signature Generation (h = 20)

20 Kaby Lake Zen 15 cycles) Time 6 10 10 ( 5

0 sphlib SSE AVX2 SHA-NI 4-SHA-NI

Kaby Lake • Signatures are 2.35× and 4.53× faster using SSE and AVX2, resp. Zen • Timings were improved using SSE, but not using AVX2. • However, hashing with SHA-NI is slightly faster than using AVX2.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 19 / 33 XMSSMT Signature Generation (h = 60 and d = 6)

30

25 Kaby Lake 20 Zen

cycles) 15 Time 6 10

( 10

5

0 sphlib SSE AVX2 SHA-NI 4-SHA-NI

Performance of XMSSMT signatures for h = 60 and d = 6.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 20 / 33 Big Fire Engine Book by Virginia Brody/Mazoujian (fl.1960), New York, McGraw-Hill. AES Modes of Operation The AES-NI Instruction Set

In 2010, Intel released a set of instructions to perform the AES algorithm.

Plaintext Plaintext

AddRoundKey AddRoundKey

SubBytes AESDECLAST InvSubBytes

InvShiftRows ShiftRows 1 AESENC − r

MixColumns N AddRoundKey

AddRoundKey InvMixColumns 1 AESDEC − r

SubBytes InvSubBytes N

AESENCLAST ShiftRows InvShiftRows

AddRoundKey AddRoundKey

Ciphertext Ciphertext

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 21 / 33 Cipher Block Chaining (CBC)

P1 P2 P3 P4 C1 C2 C3 C4

IV Dk Dk Dk Dk

Ek Ek Ek Ek IV

C1 C2 C3 C4 P1 P2 P3 P4 Encryption Decryption (sequential execution) (parallel execution)

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 22 / 33 Counter mode (CTR)

IV+1 IV+2 IV+3 IV+4 IV+1 IV+2 IV+3 IV+4

Ek Ek Ek Ek Ek Ek Ek Ek

P1 P2 P3 P4 C1 C2 C3 C4

C1 C2 C3 C4 P1 P2 P3 P4 Encryption Decryption

Either encryption and decryption can be executed in parallel. The block cipher encryption is used only.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 23 / 33 Performance of AES-128-CBC Encryption

The performance is mainly determined by the latency of the AESENC instruction.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Latency

AESENC AESENC ... AESENC

µ-arch Latency CBC-ENC Intel Haswell 7 4.49 Intel Skylake 4 2.71 Intel Kaby Lake 4 2.33 AMD Zen 4 2.44

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 24 / 33 AES Pipelined

The execution of AESENC instruction can be overlapped with other instructions of the same type.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Latency

AESENC(AESENCm0) w AESENC AESENC Reciprocal AESENC Throughput

Processor’s pipeline improves performance of CTR and CBC decryption modes.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 25 / 33 AES Pipelined Execution Units on Zen

Zen extends the capabilities for data encryption by including a second AES execution unit.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

AESENC(m0) AESENC(m8)

AESENC(m1) AESENC(m9) Unit 1 AESENC(m2) AESENC(m10)

AESENC(m3) AESENC(m11)

AESENC(m4) AESENC(m12)

AESENC(m5) AESENC(m13) Unit 2 AESENC(m6) AESENC(m14)

AESENC(m7) AESENC(m15)

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 26 / 33 Performance of AES-128-CBC Decryption

1.4 w = 1 w = 2 w = 4 w = 8 1.2 1.0 0.8 0.6

(cycles-per-byte) 0.4 Performance 0.2 0.0 Haswell Skylake Kaby Lake Zen

µ-arch Latency CBC-ENC CBC-DEC Intel Haswell 7 4.49 0.63 Intel Skylake 4 2.71 0.62 Intel Kaby Lake 4 2.33 0.53 AMD Zen 4 2.44 0.37

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 27 / 33 Performance of AES-128-CTR Mode

1.4 w = 1 w = 2 w = 4 w = 8 1.2 1.0 0.8 0.6

(cycles-per-byte) 0.4 Performance 0.2 0.0 Haswell Skylake Kaby Lake Zen

µ-arch Latency CBC-ENC CBC-DEC CTR Intel Haswell 7 4.49 0.63 0.74 Intel Skylake 4 2.71 0.62 0.62 Intel Kaby Lake 4 2.33 0.53 0.53 AMD Zen 4 2.44 0.37 0.39

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 28 / 33 Big Fire Engine Book by Virginia Brody/Mazoujian (fl.1960), New York, McGraw-Hill. Final Remarks Final Remarks

Performance of SHA-256. • Improvement of 4× w.r.t sequential implementation.

Performance of SHA-256 (multiple-message hashing). • An additional improvement of 21% using pipelining. • Vectorization achieves a similar performance on Intel’s processors, but not on Zen.

XMSS and XMSSMT signature schemes. • 4.30× and 4.64× speedup for signature generation.

AES Modes of Operation. • Zen provides two AES units ⇒ Faster CTR and CBC decryption. • CBC encryption can run faster in the multiple-message encryption scenario.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 29 / 33 U.S. Navy photo by Mass Communication Specialist 2nd Class Tucker M. Yates. Forthcomming Instruction Sets Forthcomming Instruction Sets

VAES & GFMUL Ice Lake

AVX-512 Skylake X

SHA-NI Zen ADX & RDSEED Kaby Lake MULX & AVX2 Haswell AES-NI & CLMUL Westmere CRC32 Nehalem SSE2 x64

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 30 / 33 Next Steps

AVX-512 • A 16-way parallelization of SHA-256 can be performed using AVX-512. • Number of instructions is reduced ⇒ better performance.

Other platforms • ARMv8 contains Scalable Vector Extensions. • Vector registers of variable size (multiples of 128 bits).

Other algorithms • SPHINCS+ for hash-based signatures. • Deoxys and COLM for authenticated encryption.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 31 / 33 Thanks for your attention. ReferencesI

[1] Sean Gulley, Vinodh Gopal, Kirk Yap, Wajdi Feghali, Jim Technical report, Intel Corporation, July 2013. Gullford, and Gil Wolrich. Intel R SHA Extensions New Instructions Supporting the [2] Andreas Huelsing, Denis Butin, Stefan-Lukas Gazdag, Secure Hash Algorithm on Intel R Architecture Joost Rijneveld, and Aziz Mohaisen. Processors. XMSS: eXtended Merkle Signature Scheme. RFC 8391, May 2018. The SHA256MSG1 Instruction

The SHA256MSG1 instruction performs the following operation:

xi = σ0(wi+1) + wi , for 0 ≤ i < 4.

xmm0 xmm1

w7 w6 w5 w4 w3 w2 w1 w0

σ0 σ0 σ0 σ0

+ + + +

x3 x2 x1 x0

xmm2

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 33 / 33 The SHA256MSG2 Instruction

The SHA256MSG2 instruction performs the following operation:

wi+16 = σ1(wi+14) + yi , for 0 ≤ i < 4.

xmm0 xmm1

y3 y2 y1 y0 w15 w14 w13 w12

σ1 σ1

+ + + +

w19 w18 w17 w16 xmm2

σ1 σ1

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 34 / 33 How Performance is Measured?

• Measuring the elapsed time does not allow to compare timing between different computers; instead, clock cycles are measured. • Use the RDTSC instruction to read the Time-Stamp Counter on processor.

1 #include 2 uint64_t get_cycles() { 3 uint32_t lo,hi; 4 asm volatile("rdtsc":"=a"(lo),"=d"(hi)); 5 return ((uint64_t)hi<<32) | lo; 6 }

• To reduce certain sources of randomness during measurements it is recommended to turn off technologies such as Turbo Boost or Hyper-Threading.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 35 / 33 AES-128 Encryption

Encrypting a 128-bit block (stored in xmm15) using the key schedule (stored in xmm0-xmm10). Nr = 10.

1 MOVQDA xmm15, (%rsi) ; Load message block 2 PXOR xmm15, xmm0; AddRoundKey 3 AESENC xmm15, xmm1; Round1 4 AESENC xmm15, xmm2; Round2 5 AESENC xmm15, xmm3; Round3 6 AESENC xmm15, xmm4; Round4 7 AESENC xmm15, xmm5; Round5 8 AESENC xmm15, xmm6; Round6 9 AESENC xmm15, xmm7; Round7 10 AESENC xmm15, xmm8; Round8 11 AESENC xmm15, xmm9; Round9 12 AESENCLAST xmm15, xmm10; Round 10 13 MOVQDA (%rdi), xmm15 ; Store cipher block

Analogously, for decryption use AESDEC, AESDECLAST and invert the key schedule using AESIMC.

Faz, López, de Oliveira (IC-UNICAMP) Performance Eval Crypto ISA Modern Arch AsiaCCS/Asia-PKC 2018 36 / 33 SoK: A Performance Evaluation of Cryptographic Instruction Sets on Modern Architectures

A. Faz-Hernández, Julio López, Ana Karina D. S. de Oliveira [email protected]

Institute of Computing, University of Campinas, Brazil

June 4, 2018. Incheon, Republic of Korea.

5th ACM ASIA Public-Key Cryptography Workshop AsiaCCS/Asia PKC 2018