SIMD Instruction Set Extensions for KECCAK with Applications to SHA-3, Keyak and Ketje

Hemendra K. Rawat and Patrick Schaumont! Virginia tech, Blacksburg, USA! {hrawat, schaum}@vt.edu!

1 Motivation q Secure systems and protocols rely on a suite of cryptographic applications § Hashing, MAC, Encryption, PRNG, AEAD etc... q Traditionally, different algorithms: SHA-1, SHA-2, AES, GMAC, HMAC q Different Instruction set extensions: AES-NI, SHA and Carry-less multiplication § Vector extensions in Intel, ARM, SPARC and PowerPC

Q KECCAK is the SHA-3 § Successor of SHA-2

q Subset of the cryptographic primitive family KECCAK SPONGE q Not just a Hash q A Flexible Sponge for all symmetric cryptography

“A flexible SIMD Instruction set, for flexible KECCAK Sponge” ! 2 Outline

Q KECCAK Background q Novelty Claims q Design Principles q Proposed Instruction Set Extensions q Results q Performance q Hardware Overhead q Conclusion

3 KECCAK HASH (SHA-3)

r bits Variable length input Output Digest

Padding

rate (r bits) 0 f f … f f f … f capacity 0 (c bits)

Cryptographic State Permutation Absorbing Squeezing Hashing using KECCAK Sponge Construction 4 KECCAK-f in Nutshell

b = (r +c ) bits KECCAK-f[b] Permutation

state [0 : b] 0 0 def keccakf[b]: for numRounds in range(0, maxRounds): theta(state) # Add column parities x=0 x=1 x=2 x=3 x=4 f rho(state) # Rotate lanes lane

pi(state) # Transpose lanes chi(state) # Add non-linearity y=4 L04 L14 L24 L34 L44 iota(state) # L00 ⊕ Kround y=3 L03 L13 L23 L33 L43

b = 1600, 800, 400, 200, 100, 50, 25 bits y=2 L02 L12 L22 L32 L42

y=1 L01 L11 L21 L31 L41

y=0 L00 L10 L20 L30 L40 y z plane slice x State x 5 KECCAK Applications

Message Digest Key IV Keystream

rate r 0 f f f 0 f f f capacity c (a) (c)

Key Message MAC Key Message MAC

0 f f f 0 f f f

(b) (d) Keystream

Constructions with sponge: (a) Hash, (b) Code, (c) Keystream Generation. Construction with duplex: (d) Authenticated Encryption 6 Application Stack

Application Hash MAC PRNG AEAD Layer

Cryptographic Sponge Duplex Construction

API KECCAK- KECCAK- KECCAK- KECCAK- Primitives f[1600] p[800,12] f[400] f[200]

rl1x rl1x rl1x rl1x Custom kxorr64 xorr xorr xorr Instructions chi1 chi1 chi1 chi1 chi2 chi3 chi3 chi3

7 Novelty Claims

Previous work q Custom-instruction designs for a 16 bit micro-controller for SHA-3 finalists, Constantin et al. q Integration of 64 bit KECCAK(SHA-3) data path into a 32 bit LEON3, Wang et al.

This work q Six new custom instructions for 128 bit SIMD unit. q Multipurpose KECCAK:1600, 800, 400 and 200 bits. q Five KECCAK algorithms: SHA3-512 hash, LakeKEYAK, RiverKEYAK, KetjeSR and KetjeJR authenticated ciphers. q Compatible with ARM NEON Instruction set

8 Design Principles

§ Easy integration into different processor architectures Portability § RISC-like instruction format (2 input,1 output operand) § No non-standard architectural features

§ Support for multiple symmetric cryptographic applications § Hashing § MACing Flexibility § Stream Ciphers § AEAD § PRNG

§ Small set of instructions § Low hardware Overhead Simplicity § Simple operations like XOR, AND, NOT and rotations § Short critical path 9 Design Principles

§ Optimize Computation intensive part – KECCAK-{f,p} primitives

§ Partition dataflow graph into SIMD-like instruction patterns § 2x128 bit input and 128 bit output

§ Minimize the schedule length of the graph

Performance § Optimize instruction shapes and functionality for § High ILP (instruction-level parallelism) § Minimum register spills § Minimum MOV/VEXT for register rearrangements § In-place round computation

10 Mapping KECCAK to Instructions

x=0 x=1 x=2 x=3 x=4 lane NEON Register Naming Convention Qi = Quad word (128 bit) y=4 L04 L14 L24 L34 L44 Di = Double word (64 bit) y=3 L03 L13 L23 L33 L43 State Si = Single word (32 bit) d0/s0 d1/s1 y=2 L02 L12 L22 L32 L42

y=1 L01 L11 L21 L31 L41 NEON Instruction specifier ROL VADD.I32 q1, q2, q3 specifies y=0 L00 L10 L20 L30 L40 1 y q1, q2 and q3 have 4x32-bit integer plane z slice XOR

x XOR XOR XOR XOR XOR x Column parity d2/s2 C0 C1 C2 C3 C4

ROL ROL ROL ROL ROL rl1x.u64 d2, d0, d1 1 1 1 1 1 rl1x.u32 s2, s0, s1 rl1x.u16 s2, s0, s1 XOR XOR XOR XOR XOR rl1x.u8 s2, s0, s1

D0 D1 D2 D3 D4 θ-effect Instruction rl1x Data dependencies of θ-effect 11 KECCAK Custom Instructions

Instruction Step Description Target Primitive Syntax rl1x.u64 d2, d0, d1 theta 1600, 800, 400, rl1x.u32 s2, s0, s1 rl1x Rotate Left 1 and XOR 200 rl1x.u16 s2, s0, s1 rl1x.u8 s2, s0, s1 theta, rho kxorr64 XOR Rotate & Assign 1600 kxorr64 d2, d0, d1, #i & pi xorr.u32 s2, s0, s1, #i theta, rho xorr XOR Rotate & Assign 800, 400, 200 xorr.u16 s2, s0, s1, #i & pi xorr.u8 s2, s0, s1, #i chi 1600, 800, 400, chi1.u32 q2, q0, q1 chi1 chi step 200 chi1.u64 q2, q0, q1 chi2 chi chi step (last lane) 1600 chi2.u64 d4, q0, q1

chi3 chi chi step (last lane) 800, 400, 200 chi3.u32 s4, d0, d1 12 Implementation & Validation

q Reference Implementations SHA3 KetjeJR KetjeSR Crypto § KeccakCodePackge : http://keccak.noekeon.org/files.html LakeKeyak RiverKeyak Applications q Hand Optimized Custom Instruction implementations Compiler § KECCAK-f,p {1600,800,400, 200} GNU KECCAK Assembler Instructions Compiler Framework q Cross compiled GCC with KECCAK Instruction support Linker q Timing simple CPU model in GEM5 ARM Native Binaries GEM5 q Instructions in GEM5’s ISA description language GEM5 Front End Architectural Simulator q Simulation CPU Model ISA KECCAK Instructions § Single core ARM CPU @ 1 GHz Atomic Simple Decoder § 32KB L1 I and D cache Timing SimpleCP TLB InOrder Faults § L2 cache of 2MB O3 Interrupts

13 Results: Performance

Opmized C 32-bit ASM NEON This work(CI) 350 KECCAK-p[200,nr] Performance gain between 1.4 - 2.6x over hand 300 KECCAK-f[1600] optimized assembly on

250 ARMv7 KECCAK-f[400,nr]

200 KECCAK-f[800,nr] 150

100

INSTRUCTIONS/BYTE 50

0 SHA3(c=1024) Lakekeyak (E) Lakekeyak (D) Riverkeyak (E) Riverkeyak (D) KetjeSR (E) KetjeSR (D) KetjeJR (E) KetjeJR (D)

Hash AEAD (Encrypon+MAC) Lightweight AEAD

Performance in instructions/byte for various KECCAK modes 14 Results: Hardware Cost

Gate equivalent esmates with UMC 90nm (4658 GE)

1 1869 1221 894 242 122

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% rl1x kxorr64 xorr chi1 chi2 chi3

[CATEGORY NAME] 0.1% Overhead

Cortex-A9 Core 100%

Keccak Instrucons Cortex-A9 Core 15 Conclusion

q Analysis of the instruction set design space for KECCAK primitives q Six custom instructions based on NEON instruction set in ARMv7 q Five different KECCAK applications: SHA3, LakeKEYAK, RiverKEYAK, KetjeSR and KetjeJR q Performance gain between 1.4 - 2.6x over hand optimized assembly on ARMv7 at a hardware overhead of just 4658 GEs. q Portability Aspects, Intel AVX, Generic 64/32 bit architectures, in the paper….

16 Thank you ….

Questions

17 Appendix

18 KECCAK (SHA-3)

Instance Definition SHA3-224(M) Keccak[448](M||01, 224) length) SHA3-256(M) Keccak[512](M||01, 256) H SHA3-384(M) Keccak[768](M||01, 384) Arbitrrary SHA3-512(M) Keccak[1024](M||01, 512) SHAKE128(M, d) Keccak[256](M||1111, d) Hash Value (Fixed Length) HashValue

Message( M SHAKE256(M, d) Keccak[512](M||1111, d)

19 Keccak

Source: http://keccak.noekeon.org/ 20 KECCAK-f Permutation (2)

L00 ⊕ KRound

θ step ρ step π step χ step ι step

R= θ o ρ o π o χ o ι

Source: http://keccak.noekeon.org/ 21 Proposed Instruction Set Extensions

L00 L11 L22 L33 L44

θ &π d0/s0 d1/s1 step D0 D1 D2 D3 D4 XOR

XOR XOR XOR XOR XOR ROL Intermediate (i) value B0 B1 B2 B3 B4 ρ step d2/s2 ρ ρ ρ ρ ρ ( b)

Intermediate P0 P1 P2 P3 P4 value kxorr64 d2, d0, d1, #i xorr.u32 s2, s0, s1, #i χ xorr.u16 s2, s0, s1, #i step χ χ χ χ χ xorr.u8 s2, s0, s1, #i

L00 L10 L20 L30 L40 Instruction kxorr64/xorr Data dependencies of θ,ρ ,π ,χ, ι (one plane) 22 Proposed Instruction Set Extensions

q0 q1 q0 q1

NOT NOT NOT NOT NOT NOT

AND AND AND AND AND AND

XOR XOR XOR XOR XOR XOR

(a) q2 q2

chi1.u32 q2, q0, q1 chi1.u64 q2, q0, q1

q0 q1 d0 d1

NOT NOT

AND AND

XOR XOR

(b) (c) d4 s4 Data dependencies X step (one plane) chi2.u64 d4, q0, q1 chi3.u32 s4, d0, d1

Instructions chi1,chi2 and chi3 23 http://keccak.noekeon.org/ Portability Aspects

• Based on instruction format, register width and the instruction encoding width.

Table 2: Feasibility of the proposed instructions on different platforms

Instruction Intel AVX 64 bit Arch 32 bit Arch rl1x ü ü* ü* kxorr64 ü ü X xorr ü ü ü chi1 ü X X chi2 ü X X chi3 ü ü X

* Not all variants supported 24 Results: Hardware Cost

Table 1: Area, GE and Transistor count estimates for the proposed custom instructions

Instruction µ2 Gate equivalent Transistor Equivalent rl1x 1238 310 1238 kxorr64 7474 1869 7474 xorr 4884 1221 4884 chi1 3576 894 3576 chi2 966 242 966 chi3 486 122 486 Total 18624 4658 18624

Typical Cortex-A9 CPU 3.8 Million Gates KECCAK Instructions have 0.1 % overhead 25 Motivation

Cryptographic Intel AES-NI, SHA-1,2, Carry-less multiplication extensions Instruction-Set ARMv8 Crypto Instructions

Processor AARM (Soft IP) Architecture RISC-V (Open Architecture from UC, Berkeley)

Symmetric HHash, MAC, Encryption/Decryption, AEAD, PRNG Cryptography Applications

Universal Hash: SHA-3 (Winner of NIST SHA-3 Competition) Sponge AEAD: LakeKeyak, Ketje AEAD Construction PRNG, Stream Ciphers etc.… 26