Sok: a Performance Evaluation of Cryptographic Instruction Sets on Modern Architectures Armando Faz-Hernández Julio López Ana Karina D

Session: Card-based Protocol, Implementation, and Authentication for IoT APKC’18, June 4, 2018, Incheon, Republic of Korea SoK: A Performance Evaluation of Cryptographic Instruction Sets on Modern Architectures Armando Faz-Hernández Julio López Ana Karina D. S. de Oliveira Institute of Computing Institute of Computing Federal University of Mato Grosso Do University of Campinas University of Campinas Sul (FACOM-UFMS) Campinas, São Paulo, Brazil Campinas, São Paulo, Brazil Campo Grande MS, Brazil [email protected] [email protected] [email protected] ABSTRACT 1 INTRODUCTION The latest processors have included extensions to the instruction The omnipresence of cryptographic services has influenced modern set architecture tailored to speed up the execution of cryptographic processor’s designs. One proof of that is the support given to the algorithms. Like the AES New Instructions (AES-NI) that target the Advanced Encryption Standard (AES) [29] employing extensions to AES encryption algorithm, the release of the SHA New Instructions the instruction set architecture known as the AES New Instructions (SHA-NI), designed to support the SHA-256 hash function, intro- (AES-NI) [15]. Other examples in the same vein are the CRC32 duces a new scenario for optimizing cryptographic software. In instructions, which aid on error-detection codes, and the carry- this work, we present a performance evaluation of several crypto- less multiplier (CLMUL), which is used to accelerate the AES-GCM graphic algorithms, hash-based signatures and data encryption, on authenticated encryption algorithm [16]. All of these extensions platforms that support AES-NI and/or SHA-NI. In particular, we re- enhance the performance of cryptographic implementations. visited several optimization techniques targeting multiple-message Latest processors have added instructions that support crypto- hashing, and as a result, we reduce by 21% the running time of this graphic hash functions. In 2013, Intel [21] announced the speci- task by means of a pipelined SHA-NI implementation. In public- fication of the SHA New Instructions (SHA-NI), a set of new in- key cryptography, multiple-message hashing is one of the critical structions for speeding up the execution of SHA1 and SHA2 hash operations of the XMSS and XMSSMT post-quantum hash-based functions [31]. However, SHA-NI-ready processors were available digital signatures. Using SHA-NI extensions, signatures are com- several years later. In 2016, Intel released the Goldmont micro- puted 4 faster; however, our pipelined SHA-NI implementation architecture which targets low power-consumption devices; and × increased this speedup factor to 4:3 . For symmetric cryptography, in 2017 AMD introduced Zen, a micro-architecture that supports × we revisited the implementation of AES modes of operation and both AES-NI and SHA-NI. reduced by 12% and 7% the running time of CBC decryption and The Single Instruction Multiple Data (SIMD) parallel process- CTR encryption, respectively. ing is an optimization strategy increasingly used to develop effi- cient software implementations. The Advanced Vector eXtensions 2 CCS CONCEPTS (AVX2) [26] are SIMD instructions that operate over 256-bit vector • Security and privacy → Digital signatures; Hash functions registers. The latest generations of Intel processors, such as Haswell, and message authentication codes; • Computer systems or- Broadwell, Skylake, and Kaby Lake, have support for AVX2; mean- ganization → Pipeline computing; Single instruction, multiple data; while, Zen is the first AMD’s architecture that implements AVX2. The main contribution of this work is to provide a performance KEYWORDS benchmark report of cryptographic algorithms accelerated through SHA-NI and AES-NI. We focus on mainstream processors, like the AES-NI; SHA-NI; Vector Instructions; Hash-based Digital Signa- ones found in laptops and desktops, and also those used in cloud tures; Data Encryption computing environments. To that end, we revisited known opti- ACM Reference Format: mization techniques targeting the capabilities (and limitations) of Armando Faz-Hernández, Julio López, and Ana Karina D. S. de Oliveira. four micro-architectures: Haswell (an Intel Core i7-4770 processor), 2018. SoK: A Performance Evaluation of Cryptographic Instruction Sets on Skylake (an Intel Core i7-6700K processor), Kaby Lake (an Intel Core Modern Architectures. In APKC’18: 5th ACM ASIA Public-Key Cryptography i5-7400 processor), and Zen (an AMD Ryzen 7 1800X processor). Workshop , June 4, 2018, Incheon, Republic of Korea. ACM, New York, NY, As part of our contributions, we improve the performance of USA, 10 pages. https://doi.org/10.1145/3197507.3197511 multiple-message hashing, which refers to the task of hashing sev- Permission to make digital or hard copies of all or part of this work for personal or eral messages of the same length. We followed two approaches. classroom use is granted without fee provided that copies are not made or distributed First, we develop a pipelined implementation using SHA-NI that for profit or commercial advantage and that copies bear this notice and the full citation saves around 18-21% of the running time. Second, we review SIMD on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or vectorization techniques and report timings for the parallel hashing republish, to post on servers or to redistribute to lists, requires prior specific permission of four and eight messages. and/or a fee. Request permissions from [email protected]. In public-key cryptography, multiple-message hashing serves as APKC’18, June 4, 2018, Incheon, Republic of Korea © 2018 Copyright held by the owner/author(s). Publication rights licensed to the a building block on the construction of hash-based digital signatures, Association for Computing Machinery. such as XMSS [8] and XMSSMT [24]. These signature schemes are ACM ISBN 978-1-4503-5756-2/18/06...$15.00 strong candidates to be part of a new cryptographic portfolio of https://doi.org/10.1145/3197507.3197511 9 Session: Card-based Protocol, Implementation, and Authentication for IoT APKC’18, June 4, 2018, Incheon, Republic of Korea APKC’18, June 4, 2018, Incheon, Republic of Korea Faz-Hernández, López, de Oliveira Algorithm 1 Definition of the Update SHA-256 function. Algorithm 2 The Update SHA-256 function using the SHA-NI set. Domain Parameters: The functions σo, σ1, Σo, Σ1, Ch, Maj, and Input: S is the current state, and M is a 512-bit message block. the values k0;:::;k63 are defined in Appendix A. Output: S = Update(S; M). Input: S is the current state, and M is a 512-bit message block. Message Schedule Phase Output: S = Update(S; M). 1: Load M into 128-bit vector registers: W0, W4, W8, and W12. Message Schedule Phase 2: for i 0 to 11 do 1: (w0;:::;w15) M //Split block into sixteen 32-bit words. 3: X SHA256MSG1 (W4i ; W4i+4) 2: for t 16 to 63 do 4: W4i+9 PALIGNR (W4i+12; W4i+8; 4) 3: wt σ0(wt 15) + σ1(wt 2) + wt 7 + wt 16 5: Y PADDD (X; W4i+9) − − − − 4: end for 6: W4i+16 = SHA256MSG2 (Y; W4i+12) Update State Phase 7: end for 5: (a;b;c;d; e; f ;д;h) S //Split state into eight 32-bit words. Update State Phase 6: for i 0 to 63 do 8: (a;b;c;d; e; f ;д;h) S, A = a;b; e; f , and C = c;d;д;h . 7: t1 h + Σ1(e) + Ch(e; f ;д) + ki + wi 9: for i 0 to 15 do f g f g 8: t2 Σ0(a) + Maj(a;b;c) 10: X PADDD (W4i ; K4i ) 9: h д, д f , f e, e d + t1 11: Y PSRLDQ (X; 8) 10: d c, c b, b a, a t1 + t2 12: C SHA256RNDS2 (C; A; X ) 11: end for 13: A SHA256RNDS2 (A; C; Y ) 12: (a;b;c;d; e; f ;д;h) S //Split state into eight 32-bit words. 14: end for 13: return S (a + a;b + b;c + c;d + d; e + e; f + f ;д + д;h + h) 15: (a;b;c;d; e; f ;д;h) S , A = a;b; e; f , and C = c;d;д;h 16: a;b; e; f PADDD A; A f g f g 17: fc;d;д;h g PADDD C; C quantum-resistant algorithms [23]. For this reason, we evaluate the 18: returnf Sg (a;b;c;d; e; f ;д;h) impact on the performance of these signature schemes carried by optimizations of multiple-message hashing. Finally, we revisited software implementation techniques for the instructions SHA256MSG1, SHA256MSG2, and SHA256RNDS2 aid the AES modes of operation. As a result, we show optimizations on the calculation of the Update function of the SHA-256 algorithm. for Zen that reduce the running time of AES-CBC decryption and In this section, we describe the implementation of the Update AES-CTR encryption by 12% and 9%, respectively. As a side result, function using the SHA-NI set (see Algorithm 2). First of all, it we also improve the implementation of AEGIS, an AES-based algo- is required that the message block (the values w0;:::;w15) be rithm classified in the final round of the authenticated-encryption stored into four 128-bit vector registers W0, W4, W8, and W12 as competition CAESAR [4], reducing by 11% its running time. follows Wi = [wi ;wi+1;wi+2;wi+3]. After that, the SHA256MSG1 All the source codes derived from this research work are released and SHA256MSG2 instructions will help on the message schedule under a permissive software license and can be retrieved from phase for computing the values w16;:::;w63. [ http://github.com/armfazh/FLO-shani-aesni ].

Load more