BYTE SLICING GRØSTL Optimized Intel AES-NI and 8-Bit Implementations of the SHA-3 Finalist Grøstl

BYTE SLICING GRØSTL Optimized Intel AES-NI and 8-bit Implementations of the SHA-3 Finalist Grøstl Kazumaro Aoki1∗, Günther Roland2, Yu Sasaki1 and Martin Schläffer2 1NTT Corporation, Tokyo, Japan 2IAIK, Graz University of Technology, Graz, Austria Keywords: Hash function, SHA-3 competition, Grøstl, Software implementation, Byte slicing, Intel AES new instructions, 8-bit AVR. Abstract: Grøstl is an AES-based hash function and one of the 5 finalists of the SHA-3 competition. In this work we present high-speed implementations of Grøstl for small 8-bit CPUs and large 64-bit CPUs with the recently introduced AES instructions set. Since Grøstl does not use the same MDS mixing layer as the AES, a direct application of the AES instructions seems difficult. In contrast to previous findings, our Grøstl implementations using the AES instructions are currently by far the fastest known. To achieve optimal performance we parallelize each round of Grøstl by taking advantage of the whole bit width of the used processor. This results in implementations running at 12.2 cylces/byte for Grøstl-256 and 18.6 cylces/byte for Grøstl-512. 1 INTRODUCTION lement Grøstl very efficiently on both 8-bit and 128- bit platforms. We achieve very good performance for In 2007, NIST has initiated the SHA-3 competi- larger bit widths by optimizing the MDS mixing mation (National Institute of Standards and Technology, trix computation of Grøstl and by computing mul- 2007) to find a new cryptographic hash function stan- tiple columns in parallel. The parallel computation dard. 51 interesting hash functions with different de- of the whole Grøstl round is possible and if parallel sign strategies have been accepted for the first round. AES S-box table lookups (using AES-NI or the vpaes Many of these SHA-3 candidates are AES-based and implementation of (Hamburg, 2009)) are available. might benefit from the Intel AES new instructions set The paper is organized as follows. In Section 2, (AES-NI) (Gueron and Intel Corp., 2010) to speed we give a short description of Grøstl. In Section 3, up their implementations. In (Benadjila et al., 2009) we describe requirements and general optimization those candidates which use the AES round transfor- techniques of our byte sliced implementations. In mation as a main building block have been analyzed Section 4, we show how to minimize the computa- and implemented using AES-NI. In that work, the au- tional requirements for MixBytes, the MDS mixing thors claim that algorithms which use a very different layer of Grøstl. In Section 5, we present the spe- MDS mixing matrix (than AES) are too distant from cific details of the 8-bit and 128-bit implementations. AES and that there is no easy way to benefit from Finally, we conclude in Section 6. AES-NI. Since December 2010, Grøstl(Gauravaram et al., 2011) is one of 5 finalists of the SHA-3 competition 2 DESCRIPTION OF GRØSTL and uses the same S-box as AES but a very different MDS mixing matrix. In this work we show that The hash function Grøstl was designed by Gau- it is still possible to efficiently implement Grøstl ravaram et al. as a candidate for the SHA-3 compe- using AES-NI. Moreover, our AES-NI implementa- tition (Gauravaram et al., 2011). In January 2011, tion of Grøstl is the fastest known implementation Grøstl has been tweaked for the final round of the of Grøstl so far. Furthermore, we present a self-byte competition and we only consider this variant here. It sliced implementation strategy which allows to imp- is an iterated hash function with a compression func- ∗Parts of this work were done while the author stayed at tion built from two distinct permutations P and Q, TU Graz. which are based on the same principles as the AES 124 Aoki K., Roland G., Sasaki Y. and Schläffer M.. BYTE SLICING GRØSTL - Optimized Intel AES-NI and 8-bit Implementations of the SHA-3 Finalist Grøstl. DOI: 10.5220/0003515701240133 In Proceedings of the International Conference on Security and Cryptography (SECRYPT-2011), pages 124-133 ISBN: 978-989-8425-71-3 Copyright c 2011 SCITEPRESS (Science and Technology Publications, Lda.) BYTE SLICING GRØSTL - Optimized Intel AES-NI and 8-bit Implementations of the SHA-3 Finalist Grøstl round transformation (National Institute of Standards 2.2.1 MixBytes and Technology, 2001). Grøstl is a wide pipe de- sign with security proofs for the collision and preim- As the MixBytes transformation is the most run-time age resistance of the compression function (Fouque intensive part of Grøstl in our case, we will describe et al., 2009). In the following, we describethe Grøstl this transformation in more detail here. The MixBytes hash function and the permutations of Grøstl-256 transformation is a matrix multiplication performed and Grøstl-512 in more detail. on the state matrix as follows: 2.1 The Grøstl Hash Function A ← B × A, where A is the state matrix and B is a The input message M is padded and split into blocks circulant MDS matrix specified as B = , , , , , , , M1,M2,...,Mt of ℓ bits with ℓ = 512 for Grøstl-256 circ(02 02 03 04 05 03 05 07) or by the following and ℓ = 1024 for Grøstl-512. The initial value H0, matrix: the intermediate hash values Hi, and the permutations 02 02 03 04 05 03 05 07 P and Q are of size ℓ as well. The message blocks 07 02 02 03 04 05 03 05 are processed via the compression function f , which 05 07 02 02 03 04 05 03 accepts two inputs of size ℓ bits and outputs an ℓ-bit 03 05 07 02 02 03 04 05 value. The compression function f is defined via the B = . 05 03 05 07 02 02 03 04 permutations P and Q as follows: 04 05 03 05 07 02 02 03 f (H,M)= P(H ⊕ M) ⊕ Q(M) ⊕ H. 03 04 05 03 05 07 02 02 02 03 04 05 03 05 07 02 The compression function is iterated with H0 = IV and Hi ← f (Hi−1,Mi) for 1 ≤ i ≤ t. The output The multiplication is performed in a finite field 8 4 Ht of the last call of the compression function is F256 defined by the irreducible polynomial x ⊕ x ⊕ processed by an output transformation g defined as x3 ⊕ x ⊕ 1 (0x11B). As the multiplication by 2 only g(x)= truncn(P(x) ⊕ x), where n is the output size of consists of a shift and a conditional XOR in binary the hash function and truncn(x) discards all but the arithmetic, we will calculate all multiplications by least significant n bits of x. Hence, the digest of the combining multiplications by 2 and additions (XOR), message M is defined as h(M)= g(Ht ). e.g. 7 · x = (2 · (2 · x)) ⊕ (2 · x) ⊕ x. For more details on the round transformations we 2.2 The Grøstl-256 Permutations refer to the Grøstl specification (Gauravaram et al., 2011). As mentioned above, two permutations P and Q are defined for Grøstl-256. Both permutations operate 2.3 The Grøstl-512 Permutations on a 512-bit state, which can be viewed as an 8 × 8 matrix of bytes. Each permutation of Grøstl-256 The permutations used in Grøstl-512 are of size ℓ = consists of 10 rounds, where the following four AES- 1024 bits and the state is viewed as an 8 × 16 matrix like round transformations are applied to the state in of bytes. The permutations use the same round trans- the given order: formations as in Grøstl-256 except for ShiftBytes: • AddRoundConstant (AC) XORs a constant to one Since the permutations are larger, the rows are shifted row of the state for P and to the whole state for Q. by {0,1,2,3,4,5,6,11} positions to the left in P. In Q The constant changes for every round. the rows are shifted by {1,3,5,11,0,2,4,6} positions to the left. The number of rounds is increased to 14. • SubBytes (SB) applies the AES S-box to each byte of the state. • ShiftBytes (SH) cyclically rotates the bytes of rows to the left by {0,1,2,3,4,5,6,7} positions 3 BYTESLICED in P and by {1,3,5,7,0,2,4,6} positions in Q. IMPLEMENTATIONS OF • MixBytes (MB) is a linear diffusion layer, which GRØSTL multiplies each column with a constant 8 × 8 circulant MDS matrix. In this section, we describe some requirements for the efficient parallel computation of the Grøstl round transformations. Due to the fact that MixBytes applies the same algorithm to every column of the state 125 SECRYPT 2011 - International Conference on Security and Cryptography we can ’byte slice’ Grøstl. In other words, we apply Another approach for parallel AES S-box table the same computations for every byte-wise column of lookups is to use small Log tables to efficiently com- the Grøstl state. On platforms with register sizes pute the inverse of the AES S-box using the vpaes larger than 8-bit we can parallelize every transforma- implementation presented in (Hamburg, 2009). tion by placing several bytes of one row (of the state) inside one register. One column of the state is then 3.4 ShiftBytes distributed over 8 different registers (see Figure 1). ShiftBytes is generally simple to implement on any P Q platform if the state is stored in row ordering.

BYTE SLICING GRØSTL Optimized Intel AES-NI and 8-Bit Implementations of the SHA-3 Finalist Grøstl

Improved Cryptanalysis of the Reduced Grøstl Compression Function, ECHO Permutation and AES Block Cipher

Grøstl – a SHA-3 Candidate∗

Security Analysis for MQTT in Internet of Things

A (Second) Preimage Attack on the GOST Hash Function

Improved Cryptanalysis of the Reduced Grøstl Compression Function, ECHO Permutation and AES Block Cipher

Advanced Meet-In-The-Middle Preimage Attacks: First Results on Full Tiger, and Improved Results on MD4 and SHA-2

Research Intuitions of Hashing Crypto System

Avalanche Handbook

Preimages for Reduced SHA-0 and SHA-1

SMASH Implementation Requirements

Randomness Analysis in Authenticated Encryption Systems

Evaluation of Cryptographic Packages