Compact Implementation of Threefish and Skein on FPGA

Compact Implementation of Threefish and Skein on FPGA Nuray At∗, Jean-Luc Beuchaty, and Ismail˙ San∗ ∗Department of Electrical and Electronics Engineering, Anadolu University, Eskis¸ehir, Turkey Email: fnat, [email protected] yFaculty of Engineering, Information and Systems, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573, Japan Email: [email protected] Table I Abstract—The SHA-3 finalist Skein is built from the tweakable NUMBER OF ROUNDS FOR DIFFERENT KEY SIZES (REPRINTED FROM [1]). Threefish block cipher. In order to have a better understanding of the computational efficiency of Skein (resource sharing, memory access scheme, scheduling, etc.), we design a low-area coprocessor Key size # words # rounds for Threefish and describe how to implement Skein on our [bits] Nw Nr architecture. We harness the intrinsic parallelism of Threefish 256 4 72 to design a pipelined ALU and interleave several tasks in order 512 8 72 to achieve a tight scheduling. From our point of view, the main 1024 16 80 advantage of Skein over other SHA-3 finalists is that the same coprocessor allows one to encrypt or hash a message. I. INTRODUCTION T = (t0; t1). K and T are extended with one parity word (Algorithm 1, lines 1 and 2). Each subkey is a combination In this article, we propose a novel hardware implementation of N words of the extended key, two words of the extended of the SHA-3 finalist Skein [1]. As emphasized by Kerckhof et w tweak, and a counter s (Algorithm 1, lines 5 to 9). Note that al., “fully unrolled and pipelined architectures may sometimes the extended key and the extended tweak are rotated by one hide a part of the algorithms’ complexity that is better revealed word position between two consecutive subkeys. in compact implementations” [2]. In order to have a deeper understanding of the computational efficiency of Skein (resource Algorithm 1 Key schedule. sharing, memory access scheme, scheduling, etc.), we decided Input: K = (k ; k ; : : : ; k ) to design a low-area coprocessor on a Field-Programmable A block cipher key 0 1 Nw−1 ; Gate Array (FPGA). Furthermore, such an implementation is a tweak T = (t0; t1); the constant C240 = valuable for constrained environments, where some security 1BD11BDAA9FC1A22. Output: N =4 + 1 k k k protocols mainly rely on cryptographic hash functions (see r subkeys s;0, s;1,..., s;Nw−1, where for instance [3]). 0 ≤ s ≤ Nr=4. Nw−1 Skein is built from the tweakable Threefish block cipher, M 1. k C ⊕ k ; defined with a 256-, 512-, and 1024-bit block size. The main Nw 240 i i=0 contribution of this article is a lightweight implementation of 2. t2 t ⊕ t1; Threefish (Section II). Then, we describe how to implement 3. for s 0 to Nr=4 do Skein on our coprocessor (Section III). We have prototyped 4. for i 0 to Nw − 4 do our architecture on a Xilinx Virtex-6 device and discuss our 5. ks;i k(s+i) mod (Nw+1); results in Section IV. 6. end for 7. ks;Nw−3 k(s+Nw−3) mod (Nw+1) ts mod 3; II. THE THREEFISH BLOCK CIPHER 8. ks;Nw−2 k(s+Nw−2) mod (Nw+1) t(s+1) mod 3; A. Algorithm Specification 9. ks;Nw−1 k(s+Nw−1) mod (Nw+1) s; Threefish operates entirely on unsigned 64-bit integers and 10. end for 11. return k , k ,..., k , where 0 ≤ s ≤ N =4; involves only three operations: rotation of k bits to the left s;0 s;1 s;Nw−1 r (denoted by n k), bitwise exclusive OR (denoted by ⊕), and 64 addition modulo 2 (denoted by ). Therefore, the plaintext A series of Nr rounds (Figure 1 and Algorithm 2, lines 4 P and the cipher key K are converted to Nw 64-bit words. to 19) and a final subkey addition (Algorithm 2, line 21) are Note that the number of words Nw and the number of rounds applied to produce the ciphertext. The core of a round is Nr depend on the key size (Table I). the simple non-linear mixing function Mixd;j (Algorithm 2, The key schedule generates the subkeys from a block lines 13 and 14). It consists of an addition, a rotation by a cipher key K = (k0; k1; : : : ; kNw−1) and a 128-bit tweak constant Rd mod 8;j (repeated every eight rounds and defined in [1, Table 4]), and a bitwise exclusive OR. A word permu- B. The UBI Chaining Mode tation π(i) (defined in [1, Table 3]) is then applied to obtain Let E(K; T; P ) be a tweakable encryption function. The the output of the round (Algorithm 2, line 17). Furthermore, Unique Block Iteration (UBI) chaining mode allows one to a subkey is injected every four rounds (Algorithm 2, line 7). build a compression function out of E. Each block Mi of the message is processed with a unique tweak value Ti encoding v4;0 v4;1 v4;2 v4;3 how many bytes have been processed so far, a type field (see [1] for details), and two bits specifying whether it is the k k k k 1;0 1;1 1;2 1;3 first and/or last block. The UBI chaining mode is computed e4;0 e4;1 e4;2 e4;3 as: H0 G, Hi+1 Mi ⊕ E(Hi;Ti;Mi), 0 1 ; ; 4 4 R4;0 R4;1 nn where G is a starting value of Nw words. Mix Mix C. Hardware Implementation This short description of Threefish gives us the first hints on f f f f 4;0 4;1 4;2 4;3 designing a dedicated coprocessor (Figure 2). Our architecture consists of a register file implemented by means of dual- ported memory, an ALU, and a control unit. The register file Permute is organized into 64-bit words, and stores a plaintext block, v5;0 v5;1 v5;2 v5;3 an internal state (ed;i, where 0 ≤ i ≤ Nw − 1), an extended Figure 1. One of the 72 rounds of Threefish-256. block cipher key, an extended tweak, the constant C240, and all possible values of s involved in the key schedule. Thanks to this approach, the word permutation π(i) and the word rotation of the key schedule are conveniently implemented Algorithm 2 Encryption with the Threefish block cipher. by addressing the register file accordingly. Since the round constants repeat every eight rounds (Algorithm 2, line 14), Input: A plaintext block P = (p ; p ; : : : ; p ); N =4 + 1 0 1 Nw−1 r we decided to unroll eight iterations of the main loop of subkeys k , k ,..., k , where 0 ≤ s ≤ N =4; s;0 s;1 s;Nw−1 r Threefish (Algorithm 2, lines 4 to 19). The rotation constants 4N rotation constants R , where 0 ≤ i ≤ 7 and 0 ≤ w i;j R are included in the microcode executed by the control j ≤ N =2. d;i w unit. Note that our register file is designed for Threefish-1024 Output: A ciphertext block C = (c ; c ; : : : ; c ). 0 1 Nw−1 (i.e. N = 16 and N = 80). It is therefore straightforward 1. for i 0 to N − 1 do w r w to implement the two other variants of the algorithm on our 2. v p ; 0;i i architecture. 3. end for 4. for d 0 to Nr − 1 do Register file (dual-ported memory block) 5. for i 0 to N − 1 do w WE 0 6. if d mod 4 = 0 then A Data p0,..., p15 7. ed;i vd;i kd=4;i; (Key injection) Port 16 Address ei;0,..., ei;15 8. else Port A Port B 32 9. ed;i vd;i; (Rename) k0,..., k15 Arithmetic and 48 k16 Logic Unit 10. end if 64 WE t0, t1, t2, C240 (Figure 3b) 11. end for 96 Address B 0, 1,..., 14, 15 12. for j 0 to Nw=2 − 1 do 16 17 18 19 20 112 Port , , , , Data 13. fd;2j ed;2j ed;2j+1; (Mixd;j) 14. fd;2j+1 fd;2j ⊕ (ed;2j+1 n Rd mod 8;j); 15. end for ctrl10:0 16. for i 0 to N − 1 do w StartControl Idle unit 17. vd+1;i fd,π(i); (Permute) Nw 18. end for 19. end for Figure 2. Architecture of the Threefish coprocessor. 20. for i 0 to Nw − 1 do The next step consists in defining the architecture of the 21. ci vNr ;i kNr =4;i; (Key injection) 22. end for ALU and the instruction set of our coprocessor. In the fol- 23. return C = (c0; c1; : : : ; cNw−1); lowing, Ri denotes a 64-bit register. Figure 3a illustrates our scheduling of the two mixing functions Mix4;0 and Mix4;1 of the fifth round of Threefish-256: LUT6 2 1 bit ctrl1:0 ctrl3:2 ctrl5:4 1 48 3 64 12 bits or or or 6 b carryj+1 2 1 j 32 8 a _ b R j j , , , 1 R4 4 R5 R6 , 0 2 6 2 6 16 , 0 , R R R R 0 ⊕ ⊕ 0 1 3 1 3 From 0 1 n R R R R n n ctrl7 2 0 1 3 3 3 3 3 R R R R R R B 1 aj 0 o (aj ^ bj) ⊕ Ctrl10 T From Port A e4;1 e4;0 e4;3 e4;2 port 1 1 1 0 R 0 0 b R1 e4;1 e4;1 e4;3 e4;3 From R2 1 B 1 bj A aj _ bj R2 e4;0 e4;2 From or R3 port 0 3 o 1 1 port n n R R4 e e 0 T 4;1 4;3 R1 0 a carryj R5 en9 en1 1 0 4;1 4;3 From 1 From (carry0 = Ctrl10) n25 n33 1 aj R6 e4;1 e4;3 10 3 f f f f R 4;0 4;1 4;2 4;3 Ctrl ctrl6 ctrl8 ctrl9 ctrl10 (a) Computation of Mix4;0 and Mix4;1 (Threefish-256) (b) Architecture of the ALU (c) Computation of R3 R1 R2 or R3 R3 ⊕ R6 on a Virtex-6 device Figure 3.

Compact Implementation of Threefish and Skein on FPGA

IMPLEMENTATION and BENCHMARKING of PADDING UNITS and HMAC for SHA-3 CANDIDATES in FPGAS and ASICS by Ambarish Vyas a Thesis Subm

The Boomerang Attacks on the Round-Reduced Skein-512 *

An Analytic Attack Against ARX Addition Exploiting Standard Side-Channel Leakage

SHA-3 Conference, March 2012, Skein: More Than Just a Hash Function

Authenticated Encryption with Small Stretch (Or, How to Accelerate AERO) ⋆

Bicliques for Preimages: Attacks on Skein-512 and the SHA-2 Family

Darxplorer a Toolbox for Cryptanalysis and Cipher Designers

Modes of Operation for Compressed Sensing Based Encryption

High-Speed Hardware Implementations of BLAKE, Blue

An Efficient Parallel Algorithm for Skein Hash Functions

Rotational Rebound Attacks on Reduced Skein

Authenticated Encryption on Fpgas from the Reconfigurable Part to the Static Part Karim Moussa Ali Abdellatif