Compact Implementation of Threefish and Skein on FPGA
Total Page:16
File Type:pdf, Size:1020Kb
Compact Implementation of Threefish and Skein on FPGA Nuray At∗, Jean-Luc Beuchaty, and Ismail˙ San∗ ∗Department of Electrical and Electronics Engineering, Anadolu University, Eskis¸ehir, Turkey Email: fnat, [email protected] yFaculty of Engineering, Information and Systems, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573, Japan Email: [email protected] Table I Abstract—The SHA-3 finalist Skein is built from the tweakable NUMBER OF ROUNDS FOR DIFFERENT KEY SIZES (REPRINTED FROM [1]). Threefish block cipher. In order to have a better understanding of the computational efficiency of Skein (resource sharing, memory access scheme, scheduling, etc.), we design a low-area coprocessor Key size # words # rounds for Threefish and describe how to implement Skein on our [bits] Nw Nr architecture. We harness the intrinsic parallelism of Threefish 256 4 72 to design a pipelined ALU and interleave several tasks in order 512 8 72 to achieve a tight scheduling. From our point of view, the main 1024 16 80 advantage of Skein over other SHA-3 finalists is that the same coprocessor allows one to encrypt or hash a message. I. INTRODUCTION T = (t0; t1). K and T are extended with one parity word (Algorithm 1, lines 1 and 2). Each subkey is a combination In this article, we propose a novel hardware implementation of N words of the extended key, two words of the extended of the SHA-3 finalist Skein [1]. As emphasized by Kerckhof et w tweak, and a counter s (Algorithm 1, lines 5 to 9). Note that al., “fully unrolled and pipelined architectures may sometimes the extended key and the extended tweak are rotated by one hide a part of the algorithms’ complexity that is better revealed word position between two consecutive subkeys. in compact implementations” [2]. In order to have a deeper un- derstanding of the computational efficiency of Skein (resource Algorithm 1 Key schedule. sharing, memory access scheme, scheduling, etc.), we decided Input: K = (k ; k ; : : : ; k ) to design a low-area coprocessor on a Field-Programmable A block cipher key 0 1 Nw−1 ; Gate Array (FPGA). Furthermore, such an implementation is a tweak T = (t0; t1); the constant C240 = valuable for constrained environments, where some security 1BD11BDAA9FC1A22. Output: N =4 + 1 k k k protocols mainly rely on cryptographic hash functions (see r subkeys s;0, s;1,..., s;Nw−1, where for instance [3]). 0 ≤ s ≤ Nr=4. Nw−1 Skein is built from the tweakable Threefish block cipher, M 1. k C ⊕ k ; defined with a 256-, 512-, and 1024-bit block size. The main Nw 240 i i=0 contribution of this article is a lightweight implementation of 2. t2 t ⊕ t1; Threefish (Section II). Then, we describe how to implement 3. for s 0 to Nr=4 do Skein on our coprocessor (Section III). We have prototyped 4. for i 0 to Nw − 4 do our architecture on a Xilinx Virtex-6 device and discuss our 5. ks;i k(s+i) mod (Nw+1); results in Section IV. 6. end for 7. ks;Nw−3 k(s+Nw−3) mod (Nw+1) ts mod 3; II. THE THREEFISH BLOCK CIPHER 8. ks;Nw−2 k(s+Nw−2) mod (Nw+1) t(s+1) mod 3; A. Algorithm Specification 9. ks;Nw−1 k(s+Nw−1) mod (Nw+1) s; Threefish operates entirely on unsigned 64-bit integers and 10. end for 11. return k , k ,..., k , where 0 ≤ s ≤ N =4; involves only three operations: rotation of k bits to the left s;0 s;1 s;Nw−1 r (denoted by n k), bitwise exclusive OR (denoted by ⊕), and 64 addition modulo 2 (denoted by ). Therefore, the plaintext A series of Nr rounds (Figure 1 and Algorithm 2, lines 4 P and the cipher key K are converted to Nw 64-bit words. to 19) and a final subkey addition (Algorithm 2, line 21) are Note that the number of words Nw and the number of rounds applied to produce the ciphertext. The core of a round is Nr depend on the key size (Table I). the simple non-linear mixing function Mixd;j (Algorithm 2, The key schedule generates the subkeys from a block lines 13 and 14). It consists of an addition, a rotation by a cipher key K = (k0; k1; : : : ; kNw−1) and a 128-bit tweak constant Rd mod 8;j (repeated every eight rounds and defined in [1, Table 4]), and a bitwise exclusive OR. A word permu- B. The UBI Chaining Mode tation π(i) (defined in [1, Table 3]) is then applied to obtain Let E(K; T; P ) be a tweakable encryption function. The the output of the round (Algorithm 2, line 17). Furthermore, Unique Block Iteration (UBI) chaining mode allows one to a subkey is injected every four rounds (Algorithm 2, line 7). build a compression function out of E. Each block Mi of the message is processed with a unique tweak value Ti encoding v4;0 v4;1 v4;2 v4;3 how many bytes have been processed so far, a type field (see [1] for details), and two bits specifying whether it is the k k k k 1;0 1;1 1;2 1;3 first and/or last block. The UBI chaining mode is computed e4;0 e4;1 e4;2 e4;3 as: H0 G, Hi+1 Mi ⊕ E(Hi;Ti;Mi), 0 1 ; ; 4 4 R4;0 R4;1 nn where G is a starting value of Nw words. Mix Mix C. Hardware Implementation This short description of Threefish gives us the first hints on f f f f 4;0 4;1 4;2 4;3 designing a dedicated coprocessor (Figure 2). Our architecture consists of a register file implemented by means of dual- ported memory, an ALU, and a control unit. The register file Permute is organized into 64-bit words, and stores a plaintext block, v5;0 v5;1 v5;2 v5;3 an internal state (ed;i, where 0 ≤ i ≤ Nw − 1), an extended Figure 1. One of the 72 rounds of Threefish-256. block cipher key, an extended tweak, the constant C240, and all possible values of s involved in the key schedule. Thanks to this approach, the word permutation π(i) and the word rotation of the key schedule are conveniently implemented Algorithm 2 Encryption with the Threefish block cipher. by addressing the register file accordingly. Since the round constants repeat every eight rounds (Algorithm 2, line 14), Input: A plaintext block P = (p ; p ; : : : ; p ); N =4 + 1 0 1 Nw−1 r we decided to unroll eight iterations of the main loop of subkeys k , k ,..., k , where 0 ≤ s ≤ N =4; s;0 s;1 s;Nw−1 r Threefish (Algorithm 2, lines 4 to 19). The rotation constants 4N rotation constants R , where 0 ≤ i ≤ 7 and 0 ≤ w i;j R are included in the microcode executed by the control j ≤ N =2. d;i w unit. Note that our register file is designed for Threefish-1024 Output: A ciphertext block C = (c ; c ; : : : ; c ). 0 1 Nw−1 (i.e. N = 16 and N = 80). It is therefore straightforward 1. for i 0 to N − 1 do w r w to implement the two other variants of the algorithm on our 2. v p ; 0;i i architecture. 3. end for 4. for d 0 to Nr − 1 do Register file (dual-ported memory block) 5. for i 0 to N − 1 do w WE 0 6. if d mod 4 = 0 then A Data p0,..., p15 7. ed;i vd;i kd=4;i; (Key injection) Port 16 Address ei;0,..., ei;15 8. else Port A Port B 32 9. ed;i vd;i; (Rename) k0,..., k15 Arithmetic and 48 k16 Logic Unit 10. end if 64 WE t0, t1, t2, C240 (Figure 3b) 11. end for 96 Address B 0, 1,..., 14, 15 12. for j 0 to Nw=2 − 1 do 16 17 18 19 20 112 Port , , , , Data 13. fd;2j ed;2j ed;2j+1; (Mixd;j) 14. fd;2j+1 fd;2j ⊕ (ed;2j+1 n Rd mod 8;j); 15. end for ctrl10:0 16. for i 0 to N − 1 do w StartControl Idle unit 17. vd+1;i fd,π(i); (Permute) Nw 18. end for 19. end for Figure 2. Architecture of the Threefish coprocessor. 20. for i 0 to Nw − 1 do The next step consists in defining the architecture of the 21. ci vNr ;i kNr =4;i; (Key injection) 22. end for ALU and the instruction set of our coprocessor. In the fol- 23. return C = (c0; c1; : : : ; cNw−1); lowing, Ri denotes a 64-bit register. Figure 3a illustrates our scheduling of the two mixing functions Mix4;0 and Mix4;1 of the fifth round of Threefish-256: LUT6 2 1 bit ctrl1:0 ctrl3:2 ctrl5:4 1 48 3 64 12 bits or or or 6 b carryj+1 2 1 j 32 8 a _ b R j j , , , 1 R4 4 R5 R6 , 0 2 6 2 6 16 , 0 , R R R R 0 ⊕ ⊕ 0 1 3 1 3 From 0 1 n R R R R n n ctrl7 2 0 1 3 3 3 3 3 R R R R R R B 1 aj 0 o (aj ^ bj) ⊕ Ctrl10 T From Port A e4;1 e4;0 e4;3 e4;2 port 1 1 1 0 R 0 0 b R1 e4;1 e4;1 e4;3 e4;3 From R2 1 B 1 bj A aj _ bj R2 e4;0 e4;2 From or R3 port 0 3 o 1 1 port n n R R4 e e 0 T 4;1 4;3 R1 0 a carryj R5 en9 en1 1 0 4;1 4;3 From 1 From (carry0 = Ctrl10) n25 n33 1 aj R6 e4;1 e4;3 10 3 f f f f R 4;0 4;1 4;2 4;3 Ctrl ctrl6 ctrl8 ctrl9 ctrl10 (a) Computation of Mix4;0 and Mix4;1 (Threefish-256) (b) Architecture of the ALU (c) Computation of R3 R1 R2 or R3 R3 ⊕ R6 on a Virtex-6 device Figure 3.