<<

Compact Implementation of Threefish and on FPGA

Nuray At∗, Jean-Luc Beuchat†, and Ismail˙ San∗ ∗Department of Electrical and Electronics Engineering, Anadolu University, Eskis¸ehir, Turkey Email: {nat, isan}@anadolu.edu.tr †Faculty of Engineering, Information and Systems, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573, Japan Email: [email protected]

Table I Abstract—The SHA-3 finalist Skein is built from the tweakable NUMBER OF ROUNDS FOR DIFFERENT SIZES (REPRINTEDFROM [1]). Threefish . In order to have a better understanding of the computational efficiency of Skein (resource sharing, memory access scheme, scheduling, etc.), we design a low-area coprocessor # words # rounds for Threefish and describe how to implement Skein on our [bits] Nw Nr architecture. We harness the intrinsic parallelism of Threefish 256 4 72 to design a pipelined ALU and interleave several tasks in order 512 8 72 to achieve a tight scheduling. From our point of view, the main 1024 16 80 advantage of Skein over other SHA-3 finalists is that the same coprocessor allows one to encrypt or hash a message.

I.INTRODUCTION T = (t0, t1). K and T are extended with one parity word (Algorithm 1, lines 1 and 2). Each subkey is a combination In this article, we propose a novel hardware implementation of N words of the extended key, two words of the extended of the SHA-3 finalist Skein [1]. As emphasized by Kerckhof et w tweak, and a counter s (Algorithm 1, lines 5 to 9). Note that al., “fully unrolled and pipelined architectures may sometimes the extended key and the extended tweak are rotated by one hide a part of the algorithms’ complexity that is better revealed word position between two consecutive subkeys. in compact implementations” [2]. In order to have a deeper un- derstanding of the computational efficiency of Skein (resource Algorithm 1 . sharing, memory access scheme, scheduling, etc.), we decided Input: K = (k , k , . . . , k ) to design a low-area coprocessor on a Field-Programmable A block cipher key 0 1 Nw−1 ; Gate Array (FPGA). Furthermore, such an implementation is a tweak T = (t0, t1); the constant C240 = valuable for constrained environments, where some security 1BD11BDAA9FC1A22. Output: N /4 + 1 k k k protocols mainly rely on cryptographic hash functions (see r subkeys s,0, s,1,..., s,Nw−1, where for instance [3]). 0 ≤ s ≤ Nr/4. Nw−1 Skein is built from the tweakable Threefish block cipher, M 1. k ← C ⊕ k ; defined with a 256-, 512-, and 1024-bit block size. The main Nw 240 i i=0 contribution of this article is a lightweight implementation of 2. t2 ← t ⊕ t1; Threefish (Section II). Then, we describe how to implement 3. for s ← 0 to Nr/4 do Skein on our coprocessor (Section III). We have prototyped 4. for i ← 0 to Nw − 4 do our architecture on a Xilinx Virtex-6 device and discuss our 5. ks,i ← k(s+i) mod (Nw+1); results in Section IV. 6. end for

7. ks,Nw−3 ← k(s+Nw−3) mod (Nw+1)  ts mod 3; II.THE BLOCK CIPHER 8. ks,Nw−2 ← k(s+Nw−2) mod (Nw+1)  t(s+1) mod 3;

A. Algorithm Specification 9. ks,Nw−1 ← k(s+Nw−1) mod (Nw+1)  s; Threefish operates entirely on unsigned 64-bit integers and 10. end for 11. return k , k ,..., k , where 0 ≤ s ≤ N /4; involves only three operations: rotation of k bits to the left s,0 s,1 s,Nw−1 r (denoted by ≪ k), bitwise exclusive OR (denoted by ⊕), and 64 addition modulo 2 (denoted by ). Therefore, the plaintext A series of Nr rounds (Figure 1 and Algorithm 2, lines 4 P and the cipher key K are converted to Nw 64-bit words. to 19) and a final subkey addition (Algorithm 2, line 21) are Note that the number of words Nw and the number of rounds applied to produce the . The core of a round is Nr depend on the key size (Table I). the simple non-linear mixing function Mixd,j (Algorithm 2, The key schedule generates the subkeys from a block lines 13 and 14). It consists of an addition, a rotation by a cipher key K = (k0, k1, . . . , kNw−1) and a 128-bit tweak constant Rd mod 8,j (repeated every eight rounds and defined in [1, Table 4]), and a bitwise exclusive OR. A word permu- B. The UBI Chaining Mode tation π(i) (defined in [1, Table 3]) is then applied to obtain Let E(K,T,P ) be a tweakable function. The the output of the round (Algorithm 2, line 17). Furthermore, Unique Block Iteration (UBI) chaining mode allows one to a subkey is injected every four rounds (Algorithm 2, line 7). build a compression function out of E. Each block Mi of the message is processed with a unique tweak value Ti encoding v4,0 v4,1 v4,2 v4,3 how many bytes have been processed so far, a type field (see [1] for details), and two bits specifying whether it is the k k k k 1,0 1,1 1,2 1,3 first and/or last block. The UBI chaining mode is computed e4,0 e4,1 e4,2 e4,3 as: H0 ← G,

Hi+1 ← Mi ⊕ E(Hi,Ti,Mi), 0 1 , , 4 4 R4,0 R4,1 ≪≪ where G is a starting value of Nw words. Mix Mix C. Hardware Implementation This short description of Threefish gives us the first hints on f f f f 4,0 4,1 4,2 4,3 designing a dedicated coprocessor (Figure 2). Our architecture consists of a register file implemented by means of dual- ported memory, an ALU, and a control unit. The register file

Permute is organized into 64-bit words, and stores a plaintext block, v5,0 v5,1 v5,2 v5,3 an internal state (ed,i, where 0 ≤ i ≤ Nw − 1), an extended Figure 1. One of the 72 rounds of Threefish-256. block cipher key, an extended tweak, the constant C240, and all possible values of s involved in the key schedule. Thanks to this approach, the word permutation π(i) and the word rotation of the key schedule are conveniently implemented Algorithm 2 Encryption with the Threefish block cipher. by addressing the register file accordingly. Since the round constants repeat every eight rounds (Algorithm 2, line 14), Input: A plaintext block P = (p , p , . . . , p ); N /4 + 1 0 1 Nw−1 r we decided to unroll eight iterations of the main loop of subkeys k , k ,..., k , where 0 ≤ s ≤ N /4; s,0 s,1 s,Nw−1 r Threefish (Algorithm 2, lines 4 to 19). The rotation constants 4N rotation constants R , where 0 ≤ i ≤ 7 and 0 ≤ w i,j R are included in the microcode executed by the control j ≤ N /2. d,i w unit. Note that our register file is designed for Threefish-1024 Output: A ciphertext block C = (c , c , . . . , c ). 0 1 Nw−1 (i.e. N = 16 and N = 80). It is therefore straightforward 1. for i ← 0 to N − 1 do w r w to implement the two other variants of the algorithm on our 2. v ← p ; 0,i i architecture. 3. end for 4. for d ← 0 to Nr − 1 do Register file (dual-ported memory block) 5. for i ← 0 to N − 1 do w WE 0

6. if d mod 4 = 0 then A Data p0,..., p15

7. ed,i ← vd,i kd/4,i; (Key injection) Port 16  Address ei,0,..., ei,15 8. else Port A Port B 32 9. ed,i ← vd,i; (Rename) k0,..., k15 Arithmetic and 48 k16 Logic Unit 10. end if 64 WE t0, t1, t2, C240 (Figure 3b) 11. end for 96

Address B 0, 1,..., 14, 15 12. for j ← 0 to Nw/2 − 1 do 16 17 18 19 20 112 Port , , , , Data 13. fd,2j ← ed,2j  ed,2j+1; (Mixd,j) 14. fd,2j+1 ← fd,2j ⊕ (ed,2j+1 ≪ Rd mod 8,j); 15. end for ctrl10:0 16. for i ← 0 to N − 1 do w StartControl Idle unit 17. vd+1,i ← fd,π(i); (Permute) Nw 18. end for 19. end for Figure 2. Architecture of the Threefish coprocessor. 20. for i ← 0 to Nw − 1 do The next step consists in defining the architecture of the 21. ci ← vNr ,i  kNr /4,i; (Key injection) 22. end for ALU and the instruction set of our coprocessor. In the fol-

23. return C = (c0, c1, . . . , cNw−1); lowing, Ri denotes a 64-bit register. Figure 3a illustrates our scheduling of the two mixing functions Mix4,0 and Mix4,1 of the fifth round of Threefish-256: otA Port efr h olwn prto Fgr 3b): a (Figure of operation rotation following any out the ALU carry perform our to summary, able In busy. be continuously must pipeline the keep to Mix schedule We irre rvddb iix.Ltu en two define on a us relying Let description Xilinx). by VHDL approach provided low-level libraries general a more require a not here does propose that (we the gates and XOR adder of the between array resources hardware Beuchat share to by strategy proposed one SHA- the to similar Ctrl where ti elkonthat well-known is It b prns epciey[] hs qain()cnb rewritten be can follows: (1) as Equation modulo Thus, subtraction [5]. the respectively and operands, AND, bitwise the OR, ovsol v aibe,adi datgosyimplemented advantageously is by and variables, five only select volves to a and on operation arithmetic (2) the Equation of implementation Virtex- the describes 3c Figure ( = and • • • • R R R R R R 64 3 6 2 1 5 4 a optto fMix of Computation (a) eeeueteisrcinR instruction the execute We signal). ( modified not intermediate R and Then, in cycles stored clock are of results three computation requires the operation start we time, operand The h ntuto R instruction The R a 3 e b 3 LUT 6 4 ∨ , 1 nls LK,adw a aeavnaeo their of advantage take can we and BLAKE, finalist 64 1 uhthat: such eie ic hr sasnl oto inlt choose to signal control single a is there Since device. n R and R bit b bits e e e ) 3 4 ≪ 4 4 10 , ( 6 e 1 , , 1 0

,b a, R 1 ← 4 , 2 3 0 e ( eoe oto i.OrAUi therefore is ALU Our bit. control a denotes e e e 4 ≪ 6 4 4 4 , a ( 1 , , , = ) ← rmtvsadddctdcrylogic. carry dedicated and primitives 1 3 0 slae nrgse R register in loaded is 9 a contain 4 ∧ e , 4 e ∨ f e e , 4 ≪ 1 e 0 4 ≪ ( , 4 4 4 , 1 b n Mix and 3 , , , R i.e. ( 4 3 2 0 25

1 3 sso as soon as ) b , R R ← 1 where , ) ( ( 3 e R 3 1 f e e R R 4 ≪ 1  a 4 4 4 , slae nrgse R register in loaded is 3 , , , R  R 2 3 1 ← ⊕  1 4 3 3 1 R ,

1 ←  f 2 (( 1 , , (Threefish- e R 4 f 4 ≪ R R R R 3 , R 4 4 3 , a ⊕

, R utb otoldb nenable an by controlled be must b 0 2 33 6 2 6) 2) 3 R , R 3 ← 6 ∧ ∨ and ⊕ ( = R f

e 1 4 b , 5

, R  otherwise, Ctrl when 4 3 256 R 3 ) R n R and , ∧ otherwise. Ctrl when , ← 2 0 6 ⊕ ) 3

a R e and , 3 a enra,admanage and read, been has ⊕ 4 losu ocompute to us allows ← Ctrl ∨ R , 1

a 6 iue3 rtmtcadlgcui o hefihencryption. Threefish for unit logic and Arithmetic 3. Figure 2 b R ≪ and h otn fR of content the ;

) From port A From port B 10 6

1 .  e 10 )  4 10 64 eoetebitwise the denote R b tal. et  ctrl ctrl , ( 1 qain()in- (2) Equation , 0 = 1 0 0 1 R 4 ctrl a btwr n to and word -bit 0 = 6 7 , Ctrl 64 2 1 0 ≪ ∧ 1:0 . respectively. , ttesame the at ; , btoperands -bit , b 10 4 o the for [4] ) 2 b rhtcueo h ALU the of Architecture (b) R

(2) . 0, 1, 2 or 3 64 ≪ and 4 ctrl ctrl , R 0 ftwo of 1 0 4 this ; 3:2 8 f a 1 4 ctrl (1) R R , ⊕ 1 is 2 1 0, 4, 8 or 12 9 . ≪ ctrl ctrl R 1 0 0 1 5 yEuto 1.Since we (1). However, Equation frequency. by compute clock to the able decrease are would the LUTs in of located is coprocessor our iue4dsrbshww ceueteisrcin of instructions the schedule we how describes Threefish- 4 Figure con- compute multiplexer R the register to into Ctrl Thanks two by file. simultaneously register trolled read the can from we values that assumes approach This instructions: following the execute to the on adder not the is and it LUT and multiplexers operator, same the arithmetic implement the to of inputs possible the select to bit to necessary however is It on-the-fly. compute subkeys us allows the approach generate and This to 1) 2). (Algorithm (Algorithm process schedule encryption key the the interleave we file, register lo st yastergse l n ouetecontent Threefish- the of use injection to p key and first file register the the R bypass of to Ctrl by us controlled multiplexers allow the that Note ward. R R compute to way 10 5:4 2 1 1 a b nodrt euetenme foead trdi the in stored operands of number the reduce to order In h mlmnaino h e neto smr straightfor- more is injection key the of implementation The  ⊕ R R R R n R and

3 or 0 16 32 48

R , , or 3 3 1 4 ≪ k 2 0 sa nu oteAU e scnie o instance for consider us Let ALU. the to input an as ← ← ← ← , R R notntl,ti ouinrqie n oecontrol more one requires solution this Unfortunately, . 2 3 6 t k 2 = 2 256 R R t t N 6 epciey n oeeueteisrcinR instruction the execute to and respectively, , 0 1 3 1 and R R R R R , w p ≪ 2 ⊕  . 3 1 3 1 . 2

2 To port B 7  ← ← ← ← rmtv nmr.Sneteciia ahof path critical the Since anymore. primitive R R k ecnla aafo otAo otB port or A port from data load can we , 0 Fgr b.Asmlrsrtg losu to us allows strategy similar A 3b). (Figure N 6 2 t k , 2 R R R k . R , w 2 c optto fR of Computation (c) Ctrl10 From R3 From R1 From R2 From R6 1 2 1 3 1 ol et load to be would  R , t eoetefis e neto.Teeasiest The injection. key first the before R , 2   LUT6 t t sn nytefntoaiisdefined functionalities the only using 1 6 2 R R 2 n scmue sfollows: as computed is and 2 2 2 ← ← ( = . 1 0 0 1 1 0 0 1 R 0 t R , 0 a b a b 5 3 j j j j ←  ≪ ( a j R 64 ∧ 1 0) b  j a a ) btadr netalevel extra an adder, -bit 0 j j ⊕ R 256 ∨ ∨ ⊕ , 2 Ctrl b b rR or t j j 10 0 ( : t 2 2 3 1 and ← e ← ← 5 0 ≪ R 0 1 , 3 2 ← ⊕ p t t 1 R 2 ( sdfie as defined is 1 0) carry 6 , R , carry naVirtex- a on 0 nregisters in tsuffices it , 4 6 carry 0 j 1 = ≪ n Ctrl and +1 Ctrl j 10 6 0 ) 3 device , ← 8 To R3 Computation of t2 and k4, and first key injection

Address A @t1 @t0 @k0 @0 @k1 @k2 @k3 @k0 @k3 @k2 @p2 @k1 @p1 @e0,0 @e0,3 @e0,2

Address B @0 @C240 @t2 @p0 @p3 @t1 @k4 @e0,0 @e0,3 @t0 @e0,2 @e1,0 File

Input B t2 k4 e0,0 e0,3 e0,2 e1,0 gister Output A t1 t0 k0 0 k1 k2 k3 k0 k3 k2 p2 k1 p1 e0,0 e0,3 e0,2 Re Output B 0 C240 p0 p3 t1 t0

Port A t1 t0 k0 0 k1 k2 k3 k0 k3 k2 p2 k1 p1 e0,0 e0,3 e0,2

Port B 0 C240 p0 p3 t1 t0

R1 t0 k0 0 k1 k2 k3 k0 k3 k2 k0,2 k1 k0,1 e0,1 e0,1 e0,3 e0,3

R2 0 C240 p0 p3 t1 p2 t0 p1 e0,0 e0,2 ≪14 R6 t1 k0 0 k1 k2 k3 e0,1 ALU

R3 t0 t2 C240 k4 e0,0 e0,3 k0,2 e0,2 k0,1 e0,1 f0,0 f0,1

Permute v1,0 v1,3

Rename e0,1 e1,0 e1,3

R3 ← R1  R2 R3 ← R3 ⊕ R6 First round Second key injection

Address A @e1,0 @,2 @e3,0 @k1 @k4 @k3 @v4,2 @k2 @v4,1 @e4,0 @e4,3 @e4,2

Address B @e2,2 @e3,0 @v4,2 @v4,1 @t2 @1 @e4,0 @t1 @e4,3 @e4,2 @e5,0 File

Input B e2,2 e3,0 v4,2 v4,1 e4,0 e4,3 e4,2 e5,0 gister Output A e1,0 e2,2 e3,0 k1 k4 k3 v4,2 k2 v4,1 e4,0 e4,3 e4,2 Re Output B t2 1 t1

Port A e1,0 e2,2 e3,0 k1 k4 k3 v4,2 k2 v4,1 e4,0 e4,3 e4,2

Port B t2 1 t1

R1 e1,3 e1,3 e1,1 e1,1 e2,1 e2,1 e2,3 e2,3 e3,3 e3,3 e3,1 e3,1 k1 k4 k3 e4,3 − 1 v4,2 k2 v4,1 e4,1 e4,1 e4,3 e4,3

R2 e1,2 e1,0 e2,0 e2,2 e3,2 e3,0 v4,0 v4,3 t2 1 k1,2 t1 k1,1 e4,0 e4,2 ≪16 ≪57 ≪52 ≪23 ≪40 ≪37 ≪5 ≪25 R6 e0,3 e1,3 e1,1 e2,1 e2,3 e3,3 e3,1 e4,1 ALU

R3 f0,2 f0,3 f1,2 f1,3 f1,0 f1,1 f2,0 f2,1 f2,2 f2,3 f3,2 f3,3 f3,0 f3,1 e4,0 e4,3 − 1 k1,2 e4,3 e4,2 k1,1 e4,1 f4,0 f4,1

Permute v1,2 v1,1 v2,2 v2,1 v2,0 v2,3 v3,0 v3,3 v3,2 v3,1 v4,2 v4,1 v4,0 v4,3 v5,0 v5,3

Rename e1,2 e1,1 e2,2 e2,1 e2,0 e2,3 e3,0 e3,3 e3,2 e3,1 e5,0 e5,3

Second round Third round Fourth round Fifth round

Figure 4. Scheduling of Threefish-256.@d denotes the address of the 64-bit word d in the register file. “Rename” and “Permute” refer to lines 9 and 17 of Algorithm 2, respectively.

The UBI chaining mode can be combined with the final key this process with an example: how to hash a message of 170 injection of Threefish encryption. It suffices to modify line 21 bytes M = (M0,M1,M2) with Skein-512-512 (Figure 5): e ← v k c ← of Algorithm 2 as follows: Nr ,i Nr ,i  Nr /4,i and i 1) Configuration block. The 32-byte string C encodes e ⊕ p Nr ,i i. The only difference between this operation and the the desired output length and several parameters de- mixing function MIX is that no permutation is applied to d,j fined in [1]. In the simple hashing mode, G0 = the second operand of the bitwise exclusive OR. 120 UBI(0,C,Tcfg2 ) depends only on the output size, and The control unit consists of an instruction memory, a small can be precomputed (see [1, Appendix B]). table to generate the address of k during the (s+i) mod (Nw+1) 2) Message. Note that M0 and M1 contain 64 bytes of data key schedule (Algorithm 1), and a simple finite-state machine. each, and M2 is the padded final block with 42 bytes of Since we unroll only eight rounds, we manage to keep the data. The tweak encodes whether Mi is the first or last instruction memory compact. In the case of Threefish-512, block of M, and the number of bytes processed so far. In we need for instance 145 instructions and 881 clock cycles to 120 this example, UBI(G0,M,Tmsg2 ) requires three calls encrypt a plaintext block in UBI mode. to Threefish-512. 120 3) Output transform. The call to UBI(G1, 0,Tout2 ) III.SKEIN HASHING achieves hashing-appropriate randomness. If a single In this work, we focus on the simple Skein hash computation output block is not sufficient, several output transforms and refer the reader to [1] for a description of the other modes can be run in parallel [1]. of operation. Three UBI invocations allow one to compute the In order to process a b-block message M, we load the 99 digest of a message M = (M0,M1,...,Mb−1) of up to 2 −8 precomputed value of G0 in the register file of our coprocessor. bits, where each Mi is a block of Nw 64-bit word. We explain Then, b + 1 calls to Threefish allow us to process the message and to perform the output transform. In the case of Skein-512- [4] J.-L. Beuchat, E. Okamoto, and T. Yamazaki, “Compact implementations 512, the throughput is given by T = 512·b·f bits/s, where f of BLAKE-32 and BLAKE-64 on FPGA,” in Proceedings of the (b+1)·881 2010 International Conference on Field-Programmable Technology– denotes the clock frequency of our architecture. FPT 2010, J. Bian, . Zhou, and K. Zhao, Eds. IEEE Press, 2010, pp. 170–177. Configuration block Output transformMessage [5] H. Warren, Hacker’s Delight. Addison-Wesley, 2002. 120 120 120 G0 ← UBI(0,C,Tcfg2 ) G1 ← UBI(G0,M,Tmsg2 ) H ← UBI(G1, 0,Tout2 ) [6] J.-P. Aumasson, L. Henzen, W. Meier, and R. Phan, “SHA-3 proposal BLAKE (version 1.4),” Jan. 2011. CMM M 0 0 1 2 [7] T. Yamazaki, J.-L. Beuchat, and E. Okamoto, “BLAKE-256, BLAKE- 512のコンパクトな統合実装,” IEICE暗号と情報セキュリティ実装

G0 G1 技術小特集号, vol. J-95A, no. 5, 2012. 0 H [8] B. Jungk, “Compact implementations of Grøstl, JH and Skein for FPGAs,” in Proceedings of the ECRYPT II Hash Workshop, 2011. [9] G. Bertoni, J. Daemen, M. Peeters, G. Van Assche, and R. Van Keer, Tweak: Tweak: Tweak: Tweak: Tweak: “Keccak implementation overview (version 3.1),” Sep. 2011. Type: Cfg Type: Msg Type: Msg Type: Msg Type: Out [10] I.˙ San and N. At, “Compact Keccak hardware architecture for data Len: 64 Len: 128 Len: 170 integrity and authentication on FPGAs,” Information Security Journal: First: 1 First: 0 First: 0 Last: 0 Last: 0 Last: 1 A Global Perspective, 2012. [11] K. Latif, M. Tariq, A. Aziz, and A. Mahboob, “Efficient hardware implementation of secure hash algorithm (SHA-3) finalist - Skein,” in Figure 5. Processing a 3-block message using Skein-512-512 in the simple Proceedings of the International Conference on Computer, Communica- hashing mode. tion, Control and Automation–3CA2011, 2011. [12] “The SHA-3 zoo,” http://ehash.iaik.tugraz.at/wiki/The SHA-3 Zoo. IV. RESULTS AND PERSPECTIVES We captured our architecture in the VHDL language and prototyped a fully autonomous implementation of Skein-512- 512 on a Xilinx Virtex-6 FPGA. Table II summarizes our results and the figures published by other researchers focusing on compact coprocessors (we refer the reader to the SHA-3 Zoo [12] for an overview of high-speed designs). Note that we considered the least favorable case, where the message consists of a single block, to compute the throughput. If we increase the size of the message, the throughput of our coprocessor converges asymptotically to 160 Mbits/s. The other hardware architectures of Skein reported in Table II make a single call to Threefish-512 and do not perform the output transform. Let us assume that all SHA-3 finalists provide the levels of security expected by the NIST. Then, according to Table II, BLAKE, Keccak, and Skein seem to be the best candidates for compact implementations on FPGA. Beuchat et al. [4] designed a low-area ALU for BLAKE on Xilinx devices. However, the datapath depends on the level of security one wishes to achieve. In order to overcome this drawback, Yamazaki et al. [7] proposed a unified coprocessor for the BLAKE family. Their ALU is built around a 64-bit datapath, and can process a 512-bit block (BLAKE-512) or two 256-bit blocks in parallel (BLAKE-256). From our point of view, the main advantage of Skein over other SHA-3 finalists is that the same coprocessor allows one to encrypt or hash a message. We plan to improve our architecture in order to support Threefish decryption, Skein-MAC, and tree hashing with Skein. REFERENCES [1] N. Ferguson, S. Lucks, B. Schneier, D. Whiting, M. Bellare, T. Kohno, J. Callas, and J. Walker, “The skein hash function family (version 1.3),” Oct. 2010. [2] S. Kerckhof, F. Durvaux, N. Veyrat-Charvillon, F. Regazzoni, G. Meurice de Dormale, and F.-X. Standaert, “Compact FPGA imple- mentations of the five SHA-3 finalists,” in Proceedings of the ECRYPT II Hash Workshop, 2011. [3] J. Zhai, C. Park, and G.-N. Wang, “Hash-based RFID security protocol using randomly key-changed identification procedure,” in Computational Science and Its Applications–ICCSA 2006, ser. Lecture Notes in Com- puter Science, no. 3983. Springer, 2006, pp. 296–305. Table II COMPACT IMPLEMENTATIONS OF THE FIVE SHA-3 FINALISTS ON VIRTEX-5 AND VIRTEX-6 FPGAS.

Area 36k memory Frequency Throughput Algorithm FPGA [slices] blocks [MHz] [Mbits/s] Aumasson et al. [6] BLAKE-256 xc5vlx110 390 – 91 412 Beuchat et al. [4]† BLAKE-256 xc6vlx75t-2 52 2 456 194 Aumasson et al. [6] BLAKE-512 xc5vlx110 939 – 59 468 Beuchat et al. [4]† BLAKE-512 xc6vlx75t-2 81 3 374 280 Kerckhof et al. [2] BLAKE-512 xc6vlx75t-1 192 – 240 183 2 × 150 (BLAKE-256) Yamazaki et al. [7] BLAKE (unified coprocessor) xc5vlx50-2 138 3 342 264 (BLAKE-512) Jungk [8] Grøstl-256 xcv5 470 – 354 1132 Kerckhof et al. [2] Grøstl-512 xc6vlx75t-1 260 – 280 640 Jungk [8] JH-256 xcv5 205 – 341 27 Kerckhof et al. [2] JH-512 xc6vlx75t-1 240 – 288 214 Bertoni et al. [9] Keccak[r = 1024, c = 576] xc5vlx50-3 448 – 265 52 Kerckhof et al. [2] Keccak[r = 1024, c = 576] xc6vlx75t-1 144 – 250 68 San & At [10] Keccak[r = 1024, c = 576] xc5vlx50-2 151 3 520 501 This Work Skein-512-512 xc6vlx75t-1 132 2 276 80 Jungk [8]‡ Skein-512-256 xcv5 555 – 271 237 Kerckhof et al. [2]‡ Skein-512-512 xc6vlx75t-1 240 – 160 179 Latif et al. [11]‡ Skein-256-256 xc5vlx110-3 821 Not specified 119 1610 †Modified to implement the tweaked version submitted for the final round of the SHA-3 competition. ‡Single call to Threefish-512.