<<

Compact Hardware Implementations of ChaCha, BLAKE, Threefish, and on FPGA

Nuray At∗, Jean-Luc Beuchat†, Eiji Okamoto‡, Ismail˙ San∗, and Teppei Yamazaki‡ ∗Department of Electrical and Electronics Engineering, Anadolu University, Eskis¸ehir, Turkey Email: {nat, isan}@anadolu.edu.tr †ELCA Informatique SA, Av. de la Harpe 22–24, Case postale 519, 1001 Lausanne, Switzerland Email: [email protected] ‡Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573, Japan Email: [email protected], [email protected]

Abstract—The cryptographic hash functions BLAKE and • ≪ k: rotation by k bits to the left. Skein are built from the ChaCha and the tweakable The rest of the article is organized as follows: after a brief Threefish , respectively. Interestingly enough, they are based on the same arithmetic operations, and the same overview of Threefish (Section II), Skein (Section III), ChaCha design philosophy allows one to design lightweight coprocessors (Section IV), and BLAKE (Section V), we describe our for hashing and . The element of our approach design philosophy and compact hardware implementations is to take advantage of the parallelism of the algorithms to (Section VI). We discuss our implementation results on several deeply pipeline our Arithmetic an Logic Units, and to avoid data Xilinx Field-Programmable Gate Arrays (FPGAs) in Sec- dependencies by interleaving independent tasks. We show for instance that a fully autonomous implementation of BLAKE and tion VII and conclude in Section VIII. ChaCha on a Xilinx Virtex-6 device occupies 144 slices and three memory blocks, and achieves competitive throughputs. In order II.THE BLOCK CIPHER to offer the same features, a coprocessor implementing Skein and Threefish requires a substantial higher slice count. The design philosophy of Threefish is that “a larger num- ber of simple rounds is more secure than fewer complex I.INTRODUCTION rounds” [2]. The can be computed in a few clock The cryptographic hash functions BLAKE [1] and Skein [2] cycles, which is an important consideration in order to build are built from the ChaCha stream cipher [3] and the tweakable a compression function from a block cipher. Threefish block cipher [2], respectively. It is therefore tempting Threefish operates entirely on unsigned 64-bit integers and to design compact unified hardware architectures able to hash involves only three operations: rotation of k bits to the left, 64 and encrypt a message. Such processors are for instance bitwise exclusive OR, and addition modulo 2 . Therefore, the valuable for constrained environments, where some security plaintext P and the cipher key K are converted to Nw 64-bit protocols mainly rely on cryptographic hash functions [4]. words. Note that the number of words Nw and the number Furthermore, as emphasized by Kerckhof et al., “fully unrolled of rounds Nr depend on the key size (Table I). The size of a and pipelined architectures may sometimes hide a part of the plaintext block is given by Nb = 8 · Nw bytes. algorithms’ complexity that is better revealed in compact im- Table I plementations” [5]. In order to have a deeper understanding of NUMBEROFROUNDSOF THREEFISH FOR DIFFERENT KEY SIZES. the computational efficiency of ChaCha, BLAKE, Threefish, and Skein, we extend here our work presented in [6], [7] Key size # 64-bit words # rounds Block size and propose novel lightweight coprocessors. The key element [bits] Nw Nr Nb [bytes] of our approach is to take advantage of the parallelism of 256 4 72 32 the algorithms to deeply pipeline our Arithmetic an Logic 512 8 72 64 1024 16 80 128 Units (ALUs), and to avoid data dependencies by interleaving independent tasks. Throughout this article, all operands are w-bit unsigned The key schedule generates the subkeys from a block integers and the following notation is adopted: cipher key K = (k0, k1, . . . , kNw−1) and a 128-bit tweak w • : addition modulo 2 ; T = (t0, t1). K and T are extended with one parity word w • : subtraction modulo 2 ; (Algorithm 1, lines 1 and 2). Each subkey is a combination • ∧: bitwise AND; of Nw words of the extended key, two words of the extended • ∨: bitwise OR; tweak, and a counter s (Algorithm 1, lines 5 to 9). Note that • ⊕: bitwise exclusive OR; the extended key and the extended tweak are rotated by one • ≫ k: rotation by k bits to the right; word position between two consecutive subkeys. Table II PERMUTATIONS USED BY THE SKEINFUNCTIONS (REPRINTEDFROM [2]).

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Nw = 4 0 3 2 1 π(i) Nw = 8 2 1 4 7 6 5 0 3 Nw = 16 0 9 2 13 6 11 4 15 10 7 12 3 14 5 8 1

v v v v Algorithm 1 Key schedule of Threefish. 4,0 4,1 4,2 4,3

Input: A block cipher key K = (k0, k1, . . . , kNw−1); k1,0 k1,1 k1,2 k1,3 a tweak T = (t0, t1); the constant C240 = e e e e 1BD11BDAA9FC1A22. 4,0 4,1 4,2 4,3

Output: Nr/4 + 1 subkeys ks,0, ks,1,..., ks,Nw−1, where 0 1

0 ≤ s ≤ Nr/4. , , 4 4 Nw−1 M R4,0 R4,1 ≪≪

1. kNw ← C240 ⊕ ki; Mix Mix i=0 2. t2 ← t ⊕ t1; 3. for s ← 0 to Nr/4 do f4,0 f4,1 f4,2 f4,3 4. for i ← 0 to Nw − 4 do

5. ks,i ← k(s+i) mod (Nw+1); 6. end for Permute

7. ks,Nw−3 ← k(s+Nw−3) mod (Nw+1)  ts mod 3; v5,0 v5,1 v5,2 v5,3

8. ks,Nw−2 ← k(s+Nw−2) mod (Nw+1)  t(s+1) mod 3; Figure 1. One of the 72 encryption rounds of Threefish-256. 9. ks,Nw−1 ← k(s+Nw−1) mod (Nw+1)  s; 10. end for v5,0 v5,1 v5,2 v5,3 11. return ks,0, ks,1,..., ks,Nw−1, where 0 ≤ s ≤ Nr/4; Permute A series of Nr rounds (Figure 1 and Algorithm 2, lines 4 f4,0 f4,1 f4,2 f4,3 to 19) and a final subkey addition (Algorithm 2, line 21) are

applied to produce the . The core of a round is 0 1 , , 4 4 the simple non-linear mixing function Mixd,j (Algorithm 2, lines 13 and 14). It consists of an addition, a rotation by a R4,0 ≫ R4,1 ≫ vMix vMix constant Rd mod 8,j (repeated every eight rounds and defined In In in [2, Table 4]), and a bitwise exclusive OR. A word permu- tation π(i) (defined in Table II) is then applied to obtain the e4,0 e4,1 e4,2 e4,3 output of the round (Algorithm 2, line 17). Furthermore, a subkey is injected every four rounds (Algorithm 2, line 7). k1,0 k1,1 k1,2 k1,3

Figure 2 describes a decryption round of Threefish-256. v4,0 v4,1 v4,2 v4,3 It consists of the inverse word permutation followed by the inverse MIX functions. Note that subkeys are injected in Figure 2. One of the 72 decryption rounds of Threefish-256. reverse order.

III.THE SKEIN FAMILY OF HASH FUNCTIONS two bits specifying whether it is the first and/or last block. The Unique Block Iteration (UBI) chaining mode allows one The UBI chaining mode is computed as: to build a compression function out of a tweakable encryption H0 ← G, M 299 −8 function. Let be a message of arbitrary length up to H ← M ⊕ E(H ,T ,M ), bits. If the number of bits in M is not a multiple of 8, we i+1 i i i i append a bit 1 followed by a (possibly empty) string of 0’s. where G is a starting value of Nb bytes. This step guarantees that M contains NM bytes. Then, we pad In this work, we consider the normal hashing mode and M with p zero bytes so that NM +p is a multiple of the block refer the reader to [2] for a description of Skein-MAC and size Nb. We can now split M into Nb-byte blocks M0,..., tree hashing with Skein. Skein is built on three invocations of Mk−1, where k = (NM +p)/Nb. Each block Mi is processed UBI (Figure 3): with a unique tweak value Ti encoding how many bytes have • Define a 32-byte configuration string C that contains the been processed so far, a type field (see [2] for details), and length of the digest size (in bits), a schema identifier, Algorithm 2 Encryption with the Threefish block cipher. Configuration block Output transformMessage 120 120 120 G0 ← UBI(0,C,Tcfg2 ) G1 ← UBI(G0,M,Tmsg2 ) H ← UBI(G1, 0,Tout2 ) Input: A plaintext block P = (p0, p1, . . . , pNw−1); Nr/4 + 1 CMM0 M1 2 0 subkeys ks,0, ks,1,..., ks,Nw−1, where 0 ≤ s ≤ Nr/4; 4Nw rotation constants Ri,j, where 0 ≤ i ≤ 7 and 0 ≤ G0 G1 j ≤ Nw/2. 0 H

Output: A ciphertext block C = (c0, c1, . . . , cNw−1). 1. for i ← 0 to Nw − 1 do Tweak: Tweak: Tweak: Tweak: Tweak: 2. v0,i ← pi; Type: Cfg Type: Msg Type: Msg Type: Msg Type: Out 3. end for Len: 64 Len: 128 Len: 170 First: 1 First: 0 First: 0 4. for d ← 0 to Nr − 1 do Last: 0 Last: 0 Last: 1 5. for i ← 0 to Nw − 1 do 6. if d mod 4 = 0 then Figure 3. Skein in normal hashing mode. 7. ed,i ← vd,i  kd/4,i; (Key injection) 8. else then encrypted (or decrypted) by XORing it with the first b 9. e ← v ; (Rename) d,i d,i bytes of the stream. 10. end if ChaCha generates the stream by blocks of 64 bytes. In order 11. end for to process the ith block, ChaCha acts on a 4 × 4 matrix M of 12. for j ← 0 to Nw/2 − 1 do 32-bit integers defined as follows: 13. fd,2j ← ed,2j  ed,2j+1; (Mixd,j)     14. fd,2j+1 ← fd,2j ⊕ (ed,2j+1 ≪ Rd mod 8,j); m0 m1 m2 m3 c0 c1 c2 c3 15. end for  m4 m5 m6 m7  k0 k1 k2 k3    =   , 16. for i ← 0 to Nw − 1 do  m8 m9 m10 m11 k4 k5 k6 k7  17. vd+1,i ← fd,π(i); (Permute) m12 m13 m14 m15 t0 t1 IV0 IV1 18. end for where 19. end for • 61707865 3320646E 79622D32 20. for i ← 0 to Nw − 1 do c0 = , c1 = , c2 = , and c3 = 6B206574 are predefined constants; 21. ci ← vNr ,i  kNr /4,i; (Key injection) 22. end for • t = (t0,t1) is a 64-bit counter encoding the index i (i.e. 32 i = 2 t1 + t0). 23. return C = (c0, c1, . . . , cNw−1); ChaCha transforms the matrix M through a series of Nr rounds (Algorithm 3). The algorithm is based on a nonlinear operation called quarter-round function and described by Al- and a version number [2, Table 7]. Compute the Nb-byte gorithm 4. Matrix M is copied into matrix V . Then, the even- block G0: and odd-numbered rounds of ChaCha apply the quarter-round G ← UBI(0,C,T 2120). 0 cfg function to each row and northwest-to-southeast diagonal of V , Note that G0 only depends on the digest size and can respectively. Eventually, a new block of the stream is generated easily be precomputed. by adding V to the original matrix M (Algorithm 3, line 15), • The message is then processed as follows: and the block counter is incremented (Algorithm 3, lines 17 to 20). 120 G1 ← UBI(G0,M,Tmsg2 ). Bernstein proposed 8-, 12-, and 20-round variants of ChaCha. Aumasson et al. introduced a novel method for • A third call to UBI is required to achieve hashing- appropriate randomness: differential of ChaCha and broke the 7-round variant [9]. Ishiguro et al. [10], [11] improved the attack and 120 H ← UBI(G1, 0,Tout2 ). concluded that and Chacha “are not presently under threat”. This transform allows one to produce arbitrary digest sizes (up to 264 bits). If a single output block H is V. THE BLAKEFAMILY OF HASH FUNCTIONS not enough, one can use Threefish in counter mode to The BLAKE family combines three previously studied produce the digest. components, chosen by Aumasson et al. for their complemen- tarity [1]: the iteration mode HAIFA, the internal structure of IV. THE CHACHA STREAM CIPHER the LAKE, and a modified version of Bernstein’s The ChaCha family of stream ciphers was designed by stream cipher ChaCha as compression function. BLAKE is a Bernstein [3] to improve the diffusion per round of Salsa20 [8], family of four hash functions, namely BLAKE-224, BLAKE- while preserving the encryption rate. ChaCha operates on 32- 256, BLAKE-384, and BLAKE-512 (Table III). The main bit words, and expands a 256-bit key (k0, . . . , k7) and a 64-bit differences lie in the length of words w, the number of rounds 70 nonce (IV0, IV1) into a 2 -byte stream. A b-byte message is Nr, and in some constants involved in the algorithm. In the Table III PROPERTIES OF THE BLAKE HASHFUNCTIONS.

Word size w Message Block Digest # rounds Rotation distances Algorithm [bits] [bits] [bits] [bits] [bits] Nr δ0 δ1 δ2 δ3 BLAKE-224 32 < 264 512 224 128 14 16 12 8 7 BLAKE-256 32 < 264 512 256 128 14 16 12 8 7 BLAKE-384 64 < 2128 1024 384 256 16 32 25 16 11 BLAKE-512 64 < 2128 1024 512 256 16 32 25 16 11

Algorithm 3 Computation of a 64-byte block of the stream the bitwise exclusive OR of two w-bit words. The latter is of ChaCha. sometimes followed by a rotation of δj bits to the right. The Input: A key, a nonce, and a block counter stored in a matrix four possible rotation distances depend on the digest size and M. are defined in Table III. The compression function of BLAKE- Output: A 64-byte block of the stream. 0 0 0 n produces a new chain value h = h0, . . . , h7 from a message 1. for i ← 0 to 15 do block m = m0, . . . , m15, a chain value h = h0, . . . , h7, a 2. v[i] ← m[i]; salt s = s0, . . . , s3, a counter t = t0, t1, and 16 constants ci 3. end for defined in [1, p. 8]. This process consists of three steps. First, a 4. for i ← 0 to N /2 − 1 do r 16-word internal state v = v0,..., v15 is initialized as follows: 5. QUARTERROUND(v0, v4, v8, v12);   v0 v1 v2 v3 6. QUARTERROUND(v1, v5, v9, v13);  v4 v5 v6 v7  7. QUARTERROUND(v2, v6, v10, v14);    v8 v9 v10 v11 8. QUARTERROUND(v3, v7, v11, v15); v12 v13 v14 v15 9. QUARTERROUND(v0, v5, v10, v15);  h h h h  10. QUARTERROUND(v1, v6, v11, v12); 0 1 2 3 h h h h 11. QUARTERROUND(v2, v7, v8, v13); ←  4 5 6 7  . s ⊕ c s ⊕ c s ⊕ c s ⊕ c  12. QUARTERROUND(v3, v4, v9, v14);  0 0 1 1 2 2 3 3 13. end for t0 ⊕ c4 t0 ⊕ c5 t1 ⊕ c6 t1 ⊕ c7 14. for i ← 0 to 15 do Then, a series of Nr rounds is performed. Each of them 15. v[i] ← v[i]  m[i]; consists of a transformation of the internal state v based on 16. end for the Gi function described by Algorithm 5, where σr denotes a 17. m12 ← m12  1; permutation of {0,..., 15} parametrized by the round index r 18. if m12 = 0 then (see Table IV). A column step updates the four columns of 19. m13 ← m13  1; matrix v as follows: G0(v0, v4, v8, v12), G1(v1, v5, v9, v13), 20. end if G2(v2, v6, v10, v14), and G3(v3, v7, v11, v15). Note that each 21. Return M and V ; call to Gi updates a distinct column of matrix v. Since we focus on compact implementations of BLAKE in this work, we interleave the computation of G , G , G , and Algorithm 4 The ChaCha quarter-round function. 0 1 2 G3. This approach allows us to design an ALU with Input: Four 32-bit integers a, b, c, and d. four pipeline stages and to achieve high clock frequencies. Output: QUARTERROUND(a, b, c, d). Then, a diagonal step updates the four diagonals of v: 1. a ← a b;  G4(v0, v5, v10, v15), G5(v1, v6, v11, v12), G6(v2, v7, v8, v13), 2. d ← (d ⊕ a) 16; ≪ and G7(v3, v4, v9, v14). Here again, each call to Gi modifies 3. c ← c  d; a distinct diagonal of the matrix, allowing us to interleave the 4. b ← (b ⊕ c) 12; ≪ computation of G4, G5, G6, and G7. 5. a ← a  b; At the end of the last round, a new chain value h0 = 6. d ← (d ⊕ a) 8; 0 0 ≪ h0, . . . , h7 is computed from the internal state v and the 7. c ← c  d; previous chain value h (finalization step): 8. b ← (b ⊕ c) 7; ≪ h0 ← h ⊕ s ⊕ v ⊕ v , h0 ← h ⊕ s ⊕ v ⊕ v , 9. Return a, b, c, and d; 0 0 0 0 8 4 4 0 4 12 0 0 h1 ← h1 ⊕ s1 ⊕ v1 ⊕ v9, h5 ← h5 ⊕ s1 ⊕ v5 ⊕ v13, 0 0 h2 ← h2 ⊕ s2 ⊕ v2 ⊕ v10, h6 ← h6 ⊕ s2 ⊕ v6 ⊕ v14, 0 0 following, we denote by BLAKE-n the algorithm with an n-bit h3 ← h3 ⊕ s3 ⊕ v3 ⊕ v11, h7 ← h7 ⊕ s3 ⊕ v7 ⊕ v15. digest. In order to guarantee that the length ` of a message is BLAKE-n involves only two arithmetic operations: the a multiple of the block size b, Aumasson et al. define the addition modulo 2w of two w-bit unsigned integers and following padding scheme [1]: Table IV PERMUTATIONS OF {0,..., 15} USEDBYTHE BLAKE FUNCTIONS (REPRINTEDFROM [1]).

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

σ0(i) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 σ1(i) 14 10 4 8 9 15 13 6 1 12 0 2 11 7 5 3 σ2(i) 11 8 12 0 5 2 15 13 10 14 3 6 7 1 9 4 σ3(i) 7 9 3 1 13 12 11 14 2 6 5 10 4 0 15 8 σ4(i) 9 0 5 7 2 4 10 15 14 1 11 12 6 8 3 13 σ5(i) 2 12 6 10 0 11 8 3 4 13 7 5 15 14 1 9 σ6(i) 12 5 1 15 14 13 4 10 0 7 6 3 9 2 8 11 σ7(i) 13 11 7 14 12 1 3 9 5 0 15 4 8 6 2 10 σ8(i) 6 15 14 9 11 3 0 8 12 2 13 7 1 4 10 5 σ9(i) 10 2 8 4 7 6 1 5 15 11 9 14 3 12 13 0

Algorithm 5 The Gi function. VI.HARDWARE IMPLEMENTATION Input: A function index i and four w-bit integers a, b, c, and All of our architectures consist of a register file organized d. into w-bit words and implemented by means of dual-ported Output: Gi(a, b, c, d). memory, an ALU, and a control unit (Figure 4). The user 1. a ← a  b; loads messages, plaintext blocks or ciphertext blocks into 2. a ← a (m ⊕ c );  σr (2i) σr (2i+1) port A. A few control bits allows her to select the algorithm 3. d ← (d ⊕ a) ≫ δ0; and the desired level of security. When the coprocessors are 4. c ← c  d; hashing or encrypting a message, the intermediates results are 5. b ← (b ⊕ c) ≫ δ1; always written to port B. In the following, we assume that our 6. a ← a  b; coprocessors are provided with padded messages. A hardware 7. a ← a (m ⊕ c );  σr (2i+1) σr (2i) wrapper interface for BLAKE, Skein, and several other hash 8. d ← (d ⊕ a) ≫ δ2; functions comprising communication and padding is described 9. c ← c  d; in [12]. 10. b ← (b ⊕ c) ≫ δ3; User Interface

• append a bit 1 followed by a sufficient number of 0 bits Control Unit such that the length is congruent to b − 2w − 1 modulo Register file (dual-ported memory) b; ctrlk−1:0 WeA • a padding bit followed by the 2w-bit unsigned big-endian representation of ` is then added; in the case of BLAKE- AddrA A

256 and BLAKE-512, the padding bit is equal to 1; Port otherwise, it is set to 0. Data Pipelined The hash can now be computed iteratively (Algorithm 6): Arithmetic and the padded message is divided into N 16-word blocks Logic Unit We m(0), . . . , m(N−1) and the chain value h(0) is set to the same B n t(i) AddrB initial value as SHA- . The counter denotes the number of B message bits in m(0), . . . , m(i) (i.e. excluding padding bits). Port Note that, if the last block contains only padding bits, then t(N−1) is set to zero. The message digest consists of the n least significant bits of the output h(N). Figure 4. General architecture of our coprocessors. Algorithm 6 Iterated hash. Input: A padded message split into N 16-word blocks and a We follow here the design strategy outlined in [6], [7], [13]– salt s. [15]. The first step consists in defining the minimal instruction Output: A n-bit digest. set to implement a block cipher and a hash function. Then, an  (0) (0) in-depth study of the scheduling allows us to build the ALU 1. h , . . . , h ← (IV ,..., IV ); 0 7 0 7 and organize the data in the register file. During this step, we 2. for i ← 0 to N − 1 do (i+1) (i) (i) (i) • try to minimize the number of control bits to keep the 3. h ← compress h , m , s, t ; instruction memory as compact as possible; 4. end for • take advantage of FPGA specifics to optimize the slice 5. return h(N); count; • identify the available parallelism and pipeline the ALU It is well-known that a  b = (a ∨ b)  (a ∧ b) and a ⊕ b = accordingly. (a ∨ b) (a ∧ b) [16]. Thus, Equation (1) can be rewritten as Eventually, we design the control unit. The instruction memory follows: is automatically generated by a C program. In order to keep the R3 ← (a ∨ b)  ((a ∧ b) ⊕ ctrl10)  ctrl10. (2) instruction ROM as compact as possible, our C program is able to compress the code, and to generate the VHDL description Figure 7 describes the implementation of Equation (2) on a of the decompression unit. Virtex-6 device. Since there is a single control signal to choose the arithmetic operation and to select a and b, Equation (2) in- A. Arithmetic and Logic Units for Threefish and Skein volves only five variables, and is advantageously implemented by 64 LUT6 2 primitives and dedicated carry logic. Our first ALU implements Threefish encryption and Skein.

In the following, Ri denotes a 64-bit register. Figure 5 illus- ctrl1:0 ctrl3:2 ctrl5:4 48 3 trates our scheduling of the two mixing functions Mix4,0 and 12 or or or 2

256 32 Mix4,1 of the fifth round of Threefish- : 8 , , , 1

R4 4 R5 R6 , 16 ,

• The operand e is loaded in register R1; at the same 0

4,1 , 0 time, we start the computation of e R ; this 0 4,1 ≪ 4,0 ≪ ≪ operation requires three clock cycles and intermediate ctrl7 ≪

results are stored in R4,R5, and R6. B

• Then, e is loaded in register R2; the content of R1 is port 4,0 1 1 not modified (i.e. R1 must be controlled by an enable 0 0 0

From R2 signal). 1 B • We execute the instruction R3 ← R1 R2 and obtain A  or R3 port o f4,0. port 0 T R1 0 • R3 and R6 contain f4,0 and e4,1 ≪ R4,0, respectively. 1 The instruction R3 ← R3⊕R6 allows us to compute f4,1. From 1

Time [clock cycles] ctrl6 ctrl8 ctrl9 ctrl10

Port A e e e e 4,1 4,0 4,3 4,2 Figure 6. Arithmetic and logic unit for Threefish encryption. R1 e4,1 e4,1 e4,3 e4,3 R2 e e 4,0 4,2 LUT6 2 ≪1 ≪1 R4 e4,1 e4,3 1 ≪9 ≪1 R5 e4,1 e4,3 5-input LUT

6 carry 1 bj a ∨ b j+1 R6 e≪25 e≪33 R j j 4,1 4,3 0 R3 f f f f 4,0 4,1 4,2 4,3 From 10

2 6 2 6 2 0

R R R R 1 3 R

⊕ ⊕ a R 1  3 1  3 1 j R R R R 0 O6 o

(aj ∧ bj) ⊕ ctrl10 T ← ← ← ← 3 3 3 3 From

R R R R 1 R 5-input LUT 256 Figure 5. Computation of Mix4,0 and Mix4,1 (Threefish- ). 1 bj aj ∨ bj From 0 3

We schedule Mix4,1 as soon as e4,0 has been read, and R O5 carryj manage to keep the pipeline continuously busy. In summary, 0 From (carry0 = ctrl10) our ALU must be able to carry out any rotation of a 64-bit 1 aj word and to perform the following operation (Figure 6): 10 ctrl ( R1  R2 when ctrl10 = 0, R3 ← (1) Figure 7. Computation of R3 ← R1  R2 or R3 ← R3 ⊕ R6 on a Virtex-6 R3 ⊕ R6 otherwise, device. where ctrl10 denotes a control bit. Let us define two 64-bit In order to reduce the number of operands stored in the operands a and b such that: register file, we interleave the key schedule (Algorithm 1) and ( the encryption process (Algorithm 2). This approach allows us (R1, R2) when ctrl = 0, (a, b) = 10 to generate the subkeys on-the-fly. It is however necessary to

(R3, R6) otherwise. compute t2 and kNw before the first key injection. The easiest way to compute t2 would be to load t0 and t1 in registers Thus, the slice count and the critical path are expected R1 and R2, respectively, and to execute the instruction R3 ← to increase. R1⊕R2. Unfortunately, this solution requires one more control • The output of the inverse Mix function is provided either bit to select the inputs of the arithmetic operator, and it is not by the arithmetic operator (e.g. e4,0 on Figure 2) or the possible to implement the multiplexers and the adder on the rotation unit (e.g. e4,1 on Figure 2). The multiplexer same LUT6 2 primitive anymore. Since the critical path of controlled by ctrl17 allows us to select the word we store our coprocessor is located in the 64-bit adder, an extra level in the register file. of LUTs would decrease the clock frequency. However, we • Since the inverse of the Mix function is sequential, we are able to compute t2 using only the functionalities defined have to perform the rotation in a single clock cycle. by Equation (1). Since t2 = (t0  0) ⊕ (t1 ≪ 0), it suffices We suggest to take advantage of the SRL16E primitive to execute the following instructions: available on Xilinx devices to implement a FIFO whose depth is dynamically adjusted according to the algorithm R4 ← t1 ≪ 0, selected by the user: one and three stages for decryption R1 ← t0,R2 ← 0,R5 ← R4 ≪ 0, and encryption, respectively. R3 ← R1  R2,R6 ← R5 ≪ 0, 3- or 1-stage FIFO (encryption or decryption) R3 ← R3 ⊕ R6. i R4 R5 R6

This approach assumes that we can read simultaneously two ≪ values from the register file. Thanks to the multiplexer con- trolled by ctrl7, we can load data from port A or port B into ctrl8 ctrl5:0 ctrl9 ctrl11 ctrl6 B register R2 (Figure 6). A similar strategy allows us to compute port kNw . 1 1 0 The implementation of the key injection is more straight- 0 0 B

From R2 1 1 or forward. Note that the multiplexers controlled by ctrl6 and port 0 A ctrl 14 R3 o ctrl8 allow us to bypass the register file and to use the content T 1 or of R3 as an input to the ALU. Let us consider for instance port 0 R1 0 0 the first key injection of Threefish-256: e0,2 is defined as 1 1 From p2  k0,2 = p2  k2  t1 and is computed as follows: ctrl ctrl ctrl ctrl ctrl ctrl R1 ← k2,R2 ← t1, 7 10 12 13 16:15 17 R3 ← R1 R2  Figure 10. Arithmetic and logic unit for Threefish encryption and decryption. R1 ← R3,R2 ← p2, R3 ← R1 R2.  B. Arithmetic and Logic Units for BLAKE and ChaCha Figures 8 and 9 describe how we schedule the instructions of Let us consider the Gi function of BLAKE-n to define the Threefish-256. instruction set of our coprocessors. Since we focus on compact The UBI chaining mode can be combined with the final key coprocessors for the BLAKE family in this article, we perform injection of Threefish encryption. It suffices to modify line 21 a single step of Algorithm 5 at each clock cycle. We will of Algorithm 2 as follows: show later that the input operand b is already stored in an internal register of our ALU when we start the computation eNr ,i ← vNr ,i  kNr /4,i; of Gi(a, b, c, d). Therefore, each operation involves the result c ← e ⊕ p . i Nr ,i i of the previous one, and our ALU will include a feedback The only difference between this operation and the mixing mechanism to bypass the register file of the coprocessor. function MIXd,j is that no permutation is applied to the second Assume that the w-bit word computed by the ALU is stored operand of the bitwise exclusive OR. in register R5, and denote by RFA and RFB the operands The inverse of the MIX function being purely sequential, provided by the register file. From the data flow diagram of Threefish decryption has less parallelism than encryption. We Algorithm 5, we easily identify three operations (Figure 11): suggest to modify our ALU as follows to fully support both 1) Save the content of R5 in the register file and compute encryption and decryption (Figure 10): R5 ← R5  RFA. • The inverse of the Mix function and the inverse of the key 2) Compute R5 ← R5  (RFA ⊕ RFB). injection require a subtraction modulo 264. Our modified 3) Save the content of R5 in the register file and compute ALU is therefore able to perform a new operation: R3 ← R5 ← (R5 ⊕ RFA) ≫ δj. R1 R2. Because of the additional control bit required to Recall now that the four calls to Gi in a column step or a select the operation, it is not possible to implement our diagonal step can be computed in parallel. In order to keep the arithmetic operator by means of 64 LUT6 2 anymore. critical path as short as possible, we suggest to design an ALU Computation of t2 and k4, and first key injection

Address A @t1 @t0 @k0 @0 @k1 @k2 @k3 @k0 @k3 @k2 @p2 @k1 @p1 @e0,0 @e0,3 @e0,2

Address B @0 @C240 @t2 @p0 @p3 @t1 @k4 @e0,0 @e0,3 @t0 @e0,2 @e1,0 File

Input B t2 k4 e0,0 e0,3 e0,2 e1,0 gister Output A t1 t0 k0 0 k1 k2 k3 k0 k3 k2 p2 k1 p1 e0,0 e0,3 e0,2 Re Output B 0 C240 p0 p3 t1 t0

Port A t1 t0 k0 0 k1 k2 k3 k0 k3 k2 p2 k1 p1 e0,0 e0,3 e0,2

Port B 0 C240 p0 p3 t1 t0

R1 t0 k0 0 k1 k2 k3 k0 k3 k2 k0,2 k1 k0,1 e0,1 e0,1 e0,3 e0,3

R2 0 C240 p0 p3 t1 p2 t0 p1 e0,0 e0,2 ≪14 R6 t1 k0 0 k1 k2 k3 e0,1 ALU

R3 t0 t2 C240 k4 e0,0 e0,3 k0,2 e0,2 k0,1 e0,1 f0,0 f0,1

Permute v1,0 v1,3

Rename e0,1 e1,0 e1,3

R3 ← R1  R2 R3 ← R3 ⊕ R6 First round Second key injection

Address A @e1,0 @e2,2 @e3,0 @k1 @k4 @k3 @v4,2 @k2 @v4,1 @e4,0 @e4,3 @e4,2

Address B @e2,2 @e3,0 @v4,2 @v4,1 @t2 @1 @e4,0 @t1 @e4,3 @e4,2 @e5,0 File

Input B e2,2 e3,0 v4,2 v4,1 e4,0 e4,3 e4,2 e5,0 gister Output A e1,0 e2,2 e3,0 k1 k4 k3 v4,2 k2 v4,1 e4,0 e4,3 e4,2 Re Output B t2 1 t1

Port A e1,0 e2,2 e3,0 k1 k4 k3 v4,2 k2 v4,1 e4,0 e4,3 e4,2

Port B t2 1 t1

R1 e1,3 e1,3 e1,1 e1,1 e2,1 e2,1 e2,3 e2,3 e3,3 e3,3 e3,1 e3,1 k1 k4 k3 e4,3 − 1 v4,2 k2 v4,1 e4,1 e4,1 e4,3 e4,3

R2 e1,2 e1,0 e2,0 e2,2 e3,2 e3,0 v4,0 v4,3 t2 1 k1,2 t1 k1,1 e4,0 e4,2 ≪16 ≪57 ≪52 ≪23 ≪40 ≪37 ≪5 ≪25 R6 e0,3 e1,3 e1,1 e2,1 e2,3 e3,3 e3,1 e4,1 ALU

R3 f0,2 f0,3 f1,2 f1,3 f1,0 f1,1 f2,0 f2,1 f2,2 f2,3 f3,2 f3,3 f3,0 f3,1 e4,0 e4,3 − 1 k1,2 e4,3 e4,2 k1,1 e4,1 f4,0 f4,1

Permute v1,2 v1,1 v2,2 v2,1 v2,0 v2,3 v3,0 v3,3 v3,2 v3,1 v4,2 v4,1 v4,0 v4,3 v5,0 v5,3

Rename e1,2 e1,1 e2,2 e2,1 e2,0 e2,3 e3,0 e3,3 e3,2 e3,1 e5,0 e5,3

Second round Third round Fourth round Fifth round

Figure 8. Scheduling of Threefish-256 encryption. @d denotes the address of the 64-bit word d in the register file. “Rename” and “Permute” refer to lines 9 and 17 of Algorithm 2, respectively.

with four pipeline stages and to interleave the computation s0 ⊕ c0 involves for instance only two words stored in of four Gi functions (Figure 12). The heart of the ALU is the register file. An array of AND gates controlled by the arithmetic operator performing the addition or the bitwise ctrl1 allows us to force the second operand to zero in XOR of two w-bit words described in Section VI-A. Our such cases. operator computes: If needed, the content of register R3 is then rotated to the left ( in two steps. Our implementation is based on the following R1  R2 when ctrl2 = 0, R3 ← observation: R1 ⊕ R2 otherwise R3 δi = (R3 (δi − δ3)) δ3, = (R1 ∨ R2)  ((R1 ∧ R2) ⊕ ctrl2)  ctrl2, ≫ ≫ ≫ where where 0 ≤ i ≤ 3. At first glance, this design choice may look • R1 stores the data provided by the register file. Since awkward. However, it will allow us to easily build a unified a flip-flop is always associated with a LUT, we can processor for the BLAKE family. The key point is that the perform some simple pre-processing without increasing content of R3 is copied into R5 when the three control bits the number of slices of the ALU: a control bit ctrl0 ctrl5:3 are equal to 0 (Figure 12). selects either RFA or RFA ⊕ RFB. This allows us to Note that the pipeline has three possible configurations, denoted by Œ, , and Ž in Figure 12: compute mσr (2i) ⊕ cσr (2i+1) and mσr (2i+1) ⊕ cσr (2i) for free (Algorithm 5, lines 2 and 7). Œ In order to minimize the area of our ALU, we can insert • R2 almost always stores the result of a previous operation. a w-bit register after the first stage of the rotation. Since However, we have to disable the feedback mechanism the latter involves w LUTs, there is no hardware overhead during the initialization step: the computation of v8 ← on a Virtex-6 device. Computation of t2 and k4, and first key injection

Address A @t0 @k0 @k1 @k2 @k3 @k0 @k1 @c2 @c3 @k4 @c0 @c1 @e71,2

Address B @t1 @C240 @t2 @t1 @18 @k4 @t0 @k3 @e71,2 @e71,1 @e71,0 @e71,2 @e70,3 @e70,0 @e70,2 @e69,3 @e69,0 @e69,2 @e68,3 File

Input B t2 e71,2 e71,1 e71,0 e70,3 e70,0 e70,2 e69,3 e69,0 e69,2 e68,3 gister Output A t0 k0 k1 k2 k3 k0 k1 c2 c3 k4 c0 c1 e71,2 Re Output B t1 C240 t1 18 t0 k3 e71,1 e71,0 e71,2 e70,3 e70,0 e70,2 e69,3 e69,0 e69,2

Port A t0 k0 k1 k2 k3 k0 k1 c2 c3 k4 c0 c1 e71,2

Port B t1 C240 t1 18 t0 k3 e71,1 e71,0 e71,2 e70,3 e70,0 e70,2 e69,3 e69,0 e69,2

R1 t0 k0 k1 k2 k3 k0 k1 c2 c3 k4 c0 c1 e71,0 e71,2 e71,0 e71,2 e70,0 e70,2 e70,0 e70,2 e69,0 e69,2 e69,0

R2 t1 C240 t1 18 k0,2 k0,3 t0 k3 k0,1 e71,1 e71,3 e70,3 e69,3

R6 f71,1 f71,3 e70,1 f70,1 f70,3 e69,1 f69,1 ALU

R3 t2 k4 k0,2 k0,3 f72,2 f72,3 k0,1 f72,0 f72,1 f∗71,1 f∗71,3 f71,0 f71,2 f∗70,1 f∗70,3 f70,0 f70,2 f∗69,1 f∗69,3 Inverse v71,2 v71,1 v71,0 v71,3 v70,3 v70,0 v70,2 v69,3 v69,0 v69,2 v68,3 Permute

Rename e71,2 e71,1 e71,0 e71,3 e70,3 e70,0 e70,2 e69,3 e69,0 e69,2 e68,3

R3 ← R1 ⊕ R2 R3 ← R1 ⊕ R3 First round Second round Third round Second key injection

Address A @k2 @k4 @k0 @k3 @v68,3 @v68,1 @e67,0 @e67,2

Address B @e68,0 @e68,2 @t0 @v68,1 @v68,3 @17 @t2 @e67,0 @e67,2 @e67,1 @e67,3 @e67,0 @e67,2 @e66,3 @e66,0 @e66,2 @e65,3 @e65,0 @e65,2 @e64,3 @e64,0 File

Input B e68,0 e68,2 v68,1 v68,3 e67,0 e67,2 e67,1 e67,3 e66,3 e66,0 e66,2 e65,3 e65,0 e65,2 e64,3 e64,0 gister Output A k2 k4 k0 k3 v68,3 v68,1 e67,0 e67,2 Re Output B e68,3 e68,0 e68,2 t0 17 t2 e67,1 e67,3 e67,0 e67,2 e66,3 e66,0 e66,2 e65,3 e65,0 e65,2

Port A k2 k4 k0 k3 v68,3 v68,1 e67,0 e67,2

Port B e68,3 e68,0 e68,2 17 t2 e67,1 e67,3 e67,0 e67,2 e66,3 e66,0 e66,2 e65,3 e65,0 e65,2

R1 e69,2 e68,0 e68,2 e68,0 v68,2 v68,0 v68,0 temp2 k0 k3 v68,3 v68,1 e67,0 e67,2 e67,0 e67,2 e66,0 e66,2 e66,0 e66,2 e65,0 e65,2 e65,0 e65,2

R2 e68,3 t0 k2 k4 17 t2 k17,3 k17,1 e67,1 e67,3 e66,3 e65,3

R6 f69,3 e68,1 f68,1 f68,3 f67,1 f67,3 e66,1 f66,1 f66,3 e65,1 f65,1 f65,3 ALU R3 f69,0 f69,2 f∗68,1 f∗68,3 f68,0 f68,2 temp2 v67,0 v67,2 k17,3 k17,1 v67,3 v67,1 f∗67,1 f∗67,3 f67,0 f67,2 f∗66,1 f∗66,3 f66,0 f66,2 f∗65,1 f∗65,3 f65,0 Inverse v68,0 v68,2 e67,0 e67,2 e67,1 e67,3 v66,0 v66,0 v66,2 v65,3 v65,0 v65,2 v64,3 v64,0 Permute

Rename e68,0 e68,2 v68,1 v68,3 v68,2 e66,0 e66,0 e66,2 e65,3 e65,0 e65,2 e64,3 e64,0

Fourth round Seventh roundSixthroundFifthround

Figure 9. Scheduling of Threefish-256 decryption.

w  The addition modulo 2 can be computed in two clock v4 (last instruction of G0) at time τ +3. If we load v3 from the w w cycles. Let a = alow +2 2 ahigh and b = blow +2 2 bhigh. We register file, we can start the computation of G7 at time τ + 4 w store ahigh and bhigh in two -bit registers, and compute (Figure 13). We easily check that this scheduling also avoids 2 w a sum word slow and a carry bit such that 2 2 c + slow = pipeline bubbles between a diagonal step and a column step w alow + blow. A flip-flop and a 2 -bit register store c and (Figure 14). Since each call to Gi involves ten instructions, slow, respectively. The most significant bits of the sum we need 80 clock cycles to perform a round of BLAKE-n. are then given by shigh = ahigh + bhigh + c. This approach Our first architecture can be modified to support the four allows us to reduce the worst-case carry path at the price algorithms of the BLAKE family (Figure 15). The 64-bit w of three 2 -bit registers and a flip-flop. datapath is built out of two 32-bit datapaths, thus allowing Ž Routing a signal from a memory block to a slice is us to perform a single 64-bit operation or two 32-bit opera- sometimes expensive in terms of wire delay. If the critical tions at each clock cycle. The mode of operation is selected path is located between the register file and register R1, according to an additional control bit ctrl6, the latter being this pipeline configuration will help boosting the clock provided by the user. The ALU includes two 32-bit adders. frequency. The output data path of a Virtex-6 memory Let alow, ahigh, blow, bhigh, slow, and shigh denote unsigned 32-bit block has an optional internal pipeline register. Therefore, integers. When the user chooses BLAKE-224 or BLAKE-256 the only hardware overhead is the w-bit register between (ctrl6 = 0), two messages are processed in parallel and the R5 and the array of AND gates controlled by ctrl1. ALU performs two 32-bit additions:

In order to avoid pipeline bubbles between column and slow ← alow + blow + ctrl2, diagonal steps, it suffices to process the four calls to Gi of shigh ← ahigh + bhigh + ctrl2. the diagonal step in the following order: G7, G4, G5, and G6. We check for instance that the ALU outputs the new value of When the coprocessor executes BLAKE-384 or BLAKE-512 From port A To R5

From port B

From R5 2

mσr(2i) cσr(2i+1) mσr(2i+1) cσr(2i) 2 2 a aa a 1 3 1 3 b bb≫ δ1 b b ≫ δ3 b 1 c c c c 3 1 d ≫ δ0 d d ≫ δ2 d 3

From port A To R5 1 5 To port BFromR

From R5 To port B From port A ≫ δi To R5 3

Figure 11. Implementation of the Gi function of BLAKE-n by means of three instructions. R5 denotes an internal register of the ALU. )

ctrl1 3 ctrl3 δ − 0 δ ( ≫ Ž

LUT6 2 ) LUT6 3 R2 δ 1 −

1 0 3 δ (  Œ δ ≫ Ž 1 ≫ 1 A R3 R4 R5 0 0 B

0 ) 3 port

R1 δ 1 − port 2 o

δ 1 From ( T 0 ≫

B Ž port

From ctrl0 ctrl2 ctrl4 ctrl5

Figure 12. Arithmetic and Logic Unit for BLAKE-n.

(ctrl6 = 1), the ALU carries out a 64-bit addition. The first BLAKE-384 and BLAKE-512. When ctrl5 is equal to one, adder generates the least significant bits of the sum and a carry ctrl4:3 encodes the index i of the rotation distance δi (Table V). bit c such that: Consequently, we can use the same instruction flow for all

32 algorithms and select the width of the datapath according to 2 · c + slow = alow + blow + ctrl2. ctrl6. Note that the three pipeline configurations defined for The second adder computes the most significant bits of the our first coprocessor are also available here. sum: The QUARTERROUND function of ChaCha requires only shigh ← ahigh + bhigh + c. two of the instructions we defined for the Gi function. Thus, We use the rotation unit of our first processor to deal with the design of a ChaCha coprocessor is rather straightforward BLAKE-224 and BLAKE-256. Note that the content of R3 is (Figure 16). Since it is not necessary to compute RFA ⊕ RFB always copied into R6 when ctrl5:3 = (000)2. Thus, we share anymore, the ALU has a single 32-bit input. The only difficulty 32 this datapath between all algorithms of the BLAKE family, is to increment the 64-bit counter 2 m13 +m12 (Algorithm 3, and need only 64 LUTs to implement the rotation unit of lines 17 to 20). Assume that the constant 1 is stored in the Time [clock cycles] τ τ + 1 τ + 2 τ + 3 τ + 4 τ + 5 τ + 6

10th step of G0: 10th step of G2: 1st step of G7: 1st step of G4: 1st step of G5:

v4 ← (v4 ⊕ v8) ≫ 7 v6 ← (v6 ⊕ v10) ≫ 7 v3 ← v3  v4 v0 ← v0  v5 v1 ← v1  v6

R1 ← v4 R1 ← v5 R1 ← v6 R1 ← v7 R1 ← v3 R1 ← v0 R1 ← v1

Œ R2 ← R5 = v8 R2 ← R5 = v9 R2 ← R5 = v10 R2 ← R5 = v11 R2 ← R5 = v4 R2 ← R5 = v5 R2 ← R5 = v6

stages  R3 ← R1  R2 R3 ← R1 ⊕ R2 R3 ← R1 ⊕ R2 R3 ← R1 ⊕ R2 R3 ← R1 ⊕ R2 R3 ← R1  R2 R3 ← R1  R2 Ž R4 ← R3 ≫ 0 R4 ← R3 ≫ 0 R4 ← R3 ≫ 0 R4 ← R3 ≫ 0 R4 ← R3 ≫ 0 R4 ← R3 ≫ 0 R4 ← R3 ≫ 0

Pipeline  R5 ← v9 = R4 R5 ← v10 = R4 R5 ← v11 = R4 R5 ← v4 = R4 ≫ 7 R5 ← v5 = R4 ≫ 7 R5 ← v6 = R4 ≫ 7 R5 ← v7 = R4 ≫ 7

9th step of G1: 9th step of G2: 9th step of G3: 10th step of G1: 10th step of G3:

v9 ← v9  v13 v10 ← v10  v14 v11 ← v11  v15 v5 ← (v5 ⊕ v9) ≫ 7 v7 ← (v7 ⊕ v11) ≫ 7

Figure 13. Avoiding pipeline bubbles between a column step and a diagonal step.

Time [clock cycles] τ τ + 1 τ + 2 τ + 3 τ + 4 τ + 5 τ + 6

10th step of G7: 10th step of G5: 1st step of G0: 1st step of G1: 1st step of G2:

v4 ← (v4 ⊕ v9) ≫ 7 v6 ← (v6 ⊕ v11) ≫ 7 v0 ← v0  v4 v1 ← v1  v5 v2 ← v2  v6

R1 ← v4 R1 ← v5 R1 ← v6 R1 ← v7 R1 ← v0 R1 ← v1 R1 ← v2

Œ R2 ← R5 = v9 R2 ← R5 = v10 R2 ← R5 = v11 R2 ← R5 = v8 R2 ← R5 = v4 R2 ← R5 = v5 R2 ← R5 = v6

stages  R3 ← R1  R2 R3 ← R1 ⊕ R2 R3 ← R1 ⊕ R2 R3 ← R1 ⊕ R2 R3 ← R1 ⊕ R2 R3 ← R1  R2 R3 ← R1  R2 Ž R4 ← R3 ≫ 0 R4 ← R3 ≫ 0 R4 ← R3 ≫ 0 R4 ← R3 ≫ 0 R4 ← R3 ≫ 0 R4 ← R3 ≫ 0 R4 ← R3 ≫ 0

Pipeline  R5 ← v10 = R4 R5 ← v11 = R4 R5 ← v8 = R4 R5 ← v4 = R4 ≫ 7 R5 ← v5 = R4 ≫ 7 R5 ← v6 = R4 ≫ 7 R5 ← v7 = R4 ≫ 7

9th step of G4: 9th step of G5: 9th step of G6: 10th step of G4: 10th step of G6:

v10 ← v10  v15 v11 ← v11  v12 v8 ← v8  v13 v5 ← (v5 ⊕ v10) ≫ 7 v7 ← (v7 ⊕ v8) ≫ 7

Figure 14. Avoiding pipeline bubbles between a diagonal step and a column step.

Table V ROTATION DISTANCES OF THE UNIFIED BLAKE COPROCESSOR. necessary:

ctrl5:3 Rot. dist. BLAKE-224/256 BLAKE-384/512 (F, R3) ← R1  R2 R1 ← m13 R2 ← 0, (000)2 0 R6 ← R3 (common datapath) (100)2 δ0 R6 ← R3 ≫ 7 R6 ← R3 ≫ 11 R3 ← R1  R2  FR4 ← R3, (101) δ R ← R 8 R ← R 16 2 1 6 3 ≫ 6 3 ≫ Register file ← R4 (m ) R4 ← R3, (110)2 δ2 R6 ← R3 ≫ 12 R6 ← R3 ≫ 25 12 (111)2 δ3 R6 ← R3 16 R6 ← R3 32 ≫ ≫ Register file ← R4 (m13). Three pipeline configurations are again available. The second one needs specific attention: since the adder is pipelined, register file. The control bit ctrl0 allows us to disable the the computation of m12  1 requires two clock cycles. It is feedback mechanism and to load the constant 0 in register therefore mandatory to introduce a NOP before loading m13 R2. Execute the following instructions: into register R1. It is of course possible to build a unified coprocessor for R1 ← 1 R2 ← 0, ChaCha and the BLAKE family (Figure 17). A new control R3 ← R1 ⊕ R2, bit ctrl7 allows the user to select the mode of operation of R4 ← R3, the ALU: encryption or hashing. Since the coprocessor has a 64-bit datapath to support BLAKE-384 and BLAKE-512, it is R5 ← R4, possible to encrypt two messages in parallel with ChaCha. R1 ← m12 R2 ← R5. C. Register Files and Control Units

Registers R1 and R2 store m12 and the constant 1, respectively. We will consider our unified coprocessor for the BLAKE Note that the output carry of the 32-bit adder can now be stored and ChaCha algorithms to describe how we design our control in a flip-flop F. Furthermore, when ctrl2 is set to one, our ALU units. The same approach can easily be applied to the other performs an “add with carry” instruction. We can now compute coprocessors considered in this work. Virtex-6 FPGAs embed m12 + 1, save the output carry in F , and increment m13 if several configurable memory blocks that can for instance nt-tt ahn losu ojm otefiaiainstep finalization the to jump BLAKE- to of us code allows machine the finite-state store to this suffices loop. follow 384 it we round that if by the Note coprocessor required our approach. unroll instructions by to of supported algorithms number is the the family summarizes BLAKE VI Table the in volved addresses memory that a of counter means program by implemented block. a memory of instruction an consists mainly unit store tagtowr a oda ihtepruain in- permutations the with deal to way straightforward A / 512 0436 1024 From port B From port A and Ž Ž Ž 20 btwrsor words -bit LUT rudCah ( ChaCha -round From port A 6 2 Ž Ž ctrl ctrl 1 ctrl 1 0 1 bit 0 0 R R 0818 2048 2 1 R R 2039 1 2 iue1.Uie rtmtcadLgcUi o h LK family. BLAKE the for Unit Logic and Arithmetic Unified 15. Figure ctrl 32 btwrs u control Our words. -bit LUT bits ntutos:asimple a instructions): 2 ctrl iue1.Aihei n oi ntfrChaCha. for Unit Logic and Arithmetic 16. Figure 6 2 1 ahigh bhigh  

64 alow blow Output bits 0 1 carry F 0 1 c 

slow shigh ctrl   6 ctrl R 2 R 3 3 ir-oefisit igeboko eoy hr sn edt r to try to need no the is as there long memory, As of unit. instructions. block of control single number complex the a more reduce into a fits require micro-code would approach of this addresses the generating by of permutation performed been has rounds of number desired the when N 1 ME FISRCIN FTEAGRTM FTHE OF ALGORITHMS THE OF INSTRUCTIONS OF UMBER ti osbet euetesz ftecd ysoigtetbedfiigthe defining table the storing by code the of size the reduce to possible is It ≪ 1 ≪ 5 ≪ 9 ≫ 11 ≫ 16 ≫ 25 ≫ 32 ≫ 1 ≫ 5 ≫ 9 ctrl 0 1 0 1 3 ctrl { 0

ctrl 0 1 0 1 0 1 0 1 , . . . 3 0 1 20 12 BLAKE- BLAKE- 8 ctrl 4 rudChaCha -round 15 rudChaCha -round ChaCha -round Œ loih instructions # Algorithm R 0 1 0 1 4 } 4 aaerzdb h on index round the by parametrized Œ Œ R R C 5 4 384 224 HA m C LUT / / σ

al VI Table 7 1 1344 1184 512 256

AFAMILIES HA ≪ r LUT (2 6 ctrl i 7

) ≫ 0 1 6 and 5 ctrl c R σ r 695 439 311 . 5 1 0 0 1 5 (2 i +1) ctrl 1 0 6

ntefl.However, on-the-fly. To port B R r 6 BLAKE TbeI)and IV) (Table

To port B 1 The . AND ctrl1 ctrl6 ctrl4 ctrl5 ctrl7 9 ≫ 1 7 0 5 Œ ≫ ≫ 1 1

R4 0 0 1 ≫ 1 0

 0

LUT6 2 ctrl3 0 1 R7 32 Ž 1 B ≫ 1 or

R2 port o 0 T 25 1 0 Œ 0 ≫ 1

 R3 R5 1

 0  1 0 16

Ž ≫ A 1 0 port 0 R1 or

1 11 From 1 0  ≫

B Ž 16 port ≪ 1 From 0 12 Œ ≪ 0 1

R6 1 0 8

ctrl0 ctrl2 ctrl8 ≪ ctrl6 1 0 1 bit 32 bits 64 bits 7 LUT6 ≪

Figure 17. Unified Arithmetic and Logic Unit for the BLAKE and ChaCha families.

main challenge is therefore to define control words of at most A. Note that a6 is equal to one only when we read an 18 bits in order to implement our instruction memory by means initial vector and assume that the digest size is selected of a single memory block. A clever organization of the register according to an additional control bit ctrl8. The address file (Figure 18) and a simple compression algorithm allows bit a5 is computed as follows: us to achieve this goal. Two blocks of dual-ported memory ( configured as 256 entries of 32 bits store the message, the a5 when a6 = 0, a5 ← chaining value, the constants, and all the intermediate variables ctrl8 otherwise. of BLAKE and ChaCha. Thus, our coprocessor requires 26 control bits (Figure 19a): Thanks to this simple mechanism, the instruction flow does not depend on the digest size. Initial vectors are • 8 address bits and a write enable signal for port A of the always read from port A. register file; • Since the initial vectors are neither modified nor read • 8 address bits and a write enable signal for port B of the from port B, the second most significant address bit is register file; always equal to zero. • 8 control bits for the ALU. Consequently, we can store 20-bit words in the instruction Two control bits are provided by the user: ctrl7 allows her memory (Figure 19b). We designed a simple compression to select between BLAKE and ChaCha, and ctrl6 specifies algorithm to encode the write enable signal of port B and the the configuration of the datapath (2 × 32 bits or 64 bits). Our six control bits ctrl5:0 by means of five bits. A C program organization of the data in the register file enables us to define generates the content of the instruction memory and the a 20-bit instruction: VHDL description of the decompression circuit. The latter • The most significant address bit depends on the algorithm involves only seven 5-input LUTs, and stores the control being executed, and is therefore provided by the user. bits of the ALU and the write enable signal of port B in • We use ports A and B to load new data (message, salt, and a register. Because of this pipeline stage, it is necessary to counter) and save the intermediate variables computed generate the write enable signal one clock cycle in advance by the ALU, respectively. Consequently, the write enable when we have to store a word in the register file. Our C signal of port A is also given by the user. program takes this parameter into account and organizes the • Let us denote by a7:0 the eight address bits of ports control bits in the instruction memory according to the pipeline 32 bits 32 bits 64 bits

0 128

m0,..., m15 m0,..., m15 m0,..., m15 15 143 16 144 h0,..., h7 h0,..., h7 h0,..., h7 23 151 24 152

v0,..., v15 v0,..., v15 v0,..., v15 39 167 40 168

c0,..., c15 c0,..., c15 c0,..., c15 55 183 56 184 s0,..., s3 s0,..., s3 s0,..., s3 59 187 60 188 t0 and t1 t0 and t1 t0 and t1 61 189 62 190 0 and 1 (32-bit constants) 0 and 1 (32-bit constants) Unused 63 191 64 192 IV0,..., IV7 (BLAKE-224) IV0,..., IV7 (BLAKE-224) IV0,..., IV7 (BLAKE-384) 71 199 72 200 Unused Unused 95 223 96 224 IV0,..., IV7 (BLAKE-256) IV0,..., IV7 (BLAKE-256) IV0,..., IV7 (BLAKE-256) 103 231 104 232 Unused Unused 127 256

Figure 18. Register file of the unified coprocessor for the BLAKE and ChaCha families.

Port A Port B Arithmetic and logic unit

ctrl6 AddrA (8 bits) WeA ctrl6 0 AddrB (8 bits) WeB ctrl7 ctrl6 ctrl5 ctrl4 ctrl3 ctrl2 ctrl1 ctrl0

(a) Address and control bits of our unified coprocessor for BLAKE and ChaCha

Port A Port B Arithmetic and logic unit

AddrA (7 bits) AddrB (6 bits) WeB ctrl5 ctrl4 ctrl3 ctrl2 ctrl1 ctrl0

(b) Address and control bits provided by the control unit

Figure 19. From 26- to 18-bit instructions. Shaded cells denote control bits provided by the user. configuration. Then, it generates the compressed instruction the two other variants of the algorithm on our architecture. memory. Figure 20 describes the instruction flow for the first The number of clock cycles required for Threefish encryption pipeline configuration of our coprocessor: and decryption according to the key size is summarized in Table VII. Because of the output transform, k + 1 invocations • As explained above, the write enable signal is generated of the tweakable encryption function are necessary to hash a one clock cycle in advance to take the internal pipeline k-block message with Skein. There is a latency of 5 clock stage of the decompression unit into account. cycles between two consecutive Threefish encryption. Thus, • All inputs of the register file are registered, and the two the throughput of Skein is given by: control bits ctrl0 and ctrl1 must therefore be generated one clock cycle after the addresses. We take advantage 8 · Nb · k · f throughput = , of the latency of our decompression unit to synchronize (k + 1) · latency of Threefish with UBI + 5 · k the control signals. where f denotes the clock frequency. We followed the same approach to build our control units 64 0 0 0 for Threefish and Skein. The register file is organized into 64- ei,8,..., ei,15 p0,..., p15 71 bit words, and stores a plaintext block, an internal state (ed,i, 72 15 t0, t1, t2, C240 where 0 ≤ i ≤ Nw − 1), an extended block cipher key, an 7516 C 76 extended tweak, the constant 240, and all possible values of ei,0,..., ei,15 Unused s involved in the key schedule (Figure 21). Thanks to this 31 approach, the word permutation π(i) and the word rotation of 32 95 96 the key schedule are conveniently implemented by addressing k0,..., k16 0, 1,..., 19, 20 the register file accordingly. Since the round constants repeat 48 49 116 every eight rounds (Algorithm 2, line 14), we decided to unroll Unused 117 55 eight iterations of the main loop of Threefish (Algorithm 2, 56 0 0 Unused lines 4 to 19). The rotation constants R are included in ei,0,..., ei,7 d,i 63 127 the microcode executed by the control unit. Note that our register file is designed for Threefish-1024 (i.e. Nw = 16 Figure 21. Register file of our Threefish and Skein architectures. and Nr = 80). It is therefore straightforward to implement Port A Port B Arithmetic and logic unit

WeB

AddrA (7 bits) AddrB (6 bits) WeB ctrl1 ctrl0

AddrA (7 bits) AddrB (6 bits) ctrl2 ctrl1 ctrl0

ctrl4 ctrl3 ctrl2

ctrl5 ctrl4 ctrl3

ctrl5

Compression algorithm Compressed opcode

Figure 20. Generation of the compressed instruction memory.

Table VII Throughput [Mbit/s] NUMBEROFINSTRUCTIONSOFTHEALGORITHMSOFTHE THREEFISH FAMILY. 500 Algorithm # instructions SHA-3-512

Threefish-256 encryption 490 400 Threefish-256 encryption with UBI 501 Threefish-256 decryption 469 Threefish-512 encryption 860 300 Threefish-512 encryption with UBI 882 Threefish-512 decryption 1092 BLAKE-512 200 Threefish-1024 encryption 1874 JH-512 Threefish-1024 encryption with UBI 1912 Skein 512-512 Threefish-1024 decryption 2351 100 SHA-2-512 Grøstl-512

0 0 50 100 150 200 250 300 VII.RESULTS AND COMPARISONS Area [Virtex-6 slices]

We captured our architecture in the VHDL language and Figure 22. Compact implementations of several cryptographic hash functions prototyped our coprocessors on a Xilinx Virtex-6 FPGA with on Virtex-6 FPGAs (512-bit digests). average speedgrade. Tables VIII and IX summarize our place- and-route results measured with ISE 14.2. Note that we considered the least favorable case, where the message consists is also an excellent candidate for cryptographic coprocessors of a single block, to compute the throughput of Skein. supporting several levels of security. Most of the architectures described in the open literature We already proposed lightweight implementations of ECHO focus on a single level of security (Table X). We took & AES [15] and Grøstl & AES2 [14] (Table XI). According to advantage of the intrinsic parallelism of BLAKE to interleave our results, the unified coprocessor for BLAKE and ChaCha the computation of four instances of the Gi function. Thanks to offers the best area–time trade-off. However, given that all this approach, we designed an ALU with four pipeline stages symmetric cryptographic functions (including authenticated and achieved higher clock speeds than the coprocessors listed encryption) can be efficiently implemented with Keccak, we in Table X. A careful scheduling allowed us to totally avoid would get the following figures with a unified architecture pipeline bubbles and memory collisions. We also addressed based on [13]: FPGA-specific issues and described how to share slices be- tween addition and bitwise exclusive OR of two operands. • hashing with arbitrary length at a security level of 256 We followed the same strategy to design our coprocessors bits: 501 Mbits/s; for Threefish and Skein. As a consequence, our coprocessors • at a security level of 256 bits: provide the end-user with hashing and encryption at all levels more than 501 Mbits/s (the generic security of keyed of security, while offering a better area–time trade-off. sponges allows one to use less capacity than for hashing, We report in Figure 22 the latest lightweight implementation hence a larger rate and a proportionally larger through- results of several cryptographic hash functions. Besides our put) [24]. coprocessors for BLAKE-512 and Skein-512-512, we selected Grøstl [14], JH [5], SHA-2-512 [21], and SHA-3-512 (Keccak 2Note that Jarvinen¨ [22] proposed the first unified coprocessor for AES- [r = 1024, c = 576]) [13]. In this context, BLAKE is 128 (encryption and key expansion) and Grøstl-256. Recently, Rogawski obviously the best choice for lightweight implementations on & Gaj [23] designed a parallel coprocessor for Grøstl-based HMAC and AES in the counter mode. Both architectures are optimized for high-speed FPGA. Since our unified architecture for the BLAKE family implementations, and it is therefore difficult to make a comparison with our (Figure 15) requires less than 100 Virtex-6 slices, BLAKE lightweight coprocessors. Table VIII PLACE-AND-ROUTE RESULTS FOR OUR THREEFISH AND SKEINCOPROCESSORSONA VIRTEX-6 FPGA (XC6VLX75T-2). ALLDESIGNSREQUIRETHREE MEMORY BLOCKS.

Throughput [Mbits/s] Supported Area Freq. Threefish-256 Threefish-512 Threefish-1024 Algorithms [Slices] [MHz] Skein-256-256 Skein-512-512 Enc. Dec. Enc. Dec. Enc. Dec. Threefish encryption 145 294 – – 153 – 175 – 160 – Threefish 277 267 – – 139 145 158 125 145 116 Skein and Threefish encryption 150 295 75 85 154 – 175 – 161 – Skein and Threefish 292 279 70 80 145 152 166 130 152 121

Table IX PLACE-AND-ROUTE RESULTS FOR OUR CHACHAAND BLAKE COPROCESSORSONA VIRTEX-6 FPGA (XC6VLX75T-2).

Throughput [Mbits/s] Supported Pipeline Area # block Freq. BLAKE-224 BLAKE-384 algorithms config. [slices] RAMs [MHz] and and 8-round 12-round 20-round BLAKE-256 BLAKE-512 ChaCha ChaCha ChaCha x 49 2 362 – – 595 422 266 ChaCha y 77 2 316 – – 520 368 232 z 77 2 345 – – 569 403 254 x 47 2 338 146 – – – – BLAKE-224 and y 49 2 341 147 – – – – BLAKE-256 z 50 2 349 150 – – – – x 79 3 331 – 252 – – – BLAKE-384 and y 91 3 331 – 252 – – – BLAKE-512 z 91 3 329 – 250 – – – x 94 3 312 2 × 134 237 – – – BLAKE y 126 3 332 2 × 143 252 – – – (all levels of security) z 129 3 343 2 × 148 261 – – – x 144 3 335 2 × 144 255 2 × 551 2 × 390 2 × 246 BLAKE and ChaCha y 156 3 289 2 × 124 220 2 × 475 2 × 337 2 × 212 (all levels of security) z 168 3 304 2 × 131 231 2 × 500 2 × 354 2 × 223

VIII.CONCLUSION to the preliminary results reported in [13], SHA-3 could The stream cipher ChaCha, the block cipher Threefish, outperform BLAKE and ChaCha. and the hash functions BLAKE and Skein are based on the same arithmetic operations. In this work, we showed that ACKNOWLEDGEMENTS the same design philosophy allows one to design lightweight coprocessors for hashing and encryption. The key element of The authors would like to thank Daniel J. Bernstein and Ray our approach is to take advantage of the parallelism of the Cheung for their valuable comments. This work was partially algorithms to: supported by the Japanese Society of Promotion of Science • deeply pipeline the ALU to achieve a high clock fre- (JSPS) through the A3 Foresight Program (Research on Next quency; Generation Internet and Network Security). Additionally the • avoid data dependencies by interleaving independent authors would like to acknowledge Xilinx and the Xilinx tasks. University Program for its generous donation of materials in Furthermore, we described how to design compact control terms of design tools. units thanks to a careful organization of the register file, loop unrolling, and a simple compression algorithm. Our archi- tectures are mainly designed for embedded systems. Thus, REFERENCES it would be interesting to conduct side-channel and fault [1] J.-P. Aumasson, L. Henzen, W. Meier, and R. Phan, “SHA-3 proposal injection attacks in future work. BLAKE (version 1.4),” Jan. 2011, available at http://www.131002.net/ Our results show that BLAKE and ChaCha are excellent . candidates for lightweight coprocessors. However, since all [2] N. Ferguson, S. Lucks, B. Schneier, D. Whiting, M. Bellare, T. Kohno, symmetric cryptographic functions can be implemented by J. Callas, and J. Walker, “The skein hash function family (version 1.3),” Oct. 2010, available at http://www.skein-hash.info. means of keyed sponges, we are planning to design hardware [3] D. Bernstein, “ChaCha, a variant of Salsa20,” Jan. 2008, available at architectures based on the new SHA-3 algorithm. According http://cr.yp.to/papers.html#chacha. Table X COMPACT IMPLEMENTATIONS OF BLAKE AND SKEINON VIRTEX-5 AND VIRTEX-6 FPGAS.THETHROUGHPUTISCOMPUTEDFORAONE-BLOCK MESSAGE.

Area 36k memory Frequency Throughput Supported algorithm(s) FPGA [slices] blocks [MHz] [Mbits/s] Latif et al. [17]‡ Skein-256-256 xc5vlx110-3 821 Not specified 119 1610 Jungk [18]‡ Skein-512-256 xc5v 555 – 271 237 Jungk [19]‡ Skein-512-256 xc6v 406 – 318 277 Kaps et al. [20] Skein-512-256 xc6vlx75t-1 207 1 166 17 Kaps et al. [20] Skein-512-256 xc6vlx75t-1 193 – 193 21 Kerckhof et al. [5]‡ Skein-512-512 xc6vlx75t-1 240 – 160 179 Aumasson et al. [1] BLAKE-256 xc5vlx110 390 – 91 412 Jungk [19] BLAKE-256 xc6v 235 – 231 518 Jungk [19] BLAKE-256 xc6v 404 – 185 823 Kaps et al. [20] BLAKE-256 xc6vlx75t-1 163 1 197 327 Kaps et al. [20] BLAKE-256 xc6vlx75t-1 166 – 268 445 Aumasson et al. [1] BLAKE-512 xc5vlx110 939 – 59 468 Kerckhof et al. [5] BLAKE-512 xc6vlx75t-1 192 – 240 183 †Without output transformation ‡Single call to Threefish-512

Table XI PLACE-AND-ROUTERESULTSFORHASHINGAND AES ENCRYPTION ON A VIRTEX-6 FPGA (XC6VLX75T-2).

Area Frequency Throughput [Mbits/s] FPGA [slices] [MHz] AES-128 AES-192 AES-256 256-bit digest 512-bit digest AES & ECHO [15] 155 397 219 186 161 92 48 AES & Grøstl [14] 169 393 217 184 159 92 69

[4] J. Zhai, C. Park, and G.-N. Wang, “Hash-based RFID security protocol [13] I.˙ San and N. At, “Compact Keccak hardware architecture for data using randomly key-changed identification procedure,” in Computational integrity and authentication on FPGAs,” Information Security Journal: Science and Its Applications–ICCSA 2006, ser. Lecture Notes in Com- A Global Perspective, vol. 21, no. 5, pp. 231–242, 2012. puter Science, M. Gavrilova, O. Gervasi, V. Kumar, C. K. Tan, D. Taniar, [14] N. At, J.-L. Beuchat, E. Okamoto, I.˙ San, and T. Yamazaki, “A low-area A. Lagana,` Y. Mun, and H. Choo, Eds., no. 3983. Springer, 2006, pp. unified hardware architecture for the AES and the cryptographic hash 296–305. function Grøstl,” 2012, cryptology ePrint Archive, Report 2012/535. [5] S. Kerckhof, F. Durvaux, N. Veyrat-Charvillon, F. Regazzoni, [15] J.-L. Beuchat, E. Okamoto, and T. Yamazaki, “A low-area unified G. Meurice de Dormale, and F.-X. Standaert, “Compact FPGA imple- hardware architecture for the AES and the cryptographic hash function mentations of the five SHA-3 finalists,” in Proceedings of the ECRYPT ECHO,” Journal of Cryptographic Engineering, vol. 1, no. 2, pp. 101– II Hash Workshop, 2011. 121, 2011. [6] N. At, J.-L. Beuchat, and I.˙ San, “Compact implementation of Threefish [16] H. Warren, Hacker’s Delight. Addison-Wesley, 2002. and Skein on FPGA,” in Proceedings of the Fifth IFIP International [17] K. Latif, M. Tariq, A. Aziz, and A. Mahboob, “Efficient hardware Conference on New Technologies, Mobility and Security–NTMS 2012, implementation of secure hash algorithm (SHA-3) finalist - Skein,” in A. Levi, M. Badra, M. Cesana, M. Ghassemian, O. Gurb¨ uz,¨ N. Jabeur, Proceedings of the International Conference on Computer, Communica- M. Klonowski, A. Mana,˜ S. Sargento, and S.Zeadally, Eds. IEEE tion, Control and Automation–3CA2011, 2011. eXpress Conference Publishing, 2012. [18] B. Jungk, “Compact implementations of Grøstl, JH and Skein for [7] J.-L. Beuchat, E. Okamoto, and T. Yamazaki, “Compact implementations FPGAs,” in Proceedings of the ECRYPT II Hash Workshop, 2011. of BLAKE-32 and BLAKE-64 on FPGA,” in Proceedings of the [19] ——, “Evaluation of compact FPGA implementations for all SHA-3 2010 International Conference on Field-Programmable Technology– finalists,” in The Third SHA-3 Candidate Conference, Mar. 2012. FPT 2010, J. Bian, Q. Zhou, and K. Zhao, Eds. IEEE Press, 2010, pp. [20] J.-P. Kaps, P. Yalla, K. Surapathi, B. Habib, S. Vadlamudi, and S. Gu- 170–177. rung, “Lightweight implementations of SHA-3 finalists on FPGAs,” in The Third SHA-3 Candidate Conference, Mar. 2012. [8] D. Bernstein, “The Salsa20 family of stream ciphers,” Dec. 2007, [21] H. Technology, “FULL DATASHEET–Tiny hash core family for Xilinx available at http://cr.yp.to/snuffle/salsafamily-20071225.pdf. FPGA,” revision 2.0 (11/06/2010). [9] J.-P. Aumasson, S. Fischer, S. Khazaei, W. Meier, and C. Rechberger, [22] K. Jarvinen,¨ “Sharing resources between AES and the SHA-3 second “New features of Latin dances: Analysis of Salsa, ChaCha, and Rumba,” round candidates and Grøstl,” in The Second SHA-3 Candidate in Fast Software Encryption–FSE 2008, ser. Lecture Notes in Computer Conference, Aug. 2010. Science, K. Nyberg, Ed., vol. 5086. Springer, 2008, pp. 470–488. [23] M. Rogawski and K. Gaj, “A high-speed unified hardware architecture [10] T. Ishiguro, S. Kiyomoto, and Y. Miyake, “Latin dances revisited: for AES and the SHA-3 candidate Grøstl,” in Proceedings of the 15th New analytic results of Salsa20 and ChaCha,” in Information and Euromicro Conference on Digital System Design, Sep. 2012. Communications Security–ICICS 2011, ser. Lecture Notes in Computer [24] The Keccak Team, Personal communication, Sep. 2012. Science, S. Qing, W. Susilo, G. Wang, and D. Liu, Eds., vol. 7043. Springer, 2011, pp. 255–266. [11] T. Ishiguro, “Modified version of ”Latin dances revisited: New analytic results of Salsa20 and ChaCha”,” 2012, cryptology ePrint Archive, Report 2012/065. [12] B. Baldwin, A. Byrne, L. Lu, M. Hamilton, N. Hanley, M. O’Neill, and W. Marnane, “A hardware wrapper for the SHA-3 hash algorithms,” 2010, cryptology ePrint Archive, Report 2010/124.