<<

2015 Euromicro Conference on Digital System Design

CLEFIA Implementation with Full Expansion

Jo˜ao Carlos Bittencourt†‡,Jo˜ao Carlos Resende‡, Wagner Luiz de Oliveira† and Ricardo Chaves‡ †Polytechnic Institute, Federal University of Bahia – Bahia, Brazil ‡INESC-ID, IST, Universidade de Lisboa – Lisbon, Portugal

Abstract—In this paper a compact and high throughput architecture at very low added cost. To validate this, a hardware structure is proposed allowing for the computation fully functional compact hardware structure is proposed, of the novel 128-bit CLEFIA algorithm and its supporting both the encryption computation of the CLEFIA associated full key expansion. In the existing state of the art only the 128-bit is supported, given the needed algorithm and the respective key expansion for all key sizes. modification to the CLEFIA Feistel network. This work shows that with a small area cost and with no performance impact, II. CLEFIA 128-BIT BLOCKCIPHER full key expansion can be supported. This is achieved by using addressable shift registers, available in modern FPGAs, The CLEFIA cipher is a 128-bit symmetrical block ci- and adaptable scheduling, allowing to compute the 4 and 8 phering algorithm supporting cipher key sizes of 128, 192, branch CLEFIA Feistel network within the same structure. and 256 bits. This algorithm is based on the well known and The obtained experimental results suggest that throughputs commonly used Feistel network structure. As in most block above 1 Gbps can be achieved with a low area cost, while ciphers, the input data is processed over several rounds, achieving efficiency metrics above those of the restricted state of the art. adding with the input key. In this particular algorithm the data and key are processed over 18, Keywords -CLEFIA, Encryption, Cipher, Full key expansion, 22, or 26 rounds depending on the key sizes. The round FPGA computation is exactly the same for each iteration. I. INTRODUCTION A. Data Processing The market of embedded systems has experienced sub- stantial growth in the last decades. Currently, the use of The encryption process takes a 128-bit input data block P = P |P |P |P WK = mobile and embedded systems already exceeds the use 0 1 2 3, four 32-bit whitening keys WK |WK |WK |WK of personal computer systems. Identically, the need for 0 1 2 3, and several 32-bit round keys RK security and privacy services has also increased. Towards i as data inputs. The resulting outputted is this, efficient and compact implementations of cryptographic a 128-bit cryptogram. primitives are needed. One such primitive is the CLEFIA The first step of the encryption process is to XOR the P P symmetrical , proposed and developed by SONY second and fourth words of the ( 1 and 3) with WK Corporation [1]. This algorithm supports 128, 192, and the first and second 32 bits of the original key ( 0 and WK 256-bit keys and provides improved cryptographic secu- 1), performing the first procedure. After rity through the use of Diffusion Switch Mechanisms and this operation the rounds are executed. Each round is com- 4 GF N whitening keys among others, in order to ensure immunity puted by a -branch Feistel structure, defined by 4,n, n against differential and linear attacks [1]. where is the number of rounds to be computed [1]. F Recent works on CLEFIA have highlighted its perfor- The round computation contains two parallel non-linear mance, particularly in hardware implementations for both functions per round, where a copy of the first and third ASIC and FPGA technologies. Many of these approaches words, and two round keys, are their respective inputs. In strive for compact structures while maintaining high per- the final round the second and fourth final words are XORed formance, leading to the optimization of the computational with the last two whitening keys. F F resources and the exploitation of possible parallelism be- Besides the round keys addition, the 0 and 1 functions S S S tween operations. However, given the need for an 8-branch employ two different types of 8-bit -Boxes ( 0 and 1) M M Feistel network, when computing the key expansion for 192 and two distinct diffusion matrices ( 0 and 1)[1]. and 256-bit keys, most existing structures that provide key expansion only do so for 128-bit keys, using a 4-branch B. Key Scheduling Feistel network [2], [3]. Since each round uses two 32-bit round keys a total of 36, The main goal of the work herein presented is to show 44, or 52 round keys (depending on the number of rounds) that a CLEFIA ciphering structure, capable of supporting are needed, plus 4 additional whitening keys [1]. These the computation of both 4 and 8 branches of CLEFIA round keys are obtained using the specified key schedule Feistel networks, can be designed within the same hardware algorithm [1].

978-1-4673-8035-5/15 $31.00 © 2015 IEEE 555 DOI 10.1109/DSD.2015.55 The whitening key (WK) generation is accomplished The first step towards supporting the expansion of all key according to the . For a 128-bit input key, the four sizes, is the ability to compute a GF N8,n function. Towards 32-bit whitening keys are obtained directly from the input this, the folded structure proposed in [4] is considered. This key, by: structure considers a T -Box based implementation within a WK0|WK1|WK2|WK3 ← K. (1) 32-bit datapath, a design choice shown to result in compact and efficient structures, particularly when considering FPGA For the 192 or 256-bit input keys, the value is divided as the target technology [4], [5]. into two 128-bit blocks, KL and KR, as shown by: This section starts by describing the proposed structure 192 GF N KL||KR ← K0|K1|K2|K3 || K4|K5|K0|K1 : K (2) for the 4/8,n computation, particularly considering the Xilinx VIRTEX FPGAs as the target technology. To con- 256 KL||KR ← K0|K1|K2|K3 || K4|K5|K6|K7 : K (3) clude, the proposed key expansion module is also detailed. The corresponding whitening key is then computed by: A. GFN Blockcipher Structure WK = KL ⊕ KR. (4) The CLEFIA encryption structure herein proposed is based on the work presented in [4]. However, the GF N8,n The key expansion of a 128-bit key uses the same 4- Feistel network imposes a larger datapath, due to the need branched GF N network used for the CLEFIA main encryp- to store and multiplex additional intermediate values. This tion process. The differences in the 128-bit key expansion storage and multiplexing can be performed by extra registers is that the input data of the GF N structure is now the input and wider multiplexers, resulting in higher area costs. key itself, and the round keys are replaced with predefined One of the main optimizations herein considered, in order constants [1]. to reduce the area cost of the proposed GF N8,n supporting When considering the key schedule for the 192 and 256- structure, is related to the needed word swap. This particular bit keys, the GF N network becomes an 8-branch structure chain of registers imposes a high cost. However, when con- (GF N8,n). In this case, the input value is a combination sidering the target technology, these individual registers can of K = KL||KR [1]. The 8-branch Feistel structure uses be replaced by addressable shift registers. This addressable the same two non-linear F functions, twice per round and shift register can be mapped into Look Up Tables (LUTs) processes eight input words on each round. operating in either SRL16 or SRL32 LUT mode. Each LUT Instead of a ciphered text, the output of the GF N is able to implement a 1-bit wide addressable shift register, structure, in the key expansion process, is either a 128- capable of storing up to 16 or 32 bits. The full value storage bit block (L), for 128-bit input keys, or two 128-bit blocks and swapping operation can thus be implemented using (LL and LR) for the remaining key sizes. After the GF N 32 LUTs, as depicted in block 2 of Figure 1, with an computation is completed, the result (L or LL and LR) additional register placed after the shift register in order to is expanded in an iterative way using a double swap (Σ) reduce the critical path. function, as: L = Σ(L); LL = Σ(LL) LR = Σ(LR) (5) The Σ function swaps several bits of its 128-bit input and returns another equally sized output, specified by: Σ(X)=X[7−63]|X[121−127]|X[0−6]|X[64−120] (6) With this, the 32-bit round keys are obtained by adding alternately the L, K,andΣ(X) values with another prede- fined set of constants [1], [3].

III. PROPOSED ARCHITECTURE The main goal of the work herein proposed is to design a compact structure capable of both computing the CLEFIA encryption and the key scheduling for all possible key sizes. As stated in Section II-B, the key expansion of 128-bit keys can be processed by the same GF N4,n structure used for encryption. On the other hand, for 192 and 256-bit keys a GF N8,n structure is required. Such a requirement is the Figure 1. Proposed CLEFIA GF N4/GF N8 structure. main difficulty towards full key expansion support regarding CLEFIA compact hardware structure. The input of data into the structure can also be optimized,

556 bus and stored into a SRL16 LUT. Once the key is stored into the SRL, it is sent to the GF N4/8,n structure, for the computation of the L values, through the K connection, depicted at the end of stage 7 .This K output is also used to provide the whitening keys to the ciphering structure, as depicted in Figure 1. For a 128-bit cipher key, the whitening keys are obtained directly from the input key. On the other hand, the whitening keys derived from 192 and 256-bit keys are obtained by XORing the lower and higher parts of the key. This is achieved using the 32-bit XOR, depicted at the end of stage 7 , while the bottom two registers are used to store the targeted values. The resulting value is stored back in the SRL16. The last step of the key expansion itself is the XORing between the corresponding constants CONij (stored in Figure 2. Proposed key scheduling structure. the BRAM) the K (or KL and KR) value, and the Σ transformation of the L (or LL and LR) value [1], [3]. The XOR is performed iteratively with 32-bit sub-values at taking into account that external data buses of small embed- a time. This Σ transformation computes the bit swapping ded systems are typically smaller than 128-bits. Herein we described in (6). To achieve this in a compact way, an also consider that the data is fed to the structure using a 32- intermediate register (at the end of stage 8 )isusedtostore bit data bus (Pi), allowing to simplify the input selection, part of the L value that is to be combined with the other as depicted in stages 1 and 6 of Figure 1. part of L outputted directly by the SRL32. Once more, considering the target technology, two em- bedded dual-port RAM blocks (BRAMs) are considered for IV. RESULT ANALYSIS T the -Box implementation [4]. In this section, experimental results for the proposed F After each iteration of a function within a round (and structures are presented and compared with the related state Feistel word addition), the result is immediately fed back to of the art. The obtained results, depicted in Table I, were the beginning of stage 6 . With the 32-bit pipelined datapath obtained using the Xilinx ISE Design Suite (v14.5), using T and the pipeline register placed into the -Box BRAM, a the software default parameters. CLEFIA Feistel round can be computed in two or four clock For the proposed CLEFIA core without key expansion, cycles, depending on the operation mode. operating frequencies in the order of 360 MHz are achieved The last operation performed by the CLEFIA algorithm resulting in a throughput of 1.3 Gbps when considering is the XORing of the last two whitening keys in the final 128-bit keys. This is achieved at a cost of 80 Slices and round, performed in stage 5 . This last XOR is followed by 3 BRAMs. Considering a throughput per slice efficiency a register in order to not influence the critical path. Since metric, an efficiency of 16.2 can be achieved. the output stage 5 is separated from the input stage 6 ,a When considering the existing state of the art, the struc- new data block can be provided at the same time the last ture proposed in [4] only considers the data encryption one is concluded. process, with no regards for key scheduling. When compared Although the functional description stated above operates with the proposed structure, disregarding the key expansion properly for the block cipher process, the key scheduling module, the proposed improvements allow to achieve iden- generation of 128 and 256-bit wide L keys needs to be tical throughputs but with 2.5 times lower resources. The treated carefully. During a key expansion process of a structure proposed in [6] achieves a lower operating fre- 128, 192 or 256-bit cipher key, the L value described in quency but is able to achieve a throughput of 21.376 Gbps. Section II-B is collected directly from the shift register L . However, it can only be used in non feedback modes, given its unfolded structures. This is done having a cost of 2479 B. Key Scheduling Structure Slices, resulting in a Throughput per Slice efficiency of 8.6, The actual key expansion if performed after the GF N4,n about half of the efficiency herein achieved. When consid- or GF N8,n computation. The proposed structure for this ering its usage with feedback modes, this unfolded structure function is based on a 32-bit datapath depicted in Figure 2. becomes highly inefficient, only achieving a throughput of The first step, where the cipher key K and the whitening 1.18 Gbps and an efficiency of 0.48 Mbps/Slice. key WK are computed, is presented in stage 7 . The 128, Given the achieved throughput and the low area resources 192 or 256-bit cipher key is obtained by a 32-bit input required, it can be concluded that the proposed CLEFIA

557 Table I PERFORMANCE COMPARISON OF COMPACT CLEFIA HARDWARE IMPLEMENTATIONS.

Device Slices BRAM Frequency Throughput Efficiency (MHz) (Gbps) (Mbps/S) Kryjak [6] 2479 0 167 21.376 8.6 CLEFIA without Resende [5] 123 2+1 352 1.073 8.7 xc5vlx30 Key Expansion Proenca [4] 205 2+1 374 1.329 6.5 Ours 80 2+1 365 1.3 16.2

† xc4lx200 171 1 285 n.a. n.a. Key Expansion Chaves [3] 100 1 387 n.a. n.a. Structure xc5vlx30 Ours 128 1 400 n.a. n.a.

† xc4lx200 481 3 287 1.020 2.7 CLEFIA with Key Chaves [3] 295 3 374 1.330 4.5 Expansion xc5vlx30 Ours 200 3 375 1.333 6.7 † Supporting only 128-bit keys. encryption core implementation results in an highly efficient can be efficiently implemented. With this and by properly structure. This is particularly valid since it allows to compute scheduling the key expansion operations, full key expansion GF N4,n and GF N8,n. can be achieved at a cost of 128 Slices. Experimental Regarding the key expansion itself, independent unfolded results suggest that the proposed optimizations allow to structures for the computation of each of the 3 possible key implemented the CLEFIA encryption and key expansion at sizes are proposed in [6]. Unless the cipher key changes a cost of 200 Slices and 3 BRAMs achieving a throughput for every data block, this unfolded approach to the key of 1.3 Gbps. scheduling is not very practical. Given the different approach ACKNOWLEDGMENTS and implementation technology, a fair comparison cannot be made. In [3], a compact structure for the key expansion This work was supported by the ARTEMIS Joint Under- ◦ computation is proposed, however only supports 128-bit taking under grant agreement n 621429 and by national keys. The expansion structure proposed in [3] has an area funds through Fundac¸˜ao para a Ciˆencia e a Tecnologia (FCT) cost of 100 slices. The key expansion structure herein with reference UID/CEC/50021/2013. proposed has been designed to support the key scheduling REFERENCES for all 3 key sizes, at a cost of 128 Slices, not representing a significant area increase when compared to [3]. [1] T. Shirai, K. Shibutani, T. Akishita, S. Moriai, and T. Iwata, “The 128-bit blockcipher clefia,” in Fast software encryption. The proposed key expansion structure requires 62 or 77 Springer, 2007, pp. 181–195. clock cycles to perform the expansion computation of the 128, 192, and 256-bit input keys. However, since the key [2] T. Sugawara, N. Homma, T. Aoki, and A. Satoh, “High- expansion is performed only once, for the same input key, performance ASIC implementations of the 128-bit block cipher and it is able to operate at a higher frequency that the main CLEFIA,” in 2008 IEEE International Symposium on Circuits and Systems. IEEE, May 2008, pp. 2925–2928. core, there is no overall impact on the system performance. Overall, the proposed structure, computing the full [3] R. Chaves, “Compact CLEFIA Implementation on FPGAs,” in CLEFIA algorithm including all key expansions, allows for Embedded Systems Design with FPGAs. Springer New York, a throughput above 1.3 Gbps at a cost of 200 Slices and 3 2013, pp. 225–243. BRAMs. When compared to the existing CLEFIA compact [4] P. Proenc¸a and R. Chaves, “Compact CLEFIA implementation structures with key expansion, it allows to achieve the same on FPGAs,” in Field Programmable Logic and Applications throughput with 30% less area resources, while supporting (FPL), 2011 International Conference on. IEEE, 2011, pp. the expansion of all possible key sizes. 512–517.

V. C ONCLUSION [5] J. C. Resende and R. Chaves, “Dual CLEFIA/AES Cipher Core on FPGA,” in Applied Reconfigurable Computing (ARC), Herein a compact hardware structure for the computation Bochum, Germany, 2015, p. 12. of the CLEFIA block cipher algorithm, capable of perform- ing data encryption and the key expansion for all key sizes [6] T. Kryjak and M. Gorgo´n, “Pipeline implementation of the is proposed. This work shows that, by properly exploring the 128-bit block cipher CLEFIA in FPGA,” in Field Pro- FPGA technology features, such as addressable shift regis- grammable Logic and Applications (FPL), 2009 International Conference on. IEEE, 2009, pp. 373–378. ters and RAM blocks, the multiple branch Feistel network

558