<<

A HIGH THROUGHPUT FPGA IMPLEMENTATION Daniel Denning, James Irvine, Malachy Devlin

Institute of System Level Integration, Alba Centre, Alba Campus, Livingston, EH54 7EG, UK E-mail: [email protected]

evaluation projects such as ISO/IEC JTC 1/SC 27, ABSTRACT IETF, and TV-Anytime Forum. Previous Camellia In this paper we a Field Programmable implementations have been published in [4,5] but Gate Array (FPGA) implementation of the do not investigate a sub-pipelining architecture. Camellia algorithm. Our implementation deeply sub-pipelines the algorithm 2. OVERVIEW OF CAMELLIA for the FPGA architecture. Camellia has been ALGORITHM included in both portfolios of the New European The Camellia algorithm processes data blocks of Schemes for Signatures, Integrity, and Encryption 128-bits with secret keys of lengths 128, 192, or 256 (NESSIE) for Europe and the bits. Note that Camellia has the same interface as Research and Evaluation Committee (CRYPTREC) the AES (Advanced Encryption Standard). In our in Japan. The implementation is the fastest implementation we focus on the algorithm using a published throughput for the entire block length of 128-bits that is key agile. A key agile recommended in both portfolios for NESSIE and core requires that on each cycle new data and CRYPTREC, and runs at a throughput of key must be accepted. 33.25Gbit/sec. A key length of 128-bits results in an 18 round Feistel structure. After the 6th and 12th rounds -1 1. INTRODUCTION FL/FL function layers are inserted to provide some non-regularity across rounds. There are also two The Camellia [1] symmetric-key has 64-bit XOR operations before the first round and been recognised by an open call from NESSIE [2] after the last, also known as pre- and post- and CRYPTREC [3] as a cryptographic algorithm whitening. The top-level structure of the algorithm to help protect the current and future information can be seen in Figure 1, as well as the inner 6 round society. FPGAs provide a very beneficial platform structure. The key schedule, discussed later in this for implementing cryptographic systems because of section, generates subkeys for each round, FL/FL-1 the inherent parallelism of the device. layers, and pre- and post-whitening. NESSIE was a European Union (EU) initiative The FL-function is defined by: and a project with in the Information Society YR(32) = ((XL(32) ∩ klL(32)) <<< 1) ⊕ XR(32), (1) Technologies (IST) of the EU. Camellia was ⊕ selected with three other block ciphers from the 42- YL(32) = (YR(32) ∪ klR(32)) XL(32), (2) encryption algorithms that were submitted. The where YL(32) are the 32 most significant bits of the other two block ciphers being MISTY1 and 64-bit output and YR(32) are the 32 least significant SHACAL-2. The AES algorithm was also selected bits. The FL-1 function is just the inverse of FL. but selected on its evaluation from the National Each round can be composed of an F-function with Institute of Standards and Technology (NIST). a XOR. A different subkey is applied to each F- Other algorithms included digital signatures, function and the output is XORed with the previous identification schemes, public-key encryption, MAC but one result. The F-function is defined as algorithms, and hash functions. ⊕ The Camellia algorithm is a 128-bit block Y(64) = P(S(X(64) k(64)). (3) cipher jointly developed by NTT and Mitsubishi The P-function is constructed only of XOR Electric Corporation. The algorithm has also been components and is a linear transformation from 8 submitted to other standardisation organisations and input bytes to 8 output bytes. The S-function M (128) x(64)

kw -1 -1 -1 -1 -1 -1 -1 -1 1 kw2 64 64 z z z z z z z z 64 64 k1 k1, k2, k3, k4, F z-1 z-1 z-1 z-1 z-1 z-1 z-1 z-1 k5, k6 k 2 S1 S2 S3 S4 S2 S3 S4 S1

F z-1 z-1 z-1 z-1 z-1 z-1 z-1 z-1 -1 kl1 FL FL kl2 k3

k7, k8, F k9, k10, k11, k12 k4 z-1 F z-1 k z-1 kl -1 5) 3 FL FL kl4 z-1 F -1 k13, k14, z z-1 k15, k4, k6 z-1 k , k 5 6 z-1 F

kw3 kw4

C(128) z-1 z-1 z-1 z-1 z-1 z-1 z-1 z-1 Figure 1. Top-level procedure of Camellia (left-hand side) showing inner structure of 6 Rounds (right-hand side)

K(128) y(64)

Σ1(64) F Figure 3. Sub-Pipelined F-Function

Σ2(64) F where f and h are linear mappings, g is an inverse over GF(28), and a, b are fixed constants.

K(128) The Camellia key schedule for a 128-bit key produces twenty-six 64-bit subkeys, for use in the Σ3(64) 18 rounds, pre- and post-whitening and FL/FL-1 F function layers. Figure 2 shows the first step that

Σ4(64) involves deriving a 128-bit variable K from the F A(128) original secret key K. Then the second step involves generating further round keys by cyclic rotating either K (now K ) or K by 15 or 17. K L A A(128) The Camellia decryption procedure is exactly the same as the encryption, it does not have any Figure 2. Key schedule architecture for inverse functions like with in the AES algorithm, Camellia but the keys are needed in reverse order. So the represents a substitution using one of 4 s-boxes that subkey that was needed to encrypt the data block in are defined by: round 18 is now needed to decrypt in round 1. This S1(x) = h(g(f(x ⊕ a))) ⊕ b, (4) can cause a delay in the overall decryption process

S2(x) = S1(x) <<< 1, (5) if the subkeys need to be produced first, before decryption can take place. This can be overcome S3(x) = S1(x) >>> 1, (6) when the decryption receives the pre-computed S(x) = S (x <<< 1), (7) 4 1 subkeys from the sender. Table 1. Summary of CRYPTREC and NESSIE 128-bit Block Cipher Winners and Finalists on FPGAs Algorithm Authors Device Area Key Throughput Efficiency Slices agility Gbits/sec Mbps/slices (BRAMS) Camellia1 Authors XC2Vp50 11,287 YES 33.25 - (88) AES-1281 Zambreno XC2V4000 16938 YES 23.57 1.39 et al. [6] MISTY11 Rouvroy et XCV1000 6,322 YES 19.34 3.06 al. [7] SHACAL-12 McLoone et XC2V4000 13,729 NO 17.02 1.24 al. [8] IDEA2 Gonzalez et XCV600 12,026 NO 8.3 0.69 al. [9] (estimated) RC62 Beuchat XC2V3000 8,554 NO 15.2 - [10] (80) (assumed) -3 3 Rogawski EPF10K130V 25811 YES 0.397 - [11] LE (assumed) CIPHERUNICORN- NEC [12] EP20K1500E 7072 LE YES 0.044 - A3 (66 ESB) SC20004 Shimoyama [0.25µm 65K YES 0.397 - et al. [13] CMOS ASIC gates (assumed)

cycles but does not have an effect on throughput. 3. SUB-PIPELINING ARCHITECTURE Further improvements might be made with some In this section we describe the sub-pipelined algorithm optimisations or floor planning. Adding architecture but it includes previous classical the registers into the P-Function means that the encryption optimisation techniques to increase the function now has a branch number of 3 or 2 instead algorithms throughput at the higher level within the of 5 depending on the data path. The architecture algorithm. These include such optimisations as of F-Function can be seen in Figure 3 on the unrolling, round pipelining, and transformation previous page. The registers can be seen in- partitioning. The implementation is key agile thus between each of the F-function operations. providing very high security and throughput as new One other important optimisation techniques data blocks can be encrypted on each clock cycle specific for the FPGA architecture is the use of with a different key. dual-port block-RAM. Every round has its own For the sub-pipelining architecture we have fully associated F-function, as described earlier, and each unrolled and added pipelining registers between function utilises 4 different s-boxes for byte each encryption round. We have then also added 5 substitution. Each F-function makes 2 calls to the stages of sub-pipelining shift registers into the F- same s-box. For this implementation we have Function. The registers have been added between chosen to use dual ported block-RAMs for the s- the P-Function, S-Function and the XOR. With a boxes. pipeline stage also added into the P-Function and a 4. RESULTS register at the output of the F-function. This architecture produces a pipeline delay of 108 clock We have implemented the sub-pipelined architecture on a Virtex-II pro XC2Vp50 device.

1 The core was designed with Xilinx’s System Included in both CRYPTREC and NESSIE. Generator, which produced the VHDL and 2 Finalists not included in NESSIE portfolio. 3 associated files. The VHDL was synthesised with Implemented on a Altera FPGA device ISE 5.2 and verified with the test vectors that were 4 Only able to find ASIC implementation submitted to NESSIE. The core runs at a frequency [2] NESSIE, “NESSIE Project Announces Final of 259MHz that results in a throughput of Selection of Crypto Algorithms”, IST-199- 33.25GBits/sec. The implementation requires 12324, February 2003. 11,287 slices (47%) of the FPGA and uses 88 block- [3] CRYPTREC, Information-Technology RAMs (37%). Generally comparisons against other Promotion Agency, Japan, “CRYPTREC encryption implementations can be difficult and can Report 2002”, depend on architecture, device, embedded cores [4] Denning, D., Irvine, J., Delvin, M., “A Key such as block-RAM, key agility, on-chip key Agile 17.4Gbit/sec Camellia Implementation”, scheduling, and other such criteria. Table 1 shows proceedings of FPL’04, Antwerp, Belgium, a list of other known high throughput FPGA Aug. 2004. architectures for NESSIE and CRYPTREC. For [5] Ichikawa, T., Sorimachi, T., Kasuya, T., implementations that have used block-RAMs we Matsui, M., “On the Criteria of Hardware have not included the efficiency. It is possible to Evalution of Block Ciphers(1)”, Techn report calculate the extra number of slices that the block- of IEICE, ISEC2001-53, September 2001. RAMs would occupy if distributed memory was to [6] Zambreno, J., Nguyen, D., Choudhary, A., be used, but this sort of calculation doesn’t take into “Exploring Area/Delay Tradeoffs in an AES account the extra routing, which would have an FPGA Implementation”, proceedings of effect on the total throughput and therefore the FPL’04, Antwerp, Belguim, Aug. 2004. efficiency. [7] Rouvroy. G., Standaert, F.X., “Efficient FPGA Implementation of Blcok Cipher MISTY1”, 5. CONCLUSION proceedings IPDPS’03, France, April 2003. [8] McLoone, M., McCanny, J. V., “Very High In this paper we have presented a sub-pipelined Speed 17 Gbps SHACAL Encryption Camellia implementation that has a throughput of Architecture”, in Proc. Of the 13th Int’l 33.25Gbit/sec. Compared to other published FPGA Conference on Field-Programmable Logic and implementations this is the fastest known its Applications (FPL), Portugal, September throughput of all block ciphers (128-bit) 2003. recommended by NESSIE and CRYPTREC. The [9] Gonzalez, I., Lopez-Buebo, S., Gomez, F. J., importance of this algorithm has been proved in the Martinez, J., “Using Partial Reconfiguration in open-call by NESSIE and CRYTREC. Cryptographic Applications: An The core is able to receive new secret keys and Implementation of the IDEA Algorithm”, in data on every clock cycle. We have utilised the Proc. Of the 14th Int’l Conference on Field- available dual-port block-RAMs for use in the F- Programmable Logic and its Applications Function and been able to pipeline the algorithm to (FPL), Portugal, September 2003. the lowest level and then apply this to the FPGA [10] Beuchat, J. L., “FPGA Implementations of the architecture. The core has a very high throughput RC6 Block Cipher”, in Proc. Of the 13th Int’l and thus could be used in a secure high Conference on Field-Programmable Logic and performance computing application. As the core its Applications (FPL), Portugal, September takes less then 50% of the Virtex 2 pro it might be 2003. possible to put two cores down thus being able to [11] Rogawski, M., “Analysis of Implementation obtain a throughput on one FPGA device of around Hierocrypt-3 Algorithm (and its comparison to 66Gbits/sec. This will be part of the work for the Camellia algorithm) using ALTERA devices”, future. Electronic Edition, CoRR cs CR/0312035, 6. REFERENCES 2003. [12] NEC Corporation, “Self Evaluation Report – [1] Aoki, K., Ichikawa, T., Kanda, M., Matsui, M., CIPHERUNICORN-A”, unpublished date. Moriai, S., Nakajima, J., Tokita, T., [13] Shimoyama, T., Hitoshi, Y., Kazuhiro, Y., “Specification of Camellia – 128-bit Block Takenaka, M., Itoh, K., Yjima, J., Torii, N., Cipher”, URL: Tanaka, H., “Specification and Supporting http://info.isl.ntt.co.jp/camellia/#Spec, Document of the Block Cipher SC2000”, IST- September 2001. 199-12324, February 2003.