Unified Hardware Architecture for 128-bit Block Ciphers AES and Camellia
A. Satoh and S. Morioka Tokyo Research Laboratory IBM Japan Ltd. Contents
Unified S-Box
Unified Permutation Layer
Unified Data Path Architecture
ASIC Implementation Results
Conclusion Unified S-Box AES S-Box
Combinations of GF(28) inverter and affine transformations Inverter followed by affine transformation for encryption and inverter follows affine transformation for decryption
SubBytes InvSubBytes
x0 y0 x0 y0 x1 y1 x1 y1 x2 y2 x2 y2 x3 GF(28 ) y3 x3 GF(2 8 ) y3 x4 Inverter y4 x4 Inverter y4 x5 y5 x5 y5 x6 y6 x6 y6 x7 y7 x7 y7 Affine Transformation A Affine Transformation A-1
1 0 0 0 1 1 1 1 a 1 0 0 1 0 0 1 0 1 a ⊕ 1 0 0 1 1 0 0 0 1 1 1 a1 1 1 0 0 1 0 0 1 0 a1 ⊕ 1 1 1 1 0 0 0 1 1 a 0 0 1 0 0 1 0 0 1 a 2 2 1 1 1 1 0 0 0 1 a3 0 1 0 1 0 0 1 0 0 a3 ⊕ 1 1 1 1 1 0 0 0 a 0 4 0 1 0 1 0 0 1 0 a4 0 1 1 1 1 1 0 0 a 1 0 0 1 0 1 0 0 1 a ⊕ 1 5 5 0 0 1 1 1 1 1 0 1 a6 1 0 0 1 0 1 0 0 a6 ⊕ 1 0 0 0 1 1 1 1 1 a 0 7 0 1 0 0 1 0 1 0 a7 Camellia S-Box
GF((24)2) inverter is placed between two affine transformations Four S-Boxes S1~S4 (I/O ordering is differed) are used Feistel-type cipher Camellia uses same S-Box in encryption and decryption s1 s2 x0 y0 8 8 x >>1 y x1 y1 s1 x2 y2 x3 GF((242 ) ) y3 s3 8 8 x4 Inverter y4 x s1 <<1 y x5 y5 x6 y6 s4 8 8 x7 y7 x <<1 s1 y
Affine TransformationFH Affine Transformation
0 1 0 0 0 1 0 0 a ⊕ 1 0 1 0 0 1 1 0 0 a 0 0 0 1 0 0 0 0 0 1 0 a1 ⊕ 1 0 1 0 0 0 1 0 0 a1 1 0 0 1 0 1 0 0 1 a 0 0 0 1 0 0 1 0 a 1 2 2 0 0 1 0 0 0 0 1 a3 0 1 0 0 0 0 0 1 a3 0 ⊕ 0 0 0 1 0 0 1 0 a 4 0 0 1 0 0 0 1 0 a4 1 0 1 0 0 1 0 0 0 a5 ⊕ 1 1 0 0 0 0 0 0 1 a 1 5 1 0 0 0 0 0 0 1 a 6 1 0 0 0 1 0 0 0 a6 1 0 0 0 1 0 1 0 0 a ⊕ 1 7 0 0 1 0 0 1 0 0 a7 0 Unified S-Box
(1) AES S-Boxes (4) Combine affine and isomorphism
0 -1% 0 (28 ) (28 ) δ 4 2 δ A GF A A-1 GF GF((2 ) ) Inverter Inverter Inverter A-1% δ 1 δ -1 1 SubBytes InvSubBytes
(2) Share GF(28) inverter (5) Share common terms
0 A 0 0 0 GF(28 ) δ GF((24 )2) δ -1%A -1 Inverter -1% Inverter -1 A 1 1 A δ 1 δ 1
(3) Use GF((24)2) inverter (6) Merge Camellia S-Box
0 4 2 A 0 δ 0 δ -1%A 0 GF((2 ) ) GF((24 )2) δ δ -1 A-1% δ 1 δ -1 1 -1 Inverter Inverter A 1 1 FH2 2 Modifying affine transformations
In order to share common terms, bit inverse operations on input are converted to the operations on output
-1 δ 0 4 2 δ %A 0 -1 GF((2 ) ) A-1% δ 1 δ -1 1 A Inverter F 2 H 2
1 1 0 0 0 1 1 0 a 0 1 0 0 0 1 0 0 a ⊕ 1 A-1×δ 0 F 0 0 1 1 1 0 0 0 1 a1 ⊕ 1 1 0 0 0 0 0 1 0 a1 ⊕ 1 0 1 1 1 1 0 0 0 a ⊕ 1 0 0 1 0 1 0 0 1 a 2 2 1 1 1 1 0 1 1 1 a3 0 0 1 0 0 0 0 1 a3 1 0 1 0 0 0 0 0 a4 0 0 0 1 0 0 1 0 a4 0 0 1 0 1 0 1 0 a 0 1 0 0 1 0 0 0 a ⊕ 1 5 5 0 1 1 0 1 1 0 0 a6 ⊕ 1 1 0 0 0 0 0 0 1 a6 0 0 1 0 0 0 1 0 a7 ⊕ 1 0 0 0 1 0 1 0 0 a7 ⊕ 1
1 1 0 0 0 1 1 0a 0 0 1 0 0 0 1 0 0 a0 0 0 0 1 1 1 0 0 0 1a1 1 1 0 0 0 0 0 1 0 a1 1 0 1 1 1 1 0 0 0a 0 0 0 1 0 1 0 0 1 a 1 2 2 1 1 1 1 0 1 1 1a3 0 0 0 1 0 0 0 0 1 a3 1 ⊕ ⊕ 0 0 0 1 0 0 1 0 a 0 1 0 1 0 0 0 0 0a4 1 4 0 0 1 0 1 0 1 0a 0 0 1 0 0 1 0 0 0 a 1 5 5 0 1 1 0 1 1 0 0a6 0 1 0 0 0 0 0 0 1 a6 0 0 0 0 1 0 1 0 0 a 1 0 0 1 0 0 0 1 0a7 0 7 Common term sharing
δ 0 δ -1%A 0 GF((24 )2) -1% δ 1 δ -1 1 A Inverter Merging 6 matrices FH2 2
into 2 matrices 1 0 1 0 0 0 0 0 a 1 0 0 0 0 1 1 0 a 0 0 0 1 0 1 0 1 1 0 0 a1 1 1 0 1 0 0 0 0 a1 1 1 1 0 1 0 0 1 0 a 1 0 0 0 1 1 1 0 a 1 XORs are reduced 2 2 0 1 1 1 0 0 0 0 a3 0 1 1 1 1 0 1 1 a3 0 -1 ⊕ 1 1 0 0δ 0 1 1 0 a 0 0 δ0 0 ×0A1 0 1 a 0 by 40% 4 4 0 1 0 1 0 0 1 0 a 0 1 0 1 1 0 0 1 a 0 5 5 0 0 0 0 1 0 1 0 a6 1 0 0 0 1 1 1 1 a6 1 1 1 0 1 1 1 0 1 a7 0 1 1 0 0 1 0 1 a7 1
δ 20 XORs 1 1 0 0 0 1 1 0a 0 0 0 1 0 0 1 0 0 a 0 0 -1 0 1 1 1 0 0 0 1a1 1 1 1 1 0 1 1 1 0 a1 δ 21 XORs 0 1 1 1 1 0 0 0a 0 1 0 1 0 0 1 0 0 a 2 2 -1 1 1 1 1 0 1 1 1a3 0 0 1 0 1 1 0 1 0 a3 A ×δ 22 XORs -1 ⊕ -1 A ×δ 1 0 1 δ1 0 0 1 0 a 1 0 1 0 0 0 0 0a4 1 4 -1 0 0 1 0 1 0 1 0 a 0 0 1 1 1 0 0 1 0 a5 Original δ ×A 21 XORs 5 1 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0a6 0 a6 0 1 0 1 0 0 0 1 a F 9 XORs 0 0 1 0 0 0 1 0a7 0 7
0 1 0 0 0 1 0 0 a 0 0 1 0 0 1 1 0 0 a 0 H 9 XORs 0 0 1 0 0 0 0 0 1 0 a1 1 0 1 0 0 0 1 0 0 a1 1 0 0 1 0 1 0 0 1 a 1 0 0 0 1 0 0 1 0 a 1 Total 102 XORs 2 2 0 0 1 0 0 0 0 1 a3 1 0 1 0 0 0 0 0 1 a3 0 ⊕ ⊕ AES+ 0 0 0 1F 0 0 1 0 a 0 0 0 1 0H0 0 1 0 a 1 Merged 56~60 XORs 4 4 0 1 0 0 1 0 0 0 a 1 1 0 0 0 0 0 0 1 a 1 S1~S4 5 5 1 0 0 0 0 0 0 1 a6 0 1 0 0 0 1 0 0 0 a6 1 0 0 0 1 0 1 0 0 1 a7 0 0 1 0 0 1 0 0 a7 0 Performance of unified S-Box
1/2 size in comparison with discretely implemented S-Boxes Speed degraded by 20% due to selectors for component sharing
((24 )2) 280 gates -1% GF δ InvSubBytes A δ Inverter 3.65 ns (AES)
((24 )2) 280 gates Total δ GF -1% SubBytes Inverter δ A 3.56 ns 816 gates (AES)
((24 )2) 256 gates F GF H S1~S4 Inverter 3.45 ns (Camellia)
δ 0 δ -1 %A 0 SubBytes+ GF((24 )2) 411~414 gates -1% δ 1 δ -1 1 InvSubBytes+ A Inverter S1~S4 FH2 2 4.29~4.65 ns (AES+Camellia) Unified Permutation Layer Merging MixColumns and InvMixColumns
Break permutation layers of AES (MixColumns and InvMixColumns) into {01, 02, 04, 08} elements All elements of MixColumns are included in InvMixColumns
MixColumns InvMixColumns
z 02 03 01 01 x yy 0E0E 0B0B 0D0D 0909 xx 0 0 Same 0 0 00 z1 01 02 03 01 x1 yy11 0909 0E0E 0B0B 0D0D xx11 = ⋅⋅ == ⋅⋅ z 01 01 02 03 x yy 0D0D 0909 0E0E 0B0B xx 2 22 Matrices 22 22 z 03 01 01 02 x z3 03 01 01 02 x3 yy33 0B0B 0D0D 0909 0E0E xx33 00 01 01 01 x 02 02 00 00 x 00 01 01 01 x 02 02 00 00 x 0 0 0 0 01 00 01 01 x1 00 02 02 00 x1 01 00 01 01 x 00 02 02 00 x = ⋅ + ⋅ = ⋅ 1 + ⋅ 1 01 01 00 01 x 00 00 02 02 x 01 01 00 01 x 00 00 02 02 x 2 2 2 2 01 01 01 00 x3 02 00 00 02 x3 01 01 01 00 x3 02 00 00 02 x3 04 00 04 00 x 08 08 08 08 x 0 0 00 04 00 04 x 08 08 08 08 x + ⋅ 1 + ⋅ 1 04 00 04 00 x 08 08 08 08 x 2 2 00 04 00 04 x3 08 08 08 08 x3 Merging P-function and InvMixColumns
Generate 64-bit matrix contains two InvMixColumns Factor common terms between P-function and 64-bit InvMixColumns
P - function InvMixColumns
w 0 1 0 1 1 0 1 1 1 x 0 y 0 E B D 9 x0 w1 1 1 0 1 1 0 1 1 x1 y1 9 E B D x1 0 w 2 1 1 1 0 1 1 0 1 x 2 y 2 D 9 E B x 2 w 3 0 1 1 1 1 1 1 0 x 3 y3 B D 9 E x3 = ⋅ = ⋅ w 1 1 0 0 0 1 1 1 x y E B D 9 x 4 4 4 4 9 E B D w 5 0 1 1 0 1 0 1 1 x 5 y5 x5 0 w 0 0 1 1 1 1 0 1 x D 9 E B 6 6 y 6 x6 w 1 0 0 1 1 1 1 0 x B D 9 E 7 7 y 7 x7 1 0 0 0 0 1 1 1 x x x x 0 0 0 1 1 0 E A E 8 0 0 1 1 1 0 1 0 1 1 0 1 0 0 x1 1 0 0 1 x1 8 E A E x1 1 0 1 1 x1 1 1 0 1 0 0 0 0 0 1 0 x 2 1 1 0 0 x 2 E 8 E A x 2 1 1 0 1 x2 1 1 1 0 0 0 0 1 x 3 0 1 1 0 x 3 A E 8 E x3 1 1 1 0 x3 = ⋅ + ⋅ = ⋅ + ⋅ 1 1 0 0 x 0 1 1 1 x E A E 8 x 0 1 1 1 x 4 4 4 4 1 0 1 1 8 E A E 1 0 1 1 0 1 1 0 x 5 x 5 x5 x5 0 0 0 0 0 1 1 x 1 1 0 1 x E 8 E A 1 1 0 1 0 6 6 x6 x6 1 0 0 1 x 1 1 1 0 x A E 8 E 1 1 1 0 7 7 x7 x7 Performance of unified permutation layer
Number of XOR terms is reduced to 1/3 of the original matrices with two additional XOR-gates of delay x
{08,04,00} {02,01,00} {01,00} {01,00} -element -element -element -element Matrices Matrices Matrices Matrices
y z w InvMixColumns MixColumns P-function
Original Matrices Shared MixCol InvMixCol P-funk Total Matrix XORs 304 880 288 1,482 476 Delay (gates) 3 5 3 5 7 Unified Data Path Architecture AES Algorithm
SPN structure using 4 primitive functions takes 11 rounds
Cipherihg Block a a a a 00 0102 03 k 00kkk01 02 03 b00bbb01 02 03 128-bit plain text 1-bit a a a a kkkk bbbb 10 1112 13 1011 12 13 = 1011 12 13 88 8 XOR a a a a kkkk bbbb 128 20 2122 23 2021 22 23 2021 22 23 AddRoundKey a30 a31a32 a 33 kkkk3031 32 33 bbbb3031 32 33 SubBytes a a a a b bbb ShiftRows 8-bit 00 0102 03 S-Box 0001 02 03 a a a a bbbbb MixColumns S-Box 10 11ij 13 1011 12ij 13 a a a a bbbb AddRoundKey 16 20 2122 23 2021 22 23 a30 a31a32 a 33 bbbb3031 32 33
a a a a no shift a a a a 32-bit 0001 02 03 0001 02 03 SubBytes a a a a left rotation by 1 a Rotation 10 1112 13 10 ShiftRows a a a a left rotation by 2 a a 4 2021 22 23 20 21 MixColumns a a a a left rotation by 3 a a a 3031 32 33 30 31 32 AddRoundKey
a b SubBytes a a 0j a b bb0j 32-bit 00 01 03 c(x ) 00 01 03 ShiftRows a a a1j a bbb1j b Permutation 10 11 13 10 11 13 AddRoundKey a20 a21a a 23 bb20 21b b 23 4 2j 2j 88 8 a30 a31a 33 bb30 31 b 33 a3j b3j 128-bit cipher text Camellia Algorithm
2 FL/FL-1 functions are inserted between 3 Feistel network blocks It takes 22 rounds by processing 64 bits in each round
k Plain Text Feistel network P function 8 64 64 64 64 64 64 z8 k1 x8 8 8 8 S1 z'8 kw1 kw2 F z7 x7 8 8 8 S4 z'7 z6 Feistel x6 8 8 z'6 k k2 8 S3 1~6 Network z5 F x5 8 8 8 S2 z'5 z4 -1 x4 8 8 kl1 FL FL kl2 8 S4 z'4 k3 z3 x3 8 8 z'3 F 8 S3 Feistel z2 k x2 8 8 7~12 Network 8 S2 z'2 8 z1 8 k4 x1 S1 z'1 -1 F kl3 FL FL kl4 FL function FL-1 function Feistel k5 k13~18 64 64 Network F 32 32 32 32 kl64 32 32 64 kl k6 F AND <<1 OR kw3 kw4 32 32
Cipher Text OR AND <<1 32 64 32 32 64 32 Ciphering block
Data/Key Input 8 (64-bit) S-Boxes and a 64-bit 32 32 32 32 permutation layer in data path 2:1 2:1 2:1 2:1 FL/FL-1 functions and key Switching whitening are also merged for Matrix Data Camellia Output
AES reuses S-Box for key 128 32 64 scheduling block 32 32
Key 64 2:1 2:1 2:1 AES takes 31clocks Scheduler 128 FL 64 FL-1 S S Camellia takes 22 clocks 32 P P
FL64 FL-1 64 32 32 3:1 3:1 32 3232 32 32 32 En klL <<1 Sel_FL_Kadd 32 32 klH 32 >>1 32 32 32 >>1 klH Key Whitening Sel_FL_Kadd <<1 klL En 2:1 2:1 3:1 3:1 64 64 Key scheduler
Camellia block AES block
128 Rcon 2:1
AES repeats XOR 32 32
operations and 128 128 2:1
Camellia repeats bit 32 32 2:1 2:1 rotations 2:1 128 128 Only registers can be >>16 <<16 32 32 shared because key 2:1 2:1 32 32 scheduling methods are >>1 <<1
64 completely different 3:1 2:1 128 32 To AES <<8 InvMix S-Box Col.
2:1 128
Σ1~4 0 64 64 32 From AES 4:1 S-Box ASIC Implementation Results ASIC Implementation Results
Two circuits (area optimized and speed optimized) were synthesized by using 0.13um CMOS ASIC library 15~25 Kgates of unified hardware are smaller by 30% compared with 22~35 Kgates of discrete designs Throughputs are lower by 9~14% for AES,and by 31~40% for Camellia Gate Max. Freq. Throughput Synthesis Algorithm Cycle Counts (MHz) (Mbps) Optimization AES 31 469.22 14,918 113.64 Area Camellia 22 661.18 This Work AES 31 794.05 24,730 192.31 Speed Camellia 22 1,118.89 ASIACRYPT 7,998 153.85 548.68 Area AES 32 2001 14,777 218.82 875.28 Speed 13,557 153.85 1,094.04 Area 3rd NESSIE Camellia 18 19,783 227.27 1,616.14 Speed Conclusions Conclusions
Unified hardware architecture for AES and Camellia Share GF((24)2) inverters in all S-Boxes Merge affine transformations and isomorphism functions Factorize common terms in permutation layers Combine FL/FL-1 functions and key whitening Reuse S-Boxes in ciphering block for key scheduler Share key registers Implemented by using 0.13-um CMOS ASIC library 15 Kgates with 459 Mbps (AES) and 647 Mbps (Camellia) Smaller than discrete implementations by 30%