<<

Unified Hardware Architecture for 128-bit Block Ciphers AES and Camellia

A. Satoh and S. Morioka Tokyo Research Laboratory IBM Japan Ltd. Contents

Š Unified S-Box

Š Unified Permutation Layer

Š Unified Data Path Architecture

Š ASIC Implementation Results

Š Conclusion Unified S-Box AES S-Box

Š Combinations of GF(28) inverter and affine transformations Š Inverter followed by affine transformation for and inverter follows affine transformation for decryption

SubBytes InvSubBytes

x0 y0 x0 y0 x1 y1 x1 y1 x2 y2 x2 y2 x3 GF(28 ) y3 x3 GF(2 8 ) y3 x4 Inverter y4 x4 Inverter y4 x5 y5 x5 y5 x6 y6 x6 y6 x7 y7 x7 y7 Affine Transformation A Affine Transformation A-1

1 0 0 0 1 1 1 1 a  1  0 0 1 0 0 1 0 1 a ⊕ 1    0       0  1 1 0 0 0 1 1 1 a1  1  1 0 0 1 0 0 1 0 a1 ⊕ 1   1 1 1 0 0 0 1 1 a 0 0 1 0 0 1 0 0 1 a     2       2  1 1 1 1 0 0 0 1 a3  0 1 0 1 0 0 1 0 0 a3      ⊕     1 1 1 1 1 0 0 0 a 0      4    0 1 0 1 0 0 1 0 a4  0 1 1 1 1 1 0 0 a  1  0 0 1 0 1 0 0 1 a ⊕ 1    5       5  0 0 1 1 1 1 1 0 1   a6    1 0 0 1 0 1 0 0 a6 ⊕ 1 0 0 0 1 1 1 1 1 a  0        7    0 1 0 0 1 0 1 0 a7  Camellia S-Box

Š GF((24)2) inverter is placed between two affine transformations Š Four S-Boxes S1~S4 (I/O ordering is differed) are used Š Feistel-type cipher Camellia uses same S-Box in encryption and decryption s1 s2 x0 y0 8 8 x >>1 y x1 y1 s1 x2 y2 x3 GF((242 ) ) y3 s3 8 8 x4 Inverter y4 x s1 <<1 y x5 y5 x6 y6 s4 8 8 x7 y7 x <<1 s1 y

Affine TransformationFH Affine Transformation

0 1 0 0 0 1 0 0 a ⊕ 1 0 1 0 0 1 1 0 0 a  0    0     0    1 0 0 0 0 0 1 0 a1 ⊕ 1 0 1 0 0 0 1 0 0 a1  1    0 0 1 0 1 0 0 1 a 0 0 0 1 0 0 1 0 a  1     2     2    0 0 1 0 0 0 0 1 a3  0 1 0 0 0 0 0 1 a3  0       ⊕ 0 0 0 1 0 0 1 0 a        4  0 0 1 0 0 0 1 0 a4  1        0 1 0 0 1 0 0 0 a5 ⊕ 1 1 0 0 0 0 0 0 1 a 1         5    1 0 0 0 0 0 0 1 a    6  1 0 0 0 1 0 0 0 a6  1  0 0 0 1 0 1 0 0 a ⊕ 1          7  0 0 1 0 0 1 0 0 a7  0 Unified S-Box

(1) AES S-Boxes (4) Combine affine and isomorphism

0 -1% 0 (28 ) (28 ) δ 4 2 δ A GF A A-1 GF GF((2 ) ) Inverter Inverter Inverter A-1% δ 1 δ -1 1 SubBytes InvSubBytes

(2) Share GF(28) inverter (5) Share common terms

0 A 0 0 0 GF(28 ) δ GF((24 )2) δ -1%A -1 Inverter -1% Inverter -1 A 1 1 A δ 1 δ 1

(3) Use GF((24)2) inverter (6) Merge Camellia S-Box

0 4 2 A 0 δ 0 δ -1%A 0 GF((2 ) ) GF((24 )2) δ δ -1 A-1% δ 1 δ -1 1 -1 Inverter Inverter A 1 1 FH2 2 Modifying affine transformations

Š In order to share common terms, bit inverse operations on input are converted to the operations on output

-1 δ 0 4 2 δ %A 0 -1 GF((2 ) ) A-1% δ 1 δ -1 1 A Inverter F 2 H 2

1 1 0 0 0 1 1 0 a  0 1 0 0 0 1 0 0 a ⊕ 1 A-1×δ    0  F    0  0 1 1 1 0 0 0 1 a1 ⊕ 1 1 0 0 0 0 0 1 0 a1 ⊕ 1 0 1 1 1 1 0 0 0 a ⊕ 1 0 0 1 0 1 0 0 1 a     2     2  1 1 1 1 0 1 1 1 a3  0 0 1 0 0 0 0 1 a3          1 0 1 0 0 0 0 0 a4  0 0 0 1 0 0 1 0 a4  0 0 1 0 1 0 1 0 a  0 1 0 0 1 0 0 0 a ⊕ 1    5     5  0 1 1 0 1 1 0 0 a6 ⊕ 1 1 0 0 0 0 0 0 1 a6          0 0 1 0 0 0 1 0 a7 ⊕ 1 0 0 0 1 0 1 0 0 a7 ⊕ 1

1 1 0 0 0 1 1 0a  0 0 1 0 0 0 1 0 0 a0  0   0          0 1 1 1 0 0 0 1a1  1  1 0 0 0 0 0 1 0 a1  1    0 1 1 1 1 0 0 0a  0 0 0 1 0 1 0 0 1 a 1    2       2    1 1 1 1 0 1 1 1a3  0 0 0 1 0 0 0 0 1 a3  1    ⊕     ⊕       0 0 0 1 0 0 1 0 a 0 1 0 1 0 0 0 0 0a4  1     4    0 0 1 0 1 0 1 0a  0 0 1 0 0 1 0 0 0 a  1    5       5    0 1 1 0 1 1 0 0a6  0 1 0 0 0 0 0 0 1 a6  0      0 0 0 1 0 1 0 0 a  1  0 0 1 0 0 0 1 0a7  0    7    Common term sharing

δ 0 δ -1%A 0 GF((24 )2) -1% δ 1 δ -1 1 A Inverter Š Merging 6 matrices FH2 2

into 2 matrices 1 0 1 0 0 0 0 0 a  1 0 0 0 0 1 1 0 a  0    0     0    1 0 1 0 1 1 0 0 a1  1 1 0 1 0 0 0 0 a1  1  1 1 0 1 0 0 1 0 a  1 0 0 0 1 1 1 0 a  1  Š XORs are reduced    2     2    0 1 1 1 0 0 0 0 a3  0 1 1 1 1 0 1 1 a3  0      -1    ⊕   1 1 0 0δ 0 1 1 0 a 0 0 δ0 0 ×0A1 0 1 a 0 by 40%    4     4    0 1 0 1 0 0 1 0 a  0 1 0 1 1 0 0 1 a  0    5     5    0 0 0 0 1 0 1 0 a6  1 0 0 0 1 1 1 1 a6  1            1 1 0 1 1 1 0 1 a7  0 1 1 0 0 1 0 1 a7  1 

δ 20 XORs 1 1 0 0 0 1 1 0a  0 0 0 1 0 0 1 0 0 a    0       0  -1 0 1 1 1 0 0 0 1a1  1  1 1 1 0 1 1 1 0 a1    δ 21 XORs 0 1 1 1 1 0 0 0a  0 1 0 1 0 0 1 0 0 a   2       2  -1 1 1 1 1 0 1 1 1a3  0 0 1 0 1 1 0 1 0 a3  A ×δ 22 XORs -1   ⊕  -1     A ×δ    1 0 1 δ1 0 0 1 0 a 1 0 1 0 0 0 0 0a4  1     4  -1       0 0 1 0 1 0 1 0 a 0 0 1 1 1 0 0 1 0 a5 Original δ ×A 21 XORs   5        1 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0a6  0   a6       0 1 0 1 0 0 0 1 a  F 9 XORs 0 0 1 0 0 0 1 0a7  0    7 

0 1 0 0 0 1 0 0 a  0 0 1 0 0 1 1 0 0 a  0 H 9 XORs    0       0    1 0 0 0 0 0 1 0 a1  1  0 1 0 0 0 1 0 0 a1  1  0 0 1 0 1 0 0 1 a  1  0 0 0 1 0 0 1 0 a  1  Total 102 XORs    2       2    0 0 1 0 0 0 0 1 a3  1  0 1 0 0 0 0 0 1 a3  0     ⊕       ⊕   AES+ 0 0 0 1F 0 0 1 0 a 0 0 0 1 0H0 0 1 0 a 1 Merged 56~60 XORs    4       4    0 1 0 0 1 0 0 0 a  1  1 0 0 0 0 0 0 1 a  1  S1~S4    5       5    1 0 0 0 0 0 0 1 a6  0 1 0 0 0 1 0 0 0 a6  1  0 0 0 1 0 1 0 0   1          a7    0 0 1 0 0 1 0 0 a7  0 Performance of unified S-Box

Š 1/2 size in comparison with discretely implemented S-Boxes Š Speed degraded by 20% due to selectors for component sharing

((24 )2) 280 gates -1% GF δ InvSubBytes A δ Inverter 3.65 ns (AES)

((24 )2) 280 gates Total δ GF -1% SubBytes Inverter δ A 3.56 ns 816 gates (AES)

((24 )2) 256 gates F GF H S1~S4 Inverter 3.45 ns (Camellia)

δ 0 δ -1 %A 0 SubBytes+ GF((24 )2) 411~414 gates -1% δ 1 δ -1 1 InvSubBytes+ A Inverter S1~S4 FH2 2 4.29~4.65 ns (AES+Camellia) Unified Permutation Layer Merging MixColumns and InvMixColumns

Š Break permutation layers of AES (MixColumns and InvMixColumns) into {01, 02, 04, 08} elements Š All elements of MixColumns are included in InvMixColumns

MixColumns InvMixColumns

 z  02 03 01 01  x  yy  0E0E 0B0B 0D0D 0909 xx   0     0  Same 0 0     00  z1   01 02 03 01  x1  yy11  0909 0E0E 0B0B 0D0D xx11    = ⋅⋅  == ⋅⋅  z   01 01 02 03 x  yy  0D0D 0909 0E0E 0B0B xx   2     22  Matrices 22     22  z  03 01 01 02  x         z3   03 01 01 02  x3  yy33  0B0B 0D0D 0909 0E0E xx33 00 01 01 01  x  02 02 00 00  x  00 01 01 01  x  02 02 00 00  x     0     0     0     0   01 00 01 01  x1  00 02 02 00  x1   01 00 01 01  x  00 02 02 00  x  = ⋅ + ⋅ = ⋅ 1 + ⋅ 1  01 01 00 01  x  00 00 02 02  x   01 01 00 01  x  00 00 02 02  x     2     2     2     2                   01 01 01 00  x3  02 00 00 02  x3   01 01 01 00  x3  02 00 00 02  x3  04 00 04 00  x  08 08 08 08  x     0     0  00 04 00 04  x  08 08 08 08  x  + ⋅ 1 + ⋅ 1 04 00 04 00  x  08 08 08 08  x     2     2          00 04 00 04  x3  08 08 08 08  x3  Merging P-function and InvMixColumns

Š Generate 64-bit matrix contains two InvMixColumns Š Factor common terms between P-function and 64-bit InvMixColumns

P - function InvMixColumns

 w 0  1 0 1 1 0 1 1 1  x 0   y 0  E B D 9   x0               w1  1 1 0 1 1 0 1 1  x1   y1  9 E B D   x1           0    w 2 1 1 1 0 1 1 0 1 x 2 y 2 D 9 E B x 2              w 3   0 1 1 1 1 1 1 0  x 3   y3  B D 9 E   x3    =   ⋅     = ⋅  w 1 1 0 0 0 1 1 1 x y  E B D 9 x  4     4   4     4          9 E B D   w 5 0 1 1 0 1 0 1 1 x 5 y5   x5          0    w 0 0 1 1 1 1 0 1 x D 9 E B  6     6   y 6     x6     w  1 0 0 1 1 1 1 0  x    B D 9 E    7     7   y 7     x7  1 0 0 0 0 1 1 1  x   x   x   x     0   0 0 1 1   0   E A E 8   0  0 1 1 1   0  1 0 1 1        0 1 0 0   x1  1 0 0 1   x1  8 E A E   x1  1 0 1 1   x1   1 1 0 1    0     0     0    0 0 1 0 x 2 1 1 0 0 x 2 E 8 E A x 2 1 1 0 1 x2                  1 1 1 0 0 0 0 1  x 3   0 1 1 0   x 3   A E 8 E   x3  1 1 1 0   x3  =   ⋅   + ⋅   = ⋅  + ⋅  1 1 0 0 x  0 1 1 1  x  E A E 8 x  0 1 1 1  x    4     4     4     4      1 0 1 1   8 E A E   1 0 1 1   0 1 1 0 x 5   x 5   x5   x5      0     0     0     0 0 1 1  x 1 1 0 1 x E 8 E A 1 1 0 1 0  6     6     x6     x6          1 0 0 1  x  1 1 1 0  x  A E 8 E   1 1 1 0      7     7     x7     x7  Performance of unified permutation layer

Š Number of XOR terms is reduced to 1/3 of the original matrices with two additional XOR-gates of delay x

{08,04,00} {02,01,00} {01,00} {01,00} -element -element -element -element Matrices Matrices Matrices Matrices

y z w InvMixColumns MixColumns P-function

Original Matrices Shared MixCol InvMixCol P-funk Total Matrix XORs 304 880 288 1,482 476 Delay (gates) 3 5 3 5 7 Unified Data Path Architecture AES Algorithm

Š SPN structure using 4 primitive functions takes 11 rounds

Cipherihg Block a a a a 00 0102 03 k 00kkk01 02 03 b00bbb01 02 03 128-bit plain text 1-bit a a a a kkkk bbbb 10 1112 13 1011 12 13 = 1011 12 13 88 8 XOR a a a a kkkk bbbb 128 20 2122 23 2021 22 23 2021 22 23 AddRoundKey a30 a31a32 a 33 kkkk3031 32 33 bbbb3031 32 33 SubBytes a a a a b bbb ShiftRows 8-bit 00 0102 03 S-Box 0001 02 03 a a a a bbbbb MixColumns S-Box 10 11ij 13 1011 12ij 13 a a a a bbbb AddRoundKey 16 20 2122 23 2021 22 23 a30 a31a32 a 33 bbbb3031 32 33

a a a a no shift a a a a 32-bit 0001 02 03 0001 02 03 SubBytes a a a a left rotation by 1 a Rotation 10 1112 13 10 ShiftRows a a a a left rotation by 2 a a 4 2021 22 23 20 21 MixColumns a a a a left rotation by 3 a a a 3031 32 33 30 31 32 AddRoundKey

a b SubBytes a a 0j a b bb0j 32-bit 00 01 03 c(x ) 00 01 03 ShiftRows a a a1j a bbb1j b Permutation 10 11 13 10 11 13 AddRoundKey a20 a21a a 23 bb20 21b b 23 4 2j 2j 88 8 a30 a31a 33 bb30 31 b 33 a3j b3j 128-bit cipher text Camellia Algorithm

Š 2 FL/FL-1 functions are inserted between 3 Feistel network blocks Š It takes 22 rounds by processing 64 bits in each round

k Plain Text Feistel network P function 8 64 64 64 64 64 64 z8 k1 x8 8 8 8 S1 z'8 kw1 kw2 F z7 x7 8 8 8 S4 z'7 z6 Feistel x6 8 8 z'6 k k2 8 S3 1~6 Network z5 F x5 8 8 8 S2 z'5 z4 -1 x4 8 8 kl1 FL FL kl2 8 S4 z'4 k3 z3 x3 8 8 z'3 F 8 S3 Feistel z2 k x2 8 8 7~12 Network 8 S2 z'2 8 z1 8 k4 x1 S1 z'1 -1 F kl3 FL FL kl4 FL function FL-1 function Feistel k5 k13~18 64 64 Network F 32 32 32 32 kl64 32 32 64 kl k6 F AND <<1 OR kw3 kw4 32 32

Cipher Text OR AND <<1 32 64 32 32 64 32 Ciphering block

Data/ Input Š 8 (64-bit) S-Boxes and a 64-bit 32 32 32 32 permutation layer in data path 2:1 2:1 2:1 2:1 Š FL/FL-1 functions and key Switching whitening are also merged for Matrix Data Camellia Output

Š AES reuses S-Box for key 128 32 64 scheduling block 32 32

Key 64 2:1 2:1 2:1 Š AES takes 31clocks Scheduler 128 FL 64 FL-1 S S Š Camellia takes 22 clocks 32 P P

FL64 FL-1 64 32 32 3:1 3:1 32 3232 32 32 32 En klL <<1 Sel_FL_Kadd 32 32 klH 32 >>1 32 32 32 >>1 klH Sel_FL_Kadd <<1 klL En 2:1 2:1 3:1 3:1 64 64 Key scheduler

Camellia block AES block

128 Rcon 2:1

Š AES repeats XOR 32 32

operations and 128 128 2:1

Camellia repeats bit 32 32 2:1 2:1 rotations 2:1 128 128 Š Only registers can be >>16 <<16 32 32 shared because key 2:1 2:1 32 32 scheduling methods are >>1 <<1

64 completely different 3:1 2:1 128 32 To AES <<8 InvMix S-Box Col.

2:1 128

Σ1~4 0 64 64 32 From AES 4:1 S-Box ASIC Implementation Results ASIC Implementation Results

Š Two circuits (area optimized and speed optimized) were synthesized by using 0.13um CMOS ASIC library Š 15~25 Kgates of unified hardware are smaller by 30% compared with 22~35 Kgates of discrete designs Š Throughputs are lower by 9~14% for AES,and by 31~40% for Camellia Gate Max. Freq. Throughput Synthesis Algorithm Cycle Counts (MHz) (Mbps) Optimization AES 31 469.22 14,918 113.64 Area Camellia 22 661.18 This Work AES 31 794.05 24,730 192.31 Speed Camellia 22 1,118.89 ASIACRYPT 7,998 153.85 548.68 Area AES 32 2001 14,777 218.82 875.28 Speed 13,557 153.85 1,094.04 Area 3rd NESSIE Camellia 18 19,783 227.27 1,616.14 Speed Conclusions Conclusions

Š Unified hardware architecture for AES and Camellia Š Share GF((24)2) inverters in all S-Boxes Š Merge affine transformations and isomorphism functions Š Factorize common terms in permutation layers Š Combine FL/FL-1 functions and key whitening Š Reuse S-Boxes in ciphering block for key scheduler Š Share key registers Š Implemented by using 0.13-um CMOS ASIC library Š 15 Kgates with 459 Mbps (AES) and 647 Mbps (Camellia) Š Smaller than discrete implementations by 30%