THE ADVANCED COMPUTING SYSTEMS ASSOCIATION
The following paper was originally published in the USENIX Workshop on Smartcard Technology
Chicago, Illinois, USA, May 10–11, 1999
Efficient Block Ciphers for Smartcards
Joan Daemen Proton World International
Vincent Rijmen K.U. Leuven
© 1999 by The USENIX Association All Rights Reserved
Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.
For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org
Ecient Blo ck Ciphers for Smartcards
Joan Daemen
Proton World Int'l
Zweefvliegtuigstraat 10
B-1130 Brussel, Belgium
Vincent Rijmen
K.U.Leuven, Dept. ESAT
Kard. Mercierlaan 94
B-3001 Heverlee, Belgium
April 1, 1999
Abstract designed and published: Square [1].
Currently, the Square family consists of
three ciphers: Square, with a blo ck length
We present a family of blo ck ciphers that
and a key length of 128 bits; BKSQ with a
can b e implemented very eciently on cheap
blo ck length of 96 bits and a variable key
Smartcard pro cessors. The ciphers use a
length 96, 144 or 192 bits; and Rijndael
very small amount of RAM and a reasonable
with a variable blo ck length and key length
amountofROM. Both cipher execution and
b oth can b e indep endently sp eci ed at 128,
key setup/key change are very fast. The ci-
192 or 256 bits. The three ciphers are de-
phers resist theoretical and practical cryptan-
signed to b e secure against all known crypto-
alytic attacks and in their design timing and
graphic attacks. For a treatment of crypto-
power analysis attacks have b een taken into
graphic security and the design rationale we
account.
refer to [1,2,3]. This pap er treats implemen-
tation asp ects and in particular those sp eci c
for Smartcards.
1 Intro duction
In Section 2 we present the common ci-
pher structure of the Square family. Section 3
discusses the implementation of the ciphers
In many applications Smartcards are used
on typical Smartcard pro cessors. Section 4
as p ortable secure devices. For their secu-
treats the features of the presented ciphers
rity the applications make use of MAC gener-
to thwart attacks that exploit typical weak-
ation/veri cation and encryption/decryption
nesses of cipher implementations on Smart-
using a blo ck cipher. We present a family of
cards. Section 5 lists concrete p erformance
blo ck ciphers that are suited for this purp ose.
gures.
Additionally, all these ciphers can be used
as ecient one-way function and the variants
with blo ck size of 196 bits or higher are ef-
cient compression function to form an iter-
ated cryptographic hash function. The fam-
ily is named after its rst member that was
F.W.O. Postdo ctoral researcher, sp onsored by
the Fund for Scienti c Research - Flanders Belgium.
be done by using op erations like 32-bit ro- 2 Cipher structure
tations, multiplications, ..., but the use of
these op erations complicates Smartcard im-
plementations. In the Square family, the dif- A Square cipher is an iterated blo ck cipher:
fusion step can b e describ ed as a matrix mul- it consists of the rep eated application of a
tiplication cf. Section 3. The co ecients of round transformation that is parameterized
the multiplication matrix have b een selected by a round key. The round keys are derived
carefully to provide di usion that is optimal
from the cipher key by means of a key sched-
in a very de nite, mathematical sense, while ule. The blo ck length is indicated by n, the
at the same time allowing very ecient imple- cipher key length by m and the number of
mentation on standard Smartcard pro cessors.
rounds by r .
2.2 The key schedule 2.1 The round transformation
The round keys have length n and r +1 The round transformation is comp osed
round keys are required: one for every round of four invertible uniform transformations,
and a nal key addition. The key schedule called steps. These steps can be describ ed
can b e thought to o ccur in two phases. most easily by thinking of the input as a rect-
angular byte array. The dimensions of this
byte arrayvary for the di erentmembers of
Generation of the expanded key: The
the family, and dep end on the blo ck size. The
expanded key is initialized by taking
four steps are describ ed as follows cf. Fig-
the m-bit cipher key. It is expanded by
ure 1.
iteratively attaching m-bit blo cks that
are computed from the previously at-
tached blo ckby means of an LFSR-like The di usion step: Every byte is replaced
computation. This is rep eated until the by a linear combination of the bytes
expanded key has length nr +1. within the same column. The bytes
are considered as elements in the eld
Extraction of round keys: Round key i is
8
GF2 .
taken from the expanded key by taking
The disp ersion step: A p ermutation of
the i-th n-bit blo ck.
the bytes over di erent columns. This is
done by shifting the rows of the byte ar-
The LFSR computations in the key expan-
rayover di erent amounts, or by a trans-
sion ensure that any pair of di erent cipher
p osition of the byte array for Square.
keys result in a pair of expanded keys with
The nonlinear step: A substitution of the
a large Hamming di erence. The addition of
round constants removes symmetry between bytes by means of a nonlinear lo okup ta-
the rounds. This is necessary in order to ble.
provide resistance against related-key attacks
The round key addition: The bytes are
and attacks where the cipher key is known,
EXORed with an n-bit round key.
e.g., if the cipher is used as the compression
function of a hash function.
The choice for the op erations in the dif-
ferent steps has b een in uenced by our wish
to make the cipher eciently implementable
3 Sp eci c Smartcard implemen-
on Smartcards. The key addition, the disp er-
tation asp ects sion and the nonlinear step all can b e imple-
mented using op erations on individual bytes,
the natural \unit" on an 8-bit pro cessor. In
In this section we discuss the implementa- the di usion step, inter-byte di usion has to
tion of the cipher on 8-bit pro cessors with a be realised. On a 32-bit pro cessor, this can
-
Di usion step
-
-
a c
S [a] S [b] S [c] S [d]
b d
Nonlinear step
-
g
e
S [e] S [f ] S [g ] S [h] f
h
j
S [i] S [j ] S [k ] S [l ]
i
k l
a c a c
b d b d
Disp ersion step
-
g g
e e
f f
h h
j j
i i
k l k l
Figure 1: Graphical illustration of some basic op erations of the Square ciphers.
limited amount of RAM and ROM available, lo cation.
i.e., typical Smartcard pro cessors.
Implementing the di usion step is less
straightforward. It takes the computation
of additions and multiplication in the eld
3.1 The round transformation
8
GF2 . Addition over this eld corresp onds
to the readily available EXOR instruction.
The multiplication factors are the elements
The round transformation can be imple-
8
of GF2 represented bybyte values 1, 2 and
mented by serially p erforming the di erent
3. The multiplication by these factors can b e
steps.
done as follows:
The nonlinear step consists of a table
8
lo okup op eration that is the same for all
1 is the identity in GF2 andmultipli-
input bytes. The 256-byte lo okup table is
cation by it do es not require any compu-
hard-co ded in the cipher program and the ta-
tation at all.
ble lo okup can be implemented with a sim-
ple load accumulator instruction in indexed Multiplication by 2 in the nite eld
mo de. The round key addition is imple- could b e implemented as a left shift, fol-
mented with the EXOR instruction. The byte lowed by a reduction. However, the exe-
disp ersion step do es not take dedicated in- cution time and/or the p ower consump-
structions but is emb o died in the wayinput tion pattern of a reduction dep end on the
bytes are loaded and stored in the preced- value of the op erand. If the MSB of the
ing/succeeding steps. These three steps can op erand is 1, the reduction takes place,
be implemented in the following way: the if 0, the reduction can b e skipp ed. This
byte is loaded into the index register, an in- can b e done in constant time by execut-
dexed load accumulator instruction is per- ing dummy instructions e.g., NOP in
formed, the round key byte is EXORed and the case the reduction is skipp ed. How-
the accumulator is stored to the hard co ded ever, this gives rise to two di erent se-
3.2 The inverse round transforma- quences of op erations. The op eration
tion can be implemented with a xed series
of instructions by implementing the mul-
tiplication by 2 as a table lo okup with a
dedicated lo okup table 2mult[], that is
de ned as
The Square ciphers do not have the Feis-
tel structure, like e.g. the DES. Whereas for
2mult[a]=2 a:
Feistel ciphers the op eration of the cipher can
be inverted by simply reordering the round
The fact that the execution time is inde-
keys, this is not p ossible for the Square ci-
p endent of the argumentmakes this ta-
phers. Therefore, if the inverse op eration of
ble lo okup implementation timing attack
the cipher has to be implemented, it is nec-
resistant. We explain in Section 4 howit
essary to implement the inverse round op er-
can b e protected against p ower analysis.
ation separately.
Multiplication by 3 can be obtained by
The inverse of the round key addition is
p erforming multiplication with 2 and
the same as the round key addition: EXOR
adding EXORing the argument itself:
with the round key. The inverse of the non-
3 a =2 1 a =2mult[a] a.
linear step is implemented like the linear step,
but uses a di erent lo okup table. The byte
In Rijndael and Square the columns consist
disp ersion step is again emb o died in the way
of 4 bytes each and the di usion step applied
input bytes are loaded and stored in the other
to a column can be describ ed by a matrix
steps.
multiplication, that is given by:
The inverse of the di usion step consists
2 3 2 3 2 3
out 2 3 1 1 in
0 0
of a multiplication with the inverse matrix.
6 7 6 7 6 7
out 1 2 3 1 in
1 1
For Rijndael and Square, the inverse of the
6 7 6 7 6 7
= :
4 5 4 5 4 5
out 1 1 2 3 in
2 2
di usion step is given by:
out 3 1 1 2 in
3 3
2 3 2 3 2 3
This can b e eciently implemented as:
out in E 9 D B
0 0
6 7 6 7 6 7
out in B E 9 D
1 1
6 7 6 7 6 7
p = in in in in ;
= :
0 1 2 3
4 5 4 5 4 5
out in D B E 9
2 2
out = 2mult[in in ]; out = p in ;
0 0 1 0 0
9 D B E out in
3 3
out = 2mult[in in ]; out = p in ;
1 1 2 1 1
out = 2mult[in in ]; out = p in ;
2 2 3 2 2
Here B, D and E denote hexadecimal values.
out = 2mult[in in ]; out = p in ;
3 3 0 3 3
It is easy to see that this multiplication will
require more op erations than the original dif-
This implementation takes only 15 EXORs
fusion step, namely 21 EXORS and 8 table
and 4 table lo okups p er column. It requires
lo okups using only the previously de ned ta-
temp orary storage for 2 bytes: the variables
ble 2mult[]. If additional tables are used, the
p and in if the output bu er is equal to the
0
p erformance loss can b e reduced. The storage
input bu er.
requirements do not increase.
In BKSQ the columns consist of 3 bytes
Note that most applications do not require
and the di usion step is given by
the inverse op eration of the cipher to b e exe-
2 3 2 3 2 3
cuted on the Smartcard. First of all, most of
out 3 2 2 in
0 0
4 5 4 5 4 5
the applications use the cipher for the gen-
out 2 3 2 in
= :
1 1
eration and veri cation of MACs and as a
out 2 2 3 in
2 2
one-way function in the generation of session
keys. In the cases where enciphermentisac- This can b e implemented with only 5 EXORs
tually used, the CFB or OFB mo de can be and a single table lo okup p er column and only
used, where the inverse cipher is not used. one byte of temp orary storage is needed.
If programmed as explained in the previ- 3.3 The key schedule
ous section, the cipher execution consists of a
series of instructions that is completely xed:
In a Smartcard implementation, computing
there are no conditional branches whose exe-
the expanded key in a single shot and storing
cution dep ends on the cipher key and input.
it for use during the actual encryption, would
This thwarts the following attacks:
require to o much RAM. Therefore, the key
expansion has b een de ned in suchawaythat
Timing attack: This attack extracts
it can b e implemented in a cyclic bu er with
key/plaintext information from the
a size no larger than the size of the cipher key.
CPU time consumed by the cipher.
The expansion op eration has b een kept very
For the Square ciphers the CPU time
simple and ecient to make fast just-in-time
is indep endent of the cipher key and
key generation p ossible.
plaintext.
In the case that the cipher key length is
Simple p ower analysis: This attack ex-
equal to the blo ck length, the round key is
tracts key/plaintext information by ob-
simply up dated in between rounds. In the
serving the power consumption during
other cases, a small amount of additional se-
the cipher computation. Di erent CPU
quence control is required. All op erations in
instructions have di erent power con-
the key up date can b e eciently implemented
sumption and this attackallows to deter-
using EXORs and table lo okups.
mine whether one or another conditional
branch is taken in a computation. For
the Square ciphers the series of instruc-
3.4 32-bit pro cessors
tions is xed.
On a 32-bit pro cessor, an ecient imple-
A more powerful attack is di erential
mentation of the round transformation will
power analysis. This attack exploits the fact
use four large tables 1k each that combine
the p ower consumed by the CPU not only de-
the e ect of the nonlinear and the disp ersion
p ends on the instruction that is executed, but
step. The tables will t nicely in the cache of
also on the values of the op erands. It com-
most mo dern 32-bit pro cessors.
bines the application of cryptanalytic tech-
niques, statistics and power analysis. Basi-
The p erformance is indep endent of the
cally, it allows the determination of the value
value of the multiplication factors in the dis-
of individual bits of intermediary computa-
p ersion step. The inverse op eration of the ci-
tion results of the cipher by an attacker that
pher has the same p erformance as the forward
do es not even need to knowthesequence of
op eration, but uses a di erent set of tables.
instructions. The basic aw that is exploited
is that the p ower consumed by an instruction
dep ends on the values of the bits that are
handled by the command. To thwart these
4 Smartcard-sp eci c attacks
attacks, twomechanisms are prop osed:
Recently,several attacks have b een demon-
scrambling: To complicate the exploitation
strated that exploit weaknesses of cipher
of power consumption bias, e.g., by us-
implementations, rather than the inherent
ing programming tricks, such as inser-
mathematical prop erties of the actual cipher
tion of a variable amount of NOPS, or
[4]. These attacks exploit information such
b etter, unnecessary instructions in be-
as timing and power consumption to obtain
tween rounds. Scrambling can b e applied
information on the cipher key or plaintext. In
reasonably indep endent from the cipher
this section we will explain that the ciphers
structure.
of the Square family lend themselves to im-
curing: To make the p ower consumption of plementations that provide resistance against
the relevant instructions used in a cipher this typ e of attacktypical for Smartcards.
time while guaranteeing a xed execution implementation much less dep endenton
time. Table 1 shows that b esides the storage the value of the treated bits. One plau-
of the current round key and the intermedi- sible way to cure instructions is by the
ate ciphertext, only four bytes of RAM are intro duction of symmetry. For example:
used. The numb ers compare very favorably if a bit is stored in loaded from RAM,
with the gures for the other AES candidate also store load its complement. If two
algorithms. bits are EXORed, execute all four di er-
entcombinations: a b; a b; a b; a b.
The timings given include the key setup Typically, additional hardware has to b e
and algorithm setup time. Only the for- intro duced for every sensitive instruc-
ward op eration of the ciphers havebeenim- tion. Therefore it is an advantage to
plemented, backwards op eration is exp ected have few di erent instructions that han-
to b e slower b ecause the inverse of the di u- dle key- or plaintext-dep endentbits.
sion step cannot b e implemented as eciently
cf. Section 3.2. The inverse di usion step is
The instructions that handle key- or plain-
between 1.5 and 2 times slower than the orig-
text dep endent bits in our implementations
inal di usion step. Since the di usion step is
of the ciphers of the Square family are:
dominant in the execution of the ciphers on a
Smartcard, ab out the same p erformance loss
can b e exp ected for the full cipher.
EXOR with accumulator, direct address-
ing;
The implementations on the Motorola
68HC08 micropro cessor have b een done us-
store accumulator, direct addressing;
ing the 68HC08 development to ols by P&E
load accumulator, direct addressing;
Micro computer Systems, Woburn, MA USA,
the IASM08 68HC08 Integrated Assembler
load accumulator, indexed addressing
and the SIML8 68HC08 simulator. No op-
o set + index register
timization of co de length has b een attempted
for this pro cessor. Execution time, co de size
and required RAM for a number of imple-
These are only four instructions. Most other
mentations are given in Table 1 1 cycle = 1
mo dern ciphers [5] use an instruction set that
oscillator p erio d = 0.25 sec.
is substantially larger, due to the use of arith-
metic op erations. Moreover, the balancing
Rijndael has also b een implemented on the
of arithmetic op erations is likely to b e more
Intel 8051 micropro cessor, using 8051 De-
complicated than the balancing of the EXOR.
velopment to ols of Keil Elektronik: Vision
It can b e seen that there is arithmetic addi-
IDE for Windows and dScop e Debug-
tion in the indexed addressing. However, if
ger/Simulator for Windows. Execution time
the lo okup tables are p ositioned at physical
for several co de sizes is given in Table 2 1
addresses that are a multiple of 256, the ad-
cycle = 12 oscillator p erio ds = 1 sec.
dress computation can b e reduced to a mere
concatenation of index and o set. Obviously,
this implies a mo di cation of the ALU hard-
ware.
6 Availability
Several implementations of Square
5 Performance
in C and Java are available from the
URL http://www.esat.kuleuven.ac.be/
~rijmen/square.
We implemented the Square ciphers on two
More information on Rijndael and a refer- di erent typ es of micropro cessors that are
ence implementation are available from the representative for Smartcards in use to day.
URL http://www.esat.kuleuven.ac.be/ These implementations have b een optimized
~rijmen/rijndael . towards minimal RAM usage and execution
Cipher Co de size Required RAM Numb er of cycles Execution time
Key, blo ck length bytes bytes msec
BKSQ96,96 900 28 6500 1.6
Square128,128 919 36 6800 1.7
Rijndael128,128 919 36 8390 2.1
Rijndael192,128 1170 44 10780 2.7
Rijndael256,128 1135 52 12490 3.1
Table 1: Co de size, required RAM and execution time for the square ciphers in Motorola
68HC08 Assembler.
Key,block length Co de size Numb er of cycles Execution time
bytes msec
768 4065 4.1
128,128 826 3744 3.7
1016 3168 3.2
192,128 1125 4512 4.5
256,128 1041 5221 5.2
Table 2: Co de size and execution time for Rijndael in Intel 8051 assembler.
References
[1] J. Daemen, L.R. Knudsen, V. Rijmen,
\The Square encryption algorithm," Dr.
Dobb's Journal, Vol. 22, No. 10, Octob er
1997, pp. 54{56.
[2] J. Daemen and V. Rijmen, \The Rijndael
blo ck cipher," presented at the First Ad-
vanced Encryption Standard Conference,
Ventura California, 1998, available from
URL http://www.nist.gov/aes.
[3] J. Daemen and V. Rijmen, \The Blo ck
Cipher BKSQ," Proc. of CARDIS'98,
LNCS, Springer-Verlag, to app ear.
[4] P.C. Ko cher, \Timing attacks on imple-
mentations of Die-Hellman, RSA, DSS
and other systems," Advances in Cryptol-
ogy, Proceedings Crypto'96, LNCS 1109,
N. Koblitz, Ed., Springer-Verlag, 1996,
pp. 146{158.
[5] NIST's AES Homepage:
http://www.nist.gov/aes.