Fast Hashing and Stream with

Panama

1 2

Joan Daemen Craig Clapp

1

Banksys, Haachtesteenweg 1442, B-1130 Brussel, Belgium

email: [email protected]

2

PictureTel Corp oration, 100 Minuteman Rd., Andover, MA 01810, USA

email: [email protected]

Abstract. We present a cryptographic mo dule that can b e used b oth as

a cryptographic and as a . High p erformance

is achieved through a combination of lowwork-factor and a high degree

of paralleli sm. Throughputs of 5.1 bits/cycle for the hashing mo de and

4.7 bits/cycle for the stream cipher mo de are demonstrated on a com-

mercially available VLIW micro-pro cessor.

1 Intro duction

Panama is a cryptographic mo dule that can b e used b oth as a cryptographic

hash function and a stream cipher. It is designed to b e very ecient in software

implementations on 32-bit architectures.

Its basic op erations are on 32-bit words. The hashing state is up dated bya

parallel nonlinear transformation, the bu er op erates as a linear feedback shift

register, similar to that applied in the compression function of SHA [6]. Panama

is largely based on the StepRightUp stream/hash mo dule that was describ ed

in [4].

Panama has a low p er-byte work factor while still claiming very high security.

The price paid for this is a relatively high xed computational overhead for every

execution of the hash function. This makes the Panama hash function less suited

for the hashing of messages shorter than the equivalentofatyp ewritten page. For

the stream cipher it results in a relatively long initialization pro cedure. Hence,

in applications where sp eed is critical, to o frequent resynchronization should b e

avoided.

Atypical application for Panama might b e the encryption or decryption of

video-rate data in conditional access applications e.g. pay-TV. Set-top b oxes

and future digital televisions will increasingly include media pro cessors for de-

co ding compressed video and for p erforming other computationally intensive

image pro cessing tasks. This is an application space where data rates are high,

high-p erformance pro cessors are increasingly likely to b e present, and decryption

must b e done yet must not unduly burden an already heavily loaded pro cessor.

After sp ecifying the Panama hash function and stream cipher, we discuss the

particular design strategy and the implementation asp ects. We don't attempt to

give a pro of of security.However, a motivation for the design choices is given.

A C reference implementation of Panama and PostScript and PDF versions

of [4] are available from http://www.esat.kuleuven.ac.be/~rijmen/daemen.

2 Basic design principles

Panama is based on a nite state machine with a 544-bit state and a 8192-

bit bu er. The state and bu er can b e up dated by p erforming an iteration.

There are two mo des for the iteration function. A Push mo de, that allows to

inject an input and generates no output, and a Pul l mo de that takes no input

and generates an output. A blank Pull iteration is a Pull iteration in which the

output is discarded.

The up dating transformation of the state has high di usion and distributed

nonlinearity. Its design is aimed at providing very high nonlinearity and fast dif-

fusion for multiple iterations. This is realised by the combination of four distinct

transformations each with its sp eci c contribution. There is one for nonlinearity,

one for bit disp ersion, one for inter-bit di usion, and one for injection of bu er

and input bits.

The bu er b ehaves as a linear feedback that ensures that input

bits are injected into the state over a wide interval of iterations. In the Push

mo de the input to the shift register is formed by the external input, in the Pull

mo de, by part of the state.

The Panama hash function is de ned as p erforming Push iterations with

message blo cks as input. If all message blo cks have b een injected, a number of

blank Pull iterations are p erformed to allow the last message blo cks b e di used

into the bu er and state. This is followed by a nal Pull iteration to retrieve the

hash result.

The Panama stream encryption scheme is initialised by doing two Push

iterations to inject the and diversi cation parameter followed bya number of

blank Pull iterations to allow the key and parameter to b e di used into the bu er

and state. After this initialisation, the scheme is ready to generate

bits at leisure by p erforming Pull iterations.

3 Sp eci cation

The state is denoted by a and consists of 17 32-bit words a to a . The bu er

0 16

b is a linear feedback shift register with 32 stages, each consisting of 8 words. An

j

j

8-word stage is denoted by b and its words by b . Both stages and words are

i

indexed starting from 0.

The three p ossible modes for the Panama mo dule are Reset, Push and Pull.

In Reset mo de the state and bu er are set to 0. In Push mo de an 8-word input

is applied and there is no output. In Pull mo de there is no input and an 8-word

output is delivered.

The bu er up date op eration is denoted by .Wehave with d = b:

j j 1

d = b if j 62 f0; 25g;

0 31

d = b  q; 1

25 24 31

d = b  b for 0  i<8 :

i i

i+2 mo d 8

In Push mo de q is the input blo ck p, in Pull mo de it is part of the state a,

with its 8 comp onentwords given by

q = a for 0  i<8 : 2

i i+1

The state up dating transformation is denoted by . It is comp osed of a num-

b er of sp eci c transformations:

 =      : 3

Here  denotes the asso ciative comp osition of transformations where the right-

most transformation is executed rst.

is an invertible linear transformation de ned by:

c = a , c = a  a  a for 0  i<17 ; 4

i i i+1 i+4

with the indices taken mo dulo 17. The invertibilityof follows from the fact

4 17

that 1  x  x is coprime to 1  x .

is an invertible nonlinear transformation de ned by:

a  for 0  i<17 ; 5 c = a , c = a  a OR

i+2 i i i+1

with the indices taken mo dulo 17. A pro of for the invertibilityof can b e found

in [4].

The p ermutation  combines cyclic word shifts and a p ermutation of the

word p ositions. If we de ne  to b e a rotation over k p ositions from LSB to

k

MSB, wehave:

c =  a , c =  a  ; 6

i k j

with

j =7i mod17 and

7

k = ii +1=2 mod32 :

The transformation  corresp onds with bitwise addition of bu er and input

words. It is given by let c =  a:

c = a  00000001 ;

0 0 hex

c = a  ` for 0  i<8 ;

8

i+1 i+1 i

16

for 0  i<8 : c = a  b

i+9 i+9

i

4

In the Push mo de ` corresp onds with the input p, in the Pull mo de ` = b .

In the Pull mo de the output z consists of 8 words given by

z = a for 0  i< 8 : 9

i i+9

The transformation  is illustrated in Fig. 1, the Push and Pull mo des of the

Panama mo dule are illustrated in Fig. 2. γ

π

θ

σ

Fig. 1. The state up dating transformation .

p 0 31

ρ

a

0 31

ρ

a z

Fig. 2. Push ab ove and Pull b elow mo des of Panama.

3.1 The Panama hash function

The Panama hash function maps a message of arbitrary length M to a hash

result of 256 bits. The Panama hash function is executed in two phases:

0

{Padding M is converted into a string M with a length that is a multiple

of 256 by app ending a single 1 followed byanumber d of 0-bits with 0 

d< 256.

0 1 2 V

{ Iteration The input sequence M = p p :::p is loaded into the Panama

mo dule according to Table 1.

After all input blo cks have b een loaded, an additional 32 blank Pull iterations

are p erformed. Then the Hash result h is returned. The numb er of Push and Pull

iterations to hash an V -blo ck input sequence is V + 33.

Time step t Mo de Input Output

0 reset { {

t

1;:::;V Push p {

V +1;:::;V +32 Pull { {

V +33 Pull { h

Table 1. The sequence diagram of the iteration phase of the Panama hash function.

The design goal for the Panama hash function is that it should b e hermetic.

For the de nition of this term we refer to [4]. In short, for a hermetic hash

function, the following statements are true.

Assume we take as hash result the value of a subset of n bits of the for

Panama 256-bit output:

n=2

{ the exp ected workload of generating a collision is of the order of 2 execu-

tions of the hash function,

{ given an n-bit value, the exp ected workload of nding a message that hashes

n

to that value is of the order of 2 executions of the hash function,

{ given a message and its n-bit hash result, the exp ected workload of nding a

n

second message that hashes to the same value is of the order of 2 executions

of the hash function.

Moreover, it is infeasible to detect systematic correlations b etween any linear

combination of input bits and any linear combination of bits of the hash result.

A hermetic hash function can b e turned into a secure MACby simply includ-

ing a secret key in the message input. Its security is indep endent of the p ositions

of the secret key bits in the message. It may b e app ended, pre-p ended, inserted

somewhere in the middle of the message or even b e distributed over bits spread

all over the message. Observe that this is not the case with hash functions of the

MD4 family [8].

3.2 The Panama stream encryption scheme

The stream cipher is initialized by rst loading the 256-bit key K , the 256-bit

diversi cation parameter Q and p erforming 32 additional blank Pull iterations.

During keystream generation an 8-word blo ck z is delivered at the output for

every iteration. In practice, the diversi cation parameter allows for frequent

resynchronisation without the need to change the key.

The sequence diagram of the Panama stream encryption scheme is given in

Table 2.

Time step t Mo de Input Output

34 reset { {

33 Push K {

32 Push Q {

31;:::;0 Pull { {

t

1;::: Pull { z

Table 2. The sequence diagram of the Panama stream encryption scheme.

The design goal for the Panama stream encryption scheme is that it should

b e hermetic and K-secure also de ned in [4]. K-security implies among other

things that, given part of the keystream outputs corresp onding with a given key

and for chosen values of Q, the most ecientway to gain knowledge on the key

or on the complementary part of the keystream output, is exhaustivekey search.

4 Discussion

4.1 The state up dating transformation

is the simplest nonlinear shift-invariant transformation. Its propagation prop-

erties are describ ed in detail in [4]. In short:

{ The maximum correlation b etween its input and output diminishes exp o-

nentially with the Hamming Weight of the output selection vectors.

{ The di erence propagation probability diminishes exp onentially with the

Hamming Weight of the input di erence vectors.

The transformation corresp onds with the multiplicationby a binary p olynomial

17

mo dulo 1  x .Itwas selected from the invertible p olynomials with Hamming

weight 3 on the basis of its go o d di usion prop erties. A single input di erence

gives rise to three output di erences. For the vast ma jority of input di erence

vectors with a small b elow 32 Hamming Weight, the Hamming Weight of the

corresp onding output vector is ab out three times higher.

The cyclic shift co ecients of  , describ ed by the simple expression in 7,

form an array of 17 di erent constants. The word p ermutation factor 7 is chosen

to let every comp onentof dep end on 9 state bits. For the chosen  parameters

it has b een veri ed that    has propagation and correlation prop erties that

are close to optimal with resp ect to the space of p ossible  parameters. On the

average, a di erence in a single bit di uses to 6 bits after one iteration, 36 bits

after two, 216 after three and all over the state after 4 iterations. Since , , 

and  are all invertible, the state up dating transformation  is invertible.

 includes the addition of a constanttoa to prevent symmetric prop erties.

0

For the value of the constant 00000001 was chosen for its simplicity.

hex

4.2 The hash function

The design of the Panama hash function di ers from the currently p opular

MD4-derived designs such as MD5 [10], SHA-1 [7] and RIPEMD-160 [5] in three

imp ortantways:

{Parallel iteration transformation: the MD4-derived designs haveanit-

eration function that consists in itself of a sequence of a large number of

simple rounds, in Panama the iteration function consists of a single more

complicated round with a parallel structure.

{ Large Chaining State: the MD4-derived hash functions haveachaining

variable that has the same size 16 or 20 bytes as the hash result by design,

the chaining variable of the Panama hash function comprising the internal

state and bu er is over1Kbyte.

{ Presence of nal transformation: In the MD4-derived designs, the hash

result is the nal value of the chaining variable. For Panama the 32 blank

Pull iterations form a nal transformation mapping the nal value of the

chaining variable state and bu er to the hash result.

These di erences are the consequence of a di erence in design strategy.In

the MD4-derived hash functions the iteration transformation is designed to b e

collision-resistant in itself. The iteration mechanism and the fact that the hash

result is the value of the chaining variable after the last message blo ck has

b een hashed, assure that the resulting hash function is collision-resistant. In

the Panama hash function, it is the di usion and nonlinearity realised by the

successive application of the iteration function that is exp ected to prevent cryp-

tographic weaknesses.

4.3 Collision resistance

In this section we explain the diculties faced when trying to generate collisions

for Panama .

The hash result is completely determined by the nal value of the chaining

V +1 V +1

variable: the state a and bu er contents b . The converse is however not

true, and pairs of messages may b e found with di erentvalues for the nal

chaining variable b oth consistent with the same hash result. In this case, the

collision actually results from the fact that the hash result is not equal to the

nal hashing state. We call this a terminal collision. Collisions in whichtwo

di erent messages give rise to equal chaining variable at a certain p oint are

called internal . In hash functions without a nal transformation, such as MD4

and its descendents, terminal collisions cannot o ccur.

Generating an internal collision implies two di erent messages that give rise

to a bitwise di erence pattern in the hashing state and bu er that dissolves, i.e.,

that ends up b eing all-zero for some iteration. Realising this, while theoretically

p ossible, is assumed to b e infeasible b ecause of the large di usion and distributed

nonlinearityof combined with the fact that every message blo ck a ects the

hashing state for a large numb er of iterations. Di erence patterns in the bu er

induce a so-called di erential characteristic or equivalently di erential trail [4]

in the hashing state.

A p ossible strategy for nding internal collisions would b e to lo ok for an

input di erence pattern that can give rise to a di erential trail that ends in a

zero di erence somewhere in the computation. Other strategies are

{ nding xed p oints of the Push mo de or more general, xed sequences of

the Push mo de,

{ meet-in-the-middle attacks exploiting the invertibility of the iteration func-

tion.

These attacks have b een considered in the design strategy and in the choice of

the state and bu er up dating transformations.

Because of the linear feedback in the bu er, a di erence pattern in a single

message blo ck gives rise to an in nite di erence propagation in the bu er. This

is illustrated in the right-hand side of Fig. 3. Only di erence patterns in the

input sequence that meet a particular condition give rise to a nite di erence

propagation in the bu er.

The simplest di erence pattern that gives rise to a nite di erence propaga-

tion in the bu er is illustrated in the left-hand side of Fig. 3. All other message

di erences that give rise to a nite di erence propagation in the bu er are su-

p erp ositions linear combinations of shifted in time and space instances of

this di erence pattern.

Even the simplest di erence pattern a ects the state a during 5 di erent

iterations, spread over a range of 32 iterations. Fig. 4 illustrates the di erence

propagation in  resulting from a similar di erence pattern in p that gives rise

0

1

to a nite di erence propagation and that has p = d ;d ;:::;d .Four of the

0 1 7

ve di erence vectors are in general di erent. For all other message di erence

patterns, the numb er of iterations with a non-zero di erence pattern and the

numb er of distinct di erence vectors in  are b oth larger.

If the di erence pattern in the hashing state and/or bu er is not equal to

zero after the loading of the last input blo ck, the di usion and nonlinearityof

the transformation formed by the 32 nal Pull iterations are assumed to make

the controlled generation of terminal collisions infeasible. 1

1

8

50 18

57

64

33

Fig. 3. Di erence propagation in the bu er of Panama.

t = 1 - d 0 d 1 d2 d3 d4 d5 d6 d7 ------

t = 8 - d 2 d 3 d4 d5 d6 d7 d0 d1 ------

t = 18 ------d0 d1 d2 d3 d4 d5 d 6 d 7

t = 25 ------d2 d3 d4 d5 d6 d7 d 0 d 1

t = 33 - d 0 d 1 d2 d3 d4 d5 d6 d7 ------

Fig. 4. Di erence patterns in  .

In internal collisions the di erence pattern in the state and bu er is zero

b efore the nal hashing state has b een reached. Clearly, this restricts the di er-

ence pattern in the input sequence to sup erp ositions of instances of the simple

di erence pattern given in the left-hand side of Fig. 3. These input di erence

patterns give rise to non-zero di erence vectors in  over a span of at least 32

iterations. We b elieve that controlling the di erential trails in the hashing state

over these large numb ers of iterations to obtain a collision is to o hard to b e a

threat to the security of the hash function.

One can avoid the need for this typ e of control over large numb ers of iter-

ations by eliminating di erences in the state immediately after they have b een

created. As can b e seen in Figure 4, this must however b e rep eated for 5 di erent

instances, with the same set of input di erences.

4.4 The stream encryption scheme

For every Pull iteration 16 words of the bu er are injected into the state and

8 state words are given at the output. In the short term, the numb er of bu er

bits that are injected into the state is twice as large as the numb er of bits that

are given at the output. It can b e checked that this is the case for anynumber

of iterations smaller than 12. The feedback from the state to the bu er causes

the bu er contents to b e renewed every 32 iterations. These factors cause the

correlations b etween output bits and linear combinations of state and bu er bits

to b e to o small to b e of practical use for .

Resynchronization attacks [4] should b e made infeasible by the 32 blank Pull

iterations after loading the initialization blo cks. Because of the feedback from

the state to the bu er in the Pull mo de, the state and almost all bu er stages

dep end in a complicated way on these blo cks at the end of the initialization

phase. We exp ect that there are no exploitable 32-round di erential trails.

5 Implementation asp ects

Panama 's heavy reliance on bitwise-logical op erations on 32-bit words makeit

well suited to implementation on 8, 16, or 32-bit pro cessors, except that its use

of 32-bit rotations do es somewhat favor 32-bit architectures.

Accesses to bu er b and op erations b etween bu er b and state a can all b e

p erformed 64-bits at a time on pro cessors that supp ort this word size. Since a is

not an integer multiple of 64-bits a dummy 32-bit word should b e pre-p ended to

a for prop erly aligning fa ;a g through fa ;a g to 64-bit b oundaries. Bu er

0 1 2 15 16

b should also b e 64-bit aligned.

So, and can b e p erformed eciently with word sizes up to 32-bits, while

 esp ecially favors 32-bit architectures.  and  can b oth b e p erformed e-

ciently with word sizes up to 64-bits. In the following analysis we concentrate

on Panama 's p erformance on 32-bit pro cessors.

5.1 Theoretical p erformance limits

To determine the maximum theoretical p erformance of Panama in its hash and

stream cipher mo des on a suitably parallel pro cessor architecture we seek to

identify the software critical path through the algorithm. This is the closed path

through the algorithm's computational owgraph that has the largest weighted

instruction count, the weighting b eing the numb er of cycles of latency asso ciated

with eachtyp e of instruction.

For instance, on most pro cessors the result of a simple op eration likean

addition or XOR can b e used in the subsequent cycle - these instructions are

said to have a one cycle latency. On mo dern high p erformance pro cessors it

is also common for shifts and rotates to b e single-cycle instructions. However

reading from memory takes several cycles. Even when the data is in the CPU's

lo cal cache it commonly su ers a two or three cycle latency on mo dern deeply

pip elined pro cessors.

We start by examining the software critical path through .For eachof ,

 , , and  , all seventeen 32-bit words a to a can in principle b e up dated

0 16

in parallel. The software critical path through these four routines is a total

of 7 cycles 3 cycles for , 1 cycle for  , 2 cycles for , and 1 cycle for , all

corresp onding to single-cycle instructions.

Many RISC pro cessors have enough registers to hold state a in its entirety,so

that when encrypting or hashing a long blo ck of data we need not keep accessing

anyofa through a from memory.However, bu er b is to o large to b e register

0 16

based and is most eciently implemented as a xed circular-bu er in memory,

with moving p ointers used to create the app earance of a shift-register.

Notably, since the accesses to b are not data-dep endent i.e. not table-lo ok-

ups all address calculations can b e done well in advance and do not contribute

to the software critical path. Also, since the stages which are read are several

stages delayed from those that are written, the read-data can in principle b e

fetched one or more up date cycle ahead of time, from which it b ecomes clear

that up dating the bu er is not on the software critical path.

Wenow explore the numb er of 32-bit instructions needed for each iteration

of Push and Pull. If state a is fully held in registers and the rotation amounts in

 are all hard-co ded, then  entails a total of 16 reads from bu er stages 4 and

16 for Pull, or input p and bu er stage 16 for Push and 17  7 logical op erations

actually one less than this b ecause the zero-rotation need not b e p erformed.

Up dating bu er b involves 16 reads bu er stages 24 and 31, 16 XOR op er-

ations, and 16 writes bu er stages 0 and 25, plus a numb er of op erations to

up date the p ointers to the accessed stages in order to simulate a shift register. As

a minimum these involve an increment and mask op eration p er p ointer. Reading

from stage 31 and writing to stage 0 takes only one p ointer, likewise for stages 24

and 25 b ecause of the way the circular-bu er is used to emulate a shift-register.

Consequently there are four p ointers to b e up dated for a Pull iteration, and

three for a Push. The masking op eration implements circular arithmetic on the

p ointers - a convenience arising from the bu er size b eing a p ower of two.

For each Pull iteration, applying the cipher output to the data stream in-

volves 8 reads from the plaintext bu er, 8 XOR op erations, and 8 writes to the

bu er, plus at least one additional instruction to up date a p ointer to

these bu ers. Each iteration ciphers 32 bytes of data.

In the case of Push wehave already accounted for reading the input p under

our discussion of , so all that is left is up dating the p ointer to the input data.

Thus, ignoring for the moment the few extra instructions necessary for main-

taining the lo op, wehaveaworkload of 189 instructions for each iteration of Push

and 215 instructions for each Pull. This is equivalent to ab out 5.9 instructions

per byte hashed or 6.7 instructions p er byte enciphered.

An estimate of how many fully pip elined execution units the algorithm might

usefully exploit can b e obtained by dividing the total numb er of op erations p er

iteration by the numb er of cycles in the critical path. This numb er ideally should

b e no less than the numb er of parallel execution paths in the target pro cessors

so that no CPU resources are left idle.

For Panama this works out to 189 or 215 instructions p er iteration divided by

7 cycles p er iteration, from whichwe estimate that the hashing and encryption

mo des might reasonably exploit pro cessors with up to 27 and 31 parallel 32-bit

datapaths resp ectively.

5.2 Benchmarked p erformance

The generous amount of parallelism in Panama lends itself naturally to e-

cient implementation on a pro cessor capable of a high degree of instruction-

level parallelism. To demonstrate the impressive throughputs achievable in such

cases a highly optimized implementationwas develop ed for the Philips TriMedia

TM-1000 pro cessor. The TriMedia pro cessor is a Very Long Instruction Word

VLIW CPU containing ve 32-bit pip elined execution units sharing a common

set of 128 32-bit registers. All ve execution units can p erform arithmetic and

logical op erations, but loads, stores, and shifts are each supp orted by only twoof

them. The two execution units that supp ort shifts are distinct from the two that

supp ort loads and stores. Given an appropriate instruction mix the pro cessor

can issue up to ve instructions p er clo ck cycle.

The op erations in Panama can b e eciently expressed in C-co de except for

the bitwise rotations. The optimized implementation was written completely in

C-co de except for resorting to a library call to access the pro cessor's native 32-bit

rotate instruction.

Since the parallelism present in the Panama algorithm is vastly more than is

available in the TriMedia CPU wewould hop e to b e able to completely saturate

the pro cessor, i.e. havevery few vacant instruction slots. However, there maybe

other resource constraints that prevent this. For instance,  calls for the intensive

use of rotate instructions, of which the TriMedia CPU can only issue twoper

cycle. Filling the other three instruction slots on each cycle requires overlapping

the execution of  with that of and/or .For the b enchmarked implementation

each of these routines was expressed as fully unrolled in-line co de, leaving the

TriMedia C-compiler to recognize and exploit the allowable overlap b etween ,

 , and .

The lo op for iterating Pull mo de compiled into 234 TriMedia assembly

instructions. The di erence b etween this and the 215 instructions previously

counted comes from the instructions for maintaining the lo op and from some

additional overhead involved in the p ointer up dates asso ciated with making the

circular bu er app ear as an LFSR. The scheduled co de was tightly packed by

the compiler into 47 instruction-issue cycles, i.e. a sustained 4.98 instructions

p er cycle were scheduled for issue, out of a theoretical maximum of 5. This is an

unusually high utilization of the TriMedia CPU even compared to its eciency

on media-pro cessing tasks for whichitwas designed.

Compiled co de for Push showed comparable overhead and co de density.

The optimized C-co de was b enchmarked on a 100 MHz TriMedia pro cessor

by encrypting or hashing a 128 Kbyte data bu er. Wecho ose this bu er size as

several times larger than the on-chip data cache so as to make the rep orted p er-

formance b e representative of the sustainable encryption or hashing p erformance

to external memory, in this case comprised of synchronous DRAM. At the level

of p erformance achieved by Panama external memory bandwidth can b ecome

a signi cant factor in the overall p erformance. For the encryption b enchmark

the data bu er was encrypted in-place so as to minimize the p erformance loss

arising from memory accesses. No o -chip cache was present the Trimedia chip

do es not actually supp ort o -chip cache.

An encryption throughput of 4.7 bits p er cycle was achieved, equivalentto

1.7 cycles p er byte, or 470 Mbps on a 100 MHz pro cessor. This includes all lo op-

overhead and cycles lost to cache misses, memory accesses, etc. This is uncom-

monly fast among stream ciphers. For comparison, two other acknowledged fast

software ciphers { RC4 [12], and SEAL [11], are rep orted as capable of 10.6 cycles

per byte and 3.5 cycles p er byte, or 75 Mbps and 230 Mbps resp ectively when

b enchmarked under these same conditions [3]. Panama is also slightly faster on

this pro cessor than the variants of WAKE describ ed in [3].

Notably, Panama achieves its sp eed advantage not byhaving the lowest

work-factor among these ciphers. For instance SEAL has a work factor of 4.25

instructions p er byte on the TriMedia pro cessor compared to Panama 's 6.7.

Rather, Panama 's sp eed advantage comes from the substantial degree of par-

allelism present in the algorithm, an attribute that can b e well exploited by

a VLIW pro cessor such as the TriMedia. Accordingly it should b e noted that

Panama 's advantage may b e diminished when running on pro cessors having less

instruction-level parallelism than the CPU rep orted here.

On the TriMedia pro cessor Panama achieves a hashing throughput of 5.1 bits

p er cycle, equivalent to 1.6 cycles p er byte, or 510 Mbps on a 100 MHz device.

We are not aware of published p erformance gures for implementations of other

currently p opular hash functions on the TriMedia pro cessor against whichto

directly compare Panama 's sp eed. Still, a simple comparison shows that the

p er-byte workload of Panama is similar to that of MD4 [9], the fastest member

of the family of hash functions to which MD5, SHA-1 and RIPEMD-160 all

b elong. Benchmarks for these p opular hashes have b een published for the Intel

Pentium pro cessor in [2] from whichwe can make comparisons to Panama .

Performance of an optimized C-co de implementation of Panama on a

200 MHz Pentium Pro using a library function for rotate was measured at

198 Mbps for ciphering and 214 Mbps for hashing, i.e. a throughput of 0.99 bits

p er cycle for ciphering and 1.07 bits p er cycle for hashing. This compares to

hashing sp eeds rep orted in [2] for SHA-1 and RIPEMD-160 of 0.24 bits p er cycle

3

and 0.21 bits p er cycle resp ectively for optimized C-co de .[2] also rep orts opti-

3

[2] rep orts p erformance on a 90 MHz Pentium, while here we rep ort p erformance on

a 200 MHz Pentium Pro. By converting all results into the normalized measure of

bits p er cycle we attempt to provide a uniform basis for comparison, however the

reader is cautioned that no allowance has b een made for the architectural di erences

mized assembly-co de versions of SHA-1 and RIPEMD-160 as achieving 0.54 bits

p er cycle and 0.44 bits p er cycle resp ectively. It is currently unknown what fur-

ther sp eed improvement could b e achieved for Panama by assembly co ding it,

but even without such improvement it shows ab out a 2 sp eed advantage over

assembly co ded versions of these other hashes.

Since the Pentium Pro can in principle issue two arithmetic or logical in-

structions p er cycle compared to ve for the TriMedia chip one maywonder

why the throughput p er cycle of Panama on the Pentium Pro is barely one

fth that achieved on the TM-1000. In part the reason is that Panama 's large

state cannot b e maintained in the small register set of the 'x86 architecture,

with the result that co de for the Pentium Pro requires massively more load and

store instructions than are required for the TriMedia, or for that matter other

RISC pro cessors with a generous complement of registers. Since b oth SHA-1

and RIPEMD-160 are substantially unhamp ered by the limited register set of

the 'x86 architecture, wewould exp ect Panama 's advantage over them to b e all

the greater on pro cessors not having this limitation.

In considering Panama 's suitability to an application it should b e b orne in

mind that the p erformance gures rep orted are for large blo ck sizes. When hash-

ing small blo cks, or encrypting with frequentkey changes or resynchronization,

the overhead of the accompanying 32 blank Pull iterations may signi cantly im-

pact the p erformance. A key change or resynchronization takes ab out as long as

encrypting 1000 bytes. Similarly, each hashed blo ck has a xed overhead equiv-

alent to hashing ab out 1000 bytes.

6 Conclusions

Wehave presented a new cryptographic mo dule capable of cryptographic hashing

and stream encryption suited for applications where large amounts of data must

b e protected. It has b een shown that the inherent parallelism allows extremely

fast software implementations on VLIW pro cessors.

References

[1] E. Biham and A. Shamir, \Di erential cryptanalysis of DES-like ,"

Journal of Cryptology,Vol. 4, No. 1, 1991, pp. 3{72.

[2] A. Bosselaers, R. Govaerts, J. Vandewalle, \Fast Hashing on the Pentium", Ad-

vances in Cryptology { Proceedings Crypto'96 LNCS 1109, N. Koblitz, Ed.,

Springer-Verlag, 1996, pp. 298{312 .

[3] C.S.K. Clapp, \Optimizing a fast stream cipher for VLIW, SIMD, and sup erscalar

pro cessors," Fast Software Encryption, LNCS 1267, E. Biham, Ed., Springer-

Verlag, 1997, pp. 273{287.

[4] J. Daemen, \Cipher and hash function design strategies based on linear and dif-

ferential cryptanalysis," Doctoral Dissertation, March 1995, K.U.Leuven.

between the Pentium and Pentium Pro.

[5] H. Dobb ertin, A. Bosselaers, B. Preneel, \RIPEMD-160: A Strengthened version

of RIPEMD," Fast Software Encryption, LNCS 1039, D. Gollmann, Ed., Springer-

Verlag, 1996, pp. 71{82.

[6] FIPS 180, Secure Hash Standard,Federal Information Pro cessing Standard FIPS,

Publicatio n 180, National Institute of Standards and Technology, US Department

of Commerce, Washington D.C., May 1993.

[7] FIPS 180-1, Secure Hash Standard,Federal Information Pro cessing Standard

FIPS, Publicatio n 180-1, National Institute of Standards and Technology,US

Department of Commerce, Washington D.C., April 1995.

[8] B. Preneel and P.C. van Oorschot, \On the SecurityofTwoMAC Algorithms",

Advances in Cryptology { Proceedings Eurocrypt'96 LNCS 1070, U.M. Maurer, Ed.,

Springer-Verlag, 1996, pp. 19{32 .

[9] R.L. Rivest, The MD4 message-digest algorithm, Request for comments RFC

1320, Internet Activities Board, Internet Privacy Task Force, April 1992.

[10] R.L. Rivest, The MD5 message-digest algorithm, Request for comments RFC

1321, Internet Activities Board, Internet Privacy Task Force, April 1992.

[11] P. Rogaway and D. Copp ersmith, \A Software-Optimized Encryption Algorithm,"

Fast Software Encryption, LNCS 809, R. Anderson, Ed., Springer-Verlag, 1994,

pp. 56{63.

[12] B. Schneier, Applied , Second Edition, John Wiley & Sons, 1996, pp. 397{398 .