<<

A Fast New DES Implementation in Software

Eli Biham

Computer Science Department

Technion Institute of Technology

Haif a Israel

Email bihamcstechnionacil

WWW httpwwwcstechnionacilbiham

Abstract In this pap er we descr ib e a f ast new DES implementation

Thi s implementation i s about ve times f aster than the f astest known

DES implementation on a bit Alpha computer and about three

times f aster than than our new optimized DES implementation on bit

computers Thi s implementation us e s a nonstandard repre s entation and

view the pro ce ssor as a SIMD computer ie as parallel onebit pro

ce ssors computing the same instruction We also di scuss the application

of this implementation to other ciphers We descr ib e a new optimized

standard implementation of DES on bit pro ce ssors which i s about

twice f aster than the f astest known standard DES implementation on the

same pro ce ssor Our implementations can also b e us e d for f ast exhaustive

s earch in software which can nd a in only a few days or a few weeks

on exi sting parallel computers and computer networks

Introduction

In this pap er we descr ib e a new implementation of DES which can b e very

eciently executed in software Thi s implementation i s b e st us e d with a non

standard order of the bits of the DES blo cks Thi s implementation do e s not

suer f rom high overhead of computing p ermutations of bits Instead we view a

pro ce ssor with for example bit words as a SIMD parallel computer which

can compute onebit operations s imultaneously while the bits of each blo ck

are s et in dierent words of which the rst bit i s always of the rst blo ck

the s econd bit b elongs to the s econd blo ck etc

The operations that DES us e s are as follows The XOR operation in our view

the XOR operation of the pro ce ssor computes onebit XORs The expans ion

and p ermutation operations thes e operations do not cost any operation s ince

instead of changing the order of words or duplicating words we can addre ss

the require d word directly We remain with the S b oxes Usual implementations

of S b oxes us e table lo okups However in our repre s entation table lo okups are very inecient s ince we have to collect s ix bits each bit f rom a dierent word Technion - Computer Science Department Technical Report CS0891 1997

Cipher Sp ee d

DES Er ic Youngs lib des



Gost



SAFER



Blowsh

Our DES Implementation

Our DES Implementation tr iple DES

Our f astest DES

Our f astest DES Triple DES



Estimation bas e d on

Table The sp ee ds of our implementations and of various ciphers on a MHz Alpha

pro ce ssor in Mbps

combine them into one index to the table and after the table lo okup take the four

re sultant bits and put each of them in a dierent word

We obs erved that there i s a much f aster implementation of the S b oxes in our

repre s entation they can b e repre s ented by their logical gate circuit In such an

implementation each S b ox i s typically repre s ented by about gates and thus

we can implement an S b ox by about instructions

We actually view the whole cipher by its gate circuit and apply it in software

In this implementation we actually compute the circuit times in parallel as

the s ize of the pro ce ssor word and thus can gain a high sp ee dup even though

we us e very s imple operations In average on bit pro ce ssors each S b ox costs

about instructions for each encrypted blo ck while each instruction takes only

one clo ck cycle

The full circuit of DES contains about gates including the key schedul

ing which costs nothing and thus we can compute DES times in about

instructions on bit pro ce ssors In average we re sult with about

instructions for the of each DES blo ck Conversion f rom and to the

standard blo ck repre s entation takes together about instructions p er blo ck

and thus encryption of standard repre s entations with our implementation takes

about instructions For compar i son our f ast standard implementation of

DES descr ib e d in this pap er require s about instructions for each blo ck

Table summarize s the sp ee ds of our implementations a standard f ast DES

implementation Er ic Youngs lib des and of various f ast ciphers

The same idea can b e applied to other ciphers Our implementation of thes e

ciphers i s ecient e sp ecially when the cipher do e s not us e all the p ower of the

machine instructions ie when each instruction mixe s only a few of the bits such

as S b oxes or e ightbit additions on bit pro ce ssors and when the word s ize of the pro ce ssor i s large such as bits when the cipher us e shorter regi sters For

Technion - Computer Science Department Technical Report CS0891 1997

example our implementation of Feal i s exp ected to b e about times f aster

than direct implementations Both variants of and GOST can

also b e applied very eciently us ing this implementation Our implementation

of ciphers which us e more complex operations such as multiplication or large

S b oxes require s more instructions to s imulate the complex operations and i s

thus le ss ecient

In Section we descr ib e an optimized standard implementation on bit

computers It us e s the bit regi sters of a bit pro ce ssor and runs almost

twice f aster than the f astest implementation des igned for bit architectures

on the same pro ce ssor It even runs f aster than f ast ciphers such as GOST

SAFER and Blowsh The sp ee d i s gained by us ing the long bit re

gi sters eectively by all other means this i s a standard implementation We

sugge st a new DESlike cipher to which we call WDES bas e d on the structure

of this f ast implementation but i s about times f aster

In Section we di scuss us ing thes e f ast implementations for exhaustive s earch

and conclude that it i s applicable even today us ing exi sting general purp os e

parallel computers and computer networks

The New NonStandard DES Implementation

Thi s implementation us e s a nonstandard repre s entation of the data in software

and in particular it do e s not have any table lo okup Instead of encrypting many

bit words one at a time we encrypt s imultaneously words and each op

eration encrypts one bit in each of the words

Actually we view a bit pro ce ssor as a SIMD computer with onebit

pro ce ssors Thi s implementation s imulates a f ast DES hardware whose number of

gates i s minimal and computes each gate by a s ingle instruction In particular the

S b oxes are computed by their gatecircuit us ing the XOR AND OR and NOT

operations and the p ermutations and expans ions do not require any instruction

s ince they can b e viewed as only changing the naming of the regi sters Although

the S b oxes are implemented in more instructions than in usual implementations

the paralleli sm of this implementation sp ee ds up the implementation much more

than the S b ox implementation re duces it Moreover some of the operations can

b e optimized out in some cas e s such as if some parts of the S b oxes are s imilar

same or complement

We repre s ent the S b oxes by their gate circuit us ing the b e stknown XOR

AND OR and NOT operations optimized to re duce the total number of gates

Although the problem of nding the b e st such circuit i s still open we found the

following optimization which require s at most gates p er DES S b ox and

only gates in average In the descr iption we denote the s ix input bits by

Technion - Computer Science Department Technical Report CS0891 1997

Instructions

Expans ion

Key mixing

P

XOR with the left half

S b oxes  in average

loadstore  load load store

Total p er round

Table The number of instructions in each round on Alpha

Total Average p er Blo ck

IPFP

rounds 

gates p er bit

Conversion of repre s entation

Table The number of instructions in DES on Alpha

abcdef We compute all the functions of d and e into regi sters excluding

the constant or constant It require s two NOTs d e and additional

operations d e d e are already known Thi s computation i s done only once

for each S b ox For each output bit of the S b ox we compute the re sult us ing thes e

functions We us e s ix operations for each line of the S b ox and s ix operations to

combine the re sults together operations for each output bit In total we us e

at most  gates for each S b ox but in average we need only

about gates p er S b ox Each combination of four values the four values of

b c or the four values of a f eg combining the quarters of each of the four lines

b c b c b c b c or combining the four

lines are combined by assuming the rst cas e

   

 b  f  f  c  f  f  c  f  f  f  f f

00 10 00 01 00 01 10 11 00

where the underlined values are known constants and f S abcdef where

bc

d e are the actual values of the input f i s one of the values kept in regi sters

bc

above and a f are the values assumed for a f to b e instantiated in the next

step More accurately in the intermediate steps we compute the combinations of

S b ox entrie s as sugge sted by the above equation eg f f  f f  f

00 00 01 00 10

f  f  f  f rather than the various values of the entrie s themselves

00 01 10 11

Tables and descr ib e the maximum number of gates p er round and for the

20

full DES Therefore we exp ect the sp ee d to b e about  Mbps on

Technion - Computer Science Department Technical Report CS0891 1997

MHz Alpha pro ce ssors In practice we achieve sp ee ds of about Mbps

s ince the pro ce ssor can apply more than one instruction in each clo ck cycle

Conversion b etween the standard and the nonstandard repre s entations can

also b e done in about instructions Doing this twice b efore and after en

cryption takes about instructions which are about instructions for each

encrypted blo ck

Thi s implementation can actually b e applied to any cipher but the eciency

of the implementation depends on many f actors such as the eciency of the or i

ginal cipher the word s ize of the pro ce ssor and the complexity of the operations

that the cipher us e s The implementation i s e sp ecially attractive to ciphers whose

operations are s imple no multiplication for example us e only small S b oxes

thus their gate complexity i s small or us e small regi ster s ize s thus cannot us e

the full p ower of mo dern pro ce ssors Example s of such ciphers are Lucifer

GOST and Feal

In the cas e of Feal standard implementations require about instructions

for each application of the round function loads load XORs for key

mixing for XOR additions S additions with carry S each might take

0 1

two operations rotations and XORS to mix with the left half of the data

The r ightround cipher takes thus about  instructions not counting

the initial and nal key mixing which can take a few additional instructions

Our implementation require s or instructions for an e ightbit addition

one or two for the LSB depends whether this i s S or S for the carry

0 1

and for the s econd bit We need three additional instructions for computing

each additional carry and two XORs for each additional bit In total we need

 instructions for S and for S In total the F

0 1

function require s

instructions XORs key loadsmixings four S b oxes loads stores

mixings with the left half and extra loadsstores if necessary The e ight

round Feal can then b e implemented in  instructions

for each of the initial and nal key mixings In average we get that only about

instructions p er blo ck which i s more than four times f aster than

standard implementations Even if we do the conversion f romto the standard

repre s entation which costs instructions p er blo ck our implementation takes

only about instructions which i s more than twice f aster than the

standard implementations

Both variants of Lucifer and GOST can also b e applied very e

ciently us ing this implementation

Thi s implementation can b e us e d for f ast encryption and decryption us ing

the same key in all the ie the key words contain only or

or for exhaustive s earch us ing the same but dierent keys We can

Technion - Computer Science Department Technical Report CS0891 1997

also us e dierent plaintexts with dierent keys if it i s of an advantage to the

application

Thi s implementation can b e us e d in three ways

Encryptiondecryption in standard repre s entations compatible to other DES

implementations

Encryptiondecryption of large blo cks such as of di sk clusters or large com

munication packets In this cas e it i s not imp ortant to us e the standard

repre s entation and thus our implementation i s even f aster s ince conversion

should not b e done

Application to exhaustive s earch

It i s easy to s ee that applications of this implementation in the ECB mo de

i s very f ast but as usual in ECB mo des it suers f rom many di sadvantages

It would b e preferable to us e standard CBC CFB and OFB mo des with this

implementation but this i s imp oss ible due to their s equential order However it

i s p oss ible to us e this implementation for standard CBC decryption s ince the

whole data can b e decrypted in parallel and then each re sult can b e mixe d with

the previous It i s also p oss ible to apply CFB decryption in a s imilar

way Therefore this implementation can b e us e d for f ast decryption in standard

mo des even when encryption i s done by usual standard implementations

parallel CBC encryption mo des can b e applied in this implementation by

choos ing initial values for the blo ck encrypted s imultaneously and apply

2

CBC on the full bit blo cks In this cas e we can also encrypt under

a dierent key in each of the parallel CBC mo des it might b e e sp ecially

attractive when a s erver has to encrypt data to many clients in parallel

Thi s implementation i s even f aster when conversion f romto standard rep

re s entation i s not applied In this cas e DES i s applied but with a nonstandard

order of the ciphertext bits To protect against multiple o ccurrence of

the same plaintexts actually the bits that enter one real DES in the non

standard repre s entation we should us e new mo des

The ECB mo de of this implementation takes the bits of the data and

encrypts them as i s A CBClike mo de can have an initial value of bits

which can b e derived f rom a bit value and apply CBC on the bit

cipher Thi s mo de actually applies standard CBC mo des in parallel one for

each of the DES applications in the nonstandard repre s entation An improvement

of this mo de can mix the bits of each regi ster for example by rotating regi ster

i containing the ith bits of the standard blo cks by i bits after adding i to

the value of the regi ster A CFBlike and OFBlike mo des can b e des igned in a s imilar way

Technion - Computer Science Department Technical Report CS0891 1997

Op erations Number of Instructions

key XOR loadXOR

EPS table lo okups  extbl add lo okup

XORing L with the S b oxes XORs

Total

Table The number of instructions in each round on Alpha

Op erations Number of Instructions

IP times XORs shifts AND 

E Initial Expans ion

rounds each instructions 

Removal of expans ion

FP Final p ermutation

Total

Table The number of instructions in DES on Alpha

A Fast Standard DES Implementation on bit

Pro ce s sors

DES can b e applied very eciently on bit pro ce ssors Unlike on bit pro

ce ssors on bit pro ce ssors the r ight half expanded to bits can b e stored in

one word Moreover by substituting every group of s ix bits entering into the S

b oxes in a s eparate byte we can directly acce ss the S b ox table by referencing

via a s ingle byte

We apply the initial and nal p ermutations by lo okup tables f rom each byte

to bits and XORing the re sults of the various table lo okups

We apply each round by XORing the r ight half repre s ented as e ight bytes

in each s ix bits are us e d by a subkey repre s ented in the same way Then e ight

table lo okups apply the e ight S b oxes and the re sults are XORed Each S b ox

already includes the P p ermutation and the E expans ion in its bit re sult Note

that due to this repre s entation s everal duplicated bits of the two halves should

b e omitted by the nal p ermutation

Tables and descr ib e the number of operations require d by this implement

ation with the number of instructions on an Alpha pro ce ssor We implemented

this co de in C on a MHz Alpha and got encryption sp ee d of Mbps Triple

DES runs at Mbps s ince some IP FPs can b e di scarded On the same pro ce ssor Er ic Youngs lib des s ingleDES runs at Mbps

Technion - Computer Science Department Technical Report CS0891 1997

Some comments on this implementation

The e ight S b oxes are applied in parallel and thus pip elining can us e it

without pip eline stalls In other ciphers and hash functions like Feal

Khufu Khafre and MD MD SHA each operation de

p ends on the output of the previous operation and thus might re sult with

pip eline stalls e sp ecially on newer or future pro ce ssors which can compute

s everal instructions s imultaneously

All the tables and the variables take together about Kbytes and enter eas ily

into the cache

Still in DES the input of the next round depends on the output of the pre

ce ding one Although in practice this do e s not slow the execution we have

another solution In DES the input of each S b ox depends only on the output

of only s ix S b oxes in the previous round Thus the co de can b e optimized

to start computing the next round while still computing the prece ding one

Thi s can sp ee d up implementations on pip elined pro ce ssors where we can

compute s everal instances in parallel

Unlike some although not all DES implementations we implement each S

b ox as one table lo okup rather than combining pairs of S b oxes into one

lo okup The latter i s more than twice slower s ince the tables b ecome larger

1

than the s ize of the onchip cache

WDES

We can us e this f ast co de to des ign a new even f aster and more s ecure cipher

to which we call WDES We convert the co de by removing IP FP and changing

the EPS operations S b oxes followed by P followed by E as us e d in this im

plementation into S b oxes f rom bits to bits Thes e S b oxes can b e much

b etter than the or iginal s ince each S b ox aects al l the bits of al l the S b oxes in

the next round rather than one bit in only s ix S b oxes

WDES has bit blo cks and it runs much f aster than DES with the same

number of rounds s ince the blo cksize i s larger and the slow initial and nal

p ermutations are di scarded its sp ee d i s Mbps on the same pro ce ssor as in

Table

Exhaustive Se arch on Powerful Computers and

Networks

In this s ection we study the p oss ibilities of exhaustive s earch on s everal kinds of

machines and networks We assume us ing the f ast implementation descr ib e d in

1 On Pentium however the latter i s twice f aster us ing the same C co de

Technion - Computer Science Department Technical Report CS0891 1997

the previous s ection

Note that re sults s imilar to the ones descr ib e d here hold also for breaking

UNIX passwords which are chosen f rom up to e ight pr intable characters In this

8

cas e the password space has passwords while each password tr ial require s

encryptions the salt should not b e taken into account s ince it i s known to

the attacker and the encryption co de can b e justied to the sp ecic value of the

8 57 56

salt Therefore about   passwords should b e tr ie d or about in

average

Sp ecial Purp o s e Computers

We can build a sp ecial purp os e computer with very long regi sters without the

exp ens ive operations such as multiplication and oating p oint operations and

only with s imple instructions such as XOR AND OR NOT Assume that in a

Pentium pro ce ssor we remove the exp ens ive operations and us e the extra chip

space to increas e the s ize of the regi sters to bits Then we need only

pro ce ssors to s earch the keys exhaustively in one year in average or s ix months

in average us ing the attack bas e d on the complementation property

It i s p oss ible theoretically to build a machine with millionbit regi sters Unex

p ectedly we now know that such a machine was actually built with the support of

the NSA Cray Computers had announced in March about such a computer

45

that can apply bit operations every s econd on a million onebit pro ce ssors s ee

45

Figure Thi s computer can compute bitoperations every s econd and thus

45 45 14 31

can compute about DES encryptions every s econd

Therefore we can apply the s earches on this machine with the following re sults

Search of Time Notes

bits s ec min min in av Exp ortable ciphers

bits s ecan hour an hour in av Linear Cryptanalysi s

16

bits s eca day hours in av Dierential Cryptanalysi s

25

bits s eca year an year in av Full key s earch

Cray Computers has bankrupted s ince nob o dy had b ought this computer

Probably the NSA had a f aster machine

General Purp o s e Parallel Computers

It i s known that Sandia National Labs has a parallel computer of MHz

20

PentiumPro pro ce ssors Thi s parallel computer can compute about  

46 46

bit operations every s econd Thus it can compute about

Technion - Computer Science Department Technical Report CS0891 1997

46 14 32

DES encryptions in each s econd Therefore we can apply the

s earches on this machine with the following re sults

Search of Time Notes

bits s ec min min in av Exp ortable ciphers

bits s ec an hour min in av Linear Cryptanalysi s

15

bits s ec hours hours in av Dierential Cryptanalysi s

24

bits s ec months months in av Full key s earch

When we apply the attack us ing the complementation property exhaustive s earch

of the full key space takes in average only about s ix weeks

Internet and the DES Worm

We can us e the Internet for our exhaustive s earch just as RSA f actorization

teams are doing Assume that an average computer on the Internet i s a s ingle

18

bit MHz RISC pro ce ssor Such a pro ce ssor can encrypt about blo cks

every s econd Therefore

40 18 21

Searching bits takes about s econds in average which are

about three weeks on a s ingle pro ce ssor computers can do it in about

half an hours

Searching bits takes about s ix months in average computers can do

it in s ix hours

Searching bits takes about years in average on a s ingle pro ce ssor

computers can do it in four days and computers can do it in one day

Searching all the bits takes about years in average on a s ingle pro

ce ssor computers can do it in a year or in s ix months us ing the com

plementation property It i s practical to have this number of computers

participating legally over the Internet this i s about the same number of com

puters as the RSA f actorizations us e Million computers can do it in two days

in average or in one day us ing the complementation property

At this p oint it i s p oss ible in practice to achieve participation of s everal

thousands computers legally over the Internet However it i s s impler and f aster

2

to do it illegally A worm for which we call the DES worm can break into many

computers over the Internet and us e their idle cycle s for exhaustive s earch The

worm verie s that only one copy of it i s executed on each computer of cours e

on computers with s everal pro ce ssors it can execute s everal copies to increas e

p erformance The DES worm makes sure it cannot b e eas ily noticed it do e s not

2

The Author do e s not recommend to do it but we should always b e aware that such a threat exi sts

Technion - Computer Science Department Technical Report CS0891 1997

need much memory anyway and it i s executed at the lowest p oss ible pr ior ity so

it do e s not di sturb other applications on the same computer

If the DES worm can get hold of about a million computers over the Internet

and assuming that it get at least half of their cycle s p eople are usually not

working over nights the DES worm can nd a key in four days in average or in

two days us ing the complementation property Moreover s ince most computers

over the Internet are not us e d in weekends which last over hours f rom Friday

evening to Monday mor ning the DES worm can us e all the cycle s and nd a

key in one weekend

Acknowledgements

We are grateful to Ross Anderson and the referee s for their various

remarks and sugge stions that improved the re sults and exp os ition of this pap er

Some of this work has b een done while the author was vi s iting the computer

laboratory at the university of Cambridge and in particular us ing their Alpha

computer Thi s re s earch was supported by the fund for the promotion of re s earch

at the Technion

Reference s

H Fei stel and Data Security Scientic American Vol No

pp May

James L Mass ey SAFERK A Byte Oriented Block Ciphering Algorithm

pro cee dings of Fast Software Encryption Cambridge Lecture Notes in Computer

Science pp

Ralph C Merkle Fast Software Encryption Functions Lecture Notes in Computer

Science Advances in Cryptology pro cee dings of CRYPTO pp

National Bureau of Standards US Department of

Commerce FIPS pub January

National Institute of Standard Technology Secure Hash Standard US Department

of Commerce FIPS pub May

National Institute of Standard Technology Secure Hash Standard US Department

of Commerce FIPS pub April

Ronald L Rivest The MD Message Digest Algorithm Lecture Notes in Computer

Science Advances in Cryptology pro cee dings of CRYPTO pp

Ronald L Rivest The MD Message Digest Algorithm Internet Reque st for

Comments RFC April

Michael Ro e Performence of Block Ciphers and Hash Functions One Year Later

pro cee dings of Fast Software Encryption Leuven Lecture Notes in Computer

Science pp

Bruce Schneier Applied Cryptography Protocols Algorithms and Source Code in

C s econd e dition John Willey Sons

Technion - Computer Science Department Technical Report CS0891 1997

Akihiro Shimizu Sho ji Miyaguchi Fast Data Encryption Algorithm FEAL

Lecture Notes in Computer Science Advances in Cryptology pro cee dings of

EUROCRYPT pp

Arthur Sorkin Lucifer a Cryptographic Algorithm Cryptologia Vol No pp January

Technion - Computer Science Department Technical Report CS0891 1997

OTC CRAY COMPUTER CORP COMPLETES INITIAL TESTING

COLORADO SPRINGS Colo March PRNewswire Cray Computer Corp Nas

daq CRAY rep orted today the successful test and demonstration on March

of an array of s ingle bit pro ce ssors packaged us ing the companys multichip

mo dule technology Thi s array i s a ma jor technical comp onent of the CRAYSuper

Scalable System CRAYSSS that i s b e ing jointly develope d by the company

the National Secur ity Agency and the Sup ercomputing Re s earch Center SRC

which was or iginally announced on August Thi s test and demonstration com

pletes the rst of a number of ma jor tasks require d under the Development Contract

Re s earchers f rom the SRC verie d correctness of operation of the s ingle bit

pro ce ssor array approximately individual Integrated Circuits which i s the rst

half of a s inge bit pro ce ssor array calle d for in the development contract Thi s

array i s couple d to a CRAY The CRAYSSS utilizes the Pro ce ssorInMemory

PIM chips develope d by the SRC Both NSA and SRC are providing s ignicant

technical ass i stance in b oth the software and hardware asp ects of the system

Once completed the high p erformance system will cons i st of a dual pro ce ssor

million byte CRAY and a s ingle bit pro ce ssor Single Instruction Multiple Data

SIMD array with a million byte memory Thi s CRAYSuper Scalable System

will provide highp erformance vector parallel pro ce ss ing scalable parallel pro ce ss ing

and the combination of b oth in a hybrid mo de featuring extremely high bandwidth

b etween the PIM pro ce ssor array and the CRAY The current schedule for completion

of the Development Contract i s the end of July including a day public Internet

acce ss demonstration

For suitable applications a SIMD pro ce ssor array of million pro ce ssors would provide

up to Trillion Bit Op erations p er Second and pr p erformance unavailable today

on any other highp erformance platform The CRAY system with the SSS option

will b e oere d as an application sp ecic pro duct The joint development contract i s

part of the Federal Governments High Performance Computing and Communications

program

Charles Breckenridge executive vice pre s ident of Marketing at Cray Computer Corp

said The CRAYSSS will provide unparallele d p erformance for many promi s ing

applications We are pleas e d to participate in this transfer of government techno

logy and we are eager to help p otential customers explore and develop appropriate

applications

Cray Computer Corp i s engaged in the des ign development manufacture and market

ing of the CRAY CRAYSSS and CRAY high p erformance computer systems

CONTACT Charles Breckenridge executive VP of Marketing or Terry Willkom pre s

ident of Cray Computer or David Gould of Chip Shots Inc

Fig Cray Computer Corp pre ss releas e of March

Technion - Computer Science Department Technical Report CS0891 1997