A Fast New DES Implementation in Software

A Fast New DES Implementation in Software Eli Biham Computer Science Department Technion Israel Institute of Technology Haif a Israel Email bihamcstechnionacil WWW httpwwwcstechnionacilbiham Abstract In this pap er we descr ib e a f ast new DES implementation Thi s implementation i s about ve times f aster than the f astest known DES implementation on a bit Alpha computer and about three times f aster than than our new optimized DES implementation on bit computers Thi s implementation us e s a nonstandard repre s entation and view the pro ce ssor as a SIMD computer ie as parallel onebit pro ce ssors computing the same instruction We also di scuss the application of this implementation to other ciphers We descr ib e a new optimized standard implementation of DES on bit pro ce ssors which i s about twice f aster than the f astest known standard DES implementation on the same pro ce ssor Our implementations can also b e us e d for f ast exhaustive s earch in software which can nd a key in only a few days or a few weeks on exi sting parallel computers and computer networks Introduction In this pap er we descr ib e a new implementation of DES which can b e very eciently executed in software Thi s implementation i s b e st us e d with a non standard order of the bits of the DES blo cks Thi s implementation do e s not suer f rom high overhead of computing p ermutations of bits Instead we view a pro ce ssor with for example bit words as a SIMD parallel computer which can compute onebit operations s imultaneously while the bits of each blo ck are s et in dierent words of which the rst bit i s always of the rst blo ck the s econd bit b elongs to the s econd blo ck etc The operations that DES us e s are as follows The XOR operation in our view the XOR operation of the pro ce ssor computes onebit XORs The expans ion and p ermutation operations thes e operations do not cost any operation s ince instead of changing the order of words or duplicating words we can addre ss the require d word directly We remain with the S b oxes Usual implementations of S b oxes us e table lo okups However in our repre s entation table lo okups are very inecient s ince we have to collect s ix bits each bit f rom a dierent word Technion - Computer Science Department Technical Report CS0891 1997 Cipher Sp ee d DES Er ic Youngs lib des Gost SAFER Blowsh Our DES Implementation Our DES Implementation tr iple DES Our f astest DES Our f astest DES Triple DES Estimation bas e d on Table The sp ee ds of our implementations and of various ciphers on a MHz Alpha pro ce ssor in Mbps combine them into one index to the table and after the table lo okup take the four re sultant bits and put each of them in a dierent word We obs erved that there i s a much f aster implementation of the S b oxes in our repre s entation they can b e repre s ented by their logical gate circuit In such an implementation each S b ox i s typically repre s ented by about gates and thus we can implement an S b ox by about instructions We actually view the whole cipher by its gate circuit and apply it in software In this implementation we actually compute the circuit times in parallel as the s ize of the pro ce ssor word and thus can gain a high sp ee dup even though we us e very s imple operations In average on bit pro ce ssors each S b ox costs about instructions for each encrypted blo ck while each instruction takes only one clo ck cycle The full circuit of DES contains about gates including the key schedul ing which costs nothing and thus we can compute DES times in about instructions on bit pro ce ssors In average we re sult with about instructions for the encryption of each DES blo ck Conversion f rom and to the standard blo ck repre s entation takes together about instructions p er blo ck and thus encryption of standard repre s entations with our implementation takes about instructions For compar i son our f ast standard implementation of DES descr ib e d in this pap er require s about instructions for each blo ck Table summarize s the sp ee ds of our implementations a standard f ast DES implementation Er ic Youngs lib des and of various f ast ciphers The same idea can b e applied to other ciphers Our implementation of thes e ciphers i s ecient e sp ecially when the cipher do e s not us e all the p ower of the machine instructions ie when each instruction mixe s only a few of the bits such as S b oxes or e ightbit additions on bit pro ce ssors and when the word s ize of the pro ce ssor i s large such as bits when the cipher us e shorter regi sters For Technion - Computer Science Department Technical Report CS0891 1997 example our implementation of Feal i s exp ected to b e about times f aster than direct implementations Both variants of Lucifer and GOST can also b e applied very eciently us ing this implementation Our implementation of ciphers which us e more complex operations such as multiplication or large S b oxes require s more instructions to s imulate the complex operations and i s thus le ss ecient In Section we descr ib e an optimized standard implementation on bit computers It us e s the bit regi sters of a bit pro ce ssor and runs almost twice f aster than the f astest implementation des igned for bit architectures on the same pro ce ssor It even runs f aster than f ast ciphers such as GOST SAFER and Blowsh The sp ee d i s gained by us ing the long bit re gi sters eectively by all other means this i s a standard implementation We sugge st a new DESlike cipher to which we call WDES bas e d on the structure of this f ast implementation but i s about times f aster In Section we di scuss us ing thes e f ast implementations for exhaustive s earch and conclude that it i s applicable even today us ing exi sting general purp os e parallel computers and computer networks The New NonStandard DES Implementation Thi s implementation us e s a nonstandard repre s entation of the data in software and in particular it do e s not have any table lo okup Instead of encrypting many bit words one at a time we encrypt s imultaneously words and each op eration encrypts one bit in each of the words Actually we view a bit pro ce ssor as a SIMD computer with onebit pro ce ssors Thi s implementation s imulates a f ast DES hardware whose number of gates i s minimal and computes each gate by a s ingle instruction In particular the S b oxes are computed by their gatecircuit us ing the XOR AND OR and NOT operations and the p ermutations and expans ions do not require any instruction s ince they can b e viewed as only changing the naming of the regi sters Although the S b oxes are implemented in more instructions than in usual implementations the paralleli sm of this implementation sp ee ds up the implementation much more than the S b ox implementation re duces it Moreover some of the operations can b e optimized out in some cas e s such as if some parts of the S b oxes are s imilar same or complement We repre s ent the S b oxes by their gate circuit us ing the b e stknown XOR AND OR and NOT operations optimized to re duce the total number of gates Although the problem of nding the b e st such circuit i s still open we found the following optimization which require s at most gates p er DES S b ox and only gates in average In the descr iption we denote the s ix input bits by Technion - Computer Science Department Technical Report CS0891 1997 Instructions Expans ion Key mixing P XOR with the left half S b oxes in average loadstore load load store Total p er round Table The number of instructions in each round on Alpha Total Average p er Blo ck IPFP rounds gates p er bit Conversion of repre s entation Table The number of instructions in DES on Alpha abcdef We compute all the functions of d and e into regi sters excluding the constant or constant It require s two NOTs d e and additional operations d e d e are already known Thi s computation i s done only once for each S b ox For each output bit of the S b ox we compute the re sult us ing thes e functions We us e s ix operations for each line of the S b ox and s ix operations to combine the re sults together operations for each output bit In total we us e at most gates for each S b ox but in average we need only about gates p er S b ox Each combination of four values the four values of b c or the four values of a f eg combining the quarters of each of the four lines b c b c b c b c or combining the four lines are combined by assuming the rst cas e b f f c f f c f f f f f 00 10 00 01 00 01 10 11 00 where the underlined values are known constants and f S abcdef where bc d e are the actual values of the input f i s one of the values kept in regi sters bc above and a f are the values assumed for a f to b e instantiated in the next step More accurately in the intermediate steps we compute the combinations of S b ox entrie s as sugge sted by the above equation eg f f f f f 00 00 01 00 10 f f f f rather than the various values of the entrie s themselves 00 01 10 11 Tables and descr ib e the maximum number of gates p er round and for the 20 full DES Therefore we exp ect the sp ee d to b e about Mbps on Technion - Computer Science Department Technical Report CS0891 1997 MHz Alpha pro ce ssors In practice we achieve sp ee ds of about Mbps s ince the pro ce ssor can apply more than one instruction in each clo ck cycle Conversion b etween the standard and the nonstandard repre s entations can also b e done in about instructions

Load more