<<

The Chinese Theorem and its Application

in a High-Speed RSA Crypto Chip£

Johann Großsch¨adl Graz University of Technology Institute for Applied Information Processing and Communications Inffeldgasse 16a, A–8010 Graz, Austria [email protected]

Abstract of two prime . This allows to utilize the Chinese Remainder Theorem (CRT) in to speed up the The performance of RSA hardware is primarily deter- private key operations. From a mathematical point of view, mined by an efficient implementation of the long the usage of the CRT for RSA decryption is well known. modular and the ability to utilize the Chinese But for a hardware implementation, a special multiplier ar- Remainder Theorem (CRT) for the private key operations. chitecture is necessary to meet the requirements for efficient

This paper presents the multiplier architecture of the RSA­ CRT-based decryption. crypto chip, a high-speed hardware accelerator for long in- This paper presents the basic algorithmic and architec-

teger modular arithmetic. The RSA­ multiplier datapath is

tural concepts of the RSA­ crypto chip, and describes how reconfigurable to execute either one 1024 bit modular expo- they were combined to provide optimum performance. A nentiation or two 512 bit modular exponentiations in paral- special focus is directed towards the implementation of lel. Another significant of the multiplier core the CRT-based decryption. The major design goal with

is its high degree of parallelism. The actual RSA­ prototype

the RSA­ was the maximization of performance on se- contains a 1056£16 bit word-serial multiplier which is op- veral levels, including the implemented hardware algo- timized for modular according to Barret’s rithms, the multiplier architecture, and the VLSI circuit modular reduction method. The multiplier core is dimen-

technique. The RSA­ crypto chip applies the FastMM al- sioned for a clock frequency of 200 MHz and requires 227 gorithm [MPPS95] for modular . FastMM clock cycles for a single 1024 bit modular multiplication. is based on Barret’s modular reduction method Pipelining in the highly parallel long integer unit allows to [Bar87] and performs a modular multiplication by three achieve a decryption rate of 560 kbit/s for a 1024 bit ex- simple multiplications and one . The - ponent. In CRT-mode, the multiplier executes two 512 bit free property of Barret’s modular reduction method makes modular exponentiations in parallel, which increases the FastMM algorithm very attractive for hardware implemen-

decryption rate by a factor of 3.5 to almost 2 Mbit/s. tation.

½¼56 £ ½6 The core of the RSA­ crypto chip is its bit word-serial multiplier, which schedules the multiplicand 1. Introduction fully parallel and the multiplier in 16 bit words. As a conse- quence of this direct processing of long , the com-

Many popular crypto-systems like the RSA putation time for an Ò bit is dra- scheme [RSA78], the Diffie-Hellman (DH) key agreement matically reduced. Furthermore, Booth recoding [Mac61] is scheme [DH76], or the Algorithm (DSA) implemented to halve the of partial-products. Thus [Nat94] are based on long integer modular exponentiation. the multiplication speed is almost doubled with low addi- A major difference between the RSA scheme and crypto- tional hardware effort. In addition, pipelining significantly systems based on the problem is the fact increases the throughput in RSA en- and decryption. Due

that the modulus used in the RSA encryption scheme is the to the high degree of parallelism in the multiplier core, the

­ £ RSA crypto chip executes a 1024 bit modular multiplica- The work described in this paper was supported by the Austrian Science Foundation (FWF) under grant number P12596–INF “Hochge- tion in 227 clock cycles. The multiplier core is dimensioned schwindigkeits-Langzahlen-Multiplizierer-Chip”. for a clock frequency of 200 MHz and designed in a full-

Ò=¾ Ò E custom methodology, which results in a very small silicon ¿ modular multiplications for an bit exponent , as- . suming the exponent contains roughly 50% ones. There- By taking advantage of the Chinese Remainder Theorem fore, the efficient implementation of the modular multipli- (CRT), the computational effort of the RSA decryption can cation is the key to high performance.

be reduced significantly. If the two prime numbers È and Æ

É of the modulus are known, it is possible to calculate 2.2 Barret’s modular reduction method

D

= C ÑÓ d Æ

the modular exponentiation Å separately

È ÑÓ d É ÑÓ d and with shorter exponents, as described in In 1987, Paul Barret introduced an algorithm for the

[QC82] and [SV93]. Since the length of the exponent is reduction operation which he used to implement

¾ ¿Ò=4 about Ò= , approximately modular multiplications are RSA encryption on a digital signal processor [Bar87]. At necessary for a single modular exponentiation. The RSA­ a first glance, a modular reduction is simply the computa- crypto chip is able to compute two exponentiations in par- tion of the remainder of an integer division :

allel, as the 1024 bit multiplier core can be split into two

  

512 bit multipliers. Running the two 512 bit multipliers 

Z Z

Z ÑÓ d Æ = Z Õ = = Z ÕÆ

in parallel allows both CRT related exponentiations to be Æ with (1) Æ computed simultaneously. Compared to the non-CRT based Æ

RSA decryption performed on an Ò bit hardware, utilizing But, compared to other operations, even to the multiplica- the CRT results in a speed-up factor of approximately 3.5. tion, a division is very costly to implement in hardware. The rest of the paper is structured as follows: In the next Barret’s basic idea was to replace the division with a mul- section, the implemented for exponentiation and tiplication by a precomputed constant which approximates

modular multiplication are presented. Section 3 focusses on the inverse of the modulus. Thus the calculation of the ex-

¤ ¥

Z =

the Chinese Remainder Theorem and explains how it can be act quotient Õ is avoided by computing the quotient

Æ Õ

used to speed up the RSA decryption. Section 4 presents the e instead:

kj k j

6 7

¾Ò

­

6 7 ¾

architecture of the RSA multiplier core and describes the Z

Ò ½

6 7

¾ Æ 4

execution of a simple multiplication. In section 5, imple- 5

Õ =

e (2)

Ò·½

mentation problems like floorplanning and clock distribu- ¾ tion are discussed. This topic is carried on in section 6, and

it is detailed how the multiplier architecture was adapted to Although equation (2) may look complicated, it can be cal-

Ò ½

comply with the requirements for efficient CRT-based RSA culated very efficiently, because the divisions by ¾ or

Ò·½

decryption. The paper finishes with a summary of results ¾ , respectively, are simply performed by truncating the

½ Ò ·½

and conclusions in section 7. least significant Ò or bits of the operands. The ex-

¤ ¥

¾Ò

¾ =Æ pression depends on the modulus Æ only and is constant as long as the modulus does not change. This con- 2 Implemented algorithms stant can be precomputed, whereby the modular reduction operation is reduced to two simple multiplications and some In order to develop high-speed RSA hardware, we not operand truncations. only need good and efficient algorithms for modular arith- When calculating a modular reduction according to Bar-

metic, but also a multiplier architecture which has to be op- ret’s method, the result may not be fully reduced, but it is

¿Æ ½ timized for those algorithms. This section presents hard- always in the range ¼ to . Therefore, one or two sub-

ware algorithms for exponentiation, modular reduction and tractions of Æ could be necessary to get the exact result. modular multiplication. 2.3 FastMM algorithm 2.1 Binary exponentiation method

A closer look at Barret’s algorithm shows that trunca-

½ Ò ·½ When applying the binary exponentiation method (also tion of operands at Ò or bit borders are necessary.

known as and multiply algorithm), a modular expo- For reasons of regularity, it would be advantageous to ap-

E

ÑÓ d Æ Û nentiation C is performed by successive modular ply truncations only at multiples of the word-size of the multiplications [Knu69]. The MSB to LSB version of the multiplier hardware (usually 16 or 32 bit) rather than at the algorithm is frequently preferred against the LSB to MSB original bit positions. Therefore, we modified Barret’s algo- one, because the latter requires storage of an additional in- rithm in order to apply the truncations only at multiples of

termediate result. Modular reduction after each multiplica- Û , whereby these truncations can be performed by succes-

tion step avoids the exponential growth in size of the inter- sive Û bit right-shift operations. In the modified Barret re- mediate results. The square and multiply algorithm needs duction, denoted as FastMM algorithm [MPPS95], the quo-

9

Z = A[Ò ·¾Û ½ :: ¼] ¡ B [Ò ·¾Û ½ :: ¼]

>

>

=

É = Z [¾Ò · Û ½ :: Ò Û ] ¡ Æ ½[Ò ·¾Û ½ :: ¼]

e ¾f¼; ½g = A ¡ B ÑÓ d Æ · eÆ

X with

ÆegÊ = É[¾Ò ·4Û ½ :: Ò ·¾Û ] ¡ ÆegÆ[Ò ·¾Û ½ :: ¼]

>

>

;

X = Z [Ò ·¾Û ½ :: ¼] · ÆegÊ[Ò ·¾Û ½ :: ¼]

Figure 1. Modular multiplication according to the FastMM algorithm. Õ

tient e is calculated as follows: 3 Chinese Remainder Theorem

j kj k

7 6

¾Ò·Û

7 6

Z ¾

Ò Û

7 6

D

¾ Æ

Å = C ÑÓ d Æ 5

4 The complexity of the RSA decryption

Õ =

e (3)

Ò·¾Û

¾ Æ depends directly on the size of D and . The decryption ex-

ponent D specifies the numbers of modular multiplications The FastMM algorithm combines multiplication and modi- necessary to perform the exponentiation, and the modulus fied Barret reduction to implement the modular multiplica- Æ determines the size of the intermediate results. A way of

tion by three multiplications and one addition according to Æ reducing the size of both D and is to take advantage of the formulas shown in figure 1. The values in the squared properties stated by the Chinese Remainder Theorem (CRT)

brackets indicate the bit positions of the operands. All three and Fermat’s Little Theorem.

·¾Û multiplications and the addition are performed with Ò

significant bits. The FastMM algorithm uses two constants, 3.1 Mathematical background

½ ÆegÆ Æ and , to calculate the (possibly not fully reduced) result of the modular multiplication. Since these two con- In the following, some basic facts and conclusions of stants depend on the modulus Æ only, they can be precom- puted according to the following equations: the CRT are summarized. This mathematical background

knowledge is of elementary importance for the efficient re-

 

¾Ò·Û

¾ alization of RSA decryption.

½ =

Æ (4)

Æ

·¾Û

Ò Theorem 1 (Chinese Remainder Theorem) Let the num-

= ¾ Æ

ÆegÆ (5)

Ò ;Ò ;:::;Ò

¾ k

bers ½ be positive integers which are relatively

Æ ½ ÆegÆ

Ò ;Ò µ =½ i 6= j

Precomputation of and has no significant influ- gcd ´ j

prime in pair, i.e. i when . Further-

Æ

= Ò Ò :::Ò Ü ;Ü ;:::;Ü

ence on the overall performance if the modulus changes Ò

¾ k ½ ¾ k more, let ½ and let be inte-

rarely compared to the data. gers. Then the system of congruences

¾Ò·Û

¾ ½

The constant Æ approximates the exact value of

Æ

Ü  Ü ÑÓ d Ò

½ ½

with limited accuracy, therefore some error eÆ is intro-

X

 Ü ÑÓ d Ò

duced when calculating the result . An exact analysis Ü

¾ ¾ of the FastMM algorithm according to [Dhe98] shows .

that the result of the modular multiplication is given as .

AB ÑÓ d Æ · eÆ e ¾f¼; ½g X

Ü  Ü ÑÓ d Ò k

with . This means that k

¾Æ ½ might not be fully reduced, but is in the range ¼ to , thus the error is at most once the modulus. has a simultaneous solution Ü to all of the congruences, and

When applying the square and multiply algorithm to- any two solutions are congruent to one another modulo Ò. ¼

gether with FastMM algorithm to calculate a modular ex- Furthermore there exists exactly one solution Ü between

½ ponentiation, no correction of the intermediate results is and Ò . necessary. Although the intermediate results are not always fully reduced, continuing the exponentiation with an incom- A constructive proof of the Chinese Remainder Theorem is

plete reduced intermediate result does not cause an error given in [Kob94]. The unique solution Ü of the simultane-

 Ü<Ò

bigger than once the modulus. An exact proof with detailed ous congruences satisfying ¼ can be calculated as

error estimation can be found in [Dhe98] or [PP89]. If a ! k

final modular reduction is necessary after the modular ex- X

Ü = Ü Ö × ÑÓ d Ò

i i

ponentiation has finished, it can be performed by adding i (6)

i=½

ÆegÆ to the result. Thus the final reduction requires no

= ´ Ü Ö × · Ü Ö × · ::: · Ü Ö × µÑÓdÒ

½ ½ ¾ ¾ ¾ k k k additional hardware effort or precomputed constants. ½

Ò

½

Ö = × = Ö ÑÓ d Ò i =½; ¾;:::;k Å Å Å

i i i È É

where i and for . The recombination of and to get can be Ò

In [MVV97], thisi method is termed as Gauss’s algorithm. done according to equation (6). For the special case

k =¾ Ò = È Ò = É Ò = Æ = ÈÉ ¾

of , ½ , and , we get

Ò ;Ò ;:::;Ò

¾ k

Corollary 1.1 If the integers ½ are pairwise

Ö = Æ=È = É Ö = Æ=É = È ¾

½ and . Moreover, equation

Ò = Ò Ò :::Ò

¾ k relatively prime and ½ , then for all inte-

(6) can be simplified by using Fermat’s little theorem:

a  b ÑÓ d Ò

gers a; b it is always valid that if and only

¡ ¡¡

a  b ÑÓ d Ò i =½; ¾;:::;k

if i for each .

½ ½

Å = Å É É ÑÓ d È · Å È È ÑÓ d É ÑÓ d Æ

È É

¡ ¡¡

a<Ò

¾ É ¾

As a consequence of the CRT, any positive integer È

= Å É É ÑÓ d È · Å È È ÑÓ d É ÑÓ d Æ

È É

[ a ;a ;:::;a ]

¡ ¡¡

¾ k

can be uniquely represented as a k-tuple ½

È ½ É ½

= Å É ÑÓ d Æ · Å È ÑÓ d Æ ÑÓ d Æ É

È (9)

a a ÑÓ d Ò i

and vice versa, whereby i denotes the residue

=½; ¾;:::;k a

for each i . The conversion of into the The last equality in equation (9) comes from the fact

Ò ;Ò ;:::;Ò

a ´b ÑÓ d cµ=´abµÑÓd´acµ

¾ k

residue system defined by ½ is simply done that for any nonnegative in-

È ½

a ÑÓ d Ò

a; b; c É ÑÓ d Æ

by modular reductions i . Conversion back from tegers . Note that the coefficients

É ½

ÑÓ d Æ

residue representation to “standard notation” is somewhat and È are constant and can be precomputed, Å

more difficult as it requires the calculation stated in equa- Å É thereby the effort for the recombination of È and is tion (6). reduced to two multiplications, one addition and one reduc-

tion modulo Æ .

Theorem 2 (Fermat’s Little Theorem) Let Ô be a prime.

D = D ÑÓ d ´È ½µ

È

½

Ô When assuming that the exponents

Ô a  ½ÑÓdÔ

Any integer a not divisible by satisfies .

D = D ÑÓ d ´É ½µ

and É , as well as the constants which

È ½

Ê = É ÑÓ d Æ

For a proof refer to [Kob94]. Fermat’s little theorem is very are needed for the recombination È and

É ½

Ê = È ÑÓ d Æ

useful for calculating the of an integer É have been precomputed, the CRT-

Ô ¾ ½

a  a ÑÓ d Ô a because . based RSA decryption can take place according to the fol-

lowing steps [SV93]: Ô

Corollary 2.1 If an integer a is not divisible by and if

Ò Ñ

Ò  Ñ ÑÓ d Ô ½ a  a ÑÓ d Ô

= C ÑÓ d È C = C ÑÓ d É

, then . C É

1. Calculate È and .

D

È

Å = C ÑÓ d È È

Collorally (2.1) states that when working modulo a prime 2. Calculate the exponentiations È

D

É

Å = C ÑÓ d É

Ô ÑÓ d ´Ô ½µ É

, the exponents can be reduced . This allows and É .

= Å Ê ÑÓ d Æ

to perform the RSA decryption with significantly shorter Ë

È È

3. Calculate the coefficients È and

= Å Ê ÑÓ d Æ

exponents. Ë

É É

É .

Å = Ë · Ë Å  Æ É 4. Calculate È .If then calculate

3.2 RSA decryption using the CRT

= Å Æ

Å .

É Å <Æ = ÈÉ

Since È and are primes, any message It is obvious to see that the initial reductions of the ci-

C

[ Å ;Å ] É

is uniquely represented by the tuple È , where phertext (step 1) and the Chinese recombination (steps 3

Å = Å ÑÓ d È Å = Å ÑÓ d É É

È and . Therefore, it is and 4) do not cause significant computational cost com-

Å Å ;Å É also possible to obtain by computation of È and pared to the calculations in step 2. The two exponentia-

recombining them according to equation (6), rather than tions (step 2) can be computed independent from each other

D

= C ÑÓ d Æ the usual computation of Å . By using the and in parallel. Compared to the non-CRT based decryption

corollary (2.1), the size of the exponent can be scaled down: performed on an Ò bit hardware, one of the CRT-based ex- ¾

ponentiations is about 4 times faster if an Ò= bit hardware

¡

D

= Å ÑÓ d È = C ÑÓ d Æ ÑÓ d È

Å is used. This dramatic speed improvement is due to the È

D 50% length reduction of both, the exponent and the mod-

C ÑÓ d È ´ Æ = Èɵ = since

(7) ulus. Performing the two exponentiations within step 2 in

D ÑÓ d ´È ½µ

= C ÑÓ d È ¾

parallel requires two Ò= bit modular multipliers, yielding a

D

È

4 ¯ ¯

D = D ÑÓ d ´È ½µ ÑÓ d È = C

with È speed up factor of , with accounting for the steps 1, 2 and 4. See section 6 for a suggested reconfigurable modular

Furthermore, it is easily observed that the ciphertext C

multiplier which is able to execute either one Ò bit exponen-

È Å

can be reduced modulo before computing È , so the ¾ tiation or two Ò= bit exponentiations in parallel.

lengths of all operands are scaled down by half. With the

C = C ÑÓ d È C = C ÑÓ d É É quantities È and , as well

4 Multiplier architecture

D = D ÑÓ d ´È ½µ D = D ÑÓ d ´É ½µ É

as È and ,we

Å Å É get the following equations for È and :

In section 2 it was explained how a modular exponen-

D D

È É

ÑÓ d È ÑÓ d É Å = C Å = C

È É É È and (8) tiation can be calculated by continued modular multiplica- Û

High Registers (S+C) Û CLA Û

Û CLA

Û bit hard-wired

right shift 4:2 Accumulator

Û Û Û

Carry Save Adder

PPw/2-1 PP2 PP1 PP0

Û=¾ ¾ Û= Booth

Part.Prod. Generator ¾ Û= Rec.

Multiplicand Register Û

Low Register

Figure 2. Basic architecture of the partial parallel multiplier. tions, and how three simple multiplications and one addi- rates a radix 4 encoding of the multiplier, it requires a more tion result in a modular multiplication. The multiplier hard- complex partial-product generator (PPG) and a Booth re- ware which will be introduced in this section is optimized coder circuit (BR). The Booth recoder circuit is needed to

for executing long integer multiplications according to the generate the appropriate control signals for the PPG. As-

=½6 Ò = ½¼¾4 FastMM algorithm. suming Û and , there are eight PPGs re- quired to calculate the eight partial-products, whereby each

4.1 Partial parallel multiplier PPG consists of 1024 Booth multiplexers. ¾

In order to reduce these Û= partial-products to a single

¾ ¾ redundant number, the array multiplier needs Û= carry Figure 2 illustrates the architecture of the high-speed save adders (CSA), assuming that the first three partial- word-serial multiplier of the RSA­ , in the following de- products are processed by one CSA. Each CSA in the mul- noted as partial parallel multiplier (PPM). According to

tiplier core consists of Ò half-cycle full adders presented in

Û=¾ the word-size Û , the PPM processes partial-products

[Sch96], which introduce only low latching overhead and Ò of Ò bits at once, whereby denotes the length of the mod- allow the maximum clock frequency to be kept high.

ulus. In order to reach a high degree of parallelism, a word-

=½6 ­

size of Û was chosen for the PPM. The actual RSA The output of the CSA is accumulated to the current in- = ½¼¾4

prototype is optimized for a modulus length of Ò , termediate sum in a carry save manner, too. This implies

Ò ·¾Û µ £ Û = ½¼56 £ ½6 thus the PPM has a dimension of ´ a carry save accumulator circuit performing a 4:2 reduc- bits. The PPM could be implemented in an array type ar- tion. The accumulator circuit also consists of half cycle full chitecture [Rab96] or a Wallace tree architecture [Wal64]. adders, thus the 4:2 reduction is finished after one clock cy-

For RSA­ the array architecture is the better choice since cle. Aligning the intermediate sum to the next CSA output it results in a more regular layout and less routing effort, is done by a Û bit hard-wired right shift operation. Be- especially when Booth recoding is applied. side the carry save adders, also two carry lookahead adders Booth recoding [Mac61] is implemented to halve the (CLA) are required to perform a redundant to binary con- number of partial-products, which almost doubles the mul- version of 16 bit words. Redundant to binary conversion is tiplication speed with low additional hardware effort. Ac- a 2:1 reduction, therefore a pipelined version of the CLA

proposed in [BK82] is used to overcome the carry delay.

Û=¾ cording to the word-size Û , the PPM processes partial-

products of Ò bits at once. Since Booth recoding incorpo- Additionally, the PPM also consists of seven registers, each of them is Ò bit wide: HighSum and HighCarry, Low, Subsequently, the binary words are shifted into the reg- Multiplicand, Data, N1 and NegN. Note that the registers ister Low, where they finally represent the complete Data, N1 and NegN are not shown in figure 2. The reg- lower part of the product. isters HighSum and HighCarry store the upper part of the product. Two registers are necessary since the upper part 6. After the last set of eight partial-products has been pro- is only available in redundant representation. Register Low cessed, the upper part of the result resides within the receives the lower part of the result and register Multipli- accumulator after three cycles and can be loaded into cand contains the actual multiplicand. The register Data the registers HighSum and HighCarry.

is commonly used for storing the Ò bit block of cipher- text/plaintext to become de/encrypted. The registers N1 Whenever the upper part of the product is needed as operand and NegN contain the two precomputed constants for the for the next multiplication, it can be used as the multiplier FastMM algorithm. All seven registers and the accumula- which is allowed to be redundant. The lower part of the product always appears in binary representation; therefore tor are connected by an Ò bit bus to enable parallel register transfers. it might be used as multiplicand or as multiplier. The fol- lowing table summaries the operand requirements and ap- pearance of the product. 4.2 Execution of a simple multiplication

operand representation schedule

In order to explain how the PPM executes a single mul- = ½¼¾4 Ò sequentially,

tiplication, let us assume a modulus length of bit multiplier bin. or red.

Û =½6 and a word-size of Û . This means that each shift opera- bit words tion of the registers results in a 16 bit right-shift of the stored parallel, begin multiplicand binary value. At the beginning of a multiplication, the multiplicand of multiplication resides within the register Multiplicand, and the multiplier lower part sequentially, (which is assumed to be available in redundant representa- binary

of product Û bit words tion) resides within the registers HighSum and HighCarry. upper part parallel, end A single multiplication takes place in the following way: redundant of product of multiplication 1. Within each step, a 16 bit word of the (redundant) mul-

tiplier is shifted out of the registers HighSum and High- The steps needed for a single multiplication depend on the Û Carry, starting with the least significant word. These length of the modulus Ò and the word-size on which the

16 bit words are converted from redundant into binary multiplier operates. The actual RSA­ prototype has a word- =½6 representation by a CLA, which requires three clock size of Û and needs exactly 80 clock cycles for a sin- cycles. gle 1024 bit multiplication.

2. When the 16 bit binary multiplier word reaches the 4.3 Execution of a modular multiplication Booth recoder circuit, it generates the control signals for the PPG. The PPG calculates a set of eight partial- products, which is propagated to the CSA. Booth re- When applying the square and multiply algorithm, a coding and the distribution of the control signals re- modular exponentiation is performed by successive square quires two clock cycles. and multiply steps. For a square step, the result of the previous modular multiplication, which resides within the 3. Within three clock cycles, the CSA reduces the set of register Low, acts as both, multiplicand as well as multi-

eight partial-products to a single redundant number. plier. For a multiply step, the Ò bit block of data to become However, as the CSA is a pipelined circuit, one set of de/encrypted is the multiplicand, and the result of the previ- eight partial-products can be processed each clock cy- ous modular multiplication is the multiplier. cle. According to the FastMM algorithm presented in sec- tion 2, a modular multiplication takes place in the following 4. The output of the CSA is then accumulated to the cur- way: rent intermediate sum within one cycle. Now, the least significant 16 bit of the intermediate sum already rep- 1. For multiplication 1 of the FastMM algorithm, the (re- resent a word of the lower part of the product. dundant) registers High are loaded from register Low and register Multiplicand is either loaded from regis- 5. A CLA is used again to convert the redundant 16 bit ter Low (square step) or from register Data (multiply words of the lower part into binary representation. step). After the multiplication has been performed as

Fold 3

Ò=4 : ::Ò ½ Slices: ¿

Fold 2

Ò=4 ½ :::Ò=¾ Slices: ¿

Fold 1

Peripheral Part

4 :::Ò=¾ ½ Slices: Ò= Control logic + Interface

Fold 0

4 ½ :::¼ Slices: Ò=

...bufferwith one cycle delay

Figure 3. The clocked folding principle.

described in section 4.2, the lower part of the result re- part which includes all Ò bit wide circuits (registers, carry sides within the register Low and the upper part resides save adders, partial-product generators, and the accumula- within the accumulator. tor), and a peripheral part which consists of all the other circuits (CLA, booth recoder, and control logic). Since the 2. For multiplication 2 of the FastMM algorithm, the (re- regular part is about 80% of the total chip area, its layout dundant) registers High are loaded from the accumu- is subject of detailed optimization. In order to exploit the lator and register Multiplicand is loaded from register regularity of the datapath structure, the regular part is built

N1. The multiplication is performed as described in

·¾Û of Ò identical copies of a one bit slice. This slice con- section 4.2, but without shifting the words of the lower sists of seven 1-bit register cells, eight 1-bit adder cells and part result into register Low. Note that only the upper eight 1-bit partial-product generator cells, including a uni- part of the result from multiplication 2 is needed for form inter-cell routing. the next multiplication. Register Low still contains the A slice-based layout for the regular part of the RSA­ crypto lower part result of multiplication 1. chip has two significant advantages:

3. For multiplication 3 of the FastMM algorithm, the (re- 1. The place–and–route procedure needs to be solved dundant) registers High are loaded from the accumu- only for a single slice. lator and register Multiplicand is loaded from register NegN. The multiplication is performed as described in 2. As all bit positions have a uniform layout, the verifica- section 4.2, but the accumulator is initialized with the tion process (parasitics extraction, timing simulation, lower part result of multiplication 1. Thus, multipli- back annotation) is simplified. If a particular timing is cation 3 is performed together with the addition of the verified within a single slice, it is also verified in all FastMM algorithm. The result of the modular multi- other slices. plication resides within register Low after multiplica- The slices are supposed to connect by abutment, as also tion 3. the routing between the slices is uniform. Since the wire A modular multiplication for 1024 bit operands requires length of control signals and inter-slice routing grows with 227 cycles when it is performed on a PPM with a word-size the width of a single slice, narrow slices reduce the total area

demand. Based on a 0.6 CMOS process, the slice width is =½6 of Û .

limited to about 35m with a corresponding slice length of about 1 mm, which results in a datapath layout size of about

5 Floorplanning and clocked folding

£ ½ ¿5 mm. Such an aspect ratio would be unacceptable, because chip packages are most frequently considered for

From the viewpoint of floorplanning, the RSA­ datapath square shapes. Not only packaging requires a fairly square shown in figure 2 can be divided into two parts: a regular shaped chip layout, also the distribution of global signals

Fold 3

Ò=4 : ::Ò ½ Slices: ¿

Fold 2

Ò=4 ½ :::Ò=¾ Slices: ¿

Fold 1

4 :::Ò=¾ ½ Slices: Ò= Control logic + Interface Part Part

Fold 0 Peripheral Peripheral

4 ½ :::¼ Slices: Ò=

...bufferwith one cycle delay

Figure 4. Reconfigured multiplier datapath for CRT-based RSA decryption.

(e.g. control signals, output signals of the booth recoder) is 6 Fast reconfigurable multiplier datapath much easier when the layout has an aspect ratio close to 1, since the corresponding wires are significantly shorter. Also For efficient CRT-based RSA decryption, a number of minimizing the clock skew is very important. Again, delays hardware options to the architecture presented in section 4 due to interconnection must be minimized. are required. As a consequence of the folding technique, In order to fulfill the requirement for a square shaped lay- the regular part of the multiplier is already divided into four

out, a special floorplanning technique termed folding is ap- parts. So it is obvious to add a second irregular part to get

­

¾ Ò=¾ plied to the regular part of the RSA datapath. The 1056 bit two Ò= bit multipliers. Each of this bit multipliers wide regular part is divided into four folds, whereby each consists of two folds and one irregular part. By adding mul- fold consists of 264 slices, as illustrated in figure 3. Note tiplexers to switch between the folds and the irregular parts,

that folding can only be applied if the direction of data sig- the multiplier is reconfigurable to perform either one Ò bit ¾ nal flow is restricted from more to less significant bit posi- exponentiation or two Ò= bit exponentiations in parallel. tions. It is not difficult to see that the components shown in The block diagram in figure 3 represents the multiplier ar- figure 2 can be arranged in a way to meet this restriction. chitecture if non-CRT based RSA decryptions is selected.

But folding has also a serious disadvantage: Transmit- In this case all four folds are needed to execute an Ò bit ting data signals between folds requires long interconnec- exponentiation. Alternatively, the architecture can be split tion wires, therefore buffers have to be inserted to maintain into to two multipliers, each consisting of two folds and the increased capacitive load. The additional delay caused one irregular part, as illustrated in figure 4. Note that the by the buffers and the interconnection wires would compro- multiplexers to switch between CRT-based and non CRT-

mise the overall performance. This pipeline bottleneck can based operating mode are not shown in figure 4. Each of ¾ be removed by inserting buffers with one cycle delay be- the Ò= bit multipliers is connected to the same control sig-

tween the folds. Since the data signal flow is limited from nals, causing them running synchronously, except for the

D ;D É more to less significant bit positions, the architecture allows exponents È controlling the square and multiply se- one cycle delay between subsequent folds. When all data quences (not shown in figure 4). Thus the clocked folding input signals for fold 2 are provided by fold 3, the control technique is also very advantageous for the realization of a signals can also be delayed by one cycle, causing calcu- reconfigurable multiplier datapath. lations taking place in fold 2 to be delayed by one cycle. Beside a second irregular part and a number of multi- Likewise calculations in fold 1 and fold 0 are delayed by plexers, also additional storage is required for CRT-based two and three cycles. The additional delay of three cycles RSA decryption. The precomputed exponents, the precom-

caused by this clocked folding does not compromise over- puted coefficients for Chinese recombination and the prime É all latency very much. But on the other hand, buffers with decomposition È , have to be stored on–chip. Embedded one cycle delay allow much higher clock frequencies than SRAM provides a low area solution for that purpose. Note ordinary buffers. that there is no necessity for additional registers in the mul- tiplier core, as in CRT mode the registers NegN and N1 are [Dhe98] J. F. Dhem. Design of an efficient public-

also split into two parts, whereby one part stores the pre- key cryptographic library for RISC-based smart È

computed constants for ÑÓ d calculations, and the other cards. Thesis (Ph.D.), Universite´ catholique de É part stores the constants for ÑÓ d calculations. Louvain, Louvain-la-Neuve, Belgium, 1998. [Knu69] D. E. Knuth. Seminumerical Algorithms, vol- 7 Summary of results and conclusions ume2ofThe Art of Computer Programming. Addison-Wesley, Reading, MA, USA, 1969. The subject of this paper was to present efficient algo- rithms for modular arithmetic and a multiplier architecture [Kob94] N. Koblitz. A Course in and which is optimized for these algorithms. The prototype of , volume 114 of Graduate texts in mathematics. Springer-Verlag, Berlin, Ger- the RSA­ crypto chip is designed for a modulus length

many, second edition, 1994.

=½¼¾4 Û =½6 of Ò bits and a multiplier word-size of .

Based on a 0.6 standard CMOS process, the silicon area [Mac61] O. L. MacSorley. High-Speed Arithmetic in Bi- ¾ of the multiplier core is about 70 mm and contains approx. nary Computers. Proceedings of the Institute of 1,000,000 transistors, whereby most parts of the chip were Radio Engineers, 49:67–91, 1961. implemented in full custom design methodology. The exe- cution of a 1024 bit modular multiplication needs 227 clock [MPPS95] W. Mayerwieser, K. C. Posch, R. Posch, and cycles. When the multiplier core is clocked with 200 MHz, V. Schindler. Testing a High-Speed Data Path:

this results in a decryption rate of 560 kbit/s. In CRT mode, The Design of the RSA¬ Crypto Chip. J.UCS: the decryption rate increases to 2 Mbit/s. This high perfor- Journal of Universal , 1(11) mance confirms the efficiency of the implemented hardware pp. 728–744, November 1995. algorithms and the multiplier architecture. Furthermore, the proposed design is highly scalable with respect to the mul- [MVV97] A. J. Menezes, P. C. Van Oorschot, and S. A. tiplier word-size as well as the modulus length. The only Vanstone. Handbook of applied cryptography. limiting factor is the available silicon area. A state-of-the- The CRC Press series on and its applications. CRC Press, Boca Raton, art 0.25 CMOS process would allow to increase the multi-

FL, USA, 1997. =¿¾ plier word-size to Û , which doubles the performance. The high regularity of the design makes it attractive for im- [Nat94] National Institute of Standards and Technology plementation in a full custom design methodology. Another (NIST). FIPS Publication 186: Digital Signa- significant advantage of the architecture is its reconfigura- ture Standard. National Institute for Standards bility to execute either one 1024 bit exponentiation or two and Technology, Gaithersburg, MD, USA, May 512 bit exponentiations in parallel. Thus the CRT-based 1994. RSA decryption increases the performance by a factor of 3.5, with only low additional hardware effort. [PP89] K. C. Posch and R. Posch. Approaching encryp- tion at ISDN speed using partial parallel mod- References ulus multiplication. IIG report 276, Institutes for Information Processing Graz, Graz, Austria, November 1989. [Bar87] P. Barrett. Implementing the Rivest, Shamir and Adleman public-key encryption algorithm [QC82] J.-J. Quisquater and C. Couvreur. Fast decipher- on a standard digital signal processor. In A. ment algorithm for RSA public-key cryptosys- M. Odlyzko, editor, Advances in cryptology: tem. IEE Electronics Letters, 18(21), pp. 905– CRYPTO ’86: proceedings, volume 263 of Lec- 907, October 1982. ture Notes in Computer Science, pp 311–323, Springer-Verlag, Berlin, Germany, 1987. [Rab96] J. M. Rabaey. Digital Integrated Circuits - A Design Perspective. Prentice Hall Electronics [BK82] R. P. Brent and H. T. Kung. A regular layout for and VLSI Series, Prentice Hall, Upper Saddle parallel adders. IEEE Transactions on Comput- River, NJ, USA, 1996. ers, C-31(3), pp. 260-264, 1982. [RSA78] R. L. Rivest, A. Shamir, and L. Adleman. A [DH76] W. Diffie and M. E. Hellman. New Directions in Method for Obtaining Digital Signatures and Cryptography. IEEE Transactions on Informa- Public Key Cryptosystems. Communications tion Theory, IT-22(6) pp. 644–654, November of the Association for Computing Machinery, 1976. 21(2) pp. 120–126, February 1978. [Sch96] V. Schindler. A low-power true single phase clocked (TSPC) full-adder. Proceedings of the 22nd ESSCIRC, Neuchatel,ˆ Switzerland, pp. 72–75, 1996. [SV93] M. Shand and J. Vuillemin. Fast Implementa- tion of RSA Cryptography. Proceedings of 11th Symposion on Computer Arithmetic, 1993. [Wal64] C. S. Wallace. A suggestion for a fast multiplier. IEEE Transactions on Electronic Computation, EC-13(1), pp. 14–17, 1964.

[YS89] J. Yuan and C. Svensson. High-speed CMOS circuit technique. IEEE Journal of Solid-State Circuits, 24(1), pp. 62–70, 1989.