The Chinese Remainder Theorem and Its Application in a High-Speed

The Chinese Remainder Theorem and its Application in a High-Speed RSA Crypto Chip£ Johann Großschädl Graz University of Technology Institute for Applied Information Processing and Communications Inffeldgasse 16a, A–8010 Graz, Austria [email protected] Abstract product of two prime numbers. This allows to utilize the Chinese Remainder Theorem (CRT) in order to speed up the The performance of RSA hardware is primarily deter- private key operations. From a mathematical point of view, mined by an efficient implementation of the long integer the usage of the CRT for RSA decryption is well known. modular arithmetic and the ability to utilize the Chinese But for a hardware implementation, a special multiplier ar- Remainder Theorem (CRT) for the private key operations. chitecture is necessary to meet the requirements for efficient This paper presents the multiplier architecture of the RSA CRT-based decryption. crypto chip, a high-speed hardware accelerator for long in- This paper presents the basic algorithmic and architec- teger modular arithmetic. The RSA multiplier datapath is tural concepts of the RSA crypto chip, and describes how reconfigurable to execute either one 1024 bit modular expo- they were combined to provide optimum performance. A nentiation or two 512 bit modular exponentiations in paral- special focus is directed towards the implementation of lel. Another significant characteristic of the multiplier core the CRT-based decryption. The major design goal with is its high degree of parallelism. The actual RSA prototype the RSA was the maximization of performance on se- contains a 1056£16 bit word-serial multiplier which is op- veral levels, including the implemented hardware algo- timized for modular multiplications according to Barret’s rithms, the multiplier architecture, and the VLSI circuit modular reduction method. The multiplier core is dimen- technique. The RSA crypto chip applies the FastMM al- sioned for a clock frequency of 200 MHz and requires 227 gorithm [MPPS95] for modular multiplication. FastMM clock cycles for a single 1024 bit modular multiplication. algorithm is based on Barret’s modular reduction method Pipelining in the highly parallel long integer unit allows to [Bar87] and performs a modular multiplication by three achieve a decryption rate of 560 kbit/s for a 1024 bit ex- simple multiplications and one addition. The division- ponent. In CRT-mode, the multiplier executes two 512 bit free property of Barret’s modular reduction method makes modular exponentiations in parallel, which increases the FastMM algorithm very attractive for hardware implemen- decryption rate by a factor of 3.5 to almost 2 Mbit/s. tation. ½¼56 £ ½6 The core of the RSA crypto chip is its bit word-serial multiplier, which schedules the multiplicand 1. Introduction fully parallel and the multiplier in 16 bit words. As a conse- quence of this direct processing of long integers, the com- Many popular crypto-systems like the RSA encryption putation time for an Ò bit modular exponentiation is dra- scheme [RSA78], the Diffie-Hellman (DH) key agreement matically reduced. Furthermore, Booth recoding [Mac61] is scheme [DH76], or the Digital Signature Algorithm (DSA) implemented to halve the number of partial-products. Thus [Nat94] are based on long integer modular exponentiation. the multiplication speed is almost doubled with low addi- A major difference between the RSA scheme and crypto- tional hardware effort. In addition, pipelining significantly systems based on the discrete logarithm problem is the fact increases the throughput in RSA en- and decryption. Due that the modulus used in the RSA encryption scheme is the to the high degree of parallelism in the multiplier core, the £ RSA crypto chip executes a 1024 bit modular multiplica- The work described in this paper was supported by the Austrian Science Foundation (FWF) under grant number P12596–INF “Hochge- tion in 227 clock cycles. The multiplier core is dimensioned schwindigkeits-Langzahlen-Multiplizierer-Chip”. for a clock frequency of 200 MHz and designed in a full- Ò=¾ Ò E custom methodology, which results in a very small silicon ¿ modular multiplications for an bit exponent , as- area. suming the exponent contains roughly 50% ones. There- By taking advantage of the Chinese Remainder Theorem fore, the efficient implementation of the modular multipli- (CRT), the computational effort of the RSA decryption can cation is the key to high performance. be reduced significantly. If the two prime numbers È and Æ É of the modulus are known, it is possible to calculate 2.2 Barret’s modular reduction method D = C ÑÓ d Æ the modular exponentiation Å separately È ÑÓ d É ÑÓ d and with shorter exponents, as described in In 1987, Paul Barret introduced an algorithm for the [QC82] and [SV93]. Since the length of the exponent is modulo reduction operation which he used to implement ¾ ¿Ò=4 about Ò= , approximately modular multiplications are RSA encryption on a digital signal processor [Bar87]. At necessary for a single modular exponentiation. The RSA a first glance, a modular reduction is simply the computa- crypto chip is able to compute two exponentiations in par- tion of the remainder of an integer division : allel, as the 1024 bit multiplier core can be split into two 512 bit multipliers. Running the two 512 bit multipliers Z Z Z ÑÓ d Æ = Z Õ = = Z ÕÆ in parallel allows both CRT related exponentiations to be Æ with (1) Æ computed simultaneously. Compared to the non-CRT based Æ RSA decryption performed on an Ò bit hardware, utilizing But, compared to other operations, even to the multiplica- the CRT results in a speed-up factor of approximately 3.5. tion, a division is very costly to implement in hardware. The rest of the paper is structured as follows: In the next Barret’s basic idea was to replace the division with a mul- section, the implemented algorithms for exponentiation and tiplication by a precomputed constant which approximates modular multiplication are presented. Section 3 focusses on the inverse of the modulus. Thus the calculation of the ex- ¤ ¥ Z = the Chinese Remainder Theorem and explains how it can be act quotient Õ is avoided by computing the quotient Æ Õ used to speed up the RSA decryption. Section 4 presents the e instead: kj k j 6 7 ¾Ò 6 7 ¾ architecture of the RSA multiplier core and describes the Z Ò ½ 6 7 ¾ Æ 4 execution of a simple multiplication. In section 5, imple- 5 Õ = e (2) Ò·½ mentation problems like floorplanning and clock distribu- ¾ tion are discussed. This topic is carried on in section 6, and it is detailed how the multiplier architecture was adapted to Although equation (2) may look complicated, it can be cal- Ò ½ comply with the requirements for efficient CRT-based RSA culated very efficiently, because the divisions by ¾ or Ò·½ decryption. The paper finishes with a summary of results ¾ , respectively, are simply performed by truncating the ½ Ò ·½ and conclusions in section 7. least significant Ò or bits of the operands. The ex- ¤ ¥ ¾Ò ¾ =Æ pression depends on the modulus Æ only and is constant as long as the modulus does not change. This con- 2 Implemented algorithms stant can be precomputed, whereby the modular reduction operation is reduced to two simple multiplications and some In order to develop high-speed RSA hardware, we not operand truncations. only need good and efficient algorithms for modular arith- When calculating a modular reduction according to Bar- metic, but also a multiplier architecture which has to be op- ret’s method, the result may not be fully reduced, but it is ¿Æ ½ timized for those algorithms. This section presents hard- always in the range ¼ to . Therefore, one or two sub- ware algorithms for exponentiation, modular reduction and tractions of Æ could be necessary to get the exact result. modular multiplication. 2.3 FastMM algorithm 2.1 Binary exponentiation method A closer look at Barret’s algorithm shows that trunca- ½ Ò ·½ When applying the binary exponentiation method (also tion of operands at Ò or bit borders are necessary. known as square and multiply algorithm), a modular expo- For reasons of regularity, it would be advantageous to ap- E ÑÓ d Æ Û nentiation C is performed by successive modular ply truncations only at multiples of the word-size of the multiplications [Knu69]. The MSB to LSB version of the multiplier hardware (usually 16 or 32 bit) rather than at the algorithm is frequently preferred against the LSB to MSB original bit positions. Therefore, we modified Barret’s algo- one, because the latter requires storage of an additional in- rithm in order to apply the truncations only at multiples of termediate result. Modular reduction after each multiplica- Û , whereby these truncations can be performed by succes- tion step avoids the exponential growth in size of the inter- sive Û bit right-shift operations. In the modified Barret re- mediate results. The square and multiply algorithm needs duction, denoted as FastMM algorithm [MPPS95], the quo- 9 Z = A[Ò ·¾Û ½ :: ¼] ¡ B [Ò ·¾Û ½ :: ¼] > > = É = Z [¾Ò · Û ½ :: Ò Û ] ¡ Æ ½[Ò ·¾Û ½ :: ¼] e ¾f¼; ½g = A ¡ B ÑÓ d Æ · eÆ X with ÆegÊ = É[¾Ò ·4Û ½ :: Ò ·¾Û ] ¡ ÆegÆ[Ò ·¾Û ½ :: ¼] > > ; X = Z [Ò ·¾Û ½ :: ¼] · ÆegÊ[Ò ·¾Û ½ :: ¼] Figure 1. Modular multiplication according to the FastMM algorithm. Õ tient e is calculated as follows: 3 Chinese Remainder Theorem j kj k 7 6 ¾Ò·Û 7 6 Z ¾ Ò Û 7 6 D ¾ Æ Å = C ÑÓ d Æ 5 4 The complexity of the RSA decryption Õ = e (3) Ò·¾Û ¾ Æ depends directly on the size of D and . The decryption exponent D specifies the numbers of modular multiplications The FastMM algorithm combines multiplication and modi- necessary to perform the exponentiation, and the modulus fied Barret reduction to implement the modular multiplica- Æ determines the size of the intermediate results.

Load more