Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE 2013)

Improvements of RSA for hardware encryption implementation based on FPGA

Li Youguo Department of Computer Science Xinyang Agricultural College Xinyang, Henan, China [email protected]

Abstract—In the field of information security, encryption and encryption and decryption. In the practical calculation, we decryption operations in RSA cryptographic algorithm are often use repeated modular squaring algorithm: the number both modular . The direct computing method to e mod nm implement this algorithm through software programming is to of modular multiplication operations that multiply M e times with iteration statements and later make requires is reduced to at most 2k, of which k=㏒ 2e.Suppose modulo operations. The operational speed of this method is e mod nm feasible when M, e and n are relatively small. However, in |n|=l, then the complexity of is O(kl2). order to enhance the security of RSA algorithm, we need to II. MATHEMATICAL FOUNDATIONS OF RSA ALGORITHM take at least 512-bit values for M, e and n. The operation speed in the way of software implementation is slow, and intermediate results will also take up lots of temporary storage A. Repeated modular squaring algorithm space. Thus, software implementation has great difficulties. In In , we often calculate what is formed contrast, hardware encryption has high security, fast speed e 32  e mod,,, nmmm and strong real-time property. At present, FPGA-based RSA like mod nm , and work out in hardware encryption and decryption is a new research turn when e is relatively small. But when e is large, this direction, and an improved study on the existing low-radix calculation is unrealistic, for the amount of calculation and Montgomery algorithm has been made in this article. the intermediate storage data size will be tremendous, so as to lose the meaning of fast encryption and decryption. In the Keywords-Montgomery modular , practical calculation, we often use modular repeated squaring Exclusive or arithmetic unit, RSA algorithm, shift arithmetic unit algorithm: the number of modular multiplication operations e mod nm requires is reduced to at most 2k, of which k=㏒ NTRODUCTION TO HARDWARE COMPUTING IN I. I RSA e ENCRYPTION ALGORITHM 2e. Suppose |n|=l, then the complexity of mod nm is RSA security is based on the presumed difficulty of () eee O(kl2). Suppose e’s binary representation is k − 011 2 , factoring large integers, and the reason is that till now the k−1 assumption that factoring integers is an np issue has not yet = ee 2i i=0 i been proved. Perhaps, there is still a polynomial time namely . decomposition algorithm which has not yet been found[1]. 1) Compute in turn: The factorization technique adopted in 1995 showed that, for == 2 = − 0 , ii −1 inyymy  k )1,,3,2,1(,mod a 512-bit secret key, we had to spend more than 1 million dollars on factorization in 10 months. In an attacking 2) Compute in turn: experiment in 1999, for a specific 512-bit secret key, it took = = =  ey 00 1 =  −1 ii mod enyx i 1 −= 7 months to complete factorization. The risks of secret key x0  xi   ki )1,,2,1(, e = ,01 − ex = ,0 attacks mainly rely on the constant growth of computing  0  1 ii power and the improvement of factorization algorithm. It is x estimated that in the future for a longer period of time, RSA 3) The k−1 which has been finally obtained is just the is secured which has a secret key length between 1024 bits e and 2048 bits. Also, RSA’s creators required similar bit- required mod nm . length for p and q, and the modulus n itself could not be a . In modular arithmetic, we often calculate B. Euler function e mod nm Suppose n is the number of m which are positive integers what is formed like , and work out less than n and that are relatively prime to n defined by the 32 e Euler functionφ(n), namelyφ(n) = {0 ≤ m < n | gcd(m ,n) =  mod,,, nmmm in turn when e is relatively small. 1}. But when e is large, this calculation is unrealistic, for the If p is a prime number, then φ( p) = p −1, or φ( p) + 1 = p amount of calculation and the intermediate storage data size If p and q are prime numbers, and n = pq, then will be tremendous, so as to lose the meaning of fast

Published by Atlantis Press, Paris, France. © the authors 0038 Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE 2013)

φ(n) =φ( p·q) =φ( p) φ(q) = ( p −1)(q −1) algorithm to seek modular exponentiation of large numbers is BR (Binary Representation) algorithm which can C. Euler Theorem decompose modular exponentiation into a series of modular For any positive integers a and n, if gcd(a ,n) = 1, then multiplication operations[3]. First, e’s binary conversion is ϕ (n) k−1 a ≡ 1mod n. = i ()−  eee = ee i 2 In Euler Theorem, for any positive integers a and n, if represented as k 011 2 , namely i 0 . gcd(a、n) = 1, then a(n) ≡ 1mod n. Then, it is inserted into the modular exponentiation formula e RSA is a block cipher, generated by a secret key. From mod nm . Due to the theorem that the encryption and decryption process, we may find that the ABmodC=(AmodC)BmodC, to seek modular exponentiation secret key, plaintext and ciphertext all correspond to a certain of large numbers can be decomposed into a series of issues number, which are then encrypted and decrypted by on seeking modular multiplication of large numbers, so that mathematical operations. An exponential expression has e e − e been used in RSA scheme. The plaintext is encrypted by mod nm =((me k −1 mod n)mod n)m k 2 mod n )m 0 mod n. block, of which each block is a binary value less than a In this way, the algorithm has been simplified. But after e’s number n. That is, the bits of each block should be less than binary conversion with BR algorithm, its binary bit becomes very long, and also multiple scannings (iterations) are needed 2 or equal to log n. in computing. This method has too low efficiency, and takes D. Seek the up more resources. So there is a need to speed up modular exponentiation algorithm. In hardware implementation is an algorithm to seek the greatest (binary scanning algorithm), repeated modular squaring and common divisor, also known as Method of Successive modular exponentiation algorithm can be simplified to Division. Theorem: for any nonnegative integers a and b, and repeated modular squaring and modular multiplication. , == brar 0 1 , with the division method with a remainder, Based on different scanning directions of the exponent e, the there exists the following equation: gcd(a,b)=gcd(b,amodb). binary scanning algorithm can be divided into the H Divide successively: algorithm from high-order digit to low-order digit and the L algorithm from low-order digit to high-order digit. <≤+= With the H algorithm, a register is needed to save C and a 0 rrrqrr 12,2110 modular multiplier, but m can be unsaved, which can be 0 <≤+= rrrqrr computed by being directly input from the computing port. 23,3221 Seen from the number of registers, there is no much ……… difference between the two but just one register. Besides, the third step in H algorithm needs the value in <≤+= the second step, so sequential execution is required for the −−− ,112 0 rrrqrr nnnnnn −1 two modular multiplication modules; in L algorithm, the two rrqrr =+= 0 steps are independent of each other, which thus can be −1 nnnnn ++ 1,1 executed simultaneously. Therefore, the speed of the H schema is faster than that of the L schema. In addition, seen ba ),gcd( can be calculated by the remainder of each from the above two algorithms, e’s binary bit is n, and the operation dividing the divisor in this operation, which can two algorithms both need n-1 modular squaring operations gradually reduce the values of operands in the operation. In and about ( n-1 ) /2 modular multiplication operations. = rn+1 0 Modular squaring and modular multiplication are serial the limited steps n, there must be the fact that . The operations, which will reduce computing efficiency, and can last nonzero remainder is the greatest common divisor. When have some appropriate improvements. The two operations the encrypted secret key e is generated, e is any number co- can be computed in advance and saved in registers, which prime toφ(n). To select a number e, check if it is co-prime are then executed in parallel and called at any time. But toφ(n). If not, make e add 1, and check once again until it is correspondingly, one more register needs to be added. For co-prime toφ(n). The method to check whether two numbers the overall consideration of the area of a chip and its are co-prime is as follows: check if the gcd of the two computing speed, the H algorithm will be used to achieve numbers is 1; if so, they are co-prime. According to modular exponentiation, which can save hardware Euclidean algorithm, if a=bn+c and gcd(a,b)=gcd(b,amodb), implementation area. With this method, each digit of e can then gcd(a,b)=gcd(b,c) . So the method of successive be scanned from high-order digit to low-order digit, so as to division can be used to seek the greatest common divisor. check if e is 1. In the subsequent modular exponentiation, we complete the two processes of modular squaring and III. FAST MODULAR EXPONENTIATION modular multiplication in this algorithm in parallel, so that For RSA algorithm, the secret key 0 or 1 corresponds to computing efficiency can be improved. the two operations of modular multiplication and modular squaring, so it is easy to obtain the information about the secret key by power analysis. The most commonly used

Published by Atlantis Press, Paris, France. © the authors 0039 Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE 2013)

IV. THE IMPROVED MONTGOMERY MODULAR Return R; MULTIPLICATION ALGORITHM To some extent, the original Montgomery algorithm has The key to the speed of modular multiplication is the reduced the difficulty in modular multiplication of large speed of modular arithmetic that is actually an operation of numbers. k*k bit multiplication and 2k bit addition are seeking the remainder of a division. But for the division of needed, and the intermediate results can reach up to 2k bits. large numbers, whether with software or hardware, it takes Besides, in each modular multiplication operation, there much more time to realize the division operation than the must be multiplications of large numbers such as plus, minus or multiplication operation. So if division is not −1、pNTN used, less used, or replaced with other methods, the T=AB, . When A,B and N have more than 1024 computing speed of modular multiplication will be greatly bits, there will be more storage and processing data, leading improved, so as to correspondingly improve RSA’s to the demand of more hardware resources for encryption computing speed to process plaintexts and ciphertexts. and decryption directly with Montgomery algorithm. Currently, the common methods to seek modular Therefore, further improvements in Montgomery algorithm multiplication of large numbers are[4]: division after are needed. multiplication, Blakley algorithm, Brickell algorithm, Barrett B. Radix 2-Montgomery algorithm and its simplified algorithm and Montgomery algorithm. algorithm A. The original Montgomery algorithm For the current hardware encryption, its superiority is Montgomery algorithm has relatively fast modular taken into account from area and speed, so it is necessary to multiplication and short encryption and decryption time. In carry out appropriate optimization. The optimization of the 1985, PeterMontgomery put forward an algorithm to achieve original Montgomery algorithm can be roughly made from modular multiplication just through division and right shift the following aspects: 1) calculate just part of values each of numbers, which was later known as Montgomery modular time, instead of computing the value of R at a time in the multiplication algorithm. Its theorem[5]: this algorithm is original algorithm; 2) select the appropriate radix S (S=2k), mainly to convert the multiplication of two numbers to the and the corresponding division can be just realized by the concentrated multiplication of residue classes of a modulus n, shift operation. and then to convert seeking the remainder of n to seeking the Based on the size of S, the optimized algorithm can be remainder of any modulus r (r>n); to seek the power of the divided into two categories: one is the high-radix optimized modulus r that is 2. In this way, the division of the modulus n algorithm, namely S=2k, which is a method based on speed has been converted to the division of the modulus 2, and the optimization; the other is the binary Montgomery modular binary counting of the computer and the electronic circuit multiplication algorithm, namely S=2, which is a method has been changed into a simple shift operation which will be based on area optimization. According to different key much faster than division and accessing large primes. This demands of the actual situation of the project for the thought is the core of Montgomery algorithm, and all the performance indexes, the two different implementation later improvements in this algorithm are almost based on this algorithms are chosen to meet different design requirements. basic thought. For an individual modular multiplication The design of a secured wireless serial can process a small operation, this method is not so efficient, but if it is used in size of data. Taken the costs of the user into account, the repeated modular multiplication operations, its efficiency radix 2 algorithm based on the optimization of the circuit will reflect its superiority. area will be given a priority selection. For the simplified The original solution of Montgomery algorithm that is radix 2 Montgomery algorithm, it should be modified based ABS −1 mod N on the following 3 points: can be described as follows: (1) Improvement of the initial value in the algorithm. To select the radix S and the modulus N (N>1,S>N), in p case that gcd(S,N)=1, for any integer that meets Radix 2 Montgomery algorithm is introduced to seek i so 0≤T=AB

Published by Atlantis Press, Paris, France. © the authors 0040 Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering (ICCSEE 2013)

subtraction operation here has obviously increased T ← operational steps and hardware resource consumption, which 1and1: j mux[j]; has also affected operation efficiency. Therefore, it is ← b j necessary to replace the subtraction operation. For this, the 1and0: mux[j]; multiplicand A and the multiplier B must satisfy that: A and n ← mux[j] = = 0and1: j ; an ,0 bn+1 0 ← B are both less than 2N; and ; 0and0:0 mux[j]; == n+1 mm n 0 } . Thus, the result is R<2N. Here, to += j−1 j jmuxRR ][ an increase means to increase one more cycle operation, } b + 、m + m n−− 1 and to increase n 1 n 1 and n has ensured the input: AB mod2 N storage of R’s intermediate result information during the a 、p operation. Here, different values of ii can influence variables (3) Compression of the number of addition in cycle of the selector, and the selection operation of multiple input operation. The key operation in Montgomery algorithm is the and single output can be realized with the selector; to shift B a p left by a bit has avoided addition and multiplication at the solution of R. The currently cycled values of i and i are third step; to increase two cycles has got rid of comparison ×+× Npba and subtraction operations. used to compute iji and then the result of R. a 、b V. CONCLUSION Addition in the formula is processed one time when ji are known, with the operation of only a bit of radix 2 each The security of RSA encryption and decryption is based time. In the cycle, the solution of R each time includes a on the difficult decomposability of large numbers. At selection and an addition operation, which can reduce the present, the bottleneck in encryption and decryption number of addition in modular multiplication. algorithm is the slow modular exponentiation of large While the simplified radix 2 Montgomery algorithm numbers. So how to improve the speed of modular has guaranteed the correctness of results, it has also exponentiation is the key to this study. In this article, the improved operation efficiency. The pseudocode can be improved radix 2 Montgomery modular multiplication algorithm has been proposed based on the existing radix 2 Simpled-Monts_2(A,B,N) expressed as follows: Montgomery modular multiplication algorithm, so that ×= n−− 1 Simpled-Monts_2(A,B,N) AB mod2 N modular exponentiation can be converted into several = (  ,aaaaaaA = 0) modular multiplication processes, and the free modular 10 −− nnn 212 n , multiplication storage can be used to save hardware = (  ,) bbbbbbbbB == 0 resources. 10 nnnn +−− n+102112 , = === REFERENCES ( 10  nnnn +−− 02112 ,1,) nnnnnnnnnN nn +1 0 [1] Hu Lei. Handbook of Applied Cryptography[D]. Beijing. Publishing House of Electronics Industry, 2005,P1-17 += [2] Zhang Huanguo, Wang Zhangyi. A General Introduction to nbT iii ; Cryptography[D].Wuhan. Wuhan Press, 2009,P20-35 = [3] Yang Bo. Modern Cryptography[D].Beijing. Tsinghua University R0 ;0 Press, 2005,P11-14 for(i=0;i<=n;i++) += [4] Zhou Yujie. Public Key Cryptographic Algorithms and Its Fast i 0 ibaRq 0 );2(mod Implementation[D]. Beijing. National Defence Industry Press, { 2002.,P98-102 for(j=0;j<=n+1;j++) [5] Chen Fenglin. The Improvement of Montgomery Algorithm and Its { Application in RSA. Computer Applications and Software, 2006.6 , pa [6] Montgomery PL.Modular Multiplication without switch ii Mathematics of Computation,1985

Published by Atlantis Press, Paris, France. © the authors 0041