CCCP - Computerized Composer of Checkers Problems

Efficient Parallel Encryption/Decryption Information Algorithm

Erick Fredj Department of Computer Sciences, Jerusalem College of Technology (Machon Lev) 21 Havaad Haleumi St., P.O.B. 16031, 91160 Jerusalem, Israel

Abstract: This paper deals with the parallel implementation of the RSA algorithm for encryption and decryption on a network of workstations. We present a new algorithm based on a residue number system (RNS) and a hybrid of Montgomery’s method. RNS provide a good means for extremely long integer arithmetic. Their carry free operations make parallel implementations feasible. This paper shows a new combination of RNS with modulo reduction methods. The algorithm complexity is of the order of O(n), with n denoting the amount of data

Key-Words: Computer Arithmetic, Cryptography, Modular Multiplication, Residue Number System Arithmetic, Text Decryption, Text Encryption, Text Security

1 Introduction long term. Over the last years the concerns about the lack of security online and potential 1.1 How RSA works? loss of privacy prevent many computer Suppose that we have some plain- text users from realizing the full potential of message, which desires to encrypt, and the the Internet. Encryption systems, which message is M1M2… Mn. RSA encryption and decryption works on one letter at a time, so scramble electronic communications and i ii we’re going to deal with a single letter Mi information [ ][ ], allow users to from this sequence. RSA is an example of a communicate text on the Internet with public/ private key cryptographic system. In confidence, knowing their security and such a system, encryption is done based on a privacy are protected. The commonly public key and decryption on a private key. Rabin, Shamer and Edelman (RSA)[iii],[iv] The RSA encryption relies on two numbers solution provides an enhanced security. The N and e, so the public key is simply the set RSA method is based on a series of {N, e}. Similarly, the private key is the set operations involving very large integers: {N, d} since decryption relies on these two whole numbers usually at least 300 digits numbers. long. An ordinary RSA 1024 bit decryption involves about 3,000 multiplications and 1.2 Public and private keys divisions with 310 digit numbers. RSA The public key {N, e} consists of an algorithm becomes widely used in the exponent e and a modulus, and the industry and academia therefore fast encryption operation transforms M  N implementations are extremely wanted. e There exist several ways for speeding up into a cipher text C  M mod N . The RSA[v]: private key {N, d} consists also of an  Optimization of the sequential exponent d and a modulus, and the algorithm using special purpose decryption operation M  C d mod N  hardware, converts the cipher text into the original  Faster clock rates, message. The modulus N is a product of two  Parallel computers and algorithms[vi] large prime numbers p, and q. Since e is usually in the order of 3 the public-key This article focuses on the last options, operations are relatively fast (about O(n2) parallel computing, which seems to provide operations, where n is the size of the the greatest potential for speed up over the modulus). The private exponent, d, is of the same calculate this directly using the built in order as N, with private key operations function for C++ or any other language will require O(n3) time. The basic modular overflow when we raise 5 to such a large exponentiation function is perhaps the most exponent. It would be possible to one for the encryption and decryption write a class to represent large integers, process. Because of the huge gap speed however since the answer lies in the range of between private-key and public-key [0, 21], we should able to compute the operations I will focus on speeding up answer without such a class. The basic private-key operations. Certainly, the modular exponentiation algorithm used by function that is most important for the actual RSA loop over each bit b of the exponent b, encryption and encryption process is the i modular exponentiation function. Consider only need log2 b iterations. Given a trying to evaluate the expression message M, a modulus N, and an exponent b, 51,000,000,000 (mod 22). Attempting to the basic modular exponentiation algorithm used by RSA is shown in Table 1a. n Step 1: i Find the base 2 representation of b  bi 2 i1

Step 2: ans  1; T0  a

Step 3: for i:=1 to n  sizeof b*8 // where n is the total number of bits; 8 is for 8 bits per byte.

Step 4: Case 1 bi  b  i&1  0 2i i 2 bi 2 In this case, The expression a   1  1 then ans  Ti mod m

Step 5: Case 2: bi  b  i&1  1 2i i bi 2 In this case, The expression a   a  Ti then

ans  ans Ti mod m

Table 1a.The basic modular exponentiation algorithm simplifies the expression a b mod m .

From the table 1 the final value of ans is the result of a b mod m . There does not seem to be any way to perform modular 1.3 Chance for Parallelize the RSA exponentiation faster than the above algorithm method. As a result there appears to be an There exist four different approaches to inherent sequential code to the main loop parallelism the basic RSA algorithm. of RSA. We gain only a factor of 3 by  If there is a sequence of messages to be squaring in parallel with multiplication, decrypted, each of these operations and exponentiation modulo the two may be performed independently on a relatively prime factors together. different processor. This is not speed up the elapsed time for a single private- key operation, but we expect to speed up the overall performance.  Step 5, squaring, can be performed in results can be combined by Chinese parallel with the multiplication in step remainder theorem. 4 from the previous iteration, saving  Finally, the multiplications in step 4 some 33%. This is only possible with and 5 can all be performed and the the loop running from low-order results summed in parallel. They do not exponent bits to high-order. Doing so appear competitive until the used of removes some inherent parallelism. large number. Unlike, parallel  For private-key, we may assume that multiplication is not practical if the the factors p and q of N are known. The communication overhead between modular exponentiation can be processor is much larger than the performed separately and in parallel multiplication time. mod p and mod q, and then the two

M M M ⋯ M  0  1   2    m Message

RSA RSA RSA Binary to RNS Binary to RNS Binary to RNS

RNS: modular RNS: modular RNS: modular multiplication multiplication multiplication

RNS to Binary RNS to Binary RNS to Binary

Figure 1. Parallel RSA implementation model.

sequentially. Therefore, this method is inherently Our parallel implementation of RSA will sequential; to my knowledge there are no obvious incorporate the first three approaches as analogues in the parallel multiplication. In the shown in Figure 1. The first, performing case of RSA, we need to use the Montgomery’s several RSA operations independently on method for modular multiplication combined with different processors, is the most scalable a modular arithmetic where high precision technique. More ever, we also speed up the numbers are represented by their residues modulo response time of a single operation, for which a set of small relatively prime numbers. The the other three techniques are needed by using Montgomery algorithm is a modular a RNS modular multiplication. multiplication algorithm where one reduction is performed at each iteration of the multiplication. 2 Modular multiplication The advantage of this algorithm is that the modular reduction is performed by a shift instead The most frequent operation we perform in RSA of a division. Let n bet at least 4m and is the modular multiplication x  ymod m . The w  2 sequential implementation of this operation choose that mm  1mod w. Notice that requires about twice the time of simple Montgomery’s method doesn’t depend on w multiplication, since each multiplication is being a power of 2. Montgomery’s method [vii] for followed by one modular step reduction. modular multiplication work as follows: Furthermore, each multiplication depends on the results of the previous one, so they must be done Step 1: s : x  y Step 2: q : ms mod w where RNS   ARNS B ~ a j  b j multiplication m j mm  1mod w , for j 1,⋯,n Step 3: r : s  qm w RNS 1 Table1b.Montgomery modular multiplication R  RNS mi ~ rˆj ~ rj  m j  division m j m j The products s and qm have twice as many , for j i,⋯,i 1,i 1,⋯,n digits as m, x, y, or r . Since qm  s mod w , 1 Where X m denotes the inverse of X modulo s  qm w will always be a multiple of w . j m m Reducing modulo w and dividing by w are j for X and j relatively prime. The Mixed simple operations for multiple binary numbers. Radix System (MRS) associated with this RNS However we still require three multiplications. is defined using the same base of moduli. To overcome to this problem we implement Assuming that viii Montgomery’s method [ ] in Residue Number x1  x2  x3 ⋯ xn  , 0  xi  mi is the ix System (RNS) arithmetic[ ]. MRS representation of X an integer less than M , then 3 Residue Number System X  x1  x2m1  x3m1m2  ⋯  xnm1 ⋯mn1 . RNS have long been studied because of their potential for high speed arithmetic processing, Motivation Residue base m , m ,⋯, m  achieved by breaking long word length numbers 1 2 n n up into many short word length numbers that may where M  m . i1 i be operated on in parallel. RNS coding suffers from a number of serious drawbacks, all Modulus N expressed in RNS stemming from its inherent inability to perform withGCDN, M   1 , and magnitude comparison on pairs of numbers; satisfying converting a number from RNS to binary [x] is M difficult; overflows are not easily detectable, 0  N  . 3max m scaling an RNS number by a constant is time i 1, ⋯, n  i  consuming; general division is only possible Integer A is given in MRS n i1 practically by converting the operands out of A  a m , and RNS. i1 i  j1 j We introduce now our RNS system Integer B is given in RNS terminology: Answer An integer R  2N expressed in RNS, such that:  The vector m1 , m2 ,⋯, mn  forms a 1 set of moduli, called the RNS-base where R  ABM mod N Method R=0 the mi ’s are relatively prime. n for i=1 to n do  is the value of the product m . M i1 i   1  The vector x1 ,⋯, xn is the RNS qi  ri  ai  bi  mi  ni i representation of , an integer less than X mod mi M , where xi  X  X mod mi mi R  R  RNS ai RNS B 

Any X less than M has one and only one RNS RNS qi RNS N representation according to the Chinese R  R RNS mi Remainder Theorem. Addition  RNS and end for  multiplication RNS can be implemented in Table 2. RNS method for modular multiplication parallel and performed in one single step. RNS The algorithm goes through n iterations. At A  RNS B ~ a j  b j addition m j Task 1 each iteration step a MRS digit qi of a number qi  ri  ai  bi  mi  ni  mod mi , for j 1,⋯,n mi I i Q is computed and a new value of R, using qi Task rj  rj  ai  bj  qi  n j  II i, j  m 1 mod m j  i i m j i Task a  a  a m 1 j j i i m j IIIi, j mod m j 0  i  j j 2,3,⋯,n  The task III performs the conversion of and ai , is determined in RNS. At each step, R is operand A from RNS to the MRS using m computed to be a multiple of i and the moduli the Szabo Tanaka [xi] conversion are relatively prime numbers, dividing R by mi is algorithm. equivalent to multiplying each residue of R by the modular inverse of m . However, this cannot be i Since at each step of the algorithm one residue evaluated for the ith residue because mi is not is lost, the intermediate result R cannot be relatively prime to itself. Therefore, the ith residue correctly expressed after one step because is lost. We propose two solutions for correctly R  3N  M max mi . Our solution presented expressing R. here consists of extending the modular system ~ with an auxiliary base ~ ~ ⋯ ~ with Table 3. General Tasks B  m1 , m2 , , mn  n~ ~ ~ ~ ~ Use of an auxiliary residue system for expressing M   mi , M 3max mi   M 3max mi  an i1 the result. ~ 1. Reconstruct the missing residue after it dGCDM , M   1. This base extension can be lost. computed with the Szabo-Tanaka algorithm. The This algorithm can be split into three kinds of algorithm computes ABM 1 mod N in RNS, the tasks see Table 3. ~ result being obtained in the auxiliary base B .  The task I computes the MRS digit qi at

the ith step of the algorithm with ai .  The task II computes the new value of R. Table 4. Task for the reverse multiplication

RNS implementation. The timing results for Mosix [xii] machine, running the parallel hybrid RSA 4 Distributed RSA Implementation algorithms are 1.99 faster on 2 processors and 3.99 faster on 4 processors than the single Intel Pentium The most straightforward approach to parallelizing II processor. The Parallelization results show a the RSA algorithm using the message passing super linear behavior "p", where p denotes the interface (MPI) consists in applying the general number of processors of the speedup and efficiency principle of space decomposition so that each quasi perfect of 1. processor runs essential the same program on its data. The algorithm described has been implemented on a parallel Mosix machine using the 5 Conclusions and Future Work MPI library. The parallel machine is a cluster of 8 In this work we investigate a new strategy to identical Pentium II PCs with 128Mb RAM , implement an Encryption/Decryption information locally connected by a fast communication protocol algorithm based on RSA. The use of the RNS such as Myrinet, each computer with 20GBytes of allows the decomposition of a given dynamic range local memory as shown in Figure 2. I ~ ~ ~ ~ ~ ~ 1 Task i qi  ri  ai  bi  mi  ni  ~ mi ~ mod mi Task Performed all residues in the ~ II i, j auxiliary system with a ~ ~ ~ ~ ~ ~ ~ 1 ~ rj  rj  ai  b j  qi  n j  mi  ~ mod m j j  i m j from the original system. Fast Communication Myrinet Task The complementary computation of R in the auxiliary system. Figure2. Mosix Parallel Computer. III i, j r  r  a~  b  q~  n  m~ 1 j j i j i j i m j For 1024 bit RSA algorithm, if we use 32 bits mod m j  i processors then we need about 33 moduli for our j Task Conversion to MRS IV ~ ~ ~ ~ 1 ~ i, j x j  x j  xi  mi  j mod m j Task Computation of the residues in V auxiliary system from MRS digits i, j ~ ~ ~ ~ ~ x j  x j  x  m1m2 ⋯mi1 mod m j into slices of smaller sub ranges on which the computation can be efficiently implemented in parallel. Using a parallel machine to do RSA public/private key operations seem realistic today. Our future work will focus on an interactive computer service shall store and transmit with integrity any text security measure associated with certified security technologies that is used in connection with copyrighted material or other protected text content such service transmits or stores.

References: i[] Ronald L. Rivest and Adi Shamir. CryptoBytes volume 2, number 1 RSA Laboratories, Spring, 7- 11)1996). ii[] Ronald L. Rivest, The RC5 Encryption Algorithm in Fast Software Encryption, ed. Bart Preneel, Springer Pp 86-96 (1995). iii[] Ronald L. Rivest, Adi Shamir, and Leonard Adlman. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, 21(2): 120-126, (1978). iv[] E.F. Brickel, A Survey of Hardware Implementations of RSA, Advances in Cryptology- CRYPTO’89, G. Brassard, ed. Pp. 368-370, Springer-Verlag, 1990. v[] Mark Shand and Jean Vuillemin. Fast implementation of RSA cryptorapghy, In Proceedings, 11 th Symposium on Computer Arithmetic, 252-259, IEEE, 1993. vi[] F. Thomson Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann, 1992. vii[] Peter L. Montgomery. Modular multiplication without trial division. Mathematics of computation, 44(170):519-521, (1985). viii[] S.E. Eldrige and C.D. Walter, Hardware Implementation of Montgomery’s Modular Multiplication Algorithm, IEEE Trans. Computers, vol. 42, no. 6, pp 693-699, 1993. ix[] M.A. Soderstrand, W.K. Jenkins, G.A. Jullien, and F.J. Taylor, Residue Number System Arithmetic: Modern Applications in Digital Signal Processing. New York: IEEE Press, 1986. x[] R.M. Capocelli and R.Giancarlo, Efficient VLSI Networks for converting an Integer from Binary to Residue Number System and Vice Versa, IEEE Trans. Circuits and Syst., vol. CAS-35, 1425-1431, 1988. xi[] N.S. Szabo and R.I. Tanaka, Residue Arithmetic and its Applications to Computer Technology. New York: McGraw-Hill, 1967. xii [] Barak A., Guday S. and Wheeler R., The MOSIX Distributed Operating System, Load Balancing For UNIX. Lecture Notes in Computer Science, Vol. 672, Springer-Verlag, 1993.

Acknowledgment: The author would like to thank Professor Joseph Steiner of Department of Applied Math at The Jerusalem College of Technology for his important comments.