<<

JOURNAL OF COMPUTERS, VOL. 8, NO. 1, JANUARY 2013 33

A TTA-like Processor for Fast RSA Key Generation Using RNS Wei Guo, Jingwei Hu, Jizeng Wei School of Computer Science and Technology Tianjin University, Tianjin, China 300072 Email: { weiguo, jingweihu, weijizeng}@tju.edu.cn

Abstract— RSA key generation is of great concern for processing element (PE) is designed. A new [5] implementation of RSA cryptosystem on embedded system which correctly identifies every positive integer tested as due to its long processing latency. In this paper, a novel being either prime or composite is considered. [6] rather architecture is presented to provide high processing speed to RSA key generation for embedded platform with limited shows a simple way to substantially reduce the value of processing capacity. In order to exploit more data level hidden constants to provide much more efficient prime parallelism, (RNS) is introduced generation . They apply our techniques to var- to accelerate RSA key pair generation, in which these ious contexts (DSA primes, safe primes, ANSI X9.31- independent elements can be processed simultaneously. A compliant primes, strong primes, etc.) and show how cipher processor based on Transport Triggered Architec- ture (TTA) is proposed to realized the parallelism at the to build fast implementations on appropriately equipped architecture level. In the meantime, division is avoided in smart-cards, thus allowing on-board key generation. A the proposed architecture, which reduces the expense of very efficient recursive algorithm [7] for generating nearly hardware implementation remarkably. The proposed design random provable primes is presented. The expected time is implemented by Verilog HDL and synthesized in a 0.18µm for generating a prime is only slightly greater than the CMOS process. A rate of 3 pairs per second can be achieved for 1024-bit RSA key generation at the frequency of 100 expected time required for generating a pseudo-prime of MHz. the same size that passes the Miller-Rabin test for only one base. possibilities of realization of Miller- Rabin big Index Terms— RSA, key generation, RNS, Montgomery , TTA-like processor number on assembler of Texas Instruments digital signal processors of TMS320C54x family are con- sidered [8]. Applying these modules, it could be achieved I.INTRODUCTION considerably higher level of the system security regarding Public key cryptography has gained extreme popularity to the software-only security systems. [9] moves on to in many applications such as smart cards, digital certifi- a very simple new deterministic test, then they discuss cate and so on. One of the most widely used public key various ways of constructing so-called strong primes. algorithm is RSA [1]. However, the process of RSA key However, these new algorithms generally focus on the generation is very complex and time-consuming. Some improvement of the computational complexity and how implementations [2] try to generate the RSA key pairs to apply them on chip remains to be solved. on a desktop and upload the pair, or only the private We propose an efficient solution to accelerate RSA key, into a smart card. Take communication security and key pair generation from both data and instruction level high-performance processing into consideration, the entire parallelism, in which RNS and TTA are combined close- procedure of RSA key generation is preferably performed ly. The advantage of RNS Montgomery multiplication totally inside cipher chip in order to guarantee the efficien- algorithm [10] is that large number can cy and the security of the applications. Nevertheless, the be divided into small elements and each independent limited computational power of embedded systems can elements can be processed simultaneously. The advan- not afford high speed RSA key generation. In this paper, tages of Transport Triggered Architecture (TTA) [11] is the issue about how to provide high processing speed to that it can be used as an application specific processor, RSA key generation for embedded platform with limited especially as a coprocessor for different DSP applications processing capacity is discussed. in SoC. Comparing with the traditional ASIC, TTA is An on-card implementation is presented in [3] which more flexible in application and consumes less silicon takes up to 6.82 seconds to create a key pair. A s- area. And comparing with traditional DSP processor, it calable hardware architecture is proposed in [4], they has more efficiency and better performance. In this paper, have presented the architecture applied for the multiple function unit (FU) ”MMAC” is elaborately designed, word Radix-2 Montgomery and a which meets the highly parallelism of RNS Montgomery multiplication and reduces redundant data transmission. This work is supported by the Key Project Foundation of Fun- So an optimized architecture, called TTA-like, is proposed damental and Frontier Technology Research of Tianjin under Grant No.11JCZDJC1580 and the Special Financing Project of Internet of to meet the desire for high-efficient performance. In Things by the National Development and Reform Commission. order to reduce the expense of hardware implementation,

© 2013 ACADEMY PUBLISHER doi:10.4304/jcp.8.1.33-40 34 JOURNAL OF COMPUTERS, VOL. 8, NO. 1, JANUARY 2013

division is discarded in the process due to an appropriate selection of algorithms. This improvement makes it possi- ble to fulfil RSA key pair generation with pure hardware design, which is quite different from prior methods that k Random Number use software programming to do the division. The rest of paper is organized as follows. Section 2 describes the RSA key pair generation and details how to P Prime Candidate If not pass reduce timings related to prime finding algorithm which is the kernel of the whole process. Section 3 presents Prime p, q our architecture for it. The implementation results in this paper are compared with what is given in [3][4] in this section. Finally, in section 4, the conclusions and future work are discussed. Select another e II.DESIGN FLOWAND ALGORITHM ANALYSIS No Yes Common RSA key pair generator generally include a stage of in the sieve function procedure [2] [3]. We investigate in this section a way of how to avoid d, private key this by utilizing invertible numbers which is quite suitable for the hardware implementation. Some algebraic tech- niques are introduced to speed up [10] [12], so the primality test, which is the most time- consuming section practically, can be optimized greatly Figure 1. A common flow of RSA key pair generation to achieve a satisfactory implementation; And the private key generation algorithm [13] is improved as well for the sake of limited computation resources in our embedded question: how to make sure that q is co-prime with processor. Figure 1 gives out a common procedure of small prime ? the RSA key pair generation. First, a random number For question 1, we introduce a concept in number the- [14] is generated and then it is thrown into the ”sieve ory: co-prime congruences mod n form a multiplication function” procedure. After that we get so-called ”prime group. candidate” which is not a prime for sure but very likely For example, [14][16]. The primality of these candidates are confirmed mod 4 has 2 co-prime congruences: 1 and 3; via primality test(here, Miller-Rabin test is adopted). The mod 8 has 4 co-prime congruences: 1, 3, 5 and 7 rest of the procedure is simple: public key and private key mod 16 has 6 co-prime congruences: 1, 3, 5, 7, 9, 11, is generated in successive order. 13 and 15 For the convenience of narrative, co-prime congruences Π ∗ A. Sieve Function mod is denoted as ZΠ and the gcd The purpose of the sieve function is to reduce the times is denoted as . of the primality test which is the most time-consuming So,theorem 2.1.1. part of RSA key pair generation. Unlike ordinary sieve ∀ ∈ ∗ ( Π) = 1 functions which use small primes to divide the candidate k ZΠ,iff gcd k, number, trial divisions is replaced by one-time modular ∈ ∗ exponentiation [6] as shown in Algorithm 1.This algorith- It means we just have to find k ZΠ, and here comes m starts with the following∏ consideration [7]: a easier way to get it called ”Carmichael Theorem”: • Π = ( ≠ we denote Carmichael function as λ(n) and ∀a ∈ construct i pi pi 2 and pi is a small prime). λ(n) • Then we can calculate k(k < Π) to make k is co- Z, a = 1(mod n) prime with Π. So,theorem 2.1.2. • A large integer L (usually L = νΠ, ν is a small ∀ ∈ ∗ λ(Π) = 1 integer). k ZΠ, k (mod n) • q = k + L, which q is co-prime with Π definitely or in another word, q is also co-prime with small primes Take notice∏ that [13], pi. So the function of sieve function is reached. λ(Π)=lcm( pi)=lcm(λ(p1), λ(p2), ..., λ(pt)), Here, in order to explain details of our algorithm , two (λ(p)=p − 1,p is a prime; least common multiplier is questions are presented to help: denoted as lcm) 1) How to find such k which is co-prime with Π? figure 2 provides a method to generate such k. 2) How to get the large integer L to ensure that q = For question 2,L should be set as an integral multiple k +L is co-prime with Π which is equivalent to the of Π

© 2013 ACADEMY PUBLISHER JOURNAL OF COMPUTERS, VOL. 8, NO. 1, JANUARY 2013 35

Algorithm 2 Miller-Rabin Test Input: Odd integer n ≥ 3 Output: Confirm the primality of n Find u and k so that n − 1 = u ∗ 2k Let a be randomly chosen from 2, 3, ..., n − 1 b ← aumod n

Figure 2. Generator g’ of Invertible Number of modulo Π if b ∈ 1, n − 1 then print "n is prime" end if Derivation: we have gcd(a, b)=gcd(b, a mod b)(a > b,a mod b ≠ for i = 1 to k − 1 do 0); b ← b2 mod n ( + Π) ( + Π Π) ( Π) so, gcd k L, =gcd k m , =gcd k, =1 which if b = n − 1 then means that such L meets the requirement [15]. print "n is prime" According to the architecture presented in this paper, end if modular exponentiation can be done very fast and no extra if b = 1 then hardware consuming is needed. So the processing time of the sieve function can be trivial and circuit area will be print "n is composite" saved as well. end if end for Algorithm 1 Invertible Number Generation ∏ print "n is composite" Π = 74 (Π) = Input: randomly selected k i=1 pi, λ lcm(p1 − 1, p2 − 1, ..., p74 − 1), pi = 3, 5, 7, ..., 379 L = 8 × Π,M = 3 × Π Output: q co-prime with the smallest 74 odd primes if kλ(Π) mod Π = 1 then q = k + L else k = k + 1, go to 1 end if if q is even then q = q + M end if

− Figure 3. Powers an 1 mod n calculated with intermediate steps, n = 325 B. Primality Test The candidate must be tested for primality in order to be useful for the generation of a RSA key pair [12][16]. roots of 1 modulo n. If n is a prime number, there are In this section, Miller-Rabin’s method [10][13] is used no other square roots of 1 modulo n. for our primality test. The algorithm is described in Algorithm 2 below. Lemma 2.2.1. We consider another property of arithmetic modulo p If p is a prime number and 1 ≤ a < p and a2 mod p for a prime number p that can be used as a certificate for = 1, then a = 1 or a = p − 1 . compositeness [14][17]. Proof. We have (a2 − 1) mod p = (a + 1)(a − 1) mod p = 0; that means, p divides (a + 1)(a − 1). Since p is Definition 2.2.1. prime, p divides a + 1 or a − 1, hence a = p − 1 or a = Let 1 ≤ a < n. Then a is called a square root of 1 1 modulo n if a2 mod n = 1. Thus, if we find some nontrivial square root of 1 In the situation of this definition, the numbers 1 and n-1 modulo n, then n is certainly composite. Figure 3 gives are always square roots of 1 modulo n (indeed, (n−1)2 ≡ us an example to show this principle. (−1)2 ≡ 1 (mod n)); they are called the trivial square In figure 3, calculating 201324 mod 325 with two

© 2013 ACADEMY PUBLISHER 36 JOURNAL OF COMPUTERS, VOL. 8, NO. 1, JANUARY 2013

notice of b ← aumod n which is the most timing- cost computation in Algorithm 2, we focus on this part and make use of RNS Montgomery multiplication to accelerate Miller-Rabin test. The technical details shall be discussed in section 2.4.

C. Calculation of Private Key The private key is the modular inverse of the public key as described in algorithm 3 [13]. − Figure 4. Powers an 1 mod n calculated with intermediate steps, possible cases Algorithm 3 Improved Stein’s method for modular in- verse Input: e(public key),Φ(n)(Φ(n) = (p−1)×(q−1),p, q intermediate steps leads us to detect that 51 is a nontrivial are two primes ) square root of 1, which proves that 325 is not prime. Output: d = e−1 mod Φ(n) What can the sequence b0, ..., bk look like in general? We first note that if bi = 1 or bi = n−1, then the remaining (X1,X2,X3) = (1, 0, e), (Y1,Y2,Y3) = (0, 1, Φ(n)) elements b , ..., b must all equal 1, since 12 = 1 and i+1 k while X3 ≠ 1 or Y3 ≠ 1 do (n − 1)2 mod n = 1. Thus in general the sequence starts if X is even then with zero or more elements ∈/ 1, n − 1, and ends with a 3 sequence of zero or more 1s. The two parts may or may if X1,X2 are even then − 1 not be separated by an entry n . All possible patterns (X1,X2,X3) = (X1/2,X2/2,X3/2) are depicted in Figure 4, where ”*” represents an arbitrary else element ∈/ 1, n − 1. We distinguish four cases: X1 = X1 + Φ(n),X2 = X2 − e • Case 1a: b0 = 1. end if • Case 1b: b0 ≠ 1, and there is some i ≤ k − 1 such that bi = n − 1. else if Y3 is even then ł In Cases 1 and 1 we certainly have that b = 1; a b k if Y1,Y2 are even then no information about n being prime or not is gained. (Y1,Y2,Y3) = (Y1/2,Y2/2,Y3/2) • Case 2: bk ≠ 1. ł Then n is composite. • Case 3: b0 ≠ 1, but bk = 1, and n−1 does not occur in else ≥ the sequence b0, ..., bk−1. ł Consider the minimal i Y1 = Y1 + Φ(n),Y2 = Y2 − e ∈ 1 − 1 1 with bi = 1. By the assumption, bi−1 / , n , end if hence bi−1 is a nontrivial square root of 1 modulo n. Thus n is composite in this case. else if X3 > Y3 then (X1,X2,X3) = (X1 − Y1,X2 − Y2,X3 − Y3) Definition 2.2.2. else Let n ≥ 3 be odd, and write n − 1 = µ · 2k, µ odd, (Y ,Y ,Y ) = (Y − X ,Y − X ,Y − X ) k ≥ 1. A number a, 1 ≤ a < n, is called an A-witness 1 2 3 1 1 2 2 3 3 i for n if aµ mod n ≠ 1 and aµ·2 mod n = n − 1 for all end if i, 0 ≤ i < k. If n is composite and a is not an A-witness if X3 = 1 then for n, then a is called an A-liar for n. −1 return X1 = e mod Φ(n) else Lemma 2.2.2. −1 return Y1 = e mod Φ(n) If a is an A-witness for n, then n is composite. end if Proof. If a is an A-witness for n, then to the sequencebi = au · 2i mod n, 0 ≤ i ≤ k, Case 2 or Case 3 of the end while preceding discussion applies, hence n is composite. The particular method chosen for computing modular Finally, We combine this observation with the idea inverse avoids trial division to make it easier to be − of choosing some a from 2, ..., n 2 at random into a implemented in hardware. strengthening of the Fermat test, called the Miller-Rabin test.The detail of this algorithm has been described in Algorithm 2. D. RNS Montgomery Modular Multiplication Since Miller-Rabin test is dominant in the processing Modular multiplication is the kernel operation of RSA time of RSA key pair generation, it is important to key pair generation which is called in quantity and takes improve this part for the sake of high performance. Take up the most time of the whole procedure, it is of great

© 2013 ACADEMY PUBLISHER JOURNAL OF COMPUTERS, VOL. 8, NO. 1, JANUARY 2013 37

importance to analyze this part and bring out the optimal Another concern about RNS Montgomery Multiplica- algorithm for the hardware we present in this paper. tion is about the proper selection of RNS base. In this The RNS Montgomery modular multiplication algorith- paper, an efficient RNS base is chosen in the form of n m proposed in [10] is a fast parallel algorithm for modular 2 − ci, where n is the length of RNS base which multiplication with large operands, which is the basic decides the data width of hardware implementation. In operation of public key cryptosystems. The algorithm is this form, modular and modular multiplication rewritten in algorithm 4. are efficiently realized, which are shown in Algorithm 6 and Algorithm 7. Because of the decrease of the Algorithm 4 RNS Montgomery Modular Multiplication complexity of multiplication, the cost for mod operation is considerably reduced in the design. Input: |X|a∪b, |Y |a∪b, (X, Y < 2N) [ ] ( = −1) 2 Output: r a∪b, r XYA modN,r < N Algorithm 6 Modular Addition Algorithm for i=1 to k do addmod(a, b, mi) = ( × ) n step1: zi xi yi modai Input: a, b, mi, 0 < a, b < m, mi = 2 − ci and ci < = ( × | − −1| ) t step2: qi zi N ai modai 2 , and1 < t < (n − 1)/2

end for Output: r = a + b mod mi step3: qj = BT (qi, 0) step1: P = a + b

for j=1 to k do step2: Q = P + ci step4: zj = (xj × yj) mod bj step3: n step5: wj = (zj + qj × Nj) mod bj if Q ≥ 2 then = ( × | −1| )( = ) n step6: rj wi Ai bj i j r ← Qmod 2 end for else

step7: ri = BT (rj, 0x80000000) r ← P end if Two bases a and b are introduced, and subscript i and j are used to indicate the elements related to base a and Algorithm 7 Modular Addition Algorithm b respectively. For example, ai represents the ith element of base a, and xj represents the jth element of X which mulmod(a, b, mi) [ ] n is represented by base b. In this paper, X a∪b means Input: a, b, mi, 0 < a, b < m, mi = 2 − ci and ci < X is represented in RNS by base a and b, and |A−1| i bj 2t, and1 < t < (n − 1)/2 represents the inverse of A modulo b . This algorithm ∏ i j × = k Output: r = a b mod mi chooses A i=1 ai as Montgomery constant. In Algo- rithm 4, step3 and step7 are base transformation (BT) step1: y = a × b n n between different base representation, which is shown step2: y1 = y ÷ 2 ; y0 = y mod 2 in Algorithm 5. In this case, step3 transforms q which i step3: y′ = c y + y is represented by base a to base b representation into i 1 0 y′ = y′ ÷ 2n y′ = y′ 2n qj. Derived from base transformation from [10], the BT step4: 1 ; 0 mod ′′ ′ ′ algorithm is reformulated to satisfy architecture designed step5: y = ciy1 + y0 in this paper. |X|m is used to represent X modulo mi ′′′ ′′ i step6: y = y + ci in the following algorithm. step7: ′′′ ≥ 2n Algorithm 5 Base Extension qj = BT (qi, α) if y then ← ′′′ 2n Input: [q]a r y mod

Output: [q]b, qr else ′′ for i=1 to k do r ← y = × | −1| end if li qi Ai mi end for ∑ = ( + k ) 32 w α i=1 li >> III.IMPLEMENTATION FOR RSAKEY PAIR for j=1 to k do ∑ GENERATION = | k | | × | qj n=1 An ai li mi In this section, a TTA-like architecture is presented to end for illustrate the fast RSA key pair generation in our way. return [q] The implementation results are compared with [3] [4] in b section 3.

© 2013 ACADEMY PUBLISHER 38 JOURNAL OF COMPUTERS, VOL. 8, NO. 1, JANUARY 2013

A. Transport Triggered Architecture Interconnection Move Bus Network Transport Triggered Architecture (TTA) [11] is statical- ly programmed ILP modular architecture which is similar PipeLine#1 to VLIW architecture. Instead of specifying operation General- typing and controlling the FUs directly, TTAs specify Purpose Registers the required data transports. These transports may trigger PipeLine#2 operations as side effect implicitly. TTA modular template is illustrated in Figure 5. TTA is organized as a set of Port Connectivity functional units (FUs) and register files (RFs) including PipeLine#3 general- purpose registers. The input and output ports Input socket of these resources are con- nected together with an PipeLine#4 interconnection network composed of move buses and input/output sockets. The data transports encoded in each Functional Unit Instruction Output socket instruction slot are carried out on each move bus, the Decoder Operand Port number of which thus determines the maximum parallel Triggered Port Result Port data moves that can be performed in each cycle. Input sockets contain multiplexers which feed data from the Figure 5. Template of Transport Triggered Architecture buses into the desti- nation registers in FUs. Output sockets contain de-multiplexers that put FU results from C Luta1 Data Data Data Table1 D source registers on the buses. The connectivity between RAM1 RAM2 RAM3 A B RFs and FUs is more complex in VLIW architecture as the number of input/output ports increase, leading to LDST1 LDST2 LDST3 MMAC1 MMAC2 MMAC3 MMAC4 larger area and critical path delay overhead. However, Decoder interconnections of TTAs are simpler. The interconnection %XV network of TTAs can be fully connected, in which case   all the ports in all the FUs are connected to all the buses. 

But usually they are partially connected and optimized %XV for specific application according to the traffic on the Conf JMP1 JMP2 JMP3 ALU1 ALU2 RF Reg move buses in practice. The functional units follow the Ifetch triggering strategy. Each FU triggers the operation when Instruction a certain operand is transported to the triggered operand RAM register in the FU, Therefore each FU must hold one trigger register at least to perform the corresponding Figure 6. The proposed architecture in this design operations. Additional operand registers may be required for multi-operands operations. The FUs can hold more than one result registers for occasional multiple results with RNS Montgomery multiplication algorithm, long bit requirement. Every FU can be pipelined to reduce the modular multiplication can be done very fast. In this critical path delay and improve the processing throughput. work, four MMAC units are designed according to the maximum number of function units used in the processing of executing the operations referring to step 1 step 3 in B. Architecture Design algorithm 4; The remaining operations after TIME E also The proposed architecture in this design is shown in depend on these four MMAC units in order to increase Figure 6. It mainly consists of five parts: FUs, RFs, the reusability of the architecture and save the valuable control logic, transport network and on-chip RAM/ROM. hardware resources. Similar to common processors, the control logic composes Two ALU units are included to implement the modular of instruction fetch, instruction decoder and PC control addition and modular subtraction operations. And some units. In this design, JMP unit can affect the PC value to controlling operations are needed such as logical right- realize jump, branch and loop operations. shift, arithmetic right-shift and case-select. They are in- The FUs are the key factors which decide the perfor- cluded in ALU units to help finish the extra operation in mance of this processor. According to different applica- the process of RSA key pair generation as well. tions, various of FUs can be designed and attached to the There are three Load-store units (LDST), one Look- transport network. To implement RNS Montgomery mul- up Table unit (LUT). LDSTs are connected with the tiplication algorithm, modular multiplication and modular independent Data RAM. LUT are used to store the multiplication-and-accumulation are the key operations in pre-computed data. There are direct data paths between n-bit level, where n is the base size. So, the MMAC units LDSTs and MMAC units and also between LUT and the are designed to accelerate the execution speed of these group of MMAC, which are used to reduce redundant key operations. MMAC units can only do 32-bit× 32- data transmission. bit modular multiplication, But combing them together The transport network is used to transport data from

© 2013 ACADEMY PUBLISHER JOURNAL OF COMPUTERS, VOL. 8, NO. 1, JANUARY 2013 39

Sieve Function Primality Test Private Key Calculation TABLE I. RSA KEY PAIR GENERATION COMPARISON

Ref [3] Ref [4] This work

Frequency (MHz) 32 10.52 100

Private Key Primality test (ms) 84.2 25.29 4.1

prime finding (ms) 3100 / 151

RSA key pair generation (ms) 6820 / 306

Figure 7. The operational process of RSA key pair generation from function unit level IV. CONCLUSIONAND FUTURE WORK source register to destination register. Four buses are This paper presents a novel RSA key pair genera- adopted in the transport network to fully exploit the tion hardware implementation based on TTA, and Mont- computation capacity of four MMAC units in parallel. gomery modular multiplication based on RNS is adopted The width of each bus is 32-bit which is decided by the in both sieve function and primality test to improve the selected base size. performance significantly. FUs suiting the algorithms is Figure 7 illustrates the operational process of RSA designed on TTA, and direct data paths are used to key pair generation from function unit level. The whole reduce redundant data transmission. Above all, pipeline process has three stages: and parallel technology to improve the computing speed 1) Sieve Function are introduced. At the frequency of 100 MHz, 1024- 2) Primality Test bit RSA key pair generation needs 306 ms in average, 3) Private Key Calculation the logic area of the proposed architecture consists of The arrows in this figure show the data flow of the 131k gates. This result shows that our proposed work can procedure: In the Sieve Function, MMAC is used to do achieve high performance and small area for RSA key modular exponentiation specially, ALU helps to do some pair generation. ordinary arithmetic operations such as mod-add or add, On-going and future developments include: (1) Prepa- JMP then controls the programming mode; Similar to ration for some pre-computed data especially in RNS Sieve Function, we use the same units to fulfil calcula- Montgomery multiplication can be optimized which affect tion (mainly modular exponentiation) in Primality Test the rate of RSA key pair generation significantly. (2) procedure; For the last stage, It is simple to generate The concept of scalable and reconfigurable architecture is private key with ALU while jump instructions occupy a introduced, in which not only 1024-bit RSA key pair but large proportion of it, so JMP is set up here to solve this also 2048, 4096-bit can be implemented in this platform. problem. According to the algorithm analysis in section 2, these V. ACKNOWLEDGMENT function units are arranged to fulfil the RSA key pair This work is supported by the Key Project Foundation generation: MMAC units do the modular exponentiation; of Fundamental and Frontier Technology Research of ALU units complete the operation of addition and sub- Tianjin under Grant No.11JCZDJC1580 and the Special straction and some other logical operations; And JMP Financing Project of Internet of Things by the National units serve as judge and jump instruction. Development and Reform Commission.

C. Implementation Result REFERENCES The design was implemented by Verilog HDL and synthesizes with 0.18 µm CMOS technology. At 100 [1] R. L. Rivest, A. Shamir, L. Adleman, ”A Method for Ob- MHz, the processing time of 1024-bit RSA key pair taining Digital Signatures and Public-key Cryptosystems,” generation is 306 ms in average, and the logic area is 131k Communications of the ACM, vol. 21, pp. 120-126, 1978 gates. To compare the performance of our implementation [2] Chenghuai Lu, Andre L. M. dos Santos, Francisco R. with other works [3] [4], the processing time consumed in Pimentel, ”Implementation of Fast RSA Key Generation on primality test, prime finding and RSA key pair generation is analyzed. Because of the architecture specified for Smart Cards,” Proceedings of the ACM Symposium on modular multiplication, the consuming time in primality Applied Computing, pp. 214-220, 2002 test is reduced greatly and numbers of times in prime [3] Bahadori M, Mali M. R, Sarbishei O, Atarodi M, Shar- finding are 36 which is a satisfactory compromise in both ifkhani M, ”A Novel Approach for Secure and Fast Gener- circuit area and processing time. As shown in Table 1, the ation of RSA Public and Private Keys on SmartCard,” 2010 proposed work requires less clock cycles than the other 8th IEEE International NEWCAS Conference (NEWCAS works. 2010), pp. 265-268, 2010

© 2013 ACADEMY PUBLISHER 40 JOURNAL OF COMPUTERS, VOL. 8, NO. 1, JANUARY 2013

[4] Cheung Ray C. C, Brown Ashley, Luk Wayne, Cheung Peter (Cat. No.99CH36303), pp. 363-7, 1999 Y K, ”A Scalable Hardware Architecture for Prime Number Wei Guo received the M.S. degree from Louisiana State Uni- Validation,” Proceedings - 2004 IEEE International Con- versity in 1991. After then, she joined Motorola Semiconductor ference on Field-Programmable Technology, FPT ’04, pp. Ltd. and worked in design for 12 years. Now 177-184, 2004 she is a professor and director of VLSI Research Lab at School [5] Da Silva, Joao Carlos Leandro, ”Carmichael numbers and of Computer Science and Technology, Tianjin University, Tian- a new primality test,” IEEE 26th Convention of Electrical jin, China. Her research interests are SoC design technology, and Electronics Engineers in Israel, pp. 631-633, 2010 computer architecture, and cryptographic processor. [6] Joye M, Paillier P, Vaudenay S, ”Efficient generation of prime numbers,” Cryptographic Hardware and Embedded Systems -CHES 2000. Second International Workshop. Pro- Jingwei Hu received the B.E. degree in Electronic Engineering ceedings (Lecture Notes in Computer Science Vol. 1965), from Maritime University, Dalian, China, in 2011. Now he is pp. 340-54, 2000 a master candidate at School of Computer Science and Tech- [7] Maurer U. M, ”Fast generation of prime numbers and nology, Tianjin University. His research interests are applied secure public-key cryptographic parameters,” Journal of cryptography and cryptographic processor. Cryptology, pp. 123-55, 1995 [8] Bordevic G, ”On optimization of Miller-Rabin primality test on TI TMS320C54x signal processors,” 14th International Jizeng Wei received the B.E. degree in computer science from Workship in Systems, Signals and Image Processing and Harbin Institute of Technology, Harbin, China, in 2004, the 6th EURASIP Conference focused on Speech and Image M.S. and the Ph.D. degree in computer science and technology Processing, Multimedia Communications and Services - from Tianjin University, Tianjin, China, in 2007 and in 2010, EC-SIPMCS 2007, pp. 229-32, 2007 respectively. Now he is an assistant professor at School of Com- [9] Landrock P, ”Primality tests and use of primes in public key puter Science and Technology, Tianjin University. His research systems,” Lectures on Data Security. Modern Cryptology interests include application-specific instruction-set processor in Theory and Practice, pp. 127-33, 1999 (ASIP), cryptographic processor, and electronic system level [10] Laurent Imbert, Jean-Claude Bajard, ”A Full RNS Imple- (ESL) modeling. mentation of RSA,” IEEE Transactions on Computers, vol. 53, 2004, pp. 769-774. [11] Wei Guo, Jizeng Wei, Yongbin Yao et al, ”Design of a configurable and extensible Tcore processor based on Transport Triggered Architecture,” World Congress on Computer Science and Information Engineering, pp. 536- 540, 2009 [12] Hanae Nozaki, Masahiko Motoyama, Atsushi Shimbo et al, ”Implementation of RSA Algorithm Based on RNS Montgomery Multiplication,” Cryptographic Hardware and Embedded Systems(CHES), pp. 364-376, 2001 [13] Kenneth R, Castleman, ”Digital image processing[M],” New Jersey:Prentice-Hall, 1996 [14] Igumnov V.S, ”Generation of the large random prime numbers,” J International Siberian Workshop on Electron Devices and Materials. Proceedings. 5th Annual (IEEE Cat. No.04EX813), pp. 117-18, 2004 [15] Adleman L. M, ”On distinguishing prime numbers from composite numbers,” 21st Annual Symposium on Foun- dations of Computer Science, pp. 387-406, 1980 [16] Yee L. P, ”Digital signatures and public key cryptosystems with multilayer perceptrons,” CONIP ’02. Proceedings of the 9th International Conference on Neural Information Processing. Computational Intelligence for the E-Age (IEEE Cat. No.02EX575), pp. 2308-11 vol.5, 2002 [17] Duran Diaz R, ”Safe primes density and cryptographic applications,” Proceedings IEEE 33rd Annual 1999 In- ternational Carnahan Conference on Security Technology

© 2013 ACADEMY PUBLISHER