<<

Hardware Implementation of the Baillie-PSW

Carla Purdy, Yasaswy Kasarabada, and George Purdy Department of Electrical Engineering and Computing Systems University of Cincinnati Cincinnati, OH USA [email protected], [email protected], [email protected]

Abstract – The need for large primes in major cryptographic checking whether some equality involving n is true. If so, then, has stirred interest in methods for prime generation. to a certain accuracy, n is probably prime. The Fermat test [4], Recently, to improve confidence and security, the Miller-Rabin test [5] and the Solovay-Strassen test [6] are a generation in hardware is being considered as an alternative to few examples. On the other hand, deterministic tests determine software. Due to and hardware implementation with absolute certainty the primality of a number. However, issues, probabilistic primality tests are generally preferred. The Baillie-PSW primality test is a strong probabilistic test; no known these tests generally have high running times compared to Baillie-PSW exists. In this paper, we discuss probabilistic tests. Heuristic tests are a more “experimental” different types of cryptographic algorithms and primality tests, type of primality test, with low running times. These tests are and review hardware implementations of the Miller-Rabin and generally unproven but work well in practice; Baillie-PSW [7] Lucas tests. We also present the implementation of a Verilog-based is an example of such a test. Due to its low running time and design of the Baillie-PSW test on an Altera Cyclone IV GX FPGA. low rate of failure, we have chosen to implement the Baillie- To our knowledge, this is the first hardware implementation of this PSW test in this work. Our work can be used in standalone test. The implementation takes an odd random number as input embedded systems which need to generate prime numbers for and returns the next immediate number as output. security and cannot rely on software-based systems for We analyze the results from our implementation and suggest methods to further improve our results in future. generation. Our system can also be included as a prime- generation module in a large secure system. I. INTRODUCTION II. BACKGROUND Currently acceptable cryptographic systems are of two major types – symmetric key and public key. In symmetric key A. Distribution of primes and their use in cryptography, as the name suggests, the sender and receiver The Prime Number Theorem [8] [9], which describes the share the same key. It is usually implemented as block or stream asymptotic distribution of prime numbers, implies that primes ciphers. DES and AES are two examples. However, key become less common as the numbers get larger. Based on the management is a major issue. To overcome this obstacle, calculations of Gauss, Legendre and other mathematicians, we public-key cryptography was introduced in 1976 [1]. Two can estimate the number of primes less than or equal to a real separate but mathematically related keys (the public key and the number N, π (N) as: private key) are used. Given the public key, calculating the 1/2 private key is computationally infeasible. The sender encrypts |π (N) – Li (N)| < (N · ln (N) / 8π) (1) the data using the receiver’s public key; the receiving party then for all N > 2657, where Li (N) = (1/ ln( )) . uses their private key to decrypt the data. RSA, DSA and 𝑁𝑁 The security of many major public-key cryptosystems ElGamal are public-key cryptosystems. 2 𝑥𝑥 𝑑𝑑𝑑𝑑 depends on the ability of the attacker to factor∫ the public key to NIST’s SP 800-57 published in 2016 [2] mandates a key gain information about the private key. In RSA, this means length of 2048 bits (with 112 bits of strength) for all data that factoring the semi-prime into its prime factors. In 2012, a major needs to stay secure through 2030, when RSA is used. Using flaw in many RSA implementations was found. Researchers NIST’s FIPS published in 2013 [3], the maximum length of [10] noticed public keys being shared among unrelated parties. primes that need to be used to generate a 2048-bit (semi-prime) Many of the 1024-bit shared RSA moduli offered little to no key is 1024 bits, if probable primes are being used. Therefore, security. Such “high-risk” keys could compromise the security in our implementation we generate 1024-bit primes. of all concerned systems. To avoid such a scenario, large and unique primes must be used each time to generate RSA keys. Many methods can be used to generate primes. A prime sieve is a simple way to generate all primes in a limited range; B. Primality tests however, it is inefficient at finding a single prime in a large As discussed in Section I, deterministic primality tests have range. For this purpose, primality tests are generally used. zero rate of failure, but very high running times. Probabilistic There are three major types of primality tests – probabilistic, tests have low running times but higher rates of failure. The deterministic and heuristic. Probabilistic tests generally involve Baillie-PSW test has very high accuracy and a low running choosing a random number n from a sample space, followed by time; so, it is the ideal choice for a practical primality test. 1 – Baillie-PSW primality test [7] Algorithm 3 – Lucas probable prime test [14] Input: N > 3, an odd to be tested for primality Input: N (generated by RNG module in software) Output: Probable Prime if N passes; otherwise Composite 1. Perform of N with primes less than a convenient 1. U0 = 0, V0 = 2. limit. If N is divisible by any such prime, return composite and 2. Select D ϵ {5, -7, 9, -11 …} such that the quit (performed by Divisibility module in software). J (D, N) = −1. Set P = 1, Q = (1 – D) / 4. 2. Perform a Miller-Rabin test (base 2) on N. If N is not a base-2 3. Calculate UN - J (D, N) i.e. UN + 1. Miller-Rabin probable prime, declare N to be composite and 4. If UN+1 ≡ 0 mod N, RETURN Probable Prime; else quit (performed by Miller-Rabin module in hardware). RETURN Composite. 3. Perform a Lucas probable prime test on N. If N is not a Lucas probable prime, declare N to be composite and quit (performed by Lucas module in hardware). A. Miller-Rabin module 4. If N hasn’t been declared composite, declare N a probable This module performs a base-2 Miller-Rabin test. Step 2 of prime (performed by the Baillie-PSW module in hardware). Algorithm 2 uses the ME module. For step 4a, favoring speed over area, the MM module was used instead of the ME module. C. Hardware Implementations B. Lucas module Currently, many primality testing algorithms are being implemented in hardware, on FPGAs and ASICs. Due to their This module performs the Lucas probable prime test. flexibility, reconfigurable hardware like FPGAs and CPLDs are Step 2 of Algorithm 3 uses the JSC module. The values of U preferred. Several implementations of the Miller-Rabin test and and V in Algorithm 3 are the corresponding Lucas sequences a few implementations of the Lucas test exist. The fastest for the values of P and Q chosen in step 2. The recurrence implementation of the Miller-Rabin test on a Xilinx FPGA [11] relations used in this module are: gives a 2.2x speedup over software. A Lucas test + + = = (2.1) implementation on the same FPGA [12] was 30% slower but 3 2 2 𝑈𝑈𝑘𝑘−1 𝑉𝑉𝑘𝑘−1 𝐷𝐷 ∙ 𝑈𝑈𝑘𝑘−1 𝑉𝑉𝑘𝑘−1 times more energy efficient. The ASIC implementation [12] 𝑈𝑈𝑘𝑘 𝑉𝑉𝑘𝑘 + was 3.6 times faster and 400 times more energy efficient than = = 2 2 (2.2) 𝑘𝑘 2 𝑘𝑘 the optimized software implementation. To our knowledge, this 2𝑘𝑘 𝑘𝑘 𝑘𝑘 2𝑘𝑘 𝑉𝑉 𝐷𝐷 ∙ 𝑈𝑈 is the first hardware implementation of the Baillie-PSW test. 𝑈𝑈 These𝑈𝑈 ∙ relations𝑉𝑉 were 𝑉𝑉derived considering the values of P, Q and D in step 2. In to checking the congruence III. MODULE DESCRIPTIONS & IMPLEMENTATIONS condition in step 4, the following congruence relation can be The implementation in this paper consists of two main checked: 2 . This step strengthens the test. modules with five additional sub-modules and a top-level N is more likely to be prime if both congruence conditions hold. module. The main modules – the Miller-Rabin and Lucas 𝑉𝑉𝑁𝑁+1 ≡ 𝑄𝑄 𝑚𝑚𝑚𝑚𝑚𝑚 𝑁𝑁 C. Modular (MM) module modules – perform steps 2 and 3 of Algorithm 1 respectively. The sub-modules perform step 1 and assist in the successful Both the Miller-Rabin module and the Lucas module operation of the main modules. The sub-modules are the perform many modular , which are speeded up Modular Multiplication (MM) module, the Modular by using the Montgomery modular (ME) module, the Jacobi symbol calculator [15]. This algorithm adds an extra factor of 2-k to the output. To (JSC) module, the Divisibility module and the Random Number counter this, both inputs X and Y are converted to a Generator (RNG) module. The last two sub-modules were Montgomery form prior to multiplication. This changes the implemented in software on a host computer; the rest of the factor in the output to 2k. This factor is then removed by sub-modules, the two main modules and the top-level module performing another round of the multiplication algorithm while are hardware modules. We also discuss the role of the testbench. setting one input as the output from the previous step and the other input as 1, introducing a 2-k factor to counter the 2k factor. Algorithm 2 – Miller-Rabin primality test [13, 5] Input: N D. (ME) module Output: Probable Prime if N passes; otherwise Composite This module uses Montgomery modular exponentiation 1. Find s, m such that N - 1 = 2s · m. [16], to compute the value of the modular exponentiation 2. Compute X = am mod N. calculations required in the Miller-Rabin module. To favor 3. If X = 1 or N - 1, RETURN Probable Prime & EXIT. speed over area, two instances of the Modular multiplication 4. FOR i = 1 to s-1 loop module are run in parallel in this module. 2 a. Compute X = X mod N. E. Jacobi symbol calculator (JSC) module b. If X = 1, RETURN Composite and EXIT. c. If X = N - 1, RETURN Probable Prime and EXIT. Instead of calculating just the value of the Jacobi symbol, 5. RETURN Composite. this module performs the entire step 2 of Algorithm 3 and returns the value of D. It uses a binary algorithm [17] easily design in a reasonable time, we chose the relatively smaller implemented in hardware and efficient in both speed and area. Cyclone IV GX device. This led to two design decisions – 1) F. Random Number Generator (RNG) module since time taken by the Divisibility and RNG modules combined is not significant when compared to time taken by the Instead of using a typical PRNG, the principles described entire Baillie-PSW system, both modules were implemented in in the (unpublished) proof by G. Purdy [18] were used to software to save space; they do not add additional delay to the implement a modified approach. A 1024-bit odd random execution time of the entire system; and 2) the Miller-Rabin number is generated and passed to the top-level module. If this and Lucas subsystems were implemented individually and the module finds the number to be composite, subsequent numbers time taken by the top-level module was computed by comparing are generated by the top-level module by simply adding 2 to the the time taken by these subsystems, since the top-level module previous number, till a prime number is found. To generate the did not fit on the smaller device (112% of available system initial 1024-bit random number, the RNG in [19] was used. resources were needed). If the large device were chosen, the G. Divisibility module software-based modules could be moved onto it to reduce any communication delays caused by our current implementation. This checks the divisibility of a number by the first 1000 primes. It was implemented in software to save space. A. Miller–Rabin subsystem H. Baillie-PSW module The Modular Multiplication, Modular Exponentiation, and Miller-Rabin modules were synthesized. On simulating the This module is the top-level module. It reads a 1024-bit synthesized design that utilized 37% of the device resources, random number from the RAM and returns the next number in using a 1024-bit random number, the subsystem found a prime the 1024-bit range that passes the Baillie-PSW test. The Miller- number in 47.86 ms. Testing 20 other 1024-bit inputs yielded Rabin and Lucas modules are instantiated here. Outputs from an average time of 48.3 ms for finding a prime. both main modules are analyzed in this module – if both declare the number prime, the module outputs 1 and the prime number B. Lucas subsystem is written to the RAM; if either declares the number composite, The Modular Multiplication, Jacobi symbol calculator, and both modules are stopped, the input number is incremented by Lucas modules were synthesized. Simulation results of testing 2 and the main modules are restarted. Since the Divisibility 20 inputs yielded an average time to find a prime of 118ms module works on a local copy of the number being tested, this while utilizing 47% of the available resources on the FPGA. module also communicates with the Divisibility module to C. Baillie-PSW system signal that the number was found to be composite; this compels the Divisibility module to update the value of its local copy, thus avoiding synchronization issues between both copies of the number. Similarly, if the divisibility test finds the number to be composite, this module is informed by the Divisibility module. This module then updates its local copy of the number and restarts the main modules to work on the updated value. I. Testbench The testbench acts as the IN/OUT interface that reads the 1024-bit initial number from the RNG module, passes it to the Baillie-PSW module by writing it into the RAM, and signals the top-level module to start. After the top-level module finds a prime number, the testbench reads the value of the prime number from RAM. The testbench also handles Figure 1 Baillie-PSW hardware implementation: schematic synchronization between the Baillie-PSW and Divisibility The top-level module and all its submodules were modules by using a file to write the output of the Baillie-PSW simulated to implement this system as shown in Figure 1. To module and to read the output of the Divisibility module; if declare a number prime, both main modules must declare it either module finds the number is composite, the other module prime. Therefore, the slower of the two sub-systems dictates the reads this from the file and updates the value of its local copy. overall system time; so, it can be concluded that, on average, IV. RESULTS this system takes 118ms to declare a 1024-bit number to be prime. Our Verilog design was synthesized using Altera Quartus Since the clock speed of the software implementations is software [20] for a Cyclone IV GX device. At the time of our much higher than FPGA clock speed, we normalize the speeds implementation, Quartus Prime 15.1 had parallel computation to compare hardware and software implementations, effectively capability, but did not support the larger Cyclone III LS device. representing execution times as number of clock cycles. Table Since parallel computation was essential to synthesize our I compares hardware and various software implementations. Figure 2 shows the comparison with normalized clock values. [4] T. H. Cormen, C. E. Leiserson, R. . Rivest and C. Stein, "Primality The hardware implementation is almost a magnitude quicker Testing," in Introduction to Algorithms, 3rd ed., The MIT Press, 2009, pp. 965-968. than the fastest software implementation. [5] M. O. Rabin, "Probabilistic Algorithm for Testing Primality," Journal of We also compare the two subsystems with prior hardware , vol. 12, no. 1, pp. 128-138, 1980. implementations. The Miller-Rabin subsystem is faster than the [6] R. Solovay and V. Strassen, "A fast Monte-Carlo test for primality," SIAM implementation in [21], even when the clocks are normalized. journal on Computing, vol. 6, no. 1, pp. 84-85, March 1977. The Lucas subsystem is, however, slower than the implementation in [12]. A method to substantially improve the [7] C. Pomerance, J. L. Selfridge and S. S. Wagstaff Jr., "The to 25 · 10^9," of Computation, vol. 35, no. 151, pp. 1003-1026, speed of the Lucas subsystem is discussed in [18]. July 1980. V. CONCLUSIONS AND FUTURE WORK [8] J. Hadamard, "Sur la distribution des zéros de la fonction ζ(s) et ses conséquences arithmétiques.(On the distribution of zeros of the function We have discussed the various modules used in this work ζ(s) arithmetic and its consequences.)," Bulletin de la Societé mathematique and the results obtained from their synthesis for a hardware de France(Bulletin of Mathematical Society of France), vol. 24, pp. 199- implementation of the Baillie-PSW primality test. Since this 220, 1896. was the first hardware implementation, the results were [9] C. Vallée-Poussin, "Recherches analytiques de la théorie des nombres compared to prior software implementations. Our premiers(Analytical research on the theory of prime numbers)," Annales de la Société scientifique de Bruxelles(Annals of the Brussels Scientific implementation performed well against all software Society), vol. 20, pp. 183-256, 1896. implementations, when clocks were normalized. The flexibility of an FPGA is key and will help in adapting this work in future. [10]A. K. Lenstra, J. P. Hughes, M. Augier, J. W. Bos, T. Kleinjung and C. Wachter, "Ron was wrong, Whit is right," No. EPFL-REPORT-174943. To increase speed and reduce area, the design can be IACR., 2012. implemented in an ASIC. Also, using the REDC algorithm [22] [11]A. Le Masle, W. Luk, J. Eldredge and K. Carver, "Parametric Encryption for modular multiplication module will improve speed and area. Hardware Design," Reconfigurable Computing: Architectures, Tools and Applications, pp. 68-79, 2010. Table I Execution times of various Baillie-PSW implementations Built-in Designed [12]A. Le Masle, W. Luk and C. A. Moritz, "Parametrized Hardware Baillie-PSW FPGA Python Java Java Architectures for the ," International Conference on implementation function function Embedded Computer Systems (SAMOS), pp. 124-131, 2011. Execution time (ms) 118 636.82 52 180 [13]G. L. Miller, "Riemann's Hypothesis and Tests for Primality," Journal of Computer and System Sciences, vol. 13, no. 3, pp. 300-317, 1976. Execution time 1.18 x 107 1.46 x 109 1.14 x 108 3.96 x 108 (# of clock cycles) [14]R. Baillie and S. S. Wagstaff Jr., "Lucas Pseudoprimes," Mathematics of Clock speed (MHz) 100 2200 2200 2200 Computation, vol. 35, no. 152, pp. 1391-1417, October 1980. [15]J. Fry and M. Langhammer, "RSA & Public Key Cryptography in FPGAs," Altera Document, 2005. [16]B. Schneier, "11.3 Number Theory," in Applied cryptography: protocols, algorithms, and source code in C, 2nd ed., New York, J. Wiley & Sons, 1996, pp. 203-204. [17]J. Shallit and J. Sorenson, "A Binary Algorithm for the Jacobi Symbol," ACM SIGSAM Bulletin, vol. 27, no. 1, pp. 4-11, 1993. [18]Y. Kasarabada, A Verilog description and efficient hardware implementation of the Baillie-PSW primality test. MS thesis, University of Cincinnati: OhioLINK, 2016. [19]M. E. O'Neill, "Download the PCG Library," 2015. [Online]. Available: http://www.pcg-random.org/. [Accessed 16 June 2016]. [20]Altera Corporation, "Design Software - Overview," Altera Corporation, Figure 2 Speed of the Baillie-PSW implementations with a normalized clock [Online]. Available: https://www.altera.com/products/design- REFERENCES software/overview.html. [Accessed 6 July 2016]. [21]R. C. Cheung, A. Brown, W. Luk and P. Y. Cheung, "A Scalable Hardware [1] W. Diffie and M. E. Hellman, "New Directions in Cryptography," IEEE Architecture for Prime Number Validation," International Conference on Transactions on Information Theory, vol. 22, no. 6, pp. 644-654, November Field-Programmable Technology, pp. 177-184, December 2004. 1976. [22]P. L. Montgomery, "Modular multiplication without trial division," [2] E. Barker, "Recommendation for Key Management - Part 1: General Mathematics of computation, vol. 44, no. 170, pp. 519-521, 1985. (Revision 4)," NIST Special Publication 800-57, January 2016. [3] C. F. Kerry and P. D. Gallagher, "Digital Signature Standard," Federal Information Processing Standards Publication 186-4, July 2013.