On the Efficiency of Polynomial Multiplication for Lattice-Based

On the Efficiency of Polynomial Multiplication for Lattice-Based Cryptography on GPUs using CUDA Sedat Akleylek1, 2, Ozgur¨ Da˘gdelen3, and Zaliha YüceTok4 1Department of Computer Engineering, Ondokuz Mayıs University, Samsun, Turkey 2Cryptography and Computer Algebra Group, TU Darmstadt, Germany 3BridgingIT GmbH, Germany 4Institute of Applied Mathematics, METU, Ankara, Turkey [email protected],[email protected],[email protected] Abstract. Polynomial multiplication is the most time-consuming part of cryptographic schemes whose security is based on ideal lattices. Thus, any efficiency improvement on this building block has great impact on the practicability of lattice-based cryptography. In this work, we inves- tigate several algorithms for polynomial multiplication on a graphical processing unit (GPU), and implement them in both serial and parallel way on the GPU using the compute unified device architecture (CUDA) n platform. Moreover, we focus on the quotient ring (Z=pZ)[x]=(x + 1), where p is a prime number and n is a power of 2. We stress that this ring constitutes the most common setting in lattice-based cryptography for efficiency reasons. As an application we integrate the different implementations of polynomial multiplications into a lattice-based signature scheme proposed by Güneysuet al. (CHES 2012) and identify which algorithm is the prefer- able choice with respect to the ring of degree n. Keywords: lattice-based cryptography, GPU implementation, CUDA platform, polynomial multiplication, fast Fourier transform, cuFFT, NTT, Schönhage-Strassen 1 Introduction Lattice-based cryptography has gained a lot of attention since the breakthrough work by Gentry [12] who constructed the very first fully homomorphic encryption scheme, answering a problem posed in 1978 and thought by many to be impos- sible to resolve. The security of lattice-based cryptographic schemes depends on the hardness of lattice problems which are believed to be intractable even against quantum attacks, as opposed to classical cryptographic schemes which are based on the hardness of factoring or computing discrete logarithms. Indeed, Shor [28] has shown that quantum computers can solve the latter computational problems in polynomial time, and thus making it plain to the community that alternatives are necessary once efficient quantum computers are omnipresent. 2 Additional reasons for the current popularity of lattice-based cryptography lies in their asymptotic efficiency. While classical cryptographic schemes deal with very large finite fields and/or require to perform expensive exponentia- tions, the efficiency of lattice-based schemes is dominated mainly by \cheap" matrix-vector multiplications. Moreover, even more efficient instantiations of lattice-based cryptographic schemes are derived if they are based on ideal lattices corresponding to ideals in rings of the form Z[x]=(f), where f is a degree n irreducible polynomial (usually f = xn + 1). The most common ring used in lattice-based cryptography (e.g., [7,14,19]) is the quotient ring (Z=pZ)[x]=(xn+1) for which we know quasi-logarithmic multiplication algorithms, namely via the Fast Fourier Transform (FFT). We stress that polynomial multiplication still constitutes the bottleneck of lattice-based cryptography with respect to its performance. Hence, efficient implementations of the multiplication operation impact positively and directly the performance of lattice-based cryptosystems. In the past there has been a great interest for the implementation of lattice- based cryptography on FPGAs [5, 13{16, 23{25, 27]. The main inducement lies here in running the arithmetic operations in parallel. Pöppelmann et al. [22] showed how the arithmetic involved in lattice-based cryptography can be efficiently implemented on FPGAs. Specifically, they gave the details on the implementation of how to parallelize the Number Theoretic Transform (NTT) algorithm for the precomputed values. Due to the memory requirements of finding those values, the precomputation step was not performed on the FPGA but were given to the system directly. In [5], Aysu et al. presented a low-cost and area efficient hardware architecture for polynomial multiplication via NTT with applications in lattice-based cryptographic schemes. In contrast to [22], Aysu et al. computed all values for precomputation also on the FPGA. Interestingly, the literature offers far less studies on the efficiency of polynomial multiplication on graphical processing units (GPU) in the quotient ring (Z=pZ)[x]=(xn +1). While GPUs are already a tool to cryptanalyze lattice problems [17,18,26], only few works use GPUs to accelerate lattice-based cryptosystems [6, 30]. To the best of our knowledge, Emeliyanenko [10], Akleylek and Tok [1{3] are the only ones dealing with the efficiency of algorithms for polynomial multiplication on GPUs from which only the latter emphasizes on settings relevant for lattice-based cryptography. Our Contribution. The contribution of this paper is three-fold: We propose mod- ified FFT algorithm for sparse polynomial multiplication. With the proposed method we improve the complexity results almost 34% for the sparse polynomial multiplication in the lattice-based signature scheme by Güneysuet al. [14]. Second, we provide the first efficient implementation of the Schönhage-Strassen polynomial multiplication algorithm on GPU using the CUDA platform and discuss its efficiency in comparison with well-known polynomial multiplication algorithms, namely iterative/parallel NTT and cuFFT. Our implementations are designed to the given GPU, namely the NVIDIA Geforce GT 555M GPU. Last, as an application we have implemented the lattice-based signature scheme by Güneysuet al. [14] on GPU and measured running times of the signature and 3 verification algorithm while using different polynomial multiplication methods. Originally, this signature scheme was implemented on FPGAs [14]. Roadmap. In Section 2 we recall well-known algorithms for polynomial multiplication, show their efficient implementations in GPU and modify FFT algorithm to be used in sparse polynomial multiplication efficiently. In Section 3 we recall the lattice-based signature scheme by Güneysuet al.. We show the experimental results of our implementations in Section 4. We also show the performance of the selected lattice-based signature scheme with respect to the chosen algorithm for polynomial multiplication. In Section 5 we impose further discussion on our observation. We conclude in Section 6. 2 Multiplication over the Ring (Z=pZ)[x]=(xn + 1) In this section, we give an overview of selected algorithms for polynomial multiplication, namely NTT in serial and parallel type, and CUDA-based FFT (cuFFT). We also discuss improved version of FFT for sparse polynomial multiplication. We focus on the arithmetic over the quotient ring (Z=pZ)[x]=(xn + 1) whose importance in lattice-based cryptography that we emphasized in the previ- ous section. Recall that polynomial multiplication by using FFT mainly consists of three steps: { conversion of coefficients of polynomials to Fourier domain with FFT algorithm having O(n log(n)) complexity, { coefficient wise multiplication of elements in Fourier domain having O(n) complexity, { converting the result into the integer domain with inverse FFT algorithm having O(n log(n)) complexity. 2.1 Number Theoretic Transform The NTT algorithm was proposed in [21] to avoid rounding errors in FFT. NTT is a discrete Fourier transform defined over a ring or a finite field and is used to multiply two integers without requiring arithmetic operations on complex numbers. The multiplication complexity is quasi-linear (i.e., O(n log n)). The NTT algorithm can be applied if both of the following requirements are fulfilled: { The degree n of the quotient ring (Z=pZ)[x]=(xn + 1) must divide (p − 1). { 9w 2 (Z=pZ) such that wn ≡ 1 (mod p) and for each i < n it holds wn 6= 1 (mod p). In order to multiply two ring elements via NTT, those elements have to transformed into NTT form. Let w be the primitive n−th root of unity, and a(x) = 4 Pn−1 i i=0 aix 2 (Z=pZ)[x] be given. The ring element a is transformed into the NTT form by algorithm NTTw(a) which is defined as follows: n−1 X ij NTTw(a) := (A0;:::;An−1) where Ai = ajw (mod p); j=0 −1 for i 2 f0; 1; : : : ; n−1g. The inverse transform NTTw (A) for A = (A0;:::;An−1) is given as: n−1 −1 −1 X −ij NTTw (A) := (a0; : : : ; an−1) where ai = n Ajw (mod p); j=0 with i 2 f0; 1; : : : ; n−1g. By using the Convolution theorem, arbitrary polynomials can be multiplied and then reduced according to chosen reduction polynomial. However, appending n 0's to the inputs doubles the size of the transform. To use NTT for the multiplication of two elements in (Z=pZ)[x]=(xn + 1) the condition is that p ≡ 1 (mod 2n) due to wrapped convolution approach. In Algorithm 1 the parallel version of the iterative NTT method is given. To make it efficient, we make use of \for" loops in the algorithm. Parallelization is achieved by de- termining the required number of threads. This algorithm needs data transfer between CPU and GPU. This may cause a delay and potentially defined the bottleneck in the performance. In Algorithm 2 we show how the parallelism is performed. Algorithm 1 Parallelized iterative NTT method (CPU and GPU side) Pn−1 i Pn−1 i n Input: a(x) = i=0 aix , b(x) = i=0 bix 2 (Z=pZ)[x]=(x + 1) Pn−1 i Output: c(x) = a(x) · b(x) = i=0 cix 1: f = BitReverseCopy(a) 2: f = SumDiff(f) 3: rn = Element of order(n) 4: if reverse NTT then −1 5: rn = rn (mod p)

Load more