1 Lattice-Based Signatures: Optimization and Implementation on Reconfigurable Hardware

Tim Guneysu,¨ Vadim Lyubashevsky, and Thomas Poppelmann¨

Abstract—Nearly all of the currently used signature schemes, such as RSA or DSA, are based either on the factoring assumption or the presumed intractability of the problem. As a consequence, the appearance of quantum computers or algorithmic advances on these problems may lead to the unpleasant situation that a large number of today’s schemes will most likely need to be replaced with more secure alternatives. In this work we present such an alternative – an efficient signature scheme whose security is derived from the hardness of lattice problems. It is based on recent theoretical advances in lattice-based and is highly optimized for practicability and use in embedded systems. The public and secret keys are roughly 1.5 kB and 0.3 kB long, while the signature size is approximately 1.1 kB for a security level of around 80 bits. We provide implementation results on reconfigurable hardware (Spartan/Virtex-6) and demonstrate that the scheme is scalable, has low area consumption, and even outperforms classical schemes.

Index Terms—Public key cryptosystems, reconfigurable hardware, signature scheme, ideal lattices, FPGA. !

1 INTRODUCTION and NTRUSign [33], broken due to flaws in the ad-hoc design approaches [19], [51]. This has changed since the It has been known, ever since Shor’s seminal result [58], introduction of cyclic and ideal lattices [46] and related that all asymmetric cryptosystems based on factoring computationally hard problems like RING-SIS [42], [44], and the (elliptic curve) discrete logarithm problem can [52] and RING-LWE [45]. These problems have enabled be broken in polynomial time on a quantum computer. construction of a great variety of theoretically elegant In recent years, there has been a big financial push, by and efficient cryptographic primitives. both governments and private enterprises (c.f. [39]), to In this work we try to further close the gap between construct a fully-functioning quantum computer which the advances in theoretical lattice-based cryptography would have the capability to immediately render virtu- and real-world implementation by constructing a digital ally all currently-used public-key cryptography obsolete. signature scheme based on ideal lattices that considers In addition, recent breakthroughs in classical cryptanal- the constraints on embedded systems. For efficiency ysis [7], [37] have cast further doubts on the hardness we use a variant of the Ring-LWE problem which is of the discrete log problem by demonstrating almost- connected to hard problems on ideal lattices and which polynomial time algorithms for the problem in small- may be of independent interest. Our optimizations for characteristic fields. These imminent threats have moti- practicability result in a scheme with moderate signa- vated the investigation of other fundamental problems ture and key sizes as well as performance suitable for upon which asymmetric cryptography can be based, embedded and hardware systems. We point out that and the proposal of several alternative cryptographic our instantiations do not use parameters that imply constructions as potential substitutes. any meaningful security guarantees via worst-case to A promising alternative to number-theoretic construc- average-case reductions of [42], [45], [49]. Like other tions are lattice-based cryptosystems. They possess se- practical constructions (e.g. [18], [44]), our schemes are curity proofs based on well-studied problems that cur- based on the hardness of average-case lattice problems. rently cannot be solved by quantum algorithms. For a Related Work. Digital signatures are arguably the long time, however, lattice constructions have only been most used public-key cryptographic primitive in prac- considered secure for inefficiently large parameters that tical applications, and a lot of effort has gone into trying are well beyond practicability1 or were, like GGH [27] to construct such schemes from lattice assumptions. Due to the success of the NTRU encryption scheme, it • Tim G¨uneysu and Thomas P¨oppelmannare with the Horst G¨ortzInstitute was natural to try to design a signature scheme based for IT-Security, Ruhr University Bochum, Universitaetsstr 150, 44780 Bochum, Germany. on the same principles. Unlike the encryption scheme, E-mail: {Tim.Gueneysu, Thomas.Poeppelmann}@rub.de. however, the proposed NTRU signature scheme (and its • Vadim Lyubashevsky is with INRIA and ENS in Paris, France. subsequent modifications) [33], [35] has been completely E-mail: [email protected]. broken [19], [51]. Provably-secure digital signatures were finally constructed in 2008, by Gentry, Peikert, and 1. One notable exception is the NTRU public-key encryption scheme [34], which has essentially remained unbroken since its intro- Vaikuntanathan [26], and, using different techniques, by duction. Lyubashevsky and Micciancio [43]. The scheme in [26] 2 was rather inefficient in practice, with outputs and keys While the theoretical part of this work (i.e., the de- being megabytes long, while the scheme in [43] was only scription and optimization of the signature scheme) is a one-time signature that required the use of Merkle trees the same as in the conference version of the current to become a full signature scheme. The work of [43] was paper [30] we provide a vastly improved implementa- later extended in [18], [40], [41], which finally gave a tion. Especially, the usage of FFT techniques for polyno- construction of a full-fledged signature scheme whose mial multiplication and parallelization allow much faster keys and outputs are currently on the order of 5000 bits signing and verification with a reduced area footprint. each, for a 128-bit security level2. The work of [26] was There is also the newer BLISS signature scheme, given also extended by Micciancio and Peikert [47], where the in [18], which uses some ideas from our work but re- size of the signatures and keys is roughly 100, 000 bits. quires high-precision discrete Gaussian sampling. Works A software implementation of the improved signature like [11], [18], [20], [57] dealing with this problem have scheme has recently been presented by Bansarkhani appeared recently. However, currently it is still unknown and Buchmann [6]. A software implementation of the how to realize sampling for parameter sets as proposed signature scheme described in this paper can be found in [18]3 efficiently in an embedded system with low in [31]. memory requirements. As a consequence, our scheme A different way to construct signature schemes with- is still suited very well for hardware implementations. out known attacks by quantum computers is to use Moreover, the structure of BLISS is very similar to our hard decoding problems. In 2001 Courtois, Finiasz, and scheme and implementation issues like fast and flexible Sendrier [16] proposed the code-based CFS signature polynomial multiplication in hardware are already dealt scheme which has been improved by Finiasz in [23] (now with in our work. called Parallel-CFS). However, implementations in hard- Outline. We give a short overview of our hardness ware and software [8], [9] are still very slow and yield assumption in Section 2 and then introduce the highly extremely large key sizes and are thus unsuitable for em- efficient and practical signature scheme in Section 3. bedded devices. Another option are signature schemes Based on this description, we present our implementa- based on Multivariate Quadratic (MQ) equations which tion and the hardware architecture of the signing and are fast in hardware [5], [10] and software [14] but signature verification engine in Section 4 and analyze its suffer from large public key sizes. Hash-based signa- performance on different FPGAs in Section 5. In Section 6 tures are way more practical and implementations on we summarize our contribution and present an outlook smart cards and reconfigurable hardware are possible as for future work. shown in [22], [36]. Their biggest disadvantage is the requirement to keep a state and large signature sizes. A 2 PRELIMINARIES table comparing software performance and key sizes of 2.1 Notation selected signature schemes can be found in [31]. Our Contribution. The main contribution of this work Throughout the paper, we will assume that n is an inte- is the implementation of a digital signature scheme ger that is a power of 2, p is a prime number congruent to pn n based on the findings in [40], [41] which is optimized for 1 modulo 2n, and R is the ring p[x]/(x +1). Elements n Z embedded systems. In addition, we propose an improve- in Rp can be represented by polynomials of degree n−1 with coefficients in the range [−(p − 1)/2, (p − 1)/2], and ment to the above-mentioned scheme which preserves n p pn the security proof, while lowering the signature size by we will write Rk to be a subset of the ring R that approximately a factor of two (this improvement was consists of all polynomials with coefficients in the range $ subsequently used in the improved scheme of [18]). We [−k, k]. For a set S, we write s ← S to indicate that s is demonstrate the practicability of our scheme by evaluat- being chosen uniformly at random from S. ing an implementation on reconfigurable hardware. We provide one hardware unit which supports only verifi- 2.2 Hardness Assumption cation, one larger unit for signing and a combined unit for signing as well as verification. The implementation is In a particular version of the RING-SIS problem, one is pn pn fully functional, contains a Trivium-based PRNG as well given an ordered pair of polynomials (a, t) ∈ R × R pn where a is chosen uniformly from R and t = as1 + as the lightweight hash QUARK and makes use of the pn extremely efficient Number Theoretic Transform (NTT). s2, where s1 and s2 are chosen uniformly from Rk , 0 0 For example, on the low-cost Xilinx Spartan-6 we are and is asked to find an ordered pair (s1, s2) such that 0 0 √ 1.5 times faster and use only half of the resources of the as1 + s2 = t. It can be shown that when k > p, the optimized RSA implementation of Suzuki [63]. With 2385 √solution is not unique and finding any one of them, for signatures and 10899 signature verifications per second, p < k  p, was proven in [42], [52] to be as hard as we can satisfy even high-speed demands with a low area solving worst-case lattice problems√ in ideal lattices. On footprint using a Virtex-6 device. the other hand, when k < p, it can be shown that the

2. To achieve a signature length of 5000 bits, the work of [18] used 3. Note that the required standard deviation for lattice-based en- the main theoretical improvement found in the preliminary, conference cryption is much smaller compared to signatures which enabled the version of the current paper [30]. extremely compact implementation of [57]. 3

only solution is (s1, s2) with high probability, and there (where we read the 4-digit string as a number between is no classical reduction known from worst-case lattice 0 and 15) of the 16-digit string. If r1 is 1, then put a 1 problems to finding this solution. In fact, this latter prob- in position r2r3r4r5. This converts a 160-bit string into a lem is a particular instance of the RING-LWE problem. 512-digit string with at most 32 ±1’s.6 We then convert It was recently shown in [45] that if one chooses the the 512-bit string into a polynomial of degree at least 512 th si from a slightly different distribution (i.e., a Gaussian in the natural way by assigning the i coefficient of the distribution instead of a uniform one), then solving the polynomial the ith bit of the bit-string. If the polynomial RING-LWE problem (i.e., recovering the si when given is of degree greater than 512, then all of its higher-order (a, t)) is as hard as solving worst-case lattice problems terms will be 0. using a quantum algorithm. Furthermore, in that same work it was shown that solving the decision version of 3 THE SIGNATURE SCHEME RING-LWE, that is distinguishing ordered pairs (a, as1 + pn pn s2) from uniformly random ones in R × R , is still as In this section, we will present the lattice-based signature hard as solving worst-case lattice problems. scheme whose hardware implementation we describe in In this paper, we implement our signature scheme Section 4. This scheme is a combination of the schemes based on the presumed hardness of the decision from [40] and [41] as well as an additional optimization RING-LWE problem with particularly “aggressive” pa- that allows us to reduce the signature length by almost a rameters. We define the DCKp,n problem (Decisional factor of two. In [40], Lyubashevsky constructed a lattice- Compact Knapsack problem) to be the problem of based signature scheme based on the hardness of the distinguishing between the uniform distribution over RING-SIS problem, and this scheme was later improved pn pn R × R and the distribution (a, as1 + s2) where a is in two ways [41]. The first improvement results in signa- pn uniformly random in R and si are uniformly random tures that are asymptotically shorter, but unfortunately pn involves a somewhat more complicated rejection sam- in R1 . As of now, there are no known algorithms that take advantage of the fact that the distribution of si is pling algorithm during the signing procedure, involving uniform (i.e., not Gaussian) and consists of only −1/0/1 sampling from the normal distribution and computing coefficients for the parameter set we use to construct our quotients to a very high precision, which would not signature scheme.4 So it is very reasonable to conjecture be very well supported in hardware. We do not know that this problem is still hard.5 In fact, this is essentially whether the actual savings achieved in the signature the assumption that the NTRU encryption scheme is length would justify the major slowdown incurred, and based on. Due to lack of space, we direct the interested we do leave the possibility of efficiently implementing reader to Section 3 of the full version of [41] for a this rejection sampling algorithm to future work. The more in-depth discussion of the hardness of the different second improvement from [41], which we do use, shows variants of the SIS and LWE problems. how the size of the keys and the signature can be made significantly smaller by changing the assumption from ING ING n R -SIS to a version of R -LWE. 2.3 Cryptographic Hash Function H with Range D32 Our signature scheme uses a hash function, and it is 3.1 The Basic Signature Scheme quite important for us that the output of this function is n of a particular form. The range of this function, D32, for For ease of exposition, we first present the basic combi- n ≥ 512 consists of all polynomials of degree n − 1 that nation scheme of [40] and [41] in Figure 1, and sketch its have all zero coefficients except for at most 32 coefficients security proof. Full security proofs are available in [40] that are ±1. and [41]. We then present our optimization in Sections 3.2 and 3.3. We denote by H the hash function that first maps $ n ∗ p {0, 1} to a 160-bit string and then injectively maps the The secret keys are random polynomials s1, s2 ← R1 n $ pn resulting 160-bit string r to D32 via an efficient procedure and the public key is (a, t), where a ← R and we now describe. To map a 160-bit string into the range t ← as1 + s2. The parameter k in our scheme which n D32 for n ≥ 512, split c into 32 partitions. For every first appears in line 1 of the signing algorithm controls partition of c we look at 5 bits of r at a time and de- the trade-off between the security and the runtime of termine the non-zero coefficient in a partion as follows: our scheme. The smaller we take k, the more secure the let r1r2r3r4r5 be the five bits we are currently looking scheme becomes (and the shorter the signatures get), but at. If r1 is 0, then put a −1 in position number r2r3r4r5 the time to sign will increase. We explain this as well as the choice of parameters below. 4. For readers familiar with the Arora-Ge algorithm for solving LWE To sign a message µ, we pick two “masking” polyno- with small noise [2], we would like to point out that it is does not $ pn apply to our problem because this algorithm requires polynomially- mials y1, y2 ← Rk and compute c ← H(ay1 + y2, µ) many samples of the form (ai, ais + ei), whereas in our problem, only one such sample is given. 6. There is a more “compact” way to do it (see for example [24] for 5. Recently Micciancio and Peikert showed that by imposing a limit an algorithm that can convert a 160-bit string into a 512-digit one with on the number of samples, the LWE problem can still be hard with at most 24 ±1 coefficients), but the resulting transformation algorithm smaller noise [48]. is quadratic rather than linear. 4

$ pn Signing Key: s1, s2 ← R1 $ pn Verification Key: a ← R , t ← as1 + s2 ∗ n Cryptographic Hash Function: H : {0, 1} → D32

Sign(µ, a, s1, s2) Verify(µ, z1, z2, c, a, t) $ pn 1: y1, y2 ← Rk 1: Accept iff pn 2: c ← H(ay1 + y2, µ) z1, z2 ∈ Rk−32 and 3: z1 ← s1c + y1, z2 ← s2c + y2 c = H(az1 + z2 − tc, µ) pn 4: if z1 or z2 ∈/ Rk−32, then goto step 1 5: output (z1, z2, c)

Fig. 1: The Basic Signature Scheme

and the potential signature (z1, z2, c) where z1 ← s1c + DCKp,n problem. The reason that these polynomials are 7 y1, z2 ← s2c + y2 . But before sending the signature, we non-zero (with a non-negligible probability) is that if we must perform a rejection-sampling step where we only pick the secret keys (s0 , s0 ) in our security reduction n n 1 2 p p 0 √ send if z1, z2 are both in Rk−32. This part is crucial for from Rk0 where k ≈ p, then with high probability 00 00 0 0 security and it is also where the size of k matters. If k there is another set of keys (s1 , s2 ) 6= (s1, s2) in that pn 0 0 00 00 is too small, then z1, z2 will almost never be in Rk−32, same domain such that as1 + s2 = as1 + s2 and so the whereas if its too big, it will be easy for the adversary adversary cannot know which ordered pair we know. 8 0 0 to forge messages . To verify the signature (z1, z2, c), Therefore, if it so happens that for the key (s1, s2) we pn 0 = u = z −cs0 −z0 +c0s0 z −cs00 −z0 +c0s00 the verifier simply checks that z1, z2 ∈ Rk−32 and that have 1 1 1 1 1, then 1 1 1 1 9 c = H(az1 + z2 − tc, µ). is necessarily non-zero. If they are both zero, then we 0 0 00 Our security proof follows that in [41] except that it have (c−c )(s1 −s1 ) = 0. Because the coefficients of both 0 0 00 uses the rejection sampling algorithm from [40]. Given a (c − c ) and (s1 − s1 ) are small, the preceding equality pn n random polynomial a ∈ R , we pick two polynomials holds not only over the ring Zp[x]/(x +1), but also over n $ p 0 √ [x]/(xn + 1). Since the latter is an integral domain and s1, s2 ← Rk0 for a sufficiently large k ≈ p and return Z pn 0 0 c0 6= c (c − c0)(s0 − s00) = 0 (a ∈ R , t = as +s ) as the public key. By the DCKp,n , the equality 1 1 implies that 1 2 0 00 assumption (and a standard hybrid argument), this looks s1 = s1 , which is a contradiction. like a valid public key (i.e., the adversary cannot tell We now explain the trick that we use to lower the size pn pn of the signature as returned by the optimized scheme that the si are chosen from Rk0 rather than from R1 ). When the adversary gives us signature queries, we presented in Section 3.3. Notice that if Equation (2) does az + z − appropriately program the hash function outputs so that not hold exactly, but only approximately (i.e., 1 2 tc − (az0 + z0 − tc0) = w w our signatures are valid even though we do not know 1 2 for some small polynomial ), u , u au +u = a valid secret key (in fact, a valid secret key does not then we can still obtain small 1 2 such that 1 2 0 u even exist). When the adversary successfully forges a , except that the value of 2 will be larger by at most w az + z − tc ≈ az0 + z0 − tc0 new signature, we then use the “forking lemma” [54] to the norm of . Thus if 1 2 1 2 , we will still be able to produce small u1, u2 such that produce two signatures of the message µ, (z1, z2, c) and 0 0 0 au1 + u2 = 0. This could make us consider only sending (z1, z2, c ), such that (z1, c) as a signature rather than (z1, z2, c), and the proof 0 0 0 H(az1 + z2 − tc, µ) = H(az1 + z2 − tc , µ), (1) will go through fine. The problem with this approach is that the verification algorithm will no longer work, which implies that because even though az1 +z2 −tc ≈ az1 −tc, the output 0 0 0 az1 + z2 − tc = az1 + z2 − tc (2) of the hash function H will be different. A way to go around the problem is to only evaluate H on the “high and because we know that t = as1 + s2, we can obtain order bits” of the coefficients comprising the polynomial 0 0 0 0 az1+z2−tc which we could hope to be the same as those a(z1 − cs1 − z1 + c s1) + (z2 − cs2 − z2 + c s2) = 0. of the polynomial az1 − tc. But in practice, too many 0 Because zi, si, c, and c have small coefficients, we bits would be different (because of the carries caused by have found two polynomials u1, u2 with small coeffi- z2) for this to be a useful trick. What we do instead is 0 0 cients such that au1 + u2 = 0. By [41, Lemma 3.7], send (z1, z2, c) as the signature where z2 only tells us the knowing such small non-zero ui allows us to solve the carries that z2 would have created in the high order bits 0 in the sum of az1 +z2 −tc, and so z2 can be represented 7. We would like to draw the reader’s attention to the fact that in with much fewer bits than z . In the next subsection, step 3, reduction modulo p is not performed since all the polynomials 2 involved have small coefficients. we explain exactly what we mean by “high-order bits” pn 8. The exact probability that z1, z2 will be in Rk−32 is 0 00 2n 9. Here we’re supposing that s1 6= s1 . The same argument holds if  64  0 00 1 − 2k+1 . s2 6= s2 . 5

0 pn and give an algorithm that produces a z2 from z2, and Lemma 3.1 states that given two vectors y, z ∈ R then provide an optimized version of the scheme in this where the coefficients of z are small, we can replace z section that uses the compression idea. by a much more compressed vector z0 while keeping the higher order bits of y + z and y + z0 the same. 3.2 The Compression Algorithm Lemma 3.1. There exists a linear-time algorithm Compress(y, z, p, k) that for any p, n, k where 2nk/p > 1 n n takes as inputs y ←$ Rp , z ∈ Rp , and with probability at Compress(y, z, p, k) k n pn 0 p 1: uncompressed ← 0 least .98 (over the choices of y ∈ R ), outputs a z ∈ Rk 2: for i=1 to n do such that p−1 (1) 0 (1) 3: if |y[i]| > 2 − k then 1) (y + z) = (y + z ) 0 0 6kn 4: z [i] ← z[i] 2) z can be represented with only 2n+dlog(2k +1)e· p 5: uncompressed ← uncompressed + 1 bits. 6: else (1) (0) Proof. For (1) we need to show that for all i, if we write 7: write y[i] = y[i] (2k + 1) + y[i] where −k ≤ (1) (0) (0) y[i] + z[i] mod p = a[i] (2k + 1) + a[i] where −k ≤ y[i] ≤ k (0) 0 (1) (0) (0) a[i] ≤ k and y[i] + z [i] mod p = b[i] (2k + 1) + b[i] 8: if y[i] + z[i] > k then (0) 0 where −k ≤ b[i] ≤ k then we will have 9: z [i] ← k (0) 10: else if y[i] + z[i] < −k then a[i](1) = b[i](1). (3) 0 11: z [i] ← −k y[i]+z[i] 12: else In line 3 of the algorithm, we check to see if will 0 p 13: z [i] ← 0 possibly need to be reduced modulo . If it does, then z0[i] = z[i] 14: end if we set , and in this case condition (3) will be 15: end if clearly satisfied. Notice that in the other cases, we now have y[i]+z[i] mod p = y[i]+z[i] since |z[i]| ≤ k. We now 16: end for (0) 6kn handle these other cases. Assume that y[i] + z[i] > k 17: if uncompressed ≤ p then 0 0 and we set z [i] = k. Then we have 18: return z 19: else a[i](1)(2k + 1) 20: return ⊥ = y[i](1)(2k + 1) + y[i](0) + z[i] 21: end if   − y[i](1)(2k + 1) + y[i](0) + z[i] mod (2k + 1) Fig. 2: The Compression Algorithm = y[i](1)(2k + 1) + y[i](0) + z[i] In Figure 2 we fully state our compression algorithm.  (0)  For two vectors y, z, the algorithm first checks whether − y[i] + z[i] mod (2k + 1) the coefficient y[i] of y is greater than (p − 1)/2 − k in = y[i](1)(2k + 1) + y[i](0) + z[i] absolute value. If it is, then there is a possibility that  (0)  y[i] + z[i] will need to be reduced modulo p and in this − y[i] + z[i] − (2k + 1) case we do not compress z[i]. Ideally there should not   = y[i](1) + 1 (2k + 1) be many such elements, and we can show that for the parameters used in the signature scheme, there will be and via a similar calculation at most 6 (out of n) with high probability. It’s possible (1)  (1)  to set the parameters so that there are no such elements, b[i] (2k + 1) = y[i] + 1 (2k + 1) but this decreases the efficiency and is not worth the And so in this case, a[i](1) = b[i](1). The case where very slight savings in the compression. (0) y − p−1 , p−1  y[i] + z[i] < −k is almost identical and we end up For every integer in the range 2 2 and (1) (1) (1)  k y with a[i] = b[i] = y[i] − 1 (2k + 1). The third any positive integer , can be uniquely written as (0) y = y(1)(2k+1)+y(0) y(0) case where −k ≤ y[i] + z[i] ≤ k is also similar and we where is an integer in the range (1) (1) (1) y−y(0) get a[i] = b[i] = y[i] (2k + 1). [−k, k] and y(1) = . Assuming that y[i] is in the 2k+1 We now show that the efficient encoding of (2) exists. range where z[i] can be compressed, we assign the value If z[i]0 = 0, we represent it with the bit string 0000. If of k to z0[i] if y[i](0) +z[i] > k, assign −k if y[i](0) +z[i] < z[i]0 = k, we represent it with the bit string 0010. z[i]0 = −k, and 0 otherwise. Thus y(0) are the “lower-order” −k, we represent it with the bit string 0100. If z[i]0 = bits of y, and y(1) are the “higher-order” ones10. For a n z[i] (in other words, it is uncompressed), we represent polynomial y = y[0] + y[1]x + ... + y[n − 1]xn−1 ∈ Rp , it with the string 011z[i]0 where z[i] can be represented we define y(1) = y[0](1) + y[1](1)x + ... + y[n − 1](1)xn−1 by 2 log k bits (the 0110 is necessary to signify that the and y(0) = y[0](0) + y[1](0)x + ... + y[n − 1](0)xn−1. following log 2k bits represent an uncompressed value). 10. Note that these only roughly correspond to the notion of most Thus uncompressed values use 2 + log 2k bits and the and least significant bits. other values use just 2 bits. Since there are at most 6kn/p 6 uncompressed values, the maximum number of bits that Theorem 3.3. For the signature scheme in Figure 3 such are needed is that n > 200 and k < p/16, suppose that there exists a 6kn  6kn 6kn polynomial-time forger F who makes at most θ queries to the (2+log 2k)· +2 n − = 2n+dlog(2k+1)e· . p p p signer, ψ queries to the random oracle H, and succeeds in forging with probability δ. Then there exists an algorithm of the same time-complexity as F that for a randomly-chosen n n n n p p p Finally, if y is uniformly distributed in Rp , then with a ∈ R finds u1 ∈ R2k+64k0 and a u2 ∈ R4k+64k0 such probability at least .98, the algorithm will not have more that au1 + u2 = 0 (and (u1, u2) 6= 0)) with probability at than 6 uncompressed elements. The proof of the below least      n  lemma follows by bounding the binomial distribution by 1 −160 1 δ − 1/|D32| 1 − 2 δ − n − n , the Poisson distribution. 2 |D32| ψ + θ |D32| Lemma 3.2. If y is uniformly distributed modulo p and where k0 = 280/n−1p1/2. 2nk/p ≥ 1, then the compression algorithm outputs ⊥ with 2% probability less than . 3.4 Concrete Instantiation We now give some concrete instantiations of our signa- 3.3 A Signature Scheme for Embedded Systems ture scheme from Figure 3. The security of the scheme We now present the version of the signature scheme depends on two things: the hardness of the underly- that incorporates the compression idea from Section 3.2 ing DCKp,n problem and the hardness of finding pre- (see Figure 3). We will use the following notation that is images in the random oracle H11. For simplicity, we similar to the notation in Section 3.2: every polynomial fixed the output of the random oracle to 160 bits and so n Y ∈ Rp can be written as finding pre-images is 160 bits hard. Judging the security (1) (0) of the lattice problem, on the other hand, is notoriously Y = Y (2(k − 32) + 1) + Y more difficult. For this part, we rely on the extensive n (0) p experiments performed by Gama and Nguyen [25] and where Y ∈ Rk−32 and k corresponds to the k in the signature scheme in Figure 3. Notice that there is a Chen and Nguyen [15] to determine the hardness of lat- bijection between polynomials Y and this representation tice reductions for certain classes of lattices. The lattices (Y(1), Y(0)) where that were used in the experiments of [25] were a little different than ours, but we believe that barring some Y(0) = Y mod (2(k − 32) + 1), unforeseen weakness due to the added algebraic struc- ture of our lattices and the parameters, the results should and Y − Y(0) be quite similar. We consider it somewhat unlikely that Y(1) = . 2(k − 32) + 1 the algebraic structure causes any weaknesses since for certain parameters, our signature scheme is as hard Intuitively, Y(1) is comprised of the higher order bits of as RING-LWE (which has a quantum reduction from Y. worst-case lattice problems [45]), but we do encourage The secret key in our scheme consists of two poly- cryptanalysis for our particular parameters because they pn nomials s1, s2 sampled uniformly from R1 and the are somewhat smaller than what is required for the $ n public key consists of two polynomials a ← Rp and worst-case to average-case reduction in [45], [61] to go t = as1+s2. In step 1 of the signing algorithm, we choose through. pn The methodology for choosing our parameters is the the “masking polynomials” y1, y2 from Rk . In step 2, we let c be the hash function value of the high order bits same as in [41], and so we direct the interested reader of ay1 + y2 and the message µ. In step 3, we compute to that paper for a more thorough discussion. In short, one needs to make sure that the length of the secret z1, z2 and proceed only if they fall into a certain range. In √ [s |s ] p step 5, we compress the value z2 using the compression key 1 2 as a vector is not too much smaller than 0 and that the allowable length of the signature vector, algorithm implied in Lemma 3.1, and obtain a value z2 √ (1) 0 (1) which depends on k, is not much larger than p. Us- such that (az1 − tc + z2) = (az1 − tc + z2) and 0 send (z1, z , c) as the signature of µ. The verification ing these quantities, one can perform the now-standard 2 n 0 p calculation of the “root Hermite factor” that lattice re- algorithm checks whether z1, z2 are in Rk−32 and that 0 (1)  duction algorithms must achieve in order to break the c = H (az1 + z2 − tc) , µ . The running time of the signature algorithm depends scheme (see [25], [41], [50] for examples of how this is on the relationship of the parameter k with the param- done). According to experiments in [15], [25] a factor eter p. The larger the k, the more chance that z1 and pn 11. It is generally considered folklore that for obtaining signatures z2 will be in Rk−32 in step 4 of the signing algorithm, with λ bits of security using the Fiat-Shamir transform, one only needs but the easier the signature will be to forge. Thus it is random oracles that output λ bits (i.e., collision-resistance is not a requirement). While finding collisions in the random oracle does allow prudent to set k as small as possible while keeping the the valid signer to produce two distinct messages that have the same running time reasonable. signature, this does not constitute a break. 7

$ pn Signing Key: s1, s2 ← R1 $ pn Verification Key: a ← R , t ← as1 + s2 ∗ n Cryptographic Hash Function: H : {0, 1} → D32 0 Sign(µ, a, s1, s2) Verify(µ, z1, z2, c, a, t) $ pn 1: y1, y2 ← R 1: Accept iff k n (1)  0 p 2: c ← H (ay1 + y2) , µ z1, z2 ∈ Rk−32 and 0 (1)  3: z1 ← s1c + y1, z2 ← s2c + y2 c = H (az1 + z − tc) , µ pn 2 4: if z1 or z2 ∈/ Rk−32, then goto step 1 0 5: z2 ← Compress (az1 − tc, z2, p, k − 32) 0 6: if z2 = ⊥, then goto step 1 0 7: output (z1, z2, c)

Fig. 3: Optimized Signature Scheme

TABLE 1: Signature Scheme Parameters ficient, (in software and in hardware), with all opera- Aspect Set I Set II tions taking quasi-linear time, as opposed to at least quadratic time for number-theory based schemes. The n 512 1024 p 8383489 16760833 most expensive operation of the signing algorithm is in k 214 215 step 2 where we need to compute ay1 + y2, which also Approximate signature bit size 8950 18800 could be done in quasilinear time using FFT. In step 3, Approximate secret key bit size 1620 3250 we also need to perform polynomial multiplication, but Approximate public key bit size 11800 25000 because c is a very sparse polynomial with only 32 non- Expected number of repetitions 7 7 zero entries, this can be performed with just 32 vector Approximate root Hermite factor 1.0066 1.0035 additions. And there is no multiplication needed in step Equivalent symmetric security in bits ≈ 80 ≥ 256 5 because az1 − tc = ay1 + y2 − z2. of 1.01 is achievable now, a factor of 1.007 seems to 4 IMPLEMENTATION have around 80 bits of security, and a factor of 1.005 In this section we describe the FPGA implementation has more than 256-bit security. In Figure 1, we present of the signing and verification procedures for parameter two sets of parameters. According to the aforementioned set I providing about 80 bits of equivalent symmet- methodology, the first has somewhere around 80 bits ric security. We present three implementation variants; of security, while the second has more than 256. This BOTH is a combined core for signing and signature has also been verified by experiments given in the full verification. The cores named SIGN and VER, however, version of the BLISS signature paper [18]. support signing and verification only, respectively. All We will now explain how the signature, secret key, three variants are built on top of a single code base and 12 and public key sizes are calculated. We will use the have been extensively tested. concrete numbers from set I as example. The signature 0 size is calculated by summing the bit lengths of z1, z , n 2 4.1 Preliminaries c z Rp and . Since 1 is in k−32, it can be represented by In this section we provide background on the required ndlog(2(k − 32) + 1)e ≤ n log k + n = 7680 bits. From 0 arithmetic by briefly revisiting classic polynomial multi- Lemma 3.1, we know that z2 can be represented with 6(k−32)n plication and the Number Theoretic Transform (NTT). 2n+dlog(2(k −32)+1)e· ≤ 2n+6 log(2k) = 1114 n p Polynomial Arithmetic in Zp[x]/hx + 1i. In ideal bits. And c can be represented with 160 bits, for a total lattice-based cryptography the most basic operations are signature size of 8954 bits. The secret key consists of pn addition, subtraction and multiplication of polynomials polynomials s1, s2 ∈ R1 , and so they can be represented characterized as ideals of the ring of modular polyno- 2dn log(3)e = 1624 n with bits, but a simpler representation mials in Zp[x]/hx + 1i with n integer coefficients such can be used that requires 2048 bits. The public key con- that f[x] = f[0] + f[1]x + f[2]x2 + ... + f[n − 1]xn−1 ∈ sists of the polynomials (a, t), but the polynomial a does Zp[x] [45]. Two elements in these rings can be easily not need to be unique for every secret key, and can in fact added coefficient-wise with complexity O(n) with regard be some randomness that is agreed upon by everyone to arithmetic operations in Zp. Since multiplication is who uses the scheme. Thus the public key can be just t, much more complicated, the product c = a·b can be which can be represented using dn log pe = 11776 bits. We point out that even though the signature and 12. Please note that we employed a pseudo-random number gener- ator in our implementation for minimal footprint. If required, it can key sizes are larger than in some number theory based be replaced or seeded by a true random number (see for example [17], schemes, the signature scheme in Figure 3 is quite ef- [28]). 8

n computed for Zp[x]/hx + 1i by considering the special improvements. The processor supports the most com- rule that xn ≡ −1. This leads to mon operations on polynomials like addition, subtrac- tion, multiplication by the NTT as well as sampling of n−1 n−1 uniformly distributed polynomials. Keys and temporary X X i+j i+j mod n ab = (−1)b n ca[i]b[j]x values can be stored in a variable number of internal i=0 j=0 registers implemented by block memories. For instantiation of the signature scheme, we used a where each coefficient is reduced modulo p. This datapath of 23-bit due to pipelined processing of coeffi- 2 classical ”schoolbook” algorithm has complexity O(n ) cients of width dlog pe. Four registers are already fixed 2 2 2 and generally requires n multiplications and (n − 1) where register R0 and R1 are part of the NTT block, R2 is additions or subtractions. The complexity of polynomial associated to the uniform random sampler while register log(3) multiplication can be lowered to O(n ) with the R3 is exported to upper layers as I/O port. We use R4 to Karatsuba method [38]. R7 to store temporary values (e.g., y1, y2) and the public Accelerating Polynomial Multiplication. The most constant a and the public key t. The processor supports time-consuming operation of the signature scheme is the commands in two-operand form like R1 ← R1 + R2 pn polynomial multiplication a · y1. Recall that a ∈ R has pn which provides a flexible method of implementing the 512 23-bit wide coefficients and y1 ∈ Rk consists of 512 polynomial multiplication. The supported instruction set 16-bit wide coefficients for parameter set I . with cycle counts for each instruction of the processor is In order to achieve quasi-linear runtime O(n log n) for provided in Table 213. A useful characteristic of the NTT polynomial-multiplication, we employ the Fast Fourier is that it allows to ”decompose” a multiplication into Transform (FFT) or more specifically the Number The- two forward transforms and one backward transform. oretic Transform (NTT) instead of the simpler methods Since the micro-code engine allows fine-grained access mentioned above. The NTT is defined in a finite field to NTT instructions, we can exploit that coefficients are or ring for a given primitive n-th root of unity ω. The fixed most of the time (e.g., constants) or needed twice. generic forward NTTω(a) of a sequence {a[0], .., a[n−1]} We therefore store them directly in NTT representation Z to {A[0],..., A[n − 1]} with elements in p and length n or transform them only once in order to save subsequent Pn−1 ij is defined as A[i] = j=0 a[j]ω mod p, i = 0, 1, . . . , n− transformations. In order to have access to randomly- −1 −1 1 with the inverse NTTω (A) using ω instead of ω. chosen polynomials, we have implemented a uniform For lattice-based cryptography it is also convenient random sampler on port R3 as described in Section 4.4. n that most schemes are specifically defined in Zp[x]/hx + It uses random bits supplied by a PRNG based on a 1i with modulus xn +1 where n is a power of two and it stream cipher that are fed to a rejection sampler to return holds that p ≡ 1 mod 2n. Let ω be a primitive n-th root of coefficients from a specific range (i.e., for the signature 2 unity in Zp and ψ = ω. Then, when a = (a[0],... a[n−1]) scheme between −k and k). Note that the sampler is and b = (b[0],... b[n − 1]) are vectors of length n with treated like a general purpose register so that no specific elements in Zp, define d = (d[0],... d[n − 1]) as the sampling instructions are necessary. An instruction is negative wrapped convolution of a and b so that d = a∗ provided to stop the program flow until the buffer of b mod hxn +1i. We then define the vector representation the sampler is full in order to support continuous read- x¯ = (x[0], ψx[1], . . . , ψn−1x[n − 1]) for use as a¯, b¯ and d¯. out of values for n cycles after invocation. ¯ −1 ¯ It then holds that d = NTTw (NTTw(a¯)◦NTTw(b)) [64], where ◦ means coefficient-wise multiplication. This en- 4.2 Signing Messages ables standard polynomial multiplication using the NTT, The general idea of our implementation is to separate preventing the undesired doubling of the input length the signing process into three independent blocks which as it provides a complimentary modular reduction by are executed in parallel (see Figure 4). These three main n x + 1. blocks are the Lattice Processor core, the Hash unit im- Design of the Lattice Processor. The ability to achieve plementing the random oracle and the Sparse Multipli- high performance using the FFT/NTT has recently been cation and compression component. Parallel operations shown by a hard- and software implementation of an are supported by pipelining. As a consequence, input ideal lattice-based public key encryption scheme by polynomials to the last two blocks are stored in buffers, Gottert¨ et al. [29]. However, their implementation is realized as BRAMs. Thus, we can avoid latencies for in- costly due to its unrolled design and thus not suited for puts r = ay1 +y2 to the hash engine and the sparse mul- low-cost devices. As a consequence, we have built our tiplication. This allows to achieve high throughput and implementation based on an efficient and configurable ideal lattice processor described in detail in [56]. This 13. As an example, the commands NTT REV A(R3), NTT REV B(R3), NTT NTT A, NTT NTT B, NTT PW MUL, processor was originally designed for an ideal lattice- NTT INTT, NTT INV PSI, NTT INV N, NTT GP MODE, based public-key encryption scheme but it could be MOV(R2,R1) would instruct the processor to read-in two polynomial easily adapted for its use in digital signature processing. from the I/O port (R3), apply the NTT on both, do point-wise multiplication and then apply the inverse NTT and output the result. The core-component, a serial NTT multiplier, has been The needed cycles depend mostly on n and are in this case approx. 3 described in [55] to which Aysu et al. [4] proposed some 2 n log n + 6n. 9

TABLE 2: The basic instruction set of the ideal lattice processor described in [56]. Note that between every instruction a certain number of wait cycles  is required in order to clear the pipeline and reconfigure the switch matrix (the depth of the pipeline depends, e.g., on the width of p). The given cycle counts are given for the parameter set p = 8383489 and n = 512.

Command Op 1 Op 2 Cycles (measured) Cycles (theory) Explanation NTT REV {A/B} R{2..x} - 571 n +  Loads a polynomial into register R{0/1} of the NTT engine, performs the bit reversal step and multiplies with NTT constants. n NTT NTT {A/B} - - 2358 2 log n +  Executes the NTT on register R{0/1}. 3 NTT PW MUL - - 792 2 n +  Point/Coefficient-wise multiplication of registers R0 and R1. The bit reversed result is stored in register R1. n NTT INTT - - 2358 2 log n +  Executes the inverse NTT on register R1. NTT INV PSI - - 555 n +  Multiplies coefficients in R1 and multiplies with NTT constants. NTT INV N - - 554 n +  Multiplies coefficients in R1 with n−1. ADD R{0..x} R{0..x} 558 n +  Adds two polynomials (R(op1) ← R(op1) + R(op2)) SUB R{0..x} R{0..x} 558 n +  Subtracts two polynomials (R(op1) ← R(op1)−R(op2)) MOV R{0..x} R{0..x} 530 n +  Moves a polynomial from one to another register (R(op1) ← R(op2)). WAIT SAMPLER - -  Waits until the sampler has buffered more than n coef- ficients. NTT GP MODE - - 6 () Export special purpose NTT registers as general pur- pose registers until the next NTT operation.

function [3] implementing the random oracle H. This Lattice Sparse variant of QUARK offers 224-bit preimage and 112-bit Hash Processor Multiplication collision security. We employ Quark since it supports high clock frequencies and consumes only few resources. The relatively low throughput just matches the speed of the other pipelining stages of the engine. Fig. 4: Simplified block diagram of the signing engine In order to allow future extensions and clock do- showing the main blocks Lattice Processor, Hash, and main separation using a FIFO, we have implemented Sparse Multiplication. a wrapper around the hash function. Moreover, we have extended the hash in order to be able to save the current state. This is beneficial in the context of rejection optimal resource and block utilization. A direct benefit sampling. Since on average the hash function has to be of this approach is that we can choose a moderately fast restarted seven times, this would imply re-hashing of hash component that just matches the performance of the message to be signed. In case messages are long the other two blocks. This enables the use of a resource- this would require significant additional effort. As a efficient implementation of the lightweight hash function consequence, we first hash the message µ, save the state, Quark [3]. hash a binary representation of r and reload the saved The detailed architecture of our signing engine is given state in case the signature was rejected. This approach in Figure 5. We use the lattice processor to sample values is also straightforward in terms of state management, as $ pn the message has just to be fed into the hash function once y1, y2 ← Rk and directly compute r = ay1 + y2 using the NTT. Note that for this operation a is already stored and does not have to be temporarily stored in a RAM in NTT representation. As a consequence, the workload or FIFO during the signing process. As S-Quark×16 is consists of sampling 2n uniformly random values, one a sponge-based construction, we abort the squeezing forward and one backward NTT call as well as some phase after obtaining 160 output bits to generate c. additional overhead for point-wise multiplication and The polynomial c computed by the Hash unit and data movement. In signing mode the processor actually a triple (r, y1, y2) are then processed by the sparse starts processing independently of the message or secret multiplication and compression block. Due to the par- (r, y , y ) allelism of the three main blocks, the hash engine can key and precomputes triples 1 2 for subsequent 0 use. These triples are stored in a temporary buffer directly request a new triple (r, y1, y2) to generate a new pn and accessed during signing by the hashing module. hash/polynomial c. In order to compute z1, z2 ∈ Rk−32 Note also that before r = ay1 + y2 can be hashed we have to multiply the sparse polynomial c ∈ Dn pn we have to transform every coefficient into the higher- with the secret key polynomials s1, s2 ∈ R1 that have order representation r(0). This operation is basically an coefficients in the range [−1, 1]. For this, we implemented integer division by the constant 32705 which makes a column-wise polynomial multiplier and compute z1 use of the specific bit-layout of the divisor. The hash and z2 in parallel. More precisely, we store the secret module is realized by the S-Quark×16 lightweight hash key polynomials s1, s2 in one block RAM and merged 10

Decoder Lattice Processor Temporary Hash Message Uniform NTT multiplier current FIFO RAM QUARK sampler state R0 R1 saved Sparse Trivium state FIFO Multiplication FIFO Pipeline Pipeline Signature buffer MAC Decoder R4 R5 Content: Content: Encoder temp1 temp2 ALU MAC Compress Secret R6 R7 key Content: Content:

Fig. 5: Detailed architecture of the signing engine.

them so that each address holds a coefficient of s1 and Processor. With the global constant a¯ = NTT(a) and one of s2 (s1[i]||s2[i], for 0 ≥ i ≤ n). Since the coefficients the verification key ¯t = NTT(t) stored directly in NTT of s1, s2 and c are in [−1, 0], the multiplication can be representation, we finally compute the input to the hash simply realized by an adder so that the parallel com- function as follows: putation of z1, z2 is actually not very resource-intensive. −1(a¯◦ (z ) − ¯t◦ (c)) + z0 . Since the result is returned coefficient by coefficient, this NTT NTT 1 NTT 2 (4) allows immediate rejection when an out-of-range value 3 As a consequence, three NTT operations ( 2 n log n cycles), is detected (i.e., after adding the corresponding coeffi- two point-wise multiplications (2n cycles) and two ad- cient of y1 or y2). In this case the signing procedure is dition/subtractions (2n cycles) represent the arithmetic directly restarted. In order to prevent memory-intensive operations to be performed. The time-consuming decod- expansion of the 160-bit binary hash c into a polynomial, ing of the signature is performed concurrently to the op- we perform this transformation on-the-fly in the sparse eration of the Lattice Processor to avoid further latencies. multiplier. The Compress component extracts the carry One design alternative would have been to implement information needed to be able to perform the trans- the relatively cheap multiplication tc in a separate unit formation into the higher-order representation during operating in parallel to the lattice processor. This is not the signature verification. The component implementing too costly as c is sparse with only 32 coefficients either Algorithm 2 maintains a counter to track the number minus one or one. Thus a schoolbook multiplier taking of uncompressed values but does not yield further state this sparseness into account would only require roughly information. Returned values are directly encoded and 32 · 512 cycles. However, even then the multiplication written into a output FIFO. When the signing operation would be not much faster compared to the NTT, harder needs to be restarted the FIFO is reset. When successful, to manage by the state machine, and also add resource however, access to this FIFO is granted to the external consumption. interface for retrieving the valid signature.

4.4 Implementation Aspects 4.3 Signature Verification In this section we provide details for the optimization of The verification algorithm is simpler compared to sign- our implementation to reduce resource consumption or ing which also results in a smaller resource consumption. improve flexibility. Especially, the Sparse Multiplication block is obsolete (see Modular Reduction. The butterfly operator of the Figure 4) so that the verification core only consists of the NTT requires a log2 p × log2 p-bit multiplication and a Lattice Processor and Hash block. The simpler verification subsequent reduction by p = 8383489. The multiplication directly translates into less computation cycles so that is simply implemented in the ideal lattice processor we did not implement pipelining for the two blocks. We using Digital Signal Processor (DSP) blocks instantiated 0 first run the Lattice Processor to compute az1 +z2 −tc and by the IP Core generator. For the reduction modulo p then activate the hash engine in a next step. Therefore, we further rely on an idea by Solinas [60]. Obviously, the runtime for verification consists of the amount of the value 223 mod 8383489 = 5119 is small. By applying time to perform this computation and the time to hash Solina’s idea to reduce a binary number u = x45..0 mod p the result. c = 223x + x mod p = 5119 · x + n we write 45..23 22..0 45..23 0 p 0 The validity test if z1, z2 ∈ Rk−32 is simply being x22..0 = x35..0 + x22..0. Using this we reduced the result performed during signature decoding from the input of the multiplication by 10 bits (see Figure 6 for the FIFO. The main workload is due to the three polyno- complete block diagram). Applying this trick iteratively 0 mial multiplications az1, z2, tc performed by the Lattice for three times and after some subtractions we finally 11

5119 yield the reduced value u. In addition to that, the 5119 multiplication by the constant 5119 = 212 + 210 − 1 can [35:23] be implemented very efficiently with two simple shifts, [45:23] [22:0] one addition and one subtraction. The delay of the full x 46 [22:0] log2 p × log2 p multiplication and subsequent reduction mod p is 26 cycles. However, as even the fast point- 5119 p wise multiplication has to be performed sequentially in [25:23] 1 23 approx. 512 cycles (n = 512 coefficients), the necessary >p u setup delay of the pipeline does not have much impact [22:0] 0 on the overall performance. 1 Flexible Instantiation. In order to allow flexible usage and maintainability of the design we developed one Fig. 6: Pipelined reduction modulo p = 8383489. The code base that can be configured by VHDL generic input is the result of a 23 × 23-bit multiplication and statements to generate a core supporting either signing the output is c = x mod p. Note that we always multiply (SIGN), verification (VER) or both operations (BOTH). In by the constant 5119 which is realized by shift-and-adds. this case if ... generate statements are used to remove complete unnecessary blocks or just parts of a component (e.g., FSM states). Between the BOTH and Detailed Performance Evaluation. Table 3 provides the SIGN cores, the small savings in resources stems the actual runtime of several core components of our mainly from removal of the signature decoder and a implementation for a small message. Additionally, we buffer BRAM. The verification core VER is significantly outline theoretical extrapolations based on the parameter smaller than BOTH or SIGN since the Sparse Multiplication n. A difference of roughly five to ten percent between component and the resources for the pipelining stages estimation and measurement is due to the need of setup are not required. phases, pipeline stalls, or instruction decoding which are Uniform Sampling. In order to generate uniformly not included in our model. random distributed coefficients in the range [−k, k] for The runtime of the Lattice Processor in signing mode the polynomials y1, y2 we use rejection sampling in is not constant due to a small amount of wait cycles in order to obtain a value from the range [0,..., (2k + 1)]. the uniform sampler unit. Moreover, the overall runtime We therefore compute a coefficient by c = (r mod (2k + of our pipelined implementation is finally limited by 1)) − k, resembling the approach that was used for the the Lattice Processor (10505 cycles) and in case of no software implementation in [31]. The necessary pseudo- early abort, by the Sparse Multiplication (worst-case 16950 random input bits are generated using an implementa- cycles). The Hash component requires 10192 cycles to tion of the Trivium stream cipher [13]14 which outputs transform and hash r. On average one signing attempt one pseudo-random bit in every clock cycle. In the with one input triple takes 15908 cycles of which about rejection sampling step, in order to obtain a value from seven attempts are necessary in order to generate a the range [0,..., (2k +1)], we check that a 16-bit random valid signature. The computation of the valid signature value interpreted as integer is not larger or equal to requires a total of 114865 cycles on average. Signature (2k +1) = 32769. As a consequence, 50% of the inputs to verification involves polynomial arithmetic performed the sampler are rejected. To still provide sufficient perfor- by the processor (13851 cycles) and hashing of the result mance, we instantiate three Trivium cores in parallel to (10192 cycles). The overall verification runs at constant extract 3 bits at a time and perform a rejection sampling time and takes 25138 cycles, where the additional 1095 l m in dlog2 32769e = 6 clock cycles on average. cycles are accounted for initial decoding of the first 3 polynomial and I/O. A significant acceleration of the scheme could be still achieved by using a faster (but 5 RESULTS AND COMPARISON larger) hash function as the time to hash contributes roughly 40% to the overall runtime of the verification All results below were obtained post place-and-route process. (PAR) and generated with Xilinx ISE 14.2. We imple- Resource Consumption and Performance. In Table 4 mented the signing and verification engine in VHDL we give Post-PAR results for a low-cost Spartan-6 LX25. on the low-cost Spartan-6 (S6SLX25-3) and on the high- The figures were obtained after 12 runs of the Smart speed Virtex-6 (V6LX75T-3) device family. Detailed per- Explorer to achieve the smallest timing score. Based on formance results for parameter set I are given in Ta- the maximum clock frequency of 187 MHz, one signing ble 3 and the corresponding resource consumption is operations takes on average 615 µs while verification is addressed in Table 4 for Spartan-6 and Table 5 for Virtex- constant time and requires 134 µs on the BOTH core. 6, respectively. Compared to the combined core for signing and veri- fication, the core supporting signing only saves just a 14. The used implementation of Trivium is based on [62]. However, we removed asynchronous reset signals to improve resource utiliza- few logic elements while the verification core is much tion. smaller with only 65% LUT and 74% BRAM usage. 12

TABLE 3: Detailed performance evaluation of the main TABLE 5: Performance and resource consumption of all components of our implementation for a short message. three design targets on a Xilinx Virtex-6 LX75T (speed- Whenever possible we provide actual cycle counts as ab- grade -3). solute number and also the theoretical estimation based on the dimension n as well as worst-case and average Aspect BOTHSIGN-Only VER-Only case values. Small differences between estimation and Slices 2083 2086 1745 LUT/FF 6591/6791 6073/6183 4571/4338 measurement are due to wait cycles, pipeline stalls, or 18K BRAM 36 34 26 setup phases. DSP48E1 4 4 4 Max clock freq. 274 MHz 286 MHz 280 MHz Aspect Description Cycles Sign (Sig/s ) 2385 2489 - Lattice Proces- Average amount of cycles to 10505 Verify (Sig/s ) 10899 - 11138 $ pn (n log n + sor (sign) sample y1, y2 ← R and to k 12n + 23) compute r ← ay1 + y2. (400 signatures) 0 5.1 Comparison with Previous Implementations Lattice Proces- Computation of az1 + z2 − tc csign = 13851 sor (verify) (n log n+10n+ A previous implementation of the signature scheme has 18) been proposed in the conference version of this work [30] Hash Higher order transformation of chash = 10192 and the respective results are listed in Table 6. The main r and evaluation of the QUARK hash function on r(1) and a difference to this work is the use of the NTT instead short message µ:H r(1), µ of a schoolbook approach for polynomial multiplica-

Sparse Computation of z1 ← s1c + 50 − 16950 (32 tions (i.e., for computation of ay1 + y2). The schoolbool Multiplication y1, z2 ← s2c + y2, compres- to 32n) multiplier of [30] was realized efficiently using a DSP sion, rejection sampling, and and with clock domain separation high clock frequency signature encoding were achieved (416 MHz on a Virtex-6). In order to fully One signing at- Average amount of cycles re- 15908 utilize the signing engine multiple multipliers are used tempt quired for one signing attempt (400 signatures) requiring quite a few DSP blocks. Another difference z , z Signing Average amount of cycles re- 114865 is, that we now compute 1 2 in parallel and not quired to successfully generate sequentially as in [30]. a signature (measured by gen- For a fair comparison with [30] we compare only the 400 erating signatures) separate signing and verification cores. However, our Verification Amount of cycles required for 25138 (csign + BOTH core provides support for both operations at al- the verification of one message c ) hash most no additional costs compared to a core supporting only the signing operation. On the Spartan-6 platform TABLE 4: Performance and resource consumption of all we need for signing 75% LUTs, 92% BRAMs and 14% three design targets on a Xilinx Spartan-6 LX25 (speed- DSPs (see Table 4) compared with the results of [30] grade -3). given in Table 6. With respect to performance the cores of this work can sign 1715 signature compared to 931 which Aspect BOTHSIGN-Only VER-Only is an improvement of a factor of 1.8. Since in [30] the Slices 2045 2010 1586 az +z0 −tc LUT/FF 6088/6804 5614/6188 3966/4318 computation of 1 2 is not performed in parallel 18K BRAM 19.5 18.5 14.5 with just one schoolbook multiplier, the performance for DSP48A1 4 4 4 verification of this work is even better, due to the usage Max clock freq. 187 MHz 197 MHz 187 MHz of the NTT. Our verification core consumes almost the Sign (Sig/s ) 1627 1715 - same amount of BRAMs but only 67% LUTs, and 14% Verify (Sig/s ) 7438 - 7438 DSPs and increases the performance by a factor of 7.4. Beside the aforementioned hardware implementation the signature scheme has also been implemented on In Table 5 we provide results for a more expensive Intel’s Sandy Bridge and Ivy Bridge CPUs targeting the but also much faster (in the sense that is supports higher Advanced Vector Extensions (AVX) [31]. The reported clock frequencies) and larger Virtex-6 LX75. The resource (average) cycle count for a successful signing operation consumption is similar to the Spartan-6 implementation is 634988 cycles, while verification takes 45036 cycles but the achievable clock frequency is nearly 100 MHz and key generation 31140 cycles. On a 2.5 GHz Intel higher. As a consequence, we can verify more than Core i5-3210M processor (Ivy Bridge) this results in 3900 10000 signatures per second with the BOTH and VER signatures per second or more than 55000 verification core. Note that the Spartan-6 supports 9/18K BRAMs operations per second. while the Virtex-6 has been designed with larger 18/36K BRAMs. Unfortunately, this leads to some wasted space in BRAMs on the Virtex-6 as our design contains a large 5.2 Comparison with Other Implementations number of 9k BRAMs which are mapped to 18k BRAMs When comparing our results to other work given in on the Virtex-6. Table 6, we conservatively assume that RSA signatures 13

TABLE 6: Implementation results for comparable signature schemes (signing).

Operation Algorithm Device Resources Ops/s GLP [sign] [30] - XC6SLX16-3 7465 LUT/ 8993 FF/ 28 DSP/ 29.5 BRAM 931 @270/162 MHz GLP [ver] [30] - XC6SLX16-3 6225 LUT/ 6663 FF/ 8 DSP/ 15 BRAM 998 @272/158 MHz GLP [sign] [30] - XC6VLX130-3 67027 LUT/ 95511 FF/ 216 DSP/ 234 BRAM 12627 @416/204 MHz GLP [ver] [30] - XC6VLX130-3 61360 LUT/ 57903 FF/ 60 DSP/ 120 BRAM 14580 @402/156 MHz RSA Signature [63] RSA-1024; private key XC4VFX12-10 3937 LS/ 17 DSP 548 ECDSA [32] NIST-P224; point mult. XC4VFX12-12 1580 LS/ 26 DSP 2739 ECDSA [1] NIST-B163; point mult. XC4VLX200 7719 LUT/ 1502 FF 47619 UOV-Signature [10] UOV(60,20) XC5VLX50-3 13437 LUT 170940

(one modular exponentiation) with a key size of 1024 bit [4] A. Aysu, C. Patterson, and P. Schaumont. Low-cost and area- and ECDSA signatures (one point multiplication) with a efficient FPGA implementations of lattice-based cryptography. In HOST, pages 81–86, 2013. key size of 160 bit are comparable to our scheme in terms [5] S. Balasubramanian, H. W. Carter, A. Bogdanov, A. Rupp, and of security (see Section 3.4 for details on the parameters). J. Ding. Fast multivariate signature generation in hardware: The In comparison with RSA, our implementation on the case of rainbow. In ASAP, pages 25–30. IEEE Computer Society, 1.5 2008. low-cost Spartan-6 is times faster than the high- [6] R. E. Bansarkhani and J. Buchmann. Improvement and efficient speed implementation of Suzuki [63] – that still needs implementation of a lattice-based signature scheme. IACR Cryp- twice as many device resources and runs on the more tology ePrint Archive, 2013:297, 2013. expensive Virtex-4 device. Note however, that ECC over [7] R. Barbulescu, P. Gaudry, A. Joux, and E. Thome.´ A quasi- polynomial algorithm for discrete logarithm in finite fields of binary curves is very well suited for hardware and even small characteristic. CoRR, abs/1306.4244, 2013. implementations on old FPGAs like the Virtex-2 [1] are [8] D. J. Bernstein, T. Chou, and P. Schwabe. McBits: Fast faster than our lattice-based scheme. For the NTRUSign constant-time code-based cryptography. In CHES, pages 250– 272, 2013. Full version available at http://cryptojedi.org/papers/ lattice-based signature scheme (introduced in [33] and mcbits-20130616.pdf. broken by Nguyen [51]) and the XMSS [12] hash-based [9] J. Beuchat, N. Sendrier, A. Tisserand, G. Villard, et al. FPGA signature scheme we are not aware of any implemen- implementation of a recently published signature scheme. Rapport de Recherche RR LIP 2004-14, 2004. tation results for FPGAs. Hardware implementations [10] A. Bogdanov, T. Eisenbarth, A. Rupp, and C. Wolf. Time-area op- of Multivariate Quadratic (MQ) cryptosystems [5], [10] timized public-key engines: MQ-cryptosystems as replacement show that these schemes are faster (factor 2-50) than ECC for elliptic curves? In CHES, pages 45–61, 2008. but also suffer from impractical key sizes for the private [11] J. Buchmann, D. Cabarcas, F. Gopfert,¨ A. Hulsing,¨ and P. Weiden. Discrete ziggurat: A time-memory trade-off for sampling from and public key, e.g., 80 kB for Unbalanced Oil and Vine- a Gaussian distribution over the integers. In Selected Areas in gar (UOV) [53]. While implementations of the McEliece Cryptography, pages 402–417, 2013. encryption scheme offer good performance [21], [59] [12] J. Buchmann, E. Dahmen, and A. Hulsing.¨ XMSS - a practical forward secure signature scheme based on minimal security the only implementation of a code based signature assumptions. In PQCrypto, pages 117–129, 2011. scheme [9] is extremely slow with a runtime of 830 ms [13] C. D. Canniere.` Trivium: A stream cipher construction inspired for signing. by block cipher design principles. In ISC, pages 171–186, 2006. [14] A. I.-T. Chen, M.-S. Chen, T.-R. Chen, C.-M. Cheng, J. Ding, E. L.- H. Kuo, F. Y.-S. Lee, and B.-Y. Yang. SSE implementation of ONCLUSION multivariate PKCs on modern x86 CPUs. In CHES, pages 33–48, 6 C 2009. In this paper we presented a provably secure lattice [15] Y. Chen and P. Q. Nguyen. BKZ 2.0: Better lattice security based digital signature scheme and its implementation estimates. In ASIACRYPT, pages 1–20, 2011. [16] N. Courtois, M. Finiasz, and N. Sendrier. How to achieve a on different reconfigurable hardware devices. With mod- McEliece-based digital signature scheme. In ASIACRYPT, pages erate resource requirements our implementations can 157–174, 2001. even outperform classical and alternative cryptosystems [17] M. Dichtl and J. D. Golic. High-speed true random number generation with logic gates only. In CHES, pages 45–62, 2007. in terms of signature size and performance. [18] L. Ducas, A. Durmus, T. Lepoint, and V. Lyubashevsky. Lattice Future work includes the investigation and compari- signatures and bimodal Gaussians. In CRYPTO (1), pages 40–56, son with other lattice-based signature schemes like [18], 2013. Full version available at https://eprint.iacr.org/2013/383. [19] L. Ducas and P. Q. Nguyen. Learning a zonotope and more: [26], [47] as well as implementation of the presented sig- Cryptanalysis of NTRUSign countermeasures. In ASIACRYPT, nature scheme on other platforms like microcontrollers pages 433–450, 2012. or graphic cards. [20] N. Dwarakanath and S. Galbraith. Sampling from discrete Gaussians for lattice-based cryptography on a constrained device. Applicable Algebra in Engineering, Communication and Computing, REFERENCES pages 1–22, 2014. [21] T. Eisenbarth, T. Guneysu,¨ S. Heyse, and C. Paar. MicroEliece: [1] B. Ansari and M. A. Hasan. High-performance architecture McEliece for embedded devices. In CHES, pages 49–64, 2009. of elliptic curve scalar multiplication. IEEE Trans. Computers, [22] T. Eisenbarth, I. von Maurich, and X. Ye. Faster hash-based sig- 57(11):1443–1453, 2008. natures with bounded leakage. In Selected Areas in Cryptography, [2] S. Arora and R. Ge. New algorithms for learning in presence of pages 223–243, 2013. errors. In ICALP (1), pages 403–415, 2011. [23] M. Finiasz. Parallel-CFS - strengthening the CFS McEliece-based [3] J.-P. Aumasson, L. Henzen, W. Meier, and M. Naya-Plasencia. signature scheme. In Selected Areas in Cryptography, pages 159–170, Quark: A lightweight hash. J. Cryptology, 26(2):313–339, 2013. 2010. 14

[24] J.-B. Fischer and J. Stern. An efficient pseudo-random generator [52] C. Peikert and A. Rosen. Efficient collision-resistant hashing from provably as secure as syndrome decoding. In EUROCRYPT, pages worst-case assumptions on cyclic lattices. In TCC, pages 145–166, 245–255, 1996. 2006. [25] N. Gama and P. Q. Nguyen. Predicting lattice reduction. In [53] A. Petzoldt, E. Thomae, S. Bulygin, and C. Wolf. Small public EUROCRYPT, pages 31–51, 2008. keys and fast verification for Multivariate Quadratic public key [26] C. Gentry, C. Peikert, and V. Vaikuntanathan. Trapdoors for hard systems. In CHES, pages 475–490, 2011. lattices and new cryptographic constructions. In STOC, pages [54] D. Pointcheval and J. Stern. Security arguments for digital 197–206, 2008. signatures and blind signatures. J. Cryptology, 13(3):361–396, 2000. [27] O. Goldreich, S. Goldwasser, and S. Halevi. Public-key cryptosys- [55] T. Poppelmann¨ and T. Guneysu.¨ Towards efficient arithmetic tems from lattice reduction problems. In CRYPTO, pages 112–131, for lattice-based cryptography on reconfigurable hardware. In 1997. LATINCRYPT, pages 139–158, 2012. [28] J. D. Golic. New methods for digital generation and postpro- [56] T. Poppelmann¨ and T. Guneysu.¨ Towards practical lattice-based cessing of random data. IEEE Trans. Computers, 55(10):1217–1229, public-key encryption on reconfigurable hardware. In Selected 2006. Areas in Cryptography, pages 68–85, 2013. [29]N.G ottert,¨ T. Feller, M. Schneider, J. Buchmann, and S. A. Huss. [57] S. S. Roy, F. Vercauteren, and I. Verbauwhede. High precision On the design of hardware building blocks for modern lattice- discrete Gaussian sampling on FPGAs. In Selected Areas in based encryption schemes. In CHES, pages 512–529, 2012. Cryptography, pages 383–401, 2013. [30] T. Guneysu,¨ V. Lyubashevsky, and T. Poppelmann.¨ Practical [58] P. W. Shor. Algorithms for quantum computation: Discrete log- lattice-based cryptography: A signature scheme for embedded arithms and factoring. In FOCS, pages 124–134. IEEE Computer systems. In CHES, pages 530–547, 2012. Society, 1994. [31] T. Guneysu,¨ T. Oder, T. Poppelmann,¨ and P. Schwabe. Software [59] A. Shoufan, T. Wink, H. G. Molter, S. A. Huss, and E. Kohnert. speed records for lattice-based signatures. In PQCrypto, pages A novel cryptoprocessor architecture for the McEliece public-key 67–82, 2013. cryptosystem. IEEE Trans. Computers, 59(11), 2010. [32] T. Guneysu¨ and C. Paar. Ultra high performance ECC over NIST [60] J. Solinas. Generalized Mersenne numbers. Faculty of Mathematics, primes on commercial FPGAs. In CHES, pages 62–78, 2008. University of Waterloo, 1999. [33] J. Hoffstein, N. Howgrave-Graham, J. Pipher, J. H. Silverman, and [61] D. Stehle,´ R. Steinfeld, K. Tanaka, and K. Xagawa. Efficient public W. Whyte. NTRUSIGN: Digital signatures using the NTRU lattice. key encryption based on ideal lattices. In ASIACRYPT, pages 617– In CT-RSA, pages 122–140, 2003. 635, 2009. [34] J. Hoffstein, J. Pipher, and J. H. Silverman. NTRU: A ring-based [62] R. Stern. Hardware implementations of ECRYPT stream ciphers. public key cryptosystem. In ANTS, pages 267–288, 1998. VHDL code available from http://eeweb.poly.edu/faculty/ [35] J. Hoffstein, J. Pipher, and J. H. Silverman. NSS: An NTRU lattice- karri/stream ciphers/trivium.html, accessed July 25, 2013. based signature scheme. In EUROCRYPT, pages 211–228, 2001. [63] D. Suzuki. How to maximize the potential of FPGA resources for CHES [36]A.H ulsing,¨ C. Busold, and J. Buchmann. Forward secure sig- modular exponentiation. In , pages 272–288, 2007. Polynomial Algorithms in Computer Algebra (Texts and natures on smart cards. In Selected Areas in Cryptography, pages [64] F. Winkler. Monographs in Symbolic Computation) 66–80, 2012. . Springer, 1 edition, 1996. [37] A. Joux. A new index calculus algorithm with complexity l(1/4+o(1)) in very small characteristic. IACR Cryptology ePrint Archive, 2013:95, 2013. Tim G ¨uneysu is assistant professor and head [38] A. Karatsuba and Y. Ofman. Multiplication of multidigit numbers of the research group on Hardware Security on automata. In Soviet physics doklady, volume 7, page 595, 1963. at Ruhr-University Bochum in Germany. He re- [39] J. Koetsier. An inside look at the worlds newest ceived a Diploma in Information Technology In- quantum computing and nanotechnology center, 2013. ternational from University of Cooperative Ed- http://venturebeat.com/2013/05/15/an-inside-look-at-the- ucation of Mannheim (2003) and a Diploma worlds-newest-quantum-computing-and-nanotechnology- (2006) and Ph.D. (2009) in IT-Security from center/. Ruhr-University Bochum. His research topics [40] V. Lyubashevsky. Fiat-Shamir with aborts: Applications to lattice are cryptographic implementations for embed- and factoring-based signatures. In ASIACRYPT, pages 598–616, ded and hardware-based systems, including as- 2009. pects such as long-term secure cryptography, [41] V. Lyubashevsky. Lattice signatures without trapdoors. lightweight and hardware-entangled cryptographic properties as well as In EUROCRYPT, pages 738–755, 2012. Full version at cryptanalytic machines. http://eprint.iacr.org/2011/537. [42] V. Lyubashevsky and D. Micciancio. Generalized compact knap- sacks are collision resistant. In ICALP (2), pages 144–155, 2006. Vadim Lyubashevsky is currently an INRIA [43] V. Lyubashevsky and D. Micciancio. Asymptotically efficient researcher in the Cryptography Group at the lattice-based digital signatures. In TCC, pages 37–54, 2008. Ecole´ Normale Superieure´ in Paris, France. He [44] V. Lyubashevsky, D. Micciancio, C. Peikert, and A. Rosen. received his Bachelor of Science degree from SWIFFT: A modest proposal for FFT hashing. In FSE, pages 54–72, Columbia University in 2002, and his Ph.D. from 2008. the University of California, San Diego in 2008. [45] V. Lyubashevsky, C. Peikert, and O. Regev. On ideal lattices and His main area of research is lattice-based cryp- learning with errors over rings. In EUROCRYPT, pages 1–23, 2010. tography, with a particular focus on constructing [46] D. Micciancio. Generalized compact knapsacks, cyclic lattices, and practical lattice-based schemes that stem from efficient one-way functions. Computational Complexity, 16(4):365– solid theoretical foundations. 411, 2007. [47] D. Micciancio and C. Peikert. Trapdoors for lattices: Simpler, tighter, faster, smaller. In EUROCRYPT, pages 700–718, 2012. [48] D. Micciancio and C. Peikert. Hardness of SIS and LWE with Thomas Poppelmann¨ is a PhD student at the small parameters. In CRYPTO (1), 2013. research group on Hardware Security, Ruhr [49] D. Micciancio and O. Regev. Worst-case to average-case reduc- University Bochum, Germany. In 2011 he re- tions based on Gaussian measures. SIAM J. Comput., 37(1):267– ceived his Dipl.-Ing. degree in IT-security from 302, 2007. the Ruhr University Bochum. His research in- [50] D. Micciancio and O. Regev. Lattice-based cryptography. In D. J. terests include efficient implementations of cryp- Bernstein, J. Buchmann, and E. Dahmen, editors, Chapter in Post- tographic algorithms, lattice-based cryptography quantum Cryptography, pages 147–191. Springer, 2009. and FPGA security. [51] P. Q. Nguyen and O. Regev. Learning a parallelepiped: Cryptanal- ysis of GGH and NTRU signatures. J. Cryptology, 22(2):139–160, 2009.