IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 1 Securing M2M with post-quantum public- Jie-Ren Shih, Yongbo Hu, Ming-Chun Hsiao, Ming-Shing Chen, Wen-Chung Shen, Bo-Yin Yang, An-Yeu Wu, and Chen-Mou Cheng Intel-NTU Connected Context Computing Center, National Taiwan University, Taipei, Taiwan

Abstract—In this paper, we present an ASIC implementation security of RSA, currently the most popular PKC, depends on of two post-quantum public-key (PKCs), NTRU- the difficulty of the problem, while that Encrypt and TTS. It represents a first step toward securing of ECC, the runner-up PKC, depends on the machine-to-machine (M2M) systems using strong, hardware- assisted PKC. In contrast to the conventional wisdom that PKC problem. As Shor has shown, both of them would be solved by is too “expensive” for M2M sensors, it actually can lower the large quantum computers in time [1]. Such a threat total cost of ownership because of cost savings in provision, is more relevant in the machine-to-machine (M2M) context, as deployment, operation, maintenance, and general management. these systems tend to operate over a long period of time, and Furthermore, PKC can be more energy-efficient because PKC- we certainly should take precaution against such a catastrophic based security protocols usually involve less communication than their symmetric-key-based counterparts, and communication is attack, even though it might only happen in the distant future. getting relatively more and more expensive compared with com- There are mostly four different kinds of approaches com- putation. More importantly, recent algorithmic advances have posing the so-called “post-:” lattice- brought several new PKCs, NTRUEncrypt and TTS included, based cryptography, multivariate cryptography, hash-based that are orders of magnitude more efficient than traditional PKCs signatures, and code-based cryptography. In this paper, we such as RSA. It is therefore our primary goal in this paper to demonstrate the feasibility of using hardware-based PKC to focus on NTRUEncrypt, a lattice-based , and provide general data security in M2M applications. TTS, a multivariate cryptosystem, as candidates in system development. Index Terms—Lattice-based Cryptography, Multivariate Cryp- tography, Bluespec SystemVerilog A. Previous attempts of securing M2M systems I.INTRODUCTION As networked machines become more popular around our RYPTOGRAPHY is the foundation of data security. living, information security on these devices becomes an C There are mainly two kinds of cryptography in use today, important issue. Traditionally, PKC is regarded as too expen- symmetric-key and public-key cryptography. In the former, sive to deploy in M2M systems. Typical M2M systems only the communicating parties are assumed to share one or more have limited computational power, making deploying strong secret keys a priori. How they can establish such a shared cryptography on them extremely challenging. secret is often referred to as the key-exchange problem. This There have been numerous proposals how to secure M2M problem is challenging not only from a technical but also from systems from the academic research community [2], [3]. a managerial point of view, as we will need to manage O(n2) Most of them use software-based symmetric-key cryptography. keys in a network of size n. For example, TinySec provides link-layer security for sensor Public-key cryptography (PKC), on the other hand, provides networks using software implementation of symmetric-key an elegant solution to the key-exchange problem. With PKC, cryptosystems [4]. In many proposals, more bits will need key management becomes straightforward. A user can encrypt to be sent over the air for achieving certain level of security, a short-lived session key using the communicating party’s so using hardware accelerators may not necessarily help in public key and simply send out the encrypted key. PKC these cases [2]. The same functionality would be achieved by ensures that only the holder of the corresponding private key PKC in a more communication-efficient way. This is becoming can decrypt and obtain the session key. Furthermore, PKC more attractive as computation is getting cheaper in terms can provide digital signatures, which, like a person’s signature, of hardware cost and energy consumption, while wireless provides an efficient means of authentication. As a result, PKC communication is less so at the same time. As a result, proliferates in today’s Internet age and permeates many aspects communication is becoming more expensive compared with of our daily life, ranging from communication to electronic computation, not to mention the spectrum will become one commerce. of the scarcest resources when billions of M2M sensors are However, there is an emerging threat to the prevailing PKCs deployed and trying to send out their readings over the air. In due to the recent development of quantum computers. The this case, it is advantageous to use PKC on sensors for the sake of reducing communication cost. Copyright (c) 2013 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be Lastly, there have been several attempts in employing obtained from the IEEE by sending an email to [email protected]. software-based PKC to secure inter-sensor communication [5], IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 2

[6]. People have demonstrated that it is possible to run PKC a4 a3 a2 a1 a0 on sensors with acceptable performance. We believe that this x b4 b3 b2 b1 b0 is the right direction to pursue, and we plan to take it further a b a b a b a b a b by hardware acceleration. 4 0 3 0 2 0 1 0 0 0 a4b1 a3b1 a2b1 a1b1 a0b1 a4b1 a b a b a b a b a b a b a b B. Contributions 4 2 3 2 2 2 1 2 0 2 4 2 3 2 a4b3 a3b3 a2b3 a1b3 a0b3 a4b3 a3b3 a2b3 Our approach is to provide a foundation for information + a4b4 a3b4 a2b4 a1b4 a0b4 a4b4 a3b4 a2b4 a1b4 security using hardware-assisted PKC. Specifically, we plan to design and implement a complete, proof-of-concept PKC- c4 c3 c2 c1 c0 based system. We choose two types of PKCs to support. First, multivariate cryptosystems enjoy the benefit of executing much Fig. 1. Convolutional polynomial multiplication in NTRUEncrypt ring faster than traditional cryptosystems on the same hardware, making them ideal for securing sensors in M2M systems [7]. II.NTRUENCRYPT Specifically, we support the (24, 20, 20) variant of TTS over NTRUEncrypt is a lattice-based cryptosystem, whose secu- F31, which takes a 200-bit message digest and produces a 320-bit signature, providing a security level of about 80 rity is based on the hardness of the shortest vector problem in bits. Second, we will include lattice-based cryptosystems such high-dimensional euclidean lattices [9]. The main operations as NTRUEncrypt to provide for key exchang- in NTRUEncrypt involve arithmetic in a polynomial ring R = N ing [8]. Specifically, we support the ees397ep1 variant of Z[X]/(X − 1). The addition in this ring is straightforward NTRUEncrypt, which encrypts a plaintext up to 397 bits and polynomial addition, while the multiplication in this ring is produces a of 3573 bits long, providing a security convolutional, as shown in Figure 1. All in the level of about 128 bits. These are also future-proof in the ring have integral coefficients (modulo some integers), and sense that they can defend against the attack by thousand- their degrees are at most N − 1, so a typical element can be N−1 qubit quantum computers, which might emerge in the next represented as a = a0 + a1X + ...aN−1X . few decades. Based on these primitives, we can implement NTRUEncrypt is parameterized by three parameters, N, P , security protocols and services like multi-way authentication, and Q, which satisfy the following conditions. , , etc. • N is a prime number such that the maximal degree for The main contributions of this paper include the following. all polynomials in the ring R is N − 1. • We present an efficient hardware design that supports two • P and Q are two possible moduli for the coefficients of post-quantum PKCs, namely, NTRUEncrypt and the TTS the polynomials in R, with P  Q, and gcd(P,Q) = 1. signature scheme. Our approach not only allows reuse of After arithmetic operations in R, the coefficients of the poly- sequential but also combinational circuits, resulting in a nomials need to be reduced either modulo P or Q. much more compact design than if done separately. • By using the high-level design tool Bluespec SystemVer- A. Operations ilog, we are able to extensively explore architectural NTRUEncrypt consists of three parts: key generation, en- design space, including experimenting with an iterative cryption, and decryption. In this paper, we only focus on the linear system solver, which, to our best knowledge, has implementation of encryption and decryption; key generation not been investigated on cases of solving such small does not happen frequently and hence is often done off-line. systems. Here we focus on accelerating on-line operations that are • We identify the designs that provide the best trade-off mostly executed on M2M systems. To make this paper self- between time, area, and total cycle count in order to contain, we still include a brief description of key-generation minimize total energy consumption. This is especially operation below. important for M2M sensors, for many of them run on 1) Key generation: A public key h and a private key (f, fp) limited energy sources such as battery. are generated as follows. • Randomly choose a polynomial f ∈ R with coefficients C. Organization reduced modulo P . The rest of this paper is organized as follows. In Section II • Randomly choose a polynomial g ∈ R with coefficients and III, we give the detail of the implemented algorithms, reduced modulo P . NTRUEncrypt and TTS, respectively. The hardware design of • Compute the private key fp as the inverse polynomial of these algorithms is also described at the end. We show our f mod P . implementation strategy for designing the ASIC in Section IV • Compute the private key fq as the inverse polynomial of and compare the implementation results in Section V. Specif- f mod Q. ically, we will show and compare side-by-side the results ob- • Compute the public key h = fq ∗ g mod Q. tained by a high-level synthesis tool, Bluespec SystemVerilog, Here the ∗ symbol stands for the multiplication in the NTRU- against that obtained by the more traditional hand-optimized Encrypt ring, i.e., polynomial multiplication modulo XN − 1. RTL-based design. Finally, we conclude this paper by giving If any of the polynomials is not invertible, then we just start a few future directions of work in Section VI. over and repeat until we succeed. IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 3

2) Encryption: The ciphertext e is computed from the The evaluation of these polynomials at any given value public key h, a random polynomial r ∈ R mod P , and the corresponds to either encryption or verification procedure. message m ∈ R mod P A generic way to construct trapdoors for multivariate cryp- tosystems is to compose two affine maps before and after e = P × h ∗ r + m mod Q. a quadratic polynomial map with a special structure. In this 3) Decryption: The decryption procedure has three steps. case, the trapdoor information consists of the affine maps as • Compute a = f ∗ e mod Q. well as the central map, with which it is easy to invert the • Shift the coefficients of a to the range [−Q/2, Q/2] and composed public map that otherwise looks random. The one- then modulo P . way function difficulty is based on inverting a multivariate • Compute d = a ∗ fp mod P . quadratic map, which is equivalent to solving a set of quadratic It is easy to see why decryption works. By appropriate equations over a finite field. rewriting, we can see that we are performing the following Definition 1 (Multivariate Quadratic Problem): Solve the computation. system p1(x) = p2(x) = ... = pm(x) = 0, where each pi is quadratic in x = (x1, . . . , xn). All coefficients and variables a = f ∗ e modQ are in K = Fp. = f ∗ (P × h ∗ r + m) modQ Multivariate Quadratic Problem has been proved as an NP- complete problem. The structure of the central map of a = f ∗ (P × fq ∗ g ∗ r + m) modQ multivariate PKC maps elements from n to m, while the = P × g ∗ r + f ∗ m modQ K K affine transforms map elements from Kn to Kn and Km to d = a ∗ f modP m p K . The design of central maps mainly can be categorized = P × g ∗ r ∗ fp + f ∗ m ∗ fp modP into two classes. One class consists of small-field schemes, = m modP including rather conservative schemes such as Unbalanced Oil and Vinegar (UOV) as well as more aggressively designed Instead of choosing the coefficient of a in [0,Q − 1], they proposals such as Rainbow and TTS [12]. The other class is are mapped to the interval [−Q/2, Q/2] so that the original big-field schemes, such as Hidden Field Equations (HFE) and message can be recovered. Matsumoto-Imai (MIA).

B. Previous attempts A. General structure We have found several software and hardware implementa- tions of NTRUEncrypt in the literature; here we only discuss Extant multivariate PKCs almost always hide the private hardware implementations. O’Rourke presented a hardware map Q via composition with two affine maps T and S. So, design for accelerating NTRUEncrypt’s core operation of poly- P = T ◦ Q ◦ S: Kn → Km, or nomial multiplication in his master thesis [10]. It has a gate n S Q T m count of at least 1483 gates, but the design is not optimized for P : w ∈ K 7→ x = MSw+cS 7→ y 7→ z = MT y+cT ∈ K . low-cost, and there is no power consumption report. Bailey et In any scheme, the central map Q belongs to quadratic maps al. presented an FPGA design that targets low-cost systems [8]. whose inverse can be computed relatively easily. The maps S The design uses approximately 60000 gates on a Xilinx Virtex and T are affine and of full rank. The key of a multivariate 1000 EFG860 FPGA. Atici et al. presented an implementation PKC is the designing of the central map. geared toward RFID (radio-frequency identication) and low- To sign a block, one computes y = T −1(z), x = end sensors [11]. We note that although there have been −1 −1 several attempts that target compact implementation, which Q (y), w = S (x). To verify a signature, one simply is quite suitable for use in M2M systems, our design in this computes z = P (w). Public key contains only the composed paper targets not only a low gate count but also circuit reuse quadratic multivariate polynomials. Private key contains the to support signature schemes such as TTS, which we will detail of S and T , as well as the central map Q. Different describe in the next section. multivariate PKC schemes usually differ only in Q.

III.MULTIVARIATE CRYPTOGRAPHY B. TTS and related schemes Multivariate PKC is a kind of PKC whose trapdoor one-way In the design of TTS, it uses UOV scheme as the main function takes the form of a multivariate polynomial map over operation to construct the central map. By wrapping the a finite field [7]. In such constructions, the public key is given input multiple times using UOV, we can construct multi-layer by a set of polynomials nonlinear map. Such a construction is known as the Rainbow P = (p1(w1, . . . , wn), . . . , pm(w1, . . . , wn)), scheme, of which TTS is a special case. In particular, the central maps of TTS are constructed by using two layers of where p is a nonlinear polynomial in w = (w , . . . , w ) with i 1 n UOV and have a special sparse form. all coefficients and variables in = : K Fp 1) Oil and Vinegar schemes: Suppose n is an integer and X X 2 X o = n−v, v < n. The variables x , . . . , x are termed “vinegar pk(w) := Pikwi + Qikwi + Rijkwiwj. 1 v i i i>j variables,” while xv+1, . . . , xn, “oil variables.” IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 4

Take the map Q : Kn → Km with form y = Q(x) = signature schemes using systolic arrays [17] are presented. (q1(x), ..., qo(x)), where Last but not least, Tang et al. tried to minimize cycle counts v n for Rainbow signature schemes [18]. Similar to the motivation X X l mentioned in Section II-B, we do not only target a compact ql(x) = αijxixj, l = 1 . . . o. i=1 j=i design but also a reusable design that supports encryption schemes such as NTRUEncrypt. The original Oil and Vinegar scheme has m = o = v = n/2. When o < v, it becomes the Unbalance Oil and Vinegar signature scheme. If we have a UOV structure, then the D. Solving systems of linear equations quadratic part of each component qi in the central map from For TTS and related schemes, decryption is usually much x to y, when expressed as a symmetric matrix, looks like slower than encryption. This is because in decryption, the bot-  (i) (i) (i) (i)  tleneck operations involve solving a system of linear equations. α ··· α α ··· α 1,1 1,v 1,v+1 1,n There are several algorithms that can efficiently solve such  ......   ......  3  . . . .  problems, all of which have time complexity o(n ) for general  (i) (i) (i) (i)  cases.  αv,1 ··· αv,v αv,v+1 ··· αv,n  Mi :=  (i) (i) , 1) Systolic Gaussian elimination: Gaussian elimination is  α ··· α 0 ··· 0   v+1,1 v+1,v  perhaps the most well-known algorithm to solve a system of  ......   ......  linear equations. In addition to solving, Gaussian elimination  (i) (i)  αn,1 ··· αn,v 0 ··· 0 Algorithm 1 Gaussian elimination  ∗ ∗  or for short, . for k = 1 → m do ∗ 0 Find pivot for column k: We can easily figure out that the quadratic terms of the i = argmax ( i = k → m , abs(A[i, k])) system is the combination of some instead of all variables by max if A[i , k] = 0 then return Error looking at the shape of the matrix. These variables are called max end if the vinegar part. By fixing the values of these variables, the Swap rows(k, i ) quadratic terms of the system will vanish. We can then solve max Do for all rows below pivot: the resulting system of linear equations to find the values of for i = k + 1 → m do the oil variables. Do for all remaining elements in current row: 2) Rainbow scheme: By stacking several layers of UOV for j = k + 1 → n do A[i, j] := A[i, j] - A[k, j] * together for an invertible central map, we arrive at Rainbow- (A[i, k] / A[k, k]) type constructions. For 0 < v < v < ··· < v = n, 1 2 u+1 end for

Sl := {1, 2, . . . , vl} Fill lower triangular matrix with zeros: A[i, k] := 0 Ol := {vl + 1, . . . , vl+1} end for end for ol := vl+1 − vl = |Ol|

Q : X = (x1, . . . , xn) 7→ Y = (yv1+1, . . . , yn) can also be used to find the rank of a matrix, to calculate the determinant of a matrix, and to calculate the inverse of an where each yk := qk(X), with following form if vl < k ≤ invertible square matrix. vl+1, Systolic Gaussian elimination is a fully pipelined parallel X (k) X (k) X (k) qk = αij xixj + αij xixj + βi xi. hardware design optimized for solving systems of linear i≤j≤vl i≤vl

Algorithm 2 Berlekamp-Massey [20]

Require: (a0, ..., a2d−1, a linear recurrent sequence whose minimal polynomial is of degree ≤ d P j Ensure: f(x) = 0≤j≤n fjx that f(x) is a minimal poly- nomial of a Let Λ(x) = 1,K(x) = 0, l = 1, δ = 1. for j from 1 to 2d do Set γ = Λ0aj−1 + ... + Λnaj−(n+1) if γ = 0 then K(x) = xK(x) else Θ(x) = Λ(x) γ Λ(x) = Λ(x) − δ xK(x) if 2l < j then K(x) = Θ(x) δ = γ l = j − l else Fig. 2. Architecture for systolic Gaussian elimination K(x) = xK(x) end if end if Definition 2 (Linear recurrence): A sequence a = (ai)i∈N end for is said to be linear recurrent over F if there exist n ∈ N and f0, . . . , fn ∈ F with fn 6= 0 such that Algorithm 3 Wiedemann [19] X n×n fjai+j = 0, ∀i ∈ N. Require: A non-singular matrix A ∈ F and a vector b ∈ n 0≤j≤n F Ensure: y = A−1b, y ∈ n The polynomial f(x) = P f xj ∈ F [x] of degree n is F 0≤j≤n j Generate recurrent sequence a = (Aib) called a characteristic polynomial of a. i∈N Using a as input, compute minimal polynomial by Definition 3 (Minimal polynomials): A characteristic poly- Berlekamp-Massey nomial is called a minimal polynomial of the sequence a if it h = − m−m0 , compute y = h(A)b is of the least degree. m0x The Berlekamp-Massey algorithm finds the shortest linear feedback shift register (LFSR) that produces a given binary sequence [20]. The algorithm will also find a minimal poly- IV. IMPLEMENTATION nomial of a linear recurrent sequence in a field. It is used in After describing the cryptographic algorithms to support, the Wiedemann algorithm to find the minimal polynomial of we will describe the tools and the strategy used for our ASIC a sequence. design in this section. The main design challenge here is n×n Let F be a finite field. Given A ∈ F nonsingular and to combine the circuits that support two different kinds of n n b ∈ F , we would like to find a y ∈ F such that Ay = algebraic operations needed in NTRUEncrypt and TTS such b. The idea of the Wiedemann algorithm is to consider the that the resulting circuit is optimal in terms of time-area- i sequence a = (A b), i ∈ N. Let f be a minimal polynomial cycle product. Such a challenge inevitably requires extensive of a. According to the definition of minimal polynomial, we architectural exploration, which is best done using a high-level have design tool like SystemVerilog. In this work, we use Bluespec X j X j SystemVerilog to generate parameterized designs. Finally, we f(A) · b = 0 = fjA · b = f0b + fjA · b. 0≤j≤d 1≤j≤d present our processor-based design that supports several post- quantum PKCs. Therefore, X j −f0b = fjA · b. 1≤j≤d A. Bluespec SystemVerilog Hence, Bluespec SystemVerilog is an EDA tool for ASIC and −1 X j FPGA design [21]. It is a hardware description language A · y = b = −f0 fjA · b, 1≤j≤d that compiles into either Verilog RTL or a cycle-accurate C simulation program. It provides high productivity by a and therefore, radically different approach to high-level synthesize [21]. −1 X j−1 1) Bluespec basics: Bluespec SystemVerilog uses the y = −f0 fjA · b. 1≤j≤d model of hardware structure similar to that used by Verilog. It supports modules and interfaces. Interfaces are used for IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 6

most designs. module It can also compile into C code for fast simulation. Both kinds of output are cycle accurate to each others and can dump interface interface to standard VCD files. The C-based simulation is faster than rule Verilog simulation because it can exploit rule semantics to optimize the execution.

B. Atom-based design method As in designing any complex systems, we analyze and decompose the system into atomic operations. We then design units of combinational circuit, called atoms, for those different operations in a time-area minimized way so as to get a better Fig. 3. Modules, interfaces, and methods in Bluespec SystemVerilog understanding how far we can go. In addition to being the unit of optimization, atoms can be seen as a basic building block in building our ASIC design. high-level description .v synthesis Under such a design strategy, atoms are the main combi- .bsv list of rule national part of the whole ASIC. After defining the atom, we then increase parallelism by increasing the number of hardware description .{c,h} simulation atoms. Often, this can decrease the time-area-cycle product. The intuition behind is that there are inevitably some fixed Fig. 4. Bluespec SystemVerilog compiler structure cost such as registers that one needs to include at all levels of parallelism, and these parts consume energy during the whole course of the computation. Therefore, we can reduce total communication between modules, and module hierarchies can energy consumption if we can finish the computation earlier be obtained by static elaboration. Modularity helps improve via, e.g., increasing parallelism. productivity and allows rapid design-verification cycles. Like By using Bluespec SystemVerilog, we can code once and SystemC, it provides a software model and hence is friendly to generate many ASIC designs with different parameters. It software engineers. It also generates more efficient RTL than helps us have a quick look at the trend of the time-area-cycle most SystemC implementations [21]. product and find the best trade-off point, a process usually 2) Modules and interfaces: In Bluespec SystmeVerilog, a referred to as architectural exploration. module’s behavior is specified using a collection of rules. A module’s interface is specified with a collection of methods. C. Array-based design A rule in one module may invoke methods in other modules. Although rules may span multiple modules, they may be Compared with atom-based design, array-based design can composed in a modular way with conditions and actions that be seen as the alternative in which we increase parallelism are localized and encapsulated within each module. External inside an atom. In contrast to atom-based design, we first fill users of a module only need to deal with the specified input the system with a fixed number of atoms, called an array, and output. With such a design, a divide-and-conquer style and then increase the computational power of each atom. The can be easily supported. idea is still to get the job done sooner and decrease the total number of cycles needed to finish the computation, thereby 3) Atomic actions and rules: Rules and action atomicity reducing the total energy consumption. However, the critical are a powerful tool for ensuring correctness. The module can path may increase, resulting a prolonged clock cycle and hence be understood as some sequential composition of rule firings. an increase in time-area-cycle product. We hope to find an Each module’s designer can locally encapsulate the correctness optimal point before the margin return begins to diminish. conditions for use of its interface and be assured that it cannot be used incorrectly from any context. The compiler will check the conditions statically to ensure the correctness of the design. D. Processor-based design 4) Fast compilation and simulation: The compiler of Blue- Eventually, we plan to have a programmable cryptographic spec SystemVerilog statically verifies user-specified assertions processor that can support various cryptosystem via firmware. to ensure that the whole design is correct. It also supports It can be separated into three part, combinational logic, mem- abstract data types to preserve representation invariants, en- ory, and control logic. Control logic will decide which function hanced overloading, and bit-width constraints. to perform and move the data from memory to combinational The Bluespec compiler generates both datapath and control logic at the appropriate times. With different functions, it hardware to implement rule behavior in parallel hardware. The would reconfigure the control logic appropriately. core of the generated control hardware is a dynamic scheduler We have implemented a preliminary processor that supports that governs the rule firings. The generated RTL can be sent both NTRUEncrypt and TTS schemes. Our strategy works as to standard netlist synthesis and physical design tools. The follows. First, we find the part in each atom-based ASIC that timing and area are competitive with hand-written RTL for can be shared and combine them into a combinational logic IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 7

RegFile b0 b1 b2 b3 b4

a0

a1 Adder 1 M control a2 U c0 0 X Unit Unit Unit Unit a3

a4

Fig. 5. Processor-based design atom c1 that can compute the basic operations of both PKCs. We then atom c2 parallelize the atoms to form arrays of appropriate sizes that are suitable for both systems. Finally, we increase each atom’s computational power until we find the best trade-off point in Fig. 6. Atom-based design diagram terms of time-area-cycle product. 1e+09 V. RESULTS AND COMPARISON We present the implementation results in this section. First, we show the implementation results of ASIC designs of 1e+08 ) NTRUEncrypt and TTS, respectively. We then move to the 2 processor-based design that supports both cryptosystems and compare the time-area-cycle product with the combined ASIC Area (um design. All results are obtained by synthesis with Synopsys 1e+07 Design Compiler at 90nm process. We note that we report all area information in the unit used by the synthesis tool, i.e., 2 µm . To convert this number to gate count, one needs to divide 1e+06 the number by four. 1 10 100 1000 Degree of parallelism

A. NTRUEncrypt ASIC Fig. 7. Performance under different levels of parallelism for atom-based In NTRUEncrypt ASIC design, we compare the results of design two approaches, atom-based and array-based designs. Atom- based design scale in one dimension (degree of parallelism), 2 while array-based design adds an additional dimension of multiplication needs N /P cycles if P atoms are used. We are computational power. We compare the performance of timing, interested in the total energy consumption of the encryption area, and total cycle count under different parameter settings. and decryption processes, so we use time-area-cycle product 1) Atom-based approach: The bottleneck operation in en- as a figure of merit when comparing the resulting performance cryption and decryption of NTRUEncrypt is the convolutional under different parameter settings. polynomial multiplication. We decompose the resources used Figure 7 and Table I show that, while the number of to compute polynomial multiplication into sequential and com- atoms increases, the time-area-cycle product decreases rapidly. binational parts. The sequential part includes, e.g., registers This is because the combinational part only occupies a small for temporary storage and control. The combinational part is percentage of the total area, and as a result, a higher degree of the part that actually carries out the computation. It can be parallelism helps reduce idle circuitry and hence lower total parallelized to reduce the total cycle count, but it increases the circuit size, as well as might increase the length of the #atom Area Timing #cycle Time-area-cycle product critical path. 2 275224 4.12 200 226,784,576 The resulting circuit of the atom-based design is shown 4 276908 4.22 101 118,023,728 32 303685 4.12 14 17,516,551 in Figure 6. We note that the modulo circuitry is not shown 128 401246 4.14 5 8,305,792 here for brevity. Also, the mux controls whether the registers 199 432121 4.15 3 5,379,906 will be updated with new values. An atom can compute a TABLE I term of polynomial multiplication in the product given the TIME-AREA-CYCLE PRODUCTS FOR ATOM-BASEDDESIGN two input polynomial in N cycles. The whole polynomial IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 8

1e+08

b0 b1 b2 ... bN-1 ) a0 2 1e+07 a1 Area (um a2 Adder c0 Tree ... 1e+06 aN-1 0.01 0.1 1 10 100 Degree of parallelism

atom c1 Fig. 9. Performance under different levels of parallelism for array-based design

atom c2 array-based designs and find that the optimal trade-off point Fig. 8. Array-based design diagram for an NTRUEncrypt ASIC design appears at the point where there are a maximal number of atoms with each atom having Depth Area Timing #cycle Time-area-cycle product minimal computational power of a single-layer adder. 1 579721 4.27 1 2475409 3 1230304 6.83 1/3 2800992 7 2342892 19.78 1/7 3608054 B. TTS ASIC 15 4658949 14.42 1/15 4478803 As evaluation of a quadratic map can be computed using TABLE II TIME-AREA-CYCLE PRODUCTS FOR ARRAY-BASEDDESIGN linear-system evaluation as a basic building block, the bot- tleneck computation of TTS is linear-system evaluation and solving. In this section, we first present the ASIC implementa- tion of Gaussian elimination, the state-of-the-art linear system energy consumption. Also, increasing the number of atoms solver, as well as our attempt in using the Wiedemann algo- does not affect the timing of the whole design because it does rithm to solve linear systems. To support both encryption and not increase the length of the critical path. We conclude that decryption, we use matrix-vector multiplication to maximize one should use an as high degree of parallelism as possible to the degree of resource reuse. Finally, as in previous section, we have a better time-area-cycle product in atom-based design. will discuss the performance under various parameter settings 2) Array-based design: In atom-based design, more atoms such as parallelism in terms of time-area-cycle product and will lead to better time-area-cycle product. So in array-based find the optimal trade-off point in the design space. design, we first use a maximal number of atoms and try 1) Linear system solvers: Systolic Gaussian elimination to increase the computational power of each atom. We are has been wildly used for solving systems of linear equations interested to know whether this would improve the time-area- over the reals or a finite field. It is an optimal design in cycle figure of merit. Indeed, as we add P terms in a cycle, the terms of time-area product. However, it requires a dedicated total number of cycles needed will reduce to 1/P . However, combinational circuit that cannot be easily reused for other timing gets worse because we increase the height of the adder computations. tree at the output of the circuit. There should be an optimal We investigate the Wiedemann algorithm for solving linear point where increasing the computational power will no longer systems of equations. With Wiedemann, linear-system solving lead to an improved time-area-cycle product. can be decomposed into matrix-vector multiplication, which With the help from Bluespec SystemVerilog, we generate is ideal for reuse in other parts of the computation of a four designs corresponding to adding 1, 3, 7, and 15 terms multivariate PKC. However, such a reuse comes at a price that in a single cycle and summarize the results in Figure 9 and the computational complexity is twice as high as Gaussian Table II. As we have expected, the combinational part grows elimination. We are interested in finding out how much we as the computational power of each atom increases, and the can save by using a reuse-friendly algorithm for linear-system length of the critical path also grows. This makes the product solving. increases as the cycle count decreases. Overall, as the per- 2) Matrix-based design: The design of a scalable matrix- atom computational power increases, the length of the critical vector multiplication is similar to NTRUEncrypt’s array-based path grows more rapidly than the reduction of cycle count, so design. We use 20 atoms, which is dictated by the number it does not help the time-area-cycle product by increasing the of variables in the target TTS algorithm. We then scale the computational power. number of the multipliers in each atom to control the degree Finally, we compare the performance of atom-based and of parallelism. IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 9

1e+08 mod 31 9bits M ) 2 Adder U Tree X Area (um mod 307

1e+07 1 10 Fig. 11. Processor-base design that supports both NTRUEncrypt and TTS Degree of parallelism Combinational area Sequential area Timing Fig. 10. Performance under different levels of parallelism for matrix-based NTRUEncrypt 56280 7.16 design TTS 41520 532041 7.16 Processor-based 72660 7.97 Dimension Area Timing #cycle Time-area-cycle product TABLE IV 1 222013 9.07 21 60616125 COMBINATIONAL LOGIC COMPARISON 2 254413 9.45 11 36470858 4 330818 10.31 6 26473853 5 363404 11.26 5 25955876 10 564705 14.22 3 28300857 Table IV shows that by lightly increasing the combinational TABLE III logic and therefore sacrificing some timing performance, we TIME-AREA-CYCLE PRODUCT FOR MATRIX-BASEDDESIGN can obtain a circuit that can efficiently support two cryptosys- tems. The resulting circuit is about 18% smaller than the two ASIC designs combined while delivering a satisfactory We generate five designs corresponding to five different performance in terms of power consumption. levels of computational power capable of computing the inner product of two vectors of dimensions 1, 2, 4, 5, and 10. The D. Verilog versus Bluespec SystemVerilog inputs are multiplied in parallel before sent to the final adder Finally, we evaluate the quality of the RTL code generated tree for summation. The critical path consists of that of the by Bluespec SystemVerilog by comparing its performance to multiplier as well as the final adder tree. hand-optimized RTL code. From Table V, we can see that As shown in Figure 10 and Table III, the result is quite the RTL code generated by Bluespec SystemVerilog is in different compared with that of NTRUEncrypt. Recall that general only 2 to 3 times worse than hand-optimized RTL code the performance improvement of array-based NTRUEncrypt for the same designs. Therefore, we conclude that Bluespec design decreases as the computational power of each atom SystemVerilog is suitable for early design and automatic increases. However, for matrix-based TTS design, the optimal architectural exploration. After preliminary studies find an trade-off point happens when each atom has a medium level of optimal architecture, experienced RTL designer can then come computational power. It is because the delay of the multiplier in and further optimize the design by hand. is larger than that of the final adder tree, so it makes sense to Lastly, we compare our implementation results with those fill up the gap by adding more partial products in each atom. found in the literature in Table VI. If we use Wallace adder tree in our atom design, this ratio might change, so will the optimal trade-off point. VI.CONCLUSION In this paper, we report our experience implementing a scal- C. Processor-based design able ASIC design for both NTRUEncrypt and TTS. We also In the processor-based design, we first classify circuits into use Bluespec SystemVerilog to generate parameterized designs combinational logic, register, and control logic like in a typical and find the optimal trade-off points under different parameter processor design. We use the standard general-purpose register settings. By analyzing the timing-area-cycle products of these file construction to implement the registers, while control logic designs, the optimal degree of parallelism can be determined is implemented as a finite-state machine. We then combine from an architectural design viewpoint. NTRUEncrypt and TTS ASIC designs and reuse as much as possible the common components. As shown in Figure 11, for Bluespec SystemVerilog Hand-optimized RTL combinational logic, we implement a datapath unit that can NTRUEncrypt 579721 176389 either compute a multiplication and an addition module 31, or TTS 338245 172000 six additions module Q. We duplicate 20 such datapath units; TABLE V this is the maximal degree of parallelism allowed by the target BLUESPEC GENERATED VS. HAND-OPTIMIZED RTL CODE TTS scheme. IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 10

Cryptosystem Implementation Gate count Clock (ns) Cycle count Time-area-cycle product RSA Kwon et al. [22] 156000 20 1100000 3.43×1012 Atici et al. [11] 10500 2000 28390 5.96×1011 NTRUEncrypt This work 144930 4.27 398 2.46×108 Yang et al. [14] 17000 10000 4400 7.48×1011 TTS Balasubramanian et al. [16] 63593 14.9 804 7.63×108 Tang et al. [18] 150000 20 198 5.94×108 This work 90851 11.25 550 5.62×108 TABLE VI COMPARISON OF RELATED WORKS (HERECYCLECOUNTSFORENCRYPTIONAREGIVEN)

We also present a combined design that supports both [8] D. V. Bailey, D. Coffin, A. Elbirt, J. H. Silverman, and A. D. NTRUEncrypt and TTS. It has a similar design to a processor. Woodbury, “NTRU in constrained devices,” in Proceedings of the Third International Workshop on Cryptographic Hardware and Embedded In such a processor-based design, not only registers but Systems, ser. CHES ’01. London, UK, UK: Springer-Verlag, 2001, pp. also combinational circuits are shared among the different 262–272. [Online]. Available: http://dl.acm.org/citation.cfm?id=648254. supported cryptosystems for achieving maximal degree of 752549 [9] J. Hoffstein, J. Pipher, and J. H. Silverman, “NTRU: A ring-based resource reuse. The resulting design is more similar to a multi- public key cryptosystem,” in Proceedings of the Third International core 8-bit processor than a single-core 16- or 32-bit processor, Symposium on Algorithmic Number Theory, ser. ANTS-III. London, which is not as straightforward without the knowledge learned UK, UK: Springer-Verlag, 1998, pp. 267–288. [Online]. Available: from our effort of architectural exploration. http://dl.acm.org/citation.cfm?id=648184.749737 [10] C. M. O’Rourke, “Efficient NTRU implementations,” Master’s thesis, After combining NTRUEncrypt and TTS into a single sys- Worcester Polytechnic Institute, Worcester, MA, USA, 2002. tem, the logical next step we can try is to further add support [11] A. C. Atici, L. Batina, J. Fan, I. Verbauwhede, and S. Berna for more cryptosystems. Our goal is to secure M2M systems Ors Yalcin, “Low-cost implementations of ntru for pervasive security,” in Proceedings of the 2008 International Conference on Application- with strong cryptography by means of hardware acceleration. Specific Systems, Architectures and Processors, ser. ASAP ’08. Also, to support a larger number of cryptosystems, a compiler Washington, DC, USA: IEEE Computer Society, 2008, pp. 79–84. is needed to generate the complex firmware. At the end, we [Online]. Available: http://dx.doi.org/10.1109/ASAP.2008.4580158 [12] J. Ding and D. Schmidt, “Rainbow, a new multivariable polynomial hope that we will be able to demonstrate the feasibility of signature scheme,” in Proceedings of the Third international conference using hardware-based PKC to provide general data security in on Applied Cryptography and Network Security, ser. ACNS’05. Berlin, M2M applications. Heidelberg: Springer-Verlag, 2005, pp. 164–175. [Online]. Available: http://dx.doi.org/10.1007/11496137 12 [13] B.-Y. Yang, J.-M. Chen, and Y.-H. Chen, “TTS: High-speed signatures ACKNOWLEDGMENT on a low-cost smart card,” in Cryptographic Hardware and Embedded Systems — CHES 2004: 6th International Workshop Cambridge, MA, This work was supported by National Science Council, USA, August 11-13, 2004. Proceedings, ser. Lecture Notes in Computer National Taiwan University and Intel Corporation under Grants Science, vol. 3156. Springer, 2004, pp. 371–385. [14] B.-Y. Yang, C.-M. Cheng, B.-R. Chen, and J.-M. Chen, “Implementing NSC101-2911-I-002-001 and NTU102R7501. minimized multivariate PKC on low-resource embedded systems,” in Proceedings of the Third international conference on Security in Pervasive Computing, ser. SPC’06. Berlin, Heidelberg: Springer- REFERENCES Verlag, 2006, pp. 73–88. [Online]. Available: http://dx.doi.org/10.1007/ 11734666 7 [1] P. W. Shor, “Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer,” SIAM J. Comput., [15] A. I.-T. Chen, M.-S. Chen, T.-R. Chen, C.-M. Cheng, J. Ding, E. L.-H. vol. 26, no. 5, pp. 1484–1509, Oct. 1997. [Online]. Available: Kuo, F. Y.-S. Lee, and B.-Y. Yang, “SSE implementation of multivariate http://dx.doi.org/10.1137/S0097539795293172 PKCs on modern x86 CPUs,” in Proceedings of the 11th International [2] A. Perrig, J. Stankovic, and D. Wagner, “Security in wireless sensor Workshop on Cryptographic Hardware and Embedded Systems, ser. networks,” Commun. ACM, vol. 47, no. 6, pp. 53–57, Jun. 2004. CHES ’09. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 33–48. [Online]. Available: http://doi.acm.org/10.1145/990680.990707 [Online]. Available: http://dx.doi.org/10.1007/978-3-642-04138-9 3 [3] Y. Zhou, Y. Fang, and Y. Zhang, “Securing wireless sensor networks: a [16] S. Balasubramanian, A. Bogdanov, A. Rupp, J. Ding, and H. W. survey,” Commun. Surveys Tuts., vol. 10, no. 3, pp. 6–28, Jul. 2008. Carter, “Fast multivariate signature generation in hardware: The case of [Online]. Available: http://dx.doi.org/10.1109/COMST.2008.4625802 Rainbow,” in Proceedings of the 2008 16th International Symposium [4] C. Karlof, N. Sastry, and D. Wagner, “TinySec: a link layer security on Field-Programmable Custom Computing Machines, ser. FCCM ’08. architecture for wireless sensor networks,” in Proceedings of the 2nd Washington, DC, USA: IEEE Computer Society, 2008, pp. 281–282. international conference on Embedded networked sensor systems, ser. [Online]. Available: http://dx.doi.org/10.1109/FCCM.2008.52 SenSys ’04. New York, NY, USA: ACM, 2004, pp. 162–175. [Online]. [17] A. Bogdanov, T. Eisenbarth, A. Rupp, and C. Wolf, “Time-area Available: http://doi.acm.org/10.1145/1031495.1031515 optimized public-key engines: MQ-cryptosystems as replacement for [5] R. Watro, D. Kong, S.-f. Cuti, C. Gardiner, C. Lynn, and P. Kruus, elliptic curves?” in Proceeding sof the 10th international workshop “TinyPK: securing sensor networks with public key technology,” in on Cryptographic Hardware and Embedded Systems, ser. CHES Proceedings of the 2nd ACM workshop on Security of ad hoc and sensor ’08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 45–61. [Online]. networks, ser. SASN ’04. New York, NY, USA: ACM, 2004, pp. 59–64. Available: http://dx.doi.org/10.1007/978-3-540-85053-3 4 [Online]. Available: http://doi.acm.org/10.1145/1029102.1029113 [18] S. Tang, H. Yi, J. Ding, H. Chen, and G. Chen, “High-speed hardware [6] D. J. Malan, M. Welsh, and M. D. Smith, “Implementing public- implementation of Rainbow signature on FPGAs,” in Proceedings of key infrastructure for sensor networks,” ACM Trans. Sen. Netw., the 4th international conference on Post-Quantum Cryptography, ser. vol. 4, no. 4, pp. 22:1–22:23, Sep. 2008. [Online]. Available: PQCrypto’11. Berlin, Heidelberg: Springer-Verlag, 2011, pp. 228–243. http://doi.acm.org/10.1145/1387663.1387668 [Online]. Available: http://dx.doi.org/10.1007/978-3-642-25405-5 15 [7] J. Ding and B.-Y. Yang, “Multivariate public key cryptography,” in Post- [19] D. H. Wiedemann, “Solving sparse linear equations over finite fields,” Quantum Cryptography, D. J. Bernstein, J. Buchmann, and E. Dahmen, IEEE Trans. Inf. Theor., vol. 32, no. 1, pp. 54–62, Jan. 1986. [Online]. Eds. Springer-Verlag, 2009, pp. 193–241. Available: http://dx.doi.org/10.1109/TIT.1986.1057137 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 11

[20] J. Massey, “Shift-register synthesis and BCH decoding,” Information Ming-Shing Chen is a PhD student in Electronic Theory, IEEE Transactions on, vol. 15, no. 1, pp. 122–127, Jan. 1969. Engineering, National Taiwan University. His work [21] R. Nikhil, “Bluespec System Verilog: efficient, correct RTL from high was mainly on fast arithmetic of finite field applied level specifications,” in Formal Methods and Models for Co-Design, to error-correcting codes and public-key cryptogra- 2004. MEMOCODE ’04. Proceedings. Second ACM and IEEE Interna- phy. He is now interesting in solving multivariate tional Conference on, Jun. 2004, pp. 69–70. equations over finite fields and hardware architecture [22] T.-W. Kwon, C.-S. You, W.-S. Heo, Y.-K. Kang, and J.-R. Choi, “Two design for arithmetic over finite rings/fields. implementation methods of a 1024-bit RSA cryptoprocessor based on modified Montgomery algorithm,” in Circuits and Systems, 2001. ISCAS 2001. The 2001 IEEE International Symposium on, vol. 4, May 2001, pp. 650–653.

Bo-Yin Yang received his PhD in Mathematics from Jie-Ren Shih received his BS and MS in Electrical Massachusetts Institute of Technology in 1991. After Engineering from National Taiwan University. His teaching mathematics at Tamkang University in Tai- main research interests including electronic system- wan, he started working with cryptography in 2002. level design and design of countermeasures for se- Eventually moved to the Institute of Information Sci- curity circuits. ence at Academia Sinica in 2006. Bo-Yin is known for his work on efficient crypto implementations, algebraic , and post-quantum public- key cryptography.

An-Yeu (Andy) Wu (SM’12) is a Professor of Yongbo Hu was born in 1988 and received his BA the Department of Electrical Engineering and the in Microelectronics from Fudan University. Since Graduate Institute of Electronics Engineering in Na- 2010, he has been an MA candidate in Micro- tional Taiwan University. Dr. Wu is now serving an electronics in Fudan University. His main research Associate Editor for JOURNAL of SIGNAL PRO- interests include design of attack and countermea- CESSING SYSTEMS (JSPS), and acted as the Lead sure of security circuit, as well as post-quantum Guest Editor of the Special Issue of “2010 IEEE cryptography. Workshop on Signal Processing Systems (SiPS)” in JSPS, which was published in Nov. 2011. He also served on the technical program committees of many major IEEE International Conferences, such as SiPS, AP-ASIC, ISCAS, ISPACS, ICME, SOCC, and A-SSCC. He is now serving as the Chair of VLSI Systems and Architectures (VSA) Technical Committee in IEEE Circuits and Systems (CAS) Society. From August 2007 to Dec. 2009, he was on leave from NTU and served as the Deputy General Director of SoC Technology Center (STC), Industrial Technology Research Ming-Chun Hsiao was born in Taiwan in 1987. Institute (ITRI), Hsinchu, TAIWAN, supervising Parallel Core Architecture He received his BS degree in Electrical Engineering (PAC) VLIW DSP Processor and Multicore/Android SoC platform projects. In from National Taiwan University, Taipei, Taiwan, 2010, Dr. Wu received “Outstanding EE Professor Award” from The Chinese in 2010. He is currently working toward his MS Institute of Electrical Engineering (CIEE), Taiwan. Starting from Aug. 2012, degree from the Graduate Institute of Electronics Dr. Wu is serving as the Deputy Director of Graduate Institute of Electronics Engineering, National Taiwan University, Taipei, Engineering (GIEE) of National Taiwan University. Taiwan. His research interests are in the areas of VLSI implementation of cryptography algorithms and digital communication systems.

Chen-Mou Cheng received his BS and MS in Elec- trical Engineering from National Taiwan University in 1996 and 1998, respectively, and his PhD in Computer Science from Harvard University in 2007. Wen-Chung Shen was born in Taiwan, R.O.C., in He joined the Department of Electrical Engineering 1980. He received his BS degree from National Tai- of National Taiwan University in 2007, where he is wan University, Taiwan, in Electronic Engineering in currently an Assistant Professor. 2002. He received an MS degree from National Tai- His main research area is in cryptographic hard- wan University of Science and Technology, Taiwan, ware and embedded systems (CHES), as well as in Electronic Engineering in 2004. He is currently electronic system-level (ESL) design. Currently, his pursuing a PhD degree from the Graduate Institute main research activities focus on the design and of Electronics Engineering, National Taiwan Univer- analysis of efficient algorithms to solve several important problems arising sity, Taipei, Taiwan. His research fields include the from cryptology, as well as the development and implementation of these architecture and algorithm design for digital signal algorithms on massively parallel computers. These problems include solving process. His research interests are in the areas of systems of polynomial equations over finite fields, integer factorization, system-on-chip, security, and low power designs. elliptic-curve discrete logarithm, and lattice reduction.