106 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 1, MARCH 2013 Securing M2M With Post-Quantum Public- Jie-Ren Shih, Yongbo Hu, Ming-Chun Hsiao, Ming-Shing Chen, Wen-Chung Shen, Bo-Yin Yang, An-Yeu Wu, Senior Member, IEEE, and Chen-Mou Cheng

Abstract—In this paper, we present an ASIC implementation proliferates in today’s Internet age and permeates many aspects of two post-quantum public-key (PKCs): NTRU- of our daily life, ranging from communication to electronic Encrypt and TTS. It represents a first step toward securing commerce. machine-to-machine (M2M) systems using strong, hardware-as- sisted PKC. In contrast to the conventional wisdom that PKC is However, there is an emerging threat to the prevailing PKCs too “expensive” for M2M sensors, it actually can lower the total due to the recent development of quantum computers. The cost of ownership because of cost savings in provision, deployment, security of RSA, currently the most popular PKC, depends operation, maintenance, and general management. Furthermore, on the difficulty of the problem, while PKC can be more energy-efficient because PKC-based security that of ECC, the runner-up PKC, depends on the discrete protocols usually involve less communication than their sym- metric-key-based counterparts, and communication is getting logarithm problem. As Shor has shown, both of them would relatively more and more expensive compared with computation. be solved by large quantum computers in time More importantly, recent algorithmic advances have brought [1]. Such a threat is more relevant in the machine-to-machine several new PKCs, NTRUEncrypt and TTS included, that are (M2M) context, as these systems tend to operate over a long orders of magnitude more efficient than traditional PKCs such as period of time, and we certainly should take precaution against RSA. It is therefore our primary goal in this paper to demonstrate the feasibility of using hardware-based PKC to provide general such a catastrophic attack, even though it might only happen data security in M2M applications. in the distant future. There are mostly four different kinds of approaches com- Index Terms—Bluespec SystemVerilog, lattice-based cryptog- raphy, multivariate cryptography. posing the so-called “post-:” lat- tice-based cryptography, multivariate cryptography, hash-based signatures, and code-based cryptography. In this paper, we I. INTRODUCTION focus on NTRUEncrypt, a lattice-based , and TTS, a multivariate cryptosystem, as candidates in system RYPTOGRAPHY is the foundation of data security. development. C There are mainly two kinds of cryptography in use today, symmetric-key and public-key cryptography. In the former, A. Previous Attempts of Securing M2M Systems the communicating parties are assumed to share one or more secret keys apriori. How they can establish such a shared As networked machines become more popular around our secret is often referred to as the key-exchange problem. This living, information security on these devices becomes an im- problem is challenging not only from a technical but also from portant issue. Traditionally, PKC is regarded as too expensive a managerial point of view, as we will need to manage to deploy in M2M systems. Typical M2M systems only have keys in a network of size . limited computational power, making deploying strong cryptog- Public-key cryptography (PKC), on the other hand, provides raphy on them extremely challenging. an elegant solution to the key-exchange problem. With PKC, There have been numerous proposals how to secure M2M key management becomes straightforward. A user can encrypt systems from the academic research community [2], [3]. Most of a short-lived session key using the communicating party’s them use software-based symmetric-key cryptography. For ex- public key and simply send out the encrypted key. PKC ensures ample, TinySec provides link-layer security for sensor networks that only the holder of the corresponding private key can using software implementation of symmetric-key cryptosys- decrypt and obtain the session key. Furthermore, PKC can tems [4]. In many proposals, more bits will need to be sent provide digital signatures, which, like a person’s signature, over the air for achieving certain level of security, so using provides an efficient means of authentication. As a result, PKC hardware accelerators may not necessarily help in these cases [2]. The same functionality would be achieved by PKC in a Manuscript received August 06, 2012; revised December 31, 2012; accepted more communication-efficient way. This is becoming more January 05, 2013. Date of current version March 07, 2013. This work was sup- attractive as computation is getting cheaper in terms of hardware ported by the National Science Council, National Taiwan University and Intel cost and energy consumption, while wireless communication Corporation under Grant NSC101-2911-I-002-001 and Grant NTU102R7501. This paper was recommended by Guest Editor F. Koushanfar. is less so at the same time. As a result, communication is The authors are with Intel-NTU Connected Context Computing Center, Na- becoming more expensive compared with computation, not tional Taiwan University, Taipei 10617, Taiwan. to mention the spectrum will become one of the scarcest Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. resources when billions of M2M sensors are deployed and Digital Object Identifier 10.1109/JETCAS.2013.2244772 trying to send out their readings over the air. In this case,

2156-3357/$31.00 © 2013 IEEE SHIH et al.: SECURING M2M WITH POST-QUANTUM PUBLIC-KEY CRYPTOGRAPHY 107 it is advantageous to use PKC on sensors for the sake of reducing communication cost. Lastly, there have been several attempts in employing soft- ware-based PKC to secure inter-sensor communication [5], [6]. People have demonstrated that it is possible to run PKC on sen- sors with acceptable performance. We believe that this is the right direction to pursue, and we plantotakeitfurtherbyhard- ware acceleration.

B. Contributions Fig. 1. Convolutional polynomial multiplication in NTRUEncrypt. Our approach is to provide a foundation for information secu- rity using hardware-assisted PKC. Specifically, we plan to de- sign and implement a complete, proof-of-concept PKC-based system. We choose two types of PKCs to support. First, multi- variate cryptosystems enjoy the benefit of executing much faster than traditional cryptosystems on the same hardware, making them ideal for securing sensors in M2M systems [7]. Specifi- cally, we support the (24,20,20) variant of TTS over ,which takes a 200-bit message digest and produces a 320-bit signature, providing a security level of about 80 bits. Second, we will in- clude lattice-based cryptosystems such as NTRUEncrypt to pro- vide for key exchanging [8]. Specifically, we support the ees397ep1 variant of NTRUEncrypt, which encrypts a plaintext up to 397 bits and produces a of 3573 bits long, providing a security level of about 128 bits. These are also future-proof in the sense that they can defend against the attack by thousand-qubit quantum computers, which might emerge in the next few decades. Based on these primitives, we can imple- ment security protocols and services like multi-way authentica- tion, , , etc. Fig. 2. Architecture for systolic Gaussian elimination. The main contributions of this paper include the following. • We present an efficient hardware design that supports two post-quantum PKCs, namely, NTRUEncrypt and the TTS nally, we conclude this paper by giving a few future directions signature scheme. Our approach not only allows reuse of of work in Section VI. sequential but also combinational circuits, resulting in a much more compact design than if done separately. II. NTRUENCRYPT • By using the high-level design tool Bluespec SystemVer- NTRUEncrypt is a lattice-based cryptosystem, whose secu- ilog, we are able to extensively explore architectural de- rity is based on the hardness of the shortest vector problem in sign space, including experimenting with an iterative linear high-dimensional euclidean lattices [9]. The main operations in system solver, which, to our best knowledge, has not been NTRUEncrypt involve arithmetic in a polynomial ring investigated on cases of solving such small systems. . The addition in this ring is straightforward • We identify the designs that provide the best trade-off be- polynomial addition, while the multiplication in this ring is con- tween time, area, and total cycle count in order to mini- volutional, as shown in Fig. 1. All in the ring have mize total energy consumption. This is especially impor- integral coefficients (modulo some integers), and their degrees tant for M2M sensors, for many of them run on limited are at most , so a typical element can be represented as energy sources such as battery. . NTRUEncrypt is parameterized by three parameters, , , and , which satisfy the following conditions. C. Organization • is a prime number such that the maximal degree for all The rest of this paper is organized as follows. In Section II and polynomials in the ring is . III, we give the detail of the implemented algorithms, NTRUEn- • and are two possible moduli for the coefficients of the crypt and TTS, respectively. The hardware design of these algo- polynomials in ,with ,and . rithms is also described at the end. We show our implementation After arithmetic operations in , the coefficients of the polyno- strategy for designing the ASIC in Section IV and compare the mials need to be reduced either modulo or . implementation results in Section V. Specifically, we will show and compare side-by-side the results obtained by a high-level A. Operations synthesis tool, Bluespec SystemVerilog, against that obtained NTRUEncrypt consists of three parts: key generation, en- by the more traditional hand-optimized RTL-based design. Fi- cryption, and decryption. In this paper, we only focus on the im- 108 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 1, MARCH 2013

2) Decryption: The decryption procedure has three steps. • Compute . • Shift the coefficients of to the range and then modulo . • Compute . It is easy to see why decryption works. By appropriate rewriting, we can see that we are performing the following computation:

Fig. 3. Modules, interfaces, and methods in Bluespec SystemVerilog.

Instead of choosing the coefficient of in ,theyare mapped to the interval so that the original mes- Fig. 4. Bluespec SystemVerilog compiler structure. sage can be recovered.

B. Previous Attempts We have found several software and hardware implementa- tions of NTRUEncrypt in the literature; here we only discuss hardware implementations. O’Rourke presented a hardware de- sign for accelerating NTRUEncrypt’s core operation of poly- nomial multiplication in his master thesis [10]. It has a gate count of at least 1483 gates, but the design is not optimized for low-cost, and there is no power consumption report. Bailey et al. presented an FPGA design that targets low-cost systems [8]. The design uses approximately 60 000 gates on a Xilinx Virtex1000EFG860FPGA.Aticiet al. presented an implemen- Fig. 5. Processor-based design. tation geared toward radio-frequency identification (RFID) and low-end sensors [11]. We note that although there have been several attempts that target compact implementation, which is plementation of encryption and decryption; key generation does quite suitable for use in M2M systems, our design in this paper not happen frequently and hence is often done offline. Here, we targets not only a low gate count but also circuit reuse to support focus on accelerating online operations that are mostly executed signature schemes such as TTS, which we will describe in the on M2M systems. To make this paper self-contain, we still in- next section. clude a brief description of key-generation operation below. Key Generation: A public key and a private key are III. MULTIVARIATE CRYPTOGRAPHY generated as follows. • Randomly choose a polynomial with coefficients Multivariate PKC is a kind of PKC whose trapdoor one-way reduced modulo . function takes the form of a multivariate polynomial map over • Randomly choose a polynomial with coefficients a finite field [7]. In such constructions, the public key is given reduced modulo . by a set of polynomials • Compute the private key as the inverse polynomial of . • Compute the private key as the inverse polynomial of where is a nonlinear polynomial in with . all coefficients and variables in • Compute the public key . Here, the symbol stands for the multiplication in the NTRU- Encrypt ring, i.e., polynomial multiplication modulo .If any of the polynomials is not invertible, then we just start over and repeat until we succeed. The evaluation of these polynomials at any given value corre- 1) Encryption: The ciphertext is computed from the public sponds to either encryption or verification procedure. A generic key , a random polynomial , and the message way to construct trapdoors for multivariate cryptosystems is to compose two affine maps before and after a quadratic polyno- mial map with a special structure. In this case, the trapdoor in- formation consists of the affine maps as well as the central map, SHIH et al.: SECURING M2M WITH POST-QUANTUM PUBLIC-KEY CRYPTOGRAPHY 109 with which it is easy to invert the composed public map that oth- The original Oil and Vinegar scheme has . erwise looks random. The one-way function difficulty is based When , it becomes the Unbalance Oil and Vinegar sig- on inverting a multivariate quadratic map, which is equivalent nature scheme. If we have a UOV structure, then the quadratic to solving a set of quadratic equations over a finite field. part of each component in the central map from to ,when Definition 1 (Multivariate Quadratic Problem): Solve the expressed as a symmetric matrix, looks like system , where each is quadratic in . All coefficients and variables are in ...... Multivariate Quadratic Problem has been proved as an ...... NP-complete problem. The structure of the central map of a multivariate PKC maps elements from to , while the affine transforms map elements from to and to ...... The design of central maps mainly can be categorized . . . . . into two classes. One class consists of small-field schemes, including rather conservative schemes such as unbalanced Oil and Vinegar (UOV), as well as more aggressively designed or for short, . proposals such as rainbow and TTS [12]. The other class is We can easily figure out that the quadratic terms of the system big-field schemes, such as hidden field equations (HFEs) and is the combination of some instead of all variables by looking at Matsumoto-Imai (MIA). the shape of the matrix. These variables are called the vinegar part. By fixing the values of these variables, the quadratic terms A. General Structure of the system will vanish. We can then solve the resulting system of linear equations to find the values of the oil variables. Extant multivariate PKCs almost always hide the private map 2) Rainbow Scheme: By stacking several layers of UOV to- via composition with two affine maps and .So, gether for an invertible central map, we arrive at Rainbow-type : ,or constructions. For

In any scheme, the central map belongs to quadratic maps whose inverse can be computed relatively easily. The maps and are affine and of full rank. The key of a multivariate PKC is the designing of the central map. where each , with following form if To sign a block, one computes : . To verify a signature, one simply computes . Public key contains only the composed quadratic multivariate polynomials. Private key contains the detail of and , as well as the central map .Different multivariate PKC schemes usually differ only in . Given all with ,andall with , we can compute via elimination. For historical reason, a Rainbow signature scheme is said to be a TTS scheme B. TTS and Related Schemes if the coefficients of are sparse. In the design of TTS, it uses UOV scheme as the main oper- C. Previous Attempts ation to construct the central map. By wrapping the input mul- tiple times using UOV, we can construct multi-layer nonlinear There are several works on efficient implementation of map. Such a construction is known as the Rainbow scheme, of multivariate PKCs. People have implemented TTS on low-cost which TTS is a special case. In particular, the central maps of smartcards [13] and researched into minimized multivariate TTS are constructed by using two layers of UOV and have a PKC on low-resource embedded systems [14]. On modern special sparse form. x86 CPUs, Chen et al. use SSE instructions to implement 1) Oil and Vinegar Schemes: Suppose is an integer and multivariate PKCs [15]. Recently, a parallel hardware im- .Thevariables are termed “vinegar plementation of the Rainbow signature scheme [16] and a variables,” while , “oil variables.” time-area optimized hardware implementation of multivariate Take the map with form signature schemes using systolic arrays [17] are presented. ,where Last but not least, Tang et al. tried to minimize cycle counts for Rainbow signature schemes [18]. Similar to the motivation mentioned in Section II-B, we do not only target a compact design but also a reusable design that supports encryption schemes such as NTRUEncrypt. 110 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 1, MARCH 2013

D. Solving Systems of Linear Equations Definition 2 (Linear Recurrence): A sequence is said to be linear recurrent over if there exist and For TTS and related schemes, decryption is usually much with such that slower than encryption. This is because in decryption, the bot- tleneck operations involve solving a system of linear equations. There are several algorithms that can efficiently solve such prob- lems, all of which have time complexity for general cases. 1) Systolic Gaussian Elimination: Gaussian elimination is The polynomial of degree is perhaps the most well-known algorithm to solve a system of called a characteristic polynomial of . linear equations. In addition to solving, Gaussian elimination Definition 3 (Minimal Polynomials): A characteristic poly- can also be used to find the rank of a matrix, to calculate the nomial is called a minimal polynomial of the sequence if it is determinant of a matrix, and to calculate the inverse of an in- of the least degree. vertible square matrix. The Berlekamp–Massey algorithm finds the shortest linear feedback shift register (LFSR) that produces a given binary se- Algorithm 1 Gaussian elimination quence [20]. The algorithm will also find a minimal polynomial of a linear recurrent sequence in a field. It is used in the Wiede- for do mann algorithm to find the minimal polynomial of a sequence. Find pivot for column k: Algorithm 2 Berlekamp–Massey [20] ( , abs(A[i, k])) if then return Error Require: , a linear recurrent sequence whose minimal polynomial is of degree end if Ensure: that is a minimal Swap polynomial of Do for all rows below pivot: Let , , , . for do for from1to2ddo Do for all remaining elements in current row: Set for do if then

end for else Fill lower triangular matrix with zeros:

end for if then end for

Systolic Gaussian elimination is a fully pipelined parallel hardware design optimized for solving systems of linear equa- tions. It is widely used for multivariate PKC for inverting the else central map, which is regarded as optimal in terms of trade-offs between speed and area [17]. Systolic Gaussian elimination is the space-time minimal end if ASIC implementation for solving linear systems of equations, end if in which an architecture of simple processors are used as systolic cells connected in a triangular network. Systolic im- end for plementation can reduce the combinational area in the total design. At the same time, the timing delay can be minimized Let be a finite field. Given nonsingular and via pipelining. , we would like to find a such that .Theidea 2) The Wiedemann Algorithm: The Wiedemann algorithm is of the Wiedemann algorithm is to consider the sequence a randomized algorithm for solving systems of linear equations .Let be a minimal polynomial of . According over a finite field [19]. It is usually used for solving a sparse to the definition of minimal polynomial, we have matrix because there it only involves operations for a sparse matrix, where is the number of nonzero coefficients in each equation. SHIH et al.: SECURING M2M WITH POST-QUANTUM PUBLIC-KEY CRYPTOGRAPHY 111

Therefore, module’s interface is specified with a collection of methods. A rule in one module may invoke methods in other modules. Although rules may span multiple modules, they may be com- posed in a modular way with conditions and actions that are lo- calized and encapsulated within each module. External users of Hence, a module only need to deal with the specified input and output. With such a design, a divide-and-conquer style can be easily supported. 3) Atomic Actions and Rules: Rules and action atomicity and therefore are a powerful tool for ensuring correctness. The module can be understood as some sequential composition of rule firings. Each module’s designer can locally encapsulate the correctness conditions for use of its interface and be assured that it cannot be used incorrectly from any context. The compiler will check the conditions statically to ensure the correctness of the design. 4) Fast Compilation and Simulation: The compiler of Blue- Algorithm 3 Wiedemann [19] spec SystemVerilog statically verifies user-specified assertions to ensure that the whole design is correct. It also supports ab- Require: A nonsingular matrix and a vector stract data types to preserve representation invariants, enhanced Ensure: overloading, and bit-width constraints. The Bluespec compiler generates both datapath and control Generate recurrent sequence hardware to implement rule behavior in parallel hardware. The Using as input, compute minimal polynomial by core of the generated control hardware is a dynamic scheduler Berlekamp–Massey that governs the rule firings. The generated RTL can be sent to standard netlist synthesis and physical design tools. The timing , compute and area are competitive with hand-written RTL for most de- signs. IV. IMPLEMENTATION It can also compile into C code for fast simulation. Both kinds of output are cycle accurate to each others and can dump to stan- After describing the cryptographic algorithms to support, we dard VCD files. The C-based simulation is faster than Verilog will describe the tools and the strategy used for our ASIC design simulation because it can exploit rule semantics to optimize the in this section. The main design challenge here is to combine the execution. circuits that support two different kinds of algebraic operations needed in NTRUEncrypt and TTS such that the resulting circuit is optimal in terms of time-area-cycle product. Such a challenge B. Atom-Based Design inevitably requires extensive architectural exploration, which is best done using a high-level design tool like SystemVerilog. In As in designing any complex systems, we analyze and de- this work, we use Bluespec SystemVerilog to generate parame- compose the system into atomic operations. We then design terized designs. Finally, we present our processor-based design units of combinational circuit, called atoms, for those different that supports several post-quantum PKCs. operations in a time-area minimized way so as to get a better understanding how far we can go. In addition to being the unit A. Bluespec SystemVerilog of optimization, atoms can be seen as a basic building block in Bluespec SystemVerilog is an EDA tool for ASIC and FPGA building our ASIC design. design [21]. It is a hardware description language that compiles Under such a design strategy, atoms are the main combina- into either Verilog RTL or a cycle-accurate C simulation pro- tional part of the whole ASIC. After defining the atom, we then gram. It provides high productivity by a radically different ap- increase parallelism by increasing the number of atoms. Often, proach to high-level synthesize [21]. this can decrease the time-area-cycle product. The intuition be- 1) Bluespec Basics: Bluespec SystemVerilog uses the model hind is that there are inevitably some fixed cost such as reg- of hardware structure similar to that used by Verilog. It supports isters that one needs to include at all levels of parallelism, and modules and interfaces. Interfaces are used for communication these parts consume energy during the whole course of the com- between modules, and module hierarchies can be obtained by putation. Therefore, we can reduce total energy consumption if static elaboration. Modularity helps improve productivity and we can finish the computation earlier via, e.g., increasing paral- allows rapid design-verification cycles. Like SystemC, it pro- lelism. vides a software model and hence is friendly to software engi- By using Bluespec SystemVerilog, we can code once and gen- neers. It also generates more efficient RTL than most SystemC erate many ASIC designs with different parameters. It helps us implementations [21]. have a quick look at the trend of the time-area-cycle product and 2) Modules and Interfaces: In Bluespec SystmeVerilog, a find the best trade-off point, a process usually referred to as ar- module’s behavior is specified using a collection of rules. A chitectural exploration. 112 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 1, MARCH 2013

C. Array-Based Design Compared with atom-based design, array-based design can be seen as the alternative in which we increase parallelism inside an atom. In contrast to atom-based design, we first fill the system with a fixed number of atoms, called an array, and then increase the computational power of each atom. The idea is still to get the job done sooner and decrease the total number of cycles needed to finish the computation, thereby reducing the total energy con- sumption. However, the critical path may increase, resulting a prolonged clock cycle and hence an increase in time-area-cycle product. We hope to find an optimal point before the margin re- turn begins to diminish.

D. Processor-Based Design

Eventually, we plan to have a programmable cryptographic Fig. 6. Atom-based design diagram. processor that can support various cryptosystem via firmware. It can be separated into three part, combinational logic, memory, and control logic. Control logic will decide which function to perform and move the data from memory to combinational logic at the appropriate times. With different functions, it would re- configure the control logic appropriately. We have implemented a preliminary processor that supports both NTRUEncrypt and TTS schemes. Our strategy works as follows. First, we find the part in each atom-based ASIC that can be shared and combine them into a combinational logic that can compute the basic operations of both PKCs. We then parallelize the atoms to form arrays of appropriate sizes that are suitable for both systems. Finally, we increase each atom’s computational power until we find the best trade-off point in terms of time- area-cycle product. Fig. 7. Performance under different levels of parallelism for atom-based de- sign.

V. R ESULTS AND COMPARISON part that actually carries out the computation. It can be paral- We present the implementation results in this section. First, lelized to reduce the total cycle count, but it increases the circuit we show the implementation results of ASIC designs of NTRU- size, as well as might increase the length of the critical path. Encrypt and TTS, respectively. We then move to the processor- The resulting circuit of the atom-based design is shown in based design that supports both cryptosystems and compare the Fig. 6. We note that the modulo circuitry is not shown here time-area-cycle product with the combined ASIC design. All re- for brevity. Also, the mux controls whether the registers will sults are obtained by synthesis with Synopsys Design Compiler be updated with new values. An atom can compute a term of at 90 nm process. We note that we report all area information polynomial multiplication in the product given the two input in the unit used by the synthesis tool, i.e., . To convert this polynomial in cycles. The whole polynomial multiplication number to gate count, one needs to divide the number by four. needs cycles if atoms are used. We are interested in the total energy consumption of the encryption and decryption A. NTRUEncrypt ASIC processes, so we use time-area-cycle product as a figure of merit In NTRUEncrypt ASIC design, we compare the results of two when comparing the resulting performance under different pa- approaches, atom-based and array-based designs. Atom-based rameter settings. design scale in one dimension (degree of parallelism), while Fig. 7 and Table I show that, while the number of atoms array-based design adds an additional dimension of computa- increases, the time-area-cycle product decreases rapidly. This tional power. We compare the performance of timing, area, and is because the combinational part only occupies a small per- total cycle count under different parameter settings. centage of the total area, and as a result, a higher degree of 1) Atom-Based Approach: The bottleneck operation in en- parallelism helps reduce idle circuitry and hence lower total en- cryption and decryption of NTRUEncrypt is the convolutional ergy consumption. Also, increasing the number of atoms does polynomial multiplication. We decompose the resources used to not affect the timing of the whole design because it does not compute polynomial multiplication into sequential and combi- increase the length of the critical path. We conclude that one national parts. The sequential part includes, e.g., registers for should use an as high degree of parallelism as possible to have temporary storage and control. The combinational part is the a better time-area-cycle product in atom-based design. SHIH et al.: SECURING M2M WITH POST-QUANTUM PUBLIC-KEY CRYPTOGRAPHY 113

Fig. 9. Performance under different levels of parallelism for array-based de- sign.

Fig. 8. Array-based design diagram. TABLE II TIME-AREA-CYCLE PRODUCTS FOR ARRAY-BASED DESIGN

TABLE I TIME-AREA-CYCLE PRODUCTS FOR ATOM-BASED DESIGN

solving. In this section, we first present the ASIC implementa- tion of Gaussian elimination, the state-of-the-art linear system solver, as well as our attempt in using the Wiedemann algo- 2) Array-Based Design: In atom-based design, more atoms rithm to solve linear systems. To support both encryption and will lead to better time-area-cycle product. So in array-based decryption, we use matrix-vector multiplication to maximize design, we first use a maximal number of atoms and try to in- the degree of resource reuse. Finally, as in previous section, we crease the computational power of each atom. We are inter- will discuss the performance under various parameter settings ested to know whether this would improve the time-area-cycle such as parallelism in terms of time-area-cycle product and find figure of merit. Indeed, as we add terms in a cycle, the total the optimal trade-off point in the design space. number of cycles needed will reduce to . However, timing 1) Linear System Solvers: Systolic Gaussian elimination has gets worse because we increase the height of the adder tree at been wildly used for solving systems of linear equations over the the output of the circuit. There should be an optimal point where reals or a finite field. It is an optimal design in terms of time-area increasing the computational power will no longer lead to an im- product. However, it requires a dedicated combinational circuit proved time-area-cycle product. that cannot be easily reused for other computations. With the help from Bluespec SystemVerilog, we generate We investigate the Wiedemann algorithm for solving linear four designs corresponding to adding 1, 3, 7, and 15 terms in systems of equations. With Wiedemann, linear-system solving a single cycle and summarize the results in Fig. 9 and Table II. can be decomposed into matrix-vector multiplication, which is As we have expected, the combinational part grows as the com- ideal for reuse in other parts of the computation of a multivariate putational power of each atom increases, and the length of the PKC. However, such a reuse comes at a price that the computa- critical path also grows. This makes the product increases as tional complexity is twice as high as Gaussian elimination. We the cycle count decreases. Overall, as the per-atom computa- are interested in finding out how much we can save by using a tional power increases, the length of the critical path grows more reuse-friendly algorithm for linear-system solving. rapidly than the reduction of cycle count, so it does not help the 2) Matrix-Based Design: The design of a scalable matrix- time-area-cycle product by increasing the computational power. vector multiplication is similar to NTRUEncrypt’s array-based Finally, we compare the performance of atom-based and design. We use 20 atoms, which is dictated by the number of array-based designs and find that the optimal trade-off point variables in the target TTS algorithm. We then scale the number for an NTRUEncrypt ASIC design appears at the point where of the multipliers in each atom to control the degree of paral- there are a maximal number of atoms with each atom having lelism. minimal computational power of a single-layer adder. We generate five designs corresponding to five different levels of computational power capable of computing the inner B. TTS ASIC product of two vectors of dimensions 1, 2, 4, 5, and 10. The As evaluation of a quadratic map can be computed using inputs are multiplied in parallel before sent to the final adder linear-system evaluation as a basic building block, the bot- tree for summation. The critical path consists of that of the tleneck computation of TTS is linear-system evaluation and multiplier as well as the final adder tree. 114 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 1, MARCH 2013

Fig. 11. Processor-base design that supports both NTRUEncrypt and TTS.

TABLE IV COMBINATIONAL LOGIC COMPARISON

Fig. 10. Performance under different levels of parallelism for matrix-based de- sign.

TABLE III TIME-AREA-CYCLE PRODUCT FOR MATRIX-BASED DESIGN TABLE V BLUESPEC GENERATED VERSUS HAND-OPTIMIZED RTL CODE

D. Verilog Versus Bluespec SystemVerilog As shown in Fig. 10 and Table III, the result is quite dif- ferent compared with that of NTRUEncrypt. Recall that the per- Finally, we evaluate the quality of the RTL code generated by formance improvement of array-based NTRUEncrypt design Bluespec SystemVerilog by comparing its performance to hand- decreases as the computational power of each atom increases. optimized RTL code. From Table V, we can see that the RTL However, for matrix-based TTS design, the optimal trade-off code generated by Bluespec SystemVerilog is in general only point happens when each atom has a medium level of compu- 2–3 times worse than hand-optimized RTL code for the same tational power. It is because the delay of the multiplier is larger designs. Therefore, we conclude that Bluespec SystemVerilog than that of the final adder tree, so it makes sense to fill up the is suitable for early design and automatic architectural explo- gap by adding more partial products in each atom. If we use ration. After preliminary studies find an optimal architecture, Wallace adder tree in our atom design, this ratio might change, experienced RTL designer can then come in and further opti- so will the optimal trade-off point. mize the design by hand. Lastly, we compare our implementation results with those found in the literature in Table VI. C. Processor-Based Design

In the processor-based design, we first classify circuits into VI. CONCLUSION combinational logic, register, and control logic like in a typ- ical processor design. We use the standard general-purpose reg- In this paper, we report our experience implementing a scal- ister file construction to implement the registers, while control able ASIC design for both NTRUEncrypt and TTS. We also use logic is implemented as a finite-state machine. We then com- Bluespec SystemVerilog to generate parameterized designs and bine NTRUEncrypt and TTS ASIC designs and reuse as much find the optimal trade-off points under different parameter set- as possible the common components. As shown in Fig. 11, for tings. By analyzing the timing-area-cycle products of these de- combinational logic, we implement a datapath unit that can ei- signs, the optimal degree of parallelism can be determined from ther compute a multiplication and an addition module 31, or six an architectural design viewpoint. additions module . We duplicate 20 such datapath units; this We also present a combined design that supports both is the maximal degree of parallelism allowed by the target TTS NTRUEncrypt and TTS. It has a similar design to a processor. scheme. In such a processor-based design, not only registers but also Table IV shows that by lightly increasing the combinational combinational circuits are shared among the different supported logic and therefore sacrificing some timing performance, we can cryptosystems for achieving maximal degree of resource reuse. obtain a circuit that can efficiently support two cryptosystems. The resulting design is more similar to a multi-core 8-bit The resulting circuit is about 18% smaller than the two ASIC processor than a single-core 16- or 32-bit processor, which is designs combined while delivering a satisfactory performance not as straightforward without the knowledge learned from our in terms of power consumption. effort of architectural exploration. SHIH et al.: SECURING M2M WITH POST-QUANTUM PUBLIC-KEY CRYPTOGRAPHY 115

TABLE VI COMPARISON OF RELATED WORKS (HERE CYCLE COUNTS FOR ENCRYPTION ARE GIVEN)

After combining NTRUEncrypt and TTS into a single [16] S. Balasubramanian, A. Bogdanov, A. Rupp, J. Ding, and H. W. system, the logical next step we can try is to further add support Carter, “Fast multivariate signature generation in hardware: The case of Rainbow,” in Proc. 2008 16th Int. Symp. Field-Programmable for more cryptosystems. Our goal is to secure M2M systems Custom Comput. Mach., Washington, DC, USA, 2008, pp. 281–282. with strong cryptography by means of hardware acceleration. [17] A. Bogdanov, T. Eisenbarth, A. Rupp, and C. Wolf, “Time-area Also, to support a larger number of cryptosystems, a compiler optimized public-key engines: MQ-cryptosystems as replacement for elliptic curves?,” in Proc. 10th Int. Workshop Cryptogr. Hardware is needed to generate the complex firmware. At the end, we Embed. Syst., Berlin, Germany, 2008, pp. 45–61. hope that we will be able to demonstrate the feasibility of using [18] S. Tang, H. Yi, J. Ding, H. Chen, and G. Chen, “High-speed hardware implementation of Rainbow signature on FPGAs,” in Proc. 4th Int. hardware-based PKC to provide general data security in M2M Conf. Post-Quantum Cryptogr., Berlin, Germany, 2011, pp. 228–243. applications. [19] D. H. Wiedemann, “Solving sparse linear equations over finite fields,” IEEE Trans. Inf. Theory vol. 32, no. 1, pp. 54–62, Jan. 1986. REFERENCES [20] J. Massey, “Shift-register synthesis and BCH decoding,” IEEE Trans. Inf. Theory, vol. 15, no. 1, pp. 122–127, Jan. 1969. [1] P. W. Shor, “Polynomial-time algorithms for prime factorization and [21] R. Nikhil, “Bluespec system verilog: Efficient, correct RTL from discrete logarithms on a quantum computer,” SIAM J. Comput. vol. 26, high level specifications,” in Proc. 2nd ACM IEEE Int. Conf. Formal no. 5, pp. 1484–1509, Oct. 1997. Methods Models Co-Design, Jun. 2004, pp. 69–70. [2] A. Perrig, J. Stankovic, and D. Wagner, “Security in wireless sensor [22] T.-W. Kwon, C.-S. You, W.-S. Heo, Y.-K. Kang, and J.-R. Choi, “Two networks,” Commun. ACM vol. 47, no. 6, pp. 53–57, Jun. 2004. implementation methods of a 1024-bit RSA cryptoprocessor based on [3] Y. Zhou, Y. Fang, and Y. Zhang, “Securing wireless sensor networks: modified Montgomery algorithm,” in Proc. 2001 IEEE Int. Symp. Cir- A survey,” Commun. Surveys Tuts. vol. 10, no. 3, pp. 6–28, Jul. 2008. cuits Syst., May 2001, vol. 4, pp. 650–653. [4] C. Karlof, N. Sastry, and D. Wagner, “TinySec: A link layer security ar- chitecture for wireless sensor networks,” in Proc. 2nd Int. Conf. Embed. Netw. Sensor Syst., New York, 2004, pp. 162–175. Jie-Ren Shih received the B.S. and M.S. degrees in [5]R.Watro,D.Kong,S.-F.Cuti,C.Gardiner,C.Lynn,andP.Kruus, electrical engineering from National Taiwan Univer- “TinyPK: Securing sensor networks with public key technology,” in sity, Taipei, Taiwan. Proc. 2nd ACM Workshop Security ad hoc Sensor Netw.,NewYork, His main research interests including electronic 2004, pp. 59–64. system-level design and design of countermeasures [6]D.J.Malan,M.Welsh,andM.D.Smith,“Implementingpublickey for security circuits. infrastructure for sensor networks,” ACM Trans. Sensor Netw. vol. 4, no. 4, pp. 22:1–22:23, Sep. 2008. [7] J. Ding and B.-Y. Yang, “Multivariate public key cryptography,” in Post-Quantum Cryptography,D.J.Bernstein,J.Buchmann,andE. Dahmen, Eds. New York: Springer-Verlag, 2009, pp. 193–241. [8] D.V.Bailey,D.Coffin, A. Elbirt, J. H. Silverman, and A. D. Woodbury, “NTRU in constrained devices,” in Proc. 3rd Int. Workshop Cryptogr. Hardware Embed. Syst., London, U.K., 2001, pp. 262–272. [9] J. Hoffstein, J. Pipher, and J. H. Silverman, “NTRU: A ring-based Yongbo Hu was born in 1988. He received the B.A. public key cryptosystem,” in Proc. 3rd Int. Symp. Algor. Number degree in microelectronics from Fudan University, Theory, London, U.K., 1998, pp. 267–288. Shanghai, China, where he has been working toward [10] C. M. O’Rourke, “Efficient NTRU implementations,” M.S. thesis, the M.A. degree in microelectronics since 2010. Worcester Polytech. Inst., Worcester, MA, 2002. His main research interests include design of at- [11] A. C. Atici, L. Batina, J. Fan, I. Verbauwhede, and S. Berna Ors Yalcin, tack and countermeasure of security circuit, as well “Low-cost implementations of NTRU for pervasive security,” in Proc. as post-quantum cryptography. 2008 Int. Conf. Appl.-Specific Syst., Archit. Process., Washington, DC, 2008, pp. 79–84. [12] J. Ding and D. Schmidt, “Rainbow, a new multivariable polynomial signature scheme,” in Proc. 3rd Int. Conf. Appl. Cryptogr. Netw. Secur., Berlin, Heidelberg, 2005, pp. 164–175. [13] B.-Y. Yang, J.-M. Chen, and Y.-H. Chen, “TTS: High-speed signatures on a low-cost smart card,” in Proc. 6th Int. Workshop Cryptogr. Hard- ware Embed. Syst., Cambridge, MA, Aug. 11–13, 2004, vol. 3156, pp. Ming-Chun Hsiao was born in Taiwan, in 1987. He 371–385. received the B.S. degree in electrical engineering, [14] B.-Y. Yang, C.-M. Cheng, B.-R. Chen, and J.-M. Chen, “Implementing in 2010, from National Taiwan University, Taipei, minimized multivariate PKC on low-resource embedded systems,” Taiwan, where he is currently working toward the in Proc. 3rd Int. Conf. Security Pervasive Comput., Berlin, Germany, M.S. degree in the Graduate Institute of Electronics 2006, pp. 73–88. Engineering. [15] A. I.-T. Chen, M.-S. Chen, T.-R. Chen, C.-M. Cheng, J. Ding, E. L.-H. His research interests are in the areas of VLSI im- Kuo, F. Y.-S. Lee, and B.-Y. Yang, “SSE implementation of multi- plementation of cryptography algorithms and digital variate PKCs on modern 86 CPUs,” in Proc. 11th Int. Workshop communication systems. Cryptogr. Hardware Embed. Syst., Berlin, Germany, 2009, pp. 33–48. 116 IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 1, MARCH 2013

Wen-Chung Shen was born in Taiwan, in 1980. He An-Yeu Wu (SM’12)isaProfessoroftheDepart- received the B.S. degree in electronic engineering ment of Electrical Engineering and the Graduate from National Taiwan University, Taipei, Taiwan, in Institute of Electronics Engineering, National Taiwan 2002, and the M.S. degree in electronic engineering University (NTU), Taipei, Taiwan. He is now serving from National Taiwan University of Science and an Associate Editor for Journal of Signal Processing Technology, Taipei, Taiwan, in 2004. He is currently Systems (JSPS), and acted as the Lead Guest Editor working toward the Ph.D. degree at the Graduate In- of the Special Issue of “2010 IEEE Workshop on stitute of Electronics Engineering, National Taiwan Signal Processing Systems (SiPS)” in JSPS, which University, Taipei, Taiwan. was published in November 2011. From August His research fields include the architecture and 2007 to December 2009, he was on leave from NTU algorithm design for digital signal process. His re- and served as the Deputy General Director of SoC search interests are in the areas of system-on-chip, security, and low power de- Technology Center (STC), Industrial Technology Research Institute (ITRI), signs. Hsinchu, Taiwan, supervising Parallel Core Architecture (PAC) VLIW DSP Processor and Multicore/Android SoC platform projects. Starting from August 2012, he is serving as the Deputy Director of Graduate Institute of Electronics Engineering (GIEE) of NTU. Ming-Shing Chen is a Ph.D. degree student in elec- Dr. Wu received “Outstanding EE Professor Award” from The Chinese Insti- tronic engineering at National Taiwan University, tute of Electrical Engineering (CIEE), Taiwan, in 2010. He also served on the Taipei, Taiwan. technical program committees of many major IEEE International Conferences, His work was mainly on fast arithmetic of such as SiPS, AP-ASIC, ISCAS, ISPACS, ICME, SOCC, and A-SSCC. He is finite fieldappliedtoerror-correctingcodesand now serving as the Chair of VLSI Systems and Architectures (VSA) Technical public-key cryptography. He is now interesting in Committee in IEEE Circuits and Systems (CAS) Society. solving multivariate equations over finite fields and hardware architecture design for arithmetic over finite rings/fields. Chen-Mou Cheng received the B.S. and M.S. de- grees in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1996 and 1998, respec- tively, and the Ph.D. degree in computer science from Harvard University, Cambridge, MA, USA, in 2007. Bo-Yin Yang received the Ph.D. degree in mathe- He joined the Department of Electrical Engi- matics from Massachusetts Institute of Technology, neering, National Taiwan University, in 2007, where Cambridge,MA,USA,in1991. he is currently an Assistant Professor. His main After teaching mathematics at Tamkang Univer- research area is in cryptographic hardware and sity, Taiwan, he started working with cryptography embedded systems (CHES), as well as electronic in 2002. Eventually moved to the Institute of Infor- system-level (ESL) design. Currently, his main mation Science at Academia Sinica in 2006. He is research activities focus on the design and analysis of efficient algorithms known for his work on efficient crypto implemen- to solve several important problems arising from cryptology, as well as the tations, algebraic , and post-quantum development and implementation of these algorithms on massively parallel public-key cryptography. computers. These problems include solving systems of polynomial equations over finite fields, integer factorization, elliptic-curve , and lattice reduction.