<<

EFFICIENT IMPLEMENTATION OF ELLIPTIC CURVE IN RECONFIGURABLE HARDWARE

by

E-JEN LIEN

Submitted in partial fulfillment of the requirements

for the degree of Master of Science

Thesis Advisor: Dr. Swarup Bhunia

Department of Electrical Engineering and Computer Science

CASE WESTERN RESERVE UNIVERSITY

May, 2012 CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

______E-Jen Lien candidate for the ______degreeMaster of Science *.

Swarup Bhunia (signed)______(chair of the committee)

Christos Papachristou ______

Frank Merat ______

______

______

______

(date) ______03/19/2012

*We also certify that written approval has been obtained for any proprietary material contained therein.

To my family ⋯ Contents

List of Tables iii

List of Figures v

Acknowledgements vi

List of Abbreviations vii

Abstract viii

1 Introduction 1 1.1 Research objectives ...... 1 1.2 Thesis Outline ...... 3 1.3 Contributions ...... 4

2 Background and Motivation 6 2.1 MBC Architecture ...... 6 2.2 Application Mapping to MBC ...... 7 2.3 FPGA ...... 9 2.4 Mathematical Preliminary ...... 10 2.5 Elliptic Curve Cryptography ...... 10 2.6 Motivation ...... 16

i 3 Design Principles and Methodology 18 3.1 Curves over Prime Field ...... 18 3.2 Curves over Binary Field ...... 25 3.3 Software Code for ECC ...... 31 3.4 RTL code for FPGA design ...... 31 3.5 Input Data Flow Graph (DFG) for MBC ...... 31

4 Implementation of ECC 32 4.1 Software Implementation ...... 32 4.1.1 Prime Field ...... 33 4.1.2 Binary Field ...... 34 4.2 Implementation in FPGA ...... 35 4.2.1 Prime Field ...... 36 4.2.2 Binary Field ...... 40 4.3 Implementation in MBC ...... 44 4.3.1 Prime Field ...... 45 4.3.2 Binary Field ...... 47

5 Test Results 48 5.1 Test Patterns and Methodology ...... 49 5.2 Test Results ...... 50

6 Conclusion and Future Work 56

A Simulation Results 58 A.1 Prime field ...... 58 A.2 Binary field ...... 59

Bibliography 61

ii List of Tables

2.1 Instruction set ...... 8

5.1 Number of each operation from the data provided by NIST ...... 50 5.2 Number of each operation in GF (p) from the data provided by NIST 50 5.3 Number of each operation in GF (2m) from the data provided by NIST 50 5.4 Power, Performance and Size Comparison ...... 50 5.5 The Comparison of 192 bit Point Multiplication in different Paper . . 54 5.6 The Comparison of 192 bit Scalar Multiplication in different Paper . 54 5.7 The Comparison of Point Multiplication in different Papers ...... 55

iii List of Figures

1.1 2011 ITRS ASIC Scaling trend prediction ...... 2

2.1 Memory Logic Block Diagram ...... 7

3.1 Squaring in Binary Field ...... 31

4.1 ECC hardware addition module ...... 36 4.2 ECC hardware subtraction module ...... 37 4.3 ECC hardware Montgomery module ...... 38 4.4 ECC hardware Inversion module ...... 39 4.5 ECC hardware Point Addition module ...... 40 4.6 ECC hardware Point Doubling module ...... 41 4.7 ECC hardware kp module ...... 42 4.8 ECC hardware Right-to-left Shift-and-Add Multiply module . . . . . 42 4.9 Modified ECC hardware Right-to-left Shift-and-Add Multiply module 43 4.10 ECC hardware inversion module in GF (2m)...... 44 4.11 ECC hardware Itoh-Tsujii inversion module ...... 44 4.12 ECC hardware Point Addition module in GF (2m)...... 45 4.13 ECC hardware Point Doubling module in GF (2m)...... 46

5.1 Energy comparison in prime field ...... 51 5.2 Energy comparison in binary field ...... 52 5.3 Energy comparison in all fields ...... 52

iv 5.4 Performance comparison in prime field ...... 53 5.5 Performance comparison in binary field ...... 53 5.6 Performance comparison in all fields ...... 54

A.1 Functional simulation of ECC scalar multiplication in GF (p)..... 58 A.2 Functional simulation of ECC scalar multiplication in GF (2m).... 59 A.3 ECC scalar multiplication (with Itoh-Tsujii) in GF (2m)...... 60

v Acknowledgements

There are so many people I have to express my thanks sincerely. First, I want to thank my family. My parents gave me a lot of support when I needed. My wife and daughter always cheered me up and boosted my confidence. My younger brother takes care of my parents and deals with a lot of things for me. Secondly, I want to express my sincere gratitude to my advisor - Dr. Swarup Bhunia. From my advisor, I learnt the passion of work and the attitude towards research. I also want to show my heartfelt appreciation to Professor Christos Papachristou and Professor Francis Merat for serving as my thesis committee members. Finally, I want to give my thanks to all members in the nanoscape laboratory whose advice continously helped me to improve my work.

vi List of Abbreviations

ACP Average CPU Power

ANSI American National Standards Institute ASIC Application Specific Integrated Circuit CPU Central Processing Unit DFG Data Flow Graph ECC Elliptic Curve Cryptography FPGA Field Programmable Gate Array

FSM Finite State Machine IC Integrated Circuit ITRS International Technology Roadmap for Semiconductors LUT Look-Up Table MBC Memory Based Computing MLB Memory Based Logic Block MSB Most Significant Bit NIST National Institute of Standards and Technology

RSA Rivest-Shamir-Adleman TDP Thermal Design Power VLSI Very Large Scale Integration

vii Efficient Implementation of Elliptic Curve Cryptography in

Reconfigurable Hardware

Abstract by E-JEN LIEN

Elliptic curve cryptography (ECC) has emerged as a promising public- cryp- tography approach for data protection. It is based on the algebraic structure of elliptic curves over finite fields. Although ECC provides high level of information security, it involves computationally intensive /decryption process, which negatively affects its performance and energy-efficiency. Software implementation of ECC is often not amenable for resource-constrained embedded applications. Alternatively, hardware implementation of ECC has been investigated V in both application spe- cific integrated circuit(ASIC) and field programmable gate array (FPGA) platforms V in order to achieve desired performance and energy efficiency. Hardware recon- figurable computing platforms such as FPGAs are particularly attractive platform for hardware acceleration of ECC for diverse applications, since they involve signif- icantly less design cost and time than ASIC. In this work, we investigate efficient implementation of ECC in reconfigurable hardware platforms. In particular, we fo- cus on implementing different ECC encryption algorithms in FPGA and a promising memory array based reconfigurable computing framework, referred to as MBC. MBC leverages the benefit of nanoscale memory, namely, high bandwidth, large density and small wire delay to drastically reduce the overhead of programmable interconnects. We evaluate the performance and energy efficiency of these platforms and compare those with a purely software implementation. We use the pseudo-random curve in the prime field and Koblitz curve in the binary field to do the ECC scalar multiplica-

viii tion operation. We perform functional validation with data that is recommended by NIST. Simulation results show that in general, MBC provides better energy efficiency than FPGA while FPGA provides better latency.

ix Chapter 1

Introduction

In this chapter, we describe the research objectives, contribution of the thesis and outline of the thesis.

1.1 Research objectives

Energy efficiency during computation has emerged as a major design parameter in diverse applications and computing platforms [1][2][3][4][5][6][7][8]. According to the 2011 report from the International Technology Roadmap for Semiconductors (ITRS), the technology scaling trend for application specific integrated circuit (ASIC) can be depicted by Figure 1.1. It shows that although technology scaling provides consistent exponential improvement (following Moores law) in integration density, operating power is not scaling as desired. Consequently, addressing the power issue at circuit, architecture and application mapping level has been a major research area in the nanoscale technology regime. The energy issue can be more prominent for many compute-intensive tasks. Conven- tional software implementations of these tasks can be too power hungry or can be too slow to meet the requirements for many real-time and embedded applications. There is a growing trend to map these complex compute-intensive applications in

1 reconfigurable hardware, such as field programmable gate array (FPGA). FPGA is an attractive computing platform since it can drastically reduce the hardware de- velopment/test cost and time. Alternative reconfigurable hardware platform such as memory based computing (MBC) platforms [9] [10] are also very promising at nanoscale technology. MBC platform relies on a dense two-dimensional memory ar- ray to perform computing in a spatio-temporal manner. Applications are decomposed into partitions, which can potentially be mapped as large look-up table (LUT) in the memory and a function can be evaluated by accessing the LUT contents over multiple cycles. Multiple MLB interacts in spatial manner to perform complex operation. The objective of the research presented in this thesis is to explore implementation of elliptic curve cryptography (ECC) algorithm in reconfigurable hardware and evaluate their performance and energy efficiency. In order to analyze potential benefit over traditional software-based implementation, we also compare these design parameters with an alternative implementation in software. We study different variants of ECC algorithms proposed in earlier works and analyze the relative merits and demerits of these algorithms in three alternative platforms.

Figure 1.1: 2011 ITRS ASIC Scaling trend prediction

2 1.2 Thesis Outline

From inception to completion, this thesis is dedicated in analyzing and evaluating the power, performance and resource usage (referred to as size) of Elliptic Curve Cryptography (ECC) among three different platforms, namely CPU, FPGA and MBC respectively. In chapter one, we will describe the research objectives and contribution of our work. The background and motivation will be mentioned in chapter two. Here we will introduce the hardware descriptions of the different platforms on which ECC is being mapped. It will describe in detail the programming techniques and the normal mode operation principle of the proposed MBC framework. Similar short descriptions on a commercially available FPGA and the underlying hardware is also described. Finally some mathematical background in field theorem, number theorem and ECC will also be introduced which will help the reader in understanding the actual algorithm which has to be mapped in the hardware framework. Chapter three deals with the main algorithms that are namely sub-parts of Elliptic curve cryptography (ECC). The algorithms are listed and described in detail in this chapter. There are multiple variants of the same algorithm which can be mapped in the proposed framework. In this chapter, we have also described which algorithms are the most suitable choice in terms of resource usage and power consumption. In chapter four, we will describe how to implement ECC in each platform. The details and structure of each design will be described. The detailed implementation results are also listed in chapter five. Detailed functional validation of the imple- mented design is also described in Chapter 4. Finally in Chapter 5, we describe the conclusions and the future work which can potentially improve the already proposed work.

3 1.3 Contributions

The key contributions of the proposed work in this thesis are as follows: 1. In order to evaluate performance and energy efficiency of ECC implementation in reconfigurable hardware, we have mapped ECC algorithm in FPGA and MBC platforms. To compare with a traditional software implementation, we have also mapped it to software. The mapping is separately optimized in three platforms for performance.

2. We have implemented three different variants of the ECC algorithm on MBC platform, namely Prime Field, Binary Field (Binary Inversion) and Binary Field (Itoh-Tsujii Inversion). Our purpose is to show proposed MBC structure can deal with complex algorithms such as ECC and evaluate the ECC performance in MBC. The hardware resource of MBC is severely limited by its simple and regular struc- ture. We adjust the input data flow graph representing ECC in the MBC mapping framework to improve performance and minimize resource requirements. All the three versions of ECC have also been mapped in software and FPGA.

3. We designed a novel fast ECC in GF (2m) on FPGA for Binary Field Binary Inversion algorithm and optimized the design in terms of its performance. For all the implementations in software and hardware in the proposed work, the inversion step is applied and the coordinates are not pre-calculated. The applications which have been mapped in MBC and FPGA have been highly optimized so as to have competitive mapping performances in both frameworks.

4. After extensive functional validation of all the ECC implementations (three platforms and three different algorithm), we make a comparison of the performance, area and energy requirement of ECC in the three different platforms. We show that

4 MBC is superior in terms of performance and energy efficiency over software and in energy efficiency over a state-of-the-art commercially available FPGA device (Altera Stratix-IV).

5 Chapter 2

Background and Motivation

In this chapter, we will introduce some useful background knowledge relevant to this thesis. ECC is a very well-investigated topic. This chapter describes multiple existing implementations of ECC. Also, the background related to the MBC hardware and programming techniques of MBC are explained in this section, since it is useful to understand the structure of the hardware platform before writing the code to configure the hardware. We also introduce the FPGA structure so as to compare with the MBC hardware architecture. It is important to understand the distinctions between these two kinds of reconfigurable devices so that one can map ECC or any other application efficiently into the actual hardware.

2.1 MBC Architecture

The malleable hardware accelerator we used in this thesis was proposed by Dr. Som- nath Paul and Professor Swarup Bhunia[9], [10]. The inner structure of the MBC and its operation principle is provided in Figure 2.1 by showing the basic process in initializing the MBC hardware. The configuration code is compiled and loaded into the memory, among which some memory will be used in storing data, others serve as Look-Up Tables (LUTs) to be configured to certain logic. Memory is accessed over

6 multiple clock cycles to evaluate the complex functions. A sequence of operations are stored as microcodes in the schedule table. An application is mapped to an array of MLBs, which communicates in spatial manner.

Figure 2.1: Memory Logic Block Diagram

2.2 Application Mapping to MBC

The first thing that we have to do before programming on MBC is to understand the instruction sets that we have. In this case, there are thirteen basic instructions on MBC. We can use these instructions in our programs and combine them to do some complicated operations. The instruction set is shown in Table ??. Consider the case where one executes an XOR operation on two 163-bit numbers. Since the input bit-width of 163-bit exceeds the maximum computation bit-width supported by a single LUT, we must divide the operation into small pieces. For convenience and homogeneity, we always used 8-bit or 16-bit as our basic operation

7 Table 2.1: Instruction set Type Subtype inputs Outputs bitswC 2inadd a0 b0 cin sum count bitswC 2insub a1 b1 cin diff borrow mult rand a2 b2 prod delay rand a3 a3 delay shift left/right a4 # a4 shift rot left/right a5 # a5 rot sel rand a6 b6 c6 d6 sel out complex rand a7 b7 c7 lut out load #width addr loadVal store #width addr Val loadPR #width PRaddr en loadVal storePR #width PRaddr storeval en unit. We first divide the 163-bit number into eleven 16-bit arrays. Then we write the program to store the data into memory. When the program is being executed, the data loaded from the memory depends on the memory address base. The memory address is incremented after each loading. We do the XOR operation between two arrays and store the results into the temporary register. For applications whose com- putations involve large numbers, such as AES, RSA or ECC, significant power and latency will be incurred in these load/store operations. However, if we can code the program cautiously and pre-compute the output memory address or variable, we can reduce the number of operation memory load/store significantly. A sample data flow graph (DFG) which can be run on the MBC hardware is as follows:

CDFG sample name: v0000 type: complex subtype: rand inputs: baa00 g00 outputs: addraa00 en aa00 bitwidth: 4 4 name: v0001 type: complex subtype: rand inputs: bbb00 g00 outputs: addrbb00 en bb00 bitwidth: 4 4 name: v0002 type: complex subtype: rand inputs: bpp00 g00 outputs: addrpp00 en pp00 bitwidth: 4 4 name: v0003 type: complex subtype: rand inputs: bcc00 g00 outputs: addrcc00 en cc00 bitwidth: 4 4 name: v0004 type: loadPR subtype: 16 inputs: addraa00 en aa00 outputs: aa00 in bitwidth: 4 1 name: v0005 type: loadPR subtype: 16 inputs: addrbb00 en bb00 outputs: bb00 in bitwidth: 4 1

8 name: v0006 type: loadPR subtype: 16 inputs: addrpp00 en pp00 outputs: pp00 in bitwidth: 4 1 name: v0007 type: bits subtype: xor inputs: aa00 in bb00 in pp00 in outputs: cc00 bitwidth: 16 16 16 name: v0008 type: storePR subtype: 16 inputs: addrcc00 cc00 en cc00 outputs: bitwidth: 4 16 1 name: v0009 type: bitswC subtype: 2inadd inputs: loop00 one zero outputs: loop01 uloop00 bitwidth: 4 1 1 name: v0010 type: delay subtype: rand inputs: loop01 outputs: loop00 bitwidth: 4 name: v0011 type: complex subtype: rand inputs: loop00 outputs: g00 bitwidth: 4 endCDFG

This is the standard file format of an application given as an input to the software, which maps the input application to the actual hardware depending on the mapping and routing resources available.

2.3 FPGA

FPGA is a widely used reconfigurable device [11]. It is composed of a sea of con- figurable logic blocks and programmable interconnects. It has many features as de- scribed in [12] and multiple advantages which are as listed below: 1. Build a prototype rapidly. 2. Easy to migrate the design to different IC process. 3. Integrated tools and design flow from coding to hardware implementation. 4. Powerful tools can be used in timing and power analysis. FPGAs are widely used as hardware design and validation platform. We use SRAM-based reconfigurable FPGAs in this work. In this thesis, we use Xilinx and Altera FPGA as our platforms to be consistent with previous work so that we can compare our results with them. This also allows us to compare the results on all platforms at the same technology node. The FPGA that we choose to perform the power analysis on is the Stratix IV series. Since we develop the MBC model under 45nm technology node, we try to find the FPGA with a close

9 IC process. For Altera, Stratix IV is the FPGA with 40nm process [13]. Therefore, we can use it to compare our design. The FPGA performance is usually limited by the routing delays [14]. The routing delay will be an increasingly large portion in the total delay as technology scales down. In our design, we always need to mitigate the delays caused by the routing.

2.4 Mathematical Preliminary

ECC involves the knowledge of group theorem, number theorem and elliptic curve. We will introduce the basic knowledge that is needed in the thesis. There are two kinds of fields described in this thesis: • Prime field GF (p): A field contains p elements, where p is a prime. The elements of this field are the integers modulo p, and the outputs of arithmetic operations are also integers modulo p. For example, { 0, 1, 2, . . . , p − 1 } is the basis of G(p), where p is a prime. • Binary field GF (2m): A field contains 2m elements for some m, where m > 0. m is called the degree of the field. The elements of this field are bit strings of length m, and the outputs of field arithmetic are also of the length of m bits. For example, { 0, 1, 2,..., 2m − 1 } is the basis of GF (2m). The main difference between the GF (p) and GF (2m) are the addition and sub- traction operations. In GF (p), these operations require dedicate implementations. In GF (2m), these two operations can be implemented using XORs.

2.5 Elliptic Curve Cryptography

ECC can be implemented in many fields. Here we only discuss the implementation in GF (p) and GF (2m). Notation K is used as a general symbol representing the field, which can stand for both GF (p) and GF (2m). From now on, a few theorems and definitions will be introduced. As mentioned before, they are the essential tools that

10 are needed in this thesis.

Definition 2.1. (Weierstrass equation) An elliptic curve E over a field K is defined by an equation

2 3 2 E : y + a1xy + a3y = x + a2x + a4x + a6 (2.1)

we define some variables that are related to E (or called a-form, b-form and c-form of E) as follows:

 2 b2 = a1 + 4a2    b4 = 2a4 + a1a3    b = a2 + 4a  6 3 6   2 2 2  b8 = a1a6 + 4a2a6 − a1a3a4 + a2a3 − a4  (2.2) 2 c4 = b2 − 24b4    c = b3 + 36b b − 216b  6 2 2 4 6   2 3 2  ∆ = −d2d8 − 8d4 − 27d6 + 9d2d4d6   3  if ∆ 6= 0, j = c4/∆ 

where a1, a2, a3, a4, a6 ∈ K and ∆ 6= 0. We call ∆ is the discriminant of E. The quantity j defined above when ∆ 6= 0, j is called the j − invariant, or simply the invariant of E.

Definition 2.2. (E over L) If L is any extension field of K, then the set of L − rational points on E is

2 3 2 E(L) = {(x, y) ∈ L × L : y + a1xy + a3y − x − a2x − a4x − a6 = 0} ∪ {∞} Here rational means the coordinates that are in the designated field K.

Equation 2.1 is important because it is the original form of ECC. We can use W eierstrass equation to do the coordinate projection and analyze the characteris- tic of curve. Most important transformations or inductions are based on the Equ. 2.1.

11 Now we discuss the ECC in the prime field GF (p). Group law for E/GF (p): y2 = x3 + ax + b, where p 6= 2, 3 is as follows: 1. Identity. P + ∞ = ∞ + P = P for all P ∈ E(GF (p)). 2. Negatives. If P = (x, y) ∈ E(GF (p)), then (x, y) + (x, −y) = ∞ . The point (x, −y) is denoted by −P and is called the negative of P . ie. If P = (x, y) ∈ E(GF (p)), then − P ∈ E(GF (p)) is (x, −y) (2.3)

3 .P oint Addition.

Let P = (x1, y1) ∈ E(GF (p)) and Q = (x2, y2) ∈ E(GF (p)), where P 6= ±Q.

then P + Q = (x3, y3), where

 2   y2 − y1 y2 − y1 x3 = − x1 − x2 and y3 = (x1 − x3) − y1 (2.4) x2 − x1 x2 − x1

4. P oint doubling.

Let P = (x1, y1) ∈ E(GF (p)). Then 2P = (x3, y3). where

 2 2  2  3x1 + a 3x1 + a x3 = − 2x1 and y3 = (x1 − x3) − y1 (2.5) 2y1 2y1

Group Law defines the operations of the points on curve E over some field. With group law we can do point addition and point doubling. We then explain some ter- minologies that are useful in the relative topic.

12 Definition 2.3. (Order of E over GF(p)) Let E be an elliptic curve defined over GF(p). The number of points on curve E over GF(p), denoted #E(GF(p)), is called the order of E over GF(p).

Theorem 2.4. (Hasse’s Theorem) The number of points on an elliptic curve E over GF(p), which denoted by #E(GF(p))is bounded by:

√ √ p + 1 − 2 p ≤ #E(GF (p)) ≤ p + 1 + 2 p

√ √ Definition 2.5. (Hasse Interval) The interval [p + 1 − 2 p, p + 1 + 2 p] is called the Hasse interval.

Definition 2.6. (The trace of E over GF (p)) If E is defined over GF (p), then √ #E(GF (p)) = p + 1 − t where |t| ≤ 2 p; t is called the trace of E over GF (p). Since √ 2 p is small relative to p, we have #E(GF (p)) ≈ p

While the order, interval and trace are originally introduced in the prime field, it can be derived for the GF (2m).

13 Now we discuss the characteristic of GF (2m). Group law for E/(GF (2m)) : y2 + xy = x3 + ax2 + b 1. Identity. P + ∞ = ∞ + P = P for all P ∈ E(GF (2m)). 2. Negatives. If P = (x, y) ∈ GF (2m), then (x, y) + (x, x + y) = ∞. The point (x, x + y) is denoted by −P and is called the negative of P . ie.

If P = (x, y) ∈ E(GF (2m)), then − P ∈ E(GF (2m)) is (x, x + y) (2.6)

3 .P oint Addition.

m m Let P = (x1, y1) ∈ E(GF (2 )) and Q = (x2, y2) ∈ E(GF (2 )), where P 6= ±Q. then P + Q = (x3, y3), where

 2 y1 + y2 y1 + y2 x3 = + + x1 + x2 + a (2.7) x1 + x2 x1 + x2   y1 + y2 and y3 = (x1 + x3) + x3 + y1 (2.8) x1 + x2

4. P oint doubling.

m Let P = (x1, y1) ∈ E(GF (2 )). Then 2P = (x3, y3). where

 2     y1 y1 2 y1 x3 = x1 + + x1 + + a and y3 = x1 + x1 + x3 + x3 (2.9) x1 x1 x1

In binary field, we can choose different basis to build different system. Here we introduce two kinds of basis: 1. Polynomial basis. 2. Normal basis.

14 A polynomial basis is specified by an irreducible polynomial, called the field polynomial.

The bit string (am−1 . . . a2a1a0) is taken to represent the polynomial

m−1 2 am−1t + ... + a2t + a1t + a0 over GF (2). The field arithmetic is implemented as polynomial arithmetic modulo p(t), where p(t) is the field polynomial.

Definition 2.7. (Normal Basis [15]) A normal basis of GF (qm)(q = 2n) over GF (q) is a basis of the form

a, aq, aq2 , . . . , aqm−1 , (2.10) where a 6= 0, a ∈ GF (qm)(q = 2n)

The polynomial and normal basis are both used in NIST-recommended test pat- terns [16].

Lemma 2.8. (Itoh and Tsujii [17]) Let an element x in GF (qm)(q = 2n) be repre- sented by a normal basis Equation(2.10) in the form

q qm−1 x = x0a + x1a + ... + am−1a = [x0, x1, . . . , xm−1] (2.11) where {a, aq, . . . , aqm−1 } normal bases over GF (q). Then, xqk can be computed by k cyclic shifts of Equation(2.11) such that

qm x = [xm−k, xm−k+l, . . . , xm−1, x0, . . . , xm−k−1] (2.12)

we call cyclic shift in Equation(2.12) or just cyclic shift over GF (q)

Lemma 2.9. (MacWilliams and Sloane [15]) Every element x ∈ GF (qm)(q = 2n)

15 satisfies the identify

xqm = x (2.13)

Definition 2.10. (Wang, 1985 [18]) x 6= 0 x ∈ GF (2m) has an unique multiplicative inversion x−1.

x 6= 0 also satisfies Equation (2.13), i.e., x2m = x, x−1 is given by x−1 = x · x−2 = x2m−2. Here 2m−2 can be represented by 2m − 2 = 2 + 22 + ... + 2m−1, hence x−1 can be computed by x−1 = (x2)(x22 ) ... (x2m−1 ) (2.14)

2.6 Motivation

The main motivation of this thesis is to preform a detailed analysis of the MBC struc- ture via a complex algorithm. As we know, most security algorithms are computation- intensive. In this thesis, we choose ECC as our testing algorithm. The rationale is as follows: 1. ECC is a complicated algorithm, for which the software implementation is not good for real-time tasks. As mentioned in many papers, the security strength of 160- bit ECC over GF (2m) is the same as 1024-bit RSA [19] [20] [21] [22]. ECC and RSA are widely used in the authentication and signature protocol application. 2. ECC is given in many standards such as ANSI X9.63, IEEE 1363a [23], FIPS PUB 186-3 [24]. 3. ECC is composed of some elementary operations that can be efficiently imple- mented in reconfigurable hardware, from which we can obtain useful information in analyzing the performance of each device. The simulation results will be analyzed and compared with those we obtained from

16 other platforms.

17 Chapter 3

Design Principles and Methodology

In this chapter, we present the implementation details for ECC. According to the NIST recommended ECC curves [16] and different types of inversion modules, we have three ECC designs. 1. 192 bits ECC in GF (p). 2. 163 bits ECC in GF (2m) with binary inversion. 3. 163 bits ECC in GF (2m) with Itoh-Tsujii inversion.

We use Pseudo-random curve in GF (p) and Koblitz in GF (2m). Pseudo- random curves with the coefficients are generated from the output of hash function [16].

3.1 Curves over Prime Field

The equation of the elliptical curve over GF (p) is given as

y2 ≡ x3 − 3x + b (mod p) (3.1)

18 of prime order r is listed. (Thus, for these curves, the cofactor is always f = 1.) The following parameters are given: • Prime modulus - p • Order - r • 160-bit input seed to SHA-1 based algorithm - s • Output of the SHA-1 based algorithm - c • Coefficient b (satisfying (b2c ≡ −27 (mod p)))

• The base point x coordinate - (Gx)

• The base point y coordinate - (Gy)

In the ANSI X9.62 and IEEE P1363 standards, the pseudo-random curves are generated using the SHA-1 based algorithm [?]. The values of different parameters for p-192 bits is given below: Curve P-192 bits

p = (FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF)16

r = (FFFFFFFFFFFFFFFFFFFFFFFF99DEF836146BC9B1B4D22831)16

s = (3099D2BBBFCB2538542DCD5FB078B6EF5F3D6FE2C745DE65)16

b = (64210519E59C80E70FA7E9AB72243049FEB8DEECC146B9B1)16

c = (3099D2BBBFCB2538542DCD5FB078B6EF5F3D6FE2C745DE65)16

Gx = (188DA80EB03090F67CBF20EB43A18800F4FF0AFD82FF1012)16

Gy = (07192B95FFC8DA78631011ED6B24CDD573F977A11E794811)16

−P = (Gx, −Gy)

Gx = (188DA80EB03090F 67CBF 20EB43A18800F 4FF 0AF D82FF 1012)16

−Gy = p − Gy (mod p)

= (F 8E6D46A003725879CEFEE1294DB32298C06885EE186B7EE)16 Before discussing the algorithms used in the design, we present the four basic opera- tions over elliptical groups. 1. Addition: Point addition is the addition of two points A and B on the elliptic curve to obtain another point C on the same curve. Since p is a large prime (192

19 bits), we have used GNU Multiple Precision Arithmetic Library [?] to deal with the large numbers instead of dividing it into smaller numbers. Since the result may be larger than p, a subtraction step may be need to obtain the final point on the curve. For convenience, we use the ECC ADD to represent the point addition operation in this thesis.

Algorithm 3.1 Addition in GF (p) Input: A, B ∈ GF (p) Output: C = A + B ∈ GF (p) 1: C ← A + B 2: if C > p then 3: C ← C − p ; 4: end if 5: return(C)

2. Subtraction: Subtraction in GF (p) is similar to addition. After subtraction, we also check if the result is within GF (p) which requires an addition operation if the result is negative.

Algorithm 3.2 Subtraction in GF (p)

Input: A, B ∈ GF (p), k = blog2 pc + 1 Output: C = A − B ∈ GF (p) 1: C ← A − B 2: if (C[k + 1] == 1) then 3: C ← C + p ; 4: end if 5: return(C)

In order to avoid producing the wrong result, we always need (k + 1) bits for C.

Here C[k+1] means the (k + 1)th bit of C, k = blog2 pc + 1. We use ECC SUB to represent this operation.

3. Multiplication: While there are several multiplication algorithms in literature

20 [25] [26] [27] [28], two widely used algorithms in large number systesm are Karatsuba and Montgomery multiplication [29] [30]. The multiplication algorithm used in the thesis is the binary add and shift Montgomery algorithm [31].

Algorithm 3.3 Binary Add-and Shift Montgomery Product Input: A, B and p, here p[0] = 1 −k Output: u = A · B · 2 (mod p), k = blog2 pc + 1 1: u ← 0 2: for (i = 0 to (k − 1)) do 3: u ← u + A[i] · B ; 4: if (u[0] == 1) then 5: u ← u + p ; 6: end if 7: u ← u >> ; 8: end for 9: return(u)

It is important to note that the correct result cannot be obtained in a single Montgomery operation. This can be understood from the following:

mont(x · y) ≡ (x · y) · 2−k ≡ Z (mod p) (3.2)

mont(Z · 22k) ≡ (Z · 22k) · 2−k ≡ (((x · y) · 2−k) · 22k) · 2−k ≡ x · y (mod p) (3.3) | {z } Z

We can pre-calculate the value of 22k ≡ n (mod p), in this case n ≡ 10000000000000002000000000000000116 (mod p).

In other words, we need two Montgomery operations to finish a single multiplica- tion in GF (p). We use the notation ECC MONT to represent a single Binary Add-and-Shift Mont- gomery operation. For the whole multiplication (Equation(3.2) and Equation(3.3)) ,

21 we use the notation ECC MUL MONT . 4. Multiplicative inversion:

Algorithm 3.4 Binary algorithm for Inversion in GF (p) Input: A Prime p and a ∈ [1, p − 1]. Output: a−1 mod p. 1: u ← a, v ← p 2: x1 ← 1, x2 ← 0 3: while (u 6= 1 and v 6= 1) do 4: while (u[0] == 0) do 5: u ← u >> 6: if (x1[0] == 0) then x1 ← x1 >>; 7: else x1 ← (x1 + p) >> 8: end if 9: end while 10: while (v[0] == 0) do 11: v ← v >> 12: if (x2[0] == 0) then x2 ← x2 >>; 13: else x2 ← (x2 + p) >> 14: end if 15: end while 16: if (u ≥ v) then: u ← u − v, x1 ← x1 − x2; 17: else: v ← v − u, x2 ← x2 − x1 18: end if 19: end while 20: if (u == 1) then return (x1 mod p); 21: else return (x2 mod p) 22: end if

In both group laws, multiplicative inversion is the most difficult operation to im- plement. This is the reason why so many papers work on the projective coordinate over affine [32] [33]. When projective coordinate is used over affine, the multiplicative inversion needs be performed only once to obtain the proper initial points. There are two benefits from this approach: 1) The total design area can be reduced; and 2) we can reduce the computation time of whole operation.

22 There are so many while loops and branch conditions in the algorithm. The while loops are not easy to synthesize in both and MBC code. Fortunately, we can analyze the algorithm and rewrite the code to avoid the loops. In verilog, we can use a finite state machine (FSM) and if . . . else branch conditions to eliminate the while loops. In the MBC framework, we use C language to get the real counts of each loop and then modify the code in simple loop version. We use the ECC INV to represent this algorithm. Now we have four basic algorithms, we can combines these four algorithms into P ointAddition and P ointDoubling.

Algorithm 3.5 ECC Point Addition in GF (p) Input: E/GF (p): y2 = x3 + ax + b, p 6= 2, 3.P,Q ∈ GF (p) × GF (p), P = (x1, y1),Q = (x2, y2).P 6= ±Q Output: (x3, y3) = P + Q

1: temp00 ← ECC SUB(x2, x1) 2: temp01 ← ECC SUB(y2, y1) 3: temp02 ← ECC INV (temp00) 4: temp03 ← ECC MUL MONT (temp02, temp01) 5: temp04 ← ECC MUL MONT (temp03, temp03) 6: temp05 ← ECC ADD(x1, x2) 7: x3 ← ECC SUB(temp04, temp05) 8: temp06 ← ECC SUB(x1, x3) 9: temp07 ← ECC MUL MONT (temp03, temp06) 10: y3 ← ECC SUB(temp07, y1)

Note that we use ECC P oint Add and ECC P oint Double to represent these two algorithms respectively. The NIST scalar multiplication can be implemented using these two algorithms.

23 Algorithm 3.6 ECC Point Doubling in GF (p) 2 3 Input: E/GF (p): y = x + ax + b, p 6= 2, 3.P = (x1, y1) ∈ GF (p) × GF (p) Output: (x3, y3) = 2P

1: temp00 ← ECC MUL MONT (x1, x1) 2: temp01 ← ECC MUL MONT (3, temp00) 3: temp02 ← ECC ADD(temp01, a) 4: temp03 ← ECC MUL MONT (2, y1) 5: temp04 ← ECC INV (temp03) 6: temp05 ← ECC MUL MONT (temp02, temp04) 7: temp06 ← ECC MUL MONT (temp05, temp05) 8: temp07 ← ECC MUL MONT (2, x1) 9: x3 ← ECC SUB(temp06, temp07) 10: temp08 ← ECC SUB(x1, x3) 11: temp09 ← ECC MUL MONT (temp05, temp08) 12: y3 ← ECC SUB(temp09, y1)

Algorithm 3.7 Test NIST patterns using Point Scalar Multiplication Input: The prime modulus p, the NIST recommended curve E/GF (p): y2 = x3 + ax+b, p 6= 2, 3., the order r of the curve, initial point Q = (x0, y0) ∈ GF (p)×GF (p) Output: (x1, y1) = (r − 1)Q 1: k ← (r − 1) 2: (x1, y1) ← (x0, y0) 3: i ← blog2 kc + 1 4: for (j = i down to 2) do 5: (x2, y2) ← ECC P oint Double(x1, y1) 6: if (k[j − 2] == 1) then 7: (x1, y1) ← ECC P oint Add((x0, y0), (x2, y2)) 8: else 9: (x1, y1) ← (x2, y2) 10: end if 11: end for 12: return (x1, y1)

24 3.2 Curves over Binary Field

A elliptic curve E, known as Koblitzcurve, over the finite field GF (2m) is given through the following equation

E : y2 + xy = x3 + ax2 + 1, (3.4)

Where x, y, a ∈ GF (2m) In 163-bit Binary Field, NIST recommended parameters are as follows: field polynomial, p(t) = t163 + t7 + t6 + t3 + 1 or in hexadecimal number format(t=2)

p = (800000000000000000000000000000000000000C9)16 (3.5)

Curve K-163 a = 1 r = (5846006549323611672814741753598448348329118574063)10 or in hexadecimal format r = (4000000000000000000020108A2E0CC0D99F 8A5EF )16 Polynomial Basis:

Gx = (2FE13C0537BBC11ACAA07D793DE4E6D5E5C94EEE8)16

Gy = (289070FB05D38FF 58321F 2E800536D538CCDAA3D9)16

From equation (2.6), we have −p equals to (x, x + y) in E(GF (2m)). To obtain the point scalar multiplication with k = (order − 1), we can get the point −P

25 −P = (Gx,Gx + Gy)

Gx = (2FE13C0537BBC11ACAA07D793DE4E6D5E5C94EEE8)16 Here Gx + Gy = Gx ⊕ Gy

= (07714CFE32684EEF 49818F 913DB78B866904E4D31)16 In the group law over GF (2m), we can find that we need three basic operation over GF (2m).

1. Addition: In GF (2m), addition and subtraction are all arithmetic operations implemented as exclusive-or (XOR).

Algorithm 3.8 Addition in GF (2m) Input: Binary polynomials A and B of degrees at most m − 1 Output: C = A + B 1: for (i from 0 to m − 1) do 2: C[i] ← A[i] ⊕ B[i]; 3: end for 4: return(C)

2. Multiplication: The multiplication operation in Prime field is different from that in Binary field. For example, we have 7 × 3 = 21 in GF (p). But in GF (2m), (t2 + t + 1) × (t + 1) = (t3 + 1) in GF (2m). Since

(111) × (11) (3.6) (0111) (1110)

(1001)

Here we apply the right-to-left shift-and-add field multiplication in our design.

26 Algorithm 3.9 Right-to-left shift-and-add field multiplication in GF (2m) Input: Binary polynomials A(t) and B(t) of degrees at most m − 1 p(t) is the field polynomial function, p(t) = t163 + t7 + t6 + t3 + 1 Output: C(t) = A(t) · B(t) mod p(t) 1: if (A[0] == 1) then C ← B 2: else C ← 0 3: end if 4: for (i from 1 to m − 1) do 5: B ← B << (mod p(t)) 6: if (A[i] == 1) then C ← C ⊕ B 7: end if 8: end for 9: return(C)

In Theorem 3.9, we can observe the code in line 5 B ← B << (mod p(t)) because the operation only shift B one bit left, we can rewrite the code to { B ← B << If (B[ECC BITS] == 1) do B ← B ⊕ p(t) }

This eliminates the need to apply the fast reduction after B is shifted left by one bit and only requires the XOR operation with the field polynomial. This reduces the design time and design area.

3. Multiplicative inversion: Multiplicative inversion is also a complex operation in GF (2m). By comparing with Theorem (3.10) and Theorem (3.4), we can observe the nuance of these two algo- rithms. Since the conventional multiplicative inversion module requires a lot of hardware

27 Algorithm 3.10 Binary algorithm for Inversion in GF (2m) Input: A nonzero binary polynomial A of degree at most m − 1 Output: A−1 (mod p(t)), p(t) = t163 + t7 + t6 + t3 + 1 1: u ← a, v ← p 2: x1 ← 1, x2 ← 0 3: while (u 6= 1 and v 6= 1) do 4: while (u[0] == 0) do 5: u ← u >> 6: if (x1[0] == 0) then x1 ← x1 >>; 7: else x1 ← (x1 ⊕ p) >> 8: end if 9: end while 10: while (v[0] == 0) do 11: v ← v >> 12: if (x2[0] == 0) then x2 ← x2 >>; 13: else x2 ← (x2 ⊕ p) >> 14: end if 15: end while 16: if (u ≥ v) then: u ← u ⊕ v, x1 ← x1 ⊕ x2; 17: else: v ← v ⊕ u, x2 ← x2 ⊕ x1 18: end if 19: end while 20: if (u == 1) then return (x1); 21: else return (x2) 22: end if

28 resources, the Itoh-Tsujii inversion algorithm is used. Here we try another inversion module. The algorithm is proposed by Itoh-Tsujii [17], many people call it as Itoh- Tsujii inversion algorithm.

Algorithm 3.11 Itoh-Tsujii algorithm for Inversion in GF (2m) Input: A nonzero element a ∈ GF (2m), m is odd Output: B = a−1 in GF (2m) 1: A ← a2,B ← 1, x ← (m − 1) >> 2: while (x 6= 0) do x 3: A ← A · A2 4: if (x[0]==0) then x ← (x >>); 5: else B ← B · A, A ← A2, x ← (x − 1) >> 6: end if 7: end while 8: return (B)

Although the algorithm appears smaller than theorem (3.10), it requires more clock cycles.

The Itoh-Tsujii inversion scheme requires blog2(m−1)c+Hw(m−1)−1 number of

m multiplications in GF (2 ) where Hw means Hamming weight representing the non- zero elements in a string. In this case, m = 163, we get m − 1 = 162 = (10100010)2. Thus, the required number of multiplications is 7 + 3 − 1 = 9. In this case, we need 163 square operations and 9 multiplication operations to finish the total Itoh-Tsujii inversion operation.

In other representation format, we can compute the inverse of A in GF (2163) in the following order of exponents [34] A−1 = A2(281+1)[2(240+1)(220+1)(210+1)(25+1){2(22+1)(2+1)+1}+1]

4. F ast reduction in GF (2163): We can find the field polynomial p(t) chosen in GF (2163) is very simple, the modulo operation with p(t) can be modified by using fast reduction modulo to reduce the

29 total operation time. That means when we do the modulo operation with p(t), we can apply some special method to reduce the total operation time. This is called fast reduction modulo.

Algorithm 3.12 Fast reduction modulo p(t) = t163 + t7 + t6 + t3 + 1 (with W=32) Input: A binary polynomials c = C[11]232×10 + ... + C[0]232×0 of degree at most 324 each C[i] is at most 32-bit, p(t) is field polynomial Output: c(t) mod p(t) 1: for (i from 10 downto 6) do 2: T ← C[i] 3: C[i − 6] ← C[i − 6] ⊕ (T  29) 4: C[i − 5] ← C[i − 5] ⊕ (T  4) ⊕ (T  3) ⊕ T ⊕ (T  3) 5: C[i − 4] ← C[i − 4] ⊕ (T  28) ⊕ (T  29) 6: end for 7: T ← C[5]  3 8: C[0] ← C[0] ⊕ (T  7) ⊕ (T  6) ⊕ (T  3) ⊕ T 9: C[1] ← C[1] ⊕ (T  25) ⊕ (T  26) 10: C[5] ← C[5]&0x7 11: return(C[5], C[4], C[3], C[2], C[1], C[0])

5. Squaring in GF (2m): In the Binary field, squaring a number is a very simple operation in hardware design where the symmetric item in the coefficient can be eliminated, i.e., squaring a number only requires inclusion of a zero between each bit. After squaring, fast reduction can be applied to get the correct number that is in GF (2m).

Algorithm 3.13 Polynomial squaring Input: A binary polynomial A(t) of degree at most m − 1 Output: C(t) = A(t)2 1: for (i from 0 to m − 1) do 2: C[2 ∗ i] = A[i] 3: C[2 ∗ i + 1] = 0 4: end for 5: return (C)

For convenience, we can show the square operation in binary field as Figure 3.1.

30 Figure 3.1: Squaring in Binary Field

3.3 Software Code for ECC

The software implementation of ECC was written in C language on a Cent-OS 4.8. The code was written in NetBeans 7.0.1 IDE using gcc 3.4.6-11 as the compiler. In order to gain the better performance, external libraries have been used to deal with the large numbers. Specifically, the GNU multiple P recision Arithmetic Library, GMP 5.0.2, containing optimized codes for multiplication, division and other basic operations was used to deal with large numbers.

3.4 RTL code for FPGA design

Verilog was chosen as the hardware description language to describe ECC. Verifica- tion tools include VCS (for pre-simulation) and Modelsim (for post PnR simulation). In order to compare the other design, the ECC RTL was implemented in both Altera and Xilinx platforms. We use Altera Quartus 11.0sp1 and Xilinx ISE 9sp2 as our FPGA synthesize and simulation platforms.

3.5 Input Data Flow Graph (DFG) for MBC

The MBC code compiler is developing under Linux. We write two versions of the code for each version of ECC. One is load/store 8 bits at a time, the other version is 16 bits. Since some instructions can not operate in 16 bits mode, we remain the code in 8 bits.

31 Chapter 4

Implementation of ECC

In this chapter, we will describe the details of each design. First, we develop a C program to verify each algorithm. On completion of each subprogram, we combined and used test patterns from NIST to verify the top program. With this C program, we develop Verilog behavior model and RTL model. Finally, we develop the Data Flow Graph files that is needed for MBC.

4.1 Software Implementation

In order to measure the correct power consumption and executing time, we add some functions in the design. In order to measure the time, we calculated the real num- ber of clock cycles that is needed by a program [35]. Then we get the parameter CLOCKS PER SEC and divide CLOCK CYCLES by CLOCKS PER SEC to get the final execute time.

T otal # of CLOCK CY CLES Execute time = CLOCKS PER SEC

We run the total scalar multiplication operation 1,000 times to get the total run time and divide it by 1,000 to get a single run time.

32 The other problem is how to calculate the power consumption. The sophisticated CPU power measurement is described in [36] [37] [38]. The best method is to use the power meter to measure the real current and voltage of CPU [39] [40]. Then we can apply Joule’s law: P = I · V (4.1)

We use a coarse way to estimate the CPU power consumption. The practical method is to measure the average CPU usage. Here we find the ACP power [41]. They can transfer the TDP to average CPU power (ACP). TDP is the acronym of Thermal Design Power. It means the maximum amount of power the cooling system in a com- puter is required to dissipate. The CPU of our workstation is Intel Q8200 (2.33GHz) which with TDP 95 watt. So we can get the average CPU power of our workstation is

TDP 95 W att = ACP 75 W att

There is still another factor which dominates the power, the CPU usage. We use the Glibtop [42] library to calculate the usertime and get the cpu usage. Finally, we get the final CPU power is closed to

CP U P ower ' 75 · (CP U usage) · (Execute time)

We can find the result in the next chapter which shows that software consumes a lot of power. The difference between ACP, TDP or even using the power meter will not change the order of power. Which implies that the exact power cannot be calculated, and the final result will not change too much.

4.1.1 Prime Field

According to equation, we design the C subprograms as follows: 1. ECC ADD

33 2. ECC SUB 3. ECC BIT MONT 4. ECC MUL MOD MONT 5. ECC INV 6. ECC PA 7. ECC PD 8. ECC TEST

The subprogram ECC MUL MOD MONT is the combination of two Montgomery functions. In the beginning of the design, We use a subprogram ECC TEST to test whether the point is still on the curve after some operation. We write the code easily by using the library from GMP [43]. It can deal with the large number very well and we don’t need to spend our time on dealing with the large number operation.

4.1.2 Binary Field

It is easier to write the code in the binary field. Keeping in mind that we have two kinds of inversion operations in the binary field. We have subprograms in the binary field as follows: 1. ECC R2L SAA MUL G2 2. ECC SQUARE G2 3. ECC FR 163 4. ECC ADD G2 5. ECC PA G2 6. ECC PD G2 7. ECC TEST G2 8. ECC INV G2

34 9. ECC INV G2 ITOH 10. ECC TEST G2 We can understand the functions from the name of the subprograms. Here we write two multiplicative inversions. One is binary algorithm for inversion (ECC INV G2). The other is Itoh-Tsujii algorithm for inversion (ECC INV G2 ITOH). The while-loop in C can be compiled very easily. We can accomplish our C language programming without any problem.

4.2 Implementation in FPGA

We design the verilog code in two versions. The first version is the behavior model for algorithm verification. It is like C programming, but we face problems when we want to synthesize the code. The while-loop or modified if...else loop can not be synthesized correctly, the synthesizer will generate a huge number of gates. The second version is the datapath combined with the FSM (finite state machine) to do the real operation. In our design, the FSM are all Moore machines. In other words, the next state of the FSM is determined only by its current state. After verifying with VCS tools, we implement them on FPGA. After post-simulation, we can get the timing report of our design. We try to optimize the critical path. One solution is to use better structure in our datapath. The other solution is when we find the critical path in our design module, we add one or more stages in the design and replace the original design into small, fast modules. Because the total design is based on FSM, the running time of each stage will limit the final performance. If there is a stage which needs a longer running time, the other stage needs to slow down the operation frequency in order to avoid setup or hold time violation. The best design of FSM is to make sure that each stage has the equal running time. In order to get better result, we run the iterations many times to get better performance.

35 Another solution to increase the total running frequency is to add some delay buffers before the flip-flops enable signal. This will increase the setup time very easily and we do not need to decrease one stage.

4.2.1 Prime Field

Just as we have done in C, we can divide the total design into small parts. In order to design each parts well, we design all the modules with FSM and datapath. Each module has an output signal called ”ready”. It will rise to ”1” to show that the operation in this module is finished. Here we have divided the total design into four basic modules: 1. ecc hw add Although it is a very simple operation, we still use the FSM and datapath to accom- plish the module. The module is shown in Figure 4.1.

Figure 4.1: ECC hardware addition module

2. ecc hw sub

36 In the design, we left one MSB to determine the sign of the operant. This will make sure that the data input and output are all positive integers. In other words, we must make sure that the data is in GF (p). The hardware design is explained in Figure 5.6.

Figure 4.2: ECC hardware subtraction module

3. ecc hw bit mont There are so many isomorphic types of Montgomery multipliers. In this thesis, we use the binary shift-and add Montgomery multiplier. The detailed structure is shown in Figure 4.3. We must keep in mind that we need two Montgomery multipliers to do the whole multiplier operation. We can see that the operations are composed of shift and addition. It means that it can be operated at a very high frequency. Since it only deals with one bit at a time, it will take a lot of clock cycles to finish the whole operation. 4. ecc hw inv This is the most complex part in the whole design. From [44], we can re-organize

37 Figure 4.3: ECC hardware Montgomery module this module and design the datapath clearly. We can roughly divide the full module into four main datapaths and two adders. Then we can eliminate the while − loop by using branch conditions and multiplexers. The structure is shown in Figure 4.4. 5. with these four basic modules, we can build the P ointAddition and P ointDoubling module easily. We want to increase the total operation throughput, so we do not use the complex multiplexers to reuse the module. This will increase the total de- sign area, but it will sacrifice the performance and the simplicity of the design. The P oint Addition is sketched in Figure 4.5. The P oint Doubling module is drawn in Figure 4.6. 6. The top module is the kplus. The module can accomplish the whole scalar multiplication operation. For convenience, the input signals of the top module are (i). reset signal. (ii). clock signal. (iii). counter needed to be run. The other signals such as initial points are set to fixed numbers. The output signals are only

38 Figure 4.4: ECC hardware Inversion module

(i). x final value. (ii). y final value. (iii). ready signal. When we take the input counter as (order -1), the program will stop and shows the final point coordinate. If we verify that the output coordinate is the negtive point in GF (p), then the computation is correct. In some papers [33], they only test the P oint Doubling operation. When we com- pare with these papers, we only change the input counter to 2, this will do a single point doubling operation in this design. We do not need to change the whole test patterns.

39 Figure 4.5: ECC hardware Point Addition module

4.2.2 Binary Field

Again, we have divided the total design into four basic modules:

1. ecc hw add g2 In the GF (2m), the addition and subtraction operation are replaced by the xor op- eration. We do not have to check the carry-in and borrow-from bit after add/sub operation in GF (2m). 2. ecc hw sqr g2

40 Figure 4.6: ECC hardware Point Doubling module

The square operation in GF (2m) is very special. It is a very simple bit operation combined with a fast reduction module. Many designers use this special characteristic to simplify the whole process of ECC operation. We should use the square operation instead of multiplication as much as possible. This is why the normal basis is also popular in the GF (2m). 3. ecc hw r2l xor mul g2 The Right-to-left shift-and-add multiplier is shown in Figure 4.8. Since the shifter only changes one bit at a time, we can modify the original algorithm into new one.

41 Figure 4.7: ECC hardware kp module

Figure 4.8: ECC hardware Right-to-left Shift-and-Add Multiply module

Here we don’t need to do the complex reduction operation, we just need to check the first bit after shifting B to left one bit and do the exclusive-or operation. The whole operation is shown in Figure 4.9. This nuance will improve this module performance and decrease the size. 4. ecc hw inv As we mentioned before, the inversion is the most complex module of the whole

42 Figure 4.9: Modified ECC hardware Right-to-left Shift-and-Add Multiply module design. however, we can analyze the whole algorithm and get a better solution. here we provide a good way to do the inversion. This binary inversion is very closed to the one we used in the prime field, but this is more simple. The operation is shown in Figure 4.10. 5. ecc hw inv itoh g2 The original ideal of the Itoh-Tsujii module is to simplify the area that the binary inversion occupied, but it will take much more operation cycles. Here we prove the original designer’s intention and expectation. The graph is shown in Figure 4.10. We separate the squaring and the multiplication operation carefully because they have different cost in computation and area. This will save a lot of clock cycles in the total operation. 6. The P oint Addition and P oint Doubling are totally different from those we used in the prime field. The connection of these two designs are shown in Figure (4.12) and Figure (4.13) individually. 7. The top modules of the scalar multiplication in GF (2m) are called kplus g2 and kplus g2 itoh respectively. The algorithm to calculate the scalar multiplication is the same in these two fields.

43 Figure 4.10: ECC hardware inversion module in GF (2m)

Figure 4.11: ECC hardware Itoh-Tsujii inversion module 4.3 Implementation in MBC

As in the previous platform, we design the small modules and combine them together. However, since we can not see the real result from the mapper, we can only verify the 44 Figure 4.12: ECC hardware Point Addition module in GF (2m) parser and structure of the code. Since it is hard to deal with the large numbers, we write the codes in two versions. On is based on 8-bit memory load/store, the other is 16-bit memory load/store. These two operations bring about a great difference in the power consumption and delay time.

4.3.1 Prime Field

In the prime field, there are many things to conquer. In GCC, we have GMP library to deal with the large number system. It is easy to handle large numbers in verilog. However, it is total another story in MBC programming. For instance, if we want to add two 192-bit integers, we first have to do is to divide them into 16-bit or 8-

45 Figure 4.13: ECC hardware Point Doubling module in GF (2m) bit arrays. We first calculate the memory base address which the data stored. We store the initial arrays into memory according to these address. When we want to do the addition operation, we load them from the memory and add them in parts. Remember that we need to deal with the carry bit in doing the addition operation in GF (p). Because there is no branch condition, we write the if/while loop into simple loop. The number of loops that will be executed can be obtained from our C language and converted to weighted number when we calculate the total power and operating time. The loop count is determined by the bit numbers of the counter or Hamming weights. We have to write the code very precisely. Different size of the counters will get different power consumption and delay.

46 We can divide the program ECC in GF (p) into four main subprograms. 1. ECC ADD 2. ECC SUB 3. ECC MONT (Montgomery) 3. ECC INV

4.3.2 Binary Field

It is easier to write the program in GF (2m). The addition/subtraction operation in binary field are all the exclusive or operation. In this field we do not care about the carry-in or borrow-from bit. In the binary field, we use the same memory access method that we apply in the prime field to deal with the large number arithmetic. We write the subprograms and combine them into one piece. The iteration numbers will be obtained from the C language. 1. ECC ADD G2 2. ECC R2L SAA MUL (Right-to Left Shift-And-Add Multiplication) 3. ECC INV G2 (Binary Algorithm Inversion) 4. ECC INV G2 ITOH (Itoh-Tsujii Algorithm Inversion) 5. ECC SQUARE G2 (Squaring in GF (2m)) 5. ECC FAST REDUCTION G2 (Fast Reduction in GF (2m))

The most important thing is to deal with the counter bits. The other factor affects the result is the number of memory load/store operations.

47 Chapter 5

Test Results

In this chapter, we describe the test setup and the validation techniques used in order to validate the functionality of the mapped hardware. There are very few references of verified ECC test patterns and methods to validate the functionality of the design. However, as explained in Chapter 4, the Hamming weight of the patterns severely influence the number of clock cycles for the particular ECC module under examina- tion. As a result, selecting the test pattern from NIST aids in uniform input pattern which in turn aids in comparison of modules at iso-latency. In the GF (p), the Hamming weight will affect following algorithms: 1. Montgomery multiplier. 2. Binary algorithm for inversion. In the GF (2m), the Hamming weight will affect the following algorithms: 1. Right-To-Left Shift-And-Add multiplier. 2. Binary Algorithm for Inversion. The operational clock cycles or the clk cycles latency are totally decided by the input patterns. In other words, different input patterns will result in different performance output. In our design, the NIST patterns are used to perform the point scalar multi- plication and the same input patterns are also used to validate the other operations

48 so as to have fair comparison of energy consumption at iso-delay.

5.1 Test Patterns and Methodology Previous papers like [45] [46] compare the individual modules in the ECC application, such as inversion or point doubling but does not provide suitable comparison results of the entire module of ECC due to the mapping complexity of the entire application. In the thesis, we compare the performance and energy consumption of the entire NIST scalar multiplication. From Section 3.1 and Section 3.2, we get the order of the curves which signify two different mapping schemes of ECC, namely 163-bit Binary Field and 192-bit Prime Field. After (r − 1)P scalar multiplication operations, the rest of the scalar multi- plications of both the fields can be determined. The power consumption calculation methodology for the 3 different chosen platforms, namely software, FPGA and the proposed framework or MBC is as listed below: 1. In PC software, we monitor the CPU clock cycles and user time usage to find out the running time and power consumption. 2. In FPGA, we dump the VCD file through EDA simualtion and use it as an input to the power analysis tool. 3. In MBC, the software mapping tool can only report the individual vertex delay and power consumption. The energy consumption for running an entire application is the sum of the energy consumption of the individual vertices. The experimental results are as listed in Tables 5.1, 5.2 and 5.3 which lists the number of individual operations for a 192-bit Prime field or 163-bit Binary Field ECC module. Furthermore, we can analyze the total number of operations using the NIST test patterns for the above mentioned Prime Field and Binary Field ECC modules.

49 Table 5.1: Number of each operation from the data provided by NIST Prime field(192-bit) Binary field(163-bit) # of Point Addition 141 35 # of Point Doubling 191 162

Table 5.2: Number of each operation in GF (p) from the data provided by NIST Operation in Prime field(192-bit) Number Addition 372 Subtraction 1278 Binary Montgomery 3520 Binary Inversion 332

Table 5.3: Number of each operation in GF (2m) from the data provided by NIST Operation in Binary field(163-bit) Number Addition 1090 Square 359 Right-To-Left Shift-And-Add Multiplication 394 Inversion 197 *Square is the total number of square operations needed in the entire GF (2M )

5.2 Test Results The experimental results for power, performance and area requirement for Prime Field, Binary Field and Binary Field (Itoh) are shown in Table 5.4. The size in the

Table 5.4: Power, Performance and Size Comparison

Condition Intel Stratix IV S/W(Q8200) FPGA MBC 8bits MBC 16bits prime Joule 1.20E+00 6.97E-03 8.19E-04 3.94E-04 field Second 6.40E-02 2.82E-02 5.16E-01 2.50E-01 192 bits size 164 mm2 17513/12724 ALUTs/REGs 3.24E+10 µm2 3.24E+10 µm2 binary Joule 3.00E-01 6.52E-04 2.23E-04 2.10E-04 field Second 1.60E-02 1.32E-03 1.40E-01 1.32E-01 163 bits size 164 mm2 6928/7520 ALUTs/REGs 1.14E+10 µm2 1.14E+10 µm2 binary Joule 3.89E+00 2.00E-03 4.57E-04 4.39E-04 field Second 2.07E-01 3.81E-03 2.82E-01 2.71E-01 Itoh size 164 mm2 6928/7520 ALUTs/REGs 1.03E+10 µm2 1.03E+10 µm2 table given above is explained as follows: 1. For Intel Processor Chip, the size indicates Processing Die Size.

50 2. For Stratix, size indicates the resource requirements in ALUTS/REGS 3. For the proposed framework (MBC), size indicates the area of a single MLB. The Intel CPU Q8200 and MBC are all in 45nm process. The Stratix IV of Altera is 40nm process.

Figure 5.1: Energy comparison in prime field

We compare our proposed design in FPGA with other proposed designs in Table 5.5. The performance of the design mapping on FPGA in GF (p) is inferior to the already proposed designs as listed in Table 5.5. We analyze that the main reasons for this degradation in performance is the use of traditional binary add-and-shift Montgomery multiplier [31]. There is a 192-bit adder in the multiplier and we can operate on one bit in a single clock cycle. We can reduce the total operation clock cycles by using word level add-and-shift Montgomery multiplier [31] as the word level multiplier can handle 32-bit at a single clock cycle. The other modules which are comparitively slower are adders, subtractors and

51 Figure 5.2: Energy comparison in binary field

Figure 5.3: Energy comparison in all fields comparators. These modules are all 192-bit long. For adders/subtractors, we can use carry look-ahead adders in order to decrease the latency of the system. We need both

52 Figure 5.4: Performance comparison in prime field

Figure 5.5: Performance comparison in binary field magnitude and equality comparators in order to implement the whole design. Here we compare our proposed design with other proposed designs in Table 5.7 in

53 Figure 5.6: Performance comparison in all fields

Table 5.5: The Comparison of 192 bit Point Multiplication in different Paper Reference Paper LUTS Slices MHz cycles µs FPGA Al-Khaleel07 [32] 15739 - 83.3 240 2.88 Virtex 4 Al-Khaleel07 [32] 15739 - 83.3 142 1.705 Virtex 4 Proposed 30275 12935 85.2 13799 161.99 Virtex 4

Table 5.6: The Comparison of 192 bit Scalar Multiplication in different Paper Reference Paper LUTS Slices MHz cycles ms FPGA Al-Khaleel07 [32] 15739 - 83.3 ≈ 89833 1.078 Virtex 4 Al-Khaleel07 [32] 15739 - 83.3 ≈ 52000 0.624 Virtex 4 Proposed 30275 12935 85.2 3806514 44.69 Virtex 4 terms of resource usage, latency cycles and absolute latency. The data for the existing designs have been obtained from [33].

54 Table 5.7: The Comparison of Point Multiplication in different Papers Design LUTS Slices MHz cycles µs FPGA Gura02 [47] 19508 - 66.5 9495 143 Virtex-E Shu05 [48] 25763 - 68.9 - 48.00 Virtex-E Chelton06 [49] - 15020 77.0 2831 36.77 Virtex-E Chelton08 [50] - 15368 91.1 3010 33.05 Virtex-E Sutter12 [33] 12218 6432 123.5 5743 46.50 Virtex-E Sutter12 [33] 20088 10585 102.0 2463 24.50 Virtex-E Sutter12 [33] 29631 15645 87.7 1699 19.38 Virtex-E Proposed 12496 7681 135.17 3605 26.69 Virtex E Proposed(Itoh) 12106 8379 135.17 9525 70.49 Virtex E Zhang10 [51] - 16209 153.9 3010 19.55 Virtex 4 Chelton08 [50] 7719 4080 197.0 4050 20.56 Virtex 4 Ansari08 [52] - 20807 185.0 1428 7.72 Virtex 4 Proposed 13264 7898 377.93 3605 9.61 Virtex 4 Proposed(Itoh) 11740 8245 369.8 9525 25.91 Virtex 4 Sutter12 [33] 22936 6150 250.0 1371 5.48 Virtex 5 Proposed 8916 7847 546.15 3605 6.65 Virtex 5 Proposed(Itoh) 8770 8549 539.34 9525 17.72 Virtex 5

55 Chapter 6

Conclusion and Future Work

We have presented a study on efficient implementation of ECC algorithms in recon- figurable hardware platforms. We have optimized the application mapping process to improve performance and energy and evaluated the performance, energy efficiency and area requirements of two different reconfigurable platforms, namely FPGA and MBC. These design parameters are compared with an alternative software based im- plementation. We have considered implementation of ECC in both prime field and binary field and considered algorithm optimizations for performance improvement previously proposed in earlier works. We have optimized the designs separately for each platform. Our study shows that implementation of ECC algorithms in hardware show or- ders of improvement in performance and energy compared to software counterpart. The fine-grained spatial computing framework of FPGA provides better performance, while the spatio-temporal computing framework of MBC provides better energy effi- ciency. The MBC framework is good for nanoscale technologies. The inversion module is identified as the largest block and we have considered its alternative implementa- tions to improve the energy efficiency. The performance and energy are evaluated for all three platforms and for several variants of the ECC algorithms. Our analysis would enable designers to choose the right ECC implementation based on a design

56 target. The work presented in the work can be extended in many ways. First, experimen- tal measurements of performance/energy in all platforms V specifically for MBC at a nanoscale process technology would be very helpful to establish the energy benefit. The ECC algorithm can be further improved to better map the important steps in the MBC platform. One option is to employ better fusion algorithm, which can fuse multiple logic operations into one lookup operation. The datapath structures used in FPGA and MBC can also be optimized for increased energy efficiency. Finally, one can explore the implementation of ECC in other platforms, like graphics processing unit (GPU) and compare its performance / energy efficiency with other platforms.

57 Appendix A

Simulation Results

A.1 Prime field

Figure A.1: Functional simulation of ECC scalar multiplication in GF (p)

58 A.2 Binary field

Figure A.2: Functional simulation of ECC scalar multiplication in GF (2m)

59 Figure A.3: ECC scalar multiplication (with Itoh-Tsujii) in GF (2m)

60 Bibliography

[1] S. Kestur, J. Davis, and O. Williams, “Blas comparison on fpga, cpu and gpu,” in VLSI (ISVLSI), 2010 IEEE Computer Society Annual Symposium on, july 2010, pp. 288 –293.

[2] X. Tian and K. Benkrid, “Mersenne twister on fpga, cpu and gpu,” in Adaptive Hardware and Systems, 2009. AHS 2009. NASA/ESA Conference on, 29 2009-aug. 1 2009, pp. 460 –464.

[3] B. Duan, W. Wang, X. Li, C. Zhang, P. Zhang, and N. Sun, “Floating-point mixed-radix fft core generation for fpga and comparison with gpu and cpu,” in Field-Programmable Technology (FPT), 2011 International Conference on, dec. 2011, pp. 1 –6.

[4] B. Betkaoui, D. Thomas, and W. Luk, “Comparing performance and en- ergy efficiency of fpgas and gpus for high productivity computing,” in Field- Programmable Technology (FPT), 2010 International Conference on, dec. 2010, pp. 94 –101.

[5] Y. Abhyankar, C. Sajish, Y. Agarwal, C. Subrahmanya, and P. Prasad, “High performance power spectrum analysis using a fpga based reconfigurable comput- ing platform,” in Reconfigurable Computing and FPGA’s, 2006. ReConFig 2006. IEEE International Conference on, sept. 2006, pp. 1 –5.

[6] R. Lin, “A reconfigurable low-power high-performance matrix multiplier design,” in Quality Electronic Design, 2000. ISQED 2000. Proceedings. IEEE 2000 First International Symposium on, 2000, pp. 321 –328.

[7] J. Noguera and R. Badia, “Power-performance trade-offs for reconfigurable com- puting,” in Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004. International Conference on, sept. 2004, pp. 116 – 121.

[8] R. Sangireddy, H. Kim, and A. Somani, “Low-power high-performance recon- figurable computing cache architectures,” Computers, IEEE Transactions on, vol. 53, no. 10, pp. 1274 – 1290, oct. 2004.

[9] S. Paul and S. Bhunia, “Mbarc: A scalable memory based reconfigurable comput- ing framework for nanoscale devices,” in Design Automation Conference, 2008. ASPDAC 2008. Asia and South Pacific, march 2008, pp. 77 –82.

61 [10] ——, “A scalable memory-based reconfigurable computing framework for nanoscale crossbar,” Nanotechnology, IEEE Transactions on, vol. PP, no. 99, p. 1, 2010.

[11] P. Dillien, “Electrically reconfigurable arrays-eras,” in New Directions in VLSI Design, IEE Colloquium on, nov 1989, pp. 6/1 –6/6.

[12] [Online]. Available: http://www.altera.com/products/fpga.html

[13] “The world’s fastest 40-nm fpga.” [Online]. Avail- able: http://www.altera.com/devices/fpga/stratix-fpgas/stratix- iv/overview/performance/stxiv-performance.html

[14] G. R. Michael J. Alexander, “New performance-driven fpga routing algorithms,” in Design Automation, 1995. DAC ’95. 32nd Conference on, 1995, pp. 562 –567.

[15] F. M. N. Sloane, “The theory of error-correcting codes.” [Online]. Available: http://www2.research.att.com/ njas/doc/ms77.html

[16] “Recommended elliptic curves for federal government use.” [Online]. Available: http://csrc.nist.gov/groups/ST/toolkit/documents/dss/NISTReCur.pdf

[17] T. Itoh and S. Tsujii, “A fast algorithm for computing multiplica- tive inverses in gf(2m) using normal bases,” Information and Com- putation, vol. 78, no. 3, pp. 171 – 177, 1988. [Online]. Available: http://www.sciencedirect.com/science/article/pii/0890540188900247

[18] C. Wang, T. Troung, H. Shao, L. Deutsch, J. Omura, and I. Reed, “Vlsi architec- tures for computing multiplications and inverses in gf(2 ˆm),” Computers, IEEE Transactions on, vol. C-34, no. 8, pp. 709 –717, aug. 1985.

[19] A. Tungar, “Review paper: On, comparative study of embedded system archi- tectures for implementation of ecc,” in Internet, 2009. AH-ICI 2009. First Asian Himalayas International Conference on, nov. 2009, pp. 1 –3.

[20] S. Bakhtiari, A. Baraani, and M.-R. Khayyambashi, “Mobicash: A new anony- mous mobile payment system implemented by elliptic curve cryptography,” in Computer Science and Information Engineering, 2009 WRI World Congress on, vol. 3, 31 2009-april 2 2009, pp. 286 –290.

[21] C. Research, “Standards for efficient cryptography. sec 1: Elliptic curve cryptog- raphy.” Working draft. version 1.9, p. 78, August 2008.

[22] O. Ponomarev, A. Khurri, and A. Gurtov, “Elliptic curve cryptography (ecc) for host identity protocol (hip),” in Networks (ICN), 2010 Ninth International Conference on, april 2010, pp. 215 –219.

[23] “Ieee 1363.” [Online]. Available: http://grouper.ieee.org/groups/1363/

62 [24] “Information technology laboratory.” [Online]. Available: http://csrc.nist.gov/publications/fips/fips186-3/fips 186-3.pdf

[25] Z. Dyka and P. Langendoerfer, “Area efficient hardware implementation of ellip- tic curve cryptography by iteratively applying karatsuba’s method,” in Design, Automation and Test in Europe, 2005. Proceedings, march 2005, pp. 70 – 75 Vol. 3.

[26] E.-H. Wajih, M. Mohsen, Z. Medien, and B. Belgacem, “Efficient hardware ar- chitecture of recursive karatsuba-ofman multiplier,” in Design and Technology of Integrated Systems in Nanoscale Era, 2008. DTIS 2008. 3rd International Conference on, march 2008, pp. 1 –6.

[27] C.-L. Wu, D.-C. Lou, and T.-J. Chang, “An efficient montgomery exponentiation algorithm for public-key ,” in Intelligence and Security Informat- ics, 2008. ISI 2008. IEEE International Conference on, june 2008, pp. 284 –285.

[28] N. Nedjah and L. de Macedo Mourelle, “Reconfigurable hardware implementa- tion of montgomery modular multiplication and parallel binary exponentiation,” in Digital System Design, 2002. Proceedings. Euromicro Symposium on, 2002, pp. 226 – 233.

[29] C. McIvor, M. McLoone, and J. McCanny, “Fpga montgomery modular mul- tiplication architectures suitable for eccs over gf(p),” in Circuits and Systems, 2004. ISCAS ’04. Proceedings of the 2004 International Symposium on, vol. 3, may 2004, pp. III – 509–12 Vol.3.

[30] ——, “Hardware elliptic curve cryptographic processor over,” Circuits and Sys- tems I: Regular Papers, IEEE Transactions on, vol. 53, no. 9, pp. 1946 –1957, sept. 2006.

[31] F. Rodr´ıguez-Henr´ıquez, N. A. Saqib, A. D´ıaz-P`erez,and C. K. Koc, “Cryp- tographic algorithms on reconfigurable hardware (signals and communication technology),” Secaucus, NJ, USA, 2006.

[32] O. Al-Khaleel, C. Papachristou, F. Wolff, and K. Pekmestzi, “An elliptic curve design based on fpga pipeline folding,” in On-Line Testing Sym- posium, 2007. IOLTS 07. 13th IEEE International, july 2007, pp. 71 –78.

[33] G. Sutter, J. Deschamps, and J. Imana, “Efficient elliptic curve point multipli- cation using digit serial binary field operations,” Industrial Electronics, IEEE Transactions on, vol. PP, no. 99, p. 1, 2012.

[34] H. M. Choi, C. P. Hong, and C. H. Kim, “High performance elliptic curve cryp- tographic processor over gf(2163),”ˆ in Electronic Design, Test and Applications, 2008. DELTA 2008. 4th IEEE International Symposium on, jan. 2008, pp. 290 –295.

63 [35] [Online]. Available: http://www.cplusplus.com/reference/clibrary/ctime/clock/

[36] C. Isci and M. Martonosi, “Runtime power monitoring in high-end processors: methodology and empirical data,” in Microarchitecture, 2003. MICRO-36. Pro- ceedings. 36th Annual IEEE/ACM International Symposium on, dec. 2003, pp. 93 – 104.

[37] D. Molka, D. Hackenberg, R. Schone, and M. Muller, “Characterizing the energy consumption of data transfers and arithmetic operations on x86 x64 processors,” in Green Computing Conference, 2010 International, aug. 2010, pp. 123 –133.

[38] “Multicore power, area, and timing.” [Online]. Available: http://www.hpl.hp.com/research/mcpat/

[39] “Measuring processor power, tdp vs. acp.” [Online]. Avail- able: http://www.intel.com/content/dam/doc/white-paper/resources-xeon- measuring-processor-power-paper.pdf

[40] “Acp x the truth about power consumption starts here.” [Online]. Available: http://www.amd.com/us/Documents/43761D-ACP PowerConsumption.pdf

[41] “Intrducing average cpu power - acp.” [Online]. Available: http://images.dailytech.com/nimage/5925 large amd explains acp.png

[42] [Online]. Available: http://developer.gnome.org/libgtop/stable/libgtop- GlibTop.html

[43] “The gnu multiple precision arithmetic library.” [Online]. Available: http://gmplib.org/

[44] C. Chen and Z. Qin, “Fast algorithm and hardware architecture for modular inversion in gf(p),” in Intelligent Networks and Intelligent Systems, 2009. ICINIS ’09. Second International Conference on, nov. 2009, pp. 43 –45.

[45] M. Khalil-Hani, A. Irwansyah, and Y. Hau, “A tightly coupled finite field arith- metic hardware in an fpga-based embedded processor core for elliptic curve cryp- tography,” in Electronic Design, 2008. ICED 2008. International Conference on, dec. 2008, pp. 1 –6.

[46] M. Simka, J. Pelzl, T. Kleinjung, J. Franke, C. Priplata, C. Stahlke, M. Dru- tarovsky, V. Fischer, and C. Paar, “Hardware factorization based on elliptic curve method,” in Field-Programmable Custom Computing Machines, 2005. FCCM 2005. 13th Annual IEEE Symposium on, april 2005, pp. 107 – 116.

[47] N. Gura, S. C. Shantz, H. Eberle, S. Gupta, V. Gupta, D. Finchelstein, E. Goupy, D. Stebila, and D. F. E. Goupy, “An end-to-end systems approach to ellip- tic curve cryptography,” in In Cryptographic Hardware and Embedded Systems (CHES. Springer-Verlag, 2002, pp. 349–365.

64 [48] C. Shu, K. Gaj, and T. El-Ghazawi, “Low latency elliptic curve cryptography accelerators for nist curves over binary fields,” in Field-Programmable Technol- ogy, 2005. Proceedings. 2005 IEEE International Conference on, dec. 2005, pp. 309 – 310.

[49] W. Chelton and M. Benaissa, “High-speed pipelined ecc processor on fpga,” in Signal Processing Systems Design and Implementation, 2006. SIPS ’06. IEEE Workshop on, oct. 2006, pp. 136 –141.

[50] ——, “Fast elliptic curve cryptography on fpga,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 16, no. 2, pp. 198 –205, feb. 2008.

[51] Y. Zhang, D. Chen, Y. Choi, L. Chen, and S.-B. Ko, “A high performance ecc hardware implementation with instruction-level parallelism over gf(2163),” Microprocess. Microsyst., vol. 34, pp. 228–236, October 2010. [Online]. Available: http://dx.doi.org/10.1016/j.micpro.2010.04.006

[52] B. Ansari and M. Hasan, “High-performance architecture of elliptic curve scalar multiplication,” Computers, IEEE Transactions on, vol. 57, no. 11, pp. 1443 –1453, nov. 2008.

65