Efficient Implementation of Elliptic Curve Cryptography in Reconfigurable Hardware

EFFICIENT IMPLEMENTATION OF ELLIPTIC
CURVE CRYPTOGRAPHY IN
RECONFIGURABLE HARDWARE

by
E-JEN LIEN

Submitted in partial fulﬁllment of the requirements for the degree of Master of Science

Thesis Advisor: Dr. Swarup Bhunia
Department of Electrical Engineering and Computer Science
CASE WESTERN RESERVE UNIVERSITY

May, 2012

CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

E-Jen Lien

_____________________________________________________

Master of Science

candidate for the ______________________degree *.

Swarup Bhunia

(signed)_______________________________________________
(chair of the committee)

Christos Papachristou

________________________________________________

Frank Merat

________________________________________________ ________________________________________________ ________________________________________________ ________________________________________________

03/19/2012

(date) _______________________ *We also certify that written approval has been obtained for any proprietary material contained therein.

To my family ⋯

Contents

List of Tables

iii

v

List of Figures

Acknowledgements List of Abbreviations Abstract vi vii viii

1 Introduction

1

134
1.1 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Background and Motivation

6

679
2.1 MBC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Application Mapping to MBC . . . . . . . . . . . . . . . . . . . . . . 2.3 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Mathematical Preliminary . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Elliptic Curve Cryptography . . . . . . . . . . . . . . . . . . . . . . 10 2.6 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

i

3 Design Principles and Methodology

18

3.1 Curves over Prime Field . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Curves over Binary Field . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Software Code for ECC . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 RTL code for FPGA design . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Input Data Flow Graph (DFG) for MBC . . . . . . . . . . . . . . . . 31

4 Implementation of ECC

32

4.1 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Prime Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1.2 Binary Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Implementation in FPGA . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Implementation in MBC . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Test Results

48

5.1 Test Patterns and Methodology . . . . . . . . . . . . . . . . . . . . . 49 5.2 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Conclusion and Future Work A Simulation Results
56 58

A.1 Prime ﬁeld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 A.2 Binary ﬁeld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Bibliography

61

ii

List of Tables

2.1 Instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

5.1 Number of each operation from the data provided by NIST . . . . . . 50 5.2 Number of each operation in GF(p) from the data provided by NIST 50 5.3 Number of each operation in GF(2^m) from the data provided by NIST 50 5.4 Power, Performance and Size Comparison . . . . . . . . . . . . . . . 50 5.5 The Comparison of 192 bit Point Multiplication in different Paper . . 54 5.6 The Comparison of 192 bit Scalar Multiplication in different Paper . 54 5.7 The Comparison of Point Multiplication in different Papers . . . . . . 55

iii

List of Figures

1.1 2011 ITRS ASIC Scaling trend prediction . . . . . . . . . . . . . . . . 2.1 Memory Logic Block Diagram . . . . . . . . . . . . . . . . . . . . . .
27
3.1 Squaring in Binary Field . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 ECC hardware addition module . . . . . . . . . . . . . . . . . . . . . 36 4.2 ECC hardware subtraction module . . . . . . . . . . . . . . . . . . . 37 4.3 ECC hardware Montgomery module . . . . . . . . . . . . . . . . . . . 38 4.4 ECC hardware Inversion module . . . . . . . . . . . . . . . . . . . . . 39 4.5 ECC hardware Point Addition module . . . . . . . . . . . . . . . . . 40 4.6 ECC hardware Point Doubling module . . . . . . . . . . . . . . . . . 41 4.7 ECC hardware kp module . . . . . . . . . . . . . . . . . . . . . . . . 42 4.8 ECC hardware Right-to-left Shift-and-Add Multiply module . . . . . 42 4.9 Modiﬁed ECC hardware Right-to-left Shift-and-Add Multiply module 43 4.10 ECC hardware inversion module in GF(2^m) . . . . . . . . . . . . . . 44 4.11 ECC hardware Itoh-Tsujii inversion module . . . . . . . . . . . . . . 44 4.12 ECC hardware Point Addition module in GF(2^m) . . . . . . . . . . . 45 4.13 ECC hardware Point Doubling module in GF(2^m) . . . . . . . . . . . 46

5.1 Energy comparison in prime field . . . . . . . . . . . . . . . . . . . . 51 5.2 Energy comparison in binary field . . . . . . . . . . . . . . . . . . . . 52 5.3 Energy comparison in all fields . . . . . . . . . . . . . . . . . . . . . . 52

iv
5.4 Performance comparison in prime field . . . . . . . . . . . . . . . . . 53 5.5 Performance comparison in binary field . . . . . . . . . . . . . . . . . 53 5.6 Performance comparison in all fields . . . . . . . . . . . . . . . . . . . 54

A.1 Functional simulation of ECC scalar multiplication in GF(p) . . . . . 58 A.2 Functional simulation of ECC scalar multiplication in GF(2^m) . . . . 59 A.3 ECC scalar multiplication (with Itoh-Tsujii) in GF(2^m) . . . . . . . . 60

v

Acknowledgements

There are so many people I have to express my thanks sincerely. First, I want to thank my family. My parents gave me a lot of support when I needed. My wife and daughter always cheered me up and boosted my conﬁdence. My younger brother takes care of my parents and deals with a lot of things for me. Secondly, I want to express my sincere gratitude to my advisor - Dr. Swarup Bhunia. From my advisor, I learnt the passion of work and the attitude towards research. I also want to show my heartfelt appreciation to Professor Christos Papachristou and Professor Francis Merat for serving as my thesis committee members. Finally, I want to give my thanks to all members in the nanoscape laboratory whose advice continously helped me to improve my work.

vi

List of Abbreviations

ACP ANSI ASIC CPU DFG ECC FPGA FSM IC

Average CPU Power American National Standards Institute Application Speciﬁc Integrated Circuit Central Processing Unit Data Flow Graph Elliptic Curve Cryptography Field Programmable Gate Array Finite State Machine Integrated Circuit

ITRS LUT MBC MLB MSB NIST RSA TDP VLSI

International Technology Roadmap for Semiconductors Look-Up Table Memory Based Computing Memory Based Logic Block Most Signiﬁcant Bit National Institute of Standards and Technology Rivest-Shamir-Adleman Thermal Design Power Very Large Scale Integration

vii

Eﬃcient Implementation of Elliptic Curve Cryptography in
Reconﬁgurable Hardware

Abstract by
E-JEN LIEN

Elliptic curve cryptography (ECC) has emerged as a promising public-key cryptography approach for data protection. It is based on the algebraic structure of elliptic curves over finite fields. Although ECC provides high level of information security, it involves computationally intensive encryption/decryption process, which negatively affects its performance and energy-efficiency. Software implementation of ECC is often not amenable for resource-constrained embedded applications. Alternatively, hardware implementation of ECC has been investigated V in both application specific integrated circuit(ASIC) and field programmable gate array (FPGA) platforms V in order to achieve desired performance and energy efficiency. Hardware reconfigurable computing platforms such as FPGAs are particularly attractive platform for hardware acceleration of ECC for diverse applications, since they involve significantly less design cost and time than ASIC. In this work, we investigate efficient implementation of ECC in reconfigurable hardware platforms. In particular, we focus on implementing different ECC encryption algorithms in FPGA and a promising memory array based reconfigurable computing framework, referred to as MBC. MBC leverages the benefit of nanoscale memory, namely, high bandwidth, large density and small wire delay to drastically reduce the overhead of programmable interconnects. We evaluate the performance and energy efficiency of these platforms and compare those with a purely software implementation. We use the pseudo-random curve in the prime field and Koblitz curve in the binary field to do the ECC scalar multiplica-

viii tion operation. We perform functional validation with data that is recommended by NIST. Simulation results show that in general, MBC provides better energy eﬃciency than FPGA while FPGA provides better latency.

ix

Chapter 1 Introduction

In this chapter, we describe the research objectives, contribution of the thesis and outline of the thesis.

1.1 Research objectives

Energy eﬃciency during computation has emerged as a major design parameter in diverse applications and computing platforms [1][2][3][4][5][6][7][8]. According to the 2011 report from the International Technology Roadmap for Semiconductors (ITRS), the technology scaling trend for application speciﬁc integrated circuit (ASIC) can be depicted by Figure 1.1. It shows that although technology scaling provides consistent exponential improvement (following Moores law) in integration density, operating power is not scaling as desired. Consequently, addressing the power issue at circuit, architecture and application mapping level has been a major research area in the nanoscale technology regime. The energy issue can be more prominent for many compute-intensive tasks. Conventional software implementations of these tasks can be too power hungry or can be too slow to meet the requirements for many real-time and embedded applications. There is a growing trend to map these complex compute-intensive applications in

1reconfigurable hardware, such as field programmable gate array (FPGA). FPGA is an attractive computing platform since it can drastically reduce the hardware development/test cost and time. Alternative reconfigurable hardware platform such as memory based computing (MBC) platforms [9] [10] are also very promising at nanoscale technology. MBC platform relies on a dense two-dimensional memory array to perform computing in a spatio-temporal manner. Applications are decomposed into partitions, which can potentially be mapped as large look-up table (LUT) in the memory and a function can be evaluated by accessing the LUT contents over multiple cycles. Multiple MLB interacts in spatial manner to perform complex operation. The objective of the research presented in this thesis is to explore implementation of elliptic curve cryptography (ECC) algorithm in reconfigurable hardware and evaluate their performance and energy efficiency. In order to analyze potential benefit over traditional software-based implementation, we also compare these design parameters with an alternative implementation in software. We study different variants of ECC algorithms proposed in earlier works and analyze the relative merits and demerits of these algorithms in three alternative platforms.

Figure 1.1: 2011 ITRS ASIC Scaling trend prediction
2

1.2 Thesis Outline

From inception to completion, this thesis is dedicated in analyzing and evaluating the power, performance and resource usage (referred to as size) of Elliptic Curve Cryptography (ECC) among three different platforms, namely CPU, FPGA and MBC respectively.
In chapter one, we will describe the research objectives and contribution of our work.
The background and motivation will be mentioned in chapter two. Here we will introduce the hardware descriptions of the different platforms on which ECC is being mapped. It will describe in detail the programming techniques and the normal mode operation principle of the proposed MBC framework. Similar short descriptions on a commercially available FPGA and the underlying hardware is also described. Finally some mathematical background in field theorem, number theorem and ECC will also be introduced which will help the reader in understanding the actual algorithm which has to be mapped in the hardware framework.
Chapter three deals with the main algorithms that are namely sub-parts of Elliptic curve cryptography (ECC). The algorithms are listed and described in detail in this chapter. There are multiple variants of the same algorithm which can be mapped in the proposed framework. In this chapter, we have also described which algorithms are the most suitable choice in terms of resource usage and power consumption.
In chapter four, we will describe how to implement ECC in each platform. The details and structure of each design will be described. The detailed implementation results are also listed in chapter five. Detailed functional validation of the implemented design is also described in Chapter 4.
Finally in Chapter 5, we describe the conclusions and the future work which can potentially improve the already proposed work.

3

1.3 Contributions

The key contributions of the proposed work in this thesis are as follows:
1. In order to evaluate performance and energy eﬃciency of ECC implementation in reconﬁgurable hardware, we have mapped ECC algorithm in FPGA and MBC platforms. To compare with a traditional software implementation, we have also mapped it to software. The mapping is separately optimized in three platforms for performance.

2. We have implemented three diﬀerent variants of the ECC algorithm on MBC platform, namely Prime Field, Binary Field (Binary Inversion) and Binary Field (Itoh-Tsujii Inversion). Our purpose is to show proposed MBC structure can deal with complex algorithms such as ECC and evaluate the ECC performance in MBC. The hardware resource of MBC is severely limited by its simple and regular structure. We adjust the input data ﬂow graph representing ECC in the MBC mapping framework to improve performance and minimize resource requirements. All the three versions of ECC have also been mapped in software and FPGA.

3. We designed a novel fast ECC in GF(2^m) on FPGA for Binary Field Binary
Inversion algorithm and optimized the design in terms of its performance. For all the implementations in software and hardware in the proposed work, the inversion step is applied and the coordinates are not pre-calculated. The applications which have been mapped in MBC and FPGA have been highly optimized so as to have competitive mapping performances in both frameworks.

4. After extensive functional validation of all the ECC implementations (three platforms and three diﬀerent algorithm), we make a comparison of the performance, area and energy requirement of ECC in the three diﬀerent platforms. We show that

4
MBC is superior in terms of performance and energy eﬃciency over software and in energy eﬃciency over a state-of-the-art commercially available FPGA device (Altera Stratix-IV).

5

Chapter 2 Background and Motivation

In this chapter, we will introduce some useful background knowledge relevant to this thesis. ECC is a very well-investigated topic. This chapter describes multiple existing implementations of ECC. Also, the background related to the MBC hardware and programming techniques of MBC are explained in this section, since it is useful to understand the structure of the hardware platform before writing the code to configure the hardware. We also introduce the FPGA structure so as to compare with the MBC hardware architecture. It is important to understand the distinctions between these two kinds of reconfigurable devices so that one can map ECC or any other application efficiently into the actual hardware.

2.1 MBC Architecture

The malleable hardware accelerator we used in this thesis was proposed by Dr. Somnath Paul and Professor Swarup Bhunia[9], [10]. The inner structure of the MBC and its operation principle is provided in Figure 2.1 by showing the basic process in initializing the MBC hardware. The conﬁguration code is compiled and loaded into the memory, among which some memory will be used in storing data, others serve as Look-Up Tables (LUTs) to be conﬁgured to certain logic. Memory is accessed over

6multiple clock cycles to evaluate the complex functions. A sequence of operations are stored as microcodes in the schedule table. An application is mapped to an array of MLBs, which communicates in spatial manner.

Figure 2.1: Memory Logic Block Diagram

2.2 Application Mapping to MBC

The ﬁrst thing that we have to do before programming on MBC is to understand the instruction sets that we have. In this case, there are thirteen basic instructions on MBC. We can use these instructions in our programs and combine them to do some complicated operations.
The instruction set is shown in Table ??. Consider the case where one executes an XOR operation on two 163-bit numbers.
Since the input bit-width of 163-bit exceeds the maximum computation bit-width supported by a single LUT, we must divide the operation into small pieces. For convenience and homogeneity, we always used 8-bit or 16-bit as our basic operation

7
Table 2.1: Instruction set

Type bitswC bitswC mult

Subtype

inputs

a0 b0 cin a1 b1 cin a2 b2
Outputs sum count diﬀ borrow prod
2inadd 2insub rand delay shift rot

rand

a3

a4 # a3 delay a4 shift a5 rot left/right left/right rand a5 #

sel

a6 b6 c6 d6 sel

a7 b7 c7 addr out lut out loadVal complex load store rand
#width #width #width addr Val

PRaddr en

loadPR

loadVal

storePR #width PRaddr storeval en

unit. We first divide the 163-bit number into eleven 16-bit arrays. Then we write the program to store the data into memory. When the program is being executed, the data loaded from the memory depends on the memory address base. The memory address is incremented after each loading. We do the XOR operation between two arrays and store the results into the temporary register. For applications whose computations involve large numbers, such as AES, RSA or ECC, significant power and latency will be incurred in these load/store operations. However, if we can code the program cautiously and pre-compute the output memory address or variable, we can reduce the number of operation memory load/store significantly. A sample data flow graph (DFG) which can be run on the MBC hardware is as follows:

CDFG sample name: v0000 type: complex subtype: rand inputs: baa00 g00 outputs: addraa00 en aa00 bitwidth: 4 4 name: v0001 type: complex subtype: rand inputs: bbb00 g00 outputs: addrbb00 en bb00 bitwidth: 4 4 name: v0002 type: complex subtype: rand inputs: bpp00 g00 outputs: addrpp00 en pp00 bitwidth: 4 4 name: v0003 type: complex subtype: rand inputs: bcc00 g00 outputs: addrcc00 en cc00 bitwidth: 4 4 name: v0004 type: loadPR subtype: 16 inputs: addraa00 en aa00 outputs: aa00 in bitwidth: 4 1 name: v0005 type: loadPR subtype: 16 inputs: addrbb00 en bb00 outputs: bb00 in bitwidth: 4 1

8

name: v0006 type: loadPR subtype: 16 inputs: addrpp00 en pp00 outputs: pp00 in bitwidth: 4 1 name: v0007 type: bits subtype: xor inputs: aa00 in bb00 in pp00 in outputs: cc00 bitwidth: 16 16 16 name: v0008 type: storePR subtype: 16 inputs: addrcc00 cc00 en cc00 outputs: bitwidth: 4 16 1 name: v0009 type: bitswC subtype: 2inadd inputs: loop00 one zero outputs: loop01 uloop00 bitwidth: 4 1 1 name: v0010 type: delay subtype: rand inputs: loop01 outputs: loop00 bitwidth: 4 name: v0011 type: complex subtype: rand inputs: loop00 outputs: g00 bitwidth: 4 endCDFG

This is the standard ﬁle format of an application given as an input to the software, which maps the input application to the actual hardware depending on the mapping and routing resources available.

2.3 FPGA

FPGA is a widely used reconfigurable device [11]. It is composed of a sea of configurable logic blocks and programmable interconnects. It has many features as described in [12] and multiple advantages which are as listed below: 1. Build a prototype rapidly. 2. Easy to migrate the design to different IC process. 3. Integrated tools and design flow from coding to hardware implementation. 4. Powerful tools can be used in timing and power analysis.
FPGAs are widely used as hardware design and validation platform. We use
SRAM-based reconfigurable FPGAs in this work.
In this thesis, we use Xilinx and Altera FPGA as our platforms to be consistent with previous work so that we can compare our results with them. This also allows us to compare the results on all platforms at the same technology node. The FPGA that we choose to perform the power analysis on is the Stratix IV series. Since we develop the MBC model under 45nm technology node, we try to find the FPGA with a close