Case1238734353.Pdf (1.06

Home , Shared secret

IMPROVED CRYPTOGRAPHIC PROCESSOR DESIGNS FOR

SECURITY IN RFID AND OTHER UBIQUITOUS SYSTEMS

LAWRENCE LEINWEBER

Submitted in partial fulfillment of the requirements

For the degree of

Doctor of Philosophy: Engineering

Major Field: Computer Engineering

Dissertation Adviser: Dr. Christos Papachristou

Department of Electrical Engineering and Computer Science

CASE WESTERN RESERVE UNIVERSITY

May, 2009

CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

______Lawrence Leinweber candidate for the ______degreePh.D. in Computer Engineering *.

Dr. Christos Papachristou (signed)______(chair of the committee)

Dr. Francis L. Merat ______

Dr. Swarup Bhunia ______

Dr. Xinmiao Zhang ______

Dr. Francis G. Wolff ______

______

(date) ______4/1/09

*We also certify that written approval has been obtained for any proprietary material contained therein. Table of Contents

I. Introduction ...... 1

A. Motivation ...... 1

B. Problem Definition...... 2

1. Secure Protocol ...... 2

2. Cryptographic Processors ...... 3

3. Power Management ...... 3

C. Outline of the Dissertation ...... 4

II. Background ...... 5

A. Security ...... 5

1. Product Identification ...... 5

2. Privacy and Business Intelligence Risks ...... 7

3. Consideration of Other Methods ...... 9

4. Network Security ...... 13

5. Security Through Cryptography ...... 14

B. Elliptic Curve Cryptography ...... 17

1. Galois Fields ...... 17

2. Elliptic Curves ...... 19

3. Coordinate Systems and Security ...... 21

C. Cryptographic Algorithms and Architectures ...... 22

1. Galois Field Operations ...... 22

2. Elliptic Curve Operations ...... 27

3. Cryptographic Processors ...... 29 i III. Processor Designs ...... 32

A. Data Flows ...... 33

B. Arithmetic Logic Unit (ALU) ...... 35

C. Key Control Logic ...... 37

D. Inversion Control Logic ...... 39

E. High-Level Organization ...... 40

F. Summary of Contributions ...... 43

G. Comparison with Other Works ...... 46

IV. Simulation Experiments ...... 51

A. Test Setup...... 51

B. Results ...... 52

C. Comparison with Other Works ...... 60

V. Secure Protocol ...... 64

A. Requirements of a Minimal Protocol ...... 65

1. Minimum Cost Tags ...... 65

2. Minimal Back-End Support ...... 65

3. Concept of Ownership ...... 66

4. Minimal Operations ...... 66

5. Minimum Message Words and Encryptions ...... 67

B. Description of a Minimal Protocol ...... 70

1. Operations ...... 70

2. Tag Memory Requirements ...... 73

3. Lower-Layer Support ...... 74

ii 4. Infrastructure for Key Management ...... 75

5. Other Capabilities ...... 76

C. Evaluation of the Protocol ...... 77

1. Benefits of the Protocol ...... 77

2. Defenses Against Various Attacks ...... 78

3. Drawbacks of the Protocol ...... 80

4. Moore’s Law ...... 80

5. Consumer Applications ...... 81

VI. Power Management ...... 82

A. Analog Front-End ...... 82

B. Subthreshold Logic ...... 84

C. Self-Timed Circuits ...... 86

D. Power vs. Impedance ...... 86

1. Motivation ...... 86

2. Test Setup ...... 87

3. Results ...... 88

E. Recommendations ...... 92

VII. Conclusion and Future Work ...... 98

VIII. Bibliography ...... 99

iii List of Tables

Table 1: Cryptographic Processor Programs ...... 50

Table 2: Approximate Area Coefficients, 3 ≤ m ≤ 256, w ≤ 16 ...... 59

Table 3: Approximate Energy Coefficients, 3 ≤ m ≤ 256, w ≤ 16 ...... 59

Table 4: Exact Time and D Flip-Flop Coefficients ...... 59

Table 5: Area for Proposed and Reference Processors ...... 62

Table 6: Time for Proposed and Reference Processors ...... 62

Table 7: Energy (µJ) for Proposed and Reference Processors ...... 63

Table 8: Tag Memory Contents during Read and Change-Owner Operations ...... 74

iv List of Figures

Figure 1: Lopez-Dahab Data Flows for the R6 Processor ...... 33

Figure 2: Lopez-Dahab Data Flows for the R5 Processor ...... 33

Figure 3: Arithmetic Logic Unit (ALU) ...... 36

Figure 4: Dedicated Squarer, m = 11 ...... 36

Figure 5: XOR Gates per Degree for Dedicated Squarers ...... 37

Figure 6: Digit-Serial Most-Significant-Digit-First Multiplier...... 38

Figure 7: Key Control Logic ...... 40

Figure 8: Inverter Control Logic ...... 40

Figure 9: Datapaths of the R6 Processor ...... 41

Figure 10: Datapaths of the R5 Processor ...... 41

Figure 11: High Level Organization ...... 42

Figure 12: Registers and Multiplies for Proposed and Reference Processors ...... 47

Figure 13: Processor Area (NAND Gates) ...... 53

Figure 14: Processor and Divide Time (Cycles) ...... 54

Figure 15: Processor Area × Time (Million Gates × Cycles) ...... 55

Figure 16: 0.25 µm Processor Area (mm)2 ...... 56

Figure 17: 0.25 µm Processor Dynamic Energy (µJ) ...... 57

Figure 18: 0.25 µm Processor Leakage Energy × Frequency (mW) ...... 58

Figure 19: Area and Time for Proposed and Reference Processors ...... 61

Figure 20: Initial State ...... 71

Figure 21: After tag responds to read command; ID secret ...... 71

Figure 22: After tag responds to read command; ID not secret ...... 72

v Figure 23: After owner sends change-owner command; ID not secret ...... 72

Figure 24: Frequency vs. Power for 0.25 µm Chain of Four Inverters ...... 88

Figure 25: Frequency vs. Power for 0.25 µm R6 Elliptic Curve Processor ...... 88

Figure 26: 0.25 µm Chain of Four Inverters ...... 90

Figure 27: 0.25 µm R6 Elliptic Curve Processor, m = 11, w = 1 ...... 91

Figure 28: 0.25 µm Ring Oscillators with Frequency Dividers ...... 94

Figure 29: Mirrored Pair of Dickson Voltage Multipliers ...... 96

vi Improved Cryptographic Processor Designs for

Security in RFID and Other Ubiquitous Systems

Abstract

LAWRENCE LEINWEBER

In order to provide security in ubiquitous, passively powered systems, especially RFID

tags in the supply chain, improved asymmetric key cryptographic processors are

presented, tested and compared with others from the literature. The proposed processors

show a 12%-20% area and a 31%-45% time improvement. A secure protocol is also

presented to minimize cryptographic effort and communication between tag and reader.

A set of power management techniques is also presented to match processor performance to available power, resulting in greater range and responsiveness of RFID tags.

vii I. Introduction

A. Motivation

The motivation of this research was to make improvements in cryptographic processors

for low-power systems such as radio frequency identification (RFID). There is a great

deal of interest in this problem because security is important especially now because

these processors can be ubiquitous, everywhere. So the goal was to investigate as much

as possible what was already available for this application in terms of security protocols, cryptography and power management and find ways to make improvements. As it turned out, there was a great deal of information available especially about cryptography because lightweight cryptography was used in smartcards.

There is a gap, though, between what has been achieved for RFID and what is possible.

The direction in the near future will be guided by research done now to find ways to reduce the computational effort to provide security on RFID tags. Tags are unusual because they must be very inexpensive and because they have no power source of their own.

There were three approaches to the problem of improving RFID security. The simplest was to find a secure protocol to communicate with a tag. The security flaw in tags is that they are promiscuous. They communicate with anyone and this can lead to problems for a person carrying tags. The second approach was to find the cheapest asymmetry key cryptography possible, to run on a tag. This is a very important problem of course, so

1 there is a lot of background material and a lot of effort is required to make improvements.

The third approach was to find ways to manage the available power in order to get the

greatest computational effect in a tag. Unlike ordinary computer systems, tags get their

energy from an unreliable source.

B. Problem Definition

RFID tags are the next generation of barcodes for products in the supply chain. Unlike barcodes, which are read with light, RFID tags are read with radio waves, so they do not need to be oriented to face the reader. RFID and other ubiquitous systems are very small computers that are inexpensive, have relatively little computational power and typically get their power coupled through an antenna. The basic security problem is that because tags are readable by radio signals, they can be tracked, so if a person is carrying tags, his movements can be monitored. The goal is to use the tag’s computational capability to solve this problem.

The problem has been attacked on three fronts and progress has been made in all three.

1. Secure Protocol

A number of techniques have been developed by others to provide some security against unauthorized tag reading. None of these techniques provide the level of security needed to protect indefinitely against a determined intruder without either destroying the tag or requiring a great deal of data processing infrastructure.

2 With the advent of asymmetric key cryptography on RFID tags, a protocol is needed to securely communicate tag identification only to authorized parties. Since resources, especially power, are scarce on a tag, a minimal protocol is needed. Such a protocol has been developed and proven to be minimal in the number of cryptographic operations and message words communicated between reader and tag.

2. Cryptographic Processors

A great deal of research has been carried out by others to find the minimal asymmetric key cryptographic system. This has applications beyond RFID, of course. While many components have already been developed, it remains to assemble these in the most efficient way possible to produce a processor with minimal area and execution time.

The focus of this research was integrating components for a small, fast asymmetric key cryptographic processor. The result of this research was a processor that was 12%-20% smaller and 31%-45% faster than those previously presented in the literature. This is a very competitive field and some of the techniques used in this research have already been discovered independently by others.

3. Power Management

Efficient security systems on RFID tags depend on scarce power. Inexpensive tags have no power source except what is coupled from the antenna. The amount of power available may fluctuate greatly during a cryptographic operation. Many techniques have been researched by others to deliver as much power as possible from antenna to processor

3 but the best techniques need to be assembled in a comprehensive approach for power management.

This research includes a set of techniques and improvements to match processor

performance to available power, which results in greater range and responsiveness of

RFID tags.

C. Outline of the Dissertation

Since improved cryptographic processors are the most important topic, this dissertation is

organized around it, with the others presented in later chapters.

After this brief introduction, this dissertation proceeds with chapter II, the background

material on security, cryptography, algorithms and architectures for existing

cryptographic processors. Next, two versions of an improved cryptographic processor are

described in detail in chapter III. Then in chapter IV, those processors are tested and

compared with others from the literature. Then developing from the background on

security, the discussion turns to the secure protocol in chapter V. Finally in chapter VI,

power management is discussed before closing with a conclusion and bibliography.

4 II. Background

Extensive research in security for Radio Frequency Identification (RFID) has already been carried out by others. The study of security and cryptography has been important for millennia, but its application to ubiquitous systems, such as RFID, is relatively recent.

Still, the study of security for RFID was preceded by the study of security for smartcards, such as credit cards with a memory component linked electrically to a reader.

In this chapter, the background on security and cryptography for RFID is discussed to

provide the context for the cryptographic processor designs to come in following

chapters. The discussion begins with the general security problems for RFID tags in the

consumer environment, and then establishes the need for a secure protocol and a cryptographic solution. The chapter continues with the mathematical background of

elliptic curve cryptography for small systems and concludes with a summary from the

literature of recent work in computer architecture for cryptographic processors suitable

for RFID tags.

A. Security

1. Product Identification

In the 1970’s, barcode technology was developed to make product identification more

efficient [1]. One of the most important code systems was Universal Product Code

(UPC). Today barcodes based on UPC adorn the labels on most all retail products in the

developed world. 5 RFID is a technology that has evolved with falling semiconductor prices and power

requirements. Low-cost RFID tags now include modest computing capability using

power coupled by antenna from the reader. When tag prices reach a sufficiently low

price, perhaps $0.05, RFID systems will become a viable replacement for barcodes.

In the past decade, important contributions to the application of product identification

with RFID systems were made at the Auto-ID Lab at Massachusetts Institute of

Technology, including the development of the Electronic Product Code (EPC). In 2003,

the Lab was spun-off into the industry organization EPCglobal.

Optically read barcodes require a line of sight and cannot pass through most packaging.

The moniker “RFID” designates systems that operate over a wide range of frequencies,

from MHz to GHz [2]. The lower frequencies are less directional and have a greater

ability to pass through packaging material. Higher frequency RFID systems operate on the principles of radar, using reflected waves. Optically read barcode systems operate at

the frequencies of visible light, ~1015 Hz. A typical low-cost RFID system operates at

~107 Hz which is far less directional and more easily passes though packaging material.

Because barcode systems require a line of sight, they require more handling. Items must

be separated and turned to find the barcode and put it in view of the reading device. In

point-of-sale applications, this is manual labor. RFID systems promise to avoid much of

this although the seller will still need evidence that items are properly tagged. If items can

be easily identified at point of sale, they can be identified as easily all the way up the

supply chain. Shipments transferred between manufacturer, retailer, and reseller can be

verified easily and accurately.

6 As the cost of semiconductor fabrication continues to fall, RFID systems will overtake

and replace barcodes as the principal method of product identification.

2. Privacy and Business Intelligence Risks

Unfortunately, RFID introduces privacy risks for consumers and intelligence risks for

businesses [3]. Simple RFID tags can be read from anywhere within the interrogation

range of a reader. The consumer risks divulgence of personal information such as the

products, clothing, books and videos he or she is carrying. The consumer also risks being

tracked. Businesses risk divulgence of quantities and suppliers of materials purchased and

customers of products sold. All risk targeting via RFID tags.

Targeting is the problem in which a thief can identify more valuable items and therefore

devotes greater effort to steal them. For example, a thief might rob a home if an RFID tag on an expensive television can be read through an outside wall. Of course the thief would

have to know how to interpret the product code read from the tag.

Tracking is the problem in which a spy or stalker can monitor the movements of a person

from the products the person carries. In this case, the spy does not know or care about the

product code, but knows the tag will transmit the same code every time. The spy can

position a reader to indicate when a person is near the reader by identifying the tag or

constellation of tags the person usually carries.

These risks to consumers and businesses are inherent problems of RFID technology. The

ability of tag and reader to communicate through materials, without a line of sight is a

7 double-edged sword. The benefits and risks of RFID cannot be easily separated by, for example, limiting range to reduce risk without also reducing the system’s effectiveness.

A technological solution to these privacy problems is beyond the current state of the art.

Present tags lack the computing capability to simply implement a command to disable the tag. This and other solutions are being developed but the cryptographic capabilities

necessary for robust security will require an increase of orders of magnitude in circuit

complexity.

Without a technological solution, society must solve these problems through public

policy. These solutions include guidelines by industry and government groups for RFID

system deployment and use of information.

Since the technology has entered the public arena in a flawed form, consumer confidence

has suffered. In 2003, at a Wal-Mart store in Broken Arrow, Oklahoma, customers were

viewed by hidden camera while they used Procter & Gamble products with RFID tags.

Press accounts of the ill-conceived market research study embarrassed both companies,

causing them to suspend RFID development.

Privacy advocates have formulated a bill of rights for consumers including rights of

(1) notice of the presence of RFID tags, (2) removal or deactivation of tags at point of

sale, (3) alternatives to tagged products without economic penalty, (4) notice of uses of

RFID information, and, (5) timely notice of tag reading.

These privacy problems will be resolved by legislation unless and until RFID technology

evolves to correct them.

8 3. Consideration of Other Methods

For the most part, RFID security today relies on the unsound principle of security by obscurity. Most of this obscurity is unintentional. There are a variety of tag frequencies, modulation and coding techniques for a relatively small number of tags, (mere billions today). Gathering RFID information is impractical now, but this will change as tag technology becomes standardized and tags more widespread.

RFID security has also depended on limited read range but this is based on unsound assumptions of radio frequency noise around a tag.

RFID systems have also relied on the physical difficulty of reverse engineering a tag but

with a sufficiently large effort, all tags of the same architecture can be compromised.

If radio waves can pass through some materials, they can also be blocked and interfered

with. The simplest way to prevent RFID tags from communicating is to enclose them in a

Faraday cage such as a foil envelope or metalized tape. This low-tech solution can be

impractical for a tag embedded in clothing and other materials or when the location of the

tag is unknown.

A novel solution is the blocker tag, designed to transmit an interfering signal especially to

confound the singulation process, in which a reader isolates one of many tags in its

interrogation zone [4]. But if a reader does not follow the singulation protocol, the

blocker tag’s strategy may be defeated. Also the blocker needs to be designed to work

with the same frequency and modulation method of the tag to be blocked.

9 Both the Faraday cage and blocker tags have the weakness of being negative solutions.

There is no evidence that they are working unless the tag is brought within interrogation range of a suitable reader and the reader fails to read the tag, which still does not prove that another reader in another orientation would not read the tag.

The processing capabilities of RFID tags can be exploited to implement a simple command which effectively destroys the tag [5]. Although it can be difficult to prove that a tag has been disabled, a tag can send confirmation to the issuer of the kill command.

The kill command requires additional processing capability in the tag to implement password protection against unauthorized use of the command that effectively destroys the tag. The kill command password must be a carefully guarded secret, though typically the same password is embedded in many tags. Killing is a one-shot operation, at point of sale. It offers no security against business intelligence risks upstream in the supply chain, and no RFID benefit at all after the sale.

More sophisticated software solutions include novel methods of implementing password protection on tags with very little computing power. One scheme uses a set of pairs of pseudonyms (IDs) and keys (passwords) [6].

Hash-locking is a scheme that uses a one-way hash function to produce a metaID that obscures the tag’s real ID while providing an index to find the tag’s ID in a database [7].

The tag challenges with the metaID and the reader must respond with the ID to unlock the tag. The ID acts as a password. The metaIDs are vulnerable to tracking, unless they are randomized, which defeats their usefulness as database indices.

10 Protocols based on pseudonyms and hash-locking depend on shared secrets between tags

and databases. Information about every tag must be maintained indefinitely. In order to

prevent tracking, exhaustive searches of tag records are required, so these schemes do not

scale well. The burden cannot be alleviated by delegation unless the delegate is given a copy of the tag secrets [8].

The identity of a tag can be held in a central database rather than the tag itself. If a tag responds only with encrypted, nonced messages, its identity will remain hidden from the reader. Unlike a low-power, isolated tag, the central database enjoys plenty of power, storage, and network bandwidth and so can detect and refuse communication with unauthorized (rogue) readers. In effect, the problem of privacy can be delegated to the central database if the tag has the minimum capabilities of cryptography and random number generation.

A central database, however, has some drawbacks: (1) there is a communication bottleneck to the database, with some substantial fraction of the world’s tags contacting it; (2) the database is a potential single point of failure, so needs to be redundant, distributed and secure; and (3) a central database is the antithesis of consumer privacy, replacing fear of strangers reading tags with fear of Big Brother reading tags.

A nonce is a random number that is encrypted into a message to prevent tracking and replay attacks. If a tag generates the same message every time it is queried, it is easily tracked. Even if the tag generates a sequence of messages, a malevolent reader could query the tag repeatedly, so the sequence needs to be long. A tag needs the ability to

11 generate a series of random bits that can be formed into a random number so that it is not practical to read out all nonced messages.

Nonces are also needed to ensure that communication is fresh and not a replay attack. A tag or reader generates such a nonce and insists that the other use it in encrypted form in subsequent messages to demonstrate the capability to include it in an encrypt message.

An encrypted nonce can be used as the basis of a session key.

A certificate is a message signed by a trusted party or certificate authority (CA). A certificate contains a plaintext and a signature which is the plaintext (or one-way hash of the plaintext) encrypted backwards, using the CA’s private key. Anyone can decrypt this message using the CA’s public key, proving the CA endorsed the plaintext message, which might say that the public key for a particular reader is trustworthy. Without trusting the reader in advance and without the means to communicate with any other reader, a tag receiving such a certificate would know that the reader had gotten approval from the CA.

A certificate establishes trust but cannot revoke it. If a reader is stolen and operated by an unauthorized party, the reader will still have the certificate from the CA. In a typical network, a node could get a list of revoked certificates or periodically contact the CA via a different network route; however, because a tag is an isolated node, these alternatives may be unavailable or the tag may be spoofed.

In order to make certificates revocable, they could be time limited if tags have their own time sources, which would preclude passive tags. Measuring time with any accuracy

12 requires a crystal oscillator, which consumes too much power, (~1 µW). Quartz crystals may be too brittle for product tags or require too much space to protect them from

physical shock.

4. Network Security

The problem of privacy in RFID is similar to that of a laptop computer in a Wi-Fi

network, except that the resources available to an RFID tag are orders of magnitude

smaller. A battery-less (passive) tag has energy coupled to it from a reader, but the tag is

an independent digital circuit. Assuming the tag has sufficient capability to be called a

processor, the tag is an independent processor. When a tag is interrogated by a reader, the reader and tag share a communication link. A reader typically is linked to a local area

network (LAN), though not necessarily contemporaneously with the link to the tag.

Therefore, a tag fits the model of a network node, so ideas of network security directly apply [9]. But to be economical for the application of product identification, RFID tags must be network nodes of extremely low computational power.

A second disadvantage of product identification tags, compared to larger networked systems, is that tags are isolated network nodes. A tag’s access to the outside world is usually at the pleasure of a single reader. Moreover, there is no human operator who might balk at a suspicious contact from a reader. A tag without an operator is expected to

communicate with readers that the tag cannot independently authenticate. A tag is,

therefore, vulnerable to spoofing and replay attacks.

13 Spoofing is the problem in which an impostor creates a simulated environment in order to

gain the victim’s confidence so the victim reveals a secret. To protect the privacy of the

holder of an RFID tag, the tag must not reveal its ID to an untrusted party, an

unauthenticated reader. But the tag is dependent on the reader for all the tag’s

communication. A reader can spoof a tag if the reader can send legitimate looking messages.

A replay attack is a simple trick in which a message from a trusted party is recorded by an intruder and later played back to mimic the trusted party. A tag password, for example, might be replayed by a reader to defeat tag security. Readers are also vulnerable to replay attacks from tags. For example, one tag could pretend to be many by transmitting a series of IDs.

5. Security Through Cryptography

Modern computer security is based on cryptography. A plaintext message along with an encryption key are input to an encryption algorithm that produces a ciphertext which is sent on the channel to the rightful receiver and possibly an eavesdropper, who also has the decryption algorithm. Kerckhoff’s Principle states that security should come from the secrecy of the keys, not the secrecy of the algorithm, because keys are much easier to replace.

Simpler encryption algorithms use symmetric keys, where the encryption and decryption keys are related; however, in an asymmetric key cryptographic system, a decryption key cannot be deduced from an encryption key. An asymmetric key system is like giving

14 open lock-boxes to your friends. They can put secret messages in the boxes and lock

them (encryption) and send them to you. Only you can open the boxes using your key

(decryption). Your enemies cannot open the boxes. In fact, your friends cannot open them

either. The encryption key is public and openly disseminated. The decryption key is

private and never disseminated.

Rivest-Shamir-Adleman (RSA) cryptography is the original practical algorithm for

asymmetric key cryptography [10]. It is based on multiplication and specifically on the

difficulty of factoring large prime numbers. RSA is the de facto standard asymmetric

system for internet security. For example, it is the basis of the handshaking stage of the

secure socket layer (SSL) protocol. Once two parties share a secret; however, symmetric

key systems are more efficient.

Elliptic curve cryptography (ECC) is an asymmetric key system based on elliptic curves

in finite fields. The mathematical operations for ECC are simpler than integer

multiplication. Consequently ECC implementations are more efficient than RSA in terms

of key length and circuit area [11]. Both of these systems depend on the difficulty of

certain mathematical operations that have been studied for centuries. There is no

mathematical proof that there is no shortcut solution to either of these problems.

The RSA system encrypts a plaintext, m, with an encryption exponent, e, to produce a ciphertext, c, which in turn can be decrypted with a decryption exponent, d, so c = me and m = cd. But, d cannot be deduced from e. So e is disseminated and d is used to decode messages that are received. In ECC, a shared secret can be established by a combination of sender and receiver keys using Diffie-Hellman key exchange. The parties select

15 exponents, a and b which are applied to a publicly known generator, g. The parties

disseminate ga and gb, respectively. Upon receipt, each applies his own exponent, resulting in the shared secret, gab. Typically, an ECC message needs two encryptions to

pass one secret. In this system, a and b cannot be disseminated because a-1 and b-1 can be deduced from each of them, respectively. Although RSA is the more efficient in terms of the number of cryptographic operations, the resources required for each operation are much larger and therefore ECC is more economical overall. So herein the discussion proceeds with the assumption that an asymmetric cryptographic system is required but decryption exponents can be deduced from encryption exponents.

The difficulty, or computational complexity, of decoding encrypted messages without benefit of the decryption key is due to the existence of an efficient squaring operation in these cryptographic systems. For example, to get the tenth code in the sequence, xn, given x, requires three squaring operations (which yield x2, x4, and x8) and one multiply x8x2.

Without knowledge that the tenth code was used, a cryptanalyst would have to try x1, x2, … , x10, requiring nine multiplies. The difference is more dramatic with longer keys. The strength of these cryptographic systems is that they are exponentially harder to

break than they are to use as intended. This is known as the discrete logarithm problem.

Hereinafter, cryptographic operations will be written in the customary ECC notation in which encryption is represented in the notation of multiplication, rather than RSA notation, in which exponentiation is the natural choice. The ECC notation represents

point multiplication and has the multiplier, in lower-case, to the left of the generator, in

upper-case: a⋅G, b⋅G and a⋅b⋅G, as this is less cumbersome. For example, the weakness of

16 ECC, that a-1 can be deduced from a, implies that x⋅G can be deduced from a⋅x⋅G and a

because x⋅G = a-1⋅a⋅x⋅G.

B. Elliptic Curve Cryptography

1. Galois Fields

Galois field mathematics is written in a notation and is carried out with operations

distinct from ordinary math and elliptic curve math.

A Galois, or finite, field is a finite set and two operations, + and ×, each of which forms

an abelian group over the set, except that there is no x such that 0 × x = 1, if 0 and 1 are the + and × identities. Also, the distributive law of × over + holds [12]. The extension field, GF(pm), can be represented as the polynomials of degree m − 1, a(x), where each coefficient, ai œ Zp, and the operations + and × are performed modulo a reduction

polynomial, f(x), and modulo p for each ai.

Of particular interest are fields of characteristic two, p = 2, in which an element,

a = {ai œ {0, 1}, 0 ≤ i < m}, is a bit vector. Polynomial addition is addition of the

coefficients of similar degree:

In characteristic two, this is equivalent to bit-wise addition with no carries, which can be

implemented as bit-wise exclusive disjunction (XOR). This efficiency is the motivation

17 for this approach. Subtraction is the same as addition because 1 ª −1 (mod 2).

Multiplication also benefits from the absence of carries, but requires reduction.

Any irreducible polynomial of degree m can be used as the reduction polynomial. The

most efficient are sparse polynomials, which have only a few terms. These are trinomials

or pentanomials for practical cases, m ≤ 500 [13].

Polynomial multiplication requires multiplying terms and summing results of similar

degree:

; min, 1

In characteristic 2, the product of coefficients, , can be implemented with bit-wise

conjunction (AND). The inner sum can be implemented with bit-wise exclusive

disjunction (XOR). The degree 2 − 2 result must be reduced.

Squaring, a special case of multiplication, is particularly efficient in GF(2m) because

every product, , can be paired with , if . For , these pairs

sum to 0 mod 2, if 2. So the square consists only of terms

where 2, that is, is even and the products are :

This serves only to spread out the coefficients over an even, degree 2 − 2 polynomial,

which must then be reduced.

18 Division is defined in terms of multiplication. Given elements a and b, and a reduction

polynomial, f, the quotient, a ÷ b (mod f), is equivalent to a × b-1 (mod f), which requires

finding b-1 such that b × b-1 (mod f) ª 1. There is no simple implementation of this. One

approach is to use an extension of Euclid’s Algorithm for finding a greatest common

factor. A second approach is to use exponentiation.

Exponentiation is defined as repeated multiplication. Of course, 1 and there is a distributive law, , where , .

Because , 0 is a multiplicative group and f is an irreducible

polynomial, mod , where , which is the Galois field version of

Fermat’s Little Theorem. Then ‐ mod and so the second approach to

division is to obtain ‐ by raising to the power, 2. With 2, the power is

2 2.

2. Elliptic Curves

Elliptic curve mathematics is written in a notation and is carried out with operations

distinct from ordinary math and Galois field math.

Defined over the characteristic two Galois fields discussed in section 1, an elliptic curve

is the graph of an equation of the form of the generalized Weierstrass equation [14]:

19 If a1 = 0 (assuming a3 ≠ 0), the equation is supersingular and cryptographically weak. If

a1 ≠ 0, and by change of variable, these curves isomorphically simplify to:

The parameters a′2 and a′6 are also known as a and b.

The Group Law defines a way of adding two points, P1 and P2, that satisfy the elliptic

curve equation, to produce a third point, P3, also on the curve. The Group Law defines

the additive inverse, −P = −(x, y) = (x, x + y). Graphically, P1, P2, and −P3 are collinear.

A point at infinity is also needed (and is the additive identity). When rational numbers are

maintained to postpone division, a denominator of zero indicates infinity.

To get P3 = P1 + P2, observe that the slope of the line through P1 and P2 is

λ = (y1 + y2) / (x1 + x2), if x1 ≠ x2, or by differentiating the simplified equation, it is

2 λ = x1 + y1 / x1, if x1 = x2. To get x3, note that the coefficient of x is the sum of the roots

2 of the elliptic, λ + λ + a′2 = x1 + x2 + x3. For y3, note that λ is the slope of the line

through P1 and –P3, therefore:

λ λ

When x1 ≠ x2, the above are known as the point addition formulas, P3 = P1 + P2, P1 ≠ P2.

When x1 = x2, these cancel each other in the first formula and they simplify to the point

doubling formulas, P3 = 2P1.

Repeated addition is known as scalar point multiplication, P = k·P1, k œ Z. There is a

distributive law, (i + j)·P1 = i·P1 + j·P1, where i, j œ Z. 20 3. Coordinate Systems and Security

Getting λ requires Galois field division, which can be postponed by maintaining rational

numbers. In elliptic curves, the projective coordinate system is used to represent a point

P = (x, y, z) œ Z3, which is equivalent to a point in the affine coordinate system

P = (x/z, y/z). Other coordinate systems include Jacobian, (x/z2, y/z3), and Lopez-Dahab,

(x/z, y/z2).

In order to provide privacy for the RFID tag’s owner, the tag must not reveal information

that is not secured cryptographically. In particular, there is no known polynomial time

method for recovering k given kP and P, though this, the elliptic curve discrete logarithm

problem (ECDLP), has been studied extensively [15].

A result in projective coordinates can be weak because it is not supported by the study of

ECDLP. In the extreme case, an affine (x, y) could be represented as projective

(k⋅x, k⋅y, k), revealing k. This and subtler cases can be corrected by multiplying the

projective result by a random r, producing (r⋅x, r⋅y, r⋅z) to represent (x, y, z) [16]. This alleviates the RFID tag from the need to perform division, but requires transmitting an additional r⋅z value to the reader. If the division is performed on the tag, this transmission

can be avoided.

Side-channel attacks are attempts to discover the key based on observations of the

cryptographic processor. A timing attack analyzes the execution time. A simple power attack (SPA) analyzes the power consumption as the algorithm is executed. A differential power attack (DPA) adds statistical analysis to SPA. The Montgomery ladder, discussed

21 in section C2, is regarded as immune to timing attacks and SPA. To protect against DPA,

it is recommended that a random r is generated and used in the conversion from affine

(x, y) to projective (r⋅x, r⋅y, r) [17]. This would require an extra register in each of the

architectures proposed in chapter III.

For privacy on an RFID tag, it is not necessary to allow a protocol in which an attacker

can choose the generator. As discussed in chapter V, if an owner had private key, k, and

public key, kG, a tag might answer a query for its ID with r⋅G and r⋅kG ⊕ ID, (where r is

a random number, point multiplication produces an affine x result, and ⊕ indicates

exclusive disjunction). Although an attacker would know G and kG, he could not choose them and with r changing with each execution, he would have nothing consistent to statistically analyze.

C. Cryptographic Algorithms and Architectures

1. Galois Field Operations

Galois field multiplication can be performed much as school children perform long-hand

multiplication, the schoolbook algorithm; however the Galois field version is complicated

by the need to perform reduction. There are several design issues. The operation can be

performed one bit at a time or several bits at once. The multiplication and reduction can

be done simultaneously, or one after the other. Finally, the operation can be performed

from right to left or from left to right [18].

22 In each cycle of the schoolbook algorithm, the first operand is multiplied by one digit of

the second operand, resulting in a partial product. Ordinarily a “digit” refers to base ten,

but customarily in this context, a digit is several bits. Hereinafter, w is the number of bits

of the second operand that are used in one machine cycle. From the discussion in section

B1, m is the degree of the polynomials in the Galois field, GF(2m) and m is also the width

of most data registers and datapaths in these cryptographic processors. Thus an m-bit

Galois field multiplier with w = 1 multiplies all m bits of the first operand by one bit of the second operand in each machine cycle and requires m cycles to complete and is called bit-serial. A fully parallel multiplier has w = m and performs the operation in one cycle.

In general, a digit-serial multiplier requires / cycles to complete. If reduction is performed simultaneously, no additional cycles are required.

If the multiplication and reduction are not done simultaneously, then the unreduced product must be stored temporarily. This requires a double length (2m-bit) register, although it can be shared with the second operand. As the partial products are accumulated, the product grows from m bits to 2m bits while the bits of the operand still needed reduces from m to zero. But this analysis assumes that the operand needs to be copied into a register in the multiplier at the beginning of the operation. At best this approach is inflexible and so is only of academic interest. Simultaneous multiplication and reduction is more efficient because it can be performed in one m-bit product register.

The ordinary schoolbook algorithm is always performed from right to left, that is, from least significant to most significant digit because more significant partial products have no influence on the lowest order digits of the product. So as the sum of partial products is

23 accumulated, not more than m + w bits need be summed. But if reduction, a modulus operation, is performed simultaneously, the combined operation can be performed most significant digit first while operating on only m + w bits. In each cycle the accumulated product is shifted up one digit and added to the partial product of the first operand and the next most significant digit of the second operand. The sum is reduced and stored. The benefit of this approach comes at the end because the result is a reduced product. If least- significant-digit-first multiplication is used, an additional reduction is required.

The best solution is to perform multiplication and reduction simultaneously, most significant digit first. There is a trade-off between delay and area in selecting w. A

complete multiply requires a total of O(m2) bit multiply and add operations. In each of

/ cycles, O(mw) bit multiply and add operations are required, implemented as mw

two-input AND and XOR gates. So as w increases, delay decreases but area increases.

As noted in section B1, multiplying a number with itself is especially efficient in GF(2m).

In fact, it requires only a small number of XOR gates, to do the reduction. For this small

cost in area, squaring can be carried out in one cycle, so its computational cost is

comparable to Galois field addition. Of course, this area could be saved by omitting the

dedicated squarer. The multiplier could be used to perform squaring, but would require

many machine cycles and significantly increase total energy for elliptic curve operations.

Galois field division is inefficient, so it is best to maintain a rational number during

intermediate calculations and perform the division as late as possible. There are two

approaches, the Yan algorithm and the Itoh-Tsujii algorithm. Division is implemented as

an algorithm whereas the other operations are implemented more directly as architecture.

24 The Yan algorithm [19] is based on the Extended Euclidean Algorithm, which is based

on Euclid’s algorithm for finding the greatest common divisor of two numbers. The

observation is that if two numbers have a common divisor, it will appear in the remainder after the two numbers are divided together. This can be done repeatedly, dividing the larger of the two by the smaller and keeping the smaller and the remainder, until the remainder is zero. The last divisor is the greater common divisor. If the last divisor is one, the numbers are relatively prime. The Extended Euclidean Algorithm keeps track of the quotients used during the algorithm so that in the end it can report not only the greatest common divisor, but also a linear combination of the original numbers that will produce the greatest common divisor.

Given two numbers, a and b, the algorithm maintains the following relationships:

Initially, 1 and 0, therefore and . Performing

Euclid’s Algorithm on and , with, for example, , let ⁄. The

algorithm requires subtracting from . So to maintain the relationships, this requires also subtracting from and subtracting from . If instead , swap the

two equations first, then perform the subtractions. Eventually after a number of iterations,

0, leaving the results:

gcd,

25 The important observation for Galois field division, is to let b be the irreducible

polynomial, so that by definition gcd, 1 for all . And since 0 mod , at the end of the algorithm, the first relation can be simplified to 1 mod .

Therefore is the multiplicative inverse of .

If initially , then the relation is ⁄ 1 mod at the end of the algorithm and therefore ⁄ mod . So this gives a procedure to divide, ⁄,

maintaining four values, , , , and . But and are not needed and so can be

omitted from the algorithm. Yan’s algorithm takes advantage of the fact that b is the

reduction polynomial for the field, so the multiplications are not needed for each step but

bit shifts are sufficient and 2 1 steps are required.

The Itoh-Tsujii algorithm [20] is based on Fermat’s Little Theorem. The algorithm raises a number, a, to the power 2m – 2, as discussed in section B1. This can be broken down

into a sequence of squarings and multiplies. Define . So h(1) = a. And,

1 mod 2.

Therefore [h(m – 1)]2 is the inverse of h(1), mod 2m.

Given h(n), if this is squared n times then multiplied by h(n), this produces h(2n):

· · 2.

If h(2n) is squared and multiplied by a, this gives h(2n + 1):

2 · · 21.

26 Initially, h(1) = a, so this provides a procedure to get h(n) for any n > 0, by shifting the binary form of n, to get 2n or by shifting n and adding one, to get 2n + 1, building up n from 1 to the desired final value, m – 1.

The operation to get h(2n) from h(n) requires n squarings and one multiply. The operation to get h(2n+1) requires n+1 squarings and two multiplies. One further squaring

is required to get the inverse and a final multiply is required to complete a division

operation. This algorithm requires 1 squarings and log HW11 multiplies total.

Combination multiplier/inverters have been proposed. One of these, which divides based on the Yan’s algorithm, requires five registers of 1 bits each [21]. To justify such special purpose hardware for division, elliptic curve points must be maintained in affine coordinates, requiring division for every elliptic curve operation. Yet this architecture

still requires cycles to invert, making affine elliptic curve operations no bargain.

2. Elliptic Curve Operations

Using affine coordinates, the point adding formulas discussed in section B1 require one

Galois field multiply, one squaring, one divide and several addition operations. Point

doubling in affine coordinates also requires one Galois field multiply, one squaring , one

divide and several addition operations.

Since the cost of division is high, projective coordinates are preferred. Using projective

coordinates, point addition requires 16 Galois field multiply and two squaring operations.

27 Point doubling requires eight multiply and four squaring operations. Point adding and

doubling also requires several additions, but no divisions.

More efficient projective coordinate adding and doubling formulas were found by Lopez

and Dahab [22]:

½ · ·

· · · ·

· ·

where (xk, yk, zk) = kP and x1 is the x coordinate of P. Properly arranged, this can be done in six multiplies, four squarings and three additions in the Galois field. The y coordinates are not required, but if desired y2k can be recovered from x2k, x2k+1, x1 and y1. The Lopez-

Dahab formulas provide 2kP and (2k+1)P given kP and (k+1)P.

Montgomery performed point multiplication by maintaining two points, kP and (k+1)P

and either doubling the first and adding the two to get 2kP and (2k+1)P or adding the two

and doubling the second to get (2k+1)P and (2k+2)P. The effect on k is to replace it with either 2k or 2k + 1, which is equivalent to shifting a 0 or 1 bit into a binary number. So starting with 0P and 1P, the value k can be built bit by bit until the desired kP is reached.

The algorithm is Montgomery’s ladder [23].

To perform an elliptic curve cryptographic operation, P is a generator. The Lopez-Dahab formulas do not require the y coordinate, so let g be the m-bit x coordinate of P. The key, k, is also an m-bit value. Let the encrypted result, e, be the x coordinate of the result of

28 the point multiplication, kP. Lopez-Dahab adding and doubling can be performed for each of the m bits of k, according to the Montgomery ladder. The algorithm starts with 0P and 1P, so let x0 = 1, z0 = 0, x1 = g, and z1 = 1. Because point adding and doubling are both performed for each bit of k regardless of the value of k, the cryptographic key, the algorithm is desirable to provide resistance against side-channel attacks. On completion, e = xk/zk. This is the only Galois field division required for the entire elliptic curve point multiplication.

3. Cryptographic Processors

A compact elliptic curve processor was proposed that operated in affine coordinates, requiring Galois field division for each elliptic curve operation [24]. The division algorithm was based on the Extended Euclidean Algorithm. It performed bit-serial least- significant-bit-first Galois field multiplication. It had no dedicated squarer. It claimed only five registers but these may have been for the ALU alone. It did not use the

Montgomery ladder. The point multiplication algorithm timing depended on the distribution of one bits in the key.

Another processor operated in affine coordinates but used the Montgomery ladder [25]. It performed bit-serial most-significant-bit-first Galois field multiplication. The division algorithm was based on the Extended Euclidean Algorithm. It had 13 registers.

The state of the art in cryptographic processors was discussed in an overview paper in

2006 [26]. Most of the background and design decisions of elliptic curve processors were presented including the Group Law, the elliptic curve discrete logarithm problem, EC

29 Diffie-Hellman key exchange, most-significant-bit-first Galois field multiplication,

projective coordinates, the Lopez-Dahab formulas, the Montgomery ladder, the Itoh-

Tsujii algorithm and inversion based on the Extended Euclidean Algorithm. In the area of

low-power cryptography for RFID, the basic architecture of a processor was presented as

a control unit, a Galois field arithmetic logic unit (ALU) and a number of m-bit registers.

The elliptic curve processor presented in that paper included an ALU that could perform

GF(2m) and GF(p) operations in the Montgomery domain, which requires an additional

conversion step. The ALU had a rich set of operations including clear, load, add, shift and multiply. It performed point multiplication by the Montgomery ladder in projective coordinates, then performed division based on the Extended Euclidean Algorithm to produce an affine result.

The overview paper was followed by a series of processors proposed by the Computer

Security and Industrial Cryptography (COSIC) research group. These processors would use projective coordinates, most-significant-bit-first Galois field multiplication, versions of the Lopez-Dahab formulas and the Montgomery ladder. They would omit inversion.

Subsequently, researchers from this group proposed a processor with modified Lopez-

Dahab formulas to save registers [27]. To perform elliptic curve point addition and doubling, it required seven and eight Galois field multiply operations, respectively. The multiplier was bit-serial. It had no dedicated squarer. It required at least six registers for elliptic operations, a key register and three additional registers in the ALU. The final conversion to affine coordinates was suggested to be carried out with a Fermat’s Little

Theorem based inversion algorithm.

30 Another processor from the same group required only the six Galois field multiplies required by the Lopez-Dahab formulas for point adding and doubling [28]. It required only five registers for elliptic curve operations and two more registers for ALU operands and an eighth register, possibly for the key. The multiplier design was digit-serial rather than just bit-serial. It had no dedicated squarer. Although it was reported that one inversion would be required to convert the final projective result to affine coordinates, there were no other details or measurements regarding the final inversion.

Another processor directly related to [27] used the same modified version of the Lopez-

Dahab formulas but required only five and three Galois field multiply operations, respectively, using the dedicated squarer option [29] [30]. It required at least six registers for elliptic operations, a key register and three ALU registers, but claimed only seven total. The multiplier was digit-serial. The point multiplication result was provided in projective coordinates so no inversion was performed.

A somewhat different processor performed modified Lopez-Dahab formulas in seven

Galois field multiplies and required a total of only six registers [31] [32]. To do this, a modified version of the projective coordinate system was used in which the two

Montgomery ladder points shared one z coordinate. The ALU had no dedicated registers, but shared the processor’s elliptic curve registers. Register datapaths were reduced to a bare minimum to save area and interconnect. There was no key shift register. Instead, bits in the key were addressed. There was no dedicated squarer. The multiplier was digit- serial. The result was provided in projective coordinates so no inversion was performed.

31 III. Processor Designs

Because this application is very important, the components for asymmetric key cryptographic processors suitable for RFID have been studied previously and in detail.

The merits of the elliptic curve discrete logarithm problem, the characteristic two extended Galois Field, Diffie-Hellman key exchange, digit-serial most-significant-bit- first multiplication, the Lopez-Dahab formulas, the Montgomery ladder and the Itoh-

Tsujii and Yan inversion algorithms are well-established.

If the components of these processors were studied earlier, selecting the best components and optimizing them for the application can lead to significantly improved processors. A number of optimizations and improvements were developed, the combination of which is the contribution of this research.

This chapter discusses two versions of an improved elliptic curve processor. The two processor versions are designated R6 and R5 because they require 6 and 5 full-width registers, respectively. The discussion begins with the data flow analysis, which motivates the major design decisions of the processors, and leads to the organization of the arithmetic logic unit (ALU) and the implementation of the key control logic to support the Montgomery ladder, and of the inversion control logic to support the Itoh-

Tsujii algorithm. The description of the processors is completed with a discussion of the high-level organization of datapaths and registers, control logic, microcode and register initialization. The chapter closes with a summary of the optimizations and techniques that led to these improved processor designs.

32 A. Data Flows

Figure 1 and Figure 2 illustrate the data flows required to carry out the Lopez-Dahab

formulas for simultaneous point adding and doubling, as given in chapter II, section C2.

Point doubling is shown on the left side of each data flow diagram; point adding, on the

right side. Point multiplication is carried out by repeatedly performing simultaneous point

adding and doubling, according to the Montgomery ladder which calls for swapping register pairs according to the key. There are two versions of the processor, R6 and R5.

Figure 1 gives the Lopez-Dahab data flow diagram for the R6 processor, which requires 6 m-bit registers. Figure 2 gives that of the R5 processor, which used 5 registers.

In Figure 1, the top of the diagram shows the initial register contents and the bottom, the

Figure 1: Lopez-Dahab Data Flows for Figure 2: Lopez-Dahab Data Flows for the R6 Processor the R5 Processor 33 final. The diagram reveals that no more than five variables are live at any one time. The

diagram also shows that additions and multiplications are usually between one x and one z variable. It can be arranged so that this is always true by introducing a third z variable

½ for temporary storage and by grouping a′6 and x1 with the x variables.

The program for the R6 processor is shown on the left side of Table 1, on page 50.

Specifically, the code for point adding and doubling is shown in the upper left box. The five m-bit registers are xA, xB, zA, zB and zC. The sixth register is in the ALU. There are also O(log m) flip-flops in the control logic.

Figure 2 shows the point adding and doubling operations for the R5 version of the processor. This alternative uses the multiplier product register in the ALU as a temporary variable for the adding and doubling algorithm. To accomplish this, the register must not be used as a temporary variable during Galois field multiply operations. That poses no difficulty during point adding, whose flow is shown in the right half of the diagram, but

during point doubling, in the left half, an extra Galois field multiply is required. In this case, the operations for point doubling are arranged as follows:

½ ·

½ ½ · ··

Note that this represents different values for x2k and z2k than used in the R6 processor or

the Lopez-Dahab formulas, but the ratio, x2k/z2k, is the same. This version is the same as

the original except that this has the projective coordinates divided by a′6.

34 The program for the R5 processor is shown on the right side of Table 1 on page 50. The

code for adding and doubling is in the upper right box. The five m-bit registers are xA, xB, zA, zB and p, the ALU multiplier product. There are no other flip-flops in the R5 processor

except O(log m) flip-flops in the control logic.

The data flow for Itoh-Tsujii inversion requires maintaining two registers for the projective result while operating in two others. So this is not the critical requirement for registers.

B. Arithmetic Logic Unit (ALU)

Figure 3 is a block diagram of the Galois field ALU, which takes two operands, a and b, and can square either operand, or add or multiply the two together, resulting in a2, b2, a + b, or a⋅b. The ALU exposes the multiplier p register, which is also used as temporary storage for elliptic curve operations in the R5 version of the processor.

Addition and squaring are combinatorial logic only and produce results in the same cycle that operands are provided. Addition is a simple bit-wise XOR. Squaring is simple but requires reduction. The proposed processor uses minimum sparse irreducible polynomials. Figure 4 shows an eleven bit squarer. The degree 11 polynomial is f(x) = x11 + x2 + 1. For degrees with irreducible trinomials, the number of 2-input XOR

gates required for a dedicated squarer is about 0.6 XOR gates per degree. For pentanomials, about 1.55 gates per degree are required. For example, the minimum irreducible polynomial for degree 163 has five terms and its dedicated squarer requires

35 252 XOR gates. The numbers of XOR gates required for squarers for m up to 250 are

shown in Figure 5.

The multiplier is digit serial, most-significant-digit-first. Figure 6 illustrates a six bit

multiplier (m = 6) with three bit digits (w = 3). After a reset cycle to clear the p register,

this multiplier would run in two cycles. In the first cycle, the selector, s = 1, thus passing

b3-b5 through the multiplexers to d0-d2 and on to be combined with a0-a5 in AND gates in

array t. Columns of array t are combined in XOR gates to g0-g8 and then to the reduction

section, mod f.

The output of the reduction is y, which is stored in the register, p. After the next step, in

which s = 0, selecting b0-b2, the y output is the output of the multiplier, and the counter asserts a done signal to indicate the output is valid. In the R5 processor, the p register is

Figure 3: Arithmetic Logic Unit (ALU) Figure 4: Dedicated Squarer, m = 11

36 used as temporary storage as the hold signal is asserted, so the register can be read later.

The multiplier needs no registers for the operands. The a input is read directly from the

ALU’s a bus into the AND gate array. The b input is read from the ALU’s b bus into w

multiplexers that select w bits from b. Each multiplexer selects one of / inputs. If the high order digit of b is incomplete, it must be padded with zeroes. There is a register

to count / steps for selector s. That register has log / bits. The multiplier has only one m-bit register, p.

C. Key Control Logic

The Montgomery ladder serves well to resist side-channel attacks because the key has no

influence on the order of operations and very little influence on datapaths. Multiplexers

1.5

0.5 XOR Gates per Degree

0 0 50 100 150 200 250 Degree, m

Figure 5: XOR Gates per Degree for Dedicated Squarers

37 select either xA or xB and either zA or zB for input and output from the ALU during the

point adding and doubling operations, minimizing the key’s influence on the processor’s

power profile. The multiplexer select signals, sx and sz for the ALU inputs, and ss for the

ALU output, are connected via XOR gates to the key control logic, shown in Figure 7.

A counter of log bits is initialized at reset with the constant m – 1. During each loop to perform point adding and doubling, indicated by the upper boxes in Table 1, the counter output forms the selector of an m-to-1 multiplexer that selects a key bit. At the end of each iteration of the loop, the counter is decremented.

Figure 6: Digit-Serial Most-Significant-Digit-First Multiplier

38 If the registers contain elliptic curve points kG and (k+1)G at the beginning of an iteration

and the key bit is clear, the registers will contain 2kG and (2k+1)G at the end of the

iteration because the points are summed into the second register pair and the first point is

doubled into the first register pair. If the key bit is set, the registers will contain (2k+1)G and (2k+2)G at the end because the multiplexers reverse the ALU input and output selections, so the points are summed into the first register pair and the second point is doubled into the second pair.

The difference of the two points is 1G, as required by the Lopez-Dahab formulas. And the first point is either 2kG or (2k+1)G. That is, the point corresponds to k shifted left with a zero or one bit added, as required. After m iterations, the first register pair has the result of a point multiplication by the full key.

D. Inversion Control Logic

After point multiplication is complete, the projective x and z coordinates are converted to

the affine x coordinate by Galois field division of the projective coordinates. This consists

of inverting the projective coordinate in zA register and multiplying the result by the xA register.

Shifting and selectively adding one to form m – 1 while performing repeated squarings

m-1 and multiplies will produce zA to the power 2 – 1, and a final squaring produces the

m power 2 – 2, which is the Galois field multiplicative inverse of zA, as discussed in

chapter II, section C1. The control logic for implementing the Itoh-Tsujii algorithm is

shown in Figure 8 and the program steps are shown as the lower boxes in Table 1. In

39 each iteration of the algorithm, the zB register holds the h(n) value as the xB register accumulates the h(2n) or h(2n+1) result, the latter of which requires the h(n) value.

A shift register of 2log bits is initialized at reset with constant m – 1 in the lower

half of the register. The loop begins with a left shift. If a one bit is shifted into the upper

half, then the accumulated result is squared and multiplied by a. Then the upper half is

loaded into a counter of log bits, which counts squarings of a copy of the

accumulated result, which is finally multiplied with the accumulated value. The loop is

repeated for each bit of the constant m – 1. A final squaring gives the inverse of the initial value.

E. High-Level Organization

Figure 9 and Figure 10 illustrate the bus organization for the R6 and R5 versions of the

elliptic curve processor. These are m-bit registers and datapaths. To support Lopez-Dahab

Figure 7: Key Control Logic Figure 8: Inverter Control Logic

40 adding and doubling, there are four registers, xA, xB, zA, and zB. The R6 version has an

additional register, zC. The processors can store ALU results to any of these through the s bus. To minimize the number of datapaths, inputs to the ALU are separated into an x bus and a z bus. In the R6 version, the x bus is selected among xA and xB registers, processor

½ input g and constant, a′6 . The z bus is selected among zA, zB, and zC registers. The R5

-½ version has 0 on the x bus, a′6 on the z bus, and the ALU’s p register instead of zC on the z bus.

Figure 11 illustrates the processor high level organization. A five bit program counter,

PC, selects control signals from the program ROM, which is synthesized as a finite state machine (FSM). The programs are given in Table 1 on page 50. The control signals include the register selection for the x, z, and s buses, the ALU operation and special flow of control operations including key and inversion control.

In Figure 11, the buses g and k are the generator and key inputs to the processor. The generator, g, can be selected onto the x bus. The key, k, is an input into the key control

Figure 9: Datapaths of the R6 Processor Figure 10: Datapaths of the R5 Processor

41 logic. The output is bus e, which is the output of the xA register. The processor performs the point multiply, E = k⋅G, where E and G are points with affine x coordinates e and g.

At the beginning of each iteration, the Montgomery ladder requires that kG and (k+1)G values are stored as projective coordinates in registers xA/zA and xB/zB, respectively.

Initially, k = 0, so the registers must be initialized with 0G and 1G. The 0G point is the

additive identity, the point at infinity, represented in projective coordinates by a zero z

coordinate, and a non-zero x. The 1G point is the generator whose affine x coordinate is

g, which is available on the x bus.

So for proper initialization, it is required that xA ≠ 0, zA = 0, and xB/zB = g. To avoid

additional control logic or multiplexers to load reset values into these registers, it is

preferable to use only a sequence of ROM program instructions that result in proper

Figure 11: High Level Organization

42 FSM gates. Such sequences are possible for these processors using only g and the ALU’s

adding and squaring operations, by noting that x + x = 0 in GF(2m) and x2 / x = x, if x ≠ 0.

If g is selected from the x bus and the a2 operation is selected from the ALU, it will

2 2 2 output g and this can be used to set xB = g and zB = g in two instructions. Next, using

these equal values as inputs to the ALU and selecting the addition operation, a zero value

will result and this can be stored in zA. Finally, g can be selected from the x bus again and added with the zero in zA, to set xA = g and zB = g. That leaves xA/zA = g / 0 = ∞ and

2 xB/zB = g / g = g as required. That is the five instruction sequence used by the R6

processor. Since the R5 processor has a zero available from the x bus, it can load xA/zA in

two instructions and complete initialization in four.

F. Summary of Contributions

The contributions of this work are in the selection and integration of components of an

elliptic curve processor. The guiding principle in designing these processors for minimal

power and area was to give the most frequent case the most resources while giving the

rarest case only enough resources to function. From this principle, the Galois field multiplier is the most important and includes O(mw) components. The dedicated squarer justifies its cost in area by its frequent use, saving 4m Galois field multiply operations during the elliptic curve point multiplication.

On the other hand, register initialization, which is done once, has no resources dedicated to it except a few micro-instructions. The final inversion, an algorithm that is run once, though it includes O(log m) multiplies, has only a small amount of dedicated control

43 logic. Features of the processor that never change, such as the reduction polynomial, are hard-coded into the designs.

Each full-width m-bit register requires a large amount of area, about 4m gate equivalents.

The technique of dataflow analysis performed on the original Lopez-Dahab formulas revealed the minimum number of registers required. The dataflow graph can be drawn so not more than five edges need be cut while partitioning the graph into a sequence of operations. Therefore no more than five registers are required to perform the original

Lopez-Dahab formulas, which need six Galois field multiply operations.

A sixth m-bit register was required in the Galois field multiplier in the R6 version of the proposed processor. In the R5 version, the Lopez-Dahab formulas were modified to share a register with the multiplier, saving a register, but requiring a seventh multiply operation.

A dedicated squarer was chosen for the proposed design. Although it requires a slight increase in area, the power savings were very large. The Lopez-Dahab formulas require six multiply and four squaring operations. Without a squarer, all ten of these operations would be multiplies. Since squaring requires negligible time compared with multiplying, the dedicated squarer gives a 40% improvement. The squarer is also required in order to make the Itoh-Tsujii inversion algorithm economical.

Two m-bit shift registers were replaced with O(log m) bit counters and m input multiplexers. Shift registers would be used to select one of m bits from the key and to select w of m bits in the multiplier. Whereas shifting requires a full register, a counter can

44 count the bits and the current value can be used as the selector of a multiplexer. The

counter requires only O(log m) flip-flops instead of the shift register’s m flip-flops. The multiplexer has no net cost since its equivalent was required for the shift register to select among load, hold, and shifted inputs.

In addition to minimizing m-bit wide registers, m-bit datapaths were reduced by analyzing the dataflow. The R6 processor requires not more than seven inputs: the five

½ variables, the generator and the a′6 constant. The R5 processor requires eight, including the multiplier output and zero. These can be divided into two groups of four for adding

and multiplying, requiring only two 4-to-1 multiplexers for the m-bit inputs.

Control logic is a relatively small cost in these processors compared with the Galois field

registers and datapaths, but the most economical use of resources in control logic requires

a suitably sophisticated design. A minimal finite state machine would be too simple since

there are several well defined counter-controlled loops. An FSM description would be

difficult to synthesize optimally. On the other hand, a general purpose processor would

contribute unnecessary overhead that might not be synthesized away. The best choice of

control logic for these designs is a specialized microcode processor supported by

hardware to implement the key control and inversion algorithms.

An important source of area and power savings lies in hard-coding as much of the processor as possible. Reconfigurability is not important in ubiquitous systems. The cost

of these systems is such that they are much less expensive to replace than to reconfigure.

In the elliptic curve application, the only inputs required are g and k. Other parameters of

the curve can be hard-coded into the silicon: the field itself, GF(2m), the reduction

45 ½ -½ polynomial, the digit size, w, and the elliptic curve constant, a′6 or a′6 . The capability

to change any of these with the product in the field would require a processor design with

additional registers and control logic. Also, the reduction polynomials are sparse.

No control logic or multiplexers are required to load initial values in the m-bit registers at

reset. The existing processor inputs and properties of Galois fields were exploited to

initialize registers in a few cycles of microcode.

These processors perform the final inversion to convert the projective x and z coordinates

into the affine x coordinate. The projective pair has no more cryptographic value than the affine one. By providing a single result instead of two, the tag does not need to transmit a

second number back to the reader. Although the inversion could be done faster with a

Yan algorithm inverter, it would require a more complicated ALU with a shifting capability. By using the Itoh-Tsujii algorithm, an ALU with addition, squaring and multiplication is sufficient.

G. Comparison with Other Works

Figure 12 shows a comparison of the reference and proposed designs in terms of full-

width m-bit register requirements and Galois field multiply operations for point addition

and doubling. Note that these comparisons are for the number of Galois field multiplies

required assuming a dedicated squarer. In many cases, the reference processors

implemented squaring as multiplication.

46 The proposed processor designs and reference [28] fit the Lopez-Dahab formulas onto five m-bit registers, although the algorithms for point doubling differ slightly. The algorithm quoted in [28] was wrong, but could be corrected by removing step 5. Also, as noted in [13], the sum of squares is the same as the square of sums in GF(2m), saving a squaring operation in the proposed designs. The R5 processor uses a different doubling algorithm, eliminating a register at the cost of a multiply operation.

The processor in reference [28] required eight m-bit registers in total, according to reference [31]. In reference [28], three registers were indicated in the ALU, although one of these had the same name as one of the five used in its Lopez-Dahab algorithm. The

15 [27]

[25] 10

[30] R5 [32]

R6 [28] 5 Galois Field Multiplies per Point Add &Galois Field Multiplies per Point Double 0 0 5 10 15 m-bit Registers

Figure 12: Registers and Multiplies for Proposed and Reference Processors 47 eighth register may have been the key, unless, as in [32], the key bits were addressable.

The improvement in m-bit registers in the proposed designs is due, in part, to the technique of using an O(log m) bit counter as the selector for a m-input multiplexer. The proposed processors use this for selecting bits from the key and the second multiplier operand. Reference [28] used an m-bit shift register in its multiplier. Another register was saved in the proposed processors by reading the first multiplier operand from the ALU input bus without copying it into an internal register, as in the reference.

Like the proposed R6 processor, reference [32] required a total of six m-bit registers: five to implement the Lopez-Dahab formulas and one for the multiplier product. Of the five, the R6 processor uses one for temporary storage and four for the x and z coordinates of the two projective points. The processor in the reference used two for temporary storage and three for the projective points, which shared a common z coordinate. To arrange this, the reference required seven multiply operations. The R6 processor design uses the original Lopez-Dahab formulas, which require six multiplies. The alternative R5 processor uses the multiplier product as the temporary storage, reducing the total number of registers to five while requiring a seventh multiplication.

Reference [32] was fitted to a relatively high-level eight-bit processor with ROM. The proposed processors were designed at the microcode level and synthesized into finite state machines, resulting in more efficient area usage.

The designs in references [28] and [32] did not have dedicated squarers but used the multiplier. The proposed processors use dedicated squarers, which require a small

48 amount of area but provide a large improvement in performance over the multiplier. The

R6 processor requires six multiply operations per key bit whereas the references required eleven each. Reference [28] required six conventional multiplies and an additional five squarings implemented as multiplies. Reference [32] had seven conventional multiplies and four squarings implemented as multiplies.

The proposed processors perform inversion and produce an affine abscissa, a single m-bit word to be transmitted over the channel. Reference [28] indicated one inversion was required but gave no details. Reference [32] produced a projective abscissa, in the form of two m-bit words, which would double the amount of information that would need to be transmitted with precious power and bandwidth.

49 Table 1: Cryptographic Processor Programs

R6 Program R5 Program

2 -½ reset xB = g (0) reset xA = 0 + a′6 (0) 2 -½ zB = g (0) zA = xA + a′6 (0) 2 zA = xB + zB (0) xB = g (0) zB = g + zA (0) zB = g + zA (0) xA = g + zA (0) kloop xB = xB × zA (1) kloop xB = xB × zA (1) zB = xA × zB (1) zB = xA × zB (1) p = xB × zB (2) 2 xA = xA (1) zB = xB + zB (1) 2 zA = zA (1) xB = 0 + p (1) ½ 2 zC = a′6 × zA (2) zB = zB (1) zA = xA × zA (1) p = g × zB (2) xA = xA + zC (1) xB = xB + p (1) 2 2 xA = xA (1) xA = xA (1) 2 zC = xB + zB (2) zA = zA (1) -½ xB = xB × zB (1) xA = xA × a′6 (1) 2 zB = zC (1) p = xA × zA (2) zC = g × zB (2) zA = xA + zA (1) xB = xB + zC (3) xA = 0 + p (1) -½ zC = xB + zB (2) p = xA × a′6 (2) 2 xB = xB + zB (1) xA = zA (1) zC = xB + zC (2) zA = 0 + p (3) xB = xB + zB (1) zB = 0 + zA (4) zB = xA + zA (0) nloop xB = 0 + zB (5) 2 xB = xA + zB (4) xB = xB (0) nloop zB = xB + zC (5) xB = xB × zA (0) 2 2 xB = xB (0) nskip xB = xB (6) xB = xB × zA (0) zB = xB × zB (4) 2 2 nskip xB = xB (6) zA = zB (0) xB = xB × zB (4) xA = xA × zA (0) 2 zA = xB (0) zB = 0 + zB (7) xA = xA × zA (0) xB = xB + zC (7)

Special Operations (0) Normal (1) Key swap: xA↔xB and zA↔zB if key[kcnt] bit set (2) Key swap, store in zC (R6) or p (R5) (3) Key swap, shift key, if kcnt ≥ 0, goto kloop; decr. kcnt (4) Shift n, load ncnt, if b[2l-2] bit clear, goto nloop (5) if b[l-1] bit clear, goto nskip; decr. ncnt (6) Repeat while ncnt ≥ 0; decr. ncnt (7) Done; repeat forever 50 IV. Simulation Experiments

Having developed the theory behind small, fast elliptic curve processors and having

proposed two improved designs, experiments must be carried out to prove the viability of the theory. As is typical of computer engineering academic study, these tests were carried out in simulation. No chips were fabricated.

This chapter discusses tests performed with Synopsys Design Compiler and a 0.25 µm standard cell library. The discussion begins with a description of the simulation of the proposed processors, the tools and test vectors used and the data collected, for R6 and R5

processors, 3 ≤ m ≤ 256 and 1 ≤ w ≤ 16. Then the results are presented for delay, area and

power including measurements for energy of the entire cryptographic operation. These

results are presented in graphical form and as coefficients for formulas accurate to ±5%

(for area measurements), to ±20% (for energy measurement), and exactly (for time and

memory measurements). Finally, the chapter concludes with a comparison of the

improved processors proposed in this dissertation and the processors from the literature.

A. Test Setup

The processor was modeled at the gate level and tested with a simple digital logic

simulator written in C++. Classes for signals, gates, flip-flops and ROMs were

developed, tested and built up hierarchically into full-scale processors of up to m = 256

bits. The program generated Verilog code which was synthesized and simulated with

51 Synopsys Design Compiler at the Synopsys/HP EDA Laboratory in the Electrical

Engineering and Computer Science Department at Case Western Reserve University.

The simulated processor was tested at the class level and as a complete system, in many

cases using 11 bit examples from [13]. Full scale tests were carried out with the degree

163 elliptic curve NIST B-163 [33], also known as sect163r2 [34], using twenty-six

vectors from COSIC [35]. Verilog code of processor versions R6 and R5 for digit sizes

1 ≤ w ≤ 16 were verified in simulation.

Both versions of the processor, R6 and R5, were synthesized for 3 ≤ m ≤ 256 and

1 ≤ w ≤ 16 (and w ≤ m) totaling 7,946 processors, using Synopsys Design Compiler and a

TSMC 0.25 µm standard cell library, excluding cells with large leakage current. Results were obtained for area, delay, dynamic power and static (leakage) power. For dynamic power, activity was measured by simulating the synthesized circuits with hard-wired random a′6 parameters and random g and k vectors. Each activity test was run for one complete encryption operation, an elliptic curve point multiplication. There were 16 runs with different g and k vectors for m ≤ 200 and one run for larger circuits. For each circuit, numbers of clock cycles and D flip-flops were also determined.

B. Results

Propagation delay was less than 15 ns for all designs, allowing for a clock speed of at

least 66 MHz at the normal supply voltage of 2.5 V for this technology.

52 Figure 13 through Figure 18 graph area, time, energy and power vs. the degree, m, up to

250. Each graph includes plots for the two proposed versions of the processor, R6 and R5,

and for each, three multiplier digit sizes, w = 1, 4, and 16. The results for the two

processors are much less pronounced than results for the three digit sizes so the graphs

contain three pairs of curves. In the area graphs, Figure 13 and Figure 16, the R5 results

are smaller than R6 and w = 1 results are smaller than w = 4, and in turn, smaller than

w = 16. In the other four figures, the results are reversed, so the smallest results are for R6

and w = 16 and the largest results for R5 and w = 1.

Figure 13 graphs area measured in an equivalent number of two-input, single drive

strength, NAND gates. The graph also includes lines labeled “DFFs,” which is the area

required for D flip-flops, independent of w. Area goes as the first order of degree because

of the m-bit wide registers, datapaths and ALU logic. The curves are somewhat irregular

25,000 R6 R5

20,000

R6 R5 15,000 R6 R5

10,000 Area (NAND Gates) Area (NAND

R6 5,000 R5

0 0 50 100 150 200 250

Degree, m

Figure 13: Processor Area (NAND Gates)

53 largely because the minimum sparse irreducible polynomial has three terms or five terms

in an irregular pattern relative to the degree (cf. Figure 5, page 37). Digit size, w,

contributes an approximately linear increase in area because of the m×w logic array in the

multiplier.

Figure 14 graphs time in machine cycles to complete one cryptographic operation, which

is a function of the algorithms the processors run and independent of logic synthesis.

Time goes as the second order of degree because the multiplier processes / digits

per cycle for each of m bits of the key. Digit size is inversely proportional to time for the

same reason. The curves are not entirely smooth because of the / factor, so there

are steps across the graph each w degrees. There are smaller steps for log and

HW(m-1).

Figure 14 includes a very flat curve labeled “÷” for the time required for the division that

400,000 R5 R6

300,000

200,000 Time (Cycles) Time

R5 100,000 R6

R5 R6 0 ÷ 0 50 100 150 200 250 Degree, m

Figure 14: Processor and Divide Time (Cycles) 54 converts the projective result to an affine value. Division is carried out the same way in the R6 and R5 versions of the processor, independent of w, so this division curve may be viewed as being stacked with each of the other curves on the graph, indicating the portion of the total time required by each processor configuration. Division is a very small portion of the total time required for any processor, unless w is extremely large.

As is evident from Figure 13 and Figure 14, there is a trade-off between area and execution time in choosing the digit size, w. Although time increases with the second order of the degree, m, time decreases as the first order of digit size, w, and area increases as the first order of digit size, w. As m becomes very large, fighting it with a larger w becomes more important, but any time savings must be paid for directly in area.

Figure 15 graphs the product of area and time, in millions of NAND gate equivalents ×

5,000 R5 R6

4,000

3,000

2,000 R5 R6

1,000 R5 R6 Area × (Million Gates × Time Cycles) 0 0 50 100 150 200 250 Degree, m

Figure 15: Processor Area × Time (Million Gates × Cycles)

55 cycles. This is an estimate of energy needed to perform one cryptographic operation. The values are somewhat independent of the technology library used for synthesis. This is a third order function of degree. The high order effects of digit size for area and time cancel, but the largest remaining effect has digit size inversely proportional to the area- time product.

Figure 16 through Figure 18 give measurements for this 0.25 µm technology. Figure 16 graphs area results in (mm)2. Since Figure 13 graphs area in equivalent NAND gates, the only difference between the two figures is the scaling factor for the area of a NAND gate in (mm)2. The observations concerning Figure 13 apply to Figure 16 as well.

Figure 17 graphs total dynamic energy in µJ. Figure 18 graphs leakage energy × frequency, a quantity in units of mW. Both energy graphs are given for the complete

1.0 R6 R5

0.8 2

R6 R5 0.6 R6

Area (mm) R5

0.4

R6 0.2 R5

0.0 0 50 100 150 200 250

Degree, m

Figure 16: 0.25 µm Processor Area (mm)2

56 cryptographic operation. Dynamic energy can be given for the entire operation

independent of clock frequency, but leakage is continuous, so the results in Figure 18

must be divided by clock frequency to get total leakage energy during a complete cryptographic operation. But at 1 MHz, the numbers in the figure represent nJ, illustrating that for this design the leakage is negligible relative to dynamic energy.

The curves in Figure 17 are smoother than those predicted by Figure 15, implying that the reduction logic does not switch as much as the circuit as a whole. Both energy curves go as the third order of degree because the multiplier operates on each of m bits of its first operand with each bit of m bits of its second operation, and this must be done for each of m bits of the key, requiring 6m3 such operations for the R6 processor and 7m3 for the R5.

They also go inversely as w, but not strongly, because while there is w times as much logic in parts of the multiplier, the multiplier runs w times faster. But overall, when w is

120 R5 R6

100

60 R5 R6

Dynamic Energy (µJ) Energy Dynamic 40

R5 20 R6

0 0 50 100 150 200 250 Degree, m

Figure 17: 0.25 µm Processor Dynamic Energy (µJ)

57 larger, other parts of the system are used more efficiently, so energy decreases.

The synthesis results for 3 ≤ m ≤ 256 and 1 ≤ w ≤ 16 in general are summarized in Table

2 and Table 3, from a least-squares fit of the given functions of m, w, and unity. The approximate formula for a measurement given in a row is the sum of products of the coefficients in the table and functions given at the top of the columns. The worst case percent error of the formulas is given for m ≥ 72 and m ≥ 132. For example, the area of the R6 processor in NAND gate equivalents is 2.99⋅m⋅w + 49.7⋅m + 415 and accurate to

±4.78% when m ≥ 132. For m = 163 and w = 4, this estimates 10,466, but the actual measurement was 10,815.09, a 3.23% underestimate.

Table 2 gives three sets of approximate values for circuit area. The first set is overall in

NAND gates as in Figure 13. The second set is area excluding memory (D flip-flops),

80 R5 R6

R5 R6 20 R5 Leakage Energy ×Leakage Energy (mW) Frequency R6

0 0 50 100 150 200 250 Degree, m

Figure 18: 0.25 µm Processor Leakage Energy × Frequency (mW)

58 which approximate values are less accurate than overall area because memory is a large proportion of these processors, 30-45% for R6, and 28-40% for R5, with w = 1. The third set is overall area in (µm)2, as in Figure 16 except the units there are (mm)2.

Table 3 gives three sets of approximate values for energy related quantities. The first set gives the area-time product in NAND gate equivalents × cycles, as in Figure 15 except

Table 2: Approximate Area Coefficients, 3 ≤ m ≤ 256, w ≤ 16 Measurement / Function Worst Case Error Processor 1 m≥72 m≥132 Area R6 2.99 49.7 415 ±13.20% ±4.78% (NAND Gates) R5 2.93 46.7 414 ±12.98% ±4.98% Area (no RAM) R6 2.99 26.6 305 ±19.32% ±7.54% (NAND Gates) R5 2.93 27.4 304 ±17.87% ±7.36% Area R6 124 2061 17200 ±13.21% ±4.78% (µm)2 R5 121 1940 17200 ±12.93% ±4.91%

Table 3: Approximate Energy Coefficients, 3 ≤ m ≤ 256, w ≤ 16

Measurement / Function Worst Case Error Processor 1 m ≥72 m≥132 Area×Time R6 19.6 290 3040 -22.5 1150 4060 -9510 -310000 ±12.54% ±4.74% (NAND Gates×Cycles) R5 22.7 320 3190 -1.66 984 2220 35800 -883000 ±11.73% ±5.10% Dynamic Energy R6 0.641 6.42 41.2 -5.58 108 429 -5350 43600 ±15.87% ±11.38% (pJ) R5 0.735 7.04 22.5 -5.7 125 396 -5360 55500 ±15.27% ±12.24% Leakage Energy × Freq. R6 0.351 4.17 39 -2.1 28.2 201 -962 -11000 ±10.47% ±10.47% (nW) R5 0.394 4.66 29 -1.78 23.3 163 323 -29700 ±10.67% ±10.67%

Table 4: Exact Time and D Flip-Flop Coefficients

Measurement / log HW(m-1) log HW(m-1) 1 log Processor Total Time R6 6 1 1 -1 14 2 1 10 0 (Cycles) R5 7 1 1 -1 18 2 1 4 0 Projective Result Time R6 6 0 0 0 13 0 0 6 0 (Cycles) R5 7 0 0 0 17 0 0 5 0 Division Time R6 0 1 1 -1 1 2 1 4 0 (Cycles) R5 0 1 1 -1 1 2 1 -1 0 D Flip-Flops R6 0 0 0 0 6 4 0 6 1 (Count) R5 0 0 0 0 5 4 0 6 1

59 the units there are in millions of NAND gate equivalents × cycles. The second set gives

dynamic energy in pJ, as in Figure 17 except the units there are µJ. The third set gives

leakage energy × frequency in nW, as in Figure 18 except the units there are mW.

Table 4 gives four sets of numbers, three sets of machine cycles and one set of counts of

D flip-flops. The numbers are exact, functions of the design, independent of synthesis. As

in Figure 14, time is given in cycles of total time and broken down in the time to get the

projective result and to perform the division to get the result in projective coordinates and

to convert this to the affine by division. Here the function HW(m-1) is the Hamming

weight of m-1. The count of D flip-flops is exact and the comprehensive total for the

elliptic curve processor.

C. Comparison with Other Works

Figure 19 compares the proposed processor versions with reference processors from the literature. Each point represents a processor with area, time (to complete the cryptographic operation) and area-time product as shown on the axes. Reference processors are indicated with blue diamonds and the reference number in square brackets.

Some references did not include area measurements for memory devices. These are indicated with an asterisk (*) in the chart and are compared with the proposed processors excluding memory area. The proposed R6 and R5 are indicated with red squares and green triangles, respectively. The degree, m, and the digit size, w, values are indicated on

the chart. Lines connect the reference processors with their proposed processor

counterparts in terms of degree, digit size and inclusion of memory area.

60 The information in Figure 19 is also given in tabular form in Table 5 for area and Table 6 for time except the figure does not include [25]. Note that in all cases the proposed processors give significant improvement over the references.

Reference [32] gives measurements of energy required for the cryptographic operation for a processor with degree 163 and digit size one through four. Table 7 gives the energy for the reference processor and the dynamic energy of the two proposed processors (with leakage energy negligible). The proposed processors used about four times as much energy as the reference although earlier tests indicate the proposed processors used

600 500 [28]* 400 [32] [24] 300 m=192 w=1 m=163 w=1 200 [30]* 180 [28]* 160 m=131 w=1 140 [30]* [30]* 120 100 Ref 90 [30]* [28]* R6 80 [32] 70 m=163 w=4 R5 60 Time (1,000 Cycles) Time 50 [28]* 40

20 410512141618206 7 8 9 Area (1,000 NAND Gates)

Figure 19: Area and Time for Proposed and Reference Processors * Area comparison does not include memory devices

61 12%-20% less area and 31%-40% few cycles than the reference. The difference may lie

in the technology libraries. The proposed processors were tested with a TSMC 0.25 µm library excluding cells with large leakage power. The reference used a low leakage power

Table 5: Area for Proposed and Reference Processors Area (NAND Gates) Improvement Ref. m w Ref. R6 R5 R6 R5 [24] 192 1 16,847 10,433 9,856 38% 41% [25] 251 1 56,000 13,643 12,877 76% 77% [28]* 131 1 6,718 4,337 4,434 35% 34% [28]* 131 4 8,104 5,744 6,138 29% 24% [28]* 163 1 8,214 5,042 5,177 39% 37% [28]* 163 4 9,926 6,931 6,986 30% 30% [30]* 131 1 8,582 4,337 4,434 49% 48% [30]* 131 2 8,603 4,787 4,837 44% 44% [30]* 163 1 10,122 5,042 5,177 50% 49% [30]* 163 2 10,933 5,667 5,830 48% 47% [32] 163 1 10,106 8,933 8,449 12% 16% [32] 163 4 12,863 10,815 10,250 16% 20% *Area comparison does not include memory devices.

Table 6: Time for Proposed and Reference Processors Time (Cycles) Improvement Ref. m w Ref. R6 R5 R6 R5 [24] 192 1 296,383 226,593 264,219 24% 11% [25] 251 1 550,000 384,815 448,814 30% 18% [28] 131 1 210,600 106,007 123,686 50% 41% [28] 131 4 57,720 28,097 32,938 51% 43% [28] 163 1 353,710 163,355 190,570 54% 46% [28] 163 4 95,159 42,819 50,148 55% 47% [30] 131 1 159,250 106,007 123,686 33% 22% [30] 131 2 84,000 54,332 63,496 35% 24% [30] 163 1 241,500 163,355 190,570 32% 21% [30] 163 2 124,250 83,327 97,339 33% 22% [32] 163 1 275,816 163,355 190,570 41% 31% [32] 163 4 78,544 42,819 50,148 45% 36%

62 library of UMC’s 0.13 µm. If power goes as the square of the technology scale [36], then the proposed processors are roughly the equivalent of the reference processor in terms of power, but this does not account for the difference in area and time results. This again may be due to the cell library which was not available to compare for this test.

Comparison with proprietary cell libraries is problematic because the intellectual property is difficult to obtain.

Table 7: Energy (µJ) for Proposed and Reference Processors 0.13 µm 0.25 µm Scaled to 0.13 µm Ref. m w Ref. R6 R5 R6 R5 [32] 163 1 8.94 33.10 36.18 8.95 9.78 [32] 163 2 5.29 18.87 20.64 5.10 5.58 [32] 163 3 3.88 15.39 16.85 4.16 4.56 [32] 163 4 2.94 12.58 13.92 3.40 3.76

63 V. Secure Protocol

Although the need for a secure protocol for RFID has been presented and designs of cryptographic processors have been discussed, it remains to present a protocol that best suits this application by minimizing the number of cryptographic operations and amount of communication between reader and tag.

In this chapter, a minimal protocol is presented for security and privacy in RFID and other ubiquitous systems. Having discussed the need for such a protocol in chapter II, section A, this chapter begins by describing its requirements including minimizing costs, identifying the owner on the tag, and the operations required for identification. The discussion turns to establishing the minimum number of cryptographic operations required and the minimum number of message words communicated between reader and tag. Then a protocol is proposed which performs the identification operations in the minimum number of cryptographic operations and message words. The memory requirements, tag low level support, key infrastructure and other capabilities of the proposed protocol are discussed before the chapter closes with an evaluation of the protocol, including defenses against various attacks, its general benefits and drawbacks and how its adoption may affect RFID applications in the future.

64 A. Requirements of a Minimal Protocol

1. Minimum Cost Tags

Any solution based on sound principles of security requires cryptography and random number generation onboard the tags, and the lion’s share of this cost will be encryption hardware. Since the quantity of tags will dwarf the number of readers and other infrastructure, tag cost has to be minimized. The goal is to get functionality off tags while maintaining privacy by transmitting tag IDs only in encrypted, nonced messages.

2. Minimal Back-End Support

A system that depends for its security on a database record for each tag would require a large, fast and secure database back-end. The tags required for a large retailer could number in the billions. The database would have to be accessed from perhaps hundreds of thousands of readers. The information has to be accessed at the point of sale at an acceptable rate, probably in the tenths of a second.

The simpler alternative, pursued here, is to put public key cryptography on the tag [37].

Elliptic curve processors for 131 bit keys have been designed with approximately 15,000 gates and one second delay running a 175 kHz clock [29]. This low clock speed implies a low power requirement. In 2002, in response to the ongoing Certicom ECC Challenge, a

109 bit elliptic curve key was broken using 10,000 computer-years [38]. The 131 bit challenge is unsolved to date.

65 3. Concept of Ownership

The central feature of this approach to privacy is the concept of ownership. A product

identification tag has an owner that is the same as the owner of the product to which the tag is affixed. Ownership of private property is a fundamental legal concept. An owner may be an individual, corporation, partnership, or escrow agent. A tag owner may also be

a surrogate, such as a credit card company, appointed by a legal owner who does not care

to assert his ownership rights. For completeness, define an owner of newly made tags,

who freely gives up ownership, and define an owner of killed tags, who never gives up

ownership and prevents tags from transmitting.

The owner is a variable in the tag’s algorithms. Specifically, the tag owner’s public key,

a⋅G, is stored in rewritable memory in the tag. When the tag changes owners, the value

stored on the tag is changed. The owner retains the private key, a. The legal owner may

have many keys used for different purposes or at different times, although these must be

maintained to read any old tags still in the field.

Protocols have been proposed which use transfer of ownership. Reference [39] discusses

changing ownership between two parties using symmetric key cryptography but requires

database records for every tag.

4. Minimal Operations

Given that public key cryptography and owner identification is on the tag, we would like

to establish the minimum requirements for using the tag to identify itself while

maintaining privacy.

66 One operation is required to communicate the tag ID to the owner and one operation is

required to communicate a new owner to the tag. The message for the read operation

must be encrypted so that only the owner can read the tag ID. The operation to change to

a new owner must occur only with the agreement of the interested parties. Since the only

party suffering a loss by the transaction is the old owner, the tag merely needs proof that

the change is approved by the old owner. That is, the tag must authenticate the old owner as the source of the change-owner operation.

Since we require that the owner need not maintain a shared secret with the tag, there is no way for the owner to unilaterally send a message to the tag that the tag can prove came from the owner; however, the tag can authenticate the owner by requiring the owner to prove he can decrypt the message from the read operation. Therefore, the change-owner operation can be implemented as a reply to the read operation.

5. Minimum Message Words and Encryptions

For the read operation, the tag must send an encrypted message based on the public key, a⋅G, stored on the tag. This value is not by itself a secret since it is used by the owner for many tags and we require that the security of one tag is not dependent on the security of others. Rather a⋅G is the basis of a shared secret, specifically the first part of a Diffie-

Hellman key exchange, i.e., ElGamal encryption [40]. The tag can generate a nonce, b, to produce b⋅G, which it can send to the owner, establishing a secret, a⋅b⋅G, shared with the owner. Then the tag can send ID ⊕ a⋅b⋅G, where ⊕ is the “exclusive or” operation. This requires two encryption operations and two message words.

67 Diffie-Hellman key exchange needs a separate encryption for the key. As discussed in

chapter II, section A5, the RSA system is too costly even if it requires only half as many

operations. Nor is it possible to communicate the ID in a single message word without

first completing the key exchange, sending b⋅G in a separate word. Again, RSA would

not require key exchange, but as the keys are much larger, they are too costly. Finally, the

“exclusive or” operation, or other operation to securely combine ID and a⋅b⋅G, is

assumed to be available cost-free.

There is nothing in this read operation protocol to prevent a fraudulent tag or reader from

giving false identification to the owner. The goal here is to provide identification of the

product. Proving the product is actually present must be performed by other means. The

goal is also to maintain privacy so that a fraudulent reader cannot identify the product

through the RFID tag alone. Of course, the reader might already have identified the

product by other means. So the ID must not be revealed to the reader by the read

operation; nevertheless, the ID may not be secret.

For the change-owner operation, the owner needs to send a message containing the new

owner’s public key, anew⋅G. This can be done with the standard elliptic curve digital signature algorithm (ECDSA) [15]. It requires large multiply operations, similar to RSA, and so is not economical in the RFID tag environment. The old owner needs to send proof that he knows his private key, a, which he can demonstrate by using the shared

secret a⋅b⋅G from the previous read operation. Unfortunately, he cannot do this based on

the simple read operation described above since ID may already be known to the reader,

who can easily recover a⋅b⋅G from ID ⊕ a⋅b⋅G.

68 Alternatively, the read operation could be designed to produce the code words a⋅b⋅G and

ID ⊕ b⋅G, so that the owner would compute a-1⋅a⋅b⋅G = b⋅G to recover ID. Then the

shared secret would have to be b⋅G which is no help if the reader knows ID.

There is no simpler protocol than the Diffie-Hellman key exchange to get the ID to the owner, yet that provides only one secret, which is used up masking the ID. The solution is to use the secret from the asymmetric key cryptosystem as the key for a symmetric key cryptosystem which is used twice: to encrypt the ID and then to authenticate the owner for the change-owner operation. Therefore at least three encryption operations are required.

Unfortunately, a second cryptosystem would be an additional expense on the tag, so it is more economical to make the best of the asymmetric cryptosystem already on the tag.

This system cannot generate a series of encrypted messages from one key, so three encryption operations are insufficient for the combination read and change-owner operations. A minimum of four encryption operations is required.

The change-owner message from the owner to the tag cannot be formed in a single word message. The new owner’s public key, anew⋅G, must be conveyed to the tag in some form along with a shared secret, s, to prove that it came from the owner. The message cannot

be simply s ⊕ anew⋅G, since anew⋅G may be known to the reader, which could replace it with another owner’s public key. Therefore anew⋅G must be encrypted into the message.

The only way for this to be done in a way that allows it to be recovered is to use it as the

-1 generator, using a message of the form s ⋅(anew⋅G), which the tag can decrypt:

69 -1 s⋅s ⋅anew⋅G = anew⋅G. However, the tag lacks the facilities to verify that anew⋅G is an

element of the group. A malicious reader might be able to send a degenerate codeword,

causing the tag to set its owner’s public key to a cryptographically weak value. Therefore

it is impossible to send the change-owner message in a single word. At least two words

are required.

B. Description of a Minimal Protocol

1. Operations

A protocol can be implemented to provide identification and privacy under the

aforementioned conditions which requires the minimum number of encryption operations

(four), the minimum number of message words for reading (two), and the minimum

number of message words for reading and changing owner (four).

Initially, a tag contains the generator, G, its ID and the public key of its owner, a⋅G. In

preparation for a read command, the tag generates a nonce, b, and performs three

encryptions, b⋅G, b⋅(a⋅G), and (b⋅a⋅G)⋅G, and one “exclusive or” to form two message

words, the nonced generator, b⋅G, and the encrypted ID, ID ⊕ (b⋅a⋅G)⋅G.

Upon receiving a read command from the reader, the tag transmits these two words.

When the owner receives them, he uses his private key, a, and performs two encryptions, a⋅(b⋅G) and (a⋅b⋅G)⋅G, and one “exclusive or” to recover the tag ID. We assume the tag

ID contains sufficient redundancy to validate the ID.

Figure 20: Initial State Figure 20 illustrates the initial state of the cryptographic information, with circles indicating information known to the owner and tag. Only the owner knows his private key, a. Only the tag knows its ID. At this point, the owner and tag have no shared secrets.

The generator, G, and the owner’s public key, a⋅G, are shown outside both circles, indicating everyone, including an intruder, knows the information. Although a⋅G is not secret, it must not be revealed that it is associated with the tag.

Figure 21 indicates the state after the tag responds to the read command if the tag ID had been secret, as would be the case if the reader were attempting to read a hidden tag. Here the two tag message words are made public. These result in two shared secrets, a⋅b⋅G and

Figure 21: After tag responds to read command; ID secret

Figure 22: After tag responds to read command; ID not secret

(a⋅b⋅G)⋅G, which allow the owner to decrypt the ID. It is not possible to recover the ID from the public information only.

Figure 22 indicates the state if the tag ID were not secret, which must be assumed for the change-owner operation. Knowledge of the ID must not allow the tag’s owner’s public key to be changed. Although (a⋅b⋅G)⋅G is compromised, a⋅b⋅G remains a shared secret which can be used as the basis of the signature of the change-owner operation.

The change-owner operation is a continuation of the read operation. The owner who

Figure 23: After owner sends change-owner command; ID not secret

72 wishes to change the ownership of the tag to a new owner with public key, anew⋅G,

performs one “exclusive or” and one encryption to form the signature word (anew⋅G ⊕ a⋅b⋅G)⋅a⋅G. The owner sends the new owner’s public key, anew⋅G, and the signature word.

When the tag receives these words, it attempts to validate the signature by performing the

“exclusive or” operation on the anew⋅G it just received and the b⋅a⋅G secret it retained from the read operation, to get anew⋅G ⊕ b⋅a⋅G. Then this is encrypted with owner’s public key, a⋅G, stored in the tag. If the result matches the received signature word, the tag’s owner’s public key is updated to be the public key of the new owner, anew⋅G.

Together with the three encryptions performed for the read operation, this requires a total

of four encryption operations for read and change-owner.

A tag can verify a change-owner operation by answering another read operation, in which

the ID could be read only by the new owner.

Figure 23 indicates the state of the cryptographic information after the change-owner

operation. The tag can verify the signature, which could not be contrived from the public

information, even if the tag ID had been compromised.

2. Tag Memory Requirements

The tag would require four word registers to implement this protocol. Other registers are

needed within the cryptographic processor. An additional register, presumably of one-

time programmable memory, would hold the ID. The generator, G, is hard-wired into the

cryptographic processor.

73 Of the four registers needed to implement the protocol, one holds the tag’s owner’s public key and so must be of non-volatile memory. Two registers are used as scratch registers that hold the two message words as they are constructed during the read

operation and verified during the change-owner operation. The fourth register, which

contains the shared secret, a⋅b⋅G, must be maintained between the read and change-owner

operations, but this value is overwritten during the verification. Table 8 shows tag register contents during the read and change-owner operations.

3. Lower-Layer Support

When more than one tag is in the interrogation zone of a reader, they must submit to a process of singulation so that the reader can communicate with each tag individually, avoiding interference. In the binary tree-walking scheme, unique identifiers are used to

distinguish tags. With simple RFID tags, the tag IDs are used but this reveals tag IDs,

Table 8: Tag Memory Contents during Read and Change-Owner Operations

Operation R0 R1 R2 R3 Initial a⋅G

R2 ≠ Random a⋅G b

R1 ≠ R2 × G a⋅G b⋅G b

R3 ≠ R2 × R0 a⋅G b⋅G b b⋅a⋅G

R2 ≠ R3 × G a⋅G b⋅G (b⋅a⋅G)⋅G b⋅a⋅G

R2 ≠ ID ⊕ R2 a⋅G b⋅G ID ⊕ (b⋅a⋅G)⋅G b⋅a⋅G

Transmit R1 and R2 a⋅G b⋅a⋅G

Receive R1 and R2 a⋅G anew⋅G (anew⋅G ⊕ a⋅b⋅G)⋅a⋅G b⋅a⋅G

R3 ≠ R1 ⊕ R3 a⋅G anew⋅G (anew⋅G ⊕ a⋅b⋅G)⋅a⋅G anew⋅G ⊕ b⋅a⋅G

R3 ≠ R3 × R0 a⋅G anew⋅G (anew⋅G ⊕ a⋅b⋅G)⋅a⋅G (anew⋅G ⊕ b⋅a⋅G)⋅a⋅G

R0 ≠ R1, if R2 = R3 anew⋅G

74 violating privacy.

A binary tree-walking scheme can be used with the proposed protocol by using the

nonced generator, b⋅G, as the unique identifier for each tag. To prevent tracking, the

number must be generated for each interrogation cycle, which could be a limited number

of processor cycles while receiving power from the reader.

Error control codes provide for error correction to overcome noise in the communication

channel and to provide for the validation of encrypted messages. A simple coding system,

such as Hamming codes, at the message level corrects errors and improves tag

communication. The same hardware can be used for Hamming codes within the

encrypted messages for validation.

4. Infrastructure for Key Management

A consumer can take ownership of tags by presenting a public key stored on the

consumer’s smartcard, which for this purpose can be a simple memory card, since no processing capability is required.

Secure transactions, such as relinquishing ownership, require secure smartcards with an onboard encryption capability and safe storage of the private keys. The card would

require a personal identification number (PIN) from the owner in order to reproduce the

private key before decryption could be performed on the card. The interface between

human owner, smartcard and RFID tag would be a secure terminal provided by a trusted

retailer.

75 Naturally, the smartcard would be integrated with the consumer’s credit or debit card. To avoid a second scan of tags at the completion of a sale, the consumer would need to present the smartcard in advance. This will not introduce any new inconvenience since consumers are often asked to present coupons and loyalty cards in advance.

For consumers with little concern with privacy, credit card companies can provide a system of keys. Small retailers can also transfer tag ownership to a surrogate when they purchase products from their suppliers. The customers of small retailers could obtain tag ownership from the surrogate if desired. Of course they are always welcome to kill the tags and pay cash.

5. Other Capabilities

A special code for the tag’s owner’s public key would indicate the initial state of a tag after manufacture, so that the tag’s first owner could be assigned. Another special code would indicate the terminal state of a tag from which no read or change-owner operation is possible, implementing a “kill” command.

Depending on cost and functionality desired, tags may include more information and commands.

In addition to the tag’s ID and owner’s public key, a tag can contain information to describe and classify the product, although the tag’s ID can be used to find the product in a database. A tag may also contain multiple owners or a hierarchy of owners. Using more complex protocols, logical combinations of owners could be required in order to read a tag or perform other operations.

76 A tag might accept additional commands. A special command might be used to configure

one-time programmable memory, such as the tag’s ID, so tags do not need to be

physically serialized. After proper authentication, a broadcast change-owner command

could allow a group of tags to change hands without singulation. A sleep mode command

could act as a temporary kill command and would be useful during singulation especially

in places crowded with tags.

C. Evaluation of the Protocol

1. Benefits of the Protocol

The authority to read a tag remains with the owner in order to minimize tag cost and prevent tags from compromising the owner’s privacy.

The proposed system prevents many security problems including targeting, tracking, spoofing and replay attacks. Tags send only encrypted messages that require the owner to decrypt. The RFID reader need not be trusted. In fact, the reader is regarded as part of the insecure channel. A stolen reader has no capability to decrypt messages containing tag ID information. If an owner is represented by a secure smartcard in the reader, the card must be designed to erase any private key information after a time or if tampered with.

A tag contains no secrets except its ID and the identity of its owner. The encryption algorithm is not a secret. The owner’s public key, apart from the identity of the owner, is not a secret. If a tag is reverse-engineered, nothing is lost except a single ID and the

77 ownership of a single tag. All other tags remain secure including those belonging to that owner.

An owner could verify a tag to another party in any particular instance without compromising the tag’s identity in general. The owner could reveal a⋅b⋅G to a third party, allowing the ID to be recovered, and proving ownership. This could not be done immediately, however, since it would allow a change-owner operation. There would have to be proof that the tag had been re-read (changing b) or had been reset (clearing b from memory).

The proposed protocol does not depend on a central database and therefore affords true privacy to the owner without destroying tags. Each owner can operate his own database as the owner sees fit. An owner can also use no database at all and retrieve tag IDs only when needed.

The ownership model is firmly rooted in the legal concept of private property. Ownership concepts have evolved to include corporations, partnerships, etc. Tag ownership can also change with time, application, or possession within a company or household.

2. Defenses Against Various Attacks

The tag ID cannot be decrypted from the read operation message words, b⋅G and

ID ⊕ (b⋅a⋅G)⋅G, without the owner’s private key, a, the tag’s private key, b, or the shared secret, a⋅b⋅G. Since an intruder cannot obtain these without the cooperation of the tag or owner, the ID is secure. Even if b is compromised, this is only a nonce, a session key, and is no help decrypting another tag. 78 Targeting and tracking of the tag is prevented because the tag ID is secure and because

the message words from the tag change with each session.

The tag must not be spoofed by a malicious reader into believing incorrectly that the tag

is authorized by the owner to change the tag’s owner’s public key. This would occur if

the change-owner operation message words, the new owner’s public key, anew⋅G, and the signature, (anew⋅G ⊕ a⋅b⋅G)⋅a⋅G, had a false value for the new public key, Y, with false

signature, (Y ⊕ a⋅b⋅G)⋅a⋅G. But this can only be constructed if a⋅b⋅G is known, which

requires knowledge of a, b, or a⋅b⋅G and therefore the cooperation of tag or owner.

Similarly, it is impossible for a malicious reader to simply perform a replay attack on the

tag, by replaying a change-owner message to a tag since it is protected by the changing

session key, b.

Finally, an owner must not be spoofed into revealing information about his private key, a.

A read operation cannot by itself reveal this, since it requires no information from the

owner. A reader should send to the owner the read operation message words b⋅G and

ID ⊕ (b⋅a⋅G)⋅G. Instead a malicious reader could send an element of a degenerate group,

Z, and ID ⊕ (c⋅Z)⋅G, where c is small, hoping that c⋅Z ≡ a⋅Z in the degenerate group. To

prevent this, the owner should either:

1) authenticate the reader and tag by other means, or,

2) test if Z is an element of a very small group, or,

3) investigate repeatedly failed read operations, or,

4) change the private key, a, periodically.

79 3. Drawbacks of the Protocol

The tags themselves are not authenticated so a reader can fool an owner about the existence of a tag or a fake tag can fool a reader and owner. The cure for this would be a secret on every tag, but this would require extra information on the tag and in the product database. Nevertheless, proving a tag does not prove the product is intact or even present.

So the goal of this protocol is to efficiently identify the attached product, but leave to other means determining the condition of the product.

Infrastructure for tag keys is required for all tag owners. For large retailers this is an incremental change, which will occur as RFID performance exceeds its cost. For small retailers and consumers, the infrastructure may not be worth the effort, but credit card companies can act as surrogates, according to the wishes of the legal owners. With the proposed protocol, all stakeholders can exercise their ownership rights as they see fit.

4. Moore’s Law

The cost of integrated circuit area has remained constant as the size of individual semiconductor devices has shrunk exponentially for many decades, a phenomenon predicted as Moore’s Law. This has resulted in RFID tags of increasing capability at an acceptably low cost.

The strength of a public key cryptographic system is a function of key length. Short keys provide security for a short time against adversaries with few resources. Long keys protect for the foreseeable future against malevolent governments. In order to carry out

80 tracking, a spy might have to decrypt every tag he encounters, which may be prohibitive

even if the keys are short.

The proposed protocol can be implemented with smaller key lengths at first. As

semiconductor fabrication technology advances, key length can be increased and security

improved. In a relatively short time, this will evolve to provide essentially unbreakable

security in RFID tags and other ubiquitous systems.

5. Consumer Applications

When an owner-centric RFID privacy protocol is implemented, consumer confidence will

be restored because consumers will own the information on the tags attached to the

products they buy. This will result in applications that cause consumers to prefer tagged

products. Applications include smart refrigerators and medicine cabinets for convenience

and safety. More promising perhaps, a consumer could buy a reader, insert a secure card and use it to find things in his house or to help pack a suitcase.

81 VI. Power Management

RFID is a severely resource-constrained environment. Power is perhaps the scarcest resource of all. While the aforementioned protocol and processor designs minimize demands on silicon area and execution time to carry out security operations, a complete solution needs to consider the hardware environment so the available power is managed frugally.

In this chapter, methods are considered to convert as much power as possible from the antenna to run the cryptographic processor. Of course these techniques can apply more broadly, but this type of processor operates under unusual conditions where the amount of available power varies greatly over time and a large but fixed amount of processing is required, to perform a small number of encryption operations. The goal is to minimize the time required to perform the operations for the given power that is available from the environment.

A. Analog Front-End

A starting point for the design of the circuit blocks needed for a useful EPC tag were put forth in a detailed reference design [41]. This design included a rectifier to convert AC power from the antenna to DC used by the tag. The rectifier was a Dickson voltage multiplier, used to match the antenna impedance to the much higher digital circuit impedance. The reference design also included a voltage regulator and current limiter to provide well-conditioned power to the digital state machine. When such power was not

82 available, reset circuitry prevented the digital section from operating. The digital logic

was timed by a ring oscillator.

Impedance problems begin with matching the antenna to free space. Objects in the

environment, metals and dielectrics, change the resonant frequencies of antenna systems.

Earlier designs attempted to match the antenna directly to the digital circuit [42]. The

efficiency of the antenna is quite low due to its small size relative to the wavelength of

the transmitted power, the need to place the antenna flat on packaging material and

obstacles in the environment. Since silicon area is very limited, RF sections must be

designed to avoid coupling between components, which can be implemented by

introducing additional design rules. Unwanted coupling between antenna and circuit

components can be minimized by using balanced circuits and placing the silicon in a

direction where the antenna is already weak [43].

Antenna/chip combinations are initially modeled, but may require iterative testing and

refinement due to their complexity and that of their operating environment. Antenna

tuning must be broad enough to accommodate process variation in the chip and

manufacturing variation in assembling chip, antenna and packaging [44]. Such variations

can lead to dead-zones within the communication range of tag and reader [45].

A complete design for the analog front-end of an RFID transponder included a Dickson voltage multiplier with Schottky diodes, to match antenna impedance to the digital circuitry with minimal loss [46]. Schottky diodes were fabricated in a CMOS process and measured, so SPICE models could be developed [47]. Schottky diodes have very low

“turn on” voltages. Still lower threshold voltages can be obtained by using two

83 MOSFETs, one configured as a diode in parallel with another operating in the triode

region, or by biasing the gate of a MOSFET to its threshold voltage. Loss of high-

frequencies from the AC supply through a multi-stage Dickson multiplier can be reduced

by using a low frequency VCO controlled multiplier for most stages. All these techniques

can benefit from a LC tank at the antenna [48]. For an ideal Dickson voltage multiplier,

the ratio of output to input impedances is twice the number of diodes, when all diodes

and capacitors are identical [49]. If MOSFETs are used, gate voltages can control source-

drain current so that the impedance ratio is a function of available power.

Methods have been proposed for adaptive impedance matching. Tag and reader antenna

matching depends on their geometry and relative positions and orientations. This can be

corrected with an adaptive network in the reader [50]. Matching of tag antenna and

digital circuitry can be improved using a reconfigurable array of series and parallel

capacitors, under the control of a digital control unit [51].

The MOSFETs in the digital section of an RFID tag can operate below their threshold voltage, consuming less power. Methods have been developed for determining the

number of stages of the voltage multiplier and the size of the capacitors and diodes for maximum power transfer from the antenna to digital logic with a fixed impedance [52].

B. Subthreshold Logic

Digital logic can be run at very low voltages, below the threshold voltage of the

transistors. Subthreshold logic benefits from extremely low power consumption, but

operates more slowly, so performance suffers. Overall, as voltage is lowered, the PDP

84 (power-delay product) improves. Circuits consume less energy because the reduction in

power outweighs the increase in delay [53].

Subthreshold logic operates with better static noise margin than superthreshold (strong

inversion) logic. Two modified versions of subthreshold logic have been proposed to

improve robustness, despite process and temperature variation. These methods provide body biasing to control these effects. One method used a monitoring circuit to sense leakage current which controls biasing. Another method used the gate voltage for biasing, providing better performance [54].

Because drain-induced barrier lowering, body punch-through and the threshold voltage itself are not important in subthreshold operation, halo and retrograde doping could be eliminated, simplifying the CMOS process and lowering junction capacitance. Pseudo-

NMOS logic, in which PMOS transistors are configured as pull-ups, requires less power than ordinary CMOS when activity is greater than 5%, in the subthreshold regime [55].

Subthreshold logic is suitable for full-sized processors. An FFT processor of ½ million transistors was fabricated and tested. At very low clock frequencies, static leakage becomes significant. A methodology was proposed to balance static leakage and dynamic switching power in order to find the clock frequency that requires minimal energy to accomplish the same computational task [56].

85 C. Self-Timed Circuits

Classical self-timed systems use control signals to indicate when data signals become

valid rather than relying on a global clock. This approach allows the system to operate

more quickly, in a pipelined architecture, or across a greater distance [57]. Differential

logic can be designed to include its own completion signal. For a combinatorial network,

the completion signal can be produced from a replica of the network’s critical path. This method of producing a timing signal tracks with process, temperature and power supply variations [36].

Self-timed circuits have been proposed to provide maximum throughput in energy harvesting systems, in which supply voltage varies. A ring oscillator including a replica of the critical path was proposed as a clocking circuit. Because the clock was made from the same process as the circuit it timed, the clock automatically tracked with power supply, temperature and process variations [58].

D. Power vs. Impedance

1. Motivation

In order to marry self-timed subthreshold logic with a power-supplying antenna, an impedance matching network is required; however, the impedance of the logic is not

fixed because of the varying supply voltage and clock rate. Supply voltage affects

average switching current of CMOS logic in a non-trivial way, and therefore affects

impedance. Also, because the circuit is self-timed, the ratio of time spent in dynamic to

86 static operation is independent of the supply voltage. This needed to be studied in the subthreshold regime.

2. Test Setup

Two circuits were studied in simulation. One circuit was a chain of four inverters. This circuit was driven with a step input. The propagation delay was averaged from the response to step up and step down. The circuit was tested with a range of supply voltages.

Average current was measured during the propagation time. The circuit was chosen as a simple representation of a digital circuit. The circuit has 25% activity and the chain is long enough to mitigate some of the artificiality of the step input. The propagation delay of the circuit gives a direct measure of its maximum clocking frequency.

The other circuit studied was a small R6 elliptic curve processor, described in chapter III.

The circuit was an 11-bit processor (m = 11) with bit-serial multiplier (w = 1). The 11-bit processor was operated for an entire cryptographic operation, 955 cycles. This circuit was tested with a series of supply voltages. Average current was measured during the operation. Clocking was provided by a square wave voltage source adjusted to the maximum frequency for which the circuit would produce the correct digital result. Clock frequency values were of the form 1, 2, or 5 × 10n cycles per second, for integer n.

Hereinafter, frequency of this circuit refers to operations per second, which is the clock frequency divided by 955 cycles per operation. Although orders of magnitude simpler than a full-scale cryptographic processor, the studied circuit bridges the gap between the chain of four inverters and the full-sized processors in terms of electrical behavior.

87 Synopsys HSPICE was used to perform the simulation. Transistor models and gate

subcircuits were provided by a TSMC 0.25 µm CMOS process and cell libraries. The

circuits were tested with a range of supply voltages. Average current was measured

during the test time interval. The chain of four inverters was tested with supply voltages

from 0.100 to 3.162 V, in logarithmic increments of 100 values per voltage decade, for a

total of 150 values. The elliptic curve processor was tested from 0.200 to 2.500 V, in linear increments of 0.100 V, for a total of 24 values. From the voltage, current and time interval, other measurements were derived, for power, impedance, energy and frequency.

3. Results

Figure 24 and Figure 25 show the relationship of frequency vs. power for the chain of four inverters and the elliptic curve processor. The curves indicate that the throughput of digital circuits depends on the power provided, and this relationship extends from the

THz MHz

2 V GHz kHz 2 V

Frequency MHz Frequency Hz

0.2 V 0.2 V kHz mHz pW nW µW mW W pW nW µW mW W Power Power

Figure 24: Frequency vs. Power for Figure 25: Frequency vs. Power for 0.25 µm Chain of Four Inverters 0.25 µm R6 Elliptic Curve Processor

88 normal, high-power, high-performance superthreshold regime down to very low-power and low-performance subthreshold voltages. For this 0.25 µm technology, the NMOS threshold is 0.45 V, and for PMOS, about 0.60 V. These curves are similar to the power- throughput graph of a five-tap FIR filter in reference [55].

There is an approximate mathematical power law relationship between physical power and throughput from about half of the normal supply to the threshold voltage. So that log · log log , or .The values for a and b depend on details of the

circuit. The figures and the reference generally agree, indicating that circuits from a short

chain of inverters to full-scale processors are similar in their relationship of voltage,

current and maximum operating frequency. The curve for the chain of inverters is

especially useful because its frequency is measured directly.

For the chain of four inverters, Figure 26 shows the relationship of current vs. voltage on the main pair of axes. The figure shows power vs. impedance on the diagonal set of axes.

Also, frequency is labeled to the right and energy to the left of the curve. The curve shows that as voltage is dropped 1½ decades, current drops 7½ decades. The drop in current is most precipitous near the technology’s threshold voltage. The curve is nearly straight in this area of the log-log plot, indicating a mathematical power relationship between current, voltage, power and impedance.

The curve indicates that as power decreases across nine decades, impedance increases by over six decades. This is an important result because the circuit can use power efficiently only if it is provided in the required ratio of voltage and current. Unlike a resistive element or ideal antenna that has a fixed impedance, this CMOS logic operating at its

89 maximum, self-timed frequency and from super- to subthreshold voltages, has a wide

dynamic range of impedances.

As in Figure 24, the middle of the curve in Figure 26 shows that as power increases, operating frequency changes almost as quickly for this chain of inverters. As power increases by five decades, frequency increases four decades. Consequently, total energy

1mA

100 fJ 100 µA

10 µA 1 GHz 10 fJ

1µA 100 MHz

100 nA

Current 10 MHz

10 nA 1 MHz

1nA 1 fJ 100 kHz

100 pA

10 kHz 10 pA 100 µV 1 mV 10 mV 100 mV1 V 10 V 100 V 1 kV 10 kV Voltage

Figure 26: 0.25 µm Chain of Four Inverters

90 for each cycle increases one decade in the same part of the curve. This indicates that energy can be used to the same effect in a digital circuit almost independent of the rate that it is provided. If the impedance of antenna and digital logic is matched, the power can be used productively to do computational work.

For the 11-bit elliptic curve processor, Figure 27 shows a similar relationship of current

10 mA

1mA 10 nJ

100 kHz

100 µA

10 kHz 10 µA

1 nJ 1 kHz 1µA Current

100 nA 100 Hz

10 nA 10 Hz

1nA

100 pA 100 µV 1 mV 10 mV 100 mV1 V 10 V 100 V 1 kV 10 kV Voltage

Figure 27: 0.25 µm R6 Elliptic Curve Processor, m = 11, w = 1

91 vs. voltage on the main pair of axes. The figure shows power vs. impedance on the

diagonal set of axes. And, frequency is labeled to the right and energy to the left of the

curve. Here, frequency represents the speed of the entire cryptographic operation, which

is a much larger time scale than that of the chain of four inverters. The processor is much

larger, so it draws much more current. Therefore, the power is larger, and the impedance

is smaller. Since the current and time are larger, the energy is very large, compared to the

chain of inverters; however, the curves have essentially the same shape. Many of the

points in Figure 27 are clustered together because of the coarse maximum frequency

adjustment; consequently, many measurements were at the same frequency.

While the processor plot has far fewer measurements and an imprecise frequency setup

compared to the chain of inverters plot, the similarity of the results confirm that large and

small CMOS circuits have similar characteristics in their relationship of voltage and current when operated at their maximum, self-timed frequencies and at voltages from

super- to subthreshold. For these circuits, if impedance is matched as a function of power,

then it can be used most productively to do computational work.

E. Recommendations

To realize optimum performance of a cryptographic processor in an RFID tag, the unique

requirements of the system must be recognized. A tag must scavenge energy from the

environment, perform cryptographic operations and respond. Important performance

criteria are the range of the tag from the reader, the response time, and the level of

security. If the latter is not to be compromised, and range cannot be controlled by the tag,

92 the only apparent performance characteristic is response time. A simple tag system with

fixed clock frequency and fixed impedance matching will simply fail to respond if power

requirements are not met. The most important power management recommendation,

therefore, is flexibility to tailor response time to available power.

In a processor for this application, the response time is proportional to the clock period. A ring oscillator provides clocking that tracks with available power, as well as process and temperature variation. While a replica of the critical path would give a more exact match,

an adequate representation of the needed time delay can be provided by a series of

inverters, as in a ring oscillator. The circuit must be tested in extremes of temperature,

power and input vectors to determine the inverter chain’s minimum length, and a safety

margin.

The logic for these processors can be many levels of complex gates, especially XORs, so

an inordinate number of inverters would be required. Since these are only used to get

their characteristic propagation delay under the environmental conditions, their sheer

number can be a poor use of area. Each inverter consumes an incremental amount of area

while providing an incremental amount of delay; however, a frequency divider, in the

form of a flip-flop configured to toggle, consumes an incremental amount of area while

providing a doubling of delay. So a shorter ring oscillator followed by a small number of

frequency dividers is a more economical use of area; however, the circuits with dividers

consume somewhat more power than those without. For longer delays, the exponential

improvement in area outweighs the incremental cost in power.

93 Figure 28 shows the results of a study of ring oscillators with frequency dividers,

analyzed using Synopsys HSPICE with TSMC 0.25 µm technology and cell libraries.

The figure shows the results in power vs. cycle time for ring oscillators of 9, 17, 33, 65

and 129 inverters and 0, 1, 2 and 3 flip-flops (frequency dividers) operated at 0.5, 1, and

2 V. At any particular voltage, an oscillator can be built for any point marked on the graph by using the number of inverters and flip-flops indicated and the oscillator will

100 µW 2 V

10 µW 1 V

1 µW Power

0.5 V 100 nW

10 nW 1 ns 10 ns 100 ns 1 µs 10 µs 100 µs Cycle Time

Figure 28: 0.25 µm Ring Oscillators with Frequency Dividers

94 have the time delay and power consumption given on the axes. The area of these D flip- flops with Q and Q outputs is about five times the area of an inverter.

For example, an oscillator of 129 inverters and 2 flip-flops has about the same delay as an oscillator of 65 inverters and 3 flip-flops. However, with the flip-flop area equal to 5 inverter areas, the first oscillator requires 139 inverter areas while the second requires only 80 inverter areas, a 38% improvement. Figure 28 indicates that the second oscillator requires only a negligible increase in power (about 1%).

With an oscillator built of the same technology as the processor, the logic can operate efficiently over a wide range of voltages, much below the threshold voltage of the transistors, with poorer performance. At the minimum, the circuit will nearly hibernate.

Below that, the circuit will fail to operate, the charges on the gates will be lost and the circuit will require resetting when power becomes available again. For RFID security, this is not a problem so much as the expected complete loss of responsiveness of the tag when the range and power limits are reached. The most energy efficient subthreshold logic family should be used. Pseudo-NMOS is recommended [55].

An adaptive voltage multiplier is required. Figure 29 shows a mirrored pair of Dickson voltage multipliers. The Schottky diodes can be implemented as self-biasing CMOS transistors [59]. These transistors can also be controlled from later stages of the voltage multiplier. An early stage transistor can be linked through a Zener diode to a later stage node in the mirror image multiplier. The Zener diode can be implemented as a series of

CMOS transistors configured as ordinary diodes. When the voltage at the later stage becomes higher than required for the impedance match, indicating the current should

95 increase, the earlier stage transistor is cut-off, and the following capacitor shorted by

another transistor, effectively eliminating the earlier stage from the voltage multiplier.

Alternatively, later stage nodes could control early stage capacitors only, opening the

capacitors of two adjacent stages. A third alternative would disconnect the first two stages from the chain by opening the diode in the third stage and having another diode

bypass all three stages. Since the first two stages would have no load, they would have

little power consumption.

Using appropriately configured CMOS transistors in the voltage multiplier circuit, they

can be controlled for impedance matching and biased to avoid the voltage drop across

them. A small tank circuit formed with the antenna raises the impedance for the voltage multiplier [48] but this must not be sharper than process variation will allow [44]. Finally, the antenna must be designed with consideration for packaging and likely obstacles in the

environment. The antenna should be oriented relative to the RF circuitry to avoid

D1 D2 D3 D4 D5 D6 N1 N2 N3N4 N5 N6 1N5817 1N5817 1N5817 1N5817 1N5817 1N5817

C1 C2 C3 C4 C5 C6 V1 {cap} {cap} {cap} {cap} {cap} {cap} NSRC

RL 10k

C7 C8 C9 C10 C11 C12 SINE(0 1 1K) {cap} {cap} {cap} {cap} {cap} {cap}

D7 D8 D9 D10 D11 D12

1N5817 1N5817 1N5817 1N5817 1N5817 1N5817

Figure 29: Mirrored Pair of Dickson Voltage Multipliers Later stage nodes control earlier stages by opening diodes and shorting capacitors.

96 unwanted coupling [43]. Of course the antenna needs to be designed with the highest impedance that is practical.

97 VII. Conclusion and Future Work

Three topics have been discussed, a secure protocol, improved cryptographic processors and power management. It has been shown that all of these solutions are important and contribute to more efficient RFID and similar systems. Improvements have been made in these three areas and their efficacy has been demonstrated. The focus has been on the cryptographic processor designs which show a 12%-20% area improvement and a 31%-

45% time improvement compared to previous results in the literature.

There are a number of areas for future study, of course. The elliptic curve cryptography processor might be analyzed with a newer technology library. Techniques from other elliptic curve papers might be incorporated, such as shuffling registers to save datapaths.

The power management techniques can be integrated and developed into a complete methodology for operating digital logic from an unreliable power source.

98 VIII. Bibliography

[1] S. Garfinkel & B. Rosenberg (Eds.), RFID Applications, Security, and Privacy,

Addison-Wesley, 2005.

[2] K. Finkenzeller, RFID Handbook: Fundamentals and Applications in Contactless

Smart Cards and Identification, 2nd Ed., John Wiley & Sons Ltd., 2003.

[3] T. Karygiannis, B. Eydt, G. Barber, L.Bunn, and T. Phillips, “Guidelines for

Securing Radio Frequency Identification (RFID) Systems,” NIST Special

Publication 800-98, 2007.

[4] A. Juels, R.L. Rivest, and M. Szydlo, “The Blocker Tag: Selective Blocking of

RFID Tags for Consumer Privacy,” ACM Conf. on Computer and

Communications Security, 2003.

[5] EPCglobal, “EPC Radio-Frequency Identity Protocols Class-1 Generation-2 UHF

RFID Protocol for Communications at 860 MHz - 960 MHz Version 1.0.9,”

January 2005.

[6] A. Juels, “Minimalist Cryptography for Low-Cost RFID Tags,” Conf. on Security

in Commun. Networks, 2004.

[7] S.A. Weis, S.E. Sarma, R.L. Rivest, and D.W. Engels, “Security and Privacy

Aspects of Low-Cost Radio Frequency Identification Systems,” Security in

Pervasive Computing, LNCS, Vol. 2802, Springer-Verlag, 2004.

[8] S. Fouladgar and H. Afifi, “A Simple Delegation Scheme for RFID Systems

(SiDeS),” IEEE Int. Conf. on RFID, 2007.

[9] A.S. Tanenbaum, “Network Security,” Computer Networks, 4th Ed., Prentice-Hall,

2002.

[10] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, “The RSA Public-Key

Cryptosystem,” Introduction to Algorithms, 2nd Ed., McGraw-Hill, 2001.

[11] U.S. National Security Agency, “The Case for Elliptic Curve Cryptography,”

http://www.nsa.gov/business/programs/elliptic_curve.shtml

[12] R.J. McEliece, Finite Fields for Computer Scientists and Engineers, Kluwer

Academic Publishers, 1987

[13] R. Avanzi, H. Cohen, C. Doche, G. Frey, T. Lange, K. Nguyen and F.

Vercauteren, Handbook of Elliptic and Hyperelliptic Curve Cryptography,

Chapman & Hall/CRC, 2005.

100

[14] L.C. Washington, Elliptic Curves: Number Theory and Cryptography, Chapman

& Hall/CRC, 2003.

[15] I. Blake, G. Seroussi and N. Smart, Elliptic Curves in Cryptography, Cambridge

University Press, 1999.

[16] D. Naccache, N.P. Smart and J. Stern, “Projective Coordinates Leak,” Advances

in Cryptography, LNCS, Vol. 3027, Springer-Verlag, 2004.

[17] K. Okeya and K. Sakurai, “Power Analysis Breaks Elliptic Curve Cryptosystems

Even Secure Against the Timing Attack,” Progress in Cryptography, LNCS, Vol.

1977, Springer-Verlag, 2000.

[18] A.P. Fournaris and O. Koufopavlou, “Hardware Design Issues in Elliptic Curve

Cryptography,” Wireless Security and Cryptography, Specifications and

Implementations, N. Sklavos and X. Zhang (Eds.), CRC Press, 2007.

[19] Z. Yan and D.V. Sarwate, “New Systolic Architectures for Inversion and Division

in GF(2m),” IEEE Trans. Comput., Vol. 52, 2003.

[20] T. Itoh and S. Tsujii, “A Fast Algorithm for Computing Multiplicative Inverses in

GF(2m) Using Normal Bases,” Inf. Comput., 1988.

101

[21] J. Fan and I. Verbauwhede, “A Digit-Serial Architecture for Inversion and

Multiplication in GF(2M),” IEEE Workshop on Signal Process. Syst., 2008.

[22] J. Lopez and R. Dahab, “Fast Multiplication on Elliptic Curves over GF(2m)

without Precomputation,” Workshop on Cryptographic Hardware and Embedded

Systems (CHES), LNCS, Vol. 1717, Springer-Verlag, 1999.

[23] P.L. Montgomery, “Speeding the Pollard and Elliptic Curve Methods of

Factorization,” Math. of Computation, Vol. 48, 1987.

[24] J.H. Kim and D.H. Lee, “A Compact Finite Field Processor over GF(2m) for

Elliptic Curve Cryptography,” IEEE Int. Symp. on Circuits and Systems, 2002.

[25] C. Huang, J. Lai, J. Ren and Qianling Zhang, “Scalable Elliptic Curve Encryption

Processor for Portable Application,” Int. Conf. on ASIC, 2003.

[26] L. Batina, G.M. de Dormale, E. Oswald and J. Wolkerstorfer, “State of the Art in

Hardware Implementations of Cryptographic Algorithms,” Information Society

Technologies Publication IST-2002-507932, 2006.

[27] P. Tuyls and L. Batina, “RFID-Tags for Anti-Counterfeiting,” Cryptographers'

Track of RSA Conference (CT-RSA), LNCS, Vol. 3860, Springer-Verlag, 2006.

102

[28] L. Batina, N. Mentens, K. Sakiyama, B. Preneed and I. Verbauwhede, “Low-Cost

Elliptic Curve Cryptography for Wireless Sensor Networks,” European Workshop

on Security and Privacy in Ad hoc and Sensor Networks, LNCS, Vol. 4357,

Springer-Verlag, 2006.

[29] L. Batina, J. Guajardo, T. Kerins, N. Mentens, P. Tuyls and I. Verbauwhede, “An

Elliptic Curve Processor Suitable for RFID-Tags,” Int. Assoc. for Cryptologic

Research ePrint Archive, 2006.

[30] L. Batina, J. Guajardo, T. Kerins, N. Mentens, P. Tuyls and I. Verbauwhede,

“Public Key Cryptography for RFID-Tags,” IEEE Int. Workshop on Pervasive

Computing and Commun. Security, 2007.

[31] Y.K. Lee and I. Verbauwhede, “A Compact Architecture for Montgomery Elliptic

Curve Scalar Multiplication Processor,” Int. Workshop in Inform. Security

Applicat. (WISA), LNCS, Vol. 4867, Springer-Verlag, 2007.

[32] Y.K. Lee, K. Sakiyama, L. Batina, and I. Verbauwhede, “Elliptic-Curve-Based

Security Processor for RFID,” IEEE Trans.Comput., 2008

[33] U.S. National Institute for Standards and Technology, “Recommended Elliptic

Curves for Federal Government Use,” 1999.

103

[34] Certicom Research, “SEC 2: Recommended Elliptic Curve Domain Parameters,”

2000.

[35] Nessie, B. Preneel (coor.), ECDSA, GF(2163)r2, Test Vector Set 1, Project,

COSIC, K.U. Leuven, 2003,

https://www.cosic.esat.kuleuven.be/nessie/testvectors

[36] J.M. Rabaey, A. Chandrakasan and B. Nikolić, Digital Integrated Circuits, 2nd

Ed., Prentice-Hall, 2003.

[37] S.E. Sarma, S.A. Weis, and D.W. Engles, “RFID Systems and Security and

Privacy Implications,” 2002 Workshop on Cryptographic Hardware and

Embedded Systems (CHES 2002), LNCS, Vol. 2523, Springer-Verlag, 2003.

[38] Press Release, Certicom Corp., 2002.

[39] J. Saito, K. Imamoto, and K. Sakurai, “Reassignment Scheme of an RFID Tag’s

Key for Owner Transfer,” EUC Workshops, LNCS, Vol. 3823, Springer-Verlag,

2005.

[40] A.J. Menezes, P.C. van Oorschot, and S.A. Vanstone, Handbook of Applied

Cryptography, CRC Press, 1997.

104

[41] D. Wu, L.I. Williams and M. Mi, “RFID Radio Circuit Design in CMOS,”

DesignCon 2007 Symposium Digest, January 2007.

[42] P.R. Foster and R.A. Burberry, “Antenna Problems in RFID Systems,” IEE

Colloq. on RFID Tech., 1999.

[43] S. Brebels, J. Ryckaert, B. Come, S. Donnay, W. De Raedt, E. Beyne and R.P.

Mertens, “SOP Integration and Codesign of Antennas,” IEEE Trans. Adv.

Packag., vol. 27, 2004.

[44] K.V.S. Rao, P.V. Nikitin and S.F. Lam, “Antenna Design for UHF RFID Tags: A

Review and a Practical Application,” IEEE Trans. Antennas Propag., Vol. 53,

2005.

[45] V. Pillai, “Impedance Matching in RFID Tags: To Which Impedance to Match?,”

IEEE Antennas and Propagation Soc. Int. Symp., 2006.

[46] U. Karthaus and M. Fischer, “Fully Integrated Passive UHF RFlD Transponder IC

with l6.7-µW Minimum RF Input Power,” IEEE J. Solid-State Circuits, Vol. 38,

2003.

[47] W. Jeon, J. Melngailis and R.W. Newcomb, “CMOS Schottky Diode Microwave

Power Detector Fabrication, SPICE Modeling and Applications,” IEEE Int.

Workshop on Electronic Design, Test and Applications, 2006.

105

[48] F. Yuan and Nima Soltani, “Design Techniques for Power Harvesting of Passive

Wireless Microsensors,” Midwest Symp. on Circuits and Systems, 2008.

[49] R. Barnett, S. Lazar and J. Liu, “Design of Multistage Rectifiers with Low-Cost

Impedance Matching for Passive RFID Tags,” IEEE Radio Frequency Integrated

Circuits Symp., 2006.

[50] B. Jian, J.R. Smith, M. Philipose, S. Roy, K. Sundara-Rajan and A.V. Mamishev,

“Energy Scavenging for Inductively Coupled Passive RFID Systems,” IEEE

Trans. Instrum. Meas., Vol. 56, 2007.

[51] D. Maurath, M. Ortmanns and Y. Manoli, “High Efficiency, Low-Voltage and

Self-Adjusting Charge Pump with Enhanced Impedance Matching,” Midwest

Symp. on Circuits and Systems, 2008.

[52] G. De Vita and G. Iannaccone, “Design Criteria for the RF Section of UHF and

Microwave Passive RFID Transponders,” IEEE Trans. Microw. Theory Tech.,

Vol. 53, 2005.

[53] H. Soeleman and K. Roy, “Ultra-Low Power Digital Subthreshold Logic

Circuits,” Int. Symp. on Low Power Electronics and Design (ISLPED), 1999.

106

[54] H. Soeleman, K. Roy and B.C. Paul, “Robust Subthreshold Logic for Ultra-Low

Power Operation,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., Vol. 9,

2001.

[55] A. Raychowdhury, B.C. Paul, S. Bhunia and K. Roy, “Computing with

Subthreshold Leakage: Device/Circuit/Architecture Co-Design for Ultralow-

Power Subthreshold Operation,” IEEE Trans. Very Large Scale Integr. (VLSI)

Syst., Vol. 13, 2005.

[56] A. Wang and A. Chandrakasan, “A 180-mV Subthreshold FFT Processor Using a

Minimum Energy Design Methodology,” IEEE J. Sold-State Circuits, Vol. 40,

2005.

[57] C. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley, 1979.

[58] J. Siebert, J. Collier and R. Amirtharajah, “Self-Timed Circuits for Energy

Harvesting AC Power Supplies,” Int. Symp. on Low Power Electronics and

Design (ISLPED), 2005.

[59] H. Nakamoto, D. Yamazaki, D. Yamamoto, H. Kurata, S. Yamada, K. Mukaida,

T. Ninomiya, T. Ohkawa, S. Masui and K. Gotoh, “A Passive UHF RF

Identification CMOS Tag IC Using Ferroelectric RAM in 0.35 µm Technology,”

IEEE J. Solid-State Circuits, Vol. 42, 2007.

107