<<

Tamper proof certification system based on secure non-volatile FPGAs

Diogo Alcoforado da Gama de Oliveira Parrinha

Thesis to obtain the Master of Science Degree in Electrical and Engineering

Supervisor(s): Prof. Ricardo Jorge Fernandes Chaves Prof. Leonel Augusto Pires Seabra de Sousa

Examination Committee Chairperson: Prof. Gonçalo Nuno Gomes Tavares Supervisor: Prof. Ricardo Jorge Fernandes Chaves Member of the Committee: Prof. Fernando Manuel Duarte Gonçalves

November 2017 ii Acknowledgments

I would like to start by thanking the constant support from my family and everything they did for me, which allowed me to close this chapter of my life. Without them, this would have been much harder. A special thanks to my mother Marina and my father Ricardo. Throughout the years I spent in IST, I have enjoyed working with a lot of people, from colleagues to professors. I have made some great friends and I am happy to realize that we have spent amazing moments together. However, I would like to offer a particular thanks to Diogo Prata for being a good friend throughout the degree and for overcoming many common adversities together. Finally, I would like to extend my sincere thanks to my supervisor Prof. Ricardo Chaves, for his con- tinuous support and guidance throughout this project. His technical expertise and constant motivation have helped me to conclude this thesis. May this be the start of a new beginning.

Thank you!

iii iv Resumo

Os sistemas embebidos suportados por FPGAs temˆ um papel cada vez maior em sistemas cr´ıticos e de seguranc¸a. Um exemplo particular destes sistemas sao˜ os Modulos´ de Seguranc¸a em Hardware (HSM), que fornecem gestao˜ e utilizac¸ao˜ de chaves privadas, de modo seguro e confiavel.´ Contudo, os sistemas que estao˜ dispon´ıveis comercialmente sao˜ demasiado caros e limitados nas funcionalidades disponibilizadas. Por outro lado, as soluc¸oes˜ baseadas em FPGAs volateis´ que existem ate´ a` data, nao˜ sao˜ adequadas para a criac¸ao˜ de um Modulo´ de Seguranc¸a em Hardware, pois nao˜ contemˆ as carac- ter´ısticas de seguranc¸a necessarias,´ como funcionalidades anti-adulterac¸ao,˜ gestao˜ de chaves interna segura e capacidade de prevenir clonagem. Neste trabalho, e´ proposto um HSM que seja de codigo´ aberto, de baixo custo, reconfiguravel´ e altamente flex´ıvel. O sistema e´ suportado por um System- on-Chip que contem´ uma FPGA nao-vol˜ atil,´ com diversos servic¸os e caracter´ısticas de seguranc¸a. A soluc¸ao˜ apresentada opera como um sistema de certificac¸ao˜ versatil,´ capaz de providenciar gestao˜ se- gura de chaves, assinaturas digitais e de emitir certificados digitais confiaveis,´ suportando uma interface PKCS#11 com func¸oes˜ adicionais. Para melhor ilustrar a flexibilidade da soluc¸ao˜ proposta, um caso-de- uso, denominado Log-Chain, e´ tambem´ proposto e implementado. O Log-Chain consiste numa cadeia de logs que pode ser incrementada e verificada, nao˜ podendo ser modificada ou repudiada. Os resulta- dos experimentais sugerem que o sistema consegue calcular ate´ 2 operac¸oes˜ de assinatura/certificac¸ao˜ por segundo, com uma abordagem de baixo custo, adaptavel´ e segura.

Palavras-chave: FPGA nao-vol˜ atil,´ Modulo´ de Seguranc¸a em Hardware, Sistema de Certificac¸ao,˜ Microsemi Smartfusion2 SoC

v vi Abstract

Embedded systems supported by FPGAs are increasingly playing a bigger role in safety-critical areas. A particular example of such safety-critical systems are Hardware Security Modules (HSM), which pro- vide private key management and usage, in a secure and reliable way. However, commercially available systems are too expensive and limited in the provided functionality. On the other hand, existing volatile FPGA solutions do not adequately provide the needed security characteristics, such as anti-tampering features, secure internal key management and anti-cloning capabilities. Herein, an open-source, low- cost and highly flexible reconfigurable HSM is proposed, supported by a System-on-Chip with a non- volatile FPGA that contains several security characteristics and services. The presented solution oper- ates as a versatile certification system that provides secure key management, digital signatures services and is able to issue trustworthy certificates, using an extended PKCS#11 interface. To further illustrate the flexibility of the proposed solution, a Log-Chain certification use-case is also presented, which con- sists of a chain-of-logs that can be incremented and verified, but cannot be repudiated or modified. Experimental results suggest that the system is able to compute up to 2 sign/certification operations per second with a low-cost, adaptable, and secure approach.

Keywords: Non-volatile FPGA, Hardware Security Module, Certification System, Microsemi Smartfusion2 SoC

vii viii Contents

Acknowledgments...... iii Resumo...... v Abstract...... vii List of Tables...... xi List of Figures...... xiii List of Acronyms...... xv

1 Introduction 1 1.1 Objectives and Requirements...... 2 1.2 Main contributions...... 3 1.3 Thesis Outline...... 4

2 Background 5 2.1 Cryptographic Services and Mechanisms...... 5 2.1.1 Symmetric Key ...... 5 2.1.2 Asymmetric Key Cryptography...... 6 2.1.3 Hashing Function...... 7 2.1.4 Secret Key Establishment...... 8 2.1.5 Digital Signatures...... 9 2.1.6 Key Certification and PKI...... 10 2.1.7 Physically Unclonable Function...... 11 2.2 Secure Computing Platforms...... 11 2.3 Implementation Technologies...... 13 2.4 Smartfusion2 SoC...... 14 2.4.1 Device Description...... 15 2.4.2 Security Features...... 15 2.5 Summary...... 18

3 State of the Art 19 3.1 FPGA as Secure Platform...... 19 3.2 Key Generation and Storage...... 20 3.3 Full Security Systems...... 21

ix 3.4 Discussion...... 24

4 Proposed Solution 27 4.1 Users and Key Management...... 28 4.2 Communication and Session Establishment...... 30 4.3 Log-Chain...... 32 4.4 Conclusions...... 34

5 Implementation 37 5.1 Device Configuration and Setup...... 38 5.2 Cryptographic Operations...... 39 5.3 Key Generation and Management...... 41 5.4 Memory...... 42 5.5 Log-Chain...... 44 5.6 Communication Channel...... 45 5.7 Middleware...... 46 5.8 Simple Time Service...... 47 5.9 Conclusions...... 49

6 Results 51 6.1 Cryptographic Operations...... 51 6.1.1 SHA-256...... 52 6.1.2 AES-256...... 53 6.1.3 EC Scalar Multiplication...... 54 6.2 System Operations...... 55 6.3 Communication Channel...... 56 6.4 Comparison with the State of the Art...... 57 6.5 Conclusions...... 58

7 Conclusions 61 7.1 Future Work...... 62

Bibliography 63

A Communication Protocol 67

x List of Tables

2.1 X.509v3 certificate fields...... 10 2.2 Single-threaded performance (signatures/second) for different HSMs [12]...... 13 2.3 HSM Key Storage capacity [12]...... 13 2.4 Protection mechanisms for FPGA configuration data...... 14 2.5 Key Features for Secure Hardware [4]...... 16

3.1 Comparison of Security Features of the different system proposals...... 25 3.1 Comparison of Security Features of the different system proposals...... 26

4.1 Key generation and storage...... 29 4.2 Secure session establishment...... 30 4.3 Available Device commands...... 31

5.1 Non-volatile memory usage requirements for the implemented system...... 43 5.2 Additional API functions...... 47 5.3 Supported official PKCS#11 API functions...... 47

6.1 Operation times for the three SHA-256 implementations...... 52 6.2 Operation times for the two AES-256 implementations...... 54 6.3 Operation times for the three versions conceived...... 55 6.4 Operation times for the three versions conceived...... 56

xi xii List of Figures

2.1 An example of an elliptic curve. Example equation: y2 = x3 + ax + b ...... 7 2.2 SmartFusion2 SoC FPGA Block Diagram [39]...... 15 2.3 Detailed security and settings model diagram [41]. The green segments in the middle are stored in non-volatile memory. The COMBLK performs the communication between the MSS (software) and security services (System Controller)...... 17

3.1 FPGA as a Trusted Machine [7]...... 20

3.2 Setup: TA generates a public/private RSA key pair and transfers the private key privatekf into the bootstrapping binary (in blue: data is encrypted so it doesn’t need to go through a secure channel) [11]...... 23 3.3 Block diagram of the Amuet architecture [10]. The embedded application is actually the user application, protected by the proposed wrapping system...... 24

4.1 The proposed overall system architecture. The light blue rectangle represents the secure System-on-Chip. The contents of the external Flash are encrypted...... 28 4.2 Log file example. Each horizontal line represents a line break. Hashes and signatures are Base 64 encoded...... 33 4.3 Structure of a Log-Chain. Each hash is computed using the previous hash of the log. The first hash is defined by the device administrator (UID=0) and set as the root hash value.. 33 4.4 The structure of a log chain with grouped log entries...... 34 4.5 The structure of log folders and ther files. The current year and month folders are high- lighted in dark grey and the current day log file is highlighted in yellow...... 34

5.1 The first system architecture, which uses the mbedTLS algorithms to perform crypto- graphic operations. The unused modules are greyed out...... 39 5.2 The second system architecture, which uses the SoC embedded cores for additional se- curity and possible performance. The unused modules are greyed out...... 40 5.3 The third system architecture, which uses the FPGA to accelerate the SHA-256 algorithm. The unused modules are greyed out...... 40 5.4 SRAM-PUF core example for Key Code 2 and 3...... 42 5.5 Development internal mode memory map...... 44 5.6 Production mode internal memory map...... 44

xiii 5.7 The scheme for time synchronization via STS...... 49

6.1 SHA-256 throughput for the three tested implementations...... 52 6.2 AES-256 throughput for the two tested implementations...... 54 6.3 System operation times for the three tested implementations...... 55 6.4 Throughputs for open-channel and secure-channel communications...... 57

A.1 Flowchart describing the process of receiving a message through the created communi- cation protocol...... 67

xiv Acronyms

AES Advanced Standard.

ASIC Application-Specific Integrated Circuit.

DPA Differential Power Analysis.

EC Elliptic Curve.

ECC Elliptic Curve Cryptography.

ECDH Elliptic Curve Diffie-Hellman.

ECDSA Elliptic Curve Digital Signature Algorithm.

ECIES Elliptic Curve Integrated Encryption Scheme. eNVM embedded Non-Volatile Memory. eSRAM embedded Static Random Access Memory.

FPGA Field-Programmable-Gate-Array.

HMAC Hash-based Message Authentication Code.

HSM Hardware Security Module.

IV Initialization Vector.

MSS Microcontroller Subsystem.

NTP Network Time Protocol.

PC Personal Computer.

PKI Public Key Infrastructure.

PUF Physically Unclonable Function.

RSA Rivest-Shamir-Adleman.

xv SoC System-on-Chip.

STS Simple Time Service.

TPM Trusted Platform Module.

TRNG True Random Number Generator.

UID User Identification.

xvi Chapter 1

Introduction

Modern reconfigurable systems, such as Field-Programmable-Gate-Array (FPGA), provide increasing programming possibilities, high flexibility and growing hardware capabilities. For these reasons, there has been an expanding variety of applications for these devices, such as Data Centers, Medical, Aerospace, Defense, Security, Transportation and Automotive [1]. Along with this, the increasing need for data pro- tection and system reliability, especially for safety-critical systems, has urged FPGA manufacturers to develop more secure and reliable devices, rather than solely focusing on power consumption and system performance.

FPGAs are low-cost general-purpose devices that provide high flexibility and performance. They are composed of a configurable logic block array, connected through programmable interconnections. Their configuration is usually described using a hardware description language, such VHDL or Verilog, and can be configured for the desired application after manufacturing.

FPGAs can be categorized into four different categories depending on their configuration storage: SRAM-based, SRAM-based with internal flash, Flash-based and Antifuse-based. SRAM-based FPGAs have their logic cells configuration data stored in the static memory cells. These FPGAs must be re- programmed on each start since SRAM is volatile. They read the configuration from an external source (e.g. Flash memory) when the device is booted. When an internal flash memory is present, the bit- stream is stored internally, which prevents unauthorized bitstream copying (SRAM-based with internal flash). Most modern volatile FPGAs come with a secure boot process, in which the device will attempt to load the binary from the configuration memory when powered on. The binary is decrypted and authen- ticated using the onboard dedicated decryption logic and the programmed AES key (by the hardware manufacturer). This key can only be read by the internal decryption logic and is not accessed from the outside. If the configuration bitstream is not authenticated, the device gets to an error state and will not function until provided with a valid bitstream.

On the other hand, Flash-based FPGAs use an internal flash memory for the configuration storage, rather than static memory cells. Non-volatile FPGAs provide higher security, faster logic availability after power-on, and of course, non-volatile storage, which is of key importance for safety-critical ap- plications [2,3]. Furthermore, non-volatile FPGAs tend to consume less power and are more tolerant

1 to radiation effects. Because they are non-volatile, the bitstream is not at risk of being probed during start-up [3]. Finally, Antifuse-based FPGAs consist of “fuse-burning”, which means they can only be programmed once. Existing commercial security-oriented devices provide cryptographic operations and secure key man- agement with an adequate performance but at a high cost and low flexibility (e.g. Hardware Security Module), or low performance because of small computation power and memory storage but at a lower price and higher flexibility (e.g. SmartCard). Unlike these, new FPGA technologies are starting to pro- vide great flexibility and security at a low price [4,5,3], allowing for the creation of cheaper systems that resemble a Hardware Security Module (HSM) with much greater flexibility and the ability to be easily reconfigurable. Moreover, FPGAs are being integrated in System-on-Chip (SoC) designs, merging em- bedded CPUs, memories and security modules with the FPGA fabric, allowing for the creation of more robust and security-oriented systems. Over the last years, several authors have proposed FPGA-based architectures as secure computing platforms [6,7], as well as PUF-based key generation and re-keying mechanisms on FPGAs [8,9]. Moreover, full systems have been proposed, which use FPGAs to perform security operations for safety- critical applications, such as a Secure Application Wrapper which performs secure system authentication and data transfer with an external memory [10], and an FPGA-based architecture that allows users to offload sensitive computations to the cloud [11]. However, the State of the Art solutions cannot be used as Hardware Security Modules, as they solve very specific problems and do not meet certain requirements that are mandatory for an HSM to have [12], such as anti-cloning mechanisms (e.g. PUF-based key generation), channels, internal non-volatile memories for master key storage, anti-tamper mechanisms, internal clock freshness (e.g. through a Timestamping Authority) and a common developer interface, such as PKCS#11. Fur- thermore, these works use SRAM-based FPGAs, which are subject to several attacks, such as probing when the configuration bitstream is loaded at boot time [3,5]. Non-volatile FPGA technologies provide lower power consumption, faster boot times and do not need to be re-configured on each power-on.

1.1 Objectives and Requirements

The main objective of this work, is to create a re-configurable and flexible HSM, supported by a low-cost non-volatile security-oriented FPGA, as opposed to the State of the Art. The low price implies certain limited hardware specifications, such as reduced internal memory storage and short endurance, mean- ing that is not possible to install an that is stored and executed only inside the device. Moreover, the expected low CPU processing speed suggests a relatively low performance, while the lack of internal battery implies that an internal Real Time Clock cannot be relied on, unless it is securely initialized. Therefore, the work herein considered aims to overcome these limitations, providing a fully working solution that tries to compete with existing commercial ones in terms of security features and standards, while maintaining the same flexibility and low-cost as the academic re-configurable propos- als.

2 The requirements for the considered solution, in order to achieve the aforementioned objectives are as follows:

1. Secure key management and storage should be guaranteed through the use of a device dependent key generation mechanism, enhancing the anti-cloning characteristics. Additionally, internal key storage should be supported.

2. The selected device should guarantee a secure boot process.

3. As the system may be used under insecure environments, the system should be capable of estab- lishing a secure channel with the PC, guaranteeing confidentiality, integrity, freshness and authen- tication of the exchanged .

4. Internal clock freshness and synchronization should be maintained with the outside world through the use of a reliable external time provider.

5. Externally stored data should be properly protected.

6. The selected device should have several tamper detection mechanisms and anti-tamper protection features.

7. The developed system should be flexible, open-source and be available at a low cost.

1.2 Main contributions

In order to achieve the objectives of the work, a solution was proposed and implemented, which satisfies the requirements highlighted above. The proposed solution, considering the Smartfusion2 SoC as the supporting non-volatile technology, consists of creating an open-source Hardware Security Module sup- ported by a reconfigurable technology, as opposed to existing commercial solutions. The system itself consists of a low-cost tamper-proof and unclonable secure certification system capable of generating and managing keys securely, while still providing high flexibility and adaptability. The system has the ability to issue digital certificates and generate key pairs for its users, as well as generate digital signatures upon request by an authenticated user. As the system provides high flexibility and because of the existing need for a secure logging system, a complementary novel feature is proposed, which consists of creating a non-repudiable and certified chain-of-logs with the secure computation system (e.g. for Linux Syslog messages, Transaction logs or Medical Receipts [13]). Regarding the existing State of the Art [7,9,8, 10, 11], the proposed solution contributes with im- proved key management supported by a PUF-based mechanism, secure and authenticated external data storage, a secure communication channel that assures confidentiality, integrity and authentication of exchanged data, as well as the ability for developers to integrate applications with the system through an extended PKCS#11 interface. Moreover, the proposed solution considers the use of a non-volatile de- vice as opposed to a volatile one, and more specifically, a security-oriented device that contains several characteristics that makes it possible to create a flexible and reconfigurable Hardware Security Module at a low cost.

3 Additionally, this work also contributes with a thorough analysis of the cryptographic operations per- formed by the device’s embedded cores, such as AES-256, SHA-256 and Elliptic Curve scalar multi- plication. The results show that the performance of the system is primarily influenced by the Elliptic Curve scalar multiplication operation, with 70% to 95% of the operations time being spent on scalar multiplications. Additionally, a SHA-256 core was deployed in the FPGA fabric to understand the perfor- mance impact over the existing embedded cores and the software implementation. The conducted tests suggest that the FPGA-accelerated version of SHA-256 is faster than the embedded device SHA-256 core and software-based implementations, while consuming only 5% of the FPGA fabric. On the other hand, the software-based version of AES-256 is faster than the embedded AES core provided by the device. Overall, the system is able to perform up to 2 signature/certification operations per second, on a non-volatile device, at a much lower cost than existing commercial HSMs, while providing the needed security and reliability features. An article discussing the different implementations of the proposed solution and their results has been submitted and accepted to the International Conference on Reconfigurable Computing and FPGAs (ReConFig 2017).

• Diogo Parrinha and Ricardo Chaves, ”Flexible and Low-Cost HSM based on Non-Volatile FPGAs”, International Conference on Reconfigurable Computing and FPGAs (ReConFig’17), September 2017.

1.3 Thesis Outline

The thesis is organized as follows. In Chapter2, a background study on cryptographic services is pro- vided, along with a review of existing secure computing platforms and their implementation technologies. Afterwards, in Chapter3, the relevant State of the Art is presented, with a major focus on secure FPGA computing platforms, including a comparative analysis of the various solutions. Chapter4 and5 detail the proposed solution and the resulting implementation, respectively. The result analysis and perfor- mance comparison is presented in Chapter6. Finally, Chapter7 concludes this document with some final remarks and future work directions.

4 Chapter 2

Background

In this section, a brief introduction to a variety of concepts used throughout the dissertation is provided, which includes the presentation of cryptographic services and mechanisms, such as symmetric and asymmetric cryptography, Physically Unclonable Function (PUF), Public Key Infrastructure (PKI), key exchange protocols (such as ECDH) and data signing mechanisms (such as ECDSA). Furthermore, several Secure Computing Platforms (Smart Card, TPM, HSM) and Implementation Technologies (ASIC, FPGA) are introduced and compared, along with a thorough description of the Microsemi Smartfusion2 SoC.

2.1 Cryptographic Services and Mechanisms

Cryptography services and mechanisms include symmetric cryptography (AES), asymmetric cryptogra- phy (RSA, ECC-based), hashing functions (SHA-256), key exchange protocols (ECDH) and data signing algorithms (ECDSA), as well as Public Key Infrastructures. Symmetric cryptography involves the use of a shared common key between multiple parties, and is faster than asymmetric cryptography. The latter involves a pair of keys (public and private) and is usually used to provide authentication or encryption between two parties, giving the ability to ensure non-repudiation1 if used correctly. These keys can be used in key exchange protocols for two parties to establish a symmetric key for secure communication or to be used by data signing algorithms which allows a user to sign a piece of data using a private key and another user to validate it using the signer’s public key. Since public keys must be published and verified in a trusted manner, the PKI, that consists of a framework to perform the management of digital certificates that bind public keys to users, is also addressed. The PKI provides mechanisms for distributing public keys, verifying and revoking them when their private counterpart is compromised.

2.1.1 Symmetric Key Cryptography

Symmetric Key Cryptography is composed of symmetric-key algorithms for ciphering and deciphering data using the same cryptographic keys. They represent a shared secret between two or more parties,

1The ability to ensure that a party to a contract cannot deny the authorship of a document.

5 which is used to maintain a secure and private link. Although faster than Asymmetric Key Cryptography algorithms, it requires that the parties share the same secret. Currently, the main symmetric encryption standard is the Advanced Encryption Standard (AES) algo- rithm, a 128-bit iterative and symmetric block which can support key sizes of 128, 192 or 256 bits for 10, 12 or 14 rounds respectively [14]. A round consists of multiple processing steps including substi- tution, transposition, mixing of the plain-text and transformation into the final output, i.e. the cipher-text [14]. AES can be used with different block cipher modes of operation, which include ECB, CBC, OFB, OCB and CTR [15]. Since OFB, OCB and CTR allow to encrypt bit by bit, they can be used as stream , that consist of a method in which a cryptographic key and algorithm are applied to each individ- ual bit of the plaintext. Usually, the cipher modes require an Initialization Vector (IV) that is mixed with the data to achieve semantic security2. It consists of a fixed-size variable that should be non-repeating and randomly generated.

2.1.2 Asymmetric Key Cryptography

Asymmetric Key Cryptography, also known as Public Key Cryptography, includes any system that uses a pair of keys: a public key and a private key. The private key is only known to or usable by the owner, while the public key can be known to everyone. This provides two possible features: authentication, which is when someone uses the public key to verify the sender of a message, and confidentiality, which is when someone uses the public key to ensure that only the owner of the private key is able to decipher the message. Rivest-Shamir-Adleman (RSA) is one of the oldest but most used public key cryptography algorithms. It is based on the assumption that factoring the product of large prime number is a computationally hard task to do. Meaning that even if an attacker has enough computational resources and time, it will still not be able to obtain the private key. To create a public and a private key, it is necessary to generate two different random prime numbers p and q first. Then, compute n such that n = p ∗ q. Afterwards, use the Euler’s totient φ(n) to compute φ(n) = (p − 1) ∗ (q − 1). With them, it is possible to choose a random integer e that meets 1 < e < φ(n) and finally compute the integer d which verifies: e ∗ d ≡ 1 mod φ(n). The public key is (e, n) and the private key is (d, n). With these, it is possible to compute the cipher-text c of the message m (2.1), using the public key, and decrypt it with the private key by (2.2): c ≡ me mod n (2.1)

m ≡ cd mod n (2.2)

An alternative to RSA is Elliptic Curve Cryptography (ECC), based on the mathematics behind elliptic curves (see Figure 2.1) over finite fields, that can be applied to encryption, digital signatures and pseudo- random generators. An elliptic curve is represented as a looping line intersecting two axes, and ECC hinges on a particular type of equation created from a mathematical group derived from points where

2If an attacker possesses the of a message and the message’s length, it cannot determine any partial information on the message with higher probability than if it only possesses the message length.

6 the line intersects those axes [16][17]. By multiplying a point on the curve by a number, another point on the curve is obtained. However, it is computationally infeasible to find which number was used, even if the original point and result is known. An example equation of a possible curve is y2 = x3 + ax + b.

Figure 2.1: An example of an elliptic curve. Example equation: y2 = x3 + ax + b

When using elliptic curves for public key encryption, a public and private key pair must be generated: d and Q, in which d (random integer chosen from {1, ..., n − 1} where n is the order of the subgroup) represents the private key and Q its public counterpart, that is generated from Q = dG (where G is the base point of the subgroup). The definition and selection of the curve parameters, such as base point G, elliptic curve coefficients (a,b) and order of the subgroup n, goes beyond the scope of this work, but each curve has its own parameters and for many curves, they’re defined in FIPS 186-4 [18]. Most importantly, ECC uses two main operations: point addition and point multiplication, in which the former involves the public key on most algorithms while the latter involves mostly the private key. These can then be used to establish a common shared secret between two parties and to sign data using the private key. Moreover, ECC is considered to be faster than RSA and its keys have a shorter length than its RSA equivalent for the same security level [16][17]. This also makes ECC more suitable for embedded systems and systems with lower performance and memory capacity.

2.1.3 Hashing Function

A hashing function maps an arbitrary long message to a fixed-size hash value. Ideally, they have four main properties: hashes are quickly generated; it is infeasible to generate a message from its hash value except by trying all possibilities; a small change in the message should change the hash extensively; it is infeasible to find two different messages with the same hash value (collision resistant). Digital signatures and message authentication codes (MAC) make use of hashing functions. With SHA-256, the message is first padded so that its length is a multiple of 512 bits and then parsed

7 into 512-bit message blocks, M 1,M 2, ..., M N . Then each block is processed one at a time, beginning with a fixed initial hash value H0:

i i−1 i−1 H = H + CM i (H ), (2.3) where C is the SHA-256 compression function and + represents the word-wise mod 232 addition. HN is the hash of M. Hashes can typically be used to provide integrity and authentication, such as in a Hash-based Message Authentication Code (HMAC), which involves a hash function and a secret key.

2.1.4 Secret Key Establishment

A secure communication can be established with several techniques, with the goal of creating a commu- nication channel that provides confidentiality and authentication of information, as well as authentication of parties. The following presents the Diffie-Hellman (DH) Key Exchange, followed by an ECC-based DH protocol (ECDH) and finally ECIES, which uses ECDH as part of its hybrid scheme.

Diffie-Hellman

One of the most common secret key establishment protocols is the Diffie-Hellman Key Exchange (D-H), which establishes a cryptographic secret over a public channel. The idea behind this method is that the public information exchanged between two parties, is used to create a secret key without compromising it. The algorithm relies on the difficulty of solving the discrete logarithm problem. First, Alice and Bob agree to use a certain modulus p (prime) and a base g (primitive root module p), which can be public. Afterwards, Alice and Bob generate a random value, a and b, calculate A and B respectively, and finally exchange A and B between them:

A = ga mod p (2.4)

B = gb mod p, (2.5)

Finally, they compute:

s = Ba mod p = Ab mod p (2.6) where s is the resulting shared secret. The can provide perfect forward secrecy3. Unless the key pairs are not ephemeral, this does not provide authentication of either party, and therefore it is subject to Man-in-the-Middle attacks, where an attacker intersects the exchanged public keys, generates its own key pair, and exchanges its public key with both parties, pretending to be the other party. Therefore, the attacker ends up with two shared secrets, one to communicate with each party. In some cases, it is possible to define a previously shared password that is used to cipher the ex- changed public keys, guaranteeing that only those parties will be able to use them to generate the

3If any long-term key is compromised, it does not compromise all past session keys. In the D-H key exchange, if the private values are obtained randomly for each session (thus being ephemeral), compromising one of them will not compromise any of the other previously exchanged secrets.

8 shared secret, therefore providing authentication. This may not be feasible in all cases, as the password needs be known, and securely stored, before establishing the communication. However, in some situations, it is only required for one of the parties, such as a server, to be authen- ticated before starting a secure connection. In that case, the server can have a non-ephemeral key pair, whose public key has been distributed correctly via a PKI (more details in Section 2.1.6). The connecting party, or client, can then authenticate the other one because it knows its public key already and trusts it. As long as the client generates a new key pair for each session, perfect forward secrecy is still ensured. Because the server can be authenticated, Man-in-the-Middle attacks can no longer be performed, as the client rejects any shared secrets that do not match the one calculated using the known server public key.

ECDH

The protocol Elliptic Curve Diffie-Hellman (ECDH) , a derivation of the DH protocol supported by elliptic curves, allows for two parties to establish a shared secret over an insecure channel by simply exchanging their Elliptic Curve public keys. First, both parties have to agree on the same domain parameters (i.e. the same curve) and then generate a key pair (d, Q) accordingly, in which d and Q represent the private and public keys respectively. Once both parties have generated their private and public keys, they can exchange their public elliptic curve keys. Party A calculates S = dAQB and party B calculates S = dBQA, in which S is the shared secret, that cannot be obtained by an attacker because it only knows the public keys. The shared secret is hashed using SHA-256 to obtain a 256-bit key.

ECIES

Elliptic Curve Integrated Encryption Scheme (ECIES) is a hybrid encryption scheme that provides se- mantic security [19], which uses the following functions: key agreement, key derivation function, hashing, encryption and message authentication. According to [20], there are several versions of ECIES. A simple one basically consists of a key agreement protocol, such as ECDH, followed by a key derivation function (KDF), whose resulting key is used in a symmetric encryption scheme (AES). The key derivation function can be the PBKDF2, defined in PKCS#54, which supports key expansion, that can be required when using large keys.

2.1.5 Digital Signatures

A Digital Signature consists of a mathematical technique used to validate the authenticity and integrity of a digital message or document, assuring to the recipient that the sender cannot deny its authorship (non-repudiation) and that the message was not altered while in transit (authentication).

4PKCS#5 is a password-based cryptography specification [21] which covers key derivation functions, encryption schemes and message authentication schemes.

9 An example of a Digital Signature algorithm is the Elliptic Curve Digital Signature Algorithm (ECDSA), which allows an entity to digitally sign a piece of data using its private elliptic curve key. The data is first hashed using a hashing function, such as SHA-256. The resulting message digest must be truncated, so that the length of the hash is the same as the bit length of n (the order of the EC subgroup). The truncated digest value is an integer denoted as z. The algorithm works as follows: 1. Choose a random integer k from {1, ..., n − 1}. 2. Calculate the point P = kG, where G is the base point.

3. Calculate the number r = xP mod n (where xP is the coordinate x of the point P) 4. If r = 0 choose another k and try again. −1 5. Calculate s = k (z + rdA) mod n, where dA is the private key of the signer (A). 6. If s = 0 choose another k and try again.

If successful, the pair (r, s) is the signature. Other parties can verify the signature by using the signer’s public key (QA) to:

−1 1. Calculate integer u1 = s z mod n. −1 2. Calculate integer u2 = s r mod n.

3. Calculate point P = u1G + u2QA, and obtain xP .

4. The signature is valid if r = xP mod n.

2.1.6 Key Certification and PKI

Public Key Cryptography provides non-repudiation to secure communications. When Alice sends data signed with its private key, Bob knows only Alice was able to sign it because only Alice has its private key. However, Bob must trust that the received public key is in fact the public key of Alice. To solve this problem, Digital Certificates are used. Digital certificates are electronic credentials used to verify identities of individuals and machines, by guaranteeing that the identification of a user and data is bound to a certain public key. In general, digital certificates consist of three main parts: user/device information; public key; digital signature. A frequently used certificate format is the X.509v3 profile. Some of the fields present in a X.509v3 certificate include [22]:

Table 2.1: X.509v3 certificate fields. Version Number Serial Number

Signature Algorithm ID Issuer Name

Validity period Subject name

Subject Public Key Info Issuer Unique Identifier (optional)

Subject Unique Identifier (optional) Extensions (optional)

Certificate Signature Algorithm Certificate Signature

10 Because there must be services to issue, validate and revoke these certificates, the PKI was cre- ated. A PKI is a framework of roles, policies and procedures that allow the generation, management, distribution, revoking and storage of digital certificates [23, 24, 25]. A PKI normally has the following components: Security Policy, Certification Authority, Registration Authority and Certificate Repository. The Security Policy is essential to state how the organisation handles keys and valuable information. The Certification Authority is the entity which issues (binds the identity of a user to a public key with a digital signature) and revokes certificates. To ensure a certain level of trust, users that wish to receive a certificate for a public key, must first register with a Registration Authority, which is the interface between the user and the CA, that authenticates the user and submits the certificate request to the CA. Finally, a Certificate Repository is required in order to store the certificates issued and Certificate Revocation Lists (CRLs), which contain certificates that have been revoked. Certificates can be revoked for several reasons, e.g. validity date expired, private key compromised, failure to comply with policy requirements and misrepresentation.

2.1.7 Physically Unclonable Function

A Physically Unclonable Function (PUF) is a challenge-response mechanism in which the response to a given challenge is dependent on a variable physical material [8]. A PUF receives an input challenge (or stimulus) Ci ∈ C, where C is the set of all possible challenges, and outputs a response Ri ∈ R - where R is the set of all possible responses. PUFs are based on the natural randomness that exists in the IC (integrated circuit) used to generate the response - and cannot be controlled. This occurs due to the random alterations during the IC fabrication process, i.e. two PUFs with the same layout result in two different functions, so it is impossible to make two PUFs behave equally. A PUF has four main characteristics. An input produces the same response approximately (error correction codes are used to remove noise). Given a response, it must be difficult to find its challenge (input). Two different challenges must produce two different responses. Two different PUFs must pro- duce two different responses for the same challenge.

2.2 Secure Computing Platforms

Several platforms have been developed over the years to perform secure computing, which include Hardware Security Modules, Trusted Platform Modules and Smart Cards. A brief introduction to some of these platforms is given in this section. HSM are application-specific devices which provide secure cryptographic key management and ac- celerated cryptographic operations with those keys. They have the following main characteristics [12]:

1. Secure key management.

2. Secure internal and external data storage.

3. Support cryptographic operations with internal keys (such as ciphering/deciphering data and gen- erating digital signatures).

11 4. Include anti-tampering and anti-cloning features at the physical level.

5. May include side-channel analysis protection mechanisms.

6. Contain True Random Number Generators.

7. Guarantee internal clock freshness.

8. Support secure communication with the outside world.

9. Standard developer API that allows for an easy integration with software applications.

However, their cost is usually higher than general-purpose devices (the price for a regular HSM can go up to 35,000e [26, 27]) and they do not provide the same flexibility for the programmer. HSMs are usually connected to a network through TCP/IP or to a computer via USB, which makes it easy to remove or add them back. Smart Cards are security tokens that have an embedded chip. They are designed to be tamper resis- tant and provide security services. Although considered secure, smart cards possess slow input/output communication and low computational processing power and memory storage. Some advantages of using Smart Cards as Secure Computing Platforms include their low price when compared to other al- ternatives, their portability and their flexibility (being often used as credit cards, ID cards or repositories for personal information). A Trusted Platform Module (TPM) is a secure cryptographic chip that integrates a secure micro- processor with cryptographic keys and functionalities, which is normally embedded on a computer’s motherboard. It includes capabilities like Binding, Sealing and Attestation. Binding allows data encryp- tion using the TPM’s unique RSA key. Sealing works similarly to Binding, except that it requires the TPM to be in a certain state in order to decrypt the data. Attestation allows a third party to verify that the software has not been changed, by comparing the unforgeable authenticated digest of the hardware and software configuration. The most recent version of the TPM specification (2.0) provides more cryp- tography algorithms, such as ECC, AES, SHA-256 and HMAC. However, this version is still relatively new (approved in 2015) and therefore not many vendors support it. Recently, ARM and Intel have developed two different technologies that provide system-wide hard- ware isolation for trusted software. The ARM TrustZone creates an isolated area that can be used to guarantee confidentiality and integrity to the system, by providing code and data isolation [28]. Most ARM platforms today have this security technology implemented. On the other hand, the Intel SGX is a technology which allows developers to protect certain code and data from disclosure or modification [29]. Both technologies are susceptible to several attacks as described in [30, 31]. By comparing the advantages and disadvantages of each platform described above, with the objec- tives and requirements of this work (Section 1.1), it is possible to identify that ideally, the desired system would comprise of a low-cost and flexible Hardware Security Module. There are several Hardware Security Modules in the market for a variety of goals, coming with differ- ent prices and characteristics. The criteria for choosing the right HSM for a given task includes perfor- mance, scalability, redundancy, API support, security, supported algorithms, authentication options and cost.

12 A review on HSMs [12] shows that models like the AEP Keyper v2, SafeNet Luna SA 4.4, Thales nShield Connect 6000 and Ultimaco CryptoServer Se1000 support several algorithms, such as AES, RSA, ECDSA, ECDH and SHA-2, which are available through a PKCS#11 interface. While only two of them support elliptic curve operations, all support RSA. The single-threaded performance results for RSA signature generation can be found in Table 2.2.

Table 2.2: Single-threaded performance (signatures/second) for different HSMs [12]. Key Size (bits) Keyper v2 Luna SA 4.4 nShield CryptoServer 1,204 310 800 950 1160 2,048 110 420 570 710 4,096 13 35 150 230

Table 2.3: HSM Key Storage capacity [12]. AEP Keyper v2 8000 1024-bit RSA keys SafeNet Luna SA 4.4 1200 2048-bit RSA keys Thales nShield Limited on board NVRAM storage. Ultimaco CryptoServer 5000 1024-bit key pairs

Concerning key storage capacity, Table 2.3 depicts the capacity for each HSM listed previously. In regard to backups, the listed HSMs either provide backups to dedicated external cards or remote back- ups functionality. Furthermore, they all support time synchronization and administrator authentication via PKCS#11 interface.

2.3 Implementation Technologies

There are three major implementation technologies that can be used to create HSM-like systems: CPUs, ASICs and FPGAs. A Central Processing Unit (CPU) consists of a general purpose electronic chip (such as the one inside Smart Cards) that performs basic arithmetic, logical, control and input/output operations specified by program instructions. However, they do not contain internal memory (apart from possible cache memories) and are not closed systems, providing no security to the application. An Application-Specific Integrated Circuit (ASIC) is an integrated circuit built for a specific application (in this case, secure computing). They usually include microprocessors, memory blocks (ROM, RAM, Flash Memory) and other building blocks. Because of their specific use, its manufacturing cost is quite high when compared to general purpose application devices. They strive to meet the EAL 4+ assurance level regarding Common Criteria / FIPS 140-2 certification [32][33][34]. On the other hand, as described in Section1, FPGAs are low-cost general-purpose devices that provide high flexibility and performance, which can be categorized into four different categories depend- ing on their configuration storage: SRAM-based, SRAM-based with internal flash, Flash-based and Antifuse-based.

13 Many FPGA vendors (e.g. Achronix, Altera, Lattice, Microsemi and Xilinx) compete with each other to provide the best FPGAs to the market. Each vendor provides different FPGA devices with distinct technologies and equip their own security mechanisms against attacks such as Bitstream Probing, De- cryption Key Stealing, Readback attacks, Side-channel attacks, as well as FPGA counterfeiting and cloning [35][36]. Table 2.4 lists the protection mechanisms available for the configuration data of the major FPGA models in the market.

Table 2.4: Protection mechanisms for FPGA configuration data. Manufacturer Device Bitstream Encryption Authentication Technology Key Storage

Microsemi [4] IGLOO2, Smartfusion2 SoC AES-256 SHA-256 Flash PUF Spartan-6 AES-256 Device-DNA5 SRAM eFUSE6, volatile Virtex-6 AES-256 SHA-256 HMAC SRAM eFUSE, volatile Xilinx [37] Virtex-7 AES-256 SHA-256 SRAM eFUSE, volatile Zynq-7000 AES-256 SHA-256 SRAM eFUSE, volatile Stratix II/II GX AES-256 CBC - SRAM NVM Altera [38] Stratix III/IV/V AES-256 CBC - SRAM volatile, NVM Cyclone III LS AES-256 CBC - SRAM volatile

As detailed previously, non-volatile FPGAs are better suited for safety-critical applications. In par- ticular, the IGLOO2 and Smartfusion2 SoC (which integrates a processor and FPGA logic) provide the best variety of design and data security features when compared to other FPGA models from different vendors [3, 39]. In fact, among the devices highlighted in Table 2.4, the IGLOO2 and the Smartfusion2 are the only devices capable of generating and storing keys through a PUF-based mechanism, allowing for greater anti-cloning capabilities. Additionally, they are the only ones which include embedded cryp- tographic cores that are protected against Differential Power Analysis (DPA)7. Regarding the volatile FPGAs, while several ad-hoc solutions have been proposed [5] to strengthen the native technology, these devices are still vulnerable to several attacks, particularly when loading the initial configuration. On the other hand, the IGLOO2 and the Smartfusion2 SoC non-volatile bitstream storage makes them less susceptible to probing attacks on device boot [3].

2.4 Smartfusion2 SoC

Given that the Smartfusion2 SoC is considered to be the best option available, the following focuses on the SmartFusion2 SoC from Microsemi. Although the IGLOO2 provides similar security, it does not contain a microprocessor, which is necessary to run the software that controls the system. In this section, a brief description of the device is performed, followed by an introduction of its main hardware components, design and data security features, which together, form a Root-of-Trust that is essential to ensure a Secure Boot.

7A type of side-channel attack, in which the attacker studies the power consumption of a cryptographic hardware device in order to extract cryptographic secrets. A side-channel attack is any attack based on information retrieved from the physical level of a cryptographic system.

14 2.4.1 Device Description

The Microsemi Smartfusion2 SoC contains a non-volatile FPGA, integrated with an ARM Cortex-M3 in a Microcontroller Subsystem, including an embedded Non-Volatile Memory (eNVM) and an embedded Static Random Access Memory (eSRAM), along with several cryptographic services [39] depicted in Figure 2.2. The board on which the SoC is mounted, provides an SPI external flash, an Ethernet PHY 10/100 and one external RAM memory.

Figure 2.2: SmartFusion2 SoC FPGA Block Diagram [39].

There are two available power modes: full-power (normal), in which the MSS and FPGA fabric are fully operational and the Cortex-M3 is running application code with all memory controllers enabled; low-power, in which the Smartfusion2 is considered to be in an idle state but ready to respond to an interrupt sourced from the MSS and the FPGA - this mode disables the majority of the Cortex-M3 logic.

2.4.2 Security Features

Microsemi defines three different abstraction layers: Secure Hardware, Design Security and Data Secu- rity. Data Security builds on top of Design Security which is built on top of Secure Hardware. While a full list of secure hardware features is described in [4], the most important features for secure hardware are listed in Table 2.5, namely key management and the validation of digital signatures, which include the encrypted loading of user and factory keys, as well as device certificates (which are bound to a specific device) and a revocation list of stolen or scrapped devices.

15 Table 2.5: Key Features for Secure Hardware [4].

Services Features

Key Management Encrypted loading of user secret key material (both Symmetric and Asymmetric encryption supported).

Authenticated/encrypted loading of all factory keys.

Factory keys and passcodes generated and loaded by Hardware Security Models (HSMs)

Digital Signature Vali- X.509 certificate bound to device serial number, device grading information, and device se- dation cret keys.

Certificates digitally signed by factory HSMs.

Certificate revocation list for scrapped or stolen devices.

Design Security is defined as protecting the intent of the design owner, i.e. keeping the design and bitstream keys confidential and protecting against design changes. It can also be referred to as Intellectual Property (IP) protection. The list of available Design Security features can be found in [4]. The Smartfusion2 SoC provides True Random Number Generator (TRNG) for nonces and private ECC key generation, and all bitstreams are encrypted with AES-256 based encryption and fully authenticated with a 256-bit tag. To prevent back-tracking attacks, versioning is provided, disallowing the loading of obsolete bitstreams. For DPA protection, all security keys, protocols and ECC point multiplication services have coun- termeasures with technology from CryptographicTM Research Inc. Microsemi states that cryptographic services are not protected against DPA (although they are safe from timing analysis and simple power analysis), therefore they do not recommend the use of repeated keys when the adversary can choose the ciphertext [40]. The Smartfusion2 SoC has several security and access control policies which can be configured by setting flash-lock bits, which are control bits that enable or disable certain features (see the circles with crosses in Figure 2.3). The security segments in green, depicted in the middle section of Figure 2.3, are stored on the internal non-volatile memory and are the heart of the system security as they contain the set of keys (e.g. UEK1, UPK1, DPK) used to encrypt the bitstream configuration, unlock operations such as read, write, and verify eNVM, enter Factory Test Mode, erase, write and verify fabric, enable versioning updates, and restrict JTAG and SPI access. Additionally, readback of the bitstream is always disabled. For tamper protection, the device comes with configurable zeroization options to clear and verify volatile and non-volatile memories. It also pro- vides redundancy in the security flash array to allow detection and reporting of faults. As for Data Security, which they explain as protecting the information that is stored, processed or communicated in the application executing on the FPGA, a list of features is available in [4], which includes SHA-256, HMAC-SHA-256, SRAM-PUF, ECC and AES operations. Another security feature of these FPGAs is the Root-of-Trust. It is described as an entity that can be

16 Figure 2.3: Detailed security and settings model diagram [41]. The green segments in the middle are stored in non-volatile memory. The COMBLK performs the communication between the MSS (software) and security services (System Controller). trusted to always behave in the expected manner. It provides the verification of the system, software and data integrity and confidentiality, as well as the extension of trust to internal and external entities. It is the foundation upon which all security layers are built. In an embedded system, the Root-of-Trust works with other system elements to ensure the main processor boots securely using only authorized code - which extends the trusted zone to the processor and its applications. By providing the aforementioned hardware, design and data security features, Microsemi considers the Smartfusion2 SoC to provide a Root-of-Trust, which is essential to the Secure Boot.

A Secure Boot process (controlled by the System Controller displayed on the left of Figure 2.3) initializes an embedded system from rest and it does that by executing trusted code, free from tampering by an attacker. If this level of trust does not exist, another boot image could replace the original one and allow an attacker to hijack the whole system. The validation of each stage must be performed by the previous successful phase to ensure a chain-of-trust up to the application layer. The first phase (Phase 0), or Immutable Boot Loader, is inserted within the Smartfusion2 SoC and validated by the Root-Of- Trust, that ensures integrity and authenticity of the code. Then, each phase is validated by the previously trusted system, before code and execution is transferred to it.

The first pages of the eNVM are reserved for the System Controller and cannot be accessed by the user. They store, among other things, the Device Certificate and the digest of the User portion of the eNVM. This digest allows the Immutable Boot Loader, in Phase 0, to know whether or not Phase 1 contents have been modified. If not modified, the booting process proceeds to the next phase, otherwise it gets halted. All the aforementioned features, when configured and used properly, allow us to deploy

17 the Smartfusion2 SoC as a Secure Computing Platform.

2.5 Summary

In this chapter, a range of concepts used throughout the dissertation were introduced. More specifically, several cryptographic services and mechanisms were discussed, such as Symmetric and Asymmetric Key Cryptography, which include AES, RSA and ECC. Additionally, an overview of the SHA-256 hash- ing function was given, followed by a description of secret key establishment protocols (DH, ECDH and ECIES). Afterwards, the concept of digital signatures, Public Key Infrastructures and Physically Unclon- able Functions were introduced. Moreover, the existing types of Secure Computing Platforms were discussed, including Hardware Security Modules, Trusted Platform Modules and Smart Cards. After analysing the characteristics of each platform, it was concluded that the platform that fulfils most of the requirements of this dissertation are HSMs, but they are usually expensive and have very fixed designs. To understand how an HSM can be built, a variety of implementation technologies were introduced, including CPUs, ASICs and FPGAs, with a bigger detail being given to non-volatile FPGAs due to their more suitable characteristics for security applications and low price. Finally, the Smartfusion2 SoC (which integrates a microprocessor, internal memories and non- volatile FPGA fabric) was thoroughly described, with a major highlight being given to its design and data security features, which differentiate it from its competitors. The next chapter presents the State of the Art proposals which attempt to create secure computation systems supported by FPGA technologies.

18 Chapter 3

State of the Art

In this section, the State of the Art is discussed in regard to secure systems based on FPGAs by analysing how the different solutions configure their devices to act as secure modules, how their sys- tems establish communication channels with outside parties and how key management and data storage is performed. This section is divided into four sub-sections: FPGA as Secure Platform, Key Generation and Storage, Related Full Systems and Comparison Analysis. The first sub-section presents the sys- tems which consider FPGAs to be secure platforms, discussing the works that propose schemes for dynamic reconfiguration of volatile FPGAs and the implementation of the Cipherbase secure hardware on a volatile FPGA. Secondly, we present works that perform key generation and storage with FPGAs, such as a PUF-based approach and a rekeying management scheme for Storage Area Networks which uses a master enveloping key that never changes. The third sub-section discusses two architectures whose requirements are very similar to ours: the first one uses volatile FPGAs to perform trusted cloud computing and the one second creates a secure wrapper that embeds an FPGA application. Finally, a comparison analysis of the major works is performed, depicting a summary of the discussed architec- tures.

3.1 FPGA as Secure Platform

Gaj et al. [6] consider the use of embedded microprocessor cores within the FPGA to achieve bitstream security, specifically for reconfiguration of a Xilinx Virtex-II Pro on a Xilinx ML310 board. Nonetheless, they consider the entire board as a secure device and not just the FPGA, meaning the path between the FPGA and the external memory is susceptible to tampering. Arasu et al. [7] present the design of the Cipherbase secure hardware and its implementation using FPGAs. The Cipherbase system incorporates customized trusted hardware, extending Microsoft’s SQL Server for efficient execution of queries using both secure hardware and commodity servers, allowing for the secure storage of data. They choose a volatile FPGA to implement the Trusted Machine (TM). Since the logic is built from volatile configuration memories and the binary which defines the computation is loaded at power-on

19 from external non-volatile memories, the bitstream configuration can be intercepted.

Figure 3.1: FPGA as a Trusted Machine [7].

Figure 3.1 depicts how the setup of an FPGA as a TM is performed in [7]. There’s a Trusted Authority (TA) which is trusted by clients and the cloud operator, that is responsible for generating and maintaining FPGA binary encryption keys. Additionally, it vets and compiles the hardware code associated with the TM and creates the encrypted and signed binaries for each device. They assume that the FPGA is secure and that an adversary does not have access to its internal state, because they believe that the TM is only vulnerable to side-channel analysis. The session es- tablishment algorithm is not mentioned and the secure bootstrapping and operation of FPGAs is not described in their work.

3.2 Key Generation and Storage

Arasu et al. [7] present two scenarios for the generation of encryption keys for database operations in the Cipherbase system. The simplest one consists of embedding a master key into a Trusted Machine (TM) (programmable region of FPGA) binary and distributing to the database client. The TM would use this key to encrypt their data with AES. The drawbacks of this approach include the need of the Trusted Authority to generate separate binaries (each loaded onto the device) for every potential database clients and the fact that the client is locked to using this key for all their databases. A more sophisticated approach is suggested by the authors, in which the TM embeds an RSA pub- lic/private master key pair. The public key is published via standard public key infrastructure techniques, allowing clients to uniquely identify a certain FPGA. This would allow clients to negotiate AES session keys for different database fields or different database applications. In case the keys are cached by the TM, in a key vault, they would be encrypted using the master key (or another key defined by the TA), to guarantee that only the TM can recover the contents of the vault. Nabeel et. al propose [8] an approach based on Physically Unclonable Function technology to pro- vide strong hardware authentication of smart meters and efficient key management for Advanced Me- tering Infrastructures. They utilize the PUF on the devices to generate and re-generate the symmetric

20 keys and access level passwords for smart meters. The PUF based secret generation provides strong protection against key leakage since the master key is never stored in memory. They implemented the PUF feedback loop using the Xilinx’s Spartan-6 FPGA board, which is connected to a PC through a serial port. The error correction of the PUF mechanism and cryptographic operations are done on the PC. Wang et al. [9] propose an FPGA based flexible and low-cost rekeying management scheme to improve the security and reduce the processing time of rekeying processes. They claim that their system must not only provide secure, high performance, flexible, open and standard based storage infrastructure but also prevent the data from all kind of attacks, such as Side-Channel Attacks, eavesdropping and man-in-the-middle attacks. To prevent the system from being attacked, they perform key management at hardware level (FPGA), with the software simply sending commands to the key management module (hardware). The software sends commands, which include key backup, key recovery, key revocation and key generation, to the key management module (hardware). The hardware, implemented on a FPGA, communicates with the software through a PCIe interface. The internal memory stores current “active” key pairs, while the outside flash memory backs up the keys. All involving keys are encrypted before being stored. The needed keys are generated by RNG and digested through SHA-256 to be aligned with 256-bit length. In a typical scheme, when re-keying takes place, all the encrypted data must be decrypted using the old key and encrypted using the new key. They propose a new FPGA flexible and low-cost re-keying process to avoid the decryption of stored data using the old key and encryption using the new key. Their scheme proposes the use of a long-term enveloping key that encrypts the user’s access key, which is used to encrypt a data encryption/decryption key, known as LUN. When the re-keying process occurs, the new access key is generated, the LUN key is decrypted using the old access key and encrypted back with the new access key. Finally, the generated access key is encrypted using the enveloping key. This ensures that the stored data does not need to be decrypted and encrypted again, because the LUN key remains the same. The following is what is stored in memory: EKaccess(kLUN ),EKenv(kaccess). The software only stores 32-bit indexes for maintenance, which are extracted at hardware level from the encrypted private key and the user’s access keys, and are sent out to the software by the FPGA. Even though the authors claim that their system prevents physical attacks, their explanation is quite vague. They state that because at software level, only a 32-bit index is used (instead of the key), it prevents their design from physical attacks, since the attacker would need to obtain the contents of the internal memories. If an attacker has access to the devices, it is still possible to perform side-channel analysis unless the devices are truly protected against it.

3.3 Full Security Systems

This section presents two works that fulfil requirements similar to the ones proposed by this thesis. The first one [11] consists of using volatile FPGAs to perform trusted cloud computing, in which protected bitstreams are used to create a Root-of-Trust for cloud computing clients. The second one [10] proposes

21 a system architecture that wraps an embedded application on a volatile FPGA. The wrapper includes a secure user authentication interface and cryptographic services which secure all of the embedded application’s data transfer interfaces.

Eguro et al. describe [11] how protected bitstreams can be used to create a Root-of-Trust for the clients of cloud computing servers. Their hardware-based approach solves the following problem: how to secure client data and computation from both potential external attackers and an untrusted system administrator. The system which addresses this problem uses volatile FPGAs. They are programmed to form a flexible, independent trusted third party computing platform within the cloud infrastructure. Their proposed system allows clients to upload their configuration data to the cloud and since cloud administrators do not have low-level access to computation within the FPGA, it allows clients to offload sensitive parts of their applications to these devices, avoiding potential vulnerabilities in the software stack.

The deployment of the trusted computing nodes begins with a trusted authority (TA), which is trusted by all clients and cloud operator. The TA generates a random symmetric encryption key symkfb and copies it into the onboard key memory of the FPGA before the platform is delivered to the cloud operator. After the key has been written, the FPGA can be delivered to the cloud operator and installed. Since the FPGA comes with a secure boot process (which uses the symmetric key symkfb to decrypt and authenticate the bitstream configuration), the authors believe the FPGA can be used as a “virtual” HSM.

The authors, however, strive to support a more sophisticated operational model that does not require direct TA involvement for each and every bitstream. The idea is that the TA provides a single generic bootstrapping binary for each FPGA that acts as an onboard infrastructure which receives and loads client applications. Figure 3.2 depicts how the TA generates a private/public RSA key pair and places the private key into the boostrapping bitstream. The public key is published via a standard PKI. Once the TA encrypts the bootstrapping bitstream with AES using symkfb, it transfers the configuration into the flash memory on the FPGA.

Once the bootstrap configuration is running on a FPGA in the cloud, the client can create an appli- cation for the FPGA to handle sensitive data. The client connects to this device to load their application securely using standard PKI, like an SSH session, in which the client uses the public key of the device to exchange a symmetric session key sessionkf . The attack model proposed by the authors assumes that the following operations are sufficiently difficult and that they are effectively impossible in practice: breaking the cryptography used; loading a binary that cannot be decrypted and authenticated properly; retrieving binary or state information on the device from outside; altering the behaviour of the loaded binary; altering data currently on the device. Furthermore, the authors also explain the problems that would arise from keys being compromised or lost. They also emphasize that the immutable bootstrapping logic forms the initial Root-of-Trust for the clients of cloud computing servers.

Graf proposes a system [10] that acts as a secure wrapper around an embedded application on a FPGA (depicted in Figure 3.3). This wrapper (known as Amuet) creates a secure user authentication

22 Figure 3.2: Setup: TA generates a public/private RSA key pair and transfers the private key privatekf into the bootstrapping binary (in blue: data is encrypted so it doesn’t need to go through a secure channel) [11]. interface and cryptographically secures the data interfaces accessible to the embedded application, effectively rendering the FPGA as a black box capable of performing the task for which it was designed. The architecture introduces a secure token-based authentication scheme (using Java’s iButton [42]) and a FPGA-based encrypted memory controller. It is important to note that the user application which runs within the FPGA (protected by the wrapper application proposed) can only be re-programmed at the factory as there is no interface in the proposed system to perform reconfiguration of user applications.

For this thesis, the relevant part of this work is the way Amuet performs the secure embedding of the application, which includes the process of retrieving the user identification information (UID) from the iButton through a secure user interface, the modification of the UID to form a DES key and finally to use that DES key as the key source for the encrypted memory controller (EMC).

The Authentication Control Unit (ACU) is responsible for a establishing secure communication chan- nel for the UID transfer from the iButton. It has at its disposal, a xe mod n calculator, a set of RSA secret keys, a SHA-1 unit and a table of authorized certificates. The protocol used to negotiate a secure channel between the device and the iButton is similar to a RSA key exchange. However, the public key is never made public and instead, it is used as the encryption key (e), while the private RSA key is used as the decryption key (d). The two key pairs and the two moduli (n) used in the RSA-based authentication scheme, are generated prior to the programming of the iButton and the creation of the FPGA bitstream.

The FPGA stores ni, nf , ei, df and the iButton stores ni, nf , ef , di.

Additionally, the certificates stored are hashes of iButton’s UIDs, which makes their input mathemat- ically difficult to be found. Since the table resides inside the FPGA, everytime a new certificate is added or revoked, the bitstream must be modified. To prevent man-in-the-middle attacks, the iButton authenti-

23 Figure 3.3: Block diagram of the Amuet architecture [10]. The embedded application is actually the user application, protected by the proposed wrapping system. cates itself as an authorized user to the FPGA but the FPGA also authenticates itself as an authorized host to the iButton. After a successful authentication, the FPGA sends the UID to the EMC to create the final key for the DES engine. The Encrypted Memory Controller (EMC) encrypts and decrypts every transaction between the em- bedded application within the FPGA and the external memory, outside it. On startup, the EMC uses a secret DES key, which is unique to and known only to the FPGA, to create the final DES key (it is never used for ciphering/deciphering operations). This final key is formed by passing the UID from the iButton through the DES engine using the secret DES key. From the 64-bit result, the first 56 bits are the final DES key.

3.4 Discussion

In this chapter, the State of the Art was presented in regard to existing solutions that strive to create secure systems on FPGAs, each with different motivations, as seen in the previous sections. The main focus of these works are primarily the ability to perform secure key management, establishing secure communication channels and storing data securely (internally and externally). Due to the lack of non- volatile solutions, the works herein presented focused on volatile FPGAs, which are subject to several attacks [3,5]. To create a Hardware Security Module, a system must have a series of characteristics [12], such as

24 anti-cloning mechanisms (e.g. PUF-based key generation), secure communication channels, internal non-volatile memories for master key storage, anti-tamper mechanisms, internal clock freshness (e.g. through a Timestamping Authority) and a common developer interface, such as PKCS#11. However, none of the works provide a robust solutions that is able to address all of the above char- acteristics. In fact, the proposed solutions do not consider security-oriented devices, which contain anti-cloning, anti-tamper and side-channel analysis protection mechanisms, neither do they consider the necessity of internal clock freshness. Moreover, they do not consider the performance and security of the encryption software/hardware used. Our study finds that only Nabeel et. al [8] considers the use of a PUF-based mechanism to increase the security and authentication of the overall system. Additionally, freshness of exchanged data with external parties is highly overlooked by all solu- tions, which makes them susceptible to replay attacks. They lack any kind of developer interfaces (e.g PKCS#11) to allow for an easy integration of their systems with external applications. Table 3.1 depicts the key characteristics of each solution and the main features they provide, in regards to device security, the ability to establish a secure communication channel and how they perform key management.

Table 3.1: Comparison of Security Features of the different system pro- posals.

Category Arasu et al. [7] Wang et al. [9] Nabeel et al. [8] Graf [10] Eguro et al. [11]

Device Security

Device N/D1 Xilinx Xilinx Xilinx Xilinx Virtex-6 Spartan6 Virtex-E, Virtex-6 iButton

Bitstream Encryption AES AES-256 AES-256 3DES AES-256

Bitstream Storage External External External External External

Secure Channel

Algorithm RSA N/A2 N/A RSA-based RSA

Key Management

Master Keys Generation Factory Factory PUF Factory Factory

Master Keys RSA AES N/D DES RSA, AES

Master Keys Encryption No No N/D No No

Master Keys Storage External Internal N/D Internal Internal

Session Keys Generation RSA-KE3 RNG PUF RSA-KE RSA-KE

1Not Disclosed 2Not Available 3RSA-based Key Exchange protocol

25 Table 3.1: Comparison of Security Features of the different system pro- posals.

Category Arasu et al. [7] Wang et al. [9] Nabeel et al. [8] Graf [10] Eguro et al. [11]

Session Keys AES AES AES DES AES

Session Keys Storage Internal Internal, N/D Internal Internal External

Considering the above discussion, it can be concluded that the State of the Art works cannot be used to create a Hardware Security Module or a system that resembles one. Considering that HSMs are expensive and non-reconfigurable, as mentioned in Section 2.2, the next section proposes a solution, supported by the Smartfusion2 SoC, which creates a multi-user, flexible HSM, capable of performing secure key management, communicating securely with external parties, guaranteeing internal clock freshness and with the ability to sign data and issuing digital certificates. To demonstrate the flexibility of the HSM, a use-case called Log-Chain is presented and integrated in the HSM.

26 Chapter 4

Proposed Solution

The main goal of this work is to create a re-configurable and flexible HSM, supported by a low-cost non-volatile security-oriented FPGA, as opposed to the State of the Art. As stated previously, existing commercial HSMs are expensive and fixed in terms of design. On the other hand, the State of the Art proposals that attempt to create secure systems on FPGAs, focus purely on volatile devices and without security-oriented characteristics, such as anti-tampering, anti-cloning and side-channel analysis protec- tion. Additionally, they lack most of the requirements of an HSM, which were described in Section 2.2. The solution herein proposed, supported by the Smartfusion2 SoC, creates a multi-user, low-cost and highly flexible secure computation system that performs secure key management, stores data se- curely internally and externally, can establish a secure communication with outside parties, can maintain internal clock freshness and is capable of computing digital signatures as well as issuing digital certifi- cates. To demonstrate the flexibility of the proposed architecture, the implementation of a novel log cer- tification scheme is also proposed and developed. This scheme consists in creating a signed Log- Chain, such as the Linux Syslog messages, transaction logs or even a medical receipts log. Each message/command signature guarantees the authenticity of that message and all previously logged messages/commands, therefore creating a chain-of-logs that can be verified and cannot be repudiated or modified. The integration of the provided features with applications is done through an extended PKCS#11 middleware (device driver), abstracting the users from the inner workings of the system. The generated certificates are formatted according to the X.509v3 standard. Moreover, this work overcomes the limited hardware specifications inheriting to low-cost devices, such as small memories, low CPU speeds and the lack of Operating System, providing a fully working solution that tries to compete with existing commercial ones in terms of security features and standards, while maintaining the same flexibility and low-cost as the academic re-configurable proposals. An overview of the proposed system is depicted in Figure 4.1, illustrating the system’s main com- ponents. With these functionalities and security features, the system can be perceived as a Hardware Security Module with added flexibility for customization at a low-cost.

27 Figure 4.1: The proposed overall system architecture. The light blue rectangle represents the secure System-on-Chip. The contents of the external Flash are encrypted.

The following describes how user and key management is performed, including key storage and validation, the used communication protocol, and details the proposed Log-Chain scheme.

4.1 Users and Key Management

In the proposed system, we consider the need for multiple users to interact with it. Therefore, there are two types of accounts to be considered, namely a single administrative account and multiple (non- privileged) user accounts, each associated with an asymmetric key pair and a login PIN. The administrator is responsible for initializing the system and managing user accounts, i.e. to create and delete users from the internal users database, through the provided extended PKCS#11 interface. Additionally, it is also able to generate certificates for external public keys. Users are allowed to request data signature operations using their (non-extractable) private key, update their PIN, request the generation of key pairs to be exported (not stored internally), and other available user operations, such as the addition and signing of Log-Chain entries (which will be described in Section 4.3). Besides the user keys, the system uses two asymmetric key pairs and one symmetric key. The key pairs are used for the certificates issuer signature and for device authentication in the secure session establishment. The symmetric key is used for external data wrapping. An additional asymmetric key pair is generated and stored for the Log-Chain signatures. During the first device initialization, these keys are generated and stored internally using a PUF-based mechanism, providing additional protection against cloning and malicious private key retrieval. The public keys must be properly published via a PKI by the device administrator before they can be used. Since most embedded systems only have small internal memories, the users’ data (such as ID, PIN, keys and certificates) is stored on the external Flash memory, protected by the wrapping key. This

28 data is encrypted using AES-256 in CBC mode. To assure data integrity and authenticity, the SHA- 256 digest of the encrypted data, along with the IV used in the CBC (randomly generated for each block, guaranteeing semantic security) are stored in the internal Flash memory. This way, the system can ensure that whenever the block is accessed or updated, its contents are valid and have not been tampered with.

The users’ asymmetric key pair is generated using a True Random Number Generator and stored in an external Flash block along with other user data. For symmetric encryption, the AES algorithm is used, whereas in asymmetric encryption, Elliptic Curve Cryptography is most desired, as it is suggested for embedded systems with low computational power and memory storage [16][17]. Table 4.1 lists the stored keys and how they are generated and stored.

Table 4.1: Key generation and storage. Usage Type Generation Storage Sign Certificates Private PUF internal NVM Sign Log-Chain Private PUF internal NVM Secure Session Private PUF internal NVM Encrypt Ext. Flash AES PUF internal NVM User signatures Private TRNG external Flash

Since certificates must have a validity period, a correct time keeping is required. However, as most embedded systems cannot provide trustworthy timestamps (as their Real-Time-Clock is reset on power- on), an external valid time must be obtained.

There are a few options to achieve this. One would be to send the time from the connecting PC (likely obtained from an external network) to the device, but since the connecting network and the PC cannot be trusted, this is not a favourable option. A second alternative would consist of using an external Timestamping Authority to sign and timestamp a device-generated nonce. Unfortunately, the latter is also not suitable for the requirements of this work, due to the fact that most Timestamping Authorities use RSA over ECC, and therefore their public key certificates require more internal memory. Moreover, most of them are not free [43, 44] or have restrictions for free use, such as a limited number of requests per day [45, 46]. A third alternative would consist of creating a trusted ECC-based Time Service, which receives a device-generated nonce, appends a trusted timestamp to it and signs both together (this is better described in Section 5.8).

Considering the last option as the most viable, when the device is initialized and a session is estab- lished, the device sends out a challenge to the PC, which must be signed by a pre-defined Time Service and sent back to the device with the challenge and time properly signed. In order to verify it, the device must be shipped/configured with the public key certificate of the pre-defined Time Service.

29 4.2 Communication and Session Establishment

For the connection between the device and the PC, two scenarios are presented, namely one where the connection is considered secure and the other where the connection cannot be considered secure. When configured to communicate through a secure channel, a session key must be established before both parties can exchange data. This could be done in a variety of ways, but considering ECC is better suited for embedded devices, this work focuses on ECC-based algorithms. As seen in Section 2.1.4, ECIES can be used to establish a common shared secret through ECDH, and derive two symmetric keys from it, one used for message confidentiality and the other one for message authentication. Herein, the secure channel is created through the use of ECIES, as depicted in Table 4.2.

Table 4.2: Secure session establishment PC/Middleware Device Obtain S with ECDH Derive KC and KH {N } ,IV,P ,HMAC −−−−−−−−−−−−−→1 KC PC Obtain S with ECDH Derive KC and KH {N ,N 0 } ,HMAC ←−−−−−−−−−−−−−2 1 KC 0 Verify N1 {N 0 } ,HMAC −−−−−−−−−−−−−→2 KC 0 Verify N2

{Data,Ctr} ,HMAC −−−−−−−−−−−−−→KC {Data,Ctr+1} ,HMAC ←−−−−−−−−−−−−−KC {Data,Ctr+2} ,HMAC −−−−−−−−−−−−−→KC

In this case, the middleware starts by generating an asymmetric Elliptic Curve (EC) key pair and uses the known device Secure Session public key (PDevice) to obtain a shared secret through ECDH (S). After applying the Key Derivation Function PBKDF2 (defined in PKCS#5) to the shared secret, it possesses two keys: one for ciphering (KC ) and another to compute the HMAC-SHA-256 (KHMAC ).

From this point on, the middleware generates a random 256-bit nonce (N1) and a 128-bit IV, ciphers the nonce (using AES-256-CBC) and computes the HMAC-SHA-256 of the ciphertext, IV and its public key. It then sends the four parts together to the device. On the other side, the device uses the received public key to compute the shared secret through ECDH. After applying the KDF PBKDF2, like the middleware, it possesses two keys. The device deci- 0 phers the nonce and modifies it (N1). Additionally, it generates a new 256-bit nonce (N2) and ciphers it together with the modified nonce, using the obtained ciphering key from the KDF. At the same time, it computes the HMAC of the ciphertext and sends both. The middleware verifies the modified nonce, modifies N2 and sends it back to the device which will perform the final verification. Upon successful validation, both are securely connected and can exchange data in an encrypted and authenticated form. This secure session establishment is depicted in Table 4.2.

30 From this point onwards, all messages are encrypted with the session key (KC ) and authenticated with the HMAC key (KHMAC ). To guarantee the freshness of exchanged messages, they are accompa- nied with a counter of the sender and validated by the receiver so that the receiving party invalidates an existing session should it become unsynchronized. Once the session is established, the device accepts a range of commands, which allows the mid- dleware and the application using it to take advantage of the developed features. Table 4.3 lists the commands accepted by the system, including commands to start a secure session, generate digital signatures, and commands to request digital certificates and manage registered users.

Table 4.3: Available Device commands. Command Parameters Description

DVC CHECK - Check if device is connected.

SESS START Encrypted session key Initiate secure session.

SESS END - Terminate secure session.

SEND TIME Timestamp Send timestamp to the device.

DVC INIT PIN Initiates the system.

User Commands

SESS LOGIN User ID and PIN Authenticate user.

SESS LOGOUT - Log a user out.

DTSN SIGN Data Generate digital signature.

USER CERT User ID Retrieve user certificate.

USER MODIFY New PIN Modify user PIN.

LOGS ADD Message Add a new entry to the Log-Chain.

USER GENKEYS - Generate and extract key pair.

Administrator Commands

CRT REQUEST Public Key, Key Usage and Subject Generate public key certificate for a given key.

USER NEW New PIN Add new user.

USER DELETE User ID Delete a user.

It is important to note that the secure session establishment protocol, described above, does not guarantee authentication of the PC towards the device. However, the device authenticates itself towards the PC because the PC has knowledge of the device’s public key, as it has been published by the device administrator upon initialization. Therefore, some commands require an actual user or an administrator to be authenticated towards the device. To perform the authentication, an ID and PIN must be sent

31 during the login process. The administrator PIN is set the first time the device is initiated and associated with ID 0. To add new users, the administrator can send the ’USER NEW’ command with the initial PIN, which is later changed by the users with the ’USER MODIFY’ command. The PC middleware extends the PKCS#11 interface, allowing developers to integrate the proposed system in their applications. The implementation of the interface is comprehensively described in Sec- tion 5.7. It implements methods that send out the commands supported by the device, as listed in Table 4.3.

4.3 Log-Chain

One of the main advantages of the solution herein presented over existing commercial HSMs is its high flexibility and reconfigurability. To demonstrate that, and without modifying the system core, a Log- Chain feature was added, which consists of a scheme that allows for a reliable way to log messages or commands into a signed chained log (e.g. Linux Syslog entries). The Log-Chain requires a counter value to be stored within the device (which counts the total logged entries), along with a key pair that is used for the Log-Chain signatures. To accept the addition of a log entry (described above), and perform its signing, a valid user must establish a session with the device and authenticate himself using his PIN. The Log-Chain ensures that an entry cannot be removed without breaking the cryptographic validity of the chain. This is done by having each entry concatenated with the hash of the previous entry, the current counter value, and signing it with the Log-Chain private key. Note that the counter is managed and kept inside the device, as well as the Log-Chain private key, which is only located and used in the device and cannot be extracted. Each log entry is composed of the message (or command), the date and time, the User Identification (UID) of the requesting user, the internal counter, and the previous entry hash result, as:

Log Entry

Messagei | date time | UID | counter | Hashi−1

For example, if user 2 requests the addition of message “[670164.882] rm *.tex”, the log entry will be:

[670164.882] rm *.tex | 10-04-2016,13:10:20 | 2 | 27 | FB··· 71 where this would correspond to the 27th log entry and the hash of the previous entry request would be FB··· 71. Using this operation, it is possible to create a Log-Chain file, which is managed by the middleware at the PC side using dedicated calls. This middleware is responsible for the creation, management, and storage of the log files. To start a new Log-Chain, the administrator must define an initial hash value (as there is no hashi−1) and a message. Following this, the file will contain each log entry with the respective system date, hash, and resulting signature, as depicted in Figure 4.2.

32 10-04-2016,10:54:10:{log root} [hash0][signature0] 10-04-2016,10:54:15:{log entry 1} [hash1][signature1] 10-04-2016,10:54:17:{log entry 2} [hash2][signature2] 10-04-2016,10:54:18:{log entry 3} [hash3][signature3]

Figure 4.2: Log file example. Each horizontal line represents a line break. Hashes and signatures are Base 64 encoded.

To optimize the communication between the PC and the device, the hash values are computed in the middleware at the PC side and sent to the device. Herein, it is assumed that the PC is able to securely acquire each message/command (for example by having a daemon running in kernel mode) and is able to store the resulting signature and associated log entry in a file. The device is only responsible for signing the log entries and ensure the signing order (by adding the device time and internal counter value). The resulting Log-Chain is depicted in Figure 4.3. Should a malicious user delete an entry from the log and provide an incorrect hashi−1 to the device (for example, pointing to 10 entries before), the counters will be out of synchronization and the verification of the Log-Chain fails, when requested.

Figure 4.3: Structure of a Log-Chain. Each hash is computed using the previous hash of the log. The first hash is defined by the device administrator (UID=0) and set as the root hash value.

Thousands of logs entries may be requested per minute, making their signing computationally not feasible, especially on an low-cost embedded system. To cope with this, the proposed approach allows the application using the API, to decide when to request a signature. Instead of requesting the signing of every message/command, the application can request to sign every X messages and always requests the signature generation of the last one, as depicted in Figure 4.4. This allows to increase the system efficiency and adjust it to the performance of the device. Security- wise this is equivalent to signing each entry, but in order to validate a log entry belonging to the batch, the entire batch needs to be verified. To facilitate the management of the log files, the middleware separates them into a two level folder hierarchy, in which the top level represents the year and the second level represents the month. The days are represented by each log file, as depicted in Figure 4.5. The first entry of a given day triggers the creation of a new log file, and the creation of the respective year and month folders if they do not exist already. To validate any part of the chain, the middleware only needs to use the starting hash value for that

33 Figure 4.4: The structure of a log chain with grouped log entries.

Figure 4.5: The structure of log folders and ther files. The current year and month folders are highlighted in dark grey and the current day log file is highlighted in yellow. part of the chain, compute the intermediate hash values, until the chain point that needs to be validated, and validate the signature of that ending point. Should the whole chain need to be validated, the initial hash value needs to be supplied so that the root entry can be validated. Validation is a relatively light process, since other than the hashing of each entry, only one signature needs to be verified. The middleware extends the PKCS#11 interface to provide this functionality, allowing to validate one day, one month, or the entire chain, to ensure that the chain has not been broken. The signature verification process is performed solely on the PC side, without the need for the device, except to retrieve the most recent log counters.

4.4 Conclusions

This chapter presented the proposed solution to create a tamper-proof certification system based on a low-cost but secure non-volatile device. More specifically, the key features of the system are highlighted, including the capability of signing data and generating digital certificates for its users, which are managed by an administrator. To demonstrate the flexibility of the system, the Log-Chain scheme is proposed and detailed, allowing users to add data to a chain that cannot be repudiated, and can be validated at any point in time, without using the device. To increase the efficiency, applications are given the ability to choose when a Log- Chain signing operation is to be performed.

34 In terms of security, a detailed explanation is given as to how keys are generated and managed in- ternally and externally. In particular, the master keys are generated through a PUF-based mechanism and stored internally, in a secure region, while user keys are generated through a TRNG and stored en- crypted in an external Flash memory. Moreover, the system supports open-channel and secure-channel communication. For the latter, ECIES is used, which combines ECDH with a KDF that provides the encryption and authentication keys, used to encrypt the messages with AES-256-CBC and to authenti- cate them with HMAC-SHA-256. As most embedded systems contain small processing power and small internal memories, Elliptic Curve cryptography is favoured over RSA. The next chapter provides the implementation details of the proposed solution, describing the tech- nical aspects of the chosen device, how the device limitations are overcome, how the cryptographic operations are performed, how the communication channel is created, the developed PC middleware and the considered optimizations.

35 36 Chapter 5

Implementation

This chapter presents the implementation of the solution proposed in the previous chapter, to create an HSM with a non-volatile FPGA. In order to implement the proposed solution, the Smartfusion2 SoC is used, given its design and data security features for safety-critical applications, as highlighted in Section 2.4. More specifically, the SmartFusion2 Security Evaluation Kit with a Smartfusion2 90-TS SoC model was chosen.

While the device supports an Operating System, such as uClinux and FreeRTOS, the required amount of memory would imply the use of external volatile and non-volatile memories to increase the overall system memory, allowing the OS to run externally, which would be susceptible to several possi- ble attacks and thus could not be considered secure. To overcome this problem, it would be possible to deploy an extra core in the FPGA fabric to encrypt and decrypt program data on-the-fly, increasing the system latency. Alternatively, the device supports BareMetal applications, which can fit within the device’s memories. This alternative also has the advantage of being allowed direct access to the in- ner components of the device, such as the embedded security cores, the FPGA fabric and the internal memories. This is possible because Microsemi provides BareMetal software drivers that take care of the communication with the cores. On the other hand, as there is no Operating System or kernel to manage multiple threads, it is not possible to have a multi-threaded system. Therefore, the system only supports one session at a time.

Because the device has low hardware specifications, including small memories and low CPU perfor- mance, Elliptic Curve cryptography was chosen over RSA. Among the libraries that support ECC, the mbedTLS C library (or ARM R mbed) [47] is the one that can cope better with the limitations of embedded systems, as it contains a modular structure, making it possible to disable most of the unused features and provide alternatives to the implemented algorithms (e.g. hardware implementations). Among other algorithms, the library provides software implementations of AES-256, SHA-256, ECDH, ECDSA and X.509v3 certificates generation, which are required for the development of the proposed system.

To assess the performance of the system and possibly increase its security, three different architec- tures were created. The first one consists of using the mbedTLS algorithms as provided by the library. In the second architecture, three new modules were created, which call the AES-256, SHA-256 and

37 Elliptic Curve scalar multiplication cores embedded in the device. They were then configured as alterna- tive implementations to the existing algorithms in mbedTLS. The final architecture consists of a modified version of the second, in which the SHA-256 algorithm is implemented by an open-source core that is deployed in the FPGA. To provide a mean for developers to integrate the system with their applications, an extended PKCS#11 API was created. Since there must be a communication channel between the device and the PC, a USB connection was made possible, by creating a communication protocol that is capable of providing data integrity, authentication and confidentiality. Because some features require a valid date and time (e.g. Log-Chain and digital certificates), and the device’s Real Time Clock gets reset on shutdown due to lack of battery, it is necessary that the system maintains internal clock freshness and synchronization with a trusted outside party. Three solutions are presented in this chapter but only one fulfils the system needs, while fitting within its limitations. For that, a new external service was created, that provides a trusted and authenticated timestamp for a device generated nonce. Considering the above, the system configuration is detailed in Section 5.1, followed by a detailed explanation of the three architectures in Section 5.2. Afterwards, key management is described in Sec- tion 5.3 and the system memory structure is depicted in Section 5.4. Subsequently, in Section 5.5, the Log-Chain implementation is presented. Section 5.6 details the implementation of the communication channel and right after, in Section 5.7, the PC Middleware API is described, followed by an explana- tion of the implemented Simple Time Service in Section 5.8. The chapter ends with a summary of the aforementioned sections.

5.1 Device Configuration and Setup

As described in Section 2.4.2, the Smartfusion2 SoC has several configurations that can be enabled to increase the security of the overall system. First, a 256-bit key is generated and used to encrypt and authenticate bitstreams and application data. This key is stored in a secure region within the device, assuring that only authenticated and protected bitstreams are installed. To ensure that the system boots securely, power-on digest and clock frequency checks have been enabled, guaranteeing that the internal memories and FPGA fabric have the correct configuration when the system boots. Moreover, the device was configured to lock read/write/verify operations on the FPGA fabric and eNVM with a randomly generated 256-bit key, disallowing any kind of access to them without providing the secret key. Similarly, JTAG access and Debugging features are also locked with a different 256-bit key, preventing access to them unless the key is provided. Additionally, the device supports versioning, i.e. the ability to include versions in bitstreams, which forces configuration bitstream updates to have a greater version than the one currently installed. This feature has been enabled, which prevents back-tracking attacks, in which a malicious user installs an older version of the configuration that could be vulnerable to certain attacks and therefore opening new vulnerabilities for the attacker.

38 In order to make the most out of the tamper-detection features provided by the Smartfusion2 SoC, the software was developed to integrate with the tamper-detection mechanisms, which include the detection of physical changes to the hardware, clock de-synchronization and invalid digest checks, having the possibility of zeroization when those events happen.

5.2 Cryptographic Operations

As introduced at the beginning of this chapter, the software was developed with the support of the mbedTLS C library (or ARM R mbed) [47], which was created with embedded systems in mind. This library provides methods to generate Elliptic Curve key pairs, generate X.509v3 certificates, AES-256 encryption and decryption, several hashing algorithms, and implementations of ECDH and ECDSA. Given its modular structure, it is possible to disable most of the unused features and provide alternatives to the available implementations. In order to develop a better understanding of the device and the overall system, three architectures were designed and implemented. The first one, depicted in Figure 5.1, uses the mbedTLS algorithms as provided by the library, meaning that all operations run in the software stack and may be subject to certain side-channel attacks [48], especially in scalar multiplication taking place in ECC - which is used by ECDH and ECDSA.

Figure 5.1: The first system architecture, which uses the mbedTLS algorithms to perform cryptographic operations. The unused modules are greyed out.

To increase the security of the system and to assess possible performance optimizations, a second version was created, which provides three new modules to mbedTLS that call the AES-256, SHA-256 and Elliptic Curve scalar multiplication cores embedded in the device. The AES-256 and SHA-256 mod- ules were easily integrated with mbedTLS because it supports direct integration of alternatives to those algorithms. However, for Elliptic Curve scalar multiplication, the existing routines had to be modified directly to call the EC scalar multiplication core. This is not as straightforward as it sounds, because the points need to be converted from the mbedTLS data structures into the data format supported by the core, and then the other way around, when converting the result back into the format supported by mbedTLS. This architecture is depicted in Figure 5.2. Finally, the last architecture, which is depicted in Figure 5.3, was created to understand whether or

39 Figure 5.2: The second system architecture, which uses the SoC embedded cores for additional security and possible performance. The unused modules are greyed out. not a cryptographic core running in the FPGA fabric can provide an acceleration to the system. The main cryptoraphic operations are the AES-256, SHA-256 and EC scalar multiplication. Due to the lack of Elliptic Curve scalar multiplication cores for the NIST-P384 curve (the one supported by the device) and because AES-256 cores are more complex and require more input data than SHA-256 cores (e.g. cipher mode and IV), it was decided to deploy an open-source SHA-256 core [49] in the FPGA fabric. This also allowed to assess how easy or hard it is to integrate FPGA-based cores with the software running the Cortex-M3. The core has a throughput of 580Mbps at 74Mhz and is fully compliant to the NIST FIPS-180-4 SHA-256 approved algorithm. It is implemented as a single cycle combinational logic for one iteration of the hash core.

Figure 5.3: The third system architecture, which uses the FPGA to accelerate the SHA-256 algorithm. The unused modules are greyed out.

The connection between the Microcontroller Subsystem (MSS) (which contains the CPU) and the FPGA fabric is done via a High-speed AMBA Bus 1, or AHB, which is mapped to a specific memory region to which the software writes and reads from, in blocks of 32 bits. On the FPGA fabric, it was necessary to create and deploy an AHB Slave that basically works the same way, i.e. writes and reads to/from the same specific address. The slave connects to a Core Controller that was created to read input from the AHB Slave and process it. The Core Controller, as the name implies, controls what the SHA-256 core does and which information is sent to it and when. When the hash computation finishes,

1The Avanced Microcontroller Bus Architecture (AMBA) is a standard for connection of blocks within a System-on-Chip [50].

40 the controller sends data to the AHB slave so that the software can read the final result. Besides the SHA-256 input data, the software needs to send additional data to the created Core Controller, so that it knows when to direct the SHA-256 core to start the hash computation or if the last block has already been sent out, for example. On other hand, the Core Controller also needs to be able to tell the software that the computation has finished or that the software can send more data. This is done with the device General Purpose Input/Output PINs, which have been configured to connect the MSS with the Controller directly. They are used as flags whose meaning is known to both parties. Finally, because one of the requirements for an HSM (as mentioned in Section 2.2) is to contain true random number generation, the Smartfusion2 SoC SRAM-PUF service is used to obtain a high entropy seed for mbedTLS routines that generate random numbers internally. Moreover, the developed software uses the True Random Number Generator core of the Smartfusion2 SoC whenever a random number must be generated by the developed software, such as IVs and nonces.

5.3 Key Generation and Management

As detailed in Section 4.1, the system contains three master asymmetric key pairs and one symmetric key. The key pairs are used for the certificates issuer signature, the device authentication in the secure session establishment, and finally, for the Log-Chain signatures. The symmetric key is used as the wrapping key for the contents stored in the external Flash memory. To generate these keys, it would be possible to use the True Random Number Generator directly and store the keys in the eNVM. However, because the Smartfusion2 SoC contains a secure PUF-based key generation service, called SRAM- PUF, the four keys are generated and stored internally through it. This allows to create device-dependent keys, providing an additional anti-cloning mechanism to the system. Moreover, all the data generated through the SRAM-PUF service is stored in a protected region of the eNVM, which is inaccessible to the programmer except through the service itself and is completely isolated from the outside world. The SRAM-PUF service combines the passive zeroization feature of volatile memories with tamper- resistant non-volatile key storage without requiring batteries [40]. More specifically, it uses the random start-up behaviour of the SRAM block to determine a static secret for each device. The first time the SRAM-PUF service is used, a particular intrinsic secret is obtained through a process called enrolment, which creates an activation code (used for key code generation and reconstruction) using the start-up state of the SRAM-PUF, and is stored in the protected region of the eNVM. Instead of storing keys in the non-volatile memory, a key code (KC) is generated from the keys and stored in the protected eNVM region. Because keys are never actually stored in memory, an attacker cannot use the key codes even if it has access to them. Although it is possible to supply an extrinsic key to the service and generate a KC from it, it is more secure to have the service generate keys at hardware level, using its high entropy seed and the True Random Number Generator. The public keys are computed at boot time through the Elliptic Curve cores embedded in the device. Since keys are not stored in memory, they have to be reconstructed, hence why activation and key codes are required. Figure 5.4 exemplifies how keys 2 and 3 would be obtained through the SRAM-PUF

41 service by providing the activation code and the respective key codes.

Figure 5.4: SRAM-PUF core example for Key Code 2 and 3.

In addition to the master keys, the system also generates and stores individual keys for each user to perform data signing. These are generated with the mbedTLS library (which has been configured to use a SRAM-PUF seed as a high entropy source) and then stored in the external SPI Flash, encrypted using the symmetric master key with AES. The memory management, which includes key storage, is described in more depth during the following section.

5.4 Memory

Due to its security features, as described in Section 2.4.2, the Smartfusion2 SoC is considered to be se- cure and sensitive data should be stored within it if possible. However, the internal non-volatile memory (eNVM) is rather small (512KB), considering that there are several pieces of information that need to be stored, such as the software, master and user keys as well as administrator and user PINs. Therefore, it was decided to split the data between the internal non-volatile memory and the external SPI Flash (8MB) provided by the Smartfusion2 Security Evaluation kit. To support a large number of users, it was decided that user data should be stored externally, includ- ing user IDs, PIN hashes and key pairs. Since the external memory is divided into sectors of 4KB, which is enough to store the user data, and because each sector can be erased individually, every user is as- sociated with a sector (or block). Moreover, to assure confidentiality of the information stored externally, each sector is encrypted through AES-256-CBC, with a unique IV per user, ensuring that encrypting the same block content twice, yields two completely different , that is required to guarantee semantic security. This allows the most sensitive data to be stored internally, namely the master keys, admin PIN, Simple Time Service public key (more details in Section 5.8), Log-Chain counter as well as hashes and IVs of each external Flash sector. The sector hashes are used to guarantee the integrity of the external data (which can be modified), by comparing the internally stored hash with the hash computed when reading an external user block. They must be stored inside because the external memory may be tampered with, whereas the internal one cannot. For this implementation, the maximum number of supported users was chosen to be 255, allowing for identification of a user with a single byte. This

42 allows for a low but realistic number of hashes and IVs to be stored internally, which is crucial because of memory size limitations. It is important to note that the ID 0 represents the administrator, hence why the maximum number of users is 255 and not 256. Table 5.1 lists all the data storage requirements inside the internal and external non-volatile memories.

Table 5.1: Non-volatile memory usage requirements for the implemented system. eNVM Program Instructions 86KB Certificate Issuer Private Key 48B Secure Session Private Key 48B Log-Chain Private Key 48B STS Public Key 221B 255 16B IVs 4,080B 255 32B hashes 8,160B Admin PIN 32B Log-Chain Counter 2B External SPI Flash Sector ID 1B PIN hash 32B Private Key 48B Public Key 48B Public Key Certificate 800B

During development, the amount of memory available is even more limited, as the eNVM is not used to avoid burning it down with constant writes. Therefore, the code must be stored and executed in the internal RAM. Given the limited volatile memory (eSRAM), a new memory architecture had to be setup. By re-mapping the FPGA LSRAMs to an address available to the MSS, it was possible to extend the RAM memory size by an extra 64KB. This memory architecture is depicted in Figure 5.5. In particular, the eSRAM held the .text and .data program sectors, while the FPGA LSRAM memories held the .bss, .heap and .stack. This scheme provided up to 128KB of available volatile memory contained within the secure SoC while developing and debugging the system, without decreasing the life of the non-volatile memory. Had this not been possible to do, it would be extremely difficult to develop and test the system, because the eSRAM contains only 64KB. As described in Section 5.2, the AHB Interface is mapped to a specific memory address, so that the software can communicate with the FPGA. When in production mode, the eNVM can finally be used to store the .text and .data sectors, while the .bss, .heap and .stack are stored in the eSRAM. This scheme, which is depicted in Figure 5.6, no longer needs the LSRAMs from the FPGA fabric.

The eNVM of the considered device is limited to 1000 programming cycles per page of 128B [51]. Because of this limitation, the implemented solution stores the user sector hashes, IVs and Log-Chain counter in RAM, whereas the admin PIN is stored in the eNVM. This is purely for Proof-of-Concept

43 Address Region Size Data 0x20000000 eSRAM 64 KB .text, .data 0x30000000 LSRAM 64 KB .bss, .heap, .stack 0x50000000 AHB Interface - -

Figure 5.5: Development internal mode memory map.

Address Region Size Data 0x00000000 eNVM 512 KB .text, .data 0x20000000 eSRAM 64 KB .bss, .heap, .stack 0x50000000 AHB Interface - -

Figure 5.6: Production mode internal memory map.

(PoC), demonstrating that it is possible to write and read from the eNVM, and that the proposed system architecture works as expected (although the information is lost when power is turned off). On the other hand, the external SPI Flash memory has a memory endurance stated to be of about 100,000 cycles per sector, therefore requiring no special attention.

5.5 Log-Chain

To demonstrate the flexibility of the system and adaptability to new features, the Log-Chain was im- plemented. To achieve that, a new module was added to the system, without modifying existing com- ponents. The module provides two functionalities: initialize and increment. The former, as the name implies, allows the administrator to initialize the Log-Chain to be created on the PC. The latter, allows a user to add a new entry to the Log-Chain. In our implementation, only one Log-Chain is permitted to exist at the same time, otherwise multiple Log-Chain counters would need to exist and IDs would have to be assigned to each Log-Chain. As mentioned in Section 5.4, for Proof-of-Concept, the Log-Chain counter is stored in RAM, otherwise we’d end up burning down the internal eNVM quite easily, because of its limited endurance. Therefore, whenever the device is rebooted, the Log-Chain counter is lost and a new Log-Chain needs to be initialized. The counter is actually made of two 1B counters, supporting up to 8.5 billion entries. A possible alternative would include storing the counter in the external SPI Flash (which has a bigger endurance), but it would still be required to maintain the integrity of the data, so its hash would need to be stored internally and updated whenever the counter is incremented, rendering this solution useless. Another alternative consists of connecting a battery to the HSM, and storing both in a secure place. The system would store the counter in RAM until the battery detected that it needed to be charged. At that time, the battery would send out a signal to the HSM, which in turn would store the counter value in the internal eNVM as backup. This requires the battery and the HSM to be stored in a secure environment, as a malicious user could easily unplug the battery or tamper the signal cable.

44 For simplicity, we opted to keep the counter in RAM until the system is rebooted. We believe this exhibits the capabilities of the Log-Chain, while demonstrating that it is possible to add new features without interfering with the core functionalities of the system.

5.6 Communication Channel

For our system to be used by an external application running on a PC, a communication channel must be established between the PC and the HSM. For instance, a communication protocol can be built on top of a USB or Ethernet links, allowing for the communication between both parties. As the device integrates an Ethernet interface, and because mbedTLS provides features that provide the ability to create a TCP/IP server and client, it would be interesting to create a server on the HSM and allow a client, on the PC, to communicate with it. However, this is not feasible, because it is not possible to store the required algorithms within the device due to limited memory capacity. On the other hand, the device also supports USB communication. However, according to our tests, the device’s USB hardware module does not work as expected. As a matter of fact, the internal buffer is not managed properly, disrupting the communication when the buffer is filled, which leads to unexpected issues, such as information not being received properly or packet de-synchronization. To overcome this problem, the communication is done via UART2, which uses the USB physical link to perform serial communication with the PC. Although the internal buffer continues to experience problems, it is possible to create a protocol that does not fill the whole buffer. Our tests allowed us to assess that by allowing the buffer to fill up to a maximum of 16B, the communication works as intended. Considering the above, it was necessary to create a protocol which essentially splits data into blocks with maximum size of 16B. To prevent the receiving party’s buffer from filling up, each block must be sent out separately and acknowledged from the other side once it has been read. Once the sender receives the acknowledgement, it sends the next block. This guarantees that the receiving buffer has been fully read before more data is sent by the other party, avoiding data loss. To ensure that the exchanged messages follow the intended flow (i.e. to prevent command B from being executed before A), each message is accompanied by a counter that is stored by both parties. If a received message’s counter differs from the one stored internally, the communication session is terminated. Because the HSM may be installed in an insecure environment, it may be necessary that the com- munication is encrypted and authenticated. As explained in the previous chapter, the proposed solution supports secure communication, should the HSM be shipped with that configuration. In order to guar- antee semantic security, every time a new message is sent, a new IV is generated for the encryption of that specific message, ensuring that if two messages are equal, the resulting ciphertexts are different. Additionally, each message is also sent with the HMAC-SHA-256 of the ciphertext, guaranteeing data integrity. The protocol, as seen by the receiving party, is depicted in Figure A.1, found in AppendixA. The

2The Smartfusion2 SoC provides a Universal Asynchronous Receiver/Transmitter module for serial communication on top of a USB link.

45 process starts by verifying the received message’s counter, which should match the internally stored counter plus 1. Then, if the communication is secure, the receiver expects the 16B IV generated by the sender. Afterwards, it reads the data size and calculates the total number of 16B blocks to process, as each block must be processed individually due to the internal buffer’s issue. For each 16B block, it reads the data from the buffer and, if on a secure channel, decrypts it. Once all 16B blocks have been processed, it should read the remaining bytes (if any). If on a secure channel, the receiving party also expects the HMAC of the ciphertext, as computed by the sender, and compares the received HMAC with the one computed on the receiving side. The process fails if the counters are out of synchronization, if the receiver waits more than 10s for an input block of 16B or if the HMACs do not match. In those cases, the session will be automatically terminated. The timeout is set to 10s because some operations may require additional processing time and external connections, as will be discussed in Section 5.8. The keys used for encryption and authentication, are the ones resulting from the secure session establishment protocol, presented in Section 4.2. The implemented communication protocol successfully overcomes the issue present in the USB’s internal buffer, while guaranteeing the communication flow through message counters. Additionally, as the system must support a secure communication, the implemented protocol also provides a way to properly exchange data in a secure and authenticated way.

5.7 Middleware

Given that the system provides a way for a communication channel to be established with a PC, there must be an API that allows developers to integrate the HSM in their applications. This is actually a key requirement of our work. One of the most widely known APIs is the PKCS#11 interface [52], which provides a standard for method usage, arguments, return values and error codes. The implemented middleware was created off an empty PKCS#11 skeleton in C, and was compiled as a Dynamic-link Library (DLL) that can be included by other applications. The DLL implements and extends the PKCS#11 API [52], and provides the additional functions specified in Table 5.2. These allow applications to manage the users within the HSM, to use the Log-Chain feature and to generate and extract public key certificates. The communication with the device is done through the commands listed in Section 4.2. Besides the functions listed above, Table 5.3 lists the supported official PKCS#11 API functions. To use the API functions, the device must be first initialized with the C InitToken, which basically sends out a ”DVC INIT” command to the device (described in Section 4.2), in order to generate the master keys and define the admin PIN (which is an argument to the function). This function only works the first time it is executed, as the system can only be initialized once. Afterwards, the usage is the one expected from a PKCS#11 API, i.e. a session must be opened before most functions are available. As detailed in the last column of Table 5.2, some functions also require user or admin authentication. If the device has been configured to require secure sessions, the C OpenSession function establishes a secure session with the device. As mentioned at the beginning

46 Function Description Requirements HSM C UserAdd Add a new user. Session, Admin Login HSM C UserModify Modify an existing user Session, User Login HSM C UserDelete Delete an existing user. Session, Admin Login HSM C LogInit Initialize the log-chain. Session, Admin Login HSM C LogAdd Add a new log entry to the log-chain. Session, User Login HSM C LogVerifyDay Verify a certain day in the log-chain. HSM C LogVerifyMonth Verify a certain month in the log-chain. HSM C LogVerifyYear Verify a certain year in the log-chain. HSM C LogVerifyChain Verify the log-chain. HSM C LogCounter Verify the log-chain counters. Session HSM C CertGen Generate a certificate for a given public key. Session, Admin Login HSM C CertGet Get a user certificate. Session HSM C CertDevice Get a device certificate. Session

Table 5.2: Additional API functions.

C Initialize C Finalize C GetInfo C GetFunctionList C GetSlotList C GetSlotInfo C GetTokenInfo C GetMechanismList C GetMechanismInfo C InitToken C SetPIN C OpenSession C CloseSession C CloseAllSessions C Login C Logout C CreateObject C DestroyObject C GetAttributeValue C SignInit C Sign C SignUpdate C SignFinal C VerifyInit C Verify C VerifyUpdate C VerifyFinal C GenerateKeyPair

Table 5.3: Supported official PKCS#11 API functions. of this chapter, the system only supports one session at a time as it’s single-threaded.

5.8 Simple Time Service

One of the requirements of our work (and HSMs in general), is to guarantee internal clock freshness and synchronization with a trusted third party. To achieve this in our implementation, three solutions were considered. The simplest one consists of using the Network Time Protocol (NTP)3 on the PC to obtain a valid time and send it to the device. This is actually the default way of retrieving a date and time on Windows but it has been recently found to contain vulnerabilities [54], allowing for Denial of Service attacks and buffer overflows. Therefore, it was ruled out as a trusted way of obtaining the current time. Furthermore, it would still be required to provide a way to authenticate the time sent from the PC to the

3NTP is a protocol designed to synchronize the clocks of over a network [53].

47 device, as the PC cannot be trusted, and the device needs to be assured that the time received can actually be trusted and used. A second alternative requires the use of an external Timestamping Authority (TA) to provide a times- tamping service. The device generates a random nonce and sends it to the PC middleware, which takes care of requesting a timestamp from the TA, that is responsible for timestamping the nonce and signing it. Once the middleware possesses the TA response, it forwards it to the system, allowing it to update its internal clock after validating the digital signature. However, the requests and responses must fol- low the formats specified by the RFC 3161 [55], which can be quite challenging, as the code to parse and validate the responses must fit within the device memory, in addition to the Timestamping Authority public key certificate that is used to validate the digital signature. An RSA public key certificate from a regular Timestamping Authority usually occupies slightly more than 4KB of memory, including the fields described in Section 2.1.6. In addition to this limitation, Timestamping Authority services are normally paid [43, 44], which pose a financial limitation to the usage of the system. Although there are free alternatives, they are usually subject to a limited number of requests per day and only support RSA [45, 46], whose certificates are larger than ECC certificates, as we have seen previously. Finally, it would require that the system also includes code to parse and validate RSA signatures. Since there is not enough memory inside the device to fulfil the requirements above, this option cannot be considered. The last alternative consists of using an external ECC-based Time Service, similar to a Timestamping Authority, which simply receives a nonce, concatenates it with a UNIX timestamp, hashes the result and signs it. Unlike regular TSAs, this Time Service is supported by Elliptic Curve Cryptography and contains a simple interface that waits for a given nonce (as opposed to a specially formatted request) and replies back with the signature and the concatenation of the nonce with the UNIX timestamp. Since the reply is in an extremely simple format, the device does not need additional parsing code. Moreover, since it is supported by ECC, the public key certificate can easily fit within the device. As an example, the size of an ECC certificate generated by the system is not larger than 800B. Since no services like this have been found, a Simple Time Service (STS) was created from scratch, as a Proof-of-Concept. This scheme is depicted in Figure 5.7. Whenever the middleware sends a command to the device (1), the system checks how much time has passed since the last time synchronization. For this implementation, a 10h window was chosen, as it is considered big enough to include a regular work day but small enough to make up for possible synchronization losses of the internal Real Time Clock. If more than 10h have passed, the device asks the middleware to provide a timestamp and signature for a given nonce (2, 3, 4). The middleware is then responsible for requesting a timestamp and signature from the STS (5). Once the STS receives a nonce, it sends (6) a Timestamp Request following the RFC 3161 for- mat [55] to the SafeCreative TSA [46] and obtains (7) a Timestamp Response. OpenSSL and cURL are used to facilitate the creation of requests and parsing the responses. After parsing the response and verifying its signature, the timestamp is extracted and converted into a UNIX timestamp. Afterwards, it is concatenated with the nonce and hashed with SHA-256. Finally, the STS signs the hash and sends

48 Figure 5.7: The scheme for time synchronization via STS. the signature and concatenated data back to the PC middleware (8), which in turn sends both back to the device for validation (9). Once the system receives the nonce, the UNIX timestamp and the signature, it verifies if the nonce matches the one sent by the device. If there is a mismatch or if more than 5 seconds have passed since the request was performed, the time synchronization does not complete successfully (10). The 5s window ensures that it is not possible to have a considerable de-synchronization between the device and the STS - which may be caused by network delays or a malicious user. Because the STS has been created specifically for this system, it was created with memory limitations in mind and therefore ECC is used instead of RSA, so that the device does not need additional code and only needs to store the ECC public key within it. This alternative allows for a better memory usage when compared to the second option, while guaranteeing a trusted and authenticated time synchronization with an external third party.

5.9 Conclusions

In this chapter, the process of implementing the tamper-proof certification system on the Smartfusion2 SoC is detailed. The device is properly configured to be deployed in an insecure environment by enabling several security features, such as anti-tampering detection, bitstream and configuration encryption, as well as versioning and locking of debugging features. Due to lack of a secure deployment of an Operating System, a BareMetal implementation is chosen, allowing for smaller memory usage and direct access to the internal cores through the available drivers. To perform the needed cryptographic operations, the library mbedTLS has been chosen, due to its interesting characteristics for embedded devices. Three architectures were constructed. First, the mbedTLS algorithms were used as provided. Then, because the Smartfusion2 SoC provides embedded

49 hardware cores which are protected against side-channel attacks, the library has been modified to use the following cores instead of software implementations: SHA-256, AES-256 and EC Point Multiplication. In order to increase the performance even further, a new architecture was implemented based on the second, in which a SHA-256 core was deployed in the FPGA fabric and connected to the system. Additionally, a detailed explanation was given as to how key management is implemented, with a great detail being given to how the SRAM-PUF service works and how it is used. Afterwards, mem- ory management is discussed and an explanation is given as to how the memory size limitations are overcome. Moreover, the communication protocol architecture and implementation were presented, em- phasising the internal USB buffer limitation and describing the protocol implemented to overcome the problem. As the system provides a PC interface for developers, a description of the PC middleware was given, including the list of available commands and their functions. Because an external date and time provider is needed, three different alternatives were described, with a major focus being given to the Simple Time Service, which was considered the best alternative, and therefore, the one that was implemented. In the next chapter, the performance of the cryptographic operations is evaluated in software and hardware, to better understand which are the performance bottlenecks and what actually optimizes the system. Moreover, the performance of the main operations is also evaluated (i.e. signing, generating certificates and adding entries to the log-chain). Finally, the solution is compared with the existing State of the Art.

50 Chapter 6

Results

In order to evaluate the performance of the proposed system, experimental results were obtained for the implemented prototype on a Smartfusion2 Security Evaluation Kit, containing a Smartfusion2 SoC device, using Libero 11.7 SoC, SoftConsole 4.0 and Microsoft Visual Studio Community Edition 14.0 tools. The mbedTLS 2.4.2 C library [47] was used to provide support for ECDH, ECDSA, AES-256, SHA-256, HMAC-SHA-256 and X.509v3 certificate generation. Three architectures of the system were developed. The first consists of a software-only version, which makes use of the aforementioned algorithms as provided by mbedTLS, while the second archi- tecture makes use of the Smartfusion2 SoC cryptographic cores (EC scalar operations, SHA-256 and AES-256) to provide higher security and possible acceleration. Finally, the last architecture consists of a modified version of the second, in which the SHA-256 algorithm runs entirely on the FPGA fabric, supported by an open-source SHA-256 core [49], to understand whether it is possible to accelerate the operations with FPGA-based cores. The results presented in this chapter contribute with a better knowledge about the embedded cores of the Smartfusion2 SoC as well as a better understanding of the system’s performance bottleneck. The experimental performance results of the main cryptographic operations (SHA-256, AES-256 and EC Point Multiplication) are discussed in Section 6.1, while the results obtained for the system operations are presented and analysed in Section 6.2. Afterwards, Section 6.3 discussed the results obtained for the communication channel and in Section 6.4, a qualitative comparison with the State of the Art is performed. The chapter concludes with a discussion and summary of the previous sections.

6.1 Cryptographic Operations

In order to better understand the impact of each basic cryptographic operations, this section presents and analyses the performance results of the following operations: SHA-256, AES-256 and Elliptic Curve scalar multiplication. The results were obtained on the device-side using the internal Real Time Clock, meaning that the acquired values do not include the communication time between the device and the PC.

51 6.1.1 SHA-256

To evaluate the impact of the different implementation options, three versions of the SHA-256 algorithm were conceived and evaluated (as detailed in Section 5.2), namely using the mbedTLS implementation, the Smartfusion2 SoC embedded hardware core and finally, a custom hardware core deployed on the FPGA fabric. The obtained results for the SHA-256 computation, depicted in Figure 6.1 and Table 6.1, suggest that when computing large amounts of data, the embedded core is able to achieve higher throughput results (10Mbps) than the software-only version (3.5Mbps), considering only the processing time for a 512-bit input block. However, for small amounts of data, the software based implementation provided by mbedTLS, can actually achieve better results than the computation using the embedded SHA-256 core. This is due to the fixed cost of 423µs to start-up the computation process, i.e. invoking and setting up the embedded core (e.g. field initialization and pre-computation). In the software call, invoking the algorithm takes a negligible amount of time, but there is still a fixed cost of 268µs for the algorithm start-up routines.

Figure 6.1: SHA-256 throughput for the three tested implementations.

Table 6.1: Operation times for the three SHA-256 implementations. Data Size (B) FPGA-accelerated Embedded Software 64 31 µs 476 µs 416 µs 128 57 µs 529 µs 564 µs 256 109 µs 634 µs 860 µs 512 214 µs 847 µs 1,452 µs 1,024 425 µs 1,252 µs 2,636 µs

In terms of processing time, which includes data transfer, the experimental data indicates that the embedded SHA-256 core requires about 53µs to process and transfer each 512-bit input block, whereas the software-only version takes 148µs. When processing a small amount of data, the achievable through- put for the embedded core is relatively low, requiring about 529µs to calculate the hash of 1kbit of data,

52 which is almost the same as when computed in software (564µs). On the other hand, the FPGA based implementation has a starting overhead of just 5µs, and a block processing time of 26µs (including data transfer). Therefore, the maximum throughput is about 20Mbps, which is approximately 2 and 6 times larger than the embedded core and software version throughputs respectively, at a cost of 5% FPGA usage. To compute the processing times, in Equation 6.1, it is necessary to subtract the operation time of the previous data size to the current operation time and multiply the result by 64 (SHA-256 input block size). Finally, the result is divided by the total data size.

(toperation − toperation ) ∗ block size t = i i−1 (6.1) processi data size

The results clearly suggest that it is a better option to use the SHA-256 core deployed on the FPGA over the other two alternatives, given that only 5% of the FPGA fabric is used and with performance that can be 10 times better. In fact, for larger data, the FPGA-based core can process 1,024 bytes of data almost 3 times faster than the embedded core and approximately 6 times faster than the software-based version. Consequently, the maximum throughput of the FPGA core is several times larger than the other two implementations. However, this implies the use of the FPGA and its resources.

6.1.2 AES-256

Regarding AES-256, the algorithm was tested in the software and hardware layers, using mbedTLS for the software implementation and the embedded core provided by the Smartfusion2 SoC for hardware one. The acquired operation times for different data sizes are available in Table 6.2 and the maximum obtained throughputs are depicted in Figure 6.2. The processing times, which include data transfer, are calculated with Equation 6.1, identically to the SHA-256 algorithm. When using the embedded AES core, there is a start-up time of approximately 505µs. To encrypt each 128-bit data block, 175µs of processing time are required, whereas the software version only needs 71µ. Given this start-up time and encryption time, the embedded core alternative needs 1.9ms to encrypt 1kbit of data, while the software implementation requires 0.55ms to encrypt 1kbit of data. It is important to highlight that the fixed start-up cost of the software version is negligible, which is demonstrated by the fact that the operation times in Table 6.2 are multiples of the block processing time (71µs), which does not happen in the embedded core version. In terms of throughput, the embedded core and software versions achieve a maximum of about 730kbps and 1.9Mbps respectively, meaning that the latter is approximately 2.5 faster. This is explained by the numerous optimizations and pre-computations that are available in the AES-256 module provided by mbedTLS. Considering the results above, the software-based implementation can be seen as a faster alterna- tive to the embedded SoC core. However, the core has been designed to be protected against several side-channel analysis attacks [40]. Therefore, there must be a trade-off between performance and se- curity. Nonetheless, the AES algorithm is only used for encrypting externally stored data and to perform

53 Table 6.2: Operation times for the two AES-256 implementations. Data Size (B) Embedded Software 16 680 µs 71 µs 32 850 µs 142 µs 64 1,200 µs 283 µs 128 1,900 µs 548 µs 256 3,300 µs 1,094 µs 512 6,100 µs 2,184 µs 1,024 11,710 µs 4,366 µs

Figure 6.2: AES-256 throughput for the two tested implementations. secure data transfer with the PC. The impact of implementation choice is evaluated and discussed in Section 6.3, in which the communication channel performance is analysed.

6.1.3 EC Scalar Multiplication

Elliptic Curve operations take place in almost all of the features provided by the system, considering that most of the algorithms used are based on Elliptic Curve Cryptography. Moreover, the Smartfusion2 SoC has built-in cores that perform scalar multiplication and addition in an attempt to increase the security and performance of those operations, when compared to software alternatives. Regarding these two operations, Elliptic Curve scalar multiplication is much slower than scalar addition and subtraction [56], playing a very important role in the efficiency of systems that use elliptic curve cryptography algorithms, such as ECDSA, ECDH and ECIES [57]. To evaluate the impact of Elliptic Curve scalar multiplication, the operation was tested on software- only and on a modified version of mbedTLS, in which the scalar multiplication routine was modified to perform the operation through the embedded SoC core. The obtained results were the following: 0.566s per multiplication performed with hardware; 28.4s per multiplication performed when computed in the software software-only version. The difference is considerably large, with the hardware-based

54 implementation being approximately 50 times faster than when performed without the embedded core. Considering these results, the fact that mbedTLS is susceptible to side-channel analysis attacks [48] and that the Elliptic Curve cores provided by the Smartfusion2 SoC are protected against side-channel analysis (including DPA) [40], the natural choice is to use the embedded core.

6.2 System Operations

Now that the main cryptographic operations have been analysed, a performance evaluation of the pri- mary system operations is performed and discussed. They include key pair generation and computation of digital signatures, which allow the creation of digital certificates and the addition of entries to the Log-Chain. The performance tests for these operations are depicted in Figure 6.3, for the three imple- mentations considered.

Figure 6.3: System operation times for the three tested implementations.

Given the results obtained in the previous sections, the actual performance gain provided by the em- bedded security core comes from the Elliptic Curve operations used in the embedded and FPGA-based versions. As discussed above, the scalar multiplication is much slower when performed in software than when performed with the embedded security core. The impact can be easily noticed by analysing Fig- ure 6.3, in which the software version of each operation can take between 35 to 48 times more than when performed using the embedded cores. The results are also available in Table 6.3 for the three implemented versions.

Table 6.3: Operation times for the three versions conceived. Operation FPGA-accelerated Embedded Software Gen. Key Pair 580 ms 590 ms 28.4 s Sign 784 ms 788 ms 28.5 s Gen. Certificate 809 ms 816 ms 28.6 s Add log entry 791 ms 794 ms 28.5 s

55 As seen in Section 6.1.3, a scalar multiplication takes about 0.57s and 28.4s to compute when done in the embedded core and in software respectively. Actually, the embedded core and FPGA-accelerated versions do not take more than 30ms to generate a key pair when compared to performing a scalar multiplication. As a matter of fact, the operation that takes the most time, excluding the time spent performing the multiplication, is generating a digital certificate, with 243ms and 250ms being spent on the remaining code for the FPGA-accelerated and embedded cores versions respectively. The performance increase of the FPGA-accelerated version varies between 3 to 10 milliseconds, which results from the usage of the SHA-256 core deployed in the FPGA. The operation where the impact is most noticeable is in the key pair generation, explained by the fact that repeated use of the SHA-256 operation. Software-wise, since the units are in seconds, the operation time varies little from the 28.4s spent performing scalar multiplication, emphasizing the fact that Elliptic Curve scalar multiplication is the critical operation. The operation times are not affected by AES, as they do not use it. Actually, the impact of the AES implementation choice is analysed in the next section, which discusses the performance results of the communication channel, where AES-256 and SHA-256 are heavily used.

6.3 Communication Channel

As described in Section 4.2, the system supports either open-channel or secure-channel communica- tions, depending on the usage and initial configuration. To better understand the impact of the operations on the data transfer throughputs, a transfer of 128KB was executed for the three versions of the system in open and secure channel. The choice of 128KB lies in the fact that data is transferred in 16B blocks, meaning that a total of 8,192 blocks are transferred for each test, which is a sufficiently large number to get rid of any possible fluctuations in block transfer times. The obtained results are depicted in Table 6.4.

Table 6.4: Operation times for the three versions conceived. Operation FPGA-accelerated Embedded Software Open-channel 134.14 s 134.14 s 134.14 s Secure-channel 140.28 s 140.80 s 140.73 s

As expected, the open-channel times are equal for the three versions, as no cryptographic operations are performed on the transferred data. On the other hand, when transferring data through a secure channel, the FPGA-accelerated version is the fastest, with the one using the embedded cores only being the slowest, and the software-only version coming in second. The fact that the software-only version is the second fastest option, results from the software AES algorithm being faster than the embedded counterpart. However, since SHA-256 is used for the computation of the HMAC of each exchanged message, the FPGA-based version still manages to be faster than the software-only version. The throughputs of each version, depicted in Figure 6.4, reflect the aforementioned analysis. While the open-channel throughput is constant, at 7.6 kbps, the secure-channel throughputs vary slightly be-

56 tween versions. More specifically, the FPGA-accelerated variant has a maximum throughput of 7.3 kbps, the embedded version reaches 7.272 kbps and finally, the software-only implementation achieves a throughput of 7.276 kbps.

Figure 6.4: Throughputs for open-channel and secure-channel communications.

The results highlighted above suggest that a faster AES core could be deployed in the FPGA fabric, providing better results for the communication channel, when a secure channel is required for the PC connection. In terms of security, Perfect Forward Secrecy is ensured, since the session key depends on both private keys (PC and device) and the key pair from the PC is randomly generated for each connection. Once the session is terminated, the private key from the PC is destroyed, thus the session key can no longer be recovered. Additionally, Eavesdropping and Man-in-the-Middle attacks are successfully prevented through encryption and authentication of the exchanged messages, therefore preventing an attacker from listening and modifying the exchanged data successfully. Given that each message is accompanied by a counter, verified by both parties, the system is also protected against Replay attacks, i.e. an attacker cannot re-send previously sent packages without breaking the synchronization of the communication.

6.4 Comparison with the State of the Art

This section compares the implemented solution with the related State of the Art in terms of security features and design choices. All the works considered use volatile FPGAs, which are subject to several attacks during the booting process [3,5] and are not oriented towards security applications, i.e. they lack several security characteristics, for instance, anti-tampering mechanisms, anti-cloning features, true random number generators and internal non-volatile memories. Moreover, none of them consider that the selected devices may be subject to side-channel analysis. In fact, the solution in [9] goes even fur-

57 ther, by considering that simply performing the key management at hardware level, is enough to protect against all kinds of attacks. Additionally, none of the works consider the existence of a developer API that can be used to integrate applications with their works (e.g. the extended PKCS#11 API implemented for the proposed solution). The approach proposed by Arasu et al. [7] does not describe the internal key storage and master key generation and storage mechanism. Additionally, Perfect Forward Secrecy is not assured in the communication channel. On the other hand, our solution guarantees Perfect Forward Secrecy, and thoroughly describes the use of a PUF-based mechanism to generate and store keys. Nabeel et. al [8] consider the use of PUF technology to generate AES keys, but perform the error correction and cryptographic operations at software level on the PC, which is subject to several attacks. Our solution considers the PC to be insecure, meaning that all the sensitive computations are made on the device-side and the middleware only uses the authenticated results. Graf [10] proposes an architecture which makes use of DES to encrypt and decrypt external data, which is outdated and susceptible to several attacks. Moreover, users are hardcoded into the device, which means that if a user needs to be added or revoked, the FPGA needs to be entirely re-programmed. Our solution uses AES-256-CBC for encryption and decryption of external data, and generates a random IV for each block of user data. Additionally, it allows the administrator to manage the registered users, without the need for re-programming. The solution proposed by Eguro et al. [11] does not consider freshness and authentication of the uploaded data. Our implemented solution successfully allows the PC to establish a secure communi- cation channel with the device, in which all messages are authenticated with HMAC-SHA-256 and their freshness assured with a message counter.

6.5 Conclusions

In this chapter, the implementation choices of the system were analysed and compared to the State of the Art, where possible. To better understand the performance of low-level cryptographic operations required by the system, several tests were executed for SHA-256, AES-256 and EC Multiplication oper- ations. After analysing these results, performance tests were done on the system operations to evaluate the impact of the low-level cryptographic operations on the overall system efficiency. This evaluation suggests that the FPGA-accelerated version of SHA-256 is faster than the embedded SoC SHA-256 core and mbedTLS implementations. On the other hand, the software-based version of AES-256 is faster than the embedded AES core provided by the Smartfusion2 SoC. This difference is caused by the several optimizations and pre-computations that are available in the mbedTLS AES-256 module. Nonetheless, the biggest performance impact comes from the Elliptic Curve scalar multiplica- tion, which takes approximately 28.4s to perform on software and 0.566s when done with the embedded SoC EC core, consuming between 70% and 95% of the time spent performing system operations, such as generating key pairs and digital signatures. The results above clearly reflect in the performance and efficiency of the main system operations,

58 such as generating key pairs and digital signatures, as well as creating digital certificates and adding entries to the log-chain. In fact, the software version of each operation can take between 35 to 48 times more than when performed using the hardware cores. Since all tests were performed on the device-side, it was necessary to evaluate the USB communi- cation channel as well, as a low data transfer rate can limit the system performance perceived at the PC-side. According to the results, the bottleneck of the secure data transfer speeds, when compared to open-channel, resides in the use of the embedded AES-256 core, suggesting that a faster AES core, deployed in the FPGA, could allow for a throughput that could be closer to the open-channel throughput. Due to the lack of numerical results in the academic State of the Art proposals, the comparison is mostly qualitative. The existing academic proposals (Section3) lack mandatory requirements to be used as HSMs, such as a secure communication channel, anti-cloning mechanisms (e.g. PUF-based key generation), internal non-volatile memories for master key storage, high entropy random number generators, anti-tamper mechanisms, internal clock freshness (e.g. through a Timestamping Authority) and a standard developer interface, such as PKCS#11. However, the proposed work presents a fully functional, open source, and adaptable HSM system with customization capabilities, as demonstrated by added the Log-Chain functionality. Moreover, in terms of pricing, a regular HSM can cost up to 35,000e [26] while the presented implementation, on a Smartfusion2 90-TS SoC Security Evaluation kit, has an approximate cost of 400e.

59 60 Chapter 7

Conclusions

In this thesis, a low-cost and highly flexible open-source HSM is proposed and implemented on the Smartfusion2 SoC device, as per the requirements detailed in Section 1.1. The device contains a non- volatile FPGA, non-volatile and volatile internal memories, a CPU and several security cores, such as AES, SHA-256, ECC, TRNG and PUF.

Unlike the State of the Art, the developed system takes advantage of non-volatile technologies, with built-in security features, such as side-channel analysis protected embedded cores, tamper detection mechanisms, PUF-based key management and an internal real-time clock.

The developed system consists of a tamper-proof certification system capable of generating digi- tal signatures, issuing certificates for public keys and generating and extracting asymmetric key pairs for users. For easy application integration, a PC driver is provided, which implements an extended PKCS#11 interface. Because the system may be used under insecure environments, the system is capable of establishing a secure channel with the PC and maintaining internal clock freshness and syn- chronization with the outside world through the use of an external time provider. Furthermore, as the system provides high flexibility and because of the existing need for a secure logging system, a com- plementary novel feature was designed, which creates a non-repudiable and certified chain-of-logs for Linux Syslog messages.

Performance-wise, a thorough analysis of the cryptographic operations shows that the system’s per- formance is primarily influenced by the Elliptic Curve scalar multiplication operation. In fact, in the main system operations, between 70% and 95% of the time is spent performing an EC scalar multiplication. Moreover, the conducted tests suggest that the FPGA-accelerated version of SHA-256 is faster than the embedded SoC SHA-256 core and mbedTLS implementations, while consuming only 5% of the FPGA fabric. On the other hand, the software-based version of AES-256 is faster than the embedded AES core provided by the Smartfusion2 SoC, but less secure.

Overall, the system is able to perform up to 2 signature/certificate operations per second and can add 1 log entry to the Log-Chain, every second. Due to its adaptability, the Log-Chain supports batch additions, meaning that depending on the developer application, it can support much higher log-entry signature rates. All of this is done on a Smartfusion2 SoC, at a much lower cost than existing commercial

61 HSMs, while providing the needed security and reliability features.

7.1 Future Work

Among the State of the Art, the proposed solution is the first to use a non-volatile and security-oriented device to create a secure, low-lost and reconfigurable computation system, that provides a series of fea- tures that allows it to be considered an HSM. To improve the implemented solution, additional changes and features can be considered. As seen in Section 6.2, the overall system performance is limited by the amount of time spent per- forming Elliptic Curve scalar multiplications. Although the Smartfusion2 SoC EC core provides a speed- up of 48 times compared to the mbedTLS implementation, it is also suggested by the SHA-256 results that a dedicated EC scalar multiplication core running in the FPGA fabric could provide an even higher performance increase. Nonetheless, the existing embedded hardware core is considered secure by Mi- crosemi and a switch to an FPGA-based core should be carefully analysed. Moreover, since the AES core is slower than the mbedTLS implementation, it is also suggested to use a dedicated core in the FPGA fabric to provide increasing throughputs and additional security (e.g. Differential Power Analysis protection). Another performance limiter resides in the communication channel, which has a maximum through- put of 7.6 Kbps. This happens primarily due to the conceived communication protocol and the underlying UART communication. As seen in Section 5.6, the sending party can only send blocks of 16B at a time, and must wait for an acknowledgement before sending the next block, causing a bottleneck in the data transmission feed. Ideally, it should be possible to send larger data blocks at once, allowing for instant processing by the receiving party. This suggests that a different communication protocol could be used to increase the performance of the communication channel. Finally, in terms of memory, the eNVM has a memory endurance of about 1,000 writing cycles per page of 128B. This is clearly too low for the kind of data that needs to be stored, which needs to be frequently updated (e.g. Log-Chain counter and SPI Flash block hashes/IVs). A possible solution could involve relying on an external battery to assure that the values stored in RAM will not be lost. Should a loss of power happen, the values would be stored in the eNVM and and a signal sent out to a remote administrator. However, this would require measures to guarantee that an attacker could not unplug the battery or block the alarm signal.

62 Bibliography

[1] Global Market Insights, Inc. FPGA Market size worth $9.98bn by 2022, 2016. ://www. gminsights.com/pressrelease/field-programmable-gate-array-fpga-market.

[2] L. Semiconductor. A Review of Hardware Security ModulesThird Generation Non-Volatile FPGAs Enable System on Chip Functionality. Technical report, Lattice Semiconductor, 2007.

[3] R. Druyer, L. Torres, P. Benoit, P.-V. Bonzom, and P. Le-Quere. A survey on security features in modern FPGAs. In Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2015 10th International Symposium on, pages 1–8. IEEE, 2015.

[4] Microsemi. Smartfusion2 SoC and IGLOO2 FPGAs Security Features. Technical report, Microsemi, 2016.

[5] H. Kashyap and R. Chaves. Compact and on-the-fly secure dynamic reconfiguration for volatile FPGAs. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 9(2):11, 2016.

[6] A. S. Zeineddini and K. Gaj. Secure Partial Reconfiguration of FPGAs. Proceedings of the 2005 IEEE International Conference on Field-Programmable Technology, pages 155–162, 2005.

[7] A. Arasu, S. Blanas, K. Eguro, R. Kaushik, D. Kossmann, R. Ramamurthy, and R. Venkatesan. Orthogonal security with cipherbase. In CIDR, 2013.

[8] M. Nabeel, S. Kerr, X. Ding, and E. Bertino. Authentication and key management for advanced metering infrastructures utilizing physically unclonable functions. In Smart Grid Communications (SmartGridComm), 2012 IEEE Third International Conference on, pages 324–329. IEEE, 2012.

[9] Y. Wang and Y. Ha. FPGA based Rekeying for cryptographic key management in Storage Area Network. In Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on, pages 1–6. IEEE, 2013.

[10] J. Graf and P.Athanas. A key management architecture for securing off-chip data transfers. In Field Programmable Logic and Applications (FPL), pages 33–42, 2004.

[11] K. Eguro and R. Venkatesan. FPGAs for trusted cloud computing. In Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on, pages 63–70. IEEE, 2012.

63 [12] J. Ivarsson, A. Nilsson, and A. Certezza. A Review of Hardware Security Modules. Technical report, opendnssec.org, 2010.

[13] B. Tulu, H. Li, S. Chatterjee, B. Hilton, D. Beranek-Lafky, and T. Horan. Design and Implementation of a Digital Signature Solution for a Healthcare Enterprise. AMCIS 2004 Proceedings, page 43, 2004.

[14] FIPS. Announcing the ADVANCED ENCRYPTION STANDARD (AES). Technical report, National Institute of Standards and Technology (NIST), 2001.

[15] NIST. Recommendations for Block Cipher Modes of Operation. Technical report, Microsemi, 2001.

[16] S. Selvakumaraswamy and U. Govindaswamy. Efficient transmission of pki certificates using elliptic curve cryptography and its variants. International Arab Journal of Information Technology (IAJIT), 13(1), 2016.

[17] M.-D. Cano, R. Toledo-Valera, and F. Cerdan. A certification authority for elliptic curve X. 509v3 certificates. In Networking and Services, 2007. ICNS. Third International Conference on, pages 49–49. IEEE, 2007.

[18] FIPS. FIPS PUB 186-4 Digital Signature Standard (DSS). Technical report, National Institute of Standards and Technology (NIST), 2013.

[19] C. Research. SEC 1: Elliptic Curve Cryptography. http://www.secg.org/sec1-v2.pdf, Published: 2009-05-21.

[20] V. Mart´ınez, F. Alvarez,´ L. Encinas. A comparison of the standardized versions of ECIES. 2010.

[21] I. E. T. Force. KCS #5: Password-Based Cryptography Specification - Version 2.1. https://tools. ietf.org/html/rfc8018, Published: 2017-01-01.

[22] N. W. Group. X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile, . https://tools.ietf.org/html/rfc5280, Published: 2008-05-01.

[23] T. Mohsen and A. Shirazi. LPKI – A Lightweight Public Key Infrastructure for the Mobile Envi- ronments. Technical report, Department of Electrical Engineering Iran University of Science and Technology, 2008.

[24] K. N. C. Gehrmann and C. J. Mitchell. The personal CA – PKI for a Personal Area Network. Tech- nical report, Ericsson Mobile Platforms, Nokia Research Center and Group, 2002.

[25] R. S. J. Dankers, T. Garefalakis and T. Wright. Public key infrastructure in mobile systems. ELEC- TRONICS AND COMMUNICATION ENGINEERING JOURNAL, 10 2002.

[26] Thales nshield connect offers enterprise-class key management, 2009. http://www.networkworld.com/article/2246758/security/ thales-nshield-connect-offers-enterprise-class-key-management.html.

64 [27] Safenet network hsm, 2017. https://www.infosecservice.com/product/ safenet-network-hsm/.

[28] ARM. Trustzone. https://developer.arm.com/technologies/trustzone.

[29] Intel. Intel R Software Guard Extensions (Intel R SGX), 2017. https://software.intel.com/ en-us/sgx.

[30] N. Weichbrodt, A. Kurmus, P. Pietzuch, and R. Kapitza. AsyncShock: Exploiting synchronisation bugs in Intel SGX enclaves. In European Symposium on Research in , pages 440–457. Springer, 2016.

[31] M. Lipp, D. Gruss, R. Spreitzer, C. Maurice, and S. Mangard. ARMageddon: Cache attacks on mobile devices. In Proceedings of the 25th USENIX Security Symposium, pages 549–564, 2016.

[32] NIST. Validated fips 140-1 and fips 140-2 cryptographic modules, 2017. http://csrc.nist.gov/ groups/STM/cmvp/documents/140-1/140val-all.htm.

[33] THALES. Fips 140-2 certification, . https://www.thales-esecurity.com/company/ certifications/fips.

[34] THALES. Common criteria, . https://www.thales-esecurity.com/company/certifications/ common-criteria.

[35] T. Feller. Towards Trustworthy Cyber-Physical Systems. In Trustworthy Reconfigurable Systems, pages 85–136. Springer, 2014.

[36] I. Bate and P. Conmy. Certification of FPGAs - Current Issues and Possible Solutions, pages 149– 165. SPRINGER, 2009. ISBN 978-1-84882-348-8.

[37] Xilinx. Design Security Solutions. https://www.xilinx.com/products/technology/ design-security.html.

[38] Altera. Design Security. https://www.altera.com/products/fpga/features/ stx-design-security.html.

[39] Microsemi. SmartFusion2 System-on-Chip FPGAs Product Brief. Technical report, Microsemi, 2016.

[40] Microsemi. SmartFusion2 SoC FPGA and IGLOO2 FPGA Security Best Practices. Technical report, Microsemi, 2016.

[41] Microsemi. Specify and Program Security Settings and Keys with SmartFusion2 and IGLOO2 FPGAs. Technical report, Microsemi, 2013.

[42] Dallas Semiconductor. Technical report.

[43] DigiStamp, 2017. https://www.digistamp.com/.

65 [44] Ascertia, 2017. https://www.ascertia.com/.

[45] FreeTSA, 2017. https://www.freetsa.org/index_en.php.

[46] SafeCreative TSA, 2017. https://tsa.safecreative.org/.

[47] mbedtls, 2016. https://tls.mbed.org.

[48] Standaert, Franc¸ois-Xavier, Oswald, Elisabeth, editor. Constructive Side-Channel Analysis and Secure Design. Springer, 2016.

[49] SHA-256 HASH CORE, 2016. URL https://opencores.org/project,sha256_hash_core.

[50] Microsemi. CoreAHBLite v5.2. Technical report, Microsemi, 2014.

[51] Microsemi. IGLOO2 FPGA and SmartFusion2 SoC FPGA. Technical report, Microsemi, 2016.

[52] PKCS #11 v2.20: Cryptographic Token Interface Standard. RSA Laboratories, June 2004.

[53] NTP, 2017. http://www.ntp.org/.

[54] NTP Security Vulnerability Announcement , 2017. http://support.ntp.org/bin/view/Main/ SecurityNotice.

[55] N. W. Group. Internet X.509 Public Key Infrastructure Time-Stamp Protocol (TSP), . https:// tools.ietf.org/html/rfc3161, Published: 2001-08-01.

[56] A. Gutub. Remodeling of Elliptic Curve Cryptography Scalar Multiplication Architecture using Par- allel Jacobian Coordinate System. International Journal of Computer Science and Security, 10 2010.

[57] E. Karthikeyan. Survey of Elliptic Curve Scalar Multiplication Algorithms. International Journal of Advanced Networking and Application, 8 2012.

66 Appendix A

Communication Protocol

Figure A.1: Flowchart describing the process of receiving a message through the created communica- tion protocol.

67 68