Masaryk University Faculty of Informatics
Optimizing authenticated encryption algorithms
Master’s Thesis
Ondrej Mosnáček
Brno, Fall 2017
Masaryk University Faculty of Informatics
Optimizing authenticated encryption algorithms
Master’s Thesis
Ondrej Mosnáček
Brno, Fall 2017
This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document.
Declaration
Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.
Ondrej Mosnáček
Advisor: Ing. Milan Brož
i
Acknowledgement
I would like to thank my advisor, Milan Brož, for his guidance, pa- tience, and helpful feedback and advice. Also, I would like to thank my girlfriend Ludmila, my family, and my friends for their support and kind words of encouragement.
If I had more time, I would have written a shorter letter.
— Blaise Pascal
iii Abstract
In this thesis, we look at authenticated encryption with associated data (AEAD), which is a cryptographic scheme that provides both confidentiality and integrity of messages within a single operation. We look at various existing and proposed AEAD algorithms and compare them both in terms of security and performance. We take a closer look at three selected candidate families of algorithms from the CAESAR competition. Then we discuss common facilities provided by the two most com- mon CPU architectures – x86 and ARM – that can be used to implement cryptographic algorithms efficiently. Finally, we introduce our con- tribution of implementing the selected CAESAR candidates for the Linux kernel Crypto API.
iv Keywords
authenticated encryption, AEAD, CAESAR, GCM, Linux, cryptogra- phy, optimization, assembly, MORUS, AEGIS, OCB, AES-NI, SSE2, AVX2
v
Contents
1 Introduction1 1.1 Goals ...... 2 1.2 Chapter contents ...... 2
2 Authenticated encryption3 2.1 Properties ...... 5 2.2 Generic composition ...... 7 2.2.1 Encrypt-then-MAC...... 8 2.2.2 Encrypt-and-MAC...... 9 2.2.3 MAC-then-Encrypt...... 9 2.2.4 Properties of generic composition...... 10 2.3 GCM ...... 10 2.3.1 Properties...... 13 2.4 ChaCha20-Poly1305 ...... 14 2.4.1 Properties...... 16 2.5 CCM ...... 17 2.5.1 Properties...... 18 2.6 SIV ...... 19 2.6.1 Properties...... 20 2.7 GCM-SIV ...... 21 2.7.1 Properties...... 22
3 CAESAR competition 25 3.1 What is CAESAR? ...... 25 3.2 Third round candidates ...... 26 3.3 MORUS ...... 28 3.3.1 Operation...... 29 3.3.2 Properties...... 30 3.4 AEGIS ...... 31 3.4.1 Operation...... 31 3.4.2 Properties...... 32 3.5 OCB ...... 33 3.5.1 Operation...... 34 3.5.2 Properties...... 35 3.6 Comparison ...... 36
vii 4 Linux Kernel Crypto API 39 4.1 Architecture ...... 40 4.1.1 Cipher and driver names...... 40 4.1.2 Templates...... 40 4.1.3 Synchronous and asynchronous operations... 40 4.1.4 Priorities...... 40 4.1.5 Input parameter sizes...... 41 4.1.6 Scatter-gather lists...... 41 4.2 AEAD interface ...... 42 4.2.1 Input/output data layout...... 42 4.2.2 For users...... 42 4.2.3 For implementations...... 44
5 Software optimization of cryptographic algorithms 47 5.1 Intel/AMD (x86 architecture) ...... 48 5.1.1 SSE, AVX...... 48 5.1.2 AES-NI...... 49 5.1.3 SHA extensions...... 50 5.1.4 CLMUL...... 50 5.2 ARM ...... 50
6 Implementation of selected CAESAR candidates 51 6.1 Contents of the attached source code ...... 51 6.1.1 Implementation limitations...... 53 6.1.2 Merging into the upstream Linux repository.. 54 6.2 Performance measurements ...... 54 6.2.1 Direct speed comparison...... 55 6.2.2 Comparison of Dm-crypt performance..... 56 6.2.3 Summary of results...... 58
7 Conclusion 59 7.1 Contribution ...... 59 7.2 Future work ...... 60
Bibliography 61
viii 1 Introduction
When cryptographically protecting data, we often use encryption to ensure confidentiality of the payload, which means that only authorized parties are able to read it [57, 27]. However, in practice, confidentiality alone is not sufficient to protect users from certain attacks. For example, when sending an encrypted message over the net- work, an attacker might be able to (depending on the encryption method used) modify the message in such a way that the decrypted message is still meaningful but different and the receiver is unable to detect the malicious modification. In order to protect from similar attacks, it is necessary to use addi- tional cryptographic mechanisms that achieve integrity of the payload, which means that unauthorized parties cannot modify the payload without detection. When one aims to achieve both confidentiality and integrity of data, there are generally two possible approaches: 1. To use a traditional stream cipher to achieve confidentiality and to ensure integrity in some other way, e.g. using message authentication code (MAC) or digital signature. 2. To use a dedicated scheme for authenticated encryption with as- sociated data (AEAD), which provides both confidentiality and integrity in a single package. Since authenticated encryption is needed in many applications, especially in network protocols/applications and file encryption, it is a frequent target of research in cryptography. Since the introduction of the concept in 2000 [39], there have been many proposed schemes for authenticated encryption. However, the most widely adopted al- gorithm – AES-GCM1 – has several drawbacks and there is generally a lack of consensus on the best alternative. This situation has motivated the initiation of the CAESAR competi- tion, which is an open competition with the goal of selecting a portfolio of the best authenticated encryption schemes in terms of security and both hardware and software performance.
1. AES-GCM = Advanced Encryption Standard in Galois-Counter Mode
1 1. Introduction 1.1 Goals
The goals of this thesis are:
1. to compare some of the existing and proposed AEAD algo- rithms in terms of security and performance,
2. to produce implementations of selected CAESAR competition candidates for the Linux kernel cryptographic subsystem,
3. and to perform and analyze performance measurements of these implementations.
1.2 Chapter contents
The thesis consists of seven chapters. The first chapter is the introduction. In the second chapter wede- fine authenticated encryption and describe some of the most common existing AEAD algorithms and modes. In the third chapter we intro- duce the CAESAR competition, shortly characterize the third-round candidates, and select three candidates that we describe in more detail. In the fourth chapter we shortly describe the Linux kernel Crypto API with focus on its AEAD interface. In the fifth chapter we discuss the possibilities of software optimization of cryptographic algorithms with focus on the x86 and ARM architectures. In the sixth chapter we describe implementations of the three se- lected CAESAR candidates that we developed as part of the thesis, targeting the Linux kernel Crypto API. At the end of this chapter we provide and analyze performance measurements of our implementa- tions. The seventh chapter contains the conclusion.
2 2 Authenticated encryption
Authenticated encryption (also referred to as authenticated encryption with additional data – AEAD) in practice uses symmetric-key cryptogra- phy and is usually based on a stream cipher and a message authen- tication code (MAC), even though the computation of both is often merged into a single operation. Encryption using an AEAD scheme takes the following inputs:
∙ a symmetric key (K) of some fixed size,
∙ a nonce (N) of some fixed size1,
∙ an optional stream of associated data (A) that is only authenti- cated, not encrypted,
∙ the message plaintext (P) of any length (subject to some practical constraints), which is both authenticated and encrypted.
The output of the AEAD encryption process is the encrypted mes- sage ciphertext (C) and a fixed-size authentication tag (T), which carries the information needed to verify the authenticity of the message and associated data. We will denote AEAD encryption as follows:
(C, T) = AEAD-EK(N, A, P)
1. In some cases the nonce size can be variable or configurable.
Nonce AD Plaintext
Key AEAD-E
Ciphertext Tag
Figure 2.1: AEAD encryption diagram
3 2. Authenticated encryption
Nonce AD Ciphertext Tag
Key AEAD-D
Plaintext Verification OK?
Figure 2.2: AEAD decryption diagram
Some AEAD schemes can also produce a truncated authentication tag which is shorter but provides smaller guarantee against message forgery. The decryption process takes the following inputs:
∙ the key (K) as used when encrypting,
∙ the nonce (N) as used when encrypting,
∙ an optional stream of associated data (A) as used when encrypt- ing,
∙ the message ciphertext (C),
∙ the authentication tag (T).
The output of the AEAD decryption process is the decrypted mes- sage plaintext (P) or an indication that the integrity verification failed (⊥). We will denote AEAD decryption as follows:
P = AEAD-DK(N, A, C, T)
Most AEAD schemes require that the decrypted message is not given to the user as output when integrity verification fails; otherwise, they are vulnerable to known-plaintext or chosen-ciphertext attacks – i.e. the key or some internal secret information may be recovered when the attacker has access to a certain set of plaintext-ciphertext pairs.
4 2. Authenticated encryption 2.1 Properties
Since different AEAD algorithms and modes are designed with differ- ent use cases and design goals in mind, they have various properties, which need to be taken into account when choosing the right AEAD algorithm to use. In this section we list some of these properties along with a short description of each one. The list of properties is partly based on [58] and [22].
∙ Provable security – An AEAD algorithm is provably secure if there exists a mathematical proof of its security properties. Algorithms with provable security provide better assurance of their security against cryptanalysis since the confidence in their security lies within a mathematical proof and not in unproven assumptions.
∙ On-line input processing – An AEAD algorithm supports on-line input processing when the associated data followed by the message data can be supplied incrementally; or, more precisely, when the length of the associated data and message data does not have to be known before the processing of the data itself. Such algorithm can be more efficient when the input data is supplied on the fly (e.g. when encrypting the payload ofa network stream).
∙ Two-pass processing – An AEAD algorithm requires two-pass processing when (all or part of) the input data needs to be traversed twice, where the second pass depends on some inter- mediate result from the first pass. Algorithms with this property are not suitable for on-line pro- cessing of long inputs, since the whole input needs to be stored in memory during processing.
∙ Parallelizability – An AEAD algorithm may allow parallel pro- cessing of (all or part of) the input data, in order to make encryp- tion/decryption faster on parallel machines, such as multi-core processors or graphics cards.
5 2. Authenticated encryption
∙ Endian-dependency – This property means that the core of the algorithm can be implemented with the same efficiency on both little-endian and big-endian architectures. In other words, it can be implemented without byte swaps on both architectures. Note that parts of the algorithm that are not critical for its overall performance (e.g. key pre-processing) can be still endian- dependent. ∙ Nonce misuse resistance – Most AEAD algorithms do not allow encrypting more than one message using the same key-nonce pair (such usage may allow an attacker to e.g. distinguish mes- sages [44]). Nonce misuse resistance means that even when a nonce is reused under the same key, an attacker does not learn anything more than whether both messages (and their associ- ated data) are equal (obviously, the ciphertexts and tags would be also equal in such case). [54] ∙ Software performance – This property describes the performance of the algorithm when implemented in software. Since there is a number of different software platforms and architectures, itis difficult to judge this property in general. An AEAD algorithm might be efficient on one architecture, but slow on another. Therefore, it is necessary to assess software performance on more than just one type of platform and consider the need for a trade off in the algorithm design. In general, we can divide the platforms that AEAD design usually target into the following categories: – Vector extensions – Most modern architectures nowadays come with support for “vector” operations – i.e. instruc- tions that can operate with large bit vectors or vectors of integers. Examples: SSE and AVX on x86, NEON on ARM2. – AES-specific instructions – Due to the widespread standard- ization and use of the AES block cipher, some hardware
2. ARM is a reduced instruction set computing architecture used primarily in embedded devices and mobile phones.
6 2. Authenticated encryption
architectures now come with special instructions that can be used to efficiently implement AES. These instructions can significantly speed up algorithms that are based on AES or the AES round transformation. Examples: AES-NI on x86, cryptography extensions on ARM, various cryptographic accelerators on other plat- forms. – Other special instructions – Some architectures offer other instructions that can be used to speed up certain crypto- graphic algorithms. The most notable example is the PCLMULQDQ instruc- tion provided by many x86 processors, which can be used to efficiently implement GHASH, which is part ofAES- GCM (see section 2.3).
∙ Hardware performance – This property describes the encryption and decryption throughput achievable when the algorithm is implemented directly in hardware, either as a special circuit or programmed onto a generic hardware component (such as an ASIC3 or FPGA4).
∙ Circuit area – This is another property related to hardware im- plementations and describes the physical area of the circuit im- plementing the given algorithm. It is desirable to have AEADs with small circuit area, as small circuits consume less physical space, consume less electric power, and dissipate less heat.
2.2 Generic composition
A simple way to construct an AEAD scheme is to combine a stream ci- pher (e.g. AES in CBC mode5) and a message authentication code (MAC) (e.g. AES-CMAC6) using so called generic composition. In this scheme,
3. ASIC = application-specific integrated circuit 4. FPGA = field-programmable gate array 5. CBC = Cipher Block Chaining; a common block cipher mode of operation, which constructs a stream cipher from a block cipher [24] 6. CMAC = Cipher-based Message Authentication Code [26]
7 2. Authenticated encryption the stream cipher provides data confidentiality, while the MAC pro- vides data integrity. The key of the resulting AEAD construction then consists of two parts: the stream cipher key and the MAC key. The size of the nonce is the same as for the stream cipher and the tag size is at most the size of the MAC output (smaller tag sizes can be achieved by truncating the tag7). The stream cipher and MAC used in a generic AEAD composition must fulfill the following conditions:
1. The stream cipher is indistinguishable (i.e. semantically secure8) under a chosen-ciphertext attack. [6]
2. The MAC is unforgeable under a chosen-message attack. [6]
There are three possible variants of generic AEAD composition, depending on the order of operations (encryption/decryption and MAC). A description of each variant can be found below. In all cases, we denote:
∙ the encryption of plaintext P and nonce N using the underlying stream cipher with key K as EK(N, P),
∙ the decryption of ciphertext C and nonce N using the underly- ing stream cipher with key K as DK(N, C),
∙ the computation of MAC of message M using the underlying MAC algorithm with key K as HK(M),
∙ the concatenation of bit strings A and B as A ‖ B.
2.2.1 Encrypt-then-MAC In this variant the plaintext is first encrypted and the resulting cipher- text along with the associated data is then authenticated using the
7. Note that this practice reduces the security provided by the tag and should only be used when the consequences are fully considered. 8. Semantic security means that knowledge of the ciphertext (and length of the plaintext) does not reveal any addtional information about the plaintext that can be feasibly extracted. [27]
8 2. Authenticated encryption
MAC algorithm. More formally:
AEAD-EK(N, A, P) =(EK2 (N, P), HK1 (A ‖ EK2 (N, P))), DK2 (N, C) if T = HK1 (A ‖ C) AEAD-DK(N, A, C, T) = , ⊥ otherwise where K = K1 ‖ K2. According to the analysis by Bellare and Namprempre, this is the only variant that fulfills all desired security properties, but only when the MAC algorithm is strongly unforgeable. The authors note that all pseudorandom functions (and thus also most practical MACs) are strongly unforgeable. [6]
2.2.2 Encrypt-and-MAC In this variant the MAC is produced from the associated data and the plaintext and the ciphertext is produced by encrypting the plaintext. More formally:
AEAD-EK(N, A, P) =(EK2 (N, P), HK1 (A ‖ P)), DK2 (N, C) if T = HK1 (A ‖ DK2 (N, C)) AEAD-DK(N, A, C, T) = , ⊥ otherwise where K = K1 ‖ K2.
2.2.3 MAC-then-Encrypt In this variant the MAC is first produced from the plaintext and as- sociated data and then the plaintext is encrypted together with the MAC to produce the resulting ciphertext and tag. More formally:
AEAD-EK(N, A, P) =(C, T) where
C ‖ T = EK2 (N, P ‖ HK1 (A ‖ P), P if DK2 (N, C ‖ T) = P ‖ HK1 (A ‖ P) AEAD-DK(N, A, C, T) = , ⊥ otherwise where K = K1 ‖ K2.
9 2. Authenticated encryption
2.2.4 Properties of generic composition ∙ Provable security – As long as the underlying primitives are also provably secure, the resulting AEAD scheme is also provably secure. Note that to achieve strong unforgeability, the MAC used must be also strongly unforgeable and the encrypt-then- MAC variant has to be used. ∙ On-line input processing – The generic composition supports on-line input processing if both the stream cipher and MAC support it. ∙ Two-pass processing – The generic composition does not require two-pass processing. ∙ Parallelizability – The parallelizability of generic composition depends on whether the cryptographic primitives used have this property. ∙ Endian-dependency – The endian-dependency of generic compo- sition depends on whether the cryptographic primitives used are endian-dependent. ∙ Nonce misuse resistance – The generic composition is not nonce misuse resistant. ∙ Software performance – The software performance of generic composition depends on the primitives used. In case the MAC used is based on a cryptographic hash function (e.g. HMAC- SHA256), the performance is usually much lower than that of dedicated algorithms. ∙ Hardware performance and circuit area – As with the software performance, also hardware performance depends largely on the algorithms used. The circuit area will likely be relatively large, since there are two separate algorithms being used.
2.3 GCM
GCM, or Galois-Counter Mode, is a well known AEAD mode, which is based on a block cipher that has a block size of 128 bits (typically
10 2. Authenticated encryption
AES). Individual instances are denoted as Cipher-GCM, e.g. AES-GCM, Serpent-GCM, etc. In November 2007, NIST9 included GCM as one of the recommended authenticated encryption modes in NIST Special Publication 800-38D ([23]). In 2009, it has been accepted as one of the AEAD modes recommended by ISO in ISO/IEC 19772:2009 ([36]). Due to its good performance in software and its standardization by NIST, it has gained widespread usage in cryptographic libraries (e.g. OpenSSL [60], GCrypt [59]) and network applications (e.g. TLS10 [56], IPsec [61], OpenSSH [35]). GCM accepts a key of the same size as the underlying cipher and a 96-bit nonce11 and produces a 32- to 128-bit authentication tag (de- pending on security requirements). Internally, it works with 128-bit blocks. GCM uses the underlying block cipher in counter mode (denoted as Cipher-CTR) to encrypt the message and an MAC algorithm called GHASH to authenticate the message and associated data. In fact, GCM can be thought of as the encrypt-then-MAC variant of a generic composition (see 2.2.1) of Cipher-CTR and GHASH, with the following modifications: ∙ the GCM key is used directly as the key of the block cipher and the key for GHASH is derived from the GCM key as the result of encrypting an all-zero block of bits, ∙ the associated data and ciphertext are padded with zero bits to a multiple of 128 bits when passed to GHASH, ∙ a block containing the binary representation of the lengths of associated data and plaintext/ciphertext is appended to the input to GHASH, ∙ the initial counter block (ICB) for Cipher-CTR (its “nonce”) is produced by appending 31 ‘0’ bits and one ‘1’ bit to the 96-bit GCM nonce,
9. NIST = National Institute of Standards and Technology, part of the United States Department of Commerce. 10. TLS = Transport Layer Security. 11. In fact, the original NIST specification allows also nonces of arbitrary length, but since this is not supported by most implementations (and discouraged in the NIST document), we do not take that into account here.
11 2. Authenticated encryption
∙ the final authentication tag is obtained from GHASH output by encrypting it using Cipher-CTR with ICB produced by ap- pending 32 ‘0’ bits to the GCM nonce.
The internal authentication algorithm GHASH is based on polyno- mial multiplication in Galois field, where the coefficients of a polyno- mial correspond to bits in a 128-bit block. If we denote the polynomial multiplication of blocks A and B as A ⊗ B and polynomial addition of blocks A and B (which is equivalent to the well-known “XOR” opera- tion) as A ⊕ B, the authentication tag T produced by applying GHASH with key K to message M, composed of successive 128-bit blocks M0, M1,..., Mn, can be defined as:
T = GHASHK(M) = (... ((M0 ⊗ K) ⊕ M1) ⊗ K) ... ⊕ Mn) ⊗ K
A more detailed definition of GHASH and GCM can be found in the NIST document ([23]). Despite GCM being one of the two AEAD modes recommended by NIST and considered secure (if used properly), the mode has several practical disadvantages:
∙ Catastrophic effect of nonce reuse on confidentiality – If two mes- sages are encrypted using the same key and the same nonce, the initial counter block for the CTR encryption will be the same for both messages, which means the XOR of their ciphertexts is the same as the XOR of their plaintexts. Thus, for any known plaintext-ciphertext pair it is possible obtain plaintext for any ciphertext encrypted using the same key and nonce, as long as the message is not longer than the known pair. [4]
∙ Severe effect of nonce reuse on integrity – As discovered by Antoine Joux in [38], nonce reuse leads to an attack that can compromise also message integrity. This means that it is possible for a user of this algorithm to severely compromise message confidentiality by only making a subtle mistake. The prevalence of such incorrect usage discov- ered in security applications in the past (see e.g. [20]) proves that this is not just a theoretical possibility but a weakness with real world consequences.
12 2. Authenticated encryption
∙ Small nonce size – GCM supports only 96-bit nonce, which is a problem when one wants to use randomly generated nonces. The NIST specification ([23]) mandates that if the GCM nonce is generated randomly, only up to 232 messages can be encrypted with the same key, otherwise the probability of repeating the nonce becomes too high. This limitation may lead to implementation weaknesses, since not all implementors may realize that such limitation even ex- ists. Also, this causes some practical limitations, for example if we want to use GCM for disk encryption with random nonces and a single key for the whole disk, we can only safely encrypt 232 sectors (about 2 TiB of data with 512 bytes per sector), which is not sufficient even for drives smaller than 2 TiB, since sectors may get overwritten on average several times during the disk’s lifetime.
2.3.1 Properties ∙ Provable security – If a secure block cipher is used, GCM is prov- ably secure under the assumptions mentioned above. The proof of security for GCM can be found in [42].
∙ On-line input processing – GCM does support on-line input pro- cessing.
∙ Two-pass processing – GCM does not require two-pass process- ing.
∙ Parallelizability – Since GCM uses the CTR mode for encryption, the en-/decryption part can be parallelized. The GHASH part must be computed sequentially.
∙ Endian-dependency – GCM is endian-dependent. It uses big- endian byte ordering for both the GHASH polynomial rep- resentation and for counters in CTR encryption.
∙ Nonce misuse resistance – As discussed further above, nonce mis- use is fatal for the security of GCM.
13 2. Authenticated encryption
∙ Software performance – The software performance of GCM de- pends largely on the underlying block cipher and on available CPU instructions. GHASH can be accelerated significantly on the x86 architecture using the CLMUL instruction set, which has been available on Intel and AMD processors since 2010/2011. [28, 63] Since GCM is used almost solely with the AES block cipher, AES-specific CPU instructions are also important for its perfor- mance. For example, with AES-NI and CLMUL on x86, AES- GCM is one of the fastest officially standardized AEAD algo- rithms. The ARM architecture (since ARMv8) also has equivalents for the AES-NI and CLMUL extensions as part of its optional Cryp- tography Extensions.[2]
∙ Hardware performance and circuit area – Hardware performance of GCM depends on the block cipher used. Although it is not designed specifically for hardware performance, due to the simplicity of the GHASH function it can achieve moderate hardware performance and circuit size. [33]
2.4 ChaCha20-Poly1305
Another widely known AEAD algorithm often used in place of AES- GCM is the combination of the ChaCha20 stream cipher and the Poly1305 message authentication code, which is usually denoted as ChaCha20-Poly1305. It has been standardized in May 2015 in RFC 7539. [47] ChaCha20 (2007, [8]) is an improved variant of the Salsa20 stream cipher (2008, [19]), both designed by Daniel J. Bernstein. ChaCha20 utilizes an internal 512-bit block cipher operating in counter mode, where the counter consists of a 128-bit constant, a 256-bit key, a 96- bit nonce, and a 32-bit block counter12. We will denote ChaCha20
12. In the original Bernstein’s specification both the nonce and counter are 64-bit, but the AEAD construction as defined in [47] trades off 32 bits from the block counter for the nonce.
14 2. Authenticated encryption
encryption/decryption using key K, nonce N and counter k as follows (encryption and decryption are the same operation in this case):
C = ChaCha20K(N, k, P), P = ChaCha20K(N, k, C).
Poly130513 (2004, [18]) is a MAC designed by Daniel J. Bernstein, which accepts a 256-bit key and produces a 128-bit tag. The Poly1305 key is transformed into two 128-bit integers r and s. The first part (r) is formed from the first 16 bytes of the key by clearing 22 bits at specific positions and converting the bytes to an integer according to the little-endian convention. The second part is formed from the other 16 bytes of the key by converting them to an integer according to the little-endian convention. We will denote the Poly1305 transform with key K as follows:
T = Poly1305K(M). When computing a tag using Poly1305, the message is split into a sequence of 16-byte blocks (if the message length is not a multiple of 16, the last block may have less than 16 bytes). The blocks are then processed by iteratively transforming an accumulator, which is initially set to 0. Each block is appended with a byte with value 1, converted to an integer using the little-endian convention, added to an accumulator (which starts at 0), and then the accumulator is multiplied by r modulo 2130 − 5. After processing all blocks, the value s is added to the accumulator, which is converted to 16 bytes using the little-endian convention, forming the final tag. The ChaCha20-Poly1305 construction takes a 256-bit key and a 96-bit nonce and produces a 128-bit tag. The encryption/decryption procedures of ChaCha20-Poly1305 are defined as follows:
PolyKey(K, N) = ChaCha20K(N, 0, Z32),
ComputeTag(K, N, A, C) = Poly1305PolyKey(K,N)(Format(A, C)),
13. The “1305” in “Poly1305” represents the prime number that is the generator of the finite field used for polynomial multiplication inside the algorihtm – 2130 − 5.
15 2. Authenticated encryption
AEAD-EK(N, A, P) =(ChaCha20K(N, 1, P), ComputeTag(K, N, A, C)),
where C = ChaCha20K(N, 1, P),
P if T = ComputeTag(K, N, A, C)) AEAD-DK(N, A, C, T) = , ⊥ otherwise
where P = ChaCha20K(N, 1, C),
Here Zk stands for a k-byte block consisting of zero bytes and Format(A, P) stands for the concatenation of A, P (both zero-padded to 16 bytes), and the byte lengths of A and P (both converted to 8 bytes using the little-endian convention).
2.4.1 Properties ∙ Provable security – A security analysis of ChaCha20-Poly1305 can be found in [48].
∙ On-line input processing – ChaCha20-Poly1305 does support on- line input processing.
∙ Two-pass processing – ChaCha20-Poly1305 does not require two- pass processing.
∙ Parallelizability – ChaCha20-Poly1305 uses ChaCha20 (which operates in counter mode) for encryption, so the en-/decryption part can be parallelized.
∙ Endian-dependency – ChaCha20-Poly1305 is endian-dependent. It uses little-endian byte ordering for both the Poly1305’s poly- nomial representation and for the counters in ChaCha20.
∙ Nonce misuse resistance – With ChaCha20-Poly1305, repeating a nonce has similar effect on message confidentiality as with GCM (see section 2.3), although no attack on message confi- dentiality similar to the one on GCM by Joux has been demon- strated.
16 2. Authenticated encryption
Note that ChaCha20-Poly1305 has the same nonce size as GCM, and thus the same limitation for randomly generated nonces applies to it (i.e. at most 232 encryptions are allowed under the same key). ∙ Software performance – The software performance of ChaCha20- Poly1305 is usually lower than that of AES-GCM where hard- ware support for AES is available, but without it ChaCha20- Poly1305 is significantly faster, including mobile platforms. [21] Since ChaCha20-Poly1305 uses only simple bitwise operations (ChaCha20) and large number arithmetics (Poly1305), its per- formance does not depend on special hardware instructions and it can utilize vector instructions already present on most CPU architectures. ∙ Hardware performance and circuit area – ChaCha20-Poly1305 is not designed for hardware performance, but the choice of sim- ple operations in its core allow for reasonably efficient hardware implementations, especially on 32-bit platforms.
2.5 CCM
CCM (Counter with CBC-MAC) is another generic AEAD mode, which is based on a block cipher with block size of 128 bits (usu- ally AES). It was designed by Russ Housley, Doug Whiting, and Niels Ferguson. It is specified in NIST Special Publication 800-38C ([25]), ISO/IEC 19772:2009 ([36]), and a variant for use with the IPsec and TLS protocols is specified in RFC 3610 ([62]). Internally, CCM uses the underlying block cipher in CTR mode for message encryption and CBC-MAC (again with the same block cipher) for authentication. These primitives are combined using a variant of the encrypt-and-MAC construction (see section 2.2.2). CCM produces a 128-bit tag that can be optionally truncated to a shorter length. Due to certain weaknesses of CBC-MAC, CCM has strict limitations on formatting the input to CBC-MAC. The original NIST standard does not specify a single formatting scheme, but instead provides rules that such scheme must fulfill. RFC 3610 defines a specific formatting scheme, which allows for a nonce of configurable length (7–13 bytes).
17 2. Authenticated encryption
In this case, the maximum length of message depends on the nonce length L and is 28(15−L) bytes (i.e. 264 bytes for the shortest nonce and 216 bytes for the longest nonce). This scheme was criticized by Rogaway and Wagner in [49]. Another limitation of the RFC 3610 formatting scheme (and of any other legal scheme) is that the length of the message needs to be known before authentication, which means that CCM does not support on-line input processing. Bellare, Rogaway and Wagner suggested an alternative mode – EAX – which has a similar design, but supports on-line input process- ing and nonces of arbitrary length. [5] Although also standardized in ISO/IEC 19772:2009, this AEAD mode was rejected by NIST and did not receive such widespread usage.
2.5.1 Properties ∙ Provable security – A formal security analysis of CCM can be found in [37]. However, as discussed further above (and in more detail in [49]), CCM is a complex algorithm and due to the nonce-length-message-length trade off its usage may easily lead to implementation errors. ∙ On-line input processing – CCM does not support on-line input processing. ∙ Two-pass processing – CCM does not require two-pass process- ing. ∙ Parallelizability – The encryption part of CCM can be paral- lelized; the authentication part not. ∙ Endian-dependency – The formatting of input to CBC-MAC and the counter block used for encryption are endian-dependent and use the big-endian convention. ∙ Nonce misuse resistance – Due to the use of CTR mode for en- cryption, nonce reuse has a catastrophic effect on message con- fidentiality (as with GCM). Even though CCM supports a nonce with similar length as GCM and ChaCha20-Poly1305, since longer nonces lead to
18 2. Authenticated encryption
smaller limit on message length, users are likely to choose lower nonce lengths anyway. This means that it is usually not accept- able to use CCM with randomly generated nonces. [49]
∙ Software performance – The software performance of CCM de- pends largely on the underlying block cipher and on available CPU instructions. Since CCM is used almost solely with the AES block cipher, AES-specific CPU instructions are crucial for its performance. For example, with AES-NI instruction set on x86, AES-CCM can be even faster than AES-GCM. Without hardware support for AES, however, AES-CCM will be likely much slower than AES-GCM or ChaCha20-Poly1305. The ARM architecture (since ARMv8) also provides an equiva- lent for the AES-NI extensions as part of its optional Cryptogra- phy Extension.[2]
∙ Hardware performance and circuit area – Hardware performance of CCM depends on the block cipher used. The performance of AES-CCM should be similar to AES-GCM, although it is possible that GHASH is more hardware-friendly than CBC- MAC with AES. The circuit area of CCM will likely be smaller than for GCM (with the same cipher), since the same core algorithm is used for both encryption and authentication.
2.6 SIV
SIV (Synthetic Initialization Vector) is a nonce misuse resistant AEAD block cipher mode of operation designed by Phillip Rogaway and Thomas Shrimpton. For any block cipher Cipher, the resulting AEAD algorithm based on SIV is denoted Cipher-SIV. SIV has been standard- ized under RFC 5297. Although the main motivation for SIV was its use for key wrapping14, it can be also used as a regular AEAD al-
14. Key wrapping is the practice of encrypting a cryptographic key using another key.
19 2. Authenticated encryption gorithm for cases where nonce misuse resistance is necessary. [54, 32] Internally, Cipher-SIV uses the Cipher-CMAC message authentica- tion code algorithm (see [26]) to process the nonce, associated data, and message into a bit string that is used both as the authentication tag and (with a minor tweak) as the initial counter block for Cipher- CTR, which is used to encrypt/decrypt the message. SIV supports nonce of arbitrary length and even an arbitrary number of associated data strings. The tag size is always the same as the block size of the underlying cipher. SIV accepts a key of double the cipher key size – the first half is used for CMAC and the second half for CTR encryption. The exact definition of SIV is out of scope of the thesis. See[54] or [32] for a detailed description of the algorithm.
2.6.1 Properties ∙ Provable security – SIV is provably secure. For security analysis see Rogaway’s original paper. [54]
∙ On-line input processing – SIV supports on-line input processing (the lengths of inputs do not need to be known in advance), although the message needs to be processed twice (and thus retained in memory between the two passes).
∙ Two-pass processing – SIV requires two-pass processing for en- cryption and also for decryption, unless the authentication tag can be processed before the ciphertext.
∙ Parallelizability – The encryption part of SIV can be parallelized. The CMAC computation cannot be parallelized itself, although the separate additional data strings (and the nonce) can be processed in parallel.
∙ Endian-dependency – The counter mode used for encryption is endian-dependent and uses the big-endian convention.
∙ Nonce misuse resistance – SIV is nonce misuse resistant, as this is its main design goal.
20 2. Authenticated encryption
∙ Software performance – Since SIV is a two-pass algorithm, its software performance is worse compared to other block-cipher- based AEAD algorithms (such as GCM or CCM). Thus, SIV is recommended only for situations where nonce misuse resis- tance is required and lower performance can be tolerated. ∙ Hardware performance and circuit area – For hardware perfor- mance of SIV the same remarks hold as for its software perfor- mance.
2.7 GCM-SIV
GCM-SIV is an AEAD block cipher mode of operation derived from GCM (see section 2.3), which is currently a draft close to being ap- proved as a new RFC document. GCM-SIV, as opposed to GCM, is a full nonce misuse resistant algorithm. GCM-SIV was designed by Shay Gueron in collaboration with Yehuda Lindell and Adam Langley. GCM-SIV is intended primarily for use with AES (AES-GCM-SIV), although other 128-bit block ciphers may be used in its place. [29, 30, 43] Like GCM, GCM-SIV accepts a 96-bit nonce, a key of the same size as the underlying block cipher, and produces a tag of at most 128 bits. Unlike other AEAD modes, GCM-SIV does not use the key directly, but uses it to derive two keys from the nonce – a 128-bit authentication key for the internal MAC and a record-encryption key for the block cipher. This allows a larger number of messages to be encrypted under the same key, despite the relatively short nonce. See section 9 of [29] for a more detailed discussion. Internally, GCM-SIV uses POLYVAL (an algorithm akin to GCM’s GHASH) for authentication and the underlying block cipher in counter mode for confidentiality. Similar to SIV, GCM-SIV uses the tag (which is the POLYVAL output encrypted by the block cipher with the record- encryption key) as the initial counter block for encryption. Even though GCM-SIV provides a good trade off between perfor- mance and resistance against nonce misuse, some design choices may make adoption difficult for certain systems. For instance, as the counter in the CTR mode is interpreted as hav- ing little-endian byte ordering, crypto libraries, which already have an
21 2. Authenticated encryption implementation of the classic big-endian CTR mode, cannot reuse the existing code for it. Of course, with a fully-optimized implementation it is usually necessary to implement the whole algorithm as a whole (not modularly), but in generic implementations the code for common sub-algorithms is often shared to reduce redundancy. Also, there may be support in cryptographic accelerators for the classic CTR mode, but not the little-endian one, forcing platforms that rely heavily on cryptographic accelerators to use a slower software version. Next, moving the underlying cipher key schedule into the per- message computation path may cause problems in cryptographic frameworks that don’t allow for it. A notable example is the Linux kernel Crypto API, which has special requirements on the context from which the key-setting function may be called. This could make it impossible to write a correct generic implementation in such frame- work.
2.7.1 Properties ∙ Provable security – A security analysis of GCM-SIV is available in [30].
∙ On-line input processing – GCM-SIV supports on-line input pro- cessing (the lengths of inputs do not need to be known in ad- vance), although the message needs to be processed twice (and thus retained in memory between the two passes).
∙ Two-pass processing – GCM-SIV requires two-pass processing for encryption and also for decryption, unless the authentication tag can be processed before the ciphertext.
∙ Parallelizability – The encryption part of GCM-SIV can be paral- lelized. The POLYVAL computation cannot be parallelized.
∙ Endian-dependency – The counter mode encryption and POLY- VAL are endian-dependent and both use the little-endian con- vention.
∙ Nonce misuse resistance – GCM-SIV is nonce misuse resistant, as this is one of its main design goals.
22 2. Authenticated encryption
∙ Software performance – Since GCM-SIV is a two-pass algorithm, its software performance is somewhat worse compared to other block-cipher-based AEAD algorithms (such as GCM or CCM). However, on little-endian machines (which is the dominant majority) this is mitigated by a small performance gain from using the little-endian-friendly POLYVAL and counter mode. According to the performance measurements by Shay Gueron, the performance of AES-GCM-SIV is comparable to AES-GCM when decrypting and 30-50% slower when encrypting (depend- ing on AES key size) on an Intel CPU with AES-NI and CLMUL instructions. Also, the processing of small messages is slower due to the use of dynamically derived keys. [30]
∙ Hardware performance and circuit area – Even though AES-GCM- SIV was not designed for optimal hardware performance, it can be expected that it would be similar to AES-GCM, again with some overhead for small messages due to the dynamic key derivation.
23
3 CAESAR competition
3.1 What is CAESAR?
The CAESAR1 competition is a public competition started in 2014 by Daniel J. Bernstein with the goal to identify a portfolio of authenticated encryption algorithms that both offer advantages over AES-GCM and are suitable for widespread adoption. Note that CAESAR does not evaluate generic AEAD modes (like GCM), but only concrete algorithms (like AES-GCM). [9, 10] The competition is organized by a committee of experts in cryptog- raphy, including Daniel J. Bernstein (who is the CEASAR secretary and a non-voting member of the committee), Alexander Biryukov, Vincent Rijmen & Joan Daemen (the designers of the AES block cipher), Bart Preneel, and Phillip Rogaway. The committee members are allowed to submit their own algorithms, although they may not participate in committee discussions regarding their own algorithms. [12] The competition strongly emphasizes that all submissions, anal- yses by the committee, and all responses to the analyses need to be public. The CAESAR competition does not aim to standardize the selected algorithms, only to provide public analysis that can be used by relevant institutions as a basis for standardization decisions. [10] CAESAR is organized as a three-round competition with one final round. In the first round, initial candidate submissions are collected and in each subsequent round a subset of candidates is selected from the previous round. Between rounds, submission authors can submit tweaks to their specifications or code (to fix issues or improve their algorithms). [11] Each submission specifies an AEAD algorithm family, which con- tains one or more algorithms, and a number of recommended param- eter sets (i.e. algorithms with specific parameter configurations, each recommended for particular use cases). Each submission must pro- vide a PDF specification with defined contents and code for software and hardware implementations (at least a reference implementation and an optional set of optimized implementations). [9]
1. CAESAR stands for “Competition for Authenticated Encryption: Security, Ap- plicability, and Robustness.” 25 3. CAESAR competition
In July 2016, Bernstein posted a list of three use cases that a sub- mission may choose to target to a public discussion forum that is used as the main communication channel for CAESAR. In august 2016, Bernstein added a rule for third round submissions, stating that each recommended parameter set of each submission must specify a prior- itized list of use cases targeted by that parameter set. The use cases are specified as follows: [14, 15]
1. Lightweight applications – This is a use case that prioritizes small hardware circuit area, good hardware and 8-bit software per- formance, side-channel attack resistance, and is intended for mostly small messages.
2. High-performance applications – This use case prioritizes perfor- mance on 64-bit (and 32-bit) software architectures, time in- variance of implementations, and is intended also for longer messages.
3. Defense in depth – This use case prioritizes nonce misuse resis- tance, resistance against attacks involving the release of un- verified plaintexts, and robustness in corner cases (e.g. huge amounts of data).
At the time of writing, CAESAR is in its third round with 15 can- didates. Based on the last message by Bernstein in the Cryptographic competitions discussion forum from 15 June 2017, the committee is currently reviewing hardware design and software performance of the third round candidates. [16]
3.2 Third round candidates
In the third round of CAESAR there are 15 candidates: [13]
∙ ACORN (Hongjun Wu) – a very lightweight algorithm based on linear-feedback shift registers, which is very small and rea- sonably fast in hardware, although quite slow in software.
∙ Ketje (Guido Bertoni, Joan Daemen, et al.) – another lightweight and tunable sponge-based algorithm family with variants for
26 3. CAESAR competition
software (moderate performance) and hardware (very good performance); based on Keccak (a.k.a. SHA-3). ∙ Keyak (Guido Bertoni, Joan Daemen, et al.) – a lightweight al- gorithm family similar to Ketje, but focused on the defense in depth use case. ∙ Ascon (Christoph Dobraunig, Maria Eichlseder, et al.) – another lightweight hardware-focused algorithm with tunable trade off between circuit size and performance; uses a so called sponge- based design. ∙ JAMBU (Hongjun Wu, Tao Huang) – a lightweight AEAD mode intended for hardware implementations; provides two variants: AES-JAMBU and a more lightweight SIMON-JAMBU. ∙ CLOC + SILC (Tetsu Iwata, Kazuhiko Minematsu, et al.) – two related AEAD modes based on CFB2 and CBC-MAC that focus on hardware performance, instantiated with AES and other block ciphers. ∙ AEGIS (Hongjun Wu, Bart Preneel) – an algorithm family based on the AES round function; very fast in software with native AES instructions and in hardware. ∙ Tiaoxin (Ivica Nikolić) – an algorithm with design and proper- ties very similar to AEGIS. ∙ Deoxys (Jérémy Jean, Ivica Nikolić, et al.) – an algorithm based on a block cipher based on AES and optional nonce misuse resistance. ∙ MORUS (Hongjun Wu, Tao Huang) – an algorithm family tar- geting SIMD instructions; very fast on processors with SIMD support and also in hardware. ∙ NORX (Jean-Philippe Aumasson, Philipp Jovanovic, Samuel Neves) – another SIMD-focused algorithm with tunable word size and parallelism; sponge-based design.
2. CFB = Counter Feedback Mode; a block cipher mode of operation similar to CBC.
27 3. CAESAR competition
∙ AES-OCB (Ted Krovetz, Phillip Rogaway) – a very fast endian- independent AEAD mode of operation, instantiated with AES.
∙ AES-OTR (Kazuhiko Minematsu) – an AEAD mode of opera- tion similar to OCB, but faster and more lightweight in hard- ware, instantiated with AES.
∙ COLM (Elena Andreeva, Andrey Bogdanov, et al.) – an AEAD mode focused on defense in depth and software performance, instantiated with AES.
∙ AEZ (Viet Tung Hoang, Ted Krovetz, Phillip Rogaway) – an AEAD algorithm based on AES with focus on defense in depth and software performance.
In the following sections we describe a selection of these candidates in more detail. These algorithms we then implement for the Linux kernel cryptographic API (see chapter6).
3.3 MORUS
MORUS is a dedicated AEAD algorithm family designed by Hongjun Wu and Tao Huang. It is designed to easily take advantage of common SIMD instructions. It uses only four elementary operations: bitwise exclusive OR (XOR), bitwise AND, full-block bit rotation and bit ro- tation of independent words in a block. At the time of writing, the MORUS submission is at version 2 and can be found in [64]. MORUS has two variants – MORUS-640 and MORUS-1280. The main difference is that MORUS-640 operates on 128-bit blocks and uses rotation over 32-bit words and MORUS-1280 operates on 256-bit words and uses rotation over 64-bit words. MORUS-1280 and MORUS-640 primarily target CAESAR use case 2 (High-performance applications). MORUS-640 also secondarily targets use case 1 (Lightweight applica- tions). MORUS-640 has a state size of 640 bits (five 128-bit blocks) and accepts a 128-bit key. MORUS-1280 has a state size of 1280 bits (five 256- bit blocks) and accepts either a 256-bit or a 128-bit key. Both variants accept a 128-bit nonce and produce a 128-bit tag.
28 3. CAESAR competition
3.3.1 Operation
MORUS is built upon the core state update function, which transforms a MORUS state and an input block into a new state. The state update function uses a series of bitwise operations to mix the input block into the state. The state update function is assumed to be irreversible even if the input block is known. MORUS converts between byte strings and blocks using the little- endian convention (first byte forms the 8 least significant bits ofa block; last byte forms the 8 most significant bits). MORUS operates in four stages:
1. Initialization – The first state block is initialized with the nonce (zero-padded), the second one with the formatted key (repeated if smaller than block size), one or two blocks are initialized with a constant value derived from the Fibonacci sequence, and the rest is initialized to trivial values (all ones or all zeros). Then, the state update function is applied 16 times with an all-zero input block and finally, the key is XOR-ed into the second block of state.
2. Processing the associated data – Next, the associated data is split into blocks and zero-padded. The blocks are then sequentially used to update the state using the state update function.
3. Processing the message data – Then, the plaintext/ciphertext is encrypted/decrypted by XOR-ing it with a key stream derived from the state and the plaintext. For each block a “key” block is first derived from the state. This block is XOR-ed with the plain- text/ciphertext block to form the resulting ciphertext/plaintext block. Then, the state is updated using the state update function with the plaintext block (which is padded with zeros if shorter than block size) as the input block.
4. Finalization – Finally, the first state block is XOR-ed into the fifth state block, the state is updated 10 times with ablock containing the lengths of associated data and the message, and the authentication tag is derived from the final state.
29 3. CAESAR competition
3.3.2 Properties ∙ Provable security – Even though the MORUS CAESAR submis- sion provides arguments for its security, it currently does not have a formal security proof. [64,1]
∙ On-line input processing – MORUS supports on-line input pro- cessing.
∙ Two-pass processing – MORUS does not require two-pass pro- cessing.
∙ Parallelizability – The computation of MORUS is sequential and cannot be parallelized at block level. However, the state update function of MORUS provides some level of SIMD parallelism.
∙ Endian-dependency – MORUS is endian-dependent and uses ex- clusively the little-endian convention.
∙ Nonce misuse resistance – MORUS is not nonce misuse resistant by design, although due to its iterative nature nonce reuse does not have as big effect on confidentiality as it is with CTR based ciphers (like GCM) and it has no bad consequences on authentication. Since all variants of MORUS support 128-bit nonces (and these are fully mixed into the state), they can be safely generated at random.
∙ Software performance – MORUS is designed for SIMD architec- tures and its performance is very good on machines with SIMD instructions. [17] On x86, MORUS-640 and MORUS-1280 can be efficiently imple- mented using the SSE2 instruction set and MORUS-1280 can run even faster when implemented using AVX2 instructions. On modern ARM architectures, similar vector instructions are also available.
∙ Hardware performance and circuit area – MORUS has extremely good hardware performance. [33]
30 3. CAESAR competition
The circuit area of MORUS is similar to that of AES-GCM. MORUS has the best throughput per area ratio out of all round 3 CAESAR candidates. [33]
3.4 AEGIS
AEGIS3 is a dedicated AEAD algorithm family designed by Hongjun Wu and Bart Preneel. The main cryptographic primitive used in AEGIS is the AES round function, which is nowadays implemented as a single (and fast) instruction in most x86 processors. Other primitive transformations used are bitwise AND and XOR. At the time of writing, the AEGIS submission is at version 1.1 and can be found in [65]. AEGIS has three variants – AEGIS-128L, AEGIS-128, and AEGIS- 256. The differences of parameter and state sizes between the variants are summarized in table 3.1. All three variants of AEGIS target CAE- SAR use case 2 (High performance applications).
AEGIS-128L AEGIS-128 AEGIS-256 Input block size 256 bits 128 bits 128 bits Nonce size 128 bits 128 bits 256 bits Key size 128 bits 128 bits 256 bits Tag size 128 bits 128 bits 128 bits State size 1024 bits 640 bits 768 bits Table 3.1: Comparison of AEGIS variants.
3.4.1 Operation AEGIS variants operate in the same way as MORUS (see section 3.3.1), apart from different state update function and minor differences in state layout, initialization, and finalization. AEGIS-128L is more dif- ferent, since it processes data two blocks at a time.
3. The name AEGIS was taken from Greek mythology, where “aegis” is the shield used by the god Zeus.
31 3. CAESAR competition
S0 S1 S2 S3 S4
M L R R R R R
W L L L L L W
S0 S1 S2 S3 S4
(Figure adapted from [65].)
Figure 3.1: The state update function of AEGIS-128.
The state update function of AEGIS-128 cyclically applies the AES round function to the state blocks and mixes in the input block. Fig- ure 3.1 shows a diagram of the AEGIS-128 state update function (M denotes the input block; R denotes the core part of the AES round function – without XOR with the round key). The state update function of AEGIS-256 is analogical to AEGIS-128, only there is one additional state block. The state update function of AEGIS-128L is analogical to AEGIS- 128, with three more state blocks and an additional input block, which is XOR-ed into the state block S4. Another important difference from MORUS is in how AEGIS-128L processes the input data (associated data and plaintext/ciphertext). Instead of processing one 128-bit block at a time, it processes two at once. When the state is updated, two consecutive blocks of associated data or plaintext are passed as input to the state update function. Analogically, when encrypting/decrypting, two different key stream blocks are derived from the state and XOR-ed with the corresponding plaintext/ciphertext blocks. For a detailed definition of all AEGIS variants see the AEGIS CAE- SAR submission ([65]).
3.4.2 Properties ∙ Provable security – Even though the AEGIS CAESAR submission provides arguments for its security, it currently does not have a formal security proof. [65,1]
32 3. CAESAR competition
∙ On-line input processing – AEGIS supports on-line input pro- cessing.
∙ Two-pass processing – AEGIS does not require two-pass process- ing.
∙ Parallelizability – The computation of AEGIS is sequential and cannot be parallelized at block level. However, the state update function of AEGIS provides some level of internal parallelism.
∙ Endian-dependency – Apart from formatting of message lengths (which uses the little-endian convention), AEGIS is endian- independent.
∙ Nonce misuse resistance – AEGIS is not nonce misuse resistant by design, although due to its iterative nature nonce reuse does not have as big effect on confidentiality as it is with CTR based ciphers (like GCM) and it has no bad consequences on authentication. Since all variants of AEGIS support at least 128-bit nonces (and these are fully mixed into the state), they can be safely generated at random.
∙ Software performance – On machines that have instruction sup- port for the AES round (e.g. AES-NI on x86), AEGIS is even faster than MORUS. On machines without hardware AES sup- port, however, it is much slower. [17]
∙ Hardware performance and circuit area – AEGIS has the best hard- ware performance out of all round 3 CAESAR candidates. [33] The circuit area of AEGIS is similar to that of AES-GCM. AEGIS has one of the best throughput per area ratio among round 3 CAESAR candidates. [33]
3.5 OCB
OCB is an AEAD mode of operation designed by Phillip Rogaway. It uses a 128-bit block cipher as a building block to create a con- crete AEAD algorithm. The main advantages of OCB are endian-
33 3. CAESAR competition independence and good software performance (compared to other similar algorithms). It is standardized in RFC 7253 ([40]) and an earlier version in ISO/IEC 19772:2009 ([36]). Some techniques used in OCB are covered by United States patents issued for Rogaway. Rogaway has granted a no-charge, royalty-free license to any open source software implementation of OCB and to any (non-military) software implementation, research use, or non- commercial use. [52, 53, 51, 50] OCB accepts a nonce of 0–120 bits (or 0–15 bytes), a key of the same size as the underlying block cipher accepts, and produces an up to 128-bit tag.
3.5.1 Operation We shall use the following notation:
∙ EncryptK(X) – the encryption of block X using the underlying cipher with the OCB key K,
∙ DecryptK(X) – the decryption of block X using the underlying cipher with the OCB key K, ∙ ntz(i) – the number of trailing zeros in the binary representation of i.
OCB has several key-dependent 128-bit variables: L*, L$, and Li for each i between 0 and k, where 128 · 2k is the maximum length (in bits) of the associated data and message. L* is obtained as the encryption of an all-zero block using the underlying cipher (the OCB key is used directly with all block cipher invocations). L$ and L0 through Lk are obtained 4 by iteratively performing polynomial doubling , starting with L* as the initial value. These values can be precomputed for any fixed value of the key and reused for multiple encryptions/decryptions. An OCB encryption/decryption consists of three phases: 1. Hashing the associated data – Blocks S and O are initialized to all zeros. For each i-th complete 128-bit block of the associated data Ai, we set O ← O ⊕ Lntz(i) and then S ← S ⊕ EncryptK(Ai ⊕ O).
4. For an explanation of this transformation please see section 2 of the RFC ([40]).
34 3. CAESAR competition
If the message length is not a multiple of 128 bits, then the final partial block is padded with one and zeros toform A* and processed in a similar way: O ← O ⊕ L* and S ← S ⊕ EncryptK(A* ⊕ O). The final value of S is later XOR-ed into the final tag.
2. Nonce processing – Here, nonce and tag length are encoded into a single 128-bit block, which is encrypted by the block cipher and the result is used to compute the initial offset block O.
3. The actual encryption/decryption – In both cases the block S is initialized to all zeros and the value of the block O from the previous phase is used.
∙ Encryption – For i-th complete plaintext block Pi, the cor- responding ciphertext block Ci is computed as:
1. O ← O ⊕ Lntz(i)
2. Ci ← O ⊕ EncryptK(Pi ⊕ O) 3. S ← S ⊕ Pi
∙ Decryption – For i-th complete ciphertext block Ci, the cor- responding plaintext block Pi is computed as:
1. O ← O ⊕ Lntz(i)
2. Pi ← O ⊕ DecryptK(Ci ⊕ O) 3. S ← S ⊕ Pi
The final partial block (if any) is processed in a special waythat is not detailed here. See the definition in RFC 7253 ([40]) for a complete formal definition. In both cases, the final tag is computed by XOR-ing the final value from the first phase with EncryptK(S ⊕ O ⊕ L$) and trun- cating the result to the required length.
3.5.2 Properties ∙ Provable security – OCB is provably secure. For security proof see the latest paper by Krovetz and Rogaway – [41].
35 3. CAESAR competition
∙ On-line input processing – OCB supports on-line input process- ing.
∙ Two-pass processing – OCB does not require two-pass processing.
∙ Parallelizability – OCB is fully parallelizable – all underlying cipher calls (the bottleneck) can be performed in parallel. The associated data and message data can also be processed inde- pendently in parallel.
∙ Endian-dependency – The main loop of OCB is endian-independ- ent. The computation of key-dependent variables and nonce preprocessing is endian-dependent and uses the big-endian byte ordering.
∙ Nonce misuse resistance – OCB is not nonce-misuse resistant. In fact, nonce reuse has a catastrophic effect on both the confi- dentiality and integrity provided by OCB (see section 5.1 of [40]).
∙ Software performance – The software performance of AES-OCB is much better than AES-GCM on platforms with AES instructions and close to AES-GCM on other platforms. [17]
∙ Hardware performance and circuit area – In hardware, AES-OCB has performance similar to AES-GCM and about 30-40% larger circuit area. The throughput per area ratio is about 70% of that of AES-GCM. [33]
3.6 Comparison
Table 3.2 summarizes the differences between the discussed AEAD algorithms in terms of properties listed in section 2.1. In case of generic AEAD modes, we used AES, the most common block cipher, as the underlying block cipher. The properties are denoted by abbreviations based on their name. Entries prepended with a question mark are only based on educated guess. Compared to the existing AEAD algorithms, MORUS and AEGIS offer much better software and hardware performance, although both
36 3. CAESAR competition
Algorithm PS ON 2P PL ED NMR AES-GCM yes yes no partly BE no ChaCha20-Poly1305 yes yes no partly LE no AES-CCM yes no no partly BE no AES-SIV yes partly yes partly BE yes AES-GCM-SIV yes partly yes partly LE yes MORUS no yes no block LE no AEGIS no yes no block no (LE) no AES-OCB yes yes no fully no (BE) no
Algorithm SWP HWP CA AES-GCM very good (AES-NI + CLMUL) good large ChaCha20-Poly1305 moderate good ? AES-CCM very good (AES-NI) ?good ?large AES-SIV moderate (AES-NI) ?good ?large AES-GCM-SIV good (AES-NI + CLMUL) good ?large MORUS very good (SIMD) excellent large AEGIS excellent (AES-NI) excellent large AES-OCB very good (AES-NI) good large
PS = provable security NMR = nonce misuse resistance ON = on-line input processing SWP = software performance 2P = two-pass processing HWP = hardware performance PL = parallelizability CA = circuit area ED = endian-dependency
Table 3.2: AEAD algorithms comparison
are not provably secure. In addition, the software performance of MORUS does not depend on AES-specific instructions, only on more common and simpler SIMD instructions (operating on vectors of four 32-bit or 64-bit words). The advantages of AES-OCB are software and hardware perfor- mance comparable to AES-GCM, full parallelizability, endian-inde- pendence, and provable security.
37
4 Linux Kernel Crypto API
The Linux Kernel Crypto API (also referred to as Kernel Crypto API, Linux Crypto API, or just Crypto API) is a programming interface in- side the Linux kernel, which provides a framework for two types of users. First, it enables in-kernel users (usually other kernel modules) to perform various types of cryptographic and non-cryptographic transformations of data. Second, it allows device drivers to provide hardware accelerated implementations of these transformations. [45] Each algorithm must have at least one generic implementation (written in portable C code). Other implementations are usually archi- tecture specific assembly implementations or drivers for cryptographic accelerators. The users can access cryptographic functionality using transfor- mation objects (TFMs), which are instances of a transformation imple- mentation. The user can request allocation of a TFM and he recieves a cipher handle, which is an opaque representation of the underlying TFM. After all required cipher operations are performed on the handle, the user is responsible for destroying the cipher handle. [45] The Linux Kernel Crypto API supports the following types of algorithms:
∙ block cipher algorithms,
∙ symmetric key cipher algorithms (stream ciphers),
∙ AEAD algorithms,
∙ message digest algorithms (hash functions),
∙ asymmetric key cipher algorithms,
∙ key agreement protocols cipher algorithms,
∙ random number generation algorithms.
Apart from the in-kernel interface (which can only be used by kernel modules), the Linux Crypto API also provides a user space interface, which allows user space processes to invoke Crypto API operations via socket system calls.
39 4. Linux Kernel Crypto API 4.1 Architecture
4.1.1 Cipher and driver names The algorithms provided by the Linux Crypto API are referenced using a cipher name – a character string identifying the algorithm. For example, the string aes identifies the AES block cipher and the string gcm(aes) identifies the AES-GCM AEAD algorithm. The specific implementations of algorithms are distinguished using driver names. When the user requests a cipher handle, they may specify a driver name to force the use of a specific implementation. Each algorithm implementation may also specify a set of flags, which specify properties of the implementation (e.g. that it supports asynchronous invocation of operations).
4.1.2 Templates The algorithms can be either concrete “atomic” algorithms or instances of a template. A template is a scheme that takes other algorithms as inputs to form a new algorithm. For example, cbc is a template for the cipher block chaining mode of operation that accepts a block cipher algorithm (e.g. aes) to form a stream cipher (e.g. cbc(aes)). Note that the arguments of a template can also be other template instances.
4.1.3 Synchronous and asynchronous operations The Linux Crypto API allows the user to use both synchronous and asynchronous methods of invocation. When invoking the algorithm synchronously, the operation is performed by calling a function that blocks until the operation is finished. When invoking asynchronously, the function may return before the operation is finished – in this case a user-provided callback function is called when the operation is fin- ished. If the algorithm implementation does not support asynchronous processing, then the operation is always performed synchronously.
4.1.4 Priorities Each algorithm implementation must specify its priority number. When a cipher handle is requested, the implementation with the highest
40 4. Linux Kernel Crypto API priority number (among all that implement the requested algorithm) is picked.
4.1.5 Input parameter sizes The key sizes (for keyed algorithms) in Linux Crypto API are selected by the user when setting the key. The API does not provide a way to list the key sizes supported by a given algorithm. Some algorithm types, however, provide maximum and minimum key sizes. The initialization vector (IV) sizes (for stream ciphers and AEAD algorithms) are fixed for the given algorithms and can be queried from a corresponding cipher handle. The message and associated data sizes are specified when invoking the operations and can take values from the range of the unsigned int type (in the Linux kernel this is always 0 to 232 − 1). The tag size (for AEAD algorithms) is selected by the user and must be smaller than the maximum, which is fixed for a given algorithm and can be queried from the cipher handle. The algorithm may support only a subset of the tag sizes between 0 and the maximum size.
4.1.6 Scatter-gather lists The input/output data (plaintext/ciphertext, message data or associ- ated data) for cryptographic operations is passed to the implementa- tions in the form of scatter-gather lists. A scatter-gather list (or SG list) is a special collection of memory page numbers, which can be either transformed to DMA1 addresses or temporarily mapped to virtual memory. Using SG lists to pass data to the Crypto API makes it easy for cryptographic accelerator drivers to pass the data to the accelerator. Software drivers can just map the memory regions to virtual memory and work with them via regular pointers. An existing memory region can be converted into an SG list entry, although this cannot be done with all types of memory. For example, pointers to C global variables or memory allocated with the kvmalloc function cannot be converted to an SG list entry, but memory allocated
1. DMA = Direct Memory Access; a technology that allows hardware devices to directly access regions of RAM.
41 4. Linux Kernel Crypto API
Associated data Plaintext Encrypt Decrypt
Associated data Ciphertext Tag
Figure 4.1: AEAD input/output data layout using kmalloc (which is allocated as a contiguous physical memory region) can be.
4.2 AEAD interface
4.2.1 Input/output data layout The AEAD interface of the Linux Crypto API uses a specific data layout for the input and output data. The SG list for encryption input (and decryption output) should contain (and have enough space for) the associated data, immediately followed by the plaintext. [46] The SG list for encryption output (and decryption input) contains the associated data, immediately followed by the ciphertext, immedi- ately followed by the authentication tag. [46] The AEAD encrypt and decrypt operations do not copy the associ- ated data from the input SG list to the output SG list (for simplicity of implementation). [46]
4.2.2 For users The functions for accessing the AEAD functionality of the Crypto API are provided in the
42 4. Linux Kernel Crypto API
After all requested operations on the cipher handle are finished, crypto_free_aead must be called on the handle to free the resources associated with it. The functions crypto_aead_ivsize, crypto_aead_authsize, and crypto_aead_blocksize return the IV (nonce) size, maximum authen- tication tag size, and block size of the algorithm provided by the cipher handle, respectively (all in bytes). The block size represents the size that the plaintext/ciphertext size must be aligned to (for algorithms like AES-CBC, where the message length must be a multiple of 16 bytes). [46] Before the cipher handle can be used to perform encryption or decryption, the user must set the key. This is done by calling the crypto_aead_setkey function on it, and supplying a pointer to the key data and key length (in bytes). All subsequent encrypt/decrypt operations performed on the cipher handle will use the set key, until the crypto_aead_setkey function is called again with a different key. If the function is called with an invalid or unsupported key or key length, the -EINVAL error code is returned. Another function that has to be called before encrypting or de- crypting using the cipher handle, is crypto_aead_setauthsize. This function sets the authentication tag size (in bytes). If the function is called with an unsupported tag size, the -EINVAL error code is re- turned. [46] In order to perform an encrypt/decrypt operation on a cipher han- dle, the user needs to allocate one or more AEAD requests (represented by a pointer to struct aead_request) by calling aead_request_alloc with the cipher handle and kernel allocation flags (this is usually GFP_KERNEL). Each AEAD request handle can be used multiple times and must be eventually freed by a call to aead_request_free.[46] Before performing an encryption or decryption, an AEAD request must be initialized by calling the following functions:
∙ aead_request_set_callback – sets the flags for the operation and specifies the callback function that should be called when an asynchronous operation is finished,
∙ aead_request_set_crypt – sets the source and destination SG list, the length of the message and a pointer to the nonce bytes,
43 4. Linux Kernel Crypto API
∙ aead_request_set_ad – sets the length of the associated data.
When the AEAD request is initialized, crypto_aead_encrypt or crypto_aead_decrypt can be called with the request handle to per- form AEAD encryption or decryption, respectively. [46] If the return value of these functions is 0, the operation was per- formed synchronously and completed successfully. If the return value is -EINPROGRESS (the operation has started and is in progress) or -EBUSY (the operation has been queued for execution), then an asyn- chronous operation has started and the callback function of the request will be called when the operation finishes (with the appropriate er- ror code). Any other return value means that an error has occurred (-EBADMSG is used for when the authentication tag verification fails). If the return value was -EBUSY, then the callback function is first called once with error code -EINPROGRESS when the operation has started, and then when the operation has finished. While one or more operations are being performed asynchronously (but none is queued), another operation may be started on the same cipher handle (but with a different request handle). If a call returns -EBUSY, then no further operations can be started on the cipher handle until the corresponding callback is called with the -EINPROGRESS error code.
4.2.3 For implementations A Linux kernel module that implements an AEAD algorithm only needs to do two things (usually in the module’s initialization function):
1. Allocate a variable of type struct aead_alg (the algorithm definition) and set all its required fields. These include the algorithm and driver name, priority, numeric characteristics of the algorithm (IV size, tag size, etc.), and pointers to functions that perform operations on a TFM object of the implementation.
2. Register the algorithm with the Crypto API. This is done by calling the crypto_register_aead function with a pointer to the algorithm definition. The function is declared in the header file
44 4. Linux Kernel Crypto API
When the module can no longer provide the implementation (for example when the module is being unloaded), it should call the crypto_unregister_aead function with a pointer to the previously registered algorithm definition. For generic AEAD templates (such as generic AEAD mode im- plementations like GCM), the process is more complex. This time the module registers a template (struct crypto_template) using the crypto_register_template function. In this template, the module provides the template name and a function that parses the template arguments, obtains the algorithm definition for each argument, allo- cates an AEAD definition for the resulting algorithm, and registers it. The Crypto API and other internal kernel libraries provide addi- tional helper functions and structures that can be used when devel- oping cryptographic transformation implementations, for example to traverse the input/output SG lists.
45
5 Software optimization of cryptographic al- gorithms
Most cryptographic algorithms are designed so that they can be im- plemented easily using data types and operations that are available in C-like programming languages (32-bit and 64-bit integers, bit-level and arithmetic operations, constant table lookups, etc.). Nonetheless, an implementation written in portable and standards compliant C code is often several orders of magnitude slower than a dedicated hardware circuit implementation. Thus, CPU manufacturers often add special instructions to their instruction sets, which allow to efficiently compute certain small fre- quently used tasks (such as parallel vector operations, AES round, counting bits set in an integer, etc.). Sometimes compilers are able to automatically identify places in the code where such instructions can be used, but most often the programmer needs to write pieces of code directly in assembly language or use non-standard compiler intrinsics1 in order to ensure that optimal machine code is generated. The operations that are the most frequent targets for optimizations in cryptographic algorithms are: ∙ AES building blocks – Since AES is the most used block cipher and thus has a great business importance, the CPU manufactur- ers began including AES-related instructions in their instruction sets. These instructions often provide an order of magnitude speed- up compared to a table-lookup-based implementation of AES. An additional advantage of these instructions is that they are time-invariant (as opposed to table lookup implementations), which means implementations that use them are resistant to timing side-channel attacks (which have proven to be very prac- tical in some real-world scenarios, see [7]). ∙ Helper instructions for SHA-1 and SHA-256 – Similar to AES, SHA-1 and SHA-256 are widely used and standardized algo-
1. A compiler intrinsic is a special function defined by the compiler that is auto- matically converted to the special instruction it represents.
47 5. Software optimization of cryptographic algorithms
rithms (cryptographic hash functions) and some CPUs pro- vide helper instructions that allow to accelerate also these algo- rithms.
∙ Large-block and SIMD operations – Many cryptographic algo- rithms (including AES) operate on 128- or 256-bit blocks. Mod- ern 64-bit CPUs provide also wide registers (128, 256, or even 512 bits in size) along with instructions that perform efficient wide block loads/stores and various operations on them. These instructions can be sometimes used to accelerate existing cryp- tographic algorithms. Some more recent algorithms (for example MORUS, described in section 3.3) also specifically target SIMD instructions, which also operate on wide blocks, but interpret them as vectors of independent 32-bit or 64-bit values. These instructions sup- port similar range of operations as common 32-bit and 64-bit instructions (arithmetic operations, bit shifts, bit rotations, com- parisons, etc.).
5.1 Intel/AMD (x86 architecture)
The x86 architecture (implemented by Intel and AMD processors) al- lows CPUs to support optional extensions, which provide additional instructions. The support for extensions can be detected by applica- tions via the special CPUID instruction. Some extensions provide also additional registers, so the operating system must be able to recognize the extension and properly save and restore the register values when switching between processes.
5.1.1 SSE, AVX SSE (short for Streaming SIMD Extensions) is an x86 extension that provides 70 additional instructions for operating with 128-bit values (mostly interpreted as vectors of 32-bit floating point numbers). SSE also provides 16 new 128-bit registers (only 8 on 32-bit architectures), denoted xmm0 through xmm15. The first processor that supported SSE was Intel Pentium III (1999) based on the P6 microarchitecture.
48 5. Software optimization of cryptographic algorithms
Even though SSE itself contains only a few instructions applicable for cryptography, its successors extend it with more useful instructions – most notably SSE2, SSSE3 (Supplemental Streaming SIMD Extensions 3), XOP (eXtended Operations), AVX (Advanced Vector eXtensions), AVX2, and AVX-512. SSE2 (since 2001) added (among other things) a basic set of in- structions operating on vectors of 32-bit integers (as opposed to just floating point numbers in SSE), which makes it usable for optimizing some cryptographic algorithms, such as the Serpent block cipher, or the Poly1305 MAC. SSSE3 (since 2006) extended SSE2 with even more integer vector in- structions, enabling optimized implementations for many algorithms, such as SHA-1, SHA-2, or ChaCha20. XOP (2009-2017) is a now-deprecated extension developed and im- plemented by AMD, which provides extra integer vector instructions, such as bit rotation and permutation of vector elements. AVX (since 2008) and AVX2 (since 2013) added support for 256- bit registers, providing similar operations as SSE, SSE2, and SSSE3 (but on 256-bit registers). They can be used to write even faster imple- mentations of 128-bit algorithms, since sometimes two blocks can be processed simultaneously. AVX-512 (since 2015) is a new set of independent extensions that adds support for 512-bit vectors.
5.1.2 AES-NI
The AES instruction set (or AES New Instructions; AES-NI) is an x86 extension that provides accelerated instructions for encrypting and decrypting using the Advanced Encryption Standard block cipher. The first processor microarchitecture that supported this extension was Intel’s Westmere (2010). AMD started supporting this extension in the Bulldozer microarchitecture (2011). [55] The AES-NI instruction set depends on SSE and provides six new instructions that compute several complex parts of the AES algorithm.
49 5. Software optimization of cryptographic algorithms
5.1.3 SHA extensions The SHA instruction set is an x86 extension that provides accelerated instructions for hash functions SHA-1 and SHA-256 (from Secure Hash Algorithm family). The first processor microarchitecture to support this extension was Intel’s Goldmont (2016). AMD introduced support for the SHA extension in its Ryzen brand of processors implementing the Zen microarchitecture (2017). [31] The SHA instruction set depends on SSE and provides seven new instructions that compute several complex parts of the SHA-1 and SHA-256 algorithms.
5.1.4 CLMUL The CLMUL (or Carry-less Multiplication) instruction set is an x86 extension that adds 5 instructions that compute 64-bit carry-less mul- tiplication on values stored in 128-bit registers. It can be used for optimizing GHASH (the MAC used in GCM) or other common non- cryptographic algorithms, such as CRC2 or LZ77 compression. It is available in Intel processors on Westmere micorarchitecture and above (since 2010) and on AMD starting from the Bulldozer mi- croarchitecture (since 2011). [28, 63]
5.2 ARM
ARM is a popular reduced instruction set (RISC) architecture used mainly in mobile devices and embedded systems. It allows build- ing more lightweight and power efficient processors than common complex instruction set architectures, such as x86. ARM provides 128-bit SIMD extensions similar to x86’s SSE ex- tensions. Some basic SIMD support was introduced in the ARMv6 microarchitecture (manufactured since 2002) and a more advanced SIMD extension (called NEON) was introduced in ARMv7-A (avail- able in processors since 2009). [3] ARMv8-A also provides an optional extension containing instruc- tions equivalent to x86’s AES-NI, SHA, and CLMUL. [2]
2. CRC = cyclic redundancy check, an error-detecting code.
50 6 Implementation of selected CAESAR candi- dates for the Linux Crypto API
The main contribution of this thesis is an implementation of three selected CAESAR submissions (MORUS, AEGIS, and OCB) as ex- ternal modules for the Linux kernel. We have written both generic implementations and also optimized implementations utilizing x86 extensions (for OCB, we only optimized the AES-OCB instance). The implementation have been tested against standard (where available) and generated test vectors. The source code of the modules is written in C and x86 assembly. Auxiliary scripts are written in the scripting language of the Bash shell.
6.1 Contents of the attached source code
The attached source code package consists of two main parts – a set of Linux kernel modules (that provide implementations of the AEAD algorithms and a performance benchmarking facility) and auxiliary scripts that can be used to run performance measurements. The provided kernel modules are built out-of-tree. This means that they are not compiled as part of the Linux kernel source code, but instead their source code resides in a separate directory. To build an out-of-tree module for a given kernel version, one needs to have a directory containing the build files for that kernel version (i.e. build scripts and header files). Such directory is usually provided by Linux distributions as an optional package (e.g. on Ubuntu-based distribu- tions it is provided in linux-headers-generic). The modules can be built for Linux version 4.10 (released on 19 February 2017) and above. Earlier versions are not supported because they lack an important part of the internal AEAD API. The provided modules are built by running make in their source code directory. By default, the kernel is built for the currently running kernel (the path to kernel build files is auto-guessed). An alternative path can be specified as follows: make KERNEL_BUILD=/path/to/kernel/build
51 6. Implementation of selected CAESAR candidates
A successful build process produces a set files with the “.ko” ex- tension, which can be manually inserted at runtime by running the insmod program (and removed by running rmmod). A script that per- forms inserting/removing of the modules is provided in the source directory of each module. The attached source code contains four sets of kernel modules (listed by subdirectory names):
∙ linux-crypto-bench – Contains source code of the crypto_bench module, which benchmarks the execution time of various cryp- tographic operations of various algorithm types (stream cipher, AEAD, keyed hash function – MAC). The results are printed to the kernel console, whose contents can be retrieved in userspace using the dmesg utility.
∙ linux-crypto-morus – Contains the source code of the following modules:
– morus640, morus1280 – provide the generic implementa- tion of MORUS-640 and MORUS-1280 (respectively), – morus640-glue, morus1280-glue – common glue code for optimized x86 implementations of MORUS, – morus640-sse2 – provides an optimized implementation of MORUS-640 that uses the SSE2 extensions, – morus1280-sse2 – provides an optimized implementation of MORUS-1280 that uses the SSE2 extensions, – morus1280-avx2 – provides an optimized implementation of MORUS-1280 that uses the AVX2 extensions, – morus_test – tests MORUS algorithms against test vectors.
∙ linux-crypto-aegis – Contains the source code of the following modules:
– aesenc – generic implementation of the AES round, – aegis128, aegis128l, and aegis256 – provide generic im- plementations of AEGIS-128, AEGIS-128L, and AEGIS-256 (respectively),
52 6. Implementation of selected CAESAR candidates
– aegis128-aesni, aegis128l-aesni, and aegis256-aesni – provide optimized AES-NI+SSE2 implementations of AEGIS-128, AEGIS-128L, and AEGIS-256 (respectively), – aegis_test – tests AEGIS algorithms against test vectors.
∙ linux-crypto-ocb – Contains the source code of the following modules:
– ocb – provides a generic template for the OCB mode, – ocb-aesni – provides optimized AES-NI+SSE2 implemen- tation of AES-OCB, – ocb_test – tests AES-OCB against test vectors.
The algorithm implementations provided by the modules are reg- istered with the Crypto API when the respective module is loaded. When a module is unloaded, the implementations it provides are unregistered. The performance measurement scripts, along with the source code of the modules and measured performance data, are available at https: //gitlab.com/omos/masters-thesis-code as a Git repository. The module sources are located in separate repositories that are linked in the master repository as Git submodules.
6.1.1 Implementation limitations
Our implementation of OCB currently has a minor limitation – Since the Linux Crypto API only supports a fixed nonce size, our implemen- tation only allows 12-byte (96-bit) nonces. We chose this value since all standard test vectors have a 12-byte nonce, even though a 15-byte nonce would be less sensitive to nonce collisions. It would be possible to support all nonce sizes by accepting a 16- byte nonce and parsing the nonce length from the first byte (a similar method is used by the kernel’s implementation of the CCM mode). This would, however, require all users to format the nonces in a special way, which is not desirable.
53 6. Implementation of selected CAESAR candidates
6.1.2 Merging into the upstream Linux repository Currently, the AEAD algorithm implementations are in the form of externally built loadable kernel modules. When the CAESAR compe- tition finishes and if the final portfolio will contain one of theimple- mented algorithms, it may be desirable to merge our implementations into the official Linux kernel repository. This would require minor changes and putting the code into appropriate files, but most of the code could be adopted verbatim.
6.2 Performance measurements
In order to compare the performance of our AEAD algorithm imple- mentations to the existing implementations already present in the Linux kernel, we executed a set of performance measurements. We performed two kinds of measurements:
1. Testing the encryption/decryption speed directly – we measured the time to complete an encryption/decryption request inside the kernel using our crypto_bench module.
2. Testing sequential I/O throughput with full-disk AEAD encryption using Dm-crypt1 – we measured the read/write speed of a Dm- crypt AEAD mapping over a ramdisk (a block device that uses the computer’s RAM for storage). We used ramdisk because it allows the differences between algorithms to better stand out. When used with slower block devices (hard disk drive, solid- state drive), the influence of the algorithm selection is usually small or negligible (unless the cryptographic operations are very slow).
The measurements have been performed on a laptop computer with the Intel Core i5-6200U processor (dual-core) and 8 GB of RAM, running Linux version 4.14.4. The processor supports all of SSE2, AVX2, AES-NI and CLMUL extensions. All measurements discussed
1. Dm-crypt is a Linux kernel functionality that provides software full-disk en- cryption and uses the Linux Crypto API for cryptographic operations. Since Linux 4.12 it supports also authenticated encryption in compliance with IEEE 1619.1 ([34]).
54 6. Implementation of selected CAESAR candidates
No AD With AD
10,000
7,500 Encryption
5,000 Algorithm aegis128−128
aegis128l−128 2,500 aegis256−256
gcm(aes)−128 0 gcm(aes)−256
10,000 morus1280−256
Speed (MB/s) morus640−128
7,500 ocb(aes)−128
Decryption ocb(aes)−256
rfc7539(chacha20,poly1305)−256 5,000
2,500
0 512 4096 16384 32768 512 4096 16384 32768 Message size (bytes)
Figure 6.1: Encryption and decryption speed of AEAD algorithms via Crypto API
in this text have been performed directly on the running system (not on a virtual machine).
6.2.1 Direct speed comparison Figure 6.1 shows a graph comparing the encryption and decryption speeds of our algorithm implementations (AEGIS, MORUS and AES- OCB) as well as some AEAD algorithms already implemented in the kernel. The left half of the graphs displays the encryption/decryption speed without associated data and the right half displays the per- formance with associated data of the same size as message size (the speeds refer to the processing of the whole data – the associated data plus the message data). The algorithm labels consist of the Crypto API algorithm name and the key size in bits, separated by a dash (-). For AEGIS and AES-OCB, the SSE2+AES-NI implementations were used, for MORUS-640 the SSE2 implementation, and for MORUS- 1280 the AVX2 implementation. For other algorithms the optimal implementation was selected automatically by the Crypto API based on supported architecture extensions.
55 6. Implementation of selected CAESAR candidates
The results show that with AES-NI acceleration, the AEGIS vari- ants are able to achieve the best performance. On long messages, AEGIS-128L is about 2.5 times faster than AES-GCM with 128-bit key. AEGIS-128 and AEGIS-256 are slower, but both still about 1.5 times faster than AES-GCM with 128-bit key. On short messages (512 bytes), all AEGIS variants achieve about the same performance as AES-GCM, although they process the associated data faster. With AVX2extension, MORUS-1280 achieves about the same speed as AES-GCM (128-bit key) for long messages. MORUS-640 (which only uses the SSE2 extension) is about 0.6 times slower, i.e. slightly slower than AES-GCM with 256-bit key. On short messages, MORUS variants are about 50% slower than AES-GCM or AEGIS, most likely due to their expensive initialization and finalization. We also separately measured the performance of the SSE2-only implementation of MORUS-1280 and its performance turned out to be just slightly lower than the performance of MORUS-640. The AES-OCB algorithm has performance very similar to AES- GCM on long messages, although with 256-bit key it is slightly faster (than AES-GCM with 256-bit key). On shorter messages AES-OCB is slightly slower than AES-GCM with equal key size. Note that for an unknown reason, the AES-GCM implementation exhibits a performance degradation when processing large amount of associated data. For reference, we also measured the speed of ChaCha20-Poly1305, but its speed was very low compared to the other algorithms.
6.2.2 Comparison of Dm-crypt performance Figure 6.2 shows the comparison speeds of sequential reading and writ- ing to a virtual device created as a Dm-crypt mapping over a ramdisk (provided by the Linux kernel via block devices with filename of the form: /dev/ram
56 6. Implementation of selected CAESAR candidates
No journal With journal
3,000 Algorithm
aegis128l−128
aegis256−256 1,000 authenc(cmac(aes),xts(aes))−384
authenc(cmac(aes),xts(aes))−640
0 authenc(hmac(sha256),xts(aes))−512 authenc(hmac(sha256),xts(aes))−768
3,000 gcm(aes)−128 Speed (MB/s) gcm(aes)−256
morus1280−256 2,000
Write morus640−128
ocb(aes)−128
ocb(aes)−256 1,000 rfc7539(chacha20,poly1305)−256
0 512 4096 512 4096 Sector size (bytes)
Figure 6.2: Read and write speed of a ramdisk encrypted with Dm- crypt using different AEAD algorithms
Since Dm-crypt supports sector sizes larger than the usual 512 bytes, we measured the I/O speeds also with sector size of 4096 bytes. The Dm-crypt sector size is independent of the underlying device’s physical sector size and represents the unit of data that is processed by a single AEAD encrypt/decrypt operation. When the physical sector size of the underlying device is smaller than Dm-crypt sector size, the user needs to enable Dm-crypt’s journal functionality if they require atomicity of sector writes for when the writing process is abruptly interrupted (e.g. by a power failure). Thus, we measured the speed both with journal enabled and disabled. Because of the overhead of the kernel’s I/O subsystem, the differ- ences in algorithm throughput do not always propagate to differences in I/O throughput. Specifically, when the overhead of cryptographic operations falls way below the I/O overhead, further reduction in en- cryption overhead has only negligible effect on the I/O performance. If a regular storage device (hard disk or a solid-state drive) were used instead of ramdisk, the effect of encryption speed on the I/O perfor- mance would be even smaller (or completely negligible).
57 6. Implementation of selected CAESAR candidates
The performance measurements show that the AEGIS variants again performed better than the rest, but only very slightly and there is almost no difference between AEGIS-128L and the other two. AEGIS, MORUS, AES-GCM, and AES-OCB all achieved around the same speeds as AEGIS (under equivalent conditions). ChaCha20-Poly1305 had a comparable performance when writing and when reading with 4096-byte sectors, but was significantly slower when reading with 512-byte sectors. The generic composition of AES-XTS and CMAC- AES was even slower and the generic composition of AES-XTS and HMAC-SHA-256 only achieved about 230-300 MB/s when reading and 160-270 MB/s when writing. The measurements also demonstrated that switching to a 4096- byte sector encryption can have a noticeable effect on the performance of full-disk encryption, since the algorithm implementations are more efficient when processing more data at once. In our measurements, the journal had no noticeable impact on reading performance and only a small impact on writing performance. However, on devices with slow random write access or slow overall write speed the impact is likely to be higher.
6.2.3 Summary of results The results demonstrate that dedicated AEAD algorithms can provide very good performance for disk encryption (at least on modern x86 platforms). They also show that the selected CAESAR candidates can provide performance comparable or better than AES-GCM, while providing sufficient nonce size to avoid collisions when used with Dm-crypt (120-128 bits, with AEGIS-256 having even a 256-bit nonce while achieving the same performance as AEGIS-128).
58 7 Conclusion
In this thesis, we have described the principles of authenticated en- cryption and some of the most commonly used AEAD algorithms and modes of operation. We also described three selected candidates from the CAESAR competition (MORUS, AEGIS and OCB) and compared their basic properties with existing algorithms. We programmed Linux kernel modules that add implementations of the three selected CAESAR candidates to the Linux kernel Crypto API. We wrote both generic C implementations, as well as optimized implementations in x86 assembly. Finally, we measured the performance of the optimized implemen- tations – both directly and when used for transparent authenticated disk encryption using Dm-crypt. The performance of the MORUS fam- ily and AES-OCB was comparable to the existing x86 implementation of AES-GCM. The performance of the AEGIS algorithms was even better, with AEGIS-128L being more than twice as fast as AES-GCM.
7.1 Contribution
The main contribution of this thesis is the set of CAESAR candidate implementations. If one or more of the selected candidates becomes part of the final portfolio of winners of the CAESAR competition, the implementations could be included in the upstream Linux kernel code with little additional effort. During our work on the implementations, we also discovered a minor bug (present since Linux 4.10) in AEAD-related internal helper code in the Linux Crypto API. The bug caused performance degrada- tion under certain specific conditions and was triggered for example when an algorithm implementation that used said helper code was invoked via Dm-crypt. We submitted a fix for the bug to the Linux cryptographic subsys- tem maintainer and it was accepted for the 4.15 kernel release1. It was also backported to version 4.14 (which is the only past affected version that is still officially supported) as part of the 4.14.4 release.
1. https://github.com/torvalds/linux/commit/c14ca8386539
59 7. Conclusion 7.2 Future work
There is a number of possibilities to extend the work from this thesis. First, as already mentioned, the CAESAR candidate implementa- tions could be merged into the Linux kernel codebase. Second, due to the rising importance of mobile devices (and the dominant use of Linux as the OS on these devices), optimized imple- mentations for the ARM architecture could be useful. Finally, adding implementations for other CAESAR candidates would provide a more comprehensive performance comparison.
60 Bibliography
[1] AE Zoo contributors. Authenticated encryption Zoo. 2017. url: https://aezoo.compute.dtu.dk/doku.php (visited on Oct. 30, 2017). [2] ARM Engineers. ARM Cortex-A53 MPCore Processor. Technical Reference Manual. 2014. url: http : / / infocenter . arm . com / help/index.jsp?topic=/com.arm.doc.ddi0500e/CJHDEBAF. html (visited on Oct. 9, 2017). [3] ARM Engineers. Cortex-A9 NEON Media Processing Engine Tech- nical Reference Manual. 2010. url: http://infocenter.arm.com/ help/index.jsp?topic=/com.arm.doc.ddi0409f/Chdceejc. html (visited on Nov. 26, 2017). [4] T. Ashur, O. Dunkelman, and A. Luykx. Boosting Authenticated Encryption Robustness With Minimal Modifications. Cryptology ePrint Archive 2017/239. 2017. url: https : / / eprint . iacr . org/2017/239. [5] M. Bellare, P. Rogaway, and D. Wagner. EAX: A Conven- tional Authenticated-Encryption Mode. Cryptology ePrint Archive 2003/069. 2003. url: https://eprint.iacr.org/2003/069. [6] M. Bellare and C. Namprempre. “Authenticated Encryption: Relations among Notions and Analysis of the Generic Composi- tion Paradigm”. In: Advances in Cryptology — ASIACRYPT 2000: 6th International Conference on the Theory and Application of Cryp- tology and Information Security Kyoto, Japan, December 3–7, 2000. Proceedings. Ed. by T. Okamoto. Springer Berlin Heidelberg, 2000, pp. 531–545. isbn: 978-3-540-44448-0. doi: 10.1007/3-540-44448- 3_41. url: https://doi.org/10.1007/3-540-44448-3_41. [7] D. J. Bernstein. Cache-timing attacks on AES. Tech. rep. 2005. url: https://cr.yp.to/antiforgery/cachetiming-20050414.pdf. [8] D. J. Bernstein. ChaCha, a variant of Salsa20. Tech. rep. 2008. url: https://cr.yp.to/chacha/chacha-20080128.pdf. [9] D. J. Bernstein. Cryptographic Competitions. CAESAR call for sub- missions, final (2014.01.27). 2014. url: https://competitions. cr.yp.to/caesar-call.html (visited on Oct. 25, 2017).
61 BIBLIOGRAPHY
[10] D. J. Bernstein. Cryptographic Competitions. CAESAR frequently asked questions. 2014. url: https://competitions.cr.yp.to/ faq.html (visited on Oct. 25, 2017). [11] D. J. Bernstein. Cryptographic Competitions. CAESAR: Competition for Authenticated Encryption: Security, Applicability, and Robust- ness. 2014. url: https://competitions.cr.yp.to/caesar.html (visited on Oct. 25, 2017). [12] D. J. Bernstein. Cryptographic Competitions. CAESAR committee. 2015. url: https : / / competitions . cr . yp . to / caesar - committee.html (visited on Oct. 25, 2017). [13] D. J. Bernstein. Cryptographic Competitions. CAESAR submissions. 2017. url: https : / / competitions . cr . yp . to / caesar - submissions.html (visited on Oct. 25, 2017). [14] D. J. Bernstein. Cryptographic competitions. CAESAR use cases. July 2016. url: https://groups.google.com/d/topic/crypto- competitions / DLv193SPSDc / discussion (visited on Oct. 25, 2017). [15] D. J. Bernstein. Cryptographic competitions. end of CAESAR round 2. Aug. 2016. url: https : / / groups . google . com / d / topic / crypto- competitions/LBwnd- pzxBk/discussion (visited on Oct. 25, 2017). [16] D. J. Bernstein. Cryptographic competitions. CAESAR inputs. June 2017. url: https://groups.google.com/d/topic/crypto- competitions / PnOp4FS8YQI / discussion (visited on Oct. 25, 2017). [17] D. J. Bernstein. Measurements of CAESAR candidates, indexed by machine. Oct. 2017. url: https://bench.cr.yp.to/results- caesar.html (visited on Oct. 30, 2017). [18] D. J. Bernstein. The Poly1305-AES message-authentication code. Tech. rep. 2005. url: https : / / cr . yp . to / mac / poly1305 - 20050329.pdf. [19] D. J. Bernstein. The Salsa20 family of stream ciphers. Tech. rep. 2007. url: https://cr.yp.to/snuffle/salsafamily-20071225.pdf. [20] H. Böck et al. Nonce-Disrespecting Adversaries: Practical Forgery Attacks on GCM in TLS. Cryptology ePrint Archive 2016/475. 2016. url: https://eprint.iacr.org/2016/475.
62 BIBLIOGRAPHY
[21] E. Bursztein. Google Security Blog. Speeding up and strengthen- ing HTTPS connections for Chrome on Android. 2014. url: https: //security.googleblog.com/2014/04/speeding- up- and- strengthening-https.html (visited on Oct. 12, 2017). [22] M. Dworkin. Block Cipher Techniques. Guidelines for Submission of Modes of Operation. 2017. url: https : / / csrc . nist . gov / Projects/Block-Cipher-Techniques/BCM/Guidelines-for- Submitting-Modes (visited on Oct. 9, 2017). [23] M. Dworkin. Recommendation for Block Cipher Modes of Operation: Galois/Counter Mode (GCM) and GMAC. NIST Special Publication 800-38D. Nov. 2007. doi: 10.6028/NIST.SP.800-38D. url: https: //doi.org/10.6028/NIST.SP.800-38D. [24] M. Dworkin. Recommendation for Block Cipher Modes of Operation: Methods and Techniques. NIST Special Publication 800-38A. Dec. 2001. doi: 10.6028/NIST.SP.800-38A. url: https://doi.org/ 10.6028/NIST.SP.800-38A. [25] M. Dworkin. Recommendation for Block Cipher Modes of Operation: the CCM Mode for Authentication and Confidentiality. NIST Special Publication 800-38C. May 2004. doi: 10.6028/NIST.SP.800-38C. url: https://doi.org/10.6028/NIST.SP.800-38C. [26] M. Dworkin. Recommendation for Block Cipher Modes of Operation: the CMAC Mode for Authentication. NIST Special Publication 800- 38B. May 2005. doi: 10.6028/NIST.SP.800- 38B. url: https: //doi.org/10.6028/NIST.SP.800-38B. [27] O. Goldreich. Foundations of Cryptography: Volume 2, Basic Appli- cations. New York, NY, USA: Cambridge University Press, 2004. isbn: 0-521-83084-2. [28] S. Gueron. Intel Carry-Less Multiplication Instruction and its Usage for Computing the GCM Mode. 2011. url: https : / / software . intel . com / en - us / articles / intel - carry - less - multiplication - instruction - and - its - usage - for - computing-the-gcm-mode/ (visited on Oct. 9, 2017). [29] S. Gueron, A. Langley, and Y. Lindell. AES-GCM-SIV: Nonce Misuse-Resistant Authenticated Encryption. Internet Draft. July 2017. url: https://tools.ietf.org/html/draft-irtf-cfrg- gcmsiv-06.
63 BIBLIOGRAPHY
[30] S. Gueron, A. Langley, and Y. Lindell. AES-GCM-SIV: Specifica- tion and Analysis. Cryptology ePrint Archive 2017/168. 2017. url: https://eprint.iacr.org/2017/168. [31] S. Gulley, V. Gopal, K. Yap, et al. New Instructions Supporting the Secure Hash Algorithm on Intel Architecture Processors. 2013. url: https://software.intel.com/en-us/articles/intel-sha- extensions (visited on Nov. 24, 2017). [32] D. Harkins. Synthetic Initialization Vector (SIV) Authenticated En- cryption Using the Advanced Encryption Standard (AES). RFC 5297. Oct. 2008. url: https://tools.ietf.org/html/rfc5297. [33] E. Homsirikamol et al. Benchmarking of Round 3 CAESAR Candi- dates in Hardware: Methodology, Designs & Results. Presentation. Aug. 2017. url: https : / / cryptography . gmu . edu / athena / presentations/CAESAR_R3_HW_Benchmarking_v1.1.pdf (vis- ited on Oct. 7, 2017). [34] IEEE employees. IEEE Standard for Authenticated Encryption with Length Expansion for Storage Devices. IEEE Standard 1619.1. May 2008. doi: 10.1109/IEEESTD.2008.4523925. [35] K. Igoe and J. Solinas. AES Galois Counter Mode for the Secure Shell Transport Layer Protocol. RFC 5647. Aug. 2009. url: https: //tools.ietf.org/html/rfc5647. [36] ISO employees. ISO/IEC 19772:2009. Information technology – Se- curity techniques – Authenticated encryption. Standard. Geneva, Switzerland, 2009. url: https://www.iso.org/standard/46345. html. [37] J. Jonsson. “On the Security of CTR + CBC-MAC”. In: Selected Ar- eas in Cryptography: 9th Annual International Workshop, SAC 2002 St. John’s, Newfoundland, Canada, August 15–16, 2002 Revised Pa- pers. Ed. by K. Nyberg and H. Heys. Springer Berlin Heidelberg, 2003, pp. 76–93. isbn: 978-3-540-36492-4. doi: 10.1007/3-540- 36492-7_7. url: https://doi.org/10.1007/3-540-36492-7_7. [38] A. Joux. Authentication Failures in NIST version of GCM. Tech. rep. 2006. url: https://csrc.nist.gov/csrc/media/projects/ block-cipher-techniques/documents/bcm/joux_comments. pdf. [39] C. S. Jutla. Encryption Modes with Almost Free Message Integrity. Cryptology ePrint Archive 2000/039. 2000. url: https://eprint. iacr.org/2000/039.
64 BIBLIOGRAPHY
[40] T. Krovetz and P. Rogaway. The OCB Authenticated-Encryption Algorithm. RFC 7253. May 2014. url: https://tools.ietf.org/ html/rfc7253. [41] T. Krovetz and P. Rogaway. “The Software Performance of Authenticated-encryption Modes”. In: Proceedings of the 18th In- ternational Conference on Fast Software Encryption. FSE’11. Lyn- gby, Denmark: Springer-Verlag, 2011, pp. 306–327. isbn: 978-3- 642-21701-2. url: https://www.iacr.org/archive/fse2011/ 67330313/67330313.pdf. [42] D. A. McGrew and J. Viega. The Security and Performance of the Galois/Counter Mode of Operation (Full Version). Cryptology ePrint Archive 2004/193. 2004. url: https://eprint.iacr.org/2004/ 193. [43] A. Melnikov. [Cfrg] RG Last Call on draft-irtf-cfrg-gcmsiv-06. E- mail from mailing list. Sept. 2017. url: https://www.ietf.org/ mail-archive/web/cfrg/current/msg09329.html. [44] A. Mileva, V. Dimitrova, and V. Velichkov. “Analysis of the Authenticated Cipher MORUS (v1)”. In: Cryptography and In- formation Security in the Balkans: Second International Conference, BalkanCryptSec 2015, Koper, Slovenia, September 3-4, 2015, Revised Selected Papers. Ed. by E. Pasalic and L. R. Knudsen. Springer In- ternational Publishing, 2016, pp. 45–59. isbn: 978-3-319-29172-7. doi: 10.1007/978-3-319-29172-7_4. url: https://doi.org/ 10.1007/978-3-319-29172-7_4. [45] S. Mueller, M. Vasut, et al. The Linux Kernel documentation. Ker- nel Crypto API Interface Specification. 2017. url: https : / / www . kernel.org/doc/html/latest/crypto/intro.html (visited on Nov. 9, 2017). [46] S. Mueller, M. Vasut, et al. The Linux Kernel documentation. Au- thenticated Encryption With Associated Data (AEAD) Algorithm Definitions. 2017. url: https://www.kernel.org/doc/html/ latest/crypto/api-aead.html (visited on Nov. 15, 2017). [47] Y. Nir and A. Langley. ChaCha20 and Poly1305 for IETF Protocols. RFC 7539. May 2015. url: https://tools.ietf.org/html/ rfc7539. [48] G. Procter. A Security Analysis of the Composition of ChaCha20 and Poly1305. Cryptology ePrint Archive 2014/613. 2014. url: https: //eprint.iacr.org/2014/613.
65 BIBLIOGRAPHY
[49] P.Rogaway and D. Wagner. A Critique of CCM. Cryptology ePrint Archive 2003/070. 2003. url: https://eprint.iacr.org/2003/ 070. [50] P. Rogaway. License for Non-Military Software Implementations of OCB. Jan. 2013. url: http://web.cs.ucdavis.edu/~rogaway/ ocb/license2.pdf (visited on Nov. 2, 2017). [51] P. Rogaway. License for Open Source Software Implementations of OCB. Jan. 2013. url: http://web.cs.ucdavis.edu/~rogaway/ ocb/license1.pdf (visited on Nov. 2, 2017). [52] P. Rogaway. “Method and apparatus for facilitating efficient authenticated encryption”. Patent US7949129 B2 (US). May 2011. url: https://www.google.com/patents/US7949129. [53] P. Rogaway. “Method and apparatus for facilitating efficient authenticated encryption”. Patent US8321675 B2 (US). Nov. 2012. url: https://www.google.com/patents/US8321675. [54] P. Rogaway and T. Shrimpton. “A Provable-Security Treatment of the Key-Wrap Problem”. In: Advances in Cryptology - EURO- CRYPT 2006: 24th Annual International Conference on the Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May 28 - June 1, 2006. Proceedings. Ed. by S. Vaudenay. Springer Berlin Heidelberg, 2006, pp. 373–390. isbn: 978-3-540-34547-3. doi: 10.1007/11761679_23. url: https://doi.org/10.1007/ 11761679_23. [55] J. Rott. Intel Advanced Encryption Standard Instructions (AES-NI). 2012. url: https://software.intel.com/en- us/articles/ intel-advanced-encryption-standard-instructions-aes- ni (visited on Nov. 23, 2017). [56] J. Salowey, A. Choudhury, and D. McGrew. AES Galois Counter Mode (GCM) Cipher Suites for TLS. RFC 5288. Aug. 2008. url: https://tools.ietf.org/html/rfc5288. [57] B. Schneier. Applied Cryptography: Protocols, Algorithms, and Source Code in C. 2nd. New York, NY, USA: John Wiley & Sons, Inc., 1996. isbn: 0-471-11709-9. [58] P. Švenda. Basic comparison of Modes for Authenticated-Encryption. Tech. rep. 2004. url: http://www.fi.muni.cz/~xsvenda/docs/ AE_comparison_ipics04.pdf.
66 BIBLIOGRAPHY
[59] The GnuPG Project Authors. Libgcrypt – GnuPG. Oct. 2017. url: https://www.gnupg.org/related_software/libgcrypt/ (vis- ited on Oct. 2, 2017). [60] The OpenSSL Project Authors. EVP_EncryptInit – 1.1.0 manpages. Oct. 2017. url: https://www.openssl.org/docs/man1.1.0/ crypto/EVP_aes_128_gcm.html (visited on Oct. 2, 2017). [61] J. Viega and D. McGrew. The Use of Galois/Counter Mode (GCM) in IPsec Encapsulating Security Payload (ESP). RFC 4106. June 2005. url: https://tools.ietf.org/html/rfc4106. [62] D. Whiting, R. Housely, and N. Ferguson. Counter with CBC- MAC (CCM). RFC 3610. Sept. 2003. url: https://tools.ietf. org/html/rfc3610. [63] Wikipedia contributors. Bulldozer (microarchitecture). 2017. url: https://en.wikipedia.org/w/index.php?title=Bulldozer_ (microarchitecture ) &oldid = 802064821 (visited on Oct. 9, 2017). [64] H. Wu and T. Huang. The Authenticated Cipher MORUS (v2). Tech. rep. Sept. 2016. url: https://competitions.cr.yp.to/round3/ morusv2.pdf (visited on Oct. 30, 2017). [65] H. Wu and B. Preneel. AEGIS. A Fast Authenticated Encryption Al- gorithm (v1.1). Tech. rep. Sept. 2016. url: https://competitions. cr.yp.to/round3/aegisv11.pdf (visited on Oct. 31, 2017).
67