Usuba: High-Throughput and Constant-Time Ciphers, by Construction

Usuba: High-Throughput and Constant-Time Ciphers, by Construction Darius Mercadier Pierre-Évariste Dagand Sorbonne Université CNRS, Inria, LIP6 Paris, France [email protected] Abstract 1 Introduction Cryptographic primitives are subject to diverging imper- Implementations of cryptographic primitives are subject to atives. Functional correctness and auditability pushes for stringent requirements. The correctness of a primitive is the use of a high-level programming language. Performance not judged based solely on its behavior (a hard problem in and the threat of timing attacks push for using no more itself [8, 15]) but within a threat model that may, for ex- abstract than an assembler to exploit (or avoid!) the micro- ample, include timing attacks or any other side-channel. In architectural features of a given machine. We believe that this paper, we focus on timing attacks because, with the de- a suitable programming language can reconcile both views velopment of practical and remote timing attacks [10, 21], and actually improve on the state of the art of both. Usuba providing constant-time implementations has become stan- is an opinionated dataflow programming language in which dard practice in cryptographic libraries. block ciphers become so simple as to be “obviously correct” To write constant-time code is to fight an uphill battle and whose types document and enforce valid parallelization against compilers [4], which may silently rewrite one’s care- strategies at the granularity of individual bits. Its optimizing fully crafted program into one vulnerable to timing attacks, compiler, Usubac, produces high-throughput, constant-time and against the underlying architecture itself [52], whose implementations performing on par with hand-tuned refer- micro-architectural features may leak secrets through timing ence implementations. The cornerstone of our approach is or otherwise. The issue is so far-reaching that tools tradition- a systematization and generalization of bitslicing, an imple- ally applied to hardware evaluation are now used to analyze mentation trick frequently used by cryptographers. software implementations [68], treating the program and its execution environment as a single black-box. • → CCS Concepts Software and its engineering Do- However, aiming for soundness and security cannot come ; Data flow languages; • main specific languages Security at the cost of efficiency. These functions are executed re- → Block and stream ciphers. and privacy peatedly, processing massive amounts of data. Being able to Keywords Optimizing Compiler, Bitslicing, Vectorization sustain a high throughput is therefore essential. Bitslicing has been introduced by Biham [17] as an implementation ACM Reference Format: trick to speed up software implementations of DES. Bitslicing Darius Mercadier and Pierre-Évariste Dagand. 2019. Usuba: High- achieves high throughput by increasing parallelism. Most Throughput and Constant-Time Ciphers, by Construction. In Pro- ceedings of the 40th ACM SIGPLAN Conference on Programming record-breaking software implementations of block ciphers Language Design and Implementation (PLDI ’19), June 22–26, 2019, exploit this technique [6, 38, 42, 50]. Modern ciphers, meant Phoenix, AZ, USA. ACM, New York, NY, USA, 17 pages. https: to be executed in software, are now designed with bitslicing //doi.org/10.1145/3314221.3314636 in mind [12, 18, 56, 72]. Their design purposefully allows the programmer to exploit SIMD instruction sets (from Arm’s Permission to make digital or hard copies of all or part of this work for Neon extension to Intel’s SSE, AVX and AVX512 extensions). personal or classroom use is granted without fee provided that copies Such ciphers are massively parallel, bit-level programs. are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights Most cryptographic primitives are likely to stay in soft- for components of this work owned by others than the author(s) must ware, never graduating to hardware – too expensive and be honored. Abstracting with credit is permitted. To copy otherwise, or too rigid. Since Intel Core i5, only AES became widely avail- republish, to post on servers or to redistribute to lists, requires prior specific able in hardware thanks to its unique position as a NIST permission and/or a fee. Request permissions from [email protected]. standard. At the other end of the spectrum, the Internet of PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA Things has motivated the development of lightweight ci- © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. phers [5, 9, 19, 31] that run reasonably efficiently on low-end ACM ISBN 978-1-4503-6712-7/19/06...$15.00 devices with power and code size constraints – a niche left https://doi.org/10.1145/3314221.3314636 PLDI ’19, June 22–26, 2019, Phoenix, AZ, USA Darius Mercadier and Pierre-Évariste Dagand vacant by AES and which is unlikely to be filled by a one- table SubColumn (in:v4) returns (out:v4){ size-fits-all solution. There is therefore a need for supporting 6 , 5, 12, 10, 1, 14, 7, 9, a rich portfolio of cryptographic primitives, both on the low- 11, 0, 3 , 13, 8, 15, 4, 2 end side (the deployed clients) and the high-end side (the } server, interacting with thousands of clients). In this paper, node ShiftRows (input:u16x4) returns (out:u16x4) we focus exclusively on high-throughput implementations let out[0] = input[0]; for high-end servers, for which we can reasonably assume out[1] = input[1] <<< 1; that the threat model is restricted to timing attacks (thus out[2] = input[2] <<< 12; excluding physical interference with the machine). Targeting out[3] = input[3] <<< 13 low-end devices and integrating countermeasures against tel physical tampering is out of the scope of this paper. node Rectangle (plain:u16x4,key:u16x4[26]) Our language, Usuba [51], has been designed to enable returns (cipher:u16x4) the high-level description of block ciphers that our com- vars round : u16x4[26] piler, Usubac, automatically turns into a high-throughput let round[0] = plain; and constant-time implementation. The language design – its foralli in [0,24] { syntax, semantics and type system – guarantees that Usuba round[i+1] = ShiftRows(SubColumn(round[i] programs are by construction parallel and constant-time. ^ key[i])) Usuba exploits and generalizes the bitslicing trick [17], } also known as “SIMD Within A Register” [29] (SWAR) in cipher = round[25] ^ key[25] the compiler community. The key insight of bitslicing is to tel consider a block cipher as a combinational circuit written using bitwise logic operators such as x&y, x|y, x^y and ~x. Figure 1. Rectangle cipher On a 64-bit machine, the bitwise x&y operation effectively works like 64 parallel Boolean conjunctions, each processing a single bit. High throughput is achieved by parallelism: 64 packed elements, assuming that m is less or equal to the instances of the cipher can execute in parallel. The imple- number of packed elements of the architecture. We lose the mentation is constant-time by design: no data-dependent ability to perform arithmetic operations but gain the ability accesses to memory are made (or, in fact, possible at all). to arbitrarily shuffle bits of our m-bit word in a single in- While restrictive, this programming model is well-suited struction (using vpshufb on Intel AVX2). This style relies on for the description of block ciphers. Bit-twiddling, typical horizontal SIMD operations: we can perform element-wise of symmetric ciphers, is naturally expressed, often at no computations within a single register, i.e. along an horizontal run-time cost. Their streaming model is compatible with direction. We call this transformation “horizontal slicing”, or parallel execution: in parallelizable encryption/decryption “hslicing” for short. Whenever the slicing direction is unim- modes – such as counter mode (CTR) – multiple blocks portant, we talk about “m-slicing” (assuming that m > 1) and of the input can be processed in parallel whereas, in non- call “slicing” the technique encompassing bitslicing (m = 1) parallelizable modes – such as output feedback (OFB) – one and m-slicing (m > 1). can still multiplex independent encryptions from several, As shown in Section4, the choice between bitslicing, vs- concurrent clients [30]. Finally, bitsliced implementations licing and hslicing depends on the cipher and the target can automatically scale to larger registers (e.g. SIMD). architecture. On x86-64 (without SIMD extensions), we have On SIMD architectures, we can generalize bitslicing in two no choice but to resort to bitslicing. On AVX, the fastest im- variants and thus broaden the scope of the original Usuba plementation of the Rectangle cipher is vsliced [72] whereas language [51]. Rather than considering the bit as the atomic the fastest implementation of AES is hsliced [38]. unit of computation, one may use m-bit words as such a Manually writing and maintaining a sliced program is basis (m being a word size supported by the SIMD engine). extremely tedious. For the sake of illustration, Kwan’s imple- On Intel AVX2, m could for example be 8, 16, 32, 64. We mentation of DES is composed of thousands of lines like the can then exploit vertical SIMD operations to perform logic following: as well as arithmetic in parallel. If we visually represent s1 (r31 ^ k[47], r0 ^ k[11], r1 ^ k[26], r2 ^ k[3], r3 ^ k[13], registers as vertically stacked, vertical operations consist r4 ^ k[41], &l8, &l16, &l22, &l30); s2 (r3 ^ k[27], r4 ^ k[6], r5 ^ k[54], r6 ^ k[48], r7 ^ k[39], in element-wise computations along this vertical direction. r8 ^ k[19], &l12, &l27, &l1, &l17); On Intel AVX2, we can apply 32 parallel additions on 8-bit atoms (m = 8). We call this transformation “vertical slicing”, As we shall demonstrate in Section2, Usuba enables us to or “vslicing” for short.

Load more