“Make Sure DSA Signing Exponentiations Really Are Constant-Time”

“Make Sure DSA Signing Exponentiations Really are Constant-Time” Cesar Pereida García Billy Bob Brumley Yuval Yarom Department of Computer Department of Pervasive The University of Adelaide and Science Computing Data61, CSIRO, Australia Aalto University, Finland Tampere University of [email protected] cesar.pereida@aalto.fi Technology, Finland billy.brumley@tut.fi ABSTRACT Our attack builds upon several techniques to profile the TLS and SSH are two of the most commonly used proto- cache memory and capture timing signals. The signals are cols for securing Internet traffic. Many of the implemen- processed and converted into a sequence of square and mul- tations of these protocols rely on the cryptographic primi- tiplication (SM) operations from which we extract informa- tives provided in the OpenSSL library. In this work we dis- tion to create a lattice problem. The solution to the lattice close a vulnerability in OpenSSL, affecting all versions and problem yields the secret key of digital signatures. forks (e.g. LibreSSL and BoringSSL) since roughly October Flush+Reload [40] is a powerful technique to perform 2005, which renders the implementation of the DSA signa- cache-timing attacks. We adapt the Flush+Reload tech- ture scheme vulnerable to cache-based side-channel attacks. nique to OpenSSL's implementation of DSA and, exploit- Exploiting the software defect, we demonstrate the first pub- ing properties of the Intel implementation of the x86 and lished cache-based key-recovery attack on these protocols: x64 processor architectures, our spy program probes rele- 260 SSH-2 handshakes to extract a 1024/160-bit DSA host vant memory addresses to create a signal trace. key from an OpenSSH server, and 580 TLS 1.2 handshakes We process the captured signal to get the SM sequence to extract a 2048/256-bit DSA key from an stunnel server. performed by the sliding window exponentiation (SWE) al- gorithm. Then we observe and analyze the number of bits that can be extracted and used from each of those sequences. Keywords Later, the variable amount of bits extracted from each trace applied cryptography; digital signatures; side-channel anal- is used as input to a lattice attack that recovers the private ysis; timing attacks; cache-timing attacks; DSA; OpenSSL; key. CVE-2016-2178 To bridge the gap between the limited resolution of the Flush+Reload technique [4] and the high-performance of 1. INTRODUCTION the OpenSSL code we apply the performance-degradation One of the contributing factors to the explosion of the technique of Allan et al. [4]. This technique slows the expo- Internet in the last decade is the security provided by the nentiation by an average factor of 20, giving a high resolution underlying cryptographic protocols. Two of those protocols trace and allowing us to extract up to 8 bits of information are the Transport Layer Security (TLS) protocol, which pro- from some of the traces. vides security to network communication and the more spe- Similar to previous works [9, 14, 21, 32], we perform a cialized Secure Shell (SSH), which provides secure login to lattice attack to recover the secret key. We use the lattice remote hosts. construction of Benger et al. [9] and solve the resulting lat- Software implementations of these protocols often use the tice problem using the lattice reduction technique of Nguyen cryptographic primitives' implementations of the OpenSSL and Shparlinski [28]. cryptographic library. Consequently, the security of these A unique feature of our work is that we target common implementations depends on the security of OpenSSL. cryptographic protocols. Previous works that demonstrate In this paper we present a novel side-channel cache-timing cache-timing key-recovery attacks only target the crypto- attack against OpenSSL's DSA implementation. The attack graphic primitives, ignoring potential cache noise from the exploits a vulnerability in OpenSSL, which fails to use a side- protocol implementation. In contrast, we present end-to- channel-secure implementation of modular exponentiation end attacks on two common cryptographic protocols: SSH | the core mathematical operation used in DSA signatures. and TLS. We are, therefore, the first to demonstrate that cache-timing attacks are a threat not only when executing Permission to make digital or hard copies of part or all of this work for personal or the cryptographic primitives but also in the presence of the classroom use is granted without fee provided that copies are not made or distributed cache activity of the whole protocol suite. for profit or commercial advantage and that copies bear this notice and the full citation Our contributions in this work are the following: on the first page. Copyrights for third-party components of this work must be honored. CCS’16 October 24-28, 2016, Vienna, Austria • We identify a security weakness in OpenSSL which c 2016 Copyright held by the owner/author(s). fails to use a side-channel safe implementation when ACM ISBN 978-1-4503-4139-4/16/10. performing DSA signatures. (Section 3) DOI: http://dx.doi.org/10.1145/2976749.2978420 • We describe how to use a combination of the Flush+ 1639 Reload technique with a performance-degradation at- cache levels above them. In the case of Intel processors, tack to leak information from the unsafe SWE algo- the contents of the L1 and L2 caches is also stored in the rithm. (Section 4) last-level cache. A consequence of the inclusion property is that when data is evicted from the last-level cache it is also • We present the first key-recovery cache-timing attack evicted from all of the other levels of cache in the processor. on the TLS and SSH cryptographic protocols. (Sec- Intel architecture implements several cache optimizations. tion 5) The spatial pre-fetcher pairs cache lines and attempt to fetch the pair of a missed line [17]. Consecutive accesses to mem- • We construct and solve a lattice problem with the side- ory addresses are detected and pre-fetched when the pro- channel information and the digital signatures in order cessor anticipates they may be required [17]. Additionally, to recover the secret key. (Section 6) when the processor is presented with a conditional branch, speculative execution brings the data of both branches into 2. BACKGROUND the cache before the branch condition is evaluated [35]. Page [30] noted that tracing the sequence of cache hits 2.1 Memory Hierarchy and misses of software may leak information on the internal working of the software, including information that may lead Accessing data and instructions from main memory is a to recovering cryptographic keys. time consuming operation which delays the work of the fast This idea was later extended and used for mounting sev- processors, for that reason the memory hierarchy includes eral cache-based side-channel attacks [10, 29, 31]. Other smaller and faster memories called caches. Caches improve attacks were shown against the L1-instruction cache [3], the the performance by exploiting the spatial and temporal lo- branch prediction buffer [1, 2] and the last-level cache [20, cality of the memory access. 22, 25, 40]. In modern processors the hierarchy of caches is structured as follows, higher-level caches, located closer to the processor 2.2 The Flush+Reload Attack core, are smaller and faster than low-level caches, which are Our LLC-based attack is based on the [20, located closer to main memory. Recent Intel architecture Flush+Reload 40] attack, which is a cache-based side-channel attack tech- typically has three levels of cache: L1, L2 and Last-Level nique. Cache (LLC). Unlike the earlier technique [29, 31] that Each core has two L1 caches, a data cache and an instruc- Prime+Probe detects activity in cache sets, the technique tion cache, each 32 KiB in size with an access time of 4 Flush+Reload identifies access to memory lines, giving it a higher resolu- cycles. L2 caches are also core-private and have an inter- tion, a high accuracy and high signal-to-noise ratio. mediate size (256 KiB) and latency (7 cycles). The LLC is Like , relies on cache shar- shared among all of the cores and is a unified cache, con- Prime+Probe Flush+Reload ing between processes. Additionally, it requires data shar- taining both data and instructions. Typical LLC sizes are ing, which is typically achieved through the use of shared in megabytes and access time is in the order of 40 cycles. libraries or using page de-duplication [6, 36]. The unit of memory and allocation in a cache is called A round of the attack, which identifies victim access to cache line. Cache lines are of a fixed size B, which is typ- a shared memory line, consists of three phases. (See Algo- ically 64 bytes. The lg(B) low-order bits of the address, rithm 1.) In the first phase the adversary evicts the mon- called line offset, are used to locate the datum in the cache itored memory line from the cache. In the second phase, line. the adversary waits a period of time so the victim has an When a memory address is accessed, the processor checks opportunity to access the memory line. In the third phase, the availability of the address line in the top-level L1 cache. the adversary measures the time it takes to reload the mem- If the data is there then it is served to the processor, a ory line. If during the second phase the victim accesses the situation referred to as a cache hit. In a cache miss, when memory line, the line will be available in the cache and the the data is not found in the L1 cache, the processor repeats reload operation in the third phase will take a short time.

Load more