ARMageddon: Cache Attacks on Mobile Devices Moritz Lipp, Daniel Gruss, Raphael Spreitzer, Clémentine Maurice, and Stefan Mangard, Graz University of Technology https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/lipp
This paper is included in the Proceedings of the 25th USENIX Security Symposium August 10–12, 2016 • Austin, TX ISBN 978-1-931971-32-4
Open access to the Proceedings of the 25th USENIX Security Symposium is sponsored by USENIX AR geddon c e Att cks on o le e ces
Moritz Lipp, Daniel Gruss, Raphael Spreitzer, Clementine Maurice, and Stefan Mangard ra University of Technology, Austria
A str ct the possibility to use lush Reload to automatically ex- In the last 10 years, cache attacks on Intel x86 CPUs have ploit cache-based side channels via cache template at- gained increasing attention among the scientific com- tacks on Intel platforms. lush Reload does not only al- munity and powerful techniques to exploit cache side low for efficient attacks against cryptographic implemen- channels have been developed. However, modern smart- tations [8,26,56], but also to infer keystroke information phones use one or more multi-core ARM CPUs that have and even to build keyloggers on Intel platforms [19]. In a different cache organization and instruction set than contrast to attacks on cryptographic algorithms, which Intel x86 CPUs. So far, no cross-core cache attacks have are typically triggered multiple times, these attacks re- been demonstrated on non-rooted Android smartphones. quire a significantly higher accuracy as an attacker has In this work, we demonstrate how to solve key chal- only one single chance to observe a user input event. lenges to perform the most powerful cross-core cache at- Although a few publications about cache attacks on tacks Prime Probe, lush Reload, Evict Reload, and AES T-table implementations on mobile devices ex- lush lush on non-rooted ARM-based devices without ist [10, 50–52, 57], the more efficient cross-core attack any privileges. Based on our techniques, we demonstrate techniques Prime Probe, lush Reload, Evict Reload, covert channels that outperform state-of-the-art covert and lush lush [18] have not been applied on smart- channels on Android by several orders of magnitude. phones. In fact, there was reasonable doubt [60] whether Moreover, we present attacks to monitor tap and swipe these cross-core attacks can be mounted on ARM-based events as well as keystrokes, and even derive the lengths devices at all. In this work, we demonstrate that these of words entered on the touchscreen. Eventually, we are attack techniques are applicable on ARM-based devices the first to attack cryptographic primitives implemented by solving the following key challenges systematically: in Java. Our attacks work across CPUs and can even 1. ast level caches are not inclusive on ARM and thus monitor cache activity in the ARM TrustZone from the cross core attacks cannot rely on this property In- normal world. The techniques we present can be used to deed, existing cross-core attacks exploit the inclu- attack hundreds of millions of Android devices. siveness of shared last-level caches [18, 19, 22, 24, 35, 37, 38, 42, 60] and, thus, no cross-core attacks ntrod ct on have been demonstrated on ARM so far. We present an approach that exploits coherence protocols and Cache attacks represent a powerful means of exploit- L1-to-L2 transfers to make these attacks applicable ing the different access times within the memory hi- on mobile devices with non-inclusive shared last- 1 erarchy of modern system architectures. Until re- level caches, irrespective of the cache organization. cently, these attacks explicitly targeted cryptographic 2. Most modern smartphones have multiple CPUs that implementations, for instance, by means of cache tim- do not share a cache However, cache coherence ing attacks [9] or the well-known Evict Time and protocols allow CPUs to fetch cache lines from re- Prime Probe techniques [43]. The seminal paper mote cores faster than from the main memory. We by Yarom and Falkner [60] introduced the so-called utilize this property to mount both cross-core and lush Reload attack, which allows an attacker to infer cross-CPU attacks. which specific parts of a binary are accessed by avic- 1Simultaneously to our work on ARM, Irazoqui et al. [25] devel- tim program with an unprecedented accuracy and prob- oped a technique to exploit cache coherence protocols on AMD x86 ing frequency. Recently, Gruss et al. [19] demonstrated CPUs and mounted the first cross-CPU cache attack.
1 USENIX Association 25th USENIX Security Symposium 549 3. e t A v -A P s A ro essors do not level caches that are instruction-inclusive and data- su ort flus instru tion In these cases, a fast non-inclusive as well as caches that are instruction- eviction strategy must be applied for high-fre uency non-inclusive and data-inclusive. measurements. As existing eviction strategies are Our cache-based covert channel outperforms all ex- • too slow, we analyze more than 4 200 eviction isting covert channels on Android by several orders strategies for our test devices, based on Rowham- of magnitude. mer attack techni ues [17]. We demonstrate the power of these attacks • 4. A P s use seudo-r ndo re e ent o - by attacking cryptographic implementations and i to decide which cache line to replace within a by inferring more fine-grained information like cache set. This introduces additional noise even for keystrokes and swipe actions on the touchscreen. robust time-driven cache attacks [50, 52]. For the same reason, Pri e Pro e has been an open chal- tl ne The remainder of this paper is structured as lenge [51] on ARM, as an attacker needs to predict follows. In Section 2, we provide information on back- which cache line will be replaced first and wrong ground and related work. Section 3 describes the tech- predictions destroy measurements. We design re- ni ues that are the building blocks for our attacks. In access loops that interlock with a cache eviction Section 4, we demonstrate and evaluate fast cross-core strategy to reduce the effect of wrong predictions. and cross-CPU covert channels on Android. In Sec- 5. e- ur te ti ings re uire root ess on tion 5, we demonstrate cache template attacks on user A [3] and alternatives have not been evaluated so input events. In Section 6, we present attacks on crypto- far. We evaluate different timing sources and show graphic implementations used in practice as well the pos- that cache attacks can be mounted in any case. sibility to observe cache activity of cryptographic com- Based on these building blocks, we demonstrate prac- 2 putations within the TrustZone. We discuss countermea- tical and highly efficient cache attacks on ARM. We sures in Section 7 and conclude this work in Section 8. do not restrict our investigations to cryptographic im- plementations but also consider cache attacks as a means to infer other sensitive information—such as ckgro nd nd Rel ted ork inter-keystroke timings or the length of a swipe action— re uiring a significantly higher measurement accuracy. In this section, we provide the re uired preliminaries and Besides these generic attacks, we also demonstrate that discuss related work in the context of cache attacks. cache attacks can be used to monitor cache activity caused within the ARM TrustZone from the normal world. Nevertheless, we do not aim to exhaustively list c es possible exploits or find new attack vectors on crypto- Today’s CPU performance is influenced not only by the graphic algorithms. Instead, we aim to demonstrate the clock fre uency but also by the latency of instructions, immense attack potential of the presented cross-core and operand fetches, and other interactions with internal and cross-CPU attacks on ARM-based mobile devices based external devices. In order to overcome the latency of on well-studied attack vectors. Our work allows to ap- system memory accesses, CPUs employ caches to buffer ply existing attacks to millions of off-the-shelf Android fre uently used data in small and fast internal memories. devices without any privileges. Furthermore, our investi- Modern caches organize cache lines in multiple sets, gations show that Android still employs vulnerable AES which is also known as set-associative caches. Each T-table implementations. memory address maps to one of these cache sets and ad- dresses that map to the same cache set are considered ontr t ons The contributions of this work are: congruent. Congruent addresses compete for cache lines We demonstrate the applicability of highly efficient within the same set and a predefined replacement policy • cache attacks like Pri e Pro e, us e o d, determines which cache line is replaced. For instance, vi t e o d, and us us on ARM. the last generations of Intel CPUs employ an undocu- Our attacks work irrespective of the actual cache or- mented variant of least-recently used (LRU) replacement • ganization and, thus, are the first last-level cache policy [17]. ARM processors use a pseudo-LRU replace- attacks that can be applied cross-core and also ment policy for the L1 cache and they support two dif- cross-CPU on off-the-shelf ARM-based devices. ferent cache replacement policies for L2 caches, namely More specifically, our attacks work against last- round-robin and pseudo-random replacement policy. In practice, however, only the pseudo-random replacement 2Source code for ARMageddon attack examples can be found at policy is used due to performance reasons. Switching https://github.com/IAIK/armageddon. the cache replacement policy is only possible in privi-
2 550 25th USENIX Security Symposium USENIX Association leged mode. The implementation details for the pseudo- opened or accessed, an attacker can map a binary to have random policy are not documented. read-only shared memory with a victim program. A sim- CPU caches can either be virtually indexed or phys- ilar effect is caused by content-based page deduplication ically indexed, which determines whether the index is where physical pages with identical content are merged. derived from the virtual or physical address. A so-called Android applications are usually written in Java and, tag uni uely identifies the address that is cached within thus, contain self-modifying code or just-in-time com- a specific cache line. Although this tag can also be based piled code. This code would typically not be shared. on the virtual or physical address, most modern caches Since Android version 4.4 the Dalvik VM was gradu- use physical tags because they can be computed simul- ally replaced by the Android Runtime (ART). With ART, taneously while locating the cache set. ARM typically Java byte code is compiled to native code binaries [1] and uses physically indexed, physically tagged L2 caches. thus can be shared too. CPUs have multiple cache levels, with the lower lev- els being faster and smaller than the higher levels. ARM processors typically have two levels of cache. If all cache c e Att cks lines from lower levels are also stored in a higher-level Initially, cache timing attacks were performed on cryp- cache, the higher-level cache is called inclusive. If a tographic algorithms [9, 30, 31, 40, 41, 44, 55]. For ex- cache line can only reside in one of the cache levels at ample, Bernstein [9] exploited the total execution time any point in time, the caches are called e clusive. If the of AES T-table implementations. More fine-grained cache is neither inclusive nor exclusive, it is called non exploitations of memory accesses to the CPU cache inclusive. The last-level cache is often shared among have been proposed by Percival [45] and Osvik et al. all cores to enhance the performance upon transitioning [43]. More specifically, Osvik et al. formalized two con- threads between cores and to simplify cross-core cache cepts, namely Evict Time and Prime Probe, to deter- lookups. However, with shared last-level caches, one mine which specific cache sets were accessed by a victim core can (intentionally) influence the cache content of all program. Both approaches consist of three basic steps. other cores. This represents the basis for cache attacks Evict+Time like lush Reload [60]. 1. Measure execution time of victim program. In order to keep caches of multiple CPU cores or CPUs 2. Evict a specific cache set. in a coherent state, so-called coherence protocols are em- 3. Measure execution time of victim program again. ployed. However, coherence protocols also introduce Prime+Probe exploitable timing effects, which has recently been ex- 1. Occupy specific cache sets. ploited by Irazo ui et al. [25] on x86 CPUs. 2. Victim program is scheduled. In this paper, we demonstrate attacks on three smart- 3. Determine which cache sets are still occupied. phones as listed in Table 1. The Krait 400 is an ARMv7- Both approaches allow an adversary to determine A CPU, the other two processors are ARMv8-A CPUs. which cache sets are used during the victim’s compu- However, the stock Android of the Alcatel One Touch tations and have been exploited to attack cryptographic Pop 2 is compiled for an ARMv7-A instruction set and implementations [24, 35, 43, 54] and to build cross-VM thus ARMv8-A instructions are not used. We generically covert channels [37]. Yarom and Falkner [60] proposed refer to ARMv7-A and ARMv8-A as “ARM architec- lush Reload, a significantly more fine-grained attack ture” throughout this paper. All devices have a shared L2 that exploits three fundamental concepts of modern sys- cache. On the Samsung Galaxy S6, the flush instruction tem architectures. First, the availability of shared mem- is unlocked by default, which means that it is available ory between the victim process and the adversary. Sec- in userspace. Furthermore, all devices employ a cache ond, last-level caches are typically shared among all coherence protocol between cores and on the Samsung cores. Third, Intel platforms use inclusive last-level Galaxy S6 even between the two CPUs [6]. caches, meaning that the eviction of information from the last-level cache leads to the eviction of this data from all red emor lower-level caches of other cores, which allows any pro- gram to evict data from other programs on other cores. Read-only shared memory can be used as a means of While the basic idea of this attack has been proposed by memory usage optimization. In case of shared libraries it Gullasch et al. [21], Yarom and Falkner extended this reduces the memory footprint and enhances the speed by idea to shared last-level caches, allowing cross-core at- lowering cache contention. The operating system imple- tacks. lush Reload works as follows. ments this behavior by mapping the same physical mem- Flush+Reload ory into the address space of each process. As this mem- 1. Map binary (e.g., shared object) into address space. ory sharing mechanism is independent of how a file was 2. Flush a cache line (code or data) from the cache.
3 USENIX Association 25th USENIX Security Symposium 551 Table 1: Test devices used in this paper.
Device SoC CPU (cores) L1 caches L2 cache Inclusiveness OnePlus Qualcomm Krait 400 (2) 2 16 KB, 2 048 KB, non-inclusive × One Snapdragon 801 2.5 GHz 4-way, 64 sets 8-way, 2 048 sets Alcatel One Qualcomm Cortex-A53 (4) 4 32 KB, 512 KB, instruction-inclusive, × Touch Pop 2 Snapdragon 410 1.2 GHz 4-way, 128 sets 16-way, 512 sets data-non-inclusive Cortex-A53 (4) 4 32 KB, 256 KB, instruction-inclusive, × Samsung Samsung Exynos 1.5 GHz 4-way, 128 sets 16-way, 256 sets data-non-inclusive Galaxy S6 7 Octa 7420 Cortex-A57 (4) 4 32 KB, 2 048 KB, instruction-non-inclusive, × 2.1 GHz 2-way, 256 sets 16-way, 2 048 sets data-inclusive
3. Schedule the victim program. board. As Wei et al. [57] claimed that noise makes 4. Check if the corresponding line from step 2 has the attack difficult, Spreitzer and Plos [52] investigated been loaded by the victim program. the applicability of Bernstein’s cache-timing attack on Thereby, lush Reload allows an attacker to deter- different ARM Cortex-A8 and ARM Cortex-A9 smart- mine which specific instructions are executed and also phones running Android. Both investigations [52, 57] which specific data is accessed by the victim program. confirmed that timing information is leaking, but theat- Thus, rather fine-grained attacks are possible and have tack takes several hours due to the high number of mea- already been demonstrated against cryptographic im- surement samples that are re uired, i e , about 230 AES plementations [22, 27, 28]. Furthermore, Gruss et al. encryptions. Later on, Spreitzer and Gerard [50] im- [19] demonstrated the possibility to automatically ex- proved upon these results and managed to reduce the key ploit cache-based side-channel information based on space to a complexity which is practically relevant. the lush Reload approach. Besides attacking crypto- Besides Bernstein’s attack, another attack against AES graphic implementations like AES T-table implementa- T-table implementations has been proposed by Bog- tions, they showed how to infer keystroke information danov et al. [10], who exploited so-called wide collisions and even how to build a keylogger by exploiting the on an ARM9 microprocessor. In addition, power analysis cache side channel. Similarly, Oren et al. [42] demon- attacks [13] and electromagnetic emanations [14] have strated the possibility to exploit cache attacks on Intel been used to visualize cache accesses during AES com- platforms from JavaScript and showed how to infer vis- putations on ARM microprocessors. Furthermore, Spre- ited websites and how to track the user’s mouse activity. itzer and Plos [51] implemented Evict Time [43] in or- Gruss et al. [19] proposed the Evict Reload techni ue der to attack an AES T-table implementation on Android- that replaces the flush instruction in lush Reload by based smartphones. However, so far only cache attacks eviction. While it has no practical application on x86 against AES T-table implementations have been consid- CPUs, we show that it can be used on ARM CPUs. Re- ered on smartphone platforms and none of the recent ad- cently, lush lush [18] has been proposed. Unlike vances have been demonstrated on mobile devices. other techni ues, it does not perform any memory ac- cess but relies on the timing of the flush instruction to AR geddon Att ck ec n es determine whether a line has been loaded by a victim. We show that the execution time of the ARMv8-A flush We consider a scenario where an adversary attacks a instruction also depends on whether or not data is cached smartphone user by means of a malicious application. and, thus, can be used to implement this attack. This application does not re uire any permission and, While the attacks discussed above have been proposed most importantly, it can be executed in unprivileged and investigated for Intel processors, the same attacks userspace and does not re uire a rooted device. As our were considered not applicable to modern smartphones attack techni ues do not exploit specific vulnerabilities due to differences in the instruction set, the cache or- of Android versions, they work on stock Android ROMs ganization [60], and in the multi-core and multi-CPU as well as customized ROMs in use today. architecture. Thus, only same-core cache attacks have been demonstrated on smartphones so far. For instance, efe t ng t e c e rg n t on Wei et al. [57] investigated Bernstein’s cache-timing at- tack [9] on a Beagleboard employing an ARM Cortex- In this section, we tackle the aforementioned challenges A8 processor. Later on, Wei et al. [58] investigated this 1 and 2, i e , the last-level cache is not inclusive and mul- timing attack in a multi-core setting on a development tiple processors do not necessarily share a cache level.
4 552 25th USENIX Security Symposium USENIX Association Core 0 Core 1 Hit (same core) Hit (cross-core)
4 Miss (same core) Miss (cross-core) L1I L1D L1I L1D 10 · 3 Sets 2
(1) L2 Unified Cache (2) 1
Sets 0 (3) Number of accesses 0 200 400 600 800 1,000 Measured access time in CPU cycles Figure 1: Cross-core instruction cache eviction through data accesses. Figure 2: Histograms of cache hits and cache misses measured same-core and cross-core on the OnePlus One. When it comes to caches, ARM CPUs are very hetero- geneous compared to Intel CPUs. For example, whether Figure 2 shows the cache hit and miss histogram on the or not a CPU has a second-level cache can be decided by OnePlus One. The cross-core access introduces a latency the manufacturer. Nevertheless, the last-level cache on of 40 CPU cycles on average. However, cache misses ARM devices is usually shared among all cores and it can take more than 500 CPU cycles on average. Thus, cache have different inclusiveness properties for instructions hits and misses are clearly distinguishable based on a sin- and data. Due to cache coherence, shared memory is gle threshold value. kept in a coherent state across cores and CPUs. This is of importance when measuring timing differences between cache accesses and memory accesses (cache misses), as st c e ct on fast remote-cache accesses are performed instead of slow memory accesses [6]. In case of a non-coherent cache, a In this section, we tackle the aforementioned challenges cross-core attack is not possible but an attacker can run 3 and 4, i e , not all ARM processors support a flush in- the spy process on all cores simultaneously and thus fall struction, and the replacement policy is pseudo-random. back to a same-core attack. However, we observed that There are two options to evict cache lines: (1) the caches are coherent on all our test devices. flush instruction or (2) evict data with memory accesses To perform a cross-core attack we load enough data to congruent addresses, i e , addresses that map to the into the cache to fully evict the corresponding last-level same cache set. As the flush instruction is only available cache set. Thereby, we exploit that we can fill the last- on the Samsung Galaxy S6, we need to rely on eviction level cache directly or indirectly depending on the cache strategies for the other devices and, therefore, to defeat organization. On the Alcatel One Touch Pop 2, the last- the replacement policy. The L1 cache in Cortex-A53 and level cache is instruction-inclusive and thus we can evict Cortex-A57 has a very small number of ways and em- instructions from the local caches of the other core. Fig- ploys a least-recently used (LRU) replacement policy [5]. ure 1 illustrates such an eviction. In step 1, an instruc- However, for a full cache eviction, we also have to evict tion is allocated to the last-level cache and the instruc- cache lines from the L2 cache, which uses a pseudo- tion cache of one core. In step 2, a process fills its core’s random replacement policy. data cache, thereby evicting cache lines into the last-level cache. In step 3, the process has filled the last-level cache ct on str teg es Previous approaches to evict data set using only data accesses and thereby evicts the in- on Intel x86 platforms either have too much over- structions from instruction caches of other cores as well. head [23] or are only applicable to caches implement- We access cache lines multiple times to perform trans- ing an LRU replacement policy [35, 37, 42]. Spreitzer fers between L1 and L2 cache. Thus, more and more and Plos [51] proposed an eviction strategy for ARMv7- addresses used for eviction are cached in either L1 or L2. A CPUs that re uires to access more addresses than As ARM CPUs typically have L1 caches with a very low there are cache lines per cache set, due to the pseudo- associativity, the probability of eviction to L2 through random replacement policy. Recently, Gruss et al. [17] other system activity is high. Using an eviction strategy demonstrated how to automatically find fast eviction that performs fre uent transfers between L1 and L2 in- strategies on Intel x86 architectures. We show that creases this probability further. Thus, this approach also their algorithm is applicable to ARM CPUs as well. works for other cache organizations to perform cross- Thereby, we establish eviction strategies in an automated core and cross-CPU cache attacks. Due to the cache co- way and significantly reduce the overhead compared to herence protocol between the CPU cores [6,33], remote- [51]. We evaluated more than 4 200 access patterns on core fetches are faster than memory accesses and thus our smartphones and identified the best eviction strate- can be distinguished from cache misses. For instance, gies. Even though the cache employs a random replace-
5 USENIX Association 25th USENIX Security Symposium 553 Table 2: Different eviction strategies on the Krait 400. Table 3: Different eviction strategies on the Cortex-A53.
NAD Cycles Eviction rate NAD Cycles Eviction rate - - - 549 100.00 - - - 767 100.00 11 2 2 1 578 100.00 23 2 5 6 209 100.00 12 1 3 2 094 100.00 23 4 6 16 912 100.00 13 1 5 2 213 100.00 22 1 6 5 101 99.99 16 1 1 3 026 100.00 21 1 6 4 275 99.93 24 1 1 4 371 100.00 20 4 6 13 265 99.44 13 1 2 2 372 99.58 800 1 1 142 876 99.10 11 1 3 1 608 80.94 200 1 1 33 110 96.04 11 4 1 1 948 58.93 100 1 1 15 493 89.77 10 2 2 1 275 51.12 48 1 1 6 517 70.78
ment policy, average eviction rate and average execu- Flush (address cached) Flush (address not cached) tion time are reproducible. Eviction sets are computed 104 · based on physical addresses, which can be retrieved via 3 /proc/self/pagemap as current Android versions al- 2 low access to these mappings to any unprivileged app 1 without any permissions. Thus, eviction patterns and Number of cases 0 eviction sets can be efficiently computed. 0 100 200 300 400 500 600 We applied the algorithm of Gruss et al. [17] to a set Measured execution time in CPU cycles of physically congruent addresses. Table 2 summarizes different eviction strategies, i e , loop parameters, for the Figure 3: Histograms of the execution time of the flush Krait 400. N denotes the total eviction set size (length of operation on cached and not cached addresses measured the loop), A denotes the shift offset (loop increment) to on the Samsung Galaxy S6. be applied after each round, and D denotes the number of memory accesses in each iteration (loop body). The col- umn cycles states the average execution time in CPU cy- cles over 1 million evictions and the last column denotes A53 would take 28 times more CPU cycles to achieve an the average eviction rate. The first line in Table 2 shows average eviction rate of only 99.10 , thus it is not suit- the average execution time and the average eviction rate able for attacks on the last-level cache as used in previous for the privileged flush instruction, which gives the best work [51]. The reason for this is that data can only be al- result in terms of average execution time (549 CPU cy- located to L2 cache by evicting it from the L1 cache on cles). We evaluated 1863 different strategies and our best the ARM Cortex-A53. Therefore, it is better to reaccess identified eviction strategy (N = 11, A = 2, D = 2) also the data that is already in the L2 cache and gradually add achieves an average eviction rate of 100 but takes 1578 new addresses to the set of cached addresses instead of CPU cycles. Although a strategy accessing every address accessing more different addresses. in the eviction set only once (A = 1, D = 1, also called On the ARM Cortex-A57 the userspace flush in- LRU eviction) performs significantly fewer memory ac- struction was significantly faster in any case. Thus, cesses, it consumes more CPU cycles. For an average for lush Reload we use the flush instruction and for eviction rate of 100 , LRU eviction re uires an eviction Prime Probe the eviction strategy. Falling back to set size of at least 16. The average execution time then Evict Reload is not necessary on the Cortex-A57. Sim- is 3026 CPU cycles. Considering the eviction strategy ilarly to recent Intel x86 CPUs, the execution time of the used in [51] that takes 4371 CPU cycles, clearly demon- flush instruction on ARM depends on whether or notthe strates the advantage of our optimized eviction strategy value is cached, as shown in Figure 3. The execution that takes only 1578 CPU cycles. time is higher if the address is cached and lower if the We performed the same evaluation with 2295 different address is not cached. This observation allows us to dis- strategies on the ARM Cortex-A53 in our Alcatel One tinguish between cache hits and cache misses depending Touch Pop 2 test system and summarize them in Table 3. on the timing behavior of the flush instruction, and there- For the best strategy we found (N = 21, A = 1, D = 6), we fore to perform a lush lush attack. Thus, in case of measured an average eviction rate of 99.93 and an av- shared memory between the victim and the attacker, it is erage execution time of 4275 CPU cycles. We observed not even re uired to evict and reload an address in order that LRU eviction (A = 1, D = 1) on the ARM Cortex- to exploit the cache side channel.
6 554 25th USENIX Security Symposium USENIX Association Hit (PMCCNTR) Hit (clock gettime .15) A note on Prime+Probe Finding a fast eviction strat- × Miss (PMCCNTR) Miss (clock gettime .15) × egy for Prime Probe on architectures with a random Hit (syscall .25) Hit (counter thread .05) × × replacement policy is not as straightforward as on In- Miss (syscall .25) Miss (counter thread .05) 104 × × tel x86. Even in case of x86 platforms, the problem of · cache trashing has been discussed by Tromer et al. [54]. 4 Cache trashing occurs when reloading (probing) an ad- dress evicts one of the addresses that are to be accessed 2 next. While Tromer et al. were able to overcome this Number of accesses 0 problem by using a doubly-linked list that is accessed 0 20 40 60 80 100 120 140 160 180 200 forward during the prime step and backwards during the Measured access time (scaled) probe step, the random replacement policy on ARM also contributes to the negative effect of cache trashing. Figure 4: Histogram of cross-core cache hits/misses on We analyzed the behavior of the cache and designed the Alcatel One Touch Pop 2 using different methods. a prime step and a probe step that work with a smaller X-values are scaled for visual representation. set size to avoid set thrashing. Thus, we set the evic- tion set size to 15 on the Alcatel One Touch Pop 2. As n r leged s sc ll The perf_event_open we run the Prime Probe attack in a loop, exactly 1 way syscall is an abstract layer to access perfor- in the L2 cache will not be occupied after a few attack mance information through the kernel indepen- rounds. We might miss a victim access in 1 of the cases, 16 dently of the underlying hardware. For instance, which however is necessary as otherwise we would not PERF_COUNT_HW_CPU_CYCLES returns an accurate be able to get reproducible measurements at all due to set cycle count including a minor overhead due to the thrashing. If the victim replaces one of the 15 ways occu- syscall. The availability of this feature depends on the pied by the attacker, there is still one free way to reload Android kernel configuration, e.g., the stock kernel on the address that was evicted. This reduces the chance of the Alcatel One Touch Pop 2 as well as the OnePlus set thrashing significantly and allows us to successfully One provide this feature by default. Thus, in contrast perform Prime Probe on caches with a random replace- to previous work [51], the attacker does not have to ment policy. load a kernel module to access this information as the perf_event_open syscall can be accessed without any privileges or permissions. Acc r te n r leged m ng f nct on Another alternative to obtain suf- In this section, we tackle the aforementioned challenge 5, ficiently accurate timing information is the POSIX i e , cycle-accurate timings require root access on ARM. function clock_gettime(), with an accuracy In order to distinguish cache hits and cache misses, in the range of microseconds to nanoseconds. timing sources or dedicated performance counters can be Similar information can also be obtained from used. We focus on timing sources, as cache misses have /proc/timer_list. a significantly higher access latency and timing sources are well studied on Intel x86 CPUs. Cache attacks on ed c ted t re d t mer If no interface with sufficient x86 CPUs employ the unprivileged rdtsc instruction accuracy is available, an attacker can run a thread to obtain a sub-nanosecond resolution timestamp. The that increments a global variable in a loop, provid- ARMv7-A architecture does not provide an instruction ing a fair approximation of a cycle counter. Our ex- for this purpose. Instead, the ARMv7-A architecture periments show that this approach works reliably on has a performance monitoring unit that allows to mon- smartphones as well as recent x86 CPUs. The resolu- itor CPU activity. One of these performance counters— tion of this threaded timing information is as high as denoted as cycle count register (PMCCNTR)—can be with the other methods. used to distinguish cache hits and cache misses by re- lying on the number of CPU cycles that passed during In Figure 4 we show the cache hit and miss histogram a memory access. However, these performance counters based on the four different methods, including the cycle are not accessible from userspace by default and an at- count register, on a Alcatel One Touch Pop 2. Despite the tacker would need root privileges. latency and noise, cache hits and cache misses are clearly We broaden the attack surface by exploiting timing distinguishable with all approaches. Thus, all methods sources that are accessible without any privileges or per- can be used to implement cache attacks. Determining missions. We identified three possible alternatives for the best timing method on the device under attack can be timing measurements. done in a few seconds during an online attack.
7 USENIX Association 25th USENIX Security Symposium 555 g erform nce o ert nnels slower, Evict Reload is slower than lush Reload, and retransmission might be necessary in 0.14 of the cases To evaluate the performance of our attacks, we measure where eviction is not successful (cf. Section 3.2). On the the capacity of cross-core and cross-CPU cache covert older OnePlus One we achieve a cross-core transmission channels. A covert channel enables two unprivileged ap- rate of 12 537 bps at an error rate of 5.00 , 3 times faster plications on a system to communicate with each other than previous covert channels on smartphones. The without using any data transfer mechanisms provided by reason for the higher error rate is the additional timing the operating system. This communication evades the noise due to the cache coherence protocol performing a sandboxing concept and the permission system (cf. col- high number of remote-core fetches. lusion attacks [36]). Both applications were running in the background while the phone was mostly idle and an unrelated app was running as the foreground application. Att ck ng ser n t on m rt ones Our covert channel is established on addresses of a shared library that is used by both the sender and the re- In this section we demonstrate cache side-channel at- ceiver. While both processes have read-only access to the tacks on Android smartphones. We implement cache shared library, they can transmit information by loading template attacks [19] to create and exploit accu- addresses from the shared library into the cache or evict- rate cache-usage profiles using the Evict Reload or ing (flushing) it from the cache, respectively. lush Reload attack. Cache template attacks have a pro- The covert channel transmits packets of n-bit data, an filing phase and an exploitation phase. In the profiling s-bit sequence number, and a c-bit checksum that is com- phase, a template matrix is computed that represents how puted over data and sequence number. The sequence many cache hits occur on a specific address when trig- number is used to distinguish consecutive packets and gering a specific event. The exploitation phase uses this the checksum is used to check the integrity of the packet. matrix to infer events from cache hits. The receiver acknowledges valid packets by responding To perform cache template attacks, an attacker has with an s-bit sequence number and an -bit checksum. to map shared binaries or shared libraries as read-only By adjusting the sizes of checksums and sequence num- shared memory into its own address space. By us- bers the error rate of the covert channel can be controlled. ing shared libraries, the attacker bypasses any potential Each bit is represented by one address in the shared countermeasures taken by the operating system, such as library, whereas no two addresses are chosen that map restricted access to runtime data of other apps or address to the same cache set. To transmit a bit value of 1, the space layout randomization (ASLR). The attack can even sender accesses the corresponding address in the library. be performed online on the device under attack if the To transmit a bit value of 0, the sender does not access event can be simulated. the corresponding address, resulting in a cache miss on Triggering the actual event that an attacker wants to the receiver’s side. Thus, the receiving process observes spy on might require either (1) an offline phase or (2) a cache hit or a cache miss depending on the memory ac- privileged access. For instance, in case of a keylogger, cess performed by the sender. The same method is used the attacker can gather a cache template matrix offline for the acknowledgements sent by the receiving process. for a specific version of a library, or the attacker relies on privileged access of the application (or a dedicated per- We implemented this covert channel using mission) in order to be able to simulate events for gath- Evict Reload, lush Reload, and lush lush on ering the cache template matrix. However, the actual ex- our smartphones. The results are summarized in Table 4. ploitation of the cache template matrix to infer events On the Samsung Galaxy S6, we achieve a cross-core neither requires privileged access nor any permission. transmission rate of 1 140 650 bps at an error rate of 1.10 . This is 265 times faster than any existing covert channel on smartphones. In a cross-CPU transmission Att ck ng red r r we achieve a transmission rate of 257 509 bps at an error rate of 1.83 . We achieve a cross-core transition rate of Just as Linux, Android uses a large number of shared li- 178 292 bps at an error rate of 0.48 using lush lush braries, each with a size of up to several megabytes. We on the Samsung Galaxy S6. On the Alcatel One Touch inspected all available libraries on the system by man- Pop 2 we achieve a cross-core transmission rate of ually scanning the names and identified libraries that 13 618 bps at an error rate of 3.79 using Evict Reload. might be responsible for handling user input, e.g., the This is still 3 times faster than previous covert channels libinput.so library. Without loss of generality, we re- on smartphones. The covert channel is significantly stricted the set of attacked libraries since testing all li- slower on the Alcatel One Touch Pop 2 than on the braries would have taken a significant amount of time. Samsung Galaxy S6 because the hardware is much Yet, an adversary could exhaustively probe all libraries.
8 556 25th USENIX Security Symposium USENIX Association Table 4: Comparison of covert channels on Android.
Work Type Bandwidth [bps] Error rate Ours (Samsung Galaxy S6) lush Reload, cross-core 1.10 Ours (Samsung Galaxy S6) lush Reload, cross-CPU 1.83 Ours (Samsung Galaxy S6) lush lush, cross-core 0.48 Ours (Alcatel One Touch Pop 2) Evict Reload, cross-core 3.79 Ours (OnePlus One) Evict Reload, cross-core 5.00 Marforio et al. [36] Type of Intents 4 300 Marforio et al. [36] UNIX socket discovery 2 600 Schlegel et al. [48] File locks 685 Schlegel et al. [48] Volume settings 150 Schlegel et al. [48] Vibration settings 87
We automated the search for addresses in these shared key libraries and after identifying addresses, we monitored longpress swipe them in order to infer user input events. For in- Event tap stance, in the profiling phase on libinput.so, we sim- text ulated events via the android-debug bridge (adb shell)
with two different methods. The first method uses 0x840 0x880 0x3280 0x7700 0x8080 0x8100 0x8140 0x8840 0x8880 0x8900 0x8940 0x8980
the input command line tool to simulate user input 0x11000 0x11040 0x11080 events. The second method is writing event messages Addresses to /dev/input/event*. Both methods can run entirely on the device for instance in idle periods while the user is Figure 5: Cache template matrix for libinput.so. not actively using the device. As the second method only re uires a write() statement it is significantly faster, but 200 it is also more device specific. Therefore, we used the 150 input command line except when profiling differences between different letter keys. While simulating these 100 Access time events, we simultaneously probed all addresses within 50 Tap Tap Tap Swipe Swipe Swipe Tap Tap Tap Swipe Swipe the libinput.so library, i e , we measured the number of cache hits that occurred on each address when trig- 0 5 10 15 gering a specific event. As already mentioned above, the Time in seconds simulation of some events might re uire either an offline phase or specific privileges in case of online attacks. Figure 6: Monitoring address 0x11040 of libinput.so on the Alcatel One Touch Pop 2 reveals taps and swipes. Figure 5 shows part of the cache template matrix for libinput.so. We triggered the following events: key events including the power button (key), long touch number of cache hits than swipe actions. Swipe actions events (longpress), swipe events, touch events (tap), and cause cache hits in a high fre uency as long as the screen text input events (te t) via the input tool as often as pos- is touched. Figure 6 shows a se uence of 3 tap events, sible and measured each address and event for one sec- 3 swipe events, 3 tap events, and 2 swipe events. These ond. The cache template matrix clearly reveals addresses events can be clearly distinguished due to the fast access with high cache-hit rates for specific events. Darker col- times. The gaps mark periods of time where our program ors represent addresses with higher cache-hit rates for a was not scheduled on the CPU. Events occurring in those specific event and lighter colors represent addresses with periods can be missed by our attack. lower cache-hit rates. Hence, we can distinguish differ- Swipe input allows to enter words by swiping over ent events based on cache hits on these addresses. the soft-keyboard and thereby connecting single charac- We verified our results by monitoring the identified ters to form a word. Since we are able to determine the addresses while operating the smartphone manually, i e , length of swipe movements, we can correlate the length we touched the screen and our attack application reliably of the swipe movement with the actual word length in reported cache hits on the monitored addresses. For in- any Android application or system interface that uses stance, address 0x11040 of libinput.so can be used to swipe input without any privileges. Furthermore, we can distinguish tap actions and swipe actions on the screen of determine the actual length of the unlock pattern for the the Alcatel One Touch Pop 2. Tap actions cause a smaller pattern-unlock mechanism.
9 USENIX Association 25th USENIX Security Symposium 557 Figure 7 shows a user input sequence consisting of 3 alphabet tap events and 3 swipe events on the Samsung Galaxy enter space S6. The attack was conducted using lush Reload. Input An attacker can monitor every single event. Taps and backspace swipes can be distinguished based on the length of the cache hit phase. The length of a swipe movement can 0x45140 0x56940 0x57280 0x58480 0x60280 0x60340 0x60580 0x66340 0x66380 be determined from the same information. Figure 8 Addresses shows the same experiment on the OnePlus One using Evict Reload. Thus, our attack techniques work on co- Figure 9: Cache template matrix for the default AOSP herent non-inclusive last-level caches. keyboard.
400 Key 300 Space 200 Tap Tap Tap Swipe Swipe Swipe 200 Access time Access time 100 t h i s Space i s Space a Space m e s s a g e 0 2 4 6 8 0 1 2 3 4 5 6 7 Time in seconds Time in seconds Figure 7: Monitoring address 0xDC5C of libinput.so on the Samsung Galaxy S6 reveals tap and swipe events. Figure 10: Evict Reload on 2 addresses in custpack@ app@[email protected]@classes.dex on the Alcatel One Touch Pop 2 while entering the sentence 1,000 “this is a message”. 800 600 sible to precisely determine the length of single words 400 entered using the default AOSP keyboard. Access time 200 We illustrate the capability of detecting word lengths Tap Tap Tap Swipe Swipe Swipe in Figure 10. The blue line shows the timing measure- 0 2 4 6 ments for the address identified for keys in general, the Time in seconds red dots represent measurements of the address for the space key. The plot shows that we can clearly determine Figure 8: Monitoring address 0xBFF4 of libinput.so the length of entered words and monitor user input accu- on the OnePlus One reveals tap and swipe events. rately over time.