arXiv:1603.05615v1 [cs.CR] 17 Mar 2016 lu niomns o example. for i adversari environments, hyper-threading potentially cloud (or together So, scheduled not core. generally same are the ar both on if scheduled it, ously preempting without caches per-core on effects we euiydmist ial hrn fLClines, “F LLC preventing of sharing thus disable to domains security be- tween shared pages approach memory Our physical manages cloud). dynamically a caches in between tenants (e.g., information last-level domains leak security to leverage cores across shared that (LLCs) attacks access-driven side-channel mitigate to approach a present We Ls tas aae ahaiiyo eoypages memory of cacheability “P cross-tenant manages thwart to also It LLCs. LC)ta r hrdars oe n,i particular, in and, cores across caches shared last-level are via 31]). that channels [39, side (LLCs) are (e.g., concern software more in attacks Of mitigate these to renders ob- caches easy to per-core relatively victim on the effect preempt its need frequently serve the to though attacker 36]), the 9, for [30, (e.g., caches per-core boundarie used informa- security across fine-grained keys) cryptographic leaking (e.g., tion of cache-based capable Early channels caches. CPU side are at- often attacks most these the in components used on with the execution Overwhelmingly, shares victim’s tacker’s. it the of components effects microarchitectural the sys- rather but observing operating (VMM)) monitor by an machine virtual (typically or (OS) software tem implemented isolation computer, control the access same logical by the the on violating by running which not computation by about victim attack information a secret an learns is computation attacker channel an side access-driven An ldaayi,adeprcleauto,w hwthat show princi- we evaluation, verification, empirical formal and Through analysis, pled Platform-as-a-Service clouds. in method isolation (PaaS) common con- tenant a across enforcing are channels for containers side as such boundaries, on tainer intervene to kernel ac vred o aSworkloads. PaaS for overheads mance ehv mlmne u praha eoyman- memory a called as subsystem agement approach our implemented have We A C I 1 ACHE BSTRACT 1 ye-hedn a nbeteatce oosretevict the observe to attacker the enable can Hyper-threading otaeApoc oDfaigSd hnesi Last-Leve in Channels Side Defeating to Approach Software A NTRODUCTION B nvriyo ot aoiaUiest fNrhCrln O Carolina North of University Carolina North of University AR hplHl,N,UACae il C S oubs H USA OH, Columbus, USA NC, Hill, Chapel USA NC, Hill, Chapel civssrn euiywt ml perfor- small with security strong achieves iioZo ihe .Rie iqa Zhang Yinqian Reiter K. Michael Zhou Ziqiao LUSH RIME -R ELOAD ACHE -P ROBE B iecanl via channels side ” AR tak nLLCs. in attacks ” ihnteLinux the within iald in disabled) s simultane- e ltenants al im’s 1 s olmttenme flnsprccestta nattacker an that set cache per lines of as number so the pages limit memory to of cacheability the manage to anism h vrl eoyfopitt t rgnlsz.Sec- size. original P its defeat to return to ond, to footprint deduplicated memory be overall can the copies the time, in spaced eti ah es ae tP it Later sets. cache certain at- The sets. P cache CPU tacker same the share programs two the adtu oddt h hrdccearay ythe by already) cache P shared so-called recently, the More to victim. loaded thus (and re-F eur aesaigbtenteatce n victim. and attacker the P between Rather, not do sharing these page 21]; [12, require LLCs via demonstrated been have t oywl eivsbet natce’ R attacker’s an to access to victim’s invisible a be way, do- will this each copy In that its copy. so own copied its being has page main the do- in security multiple results by mains page physical same the to cesses tainers defea to First, on principles. straightforward based seemingly attacks, two side-channel LLC-based these against execution. victim’s the are with set conflicts cache to each due sets in cache absent lines the cache into many memory how same inferring and the load to time the ing rcsos n ae esrstetm oR to time using the cache measures later the and of processors) out (e.g., page instructions shared processor-specific the of chunk sized ehns omng hsclpgssae cosmu- across distrusting shared tually pages physical manage to mechanism endmntae.Tefis uhatcswr fthe of were attacks such first The have victim demonstrated. a been from information fine-grained extracting 21]). fine- 12, 37, extract [35, to (e.g., victim it from the information of grained preemption require not do r otrns h takrfis F first mem- attacker smaller The for footprints. aim li- ory that shared to mechanisms due memory deduplication and OS, management modern memory a copy-on-write brary, in situation victim— common the with a page memory physical a share to tacker F F F LUSH LUSH LUSH nti ae epooeasfwr-nydefense software-only a propose we paper this In w aite fLCbsdsd hnescpbeof capable channels side LLC-based of varieties Two 2 https://linuxcontainers.org/ LUSH 2 -R -R -R rVs.Seicly eprlypoiaeac- proximate temporally Specifically, VMs). or , RIME RIME ELOAD ELOAD ELOAD 8)i oifrwehrti hn a touched was chunk this whether infer to it [8]) h ah ylaigisonmmr into memory own its loading by cache the s -P RIME ROBE euiydomains security ait 3,3] hc eursteat- the requires which 37], [35, variety tak hnacse r sufficiently are accesses When attack. tak,w rps copy-on-access a propose we attacks, -P ROBE i tt University State hio tak a ecnutdwhen conducted be can attacks ROBE tak,w einamech- a design we attacks, h ah ymeasur- by cache the s RIME LUSH ie,poess con- processes, (i.e., clflush Caches l -P sacache-line a es ROBE ELOAD ELOAD attacks nx86 in na in (or t may PROBE. In doing so, we limit the visibility of the at- (e.g., [3, 4]) to application-specific side-channel-free im- tacker into the victim’s demand for memory that maps to plementations (e.g., [16]). These techniques can intro- that cache set. Of course, the challenge in these defenses duce substantial runtime overheads, however, and these is in engineering them to be effective in both mitigating overheads tend to increase with the generality of the tool. LLC-based side-channels and supporting efficient execu- It is for this reason that we believe that systems-level tion of computations. (i.e., OS- or VMM-level) defenses are the most plausible, To demonstrate these defenses and the tradeoffs be- general defense for deployment in the foreseeable future. tween security and efficiency that they offer, we detail With attention to cache-based side-channels specifically, their design and implementation in a memory manage- several works provide to each security domain a limited ment subsystem called CACHEBAR (short for “Cache number of designated pages that are never evicted from Barrier”) for the Linux kernel. CACHEBAR sup- the LLC (e.g., [14, 19]), thereby rendering their con- ports these defenses for security domains represented tents immune to PRIME-PROBE and FLUSH-RELOAD as Linux containers. That is, copy-on-access to de- attacks. These approaches, however, require the appli- fend against FLUSH-RELOAD attacks makes copies of cation developer to determine what data/instructions to pages as needed to isolate temporally proximate ac- protect and then to modify the application to organize cesses to the same page from different containers. More- the sensitive content into the protected pages; in contrast, over, memory cacheability is managed so that the pro- CACHEBAR seeks to protect applications holistically and cesses in each container are collectively limited in the requires no application modifications. CACHEBAR also number of lines per cache set they can PROBE. This differs in several design choices that free it from limi- implementation would thus be well-suited for use in tations of prior approaches (e.g., the limitation of only Platform-as-a-Service (PaaS) clouds that isolate cloud one protected page per core [14] or dependence on rel- customers in distinct containers; indeed, cross-container atively recent, Intel-specific cache optimizations [19]). LLC-based side channels have been demonstrated in Other systems-level solutions manage memory so as to such clouds in the wild [37]. Our security evalua- partition the use of the LLC by different security do- tions show that CACHEBAR mitigates cache-based side- mains (e.g., [26, 27]), though these approaches preclude channel attacks, and our performance evaluation indi- memory-page and CPU-cache sharing entirely and hence cates that CACHEBAR imposes very modest overheads can underutilize these resources considerably. on PaaS workloads. LLC-based side channels are a particular instance of To summarize, we contribute: timing side channels, and so defenses that seek to elim- inate timing side channels are also relevant to our prob- • A novel copy-on-access mechanism to manage phys- lem. Examples include fuzzing real-time sources on the ical memory pages shared by distrusting tenants to computer (e.g., [32]), though this impinges on legitimate prevent FLUSH-RELOAD side-channel attacks, and uses of real time. Since real-time counters are not the its formal verification using model checking. only way to time memory fetches [34], other efforts have • A novel mechanism to dynamically maintain queues sought to eliminate side-channel risks more holistically of cacheable memory pages so as to limit the cache via altering the CPU scheduler (e.g., [28, 18]) and man- lines a malicious tenant may access in PRIME-PROBE aging how tenants co-locate (e.g., [17, 38, 10, 2, 18]). In attacks, and a principled derivation of its parameters contrast, here we focus specifically on LLC-based side to balance security and performance. channels (vs. a larger subset of timing side-channels)— • Implementation of both mechanisms in a mainstream which again are arguably the most potent known side- Linux kernel and an extensive secu- channel vectors [35, 37, 12, 21]—and restrict our modi- rity and performance evaluation for PaaS workloads. fications to the memory management subsystem.

2 RELATED WORK 3 COPY-ON-ACCESS FOR FLUSH- Numerous proposals have sought to mitigate cache- RELOAD DEFENSE based side channels with low overhead through redesign The FLUSH-RELOAD attack is a highly effective LLC- of the cache hardware, e.g., [24, 15, 33, 13, 20]. Un- based side channel that was used, e.g., by Zhang et fortunately, there is little evidence that mainstream CPU al. [37] to mount fine-grained side-channel attacks in manufacturers will deploy such defenses in the foresee- commercial PaaS clouds. It leverages physical mem- able future, and even if they did, it would be years be- ory pages shared between an attacker and victim se- fore these defenses permeated the installed computing curity domains, as well as the ability to evict those base. Other proposals modify applications to better pro- pages from LLCs, using a capability such as provided tect secrets from side-channel attacks. These solutions by the clflush instruction on the x86 architecture. range from tools to limit branching on sensitive data clflush is designed to maintain consistency between

2 caches and memory for write-combined memory [11]. The attacker uses clflush, providing a virtual address as an argument, to invalidate the cache lines occupied by the backing physical memory. After a short time in- terval (the “FLUSH-RELOAD interval”) during which the victim executes, the attacker measures the time to access the same virtual address. Based on this duration, the at- tacker can infer whether the victim accessed that memory during the interval.

3.1 Design Modern operating systems, in particular Linux OS, of- ten adopt on-demand paging and copy-on-write mecha- nisms [6] to reduce the memory footprints of userspace applications. In particular, copy-on-write enables multi- ple processes to share the same set of physical memory pages as long as none of them modify the content. If a process writes to a shared memory page, the write will trigger a page fault and a subsequent new page allocation so that a private copy of page will be provided to this process. In addition, memory merging techniques like Figure 1: State transition of a physical page Kernel Same-Page Merging (KSM) [1] are also used in Linux OS to deduplicate identical memory pages. Mem- ory sharing, however,is one of the key factors that enable SIVE state. A page in the SHARED state may be shared FLUSH-RELOAD side channel attacks. Disabling mem- by more domains and remain in the same state; when any ory page sharing entirely will eliminate FLUSH-RELOAD one of the domains accesses the page, it will transition to side channels but at the cost of much larger memory foot- the ACCESSED state. An ACCESSED page can stay that prints and thus inefficient use of physical memory. way as long as only one of security domains accesses it. CACHEBAR adopts a design that we call copy-on- If this page is accessed by another domain, a new phys- access, which dynamically controls the sharing of physi- ical page will be allocated to make a copy of this one, cal memory pages between security domains. We desig- and the current page will transition to either EXCLUSIVE nate each physicalpage as being in exactly one of the fol- or SHARED state, depending on the remaining number lowing states: UNMAPPED, EXCLUSIVE, SHARED, and of domains mapping this page. The new page will be as- ACCESSED. An UNMAPPED page is a physical page that signed state EXCLUSIVE. An ACCESSED page will be re- is not currently in use. An EXCLUSIVE page is a physical set to the SHARED state if it is not accessed for ∆accessed page that is currently used by exactly one security do- seconds. This timeout mechanism will ensure that only main, but may be shared by one or multiple processes in recently used pages will remain in the ACCESSED state. that domain later. A SHARED page is a physicalpage that Page mergingmay also be triggeredby deduplicationser- is shared by multiple security domains, i.e., mapped by vices in a modern OS (e.g., KSM in Linux). This effect at least one process of each of the sharing domains, but is reflected by a dashed line in Fig. 1 from state EXCLU- no process in any domain has accessed this physical page SIVE to SHARED. A page at any of the mapped states recently. In contrast, an ACCESSED page is a previously (i.e., EXCLUSIVE, SHARED, ACCESSED) can transition SHARED page that was recently accessed by a security to UNMAPPED state for the same reason when it is a copy domain. The state transitions are shown in Fig. 1. of another page (not shown in the figure). An UNMAPPED page can transition to the EXCLUSIVE Merging duplicated pages requires some extra book- state either due to normal page mapping, or due to copy- keeping. When a page transitions from UNMAPPED to on-access when a page is copied into it. Unmapping a EXCLUSIVE due to copy-on-access, the original page is physical page for any reason (e.g., process termination, tracked by the new copy so that CACHEBAR knows with page swapping) will move an EXCLUSIVE page back to which page to merge it when deduplicating. If the orig- the UNMAPPED state. However, mapping the current EX- inal page is unmapped first, then one of its copies will CLUSIVE page by another security domain will transit it be designated as the new “original” page, with which into the SHARED state. If all but one domain unmaps this other copies will be merged in the future. The interac- page, it will transition back from the SHARED state to tion between copy-on-access and existing copy-on-write the EXCLUSIVE state, or ACCESSED state to the EXCLU- mechanisms is also implicitly depicted in Fig. 1: Upon

3 copy-on-write, the triggering process will first unmap the physical page, possibly inducing a state transition (from SHARED to EXCLUSIVE). The state of the newly mapped physical page is maintained separately. 3.2 Implementation At the core of copy-on-accessimplementation is the state machine depicted in Fig. 1. UNMAPPED ⇔ EXCLUSIVE ⇔ SHARED. Conven- tional Linux kernels maintain the relationship between processes and the physical pages they use. However, CACHEBAR also needs to keep track of the relationship between containers and the physical pages that the con- tainer’s processes use. Therefore, CACHEBAR incorpo- rates a new data structure, counter, which is conceptu- ally a table used for recording,foreach physicalpage, the number of processes in each container that have Page Ta- Figure 2: Structure of copy-on-access page lists. ble Entries (PTEs) mapped to this page. Specifically, let counter[i,j] indicate the number of PTEs mapped to physical page j in container i. For example, consider five physical pages and four containers (see Table 1). mapping process and checking whether each mapping counter[1,2] = 2 indicates there are 2 processes is from the given container. (2) Given a physical page, in container 2 that have virtual pages mapped to physi- it takes N accesses to counter, where N is the to- cal page 1. It is easy to see from counter that phys- tal number of containers, to determine which containers ical pages 1, 2, and 4 are in the SHARED or ACCESSED have mapped to this page. This operation is commonly state, physical page 5 is UNMAPPED and physical page used to determine the state of a physical page. 3 is EXCLUSIVE. Of course, the table counter needs SHARED ⇒ ACCESSED. To differentiate SHARED and to be dynamically maintained, as containers may be cre- ACCESSED states, one additional data field, owner, is ated and terminated at any time. Therefore, to implement added (see Fig. 2) to indicate the owner of the page (a counter, we added one data field, a pointer to an ar- pointer to a PID namespace structure). When the ray with the size of the number of physical pages in the page is in the SHARED state, its owner is NULL; oth- system, in each PID namespace structure. These data erwise it points to the container that last accessed it. fields collectively function as the table counter. All PTEs pointing to a SHARED physical page will Table 1: Example of counter: 5 pages, 4 containers have a reserved Copy-On-Access (COA) bit set. There- container ID 1 2 3 4 fore, any access to these virtual pages will induce a page physical page fault. When a page fault is triggered, CACHEBAR checks 1 0201 2 3110 if the page is present in physical memory; if so, and if 3 0010 the physical page is in the SHARED state, the COA bit 4 1111 of the current PTE for this page will be cleared so that 5 0000 additional accesses to this physical page from the current The counter data structure is updated and ref- process will be allowed without page faults. The physical erenced in multiple places in the kernel. Specifi- page will also transition to the ACCESSED state. cally, in CACHEBAR we instrumented every update of ACCESSED ⇒ EXCLUSIVE/SHARED. If the page is al- mapcount, a data field in the page structure for count- ready in the ACCESSED state when a domain other than ing PTE mappings, so that every time the kernel tracks the owner accesses it, the page fault handler will allo- PTE mapping of a physical page, counter is updated cate a new physical page, copy the content of the origi- accordingly. The use of counter greatly simplifies the nal page into the new page, and change the PTEs of the process of maintaining and determining the state of a processes in the accessing container so that they point to physical page: (1) Given a container, access to a single the new page. Since multiple same-content copies in one cell suffices to check whether a physical page is already domain burdens both performance and memory but con- mapped in the container. This operation is very com- tributes nothing for security, the fault handler will reuse monly used to decide if a state transition is requiredwhen a copy belonging to that domain if it exists. After copy- a page is mapped by a process. Without counter, such on-access, the original page can either be EXCLUSIVE or a operation requires running through the entire reverse SHARED. All copy pages are anonymous-mapped, since

4 only a single file-mapped page for the same file section ment [11]. As such, each FLUSH via clflush triggers is allowed. the same transitions (e.g., from SHARED to ACCESSED, A transition from the ACCESSED state to SHARED or and from ACCESSED to an EXCLUSIVE copy) as a EXCLUSIVE state can also be triggered by a timeout RELOAD in our implementation, meaning that this de- mechanism. CACHEBAR implements a periodic timer fense is equally effective against both FLUSH-RELOAD (every ∆accessed = 1s). Upon timer expiration, all phys- and FLUSH-FLUSH [8] attacks. ical pages in the ∆accessed state that were not accessed Page deduplication. To mitigate the impact of copy- SHARED during this ∆accessed interval will be reset to the on-access on the size of memory, CACHEBAR imple- state by clearing its owner field, so that pages that are ments a less frequent timer (every ∆copy = 10×∆accessed infrequently accessed are less likely to trigger copy-on- seconds) to periodically merge the page copies with access. If an ACCESSED page is found for which its their original pages. Within the timer interrupt han- counter shows the number of domains mapped to it dler, original list and each copy list are tra- is 1, then the daemon instead clears the COA bit of all versed similarly to the “ACCESSED ⇒ SHARED” transi- PTEs for that page and marks the page EXCLUSIVE. tion description above, though the ACCESSED bit in the Instead of keeping a list of ACCESSED pages, PTEs of only pages that are in the EXCLUSIVE state are CACHEBAR maintains a list of pages that are in the checked. If a copy page has not been accessed since the SHARED or ACCESSED state, denoted original list last such check (i.e., the ACCESSED bit is unset in all (shown in Fig.2). Each node in the list also main- PTEs pointing to it), it will be merged with its original tains a list of copies of the page it represents, dubbed page (the head of the copy list). The ACCESSED copy list, which could be empty. Both lists are bit in the PTEs will be cleared afterwards. doubly linked. A tracking pointer track ptr to the When merging two pages, if the original page is copy list or original list node is added in the anonymous-mapped, then the copy page can be merged struct page structure to attach the list onto the array by simply updating all PTEs pointing to the copy page to of data structures that represent physical pages. When- instead point to the original page, and then updating the ever a copy is made from the page upon copy-on-access, original page’s reverse mappings to include these PTEs. the copy page is inserted into the copy list of the If the original page is file-mapped, then the merging pro- original page. Whenever a physical page transitions to cess is more intricate, additionally involving the creation the UNMAPPED state, it will be removed from whichever of a new virtual memory area (vma structure) that maps of original list or copy list it is contained in. to the original page’s file position and using this structure In the former case, CACHEBAR will designate a copy to replace the virtual memory area of the (anonymous) page of the original page as the new original page and copy page in the relevant task structure. adjust the lists accordingly. For security reasons, merging of two pages requires Every ∆accessed seconds, CACHEBAR will traverse the flushing the original physical page from the LLC. We original list. If the visited page is in the AC- will elaborate on this point in Sec. 3.3. CESSED state, the timer interrupt handler will check whether it has been accessed since the last such check Interacting with KSM. Page deduplication can also be by seeing if the ACCESSED bit of any PTE for this page triggered by existing memory deduplication mechanisms in the owner container is set. If not, CACHEBAR will (e.g., KSM). To maintain the state of the physical pages, mapcount transition this page back to the SHARED state. In any CACHEBAR instruments every reference to case, the ACCESSED bit in the PTEs will be cleared. within KSM and updates counter accordingly. In ad- For security reasons that will be explained in Sec. 3.3, dition to merging a copy page with its original, three we furtherrequire flushing the entire memory page out of other types of page merging in KSM might occur: (1) the cache after transitioning a page from the ACCESSED an original page is merged with a copy page on a differ- copy list state to the SHARED state due to this timeout mechanism. ent ; (2) a copy page is merged with another This page-flushing procedure is implemented by issuing copy page on a different copy list; (3) an original clflush on each of the memory blocks of any virtual page is merged with another original page. page that maps to this physical page. We illustrate these operations in the example in Fig. 3. Consider the initial original list and State transition upon clflush. The clflush in- copy lists shown in Fig. 3(a), where all of pages 1–7 struction is subject to the same permission checks as a have the same content. Fig. 3(b) shows the list config- 3 memory load, will trigger the same page faults, and will urations after copy page 5 is merged into copy page 7 similarly set the ACCESSED bit in the PTE of its argu- by KSM, after which only page 7 is preserved and is in the SHARED state. (Copy page 5 is unmapped and re- 3We empirically confirmed this by executing clflush instructions moved from copy list.) Fig. 3(c) shows an original on memory pages with PTE reserved bits set.

5 ways to model concurrent systems and verify temporal logic specifications. Prior works have proposed the use of model checking to evaluate whether software is vulnerable to remote tim- ing attacks (e.g., [29]). However, using model checking (a) Initial state (b) Merging page 5 into 7 to verify that a system is secure against the side-channel attacks of concern in this paper is, we believe, novel and might be of interest in its own right. System modeling. We model a physical page in Fig. 1 using a byte variable in the PROMELA programming lan- (c) Merging page 1 into 6 () Merging page 3 into 2 guage, and two physical pages as an array of two such variables, named pages. We model two security do- Figure 3: KSM operation example mains (e.g., containers), an attacker domain and a victim domain, as two processes in PROMELA. Each process page 1 merged into a copy page 6. As a result, page maps a virtual page, virt, to one of the physical pages. 6 becomes a SHARED page, and one of page 1’s copies The virtual page is modeled as an index to the pages[] (in our implementation, the first copy in the list) is des- array; initially virt for both the attacker and the vic- ignated as the new original page of the list. Fig. 3(d) tim point to the first physical page (i.e., virt is 0). The shows original page 3 being merged into original page victim process repeatedly sets pages[virt] to 1, sim- 2. A copy page in page 3’s copy list becomes the ulating a memory access that brings pages[virt] into new original page. A physical page may be disconnected cache. The attacker process FLUSHes the virtual page by from the original list or any copy list, which assigning 0 to pages[virt] and RELOADs it by assign- we call untracked, if it has never entered SHARED or AC- ing 1 to pages[virt] after testing if it already equals to CESSED states. If KSM merges a page that is untracked 1. Both the FLUSH and RELOAD operations are modeled with a tracked page, then the untracked page will simply as atomic to simplify the state exploration. be merged into the tracked page, which will transition to We track the state and owner of the first physical page the SHARED state (if not already there). using another two variables, state and owner. The first It is apparent that KSM is capable of merging page is initially in the SHARED state (state is SHARED), more pages than our built-in page deduplication mech- and state transitions in Fig. 1 are implemented by each anisms. However, CACHEBAR still relies on the process when they access the memory. For example, built-in page deduplication mechanisms for several rea- the RELOAD code snippet run by the attacker is shown sons. First, KSM can merge only anonymous-mapped in Fig. 4. If the attacker has access to the shared page pages, while CACHEBAR needs to frequently merge an (Line 3), versus an exclusive copy (Line 16), then it anonymous-mapped page (a copy) with a file-mapped simulates an access to the page, which either moves the page (the original). Second, KSM may not be en- state of the page to ACCESSED (Line 10) if the state was abled in certain settings, which will lead to ever growing SHARED (Line 9) or to EXCLUSIVE (Line 14) after mak- copy lists. Third, KSM needs to compare page con- ing a copy (Line 13) if the state was already ACCESSED tents byte-by-byte before merging two identical pages, and not owned by the attacker (Line 12). Leakage is de- whereas CACHEBAR deduplicates pages on the same tected if pages[virt] is 1 prior to the attacker setting it copy list, avoiding the expensive page content com- as such (Line 19), which the attacker tests in Line 18. parison. To model the dashed lines in Fig. 1, we implemented another process, called timer, in PROMELA that periodi- 3.3 Security cally transitions the physical page back to SHARED state Copy-on-access is intuitively secure by design, as no two from ACCESSED state, and periodically with a longer in- security domains may access the same physical page at terval, merges the two pages by changing the value of the same time, rendering FLUSH-RELOAD attacks seem- virt of each domain back to 0, owner to none, and ingly impossible. To show security formally, we sub- state to SHARED. jected our design to model checking in order to prove The security specification is stated as a non- that copy-on-access is secure against FLUSH-RELOAD interference property. Specifically, as the attacker do- attacks. Model checking is an approach to formally ver- main always first FLUSHes the memory block (sets ify a specification of a finite-state concurrent system ex- pages[virt] to 0) before RELOADing it (setting pressed as temporal logic formulas, by traversing the pages[virt] to 1), if the non-interference property finite-state machine defined by the model. In our study, holds, then it should follow that the attacker should al- we used the Spin model checker, which offers efficient ways find pages[virt] to be 0 upon RELOADing the

6 1 atomic { 4 CACHEABILITY MANAGEMENTFOR 2 if 3 ::(virt==0) -> PRIME-PROBE DEFENSE 4 if 5 ::(state==UNMAPPED) -> Another common method to launch side-channel attacks 6 assert(0) via caches is using PRIME-PROBE attacks, introduced 7 ::(state==EXCLUSIVE && owner!=ATTACKER) -> 8 assert(0) by Osvik et al. [23]. These attacks have recently been 9 ::(state==SHARED) -> adapted to use LLCs to great effect, e.g., [21, 12]. Un- 10 state=ACCESSED 11 owner=ATTACKER like a FLUSH-RELOAD attack, PRIME-PROBE attacks do 12 ::(state==ACCESSED && owner!=ATTACKER) -> not require the attacker and victim security domains to 13 virt=1 /* copy-on-access */ 14 state=EXCLUSIVE share pages. Rather, the attacker simply needs to access 15 fi memory so as to evict (PRIME) the contents of a cache 16 ::else -> skip 17 fi set and later access (PROBE) this memory again to de- 18 assert(pages[virt]==0) termine (by timing the accesses) how much the victim 19 pages[virt]=1 20 } evicted from the cache set. A potentially effective coun- termeasure to these attacks, accordingly, is to remove the Figure 4: Code snippet for RELOAD. The procedures for other mem- attacker’s ability to PRIME and PROBE the whole cache ory accesses are similar. set and to predict how a victim’s demand for that set will be reflected in the number of evictions from that set. 4.1 Design Suppose a w-way set associative LLC, i.e., so that each cache set has w lines. Let x be the number of cache lines page. The model checker checks for violation of this in one set that the attacker observes having been evicted property in the verification. in a PRIME-PROBE interval. The PRIME-PROBE attack Automated verification. We checked the model using is effective today because x is typically a good indicator Spin. Interestingly, our first model-checking attempt of the demand d that the victim security domain had for suggested that the state transitions may leak information memory that mapped to that cache set during the PRIME- to a FLUSH-RELOAD attacker. The leaks were caused by PROBE interval. In particular, if the attacker PRIMEs and the timer process that periodically transitions the model PROBEs all w lines, then it can often observe the vic- to a SHARED state. After inspecting the design and im- tim’s demand d exactly, unless d > w (in which case the plementation, we found that there were two situations attacker learns at least d ≥ w). that may cause information leaks. In the first case, when The alternative that we propose here is to periodically SHARED the timer transitions the state machine to the and probabilistically reconfigure the budget ki of lines state from the ACCESSED state, if the prior owner of the per cache set that the security domain i can occupy. After page was the victim and the attacker reloaded the mem- such a reconfiguration, the attacker’s view of the victim’s ory right after the transition, the attacker may learn one demand d is clouded by the following three effects. First, bit of information. In the second case, when the phys- if the attacker is allotted a budget ka < w, then the at- ical page was merged with its copy, if the owner of the tacker will be unable to observe any evictions at all (i.e., SHARED 4 page was the victim before the page became , x = 0) if d < w − ka. Second, if the victim is given ′ the attacker may reload it and again learn one bit of in- allotment kv, then any two victim demands d, d satisfy- ACHE AR ′ formation. Since in our implementation of C B , ing d > d ≥ kv will be indistinguishable to the attacker. these two state transitions are triggered if the page (or Third, the probabilistic assignment of kv results in extra its copy) has not been accessed for a while (roughly ambiguity for the attacker, since x evictions might reflect ∆accessed and ∆copy seconds, respectively), the informa- the demand d or the budget kv, since x ≤ min{d, kv} (if tion leakage bandwidth due to each would be approxi- all x evictions are caused by the victim). mately 1/∆accessed bits per page per second or 1/∆copy To enforce the budget ki of lines that security domain bits per page per second, respectively. i can use in a given cache set, CACHEBAR maintains for We improved our CACHEBAR implementation to pre- each cache set a queue per security domain that records vent this leakage by enforcing LLC flushes (as described which memory blocks are presently cacheable in this set in Sec. 3.2) upon these two periodic state transitions. We by processes in this domain. Each element in the queue adapted our model accordinglyto reflect such changes by adding one more instruction to assign pages[0] to be 0 4This statement assumes a least-recently-used replacement policy right after the two timer-induced state transitions. Model and that the victim is the only security domain that runs in thePRIME- PROBE interval. If it was not the only security domain to run and evic- checking this refined model revealed no further informa- tions were caused by another, then the ambiguity of what caused the tion leakage in the design. observable evictions will additionally cause difficulties for the attacker.

7 indicates a memory block that maps to this cache set; only blocks listed in the queue can be cached in that set. The queueis maintained with a least recently used (LRU) replacement algorithm. That is, whenever a new memory block is accessed, it will replace the memory block in the corresponding queue that is the least recently used. 4.2 Implementation Implementation of cacheable queues is processor micro- architecture dependent. Here we focus our attention on Intel x86 processors, which appears to be more vulnera- ble to PRIME-PROBE attacks due to their inclusive last- level cache [21]. As x86 architecturesonly supportmem- Figure 6: Page fault handler for CACHEBAR. ory management at the page granularity (e.g., by ma- nipulating the PTEs to cause page faults), CACHEBAR ing page faults of this type, the page fault handler will controls the cacheability of memory blocks at page gran- move the accessed physical page into the correspond- ularity. CACHEBAR uses reserved bits in each PTE to ing cacheable queue, clear the NC bit in the current manage the cacheability of, and to track accesses to, the PTE6, and remove a least recently used physical page physical page to which it points, since a reserved bit set from the cacheable queue and set the NC bits in this in a PTE induces a page fault upon access to the asso- domain’s PTEs mapped to that page. A physical page ciated virtual page, for which the backing physical page removed from the cacheable queue will be flushed out cannot be retrieved or cached (if it is not already) before of the cache using clflush instructions on all of its the bitis cleared[11, 25]. We henceuse theterm domain- memory blocks to ensure that no residue remains in the cacheable to refer to a physical page that is “cacheable” cache. CACHEBAR will flush the translation lookaside in the view of all processes in a particular security do- buffers (TLB) of all processors to ensure the correct- main, which is implemented by modifying all relevant ness of page cacheabilities every time PTEs are altered. PTEs (to have no reserved bits set) in the processes of In this way, CACHEBAR limits the number of domain- that security domain. By definition, a physical page that cacheable pages of a single color at any time to k . is domain-cacheable to one container may not necessar- i To maintain the LRU property of the cacheable queue, ily be domain-cacheable to another. a daemon periodically re-sorts the queue in descend- To ensure that no more than k memory blocks from i ing order of recent access count. Specifically, the dae- all processes in container i can occupy lines in a given mon traverses the domain’s PTEs mapped to the physical cache set, CACHEBAR ensures that no more than k i frame within that domain’s queue and counts the num- of those processes’ physical memory pages, of which ber having their ACCESSED bit set, after which it clears contents can be stored in that cache set, are domain- these ACCESSED bits. It then orders the physical pages cacheable at any point in time. Physical memory pages in the cacheable queue by this count (see Fig. 5). In our of which contents can be stored in the same cache set are present implementation, this daemon is the same daemon said to be of the same color, and so to implement this that resets pages from the ACCESSED state to SHARED property, CACHEBAR maintains, per container and per state (see Sec. 3), which already checks and resets the color (rather than per cache set), one cacheable queue, ACCESSED bits in copies’ PTEs. Again, this daemon each element of which is a physical memory page that runs every ∆ = 1s seconds in our implementa- is domain-cacheable in this container. Since the mem- accessed tion. This daemon also performs the task of resetting k ory blocks in each physical page map to different cache i for each security domain i, which in our present imple- sets,5 limiting the domain-cacheable pages of a color to mentation it does every tenth time it runs. ki also limits the number of cache lines that blocks from these pages can occupy in the same cache set to ki. Interacting with copy-on-access. The cacheable queues To implement a non-domain-cacheable memory page, work closely with the copy-on-access mechanisms. In CACHEBAR sets one of the reserved bits, which we de- particular, as both the COA and NC bits may trigger a note by NC, in the PTE for each virtual page in the do- page fault upon page accesses, the page handler logic main mapped to that physical page. As such, accesses must incorporate both (shown in Fig. 6). First, a page to any of these virtual pages will be trapped into the ker- fault is handled as normal unless it is due to one of the nel and handled by the page fault handler. Upon detect- 6We avoid the overhead of traversing all PTEs in the container that 5In this way, the cache contention arising from accesses in the same map to this physical page. Access to those virtual pages will trigger page are minimized. page faults to make these updates without altering the cacheable queue.

8 PTE Figure 5: A cacheable queue for one page ACCESSED: 1 +1 color in a domain: (a) access to page 24 PTE +0 brings the page into the queue and clears the ACCESSED: 0 Evict 8 Retrieve 24 Access 0 PTE NC bit (“← ”) in the PTE triggering the PTE PTE +1 counts 0 NC: 0 1 fault; periodically, (b) a daemon counts, per Page 8 NC: 1 Page 24 ACCESSED: 1 2 Page 24 Page 32 PTE page, the ACCESSED bits (“+0”, “+1”) in Page 0 Page 8 0 Page 16 1 Page 24 NC: 0 the domain’s PTEs referring to that page and Page 16 Page 0 1 Page 0 Page 16 PTE (c) reorders the pages in the queue accord- 1 -- Page 16 2 Page 8 NC: 0 Page 0 ingly; to make room for a new page, (d) Access to Daemon counts accesses Daemon sorts queue Access to the NC bits in PTEs pointing to the least re- Page 24 to each page in approximate LRU order Page 32 cently used page are set (“← 1”), and the Time (a) (b) (c) (d) page is removed from the queue. reserved bits set in the PTE. As CACHEBAR is the only independent of the draws for other security domains. Let source of reserved bits, it takes over page fault handling Ki denote the random variable distributed according to from this point. CACHEBAR first checks the COA bit in how ki is determined. The random variables that we pre- the PTE. If it is set, the correspondingphysicalpage is ei- sume can be observed by the attacker domains include ther SHARED, in which case it will be transitioned to AC- K1,...,Km; let Ka = min {w, i=1 Ki} denote the CESSED, or ACCESSED, in which case it will be copied. number of cache lines allocated to the attacker domains. CACHEBAR then clears the COA bit and, if no other re- We also presume that the attackerP can accurately mea- served bits are set, the fault handler returns. Otherwise, sure the number X of the attacker’s cache lines that are if the NC bit is set, the associated physical page is not in evicted during the victim’s execution. the cacheable queue for its domain, and so CACHEBAR Let Pd (E) denote the probability of event E in an enqueues the page and, if the queue is full, removes the execution period during which the victim’s cache usage least-recently-used page from the queue. If the NC bit is would populate d lines (of this color) if it were allowed clear, this page fault is caused by unknown reasons and to use all w lines, i.e., if k0 = w. We (the defender) CACHEBAR turns control over to the generic handler for would like to distribute K0,...,Km and thus Ka so as reserved bits. to minimize the statistical distance between eviction dis- tributions observable by the attacker for different victim 4.3 Security demands d, d′, i.e., to minimize Recall that ki is the number of cache lines in a certain P P ′ cache set that is made available to security domain i for | d (X = x) − d (X = x) | (1) 0≤d

9 Note that we have dropped the “d” subscripts from the As such, our final optimization problem seeks to bal- probabilities on the right, since K0 and Ka are dis- ance Eqn. 1 and Eqn. 10. Let constant γ denote the max- tributed independently of d. And, since K1,...,Km are imum (i.e., worst) possible value of Eqn. 1 (i.e., when independent, for ka < w, P (Ki = w) = 1 for each i) and δ denote the maxi- m mum (i.e., worst) possible value of Eqn. 10 (i.e., when P (Ki =0)=1 for each i). Then, given a parameter ǫ, P (Ka = ka)= P (Ki = ki) (3) 0 <ǫ< 1, our optimization computes distributions for k1,...,km: i=1 k1+...X+km=ka Y K0,...,Km so as to minimize a value u subject to and 1 m u = |Pd (X = x) − Pd′ (X = x) | γ   P (K = w)= P (K = k ) (4) 0≤d

10 250 250 kv and ka respectively, were fixed in each experiment. 200 200 The machine architecture on which we performed 150 150 100 100 these tests had a w-way LLC with w = 16. The val-

CPU cyles unsharedshared CPU cyles unsharedshared ues ka and kv were distributed as computed in Sec. 4.3; with this distribution, only values in {4, 5, 6,..., 14} (a) with CACHEBAR disabled (b) with CACHEBAR enabled were possible. In each test with fixed kv and ka, we Figure 7: RELOAD timings in FLUSH-RELOAD attacks on an address allowed the victim to place a demand of (i.e., retrieve shared with the victim vs. timings on an unshared address memory blocks to fill) d ∈ {0, 1, 2, ..., 16} cache lines of the cache set undergoing the PRIME-PROBE attack the sender and receiver were linked to a shared library, by the attacker. The attacker’s goal was to classify the libcrypto.so.1.0.0, and were pinned to run on victim’s demand into one of six classes: NONE = {0}, different cores of the same socket, thus sharing the same ONE = {1}, FEW = {2, 3, 4}, SOME = {5, 6, 7, 8}, last-level cache. The sender ran in a loop, repeatedly LOTS = {9, 10, 11, 12}, and MOST = {13, 14, 15, 16}. accessing one memory location (the beginning address By asking the attacker to classify the victim’s demand of function AES decrypt()). The receiver executed into only one of six classes (versus one of 16), we sub- FLUSH-RELOAD attacks on the same memory address, stantially simplified the attacker’s job. by first FLUSHing the memory block out of the shared Also to make the attacker’s job easier, we permit- LLC with an clflush instruction and then RELOADing ted the attacker to know ka; i.e., the attacker trained a the memory address by accessing it directly while mea- different classifier per value of ka, with knowledge of suring the access latency. The interval between FLUSH the demand d per PRIME-PROBE trial, and then tested and RELOAD was set to 2500 cycles. The experiment against additional trial results to classify unknown vic- was run for 500,000 FLUSH-RELOAD trials. We then re- tim demands. Specifically, after training a naive Bayes peated this experiment with the sender accessing an un- classifier on 500,000 PRIME-PROBE trials per (d, ka, kv) shared address, to form a baseline. triple, we tested it on another 500,000 trials. To filter Fig. 7(a) shows the results of this experiment, when out PROBE readings due to page faults, excessively large run over unmodified Linux. The three horizontal lines readings were discarded from our evaluation. The tests forming the “box” in each boxplot represents the first, without protection by CACHEBAR yielded the confusion second (median), and third quartiles of the FLUSH- matrix in Table 8(a), with overall accuracy of 67.5%. In RELOAD measurements; whiskers extend to cover all this table, cells with higher numbers have lighter back- points that lie within 1.5× the interquartile range. As can grounds, and so the best attacker would be one who be seen in this figure, the times observed by the receiver achieves white cells along the diagonal and dark-gray to RELOAD the shared address were clearly separable cells elsewhere. As can be seen there, classification by from the times to RELOAD the unshared address, over the attacker was very accurate for d falling into NONE, unmodified Linux. With CACHEBAR enabled, however, ONE, or LOTS; e.g., d = 1 resulted in a classification these measurements are no longer separable (Fig. 7(b)). of ONE with probability of 0.80. Some other demands Certain corner cases are not represented in Fig. 7. For had lower accuracy, but were almost always classified example, we found it extremely difficult to conduct ex- into adjacent classes; specifically, every class of victim periments to capture the corner cases where FLUSH and demand was classified correctly or as an adjacent class RELOAD takes place right before and after physical page (e.g., d ∈ FEW was classified as ONE, FEW, or SOME) at mergers, as described in Sec. 3.3. As such, we rely on least 96% of the time. our manual inspection of the implementation in these In contrast, Fig. 8(b) shows the confusion matrix cases to check correctness and argue these corner cases for a naive Bayes classifier trained and tested using are very difficult to exploit in practice. PRIME-PROBE trials conducted with CACHEBAR en- abled. Specifically, these values were calculated using 5.2.2 PRIME-PROBE Attacks P class = c d ∈ c′ We evaluate the effectiveness of CACHEBAR against ′  d ∈ c ∧ Kv = kv PRIME-PROBE attacks by measuring its ability to inter- P class = c = ∧ Ka = ka fere with a simulated attack. In our simulation, a process   4≤ka,kv≤14  · P (K a = ka) · P (Kv = kv) in an attacker container repeatedly performed PRIME- X   PROBE attacks on a specific cache set, while a process in where class denotes the classification obtained by the ad- a victim container accessed data that were retrieved into versary using the naive Bayes classifier; c,c′ ∈ {NONE, the same cache set at the rate of d accesses per attacker ONE, FEW, SOME, LOTS, MOST}; and P (Ka = ka) and PRIME-PROBE interval. Moreover, the cache lines avail- P (Kv = kv) are calculated as described in Sec. 4.3. The ′ able to the victim container and attacker container, i.e., factor P class = c d ∈ c ∧ Kv = kv ∧ Ka = ka was 

11 Classification by attacker demand of MOST, this number is only 47%. NONEONEFEWSOMELOTSMOST NONE .96 .04 .00 .00 .00 .00 5.3 Performance Evaluation

d ONE .01 .80 .19 .01 .00 .00 FEW .00 .16 .50 .30 .04 .00 In this section we describe tests we have run to evalu- SOME .00 .00 .07 .54 .34 .04

Victim ate the performance impact of CACHEBAR relative to demand LOTS .00 .00 .00 .03 .84 .13 an unmodified Linux kernel. As mentioned previously, MOST .00 .00 .00 .03 .56 .41 we are motivated by side-channel prevention in PaaS (a) Without CACHEBAR clouds, and so we focused our performance evaluation on typical PaaS applications (primarily web servers sup- Classification by attacker NONEONEFEWSOMELOTSMOST porting various language runtimes) running together with NONE .33 .16 .26 .18 .04 .02 CACHEBAR. For the sake of space, we defer our discus-

d ONE .16 .36 .19 .19 .06 .04 sion on typical PaaS applications to App. A. Also, here FEW .13 .14 .40 .19 .09 .05 we report only throughput and response-time measure- SOME .09 .10 .16 .37 .20 .07 Victim ments; other experiments to shed light on CACHEBAR’s demand LOTS .08 .06 .10 .16 .46 .13 MOST .10 .07 .18 .18 .18 .29 memory savings over prohibiting cross-container mem- ory sharing in Linux as an alternative to copy-on-access, (b) With CACHEBAR can be found in App. B. Our experiments to explored CACHEBAR’s perfor- Figure 8: Confusion matrix of naive Bayes classifier mance impact (1) as a function of the number of con-

kv tainer (and webserver) instances; (2) for different com- 4 5 6 7 8 9 10 11 12 13 14 4 .18 .17 .17 .17 .17 .17 .17 .17 .36 .22 .33 binations of webserver and application language; (3) for 5 .19 .17 .30 .32 .27 .27 .20 .26 .33 .46 .39 6 .17 .31 .24 .18 .21 .17 .20 .27 .43 .39 .41 complex workloads characteristic of a social networking 7 .17 .33 .22 .22 .19 .31 .33 .33 .46 .48 .54 8 .33 .35 .32 .23 .43 .37 .43 .42 .32 .38 .49 website; and (4) for media-streaming workloads.

a 9 .20 .26 .31 .28 .44 .38 .34 .34 .46 .39 .56 k 10 .41 .31 .27 .35 .50 .55 .53 .31 .53 .50 .62 Webserver performance. In the first experiments, each 11 .45 .45 .40 .45 .47 .54 .54 .57 .67 .50 .50 12 .55 .50 .59 .63 .49 .48 .54 .49 .56 .58 .57 container ran an Apache version 2.4.7 with 13 .55 .53 .68 .68 .54 .65 .52 .56 .57 .66 .66 14 .53 .56 .45 .65 .46 .62 .48 .68 .55 .57 .53 PHP-FPM and SSL enabled. We set up one client per server using autobench; clients were spread across Figure 9: Accuracy per values of kv and ka four computers, each with the same networking capabil- ities as the (one) server computer (not to mention more measured empirically. Though space limits preclude re- cores and memory than the server computer), to ensure porting the full class confusion matrix for each kv, ka that any bottlenecks were on the server machine. Each pair, the accuracy of the naive Bayes classifier per kv, ka client repeatedly requested a web page and recorded pair, averaged over all classes c, is shown in Fig. 9. As in its achievable throughputs and response times at those Fig. 8, cells with larger values in Fig. 9 are more lightly throughput rates. The content returned to each client re- colored, though in this case, the diagonal has no partic- quest was the 86KB output of phpinfo(). ular significance. Rather, we would expect that when Fig. 10 shows the throughputs and response times the attacker and victim are each limited to fewer lines when clients sent requests using SSL without reusing in the cache set (i.e., small values of ka and kv, in the connections. In particular, Fig. 10(a) shows the achieved upper left-hand corner of Fig. 9) the accuracy of the at- response rates (left axis) and response times (right axis), tacker will suffer, whereas when the attacker and victim averaged over all containers, as a function of offered are permitted to use more lines of the cache (i.e., in the load when there were four containers (and so four lower right-hand corner) the attacker’s accuracy would web servers). Bars depict average response rates run- improve. Fig. 9 supports these general trends. ning over unmodified Linux (“rate w/o CACHEBAR”) Returning to Fig. 8(b), we see that CACHEBAR sub- or CACHEBAR (“rate w CACHEBAR”), and lines de- stantially degrades the adversary’s classification accu- pict average response times running over unmodified racy, which overall is only 33%. Moreover, the adversary Linux (“time w/o CACHEBAR”) or CACHEBAR (“time is not only wrong more often, but is also often “more w CACHEBAR”). Fig. 10(b) shows the same information wrong” in those cases. That is, whereas in Fig.8(a) for 16 containers. As can be seen in these figures, the shows that each class of victim demand was classified throughput impact of CACHEBAR was minimal, while as that demand or an adjacent demand at least 96% of the response time increased by around 20%. Fig. 10(c) the time, this property no longer holds true in Fig. 8(b). shows this information in another way, with the num- Indeed, the attacker’s best case in this regard is classi- ber of containers (and hence servers) increasing along fying victim demand LOTS, which it classifies as SOME, the horizontal-axis. In Fig. 10(c), each bar represents the LOTS, or MOST 75% of the time. In the case of a victim largest request rate at which the responses could keep up.

12 rate w/o CacheBar time w/o CacheBar rate w/o CacheBar time w/o CacheBar rate w/o CacheBar time w/o CacheBar rate w CacheBar time w CacheBar rate w CacheBar time w CacheBar rate w CacheBar time w CacheBar 160 3 40 5 160 5

120 30 4 120 4 2 3 3 80 20 80 2 1 2 40 40 10 1 1 Responses per sec Responses per sec Response time (ms) Response time (ms) 0 0 Response time (ms) 0 0 130 132 134 136 138 140 142 144 0 30 31 32 33 34 35 36 37 0 4 6 8 10 12 14 16 Requests per second Requests per second Throughput (requests/sec) Number of containers (a) 4 webservers (b) 16 webservers (c) different numbers of webservers Figure 10: Average throughput and response time per Apache +PHP-FPM web server, each in a separate container

w/o CacheBar w CacheBar w/o CacheBar w CacheBar rate w/o CacheBar time w/o CacheBar 40 36 36 0.8 15.0% 34 32 34 3432 34 33 34 32 rate w CacheBar time w CacheBar 30 28 27 0.6 24 8 12 20 0.3% 4.2% 0.4 27.9%74.7% 6 33.8% 1.3% -0.9% 8 10 0.2 11.9% -3.0% 1.6% 11.7% 9.4% 4 0 2.4% 1.2% -tomcatpython-apache+cgipython-tornadoruby-pumaruby-passengerruby-unicornruby-mongrelruby-thin 0 BrowseLoginPostSelfWallAddFriendSendMessageReceiveMessageRegisterUpdateLogout 2.4% 4 Response time (ms) 2 2.7% Throughput (requests/sec) Responses per sec

0 0 Response time (ms) 1 4 8 16 Number of containers Figure 11: Throughput per webserver/language Figure 12: Response times per operation Figure 13: Media streaming

Webserver+language combinations. Next, we se- Fig. 12 shows that the responsivenessof the variouscom- lected other common webserver+app-languagecombina- mon operations suffered little with CACHEBAR, suffer- tions (again, see Table 2), namely Java over a Tomcat ing between 2% and 15% overhead. Three operations web server, Python over Apache+cgi, Python over (register, update, and logout) suffered greater , and Ruby over . For each configura- than 25% overhead, but these operations were rare in the tion, we instantiated 16 containers and set each up to Faban workload (and presumably in practice). dynamically generate 80KB random strings for clients. Media streaming in CloudSuite. In additionto the web- We also did tests using another four web servers run- server benchmark setup used above, CloudSuite offers a ning the same Ruby application, namely Passenger, media streaming server running over that serves Thin , , and . Fig. 11 shows the 3.1GB static video files at different levels of quality. We throughput that resulted in each case, over Linux and set up a client process per server to issue a mix of re- over CACHEBAR. As shown there, the throughput over- quests for videos at different quality levels and, through a heads were modest for most of the server+languagecom- binary search, to find the peak request rate the server can binations that we considered. The worst case was Python sustain while keeping the failure rate below a threshold. Apache over +cgi, which suffered a throughputdegrada- Fig. 13 shows that CACHEBAR affected this application tion with CACHEBAR of 25%; other degradations were least of all, in both throughput and response time. much more modest. Impact on a more complex workload. To test for ef- 6 CONCLUSION fects on more complex workloads, we used the web- Side-channel attacks via the LLC are becoming increas- server instance in CloudSuite [7] that implements a so- ingly efficient and powerful. To counter this growing cial community website written in PHP over Nginx on threat, we have presented the design of two techniques to our CACHEBAR-protected machine. This implementa- defend against these attacks, namely (i) copy-on-access tion queries a MySQL database and caches results using for physical pages shared among multiple security do- Memcached; in keeping with PaaS architectures (see mains, to interfere with FLUSH-RELOAD attacks, and (ii) App. A), the database and Memcached server were im- cacheability management for pages to limit the number plemented on another machine without CACHEBAR pro- of cache lines per cache set that an adversary can oc- tection, since tenants cannot typically execute directly on cupy simultaneously, to mitigate PRIME-PROBE attacks. these machines. We used the Faban tool to generate We described the implementation of these techniques in a mix of requests to the webserver, including browse a memory-management subsystem called CACHEBAR (7.9%), login (7.5%), post (24.9%), add friend for Linux, to interfere with LLC-based side-channel at- (7.3%), send msg (44.0%), register (0.8%), and tacks across containers. We confirmed that our de- logout (7.5%). In addition, a background activity sign mitigates side-channel attacks through formal anal- happened on the webserver every 10s, which was ei- ysis of both copy-on-access (using model checking) and ther receive msg or update with equal likelihood. cacheability management (through probabilistic analy-

13 sis), as well as using empirical evaluations. Our experi- [15] KONG, J., ACIICMEZ,O., SEIFERT, J. P., AND ZHOU, H. De- ments also confirmed that the overheads of our approach constructing new cache designs for thwarting software cache- are modest for PaaS workloads, e.g., imposing a virtually based side channel attacks. In 2nd ACM Workshop on Computer Security Architectures (2008), pp. 25–34. unnoticeable cost on server throughputs. [16] KONIGHOFER¨ , R. A fast and cache-timing resistant implemen- tation of the aes. In Topics in Cryptology–CT-RSA 2008. 2008, Acknowledgments This work was supported in part pp. 187–202. by NSF grant 1330599. [17] LI,M.,ZHANG, Y., BAI,K.,ZANG, W., YU,M., AND HE,X. Improving cloud survivability through dependency based virtual REFERENCES machine placement. In International Conference on Security and Cryptography (July 2012), pp. 321–326. [1] ARCANGELI,A.,EIDUS,I., AND WRIGHT, C. Increasing mem- ory density by using KSM. In Linux Symposium (2009), pp. 19– [18] LI, P., GAO,D., AND REITER, M. K. StopWatch: A cloud 28. architecture for timing channel mitigation. ACM Transations on Information and System Security 17, 2 (Nov. 2014). [2] AZAR, Y., KAMARA, S., MENACHE, I., RAYKOVA,M., AND [19] LIU, F., GE, Q., YAROM, Y., MCKEEN, F., ROZAS,C., SHEPARD, B. Co-location-resistant clouds. In 6th ACM Cloud HEISER,G., AND LEE, R. B. : Defeating last-level Computing Security Workshop (2014), pp. 9–20. cache side channel attacks in cloud computing. In 22nd IEEE [3] COPPENS, B., VERBAUWHEDE, I., BOSSCHERE,K.D., AND Symposium on High Performance Computer Architecture (Mar. SUTTER, B. D. Practical mitigations for timing-based side- 2016). channel attacks on modern x86 processors. In 30th IEEE Sym- [20] LIU, F., AND LEE, R. B. Random fill cache architecture. In 47th posium on Security and Privacy (2009), pp. 45–60. Annual IEEE/ACM International Symposium on Microarchitec- [4] CRANE, S., HOMESCU, A., BRUNTHALER,S.,LARSEN, P., ture (2014), IEEE Computer Society, pp. 203–215. AND FRANZ, M. Thwarting cache side-channel attacks through [21] LIU, F., YAROM, Y., GE, Q., HEISER,G., AND LEE,R.B. dynamic software diversity. In 2015 ISOC Network and Dis- Last-level cache side-channel attacks are practical. In 36th IEEE tributed System Security Symposium (2015). Symposium on Security and Privacy (May 2015).

[5] ELIZAVETA,L., AND BICKEL, P. The earth mover’s distance is [22] MALLOWS, C. L. A note on asymptotic joint normality. Annals the Mallows distance. In 8th International Conference on Com- of Mathematical Statistics 43, 2 (1972), 508–515. puter Vision (2001), pp. 251–256. [23] OSVIK,D. A., SHAMIR,A., AND TROMER, E. Cache attacks and countermeasures: the case of AES. In Topics in Cryptology– [6] FABREGA´ ,F. J. T., AND GUTTMAN, J. D. Copy on write. Cite- CT-RSA 2006. Springer, 2006, pp. 1–20. seer, 1995. [24] PAGE, D. Partitioned cache architecture as a side-channel de- [7] FERDMAN, M., ADILEH, A., KOCBERBER, O., VOLOS,S., fence mechanism. Tech. Rep. 2005/280, IACR Cryptology ePrint ALISAFAEE,M.,JEVDJIC,D.,KAYNAK,C.,POPESCU,A.D., Archive, 2005. AILAMAKI,A., AND FALSAFI, B. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In 17th [25] RAIKIN, S., SAGER, J. D., SPERBER, Z., KRIMER,E.,LEM- International Conference on Architectural Support for Program- PEL,O.,SHWARTSMAN,S., YOAZ,A., AND GOLZ, O. Track- ming Languages and Operating Systems (2012), pp. 37–48. ing mechanism coupled to retirement in reorder buffer for indicat- ing sharing logical registers of physical register in record indexed [8] GRUSS, D., MAURICE,C., AND WAGNER, K. Flush+Flush: by logical register, Dec. 16 2014. US Patent 8,914,617. A stealthier last-level cache attack. CoRR abs/1511.04594 (Nov. [26] RAJ, H., NATHUJI, R., SINGH,A., AND ENGLAND, P. Re- 2015). source management for isolation enhanced cloud services. In [9] GULLASCH, D., BANGERTER,E., AND KRENN, S. Cache 2009 ACM Workshop on Cloud Computing Security (2009), games – bringing access-based cache attacks on AES to practice. pp. 77–84. In 32nd IEEE Symposium on Security and Privacy (May 2011), [27] SHI,J.,SONG,X., CHEN,H., AND ZANG, B. Limiting cache- pp. 490–505. based side-channel in multi-tenant cloud using dynamic page col- oring. In Workshops of the 41st IEEE/IFIP International Confer- [10] HAN, Y., ALPCAN, T., CHAN,J., AND LECKIE, C. Security games for virtual machine allocation in cloud computing. In 4th ence on Dependable Systems and Networks (2011), pp. 194–199. International Conference Decision and Game Theory for Secu- [28] STEFAN, D., BUIRAS, P., YANG,E.Z.,LEVY,A.,TEREI, rity, vol. 8252 of LNCS. Nov. 2013, pp. 99–118. D., RUSSO,A., AND MAZIERES` , D. Eliminating cache-based timing attacks with instruction-based scheduling. In Computer [11] INTEL. Intel R 64 and IA-32 Architectures Software Developers Security–ESORICS 2013. 2013, pp. 718–735. Manual, 2010. [29] SVENNINGSSON,J., AND SANDS, D. Specification and verifica- [12] IRAZOQUI,G.,EISENBARTH, T., AND SUNAR, B. S$A: tion of side channel declassification. In 6th International Confer- A shared cache attack that works across cores and defies VM ence on Formal Aspects in Security and Trust (2010), Springer- sandboxing—and its application to AES. In 36th IEEE Sympo- Verlag, pp. 111–125. sium on Security and Privacy (May 2015). [30] TROMER,E., OSVIK,D.A., AND SHAMIR, A. Efficient cache [13] KERAMIDAS, G., ANTONOPOULOS, A., SERPANOS,D.N., attacks on AES, and countermeasures. Journal of Cryptology 23, AND KAXIRAS, S. Non deterministic caches: A simple and ef- 1 (2010), 37–71. fective defense against side channel attacks. Design Automation [31] VARADARAJAN, V., RISTENPART, T., AND SWIFT,M. for Embedded Systems 12, 3 (2008), 221–230. Scheduler-based defenses against cross-VM side channels. In [14] KIM, T., PEINADO,M., AND MAINAR-RUIZ, G. STEALTH- 23rd USENIX Security Symposium (Aug. 2014). MEM: System-level protection against cache-based side channel [32] VATTIKONDA,B. C., DAS,S., AND SHACHAM, H. Eliminat- attacks in the cloud. In USENIX Security Symposium (2012), ing fine grained timers in Xen. In 3rd ACM Cloud Computing pp. 189–204. Security Workshop (Oct. 2011), pp. 41–46.

14 [33] WANG,Z., AND LEE, R. B. A novel cache architecture with controlled by different tenants may share the same OS, enhanced performance and security. In 41st IEEE/ACM Interna- however. Because users of PaaS clouds do not have the tional Symposium on Microarchitecture (2008), pp. 83–93. permission to execute arbitrary code on databases and [34] WRAY, J. C. An analysis of covert timing channels. In 1991 middleware that are typically shared by multiple tenants, IEEE Symposium on Security and Privacy (1991), pp. 2–7. the targets of the PRIME-PROBE and FLUSH-RELOAD [35] YAROM, Y., AND FALKNER, K. E. FLUSH+RELOAD: A high side-channel attacks we consider in this paper are primar- resolution, low noise, L3 cache side-channel attack. In 23rd USENIX Security Symposium (2014), pp. 719–732. ily web servers that supports various language runtimes, which may be co-located with the adversary-controlled [36] ZHANG, Y., JUELS, A., REITER,M.K., AND RISTENPART, T. Cross-VM side channels and their use to extract private keys. malicious web servers on which arbitrary code can be In ACM Conference on Computer & Communications Security executed. We conducted a survey to understand the web (2012), pp. 305–316. servers that are used in major PaaS clouds, and the pro- [37] ZHANG, Y., JUELS, A., REITER,M.K., AND RISTENPART, gramming languages they support. The results are shown T. Cross-tenant side-channel attacks in PaaS clouds. In ACM in Table 2. Conference on Computer & Communications Security (2014), pp. 990–1003. B EVALUATION OF CACHEBAR’S MEM- [38] ZHANG, Y., LI, M., BAI, K., YU,M., AND ZANG, W. In- centive compatible moving target defense against VM-colocation ORY SAVINGS attacks in clouds. In 27th IFIP Information Security and Privacy In this appendix we test the memory footprints induced Conference, vol. 376 of IFIP Advances in Information and Com- munication Technology. June 2012, pp. 388–399. by CACHEBAR’s copy-on-access mechanism in com- parison to the simpler alternative of simply not shar- [39] ZHANG, Y., AND REITER, M. K. D¨uppel: Retrofitting com- modity operating systems to mitigate cache side channels in the ing memory between containers at all. (We caution cloud. In 2013 ACM Conference on Computer & Communica- the reader that this simpler alternative addresses only tions Security (2013), pp. 827–838. FLUSH-RELOAD attacks; it could not serve as a re- placement for our PRIME-PROBE defense in Sec. 4.) To A PLATFORM-AS-A-SERVICE CLOUD measure the memory savings that copy-on-access offers Platform-as-a-Service (PaaS) cloud is a model of cloud over disabling memory sharing between containers, we computing which enables users to develop and deploy measured the total unique physical memory pages used web applications without installing and managing the re- across various numbers of webservers, each in its own quired in-house hardware and software. A typical PaaS container, when running over (i) unmodified Linux, (ii) cloud supports multiple programming languages and fa- Linux without cross-container memory sharing, and (iii) cilitates the integration of a variety of application mid- CACHEBAR-enabled Linux. We used the system diagno- dleware (e.g., data analytics, messaging, and load balanc- sis tool smem for memory accounting, specifically by ac- ing) and databases (e.g., memcache, SQL). For instance, cumulating the PSS (proportionalset size) field output by the popular Heroku (heroku.com) service supports smem for each process, which reportsthe process’ shared more than ten programming languages such as Ruby, memory pages divided by the number of processes shar- Node.js, Python, and Java, PHP, Go, , C, Erlang, ing these pages, plus its unshared memory pages. Scala, and Clojure, and provides integration of applica- For each platform, we incrementally grew the number tion middleware as add-ons to facilitate data storage, mo- of containers, each containing a webserver, and left each bile integration, monitoring and logging, and other types webserver idle after issuing to it a single request to con- of application development. Customers of PaaS usually firm its functioning. For each number of containers, we need to develop their applications on local machines, and measured the memory usage on the machine. Fig. 14(a) then upload source code to the PaaS system via for test- shows the memory overhead of Linux without cross- ing and deployment. In some cases, they are allowed container sharing (“nonshared-idle”) and CACHEBAR to ssh onto the remote machine and perform necessary (“CACHEBAR-idle”), computed by subtracting the mem- configuration and debugging. ory measured for unmodified Linux from the memory In order to increase server utilization and reduce cost, measured for each of these systems. We grew the number most public PaaS offerings are multi-tenant, which serve of containers to 16 in each case, and then extrapolated multiple customers on the same (virtual) machine. Ten- these measurements to larger numbers of containers us- ants sharing the same operating system are typically iso- ing best-fit lines (“nonshared-idle-fit” and “CACHEBAR- lated using Linux containers. While a web application idle-fit”). As can be seen in Fig. 14(a), the overhead of may contain web servers, programming language run- CACHEBAR is virtually zero, whereas the overhead of times, databases, and a set of middleware that enrich its Linux without cross-container sharing is more substan- functionality, in all PaaS clouds we have studied, lan- tial, even with negligible query load. guage runtimes and web servers are located on differ- The memory overhead of CACHEBAR does grow ent servers from databases and middleware; web servers somewhat (relative to that of unmodified Linux) when

15 PaaS cloud Supported server engines Supported programming languages

AppFog , Apache HTTP, Nginx, Microsoft IIS Java, Python, PHP, Node.js, Ruby and Go www.appfog.com Azure Apache Tomcat, , Apache HTTP, Nginx, GlassFish, Java, Python, PHP, Node.js, and ASP.NET azure.microsoft.com Wildfly, Jetty, Microsoft IIS DotCloud Apache Tomcat, Tornado, PHP built-in webserver Java, Python, PHP, Node.js, Ruby and Go www.dotcloud.com Elastic Beanstalk Apache Tomcat, Apache HTTP or Nginx, Passenger or Java, Python, PHP, Node.js, Ruby, Go, ASP.NET, and aws.amazon.com/elasticbeanstalk Puma, Microsoft IIS others Engine Yard Nginx, Rack, Passenger, Puma, Unicorn, Trinidad Java, PHP, Node.js, and Ruby www.engineyard.com Google Cloud JBoss, Wildfly, Apache Tomcat, Apache HTTP, Nginx, Java, Python, PHP, Node.js, Ruby, ASP.NET and Go cloud.google.com/appengine Zend Server, Passenger, Mongrel, Thin, Microsoft IIS Heroku Jetty, Tornado, PHP built-in webserver, Mongrel, Thin, Java, Python, PHP, Node.js, Ruby, Go, Perl, C, Er- www.heroku.com Hypnotoad, , , Mochiweb lang, Scala, and Clojure HP Cloud Apache, Apache TomEE, Nginx Java, Python, PHP, Node.js, Ruby, Perl, Erlang, www.hpcloud.com Scala, Clojure, ASP.NET OpenShift JBoss, Wildfly, Apache Tomcat, Zend Server, Vert.x Java, Python, PHP, Node.js, Ruby, Perl, Ceylon, and www.openshift.com others

Table 2: Server and programming-language support in various PaaS clouds

nonshared-idle-fit nonshared-idle abling cross-container sharing. As such, the copy-on- CacheBar-idle-fit CacheBar-idle access mechanism strikes a better balance between mini- 600 mizing memory footprints and isolating containers from 450 300 side-channels than does simply disabling cross-container 150 sharing. 0 4 8 12 16 20 24 28 32 36 42 Memory overhead (MB) Number of containers (a) Webservers idle

nonshared-busy-fit nonshared-busy CacheBar-busy-fit CacheBar-busy 600 450 300 150 0 4 8 12 16 20 24 28 32 36 42 Memory overhead (MB) Number of containers (b) 25% of webservers busy Figure 14: Memory overhead compared with unmodified Linux

some of the servers are subjected to load. Fig. 14(b) shows the same measures, but in an experiment in which every fourth server was subjected to a slightly more ac- tive (but still quite modest) load of four requests per sec- ond. This was enough to induce CACHEBAR’s copy- on-access mechanism to copy some memory pages, re- sulting in a more noticeable increase in the memory usage of the containers on CACHEBAR-enabled Linux. Again, however, the memory overhead of CACHEBAR was substantially less than of disabling cross-container sharing altogether. Moreover, even at maximum through- put load for all servers (not shown), the CACHEBAR overhead approaches but does not exceed that of dis-

16