True IOMMU Protection from DMA Attacks

True IOMMU Protection from DMA Attacks: When Copy Is Faster Than Zero Copy Alex Markuze Adam Morrison Dan Tsafrir Hello, my name is Alex and I would like to present today our work from this years at ASPLOS, this work was done by me and my advisors Adam Morrison and Dan Tsafrir from the Technion, “True IOMMU Protection from DMA Attacks, a.k.a When Copy is Faster than Zero Copy” (0:20) DMA &KTGEV/GOQT[#EEGUU /CKP/GOQT[ Lets start with DMA. DMA, or direct memory access is the ability of an attached device to access the main memory without involving the cpu. DMA capable devices are shown on the screen, Devices like GPUs, SSDs and Network Interface Cards have their own cpus and their own code and none of it is controlled by the operating system.(0:45) Problem: DMA Attacks How to develop a rootkit for Broadcom NetExtreme network cards (ESEC 2011) Someone (probably the NSA) has been hiding viruses in hard drive firmware (Kaspersky 2015) This ability to access memory directly, also allows a malicious device to read or modify sensitive data. We call this a DMA attack. This is far from being fiction, for example, one detailed instruction can be easily found on the internet and it describes in great detail how someone could get a malicious code running on a popular NIC, even from remote. A nic which you may probably find in your own university labs. Also recently, Kaspersky Labs have identified attacks in the wild that reprogrammed hard disk firmware so that the drive could reintroduce malware even if the Hardisk got formatted. (1:20) Solution: IOMMU To prevent arbitrarily access to the memory, modern systems have IOMMU hardware. The IOMMU works for devices just like the MMU works for processes. It provides virtual memory support for the attached device. Meaning that the device can access only the pages that are explicitly listed in its OS controlled page table. So with an IOMMU, the device can access the main memory only via IO virtual addresses, or IOVAs.(01:50) IOMMU Usage • Made part of DMA API • Drivers reamin unaware • Backward compatible When IOMMU was introduced its was integrated into the existing DMA-API, and this allowed for all device drivers to remain unchanged and the devices remain functional when IOMMU was enabled. Packet Transmission using DMA API 1. dma_map(VA) IOMMU • Allocate IOVA Page Table • Update dev page table IOVA PA 2. Transmit Packet IO/TLB IOVA PA 3. dma_unmap(IOVA) • Remove entry from page table • Flush IO/TLB • Free IOVA To demonstrate how the IOMMU is used via the DMA API, let’s follow a sent a packet. 1. First the nic driver dma_map-s the packet(1). the operation creates a new io virtual address(2), namely generates the actual unique integer number that serves as the iova, then the page table is populated(3 + 4), represented by the blue box on the slides. The red va and yellow IOVA point to the same physical address(5). 2. Second(6), the nic reads the packet using the IOVA(7) and the packet is sent.When the IOVA is accessed the IOTLB is updated.(8)The IOTLB is a small translation cache just like the TLB in the MMU.(9) 3. Third(10), When the packet has been sent, to prevent the a potentially malicious device from re-accessing the packet. The iova is dma_unmapped. unmap (11) removes the translation from the devices page table(12), flushes (13) the entry in the IO/TLB(14),. And finally, de allocates the IOVA(15) Namely frees the unique integer number.(4:00) Problem #1: Performance Single Core RX flow 16 Core 40 40 35 35 30 30 25 25 20 20 Gb/s 15 15 10 10 5 5 0 0 no iommuw. iommu no iommuw. iommu Though the IOMMU seems to provide the much needed layer of protection, using it creates several problems. The first problem is degraded performance(1). These are netperf results for a 40 gigabit nic used by a single core, without iommu (2) and with iommu(3). As you can see, using the IOMMU reduces the throughput from 20 Gbps to less than 5. Here we see a similar experiment (4) on sixteen cores (5). The throughput difference is even worse,(6) I don't know if you have noticed the little red bar on the left(7), and this is with all cores utilised at 100%. (4:40) Problem #2: vulnerability window • Strict protection mode: • IOTLB flush every single dma_unmap • IOTLB flushes are expensive! • To alleviate performance degradation • Trading off safety for performance One of the causes of the significant performance hit is the strict IOMMU protection mode, which we use. This mode dictates that IOVAs will be purged from the IOTLB as part of each dma_unmap operation, which is performed millions of times per second. As it turns out, IOTLB invalidations are extremely costly, especially since they are performed while holding a lock. Problem #2: vulnerability window • Strict protection: IOTLB flush every single dma_unmap • Deferred protection: batch IOTLB flushes • Amortise flush cost Page P dma mapped Page P allocated Legal Access to P Vulnerability Time line Window Page P dma unmapped and freed To address this, kernel developers have added an IOMMU usage mode that amortises the overhead. This mode flushes the IOTLB only once in 250 operations. Unfortunately, this mode, which is called deferred protection, trades off security for performance. With deferred protection, there is a window of time in which a device can access buffers that have been unmapped. i’ll explain how(1), first Page (2)P is dma_mapped (3). Now(4) the nic can legally access it (5). Then Page P(6) is unmapped and freed(7). Not much later page(8) P is allocated and used, and no one knows that it is still accessible by the device(9), because the iotlb has not been flushed yet. (6:00) Strict vs. Deferred Protection Single Core RX flow 16 Core 40 40 35 35 30 30 25 25 20 20 Gb/s 15 15 10 10 5 5 0 0 no iommu strict deferred no iommu strict deferred As you can see, deferred protection doesn't help much(1) in the way of performance, and it does create the window of vulnerability. (5:30) Problem #3: No Sub Page Protection Everything the NIC needs to access Everything the NIC CAN access 4KB page Secrets The problems don’t end there, IOMMU granularity is of 4K pages.This means that any data co-located(1) on a page with a DMA buffer can be accessed by the device! , for example the smallest network packet is about 48B(1) so this (2) is everything the nic needs(3) to access and this (4) is everything the NIC can access(5). Each time a 48 byte packet is mapped the NIC can access the whole page. We call this the sub page protection problem.(7:00) Related Work • [FAST’15] Improving single core performance • [ATC’15] Resolving scalability issues for deferred our group has been looking at the performance problems of IOMMU protection for some time now. It turns out that part of the problem with deferred protection was the algorithm for iova allocation. It had very poor performance even on a single core, and in FAST last year we have shown a fix for the IOVA allocator. Later that year at Usenix ATC we looked at the multicore scalability problems and fixed them as well. (7:40) Related Work Single Core RX flow 16 Core 40 40 35 35 30 30 25 25 20 20 Gb/s 15 15 10 10 5 5 0 0 no iommu strict deferredatc'15+ atc'15- no iommu strict deferredatc'15+ atc'15- Here are the results of this work. The ATC work includes the fixes from FAST, so the single core case is basically thesame. There are two bars for ATC, (1)ATC- is the Deferred scalable solution(2) and(3) ATC + is with strict protection(4). Deferred protection now looks much better, but strict still performs very poorly. The reason being that accessing the IOMMU on each DMA is costly. With multi core, they become prohibitively expensive. (7:00) (+/—> strict defer) (-35deg) Our contribution: performant & secure Single Core RX flow 16 Core 40 40 35 35 30 30 25 25 20 20 Gb/s 15 15 10 10 5 5 0 0 no iommu strict deferredatc'15+ atc'15- copy no iommu strict deferredatc'15+ atc'15- copy And this is copy protection, the green bar represents this papers performance results.(7:05) Our Contribution Protection No Vulnerability Sub-Page Single Core Multi Core Model Window Strict Linux X V X X Strict FAST X V V X Strict ATC X V V X Deferred Linux X X V X Deferred FAST X X V X Deferred ATC X X V V Copy Protection V V V V This table summarises the bar charts we have seen in the previous slides. As you can see Copy protection is the only mode that provides sub-page protection(1), and the only mode that provides multicore performance with strict protection(2). Now lets talk about how it is done.(7:25) Copy Protection • dma_map: 1. Allocate permanently mapped shadow buffer 2. Sync shadow and original (if needed). • dma_unmap: 1. Sync shadow and original (if needed). 2. Free permanently mapped shadow buffer We take a very different approach. The basic idea is simple: since the IOTLB invalidations are expensive, we avoid them, we just don't do Invalidatios! Instead, we use a set of buffers, called shadow buffers, they are always mapped by the IOMMU, and they are the only thing that is ever mapped for the device.

Load more