True IOMMU Protection from DMA Attacks:

When Copy Is Faster Than Zero Copy

Alex Markuze Adam Morrison Dan Tsafrir

Hello, my name is Alex and I would like to present today our work from this years at ASPLOS, this work was done by me and my advisors Adam Morrison and Dan Tsafrir from the Technion, “True IOMMU Protection from DMA Attacks, a.k.a When Copy is Faster than Zero Copy” (0:20)

DMA

&KTGEV/GOQT[#EEGUU

/CKP/GOQT[

Lets start with DMA. DMA, or is the ability of an attached device to access the main memory without involving the cpu. DMA capable devices are shown on the screen, Devices like GPUs, SSDs and Network Interface Cards have their own cpus and their own code and none of it is controlled by the .(0:45) Problem: DMA Attacks

How to develop a rootkit for Broadcom NetExtreme network cards (ESEC 2011)

Someone (probably the NSA) has been hiding viruses in hard drive firmware (Kaspersky 2015)

This ability to access memory directly, also allows a malicious device to read or modify sensitive data. We call this a DMA attack. This is far from being fiction, for example, one detailed instruction can be easily found on the internet and it describes in great detail how someone could get a malicious code running on a popular NIC, even from remote. A nic which you may probably find in your own university labs. Also recently, Kaspersky Labs have identified attacks in the wild that reprogrammed hard disk firmware so that the drive could reintroduce malware even if the Hardisk got formatted. (1:20) Solution: IOMMU

To prevent arbitrarily access to the memory, modern systems have IOMMU hardware. The IOMMU works for devices just like the MMU works for processes. It provides support for the attached device. Meaning that the device can access only the pages that are explicitly listed in its OS controlled page table. So with an IOMMU, the device can access the main memory only via IO virtual addresses, or IOVAs.(01:50) IOMMU Usage

• Made part of DMA API

• Drivers reamin unaware

• Backward compatible

When IOMMU was introduced its was integrated into the existing DMA-API, and this allowed for all device drivers to remain unchanged and the devices remain functional when IOMMU was enabled. Packet Transmission using DMA API

1. dma_map(VA) IOMMU • Allocate IOVA Page Table • Update dev page table IOVA  PA 2. Transmit Packet IO/TLB IOVA  PA

3. dma_unmap(IOVA) • Remove entry from page table • Flush IO/TLB • Free IOVA

To demonstrate how the IOMMU is used via the DMA API, let’s follow a sent a packet. 1. First the nic driver dma_map-s the packet(1). the operation creates a new io virtual address(2), namely generates the actual unique integer number that serves as the iova, then the page table is populated(3 + 4), represented by the blue box on the slides. The red va and yellow IOVA point to the same physical address(5). 2. Second(6), the nic reads the packet using the IOVA(7) and the packet is sent.When the IOVA is accessed the IOTLB is updated.(8)The IOTLB is a small translation cache just like the TLB in the MMU.(9) 3. Third(10), When the packet has been sent, to prevent the a potentially malicious device from re-accessing the packet. The iova is dma_unmapped. unmap (11) removes the translation from the devices page table(12), flushes (13) the entry in the IO/TLB(14),. And finally, de allocates the IOVA(15) Namely frees the unique integer number.(4:00) Problem #1: Performance Single Core RX flow 16 Core 40 40

35 35

30 30

25 25

20 20 Gb/s 15 15

10 10

5 5

0 0 no iommuw. iommu no iommuw. iommu

Though the IOMMU seems to provide the much needed layer of protection, using it creates several problems. The first problem is degraded performance(1). These are netperf results for a 40 gigabit nic used by a single core, without iommu (2) and with iommu(3). As you can see, using the IOMMU reduces the throughput from 20 Gbps to less than 5. Here we see a similar experiment (4) on sixteen cores (5). The throughput difference is even worse,(6) I don't know if you have noticed the little red bar on the left(7), and this is with all cores utilised at 100%. (4:40) Problem #2: vulnerability window

• Strict protection mode:

• IOTLB flush every single dma_unmap

• IOTLB flushes are expensive!

• To alleviate performance degradation

• Trading off safety for performance

One of the causes of the significant performance hit is the strict IOMMU protection mode, which we use. This mode dictates that IOVAs will be purged from the IOTLB as part of each dma_unmap operation, which is performed millions of times per second. As it turns out, IOTLB invalidations are extremely costly, especially since they are performed while holding a lock. Problem #2: vulnerability window

• Strict protection: IOTLB flush every single dma_unmap

• Deferred protection: batch IOTLB flushes

• Amortise flush cost

Page P dma mapped Page P allocated Legal Access to P Vulnerability Time line Window Page P dma unmapped and freed

To address this, kernel developers have added an IOMMU usage mode that amortises the overhead. This mode flushes the IOTLB only once in 250 operations. Unfortunately, this mode, which is called deferred protection, trades off security for performance. With deferred protection, there is a window of time in which a device can access buffers that have been unmapped. i’ll explain how(1), first Page (2)P is dma_mapped (3). Now(4) the nic can legally access it (5). Then Page P(6) is unmapped and freed(7). Not much later page(8) P is allocated and used, and no one knows that it is still accessible by the device(9), because the iotlb has not been flushed yet. (6:00) Strict vs. Deferred Protection Single Core RX flow 16 Core 40 40

35 35

30 30

25 25

20 20 Gb/s 15 15

10 10

5 5

0 0 no iommu strict deferred no iommu strict deferred

As you can see, deferred protection doesn't help much(1) in the way of performance, and it does create the window of vulnerability. (5:30) Problem #3: No Sub Page Protection

Everything the NIC needs to access

Everything the NIC CAN access 4KB page

Secrets

The problems don’t end there, IOMMU granularity is of 4K pages.This means that any data co-located(1) on a page with a DMA buffer can be accessed by the device! , for example the smallest network packet is about 48B(1) so this (2) is everything the nic needs(3) to access and this (4) is everything the NIC can access(5). Each time a 48 byte packet is mapped the NIC can access the whole page. We call this the sub page protection problem.(7:00) Related Work

• [FAST’15] Improving single core performance

• [ATC’15] Resolving scalability issues for deferred

our group has been looking at the performance problems of IOMMU protection for some time now. It turns out that part of the problem with deferred protection was the algorithm for iova allocation. It had very poor performance even on a single core, and in FAST last year we have shown a fix for the IOVA allocator. Later that year at Usenix ATC we looked at the multicore scalability problems and fixed them as well. (7:40) Related Work Single Core RX flow 16 Core 40 40

35 35

30 30

25 25

20 20 Gb/s 15 15

10 10

5 5

0 0 no iommu strict deferredatc'15+ atc'15- no iommu strict deferredatc'15+ atc'15-

Here are the results of this work. The ATC work includes the fixes from FAST, so the single core case is basically thesame. There are two bars for ATC, (1)ATC- is the Deferred scalable solution(2) and(3) ATC + is with strict protection(4). Deferred protection now looks much better, but strict still performs very poorly. The reason being that accessing the IOMMU on each DMA is costly. With multi core, they become prohibitively expensive. (7:00)

(+/—> strict defer) (-35deg) Our contribution: performant & secure Single Core RX flow 16 Core 40 40

35 35

30 30

25 25

20 20 Gb/s 15 15

10 10

5 5

0 0 no iommu strict deferredatc'15+ atc'15- copy no iommu strict deferredatc'15+ atc'15- copy

And this is copy protection, the green bar represents this papers performance results.(7:05) Our Contribution

Protection No Vulnerability Sub-Page Single Core Multi Core Model Window Strict X V X X Strict FAST X V V X Strict ATC X V V X Deferred Linux X X V X Deferred FAST X X V X Deferred ATC X X V V Copy Protection V V V V

This table summarises the bar charts we have seen in the previous slides. As you can see Copy protection is the only mode that provides sub-page protection(1), and the only mode that provides multicore performance with strict protection(2). Now lets talk about how it is done.(7:25) Copy Protection

• dma_map: 1. Allocate permanently mapped shadow buffer 2. Sync shadow and original (if needed).

• dma_unmap: 1. Sync shadow and original (if needed). 2. Free permanently mapped shadow buffer

We take a very different approach. The basic idea is simple: since the IOTLB invalidations are expensive, we avoid them, we just don't do Invalidatios! Instead, we use a set of buffers, called shadow buffers, they are always mapped by the IOMMU, and they are the only thing that is ever mapped for the device. On DMA, we sync the buffers by copying. This provides strict protection, there is never a window of time in which the device can access data that it shouldn't, and it gets us sub-page protection, because we copy exactly the amount needed. And, as it turns out you can do this without losing much performance.(9:00) Copy Protection

1. dma_map(VA) IOMMU • Allocate Shadow Buffer Page Table • Sync VA and Shadow Buffer IOVA PA • Return 2.Transmit Packet IO/TLB IOVA PA

3.dma_unmap(IOVA)

• Free Shadow Buffer

Lets follow a sent packet but now with copy protection ,Please notice that the page table is already populated.(1) On send dma_map will allocate a shadow buffer(2). It will also Sync the two buffers(3 + 4), and then return the shadows IOVA (5 + 6). The packet is transmitted (7)using the returned shadows IOVA. (8)Unmap will release the shadow buffer(9+10). And that is it. On Receive we sync the shadow buffer on unmap(9:25) Copy Protection Everything the NIC needs to access

Everything the NIC can EVER access

Secrets

With copy protection the 48B packet (1) which is the (2) only thing the nic(3) needs to see(4) . Is the only thing(5) the NIC can ever sees(6). And this is how copy protection provides sub page protection.(10:00) Observation • Copy is cheaper than invalidation < 4KB throughput 1.6 40 +16.$+PXCNKFCVKQP 1.4 30 20

1.2 Gb/s 10 1 0 0.8 copy no iommuatc’15-atc’15+ sec per packet

µ 2CIG6CDNG7RFCVG 0.6 $%QR[ 0.4 spinlock invalidate iotlb average 0.2 iommu page table mgmt copy overhead 0 other copy no iommuatc’15-atc’15+

Now lets talk about the costs in detail. What we see here is the time costs of receiving a packet when running with a single core. We can see the time breakdown and on the left(1) and(2) the throughput on the right(3). The (4)left most bar is copy protection with the copy overhead coloured in black(5). The other columns represent the unprotected(6) kernel with no other overheads, and(7) the next two bars are atc’15 results with deferred(8) and(9) strict(10) protection. The(11) main observation here is that for(12)1500Bytes mtu copy is cheaper then the(13) page table management(14) and(15) the (16)iotlb invalidation. So clearly copy is not that expensive. At least not for mtu sized packets on a single core. (11:30) Observation • Copy is cheaper than invalidation (12 s) throughput 3 µ 40 30 2.5 20 Gb/s 2 10 0 1.5 copy no iommuatc’15-atc’15+ sec per packet µ 1 spinlock invalidate iotlb average 0.5 iommu page table mgmt copy overhead 0 other copy no iommuatc’15-atc’15+

Now lets look at multicore and send. We see here a similar the graph breaking down the send costs of a 64KB packet. On send the NIC hardware breaks down the packet to mtu sized chunks on its own so the software doesn't really need to care about the actual mtu. We see that the copy(1) now costs more than the (2)page table updates, and (3)the iotlb invalidation(4), but (5)its no (6)where near(7) the locking overhead of IOTLB flushes for strict protection. In this regard copy is still much cheaper than flushing the IOTLB for each packet. Basically copy is cheaper then zero-copy (12:15) Implementation Challenges

• Concurrency & Scalability

• Transparency

• NUMA locality

• Memory consumption

• Copy costs

When designing copy protection we wanted to provide a generic solution that does little to no changes to the drivers, the solution had to be scalable as this is clearly an issue for all existing solutions. The sync step also introduced the need to be NUMA aware and to look for ways to optimise the copy. By hooking the DMA API we made the solution transparent to the driver. Copy costs were mitigated by syncing only when needed and only the needed bytes. also Thankfully the Intel Ivy bridge and Haswell cpus have an optimised version of memcpy support called ERMS(Enahnced rep mov store) operation . and that made the copy costs much more manageable. But we still needed to avoid cross socket copying. We will discuss Scalability NUMA awareness and memory consumption in the next slides. (13:20) Design 48 bit IOVA

1 cpu index r/rw/w size meta index offset

47 40 38 37

meta *array_start meta *array_start metameta *array_start *head meta *head meta *head meta *array_start meta *array_start metameta *array_start *tail meta *tail meta *tail void *shadow void *os_buff

The key is in the design, Each shadow buffer has a meta-data descriptor(1), the descriptor holds the shadow buffer pointer and the pointer to the mapped buffer. This descriptor is part of an array(2) of descriptors(3). Each array holds a list of shadow buffers of the same size and with the same permissions. The free entries of the array form a frilliest (4 + 5). Each free list is managed by a struct split between two cache lines(6). The top half holds head of the free list(9), and the bottom holds the pointer to the tail(7) of the list. A freed entry will be added to the tail of the least. and an allocated entry will taken from the head. Thus avoiding contention on concurrent free and alloc. Both half's hold a pointer to the start of the array(8) avoiding false sharing. For fast and scalable design each core holds its own arrays and shadow buffers (9). When a new shadow buffer is allocated its IOVA(10) itself is encoded with the meta descriptors location, the free-list (11)and the (12)meta entry index encode the exact location. This allows for a fast and efficient lookup of the entry on dma_unmap when only the IOVA is provided. This simple scheme allows for extremely fast and scalable free and allocate operations(14:30) Memory Consumption

Needed buffers == Inflight DMA Operations

Measured on our setup <260MB

Theoretical Boundary:

RX = #Cores x RingSize x MTU TX = #Cores x RingSize x (64KB + 4KB) x #TC Mem = RX + TX

Theoretical on our setup 8GB+

One obvious concern is the memory consumed by shadow buffers. The key point to notice here is that the memory needed is directly proportional to the amount of DMAs in flight. Looking at an ethernet driver as an example, in the theoretical worst case we need a buffer for each entry in the receive and transmit rings. On our evaluation setup this would amount to about 8 theoretical GB. But in practice, we have observed less then 260MB of memory consumed. The bottom line is that due to fast HW the number of inflight dam operations is effectively limited. Evaluation Setup • Dell PowerEdge R430 • Dual socket 8 core Haswell E5 cores @2.4GHz (16 Cores in total) • 40 GbE Intel XL710 (Fortville) • DDR4 Memory Controller

• Target • Modified Linux 3.17.2

• Loader • NO IOMMU • Stock Linux 3.17.2

Now lets talk about performance evaluation, In our evaluations we have used (read the slide). Netperf 16 Cores RX a. throughput [Gb/s] c. cpu [%] )DU 40 100 80 30 60 20 40 %RWZEQTGU 10 20

0 0 64B 256B1KB 4KB 16KB64KB 64B 256B1KB 4KB 16KB64KB

message size no iommu (no protection) copy (strict subpage protection) atc’15- (deferred page protection) atc’15+ (strict page protection)

This graph shows how our solution performs when receiving on 16 cores, the loaders are sending packet sizes from 64 Bytes to 64KB bytes(1). (2)The (3)left graph shows the throughput achieved per sent packet size(4), The right(5) graph shows the corresponding cpu utilisation(6). This graph is almost flat because regardless of the sent packet size, the received packets are no bigger than mtg bytes. The magenta line is the unprotected kernel, the blue line is copy protection the green is atc’15 with deferred and the red is atc’15 with strict. Its clearly visible that while copying does have its price(7) the results are much better than trying to flush the IOTLB on every dma, the red line fails to achieve more than 10Gb/s(8) while utilising 16(9) cores at 100% cpu.

(110, 120)-[805,642] Netperf 16 Cores TX a. throughput [Gb/s] c. cpu [%] 40 100

80 30 60 20 40 10 20

0 0 64B 256B1KB 4KB 16KB64KB 64B 256B1KB 4KB 16KB64KB

message size no iommu (no protection) copy (strict subpage protection) atc’15- (deferred page protection) atc’15+ (strict page protection)

On send the situation is better for strict atc as there are an order of magnitude less IOTLB invalidations. As each dma now handles up to 64KB per packet rather than the 1500Bytes when receivng. And again we see that copy does have its price(1), strict atc is still prohibitively costly(2). Netperf 16 Cores RX + TX a. throughput [Gb/s] c. cpu [%] 80 100 70 80 60 50 60 40 30 40 20 20 10 %RWZEQTGU 0 0 64B 256B1KB 4KB 16KB64KB 64B 256B1KB 4KB 16KB64KB

message size no iommu (no protection) copy (strict subpage protection) atc’15- (deferred page protection) atc’15+ (strict page protection)

We have gone one step farther by Looking at bidirectional flow, notice the throughput ticks have gone up (1)to 80Gb/s(2), we see that copy is still comparable in performance while providing superior protection. Though at this rate the cost is considerably higher(3) but not as high as iotlb invalidations(4). Memcached 2.5 100

2 80

1.5 60

1 40 total cpu [%] 0.5 20 transaction/sec [millions] 0 0 copy no iommu atc15- atc15+

We also ran the popular memslap benchmark for memcached to see how copy protection effects a real workload. Transactions per second(1) are on the left graph and(2) cpu utilisation is on the (3)right. (4)And again its clear(5) that copy provides a strict and sub page protection at a reasonable price. Especially when compared to the state of the art(6), just look at the cpu utilisation and performance(7). Concluding Remarks

• Copy protection = Strict + Sub-Page protection

• Copy is faster than IOTLB invalidation

Questions ?

To conclude, I've shown you that for most practical scenarios copying a DMA buffer is faster than an IOMMU invalidation, because of the related hardware costs and synchronization. We have leveraged this observation to build an alternative IOMMU usage model, that copies DMA buffers to and from IOMMU-protected shadow buffers, and therefore doesn't need to do any invalidations at all. Our model provides comparable I/O performance to an execution without an IOMMU, while providing true security againt DMA attacks. Actually our security is better than current state of the art. We provide sub-page protection, which no prior solution has. With that, I'll conclude, and I'll be happy to take any Questions? Thank you.(20:00)