Digital Investigation 14 (2015) S16eS24

Contents lists available at ScienceDirect

Digital Investigation

journal homepage: www.elsevier.com/locate/diin

DFRWS 2015 US The impact of GPU-assisted malware on memory forensics: A case study

* Davide Balzarotti a, Roberto Di Pietro b, 1, Antonio Villani , a Eurecom, Sophia Antipolis, France b Bell Labs, Paris, France c Department of Mathematics and Physics, University of Roma Tre, Rome, Italy

abstract

Keywords: In this paper we assess the impact of GPU-assisted malware on memory forensics. In Graphic processing units particular, we first introduce four different techniques that malware can adopt to hide its Memory analysis presence. We then present a case study on a very popular family of Intel GPUs, and we Malware analyze in which cases the forensic analysis can be performed using only the host's Digital Forensics memory and in which cases it requires access to the GPU's memory. Our analysis shows that, by offloading some computation to the GPUs, it is possible to successfully hide some malicious behavior. Furthermore, we provide suggestions and insights about which arti- facts could be used to detect the presence of GPU-assisted malware. © 2015 The Authors. Published by Elsevier Ltd on behalf of DFRWS. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Introduction the best of our knowledge, no GPU-assisted malware has been discovered in the wild e with the sole exception of For more than twenty years computers have relied on BadMiner (ArsTechnica, 2015) which exploits the presence dedicated Graphics Processing Units (GPUs) to perform of a GPU to mine bitcoins. However, researchers have graphical computations and rendering. More recently, with already shown that several malicious activities can take the advent of general-purpose computing on graphics advantage of GPUs, for instance to steal sensitive infor- processing units (GPGPU), GPUs have been increasingly mation (Kotcher et al., 2013; Lee et al., 2014), unpack ma- adopted also to perform other generic tasks. In fact, thanks licious code (Vasiliadis et al., Oct 2010), and hide malicious to their many-core architecture, GPUs can provide a sig- activities from malware detection and analysis tools nificant speed-up for several applications e such as finan- (Triulzi, 2008; Ladakis et al., 2013). It is important to notice cial and scientific computations, regular expression that these threats are not confined to the graphic envi- matching, bitcoins mining, video transcoding, and pass- ronment (e.g., screen grabbing or unrestricted access to word-cracking. GPU memory) but that GPUs can also be abused to perform Despite their pervasiveness and their ability to perform attacks that are outside the graphic domain. In fact, oppo- generic computations, the role and impact of GPUs to site to others commonly available PCI devices, the fact that perform malicious activities is still largely understudied. To GPUs can be programmed by userspace applications makes them a very attractive attack vector. As a matter of fact, also other peripherals (such as hard disks, network devices, printers) have been abused to perform malicious activities * Corresponding author. E-mail addresses: [email protected] (D. Balzarotti), (Zaddach et al., 2013; Sparks et al., 2009; Cui et al., 2013), [email protected] (R. Di Pietro), villani@mat. but those scenarios required to introduce custom modifi- uniroma3.it (A. Villani). cations to the device firmware. GPUs are instead meant to 1 He is also with the Department of Maths, University of Padua, Padua, be programmed by the end-user and require no firmware Italy. http://dx.doi.org/10.1016/j.diin.2015.05.010 1742-2876/© 2015 The Authors. Published by Elsevier Ltd on behalf of DFRWS. This is an open access article under the CC BY-NC-ND license (http:// creativecommons.org/licenses/by-nc-nd/4.0/). D. Balzarotti et al. / Digital Investigation 14 (2015) S16eS24 S17 modification to execute arbitrary code. To make things worse, graphic cards vendors pay more attention to the performance of their products than to their security. Un- fortunately, this has often a negative impact on several security mechanisms, such as the memory isolation be- tween independent process running on the same GPU (Di Pietro et al., 2013; Maurice et al., 2014). From a defense perspective, the main problem is that neither antivirus softwares nor memory forensics tools are currently able to analyze the content of the GPU memory. However, it is not clear how severe this limitation is in practice, and what is the real level of stealthiness that can be achieved by GPU-assisted malware. In particular, the main goal of this paper is to understand what is the impact of GPU-assisted malware on memory forensics. Are traditional collection and analysis techniques sufficient to detect and fi fully understand the malicious behavior? This is a funda- Fig. 1. A simpli ed view of a GPU-based heterogeneous architecture. The GPUs have access both to a dedicated device memory and to the host mental question for the computer forensic domain. Nowa- memory. days, memory forensics tools have an incomplete view of the system, limited only to the content of the memory. It is very important to understand under introduce four possible attack scenarios and discuss which conditions this piece of information is enough to their impact for a forensic analyst. For each scenario we detect the presence of a malicious activity (even if the implement a different proof-of-concept for and malware relies to a certain extent on the use of the GPU). we report under what conditions the memory analysis Only recently GPUs gained the attention of the digital can be performed through the kernel data structures, forensic community and a common assumption in the cur- the video driver or the GPU memory rent research (Richard, 2015) is that GPU-assisted malware  We present a detailed case study on Intel integrated needs at least a component running in the system memory. GPUs, which are part of all modern processors. Our re- If so, the analysis of this component could be sufficient to sults show that an adversary can take advantage of these detect that the malware is using the GPU and maybe even to integrated GPUs to obtain full access to any memory retrieve the purpose of the outsourced computation. page in the host. Even worse, we show how it is possible However, in this paper we show that in certain condi- to obtain a persistent malicious code running on the tions it is possible for an attacker to leave no trace in the OS GPU without any associated process running on the memory. In this case, malware coders have a substantial host. advantage that could only be limited by developing custom techniques to acquire and analyze the internal GPU mem- ory of each vendor (as well as graphic card model, and Background ). Unfortunately, this can be a very difficult e and time-demanding task especially for proprietary Modern GPUs are specialized massively parallel pro- products for which no documentation exists about their cessors with hundreds of computational units. In fact, internal data structures. while traditional CPUs dedicate most of their die area to Our goal is to answer a number of very practical ques- cache memory, GPUs dedicate it to logic circuits. The tions. For instance, is it possible for an analyst to identify combination of the two processing units, which takes which processes are using the GPU and which code is being advantage of the CPU for general applications tasks and of executed there? To answer the previous question, is it the GPU for highly parallel computation, is today the most enough to analyze an image of the system memory, or does common example of a heterogeneous architecture (see the analyst need also to inspect the regions memory- Fig. 1 for a simplified example). The host, namely the logical mapped to the (potentially containing pro- element containing the CPU, has its own memory and co- prietary data structures) or, worst case, does the internal ordinates the execution of one or more GPUs in a master- memory of the video card also need to be collected and slave configuration. The GPUs, which also have their own analyzed? Which of these questions can be answered by local memory, consist of several identical computational only looking at the system memory? Can the GPU be used units, each containing the same amount of processing el- to implement anti-forensic techniques? To address these ements. A task which is executed on a GPU is called kernel. questions, we designed and implemented a number of tests The GPU drivers manages kernels through the means of and experiments, based on a popular family of Intel GPUs. contexts, the equivalent of a process control block in the To summarize, this paper makes the following GPU.2 A key aspect of this architecture is that the GPU is contributions: also able to directly access arbitrary pages on the host

 We conduct the first study of the impact of GPU-assisted malware on memory forensics. In particular, we 2 To disambiguate between the operating system kernel and the GPU kernel, the second one will be typed in italic. S18 D. Balzarotti et al. / Digital Investigation 14 (2015) S16eS24 memory. One of the motivation for this is that, in order to scenario, and a more general approach (independent from coordinate the executions of tasks on the GPU, the CPU the device that has been compromised) is needed to handle typically writes commands into a command buffer that is these powerful cases. consumed by the GPU. The same execution model is Finally, our definition of GPU assisted malware does not adopted by both graphic and general purpose tasks. Indeed, cover attacks against the GPU. For instance, this includes regardless on which high level library is used (e.g. OpenGL, malicious code which resides completely in the host CUDA, or OpenCL), eventually all commands are appended memory but which has the GPU memory as main target, into the command buffer. e.g., to retrieve sensitive data from the content of the fra- mebuffer or from the graphic buffer objects allocated by the Physical memory acquisition GPU driver.

The dump of the physical memory allows a forensic GPU assisted malware investigator to extract detailed information about the sys- tem state, its configuration, and possible malware in- The goal of our paper is to understand what is the fections. However, the amount of useful information that impact of gpu-assisted malware on memory forensics. For can be extracted from a memory dump strictly depends on this reason, it is convenient to classify each type of GPU the way the memory image is obtained. Modern open- malware according to two sets of requirements: the oper- source memory acquisition tools for Linux (such as LiMe ating system privileges required by the malicious sample, (LiMe, 2015) and pmem (Rekall, 2015)) load a kernel mod- and the amount of internal information about the graphic ule and perform the acquisition from kernel space. Both card required to realize the malware itself. According to LiMe and pmem automatically select the address ranges these characteristics, we identify three main classes: user- that are associated to the main system memory by exam- space GPU malware, super-user GPU malware, and kernel- ining the iomem_resource linked list in the Linux kernel and level GPU malware. selecting only the address ranges marked as System RAM. Userspace malware includes all the GPU malware sam- This is necessary because the access to other memory re- ples that do not require administrative privileges on the gions can have side effects due to the presence of memory victim host. This kind of malware simply exploits the pos- mapped I/O (MMIO), potentially resulting in a system crash sibility to execute general purpose computation on the GPU (Stttgen & Cohen, 2013). and, therefore, from an OS perspective it only uses legiti- A portion of the MMIO memory is reserved to the PCI mate and it does not rely on any underlying software bus. At boot time, the BIOS instructs the operating system bug. For instance, Ladakis et al. (2013) presented an on which memory areas should be used for this purpose. example of this category that uses the GPU to execute the Therefore, the CPU and the DRAM controller have two unpacking routine of a generic malware. different views of the memory. The main difference is that Super-user malware requires instead administrative the DRAM controller does not see memory ranges privileges on the target host. This allows the malicious code consumed for MMIO. The TOLUD register marks the top of to perform a number of additional operations that are not the low (i.e. below 4 GB) usable physical memory that is available from user space. However, in this case we assume accessible to the CPU. The CPU views the memory range that the malware writer only relies on well defined libraries from TOLUD to 4 GB as reserved for MMIO, while the DRAM and APIs e without tampering with the internal informa- controller views the same range to be allocated to DRAM. tion used by the graphic driver or the card itself. All the memory areas assigned to the PCI subsystem are The most powerful case is represented by kernel-level marked as Reserved and they are dumped neither by LiMe malware. In this case, on top of having administrative nor by pmem. privileges on the host, the malware also knows the internal implementation of the graphic driver. As a result, it is able Threat model to modify arbitrary data structures in kernel space per- forming Direct Kernel Object Manipulation (DKOM). While In this paper we use the term GPU assisted malware (or from a strict operating system perspective, super-user and just GPU malware for simplicity) to describe malicious kernel access are equivalent, we prefer to separate these software that performs some of its computation in the GPU. two scenario because they may have different conse- This definition covers the case in which the malware del- quences on the portability of the malicious code and on the egates some tasks to the GPU for pure performance reasons forensic examination itself. (e.g., to mine bitcoins in a more efficient way) and the case, more interesting for our study, in which the malware le- verages the GPU for anti-forensic purposes. GPU malware and memory forensic However, in this paper we do not consider neither malicious hardware, nor code that requires the modifica- In presence of GPU malware, a forensic analyst needs to tion of the graphic card's firmware. A considerable effort be able to answer a certain number of questions. In has already been dedicated to study firmware-level mal- particular, in this paper we focus on three of them, which ware in several contexts, spanning from chipsets, hard- we believe are the most important for an investigation: (1) drives, and graphic cards (Zaddach et al., 2013; Sparks the enumeration of the processes that are using the GPU, et al., 2009; Cui et al., 2013; Triulzi, 2008). Current (2) the extraction of the code that those processes have forensic analysis techniques are often ineffective in this been executing in the video card, and (3) the identification D. Balzarotti et al. / Digital Investigation 14 (2015) S16eS24 S19 of the memory areas that the code in the video card has system should always maintain a link between a task access to. executed in the GPU and the process which is responsible To answer these three questions, the analyst needs to for that execution. However, in Section “Case study: Intel collect a memory dump and to extract a certain number of GPUs Architecture” it will be shown how the GPU execu- information. Our goal is not to present a step-by-step tion model can be broken on Intel GPUs allowing the description of how to perform this procedure. Instead, we presence of a running kernel K without any controlling want to study which of these three questions can be process P. In such case, the presence of the malware could answered depending on the type of memory acquired, the still be detected by looking at the GPU driver itself. In fact, class of GPU malware (userspace, super-user, or kernel- an analysis of the memory of the driver (which resides in level), and the knowledge of the analyst (limited to the the host memory) would reveal the presence of a Operating system internals, or covering also the internals of GPU context running a kernel K without any controlling the video card manufacturer). We stress that we are process P. interested on the impact on memory forensic of GPU mal- ware. Section “Related Work” provides a brief description Context-less code execution on what malicious tasks a GPU malware can execute and In the previous point we say how the graphic drivers how it can achieve persistence. stores information about the task being executed on the GPU (refer to Section “Case study: Intel GPUs Architecture” Anti-forensic techniques for more details). A more advanced anti-forensic technique could directly target the kernel objects to detach the kernel The operating system, some of its core services (e.g., X from the list of contexts in the GPU driver and remove any ), and/or the graphic driver usually trace regarding the existence of a particular GPU kernel (of maintain a number of internal data structures that describe course, this only makes sense if the malware already ach- who is using the graphic card and which tasks have been ieves unlimited code execution). If context-less code scheduled for execution. Since all of this data resides in the execution is possible, and we will investigate that in the system memory, it may seem that a properly designed next section, then the GPU malware can completely hide memory forensic tool should be able to answer all our three his presence from the host memory. questions without the need to access the GPU memory. In the next section we will see how this is indeed Inconsistent memory mapping possible in most of the cases. However, malware developers In (Vasiliadis et al., Oct 2010), the authors propose to use do not need to always play by the rules, and therefore we the GPU to implement a stealthy keylogger that runs inside investigate a number of possible anti-forensic techniques the graphic card and accesses through DMA the physical that could be used to reduce the footprint in the system page containing the keyboard buffer. This makes the memory. detection of the keylogger functionality more difficult, as the keyboard buffer does not appear in the list of memory Unlimited code execution pages mapped by the malicious process. In our tests we Under normal conditions, executing code on the GPU discovered that the list of accessible pages is kept both in requires a controlling process running on the host. The host the operating system and in the GPU memory. However, we process adds a task on the command queue, which will be will see how the two page tables do not necessarily need to eventually fetched and executed by the GPU. However, contain the same information. This technique has some GPUs have a non-preemptive nature: once the execution of similarities with the TLB desynchronization attack (Sparks a task is initiated, the GPU is locked with the execution of & Butler, 2005b); however, in our case a custom page that task and no one else can use the GPU in the mean- fault handler is not required. while. This is particularly problematic when the GPU is used both for rendering and computation, as this could Case study: Intel GPUs architecture generate undesired effects such as an unresponsive user interface. GPUs can be broadly divided into two categories: (i) As a consequence, in order to ensure a proper behavior, discrete GPUs with dedicated device memory and, (ii) in- the graphic driver usually enforces a timeout to kill long tegrated GPUs that use a portion of DRAM for graphics. For lasting kernels. For GPU malware this could represent an our analysis, we focus on the integrated GPU of Intel, important limitation because the malicious kernel needs to namely the Intel Integrated Graphic Devices (IGDs). How- be sent over and over in a loop, making it more easy to ever, the concepts and ideas described in this paper can be detect in system memory. also applied to other hardware and software configura- The first anti-forensic technique consists in disabling tions. In particular, in Section “Beyond our Case Study” we the existing timeout to take full control of the GPU. For will discuss how our findings can be extended beyond our instance, in Vasiliadis et al. (2014) the authors disabled the case study. GPU hangcheck to lock the GPUs indefinitely. It is important to note that our study focuses on the Linux Direct Rendering Infrastructure (DRI), a framework to Process-less code execution allow userspace applications to access graphics hardware The GPU execution model involves the presence of (Freedesktop, 2015). In DRI, a userspace application can talk a host process P controlling a GPU kernel K. This is an to the graphic drivers in two ways: directly using ioctls or advantage for a forensic investigation, since the operating indirectly through the X server. As it will be described in S20 D. Balzarotti et al. / Digital Investigation 14 (2015) S16eS24

The DRM exposes an API that user space programs can use to send commands and data to the GPU, and perform op- erations like memory allocation on the GPU memory. Graphics drivers may also use DRM functions to provide a uniform interface to applications for memory manage- ment, interrupt handling and DMA. While user-space ap- plications can interact directly with the GPU through ioctls, in many cases developers use the more high-level interface provided by the libdrm library. Fig. 2. Overview on the Linux Beignet architecture. The DRM supports two memory managers: the Trans- lation Table Maps (TTM) and the Graphics Execution Manager (GEM). In this paper we focus on the GEM, which the following, OpenCL applications uses ioctls to interact is used by the Intel graphic driver. GEM is designed to with the GPUs. Therefore, in this work we focus on the manage both the graphic memory and the graphic execu- former case. Nonetheless, in a different context the X server tion contexts, allowing multiple host processes to share memory could also contain useful artifacts and therefore graphics device resources. However, GEM only provides should also be analyzed to extract forensics information. generic functions and each vendor implements its own Fig. 2 shows a simplified view of the reference software functionality to support the GEM interface (in our case stack. Note that we represented only the DRI's software study the GEM functions were implemented inside the components which are involved in our study. In the Intel graphic driver). The buffers allocated from GEM are following, we describe briefly each layer. The information called buffer objects. contained in the current section has been obtained from several sources including the source code of i915.ko, the The Intel graphic driver official documentation of Intel (Intel, 2015) and unofficial sources such as (Vetter, 2015). The Intel graphic driver, called i915 in the main branch of the Linux kernel, implements a superset of the ioctls The userspace components: Beignet and libdrm defined in the DRM layer (we used the version 3.14 of the Linux kernel in our experiments). The IGD is controlled by The Open Computing Language (OpenCL) is an open the CPU in two ways: through a set of memory-mapped IO standard for parallel programming of heterogeneous sys- registers, and through the ringbuffers, a set of queues con- tems. OpenCL allows the parallel programming of modern taining the list of commands to be executed on the IGD. In processors found in personal computers, servers and mo- particular, in Haswell there are three different types of bile devices. Beignet (2015) is an open source OpenCL ringbuffers: the render, the blitter, and the bitstream implementation for Linux supporting two Intel processors: buffers. Ivy-Bridge and Haswell. Any application which uses the GPU is called a GPU Beignet is composed by two main components: the client. Before submitting a job to the GPU, each client sends OpenCL runtime and the Just-in-Time compiler. The run- a hardware-context creation request to the Intel graphic time consists of a dynamic library which implements the driver. Hardware contexts are opaque objects which are OpenCL API. The Just-in-Time (JIT) compiler uses LLVM to used to manage context switches between clients. Intel implement the OpenCL C language compiler and to trans- added the support to hardware contexts from the Ironlake late the code of the kernel into the ISA of the IGD. The kernel family (i.e. from 2010) relieving the driver from managing code is compiled through JIT because the OpenCL programs them. The clients submit a job j to the driver, which then should be device-independent and could be executed on tells the GPU to perform a context switch and execute j.To any hardware platform, supported by OpenCL. enforce the context switch, the driver encloses the com- In order to implement the runtime, Beignet uses a user- mands received from the client with the proper restore and space library called libdrm. Libdrm is a cross-driver mid- save commands. It is worth mentioning that a client can dleware which allows user-space applications to commu- decide to avoid the creation of new hardware context, and nicate with the kernel. The library provides wrapper in this case it gets associated with the so-called default functions for the ioctls to avoid exposing the kernel inter- hardware context. Nonetheless, in our experiments we face directly. Libdrm is a low-level library, typically used by focused on clients with an associated hardware context e graphics drivers such as the X drivers, and provides a as it is the case for Beignet. fi common API for higher layers and implements speci c The Intel graphic driver tracks all the hardware contexts functions for several vendors such as Intel, and using a linked list of i915_hw_context. AMD. In this paper we will rely on the Intel functions. All our tests were conducted on an Intel Haswell pro- Memory management cessor with Beignet 0.9.1 and libdrm 2.4.58. From a memory forensics perspective it is very impor- Direct rendering manager tant to understand that the CPU and the have two different views of the memory. From the The Direct Rendering Manager (DRM) is a subsystem of controller point of view, the memory is just an array of the Linux kernel responsible for the interface with the GPU. contiguous physical addresses. On the contrary, the CPU D. Balzarotti et al. / Digital Investigation 14 (2015) S16eS24 S21

Fig. 4. The PCI bars for GPU MMIO.

GSM data range and to system RAM, while the PPGTT en- tries can point only to the system RAM. The two tables contain almost the same entries, except for those pointing to the data GSM that are only present in the GGTT. The GGTT is a memory-resident page table containing an array of 32-bit Page Translation Entries (PTEs). The PPGTT is instead a two-level page table. The second level makes the allocation of memory for these structures less problematic for the operating system since it can be allo- cated also on not-contiguous pages. If a PPGTT is enabled, all rendering operations target per-process virtual memory. Fig. 3. The address space layout on Intel Haswell. Malware on Intel GPUs sees the memory as sequence of special-purpose areas of In this section we describe the experiments we con- contiguous locations. Such areas include system RAM as ducted to test how the different techniques presented in well as MMIO. The combination of these special-purpose Section “GPU assisted malware” could be implemented on areas is normally called Address Space Layout. The address Intel GPUs. space layout of the Haswell architecture is represented in Fig. 3. The figure shows on the right side the view of the Unlimited code execution Haswell DRAM controller and on the left side the view of the CPU. The address space is divided into two parts (i.e., To obtaining unlimited code execution, a malware needs below and above the 4 GB threshold) and each part is to disable the hangcheck in the Intel graphic driver. The composed by system RAM, MMIO areas, and remapped hangcheck is a watchdog mechanism that kills any kernel regions. The graphic card has a specially-reserved area of running for more then t seconds. The i915 driver exposes a memory that is not accessible from the CPU. This section is facility to disable the hangheck through the sysfs, at the called Graphic Stolen Memory (GSM) and consists of two path/sys/module/i915/parameters/enable_hangcheck contiguous ranges, which are configured by the BIOS: the (writing a zero to this file disables the watchdog). This Graphics Translation Tables (GTT) range that stores the vir- operation requires superuser privileges. tual to physical graphic translation tables and the Data Range which is a programmable space, used to store Inconsistent memory mapping graphics data (e.g the ). A special register called Base of GTT Stolen Memory When a buffer object is allocated through the GEM (BGSM in the Figure) points to the base address of the GSM. subsystem, it can be referenced from both the CPU and the The TOLUD register marks instead the top of the low usable GPU domains using two different virtual addresses. To physical memory that is accessible to the CPU. The only way implement this feature, the graphic driver creates two for the CPU to read the GTT memory is through MMIO on memory mappings in both the domains. However, the the PCI BAR0 of the graphic card. A MMIO on the PCI BAR2 mapping in the CPU domain is stored into the OS page table e which is also called the Graphics Aperture e is used to whereas the mapping in the IGD domain is stored in the access the Data range. The size of this area is defined in the graphic page tables. During normal operations the two BIOS and goes from 32 MB to 256 MB. The layout of the PCI information are consistent. Therefore, as long as the cor- BARs is depicted in Fig. 4. responding physical address is not in the GSM, an analyst IGDs have two virtual address spaces of 2 GB each, each can extract the memory accessed by the GPU by looking at one with its own page table. The first page table is called data structures of the operating system, in system memory. Global Graphic Translation Table (GGTT) and the second However, if an entry in any of the graphic page tables is one is called Per-Process Graphic Translation Table (PPGTT). manually modified to point to a different location, the The GGTT is used by the render ring and the display func- Linux page table is not affected. Indeed, the driver only tions. Interestingly, the GGTT entries can point both to the cares of preserving the coherency between the GGTT and S22 D. Balzarotti et al. / Digital Investigation 14 (2015) S16eS24 the PPGTT. In this case, the virtual address in the CPU This attack only makes sense if the malware also disable domain continues to point to the original physical page, the watchdog to perform an infinite task in the GPU. In a while the GPU address now points to any arbitrary page in normal desktop computer, this would render the display memory. not responsive and the attack could be easily noticed. To show the feasibility of this attack we performed a However, this side effect is only noticeable when the IGD is simple test. Our experiment includes a victim application V effectively used for rendering. This is not the case of servers which contains some sensitive data in a memory area that are managed through remote connections or high end

Bvictim. This application does not use the video card. At the laptop and workstations which are equipped with both an same time, a malicious application M invokes the OpenCL integrated and the discrete GPUs. In such configurations is library function clCreateBuffer() to allocate a buffer Bdummy very common that the discrete GPU is the one used by of 1024 bytes. The malware then loads a kernel module that default. performs a number of tasks. First, it finds the physical address of Bvictim and Bdummy and locates the Intel GPU in the PCI device list. Second, it retrieves the pointer to the MMIO Context-less execution address of the beginning of the Graphic Stolen Memory through the struct drm_i915_private data structure. Then, We demonstrate that it is possible to have a kernel the module scans the two levels of the PPGTT to find the running in the GPU without any hardware context regis- tered in the data structures of the Intel graphic driver. entry pointing to the physical address of Bdummy and mod- Indeed, the process-less execution still leaves some traces ifies the entry to point to the physical address of Bvictim.At this point, the malware unloads and removes the kernel inside the Intel graphic driver. We were not able to retrieve module from the system. Finally, M prepares a kernel task K the original malware PID, however, the buffer objects related to the running kernel and its hardware context, to overwrite the content of Bdummy (that in the IGD domains were still in the driver's memory. points to Bvictim), it compiles it using the Beignet JIT compiler, and sends it to the video card. Even though there is no a straightforward way to link Using this technique the malware forced an inconsistent such information back to the owner PID since the task memory mapping between the Linux page table and the struct was unlinked by the kernel after the SIGINT, this GPU page table. An investigator looking at the system anomaly can be detected by a very careful forensic memory would see that the process M is using the GPU and investigator. that his kernel task has access to a buffer, which still Therefore, in our final experiment we tried to replicate the previous attack, but this time also to remove any correctly points to the harmless Bdummy. However, the GPU page table (which does not reside in the system RAM) allow hardware context associated with the running kernel. This technique, on top of requiring superuser privileges, also the malware to overwrite the content of Bvictim, which re- sides in a different process which does not even use the requires a detailed knowledge of the driver internals to GPU. Even more interesting, we verified that the GPU is able perform a DKOM attack. In our test, we developed another to overwrite the target buffer even if it resides in a read- malicious kernel module that locates the hardware context only page (for instance, we were able to modify the .text of the malware kernel and removes it from the internal list segment of another running process). in the driver. To do so, the module performs the following steps: first, it looks for symbol referencing the struct drm_i915_private. Then it gets the context_list pointer and Process-less execution destroys the target context through the driver function i915_gem_context_unreference (). This operation has the We demonstrate that it is possible to have a kernel side effect of also freeing all the buffer objects of the running in the GPU without any corresponding active host hardware context. However, as long as the malicious pro- process. In this experiment the malware M uses a simple cess is the only process running on the GPU, this operation kernel that runs an infinite loop to increments the value of a does not alter the behavior of the malicious kernel. Finally, the malware set the memory of the hardware context to properly allocated Bdummy. Using again the inconsistent memory mapping technique, the malware uses its kernel K zero. to overwrite an integer variable which resides on the stack of a victim process. Findings and discussion However, in this experiment once the kernel execution started, we killed the malicious process by sending a SIGINT Table 1 summarizes the main findings of our study. The to M just after the kernel submission. In order to verify if the first column shows the different techniques we proposed in IGD was still running our code, we used several indicators. this paper and that can be used by malware to increase the First, even if there was no trace of M in the process list, the complexity of a memory forensic examination. The second value of the stack variable in the victim process kept column shows the requirements of such malware: U for increasing according to the code of K. Second, we used an user privileges, S for super-user privileges, and K for super- utility to estimate the load of the render ring measuring the user privileges with an additional knowledge of the GPU distance between the head and the tail pointers. Indeed, driver internal data structures. On the right side of the table when the GPU is idle, the head and the tail pointers point to we summarize the consequences for memory forensics, the same address. This was not the case in our experiment, grouped around the three main objective of listing the confirming that something was indeed running in the GPU. processes that use the GPU, understanding which kernel D. Balzarotti et al. / Digital Investigation 14 (2015) S16eS24 S23

Table 1 accessed by a process, passing through the GPU. In the Summary of our Findings. case of GEM, this information consists of two doubly- Anti-forensic Malware Forensic objectives linked lists that can be obtained from the struct

Technique Requirements List List kernels Memory drm_i915_private. processes map List of contexts None U ✓ (OS) ✓ (Driver) ✓ (OS) Unlimited Execution S ✓ (OS) ✓ (Driver) ✓ (OS) This artifact can be used to get information on the GPU Process-less S ✗ ✓ (Driver) ✓ (Driver) kernels and understand which processes are using the GPU. Inconsistent Map K ✓ (OS) ✓ (Driver) ✗ In IGDs this can be done through the context_list field in ✗✗ ✗ Context-less K the struct drm_i915_private data structure.

Register file code is executed by each of them, and listing the memory It contains information on the internal state of the ranges that are accessible to such kernels. GPU and can be accessed through mmio on GPU PCI Memory forensic involves two separate processes: BAR0. Among the IO registers the RING_TAIL and the memory acquisition and memory analysis. Table 1 reports RING_HEAD deserve a special mention. These registers information for both processes. First, a ✓ sign means that are used to control how the ringbuffer is accessed by the the corresponding objective can be achieved by looking only CPU and can be very useful for the analyst to understand at the content of the system memory. This is very important if there were any tasks running during the acquisition because it means that the memory acquisition tools we use process. today are already sufficient to complete the task. When this is possible, we report if the memory analysis Beyond our case study can be performed extracting information only from OS data structures, or if it requires ad-hoc modules to analyze the The variety of GPU ecosystem poses a big challenge for video driver internals. The latter can be quite problematic the forensic community. In the worst case, a different tools for a forensic point of view. Video drivers are often closed- should be developed for each possible combination of GPU source, and developing a separate memory analysis module model and Operating system. Nonetheless, for what con- (e.g., a Volatility plugin) for each of them can be a daunting cerns Linux, the presence of the DRM layer can ease this task. A similar consideration applies to the cells containing task. Other than the Intel driver, also other drivers (e.g. the ✗ sign. In this case, an analyst would first need a , nouveau, , …) are compliant with both the specialized tool to dump the video card memory. In our DRM and the GEM memory manager. For such drivers, scenario this means the Graphic Stolen Memory and the PCI most of our considerations remains the same. For other BARs of the GPU, but this could be even more complicated closed-source drivers, more work is required to reverse- for cards with a separate dedicated memory. Moreover, the engineer their internal data structures. analyst would then need custom modules to parse the data Few additional observations can be done by looking at this memory contains. Again, without the support of the Table 1. For example, the information stored in the system vendors, this can be completely impractical on a large scale. memory (marked with the ✓ sign in the table) are likely to be the same for most GPUs. In fact, to use the GPU, a host process needs to communicate with the driver and in Summary and guidelines order to do this, the host process must open the GPU de- vice file. Therefore, to find which host processes are using As part of our experiments, we developed a set of the GPU, the system memory is enough. Contrarily, espe- custom tools and volatility plugins to retrieve a number of cially for cards with dedicated memory, some of the in- information from the system and GPU memory. In partic- formation that in our case were stored within the driver's ular, from a forensic perspective, we collected and parsed memory may instead reside in the device memory. This the following data structures: would translate inevitably into more ✗ locations on the table. Graphic page tables This artifact can be used to detect the presence of Related work inconsistent memory map in the system. For IGDs, the Graphic Translation Table (GTT) and Per-Process GTT Several examples of GPU-assisted malware have been (PGTT) can be dumped through mmio on the GPU PCI BAR2. analyzed so far (Vasiliadis et al., Oct 2010; Ladakis et al., 2013; Triulzi, 2008; Danisevskis et al., 2014). For instance, Hangcheck flag in Ladakis et al. (2013) the GPU is used to implement a This flag is stored in the driver and, as described in the stealthy keylogger. To access the keyboard buffer, the GPU previous section, can be very important to detect sophis- uses the physical addressing through DMA, without any ticated attacks. CPU intervention. In (Vasiliadis et al., Oct 2010) the authors present a proof-of-concept which leverages GPU to unpack List of buffer objects the code of a malware with a XOR-based encryption Together with the graphic page tables, these artifacts scheme using several random keys. To hinder the analysis, can be used to detect which memory pages can be the malware's code is unpacked at runtime and the S24 D. Balzarotti et al. / Digital Investigation 14 (2015) S16eS24 decryption keys are stored in the device memory, that is GPU as an anti-forensic tools and we highlighted four not accessible from the CPU. In Triulzi (2008) the author different techniques that a malware can use to hide its shows how the GPU and NIC with a malicious firmware can presence. We provided a case study on Intel Integrated be used to exfiltrate data from the host's memory. Finally, GPUs showing how each of the four techniques impacts on in Danisevskis et al. (2014) the authors describe an attack the memory analysis. We tried to keep our analysis as targeting a mobile GPU. In such case a vulnerability in the general as possible however, as a future work, we plan to driver is exploited to take over the memory protection perform additional experiments targeting different GPU mechanism of the GPU. models and different Operating Systems. Vasiliadis et al. (Vasiliadis et al., 2014) implemented some encryption algorithms with CUDA and leveraged the non-preemptive nature of GPUs to preserve the integrity of References the cryptographic algorithms and the confidentiality of the key. However, the same technique can be used by malware ArsTechnica. Bitcoin malware uses gpu for mining. 2015. http://goo.gl/ to hinder live analysis or to achieve persistence. jdGHZV. In the past, several anti-memory forensic techniques Beignet, 2015. https://01.org/beignet. fi fi have been developed. Such techniques can be divided into Cui A, Costello M, Stolfo SJ. When rmware modi cations attack: a case study of embedded exploitation. In: NDSS; 2013. two categories: anti-acquisition and anti-analysis (Stttgen Danisevskis J, Piekarska M, Seifert J-P. Dark side of the : mobile and Cohen, 2013). Anti-acquisition operates during the gpu-aided malware delivery. In: Information Security and Cryptology memory acquisition process interfering with the memory (ICISC). Springer International Publishing; 2014. Di Pietro R, Lombardi F, Villani A. Cuda leaks: information leakage in gpu scanner. For instance, in Sparks and Butler (2005a) the architectures. arXiv preprint. 2013., arXiv:1305.7383. author shows how a kernel rootkit can hijack the memory Freedesktop. The direct rendering infrastructure. 2015. http://dri. scanner reads to hide its presence in the memory dump. freedesktop.org/wiki/. Haruyama T, Suzuki H. One-byte modification for breaking memory Some PoC tools are more invasive and they prevent the forensic analysis. In: BlackHat Europe Conference; 2012. kernel from loading additional drivers (Memoryze, 2015). Intel, 2015. https://01.org/linuxgraphics/. The goal of anti-analysis techniques is instead to prevent Kotcher R, Pei Y, Jumde P, Jackson C. Cross-origin pixel stealing: timing attacks using css filters. In: Proc. of the ACM SIGSAC Conference on the correct analysis of the memory dump. They perform Computer and communications security; 2013. DKOM to prevent the memory analysis tools from finding Ladakis E, Koromilas L, Vasiliadis G, Polychronakis M, Ioannidis S. You some fundamental kernel variables which are used as can type, but you can't hide: a stealthy gpu-based keylogger. In: Proc. of the 6th European Workshop on System Security (EuroSec); starting point for the analysis (e.g. the _KDDE- 2013. BUGGER_DATA64 data structure). Haruyama et al. Lee S, Kim Y, Kim J, Kim J. Stealing webpages rendered on your browser by (Haruyama & Suzuki, 2012) show how the modification of exploiting gpu vulnerabilities. In: Proc. of the IEEE Symposium on one-byte on some kernel variables can break the memory Security and Privacy; 2014. Washington, DC, USA. LiMe, 2015. http://goo.gl/lrKQ3P. analysis process. Stuttgen et al. (Stttgen & Cohen, 2013) Maurice C, Neumann C, Heen O, Francillon A. Confidentiality issues on a show that hooking memory enumeration and memory gpu in a virtualized environment. In: Proc. of the 18th International mapping API can hinder the analysis. This is substantially Conference on Financial Cryptography and Data Security; 2014. Memoryze. Meterpreter anti memory forensics script. 2015. http://goo.gl/ different from the Incosistent Mapping presented here. lJS3a6. Indeed, even in the presence of the improved memory Rekall. Memory dump with rekall. 2015. http://goo.gl/0DU5Bl. acquisition technique proposed in Stttgen and Cohen. Richard GG. Gpu malware research and the 2014 dfrws forensic challenge. 2015. http://goo.gl/Ly3ILv. (2013), the GPU page table would be marked as not safe Sparks S, Butler J. Shadow walker: raising the bar for rootkit detection. In: to dump and the modified mapping would not be detected. BlackHat Japan; 2005. GPUs can be used as both an anti-acquisition and anti- Sparks S, Butler J. Raising the bar for windows rootkit detection. PHRACK, Issue 63. 2005. analysis technique. Indeed, the non-preemptive nature of Sparks S, Embleton S, Zou CC. A chipset level network backdoor: GPUs can be leveraged to prevent the memory acquisition bypassing host-based firewall & ids. In: Proc. of 4th International process of the GPU memory itself. Pixelvault (Vasiliadis Symposium on Information, Computer, and Communications Secu- rity. New York, NY, USA: ACM; 2009. et al., 2014) locks the GPU to protect the key material, the Stttgen J, Cohen M. Anti-forensic resilient memory acquisition. Digit same mechanisms can be used to prevent the memory Investig 2013;10(0). the Proc. of the 13th Annual Digital Forensics scanner from getting the dump of the GPU memory. Research Conference (DFRWS). Furthermore, since GPUs are not considered during the Triulzi A. Project maux mk.ii. 2008. http://goo.gl/N4KdYR. Vasiliadis G, Elias A, Polychronakis M, Ioannidis S. Pixelvault: using gpus memory-forensic analysis by current tools, using the GPU for securing cryptographic operations. In: Proc. of the 21st ACM to run malicious code can prevent the correct analysis of Conference on Computer and Communications Security. ACM; 2014. the memory dump. Vasiliadis G, Polychronakis M, Ioannidis S. Gpu-assisted malware. In: Malicious and Unwanted Software (MALWARE), 2010 5th Interna- tional Conference; Oct 2010. p. 1e6. Conclusions and future work Vetter, D., 2015. http://goo.gl/rIU0bn. Zaddach J, Kurmus A, Balzarotti D, Blass EO, Francillon A, Goodspeed T, et al. Implementation and implications of a stealth hard-drive back- In this work we analyzed to what extent a GPU can be door. In: ACSAC, 29th Annual Computer Security Applications Con- used to perform malicious computation. We modeled the ference; 2013. New Orleans, Louisiana, USA.