<<

GPU System Calls

Jan´ Vesely´§‡, Arkaprava Basu‡, Abhishek Bhattacharjee§, Gabriel H. Loh‡, Mark Oskin‡, Steven K. Reinhardt‡

§ Rutgers University, ‡ AMD Research

Abstract rectX [16] for general-purpose usage. Since then, fi GPGPU programming has become increasingly acces- GPUs are becoming rst-class compute citizens and are sible to traditional CPU programmers, with graphics being tasked to perform increasingly complex work. APIs giving way to computation-oriented languages Modern GPUs increasingly support programmability- with a familiar C/C++ heritage such as OpenCLTM [9], enhancing features such as shared virtual memory and C++AMP [15], and CUDA [19]. But access to privileged hardware cache coherence, enabling them to run a wider OS services, or system calls, are an important aspect of variety of programs. But a key aspect of general-purpose CPU programming that remain out of reach for typical programming where GPUs are still found lacking is the GPU programmers. ability to invoke system calls. While traditionally considered to only be of academic We explore how to directly invoke generic system calls curiosity, GPU system calls have recently begun attract- in GPU programs. We examine how system calls should ing wider attention [13, 21, 37, 26]. Studies have ex- be meshed with prevailing GPGPU programming mod- plored the first few types of system calls on GPU pro- els, where thousands of threads are organized in a hi- grams for file input/out (I/O) [37] and network access erarchy of execution groups: Should a system call be [13]. These studies present an important start but focus invoked at the individual GPU task, or at different ex- on specific OS subsystems, such as management of the ecution group levels? What are reasonable ordering se- file system buffer cache, and in some cases, rely on spe- mantics for GPU system calls across these hierarchy of cific hardware features such as direct user-mode access execution groups? To study these questions, we imple- to the network interface controller [21]. As such, they mented GENESYS – a mechanism to allow GPU pro- point to the exigent need for a broader examination of grams to invoke system calls in the operating sys- how system calls can aid GPU programs. tem. Numerous subtle changes to Linux were necessary, Consequently, this paper generalizes the means by as the existing kernel assumes that only CPUs invoke which GPU programs directly invoke system calls, and system calls. We analyze the performance of GENESYS the benefits of doing so. Our implementation of GPU using micro-benchmarks and three applications that ex- system calls reveals two main benefits – better pro- ercise the filesystem, networking, and memory allocation grammability and the potential for improved perfor- subsystems of the kernel. We conclude by analyzing the mance. In assessing how to achieve these, we study sev- suitability of all of Linux’s system calls for the GPU. eral research questions unique to GPU system calls: How does the GPU execution model impact system 1 Introduction call invocation strategies? To manage parallelism, GPU programming languages and the underlying hardware ar- Graphics processing units (GPUs) are now widely used chitecture decompose work into a hierarchy of execu- for high-performance computing (HPC), machine learn- tion groups. The granularity of these groups range from ing, and data-analytics. Increasing deployments of these work-items (or GPU threads) to work-groups (composed general-purpose GPUs (GPGPUs) have been enabled, in of hundreds of work-items) to kernels (composed of hun- part, by improvements in their programmability. GPUs dreds work-groups) 1. This naturally presents the follow- have gradually evolved from fixed function 3D accelera- ing research question – at which of these granularities tors to fully programmable units [17, 25, 39]. Today, they should GPU system calls be invoked? support programmability-enhancing features such as ad- dress translation [20, 5, 22, 29, 31, 41], virtual memory How should GPU system calls be ordered? Tradi- [41], and cache coherence [1, 7, 14, 30, 33]. tional CPU system calls are implemented with the guar- GPU programming models have steadily evolved 1 with these hardware changes. Early GPGPU program- Without loss of generality, we use the AMD terminology of work- items, work-groups, kernels, and compute unit (CU), although our mers [40, 23, 24, 32, 35] adapted graphics-oriented work applies equally to the NVIDIA terminology of threads, thread- programming models such as OpenGL [10] and Di- blocks, kernels, and streaming multiprocessor (SM) respectively.

1 antee that instructions prior to the system call have com- 2 Motivation pleted execution, while code following the system call remains unexecuted. These semantics are a good fit for A reasonable research question to ponder is, why equip CPUs, which generally target single-threaded execution. GPUs with system call invocation capabilities at all? Af- But such “strong ordering” semantics may be overly- ter all, the lack of system call support has not hindered conservative for GPUs, acting as implicit synchroniza- widespread GPU adoption. tion barriers across potentially thousands of work-items, Consider Figure 1(left), which depicts how GPU pro- compromising performance. In fact, we find that GPU grams currently invoke system calls. At a high-level, pro- code is often amenable to “relaxed ordering”, where such grammers are forced to delay system call requests until implicit barriers may be avoided without harming the se- the end of the GPU kernel invocation. This is not ideal mantics of common GPU programs. We describe the sit- because developers have to take what was a single con- uations where strong versus weak ordering decisions af- ceptual GPU kernel and partition it into two – one be- fect performance. Beyond ordering, we study the utility fore the system call, and one after it. This model, which of “blocking” versus “non-blocking” GPU system calls. is akin to continuations, is notoriously difficult to pro- gram for [6]. Compiler technologies can assist the pro- Where and how should GPU system calls be pro- cess [36], but the effect of ending the GPU kernel, and cessed? Modern GPUs cannot, as of yet, execute priv- restarting another is the same as a barrier synchronization ileged instructions or support security rings. This is un- across all GPU threads and adds unnecessary roundtrips likely to change in the near future, due to concerns about between the CPU and the GPU. This significantly de- how privileged modes on GPUs may compromise secu- grades performance. rity [22]. We assume, therefore, that system calls in- voked by GPU programs need to ultimately be serviced 0 5="5'7 5&- 39:9;<; by the CPU. This makes efficient GPU-CPU commu- 012 312 012 312

nication and CPU-side processing fundamental to GPU  !"##$%&'&()*+,  !"##$%&'&()*+, system calls. We find that careful use of modern GPU

features like shared virtual addressing [41] and page fault 4"5"-/657#8 ".*"#'/%&'&

support [29, 31], coupled with traditional interrupt mech- - &%$%&'&()*+, - &%$%&'&()*+, anisms, can enable efficient CPU-GPU communication 4"5"-/#'&' ".*"#'/%&'& of system call requests and responses. - &%$%&'&()*+,

To study these broad research questions, we imple-

ment and evaluate a framework for generic system call  !"##$%&'&()*+, invocation on GPUs, or GENESYS, in the Linux OS.

We use microbenchmarks to study the questions posed 4"5"-/657#8 above. We then showcase the benefits of GPU system - &%$%&'&()*+, calls on with three workloads: string-matching with stor- 4"5"-/#'&' age I/O [37], an echo server with network I/O, and an enhanced version of a GPGPU benchmark [41] that pro-  !"##$%&'&()*+, vides hints to the memory allocator within the OS kernel. When designing GENESYS, we ran into several inter- esting questions. For example, what system calls make Figure 1: (Left) Timeline of events when the GPU has sense to provide to GPUs? System calls such as pread- to rely on a CPU to handle system services; and (Right) /pwrite to file(s) or send/recv to and from the network when the GPU has system call invocation capabilities. stack are useful since they underpin many I/O activities required to complete a task. But system calls such as Now consider Figure 1(right), which shows a timeline fork and execv do not, for now, seem necessary for GPU for a system where the GPU can directly invoke the OS. threads. In the middle are many system calls which need There is no need for repeated kernel launches, enabling adaptation for GPU execution. For example, getrusage better GPU efficiency because processing can proceed can be adapted to return information about GPU resource uninterrupted. System calls (e.g., request data) are still usage, and sched setaffinity likely makes sense only if processed by the CPU, as often-times the work requires the GPU thread scheduler is software-programmable, access to hardware resources only the CPU can interact which it currently is not. We conclude our study by ex- with. But CPUs do not need to actively “baby-sit” the amining all system calls in Linux and classifying their GPU and can schedule tasks in response to GPU system usefulness for GPU applications. calls as and when needed. CPU system call processing

2 also naturally overlaps with the execution of other GPU rupt handler on the CPU, which reads the syscall area, threads. processes the interrupt, and writes the results back into OSes have traditionally provided a standardized ab- the syscall area. Finally, the CPU notifies the GPU wave- stract machine, in the form of a process, to user programs front that its system call has been processed. executing on the CPU. Parts of this process abstraction, Our system call invocation framework relies on the such as memory layout, the location of program argu- ability of the GPU to interrupt the CPU, and we do lever- ments, an ISA, have benefited GPU programs as well. age readily-available hardware [2, 12] for this support. Other aspects, however, such as standardized and pro- But this is not a fundamental design choice – in fact, tected access to the filesystem, network, and memory al- prior work [26, 37] uses a CPU polling thread to service locator are extremely important for processes, yet lack- a limited set of GPU system service requests instead. Al- ing for GPU code. Allowing GPU code to invoke system though using GPU-to-CPU interrupts does present cer- calls is a further step to providing a more complete pro- tain benefits (e.g., freeing CPUs from having to , cess abstraction to GPU code. allowing them to execute other applications, etc.), our work is orthogonal to the interrupt versus polling debate. 3 Our Approach Finally, while features like shared virtual memory and CPU-GPU cache coherence [29, 30, 31, 41] are benefi- We focus on how to implement a framework for generic cial to our design and becoming common, they are not system calls. Many past studies, GPUfs [37], GPUnet strictly necessary. Instead, CPU-GPU communication [13] and CPU-to-GPU callbacks [26], have implemented can also be achieved via atomic reads and writes in sys- specific types of system calls to enable research on the tem memory and the GPU’s device memory [28]. benefits of offloading aspects of filesystems and network- ing to GPUs. Our efforts are complementary and provide 4 System Call Design Space Exploration a single unifying mechanism for all system calls. Enabling GPU system calls opens up many ways in which they can be used to improve programmability and performance. The “right” approach depends on the in- teractions between the algorithms and individual system calls. We discuss several interactions, providing GPGPU programmers a blueprint on reasoning about when and how to use system calls.

4.1 GPU-Side Design Considerations Figure 2: High-level overview of how GPU system calls To use system calls effectively in GPGPU code, pro- are invoked and processed on CPUs. grammers must be aware of the following considerations: We use the GPU system call mechanism illustrated in Invocation granularity: Potentially the first and most Figure 2 to enable our studies. We use a system with a important question is, how should system calls be CPU (possibly with multiple cores) and GPU. Our ap- aligned with the hierarchy of GPU execution groups? proach is equally applicable to integrated CPU-GPUs GPU hardware relies on singe-instruction multiple-data (also called accelerated processing units or APUs) and to (SIMD) execution and can concurrently run thousands discrete GPUs. We assume that both CPU and GPU can of threads. To keep such massive parallelism tractable, access system memory, a feature that is common today. GPGPU programming languages like OpenCLTM [9] and Figure 2 details the steps used for GPU system calls. CUDA [19] exposes hierarchical groups of concurrently- First, the GPU invokes a system call. Since the GPU is executing threads to programmers. The smallest gran- not capable of executing system calls itself, the CPU pro- ularity of execution is the GPU work-item (akin to a cesses them on its behalf. Therefore, the GPU must place CPU thread). Several work-items (typically, 32-64) op- the relevant system call arguments and information in a erate in lockstep in the unit of wavefronts, the smallest portion of system memory also visible by the CPU. We hardware-scheduled units of execution. Several of these designate a system memory syscall area to store this in- wavefronts (e.g., 16) make up programmer-visible work- formation. In the next step, the GPU interrupts the CPU, groups and execute on a single compute unit (CU) of conveying the need for system call processing. With the the underlying GPU hardware. Work-items in a work- interrupt, the GPU also identifies the wavefront issuing group can communicate amongst themselves using local the system call. This triggers the execution of an inter- CU caches and/or scratchpads. Finally, hundreds of these

3 work-groups together comprise the GPU kernel (GPU System call ordering semantics: When programmers program). The CPU dispatches work to a GPU at the use traditional system calls on the CPU, they expect pro- granularity of a kernel. Each work-group in a kernel gram instructions appearing before invocation to finish can execute independently. Further, it is possible to syn- completely. They also expect that the CPU resumes pro- chronize just the work-items within a single work-group gram execution only after the system call returns. En- [9, 14]. This allows us to avoid the cost of globally syn- forcing such “strong” ordering in the GPU may be natu- chronizing across thousands of work-items in a kernel, ral for work-item granularity invocations but is problem- which is often unnecessary in a GPU program. In light atic for work-group and kernel-grain invocations. Con- of this, should a GPU system call be invoked separately sider, for example, strong ordering on per-work-group for each work-item, once for every work-group, or once invocations. We would need to place a barrier before the for the entire GPU kernel? Determining the right system system call to ensure that all work-items in the work- call invocation granularity requires an understanding of group are synchronized. Similarly, we would require the system call’s use for the particular task at hand. a second barrier after system call return so that work- Consider a GPU program that writes sorted integers to items remain strongly ordered. Since a work-group can a single output file. One might expect the write system easily consist of hundreds of work-items, such barriers call, invoked per work-item, to be the intuitive approach. are expensive. It is worse for kernel granularity invoca- This can present correctness issues, however, since write tion since a typical kernel consists of several such work- is position-relative, and requires access to a globally up- groups, totaling thousands of work-items. dated file pointer. Without synchronizing the execution Fortunately, strong ordering is not always necessary of write across work-items, the output file will be written in GPU programs. We thus introduce the concept of in a non-deterministic, and hence, unsorted order. “relaxed ordering” of system calls. Fundamental to this concept is the observation that in general, in a program, There exist several ways of fixing this problem us- a system call play a role of primarily a producer or a ing different system call invocation granularities. In one consumer of information. For example, system calls like approach, the programmer can use a memory location write and send generate or write data while those like to temporarily buffer the sorted output. Once all work- read and receive consume or read data. For correct oper- items have updated this buffer, the GPGPU programmer ation, system calls in producer role typically require that can use a single write system call at the end of the kernel all work-items corresponding to the system call’s invo- to write the contents of the buffer to the output file. This cation granularity finish executing all their program in- approach partly loses the benefits of the GPU’s parallel structions before the system call. Producers, however, resources, since the entire system call latency is exposed may not necessarily have to wait for the system call to on the program’s critical path and cannot be overlapped finish. For example, consider the write system call and with the execution of any other work-items. A compro- assume that the programmer invokes it at the work-group mise may be to invoke a write system call per work- granularity. Program instructions after the write call may group, buffering the sorted output of the work-items until not necessarily depend upon the outcome of the write the per-work-group buffers are fully populated. Finally, itself; therefore, other work-items in the group need not yet another alternative may be to use pwrite system calls wait for the completion of the system call. This is in con- instead of write system calls. Since pwrite allows pro- trast with system calls in consumer role like read, which grammers to specify the absolute file position where the typically require system calls to return before any work- output is to be written, per-work-item invocations present items can start executing any program instructions after no correctness problems. invocation. Consumers, however, do not necessarily re- Naturally, these approaches have different perfor- quire all work-items of the work-group to finish execut- mance implications. While per-work-item invocations of ing all instructions before the invocation. The same ob- pwrite result in a flood of system calls, potentially harm- servations apply to per-kernel system call invocations. ing performance, overly-coarse kernel-grain invocations Overall, relaxed ordering obviates the need for costly also restrict performance by wasting the GPU’s parallel barrier synchronizations around system calls. For system execution resources. Sec. 6 shows that these decisions call invocations in purely producer role, it eliminates the can yield as much as a 1.75× performance difference. barrier after the invocation, and for those in only con- As such, the utility of a system call and its invocation sumer role, it eliminates the barrier before invocation. granularity are context-dependent and must be assessed Our analysis in Sec. 6 shows that ensuing performance carefully by the programmer. Therefore, we believe that improvements can be significant, up to 27%. an effective GPU system call interface must allow the Beyond performance, we have found that relaxed or- programmer to specify the desired invocation granularity dering is necessary for functional correctness for kernel for every supported system call. granularity invocation. This is because GPU kernels can

4 consist of more work-items than can concurrently exe- vocation, however, ensures that one wavefront waits for cute on the GPU. For strongly ordered system calls in- the write to complete and can raise an alarm if the write voked at the kernel-grain, all work-items in the kernel fails. Consider another scenario, where a programmer need to finish executing instructions before the invoca- wishes to prefetch data using read system calls but may tion but all work-items cannot execute concurrently. To not use the results immediately. Here, weak ordering make space, some work-items need to completely finish with non-blocking invocation is likely to provide the best their execution, including system call invocation. Unfor- performance without breaking the program’s semantics. tunately, this contradicts strong ordering requirements. In short, different combinations of blocking and ordering In other words, using strong ordering at kernel granular- enables a GPU programmer to fine tune performance and ity can run the risk of deadlocking the GPU. We need programmability tradeoffs. relaxed ordering for these cases to work correctly. In summary, at work-item granularity, we enable only 4.2 CPU-Side Design Considerations implicit strong ordering (akin to CPU threads). At work- group granularity invocation, a programmer can choose Once GPUs interrupt the CPU to convey the need to pro- between strong or weak ordering to fine tune trade-offs cess system calls, several design considerations deter- between programmability and performance. Finally, at mine the efficiency of CPU execution. kernel granularity only weak ordering is possible as for- System call coalescing: GPUs rely on extreme paral- ward progress may otherwise not be guaranteed. lelism to enable good performance. This means there Blocking versus non-blocking approaches: Most tradi- may be bursts of system calls invoked on the CPU by the tional CPU system calls, barring those for asynchronous GPU. System call coalescing is one way to increase the I/O (e.g., aio read, aio write), return only after the throughput of system call processing. The idea is to col- system call finishes execution and are hence blocking. lect several GPU system call requests and batch their pro- fi We find, however, that blocking system calls are of- cessing on the CPU. The bene t is that CPUs can man- ten overly-restrictive for GPUs. In particular, because age multiple system calls as a single unit, a GPUs use SIMD execution, if even a single work-item is single software task to process them. This reduction in blocked on a system call, none of the other work-items in task management, scheduling, and processing overheads the wavefront can make progress either. The natural way can often boost performance (see Sec. 6). Coalescing of countering the overheads of system calls is to leverage must be performed judiciously as it improves system call the GPU’s parallel execution model. That is, we find that processing throughput at the expense of latency. It im- GPUs can often benefit from non-blocking system calls plicitly serializes the processing of system calls within a which can return immediately, before system call pro- coalesced bundle. cessing completes. Non-blocking system calls can there- To allow the GPGPU programmer to balance the ben- fi fore overlap system call processing on the CPU with use- e ts of coalescing with its potential problems, we enable ful GPU work, improving performance by as much as our system call interface to accept two parameterized bits 30% in some of our studies (see Sec. 6). of information – the time window length that the CPU waits for, in order to coalesce subsequent system calls, The concepts of blocking versus non-blocking and and the maximum number of system call invocations that strong versus relaxed ordering are related but are orthog- can be coalesced within the time window. Our analysis onal. The strictness of ordering refers to the question in Sec. 6 reveals that system call coalescing can improve of when a system call can be invoked, with respect to performance by as much as 10-15%. the progress of work-items within its granularity (execu- tion group) of invocation. Instead, the question of sys- tem call blocking refers to how the return from a system 5 Implementing GENESYS call relates to the completion of its processing. There- fore, they can be combined in several useful ways. For We implemented GENESYS on a CPU-GPU system with example, consider a case where a GPU program writes the architectural details presented in Table 1. Our target to a file at work-group granularity. The execution may system uses an AMD FX-9800P heterogeneous proces- not depend upon the output of the write but the program- sor with an integrated GPU, and runs the open-source mer may want to ensure that write completes success- ROCm software stack [3]. We modified the GPU driver fully. In such a scenario, blocking writes may be invoked and code to enable GPU system calls. We with weak ordering. Weak ordering permits all but one also modified the HCC compiler in order to permit GPU wavefront in the work-group to proceed without waiting system call invocations in its C++AMP dialect. for completion of the write. This enables us to lever- By construction, GENESYS supports the invocation of age the GPU’s parallelism (see Section 5). Blocking in- any system call from GPU programs. To enable a more

5 level. When the programmer intends to invoke system Table 1: System Configuration calls at a coarser-granularity (e.g., the work-group or ker- SoC Mobile AMD FX-9800PTM nel), a single work-item is designated as the caller on CPU 4× 2.7GHz behalf of the entire work-group or kernel. System calls AMD Family 21h Model 101h Integrated-GPU 758 MHz AMD GCN 3 GPU invoked at the work-item granularity follow CPU-centric Memory 16 GB Dual-Channel DDR4-1066MHz strong ordering implicitly. However, work-group invo- Operating system Fedora 25 using cation granularities can be either relaxed or strongly or- ROCm stack 1.2 dered. With strong ordering, the GPU system call uses a (based on Linux 4.4) work-group scope synchronization before and after sys- Compiler HCC-0.10.16433 tem call invocation. When relaxed ordering is employed, C++AMP with HC extensions a synchronization is placed either before (for system calls in producer role) or after (for system calls in consumer role) invoking the system call. Strong ordering at kernel invocations are unsupported as they may deadlock GPU hardware.

GPU to CPU communication: After the GPU invokes a system call, GENESYS facilitates efficient GPU to CPU communication of the system call request. As Figure 3: Content of each slot in syscall area. described in Section 3, GENESYS uses a pre-allocated shared memory location called a syscall area to allow the GPU to convey system call parameters to the CPU. The syscall area maintains one slot for each active GPU work-item. The OS and driver code can identify the de- sired slot by using the hardware ID of the active work- item, a piece of information available to modern GPU runtimes. Note that the hardware ID is distinct from the programmer-visible work-item ID. Each of potentially millions of work-items has a unique work-item ID that is visible to the application programmer. At any one point in time, however, only a subset of these work-items ex- ecutes (as permitted by the GPU’s CU count, supported work-groups per CU, and SIMD width). The hardware Figure 4: State transition diagram for a slot in syscall ID distinguishes amongst these active work-items. Over- area. Orange shows GPU side state and actions while all, our system uses 64 bytes per slot, for a total of 1.25 blue shows that of the CPU. MBs of total syscall area size. tractable study, however, we focus on 11 system calls Figure 3 depicts the information that is contained fi that span memory allocation as well as filesystem and within each slot. The elds are the requested system call network I/O. number, the request’s state, system call arguments (as many as 6, in Linux), and padding to avoid false sharing. We designed GENESYS with several goals in mind. The field for arguments is also re-purposed for the return We desired a general framework capable of invoking any value of the system call. Figure 4 shows that when a GPU system call from the GPU program with the traditional program’s work- item invokes a system call, it atomically function-like semantics used for CPUs. We also desired updates the state of the corresponding slot from free to a simple implementation to enable easy adaptation to dif- populating. If the slot is not free, system call invoca- ferent hardware. Finally, prior work has shown that with- tion is delayed. Once the state is populated, the invoking out care, GPU execution can hamper the CPU’s access work-item populates the slot with system call informa- to shared resources like memory bandwidth and chip- or tion and changes the state to ready. The work-item also package-wide power and thermal budgets [27]. Hence, adds one bit of information about whether the invocation GENESYS also aims to limit interference with the CPU. is blocking or non-blocking. The work- item then inter- To achieve these goals, we implemented the system with rupts the CPU using a scalar GPU instruction (s sendmsg following high level parts: on AMD GPUs). For blocking invocation, the work-item Invoking GPU system calls: GENESYS permits pro- waits for the availability of the result by polling on the grammers to invoke GPU system calls at the work-item- state of the slot or could suspend itself if the hardware

6 supports. Note that, for a non-blocking request, the GPU wi kernel wg64 wg512 wg wg256 wg1024 does not wait for the result. 1 0.5 0.9 0.45 CPU-side system call processing: Once the GPU inter- 0.8 0.4 rupts the CPU, system call processing commences. The 0.7 0.35 interrupt handler creates a new kernel task and adds it to 0.6 0.3 Linux’s work-queue. This task is also passed the hard- ime (s) 0.5 0.25 0.4 0.2 ware ID of the requesting wavefront. At an expedient GPU T 0.3 0.15 future point in time an OS worker thread executes this 0.2 0.1 task. The task scans the syscall area of the given hard- 0.1 0.05 0 0 ware ID and, atomically switches any ready system call 32 128 512 1024 2048 32 128 512 10242048 requests to the processing state for blocking system calls. File Size (MB) File Size (MB) The task then carries out the system call work. Figure 5: Impact of system call invocation granularity. A subtle but critical challenge is that Linux’s tradi- tional system call routines implicitly assume that they system calls. We use microbenchmarks to quantify the are to be executed within the context of the original pro- trade-offs in the different approaches. cess invoking the system call. Consequently, they use context- specific variables (e.g., the current variable used Invocation granularity: Whether a system call should to identify the process context). This presents a problem be invoked at the work-item, work-group, or application for GPU system calls, which are instead serviced purely kernel level is often application-dependent and is best in the context of the OS’ worker thread. GENESYS over- left up to the programmer. The choice does have per- comes this problem in two ways – it either switches to formance implications. Figure 5(a) quantifies this, for a the context of the original CPU program that invoked the GPU microbenchmark that uses pread to read data from GPU kernel, or it provides additional context informa- files in tmpfs. The x-axis plots the file size, and y-axis tion in the code performing system call processing. The shows the time to read the file, with lower values being exact strategy is employed on a case-by-case basis. better. Within each cluster, we separate runtimes for dif- GENESYS implements coalescing by waiting for a ferent pread invocation granularities. predetermined amount of time before enqueueing the Figure 5(left) shows that work-item invocation granu- task to process a system call, on an interrupt. If multi- larities tend to perform worst. This is unsurprising, since ple requests are made to the CPU during this time win- this is the finest granularity of system call invocation and dow, they are coalesced with the first system call, such leads to a flood of individual system calls that overwhelm that they can be handled as a single unit of work by the the CPU. On the other end of the granularity spectrum, OS worker-thread. GENESYS exposes two parameters kernel-level invocation is problematic too as it generates to control the amount of coalescing through Linux’s a single system call and fails to leverage any potential interface – the length of time window, and the number of parallelism in processing of system call requests. This maximum system calls that can be coalesced together. problem is particularly severe at large file sizes (e.g., 2GB). In general, a good compromise is to use work- Communicating results from the CPU to the GPU: group invocation granularities instead. It does not choke fi Once the CPU worker-thread nishes processing the sys- the CPU cores with large bursts of system call requests fi tem call, results are put in the eld for arguments in the while still being able to exploit parallelism in servicing slot for blocking requests. Further, it also changes the system calls (Figure 5(left)). finished state of the slot to for blocking system calls. For In choosing how best to use the work-group invoca- free non-blocking invocations, the state is changed to . tion approach, an important design issue is the ques- The invoking GPU work-item is then re-scheduled (on tion of how many work-items should constitute a work- hardware that supports wavefront suspension) or auto- group. While Figure 5(left) assumes 64 work-items in matically restarts because it was polling on this state. a work-group, Figure 5(right) quantifies the performance Results presented in this paper are from hardware where impact of pread system calls as we vary work-group sizes polling within the GPU is necessary. The work-item can from 64 to 1024 work-items. In general, larger work- consume the result and continue execution. group sizes enable better performance, as there are fewer unique system calls necessary to handle the same amount of work. 6 Evaluation with Microbenchmarks Blocking and ordering strategies: The choice of Our evaluation begins with performance implications of whether to use blocking or non-blocking system calls, different ways in which programmers can invoke GPU with strong or weak ordering, also has a significant im-

7 weak-block strong-block weak-non-block strong-non-block voked in a non-blocking manner (albeit with strong or- 7.5 dering), performance is aided. This is expected since

7 non-blocking pwrites permit the GPU to end the invok- ing work-group, freeing GPU resources it was occupy- 6.5 ing. The CPU-side processing of pwrite then can over- 6 lap with the execution of another work-group permuting on a different block of data utilizing just-freed GPU re- 5.5 ime per iteration (ms) sources. Figure 6 shows that generally, latencies drop by

GPU T 5 30% compared to blocking calls at low iteration counts.

4.5 At higher iteration counts (beyond 16), these benefits di- 0 5 10 15 20 25 30 35 40 # compute iterations minish since the latency to perform repeated permuta- tions dominates any system call processing times. Figure 6: Performance implications of system call block- Next consider relaxing system call ordering but retain- ing and ordering semantics. ing blocking invocation (weak-block). In these cases, pact on performance. To quantify this, we designed a the work-group scope barrier post-system call is elimi- microbenchmark that uses the GPU to perform block per- nated. One out of every 16 wavefronts 2 in the work- mutation on an array. This block permutation is similar group waits for the blocking system call, while others to the permutation steps performed in the popular DES exit, freeing GPU resources. Once again, the GPU can encryption algorithm. The input data array is pre-loaded use these freed resources to run other work-items, hiding with random values and is divided into blocks of 8KB. CPU-side system call latency. Consequently, the perfor- We designed this microbenchmark so that work-groups mance trends follow those of strong-non-block, with mi- execute 1024 work-items each and permute on blocks in- nor differences in performance arising from differences dependently. At the end of the permutation, the results in the way the GPU schedules the work-groups for exe- are written to a file by invoking the pwrite system call cution. Finally, Figure 6 shows that GPUs are most ef- at work-group granularity. Importantly, it is possible to fectively able to hide system call latencies using relaxed overlap pwrite system calls for one block of data with and non-blocking approaches (weak-non-block). permutations on other blocks of data. To vary the amount Interrupt coalescing: Figure 7 shows the performance of computation per system call, we repeat the permuta- impact of coalescing multiple system calls. We again tion multiple times before writing the result. use a microbenchmark that invokes pread for our exper- Figure 6 quantifies the impact of using blocking versus iments. We read data from files of different sizes us- non-blocking system calls with strong and weak order- ing a constant number of work-items. More data is read ing. The trend lines separate results for the four combi- per pread system call from the larger files. The x-axis nations of strong- and weak-ordering with blocking and shows the varying amounts of data read, and quantifies non-blocking system calls. The x-axis plots the num- the latency per requested byte. We present two bars for ber of permutation iterations performed on each block by each point on the x-axis, illustrating the average time each work-group before writing the results. Meanwhile, needed to read one byte with the system call in the ab- the y-axis plots the average time needed to execute one sence of coalescing and when up to eight system calls iteration (lower is better). are coalesced. In general, coalescing is most beneficial Figure 6 shows that strongly ordered blocking invoca- when small amounts of data are read. Reading more data tions (strong-block) hurt performance. This is expected generally takes longer; the act of coalescing the ensuing as they require work-group scoped barriers to be placed longer-latency system calls induces greater serialization before and after pwrite invocations. The GPU’s hardware in their processing on the CPUs, degrading performance. resources for work-groups are stalled waiting for the pwrite to complete. Not only does the inability to overlap GPU parallelism hurt strongly ordered blocking perfor- 7 Case Studies mance, it also means that GPU performance is heavily fl in uenced by CPU-side performance vagaries like syn- In this section we present three workloads that execute chronization overheads, servicing other processes, etc. on the GPU and exercise different subsystems of the OS: This is particularly true at iteration counts where sys- storage, memory management, and network. In targeting tem call overheads dominate GPU-side computation – either better programmability or performance, or both, below 15 compute iterations on the x-axis. Even when we showcase the generality and utility of GENESYS. the application becomes GPU-compute bound, perfor- mance remains non-ideal. 2Each wavefront has 64 work-items. Thus a 1024 work-item wide Figure 6 shows that alternately, when pwrite is in- work-group has 16 wavefronts.

8 uncoalesced coalesced (8 interrupts) onds to complete. Figure 8 enables us to focus on this 5

4.5 sharp performance improvement. We plot traces for CPU

4 and GPU utilization and disk I/O throughput. Our results 3.5 show that the GPU implementation extracts much higher 3 throughput from the underlying storage device (up to

equest byte (ms) 2.5 170MB/s compared to the CPU’s 30MB/s). Offloading 2 search string processing to the GPU frees up the CPU to 1.5 ime per r process system calls effectively. The change in CPU uti- 1

GPU T 0.5 lization between the GPU workload and CPU workload

0 64 256 1024 2048 4096 reveals this trend. In addition, we made another interest-

Size per syscall(B) ing discovery – the GPU’s ability to launch more con- current I/O requests enabled the I/O scheduler to make Figure 7: Implications of system call coalescing. better scheduling decisions, enhancing I/O throughput. 7.1 Storage This is made possible because of the ability to directly invoke many concurrent I/O requests from the GPU. We first demonstrate the value of GPU system calls for storage I/O. Like past work [37], we wrote a wordcount 7.2 Memory workload which takes a list of words and list of files to search. Our workload searches for the specified words in We next showcase the benefits of being able to use mem- those files, and generates the word counts as output. To ory management system calls directly from the GPU. We showcase the utility of GPU system calls relative to CPU focus our attention on the miniAMR application [34], performance, we show results for two versions of this and use the madvise system call directly from the GPU to workload – a CPU-only workload and a GPU workload. better manage memory. The miniAMR application per- We parallelize the CPU version with OpenMP, assign- forms 3D stencil calculations using adaptive mesh refine- ing files to cores. Our GPU version directly invokes the ment. This application is ripe for memory management open, read, and close system calls made available with because it varies its memory needs depending on its in- GENESYS. We found, due to the nature of the GPU al- tended use. For example, when simulating regions ex- gorithm suitable for wordcount, that these system calls periencing turbulence, miniAMR needs to model with were best invoked at work-group granularity with block- higher resolution. Instead, if lower resolution is pos- ing and weak-ordering semantics. Finally, both CPU and sible without compromising simulation accuracy, mini- GPU workloads are configured to search for occurrences AMR reduces memory and computation usage. As such, of 64 strings in the Linux source. when miniAMR decides to reduce resolution, it is possi- ble to free memory. While relinquishing excess memory CPU GPU in this way is not possible in traditional GPU programs, 100 80 without explicitly dividing the offload into multiple ker- 60 40 nels interspersed with CPU code (see 1), our framework 20 does permit this with direct GPU madvise invocations. 0 CPU utilization (%) 0 5 10 15 20 25 30 35 We invoke madvise using work-group granularities, with Time (s) non-blocking and weak ordering. CPU GPU 180 160 140 madvise 120 4.5 100 4 80 3.5 60 3 oughput (MB/s) 40 2.5 20 2 0 1.5 RSS (GB)

IO Thr 0 5 10 15 20 25 30 35 1 0.5 Time (s) 0 0 20 40 60 80 100 120 140 160 180 Figure 8: Wordcount I/O and CPU utilization reading Time (s) data from an SSD. Note the high CPU usage necessary Figure 9: Memory footprint of miniAMR using madvise to provide high I/O throughput. to hint at unused memory. As one might expect, we found it much easier to write Figure 9 shows the results of our experiment. We exe- GPU code given access to direct system calls. But per- cute miniAMR with an input dataset of 4.3GB. Figure 9 haps even more crucially, we found that the GPU version depicts a time trace of memory usage for miniAMR us- with direct system call invocation capabilities executes in ing madvise. The step-like nature of the curve is due roughly 6 seconds, constituting nearly a 6× performance to the periodic nature in which the memory reclamation improvement over the CPU version, which takes 35 sec- process is executed. Note that the madvise additions to

9 CPU GPU 1000 800 600

Mbps 400 200 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Packet Size (B) Figure 10: Bandwidth versus packet size for an echo server.

miniAMR help it use an RSS far below the peak usage of 4.3GB. As such, enabling such adaptivity to memory needs allows GPU programmers to relinquish memory, which is particularly valuable for workloads with mas- sive amounts of data that often exceed the machine limits of physical memory capacity. Figure 11: Our classification of Linux’s system calls for GPUs. Size of each circle is representative of relative 7.3 Network fraction of system calls in that category. Finally, we have also studied the programmability ben- efits of GPU system call support for networking. We recvfrom), and memory management (mmap, munmap, wrote an echo server application that can process pack- madvise). Some of these system calls, like read, write, ets on either the GPU or CPU. Both the GPU and CPU lseek, are stateful. Thus, their semantics may need to be portions of the application use sendto and recvfrom sys- extended for concurrent invocations for the same file de- tem calls. A load generator, running on the CPU, can scriptor. For example, the current value of the file pointer choose to direct UDP packets to sockets on the CPU or determines what value is read or written by the read or to those running on the GPU. The GPU portion invokes write system call. This can be arbitrary if invoked at these system calls at the granularity of a work-group and work-item or work-group granularity for the same file with blocking and weak ordering. Figure 10 depicts per- descriptor since many work-items and work-groups are formance of when packets are directed to the CPU and likely to execute concurrently. when directed to the GPU. The x-axis is the packet size, while the y-axis is network throughput. Note the per- 2 Useful but currently not implementable: There are formance of the CPU and GPU are similar. Both can several system calls (13% of the total) that seem useful maximize link bandwidth with large packets. Both have for GPU code, but are not easily implementable because similar trends. of Linux’s (POSIX’s) existing design. For example, con- sider sigsuspend or sigaction. There is no kernel repre- sentation of a GPU work-item to manage and dispatch a 8 Discussion signal to. Additionally, there is no lightweight method to alter the GPU program counter of a work-item from Enabling GPUs to invoke the full stack of generic system the CPU kernel. One approach is for signal masks to be calls opens up several interesting research questions. associated with the GPU context and for signals to be de- livered as additional work-items. These semantics do not 8.1 Linux system calls on GPUs match POSIX, but they seem useful. In the process of designing GENESYS, we considered all 3 Not useful for GPUs at this time: This small group of Linux’s roughly 300 system calls and assessed which (8% of the total) contains perhaps the most controversial ones to support. Figure 11 shows how we classify the set of system calls, as we debate even amongst ourselves system calls into several groups, explained below: whether or not they should ever be supported. Examples 1 Useful and implementable: This subset is useful include fork, clone, execv, etc. At this time, we do not and adaptable to GPUs today (left, lower corner in Fig- believe that it is worth the implementation effort to sup- ure 11). Examples include pread, pwrite, mmap, mun- port these system calls from the GPU. For example, fork map etc. This is also the largest subset of Linux’s system necessitates cloning a copy of the executing caller’s GPU calls, comprising nearly 79% of them. In GENESYS, we state. Technically, this can be done – it is how GPGPU implemented 11 such system calls for filesystems (read, code is context switched with the GUI for example – but write, pread, pwrite, open, close), networking (sendto, it seems unnecessary to work at this time.

10 8.2 Lack of OS visibility this into account. To help drive this innovation, we in- tend to release GENESYS as open source, and thereby In the course of implementing GPU system call support encourage wide-spread adoption of a uniform method for within the Linux kernel, we came across a common de- invoking the OS from the GPU. sign challenge. When a CPU thread invokes the OS, that thread has a representation within the kernel. The ker- nel maintains a data-structure for each thread and relies on it to accomplish several common tasks (e.g., kernel 9 Related Work resource use, permission checks, auditing, etc.). GPU tasks, however, have traditionally not been represented in the OS kernel. We believe this should not change. Prior work has looked at subsystem-specific OS ser- GPU threads are numerous and short lived. Creating and vices for GPUs. GPUfs [37] provides filesystem access, managing a kernel structure for each GPU thread would while GPUNet [13] provides a socket-like API for net- vastly slow down the system. Moreover, these structures work access. The latest generation of C++AMP [15] would only be useful for GPU threads that actually in- and CUDA [19] provide access to the memory allocator. voke the OS, and would be wasted overhead for the rest. These efforts employ a user-mode service thread execut- Our approach, given these constraints, is to process ing on the CPU to proxy requests from the GPU [26]. system calls that do not rely on the task representation As with GENESYS, system call requests are placed in in whatever thread context the CPU happened to be in at a shared queue by the GPU. From there, however, the the time, and switch CPU contexts if necessary (See Sec- designs are different. The user-mode thread polls this tion 5). As more GPUs have system call capabilities and shared queue and “thunks” the request to the libc run- more system calls are implemented, this is an area that time or OS. This incurs added overhead for entering and will require careful consideration by kernel developers. leaving the kernel on each request, instead of being natu- rally aggregated as with GENESYS. GPU workitems are 8.3 Asynchronous system call handling also forced to spin, as user-mode code cannot wakeup the GPU hardware workgroup. In our design, GPU system calls are enqueued as kernel Several efforts have focused on providing network work items and processed outside of the interrupt con- access to GPU code [38, 21, 18, 13]. NVidia pro- text. We do this because Linux is designed such that few vides GPUDirect [18], which is used by several MPI li- operations can be processed in an interrupt handler. A braries [42, 8, 11], allows the NIC to bypass main mem- fl aw with this design, however, is it defers the system call ory and communicate directly to memory on the GPU processing to potentially past the end of the life-time of itself. GPUDirect does not provide a programming inter- the GPU thread and potentially the process that created face for GPU side code. The CPU must initiate com- the GPU thread itself! It is an example of a more gen- munication with the NIC. In [21] the author exposed eral problem with asynchronous system calls [4]. Our the memory-mapped control interface of the NIC to the solution is to provide a new function call, invoked by the GPU and thereby allowed the GPU to directly commu- CPU, that ensures all GPU system calls have completed. nicate with the NIC. This low-level interface, however, lacks the benefits a traditional operating system interface 8.4 Modifying GPU runtimes brings (e.g. protection, sockets, TCP, etc.). In developing GENESYS we pushed the GPU HW/SW stack into operating regions it was not designed for, namely with work-groups that executed for seconds to minutes instead of microseconds. We’ve focused our de- 10 Conclusions sign efforts on AMD hardware and software, but tests on other platforms reveal similar design assumptions. One We present the first study that considers generic system peculiar effect we discovered is on our test platform is if calls on GPUs, showing their programmability and per- a workgroup is long-lived, power-management kicks in formance benefits. Not only does our system call inter- and throttles up the GPU and down the CPU. While such face support any system call, we also study various as- a design choice is entirely sensible for graphics work- pects of GPU system call invocation, such as ordering loads, for GPGPU programs it may not always be. Sys- and blocking issues, invocation granularities, and CPU- tem call support on GPUs enables far richer application side coalescing. Although more work needs to be done to programs, where it may be entirely sensible to code GPU enable truly heterogeneous CPU-GPU applications, our kernels where workgroups execute for the lifetime of the work is an important first step in making the GPU a peer program. Future GPU HW/SW systems will need to take to CPUs today.

11 11 Acknowledgment [8] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. AMDR , the AMD Arrow logo, and combinations Squyres, Vishal Sahay, Prabhanjan Kambadur, thereof are trademarks of Advanced Micro Devices, Inc. Brian Barrett, Andrew Lumsdaine, Ralph H. Cas- The format PCIeR is a registered trademark of PCI-SIG tain, David J. Daniel, Richard L. Graham, and Tim- Corporation. The format ARMR is a registered trade- othy S. Woodall. Open MPI: Goals, concept, and mark of ARM Limited. OpenCLTM is a trademark of design of a next generation MPI implementation. Apple Inc. used by permission by Khronos. Other prod- In Proceedings, 11th European PVM/MPI Users’ uct names used in this publication are for identification Group Meeting, pages 97–104, Budapest, Hungary, purposes only and may be trademarks of their respective September 2004. companies c 2017 Advanced Micro Devices, Inc. All rights reserved. [9] Khronos Group. Opencl, 2014. https://www. We thank the National Science Foundation, which par- khronos.org/opengl/. tially supported this work through grants 1253700 and [10] Khronos Group. Opengl, 2014. https://www. 1337147. Finally, we thank Guilherme Cox for his feed- khronos.org/opengl/. back. [11] IBM. Ibm spectrum mpi, 2017. http: References //www-03.ibm.com/systems/spectrum- computing/products/mpi/.

[1] Neha Agarwal, David Nellans, Eiman Ebrahimi, [12] Intel. Intel Open Source HD Graphics Program- Thomas Wenisch, John Danskin, and Stephen mers Reference Manual. 2014. Keckler. Selective GPU Caches to Eliminate CPU- GPU HW Cache Coherence. In International Sym- [13] Sangman Kim, Seonggu Huh, Xinya Zhang, Yige posium on High Performance Computer Architec- Hu, Amir Wated, Emmett Witchel, and Mark Sil- ture (HPCA), 2016. berstein. GPUnet: Networking Abstractions for GPU Programs. In USENIX Symposium on Oper- [2] AMD. Graphics core next 3.0 isa, 2013. ating Systems Design and Implementation (OSDI), http://amd-dev.wpengine.netdna- 2014. cdn.com/wordpress/media/2013/07/AMD_ GCN3_Instruction_Set_Architecture.pdf. [14] George Kyriazis. Heterogeneous System Architec- ture: A Technical Review. In AMD Whitepaper, [3] AMD. ROCm: a New Era 2012. in Open GPU Computing. In https://radeonopencompute.github.io/index.html, [15] Microsoft. C++ AMP (C++ Accelerated Massive 2016. Parallelism). In https://msdn.microsoft.com/en- us/library/hh265137.aspx, 2016. [4] Zach Brown. Asynchronous System Calls. In Ot- tawa Linux Symposium, 2007. [16] Microsoft. Programming Guide for Di- rectX. In https://msdn.microsoft.com/en- [5] Guilherme Cox and Abhishek Bhattacharjee. Effi- us/library/windows/desktop/ff476455(v=vs.85).aspx, cient Address Translation with Multiple Page Sizes. 2016. In International Conference on Architectural Sup- [17] Sparsh Mittal and Jeffrey Vetter. A Survey of CPU- port for Programming Languages and Operating GPU Heterogeneous Computing Techniques. In Systems (ASPLOS), 2017. ACM Computing Surveys, 2015.

[6] Richard Draves, Brian Bershad, Richard Rashid, [18] NVidia. Nvidia gpudirect, 2013. https:// and Randall Dean. Using Continuations to Imple- developer.nvidia.com/gpudirect. ment Thread Management and Communication in Operating Systems. In Symposium on Operating [19] NVIDIA. CUDA Toolkit Documentation. In Systems Principles (SOSP), 1991. http://docs.nvidia.com/cuda, 2016.

[7] HSA Foundation. Hsa programmer’s reference [20] Juraj Obert, JMP Van Waveran, and Graham Sell- manual, 2015. http://www.hsafoundation. ers. Virtual Texturing in Software and Hardware. com/?ddownload=4945. In SIGGRAPH Courses, 2012.

12 [21] Lena Oden. Direct Communication Methods [31] Jason Power, Mark Hill, and David Wood. Support- for Distributed GPUs. In Dissertation Thesis, ing x86-64 Address Translation for 100s of GPU Ruprecht-Karls-Universitat Heidelberg, 2015. Lanes. In International Symposium on High Per- formance Computer Architecture (HPCA), 2014. [22] Lena Olson, Jason Power, Mark Hill, and David Wood. Border Control: Sandboxing Accelerators. [32] Scott Rixner, William Dally, Ujval Kapasi, Brucek In International Symposium on Microarchitecture Khailany, Abelardo Lopez-Lagunas, Pater Mattson, fi (MICRO), 2007. and John Owens. A Bandwidth-Ef cient Architec- ture for Media Processing. In International Sympo- [23] John Owens, Ujval Kapasi, Peter Mattson, Brian sium on Microarchitecture (MICRO), 1998. Towles, Ben Serebrin, Scott Rixner, and William [33] Sagi Sahar, Shai Bergman, and Mark Silberstein. Dally. Media Processing Applications on the Imag- Active Pointers: A Case for Software Address International Conference ine Stream Processor. In Translation on GPUs. In International Symposium on Computer Design (ICCD) , 2002. on Computer Architecture (ISCA), 2016. [24] John Owens, Brucek Khailany, Brian Towles, and [34] Aparna Sasidharan and Marc Snir. Miniamr - William Dally. Comparing Reyes and OpenGL a miniapp for adaptive mesh refinement. 2014. on a Stream Architecture. In SIGGRAPH, EU- http://hdl.handle.net/2142/91046. ROGRAPHICS Conference on Graphics Hardware, 2002. [35] Ben Serebrin, John Owens, Brucek Khailany, Pe- ter Mattson, Ujval Kapasi, Chen Chen, Jinyung [25] John Owens, David Luebke, Naga Govindaraju, Namkoong, Stephen Crago, Scott Rixner, and Mark Harris, Jens Kruger, Aaron Lefohn, and Tim William Dally. A Stream Processor Development Purcell. A Survey of General-Purpose Computa- Platform. In International Conference on Com- tion on Graphics Hardware. In Computer Graphics puter Design (ICCD), 2002. Forum, Vol. 26, Num. 1, pp 80-113, 2007. [36] Olin Shivers and Matthew Might. Continuations and Transducer Composition. In International [26] John Owens, Jeff Stuart, and Michael Cox. GPU- Symposium on Programming Languages Design to-CPU Callbacks. In Workshop on Unconventional and Implementation (PLDI), 2006. High Performance Computing, 2010. [37] Mark Silberstein, Bryan Ford, Idit Keidar, and Em- [27] Indrani Paul, Vignesh Ravi, Srilatha Manne, Man- mett Witchel. GPUfs: integrating file systems with ish Arora, and Sudhakar Yalamanchili. Coordi- GPUs. In International Conference on Architec- nated Energy Management in Heterogeneous Pro- tural Support for Programming Languages and Op- cessors. In International Conference on High erating Systems (ASPLOS), 2013. Performance Computing, Networking, Storage and Analysis, 2013. [38] Jeff A. Stuart, Pavan Balaji, and John D. Owens. Extending mpi to accelerators. PACT 2011 Work- [28] PCI-SIG. Atomic operations, 2008. shop Series: Architectures and Systems for Big https://pcisig.com/sites/default/ Data, October 2011. files/specification_documents/ECN_ [39] David Tarditi, Sidd Puri, and Jose Oglesby. Aceler- Atomic_Ops_080417.pdf. ator: Using Data Parallelism to Program GPUs for [29] Bharath Pichai, Lisa Hsu, and Abhishek Bhat- General-Purpose Uses. In International Conference tacharjee. Architectural Support for Address Trans- on Architectural Support for Programming Lan- lation on GPUs. In International Conference on guages and Operating Systems (ASPLOS), 2006. Architectural Support for Programming Languages [40] Chris J. Thompson, Sahngyun Hahn, and Mark and Operating Systems (ASPLOS), 2014. Oskin. Using Modern Graphics Architectures for General-Purpose Computing: A Framework and [30] Jason Power, Arkaprava Basu, Junli Gu, Sooraj Analysis. In International Symposium on Microar- Puthoor, Bradford Beckmann, Mark Hill, Steven chitecture (MICRO), 2002. Reinhardt, and David Wood. Heterogeneous Sys- tem Coherence for Integrated CPU-GPU Systems. [41] Jan Vesely, Arkaprava Basu, Mark Oskin, Gabriel In International Symposium on Microarchitecture Loh, and Abhishek Bhattacharjee. Observations (MICRO), 2013. and Opportunities in Architecting Shared Virtual

13 Memory for Heterogeneous Systems. In Interna- Appendix tional Symposium on Performance Analysis of Sys- tems and Software (ISPASS), 2016. A Classification of Linux System calls [42] H. Wang, S. Potluri, D. Bureddy, C. Rosales, and D. K. Panda. Gpu-aware mpi on rdma-enabled clus- System call Viable ters: Design, implementation and evaluation. IEEE sys accept yes Transactions on Parallel and Distributed Systems, sys accept4 yes 25(10):2595–2605, Oct 2014. sys access yes sys acct yes sys add key yes sys adjtimex yes sys alarm yes, limited use∗ sys arch prctl yes sys bind yes sys bpf yes sys brk yes, limited use∗∗ sys capget no, targets threads∗∗∗ sys capset no, targets threads∗∗∗ sys chdir yes sys chmod yes sys chown yes sys chroot yes sys clock adjtime yes sys clock getres yes sys clock gettime yes sys clock nanosleep yes ∗∗∗∗ sys clock settime yes sys clone/ptregs yes sys close yes sys connect yes sys copy file range yes sys creat yes sys delete module yes sys dup yes sys dup2 yes sys dup3 yes sys create yes sys epoll create1 yes sys epoll ctl yes sys epoll pwait yes∗ sys epoll wait yes sys eventfd yes

14 System call Viable System call Viable sys eventfd2 yes sys getppid yes sys execveat/ptregs yes, limited use∗∗ sys getpriority yes∗∗∗∗∗ sys execve/ptregs yes, limited use∗∗ sys getrandom yes sys exit yes∗∗∗∗∗ sys getresgid yes sys exit group yes sys getresuid yes sys faccessat yes sys getrlimit yes sys fadvise64 yes sys get robust list no sys fallocate yes sys getrusage yes, process level only sys fanotify init yes sys getsid yes sys fanotify mark yes sys getsockname yes sys fchdir yes sys getsockopt yes sys fchmod yes sys gettid yes∗∗∗∗∗ sys fchmodat yes sys gettimeofday yes sys fchown yes sys getuid yes sys fchownat yes sys getxattr yes sys fcntl yes sys init module yes sys fdatasync yes sys inotify add watch yes sys fgetxattr yes sys inotify init yes sys finit module yes sys inotify init1 yes sys flistxattr yes sys inotify rm watch yes sys flock yes, exclusive is limited∗∗ sys io cancel yes sys fork/ptregs no sys depends sys fremovexattr yes sys io destroy yes sys fsetxattr yes sys io getevents yes sys fstatfs yes sys ioperm no∗∗∗ sys fsync yes sys iopl/ptregs yes sys ftruncate yes sys ioprio get yes, CPU threads only sys yes∗∗∗∗ sys ioprio set yes, CPU threads only sys futimesat yes sys io setup yes sys getcpu yes∗∗∗∗∗ sys io submit yes sys getcwd yes sys kcmp yes sys getdents yes sys file load yes sys getdents64 yes sys kexec load yes sys getegid yes sys keyctl yes sys geteuid yes sys kill yes∗ sys getgid yes sys lchown yes sys getgroups yes sys lgetxattr yes sys getitimer yes sys link yes sys get mempolicy yes, address mode only sys linkat yes sys getpeername yes sys listen yes sys getpgid yes sys listxattr yes sys getpgrp yes sys llistxattr yes sys getpid yes sys lookup dcookie yes

15 System call Viable System call Viable sys lremovexattr yes sys open yes sys lseek yes sys openat yes sys lsetxattr yes sys open by handle at yes sys madvise yes sys pause no sys mbind yes sys event open yes, CPU perf events only sys membarrier no sys personality yes sys memfd create yes sys pipe yes sys migrate pages yes sys pipe2 yes sys mincore yes sys pivot root yes, limited use∗∗ sys mkdir yes sys pkey alloc yes sys mkdirat yes sys pkey free yes sys mknod yes sys pkey get yes sys mknodat yes sys pkey mprotect yes sys mlock yes sys pkey set yes sys mlock2 yes sys poll yes sys mlockall yes sys ppoll yes∗ sys mmap yes sys prctl yes sys modify ldt yes sys pread64 yes sys mount yes sys preadv yes sys move pages yes sys preadv2 yes sys mprotect yes sys preadv64 yes sys mq getsetattr yes sys preadv64v2 yes sys mq notify yes∗ sys prlimit64 yes sys mq open yes sys process vm readv yes sys mq timedreceive yes sys process vm writev yes sys mq timedsend yes sys pselect6 yes∗ sys mq unlink yes sys ptrace yes∗∗ sys mremap yes sys pwrite64 yes sys msgctl yes sys pwritev yes sys msgget yes sys pwritev2 yes sys msgrcv yes sys pwritev64 yes sys msgsnd yes sys pwritev64v2 yes sys msync yes sys quotactl yes∗∗ sys munlock yes sys read yes sys munlockall yes sys yes sys munmap yes sys readlink yes sys name to handle at yes sys readlinkat yes sys nanosleep yes∗∗∗∗ sys readv yes sys newfstat yes sys reboot yes∗∗ sys newfstatat yes sys recvfrom yes sys newlstat yes sys recvmmsg yes sys newstat yes sys recvmsg yes sys newuname yes sys remap file pages yes

16 System call Viable System call Viable sys removexattr yes sys setitimer yes ∗ sys rename yes sys set mempolicy no sys renameat yes sys setns no sys renameat2 yes sys setpgid yes sys request key yes sys setpriority yes∗∗∗∗∗ sys restart syscall yes, no use∗ sys setregid yes sys rmdir yes sys setresgid yes sys rt sigaction yes∗ sys setresuid yes sys rt sigpending yes∗ sys setreuid yes sys rt sigprocmask yes∗ sys setrlimit yes sys rt sigqueueinfo yes∗ sys set robust list no sys rt sigreturn/ptregs yes, no use∗ sys setsid yes sys rt sigsuspend yes, no use∗ sys setsockopt yes sys rt sigtimedwait yes, no use∗ sys set tid address no sys rt tgsigqueueinfo yes, no use∗ sys settimeofday yes sys sched getaffinity yes, CPU threads only sys setuid yes sys sched getattr yes, CPU threads only sys setxattr yes sys sched getparam yes, CPU threads only sys shmat yes sys sched get priority max yes∗∗∗∗∗ sys shmctl yes sys sched get priority min yes∗∗∗∗∗ sys shmdt yes sys sched getscheduler yes, CPU threads only sys shmget yes sys sched rr get interval yes, CPU threads only sys shutdown yes∗∗ sys sched setaffinity yes, CPU threads only sys sigaltstack no sys sched setattr yes, CPU threads only sys signalfd yes sys sched setparam yes, CPU threads only sys signalfd4 yes sys sched setscheduler yes, CPU threads only sys socket yes sys sched yield no sys socketpair yes sys no sys splice yes sys select yes sys statfs yes sys semctl yes sys swapoff yes∗∗ sys semget yes sys swapon yes∗∗ sys semop yes sys symlink yes sys semtimedop yes sys symlinkat yes sys sendfile64 yes sys yes∗∗ sys sendmmsg yes sys sync file range yes sys sendmsg yes sys syncfs yes∗∗ sys sendto yes sys sysctl yes∗∗ sys setdomainname yes∗∗ sys sysfs yes∗∗ sys setfsgid yes sys sysinfo yes sys setfsuid yes sys syslog yes∗∗ sys setgid yes sys tee yes sys setgroups yes sys tgkill yes∗ sys sethostname yes∗∗ sys time yes

17 System call Viable sys timer create yes∗ sys timer delete yes sys timer getoverrun yes sys timer gettime yes sys timer settime yes sys timerfd create yes sys timerfd gettime yes sys timerfd settime yes sys times yes, CPU times only sys tkill yes∗ sys truncate yes sys umask yes sys umount yes∗∗ sys unlink yes sys unlinkat yes sys unshare yes sys userfaultfd yes sys ustat yes sys utime yes sys utimensat yes sys utimes yes sys vfork/ptregs no sys vhangup yes sys vmsplice yes sys wait4 yes sys waitid yes sys write yes sys writev yes

Table 4: x86 64 system call table

∗ signals can be delivered only to CPU threads ∗∗ mostly serializing use, little benefit for GPU workloads ∗∗∗ targets threads. There is currently no OS kernel structure representing thread level context for GPU task. ∗∗∗∗ even though GPU threads are not represented, postponing return from system call will have the desired effect ∗∗∗∗∗ implementable without system call, GPU modified semantics (local priority, CU instead of CPU core, ...)

18