UNIVERSITY OF CALIFORNIA RIVERSIDE

Spatio-Temporal GPU Management for Real-Time Cyber-Physical Systems

A Thesis submitted in partial satisfaction of the requirements for the degree of

Master of Science

in

Electrical Engineering

by

Sujan Kumar Saha

March 2018

Thesis Committee:

Dr. Hyoseung Kim, Chairperson Dr. Nael Abu-Ghazaleh Dr. Daniel Wong Copyright by Sujan Kumar Saha 2018 The Thesis of Sujan Kumar Saha is approved:

Committee Chairperson

University of California, Riverside Acknowledgments

First, I would like to thank my supervising professor, Dr. Hyoseung Kim. All of his unconditional assistance, guidance, and support help me finally accomplish this work. I also wish to express the deepest appreciation to Dr. Nael Abu-Ghazaleh and Dr. Daniel Wong who serves as my committee members for their valuable advice and generous help. I offer my special thanks to my labe-mates: Ankit Juneja, Ankith Rakesh Kumar and Yecheng

Xiang for helping me in different aspects. Without their generous help, this work would not have been successful. Finally, it is an honor for me to thank my family especially my mom, Mrs. Anjana Rani Saha and my sister, Poly Saha, and my brother, Abhijeet Saha.

All of their love and support encourage me to overcome all the challenges that I have faced.

From the bottom of my heart, thank you all.

iv To my parents for all the support.

v ABSTRACT OF THE THESIS

Spatio-Temporal GPU Management for Real-Time Cyber-Physical Systems

by

Sujan Kumar Saha

Master of Science, Graduate Program in Electrical Engineering University of California, Riverside, March 2018 Dr. Hyoseung Kim, Chairperson

General-purpose Graphics Processing Units (GPUs) have been considered as a promising technology to address the high computational demands of real-time data-intensive applications. Many of today’s embedded processors already provide on-chip GPUs, the use of which can greatly help satisfy the timing challenges of data-intensive tasks by accelerat- ing their executions. However, the current state-of-the-art GPU management in real-time systems still lacks properties required for efficient and certifiable real-time GPU computing.

For example, existing real-time systems sequentially execute GPU workloads to guarantee predictable GPU access time, which significantly underutilizes the GPU and exacerbates temporal dependency among the workloads.

In this research, we propose a spatial-temporal GPU management framework for real-time cyber-physical systems. Our proposed framework explicitly manages the alloca- tion of GPU’s internal execution engines. This approach allows multiple GPU-using tasks to simultaneously execute on the GPU, thereby improving GPU utilization and reducing response time. Also, it can improve temporal isolation by allocating a portion of the GPU

vi execution engines to tasks for their exclusive use. We have implemented a prototype of the proposed framework for a CUDA environment. The case study using this implementation on two GPUs, GeForce 970 and Jetson TX2, shows that our framework reduces the response time of GPU execution segments in a predictable manner, by executing them in parallel. Experimental results with randomly-generated tasksets indicate that our frame- work yields a significant benefit in schedulability compared to the existing approach.

vii Contents

List of Figures ix

List of Tables x

1 Introduction 1

2 Background and Related Work 4 2.1 GPU organization and Kernel Execution ...... 4 2.2 Related Work ...... 6 2.3 Motivation ...... 8 2.4 System Model ...... 10

3 Spatial-Temporal GPU Reservation Framework 14 3.1 Reservation Design ...... 14 3.2 Admission Control ...... 16 3.2.1 Self-suspension Mode ...... 17 3.2.2 Busy-waiting Mode ...... 19 3.3 Resource Allocator ...... 20 3.4 Reservation based Program Transformation ...... 23

4 Evaluation 26 4.1 Implementation ...... 26 4.2 Overhead Estimation ...... 28 4.3 Case Study ...... 29 4.4 Schedulability Results ...... 32

5 Conclusions 38

Bibliography 40

viii List of Figures

2.1 Overview of GPU Architecture ...... 5 2.2 Multi-kernel Execution ...... 9 2.3 Execution time vs Number of SM on GTX970 ...... 10 2.4 Execution time vs Number of SM on TX2 ...... 10

3.1 Eample schedule of GPU using tasks showing the blocking times in self- suspending mode ...... 18 3.2 Normalized Execution Time Vs Different Par Value on GTX970 ...... 24 3.3 Normalized Execution Time Vs Different Par Value on TX2 ...... 24

4.1 Percentage overhead of selected benchmarks on GTX970 ...... 28 4.2 Percentage overhead of selected benchmarks on TX2 ...... 28 4.3 Comparison of Kernel Execution on GTX970 ...... 30 4.4 Comparison of Kernel Execution on TX2 ...... 31 4.5 Schedulability w.r.t Number of Tasks in a taskset ...... 34 4.6 Schedulability w.r.t Number of SM ...... 34 4.7 Schedulability w.r.t Number of GPU Segments ...... 35 4.8 Schedulability w.r.t Ratio of C to G ...... 36 4.9 Schedulability w.r.t Ratio of Number of GPU tasksk to Number of CPU tasks 36

ix List of Tables

4.1 Parameters for taskset generation ...... 32

x Chapter 1

Introduction

Massive data streams generated by recent embedded and cyber-physical applica- tions pose substantial challenges in satisfying real-time processing requirements. For exam- ple, in self-driving cars, data streams from tens of sensors, such as cameras and laser range

finders (LIDARs), should be analyzed in a timely manner so that the results of processing can be delivered to path/behavior planning algorithms with short and bounded delay. This requirement of real-time processing is particularly important for safety-critical domains such as automotive, unmanned automotive, avionics, and industrial automation, where any transient violation of timing constraints may lead to system failures and catastrophic losses.

General-purpose graphics processing units (GPUs) have been considered as a promising technology to address the high computational demands of real-time data streams.

Many of today’s embedded processors, such as NVIDIA TX1/2 and NXP i.MX series, al- ready have on-chip GPUs, the use of which can greatly help satisfy the timing challenges of data-intensive tasks by accelerating their executions. The stringent size, weight, power

1 and cost constraints of embedded and cyber-physical systems are also expected to be sub- stantially mitigated by GPUs.

For the safe use of GPUs, much research has been done in the real-time systems community to schedule GPU-using tasks with timing constraints [6, 8, 7, 10, 11, 15, 22].

However, the current state-of-the-art has the following limitations for efficiency and pre- dictability. First, existing real-time GPU management schemes significantly underutilize

GPUs in providing predictable GPU access time. They limit a GPU to be accessed by only one task at a time, which can cause unnecessarily long waiting delay when multiple tasks need to access the GPU. This problem will become worse in an embedded computing environment where each machine typically has only a limited number of GPUs, e.g., one on-chip GPU on the latest NVIDIA TX2 processor. Second, systems support for strong temporal isolation among GPU workloads is not yet provided. In a mixed-criticality sys- tem, low-critical tasks and high-critical tasks may share the same GPU. If low-critical tasks use the GPU for longer time than expected, the timing of high-critical tasks can be eas- ily jeopardized. Also, if both types of tasks are concurrently executed on the GPU, it is unpredictable how much temporal interference may happen.

In this research, we propose a spatio-temporal GPU reservation framework to address the aforementioned limitations. The key contribution of this work is in the ex- plicit management of GPU’s internal execution engines, e.g., streaming multiprocessors on NVIDIA GPUs and core groups on ARM Mali GPUs. With this approach, a single

GPU is divided into multiple logical units and a fraction of the GPU can be exclusively reserved for each (or a group of) time-critical task(s). This approach allows simultaneous

2 execution of multiple tasks on a single GPU, which can potentially eliminate the waiting time for GPU execution and achieve strong temporal isolation among tasks. Since recent

GPUs have multiple execution engines and many GPU applications are not programmed to fully utilize them, our proposed framework will be a viable solution to efficiently and safely share the GPU among tasks with different criticalities. In addition, our framework substantially improves task schedulability by a fine-grained allocation of GPU resources at the execution-engine level.

As a proof of concept, we have implemented our framework in a CUDA pro- gramming environment. The case study using this implementation on two NVIDIA GPUs,

GeForce 970 and Jetson TX2, shows that our framework reduces the response time of GPU execution segments in a predictable manner. Experimental results with randomly-generated tasksets indicate that our framework yields a significant benefit in schedulability compared to the existing approach.

Our GPU framework does not require any specific hardware support or detailed internal scheduling information. Thus, it is readily applicable to COTS GPUs from various vendors, e.g., AMD, ARM, NVIDIA and NXP.

The rest of the thesis is organized as follows. Chapter 2 describes background knowledge about GPU architecture, motivation for this work and related prior works. Our proposed GPU reservation framework is explained in detail in chapter 3. Chapter 4 has the evaluation methodolgy and result analysis. Finally, we conclude in Chapter 5.

3 Chapter 2

Background and Related Work

2.1 GPU organization and Kernel Execution

GPUs are used as accelarators along with CPUs in modern computing systems.

Its highly parallel structure makes it more efficient than general purpose CPUs for data intensive applications. Figure 2.1 shows a high level overview of the internel structure of a

GPU.

A single GPU consists of multiple Execution Engines (EEs)1. One EE or SM has multiple CUDA cores. The total number of cores in a GPU is the number of SMs multiplied by the number of cores per SM. In each SM, there are register file, L1 cache and for faster access of data. All the cores in a SM share these memory components. Two other memory components are L2 cache and main memory. L2 cache is shared among all SMs. There are one or more Copy Engines (CEs) in a GPU. CEs copy

1In NVIDA GPUs, an Execution Engine is called Streaming Multiprocessor (SM). We will use NVIDIA terms in the rest of the thesis, but the proposed approach is applicable to other architectures too.

4 EE

Register File

Core Core Core Core

Core Core

L2 Cache Core CopyEngine Core MainMemory Memory ControllerMemory Core Core Core Core

L1 Cache

Shared Memory

Figure 2.1: Overview of GPU Architecture data from CPU memory to GPU memory and vice versa. Data is processed by GPU cores resided in EEs. GPUs also have several other components such as instruction buffer, warp scheduler, dispatch unit, texture unit etc. Those are not mentioned in the figure as not required knowledge for this research.

NVIDIA provides the necessary APIs for parallel prgramming like CUDA. The general structure of CUDA code has five sections: (1) memory allocation in the GPU memory, (2) data copy from CPU memory to GPU memory, (3) kernel execution in GPU,

(4) copy back the result data from GPU memory to CPU memory and (5) free the GPU memory. While launching a kernel, the thread block and grid dimentions are provided. The data stream that needs to be processed on a GPU is divided into multiple logical thread blocks. The grid consists of all the thread blocks and each thread block consists of multiple threads. In general, each block is processed by a single SM and each core processes one thread.

The GPU device driver schedules the blocks to SMs but the scheduling policy is not provided by the manufacturer. Hence, it is unknown which block will be processed by which SM. As a result, there are two disadvantages. First, depending on the number

5 of thread blocks and thread block size, all SMs may not be utilized or an SM may not be fully utilized. Second, on each SM, more than one thread block can be concurrently processed if the total size of the thread blocks is less or equal to the capacity of the SM.

In that case, the execution time of a kernel can not be predicted accurately. Hence, it is required to know the GPU scheduling policy. Even though Amert et al. [3] found some details of the scheduling policy in NVIDIA TX2, but it is unsure whether their findings are applicable to other NVIDIA architectures or GPUs from different vendors. That’s why, software technique is used by modifying the application to run the thread blocks in user specified SMs [19, 9]. Though there is overhead because of adding extra pieces of code in the application program, it gives more flexibility to the programmers and predictability to the execution time of the application. In our research, we have used a similar software approach to reserve SMs for an application program.

2.2 Related Work

There are several research papers on computation techniques using GPUs in real- time domain. Among these, TimeGraph [11] proposes a priority-based GPU scheduler to improve the responsiveness of tasks. Elliot et al. [6] presented two methods: shared resource method and container method for integrating GPUs with CPU in soft real-time multiprocessor systems and shows the performance benefits over pure CPU system. In [7], the authors mentioned an optimal k-exclusion locking protocol for multi-GPU systems where

GPUs are modeled as shared resources. GPUSync [8] provides a framework for managing a real-time multi-GPU system based on the k-exclusion locking protocol mentioned in [7].

6 The server based approach [13] identifies the two major limitations of syncronization based approaches mentioned in [7, 8]: busy waiting and priority inversion and proposes the solution for the limitations. While all the approaches mentioned above considers GPU as a shared resource, they do not allow multiple tasks to use GPU at the same time. As a result, the

GPU may be underutilized and there may be a long waiting time for a process to access the GPU.

RGEM [10] adresses the non-preemptive behaviour of GPU for which the response time of a high priority task may increase. It splits the data copy operation between CPU memory and GPU memory into several parts and allows preemption at spliting points which leads to reduce the response time of a high priority task. GPES [22] also considers the non-preemptive nature of GPU and breaks the kernel into multiple subkernels as well as data copy into multiple small copy operations. As a result, preemption can happen at break points and long waiting time of high priority task is reduced. Basaran et al. [4] proposes a technique of preemptive kernel execution and data copy operation in GPU. It also supports consurrent kernel execution and copy operation to increase the task responsiveness.

Gdev [12] considers GPU as a standard resource like CPU and provides a management system at OS lavel so that user space programs as well as OS kernel can use GPU as a standard computing system. Even though the above papers contribute to reduce the waiting time of high prioroty task and improve responsiveness, no one discusses about the concurrent multiple kernel execution in GPU to improve utilization.

Otterness et al. [16] discusses about the multiple kernel execution in parallel in

GPU of NVIDIA TX1 and shows that some benchmarks get slower in execution compared to

7 when these run as individual program. But there is no detail explanation and quantification of the slow down mentioned in the paper. Kernelet [21] is a runtime system to improve

GPU utilization by slicing a kernel into sub-kernels and co-schedule multiple sub-kernels of different kernels. Bo et al. [19] proposes a software technique to run an application program on specific SMs of GPU regardless of inside GPU scheduling mechanism. It gives the user flexibility to assign SMs to a particular task. Janzen et al. [9] also mentions about

GPU partitioning among applications by allocating different SMs to different user programs.

Wang et al. [20]proposes simultaneous GPU multi-kernel execution via fine-grained sharing.

Xu et al. [18] proposes warp slicer to allow intra-SM slicing for GPU multi-programming.

While the papers [21, 19, 9, 20, 18] provide mechanisms to improve utilization, there is no schedulability analysis for real time systems. So, still there is a scope to investigate that these methods can be applicable to real-time systems or not. In this research, we present

GPU reservation method for multiple GPU using tasks and run the tasks in parallel in GPU for higher utilization and provides a complete formulation to analyze the schedulability of the tasks in real-time domain.

2.3 Motivation

The GPU resource utilization can be improved by launching multiple kernels at the same time and executing these kernel operations on SMs simultaneously. Recent GPUs support these multi-kernel execution but it introduces unpredictability in the completion time of the kernels which is not expected in real-time systems. For example, Figure 2.2 shows a nvprof timeline of four kernels launching at the same time. Each kernel has four

8 Figure 2.2: Multi-kernel Execution

thread bloacks and the system has total 13 SMs. The computation inside each kernel is exactly same. The timeline shows two kernels finish execution earlier than two others. But it is not deterministic that which two finish execution earlier and which two are delayed.

Also, there is no proper way to know the amount of delay. This issue needs to be addressed while multi-kernel execution in real-time systems.

As mentioned earlier software techniques [19, 9] can be used to run GPU kernels on user defined SMs, experimentation has been performed to know the execution behaviour of eight GPU benchmarks for different number of SMs assigned to each kernel. In this experimentation, multi-kernel execution is not used. In other words, each benchmark is executed separately. Figure 2.3 and 2.4 show the normalized GPU execution time of bench- marks for different SMs varying from 1 to 13 for GTX970 and 1 to 2 for TX2 as GTX970 has

13 SMs and TX2 has only 2 SMs. From the simulation, we see that the execution time of none of the benchmark decreases linearly. In most of the cases, the execution time becomes plato after a certain number of SM. We don’t achive much benefit of assigning more SMs after that certain number of SMs. One benchmark does not show any significant changes for varying number of SMs. It may happen when an application has much less number of thread blocks. From this observation, we can conclude that multiple kernels can be run

9 1.2

1

0.8

0.6

0.4 Normalized Execution Time Execution Normalized 0.2

0 MMUL backprop b+tree hotspot kmeans workzone stereoDisparity streamcluster

Number of SM: 1 2 3 4 5 6 7 8 9 10 11 12 13

Figure 2.3: Execution time vs Number of SM on GTX970

1.2

1

0.8

0.6

0.4 Normalized Execution Time Execution Normalized 0.2

0 MMUL backprop b+tree hotspot kmeans workzone stereoDisparity streamcluster

Number of SM: 1 2

Figure 2.4: Execution time vs Number of SM on TX2

on their own specified SMs which can be beneficial instead of running one application at a time in the GPU.

2.4 System Model

The system considered in this work is equipped with a multi-core CPU and a general-purpose GPU. The CPU has NP cores, where each core is identical to each other and runs at a fixed frequency. The GPU is assumed to follow the architecture described in Section 2.1. In that GPU, there are NEE execution engines (EE), which are equivalent

10 to streaming multiprocessors in NVIDIA GPUs and core groups in ARM Mali GPUs. We assume that the GPU has one copy engine (CE), which is typical in many of today’s GPUs, that handles copy requests in a first-come first-serve basis. The GPU memory is assumed to be sufficiently large for all GPU-using tasks in the system.

We focus on partitioned fixed-priority preemptive task scheduling due to its wide acceptance in real-time OSs such as QNX RTOS [1]. Any fixed-priority assignment can be used, such as Rate Monotonic (RM) and Deadline Monotonic (DM). For the task model, we consider sporadic tasks with constrained deadlines. Each instance (job) of a task consists of CPU segments and GPU segments. As their names imply, CPU segments run entirely on the CPU and GPU segments include GPU operations, e.g., data copy from/to the GPU and kernel execution. Once a task launches a GPU kernel, the task may self-suspend to save CPU cycles. The kernel execution time depends on the number of EEs assigned to the task. Specifically, a task τi is characterized as follows:

τi := (Ci,Gi(k),Ti,Di, ηi) where

• Ci: The sum of the worst-case execution time (WCET) of CPU segments of each job

of τi

• Gi(k): The sum of the worst-case duration of GPU segments of each job of τi, when

k EEs are assigned to τi and no other task is using the GPU

• Ti: The minimum inter-arrival time (or period) of τi

• Di: The relative deadline of each job of τi

11 • θi: The number of CPU segments of each job of τi

• ηi: The number of GPU segments of each job of τi

∗ τi,j and τi,j are used to denote the j-th CPU and GPU segments of τi, respectively. Note that we do not make any assumption about the sequence of CPU and GPU segments. Hence, a task may have two consecutive GPU segments. Gi(k) is assumed to be non-increasing with k, i.e., Gi(k) ≥ Gi(k + 1). This assumption can be easily met by monotonic over- approximations [2, 14]. The number of EEs assigned to each task is statically determined and does not change at runtime.

∗ We use Gi,j to denote the worst-case duration of τi,j (the j-th GPU segment of a

Pηi task τi). Hence, Gi(k) = j=1 Gi,j(k). Without loss of generality, each GPU segment is assumed to have three sub-segments: (i) data copy to the GPU, (ii) kernel execution, and

(iii) data copy back from the GPU. Thus, each GPU segment uses the CE up to two times.

In this model, more than one consecutive kernels can be represented with multiple GPU

∗ segments. τi,j is characterized by the following parameters:

∗ mhd e mdh τi,j := (Gi,j ,Gi,j(k),Gi,j )

where

mhd • Gi,j : The WCET of miscellaneous operations executed before the GPU kernel in

∗ τi,j, e.g., memory copy from the host to the device

e ∗ • Gi,j(k): The WCET of the GPU kernel of τi,j on k EEs

mdh ∗ • Gi,j : The WCET of miscellaneous operations executed after the GPU kernel in τi,j,

e.g., memory copy from the device to the host

12 e mhd mdh Gi,j(k) is the kernel execution time on the GPU, and Gi,j and Gi,j are the CPU time consumed for data copy between CPU and GPU memory, kernel launch commands, the

mhd e mdh notification of kernel completion, etc. Note that Gi,j(k) ≤ Gi,j + Gi,j(k) + Gi,j as

kernel execution on the GPU may overlap with miscellaneous operations on the CPU. For

simplicity, we may omit k and use Gi,j to refer to Gi,j(k). This rule also applies to other

e e m mhd GPU-segment parameters, e.g., Gi,j ≡ Gi,j(k). We use Gi,j to represent the sum of Gi,j

mdh m mhd mdh and Gi,j , i.e., Gi,j = Gi,j + Gi,j .

The CPU utilization of τi is defined as: Ui = (Ci + Gi)/Ti, if τi is busy-waits

m m Pηi m during kernel execution; Ui = (Ci + Gi )/Ti, where Gi = j=1 Gi,j, if τi self-suspends

during kernel execution.

13 Chapter 3

Spatial-Temporal GPU

Reservation Framework

3.1 Reservation Design

The GPU reservation approach allocates GPU resources to applications both spa- tially and temporally. The motivation of using GPU reservation is to reduce the long waiting time of a task and increase the GPU utilization.

When GPU kernels are launched, thread blocks are executed on SMs according to the scheduling policy of the device driver which is not revealed by the manufacturers.

In that case, the application programmer has less control over GPU resources. The spatial reservation method reserves GPU resources i.e. SMs for each of the task which requires GPU computation. When the SMs are specified to an application, the execution time is more predictable. The number of SM assigned to each task is determined by the algorithm in

14 subsection 3.3. A small application modification is required to use spatial GPU reservation.

In this research, application modification is done by creating a mapping array to declare the SM ids for a task and pass that array while launching the kernel. In the device code, when the blocks start execution, initially it checks whether the block should run on the SM or not by checking the mapping. If the block is not on reserved SM, it immidiately stops execution. Besides, if the block is on reserved SM, it continues execution. In this case, the grid dimention is modified to run all the blocks on reserved SMs.

The temporal reservation of GPU allows running multiple tasks in parallel in GPU.

As multiple tasks use GPU concurrently, overall utilization increases. In this research,

CUDA Multi-Process Service (MPS) is enabled to allow GPU co-scheduling which has a negligible overhead in the system. When multiple tasks run in the GPU at the same time with their reserved SM ids, there may be two cases. First, a task may not have any shared

SM id with other tasks accessing the GPU at that moment. In this case, when the task launches GPU kernel execution command, it can start execution in the GPU immidiately.

Second, the task may have common SM ids with other tasks. Here, after launching the kernel execution command, task needs to wait until all previous kernels with shared SM ids are finished. Here, we have assumed that the tasks are launched to GPU in FIFO manner.

So, even if any high priority task launch GPU kernel and it has shared SM id with previous low priority tasks running on the GPU, it must wait for the completion of all the tasks in the FIFO queue.

In order to minimize interference during GPU segment execution, we adopt the priority-boosting mechanism, widely used for predictable shared resource controls [5, 13, 17].

15 Specifically, τi’s priority is increased to the highest-priority level when a task τi begins a

GPU segment, and it is reverted back to τi’s original priority when τi finishes that GPU segment. In this way, no CPU segments of other tasks assigned to the same CPU core can preempt τi during the interval of the GPU segment.

During kernel execution, GPU-using tasks may either busy-wait or self-suspend.

This is configurable in many GPU programming environments, such as CUDA and OpenCL.

In general, busy-waiting is good for short kernels as it does not cause any additional schedul- ing overhead, and self-suspension is good for long kernels as it saves CPU time. Following this line of reasoning, our framework supports both modes, and also applies the chosen mode to any task waiting for GPU resources to be ready. Hence, if self-suspension (or busy-wait) is chosen, our framework lets tasks suspend (or busy-wait) not only when they execute GPU kernels but also when they wait for the CE and any shared EEs.

3.2 Admission Control

The admission control module of our framework checks the schedulability of a task under a given resource allocation by computing the task’s worst-case response time

(WCRT). As our framework supports self-suspension and busy-waiting modes, we describe the schedulability analyses for both modes in the following.

16 3.2.1 Self-suspension Mode

If self-suspension is used, the WCRT of τi is upper-bounded by the following recurrence equation:

 k m  k+1 X Wi +(Wh−(Ch+Gh )) m Wi = Ci + Gi + Bi + (Ch + Gh ) (3.1) Th τh∈hp(τi) where Ci is the cpu computation time of τi, Gi is the total gpu computation time of τi and

Bi is the total blocking time caused by GPU access. As higher priority tasks self-suspend

m while GPU computation, (Wh −(Ch +Gh )) is added in the ceiling function to capture that.

Also, the copy operation requires cpu cycles, so high priority task can not self-suspend while

m copying data from cpu to gpu and gpu to cpu. That’s why, Gh is added with Ch.

The blocking time Bi is computed as follows:

m e l Bi = Bi + Bi + Bi (3.2)

m where Bi is the blocking time from GPU data copy and miscellaneous operations in GPU

e l segments, Bi is the blocking time from kernel execution, and Bi is the blocking time from priority inversion.

The blocking time from a sub-segment for data copy and miscellaneous in the j-th

GPU segment of τi is upper-bounded by:

m X m∗ Bi,j = max Gu,w (3.3) 1≤w≤ηu τu6=τi∧ηu>0

m∗ mhd mdh m∗ where Gu,w = max(Gu,w ,Gu,w ). If there is one Copy Engine (CE) in the GPU, Gu,w will be the maximum between host to device copy time and device to host copy time. If τu has ω gpu segments, each segment will run on GPU at a time. So, we have to capture the

17 CPU Execution Copy Operation GPU Execution

� � �

Core 1 �

Core 2 �

� � �

GPU

0 1 2 3 4 5 6 7 8 9 10 11 12

Figure 3.1: Eample schedule of GPU using tasks showing the blocking times in self- suspending mode

maximum copy operation time of all the segments of τu. Also, CE handles requests in FIFO

manner. That’s why τi needs to wait for all the copy opeartions to be completed in the

queue.

m As there are at most two accesses to the CE in one GPU segment, Bi is given by:

m X m Bi = 2 · Bi,j (3.4) 1≤j≤ηi

The blocking time from any kernel execution in the j-th GPU segment of τi is

upper-bounded by:

e X e Bi,j = max Gu,w (3.5) 1≤w≤ηu τu6=τi∧S(τu)∩S(τi)6=∅ e If τi does not share its assigned EEs with any other tasks, then Bi,j = 0. But if τi has shared EEs with other tasks, at most τi needs to wait for all the tasks with which it has

e shared EEs. And the total blocking time Bi is the summation of blocking times of all the

18 segments of τi.

e X e Bi = Bi,j (3.6) 1≤j≤ηi

Whenever τi suspends, GPU access segments of lower-priority task on the same

CPU core can block τi to execute on CPU core which is called priority inversion. The blocking time of a gpu segment of τi due to this priority inversion can be captured by the following equation:

l X m∗ Bi,j = max Gu,w (3.7) 1≤w≤ηu τu∈P(τi)∧πu<πi∧ηu>0 l And the total blocking for all the segments, Bi is given by:

l l Bi = θi.Bi,j (3.8)

Note that θi is used instead of ηi. It’s because this blocking will happen for all the cpu segments of τi.

3.2.2 Busy-waiting Mode

If the tasks are in busy-waiting mode instead of being self-suspended, a variant of the traditional response time test is used to upper-bound the WCRT of a task τi:

 k  k+1 X Wi Wi = Ci + Gi + Bi + (Ch + Gh + Bi) (3.9) Th τh∈P(τi)∧πh>πi where Bi is the blocking time caused by GPU access. The blocking time Bi is computed as follows:

m e l Bi = Bi + Bi + Bi (3.10)

19 m e Bi and Bi are the same as those in the self-suspension mode. Whenever τi suspends, GPU access segments of lower-priority task on the same CPU core can block:

l X Bi,j = max Gu,w (3.11) 1≤w≤ηu τu∈P(τi)∧πu<τi∧ηu>0

l Under busy-waiting mode, Bi is computed as follows:

l l Bi = Bi,j (3.12)

This is because there is only one priority inversion from lower-priority tasks.

3.3 Resource Allocator

The resource allocation algorithm reserves SMs for each GPU using task and al- locates tasks to CPU cores. As, we have considered a multi-core system, the Worst Fit

Heuristics (WFH) is used to assign tasks to core depending on the CPU utilization. The aim of this algorithm is to evenly distribute GPU resource among tasks. But if any task is not schedulable due to

long GPU execution time with less number of SM, more SMs are allocated to that task. The benefit of evenly distributing SMs to tasks is to reduce the waiting time of tasks to start GPU execution. In this algorithm, we have considered a task set which is a combination of CPU only tasks and the tasks requiring GPU computation.

Initially, one SM is given to all the GPU using tasks mentioned in line 2 and 3 in algorithm 1. In line 6 and 7, the utilization of each core is set to 0 as no task is assigned to the core yet. Then the SM ids are allocated to the GPU using tasks in line 10 to 19.

20 Algorithm 1 SM-aware Task Allocation Algorithm

Require: Γ: a taskset, Nc: Number of CPU cores, NSM : Total number of SM in GPU

Ensure: Ni: Number of SMs for each task τi ∈ Γ, Si: SM indices for each task τi ∈ Γ, AT : an

array of a taskset allocated to each core, U: Utilization array of tasks of Γ if schedulable and ∞

otherwise.

1: for all τi ∈ Γ do

2: if τi uses GPU then

3: Ni ← 1

4: end if

5: end for

6: for cid ← 1 to Nc do

7: Ucid ← 0

8: end for

9: /* SM Allocation */

10: sm idx ← 0

11: for all τi ∈ Γ do

12: if τi uses GPU then

13: Si ← φ

14: for k ← 1 to Ni do

15: Si ← Si ∪ {sm idx}

16: sm idx ← (sm idx + 1) mod NSM

17: end for

18: end if

19: end for

20: Sort tasks in Γ in decreasing order of utilization

21 Algorithm 2 SM-aware Task Allocation Algorithm Continue

21: for all τi ∈ Γ do

22: for cid ← 1 to Nc do

23: if 1 − Ucid ≥ Ci/Ti and τi satisfies Eq. (1) then

24: Ucid ← Ucid + Ci/Ti

25: Insert τi in AT [cid]

26: Mark τi schedulable

27: break

28: end if

29: end for

30: end for

31: if all tasks in Γ schedulable then RETURN {Ni,Si,AT ,U}

32: else if Ni ≤ Nmax then

33: i ← argmax Gi(k + 1)/Ti allτi∈ΓandτiusesGP U

34: Ni ← Ni + 1

35: Go to line 10

36: else RETURN ∞

37: end if

22 Here, the spatial reservation is maintained because if any task has more than one SM, the consecutive SM ids are allocated for that task. The tasks are sorted based on utilization value in line 20. Tasks are assigned to cores according to WFH from line 21 to 30. If the tasks are shcedulable, then the return values are set in line 31. If the tasks are not schedulable, number of SM is increased to high GPU utilizing task and go back to SM allocation which is described in line 22 to 25. Thus the iteration continues until all the tasks are schedulable or the number of SM to a task reaches the maximum available SM of GPU. As the number of tasks in a taskset is limited, the algorithm will converge after allocating the tasks to cores and if the taskset is not schedulable even after assigning all the available SMs in the GPU, the algorithm will return infinity which indicates that the taskset is not schedulable.

3.4 Reservation based Program Transformation

Our Spatio-Temporal GPU reservation method requires application modification.

In general, the GPU applications have host side code and device side code. A small piece of code needs to be added in host side code and another pice of code should be added in device side code. Here, Our goal is to make the application ready to run on user specified

SMs. The key idea is instead of using original threads, a specific number of worker threads are created which performs all computation on behalf of original threads on specific SMs.

To do that, in the host side code, a mapping array is created to define which SM ids will be used for the application and how many original threads need to be processed by each worker thread. While launching the kernel, instead of providing original grid dimention, an integer

23 MMUL StereoDisparity Workzone backprop 1.2

1

0.8

0.6

0.4

Normalized Execution Time Execution Normalized 0.2

0 1 2 4 8 16 Par Value

Figure 3.2: Normalized Execution Time Vs Different Par Value on GTX970

MMUL StereoDisparity Workzone backprop 1.2

1

0.8

0.6

0.4

Normalized Execution Time Execution Normalized 0.2

0 1 2 4 8 16 Par Value

Figure 3.3: Normalized Execution Time Vs Different Par Value on TX2

multiple of total available SM is given. In the pseudo code, the integer multiple variable is par. In the device side code, all worker threads start running on all SMs initially. Then, the threads check own SM id and mapping array. After checking the mapping array, if the worker threads find that the SM is not allocated for this application, they immediately stop execution. Beside, other worker threads continue execution. By following this approach, even if multiple applications run simultaneously on GPU, no interference occurs as worker threads run independently. This application modification is quite similar to the method mentioned in [19, 9].

Here, our observation is the number of worker threads afftects kernel execution time significantly. We can change the number of worker threads by varying the par value.

As, each SM in GPU has a maximum limit of threads that can be processed simultaneously,

24 the optimum benefit can be achieved if the par value is selected by the following formula.

Maximum number of allowable threads per SM par = (3.13) Thread block size of the application

If the par value is less than this formula, the number of worker thread is less than the capacity of a SM which under utilize the GPU resource. Beside, if par value is higher than the calculated value from the formula, the number of worker threads are higher than the maximum limit. In this case, rest of the worker threads need to wait to get the GPU resource. An experimentation has been done to show this effect and the result has been plotted in Figure 3.1 and 3.2. The graph shows execution time for different par values for four benchmarks. We can observe that after a specific par value, the execution time becomes stable which indicates that by further increasing par value will not be beneficial.

25 Chapter 4

Evaluation

In this section, the experimental methodology and evaluation of our proposed work are described. First, we describe our two implementation platforms: a general purpose computer and an embedded platform and launching multiple kernel mechanism. Second, the overhead of our appraoch is presented. Third, we show the case study results on those two platforms. Fourth, the schedulability analysis of randomly generated tasksets using our

Spatio-temporal GPU management is discussed.

4.1 Implementation

The prototype of our Spatio-Temporal GPU management framework has been implemented on two platforms: a general purpose computer and an embedded platform. The general purpose computer is equiped with a intel core-i7 6700 CPU with operating frequency

3.4GHz. It has 4 physical CPU core with 2 thread per core support and 16GB memory.

Ubuntu version 16.04 has been used as operating system. The system has a NVIDIA

26 GeForce GTX 970 GPU for running CUDA applications. This GPU has 13 Streaming

Multiprocessors (SM) and 128 CUDA core per SM. It has two copy engines and 4GB global memory. CUDA driver version 8 has been used to run the applications.

NVIDIA Jetson TX2 has been used as our embedded platform. It has 4 ARM

Cortex-A57 CPU core and 2 Denver CPU core equiped with a NVIDIA pascal GPU in- tegrated in a single chip [3]. The GPU has 2 SMs with 256 CUDA core and 8GB unified memory. The CPU opearting frequency is 2GHz and GPU operating frequency is 1.3GHz.

To run multiple kernels from different applications simultaneously, we have created a single parent process and each kernel has been launched using separate CPU thread of that parent process. Each thread has been assigned to different CPU core according to the allocation algorithm. The parent process can also run a GPU kernel application so that no additional CPU core is required for parent process. Also, CUDA stream has been used to allow asynchronous copy and kernel execution for multiple kernels execution.

Launching multiple kernels at the same time can also be done using CUDA Multi Process

Service (MPS). In that case, creating a parent process is not required but eventually MPS runs as a process and consumes CPU cycle. While doing the experimentation, we have disabled hyperthreading and Dynamic Voltage Frequency Scaling (DVFS) so that real-time properties can be added.

27 15

10

5

Overhead(%) 0

MMUL b+tree hotspot kmeans backprop workzone streamcluster -5 stereoDisparity

-10

Benchmarks

Figure 4.1: Percentage overhead of selected benchmarks on GTX970

8

6

4

2

0

-2 MMUL b+tree hotspot kmeans

Overhead(%) backprop workzone streamcluser -4 stereoDisparity

-6

-8

-10 Benchmarks

Figure 4.2: Percentage overhead of selected benchmarks on TX2

4.2 Overhead Estimation

As applications are modified in our approach by adding extra code, there should be an overhead of increased kernel execution time. But while launching kernels, the gridsize is total number of SM multiplied by the constant, par. The par value can be calculated by the formula 3.13. By following this procedure, a single SM can run par thread blocks in parallel which increases parallelism and reduce overall execution time. So, by considering these two cases, an application may experience an overhead (longer execution time) or may get speed up (shorter execution time). In Figure 4.1 and 4.2, the overhead estimation of the previous

28 mentioned eight benchmarks have been plotted. On GTX970 only three benchmarks have experienced delay but two are less than 5% and only one exoerienced delay about 13%. But other benchmarks achieved speed up. Similarly, on TX2 highest delay is less than 6%. So, by using our approach, many of the applications become faster than the original instead of having an overhead whereas some application experience very less amount of delay.

4.3 Case Study

The case study for our approach has been done on two platforms mentioned earlier. To show the benefit of our approach, we have used four GPU using tasks: two matrix multiplication programs, one stereodisparity kernel and one workzone kernel. We have used GPU based matrix multiplication as a benchmark because of it’s high computa- tional intensity. One stereodisparity kernel from NVIDIA CUDA Sample Programs have been used. The other kernel is workzone which is an image processing application for self-driving cars to detect workzone in the road [13]. Each of the kernels have single GPU segement. These four kernels have been launched from a single process by creating four

CPU threads. To show the critical instance, the four threads are assigned to four different

CPU cores and the kernels are launched together to access GPU resources at the same time.

The number of SM assigned to tasks and allocating SM ids are determined by our proposed algorithm. In this case, there is no mutual SM among these four tasks.

We have considered two other cases to show the comparison of our alogorithm.

The cases are fair share allocation and all share allocation. In fair share allocation, the SMs

are evenly distributed to all GPU using tasks. If the total number of available SM is less

29 Workzone Stereodisparity MatrixMultiplication 1 MatrixMultiplication 2

: 13 SM : 13 SM : 13 SM : 13 SM Serial ExecutionSerial

0 5 10 15 20 25 30 35 40 45 50 55 60 Time (ms)

: 3 SM e : 4 SM Shar : 3 SM Fair : 3 SM

0 5 10 15 20 25 30 35 40 45 50 55 60 Time (ms) l : 1 SM : 4 SM Tempora - : 4 SM : 4 SM Spatio

0 5 10 15 20 25 30 35 40 45 50 55 60 Time (ms)

Figure 4.3: Comparison of Kernel Execution on GTX970

then the number of GPU using tasks, all tasks are given one SM to ensure that every task get an SM. In all share allocation, all the available SMs of a system are allocated to all GPU using tasks. It means in all share allocation, all kernel executions will be serialized which

is equivalent to synchronization based approaches. The kernel execution traces have been

taken using CUDA nvprof. The simultaneous execution of these four benchmarks using our Spatial-temopral reservation, fair share allocation and all share allocation are given in

Figure 4.3, and 4.4. From the results, we can conclude the following observations.

• Obs 1. Multiple kernels can run in parallel in GPU without interfaring each other

using our algorithm.

• Obs 2. The total completion time of all the tasks using our approach is least compared

30 Workzone Stereodisparity MatrixMultiplication 1 MatrixMultiplication 2

: 2 SM : 2 SM : 2 SM : 2 SM Serial ExecutionSerial

0 0. 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2 Time (s)

: 1 SM e : 1 SM Shar : 1 SM Fair : 1 SM

0 0. 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2 Time (s) l : 1 SM : 2 SM Tempora - : 2 SM : 2 SM Spatio

0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4

Time (s)

Figure 4.4: Comparison of Kernel Execution on TX2

to other two approaches in GTX970. It’s because the overall completion time using our algorithm is determined by the highest execution time among the four kernels, whereas in all share allocation method, overall completion time is the summation of all the kernels. In fair share case, there may be some mutual SM among the kernels or the required number of SM for each kernel is not optimum to reduce the overall completion time. But on TX2, the total completion time of our approach is almost same as synchronization based approaches because there are only 2 SM available on

TX2.

31 Table 4.1: Parameters for taskset generation

Parameters Minimum Value Maximum Value

Number of CPU core(Nc) 2 4

Number of task 2*Nc 4*Nc

Ratio of GPU task to CPU task 0.2 2

Utilization of task 0.05 0.3

Number of GPU segments 1 3

Task Period(Ti = Di) 100ms 500ms

Ratio of CPU execution time to GPU execution time(Ci/Gi) 0.3 0.7

e Ratio of kernel execution time to GPU execution time(Gi /Gi) 0.7 0.9

4.4 Schedulability Results

To do the schedulability experiments, we have generated 10,000 random tasksets.

Each of the tasks in a taskset is also generated randomly by setting up parameters mentioned in Table I. As the GPU kernel execution time varies with the number of SM, we have used the data of benchmark execution time shown in Figure 2.3 to select execution time. First, we choose the number of tasks in a taskset randomly from the maximum and minimum value given in the table. Then, for each of the tasks, we select period and utilization randomly based on the values in table. The deadline is set as same as the period. After that, depending on the gpu task to cpu task ratio which is randomly choosen, the number of GPU using tasks are determined. Then, for only CPU using tasks (which don’t use GPU), we calculate

CPU computation time by multiplying utilization and period. But for GPU using tasks, we choose the CPU computation time to GPU computation time ratio randomly. Using

32 this ratio, the CPU computation time is calculated. Then, the number of GPU segments is chosen randomly and the GPU execution time of each segment are calculated based on the number of SM assigned to that task and the randomly selected benchmark. Also, the ratio of copy operation and actual kernel execution is used to calculate data copy time and kernel execution time.

We present the percentage amount of schedulable tasksets among 10000 tasksets.

Here, schedulable taskset means all the task in the taskset meet deadline. We compare the schedulability results of our Spatio-Temporal GPU reservation method with five other approaches. One is assigning all available SMs to all GPU tasks. It eventually serializes the

GPU kernel executions even though they are launched together which is actually the current trend of using GPU in real-time systems. The other approach is fair share allocation of

SMs to tasks. The fair share allocation distributes SMs to GPU using tasks evenly as much as possible. If the total number of available is SM is less than the GPU using tasks, then every GPU using task is given one SM. For the comparison, we also take the results of self- suspending and busy-waiting modes of the CPU. Also, we have compared our results with two synchronization based approaches: Multiprocessor Priority Ceiling Protocol (MPCP) and Flexible Multiprocessor Locking Protocol (FMLP) and the server based approach [13].

Figure 4.5 shows the percentage of schedulable tasksets for varrying number of tasks in a taskset. The number of CPU core is four in this simulation. The result shows that with the increase of tasks in taskset, the number of schedulable tasks decreases for all the approaches as the number of CPU core is fixed. But our approach gets higher number of schedulable tasksets over all share approach for both CPU modes. According to the

33 ST_busy ST_self Fair_busy Fair_self All_busy All_self MPCP FMLP gpu_server 120

100

80

60

40 Schedulable Tasksets(%) Schedulable 20

0 8 10 12 14 16 18 Number of Tasks in Taskset

Figure 4.5: Schedulability w.r.t Number of Tasks in a taskset

ST_busy ST_self Fair_busy Fair_self All_busy All_self MPCP FMLP gpu_server 100

90

80

70

60

50 Schedulable Tasksets(%) Schedulable

40

30 1 2 3 4 5 6 7 8 9 10 11 12 13 Number of SM

Figure 4.6: Schedulability w.r.t Number of SM

results, at most 32% higher number of tasksets can be schedulable over all share approach.

Compared to busy-waiting mode, self-suspending mode of CPU gives higher benefit for both our approach and fair share allocation.

The GPUs available in the market have different number of SMs, such as NVIDIA

TX2 has only two SMs whereas GeForce 970 has 13 SMs. So, varying the number of SM is another important parameter to be considered to check the schedulabilty. Figure 4.6 presents the percentage of schedulable tasksets with the increasing number of SMs startting from 1 to 13. It is eventual that if we have less number of SMs, the performance will degrade. But with the increase of SMs, the performance of our approach starts increasing

34 ST_busy ST_self Fair_busy Fair_self All_busy All_self MPCP FMLP gpu_server 120

100

80

60

40 Schedulable Tasksets(%) Schedulable 20

0 1 2 3 4 5 6 7 8 Number of GPU Segments

Figure 4.7: Schedulability w.r.t Number of GPU Segments

ST_busy ST_self Fair_busy Fair_self All_busy All_self MPCP FMLP gpu_server 120

100

80

60

40 Schedulable Tasksets(%) Schedulable 20

0 1 2 3 4 5 6 7 8 9 10 Ratio of CPU execution time (C)to GPU execution time (G)

Figure 4.8: Schedulability w.r.t Ratio of C to G

over all share case. Up to 24% higher schedulability can be achieved using our approach over all share case with the 13 SMs in self-suspending CPU mode whereas in busy waiting mode, the improvement is nearly 16%. The performance of fair share allocation becomes same as ours with the increase of SMs.

One other important parameter is the number of GPU execution segments in an application. Many of the current GPU benchmarks have multiple GPU segments or a single

GPU segment is run multiple times in a for loop or in multiple iterations. That’s why we have checked the schedulability by varying the number of GPU segments from 1 to 8 in

35 ST_busy ST_self Fair_busy Fair_self All_busy All_self MPCP FMLP gpu_server 120

100

80

60

40 Schedulable Tasksets(%) Schedulable 20

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Ratio of Number of GPU tasks to Number of CPU tasks

Figure 4.9: Schedulability w.r.t Ratio of Number of GPU tasksk to Number of CPU tasks

GPU using tasks and took the results. Here, we have assumed that all the GPU tasks have same number of GPU segments. The result in Figure 4.7 shows that overall number of schedulable tasksets of all approaches decrease with the increase of GPU segments. It is obvious because our GPU resource is fixed but the number of GPU segements is increasing in all the tasks. But still our algorithm outperforms all share approach for any number of

GPU segments. based on the analysis, at most 28% more tasksets can be schedulable using our approach than serial execution of GPU kernels.

The next parameter we have considered for schedulability analysis is the ratio of total CPU execution time of all CPU segments and total GPU execution time of all GPU segments of a GPU using task. This ratio also affects the schedulability of tasksets. The result in Figure 4.8 shows that if the ratio increases more tasksets become schedulable for all kinds of algorithms. It’s because increasing the ratio means the amount of GPU execution time decreases whereas CPU execution time increases. Hence, the blocking time or total response time of tasks decreases. In this result, our allocation algorithm also out performs all other approaches. At some point, up to 60% schedulability increases compared to MPCP.

36 Chapter 5

Conclusions

The state of the art GPU management in real-time systems significantly under utilizes GPU resources beacuse of the serialization of each GPU kernels. The waiting time to access GPU resources also become significantly high in current trend of GPU management.

In this paper, we have presented our Spatio-Temporal GPU management framework which allows multiple tasks to access GPU resources simultaneously without any interference.

The proposed algorithm allocates GPU resources to GPU using tasks and allocates both

CPU and GPU using tasks to CPU core in a predicatable manner. The mathematical model has been presented to calculate the blocking time and the WCRT of each task. The implementation and case study results ensure that using our approach, multiple GPU using tasks can access GPU in parallel which improves resource utilization and reduce WCRT compared to the synchronization based approaches. The shcedulability resutls show that our allocation algorithm improves schedulability of tasksets in a significant amount over the current serialization of GPU access method and fair share allocation. As our approach

37 does not need any hardware modification, it can be implemented easily in current real-time systems with a very little overhead or sometimes improvement.

38 Bibliography

[1] QNX RTOS. http://www.qnx.com.

[2] Sebastian Altmeyer, Roeland Douma, Will Lunniss, and Robert I Davis. Evaluation of cache partitioning for hard real-time systems. In Euromicro Conference on Real-Time Systems (ECRTS), 2014.

[3] Tanya Amert, Nathan Otterness, Ming Yang, James H Anderson, and F Donelson Smith. GPU scheduling on the NVIDIA TX2: Hidden details revealed. In IEEE Real-Time Systems Symposium (RTSS), 2017.

[4] Can Basaran and Kyoung-Don Kang. Supporting preemptive task executions and memory copies in GPGPUs. In Euromicro Conference on Real-Time Systems (ECRTS), 2012.

[5] Bj¨ornBrandenburg. The FMLP+: An asymptotically optimal real-time locking pro- tocol for suspension-aware analysis. In Euromicro Conference on Real-Time Systems (ECRTS), 2014.

[6] Glenn Elliott and James Anderson. Globally scheduled real-time multiprocessor sys- tems with GPUs. Real-Time Systems, 48(1):34–74, 2012.

[7] Glenn Elliott and James Anderson. An optimal k-exclusion real-time locking protocol motivated by multi-GPU systems. Real-Time Systems, 49(2):140–170, 2013.

[8] Glenn Elliott, Bryan C Ward, and James H Anderson. GPUSync: A framework for real-time GPU management. In IEEE Real-Time Systems Symposium (RTSS), 2013.

[9] Johan Janz´en,David Black-Schaffer, and Andra Hugo. Partitioning GPUs for im- proved scalability. In International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2016.

[10] Shinpei Kato, Karthik Lakshmanan, Aman Kumar, Mihir Kelkar, Yutaka Ishikawa, and Ragunathan Rajkumar. RGEM: A responsive GPGPU execution model for runtime engines. In IEEE Real-Time Systems Symposium (RTSS), 2011.

39 [11] Shinpei Kato, Karthik Lakshmanan, Raj Rajkumar, and Yutaka Ishikawa. TimeGraph: GPU scheduling for real-time multi-tasking environments. In USENIX Annual Tech- nical Conference (ATC), 2011. [12] Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott A Brandt. Gdev: First- class GPU resource management in the operating system. In USENIX Annual Technical Conference (ATC), 2012. [13] Hyoseung Kim, Pratyush Patel, Shige Wang, and Ragunathan (Raj) Rajkumar. A server-based approach for predictable GPU access control. In IEEE Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), 2017. [14] Hyoseung Kim and Ragunathan Rajkumar. Real-time cache management for multi-core virtualization. In ACM International Conference on Embedded Software (EMSOFT), 2016. [15] Junsung Kim, Bjorn Andersson, Dionisio de Niz, and Ragunathan Rajkumar. Segment- fixed priority scheduling for self-suspending real-time tasks. In IEEE Real-Time Sys- tems Symposium (RTSS), 2013. [16] Nathan Otterness, Ming Yang, Sarah Rust, Eunbyung Park, James H Anderson, F Donelson Smith, Alex Berg, and Shige Wang. An evaluation of the NVIDIA TX1 for supporting real-time computer-vision workloads. In IEEE Real-Time Technology and Applications Symposium (RTAS), 2017. [17] Ragunathan Rajkumar, Lui Sha, and John P Lehoczky. Real-time synchronization protocols for multiprocessors. In IEEE Real-Time Systems Symposium (RTSS), 1988. [18] Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. Simultaneous multikernel gpu: Multi-tasking throughput processors via fine-grained sharing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016. [19] Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, and Jeffrey Vetter. Enabling and ex- ploiting flexible task assignment on GPU through SM-centric program transformations. In International Conference on Supercomputing (ICS), 2015. [20] Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram. Warped-slicer: Efficient intra- sm slicing through dynamic resource partitioning for gpu multiprogramming. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016. [21] Jianlong Zhong and Bingsheng He. Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. IEEE Transactions on Parallel and Distributed Systems, 25(6):1522–1532, 2014. [22] Husheng Zhou, Guangmo Tong, and Cong Liu. GPES: a preemptive execution system for GPGPU computing. In IEEE Real-Time Technology and Applications Symposium (RTAS), 2015.

40