UNIVERSITY OF CALIFORNIA RIVERSIDE
Spatio-Temporal GPU Management for Real-Time Cyber-Physical Systems
A Thesis submitted in partial satisfaction of the requirements for the degree of
Master of Science
in
Electrical Engineering
by
Sujan Kumar Saha
March 2018
Thesis Committee:
Dr. Hyoseung Kim, Chairperson Dr. Nael Abu-Ghazaleh Dr. Daniel Wong Copyright by Sujan Kumar Saha 2018 The Thesis of Sujan Kumar Saha is approved:
Committee Chairperson
University of California, Riverside Acknowledgments
First, I would like to thank my supervising professor, Dr. Hyoseung Kim. All of his unconditional assistance, guidance, and support help me finally accomplish this work. I also wish to express the deepest appreciation to Dr. Nael Abu-Ghazaleh and Dr. Daniel Wong who serves as my committee members for their valuable advice and generous help. I offer my special thanks to my labe-mates: Ankit Juneja, Ankith Rakesh Kumar and Yecheng
Xiang for helping me in different aspects. Without their generous help, this work would not have been successful. Finally, it is an honor for me to thank my family especially my mom, Mrs. Anjana Rani Saha and my sister, Poly Saha, and my brother, Abhijeet Saha.
All of their love and support encourage me to overcome all the challenges that I have faced.
From the bottom of my heart, thank you all.
iv To my parents for all the support.
v ABSTRACT OF THE THESIS
Spatio-Temporal GPU Management for Real-Time Cyber-Physical Systems
by
Sujan Kumar Saha
Master of Science, Graduate Program in Electrical Engineering University of California, Riverside, March 2018 Dr. Hyoseung Kim, Chairperson
General-purpose Graphics Processing Units (GPUs) have been considered as a promising technology to address the high computational demands of real-time data-intensive applications. Many of today’s embedded processors already provide on-chip GPUs, the use of which can greatly help satisfy the timing challenges of data-intensive tasks by accelerat- ing their executions. However, the current state-of-the-art GPU management in real-time systems still lacks properties required for efficient and certifiable real-time GPU computing.
For example, existing real-time systems sequentially execute GPU workloads to guarantee predictable GPU access time, which significantly underutilizes the GPU and exacerbates temporal dependency among the workloads.
In this research, we propose a spatial-temporal GPU management framework for real-time cyber-physical systems. Our proposed framework explicitly manages the alloca- tion of GPU’s internal execution engines. This approach allows multiple GPU-using tasks to simultaneously execute on the GPU, thereby improving GPU utilization and reducing response time. Also, it can improve temporal isolation by allocating a portion of the GPU
vi execution engines to tasks for their exclusive use. We have implemented a prototype of the proposed framework for a CUDA environment. The case study using this implementation on two NVIDIA GPUs, GeForce 970 and Jetson TX2, shows that our framework reduces the response time of GPU execution segments in a predictable manner, by executing them in parallel. Experimental results with randomly-generated tasksets indicate that our frame- work yields a significant benefit in schedulability compared to the existing approach.
vii Contents
List of Figures ix
List of Tables x
1 Introduction 1
2 Background and Related Work 4 2.1 GPU organization and Kernel Execution ...... 4 2.2 Related Work ...... 6 2.3 Motivation ...... 8 2.4 System Model ...... 10
3 Spatial-Temporal GPU Reservation Framework 14 3.1 Reservation Design ...... 14 3.2 Admission Control ...... 16 3.2.1 Self-suspension Mode ...... 17 3.2.2 Busy-waiting Mode ...... 19 3.3 Resource Allocator ...... 20 3.4 Reservation based Program Transformation ...... 23
4 Evaluation 26 4.1 Implementation ...... 26 4.2 Overhead Estimation ...... 28 4.3 Case Study ...... 29 4.4 Schedulability Results ...... 32
5 Conclusions 38
Bibliography 40
viii List of Figures
2.1 Overview of GPU Architecture ...... 5 2.2 Multi-kernel Execution ...... 9 2.3 Execution time vs Number of SM on GTX970 ...... 10 2.4 Execution time vs Number of SM on TX2 ...... 10
3.1 Eample schedule of GPU using tasks showing the blocking times in self- suspending mode ...... 18 3.2 Normalized Execution Time Vs Different Par Value on GTX970 ...... 24 3.3 Normalized Execution Time Vs Different Par Value on TX2 ...... 24
4.1 Percentage overhead of selected benchmarks on GTX970 ...... 28 4.2 Percentage overhead of selected benchmarks on TX2 ...... 28 4.3 Comparison of Kernel Execution on GTX970 ...... 30 4.4 Comparison of Kernel Execution on TX2 ...... 31 4.5 Schedulability w.r.t Number of Tasks in a taskset ...... 34 4.6 Schedulability w.r.t Number of SM ...... 34 4.7 Schedulability w.r.t Number of GPU Segments ...... 35 4.8 Schedulability w.r.t Ratio of C to G ...... 36 4.9 Schedulability w.r.t Ratio of Number of GPU tasksk to Number of CPU tasks 36
ix List of Tables
4.1 Parameters for taskset generation ...... 32
x Chapter 1
Introduction
Massive data streams generated by recent embedded and cyber-physical applica- tions pose substantial challenges in satisfying real-time processing requirements. For exam- ple, in self-driving cars, data streams from tens of sensors, such as cameras and laser range
finders (LIDARs), should be analyzed in a timely manner so that the results of processing can be delivered to path/behavior planning algorithms with short and bounded delay. This requirement of real-time processing is particularly important for safety-critical domains such as automotive, unmanned automotive, avionics, and industrial automation, where any transient violation of timing constraints may lead to system failures and catastrophic losses.
General-purpose graphics processing units (GPUs) have been considered as a promising technology to address the high computational demands of real-time data streams.
Many of today’s embedded processors, such as NVIDIA TX1/2 and NXP i.MX series, al- ready have on-chip GPUs, the use of which can greatly help satisfy the timing challenges of data-intensive tasks by accelerating their executions. The stringent size, weight, power
1 and cost constraints of embedded and cyber-physical systems are also expected to be sub- stantially mitigated by GPUs.
For the safe use of GPUs, much research has been done in the real-time systems community to schedule GPU-using tasks with timing constraints [6, 8, 7, 10, 11, 15, 22].
However, the current state-of-the-art has the following limitations for efficiency and pre- dictability. First, existing real-time GPU management schemes significantly underutilize
GPUs in providing predictable GPU access time. They limit a GPU to be accessed by only one task at a time, which can cause unnecessarily long waiting delay when multiple tasks need to access the GPU. This problem will become worse in an embedded computing environment where each machine typically has only a limited number of GPUs, e.g., one on-chip GPU on the latest NVIDIA TX2 processor. Second, systems support for strong temporal isolation among GPU workloads is not yet provided. In a mixed-criticality sys- tem, low-critical tasks and high-critical tasks may share the same GPU. If low-critical tasks use the GPU for longer time than expected, the timing of high-critical tasks can be eas- ily jeopardized. Also, if both types of tasks are concurrently executed on the GPU, it is unpredictable how much temporal interference may happen.
In this research, we propose a spatio-temporal GPU reservation framework to address the aforementioned limitations. The key contribution of this work is in the ex- plicit management of GPU’s internal execution engines, e.g., streaming multiprocessors on NVIDIA GPUs and core groups on ARM Mali GPUs. With this approach, a single
GPU is divided into multiple logical units and a fraction of the GPU can be exclusively reserved for each (or a group of) time-critical task(s). This approach allows simultaneous
2 execution of multiple tasks on a single GPU, which can potentially eliminate the waiting time for GPU execution and achieve strong temporal isolation among tasks. Since recent
GPUs have multiple execution engines and many GPU applications are not programmed to fully utilize them, our proposed framework will be a viable solution to efficiently and safely share the GPU among tasks with different criticalities. In addition, our framework substantially improves task schedulability by a fine-grained allocation of GPU resources at the execution-engine level.
As a proof of concept, we have implemented our framework in a CUDA pro- gramming environment. The case study using this implementation on two NVIDIA GPUs,
GeForce 970 and Jetson TX2, shows that our framework reduces the response time of GPU execution segments in a predictable manner. Experimental results with randomly-generated tasksets indicate that our framework yields a significant benefit in schedulability compared to the existing approach.
Our GPU framework does not require any specific hardware support or detailed internal scheduling information. Thus, it is readily applicable to COTS GPUs from various vendors, e.g., AMD, ARM, NVIDIA and NXP.
The rest of the thesis is organized as follows. Chapter 2 describes background knowledge about GPU architecture, motivation for this work and related prior works. Our proposed GPU reservation framework is explained in detail in chapter 3. Chapter 4 has the evaluation methodolgy and result analysis. Finally, we conclude in Chapter 5.
3 Chapter 2
Background and Related Work
2.1 GPU organization and Kernel Execution
GPUs are used as accelarators along with CPUs in modern computing systems.
Its highly parallel structure makes it more efficient than general purpose CPUs for data intensive applications. Figure 2.1 shows a high level overview of the internel structure of a
GPU.
A single GPU consists of multiple Execution Engines (EEs)1. One EE or SM has multiple CUDA cores. The total number of cores in a GPU is the number of SMs multiplied by the number of cores per SM. In each SM, there are register file, L1 cache and shared memory for faster access of data. All the cores in a SM share these memory components. Two other memory components are L2 cache and main memory. L2 cache is shared among all SMs. There are one or more Copy Engines (CEs) in a GPU. CEs copy
1In NVIDA GPUs, an Execution Engine is called Streaming Multiprocessor (SM). We will use NVIDIA terms in the rest of the thesis, but the proposed approach is applicable to other architectures too.
4 EE
Register File
Core Core Core Core
Core Core
L2 Cache Core CopyEngine Core MainMemory Memory ControllerMemory Core Core Core Core
L1 Cache
Shared Memory
Figure 2.1: Overview of GPU Architecture data from CPU memory to GPU memory and vice versa. Data is processed by GPU cores resided in EEs. GPUs also have several other components such as instruction buffer, warp scheduler, dispatch unit, texture unit etc. Those are not mentioned in the figure as not required knowledge for this research.
NVIDIA provides the necessary APIs for parallel prgramming like CUDA. The general structure of CUDA code has five sections: (1) memory allocation in the GPU memory, (2) data copy from CPU memory to GPU memory, (3) kernel execution in GPU,
(4) copy back the result data from GPU memory to CPU memory and (5) free the GPU memory. While launching a kernel, the thread block and grid dimentions are provided. The data stream that needs to be processed on a GPU is divided into multiple logical thread blocks. The grid consists of all the thread blocks and each thread block consists of multiple threads. In general, each block is processed by a single SM and each core processes one thread.
The GPU device driver schedules the blocks to SMs but the scheduling policy is not provided by the manufacturer. Hence, it is unknown which block will be processed by which SM. As a result, there are two disadvantages. First, depending on the number
5 of thread blocks and thread block size, all SMs may not be utilized or an SM may not be fully utilized. Second, on each SM, more than one thread block can be concurrently processed if the total size of the thread blocks is less or equal to the capacity of the SM.
In that case, the execution time of a kernel can not be predicted accurately. Hence, it is required to know the GPU scheduling policy. Even though Amert et al. [3] found some details of the scheduling policy in NVIDIA TX2, but it is unsure whether their findings are applicable to other NVIDIA architectures or GPUs from different vendors. That’s why, software technique is used by modifying the application to run the thread blocks in user specified SMs [19, 9]. Though there is overhead because of adding extra pieces of code in the application program, it gives more flexibility to the programmers and predictability to the execution time of the application. In our research, we have used a similar software approach to reserve SMs for an application program.
2.2 Related Work
There are several research papers on computation techniques using GPUs in real- time domain. Among these, TimeGraph [11] proposes a priority-based GPU scheduler to improve the responsiveness of tasks. Elliot et al. [6] presented two methods: shared resource method and container method for integrating GPUs with CPU in soft real-time multiprocessor systems and shows the performance benefits over pure CPU system. In [7], the authors mentioned an optimal k-exclusion locking protocol for multi-GPU systems where
GPUs are modeled as shared resources. GPUSync [8] provides a framework for managing a real-time multi-GPU system based on the k-exclusion locking protocol mentioned in [7].
6 The server based approach [13] identifies the two major limitations of syncronization based approaches mentioned in [7, 8]: busy waiting and priority inversion and proposes the solution for the limitations. While all the approaches mentioned above considers GPU as a shared resource, they do not allow multiple tasks to use GPU at the same time. As a result, the
GPU may be underutilized and there may be a long waiting time for a process to access the GPU.
RGEM [10] adresses the non-preemptive behaviour of GPU for which the response time of a high priority task may increase. It splits the data copy operation between CPU memory and GPU memory into several parts and allows preemption at spliting points which leads to reduce the response time of a high priority task. GPES [22] also considers the non-preemptive nature of GPU and breaks the kernel into multiple subkernels as well as data copy into multiple small copy operations. As a result, preemption can happen at break points and long waiting time of high priority task is reduced. Basaran et al. [4] proposes a technique of preemptive kernel execution and data copy operation in GPU. It also supports consurrent kernel execution and copy operation to increase the task responsiveness.
Gdev [12] considers GPU as a standard resource like CPU and provides a management system at OS lavel so that user space programs as well as OS kernel can use GPU as a standard computing system. Even though the above papers contribute to reduce the waiting time of high prioroty task and improve responsiveness, no one discusses about the concurrent multiple kernel execution in GPU to improve utilization.
Otterness et al. [16] discusses about the multiple kernel execution in parallel in
GPU of NVIDIA TX1 and shows that some benchmarks get slower in execution compared to
7 when these run as individual program. But there is no detail explanation and quantification of the slow down mentioned in the paper. Kernelet [21] is a runtime system to improve
GPU utilization by slicing a kernel into sub-kernels and co-schedule multiple sub-kernels of different kernels. Bo et al. [19] proposes a software technique to run an application program on specific SMs of GPU regardless of inside GPU scheduling mechanism. It gives the user flexibility to assign SMs to a particular task. Janzen et al. [9] also mentions about
GPU partitioning among applications by allocating different SMs to different user programs.
Wang et al. [20]proposes simultaneous GPU multi-kernel execution via fine-grained sharing.
Xu et al. [18] proposes warp slicer to allow intra-SM slicing for GPU multi-programming.
While the papers [21, 19, 9, 20, 18] provide mechanisms to improve utilization, there is no schedulability analysis for real time systems. So, still there is a scope to investigate that these methods can be applicable to real-time systems or not. In this research, we present
GPU reservation method for multiple GPU using tasks and run the tasks in parallel in GPU for higher utilization and provides a complete formulation to analyze the schedulability of the tasks in real-time domain.
2.3 Motivation
The GPU resource utilization can be improved by launching multiple kernels at the same time and executing these kernel operations on SMs simultaneously. Recent GPUs support these multi-kernel execution but it introduces unpredictability in the completion time of the kernels which is not expected in real-time systems. For example, Figure 2.2 shows a nvprof timeline of four kernels launching at the same time. Each kernel has four
8 Figure 2.2: Multi-kernel Execution
thread bloacks and the system has total 13 SMs. The computation inside each kernel is exactly same. The timeline shows two kernels finish execution earlier than two others. But it is not deterministic that which two finish execution earlier and which two are delayed.
Also, there is no proper way to know the amount of delay. This issue needs to be addressed while multi-kernel execution in real-time systems.
As mentioned earlier software techniques [19, 9] can be used to run GPU kernels on user defined SMs, experimentation has been performed to know the execution behaviour of eight GPU benchmarks for different number of SMs assigned to each kernel. In this experimentation, multi-kernel execution is not used. In other words, each benchmark is executed separately. Figure 2.3 and 2.4 show the normalized GPU execution time of bench- marks for different SMs varying from 1 to 13 for GTX970 and 1 to 2 for TX2 as GTX970 has
13 SMs and TX2 has only 2 SMs. From the simulation, we see that the execution time of none of the benchmark decreases linearly. In most of the cases, the execution time becomes plato after a certain number of SM. We don’t achive much benefit of assigning more SMs after that certain number of SMs. One benchmark does not show any significant changes for varying number of SMs. It may happen when an application has much less number of thread blocks. From this observation, we can conclude that multiple kernels can be run
9 1.2
1
0.8
0.6
0.4 Normalized Execution Time Execution Normalized 0.2
0 MMUL backprop b+tree hotspot kmeans workzone stereoDisparity streamcluster
Number of SM: 1 2 3 4 5 6 7 8 9 10 11 12 13
Figure 2.3: Execution time vs Number of SM on GTX970
1.2
1
0.8
0.6
0.4 Normalized Execution Time Execution Normalized 0.2
0 MMUL backprop b+tree hotspot kmeans workzone stereoDisparity streamcluster
Number of SM: 1 2
Figure 2.4: Execution time vs Number of SM on TX2
on their own specified SMs which can be beneficial instead of running one application at a time in the GPU.
2.4 System Model
The system considered in this work is equipped with a multi-core CPU and a general-purpose GPU. The CPU has NP cores, where each core is identical to each other and runs at a fixed frequency. The GPU is assumed to follow the architecture described in Section 2.1. In that GPU, there are NEE execution engines (EE), which are equivalent
10 to streaming multiprocessors in NVIDIA GPUs and core groups in ARM Mali GPUs. We assume that the GPU has one copy engine (CE), which is typical in many of today’s GPUs, that handles copy requests in a first-come first-serve basis. The GPU memory is assumed to be sufficiently large for all GPU-using tasks in the system.
We focus on partitioned fixed-priority preemptive task scheduling due to its wide acceptance in real-time OSs such as QNX RTOS [1]. Any fixed-priority assignment can be used, such as Rate Monotonic (RM) and Deadline Monotonic (DM). For the task model, we consider sporadic tasks with constrained deadlines. Each instance (job) of a task consists of CPU segments and GPU segments. As their names imply, CPU segments run entirely on the CPU and GPU segments include GPU operations, e.g., data copy from/to the GPU and kernel execution. Once a task launches a GPU kernel, the task may self-suspend to save CPU cycles. The kernel execution time depends on the number of EEs assigned to the task. Specifically, a task τi is characterized as follows:
τi := (Ci,Gi(k),Ti,Di, ηi) where
• Ci: The sum of the worst-case execution time (WCET) of CPU segments of each job
of τi
• Gi(k): The sum of the worst-case duration of GPU segments of each job of τi, when
k EEs are assigned to τi and no other task is using the GPU
• Ti: The minimum inter-arrival time (or period) of τi
• Di: The relative deadline of each job of τi
11 • θi: The number of CPU segments of each job of τi
• ηi: The number of GPU segments of each job of τi
∗ τi,j and τi,j are used to denote the j-th CPU and GPU segments of τi, respectively. Note that we do not make any assumption about the sequence of CPU and GPU segments. Hence, a task may have two consecutive GPU segments. Gi(k) is assumed to be non-increasing with k, i.e., Gi(k) ≥ Gi(k + 1). This assumption can be easily met by monotonic over- approximations [2, 14]. The number of EEs assigned to each task is statically determined and does not change at runtime.
∗ We use Gi,j to denote the worst-case duration of τi,j (the j-th GPU segment of a
Pηi task τi). Hence, Gi(k) = j=1 Gi,j(k). Without loss of generality, each GPU segment is assumed to have three sub-segments: (i) data copy to the GPU, (ii) kernel execution, and
(iii) data copy back from the GPU. Thus, each GPU segment uses the CE up to two times.
In this model, more than one consecutive kernels can be represented with multiple GPU
∗ segments. τi,j is characterized by the following parameters:
∗ mhd e mdh τi,j := (Gi,j ,Gi,j(k),Gi,j )
where
mhd • Gi,j : The WCET of miscellaneous operations executed before the GPU kernel in
∗ τi,j, e.g., memory copy from the host to the device
e ∗ • Gi,j(k): The WCET of the GPU kernel of τi,j on k EEs
mdh ∗ • Gi,j : The WCET of miscellaneous operations executed after the GPU kernel in τi,j,
e.g., memory copy from the device to the host
12 e mhd mdh Gi,j(k) is the kernel execution time on the GPU, and Gi,j and Gi,j are the CPU time consumed for data copy between CPU and GPU memory, kernel launch commands, the
mhd e mdh notification of kernel completion, etc. Note that Gi,j(k) ≤ Gi,j + Gi,j(k) + Gi,j as
kernel execution on the GPU may overlap with miscellaneous operations on the CPU. For
simplicity, we may omit k and use Gi,j to refer to Gi,j(k). This rule also applies to other
e e m mhd GPU-segment parameters, e.g., Gi,j ≡ Gi,j(k). We use Gi,j to represent the sum of Gi,j
mdh m mhd mdh and Gi,j , i.e., Gi,j = Gi,j + Gi,j .
The CPU utilization of τi is defined as: Ui = (Ci + Gi)/Ti, if τi is busy-waits
m m Pηi m during kernel execution; Ui = (Ci + Gi )/Ti, where Gi = j=1 Gi,j, if τi self-suspends
during kernel execution.
13 Chapter 3
Spatial-Temporal GPU
Reservation Framework
3.1 Reservation Design
The GPU reservation approach allocates GPU resources to applications both spa- tially and temporally. The motivation of using GPU reservation is to reduce the long waiting time of a task and increase the GPU utilization.
When GPU kernels are launched, thread blocks are executed on SMs according to the scheduling policy of the device driver which is not revealed by the manufacturers.
In that case, the application programmer has less control over GPU resources. The spatial reservation method reserves GPU resources i.e. SMs for each of the task which requires GPU computation. When the SMs are specified to an application, the execution time is more predictable. The number of SM assigned to each task is determined by the algorithm in
14 subsection 3.3. A small application modification is required to use spatial GPU reservation.
In this research, application modification is done by creating a mapping array to declare the SM ids for a task and pass that array while launching the kernel. In the device code, when the blocks start execution, initially it checks whether the block should run on the SM or not by checking the mapping. If the block is not on reserved SM, it immidiately stops execution. Besides, if the block is on reserved SM, it continues execution. In this case, the grid dimention is modified to run all the blocks on reserved SMs.
The temporal reservation of GPU allows running multiple tasks in parallel in GPU.
As multiple tasks use GPU concurrently, overall utilization increases. In this research,
CUDA Multi-Process Service (MPS) is enabled to allow GPU co-scheduling which has a negligible overhead in the system. When multiple tasks run in the GPU at the same time with their reserved SM ids, there may be two cases. First, a task may not have any shared
SM id with other tasks accessing the GPU at that moment. In this case, when the task launches GPU kernel execution command, it can start execution in the GPU immidiately.
Second, the task may have common SM ids with other tasks. Here, after launching the kernel execution command, task needs to wait until all previous kernels with shared SM ids are finished. Here, we have assumed that the tasks are launched to GPU in FIFO manner.
So, even if any high priority task launch GPU kernel and it has shared SM id with previous low priority tasks running on the GPU, it must wait for the completion of all the tasks in the FIFO queue.
In order to minimize interference during GPU segment execution, we adopt the priority-boosting mechanism, widely used for predictable shared resource controls [5, 13, 17].
15 Specifically, τi’s priority is increased to the highest-priority level when a task τi begins a
GPU segment, and it is reverted back to τi’s original priority when τi finishes that GPU segment. In this way, no CPU segments of other tasks assigned to the same CPU core can preempt τi during the interval of the GPU segment.
During kernel execution, GPU-using tasks may either busy-wait or self-suspend.
This is configurable in many GPU programming environments, such as CUDA and OpenCL.
In general, busy-waiting is good for short kernels as it does not cause any additional schedul- ing overhead, and self-suspension is good for long kernels as it saves CPU time. Following this line of reasoning, our framework supports both modes, and also applies the chosen mode to any task waiting for GPU resources to be ready. Hence, if self-suspension (or busy-wait) is chosen, our framework lets tasks suspend (or busy-wait) not only when they execute GPU kernels but also when they wait for the CE and any shared EEs.
3.2 Admission Control
The admission control module of our framework checks the schedulability of a task under a given resource allocation by computing the task’s worst-case response time
(WCRT). As our framework supports self-suspension and busy-waiting modes, we describe the schedulability analyses for both modes in the following.
16 3.2.1 Self-suspension Mode
If self-suspension is used, the WCRT of τi is upper-bounded by the following recurrence equation:
k m k+1 X Wi +(Wh−(Ch+Gh )) m Wi = Ci + Gi + Bi + (Ch + Gh ) (3.1) Th τh∈hp(τi) where Ci is the cpu computation time of τi, Gi is the total gpu computation time of τi and
Bi is the total blocking time caused by GPU access. As higher priority tasks self-suspend
m while GPU computation, (Wh −(Ch +Gh )) is added in the ceiling function to capture that.
Also, the copy operation requires cpu cycles, so high priority task can not self-suspend while
m copying data from cpu to gpu and gpu to cpu. That’s why, Gh is added with Ch.
The blocking time Bi is computed as follows:
m e l Bi = Bi + Bi + Bi (3.2)
m where Bi is the blocking time from GPU data copy and miscellaneous operations in GPU
e l segments, Bi is the blocking time from kernel execution, and Bi is the blocking time from priority inversion.
The blocking time from a sub-segment for data copy and miscellaneous in the j-th
GPU segment of τi is upper-bounded by:
m X m∗ Bi,j = max Gu,w (3.3) 1≤w≤ηu τu6=τi∧ηu>0
m∗ mhd mdh m∗ where Gu,w = max(Gu,w ,Gu,w ). If there is one Copy Engine (CE) in the GPU, Gu,w will be the maximum between host to device copy time and device to host copy time. If τu has ω gpu segments, each segment will run on GPU at a time. So, we have to capture the
17 CPU Execution Copy Operation GPU Execution