UNIVERSITY of CALIFORNIA RIVERSIDE Spatio-Temporal GPU Management for Real-Time Cyber-Physical Systems a Thesis Submitted In
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CALIFORNIA RIVERSIDE Spatio-Temporal GPU Management for Real-Time Cyber-Physical Systems A Thesis submitted in partial satisfaction of the requirements for the degree of Master of Science in Electrical Engineering by Sujan Kumar Saha March 2018 Thesis Committee: Dr. Hyoseung Kim, Chairperson Dr. Nael Abu-Ghazaleh Dr. Daniel Wong Copyright by Sujan Kumar Saha 2018 The Thesis of Sujan Kumar Saha is approved: Committee Chairperson University of California, Riverside Acknowledgments First, I would like to thank my supervising professor, Dr. Hyoseung Kim. All of his unconditional assistance, guidance, and support help me finally accomplish this work. I also wish to express the deepest appreciation to Dr. Nael Abu-Ghazaleh and Dr. Daniel Wong who serves as my committee members for their valuable advice and generous help. I offer my special thanks to my labe-mates: Ankit Juneja, Ankith Rakesh Kumar and Yecheng Xiang for helping me in different aspects. Without their generous help, this work would not have been successful. Finally, it is an honor for me to thank my family especially my mom, Mrs. Anjana Rani Saha and my sister, Poly Saha, and my brother, Abhijeet Saha. All of their love and support encourage me to overcome all the challenges that I have faced. From the bottom of my heart, thank you all. iv To my parents for all the support. v ABSTRACT OF THE THESIS Spatio-Temporal GPU Management for Real-Time Cyber-Physical Systems by Sujan Kumar Saha Master of Science, Graduate Program in Electrical Engineering University of California, Riverside, March 2018 Dr. Hyoseung Kim, Chairperson General-purpose Graphics Processing Units (GPUs) have been considered as a promising technology to address the high computational demands of real-time data-intensive applications. Many of today's embedded processors already provide on-chip GPUs, the use of which can greatly help satisfy the timing challenges of data-intensive tasks by accelerat- ing their executions. However, the current state-of-the-art GPU management in real-time systems still lacks properties required for efficient and certifiable real-time GPU computing. For example, existing real-time systems sequentially execute GPU workloads to guarantee predictable GPU access time, which significantly underutilizes the GPU and exacerbates temporal dependency among the workloads. In this research, we propose a spatial-temporal GPU management framework for real-time cyber-physical systems. Our proposed framework explicitly manages the alloca- tion of GPU's internal execution engines. This approach allows multiple GPU-using tasks to simultaneously execute on the GPU, thereby improving GPU utilization and reducing response time. Also, it can improve temporal isolation by allocating a portion of the GPU vi execution engines to tasks for their exclusive use. We have implemented a prototype of the proposed framework for a CUDA environment. The case study using this implementation on two NVIDIA GPUs, GeForce 970 and Jetson TX2, shows that our framework reduces the response time of GPU execution segments in a predictable manner, by executing them in parallel. Experimental results with randomly-generated tasksets indicate that our frame- work yields a significant benefit in schedulability compared to the existing approach. vii Contents List of Figures ix List of Tables x 1 Introduction 1 2 Background and Related Work 4 2.1 GPU organization and Kernel Execution . .4 2.2 Related Work . .6 2.3 Motivation . .8 2.4 System Model . 10 3 Spatial-Temporal GPU Reservation Framework 14 3.1 Reservation Design . 14 3.2 Admission Control . 16 3.2.1 Self-suspension Mode . 17 3.2.2 Busy-waiting Mode . 19 3.3 Resource Allocator . 20 3.4 Reservation based Program Transformation . 23 4 Evaluation 26 4.1 Implementation . 26 4.2 Overhead Estimation . 28 4.3 Case Study . 29 4.4 Schedulability Results . 32 5 Conclusions 38 Bibliography 40 viii List of Figures 2.1 Overview of GPU Architecture . .5 2.2 Multi-kernel Execution . .9 2.3 Execution time vs Number of SM on GTX970 . 10 2.4 Execution time vs Number of SM on TX2 . 10 3.1 Eample schedule of GPU using tasks showing the blocking times in self- suspending mode . 18 3.2 Normalized Execution Time Vs Different Par Value on GTX970 . 24 3.3 Normalized Execution Time Vs Different Par Value on TX2 . 24 4.1 Percentage overhead of selected benchmarks on GTX970 . 28 4.2 Percentage overhead of selected benchmarks on TX2 . 28 4.3 Comparison of Kernel Execution on GTX970 . 30 4.4 Comparison of Kernel Execution on TX2 . 31 4.5 Schedulability w.r.t Number of Tasks in a taskset . 34 4.6 Schedulability w.r.t Number of SM . 34 4.7 Schedulability w.r.t Number of GPU Segments . 35 4.8 Schedulability w.r.t Ratio of C to G . 36 4.9 Schedulability w.r.t Ratio of Number of GPU tasksk to Number of CPU tasks 36 ix List of Tables 4.1 Parameters for taskset generation . 32 x Chapter 1 Introduction Massive data streams generated by recent embedded and cyber-physical applica- tions pose substantial challenges in satisfying real-time processing requirements. For exam- ple, in self-driving cars, data streams from tens of sensors, such as cameras and laser range finders (LIDARs), should be analyzed in a timely manner so that the results of processing can be delivered to path/behavior planning algorithms with short and bounded delay. This requirement of real-time processing is particularly important for safety-critical domains such as automotive, unmanned automotive, avionics, and industrial automation, where any transient violation of timing constraints may lead to system failures and catastrophic losses. General-purpose graphics processing units (GPUs) have been considered as a promising technology to address the high computational demands of real-time data streams. Many of today's embedded processors, such as NVIDIA TX1/2 and NXP i.MX series, al- ready have on-chip GPUs, the use of which can greatly help satisfy the timing challenges of data-intensive tasks by accelerating their executions. The stringent size, weight, power 1 and cost constraints of embedded and cyber-physical systems are also expected to be sub- stantially mitigated by GPUs. For the safe use of GPUs, much research has been done in the real-time systems community to schedule GPU-using tasks with timing constraints [6, 8, 7, 10, 11, 15, 22]. However, the current state-of-the-art has the following limitations for efficiency and pre- dictability. First, existing real-time GPU management schemes significantly underutilize GPUs in providing predictable GPU access time. They limit a GPU to be accessed by only one task at a time, which can cause unnecessarily long waiting delay when multiple tasks need to access the GPU. This problem will become worse in an embedded computing environment where each machine typically has only a limited number of GPUs, e.g., one on-chip GPU on the latest NVIDIA TX2 processor. Second, systems support for strong temporal isolation among GPU workloads is not yet provided. In a mixed-criticality sys- tem, low-critical tasks and high-critical tasks may share the same GPU. If low-critical tasks use the GPU for longer time than expected, the timing of high-critical tasks can be eas- ily jeopardized. Also, if both types of tasks are concurrently executed on the GPU, it is unpredictable how much temporal interference may happen. In this research, we propose a spatio-temporal GPU reservation framework to address the aforementioned limitations. The key contribution of this work is in the ex- plicit management of GPU's internal execution engines, e.g., streaming multiprocessors on NVIDIA GPUs and core groups on ARM Mali GPUs. With this approach, a single GPU is divided into multiple logical units and a fraction of the GPU can be exclusively reserved for each (or a group of) time-critical task(s). This approach allows simultaneous 2 execution of multiple tasks on a single GPU, which can potentially eliminate the waiting time for GPU execution and achieve strong temporal isolation among tasks. Since recent GPUs have multiple execution engines and many GPU applications are not programmed to fully utilize them, our proposed framework will be a viable solution to efficiently and safely share the GPU among tasks with different criticalities. In addition, our framework substantially improves task schedulability by a fine-grained allocation of GPU resources at the execution-engine level. As a proof of concept, we have implemented our framework in a CUDA pro- gramming environment. The case study using this implementation on two NVIDIA GPUs, GeForce 970 and Jetson TX2, shows that our framework reduces the response time of GPU execution segments in a predictable manner. Experimental results with randomly-generated tasksets indicate that our framework yields a significant benefit in schedulability compared to the existing approach. Our GPU framework does not require any specific hardware support or detailed internal scheduling information. Thus, it is readily applicable to COTS GPUs from various vendors, e.g., AMD, ARM, NVIDIA and NXP. The rest of the thesis is organized as follows. Chapter 2 describes background knowledge about GPU architecture, motivation for this work and related prior works. Our proposed GPU reservation framework is explained in detail in chapter 3. Chapter 4 has the evaluation methodolgy and result analysis. Finally, we conclude in Chapter 5. 3 Chapter 2 Background and Related Work 2.1 GPU organization and Kernel Execution GPUs are used as accelarators along with CPUs in modern computing systems. Its highly parallel structure makes it more efficient than general purpose CPUs for data intensive applications. Figure 2.1 shows a high level overview of the internel structure of a GPU.