Real-Time Scheduling for Gpus with Applications in Advanced Automotive Systems

REAL-TIME SCHEDULING FOR GPUS WITH APPLICATIONS IN ADVANCED AUTOMOTIVE SYSTEMS

Glenn A. Elliott

A dissertation submitted to the faculty at the University of North Carolina at Chapel Hill in partial fulﬁllment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science.

Chapel Hill 2015

Approved by:

James H. Anderson

Sanjoy K. Baruah

Anselmo Lastra

F. Donelson Smith

Lars S. Nyland

ii ABSTRACT

Glenn A. Elliott: Real-Time Scheduling for GPUs with Applications in Advanced Automotive Systems (Under the direction of James H. Anderson)

Self-driving cars, once constrained to closed test tracks, are beginning to drive alongside human drivers on public roads. Loss of life or property may result if the computing systems of automated vehicles fail to respond to events at the right moment. We call such systems that must satisfy precise timing constraints

“real-time systems.” Since the 1960s, researchers have developed algorithms and analytical techniques used in the development of real-time systems; however, this body of knowledge primarily applies to traditional

CPU-based platforms. Unfortunately, traditional platforms cannot meet the computational requirements of self-driving cars without exceeding the power and cost constraints of commercially viable vehicles. We argue that modern graphics processing units, or GPUs, represent a feasible alternative, but new algorithms and analytical techniques must be developed in order to integrate these uniquely constrained processors into a real-time system.

The goal of the research presented in this dissertation is to discover and remedy the issues that prevent the use of GPUs in real-time systems. To overcome these issues, we design and implement a real-time multi-GPU scheduler, called GPUSync. GPUSync tightly controls access to a GPU’s computational and

DMA processors, enabling simultaneous use despite potential limitations in GPU hardware. GPUSync enables tasks to migrate among GPUs, allowing new classes of real-time multi-GPU computing platforms.

GPUSync employs heuristics to guide scheduling decisions to improve system efﬁciency without risking violations in real-time constraints. GPUSync may be paired with a wide variety of common real-time CPU schedulers. GPUSync supports closed-source GPU runtimes and drivers without loss in functionality.

We evaluate GPUSync with both analytical and runtime experiments. In our analytical experiments, we model and evaluate over fifty configurations of GPUSync. We determine which configurations support the greatest computational capacity while maintaining real-time constraints. In our runtime experiments, we

iii execute computer vision programs similar to those found in automated vehicles, with and without GPUSync.

Our results demonstrate that GPUSync greatly reduces jitter in video processing.

Research into real-time systems with GPUs is a new area of study. Although there is prior work on such systems, no other GPU scheduling framework is as comprehensive and ﬂexible as GPUSync.

iv To the McKenzies

v ACKNOWLEDGEMENTS

When I joined the computer science department at UNC, I largely believed that research as a graduate student was an independent affair: my success or failure depended entirely upon myself. I certainly would have failed had this been true. The completion of this dissertation is due in part to the knowledge, guidance, aid, and even love, from those I met and worked with during my time at Chapel Hill. I owe them all a great debt of gratitude.

First and foremost, I wish to thank my advisor, Jim Anderson, for taking a chance on a student who knew nothing of real-time systems and only thought “research on GPUs might be interesting.” On more than one occasion, Jim’s encouragement made it possible for me to complete a successful paper before a submission deadline when I had been ready to throw in the towel a day before. I would also like to thank the members of my committee, Sanjoy Baruah, Anselmo Lastra, Don Smith, Lars Nyland, and Shinpei Kato for their guidance in my research.

I also wish thank my wonderful colleagues of the Real-Time Systems Research Group at UNC. In particular, I am especially thankful for Björn Brandenburg and Bryan Ward. Björn’s tireless work on the

LITMUSRT operating system, along with his techniques for real-time analysis, set the stage upon which I performed my work. The results of Bryan’s research on locking protocols (as well as Björn’s) ﬁgures heavily in the research presented herein. My work simply would not have been possible without the contributions of these two. I would also like to thank fellow LITMUSRT developers Andrea Bastoni, Jonathan Herman, and Chris Kenna for helping me work through software bugs at any hour, as well as their willingness to offer advice during the implementation of the ideas behind my research. Additionally, I wish to thank my conference paper co-authors, Jeremy Erickson, Namhoon Kim, Cong Liu, and Kecheng Yang. I throughly enjoyed working with each of you. Finally, I would like to thank all of the other members of the real-time systems group, including Bipasa Chattopadhyay, Zhishan Guo, Hennadiy Leontyev, Haohan Li, and Mac

Mollison.

I am also grateful for the opportunities NVIDIA afforded me. My internships with the company allowed me to “peer behind the curtain” to see how GPUs and GPGPU runtimes are made. I thank Bryan Catanzaro,

vi Vlad Glavtchev, Mark Hairgrove, Chris Lamb, and Thierry Lepley for their guidance. I also wish to especially thank Elif Albuz and Michael Garland for giving me a free hand to explore.

I am also indebted to the staff of the UNC computer science department. I owe special thanks to Mike

Stone for keeping my computing hardware going, Melissa Wood for keeping grant applications on track, and

Jodie Turnbull for keeping the graduate school paperwork in order.

Finally, I wish to thank my parents and sisters for their love and support during the long years I was in

North Carolina—thank you for not telling me that I was crazy for going back to school, even if you may have thought it, and even if you were right! I also want to thank Ryan and Leann McKenzie for their friendship; words cannot express how grateful I am. Lastly, I wish to thank Kim Kutz for her continued patience, love, and support, which sometimes took the form of fresh out-of-the-oven strawberry-rhubarb pie delivered to

Sitterson Hall at one in the morning.

The funding for this research was provided by NSF grants CNS 0834132, CNS 0834270, CNS 1016954,

CNS 1115284, CNS 1218693, CNS 1409175, CPS 1239135, and CPS 1446631; AFOSR grants FA9550-09-

1-0549 and FA9550-14-1-0161; AFRL grant FA8750-11-1-0033; and ARO grants W911NF-14-1-0499 and

W911NF-09-1-0535; with additional grants from General Motors, AT&T, and IBM.

vii TABLE OF CONTENTS

LIST OF TABLES ...... xiv

LIST OF FIGURES ...... xvi

LIST OF ABBREVIATIONS ...... xx

Chapter 1: Introduction ...... 1

1.1 Real-Time Systems ...... 2

1.2 Graphics Processing Units ...... 3

1.3 Real-Time GPU Applications ...... 6

1.4 An Introduction to GPGPU Programming ...... 7

1.4.1 GPGPU Programming ...... 7

1.4.2 Real-Time GPU Scheduling ...... 9

1.4.3 Real-Time Multi-GPU Scheduling ...... 11

1.5 Thesis Statement ...... 11

1.6 Contributions ...... 12

1.6.1 A Flexible Real-Time Multi-GPU Scheduling Framework ...... 12

1.6.2 Techniques for Supporting Closed-Source GPGPU Software ...... 13

1.6.3 Support for Graph-Based GPGPU Applications ...... 13

1.6.4 Implementation and Evaluation...... 14

1.7 Organization ...... 14

Chapter 2: Background and Prior Work...... 15

2.1 Multiprocessor Real-Time Scheduling ...... 15

2.1.1 Sporadic Task Model ...... 15

2.1.2 Rate-Based Task Model ...... 17

viii 2.1.3 Processing Graph Method ...... 18

2.1.4 Scheduling Algorithms ...... 21

2.1.5 Schedulability Tests and Tardiness Bounds ...... 23

2.1.6 Locking Protocols ...... 27

2.1.6.1 Priority Inversions and Progress Mechanisms...... 28

2.1.6.2 Nested Locking ...... 33

2.1.6.3 Priority-Inversion Blocking...... 35

2.1.7 Multiprocessor k-Exclusion Locking Protocols ...... 37

2.1.7.1 The k-FMLP ...... 38

2.1.7.2 The R2DGLP ...... 41

2.1.7.3 The CK-OMLP ...... 44

2.1.8 Accounting for Overheads in Schedulability Tests ...... 46

2.1.8.1 Preemption-Centric Accounting ...... 47

2.1.8.2 Locking Protocol Overheads ...... 50

2.1.9 Integration of PI-Blocking and Overhead Accounting...... 52

2.2 Real-Time Operating Systems ...... 54

2.2.1 Basic RTOS Requirements ...... 54

2.2.2 LITMUSRT ...... 56

2.2.3 Interrupt Handling ...... 57

2.2.3.1 Linux ...... 58

2.2.3.2 PREEMPT_RT ...... 60

2.2.3.3 Towards Ideal Real-Time Bottom-Half Scheduling ...... 62

2.3 Review of Embedded GPUs and Accelerators ...... 66

2.4 GPGPU Mechanics...... 70

2.4.1 Software Architecture...... 70

2.4.2 Hardware Architecture ...... 74

2.4.2.1 Execution Engine ...... 76

2.4.2.2 Copy Engines...... 80

ix 2.4.3 Other Data Transfer Mechanisms ...... 81

2.4.4 Maintaining Engine Independence ...... 83

2.4.5 VectorAdd Revisited ...... 85

2.5 Prior Work on Accelerator Scheduling...... 87

2.5.1 Real-Time GPU Scheduling ...... 87

2.5.1.1 Persistent Kernels ...... 88

2.5.1.2 GPU Kernel WCET Estimation and Control ...... 88

2.5.1.3 GPU Resource Scheduling ...... 90

2.5.2 Real-Time DSPs and FPGA Scheduling ...... 92

2.5.3 Non-Real-Time GPU Scheduling ...... 94

2.5.3.1 GPU Virtualization ...... 94

2.5.3.2 GPU Resource Maximization ...... 96

2.5.3.3 Fair GPU Scheduling ...... 97

2.6 Conclusion ...... 99

Chapter 3: GPUSync ...... 100

3.1 Software Architectures For GPU Schedulers ...... 101

3.1.1 API-Driven vs. Command-Driven GPU Schedulers...... 101

3.1.2 Design-Space of API-Driven GPU Schedulers...... 102

3.1.2.1 GPU Scheduling in User-Space ...... 103

3.1.2.2 GPU Scheduling in Kernel-Space ...... 108

3.2 Design ...... 114

3.2.1 Synchronization-Based Philosophy ...... 114

3.2.2 System Model ...... 115

3.2.3 Resource Allocation ...... 115

3.2.3.1 High-Level Description ...... 115

3.2.3.2 GPU Allocator...... 117

3.2.3.3 Cost Predictor ...... 123

x 3.2.3.4 Engine Locks ...... 124

3.2.4 Budget Enforcement ...... 127

3.2.5 Integration ...... 130

3.2.5.1 GPU Interrupt Handling ...... 130

3.2.5.2 CUDA Runtime Callback Threads ...... 135

3.3 Implementation ...... 136

3.3.1 General Information ...... 136

3.3.2 Scheduling Policies ...... 137

3.3.3 Priority Propagation ...... 138

3.3.4 Heuristic Plugins for the GPU Allocator ...... 143

3.3.5 User Interface ...... 144

3.4 Conclusion ...... 145

Chapter 4: Evaluation ...... 146

4.1 Evaluation Platform ...... 147

4.2 Platform Conﬁgurations ...... 148

4.3 Schedulability Analysis...... 149

4.3.1 Overhead Measurement ...... 149

4.3.1.1 Algorithmic Overheads ...... 150

4.3.1.2 Memory Overheads ...... 154

4.3.2 Scope ...... 167

4.3.3 Task Model for GPU-using Sporadic Tasks...... 169

4.3.4 Blocking Analysis ...... 170

4.3.4.1 Three-Phase Blocking Analysis ...... 172

4.3.4.2 Coarse-Grain Blocking Analysis for Engine Locks ...... 172

4.3.4.3 Detailed Blocking Analysis for Engine Locks ...... 174

4.3.4.4 Detailed Blocking Analysis for the GPU Allocator ...... 193

4.3.5 Overhead Accounting ...... 196

xi 4.3.5.1 Accounting for Interrupt Top-Halves ...... 197

4.3.5.2 Accounting for Interrupt Bottom-Halves ...... 198

4.3.5.3 Limitations ...... 203

4.3.6 Schedulability Experiments ...... 205

4.3.6.1 Experimental Setup ...... 206

4.3.6.2 Results ...... 208

4.4 Runtime Evaluation ...... 221

4.4.1 Budgeting, Cost Prediction, and Afﬁnity ...... 222

4.4.1.1 Budget Performance ...... 223

4.4.1.2 Cost Predictor Accuracy ...... 225

4.4.1.3 Migration Frequency ...... 227

4.4.2 Feature-Tracking Use-Case ...... 228

4.5 Conclusion ...... 234

Chapter 5: Graph Scheduling with GPUs ...... 236

5.1 Motivation for Graph-Based Task Models ...... 237

5.2 PGMRT ...... 240

5.2.1 Graphs, Nodes, and Edges ...... 240

5.2.2 Precedence Constraints and Token Transmission ...... 240

5.2.3 Real-Time Concerns ...... 243

5.3 OpenVX ...... 244

5.4 Adding Real-Time Support to VisionWorks ...... 246

5.4.1 VisionWorks and the Sporadic Task Model ...... 247

5.4.2 Libgpui: An Interposed GPGPU Library for CUDA ...... 252

5.4.3 VisionWorks, libgpui, and GPUSync ...... 252

5.5 Evaluation of VisionWorks Under GPUSync ...... 254

5.5.1 Video Stabilization ...... 254

5.5.2 Experimental Setup ...... 255

xii 5.5.3 Results ...... 258

5.5.3.1 Completion Delays ...... 258

5.5.3.2 End-to-End Latency...... 275

5.6 Conclusion ...... 280

Chapter 6: Conclusion ...... 282

6.1 Summary of Results ...... 282

6.2 Future Work ...... 288

6.3 Closing Remarks ...... 291

BIBLIOGRAPHY ...... 292

xiii LIST OF TABLES

1.1 ADAS prototypes and related research that employ GPUs ...... 7

1.2 Results of experiment reﬂecting unpredictable and unfair sharing of GPU resources ...... 10

2.1 Summary of sporadic task set parameters ...... 17

2.2 Summary parameters for describing shared resources ...... 28

2.3 Summary of parameters used in preemption-centric overhead accounting...... 49

2.4 Summary of locking protocol overheads considered by preemption-centric accounting ...... 51

2.5 Performance and GPGPU support of several embedded GPUs ...... 67

2.6 NVIDIA software and hardware terminology with OpenCL and AMD equivalents ...... 76

3.1 GPU Allocator heuristic plugin API ...... 144

4.1 Summary of additional parameters to describe and analyze sporadic task sets with GPUs. . . . . 171

4.2 All possible representative blocking chains that may delay a copy engine request ...... 179

4.3 An example arrangement of copy engine requests ...... 182

4.4 Summary of ILP parameters used for bounding the blocking of copy engine requests ...... 188

4.5 GPUSync conﬁguration rankings under worst-case loaded overheads ...... 210

4.6 GPUSync conﬁguration rankings under worst-case idle overheads ...... 211

4.7 GPUSync conﬁguration rankings under average-case loaded overheads ...... 212

4.8 GPUSync conﬁguration rankings under average-case idle overheads ...... 213

4.9 Migration frequency under C-EDF and C-RM GPUSync conﬁgurations ...... 228

5.1 Description of nodes used in the video stabilization graph of Figure 5.8 ...... 256

5.2 Evaluation task set using VisionWorks’ video stabilization demo ...... 257

5.3 Completion delay data for SCHED_OTHER ...... 262

5.4 Completion delay data for SCHED_FIFO ...... 263

5.5 Completion delay data for LITMUSRT without GPUSync ...... 263

5.6 Completion delay data for GPUSync for (1,6,FIFO) ...... 264

xiv 5.7 Completion delay data for GPUSync for (1,6,PRIO) ...... 264

5.8 Completion delay data for GPUSync for (2,3,FIFO) ...... 265

5.9 Completion delay data for GPUSync for (2,3,PRIO) ...... 265

5.10 Completion delay data for GPUSync for (3,2,FIFO) ...... 266

5.11 Completion delay data for GPUSync for (3,2,PRIO) ...... 266

5.12 Completion delay data for GPUSync for (6,1,FIFO) ...... 267

5.13 Completion delay data for GPUSync for (6,1,PRIO) ...... 267

5.14 Average normalized completion delay data ...... 268

xv LIST OF FIGURES

1.1 Historical trends in CPU and GPU processor performance ...... 5

1.2 Host and device code for adding two vectors in CUDA ...... 8

1.3 A schedule of vector_add() ...... 9

1.4 Matrix of high-level CPU and GPU organizational choices ...... 11

2.1 Example of a PGM-speciﬁed graph ...... 18

2.2 PGM-speciﬁed graph of Figure 2.1 transformed into rate-based and sporadic tasks...... 20

2.3 Task ready queues under partitioned, clustered, and global processor scheduling ...... 21

2.4 Example of bounded deadline tardiness for a task set scheduled on two CPUs by G-EDF . . . . . 25

2.5 Schedule depicting a critical section ...... 27

2.6 Schedules where priority inheritance and priority boosting shorten priority inversions ...... 29

2.7 Situations where multiprocessor systems require stronger progress mechanisms ...... 31

2.8 Comparison of s-oblivious and s-aware pi-blocking under global scheduling ...... 37

2.9 Queue structure of the k-FMLP ...... 39

2.10 Queue structure of the R2DGLP ...... 41

2.11 Queue structure of the CK-OMLP ...... 45

2.12 Schedule depicting system overheads...... 47

2.13 Fixed-priority assignment when an I/O device is used by a single thread...... 60

2.14 Example of an unbounded priority inversion ...... 61

2.15 Example of a bounded priority inversion during interrupt bottom-half processing ...... 62

2.16 A pathological scenario for ﬁxed-priority interrupt handling ...... 63

2.17 A priority inversion due to the co-scheduling of a bottom-half ...... 65

2.18 Layered GPGPU software architecture on Linux with closed-source drivers ...... 70

2.19 Situation where a callback thread may lead to a priority inversion ...... 73

2.20 High-level architecture of a modern multicore platform ...... 75

2.21 Code fragments for a three-dimensional CUDA kernel ...... 77

2.22 The hierarchical relation of grids, blocks, and threads ...... 78

xvi 2.23 Decomposition of block threads into warps that are mapped to hardware lanes ...... 79

2.24 Two streams of sequentially ordered GPU operations...... 83

2.25 Schedule of falsely dependent streamed GPU operations and corrected counterpart ...... 84

2.26 A detailed schedule of host and GPU processing for the simple vector_add() kernel ...... 86

3.1 Software architectures of API-driven GPU schedulers implemented in user-space ...... 104

3.2 Software architectures of API-driven GPU schedulers implemented in kernel-space...... 110

3.3 High-level design of GPUSync’s resource allocation mechanisms ...... 116

3.4 A schedule of a job using GPU resources controlled by GPUSync ...... 117

3.5 The structure of a GPU allocator lock ...... 118

3.6 Example of budget signal handling ...... 129

3.7 Architecture of GPU tasklet scheduling infrastructure using klmirqd ...... 132

3.8 Memory layout of nv_linux_state_t ...... 134

3.9 Relative static priorities among scheduling policies ...... 138

3.10 Example of a complex chain of execution dependencies for tasks using GPUSync ...... 139

3.11 Recursive algorithm to propagate changes in effective priority ...... 141

4.1 Concrete examples of multicore, multi-GPU, organizational conﬁgurations ...... 148

4.2 PDF of GPU top-half execution time ...... 151

4.3 PDF of GPU bottom-half execution time ...... 152

4.4 CCDFs of top-half and bottom-half execution times ...... 153

4.5 Considered CPMD maximum overheads due to GPU trafﬁc ...... 155

4.6 Considered CPMD mean overheads due to GPU trafﬁc ...... 156

4.7 Increase in considered CPMD overheads due to GPU trafﬁc ...... 159

4.8 GPU data transmission time in an idle system ...... 161

4.9 GPU data transmission time in system under load ...... 162

4.10 Increase in DMA operation costs due to load ...... 164

4.11 Increase in DMA operation costs due to page interleaving ...... 166

4.12 Procedures for computing blocking chains for a given request scenario ...... 184

xvii 4.13 Schedules with overheads due to bottom-half interrupt processing ...... 199

4.14 Schedule depicting callback overheads ...... 201

4.15 Illustrative ranking of conﬁguration A against conﬁguration B ...... 208

4.16 Detailed schedulability result ...... 217

4.17 Detailed result of schedulability and effective utilization ...... 220

4.18 Accumulated execution engine time allocated to tasks ...... 224

4.19 CDFs of percentage error in cost predictions ...... 226

4.20 CDF of job response time for C-EDF with FIFO-ordered engine locks ...... 231

4.21 CDF of job response time for C-EDF with priority-ordered engine locks ...... 231

4.22 CDF of job response time for C-RM with priority-ordered engine locks ...... 232

5.1 Dataﬂow graph of a simple pedestrian detector application ...... 237

5.2 Transformation of a PGM-speciﬁed pedestrian detection application to sporadic tasks ...... 238

5.3 Parallel execution of graph nodes ...... 239

5.4 Construction of a graph in OpenVX for pedestrian detection ...... 245

5.5 Procedure for PGMRT-coordinated job execution...... 249

5.6 Derivation of PGM graphs used to support the pipelined thread-per-node execution model . . . 250

5.7 Overly long token critical sections may result by releasing tokens at job completion time . . . . 253

5.8 Dependency graph of a video stabilization application...... 255

5.9 Scenarios that lead to extreme values for measured completion delays...... 258

5.10 PDF of normalized completion delay data for SCHED_OTHER...... 271

5.11 PDF of normalized completion delay data for SCHED_FIFO...... 271

5.12 PDF of normalized completion delay data for LITMUSRT without GPUSync ...... 271

5.13 PDF of normalized completion delay data for GPUSync with (1,6,FIFO) ...... 272

5.14 PDF of normalized completion delay data for GPUSync with (1,6,PRIO)...... 272

5.15 PDF of normalized completion delay data for GPUSync with (2,3,FIFO) ...... 272

5.16 PDF of normalized completion delay data for GPUSync with (2,3,PRIO)...... 272

5.17 PDF of normalized completion delay data for GPUSync with (3,2,FIFO) ...... 273

xviii 5.18 PDF of normalized completion delay data for GPUSync with (3,2,PRIO)...... 273

5.19 PDF of normalized completion delay data for GPUSync with (6,1,FIFO) ...... 273

5.20 PDF of normalized completion delay data for GPUSync with (6,1,PRIO)...... 273

5.21 CDF of normalized observed end-to-end latency ...... 275

5.22 CCDF of normalized observed end-to-end latency ...... 277

xix LIST OF ABBREVIATIONS

ADAS Advanced Driver Assistance Systems

API Application Programming Interface

ASIC Application-Speciﬁc Integrated Circuit

BWI Bandwidth Inheritance

CDF Cumulative Distribution Function

CCDF Complementary Cumulative Distribution Function

C-EDF Clustered Earliest Deadline First

CE Copy Engine

CK-OMLP Clustered k-exclusion Optimal Multiprocessor Locking Protocol

C++ AMP C++ Accelerated Massive Parallelism

CPU Central Processing Unit

CUDA Compute Uniﬁed Device Architecture

CVA Compliant Vector Analysis

DAG Directed Acyclic Graph

DGL Dynamic Group Lock dGPU Discrete Graphics Processing Unit

DMA Direct Memory Access

DPCP Distributed Priority Ceiling Protocol

DSP Digital Signal Processor

EDF Earliest Deadline First

EE Execution Engine

FL Fair-Lateness

FMLP Flexible Multiprocessor Locking Protocol

FPGA Field Programmable Gate Array

FPS Frames Per Second

G-EDF Global Earliest Deadline First

GPC Graphics Processing Cluster

GPL GNU General Public License

xx GPGPU General-Purpose Computing on Graphics Processing Units

GPU Graphics Processing Unit

HSA Heterogeneous System Architecture iGPU Integrated Graphics Processing Unit

IORW Input Output and Read Write

IPC Inter-Process Communication

IPI Inter-Processor Interrupt

ISR Interrupt Service Routine

JLFP Job-Level Fixed Priority k-FMLP k-exclusion Flexible Multiprocessor Locking Protocol

MPCP Multiprocessor Priority Ceiling Protocol

MSRP Multiprocessor Stack Resource Policy

NUMA Non-Uniform Memory Access

O-KGLP Optimal k-exclusion Global Locking Protocol

OMLP O(m) Multiprocessor Locking Protocol OpenACC Open Accelerators

OpenCL Open Computing Language

OS Operating System

PCIe Peripheral Component Interconnect Express

PCP Priority Ceiling Protocol

PDF Probability Density Function

P-EDF Partitioned Earliest Deadline First

PGM Processing Graph Method

P2P Peer-to-Peer

RM Rate-Monotonic

RPC Remote Procedure Call

RTOS Real-Time Operating System

R2DGLP Replica-Request Donation Global Locking Protocol

SM Streaming Multiprocessor

xxi SRP Stack Resource Policy

WSS Working Set Size

xxii CHAPTER 1: INTRODUCTION

Real-time systems are those that must satisfy precise timing constraints in order to meet application requirements. We often ﬁnd such systems where computers sense and react to the physical world. Here, loss of life or property may result if a computer fails to act in the right moment. Real-time system designers must employ algorithms that realize predictable behavior that can be modeled by mathematical analysis.

This analysis allows the designer to prove that an application’s timing constraints are met. However, an algorithm can only be as predictable as allowed by the underlying software and hardware upon which it is implemented. Ensuring predictability becomes an increasing challenge as computing hardware grows in complexity. This is especially true for commodity computing platforms, which are optimized for throughput performance—often at the expense of predictability. This challenge is exempliﬁed by the recent development of programmable graphics processing units (GPUs) that are used to perform general purpose computations.

GPUs offer extraordinary performance and relative energy efﬁciency in comparison to traditional processors.

However, today’s standard GPU technology is unable to meet basic real-time predictability requirements.

The goal of the research presented in this dissertation is to discover and address the issues that prevent

GPUs from being used in real-time systems. GPUs exhibit unique characteristics that are not dealt with easily using methods developed in prior real-time literature. New techniques are necessary. This dissertation presents such techniques that remove the fundamental obstacles that bar the use of GPUs in real-time systems.

This research is important because it may allow GPUs to become an enabling technology for embedded real-time systems that tackle computing problems that have been outside the reach of traditional processors.

This chapter begins with a brief introduction to real-time systems. We follow with a closer look at the beneﬁts offered by GPUs and potential real-time applications. We then discuss general purpose GPU programming at a high level, followed by the challenges to supporting real-time constraints. We then present the thesis of this dissertation and describe the dissertation’s contributions. Finally, we outline the organization of this dissertation’s remaining chapters.

1 1.1 Real-Time Systems

The term “real time” has different meaning in different ﬁelds. In the context of computer graphics, “real time” often equates to “real fast” or some degree of quality-of-service.1 For instance, a computer graphics animation may be rendered in “real time” if image frames are generated at roughly 30 frames per second.

An interactive simulation may be considered “real time” if the simulation runs at about 10 to 15 frames per second. In contrast to these throughput-oriented quality-of-service-based deﬁnitions, “real time” in the ﬁeld of real-time systems is more precise. A real-time system is said to be “correct” if computations meet both logical and temporal criteria. Logical criteria require that the results of a computation must be valid. This condition is true for practically any computational system. Temporal criteria require that these results must also be made available by a designated physical time (hence, “real time”). This strict concern for temporal correctness may not necessarily be included in the aforementioned computer graphics systems where “real fast” is often good enough. Indeed, temporal correctness is as important as logical correctness in a real-time system.

A real-time workload is often embodied by a set of computational tasks. Each task releases recurring work, with each such release called a job, according to a predicable rate or time interval. The completion time of each job must satisfy some temporal constraint, such as a deadline that occurs within some interval of time after the job’s release. A real-time scheduler is responsible for allocating processor time to each incomplete job. A set of tasks, or task set, is said to be schedulable when timing constraints are guaranteed to always be satisﬁed.

A scheduling algorithm, in and of itself, does not prove schedulability. Instead, schedulability is formally proven according to an analytical model of the scheduling algorithm and the task set in question. These models may incorporate real-world overheads that are often a function of the scheduling algorithm, the algorithm’s implementation, and the hardware platform upon which jobs are scheduled. Overheads can have a strong effect on schedulability. As a consequence, the design of an efﬁcient real-time system involves the co-design of the analytical model, the scheduling algorithm, and the algorithm’s implementation.

Over the last decade, multicore processors have become ubiquitous in computing. This has spurred interest in the design and implementation of real-time multiprocessor schedulers and analytical models.

Multiprocessor schedulers can be generally classiﬁed into one of three categories: partitioned, clustered,

1The term “real fast” in this context is borrowed from McKenney (2009).

2 or global. Under partitioned scheduling, each task (and all its associated jobs) is assigned to a processor.

Under global scheduling, jobs are free to migrate among all processors. Cluster scheduling is a hybrid of partitioned and global approaches: the jobs of tasks are free to migrate among an assigned subset (or cluster) of processors. Each method may be best suited to a particular application with its own temporal constraints, as there are tradeoffs among the analytical models and associated overheads for each approach.

It is reasonable to assume that many future real-time systems will use multicore processors. Support for real-time GPU computing with multicore processors is a central theme of this dissertation. Moreover, we pay special attention to the interrelations between our analytical models, scheduling algorithms, algorithm implementation, and the computing hardware.

We wish for the results of this dissertation to have bearing on practical applications, so much of the effort behind this dissertation has been on the implementation of real-time multi-GPU schedulers and their integration with real-time multiprocessor (i.e., CPU) schedulers. Implementation and integration is done at the operating system (OS) level. We implement all of our solutions by extending LITMUSRT, a real-time patch to the Linux kernel (Calandrino et al., 2006; Brandenburg, 2011b, 2014b). This is advantageous since we can use all Linux-based GPGPU software in our research, while also beneﬁting from LITMUSRT’s variety of real-time multiprocessor schedulers and its other supporting functions.

1.2 Graphics Processing Units

The growth of GPU technology is characterized by an evolutionary process. Early GPUs of the 1970s and 1980s were used to offload 2D rendering computations from the CPU, and support for 3D rendering was common by the end of the 1990s (Buck, 2010). With few exceptions, these GPUs were “fixed function,” meaning that rendering operations were defined a priori by the GPU hardware. This changed with the advent of the “programmable pipeline” in 2001. The programmable pipeline enables programmers to implement custom rendering operations called “shaders” by using program code that is executed on the GPU. Some of the early successful shader languages include the OpenGL Shading Language (GLSL) (Khronos Group,

2014b), NVIDIA’s “C for Graphics” (Cg) (NVIDIA, 2012), and Microsoft’s High-Level Shader Language

(HLSL) (Microsoft, 2014).

Empowered by shader languages and programmable GPUs, researchers and developers began to exploit the generality of the programmable pipeline to solve non-graphics-related problems. This practice of using

3 GPUs for general-purpose computations was coined GPGPU by M. Harris in 2002 (Luebke et al., 2004;

Harris, 2009). In early GPGPU approaches, computations were expressed as shader programs, operating on graphics-oriented data (e.g., pixels and vertices). Recognizing the potential of GPGPU, generalized languages and runtime environments were developed by major graphics hardware vendors and software producers to allow general purpose programs to be run on graphics hardware without the limitations imposed by graphics-focused shader languages. Notable platforms include the Compute Uniﬁed Device Architecture

(CUDA) (NVIDIA, 2014c), OpenCL (Khronos Group, 2014a), and OpenACC (OpenACC, 2013). The ease of use enabled by these advances has facilitated the adoption of GPGPU in a number of fields of computing. Today, GPGPU is used to efficiently handle data-parallel compute-intensive problems such as cryptography (Harrison and Waldron, 2008), supercomputing (Meuer et al., 2014), finance (Scicomp

Incorporated, 2013), ray-tracing (Aila and Laine, 2009), medical imaging (Watanabe and Itagaki, 2009), video processing (Pieters et al., 2009), and many others.

GPGPU technology has received little attention in the ﬁeld of real-time systems, despite strong motivations for doing so. The strongest motivation is that the use of GPUs can signiﬁcantly increase computational performance. This is illustrated in Figure 1.1(a), which depicts performance trends of high-end Intel CPUs and

NVIDIA GPUs over much of the past decade (NVIDIA, 2014c; Intel, 2014; Ong et al., 2010). Figure 1.1(a) plots the peak theoretical single-precision performance in terms of billions of ﬂoating point operations per second (GFLOPS). There is a clear disparity between CPU and GPU performance in favor of GPUs. For example, NVIDIA’s Titan GTX Black GPU can perform at 5,121 GFLOPS in comparison to 672 GFLOPS for the Intel Ivy Bridge-EX—the GPU has a 7.6 times greater throughput.

This disparity is even greater when we consider mobile CPUs, such as those designed by ARM. For instance, the ARM Cortex-A15 series processor as a peak theoretical performance of 8 GFLOPs at 1GHz (Ra- jovic et al., 2014). Thus, the dual Cortex-A15 cores of Samsung’s Exynos 5250 (which runs as 1.7GHz) collectively achieve 27.2 GFLOPS. In contrast, the embedded NVIDIA K1 and PowerVR GX6650 GPUs can both achieve a reported 384 GFLOPS (Smith, 2014a), making them more than 14 times faster than the

Exynos 5250.2

Growth in raw ﬂoating-point performance does not necessarily translate to equal gains in performance for actual applications. In the case of both CPUs and GPUs, observed theoretical performance only nears

2It is unfair to compare the Titan GTX Black to any embedded processor since the GPU requires a great deal more power. The Titan GTX Black consumes roughly 250 watts, while the Exynos 5250 consumes no more than 8 watts.

4 Trends in GFLOPS 6000 CPU (Intel) GPU (NVIDIA) TITAN GTX Black 5000 TITAN GTX

4000

GTX-680 3000

Theoretical Peak GFLOPS Peak Theoretical 2000 GTX-580 GTX-480 GTX-280 1000 GTX-260 Ivy Bridge-EX 8800 GTX Sandy Bridge-EP Westmere-EX 6800 Ultra 7800 GTX Harpertown Bloomfield 0 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Date (a) Trend in GFLOPS. Trends in GFLOPS-per-Watt 25 CPU (Intel) GPU (NVIDIA)

TITAN GTX Black 20

TITAN GTX

GTX-680 15

GTX-580 Watts per Theoretical Peak GFLOPS Peak Theoretical per Watts GTX-280 GTX-480 GTX-260 5 Ivy Bridge-EX 8800 GTX Sandy Bridge-EP Westmere-EX Harpertown Bloomfield Irwindale Clovertown 0 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Date (b) Trend in GFLOPS-per-Watt.

Figure 1.1: Historical trends in CPU and GPU processor performance. theoretical peak performance when executing very data-parallel algorithms. However, GPU performance degrades signiﬁcantly when executing inherently serial algorithms or those that contain many conditional code paths (i.e., “branchy” code)—CPUs perform better in such cases (NVIDIA, 2014c). Nevertheless, a review of published research reveals that GPUs commonly increase performance over CPUs on the range of four to 20 times (Owens et al., 2007) for many types of computationally heavy applications. In the context of

5 real-time systems, computations accelerated by GPUs may execute at higher frequencies or perform more computation per unit time, possibly improving system responsiveness or accuracy.

Power efﬁciency is another motivation to use GPUs in real-time systems, since real-time constraints often must be satisﬁed in power-constrained embedded applications. GPUs can carry out computations at a fraction of the power needed by CPUs for equivalent computations. This is illustrated in Figure 1.1(b), which depicts the GFLOPS-per-watt for the same performance points of Figure 1.1(a). Here, the Titan

GTX Black can perform roughly 20.5 GFLOPS per watt in comparison to 4.3 GFLOPS per watt of the

Ivy Bridge-EX—the GPU is 4.7 times more efﬁcient. For an additional point of reference, the K1 and

GX6650 integrated GPUs perform approximately 48 GFLOPS-per-watt, while the Exynos 5250 CPUs deliver approximately 6.8 GFLOPS-per-watt.3

1.3 Real-Time GPU Applications

There are several application domains that may beneﬁt from real-time support for GPUs. For example, real-time-scheduled GPUs may be employed to realize predictable video compositing and encoding for use in live news and sports broadcasting (NVIDIA, 2014e). Another domain includes support of high frequency trading and other time-sensitive ﬁnancial applications (King et al., 2010). However, possibly the greatest potential for GPUs in real-time systems is in future automotive applications.

The domain that may beneﬁt the most from real-time support for GPUs is in the advanced driver assistance systems (ADAS) of new and future automobiles. Here, vehicle computer systems realize “intelligent” alert and automated features that improve safety and/or the driving experience. For example, a system that alerts the driver of pedestrians in the path of the vehicle is an ADAS. An intelligent adaptive cruise control system that can automatically steer the vehicle and control its speed in stop-and-go highway trafﬁc is another. Other

ADAS features include automatic trafﬁc sign recognition, obstacle avoidance, and driver fatigue detection.

There are clear safety implications to ADAS: if the vehicle fails to act in the right moment, loss of life may result. Precise timing constraints must be met.

Common to these ADAS applications is a reliance upon a rich sensor suite. This includes video cameras, radar detectors, acoustic sensors, and lidar4 sensors (Wei et al., 2013). Together, these sensors generate an

3Power metrics for individual components of embedded processors are difﬁcult to ﬁnd. Here, we conservatively assume that the K1, GX6650, and Exynos 5250 each consume eight watts. This a common limit for the entirety of a smartphone- and tablet-class chip, of which GPU and CPUs are merely components. 4The term “lidar” is a blend of the words “light” and “radar”.

6 Application Prototypes and Research Pedestrian, Vehicle, and Zhang and Nevatia (2008); Wojek et al. (2008) Obstacle Detection Bauer et al. (2010); Benenson et al. (2012) Trafﬁc Sign Recognition Mussi et al. (2010); Muyan-Ozcelik et al. (2011) Driver Fatigue Detection Lalonde et al. (2007) Lane Following Homm et al. (2010); Kuhnl et al. (2012) Seo and Rajkumar (2014)

Table 1.1: ADAS prototypes and related research that employ GPUs. enormous amount of data for an embedded vehicle computing system to process. It is too much in fact for traditional computing hardware to handle within a vehicle’s size, weight, and power (SWaP) constraints, much less being affordable. GPUs offer a viable alternative because many of the algorithms employed in

ADAS are data parallel—ideal for GPUs. This is especially true of computer vision algorithms that operate upon video camera feeds and the point-cloud processing of LIDAR data.

Researchers have begun applying GPUs to ADAS problems, as illustrated by Table 1.1, which lists several ADAS prototypes and related research that uses GPUs. However, each prototype assumes full control of the entire computing system; computing resources such as CPUs and GPUs are not shared with other tasks.

This does little to resolve automotive SWaP or cost constraints. If advanced automotive features are to be realistically viable, then such computations must be consolidated onto as few low-cost CPUs and GPUs as possible, while still meeting timing constraints. The design and implementation of foundational real-time methods for such systems is the focus of this dissertation research.

1.4 An Introduction to GPGPU Programming

A brief introduction to GPGPU programming is necessary to conceptualize the challenges of using GPUs in real-time systems. We ﬁrst present a high-level description of GPGPU programming and GPU mechanics.

We then discuss the limitations that prevent us from using GPUs in a real-time system without any specialized real-time mechanisms.

1.4.1 GPGPU Programming

Although GPUs are superior to CPUs in terms of raw performance and energy efﬁciency, today’s GPUs cannot operate as independent processors. Instead, GPUs used in GPGPU applications act as co-processors to CPUs. GPGPU programs are made up of a sequence of operations involving CPU code, GPU code, and, in

7 1 // Add vectors ‘a’ and ‘b’ of ‘num_elements’ floats and store results in ‘c’ 2 // using a GPU. 3 void vector_add(float *a, float *b, float *c, int num_elements) { 4 float *gpu_a, *gpu_b, *gpu_c; 5 6 ... // allocate GPU-side memory for ‘gpu_a’, ‘gpu_b’, and ‘gpu_c’ 7 8 // copy contents of ’a’ and ’b’ to corresponding buffers on the GPU 9 cudaMemcpy(gpu_a, a, num_elements*sizeof(float)); 10 cudaMemcpy(gpu_b, b, num_elements*sizeof(float)); 11 12 // perform ‘gpu_c[i] = gpu_a[i] + gpu_b[i]’ for i in [0..num_elements) 13 gpu_vector_add<<<...>>>(gpu_a, gpu_b, gpu_c, num_elements); 14 15 // copy the results of vectorAdd stored in ‘gpu_c’ to buffer ‘c’ 16 cudaMemcpy(c, gpu_c, num_elements*sizeof(float)); 17 18 ... // free buffers allocated to ‘gpu_a’, ‘gpu_b’, and ‘gpu_c’ 19 } (a) Host code for adding vectors, using a GPU.

1 // GPU routine for adding elements of vectors ‘a’ and ‘b’ with result stored ‘c’ 2 __global__ 3 void gpu_vector_add(float *a, float *b, float *c, int num_elements) { 4 int i = blockDim.x * blockIdx.x + threadIdx.x; 5 if (i < num_elements) { 6 c[i] = a[i] + b[i]; 7 } 8 } (b) Device kernel for adding vectors.

Figure 1.2: Host and device code for adding two vectors in CUDA.

the case where GPUs are equipped with their own memory, memory copies between CPU main memory and

GPU-local memory.

GPU (or “device”) code is invoked by CPU (or “host”) code, in a manner similar to a remote procedure

call (RPC). Device-side procedures are commonly referred to as “kernels.”5 A simple GPGPU routine (in

the CUDA language) for adding the elements of two arrays (or vectors) is given in Figure 1.2. Host code

appears in Figure 1.2(a) and device code in Figure 1.2(b). The routine vector_add() executes on the host.

In lines 9 and 10 of Figure 1.2(a), memory is copied from host memory to device memory as input for

the gpu_vector_add() kernel. In line 13, this kernel is called by the host, triggering the procedure in

Figure 1.2(b) to execute on the device. The kernel’s output resides in device memory after it completes. The

resulting output of the kernel is copied back to host memory on line 16 of Figure 1.2(a).

A simpliﬁed schedule for the VectorAdd routine is depicted in Figure 1.3. (The actual sequence of

operations is more complex, as we shall see in Chapter 2, but this level of detail is sufﬁcient for now.) From

5This unfortunate name should not be confused with an operating system kernel. A GPU kernel and operating system kernel have nothing in common.

8 cudaMemcpy(gpu_a, a,…); gpu_vector_host(…)

vector_host(…) cudaMemcpy(gpu_b, b,…); cudaMemcpy(c, gpu_c,…);

CPU

time t1 t2 t3 t4 t5 t6 t7 t8 t9

Figure 1.3: A schedule of vector_add().

this schedule, we see that vector_add() begins execution on a CPU, starting at time t1. The memory copy operations are carried out using direct memory access (DMA) by a “Copy Engine” (CE) processor on the time intervals [t2,t3], [t4,t5], and [t8,t9]. The CE is a GPU component used for DMA memory operations. The kernel gpu_vector_add() executes on the GPU’s “Execution Engine” (EE) during the time interval

[t6,t7]. (We discuss CEs and EEs in more depth in Chapter 2.) Observe that the CPU code waits for each GPU-related operation to complete—indicated by a set of horizontal dashed lines. During these intervals, the CPU code may suspend or spin while waiting for GPU operations to complete (the particular desired behavior may be speciﬁed by the programmer).

1.4.2 Real-Time GPU Scheduling

In a real-time system, CPU, CE, and EE processors must be scheduled according to a predictable policy to ensure that timing constraints are met. This is difﬁcult to achieve using stock GPGPU technology. Although the routines in Figure 1.2 are simple, a great deal of complex underlying software is needed execute them.

This software comes in two parts: (i) the OS device driver that manages the GPU and issues device-speciﬁc commands; and (ii) the GPGPU language runtime that interfaces with the GPU device driver to execute the

GPGPU program. GPU manufacturers provide this software. However, this software is not designed with real-time requirements in mind. Worse, it is commonly distributed as closed-source, so our ability to modify the software’s behavior to support real-time requirements is constrained.

9 Thread Loop Counts Run 1 Run 2 Run 3 1 0 0 0 2 104 969 0 3 1,307 0 3,706 4 2,230 1,928 0 5 0 786 0 Total 3,641 3,683 3,706

Table 1.2: Reported loop counts.

Unfortunately, stock GPGPU technology offers little support for dependable scheduling policies. Con- tention for GPU resources is often resolved through undisclosed arbitration policies. These arbitration policies typically have no regard for task priority and may exhibit behaviors detrimental to multitasking on host

CPUs. Furthermore, a single task can dominate GPU resources by issuing many or long-running operations.

Allocation methods are needed that eliminate or ameliorate these problems. We demonstrate several of these points with a simple experiment.

In this experiment, we have a program that repeatedly executes the VectorAdd routine on 4,000,000- element vectors in a tight loop. We ran ﬁve instances of our program, as threads, concurrently under Linux’s general purpose scheduler (e.g., a non-real-time scheduler). Our test platform has more CPUs than test threads, but all threads share the same GPU (an NVIDIA Quadro K5000). Our threads execute for a duration of 30 seconds and report the number of completed loop iterations upon completion. Table 1.2 gives the reported loop counts for three separate test runs of our experiment. We observe two important behaviors in this data. First, by comparing the values within each column, we see that each thread receives a very uneven share of GPU resources. For instance, in Run 1, Thread 2 completes 104 loops while Thread 4 completes

2,230. Second, by comparing the values within each row, we see that each thread receives a different amount of GPU resources in each experiment. For example, Thread 3 is starved in Run 2, but it receives all of the

GPU resources in Run 3. In short, GPU resources are allocated unfairly and unpredictably.

Stock GPGPU technologies do not provide a solid foundation upon which to build a real-time system.

We could endeavor to replace the closed-source software with our own, as has been explored by Kato et al.

(2012). However, this requires a great deal of software development and reverse engineering effort. It is better to leverage existing software, if at all possible. We show in this dissertation that this is indeed possible. We

10 CPU Scheduling Partitioned Clustered Global Partitioned Clustered GPU Organization GPU Global

Figure 1.4: Matrix of high-level CPU and GPU organizational choices. devise and implement mechanisms that satisfy our need for predictability while still using the manufacturer’s original software.

1.4.3 Real-Time Multi-GPU Scheduling

Modern hardware can support computing platforms that have multiple GPUs. Given the performance beneﬁts GPUs offer, it is desirable to support multi-GPU computing in a real-time system. Similar to CPUs in multiprocessor scheduling, GPUs can be organized following a partitioned, clustered, or global approach.

When combined with the earlier-discussed multiprocessor scheduling methods, we have nine high-level possible allocation categories, as illustrated in matrix form in Figure 1.4. Can real-time mechanisms be devised to support every conﬁguration choice? Which conﬁgurations are best for real-time predictability?

Which conﬁgurations offer the best observed real-time performance at runtime? Do conﬁguration choices really matter? The answers to these basic questions are not immediately clear. This dissertation investigates these questions in depth.

1.5 Thesis Statement

GPGPU is a new technology that has received little attention in the field of real-time computing. Initial research results are promising. However, there has yet to be a comprehensive study of the topic. For instance, it is not known what tradeoffs exist among the configurations depicted in Figure 1.4. More importantly, however, it has not yet been shown to what extent a real-time system can actually benefit from GPGPU technology.

11 This dissertation seeks to investigate real-time GPU scheduling and demonstrate the beneﬁts of GPGPU in real-time systems. To this end, we put forth the following thesis statement:

The computational capacity of a real-time system can be greatly increased for data-parallel

applications by the addition of GPUs as co-processors, integrated using real-time scheduling and

synchronization techniques tailored to take advantage of speciﬁc GPU capabilities. Increases in

computational capacity outweigh costs, both analytical and actual, introduced by management

overheads and limitations of GPU hardware and software.

1.6 Contributions

We now present an overview of the contributions of this dissertation that support this thesis.

1.6.1 A Flexible Real-Time Multi-GPU Scheduling Framework

The central contribution of this dissertation is the design of a ﬂexible real-time multi-GPU scheduling framework. In Chapter 3, we present our framework, which is called GPUSync. We posit that GPU management is best viewed as a synchronization problem rather than one of scheduling. At its heart,

GPUSync is a novel combination and adaptation of several recent advances in multiprocessor real-time synchronization made by Brandenburg et al. (2011) and Ward et al. (2012, 2013). This approach provides us established techniques for enforcing real-time predictable and an analytical framework for testing real-time schedulability.

GPUSync is highly conﬁgurable and supports every high-level CPU/GPU scheduling/organizational method depicted in Figure 1.4. For each high-level conﬁguration, a system designer may select a low-level

GPUSync conﬁguration that best complements their analytical model or yields the best observed runtime performance (or perhaps both).

GPUSync is also designed to efﬁciently support practical multi-GPU scheduling, in terms of both real- time analysis and real-world performance. We employ real-time scheduling techniques that reduce pessimism in analysis. We combine online monitoring with heuristics that guide GPU scheduling decisions that improve performance while still maintaining real-time predictability. We also develop budget-enforcement techniques that allow us to mitigate the effects of GPU operations that exceed their provisioned execution times.

12 1.6.2 Techniques for Supporting Closed-Source GPGPU Software

In Chapter 2, we identify issues that may arise when we attempt to use non-real-time GPGPU software in a real-time system. Speciﬁcally, these issues relate to: (i) non-real-time resource arbitration, and (ii) the real-time scheduling of GPU device driver and GPGPU runtime computations.

We present a solution to these problems in Chapter 3. We resolve the arbitration issues of (i) by wrapping the GPGPU runtime with our own interface to GPUSync. Resource contention is already resolved when the GPGPU runtime is invoked by application code. Thus, our real-time system is no longer at the mercy of undisclosed non-real-time arbitration techniques. We address the scheduling problems of (ii) through interception techniques. For example, we intercept the OS-level interrupt processing of the GPU device driver and insert our own interrupt scheduling framework. We also intercept the creation of “helper” threads created by GPGPU runtime and assign them proper real-time scheduling priorities. These features integrate with

GPUSync, allowing the framework to dynamically adjust scheduling priorities in order to maintain real-time predictability.

1.6.3 Support for Graph-Based GPGPU Applications

In Chapter 5, we extend real-time GPGPU support to graph-based software architectures, strengthening the relevance of GPUs in real-time systems. In such architectures, vertices represent sequential code segments that operate upon data, and edges express the ﬂow of data among vertices. The ﬂexibility offered by such an architecture’s inherent modularity promotes code reuse and parallel development. Also, these architectures naturally support concurrency, since parallelism can be explicitly described by the graph structure. This allows work to be scheduled in parallel and take advantage of the pipelined execution of graph processing— techniques known to improve analytical schedulability and runtime behavior.

Graph-based software architectures are well suited to efﬁciently handle the sensor processing and computer vision algorithms in the complex automotive applications we discussed earlier. For example, instead of repeating the same sensor processing steps within each application that uses a particular sensor, redundant computations can be consolidated in a application that distributes its results to others.

13 1.6.4 Implementation and Evaluation

In Chapter 3, we also describe the implementation of GPUSync in LITMUSRT. We discuss several of the unique implementation-related challenges we addressed, including efﬁcient locking protocols, budget enforcement techniques, and the tracking of real-time priorities.

In Chapters 4 and 5, we evaluate our implementation in terms of analytical schedulability and runtime performance. The results of this evaluation support the central claim of this dissertation’s thesis that GPUs can be used in a real-time system, resulting in increased computational capacity.

In order to evaluate schedulability, we develop and present an analytical model of real-world system behavior under GPUSync in Chapter 4. This model incorporates carefully measured empirical overheads relating to real-time scheduling and GPU operations. Using this model, we carry out schedulability experiments in order to determine the most promising conﬁgurations of GPUSync. This evaluation is broad and is backed by data generated from tens of thousands of CPU hours of experimentation. Ultimately, we ﬁnd that clustered CPU scheduling with partitioned GPUs offers the best real-time schedulability, overall. However, clustered GPU scheduling is competitive in some situations.

We demonstrate the effectiveness of our real-time GPU scheduling techniques by observing the runtime behavior of GPUSync under several synthetic and “real-world application” scenarios in Chapters 4 and 5. Among our ﬁndings, we show that although partitioned GPU scheduling may offer better real-time schedulability, clustered GPU scheduling may offer better observed real-time behavior.

1.7 Organization

The rest of this dissertation is organized as follows. In Chapter 2, we discuss several background topics, including the architecture and mechanics of GPUs and prior work on real-time heterogeneous multiprocessor scheduling (including GPUs). In Chapter 3, we describe the design and implementation of our conﬁgurable real-time GPU scheduling framework, GPUSync. In Chapter 4, we evaluate GPUSync in terms of theoretical schedulability and runtime experiments. In Chapter 5, we extend GPUSync to support graph-based applications and present a runtime evaluation using real-world computer vision code. We end in

Chapter 6 with concluding remarks and a discussion of future work.

14 CHAPTER 2: BACKGROUND AND PRIOR WORK1

In this chapter, we discuss background material and prior work on topics related to this dissertation.

We begin with a discussion of real-time multiprocessor scheduling, locking protocols, and schedulability analysis. We then examine prior work on the implementation of real-time schedulers in real-time operating systems (RTOSs) and issues related to peripheral device management (speciﬁcally, device interrupt handling).

We then review the current state of accelerator co-processors in the embedded domain to help motivate the timeliness of our research. We then delve into relevant aspects of GPU hardware and software architectures and programming models. Here, we also discuss the challenges of real-time GPU computing. We conclude with a review of related prior work on GPU scheduling.

2.1 Multiprocessor Real-Time Scheduling

We discuss several foundational elements of real-time multiprocessor scheduling in this section. We begin with a description of the analytical approach we use to model real-time workloads in this dissertation.

This is followed by a discussion of the meaning of the term “schedulability” and the procedures we use for formally proving real-time correctness. We then examine the topic of resource sharing in real-time systems, and how sharing may impact schedulability analysis. Finally, we discuss the methods we use to account for real-world system overheads in schedulability analysis.

2.1.1 Sporadic Task Model

A workload that is run on a real-time system is said to be schedulable when guarantees on timing constraints can be made. Schedulability is formally proved through analytical models. One such model is the well-studied sporadic task model (Mok, 1983); we focus on this task model in this research. We deﬁne the

1 Portions of this chapter previously appeared in the proceedings of two conferences. The original citations are as follows: Elliott, G. and Anderson, J. (2012a). The limitations of ﬁxed-priority interrupt handling in PREEMPT_RT and alternative approaches. In Proceedings of the 14th OSADL Real-Time Linux Workshop, pages 149–155; Elliott, G. and Anderson, J. (2012b). Robust real-time multiprocessor interrupt handling motivated by GPUs. In Proceedings of the 24th Euromicro Conference on Real-Time Systems, pages 267–276.

15 basic elements of the sporadic task model here. Later in this section, we expand this model to incorporate resource sharing. In Chapter 4, we further expand the model to describe GPUs and GPGPU workloads.

Under the sporadic task model, we describe the computational workload as a task set, T , that is speciﬁed as a collection of n tasks: T , {T1,··· ,Tn}.A job is a recurrent invocation of work by a task, Ti, and is th denoted by Ji, j, where j indicates the j job of Ti (we may omit the subscript j if the particular job invocation is inconsequential). Task Ti is described by a tuple of three parameters: (ei, pi,di). The worst-case execution time (WCET) of a job is given by ei. The releases of jobs Ji, j and Ji, j+1 have a minimum release separation time described by the task’s period, pi. A task is said to be periodic (instead of sporadic) if its jobs are always separated by pi time units. Job Ji, j is released (arrives) at time ai, j and completes (ﬁnishes) at time fi, j.

Each job has a precedence constraint: although job Ji, j+1 may be released before job Ji, j completes, Ji, j+1 cannot be scheduled until after fi, j.A pending job is an incomplete released job that has had its precedence constraint met. The response time of Ji, j is

ri, j , fi, j − ai, j. (2.1)

Every task has a relative deadline, di. A task is said to have an implicit, constrained, or arbitrary deadline if di = pi, di ≤ pi, or di ≥ 0, respectively. Every job has an absolute deadline, deﬁned by

Di, j , ai, j + di. (2.2)

Each job must execute for at most ei time units in order to complete, and it must receive this execution time by Di, j to meet its deadline. We deﬁne lateness by

li, j , fi, j − Di, j. (2.3)

Deadline tardiness is deﬁned by

xi, j , max(0,li, j). (2.4)

The utilization of a task quantiﬁes the long-term processor share required by the task and is given by the term

ei ui , . (2.5) pi

16 Task Set and Scheduler Parameters m number of CPUs c CPU cluster size T task set n number of tasks in task set U(T ) task set utilization Parameters of Task Ti ei job worst-case execution time pi period di relative deadline ui utilization

Parameters of Job Ji, j ai, j release (arrival) time fi, j completion (ﬁnish) time Di, j absolute deadline ri, j response time li, j deadline lateness (may be negative) xi, j deadline tardiness

Table 2.1: Summary of sporadic task set parameters.

The total utilization of a task set is given by

n U(T ) = ∑ ui. (2.6) i=1

Table 2.1 summarizes the various parameters we use in modeling a sporadic task set.

2.1.2 Rate-Based Task Model

The rate-based task model provides a similar approach to describing the workload of a real-time system as the sporadic task model (Jeffay and Goddard, 1999). Instead of using pi to describe the arrival sequence of task Ti’s jobs, we use χi to specify the maximum number of jobs of task Ti that may arrive within a time

2 window of υi time units. Thus, each task is described by a tuple of four parameters: (ei, χi,υi,di). We reuse all of the sporadic task model parameters we described earlier, except for pi and ui. Utilization is given by

rb χi ui , ei · . (2.7) υi

2 Jeffay and Goddard use the symbols xi and yi instead of χi and υi, respectively. We deviate from their notation to avoid confusion with other parameters we deﬁne. For example, we use xi to denote deadline tardiness.

17 1 2 1 3 1 ̺i ← = 4 Gi ̺i ← = 4 2 1 3 1 ϕi ← = 7 ϕi ← = 5 2 1 3 1 κi ← = 3 κi ← = 3 2 3 Gi Gi

4 2 4 3 ̺i ← = 1 ̺i ← = 2 4 2 4 3 ϕi ← = 3 ϕi ← = 6 4 2 4 3 κi ← = 2 4 κi ← = 4 Gi

Figure 2.1: Example of a PGM-speciﬁed graph (courtesy of Liu and Anderson (2010)).

We do not use the rate-based task model directly to determine schedulability in this dissertation. However, we do use it as an intermediary representation in the process of transforming a more complicated real-time model into the sporadic task model. We discuss this next.

2.1.3 Processing Graph Method

The sporadic and rate-based task models are limited in that they describe a set of independent tasks.

However, real-time workloads are not always so simple. For example, the sporadic and rate-based task models lack the necessary expressiveness to describe the scenario where the input of a job of one task Ti is dependent upon the output of a job of another task Tj. We see this type of inter-task dependence in graph-based software architectures. The Processing Graph Method (PGM) is an expressive model for describing such software architectures.

PGM describes the dependencies among jobs in terms of producer/consumer relationships. The workload

1 zi is described by set of n graphs: G , {G1,··· ,Gn}. Each graph is comprised of subtasks, Gi , {Gi ,··· ,Gi }, where zi denotes the number of subtasks in Gi. Figure 2.1 depicts a graph expressed in PGM. We represent each subtask with a node. A directed edge connecting a producer subtask to a dependent consumer subtask reﬂects the data dependencies between connected nodes. The production and consumption of data is modeled by tokens, where producers produce tokens and consumers consume them. Each edge is described by three parameters that describe token production and consumption. The number of tokens produced by a subtask

j k j k← j Gi for subtask Gi each time a job of Gi completes is denoted by %i . The number of tokens consumed by k k← j each job of Gi is given by κi . Finally, a threshold on the number of tokens that must be available on the

18 j k k k← j edge connecting Gi and Gi before a job of Gi may execute is given by ϕi . We denote the set of subtasks j j that directly generate input for Gi with the function pred(Gi ). Likewise, we denote the set of subtasks that j j directly consume the output of Gi with the function cons(Gi ). We attach rate-based arrival parameters to the j source nodes of each graph. Similar to the rate-based task model, we use χi to specify the maximum number j j of jobs of task Gi that may arrive within a time window of υi time units. Goddard presents a procedure for transforming a set of PGM graphs into a rate-based task set in his Ph.D. dissertation (Goddard, 1998). Building upon this work, Liu and Anderson developed a method to transform the PGM-derived rate-based task set into a sporadic task set, provided that the underlying graph contains no cycles (i.e., a directed acyclic graph (DAG)) (Liu and Anderson, 2010).

We now present the procedure for transforming a PGM-based graph into a sporadic task set, via an intermediate transformation into a rate-based task set. We direct the reader to Goddard (1998) and Liu

(2013) for the justiﬁcations behind the graph-to-rate-based-task-set and rate-based-to-sporadic-task-set transformations, respectively.

We ﬁrst assume that rate-based arrival parameters have been assigned to all source nodes of every graph.

We then derive these parameters for the remaining nodes using the equations

k←v v k κi · υi k υi , lcm k←v v k←v | v ∈ pred(Gi ) , (2.8) gcd(%i · χi ,κi )

k←v v k k %i χi v k χi , υi · k←v · v , where Gi ∈ pred(Gi ). (2.9) κi υi

We compute a relative deadline for every task using the equation:

k k υi di , k . (2.10) χi

At the end of this transformation, we have a rate-based task set, T rb, derived from G. We transform T rb into

k k a sporadic task set by replacing the terms χi and υi of each task with a period equal to each task’s relative deadline: k k υi pi , k . (2.11) χi

19 1 k k k k (ei , 1, 4, 4) Ti = ( ei , p i , d i )

1 1 Ti = ( ei , 4, 4) 2 2 Ti = ( ei , 3, 3) 3 3 Ti = ( ei , 3, 3) 4 4 Ti = ( ei , 6, 6)

(a) Rate-based transformation. (b) Sporadic transformation.

Figure 2.2: PGM-speciﬁed graph of Figure 2.1 transformed into rate-based (a) and sporadic (b) tasks.

1 Example 2.1. Consider the PGM-speciﬁed graph in Figure 2.1. Let us assume that Gi has an execution 1 1 3 rate of (χi = 1,υi = 4). We now transform the graph Gi into a set of rate-based tasks, followed by a transformation into sporadic tasks. These transformations are illustrated in Figure 2.2.

2 n 3·4 o 2 4·1 We apply Equations (2.8) and (2.9) to ﬁnd υi = lcm gcd(4·1,3) = 12 and χi = 12 · 3·4 = 4 for

rb2 rb2 2 the rate-based task Ti . We use Equation (2.10) to ﬁnd that the relative deadline for Ti is di =

12 rb3 2←1 3←1 2←1 3←1 4 = 3. The rate-based task Ti is similarly deﬁned, since κi = κi and %i = %i . Also, n o 4 2·12 4·12 4 1 4 rb4 υi = lcm gcd(4,2) , gcd(8,4) = lcm{12,12} = 12 and χi = 12 · 2 · 12 = 2 for the rate-based task Ti . The 4 12 relative deadline is di = 2 = 6. To transform the rate-based tasks to sporadic tasks, we merely take the rate-based relative deadline as the period for each task. ♦

Although we model the derived sporadic task set as a set of independent tasks, a scheduler must take steps, at runtime, to (i) efficiently track the production and consumption of tokens on each edge; and (ii) dynamically adjust the release time of jobs to ensure token input constraints are satisfied. We may accomplish (i) by enabling token-producing jobs to notify the scheduler of when token constraints are satisfied. Liu and

Anderson (2010) specify how to address (ii): we delay the release of any sporadic job until inputs are satisﬁed.

k k Suppose a job Ji, j is “released” at time ai, j as a conventional sporadic task, but that the token constraints k k k of Ji, j are not satisﬁed until a later time, t. In such a case, we adjust the release time of job Ji, j to ai, j = t k k and its deadline to Di, j = t + di . We present an efﬁcient implementation for token constraint tracking and release-time adjustment in Chapter 5.

3This example is courtesy of Liu and Anderson (2010).

20 CPU0 CPU0 CPU0

CPU1 CPU1 CPU1

CPU2 CPU2 CPU2

CPU3 CPU3 CPU3

(a) Partitioned (c = 1) (b) Clustered (c = 2) (c) Global (c = 4)

Figure 2.3: Each processor cluster is fed by a dedicated ready queue of jobs. Example with m = 4.

2.1.4 Scheduling Algorithms

We now discuss the algorithms we use to schedule a given sporadic task set on set of processors, as well as the formal analysis we use to determine whether timing constraints can be met.

The task set T is scheduled on hardware platform consisting of m processors (or cores). These processors may be divided into disjoint clusters of c processors each. Each task (and all its associated jobs) is assigned to a cluster. When c = 1, 1 < c < m, or c = m, the multiprocessor system is said to be scheduled by a partitioned, clustered, or global scheduler, respectively.

An incomplete released job is ready if it is available for execution, it is scheduled if the job is executing on a processor, and it is suspended if the job cannot be scheduled (for whatever reason). A scheduled job is either preemptible or non-preemptible, and cannot be descheduled while it is non-preemptible.

Each processor cluster draws ready jobs from a dedicated priority-ordered ready queue. This is depicted in Figure 2.3 for a system with four processors under partitioned (inset (a)), clustered (inset (b)), and global

(inset (c)) scheduling. In this ﬁgure, each ready job is depicted by a different shaded box. Jobs enter the ready queue when they are released, resume from a suspension, or are preempted. Jobs exit the ready queue when they are scheduled. A job is free to migrate among all processors within its assigned cluster. Of course, no migration is possible under partitioned scheduling, since each cluster is made up of only one processor.

The research in this dissertation focuses on multiprocessor scheduling algorithms; speciﬁcally, job-level

fixed-priority (JLFP) scheduling algorithms where a fixed scheduling priority is assigned to a job when the job is released. We further classify JLFP scheduling algorithms as fixed- or dynamic-priority. Fixed-priority

(FP) scheduling algorithms assign the same priority to all jobs of the same task, though each task may have a different priority. Dynamic-priority scheduling algorithms assign a priority to a job upon release—every job

21 of every task may have a different priority. The rate-monotonic (RM) scheduler prioritizes tasks with shorter periods over those with longer ones. Similarly, the deadline-monotonic (DM) scheduler prioritizes tasks according to relative deadlines instead of period. Period and relative deadline parameters are constant, so RM and DM are FP schedulers. The earliest-deadline-ﬁrst (EDF) scheduler prioritizes jobs with earlier absolute deadlines over those with later ones. EDF is a dynamic-priority scheduler since the absolute deadline depends upon the release time of a job. Similarly, the recently developed “fair-lateness” (FL) scheduler prioritizes jobs using pseudo-deadlines called priority points (Erickson, 2014). The relative priority point of a task, denoted by the parameter yi, is determined by the following formula:

m − 1 y d − e . (2.12) i , i m i

Similarly, an absolute priority point of a job, denoted by the parameter Yi, j, is given by:

m − 1 Y D − e . (2.13) i, j , i, j m i

FL is also a dynamic-priority scheduler.

We combine the processor cluster organization with a prioritization scheme to create a multiprocessor scheduling algorithm. For example, we refer to the multiprocessor scheduler deﬁned by combination of partitioned processor organization (c = 1) with EDF prioritization as “partitioned EDF” (P-EDF). Likewise, we get the “global RM” (G-RM) scheduler by combining globally organized processors (c = m) with RM prioritization.

Clustered approaches are particularly effective on large multiprocessor systems where migration costs are high due to high interprocessor communication costs. Migration costs are reduced if processors that can communicate efﬁciently, such as through shared caches, are clustered together. Semi-partitioned scheduling is another hybrid approach whereby most tasks are partitioned to a individual processors, while remaining tasks may migrate between two or more processors (Anderson et al., 2005). Semi-partitioned algorithms are also effective since migration costs are eliminated for most tasks. We will not discuss semi-partitioned algorithms any further; we only mention them here for the sake of completeness.

22 2.1.5 Schedulability Tests and Tardiness Bounds

A schedule is feasible for a task set if all timing constraints are met. A task set is schedulable under a given scheduling algorithm if the algorithm always generates a feasible schedule. (A scheduling algorithm is optimal if it always generates a feasible schedule, if one exists.) Our deﬁnitions of feasibility and schedulability are with respect to some notion of “timing constraints”—these may be application-speciﬁc.

We call systems where all deadlines must be met hard real-time (HRT) systems. Applications with HRT requirements are found in safety-critical applications where loss of life or damage may occur if a deadline is missed. We call systems where some deadline misses are acceptable soft real-time (SRT) systems. A video decoder is an example of an application with SRT requirements. This deﬁnition of SRT remains general and can be further reﬁned. In the context of this dissertation, an SRT system is one where deadline tardiness (the margin by which a deadline may be missed) is bounded.

A schedulability test is a procedure that determines if a given task set is schedulable. For instance, a classic result from a seminal work by Liu and Layland (1973) states that any periodic task set scheduled by uniprocessor RM scheduling is HRT-schedulable if

1 U(T ) ≤ n(2 n − 1). (2.14)

This schedulability test is only sufﬁcient, as some task sets may be schedulable with utilizations greater

1 than n(2 n − 1). In the same work, Liu and Layland also showed that any periodic task set scheduled under uniprocessor EDF scheduling is HRT-schedulable iff

U(T ) ≤ 1. (2.15)

Since any task set with a utilization greater than one has no feasible schedule, uniprocessor EDF is an optimal scheduler.

The development of schedulability tests has been, and continues to be, a central topic in real-time systems research. This research has resulted in numerous schedulability tests—each may evaluate schedulability under a different set of timing constraints, scheduling algorithms, and assumptions of task set characteristics.

The primary schedulability test we concern ourselves with in this dissertation is the SRT test developed by

Devi and Anderson (2006) for implicit-deadline sporadic task sets scheduled by G-EDF: any conventional

23 sporadic task set is schedulable with bounded deadline tardiness if the constraints

U(T ) ≤ m (2.16) and

∀Ti : ui ≤ 1 (2.17) hold true. Inequality (2.16) deﬁnes the task set utilization constraint, and Inequality (2.17) deﬁnes a per-task utilization constraint. These constraints can also be used to evaluate the schedulability (with bounded tardiness) of task sets with arbitrary deadlines (Erickson, 2014).

We can compute deadline tardiness bounds for schedulable implicit-deadline task sets. We deﬁne several terms and functions in order to describe this process. Let emin be the smallest job execution time among the tasks in T . We use E(T ) to deﬁne an operation that returns the subset of m − 1 tasks in T with the largest e j.

Similarly, we use U(T ) to deﬁne an operation that returns the subset of m − 2 tasks in T with the largest uk. The deadline tardiness of any job is bounded by

(∑Tj∈E(T ) e j) − emin xi = ei + . (2.18) m − ∑Tk∈U(T ) uk

We may extend the above test and tardiness bounds to C-EDF by testing each cluster individually.

In cases where dU(T )e < m, we may obtain tighter tardiness bounds by substituting the term m with mb, where

mb , dU(T )e, (2.19) in Equation (2.18) and the deﬁnitions of E(T ) and U(T ). This optimization reﬂects the observation that a task set only requires dU(T )e processors to be schedulable with bounded tardiness. The bound provided by Equation (2.18) may be tightened if we assume fewer than m processors in analysis.

Example 2.2. Figure 2.4 depicts the schedule for three periodic implicit-deadline tasks scheduled under

G-EDF on two processors (m = 2), with deadline ties broken by task index. The tasks share the same parameters: Ti = (ei = 8, pi = 12,di = 12). After time t = 12, the schedule settles into a steady pattern.

24 T1

T3 time 0 5 10 15 20 25 30 35 40 45 50 55 60 65

ReleaseDeadline Complete

Scheduled on CPU0 Scheduled on CPU1

Figure 2.4: Example of bounded deadline tardiness for a task set scheduled under G-EDF on two CPUs.

This task set satisﬁes the task set utilization (Inequality (2.16)) and per-task utilization (Inequality (2.17))

2 constraints since U(T ) = 2 and ui = 3 . Using Equation (2.18) to analytically bound deadline tardiness, we ﬁnd that no job will miss its deadline by more than eight time units. ♦

In Figure 2.4, we see intuitively that task T3 will never misses its deadline by more than four time units. In contrast, Equation (2.18) bounds tardiness by eight time units. This difference of four time units is evidence of pessimism in analysis. Compliant Vector Analysis (CVA) by Erickson (2014) offers tighter tardiness bounds. CVA computes bounds on deadline lateness, instead of tardiness, by solving a linear program. We present the CVA linear program here, but we direct the reader to Erickson (2014) for justiﬁcation.

CVA uses pseudo-deadline priority points, which we discussed with respect to FL-scheduling in Sec- tion 2.1.4, and also deﬁnes several additional terms. Under CVA analysis of EDF scheduling, the relative priority point and deadline of a task coincide (i.e., yi = di). We characterize processor demand by task Ti with the function yi Si(yi) = ei · max 0,1 − . (2.20) pi

Total demand is given by

S(~y) = ∑ Si(yi). (2.21) Ti∈T

Response time and lateness bounds are deﬁned recursively, withx ˆi as a real value:

ri = yi + xˆi + ei, (2.22)

25 and

li = yi + xˆi + ei − di. (2.23)

The function ~ G(xˆ,~y) = ∑ (xˆiui + ei − Si(yi)) (2.24) mb−1 largest denotes the processor demand from tasks that can contribute to job lateness.4 Finally, let

s = G(~xˆ,~y) + S(~y). (2.25)

We define additional variables Si, Ssum, G, b, and zi for the linear program. We find values for xî by solving the following linear program:

Minimize: s

s − ei Subject to:x ˆ = ∀i, i m

Si ≥ 0 ∀i, yi Si ≥ ei · 1 − ∀i, pi

zi ≥ 0 ∀i,

zi ≥ xˆiui + ei − Si − b ∀i,

Ssum = ∑ Si, Ti∈T

G = b · (mb − 1) + ∑ zi, Ti∈T

s ≥ G + Ssum

With values forx ˆi from the solution to the linear program, we compute bounds for li using Equation (2.23). There are three main advantages to CVA over Devi’s method (Equation (2.18)). First, CVA usually gives tighter bounds than Equation (2.18).5 Second, CVA computes lateness instead of tardiness. In some instances,

CVA may compute a negative value for li, indicating that a job of task Ti never misses a deadline. Finally,

4 Observe that we use the term mb deﬁned by Equation (2.19). 5 CVA strictly dominates Equation (2.18) with additional enhancements to its linear program (Erickson, 2014).

26 critical section blocked

Ji time req sat rel ti, j,k ti, j,k ti, j,k

Figure 2.5: Critical section of job Ji.

CVA can be applied to task sets that include tasks with arbitrary deadlines, while Equation (2.18) may only be applied task sets made up entirely of tasks with implicit deadlines.

2.1.6 Locking Protocols

The sporadic task model can be extended to allow a set of serially-reusable shared resources (such as shared data objects and I/O devices) to be speciﬁed. Access to these resources must be mutually exclusive, in that only one job may access the resource at a time. We denote q such resources by `1,··· ,`q. When a job Ji

req th requires a resource ` j, it issues a request Ri, j,k for ` j at time ti, j,k. The subscript k denotes the k non-nested sat request of Ji for ` j. Ri, j,k is satisﬁed as soon as Ji holds ` j at time ti, j,k, and completes when Ji releases ` j at rel time ti, j,k. This sequence of events is illustrated by Figure 2.5. We call the computation and operations (such sat rel as I/O) of Ji performed within the interval [ti, j,k,ti, j,k] a critical section. The length of a critical section of job

Ji for request Ri, j,k of resource ` j given by

rel sat Li, j,k , ti, j,k −ti, j,k (2.26)

when job Ji executes without preemption. We denote the longest critical section for resource ` j of any job of

max Ti by Li, j . The longest critical section of resource ` j of any task is given by

max max L j , max Li, j . (2.27) Ti∈T

We denote the number of times a job Ji may issue a request for resource ` j with ηi, j. Finally, the total number of resource requests of all resources requested by Ji is denoted by ηi. The above parameters are summarized in Table 2.2.

27 Parameter Meaning th ` j j serially reusable shared resource th Ri, j,k k request of job Ji for resource ` j Li, j,k critical section length of request Ri, j,k max Li, j longest critical section of any request from a job of Ti for resource ` j max L j longest critical section of any request for resource ` j ηi, j number of times job Ji may request resource ` j ηi total number of resource requests issued by job Ji

Table 2.2: Summary parameters for describing shared resources.

Locking protocols arbitrate resource requests for exclusive access issued by jobs. If a job Ji issues a request Ri, j,k for resource ` j that is unavailable, then Ji is blocked until Ri, j,k is satisfied. As depicted in req sat Figure 2.5, job Ji is blocked during the interval [ti, j,k,ti, j,k). In general, the blocked job Ji can wait for Ri, j,k to be satisfied by either spinning or suspending. Under spinning, job Ji remains scheduled on a processor and executes a tight polling loop (busy-waits) until Ri, j,k is satisfied. Under suspension, Ji relinquishes its processor and enters a suspended state, and Ji becomes ready the instant Ri, j,k is satisfied. The locking protocol determines whether spin- or suspension-based waiting methods are used, as well as the order in which multiple outstanding requests are satisfied. Spin-based locking protocols are commonly used when resource access times are very short, since the runtime overhead of suspending a job can exceed the time spent spinning. However, because GPUs have relatively long access times, we concern ourselves only with suspension-based locking protocols. This allows other useful work to be done while jobs wait for GPU access.

2.1.6.1 Priority Inversions and Progress Mechanisms

A priority inversion occurs whenever lower-priority work is scheduled instead of ready higher-priority work. Some sources of priority inversions are forced upon us by real-world constraints. For instance, device interrupts can be a source of priority inversions—we examine this at length later in Section 2.2.3. Resource sharing can also lead to such inversions.

Example 2.3. Figure 2.6(a) depicts a classic priority-inversion scenario. Here, three jobs are scheduled on a uniprocessor system. A low-priority job JL is released at time 0. At time 5, job JL obtains a lock on a shared resource. A time 8, a high-priority job JH is released, preempting JL. Job JH requires the resource held by job

JL at time 15, so JH blocks and is suspended from the processor; job JL resumes execution. At time 18, a medium-priority job JM is released and preempts job JL because JM has a higher priority. Job JM continues to

28 ReleaseComplete Block Unblock Resource Held Priority Inversion Elevated Priority

J H JH

J M JM

J J H L JL time time 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50

(a) JH experiences an unbounded priority inversion. (b) JL inherits a priority from JH at time 15.

JVH

MAX JL time 0 5 10 15 20 25 30 35 40 45 50

Figure 2.6: Resource sharing can lead to priority inversions. Progress mechanisms, such as priority inheritance (inset (b)) and priority boosting (inset (c)), can shorten priority inversion duration.

execute beyond the depicted schedule. Job JH is ready to be scheduled at time 15, but it cannot since it must wait to obtain the resource held by job JL. Thus, job JH suffers from a priority inversion, starting at time 15.

♦

We say that the priority inversion suffered by job JH in Figure 2.6(a) is unbounded because the length of the inversion depends upon the execution time of job JM, which may be arbitrarily large. Priority inversion durations must be bounded. We accomplish this by using “progress mechanisms” that expedite the scheduling of lower-priority resource-holding jobs. That is, we intentionally increase the scheduling priority of a resource-holder from a base priority (i.e., its default priority) to a higher effective priority. Naturally, the resultant priority inversion bounds are a function of resource critical section lengths, as opposed to arbitrary processor demand (as was the case with job JM in Figure 2.6(a)). We discuss three general methods employed by real-time locking protocols to bound the duration of priority inversions due to resource sharing: priority inheritance, priority boosting, and priority donation.

Priority Inheritance. Under priority inheritance (Rajkumar, 1991; Sha et al., 1990), the effective priority of a job Ji holding resource `k is set to the maximum of Ji’s base priority and the effective priority of all jobs

29 blocked (or that may block, depending upon the locking protocol) on `k. That is, job Ji inherits the priority of the highest-priority job that is waiting for `k. Job Ji’s effective priority remains elevated until `k is released.

Example 2.4. Figure 2.6(b) depicts an example of priority inheritance. As before, job JH requires the resource held by job JL at time 15, so JH blocks and is suspended from the processor; job JL resumes execution. However, job JL inherits the priority of job JH , so JL is scheduled with an effective priority of JH .

When job JM is released at time 18, it lacks the sufﬁcient priority to preempt JL. Job JL remains scheduled.

Job JL relinquishes the share resource to job JH at time 20. With the needed resource obtained, job JH is immediately scheduled. ♦

Priority inheritance is often viewed as the temporary transference of the priority from a high priority job to a low priority job. However, it is better to conceive of priority inheritance as a transference, occurring in the opposite direction, of work. That is, we may view priority inheritance as the transference of low priority work, i.e., a critical section of a low priority job, to a higher-priority job. Conceptually, we may think of a job as obtaining any requested resource immediately—the job is never blocked. However, such a job may be required to notionally execute the critical sections of lower priority jobs on their behalf.6 This view is embodied by a strengthened form of priority inheritance called bandwidth inheritance (BWI) (Lamastra et al., 2001; Nogueira and Pinho, 2008). Under BWI, a resource holding job that inherits a priority also inherits the execution time budget of the job associated with the inherited priority. Thus, the execution time of the critical section is charged against the budget of the blocked high-priority job, not the lower-priority resource holding job. Tasks can be provisioned with enough budget to cover any budget lost due to BWI.

However, budgets may be exhausted if critical sections take longer to execute than expected (or provisioned).

We may take additional measures to isolate the temporal effects of such a fault. If the budget of a task with an unsatisﬁed resource request is exhausted, then we abort the task’s request, refresh the budget of the task

(possibly decreasing the task’s priority), and reissue the aborted request (Brandenburg, 2012, 2014a).

Priority Boosting. Under priority boosting (Brandenburg and Anderson, 2013; Lakshmanan et al., 2009;

Rajkumar, 1990, 1991; Rajkumar et al., 1988), a job Ji’s effective priority is set to the highest scheduling priority when access to a shared resource is contended, or when access is granted, depending upon the locking protocol. Job Ji’s effective priority remains elevated until `k is released.

6The number of critical sections executed depends upon the locking protocol.

30 ReleaseComplete Block Unblock Resource Held Priority Inversion Elevated Priority

JM JM CPU0 CPU0

CPU1 CPU1 JH JH

J JL M JL MAX time time 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 (a) Priority inheritance ineffective across CPU partitions. (b) Priority boosting effective across CPU partitions.

JM CPU0

CPU1 JH

J JL H time 0 5 10 15 20 25 30 35 40 45 50 (c) Priority donation ensures resource-holding jobs are always scheduled.

Figure 2.7: Stronger progress mechanisms may be needed in multiprocessor systems.

Example 2.5. Figure 2.6(c) illustrates an example of priority boosting. This ﬁgure depicts the same scenario we have studied before, except that we have added an additional very high priority job, JVH , in order to illustrate an important difference between priority inheritance and priority boosting. As before, job JH requires the resource held by job JL at time 15, so JH blocks and is suspended from the processor; job JL resumes execution, but when job JM is released at time 18, it lacks the sufﬁcient priority to preempt JL—JL remains scheduled. At time 19, the very high priority job JVH is released. However, since job JL has a boosted priority, JVH cannot preempt JL, either. Job JL relinquishes the share resource to job JH at time 20, and job

JVH is immediately scheduled. Job JVH completes at time 25. With the needed resource obtained, job JH is scheduled next. ♦

Observe that in Figure 2.6(c) JVH suffers from a priority inversion, even though it does not require the shared resource. This is a drawback to priority boosting: any job may suffer a priority inversion due to the priority boosting of another job. This can be detrimental to schedulability.

Priority boosting is implicitly used by most spin-based locking protocols, as these protocols commonly disable interrupts (disabling preemption) while spinning and within a critical section—this behavior is essential to avoiding deadlock in a spin-based locking protocol. Disabling preemption essentially gives the resource holding job a maximum priority. Priority boosting is also used in suspension-based locking

31 protocols. Particularly, in multiprocessor locking protocols where resources are shared among partitions or clusters of processors. In general, priority inheritance is an ineffective progress mechanism across partitions and clusters. This because a priority that guarantees that a job is scheduled within its own local partition or cluster (i.e., the priority of a blocked job) may not be sufﬁcient to guarantee that the resource holder is scheduled within a remote partition or cluster.

Example 2.6. The ineffectiveness of priority inheritance for resources shared across partitions or clusters is illustrated in Figure 2.7(a). Here, job JM is partitioned to CPU0, while jobs JH and JL are partitioned to

CPU1. Job JL inherits a priority from JM at time 15. However, since the priority of JH is greater than JM, JL is not scheduled. JM experiences an unbounded priority inversion, even though it has sufﬁcient priority to be scheduled on CPU0. Priority boosting is effective in the same scenario, as illustrated by Figure 2.7(b). Here, the priority inversions that effect jobs JH and JM are bounded by the critical section length of JL. ♦

Priority Donation. Priority donation is a recently developed progress mechanism for multiprocessor locking protocols (Brandenburg and Anderson, 2013). Priority donation is similar to priority inheritance in that a job may adopt a priority from another job. Priority donation is also similar to priority boosting in that resource holding jobs are always scheduled if they are ready. A donor is a job that donates its priority to a donee job.

The effective priority of a donee job is set to that of its donor. The donation relationship between donor and donee is established upon job release of the donor.

We deﬁne two sets of pending jobs to help describe priority donation. Let J denote the set of all pending jobs; these may be ready or suspended. Let the subset J top-c ∈ J denote the set of pending jobs with the top-c

top-c priorities (recall that c denotes the processor cluster size). Upon the release of a job Jd, if Jd ∈ J , and the

top-c top-c arrival of Jd causes a job Ji to be moved from J to the set J \ J , then Jd may become a priority donor under the following conditions:

1. If Ji is blocked waiting for a resource or it is a resource holder, then Jd donates its priority to Ji; Ji

becomes the donee of Jd.

2. If Ji is a donor to a job Jj, then Jd donates its priority to Jj, ending the donor relationship between Ji

and Jj, and replacing it with a relationship between Jd and Jj.

32 The effective priority of a donee job remains elevated until it releases any shared resources, terminating any donor relationship it may have. A priority donor may not be scheduled until its donor relationship is terminated.7

Example 2.7. Figure 2.7(c) illustrates an example of priority donation. Here, job JL obtains a shared resource at time 5. The higher-priority job JH is released at time 8. Since job JL holds a resource, job JH donates its priority to JL, and JL continues to execute. Job JL terminates the donation relationship at time 13, when it releases the shared resource. Job JH is immediately scheduled. ♦

In general, any job may required to become a priority donor upon its release. As a result, any job can experience a priority inversion due to donation, since donors always suffer from a priority inversion. That is, every donor is in J top-c (i.e., every donor should be scheduled), and no donor is ever scheduled. Priority donation and priority boosting are similar in that any job can suffer a priority inversion due to shared resources used by other jobs. However, under priority donation, this priority inversion occurs at most once per job.

Whereas, under priority boosting, a job may experience multiple priority inversions.

Priority donation is used by the “Clustered k-exclusion O(m) Locking Protocol” (CK-OMLP), developed by Brandenburg and Anderson (2013). Ward et al. (2012) adapted the ideas behind priority donation for the

“Replica-Request Donation Global Locking Protocol” (R2DGLP). However, the R2DGLP limits donors to the set of jobs that actually share a resource. The protocol also defers the establishment of donation relationships to the moment a shared resource is requested, instead of at job release. Both the CK-OMLP and R2DGLP are a foundational element to GPUSync, so we discuss them at length in Section 2.1.7.

2.1.6.2 Nested Locking

A job may require exclusive access to multiple shared resources at once. This can lead to nested resource requests, where a task first acquires resource à and then acquires resource `b. In other words, the critical section of `b may be nested within the critical section of à. Arbitrary nesting of critical sections may lead to deadlock. The classic example of deadlock is the situation where task Ti holds resource à and blocks for access to resource `b, while Tj holds resource `b and blocks for access to resource à. Neither task makes progress, so the two tasks are blocked forever. Real-time correctness cannot be guaranteed for a system where

7Additional reﬁnements to priority donation rules allow a donor to be scheduled under special conditions when its donee is suspended. However, this is merely a runtime optimization that does not improve schedulability analysis. We direct the interested reader to Brandenburg and Anderson (2013) for details.

33 deadlock is possible. There are three general approaches to supporting nested resource requests: group locks, totally-ordered nested requests, and deadlock-free locking protocols.

Group Locks. The first approach to manage nested locking is to define away the problem. This is done by protecting the set of nested shared resources with a single group lock (Block et al., 2007). A task must acquire this lock if it needs to access one or more of the resources protected by the group lock. While safe, this approach limits parallelism. For example, consider the situation where three resources, à, `b, and `c, are protected by a single group lock. Task Ti requires resources à and `b, while task Tj only requires `c. The execution of Ti and Tj is serialized when they contend for the group lock, even though they do not actually share the same resources.

Totally-Ordered Nested Requests. Another approach to supporting nested critical sections is to ensure that resources are acquired in an order that guarantees deadlock freedom. To do so, we enumerate all resources in a single sorted order `1,··· ,`q. This ordering is observed by all tasks in a system. If a task requires simultaneous access to two resources, then it must acquire `i before resource ` j, where i < j. This generalizes to an arbitrary number of resources. It is easy to see how total ordering resolves the classic deadlock scenario.

If tasks Ti and Tj both require access to resources à and `b, then the tasks contend for à before they may contend for `b. No task can hold `b while it contends for à, so nested locking is deadlock-free. A drawback to totally ordered nested requests is that it requires disciplined programming. A given resource ordering may also be at odds with the natural flow of program code. For example, although a task Ti may require access to both resources à and `b, program code may begin using `b long before à. However, the total ordering requires à to be obtained early.

Deadlock-Free Locking Protocols. Deadlock freedom can also be guaranteed by a locking protocol algorithm. Classic (uniprocessor) real-time locking protocols that ensure deadlock freedom include the priority-ceiling protocol (PCP) (Sha et al., 1990) and the stack resource policy (SRP) (Baker, 1991). These locking protocols use rules that delay access to a shared resource, even if it is available, if immediate access may lead to deadlock at a later time.

Another technique that can guarantee deadlock freedom is the use of dynamic group locks (DGLs) (Ward and Anderson, 2013). Under DGLs, a task issues requests for all resources it may require atomically. DGLs leverage the combined atomic request to guarantee deadlock freedom. Consider the following scenario. A task Ti requires resources `a and `b, while task Tj requires resources `a, `b, and `c. Ti issues a combined

34 request for (`a,`b), and Tj issues a combined request for (`a,`b,`c). The underlying locking protocol data structures for each resource are jointly updated atomically. Under FIFO-ordered locks, resources can be granted in any order without risk of deadlock. This is because whenever tasks Ti and Tj contend for the same resources, the relative ordering between the tasks’ requests is the same in every FIFO queue. Thus, access to every resource is granted in the same order. This prevents the deadlock scenario where each task waits for resources held by the other.

We must point out two important details of DGLs. First, DGLs must maintain the illusion of obtaining resources through individual requests in order to maintain sporadic task model abstractions. This means that progress mechanisms that act on behalf of a task Ti may only be active on one lock at a time. We illustrate this point with an example. Suppose task Ti waits for resources `a and `b, and priority inheritance is used as the progress mechanism for these locks. Either the resource holder of `a or the resource holder of `b may

8 inherit the priority of Ti at any given time instant, but not both. The second important detail of DGLs is that the underlying locking protocol implementation must support joint atomic updates. The data structures that manage unsatisﬁed lock requests are commonly protected by per-lock spinlocks that reside in the OS kernel.

These spinlocks are only held while the data structures are modiﬁed. In order to support atomic DGL resource requests, all of the spinlocks related to the resources in a DGL request must be obtained before modifying the data structures of the individual locks. We have traded one multi-resource request problem (the request for DGL-protected resources) for another (the spinlocks that protect the locking protocol data structures of said resources)! We can resolve this problem in one of two ways. We may protect all locking protocol data structures with a single spinlock (i.e., a group lock). This may be appropriate, since spinlocks are held for only a short duration. However, this hurts parallelism, as all concurrently issued resource requests serialize on the DGL spinlock. A better approach is to obtain the necessary spinlocks in a total order. Thankfully, this trivial to implement in the OS kernel. Each spinlock has a unique memory address, so we obtain spinlocks in order of their memory addresses.

2.1.6.3 Priority-Inversion Blocking

Blocking durations must be accounted for in schedulability analysis when locking protocols are used.

However, schedulability analysis must only consider blocking durations for which delays in execution cannot

8 The only exception to this rule is when the same task holds both `a and `b.

35 be attributed to higher-priority demand (otherwise, a job without sufﬁcient priority to be scheduled has no effect on analysis). We term this type of blocking priority inversion blocking (pi-blocking), and quantify the total time a job Ji may be pi-blocked with the term bi. Brandenburg and Anderson (2013) classify analytical techniques for bounding pi-blocking due to suspension-based locking protocols as either suspension-oblivious

(s-oblivious) or suspension-aware (s-aware). Under s-oblivious analysis, all suspensions, including those introduced by waiting for shared resources, are analytically treated as processor demand. Hence, a job’s execution time ei is inﬂated by bi prior to performing schedulability tests such that

0 ei ≥ ei + bi, (2.28)

0 where ei denotes the safe bound on execution time used in s-oblivious schedulability analysis. Inﬂation essentially converts a set of dependent tasks into a set of independent tasks that can be analyzed by “normal”

(locking-protocol-agnostic) schedulability tests. This approach is safe, but pessimistic in that processor time is analytically consumed by all suspensions. S-aware schedulability analysis explicitly treats bi as suspension time. However, this treatment must be incorporated into schedulability tests. These tests are more difﬁcult to develop, and s-aware analysis has not yet matured for all schedulers. We primarily use s-oblivious tests for global dynamic-priority JLFP schedulers. S-aware analysis is available for global ﬁxed-priority scheduling,

P-EDF, and partitioned ﬁxed-priority scheduling (e.g. see Easwaran and Andersson (2009); Lakshmanan et al. (2009); Rajkumar (1991)).

Under global multiprocessor scheduling, if a ready job Ji is not amongst the m highest-priority ready jobs, then Ji is not scheduled, and it does not suffer from any priority inversions. This is reﬂected by s-aware analysis: the presence of m higher-priority ready jobs rules out the possibility of priority inversions for lower-priority jobs. In contrast, under s-oblivious analysis, the presence of m higher priority jobs, ready or suspended, rules out the possibility of priority inversions for lower-priority jobs. This fact can be exploited to implement optimal locking protocols under s-oblivious analysis, as we discuss shortly.

Example 2.8. Figure 2.8 illustrates the difference between s-oblivious and s-aware pi-blocking. Here, three jobs are scheduled globally across two (m = 2) processors. Jobs JH , JM, and JL have a high, medium, and low relative priorities, respectively. Job JH is suspended during the time interval [5,15) while it waits for the shared resource held by job JM. Job JH has the highest priority, so it experiences pi-blocking under both deﬁnitions of s-oblivious and s-aware analysis. Job JL is suspended during the time interval [10,20), waiting

36 Release Scheduled on CPU0 JH Scheduled on CPU 1 Complete JM Resource Held Block S-aware pi-blocking only JL time S-aware or s-oblivious pi-blocking Unblock 0 5 10 15 20 25

Figure 2.8: Comparison of s-oblivious and s-aware pi-blocking under global scheduling.

for the same shared resource. Because job JM completes at time 15, job JL experiences pi-blocking under both deﬁnitions of s-oblivious and s-aware analysis from time [15,20) (it is among the m highest-priority pending jobs). However, job JL experiences only s-aware pi-blocking during the time interval [10,15) because job JH pending but not scheduled. ♦

2.1.7 Multiprocessor k-Exclusion Locking Protocols

Most multiprocessor locking protocols have been developed for JLFP schedulers, with most attention towards the common schedulers P-RM, P-DM, P-EDF, and G-EDF.

The multiprocessor priority-ceiling protocol (MPCP) (Lakshmanan et al., 2009; Rajkumar, 1990) and the distributed priority-ceiling protocol (DPCP) (Rajkumar et al., 1988) were developed for P-FP schedulers and represent the ﬁrst multiprocessor real-time locking protocols. Locking protocols for P-EDF scheduling were later developed by Chen and Tripathi (1994), Gai et al. (2003), and Lopez et al. (2004). Locking protocols supporting global scheduling have developed more recently. These include the ﬂexible multiprocessor locking protocol (FMLP) by Block et al. (2007), supporting any global JLFP scheduler, and the parallel priority-ceiling protocol (PPCP) by Easwaran and Andersson (2009), supporting G-FP scheduling.

Our earlier discussion on the effect of s-oblivious and s-aware analysis on pi-blocking is based upon insights of Brandenburg and Anderson (2013). These insights are a relatively new development in the analysis real-time locking protocols and allow some locking protocols to be classified as optimal under s-aware or s-oblivious analysis. Under s-aware analysis, a mutual exclusion locking protocol is optimal if bi is O(n), in terms of the number of conflicting requests. In contrast, a mutual exclusion locking protocol is optimal if bi is O(m) under s-oblivious analysis; this is significant since m n is common in practice. Locking protocols

37 from the “O(m) Multiprocessor Locking Protocol” (OMLP) family of locking protocols are optimal under s-oblivious analysis (Brandenburg and Anderson, 2013; Elliott and Anderson, 2013; Ward et al., 2012).

“K-exclusion” locking protocols can be used to arbitrate access to pools of similar or identical serially reusable resources, such as communication channels or I/O buffers. K-exclusion extends ordinary mutual exclusion (mutex) by allowing up to k tasks to simultaneously hold locks (thus, mutual exclusion is equivalent to 1-exclusion). K-exclusion has historically received little attention in the real-time community. Up until recently, only Chen (1992) had examined k-exclusion for real-time uniprocessor systems (to the best of our knowledge). However, GPU applications have renewed interest in the topic, and several real-time k-exclusion locking protocols for multiprocessor systems have resulted. These include the aforementioned

CK-OMLP (Brandenburg and Anderson, 2013) and R2DGLP (Ward et al., 2012), as well as a k-exclusion variant of the global FMLP-Long, called the k-FMLP (Elliott and Anderson, 2012b).9

The deﬁnition of optimality changes under k-exclusion. Under s-aware analysis, a k-exclusion locking protocol is optimal if bi is O(n/k), in terms of the number of conﬂicting requests. In contrast, a k-exclusion

2 locking protocol is optimal if bi is O(m/k) under s-oblivious analysis. The R DGLP and CK-OMLP are optimal under s-oblivious analysis.

The k-FMLP, R2DGLP, and CK-OMLP are important to the design of GPUSync, so we discuss next the rules governing each protocol. We begin with the k-FMLP, being the simplest of the three protocols, and then discuss the R2DGLP and CK-OMLP.

2.1.7.1 The k-FMLP

The k-FMLP is simple extension of the global FMLP-Long to support k-exclusion.10 It may be used to protect a pool of k resources shared by tasks within the same cluster of processors. The pi-blocking experienced by a job waiting for a resource protected by the k-FMLP is O(n/k) where n is the number of tasks using the lock. The k-FMLP is designed as follows.

9The author of this dissertation contributed to the development of the “Optimal k-Exclusion Global Locking Protocol” (O-KGLP) (El- liott and Anderson, 2013). We do not discuss the O-KGLP since it is obsolesced by the R2DGLP, which analytically dominates the O-KGLP. 10The k-FMLP was designed by the author of this dissertation. We discuss it in this background chapter, rather than a later chapter, because the k-FMLP is a minor contribution. A detailed description and analysis of the k-FMLP may be found in the online appendix of Elliott and Anderson (2012b) at http://www.cs.unc.edu/~anderson/papers.html.

38 FQ1

FQ2 k resources FQ 3 …

FQk

Resource Holders

Figure 2.9: Queue structure of the k-FMLP.

Structure. The structure of the k-FMLP is illustrated in Figure 2.9. The k-FMLP uses k FIFO request queues, denoted FQ1,··· ,FQk. Each queue is assigned to one of the k protected replicas of resource ` j.A job Ji enqueues a resource request Ri, j,k onto the queue FQx when the job requires a resource. A job with a request at the head of its queue is considered the holder of the associated resource and is ready to run. Jobs with blocked requests are suspended from the processor.

Rules. The k-FMLP may compute the length of a FIFO queue FQx at runtime with either one of two formulas.

Under the simplest formulation, the length of FQx is given by the number of enqueued requests:

length(FQx) , |FQx|. (2.29)

Alternatively, length may be expressed by the critical section lengths of enqueued requests:

length(FQx) , ∑ Li, (2.30) Ri∈FQx where we reindex the requests in FQx with i. We call the formulation of Equation (2.29) critical-section- oblivious, and the formulation of Equation (2.30) critical-section-aware. We may use either formulation in the following rules that govern the k-FMLP. Let Ji denote a job that issues a request Ri, j,k for resource ` j.

F1 When Ji issues Ri, j,k, Ri, j,k is appended to the queue with the minimum length, min1≤x≤k{length(FQx)}. th Ji acquires the x resource when Ri, j,k is at the head of FQx.

39 F2 All jobs with queued requests are suspended except for resource holders, which are ready. The effective

priority of the resource holder in FQx is set to the maximum priority of all jobs with requests queued in

FQx.

F3 When Ji releases replica x of resource ` j, Ri, j,k is dequeued from FQx, and the job with the next queued

request in FQx is granted the newly available resource. If FQx is empty, then an arbitrary pending

request (if one exists) from another queue is “stolen” (removed from its queue) and moved to FQx, and the stolen request is granted replica x.11

Blocking Analysis. We provide a summary of the s-oblivious blocking analysis presented by Elliott and

Anderson (2012b). We direct the reader to that paper for the rational behind the following claims. For simplicity of presentation, we assume that each job issues one request.

By Rule F1, each request is enqueued on the shortest queue when it is issued, according to the function length(FQx). Thus, the k-FMLP load-balances requests among the k resources. We denote the bound on pi-blocking that a request of job Ji for a replica of a resource ` j may experience under the k-FMLP with the

k-FMLP term bi, j . Under the critical-section-oblivious formulation (Equation 2.29), request Ri, j,k is blocked by at most n − 1 bk-FMLP = · Lmax (2.31) i, j k j

n−1 time units. No request is blocked by more than k requests. Hence, blocking under the k-FMLP is O(n/k).

Under the critical-section-aware formulation (Equation 2.30), request Ri, j,k is blocked by at most

max ∑Tl ∈T \{Ti} Ll j bk-FMLP = , (2.32) i, j k

12 n−1 time units. Although request Ri, j,k may be blocked by more than k individual requests under this formu- Lmax ∑Tl ∈T \{Ti} l, j n−1 max lation, k < k ·L j often holds true, so the critical-section-aware method may provide a tighter bound on blocking. However, the critical-section-aware method requires a more complex implementation of the k-FMLP, as it must be cognizant of the critical section lengths of enqueued requests.

11Request “stealing” does not affect worst-case blocking analysis, but it ensures efﬁcient resource utilization at runtime. 12Tighter blocking bounds under the critical-section-aware method can be obtained by using an integer linear program to determine the longest the shortest queue may be when Ri, j,k is issued.

40 m donors

m " k !

FQ1

FQ2 k resources

FQ 3 …

FQk Resource PQ Holders

Figure 2.10: Queue structure of the R2DGLP.

2.1.7.2 The R2DGLP

The R2DGLP is a k-exclusion locking protocol that is optimal under s-oblivious analysis. It may be used to protect a pool of k resources shared by tasks within the same cluster of processors. The pi-blocking experienced by a job waiting for a resource protected by the R2DGLP is O(m/k). The generality of the structure and rules that govern the R2DGLP, as described by Ward et al. (2012), lends a considerable degree of leeway to the developer in the protocol’s implementation. Here, we describe a slightly simpliﬁed version to make the R2DGLP more concrete for the reader. These simpliﬁcations do not violate the rules prescribed by Ward et al. The locking protocol is designed as follows.

Structure. The structure of the R2DGLP is illustrated in Figure 2.10. Similar to the k-FMLP, the R2DGLP uses k FIFO requests queues, denoted FQ1,··· ,FQk. Each queue is assigned to one of the k protected replicas of resource ` j. A job with a request at the head of its queue is considered the holder of the associated resource and is ready to run. Jobs with blocked requests are suspended from the processor. However, unlike the k-FMLP, no FIFO queue may hold more than dm/ke requests. Additional requests may “overflow” into one of two priority queues, denoted PQ and DQ. (We depict PQ and DQ in Figure 2.10 with triangles, as they are efficiently implemented by heap data structures.) Requests with a sufficiently high priority are inserted into DQ, while others are inserted into PQ. PQ is organized by priority-order. DQ is organized in

41 reverse-priority-order (i.e., the lowest-priority request appears at the head of the queue). As we describe shortly, DQ may hold at most m requests.

Rules. Let Ji denote a job that issues a request Ri, j,k for resource ` j. The priority of Ri, j,k is equivalent to the base priority of Ji. We denote the set of incomplete requests with the m-highest priorities by M . The R2DGLP makes use of the relaxed notion of priority donation as discussed at the end of Section 2.1.6.1, whereby the establishment of a donation relationship is deferred to the moment a shared resource is requested, rather than at job release. We denote the set of requests that have a priority donor (i.e., the donees) by d . The length of FQx is the number of enqueued requests in FQx, given by the function

length(FQx) , |FQx|. (2.33)

With these deﬁnitions, the R2DGLP operates under the following rules.

R1 When Ji issues Ri, j,k:

(a) Ri, j,k is enqueued on the shortest FIFO queue, FQx, if length(FQx) ≤ dm/ke.

(b) else if Ri, j,k ∈/ M , then Ri, j,k is enqueued in PQ.

(c) else Ri, j,k is enqueued in DQ. If the number of requests in DQ is no greater than m after Ri, j,k Sk is enqueued, then the priority of Ri, j,k is donated to an arbitrary request in the set x=1{R|R ∈

FQx}\ d \ M (i.e., any request in an FQ without a donor and is not among the requests in M ). L Otherwise, the lowest-priority request R in DQ (which cannot be Ri, j,k by Rule R1b), is moved L to PQ, and Ri, j,k becomes the priority donor to the donee of R . (Donor relationships are depicted in Figure 2.10 by dashed arrows.)

R2 Ri, j,k is satisﬁed when it is at the head of FQx.

R3 Ji suspends until request Ri, j,k is satisﬁed.

R4 The job with a request at the head of FQx inherits the highest effective priority (which could be a

donated priority) of any request in FQx.

D D R5 Ri, j,k is dequeued from FQx when Ji releases the replica x. If Ri, j,k has a priority donor R , then R is

removed from DQ (by Rule R1c all donors must be in DQ) and enqueued on FQx. The job that issued

42 D R no longer donates its priority to the job of Ri, j,k. Otherwise, the request at the head of PQ (if it

exists) is removed from PQ and enqueued on FQx.

Blocking Analysis. We provide a summary of the blocking analysis presented by Ward et al. (2012). We direct the reader to that paper for the rational behind the following claims. For simplicity of presentation, we assume that each job issues one request.

We begin by bounding the number of other requests that may pi-block a request Ri, j,k within each queue under s-oblivious analysis.

Pi-blocking in FQx: Rule R1a ensures that the maximum length of FQx is dm/ke. No request in FQx is pi-blocked by more than dm/ke − 1 other requests.

Pi-blocking in DQ: Rule R5 ensures that a donor request in DQ is moved to FQx once their donee request completes. A request in DQ is pi-blocked by no more than dm/ke other requests.

Pi-blocking in PQ: Rules R1b and R1c ensure that Ri, j,k ∈/ M holds at the time Ri, j,k enters PQ. Ri, j,k may only enter M when a resource holder releases a replica, i.e., when the request of a resource

holder, Rx, exits from FQx. Rx ∈ M must hold if the dequeue of Rx from FQx promotes Ri, j,k into

M . By Rule R1c, Rx cannot have a priority donor. Thus, by Rule R5, a request from PQ is moved

into FQx. Ri, j,k must have the highest priority among all requests in PQ since Ri, j,k ∈ M while

{R | R ∈ PQ ∧ R =6 Ri, j,k} ∩ M = /0. By Rule R5, Ri, j,k is removed from PQ and enqueued on FQx. No request in PQ experiences pi-blocking under s-oblivious analysis, because the moment such a

request could experience pi-blocking, it is moved to an FQ.

Summing the pi-blocking that can be incurred by Ri, j,k as it moves through the queues, the maximum number of other requests that may pi-block a request is

lmm 2 · − 1. (2.34) k

We denote the bound on pi-blocking that a request of job Ji for a replica of a resource ` j may experience

2 R2DGLP under the R DGLP with the term bi, j . By Equation (2.34),

2 lmm bR DGLP = 2 · − 1 · Lmax. (2.35) i, j k j

43 Hence, pi-blocking under the R2DGLP is O(m/k)—optimal under s-oblivious analysis. Equation (2.35) provides a coarse-grain bound on pi-blocking. We examine derivation for ﬁner-grained blocking bounds in

Chapter 4.

2.1.7.3 The CK-OMLP

The CK-OMLP, like the R2DGLP, is a k-exclusion locking protocol that is optimal under s-oblivious analysis. Unlike the R2DGLP, the CK-OMLP also supports the protection of a pool of k resources shared by tasks across processor clusters. The pi-blocking experienced by any job in a cluster where resources protected by the CK-OMLP is O(m/k). The CK-OMLP is designed as follows.

Structure. The structure of the CK-OMLP is illustrated in Figure 2.11. The CK-OMLP uses a single FIFO queue, denoted FQ, that holds a maximum of m − k unsatisﬁed resource requests. The requests of the k resource holders are not kept in FQ. The CK-OMLP relies upon additional scheduler data structures that track the c-highest priority incomplete jobs within each processor cluster. Min-heaps are an efﬁcient data structure for such bookkeeping, so these scheduler data structures are depicted by triangles in Figure 2.11.

We denote the set of the c-highest priority incomplete jobs with the ath cluster by C a. A job may be in C a while also waiting for, or holding, a resource replica.

th Rules. The CK-OMLP operates under the following rules. Let Ji denote a job in the a cluster that issues a request Ri, j,k for resource ` j.

a a C1 Ji receives a donated priority from a donor job in C if Ji ∈/ C , pursuant to the description of priority

donation in Section 2.1.6.1, while Ri, j,k is incomplete.

C2 Ji acquires the replica x when Ji issues Ri, j,k if such a replica is available. Otherwise, Ri, j,k is enqueued

on FQ and Ji suspends.

C3 When Ji releases the replica x, the pending request at the head of FQ (if it exists) is dequeued, and the associated job acquires x.

Blocking Analysis. We provide a summary of the blocking analysis presented by Brandenburg and Anderson

(2013). We direct the reader to that paper for the rational behind the following claims. For simplicity of presentation, we assume that each job issues at most one request per resource.

44 c donors

m - k

m clusters k resources c FQ

Resource Holders

scheduler data structures

Figure 2.11: Queue structure of the CK-OMLP.

Under the CK-OMLP, a job may experience both direct and indirect pi-blocking. A job may experience direct pi-blocking while it is blocked for a shared resource. A job may experience indirect blocking while it acts as a priority donor. We consider bounds on direct and indirect blocking, in turn.

The priority donation rule, Rule C1, ensures that there are at most m incomplete requests. Since the k resource holders are not kept in FQ, FQ has a maximum length of m − k. We denote the bound on direct CK-OMLP-D pi-blocking with the term bi, j . By Rule C1, resource holders are always scheduled, so requests are max satisﬁed at a rate of at least k requests per L j -units-of-time. Thus, by this property and Rule C3, direct pi-blocking is bounded by

m − k lmm bCK-OMLP-D = · Lmax = − 1 · Lmax. (2.36) i, j k j k j

A job experiences indirect pi-blocking while it acts as a priority donor. We denote the bound on indirect

CK-OMLP-I pi-blocking due to resource ` j with the term bi, j . This duration is bounded by the maximum time that a request may be waiting in FQ, plus the critical section length of that request. Thus, indirect pi-blocking due

45 to resource ` j is bounded by

lmm bCK-OMLP-I = bCK-OMLP-D + 1 · Lmax = · Lmax. (2.37) i, j i, j j k j

There may exist multiple pools of resources that are each protected by a different instance of the CK-

CK-OMLP-I OMLP, so bi, j does not bound indirect pi-blocking due to all such resources. We denote the set of resources accessed by tasks within the ath cluster by l a. We denote the bound on total indirect pi-blocking by

CK-OMLP-I bi , which is n o bCK-OMLP-I = max bCK-OMLP-I . (2.38) i a i, j ` j∈l

We denote the bound on the total pi-blocking experienced by a job under the CK-OMLP with the term

CK-OMLP bi . For jobs that do not issue requests,

CK-OMLP CK-OMLP-I bi = bi . (2.39)

a th Let li denote the subset of resources accessed by job Ji that is scheduled within the a cluster. For jobs that do issue requests resource requests under the CK-OMLP,

CK-OMLP CK-OMLP-I CK-OMLP-D bi = bi + ∑ bi, j . (2.40) a ` j∈li

This concludes our review of real-time k-exclusion locking protocols.

2.1.8 Accounting for Overheads in Schedulability Tests

The schedulability tests we discussed in Section 2.1.5 assume that all scheduling decisions and actions are instantaneous. However, this is impossible to achieve in the real world. Simply, scheduling algorithms take time to execute. Moreover, scheduling decisions have side-effects. For instance, a job’s execution time regularly increases with every preemption due to the loss of cache afﬁnity. Basic OS functions, such as processor and device interrupt handling, introduce additional delays to real-time jobs. Contention for shared hardware, such as a system memory bus or caches, cause concurrently executing jobs to interfere with one another. Collectively, we refer to these costs, as well as others, as system overheads. Overhead-aware schedulability tests are those that incorporate system overheads into analysis.

46 ev ipi Off-Chip

Ji tck cid rel cid CPU0

Jj sch cxs cpdJk tck cid sch cxs Jl CPU1 time t +0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85

CPU Overhead Off-Chip Overhead

Figure 2.12: Schedule depicting system overheads.

We now describe the “preemption-centric” method of Brandenburg (2011b), which we use for overhead- aware schedulability analysis in this dissertation. We ﬁrst focus our attention on the case where tasks are independent, i.e., they share no resources besides processors. We discuss additional methods for account for overheads due to resource sharing, thereafter.

2.1.8.1 Preemption-Centric Accounting

Figure 2.12 depicts a schedule for four jobs on a system with two CPUs. In this illustrative example, we do not need to consider task parameters or the particular scheduling algorithm in use to study the relevant overheads. Prior to time t, assume the following: jobs Ji and Jj were scheduled on CPU0 and CPU1, respectively; job Jk had been scheduled, but was either preempted or suspended from execution; and a hardware timer has been set to time 55 to trigger the release of job Jl. We observe several different overheads in Figure 2.12:

• Scheduling (sch) overheads are incurred when a job completes (time 15) or when a newly arrived job

must preempt a scheduled job (time 68).

• A release (rel) overhead is incurred when a job becomes ready for execution (time 59). This may occur

when a new job is released, or when an incomplete job resumes after having voluntarily relinquished a

CPU.

• A scheduler tick (tck) overhead is incurred at regular intervals (time 40). In Figure 2.12, we assume

that the scheduler ticks of the CPUs are synchronized, but this is not strictly necessary. Within the tick

47 handler, the OS may reevaluate scheduler state or perform bookkeeping functions, such as tracking the

execution time of currently scheduled jobs.

• Context switch (cxs) overhead is incurred whenever the scheduler switches from one job to another

(times 19 and 72). This process takes time as the scheduler must save program state (e.g., CPU register

values) of the old job, restore the state of the new job.

• A cache preemption/migration delay (cpd or CPMD) overhead is incurred when a previously scheduled

job resumes from execution (time 23). This overhead reﬂects the fact that such a job may have to reload

data that had been previously cached. In practice, this overhead is incurred incrementally as a job

executes. However, without exact knowledge of the state of the resuming job or caches, we analytically

assume a single worst-case overhead.

• A cache interrupt delay (cid) overhead is incurred after every interrupt is handled by a CPU (times 44

and 63). The interrupt handler should be relatively light-weight in terms of execution cost, leaving the

cached state of the interrupted job relatively intact, so we account for cache interrupt delays separately

from cache preemption/migration delays.

• An event (ev) overhead accounts for the latency between the occurrence of an event, e.g., the ﬁring of a

hardware timer (time 55), and when that event is communicated to a CPU (time 59). Event overheads

do not consume CPU time, so Figure 2.12 depicts this overhead as occurring in parallel “off-chip.”

• Interprocessor interrupt (ipi) overheads are incurred in cooperative multiprocessor schedulers where a

CPU may make scheduling decisions for other CPUs. A CPU communicates such scheduling decisions

to the target CPU by sending an interprocessor interrupt (IPI) to a target CPU. The IPI overhead

captures the latency between when an IPI is sent (time 63) and when it is received (time 68). Like the

event overhead, this overhead does not consume CPU time, so Figure 2.12 depicts this overhead as

occurring in parallel “off-chip.”

The overhead durations reﬂected in Figure 2.12 are merely illustrative. The actual costs of each overhead vary and depend upon different factors. For instance, context switch overheads depend primarily upon the capabilities of the underlying processor and memory subsystem. Scheduling overheads, on the other hand, depend upon the runtime complexity of the scheduling algorithm and its implementation. Cache-related

48 Parameter Meaning ∆sch scheduler overhead ∆rel release overhead ∆tck scheduler tick overhead ∆cxs context switch overhead ∆cpd cache preemption/migration delay ∆cid cache interrupt delay ∆ev event latency ∆ipi interprocessor interrupt latency Q scheduler tick quantum tck T0 virtual task modeling the scheduler tick interrupt irq Ti virtual task modeling the release interrupt of task Ti cpre cost of one preemption due to periodic interrupts

Table 2.3: Summary of parameters used in preemption-centric overhead accounting. overheads depend upon the memory footprint and data access patterns of the jobs themselves. Each overhead may be quantiﬁed though empirical measurement. We explore this methodology further in Chapter 4.

We now present the formalisms developed by Brandenburg (2011b) to integrate the above overheads into schedulability analysis. The general strategy of preemption-centric overhead accounting is to inﬂate a task’s

WCET with a charge that accounts for the overheads that the task may experience in a worst-case scenario.

This provides a safe upper-bound on what a task may experience at runtime. We denote each overhead with the symbol ∆, with the type of overhead expressed by the label in superscript. Table 2.3 contains a summary of the parameters we use in overhead accounting.

Periodic interrupt sources are modeled as virtual tasks that always preempt non-virtual tasks.13 The

tck scheduler tick interrupt is modeled by the virtual task T0 . Release interrupts are modeled by a per-task irq virtual task Ti . These tasks take the following values:

tck tck cid tck tck ∆tck+∆cid e0 , ∆ + ∆ p0 , Q u0 , Q re cid eirq ∆rel + ∆cid pirq p uirq ∆ +∆ i , i , i i , pi

Here, Q denotes the period of the scheduler tick, or quantum. The value of Q is set by the OS. Also, observe irq that Ti is periodic, even if its associated task Ti is sporadic. This is done to safely bound utilization loss due to release interrupts.

13We emphasize that this virtual-task approach can only be applied to interrupts that may be modeled with a periodic arrival pattern. Interrupt overheads due to non-periodic interrupt sources, such as GPUs, must be accounted for in a different manner.

49 The total cost of one preemption due to these periodic interrupts is given by:

tck ev tck ev irq irq e0 + ∆ · u0 + ∑1≤i≤n ∆ · ui + ei cpre . (2.41) , tck irq 1 − u0 − ∑1≤i≤n ui

We transform the task set T into the task set T 0 by inﬂating task execution times and shrinking task periods, such that: sch cxs cpd ei + 2(∆ + ∆ ) + ∆ e0 ≥ + 2cpre + ∆ipi, (2.42) i tck irq 1 − u0 − ∑1≤i≤n ui 0 ev pi ≤ pi − ∆ , (2.43)

0 ev di ≤ di − ∆ . (2.44)

Once we have obtained T 0 using the above formulations by Brandenburg (2011b), we perform schedulability analysis upon the task set T 0, instead of T , to safely test for schedulability.

2.1.8.2 Locking Protocol Overheads

Runtime overheads due to locking protocols must also be accounted for in overhead-aware schedulability analysis. For instance, we must consider the time it may take to execute locking protocol logic to issue a resource request. We must also consider scheduling decisions that may result from changes in job priorities due to priority inheritance, boosting, or donation. We brieﬂy discuss Brandenburg’s preemption-centric overhead accounting methodology to account for such overheads. We direct the reader to Brandenburg

(2011b) for a full explanation of the following analysis.

Suspension-based locking protocols introduce four additional types of overheads:

• System call entry (sci) overhead is incurred when a job invokes the OS in order to issue a resource

request.

• System call return (sco) overhead is incurred when a resource request is satisﬁed, and the resource-

holding job resumes execution, switching control from the OS to the job.

• Lock (lk) overhead represents the execution cost of issuing a request under a given locking protocol.

• Unlock (ulk) overhead represents the execution cost of releasing a held resource.

50 Parameter Meaning ∆sci system call entry overhead ∆sco system call return overhead ∆lk lock overhead ∆ulk unlock overhead

Table 2.4: Summary of locking protocol overheads considered by preemption-centric accounting.

These overheads are summarized in Table 2.4. Overheads affect both the execution time of jobs and increase the lengths of critical sections.

Overheads and WCET. Brandenburg prescribes the following additional inﬂation to job WCET to account for resource requests overheads, assuming suspension-based locking protocols:

0 sci sco lk ulk sch cxs cpd ipi ei ≥ ei + ηi · 2∆ + 2∆ + ∆ + ∆ + 3∆ + 3∆ + 2∆ + ∆ (2.45)

We charge two instances of system call overheads (∆sci and ∆sco) to account for the calls made by a job to request and release a resource, respectively. Likewise, we charge lock (∆lk) and unlock (∆ulk) overheads to account for the locking protocol’s handling of these calls. We charge two sets of scheduling, context switch, and cache afﬁnity loss overheads (2 ∆sch + 2 ∆cxs + 2 ∆cpd) to cover the cost of self-suspension when

ipi job Ji is blocked. We incorporate the overhead of one IPI (∆ ) to account for the latency of waking up Ji after its request has been satisﬁed. Finally, we charge an additional set of scheduling and context switch overheads (∆sch + ∆cxs) to account for a potential change in priority of the resource-holding task triggered by the blocked request of Ji. (We do not make this charge for locking protocols where resource-holding jobs cannot be preempted, such as the CK-OMLP.) We inﬂate ei using Equation (2.45) prior to the application of

pre Equation (2.42). In the application of Equation (2.42), we must also include an additional charge of ηi · c if processors are not shielded from interrupts, since tick and release interrupts may further delay the resumption of job Ji after it has acquired its resource.

Overheads and Critical Section Lengths. Overheads may also be incurred during the execution of a critical section, effectively increasing the length of the critical section. This, in turn, affects blocking analysis. We inﬂate critical section lengths to account for these overheads prior to blocking analysis. Brandenburg inﬂates

51 each critical section length for suspension-based locking protocols, such that:

0 sch cxs sci sco ulk ipi Li, j,k ≥ Li, j,k + 2∆ + 2∆ + ∆ + ∆ + ∆ + ∆ . (2.46)

We charge an IPI overhead (∆ipi) to account for the delay in notifying a suspended job that it has obtained its needed resource and that it may execute (i.e., the delay in waking up the waiting job). One set of scheduling and context switch (∆sch + ∆cxs) overheads account for the time it takes a job to resume execution after suspension. The remaining scheduling and context switch overheads account for the situation where a resource-holding job may be preempted by another resource-holding job. (As in Equation (2.45), we do not make this charge for locking protocols where resource-holding jobs cannot be preempted.) Finally, the remaining overheads account for the time taken to free a resource (∆sci + ∆sco + ∆ulk).

2.1.9 Integration of PI-Blocking and Overhead Accounting

In this section, we tie together the pi-blocking and overhead accounting methods we discussed above into a six-step procedure for performing overhead-aware schedulability tests. Our procedure is tailored to s-oblivious analysis and the suspension-based locking protocols we use in this dissertation. We refer to values [ j] that are computed in each step with a superscript. For example, ei denotes the bound for job execution time th of task Ti computed in the j step. We restate several formulas in order to give a consolidated presentation of the procedure.

Step 1: Inﬂate Critical Sections

We begin our procedure by inﬂating the critical section lengths of each resource request:

[1] sch cxs sci sco ulk ipi Li, j,k , Li, j,k + 2∆ + 2∆ + ∆ + ∆ + ∆ + ∆ . (2.47)

This step corresponds to the application of Equation (2.46).

Step 2: Bound S-Oblivious PI-Blocking [1] Using the inﬂated critical section lengths computed in the prior step (i.e., Li, j,k), we compute bounds on s-oblivious pi-blocking in according to the blocking analysis techniques prescribed by the locking

protocols that we employ. For example, we may bound pi-blocking under the k-FMLP using Equa-

52 tion (2.30). For each job, we compute the total bound on pi-blocking each job may experience, denoted [2] by bi .

Step 3: Convert S-Oblivious PI-Blocking to Execution Time

We inﬂate job execution time to incorporate our bounds on pi-blocking:

[3] [2] ei , ei + bi (2.48)

Step 4: Account for Locking Protocol Overheads

We inﬂate job execution time to account for the execution of locking protocol algorithms and associated

scheduling costs:

[4] [3] sci sco lk ulk sch cxs cpd ipi ei , ei + ηi · 2∆ + 2∆ + ∆ + ∆ + 3∆ + 3∆ + 2∆ + ∆ (2.49)

This step corresponds to the application of Equation (2.45).

Step 5: Account for General Overheads

We inﬂate job execution time according to the preemption-centric method:

e[4] + 2(∆sch + ∆cxs) + ∆cpd e[5] i + (2 + η ) · cpre + ∆ipi, (2.50) i , tck irq i 1 − u0 − ∑1≤i≤n ui

[5] ev pi , pi − ∆ , (2.51)

[5] ev di , di − ∆ . (2.52)

This step corresponds to the application of Equations (2.42), (2.43), and (2.44). As we discussed in

pre the description of Equation (2.45), we incorporate an additional charge of ηi · c because we assume processors are not shielded from interrupts.

Step 6: Perform Schedulability Analysis

Steps 1 through 5 give rise to a transformed task set, T [5], that accounts for pi-blocking and overheads.

We analyze T [5] using “normal” (i.e., overhead- and locking-protocol-agnostic) schedulability analysis.

For example, we may test T [5] for bounded deadline tardiness under global fair-lateness (G-FL)

53 scheduling using Equation (2.23). If T [5] has bounded deadline tardiness, then T is also has bounded deadline tardiness.

This concludes our discussion of real-time task models, locking protocols, and overhead-aware real-time schedulability analysis. We now shift our attention towards real-time operating systems and the services that they provide to user-level applications.

2.2 Real-Time Operating Systems

In this section, we discuss the role of real-time operating systems (RTOSs) in the support of real-time applications. We begin with a brief discussion of basic RTOS requirements that must be met in order to support the realization of sound real-time systems. We then discuss LITMUSRT, an RTOS that we extend in order to meet the needs of our real-time GPGPU applications. Finally, we conclude with an in-depth discussion of interrupt handling techniques in general purpose and real-time OSs. This topic is relevant to us as GPUs use interrupts to signal the completion of operations.

2.2.1 Basic RTOS Requirements

An RTOS is the underlying software that manages hardware resources and coordinates the execution of user applications. It must meet the needs of the real-time applications that rely upon it. Of course, these needs are application-speciﬁc. Due to the wide variety of real-time applications, there are a wide variety of

RTOSs. RTOSs range from small microcontroller environments to fully-featured OSs that are as capable as, if not more capable than, general purpose OSs. Here, we only discuss basic RTOS requirements. We direct the interested reader to a taxonomy and thorough survey of modern RTOSs by Brandenburg (2011b) for a more in-depth discussion of RTOS capabilities.

Fundamentally, an RTOS is responsible for providing a deterministic and predictable environment for user applications, while fulﬁlling other operating system requirements (such as memory management and

ﬁle system support). System designers require a high degree of conﬁdence that the timing constraints of their real-time applications are met. This is only possible if the underlying RTOS ensures deterministic and predictable behavior, as far as it is able. These behaviors are predominantly realized by the RTOS scheduler, which allocates processor time to user applications and other system services. The scheduler may employ one or more real-time scheduling algorithms, such as those we discussed in Section 2.1.4.

54 System designers design and implement their application software in accordance to a real-time task model, such as the sporadic or rate-based task models, that is supported by the RTOS scheduler. The designer provisions tasks with resources (e.g., job execution time) and speciﬁes task execution rates (e.g., task periods).

With this information, the designer may use schedulability analysis to model the real-time behaviors of their system. Analysis provides feedback to the designer on whether their implemented system, as modeled, meets application timing requirements. We stress that schedulability analysis only approximates real system behavior and that nothing in analysis forces the implemented software to behave as modeled. Analytical results hold only if the following requirements are met:

1. The RTOS scheduler adheres to the real-time scheduling algorithm modeled by the analysis.

2. Application behaviors are predictable.

The ﬁrst requirement is achieved through the thoughtful design and implementation of the RTOS. The

RTOS must employ strategies that eliminate or minimize priority inversions (recall that, by deﬁnition, a priority inversion is a deviation from the real-time scheduling algorithm). This impacts the types of algorithms the RTOS may employ to schedule tasks and perform other system management operations. For example, algorithms that are non-starvation free, retry-based, or use unpredictable heuristics, may be acceptable in a general purpose OS, but are rarely so in an RTOS. Real-world constraints sometimes make priority inversions unavoidable. However, techniques can be used to mitigate their negative effects on timeliness. For example, in Section 2.2.3, we investigate strategies that minimize unavoidable priority inversions due to device interrupt handling.

The second requirement above may be harder to achieve by an RTOS, since application behavior partly depends upon the application itself. However, an RTOS may work in support of predictable behavior. For instance, an RTOS may provide predictable mechanisms for coordinated access to shared resources (i.e., locking protocols). The RTOS may monitor the resources consumed by a task at runtime (e.g., execution time) and use budget enforcement policies to prevent tasks from exceeding provisioned resources. If prevention is not possible, the RTOS may penalize the offending task by denying future resources, and/or use techniques that isolate the effects over-consumption may have on the rest of the system.

In this brief section, we have discussed RTOSs at only a high-level. We discuss our speciﬁc needs of an

RTOS in support of GPGPU real-time systems next.

55 2.2.2 LITMUSRT

A major contribution of this dissertation is the design and implementation of real-time systems that support the matrix of CPU/GPU organizational choices we discussed in Chapter 1 (see Figure 1.4). Unfor- tunately, as Brandenburg (2011b) reports in a comprehensive survey of modern RTOSs, most RTOSs are limited to ﬁxed-priority scheduling. Further, these RTOSs lack robust support for resource sharing through multiprocessor real-time locking protocols, such as those we discussed in Section 2.1.7. As we shall see in Chapter 3, we require a more feature-rich RTOS to fully explore the CPU/GPU conﬁguration options depicted in Figure 1.4.

The GPU scheduling framework we present in Chapter 3 requires tight integration with the RTOS scheduler. As a result, most of our framework is realized by extensions and modiﬁcations to an RTOS kernel—speciﬁcally, the LITMUSRT kernel. LITMUSRT (LInux Testbed for MUltiprocessor Scheduling in Real-Time systems), is an open-source real-time extension to Linux which has a long development history, beginning in 2006 (Calandrino et al., 2006), and remains under continual development (Brandenburg,

2011b, 2014b). LITMUSRT provides a plugin-based architecture that facilitates the implementation of, and experimentation with, multiprocessor real-time scheduling algorithms and locking protocols. It includes support for fixed-priority and deadline-based schedulers, among others. The flexibility of LITMUSRT enables us to explore a variety of CPU/GPU configuration.

The use of a Linux-based operating system such as LITMUSRT is critical to the research of this dissertation, as it allows us to leverage support of GPGPU technology (e.g., GPU device drivers and GPGPU runtime

(described in Section 2.4.1)) in LITMUSRT. Moreover, high-performance GPUs capable of supporting modern

GPGPU applications are primarily limited to Windows, Mac, and Linux-based platforms. The open-source nature of Linux grants us the ability to implement and effectively evaluate operating-system-level algorithms for GPU-enabled real-time systems.

LITMUSRT, being based on mainline Linux, cannot be used to host “true” (safety-critical) HRT workloads.

However, true HRT constraints are problematic in a GPU-enabled real-time system anyway due to hardware complexity, the closed-source nature of GPU hardware and software, and the lack of timing analysis tools for such platforms. Still, we do not believe this limitation totally precludes the use of GPUs in safety-related applications. In an automotive setting, for example, the reaction time of an alert driver is about 700ms (Green,

2000). A GPU-based automotive component may only have to react to events within such a relatively lax time

56 window in order to be a viably safe system. Thus, we justify research into the use of GPUs in safety-related applications in two ways. First, we expect that algorithms developed in LITMUSRT are transferable to

HRT operating systems in the future. Second, even if GPU technology cannot support HRT requirements for GPU-related computations, GPUs may still be utilized in real-time systems with mixed hard and soft requirements, also known as multi-criticality systems (Vestal, 2007), as long as SRT GPU-related operations do not interfere with HRT CPU computations.

2.2.3 Interrupt Handling

Interrupts are a hardware signaling mechanism that may be used by asynchronously operating computing elements (e.g., processors and peripheral devices) to communicate with one another. An interrupt communicates the occurrence of an event. For example, a network card may raise an interrupt to signal the arrival of a network packet—such an interrupt is received (or handled) by system CPUs. Interrupts may also be transmitted among CPUs (i.e., IPIs). IPIs can be used to coordinate scheduling in a multiprocessor system. Interrupts may also be used by high-resolution timing hardware to signal the expiry of a timer. These timers may be leveraged to realize accurately timed job releases and precise budget enforcement features in event-driven real-time schedulers.

In the case of device management, device drivers may bind, or register, an interrupt handler to a uniquely identiﬁed interrupt, or interrupt line. A device driver may multiplex multiple events on one interrupt line.

Moreover, interrupt lines may sometimes be shared by multiple devices. In such cases, multiple interrupt handlers may be executed in sequence upon receipt of an interrupt from a shared interrupt line.

Upon receipt of an interrupt, a CPU halts its currently executing task and invokes an interrupt handler, which is a segment of code responsible for taking the appropriate actions to process the interrupt. An interrupted task can only resume execution after the interrupt handler has completed. The CPU may be prevented from handling other interrupts during this time as well. Interrupt handlers must execute quickly so that the interrupted task can resume execution, and so the CPU can be responsive to other interrupts.

Interrupts require careful implementation and analysis in real-time systems. In uniprocessor and partitioned multiprocessor systems, an interrupt handler can be modeled as the highest-priority real-time task (Jeffay and Stone, 1993; Liu, 2000), though the unpredictable nature of interrupts in some applications may require conservative analysis. Such approaches can be extended to multiprocessor systems where tasks may migrate between CPUs (Brandenburg et al., 2010). However, in such systems, the subtle difference

57 between an interruption and preemption creates an additional concern: an interrupted task cannot migrate to another CPU. As a result, conservative analysis must also be used when accounting for interrupts in these systems as well. A real-time system, both in analysis and in practice, beneﬁts greatly by minimizing interruption durations. Split interrupt handling is a common way of achieving this, even in non-real-time systems.

Under split interrupt handling, an interrupt handler performs the minimum amount of processing necessary to ensure proper functioning of hardware. This may include an acknowledgement of receipt and any processing needed to identify the context of the interrupt (i.e., demultiplexing). We call this immediate processing the interrupt handler’s top-half. Additional work to be carried out in response to an interrupt is deferred.14 We call this processing the interrupt handler’s bottom-half. Bottom-half computations are executed at a more “opportune” time. However, each operating system may have a different notion of when that opportune moment may be. We now discuss the implications of different interpretations.

2.2.3.1 Linux

We now review how Linux performs split interrupt handling. Despite its general-purpose origins, variants of Linux are widely used in supporting real-time workloads.

During the initialization of the Linux kernel, device drivers register interrupt handlers with the kernel’s interrupt services layer, mapping interrupt lines to interrupt service routines (ISRs)—ISRs are equivalent to top-halves. By default, any CPU may receive a device interrupt, though CPUs may be later “shielded” individually from speciﬁed interrupts through interrupt masks.

Upon receipt of an interrupt on a CPU, Linux immediately invokes the registered ISR(s). The ISR(s) are executed within the “interrupt context,” meaning that the receipt of other interrupts is disabled. Deferred work is issued by the ISRs in the form of a softirq, or tasklet (the terms are commonly used interchangeably).

The softirq is equivalent to an interrupt handler’s bottom-half. A pending softirq is enqueued on one of several per-CPU FIFO queues, depending upon the source of the softirq. The number and designation of these queues may vary with each kernel version. However, long-established queues include (in order highest to lowest priority): HI_SOFTIRQ, NET_TX_SOFTIRQ, NET_RX_SOFTIRQ, and TASKLET_SOFTIRQ.

14Some interrupts can be, or may need to be, handled entirely within the top-half processing. This includes relatively lightweight handlers for IPIs.

58 The Linux kernel executes softirqs using a heuristic. Immediately after executing a top-half, but before exiting the interrupt context and resuming execution of the interrupted task, the kernel executes up to ten softirqs. These softirqs are taken from the softirq queues, in order of priority. For example, all network-related softirqs in the NET_TX_SOFTIRQ queue are processed before any in the TASKLET_SOFTIRQ queue. Any remaining softirqs are dispatched to one of several (per-CPU) kernel threads dedicated to softirq processing; these are the “ksoftirq” daemons. The ksoftirq daemons are scheduled with high priority, but are preemptible.

The Linux kernel is optimized for throughput. Excessive thread context-switch overheads are avoided by executing a batch of tasklets before returning from a top-half. The FIFO queue structures also bias the processing of some types of softirqs over others. However, this batch processing may introduce long interrupt latencies, leading one to wonder if this can even be considered a split interrupt system. The original motivation behind split interrupt handling is to minimize the duration of top-half execution, not extend this duration with additional work. Moreover, in all likelihood, in a system experiencing few interrupts

(though it may still be heavily utilized), for every top-half that yields a tasklet (bottom-half), that tasklet will subsequently be executed before the interrupted task is restored to the CPU. This essentially fuses the split top-half and bottom-half into one non-split interrupt handler.

How does Linux’s softirq processing affect real-time analysis? It is generally impossible to model

Linux’s interrupt processing mechanisms. Even with a model of interrupt arrival patterns, it is difﬁcult to predict the delay experienced by an interrupted task since we do not know which or how may softirqs may be processed before the interrupt handler returns. Moreover, if a bottom-half is deferred to a ksoftirq daemon, it is generally not possible to analytically bound the length of the deferral since these daemons are not scheduled with real-time priorities. Thus, we cannot predict how long a task may wait for a given bottom-half to be processed.

Schedulability analysis under Linux is further complicated by its software architecture. Because each softirq might execute within the interrupt context, softirq code may never suspend—it may never block on I/O or utilize suspension-based synchronization mechanisms. If such processing is necessary, then the bottom-half may defer additional work in yet another form. Speciﬁcally, a work item dispatched to one of

Linux’s per-CPU kworker daemons. The kworker daemons process deferred work, much like the ksoftirq daemons, but allow work items to suspend. Also like the ksoftirq daemon, kworker threads are not scheduled with real-time priorities, implying the same challenges to realizing real-time predictability.

59 T Top-Halves M

T H

TM TI

T M

ReleaseComplete Block Unblock

Priority Inversion

Figure 2.13: Fixed-priority assignment when an I/O device is used by a single thread.

2.2.3.2 PREEMPT_RT

The PREEMPT_RT patch to the Linux kernel alters the interrupt handling mechanisms of Linux to reduce interrupt latency and improve real-time performance. PREEMPT_RT executes all deferred softirqs in per-interrupt source (e.g., per-device) threads. Only performance-sensitive softirqs, such as high-resolution timers, may be appended to top-half execution, as this avoids thread context switch overheads. A system designer may assign an appropriate ﬁxed priority to each softirq-handling thread, according to their application needs. For example, disk I/O softirqs can be given a lower priority than softirqs from a GPU. Under vanilla

Linux, softirqs from both devices are processed with the same priority since these softirqs are put in the

TASKLET_SOFTIRQ FIFO queue.

There are scenarios where the priority assignment method of PREEMPT_RT for interrupt handling threads is sufﬁcient. Consider a uniprocessor real-time system with three independent threads, TH , TM, and TL. TH is assigned a high priority, TM is assigned a middle priority, and TL is assigned a low priority.

Suppose TM issues a command to an I/O device and suspends from execution until an interrupt from the device, indicating completion of the operation, has been processed. TM cannot resume execution until the bottom-half of the interrupt has completed. What priority should be assigned to the interrupt handling thread, denoted TI, that will do this work? In order to avoid interference with other threads, TI must have any priority greater than or equal to TM but less than the priority of TH . (To avoid ambiguity in scheduling, interrupt handling threads are commonly given a priority slightly greater than their dependent threads.) We refer to the priority of a task Ti with the function prio(Ti). As depicted in Figure 2.13, with the priority assignment

60 T Top-Halves H

T H

T T H I

ReleaseComplete Block Unblock

Priority Inversion

Figure 2.14: TH may suffer a long priority inversion dependent upon the execution time of TM if prio(TI) is too low.

prio(TH ) > prio(TI) > prio(TM) > prio(TL), the operations of TM have less impact on TH (priority inversions due to top-halves are unavoidable); TM and TI only receive processing time when TH is not ready to run.

Likewise, TL can in no way delay the execution of TI. Because of the lack of interference, this system is also easy to analyze. However, the situation changes when the I/O device is shared by different threads of differing priorities.

Let us reconsider the prior scenario with one change: suppose TH and TL share the I/O device simultaneously, and TM does not use the device at all. Does this change the priority that should be assigned to TI?

Indeed it does. If the priority of TI is less than TM, then TH can experience needlessly long priority inversions.

For example, this may occur when TH suspends from execution after issuing a command to the I/O device and suspends to wait for completion of the command. The interrupt indicating that the operation has completed may be received, top-half executed, and bottom-half deferred to TI, but if prio(TI) < prio(TM) and TM is scheduled, then TI cannot execute and unblock TH until TM gives up the processor. Thus, TH indirectly suffers a priority inversion with respect to TM. Such a scenario is illustrated in Figure 2.14. Observe that the duration of this inversion largely is not dependent upon the time it takes to execute the interrupt bottom-half, but rather upon the duration between when TI is ready to run and TM relinquishes the processor. This dependency can break analysis and real-time predictability may not be ensured.

The potential for such long priority inversions forces an alternative priority assignment where TI is assigned a priority great enough to ensure TH cannot suffer this particular priority inversion. In general, the priority of TI must be no less than the highest-priority thread that may depend upon TI. However, this

61 T Top-Halves L

T T L I

T M

ReleaseComplete Block Unblock

Priority Inversion

Figure 2.15: TH may suffer a priority inversion when TI processes a bottom-half for TL.

assignment introduces a different priority inversion scenario. What happens when TI processes a bottom-half for which TL blocks, as depicted in Figure 2.15? Since TI has the greatest priority, it is immediately scheduled whenever it is ready to run, so the bottom-half for TL is immediately processed, resulting in the preemption of any other threads, including TH . This is another priority inversion from which any threads with priorities less than TI, but greater than TL, may suffer. The primary advantage to using a greater priority for TI is that at least these inversions are short—one inversion only lasts as long as the execution time of one bottom-half.

However, the priority assignment that we have been forced to use is susceptible to pathological cases. Suppose that TH rarely uses the I/O device and TL uses it very frequently, generating many interrupts. Or, suppose there are many low-priority threads (e.g., TL1,··· ,TLn) that use the I/O device. In either case, TH and TM may experience many priority inversions, as illustrated in Figure 2.16.

Pathological cases are undesirable and become harder to avoid when many tasks of different priorities share devices. Further, determination of a safe priority assignment for each interrupt handling thread becomes increasingly difﬁcult with additional interrupt sources.

2.2.3.3 Towards Ideal Real-Time Bottom-Half Scheduling

In this section, we discuss other approaches to real-time scheduling of interrupt bottom-halves. Ideally, all bottom-halves should be explicitly scheduled in accordance with the analytical model used to determine schedulability. Under the sporadic task model, bottom-half processing should be accounted for through the addition of dedicated bottom-half processing sporadic tasks, or by somehow delegating the processing to tasks already within the task model.

62 T T T T Top-Halves L1 L2 Ln H

TL1 TL2 TLn TH TI

TM experiences many priority inversions TM TL1

TL2 …

TLn

ReleaseComplete Block Unblock

Priority Inversion

Figure 2.16: A pathological scenario for ﬁxed-priority interrupt handling.

Dedicated Bottom-Half Servers. Although PREEMPT_RT uses dedicated threads to process bottom-halves, they are not sporadic tasks. This leads to pessimistic analysis that must account for scenarios like that depicted in Figure 2.16. To force bottom-half processing to better conform to the sporadic task model, Lewandowski et al. (2007) and Manica et al. (2010) have employed dedicated sporadic tasks to handle bottom-half processing in ﬁxed-priority and deadline scheduled systems, respectively. These dedicated tasks take the form of servers that receive a ﬁxed amount of execution-time budget that is replenished periodically. This approach works well for handling bottom-halves spawned by interrupts raised in response to external events, such as the arrival of network packets—budgetary constraints ensure that the system is not overwhelmed by events that are outside of its control.

Individually Prioritized Bottom-Halves. The usefulness of the server-based approach is limited in situations where a real-time task may block while waiting for an internal event to occur, such as when a task waits for a GPU kernel to complete. This is because the time that the task is blocked depends upon the budget and replenishment rate of the bottom-half-processing server. An alternative approach is to individually prioritize and schedule each bottom-half with the priority of the task that is blocked waiting for the bottom-half to be processed. This approach is attractive because the bottom-half execution time can be analytically incorporated into the execution time of the waiting task itself. The operating system can determine a proper priority for a bottom-half upon its arrival, provided it tracks every task that currently waits for a bottom-half from a given device. Bottom-half-processing threads may dynamically inherit the priority of each bottom-half as it is

63 processed. This technique resolves the pathological case depicted in Figure 2.16, as TI would only preempt

TM when TI processes bottom-halves while TH is waiting. Individual prioritization of bottom-halves is employed by the commercial RTOS QNX Neutrino, which has a microkernel architecture. Device drivers are implemented as threaded “servers” (not to confused with the servers employed by Lewandowski et al. (2007) or Manica et al. (2010)). Servers receive and execute

I/O requests from clients and also perform bottom-half processing. Characteristic to microkernel designs, clients and servers communicate through message passing channels. Device drivers receive requests for I/O operations as messages. In-bound messages are queued, in priority order, in the event that they are sent faster than they can be serviced by the device driver.

In order to avoid priority inversions, device drivers inherit the in-bound message’s priority, which is attached by the sender, when it is sent. The device driver inherits the maximum priority among queued, and currently processing, messages. In addition to priority, device drivers also inherit the execution time budget of their clients (a mechanism commonly referred to as “bandwidth inheritance”). This allows for the throttling of I/O workloads on a per-client basis.

Bottom-halves are delivered to the device driver for handling as event messages, and processed at the priority inherited from I/O request messages. Academic microkernels Credo (Steinberg et al., 2005), an L4 extension, and NOVA (Steinberg et al., 2010), a microhypervisor, have employed similar techniques to QNX

Neutrino.

The beneﬁts of threaded interrupt handling comes at the cost of additional thread context-switch overheads.

(Recall that the primary reason behind Linux’s interrupt handling mechanisms to avoid these overheads.)

To address these concerns, Zhang and West (2006) developed a “process-aware interrupt” (PAI) method, which supports individually prioritized bottom-halves. Here, newly spawned bottom-halves are scheduled immediately (before the interrupt top-half returns control to the interrupted task) if the bottom-half has sufﬁcient priority. Otherwise, the bottom-half is deferred, but it is not processed by a dedicated thread.

Instead, the scheduling of bottom-half processing takes place within the context-switch code path of the operating system. Prior to a context switch, the priority of the highest-priority deferred bottom-half is compared against that of the next thread to be scheduled on the processor. The context switch is skipped if the bottom-half has greater priority, and the bottom-half is scheduled instead. The bottom-half temporarily uses the program stack of the task that was scheduled prior to the aborted context switch. The resumption of

64 T (bottom half) TL H TL CPU1

TH TH (top half) TH CPU0 time t1 t2

Release Complete Priority Inversion

Figure 2.17: Priority inversion due to the co-scheduling of a bottom-half. this task can be delayed since it may not be rescheduled until the bottom-half has completed, so the risk of priority inversions is not completely avoided.

Asynchronous I/O and Multiprocessors. The above techniques provide better real-time properties than

Linux or PREEMPT_RT, but they are not without their limitations. Thus far we have discussed device interrupts within the context of synchronous operations. Here, a device-using task blocks until an operation completes, as signaled by the completion of a bottom-half. However, interrupts are used in asynchronous operations as well, where a task may issue an operation to a device and continue execution, perhaps blocking for the operation to complete at a later time. Even with individual bottom-half prioritization, there is a risk of priority inversions under global and clustered multiprocessors schedulers.

Most real-time analysis techniques assume single threaded workload models. As such, a thread that has its priority inherited by another should never be scheduled simultaneously with that inheriting thread.

Otherwise, two threads may be scheduled at the same time under the same “identity” and the non-inheriting thread analytically becomes multi-threaded, breaking analytical assumptions. Thus, it may not be correct to schedule a bottom-half of an asynchronous operation with the priority of the dependent task. The danger here is illustrated in Figure 2.17, where TL suffers from a priority inversion within the time interval [t1,t2] on CPU1, when it is preempted by the bottom-half of TH . The inversion is due to the fact that TH is already scheduled on CPU0. We present a solution to this problem in Chapter 3, where the individual priority of a bottom-half is conditioned on the state of its dependent task.

Engineering Challenges. We conclude this section with the remark that dynamically tracking tasks that are blocked upon the completion of a bottom-half represents a software engineering challenge. QNX Neutrino,

Credo, and NOVA overcome this challenge in part by using a microkernel architecture, whose message- passing-based architecture eases priority tracking. The same cannot be said for monolithic kernels, such

65 as Linux, especially where closed-source device drivers are concerned. We show how this is overcome by

GPUSync in Chapter 3.

2.3 Review of Embedded GPUs and Accelerators

In this section, we examine the current state of GPU technology in embedded applications; this review includes a discussion of similar GPU-like accelerator technologies. This review demonstrates that there is a market for data-parallel processor architectures in embedded systems, a domain where real-time constraints are common. This conﬁrms the relevance of the research presented in this dissertation.

GPUs may be “discrete” or “integrated.” Discrete GPUs (dGPUs) are those that plug into a host system as a daughter card. Integrated GPUs (iGPUs) are found on the same physical chip as CPUs. iGPUs are common to system-on-chip (SoC) processors targeted to embedded applications, including smartphones and tablets. dGPUs are traditionally far more computationally capable than iGPUs. It is feasible to use a dGPU in an embedded system, as long as the host platform supports the necessary I/O peripheral interconnects (e.g.,

PCIe). Unfortunately, conventional dGPUs may not be well-suited to all embedded applications for several reasons. First, dGPUs may be physically too large, taking up too much space. Second, dGPUs commonly draw enough power (commonly between 150 watts to 250 watts) that they must rely upon active cooling, which requires a fan and an unobstructed airﬂow. Third, the physical port where the dGPU connects to the host system may be prone to physical vibration. However, these challenges are not insurmountable.

General Electric (General Electric, 2011) manufactures “ruggedized” dGPU platforms designed to deal with harsh embedded environments. These dGPUs may be secured to the host computing platform through reenforced I/O ports or soldered directly onto the motherboard of the computing system. Special heat-dissipating enclosures cool the GPU without the need for free-ﬂowing air. General Electric’s ruggedized systems have been used in radar for unmanned aerial vehicles (Pilaud, 2012), sonar for unmanned underwater vehicles (Keller, 2013), and situational awareness applications in manned armored vehicles (McMurray,

2011). However, these dGPUs may still not meet the needs of every embedded application due to several limitations: (i) the heat-dissipating enclosures are large and heavy; (ii) ruggedized dGPUs draw the same power as conventional counterparts; and (iii) they are expensive. General Electric’s platforms are clearly meant for defense applications. What is affordable for a multi-million dollar military vehicle may not be affordable for a mass-market automobile.

66 GPU Designer / GFLOPS GPGPU SoCs Name Maker (single-precision) Runtime (not exhaustive) GC2000 Vivante 32a OpenCL 1.2 (embedded) Freescale i.MX6 SGX544 MP3 PowerVR 51b OpenCL 1.1 MediaTek MT6589 GC4000 Vivante 64a OpenCL 1.2 (embedded) hiSilicon K3V2 Mali-628 MP6 ARM 109c OpenCL 1.1 Samsung Exynos 5422 G6400 PowerVR 256d OpenCL 1.2 Renesas R-Car H2 GC7000 Vivante 256a OpenCL 1.2 — Radeon HD 8210 AMD 256e OpenCL 1.2 AMD A4-1340 HD Graphics 4000 Intel 295f OpenCL 1.2 Intel BayTrail-T Mali-760 MP16 ARM 326h OpenCL 1.2 — GX6650 PowerVR 384e OpenCL 1.2 Apple A8 (iPhone 6) K1 NVIDIA 384e OpenCL 1.2, CUDA NVIDIA Tegra K1

a Vivante (2014) b Klug (2011) c Sandhu (2013) d Shimpi (2013b) e Smith (2014b) f Shimpi (2013a) h Athow (2013); Smith (2014b)

Table 2.5: Performance and GPGPU support of several embedded GPUs.

Although less capable, iGPUs may offer a viable alternative to dGPUs for some applications. iGPUs lack the physical limitations of dGPUs. The size of an iGPU is negligible as it resides on-chip with CPUs.

The interconnect between the host system and iGPU is also on-chip, so it is not prone to physical vibration. iGPUs require far less power; common SoCs with iGPUs commonly draw four watts of power, and rarely more than eight watts. As a consequence, iGPUs seldom require active cooling. In addition to these ideal physical characteristics, iGPUs are also more affordable, due to the economies of scale in the smartphone and tablet markets.

We now examine recent trends in iGPU performance and capabilities. Table 2.5 lists several recent iGPUs and their characteristics. We quantify computational capabilities in terms theoretical peak ﬂoating point performance, measured in GFLOPS. Unfortunately, GPU manufacturers do not always provide these numbers to the public. As a result, our data is gathered primarily from technology news websites. Each source is cited by footnote. We caution that this data may not be entirely precise. Nevertheless, we are conﬁdent that the GFLOPS reported in Table 2.5 are accurate enough to get a sense of performance.

We begin by observing trends in computational performance. The Freescale i.MX6, which includes the

GC2000 iGPU, was ﬁrst announced in early 2011. The NVIDIA Tegra K1, which includes the K1 GPU, was

ﬁrst made available to developers in mid-2014. We see that the K1 is twelve times faster than the GC2000 (32 versus 384 GFLOPS). The K1’s performance is not unusual. The Mali-760 MP16 and the GX6650 perform at a similar level. The K1, Mali-760 MP16, and GX6650 were released in 2014. Comparing the GFLOPS of these recent iGPUs to the trends in Figure 1.1(a), we see that the performance of an iGPU today is roughly

67 equivalent to a high-end dGPU in 2006. We note, however, that dGPUs of that era regularly required over

150 watts of power. Contrast this with the four to eight watts of an entire SoC today.

In Table 2.5, we also observe wide adoption of GPGPU technology. Table 2.5 lists six different

GPU designers that produce iGPUs with GPGPU support. These GPUs are licensed by even more SoC manufacturers. We see that OpenCL 1.2 is widely supported. Only the four least-performing GPUs are limited to OpenCL 1.1 or the embedded proﬁle of OpenCL 1.2.

iGPUs that support GPGPU also cross instruction-set boundaries. Although the SoCs in Table 2.5 predominantly use the ARM instruction set, we also see support for the x86 instruction set from the Intel

BayTrail-T and AMD A4-1340.

It is difﬁcult to judge which GPUs best support an embedded real-time system, as this is not strictly deﬁned by the GPU. Other SoC features are important to consider as well. Freescale and Renesas have an established presence in embedded markets. They have demonstrated an understanding and appreciation of real-time system constraints. In contrast, Samsung, Apple, and NVIDIA largely focus on consumer electronics like smartphones and tablets. Each tailors their SoC for their selected market. For example, the

Tegra K1 includes a modern cell phone radio for smartphones and tablets (NVIDIA, 2014f). However, it lacks integrated support for CAN, a data bus commonly used in automotive electronics. The converse is true of the Renesas R-Car H2—it supports CAN, but lacks a cellphone radio (Renesas, 2013).

There are also differences in software to consider. For instance, the CUDA programming language is more succinct than OpenCL. Less code is necessary to perform the same operations. Moreover, NVIDIA has developed a broad set of tuned CUDA libraries and development tools. As a result, development may proceed faster on a K1 than it might on any of the other OpenCL-only GPUs. The instruction set of the SoC may also affect development. For example, the Intel BayTrail-T and AMD A4-1340 support the x86 instruction set.

Development on these platforms beneﬁts from a wide set of tools and software libraries originally developed for desktops and servers. Also, prototypes developed on x86 workstations are easier to port to x86 SoCs, than

ARM SoCs.

Before concluding with this survey of iGPUs, we wish to discuss digital signal processors (DSPs) designed to support computer vision computations. These DSPs function much like an iGPU that executed

GPU kernels. As such, we can apply the same GPU scheduling techniques we present in Chapter 3 to these

DSPs.

68 We begin with the G2-APEX DSP developed by CogniVue (CogniVue, 2014), which CogniVue licenses to SoC manufacturers. CogniVue claims that the G2-APEX consumes only milliwatts of power, making it more power efﬁcient than iGPUs. The company provides development tools for implementing computer vision applications, including a custom version of the popular OpenCV computer vision library (OpenCV,

2014). Freescale has licensed CogniVue technology for their own SoCs (Freescale, 2014).

The other computer vision accelerator is the IMP-X4 computer vision DSP, which is incorporated into the Renesas R-Car H2 SoC (Renesas, 2013). Like CogniVue, Renesas also distributes a custom version of

OpenCV. What is unique to the R-Car H2 is that it also includes a G6400 iGPU from PowerVR. As we see in Table 2.5, the G6400 is among the more modest iGPUs. However, such deﬁciencies may be offset when paired with the IMP-X4. Unfortunately, we were unable to obtain benchmark information from Renesas or other sources to support this speculation.

It is clear from this survey of ruggedized dGPUs, iGPUs, and unique accelerator DSPs, that there is a market for data-parallel processor architectures in embedded systems. Solutions today range from expensive military-grade hardware, to specialized embedded DSPs, to common consumer-grade electronics.

These technologies will evolve with time. What direction will this evolution take? Industry has already signaled that we can expect CPUs and GPUs to become more tightly coupled. For instance, CUDA 6.0

(released in early 2014) introduced memory management features that automatically move data between host and GPU local memory. This eases programming because it frees the programmer from the burden of explicit memory management in their program code. A yet stronger signal for tightly coupled CPUs and

GPUs comes from the development of the “Heterogeneous System Architecture” (HSA), which is backed by several industry leaders. HSA is a processor architecture where CPU and GPU memory subsystems are tightly integrated with full cache coherency (HSA Foundation, 2014). A GPU is more of a peer to CPUs in this architecture, rather than an I/O device as GPUs are today. Many of the advanced features of OpenCL 2.0

(the latest revision of the OpenCL standard) require HSA-like functionality from hardware. We speculate that

CPUs and GPUs with HSA-like functionality will ﬁrst come from manufacturers that design both types of processors, as they are in the best position to tightly integrate them. This includes companies such as Intel,

AMD, NVIDIA, and ARM. It may take more time for makers of licensed GPU processors, such as PowerVR and Vivante. Also, although dGPUs typically lead iGPUs in functionality and performance, iGPUs are likely support HSA-like features before dGPUs.

69 Application GPGPU API GPGPU User Space Runtime

Kernel Space Device ioctl() Driver GPL Linux Layer

Figure 2.18: Layered GPGPU software architecture on Linux with closed-source drivers.

2.4 GPGPU Mechanics

In this section, we discuss: the software stack that manages GPUs; the inherently data-parallel hardware architecture of GPUs and other hardware components; and how GPUs are used in GPGPU applications.

We wish to deﬁne several terms before proceeding. The term “host” refers to CPUs and main system memory. “Device” refers to the GPU. Hence, “host memory” and “device memory” refer to main system memory and local GPU memory, respectively. The terms “process,” “application,” and “task,” all refer to a single thread of a program. “Process” carries connotations to services offered by kernel- or user-space daemons. “Application” refers to a general user-space program. “Task” refers to a program that performs a repetitive operation (real-time, or not).

2.4.1 Software Architecture

A complex software stack sits between a user’s GPGPU application and the GPU hardware. Figure 2.18 provides a high-level illustration of this stack in the Linux operating system. There are four primary layers: the application, the GPGPU runtime, the GPU device driver, and ﬁnally, the operating system. These layers are split across the “user space” and “kernel space” boundaries, providing the necessary memory protections between the application and operating system. We now discuss the role of each layer from top to bottom.

A GPGPU language is deﬁned by programming language features, such as extensions to the C language, and an application programming interface (API). At runtime, application-layer code interfaces, either directly or indirectly, with the GPGPU runtime through the provided API. Direct interaction occurs when application code calls API functions explicitly. For instance, the OpenCL API clEnqueueWriteBuffer() is used by the application to copy data to device memory. Indirect interaction occurs when elements of the GPGPU

70 programming language are converted into API calls at compile-time. For example, the “triple chevrons”

(i.e., <<<>>>) used to launch a GPU kernel in the CUDA language are transformed into calls to CUDA’s cuLaunchKernel().15

The GPGPU runtime manages the application’s session with the GPU. One role of the runtime is to implement GPGPU language features that do not require OS intervention. For example, both CUDA and

OpenCL allow an application to attach host-side callbacks to GPU operations. These callbacks are executed by threads internal to the GPGPU runtime as GPU operations complete. (We discuss the implications these threads have on a real-time system shortly.)

Another function of the GPGPU runtime is to translate API calls into commands given to the GPU device driver. On Linux, these commands are passed to the driver through the ioctl() system call. By design, this system call does not have a strictly deﬁned interface, as it is used by the callee to pass arbitrary data to a device driver. In this way, a device driver exposes its own API to the runtime via ioctl().

The device driver is responsible for managing the GPU hardware. It communicates directly with the

GPU to carry out operations requested by the GPGPU runtime. The driver may arbitrate GPU access when multiple applications wish to use the GPU at the same time. The driver is also responsible for providing device management services to the OS. These include device initialization and interrupt handling. The driver uses interfaces deﬁned by the operating system to provide these services. This is true for even closed-source device drivers. This is depicted in Figure 2.18, where a “GPL layer” mediates the interactions between the driver and Linux. We call this the GPL layer since it bridges the closed-source driver with the Linux kernel

APIs made available under the second version of the GNU General Public License (GPL) (Free Software

Foundation, Inc., 1991).

For efﬁciency, the GPGPU runtime may also communicate directly to GPU hardware through a memory- mapped interface. Here, the runtime is given direct access to a segment of GPU device memory and special registers by mapping these elements into the virtual address space of the application. The GPGPU runtime can submit some commands to the GPU through this interface, bypassing the GPU driver. Similarly, the runtime can monitor the completion of commands by polling memory-mapped registers that reside on the device, instead of waiting for an interrupt raised by the device. This has obvious implications on scheduling, since neither the operating system nor the GPU driver is involved in GPU resource scheduling decisions.

15See line 7 in Figure 2.21 for an example of CUDA’s triple chevrons.

71 Unfortunately, it is not always clear which GPU operations may be issued directly or which require support from the driver. However, we do have some insights to offer. GPGPU runtimes commonly allow a CPU thread to spin or suspend while waiting for a GPU operation to complete. Our experience suggests that operation completion is actively monitored through memory-mapped registers when the user program elects to spin. Operation completion is signaled by a device interrupt when the user program elects to suspend.

The GPGPU runtime and device driver are complex software packages that are usually developed and maintained by the GPU manufacturer. Unfortunately, to date, all manufacture-produced software has been for general purpose computing, not real-time systems. This raises several issues:

1. The GPGPU API provides no mechanisms to express real-time priority or time-related parameters

(such as deadlines).

2. The driver, or GPU hardware itself, may resolve contention for GPU resources using policies that are

not be amenable to real-time analysis.

3. The execution time of runtime and driver operations may vary widely, with extreme outliers in worst-

case behavior.

4. The software may employ synchronization techniques that break real-time scheduling.

We examine the ﬁrst three issues in depth in Chapters 3 and 4. However, we further explore the remaining issue regarding synchronization now, as this provides additional insight into how the GPGPU software stack operates.

As mentioned earlier, the CUDA runtime supports the attachment of callbacks to the completion of

GPU operations. Callbacks are executed by host-side threads created and managed by the CUDA runtime.

These callbacks are also responsible for signaling (waking) user threads that have suspended from execution to free up CPU resources while waiting for GPU operations to complete. That is, the callback threads are used to synchronize GPU operations and user threads. This can lead to the problematic scenario depicted by the schedule in Figure 2.19 for a uniprocessor system. At time t1, the high-priority task, TH , suspends while waiting for a GPU operation to complete. A low-priority task TL is scheduled at this time. The GPU

cb operation completes some time shortly before time t2. At time t2, the callback thread of TH , TH , is ready cb to run and wake up TH . However, TH was created by the real-time-oblivious CUDA runtime, so the thread cb lacks the priority to preempt TL. TH is not scheduled until time t3. It completes at time t4. As a result, TH

72 cb TH

TL time t1 t2 t3t4

ReleaseComplete Block Unblock

Priority Inversion

cb Figure 2.19: TH experiences a priority inversion if the callback thread, TH , is not scheduled with a proper real-time priority.

experiences an priority inversion during the time interval [t2,t3]. (The inversion ends at t3 instead of t4 since

cb cb TH performs work on behalf of TH .) The execution time of TH does not affect the duration of the priority inversion—it remains constant even if t4 −t3 ≈ 0. This scenario is not unlike the interrupt scheduling problem of Figure 2.14.

We cannot be too critical of the CUDA runtime. It was not designed with real-time scheduling in mind. The use of callback threads in dual roles of callback execution and signaling likely reduces software complexity. The approach performs well under a general purpose scheduler, since callback threads are unlikely to wait for a long time before being scheduled; such schedulers are generally responsive to threads that execute infrequently and for short durations. However, the runtime’s implementation hinders any naïve attempt at real-time scheduling.

What can we do to resolve the issues raised by the manufacturer-provided GPGPU software stack? Can it be altered? In many cases, the GPGPU software is distributed as closed-source, where direct alteration is not possible. Can it be replaced? Open-source alternatives do exist for some GPUs (e.g., Gdev (Kato et al., 2012) and Beignet (Segovia and Nanha, 2014)). We may alter the behavior of this software, but there are several practical issues to consider. First, the software may also be oblivious to the needs of real-time applications, despite being open-source. Second, the software may not support recent GPUs or all GPU features. Third, the software may be unable to utilize a GPU to its full potential since the software may be designed from knowledge gained through reverse engineering of GPU hardware. Finally, altered software

73 needs to be maintained. Replacement of manufacturer-provided GPGPU software is feasible, but it is a costly endeavor and may require the sacriﬁce of some functionality.

As we explore in Chapter 3, there is another approach that can be applied to both closed- and open- source software stacks: we can wrap the GPGPU runtime and device driver to force the software to respect real-time scheduling priorities and behave more predictably. If proven effective, such an approach should be preferred because it enables us to enjoy the beneﬁts of manufacturer-provided GPGPU software and avoid the signiﬁcant investment of alteration and maintenance of open-source alternatives.

2.4.2 Hardware Architecture

We describe the GPU hardware architecture in a top-down manner, beginning with how a GPU integrates into a host platform. We then discuss internal GPU components.

Recall from Section 2.3 that GPUs may be discrete (dGPUs) or integrated (iGPUs). For either type,

GPUs interface to the host system as I/O devices and are managed by device drivers. Discrete GPUs differ from integrated GPUs in three ways: (i) they are much more computationally capable; (ii) they have local high-speed memory (integrated GPUs use system memory); and (iii) they operate most efﬁciently upon local GPU memory, which requires copying data to and from system memory. For most of this dissertation, we focus our attention on dGPUs for their performance characteristics and interesting challenges posed by memory management. However, the techniques developed herein remain applicable to iGPUs, except that there is no need for GPU memory management.

Figure 2.20 depicts a high-level architecture of a multicore, multi-GPU system. The GPU is connected to the host system via a full-duplex PCIe bus. PCIe is a hierarchically organized packet-switched bus with an

I/O hub at its root (for this reason, the I/O hub is technically referred to as the “root complex”). Switches multiplex the bus to allow multiple devices to connect to the I/O hub. Unlike the older PCI bus, where only one device on a bus may transmit data at a time, PCIe devices can transmit data simultaneously. Trafﬁc is arbitrated at each switch using round-robin arbitration at the packet level in case of contention.16 The structure depicted in Figure 2.20 may be replicated in large-scale NUMA platforms, with CPUs and I/O hubs connected by high-speed interconnects. However, only devices that share an I/O hub may communicate directly with each other as peers.

16The PCIe speciﬁcation allows for other arbitration schemes, but these appear to be rarely implemented (PCI-SIG, 2010).

74 GPU Memory GPU Local Memory Bus CE 0 CE 1 EE

Switch

GPU0

GPU1 System Multicore I/O Hub Memory Processor GPU2

GPU3 Memory High-Speed PCIe Bus Interconnect

Figure 2.20: Example high-level architecture. The I/O hub may be integrated onto some multicore chips.

Within each GPU device are several specialized processors. The processors of greatest concern to this dissertation are labeled in Figure 2.20, using terminology partly deﬁned by NVIDIA. These are the execution engine (EE), which is used to perform computations, and the copy engines (CEs), which perform bulk memory operations to interface with the host system and other I/O devices (including other GPUs). The EE and CEs share a memory bus to local GPU memory. We take a moment to address GPGPU terminology before discussing these engines in greater detail.

For the sake of consistency, we use terminology deﬁned by NVIDIA throughout this dissertation. Our only deviation from this is the use of the term “execution engine.” NVIDIA documentation may refer to this component as the “compute engine.” We avoid this term to eschew an ambiguous abbreviation with the term

“copy engine.” Although GPU architectures from different manufacturers may differ greatly, there remain many high-level similarities. Table 2.6 provides common equivalent GPGPU terminology for many of the

NVIDIA-based terms we use herein. We note that because the approach we present in Chapter 3 operates by wrapping elements of the GPGPU runtime, we need not concern ourselves with lower-level differences between GPU architectures—the high-level limitations that we discuss are consistent across GPU platforms today.

75 Software CUDA (NVIDIA) OpenCL (Khronos) Thread Work-Item Warp Wavefront Block or Work-Group Cooperative Thread Array (CTA) Grid NDRange Hardware NVIDIA AMD CUDA Processor or Lane Processing Element Streaming Multiprocessor (SM) Compute Unit (CU) Copy Engine (CE) DMA Engine Compute Engine Compute Device

Table 2.6: NVIDIA software and hardware terminology (NVIDIA, 2014c) with equivalent OpenCL (Khronos Group, 2014a) and AMD (AMD, 2013) terminology.

2.4.2.1 Execution Engine

The execution engine consists of one or more parallel processors. GPU manufacturers scale the number of processors in the EE to realize GPUs with different computational capabilities. This allows the manufacturer to cover embedded, laptop, desktop, gaming enthusiast, and supercomputing markets with a common processor architecture. The number of parallel processors in a GPU also varies with each hardware architecture— especially among GPUs of different manufacturers. For example, recent NVIDIA GPUs typically have between one to sixteen processors, while AMD GPUs may have up to 44 (Smith, 2013).

For NVIDIA GPUs, each parallel processor is called a “streaming multiprocessor” (SM). Each SM is capable of executing a single instruction concurrently across several “lanes” of data operands. At any instant, a group of tightly coupled user-deﬁned threads, called “warps,” are bound to these lanes, one thread per lane.

Thus, the threads in a warp are executed in lock-step. If threads diverge on a conditional-code branch (e.g., an if/else-branch), then each branch is executed in turn, with the appropriate threads “masked out” within each branch to ensure each thread executes the correct branch.17

Although an SM can only execute one warp at a time, an SM may be oversubscribed with several warps at once. That is, several warps may be assigned to a single SM. This is done to facilitate the hiding of memory latencies. The SM will quickly context switch to another ready warp if the currently executing warp stalls on a memory access.

17It is for this reason that GPU performance on branchy code is poor.

76 1 // Operate on ‘input’ of size (x, y, z) in (4, 4, 4)-sized thread blocks 2 void kernel3d_host(int ***input, int x, int y, int z) 3 { 4 dim3 size3d(x, y, z); 5 dim3 blockSize(4, 4, 4); 6 dim3 gridSize(data.x / blockSize.x, data.y / blockSize.y, data.z / blockSize.z); 7 kernel3d_gpu<<>>(input, size3d); 8 } 9 10 // A 3D CUDA kernel 11 __global__ 12 void kernel3d_gpu(int ***input, dim3 size3d) 13 { 14 // Compute spatial location of thread within grid 15 int i = blockDim.x * blockIdx.x + threadIdx.x; 16 int j = blockDim.y * blockIdx.y + threadIdx.y; 17 int k = blockDim.z * blockIdx.z + threadIdx.z; 18 19 // Only operate on input if thread index is within boundaries 20 if((i < size3d.x) && (j < size3d.y) && (k < size3d.z)) 21 { 22 ... 23 } 24 }

Figure 2.21: Code fragments for a three-dimensional CUDA kernel.

To better understand how warps are assigned to the EE, we must ﬁrst brieﬂy discuss the general GPGPU programming model. In GPGPU programs, threads are organized in a hierarchical and spatial manner. Warps are groups of tightly-coupled user threads. Warps are grouped into one-, two-, or three-dimensional blocks.

Blocks are arranged into one-, two-, or three-dimensional grids. One grid represents all the threads used to execute a single GPU kernel, as discussed in Chapter 1. Conceptually, it may help to think of individual threads as mapped to a single inner-most iteration of a singly-, doubly-, or triply-nested loop. At execution time, lane-speciﬁc hardware registers within the SM inform the currently executing thread of its location within its grid. Using this information, user code can properly index input and output data structures.

A simple example is illustrated by the code fragments in Figure 2.21. In line 5, the host conﬁgures a

GPU kernel to process data in three-dimensional blocks of size 4x4x4 threads. The number of blocks in the grid is computed in line 6, assuming that the problem size divides evenly by four. This dimensional conﬁguration of the kernel is provided at kernel launch, in line 7. The code in the function kernel3d_gpu() is programmed from the perspective of a single thread—one thread among many within the grid. This thread determines its (i, j,k) coordinates within the grid on lines 15 through 17. If the coordinates are within the bounds of the problem (line 20), then the thread operates on the input data. If the input problem has the dimensions of 128x128x128, then the resulting grid has 32x32x32 blocks, each with 4x4x4 (or 64) threads. This breakdown of the grid into threads is illustrated in Figure 2.22. Each multi-dimensional block is

77 Grid Block Thread

32x32x32 4x4x4 1 Blocks Threads Thread

Figure 2.22: Grid of 32x32x32 blocks, with blocks of 4x4x4 threads. linearized and decomposed into warps. The threads of each warp are mapped to the hardware lanes. Although not speciﬁed by the language, all warp sizes to date on NVIDIA GPUs have been 32 lanes, so under this assumption, each block would be made up of two warps. This is illustrated in Figure 2.23.

The EEs of modern GPUs are capable of executing more than one kernel concurrently.18 This feature can be leveraged to increase EE utilization. Consider the following situation. Suppose we have several independent kernels to execute. We issue each kernel to the GPU, one at a time, waiting for each issued kernel to complete before issuing the next. Recall that grid blocks are distributed among the EE’s SMs. Towards the end of the execution of each kernel, SMs will begin to idle after completing their assigned blocks, while other

SMs continue executing their remaining work. At the instant before the last block completes, all but one SM will be idle. However, if we issue the independent kernels to the GPU in quick succession, not waiting for each issued kernel to complete before issuing the next, then SMs can be kept busy as they execute blocks of grids that have already been queued up for execution.

A GPGPU kernel is decomposed into a collection of multidimensional blocks, which are made up of threads that are grouped into warps. How is each SM assigned warps to execute? Although SMs execute the instructions of a single warp at a time, SMs are not assigned individual warps. Instead, SMs are assigned blocks, and the SMs independently schedule the warps within each block. How are blocks assigned to SMs?

In today’s technology, blocks are assigned to SMs by in-silicon hardware schedulers on the GPU (Bradley,

18There may be restrictions, however. For example, on NVIDIA GPUs, memory subsystems of the GPU require that concurrently running kernels share the same address space.

78 block of 64 threads warp of 32 threads warp of 32 threads

32 lanes

Figure 2.23: Linearized block of threads decomposed into two warps, which are multiplexed on hardware lanes.

2012). As software engineers, we have no direct control over how work is distributed among the SMs.19 We merely provide the GPU with a collection of blocks, which are then scheduled by the GPU itself. This has two important implications:

1. We cannot predictably schedule individual SMs.

2. We cannot predict how SMs may be shared among concurrent kernels.

It is for these reasons that we consider the SMs together as a single processor—the execution engine—rather than individual processors.20 We can predictably schedule the EE.

There is one more aspect of the EE that affects real-time predictability: the EE is non-preemptive. This is understandable, given the complexities of the GPU’s hardware scheduler. Non-preemption has a signiﬁcant impact on any real-time system in two ways. First, it becomes impossible to strictly enforce budgets on task execution time. At best, any real-time scheduling algorithm may only attempt to avoid and isolate the harmful effects violations of provisioned execution times may have. Second, priority inversions become inevitable under any work-conserving scheduler. It will always be the case that low-priority work may be scheduled on an idle EE at time t when higher-priority work for the EE arrives at time t + ε. The higher-priority

19This holds true for AMD GPUs as well (AMD, 2013). 20OpenCL 1.2 supports an optional feature called Device Fission. This feature allows the system designer to divide a single compute device into several logical compute devices. This may be exploited to reserve a segment of compute resources for high-priority work. However, support for this feature is currently limited to OpenCL runtimes that execute on CPUs (including Intel’s Xeon Phi) and tightly coupled heterogeneous processors, such as the Cell BE. We note that the problem of scheduling several logical devices is very similar to scheduling a multi-GPU system.

79 work cannot be scheduled until the low-priority work completes. We must develop algorithms that limit the duration of such inversions and account for them in analysis.

2.4.2.2 Copy Engines

The GPU copy engines transmit data between local GPU memory, remote I/O memory (e.g., other GPUs or I/O devices such as network cards), and system memory. The CEs operate through DMA operations. GPUs commonly have only one CE, and thus cannot send and receive data at the same time. However, high-end

GPUs (such as those used in high-performance computing applications) may have an additional CE, enabling simultaneous bi-directional transmissions.

Each CE may only be tasked with one DMA operation at a time. The command to perform a DMA operation must be issued by host-side code. That is, a GPU kernel cannot issue the DMA command itself.

This implies that any real-time DMA scheduling must be performed from the host.21 Like EEs, CEs are also non-preemptive, incurring the same issues regarding budget enforcement and priority inversions.

DMA operations may only read or write data that is ﬁxed, or “pinned,” to a physical memory address.

This is necessary in order to prevent data from being relocated by the OS while it is acted upon by a DMA processor. GPGPU runtimes provide APIs that allow the user to pin data. However, if the user requests a DMA operation that acts upon non-pinned data, then the GPGPU runtime must take extra measures to orchestrate the operation—the particular steps depend upon the runtime implementation. For example, the runtime may automatically copy data to a staging buffer that is pinned and perform the DMA operation on this buffer. The operation may be performed incrementally if the staging buffer is smaller than the data accessed.

This method is inefﬁcient, since it requires data to be temporarily copied to the staging buffer. Alternatively, the runtime may dynamically pin and unpin user memory as needed. This method incurs overheads, since pinning and unpinning requires support from the OS and device driver. In real-time programming, it is already common practice to pin all application data in order to avoid page faults, which are harmful to real-time predictability. We assume that this practice is extended to all GPU-related memory using provided GPGPU runtime APIs.

Some GPUs, such as those made by NVIDIA, support peer-to-peer (P2P) DMA, where data is transmitted directly between I/O devices (such as two GPUs). This is more efﬁcient than passing data between two

21Pellizzoni (2010) has explored scheduling DMA operations through the use of specialized interposition hardware that sits between the I/O device and the PCI bus. However, we consider this extreme method out of scope for this dissertation.

80 devices by way of a temporary buffer in main system memory. There are two restrictions to P2P DMA operations. First, the two peers must share the same I/O hub (root complex), as depicted in Figure 2.20.

Second, in the case of P2P operations between two GPUs, it is unclear whether it is the CE of the sender, receiver, or both that perform(s) the DMA operation. It is partly for this reason that we make the conservative assumption that both CEs are utilized under our management approach described in Chapter 3. However, this requires that we coordinate the schedules of the source and destination CEs.

The efﬁciency of a P2P DMA operation is partly dependent upon the distance between GPUs. Distance is the number of links to the nearest common switch or I/O hub of two GPUs. For example, in Figure 2.20, the distance between GPU0 and GPU2 is two (one link to a switch, a second link to a common I/O hub). P2P DMA operations are generally more efﬁcient over short distances, since there are fewer opportunities for bus contention.

2.4.3 Other Data Transfer Mechanisms

The use of CEs to transfers data to and from GPU memory is the most widely supported method of data transmission in GPGPU programming languages. However, GPUs may support other mechanisms. Fujii et al.

(2013) have explored two such mechanisms: “GPC” and “IORW.” We describe these in turn. We conclude with a remark on uniﬁed memory models offered by recent versions of OpenCL and CUDA.

On NVIDIA GPUs, there are lightweight microcontrollers shared by clusters of SMs (or “graphics processing clusters” (GPCs)) that are capable of performing DMA operations, much like the CEs. However, this functionality is not normally exposed to the programmer. Fujii et al. enable microcontroller-based

“GPC” DMA with custom ﬁrmware that is loaded by an open-source driver.22 They found that GPC-based memory transmissions can be as much as ten-times faster than CE-based memory transmissions for data chunks of 4KB or less, with equivalent performance at roughly 16KB. The microcontroller outperforms the

CE on small memory transmissions because there is less overhead in initiating the DMA operation. However, the GPC method has the significant drawback that it currently requires custom firmware, which must be a complete functional replacement for the vendor-provided firmware. The IORW approach may be a reasonable alternative if CEs cannot meet the memory transmission latency needs of an application.

22Our use of the term “GPC” generalizes three similar microcontroller-based methods explored by Fujii et al.

81 Under “memory-mapped I/O read and write” (IORW) data transmission, remote memory is mapped into the address space of the CPU or GPU. For instance, GPU device memory may be mapped directly into the address space of the CPUs. Host memory can also be mapped into the address space of the GPU processors.

Data is transmitted automatically as load and store instructions that operate upon memory-mapped addresses are executed by CPUs or SMs. Fujii et al. showed that IORW can perform very fast and low-latency memory transfers, especially for data transmissions from the host to GPU memory. IORW performs well in situations where it is difﬁcult to predict what data will be needed by a GPU kernel (e.g., graph traversal algorithms).

This is because it may be more efﬁcient for the GPU to access data as-needed directly from the host, rather than use a CE to transmit the entire problem dataset to the GPU prior to kernel launch. The IORW method transmits less data in such a case. However, there are several potential drawbacks to IORW. One such drawback is that IORW proves inefﬁcient in cases where a memory-mapped address is accessed multiple times by a remote processor. Each access incurs the relatively high penalty of transmitting data over the PCIe bus. In contrast, under the CE method, data is transmitted once over the PCIe bus and then accessed locally.

Another drawback is that host-to-device IORW data transmissions are not always a supported. Fujii et al. enabled this feature for NVIDIA GPUs through the use of an open-source GPU driver. We note, however, that industry is moving towards bi-directional IORW support. This is demonstrated by industry backing of

HSA-like architectures, which we discussed at the end of Section 2.3.

The above drawbacks to IORW do not necessarily preclude its use in a real-time system. However, the following may: we cannot directly schedule memory transfers under IORW. Instead, data is transmitted as instructions are executed by a processor. This makes predicting the worst-case execution time of computations more challenging. Any such predictions are highly likely to be exceedingly pessimistic, and thus result in poor schedulability. Instruction-level interleaving of computation and memory transfers has another drawback: we expend processor time to perform memory operations that could be ofﬂoaded to DMA processors. We can achieve greater levels of system utilization by separately scheduling DMA-based memory transfers and processor execution time.

Recently, OpenCL 2.0 and CUDA 6.0 have introduced memory models (“coarse-grained shared virtual address spaces” in OpenCL 2.0 and “uniﬁed memory” in CUDA 6.0) that unify the CPU and GPU address spaces. Under these models, the GPGPU runtime and device driver coordinate to automatically transmit data between host and device memory on a page-based (4KB) granularity. However, these models are merely

82 T : S 3CE 2EE 1CE 0CE i i Si Si Si Si

GPU

4 3 T : S S CE S CE 2EE 1EE 0CE j j j j Sj Sj Sj

Figure 2.24: Two streams made up of sequentially ordered GPU operations issued to a GPU. abstractions built on top of other DMA-based (e.g., CE-based) mechanisms. Their use in real-time systems may be ill-advised, since the abstractions prevent the use of a real-time scheduling policy.

2.4.4 Maintaining Engine Independence

The execution and copy engines of a GPU can operate independently, allowing us to schedule them separately and model them as separate processors in real-time analysis. However, limitations of in-silicon hardware schedulers that feed work to the EE and CEs can break our assumption of engine independence.

We must be aware of these limitations, so that we may design around them.

GPGPU applications issue sequentially ordered GPU operations in streams, as depicted in Figure 2.24.

Here, task Ti issues GPU operations on stream Si, and task Tj issues GPU operations on stream S j. We index and denote the engine used by each operation with a superscript. For example, the second GPU

1CE operation issued by Ti uses the CE and is denoted by Si . (For the sake of simplicity, we assume only a single CE in this example.) New operations can be issued before prior ones have completed (i.e., they may be batched). However, an operation may not begin execution until all prior operations issued to the same stream have completed. A single stream is somewhat analogous to a single CPU thread, as both are made up of sequentially ordered computational elements: CPU instructions for threads, GPU operations for streams.

All NVIDIA GPUs without the “Hyper-Q” stream scheduler suffer from a limitation that can break our assumption of engine independence.23 On a non-Hyper-Q NVIDIA GPU, pending GPU operations of every stream are combined into a single first-in-first-out (FIFO) queue. The engine scheduler dispatches enqueued operations to the appropriate engines. However, the engine scheduler stalls if the GPU operation at the head of the FIFO queue has an unsatisfied stream dependency on an unfinished operation. This

23This includes all Tesla, Fermi, and non-GK110 Kepler architecture GPUs.

83 Release Complete Stalled

0 0 S EE S EE EE i EE i

1 0 1 S EE S CE S0CE S EE CE i j CE j i

0 S EE 0EE i Si

1 S CE 1CE i Si

0CE 0CE Sj Sj time time 0 5 10 15 20 25 30 0 5 10 15 20 25

1CE 0EE 1CE 0EE (a) Scheduler stalls because Si depends upon Si . (b) No stall when Si is released after Si completes.

Figure 2.25: Schedules of streamed GPU operations when engines are dependent due to hardware scheduler limitations (a) and made independent through software (b). stall prevents operations of other streams from being dispatched to idle engines. Such a case is illustrated

0EE 1CE in Figure 2.25 (a). Here, task Ti issues GPU operations Si and Si on stream Si for the EE and CE,

0CE respectively, to a completely idle GPU. Concurrently, task Tj issues operation S j for a CE on stream S j.

0EE 1CE 0CE These operations may be enqueued in the hardware FIFO in the order: [Si ,Si ,S j ]. In this case, the

0EE 1CE 0EE engine scheduler dispatches Si to the EE immediately. Si depends upon the completion of Si , so the

0EE 1CE engine scheduler stalls until Si completes at time 15. Si is dispatched to the CE at this time, followed

0CE 0CE 0CE by S j at time 25. Observe that S j is not scheduled at time 0, even though S j is ready and the CE

0CE 1CE is idle—this is because S j has enqueued behind Si in the hardware FIFO queue. The engines are not independent in such scenarios. The particular order of stream operation interleaving in the FIFO queue occurs largely by chance. Observe that the illusion of engine independence would have been maintained if

0EE 0CE 1CE the concurrently dispatched operations had been ordered [Si ,S j ,Si ] in the hardware FIFO queue. There are two ways to resolve the above engine-dependency issue in a real-time system. First, we may model a GPU’s EE and CEs as a single notional processor. Unfortunately, such an approach results in utilization loss in schedulability analysis. For instance, suppose for a particular task set that a GPU’s EE and

CE each have a utilization of 51%, when the EE and CE are modeled as independent processors. Combining the EE and CE, we have a utilization of 102%. The notional processor is overutilized, so the task set is unschedulable under this analysis. An alternative approach is to design around the limitations of the hardware

84 scheduler: in software, each application issues GPU operations to its stream one at a time, waiting for each operation to complete before issuing the next. That is, applications “synchronize” with the GPU after issuing an operation.24 It is impossible for the hardware scheduler to stall because the FIFO queue is prevented from holding more than one operation per stream. This maintains engine independence, avoiding needless utilization loss. Figure 2.25 (b) depicts the schedule for our prior example when engine independence is enforced through software. Observe that the last operation completes at time 25 instead of time 30 because the CE is used more efﬁciently.

Enforcing engine independence through software comes at the cost of additional overheads due to synchronization. In the case where applications suspend while waiting for GPU operations to complete, overheads include costs due to the CPU scheduler, thread context switches, and interrupt processing. These overheads must be incorporated into schedulability analysis.

2.4.5 VectorAdd Revisited

In Chapter 1 (Section 1.4.1), we presented a schedule for a basic GPU kernel that adds two vectors

(Figure 1.3). As we have learned in this chapter, the actual schedule is far more complex. We revisit the schedule of the VectorAdd routine of Chapter 1, with additional details to tie together the various operations that occur in a simple GPGPU program. Figure 2.26 depicts the schedule for our VectorAdd program. The upper-half of the ﬁgure depicts when various processors are scheduled. The lower-half of the ﬁgure depicts the corresponding schedule for the threads and interrupts of the GPU-using task, Ti. For simplicity, the schedule assumes that Ti executes alone on the system. Also, we assume Ti suspends from CPU execution while waiting for a GPU operation to complete. We now walk through this schedule, step-by-step.

Task Ti is scheduled on the CPU and issues a command to copy data from the host to GPU memory at time t1. Ti suspends while waiting for this operation to complete. The copy engine CE0 is scheduled with the

DMA memory copy, which completes at time t2. The GPU issues an interrupt to the host to signal completion of the copy. Several operations occur in quick succession:

1. The interrupt handler is invoked on the CPU at time t2 and the interrupt top-half executes.

24Care must be taken in selecting a synchronization method. For instance, the CUDA API supports several synchronization methods: cudaDeviceSynchronize(), cudaStreamSynchronize(), and cudaEventSynchronize(). cudaEventSynchronize() cannot be used to maintain engine independence because its use actually results in the injection of a synchronization operation into the hardware FIFO queue, creating the very scenario we are trying to avoid. We recommend the use of cudaStreamSynchronize(), based upon our own experience.

85 Release Complete Block Unblock Interrupt Suspended

cudaMemcpy(gpu_a, a,…); gpu_vector_host(…)

vector_host(…) cudaMemcpy(gpu_b, b,…); cudaMemcpy(c, gpu_c,…);

CPU

CE 0

CE 1

Top-Halves

bh Ti cb Ti

Ti time t1 t2 t4 t6 t7 t8 t9 t3 t5

Figure 2.26: A detailed schedule depicting the interactions between the host system, GPU, and various schedulable threads, for the simple vector_add() kernel of Figure 1.2.

bh 2. The top-half spawns a bottom-half, which is processed by the thread Ti , starting at time t3.

cb 3. The bottom-half signals the callback thread of the GPGPU runtime, Ti , which is awoken and scheduled

at time t4.

cb 4. Ti wakes the task Ti at time t5.

This sequence of operations repeat for the remaining memory copy operations and the GPU kernel beginning at times t6, t8, and t9.

At time t7, Ti launches the GPU kernel. This kernel is scheduled on the EE and executes as many parallel threads across the EE’s SMs. The kernel completes at time t8. At time t9, Ti issues the command to copy the

86 results of the GPU kernel into host memory. This operation is carried out by copy engine CE1. The copy completes and Ti ﬁnishes its execution. The schedule depicted in Figure 2.26 should give us an appreciation for the complexity of the problems we face in developing a real-time system with GPUs. The VectorAdd program is simple compared to real-world GPGPU programs that may invoke many more memory copies and GPU kernels within a single

bh cb real-time job. Despite this complexity, our CPU scheduler must schedule threads Ti and Ti with priorities no less than Ti in order to avoid priority inversions, all while respecting the priorities of other real-time tasks in the system. Access to EE, CE0, and CE1 must be arbitrated among competing tasks with different real-time priorities. Overheads related to scheduling, context switches, and interrupt handling, to name a few, must be integrated into real-time schedulability analysis.

Figure 2.26 does not capture the full complexity of the system we propose to build. In Chapter 3, we develop multi-GPU scheduling algorithms that allow tasks to migrate among GPUs. The multi-GPU schedulers can be paired with a variety of multiprocessor CPU schedulers. Our proposed GPU scheduling framework also supports budget enforcement mechanisms that isolate the effects of occasionally poorly behaved tasks. Finally, we achieve this by wrapping the GPGPU runtime and device driver, rather than implementing our own real-time GPU software stack from the ground up.

2.5 Prior Work on Accelerator Scheduling

In this section, we cover prior work related to the scheduling of compute accelerators, such as GPUs.

We begin with a discussion of the various directions real-time GPU scheduling has taken. We then examine prior work on scheduling computational accelerators, such as digital signal processors (DSPs) and ﬁeld- programmable gate arrays (FPGAs). We conclude with brief survey of GPGPU scheduling techniques in non-real-time systems.

2.5.1 Real-Time GPU Scheduling

Current real-time GPU research falls within three general categories: (i) techniques for persistent low- latency GPU kernels, (ii) WCET analysis of GPU kernel code, or (iii) GPU resource scheduling. We review these categories in turn.

87 2.5.1.1 Persistent Kernels

In the first category, a persistent GPU kernel executes on a dedicated GPU. These kernels never terminate; they continually poll for work in a producer/consumer-styled software architecture. Research in this area focuses on efficient data movement between a single GPU and the rest of the system. There is no need for scheduling data-movement or GPU computations since there is only a single dedicated GPU and a single persistent kernel. Instead, low-latency memory operations are of primary importance. Aumiller et al. (2012) explored the tradeoffs between DMA memory transfers and various IORW-based data transfer mechanisms, and Fujii et al. (2013) explore these matters in greater depth. This work was done in support of that by Rath et al. (2012), wherein a GPU is used to actively adjust the magnetic field that contains plasma in a tokamak fusion reactor. Using IORW-based data transfers, Rath et al. (2012) report that their GPU-based solution can react to changes in input sensor data within 10 µs. These latencies could not be achieved using DMA-based data transfers due to the overhead of setting up the CEs.

2.5.1.2 GPU Kernel WCET Estimation and Control

The second category of real-time GPU research has focused on bounding the execution time of GPU program code, with no attention paid to scheduling or data-movement costs—it is assumed all data already resides on the GPU. This work is useful within the context of real-time analysis. Berezovskyi has been the primary investigator in developing methods for estimating the WCET of GPU kernels. In his ﬁrst effort,

Berezovskyi et al. (2012) developed a model of GPU kernel execution in terms of the total number of lanes in an SM, the number of threads within a GPU kernel, and the number of program instructions of the GPU kernel. Under several simplifying assumptions, they developed a formulation of an integer linear program

(ILP) that computes the maximal execution time of the GPU kernel across the SMs. This work matured with the development of heuristics to estimate WCET (Berezovskyi et al., 2013)—this heuristic method produces a WCET estimate in several hours, instead of several days. Unfortunately, these approaches can only provide a WCET for a GPU kernel that executes on a single SM. This limits the scope of this work to single-SM

GPUs. Thankfully, these are prevalent in iGPUs, such as those we discussed in Section 2.3.

Berezovskyi recently shifted focus towards WCET estimation through empirical measurements and statistical analysis (Berezovskyi et al., 2014). This work is more practical than the prior approaches because it accounts for memory subsystem overheads (such as cache misses), as well as being applicable to multi-SM

88 GPUs. Further, an empirical approach towards WCET estimation is generally sound for two reasons. First, data parallel algorithms that execute efﬁciently on GPUs exhibit very regular memory and code path execution patterns. Second, the GPU EE largely executes in isolation—they are not burdened with an operating system, memory paging, or even multi-tasking (assuming GPU kernels do not execute concurrently on the EE).25 For these reasons, deviations in observed GPU kernel execution time are relatively small in most applications, so violations of a statistically derived WCET should be rare.

Berezovskyi et al. are not the only ones to have researched GPU kernel WCET estimation. Betts and

Donaldson (2013) have developed two WCET-estimation methods that both utilize low-level GPU kernel execution traces and the control flow graph of GPU kernel code. The first method proceeds by estimating the start-time of the last warp executed on each SM (this is derived from empirical measurements), and then estimating the execution time of the last warp. These two estimates combine to derive a final kernel WCET estimate. Under the second method, the authors model the hardware scheduler that distributes blocks among

SMs. Using trace data, they estimate the execution-time costs due to inference among warps. These costs accumulate through to the ﬁnal warp. Betts and Donaldson tested their methods using a cycle-accurate GPU simulator, and determined that the ﬁrst method provides the most accurate results, and that the later is overly pessimistic.

WCET-estimation can be useful in hardware provisioning and used in real-time schedulability analysis.

However, a WCET-estimate merely models system behavior—the estimate in and of itself does not guarantee that a GPU kernel will actually complete within that time. Actual enforcement of GPU kernel execution time is challenging since EEs are non-preemptive. It is possible to signal to a kernel that it should terminate once a provisioned WCET has been exceeded. This is accomplished through the setting of an application-defined variable (or flag) by the host, using IORW, that is checked by the GPU kernel at runtime. A kernel can voluntarily self-terminate once it detects that its “should-terminate” flag has been raised. Unfortunately, such an approach is not always practical as it can leave data in an unknown (or difficult to resume from) state. However, Mangharam and Saba (2011) explain that just such an approach is practical for “anytime” algorithms. Anytime algorithms are those that iteratively improve results as they execute; Mangharam and

Saba use the parallel version of the A* path-ﬁnding algorithm, called PAP*, as a motivating example. A

GPU kernel that implements an anytime algorithm may check its “should-terminate” ﬂag at the top of each

25Contention on the GPU’s local memory bus can be a (minor) source of interference, as we show later in Chapter 4.

89 iteration and terminate when requested, thereby adhering to the provisioned WCET (with some measurable delay).

2.5.1.3 GPU Resource Scheduling

The ﬁnal category of real-time GPU research is on the scheduling of GPU resources. That is, the problem of scheduling both data movement and GPU computations on GPU(s) shared by competing jobs of different priorities. Work in this area seeks to develop real-time GPU scheduling algorithms, as well as analytical models to support schedulability analysis. This dissertation falls within this category.

TimeGraph is an early approach to the real-time scheduling of modern GPUs (Kato et al., 2011b).

TimeGraph plugs into an open-source GPU driver, where it intercepts the GPU commands issued by GPU- using applications. TimeGraph schedules a GPU as a single processor, scheduling intercepted commands according to a conﬁgurable scheduling policy. TimeGraph supports two scheduling policies: the “high- throughput” (HT) policy, and the “predictable-response-time” (PRT) policy. The HT policy allows commands from a task to be scheduled immediately, provided that the GPU is idle, or if commands from that task are currently scheduled on the GPU and no other commands from higher-priority tasks are waiting to be scheduled. This policy promotes throughput at the risk of introducing priority inversions—the scheduling of new commands may extend the delay experienced by higher-priority commands issued soon after. The PRT policy decreases the risk of lengthy priority inversions, as new GPU commands of a task are not scheduled until all of its prior commands have completed. TimeGraph monitors the completion status of commands by plugging into the interrupt handler of the open-source device driver.

GPU-using tasks under TimeGraph may be assigned a ﬁxed-priority and GPU utilization budget.26

A task’s GPU budget is drained as it executes commands on a GPU. TimeGraph supports two budget enforcement mechanisms: “posterior enforcement” (PE) and “a priori enforcement” (AE). Under PE, the budgetary deﬁcits incurred by a task’s budget-overrun is recouped by delaying further scheduling of the offending task until its budget has been replenished. The PE strategy is to recover from budget overruns.

Under AE, TimeGraph attempts to anticipate budget exhaustion. This is done by matching the sequence of a task’s requested GPU commands against a historical record of prior-issued GPU command sequences.

The historical record contains an average execution time, which is taken as a predicted execution time of

26TimeGraph also supports online priority assignment for graphics (i.e., non-compute) applications, where the foreground application is allowed to consume additional GPU resources.

90 the requested commands. A task’s requested GPU commands are not eligible for scheduling until the task’s budget is sufﬁcient to cover the predicted execution time. Thus, the AE strategy is to avoid budget overruns.

TimeGraph is somewhat limited in the real-time GPGPU domain. As presented by Kato et al. (2011b),

TimeGraph is targeted to graphics applications, such as video games and movie players, rather than GPGPU applications. It focuses on providing a configurable quality-of-service for applications, while loosely adhering to a fixed-priority real-time task model. TimeGraph also unifies the GPU EE and CE processors into one—we discussed the negative effects on schedulability of such an approach in Section 2.4.4. TimeGraph is limited in this way because it schedules GPU commands without inspecting them to determine their function. Separate

EE and CE scheduling is impossible without this inspection.

The developers of TimeGraph later developed RGEM, a real-time GPGPU scheduler (Kato et al., 2011a).

RGEM is similar to TimeGraph in that it also supports ﬁxed-priority scheduling. However, RGEM operates entirely within the user-space through a user-level API. The RGEM API provides functions for issuing DMA operations and launching GPU kernels. These APIs invoke GPU scheduler routines. Because RGEM is implemented in user-space, GPU scheduler state is maintained in shared memory accessed by each GPGPU task. Tasks that are unable to be scheduled immediately on the GPU are suspended from the CPU, awaiting for a message to proceed (delivered through a POSIX message queue). Perhaps the most notable of RGEM’s contributions is how it addresses schedulability problems caused by long non-preemptive DMA operations.

Here, RGEM breaks large DMA operations into smaller chunks, reducing the duration of priority inversions and thus improving schedulability.

RGEM has several advantages over TimeGraph for GPGPU applications. Unlike TimeGraph, RGEM utilizes techniques that make it amenable to schedulability analysis under rate-monotonic scheduling. Also,

RGEM separately schedules a GPU’s EE and CEs. However, as presented in Kato et al. (2011a), RGEM provides no budget enforcement mechanisms.

The notion of breaking large non-preemptive GPU operations into smaller ones has also been explored by

Basaran and Kang (2012). In addition to chunked DMA, Basaran and Kang also developed a mechanism for breaking large GPU kernels into smaller ones. Here, the kernel’s grid of thread blocks is programmatically split into smaller sub-grids that are launched as separate kernels. Unfortunately, this kernel-splitting requires developers to modify GPGPU kernel code. As we discussed in Section 2.4.2.1, threads must compute their spatial location (index) within a grid. Kernel-splitting requires the kernel code to include additional spatial offsets in the index computation. Zhong and He (2014) recently developed a method to make these offset

91 calculations transparent to the programmer in a framework called Kernelet. Kernelet programmatically analyzes kernel code and patches indexing calculations at runtime. However, no one has yet attempted to apply this technique in a real-time setting—Kernelet’s just-in-time patching of GPU kernel code may present a challenge to real-time analysis.

Kato et al. and Basaran and Kang examined GPU scheduling strictly in terms of the sporadic task model.

A different approach has been taken by Verner et al. (2012), where GPU operations of various jobs of sporadic tasks are combined into a batch at runtime and scheduled jointly. Here, batches of GPU work execute in a four-stage pipeline: data aggregation, DMA data transfer from host to device memory, kernel execution,

1 and DMA data transfer of results from device to host memory. GPU work is batched at a rate of 4 dmin, or one quarter of the shortest relative deadline in the task set. Consecutive batches may execute concurrently, each in a different stage of the pipeline.27 Verner et al. has continued research on batched scheduling for multi-GPU real-time systems in Verner et al. (2014a,b). Although their work is targeted to hard real-time systems, Verner et al. only consider schedulability in terms of the GPUs only. The real-time scheduling of the CPU-side GPGPU (i.e., triggering DMA and launching kernels) work remains unaddressed.

Thus far, we have discussed research in GPU resource scheduling largely in terms of systems development, i.e., the design and implementation of real-time GPU scheduling algorithms. Research on developing new analytical models has also been pursued. Kim et al. (2013) point out that conventional rate-monotonic schedulability analysis cannot be applied to task sets where tasks may self-suspend, unless suspensions are modeled as CPU execution time (i.e., suspension-oblivious analysis, as discussed in Section 2.1.6.3).

In order to reclaim CPU utilization that would otherwise be lost in suspension-oblivious analysis, Kim et al. devised a task model whereby jobs are broken into sub-jobs, along the phases of CPU and GPU execution. The challenge then becomes assigning a unique ﬁxed priority to each sub-job. Kim et al. showed that the determination of an optimal priority assignment is NP-hard. They presented and evaluated several priority-assignment heuristics.

2.5.2 Real-Time DSPs and FPGA Scheduling

Digital signal processors are highly specialized processors optimized to provide a limited set of computational facilities. The generality of DSPs varies from processor to processor. At one end of the spectrum,

27For GPUs with one CE, the host-to-device DMA stage of batch N is combined with the device-to-host DMA stage of batch N + 2.

92 a DSP may be an application-speciﬁc integrated circuit (ASIC), capable of performing only one type of computation, and thus lack generality. A coprocessor for performing fast Fourier transforms, a common SoC

DSP component, is an example of an ASIC DSP. At the other end of the spectrum, a DSP may be entirely programmable, such as the Texas Instruments C66x processor, which can be programmed with standard C code (Texas Instruments, 2013). Between these extremes are DSPs that are partially programmable, such as the computer vision-focused G2-APEX discussed earlier in Section 2.3.

DSPs sacriﬁce some degree of generality in order to achieve greater energy efﬁciency and speed. Common

DSPs found on SoCs provide services for audio/video encoding and decoding, image sensor processing, and cryptography. Although these features can be realized by software that executes on CPUs (or even a GPU), a software-based approach may be slower and less energy efﬁcient.

In most cases, DSPs execute non-preemptively. In this way, a DSP computation is not unlike a GPU kernel scheduled on the EE. Similarly, the logic fabric of FPGAs may be conﬁgured to implement DSP-like functionality. Researchers have examined the problem of scheduling non-preemptive DSPs, and similarly conﬁgured FPGAs, in real-time systems.

Gai et al. (2002) examined the problem of ﬁxed-priority scheduling on a hardware platform with one

CPU and one non-preemptive DSP. In their approach, ready jobs are classiﬁed as normal or DSP-using.

Normal jobs only execute on the CPU, while DSP-using jobs execute on both the CPU and the DSP. Ready normal jobs enqueue on one ready queue, while ready DSP-using jobs enqueue on another. When the DSP is idle, the scheduler selects the highest-priority job among both queues to scheduler. However, when the DSP is in use, jobs in the DSP queue are ignored. Thus, the scheduler itself resolves contention for the DSP, since no

DSP-using job can preempt a scheduled DSP-using job. Gai et al. found that their approach was preferable in terms of schedulability in comparison to a method where the DSP was treated as a shared resource, protected by the DPCP locking protocol.

Non-preemptive DSP scheduling has also been researched by Pellizzoni and Lipari (2007) for deadline- based scheduling. Pellizzoni and Lipari break DSP-using tasks into subtasks. Each subtask is designated as normal or DSP-using, just as Gai et al. Pellizzoni and Lipari provided a heuristic based upon simulated annealing to derive from task set parameters a relative deadline for each subtask. DSP access is arbitrated by the SRP locking protocol. Pellizzoni and Lipari developed a schedulability test to account for the inter-task dependencies among sub-tasks.

93 The approaches taken by Gai et al. (2002) and Pellizzoni and Lipari (2007) may be applied to GPU kernel scheduling on iGPUs. However, these methods are not general enough to be extended to support dGPUs, since DMA operations must also be scheduled. Also, these solutions do not address issues raised by large complex drivers or interrupt handling. DSPs are generally less complex than GPUs, so the associated drivers are similarly less complex.

2.5.3 Non-Real-Time GPU Scheduling

Before concluding this chapter, we examine prior work on GPU scheduling in the non-real-time domain.

This is a valuable exercise as it gives us greater insight into how GPGPU runtimes and GPU drivers can be manipulated (or even replaced) to enact a scheduling policy, be it real-time or not. Non-real-time GPU scheduling is a broad area of research, so we limit our attention to work where support for GPGPU applications is explicit. That is, we ignore work that only examines scheduling graphics applications.

In general, non-real-time GPU scheduling research may be categorized into one of three categories:

(i) GPU virtualization; (ii) GPU resource maximization; and (iii) fair GPU resource sharing. Although the research goals in each area may differ, we see that researchers often apply similar techniques.

2.5.3.1 GPU Virtualization

We now discuss the topic of GPU virtualization for GPGPU applications.

Services for cloud computing are a growing market in today’s computing industry. In order to avoid the cost of purchasing and maintaining expensive data centers, companies may rent computing services from cloud providers. Through virtualization technology, customers create their own virtual machines. From the customer’s perspective, a virtual machine offers the same functionality as a physical computer. Cloud providers multiplex several virtual machines onto a single physical computer; the virtual machines run concurrently on the shared hardware.

There is interest in offering GPGPU support in cloud services. For example, Amazon offers “GPU

Instance” virtual machines that support CUDA and OpenCL (Amazon, 2014). However, the tight coupling between the GPGPU runtime, GPU driver, and GPU hardware makes the multiplexing of multiple virtual machines on a GPU non-trivial. Indeed, Amazon side-steps this issue entirely—each GPU Instance virtual machine is allocated a dedicated GPU.

94 GPU virtualization solutions are starting to be produced by GPU manufacturers themselves. NVIDIA recently announced a product called “vGPU,” where up to eight virtual machines may share a single physical

GPU (NVIDIA, 2014d). However, vGPU supports virtualization for graphics applications only. There is no support for virtualization of GPGPU services—dedicated GPUs are still required.

Researchers have developed several prototypes for virtualizing GPGPU services: GViM by Gupta et al.

(2009), gVirtuS by Giunta et al. (2010), vCUDA by Shi et al. (2012), and GPUvm by Suzuki et al. (2014).

Implementation details differ among these prototypes, but they all (with the exception of GPUvm) share the same RPC-styled software architecture. Here, GPGPU API calls made by processes within the virtual machines are intercepted and translated into remote procedure calls that are executed within the environment that hosts the virtual machines. API interception is handled by an API-compatible stub library that replaces the GPGPU runtime within each virtual machine. The stub library communicates API calls to a GPGPU backend user-space daemon that runs within the host environment, outside the purview of the virtual machines.

This backend services the remote procedure calls using the full GPGPU runtime to communicate with the

GPU driver and GPU. These GPGPU virtualization prototypes use different mechanisms for transmitting remote procedure call data between the virtual machines and the host environment. In order to avoid costly memory copy operations between the virtual machines and host environment, these prototypes also implement mechanisms for remapping pages of memory between the virtual machines and the host environment.

GPUvm differs from the other GPU virtualization techniques in that it is not API-driven. Instead, GPUvm presents a logical instance of a GPU to each virtual machine through the hypervisor. This is achieved by creating “shadows” of memory regions, including GPU memory addresses. This technique allows applications within the virtual machines to use standard GPGPU runtimes instead of stub libraries. A GPUvm backend user-space daemon runs the host environment and monitors access to shadowed memory addresses, shuttling

GPU commands and results between the virtual machines and the physical GPU.

In the software architectures described above, GPU scheduling policies can be implemented within the backend daemons that service remote procedure requests (or GPU commands, in the case of GPUvm).

For instance, requests can be serviced with a simple policy like ﬁrst-come-ﬁrst-serve or round-robin. More advanced polices may also be employed. GViM and GPUvm use scheduling algorithms inspired by the

Xen hypervisor “credit” scheduler. Here, a budget for GPU operation execution time is assigned to each virtual machine. The appropriate budget is drained as GPU operations from the associated virtual machine

95 are executed. Budgets are replenished periodically. GPU resources can be fairly allocated to virtual machines by setting the appropriate budgets.

2.5.3.2 GPU Resource Maximization

The RPC-styled software architecture of the virtualization techniques discussed above can also be used to maximize GPU resource utilization. Recall that an EE is capable of running several GPU kernels concurrently, provided that these kernels share the same address space (see Section 2.4.2.1).28 A kernel completes when the last block of its grid completes. If kernels are not executed concurrently, then all but one of an EE’s SMs will always be left idle while the last block executes. However, these idle SMs can be kept busy if additional kernels have been queued for concurrent execution.

By funneling all GPU operations through a single process, similar to the RPC backend in the virtualization prototypes, a greater degree of EE utilization can be achieved. rCUDA is the ﬁrst attempt at implementing such a framework for the CUDA runtime (Duato et al., 2010). rCUDA consists of two major components: a

CUDA API-compatible stub library and an RPC backend daemon. In addition to servicing requests made by local processes, the rCUDA RPC server can also service GPU operation requests from remote machines in a compute cluster. VOCL provides a similar framework for the OpenCL runtime (Xiao et al., 2012). Recently,

NVIDIA released a framework of their own, called “Multi-Process Service” or MPS (NVIDIA, 2014b).

Unlike rCUDA or VOCL, MPS lacks the ability to service requests from remote machines.

With regards to GPU scheduling, the rCUDA daemon services requests from clients in a round-robin fashion. VOCL’s daemon schedules work in ﬁrst-come-ﬁrst-serve order. These schedulers do not separately schedule the EE and CEs, so engines may be left idle even though there may be work ready to be scheduled on them. NVIDIA has not disclosed the scheduler employed by MPS.

We conclude with a remark relating these GPU resource maximization techniques to the virtualization methods we discussed earlier. The authors of rCUDA and VOCL describe their approaches as a solution for

GPGPU virtualization. However, we opt to separate rCUDA and VOCL from the virtualization category because their software architectures match that of a conventional RPC-based distributed system. No attempt is made to take advantage of a virtualized environment. Speciﬁcally, they do not eliminate memory copies

28Technically speaking, in CUDA, these kernels must actually belong to the same CUDA “context.” A single context can be thought of as a GPU address space. The CUDA runtime creates one default context per GPU for each process. A single process can create multiple contexts for the same GPU using a low-level CUDA API, but this feature is rarely used.

96 between the client and server when they reside on the same physical machine; this could be done through memory remapping. Instead, rCUDA and VOCL transmit all data between the client and server through network sockets.29

2.5.3.3 Fair GPU Scheduling

We now discuss several notable works on fair GPU scheduling. We begin with PTask (Rossbach et al.,

2011). The PTask framework defines a set of OS-managed abstractions. These abstractions give the OS insight into an application’s various phases of execution on CPUs, EE, and CEs. This insight is leveraged by the OS to schedule GPU resources as “first class” processors (i.e., exercise a degree of control commensurate to that exerted on CPUs). In the PTask framework, GPU-using tasks are decomposed into a dataflow graph. Graph vertices loosely represent a computation or operation on a CPU or EE. We describe this representation as

“loose” because EE vertices must still be scheduled on a CPU in order to initiate GPU operations. Moreover, an EE vertex may still contain considerable CPU code, if desired by the programmer. Each vertex exposes a set of input and output ports. Each port is backed by a data buffer. These data buffers reside either in host memory or GPU memory. Ports are connected by channels, which represent the data dependencies among vertices. CE operations are deduced by examining the memory type (host or GPU) of data buffers that are connected by a channel. Connected ports with buffers of different memory types require the CE to shuttle data between them. Connected ports with buffers of the same memory types are combined into one, eliminating the need for memory copy operations.

GPU operations are initiated through a set of PTask-deﬁned APIs that operate on the vertex, port, and channel abstractions. These APIs call special PTask system calls before and after calling underlying GPGPU runtime APIs. This allows the OS to actively monitor the state of GPU activities. The OS can exert control over EE and CE scheduling by forcing tasks to delay or sleep within the PTask system calls. GPU EEs and

CEs are scheduled separately, allowing the EE and CEs of a GPU to be utilized simultaneously. PTask also supports automatic GPU allocation in multi-GPU systems. PTask includes a data-aware GPU scheduler that attempts to greedily schedule GPU computations on the “best” available GPU at the time the operation is issued, where “best” is deﬁned by GPU capabilities (e.g., speed) and data locality. However, data migration between GPUs must be performed by copying data to and from system memory (i.e., PTask does not use

29For the sake of completeness, we note that MPS transmits data through POSIX named pipes.

97 peer-to-peer DMA). Rossbach et al. propose several fairness-based schedulers for EE and CE scheduling.

The scheduling priority of a task is determined by a heuristic that considers parameters such as OS CPU scheduling priority, waiting time for GPU resources, and expected GPU operation time.

Rossbach et al. are somewhat radical in their approach to GPU scheduling in that they deﬁne a new programming model—GPGPU programs must be programmed to the PTask model. In contrast, Gdev achieves non-real-time GPU scheduling while remaining transparent to GPGPU program code (Kato et al.,

2012). Gdev replaces the manufacturer-provided device driver with its own. Gdev also offers a replacement

GPGPU runtime stub library in order to prevent direct communication between tasks and GPU hardware by way of the memory-mapped interface. This stub library calls into the replacement device driver to issue GPU operations. Like the PTask framework, the OS (by way of the Gdev driver) can actively monitor the state of

GPU operations and control the issuance of future ones. Gdev employs a “bandwidth-aware non-preemptive device” (BAND) scheduling algorithm to separately schedule the EE and CE. The BAND scheduler operates much like the Xen’s credit scheduler, with enhancements for managing the non-preemptive nature of the GPU engines in budget accounting. The BAND scheduler, in conjunction with unique GPU memory management features offered by Gdev, can be used to partition a physical GPU into several logical GPUs as a form of

GPU virtualization. However, Gdev (as presented) does not integrate with any virtualization technologies

(such as Xen), so we opt to not include it in the earlier virtualization category.

PTask and Gdev are both API-driven in that APIs, be they provided by the framework (PTask) or inserted by a stub library (Gdev), route the requests to perform GPU operations through a GPU scheduler. This incurs a scheduling overhead for every request. Menychtas et al. (2014) make the observation that, in cases where tasks submit all work to GPUs through a memory-mapped interface, a task can be prevented from accessing a GPU by unmapping the memory interface from the task’s virtual address space. A task’s attempt to issue work to a GPU can be then trapped within the OS’s page fault handler. The OS schedules the GPU by restoring the memory-mapped interface and allowing the application to return from the page fault. Menychtas et al. explored the implementation of several fairness-based schedulers based upon this mechanism in their

GPU scheduling framework called NEON. The NEON schedulers are passive in that GPU scheduling is not in the code-path of each API call, so some scheduling overheads are avoided. Through reverse engineering of the GPU driver, Menychtas et al. were able to use the stock GPGPU software stack. Although admirably elegant, this approach has a drawback. Like TimeGraph, Menychtas et al.’s approach also operates at the

98 GPU command level. The scheduler does not disambiguate between commands that require an EE or CE, so it must schedule the GPU as a single processor—engines may be left idle as a result.

2.6 Conclusion

Our review of the topics in this chapter should give us an appreciation for the challenges we face in realizing a multi-GPU real-time system. GPUs have unique constraints that permeate the GPGPU runtime, device driver, and GPU hardware. This marks GPU scheduling as distinct from CPU scheduling. GPUs have the potential to enable new capabilities in real-time systems if we can overcome these challenges. In this chapter, we have reviewed a variety of real-time techniques that we can use to address these issues.

Namely, real-time task models, schedulability analysis, scheduling algorithms, and locking protocols. In the next chapter, we tie these techniques together to create a single cohesive and comprehensive framework for real-time GPU scheduling.

99 CHAPTER 3: GPUSync1

In this chapter, we present the design of our real-time GPU scheduling framework, GPUSync. GPUSync addresses issues in the three fundamental categories of allocation, budgeting, and integration. Allocation issues include task-to-GPU assignment, the scheduling of GPU memory transfers, and the scheduling of GPU computations. Budgeting issues arise when tasks utilize more GPU resources than allocated. Integration issues relate to the technical challenges of integrating GPU hardware (and closed-source software) into a real-time system.

In resolving these issues, we pay careful attention to managing GPU-related parallelism. For example, modern GPUs can send data, receive data, and perform computations simultaneously. Ideally, these three operations should be allowed to overlap in time to maximize performance. Additionally, data transmissions result in increased trafﬁc on shared buses used in parallel by other tasks. We carefully manage bus trafﬁc to limit the effect bus contention has on other real-time tasks.

If a system has multiple GPUs, then parallelism-related issues arise when allocating GPUs. As we discussed in Chapter 1, it may be desirable to use a clustered or global GPU organization in order to avoid the utilization loss common to partitioned approaches. However, a GPU-using task may develop memory-based affinity for a particular GPU as it executes. In such cases, program state (data) is stored in GPU memory and is accessed by the task each time it executes on that particular GPU. This state must be migrated each time the task uses a GPU different from the one it used previously. Such migrations increase bus traffic and affect system-wide performance and predictability. GPUSync supports clustered and global GPU organizations, and supports efficient migration of program state between GPUs while maintaining real-time predictability.

1 Portions of this chapter previously appeared in conference proceedings or journals. The original citations are as follows: Elliott, G. and Anderson, J. (2012b). Robust real-time multiprocessor interrupt handling motivated by GPUs. In Proceedings of the 24th Euromicro Conference on Real-Time Systems, pages 267–276; Elliott, G. and Anderson, J. (2013). An optimal k-exclusion real-time locking protocol motivated by multi-GPU systems. Real-Time Systems, 49(2):140–170; Elliott, G., Ward, B., and Anderson, J. (2013). GPUSync: A framework for real-time GPU management. In Proceedings of the 34th IEEE International Real-Time Systems Symposium, pages 33–44; Elliott, G. and Anderson, J. (2014). Exploring the multitude of real-time multi-GPU conﬁgurations. In Proceedings of the 35th IEEE International Real-Time Systems Symposium, pages 260–271.

100 The remainder of this chapter is organized as follows. We begin by contrasting API-driven and command- driven GPU schedulers. We then discuss eight general software architectures that may be used by an

API-driven (real-time) GPU scheduler. There are tradeoffs to consider among these architectures, such as ease of implementation, robustness, and real-time correctness. We carefully weigh these tradeoffs in our selection of a general software architecture for GPUSync. Following this discussion, we describe our synchronization-based philosophy to real-time GPU scheduling. With these concepts ﬁrmly in place, we present the detailed design of GPUSync. This is done in three parts, each one addressing issues related to allocation, budgeting, and integration, respectively. We then discuss a variety of implementation challenges.

We conclude with a few ﬁnal remarks to summarize characteristics of GPUSync.

3.1 Software Architectures For GPU Schedulers

There are two importance choices to make in the design of a GPU scheduler. We must ﬁrst decide how the GPU scheduler is inserted between user applications and GPUs. We must then decide how tightly our scheduler integrates with the underlying RTOS. We make tradeoffs in implementation effort, efﬁciency, stability, and real-time correctness with each choice. We now explore these options and provide the rational behind the choices we make for GPUSync.

3.1.1 API-Driven vs. Command-Driven GPU Schedulers

In Chapter 2, we mentioned two general approaches to GPU scheduling. A GPU scheduler may be

API-driven or command-driven. Under the API-driven model, explicit GPGPU API calls issued by user applications are routed through a GPU scheduling software layer, which decides when an API call may proceed. The scheduler is invoked by every API call that may issue work to a GPU. The GPU schedulers

RGEM, GViM, gVirtuS, vCUDA, rCUDA, VOCL, MPS, PTask, and Gdev, which we discussed in Section 2.5, fall within this category. Under the command-driven model, GPGPU API calls are not scheduled. Instead, the GPU command sequences generated by API calls are scheduled. These commands must be written to an in-memory command buffer that is read by the GPU. A command-driven GPU scheduling framework controls when an application may write its buffer and/or when the buffer is read by the GPU. The GPU schedulers TimeGraph, NEON, and GPUvm, also discussed in Section 2.5, fall within this category.

We wish GPUSync to exhibit two properties:

101 1. The ability to separately schedule the EE and CEs of a GPU.

2. The ability to strictly arbitrate access to the EE and CEs of a GPU.

The ﬁrst property enables efﬁcient utilization of GPU resources. The second property allows us to model the

GPU scheduler in real-time analysis. We may imagine both API-driven and command-driven schedulers that achieve these properties. However, these properties are far easier to achieve with an API-driven scheduler than a command-driven scheduler.

The low level GPGPU runtime APIs that issue work to a GPU are cleanly divided between those that perform memory copies (i.e., CE work) and those that launch GPU kernels (i.e., EE work). There is no API call that invokes the EE and CEs of a GPU in a combined sequence of operations. Under the API-driven approach, we can easily determine which engine is invoked by a given API call. This is not the case with the command-driven approach. A call to a single GPGPU API may write a sequence of GPU commands to the command buffer. After several GPGPU API calls, this buffer may contain a mix of CE and EE commands.

A command-driven scheduler that separately schedules the EE and CEs of a GPU must parse the buffered commands in order to segment the buffer into EE and CE command sub-sequences. This scheduler must then separately schedule these sub-sequences on the appropriate GPU engines. Parsing the command buffer requires intimate knowledge of the structure and meaning of GPU commands. GPU manufacturers often withhold such information, so obtaining this knowledge may require a non-trivial effort to reverse engineer.

Moreover, this process may have to be repeated for new versions of a given GPGPU runtime, as well as new

GPU devices. Due to these challenges, it should come as no surprise that no one has yet, to the best of our knowledge, developed a command-driven GPU scheduler that separately schedules the EE and CEs of a GPU.

GPUSync Architectural Choice #1. The expressiveness of API-driven schedulers and the challenges associated with command-driven GPU schedulers lead us to make the following decision: We use an API-driven GPU scheduler in GPUSync.

3.1.2 Design-Space of API-Driven GPU Schedulers

In this section, we discuss several software architectures for API-driven GPU schedulers. An API-driven

GPU scheduling framework must determine when an intercepted API call may proceed. This scheduling decision may be made centrally by a dedicated scheduling process (e.g., a daemon) or cooperatively by the

GPU-using tasks that share a GPU scheduling algorithm and scheduler state. In either case, the GPU scheduler

102 may be implemented in user-space or kernel-space. Also, a GPU scheduler may employ a mechanism that enforces GPU scheduling decisions, or it may trust API callers to abide by them. In summary, scheduling decisions may be: (i) made centrally or cooperatively; (ii) made in user-space or kernel-space; (iii) enforced or not. These choices give rise to eight general software architectures for API-driven GPU schedulers. We will describe and discuss the tradeoffs made by each approach. However, some additional background is needed before proceeding.

Common to all eight general software architectures is the use of interposed or stub libraries. An interposed library is inserted into the code paths of processes at dynamic link-time (i.e., when the process is launched). The interposed library overrides the default linkage between application code the underlying

GPGPU runtime.2 An interposed library may invoke a GPU scheduling framework before passing an intercepted API call on to the original GPGPU runtime. A stub library is similar to an interposed library, excepting that the stub library does invoke the GPGPU runtime, but merely passes API calls onto another software component that does. Stub libraries can be employed at either static or dynamic link-time. The use of interposed and stub libraries is optional in the following architectures—an application may always implement the functionality of these libraries itself.

In our consideration of the eight general software architectures, we make two important assumptions:

(i) the GPU device driver executes within the kernel-space of the RTOS; and (ii) we are not necessarily constrained by microkernel RTOS design principles. These assumptions are consistent with the technical constraints under which we prototype GPUSync (i.e., a Linux-based OS with standard GPU drivers and

GPGPU runtime).

We are now ready to discuss the architectural options of GPU schedulers. Some architectures exhibit the same characteristics. In order to avoid being repetitive, we may paraphrase previously discussed characteristics. Paraphrased characteristics are denoted by square brackets (e.g., “[paraphrased characteristic]”).

3.1.2.1 GPU Scheduling in User-Space

We ﬁrst consider the class of software architectures for user-space GPU schedulers. Figure 3.1 depicts several such architectures. We discuss each in turn, before discussing the tradeoffs between user-space and kernel-space schedulers.

2On many UNIX-like systems, this can be accomplished through the use of the LD_PRELOAD environment variable.

103 Memory Access IPC Function/System Call Interrupt Process Shared Code/Memory

User Space

T1 T2

GPGPU GPGPU Stub Library Stub Library

User Space GPU Sched. T T Daemon 1 2

GPU Sched. GPGPU GPGPU Algorithm Interception GPU Sched. Interception Library Daemon Library GPU Sched. GPGPU GPU Sched. GPGPU State Runtime Algorithm Runtime

GPGPU GPU Sched. Runtime State

Kernel Space Kernel Space GPU Device GPU Device Driver Driver

Hardware Hardware GPU GPU

(a) Enforced centralized GPU scheduling. (b) Centralized GPU scheduling with trusted API callers.

User Space

T1 T2

GPGPU GPGPU Interception Interception Library Library User Space GPU Sched. GPU Sched. GPU Sched. Algorithm State Algorithm T1 T2

GPGPU GPGPU Interception Interception GPGPU Library Library Runtime GPU Sched. GPU Sched. GPU Sched. Daemon Algorithm State Algorithm GPGPU GPGPU GPGPU Runtime Runtime Runtime

Kernel Space Kernel Space GPU Device GPU Device Driver Driver

Hardware Hardware GPU GPU

Figure 3.1: Several software architectures of API-driven GPU schedulers implemented in user-space.

104 Centralized Scheduling With Enforcement. Figure 3.1(a) depicts a common software architecture for

GPU scheduling in user-space. Here, GPGPU API calls are issued to a GPGPU stub library within each task.

The stub library redirects the API request over an IPC channel (e.g., UNIX domain socket, TCP/IP socket, etc.) to a GPGPU scheduling daemon, which services requests according to a centralized scheduling policy.

The daemon executes all API calls itself, enforcing all scheduling decisions.

This software architecture makes the following tradeoffs. Pros:

1. Scheduling policy is easy to implement since decisions are centralized.

2. Scheduling decisions are enforced, since scheduled API calls are executed by the GPGPU scheduling

daemon itself.

Cons:

1. The daemon must include, or be able to load, the GPU kernel code of constituent tasks. This can be

accomplished at compile-time when the daemon is compiled. Dynamic loading of GPU kernel code is

also possible, but is non-trivial to implement.

2. The IPC introduces overheads due to message passing between tasks and the daemon.

3. Barring the use of any memory remapping techniques, GPU kernel data must be transmitted through the

IPC channel. Performance of data-heavy GPGPU applications (e.g., a pedestrian detection application

that is fed data by a video camera) is poor.

4. The daemon itself must be scheduled. This introduces additional scheduler overheads. Moreover,

unless the RTOS provides a mechanism by which the daemon may inherit a priority from its constituent

tasks, schedulability analysis is not straight forward. We may work around this limitation by either

boosting the priority of the daemon, which hurts schedulability analysis (see Section 2.1.6.1), or we

may reserve a CPU exclusively for the daemon, which results in the loss of a CPU for other work.

Neither approach is desirable.

Remarks:

This popular architecture is employed by GViM, gVirtuS, vCUDA, rCUDA, and MPS. However, none of these implement real-time GPU scheduling policies. We note that the overheads associated with transmitting

GPU kernel data over the IPC channel may lead to unacceptable performance for data-heavy applications.

105 Centralized Scheduling Without Enforcement. Figure 3.1(b) depicts another daemon-based scheduler.

Here, GPGPU API calls are intercepted by an interposed library. For each call, the library issues a request for the appropriate GPU engine to the GPU scheduling daemon through an IPC channel. The library waits for each request to be granted. The daemon grants requests according to a centralized scheduling policy. The interposed library passes intercepted API calls on to the original GPGPU runtime once necessary resources have been granted.

This software architecture makes the following tradeoffs. Pros:

1. Easy to implement, since scheduling decisions are centralized and GPU kernel code remains local to

each constituent task.

2. GPU kernel data is not copied over the IPC channel. Data-heavy GPGPU applications perform well.

Cons:

1. GPU scheduling decisions cannot be enforced. It is possible for a misbehaved or malicious task to

bypass the interposition library and access the GPGPU runtime directly.

2. [The IPC introduces message passing overheads.]

3. [The daemon itself must be scheduled.]

Remarks:

This architecture sacriﬁces the enforcement of scheduling decisions to realize performance beneﬁts for data-heavy applications, as GPU kernel input and output data does not traverse an IPC channel. Also, this is the easiest of the eight GPU scheduling architectures to implement. This is the architecture employed by the

Windows 7-based prototype of PTask, which is a non-real-time GPU scheduler. This is architecture is also employed by RGEM, which is a real-time GPU scheduler.

Cooperative Scheduling With Enforcement. Figure 3.1(c) depicts the software architecture of a cooperative GPU scheduler and a GPGPU runtime daemon. Here, an interposed library intercepts API calls. Each instance of the library within each task invokes the same GPU scheduling algorithm, which is embedded

106 within the interposed library. A single instance of GPU scheduler state is stored in shared memory.3 The interposed library passes API calls to a daemon for actual execution.

This software architecture makes the following tradeoffs. Pros:

1. GPU scheduling is efﬁcient, as each task can access the GPU scheduler state directly.

2. Scheduling decisions are weakly enforced. Although the GPGPU runtime daemon centralizes all

accesses to the GPUs, misbehaved or malicious tasks may issue work directly to the daemon, bypassing

the cooperative GPU scheduler.

Cons:

1. Access to the shared GPU scheduler state must be coordinated (or synchronized) among tasks. De-

pending upon the synchronization mechanism used, tasks may need to execute non-preemptively while

executing scheduling algorithms (in order to avoid deadlock). This requires support from the RTOS or

access to privileged CPU instructions that temporarily disable preemption.

2. A misbehaved or malicious task may corrupt GPU scheduler state, as it may overwrite any data in

shared memory. Recovery from such faults may be difﬁcult.

3. A misbehaved or malicious task may bypass the GPU scheduler and issue work directly to the GPGPU

runtime daemon, unless the daemon has a mechanism by which to validate requests.

4. [The IPC introduces message passing overheads.]

5. [GPU kernel data must be transmitted through the IPC channel.]

6. [The daemon itself must be scheduled.]

Remarks:

Any potential beneﬁts of cooperative scheduling are obviated by IPC-related overheads in this architecture.

Moreover, the ability to enforce scheduling decisions is weakened by the fact that the GPU scheduler can be bypassed. This architecture offers no apparent beneﬁt over centralized scheduling with enforcement

(Figure 3.1(a)).

3On many UNIX-like systems, POSIX or SysV APIs may be used to allocate and manage memory shared among processes.

107 Cooperative Scheduling Without Enforcement. Figure 3.1(d) depicts the software architecture of a cooperative GPU scheduler, without the use of a daemon. Here, an interposed library intercepts API calls. As before, tasks cooperatively execute the same GPU scheduling algorithm and operate upon the same shared scheduler state. Intercepted API calls are passed on to the original GPGPU runtime when scheduled.

This software architecture makes the following tradeoffs. Pros:

1. There are no IPC overheads. Data-heavy GPGPU applications perform well.

2. There is no daemon to schedule. This simpliﬁes real-time analysis and requires less support from the

RTOS.

3. [Efﬁcient GPU scheduling.]

Cons:

1. [Access to GPU scheduler state must be coordinated.]

2. [GPU scheduler state is vulnerable to corruption.]

3. [GPU scheduling decisions cannot be enforced.]

Remarks:

This is the most efﬁcient user-space architecture we examine. It avoids all IPC overheads. It avoids all overheads and analytical challenges raised by daemons. However, it is also the most fragile of all eight architectures. We must trust tasks to: (i) not bypass the GPU scheduler; and (ii) not corrupt the GPU scheduler state.

3.1.2.2 GPU Scheduling in Kernel-Space

GPU scheduling in user-space has several weakness. One weakness is that we may be unable to sufﬁciently protect GPU scheduler data structures. This is the case with the cooperative scheduling approaches, where GPU scheduler state can be corrupted by misbehaved or malicious tasks. However, the greatest weakness in user-space scheduling is our inability to tightly integrate with the underlying RTOS—this has the potential to prevent us from realizing a correct real-time system with any degree of conﬁdence. RTOSs may provide some mechanisms that allow real-time tasks to affect the scheduling priority of other tasks through

108 user-space actions (e.g., real-time locking protocols with priority-modifying progress mechanisms, as well as system calls that directly manipulate priorities). These can be leveraged to add some real-time determinism to a user-space GPU scheduler.4 However, these mechanisms may be insufﬁcient to minimize, and more importantly, bound, GPU-related priority inversions.

Recall from Section 2.4.5 the challenge we face in scheduling interrupt and GPGPU-runtime-callback threads. How do we dynamically bind the priorities of these threads to those of the appropriate real-time jobs, which they themselves may have priorities that change dynamically? How can we ensure that the appropriate job budgets are charged, and how do we handle budget exhaustion? Our ability resolve these issues are severely limited from user-space. These problems are best addressed by tightly integrating the GPU scheduler with the CPU scheduler and other OS components (e.g., interrupt handling services) within the RTOS kernel.

These issues motivate us to consider kernel-space GPU schedulers. Figure 3.2 depicts several high-level software architectures of kernel-space GPU schedulers. We assume that all approaches beneﬁt from the ability to tightly integrate with the RTOS. (This ability is not explicitly reﬂected by the diagrams in Figure 3.2.) We now discuss the tradeoffs among these kernel-space options.

Centralized Scheduling With Enforcement. Figure 3.2(a) depicts a software architecture with a centralized scheduler daemon. The architecture bears a strong resemblance to the one depicted in Figure 3.1(a), and functions much in the same manner. However, the GPU scheduling daemon now runs from kernel-space.

This has an important implication, which we discuss shortly.

This software architecture makes the following tradeoffs. Pros:

1. GPU kernel data is not copied over the IPC channel. This is possible because the GPU scheduling

daemon may directly access the user-space memory of its constituent tasks. Data-heavy GPGPU

applications perform well.

2. [Scheduling policy is centralized.]

3. [Scheduling decisions are enforced.]

Cons:

4These mechanisms may require real-time tasks to have an escalated privilege to enable priority-modifying capabilities—this is undesirable from a security perspective.

109 Memory Access IPC Function/System Call Interrupt Process Shared Code/Memory

User Space User Space T T T1 T2 1 2

GPGPU GPGPU GPGPU GPGPU Stub Library Stub Library Interception Interception Library Library

GPGPU GPGPU Kernel Space Runtime Runtime GPU Sched. Daemon

GPU Sched. Kernel Space GPU Sched. Algorithm Daemon GPU Sched. GPU Sched. GPU Device State Algorithm Driver GPGPU GPU Device GPU Sched. Runtime Driver State

Hardware Hardware GPU GPU

(a) Enforced centralized GPU scheduling. (b) Centralized GPU scheduling.

User Space User Space T T T1 T2 1 2 GPGPU GPGPU GPGPU GPGPU Interception Interception Stub Library Stub Library Library Library

GPGPU GPGPU Kernel Space Runtime Runtime GPU Sched. Algorithm Kernel Space GPU Sched. GPU Device GPU Sched. State Driver Algorithm GPU Device GPGPU GPU Sched. Driver Runtime State

Hardware Hardware GPU GPU

Figure 3.2: Several software architectures of API-driven GPU schedulers implemented in kernel-space.

110 1. A kernel-space GPGPU runtime may be unavailable. To the best of our knowledge, all manufacturer-

provided GPGPU runtimes only run in user-space.

2. The daemon itself must be scheduled. This introduces additional system overheads. However, from

kernel-space, we may have more ﬂexibility in properly prioritizing the daemon to ensure real-time

determinism.

3. [The daemon must include, or be able to load, the GPU kernel code of constituent tasks.]

4. [The IPC introduces message passing overheads.]

Remarks:

This architecture beneﬁts from enforced centralized scheduling, without shuttling GPU kernel input and output data across the IPC channel. This approach still suffers some IPC channel overheads due to message passing. However, the greatest drawback of this approach is a practical one: the general unavailability of kernel-space GPGPU runtimes. One must develop their own—we discussed the challenges behind such an effort in Section 2.4.1. This is not an insurmountable challenge, as demonstrated by Gdev (which employs a hybrid architecture of Figure 3.2(a) and Figure 3.2(d)), but difﬁcult.

Centralized Scheduling Without Enforcement. Figure 3.2(b) depicts another software architecture with a

GPU scheduling daemon. Its architecture matches that of Figure 3.1(b), except that the daemon now runs in kernel-space.

This software architecture makes the following tradeoffs. Pros:

1. Uses a commonly available user-space GPGPU runtime.

2. [Scheduling policy is centralized.]

3. [GPU kernel data is not copied over the IPC channel.]

Cons:

1. [GPU scheduling decisions cannot be enforced.]

2. [The IPC introduces message passing overheads.]

111 3. [The daemon itself must be scheduled.]

Remarks:

This architecture strikes a compromise between kernel-space centralized scheduling and practical constraints. Scheduling decisions are made within kernel-space, but carried out by individual tasks with a user-space GPGPU runtime. Thus, the architecture cannot enforce its scheduling decisions. This is the architecture employed by the Linux-based prototype of PTask, which is a non-real-time GPU scheduler.

Cooperative Scheduling With Enforcement. Figure 3.2(c) depicts a software architecture for a cooperative

GPU scheduler. Here, GPGPU API calls are routed to a stub library that invokes the kernel-space GPU scheduler through an OS system call. GPU scheduler state is shared by all tasks, but this is stored in kernel-space data structures. Scheduled API calls are executed by a kernel-space GPGPU runtime using the program thread of the calling task.

This software architecture makes the following tradeoffs. Pros:

1. GPU scheduler state is protected. Unlike cooperative user-space schedulers, the GPU scheduler state is

protected within kernel-space from corruption by misbehaved or malicious user-space tasks.

2. Synchronized access to GPU scheduler state is trivial within kernel-space. There is no need for an

escalated privilege to execute non-preemptively while GPU scheduler data structures are updated.

3. [GPU scheduling is efﬁcient.]

4. [Scheduling decisions are enforced.]

Cons:

1. [Need for a kernel-space GPGPU runtime.]

Remarks:

This is the strongest of the eight architectures we examine from a performance perspective. Cooperative scheduling decisions are efﬁcient and enforced by the RTOS. The GPGPU runtime is executed within kernel- space using the program stacks of the calling tasks, rather than a separately scheduled daemon. There are no

IPC overheads. The only limitation of this approach is the reliance upon a kernel-space GPGPU runtime.

112 Cooperative Scheduling Without Enforcement. Figure 3.2(d) depicts another software architecture for a cooperative GPU scheduler. Here, an interposed library intercepts API calls, and invokes the kernel-space

GPU scheduler via system calls. As before, GPU scheduler state is shared by all tasks and protected from misbehaved and malicious tasks. To schedule an API call, the GPU scheduler returns control to the interposed library. The interposed library uses the user-space GPGPU runtime to execute the scheduled API call.

This software architecture makes the following tradeoffs. Pros:

1. [Uses a commonly available user-space GPGPU runtime.]

2. [GPU scheduler state is protected.]

3. [Access to GPU scheduler state is easily synchronized.]

4. [GPU scheduling is efﬁcient.]

Cons:

1. [GPU scheduling decisions cannot be enforced.]

Remarks:

In order to support a user-space GPGPU runtime, this architecture sacriﬁces enforcement capabilities, trusting tasks to not bypass the interposed library by accessing the GPGPU runtime directly. Despite this limitation, it is still a strong architecture from a performance perspective. Like the prior approach, scheduling decisions are efﬁcient. Also, there are no IPC- or daemon-related overheads. For a researcher or developer willing to implement OS-level code, this architecture is the most practical high-performance option.

GPUSync Architectural Choice #2. After a careful consideration of the above software architectures, we come to the following decision: We opt to use a kernel-space cooperative GPU scheduler without enforcement in GPUSync.

We do so because:

1. It avoids overheads due to IPCs and daemons.

2. It supports the use of user-space GPGPU runtime libraries.

113 3. It enables tight integration with the RTOS, enabling us to fully explore the matrix of CPU/GPU

organizational choices we discussed in Chapter 1 (Figure 1.4) and support a variety of real-time

schedulers.

3.2 Design

We now describe the design of GPUSync in detail. We begin by explaining our synchronization-based philosophy. We then describe our assumed task model supported by GPUSync. We then delve into the software design details of GPUSync.

3.2.1 Synchronization-Based Philosophy

GPU management is often viewed as a scheduling problem. This is a natural extension to conventional techniques. Similar to CPU scheduling (e.g., as in Figure 2.3), pending work for GPU engines is placed in a ready queue. This pending work is prioritized and scheduled on the GPU engines. The implementation of these approaches is straightforward, but the resulting mix of CPU and GPU scheduling algorithms is difﬁcult to analyze holistically. For example, existing schedulability analysis techniques for heterogeneous processors, such as Gai et al. (2002); Baruah (2004); Pellizzoni and Lipari (2007); Kim et al. (2013), are problematic in multi-GPU systems due to one or more of the following constraints: (i) they cannot account for non-preemptive GPU execution; (ii) they require that tasks be partitioned among different types of processors, yet our GPU-using tasks must make use of CPU, EE, and CE processors; (iii) they statically assign GPU-using tasks to GPUs; (iv) they place restrictions on how GPUs may be shared among tasks; and (v) they place limits on the number of CPUs and GPUs.

Instead, we view GPU scheduling as a synchronization problem. This perspective allows us to apply existing techniques developed for real-time locking protocols towards GPU scheduling. This perspective inﬂuences both the design of the GPU scheduler and real-time analysis. However, we wish to make clear that the distinction between scheduling and synchronization approaches is somewhat blurred: the locking protocols we use become GPU schedulers—these protocols prioritize ready GPU work and grant access accordingly, just as any scheduler would. The differences lie in how work is prioritized and the use of locking protocol progress mechanisms. Nonetheless, as we shall see, a synchronization-based approach gives us established techniques to address problems relating to allocation, budgeting, and integration.

114 3.2.2 System Model

We consider a system with m CPUs, partitioned into clusters of c CPUs each, and h GPUs, partitioned into clusters of g GPUs each. We assume that the workload to be supported can be modeled as a traditional sporadic real-time task system (as described in Section 2.1.1), with jobs being scheduled by a JLFP scheduler.

Furthermore, we assume that the scheduled system is a SRT system for which bounded deadline tardiness is acceptable. While GPUSync’s design does not inherently preclude use in HRT systems, reliance upon closed-source software would make a claim of HRT support premature. Thus, we focus on design strategies that improve predictability and average-case performance, while maintaining SRT guarantees. We assume that tasks can tolerate GPU migration at job boundaries, and that the per-job execution times of each task remain relatively consistent, with overruns of provisioned bounds being uncommon events. We also assume that tasks pre-allocate all necessary GPU memory on any GPU upon which its jobs may run.

3.2.3 Resource Allocation

In this section, we discuss how GPUSync assigns GPUs and GPU engines to jobs. We begin with a high-level description of GPUSync’s resource allocation methods. We then describe the mechanisms behind

GPU allocation in detail. This is followed by a description of how GPUSync arbitrates access to GPU engines.

3.2.3.1 High-Level Description

GPUSync uses a two-level nested locking structure. The high-level design of GPUSync’s allocation mechanisms is illustrated in Figure 3.3. There are several components: a self-tuning execution Cost Predictor; a GPU Allocator, based upon a real-time k-exclusion locking protocol, augmented with heuristics; and a set of real-time Engine Locks, one per GPU engine, to arbitrate access to GPU engines.

We describe the general steps followed within GPUSync to allocate GPU resources. We refer to the schedule in Figure 3.4 to help describe when GPU resources are requested by, and allocated to, a job. This schedule corresponds to the simpliﬁed schedule for the VectorAdd program from Section 1.4.1 (Figure 1.3).

A GPU critical section is a region of code where a GPU-using task cannot tolerate a migration between

GPUs during a sequence of GPU operations. A job must acquire one of ρ tokens associated with a particular GPU before entering a critical section that involves accessing that GPU. We call a GPU critical section

115 Engine Locks

GPU 0 E D CE GPU Allocator C 0,0 request ! CE 0,1 A k = ! x g

EE 0 B … ! GPU Cost Predictor g-1

CE g-1,0

CE g-1,1

EE g-1

Figure 3.3: High-level design of GPUSync’s resource allocation mechanisms. protected by a token a token critical section. We assume each job has at most one token critical section.5 As depicted in Figure 3.3, a job requests a token from the GPU Allocator in Step A (or time 5 in Figure 3.4).

Utilizing the cost predictor in Step B and internal heuristics, the GPU Allocator determines which token (and by extension, which GPU) should be allocated to the request. The requesting job is allowed access to the assigned GPU once it receives a token in Step C. In Step D, the job competes with other token-holding jobs for GPU engines; access is arbitrated by the engine locks. A code region protected by an engine lock is an engine critical section. A job may only issue GPU operations on its assigned GPU once its needed engine locks have been acquired in Step E. For example, engine locks are requested at times 11, 22, 33, and 54 in

Figure 3.4. With the exception of peer-to-peer migrations, a job cannot hold more than one engine lock at a time. This is reﬂected in Figure 3.4, where engine locks are released at times 21, 31, 53, and 64.

The general structure of GPUSync is straightforward: a GPU Allocator assigns jobs to GPUs and engine locks arbitrate engine access. However, many questions remain. For example, how many tokens can each

GPU have? What queuing structures should be used to manage token and engine requests? How can we

5Enhancements to the Cost Predictor are necessary in order to support multiple token critical sections per job—we leave such enhancements to future work.

116 Token Critical Section

Engine Critical Sections

CPU

CE 0

CE 1 time 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

Request Token Acquire Token Release Token Suspended

Request Eng. Lock Acquire Eng. Lock Release Eng. Lock

Figure 3.4: A schedule of a job using GPU resources controlled by GPUSync. enable GPU migration, yet minimize associated overheads? We now answer such questions, and provide additional rationale for our design choices.

3.2.3.2 GPU Allocator

Each cluster of g GPUs is managed by one GPU Allocator; as such, we henceforth consider GPU management only within a single GPU cluster. We associate ρ tokens, a conﬁgurable parameter, with each GPU. All GPU tokens are pooled and managed by the GPU Allocator using a single k-exclusion lock, where k = ρ · g. The value of ρ directly affects the maximum parallelism that can be achieved by a GPU since it controls the number of jobs that may directly compete for a GPU’s engines—ρ must be at least the number of engines in a GPU if every engine is to ever be used in parallel. This implies that a large value should be used for

ρ. However, the value of ρ also strongly affects GPU migrations. Too great a value may make the GPU Allocator too migration-averse, since tokens will likely be available for every job’s preferred GPU, even those that are heavily utilized. Constraining ρ prevents GPUSync from overloading a GPU and promotes GPU migration to distribute load.

117 m donors

Token Holders

GPU0

DQ FQ 0 …

FQ -1 … k tokens

FQ k- …

FQ k-1 PQ f GPU g-1

Figure 3.5: The structure of a GPU allocator lock.

The GPU Allocator’s k-exclusion lock uses a hybrid queuing structure consisting of several ﬁxed-size

FIFO queues FQi, a priority queue PQ, and a donor queue DQ, as depicted in Figure 3.5. The FIFO queue th FQi is associated with the i token. Token-requesting jobs are enqueued in a load-balancing manner until every FQ is f jobs in length. (We will discuss mechanisms for load balancing shortly.) The parameter f is conﬁgurable. Additional requests that cannot be placed in an FQ “overﬂow” into PQ or DQ. Jobs are moved from PQ or DQ into an FQ as space becomes available. A job enqueued in FQi suspends until it is at the th head of FQi, in which case it is granted the i token. The GPU Allocator adopts the structure and rules of the R2DGLP (Section 2.1.7.2), with some exceptions.

There are three key differences. First, unlike the R2DGLP, the maximum length of the GPU Allocator FIFO queues is conﬁgurable by the system designer. Selecting different values for f allows the GPU Allocator lock to take on different analytical properties. The GPU Allocator functions exactly as the R2DGLP when f = dc/ke, as the k-FMLP when f = ∞, and as a purely priority-based protocol when f = 1. A system

118 designer may tailor the GPU Allocator to their speciﬁc task sets and schedulers.6 For instance, the k-FMLP can outperform the R2DGLP in some cases (see Ward et al. (2012) for more information). The purely priority-based GPU Allocator may be used with a task-level static priority scheduler, such as RM, when request interference on high-priority tasks must be minimized.

The second difference between the GPU Allocator and the R2DGLP is that, for cases where a GPU cluster is shared by tasks of different CPU clusters, we adopt Rule C1 (i.e., the donation-at-job-release rule) of the CK-OMLP. In this way, the GPU Allocator can be conﬁgured to mimic the CK-OMLP by using f = d(c · γ)/ke, where γ is the number of CPU clusters that share the GPU cluster in question. Such a conﬁguration differs from the CK-OMLP in that the GPU Allocator uses k shorter FIFO queues instead of a single long one—this change does not alter bounds on pi-blocking. Also, the “request stealing” aspect of the

R2DGLP’s Rule R5 ensures similar work-conserving behavior at runtime. The addition of Rule C1 subsumes the inheritance and donation rules of the R2DGLP. This is because Rule C1 ensures that token-holding jobs are always scheduled on a CPU when ready to run—there is no need for other progress mechanisms. We note that valid values of f must meet the constraint f ≥ dm/ke when Rule C1 is enacted. The last difference between the GPU Allocator and the R2DGLP is that the GPU Allocator augments

Rules R1a, R1c, and R5 of the R2DGLP with heuristics to improve runtime performance. We discuss the details of these heuristics next.

Real-time locking protocols are rarely designed to make use of online knowledge of critical section lengths, even though critical sections ﬁgure prominently in schedulability analysis and system provisioning.

We augment the GPU Allocator lock to incorporate knowledge of token critical section lengths into queuing decisions to reduce the frequency and cost of GPU migrations without preventing beneﬁcial migrations. Most of our heuristics use information provided by the Cost Predictor. For now, let us treat the Cost Predictor as a black-box that returns a predicted token critical section length for a GPU request Ri by a job Ji, under the hypothesis that Ri is enqueued on FQx and is eventually allocated the associated token—this prediction includes the cost of migration. We say that the preferred GPU of a job Ji is the GPU last used by task Ti, since

Ji may have afﬁnity with this GPU. Let the function perf(Ji) return the identity of the GPU preferred by Ji.

We must deﬁne several additional terms and functions in order to describe our heuristics. Let Li denote prd the provisioned token critical section length for Ri. Let Li,x denote the predicted token critical section

6We caution that not every conﬁguration of the GPU Allocator (and GPUSync, in general) is necessarily amenable to real-time analysis.

119 prd length provided by Cost Predictor, supposing Ri is enqueued on FQx. We assume 0 ≤ Li,x ≤ Li. We 2 use two methods to measure the length of FQx. The ﬁrst method uses the same function as the R DGLP (Equation 2.33), measuring length in terms of the number of enqueued requests. We repeat it here for a consolidated presentation:

length(FQx) , |FQx|. (3.1)

The second method measures the length in terms of the token critical section length predictions provided by the Cost Predictor: prd execLength(FQx) , ∑ L j,x , (3.2) R j∈FQx where we reindex the requests in FQx with j. For a given GPU, we determine the set of tasks that make up its “active users.” The set of tasks that last used, or are currently using, a GPU make up the GPU’s active users. A task with no pending job can still be among the active users of a GPU because the next job of this task is assumed to prefer the GPU used by the task’s prior job. Let the function numGpuUsers(FQx) denote the number of active users of the GPU associated with FQx. We denote the distance between to GPUs with 7 function distance(GPUa,GPUb). Finally, let the set F , {FQx | length(FQx) < f }, denote the set of FQs that are not full at any given time instant. The GPU Allocator employees the following four heuristics. We use these heuristics simultaneously, as each controls a different operation within the GPU Allocator.

H1: Distribute Tasks Among GPUs

Conditions: This heuristic is employed when: (i) the GPU token request Ri is the ﬁrst such request of

any job of task Ti; and (ii) there exist multiple FQs that are not full (i.e., |F | ≥ 2).

Description: Request Ri is enqueued on an FQ from the set minFQx∈F {numGpuUsers(FQx)}. That is, the request is enqueued on an FQ of a GPU with the fewest number of current users. There may

be several FQs that satisfy this criteria. The particular FQ is selected using the same methods of

Heuristic H2.

Rational: GPUSync is designed with sporadic task sets in mind. If a GPU-using task does not currently

require a GPU, there remains a high likelihood that it will in the near future. This heuristic exploits this

knowledge to distribute tasks among GPUs to reduce contention for preferred GPUs.

7Recall from Section 2.4.2.2 that distance is the number of PCIe links to the nearest common switch or I/O hub of two GPUs.

120 H2: Minimize Predicted Response Time

Conditions: This heuristic is employed when: (i) job Ji issues request Ri for a GPU token; and (ii) there exist multiple FQs that are not full.

prd Description: Request Ri is enqueued on an arbitrary FQ from the set minFQx∈F {execLength(FQx) + Li,x }.

Rational: This heuristic is meant to only allow a GPU migration if it is predicted to be beneﬁcial.A

GPU migration is only beneﬁcial if it expedites the completion of the token critical section of job Ji. Harmful GPU migrations are possible if we only expedite token acquisition. This heuristic not only

estimates how long Ji must wait for a token, given by execLength(FQx), but also how long it may take prd for Ji to complete its token critical section, given by Li,x , which includes predicted migration costs.

H3: Afﬁnity-Aware Priority Donation

Conditions: This heuristic is employed when: (i) job Ji issues request Ri for a GPU token; (ii) all FQs

are full (i.e., |F | = k); and (iii) job Ji must become a priority donor.

Description: We denote the set of eligible donee jobs by E. Let eligibility be deﬁned by Rule R1c of the R2DGLP. All eligible donees have requests in the FQs. Let x denote the index of the FQ in

which an eligible donee job Jj has a request (i.e., FQx). We select a donee from the set generated by 8 minJj∈E {distance(perf(Ji),GPUbx/gc)}. That is, we select an eligible donee with a request in an FQ

of a GPU that is closest to the preferred GPU of Ji. If there are multiple such eligible donees, then we select the one with the earliest predicted token critical section completion time.

Rational: Rule R1c dictates that a donor is immediately moved to an FQ when its donee completes its

critical section. This heuristic is meant to increase the likelihood that the request Ri is moved from the

DQ to an FQ of job Ji’s preferred GPU, or failing this, one nearby.

Additional Remarks: This heuristic is described in terms of priority donation as deﬁned by the R2DGLP.

However, we may modify this heuristic to apply to conﬁgurations where the GPU Allocator is accessed

by jobs on different CPU clusters, i.e., when Rule C1 of the CK-OMLP is used, subsuming Rule R1c.

In such a conﬁguration, this heuristic may be employed when a released job Ji has a preferred GPU

and Ji must become a priority donor, as deﬁned by Rule C1. We note, however, that eligible donees may include jobs that hold other inter-CPU-cluster (non-GPU-token) shared resources. This heuristic

8We compute the index of a GPU in a cluster from the index of an FQ by dividing the FQ-index by the number of GPUs managed by the GPU Allocator and rounding down to the nearest integer.

121 may be updated to avoid donating a priority to such jobs whenever possible. We also note that there is

a delay between when Ji begins execution (after it is no longer a priority donor) and when it issues its

GPU token request Ri—the FQ of Ji’s donee may have been ﬁlled by other requests by the time Ri is issued. Ultimately, Heuristic H3 is very difﬁcult to implement under Rule C1. Practically speaking, it

may be best to do without it in this case, unless there is a clear demonstrated need.

H4: Afﬁnity-Aware Request Stealing

Conditions: This heuristic is employed when: (i) an FQx is empty after the token for FQx has been

released; (ii) the PQ and DQ are empty; and (iii) there exists an FQy with length(FQy) > 1.

Description: Let w j denote the predicted time from completion if request R j if it remains in its current FQ. This prediction is computed by summing the predicted token critical section lengths

of requests enqueued ahead of R j and R j itself. Let the function fq(R j) return the index of the FQ

in which R j is enqueued. Let W denote the set of unsatisﬁed requests of all FQs, i.e., all requests

waiting to become a resource holder. Let Wˆ denote the set of requests that are predicted to beneﬁt by

obtaining the token of FQx immediately, relative to waiting for a token in their respective FQs. Thus, prd prd prd ˆ \{R j | w j > L }. An arbitrary request Rz from the set max ˆ {L − L } is removed W , W j,x R j∈W j,fq(R j) j,x

from its associated FQ, say FQy, and moved to FQx. If the resource holder of FQy inherited the priority

of Rz, then this inheritance relationship is ended, and the resource holder may inherit a new priority ˆ from another request in FQy. It is possible for W = /0,in which case no request is moved to FQx.

Rational: This heuristic is meant to immediately grant the available token of FQx to an unsatisﬁed request in another FQ. The heuristic selects the request that is expected to beneﬁt the most. In cases

where migration costs are high, it may be better to simply let FQx idle, which this heuristic allows.

We brieﬂy summarize the above heuristics. Heuristic H1 assigns the initial GPU to tasks in order to distribute GPU afﬁnity among the GPUs. Heuristic H2 enqueues request Ri in the FQ that the Cost Predictor predicts will lead to the earliest completion time of Ri. Heuristic H3 guides priority donation towards jobs that have (or are slated to obtain) a GPU that minimizes migration costs for request Ri, increasing the likelihood that Ri will obtain the same GPU. Finally, Heuristic H4 grants a newly idle token to a request Ri that is waiting for another token, but only if doing so is expected to expedite the completion of Ri.

122 3.2.3.3 Cost Predictor

The Cost Predictor provides predictions of the token critical section lengths of GPU-using jobs. These predictions include the estimated cost of GPU migration. The Cost Predictor uses a record of observed token critical section lengths to make its predictions.

The Cost Predictor makes estimates of token critical section lengths by tracking the accumulate CPU and

GPU execution time within the token critical section. Upon token assignment, the GPU Allocator informs the Cost Predictor when it has allocated a GPU token to a job Ji, along with the identity of the assigned GPU. In response, the Cost Predictor resets an execution-time counter, records the identity of the newly assigned GPU, and notes the distance between the last GPU used by task Ti and the newly assigned GPU. The distance information enables the Cost Predictor to make different predictions for GPU migrations of different distances.

The execution-time counter tracks the combined CPU and GPU execution time of a job while a token is held. Execution delays due to preemption and blocking due to engine lock acquisition (explained later) are not included, but CPU suspension durations due to GPU operations (memory copies, GPU kernels) are. This tracking requires tight integration with the CPU scheduler.

When a job releases its GPU token, the accumulated execution time gives a total request execution cost for both CPU and GPU operations and includes any delays due to migrations. However, this measurement is for only a single observation and provides a poor basis for predicting future behavior, especially since these measurements are strongly affected by PCIe bus congestion. Thus, we use a more refined process to drive a prediction model based upon statistical process control (Kim and Noble, 2001). Specifically, for each task and GPU migration distance pair, we maintain an average and standard deviation over a window of recent observations (our experimentation suggests that a window of twenty samples is suitable). The average and standard deviation is recomputed after every new observation unless the observation falls more than two standard deviations away from the average and at least ten prior observations have been made. This filtering prevents unusual observations from overly-influencing future predictions.

One may be concerned that the heuristic nature of GPUSync’s Cost Predictor does not lend itself to real-time predictability. However, the Cost Predictor derives its predictions from an average of observed token critical section lengths. Assuming tasks never exceed provisioned execution times, the Cost Predictor can never make a prediction that exceeds the provisioned execution time. Under-estimates are possible. The

123 effects of under-estimates are already accounted for in pi-blocking bounds derived from worst-case blocking analysis.

3.2.3.4 Engine Locks

Engine locks enable the parallelism offered by GPUs to be exploited while obviating the need for the

(unpredictable) GPU driver or GPU hardware to make resource arbitration decisions.

A mutex is associated with each GPU copy and execution engine, as depicted in Figure 3.3. For GPUs with two copy engines, one engine lock must be obtained before copying data to the GPU, and the other must be obtained before copying data from the GPU. We require that all issued GPU operations are completed before the associated engine lock is released. This is necessary in order to: (i) prevent the GPU driver or hardware from interfering with GPUSync’s scheduling decisions; and (ii) support proper scheduling of interrupt daemons and GPGPU callback threads (we discuss this shortly in Section 3.2.5). For GPUs that do not ensure engine independence (Section 2.4.4), we also require that application code that issues several GPU operations to the same engine within the same engine critical section wait for each operation to complete before the next is issued. This is necessary to prevent the GPU hardware scheduler from stalling the other engines.

Engine locks should be held for as little time as possible in order to prevent excessive blocking times that degrade overall schedulability. For this reason, tasks are discouraged from issuing multiple GPU operations within the same engine critical section, although this is not strictly prevented by GPUSync. Minimizing the hold time of execution engine locks requires application-speciﬁc solutions to break kernels into small operations. However, a generic approach is possible for copy engine locks. Chunking is a technique where large memory copies are broken up into smaller copies. The effectiveness of this technique is demonstrated by Kato et al. (2011a). GPUSync supports chunking of both regular and peer-to-peer memory copies by way of a user-space library. Chunk size is fully conﬁgurable. The library also transparently handles copy engine locking.

GPUSync can be conﬁgured to satisfy engine lock requests in either FIFO- or priority-order (all GPUs within the same GPU cluster must use the same engine lock order). Blocked jobs suspend while waiting for an engine. A job that holds an engine lock may inherit the effective priority of any job it blocks. We stress effective priority, because the blocked job may itself inherit a priority from a job waiting for a token. In

124 order to reduce worst-case blocking, a job is allowed to hold at most one engine lock at a time, except during peer-to-peer migrations.

Bus Scheduling Concerns. The copy engine locks indirectly impose a schedule on the PCIe bus if we assume that bus traffic from non-GPU devices is negligible. This is because these locks grant permission to send/receive data to/from a particular GPU to one task at a time. It is true that traffic of other GPUs will cause bus congestion and slow down a single memory transmission. However, since the PCIe bus is packetized, it is shared fluidly among copy engine lock holders. Thus, we can bound the effect of maximum PCIe bus congestion and incorporate it into real-time analysis. Deeper analysis or explicit bus scheduling techniques may be necessary if very strict timing guarantees are required, but such approaches are beyond the scope of this dissertation.

Migrations. GPUSync supports both peer-to-peer and system memory migrations, which are handled differently.

For a peer-to-peer migration from one GPU to another, a job must hold copy engine locks for both GPUs.

Requests for both copy engine locks are issued together atomically to avoid deadlock.9 This is accomplished through the use of dynamic group locks (DGLs) (Section 2.1.6.2). The job may issue memory copies to carry out the migration once both engine locks are held. Peer-to-peer migration isolates trafﬁc to the PCIe bus: copied data does not traverse the high-speed processor interconnect or system memory buses—computations utilizing these interconnects are not disturbed. However, gains from fast peer-to-peer migrations may be offset by higher lock contention: unlikely scenarios exist where every token holder in a GPU cluster may request the same copy engine lock simultaneously. However, unlikely scenarios must still be accounted for in schedulability analysis.

If GPUSync is conﬁgured to perform migrations through system memory, then such migrations are performed conservatively, i.e., they are always assumed to be necessary. Thus, state data is aggregated with input and output data. State is always copied off of a GPU after per-job GPU computations have completed.

State is then copied back to the next GPU used by the corresponding task for its subsequent job if a different

GPU is allocated. An advantage of this approach over peer-to-peer migration is that a job never needs to hold two copy engine locks at once. This conservative approach may seem heavy handed, especially when migrations between GPUs may not always be necessary. An optimistic approach could be taken where state

9GPUSync also grants requests atomically to avoid deadlock if priority-ordered engine locks are used.

125 is only pulled off of a GPU once a migration is detected. However, this results in the same degree of copy engine lock contention as peer-to-peer migrations, but without the beneﬁts of isolated bus trafﬁc.

Critical Section Fusion. Every GPU DMA memory copy must be protected by a copy engine critical section.

For chunked memory copies, a call to unlock a held copy engine lock may be immediately followed by a call to reacquire the very same lock. There are cases where this results in needless system call overheads, or worse, harm schedulability. For FIFO-ordered engine locks, such a case occurs when the engine lock in question is not contended. For priority-ordered engine locks, this occurs when said engine lock has been requested by only tasks with lower priorities than the engine lock holder.

In order to avoid needless overheads and improve schedulability, GPUSync provides a system call interface that allows an engine-lock-holding callee to ask the question “Should I relinquish this lock?” (or in the case of DGLs, “Should I relinquish these locks?”). Each engine lock instance answers this question in accordance with its locking protocol (e.g., a FIFO-ordered engine lock always answers “yes” if the lock is contended). If the answer is “yes,” then the callee should issue the necessary call to release its held engine lock(s).10 Otherwise, the callee may proceed directly into its next critical section without releasing and reacquiring its needed engine lock(s). This essentially fuses back-to-back engine lock critical sections of a job that are protected by the same locks. Real-time correctness is maintained by the “preemption points” inserted between each original critical section.

Critical section fusion for FIFO-ordered engine locks is primarily a runtime optimization. However, fusion is important for priority-ordered engine locks, since it can strongly impact schedulability. Consider the case where two tasks with different priorities contend for the same copy engine lock and both need to perform a large chunked memory copy. Without fusion, the high and low priority chunks are interleaved by the copy engine lock. This is due to that the low priority task may obtain the contended copy engine lock after each time the high priority task exits a copy engine critical section. The locking protocol governing the engine lock cannot know that the high priority task intends to immediately re-request the lock, since this behavior is entirely application-deﬁned. As a result, each copy engine lock request issued by the high priority task may be delayed by one low priority copy engine critical section. The high priority task may suffer a priority inversion each time it requests the copy engine. However, with fusion, the high priority task may

10We say “should” because the GPUSync cannot strictly enforce GPU scheduling decisions, we as discussed at the end of Section 3.1.2.2.

126 suffer a priority inversion for only its ﬁrst copy engine request, since it retains ownership of the engine lock until its memory copies have completed. This can be reﬂected in schedulability analysis.

3.2.4 Budget Enforcement

Budget enforcement policies are necessary to ensure that a task’s resource utilization remains within its provisioned budget. These policies are particularly important in a real-time GPU system that uses closed- source GPU drivers and GPGPU runtimes. We ﬁrst explain why execution time variance is usually due to closed-source software, rather than GPU kernels or GPU hardware. We then describe the budget enforcement policies provided by GPUSync.

GPU kernels usually exhibit consistent runtime behaviors. There are two reasons for this. First, high- performance GPU algorithms usually depend upon very regular, non-divergent, execution patterns (recall our discussion of the GPU EE in Section 2.4.2.1). For example, image processing algorithms usually fall within this category, since computation commonly consists of applying the same operation to image pixels.

The second reason for GPU kernel execution time predictability stems from the relative simplicity of the EE.

Deep execution pipelines, branch predictors, memory prefetchers, and multiprocessor cache interference all contribute towards variance in execution time on modern multiprocessor CPUs. However, in order to maximize the number of transistors devoted to computation, the SMs of the EE eschew such average-case- oriented features. For example, all CUDA-capable NVIDIA GPUs to date lack L1 cache coherency among

SMs.

GPU DMA operations also exhibit relatively consistent runtime behaviors. The execution time of

DMA operations are affected bus contention. However, as we show in Section ??, it is possible to perform experiments to characterize worst-case behavior. Such experimental results may be incorporated into schedulability analysis and budget provisioning. We cannot apply the same approach to program code, in general.

In most cases, we can rule out GPU kernels or GPU hardware as the primary cause of overruns of provisioned budgets. Ignoring user application software, this leaves the closed-source software as the most likely culprit. This is understandable since this software has not been designed with real-time constraints in mind—its implementation optimizes for average-case performance, not worst-case performance. As a result, the execution time of operations that typically exhibit predictable runtime behavior may occasionally take much longer to execute than usual. For example, we show that this is the case for the interrupt-handling

127 subsystem of a GPU driver in Section ??. We have already discussed the strong practical motivations for using closed-source GPU software in a real-time system. However, we must recognize that the closed-source software may sometimes act unpredictably. We must prepare for it.

Understanding the likely causes of budget overruns in a real-time GPU system can make us better researchers and engineers, but it is not clear how this information can guide the design of budget enforcement policies in GPUSync. Ultimately, it matters little if a budget overrun is due to user code, closed-source software, GPU kernels, or GPU hardware. The consequence is the same: a budget is overrun. Thus, GPUSync uses general budget enforcement policies that work regardless of whether an overrun is due to CPU or GPU work.

Unfortunately, strict budget enforcement is difﬁcult, if not impossible, to achieve due to limitations of the GPU technology. Non-preemptivity of GPU operations makes budget enforcement problematic. Even if limited preemption is provided by breaking a single GPU operation into multiple smaller operations, data on a GPU may be in a transient state at a preemption point and thus be too difﬁcult (or too costly) to resume on another GPU at a later time. This motivates us to focus on budget enforcement based on overrun recovery rather than strict enforcement to absolutely prevents overruns.

GPUSync provides three budget enforcement options: signaled overruns, early budget releasing, and a bandwidth inheritance-based method.

Signaled Overruns. Under the signaled overrun policy, jobs are provisioned with a single budget equal to a maximum CPU execution time, plus a maximum total GPU operation time. A job’s budget is drained when it is scheduled on a CPU or GPU engine. The OS delivers a signal to the job if it exhausts its budget. In order always maintain a consistent state with the GPGPU runtime, this signal is deferred if the job holds an engine lock, and it is delivered immediately once the lock is released. The signal triggers the job to execute an application-deﬁned signal handler. The handler is application-deﬁned since appropriate responses may vary at different points of execution. As depicted in Figure 3.6, one way an application may respond to a signal is to unwind the job’s stack (either by throwing an exception11 or a call to the standard function longjmp()) and execute clean-up code before releasing its token lock.

11Throwing exceptions from signal handlers may require platform support and special compiler options. For example, one must specify the compiler options -fasynchronous-unwind-tables and -fnon-call-exceptions to the GCC compiler.

128 . Enabled/disabled on try-block enter/exit. function BUDGETSIGNALHANDLER() throw BudgetException(); end function

procedure DOJOB() t ← GetToken(); gpu ← MapTokenToGpu(t); try: DoGpuWork(gpu); . Main job computation. catch BudgetException: CleanUpGpu(gpu); . Gracefully cleans up state. ﬁnally: FreeToken(t); end procedure Figure 3.6: Example of budget signal handling.

Early Releasing. The “early releasing” policy immediately refreshes the exhausted budget of a job with the budget of the task’s next job. That is, the budget of the next job is released early. This is accomplished by shifting (postponing) the job’s current deadline to that of its next job. In essence, the next job has been sacriﬁced in order to complete the overrunning one. This policy penalizes an overrunning task by forcing it to consume its own future budget allocations. This prevents the system from being overutilized in the long-term.

Under deadline schedulers, deadline postponement also helps to prevent the system from being overutilized in the short-term. We note that deadline postponement is challenging to implement since it requires priority inheritance relations established by locking protocols to be reevaluated.

Bandwidth Inheritance (BWI). A job that overruns its budget while holding a shared resource can neg- atively affect other jobs, even non-resource-using ones. We use BWI to limit the effects of such overruns to resource-sharing tasks. This is accomplished by draining the budget of a blocked job whose priority is currently inherited by a scheduled job in place of the scheduled job’s own budget.12 This can cause a blocked job to be penalized for the overrun of another, but improves temporal isolation for non-GPU-using tasks.

Under deadline-based scheduling, GPUSync takes additional measures to isolate the temporal effects of overruns—speciﬁcally, the “abort/refresh/reissue” BWI technique we discussed in Section 2.1.6.1. If the budget of a job waiting for a GPU token is exhausted, then: (i) the token request is immediately aborted;

(ii) the budget of the exhausted job is refreshed through early releasing, decreasing the job’s priority; and

12Jobs must be provisioned with additional budget derived from analytical bounds on pi-blocking to cover budget that may be lost via BWI under assumed conditions.

129 (iii) the token request is automatically reissued. Care must be taken in the implementation of this policy, as decreasing the priority of the exhausted job also requires priority inheritance relations established by locking protocols to be reevaluated.

3.2.5 Integration

Resource allocation techniques and budget enforcement policies make up the general elements of

GPUSync that may be applied to a variety of GPU technologies (i.e., GPUs from different manufacturers and

GPGPU runtimes) and RTOSs (when RTOS code is available). We now discuss elements of GPUSync that address issues that arise due to reliance on specific non-real-time software. We discussed the deficiencies of split interrupt handling in the Linux kernel in Section 2.2.3.1 and the challenges posed by the CUDA runtime’s callback threads in Sections 2.4.1 and 2.4.5. In this section, we discuss how GPUSync resolves these issues to improve real-time predictability. We present first the method we use to realize proper real-time scheduling of GPU interrupts. We then apply similar techniques to schedule CUDA callback threads.

3.2.5.1 GPU Interrupt Handling

We must deﬁne scheduling policies for the execution of both top- and bottom-halves of GPU interrupts.

We discuss these in turn.

GPU Top-Halves. The disruptions imposed upon a real-time system by interrupt top-halves are inescapable.

However, we can isolate their effects in order to improve real-time predicability. We do so through CPU shielding. CPU shielding allows the system designer to direct interrupts to a particular CPU, or group of

CPUs. To support this, the OS associates a CPU bitmask with every interrupt source, or more generally, each interrupt identiﬁer (i.e., a unique interrupt ID). These bitmasks are used by the OS to program the underlying interrupt-handling hardware (e.g., the Advanced Programmable Interrupt Controller (APIC)) that is responsible for delivering interrupts to each CPU.

Under GPUSync, we assume that the system designer applies the appropriate CPU bitmasks to direct interrupts of a given GPU to the CPUs that may be scheduled with tasks that use that GPU. For example, suppose GPUi is used by tasks that are scheduled on CPU j and CPUk. The system designer would conﬁgure the CPU bitmasks to ensure that the interrupts of GPUi may only be handed by CPU j and CPUk. CPU j and

CPUk do not necessarily reside in the same CPU cluster. For instance, GPUi could be shared by CPU j and

130 CPUk under partitioned CPU scheduling. In this case, CPU j could be forced to handle GPU interrupts on behalf of tasks that execute exclusively on CPUk—this must be accounted for in schedulability analysis. Under older interrupt handling mechanisms, a GPU may be forced to share the same interrupt identifier with other GPUs and even other devices. This stems from limitations in legacy hardware where each interrupt identifier maps to a physical interrupt pin or wire. This leads to scenarios where a GPU may be forced to share an interrupt identifier with unrelated devices, such as network interface cards and disk controllers. This sharing may make it difficult to isolate GPU interrupts on the proper CPUs. This has negative effects on both runtime predictability and schedulability analysis. Historically, the NVIDIA driver has configured GPUs to share interrupt identifiers. However, since late 2008, the GPL layer of the NVIDIA driver includes a compile-time option to enable modern interrupt handling mechanisms (NVIDIA, 2014a), specifically, Message Signaled Interrupts (MSI) (PCI-SIG, 2010).13 Under MSI, interrupts are delivered

“in-band” through the PCIe data pathways. Each GPU is assigned a unique interrupt identiﬁer under MSI, allowing us to direct GPU interrupts to the appropriate CPUs. We note that MSI is enabled by default in the

GPL layer of the NVIDIA driver, starting in late 2013 (NVIDIA, 2014a).

GPU Bottom-Halves. As we discuss in greater depth in Section 3.3, GPUSync is implemented within the Linux-based LITMUSRT kernel. GPUSync introduces a new class of LITMUSRT-aware daemons called klmirqd. This name is an abbreviation for “Litmus softirq daemon” and is preﬁxed with a “k” to indicate that the daemon executes in kernel space. Klmirqd daemons may function under any LITMUSRT-supported JLFP scheduling algorithm.

We associate one dedicated klmirqd daemon with each GPU. Each daemon processes the Linux tasklets

(i.e., bottom-halves) issued by GPU ISRs (i.e., top-halves). These daemons may execute within the CPU cluster(s) of the tasks that use the associated GPU. Each inherits the maximum effective priority of any suspended job that is blocked waiting for a GPU operation of the associated GPU to complete. In effect, each tasklet is scheduled under this inherited priority. If there are no suspended jobs waiting for the daemon’s assigned GPU to complete an operation, then the daemon is scheduled with a base priority statically below that of any real-time task. This allows the daemon to take advantage of CPUs left idle.

In order to properly schedule bottom-halves, we must identify: (i) the GPU associated with each bottom half; (ii) the klmirqd daemon associated with each GPU; and (iii) the set of suspended jobs waiting for a

13MSI is enabled in the NVIDIA driver by asserting the __NV_ENABLE_MSI compile-time option.

131 tasklet_schedule() C Is callback to the GPU driver? callback Driver Top-Half tasklet 0101101 GPU ID 1001110 D klmirqd Registry E

klmirqd_tasklet_schedule()

A GPU Registry GPU Allocator klmirqd daemon

B CPU Scheduler

Figure 3.7: Architecture of GPU tasklet scheduling infrastructure using klmirqd.

GPU operation to complete on each GPU. This is non-trivial with closed-source GPU drivers. Nonetheless, we achieve this by coordinating the GPU Allocator, CPU scheduler, and klmirqd daemons in a multi-step process. This approach is summarized in Figure 3.7, which we now describe in detail.

Step A: Update GPU Registry

When the GPU Allocator assigns a token to a task, a record of the assignment is stored in a GPU

Registry. The GPU Registry maps a task identiﬁer (i.e., a Linux task_struct pointer) to a GPU

identiﬁer (an enumerated value). The appropriate record is removed from the GPU Registry when a

GPU token is freed.

Step B: Enable Inheritance

Upon job suspension, the CPU scheduler checks if the suspending job holds an engine lock. If so, the

scheduler retrieves the GPU identiﬁer of the task’s allocated GPU from the GPU Registry. Using this

identiﬁer, the scheduler then uses the Klmirqd Registry to look up the task identiﬁer of the klmirqd

daemon assigned to that GPU. (Entries are inserted into the Klmirqd Registry at scheduler initialization-

time as klmirqd daemons are created.) The CPU scheduler allows the klmirqd daemon to inherit the

effective priority of the suspending task—this includes any priority that the suspending task itself may

inherit from a blocked task waiting for a GPU token in the GPU Allocator or waiting for engine locks.

132 The klmirqd daemon may inherit the suspended job’s effective priority until the suspended task is ready

to run.

Step C: Intercept GPU Tasklet

The closed-source GPU driver must interface with the open-source Linux-based kernel. We exploit this

fact to intercept tasklets dispatched by the driver. This is done by modifying the standard internal Linux

tasklet_schedule() function. This function is used to issue a tasklet to the kernel for execution.

When tasklet_schedule() is called, the callback entry point of the deferred work is speciﬁed

by a function pointer. We identify a tasklet as belonging to the closed-source GPU driver if this

function pointer points to a memory region allocated to the driver. It is possible to make this determi-

nation, since the driver is loaded as a module (or kernel plugin). We inspect every callback function

pointer of every dispatched tasklet, online, using Linux’s module-related routines.14 Thus, we alter

tasklet_schedule() to intercept tasklets from the GPU driver and override their scheduling. It

should be possible to use this technique to schedule tasklets of any driver in Linux that is loaded as a

module, not just GPU drivers.

Step D: Extract GPU Identiﬁer

Merely intercepting GPU tasklets is not enough if a system has multiple GPUs; we must also identify

which GPU raised the initial interrupt in order to determine which klmirqd daemon should handle

the tasklet. It may be possible to perform this identiﬁcation process at the lowest levels of interrupt

handling (i.e., when the OS looks up the appropriate ISR). However, this information must be passed

into the deeper interrupt handling layers, potentially requiring invasive changes to the OS’s internal

APIs and the users of those APIs. Instead, we opt for a simpler solution closer to tasklet scheduling.

The GPU driver attaches a memory address to each tasklet, providing input parameters for the tasklet

callback. This address points to a data block that contains a device identiﬁer indicating which GPU

raised the interrupt. However, locating this identiﬁer within the data block is challenging since it is

packaged in a driver-speciﬁc format.

Ofﬂine, we inspect the code of the driver’s GPL layer to reverse engineer the memory address offset of

the GPU identiﬁer from the address that is attached to the tasklet. From code analysis, we ﬁnd that

14This may sound like a costly operation, but it is actually quite a low-overhead process.

133 nv_linux_state_t 0xX typedef struct { offset = 0xY - 0xX ... unsigned int device_num; device_num 0xY ... } nv_linux_state_t;

memory layout

Figure 3.8: Memory layout of nv_linux_state_t.

the attached address is a pointer to a C-struct with the type “nv_linux_state_t.” Embedded within

this struct is a 32-bit unsigned integer variable named “device_num.”15 Through experimentation, we

learn that this variable indicates the GPU that raised the interrupt. By modeling the memory layout of

nv_linux_state_t, we can determine the offset needed to locate device_num within the tasklet’s

data block. This is depicted in Figure 3.8.

From experience, we ﬁnd that the memory layout of nv_linux_state_t may change with each new

version of the NVIDIA driver. Moreover, several compile-time options of the GPL layer can change the

layout of nv_linux_state_t, which may change the offset of device_num. Thus, a system designer

must repeat this reverse engineering process with each new version of the NVIDIA driver to determine

the proper offset. However, it may be feasible to automate this process by modifying the compilation

scripts of the NVIDIA driver.

Step E: Dispatch Tasklet

After completing Steps C and D, tasklet_schedule() passes intercepted GPU tasklets to the klmirqd

daemon by calling klmirqd_tasklet_schedule(). This function takes the tasklet and klmirqd task

identiﬁer as arguments. The function inserts the tasklet into a queue of pending tasklets for the klmirqd

daemon. The daemon is awoken if it is suspended waiting for work. Steps A and B ensure that the

daemon is scheduled with the proper priority.

GPU “Bottom-Bottom-Halves.” In Section 2.2.3.1, we discussed how tasklets may never block on I/O or suspend from the CPU, since tasklets may be executed within interrupt context. In order to carry out such operations, a tasklet itself may defer additional “work items” to Linux kworker daemons. In a sense, these

15In recent versions of the GPL driver, this ﬁeld has been renamed “minor_num.”

134 work items are “bottom-bottom-halves.” As with tasklets, we must also ensure that work items are properly scheduled.

We apply the same methodology we use to intercept GPU tasklets to intercept GPU work items. This is done by modifying the standard internal Linux schedule_work() function, which is normally used to pass work to the kworker daemons. Each work item contains callback and attached memory addresses, similarly to a tasklet—we identify GPU work items and associated GPU identiﬁers in the same way. Intercepted

GPU work items may be passed to a secondary klmirqd daemon dedicated to work-item processing, or it may be passed to the klmirqd daemon that also processes tasklets. In the former case, we apply the same mechanisms we use in Steps A and B above to ensure the daemon is scheduled with the proper priority. In the latter case, the klmirqd daemon executes a single pending work item, if one exists, after executing each tasklet. This essentially fuses the tasklet and work item into a single unit of work. GPUSync supports both conﬁgurations. While the use of separate tasklet and work-time klmirqd daemons may improve performance through parallelism, it breaks the sporadic task model since the two daemons may execute concurrently under the same inherited priority.

3.2.5.2 CUDA Runtime Callback Threads

As we discussed in Sections 2.4.1 and 2.4.5, the CUDA runtime employs callback threads, one per

GPU, to signal (wake) user threads that have suspended from execution while waiting for GPU operations to complete. These threads must be properly scheduled to avoid unbounded priority inversions. However, applying a real-time scheduling policy to these threads is challenging, since the threads are created and managed by closed-source software. Nonetheless, in our approach, we make no assumption of when the

CUDA runtime creates callback threads. We also assume that we do not know which callback threads are associated with which GPUs. We now describe how GPUSync schedules the callback threads of the CUDA runtime.

A GPU-using task makes a system call, set_helper_tasks(), to GPUSync during the task’s initialization phase. The system call takes two flag parameters: CURRENT and FUTURE. If the flag CURRENT is set, then the OS to examines each thread within the process of the calling task. Any encountered non-real-time thread is assumed to be a CUDA runtime callback thread. We apply a specialized LITMUSRT scheduling policy to each of these threads. If the flag FUTURE is set, then we automatically apply the same specialized scheduling policy to any threads created (forked) in the future.

135 The specialized scheduling policy we apply to callback threads is very similar to the one we use for klmirqd daemons. When job Ji suspends waiting for a GPU operation to complete, all callback threads of task Ti inherit the effective priority of Ji. This is policy is enacted by the CPU scheduler when a task suspends while holding an engine lock—this is the same mechanism we used to enable priority inheritance for klmirqd.

Normally, the simultaneous inheritance of Ji’s effective priority by multiple threads would break the sporadic task model, which assumes tasks are single threaded. However, it is safe in this particular instance because a job may only use one GPU at a time under GPUSync—only one callback thread will ever need to execute at a given moment. No two callback threads inheriting the same priority will ever execute simultaneously, so the sporadic task model is not violated.

3.3 Implementation

In this section, we discuss the implementation of several key components in GPUSync. We have already discussed several implementation-related issues that relate to the integration of GPUSync with LITMUSRT and the broader Linux kernel. We provide additional implementation details herein. We begin with general information on the implementation of GPUSync. We then discuss the challenges of implementing priority inheritance for the nested locking structure of GPUSync, while accommodating dynamic behaviors of

GPUSync. This is followed by a discussion of the plugin infrastructure supported by the GPU Allocator to implement afﬁnity-aware heuristics, such as the ones we discussed in Section 3.2.3.2. We conclude this section with a description of the GPUSync user interface used by real-time applications to communicate GPU resource requests to GPUSync.

3.3.1 General Information

We implemented GPUSync as an extension to LITMUSRT, version 2014.1, which is based upon the 3.10.5

Linux kernel.16 GPUSync adds approximately 20,000 lines of code to LITMUSRT. Contributions to this total by category are approximately: GPU Allocator and locking protocols, 35%; scheduler enhancements, budgeting, and nested inheritance, 35%; GPU interrupt and callback thread management, 20%; miscellaneous infrastructural changes, 10%. For comparison, the LITMUSRT patch to the Linux 3.10.5 kernel is roughly

15,000 lines of code. The source code for GPUSync is available at

16We distribute GPUSync as open source under the GNU General Public License, version 2. The code is currently available at www.github.com/GElliott.

136 LITMUSRT provides a plugin-based real-time scheduling framework within the Linux kernel, where particular real-time scheduling algorithms are implemented as plugins. With the exception of modiﬁcations to the tasklet and work-item processing of Linux (see Section 3.2.5), GPUSync is implemented almost entirely within LITMUSRT—GPUSync rarely interfaces directly with Linux kernel components.

We limit the integration of GPUSync with LITMUSRT to LITMUSRT’s C-EDF plugin. This plugin can be conﬁgured to cluster CPUs along different boundaries of the hardware memory hierarchy. For example, if we cluster around CPU private caches (e.g., the L1), then our C-EDF scheduler is equivalent to P-EDF.

Likewise, if we cluster around main memory, our C-EDF scheduler is equivalent to G-EDF. The LITMUSRT

C-EDF plugin provides us with the necessary flexibility to test a variety of CPU cluster configurations with a single code base. We recognize that this flexibility may result in slightly greater system overheads. For example, the LITMUSRT P-EDF plugin is more streamlined than its C-EDF counterpart (even when similarly configured), since the uniprocessor nature of P-EDF admits assumptions that reduce code complexity. Finally, we note that the C-EDF plugin requires only minor modifications to support fixed-priority scheduling, as the programmer need only provide a new prioritization function for comparing the priority of two tasks. We extended the LITMUSRT scheduler plugin API to expose the prioritization function used by a scheduler to the GPU Allocator and engine locks, so changes in the prioritization function are transparent to GPUSync components.

3.3.2 Scheduling Policies

Threads in Linux are generally scheduled under one of two primary policies: SCHED_OTHER or

SCHED_FIFO.17 The SCHED_OTHER policy is used to schedule general purpose, non-real-time, applications. SCHED_FIFO is used to schedule ﬁxed-priority real-time tasks in accordance to the POSIX standard.

Linux prioritizes SCHED_FIFO threads above SCHED_OTHER threads, as illustrated in Figure 3.9(a).

LITMUSRT introduces a third scheduling policy: SCHED_LITMUS. All threads scheduled by the SCHED_

LITMUS policy have a greater priority than SCHED_FIFO threads. This is depicted in Figure 3.9(b).

The GPUSync implementation splits the SCHED_LITMUS scheduling policy into three sub-policies.

We use sub-policies within LITMUSRT as it allows us to quickly transition threads among them. In decreasing static priority, these sub-policies are: normal, callback, and daemon. Conventional real-time tasks are

17Linux provides additional scheduling policies, such as SCHED_BATCH and SCHED_RR. These are similar to SCHED_OTHER and SCHED_FIFO, respectively. We do not discuss them in order to simplify the presentation.

137 high priority normal

SCHED_LITMUS callback

SCHED_LITMUS daemon

SCHED_FIFO SCHED_FIFO SCHED_FIFO

SCHED_OTHER SCHED_OTHER SCHED_OTHER low priority

(a) Linux policies (simpliﬁed). (b) LITMUSRT policies. (c) LITMUSRT policies with GPUSync.

Figure 3.9: Relative static priorities among scheduling policies. scheduled under the normal sub-policy. CUDA runtime callback threads are scheduled under the callback sub-policy. Klmirqd threads are scheduled under the daemon sub-policy. We prioritize the callback sub-policy over the daemon sub-policy in an effort to expedite completion of each GPU operation. CUDA callback and klmirqd threads transition to the normal sub-policy when they inherit a priority from a normal real-time task.

Figure 3.9 depicts the relative static priorities of each scheduling policy.

Threads that share the same policy compete for CPU time in accordance to the policy’s CPU scheduler.

For example, we may conﬁgure LITMUSRT to schedule SCHED_LITMUS-policy tasks with C-EDF. Jobs in LITMUSRT may have equal priorities. For instance, the deadlines of two jobs may coincide under

EDF scheduling. LITMUSRT implements several conﬁgurable mechanisms to break such ties in priority.

For GPUSync, we conﬁgure LITMUSRT to use tie-break heuristics that fairly distribute deadline lateness, proportional to each task’s relative deadline, among tasks. This heuristic can have a signiﬁcant effect on the response time of jobs on a temporarily overloaded system. For callback and klmirqd threads that do not actively inherit a priority from a normal LITMUSRT task, we break ties in a manner that fairly distributes CPU time among them.

3.3.3 Priority Propagation

The two-level nested locking structure of GPUSync requires the propagation of inherited or donated priorities along complex chains of dependent tasks.

Figure 3.10 illustrates a possible chain of task dependencies in GPUSync. In this example, suppose all tasks are scheduled within the same CPU and GPU clusters. Here, task T1 is in the DQ of the GPU Allocator and donates its priority to task T2 with a request enqueued in FQ0 of the GPU Allocator. Task T3 holds the token of FQ0, so it may inherit the effective priority of T2. T3 has an unsatisﬁed request for the ﬁrst CE of

138 T1 T2 T3 donor in DQ of waiting in FQ 0 of GPU Allocator GPU Allocator blocked on CE 0,0 cb1 T6 T8 holds on CE 0,0 and blocked on CE 1,1 T T 4 5 T8 waiting in FQ of holds CE and 1 blocked on CE 0,0 1,1 GPU Allocator suspended for DMA T7 klmirqd 1 blocked on CE 1,1

T9 holds EE 1 and suspended for GPU kernel

Figure 3.10: Example of a complex chain of execution dependencies for tasks using GPUSync.

GPU0, or CE0,0. Another task T4 is enqueued on FQ1 in the GPU Allocator. Let us assume that the token of

FQ1 is associated with GPU0. Task T5 holds the token of FQ1, so it may inherit the effective priority of T4.

T5 also has an unsatisfied request for CE0,0. Task T6 issued a DGL request for both CE0,0 and CE1,1, prior to the CE requests of T3 and T5, in order to perform a peer-to-peer migration between GPU0 and GPU1. T6’s request for CE0,0 has been satisfied, but the task remains blocked waiting for CE1,1. Because T6 holds the engine lock of CE0,0, it may inherit either the effective priority of T3 or T5. Task T7 has an unsatisfied request for CE1,1. Task T8 holds the engine lock for CE1,1, so it may inherit either the effective priority of T6 or T7. T8 has issued its DMA operation to CE1,1. The DMA is incomplete, so T8 is suspended. Thus, all of the callback

cb1 threads of T8 may inherit the effective priority of T8, including the callback thread for GPU1, T8 . Task T9 holds the engine lock for EE1 and is waiting for a GPU kernel to complete, so T9 has also suspended. Finally, the klmirqd daemon for GPU1, klmirqd1, may inherit the effective priority of either T8 or T9, since both tasks

cb1 have incomplete operations on GPU1 and both tasks are suspended. What is the effective priority of T8 ?

It is the maximum priority among tasks T1 through T8. What is the effective priority of klmirqd1? It is the maximum priority among tasks T1 through T9. Implementing the necessary mechanism to propagate effective priorities through dependency chains such as the one in Figure 3.10 is challenging since information about each task is distributed among several data structures (e.g., the request queues of the various locks). We may imagine a recursive algorithm that propagates the donated priority of T1 down the dependency change of Figure 3.10 when T1 issues its token—indeed, such an algorithm is obviously necessary. Under conventional JLFP locking protocols, the implementation of such an algorithm is straight forward due to the following two invariants:

I1 The effective priority of a job may only increase monotonically, unless such a job releases a held lock.

139 I2 The base priorities of jobs are ﬁxed.

However, neither of these hold under certain conﬁgurations of GPUSync, so implementation is more complex.

Let us consider two cases where GPUSync breaks conventional Invariants I1 and I2.

In the ﬁrst case, the request stealing rule (Rule R5 of the R2DGLP) used by the GPU Allocator may break Invariant I1, since the GPU Allocator may move a pending request from one FQ to another. Suppose

cb1 task T4 in Figure 3.10 has the highest base priority among all tasks, so T8 and klmirqd1 inherit the effective priority of T4. If T4 is moved to another FQ, say, FQ2, then we must determine a new effective priorities for

cb1 cb1 T8 and klmirqd1. T8 may draw an effective priority from the tasks T1 through T8 (except T4). klmirqd1 may draw an effective priority from the same tasks, in addition to task T9. Whatever these new effective priorities may be, they are less than the base priority of T4 (which remains unchanged). Thus, the effective

cb1 priorities of T8 and klmirqd1 decrease, breaking Invariant I1. In the second case, GPUSync’s budget enforcement policies can cause the base priority of a job to change under deadline-based schedulers, violating Invariant I2. This is clear under the early-releasing policy.

However, let us discuss a more complicated scenario under the BWI policy. With out loss of generality, let us assume that the tasks in Figure 3.10 are synchronous (i.e., each task releases a job at the same time instant), periodic, and have implicit deadlines. Thus, the jth jobs of all tasks share the same absolute deadline, since they share a common release time and relative deadline. Jobs are prioritized by EDF and ties are broken by task index such that Ti has greater priority than Ti+1. Suppose job J8, j of task T8 overruns its provisioned budget. J8, j drains the budget of job J1, j ﬁrst since J1, j has the highest priority. (Job Ji, j has greater priority than job Ji+1, j due to tie-breaking.) Once exhausted, the priority of J1, j is decreased through early-releasing and now has the lowest priority among all jobs, since D1, j+1 > Di, j for i ∈ {2,3,··· ,9}. As a result, we must determine a new effective priority for all other jobs that inherited the priority of J1, j. The job J2, j now has the highest priority among all jobs, so J8, j begins consuming the budget of J2, j. J8, j continues to exhaust the budgets of jobs J2, j through J7, j, until J8, j begins to consume its own budget. The base priority of each job decreases as its budget is exhausted, breaking Invariant I2.

We describe the situations that break Invariants I1 and I2 not because they imply a fundamental departure from a recursive algorithm to propagate priorities through chains of dependent tasks, but rather to illustrate the fact that effective priorities of tasks under GPUSync may be in ﬂux. The core algorithm to propagate effective priorities among tasks is embodied by a three-step recursive process where we detect a potential

140 –Start– GPU Allocator steals a request from an FQ

–Start– –Start– Resource request blocked Budget enforcement actions

Detect change in effective priority (P1)

no Possible change in effective priority of a resource holder?

yes

Apply the change in effective priority of the resource holder (P2)

Is resource holder yes Propagate change (P3) itself blocked?

–End–

Figure 3.11: Recursive algorithm to propagate changes in effective priority. change in the effective priority of a task, apply the change if necessary, and propagate that change if the task is itself blocked for a resource. We summarize these steps with the ﬂowchart depicted in Figure 3.11.

We now describe the algorithm we use to propagate effective priorities in more detail. We label each set of operations that make up each step in our propagation algorithm.

P1: Detect

For every protected resource (token or engine lock), the implementation of the locking protocol tracks

the effective priorities of all pending requests from which a resource holder may inherit. This is done

by using a max-heap, ordered by effective priority. Consider the implementation of a FIFO-ordered

engine lock. In addition to a FIFO queue to order requests, a heap is populated with the unsatisﬁed

141 requests. We may efﬁciently determine the greatest effective priority among the unsatisﬁed requests by

examining the root note of the heap. Updates to the heap are logarithmic in runtime complexity.

P2: Apply

Each task may inherit an effective priority through any one of several locks it may hold concurrently.

Within the Linux kernel task_struct of each task, we maintain a max-heap of the effective priorities

that the task might inherit. The nodes in this heap reference the root nodes of the associated max-heaps

maintained by P1. A task inherits the effective priority of the root node of the task_struct-heap if the

task has a lower base priority. We update task_struct-heap whenever the root note of an associated

heap maintained by P1 changes. Concurrent updates to the task_struct-heap are serialized by a

per-task spinlock.

Additional measures are taken to propagate priorities to klmirqd daemons and callback threads. The

klmirqd daemon uses the task_struct-heap to determine its effective priority. However, instead of

referencing root nodes of the heaps maintained by P1, the nodes reference tasks suspended waiting for

GPU operations to complete. If the effective priority of a task changes while it is suspended holding an

engine lock, then we propagate the change to the klmirqd daemon assigned to the GPU for the token

held by the task and reevaluate the daemon’s effective priority. For callback threads, if the effective

priority of a task changes while it is suspended holding an engine lock, then the callback threads of the

task inherit the new effective priority.

P3: Propagate

Within each task_struct, we store a pointer to a lock for which a task is blocked.18 The value of this

pointer is null if the task is not blocked. If the effective priority of a task changes while it is blocked,

we update the max-heap maintained by P1. Changes are recursively propagated.

We take the following actions to support the request stealing rule used by the GPU Allocator. First, we remove the request to steal from FQx. We then update the max-heap used by the GPU Allocator to track the effective priorities of requests in FQx. We apply the potential change in the effective priority of the token holder of FQx by following P1 and any subsequent steps, as necessary.

18When DGLs are used, this pointer points to any single requested lock not held by the task.

142 We follow a multi-step process if the budget of a job is exhausted when GPUSync is conﬁgured to use

BWI and dynamic-priority JLFP scheduling (e.g., EDF):

1. If the job is blocked for a token, we coordinate with the GPU Allocator to abort the token request. This

may result in changes to a max-heap maintained by the GPU Allocator. We apply any potential change

in the effective priority of the token holder by following P1 and any subsequent steps, as necessary.

2. We decrease the base priority of the job through early-releasing.

3. If the job holds a resource lock, then we reevaluate the effective priority of the job by following P1 and

any subsequent steps, as necessary. This step is required since the job may need to inherit a priority a

due to its decreased base priority.

4. If we aborted a token request above, then we re-issue the request. This may result in changes to a

max-heap maintained by the GPU Allocator—changes in effective priorities are propagated.

We only perform steps 2 and 3 under the early-releasing policy under dynamic-priority JLFP scheduling.

We conclude this section with a remark on the challenges of implementing the above algorithms. These algorithms require coordination among software components that implement locking protocols, schedulers, and budget enforcement policies. Each component may use a variety of spinlocks to protect kernel data structures. The implementation must be extremely careful with respect to when, and in what order, these locks are acquired so that deadlock is avoided.

3.3.4 Heuristic Plugins for the GPU Allocator

We use a plugin-based software architecture for the GPU Allocator to implement the heuristics described in Section 3.2.3.2. This provides a clean separation between the core GPU Allocator locking protocol algorithm (R2DGLP or CK-OMLP), the heuristics, and other software components of GPUSync. The design also facilities experimentation with new heuristics. A heuristic plugin deﬁnes a collection of heuristics used to guide the actions of the GPU Allocator. We may use different plugins to implement a variety of strategies such as “optimize the average case” or “aggressive migration.” We implement the heuristics discussed in

Section 3.2.3.2 as a single plugin.

Table 3.1 lists the heuristic plugin API. The advisory functions are invoked by the GPU Allocator to request guidance from the plugin at key algorithmic decision points. For example, the GPU Allocator may

143 Advisory API advise_enqueue() Recommend the FQ on which to enqueue a given request. advise_steal() Recommend a request to steal from an FQ. advise_donee_selection() Recommend a donee for a priority donor. Notification API notify_enqueue() A request has been enqueued on a specified FQ. notify_dequeue() A request has been removed from a specified FQ. notify_acquired() A specified token has been acquired by a specified job. notify_freed() A specified token has been freed by a specified job. notify_exit() A given task has terminated.

Table 3.1: Heuristic plugin interface for the GPU Allocator. call advise_enqueue() to solicit advice from the plugin in deciding which FQ a given request should be enqueued—we use this interface to implement Heuristics H1 and H2. The GPU Allocator is free to reject a recommendation made by any advisory function. We do so in order to protect the real-time correctness of the underlying GPU Allocator locking protocol algorithm from poor advice provided by the plugin.

The notiﬁcation functions are invoked by the GPU Allocator to inform the plugin of the Allocator’s actions. Heuristic plugins use these functions to maintain their own internal state. For instance, we use notify_enqueue() and notify_dequeue() to maintain a cached estimate of the lengths of each FQ (i.e.,

Equation (3.2) of Section 3.2.3.2, which is necessary to efﬁciently implement Heuristic H2). Also, we use the function notify_exit() to inform a plugin that a real-time task has terminated. This information is needed to properly implement Heuristic H1, which attempts to distribute sporadic tasks among GPUs.

We also take advantage of the notiﬁcation functions to interface with other software components. For instance, we use notify_acquired() and notify_freed() to bridge the GPU Allocator and the GPU

Registry that is used for scheduling klmirqd daemons (Section 3.2.5.1). We also use these functions to integrate the GPU Allocator and the Cost Predictor. When the GPU Allocator calls notify_acquired(), the plugin begins tracking the combined CPU and GPU execution time of a job within the token critical section, as described in Section 3.2.3.3. Similarly, the plugin provides the Cost Predictor with a new observation when the GPU Allocator calls notify_freed().

3.3.5 User Interface

The developers of LITMUSRT provide a companion user-space library, called liblitmus, to ease the development of real-time LITMUSRT tasks. The liblitmus API provides functions for creating and initializing real-time tasks, assigning tasks to clusters, and job management.

144 We extend liblitmus to provide the user with an interface to GPUSync.19 This extended library does not serve as a GPGPU interception library (Section 3.1.2). Rather, such an interception library would use the services exposed by our GPUSync-modified liblitmus to enact scheduling policies. The extended library includes functions for creating and configuring the GPU Allocator and engine locks. The API to create an instance of a GPU Allocator requires the callee to specify which GPUs are to be managed by the allocator and values for ρ (the maximum number of per-GPU concurrent users) and f (the maximum length of each FQ). The callee may also specify which, if any, heuristic plugin should be employed by the GPU Allocator instance. APIs for creating and configuring engine locks are also provided. Each GPU Allocator and engine lock instance is given a unique name by the creating process. Other real-time tasks use these names to obtain the necessary references to the created objects.

Additional extensions to the liblitmus API include functions for obtaining GPU tokens, locking and unlocking engine locks (with or without the use of DGLs), and performing chunked memory copies. There are also routines to facilitate exception-based handling of budget exhaustion signals.

3.4 Conclusion

This concludes our description of the design and implementation of GPUSync. Through the careful consideration of tradeoffs among a variety of real-time GPU scheduler designs, we have designed GPUSync to be an API-driven scheduler implemented within the operating system. We sacriﬁce the ability to strictly enforce all GPU scheduling decisions in order to support closed-source software. GPUSync is designed around a synchronization-based philosophy to real-time GPU scheduling. This provides us with a variety of tools and methodologies we need to ensure real-time predicability.

19This extended version of liblitmus is currently available at www.github.com/GElliott under the GNU General Public License, version 2.

145 CHAPTER 4: EVALUATION1

In this chapter, we evaluate real-time properties of GPUSync. Our evaluation is in two parts. In the first part, we develop a theoretical model of GPUSync for our evaluation platform, a twelve-core, eight GPU, system. This model incorporates carefully measured system overheads and overhead-aware schedulability analysis. Upon this model, we perform a large-scale set of experiments where we evaluate the schedulability of randomly generated task sets under a variety of GPUSync configurations. Although not exhaustive, this evaluation is broad. It required over 85,000 CPU hours to complete, testing over 2.8 billion task sets for schedulability. We show that real-time guarantees differ greatly among GPUSync configurations. In the second part of our evaluation, we investigate the runtime performance of GPUSync. We first examine the efficacy of GPUSync’s budget enforcement mechanisms, the accuracy of GPUSync’s Cost Predictor, and the ability of GPUSync’s affinity-aware GPU token allocation heuristics to reduce costly GPU migrations.

We then evaluate a variety of GPUSync conﬁgurations through additional runtime experiments. Here, we execute task sets made up of tasks that execute GPU-based computer vision algorithms. We report upon the differences in observed real-time performance among the GPUSync conﬁgurations.

The remainder of this chapter is organized as follows. In Section 4.1, we describe the evaluation platform upon which our schedulability and runtime evaluations are based. We then describe the platform conﬁgurations (e.g., CPU and GPU cluster conﬁgurations) we consider in our evaluations in Section 4.2. In

Section 4.3, we develop our real-time model for evaluating task set schedulability under GPUSync. This section includes a description of the methods used to gather empirical data of GPU-related system overheads— we present this data as well. Section 4.3 also includes detailed blocking analysis of GPUSync’s engine and token locks, and the methodology used to integrate GPU-related overheads into schedulability analysis. In

Section 4.4, we describe our runtime evaluation of GPUSync, and we present our results. We conclude in

1 Portions of this chapter previously appeared in conference proceedings. The original citations are as follows: Elliott, G., Ward, B., and Anderson, J. (2013). GPUSync: A framework for real-time GPU management. In Proceedings of the 34th IEEE International Real-Time Systems Symposium, pages 33–44; Elliott, G. and Anderson, J. (2014). Exploring the multitude of real-time multi-GPU conﬁgurations. In Proceedings of the 35th IEEE International Real-Time Systems Symposium, pages 260–271.

146 Section 4.5, where we compare and contrast the results of our schedulability experiments with the empirical results of our runtime evaluation.

4.1 Evaluation Platform

We evaluate our implementation of GPUSync on a high-end multicore, multi-GPU, platform. This platform has two NUMA nodes, each like the system depicted in Figure 2.20. Each NUMA node is equipped with one Xeon X5060 processor with six 2.67GHz cores, and four NVIDIA K5000 Quadro GPUs. In total, our evaluation platform is equipped with twelve CPUs and eight GPUs. We ﬁrst provide additional details about our CPUs before describing our GPUs. Each CPU core has a private 32KB L1 cache for instructions, and another of the same size for data. Each CPU core also has a private 256KB L2 cache. The six cores on each X5060 processor share a single 12MB L3 cache. On our platform, the L3 is an inclusive cache, meaning that it contains copies of the data stored in caches above it. Each Quadro K5000 GPU connects to the platform through PCIe 2.0, using 16 PCIe lanes. The K5000 has two CEs, so each GPU is capable of simultaneous bi-directional DMA operations. This GPU also supports peer-to-peer DMA. We call peer-to-peer DMA operations between two GPUs that share the same PCIe switch “near” peer-to-peer

DMA operations. Similarly, peer-to-peer DMA operations between two GPUs that share the same I/O hub, but not a PCIe switch, “far” peer-to-peer DMA operations. The hardware scheduler issues we discussed in

Section 2.4.4 limit the K5000. For all of the evaluations performed herein, we took the steps necessary to ensure that GPU engines operated independently, maximizing the parallelism of our platform.

We justify our use of a high-end platform in lieu of a smaller embedded platform, such as those we discussed in Section 2.3, in two ways. First, the complexity of the high-end platform enables research into more complex scheduling problems (e.g., clustered GPU scheduling). Second, as technology advances, embedded platforms often resemble earlier higher-end platforms. For example, NVIDIA recently announced plans to develop a platform targeted to computer vision processing in automotive applications (Ho and Smith,

2015). This platform uses two Tegra chips, each containing four CPU cores and an integrated GPU. Although details of this new product are currently unavailable, the high-level architecture of this platform appears to be not unlike that of our multi-GPU NUMA platform.

147 CPU Scheduling Partitioned Clustered Global

CPU Partitioned GPU

Cluster Clustered Association GPUOrganization NUMA Boundary Global

Figure 4.1: Concrete platform conﬁgurations.

4.2 Platform Conﬁgurations

As we described in Chapter 1 (see Figure 1.4), a multicore multi-GPU platform may organized in a number of ways. We define a notational system to help us describe specific organizational configurations.

We use a matrix of several conﬁgurations, depicted in Figure 4.1, for a twelve-CPU, eight-GPU platform to illustrate several examples. We refer to each cell in Figure 4.1 using a column-major tuple, with the indices P,

C, and G denoting partition, clustered, and global choices, respectively. The tuple (P,P) refers to the top-left corner—a conﬁguration with partitioned CPUs and GPUs. Likewise, (G,C) indicates the right-most middle cell—globally scheduled CPUs with clustered GPUs. We use the wildcard ∗ to refer to an entire row or column: e.g., (P,∗) refers to the left-most column—all conﬁgurations with partitioned CPUs. Within each cell, individual CPUs and GPUs are shown on the left and right, respectively. Dashed boxes delineate CPU and GPU clusters (no boxes are used in partitioned cases). The solid lines depict the association between

CPUs and GPUs. For example, the solid lines in (C,C) indicate that two GPU clusters are wholly assigned to each CPU cluster. Finally, the horizontal dashed line across each cell denotes the NUMA boundary of the system.

When necessary, we extend our notation to denote the number of CPUs or GPUs within a cluster by using a subscript. For example, C2 may denote a cluster of two GPUs. We must also disambiguate between

148 GPU clusters where migration is carried out by peer-to-peer DMA and those where migration is carried out by DMA operations to and from system memory. To do so, we denote peer-to-peer conﬁgurations with the

P2P superscript “P2P.” For instance, C4 describes a cluster of four GPUs with peer-to-peer migration.

4.3 Schedulability Analysis

In this section, we develop a model of our evaluation platform for testing real-time schedulability. We begin by examining and measuring GPU-related overheads. We show that the use of GPUs can lead to significant system overheads in Section 4.3.1. Due to the large configuration space represented by the combination of GPUSync parameters, platform organizational choices, and variety of JLFP schedulers support by GPUSync, we scope our study to the subset of configurations we expect will show the most promise, while still broadly covering a range of possible configurations. We describe our scope and rationale in Section 4.3.2. In Section 4.3.3, we then discuss the task model we use to characterize sporadic task sets with tasks that use GPUs. We then develop detailed pi-blocking analysis for token and engine lock requests, which we must use in schedulability analysis in Section 4.3.4. In Section 4.3.5, we extend the overhead- aware preemption-centric schedulability analysis we discussed in Section 2.1.8 to incorporate our measured

GPU-related overheads. Finally, we perform a broad set of schedulability experiments in Section 4.3.6. These experiments are meant to determine the most theoretically promising GPUSync conﬁgurations, and serve to show to what degree GPUs can increase computational capacity, despite heavy system overheads, under our model of GPUSync.

4.3.1 Overhead Measurement

We now investigate and quantify overheads due to GPU processing. In general, we classify overheads as being algorithmic or memory-related. Algorithmic overheads are those due to code execution and event signaling. These include overheads due to thread context switching, scheduling, job release queuing, interprocessor interrupt latency, CPU clock tick processing, and interrupt processing. Memory overheads are those that increase execution time due to shared use of memory and data busses. These include overheads due to cache preemption/migration delays and memory bus contention. We already discussed non-GPU- related overheads from both categories in Sections 2.1.8 and 2.1.8.2 (and summarized in Tables 2.3 and 2.4).

Overheads due to GPU processing also fall within algorithmic and memory-related categories. Namely, GPU

149 interrupt handling (algorithmic) and DMA operations (memory-related). The process for quantifying GPU interrupt handling overheads is relatively straightforward—we directly measure the execution time of GPU interrupt-handling routines within the OS. However, the quantiﬁcation of DMA overheads requires a more nuanced approach, since the heavy load that DMA operations place on the system memory bus also affects code executing on the CPUs. We now present the methodology we used for measuring and quantifying both algorithmic and memory-related overheads.

4.3.1.1 Algorithmic Overheads

We measured algorithmic overheads using the lightweight tracing facilities of LITMUSRT while executing workloads that stress the various hardware components managed by GPUSync. Measurements were taken under different CPU and GPU cluster conﬁgurations, as well as with task sets of varying sizes (in order to capture overhead trends dependent upon the number of tasks). Over 11GB of trace data was recorded (a single trace event is only 16 bytes in size). We distilled this data into average and worst-case overheads for each of the algorithmic overheads in Tables 2.3 and 2.4. The properties of non-GPU-related overheads needed for preemption-centric accounting have been thoroughly studied by Brandenburg (2011b), so we will not replicate his work here. However, we do discuss algorithmic overheads related to GPUs.

GPUs interact with the host platform through I/O interrupts. As we discussed in Section 2.2.3, interrupt processing is split into “top” and “bottom” halves. Overhead-aware schedulability analysis requires that we quantify the execution cost of top and bottom halves of GPU interrupt processing. In order to do so, we stressed our evaluation platform with 30 GPU-using tasks that performed computer vision calculations on pre-recorded video streams.2 Tasks were assigned periods between 33ms and 100ms. GPUSync allocated

GPUs to jobs and arbitrated access to GPU engines. We instrumented the code paths of GPU top-half and bottom-half routines to measure their execution times. We took execution time measurements over a duration of 20 minutes.

Figure 4.2(a) gives the probability density function (PDF) derived from over 4,500,000 GPU top-half execution time observations. In the PDF, to determine the probability that a given top-half execution time measurement falls within the domain [a,b], we sum the area under the curve between x = a and x = b; the total area under each curve is 1.0. The most striking aspects of this data are the outliers: the maximum value

2We describe this code in more detail later in Section 4.4.2.

150 (a) All measurements. Probability Density Function of Observed Top-Half Execution Times 0.3

0.25

0.2

0.15

Density 0.1

0.05

0 0 5 10 15 20 25 Time (µs) (b) Measurements below the 99th percentile.

Figure 4.2: PDF of GPU top-half execution time. is 87.19µs, yet the mean and median are only 8.01µs and 7.31µs, respectively. In order to better observe the shape of the PDF, we plot the same data in Figure 4.2(b), but we clip the domain to include only measurements below the 99th percentile. The four humps in the PDF (centered near x = 5.5µs, x = 6.5µs, x = 8.5µs, and x = 11µs, respectively) suggest that there may be at least four code paths commonly taken by the ISR.

Figure 4.3(a) gives the PDF derived from about 4,200,000 GPU bottom-half execution time observations.3 This PDF shares a similar characteristic with the PDF for top-halves in Figure 4.2(a): extreme outliers. Here, we see that the maximum value is 1008.58µs, yet the mean and median are only 66.14µs and 54.68µs, respectively. Moreover, the maximum-to-median ratio for bottom-half execution time is approximately 18.4

(1008.58/54.68 ≈ 18.4). In contrast, this ratio is roughly 11.9 (87.19/7.31 ≈ 11.9) for top-half execution time. In other words, the severity of outlier behavior in bottom-half execution time is worse. Figure 4.3(b) plots the same data, but we clip the domain to stop at the 99th percentile. In this ﬁgure, we observe at eight distinct humps in the PDF that span the approximate domain of [30µs,80µs]; even if we ignore outlier behavior, there is still a great deal of variance in bottom-half execution time.

We now examine the outlier behavior of top-half and bottom-half execution time in more detail. Figure 4.4 depicts the complement cumulative distribution function (CCDF) of top-half and bottom-half observations, in

3The number of bottom-half observations is less than top-half observations. This implies that not every top-half spawns a corresponding bottom-half.

151 (a) All measurements. Probability Density Function of Observed Bottom-Half Execution Times 0.025

0.02

0.015

0.01 Density

0.005

0 0 50 100 150 200 250 300 Time (µs) (b) Measurements below the 99th percentile.

Figure 4.3: PDF of GPU bottom-half execution time. insets (a) and (b), respectively. It is important to note that the y-axis is plotted on a log scale—the log scale makes it easier for us to observe and reason about outlier characteristics. In Figure 4.4(a), we observe that top-half outliers are rare. For instance, only 0.01% (y = 10−4) of observed top-halves had execution times greater than about 50µs. Moreover, only 0.001% of observed top-halves had execution times greater than roughly 60µs, and only 0.0001% of observed top-halves had execution times greater than about 78µs.

Bottom-half outliers are also rare. In Figure 4.4(b), we observe that only 0.01% of observed bottom- halves had execution times greater than about 550µs. Moreover, only 0.001% of observed bottom-halves had execution times greater than roughly 750µs, and only 0.0001% of observed bottom-halves had execution times greater than 1000µs.

We conclude our discussion of GPU interrupt handling overheads by stressing the need for including these overheads in schedulability analysis. The CPU time consumed by GPU interrupt handling is not trivial. In our 20-minute experiment, we ﬁnd that the system spent roughly 36.6 seconds executing top- halves. Similarly, about 282.29 seconds were spent executing bottom-halves. Altogether, interrupt processing consumed approximately 2.21% of available CPU time, across twelve CPUs. Moreover, schedulability analysis must be mindful of the very clear discrepancies between worst-case and average measurements.

152 Experimental CCDF of Top-Half Execution Times 100

10-1

10-2 > x 10-3

10-4 (Execution Time) P

10-5

10-6

10-7 0 10 20 30 40 50 60 70 80 90 Time (µs) (a) top-halves

Experimental CCDF of Bottom-Half Execution Times 100

10-1

10-2 > x 10-3

10-4 (Execution Time) P

10-5

10-6

10-7 0 200 400 600 800 1000 1200 Time (µs) (b) bottom-halves

Figure 4.4: CCDFs of top-half and bottom-half execution times.

153 4.3.1.2 Memory Overheads

Although algorithmic overheads are important, those related to memory access are more so in a real-time

GPU system. As pointed out by Pellizzoni and Caccamo (2010), I/O memory bus traffic can significantly impact the performance of tasks executing on CPUs due to system memory bus contention. Moreover, in multi-GPU platforms, there is also contention for the PCIe bus. We seek to quantify two memory-related overheads. First, we want to determine the impact GPU memory traffic has on CPMDs. Second, we seek to determine the speed at which data can be transmitted to and from system memory and directly between

GPUs. We incorporate the former into schedulability analysis. The latter is used to compute task execution requirements on GPU CEs—critical to real-time GPU schedulability.

The Effect of Bus Contention on CPMDs. To assess CPMDs, we used an experimental method modeled after the “synthetic method” described by (Brandenburg, 2011b). A non-preemptive instrumented process records the time taken to read a prescribed amount (a “working set size”) of sequential data from a “hot” cache. The process suspends for a short duration, resumes on a random processor, and rereads said data from the now “cold” cache. A cost is determined by subtracting the hot measurement from the cold. On our evaluation platform, individual measurements fall into one of three categories: L2 preemption, L3 migration, and memory migration. An L2 preemption measurement is one where the hot and cold measurements are performed on the same CPU, since each CPU has a private L2 cache on our evaluation platform. An L3 migration measurement is one where the hot and cold measurements are performed on CPUs that share an

L3 cache; on our platform, these are CPUs within the same NUMA node. A memory migration is one where hot and cold measurements are performed on CPUs that do not share a cache; on our platform, these CPUs reside within different NUMA nodes.

We are concerned with two memory configurations since our test platform is a NUMA platform. Under partitioned and clustered CPU scheduling (when clusters reside entirely within a NUMA node), memory can be allocated locally to increase performance and reduce interference from NUMA-remote tasks. However, under global CPU scheduling, one may interleave memory pages across the NUMA nodes in order to obtain good average case performance. We require overhead data for both configurations in order to accurately model each CPU/GPU configuration described in Section 4.3.6.

Under both local and interleaved conﬁgurations, we collected three CPMD datasets: (i) An “idle” dataset where the instrumented process runs alone; (ii) a “loaded” dataset where “cold” measurements are taken in

154 (a) local

(b) interleaved

Figure 4.5: Considered CPMD maximum overheads due to GPU trafﬁc.

155 (a) local

(b) interleaved

Figure 4.6: Considered CPMD mean overheads due to GPU trafﬁc.

156 the presence of cache-trashing processes that introduce contention for both caches and memory bus; and

(iii) a “loaded+gpu” dataset where additional load is created by GPU-using processes, one for each CE, that fully loads the bidirectional PCIe bus with a constant stream of 512MB DMA memory transfers to and from pinned pages in system memory. We gathered 5,000 samples apiece to measure L2 preemptions, L3 migrations, and memory migrations for each working set size. We distilled these samples into max and mean values for each type of measurement.4

Figure 4.5 plots measured maximum costs for each type of CPMD. Figure 4.5(a) depicts measurements where all accessed memory was local to the NUMA node of the CPU(s). This ﬁgure does not include measurements for memory migrations, since there are no memory migrations when data is NUMA-local.

Figure 4.5(b) depicts measurements where accessed memory was evenly interleaved, with page granularity, across the two NUMA nodes of our evaluation platform. In both of these ﬁgures, the x-axis uses a log2 scale, while the y-axis uses a log10 scale. We make two high-level observations from these graphs.

Observation 1. The cache offers little or no beneﬁt in the presence of heavy load.

In Figure 4.5(b), observe that the curves for each type of measurement within the loaded data sets practically coincide for working set sizes of 16KB or greater. For example, curves 2, 5, and 8 are virtually indistinguishable. We can make similar observations in Figure 4.5(a).

Observation 2. CPMDs on an idle platform are characterized by two plateaus, with an abrupt increase around working set sizes of 128KB for CPMDs for local memory access, and working set sizes of 64KB for CPMDs for interleaved memory access.

In Figure 4.5(a), ﬁnd the curves for L2 preemptions (curve 1) and L3 migrations (curve 4). There is very little variation among these CPMDs for working set sizes of 4KB, 8KB, 16KB, 32KB, and 64KB. Also, these CPMDs are on the order of hundreds of nanoseconds—we expect CPMDs to be small because cache reuse should be high in an idle platform. CMPDs increase abruptly for working set sizes of 128KB. However, the increases level off for working set sizes of 256KB and greater. The increase in CPMDs for working set sizes of 128KB is explained by L2 cache the utilization. Although a 256KB L2 cache is only half-ﬁlled by a working set size of 128KB, this L2 may still hold other data such as program instructions, OS memory, and

4We distribute the measurement tools we developed to assess memory overheads as open source under the GNU General Public License, version 2. The source code for these tools is currently available at www.github.com/GElliott. Portions of this code is derived from that developed by Brandenburg (2011b).

157 the memory of other processes. The plateau for working set sizes no less than 256KB is explained by the inclusive 12MB L3 cache, which is large enough to hold the bulk of the tested working sets (also, the relative share of the L3 cache consumed by program code and the memory of the OS and other processes decreases with the larger cache). It is not be surprising that CPMD costs begin to increase for working set sizes larger than 12MB. We can make similar observations for CPMDs for interleaved memory access in Figure 4.5(b).

However, the ﬁrst signiﬁcant increase in CPMDs (curves 1, 4, and 7) occurs at working set sizes of 32KB.

Figure 4.6 plots measured mean costs for each type of CPMD. We see the same trends in Figure 4.6 that we see in Figure 4.5.

For a deeper study of CPMDs, we direct the reader to the work of Brandenburg (2011b), which provides a more in-depth investigation of CPMDs. We note that the general trends in Figures 4.5 and 4.6 are consistent with those reported by Brandenburg. More relevant to the topics covered by this dissertation is the increase in CPMDs due to GPU memory trafﬁc. We investigate this next.

Figure 4.7 plots the relative increase in CPMD costs of select working set size “loaded+gpu” datasets with respect to the “loaded” data set. Figure 4.7(a) relates these increases in terms of maximum CPMD costs, while Figure 4.7(b) relates these increases in terms of mean CPMD costs. We make two observations.

Observation 3. Maximum and mean CPMDs are affected similarly by GPU trafﬁc.

The shape and magnitude of the corresponding curves in Figures 4.7(a) and 4.7(b) are very similar. For example, the plots for L2 preemption CPMDs (curve 1) both exhibit a dip in costs for working set sizes around 8192KB.

Observation 4. GPU trafﬁc affects CPMDs for local memory access more strongly than CPMDs for interleaved memory access: CPMDs for local memory access increase by factors between two and four, while

CPMDs for interleaved memory access increase by factors between one and two.

In Figures 4.7(a) and 4.7(b), we observe that GPU trafﬁc increased CPMDs for local memory access by a factor between two and four for working set sizes larger than 32KB (curves 1 and 3). CPMDs for interleaved memory access were affected to a lesser degree, with increases by a factor between approximately

1.1 and 1.9. However, CPMDs for interleaved memory access without GPU traffic are nearly as great as local CPMDs with GPU traffic. For example, in Figure 4.6(a), find the L2 preemption CPMD cost for a platform under load with GPUs (curve 3) for a working set size of 4096KB—it is roughly 650µs. Compare this to

158 (a) maximum

(b) mean

Figure 4.7: Increase in considered CPMD overheads due to GPU trafﬁc.

159 the L2 preemption CPMD cost for a platform under load without GPUs (curve 2) for interleaved memory in

Figure 4.6(b)—it is about 550µs.

The above measurements demonstrate that GPU trafﬁc must be considered when deriving estimates of CPMD overhead. Moreover, these considerations must be cognizant of memory locality on a NUMA platform.

The Effect of Bus Contention on GPU DMA Costs. Under load, GPU DMAs experience contention for the following buses: the GPU-internal memory bus, several PCIe buses at various hierarchical levels), the processor-I/O hub interconnect, and the system memory bus. If memory is interleaved across NUMA nodes, then additional contention can be experienced for the processor-processor (NUMA) interconnect as well as the remote system memory bus.

We performed experiments to determine GPU DMA costs using a technique similar to the one we used to determine CPMD overheads. An instrumented process performed DMA operations to and from system memory and peer-to-peer DMA operations between GPUs. We tested both local and interleaved conﬁgurations under idle and loaded scenarios. In addition to loading every GPU CE, we also executed memory-heavy GPU kernels on the EE of GPUs used by the instrumented process in order to stress the GPU’s own local memory bus. We took 50,000 measurements for each type of DMA operation for each tested DMA operation size.

Figure 4.8 shows the maximum and mean DMA times for local and interleaved memory access that we observed on idle platform.5 We make two observations.

Observation 5. On an idle platform, all DMA operation types have similar performance curves: for smaller DMA operations, setup costs are dominant; for larger DMA operations, the costs to transmitting data are dominant.

In each of the four insets of Figure 4.8, we observe that all curves are similar. Overhead costs to setting up DMA operations dominate for smaller DMA operations (those no greater than 64KB)—this is indicated by the relatively ﬂat portion the curves that connect the cost data-points of smaller DMA operation. The cost of DMA operations begin to scale linearly with DMA operations size for larger DMA operations (those

5Although peer-to-peer DMA operations should not be affected by interleaved memory access of system memory in an idle platform, we include the costs of these operations in order to allow direct comparison against the costs of GPU-to-system and system-to-GPU DMA operations.

160 (a) local, maximum (b) local, mean

Figure 4.8: GPU data transmission time in an idle system. no less than 128KB). For example, in Figure 4.8(a), observe that far peer-to-peer DMA (curve 2) takes approximately 200µs for 1024KB, 400µs for 2048KB, and 800µs for 4048KB—cost doubles as DMA size doubles. The curves for the other DMA operation types share the same slope as the one for far peer-to-peer

DMA, indicating the same trend. The data reﬂected by Figure 4.8 is very consistent with those reported by Kato et al. (2011a) and Fujii et al. (2013).

Observation 6. On an idle platform, far peer-to-peer DMA operations may be more costly than DMA operations to or from system memory.

In Figures 4.8(a) and 4.8(b) (local memory DMA), observe that the curve for far peer-to-peer DMA operations (curve 2) lies above the other curves for DMA operations no less than 32KB. In Figures 4.8(a) and 4.8(b) (interleaved memory DMA), the curve for GPU-to-system memory (curve 3) practically coincides with that of far peer-to-peer DMA. The relative high cost of far peer-to-peer DMA is surprising, since data only traverses the PCIe bus in peer-to-peer DMA operations. We may rule out costs due to coordination between the peer GPUs, since near peer-to-peer DMA operations would be equally costly if this were the

161 (a) local, maximum (b) local, mean

Figure 4.9: GPU data transmission time in system under load. case. We can only speculate that the PCIe root complex that connects far GPUs may not be as efﬁcient as the PCIe switch that connects near GPUs. Nevertheless, the use of far peer-to-peer DMA may still be more efﬁcient for transmitting data between two GPUs than using two DMA operations to bounce the data through system memory.

We now examine DMA operation costs when our evaluation platform is under load. Figure 4.9 shows the maximum and mean DMA costs for local and interleaved memory access when the platform is under load.

We draw only high-level observations from Figure 4.9, deferring observations of comparative performance against DMA costs in an idle platform to another set of ﬁgures.

Observation 7. System-to-GPU DMA is more costly than GPU-to-system DMA; GPU-to-system DMA is more costly than far peer-to-peer DMA; far peer-to-peer DMA is more costly than near peer-to-peer DMA.

We may observe this in every inset of Figure 4.9. The DMA cost curve for system-to-GPU DMA

(curve 4) lies above the curve for GPU-to-system DMA (curve 3), even though the curves are relatively close

(especially for DMA operations no less than 1024KB). The curve for GPU-to-system DMA clearly lies above

162 the curve for far peer-to-peer DMA (curve 2). Similarly, the curve for far peer-to-peer DMA lies above the curve for near peer-to-peer DMA (curve 1).

Observation 8. For DMA operations no less than 256KB, the mean cost of far peer-to-peer DMA is approximately twice as much the mean cost of near peer-to-peer DMA.

We can make this observation in Figure 4.9(b). Find the value of curve 1 for near peer-to-peer for DMA sizes of 256KB—it is approximately 50µs. Find the value of curve 2 for far peer-to-peer DMA for the same

DMA size—it is approximately 110µs, a little more than twice the cost for near peer-to-peer DMA. We expect this behavior from the PCIe bus topology of our evaluation platform. Under load, far peer-to-peer

DMA operations receive half as much PCIe bandwidth, on average, as near peer-to-peer DMA operations.

This is because far peer-to-peer DMA operations must traverse an additional (loaded) PCIe switch, which reduces the available bandwidth to the DMA operation by half.

We now investigate the effect load and page interleaving has on DMA costs more directly. Figure 4.10 depicts observed increases in maximum and mean DMA costs due to load. We observe the following.

Observation 9. Load can cause signiﬁcant increases in DMA costs, so it must be considered in schedulability analysis.

Consider the case where four GPUs share a PCIe bus, as they do in Figure 2.20. Under load, one might assume that each GPU will obtain 25% of the PCIe bus’s bandwidth—increasing DMA costs by a factor of four. However, such an assumption ignores the effect of contention for the system memory bus. We see in

Figure 4.10 these cost increases can be considerably greater. For example, consider curves for GPU-to-system and system-to-GPU DMA (curves 3 and 5, respectively) in both Figure 4.10(a) and Figure 4.10(b). Here, we see that DMA costs generally increase by factors between ﬁve to eight times. DMA cost increases are even greater when memory pages accessed by GPUs are interleaved across the NUMA nodes. To observe this, ﬁnd the curves for GPU-to-system and system-to-GPU DMA with interleaved system memory (curves 4 and 6, respectively) in both Figure 4.10(a) and Figure 4.10(b). We see that the increase in DMA costs generally increase between eight to ten times, but may still be as great as twelve times (curve 6, for data sizes of 64KB, in Figure 4.10(a)).

Ultimately, this result shows us that bus contention must be accounted for in schedulability analysis. We consider this a signiﬁcant oversight of prior work in real-time GPU research.

163 (a) maximum

(b) mean

Figure 4.10: Increase in DMA operation costs due to load.

164 Observation 10. Near peer-to-peer DMA operations are hardly affected by load. Far peer-to-peer operations are moderately affected.

Despite added contention for the GPU’s local memory bus caused by the memory-heavy GPU kernel executing on the EE of the tested GPU, we observe that load hardly affects near peer-to-peer DMA: curve 1 in both insets of Figure 4.10 are very close to 1.0 (never exceeding a factor of 1.1) for all tested data sizes.

Far peer-to-peer memory copies exhibit an increase factor between 2.1 to 3.5 for maximum DMA costs and an increase factor between 1.7 to 2.6 for mean DMA costs, as respectively reﬂected in Figure 4.10(a) and Figure 4.10(b) by curve 2. From this observation, we learn that costs due to contention for the PCIe bus are not negligible.

Figure 4.11 depicts observed increases in maximum and mean DMA costs caused by interleaving memory pages across NUMA nodes in system memory. The insets in this ﬁgure do not include curves for peer-to-peer DMA, since these operations do not touch interleaved pages in system memory.

Observation 11. In general, GPU DMA performance is not improved by page interleaving.

One may suspect that interleaving pages among NUMA nodes may actually improve GPU DMA performance, since memory accesses in such a scenario may operate in parallel. However, in general, interleaving pages across NUMA nodes usually increases GPU DMA costs, even when the platform is idle.

We see that nearly every data point for the curves in Figure 4.11(a) and Figure 4.11(b) lie above 1.0, indicating increases in DMA cost. For example, maximum GPU-to-system DMA costs increased by roughly 20% and

40% for data sizes no less than 256KB on idle (curve 1) and loaded (curve 2) platforms, respectively, as depicted in Figure 4.11(a). There are some exceptions where page interleaving may lead to decreased DMA costs. For instance, in Figure 4.11(a), page interleaving reduced maximum DMA costs for some DMA data sizes no greater than 32KB on an idle platform, as indicated by curves 1 and 3. However, practically speaking, it is unlikely that DMA operations will always be performed on a completely idle platform. We claim that meaningful performance beneﬁts from page interleaving cannot be achieved on our evaluation platform.

Observation 12. Page interleaving has a stronger effect on DMA costs on a platform under load.

As we discussed above, interleaving may introduce additional bus contention, especially for the bus connecting the two NUMA nodes. Here, from curves 2 and 4 in Figure 4.11(a) and Figure 4.11(b), we see that the cross-trafﬁc between NUMA nodes caused by interleaving results in increased DMA costs—these

165 (a) maximum

(b) mean

Figure 4.11: Increase in DMA operation costs due to page interleaving.

166 curves always lie above those for an idle platform. Page interleaving increased maximum DMA operation costs in a loaded platform between 40% and 65% (Figure 4.11(a)). The increase in mean DMA operation costs fell within nearly the same range: between about 40% and 55% (Figure 4.11(b)).

This concludes our investigation of system overheads introduced by GPUs. Next, we discuss how we incorporate these overheads into overhead-aware schedulability analysis.

4.3.2 Scope

GPUSync is highly configurable and adaptable. It may be used with any JLFP scheduler. It supports a mix of CPU and GPU cluster configurations. GPUSync can support GPUs with zero, one, or two CEs.6 Also, the number of tokens per GPU and maximum FIFO length are configurable. GPU migration may be carried out through peer-to-peer DMA or through a temporary buffer in system memory. In support of peer-to-peer migration, CE locks may be requested one at a time, or once as a DGL. Engine locks may be configured to satisfy pending requests in FIFO- or priority-order. Finally, any of the four GPU Allocator heuristics we employ may be disabled without breaking real-time predictability. There are well over 100,000 different GPUSync configurations that are worthy of study.

We must limit the scope of the configurations we examine in order to make our study tractable. This scope must be focused enough so that we are not overwhelmed with data, and yet it must be broad enough so that we may come to understand the tradeoffs among general classes of GPUSync configurations. For the sake of presentation, we define our scope in its entirety here in one place. Our configuration choices are as follows.

1. Twelve CPU/GPU conﬁgurations. We wish to understand the tradeoffs in schedulability among the

combinations of CPU and GPU cluster conﬁgurations, as we described in Chapter 1. As a consequence,

we must study every reasonable combination of CPU and GPU cluster conﬁgurations. Nine of the

twelve conﬁgurations we study are depicted in Figure 4.1. The remaining three conﬁgurations are

those where GPUs are put in larger clusters of four, rather than clusters of two. This enables us to

determine what effect far peer-to-peer migration costs may have on schedulability.

6In Section 2.3, we discussed integrated GPUs, which lack CEs. Although we do not focus on integrated GPUs in this dissertation, we discuss GPUSync’s support for such GPUs in Chapter 6.

167 2. Peer-to-peer and system memory migration. As we discussed in Section 4.3.1.2, peer-to-peer DMA

has the potential to greatly reduce GPU migration costs, since data is handled less often and peer-to-peer

DMA can be signiﬁcantly faster than DMAs that involve system memory. However, peer-to-peer DMA

requires a migrating task to hold two CE locks simultaneously. As we will see in Sections 4.3.4.2

and 4.3.4.3, this has a signiﬁcant impact on pi-blocking analysis. Studying both peer-to-peer and

system memory migration methods allows us to determine if peer-to-peer DMA is efﬁcient enough to

overcome more pessimistic pi-blocking bounds.

3. Optimal conﬁgurations of the GPU Allocator with the number of tokens per GPU deﬁned as

ρ ∈ {1,2,3}, and a limited set of non-optimal GPU Allocator conﬁgurations for ρ = ∞. We study a wide selection of values for ρ in order to determine if GPU engine parallelism can improve schedulability. We primarily focus on (suspension-oblivious) optimal conﬁgurations of the GPU

Allocator where ρ ∈ {1,2,3}. This gives us a spectrum of token values to study. Exclusive GPU

allocation is represented by ρ = 1, while the potential for full GPU engine utilization is represented

by ρ = 3. For partitioned GPU conﬁgurations, we also study the effect of essentially eliminating the

token lock. Here, we render all token lock request trivial by setting ρ = ∞.

4. FL scheduling. We limit our study to FL-based schedulers. This may seem like an odd choice, given

our stated motivation for supporting automotive applications. EDF-like global and clustered schedulers,

such as C-FL, are commonly associated with soft real-time systems, i.e., those generally regarded as

non-safety-critical. This is partly due to generally weaker job response-time guarantees. However, in an

automotive setting, the reaction time of an alert driver is about 700ms (Green, 2000). Such a reaction

time is well within the realm of possibility under bounded deadline tardiness constraints. Moreover,

EDF-like schedulers come with the added beneﬁt of the absence of severe utilization constraints. We

choose to investigate schedulability under FL-based schedulers because these schedulers have the best

known bounds on deadline tardiness.

5. FIFO-ordered engine locks. We limit our scope to FIFO-ordered engine locks because we expect

that they will yield better schedulability under FL scheduling than priority-ordered locks. Generally

speaking, FIFO-ordered locks result in less pessimistic bounds on pi-blocking for deadline-based

schedulers, since the analysis for priority-ordered locks under deadline-based schedulers must generally

assume that each issued request has the lowest priority for long durations of time.

168 6. Use of DGLs. DGLs limit the effect of transitive blocking under nested locking. For peer-to-peer

migrations, we assume that CE locks are acquired through atomic DGL requests, since this will give us

better schedulability results.

7. Two CEs per GPU. We focus our attention on GPUs with two CEs. We make this decision because

GPUs with two CEs provide us with a richer platform for schedulability studies. However, we recognize

the importance of GPUs with a single CE. For this reason, we present detailed blocking analysis for

such GPUs in Section 4.3.4.3. However, our schedulability experiments assume dual-CE GPUs.

8. 1MB DMA chunk size. In our schedulability experiments, we assume that each type of DMA

operation (i.e., input, output, and state data) are broken into an integral number of 1MB chunks, plus at

most one fractional chunk. We select this chunk size because it is large enough that it is not dominated

by setup cost overheads, while keeping DMA operation sizes small enough to prevent long durations

of CE blocking.

Choices 1, 2, and 3 give us a broad selection of GPUSync conﬁgurations to study. The remaining choices help keep our study tractable, and are also meant to maximize real-time schedulability under GPUSync for the systems we are motivated to study.

4.3.3 Task Model for GPU-using Sporadic Tasks

We require a task model that adequately describes a task set of GPU-using tasks. We extend the sporadic task model described in Section 2.1.1 with additional notation.

We consider a task system, T , comprised of n real-time tasks T1,··· ,Tn that are scheduled on m CPUs, partitioned into clusters of c CPUs each. The subset T gpu ⊆ T includes all tasks that require GPU resources

cpu gpu from the system’s h GPUs, partitioned into clusters of g GPUs each. The subset T , T \T are tasks that do not use a GPU. The tasks in T are partitioned among the CPU and GPU clusters. We denote the set

th gpu of tasks assigned to the a GPU cluster by Ta , where a is a GPU cluster index, which starts from zero.

th We similarly denote the set of tasks assigned to the a CPU cluster by Ta. The parameter qi denotes Ti’s gpu provisioned GPU execution time on an execution engine. The parameter ei denotes Ti’s total CPU execution gpu time requirements within its GPU critical section (note that we assume ei is included in ei). Each job Ji, j I O sends zi bytes of data as input to GPU computations. Similarly, each job Ji, j receives zi bytes of data as output from GPU computations. The size of job Ji, j’s state that may be migrated among GPUs is denoted

169 S I O S by zi . For convenience, we deﬁne the function xmit(zi ,zi ,zi ) to specify the total data transmission time required by job Ji, j. The value of this function can be computed given the empirical measurements described gpu in Section 4.3.1.2. We assume that a job of Ti ∈ T may use any one arbitrary GPU in its GPU cluster. For

cpu I O S Ti ∈ T , the parameters qi, zi , zi , and zi are zero. Finally, we redeﬁne task utilization to incorporate GPU execution time, such that I O S ei + qi + xmit(zi ,zi ,zi ) ui , . (4.1) pi

K We deﬁne several additional terms for the purpose of locking analysis. Let Li denote the maximum K token critical section length of task Ti, bi denote the maximum time job Ji, j may be blocked due to the token E lock, and bi denote the maximum time Ji, j may be blocked within a token critical section for all engine locks. Let Z denote a conﬁgured DMA chunk size that is used to break large DMA operations into smaller ones. We

I I O O S S denote the number of chunks required to transmit task data by: Ni , dzi /Ze; Ni , dzi /Ze; Ni , dzi /Ze. Let XI, XO, and XP2P denote the maximum time it takes to transmit a chunk of GPU data for input, output, and peer-to-peer migration, respectively, and let Xmax denote the maximum of XI, XO, and XP2P. Finally, let

Si denote the maximum time to perform a GPU migration. For peer-to-peer migrations,

P2P S Si , X · Ni . (4.2)

For migrations through system memory,

I O S Si , (X + X ) · Ni . (4.3)

For platform conﬁgurations with partitioned GPUs, Si = 0. Table 4.1 summarizes these terms, as well as additional terms that we use later in analysis. In the subsequent sections, we will refer to the terms listed in Tables 2.1, 2.2, and 4.1.

4.3.4 Blocking Analysis

We now discuss the method we use to bound the length of time a job may be pi-blocked due to GPU token and engine lock requests. We begin by outlining our three-phase process to computing these bounds. We

170 Additional Task Set and Scheduler Parameters h number of GPUs g GPU cluster size T gpu set of GPU-using tasks in the task set T T cpu set of CPU-only tasks in the task set T gpu gpu it Ta set of GPU-using tasks in the task set T that execute on the a GPU cluster it Tba set of tasks scheduled on CPU clusters associated with the a GPU cluster Z DMA chunk size

Additional Parameters of Task Ti qi job worst-case execution time on an GPU execution engine gpu ei portion CPU execution time of a task within its token critical section I zi bytes sent to GPU as kernel input O zi bytes received from GPU as kernel output S zi bytes of state data that resides on a GPU Additional Blocking Analysis Parameters K Li token critical section length K bi token request blocking bound νi number of token critical sections that may block job Ji EE bi total of EE request blocking bounds CE bi total of CE request blocking bounds E bi total of engine lock request blocking bounds I I Ni number of chunks to transmit zi O O Ni number of chunks to transmit zi S S Ni number of chunks to transmit zi XI time to transmit a chunk to a GPU from system memory XO time to transmit a chunk to system memory from a GPU XP2P time to transmit a chunk through peer-to-peer DMA Xmax maximum of XI, XO, and XP2P Si time needed to perform a GPU migration k k Ri set of requests, sorted by L j, for resource `k that may interfere with a request of job Ji for `k EE k Ri set of EE requests, sorted by L j, that may interfere with a similar request of job Ji CE k Ri set of CE requests, sorted by L j, that may interfere with a CE request of job Ji Overhead-Aware Analysis Parameters λ number of GPU engines of a GPU ∆top GPU interrupt top-half overhead ∆bot GPU interrupt bottom-half overhead Hi number of GPU interrupts that may interfere with job Ji γi number of GPU engine operations issued by job Ji

Table 4.1: Summary of additional parameters to describe and analyze sporadic task sets with GPUs.

171 then derive coarse-grain bounds on pi-blocking due to engine lock requests.7 We call this analysis “coarse” because it assumes that all engine critical sections have the same length. Although we do not use coarse analysis in our own schedulability experiments (presented later in Section 4.3.6), this level of analysis gives us an appreciation for the order of complexity of engine lock pi-blocking, in terms of the number of interfering requests, and the tradeoffs between GPU migration through system memory and direct peer-to-peer DMA.

We then delve into detailed, or “ﬁne-grain,” analysis to bound pi-blocking caused by engine lock and token requests.

4.3.4.1 Three-Phase Blocking Analysis

The maximum total time a job may be pi-blocked accessing tokens and engine locks is given by the equation

E K bi = bi + bi . (4.4)

E K Our challenge is to determine pi-blocking bounds on bi and bi . E We approach this problem using a three-phase process. In the ﬁrst phase, we compute bi for each task

Ti ∈ T . In the second phase, we use the computed bounds on engine lock pi-blocking to bound the length

K K of the token critical section of each GPU-using task, denoted by Li . More precisely, we bound Li with the following equation:

K I O S gpu E Li = qi + xmit(zi ,zi ,zi ) + ei + bi . (4.5)

K That is, Li is bounded by the sum of: (i) the total time job Ji executes on an EE (qi); (ii) the time to perform I O S all possible DMA operations (xmit(zi ,zi ,zi )); (iii) the CPU execution time of Ji that occurs within the token gpu critical section (ei ); and ﬁnally, (iv) the total time Ji may be pi-blocked waiting for engine locks. In the third phase, we bound pi-blocking induced by GPU token requests using analysis appropriate for the GPU

Allocator conﬁguration.

4.3.4.2 Coarse-Grain Blocking Analysis for Engine Locks

We now derive coarse-grain pi-blocking bounds for FIFO-ordered engine locks.

7We do not present coarse-grain bounds for token requests in this chapter, since we have already presented this analysis in Sections 2.1.7.2 and 2.1.7.3 for the relevant GPU Allocator conﬁgurations.

172 EE I/O Let bi denote Ti’s maximum total pi-blocking time for the EE lock, let bi denote its maximum total P2P pi-blocking time while waiting to transmit input and output chunks, and let bi denote its maximum total pi-blocking time while waiting for CE locks to perform a peer-to-peer migration. By construction,

EE I/O P2P bi = bi + bi + bi . (4.6)

A job may be pi-blocked for every GPU kernel it executes when acquiring the EE lock of its allocated

GPU. At most ρ − 1 other jobs may compete simultaneously for this lock for a given request. Since requests are FIFO-ordered, the resulting pi-blocking is bounded by

EE gpu bi = (ρ − 1) · max{q j | Tj ∈ Ta }, (4.7)

th where task Ti is assigned to the a GPU cluster.

I/O P2P Bounds for bi and bi depend partly on whether migrations are peer-to-peer or through system memory. In our analysis, we assume that all migrations are performed using the same method, though

GPUSync can support both types in the same system.

CE blocking with peer-to-peer. Under peer-to-peer migrations, any task holding a GPU token may request the CE lock of the GPU it used in its prior job in order to perform a migration. There are at most ρ · g such tasks. In the worst case, they may all attempt to access the same CE lock at the same instant. Thus, any request for a CE lock may be blocked by (ρ · g − 1) other requests. From the blocking analysis of DGLs of Ward and Anderson (2013), the total number of interfering requests for a CE is at most (ρ · g − 1). Since no request requires more than Xmax time to complete,

I/O max I O bi = X (Ni + Ni )(ρ · g − 1) (4.8) and

P2P max S bi = X · Ni (ρ · g − 1). (4.9)

CE blocking with system memory migration. When migration between GPUs takes place via system memory, CEs are only accessed by tasks that have been given a token for an allocated GPU, so at most ρ − 1 other jobs may compete for the CE lock at a given instant. Recall from Section 3.2.3.4 that state is aggregated

173 P2P with input and output data, in this case. Thus, bi = 0. However, now

I/O max I O S bi = X (Ni + Ni + 2 · Ni )(ρ − 1), (4.10) since state data must be handled twice.

Analytical bounds for peer-to-peer and system memory migrations differ. As seen above, CE lock contention is O(ρ · g) and O(ρ) under peer-to-peer and system memory migrations, respectively. Despite its inferior order of complexity, peer-to-peer migration may still result in better analytical bounds if the advantages of fewer total DMA operations and faster peer-to-peer DMA operations can be exploited. Also, there are beneﬁts to peer-to-peer migrations that are not captured in the above analysis, namely, isolation from the system memory bus and rarity of migrations due to the GPU Allocator’s heuristics.

4.3.4.3 Detailed Blocking Analysis for Engine Locks

Our detailed blocking analysis, whether for engine lock or token requests, follows the same general

8 approach, which we outline before delving into detailed analysis. For task Ti, we ﬁrst determine the number of jobs of another task, Tj where i =6 j, that may be ready to run at the same time as Ji,u. This is characterized by the task interference function, tif (Ti,Tj). From tif (Ti,Tj), we generate a set of interfering resource requests that the interfering jobs Jj,v may make when Ji,u requests a resource of the same type, where type may be GPU token, execution engine lock, or copy engine lock. This set is generated by the request interference function, xif (Ti,Tj,`k), where `k is a given resource. We aggregate the set of interfering requests of all tasks

(excluding Ti) into a single set of all interfering resource requests that may be made, as given by the total request interference function, txif (Ti,`k).

k Each interfering request, R j, has an associated length, L j. The set deﬁned by Ri , txif (Ti,`k) is sorted in descending order by length. To compute the pi-blocking experienced by Ji,u for a given single resource k k request, the top y requests are removed from Ri , depleting Ri by y requests, and summed. This process is k repeated iteratively for each request of a given type made by Ji,u, terminating early if Ri becomes empty. In general, the value of y depends upon the locking protocol used and resource organization. For example, under

GPUSync, y = (ρ ·g−1) for CE lock requests when peer-to-peer migrations are used. (This is derived directly

8We follow the general approach used by Brandenburg (2011b), so we adopt his terminology and formulas for modeling the set of requests that may interfere with a request issued by a job of task Ti.

174 from the blocking complexity of peer-to-peer migrations, which we discussed at the end of Section 4.3.4.2.)

This entire process must be repeated for each resource: GPU token, EE lock, and CE lock(s).

In the case of soft real-time scheduling, tif (Ti,Tj) depends upon tardiness bounds, which in turn depend upon blocking bounds. A ﬁxed-point iterative method must then also be used ensure schedulability, outlined by the following steps:

1. Initialize tardiness bounds to zero.

2. Compute pi-blocking bounds.

3. Compute tardiness bounds.

4. Compute new pi-blocking bounds, incorporating tardiness bounds.

5. Check schedulability. Go back to step (3) if the task set is schedulable, but new pi-blocking bounds

from step (4) differ from bounds previously computed.

This method will terminate when either bounds on pi-blocking have reached a steady state or the task set is unschedulable. This highlights a signiﬁcant beneﬁt of FL scheduling over EDF scheduling: the tighter tardiness bounds offered by FL scheduling may reduce computed pi-blocking bounds.

Before proceeding, we state two important assumptions. First, we assume that the total number of

GPU-using tasks is greater than g; otherwise, GPU Allocator Heuristics H1 and H2 load-balance GPU token requests such that no two tasks share a GPU and that each task always receives the same GPU for every job—there is no blocking or migration under this scenario. Second, we assume that a GPU-using task only requests a GPU token once per job. The following analysis can be extended to handle multiple token requests per job, though it becomes cumbersome to express. We now proceed to deﬁne the above formulas. We direct the reader towards the work of Brandenburg (2011b) for a detailed explanation of each formula.

Deﬁnition 4.1. For hard real-time systems,

pi + r j tif (Ti,Tj) , , (4.11) p j

Deﬁnition 4.2. For soft real-time systems (under the “bounded tardiness” deﬁnition of soft real-time),

max(pi,ri) + r j tif (Ti,Tj) , . (4.12) p j

175 tif (Ti,Tj) gives us the number of jobs of Tj that may interfere with a job Ji,u. We use a different deﬁnition of tif (Ti,Tj) (Equation (4.12)) for soft real-time systems, since a job of Ti may be tardy—we must consider a larger window of execution in which such a job may issue resource requests. We now derive the set of requests from Tj that may interfere with requests of Ji,u for resource `k.

Deﬁnition 4.3. The set of requests of Tj that interfere with requests of a job of Ti for resource `k is given by

xif (Ti,Tj,`k) , R j,v | 1 ≤ v ≤ tif (Ti,Tj) · η j,k , (4.13)

where η j,k is the maximum number of requests for `k that a job of Tj may make.

We say that xif (Ti,Tj,`k) deﬁnes a set of generic requests because request R j,v ∈ xif (Ti,Tj,`k) does not th th denote the v request made by task Tj after the release of Tj’s ﬁrst job. Rather, R j,v denotes the v resource request in a worst-case string of consecutive requests of Tj that may interfere with request Ri of Ti. Finally, we can derive an aggregate of all interfering requests.

Deﬁnition 4.4. The set of all interfering resource requests of other jobs that may interfere with requests of a job of Ti for resource `k is given by

[ txif (Ti,`k) , xif (Ti,Tj,`k). (4.14) Tj∈T \{Ti}

We these formulas deﬁned, we can now present detailed pi-blocking analysis for engine lock and token requests.

Detailed analysis for execution engine lock requests. We can now compute a bound on worst-case pi-

EE blocking job Ji experiences when it requests an EE, bi . Let the resource `k represent a particular execution k k engine lock. Let the function top(v,Ri ) denote the v longest requests in the set of requests Ri for `k. The k set Ri is given by txif (Ti,`k), by construction. The total worst-case pi-blocking experienced by job Ji while waiting for an execution engine is bounded by

EE bi = ∑ L j, (4.15) k R j∈top((ρ−1)·ηi,k, Ri ) where ηi,k denotes the number of EE requests issued by job Ji for `k.

176 Detailed analysis for copy engine lock requests with peer-to-peer migration. In Section 4.3.4.2, we computed pi-blocking for kernel input/output DMA operations and peer-to-peer DMA operations separately,

I/O P2P denoted by the terms bi and bi , respectively. Under detailed analysis, it is easier to compute bounds on pi-blocking for these different types of DMA operations jointly. We denote pi-blocking due to all CE requests

CE by bi . We ﬁrst present detailed analysis that holds for GPUs with either one or two CEs. We then present tighter analysis for the dual-CE case.

I O P2P Let ` , ` , and ` , represent the CE(s) of the same single GPU used by job Ji to transmit kernel input to a GPU, transmit kernel output from a GPU, and migrate state to a GPU, respectively. The CE(s) represented by these resources may be one in the same. However, we generate the set of interfering requests by considering them as separate resources, each dedicated to performing a particular type of DMA (e.g.,

I input, output, or peer-to-peer migration). Let Ri denote the sorted set of requests for copying kernel input I I O data to a GPU that may interfere with a similar request of job Ji; Ri , txif (Ti,` ). Let Ri denote the sorted set of requests for copying kernel output data from a GPU that may interfere with a similar request of job

O O P2P Ji: Ri , txif (Ti,` ). Let Ri denote the sorted set of requests of peer-to-peer DMA requests that may P2P P2P P2P interfere with a similar request of job Ji: Ri , txif (Ti,` ). Ri includes interfering requests of jobs that may migrate to and from the GPU allocated to job Ji. The sorted set of all CE requests that may interfere CE CE I S O S P2P with a CE request of job Ji is denoted by Ri : Ri , Ri Ri Ri .

The total worst-case pi-blocking experienced by Ji while waiting to receive the requested CE lock when peer-to-peer migrations are used is bounded by

CE bi = ∑ L j, (4.16) I O S CE R j∈top((ρ·g−1)·(Ni +Ni +Ni ), Ri )

I O P2P where L j is equal to the length of the associated CE operation (e.g., X , X , or X ). Observe that the above analysis does not take advantage of the fact that the GPU has two CEs. That is, the above analysis holds when a GPU has only one CE. The transitive pi-blocking induced by peer-to-peer migrations makes it difﬁcult to derive tighter bounds for dual-CE GPUs. However, tighter analysis is possible.

We now describe this optimization.

P2P Transitive pi-blocking due to peer-to-peer migrations is only possible when Ri =6 /0. Recall that the CE CE computation of bi is iterative: requests from Ri are extracted in groups of ρ · g − 1 at a time.

177 CE,k th CE I,k Let Rbi denote the set of requests remaining after the k iteration of bi ’s computation. Let Rbi , CE,k T I CE,k O,k CE,k T O Rbi Ri , denoting the sorted set of remaining interfering input requests in Rbi . Let Rbi , Rbi Ri , CE,k. P2P,k CE,k T P2P denoting the sorted set of remaining interfering output requests in Rbi Let Rbi , Rbi Ri , CE,k P2P,k denoting the sorted set of remaining interfering migration requests in Rbi . If Rbi = /0 then transitive pi- blocking due to migrations is no longer possible, since no migration requests remain. From this observation,

CE we derive the following tighter analysis for dual-CE GPUs where bi is broken down into two terms,

CEtrans CEdirect bi and bi , where pi-blocking complexity is O(ρ · g) and O(ρ), respectively. Let k ∈ N1 denote the CE CE T P2P CE smallest integer such that (Ri \top((ρ · g − 1) · k , Ri )) Ri = /0. Observe that (Ri \top((ρ · g − 1) · CE CE,k k , Ri )) ≡ Rbi .

The total transitive worst-case pi-blocking experienced by Ji while waiting for a CE to copy data to or from a GPU when peer-to-peer migrations are used with dual-CE GPUs is bounded by

CEtrans bi = ∑ L j. (4.17) CE R j∈top((ρ·g−1)·k , Ri )

The total direct worst-case pi-blocking experienced by Ji while waiting for a CE to copy data to or from a GPU when peer-to-peer migrations are used with dual-CE GPUs is bounded by

CEdirect bi = ∑ L j. (4.18) I O S CE,k R j∈top((ρ−1)·max(Ni +Ni +Ni −k ,0), Rbi )

By construction, the total worst-case pi-blocking experienced by Ji while waiting for a CE to copy data to or from a GPU when peer-to-peer migrations are used with dual-CE GPUs is bounded by

CE CEtrans CEdirect bi = bi + bi . (4.19)

Blocking chains. Are tighter bounds on CE pi-blocking possible? Certainly. We call a sequence of CE requests that may block a request of job Ji a blocking chain. The number of all possible blocking chains is ﬁnite, since each token holder may issue at most one DMA operation at a time. When g and ρ are relatively small, it is feasible to enumerate each potential blocking chain. For example, Table 4.2 depicts all possible representative blocking chains for outbound, inbound, and migration CE requests when g = 2, ρ = 3, and all GPUs have two CEs. In the table, “O,” “I,” and “M” represent outbound, inbound, and migration DMA

178 Outbound Inbound Migration Chain Cost (µs) Chain Cost (µs) Chain Cost (µs) IIOOM 4967 IOOOM 5000 IIOOO 6006 IOOMM 3961 OOOMM 3994 IOOOM 5000 IOOM 3786 OOOM 3818 IOOO 4824 IIOM 3753 IOOM 3786 IIOO 4791 OOMMM 2956 OOMM 2780 OOOMM 3994 OOMM 2780 OOM 2604 OOOM 3818 IOMM 2747 IOM 2571 IOOM 3786 OOM 2604 II 2363 OOO 3643 IOM 2571 OMM 1566 IOO 3610 IIM 2538 OM 1390 IIO 3577 OO 2429 IM 1357 OOMM 2780 OMMM 1741 I 1181 OOM 2604 OMM 1566 MM 351 IOM 2571 IMM 1533 M 176 OO 2429 OM 1390 − − IO 2396 IM 1357 − − II 2363 O 1214 − − OMM 1566 MMM 527 − − OM 1390 MM 351 − − IM 1357 M 176 − − O 1214 − − − − I 1181 − − − − MM 351 − − − − M 176

Table 4.2: All possible representative blocking chains that may delay a CE request for an outbound, inbound, or migration operation for GPUs with two CEs, g = 2, and ρ = 3. Costs computed assuming worst-case conditions for a 1MB DMA operation and non-interleaved system memory.

179 operations, respectively. An outbound DMA operation is one where a CE is used to copy data away from a

GPU. Likewise, an inbound DMA operation is one where a CE is used to copy data onto a GPU. A migration

DMA operation is simultaneously an inbound and an outbound DMA operation. Each letter (“I,” “O,” or

“M”) represents a type of CE request that interferes with a CE request of job Ji. We say the chains in Table 4.2 are “representative,” since there may exist multiple DMA sequences (or “actual” blocking chains) that contain the same frequency of request types, but differ in the order these requests appear within each sequence and which particular GPUs handle each request. We derive the cost of each blocking chain (i.e., the time job Ji can be blocked by said chain) by summing the costs of the individual DMA operations within each chain.

Thus, blocking chains that have the same frequency of request types also have the same cost. We can identify such requests with a single representative blocking chain. For example, the chains “IMO” and “MOI” each contain one outbound, inbound, and migration operation, so these chains have the same cost. In Table 4.2, we take the chain “IMO” to represent all equivalent chains.

The chains in Table 4.2 are sorted in order of decreasing cost. Here, we assume that all DMA operations are 1MB in size. The costs of individual DMA operations are derived from our overhead model where we assume worst-case overheads in a loaded system with non-interleaved system memory. We also assume that all peer-to-peer migrations are “near,” as we discussed in Section 4.3.1.2, since g = 2. Table 4.2 is divided into outbound, inbound, and migration columns to denote the sets of possible blocking chains that may interfere with a CE request of job Ji for outbound, inbound, and migration DMA operations, respectively. Observe that no chain in Table 4.2 has a length of more than ﬁve operations. This is consistent with the CE blocking bounds we derived in Section 4.3.4.2 for system conﬁgurations with peer-to-peer migration, where we showed that each CE lock request is blocked by at most ρ · g − 1 CE requests: 2 · 3 − 1 = 5. Although some representative chains appear in all three columns (e.g., “IOOM”), others do not. For example, the chain “OO” does not appear in the inbound column. This is because it is impossible for an inbound CE lock request to be blocked exclusively by outbound CE lock requests, since GPUSync directs inbound and outbound DMA operations to different CEs when a GPU has two CEs. A blocking chain where an inbound CE lock request is blocked by an outbound CE lock request must include a migration CE lock request in order to link the otherwise independent CEs.

Under GPUSync, peer-to-peer migrations are pulled from one GPU to another. That is, the job that issues a migration CE request always holds a token for the destination GPU of the peer-to-peer DMA operation.

180 A peer-to-peer migration always uses the inbound CE of the requesting job’s assigned GPU. It is for this reason that the set of blocking chains for inbound requests is a proper subset of the set of blocking chains for migration requests. However, we cannot treat migration requests as inbound requests. This is because a migration request may also contend with requests for the outbound CE of the source GPU of the peer-to-peer

DMA operation (even if the request experiences no contention for the inbound CE of the destination GPU).

The pull-based nature of peer-to-peer migrations also affects the blocking chains associated with outbound

CE requests. There may be at most ρ simultaneous inbound migration requests for GPUa’s inbound CE, since only jobs assigned a token for GPUa ever access GPUa’s inbound CE. However, there may be as many as ρ · (g − 1) simultaneous outbound migration requests for GPUa’s outbound CE. It is for this reason that the inbound and outbound columns of Table 4.2 differ.

Computing all possible blocking chains. In support of the schedulability experiments we present in Sec- tion 4.3.6, we computed tables of all possible blocking chains for system configuration defined by the unique combinations of system configuration parameters g ∈ {2,4} and ρ ∈ {1,2,3}, for systems with dual-CE GPUs.9 We use a brute-force algorithm to perform an exhaustive search of all possible blocking chains for outbound, inbound, and migration requests.

To compute the blocking chains for a given CE request of job Ji on GPUa, we consider the situation where all token holders may have incomplete CE requests that were issued prior to that of Ji’s. Job Jj, assigned to GPUb, may have issued an outbound, inbound, or migration request from any of the other g − 1

GPUs. Job Jj may also have not issued a request, which we represent with a place-holder “null” request.

Thus, Jj may perform one of (1 + 1 + (g − 1) + 1) = g + 2 possible operations. Since there are ρ · g − 1 token holders, excluding job Ji in the cluster that includes GPUa, there may be as many as (ρ · g − 1)(g + 2) different sets of incomplete requests issued before job Ji’s request. The order in which these requests are issued may affect the blocking experienced by job Ji. There are (ρ · g − 1)! different ways in which we may order the requests (including null requests) in each of these sets. This results in (ρ ·g−1)(g+2)((ρ ·g−1)!) scenarios in which these requests issued by the token holders other than job Ji may precede the request of job

Ji.

9We do not compute tables for conﬁgurations where g ∈ {1,8}. Peer-to-peer migrations are not used when g = 1 (i.e., a partitioned GPU conﬁguration). On our evaluation platform, peer-to-peer migrations are not possible when g = 8, since peer-to-peer DMA operations cannot cross NUMA boundaries.

181 Request R1 R2 R3 R4 R5 Request Type I M O O N GPU(s) Handling Request GPU0 GPU0 ← GPU1 GPU1 GPU1 − GPU Assigned to Requesting Task GPU0 GPU0 GPU1 GPU1 GPU1

Table 4.3: A possible arrangement of copy engine requests where g = 2 and ρ = 3.

With a procedure we discuss shortly, we evaluate each of these scenarios to construct a blocking chain for each type of request that may be issued by job Ji, and from it, we construct a representative blocking chain. We insert this representative blocking chain into a hash table if it is not already stored therein. This algorithm leaves much to be desired. It inspects (ρ · g − 1)(g + 2)((ρ · g − 1)!) scenarios. The procedure we describe below takes O(ρ · g) to evaluate each scenario. As a result, our brute-force algorithm has a runtime complexity of O((ρ2 · g3)((ρ · g − 1)!)). Still, we ﬁnd that this algorithm takes no more than several hours for small values of g and ρ, where g ≤ 3 and ρ ≤ 3.10,11 This is acceptable, since the results can be stored ofﬂine and reused in schedulability experiments.

We now describe the process we use to compute a blocking chain for each request scenario. Each scenario is represented by a string of requests, where each type of request is denoted by a symbol (i.e., O, I, M, or

N (for null requests)). Each request is paired with a GPU identiﬁer, indicating which GPU is to handle the request. We interpret the order in which these requests are arranged, left to right, as a temporal ordering—the most recent request is at the head of the string, and the least recent request at the tail. The row labeled

“Request Type” in Table 4.3 depicts such a string for a platform where g = 2 and ρ = 3: {I, M, O, O, N}. Table 4.3 also includes information about each request, such as the GPU handling each request and the

GPU to which the requesting task is assigned. Recall that migration requests are actually two copy engine requests that are issued simultaneously and atomically through a DGL request (please refer to Sections 2.1.6.2 and 3.3.3 for details). This allows us to treat these dual-requests as a single request that spans two GPUs.

Request R2 in Table 4.3 is such request. We use an arrow to describe the migration between GPUs. For example, request R2 is for a migration from GPU1 to GPU0, which is denoted by “GPU0 ← GPU1” in the row labeled “GPU(s) Handling Request” in Table 4.3.

10When g = 4 and ρ = 3, we must make optimizations that allow us to consolidate computations for scenarios with common requests in order to complete within a reasonable time frame. 11The code for the implementation of our algorithm is freely available at www.github.com/GElliott.

182 Every prefix-substring of the string for a given scenario may be a chain of requests that blocks a request issued by job Ji. Whether a chain of requests actually blocks a request of job Ji partly depends upon the request type issued by Ji and to which GPU(s) the request is directed. Without loss of generality, we assume every request issued by Ji, denoted by R0, is for an inbound or outbound copy engine of GPU0. In the case of migrations, we assume job Ji migrates away from GPU1 to GPU0. This is a safe generalization because our ultimate goal is to find all representative blocking chains for a particular platform configuration. Since GPUs are homogeneous under GPUSync, the set of actual blocking chains that may block a request of a given type for GPU0 is homomorphic to the set of actual blocking chains that block a request of the same type for GPU1. Representative blocking chains only describe the frequency of each request type in a chain; information on the order in which requests are issued, as well as to which GPU each request is issued, is stripped away when a representative chain is constructed from an actual chain.

We use the recursive procedure COMPUTEBLOCKINGCHAIN, depicted in Figure 4.12(a), to compute a blocking chain for a given scenario and initial request type. The procedure inspects the string of requests, denoted by the function parameter requestString, and returns the maximal preﬁx-substring, in terms of string length, that blocks the request on the top of the request stack, denoted by the function parameter stack. We initialize the parameter stack with the request of job Ji prior to the ﬁrst call to COMPUTEBLOCKINGCHAIN, as seen in insets (b), (c), and (d) of Figure 4.12, for inbound, outbound, and migration requests, respectively.

We now describe COMPUTEBLOCKINGCHAIN in more detail. If requestString is an empty string, then no request in stack can be blocked. In this case (when line 2 evaluates to false), COMPUTEBLOCKINGCHAIN returns a empty blocking chain at line 18. Otherwise, the procedure pops a request from stack (line 3) and stores it in the variable request. The parameter stack is guaranteed to contain at least one request at line 3, since a request is pushed onto the stack immediately prior to a call to COMPUTEBLOCKINGCHAIN (e.g., line 6), or stack is known to contain at least one request prior to calling COMPUTEBLOCKINGCHAIN (e.g., line 12). After extracting a request from stack, the procedure obtains a reference to the ﬁrst element in requestString (line 4); this request is stored in the variable next. We use the subroutine BLOCKEDBY to test whether request is blocked by next (line 5). BLOCKEDBY returns true if requests request and next contend for the same CEs; it returns false, otherwise. If next blocks request, then we push next onto stack (line 6), and we recursively call COMPUTEBLOCKINGCHAIN (line 11) to process the requests that follow next in requestString (the substring of requestString that makes up these requests is denoted by requestString[1:]).

At line 11, the procedure joins two lists of requests to create the returned blocking chain. The ﬁrst list is

183 1: procedure COMPUTEBLOCKINGCHAIN(stack,requestString) 2: if requestString =6 /0 then 3: request ← POP(stack) . Pop request from stack. 4: next ← requestString[0] . Process request at head of requestString. 5: if BLOCKEDBY(request,next) then . True if next and request contend for same CE locks. 6: PUSH(stack,next) . Prepare to compute chain of what may block next. 7: if ISMIGRATIONREQUEST(next) then 8: PUSH(stack,next) . Push next twice, since migrations are two requests. 9: end if 10: . Recurse on next and remainder of requestString. 11: return {next} + COMPUTEBLOCKINGCHAIN(stack,requestString[1 :]) 12: else if stack not empty then 13: . Migration request at top of stack may be blocked by next; recurse. 14: return COMPUTEBLOCKINGCHAIN(stack,requestString) 15: end if 16: end if 17: . Return nothing if (a) no more requests or (b) next does not block request and stack is empty. 18: return {} 19: end procedure (a) Recursive procedure for computing a blocking chain for a given scenario.

1: procedure COMPUTEINBOUNDBLOCKINGCHAIN(scenario) 2: stack ← { CREATEINBOUNDREQUEST(GPU0) } . Initialize a stack with request on top. 3: return COMPUTEBLOCKINGCHAIN(stack,scenario) 4: end procedure (b) Computes the blocking chain for an inbound request.

1: procedure COMPUTEOUTBOUNDBLOCKINGCHAIN(scenario) 2: stack ← { CREATEOUTBOUNDREQUEST(GPU0) } . Initialize a stack with request on top. 3: return COMPUTEBLOCKINGCHAIN(stack,scenario) 4: end procedure (c) Computes the blocking chain for an outbound request.

1: procedure COMPUTEMIGRATIONBLOCKINGCHAIN(scenario) 2: request ← CREATEMIGRATIONREQUEST(GPU0, GPU1) . Migration from GPU1 to GPU0. 3: . Initialize stack with request twice, since migrations are two requests. 4: stack ← {request,request} 5: return COMPUTEBLOCKINGCHAIN(stack,scenario) 6: end procedure (d) Computes the blocking chain for a migration request.

Figure 4.12: Procedures for computing blocking chains for a given request scenario.

184 composed of a single element, constructed from next. The second list is whatever is returned by the recursive call to COMPUTEBLOCKINGCHAIN. The remaining lines of code in COMPUTEBLOCKINGCHAIN handle a special case introduced by migration requests, which we discuss next.

Consider the following hypothetical situation. Suppose we removed lines 7, 8, and 12 through 14 from COMPUTEBLOCKINGCHAIN, and the procedure were called with stack = {(M,GPU0 → GPU1)} and requestString = {(I,GPU0),(O,GPU1)}. The broken procedure would correctly ﬁnd that the request

(I,GPU0) blocks (M,GPU0 → GPU1), since both requests contend for the inbound CE of GPU0. In the subsequent recursive call made from line 11, stack = {(I,GPU0)} and requestString = {(O,GPU1)}. As result, our broken procedure would determine that request (I,GPU0) is not blocked by (O,GPU0) (which is true), and the call would return the empty chain on line 18. Our broken version of COMPUTEBLOCK-

INGCHAIN would return the chain {(I,GPU0)} for request (M,GPU0 → GPU1). This is incorrect. The request (M,GPU0 → GPU1) may be blocked by both requests (I,GPU0) and (O,GPU1), since the migration request requires the inbound CE of GPU0 and the outbound CE of GPU1 in order to proceed. Recall from

Section 2.1.6.2 that only one job of the two jobs associated with requests (I,GPU0) and (O,GPU1) may inherit the priority of the job that issued request (M,GPU0 → GPU1) at a time. Even though the DMA operations of requests (I,GPU0) and (O,GPU1) potentially execute in parallel, in the worst-case, requests

(I,GPU0) and (O,GPU1) complete serially. Hence, request (M,GPU0 → GPU1) may be blocked by both requests (I,GPU0) and (O,GPU1), one after the other. In other words, request (M,GPU0 → GPU1) may be blocked by two parallel blocking chains, one starting from GPU0 and the other starting from GPU1, that are encountered serially.

To correctly handle this case, COMPUTEBLOCKINGCHAIN checks whether a migration request is blocked on both its source and destination GPUs. This is accomplished by pushing a migration request onto stack a second time (e.g., at line 8 in Figure 4.12(a), as well as on line 4 in Figure 4.12(d)). This realizes the following behavior: if COMPUTEBLOCKINGCHAIN reaches the potential end to a blocking chain, the procedure “rolls back” to a prior encountered migration request that may still be blocked by requests in requestStirng, and attempts to construct a parallel blocking chain. This parallel chain is naturally appended to the already constructed partial chain. This roll-back-and-continue behavior is implemented through lines 12 through 14. At line 12, the procedure knows that request is not blocked by next. If stack is not empty, then the procedure checks whether next blocks the newly exposed request on the top of stack—this request must

185 be a migration request, because all non-migration requests, which may only appear in stack once, are popped from stack immediately after they are pushed onto stack (if requests in requestString remain to be evaluated).

Let us reconsider the prior hypothetical situation under the correct version of COMPUTEBLOCK-

INGCHAIN. COMPUTEBLOCKINGCHAIN is called with stack = {(M,GPU0 → GPU1),(M,GPU0 → GPU1)} and requestString = {(I,GPU0),(O,GPU1)}. The ﬁrst instance of the request (M,GPU0 → GPU1) in stack is popped on line 3, and the procedure ﬁnds that request (I,GPU0) indeed blocks the migration request. We recurse on line 11. On this call, stack = {(I,GPU0),(M,GPU0 → GPU1)} and requestString = {(O,GPU1)}.

Request (I,GPU0) is popped from stack. At line 5, the condition evaluates to false, since requests (I,GPU0) and (O,GPU1) do not contend for the same CEs. Consequently, the procedure executes line 14, since stack is not empty (it still holds request (M,GPU0 → GPU1)). The procedure recurses once again. Now, stack =

{(M,GPU0 → GPU1)} and requestString = {(O,GPU1)}. At line 5, the procedure determines that request

(O,GPU1) blocks (M,GPU0 → GPU1), since these requests contend for the same outbound CE of GPU1.

As the recursive calls to COMPUTEBLOCKINGCHAIN unwind, the blocking chain {(I,GPU0),(O,GPU1)} is constructed for initial request (M,GPU0 → GPU1).

The runtime complexity of an initial call to COMPUTEBLOCKINGCHAIN (e.g., in insets (b), (c), and (d) of Figure 4.12) is a function of the request issued by job Ji and the length of the input request string. Each scenario we evaluate contains ρ ·g−1 requests, so each corresponding input request string is ρ ·g−1 requests in length. Every request processed by COMPUTEBLOCKINGCHAIN, including the request issued by job Ji, appears on stack at most twice (e.g., when it is a migration request). Therefore, the stack may contain up to

2 · ρ · g requests. COMPUTEBLOCKINGCHAIN processes one request from stack each time it is called. Thus, we may upper-bound the number of calls to COMPUTEBLOCKINGCHAIN by 2 · ρ · g. Since the procedure contains no loops, and all comparison, stack, and list concatenation operations are O(1) in complexity, the runtime complexity of COMPUTEBLOCKINGCHAIN is O(ρ · g).

The insightful reader may have noticed that COMPUTEBLOCKINGCHAIN does not always ﬁnd the longest blocking chain for a given scenario. Let us return to our prior scenario, with the addition of a null request in requestString. Suppose COMPUTEBLOCKINGCHAIN is called with stack = {(M,GPU0 →

GPU1),(M,GPU0 → GPU1)} and requestString = {(I,GPU0),(N,−),(O,GPU1)}. The null request in requestString causes the procedure to terminate “early” and return the blocking chain {(I,GPU0)} for the migration request. Since a null request can never interfere with another request, one may suppose we should eliminate all null requests from requestString prior to processing. If this preprocessing step were taken, then COM-

186 PUTEBLOCKINGCHAIN would return the chain {(I,GPU0),(O,GPU1)} for the initial request (M,GPU0 →

GPU1). However, recall that our brute-force approach for ﬁnding all possible blocking chains evaluates all permutations of request types and request orders. Although COMPUTEBLOCKINGCHAIN returns the chain {(I,GPU0)} for request (M,GPU0 → GPU1) with the request string {(I,GPU0),(N,−),(O,GPU1)}, at some point, we also evaluate the request string {(I,GPU0),(O,GPU1),(N,−)} for the same migration request. With the null request at the end of the request string, COMPUTEBLOCKINGCHAIN returns the chain {(I,GPU0),(O,GPU1)}. This brute-force approach also handles permuted request strings such as

{(I,GPU0),(O,GPU1),(I,GPU0),(O,GPU1)} and {(I,GPU0),(I,GPU0),(O,GPU1),(O,GPU1)}. For a migration request (M,GPU0 → GPU1),COMPUTEBLOCKINGCHAIN returns the blocking chains {(I,GPU0),

(O,GPU1)} and {(I,GPU0),(I,GPU0),(O,GPU1),(O,GPU1)} for the respective request permutations.

Blocking analysis with blocking chains. We now discuss how we incorporate blocking chains, which have been computed ofﬂine, into schedulability tests. Every CE request issued by job Ji may be blocked by any

CE one of the possible blocking chains, until all requests R j ∈ Ri have been counted. We compute a tighter CE bound on bi by ﬁnding the maximal sum of all possible chains that may block job Ji across all CE requests I O S CE issued by Ji. We can compute this sum using an ILP, given Ni , Ni , Ni , Ri , a table of all representative blocking chains, and assumed overhead costs. Of course, this is an undesirable solution since solving an

ILP is NP-hard in the strong sense. Can it be avoided? Does a greedy polynomial-time algorithm exist? In general, the answer is negative. Consider the following case. Suppose job Ji makes two inbound requests

I S to copy data to a GPU, so Ni = 2. Assume that job Ji has no state to migrate (Ni = 0), but another job Jj S CE does (Nj =6 0). Further suppose Ri is made up of six outbound requests and two migration requests. That CE CE,0 is, Ri = Rbi = {O, O, O, O, O, O, M, M}. Finally, suppose we consider a simple platform where g = 2 and ρ = 3. Table 4.2 depicts all representative blocking chains for such a platform with associated costs.

Which two representative blocking chains, one for each request issued by job Ji, can we construct from the

CE requests in Ri that maximizes the total blocking cost for job Ji? Under a greedy approach, we select the most costly chain that can be constructed from the available

CE,k requests in Rbi for each of the k requests issued by job Ji. We continue to select the most costly chains until a chain has been selected for each of the k requests, or until the pool of interfering requests has been

CE,k exhausted (i.e., Rbi = /0). Returning to our example, we examine the “inbound” column of Table 4.2 for a list of possible chains to select for job Ji’s two inbound requests. We ﬁrst select the chain “OOOMM,” since

187 Parameter Description O Ri set of outbound requests that may interfere with a request of job Ji I Ri set of inbound requests that may interfere with a request of job Ji P2P Ri set of migration requests that may interfere with a request of job Ji C out set of all representative outbound blocking chains for a given GPUSync configuration C in set of all representative inbound blocking chains for a given GPUSync configuration C mig set of all representative migration blocking chains for a given GPUSync configuration out th out Xj cost of the j chain in C in th in Xk cost of the k chain in C mig th mig X` cost of the ` chain in C out th out O j number of outbound requests in the j chain in C in th out O j number of inbound requests in the j chain in C mig th out O j number of migration requests in the j chain in C out th in Ik number of outbound requests in the k chain in C in th in Ik number of inbound requests in the k chain in C mig th in Ik number of migration requests in the k chain in C out th mig M` number of outbound requests in the ` chain in C in th mig M` number of inbound requests in the ` chain in C mig th mig M` number of migration requests in the ` chain in C O Ni number of outbound requests issued by job Ji I Nk number of inbound requests issued by job Ji S Ni number of migration requests issued by job Ji out th out c j number of instances of the j chain in C that blocks job Ji in th in ck number of instances of the k chain in C that blocks job Ji mig th mig c` number of instances of the ` chain in C that blocks job Ji Table 4.4: Summary ILP parameters.

CE,0 CE,1 it is the greatest-cost chain that we can construct from the requests in Rbi . Rbi = {O, O, O}. No chain CE,1 listed in the inbound column of Table 4.2 can be constructed from the requests in Rbi for job Ji’s second request. The total cost blocking cost for job Ji is 3,994µs (the cost of the chain “OOOMM”). What if we had made a non-greedy choice for Ji’s ﬁrst request? Under a non-greedy approach, suppose we select “OOOM” CE,1 for the ﬁrst chain instead of “OOOMM.” Now Rbi = {O, O, O, M}. We may select the chain “OOOM” once again for the second chain. The total cost of these chains is 3,818µs + 3,818µs = 7,636µs—greater than the greedy approach’s “bound.” This example demonstrates that the greedy approach fails to provide valid

CE,k upper-bounds on blocking costs. The fault in the approach is that it may prematurely exhausted Rbi of migration requests. This prevents the construction of subsequent chains that require them. We now describe

gpu the ILP, solved once for each Ti ∈ T , that we use to obtain correct bounds.

We begin by defining the variables of our ILP, which are summarized in Table 4.4. Let C out denote the set of all representative outbound blocking chains for a given GPUSync configuration defined by ρ and g. For example, C out is the set of outbound blocking chains in Table 4.2 when ρ = 3 and g = 2. Similarly, let C in

188 and C mig denote the set of all representative inbound and migration blocking chains for the same GPUSync conﬁguration, respectively. We count the number of instances of the jth outbound blocking chain in C out that

out interfere with inbound requests of job Ji with the integer variable c j . We count the number of instances of th in the k inbound blocking chain in C that interfere with inbound requests of job Ji with the integer variable

in th mig ck . Similarly, we count the number of instances of the ` migration blocking chain in C that interfere with mig migration requests of job Ji with the integer variable c` . We now deﬁne the constants that appear in our ILP. We count the number of outbound, inbound, and migration requests that appear in the jth outbound blocking

out out out out chain with the constants O j , Ij , and Mj , respectively. For example, for the chain “IIOOM,” I0 = 2, out out O0 = 2, and M0 = 1, where each value corresponds to the number of inbound, outbound, and migration in in in th requests that appear in the chain. We deﬁne similar constants, Ok , Ik , and Mk for the k outbound blocking mig mig mig th chain, as well as O` , I` , and M` for the ` migration blocking chain. Several additional constants are derived from the parameters of job Ji and the set of CE requests that may interfere with a CE request of Ji.

O I S The number of outbound, inbound, and migration requests issued by job Ji is bounded by Ni , Ni , and Ni , respectively. We denote the total number of inbound, outbound, and migration requests that may interfere

I O P2P with any CE request of job Ji by | Ri |, | Ri |, and | Ri |, respectively. We now derive the coefﬁcients that we use in the objective function of our ILP. Each coefﬁcient denotes the cost of a single instance of a

out th blocking chain. The coefﬁcient Xj represents the cost of the j outbound blocking chain. We compute out the value of Xj from our empirical measurements and types of requests that make up each blocking chain. More precisely,

out O out I out P2P out Xj , X · O j + X · Ij + X · Mj . (4.20)

in th in Similarly, Xk represents the cost of the k inbound blocking chain. The value of Xk is computed by the equation

in O in I in P2P in Xk , X · Ok + X · Ik + X · Mk . (4.21)

mig th mig Finally, X` represents the cost of the ` migration blocking chain. The value of X` is computed by the equation

mig O mig I mig P2P mig X` , X · O` + X · I` + X · M` . (4.22)

189 With the above variables, constants, and coefﬁcients deﬁned, we now present the ILP we use to bound the pi-blocking any job of task Ti may experience when issuing copy engine requests. We take the solution of

CE this ILP as the value of bi .

|C out|−1 |C in|−1 |C mig|−1 out out in in mig mig Maximize: ∑ Xj · c j + ∑ Xk · ck + ∑ X` · c` (4.23) j=0 k=0 `=0 |C out|−1 out O Subject to: ∑ c j ≤ Ni , (4.24) j=0 |C in|−1 in I ∑ ck ≤ Ni , (4.25) k=0 |C mig|−1 mig S ∑ c` ≤ Ni , (4.26) `=0 |C out|−1 |C in|−1 |C mig|−1 out out in in mig mig O ∑ Ij · c j + ∑ Ik · ck + ∑ I` · c` ≤| Ri |, (4.27) j=0 k=0 `=0 |C out|−1 |C in|−1 |C mig|−1 out out in in mig mig I ∑ O j · c j + ∑ Ok · ck + ∑ O` · c` ≤| Ri |, (4.28) j=0 k=0 `=0 |C out|−1 |C in|−1 |C mig|−1 out out in in mig mig P2P ∑ Mj · c j + ∑ Mk · ck + ∑ M` · c` ≤| Ri | . (4.29) j=0 k=0 `=0

Inequality (4.24) constrains the total number of outbound chains that may interfere with job Ji to the number of outbound CE requests issued by Ji. Similarly, Inequalities (4.25) and (4.26) constrain the total number of inbound and migration chains that may interfere with job Ji to the number of inbound and migration

CE requests issued by Ji, respectively. Inequalities (4.27), (4.28), and (4.29) constrain the number CE requests that are used to compose blocking chains, by type, to the maximum number of such requests that may interfere

P2P with job Ji. For example, suppose no task in T maintains state on a GPU. In this case, | Ri |= 0, since no job ever issues a migration request. Under these conditions, the constraint represented by Inequality (4.29)

CE ensures that no chain that contains an “M” is ever included in the calculation of bi . We make a noteworthy compromise in the above ILP. Observe that the cost of each blocking chain is computed in terms integral numbers of XI, XO, XP2P. These values represent the cost of transmitting a chunk of data. For example, the costs reﬂected in Table 4.2 are derived from the costs for 1MB chunks.

What happens if a job transmits only a fractional chunk of data? The ILP will compute a pessimistic bound

CE for bi . Although it may be possible to extend the above ILP to account for fractional chunks, such a

190 program is likely to become exceedingly complicated. Alternatively, one may assume smaller chunk sizes to reduce pessimism. This approach increases the number of CE requests issued by each task, increasing the computational complexity of the ILP. Moreover, smaller chunk sizes incur greater overheads due to

DMA setup costs (recall Observation 5 from Section 4.3.1.2). In large scale schedulability experiments, such as the ones we present later in Section 4.3.6, we ﬁnd that our ILP already boarders on the edge of intractability within the bounds of the resources available to us on our university compute cluster, so we accept the limitations of the ILP. For smaller scale experiments, or for a system designer attempting to provision or validate a handful of task sets, more complex accounting methods may be considered feasible.

In contrast to the ILP, we can trivially support fractional chunk sizes in the request interference function,

CE xif (Ti,Tj,`k) (Equation (4.13)). As such, the constructed set Ri may incorporate fractional costs. For task sets that transmit little data, bounds computed by Equation (4.19) may actually produce less pessimistic bounds than our ILP.

Detailed analysis for copy engine lock requests with system memory migration. Detailed analysis of pi-blocking for CEs under system memory migration is easier to conceptualize and compute since there can be no transitive pi-blocking. We present detailed bounds for GPUs with two CEs ﬁrst. As before, we denote

CE I O pi-blocking job Ji experiences due to its CE requests with the term bi . We redeﬁne the terms Ri , Ri , and CE Ri as needed.

CE CEI CEO I We compute bi in two parts: bi and bi . Let Ri denote the sorted set of requests for copying kernel I I input data and task state to a GPU that may interfere with a similar request of job Ji: Ri , txif (Ti,` ). The total worst-case pi-blocking experienced by Ji while waiting to receive the requested CE lock when system memory migrations are used with dual-CE GPUs is bounded by

CEI bi = ∑ L j, (4.30) I S I R j∈top((ρ−1)·(Ni +Ni ), Ri )

I where L j is equal to the length of the associated CE operation (e.g., X ).

O Let Ri denote the sorted set of requests for copying kernel output data and task state from a GPU that O O may interfere with a similar request of job Ji: Ri , txif (Ti,` ). The total worst-case pi-blocking experienced by Ji while waiting to receive the requested CE lock when system memory migrations are used with dual-CE

191 GPUs is bounded by

CEO bi = ∑ L j, (4.31) O S O R j∈top((ρ−1)·(Ni +Ni ), Ri )

O where L j is equal to the length of the associated CE operation (e.g., X ).

By construction, the total worst-case pi-blocking experienced by Ji while waiting for a CE lock when system memory migrations are used with dual-CE GPUs is bounded by

CE CEI CEO bi = bi + bi . (4.32)

Unlike the more complicated analysis for bounds when peer-to-peer migrations are used, bounds for inbound and outbound CEs are completely isolated from one another.

CE For GPUs with a single CE, we combine input, output, and state operations to compute bi jointly. In CE I S O this case, let Ri , Ri Ri , denoting the sorted set of interfering requests for the CE of any single GPU.

The total worst-case pi-blocking experienced by Ji while waiting to receive the requested CE lock when system memory migrations are used with GPUs with a single CE is bounded by

CE bi = ∑ L j, (4.33) I O S CE R j∈top((ρ−1)·(Ni +Ni +2·Ni ), Ri )

I O where L j is equal to the length of the associated CE operation (e.g., X or X ). Migration operations are counted twice, since state data is copied twice: once to system memory and once to GPU memory.

Obtaining tighter bounds in special cases. Before concluding this section, we discuss special cases where we can improve our engine lock blocking analysis. The analysis presented thus far has assumed a worst-case scenario where all of the ρ · g tokens are held at once, since this results in maximum engine lock contention. However, there are two conditions where this degree of contention is impossible. The first condition is gpu gpu gpu straightforward: if the number of tasks in Ta is less than ρ · g, i.e., | Ta |< ρ · g, then at most | Ta | tokens may be held at a given time instant. The second condition is more nuanced. If the GPU Allocator is configured like the CK-OMLP, then Rule C1 (i.e., the donation-at-job-release rule) may limit the number simultaneously held tokens. Specifically, Rule C1 ensures that at most cba jobs may hold tokens at a time, gpu where cba denotes the number of CPUs where the tasks in Ta may execute.

192 We integrate these special cases into our engine lock blocking analysis. Let the variable Z take a boolean value indicating whether the GPU Allocator is conﬁgured as the CK-OMLP (Z = 1) or not (Z = 0). We make the following substitutions our blocking calculations. First, we substitute all instances of “(ρ − 1)” with gpu “(min(ρ, | Ta |, Z ·cba +¬Z ·ρ)−1)” in Equations (4.10), (4.18), (4.30), (4.31), and (4.33). Also, we substi- gpu tute all instances of “(ρ ·g−1)” with “(min(ρ, | Ta |, Z ·cba +¬Z ·ρ)·g−1)” in Equations (4.8), (4.9), (4.16), and (4.17). Finally, for detailed blocking analysis of copy engine requests when peer-to-peer migrations are used, we further constrain the objective function expressed by Equation (4.23) to prevent the inclusion of gpu representative blocking chains greater than (min(ρ, | Ta |, Z · cba + ¬Z · ρ) · g − 1) requests in length.

This concludes our detailed blocking analysis for engine locks.

4.3.4.4 Detailed Blocking Analysis for the GPU Allocator

In this section, we discuss detailed pi-blocking analysis for the GPU Allocator. After we have performed

K engine lock pi-blocking analysis, we use Equation (4.5) to bound the token critical section length, Li , gpu of each Ti ∈ T . We may use these bounds in the coarse-grain blocking analysis that we discussed in Section 2.1.7.2 or 2.1.7.3, provided that the GPU Allocator is conﬁgured as the R2DGLP or CK-OMLP, respectively. However, detailed blocking analysis for the GPU Allocator will yield better schedulability results. We begin by presenting detailed analysis for the GPU Allocator when it is conﬁgured as the R2DGLP.

We then examine the case where the GPU Allocator is conﬁgured as the CK-OMLP.

2 gpu Detailed analysis for token request pi-blocking under the R DGLP. Let | Ta | denote the number of

th 2 GPU-using tasks in the a GPU cluster, which is managed by a single instance of the R DGLP. Let νi denote gpu the maximum number of token critical sections that may block a token request of task Ti ∈ Ta . As described by Ward et al. (2012), νi depends upon the number of tokens, task that request tokens, and CPU and GPU cluster sizes. Given these parameters, νi may be bound using one of the equations for each of the three following cases:

gpu 1. If | Ta |≤ ρ · g, then token requests are trivially satisﬁed, since there is always an available token for any requesting task:

νi = 0. (4.34)

193 gpu 2 2. If ρ · g <| Ta |≤ c, then no request ever overﬂows into the R DGLP’s PQ or DQ (every request enters an FQ immediately), so gpu | Ta | −1 ν = . (4.35) i ρ · g

Equation (4.35) assumes that the GPU Allocator simply load-balances the number of pending requests

among FIFO queues {FQ0,··· ,FQρ·g−1}. The GPU Allocator behaves in this manner if heuristics are disabled. However, if the GPU Allocator employs heuristics to reduce migration overheads, then the

number of requests in each FQ may become unbalanced—some queues are full while others are empty.

We may account for this by pessimistically assuming a bound on νi that is based upon maximum FQ length: c gpu νi = min , | T | − 1. (4.36) ρ · g a

If a task set is schedulable when Equation (4.35) is used in analysis, but not schedulable when

Equation (4.36) is used, the system designer may reduce maximum FQ length from dc/(ρ · g)e to

gpu 12 as little as d| Ta | /(ρ · g)e. We may make this alteration to the GPU Allocator without affecting gpu gpu optimality, since | Ta |≤ c. When the maximum FQ length is constrained to d| Ta | /(ρ · g)e, Equation (4.35) holds, even if the GPU Allocator employs heuristics. The choice to reduce maximum

FQ length reﬂects a tradeoff between analytical schedulability and potential beneﬁts in average-case

runtime performance.

gpu 3. If | Ta |> c, then c ν = 2 · − 1. (4.37) i ρ · g

Equation (4.37) reﬂects the bound given by Equation (2.34) of Section 2.1.7.2.

K K Once νi has been determined for each task, we may compute Li . Let Ri denote the sorted set of token gpu requests that may interfere with a token request of a job of Ti. When every task Ti ∈ T requests a token

K K K at most once per job, then Ri , txif (Ti,` ), where ` denotes the requested token. The total worst-case pi-blocking experienced by job Ji while waiting for a token is given by

K bi = ∑ L j, (4.38) K R j∈top(νi, Ri )

12Our implementation of GPUSync supports this degree of conﬁgurability.

194 where L j denotes a bound on the token critical section length of task Tj.

Detailed analysis for token request pi-blocking under the CK-OMLP. We use the detailed analysis provided by Brandenburg (2011b) for the CK-OMLP in our analysis of the GPU Allocator when it is conﬁgured as the CK-OMLP. Recall from Section 2.1.7.3 that jobs experience both direct and indirect

gpu pi-blocking under the CK-OMLP. Only jobs of tasks Ti ∈ T may experience direct pi-blocking. All jobs may experience indirect pi-blocking.

We bound direct pi-blocking under the CK-OMLP with the following equations:

 [ K top(c, xif (Ti,Tj,` )) if Ti ∈/ Ta  K  Tj∈Ta Ri,a , (4.39)  [ K top(c − 1, xif (Ti,Tj,` )) if Ti ∈ Ta ,  Tj∈Ta\{Ti} m/c−1 K [ K Ri , Ri,a , (4.40) a=0

Kdirect bi = ∑ L j . (4.41) l c m K R j∈top( ρ·g −1,Ri )

Equation (4.39) computes the set of token requests, in the worst-case, of each CPU cluster that may directly

K interfere with a request of Ti. This set of per-cluster requests is denoted by Ri,a. Under Rule C1 (i.e., the donation-at-job-release rule) of the CK-OMLP, the number of such requests is bound by c for remote clusters

(Ti ∈/ Ta) and (c − 1) for the local cluster (Ti ∈ Ta). This is represented by the two cases in Equation (4.39).

K Ri,a = /0 if the tasks Tj ∈ Ta do not share a GPU cluster with tasks Ti ∈ Tb. Equation (4.40) combines the K per-cluster sets of interfering requests into one, Ri . Finally, Equation (4.41) computes a bound on direct pi-blocking experienced by task Ti, by summing the (dc/(ρ · g)e − 1)-longest token critical sections that may directly interfere with a token request of task Ti. We now bound indirect pi-blocking due to token requests. A job is pi-blocked indirectly while it donates its priority to another task. The time a job acts as a priority donor is bounded by the time its donee is blocked, plus the critical section length of the donee itself. In the worst-case, this is bounded by

Kindirect n Kdirect o bi = max b j + L j | Tj ∈ Ta\{Ti} . (4.42)

195 With bounds on both direct and indirect pi-blocking computed, we may bound total pi-blocking experienced by any task due to token requests with the following equation:

K Kdirect Kindirect bi = bi + bi . (4.43)

This concludes our discussion of detailed blocking analysis for GPUSync.

4.3.5 Overhead Accounting

We now discuss the methods we use to integrate GPU-related overheads into overhead-aware schedulability analysis. We follow the preemption-centric interrupt accounting method that we discussed in Section 2.1.8.

We use all of the formulations described therein to account for non-GPU-related overheads. In this section, we describe the enhancements we make to this analysis to account for GPU-related overheads. Specifically, those relating to GPU interrupt processing. (We account for DMA-related overheads by using the data we gathered in Section 4.3.1.2 to determine values for ∆cpd, XI, XO, and XP2P.) We first discuss accounting for top-half interrupt processing. We then discuss the inflation of critical sections under GPUSync to account for locking-protocol-related self-suspensions and bottom-half interrupt processing.

Before we begin with our overhead analysis, let us deﬁne several terms and equations. We begin by rewriting the preemption-centric equation for job execution time inﬂation, (i.e., Equation (2.42)), in the following manner:

sch cxs cpd eˆi , ei + 2 · (∆ + ∆ ) + ∆ (4.44) eˆi e0 = + 2cpre + ∆ipi (4.45) i tck irq 1 − u0 − ∑1≤i≤n ui

Here, in Equation (4.45), we represent the numerator of the fraction that appears in Equation (2.42) with eî, defined by Equation (4.44). We will incrementally inflate eî to account for various GPUSync overheads. We refer to values that are computed in each incremental step with a superscript, such that the superscript value

[1] indicated within the square brackets matches the labeled step that defined the inflated value (e.g.,e î ).

Deﬁnition 4.5. Let Tba denote the set of tasks that are scheduled on the CPUs of the CPU clusters that gpu are associated with the GPU cluster of Ta . Tba may be made up of tasks from different CPU clusters

196 when GPU clusters are shared among CPU clusters (i.e., (P,G,∗), (P,C,∗), (P,CP2P,∗), and (C,G,∗) cluster conﬁgurations).

Deﬁnition 4.6. Let γi denote the maximum number of times Ti ∈ Tba performs a GPU engine operation (e.g., executes a kernel or transmits a chunk of data). Under the following analysis, we assume that all GPU-using tasks are conﬁgured to suspend while waiting for GPU operations to complete. Thus, each job of Ti causes at most γi interrupts to be raised by a GPU, where interrupt signals the completion of one operation.

Deﬁnition 4.7. Let λ denote the number of engine locks assigned to each GPU. The value of λ expresses the maximum number of simultaneous operations that may be carried out by a single GPU.13

Deﬁnition 4.8. Let ∆top denote the execution cost of a top-half of a single GPU interrupt.

Deﬁnition 4.9. Let ∆bot denote the execution cost of a top-half of a single GPU interrupt.

We now use the above equations and terms to account for GPU-related overheads.

4.3.5.1 Accounting for Interrupt Top-Halves

We now account for top-half interrupt processing overheads. Interrupts always execute with maximum priority, so accounting for priority-inversions due to the top-halves of GPU interrupts is straightforward: we assume that a job Ji,u is affected (or “hit”) by every GPU top-half that may be raised while Ji,u executes.

We assume that GPU interrupts are arbitrarily delivered to CPUs where tasks Ti ∈ Tba may execute—other CPUs are shielded from processing these interrupts. We compute the total number of these interrupts with the following equation:

Hi = ∑ γ j · tif (Ti,Tj). (4.46) Tj∈Tba\{Ti}

Next, we inﬂate job execution cost to place an upper bound on the burden of processing GPU interrupt top-halves:

[1] top eˆi = eˆi + Hi · ∆ . (4.47)

This bound is very pessimistic, but it is also safe.

13This statement does not hold if multiple kernels are allowed to execute simultaneously on an execution engine. However, GPUSync’s execution engine locks explicitly forbid this possibility.

197 4.3.5.2 Accounting for Interrupt Bottom-Halves

We now consider overheads associated with interrupt bottom-half processing. We account for these overheads in two ways:

1. Inﬂate engine request lengths. Bottom-half processing overheads may be incurred by a job Ji,u while it holds an engine lock, effectively lengthening the engine lock critical section. This may indirectly

delay any job waiting to obtain the engine lock in question. This must be accounted for in pi-blocking

analysis. We accomplish this by inﬂating engine request lengths prior to pi-blocking analysis.

2. Inﬂate job WCET. Bottom-half processing overheads may be incurred directly by a job Ji,u. These

overheads represent the cost of OS and other system work that Ji,u may be required to perform.

We first consider request length inflation and then job WCET inflation.

Request length inﬂation. We begin with analysis of engine request lengths. The engine request length of a job Ji,u can be affected by the GPU operations of other jobs assigned to the same GPU. Recall that under GPUSync, all interrupt bottom-halves of each GPU are processed serially by a per-GPU klmirqd thread. This thread inherits the maximum priority of any task that has suspended while waiting for a GPU operation to complete. This ensures that all priority inversions are bounded. However, this does not isolate this maximum-priority task from the GPU operations of other threads. Such a case is depicted in Figure 4.13.

In Figure 4.13(a), job Ji issues an operation to an EE within an engine critical section, and Ji suspends at time t + 5. A number of bottom-halves are processed between time t + 16 and t + 53 by the klmirqd thread at job Ji’s priority—we examine these later when we discuss WCET inﬂation. At time t +64, three events occur simultaneously: (i) a DMA operation of job Jk on CE0 completes; (ii) a DMA operation of job Jj on CE1 completes; and (iii) the GPU kernel of job Ji completes. As a result, three GPU interrupts are raised in quick succession. In Figure 4.13(b), we examine what occurs next in greater detail (we change the timescale in this

ﬁgure to ease presentation). Here, CPU1 processes the interrupt top-halves for the interrupts raised by the

GPU. The completion of the ﬁrst top-half at time s + 4 triggers an IPI to be sent to CPU0 to wake up and schedule the associated klmirqd thread. This IPI is received at time s + 13. This triggers the scheduler on

sch cxs CPU0 to run, and Jl is preempted. This incurs both a scheduling (∆ ) and a context switch (∆ ) overhead. At time s + 21, the klmirqd thread begins processing pending bottom-halves. By the time the klmirqd thread begins execution, the remaining GPU top-halves have already been enqueued for processing. The klmirqd

198 Release Complete Block Unblock

CPU Overhead Off-Chip Overhead

Jk Jk Jk Jk CE 0

CE Jj Jj Jj Jj 1

EE Ji

bot bot bot bot bot bot bot bot bot CPU Ji Jl Ji 0 Jj Jk Jj Jk Jj Jk Jk Jj Ji time t +0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 (a) High-level view of bottom-half overhead processing.

ipi Off-Chip

top top top CPU1 Jk Jj Ji klmirqd

bot bot bot Jl sch cxs sch cxs Ji CPU0 Jk Jj Ji time s +0 5 10 15 20 25 30 35 40 45 50 (b) Scenario where overheads increase engine critical section length.

Figure 4.13: Schedules with overheads due to bottom-half interrupt processing.

199 thread executes each bottom-half, completing at time s + 33. Job Ji is unblocked once its bottom-half is

14 processed, so Ji is scheduled next on CPU0, resulting in additional scheduling and context switch overheads.

We assume that a system designer incorporates the top-half and bottom-half processing overheads into the provisioned execution time of job Ji’s own GPU kernel. That is, we do not consider job Ji to be delayed by ∆top or ∆bot (i.e., the top- and bottom-half overheads of J ). However, we do consider job J to be delayed Ji Ji i i for the duration that its bottom-half is ready for processing, but is not scheduled. IPIs and the bottom-halves of other jobs cause these delays. In Figure 4.13(b), we see that the processing of Ji’s bottom-half can be delayed by up to one IPI and two scheduling and context switch operations. Also, in this particular example, job Ji is delayed by two top- and bottom-halves (those of jobs Jk and Jj). In general, job Ji may be delayed by up to min(λ,ρ) − 1 top- and bottom-halves.

To account for these overheads, we inﬂate each engine request length, Li,u,k, of job Ji,u using the following equation:

[2] bot top sch csx ipi Li,u,k , Li,u,k + (min(λ,ρ) − 1) · (∆ + ∆ ) + 2(∆ + ∆ ) + 2∆ . (4.48)

We charge for two IPIs in the above equation. The ﬁrst IPI overhead accounts for the situation we observed in Figure 4.13(b). The second IPI charge accounts for the situation where job Ji is scheduled on a different CPU than the klmirqd thread. We may drop the IPI (∆ipi) component of Equation (4.48) when both CPUs and

GPUs are partitioned (i.e., (P,P,∗)), since GPU top-halves, the klmirqd thread, and job Ji are guaranteed to be scheduled on the same CPU.

Overheads due to scheduling klmirqd threads are not the only scheduling overheads to consider. Recall that in more recent version of the CUDA runtime, that user-space callback threads, one per GPU per process, are responsible for waking jobs that are suspended waiting for their GPU operation to complete. This callback thread is only active while it wakes the suspended thread, and it sleeps otherwise. GPUSync schedules the callback thread with the current priority of the corresponding job Ji, but only while Ji is suspended waiting for a GPU operation to complete. We assume the execution cost of the callback thread is already captured by the provisioning of Li,u,k. However, we must include thread scheduling costs of the callback thread, since these overheads delay the waking of job Ji. System call overheads must also be considered, since the callback thread executes in user-space, and wakes up the suspended task through a system call. These overheads are

14In this example, we ignore the effects of any GPGPU runtime callback threads. In actuality, in GPGPU runtimes where callback threads are used, a callback thread of job Ji would be scheduled at time t = 41, not job Ji itself. We examine the effects of callback threads, shortly.

200 CPU Overhead Off-Chip Overhead

ipi ipi Off-Chip

klmirqd bot sch cxs Ji CPU1 Ji callback

sch cxs sco Ji sci CPU0 time t +0 5 10 15 20 25 30 35 40 45 50 55 60 65

Figure 4.14: Schedule depicting callback overheads. depicted in Figure 4.14. Figure 4.14 begins with the bottom-half processing to provide a frame of reference.

The overheads prior to time t + 21 have already been accounted for by Equation (4.48). At time t + 21, the callback thread returns from any prior-made system call, as indicated by the system-call-out overhead, ∆sco. Since the callback thread relinquishes any inherited priority within a system call the moment it wakes up a job, the callback thread may still be within the OS when it is scheduled at time t + 21. At time t + 34, the callback thread makes the system call to wake job Ji. This results in an IPI to schedule job Ji on CPU1—in general, we cannot guarantee that job Ji will be scheduled on the same CPU as the callback thread. Job Ji is ﬁnally scheduled at time t + 55, where it can release its engine lock, completing its engine request. We must inﬂate engine request lengths to account for delays due to these callback-related overheads. We do so with the following equation:

[3] [2] sch csx ipi sci Li,u,k , Li,u,k + ∆ + ∆ + ∆ + ∆ . (4.49)

Here, we do not charge IPI, scheduling, and context switch costs to schedule the callback thread itself (i.e., events prior to time t + 21 in Figure 4.14), since these are already captured by Equation (4.48). However, we must still account for the overheads to wake and schedule job Ji; we do so by inﬂating the engine request

sch csx ipi sci sco lengths of Ji,u by ∆ + ∆ + ∆ . We also inﬂate the request length by system call in and out (∆ and ∆ , respectively) to account for system call overheads. As with Equation (4.48), we may also drop ∆ipi from

Equation (4.49) when both CPUs and GPUs are partitioned (i.e., (P,P,∗)).

201 This completes the description of steps we use to inﬂate engine request lengths. Please note that additional inﬂations are necessary to account for standard locking protocol overheads by applying Equation (2.46) to both engine and token lock requests. After this has been done, bounds on pi-blocking may be computed

(using the blocking analysis of Section 4.3.4.3) using the inﬂated request lengths.

WCET inﬂation. The incorporation of inﬂated engine request lengths in blocking analysis models how bottom-half processing overheads incurred by job Ji can increase the pi-blocking experienced by job Jv. We must also charge overheads caused by bottom-half and callback thread processing to Ji itself. In Figure 4.13(a), we see several bottom-halves that are processed by a klmirqd daemon in the interval

[t + 16,t + 52). These are scheduled under job Ji’s priority. Under usual circumstances, we inflate the provisioned execution time of job Ji to cover any work performed under Ji’s priority. However, when these bottom-halves are scheduled on a CPU concurrently with job Ji when it is scheduled on a GPU engine, there is a “loophole” in suspension-oblivious analysis that we may exploit to cover bottom-half execution costs without needing to inflate job Ji’s provisioned execution time. Under suspension-oblivious analysis, all GPU execution time of job Ji is masked analytically by fictitious CPU demand. For example, in Figure 4.13(a), the GPU kernel of job Ji executes during the time interval [t + 5,t + 64), for a total of 59 time units. Under suspension-oblivious analysis, job Ji is correspondingly provisioned with 59 time units of CPU execution time. This CPU budget may only be consumed while job Ji executes on a GPU, but Ji cannot execute on a CPU during this time, since it is blocked on the completion of a GPU operation. However, we may schedule other useful work under job Ji’s priority during this time, making use of Ji’s otherwise unusable budget. This is the loophole in suspension-oblivious analysis we exploit to cover the costs of processing bottom-halves under job Ji’s priority while Ji executes on a GPU.

Job Ji must only be provisioned with additional CPU budget to cover GPU-related overheads that the job may incur while the job is not scheduled on a GPU. We already examined these overheads in the derivations of Equations (4.48) and (4.49). We now apply them to inﬂate the provisioned execution time of job Ji instead of engine request lengths. We do so using the following two equations to account for bottom-half and callback processing, respectively:

[4] bot sch csx ipi eˆi = eˆi + γi · (min(λ,ρ) · ∆ + 2(∆ + ∆ ) + 2∆ ), (4.50)

[5] [4] sch csx sci sco ipi eˆi = eˆi + γi · (2(∆ + ∆ ) + ∆ + ∆ + ∆ ). (4.51)

202 For each of job Ji’s of GPU operations, denoted by γi, Equation (4.50) charges the cost of processing min(λ,ρ) bottom-halves (i.e., the cost of processing (min(λ,ρ) − 1) bottom-halves of other jobs, plus the bottom-half of job Ji itself) and the cost of scheduling the klmirqd daemon. Equation (4.51) charges the cost of scheduling the callback thread, once for each of job Ji’s GPU operations. After Equations (4.50) and (4.51) have been used to account for GPU-related overheads, we must still apply Equation (2.45) to account for the standard locking protocol overheads, with ηi = γi + 1, where we add one to γi to account for a single token lock request of Ji.

4.3.5.3 Limitations

In this section, we explain and justify three pragmatic compromises we make in the analysis presented in

Sections 4.3.5.1 and 4.3.5.2.

GPU interrupt processing and cache affinity. The observant reader may notice that we have not included cache affinity loss overheads (e.g., ∆cpd or ∆cid) in Equations (4.47), (4.48), (4.49), (4.50), or (4.51). This is intentional. We expect the cache working set size of bottom-half and callback routines to be small, since these routines merely route GPU kernel completion notifications to waiting tasks—there is no heavy data processing to be done. Nevertheless, it would be desirable to include these overheads in analysis. However, we lack a straightforward method for accurately determining the associated working set sizes of GPU top-half, bottom-half, and callback processing. Moreover, we fear that the inclusion of overly pessimistic cache affinity loss overheads in these equations would render our schedulability analysis too pessimistic to be meaningful, since the costs of these overheads would be quickly compounded in blocking analysis. As a result, the differences among different GPUSync configurations that we seek to highlight and explore would be obscured.

Interrupt processing and engine request lengths. In Equation (4.48), we inﬂated each engine request length to incorporate the bottom-half processing that may occur under Ji’s priority before the pending bottom-half of Ji is itself is processed. Similarly, Brandenburg (2011b) argues that critical sections should also be inﬂated to account for every interrupt that may occur within a critical section that is arbitrated by any locking protocol that does not disable interrupts (e.g., suspension-based locking protocols). However, he also acknowledges that such an approach results in “tremendous pessimism” that may increase critical section lengths “by several orders of magnitude.” Partly motivated by a desire to explore hard real-time

203 schedulability, Brandenburg side-steps this issue by localizing all interrupt handling and scheduling decisions on a specially designated “release-master” CPU.

Although GPUSync does not forbid the use of a release-master CPU, we opt to not use one in this work for three reasons. First, a system CPU must be sacriﬁced if no real-time tasks are partitioned onto

(i.e., assigned to) the release-master CPU. This results in a compulsory CPU utilization loss of 1.0. This is something that we would rather avoid. Second, if real-time tasks are partitioned onto the release-master

CPU, then these tasks experience a high-degree of interference from all system interrupts. Schedulability analysis for this release-master partition will be very pessimistic. Finally, a release-master CPU can introduce asymmetry in CPU clusters. For example, on our twelve-core evaluation platform, if we reserved one CPU as a release-master and form CPU clusters along NUMA boundaries, then one CPU cluster would have

ﬁve CPUs, while the other would have six. This adds complications to task partitioning and schedulability analysis.

As with GPU interrupt processing and cache affinity loss, if we were to inflate critical sections to account for worst-case interrupt processing scenarios, the resulting schedulability analysis would be too pessimistic to hold any meaning for us. Instead, we compromise by accounting for interrupts by inflating job execution time (e.g., Equation (4.47)), and we do not further inflate critical section lengths. Since our stated research goal is in discovering the best methods for supporting soft real-time systems with GPUs, we feel that this more relaxed model has acceptable limitations and remains sufficiently conservative to model reality.

GPU interrupt latency. Throughout our preemption-centric analysis, we have used the overhead ∆ipi to account for delays in interprocessor interrupt latency. This overhead captures the latency between the time when CPUa sends a signal (i.e., an interrupt) to CPUb, to the time when CPUb actually receives said signal. There are analogous interrupt latencies between CPUs and GPUs. For example, there is a delay between the time a GPU raises an interrupt to notify the host platform of GPU operation completion, to the time a CPU actually receives said interrupt. We do not directly account for these overhead latencies.

We ignore GPU interrupt latency overheads because we know of no reasonable method to measure them.

For CPUs, we can directly measure IPI latencies, since CPU clocks can be synchronized—we merely log a timestamp each time an IPI is sent and received, and we take the time difference as an observed IPI latency.

Unfortunately, CPUs and GPUs do not share a synchronized clock, so we cannot apply the same methodology.

Alternatively, we could estimate GPU-related interrupt latencies with the following multi-step experiment,

204 where: (i) a GPU signals CPU; (ii) this CPU echoes back a signal to the GPU; (iii) we measure the delay between when the GPU sends its signal to when it receives the echo; (iv) we divide this delay by two to give an estimated signal latency. Unfortunately, it is not clear to us how this experiment can actually be performed. While we may craft software that triggers these signals to be sent, this software must sit atop several layers of closed-source software. The execution time of these additional layers prevents us from making accurate measurements. Moreover, the above experiment assumes that CPU-to-GPU interrupt latency is commensurate with GPU-to-CPU interrupt latency. We do not know if this assumption is well-founded.

Given these issues, we instead account for GPU-related interrupt latencies indirectly by including these overheads in the provisioned GPU execution time of GPU kernels and DMA operations. Indeed, this what was done in Section 4.3.1.2, where we characterized the cost of DMA operations. The time taken to complete each DMA operation was measured by the CPU process that issued the work. Each DMA measurement inherently includes both CPU-to-GPU and GPU-to-CPU interrupt latencies. This compromise prevents us from possibly exploiting tighter analytical methods. For example, IPI latencies are not inﬂated to account for

CPU scheduler tick overheads in preemption-centric accounting, because the message-passing mechanisms to deliver an IPI occur “off-chip” in parallel with CPU processing (see Figure 2.12). The message-passing mechanisms that lead to latencies in GPU interrupts similarly occur off-chip. However, since we implicitly incorporate these overheads into engine request lengths, these overheads are treated as CPU execution time under suspension-oblivious analysis. Consequently, these overheads are incorporated into the numerator of

Equation (4.45), where they are inﬂated (by way of the denominator) to account for CPU scheduler ticks, even though CPU scheduler ticks do not affect GPU interrupt latencies in reality. Ultimately, our approach is more pessimistic, albeit safe.

This concludes our accounting of GPU overheads in schedulability analysis.

4.3.6 Schedulability Experiments

In this section, we assess tradeoffs among the conﬁguration options we described in Section 4.3.2 by presenting the results of overhead-aware schedulability studies. We randomly generated task sets of varying characteristics and tested them for schedulability using the methods described above. We now describe the experimental process we used.

205 4.3.6.1 Experimental Setup

There is a wide space of system configuration and task set parameters to explore. We evaluated each of the nine high-level configurations illustrated in Figure 4.1, plus an additional three configurations where GPUs are clustered into two clusters of four GPUs apiece. As we discussed in Section 4.3.2, these configurations are not exhaustive, but we feel they are the simplest and most practical configurations for each combination of CPU and GPU cluster configurations. For instance, in (P,P) of Figure 4.1, four partitioned CPUs have no attached GPU; these CPUs may only schedule tasks of T cpu. Such a configuration is a natural extension of existing uniprocessor, uni-GPU methods. Each considered configuration was tested with several values of ρ:

ρ = 1 to examine schedulability under exclusive GPU allocation; ρ = 3 to explore schedulability when all

GPU engines (one EE and two CEs) are given the opportunity to operate simultaneously; and ρ = 2 to see if there is a balance to strike between ρ = 1 and ρ = 3. The conﬁguration (∗,P) was also tested with ρ = ∞ since ρ’s role in facilitating migrations is moot. Random task sets for schedulability experiments were generated according to several parameters in a multistep process. Task utilizations were generated using three uniform distributions: [0.01,0.1] (light),

[0.1,0.4] (medium), and [0.5,0.9] (heavy). Task periods were generated using two uniform distributions with ranges [33ms,100ms] (moderate), and [200ms,1000ms] (long).15 Tasks were generated by selecting a utilization and period until reaching a desired task set utilization. The task set was then randomly subdivided into T gpu and T cpu. The number of tasks in T gpu was set to be: 33%, 66%, or 100% of the task set size. For tasks in T gpu, kernel execution times were generated using three uniform distributions with ranges [10%,25%], [25%,75%], and [75%,95%] of task execution time (a corresponding amount of time was subtracted from

CPU execution time). For simplicity, we model each task in T gpu as executing one kernel per job. Each such job has one GPU critical section. Input/output data sizes were generated using three values: 256KB (light),

I O 2MB (medium), and 8MB (heavy). A selected data size was evenly split between zi and zi . Task GPU state size was generated using three values: 0%, 25%, and 100% of Ti’s combined input/output data size. In order to keep our study tractable, all tasks were assigned a CPU cache working set size of 4KB. For tasks in T gpu, 5% of its CPU execution time was determined to be within the task’s single GPU critical section. Overheads and data transmission times were taken from four data sets: average-case (AC) observations in an idle system

15These periods are inspired by the sensor streams GPUs may process. Moderate periods represent video-based sensors. Long periods model slower sensors such as LIDAR.

206 (AC/I); AC observations in a loaded system (AC/L); worst-case (WC) observations in an idle system (WC/I); and WC observations in a loaded system (WC/L). Due to the extremely long tails of the distributions of observed for GPU top-half and bottom-half interrupt execution times (depicted in Figures 4.2(a) and 4.3(a), respectively), we take the 99.9th percentiles of these measurements as our “worst-case.” We presume that extreme outlier observations are the result of software bugs in the GPU driver that could be ﬁxed before a

GPU were deployed in a serious real-time system.

A unique combination of the above system configurations and task set parameters defined a set of experimental settings, 75,816 in all. Under each set of experimental parameters, for each 0.25 increment in system utilization range (0,12] (reflecting the range of system utilizations supported by our twelve-core test platform), we generated between 500 and 4,000 task sets.16 Task sets were partitioned to the CPU/GPU clusters in three phases:

Phase 1: T gpu was partitioned among the GPU clusters, using the worst-ﬁt heuristic in decreasing GPU utilization order, where GPU utilization was given by

gpu I O S gpu qi + ei + xmit(zi ,zi ,zi ) ui , . (4.52) pi

Phase 2: T gpu was then partitioned among CPU clusters, in accordance with experimental parameters, using the worst-ﬁt heuristic in decreasing utilization order, where task utilization was given by

Equation (4.1). Bounds for pi-blocking terms were calculated and incorporated into each CPU

cluster’s (estimated) total utilization.

Phase 3: T cpu was then partitioned among the CPU clusters using the worst-ﬁt heuristic in decreasing utilization order, where task utilization was calculated using Equation (2.5) (i.e., the standard

deﬁnition of task utilization under the sporadic task model).

Task sets were tested for bounded response time. Task execution time and request critical section lengths were inﬂated to incorporate system overheads and s-oblivious pi-blocking. Tardiness bounds were computed using CVA analysis of Erickson (2014) (see Section 2.1.5). Approximately 2.8 billion task sets were tested. We used the KillDevil compute cluster at the University of North Carolina, at Chapel Hill, to perform our experiments, consuming over 85,000 CPU hours on modern Intel Xeon processors (models

16After testing a minimum of 500 task sets, additional task sets were generated until average schedulability fell within a three percentage-point interval with 95% conﬁdence, or until 4,000 task sets had been tested.

207 Area(A) > Area(B) ) Rank(A) > Rank(B) 1.0 A

B Average Schedulability Average

0.0 m 0 Task Set Utilization

Figure 4.15: Illustrative ranking of conﬁguration A against conﬁguration B.

X5670 and E5-2670). Our experimental tools were implemented on top of the schedulability test toolkit

SchedCAT (Brandenburg, 2011a). Although most of SchedCAT is implemented in the Python scripting language, performance-critical paths are implemented in C/C++. Our experimental tools used the commercial

Gurobi optimization solver (Gurobi Optimization, Inc., 2015) to solve the ILP we use to bound pi-blocking for conﬁgurations with peer-to-peer migration. We point out these implementation details of our experimental tools to highlight the fact that our CPU hours were used with reasonable efﬁciency.

4.3.6.2 Results

With over 75,000 experiments, it is infeasible to compare different system configurations by examining individual schedulability curves alone. Since our primary goal is to compare the effectiveness of each configuration, we devised the following ranking method to collapse our results into something more man- ageable. For every unique combination of task set parameters, we determined a “sub-rank” for each system configuration from first to last place. These sub-rankings were determined by comparing the area under each system configuration’s schedulability curve. A larger area under the curve indicates better schedulability.

An illustrative example is shown in Figure 4.15. In this example with two system conﬁgurations A and

B, configuration A has a first-place sub-rank since the area under A’s curve is greater (i.e., more task sets were schedulable under A). A final rank for each system configuration was determined by computing for each configuration, the median, average, and standard deviation of its sub-ranks. We then ranked system

208 conﬁgurations according to median sub-rank, tie-breaking by average sub-rank. This ranking approach was applied separately to results from each of our four overhead datasets.

Tables 4.5, 4.6, 4.7, and 4.8 show conﬁguration rankings assuming worst-case loaded system overheads

(WC/L), worst-case idle system overheads (WC/I), average-case loaded system overheads (AC/L), and average-case idle system overheads (AC/I), respectively. The columns labeled “Rank” give each configuration’s final rank. Observe that each table is sorted according to final rankings. The following columns give the median, average, and standard deviation of each configuration’s sub-ranks. Entries in the columns labeled “(CPU,GPU,ρ)” identify the ranked system configuration. Here, we extend the tuple-notation from

Section 4.2 to include ρ. The next three columns give the final rank of a configuration under the other overhead data sets. For a given row, we may compare the values of these columns against each other, and the value in the “Rank” column, to discern how a system configuration’s ranking changes under different overhead conditions. In order to fit each table onto a single page, the left-half of each table gives the rankings for the top 29 ranked configurations, and the right-half gives the rankings for the remaining configurations.

We make the following observations.

Observation 13. Clustered CPU scheduling with partitioned GPUs and ρ = ∞ had the highest rank under three of the four overhead conditions.

We may observe this in the ﬁrst row of Table 4.5. We see that (C,P,∞) ranked 1st under WC/L overheads.

In the same row, we see that (C,P,∞) also ranked 1st under WC/I and AC/L overheads, and ranked 2nd under AC/I overheads. Under AC/I overheads, we see that (G,P,∞) ranks 1st . The considered overheads are lightest under AC/I assumptions, so (G,P,∞) does well. As observed by Brandenburg (2011b), global CPU scheduling provides good soft real-time schedulability, provided that overheads are low; we see this here. These results show, in the context of these experiments, that partitioned GPU scheduling, when paired with clustered or global CPU scheduling, provides the best overall performance.

Observation 14. Clustered CPU scheduling with partitioned GPUs and ρ = ∞ was not always the best conﬁguration.

We already observed that (G,P,∞) had the highest rank under AC/I overheads. However, we can also recognize this observation in the remaining tables. In Tables 4.5, 4.6, and 4.7, compare the Median and

Average sub-rank values for (C,P,∞). If (C,P,∞) always had the highest rank, then Median and Average

209 Rankings Under Worst-Case Overheads, Loaded Rank Median Avg σ (CPU,GPU,ρ) WC/I AC/I AC/L Rank Median Avg σ (CPU,GPU,ρ) WC/I AC/I AC/L 1 1 1.82 2.72 (C,P,∞) 1 2 1 30 29 28.17 8.20 (G,C4,3) 29 25 28 2 4 5.72 5.00 (C,P,2) 2 5 4 31 31 27.09 8.89 (C,C2,1) 32 34 32 P2P 3 4 5.73 5.55 (C,P,3) 3 3 3 32 32 32.44 6.11 (G,C2 ,1) 33 31 33 4 5 10.90 10.03 (G,P,∞) 4 1 2 33 33 33.14 11.94 (G,G,3) 31 27 30 P2P 5 7 9.02 8.45 (P,P,∞) 8 7 7 34 34 28.76 11.34 (C,C4 ,1) 35 36 35 6 8 15.34 13.02 (G,P,3) 6 4 5 35 34 34.30 6.16 (G,C2,1) 34 33 34 7 9 15.59 12.88 (G,P,2) 5 6 6 36 35 29.77 12.76 (C,C4,1) 37 38 37 P2P P2P 8 10 12.84 7.15 (C,C2 ,2) 7 10 8 37 35 33.78 8.55 (P,C2 ,1) 39 40 39 9 10 13.08 6.96 (C,C2,2) 9 11 9 38 37 34.30 9.34 (P,C2,1) 41 41 41 P2P 10 12 13.86 8.49 (P,P,3) 12 14 14 39 37 35.29 8.07 (G,C4 ,1) 36 35 36 11 12 13.89 8.95 (P,P,2) 10 13 16 40 38 36.17 9.01 (G,C4,1) 38 37 38 12 13 13.91 6.04 (C,P,1) 11 18 10 41 39 35.03 9.57 (P,C2,2) 43 43 42 P2P P2P 13 15 15.06 7.22 (C,C2 ,3) 14 20 17 42 39 35.37 9.46 (P,C2 ,2) 42 42 43 14 15 15.64 7.14 (C,C2,3) 17 23 18 43 40 36.27 9.54 (P,C2,3) 45 45 44 P2P 15 15 16.03 6.09 (C,C4,2) 15 12 13 44 40 36.35 9.73 (P,C2 ,3) 44 44 45 P2P 16 16 16.52 6.82 (C,C4 ,2) 13 15 15 45 44 40.36 11.83 (G,G,1) 40 39 40 P2P P2P 17 19 18.94 7.80 (C,C4 ,3) 20 29 23 46 47 45.25 6.75 (P,C4 ,1) 46 46 46 18 19 19.17 7.57 (C,C4,3) 22 28 20 47 47 45.72 6.76 (P,C4,1) 47 47 47 P2P 19 19 22.35 11.16 (G,C2 ,2) 16 8 11 48 48 46.26 6.94 (P,C4,2) 49 48 48 P2P 20 20 22.66 10.78 (G,C2,2) 18 9 12 49 48 46.50 6.79 (P,C4 ,2) 48 49 49 21 22 18.36 10.16 (P,P,1) 19 30 26 50 48 46.86 6.37 (P,C4,3) 50 50 50 P2P 22 22 23.13 9.82 (G,P,1) 21 22 19 51 49 47.14 6.76 (P,C4 ,3) 51 51 51 P2P 23 23 24.57 10.10 (G,C2 ,3) 23 17 21 52 53 52.82 1.82 (P,G,1) 52 52 53 24 23 25.54 9.72 (G,C2,3) 25 21 22 53 53 53.15 1.60 (P,G,2) 53 54 54 25 24 25.73 8.58 (G,C4,2) 26 16 24 54 53 53.15 1.67 (P,G,3) 55 55 55 P2P 26 26 25.78 8.31 (G,C4 ,2) 24 19 25 55 54 52.95 2.79 (C,G,1) 54 53 52 27 28 31.21 12.23 (G,G,2) 27 24 29 56 56 55.81 1.60 (C,G,3) 57 56 56 P2P 28 28 27.89 8.67 (G,C4 ,3) 28 26 27 57 56 55.83 1.61 (C,G,2) 56 57 57 P2P 29 29 24.99 8.25 (C,C2 ,1) 30 32 31

Table 4.5: Conﬁguration rankings under WC/L.

210 Rankings Under Worst-Case Overheads, Idle Rank Median Avg σ (CPU,GPU,ρ) AC/I AC/L WC/L Rank Median Avg σ (CPU,GPU,ρ) AC/I AC/L WC/L P2P 1 1 3.26 4.68 (C,P,∞) 2 1 1 30 30 27.17 6.89 (C,C2 ,1) 32 31 29 2 3 4.84 4.68 (C,P,2) 5 4 2 31 30 27.58 9.47 (G,G,3) 27 30 33 3 4 5.12 5.02 (C,P,3) 3 3 3 32 32 29.47 7.17 (C,C2,1) 34 32 31 P2P 4 5 8.52 7.28 (G,P,∞) 1 2 4 33 32 31.23 5.60 (G,C2 ,1) 31 33 32 5 7 11.05 8.67 (G,P,2) 6 6 7 34 34 32.88 6.26 (G,C2,1) 33 34 35 P2P 6 7 11.19 8.91 (G,P,3) 4 5 6 35 35 31.47 10.13 (C,C4 ,1) 36 35 34 P2P P2P 7 9 12.68 8.43 (C,C2 ,2) 10 8 8 36 36 34.31 8.63 (G,C4 ,1) 35 36 39 8 10 12.29 9.67 (P,P,∞) 7 7 5 37 37 32.94 11.03 (C,C4,1) 38 37 36 9 11 14.38 8.11 (C,C2,2) 11 9 9 38 38 35.61 9.63 (G,C4,1) 37 38 40 P2P 10 12 14.28 10.24 (P,P,2) 13 16 11 39 39 37.59 5.82 (P,C2 ,1) 40 39 37 11 13 14.20 6.52 (C,P,1) 18 10 12 40 40 36.72 13.48 (G,G,1) 39 40 45 12 13 15.14 9.74 (P,P,3) 14 14 10 41 40 38.40 5.18 (P,C2,1) 41 41 38 P2P P2P 13 15 15.51 6.79 (C,C4 ,2) 15 15 16 42 41 39.64 5.05 (P,C2 ,2) 42 43 42 P2P 14 15 15.62 8.76 (C,C2 ,3) 20 17 13 43 41 40.10 4.89 (P,C2,2) 43 42 41 P2P 15 15 16.10 6.74 (C,C4,2) 12 13 15 44 42 41.08 5.02 (P,C2 ,3) 44 45 44 P2P 16 17 18.79 9.02 (G,C2 ,2) 8 11 19 45 43 41.71 4.91 (P,C2,3) 45 44 43 P2P 17 18 17.82 8.01 (C,C2,3) 23 18 14 46 47 46.48 3.44 (P,C4 ,1) 46 46 46 18 19 20.36 8.85 (G,C2,2) 9 12 20 47 47 47.07 2.76 (P,C4,1) 47 47 47 P2P 19 20 18.82 10.42 (P,P,1) 30 26 21 48 48 48.24 2.12 (P,C4 ,2) 49 49 49 P2P 20 20 19.17 8.26 (C,C4 ,3) 29 23 17 49 48 48.42 1.75 (P,C4,2) 48 48 48 21 21 20.22 7.55 (G,P,1) 22 19 22 50 49 49.01 1.92 (P,C4,3) 50 50 50 P2P 22 22 20.71 7.79 (C,C4,3) 28 20 18 51 49 49.10 1.97 (P,C4 ,3) 51 51 51 P2P 23 22 22.12 8.46 (G,C2 ,3) 17 21 23 52 53 52.77 1.95 (P,G,1) 52 53 52 P2P 24 22 22.30 6.62 (G,C4 ,2) 19 25 26 53 53 53.37 1.15 (P,G,2) 54 54 53 25 23 23.97 8.08 (G,C2,3) 21 22 24 54 54 52.88 2.84 (C,G,1) 53 52 55 26 24 23.36 6.66 (G,C4,2) 16 24 25 55 54 53.47 1.31 (P,G,3) 55 55 54 27 25 25.07 8.13 (G,G,2) 24 29 27 56 56 56.29 0.80 (C,G,2) 57 57 57 P2P 28 27 24.67 7.49 (G,C4 ,3) 26 27 28 57 56 56.29 0.84 (C,G,3) 56 56 56 29 28 26.37 7.50 (G,C4,3) 25 28 30

Table 4.6: Conﬁguration rankings under WC/I.

211 Rankings Under Average-Case Overheads, Loaded Rank Median Avg σ (CPU,GPU,ρ) WC/I AC/I WC/L Rank Median Avg σ (CPU,GPU,ρ) WC/I AC/I WC/L 1 1 2.07 3.08 (C,P,∞) 1 2 1 30 30 29.02 10.89 (G,G,3) 31 27 33 P2P 2 2 3.91 5.58 (G,P,∞) 4 1 4 31 30 26.95 7.65 (C,C2 ,1) 30 32 29 3 4 5.47 4.74 (C,P,3) 3 3 3 32 32 28.98 8.17 (C,C2,1) 32 34 31 P2P 4 4 5.80 4.42 (C,P,2) 2 5 2 33 32 30.49 6.66 (G,C2 ,1) 33 31 32 5 5 9.20 9.96 (G,P,3) 6 4 6 34 34 32.50 6.95 (G,C2,1) 34 33 35 P2P 6 6 9.80 9.77 (G,P,2) 5 6 7 35 35 29.86 10.93 (C,C4 ,1) 35 36 34 P2P 7 9 11.39 8.59 (P,P,∞) 8 7 5 36 35 33.08 9.28 (G,C4 ,1) 36 35 39 P2P 8 11 13.78 7.57 (C,C2 ,2) 7 10 8 37 37 31.29 11.81 (C,C4,1) 37 38 36 9 11 13.95 7.51 (C,C2,2) 9 11 9 38 37 34.69 9.80 (G,C4,1) 38 37 40 P2P 10 15 16.32 6.87 (C,P,1) 11 18 12 39 39 38.02 6.12 (P,C2 ,1) 39 40 37 P2P 11 15 18.15 9.80 (G,C2 ,2) 16 8 19 40 39 38.50 11.84 (G,G,1) 40 39 45 12 15 18.54 10.05 (G,C2,2) 18 9 20 41 40 38.76 5.75 (P,C2,1) 41 41 38 13 16 16.66 6.44 (C,C4,2) 15 12 15 42 41 39.46 5.78 (P,C2,2) 43 43 41 P2P 14 16 17.14 8.12 (P,P,3) 12 14 10 43 41 39.63 5.94 (P,C2 ,2) 42 42 42 P2P 15 16 17.46 7.01 (C,C4 ,2) 13 15 16 44 42 40.51 5.80 (P,C2,3) 45 45 43 P2P 16 16 17.71 8.80 (P,P,2) 10 13 11 45 42 40.77 5.81 (P,C2 ,3) 44 44 44 P2P P2P 17 17 16.53 7.45 (C,C2 ,3) 14 20 13 46 47 45.99 4.67 (P,C4 ,1) 46 46 46 18 18 17.29 7.30 (C,C2,3) 17 23 14 47 47 46.46 4.17 (P,C4,1) 47 47 47 19 20 20.19 8.61 (G,P,1) 21 22 22 48 48 47.22 4.10 (P,C4,2) 49 48 48 P2P 20 21 20.36 7.37 (C,C4,3) 22 28 18 49 48 47.34 4.47 (P,C4 ,2) 48 49 49 P2P 21 21 21.32 9.89 (G,C2 ,3) 23 17 23 50 49 47.68 4.19 (P,C4,3) 50 50 50 P2P 22 21 21.64 9.89 (G,C2,3) 25 21 24 51 49 47.72 4.73 (P,C4 ,3) 51 51 51 P2P 23 21 20.15 7.32 (C,C4 ,3) 20 29 17 52 53 52.81 2.92 (C,G,1) 54 53 55 24 22 21.99 8.01 (G,C4,2) 26 16 25 53 53 53.01 2.17 (P,G,1) 52 52 52 P2P 25 22 22.31 7.67 (G,C4 ,2) 24 19 26 54 53 53.26 2.05 (P,G,2) 53 54 53 26 25 23.86 8.39 (P,P,1) 19 30 21 55 54 53.32 1.97 (P,G,3) 55 55 54 P2P 27 26 24.14 8.43 (G,C4 ,3) 28 26 28 56 56 55.86 1.84 (C,G,3) 57 56 56 28 26 24.25 8.60 (G,C4,3) 29 25 30 57 56 55.89 1.80 (C,G,2) 56 57 57 29 27 27.32 10.60 (G,G,2) 27 24 27

Table 4.7: Conﬁguration rankings under AC/L.

212 Rankings Under Average-Case Overheads, Idle Rank Median Avg σ (CPU,GPU,ρ) WC/I AC/L WC/L Rank Median Avg σ (CPU,GPU,ρ) WC/I AC/L WC/L 1 2 3.12 2.49 (G,P,∞) 4 2 4 30 25 23.74 7.54 (P,P,1) 19 26 21 P2P 2 2 3.71 4.83 (C,P,∞) 1 1 1 31 31 29.31 7.79 (G,C2 ,1) 33 33 32 P2P 3 4 5.90 5.54 (C,P,3) 3 3 3 32 32 30.84 4.16 (C,C2 ,1) 30 31 29 4 5 5.08 3.37 (G,P,3) 6 5 6 33 33 30.45 8.10 (G,C2,1) 34 34 35 5 5 6.21 5.39 (C,P,2) 2 4 2 34 33 31.86 4.63 (C,C2,1) 32 32 31 P2P 6 6 5.74 3.09 (G,P,2) 5 6 7 35 36 32.20 10.74 (G,C4 ,1) 36 36 39 P2P 7 10 13.44 9.30 (P,P,∞) 8 7 5 36 36 33.89 7.44 (C,C4 ,1) 35 35 34 P2P 8 12 15.23 8.27 (G,C2 ,2) 16 11 19 37 37 33.26 11.35 (G,C4,1) 38 38 40 9 12 15.46 8.49 (G,C2,2) 18 12 20 38 37 34.84 7.86 (C,C4,1) 37 37 36 P2P 10 12 15.60 8.50 (C,C2 ,2) 7 8 8 39 39 34.13 15.47 (G,G,1) 40 40 45 P2P 11 12 15.99 8.73 (C,C2,2) 9 9 9 40 39 38.95 3.83 (P,C2 ,1) 39 39 37 12 16 16.80 7.02 (C,C4,2) 15 13 15 41 40 39.44 3.40 (P,C2,1) 41 41 38 P2P 13 16 16.88 9.51 (P,P,2) 10 16 11 42 42 40.90 3.17 (P,C2 ,2) 42 43 42 14 17 17.69 8.74 (P,P,3) 12 14 10 43 42 40.99 3.04 (P,C2,2) 43 42 41 P2P P2P 15 17 18.04 7.24 (C,C4 ,2) 13 15 16 44 43 42.18 3.01 (P,C2 ,3) 44 45 44 16 18 18.29 6.59 (G,C4,2) 26 24 25 45 43 42.21 3.15 (P,C2,3) 45 44 43 P2P P2P 17 18 18.69 8.86 (G,C2 ,3) 23 21 23 46 47 46.35 3.31 (P,C4 ,1) 46 46 46 18 19 18.84 6.70 (C,P,1) 11 10 12 47 47 46.75 3.00 (P,C4,1) 47 47 47 P2P 19 19 19.03 7.17 (G,C4 ,2) 24 25 26 48 48 47.85 2.69 (P,C4,2) 49 48 48 P2P P2P 20 19 19.03 8.64 (C,C2 ,3) 14 17 13 49 48 47.95 2.79 (P,C4 ,2) 48 49 49 21 19 19.44 8.46 (G,C2,3) 25 22 24 50 49 48.73 2.88 (P,C4,3) 50 50 50 P2P 22 20 17.66 8.02 (G,P,1) 21 19 22 51 49 48.90 2.90 (P,C4 ,3) 51 51 51 23 20 19.83 8.21 (C,C2,3) 17 18 14 52 53 52.29 2.88 (P,G,1) 52 53 52 24 21 19.27 7.98 (G,G,2) 27 29 27 53 53 52.33 3.39 (C,G,1) 54 52 55 25 24 21.86 7.95 (G,C4,3) 29 28 30 54 53 52.97 2.52 (P,G,2) 53 54 53 P2P 26 24 21.97 7.95 (G,C4 ,3) 28 27 28 55 54 53.13 2.52 (P,G,3) 55 55 54 27 25 22.66 9.34 (G,G,3) 31 30 33 56 56 55.81 2.26 (C,G,3) 57 56 56 28 25 23.03 6.53 (C,C4,3) 22 20 18 57 56 55.83 2.24 (C,G,2) 56 57 57 P2P 29 25 23.35 6.77 (C,C4 ,3) 20 23 17

Table 4.8: Conﬁguration rankings under AC/I.

213 would both have a value of “1.” They do not. This result demonstrates that conﬁgurations other than (C,P,∞) perform better for some classes of task sets.

Observation 15. Under partitioned GPUs, schedulability tended to be maximized when ρ was large, especially when ρ = ∞.

We may observe this by scanning the system configuration columns in Tables 4.5, 4.6, 4.7, and 4.8, picking out entries matching (∗,P,∗). Observe that entires that only differ by ρ generally tend to be ranked in decreasing ρ-order. For instance, in Table 4.7, the configurations (C,P,∞), (C,P,3), (C,P,2), and (C,P,1) are ranked first, third, fourth, and tenth, respectively. There are minor exceptions to this general trend.

For example, in Table 4.5, (C,P,2) is ranked 2nd, while (C,P,3) is ranked 3rd. However, the average sub- ranks of these conﬁgurations are very close: 5.72 and 5.73, respectively. We see similar exceptions for

(∗,P,∗) configurations, where those with ρ = 2 are occasionally ranked slightly higher than the similar corresponding configuration with ρ = 3. This occurs under all overhead assumptions, so we cannot conclude that these exceptions are due to overheads alone. Despite these exceptions, the general trend still holds: under partitioned GPU scheduling, configurations with many GPU tokens perform best. This is a good property, since it motivates the use of GPUSync configurations that maximize the opportunity for parallelism at runtime.

Observation 16. With the exception of those where CPUs are partitioned, schedulability of clustered GPU conﬁgurations tended to be maximized when ρ = 2.

To see this, locate the sets of clustered GPU conﬁgurations that only differ by their values for ρ in

Tables 4.5, 4.6, 4.7, and 4.8. With the exception of those with partitioned CPUs, entries where ρ = 2 have the highest rank among similar conﬁguration that only differ by ρ. For example, in Table 4.5, (C,C4,2) is ranked

th th th 15 , while (C,C4,3) is ranked 18 , and (C,C4,1) is ranked 36 . Similar trends can be observed for rankings in the WC/L, AC/I, and AC/L columns, as well. This result is interesting because it indicates, in terms of schedulability, that there is a “sweet spot” to the number of tokens for many clustered GPU conﬁgurations that maximizes theoretical performance. As we see later in Chapter 5, the existence of token-number sweet spots can be observed in runtime performance as well (see Observation 45).

Observation 17. Peer-to-peer migrations offered better schedulability than system memory migrations.

214 With the exception of conﬁgurations that use a GPU Allocator based upon the CK-OMLP, every clustered

GPU conﬁguration where peer-to-peer migrations are used ranks higher than the similar conﬁguration that use system memory migration. In most cases, the system-memory-variant ranks closely below the corresponding

P2P th th peer-to-peer conﬁguration. For instance, (G,C2 ,2) ranks 16 while (G,C2,2) ranks 18 in Table 4.6. The differences in ranking between similarly matched conﬁgurations under the other overhead data sets can also be observed in Tables 4.5, 4.7, and 4.8.

This result differs from what we reported in prior work (Elliott et al., 2013). In that work, results from schedulability experiments showed that system memory migration configurations outperformed corresponding peer-to-peer configurations. The schedulability analysis for peer-to-peer configurations used in those experiments was based upon the non-ILP fine-grain blocking analysis we described in Section 4.3.4.3. In contrast, the schedulability analysis we use in this work employs our ILP-based blocking chain analysis. The improvement in our results demonstrates the benefits of our ILP-based analysis. Also, as we see later in

Section 4.4, these improvements bring our analytical results more in line with observed runtime behavior.

Observation 18. Smaller GPU clusters, where g = 2, offered better schedulability than larger clusters, where g = 4.

This observation holds for every evaluated conﬁguration, even those that use a GPU Allocator based upon the CK-OMLP, which have bucked many other trends observed here. We give a diverse set of examples.

P2P th Under WC/L overheads in Table 4.5, (P,C2 ,3) (ranked 44 ) has an average sub-ranking of 36.35, while P2P st (P,C4 ,3) (ranked 51 ) has an average sub-ranking of 47.14. Under WC/I overheads in Table 4.6, (C,C2,2) th th (ranked 9 ) has an average sub-ranking of 14.38, while (C,C4,2) (ranked 15 ) has an average sub-ranking

P2P th of 16.10. Under AC/L overheads in Table 4.7, (G,C2 ,2) (ranked 11 ) has an average sub-ranking of 18.15, P2P th while (G,C4 ,2) (ranked 25 ) has an average sub-ranking of 22.31. Finally, under AC/I overheads in th th Table 4.8, (C,C2,1) (ranked 34 ) has an average sub-ranking of 31.86, while (C,C4,1) (ranked 38 ) has an average sub-ranking of 34.84.

There are two aspects of larger GPU clusters that explain this poor performance. First, for larger clusters that use peer-to-peer migration, in Section 4.3.1.2 (see Observation 8), DMA overhead costs when g = 4 are approximately twice those when g = 2. Second, with larger GPU clusters under both peer-to-peer and system memory migration conﬁgurations, jobs that issue token and engine requests may experience worse blocking in the worst-case. This is not only explained by the asymptotic blocking bounds for token and engine requests

215 that increase with g, but also by the fact that a single request potentially competes with the requests of more tasks due to there being fewer GPU clusters. Since blocking analysis always assumes worst-case resource request scenarios, the poor performance of lager cluster sizes is usually inevitable.

Observation 19. The clustered CPU and clustered GPU conﬁguration, (C,CP2P,2), was competitive with partitioned CPU and partitioned GPU conﬁgurations, (P,P,∗).

Under WC/L overheads, in Table 4.5 we see that (P,P,∞) is ranked 5th, while (C,CP2P,2) is ranked 8th.

Under AC/L overheads, in Table 4.7, this rankings gap shrinks: (P,P,∞) is ranked 7th, while (C,CP2P,2) remains ranked 8th. The relative performance of these conﬁgurations switch under WC/I overheads, as we see in Table 4.6. Here, (C,CP2P,2) is ranked 7th, and (P,P,∞) is ranked 8th. Under AC/I overheads, in Table 4.8 we see that (P,P,∞) is ranked 7th once again, while (C,CP2P,2) is ranked 10th.

We have already established that conﬁgurations (C,P,∞) and (G,P,∞) have the best schedulability in

Observation 15. Why do we care about the comparative performance of (C,CP2P,2) against fully partitioned alternatives? As we see later in Section 4.4.2, GPUSync’s affinity-aware heuristics greatly reduce the likelihood of GPU migrations. As a result, GPUSync configurations with clustered GPUs can outperform partitioned alternatives at runtime on average. We compare the best-performing clustered configuration,

(C,CP2P,2), against fully partitioned alternatives because these partitioned configurations represent the approach one would expect from current industrial practice—one where all computations are statically assigned to processor partitions. We highlight the analytical performance of (C,CP2P,2) to show that a system designer need only sacrifice a small degree of schedulability to realize potential runtime benefits of clustered

GPUs, with respect to the de facto alternative.

Observation 20. Schedulability was comparably poor under GPU Allocators based upon the CK-OMLP.

We observe in Table 4.5 that configurations (P,G,∗), (P,C,∗), (P,CP2P,∗), and (C,G,∗) make up 18 of the 21 lowest ranked configurations under WC/L. Similar trends hold under the other overhead data sets, as can be seen in Tables 4.6, 4.7, and 4.8. The consistently poor performance of configurations that rely upon the CK-OMLP for GPU token assignment clearly indicates that these configurations should be avoided (at least in the absence of other compelling system requirements).

Recall from Section 2.1.7.3 what we know of the CK-OMLP: tasks that do not obtain a resource through the CK-OMLP may still suffer s-oblivious pi-blocking due to priority donation. In general, the utilization

216 Figure 4.16: Detailed schedulability result. of every task that may share a CPU with a GPU-using tasks increases under the CK-OMLP—this severely decreases schedulability. This explains the poor schedulability we see here.

This completes our high-level comparisons of the various system conﬁgurations. We now take a deeper look at some of our results.

Figure 4.16 plots schedulability curves for the twelve highest-ranked conﬁguration of each high-level

GPUSync conﬁguration under AC/L overheads. The curves are numbered in descending order according to the area under each curve. That is, curve 1 has the greatest area under its curve, while curve 12 has the least area under its curve. In general, a curve with greater area reﬂects better performance in terms of schedulability.

The tasks of the schedulability experiment illustrated by Figure 4.16 had medium utilizations, moderate periods, heavy data requirements, and each GPU-using task had a state size of 2MB (or 25% of the tasks’ data requirement). GPU kernel execution times were sampled from a [25%,75%] uniform distribution. Finally, 66% of the tasks in each task set used a GPU. Figure 4.16 represents a tiny portion of our schedulability results. However, we examine these schedulability curves to reinforce the observations we made from the ranking data and illustrate additional points. We make the following observations.

Observation 21. The clustered and global CPU conﬁgurations with partitioned GPUs gave the best schedulability results.

217 Curves 1 and 2 plot schedulability for the configurations (G,P,∞) and (C,P,∞), respectively. As we see in Figure 4.16, these curves nearly completely overlap, reflecting roughly equivalent performance. We also see in this figure that these curves reflect greater schedulability than those for the other configurations.

Although this result is immediately apparent in Figure 4.16, we consider task sets with a utilization of about 5.0 (x = 5.0) to make more precise comparisons.

Roughly 98% of task sets with a utilization of about 5.0 are schedulable under (G,P,∞) and (C,P,∞). In

P2P comparison, only about 80% of task sets with that utilization are schedulable under (C,C2 ,2) (curve 3) P2P and (G,C2 ,2) (curve 4). Approximately 65% and 57% of task sets with a utilization of 5.0 are schedulable under (C,C4,2) (curve 5) and (C,C4,2) (curve 7), respectively; and roughly 50% and 25% of task sets with a 5.0 utilization are schedulable under (P,P,∞) and (G,G,2), respectively. No task set with a utilization

P2P P2P of 5.0 was schedulable under (P,C2 ,1) (curve 9), (P,C4 ,1) (curve 10), (P,G,1) (curve 11), or (C,G,1) (curve 12) conﬁgurations.

Observation 22. The conventional conﬁguration, (P,P,∞), did not give the best results.

P2P P2P We see in Figure 4.16 that, (C,C2 ,2) (curve 3), (G,C2 ,2) (curve 4), and (P,P,∞) (curve 6) offer similar performance for task sets with utilizations no greater than 4.2. However, these curves begin to

P2P P2P diverge thereafter. Ultimately, configurations (C,C2 ,2) and (G,C2 ,2) have better schedulability. The configurations (C,C4,2) (curve 5) and (G,C4,2) (curve 7) are arguably competitive with (P,P,∞) as well. We see that these two configurations have better schedulability than (P,P,∞) for task sets with utilizations greater than approximately 4.8.

The curves in Figure 4.16 provide a concrete example in support of Observation 19, which noted that better alternatives exist to the default industry approach of conﬁguration (P,P,∞).

Observation 23. Small GPU peer-to-peer clusters where g = 2 perform relatively well.

We make this observation by comparing the curves in Figure 4.16 for similar clustered GPU conﬁgurations

P2P that differ in cluster size. For example, conﬁguration (C,C2 ,2) (curve 3) dominates (C,C4,2) (curve 5). P2P P2P Similarly, (G,C2 ,2) (curve 4) dominates (G,C4,2) (curve 7). Even (P,C2 ,1) (curve 9) dominates P2P (P,C4 ,1) (curve 10). This observation reinforces Observation 18: smaller GPU cluster are better than large GPU clusters.

Observation 24. Peer-to-peer GPU clusters do not always offer the best schedulability.

218 In Observation 17, we remarked on compelling evidence that GPU clusters with peer-to-peer migration offered better schedulability than similar GPU clusters that use system memory migration. While this remains true in general, as we see in Figure 4.16, there are exceptional cases.

Recall that Figure 4.16 plots only the schedulability curves of the twelve best high-level conﬁgurations.

We see that conﬁgurations (C,C4,2) and (G,C4,2) are plotted by curves 5 and 7, respectively. There are no

P2P P2P curves for configuration (C,C4 ,2) or (G,C4 ,2). This is because configurations (C,C4,2) and (G,C4,2) had better schedulability. In this case, the benefits of ILP-based blocking chain analysis do not overcome larger DMA overheads or asymptoticly more complex blocking bounds that are associated with large GPU clusters with peer-to-peer migration.

Observation 25. Conﬁgurations where GPUs are shared among CPU clusters had inferior schedulability results.

In Observation 20, we remarked that conﬁgurations where the GPU Allocator was based upon the

CK-OMLP offered poor schedulability. This claim is supported here by the four lowest curves in Figure 4.16.

P2P P2P Configurations (P,C2 ,1) (curve 9), (P,C4 ,1) (curve 10), (P,G,1) (curve 11), and (C,G,1) (curve 12) were unable to schedule task sets that other configurations were always able to schedule. For example, 100% of the evaluated task sets with a utilization of 4.0 were schedulable by configurations (G,P,∞) (curve 1) and (C,P,∞) (curve 2). In contrast, practically no task sets with a utilization of 4.0 were schedulable under

P2P (P,C2 ,1) (curve 9).

If we ended our study of schedulability results here, one may be left with the impression that GPUs only harm schedulability. In Figure 4.16, no task set with a utilization of 8.0 or more is schedulable, and yet in

Section 4.3.2, we stated that task sets that fully utilize platform CPUs are schedulable with bounded deadline tardiness under FL scheduling. Does this mean that no GPUSync configuration supports a computing capacity of eight CPUs on our evaluation platform? The answer to this question is “no.” This is because the x-axis of Figure 4.16 reflects only CPU utilization—it does not reflect the gains in computational capacity made possible by GPUs. We illustrate these gains by considering what we call the effective utilization of a task set with GPU-using tasks.

To ﬁnd the effective utilization of a task set, we begin by supposing a GPU-to-CPU speed-up ratio. Let us denote this ratio by S. We then analytically convert each GPU-using task into a functionally equivalent

CPU-only independent task by viewing each unit of time spent executing on an EE as S time units spent

219 Figure 4.17: A detailed schedulability result showing schedulability and effective utilization. executing on a CPU. We also discard all DMA operations and ignore all GPU-related overheads. The effective utilization of a task set is the sum of all task utilizations after this conversion process.17

We re-plot the schedulability curves of the task sets evaluated in Figure 4.16 in terms of effective utilization in Figure 4.17, where we assume a speed-up ratio of S = 32. This is a reasonable speed-up factor, since we are converting the execution time of data-parallel GPU computations to serialized CPU computations. We make the following critical observation.

Observation 26. GPUs greatly increase the computational capacity of a platform.

Every GPUSync configuration represented by the curves in Figure 4.17 was able to schedule a task set that has an effective utilization greater than 12.0 (the number of system CPUs). The best performing configurations, (C,P,∞) (curve 1) and (G,P,∞) (curve 2), managed to schedule task sets that had effective utilizations as great as 94.0. In Figure 4.16, we saw that these configurations were not able to schedule task sets with utilizations greater than roughly 8.0—a utilization loss of 4.0. However, this utilization loss is not truly a loss in light of utilization gains from the GPUs. In a way, we sacrifice 4.0 units of CPU capacity in order to gain an equivalent of 90.0 (94.0 − 4.0) CPUs from the eight GPUs.

17Observe that two task sets with the same real CPU utilization and the same speed-up ratio may still have different effective utilizations, since the tasks in each task set may use the EE to different degrees.

220 An effective utilization of 94.0 is a best-case schedulability scenario of the conﬁgurations represented by Figure 4.17. Even when we consider more conservative cases, the beneﬁts of GPUs remain clear. For instance, about 90% of task sets with effective utilizations of 66.0 were schedulable under (C,P,∞) (curve 1) and (G,P,∞) (curve 2). Roughly 50% of task sets with effective utilizations of 76.0 were schedulable under

(C,P,∞) (curve 1) and (G,P,∞) (curve 2). Conﬁgurations with clustered GPU scheduling also realized signiﬁcant gains in computational capacity.

P2P About 90% of task sets with effective utilizations of 50.0 were schedulable under (C,C2 ,2) (curve 3) and P2P (G,C2 ,2) (curve 4). Roughly 50% of task sets with effective utilizations of 65.0 were schedulable under these conﬁgurations. This result supports a major part of this dissertation’s thesis: increases in computational capacity outweigh analytical costs introduced by management overheads and limitations of GPU hardware and software.

This concludes our schedulability analysis of GPUSync.

4.4 Runtime Evaluation

In this section, we evaluate GPUSync through runtime experiments. We present our evaluation in two parts. First, we examine the effectiveness GPUSync’s budgeting mechanisms, Cost Predictor, and afﬁnity-aware GPU Allocator. Second, we assess the effectiveness of clustered GPU management and peer-to-peer migrations through the scheduling of computer vision workloads. Before continuing, we note that we evaluated schedulability under FL schedulers in the prior section, since the analytical techniques associated with FL schedulers provide better bounds on deadline tardiness. However, the use of FL schedulers in practice requires ofﬂine analysis (e.g., execution time analysis and derivation of blocking terms) in order to properly determine task priority points. In this section, we forgo such analysis because FL scheduling is not a core aspect of GPUSync. Instead, we evaluate GPUSync under EDF and RM schedulers, since these are easier to use in practice. Up to this point, we have focused mainly on deadline-based scheduling

(e.g., FL and EDF). We include RM schedulers in our runtime evaluations, since ﬁxed-priority schedulers are more prevalent in RTOSs than deadline-based schedulers. Moreover, evaluations under RM schedulers also highlight the ﬂexibility of GPUSync’s design. However, as we discuss laster in Chapter 6, we leave the

221 development of overhead-aware schedulability tests for GPUSync under RM schedulers as future work. Such analysis is not necessary in order to perform runtime experiments.

4.4.1 Budgeting, Cost Prediction, and Afﬁnity

We used a mixed task set of CPU-only and GPU-using implicit-deadline periodic tasks to evaluate budget enforcement, the Cost Predictor, and the GPU Allocator. We now describe our task set in more detail.

Numerical code was executed by tasks on both CPUs and GPUs to simulate real applications. Task execution time was tightly controlled through the use of processor cycle counters on both CPUs and GPUs.

GPU-using tasks also transmitted data on the PCIe bus. Task periods ranged from 10ms to 75ms, reﬂecting a range of periods found in ADAS systems, such as those we described in Chapter 1 (see Table 1.1).

The task set consisted of 28 CPU-only tasks and 34 GPU-using tasks. Each task was assigned a utilization based upon the combined processor time (CPU and GPU engines) a task’s job must receive before completing.

Of the CPU-only tasks, twelve had a utilization of 0.1, eight had a utilization of 0.2, and two had a utilization of 0.3. Of the GPU-using tasks, fourteen had a utilization of 0.1, fourteen more had a utilization of 0.25, and six had a utilization of 0.5. 90% of each GPU-using task’s utilization was devoted to GPU operations, with 75% of that towards GPU kernels, and the remaining 25% towards memory copies. Each GPU-using job had one GPU critical section. Within its critical section, each such job executed two GPU kernels of equal execution time. The amount of memory to copy was determined based upon desired copy time and worst-case bus congestion bandwidth derived from empirical measurements. Memory copies were evenly split between input and output data. Task state size was set to twice the combined size of input and output data. Additionally, DMA operations were broken up into 2MB chunks.

We configured our system to run as a cluster along NUMA boundaries. Thus, there were two clusters of six CPUs and four GPUs apiece. The above task set was evenly partitioned between the two clusters. The task set was scheduled under C-EDF and C-RM schedulers. Three tokens (ρ = 3) were allocated to each GPU in order to allow all GPU engines to be used simultaneously. Under C-EDF, we configured the GPU Allocator to be optimal under suspension-oblivious analysis, i.e., the GPU Allocator was configured as the R2DGLP.

We realized this configuration by setting the maximum length of the FIFO queues of the GPU Allocator to f = dc/(g · ρ)e = d6/(4 · 3)e = 1. Under C-RM, the GPU Allocator was configured to be strictly priority- ordered, as is common to locking protocols under fixed-priority schedulers. We realized this configuration by setting f to the minimum size, or f = 1. These two configurations of the GPU Allocator are actually one in

222 the same, since our values for ρ and GPU cluster size are so great. The GPU Allocator under the C-EDF and C-RM schedulers only differ by the method used to determine job priority (earliest-deadline-ﬁrst for

C-EDF and shortest-period-ﬁrst for C-RM). We used different engine lock conﬁgurations for each scheduler.

Engine locks were FIFO-ordered and priority-ordered under C-EDF and C-RM schedulers, respectively. GPU migrations were performed using peer-to-peer DMA operations.

The task set was scheduled under two execution-behavior scenarios. Under the first scenario, tasks adhered to their prescribed execution times as closely as possible. Under the second scenario, however, eight GPU-using tasks in each cluster were configured to exhibit aberrant behaviors by executing for ten times their normal execution time at random moments (roughly spaced by five seconds for each task). These scenarios were scheduled for 180 seconds under both C-EDF and C-RM schedulers and measurements were taken. We use data gathered during the execution of these scenarios as the basis of the evaluations described in Sections 4.4.1.1, 4.4.1.2, and 4.4.1.3.

4.4.1.1 Budget Performance

We analyze the ability of GPUSync to manage budgets (and penalize overrunning tasks) by examining the long-term utilization of the execution engines. Here, we test GPUSync under the early releasing policy described in Section 3.2.4, and compare performance against a no-budget-enforcement policy. We measure execution engine utilization (GPU execution time divided by period) with respect to the hold time of execution engine locks. This is an effective measure, even if the engine may idle while the engine lock is held, since all other tasks are blocked from using the engine.

From the task set described in Section 4.4.1, we focus our attention on two of the 34 GPU-using tasks, T1 and T2. T1 has a period of 15ms and a utilization of 0.25. T1’s ideal execution-engine utilization is 0.169, and it sends and receives 512KB to and from the GPU. T2 has a period of 75ms, a utilization of 0.1, an ideal execution-engine utilization of 0.068, and it sends and receive 1024KB to and from the GPU. We focus on these tasks because of their short and long periods, respectively.

Figure 4.18 depicts the accumulated time (on the y-axis) tasks T1 and T2 hold an execution engine lock over the duration of their execution (on the x-axis). Figure 4.18 also displays the equation for the line-of-best-fit of each plotted line. The slope of each line-of-best-fit (i.e., the coefficient of the variable x) approximates the accumulated time the associated task holds an engine lock, divided by task period. We interpret this slope as a measure of long-term execution engine utilization. We make three observations.

223 [2, 8] [1, 7] [9]

[3]

[5, 11] [6, 12] [4, 10]

Figure 4.18: Allocated execution engine time.

224 Observation 27. A synchronization-based approach to GPU scheduling is effective at supplying GPU-using tasks with provisioned execution time.

Ideally, the slope of the lines in Figure 4.18 should be equal to the task’s execution-engine utilization.

With the exception of line 3, the slopes of all the lines are very close to the desired utilization. For example, when T2 is well-behaved under C-EDF scheduling (line 10), the slope is 0.065, this is commensurate with the assigned utilization of 0.068

Observation 28. Budget enforcement can penalize aberrant tasks by allocating less execution time.

This may be observed in lines 3 and 9 for the aberrant task T1 under both C-RM and C-EDF scheduling in Figure 4.18. As shown by line 3, T1’s utilization is 0.12—30%less than the provisioned 0.169, for C-RM.

Similarly, for C-EDF, T1’s utilization is 0.153—10% less than the provisioned 0.169, This loss of utilization is a result of the early releasing budget policy where any surplus from an early-released budget is discarded after an overrunning job completes.

Observation 29. C-RM and C-EDF can both perform well.

For this particular experiment, we observe that both C-RM and C-EDF are able to supply the needed

GPU execution time. This is an important result because it empowers system designers to select the scheduler that suits their needs.

4.4.1.2 Cost Predictor Accuracy

We measured the overheads related to the Cost Predictor because of our concern that computing averages and standard deviations can be computationally expensive. However, we found these worries to be unfounded.

Updating the Cost Predictor estimate took 0.294µs on average, and 2.335µ in the (observed) worst-case. We now discuss the accuracy of the Cost Predictor. Figure 4.19 plots cumulative distribution functions

(CDFs) for the percentage error of the cost predictor for different migration distances under the aberrant behavior scenario, without budget enforcement. Migration distance is denoted by d; d = 0 reﬂects no migration, d = 1 denotes migrations to neighboring (i.e., near) GPUs, and d = 2 reﬂects migrations to distant

(i.e., far) GPUs. We continue to consider tasks T1 and T2 described earlier in Section 4.4.1.

Observation 30. The Cost Predictor is generally accurate at predicating token hold time, despite aberrant task behavior.

225 CDF of Percentage Error, C-EDF With No Budgeting 1.0 [1] [6] [4] [5]

0.8 [2] x ≤ ) |

0.6 Percentage Error | 0.4 [1] T1 , d = 0

[2] T1 , d = 1

Probability( [3] T1 , d = 2 0.2 [4] T2 , d = 0

[5] T2 , d = 1 [3] [6] T2 , d = 2 0.0 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% Percentage Error | | (a) C-EDF

CDF of Percentage Error, C-RM With No Budgeting 1.0 [2]

[3,1] [4]

0.8 [5] x

≤ [6] ) |

0.6 Percentage Error | 0.4 [1] T1 , d = 0

[2] T1 , d = 1

Probability( [3] T1 , d = 2 0.2 [4] T2 , d = 0

[5] T2 , d = 1

[6] T2 , d = 2 0.0 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% Percentage Error | | (b) C-RM

Figure 4.19: CDFs of percentage error in cost predictions.

226 As seen in Figure 4.19, under C-EDF (inset(a)), roughly 95% of all predictions for task T1 have a percentage error of 5% or less (only one sample exists for T1 where d = 2). Accuracy is even better for T1 under C-RM. This is expected since T1 has a statically high priority due to its short period, and it is thus able to avoid more interference from other tasks than under C-EDF.

Observation 31. The Cost Predictor is less accurate for tasks with longer execution times and periods.

Both Figure 4.19(a) and Figure 4.19(b) show that the Cost Predictor is less accurate for T2 than T1 in this experiment. This is because T2 has a longer execution time and period than most other tasks in the task set.

Thus, T2 is more likely to experience interference from aberrant tasks.

Observation 32. The Cost Predictor is moderately less accurate under C-RM than C-EDF.

This can be seen by comparing insets (a) and (b) of Figure 4.19. For example, about 90% of predictions for T2 with d = 1 under C-EDF (inset (a)) have a percentage error no greater than 20%. Compare this to

C-RM (inset(b)), where only 80% of d = 1 predictions for T2 have the same degree of accuracy.

Observation 33. The Cost Predictor is generally less accurate for longer migration distances.

A migrating job of a task must acquire additional copy engine locks and do more work than non-migrating jobs. This introduces variability into token hold times predicted by the Cost Predictor. We see that this generally has a negative effect on the Cost Predictor. This is clearly demonstrated in Figure 4.19(b) for C-RM, where each CDF generally upper-bounds the CDF of the same task at the next migration distance. We note that this does not always hold under C-EDF (Figure 4.19(a)). For example, predictions for T2 with d = 1 are more accurate than d = 0. However, as we discuss shortly, this may be due to a smaller sample size of observations.

4.4.1.3 Migration Frequency

The Cost Predictor is an important component of GPUSync, since it inﬂuences the behavior of the migration heuristics. Table 4.9 gives the total number of migrations observed under the aberrant scenario, with and without budget enforcement. We continue to consider tasks T1 and T2 described earlier in Section 4.4.1. We make two observations.

Observation 34. Afﬁnity-aware GPU assignment helps maintain afﬁnity.

227 C-EDF C-RM Migration Type No Budgeting Budgeting No Budgeting Budgeting T1 T2 T1 T2 T1 T2 T1 T2 No Migration 11,887 2,382 8,866 2,374 11,255 1,826 8,586 1,915 Near 110 11 130 8 374 245 386 142 Far 0 4 0 13 368 326 173 340 Total 11,997 2,397 8,996 2,395 11,997 2,397 9,145 2,397

Table 4.9: Migration frequency for T1 and T2.

Under all scenarios, we see that tasks are significantly more likely to be assigned the GPU with which they have affinity. For instance, task T1, under C-EDF with no budget enforcement, maintained affinity 11,887 times, migrated to a neighboring GPU 110 times, and never migrated to a distant GPU. Similar trends are observed for all tasks. Without an affinity-aware method, migrations would be more frequent than reassignment, since any task would have a 75% chance (with a GPU-cluster size of four) of being assigned to a different GPU.

Observation 35. GPUSync is more successful at maintaining afﬁnity under the C-EDF conﬁguration.

Observe in Table 4.9 that migrations are more frequent under C-RM than C-EDF. Indeed, distant migrations are practically eliminated under C-EDF. This behavior is attributable to the decreased accuracy of the Cost Predictor under C-RM (recall Observation 32). Inaccurate estimates can cause migrations to seem faster than waiting for the GPU for which the job has afﬁnity. This observation demonstrates the importance of the Cost Predictor with respect to the runtime behavior.

GPUSync can effectively constrain GPU utilization through budget enforcement techniques, produce relatively accurate predictions of GPU migration costs, and effectively employ GPU allocation heuristics that signiﬁcantly reduce the frequency of GPU migrations.

This concludes our focused evaluation of the elemental components of GPUSync. We now examine overall runtime performance.

4.4.2 Feature-Tracking Use-Case

We now describe the vision-related experiments mentioned earlier. In these experiments, we adapted a freely-available CUDA-based feature tracking program to GPUSync on LITMUSRT by Kim et al. (2009).

Feature tracking detects interesting features, such as lines and corners, in a video image and tracks the

228 movement of detected features through time. Feature tracking can be used to deduce the three-dimensional structure of environment and movement of the camera. Feature tracking can be an important element of a pipeline of computer vision algorithms used by ADAS and autonomous automotive systems that sense and monitor the environment. The tracker represents a scheduling challenge since it utilizes both CPUs and GPUs to carry out its computations. Though feature tracking is only one GPGPU application, its image-processing operations are emblematic of many others.

We stressed our evaluation platform by applying feature tracking to thirty independent video streams simultaneously. Each video stream was handled by one task, with each frame being processed by one job.

The video streams were assigned different execution-rate requirements: two high-rate streams ran at 30 frames per second (FPS), ten medium-rate streams ran at 20FPS, and eighteen low-rate streams ran at 10FPS.

The high- and medium-rate streams operated on video frames with a resolution of 320x240 pixels. Each job of these streams had roughly 1MB of combined input and output data and a state size of about 6.5MB. The low-rate streams processed larger video frames with a resolution of 640x480. A low-rate stream required four times as much memory as a higher-rate stream. Video frames were preloaded into memory in order to avoid disk latencies (such latencies would be non-existent with real video cameras). All data was page-locked in system memory to facilitate fast and deterministic memory operations.

We tested the same conﬁgurations as in the prior runtime experiments, with the addition of another

GPUSync configuration under C-EDF scheduling: the use of priority-ordered engine locks. The token count was also reduced to ρ = 2 for all configurations—we found that this configuration worked best for our particular computer vision workload.18 The video streams were scheduled on three different GPU clustering configurations: eight GPU partitions (g = 1), four small GPU clusters of two GPUs (g = 2), and two large

GPU clusters of four GPUs (g = 4). CPUs were organized in two clusters of six. We focused our attention on platform configurations where GPU clusters were not shared by CPUs of different CPU clusters (recall that such configurations performed exceedingly poorly in our schedulability experiments). Each CPU cluster was associated with the same number of GPU clusters (and hence, the same number of GPUs). We tested the clustered configurations with both peer-to-peer and system memory migration. Tasks were partitioned evenly among the CPU and GPU clusters. Our video-stream workload was scheduled under each configuration for 120 seconds and measurements were taken.

18In Chapter 5, we ﬁnd that ρ = 2 does not necessarily lead to the best runtime performance on our evaluation platform in general. Indeed, greater values of ρ may be necessary to achieving predictable runtime performance.

229 We analyze each GPUSync conﬁguration by inspecting the response time of each job. We deem a conﬁguration as resulting in an unschedulable system if any task consistently exhibits growth in response time, since this is a sign of unbounded deadline tardiness. We assume a schedulable system with bounded deadline tardiness, otherwise.

Figure 4.20 plots the CDFs of normalized job response times for the platform conﬁgurations that employed FIFO-ordered engine locks and C-EDF scheduling. We normalize job response times by dividing the observed response time by the period of the associated task; we do so that we may combine job response times into a single CDF. Figures 4.21 and 4.22 give the same type of plots for platforms that employed priority-ordered engine locks, under C-EDF and C-RM, respectively. Unschedulable conﬁgurations are denoted by dashed lines. We clip the plots in order to show more detail for shorter response times. Clipped curves are annotated with the value (in parentheses) where the associated CDF reaches 1.0 (i.e., the observed worst-case response time). We make several observations from these results.

Observation 36. GPUSync can be used to achieve predictable real-time performance for GPGPU applications.

This experiment provides our ﬁrst look at the overall runtime performance of GPUSync. The CDFs in

Figures 4.20, 4.21, and 4.22 demonstrate that GPUSync can be used to realize real-time predictability for

GPGPU applications. For instance, examine the CDF for GPU clusters of size two in Figure 4.20 (curve 2).

We see that all jobs had a normalized response time less than approximately 180% of period. We take this as an indicator of bounded deadline tardiness. CDFs that reﬂect even better real-time performance can be found in Figures 4.21 and 4.22. This result provides evidence supporting a central tenant of this dissertation: real-time scheduling and synchronization techniques can be applied to GPUs to realize predictable real-time performance.

Observation 37. Priority-ordered queues can provide improved observed response times.

This can be observed by comparing the curves in Figure 4.20 to those of Figures 4.21 and 4.22. For example, in Figure 4.20, the probability that a job under partitioned GPU scheduling (curve 1) had normalized response time less than 200% was approximately 90%. Compare this to Figure 4.21, where 100% of jobs had a normalized response time less than 200%. We can observe less dramatic differences in performance for the clustered GPU conﬁgurations that used peer-to-peer migration (curves 2 and 4).

230 CDF of Response Time As % of Period for C-EDF, FIFO Queues 1.0 [4] [1] (47820%) 0.8 [2] x ≤ [3] 0.6

0.4 [1] g = 1 [2] g = 2 (P2P)

Probability(Response Time) 0.2 [3] g = 2 (SysMem) [4] g = 4 (P2P) [5] (59408%) [5] g = 4 (SysMem) 0.0 0% 200 % 400 % 600 % 800 % 1000 % Response Time (% of period)

Figure 4.20: CDF of job response time as percent of period for C-EDF with FIFO-ordered engine locks. CDF of Response Time As % of Period for C-EDF, Priority Queues 1.0 [1] [4, 2] [3] (9033%) [5] (17295%) 0.8 x ≤

0.6

0.4 [1] g = 1 [2] g = 2 (P2P)

Probability(Response Time) 0.2 [3] g = 2 (SysMem) [4] g = 4 (P2P) [5] g = 4 (SysMem) 0.0 0% 200 % 400 % 600 % 800 % 1000 % Response Time (% of period)

Figure 4.21: CDF of job response time as percent of period for C-EDF with priority-ordered engine locks.

231 CDF of Response Time As % of Period for C-RM, Priority Queues 1.0 [1] [2, 4] (19200%) [3] [5] (20736%) 0.8 x ≤

0.6

0.4 [1] g = 1 [2] g = 2 (P2P)

Probability(Response Time) 0.2 [3] g = 2 (SysMem) [4] g = 4 (P2P) [5] g = 4 (SysMem) 0.0 0% 200 % 400 % 600 % 800 % 1000 % Response Time (% of period)

Figure 4.22: CDF of job response time as percent of period for C-RM with priority-ordered engine locks.

232 Observation 38. Clustered GPU management can outperform a partitioned approach.

This can be observed in curves 2 and 4 of Figure 4.20 for C-EDF scheduling with FIFO-ordered engine locks. The CDFs for the clustered GPU conﬁgurations with peer-to-peer migration generally lie above the

CDFs for the other configurations (particularly, curve 1 for partitioned GPU scheduling). This indicates that a job under clustered GPU scheduling with peer-to-peer migration is more likely to have a shorter normalized response time than jobs in the other FIFO-ordered configurations. Despite the extra load imposed by memory migrations, clustered GPU approaches can still outperform a partitioned GPU approach. This is a positive result in light of the benefits FIFO ordering can have in schedulability analysis. Clustering may provide a viable FIFO-based alternative to a poorer performing partitioned approach.

Observation 39. Small GPU clusters may be preferable over large GPU clusters.

In Figures 4.20, 4.21, and 4.22, we observe that small GPU clusters (curves 2 and 3) generally yield shorter normalized response times than large GPU clusters (curves 4 and 5) in this experiment. For example, in Figure 4.20, the curve for the platform conﬁguration with a GPU cluster size of two with peer-to-peer migrations (curve 2) lies above the curve for a GPU cluster size of four with peer-to-peer migrations (curve 4).

This difference is partly attributable to the higher cost of migration between distant GPUs in larger clusters, which we observed in Section 4.3.1.2 (Figures 4.8 and 4.9). However, we see that the smaller GPU cluster configurations still outperform the larger GPU cluster configurations when system memory migrations are used. We cannot attribute this difference in performance to migration distance, since GPUs in both configurations have the same migration distance to system memory. Instead, we attribute this difference to an increased chance of migration in larger GPU clusters.

Observation 40. GPU state migration through system memory is often too costly.

We observe in Figures 4.20, 4.21, and 4.22, that no configuration that used system memory migration resulted in a schedulable system in our experiment. Indeed, we observe that partitioned GPU management is preferable to all such system-memory-migration configurations. This is clearly reflected by curves 3 and 5 in each figure for GPU clusters of size two and four, respectively, where the observed worst-case response times are over 9,000% of task period. Furthermore, system memory migrations are even more costly for larger cluster sizes on account of the increased frequency of migrations between GPUs, thereby further increasing response times.

233 Observation 41. GPUSync can be used effectively with both C-EDF and C-RM schedulers.

We compare the figures for the C-EDF and C-RM platform configurations that used priority-ordered engine locks to contrast the performance of these schedulers. We ignore GPU cluster configurations that use system memory migration, since our workload is clearly unschedulable (Observation 40). Compare the remaining corresponding curves in Figures 4.21 (C-EDF) and 4.22 (C-RM)—the corresponding CDFs are virtually indistinguishable. For example, compare the curves for GPU clusters of size two that use peer-to-peer migration (curves 2): the curves are nearly identical. The same holds true of the corresponding curves for partitioned GPUs and the GPU clusters of size four.

GPUSync is meant to provide a configurable real-time GPU scheduling framework. We have mainly focused on configurability as it relates to the number of GPU tokens and engine lock queues. However, this configurability also extends to support for any JLFP CPU scheduler. Our observation here helps demonstrate our success at meeting this design goal. We see that GPUSync can be just as effective under C-EDF scheduling as it is under C-RM scheduling.

This concludes our ﬁrst evaluation GPUSync’s runtime performance. We direct the reader towards

Chapter 5 for additional experiments and in-depth analysis of GPUSync’s runtime performance when it used to support graph-based real-time workloads.

4.5 Conclusion

In this chapter, we have evaluated both the analytical and runtime aspects of GPUSync. We have examined the overheads associated with GPGPU computing in depth. Through carefully crafted experiments, we have quantiﬁed and characterized the runtime properties of GPU interrupt processing and DMA operations. We have also measured the degree to which DMA operations may increase the execution time of tasks executing on CPUs due to memory bus contention. We incorporated these overheads, among others, into an analytical model of our evaluation platform.

We also presented blocking analysis for various GPUSync configurations. We provided this analysis in varying degrees of granularity. We began with “coarse-grain” analysis in order to study the overall characteristics of different GPUSync configurations. We then developed more fine-grain analysis for use in schedulability experiments. In this analysis, we also discussed the tradeoffs between various GPUSync configurations.

234 We evaluated the analytical performance of GPUSync through a set of large scale schedulability experiments, where we tested over 2.8 billion task sets for schedulability. These experiments required over 85,000 CPU hours to complete on a university compute cluster. Through these experiments, we identiﬁed the most promising GPUSync conﬁgurations.

We evaluated the runtime properties of GPUSync through both targeted experiments and a computer vision use-case study. The results of our targeted experiments show that GPUSync’s afﬁnity-aware heuristics are effective at reducing GPU migration, and that these techniques can practically eliminate the most costly types of migrations. These targeted experiments also showed that GPUSync’s budget enforcement policies help prevent aberrant tasks from exceeding provisioned GPU execution times and mitigate the negative effects on other tasks when they do. Finally, in the computer vision use-case study, we found that GPUSync can effectively schedule real-time work on GPUs. Moreover, we saw that in some conﬁgurations, clustered GPU scheduling can outperform more conventional partitioned GPU approaches.

235 CHAPTER 5: GRAPH SCHEDULING AND GPUS1

In Chapter 4, we used GPUSync to arbitrate access to GPU resources among tasks that follow the sporadic task model, whereby each real-time task is represented by a single thread of execution. In this chapter, we apply GPUSync towards the graph-based PGM task model described earlier in Section 2.1.3. We begin by providing our motivation for supporting a graph-based task model, which we draw from recent trends in real-world software architectures and application constraints. We then describe PGMRT, a middleware library we developed to support real-time task sets derived from PGM graphs. PGMRT integrates tightly with the LITMUSRT kernel (which we also modiﬁed to support PGM-derived real-time task sets) to minimize system overheads. However, PGMRT remains portable to POSIX-compliant operating systems. Next, we discuss the newly developed open standard, OpenVX™, which is designed to support computer vision applications (Khronos Group, 2014c).2 OpenVX uses a graph-based software architecture designed to enable efﬁcient computation on heterogeneous computing platforms, including those that use accelerators like GPUs.

We examine assumptions made by the designers of OpenVX that conﬂict with our real-time task model.

We then discuss VisionWorks®, an OpenVX implementation by NVIDIA (Brill and Albuz, 2014).3 With support from NVIDIA, we adapted an alpha version of VisionWorks to run atop PGMRT, GPUSync, and

LITMUSRT. We describe several challenges we faced in this effort, along with our solutions. We then present the results from a runtime evaluation of our modiﬁed version of VisionWorks under several conﬁgurations of

GPUSync. We compare our GPUSync configurations against two purely Linux-based configurations, as well as a LITMUSRT configuration without GPUSync. Our results demonstrate clear benefits from GPUSync. We conclude this chapter with a summary of our efforts and experimental results.

1 Portions of this chapter previously appeared in conference proceedings. The original citation is as follows: Elliott, G., Kim, N., Liu, C., and Anderson, J. (2014). Minimizing response times of automotive dataﬂows on multicore. In Proceedings of the 20th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, pages 1–10. 2OpenVX is a trademark of the Khronos Group Inc. 3VisionWorks is a registered trademark of the NVIDIA Corp.

236 Soft Raw video Convert to Convert To Casecade Display at 30Hz YUV Grayscale Classiﬁer

Figure 5.1: Dataﬂow graph of a simple pedestrian detector application.

5.1 Motivation for Graph-Based Task Models

Graph-based software architectures, often referred to as dataﬂow architectures, are common to software applications that process continual streams of data or events. In such architectures, vertices represent sequential code segments that operate upon data, and edges express the ﬂow of data among vertices. The

flexibility offered by such an architecture’s inherent modularity promotes code reuse and parallel development. Also, these architectures naturally support concurrency, since parallelism can be explicitly described by the graph structure. These characteristics have made dataflow architectures popular in multimedia technologies (Khronos Group, 2005; Taymans et al., 2013) and the emerging field of computational photogra- phy (Adams et al., 2010; NVIDIA, 2013). Dataflow architectures are also prevalent in the sensor-processing components in prototypes of advanced automotive systems, for both driver-assisted and autonomous driving

(e.g., Miller et al. (2009); Urmson et al. (2009); Wei et al. (2013)). While many domains with dataﬂow architectures have timing requirements, the automotive case is set apart since timing violations may result in loss of life or property.

Figure 5.1 depicts a dataﬂow graph of a simple pedestrian detection application that could be used in an automotive application. We describe these nodes from left to right. A video camera feeds the source of the graph with video frames at 30Hz (or 30FPS). The ﬁrst node converts raw camera data into the common

YUV color image format. The second node extracts the “Y” component of each pixel from the YUV image, producing a grayscale image. (Computer vision algorithms often operate only on grayscale images.) The third node performs pedestrian detection computations and produces a list of the locations of detected pedestrians. In this case, the node uses a common “soft cascade classiﬁer” (Bourdev and Brandt, 2005) to detect pedestrians. Finally, the last node displays an overlay of detected pedestrians over the original color image.

We may shoehorn our pedestrian detection application into a single implicit deadline sporadic task. Here,

1 we give such a task a period and relative deadline of 33 3 ms to match the period of the video camera. Each

237 Raw video Raw video at 30Hz at 30Hz

1 Gi Convert to Convert to YUV YUV k k k k Ti = ( ei , p i , d i )

2 Gi Convert To Convert To Grayscale Grayscale

3 Gi Soft Casecade Soft Casecade Classiﬁer Classiﬁer

4 Gi Display Display

(a) PGM representation. (b) Rate-based representation. (c) Sporadic representation.

Figure 5.2: Transformation of a PGM-specified graph for a pedestrian detection application to sporadic tasks. job of this task executes the dataflow graph, end-to-end, once per job. This technique can be applied to any graph by executing graphs nodes in a topological order. However, this approach prevents us from exploiting parallelism in two ways. First, we cannot exploit parallelism expressed by parallel branches (or forks) in a graph since we serialize all graph nodes. Second, we cannot execute nodes in a pipeline, since the graph is executed end-to-end within each job. There is another significant drawback to this shoehorned approach: the combined execution time of graph nodes may exceed the period of the task. Such a task is intrinsically unschedulable without parallelism.

In Section 2.1.3, we described a process for transforming a dataflow graph described by PGM into a set of sporadic tasks. Figure 5.2 depicts such a transformation for our pedestrian detection application. We describe this transformation in more detail. Figure 5.2(a) gives a PGM-specification for the pedestrian detection k← j dataflow graph. Here, each non-sink node produces one token (%i = 1) for each of its consumers. Similarly, k← j each non-source node consumes one token (κi = 1) from each of its producers, as tokens become available

k← j rb1 (ϕi = 1). Given an input video rate of 30Hz, the rate-based task for the source node, Ti , is released once

rb1 1 rb1 1 (χi = 1) every 33 3 ms (υi = 33 3 ms). The remaining nodes have the same rate-based speciﬁcation since k← j k← j %i = κi (see Equations (2.8)–(2.11)). This is depicted in Figure 5.2(b). Finally, Figure 5.2(c) gives the k k 1 ﬁnal transformation to implicit-deadline sporadic tasks, where di, j = pi, j = 33 3 ms.

238 CPU0 CPU1 CPU2 Release Complete

x x Ji,j Ji,j +1

y Ji,j

z Ji,j time a b c d Figure 5.3: Parallel execution of graph nodes.

Ideally, any node with satisﬁed constraints should be eligible for scheduling to allow the parallel execution and pipelining of nodes. This would allow three types of parallelism: inter-node, intra-node, and pipeline

x parallelism. Consider the graph and schedule depicted in Figure 5.3. Here, the completion of job Ji, j at time y z a releases jobs Ji, j and Ji, j. These released jobs execute in parallel during the time interval [a,c]. This is an example of inter-node parallelism, since two nodes released by a shared parent may execute simultaneously.

x y z Job Ji, j+1 is released at time b. It is schedule in parallel with jobs Ji, j and Ji, j during the time interval [b,c]. x x This is an example of pipeline parallelism, since Ji, j+1 may execute before the jobs released by Ji, j complete. x x x Job Ji, j+2 is released at time d. If Ji, j+2 were scheduled before the completion of Ji, j+1, then this would be an example of intra-node parallelism. However, the schedule depicted in Figure 5.3 observes job precedence constraints; i.e., two instances of the same node may not executing simultaneously.

The transformation of a PGM-speciﬁed graph into a set of sporadic tasks gives us most of the parallelism we seek. Although it would be desirable to support intra-node parallelism, it may be challenging to realize in implementation, since it may require dynamic thread creation or preemptive work assignment to threads in a pool of worker threads. Moreover, intra-node parallelism may also require us to re-order or buffer node outputs, since the ( j + 1)st invocation of a node may complete before its jth invocation. From a practical perspective, we feel the loss of intra-node parallelism is acceptable.

239 5.2 PGMRT

In this section, we describe PGMRT, a middleware library that we developed to support PGM-derived real-time task sets. PGMRT is responsible for transmitting tokens (and, optionally, data) among the nodes and enforcing token constraints.4 PGMRT may be conﬁgured to integrate with LITMUSRT in order to reduce token transmission overheads and ensure predictable real-time execution. However, PGMRT may also be conﬁgured to support other POSIX-compliant platforms at the expense of greater overheads and priority inversions.

We begin by describing the underlying mechanisms of PGMRT, speciﬁcally, graph management and token transmission. We then address issues relating to proper real-time scheduling of tasks running under our

PGM-derived sporadic task model on LITMUSRT.

5.2.1 Graphs, Nodes, and Edges

Each graph is identiﬁed by a unique name and path, similar to a UNIX named pipe. Applications use

PGMRT’s API to create new graphs described by nodes and edges. Real-time tasks, as unique threads of execution within the same address space or separate processes, use PGMRT’s API to access information about a named graph and claim/bind to a node and its edges. PGMRT uses a plugin-based architecture to support different methods for transmitting tokens (and, optionally data) among tasks.

5.2.2 Precedence Constraints and Token Transmission

Non-source nodes have two types of precedence constraints: job and token constraints. Job constraints are satisﬁed in PGMRT, since a single thread binds to each node—jobs are naturally serialized by this thread.

Regarding token constraints, consumers block (suspend execution) whenever they lack the requisite tokens.

Producers must have a mechanism to signal consumers of new tokens. The appropriate underlying IPC mechanism depends upon how tokens are used: tokens may be event-signaling or data-passing. A single node may use a mix of event-signaling and data-passing tokens, as appropriate. Regardless of the underlying

IPC, nodes produce and consume tokens using a common API.

Event-Signaling Tokens. With event-signaling tokens, token production and consumption is realized through increment/decrement operations on per-edge token counters, similar to counting semaphores. To facilitate

4We distribute PGMRT as open source under the Revised BSD license. Source code is currently available at www.github.com/ GElliott/pgm.

240 IPC, token counters may be stored in POSIX shared memory that is mapped into each PGMRT application.

Thus, a single graph may be made up of multiple coordinated processes.

Although tokens do not transmit data implicitly, tokens can coordinate data sharing in application- level logic. For example, a token may signal the availability of new data in an out-of-band queue (i.e., a data structure outside the purview of PGMRT) shared between two nodes. Signaling is achieved via a monitor synchronization primitive (which can also be stored in shared memory to facilitate inter-process communication). For POSIX-compliant operating systems, this monitor is realized by a POSIX (pthread) condition variable, one per consumer. A consumer blocks on its condition variable if it does not have requisite tokens on every edge. A producer signals the condition variable whenever it is the last producer to satisfy all of its consumer’s token constraints.

The use of pthread condition variables has the drawback that threads that utilize the synchronization primitive must ﬁrst acquire a pthread mutex. This suspension-based mutex can be problematic in a real-time setting for three reasons. First, it can introduce heavy overheads, due to context switching, with respect to very short critical sections in PGMRT (e.g., the increment/decrement of a handful of token counters).

Such overheads may be pessimistically accounted for in real-time analysis, but it is desirable to avoid them altogether. Second, suspensions are difﬁcult to model under some methods of real-time analysis (e.g., s- oblivious analysis). Finally, an operating system may offer little or no support for real-time priority inheritance for pthread mutexes. As an alternative to pthread-based monitor, PGMRT also offers a FIFO-ordered spinlock- based monitor built upon Linux’s “fast user-space mutex” (or “futex”) API. Its use can eliminate costly context switches and problematic suspensions. Furthermore, the duration of priority inversions is bounded since spinning tasks wait non-preemptively. On non-LITMUSRT platforms, non-preemptive waiting is achieved by disabling CPU interrupts from the user-space (e.g., sti/cli/pushf/popf instructions on x86 processors).

On LITMUSRT, PGMRT uses LITMUSRT’s special support for non-preemptive code sections. Here, an application enters and exits non-preemptive sections by merely writing to a variable shared by the application and LITMUSRT.

Data-Passing Tokens. With data-passing tokens, each byte-sized token is interpreted as a byte of data.

Byte-ordering is preserved through FIFO-ordered token consumption. Data-passing can be achieved through a variety of IPC channels. PGMRT supports PGMRT-provided ring buffers, named pipes (FIFOs), message queues (MQs), and stream sockets (e.g., TCP)—all IPCs are POSIX-standard/compatible. We classify these

241 mechanisms collectively as IPC channels. One channel is created for each edge. With the exception of the ring-buffer IPC,5 consumers block on the set of channels, represented as a list of ﬁle descriptors, from each inbound edge using select().6 The operating system wakes the consumer whenever data arrives on a previously empty channel. Consumers read()/recv() tokens from the channel once tokens are available on all edges. Under PGMRT, all read/write operations are non-blocking in order to avoid excessive thread suspensions. Under non-blocking writes, producers continually write data from within a loop until all data has been written. Consumers similarly loop while reading. As a fail-safe mechanism, consumers suspend through select() if inbound data is exhausted before the requisite number of bytes have been read. Due to the general lack of introspective capabilities with the above IPC mechanisms (speciﬁcally, the inability to query the IPC channel regarding the amount of available data), PGM consumer thresholds greater than the number of tokens consumed per node invocation are not easily supported. However, PGMRT offers a solution to this problem that we discuss next.

The use of select() for data-passing tokens can introduce additional thread suspensions since select() wakes a blocked thread when data becomes available on any one channel. Thus, a consumer waiting for data on all inbound edges must loop on select() until data arrives on all inbound edges.7 To avoid this, PGMRT offers “fast” variants of FIFO- and MQ-based channels (and PGMRT ring buffers are only available in this

ﬂavor), where the underlying channel IPC is wrapped with event-signaling tokens. Here, the availability of data is tracked by event-signaling tokens, with each token corresponding to one byte of data. As with plain event-signaling tokens, consumers block on a monitor, and are awoken by the last producer to satisfy all token constraints. Producers only transmit event-signaling tokens after they have written the produced data to the associated channel. Thus, consumers using the fast channels avoid repeated suspensions while looping on select(). Moreover, support for PGM consumer thresholds greater than the number of tokens consumed per node invocation is trivialized by event-signaling token counters. Thus, the fast channel variants also support PGM’s consumer thresholds. There is one limitation to using event-signaling tokens in this context: a reliance upon shared memory. As a result, PGMRT does not offer a fast variant of stream socket

5Ring-buffers use the “fast” method described next. 6The use of select() with MQs is not strictly POSIX-compliant, but such use is commonly supported. 7An “all-or-nothing” variant of select() could be used to address this issue. However, we are unaware of any OS that supports such an API.

242 channels, since producers and consumers connected by these channels are expected to reside on different computers in a distributed system.

5.2.3 Real-Time Concerns

PGMRT described as above can be used with general-purpose schedulers. However, additional enhancements are required to ensure predictable real-time behavior. These relate to predictable token signaling and proper deadline assignment. We describe how we address these problems when PGMRT runs atop LITMUSRT.

Early Releasing and Deadline Shifting. Under deadline-based schedulers, the technique of early releasing allows a job to be scheduled prior to its release time, provided that the job’s absolute deadline is computed from the normal (i.e., “non-early”) release time. Under LITMUSRT, the jobs of all tasks associated with non-source nodes are released early. However, early-released jobs must still observe token constraints.

RT A job’s absolute deadline is computed as di time units after its (non-early) release. LITMUS employs high-resolution timers to track the minimum separation time between releases. However, recall from

Section 2.1.3 that the release time of a non-source node can be no less than the time instant the job’s token constraints are satisfied. Thus, the absolute deadline for each job must be computed on-the-fly. Immediately before a consumer blocks for tokens, it sets a “token-wait” flag stored in memory shared by user-space and the kernel. The kernel checks this flag whenever a real-time task is awoken from a sleeping state. If set, and the current time is later than the release time dictated by the sporadic task model, the kernel automatically computes an adjusted release and absolute deadline for the job and clears the flag. This computation requires the current time to approximate the arrival time of the last token—this is ensured by boosting the priority of a producer while it signals consumers. We discuss this next.

Priority-Boosting of Producers. To ensure properly computed deadlines, we boost the priority of a producer while it is signaling a sequence of consumers. Moreover, in cases where a graph spans multiple processor clusters, boosting is necessary to avoid leaving processors in remote clusters idle. Boosting is achieved through a lazy process: the priority of a producer is boosted only when the scheduler attempts to preempt it.

A producer informs LITMUSRT of when it is signaling consumers through a “token-sending” ﬂag stored in memory shared by user-space and the kernel. Once all consumers have been signaled, a producer clears the

“token-sending” ﬂag and triggers LITMUSRT to “unboost” and reschedule the producer. Priority inversions due to boosting should be accounted for in real-time analysis.

243 We now discuss applications where we may use PGMRT to help realize real-time graph scheduling, especially those that use GPUs.

5.3 OpenVX

OpenVX is a newly ratified standard API for developing computer vision applications for heterogeneous computing platforms. The API provides the programmer with a set of basic operations, or primitives, commonly used in computer vision algorithms.8 The programmer may supplement the standard set of OpenVX primitives with their own or with those provided by third-party libraries. Each primitive has a well-defined set of inputs and outputs. The implementation of a primitive is defined by the particular implementation of the

OpenVX standard. Thus, a given primitive may use a GPU in one OpenVX implementation and a specialized

DSP (e.g., CongniVue’s G2-APEX or Renesas’ IMP-X4) or mere CPUs in another. OpenVX also deﬁnes a set of data objects. Types of data objects include simple data structures such as scalars, arrays, matrices, and images. There are also higher-level data objects common to computer vision algorithms—these include histograms, image pyramids, and lookup tables.9 The programmer constructs a computer vision algorithm by instantiating primitives as nodes and data objects as parameters. The programmer binds parameters to node inputs and outputs. Since each node may use a mix of the processing elements of a heterogeneous platform, a single graph may execute across CPUs, GPUs, DSPs, etc.

Node dependencies (i.e., edges) are not explicitly provided by the programmer. Rather, the structure of a graph is derived from how parameters are bound to nodes. We demonstrate this with an example.

Figure 5.4(a) gives the relevant code fragments for creating an OpenVX graph for pedestrian detection.

The data objects imageRaw and detected represent the input and output of the graph, respectively. The data objects imageIYUV and imageGray store an image in color and grayscale formats, respectively. At line 12, the code creates a color-conversion node, convertToIYUV. The function that creates this node, vxColorConvertNode(), takes imageRaw and imageIYUV as input and output parameters, respectively.

Whenever the node represented by convertToIYUV is executed, the contents of imageRaw is processed by the color-conversion primitive, and the resulting image is stored in convertToIYUV. Similarly, the node convertToGray converts the color image into a grayscale image. The grayscale image is processed

8The OpenVX speciﬁcation calls these basic operations “kernels.” However, we opt to avoid this term since we must already differentiate between GPU and OS kernels. 9An image pyramid stores multiple copies of the same image. Each copy has a different resolution or scale.

244 1 vx_image imageRaw; // graph input : an image 2 vx_array detected; // graph output : a list of detected pedestrians 3 ... 4 // instantiate a graph 5 vx_graph pedDetector = vxCreateGraph(...); 6 ... 7 // instantiate additional parameters 8 vx_image imageIYUV = vxCreateVirtualImage(pedDetector, ...); 9 vx_image imageGray = vxCreateVirtualImage(pedDetector, ...); 10 ... 11 // instantiate primitives as nodes 12 vx_node convertToIYUV = vxColorConvertNode(pedDetector, imageRaw, imageIYUV); 13 vx_node convertToGray = vxChannelExtractNode(pedDetector, imageIYUV, 14 VX_CHANNEL_Y, imageGray); 15 vx_node detectPeds = mySoftCascadeNode(pedDetector, imageGray, detected, ...); 16 ... 17 vxProcessGraph(pedDetector); // execute the graph end-to-end (a) OpenVX code for constructing a graph.

convertToIYUV convertToGray detectPeds imageRaw imageIYUV imageGray detected (vxColorConvertNode) (vxChannelExtractNode) (mySoftCascadeNode)

(b) Bindings of data object parameters to nodes.

convertToIYUV convertToGray detectPeds (vxColorConvertNode) (vxChannelExtractNode) (mySoftCascadeNode)

Figure 5.4: Construction of a graph in OpenVX for pedestrian detection. by a user-provided node created by the function mySoftCascadeNode(), which writes a list of detected pedestrians to detected.10 Figure 5.4(b) depicts the bindings of parameters to nodes. Figure 5.4(c) depicts the derived structure of this graph.

OpenVX deﬁnes a simple execution model. From Section 2.8.5 of the OpenVX standard:

[A constructed graph] may be scheduled multiple times but only executes sequentially with

respect to itself.

Moreover:

[Simultaneously executed graphs] do not have a deﬁned behavior and may execute in parallel or

in series based on the behavior of the vendor’s implementation.

10The OpenVX standard does not currently specify a primitive for object detection, so the user must provide their own or use one from a third party.

245 This execution model simpliﬁes the OpenVX API and its implementation. Also, this simplicity is partly motivated by the variety of heterogeneous platforms upon which OpenVX applications are meant to run.

The API must work well for simple processors such as ASICs as well as modern CPUs. Furthermore, the simple execution model enables interesting opportunities for optimization. For example, an entire graph could be transformed into gate-level logic or a single GPU kernel and executed entirely on an FPGA or GPU, respectively. However, OpenVX’s execution model has four signiﬁcant implications on real-time scheduling.

First, the specification has no notion of a periodic or sporadic task. Second, the specification only allows the programmer to control when the root node of a graph is ready for execution, not when internal nodes are ready. Third, the specification does not define a threading model for graph execution. The intent of the standard is to allow the OpenVX implementation to be tailored to particular heterogeneous platforms.

However, it provides no mechanism by which to control the number of threads used to execute a graph or the priority of these threads. Finally, the speciﬁcation requires a graph to execute end-to-end before it may be executed again. This makes pipelining impossible.11

Given these limitations, how can we execute OpenVX graphs under a sporadic task model? As we discussed in Section 5.1, we may assign a single graph to a single sporadic real-time task. A job of this task would execute the nodes of the graph serially in topological order. Of course, we miss opportunities to take advantage of inherit graph parallelism with this approach. We present a better solution in the next section.

5.4 Adding Real-Time Support to VisionWorks

VisionWorks is an implementation of OpenVX developed by NVIDIA. Many of the OpenVX primitives are implemented using CUDA and are optimized for NVIDIA GPUs. The VisionWorks software provides an ideal tool with which we can evaluate the effectiveness of GPUSync for the real-time scheduling of real-world applications. With support from NVIDIA, we adapted an alpha version of VisionWorks to run atop PGMRT,

GPUSync, and LITMUSRT. In this section, we describe several challenges we faced in this effort, along with our solutions. However, ﬁrst we wish to note that the alpha version of VisionWorks provided to us by

NVIDIA was under active development at the time. The reader should not assume that statements we make regarding VisionWorks will necessarily hold when the software is made available to the public. Also, our

11At line 17 of Figure 5.4(a), the pedestrian detection graph is executed once, from end-to-end, by calling the function vxProcessGraph(). This function does not return until the graph has completed. The OpenVX function vxScheduleGraph() may be used to asynchronously execute a graph without blocking. However, a graph instance must still complete before it may be reissued.

246 work with VisionWorks was funded by NVIDIA through their internship program—at this time we are unable to share the software we developed in this effort,12 since it is the property of NVIDIA.

We describe the software we developed to bring real-time support to VisionWorks in three parts. We begin by describing the changes we made to the VisionWorks execution model to support a PGM-derived sporadic task model. Following this, we discuss a separate GPGPU interception library, called libgpui, we developed to transparently add GPUSync GPU scheduling to VisionWorks and the third-party libraries it uses. Finally, we describe how our modiﬁed VisionWorks execution model directly interacts with GPUSync; speciﬁcally, when GPU tokens are acquired and engine locks are obtained.

5.4.1 VisionWorks and the Sporadic Task Model

In this section, we describe the graph execution model used by our alpha version of VisionWorks and how we modiﬁed it to support the sporadic task model.

VisionWorks adopts the simple execution model prescribed by the OpenVX speciﬁcation. Moreover, in the alpha version software, nodes of a graph are executed in topological order by a single thread. However,

VisionWorks places no restrictions on the threading model used by individual primitives. We found that several primitives, by way third party libraries such as OpenCV, employ OpenMP to execute for-loops across several parallel threads. In order to remain within the constraints of the single-threaded sporadic task mode, we took the necessary steps to disable such intra-node multi-threading. For example, to control OpenMP, we set the environment variable OMP_NUM_THREADS = 1, effectively disabling it.

Sporadic Task Set Execution Model. Our ﬁrst change to VisionWorks was to introduce a new API function, nvxSpawnGraph(), which is used to spawn a graph for periodic execution. This function is somewhat analogous to the OpenVX asynchronous graph-scheduling function vxScheduleGraph(). However, unlike vxScheduleGraph(), a graph spawned by nvxSpawnGraph() executes repetitively, instead of only once.

Each node of a spawned graph is executed by a dedicated thread. Each thread is assigned a common period speciﬁed by the programmer. We assume implicit deadlines. An invocation of a node is equivalent to a periodic job (we sometimes use the terms “node” and “job” interchangeably). We call our new execution model for VisionWorks the “sporadic task set execution model.”

12Speciﬁcally, we refer to our modiﬁed version of VisionWorks and a software library we call libgpui.

247 Graph Input and Output. The OpenVX API assumes that graph input is primed prior to the call of vxProcessGraph() (or vxScheduleGraph()). For example, in Figure 5.4(a), the contents of the input image imageRaw must be set prior to the call to vxProcessGraph() on line 17. We achieve a similar behavior by attaching an “input callback” to the spawned graph. The input callback is executed by an “input callback node” that is prepended to the graph—all source nodes of the original graph become children to the input callback node. The same approach is taken to handle graph output, where sink nodes become parents of an appended “output callback node.”

Graph Dependencies and Pipelining. We use PGMRT to coordinate the execution of the per-node threads.

We achieve this by duplicating the VisionWorks graph structure with an auxiliary graph in PGMRT. We connect

RT k← j k← j k← j the nodes of the PGM graph with edges that use event-signaling tokens. We set %i = κi = ϕi = 1 for all edges.

Recall from Section 5.3 that OpenVX does not pass data through graph edges. Rather, node input and output is passed through singular instances of data objects. Although graph pipelining is naturally supported by the periodic task set execution model, a new hazard arises: a producer node may overwrite the contents of a data object before the old contents has been read or written by consumer node! Such consumers may not even be a direct successor of the producer. For instance, we can conceive of a graph where an image data object is passed through a chain of nodes, each node applying a ﬁlter to the image. The node at the head of this chain cannot execute again until after the image has been handled by the node at the tail. In short, the graph cannot be pipelined.

To resolve this pipelining issue, we begin by ﬁrst replicating all data objects in the graph N times. We set

N to the maximum depth of the graph as a rule of thumb. The number of replicas is actually conﬁgurable by the programmer. (Any value for N ≥ 1 will work, given the fail-safe mechanism we discuss shortly.) The jth invocation of a node (i.e., a job) accesses the ( j modN)th replica. However, replication alone does not ensure safe pipelining, since we do not enforce end-to-end graph precedence constraints. For example, a node on its

( j + N)th invocation may execute before the data it generated on its jth invocation has been fully consumed. To prevent this from happening, we introduce a fail-safe that stalls the node whenever such hazards arise.

The fail-safe is realized through an additional PGM “feedback graph.” To generate the feedback graph, we ﬁrst create a copy of the auxiliary PGM graph that already shadows the structure of the VisionWorks graph. We then add additional edges to this copy. Each additional directed edge connects a node that accesses

248 procedure DOJOB() ConsumeTokens(); . Wait for input from producers. ConsumeFeedbackTokens(); . Wait for go-ahead from descendants. DoPrimitive(); . Execute the primitive. ProduceFeedbackTokens(); . Give go-ahead to any waiting ancestors. ProduceTokens(); . Signal output to consumers. j ← j + 1 . Increment job counter. end procedure Figure 5.5: Procedure for PGMRT-coordinated job execution. a data object to any descendant node that also accesses that data object (no edge is added if such an edge already exists). We then ﬂip the direction of all edges in this graph, and each edge is initialized with an available token count equal to N.

Figure 5.5 outlines the procedure executed by each thread to execute a job. The jth invocation of the node ﬁrst waits for and consumes a token from each of its producers. It then consumes a token from each inbound-edge in the feedback graph, blocking if the requisite tokens are unavailable. The node only blocks if the ( j modN)th data object replicas are still in use.13 Thus, the node stalls, and the hazard is avoided. After executing the actual primitive, the node signals that it is done using the ( j modN)th data object replicas. The node then generates tokens for its consumers.

Example 5.1. We illustrate the measures we take to support pipelining with an example depicted in Figure 5.6.

1 In Figure 5.6(a), we begin with a simple three-node VisionWorks graph, Vi. Suppose that node Vi writes to 2 3 the data object A; node Vi modiﬁes A; and node Vi reads A. Graph Vi has a depth of three. In Figure 5.6(b), we replicate the data object A such that there are three replicas. Figure 5.6(c) gives the structure of Vi,

RT without data objects. This structure is replicated in PGM , represented by graph Gi in Figure 5.6(d). Token production, consumption, and threshold parameters are set to one. We generate the feedback graph, G¯i, as ¯3 ¯1 depicted in Figure 5.6(e). Note the edge connecting node Gi to Gi . ♦

Support for Back-Edges. Computer vision algorithms that operate on video streams often feed data derived from prior frames back into the computations performed on future frames. For example, an object tracking algorithm must recall information about objects of prior frames if the algorithm is to describe the motions of those objects in the current frame. OpenVX deﬁnes a special data object called a “delay,” which is used

13We disable priority boosting and deadline shifting for operations on the feedback graph in order to prevent alterations to properly assigned real-time priorities. The feedback graph is a fail-safe mechanism meant to prevent the overwriting of data before it has been consumed; it is only meant to trigger when task execution behaviors violate provisioned resources (i.e., the number of graph replicas is insufﬁcient). Of course, these special measures only apply to spawned graphs executing under LITMUSRT, since these features are speciﬁc to LITMUSRT.

249 V1 V1 V1 i i i

A A AA

2 same data V 2 V 2 V object i i i

A A AA

V3 V3 V3 i i i

(a) VisionWorks graph, Vi. (b) Vi with replicated data objects. (c) Structure of Vi without data objects.

— G1 1 i Gi

— G2 2 initialize each i Gi edge with 3 tokens

3 — 3 Gi Gi

(d) PGM graph, Gi, of Vi. (e) PGM feedback graph, G¯i, for Vi.

Figure 5.6: Derivation of PGM graphs used to support the pipelined thread-per-node execution model.

250 to buffer node output for use by subsequent node invocations. A delay is essentially a ring buffer used to contain other data objects (e.g., prior image frames). The oldest data object is overwritten when a new data object enters the buffer. The number of data objects stored in a ring buffer (or the “size” of the delay) is tied to how “far into the past” the vision algorithm must go. For example, suppose a node operates on frame i and it needs to access copies of the last two prior frames. In this case, the size of the delay would be two.

The consumer node of data buffered by a delay may appear anywhere within a graph. It may be a ancestor or descendant of the producer node—it may even be the producer itself. A back-edge is created when the consumer node of a delay is not a descendant of the producer node in the graph derived from non-delay data objects. Such back-edges can be used to implement the object tracking algorithm described above.

Due to complexities in the implementation and use of delays in VisionWorks, for our periodic task set execution model, we do not replicate delay data objects as we do for other types of data objects. Instead, we simply increase the size of the delay ring buffer to store an additional N data objects. For example, suppose we have a delay with size M; we increase the size of the delay to M + N. We also introduce the necessary back-edges to our auxiliary and feedback PGM graphs. The available token counts for these back-edges must be initialized with the appropriate number of tokens to allow a consuming node to execute while the delay is initially ﬁlling with data objects.

Scheduling Policies. The programmer may specify which scheduling policy to use for the threads in our sporadic task set execution model. Our modiﬁed VisionWorks supports the standard Linux SCHED_OTHER and SCHED_FIFO policies, as well as SCHED_LITMUS. We use POSIX real-time timers to implement periodic job releases under the standard Linux policies. We rely upon the periodic job release infrastructure of LITMUSRT for the SCHED_LITMUS policy.

We assume deadline-based scheduling under the SCHED_LITMUS policy. When the SCHED_FIFO policy is employed, the programmer supplies a “base” graph priority. The priority of a thread of a given node is determined by taking the length of the longest path between the node and the input callback source node, plus the base graph priority. Thus, thread priorities increase monotonically down the graph. We use this prioritization scheme to expedite the movement of data down a graph. It is important to ﬁnish work deeper in the graph, since graph execution is pipelined.

251 5.4.2 Libgpui: An Interposed GPGPU Library for CUDA

VisionWorks is a large and complex software package. The alpha version of VisionWorks we modiﬁed was made up of approximately 80k lines of code, not including third-party libraries. In the evaluation of

GPUSync in Chapter 4, the test programs interfaced directly with GPUSync through the user interface provided by liblitmus, as described in Section 3.3.5. We deemed it infeasible to manually alter VisionWorks and GPU-using third-party libraries to use GPUSync directly.

Instead, we developed an interposition library, libgpui, to intercept all calls to the lowest-level library of NVIDIA’s CUDA software, libcuda. Libgpui intercepts all API calls that may launch a kernel or issue a

DMA operation. With the exception of GPU token acquisition, libgpui is entirely transparent to the user—no modiﬁcations to user code are necessary. The library automatically interfaces with GPUSync on behalf of the

API callee. For example, with features provided by libcuda, libgpui inspects the memory address parameters of the DMA API call to deduce the end-points of the DMA operation (i.e., the memories of the source and destination of the DMA) and libgpui automatically requests the needed copy engine locks. Each API call is passed on to libcuda once GPUSync schedules the operation. Libgpui also automatically breaks large DMA operations into smaller chunks, as we discussed in Section 3.2.3.4.

Libgpui also overrides three default behaviors of the CUDA runtime to improve real-time predictability.

First, libgpui can be conﬁgured to enforce suspension-based synchronization when tasks wait for a GPU operation to complete. Second, libgpui ensures that proper stream synchronization behaviors are followed to ensure engine independence (see Section 2.4.4), if it is required for the particular GPU being used. Finally, libgpui forces all GPU operations to be issued on unique per-thread CUDA streams. This is a change from the current CUDA runtime, where threads within a process share a common stream by default.14 This prevents the threads for contending for access to a shared CUDA stream, which would be arbitrated by the CUDA runtime and result in unpredictable behavior.

5.4.3 VisionWorks, libgpui, and GPUSync

We now describe how VisionWorks interacts with GPUSync through libgpui. The alpha version of

VisionWorks does not support transparent migration of data objects among GPUs, so we focus our attention

14Coincidentally, a stream-per-thread feature similar to libgpui’s is slated to be a part of the forthcoming CUDA 7.0 (Harris, 2015).

252 Actual Token Critical Section

Ideal Token Critical Section

CE Critical Section job completes

CPU

CE time t1 t2 t3

Request Token Acquire Token Release Token Suspended

Request Eng. Lock Acquire Eng. Lock Release Eng. Lock

Figure 5.7: Overly long token critical sections may result by releasing tokens at job completion time. on partitioned GPU scheduling under GPUSync.15 The user assigns a given VisionWorks application to a

GPU partition through an environment variable when the application is launched. Libgpui reads this variable and initializes the CUDA runtime for the selected GPU on behalf of the VisionWorks application.

Although there is only one GPU within each partition, GPUSync still requires each GPU-using job to acquire one of the ρ tokens before using the GPU. This is necessary to ensure that interrupt and CUDA runtime callback threads are assigned proper real-time priorities. When libgpui intercepts an API call, it

first checks to see if the callee already holds a GPU token. If not, libgpui requests and obtains a token before proceeding with the necessary engine lock request. We modified VisionWorks to interface directly with libgpui to release any held token upon job completion—this is the only aspect to libgpui that is not transparent to VisionWorks. (Unfortunately, there is no CUDA API libgpui can intercept and interpret as a job completion, so libgpui must be notified explicitly.)

One drawback to using libgpui in this way is that token critical section lengths may be longer than are strictly necessary. We depict such a case in Figure 5.7, where a primitive performs a DMA operation early in its execution and does not use the GPU during the remainder of its computations. We may see such an execution pattern in a primitive that executes primarily on a CPU, but consumes the output of another primitive that executes on a GPU. In Figure 5.7, at time t1, the primitive operation executing on the CPU

15VisionWorks data-object abstractions do not preclude transparent migration. However, we deemed adding such support to be a non-trivial effort and outside the scope of our core goal of adding real-time GPU scheduling to VisionWorks.

253 obtains a GPU token. The primitive then copies an OpenVX data object from GPU memory to host memory.

This DMA operation completes at time t2. The remaining computations of the primitive take place entirely on the CPU, completing at time t3. At this point, the GPU token is released. Under our methodology, the

GPU token is held during these CPU computations over the time interval [t1,t3], even though the GPU is no longer required after time t2. Ideally, the GPU token would be freed at time t2. This is a compromise we make by not integrating GPUSync with VisionWorks directly. We may work around this issue by splitting the primitive into two parts: a DMA primitive and a CPU-computation primitive. However, such a change is invasive—perhaps more invasive than simply modifying the primitive to communicate directly with libgpui to release the token early. Perhaps future versions of VisionWorks or OpenVX could include an API that allows a primitive to express to the execution framework of when a particular heterogeneous computing element

(e.g., a GPU) is no longer needed. This API could be leveraged to shorten the token critical section length.

5.5 Evaluation of VisionWorks Under GPUSync

In this section, we evaluate the runtime performance of our sporadic task set execution model for

VisionWorks under eight configurations of GPUSync. We compare our results against VisionWorks (using the same sporadic task set execution model) running under two purely Linux-based configurations, as well as a LITMUSRT configuration without GPUSync.

The rest of this section is organized as follows. We begin with a description of the computer vision application we used to evaluate VisionWorks with GPUSync. We then discuss our experimental setup. Finally, we present our results in two parts. In the ﬁrst part, we report the observed frame-to-frame output delays of many concurrently executed graphs in our experiments. In the second part, we report on observed end-to-end response-time latencies of these same graphs.

5.5.1 Video Stabilization

VisionWorks includes a variety of demo applications, including a pedestrian detection application, not unlike the one illustrated in Figure 5.4. However, the pedestrian detection application is relatively uninteresting from a scheduling perspective—it is made up of only a handful of nodes arranged in a pipeline.

In our evaluation, we use VisionWorks’ “video stabilization” demo, since the application is far more complex and uses primitives common to other computer vision algorithms. Video stabilization is used to digitally

254 Duplicate Color Image

Compute Compute Display Read Convert To Compute Homography Smooth Warp Image Optical Stabilized Frame Grayscale Homography Filter Homography Perspective Pyramid Flow Image

Harris Feature Tracker

Forward Edge Delay Edge

Figure 5.8: Dependency graph of video stabilization application. dampen the effect of shaky camera movement on a video stream. Vehicle-mounted cameras are prone to camera shake. A video stream may require stabilization as a pre-processing step before higher-level processing is possible. For example, an object tracker may require stabilization—too much shake may decrease the accuracy of predicted object trajectories.

Figure 5.8 depicts the dependency graph of the video stabilization application. Table 5.1 gives a brief description of each node in this graph. We make note of two characteristics of this graph. First, video stabilization operates over a temporal window of several frames. This is needed in order to differentiate between movements due to camera shake and desired camera translation (i.e., stable long-term movement of the camera). These inter-frame dependencies are implemented using OpenVX delay data objects, which are reﬂected by delay edges in Figure 5.8. Second, although the primitive of a node may execute entirely on CPUs, it may still use a GPU to pull data out of GPU memory through DMA operations. The “Display

Stabilized Image” node is such an example. Here, the “Warp Perspective” node performs its computation and stores a stabilized frame in GPU memory. The Display Stabilized Image node pulls this data off of the GPU through DMA. This means that the Display Stabilized Image node will compete with other nodes for GPU tokens, even though it does not use the GPU to perform computation.

5.5.2 Experimental Setup

In this evaluation, we focus on clustered CPU scheduling with a single GPU. This focus is motivated by the results of Chapter 4, as well as the inability of our version of VisionWorks to support the migration of data objects among GPUs. Moreover, such a multicore single-GPU system reﬂects many common embedded computing platforms available today.

255 Node Name Function Read Frame Reads a frame from the video source. Duplicate Color Image Copies the input color image for later use. Convert To Grayscale Converts a frame from a color to grayscale (i.e., “black and white”) image. Harris Feature Tracker Detects Harris corners (features) in an image. Compute Image Pyramid Resizes the image into several images at multiple resolutions. Compute Optical Flow Determines the movement of image features from the last frame into the current frame. Compute Homography Computes a “homography matrix” that characterizes the transformation from the last frame into the current frame. Homography Filter Filters noisy values from homography matrix. Smooth Homography Merges the homography matrices of the last several frames into one. Warp Perspective Transforms an image using a provided homography matrix. (Stabilization occurs here.) Display Stabilized Image Displays the stabilized image.

Table 5.1: Description of nodes used in the video stabilization graph of Figure 5.8.

Our experimental workload was comprised of seventeen instances of the video stabilization application in order to load the CPUs and GPU engines. Each graph was executed within its own Linux process. Thus, each instance had eleven regular real-time threads (one for each node in the graph depicted in Figure 5.8) and one CUDA callback thread, which GPUSync automatically schedules as a real-time task (see Section 3.2.5.2).

With the inclusion of the GPU interrupt handling thread (see Section 3.2.5.1), there were a total of 205 real-time threads in the workload.

To each graph we assigned a period that was shared by every real-time task of the nodes therein. Two graphs had 20ms periods; another two graphs had 30ms periods; and yet another two graphs had 40ms periods. Five graphs had a period of 60ms; four graphs had a period of 80ms; and another two graphs had a period of 100ms. These periods represent a range of those we ﬁnd in ADAS, such as those we described in

Chapter 1 (see Table 1.1). Table 5.2 summarizes these period assignments.

We isolated the real-time tasks to a single NUMA node of the hardware platform we used in Chapter 4.

The remaining NUMA node was used for performance monitoring and did not execute any real-time tasks.

The workload was executed across six CPUs and a single NUMA-local GPU. This heavily loaded the CPUs and GPU to near capacity.16

The workload was scheduled under eleven different system conﬁgurations. Eight of these conﬁgurations used GPUSync under LITMUSRT’s C-EDF scheduling, with CPU clustering around the L3 cache (i.e., G-EDF

16During experimentation, the system tool top reported the NUMA node to be 96% utilized (a CPU utilization of 5.76). Similarly, the NVIDIA tool nvidia-smi reported the EE to be 66% utilized.

256 Period No. Graphs Base Priority (for SCHED_FIFO only) 20ms 2 75 30ms 2 60 40ms 2 45 60ms 5 30 80ms 4 15 100ms 2 1

Table 5.2: Evaluation task set using VisionWorks’ video stabilization demo. scheduling within the NUMA node). We experimented with four GPUSync conﬁgurations of the GPU

Allocator. These conﬁgurations differed in the number of tokens, ρ, and maximum FIFO length, f . The number of tokens were set to one, two, three, or six. The maximum FIFO length was set correspondingly to six, three, two, or one. We are interested in these GPU Allocator conﬁgurations because the associated

GPU token blocking terms under each configuration are optimal with respect to suspension-oblivious schedulability analysis for a platform with six CPUs and a single GPU. Under each corresponding pairing of ρ and f , we experimented with both FIFO-ordered (FIFO) and priority-ordered (PRIO) engine locks, resulting in the eight GPUSync configurations. We refer to each configuration with a tuple of the form

(ρ, f ,Engine Lock Protocol). For example, a configuration with ρ = 2, f = 3, and priority-ordered engine locks is denoted by (2,3,PRIO). The remaining three system configurations were as follows: LITMUSRT C-EDF scheduling without GPUSync; Linux under SCHED_FIFO fixed-priority scheduling; and Linux under

SCHED_OTHER scheduling. These last three conﬁgurations relied upon the GPGPU runtime, GPU driver, and GPU hardware to handle all GPU resource arbitration and scheduling. Under the SCHED_FIFO policy, graphs were assigned the base priorities given in Table 5.2. A ﬁxed-priority for the thread of each node was derived using the method we described in Section 5.4.1.

We executed our task set under each of the eleven system configurations for 400 seconds. Each graph processed a pre-recorded video file cached in system memory. For all configurations, we used libgpui to force all tasks to suspend while waiting for GPU operations to complete, and to also ensure that all nodes used distinct GPU streams to issue GPU operations. Libgpui invoked GPUSync for only the GPUSync configurations—libgpui passed CUDA API calls immediately on to the underlying CUDA library for non-

GPUSync conﬁgurations.

257 ReleaseDeadline Complete 2p - e ei i i

T Ti i (a) Back-to-back job completion. (b) First job completes as early as possible, while last job completes as late as possible.

Figure 5.9: Scenarios that lead to extreme values for measured completion delays.

5.5.3 Results

We now report our findings in two parts. In the first part, we examine the observed delay between consecutive outputs of the graphs. In the second part, we examine observed end-to-end response time latencies (i.e., the time from the release of a source node to the completion of the corresponding sink node) under the LITMUSRT-based configurations.

5.5.3.1 Completion Delays

We require a common observational framework in order to fairly compare the eleven system configurations. Although LITMUSRT offers kernel-level low-overhead tracing capabilities, we are unable to make use of them for the non-LITMUSRT-based configurations. Instead, we take our measurements from user-space within the graph applications themselves, and we examine the timing properties of the sink node in each graph. Specifically, we look at the delay between consecutive completions of the sink node in each graph. To gather these metrics, the sink node of each graph (i.e., the “Display Stabilized Image” node) records a time stamp at the end of its computation. We then measure the difference in consecutive timestamps to obtain a

“completion delay” metric.

There are two beneﬁts to this completion delay metric. First, we may make observations from user space; we do not require special support from the operating system, besides access to accurate hardware time counters (which are commonly available). Second, we may apply the completion delay metric to any periodic task set, regardless of any processor schedulers, be they real-time or general purpose. There are two limitations, however. First, completion delay metrics are only useful within the context of periodic task sets, since task periods not only ensure a minimum separation time between job releases of a given task,

258 but also a maximum separation time. We cannot meaningfully compare measured completion delays if the gap between consecutive job releases varies. Second, the metric is inherently noisy. Figure 5.9 depicts two extreme completion delay values that are possible for an implicit-deadline periodic task that is hard real-time schedulable. In Figure 5.9(a), the two jobs complete back-to-back. In Figure 5.9(b), the ﬁrst job completes as early as possible and the second as late as possible. As we see, any completion delay measurement for a job

Ji of a schedulable periodic task set may vary on the interval [ei,2pi − ei]. Of course, this interval may grow if the ﬁrst of two consecutive jobs executes for less than ei time units. The upper-bound (i.e., 2pi − ei) may also increase if we consider task sets that are schedulable with bounded tardiness. Despite these limitations, we feel that observed completion delays are useful in reasoning about the real-time runtime performance of our various system conﬁgurations.

What makes a “good” completion delay measurement? In these experiments, we desire smooth playback of stabilized video—we want completion delays that are close to the graph period. We also want to see little variation in completion delays. We measure variation by computing the standard deviation (σ) of measured completion delays of each graph.

Tables 5.3 through 5.13 report the characteristics of the completion delays for each graph under each labeled conﬁguration. The tables include columns for the maximum, 99.9th percentile, 99th percentile, median, and mean observed completion delay for each graph. The tables also include the standard deviation for each mean. The tables also include a column reporting the percentage of video frames that a graph could not complete in the allotted time. For example, in a 400 second experiment, a graph with a period of 20ms should complete 20,000 frames. If it only completes 15,000 frames, then we say that the graph has dropped 5,000 frames, or 25%. A graph with dropped frames is indicative of processor starvation. This gives us another method by which to detect an unschedulable task set observationally. We make the following observations.

Observation 42. The task set is clearly unschedulable under the SCHED_FIFO, and GPUSync with ρ ∈

{1,2} conﬁgurations.

We look at the percentage of dropped frames to detect unschedulability. Each system conﬁguration exhibits different characteristics as to which graphs dropped frames.

In Table 5.4, we see that the graphs with the longest periods (100ms), i.e., those with the lowest ﬁxed priorities, dropped many frames under SCHED_FIFO. For instance, G16 dropped 88.25% of the 4,000 frames

259 that should have been processed. This sort of behavior is characteristic of ﬁxed-priority schedulers—the lowest priority tasks may be starved of CPU time.

In Tables 5.6 and 5.7, we see that many frames are dropped by all tasks under GPUSync with ρ = 1.17

The percentage of dropped frames increases with period. For example, G1 in Table 5.6 drops 25.41% of its frames, while G17 drops about 47.33%. Although performance is generally bad under these configurations, we do observe that the number of dropped frames is strongly correlated with graph period. That is, performance is very regular. For instance, the five graphs in Table 5.7 with a 60ms period (graphs G7 through G11) each drop either 44.76% or 44.78% of frames—they all exhibit nearly the same behavior. Similar regularity is reflected by the median and mean completion delays for these tasks.

In Tables 5.8 and 5.9, we see that performance improves with a greater number of GPU tokens, where

ρ = 2. However, frames are still dropped. We observe a similar regularity in the percentage of dropped frames as we saw in Tables 5.6 and 5.7 (GPUSync with ρ = 1).

Observation 43. GPUSync, with ρ ∈ {3,6}, improved observed predictability.

We can observe this by looking at several different metrics. We ﬁrst compare GPUSync with ρ ∈ {3,6}

(Tables 5.10 through 5.13) against LITMUSRT without GPUSync (Table 5.5). These system conﬁgurations all share the same CPU scheduler (C-EDF), yet there are clear differences in runtime behavior.

We observe signiﬁcant differences in terms of outlier behavior. For example, in Table 5.5 for LITMUSRT

th th without GPUSync, G7 has a maximum, 99.9 percentile, and 99 percentile completion delays of 1108.68ms,

776.63ms, and 465.71ms, respectively. In Table 5.13 for GPUSync with (6,1,PRIO), G7 has a maximum, 99.9th percentile, and 99th percentile completion delays of 224.33ms, 192.2ms, 88.0ms, respectively. The observed completion delays for G7 under GPUSync with (6,1,PRIO) were reduced by factors of approximately 4.9, 4.0, and 5.3 for maximum, 99.9th percentile, and 99th percentile measurements, respectively. We see similar improvements when we consider GPUSync with (3,2,FIFO), (3,2,PRIO), and (6,1,FIFO).

We also compare the behavior of GPUSync with ρ ∈ {3,6} against LITMUSRT without GPUSync in terms of the standard deviation of completion delays. In Table 5.5 for LITMUSRT without GPUSync, the standard deviations range over [15.47ms,130.21ms]. In Table 5.10 for GPUSync with (3,2,FIFO), the range

17Since ρ = 1, these conﬁgurations are theoretically equivalent since a token-holding job immediately receives every engine lock it requests. That is, the token grants exclusive access to the GPU. However, we do see some minor variation. These variations may be simply due to experimental noise or differences in runtime overheads due to different locking logic behind FIFO and PRIO engine locks.

260 is [6.21ms,10.07ms]. In Table 5.11 for GPUSync with (3,2,PRIO), the range is [6.32ms,10.55ms]. In

Table 5.12 for GPUSync with (6,1,FIFO), the range is [8.11ms,14.65ms]. In Table 5.13 for GPUSync with

(6,1,PRIO), the range is [10.64ms,16.94ms]. With the exception of GPUSync with (6,1,PRIO), the greatest standard deviations of these GPUSync conﬁgurations are less than the smallest standard deviation under

LITMUSRT without GPUSync.

From these observations, we conclude that real-time CPU scheduling alone is not enough to ensure real-time perceptibility in a system with GPUs.

We make similar comparisons of GPUSync with ρ ∈ {3,6} against SCHED_OTHER (Table 5.3), and we also observe signiﬁcant differences in terms of outlier behavior. For example, in Table 5.3 for

th th SCHED_OTHER, G1 has a maximum, 99.9 percentile, and 99 percentile completion delays of 405.88ms,

233.32ms, and 54.61ms, respectively. In Table 5.10 for GPUSync with (3,2,FIFO), G1 has a maximum, 99.9th percentile, and 99th percentile completion delays of 115.11ms, 83.21ms, 33.81ms, respectively.

GPUSync with (3,2,FIFO) reduces maximum, 99.9th percentile, and 99th percentile completion delays by factors of approximately 3.5, 2.8, and 1.6 times, respectively. We see similar improvements when we consider GPUSync with (3,2,PRIO), (6,1,FIFO), and (6,1,PRIO).

We may also compare the behavior of GPUSync with ρ ∈ {3,6} against SCHED_OTHER in terms of the standard deviation of completion delays. In Table 5.3 for SCHED_OTHER, the standard deviations range over [13.90ms,18.36ms]. In Table 5.10 for GPUSync with (3,2,FIFO), the range is [6.21ms,10.07ms].

In Table 5.11 for GPUSync with (3,2,PRIO), the range is [6.32ms,10.55ms]. In Table 5.12 for GPUSync with (6,1,FIFO), the range is [8.11ms,14.65ms]. In Table 5.13 for GPUSync with (6,1,PRIO), the range is

[10.64ms,16.94ms]. Although the standard deviation ranges among these GPUSync conﬁgurations differ, they are less than that of SCHED_OTHER. Indeed, the greatest standard deviations when ρ = 3 are less than the smallest standard deviation under SCHED_OTHER.

Observation 44. In this experiment, the SCHED_OTHER conﬁguration outperformed both real-time conﬁgurations that lack real-time GPU management.

A surprising result from these experiments is that the SCHED_OTHER conﬁguration (Table 5.3) outperforms both SCHED_FIFO (Table 5.4) and LITMUSRT without GPUSync (Table 5.5) conﬁgurations. The

SCHED_FIFO conﬁguration dropped frames while the SCHED_OTHER conﬁguration did not. The LIT-

MUSRT without GPUSync conﬁguration has worse outlier behavior than the SCHED_OTHER conﬁguration.

261 These differences in behavior may be due to busy-waiting employed by the CUDA runtime or Vision-

Works software, despite the fact that libgpui forces tasks to suspend while waiting for GPU operations to complete. Under SCHED_FIFO and LITMUSRT without GPUSync conﬁgurations, the CPU scheduler always schedules the m-highest priority tasks (in this experiment, m = 6) that are ready to run. CPU time may be wasted if any of these tasks busy-wait for a long duration of time. The amount of CPU time wasted by a task at the expense of other tasks is limited under SCHED_OTHER, since the scheduler attempts to distribute

CPU time equally among all tasks. That is, the scheduler will preempt a task that busy-waits for too long.

This assumes that the tasks use preemptive busy-waiting, but this would be expected of software developed for general purpose computing.18 Interestingly, the fact that the SCHED_FIFO and LITMUSRT without GPUSync conﬁgurations do not deadlock suggests that any such busy-waiting is probably not used to implement a locking protocol. (Preemptive busy-waiting in a locking protocol under real-time scheduling can easily lead to deadlock.)

It is difﬁcult to draw general conclusions from the completion delay data presented in Tables 5.3 through 5.13 because there is so much data. To help us gain additional insights into the performance of the eleven system conﬁgurations, we collapsed the information in the above eleven tables into a single table,

Table 5.14. We collapse the data with the following process. We compute the total percentage of dropped

18Non-preemptive execution generally requires privileged permissions (e.g., “superuser” permissions).

Graph Period % Dropped Max 99.9th% 99th% Median Mean σ

G1 20 0 405.88 233.32 54.61 15.94 20.00 15.85 G2 20 0 412.24 248.62 53.06 16.12 20.00 15.57 G3 30 0 377.97 246.86 62.68 28.34 29.99 17.33 G4 30 0 399.43 240.92 64.64 28.08 30.00 17.71 G5 40 0 397.06 291.17 71.02 40.03 40.00 17.75 G6 40 0 452.15 295.14 69.62 39.94 39.99 18.13 G7 60 0 473.23 376.42 82.87 60.07 59.99 17.68 G8 60 0 453.20 367.44 85.32 60.47 59.99 17.91 G9 60 0 414.78 352.29 86.16 60.36 59.99 17.14 G10 60 0 462.29 372.52 84.26 60.50 59.99 17.91 G11 60 0 454.22 331.98 87.79 59.99 59.99 17.66 G12 80 0 485.56 338.03 107.36 79.86 79.98 17.68 G13 80 0 442.45 340.89 105.01 79.87 79.98 16.90 G14 80 0 391.75 349.92 105.96 79.78 79.98 16.48 G15 80 0 486.90 351.47 106.70 79.62 80.00 18.36 G16 100 0 440.83 256.71 127.19 100.00 99.99 15.48 G17 100 0 376.85 273.21 127.43 99.92 99.98 13.90

Table 5.3: Completion delay data for SCHED_OTHER. Time in ms.

262 Graph Period % Dropped Max 99.9th% 99th% Median Mean σ

G1 20 0 99.25 96.09 25.68 20.26 20.00 4.78 G2 20 0 99.29 95.93 25.86 20.34 20.00 4.88 G3 30 0 115.89 105.92 38.39 28.93 30.00 6.60 G4 30 0 116.27 105.41 38.72 28.90 30.00 6.81 G5 40 0 184.15 140.43 45.91 38.92 40.00 6.73 G6 40 0 184.30 142.32 45.89 38.94 40.00 6.69 G7 60 0 533.52 387.16 96.62 59.22 59.98 21.72 G8 60 0 456.71 351.90 96.87 59.58 60.00 20.61 G9 60 0 519.93 347.87 94.97 59.24 59.99 20.36 G10 60 0 463.48 346.12 91.02 59.76 59.99 20.61 G11 60 0 396.98 353.92 96.65 59.63 59.99 19.86 G12 80 0 904.76 770.43 204.38 66.19 79.98 49.61 G13 80 0 964.77 729.72 213.20 66.37 79.95 49.63 G14 80 0 903.07 771.49 202.04 66.64 79.93 48.67 G15 80 0 936.06 717.49 233.15 66.59 79.94 49.44 G16 100 88.25 3786.02 3786.02 2279.03 920.63 850.65 559.04 G17 100 88.25 3844.70 3844.70 2333.78 879.83 848.36 580.58

Table 5.4: Completion delay data for SCHED_FIFO. Time in ms.

Graph Period % Dropped Max 99.9th% 99th% Median Mean σ

G1 20 0 612.16 300.41 159.49 4.62 20.00 34.69 G2 20 0 254.20 143.12 62.49 19.30 20.00 15.47 G3 30 0 309.02 242.21 91.72 27.21 30.00 22.61 G4 30 0 272.94 170.40 73.96 27.71 30.00 19.36 G5 40 0 450.22 295.02 135.39 37.88 40.00 31.04 G6 40 0 497.19 335.93 154.33 36.91 40.00 38.84 G7 60 0 1108.68 776.63 465.71 43.80 59.96 85.33 G8 60 0 785.02 596.92 256.90 56.42 59.96 67.26 G9 60 0 459.19 373.74 152.66 58.97 59.95 33.11 G10 60 0 383.78 307.72 143.46 58.83 59.95 25.82 G11 60 0 383.59 307.70 143.25 58.81 59.95 25.81 G12 80 0 703.36 543.54 257.50 70.84 79.86 65.42 G13 80 0 861.65 652.05 315.17 69.61 79.86 76.18 G14 80 0 829.90 601.26 264.91 70.27 79.86 70.62 G15 80 0 865.79 666.63 363.45 69.41 79.86 79.94 G16 100 0 1278.93 1077.55 641.47 79.46 99.78 124.68 G17 100 0 1416.51 1014.76 711.54 79.27 99.78 130.21

Table 5.5: Completion delay data for LITMUSRT without GPUSync. Time in ms.

263 Graph Period % Dropped Max 99.9th% 99th% Median Mean σ

G1 20 25.41 413.27 200.79 43.21 25.48 26.90 14.83 G2 20 25.41 406.52 200.29 43.57 25.47 26.90 15.09 G3 30 36.42 422.43 345.47 113.72 45.70 47.35 18.18 G4 30 36.42 422.15 270.20 117.96 45.75 47.34 17.43 G5 40 41.23 435.51 364.95 200.15 66.22 68.29 20.48 G6 40 41.24 436.16 356.91 204.60 66.11 68.29 20.16 G7 60 44.82 462.04 454.50 238.66 105.89 109.09 24.54 G8 60 44.48 460.74 453.70 238.53 105.84 108.58 25.26 G9 60 44.48 458.31 450.14 238.97 106.00 108.58 25.21 G10 60 44.48 451.62 442.64 238.14 105.95 108.58 25.28 G11 60 44.49 455.47 448.70 241.87 105.83 108.60 25.67 G12 80 46.16 492.02 479.68 285.07 145.90 149.23 29.64 G13 80 46.16 491.48 479.88 281.86 145.69 149.23 29.51 G14 80 46.16 492.47 473.16 281.49 145.84 149.22 28.73 G15 80 46.16 471.79 462.25 285.98 145.79 149.22 29.47 G16 100 47.33 519.71 498.72 329.67 186.90 190.60 33.30 G17 100 47.33 543.10 508.17 324.20 186.96 190.59 33.38

Table 5.6: Completion delay data for GPUSync for (1,6,FIFO). Time in ms.

Graph Period % Dropped Max 99.9th% 99th% Median Mean σ

G1 20 25.3 407.33 200.10 43.65 25.22 26.87 14.89 G2 20 25.3 357.44 199.38 43.77 25.79 26.87 14.81 G3 30 36.35 418.08 276.83 119.78 45.81 47.32 17.38 G4 30 36.35 420.18 282.08 113.06 45.77 47.32 18.07 G5 40 41.18 429.04 356.54 203.34 66.12 68.27 20.12 G6 40 41.17 436.99 352.97 201.51 66.29 68.26 20.07 G7 60 44.76 462.70 452.84 237.07 105.94 109.00 24.34 G8 60 44.76 458.72 454.74 239.81 105.86 109.00 24.54 G9 60 44.76 451.96 450.42 239.74 105.90 109.00 24.48 G10 60 44.78 452.22 444.57 237.70 105.96 109.03 24.37 G11 60 44.78 455.94 449.44 239.17 105.81 109.03 24.68 G12 80 46.44 485.87 473.34 285.59 145.74 149.81 27.96 G13 80 46.44 495.53 479.42 279.90 145.98 149.81 27.58 G14 80 46.44 499.83 478.61 282.22 145.82 149.81 28.22 G15 80 46.44 544.97 474.02 282.61 145.84 149.81 28.30 G16 100 47.63 522.80 515.44 328.67 187.30 191.44 31.31 G17 100 47.63 566.30 515.94 327.71 187.12 191.44 31.92

Table 5.7: Completion delay data for GPUSync for (1,6,PRIO). Time in ms.

264 Graph Period % Dropped Max 99.9th% 99th% Median Mean σ

G1 20 0.99 174.27 93.66 30.40 19.84 20.29 6.00 G2 20 1.06 124.28 96.09 30.85 19.86 20.31 6.09 G3 30 1.14 180.10 111.63 43.56 30.00 30.49 7.09 G4 30 1.15 178.34 108.24 44.06 29.96 30.49 7.00 G5 40 1.37 183.82 120.26 66.28 40.05 40.76 7.64 G6 40 1.33 206.13 122.24 61.22 40.05 40.74 7.62 G7 60 7.32 218.13 170.21 109.48 63.10 65.03 10.55 G8 60 7.28 215.98 162.14 105.90 63.04 65.00 10.26 G9 60 7.35 216.59 162.00 112.48 63.11 65.04 10.54 G10 60 7.29 218.09 152.88 109.10 63.02 65.00 10.32 G11 60 7.31 217.27 167.35 109.57 63.11 65.01 10.35 G12 80 22.66 248.40 217.49 162.55 102.92 103.85 13.80 G13 80 22.64 250.84 208.36 156.96 103.03 103.82 13.55 G14 80 22.58 250.55 208.46 155.46 102.80 103.72 13.53 G15 80 22.64 246.19 223.89 158.22 102.86 103.80 14.05 G16 100 30.08 275.23 240.95 208.00 142.62 143.53 15.73 G17 100 30.05 244.98 229.37 200.21 142.66 143.46 15.31

Table 5.8: Completion delay data for GPUSync for (2,3,FIFO). Time in ms.

Graph Period % Dropped Max 99.9th% 99th% Median Mean σ

G1 20 1.12 188.25 102.50 30.83 19.88 20.32 6.36 G2 20 1.07 178.11 105.39 30.85 19.85 20.31 6.40 G3 30 1.13 197.98 111.49 43.57 29.97 30.49 7.23 G4 30 1.19 162.21 115.02 44.53 29.98 30.51 7.30 G5 40 1.42 191.41 133.79 63.56 39.98 40.78 8.30 G6 40 1.47 214.40 126.83 65.99 40.02 40.80 8.09 G7 60 7.89 216.31 193.13 108.98 63.35 65.43 10.87 G8 60 7.89 218.85 190.42 105.69 63.41 65.43 10.75 G9 60 7.86 222.21 186.86 111.05 63.51 65.41 10.70 G10 60 7.95 214.20 170.20 114.40 63.54 65.47 10.78 G11 60 7.89 226.12 197.13 108.45 63.47 65.43 10.79 G12 80 23.08 255.15 222.58 157.23 103.61 104.41 13.77 G13 80 23.08 238.12 228.92 154.70 103.48 104.41 13.98 G14 80 23.06 248.43 229.53 154.51 103.64 104.39 13.72 G15 80 23.04 261.14 238.38 154.46 103.40 104.35 14.16 G16 100 30.25 308.29 261.81 204.78 142.94 143.88 15.95 G17 100 30.30 275.49 256.45 210.65 142.91 143.98 16.34

Table 5.9: Completion delay data for GPUSync for (2,3,PRIO). Time in ms.

265 Graph Period % Dropped Max 99.9th% 99th% Median Mean σ

G1 20 0 115.11 83.21 33.81 19.56 20.01 6.21 G2 20 0 154.24 89.77 32.45 19.83 20.01 6.41 G3 30 0 119.40 91.59 44.81 29.55 30.01 6.26 G4 30 0 119.96 90.32 45.51 29.60 30.01 6.48 G5 40 0 154.99 93.90 56.19 39.60 40.02 6.30 G6 40 0 176.54 95.40 55.83 39.70 40.01 6.54 G7 60 0 299.21 118.69 78.54 59.51 60.04 7.09 G8 60 0 209.89 125.10 79.49 59.68 60.04 7.10 G9 60 0 246.73 112.11 79.03 59.57 60.03 6.98 G10 60 0 175.79 130.78 80.90 59.72 60.04 7.34 G11 60 0 205.56 122.91 79.30 59.60 60.03 6.66 G12 80 0 152.52 141.36 105.04 79.44 80.01 8.19 G13 80 0 198.76 141.15 104.87 79.63 80.01 7.93 G14 80 0 176.11 142.24 106.64 79.55 80.01 8.42 G15 80 0 226.29 152.43 103.39 79.55 80.02 8.52 G16 100 0 209.92 172.15 129.92 99.57 100.07 10.07 G17 100 0 209.91 170.36 131.09 99.59 100.07 9.85

Table 5.10: Completion delay data for GPUSync for (3,2,FIFO). Time in ms.

Graph Period % Dropped Max 99.9th% 99th% Median Mean σ

G1 20 0 100.51 86.71 33.42 19.09 20.01 6.32 G2 20 0 171.95 82.87 34.49 19.37 20.01 6.50 G3 30 0 163.08 91.38 44.38 29.25 30.01 6.84 G4 30 0 119.83 95.17 44.77 29.62 30.01 6.62 G5 40 0 216.41 101.41 55.15 40.46 40.02 7.35 G6 40 0 238.03 95.13 56.15 39.73 40.02 7.60 G7 60 0 215.71 120.58 81.55 59.57 60.00 6.70 G8 60 0 295.60 122.66 80.61 59.61 60.01 7.99 G9 60 0 210.39 121.58 81.13 59.50 60.01 7.37 G10 60 0 182.70 122.54 79.40 59.59 59.99 6.21 G11 60 0 142.32 119.76 78.38 59.58 59.98 6.08 G12 80 0 175.73 146.69 106.27 79.47 80.00 8.65 G13 80 0 163.97 148.16 109.05 79.78 80.00 8.94 G14 80 0 170.10 147.12 109.53 79.56 80.00 9.93 G15 80 0 218.05 152.67 109.07 79.62 80.01 9.59 G16 100 0 208.42 168.30 139.57 99.89 100.07 10.55 G17 100 0 201.03 163.87 135.47 99.80 100.06 10.13

Table 5.11: Completion delay data for GPUSync for (3,2,PRIO). Time in ms.

266 Graph Period % Dropped Max 99.9th% 99th% Median Mean σ

G1 20 0 84.66 69.29 39.55 18.56 20.01 8.11 G2 20 0 100.91 71.41 40.40 18.31 20.01 8.28 G3 30 0 106.65 78.04 49.33 29.45 30.02 9.32 G4 30 0 117.70 67.74 49.16 29.97 30.02 9.33 G5 40 0 188.84 87.52 58.98 40.21 40.02 9.21 G6 40 0 242.19 80.76 58.93 40.09 40.03 9.14 G7 60 0 378.94 120.78 83.18 60.86 60.03 9.42 G8 60 0 378.56 122.67 82.59 60.77 60.03 8.89 G9 60 0 378.50 120.32 84.30 60.93 60.03 10.73 G10 60 0 304.69 123.79 85.33 60.45 60.02 9.47 G11 60 0 163.24 126.17 89.90 60.42 59.99 10.48 G12 80 0 166.87 135.46 115.25 78.99 79.95 11.93 G13 80 0 189.56 151.96 115.58 78.77 79.95 12.50 G14 80 0 173.29 146.28 112.74 78.33 79.95 12.25 G15 80 0 170.68 148.38 105.88 78.12 79.95 10.10 G16 100 0 242.56 197.47 139.18 100.00 99.98 14.51 G17 100 0 240.90 209.31 140.24 100.09 100.00 14.65

Table 5.12: Completion delay data for GPUSync for (6,1,FIFO). Time in ms.

Graph Period % Dropped Max 99.9th% 99th% Median Mean σ

G1 20 0 126.79 113.19 44.66 16.83 20.01 10.64 G2 20 0 136.47 118.97 45.99 17.26 20.01 11.56 G3 30 0 156.19 120.81 52.68 29.89 30.01 11.72 G4 30 0 141.65 113.71 53.14 30.12 30.01 11.68 G5 40 0 174.46 102.46 60.99 41.35 40.01 11.76 G6 40 0 177.19 135.44 61.72 41.84 40.02 12.68 G7 60 0 224.33 192.20 88.00 60.66 59.99 11.08 G8 60 0 224.36 184.27 88.33 60.66 59.99 11.09 G9 60 0 197.53 164.71 88.50 60.72 59.99 10.96 G10 60 0 215.30 169.92 89.38 60.64 59.99 11.16 G11 60 0 214.23 184.25 89.54 60.71 59.99 11.56 G12 80 0 237.41 199.40 116.73 77.81 79.96 15.12 G13 80 0 216.04 178.55 116.64 77.79 79.93 13.55 G14 80 0 224.25 181.11 116.91 77.82 79.93 13.98 G15 80 0 237.55 214.03 116.46 77.25 79.93 14.11 G16 100 0 286.73 267.92 147.47 100.18 99.98 16.94 G17 100 0 257.77 234.21 139.80 100.26 99.97 16.24

Table 5.13: Completion delay data for GPUSync for (6,1,PRIO). Time in ms.

267 Normalized and Averaged Conﬁguration Total % Dropped Max 99.9th% 99th% Median Mean σ SCHED_OTHER 0 9.20 6.33 1.65 0.97 1.00 0.37 SCHED_FIFO 4.77 11.12 9.83 4.25 1.89 1.88 0.99 LITMUSRT C-EDF 0 12.12 8.40 4.22 0.86 1.00 0.93 GPUSync (1,6,FIFO) 37.83 9.65 7.64 3.70 1.69 1.74 0.47 (1,6,PRIO) 37.88 9.53 7.53 3.70 1.69 1.74 0.46 (2,3,FIFO) 6.99 4.26 3.05 1.78 1.13 1.15 0.20 (2,3,PRIO) 7.23 4.54 3.34 1.78 1.14 1.16 0.20 (3,2,FIFO) 0 3.67 2.37 1.39 0.99 1.00 0.15 (3,2,PRIO) 0 3.83 2.38 1.41 0.99 1.00 0.16 (6,1,FIFO) 0 3.99 2.22 1.51 0.99 1.00 0.21 (6,1,PRIO) 0 3.92 3.23 1.60 0.98 1.00 0.26

Table 5.14: Average normalized completion delay data. frames by summing the number of dropped frames of all graphs under a given system conﬁguration and computing its share of the total number of frames that should have been completed within the allotted time.

To combine the completion delay data, we first normalized each measurement by dividing the measurements by the graph period. We then compute the average of the normalized values. (We use the term “average” to differentiate from the completion delay means that we discuss.) For example, the “Max” for SCHED_OTHER in Table 5.14 reflects the average normalized-maximum of the “Max” values in Table 5.3. Likewise, the standard deviation of GPUSync with (1,6,PRIO) in Table 5.14 reflects an average of normalized standard deviations from Table 5.13. In Table 5.14, we also highlight the “best” values in each column. We consider values closest to 1.0 as best for average normalized completion delays, and values closest to zero as best for standard deviations. We make the following additional observations.

Observation 45. For GPUSync conﬁgurations, completion delays are more regular when ρ = 3.

In Table 5.14, we observe that the average normalized maximum completion delay is smallest under

GPUSync with (3,2,FIFO). The next smallest is the GPUSync (3,2,PRIO) conﬁguration. The same holds for the average normalized 99th percentile completion delay and standard deviation. We note that the GPUSync

(6,1,FIFO) configuration exhibited the best average normalized 99.9th percentile completion delay. However, given that this configuration is beaten or matched by GPUSync configurations with ρ = 3 in other measures, we assert that ρ = 3 still gives the best real-time performance overall.

It appears that ρ = 3 is the “sweet spot” for ρ in this experiment. The experimental workload is unschedulable when ρ = 2, yet real-time performance worsens when ρ = 6. The likelihood that GPU

268 resources are left idle is greater with smaller values of ρ. When ρ = 1, two of the GPU’s three engines (one

EE and two CEs) are guaranteed to be left idle. Similarly, at least one engine will always be idle when ρ = 2.

Engines may still be left idle when ρ = 3, but the possibility remains for all engines to be used simultaneously.

However, this line of reasoning fails when we consider the case when ρ = 6. In Section 3.2.3.2, we argued that we should constrain the number of GPU tokens in order to limit the aggressiveness of migrations. This is no longer a concern in a single-GPU system, so why does performance not continue to improve when ρ = 6? We provide the following possible explanation.

In this experiment, we executed the same amount of work under GPUSync conﬁgurations with ρ = 3 and ρ = 6. More tokens allows a ﬁnite amount work (on both CPUs and GPU engines) to be shifted to an earlier point in time of the schedule, since the CPU scheduler and locking protocols are work-conserving.

Here, the shifted work “piles up” at an earlier point in the schedule. From Table 5.14, we see that ρ = 3 provides sufﬁcient GPU parallelism to achieve good real-time performance; the constrained number of tokens meters out GPU access. However, when ρ = 6, bursts in GPU activity, where all GPU engines are used simultaneously, become more likely. These bursts may result in moments of heavy interference (particularly on the system memory bus) with other tasks, thus increasing variance in completion delays. This interference would only increase the likelihood of outlier behavior on both ends of the spectrum of observed completion delays, since bursts in GPU activity would be followed by corresponding lulls. We theorize that median-case performance would remain unaffected. This explanation is borne out by the data in Table 5.14, where we see that the average normalized median completion delays under ρ = 6 are generally indistinguishable from conﬁgurations where ρ = 3.

Observation 46. For GPUSync conﬁgurations, FIFO engine locks offered better observed predictability than PRIO engine locks.

To see this, we compare the GPUSync conﬁgurations that only differ by engine lock protocol. With exception of conﬁgurations where ρ = 1 (which we discuss shortly), we see that the average normalized standard deviations of the mean completion delays are less (or equal in the case of ρ = 2) under FIFO. FIFO-ordered engine locks impart very regular blocking behaviors—any single engine request may only be delayed between zero and ρ − 1 other engine requests. Under priority-ordered locks, higher-priority requests may continually “cut ahead” of a low-priority engine request. Thus, the low-priority engine request may be

269 delayed by more than ρ − 1 requests. A job with such requests may still meet its deadline, but there may be increased variance in completion delays due to increased variance in blocking delays.

Observation 47. GPUSync with (3,2,FIFO) outperformed the other ten system conﬁgurations.

In Table 5.14, we see that the GPUSync conﬁguration with (3,2,FIFO) produced the smallest Max, 99th, and standard deviation values among all conﬁgurations. On average, the normalized standard deviation of the mean completion delays was only 15% (0.15 in Table 5.14) of a graph’s period. Compare this to 37%, 99%, and 93% under SCHED_OTHER, SCHED_FIFO, and LITMUSRT without GPUSync, respectively. Other

GPUSync conﬁgurations where ρ ∈ {3,6} were competitive, but none performed as well as (3,2,FIFO).

The above tables give us insight into worst-case and average-case behaviors of the tested system conﬁgurations. However, the distribution of observed completion delays is somewhat obscured. To gain deeper insights into these distributions, we plot the PDF of normalized completion delays in Figures 5.10 through 5.20. Each plot is derived from a histogram with a bucket width of 0.005. In each of these ﬁgures, the x-axis denotes a normalized completion delay. The y-axis denotes a probability density. To determine the probability that a normalized completion delay falls within the domain [a,b], we sum the area under the curve between x = a and x = b; the total area under each curve is 1.0. Generally, distributions with the greatest area near x = 1.0 are best. Our goal is to understand the shape of completion delay distributions, so each distribution is plotted on the same domain and range. We also clip the domain at x = 2.5, so the long tails of these distributions are not depicted. However, we have examined worst-case behaviors in the prior tables, so we do not revisit the topic here. We make several observations.

Observation 48. The PDFs for GPUSync with ρ = 3 show that normalized completion delays are most likely near 1.0.

We see this in Figures 5.17 and 5.18 for GPUSync with ρ = 3 for FIFO and PRIO engine locks, respectively. This result is not surprising, given prior Observation 45. However, in these PDFs we also see that they closely resemble the curve of a normal distribution, more so than any PDF of the other conﬁgurations.

The PDFs for GPUSync with ρ = 6 (Figures 5.19 and 5.20), and even the PDF for the SCHED_OTHER conﬁguration (Figure 5.10), have similar shapes, but they are not as strongly centered around 1.0. Also, they do not exhibit the same degree of symmetry.

270 Probability Density Function of Normalized Completion Delays 4 3.5 3 2.5 2