Running Multiple Workloads on a GPU a UX Oriented Approach

Running Multiple Workloads on a GPU A UX Oriented Approach Yuval Sarna Graphics Software Expert @ GameFly Streaming Agenda • Sharing the GPU • We all like to Play • Introduction to GPU Scheduling • Proposed GPU Scheduler • Summary & Q&A What does it means to “share the GPU”? • Most modern applications use the GPU • They all share the same hardware resources – CPU, RAM, GPU, etc. • The GPU executes tasks coming from different processes, satisfying their needs – be it . Graphical HW Acceleration . GPGPU . Etc. The GPU Model • Many physical cores but a single core computational model (no “SetAffinity”) • Access model is FIFO, no fairness, no preemption • Many processes use the GPU simultaneously – can process only one task at a time Why do we need to share the GPU? • Cost Efficiency • Cloud Environments • Academic Super-Computers Difficulties in Sharing the GPU Efficiently • Running non-demanding application in parallel is easy . Not real-time – i.e., don’t require low latency • When it comes to running multiple demanding workloads on the GPU, sharing becomes difficult . Which workload should execute now? . How do we handle greedy workloads? . What do we expect from a GPU sharing scheme? Efficient GPU Sharing • Utilizing the GPU • Fairness of GPU between applications • Smooth User Experience (UX) Agenda • Sharing the GPU • We all like to Play • Introduction to GPU Scheduling • Proposed GPU Scheduler • Summary & Q&A Case Study – GameFly Streaming Case Study – GameFly Streaming Rendered frames are streamed as video in real- time to the client Game is running (and rendered) on a server Gamepad commands are sent back to the server The Technology Rendered frames are streamed as video in real- time to the client Game is running (and rendered) on a server Agenda • Sharing the GPU • We all like to Play • Introduction to GPU Scheduling • Proposed GPU Scheduler • Summary & Q&A Definitions & Assumptions GPU Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 GPU Scheduler Command Buffers Context Context Context Context Context Context Process Process Process Scheduling Efficiency To measure the efficiency of a scheduling CB CB CB CB? algorithm, we may look at two main factors: 0ms Deadline 33ms Begin Frame • Maximum utilization of the GPU . The algorithm should allow it to be 100% utilized. • Number of frames that missed their deadline Efficient GPU Sharing . With relation to them exceeding their expected • Utilizing the GPU time. • Fairness of GPU between applications • Ask your target audience • Smooth User Experience (UX) Fairness If life is unfair to everyone, Isn’t life fair? How is it done? • Windows Display Driver Model Application User-mode Win32® Direct3D runtime OpenGL runtime display driver GDI Kernel-mode access OpenGL installable (gdi32.dll) client driver (ICD) User Mode Kernel Mode DirectX graphics kernel subsystem (Dxgkrnl.sys), which includes Win32K.sys DisplayDisplay portport driver,driver, videovideo memorymemory manager,manager, andand GPUGPU schedulerscheduler Display miniport driver • Stall command buffers if they shouldn’t yet be submitted for GPU execution Windows OS GPU Scheduler • Round-Robin scheduling algorithm • Let’s take a look at a video showing the issues . GPU utilization is ~105% . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare . Running on NVIDIA GRID K520 Windows OS GPU Scheduler • Round-Robin scheduling algorithm X5 • Let’s take a look at a video showing the issues . GPU utilization is ~105% . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare Windows OS GPU Scheduler • Round-Robin scheduling algorithm X5 • Let’s take a look at a video showing the issues . GPU utilization is ~105% . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare Windows OS GPU Scheduler • Round-Robin scheduling algorithm X5 • Let’s take a look at a video showing the issues . GPU utilization is ~105% . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare Windows OS GPU Scheduler • Round-Robin scheduling algorithm • Let’s take a look at a video showing the issues . GPU utilization is ~105% . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare . Running on NVIDIA GRID K520 A Look Behind the Scenes ~142ms ~48ms ~57ms ~35ms ~80ms Grey command buffers are new frames released by the game A Look Behind the Scenes ~24ms Agenda • Sharing the GPU • We all like to Play • Introduction to GPU Scheduling • Proposed GPU Scheduler • Summary & Q&A Why can it be done better? • We know what kind of workloads we want to schedule • We can set a target performance • Our scheduler doesn’t have to be generic GPU Resources • For example, say we set the target performance to a 30 frames per- second (FPS) rate • Each frame shouldn’t take more than ~33ms GPU 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 • These are the GPU’s resources we have to manage and schedule • We don’t allow running more than “33 blocks” worth of workloads concurrently . But is it enough? First Attempt – Earliest Deadline First • Prioritize CBs with earlier deadlines using the following data: . The time it took the context to complete the previous frame . The time a context has used so far to create the current frame Round-Robin Scheduling MS 0 8 16 24 32 40 ? ? ? Round-Robin Scheduling MS 0 8 16 24 32 40 Round-Robin Scheduling MS 0 8 16 24 32 40 ? ? Round-Robin Scheduling MS 0 8 16 24 32 40 ? ? Round-Robin Scheduling MS 0 8 16 24 32 40 ? ? Round-Robin Scheduling MS 0 8 16 24 32 40 ? ? Round-Robin Scheduling 33 ms – Frame Deadline MS 0 8 16 24 32 40 Tomb Raider is the only game that managed to complete its frame before the deadline. First Attempt – Earliest Deadline First MS 0 8 16 24 32 40 ? ? ? First Attempt – Earliest Deadline First MS 0 8 16 24 32 40 First Attempt – Earliest Deadline First MS 0 8 16 24 32 40 ? ? First Attempt – Earliest Deadline First MS 0 8 16 24 32 40 ? This game has already started, ? so its priority is higher. First Attempt – Earliest Deadline First MS 0 8 16 24 32 40 ? ? This game has already started, so its priority is higher. First Attempt – Earliest Deadline First MS 0 8 16 24 32 40 ? ? First Attempt – Earliest Deadline First 33 ms – Frame Deadline MS 0 8 16 24 32 40 Both Tomb Raider & MotoGP15 completed their frames before the deadline. Sum • • Results UX is improved isimproved UX 10 1000 1200 1400 200 400 600 800 Windows GPU Scheduler GPU Windows 0 games running concurrently running games 0 6 12 18 24 Frames Interval (ms) 30 36 42 – 48 54 interval frames variance is reducedsignificantly 60 66 72 78 84 90 96 Sum 1000 1200 1400 200 400 600 800 0 0 6 EDF GPU Scheduler GPU EDF 10 14 18 22 26 Frames Interval (ms) 30 34 38 42 46 50 54 58 62 66 70 74 78 82 90 First Attempt – Earliest Deadline First • Drawbacks: . Tries to schedule more than 100% capacity worth of work. Greedy workloads get the highest priority . The innocents suffer from low FPS and stuttering First Attempt – Earliest Deadline First • Drawbacks: . Tries to schedule more than 100% capacity worth of work. Greedy workloads get the highest priority . The innocents suffer from low FPS and stuttering First Attempt – Earliest Deadline First • Drawbacks: . Tries to schedule more than 100% capacity worth of work. Greedy workloads get the highest priority . The innocents suffer from low FPS and stuttering Proposed New Scheduling Algorithm • The proposed new algorithm uses a combination of two principles: . Each process gets a time quantum. • If the time quantum is depleted before finishing the frame, the process may not further submit tasks for execution. • The time given to all processes will always be equal to the global frame time (for example, 33ms). Amongst those with available time quantum, use priorities using: • Deadline. • Other schemes Definitions • n – Number of running processes. • i – The index of a process (counting from 1). • – The time the previous frame took for process i. • – The expected time a single frame will take for process i. • –How much did process i exceeded its expected frame time, compared to the previous frame. 0. • Time(i) – The new time quantum process i receives. ≥ • FT – The global Frame Time. This dictates the deadlines. For example, for a 30FPS target, the FT is ~33.66ms. Calculating Time Quantum Utilization 1. If = 0 : . = 0% ∑=1 2. Else If 0 < : ≤ 100% . = ∑=1 ≤ ∑=1 ∗ 3. Else : ( ) . = > 100% ( ) ∑=1 − • −0 ∑=1 = ∗ ≤ → Example 1 Utilization < 100% FT = 33ms n = 3 8 8 = = 0 P1 12 13 0 P2 1 1 − P3 5 3 2 Total 3 = 25 3 = 2 � � =1 =1 = = ~75% Utilization: 3 ∑=1 25 33 Example 1 Utilization < 100% • Here’s the time quantum each process will get for the current frame: 2. If 0 < : . 1 = = 33 = 10.56 ∑=1 ≤ 1 8 = 3 ∑=1 ∗ 25 ∗ ∗ ∑=1 . 2 = = 33 = 15.84 2 12 3 ∑=1 25 ∗ ∗ P1 8 8 0 . 3 = = 33 = 6.6 P2 12 13 0 3 5 3 P3 5 3 2 ∑=1 ∗ 25 ∗ ( ) = = 1 100% Utilization: 3 ∑=1 33 33 → Example 2 Utilization > 100% FT = 33ms n = 3 10 10 = = 0 P1 16 10 6 P2 1 1 − P3 10 8 2 Total 3 = 36 3 = 8 � � =1 =1 = = ~110% Utilization: 3 ∑=1 36 33 Example 2 Utilization > 100% 3. Else: ( ) = ( ) ∑=1 − − ∗ ∑=1 P1 10 10 0 • 1 = 3 = 10 0 = 10 0 = 10 ∑=1 − 36−33 3 P2 16 10 6 1 3 1 8 8 − ∑=1 ∗ − ∗ − ∗ P3 10 8 2 • 2 = 3 = 16 6 = 16 6 = 13.75 ∑=1 − 36−33 3 2 3 2 8 8 − ∑=1 ∗ − ∗ − ∗ • 3 = 3 = 10 2 = 10 2 = 9.25 ∑=1 − 36−33 3 3 3 3 8 8 − ∑=1 ∗ − ∗ − ∗ ( ) = = 1 100% Utilization: 3 ∑=1 33 33 → Calculating Priorities • To address the case where we have several processes with enough time quantum, each process also gets a priority • Priorities are given based on the deadline by using Earliest Deadline First • Other schemes may be used – .

Load more