<<

Running Multiple Workloads on a GPU A UX Oriented Approach

Yuval Sarna Graphics Software Expert @ GameFly Streaming Agenda

• Sharing the GPU

• We all like to Play

• Introduction to GPU Scheduling

• Proposed GPU Scheduler

• Summary & Q&A What does it means to “share the GPU”?

• Most modern applications use the GPU

• They all share the same hardware resources – CPU, RAM, GPU, etc.

• The GPU executes tasks coming from different processes, satisfying their needs – be it . Graphical HW Acceleration . GPGPU . Etc. The GPU Model

• Many physical cores but a single core computational model (no “SetAffinity”)

• Access model is FIFO, no fairness, no preemption

• Many processes use the GPU simultaneously – can process only one task at a time Why do we need to share the GPU?

• Cost Efficiency

• Cloud Environments

• Academic Super-Computers Difficulties in Sharing the GPU Efficiently

• Running non-demanding application in parallel is easy . Not real-time – i.e., don’t require low latency

• When it comes to running multiple demanding workloads on the GPU, sharing becomes difficult . Which workload should execute now? . How do we handle greedy workloads? . What do we expect from a GPU sharing scheme? Efficient GPU Sharing

• Utilizing the GPU

• Fairness of GPU between applications

• Smooth User Experience (UX) Agenda

• Sharing the GPU

• We all like to Play

• Introduction to GPU Scheduling

• Proposed GPU Scheduler

• Summary & Q&A Case Study – GameFly Streaming Case Study – GameFly Streaming

Rendered frames are streamed as video in real- time to the client

Game is running (and rendered) on a server Gamepad commands are sent back to the server The Technology

Rendered frames are streamed as video in real- time to the client

Game is running (and rendered) on a server Agenda

• Sharing the GPU

• We all like to Play

• Introduction to GPU Scheduling

• Proposed GPU Scheduler

• Summary & Q&A Definitions & Assumptions

GPU

Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6

GPU Scheduler

Command Buffers Context Context Context Context Context Context

Process Process Process Scheduling Efficiency

To measure the efficiency of a scheduling CB CB CB CB? algorithm, we may look at two main factors: 0ms Deadline 33ms Begin Frame • Maximum utilization of the GPU . The algorithm should allow it to be 100% utilized.

• Number of frames that missed their deadline Efficient GPU Sharing

. With relation to them exceeding their expected • Utilizing the GPU time. • Fairness of GPU between applications

• Ask your target audience • Smooth User Experience (UX) Fairness

If life is unfair to everyone,

Isn’t life fair? How is it done?

• Windows Display Driver Model

Application

User-mode Win32® Direct3D runtime OpenGL runtime display driver GDI

Kernel-mode access OpenGL installable (gdi32.dll) client driver (ICD)

User Mode Kernel Mode

DirectX graphics kernel subsystem (Dxgkrnl.sys), which includes Win32K.sys DisplayDisplay portport driver,driver, videovideo memorymemory manager,manager, andand GPUGPU schedulerscheduler

Display miniport driver • Stall command buffers if they shouldn’t yet be submitted for GPU execution Windows OS GPU Scheduler

• Round-Robin scheduling algorithm

• Let’s take a look at a video showing the issues . GPU utilization is ~105% . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare . Running on NVIDIA GRID K520 Windows OS GPU Scheduler

• Round-Robin scheduling algorithm

X5 • Let’s take a look at a video showing the issues . GPU utilization is ~105% . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare Windows OS GPU Scheduler

• Round-Robin scheduling algorithm

X5 • Let’s take a look at a video showing the issues . GPU utilization is ~105% . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare Windows OS GPU Scheduler

• Round-Robin scheduling algorithm

X5 • Let’s take a look at a video showing the issues . GPU utilization is ~105% . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare Windows OS GPU Scheduler

• Round-Robin scheduling algorithm

• Let’s take a look at a video showing the issues . GPU utilization is ~105% . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare . Running on NVIDIA GRID K520 A Look Behind the Scenes

~142ms ~48ms ~57ms ~35ms ~80ms

Grey command buffers are new frames released by the game A Look Behind the Scenes

~24ms Agenda

• Sharing the GPU

• We all like to Play

• Introduction to GPU Scheduling

• Proposed GPU Scheduler

• Summary & Q&A Why can it be done better?

• We know what kind of workloads we want to schedule

• We can set a target performance

• Our scheduler doesn’t have to be generic GPU Resources • For example, say we set the target performance to a 30 frames per- second (FPS) rate

• Each frame shouldn’t take more than ~33ms

GPU 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

• These are the GPU’s resources we have to manage and schedule

• We don’t allow running more than “33 blocks” worth of workloads concurrently . But is it enough? First Attempt – Earliest Deadline First

• Prioritize CBs with earlier deadlines using the following data:

. The time it took the context to complete the previous frame . The time a context has used so far to create the current frame Round-Robin Scheduling

MS 0 8 16 24 32 40

? ? ? Round-Robin Scheduling

MS 0 8 16 24 32 40 Round-Robin Scheduling

MS 0 8 16 24 32 40

? ? Round-Robin Scheduling

MS 0 8 16 24 32 40

? ? Round-Robin Scheduling

MS 0 8 16 24 32 40

? ? Round-Robin Scheduling

MS 0 8 16 24 32 40

? ? Round-Robin Scheduling 33 ms – Frame Deadline

MS 0 8 16 24 32 40

Tomb Raider is the only game that managed to complete its frame before the deadline. First Attempt – Earliest Deadline First

MS 0 8 16 24 32 40

? ? ? First Attempt – Earliest Deadline First

MS 0 8 16 24 32 40 First Attempt – Earliest Deadline First

MS 0 8 16 24 32 40

? ? First Attempt – Earliest Deadline First

MS 0 8 16 24 32 40

?

This game has already started, ? so its priority is higher. First Attempt – Earliest Deadline First

MS 0 8 16 24 32 40

? ?

This game has already started, so its priority is higher. First Attempt – Earliest Deadline First

MS 0 8 16 24 32 40

? ? First Attempt – Earliest Deadline First 33 ms – Frame Deadline

MS 0 8 16 24 32 40

Both Tomb Raider & MotoGP15 completed their frames before the deadline. Results

• 10 games running concurrently • UX is improved – frames interval variance is reduced significantly

Windows GPU Scheduler EDF GPU Scheduler 1400 1400 1200 1200 1000 1000 800 800 600 600 Sum Sum 400 400 200 200 0 0 0 6 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 90 Frames Interval (ms) Frames Interval (ms) First Attempt – Earliest Deadline First

• Drawbacks: . Tries to schedule more than 100% capacity worth of work.

. Greedy workloads get the highest priority

. The innocents suffer from low FPS and stuttering First Attempt – Earliest Deadline First

• Drawbacks: . Tries to schedule more than 100% capacity worth of work.

. Greedy workloads get the highest priority

. The innocents suffer from low FPS and stuttering First Attempt – Earliest Deadline First

• Drawbacks: . Tries to schedule more than 100% capacity worth of work.

. Greedy workloads get the highest priority

. The innocents suffer from low FPS and stuttering Proposed New Scheduling Algorithm

• The proposed new algorithm uses a combination of two principles:

. Each process gets a time quantum. • If the time quantum is depleted before finishing the frame, the process may not further submit tasks for execution. • The time given to all processes will always be equal to the global frame time (for example, 33ms).

. Amongst those with available time quantum, use priorities using: • Deadline. • Other schemes Definitions

• n – Number of running processes. • i – The index of a process (counting from 1). • – The time the previous frame took for process i. • – The expected time a single frame will take for process i. 𝑻𝑻𝒊𝒊 • –𝒊𝒊How much did process i exceeded its expected frame time, compared𝑬𝑬𝑬𝑬𝑬𝑬 to the previous frame. 0. 𝑫𝑫𝒊𝒊 • Time(i) – The new time quantum process i receives. 𝐷𝐷𝑖𝑖 ≥ • FT – The global Frame Time. This dictates the deadlines. For example, for a 30FPS target, the FT is ~33.66ms. Calculating Time Quantum Utilization 1. If = 0 : . 𝑛𝑛 = 0% ∑𝑖𝑖=1 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑖𝑖 𝐹𝐹𝐹𝐹 2. Else If 0 < : 𝑛𝑛 ≤ 100% . = ∑𝑖𝑖=1 𝑇𝑇𝑖𝑖 𝑇𝑇𝑇𝑇 ≤ 𝐹𝐹𝐹𝐹 𝑛𝑛 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑖𝑖 ∑𝑗𝑗=1 𝑇𝑇𝑗𝑗 ∗ 𝐹𝐹𝐹𝐹 3. Else :

( ) . = > 100% (𝑛𝑛 ) ∑𝑗𝑗=1 𝑇𝑇𝑇𝑇−𝐹𝐹𝐹𝐹 𝑛𝑛 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇• 𝑖𝑖 𝑇𝑇𝑖𝑖 −0 ∑𝑗𝑗=1 𝐷𝐷𝑗𝑗 = ∗ 𝐷𝐷𝑖𝑖

𝐼𝐼𝐼𝐼 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑖𝑖 ≤ → 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑖𝑖 𝑇𝑇𝑖𝑖 Example 1 Utilization < 100%

FT = 33ms n = 3

8 8 = = 0 P1 𝑻𝑻𝒊𝒊 𝑬𝑬𝑬𝑬𝑬𝑬𝒊𝒊 𝑫𝑫𝒊𝒊 12 13 0 P2 𝐷𝐷1 𝑇𝑇1 − 𝐸𝐸𝐸𝐸𝐸𝐸 P3 5 3 2

Total 3 = 25 3 = 2

� 𝑇𝑇𝑖𝑖 𝑚𝑚𝑚𝑚 � 𝐷𝐷𝑖𝑖 𝑚𝑚𝑚𝑚 𝑖𝑖=1 𝑖𝑖=1

= = ~75% Utilization: 3 ∑𝑖𝑖=1 𝑇𝑇𝑖𝑖 25 𝐹𝐹𝑇𝑇 33 Example 1 Utilization < 100%

• Here’s the time quantum each process will get for the current frame:

2. If 0 < : . 1 = = 33 = 10.56 𝑛𝑛 ∑𝑖𝑖=1 𝑇𝑇𝑖𝑖 ≤ 𝐹𝐹𝑇𝑇 𝑇𝑇1 8 = 3 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 ∑𝑗𝑗=1 𝑇𝑇𝑗𝑗 ∗ 𝐹𝐹𝐹𝐹 25 ∗ 𝑚𝑚𝑚𝑚 𝑇𝑇𝑖𝑖 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑖𝑖 𝑛𝑛 ∗ 𝐹𝐹𝐹𝐹 ∑𝑗𝑗=1 𝑇𝑇𝑗𝑗 . 2 = = 33 = 15.84 𝑇𝑇2 12 3 ∑𝑗𝑗=1 𝑇𝑇𝑗𝑗 25 𝒊𝒊 𝒊𝒊 𝒊𝒊 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 ∗ 𝐹𝐹𝐹𝐹 ∗ 𝑚𝑚𝑚𝑚 P1 𝑻𝑻8 𝑬𝑬𝑬𝑬8𝑬𝑬 𝑫𝑫0

. 3 = = 33 = 6.6 P2 12 13 0 𝑇𝑇3 5 3 P3 5 3 2 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 ∑𝑗𝑗=1 𝑇𝑇𝑗𝑗 ∗ 𝐹𝐹𝐹𝐹 25 ∗ 𝑚𝑚𝑚𝑚 ( ) = = 1 100% Utilization: 3 ∑𝑖𝑖=1 𝑇𝑇𝑖𝑖𝑚𝑚𝑒𝑒 𝑖𝑖 33 𝐹𝐹𝑇𝑇 33 → Example 2 Utilization > 100%

FT = 33ms n = 3

10 10 = = 0 P1 𝑻𝑻𝒊𝒊 𝑬𝑬𝑬𝑬𝑬𝑬𝒊𝒊 𝑫𝑫𝒊𝒊 16 10 6 P2 𝐷𝐷1 𝑇𝑇1 − 𝐸𝐸𝐸𝐸𝐸𝐸 P3 10 8 2

Total 3 = 36 3 = 8

� 𝑇𝑇𝑖𝑖 𝑚𝑚𝑚𝑚 � 𝐷𝐷𝑖𝑖 𝑚𝑚𝑚𝑚 𝑖𝑖=1 𝑖𝑖=1

= = ~110% Utilization: 3 ∑𝑖𝑖=1 𝑇𝑇𝑖𝑖 36 𝐹𝐹𝑇𝑇 33 Example 2 Utilization > 100% 3. Else:

( ) = 𝑛𝑛( ) ∑𝑗𝑗=1 𝑇𝑇𝑇𝑇 − 𝐹𝐹𝐹𝐹 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑖𝑖 𝑇𝑇𝑖𝑖 − 𝑛𝑛 ∗ 𝐷𝐷𝑖𝑖 ∑𝑗𝑗=1 𝐷𝐷𝑗𝑗

𝒊𝒊 𝒊𝒊 𝒊𝒊 P1 10𝑻𝑻 𝑬𝑬𝑬𝑬10𝑬𝑬 𝑫𝑫0 • 1 = 3 = 10 0 = 10 0 = 10 ∑𝑗𝑗=1 𝑇𝑇𝑇𝑇−𝐹𝐹𝐹𝐹 36−33 3 P2 16 10 6 1 3 1 8 8 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 − ∑𝑗𝑗=1 𝐷𝐷𝑗𝑗 ∗ 𝐷𝐷 − ∗ − ∗ 𝑚𝑚𝑚𝑚 P3 10 8 2

• 2 = 3 = 16 6 = 16 6 = 13.75 ∑𝑗𝑗=1 𝑇𝑇𝑇𝑇−𝐹𝐹𝐹𝐹 36−33 3 2 3 2 8 8 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 − ∑𝑗𝑗=1 𝐷𝐷𝑗𝑗 ∗ 𝐷𝐷 − ∗ − ∗ 𝑚𝑚𝑚𝑚

• 3 = 3 = 10 2 = 10 2 = 9.25 ∑𝑗𝑗=1 𝑇𝑇𝑇𝑇−𝐹𝐹𝐹𝐹 36−33 3 3 3 3 8 8 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑇𝑇 − ∑𝑗𝑗=1 𝐷𝐷𝑗𝑗 ∗ 𝐷𝐷 − ∗ − ∗ 𝑚𝑚𝑚𝑚

( ) = = 1 100% Utilization: 3 ∑𝑖𝑖=1 𝑇𝑇𝑖𝑖𝑚𝑚𝑒𝑒 𝑖𝑖 33 𝐹𝐹𝑇𝑇 33 → Calculating Priorities

• To address the case where we have several processes with enough time quantum, each process also gets a priority

• Priorities are given based on the deadline by using Earliest Deadline First

• Other schemes may be used – . For example, we could take into account the amount of the time the process exceeded its expected frame time Example 1 Context given enough QT

• Context received 12ms time quantum

0ms 10ms 33ms BeginFrame EndFrame Deadline BeginFrame

• Finished Frame @ 10ms • QT Left – 2ms Example 2 Context not given enough QT

• Context received 10ms time quantum, needs 14ms

0ms 10ms 33ms 37ms BeginFrame Out of Time Quantum. Deadline New time EndFrame BeginFrame All future CBs must Quantum given. wait.

• FPS Drop to 27FPS Results

• Let’s take a look at a video showing the scheduler’s result . GPU utilization is ~105% . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare . Running on NVIDIA GRID K520 Results

• Let’s take a look at a video showing the scheduler’s result . GPU utilization is ~105% X5 . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare Results

• Let’s take a look at a video showing the scheduler’s result . GPU utilization is ~105% X5 . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare Results

• Let’s take a look at a video showing the scheduler’s result . GPU utilization is ~105% X5 . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare Results

• Let’s take a look at a video showing the scheduler’s result . GPU utilization is ~105% . Six concurrent games – • 5 Overlord II • 1 Alan Wake’s American Nightmare . Running on NVIDIA GRID K520 Results

Purple command buffers are new frames released by the game Agenda

• Sharing the GPU

• We all like to Play

• Introduction to GPU Scheduling

• Proposed GPU Scheduler

• Summary & Q&A Thank You!

• You’re more than welcome to talk to me after the lecture or email me

. Yuval Sarna [email protected]

• Please don’t forget to fill out the survey