CUDA Dynamic Parallelism

©Jin Wang and Sudhakar Yalamanchili unless otherwise noted (1)

Objective

• To understand the CUDA Dynamic Parallelism (CDP) execution model, including synchronization and memory model • To understand the benefits of CDP in terms of productivity, workload balance and memory regularity • To understand the launching path of child kernels in the microarchitectural level • To understand the overhead of CDP

(2)

1 Reading

• CUDA Programming Guide. Appendix C “CUDA Dynamic Parallelism”. • S. Jones, “Introduction to Dynamic Parallelism”, GPU Technology Conference (presentation), Mar 2012. • J. Wang and S. Yalamanchili. “Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications.” 2014 IEEE International Symposium on Workload Characterization (IISWC). October 2014. • J. Wang, A. Sidelink, N Rubin, and S. Yalamanchili, “Dynamic Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs”, IEEE/ACM International Symposium on Computer Architecture (ISCA), June 2015.

(3)

Recap: CUDA Execution Model

• Kernels/Grids are launched by host (CPU) • Grids are executed on GPU • Results are returned to CPU

host device

Grid 1 Block Block Kernel 1 (0, 0) (0, 1)

Block Block (1, 0) (1, 1)

Grid 2 Block Block (0, 0) (0, 1) Kernel 2 Block Block (1, 0) (1, 1) (4)

2 Launching from device

• Kernels/Grids can be launched by GPU

host device

Grid 1 Block Block Kernel 1 (0, 0) (0, 1)

Block Block (1, 0) (1, 1)

Grid 2 Block Block (0, 0) (0, 1)

Block Block (1, 0) (1, 1) (5)

Code Example

__global__ void k1() { __global__ void k1() { } if(result) __global__ void k2() { k2<<<1, 1>>>(); } } int main() { __global__ void k2() { … } k1<<<1,1>>>(); … //get result of k1 int main() { if(result) { … k2<<<1, 1>>>(); k1<<<1,1>>>(); } } }

Launching k2 from CPU Launching k2 from GPU

(6)

3 CUDA Dynamic Parallelism (CDP)

• Introduced in NVIDIA Kepler GK110 • Launch workload directly from GPU • Launch can be nested

Child Parent Finish Launch Grandchild Finish Launch

(7)

CDP (2)

• Launches are per thread v Each thread can launch new kernels v Each thread can make nested launches

__global__ void k1() { Q1: How many k2 are if(threadIdx.x == 0) launched? k2<<<1, 1>>>();

32 } __global__ void k2() { } Q2: How to change the code to launch only one k2? int main() { … k1<<<1,32>>>(); } (8)

4 Compilation

• Needs special option to compile nvcc test.cu –o test –arch=sm_35 –rdc=true -lcudadevrt

• -sm_35: must be on device that have compute capability > 3.5 • -rdc=true: generate relocatable device code, must be used for CDP program • -lcudadevrt: link CUDA device runtime library. Can be omitted on >CUDA 6.0

• Demo https://github.gatech.edu/jwang323/ gpu_class/blob/master/src/cdp.cu

(9)

PTX Interface

• The “<<< >>>” in device code will be compiled into two PTX v cudaGetParameterBuffer (cudaGetParameterBufferV2) v cudaLaunchDevice (cudaLaunchDeviceV2)

call.uni (retval0), cudaGetParameterBufferV2,( param0,param1,param2,param3);

call.uni (retval0),cudaLaunchDeviceV2,( param0,param1);

More on CUDA Programming Guide Appendix C

(10)

5 CDP Synchronization

• Child execution is not guaranteed unless synchronization v Explicit v Implicit __global__ void k1() { Q1: Does k2 start if(threadIdx.x == 0) execution immediately k2<<<1, 1>>>(); after it is launched? } Don’t know! __global__ void k2() {

}

int main() { … k1<<<1,32>>>(); } (11)

CDP Synchronization (2)

• Explicit synchronization v API: cudaDeviceSynchronize() v All child kernels launched in the thread block before the sync API immediately start execution v If not enough GPU resources, parent kernel is suspended to yield to children v Blocking parent until child kernels finish

__global__ void k1() { if(threadIdx.x == 0) k2<<<1, 1>>>(); cudaDeviceSynchronize();

}

(12)

6 CDP Synchronization (3)

• cudaDeviceSynchronize and __syncthreads v Do NOT imply each other!

__global__ void k1() { if(threadIdx.x == 0) K2 result visible to (same block) k2<<<1, 1>>>(); thread 0? N Other threads? N

//Only sync with child kernels, not other parent threads cudaDeviceSynchronize(); thread 0? Y Other threads? N

//still necessary for parent barrier __syncthreads(); thread 0? Y Other threads? Y

//k2 result consumed by all parent threads consumeK2Result();

}

(13)

CDP Synchronization (3)

• How about this?

__global__ void k1() { if(threadIdx.x == 0) K2 result visible to (same block) k2<<<1, 1>>>(); thread 0? N Other threads? N

__syncthreads(); thread 0? N Other threads? N

cudaDeviceSynchronize(); thread 0? Y Other threads? N

}

(14)

7 CDP Synchronization (4)

• Implicit Synchronization v If no explicit sync, child kernel must start execution when parent kernel execution ends v Parent kernel is not finished until all child kernels are finished __global__ void k1() { if(threadIdx.x == 0) k2<<>>();

otherOps(); Implicit Sync Point }

int main() { Q: When can k3 start execution? k1<<>>(); After k1 and k2 finish. k3<<>>(); } (15)

CDP Memory Model

• Global memory: visible to both parent and child • and local memory: private to parent/child, cannot be passed to each other.

__global__ void k1(int * glmem) { __shared__ int shmem[…];

k2<<>>(shmem) //undefined behavior

k2<<>>(glmem) //ok

}

(16)

8 CDP Memory Model (2)

• Global memory is fully consistent when: v Child is invoked (parent->child) v Child completes after cudaDeviceSynchronize is invoked (child->parent)

__global__ void k1(int * glmem) {

//glmem is fully consistent between k1 and k2 k2<<>>(glmem)

//No consistency guarantee

cudaDeviceSyncrhonize(); //glmem is fully consistent between k1 and k2 }

(17)

CDP Benefits

• What are the issues with the following CUDA code? • Hint: control flow and memory accesses

__global__ void k1(int * iterations, int * offsets, int * vals) {

int it = iterations[threadIdx.x]; int offset = offsets[threadIdx.x]; for(int i = 0; i < it; i++) { Control diveregence val = vals[offset + i]; Non-coalesced (val); memory accesses } }

(18)

9 CDP Benefits (2)

threadIdx.x iterations offset 0 1 2 3 4 5 6 7 0 3 0 0 3 7 12 19 21 23 34 1 4 3 2 5 7 3 7 12 4 2 19 5 2 21 6 11 23 7 4 34

• Work load imbalance causes control flow divergence • Different offsets cause non-coalesced memory accesses

(19)

CDP Benefits (3)

• Use CDP to implement v Launch a child kernel from each parent thread v Launch only when sufficient parallelism

Parent Kernel Child Kernels t0 t1 t2 t3 t4 t5 t6 t7

Launched Launched Launched Launched Launched by t1 by t2 by t3 by t6 by t7

(20)

10 CDP Benefits (3)

• Reduce control divergence v Reduce workload imbalance in parent kernel v Uniform control flow in child kernels • Increase coalesced memory accesses

Parent Kernel Child Kernels t0 t1 t2 t3 t4 t5 t6 t7

3 4 5 6 8 91011 0 71219 21 23 1 2 Launched Launched Launched Launched Launched by t1 by t2 by t3 by t6 by t7

Uniform control flow

More memory Reduced workload coalescing imbalance (21)

CDP Benefits (4)

• Compared with other optimization to reduce control and memory divergence: v Productivity v Fewer lines of code

__global__ void k1(int * iterations, int * offsets, int * vals) {

int it = iterations[threadIdx.x]; int offset = offsets[threadIdx.x];

if(it>para_threshold) child<<<1,it>>>(vals, offset); }

(22)

11 CDP Benefits (5)

• Other benefits (from S. Jones’ slides) v Data-dependent execution v Dynamic load balancing v Dynamic library kernel calls v Recursive kernel calls v Simplify CPU/GPU divide

(23)

Recap: Host-launched Kernel Execution

Kernel Distributor Entry Concurrent Kernel Execution PC Dim Param ExeBL WarpsWarps WarpsWarps WarpsWarps WarpsWarps K Warp Schedulers KernelK K K Warp Context KernelK Kernel Distributor KernelK Registers HW Work Queues WarpsWarps SMX Scheduler Core Core Core Core Control Registers Kernel Management Unit Kernels Pending TB TB TB TB Interconnection Bus TB TB TB TB L1 Cache / Shard Memory SMX SMX SMX SMX

GPU

Host Memory Kernel L2 Cache DRAM CPU Controller

(24)

12 Device Kernel Launching

Kernel Distributor Entry PC Dim Param ExeBL Warp Schedulers

Manage Warp Context Thousands of Kernel Distributor kernels Registers HW Work Queues

SMX Scheduler Core Core Core Core Control Registers Kernel Management Unit Kernels Pending

Interconnection Bus L1 Cache / Shard Memory SMX SMX SMX SMXDevice Kernel Launch GPU

Host Memory L2 Cache DRAM CPU Controller

Warning: information speculated according to NVIDIA’s documentations (25)

Device Kernel Launching (2)

• Managed by Hardware (uArch) and Software (device driver) • Pending kernel management in KMU v Managing pending child kernels and suspended kernels (due to parent-child yielding) v Kernel information are stored in memory

v Tracked by driver Kernel Distributor Entry PC Dim Param ExeBL Warp Schedulers

Warp Context

Kernel Distributor Registers HW Work Queues

SMX Scheduler Core Core Core Core Control Registers Kernel Management Unit Kernels Pending

Interconnection Bus L1 Cache / Shard Memory SMX SMX SMX SMX

GPU

Host Memory L2 Cache DRAM CPU Controller Warning: information speculated according to NVIDIA’s documentations (26)

13 Device Kernel Launching (3)

• New Launching path from SMX v Driver (software) initializes the child kernels and stores their information in the memory v SMX (hardware) notifies Kernel Management Unit • Device kernel execution v Same as host-launched kernels

Kernel Distributor Entry PC Dim Param ExeBL Warp Schedulers

Warp Context

Kernel Distributor Registers HW Work Queues

SMX Scheduler Core Core Core Core Control Registers Kernel Management Unit Kernels Pending

Interconnection Bus L1 Cache / Shard Memory SMX SMX SMX SMX

GPU

Host Memory L2 Cache DRAM CPU Controller Warning: information speculated according to NVIDIA’s documentations (27)

Device Kernel Launching (4)

• Parent-Child Synchronization v Device drive processes the synchronization request and perform necessary actions o Suspend parent kernels o Start execute child kernels o Complete parent kernels v SMX issues corresponding commands to KMU Kernel Distributor Entry PC Dim Param ExeBL Warp Schedulers

Warp Context

Kernel Distributor Registers HW Work Queues

SMX Scheduler Core Core Core Core Control Registers Kernel Management Unit Kernels Pending

Interconnection Bus L1 Cache / Shard Memory SMX SMX SMX SMX

GPU

Host Memory L2 Cache DRAM CPU Controller Warning: information speculated according to NVIDIA’s documentations (28)

14 CDP Launching Overhead

• Measured from CDP Program directly v Device kernel launching time ~35us v Recap: host kernel launching time ~10us

• Hardware (uArch) v Extra launching path • Software (device driver) v Similar as driver for host-launched kernels, but executed on device v Kernel preallocation v Resource management • No detailed breakdown

Warning: information speculated according to NVIDIA’s documentations (29)

CDP Launching Overhead (2)

• Scales with kernel launch count • 98.57ms for 256K kernel launches on K40 (VS nonCDP BFS kernel 3.27ms) • 36.1%-80.6% of total execution time

120 100 80 60 40

Overhead(ms) 20 0

30 Kernel Launching Count (30)

15 CDP Launching Overhead (2)

• Potential benefits in control flow and memory behavior • However, overhead causes slow down instead of 3 CDP-ideal CDP-actual 2.5 2 1.5 1 0.5 Speedup over non-CDP 0

(31) From Wang and Yalamanchili, IISWC’14

CDP Memory Footprint

• Require preallocation for child kernel information v ~8KB per child kernel launch • Require preallocation for suspended parent status v Register v Shared memory v ~100MB per explicit synchronization level (cudaDeviceSynchronize)

(32)

16 CDP memory footprint

• Reserved global memory for kernel launch v scale with kernel count v Max 1.2GB reserved (K20c has 5GB global memory)

140 1.4

120 Pending Launch Count Limit 1.2 Memory Reserved on K20c 100 1

80 0.8

60 0.6

40 0.4 Memory Reserved on K20c (GB) Pending Launching Limit (x1000) 20 0.2

0 0

33 (33) From Wang and Yalamanchili, IISWC’14

Summary

• CUDA Dynamic Parallelism (CDP) supports launching kernels from the GPU • Good for productivity, control flow and memory behavior (irregular applications) • Supported by extended uArch and driver • Non-trivial overhead overshadows the performance benefit

(34)

17 Dynamic Thread Block Launch (DTBL)

Wang, Rubin, Sidelnik and Yalamanchili, [ISCA 2015] (35)

Objectives

• To understand DTBL execution model • To understand the microarchitectural support for DTBL • To understand the benefits of DTBL

(36)

18 Example of Irregular Applications

• Dynamically Formed structured pockets of Parallelism (DFP)

Adaptive Mesh Refinement (AMR) Graph Traversal

Pockets of Structured 0 1 2 3 4 5

0 2 Parallelism

1 3 4 5

Workload Imbalance

(37)

Recap: CDP for Irregular Applications

• Launch a child kernel from each parent thread when sufficient parallelism

Parent Kernel Child Kernels t0 t1 t2 t3 t4 t5 t6 t7

Launched Launched Launched Launched Launched by t1 by t2 by t3 by t6 by t7

(38)

19 Features of DFP

• High Dynamic Workload Density: 1000~200,000 pockets of structured parallelism (# of dynamic kernels) 180 160 Count of launched child kernels 140 120 100

(x1000) 80 60 40 20 Count of launched child kernels 0

• High kernel launching overhead

(39)

Features of DFP (2)

• Low Compute Intensity: Average degree of parallelism in dynamic workload is ~40 (# of threads in dynamic kernels) Average child kernel thread count

120 104.9 100 89.7 80 61.4 64.9 63.3 59.8 60 37.6 39.8 40.9 41.5 40 32 32 33.5 20 0 0 0 0

• Low GPU occupancy and utilization

(40)

20 Features of DFP (3)

• Workload Similarity: most dynamic workloads execute the same kernel code (with different configuration, i.e. TB size, shared memory size, etc…)

(41)

Dynamic Thread Block Launch (DTBL)

• Motivation: v Light-weight, efficient programming model and microarchitecture for DFP

• Propose: DTBL v Extend the current GPU execution model v Threads can launch fine-grained TBs on demand instead of kernels v Goal: o Increase SMX execution efficiency for DFP o Minimize hardware and runtime overhead

(42)

21 DTBL Execution model

• Each thread in a kernel can launch new TBs • New TBs are coalesced with the existing kernel • Coalescing: new TBs belong to the same kernel and scheduled together New Kernel 1 Kernel 1 Kernel 2

Kernel 3

New Kernel 3 (43)

User Interface and Code Example

• Program Interface: One additional CUDA Device Runtime API cudaLaunchAggGroup

__global__ void K1(…) {

void * param = cudaGetParameterBuffer(…);

cudaLaunchAggGroup(param, …); }

(44)

22 Revisit: Graph Traversal

• Launch new TBs for DFP 0 2 v All the TBs are doing the same work 1 3 • Launch only when sufficient parallelism 4 5 Parent Kernel Dynamically Launched TBs 0 1 2 3 4 5

Launched Launched Launched Launched by t1 by t2 by t3 by t4

Similar benefit as in CDP for control flow and memory behavior

(45)

Revisit: Graph Traversal

• Launch new TBs for DFP 0 2 v All the TBs are doing the same work 1 3 • Launch only when sufficient parallelism 4 5 Parent Kernel DTBL: Coalesced with a new Kernel 0 1 2 3 4 5

Launched Launched Launched Launched by t1 by t2 by t3 by t4

(46)

23 TB Coalescing and Scheduling

• Achieve high SMX occupancy and utilization • Recall DFP are fine-grained dynamic workloads with high density. • TB coalescing allows enough TBs to be aggregated into one big kernel and scheduled to SMXs. • In comparison: CDP only allows 32 concurrent kernel execution. SMXs are not fully occupied for small kernels.

(47) 47

Examples for TB coalescing

• One TB has 64 threads or 2 warps (in CDP, that is one kernel) • On K20c GK110 GPU: 13 SMX, max 64 warps/ SMX, max 32 concurrent kernels CDP: All child kernels

…… Kernel 1 Kernel 2 Kernel 3 Kernel 100 SMXs

Register File Register File Register File Cores Cores …… Cores

L1 Cache/Shared Memory L1 Cache/Shared Memory L1 Cache/Shared Memory

SMX0 SMX1 SMX12 (48) 48

24 Examples for TB coalescing

• One TB has 64 threads or 2 warps (in CDP, that is one kernel) • On K20c GK110 GPU: 13 SMX, max 64 warps/ SMX, max 32 concurrent kernels CDP: All child kernels

…… Kernel 2 Kernel 3 Kernel 100 SMXs

Register File Register File Register File Cores Cores …… Cores Kernel 1 L1 Cache/Shared Memory L1 Cache/Shared Memory L1 Cache/Shared Memory

SMX0 SMX1 SMX12 (49) 49

Examples for TB coalescing

• One TB has 64 threads or 2 warps (in CDP, that is one kernel) • On K20c GK110 GPU: 13 SMX, max 64 warps/ SMX, max 32 concurrent kernels CDP: All child kernels

…… Kernel 3 Kernel 100 SMXs

Register File Register File Register File Cores Cores …… Cores Kernel 1 Kernel 2 L1 Cache/Shared Memory L1 Cache/Shared Memory L1 Cache/Shared Memory

SMX0 SMX1 SMX12 (50) 50

25 Examples for TB coalescing

• One TB has 64 threads or 2 warps (in CDP, that is one kernel) • On K20c GK110 GPU: 13 SMX, max 64 warps/ SMX, max 32 concurrent kernels CDP: All child kernels

…… Kernel 14 Kernel 100 SMXs

Register File Register File Register File Cores Cores …… Cores Kernel 1 Kernel 2 Kernel 13 L1 Cache/Shared Memory L1 Cache/Shared Memory L1 Cache/Shared Memory

SMX0 SMX1 SMX12 (51) 51

Examples for TB coalescing

• One TB has 64 threads or 2 warps (in CDP, that is one kernel) • On K20c GK110 GPU: 13 SMX, max 64 warps/ SMX, max 32 concurrent kernels CDP: All child kernels

…… Kernel 15 Kernel 100 SMXs

Register File Register File Register File Cores Cores Cores Kernel 14 …… Kernel 1 Kernel 2 Kernel 13 L1 Cache/Shared Memory L1 Cache/Shared Memory L1 Cache/Shared Memory

SMX0 SMX1 SMX12 (52) 52

26 Examples for TB coalescing

• One TB has 64 threads or 2 warps (in CDP, that is one kernel) • On K20c GK110 GPU: 13 SMX, max 64 warps/ SMX, max 32 concurrent kernels CDP: All child kernels Wait till previous kernel finish …… Kernel 33 Kernel 100 SMXs

Register File Register File Register File Cores Kernel 27 Cores Kernel 28 Kernel 32 Cores Kernel 14 Kernel 15 Kernel 26 Kernel 1 Kernel 2 Kernel 13 L1 Cache/Shared Memory L1 Cache/Shared Memory …… L1 Cache/Shared Memory SMX0 SMX1 SMX12 (53) 53

Examples for TB coalescing

• OneAchieved TB has SMX 64 threadsOccupancy: or 2 32warps kernels (in CDP,* 2 thatwarps/kernel is one kernel) / (64 warps/SMX * 13 SMXs) • On K20c GK110 GPU:= 0.077 13 SMX, max 64 warps/ SMX, max 32 concurrent kernels CDP: All child kernels Wait till previous kernel finish …… Kernel 33 Kernel 100 SMXs

Register File Register File Register File Cores Kernel 27 Cores Kernel 28 Kernel 32 Cores Kernel 14 Kernel 15 Kernel 26 Kernel 1 Kernel 2 Kernel 13 L1 Cache/Shared Memory L1 Cache/Shared Memory …… L1 Cache/Shared Memory SMX0 SMX1 SMX12 (54) 54

27 Examples for TB coalescing

• One TB has 64 threads or 2 warps (in CDP, that is one kernel) • On K20c GK110 GPU: 13 SMX, max 64 warps/ SMX, max 32 concurrent kernels DTBL: All TBs coalesced with one kernel. No restrictions in scheduling.

…… TB 1 TB 2 TB 3 TB 100 SMXs

Register File Register File Register File Cores Cores …… Cores

L1 Cache/Shared Memory L1 Cache/Shared Memory L1 Cache/Shared Memory

SMX0 SMX1 SMX12 (55) 55

Examples for TB coalescing

• One TB has 64 threads or 2 warps (in CDP, that is one kernel) • On K20c GK110 GPU: 13 SMX, max 64 warps/ SMX, max 32 concurrent kernels DTBL: All TBs coalesced with one kernel. No restrictions in scheduling.

SMXs

Register File KernelTB 92 27 Register File KernelTB 93 28 TB 100 Register File TB 91 Cores Cores …… Cores

L1 Cache/Shared Memory L1 Cache/Shared Memory L1 Cache/Shared Memory

SMX0 SMX1 SMX12 (56) 56

28 Examples for TB coalescing

• OneAchieved TB has SMX64 threads Occupancy: or 2 warps 100 TBs (in *CDP, 2 thatwarps/TB is one /kernel) (64 warps/SMX * 13 SMXs) = • On K20c GK110 GPU:0.24 13 SMX, max 64 warps/ SMX, max 32 concurrent kernels DTBL: All TBs coalescedCan be to increasedone kernel. No ifrestrictions more inTBs! scheduling.

SMXs

Register File KernelTB 92 27 Register File KernelTB 93 28 TB 100 Register File TB 91 Cores Cores …… Cores

L1 Cache/Shared Memory L1 Cache/Shared Memory L1 Cache/Shared Memory

SMX0 SMX1 SMX12 (57) 57

Microarchitecture (Baseline GPU)

Kernel Distributor Aggregated Group Table Entries AggDim Param Next ExeBL PCPC Dim Param ExeBLExeBL NAGEI LAGEI

DTBL Scheduling

Kernel SMX Scheduler Management Microarchtecture Unit FCFS Controller Control Registers Extension KDEI AGEI NextBL Kernel Distributor

SMX Aggregated Group Information SMX SMX SMX SMX Thread Block PC Dim Param Next Control Registers

KDEI AGEI BLKID DRAM Memory Controller

(58)

29 Microarchitecture Extension

Kernel Distributor Aggregated Group Table Entries AggDim Param Next ExeBL PCPC Dim Param ExeBLExeBL NAGEI LAGEI Tracking new TBs

DTBL Scheduling Coalescing and scheduling new TBs

Kernel SMX Scheduler Management Microarchtecture Unit FCFS Controller Control Registers Extension KDEI AGEI NextBL Kernel Distributor

New TB launching information SMX Aggregated Group Information SMX SMX SMX SMX Thread Block PC Dim Param Next Control Registers

KDEI AGEI BLKID DRAM Memory Controller

(59)

TB Coalescing on Microarchitecture

Kernel Distributor Aggregated Group Table Entries K K K K AggDim Param Next ExeBL PC Dim Param ExeBL NAGEI LAGEI

DTBL Scheduling Coalescing

Kernel SMX Scheduler Management Microarchtecture Unit FCFS Controller Control Registers Extension KDEI AGEI NextBL Kernel Distributor Newly TB TB TB TB launched TBs SMX Aggregated Group Information SMX SMX SMX SMX Thread Block PC Dim Param Next Control Registers

KDEI AGEI BLKID DRAM Memory Controller

(60) 60

30 TB Coalescing on Microarchitecture

Kernel Distributor Aggregated Group Table Entries K AggDim Param Next ExeBL PC Dim Param ExeBL NAGEI LAGEI

DTBL Scheduling

Kernel SMX Scheduler Management Scheduled together to Microarchtecture Unit FCFS Controller Control Registersincrease SMX OccupancyExtension KDEI AGEI NextBL Kernel Distributor

TBTB TBTB TBTB TBTB SMX Aggregated Group Information SMX SMX SMX SMX Thread Block PC Dim Param Next Control Registers

KDEI AGEI BLKID DRAM Memory Controller 61 (61)

TB Launching Latency

• Exclusive launching path from SMXs • Simple runtime and driver support with low latency v Kernel information only has to be managed once v New TBs can reuse the kernel information

• In comparison, CDP uses v Longer launching path v More complicated runtime and driver for kernel management

(62) 62

31 Launch Path for CDP and DTBL

Kernel Distributor Aggregated Group Table Entries AggDim Param Next ExeBL PC Dim Param ExeBL NAGEI LAGEI

DTBL Scheduling Manage Thousands Shorter path of kernels to KDE Kernel SMX Scheduler Management Microarchtecture Unit FCFS Controller Control Registers Extension KDEI AGEI NextBL Kernel Distributor If launched as kernel TB TB TB TB SMX Aggregated Group Information SMX SMX SMX SMX Thread Block PC Dim Param Next Control Registers

KDEI AGEI BLKID DRAM Memory Controller

(63) 63

Launch Path for CDP and DTBL

Kernel Distributor Aggregated Group Table Entries AggDim Param Next ExeBL PC Dim Param ExeBL NAGEI LAGEI

DTBL Scheduling Launch path forShorter CDP: path ~ 70k cycles than to KMU Kernel Launch path for DTBL:SMX ~ Scheduler10k cycles Management Microarchtecture Unit FCFS Controller Control Registers Extension KDEI AGEI NextBL (32 launchesKernel Distributor per warp) If launched as kernel TB TB TB TB SMX Aggregated Group Information SMX SMX SMX SMX Thread Block PC Dim Param Next Control Registers

KDEI AGEI BLKID DRAM Memory Controller

(64) 64

32 Speedup of DTBL

3.5 CDP-ideal DTBL-ideal CDP DTBL 3 2.5 2 1.5 1 0.5 0 Speedup over flat implementaon

• Ideal: excluding launch latency, 1.63x over flat implementations and 1.14x over CDP (Benefits from Higher GPU Utilization) • Actual: including launch latency, 1.21x over flat implementations and 1.40x over CDP (Benefits from less launching overhead) (65) 65

Summary

• DTBL: New extension to GPU execution Model • Efficient solution for Dynamic Parallelism in irregular applications

• Shows more benefit for dynamic, fine-grained but sufficient parallelism • Dynamic launching latency is reduced from CDP

(66)

33