Matt Sandy Program Manager Microsoft Corporation

Matt Sandy Program Manager Microsoft Corporation Many-Core Systems Today What is DirectCompute? Where does it fit? Design of DirectCompute Language, Execution Model, Memory Model… Tutorial – Hello DirectCompute Tutorial – Constant Buffers Performance Considerations Tutorial – Thread Groups Tutorial – Shared Memory SIMD SIMD SIMD SIMD CPU0 CPU1 SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD CPU2 CPU3 SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD L2 Cache L2 Cache CPU GPU APU CPU ~10 GB/s GPU 50 GFLOPS 2500 GFLOPS ~10 GB/s ~100 GB/s CPU RAM GPU RAM 4-6 GB 1 GB x86 SIMD 50 GFLOPS 500 GFLOPS ~10 GB/s System RAM Microsoft’s GPGPU Programming Solution API of the DirectX Family Component of the Direct3D API Technical Computing, Games, Applications Media Playback & Processing, etc. C++ AMP, Accelerator, Brook+, Domain Domain D3DCSX, Ct, RapidMind, MKL, Libraries Languages ACML, cuFFT, etc. Compute Languages DirectCompute, OpenCL, CUDA C, etc. Processors APU, CPU, GPU, AMD, Intel, NVIDIA, S3, etc. Language and Syntax DirectCompute code written in “HLSL” High-Level Shader Language DirectCompute functions are “Compute Shaders” Syntax is C-like, with some exceptions Built-in types and intrinsic functions No pointers Compute Shaders use existing memory resources Neither create nor destroy memory Execution Model Threads are bundled into Thread Groups Thread Group size is defined in the Compute Shader numthreads attribute Host code dispatches Thread Groups, not threads z 0,0,1 1,0,1 2,0,1 3,0,1 4,0,1 5,0,1 6,0,1 7,0,1 8,0,1 0 0 1 1 0 1 2 0 1 x 0,0,0 1,0,0 2,0,0 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0 8,0,0 8,1,1 y 0,1,0 01,1,0 0 0 2,1,0 3,1,0 14,1,00 0 5,1,0 6,1,0 27,1,0 0 0 8,1,0 8,2,1 [numthreads(3,2,1)] 2 1 1 0,2,0 1,2,0 2,2,0 3,2,0 4,2,0 5,2,0 6,2,0 7,2,0 8,2,0 8,3,1 0,3,0 01,3,0 1 0 2,3,0 3,3,0 14,3,01 0 5,3,0 6,3,0 27,3,0 1 0 8,3,0 8,4,1 pContext->Dispatch(3,3,2); 0,4,0 1,4,0 2,4,0 3,4,0 4,4,0 5,4,0 6,4,0 7,4,02 28,4,0 1 8,5,1 0,5,0 01,5,0 2 0 2,5,0 3,5,0 14,5,02 0 5,5,0 6,5,0 27,5,0 2 0 8,5,0 Direct3D Runtime Manages all interactions with the GPU Executes Compute Shaders Handles device memory allocation Compilation HLSL Code HLSL Compilation Options HLSL Compiler Offline (fxc.exe) Runtime (D3D extensions library) Intermediate Direct3D runtime consumes IL Language (IL) Driver Hardware Native Code Memory Model DirectCompute partitions memory into Resources Buffers for arbitrary data formats Textures for media formats Resources are accessed using Resource Views Conveys usage intent, permitting access optimizations Enables utilization of fixed-function translation and sampling Can create multiple views of a single resource Runtime connects Views to Compute Shaders Compute Shaders operate on views of resources Common Buffer Views Shader Resource View (SRV) Provides read-only access to a resource Unordered Access View (UAV) Provides read-write access to a resource Shader Resource View Compute Resource Shader Object Unordered Access View D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); context->Dispatch(…); StructuredBuffer<float> myInput; RWStructuredBuffer<float> myOutput; [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); context->Dispatch(…); StructuredBuffer<float> myInput; RWStructuredBuffer<float> myOutput; [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…); StructuredBuffer<float> myInput; RWStructuredBuffer<float> myOutput; [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…); StructuredBuffer<float> myInput; RWStructuredBuffer<float> myOutput; [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…); StructuredBuffer<float> myInput; RWStructuredBuffer<float> myOutput; [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…); StructuredBuffer<float> myInput; RWStructuredBuffer<float> myOutput; SRV [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…); StructuredBuffer<float> myInput; RWStructuredBuffer<float> myOutput; SRV UAV [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…); StructuredBuffer<float> myInput; RWStructuredBuffer<float> myOutput; SRV UAV [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…); StructuredBuffer<float> myInput; RWStructuredBuffer<float> myOutput; SRV UAV [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…); StructuredBuffer<float> myInput; RWStructuredBuffer<float> myOutput; SRV UAV [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); SIMD context->CSSetShader(…); context->CSSetShaderResources(…); SIMD context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…); SIMD SIMD StructuredBuffer<float> myInput; RWStructuredBuffer<float> myOutput; SRV UAV [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory Thread Groups 1x1x1 32x8x1 SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD Shared Memory StructuredBuffer<float> myInput; RWStructuredBuffer<float> myOutput; groupshared float temp[256]; [numthreads(256,1,1)] void main( uint3 DTid : SV_DispatchThreadID, uint GI : SV_GroupIndex ) { temp[GI] = myInput[DTid.x]; // copy device memory value to shared memory

Matt Sandy Program Manager Microsoft Corporation

Comparison of Technologies for General-Purpose Computing on Graphics Processing Units

A Qualitative Comparison Study Between Common GPGPU Frameworks

Gpgpu Processing in Cuda Architecture

Pny-Nvidia-Quadro-P2200.Pdf

Gainward GT 440 1GB DVI HDMI

3D Graphics for Virtual Desktops Smackdown

Gainward Geforce 210 512MB DVI HDMI

Matrox Imaging Library (MIL) 9.0

NVIDIA Parallel Nsight™

Nvidia-Quadro-P620-V2.Pdf

A Fast Fluid Simulator Using Smoothed-Particle Hydrodynamics

Languages, Apis and Development Tools for GPU Computing San Jose, California | 30 September 2009