Matt Sandy Program Manager Microsoft Corporation Many-Core Systems Today What is DirectCompute? Where does it fit? Design of DirectCompute Language, Execution Model, Memory Model… Tutorial – Hello DirectCompute Tutorial – Constant Buffers Performance Considerations Tutorial – Thread Groups Tutorial – Shared Memory SIMD SIMD SIMD SIMD CPU0 CPU1 SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD CPU2 CPU3 SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD L2 Cache L2 Cache

CPU GPU APU CPU ~10 GB/s GPU 50 GFLOPS 2500 GFLOPS

~10 GB/s ~100 GB/s

CPU RAM GPU RAM 4-6 GB 1 GB x86 SIMD 50 GFLOPS 500 GFLOPS

~10 GB/s

System RAM Microsoft’s GPGPU Programming Solution API of the DirectX Family Component of the Direct3D API Technical Computing, Games, Applications Media Playback & Processing, etc.

C++ AMP, Accelerator, Brook+, Domain Domain D3DCSX, Ct, RapidMind, MKL, Libraries Languages ACML, cuFFT, etc.

Compute Languages DirectCompute, OpenCL, CUDA C, etc.

Processors APU, CPU, GPU, AMD, Intel, NVIDIA, S3, etc. Language and Syntax

DirectCompute code written in “HLSL” High-Level Shader Language DirectCompute functions are “Compute Shaders” Syntax is C-like, with some exceptions Built-in types and intrinsic functions No pointers Compute Shaders use existing memory resources Neither create nor destroy memory Execution Model Threads are bundled into Thread Groups Thread Group size is defined in the Compute Shader numthreads attribute Host code dispatches Thread Groups, not threads z 0,0,1 1,0,1 2,0,1 3,0,1 4,0,1 5,0,1 6,0,1 7,0,1 8,0,1 0 0 1 1 0 1 2 0 1 x 0,0,0 1,0,0 2,0,0 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0 8,0,0 8,1,1 y 0,1,0 01,1,0 0 0 2,1,0 3,1,0 14,1,00 0 5,1,0 6,1,0 27,1,0 0 0 8,1,0 8,2,1 [numthreads(3,2,1)] 2 1 1 0,2,0 1,2,0 2,2,0 3,2,0 4,2,0 5,2,0 6,2,0 7,2,0 8,2,0 8,3,1 0,3,0 01,3,0 1 0 2,3,0 3,3,0 14,3,01 0 5,3,0 6,3,0 27,3,0 1 0 8,3,0 8,4,1 pContext->Dispatch(3,3,2); 0,4,0 1,4,0 2,4,0 3,4,0 4,4,0 5,4,0 6,4,0 7,4,02 28,4,0 1 8,5,1 0,5,0 01,5,0 2 0 2,5,0 3,5,0 14,5,02 0 5,5,0 6,5,0 27,5,0 2 0 8,5,0 Direct3D Runtime

Manages all interactions with the GPU Executes Compute Shaders Handles device memory allocation Compilation HLSL Code

HLSL Compilation Options HLSL Compiler Offline (fxc.exe) Runtime (D3D extensions library) Intermediate Direct3D runtime consumes IL Language (IL)

Driver

Hardware Native Code Memory Model DirectCompute partitions memory into Resources Buffers for arbitrary data formats Textures for media formats Resources are accessed using Resource Views Conveys usage intent, permitting access optimizations Enables utilization of fixed-function translation and sampling Can create multiple views of a single resource Runtime connects Views to Compute Shaders Compute Shaders operate on views of resources Common Buffer Views

Shader Resource View (SRV) Provides read-only access to a resource Unordered Access View (UAV) Provides read-write access to a resource

Shader Resource View Compute Resource Shader Object Unordered Access View D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); context->Dispatch(…);

StructuredBuffer myInput; RWStructuredBuffer myOutput;

[numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); context->Dispatch(…);

StructuredBuffer myInput; RWStructuredBuffer myOutput;

[numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…);

StructuredBuffer myInput; RWStructuredBuffer myOutput;

[numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…);

StructuredBuffer myInput; RWStructuredBuffer myOutput;

[numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…);

StructuredBuffer myInput; RWStructuredBuffer myOutput;

[numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…);

StructuredBuffer myInput; RWStructuredBuffer myOutput; SRV [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…);

StructuredBuffer myInput; RWStructuredBuffer myOutput; SRV UAV [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…);

StructuredBuffer myInput; RWStructuredBuffer myOutput; SRV UAV [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…);

StructuredBuffer myInput; RWStructuredBuffer myOutput; SRV UAV [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); context->CSSetShader(…); context->CSSetShaderResources(…); context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…);

StructuredBuffer myInput; RWStructuredBuffer myOutput; SRV UAV [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory D3D11CreateDevice(…); APU device->CreateComputeShader(…); device->CreateBuffer(…); SIMD Engine device->CreateBuffer(…); device->CreateShaderResourceView(…); device->CreateUnorderedAccessView(…); SIMD context->CSSetShader(…); context->CSSetShaderResources(…); SIMD context->CSSetUnorderedAccessViews(…); SimpleCS context->Dispatch(…); SIMD SIMD

StructuredBuffer myInput; RWStructuredBuffer myOutput; SRV UAV [numthreads(256,1,1)] void SimpleCS( uint3 DTid : SV_DispatchThreadID ) Buffer0 Buffer1 { myOutput[DTid.x] = sqrt( myInput[DTid.x] ); } Device Memory

Thread Groups 1x1x1 32x8x1 SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD Shared Memory

StructuredBuffer myInput; RWStructuredBuffer myOutput; groupshared float temp[256];

[numthreads(256,1,1)] void main( uint3 DTid : SV_DispatchThreadID, uint GI : SV_GroupIndex ) { temp[GI] = myInput[DTid.x]; // copy device memory value to shared memory GroupMemoryBarrierWithGroupSync(); // ensures all copy operations have finished if( GI < 8 || GI > 256 – 8 ) return; float sum = 0; for( int i = -8; i <= 8; i++ ) sum += temp[GI+i]; // each thread can use inexpensive shared loads myOutput[DTid.x] = sum / 17.f; }

Download the DirectX SDK http://msdn.microsoft.com/directx Browse DirectCompute SDK Samples BasicCompute11 ComputeShaderSort11 NBodyGravityCS11 HDRToneMappingCS11 Check out the Lecture Series on Channel 9 http://channel9.msdn.com/ Tags/directcompute-lecture-series Start Using DirectCompute Today! © 2011 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

SDK Debug Layers Provides information about runtime warnings and errors PIX for Windows Designed for graphics applications, provides high-level debugging of compute shader invocations GPUView for profiling and performance analysis Vendor-provided tools Interoperability with rest of 2D, 3D, Video rendering APIs Often want to display results you have computed Cross-hardware compatibility Feature compatibility guarantees Access to fixed-function hardware No one API provides “better” performance Different APIs run on the same hardware Performance depends on what the application/algorithm is And whether it is a common case for the API UINT uCreationFlags = D3D11_CREATE_DEVICE_SINGLETHREADED; #if defined(DEBUG) || defined(_DEBUG) uCreationFlags |= D3D11_CREATE_DEVICE_DEBUG; #endif D3D_FEATURE_LEVEL flOut;

D3D11CreateDevice( NULL, // Use the default graphics card D3D_DRIVER_TYPE_HARDWARE, // Try to create a hardware accelerated device NULL, // Do not use external software rasterizer modules uCreationFlags, // Use the creation flags specified NULL, // Try to get highest feature level available 0, // Number of elements in the previous array argument D3D11_SDK_VERSION, // SDK version &pDevice, // Device out &flOut, // Actual feature level created &pContext // Context out ); ID3D11Buffer* pBuffer; ID3D11UnorderedAccessView* pBufferUAV; D3D11_BUFFER_DESC bufferDesc; ZeroMemory( &bufferDesc, sizeof( bufferDesc ) ); bufferDesc.StructureByteStride = sizeof( float ); bufferDesc.ByteWidth = sizeof( float ) * uNumFloats; bufferDesc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED; bufferDesc.BindFlags |= D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_UNORDERED_ACCESS; bufferDesc.Usage = D3D11_USAGE_DEFAULT; D3D11_SUBRESOURCE_DATA InitialSRData; ZeroMemory( &InitialSRData, sizeof( InitialSRData ) ); InitialSRData.pSysMem = pInitialData; pDevice->CreateBuffer( &bufferDesc, &InitialSRData, &pBuffer); D3D11_UNORDERED_ACCESS_VIEW_DESC viewDescUAV; ZeroMemory( &viewDescUAV, sizeof( viewDescUAV ) ); viewDescUAV.Format = DXGI_FORMAT_UNKNOWN; viewDescUAV.ViewDimension = D3D11_UAV_DIMENSION_BUFFER; viewDescUAV.Buffer.FirstElement = 0; viewDescUAV.Buffer.NumElements = uNumFloats; pDevice->CreateUnorderedAccessView( pBuffer, &viewDescUAV, &pBufferUAV ); // create a new “staging buffer” as a resource accessible by the CPU and GPU D3D11_BUFFER_DESC desc; ZeroMemory( &desc, sizeof( desc ) ); pDeviceBuffer->GetDesc( &desc ); desc.Usage = D3D11_USAGE_STAGING; desc.CPUAccessFlags = D3D11_CPU_ACCESS_READ; desc.BindFlags = 0; desc.MiscFlags = 0; ID3D11Buffer* pDeviceBufferCopy = NULL; pDevice->CreateBuffer( &desc, NULL, &pDeviceBufferCopy ); // copy the data from the desired buffer to the staging buffer pContext->CopyResource( pDeviceBufferCopy, pDeviceBuffer ); // create a mapped resource to allow CPU access to the data in the staging buffer D3D11_MAPPED_RESOURCE MappedResource; pContext->Map( pDeviceBufferCopy, 0, D3D11_MAP_READ, &MappedResource ); // allocate memory on the CPU for the buffer data, and perform the copy operation float* pHostBuffer = new float[uNumFloats]; memcpy( pHostBuffer, MappedResource.pData, sizeof( float ) * uNumFloats ); Read/write from/to structured data Indexed I/O with known data amount Writes of unknown/arbitrary amount Preserving order is not useful in this case Load/sample() from image Just like pixel shaders do via a texture and SRV Random access write-only to image Hardware pixel format conversion is supported Random access read/write to image data Supports in-place image operations Indexed I/O with known data amount Exactly like arrays, even in syntax Can also use atomic operators Need to watch for contention on write destinations Packing/compaction can cause contention on counters Buffer and View Creation ID3D11Buffer* pBuffer; D3D11_BUFFER_DESC bufferDesc; ZeroMemory( &bufferDesc, sizeof( bufferDesc ) ); bufferDesc.StructureByteStride = sizeof( float ); bufferDesc.ByteWidth = sizeof( float ) * uNumFloats; bufferDesc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED; bufferDesc.BindFlags |= D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_UNORDERED_ACCESS; bufferDesc.Usage = D3D11_USAGE_DEFAULT; D3D11_SUBRESOURCE_DATA InitialSRData; ZeroMemory( &InitialSRData, sizeof( InitialSRData ) ); InitialSRData.pSysMem = pInitialData; pDevice->CreateBuffer( &bufferDesc, &InitialSRData, &pBuffer);

ID3D11UnorderedAccessView* pBufferUAV; D3D11_UNORDERED_ACCESS_VIEW_DESC viewDescUAV; ZeroMemory( &viewDescUAV, sizeof( viewDescUAV ) ); viewDescUAV.Format = DXGI_FORMAT_UNKNOWN; viewDescUAV.ViewDimension = D3D11_UAV_DIMENSION_BUFFER; viewDescUAV.Buffer.FirstElement = 0; viewDescUAV.Buffer.NumElements = uNumFloats; viewDescUAV.Buffer.Flags = 0; pDevice->CreateUnorderedAccessView( pBuffer, &viewDescUAV, &pBufferUAV ); Compute Shader Code

// host execution code: // context->Dispatch(4,2,1);

StructuredBuffer myInput; RWStructuredBuffer myOutput;

[numthreads(32,16,1)] void main( uint3 Gid : SV_GroupID, // Dispatch-unique group ID uint3 DTid : SV_DispatchThreadID, // Dispatch-unique thread ID uint3 GTid : SV_GroupThreadID, // Group-unique thread ID uint GI : SV_GroupIndex ) // Flattened group-unique { myOutput[DTid.x+DTid.y*128]=sqrt(myInput[DTid.x+DTid.y*128]); } Preserving order is not useful in this case

Use Append/Consume() intrinsics Buffer and View Creation ID3D11Buffer* pBuffer; D3D11_BUFFER_DESC bufferDesc; ZeroMemory( &bufferDesc, sizeof( bufferDesc ) ); bufferDesc.StructureByteStride = sizeof( float ); bufferDesc.ByteWidth = sizeof( float ) * uNumFloats; bufferDesc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED; bufferDesc.BindFlags |= D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_UNORDERED_ACCESS; bufferDesc.Usage = D3D11_USAGE_DEFAULT; D3D11_SUBRESOURCE_DATA InitialSRData; ZeroMemory( &InitialSRData, sizeof( InitialSRData ) ); InitialSRData.pSysMem = pInitialData; pDevice->CreateBuffer( &bufferDesc, &InitialSRData, &pBuffer);

ID3D11UnorderedAccessView* pBufferUAV; D3D11_UNORDERED_ACCESS_VIEW_DESC viewDescUAV; ZeroMemory( &viewDescUAV, sizeof( viewDescUAV ) ); viewDescUAV.Format = DXGI_FORMAT_UNKNOWN; viewDescUAV.ViewDimension = D3D11_UAV_DIMENSION_BUFFER; viewDescUAV.Buffer.FirstElement = 0; viewDescUAV.Buffer.NumElements = uNumFloats; viewDescUAV.Buffer.Flags = D3D11_BUFFER_UAV_FLAG_APPEND; pDevice->CreateUnorderedAccessView( pBuffer, &viewDescUAV, &pBufferUAV ); Compute Shader Code

// host execution code: // context->Dispatch(4,1,1);

ConsumeStructuredBuffer myInput; AppendStructuredBuffer myOutput;

[numthreads(32,16,1)] void main() { float x = myInput.Consume(); x = sqrt( x ); myOutput.Append( x ); } Enables hardware accelerated filtering/sampling Leverages a large fraction of the silicon

Used just like in a pixel shader via a texture resource and an SRV view

Coherency of access affects performance There is a very small cache in this path Texture and View Creation ID3D11Texture2D* pTexture; D3D11_TEXTURE2D_DESC textureDesc; ZeroMemory( &textureDesc, sizeof( textureDesc ) ); textureDesc.Width = 1920; textureDesc.Height = 1200; textureDesc.MipLevels = 1; textureDesc.ArraySize = 1; textureDesc.SampleDesc.Count = 1; textureDesc.SampleDesc.Quality = 0; textureDesc.Usage = D3D11_USAGE_DEFAULT; textureDesc.BindFlags = D3D11_BIND_SHADER_RESOURCE; textureDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM; pDevice->CreateTexture2D( &textureDesc, NULL, &pTexture );

ID3D11ShaderResourceView* pTextureSRV; D3D11_SHADER_RESOURCE_VIEW_DESC viewDescSRV; ZeroMemory( &viewDescSRV, sizeof(viewDescSRV) ); viewDescSRV.Format = DXGI_FORMAT_R8G8B8A8_UNORM; viewDescSRV.ViewDimension = D3D11_SRV_DIMENSION_TEXTURE2D; viewDescSRV.Texture2D.MipLevels = 1; viewDescSRV.Texture2D.MostDetailedMip = 0; pDevice->CreateShaderResourceView( pTexture, &viewDescSRV, &pTextureSRV ); Compute Shader Code

// host execution code: // context->Dispatch(60,75,1); // 1920x1200 threads

Texture2D myInputTex; RWStructuredBuffer myOutput;

[numthreads(32,16,1)] void main( uint3 DTid : SV_DispatchThreadID ) { float4 pixel = myInputTex.Load( DTid ); float mag = length( pixel.xyz ); myOutput[DTid.x+DTid.y*1920] = mag; } Hardware pixel format conversion supported

Blending/compositing operations are not Must use a Pixel Shader to get these, but then random access it not available

Be careful of contention on write destinations Texture and View Creation ID3D11Texture2D* pTexture; D3D11_TEXTURE2D_DESC textureDesc; ZeroMemory( &textureDesc, sizeof( textureDesc ) ); textureDesc.Width = 1920; textureDesc.Height = 1200; textureDesc.MipLevels = 1; textureDesc.ArraySize = 1; textureDesc.SampleDesc.Count = 1; textureDesc.SampleDesc.Quality = 0; textureDesc.Usage = D3D11_USAGE_DEFAULT; textureDesc.BindFlags = D3D11_BIND_UNORDERED_ACCESS; textureDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM; pDevice->CreateTexture2D( &textureDesc, NULL, &pTexture );

ID3D11UnorderedAccessView* pTextureUAV; D3D11_UNORDERED_ACCESS_VIEW_DESC viewDescUAV; ZeroMemory( &viewDescUAV, sizeof( viewDescUAV ) ); viewDescUAV.Format = DXGI_FORMAT_R8G8B8A8_UNORM; viewDescUAV.ViewDimension = D3D11_UAV_DIMENSION_TEXTURE2D; viewDescUAV.Texture2D.MipSlice = 0; pDevice->CreateUnorderedAccessView( &pTexture, &viewDescUAV, &pTextureUAV ); Compute Shader Code

// host execution code: // context->Dispatch(60,75,1); // 1920x1200 threads

StructuredBuffer myInput; RWTexture2D myOutputTex;

[numthreads(32,16,1)] void main( uint3 DTid : SV_DispatchThreadID ) { const float4 red = float4(1,0,0,0); const float4 blue = float4(0,0,1,0); float2 data = myInput[DTid.x+DTid.y*1920]; float4 color = lerp( red, blue, data.x ); color = normalize( color ) * data.y; myOutputTex[DTid.xy] = color; } Supports in-place image operations Modifying just a few locations in a matrix/array A few pixels in an image (red eye removal)

App must perform any pixel format conversions in shader code

Same texture resource can be sample()d through a separate SRV at a different time Texture and View Creation ID3D11Texture2D* pTexture; D3D11_TEXTURE2D_DESC textureDesc; ZeroMemory( &textureDesc, sizeof( textureDesc ) ); textureDesc.Width = 1920; textureDesc.Height = 1200; textureDesc.MipLevels = 1; textureDesc.ArraySize = 1; textureDesc.SampleDesc.Count = 1; textureDesc.SampleDesc.Quality = 0; textureDesc.Usage = D3D11_USAGE_DEFAULT; textureDesc.BindFlags = D3D11_BIND_UNORDERED_ACCESS; textureDesc.Format = DXGI_FORMAT_R8G8B8A8_TYPELESS; pDevice->CreateTexture2D( &textureDesc, NULL, &pTexture );

ID3D11UnorderedAccessView* pTextureUAV; D3D11_UNORDERED_ACCESS_VIEW_DESC viewDescUAV; ZeroMemory( &viewDescUAV, sizeof( viewDescUAV ) ); viewDescUAV.Format = DXGI_FORMAT_R32_UINT; viewDescUAV.ViewDimension = D3D11_UAV_DIMENSION_TEXTURE2D; viewDescUAV.Texture2D.MipSlice = 0; pDevice->CreateUnorderedAccessView( &pTexture, &viewDescUAV, &pTextureUAV ); Compute Shader Code

// host execution code: // context->Dispatch(60,75,1); // 1920x1200 threads RWTexture2D myTex;

[numthreads(32,16,1)] void main( uint3 DTid : SV_DispatchThreadID ) { uint packed = myTex[DTid.xy]; precise float4 pixel; pixel.r = ( ( packed >> 0 ) & 0xff ) / 255.0f; pixel.g = ( ( packed >> 8 ) & 0xff ) / 255.0f; pixel.b = ( ( packed >> 16 ) & 0xff ) / 255.0f; pixel.a = ( ( packed >> 24 ) & 0xff ) / 255.0f; if( pixel.r > 0.8f ) pixel.r = ( pixel.g + pixel.b ) / 2.0f; packed = ( ( ( (uint)( floor( saturate( pixel.r ) * 255.0f + 0.5f ) ) ) << 0 ) | ( ( (uint)( floor( saturate( pixel.g ) * 255.0f + 0.5f ) ) ) << 8 ) | ( ( (uint)( floor( saturate( pixel.b ) * 255.0f + 0.5f ) ) ) << 16 ) | ( ( (uint)( floor( saturate( pixel.a ) * 255.0f + 0.5f ) ) ) << 24 ) ); myTex[DTid.xy] = packed; } Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.