"Kaveri" APU for Heterogeneous Computing DAN BOUVIER, BEN SANDER AUGUST 2014 OUR DESIGN CHOICES

Applying AMD's "Kaveri" APU for Heterogeneous Computing DAN BOUVIER, BEN SANDER AUGUST 2014 OUR DESIGN CHOICES

REDESIGNED HSA FEATURES ADDED THE LATEST ENERGY EFFICIENCY COMPUTE CORES TO UNLOCK GFLOPS GAMING TECHNOLOGY

hUMA

CPU GPU

 All new architecture for  Featuring shared system  GCN Architecture  95W to 15W solutions 1 45% more GPU memory  AMD TrueAudio technology * featuring configurable TDP performance  Heterogeneous Queuing  Mantle  PCI-Express® Gen 3

*AMD TrueAudio technology is offered by select AMD Radeon™ R9 and R7 200 Series GPUs and select AMD A-Series APUs and is designed to improve acoustic realism. Requires enabled game or application. Not all audio equipment supports all audio effects; additional audio equipment may be required for some audio effects. Not all products feature all technologies—check with your component or system manufacturer for specific capabilities.

2 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | A-SERIES REDEFINES COMPUTE

Kaveri MAXIMUM COMPUTE PERFORMANCE

4 “Steamroller” CPU Cores Multimedia • Up to 12 compute cores*

AMD TrueAudio technology • 4 "Steamroller" CPU cores • 8 GCN GPU cores UVD • HSA enabled VCE 8 GCN GPU Cores Connectivity ENHANCED USER EXPERIENCES PCIe Gen 3 • Video acceleration PCIe Gen 2 • AMD TrueAudio technology

Display • 4 display heads Shared Memory Controller Shared

DisplayPort 1.2 HDMI 1.4a HIGH PERFORMANCE CONNECTIVITY

hUMA DVI • 128bits DDR3 up to 2133 • PCI-Express® Gen3 x16 for discrete graphics upgrade • PCI-Express® for direct attach NVMe SSD

* For more information visit amd.com/computecores 3 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | “KAVERI”

PCIe/Display

Dual Core x86 Module

Die Size: 245mm2 GPU

Transistor count: 2.41 Billion U Dual Core N x86 Module B Process: 28nm

DDR3 PHY

4 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | IMPROVEMENTS FOR AMD’S DUAL CPU COMPUTE CORES

“Steamroller” Reduced Reduced Increase in I-Cache Mispredicted Scheduling Fetch Misses Branches Efficiency Decode Decode

Max-width Integer Integer Dispatches Scheduler Scheduler

per Thread FP Scheduler

Major Increase

improvements Instruction Cache

Pipeline Pipeline Pipeline Pipeline

Pipeline Pipeline Pipeline

in store handling Size Pipeline

bit bit FMAC bit FMAC

- -

MMX Unit MMX

128 128 L1 DCache L1 DCache

Shared L2 Cache

5 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | "KAVERI" GPU – GRAPHICS CORE NEXT ARCHITECTURE

ACE ACE Graphics ACE ACE Command of “Kaveri” is dedicated for GPU ACE ACE 47% Processor ACE ACE  Global Data Share  8 compute units Masked Quad Sum of Absolute Difference (MQSAD) with 32b Shader Engine (512 IEEE 2008-compliant shaders) accumulation and saturation Geometry Processor  Device flat (generic) addressing Rasterizer support  Precision improvement for CU native LOG/EXP ops to 1ULP RB CU

CU Branch & Vector Units Texture Filter Texture Fetch RB Message Unit Scheduler (4x SIMD-16) Scalar Unit Units (4) Load / Store CU Units (16)

Vector Registers Local Data Share Scalar Registers L1 Cache L2 Cache (4x 64KB) (64KB) (4KB) (16KB)

6 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | "KAVERI" APU ENHANCEMENTS "Kaveri" “Steamroller” “Steamroller”  Coherent Hub and IOMMU Memory DDR3  Dedicated coherent transaction path Controller Memory DDR3 Unified Controller Northbridge System Mgmt  AMD Radeon™ Memory Bus Unit  High bandwidth graphics data PCIe G2 x8 Graphics PCI Northbridge/ Bridge GCN GPU IOMMUv2 PCIe G3 x16

 PCI-Express® Gen3 for External Graphic Memory Graphics Attach Controller Universal DP/HDMI Video Display Decoder EnginesDisplay DP/DVI  Four Display Engines Video EnginesDisplay Display DP/DVI CODEC Engines Engine Engine DP/DVI True Audio  AMD TrueAudio Technology

7 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | INTRODUCTION OF HARDWARE COHERENCY FOR THE GPU

 Coherent Hub (CHUB) ‒ Compute traffic steered to dedicated coherent transaction path ‒ Includes IOMMU Address Translation Cache (ATC) ‒ Selectively probe CPU caches based on page attribute  Atomics

‒ Single cycle request Graphics CU Client ‒ One at a time (no gathering at the SYS level) Non-Coherent Northbridge / Path IOMMUv2 . . ‒ All atomics return the original data from DRAM . (success of conditional) Unified Northbridge CU Client ‒ TYPES Supported: CHUB/ Test and OR, Swap, Add, Subtract, AND, Coherent IOMMU OR, XOR, Signed Min, Signed Max, Memory Path ATC Controller SYS L2 L2 L2 L2 Unsigned Min, Unsigned Max, DRAM Clamping Inc, Clamping Dec, AMD Radeon™ Graphics Memory Controller Compare and Swap Controller Memory Bus Memory

8 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | IOMMUv2 – ELIMINATES DOUBLE COPY CPU Copy GPU DMA  With traditional memory system ‒ Not all GPU memory is CPU accessible (e.g. local Local frame buffer memory) Unpinned Pinned Frame GPU Memory Memory ‒ Local frame buffer may not be large enough working Buffer space ‒ Lack of demand-paging support ‒ Alignment limitations ‒ Data copied from unpinned to pinned regions

Unpinned IOMMUv2 GPU  With IOMMUv2 Memory ‒ Eliminate CPU & DMA copy operations (in both directions!!) ‒ GPU operates on unpinned region directly

9 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | IOMMUV2 PERIPHERAL PAGE FAULTS

Peripheral Processor ATS request Page tables ATS response

Evaluate ATS PRI request response PPR ATS – Address Translation Services SW queue IOMMU PRI – Page Request Interface PPR – Peripheral Page Request CMD CMD – Command Queue queue PRI response SW – Software (OS or Hypervisor) Evaluate PRI ATS request Event response log System ATS response Memory Time

10 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | QUEUING (TODAY’S PICTURE)  High overhead to pass work to/from GPU

Application OS GPU

Transfer Buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job