Applying AMD's "Kaveri" APU for Heterogeneous Computing DAN BOUVIER, BEN SANDER AUGUST 2014 OUR DESIGN CHOICES

REDESIGNED HSA FEATURES ADDED THE LATEST ENERGY EFFICIENCY COMPUTE CORES TO UNLOCK GFLOPS GAMING TECHNOLOGY

hUMA

CPU GPU

 All new architecture for  Featuring shared system  GCN Architecture  95W to 15W solutions 1 45% more GPU memory  AMD TrueAudio technology * featuring configurable TDP performance  Heterogeneous Queuing  Mantle  PCI-Express® Gen 3

*AMD TrueAudio technology is offered by select AMD ™ R9 and R7 200 Series GPUs and select AMD A-Series APUs and is designed to improve acoustic realism. Requires enabled game or application. Not all audio equipment supports all audio effects; additional audio equipment may be required for some audio effects. Not all products feature all technologies—check with your component or system manufacturer for specific capabilities.

2 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | A-SERIES REDEFINES COMPUTE

Kaveri MAXIMUM COMPUTE PERFORMANCE

4 “Steamroller” CPU Cores Multimedia • Up to 12 compute cores*

AMD TrueAudio technology • 4 "Steamroller" CPU cores • 8 GCN GPU cores UVD • HSA enabled VCE 8 GCN GPU Cores Connectivity ENHANCED USER EXPERIENCES PCIe Gen 3 • Video acceleration PCIe Gen 2 • AMD TrueAudio technology

Display • 4 display heads Shared Shared

DisplayPort 1.2 HDMI 1.4a HIGH PERFORMANCE CONNECTIVITY

hUMA DVI • 128bits DDR3 up to 2133 • PCI-Express® Gen3 x16 for discrete graphics upgrade • PCI-Express® for direct attach NVMe SSD

* For more information visit amd.com/computecores 3 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | “KAVERI”

PCIe/Display

Dual Core Module

Die Size: 245mm2 GPU

Transistor count: 2.41 Billion U Dual Core N x86 Module B Process: 28nm

DDR3 PHY

4 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | IMPROVEMENTS FOR AMD’S DUAL CPU COMPUTE CORES

“Steamroller” Reduced Reduced Increase in I-Cache Mispredicted Scheduling Fetch Misses Branches Efficiency Decode Decode

Max-width Integer Integer Dispatches Scheduler Scheduler

per Thread FP Scheduler

Major Increase

improvements Instruction Cache

Pipeline Pipeline Pipeline Pipeline

Pipeline Pipeline Pipeline

in store handling Size Pipeline

bit bit FMAC bit FMAC

- -

MMX Unit MMX

128 128 L1 DCache L1 DCache

Shared L2 Cache

5 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | "KAVERI" GPU – ARCHITECTURE

ACE ACE Graphics ACE ACE Command of “Kaveri” is dedicated for GPU ACE ACE 47% Processor ACE ACE  Global Data Share  8 compute units Masked Quad Sum of Absolute Difference (MQSAD) with 32b Shader Engine (512 IEEE 2008-compliant shaders) accumulation and saturation Geometry Processor  Device flat (generic) addressing Rasterizer support  Precision improvement for CU native LOG/EXP ops to 1ULP RB CU

CU

CU Branch & Vector Units Texture Filter Texture Fetch RB Message Unit Scheduler (4x SIMD-16) Scalar Unit Units (4) Load / Store CU Units (16)

CU

CU

CU

Vector Registers Local Data Share Scalar Registers L1 Cache L2 Cache (4x 64KB) (64KB) (4KB) (16KB)

6 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | "KAVERI" APU ENHANCEMENTS "Kaveri" “Steamroller” “Steamroller”  Coherent Hub and IOMMU Memory DDR3  Dedicated coherent transaction path Controller Memory DDR3 Unified Controller Northbridge System Mgmt  AMD Radeon™ Memory Bus Unit  High bandwidth graphics data PCIe G2 x8 Graphics PCI Northbridge/ Bridge GCN GPU IOMMUv2 PCIe G3 x16

 PCI-Express® Gen3 for External Graphic Memory Graphics Attach Controller Universal DP/HDMI Video Display Decoder EnginesDisplay DP/DVI  Four Display Engines Video EnginesDisplay Display DP/DVI CODEC Engines Engine Engine DP/DVI True Audio  AMD TrueAudio Technology

7 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | INTRODUCTION OF HARDWARE COHERENCY FOR THE GPU

 Coherent Hub (CHUB) ‒ Compute traffic steered to dedicated coherent transaction path ‒ Includes IOMMU Address Translation Cache (ATC) ‒ Selectively probe CPU caches based on page attribute  Atomics

‒ Single cycle request Graphics CU Client ‒ One at a time (no gathering at the SYS level) Non-Coherent Northbridge / Path IOMMUv2 . . ‒ All atomics return the original data from DRAM . (success of conditional) Unified Northbridge CU Client ‒ TYPES Supported: CHUB/ Test and OR, Swap, Add, Subtract, AND, Coherent IOMMU OR, XOR, Signed Min, Signed Max, Memory Path ATC Controller SYS L2 L2 L2 L2 Unsigned Min, Unsigned Max, DRAM Clamping Inc, Clamping Dec, AMD Radeon™ Graphics Memory Controller Compare and Swap Controller Memory Bus Memory

8 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | IOMMUv2 – ELIMINATES DOUBLE COPY CPU Copy GPU DMA  With traditional memory system ‒ Not all GPU memory is CPU accessible (e.g. local Local frame buffer memory) Unpinned Pinned Frame GPU Memory Memory ‒ Local frame buffer may not be large enough working Buffer space ‒ Lack of demand-paging support ‒ Alignment limitations ‒ Data copied from unpinned to pinned regions

Unpinned IOMMUv2 GPU  With IOMMUv2 Memory ‒ Eliminate CPU & DMA copy operations (in both directions!!) ‒ GPU operates on unpinned region directly

9 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | IOMMUV2 PERIPHERAL PAGE FAULTS

Peripheral Processor ATS request Page tables ATS response

Evaluate ATS PRI request response PPR ATS – Address Translation Services SW queue IOMMU PRI – Page Request Interface PPR – Peripheral Page Request CMD CMD – Command Queue queue PRI response SW – Software (OS or Hypervisor) Evaluate PRI ATS request Event response log System ATS response Memory Time

10 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | QUEUING (TODAY’S PICTURE)  High overhead to pass work to/from GPU

Application OS GPU

Transfer Buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job

Finish Job Schedule Application Get Buffer Copy/Map Memory

11 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | QUEUING WITH HSA ON KAVERI  Shared Virtual Memory

Application OS GPU

Transfer Buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job

Finish Job Schedule Application Get Buffer Copy/Map Memory

12 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | QUEUING WITH HSA ON KAVERI  Shared Virtual Memory  System Coherency

Application OS GPU

Transfer Buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job

Finish Job Schedule Application Get Buffer Copy/Map Memory

13 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | QUEUING WITH HSA ON KAVERI  Shared Virtual Memory  System Coherency  Signaling

Application OS GPU

Transfer Buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job

Finish Job Schedule Application Get Buffer Copy/Map Memory

14 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | QUEUING WITH HSA ON KAVERI  Shared Virtual Memory  System Coherency  Signaling  User Mode Queuing Application OS GPU

Transfer Buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job

Finish Job Schedule Application Get Buffer Copy/Map Memory

15 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | QUEUING WITH HSA ON KAVERI  Shared Virtual Memory  System Coherency  Signaling  User Mode Queuing Application OS GPU

Queue Job

Start Job

Finish Job

16 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | HSA IN A NUTSHELL

HSA Hardware Building Blocks HSA Software Building Blocks

 Shared Virtual Memory  HSAIL ‒ Single address space ‒ Portable, parallel, compiler IR ‒ Coherent  HSA Runtime ‒ Pageable ‒ Create queues ‒ Fast access from all components ‒ Allocate memory ‒ Can share pointers ‒ Device discovery  Architected User-Level Queues  Reference High-level Compiler  Signals ‒ CLANG/LLVM ‒ Generate HSAIL ‒ OpenCL, C++AMP Provide industry-standard, architected requirements for how devices share memory and Provide industry-standard compiler IR and runtime communicate with each other to enable existing programming languages to target the GPU

17 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | EVOLUTION OF THE SOFTWARE STACK

Driver Stack HSA Software Stack

Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps

Domain Libraries HSA Domain Libraries, OpenCL™ 2.x Runtime

OpenCL™, DX Runtimes, HSA Runtime User Mode Drivers Task Queuing HSA JIT Libraries HSA Kernel Graphics Kernel Mode Driver Mode Driver

Hardware - APUs, CPUs, GPUs

User mode component Kernel mode component Components contributed by third parties

18 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | WHAT IS HSAIL?

 Intermediate language for parallel compute in HSA  Generated by a “High-level Compiler” (GCC, LLVM, Java VM, etc)  Expresses parallel regions of code  Portable across vendors and stable across product generations  Goal: Bring parallel acceleration to mainstream programming languages (OpenMP, C++AMP, Java)

main() { …

#pragma omp parallel for for (int i=0;i

19 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | PROGRAMMING LANGUAGES PROLIFERATING ON HSA

OpenCL™ C++ AMP Python Java App App App App

OpenCL Java JVM Various Fabric Runtime (Sumatra) Runtimes Engine RT

HSAIL

HSA HSA Core HSA Helper Libraries Runtime Finalizer

Kernel Fusion Driver (KFD)

20 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | HSA ENABLEMENT OF JAVA

JAVA 9 – HSA-ENABLED JAVA (SUMATRA)

 Adds native APU acceleration to Java Virtual Machine (JVM) Java Application

Java JDK Stream + Lambda API  Developer uses Lambda, Stream API

Java GRAAL JIT Backend  JVM generates HSAIL automatically HSAIL HSA Finalizer & Runtime

JVM

CPU ISA GPU ISA

CPU GPU

21 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | USE CASES SHOWING HSA ADVANTAGE ON KAVERI

Programming Use Case Description HSA Advantage Technique

Binary tree searches GPU can access existing data structures containing pointers Data Pointers GPU performs searches in a CPU-created binary tree Higher performance through parallel operations

Binary tree updates CPU and GPU can synchronize using Platform Atomics CPU and GPU operating simultaneously on the tree, both Platform Atomics Higher performance through parallel operations doing modifications

Hierarchical data searches GPU can operate on huge models in place Applications include object recognition, collision detection, Large Data Sets Higher performance through parallel operations global illumination, BVH

Middleware user-callbacks GPU can invoke CPU functions from within a GPU kernel CPU Callbacks GPU processes work items, some of which require a call to Simpler programming does not require “split kernels” a CPU function to fetch new data Higher performance through parallel operations

22 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | Data Pointers DATA POINTERS Legacy GPU SYSTEM MEMORY

KERNEL

L R

GPU MEMORY

L R L R L R L R L R

TREE RESULT BUFFER

FLAT TREE RESULT BUFFER

24 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DATA POINTERS Legacy GPU SYSTEM MEMORY

KERNEL

L R

GPU MEMORY

L R L R L R L R L R

TREE RESULT BUFFER

FLAT TREE RESULT BUFFER

25 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DATA POINTERS Legacy GPU SYSTEM MEMORY

KERNEL

L R

GPU MEMORY

L R L R

TREE RESULT BUFFER

FLAT TREE RESULT BUFFER

26 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DATA POINTERS HSA GPU SYSTEM MEMORY

KERNEL

L R

L R L R

TREE RESULT BUFFER

27 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DATA POINTERS HSA GPU SYSTEM MEMORY

KERNEL

L R

L R L R

TREE RESULT BUFFER

28 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DATA POINTERS - CODE COMPLEXITY

HSA Legacy

29 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DATA POINTERS - PERFORMANCE Binary Tree Search 60,000

50,000

40,000

CPU (1 core) 30,000 CPU (4 core) Legacy APU HSA APU

Search rate ( nodes / ms / nodesms ) ( rate Search 20,000

10,000

0 1M 5M 10M 25M Tree size (# nodes) *Measured in AMD labs2 30 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | Platform Atomics PLATFORM ATOMICS HSA and full OpenCL 2.0 GPU

KERNEL

Both CPU+GPU operating on same data structure CPU 0 concurrently CPU 1

INPUT TREE BUFFER

32 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | PLATFORM ATOMICS HSA and full OpenCL 2.0 GPU

KERNEL

Both CPU+GPU operating on same data structure CPU 0 concurrently CPU 1

INPUT TREE BUFFER

33 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | AMD’S UNIFIED SDK

 Access to AMD APU and GPU programmable components  Component installer - choose just what you need  Initial release includes:  APP SDK v2.9  Media SDK 1.0 AMD Unified SDK

APP SDK 2.9 MEDIA SDK 1.0

 Web-based sample browser  GPU accelerated video pre/post processing library  Supports programming standards: OpenCL™, C++ AMP  Leverage AMD's media encode/decode acceleration blocks  Code samples for accelerated open source libraries:  Library for low latency video encoding ‒ OpenCV, OpenNI, Bolt, Aparapi  Supports both Windows Store and classic desktop  OpenCL™ source editing plug-in for visual studio  Now supports Cmake

34 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | ACCELERATED OPEN SOURCE LIBRARIES

OpenCV Bolt clMath Aparapi

 Most popular computer  C++ template library  AMD released APPML as  OpenCL accelerated Java 7 vision library  Provides GPU off-load for open source to create  Java for data parallel  Now with many OpenCL™ common data-parallel clMath algorithms (no need to accelerated functions algorithms  Accelerated BLAS and FFT learn OpenCL)  Now with cross-OS support libraries and improved  Accessible from Fortran, C performance/functionality and C++

35 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | KAVERI OPENS THE GATES TO PERFORMANCE

EQUAL ACCESS TO ENTIRE MEMORY UNLOCKING APU GFLOPS ALL-PROCESSORS-EQUAL

hQ hUMA 737.3 GPU GFLOPS

CPU GPU 118.4 CPU GFLOPS CPU GPU APU GFLOPS

 GPU and CPU have uniform  Access to full potential of Kaveri  GPU and CPU have equal visibility into entire memory APU compute power flexibility to be used to create space and dispatch work items

36 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | END RESULT: HSA RESULTS IN MORE ENERGY EFFICIENT COMPUTATION

What does this mean for power? COMPUTE CAPACITY Trend in PCs

GPU TREND  Many important workloads execute many times more efficiently using GPU compute resources than CPU only 1st APU ‒ E.g. video indexing, natural human FUTURE interfaces, pattern recognition PCMark8 v2

GFLOPS  For the same power, much better performance  finish early (computation, PCMark7 web page render, display update) and go to sleep CPU TREND

2010 2011 2012 2013 2014 PRODUCT YEAR Example Improvement CPU GFLOPs GPU GFLOPs from GPU-Compute

Source: AMD Internal data 37 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | AMD POWER EFFICIENCY IMPROVEMENTS Power Reductions Over Time Typical-Use Energy Efficiency

10.0x “Kaveri” 10X  Over the last 6 years (2008-2014), AMD 9.0x achieved a 10x improvement in platform energy efficiency* 8.0x “Richland”  Enabled by: 7.0x “Trinity” ‒ Intelligent dynamic power 6.0x management 5.0x “Llano” ‒ Further integration of system 4.0x “Danube” components (NorthBridge, GPU, 3.0x SouthBridge) SCALE) (LOG EFFICIENCY ENERGY “Tigris” 2.0x ‒ Silicon power optimizations “” 1.0x ‒ Process scaling improvements 2008 2009 2010 2011 2012 2013 2014

Energy use drops While performance increases = Increased efficiency

*Typical-use Energy Efficiency as defined by taking the ratio of compute capability as measured by common performance measures such as SpecIntRate, PassMark and PCMark, divided by typical energy use as defined by

ETEC (Typical Energy Consumption for notebook computers) as specified in Energy Star Program Requirements Rev 6.0 10/2013 38 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 |  Next-Gen “Puma+” x86 Cores

HSA A KEY ENABLER FOR AN ENERGY EFFICIENT ROADMAP

 Power efficient APUs

 Heterogeneous compute with energy efficient accelerators  Smart power management

 Integration and miniaturization

39 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | SUMMARY

With HSA features, Kaveri is an optimized platform for Heterogeneous Computing  HSA features make Kaveri the FIRST full OpenCL 2.0 capable ‒ Fine Grained SVM (hUMA) ‒ C11 Atomics (Platform Atomics) ‒ Dynamic Parallelism (hQ) ‒ Pipes (hUMA, hQ)

40 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | ENDNOTES

1Testing by AMD Performance labs using 3DMark Sky Diver as of July 3, 2014. AMD A10-7850K with 2x8GB DDR3-2133 memory, 512GB SSD, Windows 8.1, Driver 14.20.1004 Beta 11 scored 5523. AMD A10-6800K with 2x8GB DDR3-2133 memory, 512GB SSD, Windows 8.1, Driver 14.20.1004 Beta 11 scored 3796. 2 System configuration “Kaveri” A10- 95W TDP, CPU Speed 3.7GHz/4.0 GHz, GPU speed 720MHz, Memory 2x4GB DDR3-1600, Disk HDD, Video Driver 13.35/HAS Beta 2.2, test dates January 1-3, 2014, 8.1 (64-bit)

41 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION © 2014 , Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. PCI Express is a registered trademark of PCI-SIG Corporation. OpenCL is a trademark of Apple Inc. used by permission by Khronos. Other names are for informational purposes only and may be trademarks of their respective owners.

42 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 |