Applying AMD's "Kaveri" APU for Heterogeneous Computing DAN BOUVIER, BEN SANDER AUGUST 2014 OUR DESIGN CHOICES
REDESIGNED HSA FEATURES ADDED THE LATEST ENERGY EFFICIENCY COMPUTE CORES TO UNLOCK GFLOPS GAMING TECHNOLOGY
hUMA
CPU GPU
All new architecture for Featuring shared system GCN Architecture 95W to 15W solutions 1 45% more GPU memory AMD TrueAudio technology * featuring configurable TDP performance Heterogeneous Queuing Mantle PCI-Express® Gen 3
*AMD TrueAudio technology is offered by select AMD Radeon™ R9 and R7 200 Series GPUs and select AMD A-Series APUs and is designed to improve acoustic realism. Requires enabled game or application. Not all audio equipment supports all audio effects; additional audio equipment may be required for some audio effects. Not all products feature all technologies—check with your component or system manufacturer for specific capabilities.
2 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | A-SERIES REDEFINES COMPUTE
Kaveri MAXIMUM COMPUTE PERFORMANCE
4 “Steamroller” CPU Cores Multimedia • Up to 12 compute cores*
AMD TrueAudio technology • 4 "Steamroller" CPU cores • 8 GCN GPU cores UVD • HSA enabled VCE 8 GCN GPU Cores Connectivity ENHANCED USER EXPERIENCES PCIe Gen 3 • Video acceleration PCIe Gen 2 • AMD TrueAudio technology
Display • 4 display heads Shared Memory Controller Shared
DisplayPort 1.2 HDMI 1.4a HIGH PERFORMANCE CONNECTIVITY
hUMA DVI • 128bits DDR3 up to 2133 • PCI-Express® Gen3 x16 for discrete graphics upgrade • PCI-Express® for direct attach NVMe SSD
* For more information visit amd.com/computecores 3 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | “KAVERI”
PCIe/Display
Dual Core x86 Module
Die Size: 245mm2 GPU
Transistor count: 2.41 Billion U Dual Core N x86 Module B Process: 28nm
DDR3 PHY
4 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | IMPROVEMENTS FOR AMD’S DUAL CPU COMPUTE CORES
“Steamroller” Reduced Reduced Increase in I-Cache Mispredicted Scheduling Fetch Misses Branches Efficiency Decode Decode
Max-width Integer Integer Dispatches Scheduler Scheduler
per Thread FP Scheduler
Major Increase
improvements Instruction Cache
Pipeline Pipeline Pipeline Pipeline
Pipeline Pipeline Pipeline
in store handling Size Pipeline
bit bit FMAC bit FMAC
- -
MMX Unit MMX
128 128 L1 DCache L1 DCache
Shared L2 Cache
5 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | "KAVERI" GPU – GRAPHICS CORE NEXT ARCHITECTURE
ACE ACE Graphics ACE ACE Command of “Kaveri” is dedicated for GPU ACE ACE 47% Processor ACE ACE Global Data Share 8 compute units Masked Quad Sum of Absolute Difference (MQSAD) with 32b Shader Engine (512 IEEE 2008-compliant shaders) accumulation and saturation Geometry Processor Device flat (generic) addressing Rasterizer support Precision improvement for CU native LOG/EXP ops to 1ULP RB CU
CU
CU Branch & Vector Units Texture Filter Texture Fetch RB Message Unit Scheduler (4x SIMD-16) Scalar Unit Units (4) Load / Store CU Units (16)
CU
CU
CU
Vector Registers Local Data Share Scalar Registers L1 Cache L2 Cache (4x 64KB) (64KB) (4KB) (16KB)
6 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | "KAVERI" APU ENHANCEMENTS "Kaveri" “Steamroller” “Steamroller” Coherent Hub and IOMMU Memory DDR3 Dedicated coherent transaction path Controller Memory DDR3 Unified Controller Northbridge System Mgmt AMD Radeon™ Memory Bus Unit High bandwidth graphics data PCIe G2 x8 Graphics PCI Northbridge/ Bridge GCN GPU IOMMUv2 PCIe G3 x16
PCI-Express® Gen3 for External Graphic Memory Graphics Attach Controller Universal DP/HDMI Video Display Decoder EnginesDisplay DP/DVI Four Display Engines Video EnginesDisplay Display DP/DVI CODEC Engines Engine Engine DP/DVI True Audio AMD TrueAudio Technology
7 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | INTRODUCTION OF HARDWARE COHERENCY FOR THE GPU
Coherent Hub (CHUB) ‒ Compute traffic steered to dedicated coherent transaction path ‒ Includes IOMMU Address Translation Cache (ATC) ‒ Selectively probe CPU caches based on page attribute Atomics
‒ Single cycle request Graphics CU Client ‒ One at a time (no gathering at the SYS level) Non-Coherent Northbridge / Path IOMMUv2 . . ‒ All atomics return the original data from DRAM . (success of conditional) Unified Northbridge CU Client ‒ TYPES Supported: CHUB/ Test and OR, Swap, Add, Subtract, AND, Coherent IOMMU OR, XOR, Signed Min, Signed Max, Memory Path ATC Controller SYS L2 L2 L2 L2 Unsigned Min, Unsigned Max, DRAM Clamping Inc, Clamping Dec, AMD Radeon™ Graphics Memory Controller Compare and Swap Controller Memory Bus Memory
8 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | IOMMUv2 – ELIMINATES DOUBLE COPY CPU Copy GPU DMA With traditional memory system ‒ Not all GPU memory is CPU accessible (e.g. local Local frame buffer memory) Unpinned Pinned Frame GPU Memory Memory ‒ Local frame buffer may not be large enough working Buffer space ‒ Lack of demand-paging support ‒ Alignment limitations ‒ Data copied from unpinned to pinned regions
Unpinned IOMMUv2 GPU With IOMMUv2 Memory ‒ Eliminate CPU & DMA copy operations (in both directions!!) ‒ GPU operates on unpinned region directly
9 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | IOMMUV2 PERIPHERAL PAGE FAULTS
Peripheral Processor ATS request Page tables ATS response
Evaluate ATS PRI request response PPR ATS – Address Translation Services SW queue IOMMU PRI – Page Request Interface PPR – Peripheral Page Request CMD CMD – Command Queue queue PRI response SW – Software (OS or Hypervisor) Evaluate PRI ATS request Event response log System ATS response Memory Time
10 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | QUEUING (TODAY’S PICTURE) High overhead to pass work to/from GPU
Application OS GPU
Transfer Buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job
Finish Job Schedule Application Get Buffer Copy/Map Memory
11 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | QUEUING WITH HSA ON KAVERI Shared Virtual Memory
Application OS GPU
Transfer Buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job
Finish Job Schedule Application Get Buffer Copy/Map Memory
12 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | QUEUING WITH HSA ON KAVERI Shared Virtual Memory System Coherency
Application OS GPU
Transfer Buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job
Finish Job Schedule Application Get Buffer Copy/Map Memory
13 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | QUEUING WITH HSA ON KAVERI Shared Virtual Memory System Coherency Signaling
Application OS GPU
Transfer Buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job
Finish Job Schedule Application Get Buffer Copy/Map Memory
14 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | QUEUING WITH HSA ON KAVERI Shared Virtual Memory System Coherency Signaling User Mode Queuing Application OS GPU
Transfer Buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job
Finish Job Schedule Application Get Buffer Copy/Map Memory
15 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | QUEUING WITH HSA ON KAVERI Shared Virtual Memory System Coherency Signaling User Mode Queuing Application OS GPU
Queue Job
Start Job
Finish Job
16 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | HSA IN A NUTSHELL
HSA Hardware Building Blocks HSA Software Building Blocks
Shared Virtual Memory HSAIL ‒ Single address space ‒ Portable, parallel, compiler IR ‒ Coherent HSA Runtime ‒ Pageable ‒ Create queues ‒ Fast access from all components ‒ Allocate memory ‒ Can share pointers ‒ Device discovery Architected User-Level Queues Reference High-level Compiler Signals ‒ CLANG/LLVM ‒ Generate HSAIL ‒ OpenCL, C++AMP Provide industry-standard, architected requirements for how devices share memory and Provide industry-standard compiler IR and runtime communicate with each other to enable existing programming languages to target the GPU
17 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | EVOLUTION OF THE SOFTWARE STACK
Driver Stack HSA Software Stack
Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps Apps
Domain Libraries HSA Domain Libraries, OpenCL™ 2.x Runtime
OpenCL™, DX Runtimes, HSA Runtime User Mode Drivers Task Queuing HSA JIT Libraries HSA Kernel Graphics Kernel Mode Driver Mode Driver
Hardware - APUs, CPUs, GPUs
User mode component Kernel mode component Components contributed by third parties
18 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | WHAT IS HSAIL?
Intermediate language for parallel compute in HSA Generated by a “High-level Compiler” (GCC, LLVM, Java VM, etc) Expresses parallel regions of code Portable across vendors and stable across product generations Goal: Bring parallel acceleration to mainstream programming languages (OpenMP, C++AMP, Java)
main() { …
#pragma omp parallel for for (int i=0;i 19 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | PROGRAMMING LANGUAGES PROLIFERATING ON HSA OpenCL™ C++ AMP Python Java App App App App OpenCL Java JVM Various Fabric Runtime (Sumatra) Runtimes Engine RT HSAIL HSA HSA Core HSA Helper Libraries Runtime Finalizer Kernel Fusion Driver (KFD) 20 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | HSA ENABLEMENT OF JAVA JAVA 9 – HSA-ENABLED JAVA (SUMATRA) Adds native APU acceleration to Java Virtual Machine (JVM) Java Application Java JDK Stream + Lambda API Developer uses Lambda, Stream API Java GRAAL JIT Backend JVM generates HSAIL automatically HSAIL HSA Finalizer & Runtime JVM CPU ISA GPU ISA CPU GPU 21 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | USE CASES SHOWING HSA ADVANTAGE ON KAVERI Programming Use Case Description HSA Advantage Technique Binary tree searches GPU can access existing data structures containing pointers Data Pointers GPU performs searches in a CPU-created binary tree Higher performance through parallel operations Binary tree updates CPU and GPU can synchronize using Platform Atomics CPU and GPU operating simultaneously on the tree, both Platform Atomics Higher performance through parallel operations doing modifications Hierarchical data searches GPU can operate on huge models in place Applications include object recognition, collision detection, Large Data Sets Higher performance through parallel operations global illumination, BVH Middleware user-callbacks GPU can invoke CPU functions from within a GPU kernel CPU Callbacks GPU processes work items, some of which require a call to Simpler programming does not require “split kernels” a CPU function to fetch new data Higher performance through parallel operations 22 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | Data Pointers DATA POINTERS Legacy GPU SYSTEM MEMORY KERNEL L R GPU MEMORY L R L R L R L R L R TREE RESULT BUFFER FLAT TREE RESULT BUFFER 24 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DATA POINTERS Legacy GPU SYSTEM MEMORY KERNEL L R GPU MEMORY L R L R L R L R L R TREE RESULT BUFFER FLAT TREE RESULT BUFFER 25 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DATA POINTERS Legacy GPU SYSTEM MEMORY KERNEL L R GPU MEMORY L R L R TREE RESULT BUFFER FLAT TREE RESULT BUFFER 26 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DATA POINTERS HSA GPU SYSTEM MEMORY KERNEL L R L R L R TREE RESULT BUFFER 27 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DATA POINTERS HSA GPU SYSTEM MEMORY KERNEL L R L R L R TREE RESULT BUFFER 28 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DATA POINTERS - CODE COMPLEXITY HSA Legacy 29 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DATA POINTERS - PERFORMANCE Binary Tree Search 60,000 50,000 40,000 CPU (1 core) 30,000 CPU (4 core) Legacy APU HSA APU Search rate ( nodes / ms / nodesms ) ( rate Search 20,000 10,000 0 1M 5M 10M 25M Tree size (# nodes) *Measured in AMD labs2 30 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | Platform Atomics PLATFORM ATOMICS HSA and full OpenCL 2.0 GPU KERNEL Both CPU+GPU operating on same data structure CPU 0 concurrently CPU 1 INPUT TREE BUFFER 32 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | PLATFORM ATOMICS HSA and full OpenCL 2.0 GPU KERNEL Both CPU+GPU operating on same data structure CPU 0 concurrently CPU 1 INPUT TREE BUFFER 33 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | AMD’S UNIFIED SDK Access to AMD APU and GPU programmable components Component installer - choose just what you need Initial release includes: APP SDK v2.9 Media SDK 1.0 AMD Unified SDK APP SDK 2.9 MEDIA SDK 1.0 Web-based sample browser GPU accelerated video pre/post processing library Supports programming standards: OpenCL™, C++ AMP Leverage AMD's media encode/decode acceleration blocks Code samples for accelerated open source libraries: Library for low latency video encoding ‒ OpenCV, OpenNI, Bolt, Aparapi Supports both Windows Store and classic desktop OpenCL™ source editing plug-in for visual studio Now supports Cmake 34 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | ACCELERATED OPEN SOURCE LIBRARIES OpenCV Bolt clMath Aparapi Most popular computer C++ template library AMD released APPML as OpenCL accelerated Java 7 vision library Provides GPU off-load for open source to create Java APIs for data parallel Now with many OpenCL™ common data-parallel clMath algorithms (no need to accelerated functions algorithms Accelerated BLAS and FFT learn OpenCL) Now with cross-OS support libraries and improved Accessible from Fortran, C performance/functionality and C++ 35 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | KAVERI OPENS THE GATES TO PERFORMANCE EQUAL ACCESS TO ENTIRE MEMORY UNLOCKING APU GFLOPS ALL-PROCESSORS-EQUAL hQ hUMA 737.3 GPU GFLOPS CPU GPU 118.4 CPU GFLOPS CPU GPU APU GFLOPS GPU and CPU have uniform Access to full potential of Kaveri GPU and CPU have equal visibility into entire memory APU compute power flexibility to be used to create space and dispatch work items 36 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | END RESULT: HSA RESULTS IN MORE ENERGY EFFICIENT COMPUTATION What does this mean for power? COMPUTE CAPACITY Trend in PCs GPU TREND Many important workloads execute many times more efficiently using GPU compute resources than CPU only 1st APU ‒ E.g. video indexing, natural human FUTURE interfaces, pattern recognition PCMark8 v2 GFLOPS For the same power, much better performance finish early (computation, PCMark7 web page render, display update) and go to sleep CPU TREND 2010 2011 2012 2013 2014 PRODUCT YEAR Example Improvement CPU GFLOPs GPU GFLOPs from GPU-Compute Source: AMD Internal data 37 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | AMD POWER EFFICIENCY IMPROVEMENTS Power Reductions Over Time Typical-Use Energy Efficiency 10.0x “Kaveri” 10X Over the last 6 years (2008-2014), AMD 9.0x achieved a 10x improvement in platform energy efficiency* 8.0x “Richland” Enabled by: 7.0x “Trinity” ‒ Intelligent dynamic power 6.0x management 5.0x “Llano” ‒ Further integration of system 4.0x “Danube” components (NorthBridge, GPU, 3.0x SouthBridge) SCALE) (LOG EFFICIENCY ENERGY “Tigris” 2.0x ‒ Silicon power optimizations “Puma” 1.0x ‒ Process scaling improvements 2008 2009 2010 2011 2012 2013 2014 Energy use drops While performance increases = Increased efficiency *Typical-use Energy Efficiency as defined by taking the ratio of compute capability as measured by common performance measures such as SpecIntRate, PassMark and PCMark, divided by typical energy use as defined by ETEC (Typical Energy Consumption for notebook computers) as specified in Energy Star Program Requirements Rev 6.0 10/2013 38 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | Next-Gen “Puma+” x86 Cores HSA A KEY ENABLER FOR AN ENERGY EFFICIENT ROADMAP Power efficient APUs Heterogeneous compute with energy efficient accelerators Smart power management Integration and miniaturization 39 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | SUMMARY With HSA features, Kaveri is an optimized platform for Heterogeneous Computing HSA features make Kaveri the FIRST full OpenCL 2.0 capable chip ‒ Fine Grained SVM (hUMA) ‒ C11 Atomics (Platform Atomics) ‒ Dynamic Parallelism (hQ) ‒ Pipes (hUMA, hQ) 40 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | ENDNOTES 1Testing by AMD Performance labs using 3DMark Sky Diver as of July 3, 2014. AMD A10-7850K with 2x8GB DDR3-2133 memory, 512GB SSD, Windows 8.1, Driver 14.20.1004 Beta 11 scored 5523. AMD A10-6800K with 2x8GB DDR3-2133 memory, 512GB SSD, Windows 8.1, Driver 14.20.1004 Beta 11 scored 3796. 2 System configuration “Kaveri” A10- 95W TDP, CPU Speed 3.7GHz/4.0 GHz, GPU speed 720MHz, Memory 2x4GB DDR3-1600, Disk HDD, Video Driver 13.35/HAS Beta 2.2, test dates January 1-3, 2014, Microsoft Windows 8.1 (64-bit) 41 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 | DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. PCI Express is a registered trademark of PCI-SIG Corporation. OpenCL is a trademark of Apple Inc. used by permission by Khronos. Other names are for informational purposes only and may be trademarks of their respective owners. 42 | Applying AMD's "Kaveri" APU for Heterogeneous Computing| HOT CHIPS 26 - AUGUST 2014 |