Andre Heidekrueger

AMD Heterogenous Computing X86 in development AMD new CPU and Accelerator Designs Building blocks for Heterogenous computing with the GPU Accelerators and the Latest x86 Platform Innovations 1 | Hot Chips | August, 2010 Server Industry Trends China has Seismic The performance of the 265m datasets online gamers fastest supercomputer typically exceed a terabyte grew 500x in the last decade The top 8 systems Accelerator-based on the Green 500 list use accelerators 800 servers on the Green 500 images list are 3x as energy are uploaded to Facebook efficient as those without every second 2 | Hot Chips | August,accelerators 2010 2 Top500.org Performance Projections: Can the Current Trajectory Achieve Exascale? 1 EFlops Might get there on current trajectory, but… • Too late for major government programs leading to 2018 • System power in traditional x86 architecture would be unmanageable Source for chart: Top500.org; annotations by AMD 3 | Hot Chips | August, 2010 Three Eras of Processor Performance Multi-Core Heterogeneous Single-Core Systems Era Era Era Enabled by: Enabled by: Enabled by: Moore’s Law Moore’s Law Moore’s Law Voltage Scaling Desire for Throughput Abundant data parallelism Micro-Architecture 20 years of SMP arch Power efficient GPUs Constrained by: Constrained by: Temporarily constrained by: Power Power Programming models Complexity Parallel SW availability Communication overheads Scalability o o ? we are we are here o here we are Performance here thread Performance - Time Time Application Targeted Time Throughput Throughput Performance (# of Processors) (Data-parallel exploitation) Single 4 | Driving HPC Performance Efficiency Fusion Architecture The Benefits of Heterogeneous Computing x86 CPU owns GPU Optimized for the Software World Modern Workloads . Windows®, MacOS FUSION . Enormous parallel and Linux® franchises computing capacity ARCHITECTURE . Thousands of apps . Outstanding performance-per - . Established programming watt-per-dollar and memory model . Very efficient . Mature tool chain hardware threading . Extensive backward . SIMD architecture well compatibility for matched to modern applications and OSs workloads: video, audio, . High barrier to entry graphics 5 | Driving HPC Performance Efficiency A New Era of HPC Performance Microprocessor Advancement Single-Core Multi-Core Heterogeneous FUSION Era Era Systems Era ARCHITECTURE Heterogeneous System-level Computing programmable OpenCL™/DirectX® Homogeneous driver-based programs Computing nt Graphics GPU Programmability driver-based programs Advanceme Throughput Performance GPU 6 | Driving HPC Performance Efficiency Current AMD HPC Product Portfolio Energy efficient CPU and discrete GPU processors focused on addressing the most demanding HPC workloads Multi-core x86 Processors • Outstanding Performance • Superior Scalability • Enhanced Power Efficiency ATI FirePro™ Professional Graphics • 3D Accelerators For Visualization • Full support for GPU computation with OpenCL AMD FireStream™ GPU Accelerators • Optimized for server integration • Single-slot and dual-slot form factors • Industry standard OpenCL SDK 7 | Driving HPC Performance Efficiency Progression to AMD Server Fusion Current Future (Socket G34) NG Interlagos Terramar CPU/FMAC Server Fusion SR56x0 Accelerator 3 PCI PCI BUS Accelerator 2 GPU Acc. 1 8 | Driving HPC Performance Efficiency Ongoing Memory System and GPU Architecture Improvements . Significant and ongoing bandwidth improvements CPU Cores throughout the / MC UNB System Memory memory system Future UVD APU Chip . Continue incredible pace of GPU improvements in Substantial GPU increases performance/watt planned . Increase sophistication of the GPU so that Substantial increases Substantial it becomes a first-class citizen of the overall planned increases system architecture planned . Single unified virtual address space Discrete GPU Discrete GPU . Virtual memory support via IOMMU Card/Chip MC GDDR (optional) GPU Memory . Participation in system-level coherency UVD . Support for context switching 2X increase planned (PCIe Gen3) 9 | Driving HPC Performance Efficiency AMD SERVER PLATFORM 2003-2011 AMD Opteron™ processor, world’s first x86 World’s first native quad-core processor with 64- and 32-bit capability, x86 processor with HW-based integrated memory controller, HyperTransport™ virtualization features technology high-speed serial system bus and glueless 1-to-8P scalability AMD Opteron™ processor HE/EE, World’s first 6-core x86 first low-power x86 processors processor for 1P to 8P “Bulldozer” World’s first dual-core x86 processor World’s first 12-core x86 processor and direct connect architecture 2.0 AMD socket F (1207) platform: four generations of upgradeability World’s only sub-6 watt per core server Processor 2003 2004 2005 2006 2007 2008 2009 2010 2011 Anticipating needs and delivering innovation at the right time for the market 10 | Presentation Title | February, 2011 | Public BUILDING A “BULLDOZER” PROCESSOR Each processor die is composed of multiple “Bulldozer” modules Module divisions are transparent to Shared L3 CacheL3Shared shared hardware, operating system or application Memory Controller NB/HT Links The modular architecture speeds chip development and increases product flexibility Server: “Interlagos” –16 cores (2 dies) “Valencia” –8 cores (1 die) Client: “Zambezi” –8 cores (1 die) 12 | Presentation Title | March, 2011 | Public NEW “BULLDOZER” INSTRUCTIONS Instruction Description Applications SSSE3 Supplemental Streaming SIMD Extensions 3 (SSSE3) is a SIMD instruction Video encoding and transcoding. set. It contains 16 discrete instructions; because each can act on 64-bit MMX or 128-bit XMM registers it represents a total of 32 instructions. SSE 4.1 A set of 47 instructions that execute operations which are not specific to Covers a range of applications with several new packed data operations multimedia applications. It features a number of instructions whose action is including various specialized data movement instructions; dot product, determined by a constant field and a set of instructions that take XMM0 as min/max, compare and rounding operations for numeric processing, and an implicit third operand. further video encoding support. SSE 4.2 An additional 7 instructions that are incremental to SSE 4.1, including 4 POPCNT (Population Count) instruction is applicable to bioinformatics very powerful and generic string compare operations. algorithms; CRC32 is for accelerating CRC-based integrity checking of network or disk data transfers; string instructions are applicable to any text- intensive applications such as XML and HTML parsing. Advanced Encryption Standard (AES) Instruction Set is an extension to the Any application that uses AES encryption, key uses being secure network AES (AESNI), x86 instruction set architecture. It helps improve the speed of applications transactions (internet and LANs), disk encryption such as Microsoft’s PCLMULQDQ performing encryption and decryption using the Advanced Encryption BitLocker, and database encryption. Standard (AES). AVX The size of the SIMD vector registers is increased from 128-bits XMM Provides a significant performance boost for vector floating point registers to 256-bits registers called YMM0 - YMM15. Existing 128-bit applications , and a lesser boost for multimedia apps and non-vector FP- instructions use the lower half of the YMM registers. The AVX instruction intensive apps. set allows all two-operand XMM instructions to be modified into non- destructive three-operand forms where the destination register is different from both source registers. FMA4 The FMA instruction set is a extension to the 128-bit and 256-bit SIMD A further boost for numeric applications, particularly HPC-type applications, instructions in the X86 microprocessor instruction set to perform fused providing up to a 2x increase beyond AVX on AMD processors. multiply-add operations. XOP XOP makes the binary coding of new instructions more compatible with A variety of numeric and multimedia applications, including DSP signal Intel's AVX instruction extensions, while the functionality of the instructions processing algorithms for audio and radio. is unchanged. 13 | Presentation Title | March, 2011 | Public Unprecedented Server Performance Gains 45 Floating Point Integer 40 35 30 25 20 15 10 5 0 2003 2004 2005 2006 2007 2008 2009 2010 2011 AMD Opteron™ AMD Opteron™ AMD Opteron™ AMD Opteron™ AMD Opteron™ AMD Opteron™ AMD Opteron™ AMD Opteron™ “Interlagos”* 244 250 x75 285 2356 2384 2435 6176 SE (dual-core) AMD expects “Bulldozer”-based Interlagos to deliver up to 50% more performance than current 12-core AMD Opteron™ processors in standard server workloads * “Interlagos” data is based on AMD projections 14 | Hot Chips | August, 2010 Ongoing Memory System and GPU Architecture Improvements . Significant and ongoing bandwidth CPU improvements throughout the Cores UNB / MC UNB System Memory memory system Future UVD APU Chip . Continue incredible pace of GPU Substantial GPU increases improvements in performance/watt planned Substantial . Increase sophistication of the GPU so that increases Substantial planned it becomes a first-class citizen of the overall increases system architecture planned . Single unified virtual address space Discrete GPU Discrete GPU Card/Chip MC GDDR . Virtual memory support via IOMMU (optional) GPU Memory UVD . Participation in system-level coherency 2X increase planned . Support for context switching (PCIe Gen3) 15 | Hot Chips | August, 2010 .

Andre Heidekrueger

Elementary Functions: Towards Automatically Generated, Efficient

Hierarchical Roofline Analysis for Gpus: Accelerating Performance

Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus

Introduction to Intel Scalable Architectures

Intel® Architecture Instruction Set Extensions and Future Features

Intel(R) Advanced Vector Extensions Programming Reference

Effective Vectorization with Openmp 4.5

Asianux Server 4 ==

VCL C++ Vector Class Library Manual

Speeding up Energy System Models - a Best Practice Guide

Intel® Architecture Instruction Set Extensions Programming Reference

Creating a Compiler Optimized Inlineable Implementation of Intel