Andre Heidekrueger

CPU / GPU TECHNOLOGIES NOW AND FUTURE

André Heidekrüger

Sr. Technical Consultant Presales EMEA

1 | HPC Advisory Council Lugano A CASE FOR SERVER FUSION - EXASCALE

. Current trajectory puts traditional x86 computing at just over 20Pflops by 2018 . A data center that could achieve an exaflop in 2018 using only x86 processors would consume over 3TW . To achieve exascale capability by 2018, x86 performance would need to increase by 2x each year, starting in 20101.00E+19

EXAFLOP 1.00E+18 500PF Heterogeneous compute required to 100PF 1.00E+17 bridge the gap 50PF Exascale Target

20PF @30% uplift Homogeneous 1.00E+16 x86 compute hits a wall

1.00E+15

2 | HPC Advisory Council Lugano HIGH EFFICIENCY LINPACK IMPLEMENTATION ON AMD MAGNY COURS + AMD 5870 GPU

System GFLOPS

DGEMM/node Linpack/node Linpack/4 nodes

0 500 1000 1500 2000 2500 •GPU DPFP Peak: 544 GFLOPS •GPU DGEMM kernel: 87% of Peak 2.5 GFLOPS/W HPL code:

http://code.compeng.uni-frankfurt.de/ •Node DPFP Peak: 745.6 GFLOPS •Linpack efficiency: 75.5% of Peak •Linpack scaling across 4 nodes: 70% of Peak

3 | HPC Advisory Council Lugano AMD PROCESSOR POWER AND PERFORMANCE OVER TIME INCREASING PERFORMANCE-PER-WATT EFFICIENCIES

Advanced Platform Management Link AMD CoolSpeed Technology C1E power-state DDR3 LV Memory AMD PowerCap Manager AMD SmartFetch Technology AMD CoolCore™ Technology on L3 AMD CoolCore™ Technology Dual Dynamic Power Management Independent Dynamic Core Engagement AMD PowerNow!™ Technology Integrated Memory Controller

SPEC, SPECint, and SPECfp are registered trademarks of the Standard Performance Evaluation Corporation. The comparison presented above is based on the best performing two-socket servers using the specified processor model. For the latest SPECint_rate2006 and SPECfp_rate2006 results, visit http://www.spec.org/cput2006/results. “Interlagos” performance based on internal AMD estimates

4 | HPC Advisory Council Lugano TECHNOLOGY COMPARISONS

Compared to how the current architecture handles two threads, “Bulldozer” can deliver more performance in a smaller die space

5 | HPC Advisory Council Lugano SHARING RESOURCES HELPING TO MAXIMIZE POWER EFFICIENCY AND COST

Dedicated Shared at the Shared at The “Bulldozer” module has shared and Components module level the chip level dedicated components Fetch

The shared components: Decode . Help reduce power consumption Int FP Int . Help reduce die space (cost) Scheduler Scheduler Scheduler

The dedicated components: Core 1 Core 2

. Help increase performance and

scalability

bitFMAC bitFMAC

- -

Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline 128 “Bulldozer” dynamically switches 128 between shared and dedicated L1 DCache L1 DCache

components to maximize performance Shared L2 Cache per watt

Shared L3 Cache and NB

6 | HPC Advisory Council Lugano CORE MICROARCHITECTURE – SHARED FPU

Prediction Queue ICache . Co-processor L1 BTB Fetch Queue organization L2 BTB Ucode ROM 4 x86 Decoders . Reports completion back to parent core

Dual 128-bit FMAC Int Int . Scheduler FP Scheduler Scheduler pipes

Instr Instr

. Dual 128-bit packed Retire Retire

bit bit

- -

MMX MMX

FMAC FMAC

AGen AGen AGen AGen

128 128 EX, DIV EX,

integer pipes DIV EX,

EX, MUL EX, MUL EX, . PRF-based register renaming Ld/ST Unit L1 DTLB L1 DCache FP Ld Buffer Ld/ST Unit L1 DTLB L1 DCache

. Unified scheduler (for Data Prefetcher Shared L2 Cache both threads)

7 | HPC Advisory Council Lugano THREAD CONTROL AND SELECTION MECHANISMS

Vertical MT Each core is logical processor from viewpoint of software Single Thread

SMT/ thread agnostic

SC Qs Core IBB uopQ IF Decode Dispatch thread thread thread Req Q L2/CU domain domain domain Core

PredQ

BP FP SC Q FP RetQs FP Thread fronten Executio backen domain d n d

8 | HPC Advisory Council Lugano POWER EFFICIENCY AND APM

. Start with inherently power- Power consumption varies greatly by workload efficient micro-architecture 90% and implementation: – Dynamic sharing of shared 80% resources – Minimize data movement 70% – Extensive clock and power gating

60%

. Add active management support: 50%

– Digitally measure activity to 40%

estimate power Percent Percent TDP of – Hardware uses higher frequency 30% when power limit allows Power Headroom that can 20% be utilized per core. . Support for chip-level core power gating 10%

* Based on internal AMD modeling using benchmark simulations 9 | HPC Advisory Council Lugano

BUILDING A “BULLDOZER” PROCESSOR

Each processor die is composed of multiple “Bulldozer” modules Module divisions are transparent to shared hardware, operating system or

application Shared CacheL3 The modular architecture speeds chip

development and increases product Memory Controller NB/HT Links flexibility

Server: “Interlagos” –16 cores (2 dies) “Valencia” –8 cores (1 die)

Client: “Zambezi” –8 cores (1 die)

10 | HPC Advisory Council Lugano NEW “BULLDOZER” INSTRUCTIONS

Instruction Description

SSSE3 Supplemental Streaming SIMD Extensions 3 (SSSE3) is a SIMD instruction set. It contains 16 discrete instructions; because each can act on 64-bit MMX or 128-bit XMM registers it represents a total of 32 instructions.

SSE 4.1 A set of 47 instructions that execute operations which are not specific to multimedia applications. It features a number of instructions whose action is determined by a constant field

and a set of instructions that take XMM0 as an implicit third operand.

no recompile no

SSE 4.2 An additional 7 instructions that are incremental to SSE 4.1, including 4 very powerful and generic string compare operations.

Advanced Encryption Standard (AES) Instruction Set is an extension to the x86 instruction set Require AES and PCMULQDQ architecture. It helps improve the speed of applications performing encryption and decryption using the Advanced Encryption Standard (AES).

The size of the SIMD vector registers is increased from 128-bits XMM registers to 256-bits AVX registers called YMM0 - YMM15. Existing 128-bit instructions use the lower half of the YMM registers. The AVX instruction set allows all two-operand XMM instructions to be modified into non-destructive three-operand forms where the destination register is different from both source registers.

FMA4 The FMA instruction set is a extension to the 128-bit and 256-bit SIMD instructions in the X86 microprocessor instruction set to perform fused multiply-add operations.

XOP XOP makes the binary coding of new instructions more compatible with Intel's AVX instruction

Require a recompile a Require extensions, while the functionality of the instructions is unchanged.

11 | HPC Advisory Council Lugano “BULLDOZER”: FLEX FP OPERATING MODES

Legacy Core 1 Core 2 AVX AVX Mode Legacy AVX AVX Core 1 single precision 4 8 0 Core 1 double precision 2 4 0 Core 2 single precision 4 0 8 Core 2 double precision 2 0 4 FLOPs/Cycle (16 cores) 64 64 64 Recompiled app? No Yes Yes

12 | HPC Advisory Council Lugano GENERATIONAL COMPARISONS

AMD Opteron™ 4100/6100 “Valencia” / “Interlagos” Series Processors

Cores 4100: 4 or 6 core; 6100: 8 or 12 core 4200: 6 or 8 core; 6200: 8, 12 or 16 core Cache (L2 per core / L3 per 512KB / 6MB 2MB (shared between 2 cores) / 8MB die)

Memory Channels and speed 4100: two; 6100: four; up to 1333MHz 4200: two; 6200: four; up to 1600MHz

128-bit dedicated FMAC per core or 256-bit AVX Floating point capability 128-bit FPU per core (FADD/FMUL) shared between 2 cores Integer Issues Per Cycle 3 4 Turbo CORE Technology No Yes (+500MHz with all cores active) Power (ACP) 65W, 80W, 105W TBD (planned 65W, 80W, 105W) SSSE3, SSE 4.1/4.2, AVX, AES, FMA4, XOP, New Instruction Sets PCLMULQDQ Power Gating AMD CoolCore™, C1E AMD CoolCore™, C1E, C6 Process / Die Size 45nm SOI 32nm SOI (smaller overall die size) Performance Expected up to 50% higher throughput

The above reflect current expectations regarding features and performance and is subject to change

13 | HPC Advisory Council Lugano AMD AND GPU COMPUTING: PIONEERING INNOVATION

First First with Double First to First on Second Generation Development Precision 1 TFLOPS, Top 500’s Single Slot Platform First OpenCL CPU Top Ten

Industry’s first double precision GPU: AMD FireStream™ 9170 First to 1 TFLOPS Tianhe-1 FireStream 9350 First single slot: Top500 #5 AMD FireStream 9250

ATI Stream SDK

FireStream 9370 Stream Computing Development Platform Industry Standard FireStream 9270 CTM SDK API: OpenCL announced

2006 2007 2008 2009 2010

14 | HPC Advisory Council Lugano AMD FIRESTREAM™ GPU ACCELERATORS SOLUTIONS OPTIMIZED FOR PERFORMANCE, POWER AND DENSITY

AMD FireStream 9370 - 2.64 TFLOPS Maximum Performance - 528 GFLOPS DPFP - Fastest memory technology - 4GB GDDR5 - Large memory - <225W - Passive heat sink

Deployable Performance AMD FireStream 9350 - Highest performance per watt and - 2.0 TFLOPS density - 400 GFLOPS DPFP - Low power - 2GB GDDR5 - Low profile - Single slot, <150W - Optimal price/performance - Passive heat sink

15 | HPC Advisory Council Lugano ATI RADEON™ HD 5870 (“CYPRESS”)

. 1600 Stream Processors . 20 SIMD engines . 2.72 TFLOPs SP . 544 GFLOPs DP

16 | HPC Advisory Council Lugano OPENCL™ DEVICE EXAMPLE

. ATI Radeon™ HD 5870 GPU

1 Compute Unit Contains 16 Stream Cores

1 Stream Core = 5 Processing Elements

17 | HPC Advisory Council Lugano AMD RADEON™ HD 6900 SERIES

. Dual graphics engines . New VLIW4 core architecture . More SIMD engines and texture units . Over 5 Gbps . 1536 Stream Processors . 2.7 TFLOPs SP . 683 GFLOPs DP

18 | HPC Advisory Council Lugano NEW CORE DESIGN

. VLIW4 thread processors

– 4-way co-issue

– All stream processing units now have equal capabilities (no more “T-unit”) Stream Processing Units . Special functions (transcendentals) occupy 3 of 4 issue slots FP ops per clock Integer ops per clock

4 32-bit MAD 4 24-bit MUL, ADD or MAD . Allow better utilization than previous 2 64-bit MUL or ADD 2 32-bit ADD 1 64-bit MAD or FMA 1 32-bit MUL VLIW5 design 1 Special Function – Similar performance with ~10% area reduction

– Simplified scheduling and register management

– Extensive logic re-use

19 | HPC Advisory Council Lugano GPU COMPUTE ENHANCEMENTS

. Asynchronous dispatch

– Execute multiple compute kernels simultaneously

– Each kernel has its own command queue and protected virtual address domain . Dual bidirectional DMA engines for faster system memory reads & writes . Coalescing of shader read ops . Fetch direct to LDS . Improved flow control . Faster double precision ops (1/4 SP rate)

20 | HPC Advisory Council Lugano UPCOMING POWER CONTAINMENT FEATURE DRIVES GPU PERFORMANCE EFFICIENCY

. Clamps GPU TDP to a pre-determined level

. Integrated power control processor monitors power draw every clock cycle

– Dynamically adjusts clocks for various blocks to enforce TDP

. No longer need to constrain clocks speeds to allow for outlier applications

. User controllable via AMD OverDrive Utility

21 21 | HPC Advisory Council Lugano Note: See slide 34 for additional information POWER CONTAINMENT High peak power

Unconstrained power

Scaled power

Power containment

Lower peak power Power

Fast completionSlower completion

Time Theoretical example only

22 | HPC Advisory Council Lugano NOW THE AMD FUSION™ ERA OF COMPUTING BEGINS

. APU: Fusion of CPU & GPU compute power within one processor

. High-bandwidth I/O

23 | HPC Advisory Council Lugano Fusion APU Based PC

24 | HPC Advisory Council Lugano ATI STREAM SDK V2.3: OPENCL™ FOR MULTICORE X86 CPUS AND GPUS

The Power of AMD Fusion™: Developers leverage heterogeneous architecture to enable superior user experience • Complete OpenCL™ development platform • Certified OpenCL™ 1.1 compliant by The Khronos Group1 • Write code that can scale well on multi-core CPUs and GPUs • AMD delivers on the promise of support for OpenCL™, with both high-performance CPU and GPU technologies • Available for download now – includes documentation, samples, profilers and developer support

Software Product Page: http://developer.amd.com/stream Applications

Graphics Workloads Serial and Task Parallel Data Parallel Workloads Workloads

1Conformance logs submitted for the ATI Radeon™ HD 5800 Series GPUs, ATI Radeon™ HD 5700 Series GPUs, ATI Radeon™ HD 5600 Series GPUs, ATI Radeon™ HD 5400 Series GPUs, ATI Radeon™ HD 5500 Series GPUs, ATI Radeon™ HD 4800 Series GPUs, ATI Radeon™ 4600 Series GPUs, ATI FirePro™ V8800 Series GPUs, ATI FirePro™ V8700 Series GPUs, , ATI FirePro™ V7800 Series GPUs, ATI FirePro™ V7700 Series GPUs, ATI FirePro™ V5800 Series GPUs, ATI FirePro™ V5700 Series GPUs, ATI FirePro™ V4800 Series GPUs, ATI FirePro™ V3800 Series GPUs, ATI FirePro™ V3700 Series GPUs, AMD FireStream™ 9200 Series GPU Compute Accelerators, ATI Mobility Radeon™ HD 5800 Series GPUs (Windows®), ATI Mobility Radeon™ HD 5700 Series GPUs (Windows®), ATI Mobility Radeon™ HD 5600 Series GPUs (Windows®), ATI Mobility Radeon™ HD 5400 Series GPUs (Windows®), ATI Mobility Radeon™ HD 4800 Series GPUs (Windows®), ATI Mobility Radeon™ HD 4600 Series GPUs (Windows®), ATI Radeon™ E4690 Discrete GPU (Windows®), and x86 CPUs with SSE3.

25 | HPC Advisory Council Lugano GPU COMPUTE OFFLOAD – 3 PHASES

Architected Era

Fusion Architecture Excellent

Industry Standard . Mainstream programmers Drivers Era . GPU is a first class member of the platform architecture OpenCL/DirectComputeOpenCL™ /DirectCompute DriverDriver-based-based APIsAPIs . Full C++ support . Single unified & coherent Proprietary address space Accessiblity Drivers Era . Expert programmers . Good APIs for compute GraphicsGraphics & Proprietary & Proprietary Driver - . “C and C++ like” Driverbased-based APIs APIs . Multiple address spaces &

explicit data movement Architecture Maturity & Maturity Architecture Programmer . “Hacker” programmers . Exploit early programmable “shader cores” in the GPU . Make your program look like

“graphics” to the GPU

. CUDA, Brook+, etc Poor

2002 - 2008 2009 - 2011 2012 - 2020

26 | HPC Advisory Council Lugano http://developer.amd.com/afds/pages/default.aspx

27 | HPC Advisory Council Lugano DISCLAIMER

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Trademark Attribution

AMD, AMD Opteron, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.

28 | HPC Advisory Council Lugano