AMD OpteronTM Multicore Processors

Brian Waldecker | February 1, 2009 Senior Member of Technical Staff System Optimization Engineering Group Processor Solutions Engineering, AMD Austin TX

Pat Conway | Presenter Principal Member of Technical Staff Unified North Bridge team Processor Solutions Engineering, AMD Sunnyvale, CA Outline

ƒAMD Roadmaps and Decoder Rings ƒHardware Overview ƒSoftware Overview ƒQuestions

2 AMD Hex-Core Processors | Nersc/OLCF/NICS XT5 Workshop| February 2010 Roadmap (and Decoder Rings)

3 AMD Hex-Core Processors | Nersc/OLCF/NICS Cray XT5 Workshop| February 2010 Planned Server Platform Roadmap

2006 2007 2008 2009 2010 2011 “Maranello” with AMD SR56x0 and SP5100 Magny-Cours New Architecture Six-Core AMD ™ Processor with AMD ay rm rise oo ww pp (1207) with AMD SR56x0 and SP5100 Shanghai/Istanbul Platf 2/4- Enter “Socket F (1207)” Socket F(1207) with Nvidia and Broadcom Rev F Barcelona Shanghai Istanbul “San Marino” +AMD SR56x0 and SP5100 y cient m ii aa Lisbon New Architecture “Adelaide” Platfor 1/2-w Socket C32 EE+AMD SR5650 and SP5100 Lisbon New Architecture Power Eff “Buenos Aires” Socket AM3 with AMD SR56x0 and SP5100 Suzuka way tform -- aa

1 “Soc ket AM2+” Pl Socket AM2+ with Nvidia and Broadcom chipsets 4 AMD Hex-Core Processors | Nersc/OLCF/NICS Cray XT5 Workshop| February 2010 Rev F Budapest Suzuka AMD Multi-core Processor trends Timeline: 2007 2008 2009 2010 2011

2FLOP/clk 4FLOP/clk 8FLOP/clk

DDR2 533, 667, 800 DDR3 1333, 1600 60 nm 45 nm 32 nm 32

ee 16 cor

8 GFLOP/s wrt 1

4 GB/s Watts metric

2 nd tre

1

Log2 single core dual core quad core six core 12 core 050.5 * More computation while using less power per core 64-bit Architecture Evolution

2003 2005 2007 2008 2009 2010

AMD AMD “Barcelona” “Shanghai” “Istanbul” “Magny-Cours” Opteron™ Opteron™

Mfg. 90nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI Process

K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+ CPU Core

L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB

Hyper Transport™ 3x 1.6GT/s 3x 1.6GT/s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s Technology

Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 1066 4x DDR3 1333

Max Power Budget Remains Consistent

6 AMD Hex-Core Processors | Nersc/OLCF/NICS Cray XT5 Workshop| February 2010 A Strong Partnership in HPC: “Granite” Past, Present, and Future “Baker”+ “Marble”

“Baker”

Cray XT5

&XT5& XT5h

Cray XT4 CX1

Scalar VtVector CXMTCray XMT

Multithreaded

7 AMD Hex-Core Processors | Nersc/OLCF/NICS Cray XT5 Workshop| February 2010 Hardware

8 AMD Hex-Core Processors | Nersc/OLCF/NICS Cray XT5 Workshop| February 2010 “Istanbul” Silicon

L3 HT0 Link CACHE

CORE 0 H CORE 1 CORE 2 D L2 T L2 CACHE 2 D NORTH BRIDGE H T CORE 3 CORE 4 CORE 5 3 R L2 L2 L3 CACHE CACHE HT1 Link

9 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Shanghai to Istanbul

• 6 cores (~1.5X ) • Same per core L1 & L2 • Same shared L3 • NB & Xbar upgrades (going from 4 to 6 cores) • HT Assist – provides 3 probe scenarios • No probe need dded • Directed probe • Broadcast probe •Memory BW and latency improvement • Amount depends on platform and configuration • Socket Compatibility

10 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Cache Hierarchy

Dedi cated L1 cach e Core 0 Core 1 Core 2 Core 3 Core 4 Core 5

– 2 way associativity. Cache Cache Cache Cache Cache Cache Control Control Control Control Control Control – 8 banks.

– 2 128-bit loads per 64KB 64KB 64KB 64KB 64KB 64KB cycle.

Dedicated L2 cache 512KB 512KB 512KB 512KB 512KB 512KB

– 16 way associativity. 2 to 6 MB Shared L3 cache

– 32 way (Barcel ona), 48 way (Shanghai and Istanbul) associativity.

– fills from L3 leave likely shared lines in L3.

– sharing aware replacement policy.

11 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Core Micro Architecture Reference : Software Optimization Guide for AMD Family 10h Processors, FastPath? Macro-Ops? Micro-Ops? Pub. #40546, Rev. 3.10 Feb 2009

Notes / Considerations

Avoid having more than 2 or 3 X86 Insts. branches per 16B of instructions.

Three Decode Categories (FastPath also called DirectPath) • DirectPath Single - best • DirectPath Double - better • VectorPath (microcode) - good

Macro-Ops Macro-Ops tracked ReOrder Buffer (ROB). • 72 entries (3 wide x 24 deep) • In-orde r dispatch, r etir ement

Micro-Ops Micro-Ops issue from Sched to Execution Units • “Sched” aka “ Reserv. StationStation” • Out-of-order issue • FP scheduler shared across units • INT Schedulers are “per unit” Improved Out-of-Order Load Execution In-Order Address Generation (per AGU) Store-to-Load Forwarding Support

12 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 TLB Review (Barcelona, Shanghai, Istanbul, Magny-Cours) • Support for 1GB pagesize (4k, 2M, 1G) • 48 bit physical addresses = 256TB (increase from 40bits on K8) • Data TLB • L1 Data TLB • 48 entries, fully associative • all 48 entries support any pagesize • L2 TLB • 512 4k entries, and • 128 2M entries • Instruction TLB • L1 Instruction TLB • fully associative • support for 4k or 2M pagesizes • L2 Instruction TLB

13 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Data Prefetch: Review of Options

ƒ Hardware prefetching – DRAM prefetcher • tktracks positive, negati ve, non-unit str ides. • dedicated buffer (in NB) to hold prefetched data. • Aggressively use idle DRAM cycles. – Core prefhfetchers • Does hardware prefetching into L1 Dcache. ƒ Software ppgrefetching instructions – MOV (prefetch via load / store) – prefetcht0, prefetcht1, prefetcht2 (currently all treated the same) – prefetchw = prefetch with intent to modify – prefetchnta = prefetch non-temporal (favor for replacement)

14 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Six-Core AMD RDDR-2 RDDR-2 Opteron™ Processor Up to 8 DIMMs Up to 8 DIMMs MM MM MM MM HT-1 MM MM MM MM II II II II Performance II II II II D D D D “Istanbul” HT-3 “Istanbul” D D D D • Six-Core AMD Opteron™ Processor 6M Shared L3 Cache North Bridge enhancements (PF + prefetch) DIMM DIMM DIMM DIMM DIMM DIMM DIMM 45nm Process Technology DIMM • DDR2-800 Memory HT-1 HT-1 • HyperTransport-3 @ 4.8 GT/sec HT-3 HT-3

HT-1 DIMM DIMM DIMM DIMM Reliability/Availability “Istanbul” HT-3 “Istanbul” DIMM DIMM DIMM DIMM • L3 Cache Index Disable M M M M • HyperTransport Retry (HT -3 Mode) M M M M MM MM MM MM MM MM MM MM DI DI DI DI • x8 ECC (Supports x4 Chipkill in unganged mode) DI DI DI DI HT-1 HT-1 Virtualization HT-3 HT-3 • AMD-V™ with Rapid Virtualization Indexing PCI- PCI- e or I/O Br idge e or Manageability PCI- PCI- • APML Management Link* X X

Scalability USB • 48-bit Physical Addressing (256TB) South SATA SE: 2. 8GHz • HT Assist (Cache Probe Filter) PCI Bridge Std: 2.6GHz LPC PATA HE: 2.1GHz Continued Platform Compatibility EE: 1.8GHz • Nvidia/Broadcom-based F/1207 platforms 4 socket example block diagram

*APML-enabled platform support required. 15 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Six-Core AMD Opteron™ RDDR-2 RDDR-2 Processor Up to 8 DIMMs Up to 8 DIMMs MM MM MM MM HT-1 MM MM MM MM II II II II Performance II II II II D D D D “Istanbul” “Istanbul” D D D D • Six-Core AMD Opteron™ Processor HT-3 6M Shared L3 Cache North Bridge enhancements (PF + prefetch) DIMM DIMM DIMM DIMM DIMM DIMM DIMM 45nm Process Technology DIMM • DDR2-800 Memory HT-1 HT-1 • HyperTransport-3 @ 4.8 GT/sec HT-3 HT-3

HT-1 DIMM DIMM DIMM DIMM Reliability/Availability “Istanbul” HT-3 “Istanbul” DIMM DIMM DIMM DIMM • L3 Cache Index Disable M M M M • HyperTransport Retry (HT -3 Mode) M M M M MM MM MM MM MM MM MM MM DI DI DI DI • x8 ECC (Supports x4 Chipkill in unganged mode) DI DI DI DI HT-1 HT-1 Virtualization HT-3 HT-3 • AMD-V™ with Rapid Virtualization Indexing 4 socket example block diagram

Manageability STREAM Bandwidth • APML Management Link* (GB/s)* Product, Freq, Dram 2S 4S 8S • 48-bit Physical Addressing (256TB) • HT Assist (Cache Probe Filter) “Barcelona,” 2.3/2.0, RDDR2‐667 17.2 20.5 “Shanghai,” 2.7/2.2, RDDR2‐800 21.4 24 Continued Platform Compatibility • Nvidia/Broadcom-based F/1207 platforms “Istanbul,” 2.4/2.2, RDDR2‐800 22 42 81.5

*Based on measurements at AMD performance labs. See backup slide for configuration information. 16 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Significantly Higher Performance in Same Power Envelope

Two-socktket servers using Six-Core AMD Opt eron™ processors significantly outperform two-socket servers using Dual-Core and Quad-Core AMD Opteron™ processors without consuming significantly more power

700% SPECfp®_rate2006 600% SPECint®_rate2006 SPECjbb®2005 500% Server Idle Power ance mm 400% 3x core count >3x performance Perfor

300% Same power envelope 200% Relative

100%

0% Dual‐Core Quad‐Core Six‐Core AMD OOtpteron™ processor AMD OOtpteron™ processor ("Shangh ai") AMD OOtpteron™ processor ("Ist anb ul") Model 2218 (2.6GHz) Model 2382 (2.6GHz) Model 2435 (2.6GHz)

SPEC, SPECfp, SPECint, and SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation. The performance results for Six-Core AMD Opteron™ processor Model 2435 and the SPECjbb result for Quad-Core AMD Opteron™ processor Model 2382 are based upon data submitted to Standard Performance Evaluation Corporation as of May 21, 2009. The other performance results stated below reflect results published on http://www.spec.org/ as of May 21, 2009. The server idle power results are based on measurements of server active idle power for 60 seconds at AMD performance labs as of May 21, 2009. The comparison presented below is based on the best performing two-socket servers using AMD Opteron™ processor Models 2218, 2382, and 2435. For the latest results, visit http://www.spec.org/. Please see backup slides for configuration information.

17 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Istanbul to Magny-Cours

• 12 cores per socket (2 Istanbul die MCM) • Same per core L1 & L2 • Same shared L3 • NB & Xbar upgrades (going from 4 to 6 cores) • DDR3 dimms (up to DDR3-1333) • 4 memory chhl/ktannels/socket (2 chhl/di)annels/die) • “local” memory refers to die, not socket • Memory BW improvements • Same Probe filters as Istanbul • DDR3

18 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 MCM 2.0 Logical View

G34 Socket “Magny-Cours” utilizes a Directly Connected MCM

DDR3 Memory Channel

x16 cHT Package has 12 cores, P0 4 HT ports, & 4 memory channels x16 x8 cHT (NC)

Die (Node) has 6 cores, 4 HT ports & 2 memory channels P1

x16 ncHT

19 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Topologies

2P 4P

P2 P6 P0 P2

P3 P7

P0 P4

P1 P3 I/O I/O P1 P5 x16 I/O I/O x16 I/O I/O Diameter 1 Diameter 2 Avg Diam 0.75 Avg Diam 1.25 DRAM BW 85.6 GB/s DRAM BW 170.4 GB/s XFIRE BW 71.7 GB/s (*) XFIRE BW 143.4 GB/s

20 AMD(*) XFIREHex-Core BW is the Processors maximum available | Cray coherent XT memory5 Hex-core bandwidth Workshop| if the HT links wereDecember the only limiting 2009 factor. Each node accesses its own memory and that of every other node in an interleaved fashion. Cray XT5 Blade and Compute Node

Eight OpteronTM cpus Seastar2+ chips form 3-D torus interconnect

Memory r2+ chips y y y ity ity tt aa tt tt cc cc Airflow Airflow Airflow Airflow Airflow Low Veloci Low Veloci Low Veloci High Velo High Velo Four Seast Four

Memory DIMMS

Cray Four Compute Nodes per Blade SeaStar2+ (2 cpus per node) Interconnect (Node)

21 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 HT Assist Feature (Probe Filters)

22 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Istanbul Block Diagram

“Istanbul”

Core Core Core Core Core Core 0 1 2 3 4 5

512kB 512kB 512kB 512kB 512kB 512kB L2 L2 L2 L2 L2 L2

System Request Interface (SRI)

L3 data array L3 tag (6MB)

XBAR

Memory Probe Controller Filter MCT/DCT

DRAM DRAM 4 HyperTransport™3 Technology Ports DRAMDRAM DRAMDRAM

23 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Multi-Socket System Overview

DRAM DRAM (Two Socket System)

core0 core0 MCT core1 MCT core1

SRI core2 SRI core2 core3 core3 ncHT-H XBAR XBAR ncHT-H cHT cHT B cHT cHT B

I/O I/O Inter-socket Probes and Probe Responses travel: SRI -> XBAR -> cHT -> cHT -> XBAR -> SRI

Probes Requests initiate at home memory node, but return directly to node making initial memory request. key: cHT = coherent HyperTransport ncHT = non-coherent HyperTransport XBAR = crossbar switch SRI = system request interface (memory access, cache probes, etc.) MCT = memory controller HB = host bridgg(ge (e.g. HT to PCI , SeaStar, etc.)

24 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 HT Assist and Memory Latency

With “old” broadcast coherence protocol, the latency of a memory access is the longer of 2 paths: ƒ time it takes to return data from DRAM and ƒ thkbllhhe time it takes to probe all caches

With HT Assist, local memory latency is significantly reduced as it is not necessaryyp to probe caches on other nodes.

Several server workloads naturally have ~100% local accesses ƒ SPECint®, SPECfp® ƒ VMMARKTM typically run with 1 VM per core ƒ SPECpower_ssj® with 1 JVM per core ƒ STREAM

Probe Filter amplifies benefit of any NUMA optimizations in OS/application which make memory accesses local

SPEC, SPECint, SPECfp, and SPECpower_ssj are trademarks or registered trademarks of the Standard Performance Evaluation Corporation. 25 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 HyperTransport™ Technology HT Assist (Probe Filter) “Old” broadcast protocol

Key enabling technology on “Istanbul” and Home “Magny-Cours” Node Probes RdBlk RequestRdBlk Request HT Assist is a sparse directory cache Resps Req ƒ Associated with the memory controller on Node the home node PF – clean data ƒ Tracks all lines cached in the system from Home PF Node the home nod e Lookup

DRAM Eliminates most probe broadcasts (see diagram) Resp Req ƒ Lowers latency Node – local accesses get local DRAM latency, no need to wait for probe responses PF – dirty data Home PF – less queuing delay due to lower HT Node Lookupoo up traffic overh ead Directed Probe ƒ Increases system bandwidth by reducing Req Cache Owner probe traffic Node Resp

26 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Cache Coherence Protocol

ƒ Track lines in M, E, O or S state in probe filter

ƒ PF is fully inclusive of all cached data in system - if a line is cached, then a PF entry must exist.

ƒ Presence of probe filter entry says line in M, E, O or S state

ƒ Absence of probe filter entry says line is uncached

ƒ New messages – Directed probe on probe filter hit – Replacement notification E ->I (clean VicBlk)

27 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Where Do We Put the HT Assist Probe Filter?

Q: Where do we store probe filter entries without adding a large on-chip probe filter RAM which is not used in a 1P desktop system?

A: Steal 1MB of 6MB L3 cache per die in “Magny-Cours”

systems way 0 way 1 way 15 L3 cache Dir set

Implementation in fast SRAM (L3) minimizes – Access latency – Port occupancy of read-modify-write operations – Indirection latency for cache-to-cache transfers

28 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Format of a Probe Filter Entry

ƒ 16 probe filter entries per L3 cache line (64B), 4B per entry, 4-way set associative ƒ 1MB of a 6MB L3 cache per die holds 256k probe filter entries and covers 16MB of cache

4 ways

Entry 0 Entry 1 Entry 2 Entry 3 L3 Cache Line (64B) 4 sets Entry 12 Entry 13 Entry 14 Entry 15

PbFiltProbe Filter EtEntry (4B) Tag State t Owner

EM, O, S, S1 states

29 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Software

30 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Compiler Options

ƒ The Portland Group (PGI) family of compilers • Support for Linux and Windows. • Debuggers and Profilers for OpenMP and MPI. ƒ GCC and GNU Tools for AMD (gcc, glibc, binutils/gdb) • AMD actively contributes improvements targeting AMD cpus. • More information available at http://developer.amd.com/cpu/gnu/Pages/default.aspx ƒ C/C++/For tran compilers bdbased on O64Open64 technology • Available for download on AMD Developer Central at http://developer.amd.com/cpu/open64 in source and binary forms. ƒ

31 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Some Notable PGI Flags A Good Base Set of Flags

C/C++ -fastsse -Mipa=fast -Mipa=inline -tp shanghai-64 Fortran -fastsse -Mvect=short -Mipa=fast -Mipa=inline -tp shanghai-64

(note: “-tp shanghai-64” is also appropriate for Istanbul) And Some More to Usually Consider Flag Purpose -Msmartalloc Use smart malloc routines. -Msmartalloc=huge Use smart malloc routines with large pages (depends on amount of OS allocated large pages and the PGI_HUGE_PAGES environment variable). -Mvect=fuse Enables vectorizer to fuse loops. -Mpfi=indirect (first pass) Use profile feedback optimizations. (requires -Mppofo=ind irect ect(secod (second pass) comp iling, do ing t ra ining ru n(s), a nd reco mp iling) . -Msafeptr (C/C++ only) Says arrays don’t overlap and different pointers point to distinct locations. (has many sub-options). -Mfprelaxed Allows relaxed precision for certain Intrinsic functions (sqrt, rsqrt, order, div).

32 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Software Optimization & Compiler Guidelines

http://developer.amd.com/documentation/guides/Pages/default.aspx

33 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 ACML 4.3.0 ƒ L3 BLAS improvements – SGEMM for Six-Core AMD Opteron™ Processor – New Intel DGEMM and SGEMM kernels • Supporting Woodcrest, Penryn, Nehalem • Competitive with MKL – New DGEMM "fast memory allocation" scheme • allows improved performance of other routines (such as LAPACK) which make heavy use of DGEMM ƒ Six-Core AMD Opteron™ processor tuning for Level 1 BLAS – xDOT,,,, xCOPY, xAXPY, and xSCAL ƒ 3DFFT performance improvements – Now outperforming MKL ƒ AMD Family 10h tuning for real-complex FFTs – csfft, dzfft, scfft and zdfft have been re-tuned for FP128

34 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 New Memory Allocation Scheme

35 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 3DFFT Performance Improvement

Linux 44 thhdreads

30

25

20

ACML 4.2 4T

ACML 4.3 4T 15 Time MKL 10.1 4T

10

5

0 400 450 500 576 600 720 750 Problem size

Lower is better

36 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 ACML 4.3.0 Supported Compilers

ƒ PGI – 8.0-6 ƒ GCC/GFORTRAN – 4.3.2 – Now backwards compatible with GCC 4.1.2 and 4.2 ƒ Open64 – 4.2.1 ƒ Intel Fortran 11.0 ƒ Microsoft Visual Studio 2008

37 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 ACML 4.4 (()Just Out)

ƒ Ensure proper scaling with Istanbul/Magny-Cours ƒ Resolve scaling issues with small problems ƒ Family 10h optimized ZGEMM, associated L3 routines ƒ Double Complex 2D/3D-FFT improvements

38 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 AMD Developer Central (for more info and downloads)

Downloads ƒ AACMLCML andand AMD LIBM areare offeredoffered as freefree dodownloadswnloads to registered members of AMD Developer Central. Forums ƒ Thfhe ACML forum is an excell lllfdent place to find answers. Blogs ƒ Watch for blog entries by members of the ACML and LIBM team. http://developer. amd.com

39 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 (bib(A bit about Stream Comput ing )

40 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 ATI Radeon™ HD 5870 Graphics Architecture

2.72 Teraflops Single Precision 544 Gigaflops Double Precision

• Full Hard ware Impl ement ati on of DirectCompute 11 and OpenCL™ 1.0

• IEEE754-2008 Compliance Enhancements

• Additional Compute Features: • 32-bit Atomic Operations • Flexibb3le 32kB Lo oaaaacal Data Shares • 64kB Global Data Share • Global synchronization • Append/consume buffers

• ATI Stream SDK beta • http://developer.amd.com/streambeta

41 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Some GPGPU thoughts

• Massively Parallel Compute capability • FLOPS per watt and per mm2 are truly impressive. • Programmability • Improving but not painless. • Moving Data onto and off of the GPGPU • Must do enough computation to amortize this. • RAS • Not the same level as general purpose CPUs yet.

42 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 ACML-GPU 1.0 Released March 2009

Selected BLAS routines enabled for GPU ƒ DGEMM, SGEMM will run on GPU if present ƒ Small probl ems (N,M ,K < 200) run on CPU ƒ Supports Quad- and Six-Core AMD Opteron™ processors

BLAS LAPACK FFT RNG LIBMV Expert L3 23 Shape Array Simple GtGenerators L2 LAPACK Vector 3.1.1 3D L1 2D 5 Base Fast Scalar 1D GtGenerators

43 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Trademark Attribution

AMD, the AMD Arrow logo and combinations thereof are trademarks of , Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.

©2009 Advanced Micro Devices, Inc. All rights reserved.

44 AMD Hex-Core Processors | Nersc/OLCF/NICS Cray XT5 Workshop| February 2010 Questions

45 AMD Hex-Core Processors | Nersc/OLCF/NICS Cray XT5 Workshop| February 2010 Backup

46 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 MOESI Cache Coherency Protocol

Probe Write Hit Invalid Exclusive Read Hit Read Miss Exclusive dd Miss Share Write Hit dd ee Rea Prob

Modified Shared

Read Hit Write Hit Owned Read Hit Probe Read Hit •“Read” and “Write” are by this core. •“Probe Read” and “Probe Write” are reads and writes by others, that must (May consider Owned as probe this core’s caches. special case of Shared) Read Hit Probe Read Hit

47 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009