AMD Opterontm Multicore Pprocessors
Total Page:16
File Type:pdf, Size:1020Kb
AMD OpteronTM Multicore Processors Brian Waldecker | February 1, 2009 Senior Member of Technical Staff System Optimization Engineering Group Processor Solutions Engineering, AMD Austin TX Pat Conway | Presenter Principal Member of Technical Staff Unified North Bridge team Processor Solutions Engineering, AMD Sunnyvale, CA Outline AMD Roadmaps and Decoder Rings Hardware Overview Software Overview Questions 2 AMD Hex-Core Processors | Nersc/OLCF/NICS Cray XT5 Workshop| February 2010 Roadmap (and Decoder Rings) 3 AMD Hex-Core Processors | Nersc/OLCF/NICS Cray XT5 Workshop| February 2010 Planned Server Platform Roadmap 2006 2007 2008 2009 2010 2011 “Maranello” Socket G34 with AMD SR56x0 and SP5100 Magny-Cours New Architecture Six-Core AMD Opteron™ Processor with AMD ay rm rise oo ww pp Chipset Socket F(1207) with AMD SR56x0 and SP5100 Shanghai/Istanbul Platf 2/4- Enter “Socket F (1207)” Socket F(1207) with Nvidia and Broadcom chipsets Rev F Barcelona Shanghai Istanbul “San Marino” Socket C32+AMD SR56x0 and SP5100 y cient m ii aa Lisbon New Architecture “Adelaide” Platfor 1/2-w Socket C32 EE+AMD SR5650 and SP5100 Lisbon New Architecture Power Eff “Buenos Aires” Socket AM3 with AMD SR56x0 and SP5100 Suzuka way tform -- aa 1 “Soc ket AM2+” Pl Socket AM2+ with Nvidia and Broadcom chipsets 4 AMD Hex-Core Processors | Nersc/OLCF/NICS Cray XT5 Workshop| February 2010 Rev F Budapest Suzuka AMD Multi-core Processor trends Timeline: 2007 2008 2009 2010 2011 2FLOP/clk 4FLOP/clk 8FLOP/clk DDR2 533, 667, 800 DDR3 1333, 1600 60 nm 45 nm 32 nm 32 ee 16 cor 8 GFLOP/s wrt 1 4 GB/s Watts metric 2 nd tre 1 Log2 single core dual core quad core six core 12 core 050.5 * More computation while using less power per core x86 64-bit Architecture Evolution 2003 2005 2007 2008 2009 2010 AMD AMD “Barcelona” “Shanghai” “Istanbul” “Magny-Cours” Opteron™ Opteron™ Mfg. 90nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI Process K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+ CPU Core L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB Hyper Transport™ 3x 1.6GT/s 3x 1.6GT/s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s Technology Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 1066 4x DDR3 1333 Max Power Budget Remains Consistent 6 AMD Hex-Core Processors | Nersc/OLCF/NICS Cray XT5 Workshop| February 2010 A Strong Partnership in HPC: “Granite” Past, Present, and Future “Baker”+ “Marble” “Baker” Cray XT5 &XT5& XT5h Cray XT4 CX1 Scalar VtVector CXMTCray XMT Multithreaded 7 AMD Hex-Core Processors | Nersc/OLCF/NICS Cray XT5 Workshop| February 2010 Hardware 8 AMD Hex-Core Processors | Nersc/OLCF/NICS Cray XT5 Workshop| February 2010 “Istanbul” Silicon L3 HT0 Link CACHE CORE 0 H CORE 1 CORE 2 D L2 T L2 CACHE 2 D NORTH BRIDGE H T CORE 3 CORE 4 CORE 5 3 R L2 L2 L3 CACHE CACHE HT1 Link 9 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Shanghai to Istanbul • 6 cores (~1.5X flops) • Same per core L1 & L2 • Same shared L3 • NB & Xbar upgrades (going from 4 to 6 cores) • HT Assist – provides 3 probe scenarios • No probe needdded • Directed probe • Broadcast probe •Memory BW and latency improvement • Amount depends on platform and configuration • Socket Compatibility 10 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Cache Hierarchy Dedi cate d L1 cache Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 – 2 way associativity. Cache Cache Cache Cache Cache Cache Control Control Control Control Control Control – 8 banks. – 2 128-bit loads per 64KB 64KB 64KB 64KB 64KB 64KB cycle. Dedicated L2 cache 512KB 512KB 512KB 512KB 512KB 512KB – 16 way associativity. 2 to 6 MB Shared L3 cache – 32 way (Barce lona), 48 way (Shanghai and Istanbul) associativity. – fills from L3 leave likely shared lines in L3. – sharing aware replacement policy. 11 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Core Micro Architecture Reference : Software Optimization Guide for AMD Family 10h Processors, FastPath? Macro-Ops? Micro-Ops? Pub. #40546, Rev. 3.10 Feb 2009 Notes / Considerations Avoid having more than 2 or 3 X86 Insts. branches per 16B of instructions. Three Decode Categories (FastPath also called DirectPath) • DirectPath Single - best • DirectPath Double - better • VectorPath (microcode) - good Macro-Ops Macro-Ops tracked ReOrder Buffer (ROB). • 72 entries (3 wide x 24 deep) • In-or der dispatch, retir ement Micro-Ops Micro-Ops issue from Sched to Execution Units • “Sched” aka “Reserv. StationStation” • Out-of-order issue • FP scheduler shared across units • INT Schedulers are “per unit” Improved Out-of-Order Load Execution In-Order Address Generation (per AGU) Store-to-Load Forwarding Support 12 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 TLB Review (Barcelona, Shanghai, Istanbul, Magny-Cours) • Support for 1GB pagesize (4k, 2M, 1G) • 48 bit physical addresses = 256TB (increase from 40bits on K8) • Data TLB • L1 Data TLB • 48 entries, fully associative • all 48 entries support any pagesize • L2 TLB • 512 4k entries, and • 128 2M entries • Instruction TLB • L1 Instruction TLB • fully associative • support for 4k or 2M pagesizes • L2 Instruction TLB 13 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Data Prefetch: Review of Options Hardware prefetching – DRAM prefetcher • tktracks positive, negative, non-unit str ides. • dedicated buffer (in NB) to hold prefetched data. • Aggressively use idle DRAM cycles. – Core prefhfetchers • Does hardware prefetching into L1 Dcache. Software ppgrefetching instructions – MOV (prefetch via load / store) – prefetcht0, prefetcht1, prefetcht2 (currently all treated the same) – prefetchw = prefetch with intent to modify – prefetchnta = prefetch non-temporal (favor for replacement) 14 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Six-Core AMD RDDR-2 RDDR-2 Opteron™ Processor Up to 8 DIMMs Up to 8 DIMMs MM MM MM MM HT-1 MM MM MM MM II II II II Performance II II II II D D D D “Istanbul” HT-3 “Istanbul” D D D D • Six-Core AMD Opteron™ Processor 6M Shared L3 Cache North Bridge enhancements (PF + prefetch) DIMM DIMM DIMM DIMM DIMM DIMM DIMM 45nm Process Technology DIMM • DDR2-800 Memory HT-1 HT-1 • HyperTransport-3 @ 4.8 GT/sec HT-3 HT-3 HT-1 DIMM DIMM DIMM DIMM Reliability/Availability “Istanbul” HT-3 “Istanbul” DIMM DIMM DIMM DIMM • L3 Cache Index Disable M M M M • HyperTransport Retry (HT -3 Mode) M M M M MM MM MM MM MM MM MM MM DI DI DI DI • x8 ECC (Supports x4 Chipkill in unganged mode) DI DI DI DI HT-1 HT-1 Virtualization HT-3 HT-3 • AMD-V™ with Rapid Virtualization Indexing PCI- PCI- e or I/O Br idge e or Manageability PCI- PCI- • APML Management Link* X X Scalability USB • 48-bit Physical Addressing (256TB) South SATA SE: 2. 8GHz • HT Assist (Cache Probe Filter) PCI Bridge Std: 2.6GHz LPC PATA HE: 2.1GHz Continued Platform Compatibility EE: 1.8GHz • Nvidia/Broadcom-based F/1207 platforms 4 socket example block diagram *APML-enabled platform support required. 15 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Six-Core AMD Opteron™ RDDR-2 RDDR-2 Processor Up to 8 DIMMs Up to 8 DIMMs MM MM MM MM HT-1 MM MM MM MM II II II II Performance II II II II D D D D “Istanbul” “Istanbul” D D D D • Six-Core AMD Opteron™ Processor HT-3 6M Shared L3 Cache North Bridge enhancements (PF + prefetch) DIMM DIMM DIMM DIMM DIMM DIMM DIMM 45nm Process Technology DIMM • DDR2-800 Memory HT-1 HT-1 • HyperTransport-3 @ 4.8 GT/sec HT-3 HT-3 HT-1 DIMM DIMM DIMM DIMM Reliability/Availability “Istanbul” HT-3 “Istanbul” DIMM DIMM DIMM DIMM • L3 Cache Index Disable M M M M • HyperTransport Retry (HT -3 Mode) M M M M MM MM MM MM MM MM MM MM DI DI DI DI • x8 ECC (Supports x4 Chipkill in unganged mode) DI DI DI DI HT-1 HT-1 Virtualization HT-3 HT-3 • AMD-V™ with Rapid Virtualization Indexing 4 socket example block diagram Manageability STREAM Bandwidth • APML Management Link* (GB/s)* Scalability Product, Freq, Dram 2S 4S 8S • 48-bit Physical Addressing (256TB) • HT Assist (Cache Probe Filter) “Barcelona,” 2.3/2.0, RDDR2‐667 17.2 20.5 “Shanghai,” 2.7/2.2, RDDR2‐800 21.4 24 Continued Platform Compatibility • Nvidia/Broadcom-based F/1207 platforms “Istanbul,” 2.4/2.2, RDDR2‐800 22 42 81.5 *Based on measurements at AMD performance labs. See backup slide for configuration information. 16 AMD Hex-Core Processors | Cray XT5 Hex-core Workshop| December 2009 Significantly Higher Performance in Same Power Envelope Two-socktket servers using Six-Core AMD Opt eron™ processors significantly outperform two-socket servers using Dual-Core and Quad-Core AMD Opteron™ processors without consuming significantly more power 700% SPECfp®_rate2006 600% SPECint®_rate2006 SPECjbb®2005 500% Server Idle Power ance mm 400% 3x core count >3x performance Perfor 300% Same power envelope 200% Relative 100% 0% Dual‐Core Quad‐Core Six‐Core AMD OtOpteron ™ processor AMD OtOpteron ™ processor ("Shangh ai") AMD OtOpteron ™ processor ("Is tan bu l") Model 2218 (2.6GHz) Model 2382 (2.6GHz) Model 2435 (2.6GHz) SPEC, SPECfp, SPECint, and SPECjbb are registered trademarks of the Standard Performance Evaluation Corporation. The performance results for Six-Core AMD Opteron™ processor Model 2435 and the SPECjbb result for Quad-Core AMD Opteron™ processor Model 2382 are based upon data submitted to Standard Performance Evaluation Corporation as of May 21, 2009.