Power8 Quser Mspl Nov 2015 Handout

IBM Power Systems Power Systems Hardware: Today and Tomorrow November 2015 Mark Olson [email protected] © 2015 IBM Corporation IBM Power Systems POWER8 Chip © 2015 IBM Corporation IBM Power Systems Processor Technology Roadmap POWER11 Or whatever it is POWER10 named Or whatever it is POWER9 named Or whatever it is named POWER7 POWER8 POWER6 45 nm 22 nm POWER5 65 nm 130 nm POWER4 90 nm 180 nm 130 nm 2001 2004 2007 2010 2014 Future 3 © 2015 IBM Corporation IBM Power Systems Processor Chip Comparisons POWER5 POWER6 POWER7 POWER7+ POWER8 2004 2007 2010 2012 45nm SOI 32nm SOI 22nm SOI Technology 130nm SOI 65nm SOI eDRAM eDRAM eDRAM Compute Cores 2 2 8 8 12 Threads SMT2 SMT2 SMT4 SMT4 SMT8 Caching On-chip 1.9MB (L2) 8MB (L2) 2 + 32MB (L2+3) 2 + 80MB (L2+3) 6 + 96MB (L2+3) Off-chip 36MB (L3) 32MB (L3) None None 128MB (L4) Bandwidth Sust. Mem. 15GB/s 30GB/s 100GB/s 100GB/s 230GB/s Peak I/O 6GB/s 20GB/s 40GB/s 40GB/s 96GB/s 4 © 2015 IBM Corporation IBM Power Systems Processor Designs POWER5+ POWER6 POWER7 POWER7+ POWER8 Max cores 4 2 8 8 12 Technology 90nm 65nm 45nm 32nm 22nm Size 245 mm2 341 mm2 567 mm2 567 mm2 650 mm2 * Transistors 276 M 790 M 1.2 B 2.1 B 4.2 B * 1.9 4 - 5 3 – 4 Up to 4.4 Up to 4.1 Frequencies GHz GHz GHz GHz GHz ** SMT (threads) 2 2 4 4 8 L2 Cache 1.9MB Shared 4MB / Core 256KB / core 256KB / core 512KB/core 4MB / Core 10MB / Core 8MB / Core L3 Cache 36MB 32MB On chip On chip On chip L4 Cache -- -- -- -- Up to 128MB Bandwidth Sust memory 15GB/s 30GB/s 100GB/s 100GB/s 230GB/s Peak I/O 6GB/s 20GB/s 40GB/s 40GB/s 96GB/s * with 12-core chip 5 ©** 2015 announced IBM Corporation so far IBM Power Systems Processor Designs POWER4 POWER4+ POWER5 POWER5+ POWER6 POWER7 POWER7+ POWER8 Max cores 2 2 2 4 2 8 8 12 Technology 180nm 130nm 130nm 90nm 65nm 45nm 32nm 22nm Size 412mm2 mm2 389 mm2 245 mm2 341 mm2 567 mm2 567 mm2 650 mm2 * Transistors 170 M 180 M 276 M 276 M 790 M 1.2 B 2.1 B 4.2 B * 1.65-1.9 1.9-2.2 3 - 5 – Up to 4.4 3 -- 4.35 Frequencies 3 4.25 1.3 GHz 1.9 GHz GHz GHz GHz GHz GHz GHz SMT (threads) 1 1 2 2 2 4 4 8 1.4 MB 1.5 MB 1.9MB 1.9MB 256KB / 256KB / L2 Cache Shared Shared Shared Shared 4MB / Core core core 512KB/core 4MB / Core 10MB / Core 8MB / Core L3 Cache 32MB 32MB 36MB 36MB 32MB On chip On chip On chip Up to L4 Cache -- -- -- -- -- -- -- 128MB Bandwidth Sust memory 6GB/s 6GB/s 15GB/s 15GB/s 30GB/s 100GB/s 100GB/s 230GB/s Peak I/O 2GB/s 2GB/s 6GB/s 6GB/s 20GB/s 40GB/s 40GB/s 96GB/s * with 12-core chip 6 © 2015 IBM Corporation IBM Power Systems Innovation Drives Performance Gain by Technology Scaling Gain by Innovation 100% 80% Relative % 60% of Improvement 40% 20% 0% 180 nm 130 nm 90 nm 65 nm 45 nm 32 nm 22 nm 14 & 7 IBM plans for future 22 nm technology are subject to change. 7 © 2015 IBM Corporation IBM Power Systems POWER8 Chip Packaging Technology • 22nm SOI, eDRAM, 650mm2 • 15-layers = great bandwidth Accelerators Core Core Core Links SMP Core Core Core L2 L2 L2 L2 L2 L2 8M L3 Region Mem. Ctrl. L3 Cache & Chip Interconnect Mem. Ctrl. L2 L2 L2 LinksSMP L2 L2 L2 PCIe Core Core Core Core Core Core 8 © 2015 IBM Corporation IBM Power Systems World Class 22nm Semiconductor Technology Silicon On Insulator -Faster Transistor, Less Noise On-chip eDRAM - 6x latency improvement - No off-chip signaling rqmt - 8x bandwidth improvement - 3x less area than SRAM 22nm 15-layer copper wire - 5x less energy than SRAM 22nm eDRAM Cell Dense interconnect - Faster connections - Low latency distance paths - High density complex circuits - 2X wire per transistor © 2015 IBM Corporation IBM Power Systems IBM 22nm Technology 15 Low level wires used for dense local circuit interconnect. 15 Layer IBM 14 Top level wires used for metal stack for power distribution, clocks, and off-chip signaling. 13 12 11 10 9 8 8 7 9 6 6 5 7 3 4 4 1 2 5 3 2 9 Layer 1 (industry) metal stack © 2015 IBM Corporation IBM Power Systems POWER8 Chip Packaging Cores • 12 cores (SMT8) Accelerators • 8 dispatch, 10 issue, 16 exec pipe Core Core Core Links SMP Core Core Core • 2X internal data flows/queues • Enhanced prefetching L2 L2 L2 L2 L2 L2 8M L3 Region Caches Mem. Ctrl. L3 Cache & Chip Interconnect Mem. Ctrl. • 64K Data cache (L1) • 512 KB SRAM L2 / core Links SMP • 96 MB eDRAM shared L3 L2 L2 L2 PCIe L2 L2 L2 • Up to 128 MB eDRAM L4 (off-chip) Core Core Core Core Core Core Accelerators • Crypto & memory expansion • Transactional Memory Energy Management • Data Move / VM Mobility • On-chip Power Management Micro-controller • Integrated Per-core VRM Bus Interfaces • Critical Path Monitors • Integrated PCIe Gen3 • SMP Interconnect • CAPI Memory • Dual memory Controllers • 230 GB/sec Sustained bandwidth 11 © 2015 IBM Corporation IBM Power Systems POWER8 Memory Buffer Chip DRAM Memory POWER8 Chips Buffer Chip “L4 cache” Intelligence Moved into Memory • Previously on POWER7+ chip onto buffer Processor Interface • High speed interface Performance Value 12 © 2015 IBM Corporation IBM Power Systems Memory Bandwidth per Socket POWER8 POWER7 POWER6 POWER5 0 10 20 30 40 50 60 70 GB/Sec 13 © 2015 IBM Corporation IBM Power Systems Memory Bandwidth per Socket Reset the scale POWER8 POWER7 POWER6 POWER5 0 50 100 150 200 GB/Sec 14 © 2015 IBM Corporation IBM Power Systems POWER8 Memory Bandwidth per Socket POWER8 POWER7 POWER6 POWER5 0 50 100 150 200 250 GB/Sec 15 © 2015 IBM Corporation IBM Power Systems POWER8 Integrated PCI Gen 3 POWER8 Chip POWER7 Chip GX PCIe Gen3 Bus I/O Bridge PCIe Gen2 PCI Device PCI Devices 16 © 2015 IBM Corporation IBM Power Systems Power 770/780 Node I/O Bandwidth (System node or processor enclosure or CEC drawer) POWER7+ 770 POWER7 770 POWER6 570 0 20 40 60 80 GB/Sec 17 © 2015 IBM Corporation IBM Power Systems Power 770/780 Node I/O Bandwidth (System node or processor enclosure or CEC drawer) Reset the scale POWER7+ 770 POWER7 770 POWER6 570 0 50 100 150 200 250 300 GB/Sec 18 © 2015 IBM Corporation IBM Power Systems E870/E880 Node I/O Bandwidth (System node or processor enclosure or CEC drawer) POWER8 E870 POWER7+ 770 POWER7 770 POWER6 570 0 50 100 150 200 250 300 GB/Sec 19 © 2015 IBM Corporation IBM Power Systems IO Bandwidth Comparing 2-Socket Servers Reset the scale POWER8 POWER7+ POWER7 POWER6 0 50 100 150 200 GB/Sec 21 © 2015 IBM Corporation IBM Power Systems POWER8 IO Bandwidth Comparing 2-Socket Servers POWER8 POWER7+ POWER7 POWER6 0 50 100 150 200 GB/Sec 22 © 2015 IBM Corporation IBM Power Systems POWER8 CAPI (Coherent Accelerator Processor Interface) POWER8 POWER8 Coherence Bus Like an “extra” core for the POWER8 chip PCIe Gen3 Transport for encapsulated messages FPGA or ASIC Customizable Hardware / Application Accelerator • Specific system SW, middleware, or user application • Written to durable interface 23 © 2015 IBM Corporation IBM Power Systems First CAPI Solution Example High speed, fast response, social application – example Twitter Enabled by in-memory NoSQL, distributed hash tables Was initially implemented on x86 servers, but limited DRAM memory meant LOTS of servers resulting in a costly, complex infrastructure One 2-socket, 2U server PLUS one FlashSystem Drawer replaced 24 x86 servers Much lower cost of acquisition Much smaller foot print, less energy Much lower operational cost 24 © 2015 IBM Corporation IBM Power Systems Example: CAPI Attached Flash Optimization Application Read/Write Syscall FileSystem strategy() iodone() 20K Instructions LVM strategy() iodone() Disk & Adapter DD Pin buffers, Interrupt, Translate, unmap, Map DMA, unpin,Iodone Start I/O scheduling 25 © 2015 IBM Corporation IBM Power Systems Example: CAPI Attached Flash Optimization Application .Attach flash memory to POWER8 Read/Write Syscall via CAPI coherent Attach FileSystem strategy() iodone() 20K Application Instructions Posix Async aio_read() LVM I/O Style API aio_write() strategy() iodone() User Library Disk & Adapter DD < 500 Shared Memory Pin buffers, Interrupt, Instructions Translate, unmap, Work Queue Map DMA, unpin,Iodone Start I/O scheduling . Issues Read/Write Commands from applications to eliminate 97% of instruction path length CAPI Flash controller Operates in User Space . Saves 10 Cores per 1M IOPs 26 © 2015 IBM Corporation IBM Power Systems POWER8 Leapfrogs Memory Bandwidth I/O Bandwidth POWER8POWER8 POWER7POWER7+ POWER6POWER7PCIe Gen3 POWER8 PLUS …. POWER7 .CAPI .Accelerators POWER5POWER6 GX .Transactional Memory Bus PCIe Gen3 I/O .Scaleability Bridge 00 5050 100100.Smart use150150 of energy 200 PCIe Gen2 PCI .… and more PCI Device Devices 27 © 2015 IBM Corporation IBM Power Systems Scale-out CPW Comparisons .720 POWER7+ (1 socket) .S814 (1 socket) – 4-core 3.6 GHz 28,400 +40% – 4-core 3.0 GHz 39,500 – 6-core 3.6 GHz 42,400 – 6-core 3.0 GHz 59,500 – 8-core 3.6 GHz 56,300 – 8-core 3.7 GHz 85,500 +50% ~ GHz .740 POWER7+ (1 or 2 socket) .S824 (1 or 2 socket) – 6-core 4.2 GHz 49,000 – 6-core 3.8 GHz 72,000 – 12-core 4.2 GHz 91,700 – 12-core 3.8 GHz 130,000 +40% – 8-core 3.6 GHz 56,300 – 8-core 4.1 GHz 94,500 – 16-core 3.6 GHz 106,500 – 16-core 4.1 GHz 173,500 +60% – 8-core 4.2 GHz 64,500 – 12-core 1-socket not offered – 16-core 4.2 GHz 120,000 – 24-core 3.5 GHz 230,500 +90% 28 © 2015 IBM Corporation IBM Power Systems CPW POWER7+ 770 – 16-core 3.8 GHz 110,000 E870 – 32-core 3.8 GHz 191,500 – 32-core 4.02 GHz 359,000 – 48-core 3.8 GHz 290,500 – 64-core 4.02 GHz 711,000 +87% – 64-core 3.8 GHz 379,300 – 40-core 4.19 GHz 460,000 – 12-core 4.2 GHz 90,000 – 80-core 4.19 GHz 911,000 – 24-core 4.2 GHz 154,800 – 36-core 4.2 GHz 242.600 – 48-core 4.2 GHz 306,600 POWER7+ 780 E880 – 32-core 4.35 GHz 381,000 – 16-core 4.4 GHz 123,500 – 64-core 4.35 GHz

Power8 Quser Mspl Nov 2015 Handout

Wind Rose Data Comes in the Form >200,000 Wind Rose Images

Ray Tracing on the Cell Processor

March 11, 2010 Presentation

IBM System P5 Quad-Core Module Based on POWER5+ Technology: Technical Overview and Introduction

POWER® Processor-Based Systems

Openpower AI CERN V1.Pdf

Implementing Powerpc Linux on System I Platform

From Blue Gene to Cell Power.Org Moscow, JSCC Technical Day November 30, 2005

IBM Powerpc 970 (A.K.A. G5)

IBM Power Systems Performance Report Apr 13, 2021

Progress Codes

Introduction to the Cell Multiprocessor