POWER6™ Processor and Systems Jim McInnes [email protected] Compiler Optimization IBM Canada Toronto Software Lab © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Role .I am a Technical leader in the Compiler Optimization Team . Focal point to the hardware development team . Member of the Power ISA Architecture Board .For each new microarchitecture I . help direct the design toward helpful features . Design and deliver specific compiler optimizations to enable hardware exploitation 2 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER5 Chip Overview High frequency dual-core chip . 8 execution units 2LS, 2FP, 2FX, 1BR, 1CR . 1.9MB on-chip shared L2 – point of coherency, 3 slices . On-chip L3 directory and controller . On-chip memory controller Technology & Chip Stats . 130nm lithography, Cu, SOI . 276M transistors, 389 mm2 . I/Os: 2313 signal, 3057 Power/Gnd 3 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER6 Chip Overview SDU RU Ultra-high frequency dual-core chip IFU FXU . 8 execution units L2 Core0 VMX L2 2LS, 2FP, 2FX, 1BR, 1VMX QUAD LSU FPU QUAD . 2 x 4MB on-chip L2 – point of L3 coherency, 4 quads L2 CNTL CNTL . On-chip L3 directory and controller . Two on-chip memory controllers M FBC GXC M Technology & Chip stats C C . CMOS 65nm lithography, SOI Cu L2 L3 . 2 CNTL 790M transistors, 341 mm die CNTL FPU . I/Os: 1953 signal, 5399 Power/Gnd LSU VMX L2 L2 Full error checking and recovery (RU) FXU QUAD IFU Core1 QUAD RU SDU 4 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER4 / POWER5 / POWER6 POWER4 POWER5 POWER6 Alti P6 P6 Alti POWER4POWER4 POWER5POWER5 Core Core Core Core Ve Ve c Core Core c L3 L2 L3 Ctl L2 L3 4 MB 4 MB L3 L3 Ctrl L2 L2 Ctrl L3 Distributed Enhanced Switch Distributed Switch L3 Mem Cntrl Ctl Fabric Bus Controller L3 GX Cntrl GX Bus Cntrl GX Memory Memory Bus Mem Ctl Bus Cntrl Memory GX+ Memory Memor Bridge Memor y+ y+ 5 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER6 I/O: Speeds and Feeds L2 reload: 32B/Pc L3 buses: 16B/bc Read SMT 2 MB L2 2 MB L2 16B/bc Write Core 0 Quad 0 Quad 1 (Split into two, 8 and 8) (Core0) 64KB L1D (Core0) DRAM Memory connected by up to L2 1 L3 32 MB 4 channels, 533 – 800MHz DIMMS Ctrl Ctrl L3 DDR2 (channels run at 4X DRAM frequency) Dir Memory port (1 of 2): Dir (2x16MB) Nova Nova Nova Nova 8B Read (4x2B), Nova Nova Nova Nova Memory Memory (same as Fabric 4B Write (4x1B) Cntlr 0 I/OI/O Ctlr Cntlr Cntlr 1 MC0) Nova Nova Nova Nova Switch Nova Nova Nova Nova I/O L2 L3 Interface: Ctrl Ctrl 4B Read, Off-Node Fabric Buses (2 pairs): Dir Dir 4B Write SMT 2 MB L2 2 MB L2 at 2-8:1 the pc Core 1 4B/bc or 8B/bc per unidirectional pair Quad 20 Quad 3 (Core(Core 1)1) 64KB L1D (Core 1) Total L2 Buses scale at 2:1 with core frequency on chip: 8MB pc = processor clock On-Node Fabric Buses (3 pairs): bc = bus clock 2B/bc or 8B/bc per unidirectional pair 2 pc = 1 bc 1May be a single 32MB L3 chip with 8B buses 6 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER6 Core Pipelines 7 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER5 vs. POWER6 Pipeline Comparison Pre-Dec POWER5 Pipe Stages Pre-Dec POWER6 Pipe Stages 22 FO4 Pre-Dec 13 FO4 Pre-Dec Pre-Dec I Cache I Cache 5 cycle 14 cycle 3-4 cycle Xmit 12 cycle I Cache Branch redirect Branch redirect IBUF Xmit Resolution Resolution Rotate Group Form Decode IBUF0 Assembly IBUF1 Group Xfer Pre-dispatch Group Disp Group Disp Rename 4 cycle RF Issue Load-load AG/RF 1 cycle Fx-FX RF DCache / Ex 2 cycle AG/Ex 2 cycle Fx-Fx Load-Fx DCache 3 cycle Load-load/fx D Cache Fmt1 Fmt Fmt2 6 cycle FP-FP 6 cycle FP to local FP use Writeback 5 cycle LD-FP 8 cycle FP to remote FP use Finish Writeback 0 cycle LD to FP use Completion Completion Chkpt 8 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Side-by-Side comparisons POWER5+ POWER6 Mostly in-order with special Style General out-of-order execution case out-of-order execution 2FX, 2LS, 2FP, 1BR/CR, Units 2FX, 2LS, 2FP, 1BR, 1CR 1VMX 2 SMT threads 2 SMT threads Priority-based dispatch Alternate ifetch Threading Simultaneous dispatch from Alternate dispatch (up to 5 two threads (up to 7 instructions) instructions) 9 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Side-by-side comparisons… POWER5+ POWER6 L1 Cache ICache capacity, associativity 64 KB, 2-way 64 KB, 4-way DCache capacity, associativity 32 KB, 4-way 64 KB, 8-way L2 Cache Point of Coherency Point of Coherency Capacity, line size 1.9 MB, 128 B line 2 x 4 MB, 128 B line Associativity, replacement 10-way, LRU each 8-way, LRU Off-chip L3 Cache Capacity, line size 36 MB, 256 B line 32 MB, 128 B line Associativity, replacement 12-way, LRU 16-way, LRU Memory 4 TB maximum 8 TB maximum Memory bus 2x DRAM frequency 4x DRAM frequency 10 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Latency Profiles 160 Local Memory 140 120 100 P5@ 1.9GHz 80 [email protected] Time (ns) 60 L3 40 20 L2 0 0.1 1 10 100 1000 Size (MB) 11 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Processor Core Functional Features PowerPC AS Architecture Simultaneous Multithreading ( 2 threads) Decimal Floating Point (extension to PowerPC ISA) 48 Bit Real Address Support Virtualization Support (1024 partitions) Concurrent support of 4 page sizes (4K, 64K, 16M, 16G) New storage key support Bi-Endian Support Dynamic Power Management Single Bit Error Detection on all Dataflow Robust Error Recovery (R unit) CPU sparing support (dynamic CPU swapping) Common Core for I, P, and Blade 12 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Achieving High Frequency: POWER6 13FO4 Challenge Example Circuit Design .1 FO4 = delay of 1 inverter that drives 4 receivers .1 Logical Gate = 2 FO4 .1 cycle = Latch + function + wire 1 cycle = 3 FO4 + function + 4 FO4 . Function = 6 FO4 = 3 Gates Integration .It takes 6 cycles to send a signals across the core .Communication between units takes 1 cycle using good wire .Control across a 64-bit data flow takes a cycle 13 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Sacrifices associated with high frequency . In-order design instead of out-of-order .Longer pipelines .Fp-compare latency . FX multiply done in FPU . Memory bandwidth reduced . Esp. Store queue draining bandwidth Mitigating this: .SMT implementation is improved 14 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only In-Order vs. Out of Order In-Order Dispatch In-Order Dispatch Out-of-Order Execution In-Order Execution Fetch Fetch Load R2,… Addi R2,R2,1 Dispatch Load R3, … Dispatch Operands must be available Instruction Instruction Queue Queue Operands must be available Functional Functional Functional Functional STQ STQ Unit Unit Unit Unit 15 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only P5 vs. P6 instruction groups Both machines are “superscalar,” meaning that instructions are formed into groups and dispatched together. There are two big differences between P5 and P6: . On P5 there are few restrictions on what types on insns and how many can be grouped together, while on P6 each group must fit in the available resources (2 fp, 2 fx, 2 ldst,..) . On P5, insns are placed in queues to wait for operands to be ready, while on P6 dispatch must wait until all operands are ready for the whole group On p6, cycles are shorter, but there are more stall cycles and more partial instruction groups 16 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Strategic NOP insertion ORI 1,1,0 terminates a dispatch group early Suppose at cycle 10 the current group contains FMA, FMA, LFL and all are ready to go.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages35 Page
-
File Size-