Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER Microprocessor
Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group
© 2003 IBM Corporation SMT Implementation in POWER5
Outline
§ Motivation § Background § Threading Fundamentals § Enhanced SMT POWER5 Implentation § Additional SMT Considerations § Summary
2 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5
Microprocessor Design Optimization Focus Areas
§ Memory latency 4 Increased processor speeds make memory appear further away 4 Longer stalls possible § Branch processing 4 Mispredict more costly as pipeline depth increases resulting in stalls and wasted power 4 Predication drives increased power and larger chip area § Execution Unit Utilization 4 Currently 20-25% execution unit utilization common § Simultaneous multi-threading (SMT) and POWER architecture address these areas
3 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5
POWER4 --- Shipped in Systems December 2001
§ Technology: 180nm lithography, Cu, SOI FPU ISU ISU FPU 4 POWER4+ shipping in 130nm today FXU FXU § Dual processor core IDU IDU § 8-way superscalar IFU LSU LSU IFU 4 Out of Order execution BXU BXU 4 2 Load / Store units 4 2 Fixed Point units 4 2 Floating Point units L2 L2 L2 4 Logical operations on Condition Register 4 Branch Execution unit
§ > 200 instructions in flight L3 Directory/Control § Hardware instruction and data prefetch
4 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5
POWER5 --- The Next Step
§ Technology: 130nm lithography, Cu, SOI § Dual processor core § 8-way superscalar § Simultaneous multithreaded (SMT) core 4 Up to 2 virtual processors per real processor 4 24% area growth per core for SMT 4 Natural extension to POWER4 design
5 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5
Multi-threading Evolution Single Thread Coarse Grain Threading
FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL
Fine Grain Threading Simultaneous Multi-Threading
FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Thread 0 Executing Thread 0 Executing No Thread Executing
6 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5 Changes Going From ST to SMT Core § SMT easily added to Superscalar Micro-architecture 4 Second Program Counter (PC) added to share I-fetch bandwidth 4 GPR/FPR rename mapper expanded to map second set of registers (High order address bit indicates thread) 4 Completion logic replicated to track two threads 4 Thread bit added to most address/tag buses
Fetch PCPC Unit BR/CRUBR/CRU CR,CR, LR,LR, BR, CRLCRL I-Cache Issue QsQs CTRCTR Units
RegisterRegister FPFP Decode FPRs FPUs Rename Issue QsQs FPUs Data FXUs, Cache Integer GPRsGPRs LSUsLSUs Issue QsQs
7 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5
Resource Sizes
§ Analysis done to optimize every ST SMT micro-architectural resource size 4GPR/FPR rename pool size 4I-fetch buffers
4Reservation Station IPC 4SLB/TLB/ERAT 4I-cache/D-cache § Many Workloads examined § Associativity also examined ~ 50 60 70 80 90 100 110 120 130 Number of GPR Renames
Results based on simulation of an online transaction processing application Vertical axis does not originate at 0 8 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5
Resource Sharing Global Completion Table Occupancy
Without dynamic resource utilization § Threads share many resources 10 adjustment 4 Global Completion Table, BHT, 8 6 TLB, . . . 4 0 5 § Higher performance realized 2 10 Relative Occurrence 15 when resources balanced 0 20 0 5 10 15 20 Thread 1 across threads Thread 0
With dynamic 4 Tendency to drift toward resource utilization 10 adjustment extremes accompanied by 8 reduced performance 6 4 0 § Solution: Dynamically adjust 5 2 10
Relative Occurrence 15 resource utilization 0 20 0 5 10 15 20 Thread 1 Thread 0
Results based on simulation of an online transaction processing application 9 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5
Thread Priority Single Thread Mode § Instances when unbalanced execution desirable 4 No work for opposite thread 2 2 4 Thread waiting on lock 1 4 Software determined non uniform 1 balance 1
4 Power management IPC 1 4 … 1 § Solution: Control instruction decode 0 rate 0 0 4 Software/hardware controls 8 priority 0,7 -5 -3 -1 0 1 3 5 7,0 1,1 levels for each thread Thread 1 Priority - Thread 0 Priority Power Thread 0 IPC Thread 1 IPC Save Mode
10 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5
Dynamic Thread Switching
Thread States § Used if no task ready for second thread to run Dormant § Allocates all machine resources to software one thread § Initiated by software hardware § Dormant thread wakes up on: Active or software 4 External interrupt software 4 Decrementer interrupt
4 Special instruction from active Null thread software
11 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5
Single Thread Operation
§ Advantageous for execution unit limited applications 4 Floating or fixed point intensive workloads § Execution unit limited applications
provide minimal performance leverage IPC for SMT 4 Extra resources necessary for SMT POWER5 ST POWER4+ provide higher performance benefit POWER5 SMT when dedicated to single thread § Determined dynamically on a per Matrix Multiply processor basis
12 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5
Other SMT Considerations
§ Power Management 4 SMT Increases execution unit utilization 4 Dynamic power management does not impact performance § Debug tools / Lab bring-up 4 Instruction tracing 4 Hang detection 4 Forward progress monitor § Performance Monitoring § Serviceability
13 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5
Summary
§ POWER5 SMT implementation is more than SMT 4 Good ROI for silicon area: Performance gain > Area increase 4 Resource sizes optimized 4 Dynamic feedback enhances instruction throughput 4 Software controlled priority exploits machine architecture 4 Dynamic ST to/from SMT mode capability optimizes system resources § SMT impacts pervasive throughout chip § Operating in laboratory 4 AIX, Linux and OS/400 booted and running
14 Hotchips 15, August 2003 © 2003 IBM Corporation