Simultaneous Multi-threading Implementation in POWER5 -- IBM's Next Generation POWER

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

© 2003 IBM Corporation SMT Implementation in POWER5

Outline

§ Motivation § Background § Threading Fundamentals § Enhanced SMT POWER5 Implentation § Additional SMT Considerations § Summary

2 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5

Microprocessor Design Optimization Focus Areas

§ Memory latency 4 Increased processor speeds make memory appear further away 4 Longer stalls possible § Branch processing 4 Mispredict more costly as pipeline depth increases resulting in stalls and wasted power 4 Predication drives increased power and larger chip area § Execution Unit Utilization 4 Currently 20-25% execution unit utilization common § Simultaneous multi-threading (SMT) and POWER architecture address these areas

3 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5

POWER4 --- Shipped in Systems December 2001

§ Technology: 180nm lithography, Cu, SOI FPU ISU ISU FPU 4 POWER4+ shipping in 130nm today FXU FXU § Dual processor core IDU IDU § 8-way superscalar IFU LSU LSU IFU 4 Out of Order execution BXU BXU 4 2 Load / Store units 4 2 Fixed Point units 4 2 Floating Point units L2 L2 L2 4 Logical operations on Condition Register 4 Branch Execution unit

§ > 200 instructions in flight L3 Directory/Control § Hardware instruction and data prefetch

4 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5

POWER5 --- The Next Step

§ Technology: 130nm lithography, Cu, SOI § Dual processor core § 8-way superscalar § Simultaneous multithreaded (SMT) core 4 Up to 2 virtual processors per real processor 4 24% area growth per core for SMT 4 Natural extension to POWER4 design

5 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5

Multi-threading Evolution Single Coarse Grain Threading

FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL

Fine Grain Threading Simultaneous Multi-Threading

FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Thread 0 Executing Thread 0 Executing No Thread Executing

6 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5 Changes Going From ST to SMT Core § SMT easily added to Superscalar Micro-architecture 4 Second (PC) added to share I-fetch bandwidth 4 GPR/FPR rename mapper expanded to map second set of registers (High order address bit indicates thread) 4 Completion logic replicated to track two threads 4 Thread bit added to most address/tag buses

Fetch PCPC Unit BR/CRUBR/CRU CR,CR, LR,LR, BR, CRLCRL I-Cache Issue QsQs CTRCTR Units

RegisterRegister FPFP Decode FPRs FPUs Rename Issue QsQs FPUs Data FXUs, Cache Integer GPRsGPRs LSUsLSUs Issue QsQs

7 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5

Resource Sizes

§ Analysis done to optimize every ST SMT micro-architectural resource size 4GPR/FPR rename pool size 4I-fetch buffers

4Reservation Station IPC 4SLB/TLB/ERAT 4I-cache/D-cache § Many Workloads examined § Associativity also examined ~ 50 60 70 80 90 100 110 120 130 Number of GPR Renames

Results based on simulation of an online transaction processing application Vertical axis does not originate at 0 8 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5

Resource Sharing Global Completion Table Occupancy

Without dynamic resource utilization § Threads share many resources 10 adjustment 4 Global Completion Table, BHT, 8 6 TLB, . . . 4 0 5 § Higher performance realized 2 10 Relative Occurrence 15 when resources balanced 0 20 0 5 10 15 20 Thread 1 across threads Thread 0

With dynamic 4 Tendency to drift toward resource utilization 10 adjustment extremes accompanied by 8 reduced performance 6 4 0 § Solution: Dynamically adjust 5 2 10

Relative Occurrence 15 resource utilization 0 20 0 5 10 15 20 Thread 1 Thread 0

Results based on simulation of an online transaction processing application 9 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5

Thread Priority Single Thread Mode § Instances when unbalanced execution desirable 4 No work for opposite thread 2 2 4 Thread waiting on lock 1 4 Software determined non uniform 1 balance 1

4 Power management IPC 1 4 … 1 § Solution: Control instruction decode 0 rate 0 0 4 Software/hardware controls 8 priority 0,7 -5 -3 -1 0 1 3 5 7,0 1,1 levels for each thread Thread 1 Priority - Thread 0 Priority Power Thread 0 IPC Thread 1 IPC Save Mode

10 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5

Dynamic Thread Switching

Thread States § Used if no task ready for second thread to run Dormant § Allocates all machine resources to software one thread § Initiated by software hardware § Dormant thread wakes up on: Active or software 4 External interrupt software 4 Decrementer interrupt

4 Special instruction from active Null thread software

11 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5

Single Thread Operation

§ Advantageous for execution unit limited applications 4 Floating or fixed point intensive workloads § Execution unit limited applications

provide minimal performance leverage IPC for SMT 4 Extra resources necessary for SMT POWER5 ST POWER4+ provide higher performance benefit POWER5 SMT when dedicated to single thread § Determined dynamically on a per Matrix Multiply processor basis

12 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5

Other SMT Considerations

§ Power Management 4 SMT Increases execution unit utilization 4 Dynamic power management does not impact performance § Debug tools / Lab bring-up 4 Instruction tracing 4 Hang detection 4 Forward progress monitor § Performance Monitoring § Serviceability

13 Hotchips 15, August 2003 © 2003 IBM Corporation SMT Implementation in POWER5

Summary

§ POWER5 SMT implementation is more than SMT 4 Good ROI for silicon area: Performance gain > Area increase 4 Resource sizes optimized 4 Dynamic feedback enhances instruction throughput 4 Software controlled priority exploits machine architecture 4 Dynamic ST to/from SMT mode capability optimizes system resources § SMT impacts pervasive throughout chip § Operating in laboratory 4 AIX, Linux and OS/400 booted and running

14 Hotchips 15, August 2003 © 2003 IBM Corporation