International Symposium Chip Multithreading: Opportunities on High-Performance and Challenges Computer Architecture Feb 15th 2005

Lawrence Spracklen & Santosh G. Abraham

Lawrence Spracklen Advanced Architecture Overview

● The case for Chip Multithreaded (CMT) Processors

● Evolution of CMT Processors

● CMT Design Space

● CMT Challenges

● Conclusions

2 The case for CMT Processors

3 Diminishing Returns

● Dramatic gains have been achieved in single-threaded performance in recent years

● Used a variety of microarchitectural techniques E.G. superscalar issue, out-of-order issue, on-chip caches & deeper pipelines

● Recent attempts to continue leveraging these techniques have not led to demonstrably better performance

● Power and memory latency considerations introduce increasingly insurmountable obstacles to improving single thread performance – Several high frequency designs have recently been abandoned

4 Power Consumption

● Power consumption has an almost cubic dependence on core frequency

● Processors already pushing the limits of power dissipation

● For applications with sufficient threads: – Double the number of concurrent threads x2 – Half the frequency of the threads x8 Result: equivalent performance at a ¼ of the power

5 Offchip Bandwidth

● While offchip bandwidth has increased, so have offchip latencies

● Need to sustain over 100 in-flight requests to fully utilize the available bandwidth

● Difficult to achieve with a single core

– Even an aggressive OOO processor generates less than 2 parallel requests on typical server workloads

● A large number of concurrently executing threads are required to achieve high bandwidth utilization

6 Server Workloads

● Server workloads are characterized by:

– High levels of thread-level parallelism (TLP)

– Low instruction-level parallelism (ILP)

● Limited opportunity to boost single-thread performance

– Majority of CPI is a result of offchip misses

● Couple high TLP in the application domain with support for multiple threads of execution (strands) on a processor chip Chip Multithreaded (CMT) processors support many simultaneous hardware strands

7 CMT Processors

● CMT processors support many simultaneous hardware strands via a combination of techniques 1) Support for multiple cores (Chip Multiprocessors (CMP)) 2) Simultaneous Multi-Threading (SMT)

● Strands spend a large amount of time stalled waiting for offchip misses

– SMT enables multiple strands to share many of the resources within the core

● SMT increases utilization of key resources within a core

● CMP allows multiple cores to share resources such as the L2 cache and offchip bandwidth

8 Evolution of CMT Processors

9 Evolution of CMTs

● Given the exponential increase in transistors per chip over time, chip level multiprocessors have long been predicted...

● [Olukotun & the Stanford Hydra CMP, 1996]

– 4 MIPS-based cores on a single chip

● [DEC/Compaq Piranha, 2000]

– 8 simple Alpha cores sharing an onchip L2$

● [Sun Microsystem's MAJC Architecture, 1995]

– Provided well-defined support for both CMP and SMT processors

● [Sun Microsystem's MAJC-5200, 1999]

– A dual-core CMT processor with cores sharing an L1 D$

10 An Evolutionary approach CMT Design ● A new process technology essentially doubles the transistor budget

● An attractive proposition for an evolving CMT design is to double the number of cores every generation

– Little redesign effort is expended on the cores

– Doubles performance every process generation (with sufficient TLP)

● The design philosophy behind the initial generations of CMT processors

11 1st Generation CMTs CPU0 CPU1 ● 2 cores per chip

● Cores derived from earlier uniprocessor designs L2 Cache L2 Cache

● Cores do not share any resources, except off-chip data paths Crossbar Switch

Memory Controller Examples: – Sun's Gemini processor: a dual-core UltraSPARC-II derivative – Sun's UltraSPARC-IV processor: a dual-core UltraSPARC-III derivative – AMD's upcoming dual-core Opteron – Intel's upcoming dual-core Itanium – Intel's upcoming dual-core Xeon 12 2nd Generation CMTs CPU0 CPU1 ● 2 or more cores per chip

● Cores still derived from earlier Crossbar Switch uniprocessor designs

● Cores now share the L2 cache L2 Cache – Advantageous as most commercial applications have significant instruction footprints

– Speeds inter-core communication Memory Controller

Examples: – Sun's UltraSPARC-IV+ processor: a dual-core UltraSPARC-III derivative – IBM's Power4/5 processor: a dual-core Power processor – Fujitsu's Sparc64 VI processor: a dual-core SPARC

13 CMT Problems?

● Re-using existing core designs may not scale beyond a couple of generations: Power: to restrain total power consumption, power dissipation of each core must be halved in each generation – Unlikely to be achievable by voltage scaling or even clock gating and frequency scaling Offchip bandwidth: to maintain the same offchip BW per core, total BW must double each generation – Can increase BW by devoting additional pins or increasing BW per pin – Maximum number of pins only increasing at 10% per generation

● Need a new approach to CMT design

14 rd

3 Generation CMTs U PU PU PU PU PU PU PU C C C C C C C CP

● Future CMTs need to satisfy these MT S SMT SMT SMT SMT SMT SMT SMT

power and bandwidth constraints while y y y y y y y a a ay a a a a a

delivering ever increasing performance -w -w -w -w -w -w -w -w 4 4 4 4 4 4 4 4 Crossbar Switch ● CMT processors are likely to be designed from the ground-up L2 Cache

– Entire design optimized for a CMT design point Memory Controller ● Multiple cores per chip

Examples: – Sun's Niagara Processor:

● 8 four-way SMT cores

15 CMT Design Space

16 Number of Cores For a fixed area and power budget:

● Decision is between small number of aggressive OOO cores, or multiple simple cores

● For workloads with sufficient TLP, the simpler core solution may deliver superior performance at a fraction of the power – e.g. Niagara – 8 4-way SMT cores. Each core single issue, short pipeline

● What about applications with limited TLP? – Leverage Speculative Parallelism – Employ heterogeneous cores: a single aggressive core can be provided for single threaded applications

17 Resource Sharing

● A core requires several chip-level resources

● Branch predictors ● L2 cache

● Execution pipeline ● L1 I-cache

● Floating-point units ● L1 D-cache

● Currently, cores in advanced CMTs only share the L2 cache ● Opportunity to share a much wider range of chip-level resources – Cores in the MAJC-5200 processor share a L1 D-cache – Cores could share floating-point units ● CMT processors can begin to include a variety of new [expensive] shared resources – Cost amortized over multiple cores 18 Hardware Accelerators

● On-chip hardware accelerators become increasingly attractive in the CMT space – Cost is amortized and high utilization is achieved ● On-chip accelerators can provide a range of benefits – Increased performance on certain specialized tasks – A more power-efficient approach – Offload of more mundane tasks from the general-purpose cores – Frequently significantly more efficient than offchip accelerators ● A variety of different onchip accelerators can be envisaged – Cryptographic accelerators – Network offload engines – XML parsing – OS functionality e.g. memcopy 19 CMT Challenges

20 CMT Challenges

● Attention has focussed on improving single-thread performance using a variety of speculative techniques ● Aggressive speculative techniques can consume significant resources for even modest performance improvements ● This philosophy is compatible with a single core design – During periods of extensive speculation, the resources are otherwise underutilized ● In CMT processors resources are shared between strands ● Need to ensure strands are not deprived of resources ● Strands need to be “good neighbors” ● Policies should focus on maximizing overall performance

21 Prefetching

● Instruction prefetching can significantly increase performance of commercial workloads on CMT processors

1.5 4-core CMP, Private 32KB 4-way L1$,

t, X 1.45 Shared 2MB 4-way L2$ n e 1.4 m e v 1.35 ro p

m 1.3 i ce

n 1.25 a

rm 1.2 o rf e 1.15 p l a

ti 1.1 n te

o 1.05 P 1 Database TPC-W SPECjAppServer SPECweb

● Potential to increase performance by up to 1.5X by mitigating effects of instruction misses 22 Prefetching on CMTs

● Speculative prefetching strategies are still important for CMT processors ● Algorithms need to be tailored to the CMT design point ● In a single-strand system the entire system is idle on a miss – Aggressive prefetching may be very effective ● In a CMT processor a less aggressive strategy may be preferable – The harmful side effects of miss-speculation are more pronounced ● One strand's speculation can impact other strands – Evicting their lines from the caches or monopolizing shared resources ● Prefetch accuracy is more critical

23 Request Prioritization

● Hardware scouting generates a highly-accurate stream of prefetches – Pollution not a major concern ● In a single-core design, the processor is otherwise stalled during scouting – The prefetches don't impact the timeliness of the demand fetches ● In a CMT processor one strand may be scouting while others are issuing demand fetches – One strand's speculative requests may delay another strand's demand fetches – Optimizing for peak single-strand performance may not result in optimal aggregate (chip-wide) performance ● Require speculative strategies that leverage idle resources and defer to demand fetches 24 Hot Sets

● Hot sets result when many heavily accessed physical addresses map to the same cache set

– Results in thrashing and a significant increase in conflict misses

● Multiple strands now sharing the L2 cache

● Traditionally addressed by increasing the associtivity of the cache

● Not feasible to increase the cache associtivity in line with the number of sharers in CMT processors

● Require a more elegant way to reduce hot sets...

25 Hot Banks

● CMT processors typically support write-through L1 caches – Makes coherency protocols simpler ● As a result, stores and L1 cache load and instruction misses all pressure the L2 cache ● To provide sufficient bandwidth into the L2 the cache is heavily banked ● Hot banks occur when certain banks are accessed much more heavily than others – Can have a significant performance impact

26 Hot Sets & Hot Banks Potential Solution

● Cache set/bank selection is typically based on the low- order bits of the physical address

● Techniques such as index hashing can reduce the prevalence of hot sets/banks

– Use more bits from the physical address to generate an XOR hash that is then used to select the cache set/bank

● Can be problems:

– Only a limited set of bits are available to generate the XOR hash

– L1 cache inclusion requirements may limit use of index hashing in the L2 cache

● Significant room for further improvements while satisfying constraints on L1 and L2 set/bank selection 27 Offchip Bandwidth

● A core's bandwidth requirements can increase significantly when speculative techniques are employed )

n 4-core CMP, o 1.3 ti Private 32KB 4-way L1$, a l Shared 2MB 4-way L2$ cu e 1.25 sp

o Control speculation n

to 1.2 Control speculation + Hardware Scouting d e z

i Control speculation + Hardware Scouting + Value Prediction l a 1.15 rm o (n th

d 1.1 i w d n a

B 1.05 p i ch f

Of 1 SPECweb SPECjbb Database

● Offchip bandwidth limitations may curtail the number and aggressiveness of cores in future CMT processors 28 Offchip Bandwidth Potential Solutions

● Reduce offchip traffic by compressing the onchip caches

– Achieves the reduced miss rate associated with a larger cache at a significantly lower cost

– Reduction in miss rate needs to be carefully balanced against increased latency resulting from decompression overheads [favors adaptive schemes]

● Attempt to do more with the available BW

– Compress the off-chip traffic

– Exploit silence to minimize writeback traffic

29 Compiler Optimizations

● Compilers can still make a significant difference to application performance on CMT processors

● Optimizing compilers currently perform extensive loop unrolling and aggressive software prefetching

– Perfect optimizations for traditional resource-rich single-core processors

● Certain aggressive optimizations may be less well suited to CMT processors where each strand has more modest resources

– May be advantageous to optimize for reduced code footprint

● Of course, anything compilers can do to expose additional TLP is very welcome!

30 Concluding Remarks (1)

● A variety of pressing factors are bringing about a wholesale shift to Chip Multithreaded (CMT) designs ● CMT designs support many HW strands via efficient sharing of on-chip resources ● CMT designs are rapidly evolving – First generation of CMT processors were rapidly derived from existing uni- processor designs – The current generation supports sharing of the L2 caches – Next-generation CMT processors are specifically designed for the CMT design point

31 Concluding Remarks (2)

● CMT processors are a perfect match for thread-rich commercial applications ● Need to begin exploiting Speculative parallelism approaches for applications with limited TLP ● Resource sharing and resultant inter-strand interactions gives rise to many design challenges – Speculative prefetching, request prioritization, hot sets, hot banks, BW limitations to name but a few...

CMT processors present an exciting range of opportunities and challenges

32 Questions?

[email protected] 33 Backup Slides

34 Request Prioritization on CMTs

● Prefetches can delay other strands' demand fetches 100 4-core CMP, ) 95 Private 32KB 4-way L1$, e

v 90 Shared 2MB 4-way L2$ ti 85 a l

u 80

m 75

cu 70 ( 65 No Speculation es 60

ss Hardware Scouting i 55

m 50 p i 45 ch

f 40

of 35

f 30 o 25 % 20 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9 1.95 2 Miss latency (normalized to unloaded latency) Increasing miss latency Potential Solution ● L2 cache should give higher priority to demand fetches – However, prefetches should not dropped, just delayed until the resource is idle ● More sophisticated schemes may yield improved performance 35