International Symposium Chip Multithreading: Opportunities on High-Performance and Challenges Computer Architecture Feb 15th 2005
Lawrence Spracklen & Santosh G. Abraham
Lawrence Spracklen Advanced Processor Architecture Sun Microsystems Overview
● The case for Chip Multithreaded (CMT) Processors
● Evolution of CMT Processors
● CMT Design Space
● CMT Challenges
● Conclusions
2 The case for CMT Processors
3 Diminishing Returns
● Dramatic gains have been achieved in single-threaded performance in recent years
● Used a variety of microarchitectural techniques E.G. superscalar issue, out-of-order issue, on-chip caches & deeper pipelines
● Recent attempts to continue leveraging these techniques have not led to demonstrably better performance
● Power and memory latency considerations introduce increasingly insurmountable obstacles to improving single thread performance – Several high frequency designs have recently been abandoned
4 Power Consumption
● Power consumption has an almost cubic dependence on core frequency
● Processors already pushing the limits of power dissipation
● For applications with sufficient threads: – Double the number of concurrent threads x2 – Half the frequency of the threads x8 Result: equivalent performance at a ¼ of the power
5 Offchip Bandwidth
● While offchip bandwidth has increased, so have offchip latencies
● Need to sustain over 100 in-flight requests to fully utilize the available bandwidth
● Difficult to achieve with a single core
– Even an aggressive OOO processor generates less than 2 parallel requests on typical server workloads
● A large number of concurrently executing threads are required to achieve high bandwidth utilization
6 Server Workloads
● Server workloads are characterized by:
– High levels of thread-level parallelism (TLP)
– Low instruction-level parallelism (ILP)
● Limited opportunity to boost single-thread performance
– Majority of CPI is a result of offchip misses
● Couple high TLP in the application domain with support for multiple threads of execution (strands) on a processor chip Chip Multithreaded (CMT) processors support many simultaneous hardware strands
7 CMT Processors
● CMT processors support many simultaneous hardware strands via a combination of techniques 1) Support for multiple cores (Chip Multiprocessors (CMP)) 2) Simultaneous Multi-Threading (SMT)
● Strands spend a large amount of time stalled waiting for offchip misses
– SMT enables multiple strands to share many of the resources within the core
● SMT increases utilization of key resources within a core
● CMP allows multiple cores to share resources such as the L2 cache and offchip bandwidth
8 Evolution of CMT Processors
9 Evolution of CMTs
● Given the exponential increase in transistors per chip over time, chip level multiprocessors have long been predicted...
● [Olukotun & the Stanford Hydra CMP, 1996]
– 4 MIPS-based cores on a single chip
● [DEC/Compaq Piranha, 2000]
– 8 simple Alpha cores sharing an onchip L2$
● [Sun Microsystem's MAJC Architecture, 1995]
– Provided well-defined support for both CMP and SMT processors
● [Sun Microsystem's MAJC-5200, 1999]
– A dual-core CMT processor with cores sharing an L1 D$
10 An Evolutionary approach CMT Design ● A new process technology essentially doubles the transistor budget
● An attractive proposition for an evolving CMT design is to double the number of cores every generation
– Little redesign effort is expended on the cores
– Doubles performance every process generation (with sufficient TLP)
● The design philosophy behind the initial generations of CMT processors
11 1st Generation CMTs CPU0 CPU1 ● 2 cores per chip
● Cores derived from earlier uniprocessor designs L2 Cache L2 Cache
● Cores do not share any resources, except off-chip data paths Crossbar Switch
Memory Controller Examples: – Sun's Gemini processor: a dual-core UltraSPARC-II derivative – Sun's UltraSPARC-IV processor: a dual-core UltraSPARC-III derivative – AMD's upcoming dual-core Opteron – Intel's upcoming dual-core Itanium – Intel's upcoming dual-core Xeon 12 2nd Generation CMTs CPU0 CPU1 ● 2 or more cores per chip
● Cores still derived from earlier Crossbar Switch uniprocessor designs
● Cores now share the L2 cache L2 Cache – Advantageous as most commercial applications have significant instruction footprints
– Speeds inter-core communication Memory Controller
Examples: – Sun's UltraSPARC-IV+ processor: a dual-core UltraSPARC-III derivative – IBM's Power4/5 processor: a dual-core Power processor – Fujitsu's Sparc64 VI processor: a dual-core SPARC
13 CMT Problems?
● Re-using existing core designs may not scale beyond a couple of generations: Power: to restrain total power consumption, power dissipation of each core must be halved in each generation – Unlikely to be achievable by voltage scaling or even clock gating and frequency scaling Offchip bandwidth: to maintain the same offchip BW per core, total BW must double each generation – Can increase BW by devoting additional pins or increasing BW per pin – Maximum number of pins only increasing at 10% per generation
● Need a new approach to CMT design
14 rd
3 Generation CMTs U PU PU PU PU PU PU PU C C C C C C C CP
● Future CMTs need to satisfy these MT S SMT SMT SMT SMT SMT SMT SMT
power and bandwidth constraints while y y y y y y y a a ay a a a a a
delivering ever increasing performance -w -w -w -w -w -w -w -w 4 4 4 4 4 4 4 4 Crossbar Switch ● CMT processors are likely to be designed from the ground-up L2 Cache
– Entire design optimized for a CMT design point Memory Controller ● Multiple cores per chip
Examples: – Sun's Niagara Processor:
● 8 four-way SMT cores
15 CMT Design Space
16 Number of Cores For a fixed area and power budget:
● Decision is between small number of aggressive OOO cores, or multiple simple cores
● For workloads with sufficient TLP, the simpler core solution may deliver superior performance at a fraction of the power – e.g. Niagara – 8 4-way SMT cores. Each core single issue, short pipeline
● What about applications with limited TLP? – Leverage Speculative Parallelism – Employ heterogeneous cores: a single aggressive core can be provided for single threaded applications
17 Resource Sharing
● A core requires several chip-level resources
● Register file ● Branch predictors ● L2 cache
● Execution pipeline ● L1 I-cache
● Floating-point units ● L1 D-cache
● Currently, cores in advanced CMTs only share the L2 cache ● Opportunity to share a much wider range of chip-level resources – Cores in the MAJC-5200 processor share a L1 D-cache – Cores could share floating-point units ● CMT processors can begin to include a variety of new [expensive] shared resources – Cost amortized over multiple cores 18 Hardware Accelerators
● On-chip hardware accelerators become increasingly attractive in the CMT space – Cost is amortized and high utilization is achieved ● On-chip accelerators can provide a range of benefits – Increased performance on certain specialized tasks – A more power-efficient approach – Offload of more mundane tasks from the general-purpose cores – Frequently significantly more efficient than offchip accelerators ● A variety of different onchip accelerators can be envisaged – Cryptographic accelerators – Network offload engines – XML parsing – OS functionality e.g. memcopy 19 CMT Challenges
20 CMT Challenges
● Attention has focussed on improving single-thread performance using a variety of speculative techniques ● Aggressive speculative techniques can consume significant resources for even modest performance improvements ● This philosophy is compatible with a single core design – During periods of extensive speculation, the resources are otherwise underutilized ● In CMT processors resources are shared between strands ● Need to ensure strands are not deprived of resources ● Strands need to be “good neighbors” ● Policies should focus on maximizing overall performance
21 Prefetching
● Instruction prefetching can significantly increase performance of commercial workloads on CMT processors
1.5 4-core CMP, Private 32KB 4-way L1$,
t, X 1.45 Shared 2MB 4-way L2$ n e 1.4 m e v 1.35 ro p
m 1.3 i ce
n 1.25 a
rm 1.2 o rf e 1.15 p l a
ti 1.1 n te
o 1.05 P 1 Database TPC-W SPECjAppServer SPECweb
● Potential to increase performance by up to 1.5X by mitigating effects of instruction misses 22 Prefetching on CMTs
● Speculative prefetching strategies are still important for CMT processors ● Algorithms need to be tailored to the CMT design point ● In a single-strand system the entire system is idle on a miss – Aggressive prefetching may be very effective ● In a CMT processor a less aggressive strategy may be preferable – The harmful side effects of miss-speculation are more pronounced ● One strand's speculation can impact other strands – Evicting their lines from the caches or monopolizing shared resources ● Prefetch accuracy is more critical
23 Request Prioritization
● Hardware scouting generates a highly-accurate stream of prefetches – Pollution not a major concern ● In a single-core design, the processor is otherwise stalled during scouting – The prefetches don't impact the timeliness of the demand fetches ● In a CMT processor one strand may be scouting while others are issuing demand fetches – One strand's speculative requests may delay another strand's demand fetches – Optimizing for peak single-strand performance may not result in optimal aggregate (chip-wide) performance ● Require speculative strategies that leverage idle resources and defer to demand fetches 24 Hot Sets
● Hot sets result when many heavily accessed physical addresses map to the same cache set
– Results in thrashing and a significant increase in conflict misses
● Multiple strands now sharing the L2 cache
● Traditionally addressed by increasing the associtivity of the cache
● Not feasible to increase the cache associtivity in line with the number of sharers in CMT processors
● Require a more elegant way to reduce hot sets...
25 Hot Banks
● CMT processors typically support write-through L1 caches – Makes coherency protocols simpler ● As a result, stores and L1 cache load and instruction misses all pressure the L2 cache ● To provide sufficient bandwidth into the L2 the cache is heavily banked ● Hot banks occur when certain banks are accessed much more heavily than others – Can have a significant performance impact
26 Hot Sets & Hot Banks Potential Solution
● Cache set/bank selection is typically based on the low- order bits of the physical address
● Techniques such as index hashing can reduce the prevalence of hot sets/banks
– Use more bits from the physical address to generate an XOR hash that is then used to select the cache set/bank
● Can be problems:
– Only a limited set of bits are available to generate the XOR hash
– L1 cache inclusion requirements may limit use of index hashing in the L2 cache
● Significant room for further improvements while satisfying constraints on L1 and L2 set/bank selection 27 Offchip Bandwidth
● A core's bandwidth requirements can increase significantly when speculative techniques are employed )
n 4-core CMP, o 1.3 ti Private 32KB 4-way L1$, a l Shared 2MB 4-way L2$ cu e 1.25 sp
o Control speculation n
to 1.2 Control speculation + Hardware Scouting d e z
i Control speculation + Hardware Scouting + Value Prediction l a 1.15 rm o (n th
d 1.1 i w d n a
B 1.05 p i ch f
Of 1 SPECweb SPECjbb Database
● Offchip bandwidth limitations may curtail the number and aggressiveness of cores in future CMT processors 28 Offchip Bandwidth Potential Solutions
● Reduce offchip traffic by compressing the onchip caches
– Achieves the reduced miss rate associated with a larger cache at a significantly lower cost
– Reduction in miss rate needs to be carefully balanced against increased latency resulting from decompression overheads [favors adaptive schemes]
● Attempt to do more with the available BW
– Compress the off-chip traffic
– Exploit silence to minimize writeback traffic
29 Compiler Optimizations
● Compilers can still make a significant difference to application performance on CMT processors
● Optimizing compilers currently perform extensive loop unrolling and aggressive software prefetching
– Perfect optimizations for traditional resource-rich single-core processors
● Certain aggressive optimizations may be less well suited to CMT processors where each strand has more modest resources
– May be advantageous to optimize for reduced code footprint
● Of course, anything compilers can do to expose additional TLP is very welcome!
30 Concluding Remarks (1)
● A variety of pressing factors are bringing about a wholesale shift to Chip Multithreaded (CMT) designs ● CMT designs support many HW strands via efficient sharing of on-chip resources ● CMT designs are rapidly evolving – First generation of CMT processors were rapidly derived from existing uni- processor designs – The current generation supports sharing of the L2 caches – Next-generation CMT processors are specifically designed for the CMT design point
31 Concluding Remarks (2)
● CMT processors are a perfect match for thread-rich commercial applications ● Need to begin exploiting Speculative parallelism approaches for applications with limited TLP ● Resource sharing and resultant inter-strand interactions gives rise to many design challenges – Speculative prefetching, request prioritization, hot sets, hot banks, BW limitations to name but a few...
CMT processors present an exciting range of opportunities and challenges
32 Questions?
[email protected] 33 Backup Slides
34 Request Prioritization on CMTs
● Prefetches can delay other strands' demand fetches 100 4-core CMP, ) 95 Private 32KB 4-way L1$, e
v 90 Shared 2MB 4-way L2$ ti 85 a l
u 80
m 75
cu 70 ( 65 No Speculation es 60
ss Hardware Scouting i 55
m 50 p i 45 ch
f 40
of 35
f 30 o 25 % 20 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9 1.95 2 Miss latency (normalized to unloaded latency) Increasing miss latency Potential Solution ● L2 cache should give higher priority to demand fetches – However, prefetches should not dropped, just delayed until the resource is idle ● More sophisticated schemes may yield improved performance 35