Chip Multithreading: Opportunities on High-Performance and Challenges Computer Architecture Feb 15Th 2005

International Symposium Chip Multithreading: Opportunities on High-Performance and Challenges Computer Architecture Feb 15th 2005 Lawrence Spracklen & Santosh G. Abraham Lawrence Spracklen Advanced Processor Architecture Sun Microsystems Overview ● The case for Chip Multithreaded (CMT) Processors ● Evolution of CMT Processors ● CMT Design Space ● CMT Challenges ● Conclusions 2 The case for CMT Processors 3 Diminishing Returns ● Dramatic gains have been achieved in single-threaded performance in recent years ● Used a variety of microarchitectural techniques E.G. superscalar issue, out-of-order issue, on-chip caches & deeper pipelines ● Recent attempts to continue leveraging these techniques have not led to demonstrably better performance ● Power and memory latency considerations introduce increasingly insurmountable obstacles to improving single thread performance – Several high frequency designs have recently been abandoned 4 Power Consumption ● Power consumption has an almost cubic dependence on core frequency ● Processors already pushing the limits of power dissipation ● For applications with sufficient threads: – Double the number of concurrent threads x2 – Half the frequency of the threads x8 Result: equivalent performance at a ¼ of the power 5 Offchip Bandwidth ● While offchip bandwidth has increased, so have offchip latencies ● Need to sustain over 100 in-flight requests to fully utilize the available bandwidth ● Difficult to achieve with a single core – Even an aggressive OOO processor generates less than 2 parallel requests on typical server workloads ● A large number of concurrently executing threads are required to achieve high bandwidth utilization 6 Server Workloads ● Server workloads are characterized by: – High levels of thread-level parallelism (TLP) – Low instruction-level parallelism (ILP) ● Limited opportunity to boost single-thread performance – Majority of CPI is a result of offchip misses ● Couple high TLP in the application domain with support for multiple threads of execution (strands) on a processor chip Chip Multithreaded (CMT) processors support many simultaneous hardware strands 7 CMT Processors ● CMT processors support many simultaneous hardware strands via a combination of techniques 1) Support for multiple cores (Chip Multiprocessors (CMP)) 2) Simultaneous Multi-Threading (SMT) ● Strands spend a large amount of time stalled waiting for offchip misses – SMT enables multiple strands to share many of the resources within the core ● SMT increases utilization of key resources within a core ● CMP allows multiple cores to share resources such as the L2 cache and offchip bandwidth 8 Evolution of CMT Processors 9 Evolution of CMTs ● Given the exponential increase in transistors per chip over time, chip level multiprocessors have long been predicted... ● [Olukotun & the Stanford Hydra CMP, 1996] – 4 MIPS-based cores on a single chip ● [DEC/Compaq Piranha, 2000] – 8 simple Alpha cores sharing an onchip L2$ ● [Sun Microsystem's MAJC Architecture, 1995] – Provided well-defined support for both CMP and SMT processors ● [Sun Microsystem's MAJC-5200, 1999] – A dual-core CMT processor with cores sharing an L1 D$ 10 An Evolutionary approach CMT Design ● A new process technology essentially doubles the transistor budget ● An attractive proposition for an evolving CMT design is to double the number of cores every generation – Little redesign effort is expended on the cores – Doubles performance every process generation (with sufficient TLP) ● The design philosophy behind the initial generations of CMT processors 11 1st Generation CMTs CPU0 CPU1 ● 2 cores per chip ● Cores derived from earlier uniprocessor designs L2 Cache L2 Cache ● Cores do not share any resources, except off-chip data paths Crossbar Switch Memory Controller Examples: – Sun's Gemini processor: a dual-core UltraSPARC-II derivative – Sun's UltraSPARC-IV processor: a dual-core UltraSPARC-III derivative – AMD's upcoming dual-core Opteron – Intel's upcoming dual-core Itanium – Intel's upcoming dual-core Xeon 12 2nd Generation CMTs CPU0 CPU1 ● 2 or more cores per chip ● Cores still derived from earlier Crossbar Switch uniprocessor designs ● Cores now share the L2 cache L2 Cache – Advantageous as most commercial applications have significant instruction footprints – Speeds inter-core communication Memory Controller Examples: – Sun's UltraSPARC-IV+ processor: a dual-core UltraSPARC-III derivative – IBM's Power4/5 processor: a dual-core Power processor – Fujitsu's Sparc64 VI processor: a dual-core SPARC 13 CMT Problems? ● Re-using existing core designs may not scale beyond a couple of generations: Power: to restrain total power consumption, power dissipation of each core must be halved in each generation – Unlikely to be achievable by voltage scaling or even clock gating and frequency scaling Offchip bandwidth: to maintain the same offchip BW per core, total BW must double each generation – Can increase BW by devoting additional pins or increasing BW per pin – Maximum number of pins only increasing at 10% per generation ● Need a new approach to CMT design 14 rd 3 Generation CMTs U PU PU PU PU PU PU PU C C C C C C C CP ● Future CMTs need to satisfy these MT S SMT SMT SMT SMT SMT SMT SMT power and bandwidth constraints while y y y y y y y a a ay a a a a a delivering ever increasing performance -w -w -w -w -w -w -w -w 4 4 4 4 4 4 4 4 Crossbar Switch ● CMT processors are likely to be designed from the ground-up L2 Cache – Entire design optimized for a CMT design point Memory Controller ● Multiple cores per chip Examples: – Sun's Niagara Processor: ● 8 four-way SMT cores 15 CMT Design Space 16 Number of Cores For a fixed area and power budget: ● Decision is between small number of aggressive OOO cores, or multiple simple cores ● For workloads with sufficient TLP, the simpler core solution may deliver superior performance at a fraction of the power – e.g. Niagara – 8 4-way SMT cores. Each core single issue, short pipeline ● What about applications with limited TLP? – Leverage Speculative Parallelism – Employ heterogeneous cores: a single aggressive core can be provided for single threaded applications 17 Resource Sharing ● A core requires several chip-level resources ● Register file ● Branch predictors ● L2 cache ● Execution pipeline ● L1 I-cache ● Floating-point units ● L1 D-cache ● Currently, cores in advanced CMTs only share the L2 cache ● Opportunity to share a much wider range of chip-level resources – Cores in the MAJC-5200 processor share a L1 D-cache – Cores could share floating-point units ● CMT processors can begin to include a variety of new [expensive] shared resources – Cost amortized over multiple cores 18 Hardware Accelerators ● On-chip hardware accelerators become increasingly attractive in the CMT space – Cost is amortized and high utilization is achieved ● On-chip accelerators can provide a range of benefits – Increased performance on certain specialized tasks – A more power-efficient approach – Offload of more mundane tasks from the general-purpose cores – Frequently significantly more efficient than offchip accelerators ● A variety of different onchip accelerators can be envisaged – Cryptographic accelerators – Network offload engines – XML parsing – OS functionality e.g. memcopy 19 CMT Challenges 20 CMT Challenges ● Attention has focussed on improving single-thread performance using a variety of speculative techniques ● Aggressive speculative techniques can consume significant resources for even modest performance improvements ● This philosophy is compatible with a single core design – During periods of extensive speculation, the resources are otherwise underutilized ● In CMT processors resources are shared between strands ● Need to ensure strands are not deprived of resources ● Strands need to be “good neighbors” ● Policies should focus on maximizing overall performance 21 Prefetching ● Instruction prefetching can significantly increase performance of commercial workloads on CMT processors 1.5 4-core CMP, Private 32KB 4-way L1$, t, X 1.45 Shared 2MB 4-way L2$ n e 1.4 m e v 1.35 ro p m 1.3 i ce n 1.25 a rm 1.2 o rf e 1.15 p l a ti 1.1 n te o 1.05 P 1 Database TPC-W SPECjAppServer SPECweb ● Potential to increase performance by up to 1.5X by mitigating effects of instruction misses 22 Prefetching on CMTs ● Speculative prefetching strategies are still important for CMT processors ● Algorithms need to be tailored to the CMT design point ● In a single-strand system the entire system is idle on a miss – Aggressive prefetching may be very effective ● In a CMT processor a less aggressive strategy may be preferable – The harmful side effects of miss-speculation are more pronounced ● One strand's speculation can impact other strands – Evicting their lines from the caches or monopolizing shared resources ● Prefetch accuracy is more critical 23 Request Prioritization ● Hardware scouting generates a highly-accurate stream of prefetches – Pollution not a major concern ● In a single-core design, the processor is otherwise stalled during scouting – The prefetches don't impact the timeliness of the demand fetches ● In a CMT processor one strand may be scouting while others are issuing demand fetches – One strand's speculative requests may delay another strand's demand fetches – Optimizing for peak single-strand performance may not result in optimal aggregate (chip-wide) performance ● Require speculative strategies that leverage idle resources and defer to demand fetches 24 Hot Sets ● Hot sets result when many heavily accessed physical addresses map to

Load more