Chip Multithreading: Opportunities on High-Performance and Challenges Computer Architecture Feb 15Th 2005

International Symposium Chip Multithreading: Opportunities on High-Performance and Challenges Computer Architecture Feb 15th 2005 Lawrence Spracklen & Santosh G. Abraham Lawrence Spracklen Advanced Processor Architecture Sun Microsystems Overview ● The case for Chip Multithreaded (CMT) Processors ● Evolution of CMT Processors ● CMT Design Space ● CMT Challenges ● Conclusions 2 The case for CMT Processors 3 Diminishing Returns ● Dramatic gains have been achieved in single-threaded performance in recent years ● Used a variety of microarchitectural techniques E.G. superscalar issue, out-of-order issue, on-chip caches & deeper pipelines ● Recent attempts to continue leveraging these techniques have not led to demonstrably better performance ● Power and memory latency considerations introduce increasingly insurmountable obstacles to improving single thread performance – Several high frequency designs have recently been abandoned 4 Power Consumption ● Power consumption has an almost cubic dependence on core frequency ● Processors already pushing the limits of power dissipation ● For applications with sufficient threads: – Double the number of concurrent threads x2 – Half the frequency of the threads x8 Result: equivalent performance at a ¼ of the power 5 Offchip Bandwidth ● While offchip bandwidth has increased, so have offchip latencies ● Need to sustain over 100 in-flight requests to fully utilize the available bandwidth ● Difficult to achieve with a single core – Even an aggressive OOO processor generates less than 2 parallel requests on typical server workloads ● A large number of concurrently executing threads are required to achieve high bandwidth utilization 6 Server Workloads ● Server workloads are characterized by: – High levels of thread-level parallelism (TLP) – Low instruction-level parallelism (ILP) ● Limited opportunity to boost single-thread performance – Majority of CPI is a result of offchip misses ● Couple high TLP in the application domain with support for multiple threads of execution (strands) on a processor chip Chip Multithreaded (CMT) processors support many simultaneous hardware strands 7 CMT Processors ● CMT processors support many simultaneous hardware strands via a combination of techniques 1) Support for multiple cores (Chip Multiprocessors (CMP)) 2) Simultaneous Multi-Threading (SMT) ● Strands spend a large amount of time stalled waiting for offchip misses – SMT enables multiple strands to share many of the resources within the core ● SMT increases utilization of key resources within a core ● CMP allows multiple cores to share resources such as the L2 cache and offchip bandwidth 8 Evolution of CMT Processors 9 Evolution of CMTs ● Given the exponential increase in transistors per chip over time, chip level multiprocessors have long been predicted... ● [Olukotun & the Stanford Hydra CMP, 1996] – 4 MIPS-based cores on a single chip ● [DEC/Compaq Piranha, 2000] – 8 simple Alpha cores sharing an onchip L2$ ● [Sun Microsystem's MAJC Architecture, 1995] – Provided well-defined support for both CMP and SMT processors ● [Sun Microsystem's MAJC-5200, 1999] – A dual-core CMT processor with cores sharing an L1 D$ 10 An Evolutionary approach CMT Design ● A new process technology essentially doubles the transistor budget ● An attractive proposition for an evolving CMT design is to double the number of cores every generation – Little redesign effort is expended on the cores – Doubles performance every process generation (with sufficient TLP) ● The design philosophy behind the initial generations of CMT processors 11 1st Generation CMTs CPU0 CPU1 ● 2 cores per chip ● Cores derived from earlier uniprocessor designs L2 Cache L2 Cache ● Cores do not share any resources, except off-chip data paths Crossbar Switch Memory Controller Examples: – Sun's Gemini processor: a dual-core UltraSPARC-II derivative – Sun's UltraSPARC-IV processor: a dual-core UltraSPARC-III derivative – AMD's upcoming dual-core Opteron – Intel's upcoming dual-core Itanium – Intel's upcoming dual-core Xeon 12 2nd Generation CMTs CPU0 CPU1 ● 2 or more cores per chip ● Cores still derived from earlier Crossbar Switch uniprocessor designs ● Cores now share the L2 cache L2 Cache – Advantageous as most commercial applications have significant instruction footprints – Speeds inter-core communication Memory Controller Examples: – Sun's UltraSPARC-IV+ processor: a dual-core UltraSPARC-III derivative – IBM's Power4/5 processor: a dual-core Power processor – Fujitsu's Sparc64 VI processor: a dual-core SPARC 13 CMT Problems? ● Re-using existing core designs may not scale beyond a couple of generations: Power: to restrain total power consumption, power dissipation of each core must be halved in each generation – Unlikely to be achievable by voltage scaling or even clock gating and frequency scaling Offchip bandwidth: to maintain the same offchip BW per core, total BW must double each generation – Can increase BW by devoting additional pins or increasing BW per pin – Maximum number of pins only increasing at 10% per generation ● Need a new approach to CMT design 14 rd 3 Generation CMTs U PU PU PU PU PU PU PU C C C C C C C CP ● Future CMTs need to satisfy these MT S SMT SMT SMT SMT SMT SMT SMT power and bandwidth constraints while y y y y y y y a a ay a a a a a delivering ever increasing performance -w -w -w -w -w -w -w -w 4 4 4 4 4 4 4 4 Crossbar Switch ● CMT processors are likely to be designed from the ground-up L2 Cache – Entire design optimized for a CMT design point Memory Controller ● Multiple cores per chip Examples: – Sun's Niagara Processor: ● 8 four-way SMT cores 15 CMT Design Space 16 Number of Cores For a fixed area and power budget: ● Decision is between small number of aggressive OOO cores, or multiple simple cores ● For workloads with sufficient TLP, the simpler core solution may deliver superior performance at a fraction of the power – e.g. Niagara – 8 4-way SMT cores. Each core single issue, short pipeline ● What about applications with limited TLP? – Leverage Speculative Parallelism – Employ heterogeneous cores: a single aggressive core can be provided for single threaded applications 17 Resource Sharing ● A core requires several chip-level resources ● Register file ● Branch predictors ● L2 cache ● Execution pipeline ● L1 I-cache ● Floating-point units ● L1 D-cache ● Currently, cores in advanced CMTs only share the L2 cache ● Opportunity to share a much wider range of chip-level resources – Cores in the MAJC-5200 processor share a L1 D-cache – Cores could share floating-point units ● CMT processors can begin to include a variety of new [expensive] shared resources – Cost amortized over multiple cores 18 Hardware Accelerators ● On-chip hardware accelerators become increasingly attractive in the CMT space – Cost is amortized and high utilization is achieved ● On-chip accelerators can provide a range of benefits – Increased performance on certain specialized tasks – A more power-efficient approach – Offload of more mundane tasks from the general-purpose cores – Frequently significantly more efficient than offchip accelerators ● A variety of different onchip accelerators can be envisaged – Cryptographic accelerators – Network offload engines – XML parsing – OS functionality e.g. memcopy 19 CMT Challenges 20 CMT Challenges ● Attention has focussed on improving single-thread performance using a variety of speculative techniques ● Aggressive speculative techniques can consume significant resources for even modest performance improvements ● This philosophy is compatible with a single core design – During periods of extensive speculation, the resources are otherwise underutilized ● In CMT processors resources are shared between strands ● Need to ensure strands are not deprived of resources ● Strands need to be “good neighbors” ● Policies should focus on maximizing overall performance 21 Prefetching ● Instruction prefetching can significantly increase performance of commercial workloads on CMT processors 1.5 4-core CMP, Private 32KB 4-way L1$, t, X 1.45 Shared 2MB 4-way L2$ n e 1.4 m e v 1.35 ro p m 1.3 i ce n 1.25 a rm 1.2 o rf e 1.15 p l a ti 1.1 n te o 1.05 P 1 Database TPC-W SPECjAppServer SPECweb ● Potential to increase performance by up to 1.5X by mitigating effects of instruction misses 22 Prefetching on CMTs ● Speculative prefetching strategies are still important for CMT processors ● Algorithms need to be tailored to the CMT design point ● In a single-strand system the entire system is idle on a miss – Aggressive prefetching may be very effective ● In a CMT processor a less aggressive strategy may be preferable – The harmful side effects of miss-speculation are more pronounced ● One strand's speculation can impact other strands – Evicting their lines from the caches or monopolizing shared resources ● Prefetch accuracy is more critical 23 Request Prioritization ● Hardware scouting generates a highly-accurate stream of prefetches – Pollution not a major concern ● In a single-core design, the processor is otherwise stalled during scouting – The prefetches don't impact the timeliness of the demand fetches ● In a CMT processor one strand may be scouting while others are issuing demand fetches – One strand's speculative requests may delay another strand's demand fetches – Optimizing for peak single-strand performance may not result in optimal aggregate (chip-wide) performance ● Require speculative strategies that leverage idle resources and defer to demand fetches 24 Hot Sets ● Hot sets result when many heavily accessed physical addresses map to

Chip Multithreading: Opportunities on High-Performance and Challenges Computer Architecture Feb 15Th 2005

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support