Piranha:Piranha

Piranha:Piranha: DesigningDesigning aa ScalableScalable CMP-basedCMP-based SystemSystem forfor CommercialCommercial WorkloadsWorkloads LuizLuiz AndréAndré BarrosoBarroso WesternWestern ResearchResearch LaboratoryLaboratory April 27, 2001 Asilomar Microcomputer Workshop WhatWhat isis Piranha?Piranha? l A scalable shared memory architecture based on chip multiprocessing (CMP) and targeted at commercial workloads l A research prototype under development by Compaq Research and Compaq NonStop Hardware Development Group l A departure from ever increasing processor complexity and system design/verification cycles ImportanceImportance ofof CommercialCommercial ApplicationsApplications Worldwide Server Customer Spending (IDC 1999) Scientific & Other engineering 3% 6% Infrastructure Collaborative 29% 12% Software development 14% Decision Business support processing 14% 22% l Total server market size in 1999: ~$55-60B – technical applications: less than $6B – commercial applications: ~$40B PricePrice StructureStructure ofof ServersServers Normalized breakdown of HW cost l IBM eServer 680 100% (220KtpmC; $43/tpmC) 90% § 24 CPUs 80% 70% I/O § 96GB DRAM, 18 TB Disk 60% DRAM 50% § $9M price tag CPU 40% Base 30% l Compaq ProLiant ML370 20% 10% (32KtpmC; $12/tpmC) 0% § 4 CPUs IBM eServer 680 Compaq ProLiant ML570 § 8GB DRAM, 2TB Disk Price per component System § $240K price tag $/CPU $/MB DRAM $/GB Disk IBM eServer 680 $65,417 $9 $359 Compaq ProLiant ML570 $6,048 $4 $64 - Storage prices dominate (50%-70% in customer installations) - Software maintenance/management costs even higher (up to $100M) - Price of expensive CPUs/memory system amortized OutlineOutline l Importance of Commercial Workloads l Commercial Workload Requirements l Trends in Processor Design l Piranha l Design Methodology l Summary StudiesStudies ofof CommercialCommercial WorkloadsWorkloads l Collaboration with Kourosh Gharachorloo (Compaq WRL) – ISCA’98: Memory System Characterization of Commercial Workloads (with E. Bugnion) – ISCA’98: An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors (with J. Lo, S. Eggers, H. Levy, and S. Parekh) – ASPLOS’98: Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors (with P. Ranganathan and S. Adve) – HPCA’00: Impact of Chip-Level Integration on Performance of OLTP Workloads (with A. Nowatzyk and B. Verghese) – ISCA’01: Code Layout Optimizations for Transaction Processing Workloads (with A. Ramirez, R. Cohn, J. Larriba-Pey, G. Lowney, and M. Valero) StudiesStudies ofof CommercialCommercial Workloads:Workloads: summarysummary l Memory system is the main bottleneck – astronomically high CPI – dominated by memory stall times – instruction stalls as important as data stalls – fast/large L2 caches are critical l Very poor Instruction Level Parallelism (ILP) – frequent hard-to-predict branches – large L1 miss ratios – Ld-Ld dependencies – disappointing gains from wide-issue out-of-order techniques! OutlineOutline l Importance of Commercial Workloads l Commercial Workload Requirements l Trends in Processor Design l Piranha l Design Methodology l Summary IncreasingIncreasing ComplexityComplexity ofof ProcessorProcessor DesignsDesigns l Pushing limits of instruction-level parallelism – multiple instruction issue – speculative out-of-order (OOO) execution l Driven by applications such as SPEC l Increasing design time and team size Processor Year Transistor Design Design Verification (SGI MIPS) Shipped Count Team Time Team Size (millions) Size (months) (% of total) R2000 1985 0.10 20 15 15% R4000 1991 1.40 55 24 20% R10000 1996 6.80 >100 36 >35% courtesy: John Hennessy, IEEE Computer, 32(8) l Yielding diminishing returns in performance ExploitingExploiting HigherHigher LevelsLevels ofof IntegrationIntegration Alpha 21364 Single M M 1GHz chip 364 364 21264 CPU IO IO M M 64KB 64KB MEM-CTL I$ D$ 364 364 0 31 IO IO 1.5MB M M L2$ 364 364 MEM-CTL IO IO Network Interface 0 I/O Coherence Engine 31 l lower latency, higher bandwidth l incrementally scalable glueless multiprocessing l reuse of existing CPU core addresses complexity issues ExploitingExploiting ParallelismParallelism inin CommercialCommercial AppsApps Simultaneous Multithreading (SMT) Chip Multiprocessing (CMP) CPU CPU thread 1 thread 2 I$ D$ I$ D$ thread 3 MEM-CTL thread 4 time L2$ MEM-CTL Network Example: Alpha 21464 Coherence I/O Example: IBM Power4 l SMT superior in single-thread performance l CMP addresses complexity by using simpler cores OutlineOutline l Importance of Commercial Workloads l Commercial Workload Requirements l Trends in Processor Design l Piranha – Architecture – Performance l Design Methodology l Summary PiranhaPiranha ProjectProject l Explore chip multiprocessing for scalable servers l Focus on parallel commercial workloads l Small team, modest investment, short design time l Address complexity by using: – simple processor cores – standard ASIC methodology Give up on ILP, embrace TLP PiranhaPiranha TeamTeam MembersMembers Research NonStop Hardware Development – Luiz André Barroso (WRL) ASIC Design Center – Kourosh Gharachorloo (WRL) – Tom Heynemann – David Lowell (WRL) – Dan Joyce – Harland Maxwell – Joel McCormack (WRL) – Harold Miller – Mosur Ravishankar (WRL) – Sanjay Singh – Rob Stets (WRL) – Scott Smith – Yuan Yu (SRC) – Jeff Sprouse – … several contractors Former Contributors Robert McNamara Brian Robinson Basem Nayfeh Barton Sano Andreas Nowatzyk Daniel Scales Joan Pendleton Ben Verghese Shaz Qadeer PiranhaPiranha ProcessingProcessing NodeNode Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1-issue, in-order, 500MHz CPU CPU CPU CPU L1 caches: I&D, 64KB, 2-way HE L2$ L2$ L2$ L2$ Intra-chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way ICS Memory Controller (MC) Router RDRAM, 12.8GB/sec Protocol Engines (HE & RE): I$ D$ I$ D$ I$ D$ I$ D$ mprog., 1K minstr., RE L2$ L2$ L2$ L2$ even/odd interleaving System Interconnect: 4-port Xbar router CPU CPU CPU CPU topology independent MEM-CTL MEM-CTL MEM-CTL MEM-CTL 32GB/sec total bandwidth Single Chip PiranhaPiranha I/OI/OI/O NodeNode CPU 2 Links @ Router HE 8GB/s I$ D$D$PCI-X FBICSFB RE L2$ MEM-CTL l I/O node is a full-fledged member of system interconnect – CPU indistinguishable from Processing Node CPUs – participates in global coherence protocol ExampleExample ConfigurationConfiguration P P P P- I/O P- I/O P P P l Arbitrary topologies l Match ratio of Processing to I/O nodes to application requirements L2L2 CacheCache andand Intra-NodeIntra-Node CoherenceCoherence l No inclusion between L1s and L2 cache – total L1 capacity equals L2 capacity – L2 misses go directly to L1 – L2 filled by L1 replacements l L2 keeps track of all lines in the chip – sends Invalidates, Forwards – orchestrates L1-to-L2 write-backs to maximize chip-memory utilization – cooperates with Protocol Engines to enforce system-wide coherence Inter-NodeInter-Node CoherenceCoherence ProtocolProtocol l ‘Stealing’ ECC bits for memory directory 8x(64+8) 4X(128+9+7) 2X(256+10+22) 1X(512+11+53) Data-bits ECC Directory-bits 0 28 44 53 l Directory (2b state + 40b sharing info) state info on sharers state info on sharers 2b 20b 2b 20b l Dual representation: limited pointer + coarse vector l “Cruise Missile” Invalidations (CMI) CMI – limit fan-out/fan-in serialization with CV 010000001000 l Several new protocol optimizations SimulatedSimulated ArchitecturesArchitectures Single-ChipSingle-Chip PiranhaPiranha PerformancePerformance 350 350 300 L2Miss L2Hit 233 250 CPU 200 191 150 145 100 100 100 50 34 44 Normalized Execution Time 0 P1 INO OOO P8 P1 INO OOO P8 500 MHz 1GHz 1GHz 500MHz 500 MHz 1GHz 1GHz 500MHz 1-issue 1-issue 4-issue 1-issue 1-issue 1-issue 4-issue 1-issue OLTP DSS l Piranha’s performance margin 3x for OLTP and 2.2x for DSS l Piranha has more outstanding misses è better utilizes memory system Single-ChipSingle-Chip PerformancePerformance (Cont.)(Cont.) 8 100 7 90 80 6 70 5 60 L2 Miss 4 50 L2 Fwd sses (%) 40 L2 Hit Speedup 3 Mi 2 30 20 1 10 Normalized Breakdown of L1 0 0 0 1 2 3 4 5 6 7 8 P1 P2 P4 P8 Number of Cores 500 MHz, 1-issue l Near-linear scalability – low memory latencies – effectiveness of highly associative L2 and non-inclusive caching PotentialPotential ofof aa Full-CustomFull-Custom PiranhaPiranha 120 100 100 L2 Miss 100 L2 Hit 80 CPU 60 43 40 34 20 19 20 Normalized Execution Time 0 OOO P8 P8F OOO P8 P8F 1GHz 500MHz 1.25GHz 1GHz 500MHz 1.25GHz 4-issue 1-issue 1-issue 4-issue 1-issue 1-issue OLTP DSS l 5x margin over OOO for OLTP and DSS l Full-custom design benefits substantially from boost in core speed OutlineOutline l Importance of Commercial Workloads l Commercial Workload Requirements l Trends in Processor Design l Piranha l Design Methodology l Summary ManagingManaging ComplexityComplexity inin thethe ArchitectureArchitecture l Use of many simpler logic modules – shorter design – easier verification – only short wires* – faster synthesis – simpler chip-level layout l Simplify intra-chip communication – all traffic goes through ICS (no backdoors) l Use of microprogrammed protocol engines l Adoption of large VM pages l Implement sub-set of Alpha ISA – no VAX floating point, no multimedia instructions, etc. MethodologyMethodology ChallengesChallenges l Isolated sub-module testing – need to create robust bus functional models (BFM) – sub-modules’ behavior highly inter-dependent – not feasible with a small team l System-level (integrated) testing – much easier to create tests – only one BFM at the processor interface – simpler to

Piranha:Piranha

Implicitly-Multithreaded Processors

Kaisen Lin and Michael Conley

A Speculative Control Scheme for an Energy-Efficient Banked Register File

The Microarchitecture of a Low Power Register File

Alphaserver GS1280 Overview

REPORT Compaq Chooses SMT for Alpha Simultaneous Multithreading

PERL – a Register-Less Processor

Computer Architectures an Overview

UNIVERSITY of CALIFORNIA, SAN DIEGO Holistic Design for Multi-Core Architectures a Dissertation Submitted in Partial Satisfactio

Superscalar Execution Scalar Pipeline and the Flynn Bottleneck Multiple

Mini-Threads: Increasing TLP on Small-Scale SMT Processors

Energy-Effective Issue Logic