Piranha:Piranha

Piranha:Piranha: DesigningDesigning aa ScalableScalable CMP-basedCMP-based SystemSystem forfor CommercialCommercial WorkloadsWorkloads LuizLuiz AndréAndré BarrosoBarroso WesternWestern ResearchResearch LaboratoryLaboratory April 27, 2001 Asilomar Microcomputer Workshop WhatWhat isis Piranha?Piranha? l A scalable shared memory architecture based on chip multiprocessing (CMP) and targeted at commercial workloads l A research prototype under development by Compaq Research and Compaq NonStop Hardware Development Group l A departure from ever increasing processor complexity and system design/verification cycles ImportanceImportance ofof CommercialCommercial ApplicationsApplications Worldwide Server Customer Spending (IDC 1999) Scientific & Other engineering 3% 6% Infrastructure Collaborative 29% 12% Software development 14% Decision Business support processing 14% 22% l Total server market size in 1999: ~$55-60B – technical applications: less than $6B – commercial applications: ~$40B PricePrice StructureStructure ofof ServersServers Normalized breakdown of HW cost l IBM eServer 680 100% (220KtpmC; $43/tpmC) 90% § 24 CPUs 80% 70% I/O § 96GB DRAM, 18 TB Disk 60% DRAM 50% § $9M price tag CPU 40% Base 30% l Compaq ProLiant ML370 20% 10% (32KtpmC; $12/tpmC) 0% § 4 CPUs IBM eServer 680 Compaq ProLiant ML570 § 8GB DRAM, 2TB Disk Price per component System § $240K price tag $/CPU $/MB DRAM $/GB Disk IBM eServer 680 $65,417 $9 $359 Compaq ProLiant ML570 $6,048 $4 $64 - Storage prices dominate (50%-70% in customer installations) - Software maintenance/management costs even higher (up to $100M) - Price of expensive CPUs/memory system amortized OutlineOutline l Importance of Commercial Workloads l Commercial Workload Requirements l Trends in Processor Design l Piranha l Design Methodology l Summary StudiesStudies ofof CommercialCommercial WorkloadsWorkloads l Collaboration with Kourosh Gharachorloo (Compaq WRL) – ISCA’98: Memory System Characterization of Commercial Workloads (with E. Bugnion) – ISCA’98: An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors (with J. Lo, S. Eggers, H. Levy, and S. Parekh) – ASPLOS’98: Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors (with P. Ranganathan and S. Adve) – HPCA’00: Impact of Chip-Level Integration on Performance of OLTP Workloads (with A. Nowatzyk and B. Verghese) – ISCA’01: Code Layout Optimizations for Transaction Processing Workloads (with A. Ramirez, R. Cohn, J. Larriba-Pey, G. Lowney, and M. Valero) StudiesStudies ofof CommercialCommercial Workloads:Workloads: summarysummary l Memory system is the main bottleneck – astronomically high CPI – dominated by memory stall times – instruction stalls as important as data stalls – fast/large L2 caches are critical l Very poor Instruction Level Parallelism (ILP) – frequent hard-to-predict branches – large L1 miss ratios – Ld-Ld dependencies – disappointing gains from wide-issue out-of-order techniques! OutlineOutline l Importance of Commercial Workloads l Commercial Workload Requirements l Trends in Processor Design l Piranha l Design Methodology l Summary IncreasingIncreasing ComplexityComplexity ofof ProcessorProcessor DesignsDesigns l Pushing limits of instruction-level parallelism – multiple instruction issue – speculative out-of-order (OOO) execution l Driven by applications such as SPEC l Increasing design time and team size Processor Year Transistor Design Design Verification (SGI MIPS) Shipped Count Team Time Team Size (millions) Size (months) (% of total) R2000 1985 0.10 20 15 15% R4000 1991 1.40 55 24 20% R10000 1996 6.80 >100 36 >35% courtesy: John Hennessy, IEEE Computer, 32(8) l Yielding diminishing returns in performance ExploitingExploiting HigherHigher LevelsLevels ofof IntegrationIntegration Alpha 21364 Single M M 1GHz chip 364 364 21264 CPU IO IO M M 64KB 64KB MEM-CTL I$ D$ 364 364 0 31 IO IO 1.5MB M M L2$ 364 364 MEM-CTL IO IO Network Interface 0 I/O Coherence Engine 31 l lower latency, higher bandwidth l incrementally scalable glueless multiprocessing l reuse of existing CPU core addresses complexity issues ExploitingExploiting ParallelismParallelism inin CommercialCommercial AppsApps Simultaneous Multithreading (SMT) Chip Multiprocessing (CMP) CPU CPU thread 1 thread 2 I$ D$ I$ D$ thread 3 MEM-CTL thread 4 time L2$ MEM-CTL Network Example: Alpha 21464 Coherence I/O Example: IBM Power4 l SMT superior in single-thread performance l CMP addresses complexity by using simpler cores OutlineOutline l Importance of Commercial Workloads l Commercial Workload Requirements l Trends in Processor Design l Piranha – Architecture – Performance l Design Methodology l Summary PiranhaPiranha ProjectProject l Explore chip multiprocessing for scalable servers l Focus on parallel commercial workloads l Small team, modest investment, short design time l Address complexity by using: – simple processor cores – standard ASIC methodology Give up on ILP, embrace TLP PiranhaPiranha TeamTeam MembersMembers Research NonStop Hardware Development – Luiz André Barroso (WRL) ASIC Design Center – Kourosh Gharachorloo (WRL) – Tom Heynemann – David Lowell (WRL) – Dan Joyce – Harland Maxwell – Joel McCormack (WRL) – Harold Miller – Mosur Ravishankar (WRL) – Sanjay Singh – Rob Stets (WRL) – Scott Smith – Yuan Yu (SRC) – Jeff Sprouse – … several contractors Former Contributors Robert McNamara Brian Robinson Basem Nayfeh Barton Sano Andreas Nowatzyk Daniel Scales Joan Pendleton Ben Verghese Shaz Qadeer PiranhaPiranha ProcessingProcessing NodeNode Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL 1-issue, in-order, 500MHz CPU CPU CPU CPU L1 caches: I&D, 64KB, 2-way HE L2$ L2$ L2$ L2$ Intra-chip switch (ICS) I$ D$ I$ D$ I$ D$ I$ D$ 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way ICS Memory Controller (MC) Router RDRAM, 12.8GB/sec Protocol Engines (HE & RE): I$ D$ I$ D$ I$ D$ I$ D$ mprog., 1K minstr., RE L2$ L2$ L2$ L2$ even/odd interleaving System Interconnect: 4-port Xbar router CPU CPU CPU CPU topology independent MEM-CTL MEM-CTL MEM-CTL MEM-CTL 32GB/sec total bandwidth Single Chip PiranhaPiranha I/OI/OI/O NodeNode CPU 2 Links @ Router HE 8GB/s I$ D$D$PCI-X FBICSFB RE L2$ MEM-CTL l I/O node is a full-fledged member of system interconnect – CPU indistinguishable from Processing Node CPUs – participates in global coherence protocol ExampleExample ConfigurationConfiguration P P P P- I/O P- I/O P P P l Arbitrary topologies l Match ratio of Processing to I/O nodes to application requirements L2L2 CacheCache andand Intra-NodeIntra-Node CoherenceCoherence l No inclusion between L1s and L2 cache – total L1 capacity equals L2 capacity – L2 misses go directly to L1 – L2 filled by L1 replacements l L2 keeps track of all lines in the chip – sends Invalidates, Forwards – orchestrates L1-to-L2 write-backs to maximize chip-memory utilization – cooperates with Protocol Engines to enforce system-wide coherence Inter-NodeInter-Node CoherenceCoherence ProtocolProtocol l ‘Stealing’ ECC bits for memory directory 8x(64+8) 4X(128+9+7) 2X(256+10+22) 1X(512+11+53) Data-bits ECC Directory-bits 0 28 44 53 l Directory (2b state + 40b sharing info) state info on sharers state info on sharers 2b 20b 2b 20b l Dual representation: limited pointer + coarse vector l “Cruise Missile” Invalidations (CMI) CMI – limit fan-out/fan-in serialization with CV 010000001000 l Several new protocol optimizations SimulatedSimulated ArchitecturesArchitectures Single-ChipSingle-Chip PiranhaPiranha PerformancePerformance 350 350 300 L2Miss L2Hit 233 250 CPU 200 191 150 145 100 100 100 50 34 44 Normalized Execution Time 0 P1 INO OOO P8 P1 INO OOO P8 500 MHz 1GHz 1GHz 500MHz 500 MHz 1GHz 1GHz 500MHz 1-issue 1-issue 4-issue 1-issue 1-issue 1-issue 4-issue 1-issue OLTP DSS l Piranha’s performance margin 3x for OLTP and 2.2x for DSS l Piranha has more outstanding misses è better utilizes memory system Single-ChipSingle-Chip PerformancePerformance (Cont.)(Cont.) 8 100 7 90 80 6 70 5 60 L2 Miss 4 50 L2 Fwd sses (%) 40 L2 Hit Speedup 3 Mi 2 30 20 1 10 Normalized Breakdown of L1 0 0 0 1 2 3 4 5 6 7 8 P1 P2 P4 P8 Number of Cores 500 MHz, 1-issue l Near-linear scalability – low memory latencies – effectiveness of highly associative L2 and non-inclusive caching PotentialPotential ofof aa Full-CustomFull-Custom PiranhaPiranha 120 100 100 L2 Miss 100 L2 Hit 80 CPU 60 43 40 34 20 19 20 Normalized Execution Time 0 OOO P8 P8F OOO P8 P8F 1GHz 500MHz 1.25GHz 1GHz 500MHz 1.25GHz 4-issue 1-issue 1-issue 4-issue 1-issue 1-issue OLTP DSS l 5x margin over OOO for OLTP and DSS l Full-custom design benefits substantially from boost in core speed OutlineOutline l Importance of Commercial Workloads l Commercial Workload Requirements l Trends in Processor Design l Piranha l Design Methodology l Summary ManagingManaging ComplexityComplexity inin thethe ArchitectureArchitecture l Use of many simpler logic modules – shorter design – easier verification – only short wires* – faster synthesis – simpler chip-level layout l Simplify intra-chip communication – all traffic goes through ICS (no backdoors) l Use of microprogrammed protocol engines l Adoption of large VM pages l Implement sub-set of Alpha ISA – no VAX floating point, no multimedia instructions, etc. MethodologyMethodology ChallengesChallenges l Isolated sub-module testing – need to create robust bus functional models (BFM) – sub-modules’ behavior highly inter-dependent – not feasible with a small team l System-level (integrated) testing – much easier to create tests – only one BFM at the processor interface – simpler to

Piranha:Piranha

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support