Chip Multiprocessing and the Cell Broadband Engine

Total Page:16

File Type:pdf, Size:1020Kb

Chip Multiprocessing and the Cell Broadband Engine Cell Broadband Engine Chip Multiprocessing and the Cell Broadband Engine Michael Gschwind © 2006 IBM Corporation ACM Computing Frontiers 2006 Cell Broadband Engine Computer Architecture at the turn of the millenium “The end of architecture” was being proclaimed – Frequency scaling as performance driver – State of the art microprocessors •multiple instruction issue •out of order architecture •register renaming •deep pipelines Little or no focus on compilers – Questions about the need for compiler research Academic papers focused on microarchitecture tweaks 2 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine The age of frequency scaling Scaled Device Voltage, V / α 1 WIRING tox/α bips W/α 0.9 GATE n+ n+ source drain 0.8 L/α xd/α p substrate, doping α*NA 0.7 Performance SCALING: RESULTS: Voltage: V/α Higher Density: ~α2 0.6 Oxide: tox /α Higher Speed: ~α Wire width: W/α Power/ckt: ~1/α2 0.5 Gate width: L/α Power Density: 37 34 31 28 25 22 19 16 13 10 7 ~Constant Total FO4 Per Stage Diffusion: xd /α Substrate: α *NA Source: Dennard et al., JSSC 1974. deeper pipeline 3 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Frequency scaling A trusted standby – Frequency as the tide that raises all boats – Increased performance across all applications – Kept the industry going for a decade Massive investment to keep going – Equipment cost – Power increase – Manufacturing variability – New materials 4 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine … but reality was closing in! 1000 100 1 10 0.8 1 0.6 0.4 0.1 bips 0.2 bips^3/W 0.01 Relative to Optimal FO4 Optimal to Relative Active Power 0 Leakage Power 37 34 31 28 25 22 19 16 13 10 7 0.001 Total FO4 Per Stage 1 0.1 0.01 5 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Power crisis 1000 The power crisis is not “natural” – Created by deviating from 100 ideal scaling theory Tox(Å) – Vdd and Vt not scaled by α • additional performance with increased voltage 10 Laws of physics Vdd classic scaling – Pchip = ntransistors * Ptransistor Marginal performance gain 1 Vt per transistor low – significant power increase – decreased power/performance 0.1 efficiency 1 0.1 0.01 gate length Lgate (µm) 6 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine The power inefficiency of deep pipelining Power-performance optimal Performance optimal 1 0.8 0.6 0.4 bips 0.2 bips^3/W deeper pipeline Relative to Optimal FO4 Relative 0 37 34 31 28 25 22 19 16 13 10 7 Total FO4 Per Stage Source: Srinivasan et al., MICRO 2002 Deep pipelining increases number of latches and switching rate ⇒ power increases with at least f2 Latch insertion delay limits gains of pipelining 7 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine 100 Hitting the memory wall MFLOPS Memory latency 80 60 40 Latency gap 20 – Memory speedup lags behind normalized 0 processor speedup -20 1990 2003 – limits ILP -40 -60 Chip I/O bandwidth gap Source: McKee, Computing Frontiers 2004 – Less bandwidth per MIPS Latency gap as application bandwidth gap usable bandwidth avg. request size no. of inflight roundtrip latency requests – Typically (much) less than chip I/O bandwidth Source: Burger et al., ISCA 1996 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Our Y2K challenge 10× performance of desktop systems 1 TeraFlop / second with a four-node configuration 1 Byte bandwidth per 1 Flop – “golden rule for balanced supercomputer design” scalable design across a range of design points mass-produced and low cost 9 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Cell Design Goals Provide the platform for the future of computing – 10× performance of desktop systems shipping in 2005 Computing density as main challenge –Dramatically increase performance per X • X = Area, Power, Volume, Cost,… Single core designs offer diminishing returns on investment – In power, area, design complexity and verification cost 10 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Necessity as the mother of invention Increase power-performance efficiency –Simple designs are more efficient in terms of power and area Increase memory subsystem efficiency –Increasing data transaction size –Increase number of concurrently outstanding transactions ⇒ Exploit larger fraction of chip I/O bandwidth Use CMOS density scaling –Exploit density instead of frequency scaling to deliver increased aggregate performance Use compilers to extract parallelism from application –Exploit application parallelism to translate aggregate performance to application performance 11 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Exploit application-level parallelism Data-level parallelism – SIMD parallelism improves performance with little overhead Thread-level parallelism – Exploit application threads with multi-core design approach Memory-level parallelism – Improve memory access efficiency by increasing number of parallel memory transactions Compute-transfer parallelism – Transfer data in parallel to execution by exploiting application knowledge 12 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Concept phase ideas modular design large architected design reuse register set Chip control cost of high frequency coherency and low power! Multiprocessor reverse Vdd efficient use of scaling for low memory interface power 13 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Cell Architecture SPE Heterogeneous multi- SPU SPU SPU SPU SPU SPU SPU SPU core system SXU SXU SXU SXU SXU SXU SXU SXU architecture LS LS LS LS LS LS LS LS – Power Processor Element for control tasks SMF SMF SMF SMF SMF SMF SMF SMF – Synergistic Processor 16B/cycle Elements for data- EIB (up to 96B/cycle) intensive processing 16B/cycle PPE 16B/cycle 16B/cycle (2x) Synergistic Processor PPU MIC BIC Element (SPE) consists of L2 L1 PXU – Synergistic Processor 32B/cycle 16B/cycle Unit (SPU) Dual FlexIOTM TM – Synergistic Memory Flow XDR Control (SMF) 64-bit Power Architecture with VMX 14 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Shifting the Balance of Power with Cell Broadband Engine Data processor instead of control system – Control-centric code stable over time – Big growth in data processing needs • Modeling • Games • Digital media • Scientific applications Today’s architectures are built on a 40 year old data model – Efficiency as defined in 1964 – Big overhead per data operation – Parallelism added as an after-thought 15 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Powering Cell – the Synergistic Processor Unit 64B Fetch ILB Local Store 128B SMF Issue / Single Port Branch SRAM 2 instructions 16B VRF 16B x 2 16B x 3 x 2 V F P U V F X U PERM LSU Source: Gschwind et al., Hot Chips 17, 2005 16 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Density Computing in SPEs Today, execution units only fraction of core area and power – Bigger fraction goes to other functions • Address translation and privilege levels • Instruction reordering • Register renaming • Cache hierarchy Cell changes this ratio to increase performance per area and power – Architectural focus on data processing • Wide datapaths • More and wide architectural registers • Data privatization and single level processor-local store • All code executes in a single (user) privilege level • Static scheduling 17 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Streamlined Architecture Architectural focus on simplicity – Aids achievable operating frequency – Optimize circuits for common performance case – Compiler aids in layering traditional hardware functions – Leverage 20 years of architecture research Focus on statically scheduled data parallelism – Focus on data parallel instructions • No separate scalar execution units • Scalar operations mapped onto data parallel dataflow – Exploit wide data paths • data processing • instruction fetch – Address impediments to static scheduling • Large register set • Reduce latencies by eliminating non-essential functionality 18 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine SPE Highlights LS DP RISC organization SFP – 32 bit fixed instructions LS – Load/store architecture FXU EVN – Unified register file User-mode architecture FWD – No page translation within SPU FXU ODD LS SIMD dataflow
Recommended publications
  • Ray Tracing on the Cell Processor
    Ray Tracing on the Cell Processor Carsten Benthin† Ingo Wald Michael Scherbaum† Heiko Friedrich‡ †inTrace Realtime Ray Tracing GmbH SCI Institute, University of Utah ‡Saarland University {benthin, scherbaum}@intrace.com, [email protected], [email protected] Abstract band”) architectures1 to hide memory latencies, at exposing par- Over the last three decades, higher CPU performance has been allelism through SIMD units, and at multi-core architectures. At achieved almost exclusively by raising the CPU’s clock rate. Today, least for specialized tasks such as triangle rasterization, these con- the resulting power consumption and heat dissipation threaten to cepts have been proven as very powerful, and have made modern end this trend, and CPU designers are looking for alternative ways GPUs as fast as they are; for example, a Nvidia 7800 GTX offers of providing more compute power. In particular, they are looking 313 GFlops [16], 35 times more than a 2.2 GHz AMD Opteron towards three concepts: a streaming compute model, vector-like CPU. Today, there seems to be a convergence between CPUs and SIMD units, and multi-core architectures. One particular example GPUs, with GPUs—already using the streaming compute model, of such an architecture is the Cell Broadband Engine Architecture SIMD, and multi-core—become increasingly programmable, and (CBEA), a multi-core processor that offers a raw compute power CPUs are getting equipped with more and more cores and stream- of up to 200 GFlops per 3.2 GHz chip. The Cell bears a huge po- ing functionalities. Most commodity CPUs already offer 2–8 tential for compute-intensive applications like ray tracing, but also cores [1, 9, 10, 25], and desktop PCs with more cores can be built requires addressing the challenges caused by this processor’s un- by stacking and interconnecting smaller multi-core systems (mostly conventional architecture.
    [Show full text]
  • IBM System P5 Quad-Core Module Based on POWER5+ Technology: Technical Overview and Introduction
    Giuliano Anselmi Redpaper Bernard Filhol SahngShin Kim Gregor Linzmeier Ondrej Plachy Scott Vetter IBM System p5 Quad-Core Module Based on POWER5+ Technology: Technical Overview and Introduction The quad-core module (QCM) is based on the well-known POWER5™ dual-core module (DCM) technology. The dual-core POWER5 processor and the dual-core POWER5+™ processor are packaged with the L3 cache chip into a cost-effective DCM package. The QCM is a package that enables entry-level or midrange IBM® System p5™ servers to achieve additional processing density without increasing the footprint. Figure 1 shows the DCM and QCM physical views and the basic internal architecture. to I/O to I/O Core Core 1.5 GHz 1.5 GHz Enhanced Enhanced 1.9 MB MB 1.9 1.9 MB MB 1.9 L2 cache L2 L2 cache switch distributed switch distributed switch 1.9 GHz POWER5 Core Core core or CPU or core 1.5 GHz L3 Mem 1.5 GHz L3 Mem L2 cache L2 ctrl ctrl ctrl ctrl Enhanced distributed Enhanced 1.9 MB Shared Shared MB 1.9 Ctrl 1.9 GHz L3 Mem Ctrl POWER5 core or CPU or core 36 MB 36 MB 36 L3 cache L3 cache L3 DCM 36 MB QCM L3 cache L3 to memory to memory DIMMs DIMMs POWER5+ Dual-Core Module POWER5+ Quad-Core Module One Dual-Core-chip Two Dual-Core-chips plus two L3-cache-chips plus one L3-cache-chip Figure 1 DCM and QCM physical views and basic internal architecture © Copyright IBM Corp. 2006.
    [Show full text]
  • Implementing Powerpc Linux on System I Platform
    Front cover Implementing POWER Linux on IBM System i Platform Planning and configuring Linux servers on IBM System i platform Linux distribution on IBM System i Platform installation guide Tips to run Linux servers on IBM System i platform Yessong Johng Erwin Earley Rico Franke Vlatko Kosturjak ibm.com/redbooks International Technical Support Organization Implementing POWER Linux on IBM System i Platform February 2007 SG24-6388-01 Note: Before using this information and the product it supports, read the information in “Notices” on page vii. Second Edition (February 2007) This edition applies to i5/OS V5R4, SLES10 and RHEL4. © Copyright International Business Machines Corporation 2005, 2007. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Notices . vii Trademarks . viii Preface . ix The team that wrote this redbook. ix Become a published author . xi Comments welcome. xi Chapter 1. Introduction to Linux on System i platform . 1 1.1 Concepts and terminology . 2 1.1.1 System i platform . 2 1.1.2 Hardware management console . 4 1.1.3 Virtual Partition Manager (VPM) . 10 1.2 Brief introduction to Linux and Linux on System i platform . 12 1.2.1 Linux on System i platform . 12 1.3 Differences between existing Power5-based System i and previous System i models 13 1.3.1 Linux enhancements on Power5 / Power5+ . 14 1.4 Where to go for more information . 15 Chapter 2. Configuration planning . 17 2.1 Concepts and terminology . 18 2.1.1 Processor concepts .
    [Show full text]
  • From Blue Gene to Cell Power.Org Moscow, JSCC Technical Day November 30, 2005
    IBM eServer pSeries™ From Blue Gene to Cell Power.org Moscow, JSCC Technical Day November 30, 2005 Dr. Luigi Brochard IBM Distinguished Engineer Deep Computing Architect [email protected] © 2004 IBM Corporation IBM eServer pSeries™ Technology Trends As frequency increase is limited due to power limitation Dual core is a way to : 2 x Peak Performance per chip (and per cycle) But at the expense of frequency (around 20% down) Another way is to increase Flop/cycle © 2004 IBM Corporation IBM eServer pSeries™ IBM innovations POWER : FMA in 1990 with POWER: 2 Flop/cycle/chip Double FMA in 1992 with POWER2 : 4 Flop/cycle/chip Dual core in 2001 with POWER4: 8 Flop/cycle/chip Quadruple core modules in Oct 2005 with POWER5: 16 Flop/cycle/module PowerPC: VMX in 2003 with ppc970FX : 8 Flops/cycle/core, 32bit only Dual VMX+ FMA with pp970MP in 1Q06 Blue Gene: Low frequency , system on a chip, tight integration of thousands of cpus Cell : 8 SIMD units and a ppc970 core on a chip : 64 Flop/cycle/chip © 2004 IBM Corporation IBM eServer pSeries™ Technology Trends As needs diversify, systems are heterogeneous and distributed GRID technologies are an essential part to create cooperative environments based on standards © 2004 IBM Corporation IBM eServer pSeries™ IBM innovations IBM is : a sponsor of Globus Alliances contributing to Globus Tool Kit open souce a founding member of Globus Consortium IBM is extending its products Global file systems : – Multi platform and multi cluster GPFS Meta schedulers : – Multi platform
    [Show full text]
  • IBM Powerpc 970 (A.K.A. G5)
    IBM PowerPC 970 (a.k.a. G5) Ref 1 David Benham and Yu-Chung Chen UIC – Department of Computer Science CS 466 PPC 970FX overview ● 64-bit RISC ● 58 million transistors ● 512 KB of L2 cache and 96KB of L1 cache ● 90um process with a die size of 65 sq. mm ● Native 32 bit compatibility ● Maximum clock speed of 2.7 Ghz ● SIMD instruction set (Altivec) ● 42 watts @ 1.8 Ghz (1.3 volts) ● Peak data bandwidth of 6.4 GB per second A picture is worth a 2^10 words (approx.) Ref 2 A little history ● PowerPC processor line is a product of the AIM alliance formed in 1991. (Apple, IBM, and Motorola) ● PPC 601 (G1) - 1993 ● PPC 603 (G2) - 1995 ● PPC 750 (G3) - 1997 ● PPC 7400 (G4) - 1999 ● PPC 970 (G5) - 2002 ● AIM alliance dissolved in 2005 Processor Ref 3 Ref 3 Core details ● 16(int)-25(vector) stage pipeline ● Large number of 'in flight' instructions (various stages of execution) - theoretical limit of 215 instructions ● 512 KB L2 cache ● 96 KB L1 cache – 64 KB I-Cache – 32 KB D-Cache Core details continued ● 10 execution units – 2 load/store operations – 2 fixed-point register-register operations – 2 floating-point operations – 1 branch operation – 1 condition register operation – 1 vector permute operation – 1 vector ALU operation ● 32 64 bit general purpose registers, 32 64 bit floating point registers, 32 128 vector registers Pipeline Ref 4 Benchmarks ● SPEC2000 ● BLAST – Bioinformatics ● Amber / jac - Structure biology ● CFD lab code SPEC CPU2000 ● IBM eServer BladeCenter JS20 ● PPC 970 2.2Ghz ● SPECint2000 ● Base: 986 Peak: 1040 ● SPECfp2000 ● Base: 1178 Peak: 1241 ● Dell PowerEdge 1750 Xeon 3.06Ghz ● SPECint2000 ● Base: 1031 Peak: 1067 Apple’s SPEC Results*2 ● SPECfp2000 ● Base: 1030 Peak: 1044 BLAST Ref.
    [Show full text]
  • IBM Power Systems Performance Report Apr 13, 2021
    IBM Power Performance Report Power7 to Power10 September 8, 2021 Table of Contents 3 Introduction to Performance of IBM UNIX, IBM i, and Linux Operating System Servers 4 Section 1 – SPEC® CPU Benchmark Performance 4 Section 1a – Linux Multi-user SPEC® CPU2017 Performance (Power10) 4 Section 1b – Linux Multi-user SPEC® CPU2017 Performance (Power9) 4 Section 1c – AIX Multi-user SPEC® CPU2006 Performance (Power7, Power7+, Power8) 5 Section 1d – Linux Multi-user SPEC® CPU2006 Performance (Power7, Power7+, Power8) 6 Section 2 – AIX Multi-user Performance (rPerf) 6 Section 2a – AIX Multi-user Performance (Power8, Power9 and Power10) 9 Section 2b – AIX Multi-user Performance (Power9) in Non-default Processor Power Mode Setting 9 Section 2c – AIX Multi-user Performance (Power7 and Power7+) 13 Section 2d – AIX Capacity Upgrade on Demand Relative Performance Guidelines (Power8) 15 Section 2e – AIX Capacity Upgrade on Demand Relative Performance Guidelines (Power7 and Power7+) 20 Section 3 – CPW Benchmark Performance 19 Section 3a – CPW Benchmark Performance (Power8, Power9 and Power10) 22 Section 3b – CPW Benchmark Performance (Power7 and Power7+) 25 Section 4 – SPECjbb®2015 Benchmark Performance 25 Section 4a – SPECjbb®2015 Benchmark Performance (Power9) 25 Section 4b – SPECjbb®2015 Benchmark Performance (Power8) 25 Section 5 – AIX SAP® Standard Application Benchmark Performance 25 Section 5a – SAP® Sales and Distribution (SD) 2-Tier – AIX (Power7 to Power8) 26 Section 5b – SAP® Sales and Distribution (SD) 2-Tier – Linux on Power (Power7 to Power7+)
    [Show full text]
  • IBM's POWER5 Micro Processor Design and Methodology
    IBM's POWER5 Micro Processor Design and Methodology Ron Kalla IBM Systems Group © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Outline POWER5 Overview Design Process Power UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology POWER Server Roadmap 2001 2002-3 2004* 2005* 2006* POWER4 POWER4+ POWER5 POWER5+ POWER6 90 nm 65 nm 130 nm 130 nm Ultra high 180 nm >> GHz >> GHz frequency cores Core Core 1.7 GHz 1.7 GHz > GHz > GHz L2 caches Core Core Core Core 1.3 GHz 1.3 GHz Shared L2 Advanced System Features Core Core Distributed Switch Shared L2 Shared L2 Distributed Switch Distributed Switch Shared L2 Simultaneous multi-threading Distributed Switch Sub-processor partitioning Reduced size Dynamic firmware updates Enhanced scalability, parallelism Chip Multi Processing Lower power High throughput performance - Distributed Switch Larger L2 Enhanced memory subsystem - Shared L2 More LPARs (32) Dynamic LPARs (16) Autonomic Computing Enhancements *Planned to be offered by IBM. All statements about IBM’s future direction and intent are* subject to change or withdrawal without notice and represent goals and objectives only. UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology POWER5 Technology: 130nm lithography, Cu, SOI 389mm2 276M Transistors Dual processor core 8-way superscalar Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real processor UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro
    [Show full text]
  • Microprocessor
    MICROPROCESSOR www.MPRonline.com THE REPORTINSIDER’S GUIDE TO MICROPROCESSOR HARDWARE POWER5 TOPS ON BANDWIDTH IBM’s Design Is Still Elegant, But Itanium Provides Competition By Kevin Krewell {12/22/03-02} On large multiprocessing systems, often the dominant attributes needed to assure good incremental processor performance include memory coherency issues, aggregate memory bandwidth, and I/O performance. The Power4 processor, now shipping from IBM, nicely balances integration with performance. The Power4 has an processor speed, but are eight bytes wide, offering more than eight-issue superscalar core, 12.8GB/s of memory bandwidth, 4GB/s of bandwidth per bus. The fast buses are used to 1.5MB of L2 cache, and 128MB of external L3 cache. The enable what IBM terms aggressive cache-to-cache transfers. recently introduced Power5 steps up integration and per- This tightly coupled multiprocessing is designed to allow formance by integrating the distributed switch fabric between processing threads to operate in parallel over the processor memory controller and core/caches. (See MPR 10/14/03-01, array. With eight modules, Power5 supports up to 128-way “IBM Raises Curtain on Power5.”) The Power5 has an on-die multithreaded processing. memory controller that will support both DDR and DDR2 The bus structure and distributed switch fabric also SDRAM memory. The Power5 also improves system per- allow IBM to create a 64-way (processor cores) configuration. formance, as each processor core is now multithreaded. (See MPR 9/08/03-02, “IBM Previews Power5.”) IBM has packed four processor die (eight processor cores) and four L3 cache chips on one 95- × 95mm MCM, seen in Figure 1.
    [Show full text]
  • Introduction to the Cell Multiprocessor
    Introduction J. A. Kahle M. N. Day to the Cell H. P. Hofstee C. R. Johns multiprocessor T. R. Maeurer D. Shippy This paper provides an introductory overview of the Cell multiprocessor. Cell represents a revolutionary extension of conventional microprocessor architecture and organization. The paper discusses the history of the project, the program objectives and challenges, the design concept, the architecture and programming models, and the implementation. Introduction: History of the project processors in order to provide the required Initial discussion on the collaborative effort to develop computational density and power efficiency. After Cell began with support from CEOs from the Sony several months of architectural discussion and contract and IBM companies: Sony as a content provider and negotiations, the STI (SCEI–Toshiba–IBM) Design IBM as a leading-edge technology and server company. Center was formally opened in Austin, Texas, on Collaboration was initiated among SCEI (Sony March 9, 2001. The STI Design Center represented Computer Entertainment Incorporated), IBM, for a joint investment in design of about $400,000,000. microprocessor development, and Toshiba, as a Separate joint collaborations were also set in place development and high-volume manufacturing technology for process technology development. partner. This led to high-level architectural discussions A number of key elements were employed to drive the among the three companies during the summer of 2000. success of the Cell multiprocessor design. First, a holistic During a critical meeting in Tokyo, it was determined design approach was used, encompassing processor that traditional architectural organizations would not architecture, hardware implementation, system deliver the computational power that SCEI sought structures, and software programming models.
    [Show full text]
  • IBM Power Roadmap
    POWER6™ Processor and Systems Jim McInnes [email protected] Compiler Optimization IBM Canada Toronto Software Lab © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Role .I am a Technical leader in the Compiler Optimization Team . Focal point to the hardware development team . Member of the Power ISA Architecture Board .For each new microarchitecture I . help direct the design toward helpful features . Design and deliver specific compiler optimizations to enable hardware exploitation 2 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER5 Chip Overview High frequency dual-core chip . 8 execution units 2LS, 2FP, 2FX, 1BR, 1CR . 1.9MB on-chip shared L2 – point of coherency, 3 slices . On-chip L3 directory and controller . On-chip memory controller Technology & Chip Stats . 130nm lithography, Cu, SOI . 276M transistors, 389 mm2 . I/Os: 2313 signal, 3057 Power/Gnd 3 IBM POWER6 Overview © 2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER6 Chip Overview SDU RU Ultra-high frequency dual-core chip IFU FXU . 8 execution units L2 Core0 VMX L2 2LS, 2FP, 2FX, 1BR, 1VMX QUAD LSU FPU QUAD . 2 x 4MB on-chip L2 – point of L3 coherency, 4 quads L2 CNTL CNTL . On-chip L3 directory and controller . Two on-chip memory controllers M FBC GXC M Technology & Chip stats C C .
    [Show full text]
  • POWER8: the First Openpower Processor
    POWER8: The first OpenPOWER processor Dr. Michael Gschwind Senior Technical Staff Member & Senior Manager IBM Power Systems #OpenPOWERSummit Join the conversation at #OpenPOWERSummit 1 OpenPOWER is about choice in large-scale data centers The choice to The choice to The choice to differentiate innovate grow . build workload • collaborative • delivered system optimized innovation in open performance solutions ecosystem • new capabilities . use best-of- • with open instead of breed interfaces technology scaling components from an open ecosystem Join the conversation at #OpenPOWERSummit Why Power and Why Now? . Power is optimized for server workloads . Power8 was optimized to simplify application porting . Power8 includes CAPI, the Coherent Accelerator Processor Interconnect • Building on a long history of IBM workload acceleration Join the conversation at #OpenPOWERSummit POWER8 Processor Cores • 12 cores (SMT8) 96 threads per chip • 2X internal data flows/queues • 64K data cache, 32K instruction cache Caches • 512 KB SRAM L2 / core • 96 MB eDRAM shared L3 • Up to 128 MB eDRAM L4 (off-chip) Accelerators • Crypto & memory expansion • Transactional Memory • VMM assist • Data Move / VM Mobility • Coherent Accelerator Processor Interface (CAPI) Join the conversation at #OpenPOWERSummit 4 POWER8 Core •Up to eight hardware threads per core (SMT8) •8 dispatch •10 issue •16 execution pipes: •2 FXU, 2 LSU, 2 LU, 4 FPU, 2 VMX, 1 Crypto, 1 DFU, 1 CR, 1 BR •Larger Issue queues (4 x 16-entry) •Larger global completion, Load/Store reorder queue •Improved branch prediction •Improved unaligned storage access •Improved data prefetch Join the conversation at #OpenPOWERSummit 5 POWER8 Architecture . High-performance LE support – Foundation for a new ecosystem . Organic application growth Power evolution – Instruction Fusion 1600 PowerPC .
    [Show full text]
  • Power8 Quser Mspl Nov 2015 Handout
    IBM Power Systems Power Systems Hardware: Today and Tomorrow November 2015 Mark Olson [email protected] © 2015 IBM Corporation IBM Power Systems POWER8 Chip © 2015 IBM Corporation IBM Power Systems Processor Technology Roadmap POWER11 Or whatever it is POWER10 named Or whatever it is POWER9 named Or whatever it is named POWER7 POWER8 POWER6 45 nm 22 nm POWER5 65 nm 130 nm POWER4 90 nm 180 nm 130 nm 2001 2004 2007 2010 2014 Future 3 © 2015 IBM Corporation IBM Power Systems Processor Chip Comparisons POWER5 POWER6 POWER7 POWER7+ POWER8 2004 2007 2010 2012 45nm SOI 32nm SOI 22nm SOI Technology 130nm SOI 65nm SOI eDRAM eDRAM eDRAM Compute Cores 2 2 8 8 12 Threads SMT2 SMT2 SMT4 SMT4 SMT8 Caching On-chip 1.9MB (L2) 8MB (L2) 2 + 32MB (L2+3) 2 + 80MB (L2+3) 6 + 96MB (L2+3) Off-chip 36MB (L3) 32MB (L3) None None 128MB (L4) Bandwidth Sust. Mem. 15GB/s 30GB/s 100GB/s 100GB/s 230GB/s Peak I/O 6GB/s 20GB/s 40GB/s 40GB/s 96GB/s 4 © 2015 IBM Corporation IBM Power Systems Processor Designs POWER5+ POWER6 POWER7 POWER7+ POWER8 Max cores 4 2 8 8 12 Technology 90nm 65nm 45nm 32nm 22nm Size 245 mm2 341 mm2 567 mm2 567 mm2 650 mm2 * Transistors 276 M 790 M 1.2 B 2.1 B 4.2 B * 1.9 4 - 5 3 – 4 Up to 4.4 Up to 4.1 Frequencies GHz GHz GHz GHz GHz ** SMT (threads) 2 2 4 4 8 L2 Cache 1.9MB Shared 4MB / Core 256KB / core 256KB / core 512KB/core 4MB / Core 10MB / Core 8MB / Core L3 Cache 36MB 32MB On chip On chip On chip L4 Cache -- -- -- -- Up to 128MB Bandwidth Sust memory 15GB/s 30GB/s 100GB/s 100GB/s 230GB/s Peak I/O 6GB/s 20GB/s 40GB/s 40GB/s 96GB/s * with 12-core
    [Show full text]