Der Cell Prozessor

Total Page:16

File Type:pdf, Size:1020Kb

Der Cell Prozessor Der Cell Prozessor Seminarvortrag von Viacheslav Mlotok SS2005 Universität Mannheim, Lehrstuhl für Rechnerarchitektur 1 Inhalt • Entwicklungsziele • Cell Architektur – Power Processor Element – Synergistic Processor Element – Memory Interface Controller – Bus Interface Controller – Element Interconnect Bus • Software Cells • Anwendungs- und Produktbeispiele • Fazit • Quellenangaben 2 Entwicklungsziele Quelle: [9] • Hauptziel: neue Architektur für die Storage effiziente Verarbeitung von Logic Multimedia- und Grafikdaten der Cache nächsten Generation • Aufbau eines modernen RISC Control Logic Prozessors IF ID – großer Cache Dispatch – out-of-order execution, speculative execution, hardware branch prediction, predication, etc. – superscalare Architektur, tiefe Pipelines Issue Logic • 40% Performancezuwachs bei Execution Verdoppelung der Anzahl von Logic Vector FP Units Unit Integer Units LS Unit Transistoren Execution Core Reorder Commit Commit Unit 3 Entwicklungsziele • Performance pro Transistor als Quelle: [9] Maß Storage Logic • Mikroarchitektur vereinfachen und Komplexität auf die Software Local Storage verlagern • Control Logic kein Cache - nur ein schneller IF lokaler Speicher ID • großes allgemeines Registerfile, Dispatch data fetch und branch prediction Execution Logic in Software • Multicore anstatt tiefen Pipelines Vector Arithmetic Units Permute LS Channel Branch Unit Unit Unit Unit Execution Core Commit Commit Unit 4 Cell Architektur Quelle: [8] • gemeinsame Entwicklung von IBM, Sony und Toshiba • 90nm SOI, 8 Metalllagen • 234 Mio. Transistoren 2 • 221mm Die-Größe 5 Cell Architektur Quelle: [1] • SoC-Design • besteht aus – PowerPC Core mit L2 Cache – 8 SIMD Prozessoren – Memory und I/O Interfaces – Testeinheit für testing, monitoring und debugging • 9 Prozessorkerne, 10-way Multithreading Quelle: [1] • high performance distributed, parallel computing • Virtualization technology • modularer Aufbau • universeller Einsatz: vom PDA bis zum Server 6 Cell Architektur • 4 GHz @ 1.1 Volt, 50-80 Watt Verlustleistung • 4.6 GHz @ 1.3 Volt • Zukunft: 5.6 GHz @ 1.4 Volt, 180 Watt Verlustleistung • 256 GFlops @ 4 GHz (single precision, nicht IEEE kompatibel) • 26 GFlops @ 4 GHz (double precision, IEEE kompatibel) Quelle: [2] 7 Power Processor Element (PPE) • modifizierter 64bit PowerPC mit VMX-Erweiterung (Vector Multimedia Extension) • in-order execution, dual-issue • 2-way Multithreading, round-robin scheduling • 32 KB L1 Instruktions- und Datencache, 512 KB L2 Cache • Hostprozessor für die SIMD Prozessoren, führt allgemeine Aufgaben aus Quelle: [1] 8 Synergistic Processor Element (SPE) • 128bit unabhängige SIMD-Einheit, single-precision optimiert • in-order execution, dual-issue • 21 Mio. Transistoren • 2.5 x 5.81 mm • von PPE verwaltet, programmierbar in Hochsprachen (z.B. C, C++) Quelle: [1] 9 Synergistic Processor Element (SPE) Quelle: [3] • Instruktionsformat: 32bit fixed-length, 3 Operanden, 1 Ergebnis • 128bit Operanden (4 x 32bit) • dreistufiges Memorymodell – Hauptspeicher, lokaler Speicher, Register – Speicherzugriffe mit DMA (kohärent auf den Hauptspeicher) • Stream Processing (Chaining) möglich – SPEs von anderen Cells können dabei benutzt werden, falls eine längere Kette gebraucht wird • 4 Operationen/Takt => 256 GFlops für 8 SPEs @ 4 GHz (8SPEs x 4GHz x 2 FPUs x 4 Operationen) • data fetch und branch prediction in Software – 18 Takte branch misprediction penalty 10 Synergistic Processor Element (SPE) Quelle: [3] • 256 KB Local Store (LS) • 128 x 128bit Registerfile mit 6 read ports und 2 write ports, 2 Takte Zugriffslatenz • DMA-Einheit (Speicherzugriffe) • MMU • BIU - Bus Interface Unit (zur Anbindung an den gemeinsamen Bus) • RTB - Testblock mit ABIST (Array Build In Self Test) • ATO - Atomic Memory Unit (Kohärenz) 11 Synergistic Processor Element (SPE) Local Store (LS) Quelle: [3] • array of single ported, pipelined SRAMs • 256 KB groß (4 x 64 KB) • L1 Cache-Ersatz für Instruktionen und Daten • nicht kohärent, nur lokale Zugriffe • Leselatenz = 6 Takte, Schreiblatenz = 4 Takte • quadword (16 Byte) oder line (128 Byte) Zugriffe • drei Prioritätsstufen – DMA – Load und store – Instruction fetch • 16 Byte/Takt L/S Bandbreite • 128 Byte/Takt DMA Bandbreite 12 Synergistic Processor Element (SPE) DMA Einheit Quelle: [3] • steuert den Datenverkehr • lokale und externe requests • Adressübersetzung durch MMU • bis zu 16 outstanding requests werden unterstützt • 16 KB pro request • programmiert durch channel interface – message passing interface – bis zu 128 unidirektionale Kanäle (blocking oder nonblocking) – read channel, write channel und read channel count 13 Synergistic Processor Element (SPE) Quelle: [3] • zwei Pipelines, „even“ und „odd“ • 6 Ausführungseinheiten Quelle: [3] 14 Synergistic Processor Element (SPE) Pipelinetiefe, Latenzen Quelle: [3] Execution Pipeline Instruction Unit Instructions Pipe Depth Latency Simple Fixed Word arithmetic, logicals, counting Even 2 2 leading zeros, selects and compares Simple Fixed Word shifts and rotates Even 3 4 Single Precision Multiply-accumulate Even 6 6 Single Precision Integer multiply-accumulate Even 7 7 Pop count, absolute sum of differences, byte Byte Odd 3 4 average, byte sum Quadworld shifts, rotates, gathers, Permute shuffles as well as reciprocal estimates Odd 3 4 Load Store Load and store Odd 6 6 Channel Channel read/write Odd 5 6 Branch Branches Odd 3 4 15 Memory Interface Controller (MIC) • Unterstüzung von 3.2 GHz XDR DRAM Memory von Rambus (lizensiert von Sony und Toshiba) • XIO Interface - ultra high-speed parallel interface between multiple chips • 25.6 GB/s memory bandwidth (dual channel) • maximal 4.5 GB adressierbar (72 DRAM Devices @ 512Mb) Quelle: [1] 16 XDR Memory System • 1024bit Blockzugriff • unterteilt in „sandboxes“ – Statusbits bestimmen, welche SPEs darauf zugreifen darf (memory protection) Quelle: [11] • command and address bus – unidirektional, 12bit breit, 800 Mb/s – bis zu 36 DRAM Devices • Datenbus – 36 bidirektionale point-to-point Verbindungen (3.2 Gb/s) – konfigurierbar nach Anzahl der Devices – 3.2 GHz (8 x 400 MHz) - octal data rate (ODR) – maximale Bandbreite 12.8 GB/s (2 Devices, 16bit, 1 channel) 17 Bus Interface Controller • Kommunikation mit dem Rest des Systems • FlexIO Interface (entwickelt von Rambus) • kohärentes Interface für SMP (Symmetrical Multi Processing), nicht kohärentes Interface für I/O • 44.8 GB/s outbound bandwidth, 32 GB/s inbound bandwidth • zusammen 76.8 GB/s I/O Bandbreite Quelle: [1] 18 Bus Interface Controller • 12 byte lanes • Byte lane - 8bit breite, unidirektionale, point-to-point Verbindung • 6.4 GB/s pro byte lane • asymmetrische Konfiguration – 7 byte lanes outbound, 5 byte lanes inbound – Bildung von kohärenten und nicht kohärenten Gruppen Quelle: [1] 19 Element Interconnect Bus (EIB) • verbindet alle Einheiten des Cells miteinander • kohärent • Datenstruktur – 4 unidirektionale 128bit Ringe – 96 Byte/Takt, bis zu 128 outstanding requests • Kontrollstruktur Quelle: [1] – ein Bus • läuft mit halber Cell-Taktfrequenz 20 Software Cells • Apulet - Bündel aus einem Datenobjekt und erforderlichem Programmcode • vom PPE initialisiert und direkt von SPEs ausgeführt • systemunabhängig • können über das Netzwerk verschickt werden Quelle: [4] 21 Beispiel - Digital TV Receiver • erforderliche Schritte (MPEG 2) – COFDM demodulation – MPEG audio and video decode – Error correction – Video scaling – Demultiplexing – Display construction – Descrambling – Contrast and brightness processing COFDM Error Demuliplexing Audio & Video Display Contrast & Demodulation Correction & Descrambling Video Decode Scaling Construction Brightness Processing Quelle: [10] 22 Flexibilität Quelle: [1] • direkte Verbindung zweier Cells durch FlexIO, indirekt über einen Switch • unterschiedliche Systemkonfigurationen – Spielekonsolen, HDTVs – Home entertaiment Anlagen – Workstations, Supercomputers Quelle: [1] Quelle: [1] 23 Cell Processor Based Workstation (CPBW) • Cell Based Blade Server Board Prototype – 2 Cells, 512MB XDR DRAM, 2 South Bridges Quelle: [6] – 2.4-2.8 GHz (3 GHz im Labor) – 200 GFlops @ 3 GHz (400 GFlops pro Board) – 7 Boards pro Rack – 23 x 43 cm – Linux 2.6.11 Quelle: [6] 24 Playstation 3 • 1 Cell mit 7 SPEs @ 3.2 GHz, 256 MB XDR DRAM • Grafikchip von NVidia @ 550 MHz, 256 GDDR3 DRAM • 3 Ethernet-Ports, 802.11b/g WLAN, Bluetooth 2.0, optionale Festplatte • Verfügbarkeit: Frühling 2006, Preis: 300-400 Euro Quelle: [7] Quelle: [7] 25 Fazit • ungewöhnliche Architektur, die sich von der Architektur aktueller Prozessoren unterscheidet • vereinfachtes Design, teilweise ohne klassische Ansätze der Rechnerarchitektur • hohe Performance durch massive Parallelisierung der Aufgaben • neues Programmiermodell notwendig, um SPEs effizient nutzen zu können • SPEs werden schwierig zu programmieren sein • primärer Einsatz im Bereich des Multimedia (single precision) • offene Frage: Verwendung in Supercomputern (double precision) und in Desktop-PCs (Konkurrenz zu x86) 26 Quellenangaben • [1]: H. Peter Hofstee, Ph. D., „Power Efficient Processor Design and the Cell Processor“ http://www.hpcaconf.org/hpca11/slides/Cell_Public_Hofstee.pdf http://www.hpcaconf.org/hpca11/papers/25_hofstee-cellprocessor_final.pdf • [2]: D. Pham et. al. „The Design and Implementation of a First-Generation CELL Processor“, International Solid-State Circuits Conference Technical Digest, Feb. 2005 http://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell • [3]: B. Flachs et. al. „A Streaming Processing Unit for a CELL Processor“, International Solid-State
Recommended publications
  • Introduction to the Cell Multiprocessor
    Introduction J. A. Kahle M. N. Day to the Cell H. P. Hofstee C. R. Johns multiprocessor T. R. Maeurer D. Shippy This paper provides an introductory overview of the Cell multiprocessor. Cell represents a revolutionary extension of conventional microprocessor architecture and organization. The paper discusses the history of the project, the program objectives and challenges, the design concept, the architecture and programming models, and the implementation. Introduction: History of the project processors in order to provide the required Initial discussion on the collaborative effort to develop computational density and power efficiency. After Cell began with support from CEOs from the Sony several months of architectural discussion and contract and IBM companies: Sony as a content provider and negotiations, the STI (SCEI–Toshiba–IBM) Design IBM as a leading-edge technology and server company. Center was formally opened in Austin, Texas, on Collaboration was initiated among SCEI (Sony March 9, 2001. The STI Design Center represented Computer Entertainment Incorporated), IBM, for a joint investment in design of about $400,000,000. microprocessor development, and Toshiba, as a Separate joint collaborations were also set in place development and high-volume manufacturing technology for process technology development. partner. This led to high-level architectural discussions A number of key elements were employed to drive the among the three companies during the summer of 2000. success of the Cell multiprocessor design. First, a holistic During a critical meeting in Tokyo, it was determined design approach was used, encompassing processor that traditional architectural organizations would not architecture, hardware implementation, system deliver the computational power that SCEI sought structures, and software programming models.
    [Show full text]
  • Download Presentation
    Computer architecture in the past and next quarter century CNNA 2010 Berkeley, Feb 3, 2010 H. Peter Hofstee Cell/B.E. Chief Scientist IBM Systems and Technology Group CMOS Microprocessor Trends, The First ~25 Years ( Good old days ) Log(Performance) Single Thread More active transistors, higher frequency 2005 Page 2 SPECINT 10000 3X From Hennessy and Patterson, Computer Architecture: A ??%/year 1000 Quantitative Approach , 4th edition, 2006 52%/year 100 Performance (vs. VAX-11/780) VAX-11/780) (vs. Performance 10 25%/year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 • VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 Page 3 CMOS Devices hit a scaling wall 1000 Air Cooling limit ) 100 2 Active 10 Power Passive Power 1 0.1 Power Density (W/cm Power Density 0.01 1994 2004 0.001 1 0.1 0.01 Gate Length (microns) Isaac e.a. IBM Page 4 Microprocessor Trends More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency 2005 Page 5 Microprocessor Trends Multi-Core More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency 2005 2015(?) Page 6 Multicore Power Server Processors Power 4 Power 5 Power 7 2001 2004 2009 Introduces Dual core Dual Core – 4 threads 8 cores – 32 threads Page 7 Why are (shared memory) CMPs dominant? A new system delivers nearly twice the throughput performance of the previous one without application-level changes. Applications do not degrade in performance when ported (to a next-generation processor).
    [Show full text]
  • Keir, Paul G (2012) Design and Implementation of an Array Language for Computational Science on a Heterogeneous Multicore Architecture
    Keir, Paul G (2012) Design and implementation of an array language for computational science on a heterogeneous multicore architecture. PhD thesis. http://theses.gla.ac.uk/3645/ Copyright and moral rights for this thesis are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the Author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the Author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Glasgow Theses Service http://theses.gla.ac.uk/ [email protected] Design and Implementation of an Array Language for Computational Science on a Heterogeneous Multicore Architecture Paul Keir Submitted in fulfilment of the requirements for the degree of Doctor of Philosophy School of Computing Science College of Science and Engineering University of Glasgow July 8, 2012 c Paul Keir, 2012 Abstract The packing of multiple processor cores onto a single chip has become a mainstream solution to fundamental physical issues relating to the microscopic scales employed in the manufac- ture of semiconductor components. Multicore architectures provide lower clock speeds per core, while aggregate floating-point capability continues to increase. Heterogeneous multicore chips, such as the Cell Broadband Engine (CBE) and modern graphics chips, also address the related issue of an increasing mismatch between high pro- cessor speeds, and huge latency to main memory.
    [Show full text]
  • Introducing the Cell Processor
    Many of the designations used by manufacturers and sellers to distinguish their products Editor-in-Chief are claimed as trademarks. Where those designations appear in this book, and the publish- Mark Taub er was aware of a trademark claim, the designations have been printed with initial capital Acquisitions Editor letters or in all capitals. Bernard Goodwin The author and publisher have taken care in the preparation of this book, but make no Managing Editor expressed or implied warranty of any kind and assume no responsibility for errors or omis- Kristy Hart sions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. Project Editor Betsy Harris The publisher offers excellent discounts on this book when ordered in quantity for bulk pur- chases or special sales, which may include electronic versions and/or custom covers and Copy Editor content particular to your business, training goals, marketing focus, and branding interests. Keith Cline For more information, please contact: Indexer Erika Millen U.S. Corporate and Government Sales (800) 382-3419 Proofreader [email protected] Editorial Advantage For sales outside the United States please contact: Technical International Sales Reviewers [email protected] Duc Vianney Sean Curry Visit us on the Web: www.informit.com/ph Hema Reddy Library of Congress Cataloging-in-Publication Data: Michael Kistler Michael Brutman Scarpino, Matthew, 1975- Brian Watt Yu Dong Yang Programming the Cell processor : for games, graphics, and computation / Matthew Daniel Brokenshire Scarpino. — 1st ed. Nicholas Blachford Alex Chungen Chow p. cm.
    [Show full text]
  • Real-Time Supercomputing and Technology for Games and Entertainment H
    Real-time Supercomputing and Technology for Games and Entertainment H. Peter Hofstee, Ph. D. Cell BE Chief Scientist and IBM Systems and Technology Group SCEI/Sony Toshiba IBM (STI) Design Center Austin, Texas 1 © 2006 IBM Corporation Collaborative Innovation: Gaming Triple Crown All IBM designed processors! All Power ArchitectureTM based! 2 © 2006 IBM Corporation SPECINT 10000 3X From Hennessy and Patterson, Computer Architecture: A ??%/year 1000 Quantitative Approach, 4th edition, 2006 52%/year 100 Performance (vs. VAX-11/780) (vs. Performance 10 ⇒ Sea change in chip 25%/year design: multiple “cores” or processors per chip 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 • VAX : 25%/year 1978 to 1986 • RISC + x86: 52%/year 1986 to 2002 • RISC + x86: ??%/year 2002 to present 3 © 2006 IBM Corporation Microprocessor Trends Single Thread Hybrid performance power limited Multi-core throughput Multi-Core performance extended Performance Hybrid extends Single Thread performance and efficiency Power 4 © 2006 IBM Corporation Traditional (very good!) General Purpose Processor IBM Power5+ 5 © 2006 IBM Corporation Cell Broadband Engine TM: A Heterogeneous Multi-core Architecture * Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. 6 © 2006 IBM Corporation Memory Managing Processor vs. Traditional General Purpose Processor Cell AMD BE IBM Intel 7 © 2006 IBM Corporation Cell Broadband Engine Architecture™ (CBEA) Technology Competitive Roadmap Next Gen (2PPE’+32SPE’) 45nm SOI Performance ~1 TFlop (est.) Enhancements/ Advanced Scaling Cell BE (1+8eDP SPE) 65nm SOI Cost Cell BE Cell BE Reduction (1+8) (1+8) 90nm SOI 65nm SOI 2006 2007 2008 2009 2010 All future dates and specifications are estimations only; Subject to change without notice.
    [Show full text]
  • The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor
    RC24128 (W0610-005) October 2, 2006 Computer Science IBM Research Report The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor Michael Gschwind IBM Research Division Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598 Research Division Almaden - Austin - Beijing - Haifa - India - T. J. Watson - Tokyo - Zurich LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P. O. Box 218, Yorktown Heights, NY 10598 USA (email: [email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home . The Cell Broadband Engine: Exploiting multiple levels of parallelism in a chip multiprocessor Michael Gschwind IBM T.J. Watson Research Center Yorktown Heights, NY Abstract As CMOS feature sizes continue to shrink and traditional microarchitectural methods for delivering high per- formance (e.g., deep pipelining) become too expensive and power-hungry, chip multiprocessors (CMPs) become an exciting new direction by which system designers can deliver increased performance. Exploiting parallelism in such designs is the key to high performance, and we find that parallelism must be exploited at multiple levels of the sys- tem: the thread-level parallelism that has become popular in many designs fails to exploit all the levels of available parallelism in many workloads for CMP systems.
    [Show full text]
  • Heterogeneous Processors: the Cell Broadband Engine (Plus Processor History & Outlook)
    Heterogeneous Processors: The Cell Broadband Engine (plus processor history & outlook) IEEE SSC/CAS Joint Chapter, Jan. 18, 2011 H. Peter Hofstee IBM Austin Research Laboratory CMOS Microprocessor Trends, The First ~25 Years ( Good old days ) Log(Performance) Single Thread More active transistors, higher frequency 2005 Page 2 SPECINT 10000 3X From Hennessy and Patterson, Computer Architecture: A ??%/year 1000 Quantitative Approach , 4th edition, 2006 52%/year 100 Performance (vs. VAX-11/780) VAX-11/780) (vs. Performance 10 25%/year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 • VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 Page 3 1971 – 2000, A Comparison 1971(Intel 4004) 2000 ( Intel Pentium III Xeon ) Technology 10 micron (PMOS) 180 nm (CMOS) Voltage 15V 1.7V ( 0.27V ) #Transistors 2,312 28M ( 69M ) Frequency 740KHz 600MHz – 1GHz ( 41MHz ) Cycles per Inst. 8 ~1 Chip size 11mm2 106mm2 Power 0.45W 20.4 W( [email protected] ) Power density 0.04W/mm2 0.18W/mm Inst/(Hz * #tr) 5.4 E-5 3.6 E-8 Page 4 ~2004: CMOS Devices hit a scaling wall 1000 Air Cooling limit ) 100 2 Active 10 Power Passive Power 1 0.1 Power Density (W/cm Power Density 0.01 1994 2004 0.001 1 0.1 0.01 Gate Length (microns) Isaac e.a. IBM Page 5 Microprocessor Trends More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency 2005 Page 6 Microprocessor Trends Multi-Core More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency
    [Show full text]
  • CELL Processor
    Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures ➢ Chip's performance is related to its cross section same area 2 ´ performance (Pollack's Rule) Cray's two oxen Cray's 1024 chicken 01/15/08 22:22 2 Three Performance-Limiting Walls ➢ Power Wall Increasingly, microprocessor performance is limited by achievable power dissipation rather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of microprocessors is to improve power efficiency at about the same rate as the performance increase. ➢ Frequency Wall Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a point of diminishing returns, and even negative returns if power is taken into account. ➢ Memory Wall On multi-gigahertz symmetric multiprocessors – even those with integrated memory controllers – latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor. 01/15/08 22:22 3 Conventional Processors Don't Cut It ... shallower pipelines with in-order execution have proven to be the most area and energy efficient. [...] we believe the efficient building blocks of future architectures are likely to be simple, modestly pipelined (5-9 stages) processors, floating point units, vector, and SIMD processing elements. Note that these constraints fly in the face of the conventional wisdom of simplifying parallel programming by using the largest processors available.
    [Show full text]
  • HC17.S1T1 a Novel SIMD Architecture for the Cell Heterogeneous Chip-Multiprocessor
    Broadband Processor Architecture A novel SIMD architecture for the Cell heterogeneous chip-multiprocessor Michael Gschwind, Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe, Takeshi Yamazaki © 2005 IBM Corporation Broadband Processor Architecture Acknowledgements . Cell is the result of a partnership between SCEI/Sony, Toshiba, and IBM . Cell represents the work of more than 400 people starting in 2000 and a design investment of about $400M 2 © 2005 IBM Corporation Broadband Processor Architecture Cell Design Goals . Provide the platform for the future of computing – 10× performance of desktop systems shipping in 2005 – Ability to reach 1 TF with a 4-node Cell system . Computing density as main challenge – Dramatically increase performance per X • X = Area, Power, Volume, Cost,… . Single core designs offer diminishing returns on investment – In power, area, design complexity and verification cost . Exploit application parallelism to provide a quantum leap in performance 3 © 2005 IBM Corporation Broadband Processor Architecture Shifting the Balance of Power with Cell . Today’s architectures are built on a 40 year old data model – Efficiency as defined in 1964 – Big overhead per data operation – Data parallelism added as an after-thought . Cell provides parallelism at all levels of system abstraction – Thread-level parallelism multi-core design approach – Instruction-level parallelism statically scheduled & power aware – Data parallelism data-paralleI instructions . Data processor instead of control system 4 © 2005 IBM Corporation Broadband Processor Architecture Cell Features SPE . Heterogeneous multi- SPU SPU SPU SPU SPU SPU SPU SPU core system architecture SXU SXU SXU SXU SXU SXU SXU SXU – Power Processor Element LS LS LS LS LS LS LS LS for control tasks – Synergistic Processor SMF SMF SMF SMF SMF SMF SMF SMF Elements for data-intensive processing 16B/cycle .
    [Show full text]
  • An Open Source Environment for Cell Broadband Engine System Software
    RC24296 (W0603-061) March 8, 2006 Computer Science IBM Research Report An Open Source Environment for Cell Broadband Engine System Software Michael Gschwind IBM Research Division Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598 David Erb, Sidney Manning, Mark Nutter IBM Research Division Austin Research Laboratory 11501 Burnet Road Austin, TX 78758 Research Division Almaden - Austin - Beijing - Haifa - India - T. J. Watson - Tokyo - Zurich LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P. O. Box 218, Yorktown Heights, NY 10598 USA (email: [email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home . An Open Source Environment for Cell Broadband Engine System Software Michael Gschwind, IBM T.J. Watson Research Center David Erb, Sid Manning, and Mark Nutter, IBM Austin Abstract magnitude over that of desktop systems shipping in 2005 [10, 9, 6]. To meet that goal, designers had to The Cell Broadband Engine provides the first imple- optimize performance against area, power, volume, mentation of a chip multiprocessor with a significant and cost in a manner not possible with legacy archi- number of general-purpose programmable cores tar- tectures.
    [Show full text]
  • Tutorial Hardware and Software Architectures for the CELL
    IBM Systems and Technology Group Tutorial Hardware and Software Architectures for the CELL BROADBAND ENGINE processor Michael Day, Peter Hofstee IBM Systems & Technology Group, Austin, Texas CODES+ISSS Conference, September 2005 © 2005 IBM Corporation IBM Systems and Technology Group Agenda Trends in Processors and Systems Introduction to Cell Challenges to processor performance Cell Broadband Engine Architecture (CBEA) Cell Broadband Engine (CBE) Processor Overview Cell Programming Models Prototype Software Environment TRE Demo 2 STI Technology © 2005 IBM Corporation IBM Systems and Technology Group Trends in Microprocessors and Systems 3 STI Technology © 2005 IBM Corporation IBM Systems and Technology Group Processor Performance over Time (Game processors take the lead on media performance) Flops (SP) 1000000 100000 10000 Game Processors 1000 PC Processors 100 10 1993 1995 1996 1998 2000 2002 2004 4 STI Technology © 2005 IBM Corporation IBM Systems and Technology Group System Trends toward Integration Memory Northbridge Memory Next-Gen Accel Processor Processor IO Southbridge IO Implied loss of system configuration flexibility Must compensate with generality of acceleration function to maintain market size. 5 STI Technology © 2005 IBM Corporation IBM Systems and Technology Group Motivation: Cell Goals Outstanding performance, especially on game/multimedia applications. • Challenges: Memory Wall, Power Wall, Frequency Wall Real time responsiveness to the user and the network. • Challenges: Real-time in an SMP environment,
    [Show full text]
  • H. Peter Hofstee IBM Austin / TU Delft Transformation Hierarchy
    Heterogeneous Shared- Memory Multicore Processors H. Peter Hofstee IBM Austin / TU Delft Transformation Hierarchy. Yale Patt Playing at the program/ISA boundary …. E peròsappia ciascuno che nulla cosa per legame musaico armonizzata si può de la sua loquela in altra transmutare, sanza rompere tutta sua dolcezza e armonia. Dante (Convivio) And yet each of them knows that nothing by a harmonized mosaic can be told of his talk in another transmutation without breaking all his sweetness and harmony. … and Google Translate Playing at the program/ISA boundary …. E peròsappia ciascuno che nulla cosa per legame musaico armonizzata si può de la sua loquela in altra transmutare, sanza rompere tutta sua dolcezza e armonia. Dante (Convivio) And so everyone should know that nothing harmonized according to the Muse’s rules can be translated from its native speech into another without breaking all its sweetness and harmony. Overview • It is (indeed) all about the memory • Programming Heterogeneous Multicore • Apache Arrow and Fletcher CPU POWER9 Source: "POWER9 Processor for the Cognitive Era". IBM presentation by Brian Thompto. Hot Chips 28 Symposium, October 2016 A(nother) story for a 10-year old … Power Power ISA ISA +RMT … +RMT MMU/BIU MMU/BIU +RMT +RMT IO Memory COHERENT BUS (+RAG) transl. MMU/DMA MMU/DMA Syn. +RMT Syn. +RMT … LS Alias Proc. Proc. … Local Store Local Store ISA ISA LS Alias Memory Memory 8 © 2006 IBM Corporation Memory Managing Processor vs. Traditional General Purpose Processor Cell AMD BE IBM Intel 9 © 2006 IBM Corporation Cell What do you most associate Cell with? Cell PlayStation 3 ? Cell Roadrunner Supercomputer? Cell Difficult to program? Programming Cell • Cell memory model • A near miss? • Tasks • CellSuperScalar • OpenCL • OpenMP Is shared coherent memory enough? … Apache Arrow Johan Peltenburg e.a., TU Delft Old Way Apache Arrow & Fletcher J.
    [Show full text]