The Microarchitecture of the Synergistic Processor for a Cell Processor Brian Flachs, Shigehiro Asano, Member, IEEE, Sang H

Total Page:16

File Type:pdf, Size:1020Kb

The Microarchitecture of the Synergistic Processor for a Cell Processor Brian Flachs, Shigehiro Asano, Member, IEEE, Sang H IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006 63 The Microarchitecture of the Synergistic Processor for a Cell Processor Brian Flachs, Shigehiro Asano, Member, IEEE, Sang H. Dhong, Fellow, IEEE, H. Peter Hofstee, Member, IEEE, Gilles Gervais, Roy Kim, Tien Le, Peichun Liu, Jens Leenstra, John Liberty, Brad Michael, Hwa-Joon Oh, Silvia Melitta Mueller, Osamu Takahashi, Member, IEEE, A. Hatakeyama, Yukio Watanabe, Naoka Yano, Daniel A. Brokenshire, Mohammad Peyravian, Vandung To, and Eiji Iwata Abstract—This paper describes an 11 FO4 streaming data becomes an important performance issue. Since leakage is processor in the IBM 90-nm SOI-low-k process. The dual-issue, proportional to area, processor designs need to extract more four-way SIMD processor emphasizes achievable performance per performance per transistor. area and power. Software controls most aspects of data movement and instruction flow to improve memory system performance and core performance density. The design minimizes instruction II. ARCHITECTURE latency while providing for fine grain clock control to reduce power. The Cell processor is a heterogeneous shared memory mul- tiprocessor [2]. It features a multi-threaded 64 bit POWER Index Terms—Cell, DSP, RISC, SIMD, SPE, SPU. processing element (PPE) and eight synergistic processing elements (SPE). Performance per transistor is the motivation I. INTRODUCTION for heterogeneity. Software can be divided into general purpose computing threads, operating system tasks, and streaming NCREASING thread level parallelism, data bandwidth, media threads and targeted to a processing core customized memory latency, and leakage current are important drivers I for those tasks. For example, PPE is responsible for running for new processor designs, such as Cell. Today’s media-rich the operating system and coordinating the flow of the data application software is often characterized by multiple light processing threads through the SPEs. This differentiation weight threads and software pipelines. This trend in software allows the architectures and implementations of the PPE and design favors processors that utilize these threads to drive SPE to be optimized for their respective workloads and enables the improved data bandwidths over processors designed to significant improvements in performance per transistor. accelerate a single thread of execution by taking advantage of The synergistic processor element (SPE) is the first imple- instruction level parallelism. Memory latency is a key limiter to mentation of a new processor architecture designed to accel- processor performance. Modern processors can lose up to 4000 erate media and streaming workloads. The architecture aims instruction slots while they wait for data from main memory. to improve the effective memory bandwidth achievable by ap- Previous designs emphasize large caches and reorder buffers, plications by improving the degree to which software can tol- first to reduce the average latency and second to maintain in- erate memory latency. SPE provides processing power needed struction throughput while waiting for data from cache misses. by streaming and media workloads through four-way SIMD op- However, these hardware structures have difficulty scaling to erations, dual issue and high frequency. the sizes required by the large data structures utilized by media Area and power efficiency are important enablers for multi- rich software. Transistors oxides are now a few atomic levels core designs that take advantage of parallelism in applications, thick and the channels are extremely narrow. These features are where performance is power limited. Every design choice must very good for improving transistor performance and increasing trade off the performance a prospective feature would bring transistor density, but tend to increase leakage current. As versus the prospect of omitting the feature and devoting the area processor performance becomes power limited, leakage current and power toward higher clock frequency or more SPE cores per Cell processor chip. Power efficiency drives a desire to replace Manuscript received April 15, 2005; revised August 31, 2005. event and status polling performed by software during synchro- B. Flachs, S. H. Dhong, H. P. Hofstee, G. Gervais, R. Kim, T. Le, P. Liu, J. Liberty, B. Michael, H.-J. Oh, O. Takahashi, D. A. Brokenshire, and V. To are nization with synchronization mechanisms that allow for low with the IBM Systems and Technology Group, Austin, TX 78758 USA (e-mail: power waiting. fl[email protected]). Fig. 1 is a diagram of the SPE architecture’s major entities and S. Asano and Y. Watanabe are with Toshiba America Electronic Components, Austin, TX 78717 USA. their relationships. Local store is a private memory for SPE in- N. Yano is with the Broadband System LSI Development Center, Semicon- structions and data. The synergistic processing unit (SPU) core ductor Company, Toshiba Corporation, Kawasaki, Japan. is a processor than runs instructions from the local store and J. Leenstra and S. M. Mueller are with the IBM Entwicklung GmbH, Boeblingen 71032, Germany. can read or write the local store with its load and store instruc- A. Hatakeyama and E. Iwata are with Sony Computer Entertainment, Austin, tions. The direct memory access (DMA) unit transfers data be- TX 78717 USA. tween local store and system memory. The DMA unit is pro- M. Peyravian is with IBM Microelectronics, Research Triangle Park, NC 27709 USA. grammable by SPU software via the channel unit. The channel Digital Object Identifier 10.1109/JSSC.2005.859332 unit is a message passing interface between the SPU core and 0018-9200/$20.00 © 2006 IEEE 64 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006 the DMA unit and the rest of the Cell processing system. The channel unit is accessed by the SPE software through channel access instructions. The SPU core is a SIMD RISC-style processor. All instruc- tions are encoded in 32 bit fixed length instruction formats and there are no hard to pipeline instructions. SPU features 128 gen- eral purpose registers. These registers are used by both floating point and integer instructions. The shared register file allows the highest level of performance for various workloads with the smallest number of registers. 128 registers allow for loop un- rolling which is necessary to fill functional unit pipelines with independent instructions. Most instructions operate on 127 bit wide data. For example, the floating point multiply add instruc- tion operates on vectors of four 32 bit single precision floating point values. Some instructions, such as floating point multiply Fig. 1. SPE architecture diagram. add, consume three register operands and produce a register result. SPE includes instructions that perform single precision floating point, integer arithmetic, logicals, loads, stores, com- pares and branches in addition to some instructions intended to help media applications. The instruction set is designed to simplify compiler register allocation and code schedulers. Most SPE software is written in C or C++ with intrinsic functions. The SPE’s architecture reduces area and power while facili- tating improved performance by requiring software solve “hard” scheduling problems such as data fetch and branch prediction. Because SPE will not be running the operating system, SPE concentrates on user mode execution. SPE load and store in- structions are performed within a local address space, not in system address space. The local address space is untranslated, unguarded and noncoherent with respect to the system address space and is serviced by the local store (LS). LS is a private memory, not a cache, and does not require tag arrays or backing Fig. 2. Example time line of concurrent computation and memory access. store. Loads, stores and instruction fetch complete with fixed delay and without exception, greatly simplifying the core de- sign and providing predictable real-time behavior. This design instructions. These commands can perform scatter-gather op- reduces the area and power of the core while allowing for higher erations from system memory or setup a complex set of status frequency operation. reporting and notification mechanisms. Not only can software Data is transferred to and from the LS in 1024 bit lines by achieve much higher bandwidth through the DMA engine than the SPE DMA engine. The SPE DMA engine allows SPE soft- it could with a hardware prefetch engine, a much higher frac- ware to schedule data transfers in parallel with core execution. tion of the bandwidth is useful data than would occur with the Fig. 2 is a time line that illustrates how software can be di- prefetch engine design. vided into coarse grained threads to overlap data transfer and The channel unit is a message passing interface between the core computation. As thread 1 finishes its computation it initi- SPU core and the rest of the system. Each device is allocated ates DMA fetch of its next data set and branches to thread 2. one or more channels through which messages can be sent to or Thread 2 begins by waiting for its previously requested data from the SPU core. SPU software sends and receives messages transfers to finish and begins computation while the DMA en- with the write and read channel instructions. Channels have ca- gine gets the data needed by thread 1. When thread 2 completes pacity which allows for multiple messages to be queued. Ca- the computation, it programs the DMA engine to store the re- pacity allows the SPU to send multiple commands to a device sults to system memory and fetch from system memory the in pipelined fashion without incurring delay, until the channel next data set. Thread 2 then branches back to thread 1. Tech- capacity is exhausted. When a channel is exhausted the write niques like double buffering and course grained multithreading or read instruction will stall the SPU in a low power wait mode allow software to overcome memory latency to achieve high until the device becomes ready. Channel wait mode can often memory bandwidth and improve performance.
Recommended publications
  • Wind Rose Data Comes in the Form >200,000 Wind Rose Images
    Making Wind Speed and Direction Maps Rich Stromberg Alaska Energy Authority [email protected]/907-771-3053 6/30/2011 Wind Direction Maps 1 Wind rose data comes in the form of >200,000 wind rose images across Alaska 6/30/2011 Wind Direction Maps 2 Wind rose data is quantified in very large Excel™ spreadsheets for each region of the state • Fields: X Y X_1 Y_1 FILE FREQ1 FREQ2 FREQ3 FREQ4 FREQ5 FREQ6 FREQ7 FREQ8 FREQ9 FREQ10 FREQ11 FREQ12 FREQ13 FREQ14 FREQ15 FREQ16 SPEED1 SPEED2 SPEED3 SPEED4 SPEED5 SPEED6 SPEED7 SPEED8 SPEED9 SPEED10 SPEED11 SPEED12 SPEED13 SPEED14 SPEED15 SPEED16 POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 WEIBC1 WEIBC2 WEIBC3 WEIBC4 WEIBC5 WEIBC6 WEIBC7 WEIBC8 WEIBC9 WEIBC10 WEIBC11 WEIBC12 WEIBC13 WEIBC14 WEIBC15 WEIBC16 WEIBK1 WEIBK2 WEIBK3 WEIBK4 WEIBK5 WEIBK6 WEIBK7 WEIBK8 WEIBK9 WEIBK10 WEIBK11 WEIBK12 WEIBK13 WEIBK14 WEIBK15 WEIBK16 6/30/2011 Wind Direction Maps 3 Data set is thinned down to wind power density • Fields: X Y • POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 • Power1 is the wind power density coming from the north (0 degrees). Power 2 is wind power from 22.5 deg.,…Power 9 is south (180 deg.), etc… 6/30/2011 Wind Direction Maps 4 Spreadsheet calculations X Y POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 Max Wind Dir Prim 2nd Wind Dir Sec -132.7365 54.4833 0.643 0.767 1.911 4.083
    [Show full text]
  • Vxworks Architecture Supplement, 6.2
    VxWorks Architecture Supplement VxWorks® ARCHITECTURE SUPPLEMENT 6.2 Copyright © 2005 Wind River Systems, Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means without the prior written permission of Wind River Systems, Inc. Wind River, the Wind River logo, Tornado, and VxWorks are registered trademarks of Wind River Systems, Inc. Any third-party trademarks referenced are the property of their respective owners. For further information regarding Wind River trademarks, please see: http://www.windriver.com/company/terms/trademark.html This product may include software licensed to Wind River by third parties. Relevant notices (if any) are provided in your product installation at the following location: installDir/product_name/3rd_party_licensor_notice.pdf. Wind River may refer to third-party documentation by listing publications or providing links to third-party Web sites for informational purposes. Wind River accepts no responsibility for the information provided in such third-party documentation. Corporate Headquarters Wind River Systems, Inc. 500 Wind River Way Alameda, CA 94501-1153 U.S.A. toll free (U.S.): (800) 545-WIND telephone: (510) 748-4100 facsimile: (510) 749-2010 For additional contact information, please visit the Wind River URL: http://www.windriver.com For information on how to contact Customer Support, please visit the following URL: http://www.windriver.com/support VxWorks Architecture Supplement, 6.2 11 Oct 05 Part #: DOC-15660-ND-00 Contents 1 Introduction
    [Show full text]
  • Quietrock Case Study | Sony Computer Entertainment America
    Studios & Entertainment Quiet® Success Story Project: ® Sony Computer Entertainment ‘Sh-h-h-h!’ PlayStation 4 game- America, LLC making in progress! Location: San Mateo, California New sound studios spearhead renovations for Sony Computer General Contractor: Entertainment America, with an acoustical assist from Magnum Drywall Magnum Drywall and PABCO® Gypsum’s QuietRock® Products: QuietRock® FLAME CURB® San Mateo, California The Sound of Silence means a lot to engineers producing video games for Sony PlayStation®4 (PS4™) enthusiasts. At Sony Computer Entertainment America LLC headquarters in San Mateo, California, producers’ tolerance for intrusive noise from beyond studio walls is zero. Only the intense action on-screen matters while creating audio effects that dramatize, punctuate and heighten the deeply immersive experience for video gamers. In these studios Sony Entertainment sound engineers make the most of that capability while keeping the PlayStation® pipeline full for weekly launches of new games. The gaming experience draws players to PS4™ and its predecessor PlayStation® consoles, and PS4™ elevates 3D excitement ever higher. It’s the world’s most powerful games console, with a Graphics Processing Unit (GPU) able to perform 1,843 teraflops*. “When sound is important, we prefer to submit QuietRock as a good solution for the architect and the owner. There’s nothing else on the market that’s comparable. I even used it in my own home movie theatre.”” – Gary Robinson, Owner Magnum Drywall what the job demands Visit www.QuietRock.com or call Call 1.800.797.8159 for more information On top of its introductory lineup in November 2013, over 180 PS4™ games are in development, including “Be the Batman”, the epic conclusion of the “Batman: Arkham Knight” trilogy, that is due out in June 2015.
    [Show full text]
  • List of Notable Handheld Game Consoles (Source
    List of notable handheld game consoles (source: http://en.wikipedia.org/wiki/Handheld_game_console#List_of_notable_handheld_game_consoles) * Milton Bradley Microvision (1979) * Epoch Game Pocket Computer - (1984) - Japanese only; not a success * Nintendo Game Boy (1989) - First internationally successful handheld game console * Atari Lynx (1989) - First backlit/color screen, first hardware capable of accelerated 3d drawing * NEC TurboExpress (1990, Japan; 1991, North America) - Played huCard (TurboGrafx-16/PC Engine) games, first console/handheld intercompatibility * Sega Game Gear (1991) - Architecturally similar to Sega Master System, notable accessory firsts include a TV tuner * Watara Supervision (1992) - first handheld with TV-OUT support; although the Super Game Boy was only a compatibility layer for the preceding game boy. * Sega Mega Jet (1992) - no screen, made for Japan Air Lines (first handheld without a screen) * Mega Duck/Cougar Boy (1993) - 4 level grayscale 2,7" LCD - Stereo sound - rare, sold in Europe and Brazil * Nintendo Virtual Boy (1994) - Monochromatic (red only) 3D goggle set, only semi-portable; first 3D portable * Sega Nomad (1995) - Played normal Sega Genesis cartridges, albeit at lower resolution * Neo Geo Pocket (1996) - Unrelated to Neo Geo consoles or arcade systems save for name * Game Boy Pocket (1996) - Slimmer redesign of Game Boy * Game Boy Pocket Light (1997) - Japanese only backlit version of the Game Boy Pocket * Tiger game.com (1997) - First touch screen, first Internet support (with use of sold-separately
    [Show full text]
  • Power4 Focuses on Memory Bandwidth IBM Confronts IA-64, Says ISA Not Important
    VOLUME 13, NUMBER 13 OCTOBER 6,1999 MICROPROCESSOR REPORT THE INSIDERS’ GUIDE TO MICROPROCESSOR HARDWARE Power4 Focuses on Memory Bandwidth IBM Confronts IA-64, Says ISA Not Important by Keith Diefendorff company has decided to make a last-gasp effort to retain control of its high-end server silicon by throwing its consid- Not content to wrap sheet metal around erable financial and technical weight behind Power4. Intel microprocessors for its future server After investing this much effort in Power4, if IBM fails business, IBM is developing a processor it to deliver a server processor with compelling advantages hopes will fend off the IA-64 juggernaut. Speaking at this over the best IA-64 processors, it will be left with little alter- week’s Microprocessor Forum, chief architect Jim Kahle de- native but to capitulate. If Power4 fails, it will also be a clear scribed IBM’s monster 170-million-transistor Power4 chip, indication to Sun, Compaq, and others that are bucking which boasts two 64-bit 1-GHz five-issue superscalar cores, a IA-64, that the days of proprietary CPUs are numbered. But triple-level cache hierarchy, a 10-GByte/s main-memory IBM intends to resist mightily, and, based on what the com- interface, and a 45-GByte/s multiprocessor interface, as pany has disclosed about Power4 so far, it may just succeed. Figure 1 shows. Kahle said that IBM will see first silicon on Power4 in 1Q00, and systems will begin shipping in 2H01. Looking for Parallelism in All the Right Places With Power4, IBM is targeting the high-reliability servers No Holds Barred that will power future e-businesses.
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • The Evolution of Ibm Research Looking Back at 50 Years of Scientific Achievements and Innovations
    FEATURES THE EVOLUTION OF IBM RESEARCH LOOKING BACK AT 50 YEARS OF SCIENTIFIC ACHIEVEMENTS AND INNOVATIONS l Chris Sciacca and Christophe Rossel – IBM Research – Zurich, Switzerland – DOI: 10.1051/epn/2014201 By the mid-1950s IBM had established laboratories in New York City and in San Jose, California, with San Jose being the first one apart from headquarters. This provided considerable freedom to the scientists and with its success IBM executives gained the confidence they needed to look beyond the United States for a third lab. The choice wasn’t easy, but Switzerland was eventually selected based on the same blend of talent, skills and academia that IBM uses today — most recently for its decision to open new labs in Ireland, Brazil and Australia. 16 EPN 45/2 Article available at http://www.europhysicsnews.org or http://dx.doi.org/10.1051/epn/2014201 THE evolution OF IBM RESEARCH FEATURES he Computing-Tabulating-Recording Com- sorting and disseminating information was going to pany (C-T-R), the precursor to IBM, was be a big business, requiring investment in research founded on 16 June 1911. It was initially a and development. Tmerger of three manufacturing businesses, He began hiring the country’s top engineers, led which were eventually molded into the $100 billion in- by one of world’s most prolific inventors at the time: novator in technology, science, management and culture James Wares Bryce. Bryce was given the task to in- known as IBM. vent and build the best tabulating, sorting and key- With the success of C-T-R after World War I came punch machines.
    [Show full text]
  • 14789093.Pdf
    iNIS-mf—8658 THE INFLUENCE OF COLLISIONS WITH NOBLE GASES ON SPECTRAL LINES OF HYDROGEN ISOTOPES PROEFSCHRIFT TER VERKRIJGING VAN DE GRAAD VAN DOCTOR IN DE WISKUNDE EN NATUURWETENSCHAPPEN AAN DE RIJKSUNIVERSITEIT TE LEIDEN, OP GEZAG VAN DE RECTOR MAGNIFICUS DR. A.A.H. KASSENAAR, HOOGLERAAR IN DE FACULTEIT DER GENEESKUNDE, VOLGENS BESLUIT VAN HET COLLEGE VAN DEKANEN TE VERDEDIGEN OP WOENSDAG 10 NOVEMBER 1982 TE KLOKKE 14.15 UUR DOOR PETER WILLEM HERMANS GEBOREN TE ROTTERDAM IN 1952 1982 DRUKKERIJ J.H. PASMANS B.V., 's-GRAVENHAGE Promotor: Prof. dr. J.J.M. Beenakker Het onderzoek is uitgevoerd mede onder verantwoordelijkheid van wijlen prof. dr. H.F.P. Knaap Aan mijn oudeva Het 1n dit proefschrift beschreven onderzoek werd uitgevoerd als onderdeel van het programma van de werkgemeenschap voor Molecuul fysica van de Stichting voor Fundamenteel Onderzoek der Materie (FOM) en is mogelijk gemaakt door financiële steun van de Nederlandse Organisatie voor Zuiver- WetenschappeHjk Onderzoek (ZWO). CONTENTS PREFACE 9 CHAPTER I THEORY OF THE COLLISIONAL BROADENING AND SHIFT OF SPECTRAL LINES OF HYDROGEN INFINITELY DILUTED IN NOBLE GASES 11 1. Introduction 11 2. General theory 12 a. Rotational Raman 15 b. Depolarized Rayleigh 16 3. Experimental preview 16 a. Rotational Raman lines; broadening and shift 17 b. Depolarized Rayleigh line 17 CHAPTER II EXPERIMENTAL DETERMINATION OF LINE BROADENING AND SHIFT CROSS SECTIONS OF HYDROGEN-NOBLE GAS MIXTURES 21 1. Introduction 21 2. Experimental setup 22 2.1 Laser 22 2.2 Scattering cell 24 2.3 Analyzing system 27 2.4 Detection system 28 3. Measuring technique 28 3.1 Width measurement 28 3.2 Shift measurement 32 3.3 Gases 32 4.
    [Show full text]
  • IBM Research AI Residency Program
    IBM Research AI Residency Program The IBM Research™ AI Residency Program is a 13-month program Topics of focus include: that provides opportunity for scientists, engineers, domain experts – Trust in AI (Causal modeling, fairness, explainability, and entrepreneurs to conduct innovative research and development robustness, transparency, AI ethics) on important and emerging topics in Artificial Intelligence (AI). The program aims at creating and investigating novel approaches – Natural Language Processing and Understanding in AI that progress capabilities towards significant technical (Question and answering, reading comprehension, and real-world challenges. AI Residents work closely with IBM language embeddings, dialog, multi-lingual NLP) Research scientists and are expected to fully complete a project – Knowledge and Reasoning (Knowledge/graph embeddings, within the 13-month residency. The results of the project may neuro-symbolic reasoning) include publications in top AI conferences and journals, development of prototypes demonstrating important new AI functionality – AI Automation, Scaling, and Management (Automated data and fielding of working AI systems. science, neural architecture search, AutoML, transfer learning, few-shot/one-shot/meta learning, active learning, AI planning, As part of the selection process, candidates must submit parallel and distributed learning) a 500-word statement of research interest and goals. – AI and Software Engineering (Big code analysis and understanding, software life cycle including modernize, build, debug, test and manage, software synthesis including refactoring and automated programming) – Human-Centered AI (HCI of AI, human-AI collaboration, AI interaction models and modalities, conversational AI, novel AI experiences, visual AI and data visualization) Deadline to apply: January 31, 2021 Earliest start date: June 1, 2021 Duration: 13 months Locations: IBM Thomas J.
    [Show full text]
  • POWER® Processor-Based Systems
    IBM® Power® Systems RAS Introduction to IBM® Power® Reliability, Availability, and Serviceability for POWER9® processor-based systems using IBM PowerVM™ With Updates covering the latest 4+ Socket Power10 processor-based systems IBM Systems Group Daniel Henderson, Irving Baysah Trademarks, Copyrights, Notices and Acknowledgements Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: Active AIX® POWER® POWER Power Power Systems Memory™ Hypervisor™ Systems™ Software™ Power® POWER POWER7 POWER8™ POWER® PowerLinux™ 7® +™ POWER® PowerHA® POWER6 ® PowerVM System System PowerVC™ POWER Power Architecture™ ® x® z® Hypervisor™ Additional Trademarks may be identified in the body of this document. Other company, product, or service names may be trademarks or service marks of others. Notices The last page of this document contains copyright information, important notices, and other information. Acknowledgements While this whitepaper has two principal authors/editors it is the culmination of the work of a number of different subject matter experts within IBM who contributed ideas, detailed technical information, and the occasional photograph and section of description.
    [Show full text]
  • IBM Powerpc 970 (A.K.A. G5)
    IBM PowerPC 970 (a.k.a. G5) Ref 1 David Benham and Yu-Chung Chen UIC – Department of Computer Science CS 466 PPC 970FX overview ● 64-bit RISC ● 58 million transistors ● 512 KB of L2 cache and 96KB of L1 cache ● 90um process with a die size of 65 sq. mm ● Native 32 bit compatibility ● Maximum clock speed of 2.7 Ghz ● SIMD instruction set (Altivec) ● 42 watts @ 1.8 Ghz (1.3 volts) ● Peak data bandwidth of 6.4 GB per second A picture is worth a 2^10 words (approx.) Ref 2 A little history ● PowerPC processor line is a product of the AIM alliance formed in 1991. (Apple, IBM, and Motorola) ● PPC 601 (G1) - 1993 ● PPC 603 (G2) - 1995 ● PPC 750 (G3) - 1997 ● PPC 7400 (G4) - 1999 ● PPC 970 (G5) - 2002 ● AIM alliance dissolved in 2005 Processor Ref 3 Ref 3 Core details ● 16(int)-25(vector) stage pipeline ● Large number of 'in flight' instructions (various stages of execution) - theoretical limit of 215 instructions ● 512 KB L2 cache ● 96 KB L1 cache – 64 KB I-Cache – 32 KB D-Cache Core details continued ● 10 execution units – 2 load/store operations – 2 fixed-point register-register operations – 2 floating-point operations – 1 branch operation – 1 condition register operation – 1 vector permute operation – 1 vector ALU operation ● 32 64 bit general purpose registers, 32 64 bit floating point registers, 32 128 vector registers Pipeline Ref 4 Benchmarks ● SPEC2000 ● BLAST – Bioinformatics ● Amber / jac - Structure biology ● CFD lab code SPEC CPU2000 ● IBM eServer BladeCenter JS20 ● PPC 970 2.2Ghz ● SPECint2000 ● Base: 986 Peak: 1040 ● SPECfp2000 ● Base: 1178 Peak: 1241 ● Dell PowerEdge 1750 Xeon 3.06Ghz ● SPECint2000 ● Base: 1031 Peak: 1067 Apple’s SPEC Results*2 ● SPECfp2000 ● Base: 1030 Peak: 1044 BLAST Ref.
    [Show full text]
  • Treatment and Differential Diagnosis Insights for the Physician's
    Treatment and differential diagnosis insights for the physician’s consideration in the moments that matter most The role of medical imaging in global health systems is literally fundamental. Like labs, medical images are used at one point or another in almost every high cost, high value episode of care. Echocardiograms, CT scans, mammograms, and x-rays, for example, “atlas” the body and help chart a course forward for a patient’s care team. Imaging precision has improved as a result of technological advancements and breakthroughs in related medical research. Those advancements also bring with them exponential growth in medical imaging data. The capabilities referenced throughout this document are in the research and development phase and are not available for any use, commercial or non-commercial. Any statements and claims related to the capabilities referenced are aspirational only. There were roughly 800 million multi-slice exams performed in the United States in 2015 alone. Those studies generated approximately 60 billion medical images. At those volumes, each of the roughly 31,000 radiologists in the U.S. would have to view an image every two seconds of every working day for an entire year in order to extract potentially life-saving information from a handful of images hidden in a sea of data. 31K 800MM 60B radiologists exams medical images What’s worse, medical images remain largely disconnected from the rest of the relevant data (lab results, patient-similar cases, medical research) inside medical records (and beyond them), making it difficult for physicians to place medical imaging in the context of patient histories that may unlock clues to previously unconsidered treatment pathways.
    [Show full text]