IBM POWER9 SMT Deep Dive Summit Training Workshop

Total Page:16

File Type:pdf, Size:1020Kb

IBM POWER9 SMT Deep Dive Summit Training Workshop IBM POWER9 SMT Deep Dive Summit Training Workshop Brian Thompto POWER Systems, IBM Systems © 2018 IBM Corporation POWER9 Processor Performance Optimized for Open Interfaces for Cognitive Workloads Accelerated Computing New Core Microarchitecture 1st processor introduction of PCIeG4 Enhanced cache hierarchy Up to 120 MB / Chip 25G Coherent Link: On Chip Super-Highway Next-gen CAPI technology Connect Cores, Caches And Accelerators / GPU’s NVLINK2.0 for GPU attach 14nm silicon technology Family of Scale-out & Scale-up Optimized Offerings Dual Memory Subsystems optimized for Scale Out (latency/density) & Enterprise (capacity/bandwidth/RAS) 12 SMT8 or 24 SMT4 cores (96 threads) HiGh bandwidth scale-up fabric: 2-16 socket offerings with 2-4x chip-to-chip interconnect bandwidth © 2018 IBM Corporation 2 POWER9 – AC922 with 6 GPU’s SMP/Accelerator Signaling Memory Signaling Core Core Core Core Core Core Core Core L2 L2 L2 L2 L3 Region L3 Region L3 Region L3 Region L3 Region L3 Region L3 Region L3 Region L2 L2 L2 L2 Core Core Core Core Core Core Core Core PCIe Signaling PCIe On-Chip Accel SMP Signaling L3 Region L3 Region & Interconnect SMP L3 Region L3 Region L2 L2 Enablement Accelerator Chip L2 L2 - Core Core Core Core Off Core Core Core Core SMP/Accelerator Signaling Memory Signaling POWER9 Chip with 22 / 24 Active Cores Up to 88 Threads / Socket Images / diagrams modified from: "IBM POWER9 systems designed for commercial cognitive and cloud", IBM J. Res. & Dev., vol. 62, no. 4/5, 2018 "POWER9: Processor for the cognitive era", Proc. Hot Chips 28 Symp., pp. 1-19, Aug. 2016.. © 2018 IBM Corporation © 2018 IBM Corporation IBM Systems 3 POWER9 – Core and Cache Topology SMP/Accelerator Signaling Memory Signaling Core Core Core Core Core Core Core Core Core Core L2 L2 L2 L2 1-4 Theads 1-4 Theads L3 Region L3 Region L3 Region L3 Region (SMT4) (SMT4) L3 Region L3 Region L3 Region L3 Region L2 L2 L2 L2 Core Core Core Core Core Core Core Core PCIe Signaling SMP Signaling L2 Cache PCIe On-Chip Accel (512k) L3 Region L3 Region & Interconnect SMP L3 Region L3 Region L2 L2 Enablement Accelerator Chip L2 L2 - Core Core Core Core Off Core Core Core Core L3 Cache Region (10 MB) SMP/Accelerator Signaling Memory Signaling POWER9 Chip with 22 / 24 Active Cores 2 x POWER9 SMT4 Core : 1-4 threads each Up to 88 Threads / SocKet L2 Cache (512k) and L3 Cache (10MB) : 1-8 threads Images / diagrams modified from: "POWER9: Processor for the cognitive era", Proc. Hot Chips 28 Symp., pp. 1-19, Aug. 2016.. © 2018 IBM Corporation © 2018 IBM Corporation IBM Systems 4 POWER9: Cache Capacity Caches per pair of SMT4 cores (up to 1-8 threads) • L2: 512k, 8-way • L3: 10 MB, 20-way • Enhanced L3 Cache Effectiveness with enhanced Replacement • Aggregate 110 MB, 11 x 20 way associativity when 22 cores active (out of 24) on Summit POWER9 17 Layers of Metal eDRAM Processing Cores 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 10M 7 TB/s 256 GB/s each Core Pair DDR SMP PCIe CAPI NVLink 2 New New CAPI Mellanox IBM & IBM & PCIe Nvidia Memory Partner GPU Partner POWER9 Device Devices Devices © 2018 IBM Corporation 5 New POWER9 Microarchitecture Optimized for Cognitive Workloads & Stronger Thread Performance • Shorter pipeline & improved scheduling / branch prediction for unoptimized code & interpretive languages • Increased execution bandwidth for a range of workloads including commercial, cognitive and analytics • Adaptive features for improved efficiency and performance POWER9 Pipeline (SMT4) POWER9 SMT4 Core – Sliced Micro-arch Fetch Fetch 8 Instructions Instr. $ Decode1 Decode2 Decode 6 Instructions Crack Internal-Operation Generation Resource1 Resource2 Dispatch 6 Instructions Slice Route Resource Assignment and Routing Map LSU VSU Branch Pipe Pipe Pipe Branch Select Select Issue Slices Execution Issue Issue Source x 1 Slices Agen ALU Result x 4 Brdcst Float2 $ Acc. Float3 Format Bypass Float4 Result Float5 Result Finish Complete 64 Instructions or Internal-Operations Images / diagrams modified from:. Complete © 2018 IBM Corporation "IBM POWER9 processor core", IBM Journal of Research and Development, vol. 62, no. 4/5, pp. 2:1-2:12, 2018. 6 POWER9 SMT4-Core Microarchitecture 2 x 128b 128b 64b Super-slice Super-slice Slice Exec Exec Super Super ISU Slice Slice 64b 64b 64b VSU VSU VSU 2 x 64b 2 x 64b DW DW DW IFU LSU LSU LSU LSU LSU Super Super 2 x 64b 64b compute Slice Slice 1 x 128b 64b load/store LSU POWER9 SMT4 Core – Sliced Micro-arch POWER9 SMT4 Core Images / diagrams modified from: "POWER9: Processor for the cognitive era", Proc. Hot Chips 28 Symp., pp. 1-19, Aug. 2016. "IBM POWER9 processor core", IBM Journal of Research and Development, vol. 62, no. 4/5, pp. 2:1-2:12, 2018. © 2018 IBM Corporation 7 POWER9 – Core Compute SMT4 Core Resources SMT4 Core x 22 per Socket for Summit Systems Fetch / Branch • 32kB, 8-way Instruction Cache x8 • 8 fetch, 6 decode Predecode L1 Instruction $ IBUF Decode / Crack SMT4 Core • 1x branch execution Branch Dispatch: Allocate / Rename Instruction / Iop Prediction Completion Table Slices issue VSU and AGEN x6 • 4x scalar-64b / 2x vector-128b Branch Slice Slice 0 Slice 1 Slice 2 Slice 3 • 4x load/store AGEN ALU ALU ALU ALU BRU AGEN XS XS AGEN AGEN XS XS AGEN FP FP FP FP MUL MUL MUL MUL Vector Scalar Unit (VSU) Pipes CRYPT XC XC XC XC • 4x ALU + Simple (64b) PM PM • 4x FP + FX-MUL + Complex (64b) QFX QP/DFU QFX • 2x Permute (128b) DIV DIV • 2x Quad Fixed (128b) ST-D ST-D ST-D ST-D 128b • 2x Fixed Divide (64b) Super-slice • 1x Quad FP & Decimal FP L1D$ 0 L1D$ 1 L1D$ 2 L1D$ 3 • 1x Cryptography LRQ 0/1 LRQ 2/3 Load Store Unit (LSU) Slices SRQ 0 SRQ 1 SRQ 2 SRQ 3 • 32kB, 8-way Data Cache • Up to 4 DW load or store © 2018 IBM Corporation 8 Thread Sharing of POWER9 Cores + Cache Exec Exec Exec Exec Super Super Super Super Core ISU ISU Core Slice Slice Slice Slice Core Core Power 2 1x 64bthread2 x 64b 2 xGate 64b 2 x 64b IFU IFU 1-4 Theads 1-4 Theads LSU LSU LSU LSU (SMT4) (SMT4) Super Super Super Super Slice Slice Slice Slice LSU LSU Cache Region L2 Cache L2 Cache (512k) 1 thread L3 Cache Region L3 Cache Region (10 MB) (10 MB) ST x 1 core 11 threads per socket Subset of cores enabled with Single Thread (ST) Individual cores are inactive • Inactive cores allow higher socket frequency via. WOF Frequency Boost • One threaD gets access to the full Level-2 / Level-3 cache region © 2018 IBM Corporation 9 Thread Sharing of POWER9 Cores + Cache Exec Exec Exec Exec Exec Exec Exec Exec Super Super Super Super SuperCoreSuper CoreSuper Super Core ISU ISU Core ISU ISU Slice Slice Slice Slice Slice Slice Slice Slice Power 2 1x 64bthread2 x 64b 2 xGate 64b 2 x 64b 21 x 64bthread2 x 64b 1 thread2 x 64b 2 x 64b IFU IFU IFU IFU LSU LSU LSU LSU LSU LSU LSU LSU Super Super Super Super Super Super Super Super Slice Slice Slice Slice Slice Slice Slice Slice 1 Thread active per Core (ST) LSU LSU LSU LSU • Each thread gets ½ of the core execution resources • Threads share the Level-2 / Level-3 cache Cache Region Cache Region L2 Cache L2 Cache 1 2 thread threads L3 Cache Region L3 Cache Region (10 MB) (10 MB) ST x 1 core ST x 2 cores 11 threads per socket 22 threads per socket © 2018 IBM Corporation 10 Thread Sharing of POWER9 Cores + Cache Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Super Super Super Super SuperCoreSuper CoreSuper Super SuperCoreSuper SuperCoreSuper Core ISU ISU Core ISU ISU ISU ISU Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Power 2 1x 64bthread2 x 64b 2 xGate 64b 2 x 64b 21 x 64bthread2 x 64b 1 thread2 x 64b 2 x 64b 22 x 64bthreads2 x 64b 2 threads2 x 64b 2 x 64b IFU IFU IFU IFU IFU IFU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU 2 Threads Active Per Super Super Super Super Super Super Super Super Super Super Super Super Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Core (SMT2) LSU LSU LSU LSU LSU LSU • Each pair of threads shares ½ of each core’s Cache Region Cache Region Cache Region L2 Cache L2 Cache L2 Cache execution resources • 4 threads share the 1 2 4 thread threads threads Level-2 / Level-3 cache L3 Cache Region L3 Cache Region L3 Cache Region (10 MB) (10 MB) (10 MB) ST x 1 core ST x 2 cores SMT2 X 2 cores 11 threads per socket 22 threads per socket 44 threads per socket © 2018 IBM Corporation 11 Thread Sharing of POWER9 Cores + Cache Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Exec Super Super Super Super SuperCoreSuper CoreSuper Super SuperCoreSuper SuperCoreSuper SuperCoreSuper CoreSuper Super Core ISU ISU Core ISU ISU ISU ISU ISU ISU Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Power 2 1x 64bthread2 x 64b 2 xGate 64b 2 x 64b 21 x 64bthread2 x 64b 1 thread2 x 64b 2 x 64b 22 x 64bthreads2 x 64b 2 threads2 x 64b 2 x 64b 22 x 64b 2 2x 64b 22 x 64b 2 x2 64b IFU IFU IFU IFU IFU IFU threads threads IFU IFU threads threads LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU LSU Super Super Super Super Super Super Super Super Super Super Super Super Super Super Super Super Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice Slice4 Slice Slice4 Slice LSU LSU LSU LSU LSU LSU LSU LSU threads threads Cache Region Cache Region Cache Region Cache Region L2 Cache L2 Cache L2 Cache L2 Cache 1 2 4 8 thread threads threads threads L3 Cache Region L3 Cache Region L3 Cache Region L3 Cache Region (10 MB) (10 MB) (10 MB) (10 MB) ST x 1 core ST x 2 cores SMT2 X 2 cores SMT4 x 2 cores 11 threads per socket 22 threads per socket 44 threads per socket 88 threads per socket 4 Threads
Recommended publications
  • Investigations on Hardware Compression of IBM Power9 Processors
    Investigations on hardware compression of IBM Power9 processors Jérome Kieffer, Pierre Paleo, Antoine Roux, Benoît Rousselle Outline ● The bandwidth issue at synchrotrons sources ● Presentation of the evaluated systems: – Intel Xeon vs IBM Power9 – Benchmarks on bandwidth ● The need for compression of scientific data – Compression as part of HDF5 – The hardware compression engine NX-gzip within Power9 – Gzip performance benchmark – Bitshuffle-LZ4 benchmark – Filter optimizations – Benchmark of parallel filtered gzip ● Conclusions – on the hardware – on the compression pipeline in HDF5 Page 2 HDF5 on Power9 18/09/2019 Bandwidth issue at synchrotrons sources Data analysis computer with the main interconnections and their associated bandwidth. Data reduction Upgrade to → Azimuthal integration 100 Gbit/s Data compression ! Figures from former generation of servers Kieffer et al. Volume 25 | Part 2 | March 2018 | Pages 612–617 | 10.1107/S1600577518000607 Page 3 HDF5 on Power9 18/09/2019 Topologies of Intel Xeon servers in 2019 Source: intel.com Page 4 HDF5 on Power9 18/09/2019 Architecture of the AC922 server from IBM featuring Power9 Credit: Thibaud Besson, IBM France Page 6 HDF5 on Power9 18/09/2019 Bandwidth measurement: Xeon vs Power9 Computer Dell R840 IBM AC922 Processor 4 Intel Xeon (12 cores) 2 IBM Power9 (16 cores) 2.6 GHz 2.7 GHz Cache (L3) 19 MB 8x 10 MB Memory channels 4x 6 DDR4 2x 8 DDR4 Memory capacity → 3TB → 2TB Memory speed theory 512 GB/s 340 GB/s Measured memory speed 160 GB/s 270 GB/s Interconnects PCIe v3 PCIe v4 NVlink2 & CAPI2 GP-GPU co-processor 2Tesla V100 PCIe v3 2Tesla V100 NVlink2 Interconnect speed 12 GB/s 48 GB/s CPU ↔ GPU Page 8 HDF5 on Power9 18/09/2019 Strength and weaknesses of the OpenPower architecture While amd64 is today’s de facto standard in HPC, it has a few competitors: arm64, ppc64le and to a less extend riscv and mips64.
    [Show full text]
  • Wind Rose Data Comes in the Form >200,000 Wind Rose Images
    Making Wind Speed and Direction Maps Rich Stromberg Alaska Energy Authority [email protected]/907-771-3053 6/30/2011 Wind Direction Maps 1 Wind rose data comes in the form of >200,000 wind rose images across Alaska 6/30/2011 Wind Direction Maps 2 Wind rose data is quantified in very large Excel™ spreadsheets for each region of the state • Fields: X Y X_1 Y_1 FILE FREQ1 FREQ2 FREQ3 FREQ4 FREQ5 FREQ6 FREQ7 FREQ8 FREQ9 FREQ10 FREQ11 FREQ12 FREQ13 FREQ14 FREQ15 FREQ16 SPEED1 SPEED2 SPEED3 SPEED4 SPEED5 SPEED6 SPEED7 SPEED8 SPEED9 SPEED10 SPEED11 SPEED12 SPEED13 SPEED14 SPEED15 SPEED16 POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 WEIBC1 WEIBC2 WEIBC3 WEIBC4 WEIBC5 WEIBC6 WEIBC7 WEIBC8 WEIBC9 WEIBC10 WEIBC11 WEIBC12 WEIBC13 WEIBC14 WEIBC15 WEIBC16 WEIBK1 WEIBK2 WEIBK3 WEIBK4 WEIBK5 WEIBK6 WEIBK7 WEIBK8 WEIBK9 WEIBK10 WEIBK11 WEIBK12 WEIBK13 WEIBK14 WEIBK15 WEIBK16 6/30/2011 Wind Direction Maps 3 Data set is thinned down to wind power density • Fields: X Y • POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 • Power1 is the wind power density coming from the north (0 degrees). Power 2 is wind power from 22.5 deg.,…Power 9 is south (180 deg.), etc… 6/30/2011 Wind Direction Maps 4 Spreadsheet calculations X Y POWER1 POWER2 POWER3 POWER4 POWER5 POWER6 POWER7 POWER8 POWER9 POWER10 POWER11 POWER12 POWER13 POWER14 POWER15 POWER16 Max Wind Dir Prim 2nd Wind Dir Sec -132.7365 54.4833 0.643 0.767 1.911 4.083
    [Show full text]
  • POWER® Processor-Based Systems
    IBM® Power® Systems RAS Introduction to IBM® Power® Reliability, Availability, and Serviceability for POWER9® processor-based systems using IBM PowerVM™ With Updates covering the latest 4+ Socket Power10 processor-based systems IBM Systems Group Daniel Henderson, Irving Baysah Trademarks, Copyrights, Notices and Acknowledgements Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: Active AIX® POWER® POWER Power Power Systems Memory™ Hypervisor™ Systems™ Software™ Power® POWER POWER7 POWER8™ POWER® PowerLinux™ 7® +™ POWER® PowerHA® POWER6 ® PowerVM System System PowerVC™ POWER Power Architecture™ ® x® z® Hypervisor™ Additional Trademarks may be identified in the body of this document. Other company, product, or service names may be trademarks or service marks of others. Notices The last page of this document contains copyright information, important notices, and other information. Acknowledgements While this whitepaper has two principal authors/editors it is the culmination of the work of a number of different subject matter experts within IBM who contributed ideas, detailed technical information, and the occasional photograph and section of description.
    [Show full text]
  • From Blue Gene to Cell Power.Org Moscow, JSCC Technical Day November 30, 2005
    IBM eServer pSeries™ From Blue Gene to Cell Power.org Moscow, JSCC Technical Day November 30, 2005 Dr. Luigi Brochard IBM Distinguished Engineer Deep Computing Architect [email protected] © 2004 IBM Corporation IBM eServer pSeries™ Technology Trends As frequency increase is limited due to power limitation Dual core is a way to : 2 x Peak Performance per chip (and per cycle) But at the expense of frequency (around 20% down) Another way is to increase Flop/cycle © 2004 IBM Corporation IBM eServer pSeries™ IBM innovations POWER : FMA in 1990 with POWER: 2 Flop/cycle/chip Double FMA in 1992 with POWER2 : 4 Flop/cycle/chip Dual core in 2001 with POWER4: 8 Flop/cycle/chip Quadruple core modules in Oct 2005 with POWER5: 16 Flop/cycle/module PowerPC: VMX in 2003 with ppc970FX : 8 Flops/cycle/core, 32bit only Dual VMX+ FMA with pp970MP in 1Q06 Blue Gene: Low frequency , system on a chip, tight integration of thousands of cpus Cell : 8 SIMD units and a ppc970 core on a chip : 64 Flop/cycle/chip © 2004 IBM Corporation IBM eServer pSeries™ Technology Trends As needs diversify, systems are heterogeneous and distributed GRID technologies are an essential part to create cooperative environments based on standards © 2004 IBM Corporation IBM eServer pSeries™ IBM innovations IBM is : a sponsor of Globus Alliances contributing to Globus Tool Kit open souce a founding member of Globus Consortium IBM is extending its products Global file systems : – Multi platform and multi cluster GPFS Meta schedulers : – Multi platform
    [Show full text]
  • IBM Power Systems Performance Report Apr 13, 2021
    IBM Power Performance Report Power7 to Power10 September 8, 2021 Table of Contents 3 Introduction to Performance of IBM UNIX, IBM i, and Linux Operating System Servers 4 Section 1 – SPEC® CPU Benchmark Performance 4 Section 1a – Linux Multi-user SPEC® CPU2017 Performance (Power10) 4 Section 1b – Linux Multi-user SPEC® CPU2017 Performance (Power9) 4 Section 1c – AIX Multi-user SPEC® CPU2006 Performance (Power7, Power7+, Power8) 5 Section 1d – Linux Multi-user SPEC® CPU2006 Performance (Power7, Power7+, Power8) 6 Section 2 – AIX Multi-user Performance (rPerf) 6 Section 2a – AIX Multi-user Performance (Power8, Power9 and Power10) 9 Section 2b – AIX Multi-user Performance (Power9) in Non-default Processor Power Mode Setting 9 Section 2c – AIX Multi-user Performance (Power7 and Power7+) 13 Section 2d – AIX Capacity Upgrade on Demand Relative Performance Guidelines (Power8) 15 Section 2e – AIX Capacity Upgrade on Demand Relative Performance Guidelines (Power7 and Power7+) 20 Section 3 – CPW Benchmark Performance 19 Section 3a – CPW Benchmark Performance (Power8, Power9 and Power10) 22 Section 3b – CPW Benchmark Performance (Power7 and Power7+) 25 Section 4 – SPECjbb®2015 Benchmark Performance 25 Section 4a – SPECjbb®2015 Benchmark Performance (Power9) 25 Section 4b – SPECjbb®2015 Benchmark Performance (Power8) 25 Section 5 – AIX SAP® Standard Application Benchmark Performance 25 Section 5a – SAP® Sales and Distribution (SD) 2-Tier – AIX (Power7 to Power8) 26 Section 5b – SAP® Sales and Distribution (SD) 2-Tier – Linux on Power (Power7 to Power7+)
    [Show full text]
  • Towards a Portable Hierarchical View of Distributed Shared Memory Systems: Challenges and Solutions
    Towards A Portable Hierarchical View of Distributed Shared Memory Systems: Challenges and Solutions Millad Ghane Sunita Chandrasekaran Margaret S. Cheung Department of Computer Science Department of Computer and Information Physics Department University of Houston Sciences University of Houston TX, USA University of Delaware Center for Theoretical Biological Physics, [email protected],[email protected] DE, USA Rice University [email protected] TX, USA [email protected] Abstract 1 Introduction An ever-growing diversity in the architecture of modern super- Heterogeneity has become increasingly prevalent in the recent computers has led to challenges in developing scientifc software. years given its promising role in tackling energy and power con- Utilizing heterogeneous and disruptive architectures (e.g., of-chip sumption crisis of high-performance computing (HPC) systems [15, and, in the near future, on-chip accelerators) has increased the soft- 20]. Dennard scaling [14] has instigated the adaptation of heteroge- ware complexity and worsened its maintainability. To that end, we neous architectures in the design of supercomputers and clusters need a productive software ecosystem that improves the usability by the HPC community. The July 2019 TOP500 [36] report shows and portability of applications for such systems while allowing how 126 systems in the list are heterogeneous systems confgured every parallelism opportunity to be exploited. with one or many GPUs. This is the prevailing trend in current gen- In this paper, we outline several challenges that we encountered eration of supercomputers. As an example, Summit [31], the fastest in the implementation of Gecko, a hierarchical model for distributed supercomputer according to the Top500 list (June 2019) [36], has shared memory architectures, using a directive-based program- two IBM POWER9 processors and six NVIDIA Volta V100 GPUs.
    [Show full text]
  • Introduction to the Cell Multiprocessor
    Introduction J. A. Kahle M. N. Day to the Cell H. P. Hofstee C. R. Johns multiprocessor T. R. Maeurer D. Shippy This paper provides an introductory overview of the Cell multiprocessor. Cell represents a revolutionary extension of conventional microprocessor architecture and organization. The paper discusses the history of the project, the program objectives and challenges, the design concept, the architecture and programming models, and the implementation. Introduction: History of the project processors in order to provide the required Initial discussion on the collaborative effort to develop computational density and power efficiency. After Cell began with support from CEOs from the Sony several months of architectural discussion and contract and IBM companies: Sony as a content provider and negotiations, the STI (SCEI–Toshiba–IBM) Design IBM as a leading-edge technology and server company. Center was formally opened in Austin, Texas, on Collaboration was initiated among SCEI (Sony March 9, 2001. The STI Design Center represented Computer Entertainment Incorporated), IBM, for a joint investment in design of about $400,000,000. microprocessor development, and Toshiba, as a Separate joint collaborations were also set in place development and high-volume manufacturing technology for process technology development. partner. This led to high-level architectural discussions A number of key elements were employed to drive the among the three companies during the summer of 2000. success of the Cell multiprocessor design. First, a holistic During a critical meeting in Tokyo, it was determined design approach was used, encompassing processor that traditional architectural organizations would not architecture, hardware implementation, system deliver the computational power that SCEI sought structures, and software programming models.
    [Show full text]
  • Power Architecture® ISA 2.06 Stride N Prefetch Engines to Boost Application's Performance
    Power Architecture® ISA 2.06 Stride N prefetch Engines to boost Application's performance History of IBM POWER architecture: POWER stands for Performance Optimization with Enhanced RISC. Power architecture is synonymous with performance. Introduced by IBM in 1991, POWER1 was a superscalar design that implemented register renaming andout-of-order execution. In Power2, additional FP unit and caches were added to boost performance. In 1996 IBM released successor of the POWER2 called P2SC (POWER2 Super chip), which is a single chip implementation of POWER2. P2SC is used to power the 30-node IBM Deep Blue supercomputer that beat world Chess Champion Garry Kasparov at chess in 1997. Power3, first 64 bit SMP, featured a data prefetch engine, non-blocking interleaved data cache, dual floating point execution units, and many other goodies. Power3 also unified the PowerPC and POWER Instruction set and was used in IBM's RS/6000 servers. The POWER3-II reimplemented POWER3 using copper interconnects, delivering double the performance at about the same price. Power4 was the first Gigahertz dual core processor launched in 2001 which was awarded the MicroProcessor Technology Award in recognition of its innovations and technology exploitation. Power5 came in with symmetric multi threading (SMT) feature to further increase application's performance. In 2004, IBM with 15 other companies founded Power.org. Power.org released the Power ISA v2.03 in September 2006, Power ISA v.2.04 in June 2007 and Power ISA v.2.05 with many advanced features such as VMX, virtualization, variable length encoding, hyper visor functionality, logical partitioning, virtual page handling, Decimal Floating point and so on which further boosted the architecture leadership in the market place and POWER5+, Cell, POWER6, PA6T, Titan are various compliant cores.
    [Show full text]
  • Upgrade to POWER9 Planning Checklist
    Achieve your full business potential with IBM POWER9 Future-forward infrastructure designed to crush data-intensive workloads Upgrade to POWER9 Planning Checklist Using this checklist will help to ensure that your infrastructure strategy is aligned to your needs, avoiding potential cost overruns or capability shortfalls. 1 Determine current and future capacity 5 Identify all dependencies for major database requirements. Bring your team together, platforms, including Oracle, DB2, SAP HANA, assess your current application workload and open-source databases like EnterpriseDB, requirements and three- to five-year MongoDB, neo4j, and Redis. You’re likely outlook. You’ll then have a good picture running major databases on the Power of when and where application growth Systems platform; co-locating your current will take place, enabling you to secure servers may be a way to reduce expenditure capacity at the appropriate time on an and increase flexibility. as-needed basis. 6 Understand current and future data center 2 Assess operational efficiencies and identify environmental requirements. You may be opportunities to improve service levels unnecessarily overspending on power, while decreasing exposure to security and cooling and space. Savings here will help your compliancy issues/problems. With new organization avoid costs associated with data technologies that allow you to easily adjust center expansion. capacity, you will be in a much better position to lower costs, improve service Identify the requirements of your strategy for levels, and increase efficiency. 7 on- and off-premises cloud infrastructure. As you move to the cloud, ensure you 3 Create a detailed inventory of servers have a strong strategy to determine which across your entire IT infrastructure.
    [Show full text]
  • AC922 Data Movement for CORAL
    AC922 Data Movement for CORAL Roberts, Steve Ramanna, Pradeep Walthour, John IBM IBM IBM Cognitive Systems Cognitive Systems Cognitive Systems Austin, TX Austin, TX Austin, TX [email protected] [email protected] [email protected] Abstract—Recent publications have considered the challenge of and GPU processing elements associated with their respective movement in and out of the high bandwidth memory in attempt DRAM or high bandwidth memory (HBM2). Each processor to maximize GPU utilization and minimize overall application element creates a NUMA domain which in total encompasses wall time. This paper builds on previous contributions, [5] [17], which simulate software models, advocated optimization, and over > 2PB worth of total memory (see Table I for total suggest design considerations. This contribution characterizes the capacity). data movement innovations of the AC922 nodes IBM delivered to Oak Ridge National Labs and Lawrence Livermore National TABLE I Labs as part of the 2014 Collaboration of Oak Ridge, Argonne, CORAL SYSTEMS MEMORY SUMMARY and Livermore (CORAL) joint procurement activity. With a single HPC system able to perform up to 200PF of processing Lab Nodes Sockets DRAM (TB) GPUs HBM2 (TB) with access to 2.5PB of memory, this architecture motivates a ORNL 4607 9216 2,304 27648 432 careful look at data movement. The AC922 POWER9 system LLNL 4320 8640 2,160 17280 270 with NVIDIA V100 GPUs have cache line granularity, more than double the bandwidth of PCIe Gen3, low latency interfaces Efficient programming models call for accessing system and are interconnected by dual-rail Mellanox CAPI/EDR HCAs. memory with as little data replication as possible and As such, the bandwidth and latency assumptions from previous with low instruction overhead.
    [Show full text]
  • POWER10 Processor Chip
    POWER10 Processor Chip Technology and Packaging: PowerAXON PowerAXON - 602mm2 7nm Samsung (18B devices) x x 3 SMT8 SMT8 SMT8 SMT8 3 - 18 layer metal stack, enhanced device 2 Core Core Core Core 2 - Single-chip or Dual-chip sockets + 2MB L2 2MB L2 2MB L2 2MB L2 + 4 4 SMP, Memory, Accel, Cluster, PCI Interconnect Cluster,Accel, Memory, SMP, Computational Capabilities: PCI Interconnect Cluster,Accel, Memory, SMP, Local 8MB - Up to 15 SMT8 Cores (2 MB L2 Cache / core) L3 region (Up to 120 simultaneous hardware threads) 64 MB L3 Hemisphere Memory Signaling (8x8 OMI) (8x8 Signaling Memory - Up to 120 MB L3 cache (low latency NUCA mgmt) OMI) (8x8 Signaling Memory - 3x energy efficiency relative to POWER9 SMT8 SMT8 SMT8 SMT8 - Enterprise thread strength optimizations Core Core Core Core - AI and security focused ISA additions 2MB L2 2MB L2 2MB L2 2MB L2 - 2x general, 4x matrix SIMD relative to POWER9 - EA-tagged L1 cache, 4x MMU relative to POWER9 SMT8 SMT8 SMT8 SMT8 Core Core Core Core Open Memory Interface: 2MB L2 2MB L2 2MB L2 2MB L2 - 16 x8 at up to 32 GT/s (1 TB/s) - Technology agnostic support: near/main/storage tiers - Minimal (< 10ns latency) add vs DDR direct attach 64 MB L3 Hemisphere PowerAXON Interface: - 16 x8 at up to 32 GT/s (1 TB/s) SMT8 SMT8 SMT8 SMT8 x Core Core Core Core x - SMP interconnect for up to 16 sockets 3 2MB L2 2MB L2 2MB L2 2MB L2 3 - OpenCAPI attach for memory, accelerators, I/O 2 2 + + - Integrated clustering (memory semantics) 4 4 PCIe Gen 5 PCIe Gen 5 PowerAXON PowerAXON PCIe Gen 5 Interface: Signaling (x16) Signaling
    [Show full text]
  • POWER8: the First Openpower Processor
    POWER8: The first OpenPOWER processor Dr. Michael Gschwind Senior Technical Staff Member & Senior Manager IBM Power Systems #OpenPOWERSummit Join the conversation at #OpenPOWERSummit 1 OpenPOWER is about choice in large-scale data centers The choice to The choice to The choice to differentiate innovate grow . build workload • collaborative • delivered system optimized innovation in open performance solutions ecosystem • new capabilities . use best-of- • with open instead of breed interfaces technology scaling components from an open ecosystem Join the conversation at #OpenPOWERSummit Why Power and Why Now? . Power is optimized for server workloads . Power8 was optimized to simplify application porting . Power8 includes CAPI, the Coherent Accelerator Processor Interconnect • Building on a long history of IBM workload acceleration Join the conversation at #OpenPOWERSummit POWER8 Processor Cores • 12 cores (SMT8) 96 threads per chip • 2X internal data flows/queues • 64K data cache, 32K instruction cache Caches • 512 KB SRAM L2 / core • 96 MB eDRAM shared L3 • Up to 128 MB eDRAM L4 (off-chip) Accelerators • Crypto & memory expansion • Transactional Memory • VMM assist • Data Move / VM Mobility • Coherent Accelerator Processor Interface (CAPI) Join the conversation at #OpenPOWERSummit 4 POWER8 Core •Up to eight hardware threads per core (SMT8) •8 dispatch •10 issue •16 execution pipes: •2 FXU, 2 LSU, 2 LU, 4 FPU, 2 VMX, 1 Crypto, 1 DFU, 1 CR, 1 BR •Larger Issue queues (4 x 16-entry) •Larger global completion, Load/Store reorder queue •Improved branch prediction •Improved unaligned storage access •Improved data prefetch Join the conversation at #OpenPOWERSummit 5 POWER8 Architecture . High-performance LE support – Foundation for a new ecosystem . Organic application growth Power evolution – Instruction Fusion 1600 PowerPC .
    [Show full text]