Sony/Toshiba/IBM (STI) CELL Processor Three Performance-Limiting Walls

Total Page:16

File Type:pdf, Size:1020Kb

Sony/Toshiba/IBM (STI) CELL Processor Three Performance-Limiting Walls Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2007 Three Performance-Limiting Walls ¾ Power Wall Increasingly, microprocessor performance is limited by achievable power dissipation rather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of microprocessors is to improve power efficiency at about the same rate as the performance increase. ¾ Frequency Wall Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a point of diminishing returns, and even negative returns if power is taken into account. ¾ Memory Wall On multi-gigahertz symmetric multiprocessors – even those with integrated memory controllers – latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor. 01/31/07 11:10 2 1 The Memory Wall "When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. [...] Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available." H. Peter Hofstee "Cell Broadband Engine Architecture from 20,000 feet" http://www-128.ibm.com/developerworks/power/library/pa-cbea.html Their (multicore) low cost does not guarantee their effective use in HPC. This relates back to the data-intensive nature of most HPC applications and the sharing of already limited bandwidth to memory. The stream benchmark performance of Intel's new Woodcrest dual core processor illustrates this point. [...] Much effort was put into improving Woodcrest's memory subsystem, which offers a total of over 21 GBs/sec on nodes with two sockets and four cores. Yet, four-threaded runs of the memory intensive Stream benchmark on such nodes that I have seen extract no more than 35 percent of the available bandwidth from the Woodcrest's memory subsystem." Richard B. Walsh "New Processor Options for HPC" http://www.hpcwire.com/hpc/906849.html 01/31/07 11:10 3 CELL Overview ¾ $400 million over 5 years ¾ Sony/Toshiba/IBM alliance known as STI ¾ STI Design Center – Austin, Texas – March 2001 ¾ Mercury Computer Systems Inc. – dual CELL blades ¾ Cell Broadband Engine Architecture / CBEA / CELL BE ¾ Playstation3, dual CELL blades, PCI accelerator cards ¾ 3.2 GHz, 90nm OSI, ¾ 234 million transistors ¾ 165 million – Xbox 360 ¾ 220 million – Itanium 2 (2002) ¾ 1,700 million – Dual-Core Itanium 2 (2006) 01/31/07 11:10 4 2 CELL Architecture ¾ PPU – PowerPC 970 core ¾ SPE – Synergistic Processing Element ¾ SPU – Synergistic Processing Unit ¾ LS – Local Store ¾ MFC – Memory Flow Controller ¾ EIB – Element Interconnection Bus ¾ MIC – Memory Interface Controller 01/31/07 11:10 5 Power Processing Element ¾ Power Processing Element (PPE) ¾ Power 970 architecture compliant ¾ 2-way Symmetric Multithreading (SMT) ¾ 32KB Level 1 instruction cache ¾ 32KB level 1 data cache ¾ 512KB level 2 cache ¾ VMX (AltiVec) with 32 128-bit vector registers ¾ standard FPU ¾ fully pipelined DP with FMA ¾ 6.4 Gflop/s DP at 3.2 GHz ¾ AltiVec ¾ no DP ¾ 4-way fully pipelined SP with FMA ¾ 25.6 Gflop/s SP at 3.2 GHz 01/31/07 11:10 6 3 Synergistic Processing Elements ¾ Synergistic Processing Elements (SPEs) ¾ 128-bit SIMD ¾ 128 vector registers ¾ 256KB instruction and data local memory ¾ Memory Flow Controller (MFC) ¾ 16-way SIMD (8-bit integer) ¾ 8-way SIMD (16-bit integer) ¾ 4-way SIMD (32-bit integer, single prec. FP) ¾ 2-way SIMD (64-bit double prec. FP) ¾ 25.6 Gflop/s SP at 3.2 Ghz (fully pipelined) ¾ 1.8 Gflop/s DP at 3.2 Ghz (7 cycle latency) 01/31/07 11:10 7 SPE Architecture ¾ Dual issue (in order) pipeline ¾ Even – arithmetic ¾ integer ¾ floating point ¾ Odd – data motion ¾ permutations ¾ local store ¾ branches ¾ channel 01/31/07 11:10 8 4 Element Interconnection Bus ¾ Element Interconnection Bus (EIB) ¾ 4 16B-wide unidirectional channels ¾ half the system clock (1.6GHz) ¾ 204.8 GB/s bandwidth (arbitration) 01/31/07 11:10 9 Element Interconnection Bus ¾ Element Interconnection Bus (EIB) ¾ 4 16B-wide unidirectional channels ¾ half the system clock (1.6GHz) ¾ 204.8 GB/s bandwidth (arbitration) 01/31/07 11:10 10 5 Main Memory System ¾ Memory Interface Controller (MIC) ¾ external dual XDR, ¾ 3.2 Ghz max effective frequency, (max 400 MHz, Octal Data Rate), ¾ each: 8 banks → max 256 MB, ¾ total: 16 banks → max 512 MB, ¾ 25.6 GB/s. 01/31/07 11:10 11 CELL Performance – Double Precision In double precision ¾ every seven cycles each SPE can: ¾ process a two element vector, ¾ perform two operations on each element. ¾ in one cycle the FPU on the PPE can: ¾ process one element, ¾ perform two operations on the element. 8 x2 x2 x3.2 GHz / 7 = 14.63 Gflop/s 2 x 3.2 GHz = 6.4 Gflop/s 21.03 Gflop/s 01/31/07 11:10 12 6 CELL Performance – Single Precision In single precision ¾ in one cycle each SPE can: ¾ process a four element vector, ¾ perform two operations on each element. ¾ in one cycle the VMX on the PPE can: ¾ process a four element vector, ¾ perform two operations on each element. 8 x4 x2 x3.2 GHz =204.8 Gflop/s 4 x 2 x 3.2 GHz = 25.6 Gflop/s 230.4 Gflop/s 01/31/07 11:10 13 CELL Performance – Bandwidth Bandwidth: ¾ 3.2 GHz clock: ¾ each SPU – 25.6 GB/s, (compare to 25.6 Gflop/s per SPU) ¾ Main memory – 25.6 GB/s, ¾ EIB – 204.8 GB/s. (compare to 204.8 Gflop/s – 8 SPUs) 01/31/07 11:10 14 7 Performance Comparison – Double Precision 1.6 GHz Dual-Core Itanium 2 ¾ 1.6 x 4 x 2 = 12.8 Gflop/s 3.2 GHz CELL BE (SPEs only) ¾ 3.2 x 8 x 8 = 14.6 Gflop/s 01/31/07 11:10 15 Performance Comparison – Single Precision 1.6 GHz Dual-Core Itanium 2 ¾ 1.6 x 4 x 2 = 12.8 Gflop/s 3.2 GHz SPE ¾ 3.2 x 8 = 25.6 Gflop/s ¾ One SPE = 2 Dual-Core Itaniums 2 3.2 GHz CELL BE (SPEs only) ¾ 3.2 x 8 x 8 = 204.8 Gflop/s ¾ One CBE = 16 Dual-Core Itaniums 2 01/31/07 11:10 16 8 CELL Programming Basics ¾ Programming the SPUs ¾ SIMD'ization (vectorization) ¾ Communication ¾ DMAs ¾ Mailboxes ¾ Measuring Performance ¾ SPU decrementer 01/31/07 11:10 17 SPE Register File 01/31/07 11:10 18 9 SPE SIMD Arithmetic 01/31/07 11:10 19 SPE SIMD Data Motion 01/31/07 11:10 20 10 SPE SIMD Vector Data Types 01/31/07 11:10 21 SPE SIMD Arithmetic Intrinsics 01/31/07 11:10 22 11 SPE Scalar Processing 01/31/07 11:10 23 SPE Static Code Analysis 01/31/07 11:10 24 12 SPE DMA Commands 01/31/07 11:10 25 SPE DMA Status 01/31/07 11:10 26 13 DMA Double Buffering Prologue Receive tile 1 Loop body Receive tile 2 FOR I=2 TO N-1 Compute tile 1 Send tile I-1 Swap buffers Receive tile I+1 Epilogue Compute tile I Send tile N-1 Swap buffers Compute tile N END FOR Send tile N1 01/31/07 11:10 27 SPE Mailboxes ¾ FIFO queues ¾ 32-bit messages ¾ Intended for mainly for communication between the PPE and the SPEs 01/31/07 11:10 28 14 SPE Decrementer ¾ 14MHz – IBM dual CELL blade ¾ 80MHz – Sony Playstation3 01/31/07 11:10 29 CELL Basic Coding Tips ¾ Local Store ¾ Keep in mind the Local Store is 256KB in size ¾ Use plug-ins to handle larger codes ¾ DMA Transfers ¾ Use SPE-initiated DMA transfers ¾ Use double-buffering to hide transfers ¾ Use fence and barrier to order transfers ¾ Loops ¾ Unroll loops to reduce dependency stalls, increase dual-issue rate, and exploit SPU large register file ¾ Branches ¾ Eliminate non-predicted branches ¾ Dual-Issue ¾ Choose intrinsics to maximize dual-issue 01/31/07 11:10 30 15 UT CELL BE Cluster Ashe.CS.UTK.EDU 01/31/07 11:10 31 UT CELL BE Cluster – Historical Perspective Connection Machine CM-5 (512 CPUs) ¾ 512 x 128 = 65 Gflop/s DP Playstation3 (4 units) ¾ 4 x 17 = 68 Gflop/s DP 01/31/07 11:10 32 16.
Recommended publications
  • USE CASE Requirements
    Ref. Ares(2017)2745733 - 31/05/2017 USE CASE Requirements Co-funded by the Horizon 2020 Framework Programme of the European Union DELIVERABLE NUMBER D2.1 DELIVERABLE TITLE USE CASE Requirements RESPONSIBLE AUTHOR CSI Piemonte OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure and Platform in Industrial and Societal Applications GRANT AGREEMENT N. 688386 PROJECT REF. NO H2020 - 688386 PROJECT ACRONYM OPERA LOw Power Heterogeneous Architecture for Next Generation of PROJECT FULL NAME SmaRt Infrastructure and Platform in Industrial and Societal Applications STARTING DATE (DUR.) 01 /12 /2015 ENDING DATE 30/11/2018 PROJECT WEBSITE www.operaproject.eu WP2 | Low Power Computing Requirements and Innovation WORKPACKAGE N. | TITLE Engineering WORKPACKAGE LEADER ISMB DELIVERABLE N. | TITLE D2.1 | USE CASE Requirements RESPONSIBLE AUTHOR Luca Scanavino – CSI Piemonte DATE OF DELIVERY M10 (CONTRACTUAL) DATE OF DELIVERY (SUBMITTED) M10 VERSION | STATUS V2.0 (update) NATURE R(Report) DISSEMINATION LEVEL PU(Public) AUTHORS (PARTNER) CSI PIEMONTE, DEPARTMENT DE L’ISERE, ISMB D2.1 | USE CASEs Requirements 1 OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure and Platform in Industrial and Societal Applications VERSION MODIFICATION(S) DATE AUTHOR(S) Luca Scanavino – CSI Jean-Christophe 0.1 All Document 12/09/2016 Maisonobe – LD38 Pietro Ruiu – ISMB Alberto Scionti - ISMB Finalization of the 1.0 15/09/2016 Giulio URLINI (ST) document Update based on the Luca Scanavino – CSI feedback
    [Show full text]
  • Copyrighted Material
    CHAPTER 1 MULTI- AND MANY-CORES, ARCHITECTURAL OVERVIEW FOR PROGRAMMERS Lasse Natvig, Alexandru Iordan, Mujahed Eleyat, Magnus Jahre and Jorn Amundsen 1.1 INTRODUCTION 1.1.1 Fundamental Techniques Parallelism hasCOPYRIGHTED been used since the early days of computing MATERIAL to enhance performance. From the first computers to the most modern sequential processors (also called uni- processors), the main concepts introduced by von Neumann [20] are still in use. How- ever, the ever-increasing demand for computing performance has pushed computer architects toward implementing different techniques of parallelism. The von Neu- mann architecture was initially a sequential machine operating on scalar data with bit-serial operations [20]. Word-parallel operations were made possible by using more complex logic that could perform binary operations in parallel on all the bits in a computer word, and it was just the start of an adventure of innovations in parallel computer architectures. Programming Multicore and Many-core Computing Systems, 3 First Edition. Edited by Sabri Pllana and Fatos Xhafa. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. 4 MULTI- AND MANY-CORES, ARCHITECTURAL OVERVIEW FOR PROGRAMMERS Prefetching is a 'look-ahead technique' that was introduced quite early and is a way of parallelism that is used at several levels and in different components of a computer today. Both data and instructions are very often accessed sequentially. Therefore, when accessing an element (instruction or data) at address k, an auto- matic access to address k+1 will bring the element to where it is needed before it is accessed and thus eliminates or reduces waiting time.
    [Show full text]
  • Programming the Cell Broadband Engine Examples and Best Practices
    Front cover Draft Document for Review February 15, 2008 4:59 pm SG24-7575-00 Programming the Cell Broadband Engine Examples and Best Practices Practical code development and porting examples included Make the most of SDK 3.0 debug and performance tools Understand and apply different programming models and strategies Abraham Arevalo Ricardo M. Matinata Maharaja Pandian Eitan Peri Kurtis Ruby Francois Thomas Chris Almond ibm.com/redbooks Draft Document for Review February 15, 2008 4:59 pm 7575edno.fm International Technical Support Organization Programming the Cell Broadband Engine: Examples and Best Practices December 2007 SG24-7575-00 7575edno.fm Draft Document for Review February 15, 2008 4:59 pm Note: Before using this information and the product it supports, read the information in “Notices” on page xvii. First Edition (December 2007) This edition applies to Version 3.0 of the IBM Cell Broadband Engine SDK, and the IBM BladeCenter QS-21 platform. © Copyright International Business Machines Corporation 2007. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Draft Document for Review February 15, 2008 4:59 pm 7575TOC.fm Contents Preface . xi The team that wrote this book . xi Acknowledgements . xiii Become a published author . xiv Comments welcome. xv Notices . xvii Trademarks . xviii Part 1. Introduction to the Cell Broadband Engine . 1 Chapter 1. Cell Broadband Engine Overview . 3 1.1 Motivation . 4 1.2 Scaling the three performance-limiting walls. 6 1.2.1 Scaling the power-limitation wall . 6 1.2.2 Scaling the memory-limitation wall .
    [Show full text]
  • Cell Broadband Engine Architecture
    Cell Broadband Engine Architecture Patrick White Aun Kei Hong Overview • Background • Overall Chip • Goals Specifications • Components • Commercialization – PPE • Conclusion – SPE –EIB – Memory I/O –PMU, TMU Background • Ken Kutaragi • Sony, IBM, and Toshiba Partnership • 400+ people and $400M • 10 centers globally Goals • Parallelism – Thread Level – Instruction Level – Data Level • Create a computer that acts like cells in a biological system POWER Processing Element (PPE) • Power Processing Unit (PPU) • Cache – 32 kB L1 Cache – 512 kB L2 Cache • SIMD (Single Issue Multiple Data) – Vector multimedia extension • Controls SPEs POWER Processing Unit (PPU) • Main processor • Dual issue, In order, Dual thread Synergistic Processing Element (SPE) • Synergistic Processing Unit • Memory Flow Controller • Direct Memory Access Synergistic Processing Unit (SPU) • Dual issue, in order • 256 kB embedded SRAM • 128 entry operations per cycle Memory Flow Controller (MFC) • Connects SPE to EIB • Communicates through SPU channel interface • Process commands by depending on the type of commands • Process DMA Commands Element Interconnect Bus (EIB) • Facilitates commutation between the PPE, SPE, Main Memory, and I/O • 4 16-bit wide data rings • Allows up to 3 concurrent transfers • Contains a data bus arbiter that decides which ring handles each request • Runs on half the processor speed Memory Interface Controller (MIC) • Interface between main memory and the Cell • Two Rambus XRD memory banks – 36 bits wide – Independent control bus • Banks are interleaved
    [Show full text]
  • State-Of-The-Art in Heterogeneous Computing
    Scientific Programming 18 (2010) 1–33 1 DOI 10.3233/SPR-2009-0296 IOS Press State-of-the-art in heterogeneous computing Andre R. Brodtkorb a,∗, Christopher Dyken a, Trond R. Hagen a, Jon M. Hjelmervik a and Olaf O. Storaasli b a SINTEF ICT, Department of Applied Mathematics, Blindern, Oslo, Norway E-mails: {Andre.Brodtkorb, Christopher.Dyken, Trond.R.Hagen, Jon.M.Hjelmervik}@sintef.no b Oak Ridge National Laboratory, Future Technologies Group, Oak Ridge, TN, USA E-mail: [email protected] Abstract. Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing. Keywords: Power-efficient architectures, parallel computer architecture, stream or vector architectures, energy and power consumption, microprocessor performance 1. Introduction the speed of logic gates, making computers smaller and more power efficient. Noyce and Kilby indepen- The goal of this article is to provide an overview of dently invented the integrated circuit in 1958, leading node-level heterogeneous computing, including hard- to further reductions in power and space required for ware, software tools and state-of-the-art algorithms.
    [Show full text]
  • University of Denver | Graduate Bulletin 2014-2015 5
    University of Denver graduate Bulletin 2014-2015 Table of Contents Graduate .............................................................................................................................................................................................................................. 5 About DU ............................................................................................................................................................................................................................. 6 Equal Opportunity and Non-Discrimination Policy ........................................................................................................................................................ 6 University Goverance and Organization ....................................................................................................................................................................... 6 Accreditation ....................................................................................................................................................................................................................... 10 About Graduate Policy ....................................................................................................................................................................................................... 11 Dual and Joint Graduate Degrees ....................................................................................................................................................................................
    [Show full text]
  • On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus
    1 On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus Thomas William Ainsworth and Timothy Mark Pinkston Electrical Engineering Department University of Southern California Los Angeles, CA 90089-2562, USA E-mail: [email protected], [email protected] Abstract – With the rise of multicore computing, the design of on- microarchitecture are partitioned into multiple (and soon to be chip networks (or networks on chip) has become an increasingly many) clusters [17] – [20], tiles [21] – [24], and cores [25] – important component of computer architecture. The Cell [31]. Broadband Engine’s Element Interconnect Bus (EIB), with its On-chip networks may be used for different purposes and four data rings and shared command bus for end-to-end control, take on different forms by the microarchitecture to transfer supports twelve nodes—more than most mainstream on-chip networks, which makes it an interesting case study. As a first instructions, operand values, cache/memory blocks, step toward understanding the design and performance of on- cache/memory coherency traffic, status and control chip networks implemented within the context of a commercial information, etc. For example, of the eight on-chip networks multicore chip, this paper analytically evaluates the EIB network implemented in the TRIPS EDGE research prototype using conventional latency and throughput characterization architecture [23] for interconnecting its 106 heterogeneous methods as well as using a recently proposed 5-tuple latency tiles composing two processor cores and L2 cache integrated characterization model for on-chip networks. These are used to on-chip, two are 141-bit and 128-bit bidirectional mesh identify the end-to-end control component of the EIB (i.e., the networks used for transporting operand and memory data, shared command bus) as being the main bottleneck to achieving respectively, sent broadside; the remaining six are dedicated minimal, single-cycle latency and maximal 307.2 GB/sec raw effective bandwidth provided natively by the EIB.
    [Show full text]
  • VGP393 – Week 1
    VGP393 – Week 1 ⇨ Agenda: ­ Course road-map ­ Types of parallel architectures ­ Parallel programming / threading terminology 16-July-2008 © Copyright Ian D. Romanick 2008 What should you already know? ⇨ C++ and object oriented programming ­ Most assignments will include some boot-strap code in the form of C++ classes ⇨ Fundamental data structures ­ A lot of we cover will require working knowledge of trees, lists, stacks, and queues ­ Understanding of some STL types will be helpful, but is not required ⇨ Some knowledge of linear algebra / vector math ­ Many of the problems that are intersting to impelment on parallel computers originate in linear algebra 16-July-2008 © Copyright Ian D. Romanick 2008 What will you learn? ⇨ Types of parallel computers and system organization ­ SIMD, MIMD, multi-core, multi-threaded, etc. ⇨ Multi-threading primitives and their use ­ Threads, mutexes, semaphores, etc. ⇨ Debugging parallel programs ­ Detecting and avoiding deadlock will be key ⇨ Measuring performance ­ If we double the number of processors, how much speed-up is realized? 16-July-2008 © Copyright Ian D. Romanick 2008 How will you be graded? ⇨ Four, bi-weekly quizzes worth 5 points each ⇨ Final exam worth 50 points ⇨ Four programming assignments ­ One worth 10 points ­ Three worth 30 points each 16-July-2008 © Copyright Ian D. Romanick 2008 How will programs be graded? ⇨ First and foremost, does the program produce the correct output? ⇨ Are appropriate algorithms and data-structures used? ⇨ Is the code readable and clear? 16-July-2008 © Copyright Ian D. Romanick 2008 Types of Parallel Computers ⇨ Flynn's taxonomy of sequential and parallel computers provides a high-level model: Executes one instruction on a Single Multiple one set of data Instruction Instruction Single Data SISD MISD Multiple Data SIMD MIMD Executes one Executes multiple instruction on instructions on multiple sets of data multiple sets of data 16-July-2008 © Copyright Ian D.
    [Show full text]
  • Eindhoven University of Technology MASTER Comparison of Computer
    Eindhoven University of Technology MASTER Comparison of computer architectures and design of a multi-core solution for an industrial control application Brondsema, Y.W. Award date: 2011 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Eindhoven University of Technology Department of Electrical Engineering Master’s Thesis Report Comparison of computer architectures and design of a multi-core solution for an industrial control application Yoran W. Brondsema Supervised by Prof. M.C.W. Geilen, Eindhoven University of Technology Micha Nelissen, Prodrive BV Comparison of computer architectures and design of a multi-core solution for an industrial control application Son, August 26, 2011 Course : Graduation project Master : Embedded Systems Department : Electrical Engineering - Eindhoven University of Technology Faculty : Embedded Systems - Eindhoven University of Technology Supervisors : For Eindhoven University of Technology: Prof.
    [Show full text]
  • A Survey on Hardware-Aware and Heterogeneous Computing on Multicore Processors and Accelerators
    A Survey on Hardware-aware and Heterogeneous Computing on Multicore Processors and Accelerators Rainer Buchty Vincent Heuveline Wolfgang Karl Jan-Philipp Weiß No. 2009-02 Preprint Series of the Engineering Mathematics and Computing Lab (EMCL) KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.emcl.kit.edu Preprint Series of the Engineering Mathematics and Computing Lab (EMCL) ISSN 2191–0693 No. 2009-02 Impressum Karlsruhe Institute of Technology (KIT) Engineering Mathematics and Computing Lab (EMCL) Fritz-Erler-Str. 23, building 01.86 76133 Karlsruhe Germany KIT – University of the State of Baden Wuerttemberg and National Laboratory of the Helmholtz Association Published on the Internet under the following Creative Commons License: http://creativecommons.org/licenses/by-nc-nd/3.0/de . www.emcl.kit.edu A Survey on Hardware-aware and Heterogeneous Computing on Multicore Processors and Accelerators Rainer Buchty 1, Vincent Heuveline 2, Wolfgang Karl 1 and Jan-Philipp Weiss 2,3 1 Chair for Computer Architecture, Institute of Computer Science and Engineering 2 Engineering Mathematics Computing Lab 3 Shared Research Group New Frontiers in High Performance Computing Karlsruhe Institute of Technology, Kaiserstr. 12, 76128 Karlsruhe, Germany Abstract. The paradigm shift towards multicore technologies is offer- ing a great potential of computational power for scientific and industrial applications. It is, however, posing considerable challenges to software de- velopment. This problem is impaired by increasing heterogeneity of hard- ware platforms on both, processor level, and by adding dedicated accel- erators. Performance gains for data- and compute-intensive applications can currently only be achieved by exploiting coarse- and fine-grained parallelism on all system levels, and improved scalability with respect to constantly increasing core counts.
    [Show full text]
  • INF5063: Programming Heterogeneous Multi-Core Processors
    INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 25th, 2015 INF5063 University of Oslo INF5063 Overview § Course topic and scope § Background For the use and parallel processing using heterogeneous multi-core processors § Examples oF heterogeneous architectures § Vector Processing University of Oslo INF5063 INF5063: The Course People § Håkon Kvale Stensland email: haakonks @ ifi § Carsten Griwodz email: griFF @ ifi § ProFessor Pål Halvorsen email: paalh @ ifi § Guest lectures From Dolphin Interconnect Solutions Hugo Kohmann & Roy Nordstrøm University of Oslo INF5063 Time and place § Lectures: Tuesday 09:00 - 16:00 Storstua (Simula Research Laboratory) − August 25th − September 22nd − October 20th − November 24st § Group exercises: No group exercises in this course! University of Oslo INF5063 Plan For Today (Session 1) 10:15 – 11:00: Course Introduction 11:00 – 11:15: Break 11:15 – 12:00: Introduction to SSE/AVX 12:00 – 13:00: Lunch (Will be provided by Simula) 13:00 – 13:45: Introduction to Video Processing 13:45 – 14:00: Break 14:00 – 14:45: Using SIMD For Video Processing 14:45 – 15:00: Break 15:00 – 15:45: Codec 63 (c63) Home Exam Precode 15:45 – 16:00: Home Exam 1 is presented University of Oslo INF5063 About INF5063: Topic & Scope § Content: The course gives … − … an overview oF heterogeneous multi-core processors in general and three variants in particular and a modern general-purpose core (architectures and use) − … an introduction to working with heterogeneous multi-core processors
    [Show full text]
  • QPACE--A QCD Parallel Computer Based on Cell Processors
    QPACE – a QCD parallel computer based on Cell processors H. Baier1, H. Boettiger1, M. Drochner2, N. Eicker2;3, U. Fischer1, Z. Fodor3, A. Frommer3, C. Gomez4, G. Goldrian1, S. Heybrock5, D. Hierl5, M. Hüsken3, T. Huth1, B. Krill1, J. Lauritsen1, T. Lippert2;3, T. Maurer∗5, B. Mendl5, N. Meyer5, A. Nobile5, I. Ouda6, M. Pivanti7, D. Pleiter†8, M. Ries1, A. Schäfer5, H. Schick1, F. Schifano7, H. Simma8;9, S. Solbrig5, T. Streuer5, K.-H. Sulanke8, R. Tripiccione7, J.-S. Vogt1, T. Wettig5, F. Winter8 1 IBM Deutschland Research & Development GmbH, 71032 Böblingen, Germany 2 Research Center Jülich, 52425 Jülich, Germany 3 University of Wuppertal, 42119 Wuppertal, Germany 4 IBM La Gaude, Le Plan du Bois, La Gaude 06610, France 5 Department of Physics, University of Regensburg, 93040 Regensburg, Germany 6 IBM Rochester, 3605 HWY 52 N, Rochester MN 55901-1407, USA 7 University of Ferrara, 44100 Ferrara, Italy 8 Deutsches Elektronen Synchrotron (DESY), 15738 Zeuthen, Germany 9 Department of Physics, University of Milano-Bicocca, 20126 Milano, Italy QPACE is a novel parallel computer which has been developed to be primarily used for lattice QCD simulations. The compute power is provided by the IBM PowerXCell 8i processor, an enhanced version of the Cell processor that is used in the Playstation 3. The QPACE nodes are interconnected by a custom, application optimized 3-dimensional torus network implemented on an FPGA. To achieve the very high packaging density of 26 TFlops per rack a new water cooling concept has been developed and successfully realized. In this paper we give an overview of the architecture and highlight some important technical details of the system.
    [Show full text]