INF5063: Programming Heterogeneous Multi-Core Processors

Total Page:16

File Type:pdf, Size:1020Kb

INF5063: Programming Heterogeneous Multi-Core Processors INF5063: Programming heterogeneous multi-core processors Introduction Håkon Kvale Stensland August 25th, 2015 INF5063 University of Oslo INF5063 Overview § Course topic and scope § Background For the use and parallel processing using heterogeneous multi-core processors § Examples oF heterogeneous architectures § Vector Processing University of Oslo INF5063 INF5063: The Course People § Håkon Kvale Stensland email: haakonks @ ifi § Carsten Griwodz email: griFF @ ifi § ProFessor Pål Halvorsen email: paalh @ ifi § Guest lectures From Dolphin Interconnect Solutions Hugo Kohmann & Roy Nordstrøm University of Oslo INF5063 Time and place § Lectures: Tuesday 09:00 - 16:00 Storstua (Simula Research Laboratory) − August 25th − September 22nd − October 20th − November 24st § Group exercises: No group exercises in this course! University of Oslo INF5063 Plan For Today (Session 1) 10:15 – 11:00: Course Introduction 11:00 – 11:15: Break 11:15 – 12:00: Introduction to SSE/AVX 12:00 – 13:00: Lunch (Will be provided by Simula) 13:00 – 13:45: Introduction to Video Processing 13:45 – 14:00: Break 14:00 – 14:45: Using SIMD For Video Processing 14:45 – 15:00: Break 15:00 – 15:45: Codec 63 (c63) Home Exam Precode 15:45 – 16:00: Home Exam 1 is presented University of Oslo INF5063 About INF5063: Topic & Scope § Content: The course gives … − … an overview oF heterogeneous multi-core processors in general and three variants in particular and a modern general-purpose core (architectures and use) − … an introduction to working with heterogeneous multi-core processors • SSEx / AVXx for x86 • Nvidia’s Family oF GPUs and the CUDA programming Framework • Multiple machines connected with Dolphin PCIe links − … some ideas oF how to use/program heterogeneous multi-core processors University of Oslo INF5063 About INF5063: Topic & Scope § Tasks: The important part of the course is lab-assignments where you program each oF the three examples oF heterogeneous multi-core processors § 3 graded home exams (counting 33% each): − Deliver code and make a demonstration explaining your design and code to the class 1. On the x86 • Video encoding – Improve the perFormance of video compression by using SSE instructions. 2. On the Nvidia graphics cards • Video encoding – Improve the perFormance oF video compression using the GPU architecture. 3. On the distributed systems • Video encoding – The same as above, but exploit the parallelism on multiple GPUs and computers connected with Dolphin PCIe links. § Students will be working together in groups oF two. Try to Find a partner during the session today! § Competition at the end oF the course! Have the Fastest implementation oF the code! University of Oslo INF5063 Background and Motivation: Moore’s Law Motivation: Intel View 2015: § >billion transistors integrated •8,0 billion - nVIDIA GM200 (Maxwell) •8,1 billion - AMD Fury (Pirate Islands) 1971: • 2,300 - Intel 4004 University of Oslo INF5063 Motivation: Intel View § >billion transistors integrated § Clock Frequency has not increased since 2006 2015 (Still): • 5 GHz: IBM Power 6 & 8 University of Oslo INF5063 Motivation: Intel View § >billion transistors integrated § Clock Frequency has not increased since 2006 § Power? Heat? University of Oslo INF5063 Motivation: Intel View § Soon >billion transistors integrated § Clock Frequency can still increase § Future applications will demand TIPS § Power? Heat? University of Oslo INF5063 Motivation “Future applications will demand more instructions per second” “Think platform beyond a single processor” “Exploit concurrency at multiple levels” “Power will be the limiter due to complexity and leakage” Distribute workload on multiple cores University of Oslo INF5063 Multicores! Symmetric Multi-Core Processors Intel Haswell University of Oslo INF5063 Symmetric Multi-Core Processors AMD “Piledriver” University of Oslo INF5063 Symmetric Multi-Core Processors § Good − Growing computational power § Problematic − Growing die sizes − Unused resources • Some cores used much more than others • Many core parts Frequently unused § Why not spread the load better? ⇒ Heterogeneous Architectures! University of Oslo INF5063 x86 – heterogeneous? Intel Haswell Core Architecture Branch Instruction Predictors Fetch Unit LE ITLB R9KB LE I-Cache GQ wayU EgB EgB Precode Fetch Buffer g Instructions Front-end 9x9d Instruction Queue µcode Complex Simple Simple Simple Haswell Decoder Decoder Decoder Decoder O µops E µop E µop E µop EFXK µop Cache GQ-wayU Xg µop Decode Queue O µops R9B O µops E.9 Entry Reorder Buffer GROBU EgQ Integer EgQ AVX OQ Entry Branch m9 Entry O9 Entry Out-of-Order Registers Registers Order Buffer Load Buffer Store Buffer Scheduler gd Entry Unified Scheduler Port d Port E Port X Port g Port 9 Port R Port O Port m ALU 9Xg-bit ALU 9Xg-bit ALU gO-bit gO-bit Store Branch VMUL LEA VALU Fast LEA AGU AGU AGU Execution Shift VShift MUL VShuffle Units 9Xg-bit 9Xg-bit 9Xg-bit 9Xg-bit ALU Store FMA FMA VALU FShuffle Branch Data FBlend FADD VBlend FBlend Shift 9xR9B Memory R9B Units L9 TLB LE DTLB R9KB LE D-Cache GQ wayU gOB 9XgKB L9 Cache GQ wayU University of Oslo INF5063 Branch Instruction Predictors Fetch Unit LE ITLB R9KB LE I-Cache GQ wayU EgB EgB Precode Fetch Buffer g Instructions Front-end Intel Haswell Architecture9x9d Instruction Queue µcode Complex Simple Simple Simple Decoder Decoder Decoder Decoder O µops E µop E µop E µop EFXK µop Cache GQ-wayU Xg µop Decode Queue O µops R9B O µops E.9 Entry Reorder Buffer GROBU EgQ Integer EgQ AVX OQ Entry Branch m9 Entry O9 Entry Out-of-Order Registers Registers Order Buffer Load Buffer Store Buffer Scheduler gd Entry Unified Scheduler Port d Port E Port X Port g Port 9 Port R Port O Port m ALU 9Xg-bit ALU 9Xg-bit ALU gO-bit gO-bit Store Branch VMUL LEA VALU • THUS, an x86 is deFinitelyFast parallel LEA and hasAGU heterogeneousAGU AGU Execution(internal)Shift cores.VShift MUL VShuffle Units 9Xg-bit 9Xg-bit 9Xg-bit 9Xg-bit ALU Store • The coresFMA have manyFMA (hidden)VALU FShuffle complicatingBranch Factors. One oF these Data are out-ofFBlend-orderFADD execution.VBlend FBlend Shift 9xR9B Memory R9B Units L9 TLB LE DTLB R9KB LE D-Cache GQ wayU gOB University of Oslo INF5063 9XgKB L9 Cache GQ wayU Out-of-order execution § Beginning with Pentium Pro, out-of-order-execution has improved the micro-architecture design − Execution oF an instruction is delayed iF the input data is not yet available. − Haswell architecture has 192 instructions in the out-of-order window. § Instructions are split into micro-operations (μ-ops) − ADD EAX, EBX % Add content in EBX to EAX • simple operation which generates only 1 μ-ops − ADD EAX, [mem1] % Add content in mem1 to EAX • operation which generates 2 μ-ops: 1) load mem1 into (unamed) register, 2) add − ADD [mem1], EAX % Add content in EAX to mem1 • operation which generates 3 μ-ops: 1) load mem1 into (unamed) register, 2) add, 3) write result back to memory University of Oslo INF5063 Branch Instruction Predictors Fetch Unit LE ITLB R9KB LE I-Cache GQ wayU EgB EgB Precode Fetch Buffer g Instructions Front-end 9x9d Instruction Queue µcode Complex Simple Simple Simple Decoder Decoder Decoder Decoder O µops E µop E µop E µop EFXK µop Cache GQ-wayU Xg µop Decode Queue O µops R9B O µops E.9 Entry Reorder Buffer GROBU Intel Haswell EgQ Integer EgQ AVX OQ Entry Branch m9 Entry O9 Entry Out-of-Order Registers Registers Order Buffer Load Buffer Store Buffer Scheduler gd Entry Unified Scheduler Port d Port E Port X Port g Port 9 Port R Port O Port m ALU 9Xg-bit ALU 9Xg-bit ALU gO-bit gO-bit Store Branch VMUL LEA VALU Fast LEA AGU AGU AGU Execution Shift VShift MUL VShuffle Units 9Xg-bit 9Xg-bit 9Xg-bit 9Xg-bit ALU Store FMA FMA VALU FShuffle Branch Data FBlend FADD VBlend FBlend Shift 9xR9B Memory R9B Units L9 TLB LE DTLB R9KB LE D-Cache GQ wayU gOB 9XgKB L9 Cache GQ wayU § The x86 architecture “should” be a nice, easy introduction to more complex heterogeneous architectures later in the course… University of Oslo INF5063 Co-Processors § The original IBM PC included a socket For an Intel 8087 Floating point co-processor (FPU) − 50-fold speed up of floating point operations § Intel kept the co-processor up to i486 − 486DX contained an optimized i487 block − Still separate pipeline (pipeline Flush when starting and ending use) − Communication over an internal bus § Commodore Amiga was one oF the earlier machines that used multiple processors − Motorola 680x0 main processor − Blitter (block image transFerrer - moving data, Fill operations, line drawing, perForming boolean operations) − Copper (Co-Processor - change address For video RAM on the Fly) University of Oslo INF5063 nVIDIA Tegra X1 ARM SoC § One oF many multi-core processors for handheld devices § 4 ARM Cortex-A57 processors − 4 ARM Cortex-A53 cores − Out-of-order design − 64-bit ARMv8 instruction set − Cache-coherent cores § Several “dedicated” co-processors: − 4K Video Decoder − 4K Video Encoder − Audio Processor − 2x Image Processor § Fully programmable Maxwell-family GPU with 256 simple cores. University of Oslo INF5063 Intel SCC (Single Chip Cloud Computer) § Prototype processor From Intel’s TerraScale project § 24 «tiles» connected in a mesh § 1.3 billion transistors § Intel SCC Tile: − Two Intel P54C cores Pentium − In-order-execution design − IA32 architecture − NO SIMD units(!!) § No cache coherency § Only made a limited number oF chips For research. University of Oslo INF5063 Intel MIC («Larabee» / «Knights Ferry» / «Knights Corner») § Introduced in 2009 as a consumer GPU From Intel § Canceled in May 2010 due to
Recommended publications
  • Intel® IA-64 Architecture Software Developer's Manual
    Intel® IA-64 Architecture Software Developer’s Manual Volume 1: IA-64 Application Architecture Revision 1.1 July 2000 Document Number: 245317-002 THIS DOCUMENT IS PROVIDED “AS IS” WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY, NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE. Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. Intel® IA-64 processors may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800- 548-4725, or by visiting Intel’s website at http://developer.intel.com/design/litcentr.
    [Show full text]
  • USE CASE Requirements
    Ref. Ares(2017)2745733 - 31/05/2017 USE CASE Requirements Co-funded by the Horizon 2020 Framework Programme of the European Union DELIVERABLE NUMBER D2.1 DELIVERABLE TITLE USE CASE Requirements RESPONSIBLE AUTHOR CSI Piemonte OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure and Platform in Industrial and Societal Applications GRANT AGREEMENT N. 688386 PROJECT REF. NO H2020 - 688386 PROJECT ACRONYM OPERA LOw Power Heterogeneous Architecture for Next Generation of PROJECT FULL NAME SmaRt Infrastructure and Platform in Industrial and Societal Applications STARTING DATE (DUR.) 01 /12 /2015 ENDING DATE 30/11/2018 PROJECT WEBSITE www.operaproject.eu WP2 | Low Power Computing Requirements and Innovation WORKPACKAGE N. | TITLE Engineering WORKPACKAGE LEADER ISMB DELIVERABLE N. | TITLE D2.1 | USE CASE Requirements RESPONSIBLE AUTHOR Luca Scanavino – CSI Piemonte DATE OF DELIVERY M10 (CONTRACTUAL) DATE OF DELIVERY (SUBMITTED) M10 VERSION | STATUS V2.0 (update) NATURE R(Report) DISSEMINATION LEVEL PU(Public) AUTHORS (PARTNER) CSI PIEMONTE, DEPARTMENT DE L’ISERE, ISMB D2.1 | USE CASEs Requirements 1 OPERA: LOw Power Heterogeneous Architecture for Next Generation of SmaRt Infrastructure and Platform in Industrial and Societal Applications VERSION MODIFICATION(S) DATE AUTHOR(S) Luca Scanavino – CSI Jean-Christophe 0.1 All Document 12/09/2016 Maisonobe – LD38 Pietro Ruiu – ISMB Alberto Scionti - ISMB Finalization of the 1.0 15/09/2016 Giulio URLINI (ST) document Update based on the Luca Scanavino – CSI feedback
    [Show full text]
  • Copyrighted Material
    CHAPTER 1 MULTI- AND MANY-CORES, ARCHITECTURAL OVERVIEW FOR PROGRAMMERS Lasse Natvig, Alexandru Iordan, Mujahed Eleyat, Magnus Jahre and Jorn Amundsen 1.1 INTRODUCTION 1.1.1 Fundamental Techniques Parallelism hasCOPYRIGHTED been used since the early days of computing MATERIAL to enhance performance. From the first computers to the most modern sequential processors (also called uni- processors), the main concepts introduced by von Neumann [20] are still in use. How- ever, the ever-increasing demand for computing performance has pushed computer architects toward implementing different techniques of parallelism. The von Neu- mann architecture was initially a sequential machine operating on scalar data with bit-serial operations [20]. Word-parallel operations were made possible by using more complex logic that could perform binary operations in parallel on all the bits in a computer word, and it was just the start of an adventure of innovations in parallel computer architectures. Programming Multicore and Many-core Computing Systems, 3 First Edition. Edited by Sabri Pllana and Fatos Xhafa. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. 4 MULTI- AND MANY-CORES, ARCHITECTURAL OVERVIEW FOR PROGRAMMERS Prefetching is a 'look-ahead technique' that was introduced quite early and is a way of parallelism that is used at several levels and in different components of a computer today. Both data and instructions are very often accessed sequentially. Therefore, when accessing an element (instruction or data) at address k, an auto- matic access to address k+1 will bring the element to where it is needed before it is accessed and thus eliminates or reduces waiting time.
    [Show full text]
  • Appendix D an Alternative to RISC: the Intel 80X86
    D.1 Introduction D-2 D.2 80x86 Registers and Data Addressing Modes D-3 D.3 80x86 Integer Operations D-6 D.4 80x86 Floating-Point Operations D-10 D.5 80x86 Instruction Encoding D-12 D.6 Putting It All Together: Measurements of Instruction Set Usage D-14 D.7 Concluding Remarks D-20 D.8 Historical Perspective and References D-21 D An Alternative to RISC: The Intel 80x86 The x86 isn’t all that complex—it just doesn’t make a lot of sense. Mike Johnson Leader of 80x86 Design at AMD, Microprocessor Report (1994) © 2003 Elsevier Science (USA). All rights reserved. D-2 I Appendix D An Alternative to RISC: The Intel 80x86 D.1 Introduction MIPS was the vision of a single architect. The pieces of this architecture fit nicely together and the whole architecture can be described succinctly. Such is not the case of the 80x86: It is the product of several independent groups who evolved the architecture over 20 years, adding new features to the original instruction set as you might add clothing to a packed bag. Here are important 80x86 milestones: I 1978—The Intel 8086 architecture was announced as an assembly language– compatible extension of the then-successful Intel 8080, an 8-bit microproces- sor. The 8086 is a 16-bit architecture, with all internal registers 16 bits wide. Whereas the 8080 was a straightforward accumulator machine, the 8086 extended the architecture with additional registers. Because nearly every reg- ister has a dedicated use, the 8086 falls somewhere between an accumulator machine and a general-purpose register machine, and can fairly be called an extended accumulator machine.
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • Programmable Digital Microcircuits - a Survey with Examples of Use
    - 237 - PROGRAMMABLE DIGITAL MICROCIRCUITS - A SURVEY WITH EXAMPLES OF USE C. Verkerk CERN, Geneva, Switzerland 1. Introduction For most readers the title of these lecture notes will evoke microprocessors. The fixed instruction set microprocessors are however not the only programmable digital mi• crocircuits and, although a number of pages will be dedicated to them, the aim of these notes is also to draw attention to other useful microcircuits. A complete survey of programmable circuits would fill several books and a selection had therefore to be made. The choice has rather been to treat a variety of devices than to give an in- depth treatment of a particular circuit. The selected devices have all found useful ap• plications in high-energy physics, or hold promise for future use. The microprocessor is very young : just over eleven years. An advertisement, an• nouncing a new era of integrated electronics, and which appeared in the November 15, 1971 issue of Electronics News, is generally considered its birth-certificate. The adver• tisement was for the Intel 4004 and its three support chips. The history leading to this announcement merits to be recalled. Intel, then a very young company, was working on the design of a chip-set for a high-performance calculator, for and in collaboration with a Japanese firm, Busicom. One of the Intel engineers found the Busicom design of 9 different chips too complicated and tried to find a more general and programmable solu• tion. His design, the 4004 microprocessor, was finally adapted by Busicom, and after further négociation, Intel acquired marketing rights for its new invention.
    [Show full text]
  • SIMD: Data Parallel Execution J
    ERLANGEN REGIONAL COMPUTING CENTER SIMD: Data parallel execution J. Eitzinger PATC, 12.6.2018 Stored Program Computer: Base setting for (int j=0; j<size; j++){ Memory sum = sum + V[j]; } 401d08: f3 0f 58 04 82 addss xmm0,[rdx + rax * 4] Arithmetic Control 401d0d: 48 83 c0 01 add rax,1 Logical 401d11: 39 c7 cmp edi,eax Unit CPU 401d13: 77 f3 ja 401d08 Unit Input Output Architect’s view: Make the common case fast ! Execution and memory . Improvements for relevant software Strategies . What are the technical opportunities? . Increase clock speed . Parallelism . Economical concerns . Specialization . Marketing concerns 2 Data parallel execution units (SIMD) for (int j=0; j<size; j++){ Scalar execution A[j] = B[j] + C[j]; } Register widths • 1 operand = + • 2 operands (SSE) • 4 operands (AVX) • 8 operands (AVX512) 3 Data parallel execution units (SIMD) for (int j=0; j<size; j++){ A[j] = B[j] + C[j]; SIMD execution } Register widths • 1 operand • 2 operands (SSE) = + • 4 operands (AVX) • 8 operands (AVX512) 4 History of short vector SIMD 1995 1996 1997 1998 1999 2001 2008 2010 2011 2012 2013 2016 VIS MDMX MMX 3DNow SSE SSE2 SSE4 VSX AVX IMCI AVX2 AVX512 AltiVec NEON 64 bit 128 bit 256 bit 512 bit Vendors package other ISA features under the SIMD flag: monitor/mwait, NT stores, prefetch ISA, application specific instructions, FMA X86 Power ARM SPARC64 Intel 512bit IBM 128bit A5 upward Fujitsu 256bit AMD 256bit 64bit / 128bit K Computer 128bit 5 Technologies Driving Performance Technology 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
    [Show full text]
  • Programming the Cell Broadband Engine Examples and Best Practices
    Front cover Draft Document for Review February 15, 2008 4:59 pm SG24-7575-00 Programming the Cell Broadband Engine Examples and Best Practices Practical code development and porting examples included Make the most of SDK 3.0 debug and performance tools Understand and apply different programming models and strategies Abraham Arevalo Ricardo M. Matinata Maharaja Pandian Eitan Peri Kurtis Ruby Francois Thomas Chris Almond ibm.com/redbooks Draft Document for Review February 15, 2008 4:59 pm 7575edno.fm International Technical Support Organization Programming the Cell Broadband Engine: Examples and Best Practices December 2007 SG24-7575-00 7575edno.fm Draft Document for Review February 15, 2008 4:59 pm Note: Before using this information and the product it supports, read the information in “Notices” on page xvii. First Edition (December 2007) This edition applies to Version 3.0 of the IBM Cell Broadband Engine SDK, and the IBM BladeCenter QS-21 platform. © Copyright International Business Machines Corporation 2007. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Draft Document for Review February 15, 2008 4:59 pm 7575TOC.fm Contents Preface . xi The team that wrote this book . xi Acknowledgements . xiii Become a published author . xiv Comments welcome. xv Notices . xvii Trademarks . xviii Part 1. Introduction to the Cell Broadband Engine . 1 Chapter 1. Cell Broadband Engine Overview . 3 1.1 Motivation . 4 1.2 Scaling the three performance-limiting walls. 6 1.2.1 Scaling the power-limitation wall . 6 1.2.2 Scaling the memory-limitation wall .
    [Show full text]
  • APPLICATION NOTE Ap·113
    APPLICATION Ap·113 NOTE February 1981 Intel Corporation makes no warranty for the use of its products and assumes no responsibility for any errors which may appear in this document nor does it make a commitment to update the information contained herein. Intel software products are copyrighted by and shall remain the property of Intel Corporation. Use, duplication or disclosure is subjectto restrictions stated in Intel's software license, or as defined in ASPR 7-104.9 (a) (9). Intel Corporation a!=;sumes no resDonsibilitv for the use of anv circuitry other than circuitry embodied in an Intel product. No other circuit patent licenses are implied. No part of this document may be copied or reproduced in any form or by any means without the prior written consent of Intel Corporation. The following are trademarks of Intel Corporation and may only be used to identify Intel products: BXP Intelevision MULTIBUS* CREDIT Intellec MULTIMODULE i iSBC Plug-A-Bubble ICE iSBX PROMPT ICS Library Manager Promware im MCS RMX Insite Megachassis UPI Intel Micromap ~Scope System 2000 and the combinations of ICE, iCS, iSBC, MCS or RMX and a numerical suffix. MDS is an ordering code only and is not used as a product name or trademark. MDS® is a registered trademark of Mohawk Data Sciences Corporation. *MULTIBUS is a patented Intel bus. Additional copies of this manual or other Intel literature may be obtained from: Intel Corporation Literature Department SV3-3 3065 Bowers Avenue Santa Clara, CA 95051 © INTEL CORPORATION, 1981 AFN-013008-1 Ap·113 Getting Started With Contents the Numeric Data INTRODUCTION Processor iAPX 86,88 Base .............................
    [Show full text]
  • Cell Broadband Engine Architecture
    Cell Broadband Engine Architecture Patrick White Aun Kei Hong Overview • Background • Overall Chip • Goals Specifications • Components • Commercialization – PPE • Conclusion – SPE –EIB – Memory I/O –PMU, TMU Background • Ken Kutaragi • Sony, IBM, and Toshiba Partnership • 400+ people and $400M • 10 centers globally Goals • Parallelism – Thread Level – Instruction Level – Data Level • Create a computer that acts like cells in a biological system POWER Processing Element (PPE) • Power Processing Unit (PPU) • Cache – 32 kB L1 Cache – 512 kB L2 Cache • SIMD (Single Issue Multiple Data) – Vector multimedia extension • Controls SPEs POWER Processing Unit (PPU) • Main processor • Dual issue, In order, Dual thread Synergistic Processing Element (SPE) • Synergistic Processing Unit • Memory Flow Controller • Direct Memory Access Synergistic Processing Unit (SPU) • Dual issue, in order • 256 kB embedded SRAM • 128 entry operations per cycle Memory Flow Controller (MFC) • Connects SPE to EIB • Communicates through SPU channel interface • Process commands by depending on the type of commands • Process DMA Commands Element Interconnect Bus (EIB) • Facilitates commutation between the PPE, SPE, Main Memory, and I/O • 4 16-bit wide data rings • Allows up to 3 concurrent transfers • Contains a data bus arbiter that decides which ring handles each request • Runs on half the processor speed Memory Interface Controller (MIC) • Interface between main memory and the Cell • Two Rambus XRD memory banks – 36 bits wide – Independent control bus • Banks are interleaved
    [Show full text]
  • Introduction to Cpu
    microprocessors and microcontrollers - sadri 1 INTRODUCTION TO CPU Mohammad Sadegh Sadri Session 2 Microprocessor Course Isfahan University of Technology Sep., Oct., 2010 microprocessors and microcontrollers - sadri 2 Agenda • Review of the first session • A tour of silicon world! • Basic definition of CPU • Von Neumann Architecture • Example: Basic ARM7 Architecture • A brief detailed explanation of ARM7 Architecture • Hardvard Architecture • Example: TMS320C25 DSP microprocessors and microcontrollers - sadri 3 Agenda (2) • History of CPUs • 4004 • TMS1000 • 8080 • Z80 • Am2901 • 8051 • PIC16 microprocessors and microcontrollers - sadri 4 Von Neumann Architecture • Same Memory • Program • Data • Single Bus microprocessors and microcontrollers - sadri 5 Sample : ARM7T CPU microprocessors and microcontrollers - sadri 6 Harvard Architecture • Separate memories for program and data microprocessors and microcontrollers - sadri 7 TMS320C25 DSP microprocessors and microcontrollers - sadri 8 Silicon Market Revenue Rank Rank Country of 2009/2008 Company (million Market share 2009 2008 origin changes $ USD) Intel 11 USA 32 410 -4.0% 14.1% Corporation Samsung 22 South Korea 17 496 +3.5% 7.6% Electronics Toshiba 33Semiconduc Japan 10 319 -6.9% 4.5% tors Texas 44 USA 9 617 -12.6% 4.2% Instruments STMicroelec 55 FranceItaly 8 510 -17.6% 3.7% tronics 68Qualcomm USA 6 409 -1.1% 2.8% 79Hynix South Korea 6 246 +3.7% 2.7% 812AMD USA 5 207 -4.6% 2.3% Renesas 96 Japan 5 153 -26.6% 2.2% Technology 10 7 Sony Japan 4 468 -35.7% 1.9% microprocessors and microcontrollers
    [Show full text]
  • Sony/Toshiba/IBM (STI) CELL Processor Three Performance-Limiting Walls
    Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2007 Three Performance-Limiting Walls ¾ Power Wall Increasingly, microprocessor performance is limited by achievable power dissipation rather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of microprocessors is to improve power efficiency at about the same rate as the performance increase. ¾ Frequency Wall Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a point of diminishing returns, and even negative returns if power is taken into account. ¾ Memory Wall On multi-gigahertz symmetric multiprocessors – even those with integrated memory controllers – latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor. 01/31/07 11:10 2 1 The Memory Wall "When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. [...] Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available." H. Peter Hofstee "Cell Broadband Engine Architecture from 20,000 feet" http://www-128.ibm.com/developerworks/power/library/pa-cbea.html Their (multicore) low cost does not guarantee their effective use in HPC.
    [Show full text]