Vector Support for Multicore Processors with Major Emphasis on Configurable Multiprocessors

Total Page:16

File Type:pdf, Size:1020Kb

Vector Support for Multicore Processors with Major Emphasis on Configurable Multiprocessors Copyright Warning & Restrictions The copyright law of the United States (Title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material. Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specified conditions is that the photocopy or reproduction is not to be “used for any purpose other than private study, scholarship, or research.” If a, user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of “fair use” that user may be liable for copyright infringement, This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment of the order would involve violation of copyright law. Please Note: The author retains the copyright while the New Jersey Institute of Technology reserves the right to distribute this thesis or dissertation Printing note: If you do not wish to print this page, then select “Pages from: first page # to: last page #” on the print dialog screen The Van Houten library has removed some of the personal information and all signatures from the approval page and biographical sketches of theses and dissertations in order to protect the identity of NJIT graduates and faculty. ABSTRACT VECTOR SUPPORT FOR MULTICORE PROCESSORS WITH MAJOR EMPHASIS ON CONFIGURABLE MULTIPROCESSORS by Hongyan Yang I ecey ecame iceasigy iicu o ui ige see uiocesso cis ecause o eomace egaaio a ig owe cosumio e quaaicay iceasig cicui comeiy oae e eoaio o moe isucio-ee aaeism (I o coiue aisig e eomace ocesso esiges e ocuse o ea-ee aa- eism ( o eaie a ew aciecue esig aaigm Muicoe ocesso esig is e esu o is e I as oe quie caae i eomace icease a oies ew oouiies i owe maageme a sysem scaaiiy u cue muicoe o- cessos o o oie oweu eco aciecue suo wic cou yie sigiica seeus o aay oeaios wie maiaiig aea/owe eiciecy is isseaio ooses a eses e eaiaio o a GA-ase ooye o a muicoe aciecue wi a sae eco ui (MCwS GA sas o ie- ogammae Gae Aay e iea is a ae a imoig oy scaa o e- omace some awae uge cou e use o eaie a eco ui o geay seeu aicaios aua i aa-ee aaeism ( o e eaisic imie y e a- aeism i e aicaio ise a y e comies ecoiig aiiies mos o e geea-uose ogams ca oy e aiay ecoie us o eicie esouce us- age oe eco ui sou e sae y seea scaa ocessos is aoac cou aso kee e oea uge wii acceae imis We sugges a is ye o eco-ui saig e esaise i uue muicoe cis e esig imemeaio a eauaio o a MCwS sysem wi wo scaa ocessos a a sae eco ui ae esee o GA ooyig e Micoae ocesso wic is a commecia I (Ieecua oey coe om ii is use as e scaa ocesso; i e eeimes e eco ui is coece o a ai o Micoae ocessos oug saa us ieaces e oea sysem is ogaie i a ecoue a mui-ake sucue is ogaiaio oies susaia sysem scaaiiy a ee eco eomace o a gie aea uge ecmaks om seea aeas sow a e MCwS sysem ca oie sigiica eomace icease as comae o a muicoe sysem wiou a eco ui owee a MCwS sysem wi wo Micoaes a a sae eco ui is o a- ways a oimie sysem coiguaio o aious aicaios wi iee eceages o ecoiaio O e oe a e MCwS amewok was esige o easy scaa- iiy o oeiay icooae aious umes o scaa/eco uis a aious ucio uis Aso e eiiiy iee o GAs ca ai e ask o macig age aica- ios ese eeis ca e ake io accou o ceae oimie MCwS sysems o aious aicaios So e wok eeuay ocuse o uiig a aciecue esig amewok icooaig eomace a esouce maageme o aicaio-seciic MCwS (AS-MCwS sysems o emee sysem esig esouce usage owe cosumio a eecuio aecy ae ee meics o e use i esig aeos e ouc o ese meics is use ee o coose e MCwS sysem wi e smaes aue VECTOR SUPPORT FOR MULTICORE PROCESSORS WITH MAJOR EMPHASIS ON CONFIGURABLE MULTIPROCESSORS by Hongyan Yang A Dissertation Submitted to the Faculty of New Jersey Institute of Technology in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Electrical Engineering Department of Electrical and Computer Engineering May 2008 Copyright © 2008 by Hongyan Yang ALL RIGHTS RESERVED APPROVAL PAGE VECTOR SUPPORT FOR MULTICORE PROCESSORS WITH MAJOR EMPHASIS ON CONFIGURABLE MULTIPROCESSORS Hongyan Yang Soiios G iaas isseaio Aiso ae oesso o Eecica a Comue Egieeig I Ewi ou Commiee Meme ae Associae oesso o Eecica a Comue Egieeig I ie u Commiee Meme ae Assisa oesso o Eecica a Comue Egieeig I oeo oas-Cessa Commiee Meme ae Associae oesso o Eecica a Comue Egieeig I Aeaos Geessiois Commiee Meme ae Associae oesso o Comue Sciece I BIOGRAPHICAL SKETCH Author: ogya Yag Degree: oco o iosoy Date: May Undergraduate and Graduate Education: • oco o iosoy i Eecica Egieeig ew esey Isiue o ecoogy ewak • Mase o Sciece i Eecica Egieeig eiig Uiesiy o oss a eecommuicaios eiig Cia 1 • aceo o Sciece i Eecica Egieeig eiig Uiesiy o oss a eecommuicaios eiig Cia I99 Major: Eecica Egieeig Publications: Yag S G iaas a u "ecoiguaio suo o eco oeaios" Inter- national Journal of High Performance Systems Architecture, o 1 o 9-97 7 S Wag Yag u S G iaas "Asymmeicay ake aue-awae egise ies o ow eegy a ig eomace" Microprocessors and Microsystems, 7 Yag S G iaas a u "GA-ase eco ocessig o mai oeaios" i International Conference on Information Technology: New Generations, as egas USA Ai 7 99-99 Yag a S G iaas "GA-ase eco ocesso o ageaic equaio soes" i IEEE International Systems-On-Chip Conference, Wasigo C USA 5 115-11 Yag S Wag S G iaas a u "eco ocessig suo o GA-oiee ig eomace aicaios" i IEEE Computer Society Annual Symposium on VLSI, oo Aege ai May 7 7- iv SWag Yag u a S G iaas "Asymmeicay ake aue-awae egise ies" i IEEE Cptr St Annl Sp n SI oo Aege ai May 7 33-3 prnt, nn hn nd Cfn Yn hbnd, Shnn Wn dhtr, Andr Wn v ACKNOWLEDGMENT is o a I wou ike o eess my ee a sicee gaiue o my aiso Soiios G iaas I cou o ae iise is isseaio wiou is aiece uesaig a ecouageme uig e as ie yeas is kowege a eeise i comue aciecue as cosay ee a isie me o go ue i my eseac I wa o ak im o is e i may asecs is simuaig a cosucie suggesios o my eseac; is aiece o eiew a imoe my wiigs ues o imes; is kiess o oiig iacia suo uig my suy a eseac a muc moe I aso eae a o om is oessioa wok sye ose wi e o gea aue i my uue caee I wamy ak Ewi ou u ie oeo oas-Cessa a Aeaos Geessiois o eig my commiee memes seig ime o eiewig my isseaio a giig me auae suggesios I aso ak my a coeagues i CA o ou ieesig iscussios o aious eseac oics a e u o eig ogee I owe my eees gaiue o my usa Suagqua Wag o is ue- saig a e iay I am eey iee o my ae iayu ag my moe Caieg Yag o ei ucoiioa oe a coiuous suo vii AE O COES Cae age I IOUCIO I 11 eco ocessig 1 1I1 eco Suecomues a ei Aicaios 2 11 eco ocessig i Emee Sysems 3 I Muicoe ocessos 5 11 Moiaio o ei eeome 5 1 Sae o e A a uue Muicoe ocessos 5 13 Coiguae Comuig 8 131 Oeiew o ie-ogamae Gae Aays (GAs 8 13 ece Aaces i GAs 9 133 esig ow 11 1 Moiaios a Oecies I 2 ECO OCESSIG O EMEE AICAIOS 15 1 Aciecue o e eco ocesso I 11 Isucio Se Aciecue I7 1 Scaa Ui 1 I3 eco egise ie 19 I eco Memoy Ieace (MI 22 2.2 eomace esus 22 1 eimiaies o eco ocessig 3 2.2.2 Sase iea Equaio Soe 26 3 ese Mai-Mai Muiicaio 3 2.2.4 Sase Mai-eco Muiicaio 31 3 MUICOE OCESSO SYSEM WI A SAE ECO UI 33 v TABLE OF CONTENTS (Continued) Cae age 31 Scaa ocesso a GA esouces 3 311 Micoae So ocesso 3 31 ie-5 esouces 35 3 Sysem esig a Ogaiaio 3 3I eco Issue ogic 3 3 eco egise ie 3 33 ucio Uis 1 3 Memoy Cooe 42 33 Sysem Imemeaio o e GA Ci 3 33I igia Siga ocessig (S ocks 3 33 igia Cock Maage (CM imiie 44 333 O-Ci ock Memoies 44 33 esouce Usage o aious Coiguaios 5 4 EOMACE ESUS A AAYSIS 7 I G o YIQ Coesio (GYIQ 7 4.2 emouaio Agoim i Commuicaio Sysems 9 3 iscee Cosie asom 5 5 OWE COSUMIO CAACEIAIO A ESOUCE MA- AGEME O MCWS SYSEMS 5 5I esouce/owe Moeig o e MCwS Sysem 59 5II yamic owe Cosumio o oaig-oi Aimeic Uis 60 51 yamic owe Cosumio o ua-o ock Memoies 62 513 owe Cosumio o Micoae w/wo a oaig-oi Ui 64 51 owe Cosumio o e eco Ui o iee Coiguaios 64 5 owe/esouce-Eicie MCwS Sysems o aious Aicaios 7 5I Scaa/eco Eecuio ime Esimaio 7I ix TABLE OF CONTENTS (Continued) Cae age 5 Aicaio-Seciic MCwS (AS-MCwS Sysem o owe/esouce Eiciecy 7 COCUSIOS A UUE WOK 1 1 Cocusios 1 uue Wok 3 EEECES LIST OF TABLES ae age 1 o-eo Eeme ( Cages i W-Mai aiio Geeaio 28 2.2 Iu Sase Maices om e awe-oeig Coecio i Ou Eeimes 31 31 esouce Cosumio o Sige-ecisio Imemeaios 3 3 Coiguaio a Usage o ock Memoies 5 33 esouce Usage o e eco Ui a e Memoy Cooe 46 I Eecuio imes ( o GYIQ Coesio 48 4.2 Eecuio imes ( o e isace Cacuaio Agoim we u o Oe Micoae wi e eco Ui o 1 Iu aa 51 3 Eecuio imes ( o e emouaio Agoim (M Micoae ISisace U eco Ui 53 4.4 eco a Scaa Eecuio imes a ei aio o e C Agoim .
Recommended publications
  • Data-Level Parallelism
    Fall 2015 :: CSE 610 – Parallel Computer Architectures Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Data Parallelism vs. Control Parallelism – Data Parallelism: parallelism arises from executing essentially the same code on a large number of objects – Control Parallelism: parallelism arises from executing different threads of control concurrently • Hypothesis: applications that use massively parallel machines will mostly exploit data parallelism – Common in the Scientific Computing domain • DLP originally linked with SIMD machines; now SIMT is more common – SIMD: Single Instruction Multiple Data – SIMT: Single Instruction Multiple Threads Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Many incarnations of DLP architectures over decades – Old vector processors • Cray processors: Cray-1, Cray-2, …, Cray X1 – SIMD extensions • Intel SSE and AVX units • Alpha Tarantula (didn’t see light of day ) – Old massively parallel computers • Connection Machines • MasPar machines – Modern GPUs • NVIDIA, AMD, Qualcomm, … • Focus of throughput rather than latency Vector Processors 4 SCALAR VECTOR (1 operation) (N operations) r1 r2 v1 v2 + + r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 Scalar processors operate on single numbers (scalars) Vector processors operate on linear sequences of numbers (vectors) 6.888 Spring 2013 - Sanchez and Emer - L14 What’s in a Vector Processor? 5 A scalar processor (e.g. a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements E.g. 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set of vector functional units Integer, FP, load/store, etc Some times vector and scalar units are combined (share ALUs) 6.888 Spring 2013 - Sanchez and Emer - L14 Example of Simple Vector Processor 6 6.888 Spring 2013 - Sanchez and Emer - L14 Basic Vector ISA 7 Instr.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Vector-Thread Architecture and Implementation by Ronny Meir Krashinsky B.S
    Vector-Thread Architecture And Implementation by Ronny Meir Krashinsky B.S. Electrical Engineering and Computer Science University of California at Berkeley, 1999 S.M. Electrical Engineering and Computer Science Massachusetts Institute of Technology, 2001 Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2007 c Massachusetts Institute of Technology 2007. All rights reserved. Author........................................................................... Department of Electrical Engineering and Computer Science May 25, 2007 Certified by . Krste Asanovic´ Associate Professor Thesis Supervisor Accepted by . Arthur C. Smith Chairman, Department Committee on Graduate Students 2 Vector-Thread Architecture And Implementation by Ronny Meir Krashinsky Submitted to the Department of Electrical Engineering and Computer Science on May 25, 2007, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract This thesis proposes vector-thread architectures as a performance-efficient solution for all-purpose computing. The VT architectural paradigm unifies the vector and multithreaded compute models. VT provides the programmer with a control processor and a vector of virtual processors. The control processor can use vector-fetch commands to broadcast instructions to all the VPs or each VP can use thread-fetches to direct its own control flow. A seamless intermixing of the vector and threaded control mechanisms allows a VT architecture to flexibly and compactly encode application paral- lelism and locality. VT architectures can efficiently exploit a wide variety of loop-level parallelism, including non-vectorizable loops with cross-iteration dependencies or internal control flow.
    [Show full text]
  • Lecture 14: Gpus
    LECTURE 14 GPUS DANIEL SANCHEZ AND JOEL EMER [INCORPORATES MATERIAL FROM KOZYRAKIS (EE382A), NVIDIA KEPLER WHITEPAPER, HENNESY&PATTERSON] 6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 Today’s Menu 2 Review of vector processors Basic GPU architecture Paper discussions 6.888 Spring 2013 - Sanchez and Emer - L14 Vector Processors 3 SCALAR VECTOR (1 operation) (N operations) r1 r2 v1 v2 + + r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 Scalar processors operate on single numbers (scalars) Vector processors operate on linear sequences of numbers (vectors) 6.888 Spring 2013 - Sanchez and Emer - L14 What’s in a Vector Processor? 4 A scalar processor (e.g. a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements E.g. 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set of vector functional units Integer, FP, load/store, etc Some times vector and scalar units are combined (share ALUs) 6.888 Spring 2013 - Sanchez and Emer - L14 Example of Simple Vector Processor 5 6.888 Spring 2013 - Sanchez and Emer - L14 Basic Vector ISA 6 Instr. Operands Operation Comment VADD.VV V1,V2,V3 V1=V2+V3 vector + vector VADD.SV V1,R0,V2 V1=R0+V2 scalar + vector VMUL.VV V1,V2,V3 V1=V2*V3 vector x vector VMUL.SV V1,R0,V2 V1=R0*V2 scalar x vector VLD V1,R1 V1=M[R1...R1+63] load, stride=1 VLDS V1,R1,R2 V1=M[R1…R1+63*R2] load, stride=R2
    [Show full text]
  • Lecture 14: Vector Processors
    Lecture 14: Vector Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a EE382A – Autumn 2009 Lecture 14- 1 Christos Kozyrakis Announcements • Readings for this lecture – H&P 4th edition, Appendix F – Required paper • HW3 available on online – Due on Wed 11/11th • Exam on Fri 11/13, 9am - noon, room 200-305 – All lectures + required papers – Closed books, 1 page of notes, calculator – Review session on Friday 11/6, 2-3pm, Gates Hall Room 498 EE382A – Autumn 2009 Lecture 14 - 2 Christos Kozyrakis Review: Multi-core Processors • Use Moore’s law to place more cores per chip – 2x cores/chip with each CMOS generation – Roughly same clock frequency – Known as multi-core chips or chip-multiprocessors (CMP) • Shared-memory multi-core – All cores access a unified physical address space – Implicit communication through loads and stores – Caches and OOO cores lead to coherence and consistency issues EE382A – Autumn 2009 Lecture 14 - 3 Christos Kozyrakis Review: Memory Consistency Problem P1 P2 /*Assume initial value of A and flag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A; • Intuitively, you expect to print A=1 – But can you think of a case where you will print A=0? – Even if cache coherence is available • Coherence talks about accesses to a single location • Consistency is about ordering for accesses to difference locations • Alternatively – Coherence determines what value is returned by a read – Consistency determines when a write value becomes visible EE382A – Autumn 2009 Lecture 14 - 4 Christos Kozyrakis Sequential Consistency (What the Programmers Often Assume) • Definition by L.
    [Show full text]
  • Vector Vs. Scalar Processors: a Performance Comparison Using a Set of Computational Science Benchmarks
    Vector vs. Scalar Processors: A Performance Comparison Using a Set of Computational Science Benchmarks Mike Ashworth, Ian J. Bush and Martyn F. Guest, Computational Science & Engineering Department, CCLRC Daresbury Laboratory ABSTRACT: Despite a significant decline in their popularity in the last decade vector processors are still with us, and manufacturers such as Cray and NEC are bringing new products to market. We have carried out a performance comparison of three full-scale applications, the first, SBLI, a Direct Numerical Simulation code from Computational Fluid Dynamics, the second, DL_POLY, a molecular dynamics code and the third, POLCOMS, a coastal-ocean model. Comparing the performance of the Cray X1 vector system with two massively parallel (MPP) micro-processor-based systems we find three rather different results. The SBLI PCHAN benchmark performs excellently on the Cray X1 with no code modification, showing 100% vectorisation and significantly outperforming the MPP systems. The performance of DL_POLY was initially poor, but we were able to make significant improvements through a few simple optimisations. The POLCOMS code has been substantially restructured for cache-based MPP systems and now does not vectorise at all well on the Cray X1 leading to poor performance. We conclude that both vector and MPP systems can deliver high performance levels but that, depending on the algorithm, careful software design may be necessary if the same code is to achieve high performance on different architectures. KEYWORDS: vector processor, scalar processor, benchmarking, parallel computing, CFD, molecular dynamics, coastal ocean modelling All of the key computational science groups in the 1. Introduction UK made use of vector supercomputers during their halcyon days of the 1970s, 1980s and into the early 1990s Vector computers entered the scene at a very early [1]-[3].
    [Show full text]
  • Computer Architecture: Parallel Processing Basics
    Computer Architecture: Parallel Processing Basics Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/9/13 Today What is Parallel Processing? Why? Kinds of Parallel Processing Multiprocessing and Multithreading Measuring success Speedup Amdhal’s Law Bottlenecks to parallelism 2 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi PDL'09 © 2007-9 Goldstein5 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi Parallel PDL'09 © 2007-9 Goldstein6 Concurrent Systems Physical Geographical Cloud Parallel Geophysical +++ ++ --- --- location Relative +++ +++ + - location Faults ++++ +++ ++++ -- Number of +++ +++ + - Processors + Network varies varies fixed fixed structure Network --- --- + + connectivity 7 Concurrent System Challenge: Programming The old joke: How long does it take to write a parallel program? One Graduate Student Year 8 Parallel Programming Again?? Increased demand (multicore) Increased scale (cloud) Improved compute/communicate Change in Application focus Irregular Recursive data structures PDL'09 © 2007-9 Goldstein9 Why Parallel Computers? Parallelism: Doing multiple things at a time Things: instructions,
    [Show full text]
  • COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011
    COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Edgar Gabriel Spring 2011 Edgar Gabriel Moore’s Law • Long-term trend on the number of transistor per integrated circuit • Number of transistors double every ~18 month Source: http://en.wikipedia.org/wki/Images:Moores_law.svg COSC 6385 – Computer Architecture Edgar Gabriel 1 What do we do with that many transistors? • Optimizing the execution of a single instruction stream through – Pipelining • Overlap the execution of multiple instructions • Example: all RISC architectures; Intel x86 underneath the hood – Out-of-order execution: • Allow instructions to overtake each other in accordance with code dependencies (RAW, WAW, WAR) • Example: all commercial processors (Intel, AMD, IBM, SUN) – Branch prediction and speculative execution: • Reduce the number of stall cycles due to unresolved branches • Example: (nearly) all commercial processors COSC 6385 – Computer Architecture Edgar Gabriel What do we do with that many transistors? (II) – Multi-issue processors: • Allow multiple instructions to start execution per clock cycle • Superscalar (Intel x86, AMD, …) vs. VLIW architectures – VLIW/EPIC architectures: • Allow compilers to indicate independent instructions per issue packet • Example: Intel Itanium series – Vector units: • Allow for the efficient expression and execution of vector operations • Example: SSE, SSE2, SSE3, SSE4 instructions COSC 6385 – Computer Architecture Edgar Gabriel 2 Limitations of optimizing a single instruction
    [Show full text]
  • Design and Implementation of a Multithreaded Associative Simd Processor
    DESIGN AND IMPLEMENTATION OF A MULTITHREADED ASSOCIATIVE SIMD PROCESSOR A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Kevin Schaffer December, 2011 Dissertation written by Kevin Schaffer B.S., Kent State University, 2001 M.S., Kent State University, 2003 Ph.D., Kent State University, 2011 Approved by Robert A. Walker, Chair, Doctoral Dissertation Committee Johnnie W. Baker, Members, Doctoral Dissertation Committee Kenneth E. Batcher, Eugene C. Gartland, Accepted by John R. D. Stalvey, Administrator, Department of Computer Science Timothy Moerland, Dean, College of Arts and Sciences ii TABLE OF CONTENTS LIST OF FIGURES ......................................................................................................... viii LIST OF TABLES ............................................................................................................. xi CHAPTER 1 INTRODUCTION ........................................................................................ 1 1.1. Architectural Trends .............................................................................................. 1 1.1.1. Wide-Issue Superscalar Processors............................................................... 2 1.1.2. Chip Multiprocessors (CMPs) ...................................................................... 2 1.2. An Alternative Approach: SIMD ........................................................................... 3 1.3. MTASC Processor ................................................................................................
    [Show full text]
  • Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. SIMD architecture 2. Vector architectures optimizations: Multiple Lanes, Vector Length Registers, Vector Mask Registers, Memory Banks, Stride, Scatter-Gather, 3. Programming Vector Architectures 4. SIMD extensions for media apps 5. GPUs – Graphical Processing Units 6. Fermi architecture innovations 7. Examples of loop-level parallelism 8. Fallacies Copyright © 2012, Elsevier Inc. All rights reserved. 2 Classes of Computers Classes Flynn’s Taxonomy SISD - Single instruction stream, single data stream SIMD - Single instruction stream, multiple data streams New: SIMT – Single Instruction Multiple Threads (for GPUs) MISD - Multiple instruction streams, single data stream No commercial implementation MIMD - Multiple instruction streams, multiple data streams Tightly-coupled MIMD Loosely-coupled MIMD Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Advantages of SIMD architectures 1. Can exploit significant data-level parallelism for: 1. matrix-oriented scientific computing 2. media-oriented image and sound processors 2. More energy efficient than MIMD 1. Only needs to fetch one instruction per multiple data operations, rather than one instr. per data op. 2. Makes SIMD attractive for personal mobile devices 3. Allows programmers to continue thinking sequentially SIMD/MIMD comparison. Potential speedup for SIMD twice that from MIMID! x86 processors expect two additional cores per chip per year SIMD width to double every four years Copyright © 2012, Elsevier Inc. All rights reserved. 4 Introduction SIMD parallelism SIMD architectures A. Vector architectures B. SIMD extensions for mobile systems and multimedia applications C.
    [Show full text]
  • 7Th Gen Intel® Core™ Processor U/Y-Platforms
    7th Generation Intel® Processor Families for U/Y Platforms Datasheet, Volume 1 of 2 Supporting 7th Generation Intel® Core™ Processor Families, Intel® Pentium® Processors, Intel® Celeron® Processors for U/Y Platforms January 2017 Document Number:334661-002 Legal Lines and Disclaimers You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548- 4725 or visit www.intel.com/design/literature.htm.
    [Show full text]
  • Lect. 11: Vector and SIMD Processors
    Lect. 11: Vector and SIMD Processors . Many real-world problems, especially in science and engineering, map well to computation on arrays . RISC approach is inefficient: – Based on loops → require dynamic or static unrolling to overlap computations – Indexing arrays based on arithmetic updates of induction variables – Fetching of array elements from memory based on individual, and unrelated, loads and stores – Instruction dependences must be identified for each individual instruction . Idea: – Treat operands as whole vectors, not as individual integer of float-point numbers – Single machine instruction now operates on whole vectors (e.g., a vector add) – Loads and stores to memory also operate on whole vectors – Individual operations on vector elements are independent and only dependences between whole vector operations must be tracked CS4/MSc Parallel Architectures - 2012-2013 1 Execution Model for (i=0; i<64; i++) a[i] = b[i] + s; . Straightforward RISC code: – F2 contains the value of s – R1 contains the address of the first element of a – R2 contains the address of the first element of b – R3 contains the address of the last element of a + 8 loop: L.D F0,0(R2) ;F0=array element of b ADD.D F4,F0,F2 ;main computation S.D F4,0(R1) ;store result DADDUI R1,R1,8 ;increment index DADDUI R2,R2,8 ;increment index BNE R1,R3,loop ;next iteration CS4/MSc Parallel Architectures - 2012-2013 2 Execution Model for (i=0; i<64; i++) a[i] = b[i] + s; . Straightforward vector code: – F2 contains the value of s – R1 contains the address of the first element of a – R2 contains the address of the first element of b – Assume vector registers have 64 double precision elements LV V1,R2 ;V1=array b ADDVS.D V2,V1,F2 ;main computation SV V2,R1 ;store result – Notes: .
    [Show full text]