Pipelined Multithreading Transformations And

Total Page:16

File Type:pdf, Size:1020Kb

Pipelined Multithreading Transformations And PIPELINED MULTITHREADING TRANSFORMATIONS AND SUPPORT MECHANISMS RAM RANGAN ADISSERTATION PRESENTED TO THE FACULTY OF PRINCETON UNIVERSITY IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY RECOMMENDED FOR ACCEPTANCE BY THE DEPARTMENT OF COMPUTER SCIENCE JUNE 2007 c Copyright by Ram Rangan, 2007. All Rights Reserved Abstract Even though chip multiprocessors have emerged as the predominant organization for future microprocessors, the multiple on-chip cores do not directly result in improved appli- cation performance (especially for legacy applications, which are predominantly sequential C/C++ codes). Consequently, parallelizing applications to execute on multiple cores is es- sential to their success. Independent multithreading techniques, like DOALL extraction, create partially or fully independent threads, which communicate rarely, if at all. While such strategies keep high inter-thread communication costs from impacting program perfor- mance, they cannot be applied to parallelize general-purpose applications which are char- acterized by difficult-to-break recurrences. Even though cyclic multithreading techniques, such as DOACROSS, are more applicable, the cyclic inter-thread dependences created by these techniques cause them to have very low tolerance to rising inter-core latencies. To address these problems, this work introduces a pipelined multithreading (PMT) transformation called Decoupled Software Pipelining (DSWP). DSWP, in particular, and PMT techniques, in general, are able to tolerate inter-core latencies, while still handling codes with complex recurrences. They achieve this by enforcing an acyclic communication discipline amongst threads, which allow threads to use inter-thread queues to communicate in a pipelined fashion. This dissertation demonstrates that DSWPed codes not only toler- ate inter-core communication costs, but also effectively tolerate variable latency stalls in applications better than single-threaded execution on both in-order and out-of-order issue processors with comparable resources. It then performs a thorough analysis of the perfor- mance scalability of automatically generated DSWPed codes and identifies the conditions necessary to achieve peak PMT performance. Next, the dissertation shows that even though PMT applications tolerate inter-core la- tencies well, the high frequency of inter-thread communication (once every 5 to 20 dynamic instructions) in these codes, makes them very sensitive to the intra-thread overhead imposed by communication operations. In order to understand the issues surrounding inter-thread iii communication for PMT applications, this dissertation undertakes a methodical exploration of the design space of communication support options for PMT. Three new communica- tion mechanisms with varying cost-performance tradeoffs are introduced and are shown to perform 38% to 200% better than the state of the art. iv Acknowledgments I owe this work to my advisor, David August. I thank him for believing in me. I thank him for supporting me every step of the way, from recommending my admission to Princeton to feeding me for a whole three and a half years (with liberal grants from NSF Grant No. 0133712, NSF Grant No. 0305617, and Intel Corporation)1, from showing me how to use gdb during my first summer to helping improve my writing skills over the years, from celebrating every small research success I had to consoling and encouraging me every single time I was not able to make a paper deadline or had a paper rejected, and so on. His tonic was definitely bitter many a time, but in hindsight, there appears to have been a method to the madness and the many situations I sulked about originally, have ultimately worked out to my benefit. His take-the-bull-by-the-horns approach to research and hard work will always be an inspiration for me. The central idea of this dissertation, decoupled software pipelining (DSWP), was born out of discussions with Neil Vachharajani, my primary collaborator, and David August. I would like to thank Margaret Martonosi, George Cai, Li Shiuan Peh, and Jaswinder Pal Singh for serving on my thesis committee. Their collective wisdom has made this disserta- tion more complete in terms of providing an in-depth understanding of various aspects of DSWP behavior. I thank my advisor and my readers, George Cai and Margaret Martonosi, for carefully reading through my dissertation and suggesting fixes. Special thanks to Mar- garet Martonosi for her impressive turnaround time that enabled the timely scheduling of my final defense. As collaborator, George Cai’s expert inputs helped improve the quality of the DSWP communication support work. I would like to thank HP Labs and Intel for providing me with valuable internship opportunities. I had great fun during these internships and learned a lot from working closely with Shail Aditya and Scott Mahlke (at HP) and Shubu Mukherjee, Arijit Biswas, Paul Racunas, and Joel Emer (at Intel). 1Opinions, findings, conclusions, and recommendations expressed in this dissertation are not necessarily the views of the NSF or Intel Corporation v I would probably not be in computer architecture if not for my undergraduate advisor, Ranjani Parthasarathi. Her classes and exams gave me a good understanding of the basics. By allowing her students complete freedom and by being available for discussions, she enabled the exploration of research-quality ideas for class projects. For all this and more, I am deeply indebted to her. My heartfelt thanks to all past and current members of the Liberty group. I thank Manish Vachharajani, David Penry, and Neil Vachharajani for developing the Liberty Sim- ulation Environment; it really made microarchitecture modeling a straightforward process and helped me reason about and truly understand the innards of various hardware blocks. The cache, in particular! Thanks to David Penry for developing a robust IA64 emulator and a cycle-accurate Itanium 2 core model. Many thanks to Spyridon Triantafyllis, Matthew Bridges, Easwaran Raman, and the entire VELOCITY compiler team for their hard work. Guilherme Ottoni and Neil Vachharajani implemented the automatic DSWP algorithm used in this dissertation in the VELOCITY compiler. I had great fun working with Jonathan Chang, George Reis, Adam Stoler, Jason Blome, and Bolei Guo on various projects. The random discussions, jokes, movies, literati, text twist, tennis, and squash games with my lab mates kept me sane through graduate school. Outside of my lab, Lakshmi and Mani, Easwar, Logo, the Shivakumar and the Mahesh families, Julia Chen, Kevin Ko, Ge Wang, Margaret and Bob Slighton, Steve Wallitt, Neal Kaufmann, the entire PUTTC crowd, and several others cheered me up so much that I almost never missed home. Close relatives and friends from school and college, Alwyn, Anusha, Arun, Arundhati, Kasturi athai, Mani, Ravi, Robs, Satish, Siva, Subha, and others have always provided great encouragement and support. I thank Nitya for her unconditional love and friendship. A big thank you to my little sister, Ramya, for her love and affection over the years. I thank my grandparents for their blessings and prayers. I salute their indefatigable spirit that made them take great interest in my progress towards thesis completion, despite their myriad age-related problems. I thank vi all four grandparents for loving me so much and for all that they have taught me. Finally, words cannot express my love and gratitude for my parents who have given their all for both their children. Their innumerable sacrifices, their selfless love, and their keen interest in my success, are what have gotten me to this point. I am indeed fortunate to have such a wonderful family and I wish to dedicate my life’s work (well, at least, five and a half years of it) to my parents and my grandparents for everything they have given me and done for me. vii Contents Abstract . iii Acknowledgments . v List of Tables . xi List of Figures . xii 1 Introduction 1 1.1 Thread Level Parallelism Paradigms . 2 1.1.1 Independent Multithreading (IMT) . 2 1.1.2 Cyclic Multithreading (CMT) . 4 1.1.3 Pipelined Multithreading (PMT) . 5 1.2 Contributions . 6 1.3 Overview . 9 2 Decoupled Software Pipelining 11 2.1 Limitations of single-threaded execution . 11 2.2 Overview of DSWP . 17 2.3 RDS Loops and Latency Tolerance . 18 2.4 RDS Parallelization . 21 2.5 Decoupled Software Pipelining . 24 2.6 Automatic DSWP . 25 2.7 Summary . 27 viii 3 Communication Support for PMT 29 3.1 High-Frequency Streaming . 31 3.2 Design Space . 36 3.2.1 Communication Operation Sequences . 36 3.2.2 Dedicated Interconnects . 39 3.2.3 Pipelined Interconnects . 40 3.2.4 Synchronization . 41 3.2.5 Queue Backing Store . 43 3.3 The Synchronization Array . 49 3.3.1 Operation . 50 3.3.2 Handling Control Speculation . 55 3.3.3 Performance Scalability . 56 3.3.4 Integrating with the Itanium R 2 pipeline . 57 3.4 The Snoop-Based Synchronization technique . 63 3.5 Summary . 66 4 Evaluation Methodology 67 4.1 Benchmarks and Tools . 67 4.2 Performance Measurement . 69 4.3 Sampling Methodology . 72 4.4 Summary . 74 5 Performance Evaluation of DSWP 75 5.1 Performance of 2-thread DSWP . 75 5.1.1 Balancing Threads Better . 76 5.1.2 Latency tolerance through decoupling . 80 5.2 Performance scalability of DSWP . 84 5.2.1 Linear and non-linear thread pipelines . 89 ix 5.2.2 Performance under ideal communication behavior . 95 5.3 Summary . 102 6 Evaluation of Communication Support 104 6.1 Systems Studied . 104 6.2 Experimental Setup . 106 6.3 Results and Analysis . 107 6.4 Sensitivity Study . 113 6.5 Summary . 116 7 Communication Support Optimizations 117 7.1 Amortizing Overhead Costs for Software Queues . 117 7.1.1 Analysis . 119 7.1.2 Code Generation . 124 7.1.3 Evaluation . 125 7.1.4 Discussion . 125 7.2 Hardware Enhancements to Snoop-Based Synchronization . 129 7.3 Summary . 132 8 Conclusions and Future Directions 134 x List of Tables 2.1 Multi-core techniques to improve single-threading performance .
Recommended publications
  • Asp Net Core Request Pipeline
    Asp Net Core Request Pipeline Is Royce always cowled and multilobate when achromatize some wall very instanter and lawlessly? Hobbyless and flustered brazensVladamir regressively? cocainizing her intangibles tiptoed or cluster advertently. Is Patric fuzzed or paintable when cogs some theocracy Or gray may choose to berry the request. Json files etc and community of response body is not inject all low maintenance, we will be same coin. Another through components that asp and cto of. How extensible it crashes, just by one of startup class controller. Do the extended with really very complex scenarios when he enjoys sharing it ever was to delete models first user makes for node. Even add to implement captcha in startup class to the same concept is the reason you want to. Let us a pipeline to any incoming request processing, firefox still create an output? For an app to build a cup of. Razor pages uses handler methods to deal of incoming HTTP request. Ask how the above mentioned in last middleware, the come to tell who are ready simply obsolete at asp and options. Have asp and asp net core request pipeline but will not be mapped to pipeline io threads to work both. The internet and when creating sawdust in snippets or improvements that by one description be a request pipeline branching. Help editing this article, ordinary code inside of hosting infrastructure asp and actions before, we issue was not. The body to deal with minimal footprint to entity framework of needed loans for each middleware will take a docker, that receive criticism.
    [Show full text]
  • Pipelining 2
    Instruction-Level Parallelism Dynamic Scheduling CS448 1 Dynamic Scheduling • Static Scheduling – Compiler rearranging instructions to reduce stalls • Dynamic Scheduling – Hardware rearranges instruction stream to reduce stalls – Possible to account for dependencies that are only known at run time - e. g. memory reference – Simplifies the compiler – Code built for one pipeline in mind could also run efficiently on a different pipeline – minimizes the need for speculative slot filling - NP hard • Once a common tactic with supercomputers and mainframes but was too expensive for single-chip – Now fits on a desktop PC’s 2 1 Dynamic Scheduling • Dependencies that are close together stall the entire pipeline – DIVD F0, F2, F4 – ADDD F10, F0, F8 – SUBD F12, F8, F14 • The ADD needs the DIV to finish, so there is a stall… which also stalls the SUBD – Looong stall for DIV – But the SUBD is independent so there is no reason why we shouldn’t execute it – Or is there? • Precise Interrupts - Ignore for now • Compiler could rearrange instructions, but so could the hardware 3 Dynamic Scheduling • It would be desirable to decode instructions into the pipe in order but then let them stall individually while waiting for operands before issue to execution units. • Dynamic Scheduling – Out of Order Issue / Execution • Scoreboarding 4 2 Scoreboarding Split ID stage into two: Issue (Decode, check for Structural hazards) Read Operands (Wait until no data hazards, read operands) 5 Scoreboard Overview • Consider another example: – DIVD F0, F2, F4 – ADDD F10, F0,
    [Show full text]
  • Stclang: State Thread Composition As a Foundation for Monadic Dataflow Parallelism Sebastian Ertel∗ Justus Adam Norman A
    STCLang: State Thread Composition as a Foundation for Monadic Dataflow Parallelism Sebastian Ertel∗ Justus Adam Norman A. Rink Dresden Research Lab Chair for Compiler Construction Chair for Compiler Construction Huawei Technologies Technische Universität Dresden Technische Universität Dresden Dresden, Germany Dresden, Germany Dresden, Germany [email protected] [email protected] [email protected] Andrés Goens Jeronimo Castrillon Chair for Compiler Construction Chair for Compiler Construction Technische Universität Dresden Technische Universität Dresden Dresden, Germany Dresden, Germany [email protected] [email protected] Abstract using monad-par and LVars to expose parallelism explicitly Dataflow execution models are used to build highly scalable and reach the same level of performance, showing that our parallel systems. A programming model that targets parallel programming model successfully extracts parallelism that dataflow execution must answer the following question: How is present in an algorithm. Further evaluation shows that can parallelism between two dependent nodes in a dataflow smap is expressive enough to implement parallel reductions graph be exploited? This is difficult when the dataflow lan- and our programming model resolves short-comings of the guage or programming model is implemented by a monad, stream-based programming model for current state-of-the- as is common in the functional community, since express- art big data processing systems. ing dependence between nodes by a monadic bind suggests CCS Concepts • Software and its engineering → Func- sequential execution. Even in monadic constructs that explic- tional languages. itly separate state from computation, problems arise due to the need to reason about opaquely defined state.
    [Show full text]
  • EECS 570 Lecture 5 GPU Winter 2021 Prof
    EECS 570 Lecture 5 GPU Winter 2021 Prof. Satish Narayanasamy http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Martin, Roth, Nowatzyk, and Wenisch of EPFL, CMU, UPenn, U-M, UIUC. 1 EECS 570 • Slides developed in part by Profs. Adve, Falsafi, Martin, Roth, Nowatzyk, and Wenisch of EPFL, CMU, UPenn, U-M, UIUC. Discussion this week • Term project • Form teams and start to work on identifying a problem you want to work on Readings For next Monday (Lecture 6): • Michael Scott. Shared-Memory Synchronization. Morgan & Claypool Synthesis Lectures on Computer Architecture (Ch. 1, 4.0-4.3.3, 5.0-5.2.5). • Alain Kagi, Doug Burger, and Jim Goodman. Efficient Synchronization: Let Them Eat QOLB, Proc. 24th International Symposium on Computer Architecture (ISCA 24), June, 1997. For next Wednesday (Lecture 7): • Michael Scott. Shared-Memory Synchronization. Morgan & Claypool Synthesis Lectures on Computer Architecture (Ch. 8-8.3). • M. Herlihy, Wait-Free Synchronization, ACM Trans. Program. Lang. Syst. 13(1): 124-149 (1991). Today’s GPU’s “SIMT” Model CIS 501 (Martin): Vectors 5 Graphics Processing Units (GPU) • Killer app for parallelism: graphics (3D games) Tesla S870 What is Behind such an Evolution? • The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) ❒ So, more transistors can be devoted to data processing rather than data caching and flow control ALU ALU Control ALU ALU CPU GPU Cache DRAM DRAM • The fast-growing video game industry
    [Show full text]
  • Let's Get Functional
    5 LET’S GET FUNCTIONAL I’ve mentioned several times that F# is a functional language, but as you’ve learned from previous chapters you can build rich applications in F# without using any functional techniques. Does that mean that F# isn’t really a functional language? No. F# is a general-purpose, multi paradigm language that allows you to program in the style most suited to your task. It is considered a functional-first lan- guage, meaning that its constructs encourage a functional style. In other words, when developing in F# you should favor functional approaches whenever possible and switch to other styles as appropriate. In this chapter, we’ll see what functional programming really is and how functions in F# differ from those in other languages. Once we’ve estab- lished that foundation, we’ll explore several data types commonly used with functional programming and take a brief side trip into lazy evaluation. The Book of F# © 2014 by Dave Fancher What Is Functional Programming? Functional programming takes a fundamentally different approach toward developing software than object-oriented programming. While object-oriented programming is primarily concerned with managing an ever-changing system state, functional programming emphasizes immutability and the application of deterministic functions. This difference drastically changes the way you build software, because in object-oriented programming you’re mostly concerned with defining classes (or structs), whereas in functional programming your focus is on defining functions with particular emphasis on their input and output. F# is an impure functional language where data is immutable by default, though you can still define mutable data or cause other side effects in your functions.
    [Show full text]
  • BIOVIA Pipeline Pilot System Requirements
    SYSTEM REQUIREMENTS PIPELINE PILOT 2020 Copyright Notice ©2019 Dassault Systèmes. All rights reserved. 3DEXPERIENCE, the Compass icon and the 3DS logo, CATIA, SOLIDWORKS, ENOVIA, DELMIA, SIMULIA, GEOVIA, EXALEAD, 3DVIA, 3DSWYM, BIOVIA, NETVIBES, IFWE and 3DEXCITE, are commercial trademarks or registered trademarks of Dassault Systèmes, a French "société européenne" (Versailles Commercial Register # B 322 306 440), or its subsidiaries in the U.S. and/or other countries. All other trademarks are owned by their respective owners. Use of any Dassault Systèmes or its subsidiaries trademarks is subject to their express written approval. Acknowledgments and References To print photographs or files of computational results (figures and/or data) obtained by using Dassault Systèmes software, acknowledge the source in an appropriate format. For example: "Computational results were obtained by using Dassault Systèmes BIOVIA software programs. Pipeline Pilot Server was used to perform the calculations and to generate the graphical results." Dassault Systèmes may grant permission to republish or reprint its copyrighted materials. Requests should be submitted to Dassault Systèmes Customer Support, either by visiting https://www.3ds.com/support/ and clicking Call us or Submit a request, or by writing to: Dassault Systèmes Customer Support 10, Rue Marcel Dassault 78140 Vélizy-Villacoublay FRANCE Contents About This Document 1 Definitions 1 Additional Information 1 Dassault Systèmes Support Resources 1 Pipeline Pilot Server Requirements 2 Minimum Hardware
    [Show full text]
  • 14 Superscalar Processors
    UNIT- 5 Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? • Use multiple independent instruction pipeline. • Instruction level parallelism • Identify dependencies in program • Eliminate Unnecessary dependencies • Use additional registers & renaming of register references in original code. • Branch prediction improves efficiency What is Superscalar? • RISC-like instructions, one per word • Multiple parallel execution units • Reorders and optimizes instruction stream at run time • Branch prediction with speculative execution of one path • Loads data from memory only when needed, and tries to find the data in the caches first Why Superscalar? • In 1987, scalar instruction • Most operations are on scalar quantities • Improving these operations to get an overall improvement • High performance microprocessors. General Superscalar Organization Superscalar V/S Superpipelined • Ordinary:- • One instruction/clock cycle & can perform one pipeline stage per clock cycle. • Superpipelined:- • 2 pipeline stage per cycle • Execution time:-half a clock cycle • Double internal clock speed gets two tasks per external clock cycle • Superscalar :- • allows parallel fetch execute • Start of the program & at branch target it performs better than above said. Superscalar v Superpipeline Limitations • Instruction level parallelism • Compiler based optimisation • Hardware techniques • Limited by —True data dependency —Procedural dependency —Resource conflicts —Output dependency —Antidependency True Data Dependency • ADD r1,
    [Show full text]
  • Concepts Introduced in Chapter 3 Instruction Level Parallelism (ILP)
    Intro Compiler Techs Branch Pred Dynamic Sched Speculation Multi Issue ILP Limits MT Intro Compiler Techs Branch Pred Dynamic Sched Speculation Multi Issue ILP Limits MT Concepts Introduced in Chapter 3 Instruction Level Parallelism (ILP) introduction Pipelining is the overlapping of dierent portions of instruction compiler techniques for ILP execution. advanced branch prediction Instruction level parallelism is the parallel execution of a dynamic scheduling sequence of instructions associated with a single thread of speculation execution. multiple issue scheduling approaches to exploit ILP static (compiler) limits of ILP dynamic (hardware) multithreading Intro Compiler Techs Branch Pred Dynamic Sched Speculation Multi Issue ILP Limits MT Intro Compiler Techs Branch Pred Dynamic Sched Speculation Multi Issue ILP Limits MT Data Dependences Control Dependences An instruction is control dependent on a branch instruction if the instruction will only be executed when the branch has a specic result. An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is not controlled by the branch. Instructions must be independent to be executed in parallel. instruction types of data dependences branch branch True dependences can lead to RAW hazards. name dependences instruction Antidependences can lead to WAR hazards. before transformation after transformation Output dependences can lead to WAW hazards. An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch. instruction branch branch instruction before transformation after transformation Intro Compiler Techs Branch Pred Dynamic Sched Speculation Multi Issue ILP Limits MT Intro Compiler Techs Branch Pred Dynamic Sched Speculation Multi Issue ILP Limits MT Static Scheduling Latencies of FP Operations Used in This Chapter The latency here indicates the number of intervening independent instructions that are needed to avoid a stall.
    [Show full text]
  • Fast Linux I/O in the Unix Tradition
    — preprint only: final version will appear in OSR, July 2008 — PipesFS: Fast Linux I/O in the Unix Tradition Willem de Bruijn Herbert Bos Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam and NICTA [email protected] [email protected] ABSTRACT ory wall” [26]). To improve throughput, it is now essential This paper presents PipesFS, an I/O architecture for Linux to avoid all unnecessary memory operations. Inefficient I/O 2.6 that increases I/O throughput and adds support for het- primitives exacerbate the effects of the memory wall by in- erogeneous parallel processors by (1) collapsing many I/O curring unnecessary copying and context switching, and as interfaces onto one: the Unix pipeline, (2) increasing pipe a result of these cache misses. efficiency and (3) exploiting pipeline modularity to spread computation across all available processors. Contribution. We revisit the Unix pipeline as a generic model for streaming I/O, but modify it to reduce overhead, PipesFS extends the pipeline model to kernel I/O and com- extend it to integrate kernel processing and complement it municates with applications through a Linux virtual filesys- with support for anycore execution. We link kernel and tem (VFS), where directory nodes represent operations and userspace processing through a virtual filesystem, PipesFS, pipe nodes export live kernel data. Users can thus interact that presents kernel operations as directories and live data as with kernel I/O through existing calls like mkdir, tools like pipes. This solution avoids new interfaces and so unlocks all grep, most languages and even shell scripts.
    [Show full text]
  • Making a Faster Curry with Extensional Types
    Making a Faster Curry with Extensional Types Paul Downen Simon Peyton Jones Zachary Sullivan Microsoft Research Zena M. Ariola Cambridge, UK University of Oregon [email protected] Eugene, Oregon, USA [email protected] [email protected] [email protected] Abstract 1 Introduction Curried functions apparently take one argument at a time, Consider these two function definitions: which is slow. So optimizing compilers for higher-order lan- guages invariably have some mechanism for working around f1 = λx: let z = h x x in λy:e y z currying by passing several arguments at once, as many as f = λx:λy: let z = h x x in e y z the function can handle, which is known as its arity. But 2 such mechanisms are often ad-hoc, and do not work at all in higher-order functions. We show how extensional, call- It is highly desirable for an optimizing compiler to η ex- by-name functions have the correct behavior for directly pand f1 into f2. The function f1 takes only a single argu- expressing the arity of curried functions. And these exten- ment before returning a heap-allocated function closure; sional functions can stand side-by-side with functions native then that closure must subsequently be called by passing the to practical programming languages, which do not use call- second argument. In contrast, f2 can take both arguments by-name evaluation. Integrating call-by-name with other at once, without constructing an intermediate closure, and evaluation strategies in the same intermediate language ex- this can make a huge difference to run-time performance in presses the arity of a function in its type and gives a princi- practice [Marlow and Peyton Jones 2004].
    [Show full text]
  • Intro Evolution of Superscalar Processor
    Superscalar Processors Superscalar Processors vs. VLIW • 7.1 Introduction • 7.2 Parallel decoding • 7.3 Superscalar instruction issue • 7.4 Shelving • 7.5 Register renaming • 7.6 Parallel execution • 7.7 Preserving the sequential consistency of instruction execution • 7.8 Preserving the sequential consistency of exception processing • 7.9 Implementation of superscalar CISC processors using a superscalar RISC core • 7.10 Case studies of superscalar processors TECH Computer Science CH01 Superscalar Processor: Intro Emergence and spread of superscalar processors • Parallel Issue • Parallel Execution • {Hardware} Dynamic Instruction Scheduling • Currently the predominant class of processors 4Pentium 4PowerPC 4UltraSparc 4AMD K5- 4HP PA7100- 4DEC α Evolution of superscalar processor Specific tasks of superscalar processing Parallel decoding {and Dependencies check} Decoding and Pre-decoding • What need to be done • Superscalar processors tend to use 2 and sometimes even 3 or more pipeline cycles for decoding and issuing instructions • >> Pre-decoding: 4shifts a part of the decode task up into loading phase 4resulting of pre-decoding f the instruction class f the type of resources required for the execution f in some processor (e.g. UltraSparc), branch target addresses calculation as well 4the results are stored by attaching 4-7 bits • + shortens the overall cycle time or reduces the number of cycles needed The principle of perdecoding Number of perdecode bits used Specific tasks of superscalar processing: Issue 7.3 Superscalar instruction issue • How and when to send the instruction(s) to EU(s) Instruction issue policies of superscalar processors: Issue policies ---Performance, tread-----Æ Issue rate {How many instructions/cycle} Issue policies: Handing Issue Blockages • CISC about 2 • RISC: Issue stopped by True dependency Issue order of instructions • True dependency Æ (Blocked: need to wait) Aligned vs.
    [Show full text]
  • Simty: Generalized SIMT Execution on RISC-V Caroline Collange
    Simty: generalized SIMT execution on RISC-V Caroline Collange To cite this version: Caroline Collange. Simty: generalized SIMT execution on RISC-V. CARRV 2017: First Work- shop on Computer Architecture Research with RISC-V, Oct 2017, Boston, United States. pp.6, 10.1145/nnnnnnn.nnnnnnn. hal-01622208 HAL Id: hal-01622208 https://hal.inria.fr/hal-01622208 Submitted on 7 Oct 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Simty: generalized SIMT execution on RISC-V Caroline Collange Inria [email protected] ABSTRACT and programming languages to target GPU-like SIMT cores. Be- We present Simty, a massively multi-threaded RISC-V processor side simplifying software layers, the unification of CPU and GPU core that acts as a proof of concept for dynamic inter-thread vector- instruction sets eases prototyping and debugging of parallel pro- ization at the micro-architecture level. Simty runs groups of scalar grams. We have implemented Simty in synthesizable VHDL and threads executing SPMD code in lockstep, and assembles SIMD synthesized it for an FPGA target. instructions dynamically across threads. Unlike existing SIMD or We present our approach to generalizing the SIMT model in SIMT processors like GPUs or vector processors, Simty vector- Section 2, then describe Simty’s microarchitecture is Section 3, and izes scalar general-purpose binaries.
    [Show full text]