SIMULTANEOUS MULTITHREADING (SMT) Multiple HW Contexts (Regs, PC, SP) AMD X2, X3, X4, Intel Core 2) Each Cycle, Any Context May Execute E.G

Total Page:16

File Type:pdf, Size:1020Kb

SIMULTANEOUS MULTITHREADING (SMT) Multiple HW Contexts (Regs, PC, SP) AMD X2, X3, X4, Intel Core 2) Each Cycle, Any Context May Execute E.G SimultaneousSimultaneous MultithreadingMultithreading (SMT)(SMT) • An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors (superscalars). • SMT has the potential of greatly enhancing superscalar processor computational capabilities by: Chip-Level TLP – Exploiting thread-level parallelism (TLP) in a single processor core, simultaneously issuing, executing and retiring instructions from different threads during the same cycle. • A single physical SMT processor core acts as a number of logical processors each executing a single thread – Providing multiple hardware contexts, hardware thread scheduling and context switching capability. – Providing effective long latency hiding. • e.g FP, branch misprediction, memory latency EECC722 - Shaaban #1 Lec # 2 Fall 2012 9-3-2012 SMTSMT IssuesIssues • SMT CPU performance gain potential. • Modifications to Superscalar CPU architecture to support SMT. • SMT performance evaluation vs. Fine-grain multithreading, Superscalar, Chip Multiprocessors. • Hardware techniques to improve SMT performance: Ref. Papers – Optimal level one cache configuration for SMT. SMT-1, SMT-2 – SMT thread instruction fetch, issue policies. – Instruction recycling (reuse) of decoded instructions. • Software techniques: – Compiler optimizations for SMT. SMT-3 – Software-directed register deallocation. – Operating system behavior and optimization. SMT-7 • SMT support for fine-grain synchronization. SMT-4 • SMT as a viable architecture for network processors. • Current SMT implementation: Intel’s Hyper-Threading (2-way SMT) Microarchitecture and performance in compute-intensive workloads. SMT-8 SMT-9 EECC722 - Shaaban #2 Lec # 2 Fall 2012 9-3-2012 EvolutionEvolution ofof MicroprocessorsMicroprocessors General Purpose Processors (GPPs) Multiple Issue (CPI <1) Multi-cycle Pipelined (single issue) Superscalar/VLIW/SMT/CMP Original (2002) Intel 1 GHz Predictions 15 GHz to ???? GHz IPC T = I x CPI x C Source: John P. Chen, Intel Labs Single-issue Processor = Scalar Processor EECC722 - Shaaban Instructions Per Cycle (IPC) = 1/CPI #3 Lec # 2 Fall 2012 9-3-2012 MicroprocessorMicroprocessor FrequencyFrequency TrendTrend 10,000 100 Intel Realty Check: Processor freq IBM Power PC Clock frequency scaling scales by 2X per DEC is slowing down! generation Gate delays/clock (Did silicone finally hit the wall?) 21264S 1,000 Why? 21164A 21264 1- Power leakage 21064A Pentium(R) 2- Clock distribution 21164 10 Mhz II 21066 MPC750 delays 604 604+ Pentium Pro 100 Result: (R) Gate Delays/ Clock 601, 603 Deeper Pipelines Pentium(R) Longer stalls 486 Higher CPI (lowers effective 386 performance 10 1 per cycle) 1987 1991 1993 1995 1997 1999 2001 2003 2005 1989 1. Frequency used to double each generation Possible Solutions? Chip-Level TLP 2. Number of gates/clock reduce by 25% - Exploit Thread-Level Parallelism (TLP) 3. Leads to deeper pipelines with more stages at the chip level (SMT/CMP) (e.g Intel Pentium 4E has 30+ pipeline stages) - Utilize/integrate more-specialized computing elements other than GPPs T = I x CPI x C EECC722 - Shaaban #4 Lec # 2 Fall 2012 9-3-2012 ParallelismParallelism inin MicroprocessorMicroprocessor VLSIVLSI GenerationsGenerations Bit-level parallelism Instruction-level Thread-level (?) 100,000,000 (ILP) (TLP) Superscalar Multiple micro-operations Simultaneous per cycle /VLIW Single-issue CPI <1 Multithreading SMT: (multi-cycle non-pipelined) Pipelined e.g. Intel’s Hyper-threading 10,000,000 CPI =1 AKA Operation-Level Parallelism Chip-Multiprocessors (CMPs) e.g IBM Power 4, 5 Not Pipelined R10000 Intel Pentium D, Core 2 CPI >> 1 AMD Athlon 64 X2 Dual Core Opteron Sun UltraSparc T1 (Niagara) 1,000,000 Pentium i80386 Chip-Level Parallel i80286 R3000 Transistors 100,000 Processing R2000 Even more important i8086 due to slowing clock 10,000 rate increase i8080 i8008 Thread-Level i4004 Single Thread Parallelism (TLP) 1,000 1970 1975 1980 1985 1990 1995 2000 2005 Improving microprocessor generation performance by EECC722 - Shaaban exploiting more levels of parallelism #5 Lec # 2 Fall 2012 9-3-2012 MicroprocessorMicroprocessor ArchitectureArchitecture TrendsTrends General Purpose Processor (GPP) CISC Machines instructions take variable times to complete RISC Machines (microcode) Single simple instructions, optimized for speed Threaded RISC Machines (pipelined) same individual instruction latency greater throughput through instruction "overlap" Superscalar Processors multiple instructions executing simultaneously CMPs Multithreaded Processors VLIW Single Chip Multiprocessors additional HW resources (regs, PC, SP) "Superinstructions" grouped together duplicate entire processors each context gets processor for x cycles decreased HW control complexity (Single or Multi-Threaded) (tech soon due to Moore's Law) (e.g IBM Power 4/5, SIMULTANEOUS MULTITHREADING (SMT) multiple HW contexts (regs, PC, SP) AMD X2, X3, X4, Intel Core 2) each cycle, any context may execute e.g. Intel’s HyperThreading (P4) Chip-level TLP SMT/CMPs e.g. IBM Power5,6,7 , Intel Pentium D, Sun Niagara - (UltraSparc T1) Intel Nehalem (Core i7) EECC722 - Shaaban #6 Lec # 2 Fall 2012 9-3-2012 CPUCPU ArchitectureArchitecture Evolution:Evolution: SingleSingle Threaded/IssueThreaded/Issue PipelinePipeline • Traditional 5-stage integer pipeline. • Increases Throughput: Ideal CPI = 1 Register File Fetch Decode Execute Memory Writeback PC SP Memory Hierarchy (Management) EECC722 - Shaaban #7 Lec # 2 Fall 2012 9-3-2012 CPUCPU ArchitectureArchitecture Evolution:Evolution: SingleSingle--Threaded/SuperscalarThreaded/Superscalar ArchitecturesArchitectures • Fetch, issue, execute, etc. more than one instruction per cycle (CPI < 1). • Limited by instruction-level parallelism (ILP). Due to single thread limitations Fetch i Decode i Execute i Memory i Writeback i Memory HierarchyMemory (Management) Writeback Decode i+1 Execute i+1 Memory i+1 Register File Fetch i+1 i+1 PC SP Fetch i Decode i Execute i Memory i Writeback i EECC722 - Shaaban #8 Lec # 2 Fall 2012 9-3-2012 Commit or Retirement (In Order) HardwareHardware--BasedBased FIFO Usually implemented SpeculationSpeculation as a circular buffer Instructions Speculative Execution + to issue in Next to Tomasulo’s Algorithm Order: commit Tomasulo’s Algorithm Instruction = Speculative Tomasulo Queue (IQ) Store Results Speculative Tomasulo-based Processor EECC722 - Shaaban 4th Edition: page 107 (3rd Edition: page 228) #9 Lec # 2 Fall 2012 9-3-2012 FourFour StepsSteps ofof SpeculativeSpeculative TomasuloTomasulo AlgorithmAlgorithm 1. Issue — (In-order) Get an instruction from Instruction Queue If a reservation station and a reorder buffer slot are free, issue instruction & send operands & reorder buffer number for destination (this stage is sometimes called “dispatch”) Stage 0 Instruction Fetch (IF): No changes, in-order 2. Execution — (out-of-order) Operate on operands (EX) Includes data MEM read When both operands are ready then execute; if not ready, watch CDB for result; when both operands are in reservation station, execute; checks RAW (sometimes called “issue”) No write to registers or 3. Write result — (out-of-order) Finish execution (WB) memory in WB Write on Common Data Bus (CDB) to all awaiting FUs & reorder No WB for stores buffer; mark reservation station available. i.e Reservation Stations or branches 4. Commit — (In-order) Update registers, memory with reorder buffer result – When an instruction is at head of reorder buffer & the result is present, update register with result (or store to memory) and remove instruction from reorder buffer. Successfully completed instructions write to registers and memory (stores) here Mispredicted Branch – A mispredicted branch at the head of the reorder buffer flushes the Handling reorder buffer (cancels speculated instructions after the branch) ⇒ Instructions issue in order, execute (EX), write result (WB) out of order, but must commit in order. EECC722 - Shaaban 4th Edition: pages 106-108 (3rd Edition: pages 227-229) #10 Lec # 2 Fall 2012 9-3-2012 AdvancedAdvanced CPUCPU Architectures:Architectures: VLIW:VLIW: Intel/HPIntel/HP IA-64 ExplicitlyExplicitly ParallelParallel InstructionInstruction ComputingComputing (EPIC)(EPIC) • Strengths: – Allows for a high level of instruction parallelism (ILP). – Takes a lot of the dependency analysis out of HW and places focus on smart compilers. • Weakness: – Limited by instruction-level parallelism (ILP) in a single thread. – Keeping Functional Units (FUs) busy (control hazards). – Static FUs Scheduling limits performance gains. – Resulting overall performance heavily depends on compiler performance. EECC722 - Shaaban #11 Lec # 2 Fall 2012 9-3-2012 SuperscalarSuperscalar ArchitectureArchitecture Limitations:Limitations: Issue Slot Waste Classification • Empty or wasted issue slots can be defined as either vertical waste or horizontal waste: – Vertical waste is introduced when the processor issues no instructions in a cycle. – Horizontal waste occurs when not all issue slots can be filled in a cycle. Why not 8-issue? Example: 4-Issue Superscalar Ideal IPC =4 Ideal CPI = .25 Also applies to VLIW Instructions Per Cycle = IPC = 1/CPI Result of issue slot waste: Actual Performance << Peak Performance EECC722 - Shaaban #12 Lec # 2 Fall 2012 9-3-2012 Sources of Unused Issue Cycles in an 8-issue Superscalar Processor. (wasted) Ideal Instructions Per Cycle, IPC = 8 Here real IPC about 1.5 (CPI = 1/8)
Recommended publications
  • IEEE Paper Template in A4
    Vasantha.N.S. et al, International Journal of Computer Science and Mobile Computing, Vol.6 Issue.6, June- 2017, pg. 302-306 Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320–088X IMPACT FACTOR: 6.017 IJCSMC, Vol. 6, Issue. 6, June 2017, pg.302 – 306 Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm Vasantha.N.S.1, Meghana Kulkarni2 ¹Department of VLSI and Embedded Systems, Centre for PG Studies, VTU Belagavi, India ²Associate Professor Department of VLSI and Embedded Systems, Centre for PG Studies, VTU Belagavi, India 1 [email protected] ; 2 [email protected] Abstract— Tomasulo’s algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution, designed to efficiently utilize multiple execution units. It was developed by Robert Tomasulo at IBM. The major innovations of Tomasulo’s algorithm include register renaming in hardware. It also uses the concept of reservation stations for all execution units. A common data bus (CDB) on which computed values broadcast to all reservation stations that may need them is also present. The algorithm allows for improved parallel execution of instructions that would otherwise stall under the use of other earlier algorithms such as scoreboarding. Keywords— Reservation Station, Register renaming, common data bus, multiple execution unit, register file I. INTRODUCTION The instructions in any program may be executed in any of the 2 ways namely sequential order and the other is the data flow order. The sequential order is the one in which the instructions are executed one after the other but in reality this flow is very rare in programs.
    [Show full text]
  • Computer Science 246 Computer Architecture Spring 2010 Harvard University
    Computer Science 246 Computer Architecture Spring 2010 Harvard University Instructor: Prof. David Brooks [email protected] Dynamic Branch Prediction, Speculation, and Multiple Issue Computer Science 246 David Brooks Lecture Outline • Tomasulo’s Algorithm Review (3.1-3.3) • Pointer-Based Renaming (MIPS R10000) • Dynamic Branch Prediction (3.4) • Other Front-end Optimizations (3.5) – Branch Target Buffers/Return Address Stack Computer Science 246 David Brooks Tomasulo Review • Reservation Stations – Distribute RAW hazard detection – Renaming eliminates WAW hazards – Buffering values in Reservation Stations removes WARs – Tag match in CDB requires many associative compares • Common Data Bus – Achilles heal of Tomasulo – Multiple writebacks (multiple CDBs) expensive • Load/Store reordering – Load address compared with store address in store buffer Computer Science 246 David Brooks Tomasulo Organization From Mem FP Op FP Registers Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Store Load6 Buffers Add1 Add2 Mult1 Add3 Mult2 Reservation To Mem Stations FP adders FP multipliers Common Data Bus (CDB) Tomasulo Review 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 LD F0, 0(R1) Iss M1 M2 M3 M4 M5 M6 M7 M8 Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2 M3 Wb SUBI R1, R1, 8 Iss Ex Wb BNEZ R1, Loop Iss Ex Wb LD F0, 0(R1) Iss Iss Iss Iss M Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2
    [Show full text]
  • Dynamic Scheduling
    Dynamically-Scheduled Machines ! • In a Statically-Scheduled machine, the compiler schedules all instructions to avoid data-hazards: the ID unit may require that instructions can issue together without hazards, otherwise the ID unit inserts stalls until the hazards clear! • This section will deal with Dynamically-Scheduled machines, where hardware- based techniques are used to detect are remove avoidable data-hazards automatically, to allow ‘out-of-order’ execution, and improve performance! • Dynamic scheduling used in the Pentium III and 4, the AMD Athlon, the MIPS R10000, the SUN Ultra-SPARC III; the IBM Power chips, the IBM/Motorola PowerPC, the HP Alpha 21264, the Intel Dual-Core and Quad-Core processors! • In contrast, static multiple-issue with compiler-based scheduling is used in the Intel IA-64 Itanium architectures! • In 2007, the dual-core and quad-core Intel processors use the Pentium 5 family of dynamically-scheduled processors. The Itanium has had a hard time gaining market share.! Ch. 6, Advanced Pipelining-DYNAMIC, slide 1! © Ted Szymanski! Dynamic Scheduling - the Idea ! (class text - pg 168, 171, 5th ed.)" !DIVD ! !F0, F2, F4! ADDD ! !F10, F0, F8 !- data hazard, stall issue for 23 cc! SUBD ! !F12, F8, F14 !- SUBD inherits 23 cc of stalls! • ADDD depends on DIVD; in a static scheduled machine, the ID unit detects the hazard and causes the basic pipeline to stall for 23 cc! • The SUBD instruction cannot execute because the pipeline has stalled, even though SUBD does not logically depend upon either previous instruction! • suppose the machine architecture was re-organized to let the SUBD and subsequent instructions “bypass” the previous stalled instruction (the ADDD) and proceed with its execution -> we would allow “out-of-order” execution! • however, out-of-order execution would allow out-of-order completion, which may allow RW (Read-Write) and WW (Write-Write) data hazards ! • a RW and WW hazard occurs when the reads/writes complete in the wrong order, destroying the data.
    [Show full text]
  • United States Patent (19) 11 Patent Number: 5,680,565 Glew Et Al
    USOO568.0565A United States Patent (19) 11 Patent Number: 5,680,565 Glew et al. 45 Date of Patent: Oct. 21, 1997 (54) METHOD AND APPARATUS FOR Diefendorff, "Organization of the Motorola 88110 Super PERFORMING PAGE TABLE WALKS INA scalar RISC Microprocessor.” IEEE Micro, Apr. 1996, pp. MCROPROCESSOR CAPABLE OF 40-63. PROCESSING SPECULATIVE Yeager, Kenneth C. "The MIPS R10000 Superscalar Micro INSTRUCTIONS processor.” IEEE Micro, Apr. 1996, pp. 28-40 Apr. 1996. 75 Inventors: Andy Glew, Hillsboro; Glenn Hinton; Smotherman et al. "Instruction Scheduling for the Motorola Haitham Akkary, both of Portland, all 88.110" Microarchitecture 1993 International Symposium, of Oreg. pp. 257-262. Circello et al., “The Motorola 68060 Microprocessor.” 73) Assignee: Intel Corporation, Santa Clara, Calif. COMPCON IEEE Comp. Soc. Int'l Conf., Spring 1993, pp. 73-78. 21 Appl. No.: 176,363 "Superscalar Microprocessor Design" by Mike Johnson, Advanced Micro Devices, Prentice Hall, 1991. 22 Filed: Dec. 30, 1993 Popescu, et al., "The Metaflow Architecture.” IEEE Micro, (51) Int. C. G06F 12/12 pp. 10-13 and 63–73, Jun. 1991. 52 U.S. Cl. ........................... 395/415: 395/800; 395/383 Primary Examiner-David K. Moore 58) Field of Search .................................. 395/400, 375, Assistant Examiner-Kevin Verbrugge 395/800, 414, 415, 421.03 Attorney, Agent, or Firm-Blakely, Sokoloff, Taylor & 56 References Cited Zafiman U.S. PATENT DOCUMENTS 57 ABSTRACT 5,136,697 8/1992 Johnson .................................. 395/586 A page table walk is performed in response to a data 5,226,126 7/1993 McFarland et al. 395/.394 translation lookaside buffer miss based on a speculative 5,230,068 7/1993 Van Dyke et al.
    [Show full text]
  • Identifying Bottlenecks in a Multithreaded Superscalar
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by KITopen Identifying Bottlenecks in a Multithreaded Sup erscalar Micropro cessor Ulrich Sigmund and Theo Ungerer VIONA DevelopmentGmbH Karlstr D Karlsruhe Germany University of Karlsruhe Dept of Computer Design and Fault Tolerance D Karlsruhe Germany Abstract This pap er presents a multithreaded sup erscalar pro cessor that p ermits several threads to issue instructions to the execution units of a wide sup erscalar pro cessor in a single cycle Instructions can simul taneously b e issued from up to threads with a total issue bandwidth of instructions p er cycle Our results show that the threaded issue pro cessor reaches a throughput of instructions p er cycle Intro duction Current micropro cessors utilize instructionlevel parallelism by a deep pro cessor pip eline and by the sup erscalar technique that issues up to four instructions p er cycle from a single thread VLSItechnology will allow future generations of micropro cessors to exploit instructionlevel parallelism up to instructions p er cycle or more However the instructionlevel parallelism found in a conventional instruction stream is limited The solution is the additional utilization of more coarsegrained parallelism The main approaches are the multipro cessor chip and the multithreaded pro ces pro cessors on a sor The multipro cessor chip integrates two or more complete single chip Therefore every unit of a pro cessor is duplicated and used indep en dently of its copies on the chip
    [Show full text]
  • GPU Implementation Over IPTV Software Defined Networking
    Esmeralda Hysenbelliu. Int. Journal of Engineering Research and Application www.ijera.com ISSN : 2248-9622, Vol. 7, Issue 8, ( Part -1) August 2017, pp.41-45 RESEARCH ARTICLE OPEN ACCESS GPU Implementation over IPTV Software Defined Networking Esmeralda Hysenbelliu* Information Technology Faculty, Polytechnic University of Tirana, Sheshi “Nënë Tereza”, Nr.4, Tirana, Albania Corresponding author: Esmeralda Hysenbelliu ABSTRACT One of the most important issue in IPTV Software defined Network is Bandwidth Issue and Quality of Service at the client side. Decidedly, it is required high level quality of images in low bandwidth and for this reason it is needed different transcoding standards (Compression of image as much as it is possible without destroying the quality of it) as H.264, H265, VP8 and VP9. During a test performed in SMC IPTV SDN Cloud Network, it was observed that with a server HP ProLiant DL380 g6 with two physical processors there was not possible to transcode in format H.264 more than 30 channels simultaneously because CPU’s achieved 100%. This is the reason why it was immediately needed to use Graphic Processing Units called GPU’s which offer high level images processing. After GPU superscalar processor was integrated and done functional via module NVENC of FFEMPEG Program, number of channels transcoded simultaneously was tremendous increased (more than 100 channels). The aim of this paper is to real implement GPU superscalar processors in IPTV Cloud Networks by achieving improvement of performance to more than 60%. Keywords - GPU superscalar processor, Performance Improvement, NVENC, CUDA --------------------------------------------------------------------------------------------------------------------------------------- Date of Submission: 01 -05-2017 Date of acceptance: 19-08-2017 --------------------------------------------------------------------------------------------------------------------------------------- I.
    [Show full text]
  • Computer Hardware Architecture Lecture 4
    Computer Hardware Architecture Lecture 4 Manfred Liebmann Technische Universit¨atM¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] November 10, 2015 Manfred Liebmann November 10, 2015 Reading List • Pacheco - An Introduction to Parallel Programming (Chapter 1 - 2) { Introduction to computer hardware architecture from the parallel programming angle • Hennessy-Patterson - Computer Architecture - A Quantitative Approach { Reference book for computer hardware architecture All books are available on the Moodle platform! Computer Hardware Architecture 1 Manfred Liebmann November 10, 2015 UMA Architecture Figure 1: A uniform memory access (UMA) multicore system Access times to main memory is the same for all cores in the system! Computer Hardware Architecture 2 Manfred Liebmann November 10, 2015 NUMA Architecture Figure 2: A nonuniform memory access (UMA) multicore system Access times to main memory differs form core to core depending on the proximity of the main memory. This architecture is often used in dual and quad socket servers, due to improved memory bandwidth. Computer Hardware Architecture 3 Manfred Liebmann November 10, 2015 Cache Coherence Figure 3: A shared memory system with two cores and two caches What happens if the same data element z1 is manipulated in two different caches? The hardware enforces cache coherence, i.e. consistency between the caches. Expensive! Computer Hardware Architecture 4 Manfred Liebmann November 10, 2015 False Sharing The cache coherence protocol works on the granularity of a cache line. If two threads manipulate different element within a single cache line, the cache coherency protocol is activated to ensure consistency, even if every thread is only manipulating its own data.
    [Show full text]
  • Superscalar Fall 2020
    CS232 Superscalar Fall 2020 Superscalar What is superscalar - A superscalar processor has more than one set of functional units and executes multiple independent instructions during a clock cycle by simultaneously dispatching multiple instructions to different functional units in the processor. - You can think of a superscalar processor as there are more than one washer, dryer, and person who can fold. So, it allows more throughput. - The order of instruction execution is usually assisted by the compiler. The hardware and the compiler assure that parallel execution does not violate the intent of the program. - Example: • Ordinary pipeline: four stages (Fetch, Decode, Execute, Write back), one clock cycle per stage. Executing 6 instructions take 9 clock cycles. I0: F D E W I1: F D E W I2: F D E W I3: F D E W I4: F D E W I5: F D E W cc: 1 2 3 4 5 6 7 8 9 • 2-degree superscalar: attempts to process 2 instructions simultaneously. Executing 6 instructions take 6 clock cycles. I0: F D E W I1: F D E W I2: F D E W I3: F D E W I4: F D E W I5: F D E W cc: 1 2 3 4 5 6 Limitations of Superscalar - The above example assumes that the instructions are independent of each other. So, it’s easily to push them into the pipeline and superscalar. However, instructions are usually relevant to each other. Just like the hazards in pipeline, superscalar has limitations too. - There are several fundamental limitations the system must cope, which are true data dependency, procedural dependency, resource conflict, output dependency, and anti- dependency.
    [Show full text]
  • The Multiscalar Architecture
    THE MULTISCALAR ARCHITECTURE by MANOJ FRANKLIN A thesis submitted in partial ful®llment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Computer Sciences) at the UNIVERSITY OF WISCONSIN Ð MADISON 1993 THE MULTISCALAR ARCHITECTURE Manoj Franklin Under the supervision of Associate Professor Gurindar S. Sohi at the University of Wisconsin-Madison ABSTRACT The centerpiece of this thesis is a new processing paradigm for exploiting instruction level parallelism. This paradigm, called the multiscalar paradigm, splits the program into many smaller tasks, and exploits ®ne-grain parallelism by executing multiple, possibly (control and/or data) depen- dent tasks in parallel using multiple processing elements. Splitting the instruction stream at statically determined boundaries allows the compiler to pass substantial information about the tasks to the hardware. The processing paradigm can be viewed as extensions of the superscalar and multiprocess- ing paradigms, and shares a number of properties of the sequential processing model and the data¯ow processing model. The multiscalar paradigm is easily realizable, and we describe an implementation of the multis- calar paradigm, called the multiscalar processor. The central idea here is to connect multiple sequen- tial processors, in a decoupled and decentralized manner, to achieve overall multiple issue. The mul- tiscalar processor supports speculative execution, allows arbitrary dynamic code motion (facilitated by an ef®cient hardware memory disambiguation mechanism), exploits communication localities, and does all of these with hardware that is fairly straightforward to build. Other desirable aspects of the implementation include decentralization of the critical resources, absence of wide associative searches, and absence of wide interconnection/data paths.
    [Show full text]
  • Advanced Processor Designs
    Advanced processor designs We’ve only scratched the surface of CPU design. Today we’ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. —The Motorola PowerPC, used in Apple computers and many embedded systems, is a good example of state-of-the-art RISC technologies. —The Intel Itanium is a more radical design intended for the higher-end systems market. April 9, 2003 ©2001-2003 Howard Huang 1 General ways to make CPUs faster You can improve the chip manufacturing technology. — Smaller CPUs can run at a higher clock rates, since electrons have to travel a shorter distance. Newer chips use a “0.13µm process,” and this will soon shrink down to 0.09µm. — Using different materials, such as copper instead of aluminum, can improve conductivity and also make things faster. You can also improve your CPU design, like we’ve been doing in CS232. — One of the most important ideas is parallel computation—executing several parts of a program at the same time. — Being able to execute more instructions at a time results in a higher instruction throughput, and a faster execution time. April 9, 2003 Advanced processor designs 2 Pipelining is parallelism We’ve already seen parallelism in detail! A pipelined processor executes many instructions at the same time. All modern processors use pipelining, because most programs will execute faster on a pipelined CPU without any programmer intervention. Today we’ll discuss some more advanced techniques to help you get the most out of your pipeline. April 9, 2003 Advanced processor designs 3 Motivation for some advanced ideas Our pipelined datapath only supports integer operations, and we assumed the ALU had about a 2ns delay.
    [Show full text]
  • Computer Architecture Out-Of-Order Execution
    Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2013– Out-of-Order Execution The need for speed: Superscalar • Remember our goal: minimize CPU Time CPU Time = duration of clock cycle × CPI × IC • So far we have learned that in order to Minimize clock cycle ⇒ add more pipe stages Minimize CPI ⇒ utilize pipeline Minimize IC ⇒ change/improve the architecture • Why not make the pipeline deeper and deeper? Beyond some point, adding more pipe stages doesn’t help, because Control/data hazards increase, and become costlier • (Recall that in a pipelined CPU, CPI=1 only w/o hazards) • So what can we do next? Reduce the CPI by utilizing ILP (instruction level parallelism) We will need to duplicate HW for this purpose… 2 Computer Architecture 2013– Out-of-Order Execution A simple superscalar CPU • Duplicates the pipeline to accommodate ILP (IPC > 1) ILP=instruction-level parallelism • Note that duplicating HW in just one pipe stage doesn’t help e.g., when having 2 ALUs, the bottleneck moves to other stages IF ID EXE MEM WB • Conclusion: Getting IPC > 1 requires to fetch/decode/exe/retire >1 instruction per clock: IF ID EXE MEM WB 3 Computer Architecture 2013– Out-of-Order Execution Example: Pentium Processor • Pentium fetches & decodes 2 instructions per cycle • Before register file read, decide on pairing Can the two instructions be executed in parallel? (yes/no) u-pipe IF ID v-pipe • Pairing decision is based… On data
    [Show full text]
  • CSE 586 Computer Architecture Lecture 4
    Highlights from last week CSE 586 Computer Architecture • ILP: where can the compiler optimize – Loop unrolling and software pipelining – Speculative execution (we’ll see predication today) Lecture 4 • ILP: Dynamic scheduling in a single issue machine – Scoreboard Jean-Loup Baer – Tomasulo’s algorithm http://www.cs.washington.edu/education/courses/586/00sp CSE 586 Spring 00 1 CSE 586 Spring 00 2 Highlights from last week (c’ed) -- Highlights from last week (c’ed) – Scoreboard Tomasulo’s algorithm • The scoreboard keeps a record of all data dependencies • Decentralized control • The scoreboard keeps a record of all functional unit • Use of reservation stations to buffer and/or rename occupancies registers (hence gets rid of WAW and WAR hazards) • The scoreboard decides if an instruction can be issued • Results –and their names– are broadcast to reservations • The scoreboard decides if an instruction can store its result stations and register file • Implementation-wise, scoreboard keeps track of which • Instructions are issued in order but can be dispatched, registers are used as sources and destinations and which executed and completed out-of-order functional units use them CSE 586 Spring 00 3 CSE 586 Spring 00 4 Highlights from last week (c’ed) Multiple Issue Alternatives • Register renaming: avoids WAW and WAR hazards • Superscalar (hardware detects conflicts) • Is performed at “decode” time to rename the result register – Statically scheduled (in order dispatch and hence execution; cf DEC Alpha 21164) • Two basic implementation schemes – Dynamically scheduled (in order issue, out of order dispatch and – Have a separate physical register file execution; cf MIPS 10000, IBM Power PC 620 and Intel Pentium – Use of reorder buffer and reservation stations (cf.
    [Show full text]