CS 61C: Great Ideas in Computer Architecture Multiple Instruction

Total Page:16

File Type:pdf, Size:1020Kb

CS 61C: Great Ideas in Computer Architecture Multiple Instruction 7/31/2014 CS 61C: Great Ideas in Great Idea #4: Parallelism Computer Architecture Software Hardware • Parallel Requests Warehouse Smart Assigned to computer Scale Phone Computer e.g. search “Garcia” Leverage Multiple Instruction • Parallel Threads Parallelism & Assigned to core Achieve High Issue/Virtual Memory e.g. lookup, ads Performance Computer Introduction • Parallel Instructions Core … Core > 1 instruction @ one time Memory e.g. 5 pipelined instructions Input/Output Guest Lecturer/TA: Andrew Luo • Parallel Data Core > 1 data item @ one time Instruction Unit(s) Functional e.g. add of 4 pairs of words Unit(s) • Hardware descriptions A0+B0 A1+B1 A2+B2 A3+B3 Logic Gates All gates functioning in Cache Memory parallel at same time 8/31/2014 Summer 2014 - Lecture 23 1 8/31/2014 Summer 2014 - Lecture 23 2 Review of Last Lecture (1/2) Review of Last Lecture (2/2) •A hazard is a situation that prevents the next instruction from executing in the next clock •Hazards hurt the performance of pipelined cycle processors • Structural hazard •Stalling can be applied to any hazard, but hurt • A required resource is needed by multiple instructions performance… better solutions? in different stages • Structural hazards • Data hazard • Have separate structures for each stage • Data dependency between instructions • A later instruction requires the result of an earlier • Data hazards instruction • Forwarding • Control hazard • Control hazards • The flow of execution depends on the previous • Branch prediction (mitigates), branch delay slot instruction (ex. jumps, branches) 8/31/2014 Summer 2014 - Lecture 23 3 8/31/2014 Summer 2014 - Lecture 23 4 Agenda Multiple Issue •Modern processors can issue and execute •Multiple Issue multiple instructions per clock cycle •Administrivia •CPI < 1 (superscalar), so can use Instructions •Virtual Memory Introduction Per Cycle (IPC) instead •e.g. 4 GHz 4-way multiple-issue can execute 16 billion IPS with peak CPI = 0.25 and peak IPC = 4 • But dependencies and structural hazards reduce this in practice 8/31/2014 Summer 2014 - Lecture 23 5 8/31/2014 Summer 2014 - Lecture 23 6 1 7/31/2014 Superscalar Laundry: Parallel per Multiple Issue stage6 PM 7 8 9 10 11 12 1 2 AM •Static multiple issue 3030 30 30 30 Time • Compiler reorders independent/commutative T instructions to be issued together a A (light clothing) • Compiler detects and avoids hazards s B (dark clothing) • k Dynamic multiple issue C (very dirty clothing) • CPU examines pipeline and chooses instructions to O D reorder/issue r (light clothing) • CPU can resolve hazards at runtime d E (dark clothing) e (very dirty clothing) r F • More resources, HW to match mix of parallel tasks? 8/31/2014 Summer 2014 - Lecture 23 7 8/31/2014 Summer 2014 - Lecture 23 8 Pipeline Depth and Issue Width Pipeline Depth and Issue Width • Intel Processors over Time 10000 Microprocessor Year Clock Rate Pipeline Issue Cores Power Clock Stages width i486 1989 25 MHz 5 1 1 5W 1000 Power Pentium 1993 66 MHz 5 2 1 10W Pentium Pro 1997 200 MHz 10 3 1 29W Pipeline P4 Willamette 2001 2000 MHz 22 3 1 75W 100 Stages P4 Prescott 2004 3600 MHz 31 3 1 103W Issue width Core 2 Conroe 2006 2930 MHz 12-14 4* 2 75W 10 Core 2 Penryn 2008 2930 MHz 12-14 4* 4 95W Cores Core i7 Westmere 2010 3460 MHz 14 4* 6 130W Xeon Sandy Bridge 2012 3100 MHz 14-19 4* 8 150W 1 1989 1992 1995 1998 2001 2004 2007 2010 Xeon Ivy Bridge 2014 2800 MHz 14-19 4* 15 155W 8/31/2014 Summer 2014 - Lecture 23 9 8/31/2014 Summer 2014 - Lecture 23 10 Static Multiple Issue Scheduling Static Multiple Issue • Compiler reorders independent/commutative instructions to be issued together (an “issue packet”) •Compiler must remove some/all hazards • Group of instructions that can be issued on a single cycle • Reorder instructions into issue packets • Determined by structural resources required • No dependencies within a packet • Specifies multiple concurrent operations • Possibly some dependencies between packets • Varies between ISAs; compiler must know! • Pad with nops if necessary 8/31/2014 Summer 2014 - Lecture 23 11 8/31/2014 Summer 2014 - Lecture 23 12 2 7/31/2014 Dynamic Multiple Issue Dynamic Pipeline Scheduling • Allow the CPU to execute instructions out of order to •Used in “superscalar” processors avoid stalls •CPU decides whether to issue 0, 1, 2, … • But commit result to registers in order instructions each cycle • Example: lw $t0, 20($s2) • Goal is to avoid structural and data hazards addu $t1, $t0, $t2 subu $s4, $s4, $t3 •Avoids need for compiler scheduling slti $t5, $s4, 20 • Though it may still help • Can start subu while addu is waiting for lw • Code semantics ensured by the CPU • Especially useful on cache misses; can execute many instructions while waiting! 8/31/2014 Summer 2014 - Lecture 23 13 8/31/2014 Summer 2014 - Lecture 23 14 Why Do Dynamic Scheduling? Speculation • “Guess” what to do with an instruction • Why not just let the compiler schedule code? • Start operation as soon as possible • Not all stalls are predicable • Check whether guess was right and roll back if necessary • e.g. cache misses • Examples: • Can’t always schedule around branches • Speculate on branch outcome (Branch Prediction) • Branch outcome is dynamically determined by I/O • Roll back if path taken is different • Different implementations of an ISA have different • Speculate on load latencies and hazards • Load into an internal register before instruction to minimize time waiting for memory • Forward compatibility and optimizations • Can be done in hardware or by compiler • Common to static and dynamic multiple issue 8/31/2014 Summer 2014 - Lecture 23 15 8/31/2014 Summer 2014 - Lecture 23 16 Not a Simple Linear Pipeline Out-of-Order Execution (1/2) 3 major units operating in parallel: Can also unroll loops in hardware • Instruction fetch and issue unit 1) Fetch instructions in program order (≤ 4/clock) • Issues instructions in program order • Many parallel functional (execution) units 2) Predict branches as taken/untaken • Each unit has an input buffer called a Reservation Station • Holds operands and records the operation • Can execute instructions out-of-order (OOO) 3) To avoid hazards on registers, rename registers • Commit unit using a set of internal registers (≈ 80 registers) • Saves results from functional units in Reorder Buffers • Stores results once branch resolved so OK to execute 4) Collection of renamed instructions might execute in • Commits results in program order a window (≈ 60 instructions) 8/31/2014 Summer 2014 - Lecture 23 17 8/31/2014 Summer 2014 - Lecture 23 18 3 7/31/2014 Out-of-Order Execution (2/2) Dynamically Scheduled CPU Preserves Branch prediction, dependencies 5) Execute instructions with ready operands in 1 of Register renaming multiple functional units (ALUs, FPUs, Ld/St) Wait here 6) Buffer results of executed instructions until until all predicted branches are resolved in reorder buffer operands available 7) If predicted branch correctly, commit results in program order Execute… Results also sent to any waiting 8) If predicted branch incorrectly, discard all reservation … and Hold dependent results and start with correct PC stations (think: Reorder buffer forwarding!) for register and Can supply memory writes operands for issued instructions 8/31/2014 Summer 2014 - Lecture 23 19 8/31/2014 Summer 2014 - Lecture 23 20 Out-Of-Order Intel Intel Nehalem Microarchitecture • All use O-O-O since 2001 Microprocessor Year Clock Rate Pipeline Issue Out-of-order/ Cores Power Stages width Speculation i486 1989 25 MHz 5 1 No 1 5W Pentium 1993 66 MHz 5 2 No 1 10W Pentium Pro 1997 200 MHz 10 3 Yes 1 29W P4 Willamette 2001 2000 MHz 22 3 Yes 1 75W P4 Prescott 2004 3600 MHz 31 3 Yes 1 103W Core 2 Conroe 2006 2930 MHz 12-14 4* Yes 2 75W Core 2 Penryn 2008 2930 MHz 12-14 4* Yes 4 95W Core i7 2010 3460 MHz 14 4* Yes 6 130W Westmere Xeon Sandy 2012 3100 MHz 14-19 4* Yes 8 150W Bridge Xeon Ivy Bridge 2014 2800 MHz 14-19 4* Yes 15 155W 8/31/2014 Summer 2014 - Lecture 23 21 8/31/2014 Summer 2014 - Lecture 23 22 Intel Nehalem Pipeline Flow Does Multiple Issue Work? • Yes, but not as much as we’d like • Programs have real dependencies that limit ILP • Some dependencies are hard to eliminate • e.g. pointer aliasing (restrict keyword helps) • Some parallelism is hard to expose • Limited window size during instruction issue • Memory delays and limited bandwidth • Hard to keep pipelines full • Speculation can help if done well 8/31/2014 Summer 2014 - Lecture 23 23 8/31/2014 Summer 2014 - Lecture 23 24 4 7/31/2014 Agenda Administrivia •Multiple Issue •HW5 due tonight •Administrivia •Project 2 (Performance Optimization) due •Virtual Memory Introduction Sunday •No lab today •TAs will be in lab to check-off make up labs • Highly encouraged to make up labs today if you’re behind – treated as Tuesday checkoff for lateness •Project 3 (Pipelined Processor in Logisim) released Friday/Saturday 8/31/2014 Summer 2014 - Lecture 23 25 8/31/2014 Summer 2014 - Lecture 23 26 Agenda Memory Hierarchy •Multiple Issue Upper Level Regs Faster •Administrivia Instr Operands •Virtual Memory Introduction Earlier: L1 Cache Caches Blocks L2 Cache Blocks Next Up: Memory Virtual Pages Memory Disk Files Larger Tape Lower Level 8/31/2014 Summer 2014 - Lecture 23 27 8/31/2014 Summer 2014 - Lecture 23 28 Memory Hierarchy Requirements Memory Hierarchy Requirements • Principle of Locality • Allow multiple processes to simultaneously occupy memory and provide • Allows caches to offer (close to) speed of cache memory with size of DRAM protection memory • Don’t let
Recommended publications
  • IEEE Paper Template in A4
    Vasantha.N.S. et al, International Journal of Computer Science and Mobile Computing, Vol.6 Issue.6, June- 2017, pg. 302-306 Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320–088X IMPACT FACTOR: 6.017 IJCSMC, Vol. 6, Issue. 6, June 2017, pg.302 – 306 Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm Vasantha.N.S.1, Meghana Kulkarni2 ¹Department of VLSI and Embedded Systems, Centre for PG Studies, VTU Belagavi, India ²Associate Professor Department of VLSI and Embedded Systems, Centre for PG Studies, VTU Belagavi, India 1 [email protected] ; 2 [email protected] Abstract— Tomasulo’s algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution, designed to efficiently utilize multiple execution units. It was developed by Robert Tomasulo at IBM. The major innovations of Tomasulo’s algorithm include register renaming in hardware. It also uses the concept of reservation stations for all execution units. A common data bus (CDB) on which computed values broadcast to all reservation stations that may need them is also present. The algorithm allows for improved parallel execution of instructions that would otherwise stall under the use of other earlier algorithms such as scoreboarding. Keywords— Reservation Station, Register renaming, common data bus, multiple execution unit, register file I. INTRODUCTION The instructions in any program may be executed in any of the 2 ways namely sequential order and the other is the data flow order. The sequential order is the one in which the instructions are executed one after the other but in reality this flow is very rare in programs.
    [Show full text]
  • Computer Science 246 Computer Architecture Spring 2010 Harvard University
    Computer Science 246 Computer Architecture Spring 2010 Harvard University Instructor: Prof. David Brooks [email protected] Dynamic Branch Prediction, Speculation, and Multiple Issue Computer Science 246 David Brooks Lecture Outline • Tomasulo’s Algorithm Review (3.1-3.3) • Pointer-Based Renaming (MIPS R10000) • Dynamic Branch Prediction (3.4) • Other Front-end Optimizations (3.5) – Branch Target Buffers/Return Address Stack Computer Science 246 David Brooks Tomasulo Review • Reservation Stations – Distribute RAW hazard detection – Renaming eliminates WAW hazards – Buffering values in Reservation Stations removes WARs – Tag match in CDB requires many associative compares • Common Data Bus – Achilles heal of Tomasulo – Multiple writebacks (multiple CDBs) expensive • Load/Store reordering – Load address compared with store address in store buffer Computer Science 246 David Brooks Tomasulo Organization From Mem FP Op FP Registers Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Store Load6 Buffers Add1 Add2 Mult1 Add3 Mult2 Reservation To Mem Stations FP adders FP multipliers Common Data Bus (CDB) Tomasulo Review 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 LD F0, 0(R1) Iss M1 M2 M3 M4 M5 M6 M7 M8 Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2 M3 Wb SUBI R1, R1, 8 Iss Ex Wb BNEZ R1, Loop Iss Ex Wb LD F0, 0(R1) Iss Iss Iss Iss M Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2
    [Show full text]
  • Dynamic Scheduling
    Dynamically-Scheduled Machines ! • In a Statically-Scheduled machine, the compiler schedules all instructions to avoid data-hazards: the ID unit may require that instructions can issue together without hazards, otherwise the ID unit inserts stalls until the hazards clear! • This section will deal with Dynamically-Scheduled machines, where hardware- based techniques are used to detect are remove avoidable data-hazards automatically, to allow ‘out-of-order’ execution, and improve performance! • Dynamic scheduling used in the Pentium III and 4, the AMD Athlon, the MIPS R10000, the SUN Ultra-SPARC III; the IBM Power chips, the IBM/Motorola PowerPC, the HP Alpha 21264, the Intel Dual-Core and Quad-Core processors! • In contrast, static multiple-issue with compiler-based scheduling is used in the Intel IA-64 Itanium architectures! • In 2007, the dual-core and quad-core Intel processors use the Pentium 5 family of dynamically-scheduled processors. The Itanium has had a hard time gaining market share.! Ch. 6, Advanced Pipelining-DYNAMIC, slide 1! © Ted Szymanski! Dynamic Scheduling - the Idea ! (class text - pg 168, 171, 5th ed.)" !DIVD ! !F0, F2, F4! ADDD ! !F10, F0, F8 !- data hazard, stall issue for 23 cc! SUBD ! !F12, F8, F14 !- SUBD inherits 23 cc of stalls! • ADDD depends on DIVD; in a static scheduled machine, the ID unit detects the hazard and causes the basic pipeline to stall for 23 cc! • The SUBD instruction cannot execute because the pipeline has stalled, even though SUBD does not logically depend upon either previous instruction! • suppose the machine architecture was re-organized to let the SUBD and subsequent instructions “bypass” the previous stalled instruction (the ADDD) and proceed with its execution -> we would allow “out-of-order” execution! • however, out-of-order execution would allow out-of-order completion, which may allow RW (Read-Write) and WW (Write-Write) data hazards ! • a RW and WW hazard occurs when the reads/writes complete in the wrong order, destroying the data.
    [Show full text]
  • United States Patent (19) 11 Patent Number: 5,680,565 Glew Et Al
    USOO568.0565A United States Patent (19) 11 Patent Number: 5,680,565 Glew et al. 45 Date of Patent: Oct. 21, 1997 (54) METHOD AND APPARATUS FOR Diefendorff, "Organization of the Motorola 88110 Super PERFORMING PAGE TABLE WALKS INA scalar RISC Microprocessor.” IEEE Micro, Apr. 1996, pp. MCROPROCESSOR CAPABLE OF 40-63. PROCESSING SPECULATIVE Yeager, Kenneth C. "The MIPS R10000 Superscalar Micro INSTRUCTIONS processor.” IEEE Micro, Apr. 1996, pp. 28-40 Apr. 1996. 75 Inventors: Andy Glew, Hillsboro; Glenn Hinton; Smotherman et al. "Instruction Scheduling for the Motorola Haitham Akkary, both of Portland, all 88.110" Microarchitecture 1993 International Symposium, of Oreg. pp. 257-262. Circello et al., “The Motorola 68060 Microprocessor.” 73) Assignee: Intel Corporation, Santa Clara, Calif. COMPCON IEEE Comp. Soc. Int'l Conf., Spring 1993, pp. 73-78. 21 Appl. No.: 176,363 "Superscalar Microprocessor Design" by Mike Johnson, Advanced Micro Devices, Prentice Hall, 1991. 22 Filed: Dec. 30, 1993 Popescu, et al., "The Metaflow Architecture.” IEEE Micro, (51) Int. C. G06F 12/12 pp. 10-13 and 63–73, Jun. 1991. 52 U.S. Cl. ........................... 395/415: 395/800; 395/383 Primary Examiner-David K. Moore 58) Field of Search .................................. 395/400, 375, Assistant Examiner-Kevin Verbrugge 395/800, 414, 415, 421.03 Attorney, Agent, or Firm-Blakely, Sokoloff, Taylor & 56 References Cited Zafiman U.S. PATENT DOCUMENTS 57 ABSTRACT 5,136,697 8/1992 Johnson .................................. 395/586 A page table walk is performed in response to a data 5,226,126 7/1993 McFarland et al. 395/.394 translation lookaside buffer miss based on a speculative 5,230,068 7/1993 Van Dyke et al.
    [Show full text]
  • Identifying Bottlenecks in a Multithreaded Superscalar
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by KITopen Identifying Bottlenecks in a Multithreaded Sup erscalar Micropro cessor Ulrich Sigmund and Theo Ungerer VIONA DevelopmentGmbH Karlstr D Karlsruhe Germany University of Karlsruhe Dept of Computer Design and Fault Tolerance D Karlsruhe Germany Abstract This pap er presents a multithreaded sup erscalar pro cessor that p ermits several threads to issue instructions to the execution units of a wide sup erscalar pro cessor in a single cycle Instructions can simul taneously b e issued from up to threads with a total issue bandwidth of instructions p er cycle Our results show that the threaded issue pro cessor reaches a throughput of instructions p er cycle Intro duction Current micropro cessors utilize instructionlevel parallelism by a deep pro cessor pip eline and by the sup erscalar technique that issues up to four instructions p er cycle from a single thread VLSItechnology will allow future generations of micropro cessors to exploit instructionlevel parallelism up to instructions p er cycle or more However the instructionlevel parallelism found in a conventional instruction stream is limited The solution is the additional utilization of more coarsegrained parallelism The main approaches are the multipro cessor chip and the multithreaded pro ces pro cessors on a sor The multipro cessor chip integrates two or more complete single chip Therefore every unit of a pro cessor is duplicated and used indep en dently of its copies on the chip
    [Show full text]
  • MBP4ASG41M-VS3.Pdf
    ASRock > G41M-VS3 Página 1 de 2 Home | Global / English [Change] About ASRock Products News Support Forum Download Awards Dealer Zone Where to Buy Products G41M-VS3 Motherboard Series »G41M-VS3 Translate »Overview & Specifications ■ Supports FSB1333/1066/800/533 MHz CPUs »Download ■ Supports Dual Channel DDR3 1333(OC) ■ Intel® Graphics Media Accelerator X4500, Pixel Shader 4.0, DirectX 10, Max. shared »Manual memory 1759MB »FAQ ■ EuP Ready »CPU Support List ■ Supports ASRock XFast RAM, XFast LAN, XFast USB Technologies ■ Supports Instant Boot, Instant Flash, OC DNA, ASRock OC Tuner (Up to 158% CPU »Memory QVL frequency increase) »Beta Zone ■ Supports Intelligent Energy Saver (Up to 20% CPU Power Saving) ■ Free Bundle : CyberLink DVD Suite - OEM and Trial; Creative Sound Blaster X-Fi MB - Trial This model may not be sold worldwide. Please contact your local dealer for the availability of this model in your region. Product Specifications General - LGA 775 for Intel® Core™ 2 Extreme / Core™ 2 Quad / Core™ 2 Duo / Pentium® Dual Core / Celeron® Dual Core / Celeron, supporting Penryn Quad Core Yorkfield and Dual Core Wolfdale processors - Supports FSB1333/1066/800/533 MHz CPU - Supports Hyper-Threading Technology - Supports Untied Overclocking Technology - Supports EM64T CPU - Northbridge: Intel® G41 Chipset - Southbridge: Intel® ICH7 - Dual Channel DDR3 memory technology - 2 x DDR3 DIMM slots - Supports DDR3 1333(OC)/1066/800 non-ECC, un-buffered memory Memory - Max. capacity of system memory: 8GB* *Due to the operating system limitation, the actual memory size may be less than 4GB for the reservation for system usage under Windows® 32-bit OS. For Windows® 64-bit OS with 64-bit CPU, there is no such limitation.
    [Show full text]
  • Advanced Processor Designs
    Advanced processor designs We’ve only scratched the surface of CPU design. Today we’ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. —The Motorola PowerPC, used in Apple computers and many embedded systems, is a good example of state-of-the-art RISC technologies. —The Intel Itanium is a more radical design intended for the higher-end systems market. April 9, 2003 ©2001-2003 Howard Huang 1 General ways to make CPUs faster You can improve the chip manufacturing technology. — Smaller CPUs can run at a higher clock rates, since electrons have to travel a shorter distance. Newer chips use a “0.13µm process,” and this will soon shrink down to 0.09µm. — Using different materials, such as copper instead of aluminum, can improve conductivity and also make things faster. You can also improve your CPU design, like we’ve been doing in CS232. — One of the most important ideas is parallel computation—executing several parts of a program at the same time. — Being able to execute more instructions at a time results in a higher instruction throughput, and a faster execution time. April 9, 2003 Advanced processor designs 2 Pipelining is parallelism We’ve already seen parallelism in detail! A pipelined processor executes many instructions at the same time. All modern processors use pipelining, because most programs will execute faster on a pipelined CPU without any programmer intervention. Today we’ll discuss some more advanced techniques to help you get the most out of your pipeline. April 9, 2003 Advanced processor designs 3 Motivation for some advanced ideas Our pipelined datapath only supports integer operations, and we assumed the ALU had about a 2ns delay.
    [Show full text]
  • Evolution of Microprocessor Performance
    EvolutionEvolution ofof MicroprocessorMicroprocessor PerformancePerformance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static & dynamic branch predication. Even with these improvements, the restriction of issuing a single instruction per cycle still limits the ideal CPI = 1 Multiple Issue (CPI <1) Multi-cycle Pipelined T = I x CPI x C (single issue) Superscalar/VLIW/SMT Original (2002) Intel Predictions 1 GHz ? 15 GHz to ???? GHz IPC CPI > 10 1.1-10 0.5 - 1.1 .35 - .5 (?) Source: John P. Chen, Intel Labs We next examine the two approaches to achieve a CPI < 1 by issuing multiple instructions per cycle: 4th Edition: Chapter 2.6-2.8 (3rd Edition: Chapter 3.6, 3.7, 4.3 • Superscalar CPUs • Very Long Instruction Word (VLIW) CPUs. Single-issue Processor = Scalar Processor EECC551 - Shaaban Instructions Per Cycle (IPC) = 1/CPI EECC551 - Shaaban #1 lec # 6 Fall 2007 10-2-2007 ParallelismParallelism inin MicroprocessorMicroprocessor VLSIVLSI GenerationsGenerations Bit-level parallelism Instruction-level Thread-level (?) (TLP) 100,000,000 (ILP) Multiple micro-operations Superscalar /VLIW per cycle Simultaneous Single-issue CPI <1 u Multithreading SMT: (multi-cycle non-pipelined) Pipelined e.g. Intel’s Hyper-threading 10,000,000 CPI =1 u uuu u u Chip-Multiprocessors (CMPs) u Not Pipelined R10000 e.g IBM Power 4, 5 CPI >> 1 uuuuuuu u AMD Athlon64 X2 u uuuuu Intel Pentium D u uuuuuuuu u u 1,000,000 u uu uPentium u u uu i80386 u i80286
    [Show full text]
  • Conroe and Allendale Electrical, Mechanical, and Thermal
    Intel® Xeon® Processor 5500 Series Datasheet, Volume 1 March 2009 Document Number: 321321-001 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The Intel® Xeon® Processor 5500 Series may contain design defects or errors known as errata which may cause the product to deviate from published specifications.Current characterized errata are available on request. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details. Over time processor numbers will increment based on changes in clock, speed, cache, FSB, or other features, and increments are not intended to represent proportional or quantitative increases in any particular feature.
    [Show full text]
  • Multiprocessing Contents
    Multiprocessing Contents 1 Multiprocessing 1 1.1 Pre-history .............................................. 1 1.2 Key topics ............................................... 1 1.2.1 Processor symmetry ...................................... 1 1.2.2 Instruction and data streams ................................. 1 1.2.3 Processor coupling ...................................... 2 1.2.4 Multiprocessor Communication Architecture ......................... 2 1.3 Flynn’s taxonomy ........................................... 2 1.3.1 SISD multiprocessing ..................................... 2 1.3.2 SIMD multiprocessing .................................... 2 1.3.3 MISD multiprocessing .................................... 3 1.3.4 MIMD multiprocessing .................................... 3 1.4 See also ................................................ 3 1.5 References ............................................... 3 2 Computer multitasking 5 2.1 Multiprogramming .......................................... 5 2.2 Cooperative multitasking ....................................... 6 2.3 Preemptive multitasking ....................................... 6 2.4 Real time ............................................... 7 2.5 Multithreading ............................................ 7 2.6 Memory protection .......................................... 7 2.7 Memory swapping .......................................... 7 2.8 Programming ............................................. 7 2.9 See also ................................................ 8 2.10 References .............................................
    [Show full text]
  • Computer Architecture Out-Of-Order Execution
    Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2013– Out-of-Order Execution The need for speed: Superscalar • Remember our goal: minimize CPU Time CPU Time = duration of clock cycle × CPI × IC • So far we have learned that in order to Minimize clock cycle ⇒ add more pipe stages Minimize CPI ⇒ utilize pipeline Minimize IC ⇒ change/improve the architecture • Why not make the pipeline deeper and deeper? Beyond some point, adding more pipe stages doesn’t help, because Control/data hazards increase, and become costlier • (Recall that in a pipelined CPU, CPI=1 only w/o hazards) • So what can we do next? Reduce the CPI by utilizing ILP (instruction level parallelism) We will need to duplicate HW for this purpose… 2 Computer Architecture 2013– Out-of-Order Execution A simple superscalar CPU • Duplicates the pipeline to accommodate ILP (IPC > 1) ILP=instruction-level parallelism • Note that duplicating HW in just one pipe stage doesn’t help e.g., when having 2 ALUs, the bottleneck moves to other stages IF ID EXE MEM WB • Conclusion: Getting IPC > 1 requires to fetch/decode/exe/retire >1 instruction per clock: IF ID EXE MEM WB 3 Computer Architecture 2013– Out-of-Order Execution Example: Pentium Processor • Pentium fetches & decodes 2 instructions per cycle • Before register file read, decide on pairing Can the two instructions be executed in parallel? (yes/no) u-pipe IF ID v-pipe • Pairing decision is based… On data
    [Show full text]
  • Intel® Core™ Microarchitecture • Wrap Up
    EW N IntelIntel®® CoreCore™™ MicroarchitectureMicroarchitecture MarchMarch 8,8, 20062006 Stephen L. Smith Bob Valentine Vice President Architect Digital Enterprise Group Intel Architecture Group Agenda • Multi-core Update and New Microarchitecture Level Set • New Intel® Core™ Microarchitecture • Wrap Up 2 Intel Multi-core Roadmap – Updates since Fall IDF 3 Ramping Multi-core Everywhere 4 All products and dates are preliminary and subject to change without notice. Refresher: What is Multi-Core? Two or more independent execution cores in the same processor Specific implementations will vary over time - driven by product implementation and manufacturing efficiencies • Best mix of product architecture and volume mfg capabilities – Architecture: Shared Caches vs. Independent Caches – Mfg capabilities: volume packaging technology • Designed to deliver performance, OEM and end user experience Single die (Monolithic) based processor Multi-Chip Processor Example: 90nm Pentium® D Example: Intel Core™ Duo Example: 65nm Pentium D Processor (Smithfield) Processor (Yonah) Processor (Presler) Core0 Core1 Core0 Core1 Core0 Core1 Front Side Bus Front Side Bus Front Side Bus *Not representative of actual die photos or relative size 5 Intel® Core™ Micro-architecture *Not representative of actual die photo or relative size 6 Intel Multi-core Roadmap 7 Intel Multi-core Roadmap 8 Intel® Core™ Microarchitecture Based Platforms Platform 2006 20072007 Caneland Platform (2007) MP Servers Tigerton (QC) (2007) Bensley Platform (Q2’06)/ Glidewell Platform (Q2’06) ) DP Servers/ Woodcrest (Q3’06) DP Workstation Clovertown (QC) (Q1’07) Kaylo Platform (Q3’06)/ Wyloway Platform (Q3 ’06) UP Servers/ Conroe (Q3’06) UP Workstation Kentsfield (QC) (Q1’07) Bridge Creek Platform (Mid’06) Desktop -Home Conroe (Q3’06) Kentsfield (QC) (Q1’07) Desktop -Office Averill Platform (Mid’06) Conroe (Q3’06) Mobile Client Napa Platform (Q1’06) Merom (2H’06) All products and dates are preliminary 9 Note: only Intel® Core™ microarchitecture QC refers to Quad-Core and subject to change without notice.
    [Show full text]