Intel Technology Platform for HPC Intel® Xeon® and Intel® Xeon Phi™ Processor Update

Total Page:16

File Type:pdf, Size:1020Kb

Intel Technology Platform for HPC Intel® Xeon® and Intel® Xeon Phi™ Processor Update Intel technology platform for HPC Intel® Xeon® and Intel® Xeon Phi™ processor update Manel Fernández Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel® Xeon and Xeon Phi™ February 17th 2016, Barcelona Today’s Intel solutions for HPC The multi- and many-core era Multi-core Many integrated core (MIC) C/C++/Fortran, OMP/MPI/Cilk+/TBB C/C++/Fortran, OMP/MPI/Cilk+/TBB Bootable, native execution model PCIe coprocessor, native and offload execution models Up to 18 cores, 3 GHz, 36 threads Up to 61 cores, 1.2 GHz, 244 threads Up to 768 GB, 68 GB/s, 432 GFLOP/s DP Up to 16 GB, 352 GB/s, 1.2 TFLOP/s DP 256-bit SIMD, FMA, gather (AVX2) 512-bit SIMD, FMA, gather/scatter, EMU (IMCI) Targeted at general purpose applications Targeted at highly parallel applications Single thread performance (ILP) High parallelism (DLP, TLP) Memory capacity High memory bandwidth 2 Processor Core 1 Core 2 Core n Core L3 Cache DDR3 Clock & MC QPI Graphics or Power DDR4 Intel® CoreTM architecture 3 Intel® CoreTM architecture roadmap The “Tick-Tock” roadmap model Intel® Core™ 2nd 3rd 4th 5th 6th “Nehalem” microarchitecture generation generation generation generation generation Sandy Ivy Nehalem Westmere Haswell Broadwell Skylake Bridge Bridge 45nm 32nm 22nm 14nm 2008 2009 2011 2012 2013 Sep 2014 2015 ADX and 3 other MPX, SGX SSE4.2 AES AVX RNRAND, etc. AVX2 new instructions AVX-512 (Xeon only) Tick (shrink, new process technology) Tock (innovate, new microarchitecture) 4 Haswell execution unit overview Unified Reservation Station Port 4 Port Port 2 Port Port 1 Port Port 0 Port Port 6 Port Port 5 Port Port 7 Port Port 3 Port Integer IntegerInteger LoadLoad && Store Integer Integer Store ALU & Shift ALUALU && LEALEA StoreStore AddressAddress Data ALU & LEA ALU & Shift Address FMA FMA FP Mult Vector FP Multiply FP Add 2xFMA Shuffle th Vector Int VectorVector IntInt • Doubles peak FLOPs Vector Int 4 ALU • Multiply ALUALU • Two FP multiplies ALU Great for integer workloads • Frees Port0 & 1 for vector Vector VectorVector benefits legacy Vector Logicals LogicalsLogicals Logicals Branch Branch Divide Vector Additional Branch Unit Shifts • Reduces Port0 Conflicts Additional AGU for Stores • 2nd EU for high branch code • Leaves Port 2 & 3 open for Loads 6 Microarchitecture buffer sizes Extract more parallelism on every generation Nehalem Sandy Bridge Haswell Skylake Out-of-order 128 168 192 224 Window In-flight Loads 48 64 72 72 In-flight Stores 32 36 42 56 Scheduler Entries 36 54 60 97 Integer Register N/A 160 168 180 File FP Register File N/A 144 168 168 Allocation Queue 28/thread 28/thread 56 64/thread 7 Fused Multiply and Add (FMA) instruction Example: polynomial evaluation Micro- SP FLOPs DP FLOPs Architecture Instruction Set per cycle per cycle Nehalem SSE (128-bits) 8 4 Sandy Bridge AVX (256-bits) 16 8 Haswell AVX2 (FMA) (256-bits) 32 16 2x peak FLOPs/cycle (throughput) Latency Xeon Xeon Ratio (clocks) E5 v2 E5 v3 (lower is better) MulPS, PD 5 5 AddPS, PD 3 3 Mul+Add /FMA 8 5 0.625 >37% reduced latency (5-cycle FMA latency same as an FP multiply) Improves accuracy and performance for commonly used class of algorithms 9 Broadwell: 5th generation Intel® Core™ architecture Microarchitecture changes FP instructions performance improvements • Decreased latency and increased throughput for most divider (radix-1024) uops • Pseudo-double bandwidth for scalar divider uops • Vector multiply latency decrease (from 5 to 3 cycles) STLB improvements • Native, 16-entry 1G STLB array • Increased size of STLB (from 1kB to 1.5kB) Enabled two simultaneous page miss walks Other ISA performance improvements • ADC, CMOV – 1 uop flow • PCLMULQDQ – 2 uop/7 cycles to 1 uop/5 cycles • VCVTPS2PH (mem form) – 4 uops to 3 uops 10 Skylake: 6th generation Intel® Core™ architecture Dedicated server and client IP configurations Improved microarchitecture • Higher capacity front-end (up to 6 instr/cycle) • Improved branch predictor • Deeper Out-of-Order buffers • More execution units, shorter latencies • Deeper store, fill, and write-back buffers • Smarter prefetchers • Improved page miss handling • Better L2 cache miss bandwidth • Improved Hyper-Threading • Performance/watt enhancements New instructions supported • Software Guard Extensions (SGX) • Memory Protection Extensions (MPX) • AVX-512 (Xeon versions only) 11 Processor Core 1 Core 2 Core n Core L3 Cache DDR3 Clock & MC QPI Graphics or Power DDR4 Intel® Xeon® processor architecture 12 Intel® Xeon® processors and platforms Intel® Xeon® E3 Intel® Xeon® E5 Intel® Xeon® E7 E3 E5 2x QPI E7 3x QPI Memory Memory Memory PCIe3 PCIe3 PCIe3 Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® E5-1xxx E5-2xxx E5-4xxx E7-4xxx E7-xxxx E3-1xxx CPU/Socket QPI >4S E5-2600 v3 (Haswell EP) E5-2600 v4 (Broadwell EP) 13 Intel® Xeon® E5-2600 v3 processor overview Power Management 22nm Process (Tock) Per Core P-State (PCPS) PCIe3.0 DMI Uncore Frequency Scaling (UFS) PCI Express 3.0 Energy Efficient Turbo (EET) EP: 40 Lanes System Agent IMC Memory Technology: Intel® Hyper-Threading Core LLC Socket R3 Technology (2 threads/core) Core LLC 4xDDR4 channels Core LLC 1333, 1600, 1866, 2133 MTS Intel® Turbo Boost Core LLC Technology . ~2.5 MB Last Level Cache/Core . Up to 45 MB total LLC Up to 18 Cores . ® Core LLC Intel AVX 2.0 / Haswell New Instruction (HNI) Core LLC Integrated Voltage Regulator Intel® QuickPath Interface Intel® QuickPath Interface (x2) 9.6GT/s New Feature Existing Feature 14 Haswell EP die configurations 14-18 Core (HCC) 10-12 Core (MCC) 4-8 Core (LCC) QPI QPI PCI-E PCI-E PCI-E PCI-E QPI QPI PCI-E PCI-E PCI-E PCI-E QPI QPI PCI-E PCI-E PCI-E PCI-E Link Link X16 X16 X8 X4 (ESI) Link Link X16 X16 X8 X4 (ESI) Link Link X16 X16 X8 X4 (ESI) UBox PCU UBox PCU UBox PCU CB DMA CB DMA CB DMA R3QPI R2PCI R3QPI R2PCI R3QPI R2PCI IOAPIC IOAPIC IOAPIC QPI Agent IIO QPI Agent IIO QPI Agent IIO UP UP UP UP UP DN DN DN DN DN U D D U U D D U LLC Cache D U IDI Core P N QPII N P P N N P / Core 2.5MB Bo N P Bo SAD IDI CBO IDI IDI IDI I I D D I I I I IDI IDI IDI I / / / P / / I I D QPII QPII QPII Q Core Cache LLC LLC Cache Core Core U D Cache LLC LLC Cache D U Core Core U D Cache LLC LLC Cache D U Core Core U D Q Cache LLC Core U D Cache LLC LLC Cache D U Core QPII QPII QPII Q D D I / / / / P P IDI IDI Core Core Core IDI Core Core Core I Core Core Core I Bo Bo Bo Bo Bo P N Bo Bo N P Bo Bo P N Bo Bo I N P Bo Bo P N Bo Bo P N Bo Bo N P Bo I 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB I 2.5MB 2.5MB 2.5MB D I I IDI IDI IDI SAD SAD SAD SAD SAD SAD I SAD SAD SAD CBO CBO CBO CBO CBO CBO CBO CBO CBO IDI IDI IDI I I D D I I I I IDI IDI IDI I P / / / / / I I D QPII QPII QPII Q Core Cache LLC LLC Cache Core Core U D Cache LLC LLC Cache D U Core Core U D Cache LLC LLC Cache D U Core Core U D Q Cache LLC Core U D Cache LLC LLC Cache D U Core QPII QPII QPII Q D D I / / / / P P IDI IDI I I Core Core Core Core Core Core Core Core IDI Core Bo Bo Bo Bo Bo P N Bo Bo N P Bo Bo P N Bo Bo I N P Bo Bo P N Bo Bo P N Bo Bo N P Bo I 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB I 2.5MB 2.5MB 2.5MB D I I IDI IDI I SAD SAD SAD SAD SAD SAD SAD SAD SAD IDI CBO CBO CBO CBO CBO CBO CBO CBO CBO I I IDI IDI IDI D D I I I I I IDI IDI IDI P / / / / / I I D Q Q QPII QPII Core Cache LLC LLC Cache Core Core U D Cache LLC LLC Cache D U Core Core U D Cache LLC LLC Cache D U Core Core U D Cache LLC Core U D QPII Cache LLC LLC Cache D U Core Q QPII QPII QPII D D I / / / P P / I I IDI Core Core Core IDI Core Core Core Core Core IDI Core Bo Bo Bo Bo Bo P N Bo Bo N P Bo Bo P N Bo Bo I N P Bo Bo P N Bo Bo P N Bo Bo N P Bo I 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB I 2.5MB 2.5MB 2.5MB D I I I IDI IDI SAD SAD SAD SAD SAD SAD SAD SAD SAD IDI CBO CBO CBO CBO CBO CBO CBO CBO CBO I I IDI IDI IDI D D I I I I I IDI IDI P IDI / / / / / I I D Q Q QPII QPII Core Cache LLC LLC Cache Core Core U D Cache LLC LLC Cache D U Core Core U D Cache LLC LLC Cache D U Core Core U D Cache LLC Core U D QPII Cache LLC LLC Cache D U Core Q QPII QPII QPII D D I / / / P P / I Core Core Core I IDI Core Core Core IDI Core IDI Bo Bo Bo Bo Bo P N Bo Bo N P Bo Bo P N Bo Bo I N P Bo Bo P N Bo Core Bo Bo Core I 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB 2.5MB I 2.5MB Bo P N 2.5MB 2.5MB N P Bo D I I I IDI IDI SAD SAD SAD SAD SAD SAD SAD SAD SAD IDI CBO CBO CBO CBO CBO CBO CBO CBO CBO LLC Cache D U IDI Core U D QPII D U U D / Core D U 2.5MB Bo N P Bo P N SAD IDI N P P N N P CBO UP DN UP DN DN UP DN UP Home Agent Home Agent Home Agent Home Agent Home Agent DDR DDR DDR DDR DDR DDR DDR DDR DDR DDR Mem Ctlr Mem Ctlr Mem Ctlr Mem Ctlr Mem Ctlr Chop Columns Home Agents Cores Power (W) Transitors (B) Die Area (mm2) HCC 4 2 14-18 110-145 5.69 662 MCC 3 2 6-12 65-160 3.84 492 LCC 2 1 4-8 55-140 2.60 354 15 Cluster on die (COD) mode Supported on 1S/2S HSW-EP SKUs with 2 Home Agents (10+ cores) COD Mode for 18C E5-2600 v3 Cluster0 QPI In memory directory bits and directory cache IIO Cluster1 used on 2S to reduce coherence traffic and 0/1 Cbo Core cache-to-cache transfer latencies Sbo LLC Cbo Cbo Cbo Cbo Targeted at NUMA optimized workloads Core Core Core Core LLC LLC LLC LLC where latency is more important than Cbo Cbo Cbo Cbo Core Core Core Core sharing across Caching Agents (Cbo) LLC LLC LLC LLC Cbo • Reduces average LLC hit and local memory Cbo Cbo Cbo Core Core Core Core LLC latencies LLC LLC LLC Cbo Core • HA sees most requests from reduced set of threads Cbo Cbo Cbo LLC Core Core Core which can lead to higher memory bandwidth LLC LLC LLC Cbo Core OS/VMM own NUMA and process affinity Sbo LLC decisions HA0 HA1 • Exported as 2 NUMA nodes 16 Intel® Xeon® E5-2600 v3 product family Intel® Xeon® E5-2699 v3 (18C, 2.3GHz) vs.
Recommended publications
  • IEEE Paper Template in A4
    Vasantha.N.S. et al, International Journal of Computer Science and Mobile Computing, Vol.6 Issue.6, June- 2017, pg. 302-306 Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320–088X IMPACT FACTOR: 6.017 IJCSMC, Vol. 6, Issue. 6, June 2017, pg.302 – 306 Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm Vasantha.N.S.1, Meghana Kulkarni2 ¹Department of VLSI and Embedded Systems, Centre for PG Studies, VTU Belagavi, India ²Associate Professor Department of VLSI and Embedded Systems, Centre for PG Studies, VTU Belagavi, India 1 [email protected] ; 2 [email protected] Abstract— Tomasulo’s algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution, designed to efficiently utilize multiple execution units. It was developed by Robert Tomasulo at IBM. The major innovations of Tomasulo’s algorithm include register renaming in hardware. It also uses the concept of reservation stations for all execution units. A common data bus (CDB) on which computed values broadcast to all reservation stations that may need them is also present. The algorithm allows for improved parallel execution of instructions that would otherwise stall under the use of other earlier algorithms such as scoreboarding. Keywords— Reservation Station, Register renaming, common data bus, multiple execution unit, register file I. INTRODUCTION The instructions in any program may be executed in any of the 2 ways namely sequential order and the other is the data flow order. The sequential order is the one in which the instructions are executed one after the other but in reality this flow is very rare in programs.
    [Show full text]
  • Computer Science 246 Computer Architecture Spring 2010 Harvard University
    Computer Science 246 Computer Architecture Spring 2010 Harvard University Instructor: Prof. David Brooks [email protected] Dynamic Branch Prediction, Speculation, and Multiple Issue Computer Science 246 David Brooks Lecture Outline • Tomasulo’s Algorithm Review (3.1-3.3) • Pointer-Based Renaming (MIPS R10000) • Dynamic Branch Prediction (3.4) • Other Front-end Optimizations (3.5) – Branch Target Buffers/Return Address Stack Computer Science 246 David Brooks Tomasulo Review • Reservation Stations – Distribute RAW hazard detection – Renaming eliminates WAW hazards – Buffering values in Reservation Stations removes WARs – Tag match in CDB requires many associative compares • Common Data Bus – Achilles heal of Tomasulo – Multiple writebacks (multiple CDBs) expensive • Load/Store reordering – Load address compared with store address in store buffer Computer Science 246 David Brooks Tomasulo Organization From Mem FP Op FP Registers Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Store Load6 Buffers Add1 Add2 Mult1 Add3 Mult2 Reservation To Mem Stations FP adders FP multipliers Common Data Bus (CDB) Tomasulo Review 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 LD F0, 0(R1) Iss M1 M2 M3 M4 M5 M6 M7 M8 Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2 M3 Wb SUBI R1, R1, 8 Iss Ex Wb BNEZ R1, Loop Iss Ex Wb LD F0, 0(R1) Iss Iss Iss Iss M Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2
    [Show full text]
  • Dynamic Scheduling
    Dynamically-Scheduled Machines ! • In a Statically-Scheduled machine, the compiler schedules all instructions to avoid data-hazards: the ID unit may require that instructions can issue together without hazards, otherwise the ID unit inserts stalls until the hazards clear! • This section will deal with Dynamically-Scheduled machines, where hardware- based techniques are used to detect are remove avoidable data-hazards automatically, to allow ‘out-of-order’ execution, and improve performance! • Dynamic scheduling used in the Pentium III and 4, the AMD Athlon, the MIPS R10000, the SUN Ultra-SPARC III; the IBM Power chips, the IBM/Motorola PowerPC, the HP Alpha 21264, the Intel Dual-Core and Quad-Core processors! • In contrast, static multiple-issue with compiler-based scheduling is used in the Intel IA-64 Itanium architectures! • In 2007, the dual-core and quad-core Intel processors use the Pentium 5 family of dynamically-scheduled processors. The Itanium has had a hard time gaining market share.! Ch. 6, Advanced Pipelining-DYNAMIC, slide 1! © Ted Szymanski! Dynamic Scheduling - the Idea ! (class text - pg 168, 171, 5th ed.)" !DIVD ! !F0, F2, F4! ADDD ! !F10, F0, F8 !- data hazard, stall issue for 23 cc! SUBD ! !F12, F8, F14 !- SUBD inherits 23 cc of stalls! • ADDD depends on DIVD; in a static scheduled machine, the ID unit detects the hazard and causes the basic pipeline to stall for 23 cc! • The SUBD instruction cannot execute because the pipeline has stalled, even though SUBD does not logically depend upon either previous instruction! • suppose the machine architecture was re-organized to let the SUBD and subsequent instructions “bypass” the previous stalled instruction (the ADDD) and proceed with its execution -> we would allow “out-of-order” execution! • however, out-of-order execution would allow out-of-order completion, which may allow RW (Read-Write) and WW (Write-Write) data hazards ! • a RW and WW hazard occurs when the reads/writes complete in the wrong order, destroying the data.
    [Show full text]
  • United States Patent (19) 11 Patent Number: 5,680,565 Glew Et Al
    USOO568.0565A United States Patent (19) 11 Patent Number: 5,680,565 Glew et al. 45 Date of Patent: Oct. 21, 1997 (54) METHOD AND APPARATUS FOR Diefendorff, "Organization of the Motorola 88110 Super PERFORMING PAGE TABLE WALKS INA scalar RISC Microprocessor.” IEEE Micro, Apr. 1996, pp. MCROPROCESSOR CAPABLE OF 40-63. PROCESSING SPECULATIVE Yeager, Kenneth C. "The MIPS R10000 Superscalar Micro INSTRUCTIONS processor.” IEEE Micro, Apr. 1996, pp. 28-40 Apr. 1996. 75 Inventors: Andy Glew, Hillsboro; Glenn Hinton; Smotherman et al. "Instruction Scheduling for the Motorola Haitham Akkary, both of Portland, all 88.110" Microarchitecture 1993 International Symposium, of Oreg. pp. 257-262. Circello et al., “The Motorola 68060 Microprocessor.” 73) Assignee: Intel Corporation, Santa Clara, Calif. COMPCON IEEE Comp. Soc. Int'l Conf., Spring 1993, pp. 73-78. 21 Appl. No.: 176,363 "Superscalar Microprocessor Design" by Mike Johnson, Advanced Micro Devices, Prentice Hall, 1991. 22 Filed: Dec. 30, 1993 Popescu, et al., "The Metaflow Architecture.” IEEE Micro, (51) Int. C. G06F 12/12 pp. 10-13 and 63–73, Jun. 1991. 52 U.S. Cl. ........................... 395/415: 395/800; 395/383 Primary Examiner-David K. Moore 58) Field of Search .................................. 395/400, 375, Assistant Examiner-Kevin Verbrugge 395/800, 414, 415, 421.03 Attorney, Agent, or Firm-Blakely, Sokoloff, Taylor & 56 References Cited Zafiman U.S. PATENT DOCUMENTS 57 ABSTRACT 5,136,697 8/1992 Johnson .................................. 395/586 A page table walk is performed in response to a data 5,226,126 7/1993 McFarland et al. 395/.394 translation lookaside buffer miss based on a speculative 5,230,068 7/1993 Van Dyke et al.
    [Show full text]
  • Identifying Bottlenecks in a Multithreaded Superscalar
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by KITopen Identifying Bottlenecks in a Multithreaded Sup erscalar Micropro cessor Ulrich Sigmund and Theo Ungerer VIONA DevelopmentGmbH Karlstr D Karlsruhe Germany University of Karlsruhe Dept of Computer Design and Fault Tolerance D Karlsruhe Germany Abstract This pap er presents a multithreaded sup erscalar pro cessor that p ermits several threads to issue instructions to the execution units of a wide sup erscalar pro cessor in a single cycle Instructions can simul taneously b e issued from up to threads with a total issue bandwidth of instructions p er cycle Our results show that the threaded issue pro cessor reaches a throughput of instructions p er cycle Intro duction Current micropro cessors utilize instructionlevel parallelism by a deep pro cessor pip eline and by the sup erscalar technique that issues up to four instructions p er cycle from a single thread VLSItechnology will allow future generations of micropro cessors to exploit instructionlevel parallelism up to instructions p er cycle or more However the instructionlevel parallelism found in a conventional instruction stream is limited The solution is the additional utilization of more coarsegrained parallelism The main approaches are the multipro cessor chip and the multithreaded pro ces pro cessors on a sor The multipro cessor chip integrates two or more complete single chip Therefore every unit of a pro cessor is duplicated and used indep en dently of its copies on the chip
    [Show full text]
  • Advanced Processor Designs
    Advanced processor designs We’ve only scratched the surface of CPU design. Today we’ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. —The Motorola PowerPC, used in Apple computers and many embedded systems, is a good example of state-of-the-art RISC technologies. —The Intel Itanium is a more radical design intended for the higher-end systems market. April 9, 2003 ©2001-2003 Howard Huang 1 General ways to make CPUs faster You can improve the chip manufacturing technology. — Smaller CPUs can run at a higher clock rates, since electrons have to travel a shorter distance. Newer chips use a “0.13µm process,” and this will soon shrink down to 0.09µm. — Using different materials, such as copper instead of aluminum, can improve conductivity and also make things faster. You can also improve your CPU design, like we’ve been doing in CS232. — One of the most important ideas is parallel computation—executing several parts of a program at the same time. — Being able to execute more instructions at a time results in a higher instruction throughput, and a faster execution time. April 9, 2003 Advanced processor designs 2 Pipelining is parallelism We’ve already seen parallelism in detail! A pipelined processor executes many instructions at the same time. All modern processors use pipelining, because most programs will execute faster on a pipelined CPU without any programmer intervention. Today we’ll discuss some more advanced techniques to help you get the most out of your pipeline. April 9, 2003 Advanced processor designs 3 Motivation for some advanced ideas Our pipelined datapath only supports integer operations, and we assumed the ALU had about a 2ns delay.
    [Show full text]
  • Computer Architecture Out-Of-Order Execution
    Computer Architecture Out-of-order Execution By Yoav Etsion With acknowledgement to Dan Tsafrir, Avi Mendelson, Lihu Rappoport, and Adi Yoaz 1 Computer Architecture 2013– Out-of-Order Execution The need for speed: Superscalar • Remember our goal: minimize CPU Time CPU Time = duration of clock cycle × CPI × IC • So far we have learned that in order to Minimize clock cycle ⇒ add more pipe stages Minimize CPI ⇒ utilize pipeline Minimize IC ⇒ change/improve the architecture • Why not make the pipeline deeper and deeper? Beyond some point, adding more pipe stages doesn’t help, because Control/data hazards increase, and become costlier • (Recall that in a pipelined CPU, CPI=1 only w/o hazards) • So what can we do next? Reduce the CPI by utilizing ILP (instruction level parallelism) We will need to duplicate HW for this purpose… 2 Computer Architecture 2013– Out-of-Order Execution A simple superscalar CPU • Duplicates the pipeline to accommodate ILP (IPC > 1) ILP=instruction-level parallelism • Note that duplicating HW in just one pipe stage doesn’t help e.g., when having 2 ALUs, the bottleneck moves to other stages IF ID EXE MEM WB • Conclusion: Getting IPC > 1 requires to fetch/decode/exe/retire >1 instruction per clock: IF ID EXE MEM WB 3 Computer Architecture 2013– Out-of-Order Execution Example: Pentium Processor • Pentium fetches & decodes 2 instructions per cycle • Before register file read, decide on pairing Can the two instructions be executed in parallel? (yes/no) u-pipe IF ID v-pipe • Pairing decision is based… On data
    [Show full text]
  • CSE 586 Computer Architecture Lecture 4
    Highlights from last week CSE 586 Computer Architecture • ILP: where can the compiler optimize – Loop unrolling and software pipelining – Speculative execution (we’ll see predication today) Lecture 4 • ILP: Dynamic scheduling in a single issue machine – Scoreboard Jean-Loup Baer – Tomasulo’s algorithm http://www.cs.washington.edu/education/courses/586/00sp CSE 586 Spring 00 1 CSE 586 Spring 00 2 Highlights from last week (c’ed) -- Highlights from last week (c’ed) – Scoreboard Tomasulo’s algorithm • The scoreboard keeps a record of all data dependencies • Decentralized control • The scoreboard keeps a record of all functional unit • Use of reservation stations to buffer and/or rename occupancies registers (hence gets rid of WAW and WAR hazards) • The scoreboard decides if an instruction can be issued • Results –and their names– are broadcast to reservations • The scoreboard decides if an instruction can store its result stations and register file • Implementation-wise, scoreboard keeps track of which • Instructions are issued in order but can be dispatched, registers are used as sources and destinations and which executed and completed out-of-order functional units use them CSE 586 Spring 00 3 CSE 586 Spring 00 4 Highlights from last week (c’ed) Multiple Issue Alternatives • Register renaming: avoids WAW and WAR hazards • Superscalar (hardware detects conflicts) • Is performed at “decode” time to rename the result register – Statically scheduled (in order dispatch and hence execution; cf DEC Alpha 21164) • Two basic implementation schemes – Dynamically scheduled (in order issue, out of order dispatch and – Have a separate physical register file execution; cf MIPS 10000, IBM Power PC 620 and Intel Pentium – Use of reorder buffer and reservation stations (cf.
    [Show full text]
  • Superscalar Instruction Dispatch with Exception Handling
    Superscalar Instruction Dispatch With Exception Handling Design and Testing Report Joseph McMahan December 2013 Contents 1 Design 3 1.1 Overview . ............. ............. 3 1.1.1 The Tomasulo Algorithm . ............. 4 1.1.2 Two-Phase Cloc"ing .......... ............. 4 1.1.3 #esolving Data Hazards . ............. ' 1.1.4 (oftware Interface............ ............. + 1.2 High ,evel Design ............. ............. 10 1.3 ,ow ,evel Design ............. ............. 12 1.3.1 Buses . ............. ............. 12 1.3.2 .unctional /nits ............. ............ 13 1.3.3 Completion File ............. ............. 14 1.3.4 Bus !0cles . ............. ............. 17 1.3.' *nstruction Deco&e an& Dispatch . ............. 18 2 Testing 23 2.1 Tests . ............. ............. 23 2.1.1 ,oa& Data2 Ad& ............. ............. 23 2.1.2 #esource Hazard3 Onl0 One (tation Available ......... 24 2.1.3 Multipl0 . ............. ............. 26 2.1.4 #esource Hazard3 Two Instructions o) (ame Type . 26 2.1.' Data Hazar&3 #A5........... ............. 29 2.1.4 Data Hazar&3 5A5 .......... ............. 31 2.1.+ Data Hazar&3 5AR . ............. 31 2.1.1 Memor0 Hazar&3 5AR . ............. 35 2.1.6 #esource Hazard3 No (tations Available ............ 3' 2.1.10 (tall on .ull Completion File . ............ 31 2.1.11 Han&le 89ception............ ............. 40 1 3 Code 43 3.1 Mo&ules . ............. ............. 43 3.1.1 rstation.v . ............. ............. 43 3.1.2 reg.v . ............. ............. 49 3.1.3 fetch.v . ............. ............. 53 3.1.4 store.v . ............. ............. '6 3.1.' mem-unit.v . ............. ............. 63 3.1.4 ID.v . ............. ............. 70 3.1.+ alu.v . ............. ............. 84 3.1.1 alu-unit.v . ............. ............. 14 3.1.6 mult.v . ............. ............. 90 3.1.10 mp-unit.v . ............. ............. 92 3.1.11 cf.v .
    [Show full text]
  • A Fast, Verified, Cross-Platform Cryptographic Provider
    EverCrypt: A Fast, Verified, Cross-Platform Cryptographic Provider Jonathan Protzenko∗, Bryan Parnoz, Aymeric Fromherzz, Chris Hawblitzel∗, Marina Polubelovay, Karthikeyan Bhargavany Benjamin Beurdouchey, Joonwon Choi∗x, Antoine Delignat-Lavaud∗,Cedric´ Fournet∗, Natalia Kulatovay, Tahina Ramananandro∗, Aseem Rastogi∗, Nikhil Swamy∗, Christoph M. Wintersteiger∗, Santiago Zanella-Beguelin∗ ∗Microsoft Research zCarnegie Mellon University yInria xMIT Abstract—We present EverCrypt: a comprehensive collection prone (due in part to Intel and AMD reporting CPU features of verified, high-performance cryptographic functionalities avail- inconsistently [78]), with various cryptographic providers able via a carefully designed API. The API provably supports invoking illegal instructions on specific platforms [74], leading agility (choosing between multiple algorithms for the same functionality) and multiplexing (choosing between multiple im- to killed processes and even crashing kernels. plementations of the same algorithm). Through abstraction and Since a cryptographic provider is the linchpin of most zero-cost generic programming, we show how agility can simplify security-sensitive applications, its correctness and security are verification without sacrificing performance, and we demonstrate crucial. However, for most applications (e.g., TLS, cryptocur- how C and assembly can be composed and verified against rencies, or disk encryption), the provider is also on the critical shared specifications. We substantiate the effectiveness of these techniques with
    [Show full text]
  • MIPS Architecture with Tomasulo Algorithm [12]
    CALIFORNIA STATE UNIVERSITY NORTHRIDGE TOMASULO ARCHITECTURE BASED MIPS PROCESSOR A graduate project submitted in partial fulfilment for the requirement Of the degree of Master of Science In Electrical Engineering By Sameer S Pandit May 2014 The graduate project of Sameer Pandit is approved by: ______________________________________ ____________ Dr. Ali Amini, Ph.D. Date _______________________________________ ____________ Dr. Shahnam Mirzaei, Ph.D. Date _______________________________________ ____________ Dr. Ramin Roosta, Ph.D., Chair Date California State University Northridge ii ACKNOWLEDGEMENT I would like to express my gratitude towards all the members of the project committee and would like to thank them for their continuous support and mentoring me for every step that I have taken towards the completion of this project. Dr. Roosta, for his guidance, Dr. Shahnam for his ideas, and Dr. Ali Amini for his utmost support. I would also like to thank my family and friends for their love, care and support through all the tough times during my graduation. iii Table of Contents SIGNATURE PAGE .......................................................................................................... ii ACKNOWLEDGEMENT ................................................................................................. iii LIST OF FIGURES .......................................................................................................... vii ABSTRACT .......................................................................................................................
    [Show full text]
  • Extracting and Mapping Industry 4.0 Technologies Using Wikipedia
    Computers in Industry 100 (2018) 244–257 Contents lists available at ScienceDirect Computers in Industry journal homepage: www.elsevier.com/locate/compind Extracting and mapping industry 4.0 technologies using wikipedia T ⁎ Filippo Chiarelloa, , Leonello Trivellib, Andrea Bonaccorsia, Gualtiero Fantonic a Department of Energy, Systems, Territory and Construction Engineering, University of Pisa, Largo Lucio Lazzarino, 2, 56126 Pisa, Italy b Department of Economics and Management, University of Pisa, Via Cosimo Ridolfi, 10, 56124 Pisa, Italy c Department of Mechanical, Nuclear and Production Engineering, University of Pisa, Largo Lucio Lazzarino, 2, 56126 Pisa, Italy ARTICLE INFO ABSTRACT Keywords: The explosion of the interest in the industry 4.0 generated a hype on both academia and business: the former is Industry 4.0 attracted for the opportunities given by the emergence of such a new field, the latter is pulled by incentives and Digital industry national investment plans. The Industry 4.0 technological field is not new but it is highly heterogeneous (actually Industrial IoT it is the aggregation point of more than 30 different fields of the technology). For this reason, many stakeholders Big data feel uncomfortable since they do not master the whole set of technologies, they manifested a lack of knowledge Digital currency and problems of communication with other domains. Programming languages Computing Actually such problem is twofold, on one side a common vocabulary that helps domain experts to have a Embedded systems mutual understanding is missing Riel et al. [1], on the other side, an overall standardization effort would be IoT beneficial to integrate existing terminologies in a reference architecture for the Industry 4.0 paradigm Smit et al.
    [Show full text]