SIMULTANEOUS MULTITHREADING (SMT) Multiple HW Contexts (Regs, PC, SP) AMD X2, X3, X4, Intel Core 2) Each Cycle, Any Context May Execute E.G

SimultaneousSimultaneous MultithreadingMultithreading (SMT)(SMT) • An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors (superscalars). • SMT has the potential of greatly enhancing superscalar processor computational capabilities by: Chip-Level TLP – Exploiting thread-level parallelism (TLP) in a single processor core, simultaneously issuing, executing and retiring instructions from different threads during the same cycle. • A single physical SMT processor core acts as a number of logical processors each executing a single thread – Providing multiple hardware contexts, hardware thread scheduling and context switching capability. – Providing effective long latency hiding. • e.g FP, branch misprediction, memory latency EECC722 - Shaaban #1 Lec # 2 Fall 2012 9-3-2012 SMTSMT IssuesIssues • SMT CPU performance gain potential. • Modifications to Superscalar CPU architecture to support SMT. • SMT performance evaluation vs. Fine-grain multithreading, Superscalar, Chip Multiprocessors. • Hardware techniques to improve SMT performance: Ref. Papers – Optimal level one cache configuration for SMT. SMT-1, SMT-2 – SMT thread instruction fetch, issue policies. – Instruction recycling (reuse) of decoded instructions. • Software techniques: – Compiler optimizations for SMT. SMT-3 – Software-directed register deallocation. – Operating system behavior and optimization. SMT-7 • SMT support for fine-grain synchronization. SMT-4 • SMT as a viable architecture for network processors. • Current SMT implementation: Intel’s Hyper-Threading (2-way SMT) Microarchitecture and performance in compute-intensive workloads. SMT-8 SMT-9 EECC722 - Shaaban #2 Lec # 2 Fall 2012 9-3-2012 EvolutionEvolution ofof MicroprocessorsMicroprocessors General Purpose Processors (GPPs) Multiple Issue (CPI <1) Multi-cycle Pipelined (single issue) Superscalar/VLIW/SMT/CMP Original (2002) Intel 1 GHz Predictions 15 GHz to ???? GHz IPC T = I x CPI x C Source: John P. Chen, Intel Labs Single-issue Processor = Scalar Processor EECC722 - Shaaban Instructions Per Cycle (IPC) = 1/CPI #3 Lec # 2 Fall 2012 9-3-2012 MicroprocessorMicroprocessor FrequencyFrequency TrendTrend 10,000 100 Intel Realty Check: Processor freq IBM Power PC Clock frequency scaling scales by 2X per DEC is slowing down! generation Gate delays/clock (Did silicone finally hit the wall?) 21264S 1,000 Why? 21164A 21264 1- Power leakage 21064A Pentium(R) 2- Clock distribution 21164 10 Mhz II 21066 MPC750 delays 604 604+ Pentium Pro 100 Result: (R) Gate Delays/ Clock 601, 603 Deeper Pipelines Pentium(R) Longer stalls 486 Higher CPI (lowers effective 386 performance 10 1 per cycle) 1987 1991 1993 1995 1997 1999 2001 2003 2005 1989 1. Frequency used to double each generation Possible Solutions? Chip-Level TLP 2. Number of gates/clock reduce by 25% - Exploit Thread-Level Parallelism (TLP) 3. Leads to deeper pipelines with more stages at the chip level (SMT/CMP) (e.g Intel Pentium 4E has 30+ pipeline stages) - Utilize/integrate more-specialized computing elements other than GPPs T = I x CPI x C EECC722 - Shaaban #4 Lec # 2 Fall 2012 9-3-2012 ParallelismParallelism inin MicroprocessorMicroprocessor VLSIVLSI GenerationsGenerations Bit-level parallelism Instruction-level Thread-level (?) 100,000,000 (ILP) (TLP) Superscalar Multiple micro-operations Simultaneous per cycle /VLIW Single-issue CPI <1 Multithreading SMT: (multi-cycle non-pipelined) Pipelined e.g. Intel’s Hyper-threading 10,000,000 CPI =1 AKA Operation-Level Parallelism Chip-Multiprocessors (CMPs) e.g IBM Power 4, 5 Not Pipelined R10000 Intel Pentium D, Core 2 CPI >> 1 AMD Athlon 64 X2 Dual Core Opteron Sun UltraSparc T1 (Niagara) 1,000,000 Pentium i80386 Chip-Level Parallel i80286 R3000 Transistors 100,000 Processing R2000 Even more important i8086 due to slowing clock 10,000 rate increase i8080 i8008 Thread-Level i4004 Single Thread Parallelism (TLP) 1,000 1970 1975 1980 1985 1990 1995 2000 2005 Improving microprocessor generation performance by EECC722 - Shaaban exploiting more levels of parallelism #5 Lec # 2 Fall 2012 9-3-2012 MicroprocessorMicroprocessor ArchitectureArchitecture TrendsTrends General Purpose Processor (GPP) CISC Machines instructions take variable times to complete RISC Machines (microcode) Single simple instructions, optimized for speed Threaded RISC Machines (pipelined) same individual instruction latency greater throughput through instruction "overlap" Superscalar Processors multiple instructions executing simultaneously CMPs Multithreaded Processors VLIW Single Chip Multiprocessors additional HW resources (regs, PC, SP) "Superinstructions" grouped together duplicate entire processors each context gets processor for x cycles decreased HW control complexity (Single or Multi-Threaded) (tech soon due to Moore's Law) (e.g IBM Power 4/5, SIMULTANEOUS MULTITHREADING (SMT) multiple HW contexts (regs, PC, SP) AMD X2, X3, X4, Intel Core 2) each cycle, any context may execute e.g. Intel’s HyperThreading (P4) Chip-level TLP SMT/CMPs e.g. IBM Power5,6,7 , Intel Pentium D, Sun Niagara - (UltraSparc T1) Intel Nehalem (Core i7) EECC722 - Shaaban #6 Lec # 2 Fall 2012 9-3-2012 CPUCPU ArchitectureArchitecture Evolution:Evolution: SingleSingle Threaded/IssueThreaded/Issue PipelinePipeline • Traditional 5-stage integer pipeline. • Increases Throughput: Ideal CPI = 1 Register File Fetch Decode Execute Memory Writeback PC SP Memory Hierarchy (Management) EECC722 - Shaaban #7 Lec # 2 Fall 2012 9-3-2012 CPUCPU ArchitectureArchitecture Evolution:Evolution: SingleSingle--Threaded/SuperscalarThreaded/Superscalar ArchitecturesArchitectures • Fetch, issue, execute, etc. more than one instruction per cycle (CPI < 1). • Limited by instruction-level parallelism (ILP). Due to single thread limitations Fetch i Decode i Execute i Memory i Writeback i Memory HierarchyMemory (Management) Writeback Decode i+1 Execute i+1 Memory i+1 Register File Fetch i+1 i+1 PC SP Fetch i Decode i Execute i Memory i Writeback i EECC722 - Shaaban #8 Lec # 2 Fall 2012 9-3-2012 Commit or Retirement (In Order) HardwareHardware--BasedBased FIFO Usually implemented SpeculationSpeculation as a circular buffer Instructions Speculative Execution + to issue in Next to Tomasulo’s Algorithm Order: commit Tomasulo’s Algorithm Instruction = Speculative Tomasulo Queue (IQ) Store Results Speculative Tomasulo-based Processor EECC722 - Shaaban 4th Edition: page 107 (3rd Edition: page 228) #9 Lec # 2 Fall 2012 9-3-2012 FourFour StepsSteps ofof SpeculativeSpeculative TomasuloTomasulo AlgorithmAlgorithm 1. Issue — (In-order) Get an instruction from Instruction Queue If a reservation station and a reorder buffer slot are free, issue instruction & send operands & reorder buffer number for destination (this stage is sometimes called “dispatch”) Stage 0 Instruction Fetch (IF): No changes, in-order 2. Execution — (out-of-order) Operate on operands (EX) Includes data MEM read When both operands are ready then execute; if not ready, watch CDB for result; when both operands are in reservation station, execute; checks RAW (sometimes called “issue”) No write to registers or 3. Write result — (out-of-order) Finish execution (WB) memory in WB Write on Common Data Bus (CDB) to all awaiting FUs & reorder No WB for stores buffer; mark reservation station available. i.e Reservation Stations or branches 4. Commit — (In-order) Update registers, memory with reorder buffer result – When an instruction is at head of reorder buffer & the result is present, update register with result (or store to memory) and remove instruction from reorder buffer. Successfully completed instructions write to registers and memory (stores) here Mispredicted Branch – A mispredicted branch at the head of the reorder buffer flushes the Handling reorder buffer (cancels speculated instructions after the branch) ⇒ Instructions issue in order, execute (EX), write result (WB) out of order, but must commit in order. EECC722 - Shaaban 4th Edition: pages 106-108 (3rd Edition: pages 227-229) #10 Lec # 2 Fall 2012 9-3-2012 AdvancedAdvanced CPUCPU Architectures:Architectures: VLIW:VLIW: Intel/HPIntel/HP IA-64 ExplicitlyExplicitly ParallelParallel InstructionInstruction ComputingComputing (EPIC)(EPIC) • Strengths: – Allows for a high level of instruction parallelism (ILP). – Takes a lot of the dependency analysis out of HW and places focus on smart compilers. • Weakness: – Limited by instruction-level parallelism (ILP) in a single thread. – Keeping Functional Units (FUs) busy (control hazards). – Static FUs Scheduling limits performance gains. – Resulting overall performance heavily depends on compiler performance. EECC722 - Shaaban #11 Lec # 2 Fall 2012 9-3-2012 SuperscalarSuperscalar ArchitectureArchitecture Limitations:Limitations: Issue Slot Waste Classification • Empty or wasted issue slots can be defined as either vertical waste or horizontal waste: – Vertical waste is introduced when the processor issues no instructions in a cycle. – Horizontal waste occurs when not all issue slots can be filled in a cycle. Why not 8-issue? Example: 4-Issue Superscalar Ideal IPC =4 Ideal CPI = .25 Also applies to VLIW Instructions Per Cycle = IPC = 1/CPI Result of issue slot waste: Actual Performance << Peak Performance EECC722 - Shaaban #12 Lec # 2 Fall 2012 9-3-2012 Sources of Unused Issue Cycles in an 8-issue Superscalar Processor. (wasted) Ideal Instructions Per Cycle, IPC = 8 Here real IPC about 1.5 (CPI = 1/8)

Load more