SimultaneousSimultaneous MultithreadingMultithreading (SMT)(SMT) • An evolutionary architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors (superscalars).

• SMT has the potential of greatly enhancing computational capabilities by: Chip-Level TLP – Exploiting -level parallelism (TLP) in a single processor core, simultaneously issuing, executing and retiring instructions from different threads during the same cycle. • A single physical SMT processor core acts as a number of logical processors each executing a single thread – Providing multiple hardware contexts, hardware thread scheduling and context switching capability. – Providing effective long latency hiding. • e.g FP, branch misprediction, memory latency

EECC722 - Shaaban #1 Lec # 2 Fall 2012 9-3-2012 SMTSMT IssuesIssues • SMT CPU performance gain potential. • Modifications to Superscalar CPU architecture to support SMT. • SMT performance evaluation vs. Fine-grain multithreading, Superscalar, Chip Multiprocessors. • Hardware techniques to improve SMT performance: Ref. Papers – Optimal level one configuration for SMT. SMT-1, SMT-2 – SMT thread instruction fetch, issue policies.

– Instruction recycling (reuse) of decoded instructions. • Software techniques: – optimizations for SMT. SMT-3 – Software-directed register deallocation. – Operating system behavior and optimization. SMT-7 • SMT support for fine-grain synchronization. SMT-4 • SMT as a viable architecture for network processors. • Current SMT implementation: Intel’s Hyper-Threading (2-way SMT) and performance in compute-intensive workloads. SMT-8 SMT-9 EECC722 - Shaaban #2 Lec # 2 Fall 2012 9-3-2012 EvolutionEvolution ofof MicroprocessorsMicroprocessors General Purpose Processors (GPPs)

Multiple Issue (CPI <1) Multi-cycle Pipelined (single issue) Superscalar/VLIW/SMT/CMP

Original (2002) Intel 1 GHz Predictions 15 GHz to ???? GHz

IPC

T = I x CPI x C

Source: John P. Chen, Intel Labs Single-issue Processor = EECC722 - Shaaban (IPC) = 1/CPI #3 Lec # 2 Fall 2012 9-3-2012 MicroprocessorMicroprocessor FrequencyFrequency TrendTrend 10,000 100 Intel Realty Check: Processor freq IBM Power PC Clock frequency scaling scales by 2X per DEC is slowing down! generation Gate delays/clock (Did silicone finally hit the wall?) 21264S 1,000 Why? 21164A 21264 1- Power leakage 21064A (R) 2- Clock distribution 21164 10

Mhz II 21066 MPC750 delays 604 604+ 100 Result:

(R) Gate Delays/ Clock 601, 603 Deeper Pipelines Pentium(R) Longer stalls 486 Higher CPI (lowers effective 386 performance 10 1 per cycle) 1987 1991 1993 1995 1997 1999 2001 2003 2005 1989 1. Frequency used to double each generation Possible Solutions? Chip-Level TLP 2. Number of gates/clock reduce by 25% - Exploit Thread-Level Parallelism (TLP) 3. Leads to deeper pipelines with more stages at the chip level (SMT/CMP) (e.g Intel Pentium 4E has 30+ stages) - Utilize/integrate more-specialized computing elements other than GPPs

T = I x CPI x C EECC722 - Shaaban #4 Lec # 2 Fall 2012 9-3-2012 ParallelismParallelism inin MicroprocessorMicroprocessor VLSIVLSI GenerationsGenerations Bit-level parallelism Instruction-level Thread-level (?) 100,000,000 (ILP) (TLP) Superscalar Multiple micro-operations Simultaneous per cycle /VLIW Single-issue CPI <1 Multithreading SMT: (multi-cycle non-pipelined) Pipelined ‹ e.g. Intel’s Hyper-threading 10,000,000 CPI =1 AKA Operation-Level Parallelism ‹ Chip-Multiprocessors (CMPs) ‹‹‹ ‹ ‹ e.g IBM Power 4, 5 Not Pipelined ‹ Intel Pentium D, Core 2 CPI >> 1 ‹‹‹‹‹‹‹ AMD Athlon 64 X2 ‹ Dual Core Opteron ‹ ‹ ‹‹‹‹ Sun UltraSparc T1 (Niagara) ‹ ‹ ‹ 1,000,000 ‹‹‹‹ ‹‹‹‹ ‹ ‹‹ Pentium ‹ ‹ ‹ ‹‹ i80386 Chip-Level ‹ Parallel i80286 ‹ ‹ ‹ R3000 Transistors 100,000 Processing ‹ ‹ ‹ R2000 Even more important ‹ i8086 due to slowing clock

10,000 rate increase ‹ i8080 ‹i8008 ‹ Thread-Level ‹ i4004 Single Thread Parallelism (TLP)

1,000 1970 1975 1980 1985 1990 1995 2000 2005 Improving generation performance by EECC722 - Shaaban exploiting more levels of parallelism #5 Lec # 2 Fall 2012 9-3-2012 MicroprocessorMicroprocessor ArchitectureArchitecture TrendsTrends General Purpose Processor (GPP) CISC Machines instructions take variable times to complete

RISC Machines () Single simple instructions, optimized for speed

Threaded RISC Machines (pipelined) same individual instruction latency greater throughput through instruction "overlap"

Superscalar Processors multiple instructions executing simultaneously

CMPs Multithreaded Processors VLIW Single Chip Multiprocessors additional HW resources (regs, PC, SP) "Superinstructions" grouped together duplicate entire processors each context gets processor for x cycles decreased HW control complexity (Single or Multi-Threaded) (tech soon due to Moore's Law) (e.g IBM Power 4/5, SIMULTANEOUS MULTITHREADING (SMT) multiple HW contexts (regs, PC, SP) AMD X2, X3, X4, Intel Core 2) each cycle, any context may execute e.g. Intel’s HyperThreading (P4) Chip-level TLP SMT/CMPs e.g. IBM Power5,6,7 , Intel Pentium D, Sun Niagara - (UltraSparc T1) Intel Nehalem (Core i7) EECC722 - Shaaban #6 Lec # 2 Fall 2012 9-3-2012 CPUCPU ArchitectureArchitecture Evolution:Evolution: SingleSingle Threaded/IssueThreaded/Issue PipelinePipeline

• Traditional 5-stage integer pipeline. • Increases Throughput: Ideal CPI = 1

Register File Fetch Decode Execute Memory Writeback PC

SP (Management)

EECC722 - Shaaban #7 Lec # 2 Fall 2012 9-3-2012 CPUCPU ArchitectureArchitecture Evolution:Evolution: SingleSingle--Threaded/SuperscalarThreaded/Superscalar ArchitecturesArchitectures • Fetch, issue, execute, etc. more than one instruction per cycle (CPI < 1). • Limited by instruction-level parallelism (ILP). Due to single thread limitations

Fetch i Decode i Execute i Memory i Writeback i Memory Hierarchy (Management) Memory Hierarchy

Writeback Fetch i+1 Decode i+1 Execute i+1 Memory i+1 i+1 PC SP

Fetch i Decode i Execute i Memory i Writeback i

EECC722 - Shaaban #8 Lec # 2 Fall 2012 9-3-2012 Commit or Retirement (In Order) HardwareHardware--BasedBased FIFO Usually implemented SpeculationSpeculation as a circular buffer Instructions + to issue in Next to Tomasulo’s Algorithm Order: commit Tomasulo’s Algorithm Instruction = Speculative Tomasulo Queue (IQ)

Store Results

Speculative Tomasulo-based Processor EECC722 - Shaaban 4th Edition: page 107 (3rd Edition: page 228) #9 Lec # 2 Fall 2012 9-3-2012 FourFour StepsSteps ofof SpeculativeSpeculative TomasuloTomasulo AlgorithmAlgorithm 1. Issue — (In-order) Get an instruction from Instruction Queue If a and a reorder buffer slot are free, issue instruction & send operands & reorder buffer number for destination (this stage is sometimes called “dispatch”) Stage 0 Instruction Fetch (IF): No changes, in-order

2. Execution — (out-of-order) Operate on operands (EX) Includes data MEM read When both operands are ready then execute; if not ready, watch CDB for result; when both operands are in reservation station, execute; checks RAW (sometimes called “issue”) No write to registers or 3. Write result — (out-of-order) Finish execution (WB) memory in WB

Write on Common Data (CDB) to all awaiting FUs & reorder No WB for stores buffer; mark reservation station available. i.e Reservation Stations or branches 4. Commit — (In-order) Update registers, memory with reorder buffer result – When an instruction is at head of reorder buffer & the result is present, update register with result (or store to memory) and remove instruction from reorder buffer. Successfully completed instructions write to registers and memory (stores) here Mispredicted Branch – A mispredicted branch at the head of the reorder buffer flushes the Handling reorder buffer (cancels speculated instructions after the branch) ⇒ Instructions issue in order, execute (EX), write result (WB) out of order, but must commit in order. EECC722 - Shaaban 4th Edition: pages 106-108 (3rd Edition: pages 227-229) #10 Lec # 2 Fall 2012 9-3-2012 AdvancedAdvanced CPUCPU Architectures:Architectures: VLIW:VLIW: Intel/HPIntel/HP IA-64 ExplicitlyExplicitly ParallelParallel InstructionInstruction ComputingComputing (EPIC)(EPIC) • Strengths: – Allows for a high level of instruction parallelism (ILP). – Takes a lot of the dependency analysis out of HW and places focus on smart . • Weakness: – Limited by instruction-level parallelism (ILP) in a single thread. – Keeping Functional Units (FUs) busy (control hazards). – Static FUs Scheduling limits performance gains. – Resulting overall performance heavily depends on compiler performance. EECC722 - Shaaban #11 Lec # 2 Fall 2012 9-3-2012 SuperscalarSuperscalar ArchitectureArchitecture Limitations:Limitations: Issue Slot Waste Classification • Empty or wasted issue slots can be defined as either vertical waste or horizontal waste: – Vertical waste is introduced when the processor issues no instructions in a cycle. – Horizontal waste occurs when not all issue slots can be filled in a cycle. Why not 8-issue?

Example:

4-Issue Superscalar

Ideal IPC =4 Ideal CPI = .25

Also applies to VLIW Instructions Per Cycle = IPC = 1/CPI

Result of issue slot waste: Actual Performance << Peak Performance EECC722 - Shaaban #12 Lec # 2 Fall 2012 9-3-2012 Sources of Unused Issue Cycles in an 8-issue Superscalar Processor. (wasted) Ideal Instructions Per Cycle, IPC = 8 Here real IPC about 1.5 (CPI = 1/8) (18.75 % of ideal IPC)

Average IPC= 1.5 instructions/cycle issue rate

Real IPC << Ideal IPC

1.5 << 8

~ 81% of issue slots wasted

Processor busy represents the utilized issue slots; all others represent wasted issue slots.

61% of the wasted cycles are vertical waste, the remainder are horizontal waste.

Workload: SPEC92 suite. SMT-1 EECC722 - Shaaban Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., EECC722 - Shaaban Proceedings of the 22rd Annual International Symposium on Architecture, June 1995, pages 392-403. #13 Lec # 2 Fall 2012 9-3-2012 SuperscalarSuperscalar ArchitectureArchitecture LimitationsLimitations :: All possible causes of wasted issue slots, and latency-hiding or latency reducing techniques that can reduce the number of cycles wasted by each cause.

Main Issue: One Thread leads to limited ILP (cannot fill issue slots) Solution: Exploit Thread Level Parallelism (TLP) within a single microprocessor chip:

Simultaneous Multithreaded (SMT) Processor: Chip-Multiprocessors (CMPs): How? -The processor issues and executes instructions from - Integrate two or more complete processor cores on a number of threads creating a number of logical AND/OR the same chip (die) processors within a single physical processor - Each core runs a different thread (or program) e.g. Intel’s HyperThreading (HT), each physical - Limited ILP is still a problem in each core processor executes instructions from two threads (Solution: combine this approach with SMT) SMT-1 EECC722 - Shaaban Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., EECC722 - Shaaban Proceedings of the 22rd Annual International Symposium on , June 1995, pages 392-403. #14 Lec # 2 Fall 2012 9-3-2012 AdvancedAdvanced CPUCPU Architectures:Architectures: SingleSingle ChipChip MultiprocessorsMultiprocessors

((CMPsCMPs)) AKA Multi-Core Processors • Strengths: – Create a single processor block and duplicate. – Exploits Thread-Level Parallelism. at chip level – Takes a lot of the dependency analysis out of HW and places focus on smart compilers. • Weakness: – Performance within each processor still limited by individual thread performance (ILP). – High power requirements using current VLSI processes. • Almost entire processor cores are replicated on chip. • May run at lower clock rates to reduce heat/power consumption. e.g IBM Power 4/5, Intel Pentium D, Core Duo, Core 2 (Conroe), Core i7 AMD Athlon 64 X2, X3, X4, Dual/quad Core Opteron Sun UltraSparc T1 (Niagara) … EECC722 - Shaaban #15 Lec # 2 Fall 2012 9-3-2012 AdvancedAdvanced CPUCPU Architectures:Architectures: Single Chip Multiprocessor (CMP)

Register File i Or 4-way Control Superscalar (Two-way) Pipeline PC i Unit i i SP i Memory Hierarchy (Management) Hierarchy Memory

Register File i+1 Control Superscalar (Two-way) Pipeline Unit PC i+1 i+1 i+1 SP i+1

Register File n Control Superscalar (Two-way) Pipeline PC n Unit n n SP n

CMP with n cores EECC722 - Shaaban #16 Lec # 2 Fall 2012 9-3-2012 Current Dual-Core Chip-Multiprocessor (CMP) Architectures Single Die Two Dies – Shared Package Single Die Private Caches Private Caches Shared L2 Cache Shared System Interface Private System Interface

L2 Or

L3

On-chip crossbar/ Cores communicate using shared cache FSB (Lowest communication latency) Cores communicate using on-chip Cores communicate over external Interconnects (shared system interface) Front Side Bus (FSB) Examples: (Highest communication latency) IBM POWER4/5 Examples: Intel Pentium Core Duo (Yonah), Conroe AMD Dual Core Opteron, Example: Sun UltraSparc T1 (Niagara) Athlon 64 X2 Intel Pentium D Quad Core AMD K10 Intel Itanium2 (Montecito) (shared L3 cache) Source: Real World Technologies, EECC722 - Shaaban http://www.realworldtech.com/page.cfm?ArticleID=RWT101405234615 #17 Lec # 2 Fall 2012 9-3-2012 AdvancedAdvanced CPUCPU Architectures:Architectures: FineFine--grainedgrained oror TraditionalTraditional MultithreadedMultithreaded ProcessorsProcessors

• Multiple hardware contexts (PC, SP, and registers). • Only one context or thread issues instructions each cycle. • Performance limited by Instruction-Level Parallelism (ILP) within each individual thread: – Can reduce some of the vertical issue slot waste. – No reduction in horizontal issue slot waste. • Example Architecture: The Tera Computer System

EECC722 - Shaaban #18 Lec # 2 Fall 2012 9-3-2012 Fine-grain or Traditional Multithreaded Processors The Tera (Cray) Computer System • The Tera computer system is a multiprocessor that can accommodate up to 256 processors.

• Each Tera processor is fine-grain multithreaded: From one thread – Each processor can issue one 3-operation Long Instruction Word (LIW) every 3 ns cycle (333MHz) from among as many as 128 distinct instruction streams (hardware threads), thereby hiding up to 128 cycles (384 ns) of memory latency. – In addition, each stream can issue as many as eight memory references without waiting for earlier ones to finish, further augmenting the memory latency tolerance of the processor. – A stream implements a load/store architecture with three addressing modes and 31 general-purpose 64-bit registers. – The instructions are 64 bits wide and can contain three operations: a memory reference operation (M-unit operation or simply M-op for short), an arithmetic or logical operation (A-op), and a branch or simple arithmetic or logical operation (C-op). EECC722 - Shaaban Source: http://www.cscs.westminster.ac.uk/~seamang/PAR/tera_overview.html #19 Lec # 2 Fall 2012 9-3-2012 SMT: Simultaneous Multithreading • Multiple Hardware Contexts (or threads) running at the same time (HW context: ISA registers, PC, and SP etc.). Thread state • A single physical SMT processor core acts (and reports to the operating system) as a number of logical processors each executing a single thread • Reduces both horizontal and vertical waste by having multiple threads keeping functional units busy during every cycle. • Builds on top of current time-proven advancements in CPU design: superscalar, dynamic scheduling, hardware speculation, dynamic HW branch prediction, multiple levels of cache, hardware pre-fetching etc. • Enabling Technology: VLSI logic density in the order of hundreds of millions of transistors/Chip. – Potential performance gain is much greater than the increase in chip area and power consumption needed to support SMT. • Improved Performance/Chip Area/Watt (Computational Efficiency) vs. single-threaded superscalar cores. 2-way SMT processor 10-15% increase in area Vs. ~ 100% increase for dual-core CMP EECC722 - Shaaban #20 Lec # 2 Fall 2012 9-3-2012 SMT • With multiple threads running penalties from long-latency operations, cache misses, and branch mispredictions will be hidden: Thus SMT is an effective long latency-hiding technique – Reduction of both horizontal and vertical waste and thus improved Instructions Issued Per Cycle (IPC) rate.

• Functional units are shared among all contexts during every cycle: – More complicated register read and writeback stages. • More threads issuing to functional units results in higher resource utilization. • CPU resources may have to resized to accommodate the additional demands of the multiple threads running. – (e.g cache, TLBs, branch prediction tables, rename registers)

EECC722 - Shaaban context = hardware thread #21 Lec # 2 Fall 2012 9-3-2012 SMT: Simultaneous Multithreading

Register File i Superscalar (Two-way) Pipeline PC i i SP i Memory Hierarchy (Management) Hierarchy Memory

Register File i+1 (Chip-Wide) Unit Control Superscalar (Two-way) Pipeline PC i+1 i+1 SP i+1 n Hardware Contexts

Register File n PC n Superscalar (Two-way) Pipeline n SP n

One n-way SMT Core Modified out-of-order Superscalar Core EECC722 - Shaaban #22 Lec # 2 Fall 2012 9-3-2012 TheThe PowerPower OfOf SMTSMT 1 1 1 1 1 1 2 2 2 2 2 3 1 3 3 3 3 4 5 4 2 2 4

Time (processor cycles) 1 1 1 1 5 5 4 5 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 3 1 2 4 1 4 4 4 1 2 5

Superscalar Traditional (Fine-grain) Simultaneous Multithreaded Multithreading Rows of squares represent instruction issue slots Box with number x: instruction issued from thread x Empty box: slot is wasted EECC722 - Shaaban #23 Lec # 2 Fall 2012 9-3-2012 SMTSMT PerformancePerformance ExampleExample

Inst Code Description Functional unit A LUI R5,100 R5 = 100 Int ALU B FMUL F1,F2,F3 F1 = F2 x F3 FP ALU C ADD R4,R4,8 R4 = R4 + 8 Int ALU D MUL R3,R4,R5 R3 = R4 x R5 Int mul/div E LW R6,R4 R6 = (R4) Memory port F ADD R1,R2,R3 R1 = R2 + R3 Int ALU G NOT R7,R7 R7 = !R7 Int ALU H FADD F4,F1,F2 F4=F1 + F2 FP ALU I XOR R8,R1,R7 R8 = R1 XOR R7 Int ALU J SUBI R2,R1,4 R2 = R1 – 4 Int ALU K SW ADDR,R2 (ADDR) = R2 Memory port

• 4 integer ALUs (1 cycle latency) • 1 integer multiplier/divider (3 cycle latency) • 3 memory ports (2 cycle latency, assume cache hit) • 2 FP ALUs (5 cycle latency) • Assume all functional units are fully-pipelined

EECC722 - Shaaban #24 Lec # 2 Fall 2012 9-3-2012 SMTSMT PerformancePerformance ExampleExample

4-issue (single-threaded) (continued)(continued) 2-thread SMT Cycle Superscalar Issuing Slots SMT Issuing Slots 123412 3 4 1 LUI (A) FMUL (B) ADD (C) T1.LUI (A) T1.FMUL T1.ADD (C) T2.LUI (A) (B) 2 MUL (D) LW (E) T1.MUL (D) T1.LW (E) T2.FMUL (B) T2.ADD (C) 3 T2.MUL (D) T2.LW (E) 4 5 ADD (F) NOT (G) T1.ADD (F) T1.NOT (G) 6 FADD (H) XOR (I) SUBI (J) T1.FADD (H) T1.XOR (I) T1.SUBI (J) T2.ADD (F) 7 SW (K) T1.SW (K) T2.NOT (G) T2.FADD (H) 8 T2.XOR (I) T2.SUBI (J) 9 T2.SW (K)

• 2 additional cycles for SMT to complete program 2 i.e 2nd thread • Throughput: – Superscalar: 11 inst/7 cycles = 1.57 IPC – SMT: 22 inst/9 cycles = 2.44 IPC – SMT is 2.44/1.57 = 1.55 times faster than superscalar for this example Ideal = 2 EECC722 - Shaaban #25 Lec # 2 Fall 2012 9-3-2012 ModificationsModifications toto SuperscalarSuperscalar CPUsCPUs toto SupportSupport SMTSMT

Necessary Modifications: i.e thread state • Multiple program counters (PCs), ISA registers and some mechanism by which one fetch unit selects one each cycle (thread instruction fetch/issue policy). • A separate return stack for each thread for predicting subroutine return destinations. • Per-thread instruction issue/retirement, instruction queue flush, and trap mechanisms. • A thread id with each branch target buffer entry to avoid predicting phantom branches.

Modifications to Improve SMT performance: Resize some hardware resources for performance • A larger rename register file, to support logical registers for all threads plus additional registers for . (may require additional pipeline stages). • A higher available main memory fetch bandwidth may be required. • Larger data TLB with more entries to compensate for increased virtual to physical address translations. • Improved cache to offset the cache performance degradation due to cache sharing among the threads and the resulting reduced locality. – e.g Private per-thread vs. shared L1 cache. SMT-2 EECC722 - Shaaban Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. #26 Lec # 2 Fall 2012 9-3-2012 SMTSMT ImplementationsImplementations • Intel’s implementation of Hyper-Threading (HT) Technology (2-thread SMT): – Originally implemented in its NetBurst microarchitecture (P4 processor family). – Current Hyper-Threading implementation: Intel’s Nehalem (Core i7 – introduced 4th quarter 2008): 2, 4 or 8 cores per chip each 2-thread SMT (4-16 threads per chip).

• IBM POWER 5/6: Dual cores each 2-thread SMT. • The Alpha EV8 (4-thread SMT) originally scheduled for production in 2001. • A number of special-purpose processors targeted towards (NP) applications. • Sun UltraSparc T1 (Niagara): Eight processor cores each executing from 4 hardware threads (32 threads total). – Actually, not SMT but fine-grain multithreaded (each core issues one instruction from one thread per cycle).

• Current technology has the potential for 4-8 simultaneous threads per core (based on and design complexity).

EECC722 - Shaaban #27 Lec # 2 Fall 2012 9-3-2012 A Base SMT Hardware Architecture.

In-Order Front End Out-of-order Core Fetch/Issue Modified Superscalar Speculative Tomasulo SMT-2 Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. EECC722 - Shaaban #28 Lec # 2 Fall 2012 9-3-2012 ExampleExample SMTSMT Vs.Vs. SuperscalarSuperscalar PipelinePipeline

Based on the

Two extra pipeline stages added for reg. Read/write to account for the size increase of the register file • The pipeline of (a) a conventional superscalar processor and (b) that pipeline modified for an SMT processor, along with some implications of those pipelines.

Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, SMT-2 Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. EECC722 - Shaaban #29 Lec # 2 Fall 2012 9-3-2012 IntelIntel HyperHyper--ThreadedThreaded (2(2--wayway SMT)SMT) P4P4 ProcessorProcessor PipelinePipeline

SMT-8 Source: Intel Technology Journal , Volume 6, Number 1, February 2002. EECC722 - Shaaban #30 Lec # 2 Fall 2012 9-3-2012 IntelIntel P4P4 OutOut--ofof--orderorder ExecutionExecution EngineEngine DetailedDetailed PipelinePipeline

Hyper-Threaded (2-way SMT) SMT-8 Source: Intel Technology Journal , Volume 6, Number 1, February 2002. EECC722 - Shaaban #31 Lec # 2 Fall 2012 9-3-2012 SMTSMT PerformancePerformance ComparisonComparison • Instruction throughput (IPC) from simulations by Eggers et al. at The University of Washington, using both multiprogramming and parallel workloads: 8-issue Multiprogramming workload 8-threads Superscalar Traditional SMT Threads Multithreading 1 2.7 2.6 3.1 2 - 3.3 3.5 IPC 4 - 3.6 5.7 8 - 2.8 6.2

Parallel Workload i.e Fine-grained Superscalar MP2 MP4 Traditional SMT Threads (MP = Chip-multiprocessor) Multithreading IPC 1 3.3 2.4 1.5 3.3 3.3 2 - 4.3 2.6 4.1 4.7 4 - - 4.2 4.2 5.6 8 - - - 3.5 6.1

Multiprogramming workload = multiple single threaded programs (multi-tasking) Parallel Workload = Single multi-threaded program EECC722 - Shaaban #32 Lec # 2 Fall 2012 9-3-2012 Possible Machine Models for an 8-way Multithreaded Processor • The following machine models for a multithreaded CPU that can issue 8 instruction per cycle differ in how threads use issue slots and functional units: • Fine-Grain Multithreading: – Only one thread issues instructions each cycle, but it can use the entire issue width of the processor. This hides all sources of vertical waste, but does not hide horizontal waste. • SM:Full Simultaneous Issue: i.e SM: Eight Issue Most Complex – This is a completely flexible simultaneous multithreaded superscalar: all eight threads compete for each of the 8 issue slots each cycle. This is the least realistic model in terms of hardware complexity, but provides insight into the potential for simultaneous multithreading. The following models each represent restrictions to this scheme that decrease hardware complexity. • SM:Single Issue,SM:Dual Issue, and SM:Four Issue: – These three models limit the number of instructions each thread can issue, or have active in the scheduling window, each cycle. – For example, in a SM:Dual Issue processor, each thread can issue a maximum of 2 instructions per cycle; therefore, a minimum of 4 threads would be required to fill the 8 issue slots in one cycle. • SM:Limited Connection. – Each hardware context is directly connected to exactly one of each type of functional unit. – For example, if the hardware supports eight threads and there are four integer units, each integer unit could receive instructions from exactly two threads. – The partitioning of functional units among threads is thus less dynamic than in the other models, but each functional unit is still shared (the critical factor in achieving high utilization). SMT-1 i.e. Partition functional units among threads EECC722 - Shaaban Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., EECC722 - Shaaban Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403. #33 Lec # 2 Fall 2012 9-3-2012 Comparison of Multithreaded CPU Models Complexity

A comparison of key hardware complexity features of the various models (H=high complexity). The comparison takes into account: – the number of ports needed for each register file, – the dependence checking for a single thread to issue multiple instructions, – the amount of forwarding logic, – and the difficulty of scheduling issued instructions onto functional units.

SMT-1 EECC722 - Shaaban Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., EECC722 - Shaaban Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403. #34 Lec # 2 Fall 2012 9-3-2012 Simultaneous Vs. Fine-Grain Multithreading Performance

4.8?

IPC 3.1

6.4 IPC

Workload: SPEC92

Instruction throughput as a function of the number of threads. (a)-(c) show the throughput by thread priority for particular models, and (d) shows the total throughput for all threads for each of the six machine models. The lowest segment of each bar is the contribution of the highest priority thread to the total throughput. SMT-1 EECC722 - Shaaban Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., EECC722 - Shaaban Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403. #35 Lec # 2 Fall 2012 9-3-2012 Simultaneous Multithreading (SM) Vs. Single-Chip (MP) IPC

• Results for the multiprocessor MP vs. simultaneous multithreading SM comparisons.The multiprocessor always has one functional unit of each type per processor. In most cases the SM processor has the same total number of each FU type as the MP. SMT-1 EECC722 - Shaaban Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., EECC722 - Shaaban Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403. #36 Lec # 2 Fall 2012 9-3-2012 ImpactImpact ofof LevelLevel 11 CacheCache SharingSharing onon SMTSMT PerformancePerformance • Results for the simulated cache configurations, shown relative to the throughput (instructions per cycle) of the 64s.64p 64K instruction cache shared 64K data cache private Instruction Data • The caches are specified as: (8K per thread) Notation: [total I cache size in KB][private or shared].[D cache size][private or shared] For instance, 64p.64s has eight private 8 KB I caches and a shared 64 KB data

Best overall performance of configurations considered achieved by 64s.64s (64K instruction cache shared 64K data cache shared)

SMT-1 EECC722 - Shaaban Source: Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean Tullsen et al., EECC722 - Shaaban Proceedings of the 22rd Annual International Symposium on Computer Architecture, June 1995, pages 392-403. #37 Lec # 2 Fall 2012 9-3-2012 The Impact of Increased Multithreading on Some Low Level Metrics for Base SMT Architecture

IPC Renaming Registers

So? More threads supported may lead to more demand on hardware resources (e.g here D and I miss rated increased substantially, and thus need to be resized) SMT-2 EECC722 - Shaaban Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. #38 Lec # 2 Fall 2012 9-3-2012 PossiblePossible SMTSMT ThreadThread InstructionInstruction FetchFetch SchedulingScheduling PoliciesPolicies • Round Robin: – Instruction from Thread 1, then Thread 2, then Thread 3, etc. (eg RR 1.8 : each cycle one thread fetches up to eight instructions RR 2.4 each cycle two threads fetch up to four instructions each) • BR-Count: – Give highest priority to those threads that are least likely to be on a wrong path by by counting branch instructions that are in the decode stage, the rename stage, and the instruction queues, favoring those with the fewest unresolved branches. • MISS-Count: – Give priority to those threads that have the fewest outstanding Data cache misses. • ICount: – Highest priority assigned to thread with the lowest number of instructions in static portion of pipeline (decode, rename, and the instruction queues). • IQPOSN: Instruction Queue (IQ) Position – Give lowest priority to those threads with instructions closest to the head of either the integer or floating point instruction queues (the oldest instruction is at the head of the queue).

SMT-2 EECC722 - Shaaban Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. #39 Lec # 2 Fall 2012 9-3-2012 Instruction Throughput For Round Robin Instruction Fetch Scheduling

IPC = 4.2 RR.2.8

RR with best performance

Best overall instruction throughput achieved using round robin RR.2.8 (in each cycle two threads each fetch a block of 8 instructions)

Workload: SPEC92 SMT-2 EECC722 - Shaaban Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. #40 Lec # 2 Fall 2012 9-3-2012 Instruction throughput & Thread Fetch Policy ICOUNT.2.8

All other fetch heuristics provide speedup over round robin Instruction Count ICOUNT.2.8 provides most improvement 5.3 instructions/cycle vs 2.5 for unmodified superscalar. Workload: SPEC92 Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,

Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. SMT-2 ICOUNT: Highest priority assigned to thread with the lowest number of instructions in static portion of pipeline (decode, rename, and the instruction EECC722 - Shaaban queues). #41 Lec # 2 Fall 2012 9-3-2012 Low-Level Metrics For Round Robin 2.8, Icount 2.8

Renaming Registers

ICOUNT improves on the performance of Round Robin by 23% by reducing Instruction Queue (IQ) clog by selecting a better mix of instructions to queue SMT-2 EECC722 - Shaaban #42 Lec # 2 Fall 2012 9-3-2012 PossiblePossible SMTSMT InstructionInstruction IssueIssue PoliciesPolicies • OLDEST FIRST: Issue the oldest instructions (those deepest into the instruction queue, the default). • OPT LAST and SPEC LAST: Issue optimistic and speculative instructions after all others have been issued. • BRANCH FIRST: Issue branches as early as possible in order to identify mispredicted branches quickly.

Instruction issue bandwidth is not a bottleneck in SMT as shown above

Source: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, Dean Tullsen et al. Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996, pages 191-202. SMT-2 ICOUNT.2.8 Fetch policy used for all issue policies above EECC722 - Shaaban #43 Lec # 2 Fall 2012 9-3-2012 SMT: Simultaneous Multithreading • Strengths: – Overcomes the limitations imposed by low single thread instruction-level parallelism. – Resource-efficient support of chip-level TLP. – Multiple threads running will hide individual control hazards ( i.e branch mispredictions) and other long latencies (i.e main memory access latency on a cache miss). • Weaknesses: – Additional stress placed on memory hierarchy. – complexity. – Sizing of resources (cache, branch prediction, TLBs etc.) – Accessing registers (32 integer + 32 FP for each HW context): • Some designs devote two clock cycles for both register reads and register writes. Deeper pipeline

EECC722 - Shaaban #44 Lec # 2 Fall 2012 9-3-2012