ECE5917 Soc Architecture: MP Soc

ECE5917 SoC Architecture: MP SoC Tae Hee Han: [email protected] Semiconductor Systems Engineering Sungkyunkwan University Outline n Overview n Parallelism n Data-Level Parallelism n Instruction-Level Parallelism n Thread-Level Parallelism n Processor-Level Parallelism n Multi-core 2 Parallelism - Thread Level Parallelism 3 Superscalar (In)Efficiency Issue width Instruction issue Completely idle cycle (vertical waste) •Introduced when the processor issues no instructions in a cycle Time Partially filled cycle, i.e., IPC < 4 (horizontal waste) •Occurs when not all issue slots can be filled in a cycle 4 Thread n Definition n A discrete sequence of related instructions n Executed independently of other such sequences n Every program has at least one thread n Initializes n Executes instructions n May create other threads n Each thread maintains its current state n OS maps a thread to hardware resources 5 Multithreading n On a single processor, multithreading generally occurs by time- division multiplexing (as in multitasking) – context switching n On a multiprocessor or multi-core systems, threads can be truly concurrent, with every processor or core executing a separate thread simultaneously n Many modern OS directly support both time-sliced and multiprocessor threading with a process scheduler n The kernel of an OS allows programmer to manipulate threads via the system call interface 6 Thread Level Parallelism (TLP) n Interaction with OS Physical Memory n OS perceives each core as a separate processor n OS scheduler maps threads/processes to different logical (or virtual) cores Virtual Memory Virtual Memory (ASID 1) (ASID 2) n Most major OS support multithreading today Process 1 Process 2 n TLP explicitly represented by the use of multiple Thread 1 Thread 2 Thread 1 Thread 2 Thread 3 threads of execution that are inherently parallel • Stack • Stack • Stack • Stack • Stack • Register • Register • Register • Register • Register • PC • PC • PC • PC • PC n Goal: Use multiple instruction streams to improve Thread Scheduler (OS) n Throughput of computers that run many programs n Execution time of multi-threaded programs Processor Processor Core 1 Core 2 (e.g 2-way SMT) (e.g 2-way SMT) n TLP could be more cost-effective than ILP 7 Multithreaded Execution n Multithreading: multiple threads share functional units of 1 processor via overlapping n Processor must duplicate independent state of each thread n Separate copy of register file, PC n Separate page table if different process n Memory sharing via virtual memory mechanisms n Already supports multiple processes n HW for fast thread switch n Must be much faster than full process switch (which is 100s to 1000s of clocks) n When to switch? n Alternate instruction per thread (fine grain)—round robin n When thread is stalled (coarse grain) n e.g., cache miss 8 Conceptual Illustration of Multithreaded Architecture Program running 1 2 3 4 in parallel i = n Sub-problem j = m Serial Code A i = 3 j = 2 Concurrent i = 2 Sub-problem j = 1 threads of i = 1 B Sub-problem Computation C Hardware Streams Unused Streams Instruction Ready Pool Pipeline of executing Instructions 9 Sources of Wasted Issue Slots Source Possible latency-hiding or latency-reducing technique Increase TLB sizes, HW instruction prefetching, HW or SW data prefetching, TLB miss faster servicing of TLB misses I-cache miss Increase cache size, more associativity, HW instruction prefetching Increase cache size, more associativity, HW or SW data prefetching, improved D-cache miss instruction scheduling, more sophisticated dynamic execution Branch misprediction Improved branch prediction scheme, lower branch misprediction penalty Control hazard Speculative execution, more aggressive if-conversion Load delays (L1 cache hits) Shorter load latency, improved instruction scheduling, dynamic scheduling Short integer delay Improved instruction scheduling Long integer, short FP, long Shorter latencies, improved instruction scheduling FP delays Memory conflict Improved instruction scheduling 10 Fine-Grained Multithreading n Switches between threads on each instruction, interleaving execution of multiple threads n Usually done round-robin, skipping stalled threads n CPU must be able to switch threads every clock n Advantage: can hide both short and long stalls n Instructions from other threads always available to execute n Easy to insert on short stalls n Disadvantage: slows individual threads n Thread ready to execute without stalls will be delayed by instructions from other threads n Used on Sun (Now Oracle) Niagara (UltraSPARC T1) – Nov. 2005 11 Coarse-Grained Multithreading n Switches threads only on costly stalls: e.g., L2 cache misses n Advantages n Relieves need to have very fast thread switching n Doesn’t slow thread n Other threads only issue instructions when main one would stall (for long time) anyway n Disadvantage: pipeline startup costs make it hard to hide throughput losses from shorter stalls n Pipeline must be emptied or frozen on stall, since CPU issues instructions from only one thread n New thread must fill pipe before instructions can complete n Thus, better for reducing penalty of high-cost stalls where pipeline refill << stall time n Used in IBM AS/400 12 Simple Multithreaded Pipeline PC X PC 1 GPR1 PC 1 I$ IR GPR1GPR1 PC 1 GPR1 1 Y D$ +1 2 2 Thread select n Additional state: One copy of architected state per thread (e.g., PC, GPR) n Thread select: Round-robin logic; Propagate Thread-ID down pipeline to access correct state (e.g., GPR1 versus GPR2) n OS perceives multiple logical CPUs 13 Cycle Interleaved MT (Fine-Grain MT) Issue width Instruction issue Second thread interleaved cycle-by-cycle Time Partially filled cycle, i.e., IPC < 4 (horizontal waste) Cycle interleaved multithreading reduces vertical waste with cycle-by-cycle interleaving. However, horizontal waste remains. 14 Chip Multiprocessing (CMP) Issue width Second thread interleaved cycle-by-cycle Time Partially filled cycle, i.e., IPC < 4 (horizontal waste) Chip multiprocessing reduces horizontal waste with simple (narrower) cores. However, (1) vertical waste remains and (2) ILP is bounded. 15 Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995] n Interleave multiple threads to multiple issue slots with no restrictions Issue width Time 16 Simultaneous Multithreading (SMT) Motivation n Fine-grain Multithreading n HEP, Tera, MASA, MIT Alewife n Fast context switching among multiple independent threads n Switch threads on cache miss stalls – Alewife n Switch threads on every cycle – Tera, HEP n Target vertical wastes only n At any cycle, issue instructions from only a single thread n Single-chip MP n Coarse-grain parallelism among independent threads in a different processor n Also exhibit both vertical and horizontal wastes in each individual processor pipeline 17 Simultaneous Multithreading (SMT) n An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors (superscalars) n SMT has the potential of greatly enhancing superscalar processor computational capabilities by n Exploiting thread-level parallelism in a single processor core, simultaneously issuing, executing and retiring instructions from different threads during the same cycle n A single physical SMT processor core acts as a number of logical processors each executing a single thread n Providing multiple hardware contexts, hardware thread scheduling and context switching capability n Providing effective long latency hiding n e.g.) FP, branch misprediction, memory latency 18 Simultaneous Multithreading (SMT) n Intel’s HyperThreading (2-way SMT) n IBM Power7 (4/6/8 cores, 4-way SMT); IBM Power5/6 (2 cores - each 2-way SMT, 4 chips per package): Power5 has OoO cores, Power6 In-order cores; n Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources Fdiv, unpipe (16 cycles) Fetch Decode RS & ROB Fmult Unit + (4 cycles) Physical RegReg RegisterRegister Register Fadd RegReg RegisterRegister (2 cyc) FileFileRegReg PCPC RRenameenameRegisterRegisterr r File FileFileFileRegReg PCPC RRenameRenameenameRegisterRegisterrrr FileFile PCPCPC RRenameenamerr File PC Renamer 1 ALU 2 ALU I-CACHE Load/Store D-CACHE (variable) 19 Overview of SMT Hardware Changes n For an N-way (N threads) SMT, we need: n Ability to fetch from N threads n N sets of registers (including PCs) n N rename tables (RATs) n N virtual memory spaces n But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.) 20 Multithreading: Classification FU1 FU2 FU3 FU4 Unused Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 ExecutionTime Conventional Fine-grained Coarse-grained Chip Simultaneous Superscalar Multithreading Multithreading Multiprocessor Multithreading Single (cycle-by-cycle (Block Interleaving) (CMP or (SMT) Threaded Interleaving) Multi-Core) 21 SMT Performance n When it works, it fills idle “issue slots” with work from other threads; throughput improves n But sometimes it can cause performance degradation! Time( ) < Time( ) Finish one task, Do both at same then do the other time using SMT 22 How? n Cache thrashing L2 I$ D$ Executes reasonably quickly due Thread just fits in 0 to high cache I$ D$ the Level-1 Caches hit rates Context switch to Thread1 Caches were just big enough to hold one thread’s data, but not two thread’s worth I$ D$ Now both threads have Thread1 also fits nicely in the caches significantly higher cache miss rates à Intel Smart Cache! 23 Multithreading: How Many Threads? n With more HW threads: n Larger/multiple register files n Replicated & partitioned resources à Lower utilization, lower single-thread performance n Shared resources à Utilization vs. interference and thrashing n Impact of MT/MC on memory hierarchy? Source: Guz et al. "Many-Core vs. Many-Thread Machines: Stay Away From the Valley," IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 8, NO. 1, 2009 24 SMT: Intel vs. ARM n In 2010, ARM said it might include SMT in its chips in the future; however this was rejected for their 2012 64-bit design Noel Hurley (VP of marketing and strategy in ARM’s processor division) said ARM rejected SMT as an option.

Load more