Level Parallelism (TLP), Data-Level Parallelism (DLP) and Single- Instruction-Multiple-Data (SIMD) Computing

Computer Architecture ESE 545 Computer Architecture Limits to ILP, Thread- level Parallelism (TLP), Data-level Parallelism (DLP) and Single- Instruction-Multiple-Data (SIMD) Computing ILP: Limits to ILP, TLP, and DLP 1 Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor performance curve? Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock Motorola AltaVec: 128 bit ints and FPs Supersparc Multimedia ops, etc. ILP: Limits to ILP, TLP, and DLP 2 Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Jump prediction – all jumps perfectly predicted (returns, case statements) 2 & 3 no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle; ILP: Limits to ILP, TLP, and DLP 3 Limits to ILP HW Model Comparison Model Power 5 Instructions Issued Infinite 4 per clock Instruction Window Infinite 200 Size Renaming Infinite 48 integer + Registers 40 Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect ?? Analysis ILP: Limits to ILP, TLP, and DLP 4 Upper Limit to ILP: Ideal Machine 160 150.1 140 FP: 75 - 150 118.7 120 Integer: 18 - 60 100 75.2 80 62.6 54.8 60 40 17.9 20 0 gcc espresso li fpppp doducd tomcatv Instructions Clock Per Programs ILP: Limits to ILP, TLP, and DLP 5 Limits to ILP HW Model Comparison New Model Model Power 5 Instructions Infinite Infinite 4 Issued per clock Instruction Infinite, 2K, 512, Infinite 200 Window Size 128, 32 Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect Perfect 2% to 6% Prediction misprediction (Tournament Branch Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect ?? Alias ILP: Limits to ILP, TLP, and DLP 6 More Realistic HW: Window Impact Change from Infinite window 2048, 512, 128, 32 FP: 9 - 150 160 150 140 119 k 120 100 Integer: 8 - 63 80 75 63 61 59 60 60 55 IPC 49 41 45 36 40 35 34 Instructions Per Cloc 18 1513 15 14 1615 14 20 10108812119 9 0 gcc espresso li fpppp doduc tomcatv Inf inite 2048 512 128 32 ILP: Limits to ILP, TLP, and DLP 7 Limits to ILP HW Model Comparison New Model Model Power 5 Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite Infinite 48 integer + Registers 40 Fl. Pt. Branch Perfect vs. 8K Perfect 2% to 6% Prediction Tournament vs. misprediction 512 2-bit vs. (Tournament Branch profile vs. none Predictor) Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect ?? Alias ILP: Limits to ILP, TLP, and DLP 8 More Realistic HW: Branch Impact FP: 15 - 45 Change from Infinite window 61 60 60 58 to examine to 2048 and maximum issue of 64 50 48 46 instructions per clock cycle 45 46 45 45 41 40 Integer: 6 - 12 35 30 29 20 19 IPC 16 15 13 14 12 10 10 9 7 7 6 666 4 2 2 2 0 gcc espresso li fpppp doducd tomcatv Program Perfect Selective predictor Standard 2-bit Static None Perfect Tournament BHT (512) Profile No prediction ILP: Limits to ILP, TLP, and DLP 9 Misprediction Rates 35% 30% 30% 25% 23% 20% 18% 18% 16% 14% 14% 15% 12% 12% 10% 6% Misprediction Rate Misprediction 5% 4% 5% 3% 1%1% 2% 2% 0% 0% tomcatv doduc fpppp li espresso gcc Profile-based 2-bit counter Tournament ILP: Limits to ILP, TLP, and DLP 10 Limits to ILP HW Model Comparison New Model Model Power 5 Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming Infinite v. 256, Infinite 48 integer + Registers 128, 64, 32, none 40 Fl. Pt. Branch 8K 2-bit Perfect Tournament Branch Prediction Predictor Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect Perfect Perfect Alias ILP: Limits to ILP, TLP, and DLP 11 More Realistic HW: Renaming Register Impact (N int + N fp) 70 FP: 11 - 45 59 60 Change 2048 instr 54 49 50 window, 64 instr issue, 45 8K 2 level Prediction 44 40 35 30 29 28 Integer: 5 - 15 20 20 16 IPC 15 15 15 13 12 12 12 11 11 11 10 10 10 10 9 6 7 5 5 5 555 5 4 4 4 0 gcc espresso li fpppp doducd tomcatv Program Infinite 256 128 64 32 None Infinite 256 128 64 32 None ILP: Limits to ILP, TLP, and DLP 12 Limits to ILP HW Model Comparison New Model Model Power 5 Instructions 64 Infinite 4 Issued per clock Instruction 2048 Infinite 200 Window Size Renaming 256 Int + 256 FP Infinite 48 integer + Registers 40 Fl. Pt. Branch 8K 2-bit Perfect Tournament Prediction Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Perfect v. Stack Perfect Perfect Alias v. Inspect v. none ILP: Limits to ILP, TLP, and DLP 13 More Realistic HW: Memory Address Alias Impact 49 49 50 45 45 45 Change 2048 instr 40 window, 64 instr issue, FP: 4 - 45 35 8K 2 level Prediction, (Fortran, 30 256 renaming registers no heap) 25 20 16 16 Integer:15 4 - 9 15 12 10 10 9 IPC 7 7 5 5 6 4 4 4 5 5 3 3 3 4 4 0 gcc espresso li fpppp doducd tomcatv Program Perfect Global/stack Perfect Inspection None Perfect Global/Stack perf; Inspec. None heap conflicts Assem. ILP: Limits to ILP, TLP, and DLP 14 Limits to ILP HW Model Comparison New Model Model Power 5 Instructions 64 (no Infinite 4 Issued per restrictions) clock Instruction Infinite vs. 256, Infinite 200 Window Size 128, 64, 32 Renaming 64 Int + 64 FP Infinite 48 integer + Registers 40 Fl. Pt. Branch 1K 2-bit Perfect Tournament Prediction Cache Perfect Perfect 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory HW Perfect Perfect Alias disambiguation ILP: Limits to ILP, TLP, and DLP 15 Realistic HW: Window Impact 60 56 Perfect disambiguation 52 50 (HW), 1K Selective 47 FP: 8 - 45 45 Prediction, 16 entry 40 35 return, 64 registers, issue 34 30 as many as window 22 22 20 Integer: 6 - 12 17 16 15 15 15 14 IPC 13 14 12 12 11 12 10 10 10 10 11 9 9 9 9 10 8 8 8 6 7 66 5 6 4 4 4 4 3 2 3 3 3 3 0 gcc expresso li fpppp doducd tomcatv Program Infinite 256 128 64 32 16 8 4 Infinite 256 128 64 32 16 8 4 ILP: Limits to ILP, TLP, and DLP 16 HW vs. SW to Increase ILP Memory disambiguation: HW best WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage Can get conflicts via allocation of stack frames as a called procedure reuses the memory addresses of a previous frame on the stack Speculation: HW best when dynamic branch prediction better than compile time prediction Exceptions easier for HW HW doesn’t need bookkeeping code or compensation code Very complicated to get right Scheduling: SW can look ahead to schedule better Compiler independence: does not require new compiler, recompilation to run well ILP: Limits to ILP, TLP, and DLP 17 In Conclusion Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options These are not laws of physics; just practical limits for today, and perhaps overcome via research Compiler and ISA advances could change results ILP: Limits to ILP, TLP, and DLP 18 Performance Beyond Single Thread ILP There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) Explicit Thread Level Parallelism or Data Level Parallelism (next lectures) Thread: process with own instructions and data thread may be a process part of a parallel program of multiple processes, or it may be an independent program Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute Data Level Parallelism: Perform identical operations on data, and lots of data Multithreaded & vector processors 19 Multithreading Difficult to continue to extract instruction- level parallelism (ILP) from a single sequential thread of control Many workloads can make use of thread- level parallelism (TLP) TLP from multiprogramming (run independent sequential jobs) TLP from multithreaded applications (run one job faster using parallel threads) Multithreading uses TLP to improve utilization of a single processor Multithreaded & vector processors 20 Pipeline Hazards t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10t11t12t13t14 LW r1, 0(r2) F D X M W LW r5, 12(r1) F D D D D X M W ADDI r5, r5, #12 F F F F D D D D X M W SW 12(r1), r5 F F F F D D D D Each instruction may depend on the next What is usually done to cope with this? – interlocks (slow) – or bypassing (needs hardware, doesn’t help all hazards) Multithreaded & vector processors 21 Multithreading How can we guarantee no dependencies between instructions in a pipeline? -- One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 T1: LW r1, 0(r2) F D X M W Prior instruction in T2: ADD r7, r1, r4 F D X M W a thread always completes write- T3: XORI r5, r4, #12 F D X M W back before next T4: SW 0(r7), r5 F D X M W instruction in same thread reads T1: LW r5, 12(r1) F D X M W register file Multithreaded & vector

Level Parallelism (TLP), Data-Level Parallelism (DLP) and Single- Instruction-Multiple-Data (SIMD) Computing

Data-Flow Prescheduling for Large Instruction Windows in Out-Of-Order Processors

Jetson TX2 • NVIDIA Jetson Xavier • GPU Programming • Algorithm Mapping: • Convolutions Parallel Algorithm Execution

Computer Architecture Out-Of-Order Execution

CSC.T433 Advanced Computer Architecture, Department of Computer Science, TOKYO TECH 1 Datapath of Ooo Execution Processor

STRAIGHT: Realizing a Lightweight Large Instruction Window by Using Eventually Consistent Distributed Registers

Optimizing SIMD Execution in HW/SW Co-Designed Processors

Multithreading

Transforming TLP Into DLP with the Dynamic Inter-Thread Vectorization Architecture Sajith Kalathingal

Advanced Computer Architecture

Dynamic Vectorization in the E2 Dynamic Multicore Architecture to Appear in the Proceedings of HEART 2010

Instruction Fetch and Issue on an Implementable Simultaneous

Chapter 16 - Instruction-Level Parallelism and Superscalar Processors