Overview of Multiprocessors

Review of Concept of ILP Processors • Interest in multiple-issue because wanted to improve performance without affecting uniprocessor programming model CS 211 Computer Architecture • Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice ILP to Multiprocessing • Processors of last 5 years (Pentium 4, IBM Power 5, AMD Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple- issue processors announced in 1995 – Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X as many renaming registers, and 2X as many load-store units ⇒ performance 8 to 16X • Peak v. delivered performance gap increasing 2 Next… Limits to ILP • Conflicting studies of ILP amount • Quick review of Limits to ILP: Chapter 3 – Benchmarks • Thread Level Parallelism: Chapter 3 – Hardware sophistication – Multithreading – Compiler sophistication – Simultaneous Multithreading • How much ILP is available using existing • Multiprocessing: Chapter 4 mechanisms with increasing HW budgets? – Fundamentals of multiprocessors » Synchronization, memory, … • Do we need to invent new HW/SW – Chip level multiprocessing: Multi-Core mechanisms to keep on processor performance curve? – Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints – Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock – Motorola AltaVec: 128 bit ints and FPs – Supersparc Multimedia ops, etc. 3 4 NOW Handout Page 1 1 Overcoming Limits Limits to ILP • Advances in compiler technology + significantly new and different hardware Assumptions for ideal/perfect machine to start: techniques may be able to overcome 1. Register renaming – infinite virtual registers limitations assumed in studies => all register WAW & WAR hazards are avoided • However, unlikely such advances when 2. Branch prediction – perfect; no mispredictions coupled with realistic hardware will 3. Jump prediction – all jumps perfectly predicted overcome these limits in near future (returns, case statements) 2 & 3 ⇒ no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW 5. Window size is infinite 5 6 Some numbers… Upper Limit to ILP: Ideal Machine (Figure 3.1) • Initial MIPS HW Model here; MIPS compilers. 160 150.1 FP: 75 - 150 • perfect caches; 1 cycle latency for all 140 instructions (FP *,/); unlimited instructions Integer: 18 - 60 118.7 issued/clock cycle 120 100 75.2 80 62.6 60 54.8 40 17.9 20 0 Instructions Per Clock Instructions gcc espresso li fpppp doducd tomcatv Programs 7 8 NOW Handout Page 2 2 ILP Limitations: In Reality – ILP Limitations: Architecture “Parameters” Architecture“Parameters” • Window size • Window size – Large window size could lead to more ILP, but more time to examine – Large window size could lead to more ILP, but more time to instruction packet to determine parallelism examine instruction packet to determine parallelism • Register file size – Finite size of register file (virtual and physical) introduces more name dependencies • Branch prediction – Effects of realistic branch predictor • Memory aliasing – Idealized model assumes we can analyze all memory dependencies: but compile time analysis cannot be perfect – Realistic: » Global/stack: assume perfect predictions for global and stack data but all heap references will conflict » Inspection: deterministic compile time references • R1(10) and R1(100) cannot conflict if R1 has not changed between the two » None: all memory accesses are assumed to conflict 9 10 More Realistic HW: Window Impact ILP Limitations: Figure 3.2 Architecture“Parameters” Change from Infinite window 2048, 512, 128, 32 FP: 9 - 150 160 150 • Register file size – Finite size of register file (virtual and physical) introduces 140 more name dependencies 119 k 120 Integer: 8 - 63 100 75 80 63 61 59 60 55 IPC 60 49 45 41 36 35 40 34 Instructions Per Cloc 18 16 1513 1512 14 15 14 20 101088119 9 0 gcc espresso li fpppp doduc tomcatv Inf inite 2048 512 128 32 11 12 NOW Handout Page 3 3 More Realistic HW: ILP Limitations: Renaming Register Impact (N int + N fp +64) Architecture“Parameters” Figure 3.5 70 FP: 11 - 45 Change 2048 instr 59 60 window, 64 instr 54 49 50 issue, 8K 2 level 45 Prediction 44 • Branch prediction – Effects of realistic branch predictor 40 35 30 Integer: 5 - 15 29 28 IPC 20 20 16 15 15 15 13 12 12 12 11 11 11 10 10 10 10 9 6 7 5 5 5 555 5 4 4 4 0 gcc espresso li fpppp doducd tomcatv Program Infinite 256 128 64 32 None Infinite 256 128 64 32 None 13 14 More Realistic HW: Branch Impact Figure 3.3 Misprediction Rates Change from Infinite 61 FP: 15 -60 45 60 window to examine to 58 35% 2048 and maximum 30% 30% 50 issue of 64 instructions 48 46 46 per clock cycle 45 45 45 25% 23% 41 40 20% 18% 18% 35 16% 14% 14% Integer: 6 - 12 15% 30 29 12% 12% 10% 6% Misprediction Rate Misprediction IPC 20 19 5% 4% 16 3% 15 5% 2% 2% 13 14 1%1% 12 0% 10 10 9 0% 7 7 6 666 4 tomcatv doduc fpppp li espresso gcc 2 2 2 0 gcc espresso li fpppp doducd tomcatv Profile-based 2-bit counter Tournament Program Perfect Selective predictor Standard 2-bit Static None 15 16 Perfect Tournament BHT (512) Profile No prediction NOW Handout Page 4 4 ILP Limitations: More Realistic HW: Architecture“Parameters” Memory Address Alias Impact Figure 3.6 49 49 • Memory aliasing 50 45 45 – Idealized model assumes we can analyze all memory 45 Change 2048 instr dependencies: compile time analysis cannot be perfect 40 window, 64 instr FP: 4 - 45 – Realistic: issue, 8K 2 level 35 (Fortran, » Global/stack: assume perfect predictions for global and Prediction, 256 30 no heap) stack data but all heap references will conflict renaming registers » Inspection: deterministic compile time references 25 • R1(10) and R1(100) cannot conflict if R1 has not changed between the 20 Integer: 4 - 9 16 16 two 15 15 12 » None: all memory accesses are assumed to conflict 10 9 IPC 10 7 7 5 5 6 4 4 4 5 5 3 3 3 4 4 0 gcc espresso li fpppp doducd tomcatv Program Perfect Global/stack Perfect Inspection None Perfect Global/Stack perf; Inspec. None 17 heap conflicts Assem. 18 How to Exceed ILP Limits of this study? HW v. SW to increase ILP • These are not laws of physics; just practical limits • Memory disambiguation: HW best for today, and perhaps overcome via research • Speculation: • Compiler and ISA advances could change results – HW best when dynamic branch prediction • WAR and WAW hazards through memory: better than compile time prediction eliminated WAW and WAR hazards through – Exceptions easier for HW register renaming, but not in memory usage – Can get conflicts via allocation of stack frames as a called – HW doesn’t need bookkeeping code or procedure reuses the memory addresses of a previous frame compensation code on the stack – Very complicated to get right • Scheduling: SW can look ahead to schedule better • Compiler independence: does not require new compiler, recompilation to run well 19 20 NOW Handout Page 5 5 Next….Thread Level Parallelism Performance beyond single thread ILP • There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) • Explicit Thread Level Parallelism or Data Level Parallelism • Thread: process with own instructions and data – thread may be a process part of a parallel program of multiple processes, or it may be an independent program – Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute • Data Level Parallelism: Perform identical operations on data, and lots of data – Example: Vector operations, matrix computations 21 22 Thread Level Parallelism (TLP) New Approach: Mulithreaded Execution • ILP exploits implicit parallel operations • Multithreading: multiple threads to share the within a loop or straight-line code functional units of 1 processor via segment overlapping • TLP explicitly represented by the use of – processor must duplicate independent state of each thread multiple threads of execution that are e.g., a separate copy of register file, a separate PC, and for inherently parallel running independent programs, a separate page table – memory shared through the virtual memory mechanisms, • Goal: Use multiple instruction streams to which already support multiple processes improve – HW for fast thread switch; much faster than full process 1. Throughput of computers that run many switch ≈ 100s to 1000s of clocks programs • When switch? 2. Execution time of multi-threaded programs – Fine Grain: Alternate instruction per thread • TLP could be more cost-effective to – Coarse grain: When a thread is stalled, perhaps for a cache exploit than ILP miss, another thread can be executed 23 24 NOW Handout Page 6 6 Multithreaded Processing Fine-Grained Multithreading Fine Grain Coarse Grain • Switches between threads on each instruction, causing the execution of multiples threads to be interleaved – Usually done in a round-robin fashion, skipping any stalled threads • CPU must be able to switch threads every clock • Advantage: – can hide both short and long stalls, since instructions from other threads executed when one thread stalls • Disadvantage: – slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads • Used on Sun’s Niagara (see textbook) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot 25 26 Course-Grained

Overview of Multiprocessors

Development of a Dynamically Extensible Spinnaker Chip Computing Module

Wind Rose Data Comes in the Form >200,000 Wind Rose Images

1 Introduction

Instruction Latencies and Throughput for AMD and Intel X86 Processors

Power4 Focuses on Memory Bandwidth IBM Confronts IA-64, Says ISA Not Important

NVIDIA CUDA on IBM POWER8: Technical Overview, Software Installation, and Application Development

Openpower AI CERN V1.Pdf

IBM Powerpc 970 (A.K.A. G5)

IBM Power Systems Performance Report Apr 13, 2021

Exam 1 Solutions

Introduction to the Cell Multiprocessor

IBM Power Roadmap