Overview of Multiprocessors

Review of Concept of ILP Processors • Interest in multiple-issue because wanted to improve performance without affecting uniprocessor programming model CS 211 Computer Architecture • Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice ILP to Multiprocessing • Processors of last 5 years (Pentium 4, IBM Power 5, AMD Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple- issue processors announced in 1995 – Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X as many renaming registers, and 2X as many load-store units ⇒ performance 8 to 16X • Peak v. delivered performance gap increasing 2 Next… Limits to ILP • Conflicting studies of ILP amount • Quick review of Limits to ILP: Chapter 3 – Benchmarks • Thread Level Parallelism: Chapter 3 – Hardware sophistication – Multithreading – Compiler sophistication – Simultaneous Multithreading • How much ILP is available using existing • Multiprocessing: Chapter 4 mechanisms with increasing HW budgets? – Fundamentals of multiprocessors » Synchronization, memory, … • Do we need to invent new HW/SW – Chip level multiprocessing: Multi-Core mechanisms to keep on processor performance curve? – Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints – Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock – Motorola AltaVec: 128 bit ints and FPs – Supersparc Multimedia ops, etc. 3 4 NOW Handout Page 1 1 Overcoming Limits Limits to ILP • Advances in compiler technology + significantly new and different hardware Assumptions for ideal/perfect machine to start: techniques may be able to overcome 1. Register renaming – infinite virtual registers limitations assumed in studies => all register WAW & WAR hazards are avoided • However, unlikely such advances when 2. Branch prediction – perfect; no mispredictions coupled with realistic hardware will 3. Jump prediction – all jumps perfectly predicted overcome these limits in near future (returns, case statements) 2 & 3 ⇒ no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW 5. Window size is infinite 5 6 Some numbers… Upper Limit to ILP: Ideal Machine (Figure 3.1) • Initial MIPS HW Model here; MIPS compilers. 160 150.1 FP: 75 - 150 • perfect caches; 1 cycle latency for all 140 instructions (FP *,/); unlimited instructions Integer: 18 - 60 118.7 issued/clock cycle 120 100 75.2 80 62.6 60 54.8 40 17.9 20 0 Instructions Per Clock Instructions gcc espresso li fpppp doducd tomcatv Programs 7 8 NOW Handout Page 2 2 ILP Limitations: In Reality – ILP Limitations: Architecture “Parameters” Architecture“Parameters” • Window size • Window size – Large window size could lead to more ILP, but more time to examine – Large window size could lead to more ILP, but more time to instruction packet to determine parallelism examine instruction packet to determine parallelism • Register file size – Finite size of register file (virtual and physical) introduces more name dependencies • Branch prediction – Effects of realistic branch predictor • Memory aliasing – Idealized model assumes we can analyze all memory dependencies: but compile time analysis cannot be perfect – Realistic: » Global/stack: assume perfect predictions for global and stack data but all heap references will conflict » Inspection: deterministic compile time references • R1(10) and R1(100) cannot conflict if R1 has not changed between the two » None: all memory accesses are assumed to conflict 9 10 More Realistic HW: Window Impact ILP Limitations: Figure 3.2 Architecture“Parameters” Change from Infinite window 2048, 512, 128, 32 FP: 9 - 150 160 150 • Register file size – Finite size of register file (virtual and physical) introduces 140 more name dependencies 119 k 120 Integer: 8 - 63 100 75 80 63 61 59 60 55 IPC 60 49 45 41 36 35 40 34 Instructions Per Cloc 18 16 1513 1512 14 15 14 20 101088119 9 0 gcc espresso li fpppp doduc tomcatv Inf inite 2048 512 128 32 11 12 NOW Handout Page 3 3 More Realistic HW: ILP Limitations: Renaming Register Impact (N int + N fp +64) Architecture“Parameters” Figure 3.5 70 FP: 11 - 45 Change 2048 instr 59 60 window, 64 instr 54 49 50 issue, 8K 2 level 45 Prediction 44 • Branch prediction – Effects of realistic branch predictor 40 35 30 Integer: 5 - 15 29 28 IPC 20 20 16 15 15 15 13 12 12 12 11 11 11 10 10 10 10 9 6 7 5 5 5 555 5 4 4 4 0 gcc espresso li fpppp doducd tomcatv Program Infinite 256 128 64 32 None Infinite 256 128 64 32 None 13 14 More Realistic HW: Branch Impact Figure 3.3 Misprediction Rates Change from Infinite 61 FP: 15 -60 45 60 window to examine to 58 35% 2048 and maximum 30% 30% 50 issue of 64 instructions 48 46 46 per clock cycle 45 45 45 25% 23% 41 40 20% 18% 18% 35 16% 14% 14% Integer: 6 - 12 15% 30 29 12% 12% 10% 6% Misprediction Rate Misprediction IPC 20 19 5% 4% 16 3% 15 5% 2% 2% 13 14 1%1% 12 0% 10 10 9 0% 7 7 6 666 4 tomcatv doduc fpppp li espresso gcc 2 2 2 0 gcc espresso li fpppp doducd tomcatv Profile-based 2-bit counter Tournament Program Perfect Selective predictor Standard 2-bit Static None 15 16 Perfect Tournament BHT (512) Profile No prediction NOW Handout Page 4 4 ILP Limitations: More Realistic HW: Architecture“Parameters” Memory Address Alias Impact Figure 3.6 49 49 • Memory aliasing 50 45 45 – Idealized model assumes we can analyze all memory 45 Change 2048 instr dependencies: compile time analysis cannot be perfect 40 window, 64 instr FP: 4 - 45 – Realistic: issue, 8K 2 level 35 (Fortran, » Global/stack: assume perfect predictions for global and Prediction, 256 30 no heap) stack data but all heap references will conflict renaming registers » Inspection: deterministic compile time references 25 • R1(10) and R1(100) cannot conflict if R1 has not changed between the 20 Integer: 4 - 9 16 16 two 15 15 12 » None: all memory accesses are assumed to conflict 10 9 IPC 10 7 7 5 5 6 4 4 4 5 5 3 3 3 4 4 0 gcc espresso li fpppp doducd tomcatv Program Perfect Global/stack Perfect Inspection None Perfect Global/Stack perf; Inspec. None 17 heap conflicts Assem. 18 How to Exceed ILP Limits of this study? HW v. SW to increase ILP • These are not laws of physics; just practical limits • Memory disambiguation: HW best for today, and perhaps overcome via research • Speculation: • Compiler and ISA advances could change results – HW best when dynamic branch prediction • WAR and WAW hazards through memory: better than compile time prediction eliminated WAW and WAR hazards through – Exceptions easier for HW register renaming, but not in memory usage – Can get conflicts via allocation of stack frames as a called – HW doesn’t need bookkeeping code or procedure reuses the memory addresses of a previous frame compensation code on the stack – Very complicated to get right • Scheduling: SW can look ahead to schedule better • Compiler independence: does not require new compiler, recompilation to run well 19 20 NOW Handout Page 5 5 Next….Thread Level Parallelism Performance beyond single thread ILP • There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) • Explicit Thread Level Parallelism or Data Level Parallelism • Thread: process with own instructions and data – thread may be a process part of a parallel program of multiple processes, or it may be an independent program – Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute • Data Level Parallelism: Perform identical operations on data, and lots of data – Example: Vector operations, matrix computations 21 22 Thread Level Parallelism (TLP) New Approach: Mulithreaded Execution • ILP exploits implicit parallel operations • Multithreading: multiple threads to share the within a loop or straight-line code functional units of 1 processor via segment overlapping • TLP explicitly represented by the use of – processor must duplicate independent state of each thread multiple threads of execution that are e.g., a separate copy of register file, a separate PC, and for inherently parallel running independent programs, a separate page table – memory shared through the virtual memory mechanisms, • Goal: Use multiple instruction streams to which already support multiple processes improve – HW for fast thread switch; much faster than full process 1. Throughput of computers that run many switch ≈ 100s to 1000s of clocks programs • When switch? 2. Execution time of multi-threaded programs – Fine Grain: Alternate instruction per thread • TLP could be more cost-effective to – Coarse grain: When a thread is stalled, perhaps for a cache exploit than ILP miss, another thread can be executed 23 24 NOW Handout Page 6 6 Multithreaded Processing Fine-Grained Multithreading Fine Grain Coarse Grain • Switches between threads on each instruction, causing the execution of multiples threads to be interleaved – Usually done in a round-robin fashion, skipping any stalled threads • CPU must be able to switch threads every clock • Advantage: – can hide both short and long stalls, since instructions from other threads executed when one thread stalls • Disadvantage: – slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads • Used on Sun’s Niagara (see textbook) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot 25 26 Course-Grained

Overview of Multiprocessors

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support