Lab #6, Homework #4 Parallel Processor Design 1. SIMD

4/26/2015 Lab #6, Homework #4 Lab #6 is now posted on the course website Adding I/O support to access SRAM memory in your processor Can be completed in teams of 2 Due May 8th or later?? Lecture 35, 36, 37: Parallel Processing HW #4 will be posted by Wednesday Reading: Chapter 4, 6 April 24, 27, 29, 2015 Covers virtual memory, I/O, superscalar design Patterson & Hennessey textbook Prof. R. Iris Bahar Due May 11th ?? Class final was scheduled for May 15th (last day of finals!) Willing to extend deadlines, but prefer not give until the 15th © 2015 R.I. Bahar since grading takes time for both assignments Portions of these slides taken from Professors S. Reda 2 and D. Patterson Parallel Processor Design 1. SIMD architectures Single Instruction Multiple Data (SIMD) Single instruction acts on multiple pieces of data at once (aka vector instructions) Common application: scientific computing, graphics Intel Core i7 AMD A10 IBM Power 8 Requires vector register file and multiple execution units 1. SIMD (data-level parallelism) 2. Superscalar (instruction-level parallelism) 3. Multi-cores (thread-level parallelism) 3 4 1 4/26/2015 2. Superscalar architectures Superscalar vs. VLIW VLIW: Compiler groups instructions to be issued together into Very Large Instruction Words (VLIW) Packages them statically into “issue slots” Compiler detects and avoids hazards Superscalar: CPU examines instruction stream and chooses instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime Can be static in order or dynamic out-of-order [from Fisher et al.] 5 6 Instruction-Level Parallelism (ILP) Multiple Issue Pipelining: executing multiple instructions in parallel Static multiple issue To increase ILP Compiler groups instructions to be issued together Deeper pipeline Packages them into “issue slots” Less work per stage shorter clock cycle Compiler detects and avoids hazards Multiple issue Replicate pipeline stages multiple pipelines Dynamic multiple issue Start multiple instructions per clock cycle CPU examines instruction stream and chooses instructions to CPI < 1, so use Instructions Per Cycle (IPC) issue each cycle E.g., 4GHz 4-way multiple-issue Compiler can help by reordering instructions 16 BIPS, peak CPI = 0.25, peak IPC = 4 But dependencies reduce this in practice CPU resolves hazards using advanced techniques at runtime 7 8 2 4/26/2015 Scheduling Static Multiple Issue MIPS with Static Dual Issue Two-issue packets Compiler must remove some/all hazards One ALU/branch instruction Reorder instructions into issue packets One load/store instruction No dependencies within a packet 64-bit aligned Possibly some dependencies between packets ALU/branch, then load/store Varies between ISAs; compiler must know! Pad an unused instruction with nop Address Instruction type Pipeline Stages Pad with nop if necessary n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB 9 10 Pipeline Design for Dual Issue Hazards in the Dual-Issue MIPS More instructions executing in parallel EX data hazard Forwarding avoided stalls with single-issue Now can’t use ALU result in load/store in same packet add $t0, $s0, $s1 load $s2, 0($t0) Split into two packets, effectively a stall Load-use hazard Still one cycle use latency, but now two instructions More aggressive scheduling required 11 12 3 4/26/2015 Scheduling Example Dynamic Multiple Issue Schedule this for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element “Superscalar” processors add $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result CPU decides whether to issue 0, 1, 2, … each cycle addi $s1, $s1,–4 # decrement pointer Avoiding structural and data hazards bne $s1, $zero, Loop # branch $s1!=0 Avoids the need for compiler scheduling ALU/branch Load/store cycle Though it may still help Loop: nop lw $t0, 0($s1) 1 Code semantics ensured by the CPU nop nop 2 add $t0, $t0, $s2 nop 3 addi $s1, $s1,–4 sw $t0, 0($s1) 4 bne $s1, $zero, Loop nop 5 IPC = 5/5 = 1 (peak dual issue IPC = 2 and single issue IPC=5/6 = .83) 14 13 Dynamic Pipeline Scheduling Name dependency: WAR lw $s0, 0($t0) … Allow the CPU to execute instructions out of order to avoid add $t0, $s1, $s2 stalls But commit result to registers in order add $s4, $s2, $s0 ….. Example sub $s2, $s1, $s3 lw $t0, 20($s2) addu $t1, $t0, $t2 lw $t2, 0($s2) sub $s4, $s4, $t3 …. slti $t5, $s4, 20 lw $s2, 4($t0) Just name dependency --- no values being transmitted Can start sub while addu is waiting for lw Dependency can be removed by renaming registers 15 (either by compiler or HW) 16 4 4/26/2015 Name dependency: WAW Register Renaming With only 32 architectural registers defined in MIPS ISA, lw $s0, 0($t0) compiler needs to reuse same registers repeatedly …. leads to potential WAR, WAW hazards if instructions are add $s0, $s1, $s2 reordered. HW Solution: Allocated extra physical registers for add $s2, $s1, $s0 additional temporary storage …. Reservation stations: all instructions are allocated space in the sub $s2, $t2, $t3 reservation station after they are decoded Only need to preserve RAW dependencies If operand(s) available (in RF or elsewhere in pipeline), copy to reservation station entry If operand(s) not immediately available, RS entry will be updated by Just name dependency --- no values being transmitted a functional unit Use in conjunction with a Reorder Buffer to write values back to Dependency can be removed by renaming registers architectural registers in order (thus avoiding WAW, WAR hazards) (either by compiler or HW) 17 18 Overall superscalar organization • Dynamic Out-of-order (OoO) execution Register Renaming • Multiple instructions fetched and decoded in parallel. Reservation stations and reorder buffer effectively provide • Decoder checks for dependencies and register renaming rename registers to avoid WAW and On instruction issue to reservation station WAR hazards. If operand is available in register file or reorder buffer • Instructions wait in a dispatch buffer until Copied to reservation station their operands are available (avoids No longer required in the register; can be overwritten RAW hazards). If operand is not yet available • When ready, instructions are dispatched It will be provided to the reservation station by a function unit to the execution units. Register update may not be required • Re-order buffer puts back instructions in program order for WB 19 20 5 4/26/2015 Dynamic Rescheduling Example Exposing ILP using Loop Unrolling Reschedule code for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element Replicate loop body to expose more parallelism add $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result Reduces loop-control overhead addi $s1, $s1,–4 # decrement pointer Use different registers per replication bne $s1, $zero, Loop # branch $s1!=0 register renaming taken care of with reservation station and ALU/branch Load/store cycle reorder buffer Loop: nop lw $t0, 0($s1) 1 Avoids loop-carried WAR, WAW dependencies addi $s1, $s1,–4 nop 2 add $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4 IPC = 5/4 = 1.25 21 22 Loop unrolling example Speculation ALU/branch Load/store cycle “Guess” what to do with an instruction Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1 Start operation as soon as possible nop lw $t1, 12($s1) 2 add $t0, $t0, $s2 lw $t2, 8($s1) 3 Check whether guess was right add $t1, $t1, $s2 lw $t3, 4($s1) 4 If so, complete the operation add $t2, $t2, $s2 sw $t0, 16($s1) 5 If not, roll-back and do the right thing add $t3, $t3, $s2 sw $t1, 12($s1) 6 Common to static and dynamic multiple issue nop sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8 Examples Speculate on branch outcome Roll back if path taken is different IPC = 14/8 = 1.75 Speculate on load Closer to 2, but a cost of registers and code size Roll back if location is updated (with store) or if data was not in cache 23 24 6 4/26/2015 Compiler/Hardware Speculation Cortex A8 and Intel i7 Processor ARM A8 Intel Core i7 920 Market Personal Mobile Device Server, cloud Compiler can reorder instructions Thermal design power 2 Watts 130 Watts e.g., move load before branch Clock rate 1 GHz 2.66 GHz Cores/Chip 1 4 Can include “fix-up” instructions to recover from incorrect Floating point? No Yes guess Multiple issue? Dynamic Dynamic Hardware can look ahead for instructions to execute Peak instructions/clock cycle 2 4 Buffer results until it determines they are actually needed Pipeline stages 14 14 Pipeline schedule Static in-order Dynamic out-of-order Reorder buffer takes care of this with speculation Flush buffers on incorrect speculation Branch prediction 2-level 2-level 1st level caches/core 32 KiB I, 32 KiB D 32 KiB I, 32 KiB D 2nd level caches/core 128-1024 KiB 256 KiB 3rd level caches (shared) - 2- 8 MB 25 26 ARM Cortex-A8 Pipeline Core i7 Pipeline 27 28 7 4/26/2015 Summary of superscalar architectures Pros: Improved single-thread throughput: hide memory latency; avoid or reduces stalls; and ability to fetch and execute multiple instructions per cycle. Cons: impacts silicon area impacts power consumption impacts design complexity SW compilation techniques enable more ILP from the same HW simplify HW (e.g., VLIW) at the expense of code portability Conclusion: great single-thread performance but at the expense of energy efficiency.

Lab #6, Homework #4 Parallel Processor Design 1. SIMD

Review Memory Disambiguation Review Explicit Register Renaming

Chapter 1: Computer Abstractions and Technology 1.6 – 1.7: Performance and Power

ARM Cortex-A* Brian Eccles, Riley Larkins, Kevin Mee, Fred Silberberg, Alex Solomon, Mitchell Wills

Clock Rate Improves Roughly Proportional to Improvement in L • Number of Transistors Improves Proportional to L2 (Or Faster)

Out-Of-Order Execution & Register Renaming

Atmega165p Datasheet

Dynamic Register Renaming Through Virtual-Physical Registers

Computer Architecture Out-Of-Order Execution

Performance of a Computer (Chapter 4) Vishwani D

Chap01: Computer Abstractions and Technology

Hardware-Sensitive Database Operations - II

RAMP: Research Accelerator for Multiple Processors