<<

4/26/2015

Lab #6, Homework #4

 Lab #6 is now posted on the course website  Adding I/O support to access SRAM memory in your  Can be completed in teams of 2  Due May 8th  or later?? Lecture 35, 36, 37: Parallel Processing  HW #4 will be posted by Wednesday Reading: Chapter 4, 6 April 24, 27, 29, 2015  Covers , I/O, superscalar design Patterson & Hennessey textbook Prof. R. Iris Bahar  Due May 11th ??

 Class final was scheduled for May 15th (last day of finals!)  Willing to extend deadlines, but prefer not give until the 15th © 2015 R.I. Bahar since grading takes time for both assignments Portions of these slides taken from Professors S. Reda 2 and D. Patterson

Parallel 1. SIMD architectures  Single Instruction Multiple Data (SIMD)  Single instruction acts on multiple pieces of data at once (aka vector instructions)  Common application: scientific , graphics

Intel Core i7 AMD A10 IBM Power 8  Requires vector and multiple execution units 1. SIMD (data-level parallelism) 2. Superscalar (instruction-level parallelism) 3. Multi-cores (-level parallelism)

3 4

1 4/26/2015

2. Superscalar architectures Superscalar vs. VLIW

 VLIW:  Compiler groups instructions to be issued together into Very Large Instruction Words (VLIW)  Packages them statically into “issue slots”  Compiler detects and avoids hazards  Superscalar:  CPU examines instruction stream and chooses instructions to issue each cycle  Compiler can help by reordering instructions  CPU resolves hazards using advanced techniques at runtime  Can be static in order or dynamic out-of-order

[from Fisher et al.] 5 6

Instruction-Level Parallelism (ILP) Multiple Issue

 Pipelining: executing multiple instructions in parallel  Static multiple issue  To increase ILP  Compiler groups instructions to be issued together  Deeper pipeline  Packages them into “issue slots”  Less work per stage  shorter clock cycle  Compiler detects and avoids hazards  Multiple issue  Replicate pipeline stages  multiple pipelines  Dynamic multiple issue  Start multiple instructions per clock cycle  CPU examines instruction stream and chooses instructions to  CPI < 1, so use Instructions Per Cycle (IPC) issue each cycle  E.g., 4GHz 4-way multiple-issue  Compiler can help by reordering instructions  16 BIPS, peak CPI = 0.25, peak IPC = 4  But dependencies reduce this in practice  CPU resolves hazards using advanced techniques at runtime

7 8

2 4/26/2015

Scheduling Static Multiple Issue MIPS with Static Dual Issue  Two-issue packets  Compiler must remove some/all hazards  One ALU/branch instruction  Reorder instructions into issue packets  One load/store instruction  No dependencies within a packet  64-bit aligned  Possibly some dependencies between packets  ALU/branch, then load/store   Varies between ISAs; compiler must know! Pad an unused instruction with nop Address Instruction type Pipeline Stages  Pad with nop if necessary n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB 9 10

Pipeline Design for Dual Issue Hazards in the Dual-Issue MIPS

 More instructions executing in parallel  EX data hazard  Forwarding avoided stalls with single-issue  Now can’t use ALU result in load/store in same packet  add $t0, $s0, $s1 load $s2, 0($t0)  Split into two packets, effectively a stall  Load-use hazard  Still one cycle use latency, but now two instructions  More aggressive scheduling required

11 12

3 4/26/2015

Scheduling Example Dynamic Multiple Issue  Schedule this for dual-issue MIPS

Loop: lw $t0, 0($s1) # $t0=array element  “Superscalar” processors add $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result  CPU decides whether to issue 0, 1, 2, … each cycle addi $s1, $s1,–4 # decrement pointer  Avoiding structural and data hazards bne $s1, $zero, Loop # branch $s1!=0  Avoids the need for compiler scheduling ALU/branch Load/store cycle  Though it may still help Loop: nop lw $t0, 0($s1) 1  Code semantics ensured by the CPU nop nop 2 add $t0, $t0, $s2 nop 3 addi $s1, $s1,–4 sw $t0, 0($s1) 4 bne $s1, $zero, Loop nop 5

 IPC = 5/5 = 1 (peak dual issue IPC = 2 and single issue IPC=5/6 = .83) 14 13

Dynamic Pipeline Scheduling Name dependency: WAR lw $s0, 0($t0) …  Allow the CPU to execute instructions out of order to avoid add $t0, $s1, $s2 stalls  But commit result to registers in order add $s4, $s2, $s0 …..  Example sub $s2, $s1, $s3 lw $t0, 20($s2) addu $t1, $t0, $t2 lw $t2, 0($s2) sub $s4, $s4, $t3 …. slti $t5, $s4, 20 lw $s2, 4($t0)  Just name dependency --- no values being transmitted  Can start sub while addu is waiting for lw  Dependency can be removed by renaming registers

15 (either by compiler or HW) 16

4 4/26/2015

Name dependency: WAW  With only 32 architectural registers defined in MIPS ISA, lw $s0, 0($t0) compiler needs to reuse same registers repeatedly ….  leads to potential WAR, WAW hazards if instructions are add $s0, $s1, $s2 reordered.  HW Solution: Allocated extra physical registers for add $s2, $s1, $s0 additional temporary storage ….  Reservation stations: all instructions are allocated space in the sub $s2, $t2, $t3 after they are decoded  Only need to preserve RAW dependencies  If operand(s) available (in RF or elsewhere in pipeline), copy to reservation station entry  If operand(s) not immediately available, RS entry will be updated by  Just name dependency --- no values being transmitted a functional unit  Use in conjunction with a Reorder Buffer to write values back to  Dependency can be removed by renaming registers architectural registers in order (thus avoiding WAW, WAR hazards) (either by compiler or HW) 17 18

Overall superscalar organization • Dynamic Out-of-order (OoO) execution Register Renaming • Multiple instructions fetched and decoded in parallel.  Reservation stations and reorder buffer effectively provide • Decoder checks for dependencies and register renaming rename registers to avoid WAW and  On instruction issue to reservation station WAR hazards.  If operand is available in register file or reorder buffer • Instructions wait in a dispatch buffer until  Copied to reservation station their operands are available (avoids  No longer required in the register; can be overwritten RAW hazards).  If operand is not yet available • When ready, instructions are dispatched  It will be provided to the reservation station by a function unit to the execution units.  Register update may not be required • Re-order buffer puts back instructions in program order for WB

19 20

5 4/26/2015

Dynamic Rescheduling Example Exposing ILP using  Reschedule code for dual-issue MIPS Loop: lw $t0, 0($s1) # $t0=array element  Replicate loop body to expose more parallelism add $t0, $t0, $s2 # add scalar in $s2 sw $t0, 0($s1) # store result  Reduces loop-control overhead addi $s1, $s1,–4 # decrement pointer  Use different registers per replication bne $s1, $zero, Loop # branch $s1!=0  register renaming taken care of with reservation station and ALU/branch Load/store cycle reorder buffer Loop: nop lw $t0, 0($s1) 1  Avoids loop-carried WAR, WAW dependencies addi $s1, $s1,–4 nop 2 add $t0, $t0, $s2 nop 3 bne $s1, $zero, Loop sw $t0, 4($s1) 4

 IPC = 5/4 = 1.25

21 22

Loop unrolling example Speculation

ALU/branch Load/store cycle  “Guess” what to do with an instruction Loop: addi $s1, $s1,–16 lw $t0, 0($s1) 1  Start operation as soon as possible nop lw $t1, 12($s1) 2 add $t0, $t0, $s2 lw $t2, 8($s1) 3  Check whether guess was right add $t1, $t1, $s2 lw $t3, 4($s1) 4  If so, complete the operation add $t2, $t2, $s2 sw $t0, 16($s1) 5  If not, roll-back and do the right thing add $t3, $t3, $s2 sw $t1, 12($s1) 6  Common to static and dynamic multiple issue nop sw $t2, 8($s1) 7 bne $s1, $zero, Loop sw $t3, 4($s1) 8  Examples  Speculate on branch outcome  Roll back if path taken is different  IPC = 14/8 = 1.75  Speculate on load  Closer to 2, but a cost of registers and code size  Roll back if location is updated (with store) or if data was not in

23 24

6 4/26/2015

Compiler/Hardware Speculation Cortex A8 and i7 Processor ARM A8 i7 920 Market Personal Mobile Device Server, cloud  Compiler can reorder instructions Thermal design power 2 Watts 130 Watts  e.g., move load before branch 1 GHz 2.66 GHz Cores/Chip 1 4  Can include “fix-up” instructions to recover from incorrect Floating point? No Yes guess Multiple issue? Dynamic Dynamic  Hardware can look ahead for instructions to execute Peak instructions/clock cycle 2 4  Buffer results until it determines they are actually needed Pipeline stages 14 14 Pipeline schedule Static in-order Dynamic out-of-order  Reorder buffer takes care of this with speculation  Flush buffers on incorrect speculation Branch prediction 2-level 2-level 1st level caches/core 32 KiB I, 32 KiB D 32 KiB I, 32 KiB D 2nd level caches/core 128-1024 KiB 256 KiB 3rd level caches (shared) - 2- 8 MB

25 26

ARM Cortex-A8 Pipeline Core i7 Pipeline

27 28

7 4/26/2015

Summary of superscalar architectures

 Pros:  Improved single-thread throughput: hide memory latency;  avoid or reduces stalls; and  ability to fetch and execute multiple .  Cons:  impacts silicon area  impacts power consumption  impacts design complexity  SW compilation techniques enable  more ILP from the same HW  simplify HW (e.g., VLIW) at the expense of code portability

 Conclusion: great single-thread performance but at the expense of energy efficiency.

29

8