EE141
EE141-Spring 2006 Outline Digital Integrated Introduction Circuits Circuits What is the Execution Unit? High Level Design Considerations Design of an Execution Unit Circuit Design of an Barrel Shifter “Real Life” Designs Luke Tsai AMD
1 2 EECS141EE141 EECS141EE141
Introduction
If you love EE141… What is an Consider a career in Microprocessor Design Execution All aspects and variety of circuit design Unit (EX)? Maximum complexity Leading Edge Technology
3 4 EECS141EE141 EECS141EE141
A Classical Processor The EX Unit Implements the Block Diagram Integer Instruction Set
Instruction Fetch (IF) Decode (DE) Add* R1, R2 Instruction Fetch (IF) Sub R1, R2 Scheduler (SC) Memory Decode (DE) Mult R1, R2 (L2 Cache) Memory Execution Unit (EX) Div R1, R2 Scheduler (SC) ROL R1, R2 (L2 Cache) Execution Unit (EX) Load-Store (LS) SAR R1, R2 Floating Point (FPU) CLZ R1 Load-Store (LS) *X86 notation. The first register is Floating Point (FPU) both a source and the destination 5 6 EECS141EE141 EECS141EE141
1 EE141
Interface to the SC Interface to the LS The SC issues instructions to the EX For Load/Store Ops, EX generates address for the LS, which in turn sends/receives Data to/from EX. Out-of-order SC needs to check for source dependency Address generation to load data return is a classical Instruction Fetch (IF) critical path in processorInstruction design Fetch (IF)
Add R1, R2 ncy ende Decode (DE) Decode (DE) Dep Add R1, [R2] Sub R3, R1 Scheduler (SC) Memory Load Scheduler (SC) Memory (L2 Cache) Sub [R3], R1 (L2 Cache) Mult R4, R2 Execution Unit (EX) Store Execution Unit (EX) No Dependency, Mult [R4], [R2] Can Issue in Parallel Load-Store (LS) Load-Store (LS) Load-Op-Store . Floating Point (FPU) Floating Point (FPU)
7 8 EECS141EE141 EECS141EE141
A Typical Block Diagram of EX
Execution Unit High Level Multi-ported ALU0 Design Register File Operand Bus Design
Bypass Considerations Mult Shifter Adder ALU1..N AGen1..N Div/CLZ/Popcnt Result Bus
9 10 EECS141EE141 EECS141EE141
Meeting the Performance Target Micro-Architecture Considerations
IPC: How each instr is executed Pipeline What EX unit and how many each to build Interface with the Scheduler Frequency What type of circuit style How to handle Out-of-order Execution Power Interface with the LS unit How much energy per operation How many cycle for Agen-Data loop? Area How to suppress speculative execution Silicon real estate is expensive when load data is invalid? The design point is based on trade-offs of the above criteria
11 12 EECS141EE141 EECS141EE141
2 EE141
Physical Design Considerations Physical Design Considerations
Operand Bypass Floorplan Bypass condition occurs when an operand of an instruction Floorplan of an EX unit is very crucial piece of scheduled to be executed in cycle n is generated in the design decision. It impacts: immediate preceding cycle (n-1). – Bus length (frequency, power) The data of this operand do not reside in the register file and need to be bypassed from one of the result buses. – Datapath pitch (frequency, power, area) – Bypass Scheme (area, power) n ditio Add* R1, R2 Con ass Byp Sub R3, R1 Mult R4, R2
* Actual execution sequence (not program order)
13 14 EECS141EE141 EECS141EE141
What is a Barrel Shifter?
Performs a shift or rotate on the Circuit Design full/partial data of an Barrel Example: 8 bit shifter Shifter Input Bit Position 7 6 5 4 3 2 1 0 Rot Left 1 6 5 4 3 2 1 0 7 Rot Right 1 0 7 6 5 4 3 2 1 Logical Shift Left 2 5 4 3 2 1 0 L L (= mult by 4) Arithmetic Shift Left 2 5 4 3 2 1 0 L L (Same as above) Logical Shift Right 3 L L L 7 6 5 4 3 Arithmetic Shift Right 3 7 7 7 7 6 5 4 3 L = Low (zero)
15 16 EECS141EE141 EECS141EE141
Barrel Shifter Design Barrel Shifter Implementations
Observe: Any input bit could be passed 1. Single-stage NxN mux Fewest gates between input and output to ALL output bit positions. Most number of select signals (largest load for shift amount) Therefore: the shifter is nothing but a giant 2. Multi-stage Mux NxN mux, where N is the width of data. More stage = more gates between input and output The mux select is the one-hot decode of the Reduction in select signal is a diminishing return – For 64 bit shifts:
shift amount. z 1 stage = 64 selects
z 2 stages (8x8) = 16 selects (75% reduction)
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 z 3 stages (4x4x4) = 12 selects (25% reduction) 3. Mux Implementation 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Low swing passgate Full Swing Domino 17 18 EECS141EE141 EECS141EE141
3 EE141
Barrel Shifter Array Barrel Shifter Additional Complexity One-Stage Mux Two-Stage Mux Inputs Inputs 1. Partial Shifts/Rotates Selects Selects X86 Instruction Set supports 8(L/H)/16/32/64 bit shifts 2. Shift differs from Rotate Shifts fills in zeros or the sign bit => How do you build a barrel shifter that does both shift and rotate? Inputs Inter- 3. Rotate could include the Carry bit turn 90o mediate X86 supports RCL/RCR (Rotate with Carry Left/Right) => A 64-bit RCL requires a 65-bit barrel shifter!
Connection Outputs Connection Outputs
19 20 EECS141EE141 EECS141EE141
Robustness and Reliability Robustness: Higher Yield=Higher Profit Margin Circuit needs to function across PVT variation Chip target yield of 70% could require EX yield of 99% “Real Life” What works in spice (w/o PVT) may not work in real life Reliability Designs In addition to simulation for speed, real design also checks –Noise – IR Drop –Electro-Migration – Inductive Effects –…
21 22 EECS141EE141 EECS141EE141
Process Variation Voltage/Temperature Variations Major Culprits: Threshold, Channel Length, Introduce more timing variations Channel Width Increase Noise In 45nm, Vth ~ +- 150mV, ΔL ~ +- 15%, ΔW ~ +- 10% (for min Worsen cross chip matching (e.g. Clock tree)
devices). (Idsat/Idoff relationships to variation non-linear. Try Degrade reliability 1.072 V it in spice.)
Matching devices/paths: sense-amp, analog, memory cell 1.103 V stability, clock tree Increases Leakage: 80% of chip leakage caused by 20% of 1.134 V devices: limits usage of dynamic circuit 1.164 V Slows down critical paths
Worse hold-time requirements 1.194 V
1.224 V
23 24 EECS141EE141 EECS141EE141
4