CIS 371 Computer Organization and Design
Unit 3: Arithmetic
Based on slides by Profs. Amir Roth & Milo Martin & C.J. Taylor
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 1 This Unit: Arithmetic
App App App • A little review System software • Binary + 2s complement • Ripple-carry addition (RCA) Mem CPU I/O • Fast integer addition • Carry-select (CSeA) • Carry Lookahead Adder (CLA) • Shifters • Integer multiplication and division • Floating point arithmetic
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 2 Readings
• P&H • Chapter 3
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 3 The Importance of Fast Arithmetic
+ 4 Register Data Insn File PC s1 s2 d Mem Mem
Tinsn-mem Tregfile TALU Tdata-mem Tregfile • Addition of two numbers is most common operation • Programs use addition frequently • Loads and stores use addition for address calculation • Branches use addition to test conditions and calculate targets • All insns use addition to calculate default next PC • Fast addition is critical to high performance
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 4 Review: Binary Integers
• Computers represent integers in binary (base2) 3 = 11, 4 = 100, 5 = 101, 30 = 11110 + Natural since only two values are represented • Addition, etc. take place as usual (carry the 1, etc.)
17 = 10001 +5 = 101 22 = 10110
• Some old machines use decimal (base10) with only 0/1 30 = 011 000 • Often called BCD (binary-coded decimal) – Unnatural for digital logic, implementation complicated & slow + More precise for currency operations
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 5 Fixed Width
• On pencil and paper, integers have infinite width
• In hardware, integers have fixed width • N bits: 16, 32 or 64 • LSB is 20, MSB is 2N-1
• Range: 0 to 2N–1
• Numbers >2N represented using multiple fixed-width integers • In software (e.g., Java BigInteger class)
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 6 What About Negative Integers?
• Sign/magnitude • Unsigned plus one bit for sign 10 = 000001010, -10 = 100001010 + Matches our intuition from “by hand” decimal arithmetic – representations of both +0 and –0 – Addition is difficult • symmetric range: –(2N-1–1) to 2N-1–1
• Option II: two’s complement (2C) • Leading 0s mean positive number, leading 1s negative 10 = 00001010, -10 = 11110110 + One representation for 0 (all zeroes) + Easy addition • asymmetric range: –(2N-1) to 2N-1–1
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 7 The Tao of 2C
• How did 2C come about? • “Let’s design a representation that makes addition easy” • Think of subtracting 10 from 0 by hand with 8-bit numbers • Have to “borrow” 1s from some imaginary leading 1
0 = 100000000 -10 = 00000010 -10 = 011111110
• Now, add the conventional way…
-10 = 11111110 +10 = 00000010 0 = 100000000
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 8 Still More On 2C
• What is the interpretation of 2C? • Same as binary, except MSB represents –2N–1, not 2N–1 • –10 = 11110110 = –27+26+25+24+22+21 + Extends to any width • –10 = 110110 = –25+24+22+21 • Why? 2N = 2*2N–1 • –25+24+22+21 = (–26+2*25)–25+24+22+21 = –26+25+24+22+21 • Equivalent to computing modulo 2N
• Trick to negating a number quickly: –B = B’ + 1 • –(1) = (0001)’+1 = 1110+1 = 1111 = –1 • –(–1) = (1111)’+1 = 0000+1 = 0001 = 1 • –(0) = (0000)’+1 = 1111+1 = 0000 = 0 • Think about why this works CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 9 Addition
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 10 1st Grade: Decimal Addition
1 43 +29 72
• Repeat N times • Add least significant digits and any overflow from previous add • Carry “overflow” to next addition • Overflow: any digit other than least significant of sum • Shift two addends and sum one digit to the right
• Sum of two N-digit numbers can yield an N+1 digit number
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 11 Binary Addition: Works the Same Way
1 111111 43 = 00101011 +29 = 00011101 72 = 01001000
• Repeat N times • Add least significant bits and any overflow from previous add • Carry the overflow to next addition • Shift two addends and sum one bit to the right • Sum of two N-bit numbers can yield an N+1 bit number
– More steps (smaller base) + Each one is simpler (adding just 1 and 0) • So simple we can do it in hardware
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 12 The Half Adder
• How to add two binary integers in hardware? • Start with adding two bits • When all else fails ... look at truth table A B S A B = C0 S 0 0 = 0 0 0 1 = 0 1 1 0 = 0 1 1 1 = 1 0 CO
• S = A^B A S HA • CO (carry out) = AB B • This is called a half adder CO
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 13 The Other Half
• We could chain half adders together, but to do that… • Need to incorporate a carry out from previous adder C A B = C0 S CI 0 0 0 = 0 0 0 0 1 = 0 1 S 0 1 0 = 0 1 CI A 0 1 1 = 1 0 A S 1 0 0 = 0 1 B FA 1 0 1 = 1 0 B 1 1 0 = 1 0 CO 1 1 1 = 1 1
CO • S = C’A’B + C’AB’ + CA’B’ + CAB = C ^ A ^ B • CO = C’AB + CA’B + CAB’ + CAB = CA + CB + AB • This is called a full adder
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 14 Ripple-Carry Adder 0 • N-bit ripple-carry adder A S • N 1-bit full adders “chained” together 0 FA 0 B0 • CO0 = CI1, CO1 = CI2, etc.
• CI0 = 0 A S 1 FA 1 • CO is carry-out of entire adder N–1 B1 • CON–1 = 1 ® “overflow” A S 2 FA 2 B • Example: 16-bit ripple carry adder 2 • How fast is this? … • How fast is an N-bit ripple-carry adder? A S 15 FA 15 B15 CO
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 15 Quantifying Adder Delay
• Combinational logic dominated by gate delays • How many gates between input and output? • Array storage dominated by wire delays • Longest delay or critical path is what matters
• Can implement any combinational function in “2” logic levels • 1 level of AND + 1 level of OR (PLA) • NOTs are “free”: push to input (DeMorgan’s) or read from latch • Example: delay(FullAdder) = 2 • d(CarryOut) = delay(AB + AC + BC) • d(Sum) = d(A ^ B ^ C) = d(AB’C’ + A’BC’ + A’B’C + ABC) = 2 • Note ‘^’ means Xor (just like in C & Java)
• Caveat: “2” assumes gates have few (<8 ?) inputs
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 Ripple-Carry Adder Delay 0 • Longest path is to CO15 (or S15) A0 S0 • delay(CO15) = 2 + MAX(d(A15),d(B15),d(CI15)) FA B0 • d(A15) = d(B15) = 0, d(CI15) = d(CO14)
• d(CO15) = 2 + d(CO14) = 2 + 2 + d(CO13) … A S 1 FA 1 • d(CO ) = 32 15 B1
A2 S2 • d(CON–1) = 2N FA B2 – Too slow! – Linear in number of bits …
A S • Number of gates is also linear 15 FA 15 B15 CO
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 17 Division
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 18 4th Grade: Decimal Division
9 // quotient 3 |29 // divisor | dividend -27 2 // remainder
• Shift divisor left (multiply by 10) until MSB lines up with dividend’s • Repeat until remaining dividend (remainder) < divisor • Find largest single digit q such that (q*divisor) < dividend • Set LSB of quotient to q • Subtract (q*divisor) from dividend • Shift quotient left by one digit (multiply by 10) • Shift divisor right by one digit (divide by 10)
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 19 Binary Division
1001 = 9 3 |29 = 0011 |011101 -24 = - 011000 5 = 000101 - 3 = - 000011 2 = 000010
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 20 Binary Division Hardware
• Same as decimal division, except (again) – More individual steps (base is smaller) + Each step is simpler • Find largest bit q such that (q*divisor) < dividend • q = 0 or 1 • Subtract (q*divisor) from dividend • q = 0 or 1 ® no actual multiplication, subtract divisor or not
• Complication: largest q such that (q*divisor) < dividend • How do you know if (1*divisor) < dividend? • Human can “eyeball” this • Computer does not have eyeballs • Subtract and see if result is negative
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 21 Software Divide Algorithm
• Can implement this algorithm in software • Inputs: dividend and divisor
remainder = 0; quotient = 0; for (int i = 0; i < 32; i++) { remainder = (remainder << 1) | (dividend >> 31); if (remainder >= divisor) { quotient = (quotient << 1) | 1; remainder = remainder - divisor; } else { quotient = (quotient << 1) | 0; } dividend = dividend << 1; }
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 22 Divide Example
• Input: Divisor = 00011 , Dividend = 11101
Step Remainder Quotient Remainder Dividend 0 00000 00000 00000 11101 1 00001 00000 00001 11010 2 00011 00001 00000 10100 3 00001 00010 00001 01000 4 00010 00100 00010 10000 5 00101 01001 00010 00000
• Result: Quotient: 1001, Remainder: 10
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 23 Divider Circuit
Divisor Quotient
Sub >=0 Shift in 0 or 1
Remainder Dividend msb Shift in 0 or 1 Shift in 0
• N cycles for n-bit divide
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 24 Fast Addition
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 25 Bad idea: a PLA-based Adder?
• If any function can be expressed as two-level logic… • …why not use a PLA for an entire 8-bit adder? • Not small • Approx. 216 AND gates, each with 16 inputs • Then, 8 OR gates, each with 216 inputs • Number of gates exponential in bit width! • Not that fast, either • An AND gate with 65K inputs != 2-input AND gate • Many-input gates made from tree of, say, 4-input gates • 216-input gates would have at least 8 logic levels • So, at least 10 levels of logic for a 16-bit PLA • Even so, delay is still logarithmic in number of bits
• There are better (faster, smaller) ways
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 26 Theme: Hardware != Software
• Hardware can do things that software fundamentally can’t • And vice versa (of course)
• In hardware, it’s easier to trade resources for latency
• One example of this: speculation • Slow computation waiting for some slow input? • Input one of two things? • Compute with both (slow), choose right one later (fast)
• Does this make sense in software? Not on a uni-processor • Difference? hardware is parallel, software is sequential
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 27 Carry-Select Adder
• Carry-select adder
• Do A15-8+B15-8 twice, once assuming C8 (CO7) = 0, once = 1
• Choose the correct one when CO7 finally becomes available + Effectively cuts carry chain in half (break critical path) – But adds mux • Delay? 0 0 A7-0 8+ S7-0 B7-0 A15-0 16 S15-0 B15-0 0 S 16+ A 15-8 15-8 8+ B15-8 S15-8 CO 1 S 16 CO A 15-8 15-8 8+ 18 B15-8
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 28 Multi-Segment Carry-Select Adder 0 • Multiple segments A4-0 5+ S4-0 • Example: 5, 5, 6 bit = 16 bit B4-0 10 0 S9-5 A9-5 • Hardware cost 5+ B9-5 S9-5 • Still mostly linear (~2x) 1 10 S9-5 • Compute each segment A9-5 12 with 0 and 1 carry-in 5+ B9-5 • Serial mux chain 0 S15-10 A15-10 • Delay 6+ B15-10 S15-10 • 5-bit adder (10) + 12 14 1 S Two muxes (4) = 14 A 15-10 CO 15-10 6+ B15-10
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 29 Carry-Select Adder Delay
• What is carry-select adder delay (two segment)?
• d(CO15) = MAX(d(CO15-8), d(CO7-0)) + 2
• d(CO15) = MAX(2*8, 2*8) + 2 = 18 • In general: 2*(N/2) + 2 = N+2 (vs 2N for RCA)
• What if we cut adder into 4 equal pieces? • Would it be 2*(N/4) + 2 = 10? Not quite
• d(CO15) = MAX(d(CO15-12),d(CO11-0)) + 2 • d(CO15) = MAX(2*4, MAX(d(CO11-8),d(CO7-0)) + 2) + 2
• d(CO15) = MAX(2*4,MAX(2*4,MAX(d(CO7-4),d(CO3-0)) + 2) + 2) + 2 • d(CO15) = MAX(2*4,MAX(2*4,MAX(2*4,2*4) + 2) + 2) + 2 • d(CO15) = 2*4 + 3*2 = 14
• N-bit adder in M equal pieces: 2*(N/M) + (M–1)*2 • 16-bit adder in 8 parts: 2*(16/8) + 7*2 = 18 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 30 16b Carry-Select Adder Delay
35 30 25 20 15
Gatedelay 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 M
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 31 Another Option: Carry Lookahead
• Is carry-select adder as fast as we can go? • Nope!
• Another approach to using additional resources • Instead of redundantly computing sums assuming different carries • Use redundancy to compute carries more quickly • This approach is called carry lookahead (CLA)
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 32 A & B ® Generate & Propagate
• Let’s look at the single-bit carry-out function • CO = AB+AC+BC • Factor out terms that use only A and B (available immediately) • CO =(AB)+(A+B)C • (AB): generates carry-out regardless of incoming C ® rename to G • (A+B) : propagates incoming C ® rename to P • CO = G+PC C C
P A A B B G
CO CO CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 33 Infinite Hardware CLA
• Can expand C1…N in terms of G’s, P’s, and C0 • Example: C16
• C16 = G15+P15C15 • C16 = G15+P15(G14+P14C14)
• C16 = G15+P15G14+ … + P15P14…P2P1G0 + P15P14…P2P1P0C0
• Similar expansions for C15, C14, etc.
• How much does this cost?
• CN needs: N AND’s + 1 OR’s, largest have N+1 inputs
• CN…C1 needs: N*(N+1)/2 AND’s + N OR’s, max N+1 inputs • N=16: 152 total gates, max input 17 • N=64: 2144 total gates, max input 65 • And how fast is it really? – Not that fast, unfortunately, 17-input gates are really slow
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 34 Compromise: Multi-Level CLA
• Ripple carry + Few small gates: 2N gates, max 3 inputs – Laid in series: 2N (linear) latency
• Infinite hardware CLA – Many big gates: N*(N+3)/2 additional gates, max N+1 inputs + Laid in parallel: constant 5 latency
• Is there a compromise? • Reasonable number of small gates? • Sublinear (doesn’t have to be constant) latency? • Yes, multi-level CLA: exploits hierarchy to achieve this
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 35 Carry Lookahead Adder (CLA)
• Calculate “propagate” and “generate” based on A, B • Not based on carry in
• Combine with tree structure C0
A0 G0
B0 P0 G1-0 • Take aways C 1 P1-0 • Tree gives logarithmic delay A1 G1 C1 G3-0 • Reasonable area B1 P1 P C2 3-0 C2 A2 G2 B2 P2 G3-2 C3 P3-2 C3 A3 G3 B3 P3
C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 36 Two-Level CLA for 4-bit Adder
• Individual carry equations
• C1 = G0+P0C0, C2 = G1+P1C1, C3 = G2+P2C2, C4 = G3+P3C3 • Infinite hardware CLA equations
• C1 = G0+P0C0
• C2 = G1+P1G0+P1P0C0 • C3 = G2+P2G1+P2P1G0+P2P1P0C0 • C4 = G3+P3G2+P3P2G1+P3P2P1G0+P3P2P1P0C0 • Hierarchical CLA equations
• First level: expand C2 using C1, C4 using C3
• C2 = G1+P1(G0+P0C0) = (G1+P1G0)+(P1P0)C0 = G1-0+P1-0C0 • C4 = G3+P3(G2+P2C2) = (G3+P3G2)+(P3P2)C2 = G3-2+P3-2C2
• Second level: expand C4 using expanded C2
• C4 = G3-2+P3-2(G1-0+P1-0C0) = (G3-2+P3-2G1-0)+(P3-2P1-0)C0
• C4 = G3-0+P3-0C0
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 37 Two-Level CLA for 4-bit Adder
• Top first-level CLA block
• Input: C0, G0, P0, G1, P1
• Output: C1, G1-0, P1-0 C0
• Bottom first-level CLA block A0 G0 B P • Input: C , G , P , G , P 0 0 G1-0 2 2 2 3 3 C 1 P1-0 • Output: C3, G3-2, P3-2 A1 G1 C1 G3-0 • Second-level CLA block B1 P1 P C2 3-0 • Input: C0, G1-0, P1-0, G3-2, P3-2 C2 A G • Output: C2, G3-0, P3-0 2 2 B2 P2 G3-2 • These 3 blocks are “the same” C3 P3-2 C3 • C4 block A3 G3 B3 P3 • Input: C0, G3-0, P3-0 • Output: C 4 C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 38 4-bit CLA Timing: d0
• Which signals are ready at 0 gate delays (d0)?
• C0
• A0, B0 C0
• A1, B1 A0 G0
• A2, B2 B0 P0 G1-0 C1 P • A3, B3 1-0 A1 G1 C1
B1 P1 G3-0 C 2 P3-0
A2 G2 C2 B2 P2 G3-2 C3 P3-2 C3 A3 G3 B3 P3
C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 39 4-bit CLA Timing: d1
• Signals ready from before
• d0: C0, Ai, Bi • New signals ready at d1 C0
• P0 = A0+B0, G0 = A0B0 A0 G0 B P • P = A +B , G = A B 0 0 G1-0 1 1 1 1 1 1 C 1 P1-0 • P2 = A2+B2, G2 = A2B2 A1 G1 C1 • P3 = A3+B3, G3 = A3B3 G3-0 B1 P1 P C2 3-0 C2 A2 G2 B2 P2 G3-2 C3 P3-2 C3 A3 G3 B3 P3
C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 40 4-bit CLA Timing: d3
• Signals ready from before
• d0: C0, Ai, Bi
• d1: Pi, Gi C0
• New signals ready at d3 A0 G0 B P • P = P P 0 0 G1-0 1-0 1 0 C 1 P1-0 • G1-0 = G1+P1G0 A1 G1 C1 • C1 = G0+P0C0 G3-0 B1 P1 P C2 3-0 C2 • P3-2 = P3P2 A2 G2 B2 P2 G3-2 • G3-2 = G3+P3G2 C P3-2 • C is not ready 3 3 C3 A3 G3 B3 P3
C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 41 4-bit CLA Timing: d5
• Signals ready from before
• d0: C0, Ai, Bi
• d1: Pi, Gi C0
• d3: P1-0, G1-0 , C1 , P3-2 , G3-2 A0 G0 B P • New signals ready at d5 0 0 G1-0 C 1 P1-0 • P3-0 = P3-2P1-0 A1 G1 C1 • G3-0 = G3-2+P3-2G1-0 G3-0 B1 P1 P • C2 = G1-0+P1-0C0 C2 3-0 C2 A2 G2 B2 P2 G3-2 C3 P3-2 C3 A3 G3 B3 P3
C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 42 4-bit CLA Timing: d7
• Signals ready from before
• d0: C0, Ai, Bi
• d1: Pi, Gi C0
• d3: P1-0, G1-0 , C1 , P3-2 , G3-2 A0 G0
• d5: P3-0, G3-0, C2 B0 P0 G1-0 C • New signals ready at d7 1 P1-0 A1 G1 C1 • C3 = G2+P2C2 G3-0 B1 P1 P • C4 = G3-0+P3-0C0 C2 3-0 C2 A2 G2 G • Si ready d2 after Ci B2 P2 3-2 C3 P3-2 C3 A3 G3 B3 P3
C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 43 Two-Level CLA for 4-bit Adder
• Hardware? • CLA blocks: 5 gates, max 3 inputs (infinite CLA with N=2) • 3 of these C0 • C4 block: 2 gates, max 2 inputs • Total: 17 gates/3inputs A0 G0 B0 P0 G1-0 • Infinite: 14/5 C 1 P1-0
A1 G1 C1 G3-0 B1 P1 • Latency? P C2 3-0 • 2 for “outer” CLA, 4 for “interior” C2 • G/P go “up”, C go “down” A2 G2 B P G3-2 • Total: 9 (1 for GP, 2 for S) 2 2 C3 P3-2 • Infinite: 5 C3 A3 G3 B3 P3 • 2L: more gates and slower? C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 44 Individual G & P ® Group G & P
• Group G/P: convenient abstraction for multi-level CLA
• Individual carry equations
• C1 = G0+P0C0, C2 = G1+P1C1 • Infinite hardware CLA equations
• C1 = G0+P0C0 • C2 = G1+P1G0+P1P0C0 • Group terms
• C2 = (G1+P1G0)+(P1P0)C0
• C2 = G1-0+P1-0C0 • G1-0, P1-0 are group G & P: treat multiple bits as a single bit
• G1-0: carry-out generated by 1-0 bits • P1-0: carry-out propagated by 1-0 bits
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 45 Two-Level CLA for 16-bit Adder
• 4 G/P inputs per level C0 G3-0 P3-0 • Hardware? C1,2,3 C • First level: 14 gates * 4 blocks 4 • Second level: 14 gates * 1 block G7-4 P • Total: 70 gates 7-4 C5,6,7 • largest gate: 5 inputs G15-0 C8 • Infinite: 152 gates, 17 inputs P15-0 G11-8 C4,8,12 P11-8 • Latency? C9,10,11 C12 • Total: 9 (1 + 2 + 2 + 2 + 2) • Infinite: 5 G15-12 P15-12 C13,14,15 • CLA for a 64-bit adder? C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 46 16-bit CLA Timing: d0
C • What is ready after 0 gate delays? 0 G3-0 • C0 P3-0 C1,2,3 C4
G7-4 P7-4 C5,6,7 G15-0 C8 P15-0 G11-8 C4,8,12 P11-8 C9,10,11 C12
G15-12 P15-12 C13,14,15
C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 47 16-bit CLA Timing: d1
C • What is ready after 1 gate delay? 0 G3-0 • Individual G/P P3-0 C1,2,3 C4
G7-4 P7-4 C5,6,7 G15-0 C8 P15-0 G11-8 C4,8,12 P11-8 C9,10,11 C12
G15-12 P15-12 C13,14,15
C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 48 16-bit CLA Timing: d3
C0 • What is ready after 3 gate delays? G3-0 st • 1 level group G/P P3-0 st • Interior carries of 1 group C1,2,3 C4 • C1, C2, C3 G7-4 P7-4 C5,6,7 G15-0 C8 P15-0 G11-8 C4,8,12 P11-8 C9,10,11 C12
G15-12 P15-12 C13,14,15
C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 49 16-bit CLA Timing: d5
C0 • And after 5 gate delays? G3-0 • Outer level group G/P P3-0 • Outer level “interior” carries C1,2,3 C4 • C4, C8, C12 G7-4 P7-4 C5,6,7 G15-0 C8 P15-0 G11-8 C4,8,12 P11-8 C9,10,11 C12
G15-12 P15-12 C13,14,15
C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 50 16-bit CLA Timing: d7
C0 • And after 7 gate delays? G3-0 • C16 P3-0 • First level “interior” carries C1,2,3 C4 • C5, C6, C7 G7-4 • C9, C10, C11 P7-4 • C , C , C 13 14 15 C5,6,7 G15-0 • Essentially, all remaining carries C8 P15-0 G11-8 C4,8,12 • Si ready 2 gate delays after Ci P11-8 • All sum bits ready after 9 delays! C9,10,11 C12 • Same as 2-level 4-bit CLA G • All 2-level CLAs have similar delay 15-12 P15-12 structure C13,14,15
C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 51 Adders In Real Processors
• Real processors super-optimize their adders • Ten or so different versions of CLA • Highly optimized versions of carry-select • Other gate techniques: carry-skip, conditional-sum • Sub-gate (transistor) techniques: Manchester carry chain • Combinations of different techniques • Alpha 21264 used CLA+CSeA+RippleCA • Used at different levels
• Even more optimizations for incrementers • Why?
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 52 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 53 Subtraction: Addition’s Tricky Pal
• Sign/magnitude subtraction is mental reverse addition • 2C subtraction is addition • How to subtract using an adder? • sub A B = add A -B • Negate B before adding (fast negation trick: –B = B’ + 1) • Isn’t a subtraction then a negation and two additions? + No, an adder can implement A+B+1 by setting the carry-in to 1
0 1 A B
~
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 54 Shifts & Rotates
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 55 Shift and Rotation Instructions
• Left/right shifts are useful… • Fast multiplication/division by small constants (next) • Bit manipulation: extracting and setting individual bits in words
• Right shifts • Can be logical (shift in 0s) or arithmetic (shift in copies of MSB) srl 110011, 2 = 001100 sra 110011, 2 = 111100 • Caveat: for negative numbers, sra is not equal to division by 2 • Consider: -1 / 16 = ?
• Rotations are less useful… • But almost “free” if shifter is there • MIPS and LC4 have only shifts, x86 has shifts and rotations CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 56 Compiler Opt: Strength Reduction
• Strength reduction: compilers will do this (sort of) A * 4 = A << 2 A * 5 = (A << 2) + A A / 8 = A >> 3 (only if A is unsigned) • Useful for address calculation: all basic data types are 2M in size int A[100]; &A[N] = A+(N*sizeof(int)) = A+N*4 = A+N<<2
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 57 A Simple Shifter
• The simplest 16-bit shifter: can only shift left by 1 • Implement using wires (no logic!) • Slightly more complicated: can shift left by 1 or 0 • Implement using wires and a multiplexor (mux16_2to1)
0 A0
A O
A15
A O A <<1 O <<1
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 58 Barrel Shifter
• What about shifting left by any amount 0–15?
• 16 consecutive “left-shift-by-1-or-0” blocks? – Would take too long (how long?) • Barrel shifter: 4 “shift-left-by-X-or-0” blocks (X = 1,2,4,8) • What is the delay?
A O <<8 <<4 <<2 <<1 shift[3] shift[2] shift[1] shift[0] shift
• Similar barrel designs for right shifts and rotations
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 59 Shifter in Verilog
• Logical shift operators << >> • performs zero-extension for >> wire [15:0] a = b << c[3:0]; • Arithmetic shift operator >>> • performs sign-extension • requires a signed wire input wire signed [15:0] b; wire [15:0] a = b >>> c[3:0];
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 60 Multiplication
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 61 3rd Grade: Decimal Multiplication
19 // multiplicand * 12 // multiplier 38 + 190 228 // product
• Start with product 0, repeat steps until no multiplier digits • Multiply multiplicand by least significant multiplier digit • Add to product • Shift multiplicand one digit to the left (multiply by 10) • Shift multiplier one digit to the right (divide by 10)
• Product of N-digit and M-digit numbers may have N+M digits
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 62 Binary Multiplication: Same Refrain
19 = 010011 // multiplicand * 12 = 001100 // multiplier 0 = 000000000000 0 = 000000000000 76 = 000001001100 152 = 000010011000 0 = 000000000000 + 0 = 000000000000 228 = 000011100100 // product
± Smaller base ® more steps, each is simpler • Multiply multiplicand by least significant multiplier digit + 0 or 1 ® no actual multiplication, add multiplicand or not • Add to total: we know how to do that • Shift multiplicand left, multiplier right by one digit
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 63 Software Multiplication
• Can implement this algorithm in software • Inputs: md (multiplicand) and mr (multiplier)
int pd = 0; // product int i = 0; for (i = 0; i < 16 && mr != 0; i++) { if (mr & 1) { pd = pd + md; } md = md << 1; // shift left mr = mr >> 1; // shift right }
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 64 Hardware Multiply: Sequential
Multiplicand Multiplier << 1 >> 1 (32 bit) (16 bit)
lsb==1? 32+ 32 Product (32 bit) we
• Control: repeat 16 times • If least significant bit of multiplier is 1… • Then add multiplicand to product • Shift multiplicand left by 1 • Shift multiplier right by 1 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 65 Hardware Multiply: Combinational C<<4 C<<3 C<<4 + 0 + C<<3 0 0 16 16 +
0 P +
C<<2 +
16 C<<2 + 0 16 0 16 P 16
C<<1 + C<<1 0 + 16 0 16 C C 0 0
• Multiply by N bits at a time using N adders • Example: N=5, terms (P=product, C=multiplicand, M=multiplier) • P = (M[0] ? (C) : 0) + (M[1] ? (C<<1) : 0) + (M[2] ? (C<<2) : 0) + (M[3] ? (C<<3) : 0) + … • Arrange like a tree to reduce gate delay critical path • Delay? N2 vs N*log N? Not that simple, depends on adder • Approx “2N” versus “N + log N”, with optimization: O(log N) CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 66 Partial Sums/Carries
• Observe: carry-outs don’t have to be chained immediately • Can be saved for later and added back in 00111 = 7 +00011 = 3 00100 // partial sums (sums without carrries) +00110 // partial carries (carries without sums) 01010 = 10
• Partial sums/carries use simple half-adders, not full-adders + Aren’t “chained” ® can be done in two levels of logic – Must sum partial sums/carries eventually, and this sum is chained • d(CS-adder) = 2 + d(normal-adder) • What is the point?
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 67 Three Input Addition
• Observe: only 0/1 carry-out possible even if 3 bits added 00111 = 7 00011 = 3 +00010 = 2 00110 // partial sums (sums without carrries) +00110 // partial carries (carries without sums) 01100 = 12
• Partial sums/carries use full adders + Still aren’t “chained” ® can be done in two levels of logic • The point is delay(CS-adder) = 2 + delay(normal-adder)… • …even for adding 3 numbers! • 2 + delay(normal-adder) < 2 * delay(normal-adder)
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 68 Hardware != Software: Part Deux
• Recall: hardware is parallel, software is sequential • Exploit: evaluate independent sub-expressions in parallel
• Example I: S = A + B + C + D • Software? 3 steps: (1) S1 = A+B, (2) S2 = S1+C, (3) S = S2+D + Hardware? 2 steps: (1) S1 = A+B, S2=C+D, (2) S = S1+S2
• Example II: S = A + B + C • Software? 2 steps: (1) S1 = A+B, (2) S = S1+C • Hardware? 2 steps: (1) S1 = A+B (2) S = S1+C + Actually hardware can do this in 1.2 steps! (CSA adder)
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 69 Carry Save Addition (CSA)
• Carry save addition (CSA): delay(N adds) < N*d(1 add) • Enabling observation: unconventional view of full adder
• 3 inputs (A,B,Cin) ® 2 outputs (S,Cout)
• If adding two numbers, only thing to do is chain Cout to Cin+1 • But what if we are adding three numbers (A+B+D)? • One option: back-to-back conventional adders
• (A,B,CinT) ® (T,CoutT), chain CoutT to CinT+1
• (T,D,CinS) ® (S,CoutS), chain CoutS to CinS+1 • Notice: we have three independent inputs to feed first adder
• (A,B,D) ® (T,CoutT), no chaining (CSA: 2 gate levels) • T: A+B+D partial sum
• CoutT: A+B+D partial carry
• (T,CoutT,CinS) ® (S,CoutS), chain CoutS to CinS+1
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 70 Carry Save Addition (CSA)
A B A B A B A B C 3 3 2 2 1 1 0 0 B0 • 2 RC adders FA FA FA FA + 2 * d(add) gate delays
0 T3 D3 T2 D2 T1 D1 T0 D0 CD0 – d(add) is really long
FA FA FA FA FA
CO S3 S2 S1 S0
A3B3 D3A2B2 D2A1B1 D1A0B0 D0 • CSA+RC adder FA FA FA FA • 2 + d(add) 0 T3 T2 T1 T0 CB0 CD0 • Subtraction works too • ? FA FA FA FA FA
CO S3 S2 S1 S0 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 71 Carry Save Addition (CSA)
A B A B A B A B C 3 3 2 2 1 1 0 0 B0 • 2 general adders Any adder (carry-select, CLA, …) – 2 * d(add) gate delays
0 T3 D3 T2 D2 T1 D1 T0 D0 CD0 + d(add) may be fast Any adder (carry-select, CLA, …)
CO S3 S2 S1 S0
A3B3 D3A2B2 D2A1B1 D1A0B0 D0 • CSA+general adder FA FA FA FA + 2 + d(add) gate levels 0 T3 T2 T1 T0 CB0 CD0 + d(add) may be fast Any adder (carry-select, CLA, …)
CO S3 S2 S1 S0 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 72 CSA Tree Multiplier C<<8 • Use 3-to-2 CSA adders 0 • Build a tree structure C<<7 0 • called a “Wallace Tree” C<<6
0 CSA • 16-bit CSA C<<5 • Start: 16 bits 0 • 1st: 5*(3->2)+1 = 11 C<<4 0 P • 2nd: 3*(3->2)+2 = 8 C<<3 • 3rd: 2*(3->2)+2 = 6 0 CLA
• 4th: 2*(3->2)+0 = 4 CSA th C<<2 CSA • 5 : 1*(3->2)+1 = 3 CSA 0 th • 6 : 1*(3->2)+0 = 2 C<<1 th 0 CSA
• 7 : CLA CSA • delay: 2+(6*2)+9 C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 0 73 Consecutive Addition: Carry Save Adder
A B A B A B A B C 3 3 2 2 1 1 0 0 B0 • 2 N-bit RC adders FA FA FA FA + 2 + d(add) gate delays
0 T3 D3 T2 D2 T1 D1 T0 D0 CD0
FA FA FA FA FA • M N-bit RC adders delay • Naïve: O(M*N) CO S3 S2 S1 S0 • Actual: O(M+N)
A3B3 D3A2B2 D2A1B1 D1A0B0 D0 • M N-bit Carry Select? FA FA FA FA • Delay calculation tricky
0 T3 T2 T1 T0 CB0 CD0 • Carry Save Adder (CSA) FA FA FA FA FA • 3-to-2 CSA tree + adder
CO S3 S2 S1 S0 • Delay: O(log M + log N) CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 74 Floating Point
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 75 Floating Point (FP) Numbers
• Floating point numbers: numbers in scientific notation • Two uses
• Use I: real numbers (numbers with non-zero fractions) • 3.1415926… • 2.1878… • 6.62 * 10–34
• Use II: really big numbers • 3.0 * 108 • 6.02 * 1023
• Aside: best not used for currency values
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 76 Scientific Notation
• Scientific notation: • Number [S,F,E] = S * F * 2E • S: sign • F: significand (fraction) • E: exponent • “Floating point”: binary (decimal) point has different magnitude
+ “Sliding window” of precision using notion of significant digits • Small numbers very precise, many places after decimal point • Big numbers are much less so, not all integers representable • But for those instances you don’t really care anyway – Caveat: all representations are just approximations • Sometimes wierdos like 0.9999999 or 1.0000001 come up + But good enough for most purposes
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 77 IEEE 754 Standard Precision/Range
• Single precision: float in C • 32-bit: 1-bit sign + 8-bit exponent + 23-bit significand • Range: 2.0 * 10–38 < N < 2.0 * 1038 • Precision: ~7 significant (decimal) digits • Used when exact precision is less important (e.g., 3D games)
• Double precision: double in C • 64-bit: 1-bit sign + 11-bit exponent + 52-bit significand • Range: 2.0 * 10–308 < N < 2.0 * 10308 • Precision: ~15 significant (decimal) digits • Used for scientific computations
• Numbers >10308 don’t come up in many calculations • 1080 ~ number of atoms in universe
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 78 Morgan Kaufmann Series in Computer Architecture and Design : Computer…vised Fourth Edition : The Hardware/Software Interface (4th Edition) 2/6/14, 10:19 AM
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 79
Patterson, David A.; Hennessy, John L.. Morgan Kaufmann Series in Computer Architecture and Design : Computer Organization and Design, Revised Fourth Edition : The Hardware/Software Interface (4th Edition). St. Louis, MO, USA: Morgan Kaufmann, 2011. p 277. http://site.ebrary.com/lib/upenn/Doc?id=10509203&ppg=277
http://site.ebrary.com/lib/upenn/docPrint.action?encrypted=e171de…d41d44dba3b10d668ca117e0d3b9e1c59428e41b6d50cbf19ce7ebcb1b9ba879 Page 1 of 2 Morgan Kaufmann Series in Computer Architecture and Design : Computer…vised Fourth Edition : The Hardware/Software Interface (4th Edition) 2/6/14, 10:21 AM
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 80 Patterson, David A.; Hennessy, John L.. Morgan Kaufmann Series in Computer Architecture and Design : Computer Organization and Design, Revised Fourth Edition : The Hardware/Software Interface (4th Edition). St. Louis, MO, USA: Morgan Kaufmann, 2011. p 279. http://site.ebrary.com/lib/upenn/Doc?id=10509203&ppg=279
http://site.ebrary.com/lib/upenn/docPrint.action?encrypted=6117b4…c1b1253394631a61f9bd11d1d89f2c1317de0391f6185100115fcc5fe1002a0e1 Page 1 of 2 Floating Point is Inexact • Accuracy problems sometimes get bad • FP arithmetic not associative: (A+B)+C not same as A+(B+C) • Addition of big and small numbers (summing many small numbers) • Subtraction of two big numbers • Example, what’s (1*1030 + 1*100) – 1*1030? • Intuitively: 1*100 = 1 • But: (1*1030 + 1*100) – 1*1030 = (1*1030 – 1*1030) = 0 • Reciprocal math: “x/y” versus ”x*(1/y)” • Reciprocal & multiply is faster than divide, but less precise • Compilers are generally conservative by default • GCC flag: –ffast-math (allows assoc. opts, reciprocal math) • Numerical analysis: field formed around this problem • Re-formulating algorithms in a way that bounds numerical error • In your code: never test for equality between FP numbers • Use something like: if (abs(a-b) < 0.00001) then … CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 81 Pentium FDIV Bug
• Pentium shipped in August 1994 • Intel actually knew about the bug in July • But calculated that delaying the project a month would cost ~$1M • And that in reality only a dozen or so people would encounter it • They were right… but one of them took the story to EE times • By November 1994, firestorm was full on • IBM said that typical Excel user would encounter bug every month • Assumed 5K divisions per second around the clock • People believed the story • IBM stopped shipping Pentium PCs • By December 1994, Intel promises full recall • Total cost: ~$550M
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 82 Arithmetic Latencies
• Latency in cycles of common arithmetic operations • Source: Agner Fog, https://www.agner.org/optimize/#manuals • AMD Ryzen core Int 32 Int 64 Fp 32 Fp 64 Add/Subtract 1 1 5 5 Multiply 3 3 5 5 Divide 14-30 14-46 8-15 8-15
• Divide is variable latency based on the size of the dividend • Detect number of leading zeros, then divide • Why is FP divide faster than integer divide?
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 83 Summary
App App App • Integer addition System software • Most timing-critical operation in datapath • Hardware != software Mem CPU I/O • Exploit sub-addition parallelism • Fast addition • Carry-select: parallelism in sum • Multiplication • Chains and trees of additions • Division • Floating point
• Next: single-cycle datapath
CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 84