Ripple Carry Adder 2 • How Fast Is This? … • How Fast Is an N-Bit Ripple-Carry Adder? a S 15 FA 15 B15 CO

CIS 371 Computer Organization and Design

Unit 3: Arithmetic

Based on slides by Profs. Amir Roth & Milo Martin & C.J. Taylor

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 1 This Unit: Arithmetic

App App App • A little review System software • Binary + 2s complement • Ripple-carry addition (RCA) Mem CPU I/O • Fast integer addition • Carry-select (CSeA) • Carry Lookahead Adder (CLA) • Shifters • Integer multiplication and division • Floating point arithmetic

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 2 Readings

• P&H • Chapter 3

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 3 The Importance of Fast Arithmetic

+ 4 Register Data Insn File PC s1 s2 d Mem Mem

Tinsn-mem Tregfile TALU Tdata-mem Tregfile • Addition of two numbers is most common operation • Programs use addition frequently • Loads and stores use addition for address calculation • Branches use addition to test conditions and calculate targets • All insns use addition to calculate default next PC • Fast addition is critical to high performance

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 4 Review: Binary Integers

• Computers represent integers in binary (base2) 3 = 11, 4 = 100, 5 = 101, 30 = 11110 + Natural since only two values are represented • Addition, etc. take place as usual (carry the 1, etc.)

17 = 10001 +5 = 101 22 = 10110

• Some old machines use decimal (base10) with only 0/1 30 = 011 000 • Often called BCD (binary-coded decimal) – Unnatural for digital logic, implementation complicated & slow + More precise for currency operations

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 5 Fixed Width

• On pencil and paper, integers have infinite width

• In hardware, integers have fixed width • N bits: 16, 32 or 64 • LSB is 20, MSB is 2N-1

• Range: 0 to 2N–1

• Numbers >2N represented using multiple fixed-width integers • In software (e.g., Java BigInteger class)

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 6 What About Negative Integers?

• Sign/magnitude • Unsigned plus one bit for sign 10 = 000001010, -10 = 100001010 + Matches our intuition from “by hand” decimal arithmetic – representations of both +0 and –0 – Addition is difficult • symmetric range: –(2N-1–1) to 2N-1–1

• Option II: two’s complement (2C) • Leading 0s mean positive number, leading 1s negative 10 = 00001010, -10 = 11110110 + One representation for 0 (all zeroes) + Easy addition • asymmetric range: –(2N-1) to 2N-1–1

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 7 The Tao of 2C

• How did 2C come about? • “Let’s design a representation that makes addition easy” • Think of subtracting 10 from 0 by hand with 8-bit numbers • Have to “borrow” 1s from some imaginary leading 1

0 = 100000000 -10 = 00000010 -10 = 011111110

• Now, add the conventional way…

-10 = 11111110 +10 = 00000010 0 = 100000000

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 8 Still More On 2C

• What is the interpretation of 2C? • Same as binary, except MSB represents –2N–1, not 2N–1 • –10 = 11110110 = –27+26+25+24+22+21 + Extends to any width • –10 = 110110 = –25+24+22+21 • Why? 2N = 2*2N–1 • –25+24+22+21 = (–26+2*25)–25+24+22+21 = –26+25+24+22+21 • Equivalent to computing modulo 2N

• Trick to negating a number quickly: –B = B’ + 1 • –(1) = (0001)’+1 = 1110+1 = 1111 = –1 • –(–1) = (1111)’+1 = 0000+1 = 0001 = 1 • –(0) = (0000)’+1 = 1111+1 = 0000 = 0 • Think about why this works CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 9 Addition

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 10 1st Grade: Decimal Addition

1 43 +29 72

• Repeat N times • Add least significant digits and any overflow from previous add • Carry “overflow” to next addition • Overflow: any digit other than least significant of sum • Shift two addends and sum one digit to the right

• Sum of two N-digit numbers can yield an N+1 digit number

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 11 Binary Addition: Works the Same Way

1 111111 43 = 00101011 +29 = 00011101 72 = 01001000

• Repeat N times • Add least significant bits and any overflow from previous add • Carry the overflow to next addition • Shift two addends and sum one bit to the right • Sum of two N-bit numbers can yield an N+1 bit number

– More steps (smaller base) + Each one is simpler (adding just 1 and 0) • So simple we can do it in hardware

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 12 The Half Adder

• How to add two binary integers in hardware? • Start with adding two bits • When all else fails ... look at truth table A B S A B = C0 S 0 0 = 0 0 0 1 = 0 1 1 0 = 0 1 1 1 = 1 0 CO

• S = A^B A S HA • CO (carry out) = AB B • This is called a half adder CO

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 13 The Other Half

• We could chain half adders together, but to do that… • Need to incorporate a carry out from previous adder C A B = C0 S CI 0 0 0 = 0 0 0 0 1 = 0 1 S 0 1 0 = 0 1 CI A 0 1 1 = 1 0 A S 1 0 0 = 0 1 B FA 1 0 1 = 1 0 B 1 1 0 = 1 0 CO 1 1 1 = 1 1

CO • S = C’A’B + C’AB’ + CA’B’ + CAB = C ^ A ^ B • CO = C’AB + CA’B + CAB’ + CAB = CA + CB + AB • This is called a full adder

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 14 Ripple-Carry Adder 0 • N-bit ripple-carry adder A S • N 1-bit full adders “chained” together 0 FA 0 B0 • CO0 = CI1, CO1 = CI2, etc.

• CI0 = 0 A S 1 FA 1 • CO is carry-out of entire adder N–1 B1 • CON–1 = 1 ® “overflow” A S 2 FA 2 B • Example: 16-bit ripple carry adder 2 • How fast is this? … • How fast is an N-bit ripple-carry adder? A S 15 FA 15 B15 CO

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 15 Quantifying Adder Delay

• Combinational logic dominated by gate delays • How many gates between input and output? • Array storage dominated by wire delays • Longest delay or critical path is what matters

• Can implement any combinational function in “2” logic levels • 1 level of AND + 1 level of OR (PLA) • NOTs are “free”: push to input (DeMorgan’s) or read from latch • Example: delay(FullAdder) = 2 • d(CarryOut) = delay(AB + AC + BC) • d(Sum) = d(A ^ B ^ C) = d(AB’C’ + A’BC’ + A’B’C + ABC) = 2 • Note ‘^’ means Xor (just like in C & Java)

• Caveat: “2” assumes gates have few (<8 ?) inputs

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 Ripple-Carry Adder Delay 0 • Longest path is to CO15 (or S15) A0 S0 • delay(CO15) = 2 + MAX(d(A15),d(B15),d(CI15)) FA B0 • d(A15) = d(B15) = 0, d(CI15) = d(CO14)

• d(CO15) = 2 + d(CO14) = 2 + 2 + d(CO13) … A S 1 FA 1 • d(CO ) = 32 15 B1

A2 S2 • d(CON–1) = 2N FA B2 – Too slow! – Linear in number of bits …

A S • Number of gates is also linear 15 FA 15 B15 CO

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 17 Division

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 18 4th Grade: Decimal Division

9 // quotient 3 |29 // divisor | dividend -27 2 // remainder

• Shift divisor left (multiply by 10) until MSB lines up with dividend’s • Repeat until remaining dividend (remainder) < divisor • Find largest single digit q such that (q*divisor) < dividend • Set LSB of quotient to q • Subtract (q*divisor) from dividend • Shift quotient left by one digit (multiply by 10) • Shift divisor right by one digit (divide by 10)

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 19 Binary Division

1001 = 9 3 |29 = 0011 |011101 -24 = - 011000 5 = 000101 - 3 = - 000011 2 = 000010

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 20 Binary Division Hardware

• Same as decimal division, except (again) – More individual steps (base is smaller) + Each step is simpler • Find largest bit q such that (q*divisor) < dividend • q = 0 or 1 • Subtract (q*divisor) from dividend • q = 0 or 1 ® no actual multiplication, subtract divisor or not

• Complication: largest q such that (q*divisor) < dividend • How do you know if (1*divisor) < dividend? • Human can “eyeball” this • Computer does not have eyeballs • Subtract and see if result is negative

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 21 Software Divide Algorithm

• Can implement this algorithm in software • Inputs: dividend and divisor

remainder = 0; quotient = 0; for (int i = 0; i < 32; i++) { remainder = (remainder << 1) | (dividend >> 31); if (remainder >= divisor) { quotient = (quotient << 1) | 1; remainder = remainder - divisor; } else { quotient = (quotient << 1) | 0; } dividend = dividend << 1; }

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 22 Divide Example

• Input: Divisor = 00011 , Dividend = 11101

Step Remainder Quotient Remainder Dividend 0 00000 00000 00000 11101 1 00001 00000 00001 11010 2 00011 00001 00000 10100 3 00001 00010 00001 01000 4 00010 00100 00010 10000 5 00101 01001 00010 00000

• Result: Quotient: 1001, Remainder: 10

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 23 Divider Circuit

Divisor Quotient

Sub >=0 Shift in 0 or 1

Remainder Dividend msb Shift in 0 or 1 Shift in 0

• N cycles for n-bit divide

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 24 Fast Addition

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 25 Bad idea: a PLA-based Adder?

• If any function can be expressed as two-level logic… • …why not use a PLA for an entire 8-bit adder? • Not small • Approx. 216 AND gates, each with 16 inputs • Then, 8 OR gates, each with 216 inputs • Number of gates exponential in bit width! • Not that fast, either • An AND gate with 65K inputs != 2-input AND gate • Many-input gates made from tree of, say, 4-input gates • 216-input gates would have at least 8 logic levels • So, at least 10 levels of logic for a 16-bit PLA • Even so, delay is still logarithmic in number of bits

• There are better (faster, smaller) ways

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 26 Theme: Hardware != Software

• Hardware can do things that software fundamentally can’t • And vice versa (of course)

• In hardware, it’s easier to trade resources for latency

• One example of this: speculation • Slow computation waiting for some slow input? • Input one of two things? • Compute with both (slow), choose right one later (fast)

• Does this make sense in software? Not on a uni-processor • Difference? hardware is parallel, software is sequential

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 27 Carry-Select Adder

• Carry-select adder

• Do A15-8+B15-8 twice, once assuming C8 (CO7) = 0, once = 1

• Choose the correct one when CO7 finally becomes available + Effectively cuts carry chain in half (break critical path) – But adds mux • Delay? 0 0 A7-0 8+ S7-0 B7-0 A15-0 16 S15-0 B15-0 0 S 16+ A 15-8 15-8 8+ B15-8 S15-8 CO 1 S 16 CO A 15-8 15-8 8+ 18 B15-8

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 28 Multi-Segment Carry-Select Adder 0 • Multiple segments A4-0 5+ S4-0 • Example: 5, 5, 6 bit = 16 bit B4-0 10 0 S9-5 A9-5 • Hardware cost 5+ B9-5 S9-5 • Still mostly linear (~2x) 1 10 S9-5 • Compute each segment A9-5 12 with 0 and 1 carry-in 5+ B9-5 • Serial mux chain 0 S15-10 A15-10 • Delay 6+ B15-10 S15-10 • 5-bit adder (10) + 12 14 1 S Two muxes (4) = 14 A 15-10 CO 15-10 6+ B15-10

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 29 Carry-Select Adder Delay

• What is carry-select adder delay (two segment)?

• d(CO15) = MAX(d(CO15-8), d(CO7-0)) + 2

• d(CO15) = MAX(2*8, 2*8) + 2 = 18 • In general: 2*(N/2) + 2 = N+2 (vs 2N for RCA)

• What if we cut adder into 4 equal pieces? • Would it be 2*(N/4) + 2 = 10? Not quite

• d(CO15) = MAX(d(CO15-12),d(CO11-0)) + 2 • d(CO15) = MAX(2*4, MAX(d(CO11-8),d(CO7-0)) + 2) + 2

• d(CO15) = MAX(2*4,MAX(2*4,MAX(d(CO7-4),d(CO3-0)) + 2) + 2) + 2 • d(CO15) = MAX(2*4,MAX(2*4,MAX(2*4,2*4) + 2) + 2) + 2 • d(CO15) = 2*4 + 3*2 = 14

• N-bit adder in M equal pieces: 2*(N/M) + (M–1)*2 • 16-bit adder in 8 parts: 2*(16/8) + 7*2 = 18 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 30 16b Carry-Select Adder Delay

35 30 25 20 15

Gatedelay 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 M

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 31 Another Option: Carry Lookahead

• Is carry-select adder as fast as we can go? • Nope!

• Another approach to using additional resources • Instead of redundantly computing sums assuming different carries • Use redundancy to compute carries more quickly • This approach is called carry lookahead (CLA)

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 32 A & B ® Generate & Propagate

• Let’s look at the single-bit carry-out function • CO = AB+AC+BC • Factor out terms that use only A and B (available immediately) • CO =(AB)+(A+B)C • (AB): generates carry-out regardless of incoming C ® rename to G • (A+B) : propagates incoming C ® rename to P • CO = G+PC C C

P A A B B G

CO CO CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 33 Infinite Hardware CLA

• Can expand C1…N in terms of G’s, P’s, and C0 • Example: C16

• C16 = G15+P15C15 • C16 = G15+P15(G14+P14C14)

• C16 = G15+P15G14+ … + P15P14…P2P1G0 + P15P14…P2P1P0C0

• Similar expansions for C15, C14, etc.

• How much does this cost?

• CN needs: N AND’s + 1 OR’s, largest have N+1 inputs

• CN…C1 needs: N*(N+1)/2 AND’s + N OR’s, max N+1 inputs • N=16: 152 total gates, max input 17 • N=64: 2144 total gates, max input 65 • And how fast is it really? – Not that fast, unfortunately, 17-input gates are really slow

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 34 Compromise: Multi-Level CLA

• Ripple carry + Few small gates: 2N gates, max 3 inputs – Laid in series: 2N (linear) latency

• Infinite hardware CLA – Many big gates: N*(N+3)/2 additional gates, max N+1 inputs + Laid in parallel: constant 5 latency

• Is there a compromise? • Reasonable number of small gates? • Sublinear (doesn’t have to be constant) latency? • Yes, multi-level CLA: exploits hierarchy to achieve this

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 35 Carry Lookahead Adder (CLA)

• Calculate “propagate” and “generate” based on A, B • Not based on carry in

• Combine with tree structure C0

A0 G0

B0 P0 G1-0 • Take aways C 1 P1-0 • Tree gives logarithmic delay A1 G1 C1 G3-0 • Reasonable area B1 P1 P C2 3-0 C2 A2 G2 B2 P2 G3-2 C3 P3-2 C3 A3 G3 B3 P3

C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 36 Two-Level CLA for 4-bit Adder

• Individual carry equations

• C1 = G0+P0C0, C2 = G1+P1C1, C3 = G2+P2C2, C4 = G3+P3C3 • Infinite hardware CLA equations

• C1 = G0+P0C0

• C2 = G1+P1G0+P1P0C0 • C3 = G2+P2G1+P2P1G0+P2P1P0C0 • C4 = G3+P3G2+P3P2G1+P3P2P1G0+P3P2P1P0C0 • Hierarchical CLA equations

• First level: expand C2 using C1, C4 using C3

• C2 = G1+P1(G0+P0C0) = (G1+P1G0)+(P1P0)C0 = G1-0+P1-0C0 • C4 = G3+P3(G2+P2C2) = (G3+P3G2)+(P3P2)C2 = G3-2+P3-2C2

• Second level: expand C4 using expanded C2

• C4 = G3-2+P3-2(G1-0+P1-0C0) = (G3-2+P3-2G1-0)+(P3-2P1-0)C0

• C4 = G3-0+P3-0C0

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 37 Two-Level CLA for 4-bit Adder

• Top first-level CLA block

• Input: C0, G0, P0, G1, P1

• Output: C1, G1-0, P1-0 C0

• Bottom first-level CLA block A0 G0 B P • Input: C , G , P , G , P 0 0 G1-0 2 2 2 3 3 C 1 P1-0 • Output: C3, G3-2, P3-2 A1 G1 C1 G3-0 • Second-level CLA block B1 P1 P C2 3-0 • Input: C0, G1-0, P1-0, G3-2, P3-2 C2 A G • Output: C2, G3-0, P3-0 2 2 B2 P2 G3-2 • These 3 blocks are “the same” C3 P3-2 C3 • C4 block A3 G3 B3 P3 • Input: C0, G3-0, P3-0 • Output: C 4 C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 38 4-bit CLA Timing: d0

• Which signals are ready at 0 gate delays (d0)?

• C0

• A0, B0 C0

• A1, B1 A0 G0

• A2, B2 B0 P0 G1-0 C1 P • A3, B3 1-0 A1 G1 C1

B1 P1 G3-0 C 2 P3-0

A2 G2 C2 B2 P2 G3-2 C3 P3-2 C3 A3 G3 B3 P3

C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 39 4-bit CLA Timing: d1

• Signals ready from before

• d0: C0, Ai, Bi • New signals ready at d1 C0

• P0 = A0+B0, G0 = A0B0 A0 G0 B P • P = A +B , G = A B 0 0 G1-0 1 1 1 1 1 1 C 1 P1-0 • P2 = A2+B2, G2 = A2B2 A1 G1 C1 • P3 = A3+B3, G3 = A3B3 G3-0 B1 P1 P C2 3-0 C2 A2 G2 B2 P2 G3-2 C3 P3-2 C3 A3 G3 B3 P3

C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 40 4-bit CLA Timing: d3

• Signals ready from before

• d0: C0, Ai, Bi

• d1: Pi, Gi C0

• New signals ready at d3 A0 G0 B P • P = P P 0 0 G1-0 1-0 1 0 C 1 P1-0 • G1-0 = G1+P1G0 A1 G1 C1 • C1 = G0+P0C0 G3-0 B1 P1 P C2 3-0 C2 • P3-2 = P3P2 A2 G2 B2 P2 G3-2 • G3-2 = G3+P3G2 C P3-2 • C is not ready 3 3 C3 A3 G3 B3 P3

C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 41 4-bit CLA Timing: d5

• Signals ready from before

• d0: C0, Ai, Bi

• d1: Pi, Gi C0

• d3: P1-0, G1-0 , C1 , P3-2 , G3-2 A0 G0 B P • New signals ready at d5 0 0 G1-0 C 1 P1-0 • P3-0 = P3-2P1-0 A1 G1 C1 • G3-0 = G3-2+P3-2G1-0 G3-0 B1 P1 P • C2 = G1-0+P1-0C0 C2 3-0 C2 A2 G2 B2 P2 G3-2 C3 P3-2 C3 A3 G3 B3 P3

C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 42 4-bit CLA Timing: d7

• Signals ready from before

• d0: C0, Ai, Bi

• d1: Pi, Gi C0

• d3: P1-0, G1-0 , C1 , P3-2 , G3-2 A0 G0

• d5: P3-0, G3-0, C2 B0 P0 G1-0 C • New signals ready at d7 1 P1-0 A1 G1 C1 • C3 = G2+P2C2 G3-0 B1 P1 P • C4 = G3-0+P3-0C0 C2 3-0 C2 A2 G2 G • Si ready d2 after Ci B2 P2 3-2 C3 P3-2 C3 A3 G3 B3 P3

C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 43 Two-Level CLA for 4-bit Adder

• Hardware? • CLA blocks: 5 gates, max 3 inputs (infinite CLA with N=2) • 3 of these C0 • C4 block: 2 gates, max 2 inputs • Total: 17 gates/3inputs A0 G0 B0 P0 G1-0 • Infinite: 14/5 C 1 P1-0

A1 G1 C1 G3-0 B1 P1 • Latency? P C2 3-0 • 2 for “outer” CLA, 4 for “interior” C2 • G/P go “up”, C go “down” A2 G2 B P G3-2 • Total: 9 (1 for GP, 2 for S) 2 2 C3 P3-2 • Infinite: 5 C3 A3 G3 B3 P3 • 2L: more gates and slower? C 4 C4 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 44 Individual G & P ® Group G & P

• Group G/P: convenient abstraction for multi-level CLA

• Individual carry equations

• C1 = G0+P0C0, C2 = G1+P1C1 • Infinite hardware CLA equations

• C1 = G0+P0C0 • C2 = G1+P1G0+P1P0C0 • Group terms

• C2 = (G1+P1G0)+(P1P0)C0

• C2 = G1-0+P1-0C0 • G1-0, P1-0 are group G & P: treat multiple bits as a single bit

• G1-0: carry-out generated by 1-0 bits • P1-0: carry-out propagated by 1-0 bits

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 45 Two-Level CLA for 16-bit Adder

• 4 G/P inputs per level C0 G3-0 P3-0 • Hardware? C1,2,3 C • First level: 14 gates * 4 blocks 4 • Second level: 14 gates * 1 block G7-4 P • Total: 70 gates 7-4 C5,6,7 • largest gate: 5 inputs G15-0 C8 • Infinite: 152 gates, 17 inputs P15-0 G11-8 C4,8,12 P11-8 • Latency? C9,10,11 C12 • Total: 9 (1 + 2 + 2 + 2 + 2) • Infinite: 5 G15-12 P15-12 C13,14,15 • CLA for a 64-bit adder? C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 46 16-bit CLA Timing: d0

C • What is ready after 0 gate delays? 0 G3-0 • C0 P3-0 C1,2,3 C4

G7-4 P7-4 C5,6,7 G15-0 C8 P15-0 G11-8 C4,8,12 P11-8 C9,10,11 C12

G15-12 P15-12 C13,14,15

C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 47 16-bit CLA Timing: d1

C • What is ready after 1 gate delay? 0 G3-0 • Individual G/P P3-0 C1,2,3 C4

G7-4 P7-4 C5,6,7 G15-0 C8 P15-0 G11-8 C4,8,12 P11-8 C9,10,11 C12

G15-12 P15-12 C13,14,15

C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 48 16-bit CLA Timing: d3

C0 • What is ready after 3 gate delays? G3-0 st • 1 level group G/P P3-0 st • Interior carries of 1 group C1,2,3 C4 • C1, C2, C3 G7-4 P7-4 C5,6,7 G15-0 C8 P15-0 G11-8 C4,8,12 P11-8 C9,10,11 C12

G15-12 P15-12 C13,14,15

C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 49 16-bit CLA Timing: d5

C0 • And after 5 gate delays? G3-0 • Outer level group G/P P3-0 • Outer level “interior” carries C1,2,3 C4 • C4, C8, C12 G7-4 P7-4 C5,6,7 G15-0 C8 P15-0 G11-8 C4,8,12 P11-8 C9,10,11 C12

G15-12 P15-12 C13,14,15

C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 50 16-bit CLA Timing: d7

C0 • And after 7 gate delays? G3-0 • C16 P3-0 • First level “interior” carries C1,2,3 C4 • C5, C6, C7 G7-4 • C9, C10, C11 P7-4 • C , C , C 13 14 15 C5,6,7 G15-0 • Essentially, all remaining carries C8 P15-0 G11-8 C4,8,12 • Si ready 2 gate delays after Ci P11-8 • All sum bits ready after 9 delays! C9,10,11 C12 • Same as 2-level 4-bit CLA G • All 2-level CLAs have similar delay 15-12 P15-12 structure C13,14,15

C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 16 51 Adders In Real Processors

• Real processors super-optimize their adders • Ten or so different versions of CLA • Highly optimized versions of carry-select • Other gate techniques: carry-skip, conditional-sum • Sub-gate (transistor) techniques: Manchester carry chain • Combinations of different techniques • Alpha 21264 used CLA+CSeA+RippleCA • Used at different levels

• Even more optimizations for incrementers • Why?

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 52 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 53 Subtraction: Addition’s Tricky Pal

• Sign/magnitude subtraction is mental reverse addition • 2C subtraction is addition • How to subtract using an adder? • sub A B = add A -B • Negate B before adding (fast negation trick: –B = B’ + 1) • Isn’t a subtraction then a negation and two additions? + No, an adder can implement A+B+1 by setting the carry-in to 1

0 1 A B

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 54 Shifts & Rotates

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 55 Shift and Rotation Instructions

• Left/right shifts are useful… • Fast multiplication/division by small constants (next) • Bit manipulation: extracting and setting individual bits in words

• Right shifts • Can be logical (shift in 0s) or arithmetic (shift in copies of MSB) srl 110011, 2 = 001100 sra 110011, 2 = 111100 • Caveat: for negative numbers, sra is not equal to division by 2 • Consider: -1 / 16 = ?

• Rotations are less useful… • But almost “free” if shifter is there • MIPS and LC4 have only shifts, x86 has shifts and rotations CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 56 Compiler Opt: Strength Reduction

• Strength reduction: compilers will do this (sort of) A * 4 = A << 2 A * 5 = (A << 2) + A A / 8 = A >> 3 (only if A is unsigned) • Useful for address calculation: all basic data types are 2M in size int A[100]; &A[N] = A+(N*sizeof(int)) = A+N*4 = A+N<<2

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 57 A Simple Shifter

• The simplest 16-bit shifter: can only shift left by 1 • Implement using wires (no logic!) • Slightly more complicated: can shift left by 1 or 0 • Implement using wires and a multiplexor (mux16_2to1)

0 A0

A O

A15

A O A <<1 O <<1

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 58 Barrel Shifter

• What about shifting left by any amount 0–15?

• 16 consecutive “left-shift-by-1-or-0” blocks? – Would take too long (how long?) • Barrel shifter: 4 “shift-left-by-X-or-0” blocks (X = 1,2,4,8) • What is the delay?

A O <<8 <<4 <<2 <<1 shift[3] shift[2] shift[1] shift[0] shift

• Similar barrel designs for right shifts and rotations

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 59 Shifter in Verilog

• Logical shift operators << >> • performs zero-extension for >> wire [15:0] a = b << c[3:0]; • Arithmetic shift operator >>> • performs sign-extension • requires a signed wire input wire signed [15:0] b; wire [15:0] a = b >>> c[3:0];

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 60 Multiplication

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 61 3rd Grade: Decimal Multiplication

19 // multiplicand * 12 // multiplier 38 + 190 228 // product

• Start with product 0, repeat steps until no multiplier digits • Multiply multiplicand by least significant multiplier digit • Add to product • Shift multiplicand one digit to the left (multiply by 10) • Shift multiplier one digit to the right (divide by 10)

• Product of N-digit and M-digit numbers may have N+M digits

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 62 Binary Multiplication: Same Refrain

19 = 010011 // multiplicand * 12 = 001100 // multiplier 0 = 000000000000 0 = 000000000000 76 = 000001001100 152 = 000010011000 0 = 000000000000 + 0 = 000000000000 228 = 000011100100 // product

± Smaller base ® more steps, each is simpler • Multiply multiplicand by least significant multiplier digit + 0 or 1 ® no actual multiplication, add multiplicand or not • Add to total: we know how to do that • Shift multiplicand left, multiplier right by one digit

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 63 Software Multiplication

• Can implement this algorithm in software • Inputs: md (multiplicand) and mr (multiplier)

int pd = 0; // product int i = 0; for (i = 0; i < 16 && mr != 0; i++) { if (mr & 1) { pd = pd + md; } md = md << 1; // shift left mr = mr >> 1; // shift right }

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 64 Hardware Multiply: Sequential

Multiplicand Multiplier << 1 >> 1 (32 bit) (16 bit)

lsb==1? 32+ 32 Product (32 bit) we

• Control: repeat 16 times • If least significant bit of multiplier is 1… • Then add multiplicand to product • Shift multiplicand left by 1 • Shift multiplier right by 1 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 65 Hardware Multiply: Combinational C<<4 C<<3 C<<4 + 0 + C<<3 0 0 16 16 +

0 P +

C<<2 +

16 C<<2 + 0 16 0 16 P 16

C<<1 + C<<1 0 + 16 0 16 C C 0 0

• Multiply by N bits at a time using N adders • Example: N=5, terms (P=product, C=multiplicand, M=multiplier) • P = (M[0] ? (C) : 0) + (M[1] ? (C<<1) : 0) + (M[2] ? (C<<2) : 0) + (M[3] ? (C<<3) : 0) + … • Arrange like a tree to reduce gate delay critical path • Delay? N2 vs N*log N? Not that simple, depends on adder • Approx “2N” versus “N + log N”, with optimization: O(log N) CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 66 Partial Sums/Carries

• Observe: carry-outs don’t have to be chained immediately • Can be saved for later and added back in 00111 = 7 +00011 = 3 00100 // partial sums (sums without carrries) +00110 // partial carries (carries without sums) 01010 = 10

• Partial sums/carries use simple half-adders, not full-adders + Aren’t “chained” ® can be done in two levels of logic – Must sum partial sums/carries eventually, and this sum is chained • d(CS-adder) = 2 + d(normal-adder) • What is the point?

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 67 Three Input Addition

• Observe: only 0/1 carry-out possible even if 3 bits added 00111 = 7 00011 = 3 +00010 = 2 00110 // partial sums (sums without carrries) +00110 // partial carries (carries without sums) 01100 = 12

• Partial sums/carries use full adders + Still aren’t “chained” ® can be done in two levels of logic • The point is delay(CS-adder) = 2 + delay(normal-adder)… • …even for adding 3 numbers! • 2 + delay(normal-adder) < 2 * delay(normal-adder)

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 68 Hardware != Software: Part Deux

• Recall: hardware is parallel, software is sequential • Exploit: evaluate independent sub-expressions in parallel

• Example I: S = A + B + C + D • Software? 3 steps: (1) S1 = A+B, (2) S2 = S1+C, (3) S = S2+D + Hardware? 2 steps: (1) S1 = A+B, S2=C+D, (2) S = S1+S2

• Example II: S = A + B + C • Software? 2 steps: (1) S1 = A+B, (2) S = S1+C • Hardware? 2 steps: (1) S1 = A+B (2) S = S1+C + Actually hardware can do this in 1.2 steps! (CSA adder)

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 69 Carry Save Addition (CSA)

• Carry save addition (CSA): delay(N adds) < N*d(1 add) • Enabling observation: unconventional view of full adder

• 3 inputs (A,B,Cin) ® 2 outputs (S,Cout)

• If adding two numbers, only thing to do is chain Cout to Cin+1 • But what if we are adding three numbers (A+B+D)? • One option: back-to-back conventional adders

• (A,B,CinT) ® (T,CoutT), chain CoutT to CinT+1

• (T,D,CinS) ® (S,CoutS), chain CoutS to CinS+1 • Notice: we have three independent inputs to feed first adder

• (A,B,D) ® (T,CoutT), no chaining (CSA: 2 gate levels) • T: A+B+D partial sum

• CoutT: A+B+D partial carry

• (T,CoutT,CinS) ® (S,CoutS), chain CoutS to CinS+1

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 70 Carry Save Addition (CSA)

A B A B A B A B C 3 3 2 2 1 1 0 0 B0 • 2 RC adders FA FA FA FA + 2 * d(add) gate delays

0 T3 D3 T2 D2 T1 D1 T0 D0 CD0 – d(add) is really long

FA FA FA FA FA

CO S3 S2 S1 S0

A3B3 D3A2B2 D2A1B1 D1A0B0 D0 • CSA+RC adder FA FA FA FA • 2 + d(add) 0 T3 T2 T1 T0 CB0 CD0 • Subtraction works too • ? FA FA FA FA FA

CO S3 S2 S1 S0 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 71 Carry Save Addition (CSA)

A B A B A B A B C 3 3 2 2 1 1 0 0 B0 • 2 general adders Any adder (carry-select, CLA, …) – 2 * d(add) gate delays

0 T3 D3 T2 D2 T1 D1 T0 D0 CD0 + d(add) may be fast Any adder (carry-select, CLA, …)

CO S3 S2 S1 S0

A3B3 D3A2B2 D2A1B1 D1A0B0 D0 • CSA+general adder FA FA FA FA + 2 + d(add) gate levels 0 T3 T2 T1 T0 CB0 CD0 + d(add) may be fast Any adder (carry-select, CLA, …)

CO S3 S2 S1 S0 CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 72 CSA Tree Multiplier C<<8 • Use 3-to-2 CSA adders 0 • Build a tree structure C<<7 0 • called a “Wallace Tree” C<<6

0 CSA • 16-bit CSA C<<5 • Start: 16 bits 0 • 1st: 5*(3->2)+1 = 11 C<<4 0 P • 2nd: 3*(3->2)+2 = 8 C<<3 • 3rd: 2*(3->2)+2 = 6 0 CLA

• 4th: 2*(3->2)+0 = 4 CSA th C<<2 CSA • 5 : 1*(3->2)+1 = 3 CSA 0 th • 6 : 1*(3->2)+0 = 2 C<<1 th 0 CSA

• 7 : CLA CSA • delay: 2+(6*2)+9 C CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 0 73 Consecutive Addition: Carry Save Adder

A B A B A B A B C 3 3 2 2 1 1 0 0 B0 • 2 N-bit RC adders FA FA FA FA + 2 + d(add) gate delays

0 T3 D3 T2 D2 T1 D1 T0 D0 CD0

FA FA FA FA FA • M N-bit RC adders delay • Naïve: O(M*N) CO S3 S2 S1 S0 • Actual: O(M+N)

A3B3 D3A2B2 D2A1B1 D1A0B0 D0 • M N-bit Carry Select? FA FA FA FA • Delay calculation tricky

0 T3 T2 T1 T0 CB0 CD0 • Carry Save Adder (CSA) FA FA FA FA FA • 3-to-2 CSA tree + adder

CO S3 S2 S1 S0 • Delay: O(log M + log N) CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 74 Floating Point

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 75 Floating Point (FP) Numbers

• Floating point numbers: numbers in scientific notation • Two uses

• Use I: real numbers (numbers with non-zero fractions) • 3.1415926… • 2.1878… • 6.62 * 10–34

• Use II: really big numbers • 3.0 * 108 • 6.02 * 1023

• Aside: best not used for currency values

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 76 Scientific Notation

• Scientific notation: • Number [S,F,E] = S * F * 2E • S: sign • F: significand (fraction) • E: exponent • “Floating point”: binary (decimal) point has different magnitude

+ “Sliding window” of precision using notion of significant digits • Small numbers very precise, many places after decimal point • Big numbers are much less so, not all integers representable • But for those instances you don’t really care anyway – Caveat: all representations are just approximations • Sometimes wierdos like 0.9999999 or 1.0000001 come up + But good enough for most purposes

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 77 IEEE 754 Standard Precision/Range

• Single precision: float in C • 32-bit: 1-bit sign + 8-bit exponent + 23-bit significand • Range: 2.0 * 10–38 < N < 2.0 * 1038 • Precision: ~7 significant (decimal) digits • Used when exact precision is less important (e.g., 3D games)

• Double precision: double in C • 64-bit: 1-bit sign + 11-bit exponent + 52-bit significand • Range: 2.0 * 10–308 < N < 2.0 * 10308 • Precision: ~15 significant (decimal) digits • Used for scientific computations

• Numbers >10308 don’t come up in many calculations • 1080 ~ number of atoms in universe

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 78 Morgan Kaufmann Series in Computer Architecture and Design : Computer…vised Fourth Edition : The Hardware/Software Interface (4th Edition) 2/6/14, 10:19 AM

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 79

Patterson, David A.; Hennessy, John L.. Morgan Kaufmann Series in Computer Architecture and Design : Computer Organization and Design, Revised Fourth Edition : The Hardware/Software Interface (4th Edition). St. Louis, MO, USA: Morgan Kaufmann, 2011. p 277. http://site.ebrary.com/lib/upenn/Doc?id=10509203&ppg=277

http://site.ebrary.com/lib/upenn/docPrint.action?encrypted=e171de…d41d44dba3b10d668ca117e0d3b9e1c59428e41b6d50cbf19ce7ebcb1b9ba879 Page 1 of 2 Morgan Kaufmann Series in Computer Architecture and Design : Computer…vised Fourth Edition : The Hardware/Software Interface (4th Edition) 2/6/14, 10:21 AM

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 80 Patterson, David A.; Hennessy, John L.. Morgan Kaufmann Series in Computer Architecture and Design : Computer Organization and Design, Revised Fourth Edition : The Hardware/Software Interface (4th Edition). St. Louis, MO, USA: Morgan Kaufmann, 2011. p 279. http://site.ebrary.com/lib/upenn/Doc?id=10509203&ppg=279

http://site.ebrary.com/lib/upenn/docPrint.action?encrypted=6117b4…c1b1253394631a61f9bd11d1d89f2c1317de0391f6185100115fcc5fe1002a0e1 Page 1 of 2 Floating Point is Inexact • Accuracy problems sometimes get bad • FP arithmetic not associative: (A+B)+C not same as A+(B+C) • Addition of big and small numbers (summing many small numbers) • Subtraction of two big numbers • Example, what’s (1*1030 + 1*100) – 1*1030? • Intuitively: 1*100 = 1 • But: (1*1030 + 1*100) – 1*1030 = (1*1030 – 1*1030) = 0 • Reciprocal math: “x/y” versus ”x*(1/y)” • Reciprocal & multiply is faster than divide, but less precise • Compilers are generally conservative by default • GCC flag: –ffast-math (allows assoc. opts, reciprocal math) • Numerical analysis: field formed around this problem • Re-formulating algorithms in a way that bounds numerical error • In your code: never test for equality between FP numbers • Use something like: if (abs(a-b) < 0.00001) then … CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 81 Pentium FDIV Bug

• Pentium shipped in August 1994 • Intel actually knew about the bug in July • But calculated that delaying the project a month would cost ~$1M • And that in reality only a dozen or so people would encounter it • They were right… but one of them took the story to EE times • By November 1994, firestorm was full on • IBM said that typical Excel user would encounter bug every month • Assumed 5K divisions per second around the clock • People believed the story • IBM stopped shipping Pentium PCs • By December 1994, Intel promises full recall • Total cost: ~$550M

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 82 Arithmetic Latencies

• Latency in cycles of common arithmetic operations • Source: Agner Fog, https://www.agner.org/optimize/#manuals • AMD Ryzen core Int 32 Int 64 Fp 32 Fp 64 Add/Subtract 1 1 5 5 Multiply 3 3 5 5 Divide 14-30 14-46 8-15 8-15

• Divide is variable latency based on the size of the dividend • Detect number of leading zeros, then divide • Why is FP divide faster than integer divide?

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 83 Summary

App App App • Integer addition System software • Most timing-critical operation in datapath • Hardware != software Mem CPU I/O • Exploit sub-addition parallelism • Fast addition • Carry-select: parallelism in sum • Multiplication • Chains and trees of additions • Division • Floating point

• Next: single-cycle datapath

CIS 501: Comp Arch| Dr. Joe Devietti | Arithmetic 84