Computer Architecture (Plus Finishing up Computer Arithmetic) Fall 2019 John K

IWKS 2300 Computer Architecture (plus finishing up computer arithmetic) Fall 2019 John K. Bennett From Last Lecture: "Ripple" Carry Adder Ripple carry makes addition time approximately equal to number of bits times the propagation delay of a Full Adder A B Cin A B Cin A B Cin Full Adder Full Adder Full Adder Cout ∑ Cout ∑ Cout ∑ Full adder prop. delay = 3 gpd (carry output) So a 16 bit adder would take 48 gpd to complete add Eliminating Ripple Carry: Carry Look Ahead Basics If we understand how carry works we can compute carry in advance. This is called "Carry Look-Ahead." For any bit position, if A = 1 and B = 1; Cout = 1, i.e., a carry will be generated to the next bit position, regardless of value of Cin. This is called "Carry Generate" For any bit position, if one input is 1 and the other input is 0; Cout will equal Cin (i.e., the value of Cin will be propagated to the next bit position. This is called "Carry Propagate" For any bit position, if A = 0 and B = 0; Cout will equal 0, regardless of value of Cin. This is called "Carry Stop." A B Cin A B Cin A B Cin Full Adder Full Adder Full Adder Cout ∑ Cout ∑ Cout ∑ Carry Generate, Propagate and Stop Truth table for Full Adder Cin A B ∑ Cout f fgps 0 0 0 x CSi gps fgps 0 1 1 x CPi fgps 1 0 1 x CPi A B Cin fgps 1 1 0 x CGi Full Adder Cout ∑ X No need for carry chain Carry Look Ahead Basics The equations to compute Cin at Bit Position i are as follows: Cini = Cgi-1 + Cpi-1 ● Cgi-2 + Cpi-1 ● Cpi-2 ● Cgi-3 … + Cpi-1 ● Cpi-2 … Cp1 ● Cg0 Practical Considerations Very wide (more than 8 input) gates are impractical, so we would likely use a logn depth tree of gates to implement the wide ANDs and ORs. This is still faster than chained carry, even for 16 bits (and is much faster for 32 or 64 bit adders). Cini = Cgi-1 + Cpi-1 ● Cgi-2 + Cpi-1 ● Cpi-2 ● Cgi-3 … + Cpi-1 ● Cpi-2 … Cp1 ● Cg0 What About Multiplication? 3 2 1 0 3 2 1 0 03 02 01 00 13 12 11 10 23 22 21 20 33 32 31 30 Classic Multiplication in Hardware/Software Use add – shift, like pen and paper method: Speeding Up Binary Multiplication 1. Retire more than bit at a time: • 2 bits at a time ("Booth’s Algorithm") • 3 bits at a time: Bits Operation 000 No Operation 001 Add Multiplicand 010 Add Multiplicand 011 Add 2xMultiplicand 100 Sub 2xMultiplicand 101 Sub Multiplicand 110 Sub Multiplicand 111 No Operation 2. Parallel Multiplier Using Carry Save Addition Carry Save Addition The idea is to perform several additions in sequence, keeping the carries and the sum separate. This means that all of the columns can be added in parallel without relying on the result of the previous column, creating a two output "adder" with a time delay that is independent of the size of its inputs. The sum and carry can then be recombined using one normal carry-aware addition (ripple or CLA) to form the correct result. CSA Uses Full Adders Depth = 4; 7 adders "Wallace Tree" Addition Plus add with carry CSA Adder Tree Depth = 7; 7 adders used Plus add with carry A 4-bit Example (carry propagating to the right) (or carry look-ahead) Example: An 8-bit Carry Save Array Multiplier A parallel multiplier for unsigned operands. It is composed of 2-input AND gates for producing the partial products, a series of carry save adders for adding them and a ripple-carry adder for producing the final product. FA with 3 inputs, 2 outputs Generating Partial Products What is Computer Architecture? Machine Organization + Instruction Set Architecture Decisions in each area are made for reasons of: Cost Performance Compatibility with earlier designs Computer Design is the art of balancing these criteria Classic Machine Organization (Von Neumann) Input (mouse, keyboard, …) Output (display, printer, …) Memory main (DRAM), cache (SRAM) Input secondary (disk, CD, DVD, …) Output Datapath Processor Processor Control (CPU) Control Memory 1001010010110000 0010100101010001 1111011101100110 1001010010110000 Datapath 1001010010110000 1001010010110000 Atanasoff–Berry Computer (Iowa State University) (1937-42; vacuum tubes) Zuse Z3 (Nazi Germany) (1941-43; relays) Von Neumann (Princeton) Machine (circa 1940) CPU Input device Memory Arithmetic Logic Unit (ALU) (data + Registers instructions) Output Control device Keyboard Gordon Moore, Andy Grove (and others) John Von Neumann (and others) ... made it possible ... made it small and fast. Harvard Mark 1 (circa 1940) Howard Aiken The ALU Arithmetic (in order of implementation complexity): Add operation Subtract a Shift (Right and Left) 32 ALU Rotate (Right and Left result 32 Multiply b Divide 32 Floating Point Operations Logic (usually implemented with multiplexors) And Nand Or Nor Not XNor Xor etc. Registers While there have been "memory-only" machines, even early computers typically had at least one register (called the "accumulator"), used to capture the output of the ALU for the next instruction. Since memory (RAM) is much slower than registers (which are internal to the CPU), we would like a lot of them. But registers are very expensive relative to RAM, and we have to be able to address every register. This impacts both instruction set design and word length (e.g., 8 bit, 16 bit, 32 bit, 64 bit). This has led to unusual designs, e.g., the SPARC architecture "register windows." Control Early computers were hardwired to perform a single program. Later, the notion of a "stored program" was introduced. Early programmers entered programs in binary directly into memory using switches and buttons. Assemblers and compilers made it possible for more human-readable programs to be translated into binary. Binary programs, however entered, are interpreted by the hardware to generate controls signals. This interpretation can be "hardwired" logic, or another computer using what is known as "microprogramming." Processing Logic: fetch-execute cycle CPU Input device Memory Arithmetic Logic Unit (ALU) (data + Registers instructions) Output Control device Executing the current instruction involves one or more of the following tasks: Have the ALU compute some function out = f (register values) Write the ALU output to selected registers As a side-effect of this computation, Keyboard determine what instruction to fetch and execute next. What Do Instructions Look Like in Memory? In a Von Neumann machine, both instructions and data are stored in the same memory. Data is just a set of bits in one or more words of memory. Instructions contain operation codes ("Op Codes") and addresses (of either registers or RAM). oprn addr 1 oprn addr addr 1 2 oprn addr addr addr 1 2 3 Suppose "addr" was 4 bits and the word length was 16 bits. How many registers could we have? How many operations? Architecture Families Before mid-60’s, every machine had a different ISA programs from previous generation could not run on new machine (this made replacement very expensive) IBM System/360 introduced the concept of an "architecture family" based on different detailed implementations single instruction set architecture wide range of price and performance with same software: o memory path width (1 byte to 8 bytes) o faster, more complex CPU design o greater I/O throughput and overlap IBM 360 Architecture Family Model Shipped Scientific Commercial CPU Memory Memory size Performance Performance Bandwidth Bandwidth KB KIPS KIPS MB/Sec MB/Sec 30 Jun-65 10.2 29 1.3 0.7 8-64 40 Apr-65 40 75 3.2 0.8 16-256 50 Aug-65 133 169 8 2 64-512 20 Mar-66 2 2.6 4-32 91 Oct-67 1,900 1,800 133 164 1024-4096 65 Nov-65 563 567 40 21 128-1024 75 Jan-66 940 670 41 43 256-1024 67 May-66 40 21 512-2048 44 Sep-66 118 185 16 4 32-256 95 Feb-68 3800 est. 3600 est. 133 711 5220 25 Oct-68 9.7 25 1.1 2.2 16-48 85 Dec-69 3,245 3,418 100 67 512-4096 195 Mar-71 10000 est. 10000 est. 148 169 1024-4096 The Intel x86 Architecture History … Xeon 2019 ~7B 4.0GHz 42bit 384 4.5 TB Platinum ($16,616) (64 bit 8276L arch.) The Intel x86 Instruction Set Architecture Complexity instructions from 1 to 17 bytes long one operand must act as both a source and destination one operand may come from memory several complex addressing modes Why has the x86 architecture survived this long? Historically tied to MS Windows The most frequently used instructions are relatively easy to implement and optimize Compilers avoid the portions of the architecture that are slow (i.e., most compilers for X86 machines only use a fraction of the instruction set). CISC vs. RISC CISC = Complex Instruction Set Computer RISC = Reduced Instruction Set Computer Historically, machines tend to add features over time Instruction opcodes IBM 70X, 70X0 series went from 24 opcodes to 185 in 10 years At the same time, performance increased 30 times Addressing modes Special purpose registers CISC motivations were to Improve efficiency, since complex instructions implemented in hardware presumably execute faster Supposed to make life easier for compiler writers Supposed to support more complex higher-level languages CISC vs. RISC Examination of actual code demonstrated many of these features were not used, largely because compiler code generation and optimization is hard even with simple instruction sets. RISC advocates (e.g., Dave Patterson of UC Berkeley) proposed simple, limited (reduced) instruction set large number of general purpose registers instructions mostly only used registers optimized instruction pipeline Benefits of this approach included: faster execution of instructions commonly used faster design and implementation Issues: things like floating point had to be implemented in SW CISC vs.

Computer Architecture (Plus Finishing up Computer Arithmetic) Fall 2019 John K

Comparison of Parallel and Pipelined CORDIC Algorithm Using RCA and CSA

Basics of Logic Design Arithmetic Logic Unit (ALU) Today's Lecture

Implementation of Carry Tree Adders and Compare with RCA and CSLA

CS/EE 260 – Homework 5 Solutions Spring 2000

UNIT 8B a Full Adder

Design of High Speed and Low Power Six Transistor Full Adder Using Two Transistor Xor Gate

Half Adder, Which Finds the Sum of Two Bits

Lecture 4 Adders

On the Design and Analysis of Quaternary Serial and Parallel Adders

Computer Arithmetic

Adders: Efficient Multiple Input

Comparative Study on Transistor Based Full Adder Designs