<<

IWKS 2300 Architecture (plus finishing up computer arithmetic) Fall 2019 John K. Bennett From Last Lecture: "Ripple"

 Ripple carry makes time approximately equal to number of bits times the propagation delay of a Full Adder

A B Cin A B Cin A B Cin Full Adder Full Adder Full Adder Cout ∑ Cout ∑ Cout ∑

Full adder prop. delay = 3 gpd (carry output) So a 16 bit adder would take 48 gpd to complete add Eliminating Ripple Carry: Carry Look Ahead Basics  If we understand how carry works we can compute carry in advance. This is called "Carry Look-Ahead."  For any bit position, if A = 1 and B = 1; Cout = 1, i.e., a carry will be generated to the next bit position, regardless of value of Cin. This is called "Carry Generate"  For any bit position, if one input is 1 and the other input is 0; Cout will equal Cin (i.e., the value of Cin will be propagated to the next bit position. This is called "Carry Propagate"  For any bit position, if A = 0 and B = 0; Cout will equal 0, regardless of value of Cin. This is called "Carry Stop."

A B Cin A B Cin A B Cin Full Adder Full Adder Full Adder Cout ∑ Cout ∑ Cout ∑ Carry Generate, Propagate and Stop

 Truth table for Full Adder

Cin A B ∑ Cout f fgps 0 0 0 x CSi gps fgps 0 1 1 x CPi fgps 1 0 1 x CPi A B Cin fgps 1 1 0 x CGi Full Adder Cout ∑ X

No need for carry chain  Carry Look Ahead Basics

 The equations to compute Cin at Bit Position i are as follows:

Cini = Cgi-1

+ Cpi-1 ● Cgi-2

+ Cpi-1 ● Cpi-2 ● Cgi-3 …

+ Cpi-1 ● Cpi-2 … Cp1 ● Cg0 Practical Considerations

Very wide (more than 8 input) gates are impractical, so we would likely use a logn depth tree of gates to implement the wide ANDs and ORs. This is still faster than chained carry, even for 16 bits (and is much faster for 32 or 64 bit adders).

Cini = Cgi-1

+ Cpi-1 ● Cgi-2

+ Cpi-1 ● Cpi-2 ● Cgi-3 …

+ Cpi-1 ● Cpi-2 … Cp1 ● Cg0 What About Multiplication?

3 2 1 0

3 2 1 0

03 02 01 00

13 12 11 10

23 22 21 20

33 32 31 30 Classic Multiplication in Hardware/Software

Use add – shift, like pen and paper method: Speeding Up Binary Multiplication 1. Retire more than bit at a time: • 2 bits at a time ("Booth’s Algorithm") • 3 bits at a time: Bits Operation 000 No Operation 001 Add Multiplicand 010 Add Multiplicand 011 Add 2xMultiplicand 100 Sub 2xMultiplicand 101 Sub Multiplicand 110 Sub Multiplicand 111 No Operation

2. Parallel Multiplier Using Carry Save Addition Carry Save Addition

The idea is to perform several in sequence, keeping the carries and the sum separate. This means that all of the columns can be added in parallel without relying on the result of the previous column, creating a two output "adder" with a time delay that is independent of the size of its inputs. The sum and carry can then be recombined using one normal carry-aware addition (ripple or CLA) to form the correct result. CSA Uses Full Adders

Depth = 4; 7 adders "Wallace Tree" Addition Plus add with carry CSA Adder Tree Depth = 7; 7 adders used Plus add with carry A 4-bit Example (carry propagating to the right)

(or carry look-ahead) Example: An 8-bit Carry Save Array Multiplier

A parallel multiplier for unsigned operands. It is composed of 2-input AND gates for producing the partial products, a series of carry save adders for adding them and a ripple-carry adder for producing the final product.

FA with 3 inputs, 2 outputs

Generating Partial Products What is ?

Machine Organization + Instruction Set Architecture Decisions in each area are made for reasons of:

 Cost

 Performance

 Compatibility with earlier designs Computer Design is the art of balancing these criteria Classic Machine Organization (Von Neumann)

 Input (mouse, keyboard, …)  Output (display, printer, …)  Memory

 main (DRAM), (SRAM) Input  secondary (disk, CD, DVD, …) Output  Processor  Control (CPU) Control Memory 1001010010110000 0010100101010001 1111011101100110 1001010010110000 Datapath 1001010010110000 1001010010110000 Atanasoff–Berry Computer (Iowa State University) (1937-42; vacuum tubes) Zuse (Nazi Germany) (1941-43; relays) Von Neumann (Princeton) Machine (circa 1940)

CPU Input device Memory (ALU)

(data + Registers instructions) Output Control device

Keyboard Gordon Moore, Andy Grove (and others) John Von Neumann (and others) ... made it possible ... made it small and fast. Harvard Mark 1 (circa 1940)

Howard Aiken The ALU  Arithmetic (in order of implementation complexity):

 Add operation

 Subtract a  Shift (Right and Left) 32 ALU  Rotate (Right and Left result

32  Multiply b  Divide 32

 Floating Point Operations  Logic (usually implemented with multiplexors)

 And Nand

 Or Nor

 Not XNor

 Xor etc. Registers

 While there have been "memory-only" machines, even early typically had at least one register (called the "accumulator"), used to capture the output of the ALU for the next instruction.  Since memory (RAM) is much slower than registers (which are internal to the CPU), we would like a lot of them.  But registers are very expensive relative to RAM, and we have to be able to address every register. This impacts both instruction set design and word length (e.g., 8 bit, 16 bit, 32 bit, 64 bit).  This has led to unusual designs, e.g., the SPARC architecture "register windows." Control

 Early computers were hardwired to perform a single program.  Later, the notion of a "stored program" was introduced. Early programmers entered programs in binary directly into memory using and buttons. Assemblers and compilers made it possible for more human-readable programs to be translated into binary.  Binary programs, however entered, are interpreted by the hardware to generate controls signals. This interpretation can be "hardwired" logic, or another computer using what is known as "microprogramming." Processing Logic: fetch-execute cycle

CPU Input device Memory Arithmetic Logic Unit (ALU)

(data + Registers instructions) Output Control device

Executing the current instruction involves one or more of the following tasks:

 Have the ALU compute some function out = f (register values)

 Write the ALU output to selected registers

 As a side-effect of this computation, Keyboard determine what instruction to fetch and execute next. What Do Instructions Look Like in Memory?  In a Von Neumann machine, both instructions and data are stored in the same memory.

 Data is just a set of bits in one or more words of memory.

 Instructions contain operation codes ("Op Codes") and addresses (of either registers or RAM).

oprn addr 1 oprn addr addr 1 2 oprn addr addr addr 1 2 3

 Suppose "addr" was 4 bits and the word length was 16 bits. How many registers could we have? How many operations? Architecture Families

 Before mid-60’s, every machine had a different ISA  programs from previous generation could not run on new machine (this made replacement very expensive)  IBM System/360 introduced the concept of an "architecture family" based on different detailed implementations  single instruction set architecture  wide range of price and performance with same software: o memory path width (1 byte to 8 bytes) o faster, more complex CPU design o greater I/O throughput and overlap IBM 360 Architecture Family

Model Shipped Scientific Commercial CPU Memory Memory size Performance Performance Bandwidth Bandwidth KB KIPS KIPS MB/Sec MB/Sec 30 Jun-65 10.2 29 1.3 0.7 8-64 40 Apr-65 40 75 3.2 0.8 16-256 50 Aug-65 133 169 8 2 64-512 20 Mar-66 2 2.6 4-32 91 Oct-67 1,900 1,800 133 164 1024-4096 65 Nov-65 563 567 40 21 128-1024 75 Jan-66 940 670 41 43 256-1024 67 May-66 40 21 512-2048 44 Sep-66 118 185 16 4 32-256 95 Feb-68 3800 est. 3600 est. 133 711 5220 25 Oct-68 9.7 25 1.1 2.2 16-48 85 Dec-69 3,245 3,418 100 67 512-4096 195 Mar-71 10000 est. 10000 est. 148 169 1024-4096 The Intel Architecture History

Xeon 2019 ~7B 4.0GHz 42bit 384 4.5 TB Platinum ($16,616) (64 bit 8276L arch.) The Intel x86 Instruction Set Architecture

 Complexity

 instructions from 1 to 17 bytes long

 one operand must act as both a source and destination

 one operand may come from memory

 several complex addressing modes

 Why has the x86 architecture survived this long?

 Historically tied to MS Windows

 The most frequently used instructions are relatively easy to implement and optimize

 Compilers avoid the portions of the architecture that are slow (i.e., most compilers for X86 machines only use a fraction of the instruction set). CISC vs. RISC

 CISC = Complex Instruction Set Computer  RISC = Reduced Instruction Set Computer

 Historically, machines tend to add features over time  Instruction opcodes

 IBM 70X, 70X0 series went from 24 opcodes to 185 in 10 years

 At the same time, performance increased 30 times  Addressing modes  Special purpose registers  CISC motivations were to  Improve efficiency, since complex instructions implemented in hardware presumably execute faster  Supposed to make life easier for compiler writers  Supposed to support more complex higher-level languages CISC vs. RISC

 Examination of actual code demonstrated many of these features were not used, largely because compiler code generation and optimization is hard even with simple instruction sets.  RISC advocates (e.g., Dave Patterson of UC Berkeley) proposed

 simple, limited (reduced) instruction set

 large number of general purpose registers

 instructions mostly only used registers

 optimized instruction pipeline  Benefits of this approach included:

 faster execution of instructions commonly used

 faster design and implementation  Issues: things like floating point had to be implemented in SW CISC vs. RISC

 Some early RISC architectures compared to contemporaneous CISC

Year Instr. Instr. Addr Registers Size Modes IBM 1973 208 2 - 6 4 16 370/168 VAX 1978 303 2 - 57 22 16 11/780 I 80486 1989 235 1 - 11 11 8 M 88000 1988 51 4 3 32 MIPS 1991 94 4 1 32 R4000 IBM 6000 1990 184 4 2 32 CISC vs. RISC

 Which approach is best?

 In general, fewer simpler instructions allow for increased clock speeds.

 Typically, RISC processors take less than half the design time of a CISC processor., sometimes even less.

 RISC/CISC comparisons often neglect the increased time it takes to do things like develop a software floating point library.

 In addition, CISC designers have adopted RISC techniques everywhere possible.

 Instruction complexity is only one variable  A couple of design principles:

 Make the common case fast.

 Design for the expected workload, e.g., a GPU needs a very different ISA that a Windows 10 CPU. Some Typical Assembly Language Constructs

// In what follows R1,R2,R3 are registers, PC is program , // and addr is some address in memory. There is an implied PC++ // with every instruction

ADD R1,R2,R3 // R1  R2 + R3

ADDI R1,R2,addr // R1  R2 + addr

AND R1,R1,R2 // R1  R1 and R2 (bit-wise)

JMP addr // PC  addr

JEQ R1,R2,addr // IF R1 == R2 THEN PC  addr ELSE PC++

LOAD R1, addr // R1  RAM[addr]

STORE R1, addr // RAM[addr]  R1

NOP // Do nothing

// Etc. – *many* variants Three Address Architecture

 Consider the following code fragment: X = (A-B) / (C+(D*E))

Load R1, A // R1 ← Mem[A] Load R2, B // R2 ← Mem[B] Sub R3, R2, R1 // R3 ← R2 – R1 Load R1, D // R1 ← Mem[D] Load R2, E // R2 ← Mem[E] Mpy R4, R1, R2 // R4 ← R1 * R2 Load R1, C // R1 ← Mem[C] Add R2, R1, R4 // R2 ← R1 + R4 Div R1, R3, R2 // R1 ← R3 / R2 Store X, R1 // Mem[X] ← R1 There are typically a finite number of registers, on the order of 16-32 This code: 10 instructions, 6 Memory references, code is not compact. Two Address Architecture

 Consider the following code fragment: X = (A-B) / (C+(D*E))

Load R1, A // R1 ← Mem[A] Load R2, B // R2 ← Mem[B] Sub R2, R1 // R2 ← R2 – R1 Load R1, D // R1 ← Mem[D] Load R3, E // R3 ← Mem[E] Mpy R1, R3 // R1 ← R1 * R3 Load R4, C // R4 ← Mem[C] Add R1, R4 // R1 ← R1 + R4 Div R2, R1 // R2 ← R2 / R1 Store X, R2 // Mem[X] ← R2

There are typically a finite number of registers, on the order of 16-32 This code: 10 instructions, 6 Memory references, code is a little more compact. One Address Architecture

 Consider the following code fragment: X = (A-B) / (C+(D*E))

Load A // Acc ← Mem[A] Sub B // Acc ← Acc - Mem[B] Store Temp1 // Mem[Temp1] ← Acc Load D // Acc ← Mem[D] Mpy E // Acc ← Acc * Mem[E] Add C // Acc ← Acc + Mem[C] Store Temp2 // Mem[Temp2] ← Acc Load Temp1 // Acc ← Mem[Temp1] Div Temp2 // Acc ← Acc / Mem[Temp2] Store X // Mem[X] ← Acc There is one register, called the Accumulator This code: 10 instructions, 10 Memory references, code is more compact. Zero Address Architecture

 Consider the following code fragment: X = (A-B) / (C+(D*E)) Push D // SP = SP + 1; Mem[SP] ← Mem[D]; Push E // SP = SP + 1; Mem[SP] ← Mem[E]; Mpy // Mem[SP-1] ← Mem[SP] * Mem[SP-1]; // SP = SP -1 Push C // SP = SP + 1; Mem[SP] ← Mem[C] Add // Mem[SP-1] ← Mem[SP] + Mem[SP-1] // SP = SP -1 Push A // SP = SP + 1; Mem[SP] ← Mem[A] Push B // SP = SP + 1; Mem[SP] ← Mem[B] Sub // Mem[SP-1] ← Mem[SP] - Mem[SP-1]; // SP = SP -1 Div // Mem[SP-1] ← Mem[SP] / Mem[SP-1] Pop X // Mem[X] ← Mem[SP]; SP = SP - 1

10 instructions, 24 Memory references, code is very compact. What Does it Mean to Make the Common Case Fast?

 There are a variety of techniques to speed up instruction execution. Some of these include:

 Increase (we are approaching some hard limits here)

 Pipelining (execute more than one instruction at one time)

 Caching (store data we will need near the processor)

 Other methods of improving access time  Note that we only need to employ these techniques for instructions that actually get used.  How do we know what instructions get used?

 There are decades of research exploring how different kinds of compilers and programs use instructions and memory.

 If we have a specialized workload, we can study its execution ourselves. The Memory Hierarchy  Registers (very small and very fast) – implemented as part of the processor.  Cache (small and fast storage for data and instructions we expect to need). There may be several layers of cache, e.g., a small and very fast L1 cache, a larger and somewhat Speed CPU Size Cost ($/bit) slower L2 cache, and an even larger and not Registers quite as fast L3 cache. Fastest Memory Smallest Highest  Main Memory (RAM) – the bulk of the Cache volatile memory available to the CPU. Memory Usually implemented using dynamic RAM. RAM

 Disk – Relatively non-volatile storage for Slowest Memory Biggest Lowest large amounts of information. May use Disk rotating media, or (more recently) SSD (solid state drive) technology.  Although less common today, there may be additional layers in the memory hierarchy, e.g., tape, on-line, DVD, etc. Memory Hierarchy  Ideally, all the memory we want would be in the processor, but that is cost-prohibitive (and certainly wasteful) by today’s standards.  We use the memory hierarchy to efficiently create the illusion that all memory is the same.

 The memory hierarchy must be inclusive, i.e., lower levels must include everything present in higher levels of the memory hierarchy.

 The performance of the memory hierarchy depends on hit rate, i.e., how often we find what we need at higher levels. CPU

Processor Block of data (unit of data copy) Increasing distance Level 1 from the CPU in access time

Levels in the Level 2 Data are transferred memory hierarchy

Level n

Size of the memory at each level Program Locality

 Caching in the memory hierarchy works because of two kinds if program "locality."

 temporal locality: an item (data or instruction) a program has just accessed is likely to be accessed again in the near future. Why?

 spatial locality: items near an item a program has just accessed are likely to be referenced soon. Why? Cache Terminology

 block: minimum unit of data moved between levels  hit: data requested is found in the nearest upper level  miss: data requested is not found in nearest upper level  hit rate: fraction of memory accesses that are hits  miss rate: fraction of memory accesses that are not hits

 miss rate = 1 – hit rate  hit time: time to determine if the access is a hit + time to deliver the data to the CPU  miss penalty: time to determine if the access is a miss + time to replace block at upper level with corresponding block at lower level + time to deliver the data to the CPU How Do Caches Actually Work?

 Simple example:

 assume block size = one word of data

X4 X4 Reference to Xn X1 X1 causes miss so Xn – 2 Xn – 2 it is fetched from memory Xn – 1 Xn – 1

X2 X2

Xn

X3 X3

a. Before the reference to Xn b. After the reference to Xn  Issues:

 how do we know if a data item is in the cache?

 if it is, how do we find it?

 if not, what do we do, and what if the cache is full (or "dirty")?  Solution depends on cache addressing scheme. Cache Addressing Schemes

 Fully Associative - A cache where data from any address can be stored in any cache location. All tags are compared simultaneously (associatively) with the requested address and if one matches then its associated data is accessed.  Direct Mapped - A cache where the cache location for a given address is explicitly determined from the middle address bits. The remaining top address bits are stored as a "tag" along with the entry. In this scheme, there is only one place for any block to go.  Set Associative - A compromise between a direct mapped cache and a fully associative cache where each address is mapped to a certain "set" of cache locations.  A direct mapped cache could be referred to as "one-way set associative", i.e. one location in each set, whereas a fully associative cache is "N-way associative" (where N is the total number of blocks in the cache). Direct Mapped Cache

 Addressing scheme in direct mapped cache:

 cache block address = memory block address mod cache size (unique)

 if cache size = 2m, cache address = lower m bits of n-bit memory address

 remaining upper n-m bits kept as tag bits at each cache block

 also need a valid bit to recognize valid entry

Cache

0 1 0

1 1

0 1 0

0

0 1

1 0 0 1 1

0 0

0 0

1 1 1 1

00001 00101 01001 01101 10001 10101 11001 11101

Memory Direct Mapped Cache Implementation Example

AddressAddre showingss (showing bbitit p opositionssitions) 31 30 13 12 11 2 1 0 Byte offset 20 10 Hit Data Tag Index

Index Valid Tag Data 0 1 2

1021 1022 1023 20 32

Cache with 1024 1-word blocks: byte offset (least 2 significant bits) is ignored and next 10 bits used to index into cache Implementation of a Set-Associative Cache

Address 31 30 12 11 10 9 8 3 2 1 0

22 8 Address

Index V Tag Data V Tag Data V Tag Data V Tag Data 0 1 2

253 Set 254 255 22 32

4-to-1 multiplexor

Hit Data

4-way set-associative cache with 4 comparators and one 4-to-1 multiplexor: size of cache is 1K blocks = 256 sets * 4-block set size Performance of Set-Associative Caches

It is generally more effective to increase the number of entries rather than associativity.

15%

12%

9%

e

t

a

r

s

s

i

M 6%

3%

0% One-way Two-way Four-way Eight-way

Miss rates for each of eight cache sizes Associativity 1 KB 16 KB with increasing associativity: 2 KB 32 KB 4 KB 64 KB data generated from SPEC92 benchmarks 8 KB 128 KB with 32 byte block size for all caches Cache Replacement Policy

Cache has finite size. What if the cache is full? Analogy: • Desktop full? Move books to bookshelf to make room • Bookshelf full? Move least-used to library, etc. Caches follow this same idea: if "replacement" is necessary, move old block to next level of cache (if "dirty"). How do we choose "victim"? Many policies are possible, e.g.,: • FIFO (first-in-first-out) • LRU (least recently used) • NMRU (not most recently used) • Random • Random + NMRU (almost as good as LRU) Pipelining: The Basic Idea

 Pipelining breaks instruction execution down into multiple stages

 Put registers between stages to "buffer" data and control.

 The idea is to start the execution of an instruction.

 As first instruction moves to second stage, start execution of second instruction, and so on.

 Speedup same as number of stages as long as the pipeline is full. Pipeline Hazards  Why the pipeline is not always full:

 Structural hazards arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution (e.g. floating point operation).

 Data Hazards arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline (e.g., A = B**19; A = A**C; or a cache miss)

 Control Hazards arise from the pipelining of branches and other instructions that change the PC () which points to the next instruction to execute (e.g., an if statement).  Hazards in pipelines can make it necessary to stall the pipeline. When this happens, some instructions in the pipeline are allowed to proceed while others are delayed. When an instruction is stalled, all the instructions issued later than the stalled instruction are also stalled. Instructions issued earlier than the stalled instruction must continue, since otherwise the hazard will never clear. Some Final Thoughts

 Processor speeds remain very fast relative to everything else in the memory hierarchy. This isn’t likely to change in the near future.  New designs are making memory wider (and a little bit faster).  Compilers are getting better at restructuring code to increase locality and to reduce the number of pipeline hazards. In the general case, this is really hard.  Processor designers are making the cache visible in the instruction set architecture, making it possible for programmers / compilers to use "pre-fetching" to manually populate cache entries.