CM0144 – An Introduction

Professor David Walker [email protected]

http://users.cs.cf.ac.uk/David.W.Walker/ARCHOPASS/ Course Structure 44 lectures over 1 semester 3 Main Parts: Architecture (11 lectures): DW Walker Operating Systems (22 lectures): Y Huang Assembly Language (11 lectures): M Daley Assessment Coursework (40%) Class tests (every 2 weeks from week 3 on Monday at 11am): 25% Assembly language exercises: 15% Exam (60%)

2/4/2005 CMO144 David Walker 2 Recommended Texts

Systems Architecture” Rob Williams, Pearson Higher Education 2000, ISBN 0201648598. “Operating Systems” Deitel, Deitel, and Choffnes, Pearson Higher Education 2004, ISBN 0131246968.

Online reading list is at: http://www.readinglists.co.uk/rsl/student/sviewlist.dfp?id=13102

2/4/2005 CMO144 David Walker 3 Thought For The Day

“... the cost of hardware has halved every year whilst its complexity has, on average, quadrupled every three years ... the increase in complexity and decrease in cost are both of the order of one million since the mid 1960s. Equivalent progress in, for example, the motor industry, would have provided us with luxury cars requiring only half a gallon of petrol for life, having sufficient thrust to go into orbit, and a price tag of about twenty pence.” (Professor Stonham, 1987, “Digital Logic Design”)

2/4/2005 CMO144 David Walker 4 What is a Digital Computer?

An electronic machine which solves problems by carrying out a sequence of instructions The instructions are its program, and are usually very simple: Add two numbers Check to see if something is zero Copy data from A to B

2/4/2005 CMO144 David Walker 5 The Central Problem

We want to use a computer to do useful things, but it can only process simple instructions The aim of this course is to explain how we do this In essence, we design a new set of instructions which are more convenient for us to use, but which are built from those which machine understands

2/4/2005 CMO144 David Walker 6 Multilevel Machines

Problem-Oriented Language Translation (Compiler) Assembly Language Translation (Assembler) Operating System Partial Interpretation (OS) Instruction Set Architecture Interpretation (Microprogram) Microarchitecture Hardware Increasing Digital Logic Abstraction

2/4/2005 CMO144 David Walker 7 Multilevel Machines (cont.)

As we progress upward, the programming model becomes more abstract Higher levels tend to be translated, lower ones interpreted Application programmers particularly interested in the top two levels

2/4/2005 CMO144 David Walker 8 Digital Logic Level

Digital circuits constructed from logic gates Each gate computes a simple function based on a small number of digital inputs Clusters of logic gates are combined to make registers, control circuits and other computational devices

2/4/2005 CMO144 David Walker 9 Microarchitecture Level

Consists of: Local memory (8 to 32 registers) Arithmetic Logic Unit (ALU) ALU is connected to the registers by the data path The data path operates by sending data from the registers to the ALU, and storing the result back in some register

2/4/2005 CMO144 David Walker 10 Instruction Set Architecture (ISA) Level

Description of the machine’s instruction set An instruction may be carried out in hardware (direct execution), or software (interpretation by microprogram) Many different types of instruction and instruction sets

2/4/2005 CMO144 David Walker 11 Operating System Level

Extends the ISA level with extra functions for: Memory management Process control Interprocess communication Concurrent program execution We’ll see more later…

2/4/2005 CMO144 David Walker 12 Assembly Language Level

This is really just a human- readable form of one of the underlying languages Based on the ISA level Programs in assembler need to be translated into machine language before execution Again, more coming later…

2/4/2005 CMO144 David Walker 13 Hardware = Software

The levels are really just an abstraction to help us understand a complex system Boundary between hardware and software is fuzzy: We can emulate hardware in software What goes where depends on speed, cost, reliability and expected usage

2/4/2005 CMO144 David Walker 14 The Computer Zoo

Type Price Application Disposable 1 Greeting cards Embedded 10 Watches, videos Game 100 Playstation etc. Personal 1K Desktop PC Server 10K Network File Server Collection of 100K Departmental cluster Workstations Mainframe 1M Transaction Processing Supercomputer 10M Weather Prediction

2/4/2005 CMO144 David Walker 15 Summary

Computers are designed as a series of levels, each built upon the last Computer architecture is the study of those parts of the computer which are visible to the programmer Hardware and software are equivalent are everywhere Knowing how computers work makes programming a lot easier

2/4/2005 CMO144 David Walker 16 Milestones in Computer Architecture

A brief history of computing 4 Generations of Innovation Agents of change in computing Sample Computer Families Pentium II UltraSPARC II picoJava II

2/4/2005 CMO144 David Walker 17 Generation 0: Mechanical Computers

Mechanical calculators built from cogs and gears 1642: Blaise Pascal (+,-) 1670’s: Leibniz (+,-,x,/) 1834: Charles Babbage Difference Engine (+,-) Analytical Engine (General Purpose) Store (1000 x 50 digit-registers) Mill (Computational Unit) Input and Output via punched cards But never fully completed 2/4/2005 CMO144 David Walker 18 The Analytical Engine

2/4/2005 CMO144 David Walker 19 Generation 0.5: Electromagnetic Computers

Various relay-based calculators: 1936: Konrad Zuse (the Z1) 1940’s: John Atanasoff (had binary arithmetic, but it was never completed) 1940’s: George Stibbitz (working prototype) 1944: Howard Aiken (Mark I) Babbage’s design, but used relays: A store of 72 words with 23 digits Instruction time of 6 seconds Input and output using paper tape

2/4/2005 CMO144 David Walker 20 The Mark I

2/4/2005 CMO144 David Walker 21 Generation 1: Vacuum Tubes

Faster and cheaper than relays 1943: COLOSSUS. The world’s first electronic digital computer 1946: ENIAC. 20x10 digit memory 1949: EDSAC. First stored program computer The “von Neumann machine”

2/4/2005 CMO144 David Walker 22 The von Neumann Machine

Memory

Control Arithmetic Unit Logic Unit Accumulator

Input Output

2/4/2005 CMO144 David Walker 23 Generation 2: Transistors

Faster, cheaper, more reliable Vacuum tubes obsolete by 1950’s 1961: PDP-1, first minicomputer 1963: Burroughs B5000, special features for software 1964: CDC 6600, first scientific supercomputer 1965: PDP-8, the Omibus

2/4/2005 CMO144 David Walker 24 The PDP-8 Omibus

Console Paper Other CPU Memory Terminal Tape I/O IO

Omibus

2/4/2005 CMO144 David Walker 25 Generation 3: Integrated Circuits

Allowed dozens of transistors to be put on one chip Faster, cheaper, smaller… 1964: IBM System/360, the first computer family 1970: PDP-11, 16- successor to the PDP-8 (another family) Hugely popular with Universities

2/4/2005 CMO144 David Walker 26 Generation 4: Very Large Scale Integration

Millions of transistors per chip Faster, smaller, cheaper… The era of the personal computer 1974: Intel 8080 (sold as a kit) 1981: IBM PC Built from standard parts Technical specs available -> clones MSDOS, OS/2, Microsoft…

2/4/2005 CMO144 David Walker 27 Smaller, Faster & Cheaper

The trend is for smaller computers, which run faster and cost less Q: Why has this happened? A: Money At least three factors involved: Moore’s Law Nathan’s First Law of Software Virtuous Economic Circle

2/4/2005 CMO144 David Walker 28 Moore’s Law Transistors

64M 100M 16M

10M 4M 1M 1M 256K

64K 100K 4K 10K 1K 16K

1970 1975 1980 1985 1990 1995 2000

2/4/2005 CMO144 David Walker 29 Nathan’s 1st Law of Software

“Software is a gas. It expands to fill the container holding it” (Nathan Myhrvold, Microsoft) Example: Word processors 1980 = 10-100K (Wordstar) 1990 = 100K-10Mb (Wordperfect 5.1) 2000 = 10Mb-100Mb (MS Office 2K)

2/4/2005 CMO144 David Walker 30 Virtuous Economic Circle

New Technology

Better, Cheaper More Competition Products Money

New Companies New Applications

New Markets

2/4/2005 CMO144 David Walker 31 The Evolution of the Pentium II

Chip Date MHz Transitors Memory 4004 1971 0.108 2300 640 b 8008 1972 0.108 3500 16 Kb 8080 1974 2 6000 64 Kb 8086 1978 5-10 29000 1 Mb 8088 1979 5-8 29000 1 Mb 80286 1982 8-12 134000 16 Mb 80386 1985 16-33 275000 1 Gb 80486 1989 25-100 1.2M 4 Gb Pentium 1993 60-233 3.1M 4 Gb Pentium Pro 1995 150-200 5.5M 4 Gb Pentium II 1997 233-400 7.5M 4 Gb

2/4/2005 CMO144 David Walker 32 The UltraSPARC II

Scalable Processor Architecture Reduced Instruction Set (RISC) Special instructions specially designed to handle multimedia (like MMX) 64-bit architecture for high-end computing Up to 2 Tb of main memory

2/4/2005 CMO144 David Walker 33 The picoJava II

A Java Virtual Machine (JVM) realised in hardware Intended for use in embedded systems Needs to be very cheap Very easy to change the functionality of a device whilst in operation Mobile phone could download a fax-viewer picoJava II is not a real chip, but the basis of a computer family: Sun microJava 701

2/4/2005 CMO144 David Walker 34 Summary

General trend towards faster, cheaper, smaller and more reliable computers Moore’s Law Progress in the computer-industry is self-catalysing Virtuous Circle We will study the architecture of the Pentium, the ultraSPARC and the JVM

2/4/2005 CMO144 David Walker 35 Next Lecture:

Processors Previous Lecture:

We learnt about: Milestones in computer architecture Where it all began What caused it all to happen Where it might go next

2/4/2005 CMO144 David Walker 37 Processors

We will see: What a CPU is, what it does, and why it does it The tradeoffs necessary to make cost-effective designs Modern CPU design principles How to make faster computers by exploiting parallelism

2/4/2005 CMO144 David Walker 38 What is a CPU? Central Processing Unit = “brain” Executes programs by: Fetching the next instruction from memory Examine it Execute it Consists of: Control Unit Arithmetic Logic Unit (ALU) Registers (high-speed memory) Program Counter (PC) Instruction Register (IR)

2/4/2005 CMO144 David Walker 39 A Typical Layout

Control Unit

Arithmetic I/O Devices Logic Unit Main Disk Printer Registers Memory

Bus

2/4/2005 CMO144 David Walker 40 The Data Path of a von Neumann Machine ALU Input Registers Registers ALU Output Register A A B ALU A+B A+B B

2/4/2005 CMO144 David Walker 41 Fetch-Decode-Execute

Get current instruction, and put it in the Instruction Register (IR) Increment the Program Counter (PC) Determine type of instruction fetched If instruction needs a word from memory, determine where it is Fetch the word, if needed, and put it into a register Execute the instruction Repeat 2/4/2005 CMO144 David Walker 42 Software = Hardware public class Interpreter { static int PC, AC; static int instr, instr_type; static int data_loc, data; static boolean run_bit = true;

public static void interpret(int memory[], int start_address) { PC = start_address; while (run_bit) { instr = memory[PC]; PC = PC + 1; instr_type = get_instr_type(instr); data_loc = find_data(instr, instr_type); if (data_loc > 0) data = memory[data_loc]; execute(instr_type, data); } } }

2/4/2005 CMO144 David Walker 43 Problem: Hardware Fast but Expensive

Complex dedicated instructions often faster than a sequence of simpler ones Complex instructions are common on expensive, high-performance computers But complex instructions needed on cheaper machines too Rising software development costs Instruction compatibility in computer families

2/4/2005 CMO144 David Walker 44 Solution: Interpretation

Build a simple (low-cost) computer which implements complex instructions in terms of simpler ones IBM System/360 Complex hardware implemented as complex software Many side-benefits: Can fix incorrectly implemented instructions New instructions easy to add Faster product lifecycle – easy to develop, test, debug and document

2/4/2005 CMO144 David Walker 45 An Added Bonus: Control Stores

Keep the interpreter in fast read-only memory E.g. A typical M68000 instruction takes 10 microinstructions and needs 2 references to main memory Each microinstruction takes 100 nsec (using control store) Memory reference takes 500 nsec Total = 2000 nsec with control store (Without control store total = 6000 nsec) Direct execution = 1000 nsec 2/4/2005 CMO144 David Walker 46 The Backlash: RISC

Reduced Instruction Set Computer (In contrast to a Complex Instruction Set Computer) Initially had a relatively small number of instructions (<50), hence the name Main idea is that each instruction executes in one cycle of the data path If instructions can be issued quickly, even then if a RISC machine needs several cycles to do what a CISC machine does in one, it can still be

2/4/2005faster CMO144 David Walker 47 So why don’t RISC designs dominate?

(Or why do we still use the Pentium?) Backward compatibility Companies have billions invested in software for the Intel line Intel has managed to use some of the same ideas: 80486 has “RISC core” for executing common instructions Less common and complex instructions still executed in the old-fashioned way

2/4/2005 CMO144 David Walker 48 Design Principles for Modern Computers

Execute all instructions directly in hardware Maximise the rate at which instructions are issued Ensure that instructions are easy to decode Only LOAD’s and STORE’s should reference main memory Provide plenty of registers

2/4/2005 CMO144 David Walker 49 Parallelism

Instruction level parallelism Multiple instructions executed simultaneously More instructions per second Processor-level parallelism Multiple CPU’s working together on the same problem

2/4/2005 CMO144 David Walker 50 5-Stage Pipelining

S1 S2 S3 S4 S5 Instruction Operand Write Decode Execution Fetch Fetch back

Time

S1 1 2 3 4 5 6 7 8 9

S2 1 2 3 4 5 6 7 8

S3 1 2 3 4 5 6 7

S4 1 2 3 4 5 6

S5 1 2 3 4 5

2/4/2005 CMO144 David Walker 51 Pipelining

Allows the processor to schedule more instructions/second Suppose cycle time per stage is 2 nsec Each instruction takes 10 nsec to complete Without pipelining, this gives 100 MIPS With pipelining, we have 500 MIPS, as we can issue another instruction every 2 nsec Tradeoff between latency (how long it takes to execute an instruction) and bandwidth (how many MIPS)

2/4/2005 CMO144 David Walker 52 One Pipeline good Two Pipelines better?

S1 S2 S3 S4 S5 Operand Write Decode Execution Fetch back Instruction Fetch Operand Write Decode Execution Fetch back

Difficult to program: Possibility of resource conflict (registers) Instructions in one pipeline cannot depend on outputs of those in the other

2/4/2005 CMO144 David Walker 53 Superscalar Architectures S4 Execution stage is the ALU main bottle-neck

Solution is to use ALU multiple units LOAD S1 S2 S3 S5 Instruction Operand Write Decode STORE Fetch Fetch back

Floating Point

2/4/2005 CMO144 David Walker 54 Summary

The CPU is the first main part of a computer Tradeoff to be made between the speed and the cost of the computer Interpretation is cheap (but slow) Modern processors are usually RISC- based Even if they have to support older designs Higher speeds can be obtained through the use of parallelism 2/4/2005 CMO144 David Walker 55 Next Lecture:

Primary Memory Previous Lecture:

We learnt about: Central Processing Units Processor Layout RISC vs. CISC designs Instruction-level parallelism Pipelining Superscalar Architectures

2/4/2005 CMO144 David Walker 57 Primary Memory

We will see: A high-level overview of computer memory How it is organised The effect of ordering What cache memory is and why it is used How memory is packaged

2/4/2005 CMO144 David Walker 58

Basic unit of memory Normally use binary (0/1) encoding Only two distinguishing levels needed Most robust encoding Binary arithmetic Some computers use Binary Coded (BCD) instead

2/4/2005 CMO144 David Walker 59 Memory Addresses Computer memory consists of cells, each with a unique address A cell is the smallest addressable unit Each cell consists of k bits, so we have 2k different bit combinations Addresses are also encoded in binary, so using m bits for an address gives 2m addressable cells Most computers use 8-bits/cell (byte) are grouped into words Most instructions operate on entire words

2/4/2005 CMO144 David Walker 60 Memory Organisation 1 cell

000 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 16 bit word 7 7 3 bit address 8 12 bit word 9 3 bit address 10 11 8 bit word 4 bit address 2/4/2005 CMO144 David Walker 61 Byte Ordering

The bytes within a word can be numbered differently left-to-right (big endian) right-to-left (little endian) As long as we are consistent, this doesn’t matter to the computer But it can cause problems when transferring data from one computer to another Especially with character strings

2/4/2005 CMO144 David Walker 62 Byte-Ordering and Numbers Address Address 0 0 1 2 3 3 2 1 0 0 4 4 5 6 7 32-bit word 7 6 5 4 4 8 8 9 10 11 11 10 9 8 8 12 12 13 14 15 15 14 13 12 12 Big-endian Little-endian

32-bit integers are represented the same in both schemes E.g. the value 6 in binary is 110 Big-endian: 00…00110 Little-endian: 00…00110 Both would also be 32 bits long 2/4/2005 CMO144 David Walker 63 Problems with Byte Ordering

0 J I M M I J M I J J I M 0 4 S M I T T I M S T I M S S M I T 4 8 H 0 0 0 0 0 0 H 0 0 0 H H 0 0 0 8 12 0 0 0 21 0 0 0 21 21 0 0 0 0 0 0 21 12 16 0 0 1 4 0 0 1 4 4 1 0 0 0 0 1 416 Big-endian Little-endian Transfer Transfer (original) (desired) Big -> Little and swap “Jim Smith, age 21, department 260” (260 = 1x256+4) First transfer: name is fine, but age and department are messed up With swap, get age and department right, but now name is messed up 2/4/2005 CMO144 David Walker 64 Problem: CPU Fast, Memory Slow

After a memory request, the CPU will not get the word for several cycles Two simple solutions: Continue execution, but stall CPU if an instruction references the word before it has arrived (hardware) Require compiler to fetch words before they are needed (software) May need to insert NOP instructions Very difficult to write compilers to do this effectively

2/4/2005 CMO144 David Walker 65 The Root of the Problem: Economics

Fast memory is possible, but to run at full speed, it needs to be located on the same chip as the CPU Very expensive Limits the size of the memory Do we choose: A small amount of fast memory? A large amount of slow memory?

2/4/2005 CMO144 David Walker 66 The Best of Both Worlds: Cache Memory

Combine a small amount of fast memory (the cache) with a large amount of slow memory When a word is referenced, put it and its neighbours into the cache Programs do not access memory randomly Temporal Locality: recently accessed items are likely to be used again Spatial Locality: the next access is likely to be near the last one 2/4/2005 CMO144 David Walker 67 The Cache Hit Ratio How often is a word found in the cache? Suppose a word is accessed k times in a short interval 1 reference to main memory (k-1) references to the cache The cache hit ratio h is then k - 1 h = k 2/4/2005 CMO144 David Walker 68 Mean Access Time

Cache access time = c Main memory access time = m Mean access time = c +(1-h)m If all address references are satisfied by the cache, the access time approaches c If no reference is in the cache, the access time approaches c+m

2/4/2005 CMO144 David Walker 69 Cache Design Issues

How big should the cache be? Bigger means more hits, but more expensive How big should a cache-line be? How does the cache keep track of what it contains? If we change an item in the cache, how do we write it back to main memory? Separate caches for data and instructions? Instructions never have to be written back to main memory How many caches should there be? Primary (on chip), secondary (off chip), tertiary…

2/4/2005 CMO144 David Walker 70 Memory Packaging

Until 1990’s, memory came on single chips Now usually use SIMM’s or DIMM’s Single/Double Inline Memory Module Each module contains 8 or 16 chips on a small circuit-board SIMM: up to 32 bits at once DIMM = 2 x SIMM: up to 64 bits at once

2/4/2005 CMO144 David Walker 71 Summary A bit is a single unit of memory Bits are organised into 8-bit bytes Bytes are organised in words How the bytes are ordered is important Accessing main memory is slow CPU’s typically 10-100 times faster A cache is a way of improving this Better than waiting for memory accesses to complete Memory for PC’s comes in DIMM’s or SIMM’s

2/4/2005 CMO144 David Walker 72 Next Lecture:

Secondary Memory And Input/Output Previous Lecture

We learnt about: Primary Memory Memory layout Problems caused by byte- ordering The use and design of cache memory

2/4/2005 CMO144 David Walker 74 Secondary Memory, Input and Output

We will see: The memory hierarchy How a magnetic disks works How to improve performance using a RAID layout The importance of a bus for coordinating input and output

2/4/2005 CMO144 David Walker 75 The Memory Hierarchy

Access Time Registers Storage Capacity Bits/Pound Cache

Main Memory

Magnetic Disks

Tapes and Optical Disks

2/4/2005 CMO144 David Walker 76 Magnetic Disks Intersector Gap Direction of 1 sector Rotation (0.1 microns/bit) ECC 4096 bits Preamble Track R/W Head (5-10 microns)

A magnetic disk is an aluminum disk with magnetic coating, divided into circular tracks Each track is made of a number of fixed- length sectors Each sector is made from a preamble, a data- block, and an error correction block 2/4/2005 CMO144 David Walker 77 RAID

Disk performance has not improved much over past few decades 1970’s, seek time = 50-100 msec 1990’s, seek time = 10 msec Large gap between CPU and disk performance RAID: Redundant Array of Inexpensive Disks Use disk layout to improve performance Distributed data (robust), parallel access

2/4/2005 CMO144 David Walker 78 RAID 0

Strip 0 Strip 1 Strip 2 Strip 3 Strip 4 Strip 5 Strip 6 Strip 7 Strip 8 Strip 9 Strip 10 Strip 11 Disk 1 Disk 2 Disk 3 Disk 4

Each strip is a number of contiguous sectors High data-rate Large requests can be dealt with in parallel Can also handle simultaneous requests No redundancy or robustness to disk failure So not really a true RAID design Poor response for many small sector requests 2/4/2005 CMO144 David Walker 79 RAID 1

Strip 0 Strip 1 Strip 2 Strip 3 Strip 0 Strip 1 Strip 2 Strip 3

Strip 4 Strip 5 Strip 6 Strip 7 Strip 4 Strip 5 Strip 6 Strip 7

Strip 8 Strip 9 Strip 10Strip 11 Strip 8 Strip 9 Strip 10Strip 11 Main Data Backup

Duplicate all the disks in the RAID 0 design If a drive crashes, can simply replace and copy the backup Read performance can be twice as good Main disadvantage is the number of disks needed 2/4/2005 CMO144 David Walker 80 RAID 2

Bit 0 Bit 1 Bit 2 Bit 3 EC 0 EC 1 EC 2 Data Error Disks Correction Disks

Each byte split into two 4-bit segments then add 3 error-correction bits to each Synchronise the arm and rotational position of the 7 drives Fast, as can read/write 1/2 byte each cycle Robust, as error-correction allows any drive to fail Disadvantage: hard to synchronise drives

2/4/2005 CMO144 David Walker 81 RAID 3

Bit 0 Bit 1 Bit 2 Bit 3 Parity

0111 0

Simpler version of RAID 2 Still requires drive synchronisation Parity drive provides robustness Know which drive went wrong, so we can correct any single error Both RAID 2 and 3 have a high data rate But cannot handle parallel IO 2/4/2005 CMO144 David Walker 82 RAID 4

Strip 0 Strip 1 Strip 2 Strip 3 P 0-3 Strip 4 Strip 5 Strip 6 Strip 7 P 4-7 Strip 8 Strip 9 Strip 10 Strip 11 P 8-11

Like RAID 0, but with strip parity kept on a separate drive Does not require drive synchronisation Poor performance for small updates All drives must be read to recompute parity Heavy load on parity drive may become a bottle-neck 2/4/2005 CMO144 David Walker 83 RAID 5

Strip 0 Strip 1 Strip 2 Strip 3 P 0-3 Strip 4 Strip 5 Strip 6 P 4-7 Strip 7 Strip 8 Strip 9 P 8-11 Strip 10 Strip 11 Strip 12 P 12-15 Strip 13 Strip 14 Strip 15 P 16-19 Strip 16 Strip 17 Strip 18 Strip 19

Like RAID 4, but distribute parity blocks over all drives Can be difficult to reconstruct the contents of a drive if one fails though

2/4/2005 CMO144 David Walker 84 CD-ROM and DVD Data encoded as changes in reflectivity Much slower than magnetic disks Seek times usually 100-200 msec Uses constant linear velocity, rather than constant angular velocity Speed of player changes depending on position: 530 rpm at inside, 200 rpm at outside CD-ROM can hold around 650 Mb DVD has smaller pits (0.4 v.s 0.8 microns) and a tighter spiral (0.74 v.s 1.6 microns) Can hold 4.7 to 17 Gb, depending on type

2/4/2005 CMO144 David Walker 85 Input/Output

How do we get data in and out of the computer? Many different types of input and output device: Monitors, Keyboards, Mice, Modems, Sound processors, Printers… All interact with the processor and memory via a bus, using an I/O controller The controller “looks after” the device it

2/4/2005is connected to CMO144 David Walker 86 The Bus in a Typical PC

Memory PCI Bus Main CPU Bridge Memory

SCSI SCSI Bus SCSI Video Network Disk Controller Controller Controller

PCI Bus Sound ISA Modem Processor Bridge ISA Bus

2/4/2005 CMO144 David Walker 87 Direct Memory Access

Direct Memory Access (DMA): The I/O controller reads/writes data to memory without supervision by the CPU When transfer is complete, the controller raises an interrupt This forces the CPU to suspend its current task Check for errors Perform any special actions Inform Operating System CPU also needs the bus to access the memory Can be slowed down by DMA (cycle-stealing)

2/4/2005 CMO144 David Walker 88 Summary

Cost/Performance/Size tradeoff Main memory is always too small Secondary memory is vital for storing useful amounts of data for the long term RAID is one way in which the performance of magnetic disks can be improved Input and output in a computer is usually handled by a bus DMA is a method of freeing the CPU whilst I/O carries on in the background

2/4/2005 CMO144 David Walker 89 Next Lecture:

Logic Gates And Boolean Algebra Previous Lecture

We learnt about: Secondary Memory and I/O Magnetic Disks RAID CD-ROM and DVD Buses and DMA

2/4/2005 CMO144 David Walker 91 The Digital Logic Level

We will see: Logic Gates and Boolean Algebra Using transistors to build logic gates Digital Circuits Multiplexers Decoders Adders The design of an ALU

2/4/2005 CMO144 David Walker 92 Boolean Algebra Algebra for Boolean functions Functions which operate on binary values Can use truth-tables to describe binary functions A truth-table lists all possible values of the inputs to the function Need 2n rows in a truth-table for a function of n variables Conventions: A means “not A” A+B means “A or B” AB or A.B means “A and B”

2/4/2005 CMO144 David Walker 93 Logic Gates

X X X X X

A AB AB AB AB AX ABX ABX ABX ABX 01 000 000 001 001 10 010 011 011 010 100 101 101 100 111 111 110 110

X=A X=A.B X=A+B X=A.B X=A+B

2/4/2005 CMO144 David Walker 94 Transistors as Logic Gates

Collector +Vcc +Vcc +Vcc Vout

Vin 1 Base Vout Vout Vin Vin 1Vin 2

Vin 2

Emitter

NOT Gate NAND Gate NOR Gate

2/4/2005 CMO144 David Walker 95 Example: The Majority Function

ABCX If there are more inputs 0000 0010 which are logical 1 than 0100 logical 0, then output 1. 0111 1000 Otherwise output 0 1011 1101 X = ABC + ABC + ABC + ABC 1111

2/4/2005 CMO144 David Walker 96 The Majority Function as a Digital Circuit ABCAB C ABC A

ABC

B X

C ABC

ABC

2/4/2005 CMO144 David Walker 97 Implementation of Boolean Functions Write down the truth-table for the function Provide inverters to generate the complement of each input Draw an AND gate for each term with a 1 as an output Wire the AND gates to the appropriate inputs Feed the output of the AND gates into an OR gate

2/4/2005 CMO144 David Walker 98 Combinatorial Circuits

Many applications require a digital circuit with multiple inputs and outputs Sometimes need to select amongst the inputs or outputs E.g the majority circuit Such circuits are called combinatorial circuits Multiplexers Decoders

2/4/2005 CMO144 David Walker 99 Multiplexers

Circuit with 2n data inputs, n control inputs and 1 output The control inputs select one of the data inputs Useful in parallel-to-serial conversion Step control lines sequentially from 0…00 to 1…111 Can also be used to make Boolean functions (using control inputs) For each row in the truth-table which outputs a 1, connect the corresponding data input to +Vcc For each row which outputs a 0, connect the corresponding data input to the ground We can also make a demultiplexer, which n 2/4/2005routes an input CMO144to one David of Walker2 different outputs 100 2-Bit Multiplexer Circuit

D0 D1 D2 D3

A

B X

ABAB

2/4/2005 CMO144 David Walker 101 2-bit Multiplexer as an XOR Gate

+Vcc

ABX D0 2-bit 000 D1 011 Multi- X D2 101 plexer 110 D3

Exclusive-Or Truth-table AB

2/4/2005 CMO144 David Walker 102 Decoders

Circuit with n control inputs and 2n outputs The output corresponding to the binary value encoded on the n input lines is set to 1 Useful in address decoders E.g could be used to select amongst 8 memory chips by taking the top three bits of an address

2/4/2005 CMO144 David Walker 103 Arithmetic Circuits

We can easily build circuits which perform Boolean operations on binary words at the bit-level E.g. The AND of two words X and Y is just the repeated AND of the bits of X and Y Binary arithmetic can also be thought of as a Boolean operations We can build circuits which add, subtract, multiply and divide in a similar way

2/4/2005 CMO144 David Walker 104 A Half-Adder

A ABSumCout B Sum 00 0 0 01 1 0 10 1 0 11 0 1 Cout

A half-adder performs 1-bit But it will not work for a bit in the middle of a word It needs a “carry” input

2/4/2005 CMO144 David Walker 105 A Full-Adder Cin ABCin SumCout 000 0 0 A Sum 001 1 0 B 010 1 0 011 0 1 100 1 0 101 0 1 Cout 110 0 1 111 1 1 Constructed from 2 half-adders Can replicate this circuit to build adders for any word size But need to propagate the carry: can be slow

2/4/2005 CMO144 David Walker 106 A 1-bit ALU Function Control F0 Decoder Out F1

E+ ENOT in Full Logical C EAND Adder EOR Unit A B Cout We can create a single functional unit to carry out AND, OR, NOT or addition on single bits We use a decoder to select the active function

2/4/2005 CMO144 David Walker 107 An 8-bit ALU

A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1 A0B0 Cout Cin 1 bit 1 bit 1 bit 1 bit 1 bit 1 bit 1 bit 1 bit ALU ALU ALU ALU ALU ALU ALU ALU

O7 O6 O5 O4 O3 O2 O1 O0

The 8-bit ALU is constructed by replicating a 1-bit ALU We still need to propagate the carry

2/4/2005 CMO144 David Walker 108 Next Lecture:

Binary Arithmetic Previous Lecture

We learnt about: Digital Logic Boolean Algebra Digital Circuits Multiplexers Decoders Half and Full-Adders The design of the ALU

2/4/2005 CMO144 David Walker 110 Binary Arithmetic

We will see: How numbers are stored in a computer The consequences of using fixed-length arithmetic How to convert to and from binary How to represent negative numbers How to represent fractional numbers Fixed-point Floating-point

2/4/2005 CMO144 David Walker 111 Why Binary Arithmetic?

Early computers often used decimal arithmetic Binary arithmetic is the most robust Only need to distinguish between two different voltage levels A would need to distinguish ten different levels Would need a much lower noise level to operate reliably – expensive But could pack far more into a given word length Another possibility is an analogue computer But the presence of noise means low accuracy

2/4/2005 CMO144 David Walker 112 8 bits, 1 byte, 256 values

3 bits 4 bits 8 combinations 16 combinations

Each bit we add increases the number of possible combinations by 2 With n bits, we have 2n combinations Hence 8 bits (1 byte) gives 28 = 256 different combinations

2/4/2005 CMO144 David Walker 113 How many bits in a date? Using binary values means we must divide our world into powers of 2 We usually pick the nearest power of 2 For the date format “DD/MM/YYYY” 28 to 31 days in a month: 25 = 32 12 months in a year: 24 = 16 9999 possible years: 214 = 16384 So we would need 5+4+14=23 bits in total to represent a date like this This leaves many illegal bit combinations E.g. “32/16/9999”

2/4/2005 CMO144 David Walker 114 Fixed-length Arithmetic

We are used to working with numbers of any length Computers have to work with fixed-length numbers This causes some interesting problems Suppose we only have 3 decimal digits 000, 001, 002, … 999 Cannot represent the following integers: Negative numbers (no sign) Fractional numbers (no decimal point) Numbers larger than 999 (not enough digits)

2/4/2005 CMO144 David Walker 115 Problems with Fixed- Length Arithmetic Fixed-length arithmetic is not closed For example, using 3 decimal digits: 600 + 600 = 1200 (too large) 003 - 005 = -2 (negative) 050 x 050 = 2500 (too large) 007 / 002 = 3.5 (not an integer) We can divide the problems into three main classes Overflow (too large) Underflow (too small) Not a member of the set 2/4/2005 CMO144 David Walker 116 Binary-to-Decimal Binary Decimal 000 0 = 0x22 + 0x21 + 0x20 001 1 = 0x22 + 0x21 + 1x20 010 2 = 0x22 + 1x21 + 0x20 011 3 = 0x22 + 1x21 + 1x20 100 4 = 1x22 + 0x21 + 0x20 101 5 = 1x22 + 0x21 + 1x20 110 6 = 1x22 + 1x21 + 0x20 111 7 = 1x22 + 1x21 + 1x20

By convention, the right-most bit is the least- significant Each subsequent bit is worth twice the one before We can build this idea into an algorithm 2/4/2005 CMO144 David Walker 117 Decimal-to-Binary

This involves the successive division of the decimal number by 2 The remainder at each stage is the next digit of the binary expansion E.g. Converting 58 into binary Divide 58 by 2 = 29 remainder 0 Divide 29 by 2 = 14 remainder 1 Divide 14 by 2 = 7 remainder 0 Divide 7 by 2 = 3 remainder 1 Divide 3 by 2 = 1 remainder 1 Divide 1 by 2 = 0 remainder 1 So 58 is 111010 in binary 2/4/2005 CMO144 David Walker 118 Representations of Negative Integers

There are four main ways of representing negative m-bit binary numbers Sign and magnitude 1’s Complement 2’s Complement Excess 2m-1 All of them have problems But all of them work well with binary addition and subtraction

2/4/2005 CMO144 David Walker 119 Sign and Magnitude

We use one bit for the sign, and the rest for the size of the number E.g. Using an 8-bit binary representation: 58 is 00111010 -58 is 10111010 (using sign-magnitude) The range is therefore –127 to 127 Problem: 00000000 = zero 10000000 = -zero

2/4/2005 CMO144 David Walker 120 1’s Complement Also has a sign bit When negating a number, we simply flip every 0 to a 1 and visa versa E.g. Using an 8-bit binary representation: 58 is 00111010 -58 is 11000101 (using 1’s complement) The range is again –127 to 127 We also have two representations for zero 00000000 = zero 11111111 = -zero

2/4/2005 CMO144 David Walker 121 2’s Complement Like 1’s Complement, but to negate a number we negate all the bits and add 1 to the result E.g. Using an 8-bit binary representation: 58 is 00111010 -58 is 11000110 (using 2’s complement) This gives us a range of –128 to 127 Only one representation of zero But we still have a problem: –(-128) = -(10000000) = 10000000 = -128

InnerInt 2/4/2005 CMO144 David Walker 122 Excess 2m-1

We store each number as its sum plus 2m-1 E.g. Using an 8-bit binary representation, this is “excess 128” 58 is represented as 186, or 10111010 -58 is represented as 70, or 01000110 This is equivalent to 2’s complement with the sign-bit reversed The range is again –128 to 127 Still unequal

2/4/2005 CMO144 David Walker 123 Fixed-Point Arithmetic

It is useful to be able to represent real numbers, e.g. 1.23, 3.141, 0.0000123 … One way we can do this is to reserve a certain number of digits to be the fractional part: E.g. 1010.1100 = 10.75 (“.1100” = 2-1 + 2-2 = 0.5 + 0.25 = 0.75) Note that we don’t need any special hardware to deal with this We just need to keep track of the decimal point Given m bits before and n bits after the decimal point m determines the range n determines the precision

2/4/2005 CMO144 David Walker 124 Problems with Fixed-point Arithmetic Fixed-point arithmetic is useful, but limited No good for extremely large or small numbers The mass of the Sun is 2x1033 grams The mass of an electron is 9x10-28 grams To use both these amounts in a calculation we would need over 60 decimal digits, most of which would be irrelevant We need floating-point arithmetic 2/4/2005 CMO144 David Walker 125 Floating Point Arithmetic

We borrow the familiar scientific notation to represent numbers: 3.14 = 0.314 x 101 0.00005 = 0.5 x 10-4 We use powers of 2 instead of 10: N = mantissa x 2exponent (-1 < mantissa < 1) Range depends on exponent Precision depends on mantissa This allows a huge range of numbers to be covered With some loss in accuracy IEEE floating-point standard 754: Single precision: 1 sign bit, 8 exponent, 23 mantissa Double precision: 1 sign bit, 11 exponent, 52 mantissa Reserved values for infinity and NAN (not a number)

InnerFloat 2/4/2005 CMO144 David Walker 126 Problems with Floating- point Arithmetic

Negative Expressible 0 Expressible Positive Overflow Negative Numbers Positive Numbers Overflow

Negative Positive Underflow Underflow The number line is divided into seven regions Using floating point we can only access three of them The range is not continuous, nor equally sampled 0.998x1099 and 0.999x1099 vs. 0.998x100 and 0.999x100 Cannot express some numbers at all, e.g. 0.100x103 / 3 Need to round them to the nearest expressible number For a fixed number of bits, we must trade off the range against the precision of the representation Requires special hardware for fast performance

2/4/2005 CMO144 David Walker 127 Summary

Computers use a binary representation for numbers They are bound by the constraints of fixed-length arithmetic Representing negative numbers is something of a challenge No system is ideal We can represent real numbers using fixed-point or floating-point arithmetic Again, neither system is ideal 2/4/2005 CMO144 David Walker 128 Next Lecture:

Digital Logic for Memory Previous Lecture

We learnt about: Binary arithmetic How to convert decimal into binary and binary into decimal Four ways to represent negative integers Ways of representing real numbers Fixed-point Floating-point

2/4/2005 CMO144 David Walker 130 Digital Logic for Memory

We will see: Clocks Implementing memory using latches SR latches Clocked SR latches Clocked D latches Flip flops Memory organisation RAM and ROM

2/4/2005 CMO144 David Walker 131 Clocks Oscillator

C1

Delay C2

C1.C2

The order in which operations happen in a digital circuit is important To coordinate events, we use a clock Crystal oscillator Frequencies usually 1-2000 MHz 2/4/2005 CMO144 David Walker 132 SR Latches

S 0 S 0 ABNOR 1 Q 0 Q 0 1 00 1 01 0 10 0 1 0 11 0 R Q R Q 0 0 0 1 State 0 State 1 A memory device with two stable states S = Set, R = Reset, Q = Output It remembers whether S or R was last set

2/4/2005 CMO144 David Walker 133 Clocked SR Latches

S Q

Clock

Q R

The clock input controls when the latch changes state What happens when S=R=1?

2/4/2005 CMO144 David Walker 134 Clocked D Latches

D Q

Clock

Q

The D-latch prevents the S=R=1 problem This is a true 1-bit memory Requires 11 transistors Less obvious designs use only 6 2/4/2005 CMO144 David Walker 135 Flip-Flops

a c a b b

c=a.b

A flip-flop is a latch which stores its input on the clock transition from 0 to 1 Edge-triggered, not level-triggered Length of the clock pulse is unimportant We use a simple delay circuit to generate a short pulse, then use this to

2/4/2005filter the clock CMO144 input David of Walker a D latch 136 An 8-bit Register

D0 Q0 D1 Q1 D2 Q2 D3 Q3

F0 F1 F2 F3 Clock

F4 F5 F6 F7

D4 Q4 D5 Q5 D6 Q6 D7 Q7

We can use a collection of flip flops to build a register, simply by tying their clock signals together

2/4/2005 CMO144 David Walker 137 Building Larger Memories We can build a number of bit-registers into an addressable memory We use a decoder on the address to activate the desired register Separate inputs to the memory circuit control the reading or writing of data The entire device is constructed from a large number of similar components laid out on a 2-dimensional grid Such layouts are well suited to VLSI technology

2/4/2005 CMO144 David Walker 138 A Design for an 8-byte

3-bit Decoder

Data Byte 4 Byte 5 Byte 0 Byte 1 Byte 2 Byte 3 Byte 6 Byte 7

RE WE

2/4/2005 CMO144 David Walker 139 Memory Layout RAS CAS

19 11 8 data 1 data address 512K x 8 address 4096K x 1 lines line lines (4 Mbit) lines (4 Mbit)

CS WE OE CS WE OE

Total memory is the same (4Mbits = 222 bits) Using 512K x 8 bits means 19 address lines are needed (512K = 219) Using a 2048x2048 matrix of 1-bit cells means we need 11 address lines (2048=211), but we need to use them twice (hence RAS and CAS) 2/4/2005 CMO144 David Walker 140 Random Access Memory (RAM)

Static RAM (SRAM) Based on flip-flops Very fast (2 nsec access time) Dynamic RAM (DRAM) Based on capacitors Uses 1 transistor and 1 capacitor per bit Very large memories possible But need to keep refreshing the charge Slow (>10 nsec access time) SRAM cache, DRAM main memory 2/4/2005 CMO144 David Walker 141 Read Only Memory (ROM)

RAM only stores data whilst powered Sometimes need persistent data ROMs used in cars, videos etc. to store data which doesn’t need to be changed Data in a ROM is hardwired, unchangeable after manufacture Hybrid memory combines persistence with programmability Erasable programmable ROM (EPROM) Flash memory (digital cameras, mobile phones) 2/4/2005 CMO144 David Walker 142 Summary

Clocks can be used to synchronise the digital circuits in a computer Circuits with at least two stable states are needed to implement memory How we organise a memory chip influences how we use it RAM and ROM evolved to meet different design criteria

2/4/2005 CMO144 David Walker 143 Next Lecture:

CPU chips and Buses Previous Lecture

We learnt about: Clocks How to build memory from flip-flops The influence of memory layout on its use Different types of memory RAM ROM

2/4/2005 CMO144 David Walker 145 CPU Chips and Buses

We will see: How the CPU interacts with the rest of the computer How a bus works Bus widths Bus arbitration Bus operations Some common buses The ISA bus The PCI bus 2/4/2005The USB CMO144 David Walker 146 CPU Chips

All modern CPU’s are built into a single chip The CPU interacts with the outside world solely through physical wires (lines): Address lines Data lines Control lines E.g. to fetch data from the memory: Place address of data on address lines Signal that the address is ready using one or more control lines Memory responds by placing required data on the data lines, then signaling that it has finished The CPU reads the data from the bus 2/4/2005 CMO144 David Walker 147 CPU Performance Parameters

The number of address lines (m) Can address up to 2m addresses Commonly m=16,20,32 or 64 The number of data lines (n) Can read/write n-bit word in a single operation Commonly, n=8,16,32,36 or 64 A chip with 8 data pins will take 4 cycles to read a 32-bit word

2/4/2005 CMO144 David Walker 148 CPU Control Lines

Can be grouped into six main categories Bus control, outputs from CPU used to control the rest of the system Interrupts, inputs from I/O devices used to signal completion Bus arbitration, to regulate the traffic on the bus Coprocessor signaling, to make requests to special devices (floating point units, graphics processors …) Status (running, suspended, temperature…)

2/4/2005Others (reset, CMO144 compatibility David Walker modes …) 149 Buses

Common electrical pathway between two devices Many different types: Memory bus, I/O bus, ALU bus… Each type of bus has its own requirements and properties One size doesn’t fit all The main design issues are Bus width Bus clocking Bus arbitration Bus operations 2/4/2005 CMO144 David Walker 150 Bus Width To allow large address-spaces, buses need many address lines m address lines means 2m addresses Bigger buses are more expensive More wires needed, physically larger connectors etc. Difficult to speed up the bus by simply increasing the clock speed Bus skew – signals on different lines travel at different speeds Can also combine address and data buses Multiplexed bus Designers tend to use the smallest possible bus width Not easy to expand later 2/4/2005 CMO144 David Walker 151 Synchronous Buses T1 T2 T3 CS TAD ADR Memory address to be read DATA Data TM TDS TMH MREQ TML RD

TRL TRH TDH WAIT

TAD Address output time TDS Data setup time TML Address stable TMH MREQ delay TM MREQ delay TRH RD delay TRL RD delay TDH Data hold time

2/4/2005 CMO144 David Walker 152 Problems with Synchronous Buses

Synchronous buses are easy to work with as they are clock driven But transfers must be in whole clock cycles If the memory only needs 3.1 cycles to fetch the data, it will still take 4 cycles Difficult to take advantages of improvements in technology The bus is only as fast as its clock If there are many devices, the bus must be geared to the slowest 2/4/2005 CMO144 David Walker 153 Asynchronous Buses

ADR Memory address to be read

MREQ RD MSYN

DATA Data SSYN

An asynchronous bus has no clock Each device can work as fast as it is able Full-handshake: MSYN set, SSYN set in response MSYN unset in response to SSYN being set SSYN unset in response to MSYN being unset 2/4/2005 CMO144 David Walker 154 Bus Arbitration

What happens if two devices want to use the bus at the same time? The Bus Arbitrator decides who gets the bus in case of a conflict Two main types of arbitrator Centralised Decentralised

2/4/2005 CMO144 David Walker 155 Centralised Arbitration

Request Arbiter Grant

1 2 3 4 5

Each device must ask to use the bus When the arbiter receives a request, it issues a grant The arbiter doesn’t know who asked for the bus If a device doesn’t need the bus, it passes the grant along to the next (daisy-chaining) Sets up implicit priorities between devices 2/4/2005 CMO144 David Walker 156 Decentralised Arbitration

Bus Request Bus Busy Arbitration (+VCC) 1 2 3 4 5

To acquire the bus, the device sets the request line, then waits until the bus is not busy and the inward arbitration signal is set It then unsets the outward arbitration and request lines, sets the busy line. It can then use the bus When done, it unsets the busy line and resets the outward arbitration line Again, the devices are implicitly prioritised 2/4/2005 CMO144 David Walker 157 Bus Operations Normally, only one word at a time is sent down the bus – not efficient Different types of bus operation for different situations: Block transfers E.g. sending a sequence of words to a cache Multiprocessor support Read, inspect and modify a word without releasing the bus Interrupt handling Interrupt may need access to the bus May need an arbiter to prioritise requests

2/4/2005 CMO144 David Walker 158 The ISA Bus

20-bit 20-bit Control 20-bit Control 4-bit

Control 4-bit Control 8086 (1 Mb) 80386 (4 Gb)

80286 (16 Mb) 8-bit Control

Control Industry-Standard Architecture Bus (8.33 Mhz) 8086 has a 20 bit bus 80286 has a 20 bit + 4 bit bus (Backward compatibility) 80386 has a 20 bit + 4 bit + 8 bit bus Extra control lines are also needed 2/4/2005 CMO144 David Walker 159 The PCI (Peripheral Component Interconnect) Bus ISA is too slow for multimedia applications Maximum rate is 16.7 Mbyte/sec Need at least 100 Mbyte/sec for smooth full-screen, full-colour video Intel created the PCI bus to deal with this 32 bits/cycle, 33 Mhz, 133 Mbyte/sec PCI-2: 66 Mhz, 64 bits/cycle, 528 Mbyte/sec Multiplexed data and address lines But PCI still not fast enough for a memory bus, so most PC’s have several buses Intel also designed a bridge to interface the ISA bus to the PCI bus Backward compatibility 2/4/2005 CMO144 David Walker 160 The Bus in a Typical PC

Memory PCI Bus Main CPU Bridge Memory

SCSI SCSI Bus SCSI Video Network Disk Controller Controller Controller

PCI Bus Sound ISA Modem Processor Bridge ISA Bus

2/4/2005 CMO144 David Walker 161 The Universal Serial Bus USB is an effort to standardise buses for low speed devices PCI too expensive for mice, keyboards etc. Some of its main goals are No setting of jumpers or switches needed to install a new device Only one kind of cable needed to connect devices I/O devices powered by the computer “Plug-and-play” functionality Inexpensive to manufacture Support real-time devices (sound etc.) Attach up to 127 devices to one computer

2/4/2005 CMO144 David Walker 162 Summary

A bus connects the CPU to the rest of the computer system Factors which affect buses are The width of the bus The arbitration scheme used The operations it can perform Modern buses vary in their design objectives Price/performance tradeoff

2/4/2005 CMO144 David Walker 163