Chapter 1 Introduction

Objectives of todays lecture  In this chapter we will learn about structure and function of computer and possibly nature and characteristics of computer . At the end of this chapter you be able to understand:  What is computer  What is organization and architecture of computer  Structure and function of computer  Understand inside of CPU,ALU,REGISTERS. Computer 

Computer:

A computer is a device that accepts Information (in the form of digitalized data) and manipulates it for some result based on a program or sequence of instructions on how the data is to be processed. Complex computers also include the means for storing data (including the program, which is also a form of data) for some necessary duration Architecture & Organization  Architecture is those attributes visible to the programmer  Instruction set, number of bits used for data representation, I/O mechanisms, addressing techniques.  e.g. Is there a multiply instruction?  Organization is how features are implemented  Control signals, interfaces, memory technology.  e.g. Is there a hardware multiply unit or is it done by repeated addition? Architecture & Organization 2   All Intel x86 family share the same basic architecture  The IBM System/370 family share the same basic architecture

 This gives code compatibility  At least backwards  Organization differs between different versions Structure & Function   Structure is the way in which components relate to each other  Function is the operation of individual components as part of the structure Function   All computer functions are:  Data processing  Data storage  Data movement  Control Functional View  Operations (a) Data movement Operations (b) Storage  Operation (c) Processing from/to storage Operation (d) Processing from storage to I/O Structure - Top Level  Peripherals Computer

Central Main Processing Memory Unit

Computer Systems Interconnection

Input Output Communication lines Structure - The CPU  CPU

Computer Arithmetic Registers and I/O Login Unit System CPU Internal CPU Memory Interconnection

Control Unit Structure - The Control Unit  Control Unit

CPU Sequencing ALU Login Control Internal Unit Bus Control Unit Registers Registers and Decoders

Control Memory Outline of the Book (1)   Computer Evolution and Performance  Computer Interconnection Structures  Internal Memory  External Memory  Input/Output  Operating Systems Support  Computer Arithmetic  Instruction Sets

Outline of the Book (2)   CPU Structure and Function  Reduced Instruction Set Computers  Superscalar Processors  Control Unit Operation  Microprogrammed Control  Multiprocessors and Vector Processing  Digital Logic (Appendix) Internet Resources - Web site for book  http://WilliamStallings.com/COA/COA7e.html  links to sites of interest  links to sites for courses that use the book  errata list for book  information on other books by W. Stallings  http://WilliamStallings.com/StudentSupport.html  Math  How-to  Research resources  Misc

Internet Resources - Web sites to look for  WWW Computer Architecture Home Page  CPU Info Center  Processor Emporium  ACM Special Interest Group on Computer Architecture  IEEE Technical Committee on Computer Architecture  Intel Technology Journal  Manufacturer’s sites  Intel, IBM, etc.

Internet Resources - Usenet News Groups  comp.arch  comp.arch.arithmetic  comp.arch.storage  comp.parallel Video lectures of computer organization and architecture

https://youtu.be/VxZh5NuotNo

https://youtu.be/K_21dNxKC4o

https://youtu.be/K_21dNxKC4o

https://youtu.be/UX64uR63jys

https://youtu.be/vDxyrEvKPa4?li st=PL6tk8xkG6ZQfFjCZhFFuG HpKqfxHAy8sw

Thank you William Stallings Computer Organization and Architecture 8th Edition

Chapter 2 Computer Evolution and Performance

Objectives of todays lecture

• In this chapter we will learn about History of computers ,von -Neumann machines, IAS computer, Instruction word, memory formats, Instruction sets , Generations of computers , Bus structure ,Design and performance. By the end of this chapter student will learn: • Generations of computers • Stored program concept • Operations and different memory locations. • Fetch –execute cycle ENIAC - background

• Electronic Numerical Integrator And Computer • Eckert and Mauchly • University of Pennsylvania • Trajectory tables for weapons • Started 1943 • Finished 1946 – Too late for war effort • Used until 1955

ENIAC - details

• Decimal (not binary) • 20 accumulators of 10 digits • Programmed manually by switches • 18,000 vacuum tubes • 30 tons • 15,000 square feet • 140 kW power consumption • 5,000 additions per second von Neumann/Turing

• Stored Program concept • Main memory storing programs and data • ALU operating on binary data • Control unit interpreting instructions from memory and executing • Input and output equipment operated by control unit • Princeton Institute for Advanced Studies – IAS • Completed 1952 Structure of von Neumann machine IAS - details • 1000 x 40 bit words – Binary number – 2 x 20 bit instructions • Set of registers (storage in CPU) – Memory Buffer Register – Memory Address Register – Instruction Register – Instruction Buffer Register – Program Counter – Accumulator – Multiplier Quotient Structure of IAS – detail MBR: contains a word to be stored in memory or sent to I/O unit

MAR: Specifies address in the memory of the word to be written from or read into MBR

IR: Contains 8-bit opcode instruction being executed

IBR: Temporarily holds the right hand instruction from a word in memory

PC: contains the address of next instruction to be fetched from memory.

Commercial Computers

• 1947 - Eckert-Mauchly Computer Corporation • UNIVAC I (Universal Automatic Computer) • US Bureau of Census 1950 calculations • Became part of Sperry-Rand Corporation • Late 1950s - UNIVAC II – Faster – More memory IBM

• Punched-card processing equipment • 1953 - the 701 – IBM’s first stored progra oputer – Scientific calculations • 1955 - the 702 – Business applications • Lead to 700/7000 series Transistors

• Replaced vacuum tubes • Smaller • Cheaper • Less heat dissipation • Solid State device • Made from Silicon (Sand) • Invented 1947 at Bell Labs • William Shockley et al. Transistor Based Computers

• Second generation machines • NCR & RCA produced small transistor machines • IBM 7000 • DEC - 1957 – Produced PDP-1

Microelectronics

• Literally - sall eletrois • A computer is made up of gates, memory cells and interconnections • These can be manufactured on a semiconductor • e.g. silicon wafer

Generations of Computer

• Vacuum tube - 1946-1957 • Transistor - 1958-1964 • Small scale integration - 1965 on – Up to 100 devices on a chip • Medium scale integration - to 1971 – 100-3,000 devices on a chip • Large scale integration - 1971-1977 – 3,000 - 100,000 devices on a chip • Very large scale integration - 1978 -1991 – 100,000 - 100,000,000 devices on a chip • Ultra large scale integration – 1991 - – Moore’s La

• Increased density of components on chip • Gordon Moore – co-founder of Intel • Number of transistors on a chip will double every year • Sie 9’s deelopet has sloed a little – Number of transistors doubles every 18 months • Cost of a chip has remained almost unchanged • Higher packing density means shorter electrical paths, giving higher performance • Smaller size gives increased flexibility • Reduced power and cooling requirements • Fewer interconnections increases reliability Growth in CPU Transistor Count IBM 360 series

• 1964 • Replaced (& not compatible with) 7000 series • First plaed faily of oputers – Similar or identical instruction sets – Similar or identical O/S – Increasing speed – Increasing number of I/O ports (i.e. more terminals) – Increased memory size – Increased cost • DEC PDP-8

• 1964 • First minicomputer (after miniskirt!) • Did not need air conditioned room • Small enough to sit on a lab bench • $16,000 – $100k+ for IBM 360 • Embedded applications & OEM • BUS STRUCTURE DEC - PDP-8 Bus Structure Semiconductor Memory

• 1970 • Fairchild • Size of a single core – i.e. 1 bit of magnetic core storage • Holds 256 bits • Non-destructive read • Much faster than core • Capacity approximately doubles each year Intel

• 1971 - 4004 – First microprocessor – All CPU components on a single chip – 4 bit • Followed in 1972 by 8008 – 8 bit – Both designed for specific applications • 1974 - 8080 – Itel’s first geeral purpose iroproessor

Speeding it up

• Pipelining • On board cache • On board L1 & L2 cache • Branch prediction • Data flow analysis • Speculative execution

Performance Balance

• Processor speed increased • Memory capacity increased • Memory speed lags behind processor speed

Login and Memory Performance Gap Solutions

• Increase number of bits retrieved at one time – Make DRAM ider rather tha deeper • Change DRAM interface – Cache • Reduce frequency of memory access – More complex cache and cache on chip • Increase interconnection bandwidth – High speed buses – Hierarchy of buses I/O Devices

• Peripherals with intensive I/O demands • Large data throughput demands • Processors can handle this • Problem moving data • Solutions: – Caching – Buffering – Higher-speed interconnection buses – More elaborate bus structures – Multiple-processor configurations Key is Balance

• Processor components • Main memory • I/O devices • Interconnection structures Improvements in Chip Organization and Architecture • Increase hardware speed of processor – Fundamentally due to shrinking logic gate size • More gates, packed more tightly, increasing clock rate • Propagation time for signals reduced • Increase size and speed of caches – Dedicating part of processor chip • Cache access times drop significantly • Change processor organization and architecture – Increase effective speed of execution – Parallelism Problems with Clock Speed and Login Density • Power – Power density increases with density of logic and clock speed – Dissipating heat • RC delay – Speed at which electrons flow limited by resistance and capacitance of metal wires connecting them – Delay increases as RC product increases – Wire interconnects thinner, increasing resistance – Wires closer together, increasing capacitance • Memory latency – Memory speeds lag processor speeds • Solution: – More emphasis on organizational and architectural approaches Increased Cache Capacity

• Typically two or three levels of cache between processor and main memory • Chip density increased – More cache memory on chip • Faster cache access • Pentium chip devoted about 10% of chip area to cache • Pentium 4 devotes about 50% More Complex Execution Logic

• Enable parallel execution of instructions • Pipeline works like assembly line – Different stages of execution of different instructions at same time along pipeline • Superscalar allows multiple pipelines within single processor – Instructions that do not depend on one another can be executed in parallel

Diminishing Returns

• Internal organization of processors complex – Can get a great deal of parallelism – Further significant increases likely to be relatively modest • Benefits from cache are reaching limit • Increasing clock rate runs into power dissipation problem – Some fundamental physical limits are being reached

New Approach – Multiple Cores

• Multiple processors on single chip – Large shared cache • Within a processor, increase in performance proportional to square root of increase in complexity • If software can use multiple processors, doubling number of processors almost doubles performance • So, use two simpler processors on the chip rather than one more complex processor • With two processors, larger caches are justified – Power consumption of memory logic less than processing logic x86 Evolution (1)

• 8080 – first general purpose microprocessor – 8 bit data path – Used in first personal computer – Altair • 8086 – 5MHz – 29,000 transistors – much more powerful – 16 bit – instruction cache, prefetch few instructions – 8088 (8 bit external bus) used in first IBM PC • 80286 – 16 Mbyte memory addressable – up from 1Mb • 80386 – 32 bit – Support for multitasking • 80486 – sophisticated powerful cache and instruction pipelining – built in maths co-processor x86 Evolution (2)

• Pentium – Superscalar – Multiple instructions executed in parallel • Pentium Pro – Increased superscalar organization – Aggressive register renaming – branch prediction – data flow analysis – speculative execution • Pentium II – MMX technology – graphics, video & audio processing • Pentium III – Additional floating point instructions for 3D graphics x86 Evolution (3)

• Pentium 4 – Note Arabic rather than Roman numerals – Further floating point and multimedia enhancements • Core – First x86 with dual core • Core 2 – 64 bit architecture • Core 2 Quad – 3GHz – 820 million transistors – Four processors on chip

• x86 architecture dominant outside embedded systems • Organization and technology changed dramatically • Instruction set architecture evolved with backwards compatibility • ~1 instruction per month added • 500 instructions available • See Intel web pages for detailed information on processors Performance Assessment Clock Speed

• Key parameters – Performance, cost, size, security, reliability, power consumption • System clock speed – In Hz or multiples of – Clock rate, clock cycle, clock tick, cycle time • Signals in CPU take time to settle down to 1 or 0 • Signals may change at different speeds • Operations need to be synchronised • Instruction execution in discrete steps – Fetch, decode, load and store, arithmetic or logical – Usually require multiple clock cycles per instruction • Pipelining gives simultaneous execution of instructions • So, clock speed is not the whole story System Clock Instruction Execution Rate

• Millions of instructions per second (MIPS) • Millions of floating point instructions per second (MFLOPS) • Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy Benchmarks

• Programs designed to test performance • Written in high level language – Portable • Represents style of task – Systems, numerical, commercial • Easily measured • Widely distributed • E.g. System Performance Evaluation Corporation (SPEC) – CPU2006 for computation bound • 17 floating point programs in C, C++, Fortran • 12 integer programs in C, C++ • 3 million lines of code – Speed and rate metrics • Single task and throughput Internet Resources

• http://www.intel.com/ – Search for the Intel Museum • http://www.ibm.com • http://www.dec.com • Charles Babbage Institute • PowerPC • Intel Developer Home References

• AMDA Adahl, G. Validity of the Sigle- Processor Approach to Achieving Large-Scale Coputig Capaility, Proceedings of the AFIPS Conference, 1967. Video lectures for students

• https://youtu.be/7R0IJa5v6zg • https://youtu.be/LWjTIrxVS3Y • https://youtu.be/wmcwIL_oZ- s?list=PL6tk8xkG6ZQfFjCZhFFuGHpKqfx HAy8sw

THANKS William Stallings Computer Organization and Architecture 8th Edition

Chapter 3 Top Level View of Computer Function and Interconnection Program Concept • Hardwired systems are inflexible • General purpose hardware can do different tasks, given correct control signals • Instead of re-wiring, supply a new set of control signals

What is a program? • A sequence of steps • For each step, an arithmetic or logical operation is done • For each operation, a different set of control signals is needed

Function of Control Unit • For each operation a unique code is provided —e.g. ADD, MOVE • A hardware segment accepts the code and issues the control signals

• We have a computer! Components • The Control Unit and the Arithmetic and Logic Unit constitute the • Data and instructions need to get into the system and results out —Input/output • Temporary storage of code and results is needed —Main memory Computer Components: Top Level View Instruction Cycle • Two steps: —Fetch —Execute Fetch Cycle • Program Counter (PC) holds address of next instruction to fetch • Processor fetches instruction from memory location pointed to by PC • Increment PC —Unless told otherwise • Instruction loaded into Instruction Register (IR) • Processor interprets instruction and performs required actions Execute Cycle • Processor-memory —data transfer between CPU and main memory • Processor I/O —Data transfer between CPU and I/O module • Data processing —Some arithmetic or logical operation on data • Control —Alteration of sequence of operations —e.g. jump • Combination of above Example of Program Execution Instruction Cycle State Diagram Interrupts • Mechanism by which other modules (e.g. I/O) may interrupt normal sequence of processing • Program —e.g. overflow, division by zero • Timer —Generated by internal processor timer —Used in pre-emptive multi-tasking • I/O —from I/O controller • Hardware failure —e.g. memory parity error Program Flow Control Interrupt Cycle • Added to instruction cycle • Processor checks for interrupt —Indicated by an interrupt signal • If no interrupt, fetch next instruction • If interrupt pending: —Suspend execution of current program —Save context —Set PC to start address of interrupt handler routine —Process interrupt —Restore context and continue interrupted program Transfer of Control via Interrupts Instruction Cycle with Interrupts Program Timing Short I/O Wait Program Timing Long I/O Wait Instruction Cycle (with Interrupts) - State Diagram Multiple Interrupts • Disable interrupts —Processor will ignore further interrupts whilst processing one interrupt —Interrupts remain pending and are checked after first interrupt has been processed —Interrupts handled in sequence as they occur • Define priorities —Low priority interrupts can be interrupted by higher priority interrupts —When higher priority interrupt has been processed, processor returns to previous interrupt Multiple Interrupts - Sequential Multiple Interrupts – Nested Time Sequence of Multiple Interrupts Connecting • All the units must be connected • Different type of connection for different type of unit —Memory —Input/Output —CPU Computer Modules Memory Connection • Receives and sends data • Receives addresses (of locations) • Receives control signals —Read —Write —Timing Input/Output Connection(1) • Similar to memory from computer’s viewpoint • Output —Receive data from computer —Send data to peripheral • Input —Receive data from peripheral —Send data to computer

Input/Output Connection(2) • Receive control signals from computer • Send control signals to peripherals —e.g. spin disk • Receive addresses from computer —e.g. port number to identify peripheral • Send interrupt signals (control) CPU Connection • Reads instruction and data • Writes out data (after processing) • Sends control signals to other units • Receives (& acts on) interrupts

Buses • There are a number of possible interconnection systems • Single and multiple BUS structures are most common • e.g. Control/Address/Data bus (PC) • e.g. (DEC-PDP) What is a Bus? • A communication pathway connecting two or more devices • Usually broadcast • Often grouped —A number of channels in one bus —e.g. 32 bit data bus is 32 separate single bit channels • Power lines may not be shown Data Bus • Carries data —Remember that there is no difference between “data” and “instruction” at this level • Width is a key determinant of performance —8, 16, 32, 64 bit Address bus • Identify the source or destination of data • e.g. CPU needs to read an instruction (data) from a given location in memory • Bus width determines maximum memory capacity of system —e.g. 8080 has 16 bit address bus giving 64k address space • Control and timing information —Memory read/write signal —Interrupt request —Clock signals

Bus Interconnection Scheme Big and Yellow? • What do buses look like? —Parallel lines on circuit boards —Ribbon cables —Strip connectors on mother boards – e.g. PCI —Sets of wires Physical Realization of Bus Architecture Single Bus Problems • Lots of devices on one bus leads to: —Propagation delays – Long data paths mean that co-ordination of bus use can adversely affect performance – If aggregate data transfer approaches bus capacity • Most systems use multiple buses to overcome these problems Traditional (ISA) (with cache) High Performance Bus Bus Types • Dedicated —Separate data & address lines • Multiplexed —Shared lines —Address valid or data valid control line —Advantage - fewer lines —Disadvantages – More complex control – Ultimate performance

Bus Arbitration • More than one module controlling the bus • e.g. CPU and DMA controller • Only one module may control bus at one time • Arbitration may be centralised or distributed Centralised or Distributed Arbitration • Centralised —Single hardware device controlling bus access – Bus Controller – Arbiter —May be part of CPU or separate • Distributed —Each module may claim the bus —Control logic on all modules Timing • Co-ordination of events on bus • Synchronous —Events determined by clock signals —Control Bus includes clock line —A single 1-0 is a bus cycle —All devices can read clock line —Usually sync on leading edge —Usually a single cycle for an event Synchronous Timing Diagram Asynchronous Timing – Read Diagram Asynchronous Timing – Write Diagram PCI Bus • Peripheral Component Interconnection • Intel released to public domain • 32 or 64 bit • 50 lines

PCI Bus Lines (required) • Systems lines —Including clock and reset • Address & Data —32 time mux lines for address/data —Interrupt & validate lines • Interface Control • Arbitration —Not shared —Direct connection to PCI bus arbiter • Error lines PCI Bus Lines (Optional) • Interrupt lines —Not shared • Cache support • 64-bit Bus Extension —Additional 32 lines —Time multiplexed —2 lines to enable devices to agree to use 64- bit transfer • JTAG/Boundary Scan —For testing procedures

PCI Commands • Transaction between initiator (master) and target • Master claims bus • Determine type of transaction —e.g. I/O read/write • Address phase • One or more data phases PCI Read Timing Diagram PCI Bus Arbiter PCI Bus Arbitration Foreground Reading • Stallings, chapter 3 (all of it) • www.pcguide.com/ref/mbsys/buses/

• In fact, read the whole site! • www.pcguide.com/

William Stallings Computer Organization and Architecture 8th Edition

Chapter 4 Cache Memory Characteristics • Location • Capacity • Unit of transfer • Access method • Performance • Physical type • Physical characteristics • Organisation Location • CPU • Internal • External Capacity • Word size —The natural unit of organisation • Number of words —or Bytes Unit of Transfer • Internal —Usually governed by data bus width • External —Usually a block which is much larger than a word • Addressable unit —Smallest location which can be uniquely addressed —Word internally —Cluster on M$ disks Access Methods (1) • Sequential —Start at the beginning and read through in order —Access time depends on location of data and previous location —e.g. tape • Direct —Individual blocks have unique address —Access is by jumping to vicinity plus sequential search —Access time depends on location and previous location —e.g. disk Access Methods (2) • Random —Individual addresses identify locations exactly —Access time is independent of location or previous access —e.g. RAM • Associative —Data is located by a comparison with contents of a portion of the store —Access time is independent of location or previous access —e.g. cache Memory Hierarchy • Registers —In CPU • Internal or Main memory —May include one or more levels of cache —“RAM” • External memory —Backing store Memory Hierarchy - Diagram Performance • Access time —Time between presenting the address and getting the valid data • Memory Cycle time —Time may be required for the memory to “recover” before next access —Cycle time is access + recovery • Transfer Rate —Rate at which data can be moved Physical Types • Semiconductor —RAM • Magnetic —Disk & Tape • Optical —CD & DVD • Others —Bubble —Hologram Physical Characteristics • Decay • Volatility • Erasable • Power consumption

Organisation • Physical arrangement of bits into words • Not always obvious • e.g. interleaved The Bottom Line • How much? —Capacity • How fast? —Time is money • How expensive?

Hierarchy List • Registers • L1 Cache • L2 Cache • Main memory • Disk cache • Disk • Optical • Tape So you want fast? • It is possible to build a computer which uses only static RAM (see later) • This would be very fast • This would need no cache —How can you cache cache? • This would cost a very large amount

Locality of Reference • During the course of the execution of a program, memory references tend to cluster • e.g. loops

Cache • Small amount of fast memory • Sits between normal main memory and CPU • May be located on CPU chip or module

Cache and Main Memory Cache/Main Memory Structure Cache operation – overview • CPU requests contents of memory location • Check cache for this data • If present, get from cache (fast) • If not present, read required block from main memory to cache • Then deliver from cache to CPU • Cache includes tags to identify which block of main memory is in each cache slot Cache Read Operation - Flowchart Cache Design • Addressing • Size • Mapping Function • Replacement Algorithm • Write Policy • Block Size • Number of Caches Cache Addressing • Where does cache sit? —Between processor and virtual memory management unit —Between MMU and main memory • Logical cache (virtual cache) stores data using virtual addresses —Processor accesses cache directly, not thorough physical cache —Cache access faster, before MMU address translation —Virtual addresses use same address space for different applications – Must flush cache on each context switch • Physical cache stores data using main memory physical addresses Size does matter • Cost —More cache is expensive • Speed —More cache is faster (up to a point) —Checking cache for data takes time Typical Cache Organization Comparison of Cache Sizes

Year of Processor Type L1 cache L2 cache L3 cache Introduction IBM 360/85 Mainframe 1968 16 to 32 KB — — PDP-11/70 Minicomputer 1975 1 KB — — VAX 11/780 Minicomputer 1978 16 KB — — IBM 3033 Mainframe 1978 64 KB — — IBM 3090 Mainframe 1985 128 to 256 KB — — Intel 80486 PC 1989 8 KB — — Pentium PC 1993 8 KB/8 KB 256 to 512 KB — PowerPC 601 PC 1993 32 KB — — PowerPC 620 PC 1996 32 KB/32 KB — — PowerPC G4 PC/server 1999 32 KB/32 KB 256 KB to 1 MB 2 MB IBM S/390 G4 Mainframe 1997 32 KB 256 KB 2 MB IBM S/390 G6 Mainframe 1999 256 KB 8 MB — Pentium 4 PC/server 2000 8 KB/8 KB 256 KB — High-end server/ IBM SP 2000 64 KB/32 KB 8 MB — supercomputer CRAY MTAb Supercomputer 2000 8 KB 2 MB — Itanium PC/server 2001 16 KB/16 KB 96 KB 4 MB SGI Origin 2001 High-end server 2001 32 KB/32 KB 4 MB — Itanium 2 PC/server 2002 32 KB 256 KB 6 MB IBM POWER5 High-end server 2003 64 KB 1.9 MB 36 MB CRAY XD-1 Supercomputer 2004 64 KB/64 KB 1MB — Mapping Function • Cache of 64kByte • Cache block of 4 bytes —i.e. cache is 16k (214) lines of 4 bytes • 16MBytes main memory • 24 bit address —(224=16M) Direct Mapping • Each block of main memory maps to only one cache line —i.e. if a block is in cache, it must be in one specific place • Address is in two parts • Least Significant w bits identify unique word • Most Significant s bits specify one memory block • The MSBs are split into a cache line field r and a tag of s-r (most significant) Direct Mapping Address Structure

Tag s-r Line or Slot r Word w 8 14 2

• 24 bit address • 2 bit word identifier (4 byte block) • 22 bit block identifier — 8 bit tag (=22-14) — 14 bit slot or line • No two blocks in the same line have the same Tag field • Check contents of cache by finding line and checking Tag Direct Mapping from Cache to Main Memory Direct Mapping Cache Line Table

Cache line Main Memory blocks held 0 0, m, 2m, 3m…2s-m

1 1,m+1, 2m+1…2s-m+1

… m-1 m-1, 2m-1,3m-1…2s-1

Direct Mapping Cache Organization Direct Mapping Example Direct Mapping Summary • Address length = (s + w) bits • Number of addressable units = 2s+w words or bytes • Block size = line size = 2w words or bytes • Number of blocks in main memory = 2s+ w/2w = 2s • Number of lines in cache = m = 2r • Size of tag = (s – r) bits Direct Mapping pros & cons • Simple • Inexpensive • Fixed location for given block —If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high Victim Cache • Lower miss penalty • Remember what was discarded —Already fetched —Use again with little penalty • Fully associative • 4 to 16 cache lines • Between direct mapped L1 cache and next memory level Associative Mapping • A main memory block can load into any line of cache • Memory address is interpreted as tag and word • Tag uniquely identifies block of memory • Every line’s tag is examined for a match • Cache searching gets expensive

Associative Mapping from Cache to Main Memory Fully Associative Cache Organization Associative Mapping Example Associative Mapping Address Structure

Word Tag 22 bit 2 bit • 22 bit tag stored with each 32 bit block of data • Compare tag field with tag entry in cache to check for hit • Least significant 2 bits of address identify which 16 bit word is required from 32 bit data block • e.g. —Address Tag Data Cache line —FFFFFC FFFFFC 24682468 3FFF Associative Mapping Summary • Address length = (s + w) bits • Number of addressable units = 2s+w words or bytes • Block size = line size = 2w words or bytes • Number of blocks in main memory = 2s+ w/2w = 2s • Number of lines in cache = undetermined • Size of tag = s bits Set Associative Mapping • Cache is divided into a number of sets • Each set contains a number of lines • A given block maps to any line in a given set —e.g. Block B can be in any line of set i • e.g. 2 lines per set —2 way associative mapping —A given block can be in one of 2 lines in only one set

Set Associative Mapping Example • 13 bit set number • Block number in main memory is modulo 213 • 000000, 00A000, 00B000, 00C000 … map to same set Mapping From Main Memory to Cache: v Associative Mapping From Main Memory to Cache: k-way Associative K-Way Set Associative Cache Organization Set Associative Mapping Address Structure

Word Tag 9 bit Set 13 bit 2 bit

• Use set field to determine cache set to look in • Compare tag field to see if we have a hit • e.g —Address Tag Data Set number —1FF 7FFC 1FF 12345678 1FFF —001 7FFC 001 11223344 1FFF

Two Way Set Associative Mapping Example Set Associative Mapping Summary • Address length = (s + w) bits • Number of addressable units = 2s+w words or bytes • Block size = line size = 2w words or bytes • Number of blocks in main memory = 2d • Number of lines in set = k • Number of sets = v = 2d • Number of lines in cache = kv = k * 2d • Size of tag = (s – d) bits Direct and Set Associative Cache Performance Differences • Significant up to at least 64kB for 2-way • Difference between 2-way and 4-way at 4kB much less than 4kB to 8kB • Cache complexity increases with associativity • Not justified against increasing cache to 8kB or 16kB • Above 32kB gives no improvement • (simulation results)

Figure 4.16 Varying Associativity over Cache Size 1.0 0.9 0.8 0.7 0.6 0.5 Hit ratio 0.4 0.3 0.2 0.1 0.0 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M Cache size (bytes) direct 2-way 4-way 8-way 16-way Replacement Algorithms (1) Direct mapping • No choice • Each block only maps to one line • Replace that line Replacement Algorithms (2) Associative & Set Associative • Hardware implemented algorithm (speed) • Least Recently used (LRU) • e.g. in 2 way set associative —Which of the 2 block is lru? • First in first out (FIFO) —replace block that has been in cache longest • Least frequently used —replace block which has had fewest hits • Random

Write Policy • Must not overwrite a cache block unless main memory is up to date • Multiple CPUs may have individual caches • I/O may address main memory directly Write through • All writes go to main memory as well as cache • Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date • Lots of traffic • Slows down writes

• Remember bogus write through caches! Write back • Updates initially made in cache only • Update bit for cache slot is set when update occurs • If block is to be replaced, write to main memory only if update bit is set • Other caches get out of sync • I/O must access main memory through cache • N.B. 15% of memory references are writes Line Size • Retrieve not only desired word but a number of adjacent words as well • Increased block size will increase hit ratio at first —the principle of locality • Hit ratio will decreases as block becomes even bigger —Probability of using newly fetched information becomes less than probability of reusing replaced • Larger blocks —Reduce number of blocks that fit in cache —Data overwritten shortly after being fetched —Each additional word is less local so less likely to be needed • No definitive optimum value has been found • 8 to 64 bytes seems reasonable • For HPC systems, 64- and 128-byte most common Multilevel Caches • High logic density enables caches on chip —Faster than bus access —Frees bus for other transfers • Common to use both on and off chip cache —L1 on chip, L2 off chip in static RAM —L2 access much faster than DRAM or ROM —L2 often uses separate data path —L2 may now be on chip —Resulting in L3 cache – Bus access or now on chip… Hit Ratio (L1 & L2) For 8 kbytes and 16 kbyte L1 Unified v Split Caches • One cache for data and instructions or two, one for data and one for instructions • Advantages of unified cache —Higher hit rate – Balances load of instruction and data fetch – Only one cache to design & implement • Advantages of split cache —Eliminates cache contention between instruction fetch/decode unit and execution unit – Important in pipelining Pentium 4 Cache • 80386 – no on chip cache • 80486 – 8k using 16 byte lines and four way set associative organization • Pentium (all versions) – two on chip L1 caches —Data & instructions • Pentium III – L3 cache added off chip • Pentium 4 —L1 caches – 8k bytes – 64 byte lines – four way set associative —L2 cache – Feeding both L1 caches – 256k – 128 byte lines – 8 way set associative —L3 cache on chip Intel Cache Evolution

Processor on which feature Problem Solution first appears Add external cache using faster 386 External memory slower than the . memory technology. Move external cache on-chip, 486 Increased processor speed results in external bus becoming a operating at the same speed as the bottleneck for cache access. processor.

Add external L2 cache using faster 486 Internal cache is rather small, due to limited space on chip technology than main memory

Contention occurs when both the Instruction Prefetcher and Create separate data and instruction Pentium the Execution Unit simultaneously require access to the caches. cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place.

Create separate back-side bus that Pentium Pro runs at higher speed than the main (front-side) external bus. The BSB is Increased processor speed results in external bus becoming a dedicated to the L2 cache. bottleneck for L2 cache access. Move L2 cache on to the processor Pentium II chip. Add external L3 cache. Pentium III Some applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small. Move L3 cache on-chip. Pentium 4

Pentium 4 Block Diagram Pentium 4 Core Processor • Fetch/Decode Unit —Fetches instructions from L2 cache —Decode into micro-ops —Store micro-ops in L1 cache • Out of order execution logic —Schedules micro-ops —Based on data dependence and resources —May speculatively execute • Execution units —Execute micro-ops —Data from L1 cache —Results in registers • Memory subsystem —L2 cache and systems bus Pentium 4 Design Reasoning • Decodes instructions into RISC like micro-ops before L1 cache • Micro-ops fixed length — Superscalar pipelining and scheduling • Pentium instructions long & complex • Performance improved by separating decoding from scheduling & pipelining — (More later – ch14) • Data cache is write back — Can be configured to write through • L1 cache controlled by 2 bits in register — CD = cache disable — NW = not write through — 2 instructions to invalidate (flush) cache and write back then invalidate • L2 and L3 8-way set-associative — Line size 128 bytes ARM Cache Features

Core Cache Cache Size (kB) Cache Line Size Associativity Location Write Buffer Type (words) Size (words)

ARM720T Unified 8 4 4-way Logical 8

ARM920T Split 16/16 D/I 8 64-way Logical 16

ARM926EJ-S Split 4-128/4-128 D/I 8 4-way Logical 16

ARM1022E Split 16/16 D/I 8 64-way Logical 16

ARM1026EJ-S Split 4-128/4-128 D/I 8 4-way Logical 8

Intel StrongARM Split 16/16 D/I 4 32-way Logical 32

Intel Xscale Split 32/32 D/I 8 32-way Logical 32

ARM1136-JF-S Split 4-64/4-64 D/I 8 4-way Physical 32 ARM Cache Organization • Small FIFO write buffer —Enhances memory write performance —Between cache and main memory —Small c.f. cache —Data put in write buffer at processor clock speed —Processor continues execution —External write in parallel until empty —If buffer full, processor stalls —Data in write buffer not available until written – So keep buffer small ARM Cache and Write Buffer Organization Internet Sources • Manufacturer sites —Intel —ARM • Search on cache William Stallings Computer Organization and Architecture 8th Edition

Chapter 5 Internal Memory Semiconductor Memory Types

Memory Type Category Erasure Write Mechanism Volatility

Random-access Read-write memory Electrically, byte-level Electrically Volatile memory (RAM)

Read-only Masks memory (ROM) Read-only memory Not possible

Programmable ROM (PROM)

Erasable PROM UV light, chip-level (EPROM) Nonvolatile

Electrically

Electrically Erasable Read-mostly memory Electrically, byte-level PROM (EEPROM)

Flash memory Electrically, block-level Semiconductor Memory • RAM —Misnamed as all semiconductor memory is random access —Read/Write —Volatile —Temporary storage —Static or dynamic Memory Cell Operation Dynamic RAM • Bits stored as charge in capacitors • Charges leak • Need refreshing even when powered • Simpler construction • Smaller per bit • Less expensive • Need refresh circuits • Slower • Main memory • Essentially analogue —Level of charge determines value Dynamic RAM Structure DRAM Operation • Address line active when bit read or written —Transistor switch closed (current flows) • Write —Voltage to bit line – High for 1 low for 0 —Then signal address line – Transfers charge to capacitor • Read —Address line selected – transistor turns on —Charge from capacitor fed via bit line to sense amplifier – Compares with reference value to determine 0 or 1 —Capacitor charge must be restored Static RAM • Bits stored as on/off switches • No charges to leak • No refreshing needed when powered • More complex construction • Larger per bit • More expensive • Does not need refresh circuits • Faster • Cache • Digital —Uses flip-flops Stating RAM Structure Static RAM Operation • Transistor arrangement gives stable logic state • State 1

—C1 high, C2 low

—T1 T4 off, T2 T3 on • State 0

—C2 high, C1 low —T2 T3 off, T1 T4 on

• Address line transistors T5 T6 is switch • Write – apply value to B & compliment to B • Read – value is on line B

SRAM v DRAM • Both volatile —Power needed to preserve data • Dynamic cell —Simpler to build, smaller —More dense —Less expensive —Needs refresh —Larger memory units • Static —Faster —Cache

Read Only Memory (ROM) • Permanent storage —Nonvolatile • Microprogramming (see later) • Library subroutines • Systems programs (BIOS) • Function tables

Types of ROM • Written during manufacture —Very expensive for small runs • Programmable (once) —PROM —Needs special equipment to program • Read “mostly” —Erasable Programmable (EPROM) – Erased by UV —Electrically Erasable (EEPROM) – Takes much longer to write than read —Flash memory – Erase whole memory electrically Organisation in detail • A 16Mbit chip can be organised as 1M of 16 bit words • A bit per chip system has 16 lots of 1Mbit chip with bit 1 of each word in chip 1 and so on • A 16Mbit chip can be organised as a 2048 x 2048 x 4bit array —Reduces number of address pins – Multiplex row address and column address – 11 pins to address (211=2048) – Adding one more pin doubles range of values so x4 capacity Refreshing • Refresh circuit included on chip • Disable chip • Count through rows • Read & Write back • Takes time • Slows down apparent performance Typical 16 Mb DRAM (4M x 4) Packaging 256kByte Module Organisation 1MByte Module Organisation Interleaved Memory • Collection of DRAM chips • Grouped into memory bank • Banks independently service read or write requests • K banks can service k requests simultaneously Error Correction • Hard Failure —Permanent defect • Soft Error —Random, non-destructive —No permanent damage to memory • Detected using Hamming error correcting code Error Correcting Code Function Advanced DRAM Organization • Basic DRAM same since first RAM chips • Enhanced DRAM —Contains small SRAM as well —SRAM holds last line read (c.f. Cache!) • Cache DRAM —Larger SRAM component —Use as cache or serial buffer

Synchronous DRAM (SDRAM) • Access is synchronized with an external clock • Address is presented to RAM • RAM finds data (CPU waits in conventional DRAM) • Since SDRAM moves data in time with system clock, CPU knows when data will be ready • CPU does not have to wait, it can do something else • Burst mode allows SDRAM to set up stream of data and fire it out in block • DDR-SDRAM sends data twice per clock cycle (leading & trailing edge) SDRAM SDRAM Read Timing RAMBUS • Adopted by Intel for Pentium & Itanium • Main competitor to SDRAM • Vertical package – all pins on one side • Data exchange over 28 wires < cm long • Bus addresses up to 320 RDRAM chips at 1.6Gbps • Asynchronous block protocol —480ns access time —Then 1.6 Gbps RAMBUS Diagram DDR SDRAM • SDRAM can only send data once per clock • Double-data-rate SDRAM can send data twice per clock cycle —Rising edge and falling edge

DDR SDRAM Read Timing Simplified DRAM Read Timing Cache DRAM • Mitsubishi • Integrates small SRAM cache (16 kb) onto generic DRAM chip • Used as true cache —64-bit lines —Effective for ordinary random access • To support serial access of block of data —E.g. refresh bit-mapped screen – CDRAM can prefetch data from DRAM into SRAM buffer – Subsequent accesses solely to SRAM Reading • The RAM Guide • RDRAM William Stallings Computer Organization and Architecture 8th Edition

Chapter 6 External Memory Types of External Memory • Magnetic Disk —RAID —Removable • Optical —CD-ROM —CD-Recordable (CD-R) —CD-R/W —DVD • Magnetic Tape Magnetic Disk • Disk substrate coated with magnetizable material (iron oxide…rust) • Substrate used to be aluminium • Now glass —Improved surface uniformity – Increases reliability —Reduction in surface defects – Reduced read/write errors —Lower flight heights (See later) —Better stiffness —Better shock/damage resistance Read and Write Mechanisms

• Recording & retrieval via conductive coil called a head • May be single read/write head or separate ones • During read/write, head is stationary, platter rotates • Write — Current through coil produces magnetic field — Pulses sent to head — Magnetic pattern recorded on surface below • Read (traditional) — Magnetic field moving relative to coil produces current — Coil is the same for read and write • Read (contemporary) — Separate read head, close to write head — Partially shielded magneto resistive (MR) sensor — Electrical resistance depends on direction of magnetic field — High frequency operation – Higher storage density and speed Inductive Write MR Read Data Organization and Formatting • Concentric rings or tracks —Gaps between tracks —Reduce gap to increase capacity —Same number of bits per track (variable packing density) —Constant angular velocity • Tracks divided into sectors • Minimum block size is one sector • May have more than one sector per block Disk Data Layout Disk Velocity • Bit near centre of rotating disk passes fixed point slower than bit on outside of disk • Increase spacing between bits in different tracks • Rotate disk at constant angular velocity (CAV) —Gives pie shaped sectors and concentric tracks —Individual tracks and sectors addressable —Move head to given track and wait for given sector —Waste of space on outer tracks – Lower data density • Can use zones to increase capacity —Each zone has fixed bits per track —More complex circuitry Disk Layout Methods Diagram Finding Sectors • Must be able to identify start of track and sector • Format disk —Additional information not available to user —Marks tracks and sectors

Winchester Disk Format Seagate ST506 Characteristics • Fixed (rare) or movable head • Removable or fixed • Single or double (usually) sided • Single or multiple platter • Head mechanism —Contact (Floppy) —Fixed gap —Flying (Winchester) Fixed/Movable Head Disk • Fixed head —One read write head per track —Heads mounted on fixed ridged arm • Movable head —One read write head per side —Mounted on a movable arm

Removable or Not • Removable disk —Can be removed from drive and replaced with another disk —Provides unlimited storage capacity —Easy data transfer between systems • Nonremovable disk —Permanently mounted in the drive

Multiple Platter • One head per side • Heads are joined and aligned • Aligned tracks on each platter form cylinders • Data is striped by cylinder —reduces head movement —Increases speed (transfer rate) Multiple Platters Tracks and Cylinders Floppy Disk • 8”, 5.25”, 3.5” • Small capacity —Up to 1.44Mbyte (2.88M never popular) • Slow • Universal • Cheap • Obsolete?

Winchester Hard Disk (1) • Developed by IBM in Winchester (USA) • Sealed unit • One or more platters (disks) • Heads fly on boundary layer of air as disk spins • Very small head to disk gap • Getting more robust

Winchester Hard Disk (2) • Universal • Cheap • Fastest external storage • Getting larger all the time —250 Gigabyte now easily available Speed • Seek time —Moving head to correct track • (Rotational) latency —Waiting for data to rotate under head • Access time = Seek + Latency • Transfer rate Timing of Disk I/O Transfer RAID • Redundant Array of Independent Disks • Redundant Array of Inexpensive Disks • 6 levels in common use • Not a hierarchy • Set of physical disks viewed as single logical drive by O/S • Data distributed across physical drives • Can use redundant capacity to store parity information

RAID 0 • No redundancy • Data striped across all disks • Round Robin striping • Increase speed —Multiple data requests probably not on same disk —Disks seek in parallel —A set of data is likely to be striped across multiple disks

RAID 1 • Mirrored Disks • Data is striped across disks • 2 copies of each stripe on separate disks • Read from either • Write to both • Recovery is simple —Swap faulty disk & re-mirror —No down time • Expensive RAID 2 • Disks are synchronized • Very small stripes —Often single byte/word • Error correction calculated across corresponding bits on disks • Multiple parity disks store Hamming code error correction in corresponding positions • Lots of redundancy —Expensive —Not used RAID 3 • Similar to RAID 2 • Only one redundant disk, no matter how large the array • Simple parity bit for each set of corresponding bits • Data on failed drive can be reconstructed from surviving data and parity info • Very high transfer rates RAID 4 • Each disk operates independently • Good for high I/O request rate • Large stripes • Bit by bit parity calculated across stripes on each disk • Parity stored on parity disk RAID 5 • Like RAID 4 • Parity striped across all disks • Round robin allocation for parity stripe • Avoids RAID 4 bottleneck at parity disk • Commonly used in network servers

• N.B. DOES NOT MEAN 5 DISKS!!!!! RAID 6 • Two parity calculations • Stored in separate blocks on different disks • User requirement of N disks needs N+2 • High data availability —Three disks need to fail for data loss —Significant write penalty RAID 0, 1, 2 RAID 3 & 4 RAID 5 & 6 Data Mapping For RAID 0 Optical Storage CD-ROM • Originally for audio • 650Mbytes giving over 70 minutes audio • Polycarbonate coated with highly reflective coat, usually aluminium • Data stored as pits • Read by reflecting laser • Constant packing density • Constant linear velocity CD Operation CD-ROM Drive Speeds • Audio is single speed —Constant linier velocity —1.2 ms-1 —Track (spiral) is 5.27km long —Gives 4391 seconds = 73.2 minutes • Other speeds are quoted as multiples • e.g. 24x • Quoted figure is maximum drive can achieve

CD-ROM Format

• Mode 0=blank data field • Mode 1=2048 byte data+error correction • Mode 2=2336 byte data Random Access on CD-ROM • Difficult • Move head to rough position • Set correct speed • Read address • Adjust to required location • (Yawn!)

CD-ROM for & against • Large capacity (?) • Easy to mass produce • Removable • Robust

• Expensive for small runs • Slow • Read only

Other Optical Storage • CD-Recordable (CD-R) —WORM —Now affordable —Compatible with CD-ROM drives • CD-RW —Erasable —Getting cheaper —Mostly CD-ROM drive compatible —Phase change – Material has two different reflectivities in different phase states

DVD - what’s in a name? • Digital Video Disk —Used to indicate a player for movies – Only plays video disks • Digital Versatile Disk —Used to indicate a computer drive – Will read computer disks and play video disks • Dogs Veritable Dinner • Officially - nothing!!! DVD - technology • Multi-layer • Very high capacity (4.7G per layer) • Full length movie on single disk —Using MPEG compression • Finally standardized (honest!) • Movies carry regional coding • Players only play correct region films • Can be “fixed” DVD – Writable • Loads of trouble with standards • First generation DVD drives may not read first generation DVD-W disks • First generation DVD drives may not read CD-RW disks • Wait for it to settle down before buying! CD and DVD High Definition Optical Disks • Designed for high definition videos • Much higher capacity than DVD —Shorter wavelength laser – Blue-violet range —Smaller pits • HD-DVD —15GB single side single layer • Blue-ray —Data layer closer to laser – Tighter focus, less distortion, smaller pits —25GB on single layer —Available read only (BD-ROM), Recordable once (BR-R) and re-recordable (BR-RE) Optical Memory Characteristics Magnetic Tape • Serial access • Slow • Very cheap • Backup and archive • Linear Tape-Open (LTO) Tape Drives —Developed late 1990s —Open source alternative to proprietary tape systems

Linear Tape-Open (LTO) Tape Drives

LTO-1 LTO-2 LTO-3 LTO-4 LTO-5 LTO-6

Release date 2000 2003 2005 2007 TBA TBA

Compressed capacity 200 GB 400 GB 800 GB 1600 GB 3.2 TB 6.4 TB

Compressed transfer 40 80 160 240 360 540 rate (MB/s)

Linear density 4880 7398 9638 13300 (bits/mm)

Tape tracks 384 512 704 896

Tape length 609 m 609 m 680 m 820 m

Tape width (cm) 1.27 1.27 1.27 1.27

Write elements 8 8 16 16 Internet Resources • Optical Storage Technology Association —Good source of information about optical storage technology and vendors —Extensive list of relevant links • DLTtape —Good collection of technical information and links to vendors • Search on RAID William Stallings Computer Organization and Architecture 8th Edition

Chapter 7 Input/Output Input/Output Problems • Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats • All slower than CPU and RAM • Need I/O modules

Input/Output Module • Interface to CPU and Memory • Interface to one or more peripherals Generic Model of I/O Module External Devices • Human readable —Screen, printer, keyboard • Machine readable —Monitoring and control • Communication —Modem —Network Interface Card (NIC) External Device Block Diagram I/O Module Function • Control & Timing • CPU Communication • Device Communication • Data Buffering • Error Detection

I/O Steps • CPU checks I/O module device status • I/O module returns status • If ready, CPU requests data transfer • I/O module gets data from device • I/O module transfers data to CPU • Variations for output, DMA, etc. I/O Module Diagram I/O Module Decisions • Hide or reveal device properties to CPU • Support multiple or single device • Control device functions or leave for CPU • Also O/S decisions —e.g. Unix treats everything it can as a file

Input Output Techniques • Programmed • Interrupt driven • (DMA)

Three Techniques for Input of a Block of Data Programmed I/O • CPU has direct control over I/O —Sensing status —Read/write commands —Transferring data • CPU waits for I/O module to complete operation • Wastes CPU time Programmed I/O - detail • CPU requests I/O operation • I/O module performs operation • I/O module sets status bits • CPU checks status bits periodically • I/O module does not inform CPU directly • I/O module does not interrupt CPU • CPU may wait or come back later I/O Commands • CPU issues address —Identifies module (& device if >1 per module) • CPU issues command —Control - telling module what to do – e.g. spin up disk —Test - check status – e.g. power? Error? —Read/Write – Module transfers data via buffer from/to device

Addressing I/O Devices • Under programmed I/O data transfer is very like memory access (CPU viewpoint) • Each device given unique identifier • CPU commands contain identifier (address) I/O Mapping • Memory mapped I/O —Devices and memory share an address space —I/O looks just like memory read/write —No special commands for I/O – Large selection of memory access commands available • Isolated I/O —Separate address spaces —Need I/O or memory select lines —Special commands for I/O – Limited set Memory Mapped and Isolated I/O Interrupt Driven I/O • Overcomes CPU waiting • No repeated CPU checking of device • I/O module interrupts when ready Interrupt Driven I/O Basic Operation • CPU issues read command • I/O module gets data from peripheral whilst CPU does other work • I/O module interrupts CPU • CPU requests data • I/O module transfers data

Simple Interrupt Processing CPU Viewpoint • Issue read command • Do other work • Check for interrupt at end of each instruction cycle • If interrupted:- —Save context (registers) —Process interrupt – Fetch data & store • See Operating Systems notes Changes in Memory and Registers for an Interrupt Design Issues • How do you identify the module issuing the interrupt? • How do you deal with multiple interrupts? —i.e. an interrupt handler being interrupted

Identifying Interrupting Module (1) • Different line for each module —PC —Limits number of devices • Software poll —CPU asks each module in turn —Slow

Identifying Interrupting Module (2) • Daisy Chain or Hardware poll —Interrupt Acknowledge sent down a chain —Module responsible places vector on bus —CPU uses vector to identify handler routine • Bus Master —Module must claim the bus before it can raise interrupt —e.g. PCI & SCSI Multiple Interrupts • Each interrupt line has a priority • Higher priority lines can interrupt lower priority lines • If bus mastering only current master can interrupt Example - PC Bus • 80x86 has one interrupt line • 8086 based systems use one 8259A interrupt controller • 8259A has 8 interrupt lines

Sequence of Events • 8259A accepts interrupts • 8259A determines priority • 8259A signals 8086 (raises INTR line) • CPU Acknowledges • 8259A puts correct vector on data bus • CPU processes interrupt

ISA Bus Interrupt System • ISA bus chains two 8259As together • Link is via interrupt 2 • Gives 15 lines —16 lines less one for link • IRQ 9 is used to re-route anything trying to use IRQ 2 —Backwards compatibility • Incorporated in chip set 82C59A Interrupt Controller Intel 82C55A Programmable Peripheral Interface Keyboard/Display Interfaces to 82C55A Direct Memory Access • Interrupt driven and programmed I/O require active CPU intervention —Transfer rate is limited —CPU is tied up • DMA is the answer DMA Function • Additional Module (hardware) on bus • DMA controller takes over from CPU for I/O

Typical DMA Module Diagram DMA Operation • CPU tells DMA controller:- —Read/Write —Device address —Starting address of memory block for data —Amount of data to be transferred • CPU carries on with other work • DMA controller deals with transfer • DMA controller sends interrupt when finished DMA Transfer Cycle Stealing • DMA controller takes over bus for a cycle • Transfer of one word of data • Not an interrupt —CPU does not switch context • CPU suspended just before it accesses bus —i.e. before an operand or data fetch or a data write • Slows down CPU but not as much as CPU doing transfer DMA and Interrupt Breakpoints During an Instruction Cycle Aside • What effect does caching memory have on DMA? • What about on board cache? • Hint: how much are the system buses available?

DMA Configurations (1)

• Single Bus, Detached DMA controller • Each transfer uses bus twice —I/O to DMA then DMA to memory • CPU is suspended twice DMA Configurations (2)

• Single Bus, Integrated DMA controller • Controller may support >1 device • Each transfer uses bus once —DMA to memory • CPU is suspended once DMA Configurations (3)

• Separate I/O Bus • Bus supports all DMA enabled devices • Each transfer uses bus once —DMA to memory • CPU is suspended once Intel 8237A DMA Controller • Interfaces to 80x86 family and DRAM • When DMA module needs buses it sends HOLD signal to processor • CPU responds HLDA (hold acknowledge) — DMA module can use buses • E.g. transfer data from memory to disk 1. Device requests service of DMA by pulling DREQ (DMA request) high 2. DMA puts high on HRQ (hold request), 3. CPU finishes present bus cycle (not necessarily present instruction) and puts high on HDLA (hold acknowledge). HOLD remains active for duration of DMA 4. DMA activates DACK (DMA acknowledge), telling device to start transfer 5. DMA starts transfer by putting address of first byte on address bus and activating MEMR; it then activates IOW to write to peripheral. DMA decrements counter and increments address pointer. Repeat until count reaches zero 6. DMA deactivates HRQ, giving bus back to CPU 8237 DMA Usage of Systems Bus Fly-By • While DMA using buses processor idle • Processor using bus, DMA idle —Known as fly-by DMA controller • Data does not pass through and is not stored in DMA chip —DMA only between I/O port and memory —Not between two I/O ports or two memory locations • Can do memory to memory via register • 8237 contains four DMA channels —Programmed independently —Any one active —Numbered 0, 1, 2, and 3 I/O Channels • I/O devices getting more sophisticated • e.g. 3D graphics cards • CPU instructs I/O controller to do transfer • I/O controller does entire transfer • Improves speed —Takes load off CPU —Dedicated processor is faster

I/O Channel Architecture Interfacing • Connecting devices together • Bit of wire? • Dedicated processor/memory/buses? • E.g. FireWire, InfiniBand IEEE 1394 FireWire • High performance serial bus • Fast • Low cost • Easy to implement • Also being used in digital cameras, VCRs and TV

FireWire Configuration • Daisy chain • Up to 63 devices on single port —Really 64 of which one is the interface itself • Up to 1022 buses can be connected with bridges • Automatic configuration • No bus terminators • May be tree structure Simple FireWire Configuration FireWire 3 Layer Stack • Physical —Transmission medium, electrical and signaling characteristics • Link —Transmission of data in packets • Transaction —Request-response protocol FireWire Protocol Stack FireWire - Physical Layer • Data rates from 25 to 400Mbps • Two forms of arbitration —Based on tree structure —Root acts as arbiter —First come first served —Natural priority controls simultaneous requests – i.e. who is nearest to root —Fair arbitration —Urgent arbitration

FireWire - Link Layer • Two transmission types —Asynchronous – Variable amount of data and several bytes of transaction data transferred as a packet – To explicit address – Acknowledgement returned —Isochronous – Variable amount of data in sequence of fixed size packets at regular intervals – Simplified addressing – No acknowledgement FireWire Subactions InfiniBand • I/O specification aimed at high end servers —Merger of Future I/O (Cisco, HP, Compaq, IBM) and Next Generation I/O (Intel) • Version 1 released early 2001 • Architecture and spec. for data flow between processor and intelligent I/O devices • Intended to replace PCI in servers • Increased capacity, expandability, flexibility InfiniBand Architecture • Remote storage, networking and connection between servers • Attach servers, remote storage, network devices to central fabric of switches and links • Greater server density • Scalable data centre • Independent nodes added as required • I/O distance from server up to —17m using copper —300m multimode fibre optic —10km single mode fibre • Up to 30Gbps InfiniBand Switch Fabric InfiniBand Operation • 16 logical channels (virtual lanes) per physical link • One lane for management, rest for data • Data in stream of packets • Virtual lane dedicated temporarily to end to end transfer • Switch maps traffic from incoming to outgoing lane InfiniBand Protocol Stack Foreground Reading • Check out Universal Serial Bus (USB) • Compare with other communication standards e.g.