Trends for Buses Logical Bus

Announcements • TBD EE108B Lecture 18 Buses and OS, Review, & What’s Next Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b C. Kozyrakis EE108b Lecture 18 1 C. Kozyrakis EE108b Lecture 18 2 Review: Trends for Buses Review: Buses Logical Bus and Physical Switch • A bus is a shared communication link that connects multiple devices D1 D2 D3 D4 • Single set of wires connects multiple “subsystems” as opposed to a point to point link which only connects two components together Bus • Wires connect in parallel, so 32 bit bus has 32 wires of data Parallel Bus Cntrl • Serial point-to-point advantages – Faster links – Fewer chip package pins D2 – Higher performance – Switch keeps arbitration on chip 4 port • Many bus standards are moving to serial, I/O I/O I/O D3 point to point Processor Device Device Device Memory D1 switch • 3GIO, PCI-Express(PCI) • Serial ATA (IDE hard disk) • AMD Hypertransport versus Intel Front Side Bus (FSB) C. Kozyrakis EE108b Lecture 18 3 C. Kozyrakis D4 EE108b Lecture 18 4 Review: OS Communication Reivew: A method for addressing a device • The operating system should prevent user programs from • Memory-mapped I/O: communicating with I/O device directly – Portions of the address space are assigned to each I/O device – Must protect I/O resources to keep sharing fair – I/O addresses correspond to device registers – Protection of shared I/O resources cannot be provided if user – User programs prevented form issuing I/O operations directly programs could perform I/O directly since I/O address space is protected by the address translation mechanism • Three types of communication are required: – OS must be able to give commands to I/O devices 0x00000000 0xFFFFFFFF – I/O device must be able to notify OS when I/O device has completed an operation or has encountered an error OS- Graphics I/O I/O I/O General RAM Memory – Data must be transferred between memory and an I/O device mem Memory #1 #2 #3 C. Kozyrakis EE108b Lecture 18 5 C. Kozyrakis EE108b Lecture 18 6 Communicating with the CPU Polling and Programmed I/O • Method #1: Polling – I/O device places information in a status register read data – The OS periodically checks the status register from Is the memory device – Whether polling is used is often dependent upon whether the ready? device can initiate I/O independently • For instance, a mouse works well since it has a fixed I/O rate and Busy wait loop yes no initiates its own data (whenever it is moved) Is the not an efficient way to use the CPU read data • For others, such as disk access, I/O only occurs under the control device ready? unless the device from of the OS, so we poll only when the OS knows it is active is very fast! device – Advantages yes no • Simple to implement store write data • Processor is in control and does the work Processor may be to memory data to inefficient way to – Disadvantage device transfer data • Polling overhead and data transfer consume CPU time done? no done? no yes yes C. Kozyrakis EE108b Lecture 18 7 C. Kozyrakis EE108b Lecture 18 8 I/O Notification (cont) I/O Interrupts • Method #2: I/O Interrupt – Whenever an I/O device needs attention from the processor, it Interrupts add Exceptions interrupts the processor sub User Exception (1) I/O interrupt and program – Interrupt must tell OS about the event and which device Vector Table or • Using “cause” register(s): Kernel “asks” what interrupted Arithmetic Exception nop (2) Save state • Using vectored interrupts: A different exception handler for each Page Fault ISR Address Error #1 • Example: Intel 80x86 has 256 vectored interrupts Unaligned Access – I/O interrupts are asynchronous events, and happen anytime Floating Point Exception ISR (3) Interrupt Timer Interrupt #2 service addr • Processor waits until current instruction is completed Network read Interrupt Disk Drive store – Interrupts may have different priorities ISR service Keyboard #3 ... : • Ex.: Network = high priority, keyboard = low priority (4) Restore rti routine (contains PCs of ISRs) state and return – Advantage: Execution is only halted during actual transfer Memory – Disadvantage: Software overhead of interrupt processing C. Kozyrakis EE108b Lecture 18 9 C. Kozyrakis EE108b Lecture 18 10 Data Transfer Delegating I/O: DMA • The third component to I/O communication is the transfer of data • Direct Memory Access (DMA) from the I/O device to memory (or vice versa) – Transfer blocks of data to or from memory without CPU intervention • Simple approach: “Programmed” I/O – Communication coordinated by the DMA controller • DMA controllers are integrated in memory or I/O controller chips – Software on the processor moves all data between memory – DMA controller acts as a bus master, in bus-based systems addresses and I/O addresses – Simple and flexible, but wastes CPU time • DMA Steps – Also, lots of excess data movement in modern systems – Processor sets up DMA by supplying: • Ex.: Mem --> NB --> CPU --> NB --> graphics • Identity of the device and the operation (read/write) • When we want: Mem --> NB --> graphics • The memory address for source/destination • So need a solution to allow data transfer to happen without the • The number of bytes to transfer processor’s involvement – DMA controller starts the operation by arbitrating for the bus and then starting the transfer when the data is ready – Notify the processor when the DMA transfer is complete or on error • Usually using an interrupt C. Kozyrakis EE108b Lecture 18 11 C. Kozyrakis EE108b Lecture 18 12 Reading a Disk Sector (1) Reading a Disk Sector (2) 1 GHz Pentium Processor 1 GHz Pentium Processor Pipeline CPU initiates a disk read by writing a Pipeline Disk controller reads the sector and command, logical block number, and a performs a direct memory access (DMA) Caches destination memory address to disk Caches transfer into main memory. controller control registers. AGP AGP North North Monitor Graphics Memory Monitor Graphics Memory Controller Bridge Controller Bridge PCI PCI USB Hub Ethernet Disk USB Hub Ethernet Disk Controller Controller Controller Controller Controller Controller Printer Mouse Disk Disk Printer Mouse Disk Disk Keyboard Keyboard C. Kozyrakis EE108b Lecture 18 13 C. Kozyrakis EE108b Lecture 18 14 Reading a Disk Sector (3) DMA Problems: Virtual Vs. Physical Addresses 1 GHz Pentium Processor • If DMA uses physical addresses Pipeline When the DMA transfer completes, the disk controller notifies the CPU with an – Memory access across physical page boundaries may not correspond to contiguous virtual pages (or even the same application!) Caches interrupt (i.e., asserts a special “interrupt” ≤ pin on the CPU) • Solution 1: 1 page per DMA transfer AGP • Solution 1+: chain a series of 1-page requests provided by the OS North Monitor Graphics Memory – Single interrupt at the end of the last DMA request in the chain Bridge Controller • Solution 2: DMA engine uses virtual addresses PCI – Multi-page DMA requests are now easy – A TLB is necessary for the DMA engine • For DMA with physical addresses: pages must be pinned in DRAM USB Hub Ethernet Disk Controller Controller Controller – OS should not page to disks pages involved with pending I/O Printer Mouse Disk Disk Keyboard C. Kozyrakis EE108b Lecture 18 15 C. Kozyrakis EE108b Lecture 18 16 DMA Problems: Cache Coherence I/O Summary • A copy of the data involved in a DMA transfer may reside in processor cache • I/O performance has to take into account many variables – If memory is updated: Must update or invalidate “old” cached copy – Response time and throughput – If memory is read: Must read latest value, which may be in the cache • Only a problem with write-back caches – CPU, memory, bus, I/O device • This is called the “cache coherence” problem – Workload e.g. transaction processing – Same problem in multiprocessor systems • I/O devices also span a wide spectrum – Disks, graphics, and networks • Solution 1: OS flushes the cache before I/O reads or forces writebacks before I/O writes • Buses – Flush/write-back may involve selective addresses or whole cache – bandwidth – Can be done in software or with hardware (ISA) support – synchronization • Solution 2: Route memory accesses for I/O through the cache – transactions – Search the cache for copies and invalidate or write-back as needed – This hardware solution may impact performance negatively • OS and I/O • While searching cache for I/O requests, it is not available to processor – Communication: Polling and Interrupts – Multi-level, inclusive caches make this easier – Handling I/O outside CPU: DMA • Processor searches L1 cache mostly (until it misses) • I/O requests search L2 cache mostly (until it finds a copy of interest) C. Kozyrakis EE108b Lecture 18 17 C. Kozyrakis EE108b Lecture 18 18 EE108B Quick Review (2) EE108B Quick Review (1) Major Lessons to Take Away • What we learned in EE108b • Levels of abstraction (e.g. ISA→processor→RTL blocks→gates) – Simplifies design process for complex systems – How to build programmable systems – Need good specification for each level • Processor, memory system, IO system, interactions with OS and • Pipelining compiler – Improve throughput of an engine by overlapping tasks – Processor design – Watch out for dependencies between tasks • ISA, single cycle design, pipelined design, … • Caching – Maintain a close copy of frequently accessed data – The basics of computer system design – Avoid long accesses or expensive recomputation (memoization) • Levels of abstraction, pipelining, caching, address indirection, – Think about associativity, replacement policy, block size DMA, … • Indirection (e.g. virtual→physical address translation) – Understanding why your programs sometimes run slowly – Allows transparent relocation, sharing, and protection • Overlapping (e.g. CPU work & DMA access) • Pipeline stalls, cache misses, page faults, IO accesses, … – Hide the cost of expensive tasks by executing them in parallel with other useful work • Other: – Amdhal’s law: make common case fast – Amortization, oarallel processing, memoization, speculation C.

Load more