
The System Bus Three Buses • System bus: connects system components together CPU Mem • Processor-memory bus • Important: insufficient bandwidth can bottleneck entire system • Connects CPU and memory, no direct I/O interface • Performance factors Proc-Mem +Short, few taps → fast, high-bandwidth • Physical length adapter –System specific • Number and type of connected devices (taps) I/O • I/O bus • Connects I/O devices, no direct P-M interface CPU ($) – Longer, more taps → slower, lower-bandwidth I/O I/O + Industry standard “System” (memory-I/O) bus CPU Mem • Connect P-M bus to I/O bus using adapter • Backplane bus DMA DMA I/O ctrl Backplane • CPU, memory, I/O connected to same bus + Industry standard, cheap (no adapters needed) Main kbd display NIC I/O I/O – Processor-memory performance compromised Memory Disk 17 18 © 2008 Daniel J. Sorin from Roth ECE 152 © 2008 Daniel J. Sorin from Roth ECE 152 Bus Design (1) Bus Width and Multiplexing data lines address lines • Wider control lines + More bandwidth •Goals – More expensive and more susceptible to skew • High Performance: low latency and high bandwidth • Standardization: flexibility in dealing with many devices • Multiplexed: address and data share same lines • Low Cost +Cheaper • Processor-memory bus emphasizes performance, then cost – Less bandwidth • I/O & backplane emphasize standardization, then performance • Design issues • Burst transfers 1. Width/multiplexing: are wires shared or separate? • Multiple sequential data transactions for single address 2. Clocking: is bus clocked or not? + Increase bandwidth at relatively little cost 3. Switching: how/when is bus control acquired and released? 4. Arbitration: how do we decide who gets the bus next? 19 20 © 2008 Daniel J. Sorin from Roth ECE 152 © 2008 Daniel J. Sorin from Roth ECE 152 (2) Bus Clocking (3) Bus Switching • Synchronous: clocked • Atomic: bus “busy” between request and reply +Fast +Simple – Bus must be short to minimize clock skew – Low utilization • Asynchronous: un-clocked • Split-transaction: requests/replies can be interleaved + Can be longer: no clock skew, deals with devices of different speeds + Higher utilization Æ higher throughput – Slower: requires “hand-shaking” protocol – Complex, requires sending IDs to match replies to request • For example, asynchronous read • Multiplexed data/address lines, 3 control lines 1. Processor drives address onto bus, asserts Request line 2. Memory asserts Ack line, processor stops driving 3. Memory drives data on bus, asserts DataReady line 4. Processor asserts Ack line, memory stops driving • P-M buses are synchronous • I/O and backplane buses asynchronous or slow-clock synchronous 21 22 © 2008 Daniel J. Sorin from Roth ECE 152 © 2008 Daniel J. Sorin from Roth ECE 152 (4) Bus Arbitration Standard Bus Examples • Bus master: component that can initiate a bus request PCI SCSI USB • Bus typically has several masters, including processor Type Backplane I/O I/O • I/O devices can also be masters (Why? See in a bit) Width 32–64 bits 8–32 bits 1 • Arbitration: choosing a master among multiple requests Multiplexed? Yes Yes Yes • Try to implement priority and fairness (no device “starves”) Clocking 33 (66) MHz 5 (10) MHz Asynchronous • Daisy-chain: devices connect to bus in priority order Data rate 133 (266) MB/s 10 (20) MB/s 0.2, 1.5, 80 MB/s • High-priority devices intercept/deny requests by low-priority ones Arbitration Distributed Distributed Daisy-chain ± Simple, but slow and can’t ensure fairness Maximum masters 1024 7–31 127 • Centralized: special arbiter chip collects requests, decides ± Ensures fairness, but arbiter chip may itself become bottleneck Maximum length 0.5 m 2.5 m – • Distributed: everyone sees all requests simultaneously • USB (universal serial bus) • Back off and retry if not the highest priority request • Popular for low/moderate bandwidth external peripherals ± No bottlenecks and fair, but needs a lot of control lines + Packetized interface (like TCP), extremely flexible + Also supplies power to the peripheral 23 24 © 2008 Daniel J. Sorin from Roth ECE 152 © 2008 Daniel J. Sorin from Roth ECE 152 This Unit: I/O I/O Control and Interfaces Application • I/O system structure • Now that we know how I/O devices and buses work… OS • Devices, controllers, and buses • How does I/O actually happen? Compiler Firmware • Device characteristics • How does CPU give commands to I/O devices? • How do I/O devices execute data transfers? CPU I/O •Disks • How does CPU know when I/O devices are done? Memory • Bus characteristics Digital Circuits • I/O control • Polling and interrupts Gates & Transistors • DMA 25 26 © 2008 Daniel J. Sorin from Roth ECE 152 © 2008 Daniel J. Sorin from Roth ECE 152 Sending Commands to I/O Devices Querying I/O Device Status • Remember: only OS can do this! Two options … • Now that we’ve sent command to I/O device … • I/O instructions • How do we query I/O device status? • OS only? Instructions must be privileged (only OS can execute) • So that we know if data we asked for is ready? • E.g., IA-32 • So that we know if device is ready to receive next command? • Memory-mapped I/O • Portion of physical address space reserved for I/O • Polling: Ready now? How about now? How about now??? • OS maps physical addresses to I/O device control registers • Processor queries I/O device status register (e.g., with MM load) • Stores/loads to these addresses are commands to I/O devices • Loops until it gets status it wants (ready for next command) • Main memory ignores them, I/O devices recognize and respond • Or tries again a little later • Address specifies both I/O device and command +Simple • These address are not cached – why? – Waste of processor’s time • OS only? I/O physical addresses only mapped in OS address space • Processor much faster than I/O device • E.g., almost every architecture other than IA-32 (see pattern??) 27 28 © 2008 Daniel J. Sorin from Roth ECE 152 © 2008 Daniel J. Sorin from Roth ECE 152 Polling Overhead: Example #1 Polling Overhead: Example #2 • Parameters • Same parameters • 500 MHz CPU • 500 MHz CPU, polling event takes 400 cycles • Polling event takes 400 cycles • Overhead for polling a 4 MB/s disk with 16 B interface? • Overhead for polling a mouse 30 times per second? • Must poll often enough not to miss data from disk • Cycles per second for polling = (30 poll/s)*(400 cycles/poll) • Polling rate = (4MB/s)/(16 B/poll) >> mouse polling rate • Æ 12000 cycles/second for polling • Cycles per second for polling=[(4MB/s)/(16 B/poll)]*(400 cyc/poll) • (12000 cycles/second)/(500 M cycles/second) = 0.002% overhead • Æ 100 M cycles/second for polling + Not bad • (100 M cycles/second)/(500 M cycles/second) = 20% overhead –Bad • This is the overhead of polling, not actual data transfer • Really bad if disk is not being used (pure overhead!) 29 30 © 2008 Daniel J. Sorin from Roth ECE 152 © 2008 Daniel J. Sorin from Roth ECE 152 Interrupt-Driven I/O Interrupt Overhead • Interrupts: alternative to polling • Parameters Note: when disk is • I/O device generates interrupt when status changes, data ready • 500 MHz CPU transferring data, the interrupt rate is same as • OS handles interrupts just like exceptions (e.g., page faults) • Polling event takes 400 cycles polling rate • Identity of interrupting I/O device recorded in ECR • Interrupt handler takes 400 cycles • ECR: exception cause register • Data transfer takes 100 cycles • 4 MB/s, 16 B interface disk, transfers data only 5% of time • I/O interrupts are asynchronous • Not associated with any one instruction • Don’t need to be handled immediately • Percent of time processor spends transferring data • 0.05 * (4 MB/s)/(16 B/xfer)*[(100 c/xfer)/(500M c/s)] = 0.25% • I/O interrupts are prioritized • Overhead for polling? • Synchronous interrupts (e.g., page faults) have highest priority • (4 MB/s)/(16 B/poll) * [(400 c/poll)/(500M c/s)] = 20% • High-bandwidth I/O devices have higher priority than low- bandwidth ones • Overhead for interrupts? + 0.05 * (4 MB/s)/(16 B/int) * [(400 c/int)/(500M c/s)] = 1% 31 32 © 2008 Daniel J. Sorin from Roth ECE 152 © 2008 Daniel J. Sorin from Roth ECE 152 Direct Memory Access (DMA) DMA Controllers • Interrupts remove overhead of polling… • To do DMA, I/O device attached to DMA controller • But still requires OS to transfer data one word at a time • Multiple devices can be connected to one DMA controller • Controller itself seen as a memory mapped I/O device • OK for low bandwidth I/O devices: mice, microphones, etc. • Processor initializes start memory address, transfer size, etc. • Bad for high bandwidth I/O devices: disks, monitors, etc. • DMA controller takes care of bus arbitration and transfer details • So that’s why buses support arbitration and multiple masters! • Direct Memory Access (DMA) • Transfer data between I/O and memory without processor control CPU ($) • Transfers entire blocks (e.g., pages, video frames) at a time • Can use bus “burst” transfer mode if available Bus • Only interrupts processor when done (or if error occurs) DMA DMA I/O ctrl Main display NIC Memory Disk 33 34 © 2008 Daniel J. Sorin from Roth ECE 152 © 2008 Daniel J. Sorin from Roth ECE 152 I/O Processors DMA Overhead • A DMA controller is a very simple component • Parameters • May be as simple as a FSM with some local memory • 500 MHz CPU • Some I/O requires complicated sequences of transfers • Interrupt handler takes 400 cycles • I/O processor: heavier DMA controller that executes instructions • Data transfer takes 100 cycles • Can be programmed to do complex transfers • 4 MB/s, 16 B interface, disk transfers data 50% of time • E.g., programmable network card • DMA setup takes 1600 cycles, transfer 1 16KB page at a time CPU ($) • Processor overhead for interrupt-driven I/O? Bus • 0.5 * (4M B/s)/(16 B/xfer)*[(500 c/xfer)/(500M c/s)] = 12.5% • Processor overhead with DMA? DMA DMA IOP • Processor only gets involved once per page, not once per 16 B + 0.5 * (4M B/s)/(16K B/page) * [(2000 c/page)/(500M c/s)] = 0.05% Main display NIC Memory Disk 35 36 © 2008 Daniel J.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-