K. Olukotun Fall 06/07 Handout #41 Ee108b 1
Total Page:16
File Type:pdf, Size:1020Kb
K. Olukotun Handout #41 Fall 06/07 EE108b Review: Designing an I/O System • General approach – Find the weakest link in the I/O system (consider things such as the CPU, memory system, backplane bus, I/O controllers, and EE108B I/O devices) Lecture 18 – Configure this component to sustain the required throughput Buses and OS, CMPs, Review – Determine the requirements for the rest of the system and configure them to support this throughput Kunle Olukotun Stanford University Mem CPU Mem Bus I/O Bus http://eeclass.stanford.edu/ee108b 1.5M I/O dev 2M IOPS 1M IOPS IOPS .8M IOPS 1M IOPS K. Olukotun EE108b Lecture 18 1 K. Olukotun EE108b Lecture 18 2 Review: Trends for Buses Review: Buses Logical Bus and Physical Switch • A bus is a shared communication link that connects multiple devices D1 D2 D3 D4 • Single set of wires connects multiple “subsystems” as opposed to a point to point link which only connects two components together Bus • Wires connect in parallel, so 32 bit bus has 32 wires of data Parallel Bus Cntrl • Serial point-to-point advantages – Faster links – Fewer chip package pins D2 – Higher performance – Switch keeps arbitration on chip 4 port • Many bus standards are moving to serial, I/O I/O I/O D3 point to point Processor Device Device Device Memory D1 switch • 3GIO, PCI-Express(PCI) • Serial ATA (IDE hard disk) • AMD Hypertransport versus Intel Front Side Bus (FSB) K. Olukotun EE108b Lecture 18 3 K. Olukotun D4 EE108b Lecture 18 4 1 K. Olukotun Handout #41 Fall 06/07 EE108b Review: Modern Pentium 4 I/O Review: OS Communication • I/O Options • The operating system should prevent user programs from Pentium 4 communicating with I/O device directly processor System bus (800 MHz, 604 GB/sec) – Must protect I/O resources to keep sharing fair DDR 400 AGP 8X Memory (3.2 GB/sec) (2.1 GB/sec) Graphics controller Main output – Protection of shared I/O resources cannot be provided if user hub memory DDR 400 CSA (north bridge) DIMMs (3.2 GB/sec) (0.266 GB/sec) 82875P 1 Gbit Ethernet programs could perform I/O directly Serial ATA (266 MB/sec) Parallel ATA (150 MB/sec) (100 MB/sec) Disk CD/DVD • Three types of communication are required: Serial ATA Parallel ATA (150 MB/sec) (100 MB/sec) Disk Tape I/O – OS must be able to give commands to I/O devices AC/97 controller (1 MB/sec) hub Stereo (south bridge) – I/O device must be able to notify OS when I/O device has (20 MB/sec) (surround- 82801EB sound) USB 2.0 10/100 Mbit Ethernet (60 MB/sec) completed an operation or has encountered an error . PCI bus (132 MB/sec) – Data must be transferred between memory and an I/O device K. Olukotun EE108b Lecture 18 5 K. Olukotun EE108b Lecture 18 6 Review: Polling and Programmed I/O Delegating I/O: DMA • Direct Memory Access (DMA) – Transfer blocks of data to or from memory without CPU intervention read data from Is the – Communication coordinated by the DMA controller memory device • DMA controllers are integrated in memory or I/O controller chips ready? – DMA controller acts as a bus master, in bus-based systems Busy wait loop yes no Is the not an efficient • DMA Steps device way to use the CPU read data ready? unless the device from – Processor sets up DMA by supplying: is very fast! device • Identity of the device and the operation (read/write) yes no • The memory address for source/destination write data • The number of bytes to transfer store Processor may be data to to memory – DMA controller starts the operation by arbitrating for the bus and then inefficient way to starting the transfer when the data is ready device transfer data – Notify the processor when the DMA transfer is complete or on error done? no done? no yes yes • Usually using an interrupt K. Olukotun EE108b Lecture 18 7 K. Olukotun EE108b Lecture 18 8 2 K. Olukotun Handout #41 Fall 06/07 EE108b Reading a Disk Sector (1) Reading a Disk Sector (2) 1 GHz Pentium Processor 1 GHz Pentium Processor Pipeline CPU initiates a disk read by writing a Pipeline Disk controller reads the sector and command, logical block number, and a performs a direct memory access (DMA) Caches destination memory address to disk Caches transfer into main memory. controller control registers. AGP AGP North North Monitor Graphics Memory Monitor Graphics Memory Controller Bridge Controller Bridge PCI PCI USB Hub Ethernet Disk USB Hub Ethernet Disk Controller Controller Controller Controller Controller Controller Printer Mouse Disk Disk Printer Mouse Disk Disk Keyboard Keyboard K. Olukotun EE108b Lecture 18 9 K. Olukotun EE108b Lecture 18 10 Reading a Disk Sector (3) DMA Problems: Virtual Vs. Physical Addresses 1 GHz Pentium Processor • If DMA uses physical addresses Pipeline When the DMA transfer completes, the disk controller notifies the CPU with an – Memory access across physical page boundaries may not correspond to contiguous virtual pages (or even the same application!) Caches interrupt (i.e., asserts a special “interrupt” pin on the CPU) • Solution 1: ≤ 1 page per DMA transfer AGP • Solution 1+: chain a series of 1-page requests provided by the OS North Monitor Graphics Memory – Single interrupt at the end of the last DMA request in the chain Bridge Controller • Solution 2: DMA engine uses virtual addresses PCI – Multi-page DMA requests are now easy – A TLB is necessary for the DMA engine • For DMA with physical addresses: pages must be pinned in DRAM USB Hub Ethernet Disk Controller Controller Controller – OS should not page to disks pages involved with pending I/O Printer Mouse Disk Disk Keyboard K. Olukotun EE108b Lecture 18 11 K. Olukotun EE108b Lecture 18 12 3 K. Olukotun Handout #41 Fall 06/07 EE108b DMA Problems: Cache Coherence I/O Summary • A copy of the data involved in a DMA transfer may reside in processor cache • I/O performance has to take into account many variables – If memory is updated: Must update or invalidate “old” cached copy – Response time and throughput – If memory is read: Must read latest value, which may be in the cache • Only a problem with write-back caches – CPU, memory, bus, I/O device • This is called the “cache coherence” problem – Workload e.g. transaction processing – Same problem in multiprocessor systems • I/O devices also span a wide spectrum – Disks, graphics, and networks • Solution 1: OS flushes the cache before I/O reads or forces writebacks before I/O writes • Buses – Flush/write-back may involve selective addresses or whole cache – bandwidth – Can be done in software or with hardware (ISA) support – synchronization • Solution 2: Route memory accesses for I/O through the cache – transactions – Search the cache for copies and invalidate or write-back as needed – This hardware solution may impact performance negatively • OS and I/O • While searching cache for I/O requests, it is not available to processor – Communication: Polling and Interrupts – Multi-level, inclusive caches make this easier – Handling I/O outside CPU: DMA • Processor searches L1 cache mostly (until it misses) • I/O requests search L2 cache mostly (until it finds a copy of interest) K. Olukotun EE108b Lecture 18 13 K. Olukotun EE108b Lecture 18 14 End of Uniprocessor Performance The World has Changed • Process Technology Stops Improving 10000 – Moore’s law but … From Hennessy and Patterson, Computer – Transistors don’t get faster and they leak more (65nm vs. 45nm) Architecture: A Quantitative Approach, 4th <?? 20%/%/yea yearr edition, October, 2006 1000 – Wires are much worse • Single Thread Performance Plateau 52%/year VAX-11/780) – Design and verification complexity is overwhelming (vs. 100 – Power consumption increasing dramatically – Instruction-level parallelism (ILP) has been mined out Performance 10 25%/year The free ride is over! 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 K. Olukotun EE108b Lecture 18 15 K. Olukotun EE108b Lecture 18 16 From Intel Developer Forum, September 2004 4 K. Olukotun Handout #41 Fall 06/07 EE108b The Era of Single-Chip Multiprocessors CMP Options • Single-chip multiprocessors provide a scalable alternative a) Conventional microprocessor b) Simple chip multiprocessor – Relies on scalable forms of parallelism • Request level parallelism CPU Core CPU CPU • Data level parallelism Core 1 Core N Registers . – Modular design with inherent fault-tolerance and match to VLSI Regs Regs technology L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ • Single-chip multiprocessors systems are here L2 Cache – All processor vendors are following this approach L2 Cache . L2 Cache – In embedded, server, and even desktop systems • How do we architect CMPs to best exploit thread-level parallelism? – Server applications: throughput Main Memory I/O – General purpose and scientific applications: latency Main Memory I/O e.g. AMD Dual-core Opteron Intel Paxville K. Olukotun EE108b Lecture 18 17 K. Olukotun EE108b Lecture 18 18 CMP Options Multiprocessor Questions • How do parallel processors share data? c) Shared-cache chip multiprocessor d) Multithreaded, shared-cache chip multiprocessor — single address space (SMP, NUMA) — message passing (clusters, massively parallel processors (MPP)) CPU CPU CPU Core 1 CPU Core N Core 1 Core N . Regs Regs Regs Regs Regs Regs • How do parallel processors coordinate? Regs Regs . Regs Regs — software synchronization (locks, semaphores) L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ — built into send / receive primitives L2 Cache — OS primitives (sockets) L2 Cache • How are parallel processors interconnected? Main Memory I/O — connected by a single bus Main Memory I/O — connected by a network e.g.