Announcements

• TBD

EE108B Lecture 18 Buses and OS, Review, & What’s Next

Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b

C. Kozyrakis EE108b Lecture 18 1 C. Kozyrakis EE108b Lecture 18 2

Review: Trends for Buses Review: Buses Logical and Physical Switch

• A bus is a shared communication link that connects multiple devices D1 D2 D3 D4 • Single set of wires connects multiple “subsystems” as opposed to a point to point link which only connects two components together Bus • Wires connect in parallel, so 32 bit bus has 32 wires of data Parallel Bus Cntrl • Serial point-to-point advantages – Faster links – Fewer chip package pins D2 – Higher performance – Switch keeps arbitration on chip

4 port • Many bus standards are moving to serial, I/O I/O I/O D3 point to point Device Device Device Memory D1 switch • 3GIO, PCI-Express(PCI) • Serial ATA (IDE hard disk) • AMD Hypertransport versus Front Side Bus (FSB)

C. Kozyrakis EE108b Lecture 18 3 C. Kozyrakis D4 EE108b Lecture 18 4 Review: OS Communication Reivew: A method for addressing a device

• The operating system should prevent user programs from • Memory-mapped I/O: communicating with I/O device directly – Portions of the address space are assigned to each I/O device – Must protect I/O resources to keep sharing fair – I/O addresses correspond to device registers – Protection of shared I/O resources cannot be provided if user – User programs prevented form issuing I/O operations directly programs could perform I/O directly since I/O address space is protected by the address translation mechanism • Three types of communication are required: – OS must be able to give commands to I/O devices 0x00000000 0xFFFFFFFF – I/O device must be able to notify OS when I/O device has completed an operation or has encountered an error OS- Graphics I/O I/O I/O General RAM Memory – Data must be transferred between memory and an I/O device mem Memory #1 #2 #3

C. Kozyrakis EE108b Lecture 18 5 C. Kozyrakis EE108b Lecture 18 6

Communicating with the CPU Polling and Programmed I/O

• Method #1: Polling – I/O device places information in a status register read data – The OS periodically checks the status register from Is the memory device – Whether polling is used is often dependent upon whether the ready? device can initiate I/O independently • For instance, a mouse works well since it has a fixed I/O rate and Busy wait loop yes no initiates its own data (whenever it is moved) Is the not an efficient way to use the CPU read data • For others, such as disk access, I/O only occurs under the control device ready? unless the device from of the OS, so we poll only when the OS knows it is active is very fast! device – Advantages yes no • Simple to implement store write data • Processor is in control and does the work Processor may be to memory data to inefficient way to – Disadvantage device data • Polling overhead and data transfer consume CPU time done? no done? no yes yes

C. Kozyrakis EE108b Lecture 18 7 C. Kozyrakis EE108b Lecture 18 8 I/O Notification (cont) I/O Interrupts

• Method #2: I/O Interrupt – Whenever an I/O device needs attention from the processor, it Interrupts add Exceptions interrupts the processor sub User Exception (1) I/O interrupt and program – Interrupt must tell OS about the event and which device Vector Table or • Using “cause” register(s): Kernel “asks” what interrupted Arithmetic Exception nop (2) Save state • Using vectored interrupts: A different exception handler for each Page Fault ISR Address Error #1 • Example: Intel 80x86 has 256 vectored interrupts Unaligned Access – I/O interrupts are asynchronous events, and happen anytime Floating Point Exception ISR (3) Interrupt Timer Interrupt #2 service addr • Processor waits until current instruction is completed Network read Interrupt Disk Drive store – Interrupts may have different priorities ISR service Keyboard #3 ... : • Ex.: Network = high priority, keyboard = low priority (4) Restore rti routine (contains PCs of ISRs) state and return – Advantage: Execution is only halted during actual transfer Memory – Disadvantage: Software overhead of interrupt processing

C. Kozyrakis EE108b Lecture 18 9 C. Kozyrakis EE108b Lecture 18 10

Data Transfer Delegating I/O: DMA

• The third component to I/O communication is the transfer of data • (DMA) from the I/O device to memory (or vice versa) – Transfer blocks of data to or from memory without CPU intervention • Simple approach: “Programmed” I/O – Communication coordinated by the DMA controller • DMA controllers are integrated in memory or I/O controller chips – Software on the processor moves all data between memory – DMA controller acts as a bus master, in bus-based systems addresses and I/O addresses – Simple and flexible, but wastes CPU time • DMA Steps – Also, lots of excess data movement in modern systems – Processor sets up DMA by supplying: • Ex.: Mem --> NB --> CPU --> NB --> graphics • Identity of the device and the operation (read/write) • When we want: Mem --> NB --> graphics • The memory address for source/destination • So need a solution to allow data transfer to happen without the • The number of bytes to transfer processor’s involvement – DMA controller starts the operation by arbitrating for the bus and then starting the transfer when the data is ready – Notify the processor when the DMA transfer is complete or on error • Usually using an interrupt

C. Kozyrakis EE108b Lecture 18 11 C. Kozyrakis EE108b Lecture 18 12 Reading a Disk Sector (1) Reading a Disk Sector (2)

1 GHz Pentium Processor 1 GHz Pentium Processor Pipeline CPU initiates a disk read by writing a Pipeline Disk controller reads the sector and command, logical block number, and a performs a direct memory access (DMA) Caches destination memory address to disk Caches transfer into main memory. controller control registers. AGP AGP North North Monitor Graphics Memory Monitor Graphics Memory Controller Bridge Controller Bridge

PCI PCI

USB Hub Disk USB Hub Ethernet Disk Controller Controller Controller Controller Controller Controller

Printer Mouse Disk Disk Printer Mouse Disk Disk Keyboard Keyboard

C. Kozyrakis EE108b Lecture 18 13 C. Kozyrakis EE108b Lecture 18 14

Reading a Disk Sector (3) DMA Problems: Virtual Vs. Physical Addresses

1 GHz Pentium Processor • If DMA uses physical addresses Pipeline When the DMA transfer completes, the disk controller notifies the CPU with an – Memory access across physical page boundaries may not correspond to contiguous virtual pages (or even the same application!) Caches interrupt (i.e., asserts a special “interrupt” ≤ pin on the CPU) • Solution 1: 1 page per DMA transfer AGP • Solution 1+: chain a series of 1-page requests provided by the OS North Monitor Graphics Memory – Single interrupt at the end of the last DMA request in the chain Bridge Controller • Solution 2: DMA engine uses virtual addresses PCI – Multi-page DMA requests are now easy – A TLB is necessary for the DMA engine

• For DMA with physical addresses: pages must be pinned in DRAM USB Hub Ethernet Disk Controller Controller Controller – OS should not page to disks pages involved with pending I/O

Printer Mouse Disk Disk Keyboard

C. Kozyrakis EE108b Lecture 18 15 C. Kozyrakis EE108b Lecture 18 16 DMA Problems: Cache Coherence I/O Summary

• A copy of the data involved in a DMA transfer may reside in processor cache • I/O performance has to take into account many variables – If memory is updated: Must update or invalidate “old” cached copy – Response time and throughput – If memory is read: Must read latest value, which may be in the cache • Only a problem with write-back caches – CPU, memory, bus, I/O device • This is called the “cache coherence” problem – Workload e.g. transaction processing – Same problem in multiprocessor systems • I/O devices also span a wide spectrum – Disks, graphics, and networks • Solution 1: OS flushes the cache before I/O reads or forces writebacks before I/O writes • Buses – Flush/write-back may involve selective addresses or whole cache – bandwidth – Can be done in software or with hardware (ISA) support – synchronization • Solution 2: Route memory accesses for I/O through the cache – transactions – Search the cache for copies and invalidate or write-back as needed – This hardware solution may impact performance negatively • OS and I/O • While searching cache for I/O requests, it is not available to processor – Communication: Polling and Interrupts – Multi-level, inclusive caches make this easier – Handling I/O outside CPU: DMA • Processor searches L1 cache mostly (until it misses) • I/O requests search L2 cache mostly (until it finds a copy of interest) C. Kozyrakis EE108b Lecture 18 17 C. Kozyrakis EE108b Lecture 18 18

EE108B Quick Review (2) EE108B Quick Review (1) Major Lessons to Take Away

• What we learned in EE108b • Levels of abstraction (e.g. ISA→processor→RTL blocks→gates) – Simplifies design process for complex systems – How to build programmable systems – Need good specification for each level • Processor, memory system, IO system, interactions with OS and • Pipelining compiler – Improve throughput of an engine by overlapping tasks – Processor design – Watch out for dependencies between tasks • ISA, single cycle design, pipelined design, … • Caching – Maintain a close copy of frequently accessed data – The basics of computer system design – Avoid long accesses or expensive recomputation (memoization) • Levels of abstraction, pipelining, caching, address indirection, – Think about associativity, replacement policy, block size DMA, … • Indirection (e.g. virtual→physical address translation) – Understanding why your programs sometimes run slowly – Allows transparent relocation, sharing, and protection • Overlapping (e.g. CPU work & DMA access) • Pipeline stalls, cache misses, page faults, IO accesses, … – Hide the cost of expensive tasks by executing them in parallel with other useful work • Other: – Amdhal’s law: make common case fast – Amortization, oarallel processing, memoization, speculation

C. Kozyrakis EE108b Lecture 18 19 C. Kozyrakis EE108b Lecture 18 20 The End After EE108B

EE108B

CS140 EE109 EE282 Operat. Sys. Dig. Sys. III Systems Architecture

EE271 EE382 (A/B/C) CS143 VLSI Design Adv. Comp. Arch. Compilers

C. Kozyrakis EE108b Lecture 18 21 C. Kozyrakis EE108b Lecture 18 22

EE282 Spring 07: Looking Forward Computer Systems Architecture

• Goal: learn how to build and use efficient systems

• Topics: – Advanced memory hierarchies – Advanced I/O and interactions with OS – Cluster architecture and programming – Virtual machines – Reliable systems – Energy efficient systems

• Programming assignments – Efficient programming for uniprocessors – Efficient programming for clusters

C. Kozyrakis EE108b Lecture 18 23 C. Kozyrakis EE108b Lecture 18 24 End of Uniprocessor Performance The World has Changed

• Process Technology Stops Improving 10000 – Moore’s law but … From Hennessy and Patterson, Computer – Transistors don’t get faster and they leak more (65nm vs. 45nm) Architecture: A Quantitative Approach , 4th

Performance (vs. VAX-11/780) VAX-11/780) (vs. Performance 10 25%/year The free ride is over!

1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 200 2 2004 2006

C. Kozyrakis EE108b Lecture 18 25 C. Kozyrakis EE108b Lecture 18 26 From Intel Developer Forum, September 2004

The Era of Single-Chip Multiprocessors CMP Options

• Single-chip multiprocessors provide a scalable alternative a) Conventional microprocessor b) Simple chip multiprocessor – Relies on scalable forms of parallelism • Request level parallelism CPU Core CPU CPU • Data level parallelism Core 1 Core N Registers . . . – Modular design with inherent fault-tolerance and match to VLSI Regs Regs technology L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L2 Cache • Single-chip multiprocessors systems are here L2 Cache . . . L2 Cache – All processor vendors are following this approach – In embedded, server, and even desktop systems Main Memory I/O • How do we architect CMPs to best exploit thread-level parallelism? Main Memory I/O – Server applications: throughput e.g. AMD Dual-core – General purpose and scientific applications: latency Intel Paxville

C. Kozyrakis EE108b Lecture 18 27 C. Kozyrakis EE108b Lecture 18 28 CMP Options Multiprocessor Questions

• How do parallel processors share data? c) Shared-cache chip multiprocessor d) Multithreaded, shared-cache chip multiprocessor – single address space (SMP, NUMA) – message passing (clusters, massively parallel processors CPU CPU CPU Core 1 CPU Core N (MPP)) Core 1 Core N . . . Regs Regs Regs Regs Regs Regs . . . Regs Regs Regs Regs • How do parallel processors coordinate? L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ – software synchronization (locks, semaphores) L2 Cache – built into send / receive primitives L2 Cache OS primitives (sockets)

• How are parallel processors interconnected? Main Memory I/O Main Memory I/O – connected by a single bus – connected by a network

C. Kozyrakis EE108b Lecture 18 29 C. Kozyrakis EE108b Lecture 18 30