K. Olukotun Handout #41 Fall 06/07 EE108b

Review: Designing an I/O System

• General approach – Find the weakest link in the I/O system (consider things such as the CPU, memory system, backplane , I/O controllers, and EE108B I/O devices) Lecture 18 – Configure this component to sustain the required throughput Buses and OS, CMPs, Review – Determine the requirements for the rest of the system and configure them to support this throughput

Kunle Olukotun Stanford University Mem CPU Mem Bus I/O Bus http://eeclass.stanford.edu/ee108b 1.5M I/O dev 2M IOPS 1M IOPS IOPS .8M IOPS 1M IOPS

K. Olukotun EE108b Lecture 18 1 K. Olukotun EE108b Lecture 18 2

Review: Trends for Buses Review: Buses Logical Bus and Physical Switch

• A bus is a shared communication link that connects multiple devices D1 D2 D3 D4 • Single set of wires connects multiple “subsystems” as opposed to a point to point link which only connects two components together Bus • Wires connect in parallel, so 32 bit bus has 32 wires of data Parallel Bus Cntrl • Serial point-to-point advantages – Faster links – Fewer chip package pins D2 – Higher performance – Switch keeps arbitration on chip

4 port • Many bus standards are moving to serial, I/O I/O I/O D3 point to point Processor Device Device Device Memory D1 switch • 3GIO, PCI-Express(PCI) • Serial ATA (IDE hard disk) • AMD Hypertransport versus Front Side Bus (FSB)

K. Olukotun EE108b Lecture 18 3 K. Olukotun D4 EE108b Lecture 18 4

1 K. Olukotun Handout #41 Fall 06/07 EE108b

Review: Modern Pentium 4 I/O Review: OS Communication

• I/O Options • The should prevent user programs from

Pentium 4 communicating with I/O device directly processor (800 MHz, 604 GB/sec) – Must protect I/O resources to keep sharing fair DDR 400 AGP 8X Memory (3.2 GB/sec) (2.1 GB/sec) Graphics controller Main output – Protection of shared I/O resources cannot be provided if user hub memory DDR 400 CSA (north bridge) DIMMs (3.2 GB/sec) (0.266 GB/sec) 82875P 1 Gbit programs could perform I/O directly

Serial ATA (266 MB/sec) Parallel ATA (150 MB/sec) (100 MB/sec) Disk CD/DVD • Three types of communication are required: Serial ATA Parallel ATA (150 MB/sec) (100 MB/sec) Disk Tape I/O – OS must be able to give commands to I/O devices AC/97 controller (1 MB/sec) hub Stereo (south bridge) – I/O device must be able to notify OS when I/O device has (20 MB/sec) (surround- 82801EB sound) USB 2.0 10/100 Mbit Ethernet (60 MB/sec) completed an operation or has encountered an error

. . . PCI bus (132 MB/sec) – Data must be transferred between memory and an I/O device

K. Olukotun EE108b Lecture 18 5 K. Olukotun EE108b Lecture 18 6

Review: Polling and Programmed I/O Delegating I/O: DMA

(DMA) – Transfer blocks of data to or from memory without CPU intervention read data from Is the – Communication coordinated by the DMA controller memory device • DMA controllers are integrated in memory or I/O controller chips ready? – DMA controller acts as a bus master, in bus-based systems Busy wait loop yes no Is the not an efficient • DMA Steps device way to use the CPU read data ready? unless the device from – Processor sets up DMA by supplying: is very fast! device • Identity of the device and the operation (read/write) yes no • The memory address for source/destination write data • The number of to transfer store Processor may be data to to memory – DMA controller starts the operation by arbitrating for the bus and then inefficient way to starting the transfer when the data is ready device transfer data – Notify the processor when the DMA transfer is complete or on error done? no done? no yes yes • Usually using an interrupt

K. Olukotun EE108b Lecture 18 7 K. Olukotun EE108b Lecture 18 8

2 K. Olukotun Handout #41 Fall 06/07 EE108b

Reading a (1) Reading a Disk Sector (2)

1 GHz Pentium Processor 1 GHz Pentium Processor Pipeline CPU initiates a disk read by writing a Pipeline reads the sector and command, logical block number, and a performs a direct memory access (DMA) Caches destination memory address to disk Caches transfer into main memory. controller control registers. AGP AGP North North Monitor Graphics Memory Monitor Graphics Memory Controller Bridge Controller Bridge

PCI PCI

USB Hub Ethernet Disk USB Hub Ethernet Disk Controller Controller Controller Controller Controller Controller

Printer Mouse Disk Disk Printer Mouse Disk Disk Keyboard Keyboard

K. Olukotun EE108b Lecture 18 9 K. Olukotun EE108b Lecture 18 10

Reading a Disk Sector (3) DMA Problems: Virtual Vs. Physical Addresses

1 GHz Pentium Processor • If DMA uses physical addresses Pipeline When the DMA transfer completes, the disk controller notifies the CPU with an – Memory access across physical page boundaries may not correspond to contiguous virtual pages (or even the same application!) Caches interrupt (i.e., asserts a special “interrupt” pin on the CPU) • Solution 1: ≤ 1 page per DMA transfer AGP • Solution 1+: chain a series of 1-page requests provided by the OS North Monitor Graphics Memory – Single interrupt at the end of the last DMA request in the chain Bridge Controller • Solution 2: DMA engine uses virtual addresses PCI – Multi-page DMA requests are now easy – A TLB is necessary for the DMA engine

• For DMA with physical addresses: pages must be pinned in DRAM USB Hub Ethernet Disk Controller Controller Controller – OS should not page to disks pages involved with pending I/O

Printer Mouse Disk Disk Keyboard

K. Olukotun EE108b Lecture 18 11 K. Olukotun EE108b Lecture 18 12

3 K. Olukotun Handout #41 Fall 06/07 EE108b

DMA Problems: Cache Coherence I/O Summary

• A copy of the data involved in a DMA transfer may reside in processor cache • I/O performance has to take into account many variables – If memory is updated: Must update or invalidate “old” cached copy – Response time and throughput – If memory is read: Must read latest value, which may be in the cache • Only a problem with write-back caches – CPU, memory, bus, I/O device • This is called the “cache coherence” problem – Workload e.g. transaction processing – Same problem in multiprocessor systems • I/O devices also span a wide spectrum – Disks, graphics, and networks • Solution 1: OS flushes the cache before I/O reads or forces writebacks before I/O writes • Buses – Flush/write-back may involve selective addresses or whole cache – bandwidth – Can be done in software or with hardware (ISA) support – synchronization • Solution 2: Route memory accesses for I/O through the cache – transactions – Search the cache for copies and invalidate or write-back as needed – This hardware solution may impact performance negatively • OS and I/O • While searching cache for I/O requests, it is not available to processor – Communication: Polling and Interrupts – Multi-level, inclusive caches make this easier – Handling I/O outside CPU: DMA • Processor searches L1 cache mostly (until it misses) • I/O requests search L2 cache mostly (until it finds a copy of interest) K. Olukotun EE108b Lecture 18 13 K. Olukotun EE108b Lecture 18 14

End of Uniprocessor Performance The World has Changed

• Process Technology Stops Improving 10000 – Moore’s law but … From Hennessy and Patterson, – Transistors don’t get faster and they leak more (65nm vs. 45nm) Architecture: A Quantitative Approach, 4th

VAX-11/780) – Design and verification complexity is overwhelming

(vs. 100 – Power consumption increasing dramatically – Instruction-level parallelism (ILP) has been mined out

Performance 10 25%/year The free ride is over!

1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

K. Olukotun EE108b Lecture 18 15 K. Olukotun EE108b Lecture 18 16 From Intel Developer Forum, September 2004

4 K. Olukotun Handout #41 Fall 06/07 EE108b

The Era of Single-Chip Multiprocessors CMP Options

• Single-chip multiprocessors provide a scalable alternative a) Conventional microprocessor b) Simple chip multiprocessor – Relies on scalable forms of parallelism • Request level parallelism CPU Core CPU CPU • Data level parallelism Core 1 Core N Registers . . . – Modular design with inherent fault-tolerance and match to VLSI Regs Regs technology L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ • Single-chip multiprocessors systems are here L2 Cache – All processor vendors are following this approach L2 Cache . . . L2 Cache – In embedded, server, and even desktop systems • How do we architect CMPs to best exploit thread-level parallelism? – Server applications: throughput Main Memory I/O – General purpose and scientific applications: latency Main Memory I/O e.g. AMD Dual-core Opteron Intel Paxville

K. Olukotun EE108b Lecture 18 17 K. Olukotun EE108b Lecture 18 18

CMP Options Multiprocessor Questions

• How do parallel processors share data? c) Shared-cache chip multiprocessor d) Multithreaded, shared-cache chip multiprocessor — single (SMP, NUMA) — message passing (clusters, massively parallel processors (MPP)) CPU CPU CPU Core 1 CPU Core N Core 1 Core N . . . Regs Regs Regs Regs Regs Regs • How do parallel processors coordinate? Regs Regs . . . Regs Regs — software synchronization (locks, semaphores) L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ — built into send / receive primitives L2 Cache — OS primitives (sockets) L2 Cache

• How are parallel processors interconnected? Main Memory I/O — connected by a single bus Main Memory I/O — connected by a network

e.g. Sun Niagara aka T1

K. Olukotun EE108b Lecture 18 19 K. Olukotun EE108b Lecture 18 20

5 K. Olukotun Handout #41 Fall 06/07 EE108b

Simple Problem Partitioning of Data Flow Graph

for i = 1 to N A[0] B[0] A[1] B[1] A[2] B[2] A[3] B[3] A[i] = (A[i] + B[i]) * C[i] sum = sum + A[i] + + + + C[3] • Split the loops C[0] C[1] C[2] • Independent iterations for i = 1 to N * * * * A[i] = (A[i] + B[i]) * C[i] for i = 1 to N + sum = sum + A[i] + global synch • Data flow graph? +

K. Olukotun EE108b Lecture 18 21 K. Olukotun EE108b Lecture 18 22

Simple Problem: Shared Memory Cache Coherence Problem

private int i, my_start, my_end, mynode; • Problem shared float A[N], B[N], C[N], sum; – Two caches with inconsistent values for i = my_start to my_end A[i] = (A[i] + B[i]) * C[i] Processor Processor Processor GLOBAL_SYNCH; if (mynode == 0) for i = 1 to N Cache Cache Cache X=1 X=100 sum = sum + A[i]

Single bus • Can run this on any shared memory machine

Memory I/O

K. Olukotun EE108b Lecture 18 23 K. Olukotun EE108b Lecture 18 24

6 K. Olukotun Handout #41 Fall 06/07 EE108b

Snoopy Cache Coherence EE108B Quick Review (1)

• Solution • What we learned in EE108b – Have caches “snoop” the contents of other caches on reads and writes – How to build programmable systems – Writes: Invalidate or update other caches’ copies • Processor, memory system, IO system, interactions with OS and – Reads: Get latest version from other cache, if it has a “dirty” copy compiler – Need extra tags on caches to handle numerous requests – Processor design • ISA, single cycle design, pipelined design, … – The basics of computer system design • Levels of abstraction, pipelining, caching, address indirection, Write:Dirty NeedX=1 X! DMA, … X=100 X – Understanding why your programs sometimes run slowly • Pipeline stalls, cache misses, page faults, IO accesses, …

Old X=1

K. Olukotun EE108b Lecture 18 25 K. Olukotun EE108b Lecture 18 26

EE108B Quick Review (2) After EE108B Major Lessons to Take Away

• Levels of abstraction (e.g. ISA_processor_RTL blocks_gates) EE108B – Simplifies design process for complex systems – Need good specification for each level • Pipelining – Improve throughput of an engine by overlapping tasks – Watch out for dependencies between tasks • Caching EE109 EE282 CS140 Operat. Sys. – Maintain a close copy of frequently accessed data Dig. Sys. III Systems Architecture – Avoid long accesses or expensive recomputation (memoization) – Think about associativity, replacement policy, block size – Impact of algorithm and data structure layout on locality • Indirection (e.g. virtual_physical address translation) EE271 EE382 (A/B/C) – Allows transparent relocation, sharing, and protection CS143 VLSI Design Adv. Comp. Arch. Compilers • Overlapping (e.g. CPU work & DMA access) – Hide the cost of expensive tasks by executing them in parallel with other useful work

K. Olukotun EE108b Lecture 18 27 K. Olukotun EE108b Lecture 18 28

7 K. Olukotun Handout #41 Fall 06/07 EE108b

Announcements

• HW5 due today – Not mandatory – Extra credit – No late day • Lab4 is due on 12/11

• Final exam: Wednesday, 12/7 11am-1pm – All lectures (1-18), focusing mostly on lectures 8-18 – Calculator – Review session on Friday – Study lecture notes & textbook… – Local SCPD students come to class for exam

• Course evaluations – On Axess

K. Olukotun EE108b Lecture 18 29

8