EE108B Lecture18 BusesandOS,Review,&What’sNext

ChristosKozyrakis StanfordUniversity http://eeclass.stanford.edu/ee108b

C.Kozyrakis EE108bLecture18 1 Announcements

• TBD

C.Kozyrakis EE108bLecture18 2 Review:Buses

• Aisasharedcommunicationlinkthatconnectsmultipledevices • Singlesetofwiresconnectsmultiple“subsystems” asopposedtoa pointtopointlinkwhichonlyconnectstwocomponentstogether • Wiresconnectinparallel,so32bitbushas32wiresofdata

I/O I/O I/O Processor Device Device Device Memory

C.Kozyrakis EE108bLecture18 3 Review:TrendsforBuses LogicalBusandPhysicalSwitch

D1 D2 D3 D4

Bus ParallelBus Cntrl • Serialpointtopointadvantages – Fasterlinks – Fewerchippackagepins D2 – Higherperformance – Switchkeepsarbitrationonchip

4port • Manybusstandardsaremovingtoserial, D3 pointtopoint D1 switch • 3GIO,PCIExpress(PCI) • SerialATA(IDEharddisk) • AMDHypertransport versusFrontSide Bus(FSB)

C.Kozyrakis D4 EE108bLecture18 4 Review:OSCommunication

• Theoperatingsystemshouldpreventuserprogramsfrom communicatingwithI/Odevicedirectly – MustprotectI/Oresourcestokeepsharingfair – ProtectionofsharedI/Oresourcescannotbeprovidedifuser programscouldperformI/Odirectly

• Threetypesofcommunicationarerequired: – OSmustbeabletogivecommandstoI/Odevices – I/OdevicemustbeabletonotifyOSwhenI/Odevicehas completedanoperationorhasencounteredanerror – DatamustbetransferredbetweenmemoryandanI/Odevice

C.Kozyrakis EE108bLecture18 5 Reivew:Amethodforaddressingadevice

• MemorymappedI/O: – PortionsoftheaddressspaceareassignedtoeachI/Odevice – I/Oaddressescorrespondtodeviceregisters – UserprogramspreventedformissuingI/Ooperationsdirectly sinceI/Oaddressspaceis protected bytheaddresstranslation mechanism

0x00000000 0xFFFFFFFF

OS- Graphics I/O I/O I/O General RAM Memory mem Memory #1 #2 #3

C.Kozyrakis EE108bLecture18 6 CommunicatingwiththeCPU

• Method#1:Polling – I/Odeviceplacesinformationinastatusregister – TheOSperiodicallychecksthestatusregister – Whetherpollingisusedisoftendependentuponwhetherthe devicecaninitiateI/Oindependently • Forinstance,amouseworkswellsinceithasafixedI/Orateand initiatesitsowndata(wheneveritismoved) • Forothers,suchasdiskaccess,I/Oonlyoccursunderthecontrol oftheOS,sowepollonlywhentheOSknowsitisactive – Advantages • Simpletoimplement • Processorisincontrolanddoesthework – Disadvantage • PollingoverheadanddatatransferconsumeCPUtime

C.Kozyrakis EE108bLecture18 7 PollingandProgrammedI/O

read data from Is the memory device ready?

Busy wait loop yes no Is the not an efficient device way to use the CPU read data ready? unless the device from is very fast! device yes no store write data Processor may be to memory data to inefficient way to device transfer data done? no done? no yes yes

C.Kozyrakis EE108bLecture18 8 I/ONotification(cont)

• Method#2:I/O – WheneveranI/Odeviceneedsattentionfromtheprocessor,it theprocessor – InterruptmusttellOSabouttheeventandwhichdevice • Using“cause” register(s):Kernel“asks” whatinterrupted • Using vectored interrupts:Adifferentexceptionhandlerforeach • Example:Intel80x86has256vectoredinterrupts – I/Ointerruptsareasynchronousevents,andhappenanytime • Processorwaitsuntilcurrentinstructioniscompleted – Interruptsmayhavedifferentpriorities • Ex.:Network=highpriority,keyboard=lowpriority – Advantage:Executionisonlyhaltedduringactualtransfer – Disadvantage:Softwareoverheadofinterruptprocessing

C.Kozyrakis EE108bLecture18 9 I/OInterrupts

Interrupts add Exceptions sub User Exception (1) I/O interrupt and program Vector Table or ArithmeticException nop (2) Save state PageFault ISR AddressError #1 UnalignedAccess FloatingPointException ISR (3) Interrupt TimerInterrupt #2 service addr Network read Interrupt DiskDrive store ISR service Keyboard #3 ... : (4) Restore rti routine (containsPCsofISRs) state and return Memory

C.Kozyrakis EE108bLecture18 10 DataTransfer

• ThethirdcomponenttoI/Ocommunicationisthetransferofdata fromtheI/Odevicetomemory(orviceversa) • Simpleapproach:“Programmed” I/O – Softwareontheprocessormoves all databetweenmemory addressesandI/Oaddresses – Simpleandflexible,butwastesCPUtime – Also,lotsofexcessdatamovementinmodernsystems • Ex.:Mem >NB>CPU>NB>graphics • Whenwewant:Mem >NB>graphics • Soneedasolutiontoallowdatatransfertohappen without the processor’sinvolvement

C.Kozyrakis EE108bLecture18 11 DelegatingI/O:DMA

• DirectMemoryAccess(DMA) – Transfer blocks ofdatatoorfrommemorywithoutCPUintervention – Communicationcoordinatedbythe DMA controller • DMAcontrollersareintegratedinmemoryorI/Ocontrollerchips – DMAcontrolleractsasabusmaster,inbusbasedsystems

• DMASteps – ProcessorsetsupDMAbysupplying: • Identityofthedeviceandtheoperation(read/write) • Thememoryaddressforsource/destination • Thenumberofbytestotransfer – DMAcontrollerstartstheoperationbyarbitratingforthebusandthen startingthetransferwhenthedataisready – NotifytheprocessorwhentheDMAtransferiscompleteoronerror • Usuallyusinganinterrupt

C.Kozyrakis EE108bLecture18 12 ReadingaDiskSector(1)

1 GHz Pentium Processor Pipeline CPUinitiatesadiskreadbywritinga command,logicalblocknumber,anda Caches destinationmemoryaddresstodisk controllercontrolregisters. AGP North Monitor Graphics Bridge

PCI

USBHub Disk Controller Controller Controller

Printer Mouse Disk Disk Keyboard

C.Kozyrakis EE108bLecture18 13 ReadingaDiskSector(2)

1 GHz Pentium Processor Pipeline Diskcontrollerreadsthesectorand performsadirectmemoryaccess(DMA) Caches transferintomainmemory.

AGP North Monitor Graphics Memory Controller Bridge

PCI

USBHub Ethernet Disk Controller Controller Controller

Printer Mouse Disk Disk Keyboard

C.Kozyrakis EE108bLecture18 14 ReadingaDiskSector(3)

1 GHz Pentium Processor Pipeline WhentheDMAtransfercompletes,the diskcontrollernotifiestheCPUwithan Caches interrupt (i.e.,assertsaspecial“interrupt” pinontheCPU) AGP North Monitor Graphics Memory Controller Bridge

PCI

USBHub Ethernet Disk Controller Controller Controller

Printer Mouse Disk Disk Keyboard

C.Kozyrakis EE108bLecture18 15 DMAProblems:VirtualVs.PhysicalAddresses

• IfDMAusesphysicaladdresses – Memoryaccessacrossphysicalpageboundariesmaynotcorrespond tocontiguousvirtualpages(oreventhesameapplication!) • Solution1: ≤ 1pageperDMAtransfer • Solution1+:chainaseriesof1pagerequestsprovidedbytheOS – SingleinterruptattheendofthelastDMArequestinthechain • Solution2:DMAengineusesvirtualaddresses – MultipageDMArequestsarenoweasy – ATLBisnecessaryfortheDMAengine

• ForDMAwithphysicaladdresses:pagesmustbepinnedinDRAM – OSshouldnotpagetodiskspagesinvolvedwithpendingI/O

C.Kozyrakis EE108bLecture18 16 DMAProblems:Coherence

• AcopyofthedatainvolvedinaDMAtransfermayresideinprocessorcache – Ifmemoryisupdated:Mustupdateorinvalidate“old” cachedcopy – Ifmemoryisread:Mustreadlatestvalue,whichmaybeinthecache • Onlyaproblemwithwritebackcaches • Thisiscalledthe“cachecoherence” problem – Sameprobleminmultiprocessorsystems

• Solution1:OSflushesthecachebeforeI/Oreadsorforceswritebacks before I/Owrites – Flush/writebackmayinvolveselectiveaddressesorwholecache – Canbedoneinsoftwareorwithhardware(ISA)support • Solution2:RoutememoryaccessesforI/Othroughthecache – Searchthecacheforcopiesandinvalidateorwritebackasneeded – Thishardwaresolutionmayimpactperformancenegatively • WhilesearchingcacheforI/Orequests,itisnotavailabletoprocessor – Multilevel,inclusivecachesmakethiseasier • ProcessorsearchesL1cachemostly(untilitmisses) • I/OrequestssearchL2cachemostly(untilitfindsacopyofinterest) C.Kozyrakis EE108bLecture18 17 I/OSummary

• I/Operformancehastotakeintoaccountmanyvariables – Responsetimeandthroughput – CPU,memory,bus,I/Odevice – Workloade.g.transactionprocessing • I/Odevicesalsospanawidespectrum – Disks,graphics,andnetworks • Buses – bandwidth – synchronization – transactions • OSandI/O – Communication:PollingandInterrupts – HandlingI/OoutsideCPU:DMA

C.Kozyrakis EE108bLecture18 18 EE108BQuickReview(1)

• WhatwelearnedinEE108b – Howtobuildprogrammablesystems • Processor,memorysystem,IOsystem,interactionswithOSand compiler – Processordesign • ISA,singlecycledesign,pipelineddesign,… – Thebasicsofcomputersystemdesign • Levelsofabstraction,pipelining,caching,addressindirection, DMA,… – Understandingwhyyourprogramssometimesrunslowly • Pipelinestalls,cachemisses,pagefaults,IOaccesses,…

C.Kozyrakis EE108bLecture18 19 EE108BQuickReview(2) MajorLessonstoTakeAway

• Levelsofabstraction(e.g.ISA→processor→RTLblocks→gates) – Simplifiesdesignprocessforcomplexsystems – Needgoodspecificationforeachlevel • Pipelining – Improvethroughputofanenginebyoverlappingtasks – Watchoutfordependenciesbetweentasks • Caching – Maintainaclosecopyoffrequentlyaccesseddata – Avoidlongaccessesorexpensiverecomputation (memoization) – Thinkaboutassociativity,replacementpolicy,blocksize • Indirection(e.g.virtual→physicaladdresstranslation) – Allowstransparentrelocation,sharing,andprotection • Overlapping(e.g.CPUwork&DMAaccess) – Hidethecostofexpensivetasksbyexecutingtheminparallelwithotherusefulwork • Other: – Amdhal’s law:makecommoncasefast – Amortization,oarallel processing,memoization,speculation

C.Kozyrakis EE108bLecture18 20 TheEnd

C.Kozyrakis EE108bLecture18 21 AfterEE108B

EE108B

CS140 EE109 EE282 Operat. Sys. Dig. Sys. III Systems Architecture

EE271 EE382 (A/B/C) CS143 VLSI Design Adv. Comp. Arch. Compilers

C.Kozyrakis EE108bLecture18 22 EE282Spring07: ComputerSystemsArchitecture

• Goal:learnhowtobuildanduseefficientsystems

• Topics: – Advancedmemoryhierarchies – AdvancedI/OandinteractionswithOS – Clusterarchitectureandprogramming – Virtualmachines – Reliablesystems – Energyefficientsystems

• Programmingassignments – Efficientprogrammingforuniprocessors – Efficientprogrammingforclusters

C.Kozyrakis EE108bLecture18 23 LookingForward

C.Kozyrakis EE108bLecture18 24 EndofUniprocessor Performance

10000

From Hennessy and Patterson, : A Quantitative Approach , 4th <20%/year??%/year edition, October, 2006 1000

52%/year

100

Performance(vs.VAX11/780) 10 25%/year Thefreerideisover!

1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 200 2 2004 2006

C.Kozyrakis EE108bLecture18 25 TheWorldhasChanged

• ProcessTechnologyStopsImproving – Moore’slawbut… – Transistorsdon’tgetfasterandtheyleakmore(65nmvs.45nm) – Wiresaremuchworse • SingleThreadPerformancePlateau – Designandverificationcomplexityisoverwhelming – Powerconsumptionincreasingdramatically – Instructionlevelparallelism(ILP)hasbeenminedout

C.Kozyrakis EE108bLecture18 26 FromIntelDeveloperForum,September2004 TheEraofSingleChipMultiprocessors

• Singlechipmultiprocessorsprovideascalablealternative – Reliesonscalableformsofparallelism • Requestlevelparallelism • Datalevelparallelism – ModulardesignwithinherentfaulttoleranceandmatchtoVLSI technology

• Singlechipmultiprocessorssystemsarehere – Allprocessorvendorsarefollowingthisapproach – Inembedded,server,andevendesktopsystems

• HowdowearchitectCMPs tobestexploitthreadlevelparallelism? – Serverapplications:throughput – Generalpurposeandscientificapplications:latency

C.Kozyrakis EE108bLecture18 27 CMPOptions

a)Conventionalmicroprocessor b)Simplechipmultiprocessor

CPU Core CPU CPU Core 1 Core N Registers Regs ... Regs

L1I$ L1D$ L1I$ L1D$ L1I$ L1D$ L2 Cache L2 Cache ... L2 Cache

Main Memory I/O Main Memory I/O

e.g. AMD Dual-core Opteron Intel Paxville

C.Kozyrakis EE108bLecture18 28 CMPOptions

c)Sharedcachechipmultiprocessor d)Multithreaded,sharedcachechipmultiprocessor

CPU CPU CPU Core 1 CPU Core N Core 1 Core N Regs Regs Regs Regs Regs ... Regs Regs Regs ... Regs Regs

L1I$ L1D$ L1I$ L1D$ L1I$ L1D$ L1I$ L1D$ L2 Cache L2 Cache

Main Memory I/O Main Memory I/O

C.Kozyrakis EE108bLecture18 29 MultiprocessorQuestions

• Howdoparallelprocessorssharedata? – singleaddressspace(SMP,NUMA) – messagepassing(clusters,massivelyparallelprocessors (MPP))

• Howdoparallelprocessorscoordinate? – softwaresynchronization(locks,semaphores) – builtintosend/receiveprimitives OSprimitives(sockets)

• Howareparallelprocessorsinterconnected? – connectedbyasinglebus – connectedbyanetwork

C.Kozyrakis EE108bLecture18 30