CS252 Lecture Notes Multithreaded Architectures

CS252LectureNotes MultithreadedArchitectures Concept Tolerateormasklongandoftenunpredictablelatencyoperationsbyswitchingtoanothercontext, whichisabletodousefulwork. SituationToday–Whyisthistopicrelevant? ILPhasbeenexhaustedwhichmeansthreadlevelparallelismmustbeutilized ‹ Thegapbetweenprocessorperformanceandmemoryperformanceisstilllarge ‹ Thereisamplereal-estateforimplementation ‹ Moreapplicationsarebeingwrittenwiththeuseofthreadsandmultitaskingisubiquitous ‹ Multiprocessorsaremorecommon ‹ Networklatencyisanalogoustomemorylatency ‹ Complexschedulingisalreadybeingdoneinhardware ClassicalProblem 60’sand70’s ‹ I/Olatencypromptedmultitasking ‹ IBMmainframes ‹ Multitasking ‹ I/Oprocessors ‹ Cacheswithindiskcontrollers RequirementsofMultithreading ‹ Storageneedtoholdmultiplecontext’sPC,registers,statusword,etc. ‹ Coordinationtomatchaneventwithasavedcontext ‹ Awaytoswitchcontexts ‹ Longlatencyoperationsmustuseresourcesnotinuse Tovisualizetheeffectoflatencyonprocessorutilization,letRbetherunlengthtoalonglatency event,letLbetheamountoflatencythen: 1 Util Util=R/(R+L) 0 L 80’s Problemwasrevisitedduetotheadventofgraphicsworkstations XeroxAlto,TIExplorer ‹ Concurrentprocessesareinterleavedtoallowfortheworkstationstobemoreresponsive. ‹ Theseprocessescoulddriveormonitordisplay,input,filesystem,network,user processing ‹ Processswitchwasslowsothesubsystemsweremicroprogrammedtosupportmultiple contexts ScalableMultiprocessor ‹ Dancehall–asharedinterconnectwithmemoryononesideandprocessorsontheother. ‹ Orprocessorsmayhavelocalmemory M M P/M P/M Highspeed Highspeed interconnect IO interconnect P P P/M P/M Howdotheprocessorscommunicate? SharedMemory ‹ Potentiallonglatencyoneveryload ‹ Cachecoherencybecomesanissue ‹ ExamplesincludeNYU’sUltracomputer,IBM’sRP3,BBN’sButterfly,MIT’sAlewife, andlaterStanford’sDash. ‹ Synchronizationoccursthroughsharevariables,locks,flags,andsemaphores. MessagePassing ‹ Programmerdealswithlatency.Thisenablesthemtominimizethenumberofmessages, whilemaximizingthesize,andthisschemeallowsfordelayminimizationbysendinga messagesothatitreachesthereceiveratthetimeitexpectsit. ‹ ExamplesincludeIntel’sPSCandParagon,Caltech’sCosmicCube,andThinking Machines’CM-5 ‹ Synchronizationoccursthroughsendandreceive Cycle-by-CycleInterleavedMultithreading BurtonSmith ‹ CurrentlychiefscientistatCray ‹ DenelcorHEP1(1982),HEP2 ‹ Horizon,whichwasneverbuilt ‹ Tera,MTA PCs ThreadScheduler I-Fetch RF Mem WB Featuresofthisarchitecture Aninstructionfromadifferentcontextislaunchedateachclockcycle ‹ Nointerlocksorbypassesthankstoanon-blockingpipeline Optimizations: ‹ Leavingcontextstateinproc(PC,register#,status) ‹ Assigningtagstoremoterequestandthenmatchingitoncompletion Additionaloptimizations: ‹ Afull/emptybitoneverymemorywordallowingforautomaticandefficient synchronization.Thisobviatestheneedforsemaphores,locks,etc.anditmayreduce pollingtimedonebytheprocessorbymovingthatjobtoacontroller. Challengeswiththisapproach Instructionbandwidth ‹ Sinceinstructionsarebeinggrabbedfrommanydifferentcontexts,instructionlocalityis degradedandtheI-cachemissraterises. ‹ Registerfileaccesstimeincreasesduetothefactthattheregfilehadtosignificantly increaseinsizetoaccommodatemanyseparatecontexts.Infact,theHEPandTerause SRAMtoimplementtheregfile,whichmeanslongeraccesstimes.Someofthismaybe alleviatedthroughincreasingthepipelinedepthatthecostofadditionallatency. ‹ Singlethreadperformanceissignificantlydegradedsincethecontextisforcedtoswitch toanewthreadevenifnoneareavailable. ‹ Insufficientpipelining(bandwidthbottleneck) ‹ UnpipelinedFPunit–muststallorreflectupintothreadscheduler ‹ Veryhighbandwidthnetwork,whichisfastandwide ‹ Retriesonloademptyorstorefull ImprovingSingleThreadPerformance ‹ Domoreoperationsperinstruction(VLIW) ‹ Allowmultipleinstructionstoissueintopipelinefromeachcontext.Thiscouldleadto pipelinehazards,soothersafeinstructionscouldbeinterleavedintotheexecution.For Horizon&Terathecompilerdetectssuchdatadependenciesandthehardwareenforcesit byswitchingtoanothercontextifdetected.Thisisimplementedbyinsertingintoeach instructionafieldwhichindicatesitsminimumnumberofindependentsuccessorsover allpossiblecontrolflows. ‹ Switchonload ‹ Switchonmiss ‹ Switchingonloadormisswillincreasethecontextswitchtime.Consider: TypeofSwitch R C Cycle 1 0 Load 5-10 1-2 Miss 5-100 5-10(+hittime) ‹ MaxUtilization=R/(R+C) ‹ Wheredoessaturationoccur? 1 DeterministicR R/(R+C) StochasticR Util R/(R+L) N 0 Nsat=L/(R+C)+1 Cautions Pipelinebottlenecksaremoreapparentwitheffectivemultithreading.Forexample,an unpipelinedFPunitneedstoreflectreservationuptothethreadscheduler ‹ Architecteddelayslotscomplicatemultithreadedcontrollogic.Forexample,an exceptionoccursoninstructionsinbranchdelayslot,whilefetchandexecareatbranch target ‹ Registerfiledelayandbandwidth OtherConcepts ‹ TaggedMemoryisanotherconceptthatisoftencirculatedinwhichmemoryisthoughtof asasetofobjectsinsteadofhomogenousbits.Examplesofthisarelispmachines,data- flowmachines,andJ-machines. ‹ PhysicalvsVirtualParallelism .

CS252 Lecture Notes Multithreaded Architectures

Computer Hardware

Parallel Computer Architecture

Dynamic Adaptation Techniques and Opportunities to Improve HPC Runtimes

R00456--FM Getting up to Speed

Lattice QCD: Commercial Vs

Multiprocessors and Multicomputers

Implementing Operational Intelligence Using In-Memory Computing

What Is Parallel Architecture?

Infrastructure Requirements for Future Earth System Models

ECE 669 Parallel Computer Architecture

What Is Parallel Architecture?

The Power of In-Memory Computing: from Supercomputing to Stream Processing