CS252 Lecture Notes Multithreaded Architectures

CS252 Lecture Notes Multithreaded Architectures

CS252LectureNotes MultithreadedArchitectures Concept Tolerateormasklongandoftenunpredictablelatencyoperationsbyswitchingtoanothercontext, whichisabletodousefulwork. SituationToday–Whyisthistopicrelevant? ILPhasbeenexhaustedwhichmeansthreadlevelparallelismmustbeutilized ‹ Thegapbetweenprocessorperformanceandmemoryperformanceisstilllarge ‹ Thereisamplereal-estateforimplementation ‹ Moreapplicationsarebeingwrittenwiththeuseofthreadsandmultitaskingisubiquitous ‹ Multiprocessorsaremorecommon ‹ Networklatencyisanalogoustomemorylatency ‹ Complexschedulingisalreadybeingdoneinhardware ClassicalProblem 60’sand70’s ‹ I/Olatencypromptedmultitasking ‹ IBMmainframes ‹ Multitasking ‹ I/Oprocessors ‹ Cacheswithindiskcontrollers RequirementsofMultithreading ‹ Storageneedtoholdmultiplecontext’sPC,registers,statusword,etc. ‹ Coordinationtomatchaneventwithasavedcontext ‹ Awaytoswitchcontexts ‹ Longlatencyoperationsmustuseresourcesnotinuse Tovisualizetheeffectoflatencyonprocessorutilization,letRbetherunlengthtoalonglatency event,letLbetheamountoflatencythen: 1 Util Util=R/(R+L) 0 L 80’s Problemwasrevisitedduetotheadventofgraphicsworkstations XeroxAlto,TIExplorer ‹ Concurrentprocessesareinterleavedtoallowfortheworkstationstobemoreresponsive. ‹ Theseprocessescoulddriveormonitordisplay,input,filesystem,network,user processing ‹ Processswitchwasslowsothesubsystemsweremicroprogrammedtosupportmultiple contexts ScalableMultiprocessor ‹ Dancehall–asharedinterconnectwithmemoryononesideandprocessorsontheother. ‹ Orprocessorsmayhavelocalmemory M M P/M P/M Highspeed Highspeed interconnect IO interconnect P P P/M P/M Howdotheprocessorscommunicate? SharedMemory ‹ Potentiallonglatencyoneveryload ‹ Cachecoherencybecomesanissue ‹ ExamplesincludeNYU’sUltracomputer,IBM’sRP3,BBN’sButterfly,MIT’sAlewife, andlaterStanford’sDash. ‹ Synchronizationoccursthroughsharevariables,locks,flags,andsemaphores. MessagePassing ‹ Programmerdealswithlatency.Thisenablesthemtominimizethenumberofmessages, whilemaximizingthesize,andthisschemeallowsfordelayminimizationbysendinga messagesothatitreachesthereceiveratthetimeitexpectsit. ‹ ExamplesincludeIntel’sPSCandParagon,Caltech’sCosmicCube,andThinking Machines’CM-5 ‹ Synchronizationoccursthroughsendandreceive Cycle-by-CycleInterleavedMultithreading BurtonSmith ‹ CurrentlychiefscientistatCray ‹ DenelcorHEP1(1982),HEP2 ‹ Horizon,whichwasneverbuilt ‹ Tera,MTA PCs ThreadScheduler I-Fetch RF Mem WB Featuresofthisarchitecture Aninstructionfromadifferentcontextislaunchedateachclockcycle ‹ Nointerlocksorbypassesthankstoanon-blockingpipeline Optimizations: ‹ Leavingcontextstateinproc(PC,register#,status) ‹ Assigningtagstoremoterequestandthenmatchingitoncompletion Additionaloptimizations: ‹ Afull/emptybitoneverymemorywordallowingforautomaticandefficient synchronization.Thisobviatestheneedforsemaphores,locks,etc.anditmayreduce pollingtimedonebytheprocessorbymovingthatjobtoacontroller. Challengeswiththisapproach Instructionbandwidth ‹ Sinceinstructionsarebeinggrabbedfrommanydifferentcontexts,instructionlocalityis degradedandtheI-cachemissraterises. ‹ Registerfileaccesstimeincreasesduetothefactthattheregfilehadtosignificantly increaseinsizetoaccommodatemanyseparatecontexts.Infact,theHEPandTerause SRAMtoimplementtheregfile,whichmeanslongeraccesstimes.Someofthismaybe alleviatedthroughincreasingthepipelinedepthatthecostofadditionallatency. ‹ Singlethreadperformanceissignificantlydegradedsincethecontextisforcedtoswitch toanewthreadevenifnoneareavailable. ‹ Insufficientpipelining(bandwidthbottleneck) ‹ UnpipelinedFPunit–muststallorreflectupintothreadscheduler ‹ Veryhighbandwidthnetwork,whichisfastandwide ‹ Retriesonloademptyorstorefull ImprovingSingleThreadPerformance ‹ Domoreoperationsperinstruction(VLIW) ‹ Allowmultipleinstructionstoissueintopipelinefromeachcontext.Thiscouldleadto pipelinehazards,soothersafeinstructionscouldbeinterleavedintotheexecution.For Horizon&Terathecompilerdetectssuchdatadependenciesandthehardwareenforcesit byswitchingtoanothercontextifdetected.Thisisimplementedbyinsertingintoeach instructionafieldwhichindicatesitsminimumnumberofindependentsuccessorsover allpossiblecontrolflows. ‹ Switchonload ‹ Switchonmiss ‹ Switchingonloadormisswillincreasethecontextswitchtime.Consider: TypeofSwitch R C Cycle 1 0 Load 5-10 1-2 Miss 5-100 5-10(+hittime) ‹ MaxUtilization=R/(R+C) ‹ Wheredoessaturationoccur? 1 DeterministicR R/(R+C) StochasticR Util R/(R+L) N 0 Nsat=L/(R+C)+1 Cautions Pipelinebottlenecksaremoreapparentwitheffectivemultithreading.Forexample,an unpipelinedFPunitneedstoreflectreservationuptothethreadscheduler ‹ Architecteddelayslotscomplicatemultithreadedcontrollogic.Forexample,an exceptionoccursoninstructionsinbranchdelayslot,whilefetchandexecareatbranch target ‹ Registerfiledelayandbandwidth OtherConcepts ‹ TaggedMemoryisanotherconceptthatisoftencirculatedinwhichmemoryisthoughtof asasetofobjectsinsteadofhomogenousbits.Examplesofthisarelispmachines,data- flowmachines,andJ-machines. ‹ PhysicalvsVirtualParallelism .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    5 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us