
AnEvaluationofSpeculativeInstructionExecution onSimultaneousMultithreadedProcessors StevenSwanson,LukeK.McDowell,MichaelM.Swift,SusanJ.EggersandHenryM.Levy UniversityofWashington Abstract Modern superscalar processors rely heavily on speculative execution for performance. For example,ourmeasurementsshowthatona6-issuesuperscalar,93%ofcommittedinstructionsfor SPECINT95arespeculative.Withoutspeculation,processorresourcesonsuchmachineswould belargelyidle.Incontrasttosuperscalars,simultaneousmultithreaded(SMT)processorsachieve highresourceutilizationbyissuinginstructionsfrommultiplethreadseverycycle.AnSMTpro- cessorthushastwomeansofhidinglatency:speculationandmultithreadedexecution.However, these two techniques may conflict; onanSMT processor,wrong-pathspeculative instructions fromonethreadmaycompetewithanddisplaceusefulinstructionsfromanotherthread.Forthis reason,itisimportanttounderstandthetrade-offsbetweenthesetwolatency-hidingtechniques, andtoaskwhethermultithreadedprocessorsshouldspeculatedifferentlythanconventionalsuper- scalars. ThispaperevaluatesthebehaviorofinstructionspeculationonSMTprocessorsusingboth multiprogrammed(SPECINTandSPECFP)andmultithreaded(theApacheWebserver)work- loads.Wemeasureandanalyzetheimpactofspeculationanddemonstratehowspeculationonan 8-wide SMT differs from superscalar speculation. We also examine the effect of speculation- awarefetchandbranchpredictionpoliciesintheprocessor.Our resultsquantifytheextentto which(1)speculationis,infact,criticaltoperformanceonamultithreadedprocessor,and(2) SMTactuallyenhancestheeffectivenessofspeculativeexecution,comparedtoasuperscalarpro- cessor. Categories and Subject Descriptors: C.1.2 [Processor Architectures]: Multiple Data Stream Architectures,C.4[PerformanceofSystems],C.5[ComputerSystemImplementation]. GeneralTerms:Design,Measurement,Performance AdditionalKeyWordsandPhrases:Instruction-levelparallelism,multiprocessors,multithread- ing,simultaneousmultithreading,speculation,thread-levelparallelism. 1 1Introduction Instructionspeculationisacrucialcomponentofmodernsuperscalarprocessors.Speculation hidesbranchlatenciesandtherebyboostsperformancebyexecutingthelikelybranchpathwith- outstalling.Branchpredictors,whichprovideaccuraciesupto96%(excludingOScode)[8],are thekeytoeffectivespeculation.Theprimarydisadvantageofspeculationisthatsomeprocessor resourcesareinvariablyallocatedtouseless,wrong-pathinstructionsthatmustbeflushedfrom thepipeline.However,sinceresourcesareoftenunderutilizedonsuperscalarsbecauseoflowsin- gle-threadinstruction-levelparallelism(ILP)[28,3],thebenefitofspeculationfaroutweighsthis disadvantageandthedecisiontospeculateasaggressivelyaspossibleisaneasyone. Incontrasttosuperscalars,simultaneous multithreading(SMT) processors[28, 27] operate withhighprocessorutilization,becausetheyissueandexecuteinstructionsfrommultiplethreads eachcycle,withallthreadsdynamicallysharinghardwareresources.Ifsomethreadshavelow ILP,utilizationisimprovedbyexecutinginstructionsfromadditionalthreads;ifonlyoneorafew threadsareexecuting,thenallcriticalhardwareresourcesareavailabletothem.Consequently, instructionthroughputonafullyloadedSMTprocessoristwotofourtimeshigherthanona superscalarwithcomparablehardwareonavarietyofinteger,scientific,database,andwebser- viceworkloads[16,15,20]. Withitshighhardwareutilization,speculationonanSMTmayharmratherthanimproveper- formance,particularlywithallhardwarecontextsoccupied.Speculative(andpotentiallywasteful) instructionsfromonethreadcompetewithuseful,non-speculativeinstructionsfromotherthreads forhighlyutilizedhardwareresources,andinsomecasesdisplacethem,loweringperformance. Therefore,itisimportanttounderstandthebehaviorofspeculationonanSMTprocessorandthe extenttowhichithelpsorhindersperformance. This paper investigates the interactions between instruction speculation and multithreading 2 andquantifiestheimpactofspeculationontheperformanceofSMTprocessors.Ouranalysesare based on five different workloads (including all operating system code): SPECINT95, SPECFP95, a combinationof thetwo,the ApacheWebserver, andasyntheticworkload that allowsustomanipulatebasic-blocklengthandavailableILP.Usingtheseworkloads,wecare- fully examine how speculative instructions behave on SMT, as well as how and when SMT shouldspeculate. WeattempttoimproveSMTperformancebyreducingwrong-pathspeculativeinstructions, eitherbynotspeculatingatallorbyusingspeculation-awarefetchpolicies(includingpolicies thatincorporateconfidenceestimators).Toexplaintheresults,weinvestigatewhichhardware structuresandpipelinestagesspeculationaffects,andhowspeculationonSMTprocessorsdiffers fromspeculationonatraditionalsuperscalar.Finally,weexploretheboundariesofspeculation’s usefulnessonSMTbyincreasingthenumberofhardwarethreadsandusingsyntheticworkloads tochangethebranchfrequencyandILPwithinthreads. Ourresultsshowthatdespiteitspotentialdownside,speculationincreasesSMTperformance by9%to32%overanon-speculatingSMT.Furthermore,modulatingspeculationwitheithera speculation-awarefetchpolicyorconfidenceestimationneverimproved,andusuallylowered, performance.Twoseparatefactorsexplainthisbehavior.First,SMTprocessorsactuallyrelyon speculation toprovidearich,cross-threadinstruction mix tofully utilize thefunctionalunits. Whilesophisticatedspeculationpoliciescanreducespeculation’snegativeeffects,theysimulta- neouslyrestrictthediversityofthreadsoccupyingthepre-executionstagesofthepipeline.Conse- quently,SMTonlybenefitsfromlessaggressivespeculationwhenexecutingunlikelyworkloads oronextremelyaggressivehardwaredesigns.Second,andsomewhatsurprisingly,SMTenhances theeffectivenessofspeculationbydecreasingthepercentageofspeculativeinstructionsonthe wrongpath.SMTreducedthepercentageofwrong-pathspeculativeinstructionsby29%when comparedtoasuperscalarwithcomparableexecutionresources. 3 Theremainderofthispaperisorganizedasfollows.Thenextsectiondetailsthemethodology forourexperiments.Section3presentsthebasicspeculationresultsandexplainswhyandhow speculationbenefitsSMTperformance;italsopresentsalternativefetchandpredictionschemes andshowswhytheyfallshort.Section4explorestheeffectsofsoftwareandmicroarchitectural parametersonspeculation.Finally,Section5discussesrelatedworkandSection6summarizes ourfindings. 2Methodology 2.1Simulator OurSMTsimulatorisbasedontheSMTSIMsimulator[26]andhasbeenportedtotheSimOS framework[22,2,20].Itsimulatesthefullpipelineandmemoryhierarchy,includingbankcon- flictsandbuscontention,forboththeapplicationsandtheoperatingsystem. ThebaselineconfigurationforourexperimentsisshowninTable1.Formostexperimentswe usedtheICOUNTfetchpolicy[27].ICOUNTgivesprioritytothreadswiththefewestnumberof instructionsinthepre-issuestagesofthepipelineandfetches8instructions(ortotheendofthe cacheline)fromeachofthetwohighestprioritythreads.Fromtheseinstructions,itchoosesupto 8toissue,selectingfromthehighestprioritythreaduntilabranchinstructionisencountered,then takingtheremainderfromthesecondthread.InadditiontoICOUNT,wealsoexperimentedwith threealternativefetchpolicies.Thefirstdoesnotspeculateatall,i.e.,instructionfetchingfora particularthreadstallsuntilthebranchisresolved;instead,instructionsareselectedonlyfromthe non-speculativethreadsusingICOUNT.Thesecondfavorsnon-speculatingthreadsbyfetching instructions from threads whose next instructions are non-speculative before fetching from threadswithspeculativeinstructions;tiesarebrokenwithICOUNT.Thethirdusesbranchconfi- denceestimatorstofavorthreadswithhigh-confidencebranches. OurbaselineexperimentsusedtheMcFarlingbranchpredictionalgorithm[19]usedonmod- ernprocessorsfromCompaq;forsomestudiesweaugmentedthiswithconfidenceestimators. 4 CPU ThreadContexts 8 Pipeline 9stages,7cyclemispredictionpenalty. FetchPolicy 8instructionspercyclefromupto2contexts(theICOUNTschemeof[27]) FunctionalUnits 6integer(including4load/storeand2synchronizationunits) 4floatingpoint InstructionQueues 32-entryintegerandfloatingpointqueues RenamingRegisters 100integerand100floatingpoint Retirementbandwidth 12instructions/cycle BranchPredictor McFarling-style,hybridpredictor[19](sharedamongallcontexts) LocalPredictor 4K-entrypredictiontable,indexedby2K-entryhistorytable GlobalPredictor 8Kentries,8K-entryselectiontable BranchTargetBuffer 256entries,4-waysetassociative(sharedamongallcontexts) CacheHierarchy CacheLineSize 64bytes Icache 128KB,2-waysetassociative,dual-ported,2cyclelatency Dcache 128KB,2-waysetassociative,dual-ported(fromCPU,r&w),single-ported(fromtheL2),2cyclelatency L2cache 16MB,directmapped,23cyclelatency,fullypipelined(1accesspercycle) MSHR 32entriesfortheL1cache,32entriesfortheL2cache StoreBuffer 32entries ITLB&DTLB 128-entries,fullyassociative L1-L2bus 256bitswide Memorybus 128bitswide PhysicalMemory 128MB,90cyclelatency,fullypipelined Table1:SMTparameters. Oursimulatorspeculatespastanunlimitednumberofbranches,althoughinpracticeitspeculates onlypast1.4onaverageandalmostnever(lessthan0.06%ofcycles)pastmorethan5branches. Inexploringthelimitsofspeculation’seffectiveness,wealsovariedthenumberofhardware contextsfrom1to16.Finally,forthecomparisonsbetweenSMTandsuperscalarprocessorswe useasuperscalarwiththesamehardwarecomponentsasourSMTmodelbutwithashorterpipe- line,madepossiblebythesuperscalar’ssmallerregisterfile. 5 2.2Workload Weusethreemultiprogrammedworkloads:SPECINT95,SPECFP95[21],andacombination offourapplicationsfromeachsuite,INT+FP.InadditionweusedtheApachewebserver,anopen sourcewebserverrunbythemajorityofwebsites[10].WedriveApachewithSPECWEB96 [25],astandardwebserverperformancebenchmark.Eachworkloadservesadifferentpurposein theexperiments.Theintegerbenchmarksareourdominantworkloadandwerechosenbecause
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages24 Page
-
File Size-