An Evaluation of Speculative Instruction Execution On

AnEvaluationofSpeculativeInstructionExecution onSimultaneousMultithreadedProcessors StevenSwanson,LukeK.McDowell,MichaelM.Swift,SusanJ.EggersandHenryM.Levy UniversityofWashington Abstract Modern superscalar processors rely heavily on speculative execution for performance. For example,ourmeasurementsshowthatona6-issuesuperscalar,93%ofcommittedinstructionsfor SPECINT95arespeculative.Withoutspeculation,processorresourcesonsuchmachineswould belargelyidle.Incontrasttosuperscalars,simultaneousmultithreaded(SMT)processorsachieve highresourceutilizationbyissuinginstructionsfrommultiplethreadseverycycle.AnSMTpro- cessorthushastwomeansofhidinglatency:speculationandmultithreadedexecution.However, these two techniques may conflict; onanSMT processor,wrong-pathspeculative instructions fromonethreadmaycompetewithanddisplaceusefulinstructionsfromanotherthread.Forthis reason,itisimportanttounderstandthetrade-offsbetweenthesetwolatency-hidingtechniques, andtoaskwhethermultithreadedprocessorsshouldspeculatedifferentlythanconventionalsuper- scalars. ThispaperevaluatesthebehaviorofinstructionspeculationonSMTprocessorsusingboth multiprogrammed(SPECINTandSPECFP)andmultithreaded(theApacheWebserver)workloads.Wemeasureandanalyzetheimpactofspeculationanddemonstratehowspeculationonan 8-wide SMT differs from superscalar speculation. We also examine the effect of speculation- awarefetchandbranchpredictionpoliciesintheprocessor.Our resultsquantifytheextentto which(1)speculationis,infact,criticaltoperformanceonamultithreadedprocessor,and(2) SMTactuallyenhancestheeffectivenessofspeculativeexecution,comparedtoasuperscalarpro- cessor. Categories and Subject Descriptors: C.1.2 [Processor Architectures]: Multiple Data Stream Architectures,C.4[PerformanceofSystems],C.5[ComputerSystemImplementation]. GeneralTerms:Design,Measurement,Performance AdditionalKeyWordsandPhrases:Instruction-levelparallelism,multiprocessors,multithreading,simultaneousmultithreading,speculation,thread-levelparallelism. 1 1Introduction Instructionspeculationisacrucialcomponentofmodernsuperscalarprocessors.Speculation hidesbranchlatenciesandtherebyboostsperformancebyexecutingthelikelybranchpathwith- outstalling.Branchpredictors,whichprovideaccuraciesupto96%(excludingOScode)[8],are thekeytoeffectivespeculation.Theprimarydisadvantageofspeculationisthatsomeprocessor resourcesareinvariablyallocatedtouseless,wrong-pathinstructionsthatmustbeflushedfrom thepipeline.However,sinceresourcesareoftenunderutilizedonsuperscalarsbecauseoflowsin- gle-threadinstruction-levelparallelism(ILP)[28,3],thebenefitofspeculationfaroutweighsthis disadvantageandthedecisiontospeculateasaggressivelyaspossibleisaneasyone. Incontrasttosuperscalars,simultaneous multithreading(SMT) processors[28, 27] operate withhighprocessorutilization,becausetheyissueandexecuteinstructionsfrommultiplethreads eachcycle,withallthreadsdynamicallysharinghardwareresources.Ifsomethreadshavelow ILP,utilizationisimprovedbyexecutinginstructionsfromadditionalthreads;ifonlyoneorafew threadsareexecuting,thenallcriticalhardwareresourcesareavailabletothem.Consequently, instructionthroughputonafullyloadedSMTprocessoristwotofourtimeshigherthanona superscalarwithcomparablehardwareonavarietyofinteger,scientific,database,andwebser- viceworkloads[16,15,20]. Withitshighhardwareutilization,speculationonanSMTmayharmratherthanimproveper- formance,particularlywithallhardwarecontextsoccupied.Speculative(andpotentiallywasteful) instructionsfromonethreadcompetewithuseful,non-speculativeinstructionsfromotherthreads forhighlyutilizedhardwareresources,andinsomecasesdisplacethem,loweringperformance. Therefore,itisimportanttounderstandthebehaviorofspeculationonanSMTprocessorandthe extenttowhichithelpsorhindersperformance. This paper investigates the interactions between instruction speculation and multithreading 2 andquantifiestheimpactofspeculationontheperformanceofSMTprocessors.Ouranalysesare based on five different workloads (including all operating system code): SPECINT95, SPECFP95, a combinationof thetwo,the ApacheWebserver, andasyntheticworkload that allowsustomanipulatebasic-blocklengthandavailableILP.Usingtheseworkloads,wecare- fully examine how speculative instructions behave on SMT, as well as how and when SMT shouldspeculate. WeattempttoimproveSMTperformancebyreducingwrong-pathspeculativeinstructions, eitherbynotspeculatingatallorbyusingspeculation-awarefetchpolicies(includingpolicies thatincorporateconfidenceestimators).Toexplaintheresults,weinvestigatewhichhardware structuresandpipelinestagesspeculationaffects,andhowspeculationonSMTprocessorsdiffers fromspeculationonatraditionalsuperscalar.Finally,weexploretheboundariesofspeculation’s usefulnessonSMTbyincreasingthenumberofhardwarethreadsandusingsyntheticworkloads tochangethebranchfrequencyandILPwithinthreads. Ourresultsshowthatdespiteitspotentialdownside,speculationincreasesSMTperformance by9%to32%overanon-speculatingSMT.Furthermore,modulatingspeculationwitheithera speculation-awarefetchpolicyorconfidenceestimationneverimproved,andusuallylowered, performance.Twoseparatefactorsexplainthisbehavior.First,SMTprocessorsactuallyrelyon speculation toprovidearich,cross-threadinstruction mix tofully utilize thefunctionalunits. Whilesophisticatedspeculationpoliciescanreducespeculation’snegativeeffects,theysimulta- neouslyrestrictthediversityofthreadsoccupyingthepre-executionstagesofthepipeline.Conse- quently,SMTonlybenefitsfromlessaggressivespeculationwhenexecutingunlikelyworkloads oronextremelyaggressivehardwaredesigns.Second,andsomewhatsurprisingly,SMTenhances theeffectivenessofspeculationbydecreasingthepercentageofspeculativeinstructionsonthe wrongpath.SMTreducedthepercentageofwrong-pathspeculativeinstructionsby29%when comparedtoasuperscalarwithcomparableexecutionresources. 3 Theremainderofthispaperisorganizedasfollows.Thenextsectiondetailsthemethodology forourexperiments.Section3presentsthebasicspeculationresultsandexplainswhyandhow speculationbenefitsSMTperformance;italsopresentsalternativefetchandpredictionschemes andshowswhytheyfallshort.Section4explorestheeffectsofsoftwareandmicroarchitectural parametersonspeculation.Finally,Section5discussesrelatedworkandSection6summarizes ourfindings. 2Methodology 2.1Simulator OurSMTsimulatorisbasedontheSMTSIMsimulator[26]andhasbeenportedtotheSimOS framework[22,2,20].Itsimulatesthefullpipelineandmemoryhierarchy,includingbankcon- flictsandbuscontention,forboththeapplicationsandtheoperatingsystem. ThebaselineconfigurationforourexperimentsisshowninTable1.Formostexperimentswe usedtheICOUNTfetchpolicy[27].ICOUNTgivesprioritytothreadswiththefewestnumberof instructionsinthepre-issuestagesofthepipelineandfetches8instructions(ortotheendofthe cacheline)fromeachofthetwohighestprioritythreads.Fromtheseinstructions,itchoosesupto 8toissue,selectingfromthehighestprioritythreaduntilabranchinstructionisencountered,then takingtheremainderfromthesecondthread.InadditiontoICOUNT,wealsoexperimentedwith threealternativefetchpolicies.Thefirstdoesnotspeculateatall,i.e.,instructionfetchingfora particularthreadstallsuntilthebranchisresolved;instead,instructionsareselectedonlyfromthe non-speculativethreadsusingICOUNT.Thesecondfavorsnon-speculatingthreadsbyfetching instructions from threads whose next instructions are non-speculative before fetching from threadswithspeculativeinstructions;tiesarebrokenwithICOUNT.Thethirdusesbranchconfi- denceestimatorstofavorthreadswithhigh-confidencebranches. OurbaselineexperimentsusedtheMcFarlingbranchpredictionalgorithm[19]usedonmod- ernprocessorsfromCompaq;forsomestudiesweaugmentedthiswithconfidenceestimators. 4 CPU ThreadContexts 8 Pipeline 9stages,7cyclemispredictionpenalty. FetchPolicy 8instructionspercyclefromupto2contexts(theICOUNTschemeof[27]) FunctionalUnits 6integer(including4load/storeand2synchronizationunits) 4floatingpoint InstructionQueues 32-entryintegerandfloatingpointqueues RenamingRegisters 100integerand100floatingpoint Retirementbandwidth 12instructions/cycle BranchPredictor McFarling-style,hybridpredictor[19](sharedamongallcontexts) LocalPredictor 4K-entrypredictiontable,indexedby2K-entryhistorytable GlobalPredictor 8Kentries,8K-entryselectiontable BranchTargetBuffer 256entries,4-waysetassociative(sharedamongallcontexts) CacheHierarchy CacheLineSize 64bytes Icache 128KB,2-waysetassociative,dual-ported,2cyclelatency Dcache 128KB,2-waysetassociative,dual-ported(fromCPU,r&w),single-ported(fromtheL2),2cyclelatency L2cache 16MB,directmapped,23cyclelatency,fullypipelined(1accesspercycle) MSHR 32entriesfortheL1cache,32entriesfortheL2cache StoreBuffer 32entries ITLB&DTLB 128-entries,fullyassociative L1-L2bus 256bitswide Memorybus 128bitswide PhysicalMemory 128MB,90cyclelatency,fullypipelined Table1:SMTparameters. Oursimulatorspeculatespastanunlimitednumberofbranches,althoughinpracticeitspeculates onlypast1.4onaverageandalmostnever(lessthan0.06%ofcycles)pastmorethan5branches. Inexploringthelimitsofspeculation’seffectiveness,wealsovariedthenumberofhardware contextsfrom1to16.Finally,forthecomparisonsbetweenSMTandsuperscalarprocessorswe useasuperscalarwiththesamehardwarecomponentsasourSMTmodelbutwithashorterpipe- line,madepossiblebythesuperscalar’ssmallerregisterfile. 5 2.2Workload Weusethreemultiprogrammedworkloads:SPECINT95,SPECFP95[21],andacombination offourapplicationsfromeachsuite,INT+FP.InadditionweusedtheApachewebserver,anopen sourcewebserverrunbythemajorityofwebsites[10].WedriveApachewithSPECWEB96 [25],astandardwebserverperformancebenchmark.Eachworkloadservesadifferentpurposein theexperiments.Theintegerbenchmarksareourdominantworkloadandwerechosenbecause

An Evaluation of Speculative Instruction Execution On

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support