AnEvaluationofSpeculativeInstructionExecution onSimultaneousMultithreadedProcessors

StevenSwanson,LukeK.McDowell,MichaelM.Swift,SusanJ.EggersandHenryM.Levy UniversityofWashington

Abstract

Modern superscalar processors rely heavily on speculative execution for performance. For example,ourmeasurementsshowthatona6-issuesuperscalar,93%ofcommittedinstructionsfor SPECINT95arespeculative.Withoutspeculation,resourcesonsuchmachineswould belargelyidle.Incontrasttosuperscalars,simultaneousmultithreaded(SMT)processorsachieve highresourceutilizationbyissuinginstructionsfrommultiplethreadseverycycle.AnSMTpro- cessorthushastwomeansofhidinglatency:speculationandmultithreadedexecution.However, these two techniques may conflict; onanSMT processor,wrong-pathspeculative instructions fromonemaycompetewithanddisplaceusefulinstructionsfromanotherthread.Forthis reason,itisimportanttounderstandthetrade-offsbetweenthesetwolatency-hidingtechniques, andtoaskwhethermultithreadedprocessorsshouldspeculatedifferentlythanconventionalsuper- scalars. ThispaperevaluatesthebehaviorofinstructionspeculationonSMTprocessorsusingboth multiprogrammed(SPECINTandSPECFP)andmultithreaded(theApacheWebserver)work- loads.Wemeasureandanalyzetheimpactofspeculationanddemonstratehowspeculationonan 8-wide SMT differs from superscalar speculation. We also examine the effect of speculation- awarefetchandbranchpredictionpoliciesintheprocessor.Our resultsquantifytheextentto which(1)speculationis,infact,criticaltoperformanceonamultithreadedprocessor,and(2) SMTactuallyenhancestheeffectivenessofspeculativeexecution,comparedtoasuperscalarpro- cessor.

Categories and Subject Descriptors: C.1.2 [Processor Architectures]: Multiple Data Stream Architectures,C.4[PerformanceofSystems],C.5[SystemImplementation]. GeneralTerms:Design,Measurement,Performance AdditionalKeyWordsandPhrases:Instruction-levelparallelism,multiprocessors,multithread- ing,simultaneousmultithreading,speculation,thread-levelparallelism.

1 1Introduction Instructionspeculationisacrucialcomponentofmodernsuperscalarprocessors.Speculation hidesbranchlatenciesandtherebyboostsperformancebyexecutingthelikelybranchpathwith- outstalling.Branchpredictors,whichprovideaccuraciesupto96%(excludingOScode)[8],are thekeytoeffectivespeculation.Theprimarydisadvantageofspeculationisthatsomeprocessor resourcesareinvariablyallocatedtouseless,wrong-pathinstructionsthatmustbeflushedfrom the.However,sinceresourcesareoftenunderutilizedonsuperscalarsbecauseoflowsin- gle-threadinstruction-levelparallelism(ILP)[28,3],thebenefitofspeculationfaroutweighsthis disadvantageandthedecisiontospeculateasaggressivelyaspossibleisaneasyone.

Incontrasttosuperscalars,simultaneous multithreading(SMT) processors[28, 27] operate withhighprocessorutilization,becausetheyissueandexecuteinstructionsfrommultiplethreads eachcycle,withallthreadsdynamicallysharinghardwareresources.Ifsomethreadshavelow

ILP,utilizationisimprovedbyexecutinginstructionsfromadditionalthreads;ifonlyoneorafew threadsareexecuting,thenallcriticalhardwareresourcesareavailabletothem.Consequently, instructionthroughputonafullyloadedSMTprocessoristwotofourtimeshigherthanona superscalarwithcomparablehardwareonavarietyofinteger,scientific,database,andwebser- viceworkloads[16,15,20].

Withitshighhardwareutilization,speculationonanSMTmayharmratherthanimproveper- formance,particularlywithallhardwarecontextsoccupied.Speculative(andpotentiallywasteful) instructionsfromonethreadcompetewithuseful,non-speculativeinstructionsfromotherthreads forhighlyutilizedhardwareresources,andinsomecasesdisplacethem,loweringperformance.

Therefore,itisimportanttounderstandthebehaviorofspeculationonanSMTprocessorandthe

extenttowhichithelpsorhindersperformance.

This paper investigates the interactions between instruction speculation and multithreading

2 andquantifiestheimpactofspeculationontheperformanceofSMTprocessors.Ouranalysesare based on five different workloads (including all operating system code): SPECINT95,

SPECFP95, a combinationof thetwo,the ApacheWebserver, andasyntheticworkload that allowsustomanipulatebasic-blocklengthandavailableILP.Usingtheseworkloads,wecare- fully examine how speculative instructions behave on SMT, as well as how and when SMT shouldspeculate.

WeattempttoimproveSMTperformancebyreducingwrong-pathspeculativeinstructions, eitherbynotspeculatingatallorbyusingspeculation-awarefetchpolicies(includingpolicies thatincorporateconfidenceestimators).Toexplaintheresults,weinvestigatewhichhardware structuresandpipelinestagesspeculationaffects,andhowspeculationonSMTprocessorsdiffers fromspeculationonatraditionalsuperscalar.Finally,weexploretheboundariesofspeculation’s usefulnessonSMTbyincreasingthenumberofhardwarethreadsandusingsyntheticworkloads tochangethebranchfrequencyandILPwithinthreads.

Ourresultsshowthatdespiteitspotentialdownside,speculationincreasesSMTperformance by9%to32%overanon-speculatingSMT.Furthermore,modulatingspeculationwitheithera speculation-awarefetchpolicyorconfidenceestimationneverimproved,andusuallylowered, performance.Twoseparatefactorsexplainthisbehavior.First,SMTprocessorsactuallyrelyon speculation toprovidearich,cross-threadinstruction mix tofully utilize thefunctionalunits.

Whilesophisticatedspeculationpoliciescanreducespeculation’snegativeeffects,theysimulta- neouslyrestrictthediversityofthreadsoccupyingthepre-executionstagesofthepipeline.Conse- quently,SMTonlybenefitsfromlessaggressivespeculationwhenexecutingunlikelyworkloads

oronextremelyaggressivehardwaredesigns.Second,andsomewhatsurprisingly,SMTenhances

theeffectivenessofspeculationbydecreasingthepercentageofspeculativeinstructionsonthe

wrongpath.SMTreducedthepercentageofwrong-pathspeculativeinstructionsby29%when

comparedtoasuperscalarwithcomparableexecutionresources.

3 Theremainderofthispaperisorganizedasfollows.Thenextsectiondetailsthemethodology forourexperiments.Section3presentsthebasicspeculationresultsandexplainswhyandhow speculationbenefitsSMTperformance;italsopresentsalternativefetchandpredictionschemes andshowswhytheyfallshort.Section4explorestheeffectsofsoftwareandmicroarchitectural parametersonspeculation.Finally,Section5discussesrelatedworkandSection6summarizes ourfindings.

2Methodology 2.1Simulator OurSMTsimulatorisbasedontheSMTSIMsimulator[26]andhasbeenportedtotheSimOS framework[22,2,20].Itsimulatesthefullpipelineandmemoryhierarchy,includingbankcon- flictsandcontention,forboththeapplicationsandtheoperatingsystem.

ThebaselineconfigurationforourexperimentsisshowninTable1.Formostexperimentswe usedtheICOUNTfetchpolicy[27].ICOUNTgivesprioritytothreadswiththefewestnumberof instructionsinthepre-issuestagesofthepipelineandfetches8instructions(ortotheendofthe line)fromeachofthetwohighestprioritythreads.Fromtheseinstructions,itchoosesupto

8toissue,selectingfromthehighestprioritythreaduntilabranchinstructionisencountered,then takingtheremainderfromthesecondthread.InadditiontoICOUNT,wealsoexperimentedwith threealternativefetchpolicies.Thefirstdoesnotspeculateatall,i.e.,instructionfetchingfora particularthreadstallsuntilthebranchisresolved;instead,instructionsareselectedonlyfromthe non-speculativethreadsusingICOUNT.Thesecondfavorsnon-speculatingthreadsbyfetching instructions from threads whose next instructions are non-speculative before fetching from threadswithspeculativeinstructions;tiesarebrokenwithICOUNT.Thethirdusesbranchconfi- denceestimatorstofavorthreadswithhigh-confidencebranches.

OurbaselineexperimentsusedtheMcFarlingbranchpredictionalgorithm[19]usedonmod- ernprocessorsfromCompaq;forsomestudiesweaugmentedthiswithconfidenceestimators.

4 CPU

ThreadContexts 8

Pipeline 9stages,7cyclemispredictionpenalty.

FetchPolicy 8instructionspercyclefromupto2contexts(theICOUNTschemeof[27])

FunctionalUnits 6integer(including4load/storeand2synchronizationunits) 4floatingpoint

InstructionQueues 32-entryintegerandfloatingpointqueues

RenamingRegisters 100integerand100floatingpoint

Retirementbandwidth 12instructions/cycle

BranchPredictor McFarling-style,hybridpredictor[19](sharedamongallcontexts)

LocalPredictor 4K-entrypredictiontable,indexedby2K-entryhistorytable

GlobalPredictor 8Kentries,8K-entryselectiontable

BranchTargetBuffer 256entries,4-waysetassociative(sharedamongallcontexts)

CacheHierarchy

CacheLineSize 64bytes

Icache 128KB,2-waysetassociative,dual-ported,2cyclelatency

Dcache 128KB,2-waysetassociative,dual-ported(fromCPU,r&w),single-ported(fromtheL2),2cyclelatency

L2cache 16MB,directmapped,23cyclelatency,fullypipelined(1accesspercycle)

MSHR 32entriesfortheL1cache,32entriesfortheL2cache

StoreBuffer 32entries

ITLB&DTLB 128-entries,fullyassociative

L1-L2bus 256bitswide

Memorybus 128bitswide

PhysicalMemory 128MB,90cyclelatency,fullypipelined Table1:SMTparameters. Oursimulatorspeculatespastanunlimitednumberofbranches,althoughinpracticeitspeculates onlypast1.4onaverageandalmostnever(lessthan0.06%ofcycles)pastmorethan5branches.

Inexploringthelimitsofspeculation’seffectiveness,wealsovariedthenumberofhardware contextsfrom1to16.Finally,forthecomparisonsbetweenSMTandsuperscalarprocessorswe useasuperscalarwiththesamehardwarecomponentsasourSMTmodelbutwithashorterpipe- line,madepossiblebythesuperscalar’ssmallerregisterfile.

5 2.2Workload Weusethreemultiprogrammedworkloads:SPECINT95,SPECFP95[21],andacombination offourapplicationsfromeachsuite,INT+FP.InadditionweusedtheApachewebserver,anopen sourcewebserverrunbythemajorityofwebsites[10].WedriveApachewithSPECWEB96

[25],astandardwebserverperformancebenchmark.Eachworkloadservesadifferentpurposein theexperiments.Theintegerbenchmarksareourdominantworkloadandwerechosenbecause theirfrequent,lesspredictablebranches(relativetofloatingpointprograms)providemanyoppor- tunitiesforspeculationtoaffectperformance.Apachewaschosenbecauseoverthree-quartersof itsexecutionoccursintheoperatingsystem,whosebranchbehaviorisalsolesspredictable[1,5], andbecauseitrepresentstheserverworkloadsthatconstituteoneofSMT’stargetdomains.We selectedthefloatingpointsuitebecauseitcontainsloop-basedcodewithlargebasicblocksand morepredictablebranches(thanintegercode),providinganimportantperspectiveonworkloads wherespeculationislessfrequent.Finally,followingtheexampleofSnavelyandTullsen[34],we combinedfloatingpointandintegercodetounderstandhowinteractionsbetweendifferenttypes ofapplicationsaffectourresults.

Wealso used asynthetic workload toexplorehowbranch predictionaccuracy,branch fre- quency,andtheamountofILPaffectspeculationonanSMT.Thesyntheticprogramexecutesa continuousstreamofinstructionsseparatedbybranches.Wevariedtheaveragenumberandinde- pendenceofinstructionsbetweenbranchesacrossexperiments,andthepredictionaccuracyofthe branchesissetbyacommandlineargumenttothesimulator.

WeexecuteallofourworkloadsundertheCompaqTru64Unix4.0doperatingsystem;the

simulationincludesallOSprivilegedcode,interrupts,drivers,andAlphaPALcode.Theoperating

systemexecutionaccountsforonlyasmallportionofthecyclesexecutedfortheSPECwork-

loads(about5%),whilethemajorityofcycles(77%)fortheApacheWebserverarespentinside

theOSmanagingthenetworkanddisk.

6 Mostexperimentsinclude200millioncyclesofsimulationstartingfromapoint600million instructionsintoeachprogram(simulatedin‘fastmode’).Thesyntheticbenchmarks,owingto theirsimplebehaviorandsmallsize(thereisnoneedtowarmtheL2cache),weresimulatedfor only1millioncycleseach;longersimulationshadnosignificanteffectonresults.Formachine configurationswithmorethan8contexts,weranmultipleinstancesofsomeoftheapplications.

2.3MetricsandFairness ChangingthefetchpolicyofanSMTnecessarilychangeswhichandinwhatorderinstructions execute.Differentpoliciesaffecteachthreaddifferentlyand,asaresult,theymayexecutemore orfewerinstructionsovera200millioncyclesimulation.Consequently,directlycomparingthe totalIPCwithtwodifferentfetchpoliciesmaynotbefair,sinceadifferentmixofinstructionsis executed,andthecontributionofeachthreadtothebottom-lineIPCchanges.

WecanresolvethisproblembyfollowingtheexamplesetbytheSPECratemetric[24]and averagingperformanceacrossthreadsinsteadofcycles.TheSPECrateisthepercentincreasein throughput(IPC)relativetoabaselineforeachthread,combinedusingthegeometricmean.Fol- lowingthisexample,wecomputedthegeometricmeanofthethreads'speedupsinIPCrelativeto theirperformanceonamachineusingthebaselineICOUNTfetchpolicyandexecutingthesame threadsonthesamenumberofcontexts.Finally,becauseourworkloadcontainssomethreads

(such as interrupt handlers) that run for only a small fraction of total simulation cycles, we weightedtheper-threadspeedupsbythenumberofcyclesthethreadwasscheduledinacontext.

Usingthistechniquewecomputedanaveragespeedupacrossallthreads.Wethencompared thisvaluetoaspeedupcalculatedjustusingthetotalIPCoftheworkload.Wefoundthatthetwo

metricsproducedverysimilarresults,differingonaveragebyjust1%andatmostby5%.More-

over,noneoftheperformancetrendsorconclusionschangedbasedonwhichmetricwasused.

Consequently, for the configurations we consider, using total IPC to compare performance is

accurate.SinceIPCisamoreintuitivemetrictodiscussthanthespeedupaveragedoverthreads,

7 inthispaperwereportonlytheIPCforeachexperiment.

3SpeculationonSMT Thissectionpresentstheresultsofoursimulationexperimentsoninstructionspeculationfor

SMT.Ourgoalistounderstandthetrade-offsbetweentwoalternativemeansofhidingbranch delays:instructionspeculationandSMT’sabilitytoexecutemultiplethreadseachcycle.First,we comparetheperformanceofanSMTprocessorwithandwithoutspeculationandanalyzethedif- ferencesbetweenthesetwooptions.Wethendiscusstheimpactofspeculation-awarefetchpoli- ciesandtheuseofbranchpredictionconfidenceestimatorsonspeculationperformance.

3.1Thebehaviorofspeculativeinstructions Asafirsttask,wemodifiedourSMTsimulatortoturnoffspeculation(i.e.,theprocessornever fetchespastabranchuntilithasresolvedthebranch)andcomparedthethroughputininstructions percycleofourfourworkloadsonthespeculativeandnon-speculativeSMTCPUs.Theresultsof these measurementsareshowninTable 2.Speculation benefitsSMTperformanceonall four

workloads thespeculativeSMTachievesperformancegainsofbetween9%and32%overthe

SPECINT95 SPECFP95 INT+FP Apache

IPCwithspeculation 5.2 6.0 6.0 4.5 IPCwithoutspeculation 4.2 5.5 5.5 3.4 Improvementfromspeculation 24% 9% 9% 32%

Table2:EffectofspeculationonSMT.Wesimulatedeachofthefourworkloadsonmachineswithandwithout speculation.Apache,withitssmallbasicblocksandpoorbranchprediction,derivesthemostperformance fromspeculation,whilethemorepredictablefloatingbenchmarksbenefitleast. non-speculativeprocessor.

Speculationcanhaveeffectsthroughoutthepipelineandthememorysystem.Forexample, speculationcouldpollutethecachewithinstructionsthatwillneverbeexecutedor,alternatively, prefetch instructions before they are needed, eliminating future cache misses. None of these effectsappearinoursimulation.Changingthespeculationpolicyneveraltersthepercentageof

8 cachehitsbymorethat0.4%.

Tounderstandwhythisbenefitoccurs,howspeculativeinstructionsexecuteonanSMTpro- cessor,andhowtheyaffectitsperformanceandresourceutilization,wecategorizedinstructions accordingtotheirspeculationbehavior:

• non-speculativeinstructionsarethosefetchednon-speculatively,i.e.,theyalwaysperform usefulwork; • correct-path-speculativeinstructionsarefetchedspeculatively,areonthecorrectpathofexe- cutionandthereforeaccomplishusefulwork; • wrong-path-speculativeinstructionsarefetchedspeculatively,butlieonincorrectexecution paths,andarethusultimatelyflushedfromtheexecutionpipeline(andconsequentlywaste hardwareresources). Usingthiscategorization,wefollowedallinstructionsthroughtheexecutionpipeline.Ateach pipelinestagewemeasuredtheaveragenumberofeachofthethreeinstructiontypesthatleaves that stage each cycle. We call these values the correct-path-/wrong-path-/non-speculative per- stage IPCs. The overall machine IPC is the sum of the correct-path-speculative and non- speculativecommitIPCs.Figure 1depictstheper-stageinstructioncategoriesforSPECINT95.1

Figure 1ashowswhyspeculationiscrucialtohighinstructionthroughputandexplainswhy misspeculationdoesnotwastehardwareresources.SpeculativeinstructionsonanSMTcomprise themajorityofinstructionsfetched,executedandcommitted:forSPECINT95,57%offetchIPC,

53%ofinstructionsissuedtothefunctionalunits,and52%ofcommitIPCarespeculative.Given themagnitudeofthesenumbersandtheaccuracyoftoday’sbranchpredictionhardware,itwould beextremelysurprisingifceasingtospeculateorsignificantlyreducingthenumberofspeculative instructionsimprovedperformance.

SpeculationisparticularyeffectiveonSMTfortworeasons.First,sinceSMTfetchesfrom

1WhilebottomlineIPCofthefourworkloadsvariesconsiderably,thetrendswedescribeinthenextfewparagraphsareremark- ablyconsistentacrossallofthem,e.g.,thesamefractionofwrong-pathspeculativeinstructionsreachtheexecutestageforall4 workloads.Furthermore,theconclusionsforSPECINT95areapplicabletotheotherthreeworkloads,suggestingthatthebehavior isfundamentaltoSMT,ratherthanbeingworkloaddependent.Becauseofthis,weonlypresentdataforSPECINT95inmanyof thefigures.

9

¢ £ ¤ ¥ ¦ §

Fi¡ gure 1a

     Fi gure 1b 7 7

6 6

5 5

4  4

C ©

C 

P ¨ P

I  3 I 3

2 2

1 1

0 0

it me it fetch write me queue read1 read2 fetch write decode rena execute comm queue read1 read2 decode rena execute comm non-spec correct-spec wrong-spec non-spec correct-spec wrong-spec

Figure1.Per-pipeline-stageIPCforSPECINT95,dividedbetweencorrect-path-,wrong-path-,andnon- speculativeinstructions.Ontheleft,(a)SMTwithICOUNT;ontheright,(b)SMTwithafetchpolicythat favorsnon-speculativeinstructions. each thread onlyonceevery5.4cycles onaverage (as opposed toalmost everycyclefor the single-threadedsuperscalar),itspeculateslessaggressivelypastbranches(past1.4brancheson averagecomparedto3.5branchesonasuperscalar).Thiscausesthepercentageofspeculative instructionsfetchedtodeclinefrom93%onasuperscalarto57%onSMT.Moreimportantitalso reducesthepercentageofspeculativeinstructionsonthewrongpath;becauseanSMTprocessor makes less progressdown speculative paths, it avoids multiple levels of speculative branches which impose higher (compounded) misprediction rates. For the SPECINT benchmarks, for example, 19% of speculative instructions on SMT are wrong path, compared to 28% on a superscalar. Therefore, SMT receives significant benefit from speculation at a lower cost comparedtoasuperscalar.

Second,thedatashowthatspeculationisnotparticularlywastefulonSMT.Branchprediction accuracyinthisexperimentwas88%1andonly11%offetchedinstructionswereflushedfromthe pipeline.73%ofthewrong-path-speculativeinstructionswereremovedfromthepipelinebefore theyreachedthefunctionalunits,onlyconsumingresourcesintheformofintegerinstruction queueentries,renamingregisters,andfetchbandwidth.Boththeinstructionqueue(IQ)andthe

1Thepredictionrateislowerthanthevaluefoundin[8]becauseweincludetheoperatingsystemcode.

10 poolofrenamingregistersareadequatelysized:theIQisonlyfull4.3%ofcyclesandrenaming registers are exhausted only 0.3% of cycles. (Doubling the integer IQ size reduced queue overflowto0.4%ofcycles,butraisedIPCbyonly1.8%,confirmingthattheintegerIQisnota seriousbottleneck.[27]reportsasimilarresult.)Thus,IQentriesandrenamingregistersarenot highly contended. This leaves fetch bandwidth as the only resource that speculation wastes significantly and suggests that modifying the fetch policy might improve performance. We addressthisquestioninSection3.2.

Without speculation, only non-speculative instructions use processor resources and SMT devotes no processor resources to wrong-path instructions. However, in avoiding wrong-path instructions,SMTleavesmanyofitshardwareresourcesidle.Fetchstallcycles,i.e.,cycleswhen nothreadwasfetched,rosealmostfive-fold;consequently,perstageIPCsdroppedbetween19% and 29%.Functional utilizationdropped by 20% and commitIPC,the bottom-line metric for

SMTperformance,was4.2,a19%losscomparedtoanSMTthatspeculates.Contrarytothe usualhypthesis,notspeculatingwastesmoreresourcesthanspeculating!

3.2Fetchpolicies Itispossiblethatmorespeculation-awarefetchpoliciesmightoutperformSMT’sdefaultfetch algorithm,ICOUNT,reducingthenumberofwrong-pathinstructionswhileincreasingthenum- berofcorrect-pathandnon-speculativeinstructions.Toinvestigatethesepossibilities,wecom- paredSMTwithICOUNTtoanSMTwithtwoalternativefetchpolicies:onethatfavorsnon- speculatingthreadsandafamilyoffetchpoliciesthatincorporatebranchpredictionconfidence.

3.2.1Favoringnon-speculativecontexts Afetchpolicythatfavorsnon-speculativecontexts(seeFigure1b)increasedtheproportionof

non-speculative instructions fetched by an average of 44% and decreased correct-path- and

wrong-path-speculativeinstructions byanaverage of 33% and 39%, respectively. Despite the

moderateshifttousefulinstructions(wrong-path-speculativeinstructionswerereducedfrom11%

11 to7%oftheworkload),theeffectoncommitIPCwasnegligible.Thislackofimprovementin

IPCwillbeaddressedagainandexplainedinSection3.3.

3.2.2UsingConfidenceEstimators Researchershaveproposedseveralhardwarestructuresthatassignconfidencelevelstobranch predictions, with the goal of reducing the number of wrong-path speculations [11, 7]. Each dynamicbranchreceivesaconfidencerating,ahighvalueforbranchesthatareusuallypredicted correctlyandalowvalueformisbehavingbranches.Severalgroupshavesuggestedusingconfi- denceestimatorsonSMTtoreducewrong-path-speculativeinstructionsandthusimproveperfor- mance[11,18].Inourstudyweexaminedthreedifferentconfidenceestimatorsdiscussedin[7,

11]:

• TheJRSestimatorusesatablethatisindexedbythePCxor’edwiththeglobalbranchhistory register.Thetablecontainscountersthatareincrementedwhenthepredictoriscorrectand resetonanincorrectprediction. • Thestrong-countestimatorusesthecountersinthelocalandglobalpredictorstoassignconfi- dence.Theconfidencevalueisthenumberofcountersforthebranch(0,1,or2)thatareina strongly-takenorstrongly-not-takenstate(thissubsumestheboth-strongandeither-strongesti- matorsin[7]). • Thedistanceestimatortakesadvantageofthefactthatmispredictionsareclustered.Theconfi- dencevalueforabranchisthenumberofcorrectpredictionsthatcontexthasmadeinarow (globally,notjustforthisbranch). Thereare(atleast)twodifferentwaystousesuchconfidenceinformation.Inthefirst,hard confidence, the processor stalls the thread on a low confidence branch, fetching from other threadsuntilthebranchisresolved.Inthesecond,softconfidence,theprocessorassignspriority accordingtoconfidenceofthethread’smostrecentbranch.

Hardconfidenceschemesuseaconfidencethresholdtodividebranchesintohigh-andlow-

confidencegroups.Iftheconfidencevalueisabovethethreshold,thepredictionisfollowed;oth-

erwise, the issuing thread stalls until the branch is resolved. Using hard confidence has two

effects.First,itreducesthenumberofwrong-path-speculativeinstructionsbykeepingtheproces-

sorfromspeculatingonsomeincorrectpredictions(i.e.,truenegatives).Second,itincreasesthe

12 numberofcorrectpredictionstheprocessorignores(falsenegatives).

Table3containstrueandfalsenegativesforthebaselineSMTandanSMTwithseveralhard

ConfidenceEstimator Wrong-path Correct IPC PredictionsAvoided PredictionsLost (truenegatives) (falsenegatives)

%ofbranchinstructions

Noconfidenceestimation 0 0 5.2

JRS(threshold=1) 2.0 6.0 5.2

JRS(threshold=15) 7.7 38.3 4.8

Strong(threshold=1:either) 0.7 3.9 5.1

Strong(threshold=2:both) 5.6 31.9 4.8

Distance(threshold=1) 1.5 6.6 5.2

Distance(threshold=3) 3.8 16.2 5.1

Distance(threshold=7) 5.8 27.9 4.9

Table3:HardconfidenceperformanceforSPECINT95.Branchpredictionaccuracywas88%. confidenceschemeswhenexecutingSPECINT95(SPECFP95,INT+FP,andApachehadsimilar results).SinceourMacFarlingbranchpredictor[19]hashighaccuracy(workload-dependentpre- dictions that range from 88% to 99%), the false negatives outnumber the true negatives by between3and6times.Therefore,althoughmispredictionsdeclinedby14%to88%(datanot shown),thisbenefitwasoffsetbylostsuccessfulspeculationopportunities,andIPCneverrose significantly.InthetwocaseswhenIPCdidincreasebyaslimmargin(lessthan0.5%),JRSand

Distance each with a threshold of 1, there were frequent ties between many contexts. Since

ICOUNTbreaksties,thesetwoschemesendupbeingquitesimilartoICOUNT.

Incontrasttohardconfidence,theprioritythatsoftconfidencecalculatesisintegratedintothe fetchpolicy.Wegiveprioritytocontextsthataren’tspeculating,followedbythosefetchingpasta highconfidence branch; ICOUNT breaks anyties.Inevaluatingsoftconfidence,we used the samethreeconfidenceestimators.Table4containstheresultsforSPECINT95.Fromthetable, weseethatsoftconfidenceestimatorshurtperformance,despitethefactthattheyreducedwrong-

13 ConfidenceEstimator IPC Wrongpathinstructions

Noconfidenceestimation 5.2 9.7%

JRS 5.0 4.5%

Strong 5.0 5.9%

Distance 4.9 2.9%

Table4:SoftconfidenceperformanceforSPECINT95. path-speculativeinstructionsfrom9%tobetween2.9%and5.9%ofinstructionsfetched.

Overall,then,neitherhardnorsoftconfidenceestimatorsimprovedSMTperformance,and actuallyreducedperformanceinmostcases.

3.3WhyRestrictingSpeculationHurtsSMTPerformance SMTderivesitsperformancebenefitsfromfetchingandexecutinginstructionsfrommultiple threads.Thegreaterthenumberofactivehardwarecontexts,thegreatertheglobal(cross-thread) poolofinstructionsavailabletohideintra-threadlatencies.Allthemechanismswehaveinvesti- gatedthatrestrictspeculationdosobyeliminatingcertainthreadsfromconsiderationforfetching duringsomeperiodoftime:eitherbyassigningthemalowpriorityorexcludingthemoutright.

Theconsequenceofrestrictingthepooloffetchablethreadsisalessdiversethreadmixinthe instructionqueue,whereinstructionswaittobedispatchedtothefunctionalunits.WhentheIQ holdsinstructionsfrommanythreads,thechanceofalargenumberofthembeingunabletoissue instructions is greatly reduced, and SMT can best hide intra-thread latencies. However, when fewerthreadsarepresent,itislessabletoavoidthesedelays.1

SMTwith ICOUNT providesthehighest average number of threads in theIQforall four workloadswhencomparedtoanyofthealternativefetchpoliciesorconfidenceestimators.Exe- cutingSPECINT95withsoftconfidencecanserveasacaseinpoint.Withsoftconfidence,the processortendstofetchrepeatedlyfromthreadsthathavehighconfidencebranches,fillingtheIQ

1Thesameeffectwasobservedin[27]fortheBRCOUNTandMISSCOUNTpolicies.Thesepoliciesusethenumberofthread- specificoutstandingbranchesandcachemisses,respectively,toassignfetchpriority.NeitherperformedaswellasICOUNT.

14 withinstructionsfromafewthreads.Consequently,therearenoissuableinstructionsbetween

2.8%and4.2%ofthetime,whichis3to4.5timesmoreoftenthanwithICOUNT.Theresultis thattheIQbacksupmoreoften(12to15%ofcyclesversus4%withICOUNT),causingthefront endoftheprocessortostopfetching.Thisalsoexplainswhynoneofthenewpoliciesimproved

performance theyallreducedthenumberofthreadsrepresentedintheIQ.Incontrast,ICOUNT worksdirectlytowardagoodmixofinstructionsbyfavoringunderrepresentedthreads.

Figure2correlatestheaveragenumberofthreadsrepresentedintheIQwiththeIPCofall schemesdiscussedinthispaper,onallworkloads.Forallfourworkloads,thereisaclearcorrela- tionbetweenperformanceandthenumberofthreadspresent;ICOUNTachievesthelargestvalue forbothmetrics1inmostcases.

7

6

SPECFP95  C

 SPECInt95

P 5  I Apache INT+FP

4

3

3.0 3.5 4.0 4.5 5.0 5.5 6.0

¢ £ ¤ ¥ ¢ ¦ § ¨ © ¢ £ £ ¢ ¤    £ ¢  ¢       A¡ verage Number of Threads Present in IQ

Figure2.TherelationshipbetweentheaveragenumberofthreadsintheinstructionqueueandoverallSMT performance.Eachpointrepresentsadifferentfetchpolicy.Therelativeorderingfromlefttorightoffetch policiesdiffersbetweenworkloads.ForSPECINT95,nospeculationperformedworst;thesoftconfidence schemeswerenext,followedbythedistanceestimator(thresh=3),thestrongcountschemes,andfavoring non-speculativecontexts.TheorderingforSPECINT+FPisthesame.ForSPECFP95,softconfidenceand favoringnon-speculativecontextsperformedworst,followedbynospeculationandstrongcount,distance, andJRShardconfidenceestimators.Finally,forApache,softconfidenceoutperformednospeculation(the worst) and the hardconfidence distance estimator but fellshort ofthe hard confidence JRSandstrong count estimators. For all four workloads, SMT with ICOUNT is the best performer, although, for SPECINT95 and SPECINT+FP, the hard distance estimator (thresh=1) obtains essentially identical performance.

1TheJRSandDistanceestimatorswiththresholdsof1achievehigherperformancebyminusculemarginsforsomeofthework- loads.SeeSection3.2.2.

15 3.4Summary InthissectionweexaminedtheperformanceofSMTprocessorswithspeculativeinstruction execution.Withoutspeculation,an8-contextSMTisunabletoprovideasufficientinstruction streamtokeeptheprocessorfullyutilized,andperformancesuffers.Althoughthefetchpolicies weexaminedreducethenumberofwrong-pathinstructions,theyalsolimitthreaddiversityinthe

IQ,leadingtolowerperformancewhencomparedtoICOUNT.

4Limitstospeculativeperformance Intheprevioussection,weshowedthatspeculationgreatlybenefitsSMTperformanceforour fourworkloadsrunningonthehardwarewesimulated.However,speculationwillnotimprove performanceineveryconceivableenvironment.Thegoalofthissectionistoexplorethebound- ariesofspeculation’sbenefit,i.e.,tocharacterizethetransitionbetweenbeneficialandharmful speculation.Wedothisbyperturbingthesoftwareworkloadandhardwareconfigurationsbeyond theirnormallimitstoseewherethebenefitsofspeculativeexecutionbegintodisappear.

4.1Examiningprogramcharacteristics Three different workload characteristics determine whether speculation is profitable on an

SMTprocessor:

1. Asbranchpredictionaccuracydecreases,thenumberofwrong-pathinstructionswillincrease, causingperformancetodrop.Speculationwillbecomelessusefulandatsomepointwillno longerpayoff.

2. Asthebasicblocksizeincreases,branchesbecomelessfrequentandthenumberofthreads withnounresolvedbranchesincreases.Consequently,morenon-speculativethreadswillbe availabletoprovideinstructions,reducingthevalueofspeculation.Asaresult,branchpredic- tionaccuracywillhavetobehigherforspeculationtopayoffforlargerbasic-blocksizes.

3. AsILPwithinabasic-blockincreases,thenumberofunusedresourcesdeclines,causingspec- ulationtobenefitperformanceless.

Figure 3illustratesthetrade-offsinallthreeoftheseparameters.Thehorizontalaxisisthe

numberofinstructionsbetweenbranches,i.e.,thebasic-blocksize.Thedifferentlinesrepresent

varyingamountsofILP.Theverticalaxisisthebranchpredictionaccuracyrequiredforspecula-

16 SPECFP95 100% Apache INT+FP

SPECInt95 #

) "

% ! (

y  80%

c 

a 

r

u 

c  c  ILP 6 A

 n

 60%

o ILP 4 

i 

t 

c 

i  d

 ILP 2

e 

r 

P 40%

 ILP 1

h 

c 

n 

a 

r  B 20%

0%

0 2 4 6 8 10 12 14 16 18 20

¢ £ ¤ ¥ ¦ § ¨ © ¥ ¡ ¦ © £ ¤  ¤ ¤ ©  ¥  ©  ¤ Nu¡ mber of Instructions between Branches

Figure3.Branchpredictionaccuraciesatwhichspeculatingmakesnodifference. tiontopayoffforagivenaveragebasic-blocksize1;thatis,foranygivenpoint,speculationwill payoffforbranchpredictionaccuracyvaluesabovethepointbuthurtperformanceforvalues belowit.Thehigherthe cross-overpoint,the lessbenefitspeculationprovides.Thedatawas obtainedbysimulatingasyntheticworkload(asdescribedinSection2.2)onthebaselineSMT withICOUNT(Section2.1).Forinstance,athreadwithanILPof4andabasic-blocksizeof16 instructionscouldissueallinstructionsin4cycles,whileathreadwithanILPof1wouldneed16 cycles;theformerworkloadrequiresthatbranchpredictionaccuracybeworsethan95%inorder forspeculationtohurtperformance;thelatter(ILP1)requiresthatitbelowerthan46%.

Thefourlabeledpointsrepresenttheaveragebasicblocksizesandbranchpredictionaccura- ciesforSPECINT95,SPECFP95,INT+FP,andApacheonSMTwithICOUNT.SPECINT95has abranchpredictionaccuracyof88%and6.6instructionsbetweenbranches.Accordingtothe graph,suchaworkloadwouldneedbranchpredictionaccuracytobeworsethan65%forspecula- tion to be harmful. Likewise, given the same information for SPECFP95 (18.22 instructions

1Thesyntheticworkloadforaparticularaveragebasicblocksizecontainedbasicblocksofavarietyofsizes.Thishelpstomake themeasurementsindependentofIcacheblocksizebutdoesnotremoveallthenoiseduetoIcacheinteractions(forinstance,the tailoftheILP1linegoesdown).

17 betweenbranches,99%predictionaccuracy),INT+FP(10.5instructionsbetweenbranches,90% predictionaccuracy)andApache(4.9instructionsbetweenbranches,91%predictionaccuracy), branch prediction accuracy would have to be worse than 98%, 88% and 55%, respectively.

SPECFP95comesclosetohittingthecrossoverpoint;thisisconsistentwiththerelativelysmaller performancegainduetospeculationforSPECFP95thatwesawinSection3.Similarly,Apache’s largedistancefromitscrossoverpointcoincideswiththelargebenefitspeculationprovides.

ThedatainFigure 3showthatformodernbranchpredictionhardware,onlyworkloadswith extremelylargebasicblocksandhighILPbenefitfromnotspeculating.Whilesomescientific programsmayhavethesecharacteristics,mostintegerprogramsandoperatingsystemsdonot.

Likewise,itisdoubtfulthatbranchpredictionhardware(orevenstaticbranchpredictionstrate- gies)willexhibitpoorenoughperformancetowarrantturningoffspeculationwithbasicblock sizestypicaloftoday’sworkloads.Forexample,oursimulationsofSPECINT95withaone-sixteenththesizeofourbaselinepredictorcorrectlypredictsonly70%ofbranches, butstillexperiencesa9.5%speedupovernotspeculating.

4.2IncreasingtheNumberofHardwareContexts Increasingthenumberofhardwarecontexts(whilemaintainingthesamenumberandmixof functionalunitsandnumberofissueslots)willincreasethenumberofindependentandnon-spec- ulativeinstructions,andthuswilldecreasethelikelihoodthatspeculationwillbenefitSMT.One metricthatillustratestheeffectofincreasingthenumberofhardwarecontextsisthenumberof cyclesbetweentwoconsecutivefetchesfromthesamecontext,orfetch-to-fetchdelay.Asthe fetch-to-fetch delay increases, it becomes more likely that the branch will resolve before the

threadfetches again.This causesindividual threads to speculateless aggressively, and makes

speculationlesscriticaltoperformance.Forasuperscalar,thefetch-to-fetchdelayis1.4cycles.

2Compileroptimizationwassetto-O5onCompaq’sF77compiler,whichunrolls,byafactorof4ormore,loopsbelowacertain size(100cyclesofestimatedexecution).SPECFPbenchmarkshavelargebasicblocksduebothtounrollingandtolargenative loopsinsomeprograms.

18 Foran8-contextSMTwithICOUNT,thefetch-to-fetchdelayis5.0cycles 3.6timeslonger.

Wecanusefetch-to-fetchdelaytoexploretheeffectsofaddingcontextstoourbaselinecon- figuration.With16contexts(runningtwocopiesofeachofthe8SPECINT95programs),the fetch-to-fetchdelayrisesto10.0cycles(3cycleslongerthanthebranchdelay),andthedifference between IPCwith and withoutspeculation fallsfrom 24%for8 contextsto 0% with 16(see

Figure 4),signalingthepointatwhichspeculationshouldstarthurtingSMTperformance.

12 350%

300% 10 fetch-to-fetch delay

 % increase in IPC from

y 250%

# 

a  C

speculation " l  8 P

e

!  I

 d



n  i

h 

200% 

c 

e

t 

s

  e

f 

6 a 

- 

e

o 

r 

t  c

- 150%   n

h

  i

c

 

t 

%  e 4 f 100%

2 50%

0 0%

1 4 8 16

¢ £ ¤ ¥ ¦ § ¨ © ¥ © ¥ ¤ ¦  ¤    Nu¡ mber of Hardware Contexts

Figure4.TheRelationshipBetweenFetch-to-FetchandPerformanceImprovementDuetoSpeculation.

Atfirstglance,16-contextnon-speculativeSMTsmightseemunwise,sincesingle-threaded performancestilldependsheavilyonspeculation.However,recentchipmulti-processordesigns, suchasthePiranha[32]fromCompaq,makeapersuasiveargumentthatsingle-threadedperfor- mancecouldbesacrificedinfavorofasimpler,thoughput-orienteddesign.Inthislight,a16-con- text SMT might indeed be a reasonable machine to build. Not only would it not require the complexityofspeculativehardware,butthelargenumberofthreadswouldmakeitmucheasierto hidethelargememorylatencyoftenassociatedwithserverworkloads.

19 Still,forthcomingSMTarchitectureswillmostlikelyhaveahigher,ratherthanalower,ratio offunctionalunitstohardwarecontextsthanevenourSMTprototype,whichhas6integerunits and8contexts.Forexample,therecentlycanceledCompaq21464[4]wouldhavebeenan8-wide machine withonly four contexts,suggesting thatspeculation will providemuch ofits perfor- mance.

5RelatedWork Several researchers have explored issues in branch prediction, confidence estimation, and speculation,bothonsuperscalarsandmultithreadedprocessors.Othershavestudiedrelatedtop- ics,suchassoftwarespeculationandvalueprediction.

Wall[29]examinestherelationshipsbetweenbranchpredictionandavailableparallelismona superscalarandconcludesthatgoodbranchpredictorscansignificantlyenhancetheamountof parallelismavailabletotheprocessor.HilyandSeznec[9]investigatetheeffectivenessofvarious branchpredictorsunderSMT.Theydeterminethatbothconstructiveanddestructiveinterference affectbranchpredictionaccuracyonSMT,buttheydonotaddresstheissueofspeculation.

GollaandLin[6]investigateanotionsimilartofetch-to-fetchdelayanditseffectonspecula- tion in the context of fine-grained-multithreading architectures. They find that as instruction queues became larger, the probability of a speculative instruction being on the correct path decreases dramatically. They solve the problem by fetching from several threads and thereby increasingthefetch-to-fetchdelay.Weinvestigatethenotionoffetch-to-fetchdelayinthecontext ofSMTanddemonstratethathighfetch-to-fetchdelayscanreducetheneedforspeculation.

Jacobsenetal.[11],Grunwaldetal.[7],andManneet.al.[18]suggestusingconfidenceesti- matorsforawidevarietyofapplications,includingreducingpowerconsumptionandmoderating

speculationonSMTtoincreaseperformance.Manneetal.[7]providesadetailedanalysisofcon-

fidenceestimatorperformancebutdoesnotaddressthelossofperformanceduetofalsenega-

20 tives.Wedemonstratethatfalsenegativesareasignificantdangerinhardconfidenceschemes.

Bothpapersrestricttheirdiscussiontoconfidenceestimatorsthatproducestrictlyhigh-orlow- confidenceestimates(bysettingthresholds),anddonotconsidersoftconfidence.

Wallaceetal.[30]usesspareSMTcontextstoexecutedownbothpossiblepathsofabranch.

TheyaugmentICOUNTtofavorthehighestconfidencepathasdeterminedbyaJRSestimator andonlycreatenewspeculativethreadsforbranchesonthispath.Althoughtheyassumethat thereareoneormoreunutilizedcontexts(whileourworkfocusesonmoreheavilyloadedscenar- ios),theirworkcomplementsourown.Bothshowthatspeculationpaysoffwhenfewthreadsare active, either because hardware contexts are available (their work) or threads are not being fetched (ours). Klauser and Grunwald [12] demonstrate a similar technique, but they do not restrictthecreationofspeculativethreadstothehighestconfidencepath.Instead,theyuseacon- fidenceestimatortodeterminewhentocreateanewspeculativethread.Becauseofthisdifference intheirdesign,ICOUNTperformsverypoorly,whileaconfidencebasedpriorityschemeper- formsmuchbetter.

Seng et. al. [33] examine the effects of speculation on power consumption in SMT. They observethatSMTspeculateslessdeeplypastbranches,resultinginlesspowerbeingspentonuse- lessinstructions.WeexaminetheaggressivenessofSMTinmoredetailanddemonstratethecon- nection between fetch policy, the number of hardware contexts, and how aggressively SMT speculates.

Loet.al.[17]investigatedtheeffectofSMT’sarchitectureonthedesignofseveralcompiler optimizations,includingsoftwarespeculation.TheyfoundthatsoftwarespeculationonSMTwas usefulforloop-basedcodes,buthurtperformanceonnon-loopapplications.

6Summary Thispaperexaminedandanalyzedthebehaviorofspeculativeinstructionsonsimultaneous

21 multithreaded processors. Using both multiprogrammed and multithreaded workloads, we showedthat:

• speculationisrequiredtoachievemaximumperformanceonanSMT; • fetchpoliciesandbranchconfidenceestimatorsthatfavornon-speculativeexecutionsucceed onlyinreducingperformance; • thebenefitsofcorrect-pathspeculativeinstructionsgreatlyoutweighanyharmcausedby wrong-pathspeculativeinstructions;and • multithreadingactuallyenhancesspeculation,byreducingthepercentageofspeculative instructionsonthewrongpath. WealsoshowedthatmultiplecontextsprovideasignificantadvantageforSMTrelativetoa superscalar with respect to speculative execution; namely, by interleaving instructions, multi- threadingreducesthedistancethatthreadsneedtospeculatepastbranches.Overall,SMTderives its benefit from this fine-grained interleaving of instructions from multiple threads in the IQ; therefore, policies that reduce the pool of participating threads (e.g., to favor non-speculating threads)tendtoreduceperformance.

Theseresultsholdforall“reasonable”hardwareconfigurationsandworkloadswetested;only extremeconditions(i.e.,averyhighratioofcontextstoissueslotsandfunctionalunitsorvery largebasicblocksizes)warrantedreducingoreliminatingspeculation.However,thereremain interesting microarchitectural trade-offs between speculation, implementation complexity, and single-threadedperformancethatmakethedecisionsofhowandwhentospeculateonSMTpro- cessorsmorecomplexthantheyareontraditionalsuperscalarprocessors.

References

[1] A. Agarwal,J. Hennessy,andM. Horowitz.Cacheperformanceofoperatingsystemandmultiprogramming workloads.ACMTransactionsonComputerSystems,6(4),May1988. [2] Compaq,http://www.research.digital.com/wrl/projects/SimOS/.SimOS-Alpha. [3] Z. CvetanovicandR. Kessler.PerformanceanalysisoftheAlpha21264-basedCompaqES40System.In27th AnnualInternationalSymposiumonComputerArchitecture,June2000. [4] J. Emer.Simultaneousmultithreading:MultiplyingAlpha’sperformance,October1999. [5] N. Gloy,C. Young,J. Chen,andM. Smith.Ananalysisofdynamicbranchpredictionschemesonsystemwork- loads.In23ndAnnualInternationalSymposiumonComputerArchitecture,May1996.

22 [6] P. GollaandE. Lin.Acomparisonoftheeffectofbranchpredictiononmultithreadedandscalararchitectures. ACMSIGARCHComputerArchitectureNews,26(4):3–11,1998. [7] D. Grunwald,A. Klauser,S. Manne,andA. Pleszkun.Confidenceestimationforspeculationcontrol.In25nd AnnualInternationalSymposiumonComputerArchitecture,June1998. [8] L. Gwennap.Newalgorithmimprovesbranchprediction.Reports,March271995. [9] S. HilyandA. Seznec.Branchpredictionandsimultaneousmultithreading.InInternationalConferenceonPar- allelArchitectureandCompilationTechniques,1996. [10]Y. Hu,A. Nanda,andQ. Yang.Measurement,analysisandperformanceimprovementoftheApachewebserver. InProceedingsofthe18thInternationalPerformance,ComputingandCommunicationsConference,February 1999. [11]E. Jacobson,E. Rotenberg,andJ. Smith.Assigningconfidencetoconditionalbranchpredictions.In29thAnnual InternationalSymposiumon,December1996. [12]A. KlauserandD. Grunwald.Instructionfetchmechanismsformultipathexecutionprocessors.In32ndAnnual InternationalSymposiumonMicroarchitecture,November1999. [13]M. LipastiandJ. Shen.Exceedingthedataflowlimitviavalueprediction.In29thAnnualInternationalSympo- siumonMicroarchitecture,December1996. [14]M. Lipasti,C. B.Wilkerson,andJ. Shen.Valuelocalityandloadvalueprediction.In7thInternationalConfer- enceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,1996. [15]J. Lo,L. Barroso,S. Eggers,K. Gharachorloo,J. Levy,andS. Parekh.Ananalysisofdatabaseworkloadperfor- manceonsimultaneousmultithreadingprocessors.In25ndAnnualInternationalSymposiumonComputerArchi- tecture,June1998. [16]J. Lo,S. Eggers,J. Emer,H. Levy,R. Stamm,andD. Tullsen.Convertingthread-levelparallelismintoinstruc- tion-levelparallelismviasimultaneousmultithreading.ACMTransactionsonComputerSystems,15(2),August 1997. [17]J. Lo,S. Eggers,H. Levy,S. Parekh,andD. Tullsen.Tuningcompileroptimizationsforsimultaneousmulti- threading.In30thAnnualInternationalSymposiumonMicroarchitecture,December1997. [18]S. Manne,D. Grunwald,andA. Klauser.Pipelinegating:Speculationcontrolforenergyreduction.In25th AnnualInternationalSymposiumonMicroarchitecture,June1998. [19]S. McFarling.Combiningbranchpredictors.Technicalreport,TN-36,DEC-WRL,June1993. [20]J. Redstone,S. Eggers,andH. Levy.Analysisofoperatingsystembehavioronasimultaneousmultithreaded architecture.In9thInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperat- ingSystems,November2000. [21]J. Reilly.SPECdescribesSPEC95productsandbenchmarks.September1995.http://www.specbench.org/. [22]M. Rosenblum,S. Herrod,E. Witchel,andA. Gupta.Completecomputersimulation:TheSimOSapproach. IEEEParallelandDistributedTechnology,3(4),Winter1995. [23]J. Smith.Astudyofbranchpredictionstrategies.In8thAnnualInternationalSymposiumonComputerArchitec- ture,May1981. [24]SPECCPU2000.RunandreportingrulesforSPECCPU2000,2000.http://www.specbench.org/. [25]SPECWeb.AnexplanationoftheSPECWeb96benchmark,1996.http://www.specbench.org/. [26]D. Tullsen.Simulationandmodelingofasimultaneousmultithreadingprocessor.In22ndAnnualComputer MeasurementGroupConference,December1996. [27]D. Tullsen,S. Eggers,J. Emer,H. Levy,J. Lo,andR. Stamm.Exploitingchoice:Instructionfetchandissueon animplementablesimultaneousmultithreadingprocessor.In23ndAnnualInternationalSymposiumonCom- puterArchitecture,May1996. [28]D. Tullsen,S. Eggers,andH. Levy.Simultaneousmultithreading:Maximizingon-chipparallelism.In22nd AnnualInternationalSymposiumonComputerArchitecture,June1995. [29]D. Wall.Speculativeexecutionandinstruction-levelparallelism.TechnicalReport 42,DigitalWesternResearch Laboratory,March1994.

23 [30]S. Wallace,B. Calder,andD. Tullsen.Threadedmultiplepathexecution.In25ndAnnualInternationalSympo- siumonComputerArchitecture,June1998. [31]T.-Y.YehandY. Patt.Alternativeimplementationsoftwo-leveladaptivebranchprediction.In19thAnnual InternationalSymposiumonComputerArchitecture,May1992. [32]L.Barroso,et.al.Piranha:AScalableArchitectureBasedonSingle-Chip.In27thAnnualInter- nationalSymposiumonComputerArchitecture,June2000. [33]J.Seng,D.TullsenandG.Cai.Power-SensitiveMultithreadedArchitecture.InProceedingsofthe2000Interna- tionalConferenceonComputerDesign,2000. [34]A.SnavelyandD.Tullsen.SymbioticJobschedulingforaSimultaneousMultithreadedProcessor.InNinth InternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,Nov. 2000

24