AnalysisandImprovementofPerformanceandPowerConsumptionofChipMulti ThreadingSMPArchitectures

By:RyanEricGrant

AthesissubmittedtotheDepartmentofElectricalandComputerEngineeringin

conformitywiththerequirementsforthedegreeofMasterofScience(Engineering)

Queen’sUniversity

Kingston,Ontario,Canada

August,2007

Copyright RyanE.Grant,2007

Abstract

Emerging processor technologies are becoming commercially available that make multiprocessor capabilities affordable for use in a large number of computer systems.

Increasingpowerconsumptionbythisnextgenerationofprocessorsisagrowingconcern asthecostofoperatingsuchsystemscontinuestoincrease.

It is important to understand the characteristics of these emerging technologies in order to enhance their performance. By understanding the characteristics of high performance computing workloads on real systems, the overall efficiency with which suchworkloadsareexecutedcanbeincreased.Inaddition,itisimportanttodetermine thebesttradeoffbetweensystemperformanceandpowerconsumptionusingthevariety ofsystemconfigurationsthatarepossiblewiththesenewtechnologies.

This thesis seeks to provide a comprehensive presentation of the performance characteristicsofseveralrealcommerciallyavailablesimultaneousmultithreadingmulti processor architectures and provide recommendations to improve overall system performance.Aswell,itwillprovidesolutionstoreducethepowerconsumptionofsuch systemswhileminimizingtheperformanceimpactofthesetechniquesonthesystem.

Theresultsoftheresearchconductedshowthatthenewschedulerproposedinthis thesis is capable of providing significant increases in efficiency for traditional and emergingmultiprocessortechnologies.Thesefindingsareconfirmedusingrealsystem performanceandpowermeasurements.

i Acknowledgements IwouldliketofirstthankmysupervisorDr.AhmadAfsahiforhisvaluablefeedback andsupportinconductingtheresearchforthisthesisandhisassistanceinassemblingthis document.Withouthishelpthisworkwouldnothavebeenpossible. ManythankstotheofficestaffoftheECEDepartmenthereatQueen’s,particularly Ms. Debie Frasier and Ms. Bernice Ison for their invaluable help in administrative mattersduringthetenureofmyMaster’sdegree. To my coworkers in the Parallel Processing Research Laboratory, Reza Zamani, MohammadRashti,andYingQian,thankyouforyourcontinuedsupportoverthelast twoyears. Finally,mydeepestappreciationgoestomyimmediateandextendedfamilyfortheir support throughout the years. A special and heartfelt thanks goes to my loving and understandingwifeMeaganwithoutwhomthiswouldnotbepossible.

ii Table of Contents Abstract ...... i Acknowledgements ...... ii TableofContents ...... iii ListofTables...... v ListofFigures...... vi Glossary...... viii Chapter1:Introduction...... 1 1.1Motivation...... 3 1.2Contributions ...... 4 1.3Outline ...... 6 Chapter2:Background...... 7 2.1ParallelComputingArchitecture ...... 7 2.1.1PriorArt ...... 9 2.2ProgrammingSharedMemoryParallelApplications...... 10 2.2.1PriorArt ...... 12 2.3SystemandApplicationCharacterization...... 13 2.3.1PriorArt ...... 14 2.4andOperatingSystemNoise...... 15 2.4.1PriorArt ...... 16 2.5SystemPowerConsumption...... 17 2.5.1PowerMetrics ...... 18 2.5.2PriorArt ...... 19 2.6Summary...... 20 Chapter3:ApplicationSpecifications...... 22 3.1NASParallelBenchmarks ...... 22 3.1.1NASSimulatedCFDApplications ...... 22 3.1.1.1BT ...... 22 3.1.1.2LU ...... 23 3.1.1.3SP ...... 23 3.1.2NASKernelApplications ...... 23 3.1.2.1MG ...... 23 3.1.2.2CG...... 24 3.1.3SPECOpenMPBenchmarkSuite...... 24 Chapter4:WorkloadCharacteristicsofSMTCapableSMPs ...... 26 4.1ExperimentalSetup...... 26 4.1.1Terminology...... 27 4.1.2ApplicationCharacterization ...... 28 4.2ExperimentalResultsandAnalysis ...... 29 4.2.1EPCC...... 29 4.2.1.1OpenMPSynchronization...... 30 4.2.1.2OpenMPScheduling ...... 31 4.2.2ApplicationPerformanceandAnalysis ...... 32 4.2.2.1TraceCacheAnalysis ...... 34 4.3Summary...... 36 Chapter5:CharacterizationandAnalysisofMultiCoreSMPs ...... 38

iii 5.1ExperimentalMethodology ...... 39 5.1.1Terminology...... 40 5.2SingleApplicationResults...... 42 5.2.1CachePerformance ...... 42 5.2.1.1TLBPerformance...... 45 5.2.2StalledOperation ...... 45 5.2.3BranchPrediction...... 46 5.2.4BusTransactions...... 47 5.2.5CyclesPerInstruction...... 47 5.2.6WallClockPerformance...... 48 5.3MultiApplicationResults...... 50 5.3.1CachePerformance ...... 51 5.3.1.1TLBPerformance...... 52 5.3.2StalledOperation ...... 53 5.3.3BranchPrediction...... 53 5.3.4BusTransactions...... 54 5.3.5CyclesPerInstruction...... 54 5.3.6WallClockPerformance...... 55 5.3.7CrossProductMultiProgramResults...... 56 5.4OverloadedConfigurationAnalysis ...... 57 5.4.1CachePerformance ...... 58 5.4.2StalledOperation ...... 59 5.4.3BranchPrediction...... 60 5.4.4BusTransactions...... 60 5.4.5CyclesPerInstruction...... 61 5.4.6WallClockPerformance...... 62 5.4.7OverloadedOverhead ...... 63 5.5EffectofOperatingSystemNoise ...... 64 5.5.1OperatingSystemNoiseEffectsonSingleThreadedApplications...... 65 5.5.2OperatingSystemNoiseEffectonMultithreadedApplications ...... 66 5.6Summary...... 68 Chapter6:PowerManagementofChipMultiThreadingSMPs...... 70 6.1ExperimentalFramework ...... 71 6.1.1AMPSetup...... 71 6.2PSSchedulervs.theDefaultLinuxScheduler...... 73 6.3RealPowerMeasurements...... 74 6.3.1AveragePowerConsumption ...... 74 6.3.2SlowdownandEnergySavings ...... 75 6.3.3EnergyDelayAnalysis ...... 78 6.4AMPPowerConsumptionPredictionsWithFutureTechnology ...... 82 6.4.1SlowdownandEnergySavings ...... 83 6.4.2EnergyDelayAnalysis ...... 85 6.4Summary...... 87 Chapter7:ConclusionsandFutureWork ...... 89 7.1FutureWork...... 90 References ...... 92

iv List of Tables Table4.1:ConfigurationInformation...... 28 Table4.2:AveragespeedupgainedbyenablingHTforNASandSPECParallel benchmarks...... 34 Table5.1:ConfigurationInformation...... 41 Table5.2.Speedupforarchitectures...... 50 Table5.3.Percentagedegradationforoverloadedcasesversusnonoverloadedcases .. 63 Table6.1:AMPNamingConvention ...... 72

v List of Figures

Figure1.1:ServerPowerConsumptionForecasts...... 3 Figure4.1:ProcessornumberingforaSMTsystemwithamaximumof2processors.. 28 Figure4.2:OverheadofOpenMPsynchronization...... 30 Figure4.3:Kernel2.6.9vs.2.4.22impactonOpenMPsynchronizationoverhead...... 31 Figure4.4:OverheadofOpenMPloopschedulingpolicies...... 31 Figure4.5:Kernel2.6.9vs.2.4.22impactonOpenMPloopschedulingpolicies...... 32 Figure4.6:SpeedupforNASOpenMPandSPECOMPM2001applicationsunder Kernel2.6.9relativetotheserialcase...... 33 Figure4.7:Tracecachemisses ...... 34 Figure4.8:Tracecachedeliveryrate(tracecachefetchesper100clockcycles)...... 35 Figure4.9:2 nd LevelCacheMisses...... 36 Figure5.1:IntelXeonDualCoreChipLayout...... 40 Figure5.2:Processornumberingfor2wayDualcoreSystem...... 41 Figure5.3:(a)L1CacheMissRateand(b)L2CacheMissRate...... 43 Figure5.4:TraceCacheMissRates ...... 44 Figure5.5:(a)DTLBLoadandStoreMissesNormalizedtotheSerialCase(b)ITLB MissRate ...... 45 Figure5.6:%ofTotalExecutionSpentinaStalledState...... 46 Figure5.7:BranchPredictionRate...... 47 Figure5.8:PrefetchingBusAccesses ...... 47 Figure5.9:CyclesPerInstruction ...... 48 Figure5.10:SpeedupforNASOpenMPapplications...... 49 Figure5.11:(a)L1CacheMissRateand(b)L2CacheMissRate...... 51 Figure5.12:TraceCacheMissRate...... 52 Figure5.13:(a)DTLBLoadandStoreMissesNormalizedtotheSerialCase(b)ITLB Misses ...... 52 Figure5.14:PercentageofOperationTimeSpentStalled ...... 53 Figure5.15:BranchPredictionRate...... 54 Figure5.16:PercentageofPrefetchingBusAccessesofAllBusAccesses ...... 54 Figure5.17:CyclesPerInstruction ...... 55 Figure5.18:(a)CG/FTand(b)FT/FTMultiApplicationSpeedup ...... 56 Figure5.19:CG/CGMultiApplicationSpeedup...... 56 Figure5.20:MultiprogrammedspeedupofpairsofNASbenchmarksforall architectures ...... 57 Figure5.21:(a)1 st LevelCacheMissRatesforOverloadedCases,(b)2nd LevelCache MissRatesforOverloadedCases ...... 58 Figure5.22:TraceCacheMissRatesforOverloadedCases...... 59 Figure5.23:(a)DTLBLoadandStoreMissesand(b)ITLBMissRatesforOverloaded Cases ...... 59 Figure5.24:PercentageofStalledOperationforOverloadedCases ...... 60 Figure5.25:BranchPredictionRateForOverloadedConfigurations ...... 60 Figure5.26:PercentageofPrefetchingBusAccessesForOverloadedConfigurations 61 Figure5.27:CPIForOverloadedConfigurations...... 62 Figure5.28:OverloadedNASBenchmarksSpeedup...... 62

vi Figure5.29:ImprovementincachehitratewithoutOSnoise ...... 65 Figure5.30:Effectofoperatingsystemnoiseonsystemperformance...... 66 Figure5.31:ImprovementinCacheHitRateWithoutOSNoise ...... 67 Figure5.32:ApplicationRuntimeImprovement ...... 68 Figure6.1:Processornumberingfor2waydualcoresystem ...... 72 Figure6.2:PSSchedulervs.defaultschedulerforSPECbenchmarks...... 73 Figure6.3:AveragepowerconsumptionofHTenabledconfigurations ...... 75 Figure6.4:AveragepowerconsumptionofHTdisabledconfigurations ...... 75 Figure6.5:Slowdownandenergysavingsfor(a)AMPHT on 11and(b)AMPHT on 31 ...... 76 Figure6.6:SlowdownandEnergySavingsfor(a)AMPHT on 32and(b)AMPHT on 72 ...... 77 Figure6.7:Slowdownandenergysavingsfor(a)AMPHT off 11and(b)AMPHT off 12 ...... 77 Figure6.8:SlowdownandenergysavingsforAMPHToff 32 ...... 78 Figure6.9:EnergydelayforthePSSchedulerinanHT on 11configuration ...... 79 Figure6.10:EnergydelayforthePSSchedulerinanHT on 31configuration ...... 79 Figure6.11:EnergydelayforthePSSchedulerinanHT on 32configuration...... 80 Figure6.12:EnergydelayforthePSSchedulerinanAMPHT on 72configuration.... 80 Figure6.13:EnergydelayforthePSSchedulerinanAMPHT off 11configuration.... 81 Figure6.14:EnergydelayforthePSSchedulerinanAMPHT off 21configuration.... 81 Figure6.15:EnergydelayforthePSSchedulerinanAMPHT off 32configuration.... 82 Figure6.16:AMPslowdownandenergysavingsforSPECbenchmarksoverHT on 42 83 Figure6.17:AMPslowdownandenergysavingsforSPECbenchmarksoverHT on 82 84 Figure6.18:AMPslowdownandenergysavingsforSPECbenchmarksoverHT off 42 85 Figure6.19:NormalizedEnergyDelayforSPECbenchmarksforAMPHT on 32over HT on 42...... 85 Figure6.20:NormalizedEnergyDelayforSPECbenchmarksforAMPHT on 72over HT on 82...... 86 Figure6.21:NormalizedEnergyDelayforSPECbenchmarksforAMPHT off 32over HT off 42 ...... 87

vii Glossary ADI–AlternatingDirectionImplicit AMD–AmericanMicroDevicesInc. AMP–AsymmetricMultiProcessor API–ApplicationProgrammingInterface BT–BlockTriDiagonal CFD–ComputationalFluidDynamics CG–ConjugateGradient CMP–ChipMultiProcessor CMT–ChipMultiThreading CPI–CyclesPerInstruction CPU–CentralProcessingUnit DDR–DoubleDataRate DVFS–DynamicVoltageandFrequencyScaling EP–EmbarrassinglyParallel EPI–EnergyPerInstruction FORTRAN–“IBMMathematicalFORmulaTRANslatingSystem” FT–FourierTransform HPC–HighPerformanceComputing HT–HyperThreading ITRS–InternationalTechnologyRoadmapforSiliconOrganization LLP–LoopLevelParallelism LU–LowerUpper MG–MultiGrid MPI–MessagePassingInterface OS–OperatingSystem SDRAM–SynchronousDynamicRandomAccessMemory SMP–SymmetricMultiProcessor SMT–SimultaneousMultiThreading SP–ScalarPentadiagonal

viii SPEC–StandardPerformanceEvaluationCorporation TLP–LevelParallelism

ix Chapter1:Introduction

Recent developments in the computing industry have signaled a shift in microprocessor design philosophy towards the use of multiple processing cores. This allowsasingledevicetoexecutemultipleinstructionssimultaneously.Theby whichmultipleinstructionsareexecutedinparallelonmultipleCPUsisreferredtoas parallel processing. There are two distinct types of parallelism, instruction level parallelism (ILP) and thread level parallelism (TLP). ILP exploits the potential of individualinstructionstobeexecutedsimultaneously.ILPrequiresthattheinstructions ofaprocessbeexecutedsimultaneouslyoroutoforder,andthatcodenotbedependant onthepreviouslyexecutedinstructions.Inreality,ILPisoflimitedusebecauseofthe difficultiesincodingalgorithmsinwhichtherearelimitedamountsofinterinstruction dependence.Assuch,ILPcanbeverydifficulttoexploitbecausemanysequentialcode sections occur in most algorithms, which lessens the overall effect that ILP can have uponruntimes. TLPapproachestheparallelizationprobleminadifferentwaythanILP,bydividing processes into distinct threads of execution, which contain their own independent sequentialinstructiondependencies.Thisallowsforalgorithmstobecodedinamore traditional manner and for parallelism to be more logically divided amongst separate threads.Inaddition,TLPisnotasinherentlylimitedinitsupwardparallelismpotential asILP.WithILP,thereareonlyafinitenumberofinstructionsthatcanbeexecutedin parallelinonetimeperiod,wherewithTLP,onecan execute as many instructions as possibleoneachCPUduringthesametimeperiod,giventhatthereareenoughthreadsto keep all of the processors busy. Because of the advantages of TLP and the larger potential benefits of exploiting it, the research and commercial sectors have taken an interestinTLPoverILP. TLP can be taken advantage ofusingtheconceptofmultithreading.Traditionally, multithreading has been used for multitasking, with a single CPU desktop computer running multiple processes in a time multiplexed execution that enables operating systemstoperformmultitasking.Allmajormodernoperatingsystemssupportmulti threading,includingMicrosoft’sWindowsoperatingsystem,Applecomputer’sOSXand

1 UNIX/Linux. Multithreading can also be used in systems with multiple CPUs, as threads can execute simultaneously on each independentCPU.Inadditiontorunning multipledistinctprocesses,multithreadingcanbeusedtodivideasinglecomputational task or process into several execution threads that can be run simultaneously for the purposeofusingmultipleCPUstoreduceprocessruntimes.Multiprocessorsystems such as symmetric multi-processors (SMP) which use shared memory have been availableformanyyears.Thesesystemsusemultipletraditional central processing units (CPUs)connectedtoasharedmemorysystem. EffortstotakeadvantageofTLPhavebeenattemptedusingsinglecoremultithread execution units commonly referred to as simultaneous multi-threading (SMT) [91] capable platforms. SMT equipped processors can simultaneously execute multiple threadsbysharingprocessorresourcesandusingalimitednumberofduplicateresources. This creates a significant amount of pressure on the memory system which limits the overall effectiveness of SMT technology. In order to increase the performance of systems,newdesignsareincorporatingmultiplecompleteindependentprocessorsona singlechip,eliminatingtheneedforanysharingofactualchiparchitecture. Thesenewdesigns,called chip multi-processors (CMP)havebeenproposedformany yearsbytheresearchcommunity[35]buthaveonlybeguntoappearonthecommercial marketin2006.BothIntelCorp[40]andAMD[1]havereleasedmulticoreversionsof theirprocessors,withIntelCorpreleasingamulticoremultithreadingCPUdesign.Sun MicrosystemshastakenthistrendfurtherwiththeUltraSparcT1processor[86],offering six or eight cores on a single chip with each core capable of executing four threads simultaneously.Thesemulticoresimultaneousmultithreadingprocessorsarereferredto asc hip multi-threading (CMT)[83]processors. Theadoptionofmultiprocessorsystemsinmainstreamapplicationsisreinvigorating interestinparallelcomputingasmoresystemsmovetotakeadvantageofmultiplecores and the performance of future computing systems reliesincreasinglyontheirabilityto effectivelyparallelizecomplextasks. The corresponding increase in transistor density of new systems has also brought about significant challenges relating to power consumption and heat generation. The movetomultiplecoreshasbeenmotivatedbytheinability to increase individual core

2 clockspeedsduetoheatgeneration.Thewasteheatproducedbyprocessorsisdirectly relatedtotheirpowerconsumption.Therefore,therehavebeeneffortsmadetoreduce the power consumption of modern systems by introducing dynamic voltage and frequency scaling (DVFS)[22]aswellaslowleveldesignrevisioninordertoincrease energyefficiency.

1.1 Motivation Withthetrendtowardmassivelyparallelcomputingsystemsinthefuture,designers willconceivablycontinuetoaddadditionalcoresandthreadsupportinfuturedesigns. Withthisemergingtrend,theanalysisandevaluationofexistingmultithreadedsystems is important to better understand the performance benefits of parallel processing on commerciallyavailableplatforms.Inadditiontoperformance concerns, the increasing powerconsumptionofmodernprocessorsisofgreatconcerntotheresearchcommunity as well as the commercial sector. Future forecastsofpowerconsumptionandthermal dissipation by the International Technology Roadmap for Silicon Organization (ITRS) [43]areashighas100W/cm 2in2010andupto250W/cm 2by2020.Thismeansthat current design trends cannot continue to be effective as the forecasted power consumptionfordevicesintheyear2020isfourtimestheallowablethermalgeneration densityforfuturechippackages[43].Infact,theamountofpowerconsumedbyservers throughouttheworldisestimatedtorisebyasmuchas76%from2005to2010[50]. The forecasts for future server system power consumption for the next four years is detailedinFigure1.1,withinformationtakenfromthe2006ITRSRoadmapReport[43].

PowerConsumptionofHighPerformanceServers

600 500 400 300 200 100

PowerConsumption(W) 0 2005 2006 2007 2008 2009 2010 2011 Year

Figure 1.1: Server Power Consumption Forecasts

3 As such, it is important to understand the power consumption characteristics of existingmultithreadingsystemsanddeterminetheoptimalperformance/powersavings tradeoffpointforsystemexecution.Thecharacterizationdatacanthenbeusedtofine tune existing systems throughprocessschedulingas well as other software techniques such as compiler optimizations that can increase system performance and power consumption.Thischaracterizationinformationwillalsobeofusetothesemiconductor industryuntiltheintroductionoftechnologiesthatcanpossiblyhelptoreducetheoverall power consumption of integrated circuits, like emerging nanotechnology. However, these technologies are not guaranteed to solve all of the current power consumption issues, and therefore work such as that presented in this thesis will continue to be of importancelongintothefuture.

1.2 Contributions The primary goal of this thesis is to garner some insight into the behaviour of emerging multithreaded systems as they relate to power and performance, in order to help determine the best architecture for different types of system workloads. This information will aid the research community as well as the commercial sector in the utilization of such systems for realworld workloads. In addition, this thesis seeks to providethebasisforasoftwareapproachtobettermanagethepowerconsumptionofa simultaneousmultithreadingprocessorsystem.Itutilizesaprogram/profileindependent approachdesignedtominimizetheeffectofoperatingsystemnoise[89]forthepurpose of performance/power savings. As such, it is extremely lightweight and useful in an asymmetric multiprocessor (AMP)[3]systemthatutilizesSMT.Theresulting energy savings when utilizing this method equate into a significant cost savings when this methodisappliedtolargerscientificprocessingsystems,potentiallysavinghundredsof thousandsofdollarsincostsperyear.Althoughtheresultantenergysavingsandspeedup may seem small at 515%, this is excellent in an HPC context as HPC applications provide only limited opportunities to increase performance or save energy as they typicallyutilizealloftheprocessingsystemsresourcesforextendedperiodsofintensive calculations.TheideaofaprotectedCPUhasbeenexploredinaperformancesenseon olderrealtimesystems[11]byusingaprotectedCPUtoprocessrealtimetasksinorder

4 to decrease system response time and ensure that tasks are completed before their deadline. The application of the protected CPU method in relation to nonrealtime power/performance optimization and the use of SMT are to this author’s knowledge, novel and unique. In addition, this thesis explores the effects of multiapplication workloadsandthreadoverloadingoncommercialCMP/SMTplatforms.Eachofthese contributionsareoutlinedbelow. • AnanalysisofOpenMP[69]constructsandtheireffectonsharedmemory programs. Thisanalysiswasalsoconductedusingdifferent versions of theLinuxkerneltoevaluatetheeffectthatOpenMPconstructshaveonthe performance of the system using the traditional Linux scheduler and the recentO(1)scheduler. • An in depth profiling of singlecore multiprocessor shared memory systems.TheprofilingoftheNAS[44]andSPEC[82]benchmarksuites runningOpenMPonasharedmemorysystemisalongandcomplextask. Thecollectionoftheprofilinginformationtakesasignificantamountof time, with each benchmark run taking 2032 hours and multiple runs required to capture and confirm the results for a single performance metric.Therefore,thecollectionofthedozensofmetricsrequiredinorder to properly determine the effect of the various architectural elements on thesystem’sperformanceistimeconsumingandproduces an incredibly large volume of experimental data which must then be collated and analyzed.Thecollectionandanalysisofsuchdataisofgreatusetothe academiccommunityasitisthebasisfromwhichfurtherresearchcanbe performed. • Profiling of chipmultithreaded SMP systems in multiple architectural configurations with multithreaded single and multiapplication workloads.Thiscontributionisamoreindepthprofilingandanalysisof multicore systems in multiple configurations, (SMT enabled and SMT disabled) simulating systems of various sizes and determining the differences between execution on a single multicore processor and

5 multiplemulticoreprocessorsinasharedmemorysystem.Buildingupon the knowledge of previous profiles this contribution is significant in its scopeandthelackofgoodprofilingdataavailableonmodernmulticore sharedmemorysystems. • Development of a power/performance optimizing scheduling algorithm. Thedevelopmentofaschedulertoreducetheimpactofoperatingsystem noise while reducing power consumption and increasing overall system performanceisanexcellentproofofconceptforillustratingthepotential benefits of intelligent scheduling in reducing the power consumption of systemswhilemaintainingexcellentperformance.

1.3 Outline Thisthesisisorganizedintosevenchapters.Thefirstisthisintroduction,followedby an in depth description of the background work in this area and the most important literature in the subject area. The third chapter describes in detail, the two major benchmarkingsuitesformultiprocessorsharedmemorysystemsthatareusedthroughout thisthesis.Thefourthchapterexplorestheoverheadincurredwhenrunningaparallelized applicationusingtheOpenMPsharedmemoryinterfaceAPI,anddetailstheperformance ofmultiprocessorsharedmemorysystemsusingsimultaneousmultithreading.Chapter 5continuesthestudyofmultiprocessorsystemsbyexaminingtheperformanceofmulti coremultiprocessorsystemswithmultithreadedsingleandmultiapplicationworkloads, andtheeffectofthreadoverloadingonsucharchitectures.Itconcludesbyinvestigating theeffectofoperatingsystemnoiseonmulticoresystemsandlayingthegroundworkfor the scheduler changes introduced in the next chapter. Chapter 6 details a study performed on the power usage of multiprocessor systems and uses an alternative scheduling algorithm that takes advantage of an asymmetric multiprocessor system. Chapter 7 concludes the thesis with a brief summary and closing remarks as well as discussingfutureworktobedoneinthisarea.

6 Chapter2:Background To aid in understanding the material presented in this publication, the following background material provides an introduction to parallel computer architecture, multi threading, the OpenMP [69] shared memory application programming interface (API), systemcharacterization,operatingsystemnoiseandpowermodelingofrealsystems.

2.1 Parallel Computing Architecture Parallelcomputingcanbeaccomplishedusingavarietyofmethods,eachwithitsown advantagesanddisadvantages.Fundamentally,itistheusageofmultipleCPUsworking togetheronagiventaskforthepurposeofdecreasingthetotalrunningtimeofthetask, orincreasingthethroughputofthesystemforamultiprogramworkload.Thetwobasic classifications of parallel computing implementations are the shared memory multiprocessor and the message passing multiprocessor system. Shared memory processorscanbeofseveraltypes,themostprevalentbeingsymmetricmultiprocessors. Other types of shared memory systems include non-uniform memory access (NUMA) systems,whicharemostcommonly cache coherent non-uniform memory access systems (ccNUMA) and cache-only memory architecture (COMA) systems. Multiple independent systems can be linked together to form a parallel processor system using methodslikemessagepassing,whereinformationrelatingtotheexecutionofaparallel programispassedbetweensystemsoveranetworkusingstandardizedmessagepassing systems such as MPI [64]. By using such networked systems it is possible to create clustersofindividualsharedmemorynodesinanattemptatincreasedperformance.This thesis concerns a shared memory multiprocessor where all of the processing units are physicallylocalizedwithinasinglesystem. Traditional multiprocessor systems require the use of multiple processorchipsthat areintegratedontoasinglemotherboard.Eachprocessormustsharetheavailablesystem memory.Typicallythisisdoneusingasharedsystembus.Thissystembusmustthenbe arbitratedtoensuresystemconsistency.Thereareanumberofarbitrationtechniquesthat allow for pipelined bus accesses, and other techniques for a splittransaction bus that allowsfordelayedindependentmemoryrequestsandreplies.

7 Inadditiontobusarbitration,amultiprocessorsystemmustensurecacheconsistency, as each processor typically has independent L1 and L2 caches. Cache consistency for shared memory multiprocessors is typically accomplished via snooping on the shared bus.Snoopinginvolvesadditionalcircuitryforeach microprocessor thatmonitorsthe busforrequeststhatcorrespondtodataheldinthelocalprocessorsL1orL2cache.In theeventthatarequestcorrespondstoavaluein memory,therequestisexaminedto determine the type of the request. A request for a read access to the data requires no action,whilearequestforawritetoalocationrequiresthatthedatabeflushedfromthe cache.However,thisisfurthercomplicatedasthedataheldinthecachemaybeamore currentversionofthedatathanthatheldinmemory,asthecachesaretypicallyawrite backcacheinsteadofawritethroughcache.Consequently,theprocessormayhaveto suppressthemainmemoryresponseandrespondtotherequestwiththedatainitscache. This will ensure data consistency and enable correct execution of a program when its threadsarerunacrossseveralprocessorsinasystem. Multithreadingisanessentialelementintheuseofparallelcomputing.Itisaprocess bywhichagiventaskisbrokendownintoseparateexecutableunits(threads)whichcan thenbeexecutedonseveralprocessingunitsatagiventimetoreducetheoveralltimeof execution. Multithreading using a timedivision scheme has existed for many years. ThismethodusesasingleCPUandtimesharesmanythreads,providingamultiprogram environment. Simultaneousmultithreadingisatechnologythatallowsmorethanonethreadtorun on a single processing core at one time. The goal of SMT is to increase the overall utilizationofaCPUbyusingasmanycomponentsofsaidCPUinacycleaspossible. However, numerous complex interactions among the shared resources may affect the performanceofmultithreadedapplicationsrunningonSMTs[33,90].Thiscancreate conflictsasmultiplethreadsviefortheuseofindividualcomponentsandassuch,the properpairingofthreadsonanSMTisvitalascomplimentarythreadscanperformvery well,whilethreadpairsthatrequirethesameprocessingresourcesinthesameexecution phasescancauseslowdownsduetocontention. Hyper-Threading (HT)[61]technology is an implementation of SMTonIntelprocessors.HT technology replicates essential CPU resources to allow the execution of two program threadsatatimeontheCPU.

8 UsingHTtechnology,eachphysicalprocessorisdividedintotwologicalprocessorseach of which has its own independent runqueue. These logical processors share the resources of the physical CPU including cache, execution units, translation look aside buffers,branchpredictionunitandloadandstorebuffers. Processorsarebecomingavailablethathavemultipleprocessingcoresintegratedonto a single processor die. This technology called chip multiprocessing [35] has been introducedbybothAMDandIntel,withIntelalsointroducingaCMPthatisHTenabled, creatingachipmultithreadingarchitecture.TheCMPandCMTarchitecturescanthen be integrated into a traditional SMP interconnect to allow multiple physical CMP or CMTchipstobeintegratedintoasinglesystem. The amount of interconnectivity between the cores in a CMP can vary between implementations.TheUltraSPARCIV’sdualcores[85]shareonlyoffchipdatapaths, while the dualcores in IBM Power5 [46] share L2 cache for faster interthread communicationsbetweenthecores.TheIntelXeon[39]dualcoreprocessorsstudiedin thisthesisarethefirstgenerationofdualcoreprocessorsfromIntelanddonotsharean L2cache,unlikethelatestIntelCore[40]processorsthathavesharedL2caches. 2.1.1PriorArt Thissectionreviewssomeofthemostimportantpublicationsinthefieldofparallel processingarchitecture.Italsodiscussesthemostrecentpublicationsthatarerelatedto theworkpresentedinthisthesis. SMTprocessorshavebeenstudiedinacademia[91],andhaveappearedinmainstream processorsfromIBMsuchasthePower4andPower5processors[46],Intel’sCore2[40] and Xeon [39] processors, and Sun Microsystems UltraSparc IV [85] and T1 [86] processors. TuckandTullsen[90]analyzedtheIntelPentium4HTprocessor.Theyfoundthatup toa25%performanceimprovementonparallelapplicationscouldbeachievedbyusing HT technology. They also confirmed Snavely et al.’s paper [81] by showing that the Pentium4HyperThreadingtechnologydemonstratessymbioticbehaviourduetocache and resource conflicts. Research at Intel has focused on HyperThreading with multimedia OpenMP applications [88]. This study found that enabling HT increased

9 performance, although it results in an increased demand on the memory system. However, the benefit gained by increased utilization of other processor resources was moresignificantthanthepenaltiesincurredinthememorysystems. ResearchersatDellpublishedresultsforsomebenchmarksfromtheMessagePassing Interface (MPI) version of the NAS Parallel BenchmarksrunningonaclusterofHT enabled systems [55]. They conclude that the necessity of doubling the number of processes for HT can degrade performance by increasing demand on the interconnect subsystems. Recent work has also been done on CMP architectures and their emergence in the commercial sector. Work by De Vuyst et al. [21] and ElMoursy et al. [24] have examined the effect of scheduling on CMP architectures. They study the performance andenergyoptimizationpossibilitiesofsucharchitecture. Researchers have also been examining scheduler performance and support in currentoperatingsystems[27,67]as wellasexaminingpossiblemethodsofincreasingoverallsystemthroughputusingCMPs [26].Theirinvestigationsdifferfromtheworkinthisthesisinthattheyuseasimulated platformfortheirinvestigations,whilewefocusonrealcommerciallyavailableservers. Work relating to the cache contention and sharing issues brought about by chip multiprocessors have been studied, mainly focusing on cache sharing and partitioning [14]aswellasthecontentioneffectsexpectedtooccurinCMPs[13]. CMPs have also been examined with respect to their abilities executing single threaded programs [93], particularly relating to the thread migration effects [16]. ResearchershavealsoinvestigatedthepowerperformanceofCMParchitectures[56,58, 77] although they also use simulations and only one of these papers examines the SMT/CMP architecture [56]. Other publications relating to CMT/CMP architectures havedocumentedtheirevolutioninthecommercialsector[83]

2.2 Programming Shared Memory Parallel Applications Programming of multithreaded shared memory applicationsisnormallydoneusing one of two multithreading APIs, OpenMP [69] or POSIX [38] threads. The POSIX threads API is based on thePOSIX1003.1–1995standard from ANSI and IEEE. It defines a set of thread creation and management operations as well as subroutines to

10 managemutuallyexclusivecodesectionsandcommunicationbetweenindividualthreads. POSIXthreadsareonlyavailableforcodinginC,butwrappersmakeitpossibletouse POSIXthreadsinotherlanguagessuchasFORTRANwithminimalassociatedoverhead. However,programmingwithPOSIXthreadscanbesignificantlymorecomplicatedthan programming with OpenMP, as OpenMP does not require explicit thread creation, joiningordestructions,andonlyusesrelativelysimplecompilerdirectivesfordefining parallel sections of code. Therefore, OpenMP is typically the API of choice when developingsharedmemoryapplications. TheOpenMPAPI[69]isanindustrystandardinterfacethatisusedtodevelopshared memory parallel processing programs in Fortran, C and C++. It is a specification of compiler flags, environment variables and programming library functions that can be usedtocreateparallelismonasystemusingasharedmemoryinterface.OpenMPisan attempttostandardizesharedmemoryparallelprocessinginterfacesacrossawiderange ofplatforms.ThepredecessortoOpenMP,ANSIX3H5[70]wasneverfinalizedandas adraftspecificationwasneverabletoobtaintrueportabilityamongstdifferentvendor’s platformsduetoimplementationspecificdirectivesthatwerenotportablebetweenallof thesystemvendor’smachines.OpenMPovercamethisproblembyrevisingthestandard suchthatproprietarymachinehardwareimplementationswerenotaprerequisiteforany ofthedirectives. OpenMP defines a set of compiler directives which determine how threads are assignedportionsofcomputationalloopsanddirectivesusedtocontroltheexecutionof the threads, providing support for mutually exclusive sections of code and thread synchronization. The most used clauses of the OpenMP specification are the PARALLEL, DO/FOR, PARALLEL DO/FOR, BARRIER, SINGLE, CRITICAL, LOCK/UNLOCK, ORDERED and REDUCTION clauses. A description of each is detailedbelow[69]. • PARALLEL: Definesaregionthatshouldbeexecutedbyseveralthreads. Eachthreadexecutesthecodewithintheregionunlessitisflaggedbyan exclusionclause.

11 • DO/FOR: TheDO/FORclauseisusedinsideofaPARALLELregion anddefinesloopsthatshouldbesplitupamongstthegroupofexecuting threadswitheachthreadexecutingaportionoftheiterationsoftheloop. • PARALLELDO/FOR: A shortcut clause that defines a PARALLEL regionaswellasstartingaDO/FORclause. • BARRIER: Asynchronizationclausethatcausesthreadstowaitatthe BARRIERuntilallthreadshavereachedtheBARRIERclauseincode. • SINGLE: TheSINGLEclausepreventsalloftheexecutingthreadsin aPARALLELregionfromexecutingthecodewithintheSINGLEregion. Onlyasinglethreadexecutesthiscodewhiletherestofthethreadswaitat theendoftheregionfortheexecutiontocomplete,unlesstheNOWAIT optionisspecified. • CRITICAL: CRITICALclausesdefineregionsthatonlyonethreadmay executeatatime.Threadsmustwaitatthebeginning of a CRITICAL regionifanotherthreadisexecutingthecodewithintheregion. • LOCK/UNLOCK: This clause isanalternativetousingaCRITICAL region,andallowsformutualexclusionlocking/unlockingofmutextype variables. • ORDERED: ORDERED directives ensure that the code definedbythe ORDERED region is executed in the same order as it would if the executionloopswereruninasequentialmanner. • REDUCTION: This clause can be used within the PARALLEL region to define the method in which the variables used within the PARALLEL region are to be combined and returned when the PARALLELregion’sexecutioniscomplete. 2.2.1PriorArt ThetwomajorprogramminglanguagesusedforparallelprogrammingareCandhigh performance FORTRAN. These languages support a combination of OpenMP and POSIX threads on many machine types and operating systems. FORTRAN was first

12 extendedforparallelizationin1973fortheILLIACIV,extendingitssupporttoincludea paralleldooperation[65].FORTRANhasremainedanimportantlanguageforparallel computingandhasevolvedintohighperformanceFORTRAN (HPF), a language with parallelprocessingextensions.TheHighperformanceFORTRANspecification[59]was releasedin1993.Thespecificationswererevisedthroughoutthe1990suntiltherelease ofthecurrentHPF2.0specificationsin1997. The OpenMP API [20] standardized a shared memory programming interface for severalcomputerlanguages.Researchershavedevelopedcompilerstotakeadvantageof OpenMPinstructionsforclustersofSMPs[78].AcademiahasalsousedOpenMPto develop several benchmarks for shared memory systems, the most prominent of them beingtheNASOpenMPbenchmarksuite[44]andtheSPECompsuite[6].Inadditionto thesemajorsuites,otherbenchmarkingprogramssuchasEPCC[74]microbenchmarks andtheLNLLbenchmarks[54]evaluatetheoverheadincurredduringOpenMPlibrary calls. Several publications have dealt with the parallelization of sequential code using OpenMP,includingworkbyCouturieretal.[17]inwhichtheyparallelizeamolecular dynamicsprogramusingOpenMP,inadditiontoworksuchasthatbyRenoufetal.[75] in whichtheyuseOpenMPtospeedupthecomputation of the dynamics involved in large granularmaterialssuchasconcreteorgranularpowders.OpenMPhasalsobeen usedinconjunctionwithMPItodevelophybridOpenMP+MPIprogramsforintranode and internode parallelism for parallel jobs such as Monte Carlo simulations [80] and complextaskssuchashighresolutionmapvisualization[15].

2.3 System and Application Characterization Characterizationofrealsystemscanbeadifficulttask.Commercialsystemsarenot releasedwiththeirdesignfiles,orindepthtechnical details about their circuit layout. Thiscreateschallenges,asitisnotpossibletoperform a lowlevel circuit analysis on suchsystems,andconsequently,simulationofsuchsystemsisnotaccuratecomparedto therealruntimecharacterizationofthesystems.Thecharacterizationofsuchsystems must therefore be accomplished using repeatable benchmark programs, and for the purposesofthisthesis,theymustbemultithreadedbenchmarks.

13 Theprimarymethodofobtaininginformationabouttheperformanceofasystemisa profilingtoollikeVtune[41]thatusesperformancecountersthatarebuiltintotheCPU architecture.Thesehardwareperformancecountersaresystemregistersthatcanbeset uptomonitoravarietyofrelevantsystemevents.Theseregistersenableonetoobtain veryaccuratecountsofsystemeventsthatarerelevanttothesystemsperformance,such as cache hit rates, the number of cycles spent actively executing the code, and a breakdownofthenumberofeventsthathaveoccurredoneachCPU,whetherlogicalor physical in the system. This can lead to difficulty in determining the cause of some observedtrends,astheamountofinformationthatcanbeobtainedfromarealsystemis byitsnaturelessthanthatwhichcanbecollectedusingsimulationtoolsatthedesignfile level.Thisisnottosaythatthismethodcannotprovideasignificantinsightintothereal operationofsystems,onlythatthegranularityofsuchobservationscanmakeitdifficult toprovetheexactreasonbehindperformanceabnormalities.Theresultsobtainedfrom testingonarealmachinearedevoidofanyerrorsthatcouldbecausedbyassumptions andshortcutsusedinsimulationsoftware.Therefore,whiletheresultsofsuchrealworld testscansometimesbedifficulttoexplain,theresultsareanaccuraterepresentationof realworldperformance. 2.3.1PriorArt Several benchmarks have been developed to test the performance of real systems usingOpenMP.Section2.2.1introducedtheNAS,SPEC,EPCCandLNLLbenchmark suites.Wewilldiscussthemostrelevantandimportantpublicationsthathavedealtwith shared memory systems and the characterization and evaluation of real systems using thesebenchmarksinthissection. Severalresearchpublicationshavebeenreleaseddealingwiththeevaluationofshared memory systems using OpenMP. The evaluation of OpenMP construct overhead has beenperformedonlargescalesystems[9,12,28]aswellassmallerscaleSMPs[57]. Liaoetal.[57]evaluatedOpenMPonchipmultithreadingplatforms,includingtheNAS andSPECbenchmarksforaSunFireV490andaDellPrecision450workstation.They devised several experiments in order to get a better understanding of the behaviour of OpenMPonsucharchitectures.

14 EvaluationoftheOpenMPSPECbenchmarksuitehasbeenperformedontheFujitsu PrimePower2000,SGIOrigin3800,andtheHPPA8700systemsbySaitoetal.[76].In 2001,Aslotetal.[5]evaluatedtheSPECOpenMPbenchmarksonaSunEnterprise450 SMP system. An evaluation of the SPEC OpenMP benchmarks was performed by Fredricksonetal.[28]inadditiontotheNASOpenMPbenchmarksuiteonaSunFire 15KSMP. TheNASOpenMPbenchmarkswerefirstevaluatedatthetimethattheywerereleased byJinetal.[44].TheNASOpenMPbenchmarkshavebeenfurtherevaluatedin[79]on large scale SMP clusters in order to investigate the causes of performance variability. Mauryetal.[19]evaluatedaCMP/SMThybridusingOpenMPona4waySMTsystem andasimulatedCMPandfoundthememorysubsystemtobetheprimaryperformance bottleneck. Several parallel programssuchasHMMERhavebeenevaluatedin[84]wherethey evaluatetheperformanceofbioinformaticsprograms.HMMERwasalsoevaluatedby Purkayasthaetal.[73]on4,8and16wayIntelXeonSMPnodesusingOpenMP.

2.4 Scheduling and Noise Jobschedulingisacriticalcomponentinmodernoperating systems. An operating system’s scheduler has the task of deciding which threads are assigned time on the CPU(s)ofasystem.Timeperiodsareusuallyoffixedlength,unlesstheprocessfinishes before the end of its time slot. The Linux scheduler works by priority scheduling, assigningeachthreadapriority.Theprioritiesofthethreadsaremodifiedaftereachtime slot,withthreadsthathavenotbeenexecutedrisinginpriorityandthreadsthathavejust completed their time slot being of lower priority. The Linux scheduler in the 2.4.x kernelsusedasingleprocessqueuewhichheldinformationaboutalloftheactivetasks andtheirpriority.Eachtimeanewtaskwastobe assigned, the queueneededtobe traversed. The scheduler in the 2.5.x and higher kernels was upgraded to the O(1) scheduler to avoid having to traverse the queue after each time interval. The new schedulerfunctionsbyhavingmanyqueues,oneforeachpriority.Thisavoidstheneed totraversealloftheactiveprocesseseachtimeaschedulingdecisionmustbemade,only thehighestpriorityqueuesneedtobesearched.

15 Operating system noise is a term describing the overhead caused by running the requisitedaemonsandservicesrequiredbytheoperatingsystem.Thisoverheadcanbea significantfactorindeterminingtheperformanceofhighintensityworkloads.Theactual timedevotedtorunningoperatingsystemtasksistypicallynegligiblebuttheswappingof suchsmalljobsintoaprocessorgroupthatisrunningafullworkloadcancauseajob imbalancewhichcandegradesystemperformancebydelayingsomeexecutionthreads. This causes delays to all of the other related program threads when thread synchronizationisnecessary.Thisnoiseismostsignificantwhenthesystemisrunninga multithreadedprocessthatusesalloftheprocessorsinasharedmemorysystem. 2.4.1PriorArt Theworkin[62,94]hasbeentheonlyworkondevisingalgorithmsforOpenMPloop scheduling [94], and thread pairing [62]. In [94], the authors focused on altering the behaviour of OpenMP applications executing on SMPs with SMT processors. They proposedaselftuningloopschedulertoreacttobehaviourcausedbyinterthreaddata locality,instructionmixandSMTrelatedloadimbalance.McGregorandhiscolleagues [62] introduced new scheduling policies that use runtime performance information to identifythebestmixofthreadstorunacrossprocessorsandwithineachprocessor.They haveachieveda7%to28%improvementoverthedefaultLinuxscheduler.Othershave attemptedtodevelopschedulingmethodsforimprovedperformanceonSMPmachines [72]aswell.MethodsforschedulingonSMTsystems[81]havealsobeenintroduced thatattempttotakeadvantageofthesymbioticnatureofsomeprocesseswhenrunning on an SMT system. Work by Nikolopoulos et al. [4, 68] has examined scheduling algorithmsthatcanbeusedtoimproveperformancebyadaptivelyschedulingprocesses whiletakingintoaccountbusbandwidthandmemorypressureissues. Previousresearchhasshownthatsystemnoisemayhaveadramaticeffectonhigh performancecomputingsystems[45,71].In[71],Petriniandhiscolleaguesnoticedthat theirapplicationhassuperiorperformancewhenusingthreeprocessorspernodeinstead ofthefullfour.Usinganumberofmethodologies,theydiscoveredthatthisisdueto neither the MPI implementation nor the network, but the system noise including OS

16 daemons,kernelthreads,andOSrealtimeinterrupts,amongotherthings.Theeffectof systemnoiseontheperformanceofapplicationshasalsobeenverifiedin[45,89].

2.5 System Power Consumption Theprocessofpowermodelingisaveryindepthprocessthatcanbeaccomplished using a wide range of techniques. Primarily, power modeling is done using software simulationbasedonthedesignfilesusedtodesignthelogiccircuitsinquestion,with programssuchasHSPICE[87].Theestimationofthepowerconsumptionofagiven systemcanberoughlydeterminedusingguidelinesthathavebeenfoundexperimentally by measuring the current consumption of the system under various workloads. A common power simulation approach istheuseofpopular power simulation programs basedontheAlphafamilyofprocessorssuchasWattch[10].SimulatorssuchasWattch use predetermined power consumption values to estimate the approximate energy consumption of each individual instruction. The power simulation program then interfaceswithasimulator,likeSimpleScalar[7].Themajorproblemwiththisapproach is that it does not easily translate over to current real commercial systems. Power modeling of existing commercial systems whose design is not available to the public presentsamuchmoredifficultproblem,asthemodelsdevelopedforsuchsystemsdonot haveenoughinformationabouttheimplementationofthedevicebeingmodeledtomake the models extremely accurate. The power simulation programs available rely on in depth circuit information that was available for processors such as the Alpha. Such detailedinformationisnotavailablefromIntel,whichprovidedtheCPUsusedinthis investigation.However,measurementsfromIntelCorp.fortheirCPUs[3]providethe basicmetricsofpowerconsumptionthatareusedinsomeoftheexperimentsdetailedin thisthesis. Thisthesisavoidsmanyoftheproblemsassociatedwithmodelingpowerconsumption bymeasuringthepowerconsumptionofrealsystems.Thisprovidesthemostaccurate method of studying power consumption as the power consumption measurements are within ±3%oftheactualsystempowerconsumption.Thepower measurements were obtainedusingadigitalmultimeterequippedwithadatarecorder.Themeasurements were taken from the power supply side of the machines, which means that the

17 measurementstakenrepresenttherealpowerconsumptionofthesystemincludingallof thepowerlossassociatedwithinefficienciesinthepowersupply. 2.5.1PowerMetrics The following metrics are used to study the powerperformance characteristics of systems. Performance : Highperformance computing has always been concerned with performance.Performanceofanapplicationrunningonasystemisgivenbywallclock executiontime, D. Power :Theory[66]tellsusthatthepowerconsumedbyaCMOSprocessor,inwatts, is equal to the activity factor of the system (percentage of gates that switch for each cycle) multiplied by the capacitance of the CPU times the voltage squared times the frequency.ThisisshowninEquation2.1. P = αCV 2 f (2.1) Notewehaveignoredthepowerexpendedduetoshortcircuitcurrent,andthepower loss from leakage current, as the dynamic power consumption, αCV 2f dominates in CMOScircuits. Frequency is directly proportional to the supply voltage. Therefore, power is proportionaltothecubeofachangingfrequency.However,historicaldata[3]suggests thatpoweronmodernprocessorsisproportionaltothesquareofthedutycycle. Energy :Poweristheconsumptionatadiscretepointintime.Energyisthecost duringtheexecutiontime, D,andisshownas:

D = = × E ∫ Pdt Pavg D (2.2) 0 Power-performance efficiency :Thismetricallowschoosingtheoperatingpointat which maximum energy saving can be achieved with acceptable performance

18 degradation. The energydelay product is used to quantify the powerperformance efficiency,asshowninEquation2.3: E.D = E × D (2.3) 2.5.2PriorArt Several architectural proposals have been put forward in an attempt to reduce the powerconsumptionofprocessorsincludingproposalstoenablethedecayofinformation inprocessorcaches[37],andaveryinterestingproposaltoallowprocessoroperationat very low voltages and corresponding error correction [25]. As this thesis is not concerned with making architectural changes to existingCPUdesigns,nomoredetails relatingtosuchproposalswillbeprovided.Insteadwewillconcentrateonmethodsof reducingpowerconsumptionthataredoneusingsoftware. Therehavebeenquiteanumberofmicroarchitecturalstudiesonthesubjectofenergy reduction of modern day processors. These include dynamically tuning processor resources with adaptive processing [2], comparison of SMT and chip multiprocessing [56],andheterogeneousmulticorearchitectures[52,53].Mostofthisresearchhasbeen done with singlethreaded applications and through simulations, or analytical methods. Thisthesisconcentratesonmultithreadedworkloadsonrealsystems. Therehavebeenattemptstoreduceoverallpowerusagethroughtheuseofcompiler optimizations[47]andalsobycontrollingpowermanagementsystemsthroughcompiler flags and simulation [92]. Recent attempts have used realtime information from hardware performance counters to attempt to schedule threads for performance and powerconsumption[18]onSMTandCMPsystems.Additionalworkhasfocusedon previous run profiling for creating databases that can suggest configurations for the systemwhenrunningcertainprograms[60]. Dynamicvoltageandfrequencyscalingisknownasoneofthemosteffectivemethods to reduce CPU power consumption, unfortunately at the expense of performance degradation. In fact, the semiconductor industry has recently introduced many energy savingtechnologiesintotheirchips.Themostsuccessfulofthesehavebeen SpeedStep technology from Intel as well as PowerNow! and Cool'n'Quiet from AMD. Several papers have been published on research utilizing such features to reduce power and

19 energyconsumption,fromdevisingacompileralgorithmforoptimizingsinglethreaded programsforenergyusageonlaptops[36]topowerandenergymanagementtechniques for servers and data centers [51]. Power consumption has also been studied for high performancecomputingworkloadsinSMPandAMPservers[3,8]andinhighperformance clusters[29,30]. Theauthorsin[3,8]describetheprocessofcreatinganAMPnodefromacommercial IntelSMPserver.AnAMPisasystemthatcontainsmultipledifferentprocessors.The processorscanbedifferentarchitecuresrunningatthesameordifferentspeeds,orbethe samearchitecturebutrunningatdifferentspeeds.In[3],Annavarametal.analyzethe energy per instruction (EPI)gainsthatcanbeobtainedfromusingCPUsoperating at differentfrequencies.Infact,theydeterminedthatbyutilizingasetupthatconsistsof onefastprocessortorunsequentialcode,assistedbythreeslowerCPUstorunparallel code,onecouldreducetheoverallEPIofasystemwhilemaintainingahigherspeedthan anormalSMPusingthefixedpowerbudgetofone2.0GHzIntelXeonprocessor.They used the SPEComp benchmarks as well as several other applications in their study; however,theirworkdidnotaddressHTenabledsystems[3]. In [8], Balakrishnan et al. investigated the impact of performance asymmetry of different AMPs on commercial applications as well as SPEComp applications. For commercial applications, they observed significant performance instability. They were abletoeliminatethisforsomeapplicationsbydevisinganewkernelschedulerensuring fastercoresnevergoidlebeforeslowerones.ForSPECompscientificapplicationswith tightcouplingamongdifferentthreads,theyfoundstability,butwithpoorscalabilityas theslowestcoreforcesfasteronestoidle.Toeliminateperformanceasymmetry,they changedthestaticOpenMPloopschedulingusedtodynamicscheduling.However,this resultedindegradedperformance.ThisworkalsodidnotaddressHTenabledsystems. Although,theworkin[8]isonlyfocusedonperformanceasymmetry,theyindicateAMP systemscanbeeffectiveforpower/performanceefficiency.

2.6 Summary Although the area of shared memory multiprocessors is a mature field and the researchavailableforsuchsystemsiscomprehensive,thestudyofmodernhybridSMT

20 andCMPbasedSMPsisstillevolving.Thisthesisexploresthecharacteristicsofmodern singleandmulticoreprocessorsusingrealmachines.Themajorityofpreviousworkon similar topics has been performed using simulations and nonSMT architectures. The study of AMPs is still relatively new, and previous work has focused on fixedpower budgetsandperformancestability.Thisthesisexpands on this breadth of research by investigating the power and performance benefits that can be realized using an AMP architecture. The following chapters investigate the overhead caused by OpenMP constructs and then profiles single and multicore processors running multithreaded scientificbenchmarks.Theknowledgegainedfromtheprofilingofrealsystemisthen usedtoimplementanewoperatingsystemschedulerthatincreasessystemperformance whileincreasingenergyefficiency.

21 Chapter3:ApplicationSpecifications Thischapterdetailsthebenchmarkingapplicationsthatareusedinthisthesis.The NAS OpenMP [44] and SPEC OpenMP [82] benchmarks are considered industry standardsthatareupdatedonaregularbasis.TheNASbenchmarksuiteisproducedby theUnitedStatesNationalAeronauticsandSpaceAdministration(NASA).TheNAS benchmark suite is available as an open source suite of applications coded in high performance FORTRAN. The SPEC OpenMP benchmarks were developed by the Standard Performance Evaluation Corporation in 2001. They are available as a commercialpackageandarecodedusingFORTRANandC.

3.1NASParallelBenchmarks TheNASparallelbenchmarks[44]arepublishedbyNASAandarebaseduponthe ComputationalFluidDynamics(CFD)calculationsthatareusedinscientificaerospace computations.Thebenchmarksconsistoftwotypesofapplication,5kernelapplications replicate the behaviour of the main computational loops of CFD applications and 3 simulatedcomputationalfluiddynamicsapplicationsthatreplicatethedatamanipulation activity of real CFD applications. This thesis studies the behaviour of all 3 of the simulatedCFDapplicationsaswellas2ofthekernelapplications.Eachapplicationis discussedindetailinthischapter,andalthoughthisinformationisnotstrictlynecessary forunderstandingthisthesis,itisincludedfortheinterestedreaderwhowishestohave moreinsightastothenatureofthebenchmarksutilizedhere. 3.1.1NASSimulatedCFDApplications

3.1.1.1 BT The BT benchmark is an application that solves a 3D matrix of Navierstokes equations[44].TheNavierStokesequationsusedarenotinafullycompressedform. ThebenchmarkusesanAlternatingDirectionImplicit(ADI)approximationfactorization thatproducesequationsthataredecoupledfromthex,y,andzdirectionswhichcanthen be solved directly [44]. The majority of the benchmark scales well, but some K

22 dimensionaloperationsinvolvelargememorystrides,whichmakesthisbenchmarkless scalableduetopoorcacheperformance.

3.1.1.2 LU TheLUbenchmarksolvesNavierStokesequationsina3DsystemusingSymmetric SuccessiveOverRelaxation,bysplittingthesystemintoblockloweranduppertriangular systems[44].Thisbenchmarkusesapipelinedcomputationwherethefirstprocessorin the system provides information necessary for the execution of the thread run by the second processor and so on. This makes this benchmark especially susceptible to synchronizationissues.Theintroductionofdelayfromoneoftheprocessorscancreatea ripple effect amongst the pipeline which significantly affects the performance of the benchmark.Consequentlythisbenchmarkisbestsuitedtoexecutiononsystemswitha smallamountofadditionalsystemoverhead.

3.1.1.3 SP The SP benchmark uses a factorization to decouple the x, y and z directions and directly solves the resulting decoupled linear equations [44]. This benchmark is somewhat scalable. Its scalability is limited primarily due to the same cause as the scalabilityofBT,thatmemoryoperationsintheKdirectionuselargestridesandcausea significantnumberofcachemisses.Therefore,thisbenchmarkisdependentonmemory subsystemperformanceforgoodperformance. 3.1.2NASKernelApplications

3.1.2.1 MG

MG is a multigrid solver for processing scalar Poisson equations. Because of the natureofthismultigridsolveritisanexcellentintensivememorytestasitrequiresa significantamountofdatamovement[44].Thebenchmarkprojectsafinegridontoa coarsegrid,solvesthecoarsegridandprojectstheresultontoafinegrid,thenperformsa smoothingoperation.Thisprocessisrepeatedseveraltimes.Duetothesizeofthegrids, thisbenchmarkisaffectedbymemorysubsystemspeeds.

23 3.1.2.2 CG

CG stands for conjugate gradient, which this test uses on a sparsely populated unstructured matrix that is meanttotestthesystem’s ability to perform operations on suchmatrices[44].Thisisofcoursememoryfetchintensiveasthelocationsofdataare distributed randomly, therefore reducing the effectiveness of prefetching. This benchmark scales well for small numbers of processors but begins to experience poor scalingbetween8and16processors. 3.1.3SPECOpenMPBenchmarkSuite TheStandardPerformanceEvaluationCorporation(SPEC)[82]CPU2000benchmark suitewasadaptedtorununderOpenMPin2001.Sincethattimesomemodifications havebeenmadetothesuite.Theversionofthesuiteusedfortestinginthispublication wasversion3.0,withthePurdueUniversitysourcecodefixfortheartbenchmark.This version is almost identical to the only recently released version 3.1 (the Purdue Universitymodificationswereofficiallyincludedwithversion3.1).Elevenbenchmarks are provided with the suite, and six of those benchmarks are used in this paper, comprisingfivehighperformanceFORTRANbenchmarksandoneCbenchmark.The benchmarkssimulatetheruntimebehaviourofrealscientificapplications. The apsi benchmarkisaFORTRANbenchmarksimulatinganairpollutionmodeler [6]. It uses primarily floating point arithmetic. It models the effects of temperature, pressureandwindonpollutionparticlesandtracksthepredicteddiffusionofpollutants through the atmosphere. The art benchmark is a C benchmark simulating image recognition and neural networks tasks [6]. An Adaptive Resonance Theory neural networkistrainedtorecognizetwoobjects,anairplaneandahelicopterandthenisused to find these objects in animage.The fma3d benchmark is a FORTRAN benchmark based on a crash modeling simulator [6]. The benchmark is floating point operation intensiveasitusesfiniteelementanalysistomodeltheeffectofsuddenimpactson3D solids. The input files used model an explosive element of a cylinder. The mgrid benchmarkisaFORTRANbenchmarkthatisamultigridsolver[6].Thisbenchmarkis a standard multigrid solver, a technique that is used in a variety of areas. This

24 benchmarkusesaconstantcoefficientequationonacubicalgrid.Thenatureofmulti gridsolversmakesthemmemoryintensiveasalargeamountofdataismovedduringthe computationalprocess. The swim benchmarkisaFORTRANbenchmarkformodelingshallowwater[6].It calculates the velocity vectors of a shallow water data representation consisting of a 1335x1335matrixoverashorttimespan.Itusesasignificantamountofdoublefloating pointarithmetic.Finally,the wupwise benchmarkisaFORTRANbenchmarkmodeling quantum chromodynamics [6]. Quantum chromodynamics is the study of subatomic particleslikequarksandgluons,andtheinteractionsbetweenthem.Wupwisestandsfor the Wuppertal Wilson Fermion Solver, which is used to calculate the interactions between quarks in the area of fundamental physics. This is a very computationally intensiveprocess.

25 Chapter4:WorkloadCharacteristicsofSMTCapable SMPs In this chapter we examine the SMT implementation from Intel called Hyper Threading. We examine the performance of HT when used in multithreaded shared memoryapplications.Naturally,multithreadedapplicationsareverysuitableforSMT systems. However, HT, due to extensive resource sharing may not suitably benefit OpenMPhighperformancecomputingapplications. ThischapterfirstexaminestheperformanceofdifferentOpenMPconstructsonsingle coredualCPUHTbasedIntelXeonserversrunningtheRedHat’sLinuxkernels2.4.22 and 2.6.9. It is found that the overhead of OpenMP constructs with HT enabled is significantlylargerthanwhenHTisoff[33]. ThischapteralsopresentsanevaluationofscientificapplicationsintheNASOpenMP suite(version3.2)[44],andSPECOMPM2001suite(version3)[82]withkernel2.6.9.It isinterestingtodiscovertheimpactofSMTbasedSMPsystemsontheperformanceof such applications. The effect of resource sharing within the processors on the overall system performance is evaluated by collecting data from the hardware performance counters. In addition, the architectural limitations of such a system are pinpointed by observingitscacheperformance[33]. Thischapterisorganizedasfollows.Theexperimentalsetupisdescribedinsection 4.1.Insection4.2,theperformanceresultsareanalyzed.Asummaryofthefindingsis presentedinsection4.3.

4.1ExperimentalSetup The experiments in this chapter were conducted on a singlecore platform, a Dell PowerEdge2850server.ThePowerEdge2850serverfromDellhastwosinglecore2.8 GHzIntelXeonEM64Tprocessors,a16KBsharedexecution trace cache,a16KBL1 shareddatacache,a1MBsharedandunifiedL2cache,and2GBofDDR2SDRAMona 800 MHz front side bus. The operating system is RedHat’s Enterprise WS 4.1 Linux kernelversion2.6.9andkernel2.4.22.TheL1,L2, and main memory latencies of the

26 processor are 1.44ns, 10.25ns,and142.80ns,respectively. The main memory read and writebandwidthsareequalto3856MB/sand1855MB/s,respectively. We used two different kernels to evaluate their impact on the performance. Both kernelssupportHyperThreading;however,thetaskschedulerhasundergoneacomplete rewriteinthe2.6.9kernel.ThenewschedulerisknownastheO(1)scheduler,sinceit can make a scheduling decision in constant time and independent of the number of processorsorthenumberofactivetasks.Thisisaccomplishedbyusingmultiplepriority queuesandaschedulingalgorithmthatispriorityawaresuchthatitcanquicklydecide betweenprocessesbaseduponwhichpriorityqueuetheyresidein.Thepreviousversion of the scheduler was also priority aware, but used a unified process queue which necessitated stepping through the entire queue for each scheduling period in order to makeitsnextschedulingdecision. Thenumberofactiveprocessorswaslimitedusingthe maxcpus=X bootoptionofthe kernel.Thisoptioncausesthekerneltoonlyinitializeanduse Xlogicalprocessors.This method of disablingadditionalprocessorsispreferable to running fewer threads when determiningscalabilitysinceitbetteremulatesasmallersystem. 4.1.1Terminology Figure 4.1 and Table 4.1 describe the naming convention for specifying the configuration of the two machines. Figure 4.1 shows the naming of each of the processorsinanSMTcapableSMPsystemforupto2processors. The basic terminology used to describe these configurations is comprised of three parts.ThefirstpartiseitherHT off orHT on ,whichdescribesthestateofHyperThreading inthesystem.Thesecondtermindicatesthetotal number of application threads that wereused.Thethirdtermrepresentsthenumberofphysicalprocessorchipsusedinthe tests(either1or2).

27 Figure 4.1: Processor numbering for a SMT system with a maximum of 2 processors Table 4.1: Configuration Information Terminology Hardware Contexts Corresponding Architecture Serial B0 Serial

HT on 21 A0,A1 SMT

HT off 22 B0,B1 2waySMP

HT on 42 A0,A1,A2,A3 SMTcapable2waySMP 4.1.2ApplicationCharacterization Applicationcharacteristicsweregatheredatruntimeusingthehardwareperformance countersavailableontheIntelXeonprocessors.WecollecteddatausingtheIntelVTune PerformanceAnalyzerversion7.2[41].TheIntelFortranandC/C++compilers(version 8.1)wereusedtobuildthebenchmarkapplications. Collectingprofilesofrealsystemscanbeaverytimeconsumingprocess.Thenumber ofperformancecountersavailableforuseisextremelylimitedrequiringmultiplerunsof asingleapplicationtocollectalloftherequiredprofilinginformation.Thisisfurther exasperated by using large data set benchmarks which require a significant amount of timetofinishexecution.Inordertocollectthemostcomprehensiveprofilepossible,the datacollectedmustbecollectedfortheentireruntimeoftheapplication,andallofthe application runs must be confirmed by running the same tests multiple times. For example,inordertocollect13performancemetricsforasystemconfigurationusingthe SPEC benchmark suite used in this thesis one has to run at least 26 iterations of the benchmark,13forcollectingthedataandatleast13toconfirmthedata,notincluding runswhicharenotproperlyconfirmedandmustbethrownaway.Witheachiterationof theSPECbenchmarkstakingmanyhours,asingleconfigurationprofilecaneasilytake anentiremonthofprocessingtimetocompletesuccessfully.

28 4.2ExperimentalResultsandAnalysis ThissectiondescribestheresultsandanalysisofourexperimentsonthePowerEdge 2850server.WefirstpresenttheoverheadofOpenMPconstructs.Then,weevaluatethe performance of applications and metrics that may point to potential architectural bottleneckduetoresourcesharingonHTprocessors.Weareparticularlyinterestedinthe effectsofsiblinglogicalprocessors.Particularattentionispaidtotheperformancegaps

intheHT on 42andHT on 21cases.Resultsareshownforkernel2.6.9,unlessotherwise noted. 4.2.1EPCC Overhead due to synchronization and loop scheduling is an important factor in the performance of sharedmemory parallel programs written in OpenMP. The EPCC OpenMP microbenchmarks (version 2.0) [74] were used to measure the overhead of synchronizationandloopschedulingcallsintheOpenMPruntimelibrary. Thesynchronizationbenchmarkmeasurestheoverheadincurredbyworksharingand mutualexclusiondirectives.Theworksharingdirectivesinclude PARALLEL , DO/FOR , PARALLEL DO/FOR ,and BARRIER .Themutualexclusiondirectivesinclude SINGLE, CRITICAL, LOCK/UNLOCK, ORDERED, and ATOMIC . The loop scheduling benchmark compares the scheduling policies available with OpenMP. Specifically, it compares the overhead of the For directive when used with three scheduling policies: STATIC, DYNAMIC , and GUIDED . The STATIC policy determines scheduling at compile time and is well suited for programs with static workloads that can be easily divided among threads. The DYNAMIC and GUIDED scheduling policies are intended for programs with dynamic workloads that must be balanced between threads at runtime. All three policies have an additional parameter, chunk size ,whichspecifiesthesizeofasingleworkunitintermsofloopiterations.A smaller chunk size allows for finergrained scheduling at the cost of more scheduling overhead. The GUIDED policy attempts to balance this tradeoff by dynamically decreasingthechunksize.

29 4.2.1.1 OpenMP Synchronization The EPCC synchronization benchmark was used to find the overhead of OpenMP directiveswithvaryinganumberofthreads,withandwithoutHyperThreading.Figure 4.2 depicts the overhead of different OpenMP synchronization directives for the C version (the Fortran directives are not shown but have slightly larger overhead). Interestingly, the synchronization overhead with HyperThreading enabled is significantlygreaterthanthesamenumberofthreadsonanHTdisabledsystem.

OpenMPSynchronizationOverhead Kernel2.6.9 HTon21 HToff22 HTon42 30 25 20 15

10 5 OverheadTime(s) 0

R R N O O ER IO F F L RRI INGLE E S BA DUCT PARALLEL LL CRITICAL ORDERED RA RE LOCK/UNLOCK PA Operation Figure 4.2: Overhead of OpenMP synchronization.

Figure 4.3 compares the impact of different kernels on OpenMP synchronization directives.Overall,thenewLinuxkernel2.6.9performedonparwiththeLinuxkernel 2.4.22intermsofsynchronizationoverheadfortheHTdisabledcase(notshownhere). However,theLinuxkernel2.4.22wasfoundtobesuperiorwhenusingHTasshownin Figure 4.3. The difference between the 2.4.22 kernel overhead and the 2.6.9 kernel overheadismostlikelytheresultoftheextraprocessing required to assign the newly createdthreadstotheirrespectivepriorityqueuesinthe2.6.9kernel.The2.4.22kernel does not have this overhead as all processes aresimply inserted into a single process queue.

30 OpenMPSynchronizationOverhead Linuxkernel2.6.9vs2.4.22 HTon21(2.4.22) HTon21(2.6.9) HTon42(2.4.22) HTon42(2.6.9) 30 25 20 15 10

Overhead(s) 5 0 L K L A E IER C FOR R O TION ALL ITIC C AR SINGLE R R LEL FOR /UNL RDERED DU B C O E PA AL R OCK AR Operation L P Figure 4.3: Kernel 2.6.9 vs. 2.4.22 impact on OpenMP synchronization overhead.

4.2.1.2 OpenMP Scheduling OpenMP provides three options for scheduling loop iterations among threads: STATIC , DYNAMIC ,and GUIDED .Figure4.4showstheloopschedulingoverheadsfor

HT off 44 and HT on 84. As can be seen, the overhead of different OpenMP loop schedulingschemesis2.5to8timeslargerforHyperThreadingthanforanHTdisabled system.

OpenMPLoopSchedulingOverhead(2.6.9) HToff22STATIC HToff22STATIC(n) HToff22DYNAMIC(n) HToff22GUIDED(n) HTon42STATIC HTon42STATIC(n) HTon42DYNAMIC(n) HTon42GUIDED(n) 70 60 50 40 30 20 Overhead(s) 10 0 1 2 4 8 16 32 64 128 ChunkSize Figure 4.4: Overhead of OpenMP loop scheduling policies.

31 The STATIC schedulingisapproximatelyasfastas DYNAMIC schedulingforlarger groups,butofcoursedoesnothavethebenefitofdynamicloadbalancing.Itisclearthat if the workload is large enough that large chunks are possible, DYNAMIC scheduling withitslowoverheadbutloadbalancingbenefitsisclearlythebestchoice.However,if the workload is dynamic and unbalanced as well as large, then GUIDED scheduling wouldbeanexcellentchoice.Theoverheadof STATIC(n) schedulingisnearlyconstant acrossallchunksizes, n.Theoverheadof STATIC(n) matchestheoverheadof STATIC . ForbothHTenabledandHTdisabledcases,theoverheadof GUIDED(n) isclosetothat of DYNAMIC(n) . Figure 4.5 compares the impact of two kernels on synchronization directives. Except for the some Dynamic strategies and one occurrence of the Guided policy,kernel2.4.22isbetterthanthe2.6.9kernel.

OpenMPScheduingOverheadKernel2.6.9vs. 2.4.22 HTon21(2.4.22) HTon21(2.6.9) HTon42(2.4.22) HTon42(2.6.9) 80 70 60 50 40 30 20 Overhead(s) 10 0 STATIC STATIC 1 STATIC 2 STATIC 4 STATIC 8 STATIC GUIDED 1 GUIDED 2 GUIDED 4 GUIDED 8 GUIDED STATIC 16 STATIC 32 STATIC 64 STATIC GUIDED 16 GUIDED DYNAMIC 1 DYNAMIC 2 DYNAMIC 4 DYNAMIC 8 DYNAMIC STATIC 128 STATIC DYNAMIC 16 DYNAMIC 32 DYNAMIC 64 DYNAMIC DYNAMIC 128 DYNAMIC SchedulingStrategy Figure 4.5: Kernel 2.6.9 vs. 2.4.22 impact on OpenMP loop scheduling policies.

4.2.2ApplicationPerformanceandAnalysis When determining the performance of an application, execution time is the most practical performance metric. Inthissection,wepresenttheperformanceofNASand SPECapplicationsonourPowerEdge2850platformunderthe2.6.9kernel.Figure4.6 showsthespeedupforeachoftheNASandSPECapplicationbenchmarksonavarietyof system configurations. The speedup results are calculated using application runtime relativetotheserialcase.Foreachapplication,thefourcolumnscanbeconsideredas

32 twopairs.Thefirstpairisforonephysicalprocessor,withandwithoutHTenabled.The secondpairisfortwophysicalprocessors.

NASOpenMPSpeedup SPECOMPM2001Speedup Serial HTon21 HToff22 HTon42 Serial HTon21 HToff22 HTon42

2.5 3.5 3 2 2.5 1.5 2 1.5 1 Speedup Speedup 1 0.5 0.5 0 0 BT CG LU MG SP apsi art fma3d mgrid swim wupwise Figure 4.6: Speedup for NAS OpenMP and SPEC OMPM2001 applications under Kernel 2.6.9 relative to the serial case.

Thegoalofthesetestsistounderstandifscientific multithreaded applications can benefitfromHTonreal,commercialSMTbasedSMPservers;andiftheydonotbenefit, whatarchitecturallimitationsexistinsuchprocessors.Infact,itisnotalwaysclearifit isbettertoruntwothreadsoronethreadoneachprocessor.Forinstance,althoughBT, LU,MG,SP,andwupwisebenefitfromhavingtwothreadsinoneprocessorexecution, theysuffersomewhatinthetwoprocessorexecution.CGandswimperformtheother way around. Applications such as mgrid, apsi, and fma3d always benefit from HT.

Interestingly, in art and LU, HT on 21 not only outperforms the serial case, but also

outperforms HT off 22. This means that a pair of hyperthreaded logical processors is abletooutperformtworealphysicalprocessors.Swimisthepoorestapplication.Itdoes notscalewitheitherHTorSMP.Thislackofscalabilityhastraditionallybeenthecase fortheOpenMPversionofswimonplatformssimilartotheonesusedhere. BT,CG,LU,MG,andSPattainanaveragespeedupof35%,6%,10%,9%,and19%, respectively,whenHTisenabled.FortheSPECsuiteofapplications,wupwise,mgrid, apsi, and fma3d achieve an average speedup of 37%, 68%, 31%, 49%, and 26%, respectively.Whileswimistheonlyapplicationwithaslowdownof4%,arthasthebest averagespeedupofallapplications,equalto115%.Overall,applicationsachievea33% speeduponaverage.

33 Table 4.2 summarizes the average performance gain when logical processors are enabled.ItisevidentthattheHTimplementationofSMTcannotprovideaperformance gainfor2processorexecutions,atleastforourmultithreadedapplications.

Table 4.2: Average speedup gained by enabling HT for NAS and SPEC Parallel benchmarks. Physical Processors NAS OMP SPEC OMP Overall 1 34% 92% 63% 2 2.2% 3.3% 0.55%

4.2.2.1 Trace Cache Analysis In this section, we investigate the reasons behind the possible flaws of Hyper Threading.UsinghardwareperformancecountersofIntelXeonEM64T,westudiedthe behaviour of the trace cache for NAS applications aswellasart,apsi,andswimfrom SPEC. Figure 4.7 presents trace cache misses, while Figure 4.8 depicts trace cache deliveryratefortheseapplications.

TraceCacheMisses Serial HTon21 HToff22 HTon42 400 350 300 250 200 150 (Millions) 100 50 TraceCacheMisses 0 BT CG LU MG SP apsi art swim Figure 4.7: Trace cache misses

ComparedtothebaselinesingleprocessorsystemwithoutHT,theHTenabledsystem outperforms it on almost every task. The exception to this is when the increased processingoftheHTsystemcausesabottleneckinthetracecache,causingperformance tosuffer.Therefore,exceptforprogramsthatcreatealargenumberoftracecachemisses

34 thatarealsoverybusbandwidthdependent,HTprovidesimprovementsoverthesingle processorsystem.

TraceCacheDeliveryRate Serial HTon21 HToff22 HTon42 1600 1400 1200 1000 800 600 400 200

TraceCacheDeliveryRate 0 BT CG LU MG SP apsi art swim Figure 4.8: Trace cache delivery rate (trace cache fetches per 100 clock cycles)

ItcanbeobservedinFigure4.7thatwhenthenumberoftracecachemissesincreases, andduetomemorybandwidthconstraintsthetracecachedeliverycannotbeincreased, the performance of the HT system suffers. This implies that memory intensive applications are more likely to suffer from trace cache starvation, decreasing the effectiveness of HT when used with such applications. This can be understood by comparingFigure4.6toFigure4.7andFigure4.8.Forinstance,whenperformanceis significantly worse for the CG on HT on 21 compared to the HT off 22 and HT on 42 cases, one can observe a marked increase in the trace cache misses from the HT on 21 system. This also occurs for the SP benchmark, but it can be observed that SP’s performancedoesnotsufferbecausethereisacorrespondingincreaseinthetracecache delivery rate. Only in the case when the trace cache misses increase and the cache

deliveryratedrops,doestheperformancesuffer.TheHT on 42systemisabletoavoid thisproblembyitsincreasedsystemresources,whichsignificantlyreducethetracecache

missrate,avoidingthepitfallsseenintheHT on 21case.Forthecaseofdiminishing performanceattheHT on 42level,onecanobservethatthemainperformancebottleneck isthecachememorybandwidth.InFigure4.9,theincreasednumberof2 nd levelcache

35 misses over the serial case corresponds to the decreased performance of the swim benchmarkillustratedinFigure4.6.

2ndLevelCacheMisses Serial HTon21 HToff22 HTon42 50 45 40 35 30 25 20 15 10

CacheMisses(Billions) 5 0 BT CG LU MG SP apsi art swim Figure 4.9: 2 nd Level Cache Misses. 4.3 Summary InthischaptertheoverheadduetoOpenMPconstructshasbeenexamined,comparing the2.4.22LinuxschedulerwiththeO(1)2.6.9Linuxscheduler.Itwasfoundthatoverall theSTATICloopschedulingwasagoodchoiceforsmallchunksizeapplicationswhile DYANMICandGUIDEDarethebestchoicesaschunksizeincreases.Itwasalsofound that the 2.4.22 scheduler exhibited slightly less OpenMP overhead than the 2.6.9 scheduler. In addition, the implementation of SMT from Intel, HyperThreading was analyzed to determine its potential benefits in increasing system performance. It was determinedthatHTcanonlybeofbenefitforasmallnumberofapplicationsfora2way system,althoughitcanbeofbenefitinasingleprocessorsystem.Thecauseofthislack ofspeedupwhenenablingHTwasidentifiedasbeingamemorysystemlimitation.The increased memory pressure caused by doubling the number of running threads (by turningonHT)causesanoverallsystemslowdownthatcannotbeoffsetbytheincrease inthenumberofsimultaneouslyexecutingthreads. Liaoetal.[57]havepreviouslystudiedOpenMPoverheadandwefindthatourresults areconsistentwiththeirfindingsonaDell450Precisionworkstation.TheirNASand SPEComp benchmark results are fairly consistent with our findings. However, their

36 testingmethodologyinvolvedusingtwophysicalprocessorsrunning2threadsforHT on 21tests,resultinginapartialloadsituation,whileourresultswereperformedwiththe loadentirelyonasingleHTenabledprocessor.Inaddition,ournewersystemhasleadto someperformanceimprovementsoverthesystemusedbyLiaoetal. Withtheinformationthatwasgatheredinthischapter,wecannowfocusonthemost relevant metrics for determining the optimal configuration of modern multiprocessor systems.Theworkillustratedhereonsinglecoreprocessorscanbeextendedtomulti coreprocessorswithinsimilarenvironments.Therefore,theprimemetricsdiscussedin thischaptersuchasmemorybandwidthandcachemissratescanbecomparedtonewly emerging commercial multicore processor systems. In the next chapter we examine multicoresystemsinasimilaryetmoreindepthmethodtodetermineifthetraditional performancemetricsaresuitableforjudgingtheperformanceofsuchnewsystems.

37 Chapter5:CharacterizationandAnalysisofMultiCore SMPs Theplacementofmultiplecoresonasingleprocessordieisanemergingtechnology intoday’scommercialmarket.Asthedesignofnewprocessorswithconsistentlyhigher clockspeedshasslowedduetocoolingconcerns,the attention ofindustryhasshifted fromfasterindividualprocessorstogroupsofprocessorsinordertoincreasetheoverall performanceofCPUs.Theemergenceofchipmultiprocessingandthehybridizationof this technology with SMT and SMP technology creates an interesting platform from which we can determine the optimal operating environment for these emerging technologies. This chapter investigates dualcore twoway systems available with Intel’s Hyper Threadingtechnologyandexaminesthereasonsbehindtheperformanceofsuchsystems undervariousconfigurationsandworkloads[31]. The first generation of Intel Xeon CMT processors have two distinct cores with separate 2MB L2 caches. Each core has two hardware contexts, when enabled. The introduction of dualchannel main memory further complicates the situation with chip multithreadedSMPs,whereanumberofCMTprocessorsareusedinaconventional n waySMPconfiguration.Thiscancreatefurtherbottlenecksaseachindividualphysical chipsharesamemorycontrollerbetweenitstwocores, and therefore the two physical chipsrequireamethodofensuringmemoryconsistencyandbusarbitrationwhiletaking intoaccountthattherearetwochannelstomainmemory.Asaresult,thecompetitionfor sharedresourcesisintense. CMTbasedSMPspresentnewchallengesaswellasnewopportunitiestomaximize performance.Givenasetofapplicationsandanumber of CMT processors there are ampleopportunitiestocontrolhowmanythreads,andwhichthreads,tocoscheduleon different cores or contexts in order to achieve performance.Itisourintentioninthis chaptertoidentifythesharedresourcesthatmightbecomeabottlenecktoperformance underdifferentconfigurations. Traditionally,OpenMPapplicationshavebeendevelopedforflatSMPsystems.With the availability of modern CMT processors, hybrid SMP configurations may perform

38 differently due to workload characteristics of applications. They also present new challengestothesystemssoftwaretoaccommodateinterandintraparallelismamong threads.Greaterdemandoncachesubsystem,increasedbustransactions,andstallcycles maysignificantlyaffecttheperformanceofOpenMPapplications[69]. The first contribution of this chapter is the performance evaluation of scientific applicationsintheNASOpenMPsuite[31]insection5.2.Itisinterestingtodiscoverthe impactofmulticoreSMTbasedSMPsystemsontheperformanceofsuchapplications. Weconsidertheeffectsofresourcesharingwithintheprocessorsontheperformanceby collectingdatafromhardwareperformancecounters.Weattempttopinpointarchitectural limitations of such a system by observing its overall cache performance, cycles per instruction (CPI),branchpredictionrates,bustransactions,ITLBandDTLBhitratesand number of stalled cycles. In section 5.3, an investigation is performed into the performance of the different configurations using a variety of multiapplication workloads. The performance of the systems while overloaded with twice as many executionthreadsforagivenbenchmarkasareavailableinhardwareistheninvestigated in section 5.4. Finally, an investigation into the effects of operating system noise is detailedinsection5.5

5.1 Experimental Methodology The experiments were conducted on a Dell PowerEdge 2850 SMP server. The PowerEdge2850hastwodualcore2.8GHzIntelXeon EM64Tprocessorswith12KB sharedexecutiontracecache,and16KBL1shareddatacacheoneachcore.A2x2MB L2cacheisallocatedwithone2MBL2cacheforeachcoreonachip.Thereare4GBof DDR2SDRAMonan800MHzFrontSideBus.Thechip’scoreisthePaxvillecore, whose basic architecture is described in Figure 5.1.TheoperatingsystemisRedHat LinuxEnterpriseWS4.1withKernel2.6.911.UsingLMbench[63]wehavemeasured the L1, L2, and main memory latencies of the processor as 1.43ns, 10.61 ns, and 136.85ns,respectively.Themainmemoryreadandwritebandwidthsare3.57GB/sand 1.77GB/s,respectively. The system was tested using the LMbenchbenchmarkto determine if any memory bandwidthdifferencesexistedbetweentheoperationofthreadsonasinglephysicalchip

39 and the operation of those samethreadsspreadout tobothphysicalchips.Themain memory readandwritebandwidthswhenusingtwophysical chips are 5.43 GB/s and 1.61GB/srespectively.Thisemphasizestheutilizationofthesecondarymemoryaccess controllerinthesecondchipaswellasthedualchannelmemoryarchitectureavailable. The decreased write bandwidth is most likely due to consistency requirements with respecttothetwomemoryaccesscontrollers. To test the system in a variety of configurations, some of the tests were run by maskingoffsomeoftheavailableprocessorsinthesystemsuchthattheycouldnotbe usedbyourapplicationthreads.Thisenabledustotesttheperformanceofthesystem using different thread distributions between the existing resources. The default Linux schedulerwasusedforassigningtheindividualthreadsamongstthespecifiedprocessors.

Figure 5.1: Intel Xeon Dual-Core Chip Layout

5.1.1Terminology Thetestresultsdetailedinthenextsectionshowtheresultsforthetestingonavariety ofdifferentconfigurations.Ourintentionistofindthebestsetofarchitecturesforeach applicationunderstudyrunningwithagivennumberofthreads.Figure5.2presentsthe labeling used for the hardware contexts in the HTenabled and HTdisabled systems. Such labeling will help understand the hardware contexts available for use in each configuration.

40

Figure 5.2: Processor numbering for 2-way Dual-core System

Table5.1showsthedifferentconfigurationsusedinourstudy.Thebasicterminology used to describe these configurations is comprised of three parts (four parts for the

overloaded cases). The first part is either HT off or HT on , which describes the state of HyperThreading in the system. The second term indicates the total number of application threads that were used. The third term represents the number of physical processor chips used in the tests. The first three terms are identical to the system terminology in Chapter 4. For the overloaded tests, the fourth item is the number of hardware contexts available for use. There is no need for a fourth item for non overloaded cases as there are as many hardware contexts available as the number of applicationthreads. Table 5.1: Configuration Information Terminology Hardware Contexts Corresponding Architecture Serial D0 SerialUni processor

HT on 21 C0,C1 SMT HT off 21 D0,D1 CMP HT on 41 C0,C1,C2,C3 CMT HT off 22 D0,D2 SMP HT on 42 C0,C1,C4,C5 SMT basedSMP HT off 42 D0,D1,D2,D3 CMP basedSMP HT on 82 C0,C1,C2,C3,C4,C5,C6, CMTbasedSMP C7,C8

HT on 412 C0,C1 OverloadedSMT HT off 412 D0,D1 OverloadedCMP HT off 422 D0,D2 OverloadedSMP HT on 814 C0, C1,C2,C3 OverloadedCMT

41 5.2 Single Application Results Inordertopresenttheseresultsinafairmanner,theconfigurationsmustbedivided intoseveralgroups,suchthattheconfigurationscanbecomparedtoconfigurationswith

similaramountsofresources.TheHT on 21configurationmustbeexaminedseparately fromtherestofthedataasithasbyfarthesmallestamountofresourcesavailabletoit. Inthisconfigurationitsharesatracecachebetweenexecutingthreadsalongwithseveral functionalunitsontheCPU,inadditiontohavingonlyone2MBL2cacheavailablefor use.

ThesecondgroupismuchlargerinthattheHT off 21configurationcanbecompared

directly to the HT on 41 configuration, the only difference between the two being the presenceofHT.ThethirdgroupcomprisestheHT on 42andHT off 22configurations, whichcanalsobecomparedtothesecondgroup,as the difference between these two groupsistheuseofbothphysicalchips,althoughonlyat50%usagetocreateresources

similartothoseavailabletothesecondgroup.ThefourthandfinalgroupistheHT off 42 and HT on 82group.Thisgrouputilizesalloftheresources available in our platform withHTonandoff.Sometimes,comparisonsmayalsobemadebetweengroups,but only in the context of performance per resources available to help determine the configurationsthatmakethebestuseoftheresourcesavailabletothem. 5.2.1CachePerformance The performance of the cache is very important to overall system performance, and canillustratepotentialbenefitsanddrawbacksofcertainsystemconfigurations.TheL1 andL2cachemissratesarepresentedinFigure5.3(a)and5.3(b).Thefirstobservation thatcanbemadebyexaminingthegraphs,isthattheL1cachemissratesareflatacross the different configurations. This seems counterintuitive, but is a byproduct of the benchmarkapplicationsthemselves,astheyusealargeamountofinfrequentlychanging variablesintheircalculations,andonlyasmallnumberofnewvariablesforeachloop. ThismeansthatalargenumberofL1cacherequestsarehits,whileonlyasmallnumber aremisses,requestingthefewnewvariablesrequiredforthenextloop.Thisleadstoa goodL1cachemisspercentageduetothelargenumberofrequests.

42 Examining the HT on 21 configuration, we can see that it has relatively good trace cacheperformanceandexcellentoverallhitrateswithineachlevelofcacheforallofthe applicationswiththeexceptionoftheL2missratefortheLUbenchmark.Theexcellent cacheperformancecanbeattributedtoseveralfactors,onebeingthatthereisrelatively little contention for the cache resources, and the limited computational resources availablemeanthatthememorybusisfreeforprefetchingoperationsduringexecution. The second group is similar to the third group in term of the trends in memory performance, with the HT on configurations having a higher miss rate than the HT off

configurations. This is not unexpected as the HT on configurations have less memory availabletothesystemperexecutionthread,therebycausingmorecacheevictions,which arenotoffsetbythesharingofdatabetweenthethreads.

The final group of HT off 42 and HT on 82 have fairly comparable memory performances, with the HT on 82 having the advantage in terms of its trace cache performancefortheCGbenchmark,whiletheHT off 42configurationhasanadvantage fortheL2missrateontheLUbenchmark.

L1CacheMissRate L2CacheMissRate Serial HTon21 HToff21 HTon41 Serial HTon21 HToff21 HTon41 HToff22 HTon42 HToff42 HTon82 HToff22 HTon42 HToff42 HTon82 60 40 50 35 30 40 25 30 20 20 15

MissRate(%) 10 MissRate(%) 10 5 0 0 BT CG MG LU SP BT CG MG LU SP

(a) (b) Figure 5.3: (a) L1 Cache Miss Rate and (b) L2 Cache Miss Rate TheeffectofenablingHTonthedifferentarchitecturesleadstoa2.8%increaseinL1 missrateforgroup1,a0.01%increaseongroup2,anda3.4%and11.0%decreasefor groups3and4,respectively.Theaveragefirstlevelcachemissrateis20.1%acrossall

applicationsforallHT off configurations,whiletheaveragefirstlevelcachemissratefor

HT on configurationsis17.9%.TheCGbenchmarkhasasmallperturbation,mostlikely causedbyinterferencewiththeexecutingcomputationalloops.

43 Thesecondlevelcachemissratesfluctuatebetweenthedifferentapplications,butthe changes in the LU benchmark are quite significant; therefore, the results detailed here neglecttheLUmissrates.TheeffectofenablingHTonthedifferentarchitecturesleads toa105.5%increaseforgroup1,a55.9%increaseongroup2,a150.5%increasefor group3anda37.2%decreaseforgroup4.Theaveragesecondlevelcachemissrateis

2.35%acrossallapplicationsforallHT off configurations,whiletheaveragesecondlevel cachemissrateforHT on configurationsis5.35%. The LU benchmark spans a large memory area throughout its execution and its memoryaccesspatternsrequirethatpreviouscomputationalresultsbepresentinmemory for use in future computation, leading to a disadvantage for computationally limited configurations, as they cannot perform in large parallel batches, and old results are evictedfromthecachebeforetheyareneededagain.Thisleadstoamuchlargerincrease inL2cachemissratewhenenablingHT,witha7260%increaseforgroup1,a2121% increaseforgroup2,a177%increaseforgroup3anda1242%increaseforgroup4. The trace cache miss rate of each of the architectures is detailed in Figure 5.4. It showsthattheenablingofHThasaneffectonthetracecachemissratesofthedifferent architectures,leadingtoa101.5%increaseforgroup1,a20.9%increaseongroup2,a 12.9%decreaseforgroup3anda5.1%decreaseforgroup4.

TraceCacheMissRate Serial HTon21 HToff21 HTon41 HToff22 HTon42 HToff42 HTon82 100 80 60

40

MissRate(%) 20 0 BT CG MG LU SP Figure 5.4: Trace Cache Miss Rates

The average trace cache miss rate is 59.5% across all applications for all HT off

configurations,whiletheaveragetracelevelcachemissrateforHT on configurationsis 56.8%.Thehighincreaseintracecachemissrateforgroup1ismostlikelyduetoits

44 direct comparison to the serial case, where the trace cache of theIntelprocessorhasa significantefficiencyadvantageoveramultithreadedworkload,inwhichitmustpredict thebehaviorofseveralthreadsinsteadofasingleactivethread.

5.2.1.1 TLB Performance By examining the Figures 5.5(a) and 5.5(b), we can see that the number of ITLB missesrisesbetweenthedifferentgroups.ThenumberofITLBmissesincreasesasthe complexityandresourcesofthearchitecturesincrease.Thisistrueforgroup1,group2, andifCGandLUareexcludeditisalsotrueforgroup3.Forgroup4theITLBmisses

seeadecreasebetweentheHT off andHT on configurationsacrossallofthebenchmarks

exceptforCG.DTLBmissesarerelativelyflatacrossallgroups(exceptforHT on 82for BTandLU)indicatingthattheincreasesincomplexitydonotsignificantlyimpactthe performanceoftheDTLB.

NormalizedDTLBLoadandStoreMisses ITLBMissRate Hton21 Htoff21 Hton41 HToff22 Serial HTon21 HToff21 HTon41 Hton42 Htoff42 HTon82 HToff22 HTon42 HToff42 HTon82 1.6 100

1.2 80 60 0.8 40

0.4 MissRate(%) 20

NormalizedoverSerial 0 0 BT CG MG LU SP BT CG MG LU SP

(a) (b) Figure 5.5: (a) DTLB Load and Store Misses Normalized to the Serial Case (b) ITLB Miss Rate 5.2.2StalledOperation Theexaminationofthenumberofcyclesspentstalledduetomemoryorderclearing, mispredictedbranches,lackofinstructionsinthetracecache,orthedelaycausedbythe needtoloaddatainfrommemorycanhelptodeterminewhysomeoftheconfigurations arebehavingthewaythattheydo.Figure5.6illustratesthepercentageofruntimethat thesystemspendsinthehaltstateforavarietyofarchitectures.

45 ThenumberofcyclesspentinastalledstatefortheHT on 21configurationispoor relativetotheotherconfigurationgroups.Thisisanindicationofthreadcontentionfor sharedresourcesinthecores.Group2,3and4showsimilarpatternsonceagain,with the HT on configurations having more stalled cycles than the HT off configurations. Interestingly,theconfigurationsfromgroup3areworseintermsofpercentageofstalled cyclesthroughoutallofthetestscomparedtogroup2.Theaveragepercentageofstalled cycles for the HT off configurations is 0.83%, while the average for the HT on configurationsis1.62%. %StalledOperation Serial HTon21 HToff21 HTon41 HToff22 HTon42 HToff42 HTon82 5

4

3

2

1 StalledOperation(%) 0 BT CG MG LU SP Figure 5.6: % of Total Execution Spent in a Stalled State 5.2.3BranchPrediction Figure5.7showsthebranchpredictionrateforeachofthearchitecturesforallfive benchmarks. We can see that the branch prediction rates are excellent for almost all benchmarks across all configurations, with the exception of two of the HT on

configurationsfromgroups2and3forCGandHT on 82forMG.Thebranchprediction ratedoeshaveaneffectonthetotalnumberofstallcyclesthatanarchitectureincurs,asa mispredictedbranchcancauseamemoryordercleareventandmayaffecttherelevance of data in the trace cache. This helps to explain the number of stalled cycles and consequently,thehighCPIthattheHT on configurationsingroups2and3haveforthe CGbenchmark.

46 BranchPredictionRate Serial HTon21 HToff21 HTon41 HToff22 HTon42 HToff42 HTon82

100

95

90

85

BranchPredictionRate(%) 80 BT CG MG LU SP Figure 5.7: Branch Prediction Rate 5.2.4BusTransactions Whenexaminingthebustransactioncharacteristicsoftheconfigurationspresentedin Figure 5.8, it is clear that group 1 is the only group that has the memory bandwidth capacityleftovertoperformanykindofprefetchingactivities.Group1spends~50%of itstimein4of5ofthebenchmarksprefetchingdata into the cache. The only other

instanceofsignificantprefetchingistheHT on 82configurationfortheCGbenchmark.

%PrefetchingBusAccesses Serial HTon21 HToff21 HTon41 HToff22 HTon42 HToff42 HTon82 100

80

60

40 Accesses 20 %PrefetchingBus

0 BT CG MG LU SP Figure 5.8: Pre-fetching Bus Accesses 5.2.5CyclesPerInstruction

WhenexaminingtheCPIofthedifferentconfigurationspresentedinFigure5.9,many of the observations made in the previous sections can be observed impacting the efficiency of the system. One can see a direct correlation between the results in the previoussectionsandthehigherthanaverageCPIsofsomeconfigurationswhenrunning

47 some of the benchmarks. It is interesting to note that the high CPIs of the HT on configurationsfromgroups2and3runningtheCGbenchmarkcorrelatedirectlytovery poorbranchpredictionratesandrelativelyhighL2cachemissrates,whichcombineto givethesetwoconfigurationshigherCPIsthanthoseintheirrespectivegroups.Thepoor

CPI of the HT on 82 configuration executing MG also corresponds to a poor branch prediction rate but without the high L2 cache miss rate. This makes sense as a high branch misprediction rate would cause many flushes of the execution pipeline and thereforereduceoverallefficiency.

CPI Serial HTon21 HToff21 HTon41 HToff22 HTon42 HToff42 HTon82 10 9 8 7 6 5 CPI 4 3 2 1 0 BT CG MG LU SP Figure 5.9: Cycles Per Instruction

5.2.6WallClockPerformance

The NAS benchmarks were run through a series of ten independent trials, with minimalvariancebetweentests(<~15%).TheresultsaredetailedinFigure5.10.The runtimesoftheNASbenchmarksshowaninterestingtrendthatisnewtothedualcore Intel Xeon architecture. Specifically, that the use of SMT on a CMT core can be extremelybeneficialtotheperformanceofasystem.Ofparticularinterestaretheresults forthe4threadedHT on caseutilizingbothcoresonadualcorechipinHT on mode(HT on 41).InthiscasetheoverallperformanceofasingleSMTdualcorechipiscomparable totheperformanceoftwodualcoreprocessorsoperatingwithHT off .Despitethesingle

HT on 41 chip having half as many available computational resources, the appreciable slowdownoverthedualprocessordualcoreSMPcaseisonly13.6%.

Overall,theHT off 42configurationhasthebestwallclocktimes,withtheexception ofCG,andhasthehighestaveragespeedupacrossalloftheapplications.Thisfollows

48 the same trend found for the HT on 41 case, in that despite the fact that the CMT architecture has half as many available computational resources, the appreciable slowdownovertheCMPbasedSMPcaseisonly12.8%.

NASOpenMPBenchmarkSpeedup Serial HTon21 HToff21 HTon41 HToff22 HTon42 HToff42 HTon82 3.5 3 2.5 2 1.5 Speedup 1 0.5 0 BT CG MG LU SP Figure 5.10: Speedup for NAS OpenMP applications.

When overall processor resources are increased to utilize two dualcore processors

with HT on (HT on 82), the results are different to previously observed HTrelated behaviorinthattheoveralleffectonperformanceisminimal,incontradictiontoprevious workonsinglecoreHTarchitecturesinchapter4.However,itshouldbenotedthatin thecaseofapplicationsinwhichthereisasignificantamountofdatasharing,HThas excellentspeedup(suchasinthecaseoftheapplication CG)whichoffsetsitspoorer performanceontheotherbenchmarks. Theoverallaveragespeedupforgroup1fortheHTenabledarchitecturesversusthe HTdisabledcaseis1.81versusspeedupof1fortheserialcase.Theaveragespeedupfor group 2 was 1.86 for the HT on caseand1.42fortheHT off case.Groups3and4had

averagespeedupsof1.43and2.10fortheHT on architecturesandspeedupsof1.83and

2.10fortheHT off cases,respectively.

ExceptfortheCGcase,theperformanceoftheHT on 82caseisworsethantheHT off 42case.Tobetterunderstandthereasonsbehindthis,weexaminetheCGapplicationin detail.Ingeneral,theHT on 82setupresultsinlesstotalbusaccessesthantheHT off 42 case,withanL1cachemissrateof47.1%versus56.2%fortheHT off 42case.This, coupledwithanL2cachemissrateof1%versus9.6%translatesintoahighernumberof nonprefetchingbusaccessesfromtheHT off 42case.TheHT off 42casehasalower

49 CPIof1.04versustheHT on 82’sCPIof5.02whichwouldimplythattheperformance of the HT off 42caseshouldbesuperior,butalargenumberofbustransactionsinthe

HT on 82 case are speculative prefetching (51.2% of all bus accesses) while the vast majorityofbustransactionsfortheHT off 42casearenottheresultofprefetching.This

leadstomuchmorespeculativeexecutionintheHT on 82casewhichisnotaccountedfor in the CPI as it is counted as cycles per instruction committed. The trace cache performance of the HT on 82 system is worse than the HT off 42 case for all of the benchmarkswiththeexceptionofCGandMG,withtheHT on 82configurationhavinga majoradvantageof35.6%missrateversustheHT off 42’smissrateof87.3%forCG. TheaveragespeedupofeachofthearchitecturesisdetailedinTable5.2.

Table 5.2. Speedup for architectures SMT CMP CMT SMT CMP CMT SMP based based based (HT on 21) (HT off 21) (HT on 41) (HT off 22) SMP SMP SMP (HT on 42) (HT off 42) (HT on 82) 1.81 1.42 1.87 1.83 1.43 2.11 2.10 5.3 Multi-Application Results Theperformanceofthegivenarchitecturalconfigurationsisalsoofinterestinamulti programmed environment. The following results were collected using the same configurationsasinsection5.2,bututilizedmorethanoneconcurrentprogramexecution at a time to examine the ability of the architectures to handle complimentary and uncomplimentaryworkloadsofmultipleprograms.Assuch,theseresultsarenotdirectly comparabletothoseintheprevioussection.Thegoalofthesemultiapplicationtestsis todeterminetheperformanceofthedifferentavailablesystemconfigurationsinavariety of multiapplication workloads. All of the workloads used for testing fully load the system,usingallofthesystemresourcesinabalancedwaybetweentheapplications;for

example,theHT off 42configurationusesfourthreads,twoforthefirstapplicationand

two for the second application, while HT off 21 uses two threads, one for the first applicationandoneforthesecondapplication. Two NAS benchmarks were selected for this task, the FT benchmark, which is a Fourier transform application requiring mostly computational resources and limited memoryresources,andtheCGbenchmark,whichrequiressignificantmemoryresources.

50 Three separate tests were conducted; the first usedacombinationofCGandFTfora computationallydemandingapplicationpairedwitha memory intensive program. The second used two copies of FT to determine the system’s performance with mostly computationallyintenseworkloads,andthethirdusedtwocopiesofCGtodeterminethe system’s performance under a memory intensive workload. Themaximumnumberof executionthreadsavailabletoeachsystemconfigurationwasused,withthethreadsbeing distributed evenly between the executing programs. The threads were not tied to any specificprocessor,andtheschedulerwasfreetorelocatethreadsamongsttheavailable processors. 5.3.1CachePerformance BystudyingtheL1andL2cachemissratesofthedifferentarchitecturespresentedin Figure5.11(a)and5.11(b)wecanobservethatthe1 st levelcachemissratesarerelatively stableacrossthedifferentconfigurations.However,the2 nd levelcachemissratesshow

that the HT on 21 configuration has difficulty achieving a high hit rate for the CG benchmark,asdoestheHT on 82configuration.

L1CacheMissRate L2CacheMissRate Serial HTon21 HToff21 HTon41 Serial HTon21 HToff21 HTon41 HToff22 HTon42 HToff42 HTon82 HToff22 HTon42 HToff42 HTon82 60 25 50 20 40 15 30 10 20 MissRate(%) 10 MissRate(%) 5 0 0 CG(CG/FT) FT(CG/FT) FT/FT CG/CG CG(CG/FT) FT(CG/FT) FT/FT CG/CG

(a) (b) Figure 5.11: (a) L1 Cache Miss Rate and (b) L2 Cache Miss Rate

Ingeneral,alloftheHT on configurationshaveaworseL2missratethantheirHT off equivalents.Thisismostlikelyduetocachecontentionasthetwoprogramscausecache evictionsofimportantdatabelongingtotheotherprogram.ThisalsoexplainswhyCGis affected to a greater degree by the system sharing thantheFTbenchmarkasitusesa muchlargerdatasetandismorememoryintensivethantheFTbenchmark.

51 ThetracecachemissratesshowninFigure5.12illustratethattheHT off configurations

forbothgroup2andgroup3arebetterthantheHTon configurationsforboththeCG/FT

andCG/CGworkloads,withtheHT on configurationshavinganadvantageintheFT/FT workload.TheadvantagefortheHT on configurationsfortheFT/FTworkloadforgroup

2 is fairly significant, but the advantage is even greater for the HT on configuration in

group3.Finally,group4showsthatthereisnoadvantagetotheHT on configurationin termsoftracecachemissesinanyoftheworkloadconfigurations.

TraceCacheMissRate Serial HTon21 HToff21 HTon41 HToff22 HTon42 HToff42 HTon82 100

80

60

40

MissRate(%) 20

0 CG(CG/FT) FT(CG/FT) FT/FT CG/CG

Figure 5.12: Trace Cache Miss Rate

5.3.1.1 TLB Performance Figures5.13(a)and5.13(b)showthattheHT on configurationssufferfromexcessive ITLBmissesinbothgroups2and3whenrunningtheCGbenchmark.

NormalizedDTLBLoadandStore ITLBMissRate Misses Serial HTon21 HToff21 HTon41 HTon21 HToff21 HTon41 HToff22 HToff22 HTon42 HToff42 HTon82 HTon42 HToff42 HTon82 100 2.5 80 2 60 1.5 1 40

0.5 MissRate(%) 20 0 0 NormalizedoverSerial CG(CG/FT) FT(CG/FT) FT/FT CG/CG CG(CG/FT) FT(CG/FT) FT/FT CG/CG

(a) (b) Figure 5.13: (a) DTLB Load and Store Misses Normalized to the Serial Case (b) ITLB Misses

52 The HT on configuration from group 3 also has difficulties with its DTLB for the FT/FTandCG/CGworkloads.Thisimpliesthattheexecutionunitsarepotentiallybeing starvedofinstructions,butfurtherinvestigation intotheamountofstallingthatoccurs duetothesemissesisrequiredtodeterminetherealeffectofthisfinding. 5.3.2StalledOperation The percentages of the total run time that the system spends in a stalled state are presentedinFigure5.14.Uponexamination,theamountoftotalexecutioncyclesspent in a stalled state for this multi application workload is surprising. When running a supposedlycomplimentaryworkload(CG/FT),wecansee that a significant amount of timeisspentinastalledstate.Fromthiswecaninferthatthesystemishavingavery difficulttimeprovidingtheprogramswiththerequiredresources,possiblyswitchingthe processorsonwhichtheprogramsarerunningfrequently.

%StalledOperation Serial HTon21 HToff21 HTon41 HToff22 HTon42 HToff42 HTon82 100 80 60 40 20

StalledOperation(%) 0 CG(CG/FT) FT(CG/FT) FT/FT CG/CG Figure 5.14: Percentage of Operation Time Spent Stalled 5.3.3BranchPrediction WhenexaminingthebranchpredictionratedetailedinFigure5.15,wecanseethatthe

HT off configurationsfromgroups2and3arebothworsethantheHT on configurationsfor all of the tests except for the CG/CG workload. Group 1 has relatively good branch prediction across the workloads, and group 4 shows that there is only a marginal

difference between the branch prediction rates between the HT on and HT off configurations.

53 BranchPredictionRate Serial HTon21 HToff21 HTon41 HToff22 HTon42 HToff42 HTon82 100

98

96 94

92

90 BranchPredictionRate(%) CG(CG/FT) FT(CG/FT) FT/FT CG/CG Figure 5.15: Branch Prediction Rate

5.3.4BusTransactions Theprefetchingactivitiesbeingundertakenbyeachoftheconfigurationspresentedin Figure5.16reinforcethestalledoperationresultsexaminedearlierinthissection.The configurationsspendasignificantamountoftimeprefetchingwhenrunningtheCG/FT workload.Fromthiswecaninferthatthesourceofthestallingisnotduetomemory bandwidthissues,butinsteadcanbeattributedtootherfactorssuchaspipelineflushes.

%PrefetchingBusAccesses Serial HTon21 HToff21 HTon41 HToff22 HTon42 HToff42 HTon82 100

80

60 40

20

0 CG(CG/FT) FT(CG/FT) FT/FT CG/CG %PrefetchingBusAccesses Figure 5.16: Percentage of Pre-fetching Bus Accesses of All Bus Accesses

5.3.5CyclesPerInstruction The CPI results in Figure 5.17 indicate that despite the high number of execution cyclesinwhichthesystemsarestalledfortheCG/FTworkload,theactualCPIofthe

HT on configurations does not suffer significantly. With the exception of the CG/CG

54 workload, the HT on configurations for groups 2 and 3 are better than the HT off

configurationsintermsofCPI.Group4showsthattheHT on configurationisworsethan

theHT off configurationforallworkloads.

CPI Serial HTon21 HToff21 HTon41 HToff22 HTon42 HToff42 HTon82 30 25 20

CPI 15 10 5 0 CG(CG/FT) FT(CG/FT) FT/FT CG/CG Figure 5.17: Cycles Per Instruction

5.3.6WallClockPerformance TheresultsexaminingmultiapplicationperformancearepresentedbelowinFigures 5.18(a),5.18(b)and5.19.Thegoalofthesemultiapplicationtestswastodeterminethe performance of the different available system configurations in a variety of multi applicationworkloads.Alloftheworkloadsusedfortestingfullyloadthesystem,using allofthesystemresourcesinabalancedwaybetweentheapplications,forexample,the

HT off 42configurationusesfourthreads,twoforthefirstapplicationandtwoforthe secondapplication,whileHT off 21usestwothreads,oneforthefirstapplicationandone for the second application. The applications were specifically chosen because CG is typicallymemorybound,whileFTiscomputebound.Thus,theperformancenumbers presented in the Figure 5.18(a) directly correlate to running half of a system with compute bound threads and half with memory bound threads. Figure 5.18(b) demonstrates the system performance for an entirely compute bound workload, while Figure 5.19 demonstrates the performance of a fully memory bound workload. The results clearly indicate that there is a tangible performance benefit to running compute bound and memory bound applications in parallel, as the performance of both applications is better in such a balanced environment than a system running alike applications.

55 CG/FTMultiprogrammedSpeedupOverSerial FT/FTMultiprogrammedSpeedupOverSerial CG FT FT FT 2 2

1.5 1.5

1 1 Speedup 0.5 Speedup 0.5

0 0 HTon HToff HTon HToff HTon HToff HTon HTon HToff HTon HToff HTon HToff HTon 21 21 41 22 42 42 82 21 21 41 22 42 42 82

(a) (b) Figure 5.18: (a) CG/FT and (b)FT/FT Multi-Application Speedup The speedups show that by a small margin, the applications enjoy running with themselves,butsmallvariancesoccurthatmakeitdifficulttomakestronggeneralized conclusions.TheoverallbestperformerofanyconfigurationisHT on 42,whichisthe fastest overall for two of the three configurations. The HT on 82 configuration is the fastest for the CG/FT test but only by a small margin. This indicates that for multi applicationworkloads,HToffersatangibleperformancebenefit,whetherthesystemis balanced in its workload or not. In addition, it indicates that for multiapplication workloads,cleverschedulerdesigncouldachieveoptimalperformanceofthesystemby utilizingHTandvaryingthenumberofthreadsfortheactiveapplicationsrunningonthe system.

CG/CGMultiprogrammedSpeedupOverSerial CG CG 3 2.5 2 1.5

Speedup 1 0.5 0 HTon HToff HTon HToff HTon HToff HTon 21 21 41 22 42 42 82

Figure 5.19: CG/CG Multi-Application Speedup

5.3.7CrossProductMultiProgramResults Thedifferentarchitecturalconfigurationsweretestedusingapairofapplications,and completedforallpossibletwoprogrampairsinall configurations. The program pairs wererunwithenoughevenlydistributedthreadsastofullyloadthearchitectureunder

56 test.TheresultsareshowninaboxandwhiskersplotinFigure5.20.Theboxesinthe figurerepresenttherangesofdata(the25 th and75 th percentileofthedatafallswithinthe box),whilethewhiskersrepresentthemaximumandminimumofthedata.

FromtheseresultswecanconcludethattheHT off 42(CMPbasedSMP)architecture providestheoverallbestperformanceforthemajorityofprogrampairsacrossallofthe benchmarkingprograms.However,forcertainprogrampairs,theHT on architecturescan provide better overall performance. The performance of the CG benchmark when runningwiththeBTbenchmarkontheHT on architecturesissignificantlybetterthanon theHT off architectures,whichaccountsforthelargewhiskersontheCGresultsforthe

HT on architectures. However, BT does not see significant speedup when run in conjunctionwithCGonHT on architectures.

MultiProgrammedSpeedupofNASBenchmarkPairs Configuration

-1 -1 -2 -2 -2 f-2-1 -4 f-2-2 -4 f-4-2 -8

HTon HTof HTon HTof HTon HTof HTon 4.5

4

3.5

3

2.5

2 Speedup 1.5

1

0.5

0 ~~ ~~ ~~ ~~ ~~ ~~ ~~ LU LU LU LU LU LU LU BT BT BT BT BT BT BT SP SP SP SP SP SP SP CG CG CG CG CG CG CG MG MG MG MG MG MG MG Figure 5.20: Multi-programmed speedup of pairs of NAS benchmarks for all architectures 5.4 Overloaded Configuration Analysis The four configurations presented in Figures 5.21 to 5.28 represent balanced yet overloadedworkloadsfortheprocessors.Allfourcaseshavetwiceasmanythreadsas theydoexecutioncontextsinwhichtorunthem.TheLUbenchmarkhasbeenomitted

57 from these tests as it suffers from incredibly high runtimes when the system is overloaded. 5.4.1CachePerformance InFigure5.21(a)and5.21(b)wecanseethatthecacheperformanceoftheoverloaded casesisexcellentgiventheirintensiveworkload.TheL1cachemissratesstayrelatively

stableacrossalloftheconfigurations,withasmallincreaseintheHT on 814casefor CG.L2cachemissratesaresimilartothoseforthenonoverloadedcases,risingbya significantbutnotoverwhelmingamountgiventheincreasedworkload,withthemajority oftheincreaseinmissrateoccurringfromtheSPbenchmark.

L1CacheMissRateOverloaded L2CacheMissRateOverloaded Serial HTon412 HToff412 Serial HTon412 HToff412 HToff422 HTon814 HToff422 HTon814 60 25 50 20 40 15 30 10 20 MissRate(%) MissRate(%) 10 5 0 0 BT CG MG SP BT CG MG SP

(a) (b) Figure 5.21: (a) 1 st Level Cache Miss Rates for Overloaded Cases, (b) 2nd Level Cache Miss Rates for Overloaded Cases

The trace cache miss rates for the overloaded casesareshowninFigure5.22.The best trace cache performance for the overloaded cases occurs with both of the HT on

configurations, both having much better cache miss rates than their HT off alternatives.

ThisillustratesthattheHT on configurationsarebenefitingfromtheirsharedtracecache. Thismeansthatthesystemisfairlywellbalancedintermsofprogramsynchronizationif

theHT on configurationsareabletoexploittheirsharedtracecache.

58 TraceCacheMissRateOverloaded Serial HTon412 HToff412 HToff422 HTon814 100

80

60

40

MissRate(%) 20

0 BT CG MG SP

Figure 5.22: Trace Cache Miss Rates for Overloaded Cases

TheDTLBmissperformanceoftheoverloadedcasesinFigure5.23(a)issimilarto thatforthenonoverloadedcasesinFigure5.5,seeingnosignificantincreaseduetothe thread overloading. The ITLB performance of the overloaded cases in Figure 5.23(b)

showsanincreaseformostoftheconfigurationswiththeexceptionHT on 814,which seesarealbenefittoitsITLBmissratewhenoverloaded.

NormalizedDTLBLoadandStore ITLBMissesOverloaded MissesOverloaded Serial HTon412 HToff412 HTon412 HToff412 HToff422 HTon814 HToff422 HTon814 100 1.1 90 80 1.05 70 1 60 50 0.95 40

0.9 MissRate(%) 30 0.85 20 10

NormalizedoverSerial 0.8 0 BT CG MG SP BT CG MG SP

(a) (b) Figure 5.23: (a) DTLB Load and Store Misses and (b) ITLB Miss Rates for Overloaded Cases

5.4.2StalledOperation Thepercentageofthetotalnumberofexecutionclockticksinwhichthesystemwas stalledispresentedinFigure5.24.Wecanseethatitisslightlyhigherforanoverloaded configurationthanforanonoverloadedconfigurationinFigure5.6,risingbyasmuchas

3.9% over the nonoverloaded configuration, with HTon 412 seeing the highest increases. This is to be expected as we overload the system, as it will require more

59 memoryorderclearinstructionsandingeneralrequiretheflushingofthepipelinemore often,especiallywhenswitchingcontextstoanewexecutionthread.

%StalledOperationOverloaded Serial HTon412 HToff412 HToff422 HTon814 6 5 4 3 2 1 StalledOperation(%) 0 BT CG MG SP

Figure 5.24: Percentage of Stalled Operation for Overloaded Cases

5.4.3BranchPrediction The branch prediction rate for the overloaded cases detailed in Figure 5.25 is

excellent. The branch prediction problems that occurred with the HT on 41 non overloaded configuration (Figure 5.7) are no longer occurring and overall the configurationsaredoingwellcomparedtotheirnonoverloadedconfigurations.

BranchPredictionRateOverloaded Serial HTon412 HToff412 HToff422 HTon814 100

99

98

97

96

95 BranchPredictionRate(%) BT CG MG SP

Figure 5.25: Branch Prediction Rate For Overloaded Configurations 5.4.4BusTransactions ThepercentageofbusaccessesthatareforprefetchingactivitiesisshowninFigure

5.26. It can be seen that for both of our HT on configurations, the limiting factor to performanceappearstobethecomputationresourcesavailabletothesystem,whilethe

60 HT off configurations are still limited by memory performance.Thisiskeepinginline withtheobservationsmadefortheconfigurationsintheirnonoverloadedstatesasshown inFigure5.8.

%PrefetchingBusAccessesOverloaded Serial HTon412 HToff412 HToff422 HTon814 60 50 40 30 20 10

%PrefetchingAccesses 0 BT CG MG SP

Figure 5.26: Percentage of Pre-fetching Bus Accesses For Overloaded Configurations For the majority of the applications, this advantage in prefetching does not necessarilyleadtoaperformanceincreaseasthecachehitratesarestillverycomparable totheHT off caseswhichrelyonlessoverallprefetchingactivitiesasillustratedinFigure

5.8.ThisexplainswhytheHT on casesdonotgainasignificantperformanceadvantage

over the SMP cases, as the HT on cases also have less total available cache memoryto themasmorethreadsshareasmalleroverallcachearea. 5.4.5CyclesPerInstruction Comparing the CPI of the overloaded configurations in Figure 5.27 with the non overloadedconfigurationsinFigure5.9,wefindthattheCPIsarenotgreatlyaffectedby overloading. The overall impact for all of the cases is an average increase in CPI by

1.35%, with the best performance being a drop of 46.3% in CPI for the HT on 814 configurationfortheBTbenchmark,andtheworstbeing a CPI rise of 48.8% for the

HT on 412configurationfortheSPbenchmark.

61 CPIOverloaded Serial HTon412 HToff412 HToff422 HTon814 10 9 8 7 6 5 CPI 4 3 2 1 0 BT CG MG SP

Figure 5.27: CPI For Overloaded Configurations 5.4.6WallClockPerformance ThebehaviouroftheHTfortheoverloadedcaseofHT on 814,where8threadsare executingonasingledualcoreprocessorwithHTshowsinterestingresults.Withthe exception of BT, the HT on 814 case is one of the fastest of all of the tested configurations.Overall,itresultsinaslowdownofonly7.6%versustheHT off 42case.

Thisis0.9%slowerthantheslowdownseenbytheHT on 41configuration.However,

whenyouconsiderthebestcaseforeachbenchmarkbetweentheHT on 814andHT on 4

1theoverallresultisaspeedupovertheHT off 42caseof3.25%.Thisimpliesthatthere are future opportunities for performance better than that of a system with twice the computationalresourcesbyusingintelligentschedulingtechniqueswithHTenabledona singledualcoreprocessor.

NASOpenMPBenchmarkSpeedup Overloaded Serial HTon412 HToff412 HToff422 HTon814 4

3

2

Speedup 1

0 BT CG MG SP

Figure 5.28: Overloaded NAS Benchmarks Speedup

62 It is interesting to examine the performance numbers from the CG benchmark, specifically the overloaded HT on 814 case versus the HT off 42 case. In general, this benchmarkbenefitsfromadditionalexecutionthreadsthatarelocatedonphysicallyclose cores.Interestingly,thebestperformanceonthisbenchmarkcomesfromtheHT on 814 case, even though the cache hit rates and trace cache hit rates are lower than other configurations. However, the overloaded cases have a slight advantage in terms of efficiencyofbususageoverthenonoverloadedcases,astheHT on 814caseexecutes significantly more prefetch operations than do the nonoverloaded cases. The percentageofoperationsthatwereprefetchingontheHT on 814configurationwasonly

49.5%versusa0.08%valuefortheHT off 42case,asillustratedinFigure5.10.The

HT on configurations have an advantage in terms of executing more prefetching operations thantheHT off configurationsthroughoutallofthetests,butinbenchmarks, thisprefetchingisnotbeneficialtooverallsystemperformance.InthecaseoftheCG benchmark,theprefetchingsignificantlyimprovestheperformanceoftheapplication. 5.4.7OverloadedOverhead Table 5.3 provides a comparison of the overloaded cases with their architectural equivalentnonoverloadedcasefromsection5.2.

Table 5.3. Percentage degradation for overloaded cases versus non-overloaded cases SMT CMP CMT SMP L1 31.7 -0.5 14.6 0.2 L2 317.8 -35.9 -38.8 574.6 Tracecache -12.3 12.6 -8.2 -0.3 ITLB 96.5 1.2 -31.8 20.4 DTLB -7.1 0.9 -5.3 -2.2 Stalled Operation 57.6 4.0 575.8 303.5 Branch Prediction -0.6 0.9 4.0 -0.4 Bus Transactions 33.0 0.7 3510.6 -29.3 CPI 20.4 0.9 -17.8 -6.7 Speedup -4.2 -24.8 -14.8 25.7

63 Onecanobservethatoverloadingthearchitectureshasbeneficialeffectonspeedup, particularly CMP and CMT, when neglecting the LU benchmark. The SMT configuration has increased cache miss rates for all levels of cache and sees a large increase in CPI. The CMP architecture sees a drop in L1 and L2 cache and marginal increasesforitsTLBs,withonlyminorincreasesinstallsandCPI.CMTseesmostly increasingcachemissrates,butalsohasasignificantincreaseinstallsandprefetching activities. SMP sees increases in cache miss rates and stalls, but enjoys higher pre fetchingratesandalowerCPI.

5.5 Effect of Operating System Noise Operatingsystemshavebeenshowntohaveaneffectontheperformanceofintensive workloadsinmultiprocessorsystems[45,71].TheoverheadrequiredtomaintainOS servicesdoesnotrepresentalargeportionofthesystem’soverallcomputationalload,but thetimespentprovidingsystemservicescancausethegreatercomputationalworkloadto lose synchronicity which creates slowdowns, as the workload must wait at synchronizationbarriers.Asthenumberofsimultaneouslyexecutingthreadsincreases sodoesthepenaltythatisincurredatbarriers,asmorethreadswaitforasmallproportion of threads that are lagging behind the average threads in the workload. These effects have been observed in large cluster systems using MPI applications and realtime systems,soitisreasonabletoinvestigatemulticoresystemstodeterminetheextentto whichoperatingsystemnoiseaffectstheperformanceofsuchplatformsrunningmulti threadedapplicationsinOpenMP. Whiletherehavebeensomeeffectivetechniquesproposedin[45,71]toreducethe impactofsystemnoise,suchasremovingunnecessaryOSdaemonsandkernelthreads (ormovingthemtoanotherprocessor),loweringtickrate,andcoscheduling,leavingone processorforOStasksisstillasimple,viableoptiontoeffectivelyseparatesystemnoise fromthecomputation[89].Meanwhile,pastworkonrealtimeprocessingwithLinux schedulers [11] has found that reserving a CPU specifically to respond to realtime priority threads significantly decreases the latency for realtime threads as well as the interruptresponsetime.Thesesolutionshavealladdressedtheperformanceimpactof OSnoise,andassuchitispertinenttoattemptsuchanapproachwithoursysteminan

64 attempttoincreaseperformance.Thisisausefulapproachasithelpsalleviateunwanted cacheevictionscausedbyOSthreadsthatadverselyimpactourHPCapplications. 5.5.1OperatingSystemNoiseEffectsonSingleThreadedApplications InordertoverifythatoperatingsystemnoisehasaneffectonsmallerSMPsystemsa preliminarytestoffiveNASbenchmarkapplicationswasrunonasystemusingasingle executionthread.TheLinuxprocessoraffinitymaskwasusedtoensurethatprocesses wereboundtoaspecificCPU.Onetestwasperformedwithallthreadsinthesystem boundtoasingleprocessor(usingtheaffinitymask),includingthebenchmarkthread.A secondtestwasperformedonthesamemachinewiththebenchmarkexecutionthreadon a secondary processor while all operating system tasks were assigned to the primary processor. Each application was run several times in order to ensure accurate and reportableresults.TheresultsofthesetestsaredetailedinFigures5.29and5.30.Figure 5.29 illustrates the effect of operating system noise on the cache performance of the system. The effect of operating system noise on the cache performance of the LU benchmark is noteworthy, resulting in an increase in cache hit rate of 20.6%. The remainderoftheapplicationsshowaminorimprovementornochangeintheircachehit rates.

CacheHitRateImprovementsWithoutOSNoise

BT CG MG LU SP 25

20

15

10

5 Improvement(%) 0 L1HitImpr. L2HitImpr. AverageImpr. 5 Figure 5.29: Improvement in cache hit rate without OS noise The results in Figure 5.30 indicate that the system can experience up to a 9.1% performance impact due to operating system noise, and all applications show some

65 decrease in performance due to the noise. The LU benchmark has seen a decrease in runtimecorrespondingtoitsincreasedcachehitrate,ashaveBTandCG.SPandMG havealsobothseenadecreaseinruntimewhenOSnoiseisremoved,indicatingthatthe techniqueiseffectivefortheentirerangeofbenchmarktests.

PercentageImprovementinApplicationRun TimeWithoutOSNoise 10 9 8 7 6 5 4 3

Improvement(%) 2 1 0 BT CG MG LU SP Figure 5.30: Effect of operating system noise on system performance 5.5.2OperatingSystemNoiseEffectonMultithreadedApplications Given the findings of the serial case research, a comprehensivestudyofthesystem undermanydifferentconfigurationswasperformed.Operatingsystemnoisewasisolated onthesystembymaskingoffasingleprocessor(logical or physical depending on the configuration),andassigningOStaskstothatprocessor.Theresultingdata,shownin Figure5.31,indicatesthattheeffectofOSnoisecanbesignificantwithmodernmulti core processors. OS noise almost always results in an increase in L1 and L2 cache misses,seeinganaveragedegradationof1.07%fortheL1cachehitrateand3.02%for theL2cacheacrossallconfigurationsandapplicationsoftheNASbenchmarks. TheaverageimprovementincachehitrateforeachconfigurationisshowninFigure 5.31 for the NAS benchmarks. The improvement percentages for each architecture correspondtotheaverageimprovementinL1andL2cachehitratesacrossallfiveofthe NAS benchmarks. The increase of cache hit rates is of particular importance, as the bottleneck to performance for such applications is most typically the memory access latenciesandmemorybandwidth.

66 AverageImprovementinCacheHitRates fortheNASBenchmarks L1CacheHitRateImprv. L2CacheHitRateImprv. 10 8 6 4 2

Improvement(%) 0 2 HTon HToff HTon HToff HTon HToff HTon 21 21 41 22 42 32 72

Figure 5.31: Improvement in Cache Hit Rate Without OS Noise Foreachoftheconfigurations,thetrendisthatthedegradationofcachehitratesis inverselyproportionaltothetotalnumberofprocessorsinthesystem.Thisisexpected asthemoreprocessorsthatareinthesystem,thelessoverallimpactoccurswhenasingle processor is assigned to an OS task. This corresponds with a resulting decrease in application runtimes, particularly for programs that are significantly impacted by OS noise,resultinginsignificantgainsintermsofwallclockexecutiontime. Two of the configurations presented here are special cases in that they were tested using one thread less than their operating system loaded counterparts. This was unavoidablefortestingtheseconfigurations,astheOSnoiserequiresafreeprocessorto offload its overhead onto, and for the fully utilized system configurations, this is impossible. This means that although the HT on 72 and HT off 32 configurations may showaslowdownversustheiroperatingsystemloaded partners, the systems operating withoutoperatingsystemnoisehavefeweroverallresourcesavailabletothem.Despite thishandicaptheystillmanagetohavegainsintheir respective cache hitratesforthe benchmarks. TheHT on 72andHT off 32configurationsshownegligibleimprovement fortheirL1cachehitrates,buthaveanimprovementof1.27and2.43%respectivelyfor

their L2 cache hit rates. Despite having fewer resources available to it, the HT on 72 configurationseesan8.3%improvementinruntimes overalloftheapplications,with onlyoneapplicationseeinganincreaseinruntimeresultinginaslowdownof3.64%for the BT benchmark. The HT off 32 configuration sees a much larger increase in wall

67 clock execution times seeing an average slowdown of 16.5% across all of the applications,withonlyoneapplication,CG,showinganimprovementof1.55%. Theeffectofoperatingsystemnoiseontheruntimeofsuchscientificapplicationsis visible, with an average percentage decrease inwallclocktimesof3.8%forallofthe applicationsacrossalloftheconfigurations.Iftheresultsforthetwoconfigurationsthat have less overall system resources available to them due to the isolation of operating systemnoise(HT off 32andHT on 72)areremovedfromthegroup,theresultsimprove toanaveragepercentagedecreaseinwallclocktimeof7.0%.Alloftheconfigurations

(except for HT off 32) see a decrease in wall clock run times, ranging from 0.3% to 13.5%.TheimprovementsinruntimesaredetailedinFigure5.32.

PercentageImprovementinApplicationRun TimeW ithoutOSNoise H Ton21 H Toff21 HTon41 H Toff22 H Ton42 H Toff32 HTon72 50 40 30 20 10 0 10

Improvement(%) 20 30 BT CG MG LU SP 40

Figure 5.32: Application Run-time Improvement 5.6 Summary In this chapter, we presented performance of scientific applications from the NAS OpenMPsuiteonarangeofsystemconfigurationswithkernel2.6.9ona2waydual core HyperThreaded SMP. Our performance results indicate that the majority of applicationscouldbenefitfromusingasingledualcoreprocessorwithHTenabled,in terms of total computing power per system resources available. However, only one applicationenjoyedperformancegainofduetoHTonbothdualcoreprocessors. Bycollectingdatafromhardwareperformancecounters,weanalyzedtheeffectofHT onthevarioussystemconfigurationsaswellastheeffectofthreadoverloadingonthe

68 system. When utilizing all of the available system resources, most applications suffer fromtheincreasingnumberofcachemisseswhenbothdualcoreprocessorsareenabled withHTenabled. ThedecisionsmadebytheschedulerarecrucialtotheperformanceofHT.Withthe optimizationoftheschedulertheperformanceofasingleprocessorcouldbeincreased forscientificapplicationstoalmostthesameperformancelevelofasystemwithtwiceas manynonHTprocessors. Inaddition,ithasbeendeterminedthatoperatingsystemnoisemaycausesignificant performancedegradationandcouldbeapotentialsourceofoptimizationforsmallscale multicoreSMPs.Thenextchapterusesthefindingsofthischaptertoproposeamethod of improving system performance, and takes advantage of an opportunity to reduce systempowerconsumptionatthesametime.

69 Chapter6:PowerManagementofChipMultiThreading SMPs Powerconsumptionisanimportantdesignconstraintinmoderndayserversandhigh performanceserverclusters.Thischapterexploresthepowerperformanceefficiencyof HyperThreadedAMPservers,andproposesaschedulingalgorithmthatcanbeusedto reduce the overall power consumption of a server while maintaining a high level of performance.AnAMPisasystemthathasaheterogeneous collection of processors. The processors can be of different types and/or operate at different speeds. The AMP presentedherehasidenticalprocessors,runningatdifferentspeeds. ThischapterproposesamodificationtotheLinuxschedulerasamethodofpotentially reducing the power consumption of a system, while producing less of a performance impactonthesystemthanwouldhavebeenotherwiseachievedusingthedefaultprocess scheduler[32,34].Previousresearchhasshownthatsystemnoise,includingoperating system(OS)interferencewiththeapplication,hasadramaticeffectonhighperformance computing [71]. Using static clock throttling and processor affinity, we bind all OS activities to logical processor zero (or physical processor zero) that runs at a lower frequencythantherestofprocessorsintheAMP.Inordertosustaintheperformancefor theparallelOpenMPthreads,allotherprocessorsrunattheirmaximumfrequency. Theresultsinthepreviouschapterconcerningtheeffectofoperatingsystemnoiseon systemperformancehavebeenthemotivationbehindthismethodtooffloadsystemnoise onto a single processor. This reduced noise should correspond to an increased performance of the user threads such that the impact of reserving the CPU for system tasksisminimized.Intheeventofasystemloadthatdoesnotcorrespondtoafullload forasingleprocessor,thereexistsanopportunitytoreducethefrequencyofthereserved CPU such that its load is as close to 100% as possible. This clock throttling of the reservedCPUhasapowersavingseffect. The rest of this chapter is organized as follows. In Section 6.1, we describe the experimental framework including the AMP setup. Section 6.2 examines the performance increase that is possible using the proposed scheduler over the default scheduler.Section6.3describestherealpowerconsumptionmeasurementsforadual

70 core2wayCMP/SMThybridthatusesclockthrottling. Finally, we predict the effect thattruefrequencyscalingwouldhaveonthepowerconsumptionofadualcore2way CMP/SMThybridsysteminsection6.4.

6.1ExperimentalFramework Theexperimentsinsection6.2&6.3wereconductedonadualcoreDellPowerEdge 2850server.Thespecificationsofthisplatformcanbefoundinsection5.1. 6.1.1AMPSetup ToevaluatethepowerperformanceefficiencyofAMPoverSMPsystems,wecreated staticAMPconfigurationsonour2waydualcoreplatformthroughclockthrottlingand affinity control. In clock throttling, one can set the duty cycle to one of the seven availablelevels.Clockthrottlinghasasimilarimpactonperformanceasreducingthe frequency [3], but is not as ideal a solution as true frequency scaling, which should further increase the potential energy savings of the approach detailed in this chapter. ThiseffectispredictedinSection6.3. The Linux 2.6.9 kernel supports clock throttling through a sysfs interface with appropriatedrivers.ThesystemwasconfiguredtoenableCPUfrequencyscalingusing clockthrottling.Thep4clockmoddriverwasbuiltintothekernelandthestandardsysfs interface was used. The frequency governor was set to userspace control, creating a staticoperatingpointforclockthrottling.Bystaticsetup,wemeanthedutycycleisset onlyoncebeforetheapplicationrun. WehaveimplementedanewLinuxscheduler,tobecalled power-saving scheduler (PSScheduler),bymodifyingtheLinuxschedulertoreserveasingleCPUthatrunsonly kernelthreads,leavingtherestoftheCPUsinthesystemtoexecutealluserthreadsat maximum frequency. This is accomplished by using the processor affinity properties available in the Linux 2.6.9 that allow processes to be bound to a specific set of processors,oranindividualprocessor. The available operating points for the 2way dualcore platform were 2.8GHz, 2.4GHz, 2.1GHz, 1.8GHz, 1.5GHZ, 1.2GHz, 900MHz, 600MHz, and 300MHz. The CPU frequency of the first physical processor was adjusted throughout the available

71 operatingpointsfortheexecutionofsystemactivities,whiletherestofprocessorsinthe systemremainedatthehighestavailableclockfrequencytoruntheapplicationthreads. However,ourexperimentationwiththeapplicationbenchmarksrevealedtheperformance of the AMPs decreased significantly when the clock speed was reduced to under two timesthefrontsidebus(mainmemorypathway)speedofthemachine,soonlyresults from the operating points above two times the front side bus of each machine are reported. Itshouldbementionedthatthedutycyclecanonlybesetonaperphysicalprocessor basis on Intel multiprocessors. Therefore, in the case of an HTenabled system, this creates an asymmetrical imbalance among the logical processors executing the user threads(thebenchmarks).Thiscanhaveanegativeeffectonthesystemperformance. Inordertodifferentiatebetweenthepossibleconfigurationsofourplatform,anaming conventionsimilartotheoneusedinchapter5ispresentedinTable6.1.Figures6.1is provided as a reference to help understand the differentconfigurations.Thesystemis identical to the system in chapter 5 with the expression of AMP to indicate that the systemisoperatingasanAMPfollowedbyitsHTstatus,eitheronoroff,followedby thenumberofthreadsusedandfinallythenumberofphysicalprocessorsinuse.

Figure 6.1: Processor numbering for 2-way dual-core system Table 6.1: AMP Naming Convention Terminology Hardware Contexts Corresponding Architecture

AMP HT on 11 C0usedforOS , C1 AMP SMT AMP HT off –11 D0usedforOS , D1 AMP CMP AMP HT on –31 C0usedforOS ,C1,C2,C3 AMP CMT AMP HT off –12 D0usedforOS , D2 AMP SMP AMP HT on –32 C0usedforOS ,C1,C4,C5 AMP SMT basedSMP AMP HT off –32 D0usedforOS ,D1,D2,D3 AMP CMP basedSMP AMPHT on –72 C0 used for OS , C1, C2, C3, C4, AMPCMTbasedSMP C5,C6,C7,C8

72 6.2PSSchedulervs.theDefaultLinuxScheduler The main motivation behind the proposed scheduler is to sustain an SMP’s performance with the original O(1) scheduler, while providing energy savings. Our intentioninthissectionistoseeifthenewschedulerperformsonparwiththedefault schedulerinbothHTenabledandHTdisabledconfigurations.Theconfigurationsinthis sectionhaveallprocessorsrunningatfullclockspeed,sotheyusethenamingconvention introducedinchapter5insteadofthatinsection6.1. Figure 6.2 compares the baseline performance of the PS-Scheduler withthedefault schedulerfortheSPECbenchmarksforbothHTdisabledandHTenabledsystems.The performance of the PS-Scheduler comparedtothedefaultschedulerispromising,with almostallconfigurationsandbenchmarksseeinganimprovementinperformance.The

HT on 31andHT on 32configurationsshowthebestspeedupoftheconfigurationsatan averageof14.4%and16.2%respectively.TheHT on 11configurationshowstheworst

speedup,withaslowdownof11.8%.OnlytheHT on 11 and HT off 32 configurations showadecreaseinperformancewiththenewscheduler. Overall,theperformancegainofthe PS-Scheduler oftheHTenabledconfigurations withtheoriginalschedulerrangesfrom–11.8%to+16.2%,withanaverageperformance gainof5.3%.FortheHTdisabledcase,theperformance gain ranges from –9.1% to +4.4%,withanaverageperformancegainof–0.9%.

SpeedupofPSSchedoverDefaultScheduler HTon11 HToff11 HTon31 HToff12 HTon32 HToff32 HTon72 1.4 1.2 1 0.8 0.6

Speedup 0.4 0.2 0 apsi art fma3d mgrid swim wupwise

Figure 6.2: PS-Scheduler vs. default scheduler for SPEC benchmarks

73 6.3 Real Power Measurements Anexperimentwassetupinordertodeterminetheactualpowerconsumptionofareal system utilizing thedifferentschedulers.ADell PowerEdge 2850 DualCore 2.8GHz SMPwasusedforthetesting,whosedetailsaredescribedinsection5.1.Thesystemwas connected to a Keithley 2701 Digital Multimeter [49] with a Keithley 7700 data acquisition unit [48] installed. The overall power consumption of the system was measuredusingaresistiveelementattachedtothepowercableleadingintothesystem. Duetothenatureoftherackmountedsystem,powermeasurementsontheoutputofthe powersupplyinsidethemachinewereimpractical.Therefore, thepowerconsumption numberspresentedhereareaffectedbythepowersupplylossesandtakeintoaccountall componentsofthesystemfedbythemainpowersupply. TheSPECompbenchmarksuitewasusedforrealworldpowermeasurementduetoits longerruntimes,whichenabledustoobtainconsistentstablepowermeasurements.The power measurement equipment was validated through the use of a Wattsup EPS Pro powermeter[23],andfoundtobewithintheacceptableerrorrangeofthetwodevices, withtheKeithleymeterhavinganerrorrangeof+/1%. 6.3.1AveragePowerConsumption Theaverageinstantaneouspowermeasurementsforeachoftheconfigurationsatthe varying operating frequencies are presented in Figure 6.3 and Figure 6.4. Thepower consumptionofthesystemusingthenewschedulerissurprising.Inmanycases,thenew schedulerhashigherinstantaneouspowerconsumptionthantheoriginalscheduler.The highestaverageinstantaneouspowerconsumptionofeitherscheduleroccurswiththe PS-

Scheduler withahighof337.4WfortheAMPHT on 72configurationrunningmgridat 2.8GHz.Thiscomparestothemaximumvalueforthedefaultschedulerof320.9Wfor the HT on 82 configuration running mgrid at 2.8GHz. However, the average power consumption for all of the operating frequencies and configurations across all of the applications is lower for the PS-Scheduler at 262.4 W versus the default scheduler’s averageof272.1W.Giventhatthe PS-Scheduler has faster runtimes thanthedefault

74 scheduler,thehigherinstantaneouspowerreadingsdonotnecessarilytranslateintoworse overallenergyefficiency,aswillbediscussedlaterinthischapter.

PowerComparisionforPSSchedVs DefaultSchedulerforHTEnabledSystems AMPHTon11 HTon21 AMPHTon31 HTon41 AMPHTon32 HTon42 AMPHTon72 HTon82 400 350 300 250 200 150 100 50 0 PowerConsumption(W) art art art apsi apsi apsi swim swim swim swim swim swim mgrid mgrid mgrid fma3d fma3d fma3d wupwise wupwise wupwise 2.4GHz 2.1GHz 1.8GHz Figure 6.3: Average power consumption of HT-enabled configurations

PowerComparisionforPSSchedVs DefaultSchedulerForHTDisabledSystems AMPHToff11 HToff21 AMPHToff12 HToff22 AMPHToff32 HToff42 350 300 250 200 150 100 50 0 PowerConsumption(W) art art art apsi apsi apsi mgrid mgrid mgrid swim swim swim fma3d fma3d fma3d wupwise wupwise wupwise 2.4GHz 2.1GHz 1.8GHz Figure 6.4: Average power consumption of HT-disabled configurations 6.3.2SlowdownandEnergySavings Thissectionpresentstheactualslowdowninwallclocktimethatoccurswhenusing the PS-Scheduler andthecorrespondingenergysavingsthatoccur.TheAMPfrequency

75 in the figures in this section correspond to the CPU frequency of the first physical processor in the system. The remaining physical processors are running at maximum frequency.InthecaseoftheHTenabledprocessors,itshouldbenotedthatthefirsttwo logicalprocessorsarescaledinfrequency.Therefore,onelogicalCPUthatisexecuting theuserthreadshasareducedfrequencyinadditiontothereservedCPU.

TheresultspresentedinFigure6.5(a)showthattheAMPHT on 11configurationon averagedoesnotperformwell,withsignificantslowdownsoccurringthatyieldalmosta 1:1relationshipbetweenslowdownandenergysavings,withtheexceptionoftheswim benchmarkwhichseesbothaspeedupandareduction in corresponding energy usage.

ThisiscontrastedbytheAMPHT on 31results,wherespeedupoccursforboththe2.4 GHz and 2.1 GHz operating points, providing energy savings of between 18 to 32% whilesimultaneouslyreducingexecutiontimeforallapplicationswiththeexceptionof theswimbenchmark.

AMPHTon11SlowdownandEnergy AMPHTon31SlowdownandEnergy SavingsOverHTon21 SavingsOverHTon41 AMP-HTon-1-1 @2.4GHz Slowdown AMP-HTon-1-1 @2.4GHz Savings AMP-HTon-3-1 @2.4GHz Slowdown AMP-HTon-3-1 @2.4GHz Savings AMP-HTon-1-1 @2.1GHz Slowdown AMP-HTon-1-1 @2.1GHz Savings AMP-HTon-3-1 @2.1GHz Slowdown AMP-HTon-3-1 @2.1GHz Savings AMP-HTon-1-1 @1.8GHz Slowdown AMP-HTon-1-1 @1.8GHz Savings AMP-HTon-3-1 @1.8GHz Slowdown AMP-HTon-3-1 @1.8GHz Savings 30 80 25 60

20 40 15 20 10 % % 0 5 -20 0 -5 -40 apsi art fma3d mgrid swim wupwise -10 -60 apsi art fma3d mgrid swim wupwise -15 -80

(a) (b) Figure 6.5: Slowdown and energy savings for (a) AMP-HT on -1-1 and (b) AMP-HT on -3-1

TheresultsfortheAMPHT on 32configurationpresentedinFigure6.6aresimilarin pattern to those of the AMPHT on 31 configuration with speedup occurring for the applicationsbetween2.4GHzand2.1GHzandgoodresultantenergysavings,withthe exceptionoftheswimbenchmark,whichseessomemarginalimprovementforthefirst twooperatingfrequencies.TheAMPHT on 32configurationdoeslagbehindtheAMP

HT on 31 configuration in total speedup and as a result shows lower potential energy savings.TheAMPHT on 72configurationshowsgoodresultsfortheapsibenchmark,

76 but the remaining benchmarks show a combination of slowdown and poor energy savings.Theapsi,mgridandswimbenchmarksbenefitfromthe PS-Scheduler interms of runtime, but mgrid and swim fall behind in terms of energy consumption. The remainder of the applications suffer from both slowdown and increased energy consumption.

AMPHTon32SlowdownandEnergy AMPHTon72SlowdownandEnergy SavingsOverHTon42 SavingsOverHTon82 AMPHTon32@2.4GHzSlowdown AMPHTon32@2.4GHzSavings AMPHTon72@2.4GHzSlowdown AMPHTon72@2.4GHzSavings AMPHTon32@2.1GHzSlowdown AMPHTon32@2.1GHzSavings AMPHTon72@2.1GHzSlowdown AMPHTon72@2.1GHzSavings AMPHTon32@1.8GHzSlowdown AMPHTon32@1.8GHzSavings AMPHTon72@1.8GHzSlowdown AMPHTon72@1.8GHzSavings 80 50

60 40

40 30

20 20 % % 0 10

-20 0 apsi art fma3d mgrid swim wupwise -40 -10 apsi art fma3d mgrid swim wupwise -60 -20

(a) (b) Figure 6.6: Slowdown and Energy Savings for (a) AMP-HT on -3-2 and (b) AMP-HT on -7-2

The results for slowdown and energy savings for the HTdisabled architectures in

Figure6.7,showthattheAMPHT off 11andAMPHT off 12configurationsshowgood slowdown/savingsforthe2.4GHzoperatingpoint.Withtheexceptionofswimforthe

AMPHT off 12configuration,the2.1GHzand1.8GHzoperatingpointsseetoomuch slowdowntobeabletorealizeanysignificantenergysavings.

AMPHToff11SlowdownandEnergy AMPHToff12SlowdownandEnergy SavingsOverHToff21 SavingsOverHToff22 AMPHToff11@2.4GHzSlowdown AMPHToff11@2.4GHzSavings AMPHToff12@2.4GHzSlowdown AMPHToff22@2.4GHzSavings AMPHToff11@2.1GHzSlowdown AMPHToff11@2.1GHzSavings AMPHToff22@2.1GHzSlowdown AMPHToff22@2.1GHzSavings AMPHToff11@1.8GHzSlowdown AMPHToff11@1.8GHzSavings AMPHToff22@1.8GHzSlowdown AMPHToff22@1.8GHzSavings 40 30 30 25 20 20 15 10 10 % % 0 5 0 -10 apsi art fma3d mgrid swim wupwise -5 -20 -10 apsi art fma3d mgrid swim wupwise -30 -15

(a) (b) Figure 6.7: Slowdown and energy savings for (a) AMP-HT off -1-1 and (b) AMP-HT off -1-2

77 The final architecture that was examined, AMPHT off 32, has its slowdown and energy savings illustrated in Figure 6.8. The results are varied, with apsi and mgrid showing excellent results, while art exhibits terrible slowdown and consequently poor energy savings numbers. The fma3d and wupwise applications show some energy savings are possible, but the resulting slowdown is slightly greater than the potential energy savings. The swim benchmark shows some speedup, and some potential for energysavingsaswell.

AMPHToff32SlowdownandEnergy SavingsOverHToff42 AMPHToff32@2.4GHzSlowdown AMPHToff32@2.4GHzSavings AMPHToff32@2.1GHzSlowdown AMPHToff32@2.1GHzSavings AMPHToff32@1.8GHzSlowdown AMPHToff32@1.8GHzSavings 40

30

20

10%

0

-10 apsi art fma3d mgrid swim wupwise -20

Figure 6.8: Slowdown and energy savings for AMP-HToff -3-2 6.3.3EnergyDelayAnalysis Tocomparetheschedulersfairlyonemustalsotakeintoaccountthespeedofeach schedulerinadditiontotheenergysavingsthatcanbeobtained.Tothisend,theenergy delayanalysisispresentedinFigures6.9to6.14forthe PS-Scheduler normalizedtothe default scheduler. Energydelay is a metric used to determine whether a tradeoff between energy usage and the delay that is causes is beneficial or not. Energydelay productsoflessthan1indicateabeneficialtradeoff.

The energydelay of the HT on 11configurationispresentedinFigure6.9.Wecan observethattheenergydelayofthesystemoperatingat2.4GHzismediocre.However, the efficiency at the 2.1GHz and 1.8GHz operating points is excellent with all of the applications having energy delays below 1. Of course, this represents the significant energy savings that a single processor has over thebaselinesystemwithfourphysical

78 processors.AssuchtheHT on 11 configuration hasgoodenergydelayproductsbutis notveryusefulduetothelossofperformancethatitincurs.

EnergyDelayforAMPHTon11Normalizedto HTon21 apsi art fma3d mgrid swim wupwise

1.5

1

0.5 EnergyDelay 0 2.4GHz 2.1GHz 1.8GHz OperatingPoint

Figure 6.9: Energy delay for the PS-Scheduler in an HT on -1-1 configuration

The energy delay figures for AMPHT on 31 configuration in Figure 6.10 are very promising. The AMPHT on 31 configuration has a majority of the applications with energydelaysoflessthan1forboththe2.4GHzand 2.1 GHz operating points, but showsasignificantincreaseinenergydelayforthe1.8GHzoperatingpoint.Overall,its averageenergydelayforthe2.4GHzoperatingpointis0.75.

EnergyDelayforAMPHTon31Normalizedto HTon41 apsi art fma3d mgrid swim wupwise 3 2.5 2 1.5 1

EnergyDelay 0.5 0 2.4GHz 2.1GHz 1.8GHz OperatingPoint

Figure 6.10: Energy delay for the PS-Scheduler in an HT on -3-1 configuration

The AMPHT on 32 configuration’s energydelay products are illustrated in Figure

6.11.TheAMPHT on 32configurationisthebestofalloftheconfigurations,withnone

79 oftheapplicationshavinganenergydelayofabove1forthefirsttwooperatingpoints, and energy delays within the 0.60.8 range. In addition, its performance is excellent, givingitbothgoodenergyconsumptionandperformance.

EnergyDelayforAMPHTon32 NormalizedtoHTon42 apsi art fma3d mgrid swim wupwise

2.5 2 1.5 1

EnergyDelay 0.5 0 2.4GHz 2.1GHz 1.8GHz OperatingPoint

Figure 6.11: Energy-delay for the PS-Scheduler in an HT on -3-2 configuration

For the AMPHT on 72 configuration presented in Figure 6.12, the majority of applications do not see an energydelay of less than one until the operating point is lowered to 2.1GHz or lower. Half of the applications see a benefit to using the PS- Scheduler .TheAMPHTon72configurationwiththe PS-Scheduler canbebeneficial butitisdependantonthetypeofapplicationswhicharebeingrun.

EnergyDelayAMPHTon72Normalizedto HTon82 apsi art fma3d mgrid sw im w upw ise 1.4 1.2 1 0.8 0.6 0.4 EnergyDelay 0.2 0 2.4GHz 2.1GHz 1.8GHz OperatingPoint

Figure 6.12: Energy-delay for the PS-Scheduler in an AMP-HT on -7-2 configuration

TheAMPHT off 11configurationinFigure6.13showsthatanenergydelayofless than one can be achieved for the upper operating pointswiththe PS-Scheduler ,forall

80 applications.Thisbehaviourabruptlyceaseswhentheoperatingpointisloweredbelow 2.4GHz.Alloftheapplicationshaveenergydelayproductsgreaterthan1forallofthe operatingpointsbelow2.4GHz.

EnergyDelayforAMPHToff11Normalized toHToff21 apsi art fma3d mgrid swim wupwise 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 Energydelay 0.2 0 2.4GHz 2.1GHz 1.8GHz OperatingPoint

Figure 6.13: Energy-delay for the PS-Scheduler in an AMP-HT off -1-1 configuration

TheenergydelayproductsfortheAMPHT off 12configurationareshowninFigure 6.14.Onecanobservethattheenergydelayproductsforthe2.4GHzoperatingpointare all below 1, indicating that the PS-Scheduler can provide a significant increase in efficiency for all of the benchmarks in the suite. The efficacy of the PS-Scheduler technique quickly fades as the delay incurred at the lower operating frequencies outweighsanyenergysavings.

EnergyDelayforAMPHToff12 NormalizedtoHToff22 apsi art fm a3d m g rid sw im w upw ise 1.6 1.4 1.2 1 0.8 0.6

EnergyDelay 0.4 0.2 0 2.4GHz 2.1GHz 1.8GHz OperatingPoint

Figure 6.14: Energy-delay for the PS-Scheduler in an AMP-HT off -2-1 configuration Wecanseethattheenergydelayofthe PS-Scheduler isonthewholemediocreforthe

AMPHT off 32caseinFigure6.15,withnosignificantoverallsavings.Althoughart,

81 mgridandswimbenefitfromusingthescheduler,themajorityofapplicationsdonot.In fact,withtheSPECbenchmarks,thesystemseesageneralimprovementofwallclock timesbuthigherpowerconsumption.Thisleadsto poor energydelay products as the increaseinspeeddoesnotoffsettheincreaseinenergyconsumption.

EnergyDelayAMPHToff32Normalizedto HT off42 apsi art fma3d mgrid sw im w upw ise 2.5

2

1.5

1

EnergyDelay 0.5

0 2.4GHz 2.1GHz 1.8GHz OperatingPoint

Figure 6.15: Energy-delay for the PS-Scheduler in an AMP-HT off -3-2 configuration 6.4 AMP Power Consumption Predictions With Future Technology Using a method similartothatin[3], we can estimate the effect that true frequency scalingwouldhaveonadualcoresystem.Thepowerconsumptionofa2.8GHzdual coreXeonprocessoris135Wwhenactiveand32Wwhenidle[42].Theauthorsin[3] provideamethodbasedonhistoricaldata,wherefor a changing frequency, the power consumption is proportional to the square of the duty cycle. Therefore, if a 2way 2.8GHzdualcoreIntelXeonprocessorsystemconsumes270Wwhenhighlyactive,the system at 2.4GHz would be expected to consume 198.3W (270W × (2.4/2.8) 2) when highlyactive.Thiscorrespondstoanenergyconsumptionof49.575Wpercore.When CPUsareinanidlestatethepowerconsumptionofthesystemis64W,correspondingto 16Wpercore.Applyingthesamescalingasusedfortheactivecase,wedeterminethat the idle energy consumption for a single core at 2.4GHz is 11.75W. Usingthesame methodology,onecaneasilyfindthehighlyactivepowerconsumptionofanAMPwith oneCPUoperatingat2.4GHzandtheotherthreeCPUsoperatingat2.8GHztobe252W. KnowingtheapproximateenergyconsumptionofourAMPsystems,whenhighlyactive

82 oridle,allowsustoestimatethepowerconsumptionwhileexecutingtheSPECOpenMP benchmarks. 6.4.1SlowdownandEnergySavings The energy savings and slowdown due to frequency scaling of the system while executingtheSPECbenchmarksarepresentedinFigures6.16to6.18. Apredictionoftheperformanceofthebestconfigurationfromtheprevioussectionis pertinent. Therefore, the results for the AMPHT on 32 configuration are presented in Figure6.16.Everyapplicationdemonstratesaperformanceincreaseforthe2.4GHzand

2.1GHz operating points. The resulting energy savings for the AMPHT on 32 configurationforthe2.4GHzand2.1GHzoperatingpoints,onaverage,acrossallofthe benchmarksare27.6%and27.8%respectively.Thiscorrespondstoanaveragespeedup of 16.2% and 13.7% for the 2.4GHz and 2.1GHZ operating points. The 1.8GHz operatingpointseesaslowdownof32.4%andanenergylossof5.33%. Whencomparingthefrequencyscalingestimateswiththepreviouslymeasuredresults in figure 6.6(a), we find that the energy savings using frequency scaling improve the resultsfrom11.38%,19.0%and–19.0%forthe2.4GHz,2.1GHzand1.8GHzoperating pointsto27.6%,27.8%and–5.33%.Obviously,theuseoftruefrequencyscalingshould significantlyimprovetheeffectivenessofthe PS-Scheduler forthisconfiguration.

AMPHTon32SlowdownandEnergy SavingsOverHTon42 AMPHTon32@2.4GHzSlowdown AMPHTon32@2.4GHzSavings AMPHTon32@2.1GHzSlowdown AMPHTon32@2.1GHzSavings AMPHTon32@1.8GHzSlowdown AMPHTon32@1.8GHzSavings 80

60

40

20%

0

-20 apsi art fma3d mgrid swim wupwise -40

Figure 6.16: AMP slowdown and energy savings for SPEC benchmarks over HT on -4-2

83 In the HT on 72 case in Figure 6.17, we can observe that frequency scaling would improvetheenergysavingsofthesystemby21.3%on average over the PS-Scheduler resultsthatweremeasuredinFigure6.6(b).

AMPHTon72SlowdownandEnergy SavingsOverHTon82 AMPHTon72@2.4GHzSlowdown AMPHTon72@2.4GHzSavings AMPHTon72@2.1GHzSlowdown AMPHTon72@2.1GHzSavings AMPHTon72@1.8GHzSlowdown AMPHTon72@1.8GHzSavings 30 25 20 15 10 % 5 0 -5 -10 apsi art fma3d mgrid swim wupwise -15 Figure 6.17: AMP slowdown and energy savings for SPEC benchmarks over HT on -8-2

Overall,theaveragerangeofenergysavingsoftheAMPHT on 72configurationfor thebenchmarksis+8.13%to+27.5%.Swim,mgridandapsiaretheapplicationsthat benefitthemostfromthenewscheduler,seeinganimprovementinbothperformanceand power consumption. However, all of the applications see an improvement in power consumption.

The slowdown and energy savings of the AMPHT off 32 system are presented in Figure6.18.Theapsi,mgirdandswimbenchmarksseethegreatestbenefitfromthe PS- Scheduler withallthreebenchmarksexperiencingadecreaseinruntimeaswellasenergy savings.Theswimbenchmarkexhibitsthebestbehaviourshowinganaveragespeedup of5.74%andenergysavingsof26.5%.Theswimbenchmark is a memory intensive benchmark[33],andiswellknownforitspoorscalability.Therefore,theenergysavings canbeattributedtotwofactors,thereductionofsystemnoiseandasmallernumberof overallexecutionthreads.

Overall, the AMPHT off 32 configuration has an average energy savings of 14.9% acrossallthreefrequenciesandallofthebenchmarks.Ithasanaverageslowdownof 9.1%. A comparison to the previous results in Figure 6.8 show that energy savings

84 improve by 8.84%, 7.79% and 0.26% for the 2.4GHz, 2.1GHz and 1.8GHz operating pointsrespectivelyforatruefrequencyscalingsystem.

AMPHToff32SlowdownandEnergy SavingsOverHToff42 AMPHToff32@2.4GHzSlowdown AMPHToff32@2.4GHzSavings AMPHToff32@2.1GHzSlowdown AMPHToff32@2.1GHzSavings AMPHToff32@1.8GHzSlowdown AMPHToff32@1.8GHzSavings 40 35 30 25 20

15% 10 5 0 -5 apsi art fma3d mgrid swim wupwise -10

Figure 6.18: AMP slowdown and energy savings for SPEC benchmarks over HT off -4-2

6.4.2EnergyDelayAnalysis First, we examine the predicted energydelay product of the most efficient configuration from section 6.3. The energydelay products for the AMPHT on 32 configurationarepresentedinFigure6.19.

EnergyDelayforAMPHTon32 NormalizedtoHTon42 apsi art fma3d mgrid swim wupwise 2.5

2 1.5

1

EnergyDelay 0.5

0 2.4GHz 2.1GHz 1.8GHz OperatingPoint

Figure 6.19: Normalized Energy-Delay for SPEC benchmarks for AMP-HT on -3-2 over HT on -4-2

85 When Figure 6.19 is compared with Figure 6.11 we can see that the energydelay productsofthepredictivecasearereduced.Thisleadstoanewaverageenergydelayof 0.61 for the 2.4GHz operating point over the measured average energydelay of 0.74.

Overall,theperformanceofthePSSchedulerwithasystemconfiguredasanAMPHT on 32isexcellentandtheenergysavingsaresubstantial. Figure 6.20 presents the energydelay of the AMP running the PS-Scheduler normalizedtotheoriginalschedulerfortheHT on 84case.Anenergydelayoflessthan oneshowsthatthesavingsinenergyconsumptionoutpacethecorrespondingincreasein executionspeed.ComparingFigure6.20withFigure6.12wecanobservethatenergy delays have dropped by approximately 0.2 for the applications. The average energy delayforthe2.4GHzoperatingpointhasdroppedfrom1.0to0.82.Boththe2.4GHzand 2.1GHzoperatingpointsnowhavealloftheSPECapplications with energydelays of below1.Thisindicatesthatthefutureapplicationsofthe PS-Scheduler areexpectedto improveastechnologyadvances.

EnergyDelayAMPHTon72PSSched NormalizedtoHTon82 apsi art fma3d mgrid swim wupwise

1.2 1 0.8 0.6 0.4

EnergyDelay 0.2 0 2.4GHz 2.1GHz 1.8GHz OperatingPoint

Figure 6.20: Normalized Energy-Delay for SPEC benchmarks for AMP-HT on -7-2 over HT on -8-2

The results of the energydelay analysis of the AMPHT off 32 configuration are presentedinFigure6.21.WhencomparingFigure6.21withFigure6.15wefindthatthe energydelays have improved but a number of applications still have energy delay productssignificantlyhigherthan1.However,art,mgridandswimhaveshownafurther improvementtotheirefficiencywiththe PS-Scheduler andreceiveasignificantbenefit

86 fromutilizingit.Fromthiswecanconcludethatthe PS-Scheduler hasapotentialforuse onnonHTsystemsforsomeapplications.

EnergyDelayAMPHToff32PSSched NormalizedtoHToff42 apsi art fma3d mgrid swim wupwise 2.5

2 1.5

1

EnergyDelay 0.5

0 2.4GHz 2.1GHz 1.8GHz OperatingPoint

Figure 6.21: Normalized Energy-Delay for SPEC benchmarks for AMP-HT off -3-2 over HT off -4-2

6.4 Summary Inthischapter,thelessonslearnedfromthepreviouschapterswereusedtodesigna newoperatingsystemtaskschedulerinanattempttoincreasesystemperformanceand saveenergy. Thepowerconsumptionofarealdualcoresystemwasmeasured.Itwasfoundthat performanceimprovementscouldbemadeusingthenewschedulerondualcoresystems. In addition, total consumed energy could be reduced despite a higher instantaneous power usage using the new scheduler. The best platforms for such a scheduler were determinedtobetheAMPHT on 31,AMPHT on 32andAMPHT off 22architectures. Themodifiedschedulerwasthentestedanditsenergyusagewasestimatedonseveral differentcomputingplatformstodeterminethepowerconsumptionofanidenticalsystem withtruefrequencyscaling.Theseresultsindicatedthatthenewschedulercouldachieve goodenergyconservationonsystemswithverylittleperformanceimpact. The new scheduler does a good job at reducing the energy consumption of AMPs runningtheSPECompapplicationswhilehavingaminimalimpactontheperformanceof thesystem.Theperformanceresultsusingrealpowermeasurementsindicateonaverage

87 15.6%energysavingsand6.1%slowdownfortheHTdisabled case, and 7.1% energy savingsand4.8%slowdownfortheHTenabledcaseacrossallapplicationsstudiedin thischapter.

88 Chapter7:ConclusionsandFutureWork This thesis has explored the behaviour of singlecore and dualcore symmetric multiprocessor systems using the OpenMP interface and the effect of Intel’s Hyper Threadingtechnology.TheoverheadincurredbyusingtheOpenMPAPIwasexamined anditwasdeterminedthatthe2.6.9LinuxkernelcausesmoreOpenMPoverheadthan the2.4.22kernelforSMTarchitectures.Thebehaviourofsinglecorearchitectures,in relationtotheirperformanceexecutingwellknownhighprofilescientificbenchmarking suites, has been examined in detail. It was found that the memory system and CPU cachesaretheperformancebottlenecksofthesystems.Thetracecacheperformancewas foundtobeasourceofperformancedegradationaswell. Thisthesishasperformedanindepthexplorationoftheperformancecharacteristicsof multicore processors equipped with SMT capabilities and investigated the effect of overloaded workloads on system performance. From this investigation, it seems clear that the optimal platform for high performance computing on such systems lies in utilizing the SMT features available in the dualcore architecture. This thesis has demonstratedthattheperformanceofSMTtechnologiesinmulticoreprocessordesigns canperformaswellorbetterthanSMPorCMParchitecturesbyshowingthatasingle multicoreprocessorwithSMTcapabilitiescancloselymatchtheperformanceofaSMP orCMPmachinewithtwiceasmanyprocessors.Therefore we can conclude that HT technologycanbeofuseintheHPCdomain,andinfactcanprovidetangibleefficiency benefitsoverthenonHTalternativeswhenusedinarchitecturesthatarecomposedofa singledualcoreCPUortwosinglecoreCPUs.However,thebenefitsofutilizingHT havebeennegligiblewhenthenumberoflogicalcoresinthesystemrisesabovefour.By operatingthesystemswithworkloadscreatingtwotimesasmanythreadsasthenumber of available execution contexts, it has been shown that significant performance improvementscanbeachievedforsomeapplications. Theeffectofoperatingsystemnoiseonsuchsystemswasexplored,andfoundtobea potential area of improvement for such SMP/SMT systems. These findings initiated research into methods of reducing operating system noise, and optimizing systems to increaseperformanceaswellasinvestigatingthepotentialpowersavingsthatcouldbe

89 realized using such schemes. In addition, the possible power consumption benefit of SMT technologies has been explored, particularly as it relates to AMPs. It has been shownthatbyreservingaCPUinamultiprocessorsystem,bothperformancegainsand powersavingsarepossibleforrealsystems.This was demonstrated by measuring the powerconsumptionofareal2waydualcoresystem.Fromthisexperimentaldata,the powerconsumptionofatruefrequencyscalingsystemwaspredicted,showingthatthe energy savings of the proposed scheduler could be increased significantly when such scalingtechniquesareavailableforXeonprocessors. Therefore, wecanconcludethatsignificantperformancebenefitscanberealized,in addition to power savings, over the traditional SMP/CMP architectures. CMP/CMT hybridtechnologycanbeleveragedtoprovideperformancelevelsequivalenttothoseof systemswithtwiceasmanycomputationalresources.CMTtechnologyisapromising newarchitecturethatwhenutilizedcorrectlywillbeabletoprovidegreatbenefittothe HPCscientificcommunityintermsofbothpowerandperformance.Coupledwiththe useoftheschedulingmethoddescribedinthisthesis,theeffectthatsucharchitectures canhavewithintheHPCcommunityisapositiveone.Theincreaseinoverallmachine efficiencythatcanbeachievedusingsuchapproachesoffersthepossibilityofincreased systemthroughputataloweroperatingcostthanpreviousgenerationsofsystems.

7.1 Future Work In the future, this work can be improved upon by adapting the scheduler to take advantage of online performance monitor counter data.ThiswouldallowanAMPto dynamicallyadjustitselfaccordingtothecurrentprocessorworkload.Inaddition,using the power monitor data collection system developed for this thesis, it is technically possibletointegratearealtimepowerconsumptionreportingsystemthatcanreportthe system’spowerconsumptionduringsystemoperationandthisdatacanbeusedtofurther refinetheefficiencyofthescheduler.Theonlinepowersystemreportingcouldalsobe integrated into an API for use within the programs themselves to make them power aware.Allofthesefutureresearchoptionsshouldfurtherincreasetheefficiencyofthe schedulerdevelopedinthisthesis,makingthetechnique more beneficial for executing scientificapplications.

90 The PS-Scheduler couldbeextendedtoworkwithemergingquadcoreprocessorsand testedwithalargergroupofscientificapplications.Itcouldalsobetestedanddeveloped forusewithcommercialapplications,particularlyinadatacentercontext.Theconcepts behindthe PS-Scheduler couldalsobeappliedtoprocessordesign,creatingaslowlow powerprocessingcorespecificallytohandleOSactivity.

91 References

[1]AdvancedMicroDevices.AMDathlonX2dualcore.Available: http://www.amd.com/us en/Processors/ProductInformation/0,,30_118_9485_13041,00.html

[2]D.H.Albonesi,R.Balasubramonian,S.G.Dropsbo,S.Dwarkadas,F.G.Friedman, M.C.Huang,V.Kursun,G.Magklis,M.L.Scott,G.Semeraro,P.Bose,A. Buyuktosunoglu,P.W.CookandS.E.Schuster,"Dynamicallytuningprocessor resourceswithadaptiveprocessing," IEEE Computer, vol.36,pp.4958,2003.

[3]M.Annavaram,E.GrochowskiandJ.Shen,"Mitigatingamdahl'slawthroughEPI throttling,"in ISCA '05: Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005,pp.298309.

[4]C.D.Antonopoulos,D.S.NikolopoulosandT.S.Papatheodorou,"Scheduling algorithmswithbusbandwidthconsiderationsforSMPs,"in ICPP '03: Proceedings of the International Conference on Parallel Processing, 2003,pp.547554.

[5]V.AslotandR.Eigenmann,"PerformancecharacteristicsoftheSPECOMP2001 benchmarks,"in EWOMP '01: Proceedings of the European Workshop on OpenMP, 2001,

[6]V.Aslot,M.Domeika,R.Eigenmann,G.Gaertner,W.B.JonesandB.Parady, "SPEComp:Anewbenchmarksuiteformeasuringparallelcomputerperformance,"in WOMPAT '01: Proceedings of the Workshop on OpenMP Applications and Tools, 2001, pp.10.

[7]T.Austin,E.LarsonandD.Ernst,"SimpleScalar:aninfrastructureforcomputer systemmodeling," IEEE Computer, vol.35,pp.5967,2002.

[8]S.Balakrishnan,R.Rajwar,M.UptonandK.Lai,"Theimpactofperformance asymmetryinemergingmulticorearchitectures,"inISCA '05: Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005,pp.506517.

[9]R.BerrendorfandG.Nieken,"PerformancecharacteristicsforOpenMPconstructson differentparallelcomputerarchitectures," Concurrency Practice and Experience, vol. 12,pp.12611273,2000.

[10]D.Brooks,V.TiwariandM.Martonosi,"Wattch:Aframeworkforarchitectural levelpoweranalysisandoptimizations,"in ISCA '00: Proceedings of the 27th International Symposium on Computer Architecture, 2000,pp.8394.

[11]S.Brosky.ShieldedCPUs:Realtimeperformanceinstandardlinux.Available: http://www.linuxjournal.com

92 [12]J.M.Bull,"MeasuringsynchronisationandschedulingoverheadsinOpenMP,"in EWOMP '99: Proceedings of First European Workshop on OpenMP, 1999,pp.99105.

[13]D.Chandra,F.Guo,S.KimandY.Solihin,"Predictinginterthreadcache contentiononachipmultiprocessorarchitecture,"in HPCA '05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, 2005,pp.340 351.

[14]J.ChangandG.S.Sohi,"CooperativeCachingforChipMultiprocessors," IEEE Network, vol.1,pp.L1D,

[15]L.Chen,I.FujishiroandK.Nakajima,"Parallelperformanceoptimizationoflarge scaleunstructureddatavisualizationfortheearthsimulator,"in EGPGV '02: Proceedings of the Fourth Eurographics Workshop on Parallel Graphics and Visualization, 2002,pp. 133140.

[16]T.Constantinou,Y.Sazeides,P.Michaud,D.FetisandA.Seznec,"Performance implicationsofsinglethreadmigrationonachipmulticore," ACM SIGARCH Computer Architecture News, vol.33,pp.8091,2005.

[17]R.CouturierandC.Chipot,"ParallelmoleculardynamicsusingOpenMPona sharedmemorymachine," Comput. Phys. Commun., vol.124,pp.4959,Jan.2000.

[18]M.CurtisMaury,J.Dzierwa,D.AntonopoulosandD.S.Nikolopoulos,"Online strategiesforhighperformancepowerawarethreadexecutiononemerging multiprocessors,"in HP-PAC '06: 2nd Worksop on High-Performance, Power-Aware Computing in the Proceedings of the 21st International Parallel and Distributed Processing Symposium, 2006,

[19]M.CurtisMaury,X.Ding,C.D.AntonopoulosandD.S.Nikolopoulos,"An evaluationofOpenMPoncurrentandemergingMultithreaded/Multicoreprocessors,"in IWOMP '05: Proceedings of the International Workshop on OpenMP, 2005,

[20]L.DagumandR.Menon,"OpenMP:anindustrystandardAPIforsharedmemory programming," IEEE Computational Science and Engineering, vol.5,pp.4655,1998.

[21]M.DeVuyst,R.KumarandD.M.Tullsen,"Exploitingunbalancedthread schedulingforenergyandperformanceonaCMPofSMTprocessors,"in IPDPS '06: Proceedings of the 20th International Parallel and Distributed Processing Symposium, 2006,pp.10.

[22]M.J.DeLucaandM.A.Rivas,"Computingsystemwithselectiveoperatingvoltage andbusspeed,"U.S.A.5086501,1992,1989.

[23]ElectronicEducationalDevices.(2007,Jan.).WattsupproEPSpowermeter. 2007(May 1), pp.2.Available:https://www.doubleed.com/watts_up__es.pdf

93 [24]A.ElMoursy,R.Garg,D.H.AlbonesiandS.Dwarkadas,"Compatiblephaseco schedulingonaCMPofmultithreadedprocessors,"in IPDPS '06: Proceedings of the 20th International Parallel and Distributed Processing Symposium, 2006,pp.10.

[25]D.Ernst,N.S.Kim,S.Das,S.Pant,R.Rao,T.Pham,C.Ziesler,D.Blaauw,T. Austin,K.FlautnerandT.Mudge,"Razor:Alowpowerpipelinebasedoncircuitlevel timingspeculation,"in MICRO 36: Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003,pp.7.

[26]A.Fedorova,M.Seltzer,C.SmallandD.Nussbaum,"ThroughputOriented SchedulingOnChipMultithreadingSystems," Technical Report TR-17-04, Harvard University, August, 2004.

[27]A.Fedorova,C.Small,D.NussbaumandM.Seltzer,"Chipmultithreadingsystems needanewoperatingsystemscheduler,"in EW '04: Proceedings of the 11th Workshop on ACM SIGOPS European Workshop: Beyond the PC, 2004,pp.9.

[28]N.R.Fredrickson,A.AfsahiandY.Qian,"PerformancecharacteristicsofopenMP constructs,andapplicationbenchmarksonalargesymmetricmultiprocessor,"in ICS '03: Proceedings of the 17th Annual International Conference on Supercomputing, 2003,pp. 140149.

[29]V.W.FreehandD.K.Lowenthal,"UsingmultipleenergygearsinMPIprograms onapowerscalablecluster,"in PPoPP '05: Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2005,pp.164173.

[30]R.Ge,X.FengandK.W.Cameron,"Improvementofpowerperformance efficiencyforhighendcomputing,"in IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, 2005,pp.8.

[31]R.E.GrantandA.Afsahi,"AcomprehensiveanalysisofmultithreadedOpenMP applicationsondualcoreintelxeonSMPs,"in MTAAP '07: Workshop on Multithreaded Architectures and Applications in the Proceedings of the 21 St International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007,pp.365.

[32]R.E.GrantandA.Afsahi.Powerperformanceefficiencyofasymmetric multiprocessorsformultithreadedscientificapplications.Presentedat HPPAC '06: 2nd Workshop on High-Performance, Power-Aware ComputingintheProceedingsofthe 20thInternationalParallelandDistributedProcessingSymposium(IPDPS2006).

[33]R.E.GrantandA.Afsahi,"Characterizationofmultithreadedscientificworkloads onsimultaneousmultithreadingintelprocessors,"in IOSCA '05: Proceedings of the Workshop on Interaction between Operating System and Computer Architecture, 2005, pp.1319.

[34]R.E.GrantandA.Afsahi,"ImprovingPowerandPerformanceofChip MultiprocessorsbyExploitingtheEffectsofOperatingSystemNoise,"Tobesubmitted.

94 [35]L.Hammond,B.A.NayfehandK.Olukotun,"ASingleChipMultiprocessor," IEEE Computer, vol.30,pp.7985,1997.

[36]C.HsuandU.Kremer,"Thedesign,implementation,andevaluationofacompiler algorithmforCPUenergyreduction,"in PLDI '03: Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, 2003,pp.38 48.

[37]Z.Hu,S.KaxirasandM.Martonosi,"Letcachesdecay:reducingleakageenergyvia exploitationofcachegenerationalbehavior," ACM Transactions on Computer Systems, vol.20,pp.161190,2002.

[38]IEEE,"InformationTechnologyPortableOperatingSystemInterface(POSIX) Standard,"vol.IEEEStd.1003.1:2001,2001.

[39]IntelCorp.(2006,Intelxeonprocessor. [http://www.intel.com/products/processor/xeon/index.htm]. 2006 Available: http://www.intel.com/products/processor/xeon/index.htm

[40]IntelCorp.Intelcore2processorfamily.Available: http://www.intel.com/products/processor/core2/index.htm

[41]IntelCorp.(2007,IntelVTuneperformanceanalyzer.Available: http://www.intel.com/software/products/vtune

[42]IntelCorp.(2005,Oct.).Dualcoreintelxeonprocessor2.8GHzdatasheet. [[online]].Available:http://download.intel.com/design/Xeon/datashts/30915801.pdf

[43]ITRS.Internationaltechnologyroadmapforsiliconorganization.Available: www.itrs.net

[44]H.Jin,M.FrumkinandJ.Yan.(1999,Oct.).TheOpenMPimplementationofNAS parallelbenchmarksanditsperformance.NASAAmesResearchCenter,U.S.A.

[45]T.Jones,S.Dawson,R.Neely,W.Tuel,L.Brenner,J.Fier,R.Blackmore,P. Caffrey,B.Maskell,P.TomlinsonandM.Roberts,"Improvingthescalabilityofparallel jobsbyaddingparallelawarenesstotheoperatingsystem,"in SC '03: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, 2003,pp.10.

[46]R.Kalla,B.SinharoyandJ.M.Tendler,"IBMPower5chip:adualcore multithreadedprocessor," IEEE Micro, vol.24,pp.4047,2004.

[47]M.Kandemir,N.Vijaykrishnan,M.J.IrwinandW.Ye,"Influenceofcompiler optimizationsonsystempower,"in DAC '00: Proceedings of the 37th Design Automation Conference, 2000,pp.304307.

95 [48]KeithleyInstrumentsInc.(2007,DigitalmultimetersanddataAcquisition/Switching systems7700. 2007(April 20, 2007), Available: http://www.keithley.com/products/dmm/?mn=7700

[49]KeithleyInstrumentsInc.(2007,DigitalmultimetersanddataAcquisition/Switching systems2701. 2007(April 20, 2007), Available: http://www.keithley.com/products/dmm/?mn=2701

[50]J.G.Koomey.(2005,Jan.).Estimatingtotalpowerconsumptionbyserversinthe U.S.andtheworld.Available: http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf

[51]R.Kotla,S.Ghiasi,T.KellerandF.Rawson,"Schedulingprocessorvoltageand frequencyinserverandclustersystems,"in IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, 2005,pp.8.

[52]R.Kumar,K.Farkas,N.Jouppi,P.RanganathanandD.M.Tullsen,"SingleISA heterogeneousmulticorearchitectures:Thepotentialforprocessorpowerreduction,"in MICRO '36: Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003,pp.81.

[53]R.Kumar,D.M.Tullsen,N.P.JouppiandP.Ranganathan,"Heterogeneouschip multiprocessors," IEEE Computer, vol.38,pp.3238,2005.

[54]LawrenceLivermoreNationalLaboratory.LLNLOpenMPbenchmarks.Available: www.llnl.gov/CASC/RTSReport/openmpperf.html

[55]T.Leng,R.Ali,J.Hsieh,V.MashayekhiandR.Rooholamini,"Anempiricalstudy ofhyperthreadinginhighperformancecomputingclusters,"in The Proceedings of the Third LCI International Conference on Linux Clusters: The HPC Revolution, 2002,

[56]Y.Li,K.Skadron,D.BrooksandZ.Hu,"Performance,energy,andthermal considerationsforSMTandCMParchitectures,"in HPCA '05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, 2005,pp.7182.

[57]C.Liao,Z.Liu,L.HuangandB.Chapman,"EvaluatingOpenMPonchip MultiThreadingplatforms,"in IWOMP '05: Proceedings of the First International Workshop on OpenMP, 2005,

[58]C.Liu,A.Sivasubramaniam,M.KandemirandM.J.Irwin,"Exploitingbarriersto optimizepowerconsumptionofCMPs,"in IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium, 2005,pp.5a5a.

[59]D.B.Loveman,"HighPerformanceFortran," IEEE Parallel & Distributed Technology: Systems & Technology, vol.1,pp.2542,1993.

[60]G.Magklis,M.L.Scott,G.Semeraro,D.H.AlbonesiandS.Dropsho,"Profile baseddynamicvoltageandfrequencyscalingforamultipleclockdomain

96 microprocessor,"in ISCA '03: Proceedings of the 30th Annual International Symposium on Computer Architecture, 2003,pp.1425.

[61]D.T.Marr,F.Binns,D.L.Hill,G.Hinton,D.A.Koufaty,J.A.MillerandM. Upton.(Feb.2002,Hyperthreadingtechnologyarchitectureandmicroarchitecture. Intel Technology Journal 6(1),

[62]R.L.McGregor,C.D.AntonopoulosandD.S.Nikolopoulos,"Scheduling algorithmsforeffectivethreadpairingonhybridmultiprocessors,"in IPDPS '05: Proceedingsof the 19th IEEE International Parallel and Distributed Processing Symposium, 2005,pp.28a.

[63]L.W.McVoyandC.Staelin,"Lmbench:Portabletoolsforperformanceanalysis," in USENIX Annual Technical Conference, 1996,pp.279294.

[64]MessagePassingInterfaceForum.(1997,MPI,Amessagepassinginterface standard. 1.2

[65]R.E.Millstein,"ControlstructuresinIlliacIVFortran," Communications of the ACM, vol.16,pp.621627,1973.

[66]T.Mudge,"Power:AFirstClassDesignConstraint," IEEE Computer, vol.34,pp. 525257,Apr.2001.

[67]D.Nellans,R.BalasubramonianandE.Brunvand,"Acaseforincreasedoperating systemsupportinchipmultiprocessors,"in P=ac2 '05: Proceedings of the 2nd IBM Watson Conference on Interaction between Architecture, Circuits, and Compilers, 2005,

[68]D.S.NikolopoulosandC.D.Polychronopoulos,"Adaptiveschedulingunder memorypressureonmultiprogrammedSMPs,"in IPDPS '02: Proceedings of the International Parallel and Distributed Processing Symposium, 2002,pp.1116.

[69]OpenMPArchitectureReviewBoard,"OpenMPSpecificationVersion2.5,"2005.

[70]Openmp.org.OpenMPsimple,portable,scalableSMPprogramming.Available: www..org

[71]F.Petrini,D.J.KerbysonandS.Pakin,"Thecaseofthemissingsupercomputer performance:Achievingoptimalperformanceonthe8,192processorsofASCIQ,"in SC '03: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, 2003,pp.55.

[72]E.D.Polychronopoulos,D.S.Nikolopoulos,T.S.Papatheodorou,X.Martorell,J. LabartaandN.Navarro,"Anefficientkernellevelschedulingmethodologyfor multiprogrammedsharedmemorymultiprocessors,"inPDCS ’99: Proceedings of the 12th International Conference on Parallel and Distributed Computing Systems, pp.148 155.

97 [73]A.Purkayastha,C.Guiang,K.Schulz,T.Minyard,K.Milfeld,W.Barth,P.Hurley andJ.Boisseau,"PerformancecharacteristicsofdualprocessorHPCclusternodesbased on64bitcommodityprocessors,"in IISWC '05: Proceedings of the IEEE International Symposium on Workload Characterization, 2005,pp.8798.

[74]F.J.L.ReidandJ.M.Bull,"OpenMPmicrobenchmarksversion2.0,"in EWOMP '04: Proceedings of 6th European Workshop on OpenMP, 2004,

[75]M.Renouf,F.DuboisandP.Alart,"Aparallelversionofthenonsmoothcontact dynamicsalgorithmappliedtothesimulationofgranularmedia," J. Comput. Appl. Math., vol.168,pp.375382,2004.

[76]H.Saito,G.Gaertner,W.Jones,R.Eigenmann,H.Iwashita,R.Lieberman,M.van WaverenandB.Whitney,"LargesystemperformanceofSPECOMP2001benchmarks," in WOMPAT '02: Workshop on OpenMP: Experiences and Implementation in the Proceedings of the International Workshop on OpenMP (IWOMP), 2002,

[77]R.Sasanka,S.V.Adve,Y.ChenandE.Debes,"TheenergyefficiencyofCMPvs. SMTformultimediaworkloads,"in ICS '04: Proceedings of the 18th Annual International Conference on Supercomputing, 2004,pp.196206.

[78]M.Sato,S.Satoh,K.KusanoandY.Tanaka,"DesignofOpenMPcompilerforan SMPcluster,"in EWOMP '99: Proceedings of the European Workshop on OpenMP, 1999,pp.32–39.

[79]D.SkinnerandW.Kramer,"Understandingthecausesofperformancevariabilityin HPCworkloads,"in IISWC '05: Proceedings of the IEEE International Symposium on Workload Characterization, 2005,pp.137149.

[80]L.SmithandP.Kent,"DevelopmentandperformanceofamixedOpenMP/MPI quantumMonteCarlocode," Concurrency Practice and Experience, vol.12,pp.1121 1129,2000.

[81]A.Snavely,D.M.TullsenandG.Voelker,"Symbioticjobschedulingwithpriorities forasimultaneousmultithreadingprocessor,"in The Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 2002,pp.6676.

[82]SPEC.(2005,SPECOMPbenchmarksuite.Available:http://www.spec.org/omp/.

[83]L.SpracklenandS.G.Abraham,"Chipmultithreading:Opportunitiesand challenges,"in HPCA '05: Proceedings of the International Symposium on High- Performance Computer Architecture, 2005,pp.248252.

[84]U.Srinivasan,P.S.Chen,Q.Diao,C.C.Lim,E.Li,Y.Chen,R.JuandY.Zhang, "CharacterizationandanalysisofHMMERandSVMRFEparallelbioinformatics applications,"in IISWC '05: Proceedings of the IEEE International Symposium on Workload Characterization, 2005,pp.8798.

98 [85]SunMicrosystems.(2006,UltraSPARCIV. 2006 Available: http://www.sun.com/processors/UltraSPARCIVplus/index.xml

[86]SunMicrosystems.T1processor.Available: http://www.sun.com/processors/UltraSPARCT1/index.xml

[87]Synopsys.HSPICEcircuitsimulationsoftware.Available: http://www.synopsys.com/products/mixedsignal/hspice/hspice.html

[88]X.Tian,Y.Chen,M.Girkar,S.Ge,R.LienhartandS.Shah,"Exploringtheuseof hyperthreadingtechnologyformultimediaapplicationswithintelOpenMPcompiler,"in IPDPS '03: Proceedings of the International Parallel and Distributed Processing Symposium, 2003,pp.8.

[89]D.Tsafrir,Y.Etsion,D.G.FeitelsonandS.Kirkpatrick,"Systemnoise,OSclock ticks,andfinegrainedparallelapplications,"inICS '05: Proceedings of the 19th Annual International Conference on Supercomputing, 2005,pp.303312.

[90]N.TuckandD.M.Tullsen,"Initialobservationsofthesimultaneousmultithreading pentium4processor,"in PACT '03: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, 2003,pp.2634.

[91]D.M.Tullsen,S.J.EggersandH.M.Levy,"Simultaneousmultithreading: Maximizingonchipparallelism,"in ISCA '95: Proceedings of the 22nd Annual International Symposium on Computer Architecture, 1995,pp.392403.

[92]O.S.Unsal,I.Koren,C.M.KrishnaandC.A.Moritz,"CoolFetch:Compiler EnabledPowerAwareFetchThrottling," IEEE Computer Architecture Letters, vol.1, pp.100103,2002.

[93]N.Vachharajani,M.Iyer,C.Ashok,M.Vachharajani,D.I.AugustandD.Connors, "Chipmultiprocessorscalabilityforsinglethreadedapplications," ACM SIGARCH Computer Architecture News, vol.33,pp.4453,2005.

[94]Y.Zhang,M.Burcea,V.Cheng,R.Ho,V.ChengandM.Voss,"Anadaptive OpenMPloopschedulerforhyperthreadedSMPs,"in IPDPS '04: Proceedings of 16th International Conference on Parallel and Distributed Computing Systems, 2004,

99