<<

October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi v v v v v Following Topicswillbediscussed Lecture Outline Theoretical conceptsonPerformanceandScalability Communication overheadmeasurementAnalysis Work loadandspeedmetrics Performance metrics Requirements inperformanceandcost Performance MetricsandScalabilityAnalysis • • • Examples Speed up,Efficiency, Phase ParallelModel Performance Metricsand Performance Metricsand Analysis Scalability Analysis Isoefficiency models 2 1 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi Remarks Approach How dowemeasuretheperformanceofacomputersystem v v v v v v v v v Questions tobeanswered Pitfalls ofusingexecutiontimeasperformancemetric. misleading interpretations. This approachissometimesdifficulttoapplyand it couldperm elapsed Run theuser’sapplicationtimeandmeasurewall clocktime to measurecomputerperformance Many peoplebelievethatexecutiontimeistheonly executing agivenapplication? How canonedeterminethescalabilityofaparallelcomputer What arethefactors(parameters)affectingperformance? What kindofperformancemetricsshouldbeused? programs? How oneshouldmeasuretheperformanceofapplication What aretheuser’srequirementinperformanceandcost? Ø true performanceoftheparallel machine Execution timealonedoesnotgivetheusermuch cluetoa Performance VersusCost Performance VersusCost reliable metric (Contd..) ? it 4 3 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi Processing Speed Performance versuscost Remarks Six typesofperformancerequirementsareposedbyusers: Types ofperformancerequirement v v v v v v v v certain processingspeedratherthananexecutiontime limit For manyapplications,theusersmaybeinterested inachieving Execution timeiscriticaltosomeapplications Performance /Costratio Cost effectiveness Utilization System Processing speed Executive timeandthroughput Ø Ø Ø Example : Example : guaranteed tofinishwithinatimelimit A realtimeapplication,theusercaresaboutwhether thejobis : These requirementscouldleadtoquitedifferent platform conclusions forthesameapplicationoncomputer Performance Requirements Performance Requirements Radar signalprocessingapplication Radar signalprocessingapplication (Contd..) a 6 5 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi Utilization andCostEffectiveness v v System Throughput v v v other performancerequirementisneeded. Remark :Executiontimeisnotonlyperformancerequirement,but unit time Throughput isdefinedtobethenumberofjobsprocessedina Example : overheads Reduction oftimespentinloadimbalanceandcommunication High percentageoftheCPUhourstobeused for useful may wanttorunhisapplicationmorecosteffectively Instead ofsearchingfortheshortestexecutiontime, theuser STAP Benchmarks,TPCBenchmarks Performance Requirements Performance Requirements (Contd..) (Contd..) 8 7 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi Performance/Cost Ratio Utilization andCostEffectiveness Remarks v v v v v v v ratio ofthespeedtopurchasingprice. Performance/cost ratioofacomputersystemisdefinedasthe . which isratiooftheachievedspeedtopeakagive A goodindicatorofcost programs areoftenhighlyoptimized. Utilization valuesgeneratedfromthevendor’sbenchmark utilization dropsasmorenodesareused. Utilization factorvariesfrom5%to38%.Generally the workload, oralowspeedduetoslowmachine. Good programcouldhavealongexecutiontimedue toalarge A lowutilizationalwaysindicatesapoorprogramor compiler. provided ifCPU Higher Utilizationcorrespondstohigher Performance Requirements Performance Requirements - hours arechangedatafixedrate. - effectiveness istheutilizationfactor Gflop /s perdollar, (Contd..) (Contd..) n 10 9 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi The misleadingpeakperformance/costratio v v Summary v v v The otherinclude requirements ausermaywanttoimpose. The executiontimeisjustoneofseveralperformance Usually, asetorrequirementsmaybeimposed. effectiveness, performance/costratio,andscalability. Example different parallelcomputingsystem Often, usersneedtousemorethanonemetricincomparing often misleading Using thepeakperformance/costratiotocomparesystemis Ø Ø But itssustainedperformanceisactuallyhigher. supercomputers ismuchlowerthanthoseofothers. systems alone, thecurrentPentiumPCeasilybeatmorepowerful If weusethecost with theperformance/costratioofacomputersystem The cost : Thepeakperformancecostratioofsome - effectiveness measureshouldnotbeconfused Performance VersusCost Performance VersusCost speed, throughput,utilization,cost - effectiveness orperformance/costratio - (Contd..) 12 11 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi v Workload andspeedmetrics Workload andSpeedMetrics Workload Type count operation (flop) Floating Instruction count Execution time workload ofaprogram Three metricsarefrequentlyusedtomeasurethecomputational Ø Ø Ø The executiontime third metricisoftenarchitecture The numberoffloating same ISA. when theprogramisexecutedondifferentmachineswith tied toaninstructionsetarchitecture(ISA),staysunchanged The numberofinstructionsexecuted executed onadifferentmachine. computer system.ItmaychangewhentheCprogramis - point Workload andSpeedMetrics Summary ofallthreeperformancemetrics Workload andSpeedMetrics Seconds (s),CPUclocks billion instruction Million instructionsor Million flop( billion flop( Workload Unit Flop, : Thefirstmetricistiedtoaspecific - point operationsexecuted Gflop Mflop - independent. ) ), : Thesecondmetric, Application persecond MIPS orBIPS Speed Unit Gflop Mflop (Contd..) /s /s : The 14 13 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi v v v Instruction count but only4000foranother.Forsuchaninput A sortingprogrammayexecute100000instructionsforagivenin Example workload isdefinedastheinstructioncountforworstcase Instruction count v v mean theprogramneedsmoretimetoexecute. Finally, alargerinstructioncountdoesnotnecessarily optimizations areused. different onthesamemachinewhen compilers or Even withafixedinput,theinstructioncountofprogram coul on differentmachines. Even withafixedinput,theinstructionexecutedcould bediffe instruction countforthe an inputdependentprogram,theworkloadisdefinedas Instruction countmaydependontheinputdatavalues.Forsuch just thenumberofinstructionsinassemblyprogramtext. The workloadistheinstructionsthatmachinesexecuted,no Workload andSpeedMetrics Workload andSpeedMetrics worst - case input. - dependent program,the (Contd..) input. (Contd..) d be put, rent t 16 15 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi v v v Execution time Execution time The basicunitofworkloadisseconds. Execution timedependsonmanyfactorsexplainedbelow. known astheelapsedtime. execution timeshouldbemeasuredintermsofwallclocktime,a the workloadastotaltimetakentoexecuteprogram.This For agivenprogramonspecificcomputersystem,onecandefin Workload andSpeedMetrics Workload andSpeedMetrics Language Data structure of algorithm Choice ofrightkind Input data Platform affect performance operating system hardware and The machine (Contd..) (Contd..) lso e 18 17 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi Time Execution Floating pointcount v v v v flop countisdeterminedbyanexecutionruntime. This approachhasbeenusedintheNASbenchmark, wherethe instructions actuallyexecuted. This specificrunwillgenerateareportofthenumber offlops workload byrunningtheapplicationonaspecificmachine. varies withdifferentinputdata(e.g.,sorting)onecan measure When theapplicationcodeiscomplex,orwhen the workload dependent, theworkloadcanbedeterminedbycode inspection. When applicationprogramissimpleanditsworkloadnotinput Language structure Data Algorithm Platform Input Workload andSpeedMetrics Workload andSpeedMetrics worst (cache, mainmemory,disk),operatingsystemversions. performance. Otherfactorsincludethememoryhierarchy Machine hardwareandoperatingsystemaffect input When usingexecutiontimeastheworkloadforsuch take differenttimesoninputdata. Other programs,suchassortingandsearchingcould will take0( depend ontheinputdata.Forinstance,an The executiontimeformanyapplicationsdoesnot How thedataarestructuredalsoimpactsperformance The algorithmusedhasagreatinputonexecutiontime options/ andlibraryfunctionsreducetheexecutiontime. Language usedtocodetheapplication,compiler/linker - - dependent programs,weneedtouseeitherthe case timeortheforbaselineinputdataset. N log N ) time. N - point FFT (Contd..) (Contd..) the or - 20 19 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi v v v v v Caveats Rules forCountingFloating A[2*1] =B[j If (X>Y)Max=2.0*X; achieve 90%utilizationinsomenumericalsubroutines. Highly tunedprogramscouldachievemore.Itis possibleto clue astohowwelltheapplicationisimplemented. their correspondingspeedandutilizationmeasures givesome The flopcountandtheinstructionhaveanother advantage: Workload canbedeterminedonlybyexecutingtheprogram. different systems. The workloadchangeswhenthesameapplication isrunon leads toastrangephenomenon. Using executiontimeorinstructioncountastheworkloadmetric X =sin(Y) X =Y/3.0+ X =(float)i+3.0; Operations X =Y; - - 1]+1.5*C exp(Z); sqrt Workload andSpeedMetrics Workload andSpeedMetrics (Z) - 2; Count Flop 9 2 2 1 3 17 - Point Operations(NASstandards) An isolatedassignmentiscountedas1flop Add, subtract,ormultiplyeachcountas1flop A sine,exponential,etc.iscountedas8flop A divisionorsquarerootiscountedas4flop A typeconversioniscountedas1flop A comparisoniscountedas1flop Assignment notseparatelycounted Comments onRules Index arithmeticnotcounted (Contd..) (Contd..) 22 21 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi Remark v v v Summary ofperformancemetrics and Clustersare The historicalvaluesofperformanceparametersforPVP,SMP, Computational Characteristics v v v v v v Computational CharacteristicsofParallelComputers platform, compileroptions,andinputdatasetetc. under asetofspecificlistingconditionsincludinghardware The executiontimeisoftenmeasuredonaspecificmachine, inspection, asdescribedinthetable. In practice,theflopcountworkloadisoftendeterminedbycode count andtheexecutiontime. To sumup,allthreemetricsareuseful,butespeciallytheflop Peak performanceon Peak performancepernode( Machine Size Memory Capacity (MHz) Year ofreleaseinMarket : The flopcountmetricismorestable. Workload andSpeedMetrics n ( >1) Mflop nodes ( /sec) Gflop /sec) (Contd..) MPPs 24 23 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi is showninthefigure. between processorsandmemories.Thememorysubsystemhierarchy The performancedependsonhowfastthesystemcanmovedata The MemoryHierarchy extracting performance For eachlayer,thefollowingthreeparametersplayavitalrole The MemoryHierarchy B =1 L =0cycle C <2KB Registers v v v Computational CharacteristicsofParallelComputers Computational CharacteristicsofParallelComputers - (B):Formovingalargeamountofdata,howmany (L):Howmuchtimeisneededtofetchonefromthe Capacity (C):Howmanybytesofdatacanbeheldbythe 32 GB/s Performance parametersof atypicalmemoryhierarchy 1 0 4 Cache Level - - - 16 GB/s 2 cycles 256 KB second ? bytes canbetransferredfromthedeviceina device ? device ? - 1 1 2 64 KB Cache Level - - 4 GB/s 10 cycles - 4 MB - 2 10 16 MB Memory Remote 0.4 memory - Main 100 cycles - 2 GB/s u - 16 GB 1 100 1 - - 300 MB/s 100 GB - 100 Kcycles 1 100K 1 - - for Disk 16 MB/s 100 GB (Contd..) (Contd..) - 1M cycles 26 25 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi v v v v v overlapping isinvolved respectively. ( computation, theparallelism,andinteraction operations, where The timetoexecuteaparallelprogramis Computational CharacteristicsofParallelComputers memory. global memoryforthemachineswithacentralizedshared The mainmemoryincludesthelocalinanode,and SMP node. The level level The level in factpartoftheprocessorchip. The devicesclosesttotheprocessorareregisters,whichar The fasterandsmallerdevicesareclosertotheprocessor. T comp – Parallelism andInteractionOverheads 2 cacheisoffchip. , Assume thatinteractionoverheadsareignored,and no T – - par 1 cacheisusuallyontheprocessorchipand 3 cacheisoffchipsharedbysomeprocessorsin and T = ). T T interact comp + are thetimesneededtoexecute T par + T interact (Contd..) e 28 27 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi v Parallelism Overheads v v v Source ofParallelismoverhead v v v v Sources ofinteractionoverheads times toprocessrawdata. This isimportantfortheparallelprogramwhichexecutedman Creating aprocessorgroupisexpensiveonparallelcomputers identification, rank,group,groupsize,etc. Process inquiryoperations,suchasaskingforprocess group Grouping operations,suchascreationordestructionofaproces switching, etc. Process management,suchascreation,termination,context Idleness duetoInteraction communication andreading/writingofsharedvariables Communication, suchaspoint Aggregation, suchasreductionandscan Synchronization, suchasbarrier,lockscriticalregions, andev Parallelism andInteractionOverheads Parallelism andInteractionOverheads - to - point andcollective ( Contd (Contd..) ents …) y s . 30 29 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi v v v v v v The followingconditionsareusedtoestimatetheoverheadinvo v v v v v Overheads Quantification Overhead measurementconditions To usewallclocktimeelapsed. The communicationhardwareandprotocolused. The messagepassinglibraryused. The bestcomplieroptionsshouldbeused. The programminglanguageused. not bepagefaults) made smallenoughtofitintothenodememoryso that therewill The datastructuresused(Thearealways Collective communicationandcomputation asymptotic bandwidth message length Point Expressions foroverheadparallelismduetocommunications Measurement methods Measurement conditions - Parallelism andInteractionOverheads to Parallelism andInteractionOverheads - Point Communicationoverheadfor m ) (in bytes)( requirement ofstartuptime, (Contd..) lved. (Contd..) 32 31 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi v v v Interpretation ofoverheadmeasurementdata Overhead measurementMethods v v v v Remarks lengths andmachinesize. that canbeusedtoestimatetheoverheadforvariousmessage Method 3:Presentthedatausingsimple,closed Method 2:Presentthedataascurve Method 1:Presenttheresultsintabularform v v another. The overheadvaluesalsovarygreatlyfromonesystem to significant andtheyshouldbequantified. In somecasesparallelismandinteractionoverheads are system. the basiccomputationtime,andtheyvaryfromasystem to Parallelism andinteractionoverheadsareoftenlarge comparedt develop aparallelprogramongivencomputer. Knowing theoverheadscanhelpprogrammerdecidehowtobest collective communicationperformance Measuring point two nodes Measuring point Parallelism andInteractionOverheads Parallelism andInteractionOverheads - - point communication(ping point communicationinvolving - pong test)between - form expression n nodes and (Contd..) (Contd..) o 34 33 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi While executingon is depictedinthefollowingfigure phase We wanttodevelopanefficientparallelprogram component computationalphases Consider asequentialprogram Phase Parallelmodel Phase Parallelmodel Ø v v DOP T W Phase 1 (1) 1 T The totalparallelexecutiontimeoverpnodesbecomes The parallelexecutiontimeforstep par C Performance Metrics:PhaseParallelModel Performance Metrics:PhaseParallelModel 1 The phaseparallelmodelofanapplicationalgorithm i with aDegreeOfParallelism C and 1 : Interaction T Ÿ Ÿ Ÿ interact T n p = processors with1 Ÿ Ÿ Ÿ denote allparallelismandinteractionoverheads. 1 : : £ S Ÿ Basic Metrics Analysis Ÿ i Ÿ £ k min ( DOP T W Phase 1 C ( i T i ) DOP 1 C consisting ofasequence i (i) 1 , C C i £ , DOP i 2 : , n Ci p C ) + £ 3 Ÿ Ÿ Ÿ is T ... i . Aphaseparallelprogram DOP Ÿ Ÿ par Tp Interaction Ÿ C + Ÿ Ÿ Ÿ K (i)=T1(i) /p, . T i , wehave C interact by exploitingeach Ÿ Ÿ Ÿ DOP T W Phase 1 ( k k ) k K (Contd..) C major k : 36 35 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi 4. 3. 2. 1. Remark Phase parallelmodel Notation time These inequalitiesareusefultoestimatetheparallelexecutio performance issuesofaparticularcode. These metricswillhelptheApplicationUsertoknowabout sizes inMPIplayaimportantroleforthisoverhead. communication andcomputationoverheadforvariousmessage The implementationofpoint communication overhead. In manycases,thetotaloverheadisdominatedby T T T E P S T Performance Metrics:PhaseParallelModel Performance Metrics:PhaseParallelModel 0 ∞ ∞ p p p p 1 p Sequential time - p Parallel time, Total overhead Critical path node efficiency Terminology p - - node time node speed T p = 1 £ £ S i T £ £ T E ∞ ∞ k 0 p T P = = = S 1 p - min ( = W/ to S 1 T = £ £ p S Definition par p - = T 1 i T point andcollective / p=T £ £ £ £ DOP S 1 + i k ( 1 T £ i £ ) p / T T k T 1 interact i , p DOP ( 1 i p ) T / ( + ) 1 T pT ( i i par ) p ) + T interact (Contd..) (Contd..) n 38 37 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi v v v v v v v identical of the single processor. execution timeofthefastestknownsequentialalgorithm ona cost ofsolvingaproblemonparallelcomputerisproportional t The the of parallelruntimeandthenumberprocessorsused Cost The processor taken bytheparallelalgorithmtosolvesameproblemon Cost Optimal execution timeoftheknownbestsequentialalgorithm of the Efficiency canalsobeexpressedastheratioofexecutiont Efficiency Speedup Performance MetricsofParallelSystems cost p Performance MetricsofParallelSystems cost : processors usedbytheparallelalgorithmareassumedtobe fastest best Cost ofsolvingaproblemonparallelsystemistheproduct of solvingthesameproblemon to theoneusedbysequentialalgorithm : : of solvingaproblemonsingleprocessoris the Speedup sequential algorithmforsolvingaproblemtothetime Ratio ofspeeduptothenumberprocessors. known sequentialalgorithmforsolvingaproblemto : A parallelsystemissaidtobecost T p is definedastheratioofserialruntime E = p.S p p processors - optimal ifthe (Contd..) o the ime p 40 39 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi properties Remark : A IdealScalabilitymetricshouldhavethefollowingtwo v v v v v 2. 1. terms ofnumberprocessors parallel algorithmiscalled Using fewerthanthemaximumnumberofprocessorstoexecutea and acost Cost issometimesreferredtoasworkorprocessor Cost optimalparallelsystemhasan those withalargeone. Systems withasmall has ashorterexecutiontime. system withasmall It canbeshownthatunderveryreasonableconditions, a always hasashorterexecutiontime. It isconsistentwithexecutiontime,i.e.,amorescalable syst of machinesize. It predictstheworkloadgrowthratewithrespecttoincreas Performance MetricsofParallelSystems Scalability andSpeedupAnalysis - optimal systemsisalsoknownas isoutlization isoutilization scaling down (thus morescalable)always efficiency are morescalablethan PT a parallelsystemin of P optimal system θ (1) - time product (Contd..) em (Contd..) e 42 41 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi 1. 7. 6. 5. 4. 3. 2. v v v v Summary ofPerformancemetrics References Ernst L. A Hardware/SoftwareApproach,Morgan Culler DavidE, Programming) McGrawHill Kai Hwang, Engineering,Addison Ian T.Foster,DesigningandBuildingParallelPrograms,Concept (1996) William Benjmann Computing, DesignandAnalysisofAlgorithms,Redwood City,CA, Vipin Series onComputingEngineering, Series onComputerEngineering, Albert Y.H. Theoretical conceptsonPerformanceandScalability Types ofoverheadinparallelcomputingarediscussed. Work loadandspeedmetricsareexplained parallel programhavebeenexplained. Performance metricsandmeasurementofperformancea • Kumar, Gropp Phase Parallelmodel,Speedup,Efficiency, Isospeed Leiss /Cummings (1994). Zomaya , RustyLusk,TuningMPIApplicationsforPeakPerformance,Pitt Ananth Grama , ParallelandVectorComputingApracticalIntroduction,McGraw Zhiwei Xu Jaswinder and , ParallelanddistributedComputingHandbook,McGraw Isoutilization , ScalableParallelComputing(TechnologyArchitecture Pal Singhwith Newyork Conclusions , - Anshul Wesley PublishingCompany(1995). Newyork Newyork (1997) Gupta, George Kaufmann Anoop metrics areexplained. (1995). (1996). Gupta, ParallelComputerArchitecture, Publishers, Inc,(1999) Karypis , IntroductiontoParallel s andtoolsforParallel Isoefficiency sburgh - - Hill Hill , 44 43 October 10 –11, 2002, Par.Comp Workshop at IIT-Delhi 45