EE108B– Lecture5

CPUPerformance

ChristosKozyrakis

StanfordUniversity http://eeclass.stanford.edu/ee108b

C.Kozyrakis EE108bLecture5 1 Announcements

• Upcomingdeadlines – HW1dueonTuesday1/25 – Lab1dueonTuesday1/30 – PA1dueonThursday2/8

C.Kozyrakis EE108bLecture5 2 Review:ProcedureCallandReturn

• Proceduresarerequiredforstructuredprogramming – Aka:functions,methods,,… • Implementingproceduresinassemblyrequiresseveralthingstobedone – Memoryspacemustbesetasideforlocalvariables – Argumentsmustbepassedinandreturnvaluespassedout – Executionmustcontinueafterthecall • ProcedureSteps 1. Placeparametersinaplacewheretheprocedurecanaccessthem 2. Transfercontroltotheprocedure 3. Acquirethestorageresourcesneededfortheprocedure 4. Performthedesiredtask 5. Placetheresultvalueinaplacewherethecallingprogramcanaccessit 6. Returncontroltothepointoforigin

C.Kozyrakis EE108bLecture5 3 Stacks

• Dataispushedontothestacktostoreitandpoppedfromthestack whennotlongerneeded – MIPSdoesnotsupportinhardware(useloads/stores) – Procedurecallingconventionrequiresone • Callingconvention – Commonrulesacrossproceduresrequired – Recentmachinesaresetbysoftwareconventionandearlier machinesbyhardwareinstructions • UsingStacks – Stackscangrowupordown – StackgrowsdowninMIPS – Entirestackframeispushedandpopped,ratherthansingle elements

C.Kozyrakis EE108bLecture5 4 MIPSStorageLayout

$sp = 7fffffff 16 Stack Stack and dynamic area grow towards one another to Dynamic data maximize storage $gp = 10008000 16 Static data use before collision 10000000 16 Text/Code 400000 16 Reserved

C.Kozyrakis EE108bLecture5 5 ProcedureActivationRecord(Frame) textbook

• Eachprocedurecreatesanactivationrecordonthestack

First word of frame $fp Saved arguments Higher addresses

Saved return address Saved registers (if any) Stack grows downwards Local Arrays and Structures (if any) Last word of frame $sp Lower addresses

C.Kozyrakis EE108bLecture5 6 ProcedureActivationRecord(Frame) SGI/GCCCompiler

• Eachprocedurecreatesanactivationrecordonthestack – Atleast32bytesbyconventiontoallowfor$a0$a3,$ra,$fp andbe doubledoublewordaligned(16bytemultiple)

Arguments Higher addresses (>= 16 B) First word of frame $fp Local Arrays and Structures (if any) Stack grows downwards Saved return address

Saved registers Space for nested call args (>= 16 B) Lower addresses Last word of frame $sp

C.Kozyrakis EE108bLecture5 7 RegisterAssignments

• Usethefollowingcallingconventions

Name Register number Usage $zero 0 theconstantvalue0 $v0-$v1 23 valuesforresultsandexpressionevaluation $a0-$a3 47 arguments $t0-$t7 815 temporaries $s0-$s7 1623 saved $t8-$t9 2425 moretemporaries $gp 28 globalpointer $sp 29 stackpointer $fp 30 framepointer $ra 31 returnaddress

C.Kozyrakis EE108bLecture5 8 RegisterAssignments(cont)

0 $zeroZeroconstant(0) 16 $s0 Callee saved 1 $at Reservedforassembler ... 2 $v0 Expressionresults 23 $s7 3 $v1 24 $t8 Temporary(cont’d) 4 $a0 Arguments 25 $t9 5 $a1 26 $k0 ReservedforOSkernel 6 $a2 27 $k1 7 $a3 28 $gp Pointertoglobalarea 8 $t0 Callersaved 29 $sp StackPointer ... 30 $fp FramePointer 15 $t7 31 $ra Returnaddress

C.Kozyrakis EE108bLecture5 9 Callervs.Callee SavedRegisters

Preserved Not Preserved Savedregisters($s0$s7) Temporaryregisters($t0$t9) Stack/framepointer($sp,$fp,$gp)Argumentregisters($a0$a3) Returnaddress($ra) Returnvalues($v0$v1)

• Preserved registers (Callee Save) • Save register values on stack prior to use • Restore registers before return • Not preserved registers (Caller Save) • Do what you please and expect callees to do likewise • Should be saved by the caller if needed after procedure call

C.Kozyrakis EE108bLecture5 10 CallandReturn

• Caller – Savecallersavedregisters $a0 $a3 , $t0 $t9 – Loadargumentsin $a0 $a3 ,restpassedonstack – Execute jal instruction • Callee Setup 1. Allocatememoryfornewframe( $sp = $sp frame) 2. Savecalleesavedregisters $s0 $s7 , $fp , $ra 3. Setframepointer( $fp = $sp +framesize 4) • Callee Return – Placereturnvaluein $v0 and $v1 – Restoreanycalleesavedregisters – Popstack( $sp = $sp +framesize) – Returnby jr $ra

C.Kozyrakis EE108bLecture5 11 CallingConventionSteps

Beforecall: FP First four arguments SP passed in registers Callee FP SP Adjust SP setup;step1

FP Callee SP ra Save registers as needed setup;step2 old fp $s0-$s7

FP Callee SP ra Adjust FP old fp setup;step3 $s0-$s7

C.Kozyrakis EE108bLecture5 12 SimpleExample

int foo(int num) foo: { addiu $sp, $sp, -32 # push frame return(bar(num + 1)); sw $ra, 28($sp) # Store $ra } sw $fp, 24($sp) # Store $fp addiu $fp, $sp, 28 # Set new $fp addiu $a0, $a0, 1 # num + 1 int bar(int num) jal bar # call bar { lw $fp, 24($sp) # Load $fp return(num + 1); lw $ra, 28($sp) # Load $ra } addiu $sp, $sp, 32 # pop frame jr $ra # return

bar: addiu $v0, $a0, 1 # leaf procedure jr $ra # with no frame

C.Kozyrakis EE108bLecture5 13 FactorialExample

fact: slti $t0, $a0, 2 # a0 < 2 beq $t0, $zero, skip # goto skip int fact(int n) ori $v0, $zero, 1 # Return 1 { jr $ra # Return if (n <= 1) skip: addiu $sp, $sp, -32 # $sp down 32 return(1); sw $ra, 28($sp) # Save $ra else sw $fp, 24($sp) # Save $fp return(n*fact(n-1)); addiu $fp, $sp, 28 # Set up $fp sw $a0, 32($sp) # Save n } addui $a0, $a0, -1 # n – 1 jal fact # Call recursive link: lw $a0, 32($sp) # Restore n mul $v0, $v0, $a0 # n * fact(n-1) lw $ra, 28($sp) # Load $ra lw $fp, 24($sp) # Load $fp addiu $sp, $sp, 32 # Pop stack jr $ra # Return

C.Kozyrakis EE108bLecture5 14 RunningFactorialExample

• main() • • frame main() sp(main) 64 3 main { fp(fact(3)) 60 56 92 fact(3); • fact(3) • } • frame sp(fact(3)) 32 2 fp(fact(2)) 28 link 60 24 fact(2) • • frame • sp(fact(2)) 0 1

Stack just before call of fact(1)

C.Kozyrakis EE108bLecture5 15 FactorialAgain TailRecursion

TailRecursiveProcedure GeneralForm int fact_t (int n, int val) fact_t(n, val) { { if (n <= 1) ••• return val; return return fact_t( Nexpr , Vexpr ) fact_t(n-1, val*n); } }

• Form TopLevelCall – Directlyreturnvaluereturnedby int fact(int n) recursivecall { • Consequence return fact_t(n, 1); – Canconvertintoloop }

C.Kozyrakis EE108bLecture5 16 RemovingTailRecursion

OptimizedGeneralForm ResultingCode fact_t(n, val) int fact_t (int n, int val) { { start: start: ••• if (n <= 1) val = Vexpr ; return val; n = Nexpr ; val = val*n; goto start; n = n-1; } goto start; } • EffectofOptimization – Turnrecursivechainintosingleprocedure(leaf) – Eliminateprocedurecalloverhead – Nostackframeneeded – Constantspacerequirement • Vs.linearforrecursiveversion

C.Kozyrakis EE108bLecture5 17 AssemblyCodeforfact_t

int fact_t (int n, int val) { start: if (n <= 1) return val; val = val*n; n = n-1; goto start; }

fact_t: addui $v0, $a1, $zero # $v0 = $a1 start: slti $t0, $a0, 2 # n < 2 beq $t0, $zero, skip # if n >= 2 goto skip jr $ra # return skip: mult $v0, $v0, $a0 # val = val * n addiu $a0, $a0, -1 # n = n - 1 j start # gotostart

C.Kozyrakis EE108bLecture5 18 Today’sTopics

• Reading2.7,4.1–4.4

• Psuedo Instructions

• CPUperformance – CPU==centralprocessingunit==the

C.Kozyrakis EE108bLecture5 19 Pseudoinstructions

• Assemblerexpands pseudoinstructions move $t0, $t1 # Copy $t1 to $t0

addu $t0, $zero, $t1# Actual instruction

• Somepseudoinstructions needatemporaryregister – Cannotuse $t , $s ,etc.sincetheymaybeinuse – The $at registerisreservedfortheassembler blt $t0, $t1, L1 # Goto L1 if $t0 < $t1

slt $at, $t0, $t1 # Set $at = 1 if $t0 < $t1 bne $at, $zero, L1 # Goto L1 if $at != 0

C.Kozyrakis EE108bLecture5 20 PerformanceMetrics

• ResponseTime(latency) – Howlongdoesittakeformyjobtorun? – Howlongdoesittaketoexecuteajob? – HowlongmustIwaitforthedatabasequery?

• Throughput – Howmanyjobscanthemachinerunatonce? – Whatistheaverageexecutionrate? – Howmanyqueriesperminute?

• Ifweupgradeamachinewithanewprocessorwhatdoweincrease? • Ifweaddanewmachinetothelabwhatdoweincrease?

C.Kozyrakis EE108bLecture5 21 Performance=ExecutionTime

• ElapsedTime – Countseverything (disk and memory accesses, I/O , etc.) – Ausefulnumber,butoftennotgoodforcomparisonpurposes • E.g.,OS&multiprogrammingtimemakeitdifficulttocompareCPUs

• CPUtime – Doesn'tcountI/Oortimespentrunningotherprograms – Canbebrokenupintosystemtime,andusertime

• Ourfocus:userCPUtime – Timespentexecutingthelinesofcodethatare"in"ourprogram – Includesarithmetic,memory,andcontrolinstructions…

C.Kozyrakis EE108bLecture5 22 ClockCycles

• Insteadofreportingexecutiontimeinseconds,weoftenusecycles seconds cycles seconds = × program program cycle

• Clock“ticks” indicatewhentostartactivities:

time

• Cycletime=timebetweenticks=secondspercycle • Clockrate(frequency)=cyclespersecond(1Hz.=1cycle/sec) 1 12 A2Ghz.clockhasa ×10 = 500 picoseconds (ps) cycletime 2 ×109

C.Kozyrakis EE108bLecture5 23 NowthatweunderstandClockCycles

• Agivenprogramwillrequire – somenumberofinstructions(machineinstructions) – somenumberofcycles – somenumberofseconds

• Wehaveavocabularythatrelatesthesequantities: – cycletime(secondspercycle) – clockrate(cyclespersecond) – CPI(cyclesperinstruction) – MIPS(millionsofinstructionspersecond)

C.Kozyrakis EE108bLecture5 24 PerformanceandExecutionTime

• User CPU execution time Execution Time = Clock Cycles for Pr ogram × Clock Cycle Time

• SinceCycleTimeis1/ClockRate(orclockfrequency)

Clock Cycles for Pr ogram Execution Time =

• Theprogramshouldbesomethingrealpeoplecareabout – Desktop:MSoffice,edit,compile – Server:web,ecommerce,database – Scientific:physics,weatherforecasting

C.Kozyrakis EE108bLecture5 25 MeasuringClockCycles

• Clockcycles/programisnotanintuitiveoreasilydeterminedvalue,so

Clock Cycles = Instructions × Clock

• CyclesPerInstruction(CPI)usedoften

• CPIisan average sincethenumberofcyclesperinstructionvaries frominstructiontoinstruction – Averagedependsoninstructionmix,latencyofeachinst.typeetc.

• CPIscanbeusedtocomparetwoimplementationsofthesameISA, butisnotusefulaloneforcomparingdifferentISAs – AnaddisdifferentfromaMIPSadd

C.Kozyrakis EE108bLecture5 26 UsingCPI

• Drawingonthepreviousequation:

Execution Time = Instructions × CPI × Clock Cycle Time Instructions × CPI Execution Time = Clock Rate

• Toimproveperformance(i.e.,reduceexecutiontime) – Increaseclockrate(decreaseclockcycletime)OR – DecreaseCPIOR – Reducethenumberofinstructions • Designersbalancecycletimeagainstthenumberofcyclesrequired – Improvingonefactormaymaketheotheroneworse…

C.Kozyrakis EE108bLecture5 27 ClockRate≠ Performance

• MobileIntel4 Vs IntelPentiumM – 2.4GHz 1.6GHz

• PerformanceonMobilemark withsamememoryanddisk – Word,excel,photoshop,powerpoint,etc. – MobilePentium4isonly15%faster

• WhatistherelativeCPI? – ExecTime =IC• CPI/Clockrate

– IC• CPI M/1.6=1.15• IC• CPI 4/2.4

– CPI 4/CPI M =2.4/(1.15•1.6)=1.3

C.Kozyrakis EE108bLecture5 28 CPIVaries

• Differentinstructiontypesrequiredifferentnumbersofcycles • CPIisoftenreportedfortypesofinstructions

n

Clock Cycles = ∑(CPIi × ICi ) i=1

• whereCPI i istheCPIforthetypeofinstructions andIC i isthecountofthattypeofinstruction

C.Kozyrakis EE108bLecture5 29 ComputingCPI

• TocomputetheoverallaverageCPIuse

n  Instruction Counti  CPI = ∑CPIi ×  i=1  Instruction Count 

C.Kozyrakis EE108bLecture5 30 ComputingCPIExample

Instruction Type CPI Frequency CPI * Frequency ALU 1 50% 0.5 Branch 2 20% 0.4 Load 2 20% 0.4 Store 2 10% 0.2

• Giventhismachine,theCPIisthesumofCPI × Frequency

• AverageCPIis0.5+0.4+0.4+0.2=1.5

• Whatfractionofthetimefordatatransfer?

C.Kozyrakis EE108bLecture5 31 WhatistheImpactofDisplacement BasedMemoryAddressingMode?

Instruction Type CPI Frequency CPI * Frequency ALU 1 50% 0.5 Branch 2 20% 0.4 Load 2 20% 0.4 Store 2 10% 0.2

• Assume50%ofMIPSloadsandstoreshaveazerodisplacement.

C.Kozyrakis EE108bLecture5 32 CPUTimeVs.MIPS (MIPS=millioninstructionspersecond)

• Twodifferentcompilersarebeingtestedfora1GHz.machinewith threedifferentclassesofinstructions:ClassA,ClassB,andClassC, whichrequire1,2,and3cycles(respectively).Bothcompilers areused toproducecodeforalargepieceofsoftware.

Thefirstcompiler'scodeuses5millionClassAinstructions,1 million ClassBinstructions,and1millionClassCinstructions.

Thesecondcompiler'scodeuses10millionClassAinstructions, 1 millionClassBinstructions,and1millionClassCinstructions.

• WhichsequencewillbefasteraccordingtoMIPS? • Whichsequencewillbefasteraccordingtoexecutiontime?

C.Kozyrakis EE108bLecture5 33

• SpeedupallowsustocomparedifferentCPUsoroptimizations CPUtimeOld Speedup = CPUtimeNew

• Example – OriginalCPUtakes2sectorunaprogram – NewCPUtakes1.5sectorunaprogram – Speedup=1.333orspeedupor33%

C.Kozyrakis EE108bLecture5 34 Amdahl’sLaw

• Ifanoptimizationimprovesafraction f ofexecutiontimebyafactorof a

Told 1 Speedup = = [(1− f ) + f / a *] Told 1( − f ) + f / a

• ThisformulaisknownasAmdahl’sLaw • Lessonsfrom – Iff→100%,thenspeedup=a – Ifa→∞,thespeedup=1/(1f) • Summary – Makethecommoncasefast – Watchoutforthenonoptimizedcomponent

C.Kozyrakis EE108bLecture5 35 Amdahl'sLawExample

• Supposeaprogramrunsin100secondsonamachine,with multiplyresponsiblefor80secondsofthistime.Howmuchdowe havetoimprovethespeedofmultiplicationifwewanttheprogramto run4timesfaster?"

• Howaboutmakingit5timesfaster?

C.Kozyrakis EE108bLecture5 36 EvaluatingPerformance

• Performancebestdeterminedbyrunningarealapplication – Useprogramstypicalofexpectedworkload – Or,typicalofexpectedclassofapplications • e.g.,compilers/editors,scientificapplications,graphics,etc. • Smallbenchmarks – niceforarchitectsanddesigners – easytostandardize – canbeabused • SPEC(SystemPerformanceEvaluationCooperative) – Companieshaveagreedonasetofrealprogramandinputs – Valuableindicatorofperformance(andcompilertechnology) – Canstillbeabused

C.Kozyrakis EE108bLecture5 37 Games

• An embarrassed Intel Corp. acknowledged Friday that a bug in a software program known as a compiler had led the company to overstate the speed of its chips on an industry benchmark by 10 percent. However, industry analysts said the coding error…was a sad commentary on a common industry practice of “cheating” on standardized performance tests…The error was pointed out to Intel two days ago by a competitor, Motorola …came in a test known as SPECint92…Intel acknowledged that it had “optimized” its compiler to improve its test scores. The company had also said that it did not like the practice but felt to compelled to make the optimizations because its competitors were doing the same thing…At the heart of Intel’s problem is the practice of “tuning” compiler programs to recognize certain computing problems in the test and then substituting special handwritten pieces of code…

Saturday, January 6, 1996 New York Times

C.Kozyrakis EE108bLecture5 38 SPECCPU2000

C.Kozyrakis EE108bLecture5 39 OtherBenchmarks

• Scientificcomputing:Linpack,SpecOMP,SpecHPC,… • Embeddedbenchmarks:EEMBC,,… • Enterprisecomputing – TCPC,TPCW,TPCH – SpecJbb,SpecSFS,SpecMail,Streams, • Other – 3Dmark,ScienceMark,Winstone,iBench,AquaMark,…

• Watchout:yourresultswillbeasgoodasyourbenchmarks – Makesureyouknowwhatthebenchmarkisdesignedtomeasure – Performanceisnottheonlymetricforcomputingsystems • Cost,powerconsumption,reliability,realtimeperformance,…

C.Kozyrakis EE108bLecture5 40 PerformanceSummary

• Performanceisspecifictoaparticularprogram/s – Totalexecutiontimeisaconsistentsummaryofperformance

• Foragivenarchitectureperformanceincreasescomefrom: – Increasesinclockrate(withoutadverseCPIaffects) – ImprovementsinprocessororganizationthatlowerCPI – CompilerenhancementsthatlowerCPIand/orinstructioncount – Algorithm/Languagechoicesthataffectinstructioncount

• Pitfall:expectingimprovementinoneaspectofamachine’s performancetoaffectthetotalperformance

C.Kozyrakis EE108bLecture5 41 TranslationHierarchy

• Highlevel → Assembly → Machine

.c C Program Compiler

.s Assembly Program Assembler

.o Machine Object Module Object Linker Executable Loader

Memory

C.Kozyrakis EE108bLecture5 42 Compiler Convertshighlevellanguagetomachinecode

• Compilerphases

Intermediate Representation Lexical Analysis Scanner.c IR Optimization Syntax Analysis Parser Assembly/Obj Gen Semantic Analysis Semantic Analyzer Assembly/Obj Opt Inter. Code Gen Assembly ( .s ) or Intermediate Representation Machine Object ( .o ) e.g. GCC: Three Address Code (TAC) e.g. Java bytecodes Front-end (Analysis) Back-end (Synthesis)

C.Kozyrakis EE108bLecture5 43 IntermediateRepresentation(IR)

• Threeaddresscode(TAC)

j = 2 * i + 1; t1 = 2 * i if (j >= n) t2 = t1 + 1 j = 2 * i + 3; j = t2 return a[j]; t3 = j < n if t3 goto L0 t4 = 2 * i t5 = t4 + 3 j = t5 L0: t6 = a[j] return t6

Single assignment

C.Kozyrakis EE108bLecture5 44 OptimizingCompilers

• Provideefficientmappingofprogramtomachine – Codeselectionandordering – eliminatingminorinefficiencies – registerallocation • Don’t(usually)improveasymptoticefficiency – Uptoprogrammertoselectbestoverallalgorithm – BigOsavingsare(often)moreimportantthanconstantfactors • butconstantfactorsalsomatter

• Optimizationtypes – Local:insideblocks – Global:acrossbasicblockse.g.loops

C.Kozyrakis EE108bLecture5 45 LimitationsofOptimizingCompilers

• OperateUnderFundamentalConstraints 1. Improvedcodehassameoutput 2. Improvedcodehassamesideeffects 3. Improvedcodeisascorrectasoriginalcode

• Mostanalysisisperformedonlywithinprocedures – Wholeprogramanalysisistooexpensiveinmostcases • Mostanalysisisbasedonlyonstaticinformation – Compilerhasdifficultyanticipatingruntimeinputs – Profiledrivendynamiccompilationbecomingmoreprevalent

• When in doubt, the compiler must be conservative

C.Kozyrakis EE108bLecture5 46 Preview:CodeOptimization

• Goal:Improveperformanceby: – Removingredundantwork • Unreachablecode • Commonsubexpression elimination • Inductionvariableelimination – Creatingsimpleroperations • Dealingwithconstantsinthecompiler • Strengthreduction – Managingregisterswell

Execution Time = Instructions × CPI × Clock Cycle Time

C.Kozyrakis EE108bLecture5 47