InstructionSetArchitectureforMultimediaSignalProcessing RubyLee PrincetonUniversity 1. INTRODUCTION Multimediasignalprocessing,ormediaprocessing[1],istheprocessingofdigitalmultimediainformationina programmable processor. Digital multimedia information includes visual information like images, video, graphics andanimation,audioinformationlikevoiceandmusic,andtextualinformationlikekeyboardtextandhandwriting. With general-purpose computers processing more multimedia information, multimedia instructions for efficient media processing have been defined for the Instruction Set Architectures (ISAs) of microprocessors. Meanwhile, digitalprocessingofvideoandaudiodatainconsumerproductshasalsoresultedinmoresophisticatedmultimedia processors.Traditionaldigitalsignalprocessors(DSPs)inmusicplayersandrecordersandmobiletelephonesare becoming increasingly sophisticated as they process multiple forms of multimedia data, rather than just audio signals.Videoprocessorsfortelevisionsandvideorecordershavebecomemoreversatileastheyhavetotakeinto accounthigh-fidelityaudioprocessingandreal-timethree-dimensional(3-D)graphicsanimations.Thishasledto thedesignofmoreversatilemediaprocessors,whichcombinethecapabilitiesofDSPsforefficientaudioandsignal processing,videoprocessorsforefficientvideoprocessing,graphicsprocessorsforefficient2-Dand3-Dgraphics processing, and general-purpose processors for efficient and flexible programming. The functions performed by microprocessors and media processors may eventually converge. In this chapter, we describe some of the key innovationsinmultimediainstructionsaddedtomicroprocessorISAs,whichhaveallowedhigh-fidelitymultimedia tobeprocessedinreal-timeonubiquitousdesktopandnotebookcomputers.Manyofthesefeatureshavealsobeen adoptedinmodernmediaprocessorsanddigitalsignalprocessors. 1.1SubwordParallelism Workloadcharacterizationstudiesonmultimediaapplicationsshowthatmediaapplicationshavehugeamounts of data parallelism and operate on lower-precision data types. A pixel-oriented application, for example, rarely needstoprocessdatathatiswiderthan16bits.Thistranslatesintolowcomputationalefficiencyongeneral-purpose processors where the registeranddatapathsizesaretypically32or64bits,calledthewidthofaword.Efficient processing of low-precision data types in parallel becomes a basic requirement for improved multimedia performance.Thisisachievedbypartitioningawordintomultiplesubwords,eachsubwordrepresentingalower- precision datum. A packed data type will be defined as data that consists of multiple subwords packed together. These subwords can be processed in parallel using a single instruction, called a subword-parallel instruction, a packedinstruction,oramicroSIMDinstruction.SIMDstandsfor“SingleInstructionMultipleData”,atermcoined by Flynn [2] for describing very large parallel machines with many data processors, where the same instruction issuedfromasinglecontrolprocessoroperatesinparallelondataelementsintheparalleldataprocessors.Lee[3] coined the term microSIMD architecture to describe an ISA where a single instruction operates in parallel on multiplesubwordswithinasingleprocessor. Figure1.1showsa32-bitintegerregisterthatismadeupoffour8-bitsubwords.Thesubwordsintheregister canbepixelvaluesfromagrayscaleimage.Inthiscase,theregisterisholdingfourpixelswithvalues0xFF,0x0F, 0xF0 and 0x00. The same 32-bit register can also be interpreted as two 16-bit subwords, in which case, these subwordswouldbe0xFF0Fand0xF000.Thesubwordboundariesdonotcorrespondtoaphysicalboundaryinthe registerfile;theyaremerelyhowthebitsinthewordareinterpretedbytheprogram.Ifwehave64-bitregisters,the mostusefulsubwordsizeswillbe8-bits,16-bitsor32-bitwords.Asingleregistercanthenaccommodate8,4or2 ofthesedifferentsizedsubwordsrespectively.

Ra: 11111111 00001111 11110000 00000000 Figure1.132-bitintegerregistermadeupoffour8-bitsubwords. To exploit subword parallelism, packed parallelism or microSIMD parallelism in a typical word-oriented microprocessor, new subword-parallel or packed instructions are added. (We use the terms “subword-parallel”, “packed” and “microSIMD” interchangeably to describe operations, instructions and architectures.) The parallel processing of the packed data types typically requires only minor modifications to the word-oriented functional units, with the register file and the pipeline structure remaining unchanged. This results in very significant performanceimprovementsformultimediaprocessing,ataverylowcost.

1 GeneralRegisterFile

Partitionable ALU

Figure1.2MicroSIMDparallelismusespackeddatatypesandapartitionableALU. Typically,packedarithmeticinstructionssuchaspackedaddandpackedsubtractarefirstintroduced. Tosupportsubwordparallelismefficiently,otherclassesofnewinstructionslikesubwordpermutationinstructions arealsoneeded.Wedescribetypicalsubword-parallelinstructionsintherestofthischapter,pointingoutinteresting arithmeticorarchitecturalfeaturesthathavebeenaddedtosupportthisstyleofmicroSIMDparallelism.Insection 2, we describe packed add and packed subtract instructions, and several variants of these. These instructions can all be implemented on the basic Arithmetic Logical Units (ALUs) found in programmable processors, with minor modifications. We describe such partitionable ALUs in section 2.1. We also describe saturationarithmetic-oneofthemostinterestingoutcomesofsubword-paralleladditions-forefficientlyhandling overflowsandperformingin-lineconditionaloperations.Avariantofpackedadditionisthepackedaverage instruction,whereunbiasedroundingisaninterestingassociatedfeature.Anotherclassofpackedinstructionsthat can use the ALU is the parallel compare instruction where the results are the outcomes of the subword comparisons. In section 3, we describe how packed integer multiplication is handled. We describe different approaches to solving the problem of the products being twice as large asthesubwordoperandsthataremultipliedinparallel. Whilesubword-parallelmultiplicationinstructionsgenerallyrequiretheintroductionofnewintegermultiplication functional units to a microprocessor, we describe the special case of multiplication by constants, which can be achievedveryefficientlywithpackedshiftandaddinstructionsthatcanbeimplementedonanALUwith asmallpreshifter. Insection4,wedescribepackedshiftandpackedrotateinstructions,whichperformasupersetof thefunctionsofatypicalshifterfoundinmicroprocessors,inparallel,onpackedsubwords. Insection5,wedescribeanewclassofinstructions,notpreviouslyfoundinprogrammableprocessorsthatdo not support subword parallelism. These are subword permutation instructions, which rearrange the order of the subwords packed in one or more registers. These permutation instructions can be implemented using a modified shifter,orasaseparatepermutationfunctionunit(seeFigure1.3). n

n Register File Permutation ALU Shifter Function Multiplier Unit

n Figure1.3Typicaldatapathsandfunctionalunitsinaprogrammableprocessor. To provide examples and illustrations, we refer to the following first and second generation multimedia instructionsinmicroprocessorISAs: • IA-64[4],MMX[5,6],andSSE-2[7]fromIntel,

2 • MAX-2[8,9]fromHewlett-Packard, • 3DNow!1[10,11]fromAMDand • AltiVec[12]fromMotorola. 1.2HistoricalOverview Thefirstgenerationmultimediainstructionsfocusedonsubwordparallelismintheintegerdomain.Thefirstset ofmultimediaextensionstargetedatgeneral-purposemultimediaacceleration,ratherthanjustgraphicsacceleration, wasMAX-1,introducedwiththePA-7100LCprocessorinJanuary1994[13,14]byHewlett-Packard.MAX-1,an acronymfor“MultimediaAccelerationeXtensions”,isaminimalistsetofmultimediainstructionsforthe32-bitPA- RISCprocessor.AnapplicationthatclearlyillustratedthesuperiorperformanceofMAX-1wasMPEG-1videoand audio decoding with software, at real-time rates of 30 frames per second [15,16,17]. For the first time, this performance was made possible using software on a general-purpose processor in a low-end desktop computer. Until then, such high-fidelity, real-time video decompression performance was not achievable without using specializedhardware.MAX-1alsoacceleratedpixelprocessingingraphicsrenderingandimageprocessing. Next,SunintroducedVIS[18],whichwasanextensionfortheUltraSparcprocessors.VISwasamuchlargerset ofmultimediainstructions.Inadditiontopackedarithmeticoperations,VISprovidedveryspecializedinstructions foraccessingvisualdata,storedinpre-determinedwaysinmemory. IntelintroducedMMX[5,6]multimediaextensionsinthedominantPentiummicroprocessorsinJanuary1997, which immediately legitimized the valuable contribution of multimedia instructions for ubiquitous multimedia applications. MAX-2 [8,9], was Hewlett-Packard’s multimedia extension for its 64-bit PA-RISC 2.0 processors. Although designedsimultaneouslywithMAX-1,itwasonlyintroducedin1996,withthePA-RISC2.0[8]architecture.The subwordpermutationinstructionsintroducedwithMAX-2wereusefulonlywiththeincreasedsubwordparallelism in 64-bit registers. Like MAX-1, MAX-2 was also a minimalist set of general-purpose media acceleration primitives. MIPS also described MDMX (although this is not implemented in most MIPS microprocessors), and Alpha describedaverysmallsetofMVImultimediainstructionsforvideocompression. The second generation multimedia instructions initially focused on subword parallelism on the floating-point (FP) side for accelerating graphics geometry computations and high-fidelity audio processing. Both of these multimediaapplicationsusesingle-precisionfloating-pointnumbersforincreasedrangeandaccuracy,ratherthan8- bit or 16-bit integers. These multimedia ISAs include SSE and SSE-2 [7] from Intel and 3DNow! [10,11] from AMD. Finally, the PowerPC’s AltiVec [12] and the Intel-HP IA-64 [4] multimedia instruction sets are comprehensive integer and floating-point multimedia instructions. Today, every microprocessor ISA and most mediaandDSPISAsincludesubword-parallelmultimediainstructions. 2. PACKEDADDANDPACKEDSUBTRACTINSTRUCTIONS Packedaddandpackedsubtractinstructionsaresimilartoordinaryaddandsubtractinstructions, exceptthattheoperationsareperformedinparallelonthesubwordsoftwosourceregisters.Add(non-packed)and packed add operations areshowninFigures2.1and2.2respectively.ThepackedaddinFigure2.2uses sourceregisterswithfoursubwordseach.Thecorrespondingsubwordsfromthetwosourceregistersaresummed up,andthefoursumsarewrittentothetargetregister.Apackedsubtractoperationoperatessimilarly.

13DNow!maybeconsideredashavingtwoversions.InJune2000,25newinstructionswereaddedtotheoriginal3DNow!specification.In thistext,wewillactuallybeconsideringthisextended3DNow!architecture. 3 Ra: Operand#1

Rb: Operand#2

Rc: Result Figure2.1ADDRc,Ra,Rb:Ordinaryaddinstruction.

Ra:

Rb:

Rc: Figure2.2PADDRc,Ra,Rb:Packedaddinstruction. 2.1PartitionableALUs Very minor modifications to the underlying functional units are needed to implement packed add and packed subtract instructions. Assume that we have an ALU with 32-bit integer registers, and we want to extendthisALUtoperformapackedaddthatwilloperateonfour8-bitsubwordsinparallel.Toachievethis,we justhavetoblockthecarrypropagationacrossthesubwordboundaries.Sinceeachsubwordisinterpretedasbeing independentoftheneighboringsubwords,bystoppingthecarrybitsfromaffectingtheneighboringsubwords,the packedaddoperationcanberealized. InFigure2.3,thepackedintegerregisterRa=[0xFF|0x0F|0xF0|0x00]isbeingaddedtoanotherpackedregister Rb=[0x00|0xFF|0xFF|0x0F]. The result is written to the target register Rc. If the ordinary add instruction is, the overflowsgeneratedbytheadditionofthesecondandthirdsubwordswillpropagateintothefirsttwosums.The correctsums,however,canbeachievedeasilybyblockingthecarrybitpropagationacrossthesubwordboundaries, whicharespaced8-bitsapartfromoneanother. Subword1 Subword2 Subword3 Subword4

Ra: 11111111 00001111 11110000 00000000

Rb: 00000000 11111111 11111111 00001111

Rc: 11111111 00001110 11101111 00001111 Figure2.3Inthepackedaddinstruction,thecarrybitsarenotpropagatedintothefirstandsecondsums.

4 As shown in Figure 2.4, a 2-to-1 multiplexer placed at the subword boundaries of the adder can be used to control the propagation or the blocking of the carry bits. If the instruction is a packed add, the multiplexer control is set such that a zero is propagated into the next subword. If the instruction is an ordinary add, the multiplexercontrolissetsuchthatthecarryfromthepreviousstageispropagated.Byplacingsuchamultiplexerat eachsubwordboundaryandaddingthecontrollogic,thesupportforpackedaddinstructionswillbeaddedto this ALU. If multiple subword sizes must be supported, more multiplexers may be required. In this case, the multiplexercontrolgetsmorecomplicated;neverthelesstheareacostisstillveryinsignificantfortheperformance providedbysuchmicroSIMDinstructions. By using 3-to-1 multiplexers instead of 2-to-1 multiplexers, we can also implement packed subtract instructions.Themultiplexercontrolissetsuchthat: • Forpackedaddinstructions,zeroispropagatedintothenextstage, • Forpackedsubtractinstructions,oneispropagatedintothenextstage, • Forordinaryadd/subtractinstructions,thecarrybitfromthepreviousstageispropagatedintothenext stage. Whenwepropagateazerothroughtheboundaryintothenextsubwordinthepackedaddinstructions,weare essentiallyignoringanyoverflowthatmighthavebeengenerated.Similarly,whenwepropagateaonethroughthe boundaryintothenextsubwordinthepackedsubtractinstructions,weareessentiallyignoringanyborrow that might have been generated. Ignoring overflows is equivalent to using modular arithmeticinaddoperations. Whilemodulararithmeticcanbenecessaryoruseful,thereareotheroccasionswhenthecarrybitsshouldnotbe ignoredandhavetobehandleddifferently. Subword1 Subword2 Subword3 Subword4

Ra: 11111111 00001111 11110000 00000000

Rb: 00000000 11111111 11111111 00001111

carry-out carry-in 0 0 0

Rc: 11111111 00001110 11101111 00001111 Figure2.4PartitionableALU:Inpackedaddinstructions,themultiplexerspropagatezero;inordinaryadd instructions,themultiplexerspropagatecarry-outfromthepreviousstageintothecarry-inofthenextstage. 2.2HandlingParallelOverflows Overflowsinpackedadd/subtractinstructionscanbehandledinthefollowingways: • Theoverflowmaybeignored(modulararithmetic), • Aflagbitmaybesetifatleastoneoverflowisgenerated, • Multipleflagbits(i.e.oneflagbitforeachadditionoperationonthesubwords)maybeset, • Asoftwareoverflowtrapcanbetaken, • Saturationarithmetic:theresultsarelimitedtoacertainrange.Iftheoutcomeoftheoperationfallsoutsidethis range,thecorrespondinglimitingvaluewillbetheresult. Most non-packed integer add/subtract instructions choose to ignore overflows and perform modular arithmetic.Inmodulararithmetic,thenumberswraparoundfromthelargestrepresentablenumbertothesmallest representable number. For example,in8-bitmodulararithmetic,theoperation254+2willgivearesultof0.The

5 expectedresult,256,islargerthanthelargestrepresentablenumber,whichis255,andthereforeiswrappedaround tothesmallestrepresentablenumber,whichis0. In multimedia applications, modular arithmetic frequently gives undesirable results. If the numbers in the previousexamplewerepixelvaluesinagrayscaleimage,bywrappingthevaluesfrom255downto0,wewould have converted white pixels into black ones. One solution to this problem is to use overflow traps, which are implementedinsoftware. A flag bit is an indicator bit that is setorcleareddependingontheoutcomeofaparticularoperation.Inthe context of this discussion, an overflow flag bit is an indicator that is set when an add instruction generates an overflow.Thereareoccasionswheretheuseoftheflagbitsaredesirable.Consideraloopthatiteratesmanytimes andineachiteration,executesmanyaddinstructions.Inthiscase,itisnotdesirabletohandleoverflows(bytaking overflowtraproutines)assoonastheyoccur,sincethiswouldnegativelyimpacttheperformancebyinterrupting theexecutionoftheloopbody.Rather,theoverflowflagcanbesetwhentheoverflowoccurs,andtheprogramflow mayberesumedasiftheoverflowdidnotoccur.Attheendofeachiterationhowever,thisoverflowflagcanbe checkedandtheoverflowtrapcanbeexecutediftheflagturnsouttobeset.Thisway,theprogramflowwouldnot beinterruptedwhiletheloopbodyexecutes. Anoverflowtrapcanbeusedtosaturatetheresultssothattheaforementionedproblemswouldnotoccur.A resultthatisgreaterthanthelargestrepresentablevalueisreplacedbythatlargestvalue.Similarly,aresultthatis lessthanthesmallestrepresentablevalueisreplacedbythatsmallestvalue.Oneproblemwiththissolutionwillbe its negative effects to performance. An overflow trap is handled in software and may take many clock cycles to resolve.Thiscanbeacceptableonlyiftheoverflowsareinfrequent.Fornon-packedadd/subtractinstructions, generationofanoverflowona64-bitregisterbyaddingup8-bitquantitieswillberare,soasoftwareoverflowtrap willworkwell.Thisisnotthecaseforpackedarithmeticoperations.Causinganoverflowinan8-bitsubwordis much more likely than in a 64-bit register. Also, since a 64-bit register may hold eight 8-bit subwords, multiple overflowscanoccurinasingleexecutioncycle.Inthiscase,handlingtheoverflowsbysoftwaretrapscouldeasily negate any performance gains from executing packed operations. The use of saturation arithmetic solves this problem. 2.3SaturationArithmetic Saturationarithmeticimplementsinhardwaretheworkdonebytheoverflowtrapdescribedabove.Theresults fallingoutsidetheallowednumericrangesaresaturatedtotheupperandlowerlimitsbyhardware.Thiscanhandle multiple parallel overflows efficiently, without operating system intervention. We define two types overflows for arithmeticoperations: • Apositiveoverflowoccurswhentheresultislargerthanthelargestvalueinthedefinedrangeforthatresult. • Anegativeoverflowoccurswhentheresultissmallerthanthesmallestvalueinthedefinedrangeforthatresult. Ifsaturationarithmeticisusedinanoperation,theresultisclippedtothemaximumvalueinitsdefinedrangeif apositiveoverflowoccurs,andtotheminimumvalueinitsdefinedrangeifanegativeoverflowoccurs. Foragiveninstruction,multiplesaturationoptionsmayexist,dependingonwhethertheoperandsandtheresult aretreatedassignedorunsignedintegers.Foraninstructionthatusesthreeregisters(twoforsourceoperandsand onefortheresult),therecanbeeightdifferentsaturationoptions.Eachoneofthethreeregisterscanbetreatedas containingeitherasignedoranunsignedinteger,whichgives23possiblecombinations.Notalloftheeightpossible saturation options are equally useful. Only three of the eight possible saturation options are used in any of the multimediaISAssurveyed: a) sss(signedresult-signedfirstoperand-signedsecondoperand): Inthissaturationoption,theresultandthetwooperandsarealltreatedassignedintegers.Themostsignificant bitisconsideredthesignbit.Consideringn-bitsubwords,theresultandoperandsaredefinedintherange[-2n-1, 2n-1-1].Ifapositiveoverflowoccurs,theresultissaturatedto2n-1.Ifanegativeoverflowoccurs,theresultis saturated to –2n-1. In an addition operation that uses the sss saturation option, since the operands are signed numbers,apositiveoverflowispossibleonlywhenbothoperandsarepositive.Similarly,anegativeoverflowis possibleonlywhenbothoperandsarenegative. b) uuu(unsignedresult-unsignedfirstoperand-unsignedsecondoperand): 6 Inthissaturationoption,theresultandthetwooperandsarealltreatedasunsignedintegers.Consideringn-bit integersubwords,theresultandtheoperandsaredefinedintherange[0,2n-1].Ifapositiveoverflowoccurs, theresultissaturatedto2n.Ifanegativeoverflowoccurs,theresultissaturatedtozero.Inanadditionoperation that uses the uuu saturation option, since the operands are unsigned numbers, negative overflow is not a possibility. However, for a subtraction operation using the uuusaturation,negativeoverflowispossible,and anynegativeresultwillbeclampedtozeroasthesmallestvalue. c) uus(unsignedresult-unsignedfirstoperand-signedsecondoperand): In this saturation option, the result and the first operand are treated as unsigned numbers, and the second operandistreatedasasignednumber.Althoughthismayseemlikeanunusualoption,itprovesusefulsinceit allowstheadditionofasignedincrementtoanunsignedpixel.Italsoallowsnegativenumberstobeclippedto zero. In addition to the efficient handling of overflows, saturation arithmetic also facilitates several other useful computations.Forinstance,saturationarithmeticcanalsobeusedtoclipresultstoarbitrarymaximumorminimum values.Withoutsaturationarithmetic,theseoperationscouldnormallytakeuptofiveinstructionsforeachpairof subwords.Thatwouldincludeinstructionstocheckforupperandlowerboundsandthentoperformtheclipping. Using saturation arithmetic however, this effect canbeachievedinasfewastwoinstructionsforallthepairsof packedsubwords. Saturation arithmetic and also be used for in-line conditional execution, reducing the need for conditional branches that can cause significant performance degradations in pipelined processors. Some examples are the packedmaximumandpackedabsolutedifferenceinstructionsshowninFigures2.5(a)and2.5(b).

Ra: 58 14 12 77

Rb: 22 192 118 36

a) ci =max(ai,bi)

Rc: 36 0 0 41 PSUB,uuu Rc,Ra,Rb

Rc: 58 192 118 77 PADD Rc,Rc,Rb

b) ci =|ai-bi|

Re: 36 0 0 41 PSUB,uuuRe,Ra,Rb

Rf: 0 178 106 0 PSUB,uuu Rf,Rb,Ra

R : 36 178 106 41 PADD R ,R ,R c c e f Figure2.5(a)Packedmaximumoperationusingsaturationarithmetic.(b)Packedabsolute differenceoperationusingsaturationarithmetic. Table 2.1 contains examples of operations that can be performed using saturation arithmetic [14]. All of the instructionsinthetableusethreeregisters.Thefirstregisteristhetargetregister.Thesecondandthethirdregisters hold the first and the second operands respectively. PADD and PSUB denote packed add and packed subtractinstructions.Thethree-letterfieldaftertheinstructionmnemonicspecifieswhichsaturationoptionisto beused.Ifthisfieldisempty,modulararithmeticisassumed.Alltheexamplesinthetableoperateon16-bitinteger subwords. Table 2.2 contains a summary of the register and subword sizes and the saturation options found in different multimedia instruction set architectures. Table 2.3 is a summary of the packed add/subtract

7 instructionsinseveralmultimediaISAs.Thefirstcolumncontainsdescriptionsofcommonpackedinstructions.The symbols ai and bi representthecorrespondingsubwordsfromthetwosourceregisters.Thesymbol ci represents thecorrespondingsubwordinthetargetregister. TABLE2.1Examplesofoperationsthatarefacilitatedbysaturationarithmetic.aiandbiarethesubwordsinthe registersRaandRbrespectively,wherei=1,2,…,kandkdenotingthenumberofsubwordsinaregister.Subword sizen,isassumedtobetwobytes(i.e.n=16)forthistable. Operation InstructionSequence Notes 15 Clipaitoanarbitrarymaximumvaluevmax, PADD.sssRa,Ra,Rb Rbcontainsthevalue(2 -1-vmax).Ifai>vmax,this 15 15 wherevmax<2 -1. operationclipsaito2 -1onthehighend. PSUB.sssRa,Ra,Rb aiisatmostvmax. 15 Clipaitoanarbitraryminimumvaluevmin, PSUB.sssRa,Ra,Rb Rbcontainsthevalue(-2 +vmin).Ifai-2 . operationclipsaito-2 atthelowend. PADD.sssRa,Ra,Rb aiisatleastvmin. 15 Clipaitowithinthearbitraryrange[vmin,vmax], PADD.sssRa,Ra,Rb Rbcontainsthevalue(2 -1-vmax).Thisoperation 15 15 15 where–2

Clipthesignedintegeraitoanunsigned PADD.uusRa,Ra,0 Ifai<0,thenai=0elseai=ai. 15 integerwithintherange[0,vmax],where2 - Ifaiwasnegative,itgetsclippedtozero,else 16 1bi,thenci=(ai-bi)elseci=0. Packedmaximumoperation. PADDRc,RbRc Ifai>bi,thenci=aielseci=bi. ci=|ai-bi| PSUB.uuuRe,Ra,Rb Ifai>bi,thenei=(ai-bi)elseei=0. Absolutedifferenceoperation. Absolutevalueofthedifferenceofthetwo PSUB.uusRf,Rb,Ra Ifai<=bi,thenfi=(bi-ai)elsefi=0. subwordsiswrittentothetargetregister. PADDRc,Re,Rf Ifai>bi,thenci=(ai-bi),elseci=(bi-ai). TABLE2.2Summaryoftheintegerregisterandsubwordsizesforthedifferentarchitectures. Architecturalfeature IA-64 MAX-2 MMX SSE-2 AltiVec Sizeofintegerregisters(bits) 64 64 64 128 128 Supportedsubwordsizes(bytes) 1,2,4 2 1,2,4 1,2,4,8 1,2,4 Modulararithmetic Y Y Y Y Y Supportedsaturationoptions sss,uuu,uus sss,uus sss,uuu sss,uuu uuu,sss for1,2byte for2byte for1,2byte for1,2byte for1,2,4byte TABLE2.3Summaryofthepackedaddandpackedsubtractinstructionsandvariants. IntegerOperations IA-64 MAX-2 MMX SSE-2 3DNow! AltiVec = + √ √ √ √ √ ci ai bi = + √ √ √ √ ci ai bi (withsaturation) = − √ √ √ √ √ ci ai bi = − √ √ √ √ ci ai bi (withsaturation) = √ √ √ √ √ ci avg(ai ,bi ) = √ ci neg _ avg(ai ,bi ) = + + √ [c2i , c2i+1 ] [a2i a2i+1,b2i b2i+1 ] = + √ lsbit(ci ) carryout(ai bi ) = − √ lsbit(ci ) carryout(ai bi ) = √ √ √ ci compare(ai ,bi ) Movemask √ √

8 = √ √2 √ √ √ ci max(ai ,bi ) = √ √2 √ √ √ ci min(ai ,bi ) = − √ √ √ c ƒ ai bi TheIA-643architecturehas64-bitintegerregisters.Packedaddandpackedsubtractinstructionsare supported for subword sizes of 1,2 and 4 bytes. Modular arithmetic is defined for all subword sizeswhereasthe saturationoptions(sss,uuuanduus)existforonly1and2-bytesubwords. ThePA-RISCMAX-2architecturealsohas64-bitintegerregisters.Packedaddandpackedsubtract instructionsoperateononly2-bytesubwords.MAX-2instructionssupportmodulararithmetic,andthesssanduus saturationoptions. The IA-32 MMX architecture defines eight 64-bit registers for use by the multimedia instructions. Although these registers are referred to as separate registers, they are aliased to the registers in the FP data register stack. Supported subword sizes are 1, 2 and 4 bytes. Modular arithmetic is defined for all subword sizes whereas the saturationoptions(sssanduus)existforonly1and2-bytesubwords. TheIA-32SSE-2technologyintroducesanewsetofeight128-bitFPregisterstotheIA-32architecture.Eachof the128-bitregisterscanaccommodatefoursingle-precision(SP)ortwodouble-precision(DP)numbers.Moreover, theseregisterscanalsobeusedtoaccommodatepackedintegerdatatypes.Integersubwordsizescanbe1,2,4or8 bytes.Modulararithmeticisdefinedforallsubwordsizeswhereasthesaturationoptions(sssanduus)existforonly 1and2-bytesubwords. The PowerPC AltiVec architecture has 32 128-bit registers that can be accessed by multimedia instructions. Packed add/subtract instructions are supported for 1, 2 and 4-byte subwords. Modular or saturation arithmetic(uuuorsss)canbeused. 2.4PackedAverage PackedaverageinstructionsareverycommoninmediaapplicationssuchaspixelaveraginginMPEG-2 encoding,motioncompensationandvideoscaling.Inapackedaverage,thepairsofcorrespondingsubwords inthetwosourceregistersareaddedtogenerateintermediatesums.Then,theintermediatesumsareshiftedrightby one bit, so that any overflow bit is shifted in on the left as the most significant bit. The beauty of the average operationisthatnooverflowcanoccur,andtwooperations(addfollowedbyaonebitrightshift)areperformedin oneoperation.Inapackedaverageinstruction,2noperationsareperformedinasinglecycle,wherenisthe number of subwords. In fact, even more operations are performed in a packed average instruction, if the roundingappliedtotheleastsignificantendoftheresultisconsidered.Here,twodifferentroundingoptionshave beenused:

2Thisoperationisrealizedbyusingsaturationarithmetic. 3AllthediscussionsinthischapterconsiderIntel’sIA-64asthebasearchitecture.Evaluationsoftheotherarchitecturesaregenerallycarried outbycomparisonstoIA-64. 9 Ra:

Rb:

1 1 1 1

sumplus carry s s h h i i 1 f f 1 t t b b r r i i i i t t g g s s h h h h t t i i 1 1 f f t t b b r r i i i i t t g g h h t t

R : c Figure2.6PAVGRc,Ra,Rb:Packedaverageinstructionusingtheroundawayfromzerooption. • roundawayfromzero:Aoneisaddedtotheintermediatesums,beforetheyareshiftedtotherightbyonebit position.Ifcarrybitsweregeneratedduringtheadditionoperation,theyareinsertedintothemostsignificantbit positionduringtheshiftrightoperation(seeFigure2.6). • round to odd: Instead of adding one to the intermediate sums, a much simpler OR operation is used. The intermediatesumsaredirectlyshiftedrightbyonebitposition,andthelasttwobitsofeachofthesubwordsofthe intermediate sums are ORed to give the least significant bit of the final result. This makes sure that the least significant bit of the final results are set to one (odd) if at least one of the two least-significant bits of the intermediatesumsareone(seeFigure2.7). This rounding mode also performs unbiased rounding under the following assumptions. If the intermediate resultisuniformlydistributedovertherangeofpossiblevalues,thenhalfofthetime,thebitshiftedoutiszero,and theresultremainsunchangedwithrounding.Theotherhalfofthetime,thebitshiftedoutisone:ifthenextleast significantbitisone,thentheresultloses–0.5,butifthenextleastsignificantbitisazero,thentheresultgains +0.5.Sincethesecasesareequallylikelywithauniformdistributionoftheresult,theroundtooddoptiontendsto canceloutthecumulativeaveragingerrorsthatmaybegeneratedwithrepeateduseoftheaveraginginstruction.

10 Ra:

Rb:

sumplus carry

carry sumbits

s s bit h h i i 1 f f 1 t t b b r r i i i i t t g g s s h h h h t t i i 1 1 f f t t b b r r i i i i t t g g h h t t OR

Rc:

Figure2.7PAVGRc,Ra,Rb:Packedaverageinstructionusingtheroundtooddoption. 2.5AccumulateInteger Sometimes, it is useful to add adjacent subwords in the same register. This can, for example, facilitate the accumulationofstreamingdata.Anaccumulateintegerinstructionperformsanadditionofthesubwordsin thesameregisterandplacesthesumintheupperhalfofthetargetregister,whilerepeatingthesameprocessforthe secondsourceregisterandusingthelowerhalfofthetargetregister(Figure2.8).

Ra: Rb:

Rc: Figure2.8ACCRc,Ra,Rb:Accumulateintegerworkingonregisterswithtwosubwords. 2.6SaveCarryBits Thisinstructionsavesthecarrybitsfromapackedaddoperation,ratherthanthesums.Figure2.9showssuch asavecarrybitsinstructioninAltiVec:apackedaddisperformedandthecarrybitsarewrittentothe least significant bit of each result subword in the target register. A similar instruction saves the borrow bits generatedwhenperformingpackedsubtractinsteadofpackedadd.

11 Ra:

Rb: s s s s u u u u

c c c m m m m c

a a a a

r r r r r r r r y y y y

Rc: 0...0 0...0 0...0 0...0 Figure2.9Savecarrybitsinstruction. 2.7PackedCompareInstructions Sometimes, it is necessary to compare pairs of subwords. In a packed compare instruction, pairs of subwordsarecomparedaccordingtotherelationspecifiedbytheinstruction.Iftheconditionistrueforasubword pair, the corresponding field in the target register is written with a 1-mask. If the condition is false, the correspondingfieldinthetargetregisteriswrittenwitha0-mask.Alternatively,atrueorfalsebitisgeneratedfor each subword, and this set of bits is written into the least significant bits of the result register. Some of the architectureshavecompareinstructionsthatallowcomparisonoftwonumbersforallofthe10possiblerelations4, whereasothersonlysupportasubsetofthemostfrequentrelations.Atypicalpackedcompareinstructionis showninFigure2.10forthecaseoffoursubwords.

Ra:

Rb:

compare compare compare compare F T T T a r r r l u u u s e e e e

Rc: 1...1 0...0 1...1 1...1 Figure2.10Packedcompareinstruction.Bitmasksaregeneratedasaresultofthecomparisonsmade. WhenamaskofbitsisgeneratedasinFigure2.10,oftenamovemaskinstructionisalsoprovided.Inamove maskinstruction,themostsignificantbitsofeachofthesubwordsarepicked,andthesebitsareplacedintothe target register, in a right aligned field (see Figure 2.11). In different algorithms, either the subword mask format generatedinFigure2.10orthebitmaskformatgeneratedinFigure2.11ismoreuseful.

4Twonumbersaandbcanbecomparedforoneofthefollowing10possiblerelations:equal,less-than,less-than-or-equal,greater-than, greater-than-or-equal, not-equal, not-less-than, not-less-than-or-equal, not-greater-than, not-greater-than-or-equal. Typical notation for these relationsareasfollowsrespectively:==,<,<=,>,>=,!=,!<,!<=,!>,!>=. 12 Ra:

R : 0...0 0...0 0...0 0...0 b Figure2.11MovemaskRb,Ra. Twocommoncomparisonsusedarefindingthelargerofapairofnumbers,orthesmallerofapairofnumbers. In the packed maximum instruction, the greater of the subwords in the compared pair gets written to the correspondingsubwordinthetargetregister(seeFigure2.12).Similarly,inthepackedminimuminstruction,the smallerofthesubwordsinthecomparedpairgetswrittentothecorrespondingsubwordinthetargetregister.Inthe earlier section on saturation arithmetic, we saw that instead of special instructions for packed maximum and packed minimum, MAX-2 performs packed maximum and packed minimum operations by using packedaddandpackedsubtractinstructionswithsaturationarithmetic(seeFigure2.5).AnALUcanbe used to implement comparisons, maximum andminimuminstructionswithasubtractionoperation;comparisons forequalityorinequalityisusuallydonewithanexclusive-oroperation,alsoavailableinmostALUs.

Ra:

Rb:

max(ai,bi) max(ai,bi) max(ai,bi) max(ai,bi)

Rc: Figure2.12Packedmaximuminstruction. 2.8.SumofAbsoluteDifferences Amorecomplex,multi-cycleinstructionisthesumofabsolutedifferences(SAD)instruction(see Figure2.13).ThisisusedformotionestimationinMPEG-1andMPEG-2videoencoding,forexample.InaSAD instruction,thetwopackedoperandsaresubtractedfromoneanother.Absolutevaluesoftheresultingdifferences arethensummedup.

13 Ra:

Rb:

Absolute Absolute Absolute Absolute value value value value

Rc: 0...0 0...0 Figure2.13SADRc,Ra,Rb:Sumofabsolutedifferencesinstruction. While useful, the SAD instruction is a multi-cycle instruction with a typical latency of three cycles. This can complicatethepipelinecontrolofotherwisesinglecycleintegerpipelines.Hence,minimalistmultimediainstruction sets like MAX-2 do not have SAD instructions. Instead, MAX-2 uses generic packed add and packed subtractinstructionswithsaturationarithmetictoperformtheSADoperation(seeFigure2.5(b)andTable2.1). 3.PACKEDMULTIPLYINSTRUCTIONS 3.1MultiplicationoftwoPackedIntegerRegisters Themaindifficultywithpackedmultiplicationoftwon-bitintegersisthattheproductistwiceaslongaseach operand.Considerthecasewheretheregistersizeis64bitsandthesubwordsare16bits.Theresultofthepacked multiplicationwillbefour32-bitproducts,whichcannotbeaccommodatedinasingle64-bittargetregister. Onesolutionistousetwopackedmultiplyinstructions.Figure3.1showsapackedmultiplyhigh instruction,whichplacesonlythemoresignificantupperhalvesoftheproductsintothetargetregister.Figure3.2 showsapackedmultiplylowinstruction,whichplacesonlythelesssignificantlowerhalvesoftheproducts intothetargetregister.

Ra: Sourceregisters with16-bit subwords Rb:

H L H L Four32-bit products H L H L

Targetregisterholding thehigh-order16-bitsof Rc: H H H H theintermediateproducts Figure3.1Packedmultiplyhighinstruction.

14 Ra: Sourceregisters with16-bit subwords Rb:

H L H L Four32-bit products H L H L

Targetregisterholding thelow-order16-bitsof Rc: L L L L theintermediateproducts Figure3.2Packedmultiplylowinstruction. IA-64generalizesthiswithitspackedmultiplyandshiftrightinstruction(seeFigure3.3),which doesaparallelmultiplicationfollowedbyarightshift.Insteadofbeingabletochooseeithertheupperorthelower halfoftheproductstobeputintothetargetregister,itallowsmultiple5different16-bitfieldsfromeachofthe32-bit productstobechosenandplacedinthetargetregister.

Ra: Sourceregisters with16-bit subwords Rb:

Four32-bit products

Shiftright nbits >>n >>n >>n >>n

H L H L

H L H L

Targetregisterholding thelow-order16-bitsof R : theright-shifted32-bit c L L L L intermediateproducts Figure3.3Thegeneralizedpackedmultiplyandshiftrightinstruction. IA-64alsoallowsthefullproducttobesaved,butforonlyhalfofthepairsofsourcesubwords.Eithertheoddor theevenindexedsubwordsaremultiplied.Thismakessurethatonlyasmanyfullproductsascanbeaccommodated in one target register are generated. These two variants, the packed multiply left and packed multiplyrightinstructions,aredepictedinFigures3.4and3.5.

5InIA-64theright-shiftamountsarelimitedto0,7,15or16bits,sothatonly2bitsinthepackedmultiplyandshiftright instructionareneededtoencodethefourshiftamounts. 15 Ra: 1 2 3 4 Sourceregisters with16-bit subwords Rb: 1 2 3 4

Two32-bit products H L H L

R : H L H L c Figure3.4Packedmultiplyleftinstructionwhereonlytheoddindexedsubwordsofthetwosource registersaremultiplied.

Ra: 1 2 3 4 Sourceregisters with16-bit subwords Rb: 1 2 3 4

Two32-bit products H L H L

R : H L H L c Figure3.5Packedmultiplyrightinstructionwhereonlytheevenindexedsubwordsofthetwosource registersaremultiplied. Anothervariantisthepackedmultiplyandaccumulateinstruction.Normally,amultiplyand accumulate operation requires three source registers. The PMADDWD instruction in MMX requires only two source registers by performing a packed multiply followed by an addition of two adjacent subwords (see Figure3.6).

Ra: Sourceregisters with16-bit subwords Rb:

Four32-bit products

R : 32-bitsum 32-bitsum c Figure3.6PackedmultiplyandaccumulateinstructioninMMX. Instructions in the AltiVec architecture may have up to three source registers. Hence, AltiVec’s packed multiplyandaccumulateusesthreesourceregisters.InFigure3.7,theinstructionpackedmultiply highandaccumulatestartsjustlikeapackedmultiplyinstruction,selectsthemoresignificanthalves oftheproducts,thenperformsapackedaddofthesehalvesandthevaluesfromathirdregister.Theinstruction packedmultiplylowandaccumulateisthesame,exceptthatonlythelesssignificanthalvesofthe productsareaddedtothesubwordsfromthethirdregister.

16

Ra:

Sourceregisters with16-bit subwords Rb:

Rc:

H L H L Four32-bit products H L H L

Targetregisterholding R : the16-bitsums d Figure3.7InthepackedmultiplyhighandaccumulateinstructioninAltiVec,onlythehigh orderbitsoftheintermediateproductsareusedintheaddition. 3.2MultiplicationofaPackedIntegerRegisterbyanIntegerConstant Manymultiplicationsinmultimediaapplicationsarewithconstants,ratherthanwithvariables.Forexample,in the Inverse Discrete Cosine Transform (IDCT) used in the compression and decompression of JPEG images and MPEG-1 and MPEG-2 video, all the multiplications are by constants. This type of multiplication can be further optimized for simpler hardware, lower power and higher performance simultaneouslybyusingpackedshift andaddinstructions[19].Shiftingaregisterleftbynbitsisequivalenttomultiplyingitby2n.Sinceaconstant numbercanberepresentedasabinarysequenceofonesandzeros,usingthisnumberasamultiplierisequivalentto aleftshiftofthemultiplicandofnbitsforeachnthpositionwherethereisaoneinthemultiplierandanaddofeach shiftedvaluetotheresultregister. Asanexample,considermultiplyingtheintegerregisterRawiththeconstantC=11.Thefollowinginstruction sequenceperformsthismultiplication.AssumeRainitiallycontainsthevalue6.

Initialvalues:C=11=10112andRa=6=01102 Instruction Operation Result Shiftleft1bitRb,Ra Rb=Ra<<1 Rb=11002=12 AddRb,Rb,Ra Rb=Rb+Ra Rb=11002+01102=0100102=18 Shiftleft3bitRc,Ra Rc=Ra<<3 Rc=01102*8=1100002=48 AddRb,Rb,Rc Rb=Rb+Rc Rb=0100102+1100002=10000102=66 Thissequencecanbeshortenedbycombiningtheshiftleftandtheaddinstructionsintoonenewshift left and add instruction. The following new sequence performs the same multiplication in half as many instructionsandusesonelessregister. Initialvalues:C=11=10112andRa=6=01102 Instruction Operation Result Shiftleft1bitandaddRb,Ra,Ra Rb=Ra<<1+Ra Rb=18 Shiftleft3bitandaddRb,Ra,Rb Rb=Ra<<3+Rb Rb=66 Multiplication of packed integer registers by integer constants uses the same idea. The shift left and addinstructionbecomesapackedshiftleftandaddinstructiontosupportthepackeddatatypes.As

17 anexampleconsidermultiplyingthesubwordsofthepackedintegerregisterRa=[1|2|3|4]bytheconstantC=11.The instructionstoperformthisoperationare: Initialvalues:C=11=10112andRa=[1|2|3|4]=[0001|0010|0011|0100]2 Instruction Operation Result Shiftleft1bitandaddRb,Ra,Ra Rb=Ra<<1+Ra Rb=[3|6|9|12] Shiftleft3bitandaddRb,Ra,Rb Rb=Ra<<3+Rb Rb=[11|22|33|44] The same reasoning we used for multiplication by integer constants applies to multiplication by fractional constants.Arithmeticrightshiftofaregisterbynbitsisequivalenttodividingitby2n.Usingafractionalconstant asamultiplierisequivalenttoanarithmeticrightshiftofthemultiplicandbynbitsforeachnthpositionwherethere isa1inthemultiplierandanaddofeachshiftedvaluetotheresultregister.Byusingapackedarithmetic shiftrightandaddinstruction,theshiftandtheaddinstructionscanbecombinedintoonetofurther speedsuchcomputations.Forinstance,multiplicationofapackedregisterbythefractionalconstant0.0112(=0.375) canbeperformedbyusingjusttwopackedarithmeticshiftrightandaddinstructions. Initialvalues:C=0.375=0.0112andRa=[1|2|3|4]=[0001|0010|0011|0100]2 Instruction Operation Result Arithmeticshiftright3bitandaddRb,Ra,0 Rb=Ra>>2+0 Rb=[0.125|0.25|0.375|0.5] Arithmeticshiftright2bitandaddRb,Ra,Rb Rb=Ra>>2+Rb Rb=[0.375|0.75|1.125|1.5] Onlytwosingle-cycleinstructionsarerequiredtoperformthemultiplicationoffoursubwordsbyaconstant,in thisexample.Thisisequivalenttoaneffectiverateoftwomultiplicationspercycle.Withoutsubwordparallelism, the same operations wouldtakeatleastfourintegermultiplyinstructions.Furthermore,thepackedshift andaddinstructionsuseasimpleALUwithasmallpreshifter,whereastheintegermultiplyinstructionsneed amorecomplexmultiplierfunctionalunit.Inaddition,eachmultiplicationoperationtakesatleastthreecyclesof latencycomparedtoaonecyclelatencyforapreshiftandaddoperation.Henceforthisexample,thespeedupfor multiplying four subwords by a constant is 4*3/2 = 6, comparing implementations with one subword multiplier versusonepartitionableALUwithpreshifter. MAX-2 in PA-RISC and IA-64 are the only multimedia ISAs surveyed that have these efficient packed shift left and add instructions and packed shift right and add instructions. The preshift amounts allowed are by one, two or three bits, and the arithmetic is performed withsignedsaturation,for16-bit subwords. 3.3 VectorMultiplication Sofar,wehavelookedatrelativelysimplepackedmultiplyinstructions.Theseinstructionsalltakeabout thesamelatencyasasinglemultiplyinstruction,whichistypically3-4cyclescomparedtoanaddinstruction normalizedtoonecyclelatency.Forbetterorworse,somemultimediaISAshaveincludedverycomplex,multiple- cycle operations. For example, AltiVec has a packed vector multiply and accumulate instruction, usingthree128-bitpackedsourceoperandsanda128-bittargetregister(seeFigure3.8).First,allthepairsofbytes withina32-bitsubwordintwoofthesourceregistersaremultipliedinparalleland16-bitproductsaregenerated. Then, four 16-bit products are added to each other to generate a “sum of products” for every 32-bits. A 32-bit subword from the third source register is added to this “sum of products”. The resulting sum is placed in the corresponding 32-bit subword field of the target register. This process is repeated for each of the four 32-bit subwords.Thisisatotalofsixteen8-bitintegermultiplies,twelve16-bitadditions,andfour32-bitadditions,using four 128-bit registers, in a single VSUMMBM instruction. This can perform a 4x4 matrix times a 4x1 vector multiplication,whereeachelementisabyte,inasingleinstruction,butthissinglecomplexinstructiontakesmany cyclesoflatency.Whileamultiplicationofa4x4matrixwitha4x1vectorisaveryfrequentoperationingraphics geometryprocessing,theprecisionrequiredisusuallythatof32-bitsingle-precisionfloating-pointnumbers,not8- bit integers. Whether the complexity of such a compound VSUMMBM instruction is justified depends on the frequencyofsuch4x4matrix-vectormultiplicationsofbytes. Table3.1summarizesthepackedintegermultiplicationinstructionsdescribedabove.

18 Ra: ...... Sourceregisters with8-bit subwords. Rb: ......

Sourceregister R : ...... with32-bit subwords. c

H L H L Four16-bit products H L H L

32-bittarget subwordholding thesumofthefour16-bit Rd: ...... productsandthe32-bit subword Figure3.8AltiVec’sVSUMMBMinstruction:onlyonefourthoftheinstructionisshown.Eachboxrepresentsa byte.Thisprocessiscarriedoutforeach32-bitwordinthe128-bitsourceregisters. TABLE3.1Packedintegermultiplicationinstructions. IntegerOperations IA-64 MAX-2 MMX SSE-2 3Dnow! AltiVec = √ √ √ √ √ ci lower _ half (ai *bi ) = √ √ √ √ √ ci upper _ half (ai *bi ) = >> √6 ci lower _ half [(ai *bi ) n] Packedmultiplyleft √ √ = [c2i ,c2i+1 ] a2i *b2i Packedmultiplyright √ √ = [c2i ,c2i+1] a2i+1 *b2i+1 Packedmultiplyandaccumulate √ = + [c2i ,c2i+1] a2i *b2i a2i+1 *b2i+1 = + √ di upper _ half (ai *bi ) ci = + √ di lower _ half (ai *bi ) ci Packedshiftleftandadd7 √ √ = << + ci (ai n) bi ,forn=1,2or3bits. Packedshiftrightandadd8 √ √ = >> + ci (ai n) bi ,forn=1,2or3bits. Packedvectormultiplyand √ accumulate(VSUMMBM) = [d4i ,d4i+1,d4i+2 ,d4i+3 ] 4 + [c4i ,c4i+1,c4i+2 ,c4i+3 ] ƒ a4i+ j *b4i+ j j=1 VMSUMxxxinstructionsofAltiVec(generalform) √ = + + [d2i ,d2i+1] a2i *b2i a2i+1 *b2i+1 [c2i ,c2i+1] 4.PACKEDSHIFTANDROTATEOPERATIONS 6Shiftamountsarelimitedto0,7,15or16bits. 7Foruseinmultiplicationofapackedregisterbyanintegerconstant. 8Foruseinmultiplicationofapackedregisterbyafractionalconstant. 19 MostmicroprocessorshaveoneormoreshiftersinadditiontooneormoreALUs(seeFigure1.3).Justasthe ALU is partitionable, so is the shifter, for subword-parallel operation. A packed shift instruction performs blocking shifts of the subwords packed in a register. Any bits shifted to the left are blocked from affecting the adjacentsubwordontheleft;anybitsshiftedtotherightareblockedfromaffectingtheadjacentsubwordonthe right. Forthepackedshiftinstruction,theshiftcanbelogical(zerossubstitutedforvacatedbits)orarithmetic (zerossubstitutedforvacatedbitsontherightandsign-bitreplicatedforvacatedbitsontheleft).Theshiftamount canbegivenbyanimmediateoperandorbyaregisteroperand.Whentheshiftamountisgivenbyaregister,each subwordisusuallyshiftedbythesameamount,givenbytheleastsignificantlog2nbitsofasecondsourceregister, forshiftingthenbitsofafirstsourceregister(seeFigure4.1).Inamorecomplicated,butmoreversatileform,each subwordinapackedregistercanbeshiftedbyadifferentamount(seeFigure4.2). Similarly,thepackedrotateinstructionperformsrotationsoneachsubwordinparallel.Theamounttobe rotatedcanbespecifiedbyanimmediateinaninstruction,byasinglerotateamountinaregister,orbydifferent rotateamountsforeachsubword(seeFigure4.3).Data-dependentrotations,wherethesinglerotateamountisgiven inaregister,havebeenproposedforsymmetriccryptographyalgorithmslikeRC5. Packedshiftinstructionsmayalsobeusedtomultiplyordividesubwordsbyaconstantthatisapowerof two. When used in this way, it may be necessary to apply saturation arithmetic with parallel left shifts used for multiplication.Itmayalsobedesirabletoapplyroundingwithparallelarithmeticrightshifts.Suchsaturationand rounding complicate the circuitry for the shifter functional unit, and is not implemented by any of the current multimediaISAs.Hence,packedshiftinstructionsshouldbeusedformultiplicationordivisiononlywhenno overflowcanoccuronleftshifts,andsufficientprecisioncanbepreservedonrightshifts.Formultiplicationbyan integerorfractionalconstant,packedshiftandaddinstructions,describedinSection3.2,arepreferable. Thesecanbettercontrolaccuracyinthemultiplication.

Ra:

Rb:

shiftamount

shiftunit

Shiftscanbeleftorright, shiftunit logicalorarithmetic shiftunit

shiftunit

R : c Figure4.1Packedshiftinstruction.Shiftamountisgiveninthesecondoperand.Eachsubwordisshifted bythesameamount.

Ra:

Rb:

shift shift shift shift amount amount amount amount

shiftunit

shiftunit Shiftscanbeleftorright, logicalorarithmetic shiftunit

shiftunit

R : c Figure4.2Packedshiftinstruction.Shiftamountisgiveninthesecondoperand.Eachsubwordcanbe shiftedbyadifferentamount.

20

Ra:

Rb:

rotate rotate rotate rotate amount amount amount amount

rotateunit

rotateunit Rotatescanbe

leftorright rotateunit

rotateunit

R : c Figure4.3Packedrotateinstruction.Rotateamountisgiveninthesecondoperand.Eachsubwordcanbe rotatedbyadifferentamount. Table 4.1 summarizes the multimedia instructions involving packed shift and packed rotate operations.Inthetable,nisusedtorepresentashiftorrotateamountthatisspecifiedintheimmediatefieldofan = << instruction.Forexample,intheoperationdenotedas ci ai n ,eachsubwordofcisshiftedtotheleftbythe = << amountgivenintheimmediatefieldofthecorrespondinginstruction.Similarly,intheoperation ci ai b ,each = << subwordofcisshiftedtotheleftbytheamountspecifiedinthesourceregisterb.In ci ai bi ,eachsubwordof cisshiftedtotheleftbytheamountspecifiedinthecorrespondingsubwordofthesourceregisterb.Shiftleftis representedby<<,shiftrightby>>androtateby<<<. TABLE4.1Summaryofpackedshiftandpackedrotateinstructions. IntegerOperations IA-64 MAX-2 MMX SSE-2 3Dnow! AltiVec = << √ √ √ ci ai n = << √ √ ci ai b = << √ ci ai bi = >> √ √ √ ci ai n = >> √ √ ci ai b = >> √ ci ai bi = <<< ci ai n = <<< ci ai b = <<< √ ci ai bi 5.SUBWORDPERMUTATIONINSTRUCTIONS In the first generation multimedia instruction sets, the rearrangement of subwords in registers manifested as packing and unpacking operations. MAX-2 first introduced general-purpose subword permutation instructions for moreversatilere-orderingofsubwordspackedintooneormoreregisters. 5.1PackInstructions Pack instructions convert from larger subwords to smaller subwords. If the value in the larger subword is greaterthanthemaximumvaluethatcanberepresentedbythesmallersubword,saturationarithmeticisperformed, andtheresultingsubwordissettothemaximumvalueofthesmallersubword.Figure5.1showshowaregisterwith smallerpackedsubwordscanbecreatedfromtworegisterswithsubwordsthataretwiceaslarge.Packinstructions differinthesizeofthesupportedsubwordsandinthesaturationoptionsused. 5.2UnpackInstructions

21 Unpack instructions are used to convert smaller packed data types to larger ones. The subwords in the two sourceoperandsaresplitandwrittentothetargetregisterinalternatingorder.Sinceonlyonehalfofeachofthe sourceregisterscanbeused,theunpackinstructionscomewithtwovariants:unpackhighorunpacklow. Thehigh/lowunpackinstructionsselectandunpackthehighorlowordersubwordsofthesourceoperands.

Ra:

Rb:

Rc: Figure5.1Packinstructionconvertslargersubwordstosmallerones.

Ra:

Rb:

Rc: Figure5.2Unpackhighinstruction.

Ra:

Rb:

Rc: Figure5.3Unpacklowinstruction. 5.3SubwordPermutationInstructions Ideally,itisdesirabletobeabletoperformallpossiblepermutationsonpackeddata.Thisisonlypossiblefor smallnumbersofsubwords.Whenthenumberofsubwordsincreases,thenumberofcontrolbitsrequiredtospecify arbitrarypermutationsbecomestoolargetobeencodedinaninstruction.Forthecaseofnsubwords,thenumberof controlbitsusedtospecifyaparticularpermutationofthesensubwordsisn*log2(n).Table5.1showshowmany controlbitsarerequiredtospecifyanyarbitrarypermutationfordifferentnumberofsubwords.Whenthenumberof subwords is 16 or greater, the number of control bitsexceedsthenumberofthebitsavailableintheinstruction, whichistypically32bits.Therefore,itbecomesnecessarytouseasecondregister9tocontainthecontrolbitsused tospecifythepermutation.Byusingthissecondregister,itispossibletogetanyarbitrarypermutationofupto16 subwordsinoneinstruction. TABLE5.1Numberofcontrolbitsrequiredtospecifyanarbitrarypermutation. Numberofsubwords Numberofcontrolbitsrequiredtospecifyanarbitrary inapackeddatatype permutationforagivennumberofsubwords 9Thissecondregisterneedstobeatleast64-bitswidetofullyaccommodatethe64controlbitsneededfor16subwords. 22 2 2 4 8 8 24 16 64 32 160 64 384 128 896 SinceAltiVecinstructionscanhavethree128-bitsourceregisters,asubwordpermutationcanusetworegisters toholddata,andthethirdregistertoholdthecontrolbits.Thisallowsanyarbitraryselectionandre-orderingof16 ofthe32bytesinthetwosourceregistersinavperminstruction. Since only a small subset of all the possible permutations is achievable with one subword permutation instruction, it is desirable to select permutations that can be used as primitives to realize other permutations. A permuteinstructionscanhaveoneortwosourceregistersasoperands.Inthelattercase,onlyhalfofthesubwords inthetwosourceoperandsmayactuallyappearinthetargetregister.Examplesofthesetwocasesarethemuxand mixinstructionsrespectively,inbothIA-64andMAX-2. Mux in IA-64 operates on one source register. It allows all possible permutations of four packed 16-bit subwords,withandwithoutrepetitions(seeFigure5.4).An8-bitimmediatefieldisusedtoselectoneofthe256 possiblepermutations.ThisisthesameoperationperformedbythepermuteinstructionintheearlierMAX-2. InIA-64,themuxinstructioncanalsopermuteeightpacked8-bitsubwords.Forthe8-bitsubwords,muxhas fivevariants,andonlythefollowingpermutationsareimplementedinhardware(seeFigure5.5): • Mux.rev(reverse):Reversestheorderofbytes. • Mux.mix(mix):Mixesthebytesintheupperandlower32-bithalvesofthe64-bitsourceregister. • Mux.shuf(shuffle):Performsaperfectshuffleonthebytesintheupperandlowerhalvesoftheregister. • Mux.alt (alternate): Selects first the even10 indexed bytes, placing them in the upper half of the result register,thenselectstheoddindexedbytes,placingthemintherighthalfoftheresultregister. • Mux.brcst(broadcast):Replicatestheleastsignificantbyteintoallthebytelocationsoftheresultregister.

Ra:

Anypermutationof the subwords

Rb: Figure5.4Arbitrarypermutationonaregisterwithfoursubwords.

10Thebytesareindexedfrom0to7.0correspondstothemostsignificantbyte,whichisontheleftendoftheregisters. 23 Ra: Ra:

Rb: Rb: (a)rev (b)mix

Ra: Ra:

Rb: Rb: (c) shuf (d)alt

Ra:

Rb: (e) brcst Figure5.5MuxinstructionofIA-64hasfivepermutationoptionsfor8-bitsubwords. Mix is a very useful permutation operation on two source registers. A mix instruction picks alternating subwordsfromtwosourceregistersandplacesthemintothetargetregister.Sincemixusestwosourceregisters,it appearsintwovariants.Thefirstvariant(Figure5.6)iscalledmixleftandusesthelefthalvesofthesource registers in the permutation. Similarly, the mix right variant (Figure 5.7) uses the right halves of the source registers.

Ra:

Rb:

Rc: Figure5.6Mixleftinstruction.

Ra:

Rb:

Rc: Figure5.7Mixrightinstruction. Extract/DepositInstructions

24 Amoresophisticatedshiftercanalsoperformextractanddepositbit-fieldoperations[13].Anextract instructionpicksanarbitrarycontiguousbit-fieldfromthesourceoperandandplacesitrightalignedintotheresult register. Extract instructions may be limited to work on subwords instead of bit-fields. Extractinstructions cleartheupperbitsofthetargetregister.Figures5.8and5.9showsomepossibleextractinstructions.

Ra:

Rb: Figure5.8Extractbit-fieldinstruction.

Ra:

Rb: Figure5.9Extractsubwordinstruction. Adepositinstructionpicksaright-alignedcontiguousbit-fieldfromthesourceregisterandpatchesitintoan arbitrarylocationinthetargetregister.Theunpatchedbitsofthetargetregisterremainunchanged.Alternatively, theyareclearedtozeroinazeroanddepositinstruction[13].Depositinstructionsmaybelimitedtowork on subwords instead of arbitrarily long bit-fieldsandarbitrarypatchlocations.Figures5.10and5.11showsome possibledepositinstructions. A very useful instruction that can be included in this section is the shift pair instruction in IA-64 (see Figure 5.12). This instruction, which was first introduced in the PA-RISC ISA [13], is essentially a shift instructionforbit-stringsthatspanmorethanoneregister.Shiftpairconcatenatesthetwosourceregistersto forma128-bitintermediatevalue,whichisshiftedtotherightbynbits.Theleastsignificant64-bitsoftheshifted value is written to the result register. If the same register is specified for both operands, the result is a rotate operation.Sincerotatescanberealizedthisway,IA-64doesnothaveaseparaterotateinstruction.Thisshift pair instruction is more general than a rotate, allowing flexible combination of two bit-fields from separate registers. Table5.2belowsummarizesthesubwordpermutationinstructionsonpackeddatatypes.

Ra:

R : b Figure5.10Depositbit-fieldinstruction.

Ra:

R : b Figure5.11Depositsubwordinstruction.

25 Ra: :Rb

:Rc Figure5.12ShiftpairinstructioninIA-64. TABLE5.2Subwordpermutationinstructions. IntegerOperations IA-64 MAX-2 MMX SSE-2 3Dnow! AltiVec Pack √ √ √ Unpacklow √ √ √ √ Unpackhigh √ √ √ Permutensubwords √(n=4) √(n=4) √(n=4) √(n=4) √(n=16,32)11 Mux.rev √ Mux.mix √ Mux.shuffle √ Mux.alt √ Mux.brcst √ Mixleft √ √ √ Mixright √ √ √ Extractbit-field √ √ Extractsubword √ Depositbit-field √ √ Depositsubword √ ShiftpairRc,Ra,Rb √ √ 6.FLOATINGPOINTMICROSIMDINSTRUCTIONS High-fidelity audio and graphics geometry processing require the higher precision and range of floating-point numbers.Usually,single-precision(32-bit)floating-point(FP)numbersaresufficient,but16-bitintegersorfixed- pointnumbersarenot.Double-precision(64-bit)floating-pointnumbersarenotreallyneededforsuchmultimedia computations. Sincefloating-pointregistersareatleast64-bitswideinmicroprocessorstosupportdouble-precision(DP)FP numbers, it is possible to pack two single-precision (SP) FP numbers in a 64-bit register, to support subword parallelism,orpackedparallelism,ormicroSIMDparallelismontheFPfunctionalunitsandregisters.Theprecision levelssupportedbydifferentISAsareshowninTable6.1.SPandDPnumbersare32and64bitslongrespectively, asdefinedbytheIEEE-754FPnumberstandard.OnlySSE-2supportspackeddouble-precisionFPnumbers. TABLE6.1SupportedprecisionlevelsforthepackedFPoperations. Architecture IA-64 SSE-2 3DNow! AltiVec FPregistersize 82bits 128bits 128bits 128bits AllowedpackedFPdatatypes 2SP 4SPor2DP 4SP 4SP 6.1 PackedFloatingPointArithmeticInstructions PackedFPAdd:Figure6.1showsapackedFPadd,wherefourpairsofsingle-precisionFPnumbersintwo 128-bitregistersareaddedusingfloating-pointaddition.PackedFPsubtractinstructionsaresimilar.While thepackedFPinstructionlooksverysimilartothepackedintegerequivalents(seeFigure2.2),implementationof packedFPaddisnotassimpleasblockingcarriesatthesubwordboundaryasinpackedintegeraddition(see Figure2.4).ItismuchmoredifficulttopartitionaFPfunctionalunitforsubwordparallelismbecauseofthenature of FP arithmetic acting on FP numbers represented in sign, mantissa and exponent format. Another difference is

11Thisisthevperminstructionandithassomelimitationsforn=32.Seetextformoredetailsonthisinstruction. 26 that,infloating-pointnumberrepresentation,considerationslikemodulararithmeticorsaturationarithmeticarenot applicable.

Ra:

Rb:

R : c Figure6.1PFPADDRc,Ra,Rb:PackedFPaddinstruction. PackedFPMultiplication.MultiplicationoftwopackedFPregistersinvolvesmultiplicationofcorrespondingFP subwords from the source registers, where the products are written to the corresponding subword in the target register(seeFigure6.2).Inmultiplicationoftwosingle-precisionnumbers,theproductisalsosingle-precision,and hencethesamewidth.Therefore,packedFPmultiplydoesnothavetheproblemsassociatedwithpacked integermultiplyinstructions,wheretheproductistwicethewidthoftheoperands.

Ra: SP SP SP SP

Rb: SP SP SP SP

R : SP SP SP SP c Figure6.2PFPMULRc,Ra,Rb:PackedFPmultiplyinstruction. PackedFPMultiplyandAdd.ThemostimportantFPoperationinaudio,graphicsanddigitalsignalprocessingis the FP multiply and accumulate operation. Recognizing this, many ISAs have implemented thisasthe basicFPoperation,needingthreesourceregisters.Forexample,IA-64implementspackedFPmultiplyand add(FPMA),packedFPmultiplyandsubtract(FPMS),andpackedFPnegativemultiply andadd(FPNMA).ItthenrealizespackedFPadd,packedFPsubtractandpackedFPmultiply operationsbyusingFPMAandFPMSinstructions.IA-64architecturespecifies128FPregisters,whicharenumbered FR0throughFR127.Oftheseregisters,FR0andFR1arespecial.FR0alwaysreturnsthevalue+0.0whensourced asanoperand,andFR1alwaysreads+1.0.WhenFR0orFR1areusedassourceoperands,theFPMAandFPMS instructionscanbeusedtorealizepackedFPaddorpackedFPsubtractoperationsandpackedFP multiplyoperations(seeTable6.2). TheformatoftheFPMA(Figure6.3)instructionisFPMARd,Ra,Rb,RcandtheoperationitperformsisRd=Ra * Rb + Rc. If FR1 is used as the first or the second source operand, a packed FP add operation is realized. Similarly,aFPMSinstructioncanbeusedtorealizeapackedFPsubtractoperation.UsingFR0asthethird sourceoperandinFPMAorFPMSresultsinapackedFPmultiplyoperation.

27 Ra: Sourceregisters twoSPFPsubwords Rb:

Rc:

R : d Figure6.3PackedFPmultiplyandaddinstructioninIA-64. TABLE6.2IA-64usesFPMAandFPMSinstructionsforpackedFPadd,packedFPsubtractand packedFPmultiply. IA-64instruction Operation EquivalentInstruction FPMARd,FR1,Rb,Rc Rd=FR1*Rb+Rc PackedFPadd (packedFPmultiplyandadd) =1.0*Rb+Rc =Rb+Rc FPMSRd,FR1,Rb,Rc Rd=FR1*Rb-Rc PackedFPsubtract (packedFPmultiplyandsubtract) =1.0*Rb-Rc =Rb-Rc FPMARd,Ra,Rb,FR0 Rd=Ra*Rb+FR0 PackedFPmultiply (packedFPmultiplyandadd) =Ra*Rb+0.0 =Ra*Rb TABLE6.3SummaryofFPmicroSIMDinstructions. PackedFPInstructions IA-64 SSE-2 3DNow! AltiVec = + √12 √ √ √ ci ai bi = − √13 √ √ √ ci ai bi = √14 √ √ ci ai *bi = − √ di ai *bi = + √ √ di ai *bi ci (FPMA) = − √ di ai *bi ci (FPMS) = − + √ √ di ai *bi ci (FPNMA) = − √ ci ai = √ ci ai = − √ ci ai = √ √ √ √ ci compare(ai ,bi ) = √ √ √ √ ci max(ai ,bi ) = √ √ √ √ ci min(ai ,bi ) = √ ci max( ai , bi )

12ThisoperationisrealizedbyusingthepackedFPmultiplyandaddinstruction. 13ThisoperationisrealizedbyusingthepackedFPmultiplyandsubtractinstruction. 14 This operation is realized by using the packed FP multiply and add or packed FP multiply and subtract instruction. 28 = √ ci min( ai , bi ) = 15 √ ci VCMPBFP(ai ,bi ) = √ ci ai = √ √ √ ci 1 ai = √ √ √ ci 1 ai = √ ci log2 ai

= ai √ ci 2 PermutenFPsubwords √(n=2,4) SwapFPsubwords √ (optionallynegateleftorrightsubword) Mix_Left,Mix_Right, √ Mix_Left_Right Unpack_high,Unpack_low √ Pack √ √ Table 6.3 is a summary of the packed FP instructions supported by multimedia ISAs. Several packed FP instructionsoperateidenticallytotheirpackedintegerequivalents,exceptthattheyoperateonpackedFPsubwords rather than packed integer (or fixed-point) subwords. These include packed FP negate, packed FP absolute value, packed FP negated absolute value, packed FP compare, packed FPmaximumandpackedFPminimum.IA-64alsohasthepackedFPmaximumabsolutevalue andthepackedFPminimumabsolutevalue.Theseputthelargerorsmalleroftheabsolutevaluesofthe pairsofFPsubwordsintotheresultsubwordsinthetargetregister,respectively. PackedFPCompare.ThepackedFPcompareinstructioncomparespairsofFPsubwordsaccordingtothe relationspecifiedbytheinstruction.Iftheconditionistrueforasubwordpair,thecorrespondingfieldinthetarget registeriswrittenwitha1-mask.Iftheconditionisfalse,thecorrespondingfieldinthetargetregisteriswrittenwith a0-mask.Theonlydifferenceisthattwoadditionalrelations,orderedandunordered,arepossibleforfloating-point numbersinadditiontothe10relationsalreadyspecifiedforcomparingintegers(seesection2.7).SomeISAshave packed FP compare instructions that allow all the 12 possible relations16, whereas others support a more limitedsubsetofrelations. Packed FP Compare Bounds: An interesting comparison instruction is the packed FP compare bounds (VCMPBFP ) instruction of AltiVec. This instruction compares corresponding FP subwords from the two source registers, and depending on the relation between the compared numbers, it generates a two-bit result, which is written to the target register. The resulting two-bit field indicates the relation between the two compared FP numbers. For instance, in VCMPBFP Rc,Ra,Rb, the FP number pairs (ai,bi) are compared, and a two-bit field is writtenintocisuchthat: • Bit0ofthetwo-bitfieldisclearedifai<=bi,andissetotherwise, • Bit1ofthetwo-bitfieldisclearedifai>=(-bi),andissetotherwise. • BothbitsaresetifanyofthecomparedFPnumbersisaNaN. Thetwo-bitresultfieldiswrittentothehigh-ordertwobitsofci;theremainingbitsofciareclearedto0.Table 6.4givesexamplesofinputpairsthatresultineachofthefourdifferentpossibleoutputsforthisinstruction. Table6.4ResultoftheVCMPBFPinstructionfordifferentinputpairs. Input Output ai bi Bit0 Bit1 3.0 5.0 0 0

15ThisisthepackedFPcompareboundsinstruction,explainedinthetext. 16Twofloating-pointnumbersaandbcanbecomparedforoneofthefollowing12possiblerelations:equal,less-than,less-than-or-equal, greater-than, greater-than-or-equal, unordered, not-equal, not-less-than, not-less-than-or-equal, not-greater-than, not-greater-than-or-equal, ordered.Typicalnotationfortheserelationsareasfollowsrespectively:==,<,<=,>,>=,?,!=,!<,!<=,!>,!>=,!?. 29 -8.0 5.0 0 1 8.0 5.0 1 0 3.0 -5.0 1 1 TheSSE-2architecturealsoincludesapackedFPsquarerootinstruction.Thisinstructionoperateson packedsingle-precisionordouble-precisionnumbersandcomputesthesquarerootstoSPorDPaccuracy.IA-64 has the packed FP reciprocal square root instruction and the packed FP reciprocal instruction.Bothareveryusefulforgraphicscomputations. 6.2SubwordPermutationInstructions FPPermutationInstructions.SSE-2hasanFPpermute(seeFigure6.4)instructionthatallowsanyarbitrary permutation of the four 32-bit SPsubwordsinoneofits128-bitmultimediaregisters.Thisoperatesjustlikethe permuteinstructioninMAX-2andthemuxinstruction(2-bytesubwordversion)inIA-64(seefigure5.4).

Ra:

Anypermutationof the subwords

Rb: Figure6.4FPpermuteRb,Ra:FPpermuteinstruction. Since IA-64 only has two single-precision subwords in its packed format, all possible permutations of two subwords can be achieved with a much simpler operation, FP swap. This instruction just exchanges the two subwords.IA-64alsoallowstwovariantsofthis:afterswappingthesubwords,thesignofeithertheleftortheright FPvalueisnegated. FPmixisoneusefuloperationthatperformsapermutationontwopackedFPregisters.AFPmixinstruction picks alternating subwords from two source registers and places them into the target register. FP mix in IA-64 appears in three variants. The first one (Figure 6.5) is called the FP mix left andusestheoddindexedFP subwords of the source registers in the permutation, starting from the leftmost subword. The second variant, FP mixright(Figure6.6)usestheevenindexedFPsubwordsofthesourceregisters,endingwiththerightmost subword. The third variant, FP mix left right (Figure 6.7) uses the odd indexed FP subword of the first sourceregister,andtheevenindexedsubwordofthesecondsourceregister.

Ra:

Rb:

Rc: Figure6.5FPmixleftRc,Rb,Ra:FPmixleftinstructioninIA-64.

30 Ra:

Rb:

Rc: Figure6.6FPmixrightRc,Rb,Ra:FPmixrightinstructioninIA-64.

Ra:

Rb:

Rc: Figure6.7FPmixleftrightRc,Rb,Ra:FPmixleftrightinstructioninIA-64. FP Unpack: Packing and unpacking subwordshasadifferentinterpretationforFPnumbersthanforintegers.In general,thereissufficientprecisioninsingle-precisionnumbers,andthereisnoneedtounpackittoadouble- precisionnumber.However,theFPunpackcanberegardedasausefulsubwordpermutationinstructionlikeFP mix. It performs a shuffle by interleaving the subwords from two registers. The FP unpack instructions operatejustliketheequivalentintegerunpackinstructions(seeFigures5.2and5.3).Theycomeintwoflavors: FPunpackhighandFPunpacklow.WenotetheSSE-2employsFPunpack,afterunpackinMMX, andIA-64employsFPmix,aftermixinMAX-2. FPPack:Intheintegerdomain,packinstructionsareusedtocreatesmallerpackeddatatypesfromlargerdata types.TheFPpackinstructioninIA-64createstwopackedSPnumbersfromtwo82-bitsourceregisters.AllIA- 64FPregistersare82-bitextendedprecisionFPformatwithtwoextraguardbitsforcomputationalaccuracy.First, the two 82-bit numbers are converted to standard 32-bit SP representation. These two SP numbers are then concatenatedandtheresultisstoredinthesignificandfield(whichis64bits)ofthe82-bittargetFPregister.The exponentfieldofthetargetregisterissettothebiasedexponentfor2.063,whichindicatesapackedFPformat,and thesignbitissettozero,indicatingapositivenumber. 7. CONCLUSIONS We have described multimedia instructions for programmable processors by broad classes according to the functionalunitsused.Packedaddandpackedsubtractinstructions,anddifferentvariantsofthese,usethe ALU, packed multiply instructions use the multiplier functional unit, and packed shift and packed rotate instructions use the shifter. Packed subword permutation instructions can either be implemented on a modifiedshifterorinanewpermutationunit.Foreachoftheseinstructionclasses,wecomparedthemultimedia instructionsetsintroducedincurrentmicroprocessors,forexample,theIA-64[4],MMX[5,6],andSSE-2[7]from Intel,MAX-2[8,9]fromHewlett-Packard,3DNow![10,11]fromAMD,andAltiVec[12]fromMotorola. The key feature in these multimedia instructions is the concept of subword parallelism, also called packed parallelism or microSIMD parallelism. This is implemented for packed integers or fixed-point numbers in the integerdatapaths,andforpackedfloating-pointnumbersinthefloating-pointdatapaths.Visualmultimediadatalike images, video, graphics rendering and animation involve pixel processing, which can fully exploit subword parallelism on the integer datapath. Higher-fidelity audio processing and graphics geometry processing require single-precision floating-point computations, which exploit subword parallelism on the floating-point datapath. TypicalDSPoperationslikemultiplyandaccumulatehavealsobeenaddedtothemultimediarepertoireof general-purpose microprocessors. These multimedia instructions have embedded DSP and visual processing capabilities into general-purpose microprocessors, providing native signal processing (sometimes referred to as

31 NSP)formultimediadata.Infact,mostDSPsandmediaprocessorshavealsoadoptedsubwordparallelismintheir architectures, as well as subword permutation instructions and other features often first introduced in microprocessorsformultimediasignalprocessing. Some of the multimedia ISAs introduced in microprocessors adhere to the “less is more” minimalist architecture approach, defining as few instructions as necessary for high-performance, with each instruction executableinasinglepipelinecycle.Othersembodythe“moreisbetter”approach,wherecomplexsequencesof operations are represented by a single multimedia instruction, with such an instruction taking many cycles for execution.AnexampleisthepackedvectormultiplyandaccumulateinstructioninAltiVec(Figure 3.8). These two trendsrepresentdifferentstylisticpreferences,akintoRISC(ReducedInstructionSetComputer) andCISC(ComplexInstructionSetComputer)architecturalpreferences.Infact,sometimes,RISC-likemultimedia instructions have been added to CISC processor ISAs, and CISC-like multimedia instructions to RISC processor ISAs.Theremarkablefactisthatsubword-parallelmultimediainstructionshaveachievedsuchrapidandpervasive adoptioninbothRISCandCISCmicroprocessors,DSPsandmediaprocessors,attestingtotheirundisputedcost- effectivenessinacceleratingmultimediaprocessinginsoftware. Tosimplifysoftwarecompatibilityandinteroperabilityofmultimediasoftwareacrossdifferentprocessors,itis highlydesirabletorefinethebestideasfromthedifferentmultimediaISAsintoacoherentsetofsubword-parallel instructions.Ifthisisasmallyetpowerfulset,itismorelikelytobeimplementedinallfuturemicroprocessorsand mediaprocessors,allowingalgorithmandcompileroptimizationstoexploitmicroSIMDparallelismwithconfidence that benefits would be realized across almost all processors. While slight differences in multimedia instructions acrossprocessorsmaynotaffectthepotentialperformanceprovidedbyeachISA,theymakeitdifficulttodesignan optimal algorithm and a set of compiler optimizations that achieve the best multimedia performance for each processor.ThechallengeforthenextphaseofmultimediaISAdesignistounderstandwhichISAfeaturesaretruly effective for multimedia signal processing and encapsulate these insights into the design of third-generation multimediaISAforbothmicroprocessorsandmediaprocessors. Acknowledgments: I would like to thank my student, A. Murat Fiskiran, for surveying SSE-2, 3DNow! and AltiVec,andforhisinvaluablehelpinpreparingthefiguresandtables. 8. REFERENCES 1. R.B.LeeandM.Smith,“MediaProcessing:ANewDesignTarget,”IEEEMicro,Vol.16,No.4,August1996, pp.6-9. 2. M.J.Flynn,“VeryHigh-SpeedComputingSystems”,ProceedingsoftheIEEE,No.54,December1966. 3. R.B. Lee, “Efficiency of MicroSIMD Architectures and Index-Mapped Data for Media Processors,” ProceedingsofIS&T/SPIESymposiumonElectricImaging:MediaProcessors99,January25-29,1999,San Jose,California,pp.34-46. 4. Intel,“IA-64ArchitectureSoftwareDeveloper’sManual,Volume3:InstructionSetReference,”Revision1.1, July2000,OrderCode245319-002. 5. Intel, “Intel Architecture Software Developer’s Manual, Volume 2: Instruction Set Reference,” 1999, Order Code243191. 6. A.PelegandU.Weiser,“MMXTechnologyExtensiontotheIntelArchitecture,”IEEEMicro,Vol.16,No.4, August1996,pp.10-20. 7. Intel, “IA-32 Intel Architecture Software Developer’s Manual With Preliminary Willamette Architecture Information,Volume2:InstructionSetReference,”2000. 8. G.Kane,“PA-RISC2.0Architecture,”1996,PrenticeHall,ISBN0-13-182734-0. 9. R.B.Lee,“SubwordParallelismwithMAX-2,”IEEEMicro,Vol.16,No.4,August1996,pp.51-59. 10. AMD,“3DNow!TechnologyManual,”March2000,OrderCode21928G/0. 11. AMD, “AMD Extensions to the 3DNow! and MMX Instruction Sets Manual,” March 2000, Order Code 22466D/0. 12. Motorola, “AltiVec Technology Programming Environments Manual,” Revision 0.1, November 1998, Order CodeALTIVECPEM/D. 13. R.B.Lee,“PrecisionArchitecture,”IEEEComputer,Vol.22No.1,Jan.1989,pp.78-91. 14. R.B.Lee,“MultimediaExtensionsforGeneral-PurposeProcessors,”ProceedingsofIEEESIPS97,November 1997,pp.9-23. 15. R.B.Lee,“AcceleratingMultimediawithenhancedMicroprocessors,”IEEEMicro,Vol.15,No.2,April1995, pp.22-32.

32 16. V. Bhaskaran, K. Konstantinides, R.B. Lee and J.P. Beck, “Algorithmic and ArchitecturalEnhancementsfor Real-TimeMPEG-1DecodingonaGeneralPurposeRISCWorkstation,”IEEETransactionsonCircuitsand SystemsforVideoTechnology,Vol.5,No.5,October1995,pp.380-386. 17. R.B.Lee,J.P.Beck,J.LambandK.E.Severson,“Real-TimeSoftwareMPEGVideoDecoderonMultimedia- EnhancedPA7100LCProcessors,”Hewlett-PackardJournal,April1995,pp.60-68. 18. M. Tremblay, J.M. O’Connor, V. Narayanan, H. Liang, “VIS Speeds New Media Processing,” IEEE Micro, Vol.16,No.4,August1996,pp.10-20. 19. Z. Luo and R.B. Lee, “Cost-Effective Multiplication with Enhanced Adders for Multimedia Applications,” ProceedingsofISCAS2000,IEEEInternationalSymposiumonCircuitsandSystems,Vol.I,May28-31,2000, Geneva,Switzerland,pp.651-654.

33