Instruction Set Architecture for Multimedia Signal Processing

InstructionSetArchitectureforMultimediaSignalProcessing RubyLee PrincetonUniversity 1. INTRODUCTION Multimediasignalprocessing,ormediaprocessing[1],istheprocessingofdigitalmultimediainformationina programmable processor. Digital multimedia information includes visual information like images, video, graphics andanimation,audioinformationlikevoiceandmusic,andtextualinformationlikekeyboardtextandhandwriting. With general-purpose computers processing more multimedia information, multimedia instructions for efficient media processing have been defined for the Instruction Set Architectures (ISAs) of microprocessors. Meanwhile, digitalprocessingofvideoandaudiodatainconsumerproductshasalsoresultedinmoresophisticatedmultimedia processors.Traditionaldigitalsignalprocessors(DSPs)inmusicplayersandrecordersandmobiletelephonesare becoming increasingly sophisticated as they process multiple forms of multimedia data, rather than just audio signals.Videoprocessorsfortelevisionsandvideorecordershavebecomemoreversatileastheyhavetotakeinto accounthigh-fidelityaudioprocessingandreal-timethree-dimensional(3-D)graphicsanimations.Thishasledto thedesignofmoreversatilemediaprocessors,whichcombinethecapabilitiesofDSPsforefficientaudioandsignal processing,videoprocessorsforefficientvideoprocessing,graphicsprocessorsforefficient2-Dand3-Dgraphics processing, and general-purpose processors for efficient and flexible programming. The functions performed by microprocessors and media processors may eventually converge. In this chapter, we describe some of the key innovationsinmultimediainstructionsaddedtomicroprocessorISAs,whichhaveallowedhigh-fidelitymultimedia tobeprocessedinreal-timeonubiquitousdesktopandnotebookcomputers.Manyofthesefeatureshavealsobeen adoptedinmodernmediaprocessorsanddigitalsignalprocessors. 1.1SubwordParallelism Workloadcharacterizationstudiesonmultimediaapplicationsshowthatmediaapplicationshavehugeamounts of data parallelism and operate on lower-precision data types. A pixel-oriented application, for example, rarely needstoprocessdatathatiswiderthan16bits.Thistranslatesintolowcomputationalefficiencyongeneral-purpose processors where the registeranddatapathsizesaretypically32or64bits,calledthewidthofaword.Efficient processing of low-precision data types in parallel becomes a basic requirement for improved multimedia performance.Thisisachievedbypartitioningawordintomultiplesubwords,eachsubwordrepresentingalower- precision datum. A packed data type will be defined as data that consists of multiple subwords packed together. These subwords can be processed in parallel using a single instruction, called a subword-parallel instruction, a packedinstruction,oramicroSIMDinstruction.SIMDstandsfor“SingleInstructionMultipleData”,atermcoined by Flynn [2] for describing very large parallel machines with many data processors, where the same instruction issuedfromasinglecontrolprocessoroperatesinparallelondataelementsintheparalleldataprocessors.Lee[3] coined the term microSIMD architecture to describe an ISA where a single instruction operates in parallel on multiplesubwordswithinasingleprocessor. Figure1.1showsa32-bitintegerregisterthatismadeupoffour8-bitsubwords.Thesubwordsintheregister canbepixelvaluesfromagrayscaleimage.Inthiscase,theregisterisholdingfourpixelswithvalues0xFF,0x0F, 0xF0 and 0x00. The same 32-bit register can also be interpreted as two 16-bit subwords, in which case, these subwordswouldbe0xFF0Fand0xF000.Thesubwordboundariesdonotcorrespondtoaphysicalboundaryinthe registerfile;theyaremerelyhowthebitsinthewordareinterpretedbytheprogram.Ifwehave64-bitregisters,the mostusefulsubwordsizeswillbe8-bits,16-bitsor32-bitwords.Asingleregistercanthenaccommodate8,4or2 ofthesedifferentsizedsubwordsrespectively. Ra: 11111111 00001111 11110000 00000000 Figure1.132-bitintegerregistermadeupoffour8-bitsubwords. To exploit subword parallelism, packed parallelism or microSIMD parallelism in a typical word-oriented microprocessor, new subword-parallel or packed instructions are added. (We use the terms “subword-parallel”, “packed” and “microSIMD” interchangeably to describe operations, instructions and architectures.) The parallel processing of the packed data types typically requires only minor modifications to the word-oriented functional units, with the register file and the pipeline structure remaining unchanged. This results in very significant performanceimprovementsformultimediaprocessing,ataverylowcost. 1 GeneralRegisterFile Partitionable ALU Figure1.2MicroSIMDparallelismusespackeddatatypesandapartitionableALU. Typically,packedarithmeticinstructionssuchaspackedaddandpackedsubtractarefirstintroduced. Tosupportsubwordparallelismefficiently,otherclassesofnewinstructionslikesubwordpermutationinstructions arealsoneeded.Wedescribetypicalsubword-parallelinstructionsintherestofthischapter,pointingoutinteresting arithmeticorarchitecturalfeaturesthathavebeenaddedtosupportthisstyleofmicroSIMDparallelism.Insection 2, we describe packed add and packed subtract instructions, and several variants of these. These instructions can all be implemented on the basic Arithmetic Logical Units (ALUs) found in programmable processors, with minor modifications. We describe such partitionable ALUs in section 2.1. We also describe saturationarithmetic-oneofthemostinterestingoutcomesofsubword-paralleladditions-forefficientlyhandling overflowsandperformingin-lineconditionaloperations.Avariantofpackedadditionisthepackedaverage instruction,whereunbiasedroundingisaninterestingassociatedfeature.Anotherclassofpackedinstructionsthat can use the ALU is the parallel compare instruction where the results are the outcomes of the subword comparisons. In section 3, we describe how packed integer multiplication is handled. We describe different approaches to solving the problem of the products being twice as large asthesubwordoperandsthataremultipliedinparallel. Whilesubword-parallelmultiplicationinstructionsgenerallyrequiretheintroductionofnewintegermultiplication functional units to a microprocessor, we describe the special case of multiplication by constants, which can be achievedveryefficientlywithpackedshiftandaddinstructionsthatcanbeimplementedonanALUwith asmallpreshifter. Insection4,wedescribepackedshiftandpackedrotateinstructions,whichperformasupersetof thefunctionsofatypicalshifterfoundinmicroprocessors,inparallel,onpackedsubwords. Insection5,wedescribeanewclassofinstructions,notpreviouslyfoundinprogrammableprocessorsthatdo not support subword parallelism. These are subword permutation instructions, which rearrange the order of the subwords packed in one or more registers. These permutation instructions can be implemented using a modified shifter,orasaseparatepermutationfunctionunit(seeFigure1.3). n n Register File Permutation ALU Shifter Function Multiplier Unit n Figure1.3Typicaldatapathsandfunctionalunitsinaprogrammableprocessor. To provide examples and illustrations, we refer to the following first and second generation multimedia instructionsinmicroprocessorISAs: • IA-64[4],MMX[5,6],andSSE-2[7]fromIntel, 2 • MAX-2[8,9]fromHewlett-Packard, • 3DNow!1[10,11]fromAMDand • AltiVec[12]fromMotorola. 1.2HistoricalOverview Thefirstgenerationmultimediainstructionsfocusedonsubwordparallelismintheintegerdomain.Thefirstset ofmultimediaextensionstargetedatgeneral-purposemultimediaacceleration,ratherthanjustgraphicsacceleration, wasMAX-1,introducedwiththePA-7100LCprocessorinJanuary1994[13,14]byHewlett-Packard.MAX-1,an acronymfor“MultimediaAccelerationeXtensions”,isaminimalistsetofmultimediainstructionsforthe32-bitPA- RISCprocessor.AnapplicationthatclearlyillustratedthesuperiorperformanceofMAX-1wasMPEG-1videoand audio decoding with software, at real-time rates of 30 frames per second [15,16,17]. For the first time, this performance was made possible using software on a general-purpose processor in a low-end desktop computer. Until then, such high-fidelity, real-time video decompression performance was not achievable without using specializedhardware.MAX-1alsoacceleratedpixelprocessingingraphicsrenderingandimageprocessing. Next,SunintroducedVIS[18],whichwasanextensionfortheUltraSparcprocessors.VISwasamuchlargerset ofmultimediainstructions.Inadditiontopackedarithmeticoperations,VISprovidedveryspecializedinstructions foraccessingvisualdata,storedinpre-determinedwaysinmemory. IntelintroducedMMX[5,6]multimediaextensionsinthedominantPentiummicroprocessorsinJanuary1997, which immediately legitimized the valuable contribution of multimedia instructions for ubiquitous multimedia applications. MAX-2 [8,9], was Hewlett-Packard’s multimedia extension for its 64-bit PA-RISC 2.0 processors. Although designedsimultaneouslywithMAX-1,itwasonlyintroducedin1996,withthePA-RISC2.0[8]architecture.The subwordpermutationinstructionsintroducedwithMAX-2wereusefulonlywiththeincreasedsubwordparallelism in 64-bit registers. Like MAX-1, MAX-2 was also a minimalist set of general-purpose media acceleration primitives. MIPS also described MDMX (although this is not implemented in most MIPS microprocessors), and Alpha describedaverysmallsetofMVImultimediainstructionsforvideocompression. The second generation multimedia instructions initially focused on subword parallelism on the floating-point (FP) side for accelerating graphics geometry computations and high-fidelity audio processing. Both of these multimediaapplicationsusesingle-precisionfloating-pointnumbersforincreasedrangeandaccuracy,ratherthan8- bit or 16-bit integers. These multimedia ISAs include SSE and SSE-2 [7] from Intel and 3DNow!

Instruction Set Architecture for Multimedia Signal Processing

RISC-V Vector Extension Webinar II

SIMD Extensions

Optimizing Packed String Matching on AVX2 Platform

SIMD: Data Parallel Execution J

Optimizing Software Performance Using Vector Instructions Invited Talk at Speed-B Conference, October 19–21, 2016, Utrecht, the Netherlands

Effectiveness of the MAX-2 Multimedia Extensions for PA-RISC 2.0 Processors

Pengju Ren@XJTU 2021

Implications of Programmable General Purpose Processors for Compression/Encryption Applications 1. Introduction

ALEX BENNÉE KVM FORUM 2017 Created: 2017-10-20 Fri 20:46

Idisa+: a Portable Model for High Performance Simd Programming

Msc THESIS Customizing Vector Instruction Set Architectures

Counter Control Flow Divergence in Compiled Query Pipelines