Performance Analysis of Instruction Set Architecture Extensions for Multimedia§

Nathan Slingerland Alan Jay Smith [email protected] [email protected] Apple Computer, Inc. University of California at Berkeley

Abstract kernels taken from vendor provided libraries [1], [5] , [8] , [26] , [27] , [31] , [33]. Our contribution is unique as we do not focus Many instruction sets include instructions for exclusively on a single architecture, and we study the performance accelerating multimedia applications such as DVD playback, of kernels derived from a real, measured, general purpose speech recognition and 3D graphics. Despite general agreement multimedia wor kload. Our results are obtained from actual on the need to support this emerging workload, there are hardware measurements rather than through simulation, instilling considerable differences between the instruction sets that have confidence in our results. been designed to do so. In this paper we study the performance of five instruction sets on kernels extracted from a broad multimedia Section 2 summarizes the multimedia workload we studied, and workload. Each kernel was recoded in the assembly language for details the sixteen computatio nally important kernels which we each of the five multimedia extensions. We compare the extracted from it. Our methodology for recoding the kernels with performance of contemporary implementations of each extension multimedia instructions and their measurement is described in against each other as well as to the original compiled C Section 3. An overview of the five instructions sets, and their performance. From our analysis we determine how well implementations, is given in Section 4. multimedia workloads map to current instruction sets, noting Our analysis is divided into two parts. Section 5 reflects our what was useful and what was not. We also propose two experience in coding the kernels, and lends insight into those enhancements to current architectures: strided memory instruction set features that we found useful. In Section 6 we operations, and fat subwords. compare the performance of five different processors each of Keywords: SIMD, subword paralle l, multimedia, instruction set, which implements a particular multimedia instruction set. We performance, measurement, MMX, SSE, AltiVec, VIS, MVI compare the performance of the five architectures both against one another, as well as their relative improvement over compiled 1. Introduction (optimized) C code. Finally, in Section 7 we propose two new directions for multimedia architectures on general purpose Specialized instructions have been introduced by microprocessor : strided memory operations and fat subwords. vendors to support the unique computational demands of multimedia applications. The mismatch between wide data paths 2. Workload and the relatively short data types found in multimedia applications has lead the industry to embrace SIMD (single The lack of a standardized multimedia benchmark has meant that instruction, multiple data) style processing. Unlike traditional workload selection is the most difficult aspect of any study of forms of SIMD computing in which multiple individual multimedia. It was for this reason that we developed the Berkeley processors execute the same instruction, multimedia instructions multimedia workload, which is described in [35]. [35] is highly are executed by a single processor, and pack multiple short data recommended reading for any reader concerned about how elements into a single wide (64 or 128-bit) register, with all of the applications were selected or the technical merit and relevance of sub-elements being operated on in parallel. the Berkeley multimedia workload. It details why we felt existing multimedia benchmarks were unsuitable, develops and The goal of this paper is to quantify how architectural differences characterizes our workload, and extracts the kernels that we study between multimedia instruction sets translate into differences in here. The Berkeley multimedia workload was developed for use in performance. Prior studies have primarily focused on a single this research as well as our study of multimedia cache behavior in instruction set in isolation and have measured the performance of [36]. In selecting the component applications, we strove to cover as many types of media processing as possible: image compression (DjVu, JPEG), 3D graphics (Mesa, POVray), document rendering (Ghostscript), audio synthesis (Timidity), audio compression

§Funding for this research has been provided by the State of California under the MICRO program, and by AT&T, Cisco Corporation, Fujitsu Microelectronics, IBM, Corporation, Maxtor Corporation, Microsoft Corporation, , Toshiba Corporation and Veritas Software Corporation.

1

Sat Native Src Static %Static %CPU Kernel Name Source Application Data Type Arith Width Lines Instr Instr Cycles Add Block MPEG-2 Decode 8-bit (U) Ö 64-bits 46 191 0.1% 13.7% Block Match MPEG-2 Encode 8-bit (U) 128-bits 52 294 0.4% 59.8% Clip Test and Project Mesa FP limited 95 447 0.1% 0.8% Color Space Conversion JPEG Encode 8-bit (U) limited 22 78 0.1% 9.8% DCT MPEG-2 Encode 16-bit (S) 128-bits 14 116 0.1% 12.3% FFT LAME FP limited 208 981 4.4% 14.5% Inverse DCT MPEG-2 Decode 16-bit (S) Ö 128-bits 75 649 0.3% 29.7% Max Value LAME 32-bit (S) unlimited 8 39 0.2% 12.0% Mix Timidity 16-bit (S) Ö unlimited 143 829 1.0% 35.7% Quantize LAME FP unlimited 55 312 1.4% 15.3% Short Term Anal Filter GSM Encode 16-bit (S) Ö 128-bits 15 79 0.1% 20.2% Short Term Synth Filter GSM Decode 16-bit (S) Ö 128-bits 15 114 0.2% 72.7% Subsample Horizontal MPEG-2 Encode 8-bit (U) Ö 88-bits 35 244 0.3% 2.6% Subsample Vertical MPEG-2 Encode 8-bit (U) Ö unlimited 76 478 0.6% 2.1% Synthesis Filtering Mpg123 FP Ö 512-bits 67 348 0.4% 39.6% Transform & Normalize Mesa FP limited 51 354 0.1% 0.7% Table 1. Multimedia Kernels Studied - from left to right, the columns list 1) primary data type specified as N-bit ({Unsigned, Signed}) integer or floating point (FP), 2) if saturating arithmetic is used, 3) native width; the longest width which does not load/store/compute excess unused elements, 4) static C source line count (for those lines which are executed), 5) static instruction count (for those instructions which are executed) 6) percentage of total static instructions, 7) percentage of total CPU time spent in the kernel. The later three statistics are machine specific and are for the original C code on a Compaq DS20 (dual 500 MHz Alpha 21264, Tru64 Unix v5.0 Rev. 910) machine.

(ADPCM, LAME, mpg123), video compression (MPEG-2 at invasive as possible. This also allowed us to quickly test new DVD and HDTV resolutions), speech synthesis (Rsynth), speech revisions of the code by linking against different versions of the compression (GSM), speech recognition (Rasta) and video game library. When adding a new architecture to our study it was (Doom) applications. Open source software was used both for its possible to simply start with a copy of the C reference version of portability (allowing for cross platform comparisons) and the fact the library and then implement and debug replacement SIMD that we could analyze the source code directly. assembly functions one at a time. It is usually neither time nor resource efficient to hand optimize 3.2. Coding Process every line of code in an application. Instead, optimization should focus on execution hot spots or computational kernels, thereby Ideally, high level language compilers would be able to limiting the recoding effort to those portions of the code that have systematically and automatically identify parallelizable sections of the most potential to improve overall performance. In order to code and generate the appropriate SIMD instruction sequence. evaluate the multimedia instruction sets, kernels were distilled SIMD optimizations would then not just be limited to multimedia from the workload based on their computational significance and applications, but could be more generally applied to any amenability to hand optimization with SIMD instructions; [35] application exhibiting the appropriate type of data parallelism. discusses the kernel selection process in detail. Each kernel was [21] has proposed strategies for extracting parallelism from hand coded in the assembly language of each of the five multimedia workloads with compilers. Recently, in fact, the very instruction sets. Table 1 lists the kernel codes examined. first commercial compilers have become available for Intel’s SSE ([16]) and Motorola’s AltiVec ([45]), although their effectiveness 3. Methodology is still an open question. The lack of languages which allow programmers to specify data types and overflow semantics at 3.1. Berkeley Multimedia Kernel Library variable declaration time has hindered the development of automated compiler support for multimedia instruction sets [10]. From a performance standpoint, no piece of code can realistically In addition, current SIMD extensions are clearly targeted for use be extracted and studied in isolation. In order to measure the by an expert applications programmer coding by hand in assembly performance of the multimedia instruction sets, we distilled from or through high-level language macros. our Berkeley multimedia workload a set of computationally important kernel functions, which when taken together form the 3.2.1. Hand Coding Berkeley multimedia kernel library (BMKL). All of the parent applications in the Berkeley multimedia workload were modified As is the case with digital signal processors (DSPs), the most to make calls to the BMKL rather than their original internal efficient way to program with multimedia extensions is to have an functions. By measuring our kernel codes from within real expert programmer tune software using assembly language [20]. applications we could realistically include the effects of the shared Although this is more tedious and error prone than other methods resources of our test systems (e.g. CPU, caches, TLB, memory, such as hand coded standard shared libraries or automatic code and registers). generation by a compiler, it is a method that is available on every platform, and allows for the greatest flexibilit y and precision Encapsulating the kernels within a library with a well defined when coding. interface allowed for low overhead measurement code to be placed around library functions making measurements as non-

2

AMD DEC Intel Motorola Sun Athlon Alpha 21264 Pentium III 7400 (G4) UltraSPARC IIi Clock (MHz) 500 500 500 500 360 SIMD Extension(s) MMX/3DNow!+ MVI MMX/SSE AltiVec VIS SIMD Instructions 57/24 13 57/70 162 121 First Shipped 6/1999 2/1998 2/1999 8/1999 11/1998 Transistors (´106) 22.0 15.2 9.5 6.5 5.8 Process (mm) [Die Size (mm2)] 0.25 [184] 0.25 [225] 0.20 [105] 0.20 [83] 0.25 [148] L1 $I Cache, $D Cache (KB) 64, 64 64, 64 16, 16 32, 32 32, 64 L2 Cache (MB) 0.5 4 0.5 1 2 Register File (# ´ width) FP(8x64b)/ Int (31x64b) FP(8x64b)/ 32x128b FP(32x64b) FP(8x64b) 8x128b Reorder Buffer Entries 72 35 40 6 12 Int, FP Multimedia Units 2 (combined) 2, 0 2 (combined) 3, 1 2, 0 Int Add Latency 2 [64b/cycle] - 1 [64b/cycle] 1 [128b/cycle] 1 [64b/cycle] Int Multiply Latency 3 [64b/cycle] - 3 [64b/cycle] 3 [128b/cycle] 3 [64b/cycle] FP Add Latency 4 [64b/cycle] - 4 [64b/cycle] 4 [128b/cycle] - FP Multiply Latency 4 [64b/cycle] - 5 [64b/cycle] 4 [128b/cycle] - FP ~1/sqrt Latency 3 [32b/cycle] - 2 [64b/cycle] 4 [128b/cycle] - Table 2. Microprocessors Studied - All parameters are for the actual chips used in this study. A SIMD register file may be shared with existing integer (Int) or floating point (FP) registers, or be separate. Note that some of the processors implement several multimedia extensions (e.g. Pentium III has MMX and SSE) - the corresponding parameters for each are separated with “/”. A dash (-) indicates that a particular operation is not available on a given architecture, and so no latency and throughput numbers are given. DEC’s MVI extension (on the 21264) does not include any of the listed operations, but all MVI instructio ns have a latency of 3 cycles [64b/cycle]. [2], [3] , [4] , [6], [7] , [9] , [12] , [13] , [17] , [19] , [25] , [28] , [29] , [42]

All of the assembly codes in our study were written by the same given platform, rather than just the multimedia instruction set. programmer (Slingerland) with an approximately equal amount of Instructions were scheduled sanely by keeping data consumption time spent on each platform. We chose to code in assembly as far from data use as possible, unrolling loops when of language ourselves rather than measure vendor supplied performance benefit, and manually scheduling instructions so as optimized libraries because they would make a fair cross-platform to take advantage of multiple functional units. comparison next to impossible. Vendor supplied libraries were In some cases it was not practical to code a SIMD version of a avoided because each vendor implements a different set of functions which presume particular (and likely different) data kernel if an instruction set lacked the requisite functionality. For formats. For example, a vendor supplied color space conversion example, Sun’s VIS and DEC’s MVI do not support packed function might assume band-interleaved format when the program floating point, so floating point kernels were not recoded for these platforms. DEC’s MVI also does not contain any data to be accelerated uses a band-separated structure. Some overhead would then be needed to transform data structures into the communication (e.g. permute, mix, merge) or subword parallel expected format, perturbing the system under test and adding integer multiply instructions. If there was no compelling unnecessary overhead. opportunity for performance gain, kernels were not recoded from C. The C version was considered to be a particular platform’s Coding ourselves had the supplemental benefit of preventing any “solution” to a kernel when the needed SIMD operations were not differences in programmer ability and time spent coding between provided. Situations which called for a C “solution” are clearly the vendors from potentially skewing our comparison. This also denoted in our results, as the typically lesser performance of gave us considerable insight into the reasons for performance compiler generated code must be kept in mind when comparing differences, where we would otherwise be limited to simply against the other (hand-tuned) kernels. reporting our measurements. It is important to stress that before even a single line of code was rewritten in assembly the kernel 3.2.2. C Compilers and Optimization Flags algorithms and applications in question were studied in The most architecturally tuned compiler on each architecture was considerable detail. Our level of understanding made it possible to used to compile the C reference version of the Berkeley play the role of applications expert and thereby make educated Multimedia Kernel library. The optimization flags used were choices about SIMD algorithms, sufficient precision and required those which give the best general speedup and highest level of data types, and is reflected in the development and optimization without resorting to tuning the compiler flags for a characterization of our workload in [35]. particular kernel. The specific compiler and associated For reasons of practicability, we limited our optimizations to the optimization flags used on each platform are listed in [38]. kernel level; we did not rewrite entire applications. This had ramifications for data alignment and data structure layout on some 4. Instruction Sets architectures. Tools that aid in precise per-cycle instruction In this paper we study five architectures with different multimedia scheduling were not used, as they are not available on all instruction sets (Table 2 lists the parameters for the exact parts). platforms. To use them would mean that we would also be Many of the instruction sets (AMD’s 3DNow!, DEC’s MVI, measuring the available tools and development environment of a Intel’s MMX, Sun’s VIS) use 64-bit wide registers, while 3

Motorola’s AltiVec and Intel’s SSE are 128-bits wide. The size of DEC the register file available on each architecture varies widely, ranging from eight 64-bit registers on the based AMD Athlon The DEC (now Compaq, but we will refer to it as DEC for and Intel Pentium III to thirty-two 128-bit wide registers with historical consistency) Alpha 21264 processor includes the SIMD Motorola’s AltiVec extension. integer Motion Video Instructio ns (MVI) multimedia extension. It is the smallest of the multimedia instruction sets, weighing in with All of the multimedia extensions support integer operations, only 13 instructions. MVI shares the existing Alpha 32-register although the types and widths of available operations vary greatly. integer register file. Notably, no SIMD saturating The earliest multimedia instruction sets (e.g. Sun’s VIS and addition/subtraction, multiplication, or shift instructions are DEC’s MVI) had design goals limited by the immaturity of their included. target workload and unproven benefit to performance. Any approach which greatly modified the overall architecture or Motorola significantly affected die size was out of the question, so leveraging as much functionality as possible from the existing Motorola’s MPC7400 (also known as the G4) processor chip architecture was a high priority. Thus, the Sun and DEC implements the 128-bit wide SIMD AltiVec extension which extensions do not implement subword parallel floating point supports a wide variety of integer data types, as well as subword instructions. Witness also the difference between Intel’s first parallel single precision floating point. A dedicated 32 x 128-bit multimedia extension, MMX (1997), which had as one of its register file is included, along with four non-identical parallel, primary design goals to not require any new operating system pipelined vector execution units. Hardware assisted software support, and Intel’s SSE (1999) which added new operating prefetching is implemented, whereby a prefetch stream is set up system maintained state (eight 128-bit wide registers) for the first by software, and fetched into the cache independently by time since the introduction of the Intel386 instruction set in 1985. hardware. Sun 5. Instruction Set Analysis Sun’s UltraSPARC IIi processor incorporates the VIS multimedia This section encompasses our experience coding each of the extension. VIS is a set of SIMD integer instructions that share sixteen kernels with five different multimedia extensions. We UltraSPARC’s scalar floating point register file. Subword parallel point out those instruction set features (data types and operations) multiplication is accomplished through 8-bit multiplication that we found to be useful in accelerating our workload, as well as primitive instructions and a graphics status register (GSR) is used those for which we found no application or that created significant to support data alignment and scaling for pack operations. bottlenecks in current multimedia architectures. Those readers primarily interested in our performance measurements of specific Intel implementations of these instruction sets can skip ahead to Section 6 and return here to Section 5 at their leisure. Intel’s Pentium III processor includes their original SIMD integer MMX extension, as well as SSE, a newer SIMD floating point Illustrating our discussion are code fragments both from the instruction set. MMX is mapped onto the existing x87 floating original C source code of each kernel algorithm, as well as the point architecture and registers and introduces no new different SIMD implementations. The code fragments consist of a architectural state (registers or exceptions). SSE is primarily a few of the key central lines of code from a given kernel. This SIMD floating point extension but also incorporates feedback on gives a good idea of the types of operations and data types used. MMX from software vendors in the form of new integer The data types of all of the variables in our sample C code are instructions. Unlike MMX, the floating point side of SSE adds specified in a platform independent way such that the prefix new architectural state to the Intel architecture with the addition of indicates the type: INT: signed integer, UINT: unsigned integer, an 8 x 128-bit register file and exceptions to support IEEE FP: floating point, followed by N, the number of bits. The compliant floating point operations. Note that SSE is only 128- complete original C source code for each kernel as well as for the bits wide for packed floating point operations, integer operations assembly language SIMD implementations of the BMKL are are still 64-bits wide like MMX. Also, although the SSE available on the web at instruction set architecture and register file are defined to be 128- http://www.cs.berkeley.edu/~slingn/research/. bits wide, the Pentium-III SSE execution units are actually 64-bits (two single precision floating point elements) wide in hardware. 5.1. Register File and Data Path The instruction decoder translates 4-wide (128-bit) SSE Multimedia instruction sets can be broadly categorized according instructions into pairs of 2-wide (64-bit) internal micro-ops. to the location and geometry of the register file upon which SIMD AMD instructions operate. Alternatives include the reuse of the existing integer or floating point register files, or implementing an entirely The AMD Athlon processor implements MMX (which was separate one. The type of register file affects the width and licensed from Intel), in addition to AMD’s own 3DNow! therefore the number of packed elements that can be operated on extension which utilizes the same x87 floating point registers and simultaneously (vector length). basic instruction formats as MMX, and adds a subword parallel single precision floating point data type. The Athlon processor 5.1.1. Shared Integer Data Path actually includes an enhanced version of 3DNow! which adds a Implementing multimedia instructions on the integer data path has handful of instructions in order to make 3DNow! functionally the advantage that the functional units for shift and logical equivalent to Intel’s SSE. operations need not be replicated. Subword parallel addition and subtraction are easily created by blocking the appropriate carry 4

INT16 *bp; UINT8 *rfp; INT32 tmp; tmp = *bp++ + *rfp; /* Add */ *rfp = tmp>255 ? 255 : (tmp<0 ? 0 : tmp); /* Clip */ Listing 1. Add Block - computed for each pixel in a sub-block

;; unaligned vector load lvx v3, 0, r3 ; v3: vector MSQ for initial bp0..bp7 vector lvx v4, r11, r3 ; v4: vector LSQ ;; unaligned vector load lvx v0, 0, r4 ; v0: vector MSQ lvx v1, r11, r4 ; v1: vector LSQ lvsl v2, 0, r4 ; v2: vector alignment mask for vperm vxor v10, v10, v10 ; v10: 0 vperm v0, v0, v1, v2 ; v0: rfp:|0|1|2|3|4|5|6|7|X|X|X|X|X|X|X|X| vperm v3, v3, v4, v5 ; v3: | bp0 | bp1 | bp2 | bp3 | bp4 | bp5 | bp6 | bp7 | addi r3, r3, 16 ; r3: bp += 8 (pointer to INT16) vmrghb v1, v10, v0 ; v0: | rfp0| rfp1| rfp2| rfp3| rfp4| rfp5| rfp6| rfp7| vaddshs v1, v3, v1 ; v1: bp + rfp [0..7] vpkshus v1, v1, v1 ; v1: rfp:|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| stvewx v1, 0, r4 ; store rfp [0..3] stvewx v1, r12, r4 ; store rfp [4..7] add r4, r4, r5 ; r4: rfp += (iincr + 8) (pointer to UINT8) vmov v3, v4 ; move current LSQ to next MSQ Listing 2. AltiVec Add Block Fragment bits. However, modifications to the integer data path to well as increase the amount of clean up code overhead. The accommodate multimedia instructions can potentially adversely “native width” column of Table 1 specifies how each of the affect the critical path of the integer data-path pipeline [20]. In multimedia kernels fits into one of the following categories: addition, on architectures such as x86 (AMD, Intel) and PowerPC (Motorola) the integer data path is 32-bits wide, making a shared unlimited width - Kernels that operate on data elements which are integer data path approach less compe lling due to the limited truly independent and are naturally arranged so as to be amenable to SIMD processing. The inner loops of these kernels can be strip amount of data level parallelism possible. mined at almost any length, with the increase in performance of a 5.1.2. Shared Floating Point Data Path longer vector being directly proportional to the increase in vector length. The reuse of floating point rather than integer registers has the advantage that they need not be shared with pointers and loop and limited width - Although data elements are independent, there is other variables. In addition, multimedia and floating point overhead involved in rearranging input data so that it may be instructions are not typically used simultaneously [20], so there is operated on in a SIMD manner with longer vectors. Thus, the seldom any added register pressure. Finally, many architectures performance advantage of longer vectors is limited by the support at least double -precision (64-bit wide) operations, which overhead (which typically increases with vector length) required is often wider than the integer data path. to employ them. exact width - A kernel which has a precise natural width which 5.1.3. Separate Data Path can be considered to be the right match for the kernel. This width A separate data path that is not shared with existing integer or is the longest width that does not load, store, or compute excess, floating point data paths has the advantage of simpler pipeline unused elements. control and potentially an increased number of registers. Long vectors can sometimes be a problem when their length Disadvantages include the need for saving and restoring the new exceeds the natural width of an algorithm. A good example of this registers during context switches, as well as the relative difficulty problem is the add block kernel, which operates on MPEG sub- and high overhead of moving data between register files. This can blocks (8 x 8 arrays of pixels). One input array (bp) consists of potentially limit performance if scalar code depends upon a vector signed 16-bit integer values and the other (rfp) of unsigned 8-bit result. integers (Listing 1). 5.1.4. SIMD Register Width During the block reconstruction phase of motion compensation in the decoder, a block of pixels is reconstituted by summing the One of the most critical factors in SIMD instruction set design is pixels in different sub-blocks. For example, in the AltiVec deciding how long vectors will be. If an existing data path is to be reused, there is little choice, but when a new data path is to be implementation of the add block kernel (Listing 2), each time a designed it makes sense to ask how wide is wide enough. A vector row of the 8x8 rfp sub-block is loaded on a 128-bit wide length that is too short limits the ability to exploit data parallelism, architecture, one half of the vector is useless data which will be while an excessively long vector length can degrade performance thrown away when the rfp values are promoted to 16-bits. This due to insufficiently available data parallelism in an algorithm as problem could be avoided on a 128-bit architecture with a half 5

INT32 x0,x1,x2,x3,x4,x5,x6,x7,x8 x7 = x8 + x3; x8 -= x3; x3 = x0 + x2; x0 -= x2; x2 = (181*(x4+x5)+128)>>8; x4 = (181*(x4-x5)+128)>>8; Listing 3. Inverse DCT - only row computation shown

extern FLOAT64 c[8][8]; /* transform coefficients */ INT16 block[8][8]; FLOAT64 sum; FLOAT64 tmp[8][8]; for (INT32 i=0; i<8; i++) for (int j=0; j<8; j++) { sum = 0.0; for (int k=0; k<8; k++) sum += c[j][k] * block[i][k]; tmp[i][j] = sum; } Listing 4. DCT length 64-bit load, an operation which other 128-bit SIMD were able to hold the entire 16-bit 8x8 matrix. Since the DCT and architectures (such as Intel’s SSE) have. IDCT both operate on 8x8 2D matrices of 16-bit signed values, they require at least 16 64-bit or 8 128-bit registers. As we saw in Table 1, most multimedia kernels are either unlimited/limited or most often have an exact required width of 5.2. Data Types 128-bits. The remaining exact native widths (64-, 88- and 512- bits) come up only once each. Thus, we consider a total vector The data types supported by the different multimedia instruction length of 128-bits to be optimal for most multimedia algorithms, sets include {signed, unsigned} {8, 16, 32, 64} bit values, as well especially with the inclusion of short load/store operations. as single precision floating point. Most of the instruction sets do not support all of these types, and usually only a subset of 5.1.5. Number of Registers operations on each. To determine which data types and operations are useful, we looked at the dynamic SIMD instruction counts for Multimedia applications and their kernels can generally take the kernels when running the Berkeley multimedia workload. We advantage of quite large register files. This rule of thumb is then broke down the dynamic SIMD instruction counts on each echoed in the architectures of dedicated media processors such as architecture in two ways: data type distribution per instruction MicroUnity’s chip, which has a 64 x 64-bit register file (also class (e.g. add, multiply) and per kernel. Here we discuss only the accessible as 128 x 32-bits) [11] , while the Philips Trimedia TM-1 salient features of tables of these categorizations, although the full has a 128 x 32-bit register file [32]. set of results is available in [38]. The DCT and IDCT kernels (fragments of the original codes are In general, the video and imaging kernels (add block, block given in Listing 3 and Listing 4 are excellent examples of where match, color space, DCT, IDCT, subsample horizontal, subsample the importance of a large register file can be seen. The discrete vertical) utilize 8- and 16-bit operations. Audio kernels (FFT, max cosine transform (DCT) is the algorithmic centerpiece to many val, mix stereo, quantize, short term analysis filtering, short term lossy image compression methods. It is similar to the discrete synthesis filtering, synthesis filtering) either use 16-bit values or Fourier transform (DFT) in that it maps values from the time single precision floating point, while the 3D kernels (clip test, domain to the frequency domain, producing an array of transform) are limited almost exclusively to single precision coefficients representing frequencies [18]. The inverse DCT maps floating point. in the opposite direction, from the frequency to the time domain. A 2D DCT or IDCT is efficiently computed by 1D transforms on 5.2.1. Integer each row followed by 1D transforms on each column and then Image and video data is typically stored as packed unsigned 8-bit scaling the resulting matrix appropriately. A SIMD approach values, but intermediate processing usually requires precision requires that multiple data elements from several iterations be greater than 8-bits. We found that for other than width promotion, operated on in parallel for the greatest efficiency, so computation most 8-bit functionality was wasted on our set of multimedia is straightforward for the 1D column DCT, since the kernels. The signed 16-bit data type is the most heavily used corresponding elements of each loop iteratio n are adjacent in because it is both the native data type for audio and speech data, memory (assuming a row-major storage format). Performing a 1D as well as the typical intermediate data type for video and row DCT is more problematic since the corresponding elements imaging. On the wide end of the spectrum, 32-bit and longer data of adjacent rows are not adjacent in memory. The solution is to types are typically only used for accumulation and simple data employ a matrix transposition (making corresponding “row” communication operations such as alignment. Operations on 32- elements adjacent in memory), perform the desired computation, bit values tend to be limited to addition and subtraction (for and then re-transpose the matrix to put the result matrix back into accumulation), width promotion and demotion (for converting to a the correct configuration. We found this technique to only be of narrower output data type) and shifts (for data alignment). performance benefit on those architectures with register files that

6

pmaxsw mm0, [CLIP_MIN] ; compute first element pminsw mm0, [CLIP_MAX] ; in order to free mm0 movq [esi + 0], mm0 ; store x0 [0..3] movq mm0, [CLIP_MIN] ; clip to –256 pmaxsw mm1, mm0 pmaxsw mm2, mm0 pmaxsw mm3, mm0 pmaxsw mm4, mm0 pmaxsw mm5, mm0 pmaxsw mm6, mm0 pmaxsw mm7, mm0 movq mm0, [CLIP_MAX] ; clip to +255 pminsw mm1, mm0 pminsw mm2, mm0 pminsw mm3, mm0 pminsw mm4, mm0 pminsw mm5, mm0 pminsw mm6, mm0 pminsw mm7, mm0 Listing 5. Intel MMX/SSE IDCT

5.2.2. Floating Point a glaring visual difference (e.g. the addition of two white pixels results in a black pixel). Saturating arithmetic also allows for Single precision floating point plays an important role in many of overflow in multiple packed elements to be dealt with efficiently. the multimedia kernels such as the geometry stage of 3D If overflow within packed elements were to be handled similar to rendering (the clipping and transform kernels) and the FFT, where traditional scalar arithmetic, an overflowing SIMD operation the wide dynamic range of floating point is required. Double would have to be repeated and tested serially to determine which precision SIMD is uncommon; currently it is only available with element overflowed. The only added cost of saturating arithmetic Intel’s recently announced SSE2 extension on their Pentium IV is separate instructions for signed and unsigned operands since the processor. This data type is targeted more towards applications values must be interpreted by the hardware as a particular data such as scientific and engineering workloads rather than type. This is unlike modulo operations for which the same multimedia [14] , [15] , and would have no use in the kernels we instruction can be used for both unsigned and signed 2’s examine. complement values. 5.3. Operations Despite the utility of saturating operations, modulo computations are still important because they allow for results to be numerically One of the primary goals of this work is to separate useful SIMD identical to existing scalar algorithms. This is sometimes a operations from those that are not. The large differences in current consideration for the sake of compatibility and comparability. The multimedia instruction sets for general purpose processors are kernels in the BMKL that employ saturating arithmetic are noted fertile ground for making such a determination because many in Table 1. From this it is clear that the most important types of different design choices have been made. Please note the very saturating arithmetic are unsigned 8-bit and signed 16-bit integers. important caveat that this determination is made only with respect The IDCT kernel clamps to a signed 9-bit range [-256..+255], to supporting our multimedia workload. We explicitly comment which can be accomplished through a pair of max/min operations; on those instructions which we found to be “useless” for our we discuss max and min in more detail later. Saturating 32-bit workload but for which we believe the designers had some other operations are of little value since overflow is not often a concern purpose or workload in mind. for such a wide data type. 5.3.1. Arithmetic 5.3.1.2. Shift 5.3.1.1. Modulo/Saturation SIMD shift operations provide an inexpensive way to perform multiplication and division by powers of two. They are also Modulo arithmetic “wraps around” to the next representable value crucial for supporting fixed point integer arithmetic in which a when overflow occurs, while saturating arithmetic clamps the common sequence of operations is the multiplication of an N-bit output value to the highest or lowest representable value for integer by an M-bit fixed point fractional constant, producing an positive and negative overflow respectively. This characteristic is (N+M)-bit result, with a binary point at the Mth most significant useful because of its desirable aesthetic result in imaging and bit. At the end of computation, the final result is rounded by audio multimedia computations. For example, when adding pixels adding the fixed point fraction representing 0.5, and then shifted in an imaging application, modulo addition is undesirable because right M-bits to eliminate the fractional bits. if overflow occurs, a small change in operand values may result in

7

FLOAT32 ex = vEye[i][0], ey = vEye[i][1], ez = vEye[i][2], ew = vEye[i][3]; FLOAT32 cx = m0 * ex + m8 * ez, cy = m5 * ey + m9 * ez; FLOAT32 cz = m10 * ez + m14 * ew, cw = -ez; UINT8 mask = 0; vClip[i][0] = cx; vClip[i][1] = cy; vClip[i][2] = cz; vClip[i][3] = cw; if (cx > cw) mask |= CLIP_RIGHT_BIT; else if (cx < -cw) mask |= CLIP_LEFT_BIT; if (cy > cw) mask |= CLIP_TOP_BIT; else if (cy < -cw) mask |= CLIP_BOTTOM_BIT; if (cz > cw) mask |= CLIP_FAR_BIT; else if (cz < -cw) mask |= CLIP_NEAR_BIT; Listing 6. Clip Test and Project

;; ; v1: | cx | cy | cz | cw | vspltw v2, v1, 3 ; v2: | cw | cw | cw | cw | vcmpbfp. v3, v1, v2 ; v3: bit mask of clip comparisons ;; set cr6 to 0x2 if all test values are within boundaries li r12, 0 mcrf cr0, cr6 bc COND_TRUE, 0x2, fast_out vcmpgtsw v4, v0, v3 ; v4: > test, - mask if clipping vcmpgtsw v5, v3, v0 ; v5: < test, + mask if clipping vand v4, v4, v27 vand v5, v5, v28 vor v4, v4, v5 ; v4: | mask0| mask1| mask2| mask3| vsldoi v5, v4, v4, 8 vor v4, v4, v5 vsldoi v5, v4, v4, 4 vor v4, v4, v5 ; v4: | mask | mask | mask | mask | vspltb v4, v4, 15 ; v4: | M| M|...| M| M| [0..15] stvebx v4, 0, r5 ; store mask to mask_temp lbz r12, 0(r5) ; r12: mask lbz r15, 0(r7) ; r15: clipMask[i] or r15, r15, r12 stb r15, 0(r7) ; r15: clipMask[i] |= mask or r13, r13, r12 ; r13: tmpOrMask |= mask fast_out: addi r7, r7, 1 ; r7: clipMask++ (pointer to UINT8) and r14, r14, r12 ; r14: tmpAndMask &= mask addic. r3, r3, -1 ; n— bc COND_FALSE, ZERO_RESULT, loop Listing 7. Motorola G4 Clip Test and Project 5.3.1.3. Min/Max execution cost is usually considerable. Architecturally, the implementation cost of max and min instructions should be low Min and max output the minimum or maximum values of the since the necessary comparators must already exist for saturating corresponding elements in two input registers, respectively. A arithmetic. The only difference is that instead of comparing to a max instruction is of obvious utility in the maximum value search constant, a register value is used. We have found that in many kernel, which searches through an array of signed 32-bit integers cases where comparisons seem to be required, max and min for the greatest maximum absolute value. Max and min instructions are sufficient. instructions have other less apparent uses as well. Signed minimum and maximum operations are often used with a constant 5.3.1.4. Comparisons second operand to saturate results to arbitrary ranges. This is the case in the IDCT kernel which clips its output value range to We have found that integer control flow instructions such as -256...+255 (9-bit signed integer), which does not correspond to packed comparisons are seldom needed, except on architectures without max/min operations (e.g. Sun’s VIS). One specific type of the saturating data types supported by any of the multimedia extensions. Listing 5 demonstrates this clamping to arbitrary floating point comparison was of significant benefit to the project boundaries for the Intel implementation of the IDCT. and clip test kernel. In that kernel 3D objects are first mapped to 2D space through a matrix multiplication of 1x4 vectors and 4x4 An arbitrary clamping operation can also be simulated with matrices. Objects are then clipped to the viewable area to avoid packed signed saturating addition by summing with the unnecessary rendering. The code fragment listed in Listing 6 is appropriate constants. Similarly, max and min can be synthesized computed for each vertex in a 3D scene every time a frame is through simpler comparison operations, but the additional rendered.

8

Processor Instruction Latency: Throughput AMD Athlon PSADBW 3 cycles : 64b every 1 cycles DEC Alpha PERR 2 cycles : 64b every 1 cycle Intel Pentium III PSADBW 5 cycles : 64b every 2 cycles Sun UltraSPARC IIi PDIST 3 cycles : 64b every 1 cycle Table 3. SAD Instruction Survey

Motorola’s AltiVec includes a specialized comparison instruction, match kernel utilized by MPEG-2 encoding). vcmpbfp, which deals with boundary testing. This is done by Although DEC’s MVI extension is quite small (only 13 testing all of the clip values in parallel to see if any clipping is instructions), one of the few operations that DEC did include was needed, and then branching as a fast out if no clipping is SAD. DEC architects found (in agreement with our experience) necessary. This technique is extremely effective because no that this operation provides the most performance benefit of all clipping is the common case, with most vertices within screen multimedia extension operations [34]. Interestingly, Intel’s MMX, boundaries. An example of this is shown in the clip test kernel although a much richer set of instructions, did not include this from Mesa’s 3D rendering pipeline (Listing 7). For the Mesa operation (it was later included in both AMD’s enhanced 3DNow! “gears” application, the fast out case held true for 61946 of the as well as Intel’s SSE extensions to MMX). Sun’s VIS also 64560 (96.0%) clipping tests performed in our application run of includes a sum of absolute differences instruction. 30 frames rendered at 1024x768 resolution. The Motorola G4 microprocessor was the only CPU in our survey 5.3.1.5. Sum of Absolute Differences that did not include some form of SAD operation, forcing us to synthesize it from other instructions (Listing 10). Although Intel’s A sum of absolute differences (SAD) instruction operates on a SSE extension (see Listing 12) includes the psadbw instruction, pair of packed 8-bit unsigned input registers, summing the this offers only a one cycle performance advantage when absolute differences between the respective packed elements and compared to the AltiVec implementation. Despite the fact that placing (or accumulating) the scalar sum in another register. The Intel has a performance win with a dedicated SAD instruction, it block match kernel is the only one in which sum of absolute requires several additional instructions due to the fact that it is a differences (SAD) instructions is used. Block match sums the 64-bit wide extension. To estimate the latency of a hypothetical absolute differences between the corresponding pixels of two SAD instruction for a 128-bit extension such as AltiVec, we first 16x16 macroblocks (Listing 8 lists the core lines of the block try to understand the latency of this instruction on other

UINT8 blk_1[16][16]; UINT8 blk_2[16][16]; INT32 sad=0; INT32 diff; for(j=0; j

FLOAT64 tx, ty, tz, len, scale; FLOAT32 ux = u[i][0], uy = u[i][1], uz = u[i][2]; tx = ux * m[0] + uy * m[1] + uz * m[2]; ty = ux * m[4] + uy * m[5] + uz * m[6]; tz = ux * m[8] + uy * m[9] + uz * m[10]; len = sqrt( tx*tx + ty*ty + tz*tz ); scale = (len>1E-30) ? (1.0 / len) : 1.0; v[i][0] = tx * scale; v[i][1] = ty * scale; v[i][2] = tz * scale; Listing 9. Transform and Normalize

;; v1: block #1, line #1, pixels [0..F] ;; v4: block #1, line #2, pixels [0..F] ;; v6: block #2, line #1, pixels [0..F] vavgub v1, v1, v4 ; v1: vertically interpolated p0..pF vmaxub v7, v1, v4 ; v7: max [0..F] vminub v8, v1, v4 ; v8: min [0..F] vsububs v7, v7, v8 ; v7: abs_diffs [0..F] vsum4ubs v31, v7, v31 ; v31: | SAD_0 | SAD_1 | SAD_2 | SAD_3 | vsumsws v31, v31, v0 ; v31: | 0 | 0 | 0 | SAD | Listing 10. Motorola G4 Block Match - SAD (starting with vmaxub) takes 9 cycles

9

; xmm3: tx^2+ty^2+tz^2 [0..3] (len=sqrt(tx^2+ty^2+tz^2)) ; xmm7: 0.5 [0..3] ; xmm5: 3.0 [0..3] rsqrtps xmm4, xmm3 ; xmm4: rsqrtps(a) movaps xmm6, xmm3 mulps xmm6, xmm4 ; xmm6: a*rsqrtps(a) mulps xmm6, xmm4 ; xmm6: a*rsqrtps(a)*rsqrtps(a) mulps xmm4, xmm7 ; xmm4: 0.5*rsqrtps(a) subps xmm5, xmm6 ; xmm5: 3.0 - a*rsqrtps(a)*rsqrtps(a) mulps xmm4, xmm5 ; xmm4: |1/len1|1/len2|1/len1|1/len0| ;; sqrt(a) = a*(1/sqrt(a)) mulps xmm3, xmm4 ; xmm3: | len3 | len2 | len1 | len0 | Listing 11. Intel Approximated Square Root - 25 cycles

;; mm1: block #1, line #1, pixels [0..7] ;; mm5: block #1, line #1, pixels [8..F] ;; mm3: block #1, line #2, pixels [0..7] ;; mm4: block #1, line #2, pixels [8..F] ;; mm2: block #2, line #1, pixels [0..7] ;; mm6: block #2, line #1, pixels [8..F] pavgb mm1, mm3 ; mm1: vertically interpolated p0..p7 pavgb mm5, mm4 ; mm5: vertically interpolated p8..pF psadbw mm1, mm2 ; mm1: | 0 | 0 | 0 |SAD0| Pixels 0..7 psadbw mm5, mm6 ; mm5: | 0 | 0 | 0 |SAD1| Pixels 8..F paddd mm1, mm5 ; mm1: | 0 | 0 | 0 | SAD| Listing 12. Intel Pentium III Block Match - SAD (starting with first psadbw) takes 8 cycles

; xmm3: tx + ty + tz [0..3] (len=sqrt(tx+ty+tz)) ; xmm3: a [0..3] sqrtps xmm3, xmm3 ; xmm3: sqrt(a) Listing 13. Intel Full Precision Square Root - 36 cycles architectures (Table 3). Integer average operations were only used in the block match kernel (for interpolation). This kernel operates on 8-bit unsigned The latency of a 128-bit instruction would be higher than the 64- values and was the only type of “average” instruction that was bit instructions listed in the table because this instruction requires used within our workload. a cascade of adders to sum (reduce) the differences between the elements. An N-bit SAD instruction (M=N/8) can be broken down 5.3.1.7. High Latency Function Approximation into steps: 1) calculate M 8-bit differences, 2) compute the absolute value of the differences, 3) perform log2M cascaded 3D rendering applications use floating point math functions such summations. The architects of the 64-bit DEC MVI extension as reciprocal and square-root that are typically very high latency. comment that a 3-cycle implementation of PERR would have In the transform and normalize kernel, graphics primitives are been easily achievable, but in the end the architects achieved a transformed to the viewer’s frame of reference through matrix more aggressive 2-cycle instruction [7]. If a SAD operation were multiplications. The transform kernel has at its heart a floating to be implemented in Motorola’s AltiVec, we estimate it would point reciprocal square root operation and is shown in Listing 9. have a latency of 4 cycles; certainly a superior solution to the 9 This code is computed for each vertex in a 3D scene. cycle solution shown in Listing 10. On some architectures scalar versions of these functions 5.3.1.6. Average (reciprocal and square root) are computed in software, while others have hardware instructions. The computation of these In addition to computing the sum of absolute differences, half- functions is iterative, so an operation’s latency is directly pixel interpolation, for which MPEG-2 encoding offers three proportional to the number of bits of precision returned; full IEEE varieties, is also important. Interpolation is done by averaging a compliant operations return 24-bits of mantissa. set of pixel values with pixels spatially offset by one horizontally, vertically or both. Vertical interpolation is shown in Listing 10 All of the SIMD floating point extensions (AMD’s 3DNow!, and Listing 12, and is done through 8-bit unsigned average Intel’s SSE and Motorola’s Alt iVec) include approximation instructions. The original C MPEG-2 code first interpolates and instructions for 1 and 1 . These are typically then computes the sum of absolute differences on the result., but a x x similar interpolation operation can be done by averaging the result implemented as hardware lookup tables, returning k-bits of of several SAD operations using scalar integer arithmetic (since precision. In Intel’s SSE, for example, approximate reciprocal the result of a SAD instruction is a scalar value). This second (rcp) and reciprocal square root (rsqrt) return 12-bits of approach was used when programming for DEC’s MVI extension mantissa. as it does not include a packed average instruction.

10

; mm3: tx^2+ty^2+tz^2 [0] (len=sqrt(tx^2+ty^2+tz^2)) pfrsqrt mm4, mm7 ; mm4: |~1/len0 |~1/len0 | movq mm5, mm4 pfmul mm4, mm4 pfrsqit1 mm4, mm7 pfrcpit2 mm4, mm5 ; mm4: | 1/len0 | 1/len0 | pfmul mm7, mm4 ; mm7: | len0 | len0 | Listing 14. AMD Approximated Square Root - 20 cycles per element

One unique aspect of Intel’s SSE instruction set is that not only from a packed operation requires considerable time to determine which packed element caused the problem. Fortunately, in most does it include approximations of 1 and 1 , but it also x x cases where an exception might be raised it is possible to fill in a x value which will give reasonable results for most applications and includes full precision (24-bits of mantissa) versions of y and continue computation without interruption. This speeds execution because no explicit error condition checking need be done, and is x . Of course, this added precision comes at a price – namely similar to saturating integer arithmetic where maximum or much higher latency (in fact Intel’s full precision instructions are minimum result values are substituted rather than checking for not pipelined on the Pentium III) than the pipelined and reporting positive or negative overflow. approximations. Only Inte l’s SSE implements full IEEE compliant SIMD floating All of the floating point multimedia extension vendors, including point exceptions, and includes a control/status register (MXCSR) to Intel, point to the Newton-Raphson method for improving the mask or unmask packed floating point numerical exceptions. accuracy of approximations through specially derived functions. 5.3.3. Floating Point Rounding In the case of 1 it is possible to iteratively increase the x Fully compliant IEEE floating point supports four rounding precision of an initial approximation through the equation: modes but most real time 3D applications are not particularly 3 sensitive to rounding error. It is often acceptable to incur a slight x1 = x0 - (0.5× a × x0 - 0.5× x0 ) loss in precision and only use the flush to zero (FTZ) rounding

2 mode [43]. Flush to zero clamps to a minimum representable = 0.5× x0 × (3.0 - a × x0 ) result in the event of underflow (a number too small to be represented in single precision floating point). Intel’s SSE offers a Employing an approximation instruction in conjunction with the fully IEEE compliant rounding mode as well as a faster flush to Newton-Raphson method to achieve full precision is actually zero mode. 3DNow! supports only truncated rounding (round to faster than the full precision version of the instruction that Intel zero). All of Motorola’s AltiVec floating point arithmetic provides. Compare the code fragments in Listing 11 and Listing instructions use the IEEE default rounding mode of round to 13 which have execution times of 25 vs. 36 cycles respectively. nearest, but the IEEE directed rounding modes are not provided. One iteration of the Newton-Raphson method is enough to improve a 12-bit approximation to the full 24-bit precision of 5.3.4. Type Conversion IEEE single precision. It is useful to distinguish an application’s storage data type (how The added cost of the Newton-Raphson method is the register data is stored in memory or on disk) from its computational data space needed to hold intermediate values and constants. AMD’s type (the intermediate data type used to compute a result). In 3DNow! extension circumvents this by including instructions to general, storage data types although narrow and therefore internally perform the Newton-Raphson method, rather than potentially offering the greatest degrees of parallelism, are having the programmer explicitly implement it (Listing 14). The insufficiently wide for intermediate computations to occur without only odd thing about AMD’s reciprocal square root instructions overflow. are that they are actually scalar; they only produce one result value, based on the lower packed element. These instructions are a Width promotion is the expansion of an N-bit value to some larger part of 3Dnow! and not the normal x87 floating-point; they expect width. For unsigned fixed point numbers this requires zero input data to be in a packed form (64-bits comprised of two 32-bit extension (filling any additional bits with zeros). Zero extension is single -precision sub-words). It is necessary to execute two usually not specified as such in a multimedia architecture because Newton-Raphson operations to compute both packed elements it overlaps in functionality with data rearrangement instructions when using AMD’s 3DNow! extension. such as unpack or merge if packed values are merged with another register which has been zeroed prior to merging. Signed element 5.3.2. Exceptions unpacking is not as simple, but is rarely supported directly by hardware; only the AltiVec instruction set includes it. It can be The techniques which SIMD instruction sets employ to take care synthesized with multiplication by one since a multiplication of exception handling are very similar to those used when dealing yields a result that is the overall width of both its operands. with overflow. Checking result flags or generating an exception

11

0 7 8 1516 23 24 31 32 39 40 47 48 55 56 63 vA 0x200x20 0x210x21 0x220x220x230x230x240x240x250x25 0x260x26 0x270x27

0 7 8 15 16 23 24 31 32 39 40 47 48 55 56 63 0 31 vB 0x400x40 0x410x41 0x420x420x430x430x440x440x450x45 0x460x46 0x470x47 rA 0x000000050x00000005

0 7 8 15 16 23 24 31 32 39 40 47 48 55 56 63 vC 0x110x11 0x000x00 0x110x110x100x100x030x030x020x02 0x130x13 0x120x12

0 7 8 15 16 23 24 31 32 39 40 47 48 55 56 63 merge_left vD,vA,vB vD 0x200x200x400x40 0x210x210x410x41 0x220x22 0x420x42 0x230x23 0x430x43

0 7 8 15 16 23 24 31 32 39 40 47 48 55 56 63 rotate_left vD,vA,vB,0x2 vD 0x220x220x230x23 0x240x240x250x25 0x260x26 0x270x27 0x400x40 0x410x41

0 7 8 15 16 23 24 31 32 39 40 47 48 55 56 63 insert vB,rA,0x1 0x40 0x05 0x42 0x43 0x44 0x45 0x46 0x47 vB 0x40 0x05 0x42 0x43 0x44 0x45 0x46 0x47

0 31 extract rD,vB,0x2 rD 0x000000420x00000042

0 7 8 15 16 23 24 31 32 39 40 47 48 55 56 63 permute vD,vA,vB,vC vD 0x410x410x200x20 0x410x410x400x40 0x230x23 0x220x22 0x430x43 0x420x42

Figure 1. Data Communication Operations – source registers are shown at the top, operations are considered independently.

Video and imaging algorithms often utilize an 8-bit unsig ned data elements in each of two source registers. Align or rotate type for storage, but 16-bits for computation; 8-bit to 16-bit zero operations allow for arbitrary byte-boundary data realignment of extension is useful. Audio and speech algorithms, on the other the data in two source registers; essentially a shift operation that is hand, typically employ signed 16-bit values, but because performed in byte multiples. Both interleave and align type multiplication by a fractional fixed point constant is a common operations have hard coded data communication patterns. Insert operation, these values are often unpacked as a natural and extract operations allow for a specific packed element to be consequence of computation. So, although a signed unpack extracted as a scalar or a scalar value to be inserted to a specified operation would likely be faster than multiplication by 1, it is location. Shuffle (also called permute) operations allow for greater seldom necessary to resort to this in practice. flexibility than those operations with fixed communication patterns, but this added flexibility requires that the communication All data types that occur in multimedia should be supported for pattern be specified either in a third source register or as an packing and unpacking even for those widths not directly immediate value in part of the instruction encoding. supported by arithmetic operations. It should always be possible to convert to a width that is supported for computation or required The sufficiency of simple data communication operations is to for storage. some degree dependent on the vector length employed. For example, 128-bit AltiVec vectors contain up to sixteen elements, 5.3.5. Data Rearrangement while a shorter extension such as Intel’s 64-bit MMX contain at most eight of the same type of element. This means that simple SIMD instructions perform the same operation on multiple data data rearrangement operations cover a relatively larger fraction of elements. Because all of the data within a register must be treated all possible mappings for a shorter vector length. identically, the ability to efficiently rearrange data bytes within and between registers is critical to performance. We will refer to A related class of instructions that we found to be quite useful this class of instructions as “data communication”, the various when programming with the AltiVec extension was “splat” flavors of which are demonstrated in Figure 1. Interleave or instructions. A splat operation places either an immediate scalar or merge instructions (also referred to as mixing or unpacking) specified element from a source register into every element of the merge alternate data elements from the upper or lower half of the destination register. This was very useful when constants were 12

fmul8sux16 %f0, %f2, %f4 parameters of the cache system. Currently there is no means for fmul8ulx16 %f0, %f2, %f6 synchronization in AltiVec implementations (other than fpadd16 %f4, %f6, %f8 periodically reissuing dst instructions and thereby limiting the Listing 15. Sun VIS 16-bit x 16-bit ® 16-bit Multiply potential benefit of this technique over pure software). Several hardware techniques for maintaining the synchronization of data fmuld8sux16 %f0, %f2, %f4 prefetch and consumption in hybrid solutions as well as pure fmuld8ulx16 %f0, %f2, %f6 hardware prefetching are discussed in [30]. fpadd32 %f4, %f6, %f8 5.4. Bottlenecks and Unused Functionality Listing 16. Sun VIS 16-bit x 16-bit ® 32-bit Multiply In this section we discuss those features which appear in required; on other architectures it is necessary to statically store multimedia instruction sets that do not appear to be useful, and are these types of values in memory, and then load them into a not “free”; i.e. they aren’t a low (or no) cost side effect of some register as needed. other useful feature. [22] presents a novel set of simple data communication primitives 5.4.1. Instruction Primitives whic h can perform all 24 permutations of a 2x2 matrix in a single cycle on a processor with dual data communication functional The VIS instruction set does not include full 16-bit multiply units. This is useful because any larger data communication instructions. It instead offers multiplication primitives, the results problem can be decomposed into 2x2 matrices. Although this of which must be combined through addition (see Listing 15 and approach might make some very complex data communication Listing 16). patterns slow to compute, we have found that most multimedia algorithms have patterns which are relatively simple. Because of The reason that the architects of VIS divided up 16-bit this we endorse [22]’s technique for supplying the data multiplication in this way was to decrease die area. Not providing communication needs of multimedia applications. a full 16x16 multiplier subunit cut the size of the arrays in half [44]. Unfortunately, dividing an operation into several instructions 5.3.6. Prefetching which are not otherwise useful in and of themselves increases register pressure, decreases instruction decoding bandwidth and Prefetching is a hardware or software technique which tries to creates additional data dependencies. Splitting SIMD instructions predict data access needs in advance, overlapping memory access (which have been introduced for their ability to extract data with useful computation. Although we will not otherwise examine parallelism) can actually cripple a superscalar processor’s ability the performance implications of prefetch instructions (which to extract instruction level parallelism. A multi-cycle operation would be a useful extension to this study), we mention them can be a better solution than a multi-instruction operation because briefly because they are often a part of multimedia instruction sets instruction latencies can be transparently upgraded in future due to the highly predictable nature of data accesses in multimedia processors, while poor instruction semantics can not be repaired applications. without adding new instructions. Prefetching can be implemented in software, hardware or both, 5.4.2. Unused High Latency Approximation with each technique distinguished by its methods for detection (discovering when prefetching will be of benefit) and Instructions synchronization (the means for ensuring that data is available in 1 the cache before it is needed but not so far in advance that it Floating point approximation of x instructions, although pollutes the cache by purging still needed items) [40]. available on several platforms, did not find application in the Many multimedia instruction sets take a purely software approach kernels we studied. AltiVec includes approximate log2 and exp2 by providing prefetch instructions that fetch only the cache block instructions, which can potentially find application in the lighting at a specified address into the cache from main memory (without stage of a 3D rendering pipeline. Lighting calculations are blocking normal load/store instruction accesses). Determining the currently handled by 3D accelerator cards, and not the CPU, and ideal location for this type of prefetch instruction depends on so were not included in the BMKL. many architectural parameters. Unfortunately, these include such things as the number of CPU clocks for memory latency and the 5.4.3. Unused Pixel Conversion Instructions number of CPU clocks to transfer a cache line, which are both Motorola’s AltiVec extension includes pixel pack (vpkpx) and highly machine dependent and not readily available to the pixel unpack (vupkhpx, vupklpx) instructions for converting programmer. between 32-bit true color and 16-bit color representations. These Another technique is a software-hardware hybrid approach in did not find application within the BMKL, although it is possible which a software-directed hardware prefetching scheme is used. that they might be of utility in situations where AltiVec needs to One variation of this is implemented by Motorola’s AltiVec and operate on 16-bit color data; many video games use 16-bit its data stream touch instruction (dst) instruction which textures, for example. indicates the memory sequence or pattern that is likely to be accessed. We will refer to this hybrid of hardware and software 5.4.4. Unused Memory Access Instructions prefetching as software directed prefetching to indicate that a Sun’s VIS includes two sets of instructions for accelerating separate prefetch instruction need not be issued for each data multimedia operations with sophisticated memory addressing element. Hardware optimizes the number of cache blocks to needs. The first, edge8, edge16, and edge32, produce bit prefetch so it is not necessary for the programmer to know the vectors to be used in conjunction with partial store instructions to 13

AMD DEC Intel Motorola Sun

Athlon Alpha 21264 Pentium III 7400 (G4) UltraSPARC IIi Int Load/Store 5.4% 20.3% 6.2% 2.4% 2.6% Int Arithmetic 12.1% 29.2% 12.3% 19.1% 26.6% Int Control Flow 12.8% 11.3% 13.4% 13.8% 9.3% Int Data Communication 2.3% 6.7% 2.1% 1.1% 0.0% FP Load/Store 3.2% 11.2% 4.0% 7.4% 8.0% FP Arithmetic 0.0% 18.1% 0.0% 2.3% 8.4% FP Control Flow 0.0% 0.1% 0.0% 0.0% 6.2% FP Data Communication 0.0% 0.0% 0.0% 0.0% 0.0% SIMD Load/Store 19.4% 21.0% 17.9% 11.7% SIMD Data Communication 12.3% 12.6% 17.0% 9.7% SIMD Int Arithmetic 12.5% 3.1% 18.8% 11.8% 13.6% SIMD Int Control Flow 0.0% 0.0% 0.0% 3.5% SIMD FP Arithmetic 19.9% 9.3% 6.9% SIMD FP Control Flow 0.0% 0.0% 0.0% Control State 0.2% 0.0% 0.2% 0.3% 0.4% Total Count 4.49E+09 7.45E+09 3.52E+09 4.53E+09 5.60E+09 Table 4. Overall Instruction Mix – values are percentages of the total counts listed at the bottom of each column. Control state instructions are those that interact with special purpose registers. Control flow instructions include branches as well as conditional moves and comparisons. Gray table cells indicate functionality that was not implemented by a particular architecture (as opposed to implemented but unused, which is indicated by 0%). deal with the boundaries in 2D images. The second group of operations from proceeding at the fullest possible degree of addressing instructions include array8, array16 and (instruction) parallelism. array32 which find use in volumetric imaging (the process of displaying a two dimensional slice of three dimensional data). An 5.4.6. Transparently Forced Alignment array instruction converts (x,y,z) coordinates into a memory Hardware support to efficiently handle memory accesses that are address. The Berkeley multimedia workload does not include any not aligned are expensive in both area and timing [43]. Ideally, volumetric imaging applications, so it is unsurprising that these data would always be aligned through software by the compiler or instructions found no use in our workload. run-time architecture, but in some situations it is impossible to guarantee alignment. For example, in the motion compensation step of MPEG video coding, unaligned memory access is needed 5.4.5. Singular, Highly Utilized Resources depending on the motion vector [20] , as the addresses of the reference macroblock can be random depending on the type of Although we usually think of SIMD architectures as extracting motion search being performed. data level parallelism, all of the implementations of the instruction sets we have examined are also superscalar, with multiple parallel Allowing only aligned memory accesses (and synthesizing SIMD functional units. In fact, unless the SIMD vector length is unaligned accesses in software) can potentially perform better long enough to hold the entire data set being operated on, there is than unaligned access implemented in hardware. The AltiVec almost always the potential to extract instruction as well as data instruction set architecture does not provide for alignment le vel parallelism. exceptions when loading and storing data. Alignment is maintained by forcing the lower four bits of any address to be Sun’s VIS architecture does not include subword parallel shift zero. This is transparent to the programmer, so the programmer is instructions. Instead, the graphics status register (GSR) has a 3-bit responsible for guaranteeing alignment, otherwise incorrect data addr_offset field that is used implicitly for byte granularity may be loaded or stored. We believe it is better that performance alignment, as well as a 4-bit scale_factor field for and correctness issues due to alignment be made explicit. The packing/truncation operations. The VIS GSR is a serializing loading of incorrect data due to a mistaken assumption about bottleneck because any time packing or alignment functionality is alignment is extremely difficult to track down. needed, it must be pushed through the GSR. Because VIS lacks subword shift operations, we found ourselves synthesizing such 5.5. Overall Instruction Mix operations with the packing and alignment operations whenever we could not otherwise avoid them. Even with the careful Table 4 shows the total mix of dynamic instructions executed by planning of packing and alignment operations it was often each architecture. Counts are for the code within the kernels only, necessary to write to the GSR (a slow operation) several times and do not include instructions from the rest of each application. during each iteration of the loops of our multimedia kernels. The Control state instructions are those that interact with special serializing effect of this singular resource prevented VIS purpose machine registers (e.g. the GSR on Sun’s VIS). Branches 14

Add Block Block Match Clip Color Space DCT FFT IDCT Max Val Mix Quantize Short Term Anal Short Term Synth Subsmp Horiz Subsmp Vert Synth Filt Transform Arith. Mean Geom. Mean AMD Athlon 500 1,087 1,801 7,374 1.22E+08 24,332 125,484 2,999 2,407 956 34,233 15,268 21,849 1.60E+07 2.55E+07 7,308 11,527 1.0E+07 3.7E+04 DEC Alpha 21264 500 669 979 8,591 2.24E+07 12,475 67,321 1,276 1,143 616 19,549 11,120 19,208 9.21E+06 1.47E+07 4,900 7,990 2.9E+06 2.2E+04 Intel Pentium III 500 969 1,855 7,938 6.01E+07 38,192 147,047 2,772 2,536 1,065 37,100 16,129 43,582 1.58E+07 2.20E+07 8,349 20,491 6.1E+06 4.1E+04 Motorola 7400 (G4) 500 1,054 1,466 8,906 5.80E+07 33,176 232,272 3,326 4,110 1,710 29,488 24,156 23,721 1.75E+07 2.93E+07 4,917 81,985 6.6E+06 4.7E+04 Sun UltraSPARC IIi 360 1,662 2,408 12,222 6.20E+07 30,922 111,258 3,782 6,165 3,550 34,928 36,773 48,660 1.77E+07 3.99E+07 7,138 14,455 7.5E+06 5.3E+04

C Time (ns) Sun UltraSPARC IIi 500* 1,197 1,734 8,800 4.47E+07 22,264 80,106 2,723 4,439 2,556 25,148 26,477 35,035 1.27E+07 2.87E+07 5,140 10,408 5.4E+06 3.8E+04 Average 1,088 1,702 9,006 6.49E+07 27,819 136,677 2,831 3,272 1,579 31,059 20,689 31,404 1.52E+07 2.63E+07 6,522 27,290 6.7E+06 4.2E+04

AMD Athlon 500 114 257 4,559 1.29E+07 1,438 63,446 827 604 341 18,557 9,887 7,425 1.13E+07 1.08E+07 2,585 4,597 2.2E+06 1.1E+04 DEC Alpha 21264 500 350 379 1.30E+07 1,168 9.26E+06 4.49E+06 4.5E+06 6.6E+04 Intel Pentium III 500 152 381 6,336 1.08E+07 1,228 84,661 993 669 528 15,010 11,003 5,759 1.17E+07 1.00E+07 4,718 4,338 2.0E+06 1.2E+04 Motorola 7400 (G4) 500 423 437 3,427 1.30E+07 907 116,828 847 618 446 14,335 3,729 3,672 1.08E+07 3.07E+06 2,727 2,998 1.7E+06 1.0E+04 Sun UltraSPARC IIi 360 342 852 2.30E+07 3,353 2,508 8,715 1,716 15,009 15,202 1.95E+07 1.05E+07 4.8E+06 3.2E+04 Sun UltraSPARC IIi 500* 246 613 1.66E+07 2,415 1,806 6,275 1,235 10,806 10,945 1.41E+07 7.53E+06 3.5E+06 2.3E+04 SIMD Time (ns) Average 276 461 4,774 1.45E+07 1,732 88,312 1,294 2,355 758 15,967 9,907 8,015 1.25E+07 7.77E+06 3,343 3,978 2.2E+06 1.5E+04

AMD Athlon 500 9.6x 7.0x 1.6x 9.4x 16.9x 2.0x 3.6x 4.0x 2.8x 1.8x 1.5x 2.9x 1.4x 2.4x 2.8x 2.5x 4.5x 3.4x DEC Alpha 21264 500 1.9x 2.6x 1.7x 1.0x 1.0x 3.3x 1.9x 1.7x Intel Pentium III 500 6.4x 4.9x 1.3x 5.6x 31.1x 1.7x 2.8x 3.8x 2.0x 2.5x 1.5x 7.6x 1.3x 2.2x 1.8x 4.7x 5.1x 3.3x

Ratio Motorola 7400 (G4) 500 2.5x 3.4x 2.6x 4.5x 36.6x 2.0x 3.9x 6.6x 3.8x 2.1x 6.5x 6.5x 1.6x 9.6x 1.8x 27.4x 7.6x 4.6x Sun UltraSPARC IIi 500* 4.9x 2.8x 2.7x 9.2x 1.5x 0.7x 2.1x 2.5x 3.2x 0.9x 3.8x 3.1x 2.5x SIMD:C Perf Average 5.0x 4.1x 1.8x 4.8x 23.5x 1.9x 3.0x 3.2x 2.7x 2.1x 3.0x 5.0x 1.3x 4.2x 2.1x 11.5x 5.0x 3.6x

AMD Athlon 500 -9% -15% 11% -99% 7% 4% -14% 18% 31% -18% 18% 24% -12% -6% -19% 56% -1% DEC Alpha 21264 500 33% 38% -3% 64% 52% 48% 51% 61% 55% 33% 40% 33% 35% 39% 20% 70% 42% Intel Pentium III 500 3% -18% 5% 2% -46% -13% -6% 13% 23% -27% 13% -52% -11% 9% -36% 23% -7% Motorola 7400 (G4) 500 -28%

C %DM -6% 6% -7% 6% -27% -78% -27% -40% -24% -1% -30% 17% -23% -22% 20% -210% Sun UltraSPARC IIi 500* -20% -11% -6% 27% 15% 39% -4% -52% -85% 14% -42% -22% 10% -19% 16% 61% -5%

AMD Athlon 500 56% 38% 28% 3% -9% 23% 28% 68% 46% 0% -6% 21% 1% -50% 36% 24% 19% DEC Alpha 21264 500 -36% 8% -35% 2% 52% 18% -11% 37% 3% -6% -19% -104% 19% 38% -22% -32% -6% Intel Pentium III 500 41% 8% 0% 19% 7% -3% 14% 64% 17% 19% -18% 39% -2% -40% -18% 28% 11% Motorola 7400 (G4) 500 -65% -6% 46% 2% 31% -42% 26% 67% 30% 23% 60% 61% 5% 57% 32% 51% 24%

SIMD %DM Sun UltraSPARC IIi 500* 4% -48% -39% -25% -82% 3% -57% -236% -95% -36% -16% -16% -23% -5% -28% -72% -48% Table 5. Kernel Performance Statistics – From top to bottom: C time (ns), SIMD time (ns), SIMD speedup relative to C, C percent deviation from the mean (%DM) and SIMD %DM (see text for further details on this metric). Kernels with a gray background were not implemented in SIMD due to insufficient instruction set functionality. The DCT kernel was originally coded in floating point, but implemented in fixed point integer for SIMD codes. (*)Sun UltraSPARC IIi 500MHz values are interpolated from measurements taken on a 360MHz system.

and other data dependent codes such as conditional moves are 6. Performance Comparison categorized as “control flow” type instructions. We can see that SIMD kernels utilize a significant number of 6.1. SIMD Speedup over C scalar operations for pointers, loop variables and clean up code, In order to establish a base line for performance, the average evidence that sharing the integer data path is not a good idea. execution time of the C and SIMD assembly versions of the Intel’s SSE is a 128-bit wide extension, as compared to AMD’s BMKL were measured on each platform (Table 5). A speedup 64-bit wide 3DNow!, explaining why Intel’s overall instruction from 2x-5x was typical, although some kernels were sped up count is lower by about 1 billion (22%) instructions. The same considerably more. Note that the C compiler for the Apple system reasoning applies to Motorola’s G4 which has the 128-bit wide is from a pre-release version of MacOS X (developer’s preview 3) (both packed integer and floating point) AltiVec extension. DEC’s and is known to be weak, inflating the speedup of AltiVec over C. bloated instruction count is due to the fact that their MVI extension is very limited in functionality (13 instructions in total), The DCT and IDCT algorithms are of the same computational and so many operations need to be done with scalar instructions. order of magnitude so it might seem strange that they demonstrate such different performance improvements (Table 5). The reason Data communication operations represent the overhead necessary for this is the algorithm employed by the original MPEG-2 to put data in a format amenable to SIMD operations. Ideally, encoder DCT code (Listing 4) is not very efficient - it uses double these types of instructions should make up as small a part of the precision floating point where fixed point integer should provide dynamic instruction mix as possible. Table 4 reveals that the sufficient precision (and is typically faster on most architectures) Motorola AltiVec extension executes the largest percentage of [41]. Unlike the original scalar forward DCT code which was these instructions. This is due to two factors: 1) the wider register computed in floating point, the scalar inverse DCT source code width means that it is less likely that the data is arranged correctly (Listing 3) is written in fixed point integer. The SIMD forward as first loaded from memory and 2) data communication and inverse DCT are both computed in fixed-point integer. This operations are used by AltiVec to simulate unaligned access to meant that it was not appropriate to use the original code as the memory.

15

static INT32 lutab[10000]; INT32 l_end; FLOAT64 xr[]; INT32 ix[]; FLOAT64 *istep_p; FLOAT32 temp; for (i=0;i

16

latency. From Table 5, we can see that overall SIMD AMD 6.2.4. Sun UltraSPARC IIi Athlon code does 19.1% better than average, while the Pentium III is 10.9% better than the average of all of the multimedia The performance of the VIS multimedia extension was mediocre instruction sets. Both chips perform well, but the AMD Athlon at best (even when scaled from 360 to 500 MHz). It is the oldest matches or outperforms the Intel chip on all but one SIMD multimedia instruction set we have looked at, the second to be floating point kernel (quantize). released after HP’s MAX-1 (1996). As we have pointed out earlier, this instruction set suffers from some odd instruction Quantization is a process by which a real value is converted to a choices (e.g. multiplication primitives), missing functionality (no discrete integer representation. In the quantization kernel, an array subword shift operations) and an over-utilized graphics status of real double precision floating point values, xr[], is converted to register that is a bottleneck. an array of 32-bit integers, xi[], according to the function 7. New Directions xi[i] = xr[i] × xr[i] + 0.4054 . Note that original C In this section we describe two new ideas for future multimedia implementation utilizes a lookup table for some values (not shown extensions based on our coding experience. in Listing 17). Although SIMD floating point instruction sets include 7.1. Fat Subwords approximations for 1 , rather than x , the equivalence With current SIMD instruction sets, data is loaded into registers in x the same packed format that it is stored in memory. Frequently, 1 unpacking is required before operations are performed, and the x = x × can be used. The reason for AMD’s lackluster unpacked data of course no longer fits within a single register. We x therefore propose: performance on the quantize kernel is because AMD’s reciprocal · registers (and therefore subwords) that are wider than the square root instructions are actually scalar - they only compute the data to be loaded into them result for the lower packed element. This costs 3DNow! performance on this highly square-root intensive kernel. · implicit unpack with load 6.2.2. DEC Alpha 21264 · implicit pack (and saturate) with store From Table 5 we see that the DEC Alpha platform far outstrips A fat subword architecture is in some ways similar to the as yet the other systems in terms of compiled C performance. It has been unimplemented MIPS MDMX instruction set [24]. The MIPS claimed that a broader multimedia instruction set would not be MDMX extension has a single 192-bit accumulator register as its useful on Alpha, as an extension like Intel’s MMX only fixes x86 architectural cornerstone, with the more usual style of register to legacy-architecture performance deficiencies which are not register SIMD operations also included; non-accumulator SIMD present with the Alpha architecture [34]. Our performance operations share the 64-bit floating point datapath. The destination comparison shows this argument to be false, as the kernels of normal SIMD instructions can be either another SIMD register programmed with the extensions from AMD, Intel and Motorola or the accumulator (to be loaded or accumulated). When were able to not only match the performance of those on the accumulating packed unsigned bytes, the accumulator is Alpha 21264, but often exceeded it. This indicates that partitioned into eight 24-bit unsigned slices. Packed 16-bit sophisticated multimedia instruction sets are far more than a operations cause the accumulator to be split into four 48-bit means of overcoming x86 architectural limitations. sections. What is good about the MDMX accumulator approach is that 6.2.3. Motorola G4 implicit width promotion provides an elegant solution to overflow Motorola’s AltiVec was the only multimedia extension which was and other issues caused by packing data as tightly as possible. For architected from the ground up - all of the others in some way example, multiplication is semantically difficult to deal with on leverage existing resources. The instructions included in AltiVec SIMD architectures because the result of a multiply is longer than are far more general than those required by multimedia either operand [23]. This problem is avoided by the MDMX applications. Almost every possible operation and data type is accumulator, because there is enough extra space to prevent supported, which should allow it to be applied to other application overflow. domains as well. A 128-bit register width, combined with Fixed point arithmetic is also more precise because full precision latencies that are at least as good as those found on the other (64- results can be accumulated, and the total rounded only once at the bit) architectures, allows AltiVec to come in with the best overall end of the loop. Similarly, most scalar multimedia algorithms only performance (23.7% better than average on SIMD accelerated specify saturation at the end of computation because it is more code, according to Table 5). However, performance on a few of precise. For example, if we are adding three signed 16-bit values: the kernels is still poor, especially add block and FFT. The add block kernel’s problem has already been discussed - the 128-bit Saturate Every Step: 32760 + 50 -20 = 32747 register width is actually too long for this kernel, causing unnecessary data to be loaded from memory, since there is no 64- Saturate Last Step: 32760 + 50 -20 = 32767 bit short load in AltiVec, although there are 8, 16, and 32-bit short Unfortunately, in SIMD architectures where the packed elements loads. The reason for the poor FFT performance is not entirely maximally fill the available register space, the choice is either to clear, although revisiting our code revealed that the way in which be imprecise (compute with saturation at every step) or lose we coded stores to unaligned memory could be improved slightly. parallelism (explicitly promote the input data to be wider and 17

UINT8 *rowp, *y_p, *u_p, v_p; INT32 red = *rowp++, green = *rowp++, blue = *rowp++; *y_p++ = +0.2558*red + 0.5022*green + 0.0975*blue + 16.5; *u_p++ = -0.1476*red - 0.2899*green + 0.4375*blue + 128.5; *v_p++ = +0.4375*red - 0.3664*green - 0.0711*blue + 128.5; Listing 18. Color Space Conversion rgb_to_yuv: oris r11,r11 RED_PATTERN oris r12,r12,GREEN_PATTERN oris r13,r13,BLUE_PATTERN lvxstrd v28,0,r3,r11 ;; v28: |r0|r1|r2|r3|r4|r5|r6|r7|r8|r9|rA|rB|rC|rD|rE|rF| lvxstrd v29,0,r3,r12 ;; v29: |g0|g1|g2|g3|g4|g5|g6|g7|g8|g9|gA|gB|gC|gD|gE|gF| lvxstrd v30,0,r3,r13 ;; v30: |b0|b1|b2|b3|b4|b5|b6|b7|b8|b9|bA|bB|bC|bD|bE|bF| Listing 19. Modified Color Space Conversion

Low Precision Computation High Precision Computation (128-bit total load/store width) (64-bit total load/store width ) Storage Computation Storage Computation 8U 12S x 16 8U 24S x 8 16S 24S x 8 16S 48S x 4 32S 48S x 4 - - 32FP 32FP x 4 - - Table 6. Supported Data Types - data types are divided into storage (as stored in memory) and computation (the width of arithmetic and other instruction elements) widths. saturate at the end). Saturating arithmetic can also produce 6x86MX/M-II processor) to compensate for the eight register unexpected results since, unlike normal addition, the order of encoding limit inherent in x86 based architectures. operations matters. Supported storage data types include those that we have seen to be While the MIPS MDMX solution may seem elegant in principle, of importance in multimedia for storage formats: 8-bit unsigned, it ignores the architectural aspect of actually making such a design 16-bit signed, 32-bit signed and single precision floating point. fast. The accumulator is a singular and unique shared resource, The data types supported by packed arithmetic operations for and as such has a tendency to limit instruction level parallelism. computation are different: 12-bit signed, 24-bit signed, 48-bit We found that the existence of only a single accumulator would signed and single precision floating point. Depending on the be a severe handicap to avoiding data dependencies. On a non- algorithm it may be desirable to access memory at either 64-bit or accumulator but otherwise superscalar architecture it is usually 128-bit widths (see Table 6), which would then unpacked to possible to perform some other useful, non data dependent different computational widths: a smaller number of high- operation in parallel so that processing can proceed at the greatest precision subwords (64-bit storage width) or a larger number of degree of instruction level parallelism possible. On MDMX all low-precision subwords (128-bit storage width). This also allows computations that need to use the accumulator must proceed for better matching with the natural width of each algorithm. serially since accumulation is inherently dependent on the Packed floating point data types present an interesting design previous result. choice - we can either operate on: 7.1.1. Supported Data Types 1. four single precision values in parallel (not using 16 of the To avoid the problems with MDMX, we suggest that a normal bits in each register element) register file architecture be used (as opposed to a single 2. four single precision values which have been expanded to a accumulator), with the entire SIMD register file made wider than 48-bit extended precision format the width of loads and stores. For the sake of example we have chosen a width of 192-bits for our datapath (registers and 3. six single precision values, exactly filling a 192-bit register functional units), corresponding to a load/store width of either 64- The downside of the second solution is that SIMD results may not or 128-bits (we will discuss both of these possibilities shortly). A exactly match scalar results. In addition, the latency of many 64-bit only implementation would be possible but of course with a floating point operations depends heavily on the precision being lesser degree of potential parallelism. Another possibility, should computed, so a more precise operation is a higher latency one. very wide registers prove infeasible, is the use of register sets (e.g. The third solution, although attractive for its additional data triplets in the case of 64-bit wide registers). A similar approach parallelism, is problematic because it would require its own set of was used by Cyrix in their extension to MMX (found in their

18

0 23 24 47 48 71 72 95 96 119 120 143 144 167 168 191 lvx8u24s v0,[1000] 0 330 220 46460 770 990 000 23230 22

0x1000 3 2 46 7 9 0 23 2

0x1008 15 253 34 5 255 2 2 0

lvx8u24s v1,[1008] 0 15150 2532530 34340 550 2552550 220 220 00 0 23 24 47 48 71 72 95 96 119 120 143 144 167 168 191

0 23 24 47 48 71 72 95 96 119 120 143 144 167 168 191 45 506 1564 35 2295 0 46 0 mult24s v2,v0,v1 45 506 1564 35 2295 0 46 0

stv24s8u v2,[1000] 0x1000 45 255 255 35 255 0 46 0 0 7 8 15 16 23 24 31 32 39 40 47 48 55 56 63 Figure 2. Fat Subword Example - two vectors of eight 8-bit unsigned integers are loaded and implicitly expanded to 24-bit signed notation from address 0x1000 and 0x1008 into vector registers v0 and v1 respectively. Vector register v0 and v1 are then multiplied and the result placed in v2. Overflow cannot occur because the maximum width of the result is 16-bits per multiplication (the sum of the operand widths). The result vector is then simultaneously repacked to 8-bit unsigned notation and stored out to memory address 0x1000. data rearrangement instructions based on a six element (rather wide enough to hold the product of two storage data types, than four or eight element) vector. overflow should never be a problem. 7.1.2. Supported Operations In [38] we present our proposed instruction set (97 instructions in total) for a fat subword SIMD multimedia extension. Unlike Because we have fundamentally changed the treatment of existing instruction sets which are fundamentally byte-based, this multimedia data types, we also need to reexamine which instruction set is centered on quantities which are multiples of 12- operations are still valuable in this new light. Several instructions bits wide. This can be thought of as four extra bits of precision for are no longer useful: every byte of actual storage data loaded. average - the only useful average instruction data type within our 7.1.3. Fat Subword Example workload was for 8-bit unsigned values. Average is only really useful on existing multimedia architectures because it allows for A small example of how to code for the proposed architecture is computation without width promotion. On a fat subword shown in Figure 2. Note that load and store instructions specify architecture average instructions are unnecessary, since there both computational and storage data types. would already be sufficient precision and the requisite functionality can be synthesized through shift and add (operations 7.2. Strided Memory Access which are useful in and of themselves). We propose that SIMD architectures implement strided load and pack/unpack - performed implicitly with loads and stores. Some store instructions to make the gathering of non-adjacent data types of width and type conversion may still be of utility (e.g. elements more efficient. This is similar to the prefetch mechanism conversion between floating point and integer) in AltiVec, except that the data elements would be assembled together by the hardware into a single register, rather than simply truncating multiplication - truncation predefines a set number of loaded into the cache. Of course such a memory operation would result bits to be thrown away. This primarily has application when necessarily be slower than a traditional one, but it would cut down multiplying n-bit fixed point values with fractional components immensely on the overhead that would have to go into that together take up a total of n-bits of precision. Unfortunately, reorganizing data as loaded from memory. Strided loads and truncation plays havoc with precision since it is usually desirable stores would have three operands, vD, rA and rB, where vD is the to truncate once, at the end of a series of fixed-point destination fat-subword register, rA is the base address and rB computations, rather than at every step. Because all of the high contains a description of the memory access pattern: precision computational data types in our design are more than

19

width - width of the data elements to be loaded [2 bits], 00 = 8- 8.1. Useful Features bits, 01 = 16-bits, 10 = 32-bits, 11= 64-bits. offset - number of bytes offset from the base address from which 8.1.1. Register File and Datapath to begin loading [6 bits], interpreted as a signed value. · Sharing the floating point datapath is usually preferable to stride - number of bytes between the effective address of one the integer data path because there is no contention with element in the sequence and the next [24 bits], interpreted as a pointer and loop variables, and less chance of affecting the signed value. critical path of the processor. The color space conversion kernel is an excellent example of · Barring any other existing architectural considerations, a where strided load and store instructions could be used. Pixel data 128-bit wide data path is optimal for most multimedia consists of one or more channels or bands, with each channel algorithms (192-bits in the case of a fat subword architecture, representing some independent value associated with a given although memory accesses are still 64- or 128-bits wide). pixel’s (x,y) position. A single channel, for example, represents · Multimedia algorithms can take advantage of large register grayscale, while a three (or more) channel image is typically files - we suggest at least 16 128-bit registers, or 32 64-bit color. The band data may be interleaved (each pixel’s red, green, registers. and blue data are adjacent in memory) or separated (e.g. the red data for adjacent pixels are adjacent in memory). In image 8.1.2. Data Types processing algorithms such as color space conversion we operate on each channel in a different way, so band separated format is · 8-bit signed data types were not used within our workload. 8- the most convenient for SIMD processing. Converting from the bit unsigned data types are often used for storage, but not for RGB to YCBCR color space is done through the conversion computation. coefficients shown in Listing 18. · 16-bit signed data types are the most common; they are the Listing 19 replaces thirty-eight instructions in the original AltiVec intermediate (computational) data type for video and the color space conversion kernel (the corresponding code fragment is storage (and sometimes computational) data type for audio listed in [38]). In the original AltiVec code, it was necessary to and speech algorithms. Unsigned 16-bit values were not load six permute control vectors (each 128-bits wide) before used. executing the six vperm instructions required to rearrange the data into band separated format. · 32-bit signed va lues are most often used for accumulation. Unsigned 32-bit values were not used within our workload. Traditional SIMD data communication operations have trouble with data which is not aligned on boundaries which are powers of · Single precision floating point (32-bit) is used in audio and two; in the case of color space conversion, visually adjacent pixels 3D graphics (geometry) kernels. from each band are spaced 3 bytes apart. Strided loads and stores 8.1.3. Integer Arithmetic are by definition unaligned, so this would need to be handled by the load/store hardware in the CPU. It would also make sense to · Saturation prevents overflow in a fast, numerically have additional versions of these instructions to provide hints to acceptable way for SIMD operations. the memory hardware whenever data elements will not be of near- term use (especially if the stride is on the same or greater order of · Max and min operations are an efficient way to perform the line size). Another approach to strided data access is presented SIMD comparisons as well as to clamp to an arbitrary range. in [46] which discusses the Impulse system architecture. Impulse They are useful at all computational data widths. allows for a strided mapping that creates dense cache lines from · Average instructions were only used in the MPEG encode array elements that are not contiguous in memory. block match kernel for interpolation (8-bit unsigned data 8. Summary type). We will next summarize our findings by distilling those · Shift operations of all types are useful at all data widths - architectural features we found to be useful or beneficial and they are critical for fixed point arithmetic, provide an pointing out those which were not. Some “useless” features were efficient means for data realignment, and division and simply not exercised within our workload, while others were multiplication by powers of two. clearly the source of performance bottlenecks. We distinguish · Sum of absolute difference instructions significantly speedup between these two situations in our summary. block matching, which is large part of MPEG and other Existing architectures attempt to use the same computational block based video encoding algorithms that employ motion width as storage width by default because there is no implicit compensation. Block matching almost always employs an 8- unpacking or packing of data. This has resulted in a great deal of bit unsigned data type. supporting functionality in order to aid or avoid the packing and unpacking of data. With a fat subword architecture, many of the 8.1.4. Floating Point Arithmetic reasons for these features are eliminated, creating a simpler overall design. We note in our summary where there would · 1 approximation instructions are useful; through differences introduced by such an approach. x 1 multiplication they can also estimate x and x . A full

20

precision version of this instruction is probably unnecessary, · In general, a singular and unique resource, such as a control unless it can be made very fast, as the Newton-Raphson register, is a potential bottleneck if it will be highly utilized. method can be used to rapidly improve the precision of an In the case of Sun’s VIS, the graphics status register (GSR) approximation. was a severe bottleneck that could have been avoided if SIMD shift instructions and better data communication · Exceptions and sophisticated rounding modes are primitives had been included. unnecessary for multimedia; in any instance of where these might be used it is possible to substitute a reasonable value 8.3. New Directions that will allow computation to continue unhindered, and still produce an acceptable result. In addition to analyzing how well current multimedia instruction set features map to multimedia workloads, we also proposed two 8.1.5. Data Rearrangement new directions for multimedia instruction sets. · A full permute operation (as is found in AltiVec) is very · Because SIMD architectures apply the same operation to all flexible, but is unnecessary for most multimedia applications of the elements in a packed register, there are many cases where only simple data rearrangement patterns are needed. In where data is not optimally organized as loaded from AltiVec the vperm instruction serves double duty as a memory. Similar to scatter-gather operations from traditional means for simulating unaligned memory access, so its vector architectures, we proposed implementing strided loads capabilities are basically free. and stores for packed registers. · [22] presents a novel set of simple data communication · Based on the observation that storage and computational data primitives which can perform all 24 permutations of a 2x2 types are almost always different, we proposed a fat subword matrix in a single cycle on a processor with dual data register architecture, which eliminates much of the explicit communication functional units. Any larger data packing and unpacking overhead that typically makes SIMD communication problem can be decomposed into 2x2 processing progress slower than is necessary. We found that matrices, and data communication patterns are statically this fundamental change in how data types are handled defined, so no extra register space need be wasted on would have significant implications for instruction set permutation control vectors. design. 8.1.6. Memory Operations 9. Conclusion · Silently accepting an unaligned address and forcing it to be We have presented a unified comparison of five multimedia aligned (as in AltiVec) is dangerous because it allows instruction sets for general purpose microprocessors which intermittent and difficult to track down programming errors discusses the advantages and disadvantages of specific features of to go unnoticed. Instead, an exception should be raised when current instruction sets. In addition to examining the five an unaligned access occurs, or hardware should support instruction sets from a functional standpoint, we have also unaligned memory access directly. empirically compared the performance of specific implementations of these instruction sets on real hardware. This 8.2. Bottlenecks and Unused Functionality has allowed us to measure typical speedups relative to a highly tuned compiler on a given platform as well as make cross- · Instruction primitives (such as the multiplication instruction architecture comparisons. We have also introduced two new ideas primitives found in Sun’s VIS) decrease instruction decoding for future multimedia instruction sets: strided memory accesses bandwidth, increase register pressure, and are not useful in and a fat-subword register architecture. and of themselves. Even if the atomic version of an operation (such as multiplication) is slow, it is preferable to a set of Our analysis has been based on measurements of hand tuned primitives because it is cleaner and easier to upgrade an SIMD kernels. Today most microprocessors found in desktop instruction’s latency in the next revision of an architecture workstations are equipped with multimedia extensions that remain than it is to implement an entirely new instruction (rendering unused due to the lack of compiler support for these instruction useless the previous instruction and any software employing sets [21]. A valuable extension to this study would be an analysis it). of what characteristics would make SIMD most instructions most amenable to use by a compiler. In addition, multimedia · Motorola’s AltiVec extension includes pixel pack and applications, and therefore our target workload, are constantly unpack instructions for converting between 32-bit true color evolving: MPEG-4, and MP3-Pro audio are only two examples of and 16-bit color representations which were not used in our up and coming algorithms which demand further examination. As workload. AltiVec’s approximations for log2 and exp2 also such new algorithms and applications come to light we will need went without application in our workload because they are to revisit our design choices for multimedia instruction sets. used in a different stage of 3D rendering (lighting) than we examined (due to it’s lesser computational overhead). References · Sun’s VIS includes edge instructions for dealing with [1] G.E. Allen, B.L. Evans, L.K. John, “Real-Time High- boundaries in 2D image processing, as well as array Throughput Sonar Beamforming Kernels Using Native instructions for volumetric imaging. Neither type of Signal Processing and Memory Latency Hiding instruction was found to be useful within the Berkeley Techniques,” Proc. 33rd IEEE Asilomar Conf. on Signals, multimedia workload. Systems and Computers, Pacific Grove, Calif., 1999, pp. 137-141

21

[2] Advanced Micro Devices, Inc., “AMD Athlon Processor http://developer.intel.com/design/processor/future/manuals x86 Code Optimization Guide,” Publication #22007, Rev. /24547001.pdf, (current Sept. 2000) G, http://www.amd.com/products/cpg/athlon/techdocs/ [15] Intel Corporation, “Intel Announces New NetBurst Micro- pdf/22007.pdf, (current Apr. 2000) Architecture for Pentium IV Processor,” [3] Advanced Micro Devices, Inc., “AMD Athlon Processor http://www.intel.com/pressroom/archive/releases/dp08220 Technical Brief,” Publication #22054, Rev. D, 0.htm, (current Sept. 2000) http://www.amd.com/products/cpg/athlon/techdocs/pdf/22 [16] Intel Corporation, “Pentium 4 Processor – Development 054.pdf, (current Apr. 2000) Tools,” http://developer.intel.com/ [4] Advanced Micro Devices, “3DNow! Technology vs. design/pentium4/devtools/, (current July 2001) KNI,” White Paper, http://www.amd.com/products/cpg/3dnow/vskni.html, [17] J. Keshava, V. Pentkovski, “Pentium III Processor Implementation Tradeoffs,” Intel Tech. J., Quarter 2, (current Apr. 2000) 1999, [5] R. Bhargava, L.K. John, B.L. Evans, R. Radhakrishnan, http://developer.intel.com/technology/itj/q21999/pdf/impli “Evaluating MMX Technology Using DSP and ment.pdf, (current Apr. 2000) Multimedia Applications,” Proc. 31st IEEE Intl. Symp. on [18] T. Kientzle, “Implementing Fast DCTs,” Dr. Dobb’s J., Microarchitecture (MICRO-31), Dallas, Texas, 1998, pp. 37-46 vol. 24, no. 3, March 1999, pp. 115-119 [6] T. Burd, “General Processor Info,” CPU Info Center, [19] L. Kohn, G. Maturana, M. Tremblay, A. Prabhu, G. Zyner, http://bwrc.eecs.berkeley.edu/CIC/ “The Visual Instruction Set (VIS) in UltraSPARC,” Proc. Compcon ’95, San Francisco, 1995, pp. 462-469 summary/local/summary.pdf, (current Apr. 2000) [7] D.A. Carlson, R.W. Castelino, R.O. Mueller, “Multimedia [20] I. Kuroda, T. Nishitani, “Multimedia Processors,” Proc. Extensions for a 550-MHz RISC Microprocessor,” IEEE IEEE, vol. 86 no. 6, June 1998, pp. 1203-1221 J. of Solid-State Circuits, vol. 32, no. 11, Nov. 1997, pp. [21] S. Larsen, S. Amarasinghe, “Exploiting Superword Level 1618-1624 Parallelism with Multimedia Instruction Sets,” Proc. ACM [8] W. Chen, H. John Reekie, S. Bhave, E.A. Lee, “Native SIGPLAN ‘00 Conf. on Programming Language Design Signal Processing on the UltraSparc in the Ptolemy and Implementation, Vancouver, BC Canada, 2000, pp. Environment,” Proc. 30th Ann. Asilomar Conf. on Signals, 145-156 Systems, and Computers, Pacific Grove, California, 1996, [22] R.B. Lee, “Subword Permutation Instructions for Two- vol. 2, pp. 1368-1372 Dimensional Multimedia Processing in MicroSIMD [9] Compaq Computer Corporation, “Alpha 21264 Architectures,” Proc. IEEE Intl. Conf. on Application- Microprocessor Hardware Reference Manual,” Part No. Specific Systems, Architectures, and Processors, Boston, 2000, pg. 3-14 DS-0027A-TE, Feb. 2000, http://www.support.compaq.com/alpha-tools/ [23] R.B. Lee, “Multimedia Extensions for General Purpose documentation/current/21264_EV67/ds-0027a- Processors,” Proc. IEEE Workshop on VLSI Signal te_21264_hrm.pdf, (current Apr. 2000) Processing, Leicester, UK, 1997, pp. 1-15 [10] T.M. Conte, P.K. Dubey, M.D. Jennings, R.B. Lee, A. [24] MIPS Technologies, Inc., “MIPS Extension for Digital Peleg, S. Rathnam, M. Schlansker, P. Song, A. Wolfe, Media with 3D,” WhitePaper, Mar. 12, 1997 "Challenges to Combining General-Purpose and http://www.mips.com/Documentation/isa5_tech_brf.pdf, Multimedia Processors," IEEE Computer, vol. 30, no. 12, (current Apr. 2000) Dec. 1997, pp. 33-37 [25] Motorola Inc., “MPC7400 RISC Microprocessor User’s [11] C. Hansen, “MicroUnity’s MediaProcessor Architecture,” Manual, Rev. 0,” Document MPC7400UM/D, IEEE Micro, vol. 16, no. 4, Aug. 1996, pp. 34-41 http://www.mot.com/SPS/PowerPC/teksupport/teklibrary/ [12] Intel Corporation, “Intel Introduces The Pentium manuals/ MPC7400UM.pdf, (current Apr. 2000) Processor With MMX Technology,” January 8, 1997 [26] J. Nakashima, K. Tallman, “The VIS Advantage: Press Release, http://www.intel.com/pressroom/archive/ Benchmark Results Chart VIS Performance,” White Paper, releases/dp010897.htm, (current Apr. 2000) Oct. 1996, http://www.sun.com/microelectronics/vis/, [13] Intel Corporation, “Intel Architecture Software (current Apr. 2000) Developer’s Manual, Volume 1: Basic Architecture,” [27] H. Nguyen, L.K. John, “Exploiting SIMD Parallelism in Publication 243190, , DSP and Multimedia Algorithms Using the AltiVec http://developer.intel.com/design/pentiumii/ Technology,” Proc. 1999 Int’l Conf. on Supercomputing, manuals/24319002.PDF, (current Apr. 2000) Rhodes, Greece, 1999, pp. 11-20 [14] Intel Corporation, “IA32 Intel Architecture Software [28] K. Noer, “Heat Dissipation Per Square Millimeter Die Size Developer’s Manual with Preliminary Intel Pentium 4 Specifications,” http://home.worldonline.dk/~noer/, Processor Information,” Volume 1: Basic Architecture, (current Apr. 2000)

22

[29] K.B. Normoyle, M.A. Csoppenszky, A. Tzeng, T.P. [38] N.T. Slingerland, A.J. Smith, “Measuring the Performance Johnson, C.D. Furman, J. Mostoufi, “UltraSPARC-IIi: of Multimedia Instruction Sets,” University of California Expanding the Boundaries of a System on a Chip,” IEEE at Berkeley Technical Report CSD-00-1125, Dec. 2000 Micro, vol. 18, no. 2, Mar./Apr. 1998, pp. 14-24 [39] A. Stiller, “Architecture Contest,” c’t Magazine, vol. [30] A.D. Pimentel, P. Struik, P. van der Wolf, L.O. 16/99, http://www.heise.de/ct/english/99/16/092/, (current Hertzberger, “Hardware versus Hybrid Data Prefetching in Dec. 2000) Multimedia Processors: A Case Study,” Proc. IEEE Intl. [40] P. Struik, P. van der Wolf, A.D. Pimentel, “A Combined Performance, Computing and Communications Conference, Phoenix, Ariz., 2000, pp. 525-531, Hardware/Software Solution for Stream Prefetching in Multimedia Applications,” Proc. 10th Ann. Symp. on [31] P. Ranganathan, S. Adve, N.P. Jouppi, “Performance of Electronic Imaging (Multimedia Hardware Architectures Image and Video Processing with General-Purpose track), San Jose, Calif., 1998, pp. 120-130 Processors and Media ISA Extensions,” Proc. 26th Ann. Intl. Symp. on Computer Architecture, Atlanta, 1999, pp. [41] M.T. Sun ed., “IEEE Standard Specifications for the 124-135 Implementation of 8x8 Inverse Discrete Cosine Transform”, IEEE Standard 1180-1990, IEEE, 1991 [32] S. Rathnam, G. Slavenberg, “An Architectural Overview [42] Sun Microsystems Inc., “UltraSPARC-IIi User’s Manual,” of the Programmable Multimedia Processor, TM-1,” Proc. of COMPCON ‘96, Santa Clara, Calif., 1996, pp. 319-326 1997, Part No. 805-0087-01, http://www.sun.com/microelectronics/UltraSPARC- [33] D.S. Rice, “High-Performance Image Processing Using IIi/docs/805-0087.pdf, (current Apr. 2000) Special-Purpose CPU Instructions: The UltraSPARC [43] S. Thakkar, T. Huff, “The Internet Streaming SIMD Visual Instruction Set,” Univ. Calif. at Berkeley, Master’s Report, Mar., 1996 Extensions,” Intel Tech. J., Q2 1999, http://developer.intel.com/technology/itj/q21999.htm, [34] P. Rubinfeld, B. Rose, M. McCallig, “Motion Video (current Apr. 2000) Instruction Extensions for Alpha,” White Paper, Oct., [44] M. Tremblay, J.M. O’Connor, V. Narayanan, L. He, “VIS 1996 http://www.digital.com/alphaoem/papers/pmvi- abstract.htm, (current Apr. 2000) Speeds New Media Processing,” IEEE Micro, vol. 16, no. 4, Aug. 1996, pp. 10-20 [35] N.T. Slingerland, A.J. Smith, “Design and Characterization of the Berkeley Multimedia Workload,” [45] Veridian Systems Division, “Veridian Systems – VAST/AltiVec,” http://www.psrv.com/ .html, University of California at Berkeley Technical Report CSD-00-1122, Dec. 2000 (current July 2001) [36] N.T. Slingerland, A.J. Smith, “Cache Performance for [46] L. Zhang, J.B. Carter, W. Hsieh, S.A. McKee, “Memory Multimedia Applications ,” Proc. of the 15th IEEE Intl. System Support for Image Processing,” Proc. 1999 International Conference on Parallel Architectures and Conf. on Supercomputing, Sorrento Italy, June 16-21, 2001, pp. 204-217 Compilation Techniques, Newport Beach, Calif., 1999, pp. 98-107 [37] N.T. Slingerland, A.J. Smith, “Multimedia Instruction Sets for General Purpose Microprocessors: A Survey,” University of California at Berkeley Technical Report CS - 00-1124, Dec. 2000

23