Measuring the Performance of Multimedia Instruction Sets

Nathan Slingerland Alan Jay Smith [email protected] [email protected] Apple Computer, Inc. University of California at Berkeley

Abstract Many instruction sets include instructions for accelerating multimedia applications such as DVD playback, speech recognition and 3-D graphics. Despite general agreement on the need to support this emerging workload, there are considerable differences between the instruction sets that have been designed to do so. In this paper we study the performance of five instruction sets on kernels extracted from a broad multimedia workload. We compare the performance of contemporary implementations of each extension against each other as well as to the original compiled C performance. From our analysis we determine how well multimedia workloads map to current instruction sets, noting what was useful and what was not. We also propose two enhancements: fat subwords and strided memory operations. Keywords: SIMD, subword parallel, multimedia, performance, measurement, MMX, SSE, AltiVec, VIS, MVI

1. Introduction Specialized instructions have been introduced by microprocessor vendors to support the unique computational demands of multimedia applications. The mismatch between wide data paths and the relatively short data types found in multimedia applications has led the industry to embrace SIMD (single instruction, multiple data) style processing. Unlike traditional forms of SIMD computing in which multiple individual processors execute the same instruction, multimedia instructions are executed by a single processor, and pack multiple short data elements into a single wide (64 or 128-bit) register, with all of the sub-elements being operated on in parallel. The goal of this paper is to quantify how architectural differences between multimedia instruction sets translate into differences in performance. Prior studies have primarily focused on a single instruction set in isolation and have measured the performance of kernels taken from vendor provided libraries [1], [5], [8], [25], [26], [30], [32]. Our contribution is unique as we do not focus exclusively on a single architecture, and we study the performance of kernels derived

Funding for this research has been provided by the State of California under the MICRO program, and by Cisco Corporation, Fujitsu Microelectronics, IBM, Corporation, Maxtor Corporation, Microsoft Corporation, , Toshiba Corporation and Veritas Software Corporation.

from a real, measured, general purpose multimedia workload. Our results are obtained from actual hardware measurements rather than through simulation, instilling confidence in our results. Section 2 summarizes the multimedia workload we studied, and details the sixteen computationally important kernels which we extracted from it. Our methodology for recoding the kernels with multimedia instructions and their measurement is described in Section 3. An overview of the five instructions sets, and their implementations, is given in Section 4. Our analysis is divided into two parts. Section 5 reflects our experience in coding the kernels, and lends insight into those instruction set features that we found useful. In Section 6 we compare the performance of five different processors, each of which implements a particular multimedia instruction set. Section 7 proposes two new directions for multimedia architectures on general purpose : fat subwords and strided memory operations. Readers who are interested in a much more detailed discussion of the material presented here should see [36].

2. Workload The lack of a standardized multimedia benchmark has meant that workload selection is the most difficult aspect of any study of multimedia. It was for this reason that we developed the Berkeley multimedia workload, which is described in [35] (we recommend this paper to any reader concerned about how applications were selected or the technical merit and relevance of the Berkeley multimedia workload). It details why we felt existing multimedia benchmarks were unsuitable, develops and characterizes our workload, and extracts the kernels that we study here. In selecting the component applications, we strove to cover as many types of media processing as possible: image compression (DjVu, JPEG), 3-D graphics (Mesa, POVray), document rendering (Ghostscript), audio synthesis (Timidity), audio compression (ADPCM, LAME, mpg123), video compression (MPEG-2 at DVD and HDTV resolutions), speech synthesis (Rsynth), speech compression (GSM), speech recognition (Rasta) and video game (Doom) applications. Open source software was used both for its portability (allowing for cross platform comparisons) and the fact that we could analyze the source code directly.

3. Methodology 3.1. Berkeley Multimedia Kernel Library In order to measure the performance of the multimedia instruction sets, we distilled from our Berkeley multimedia workload a set of computationally important kernel functions, which

2

when taken together form the Berkeley multimedia kernel library (BMKL). Kernels were chosen based on their computational significance and amenability to hand optimization with SIMD instructions; [35] discusses the kernel selection process in detail. Table 1 lists the characteristics of the kernel codes examined. Source Data Sat Native Src Static %Static %CPU Kernel Name Application Type Arith Width Lines Instr Instr Cycles Add Block MPEG-2 Decode 8-bit (U) Ö 64-bits 46 191 0.1% 13.7% Block Match MPEG-2 Encode 8-bit (U) 128-bits 52 294 0.4% 59.8% Clip Test and Project Mesa FP limited 95 447 0.1% 0.8% Color Space Conversion JPEG Encode 8-bit (U) limited 22 78 0.1% 9.8% DCT MPEG-2 Encode 16-bit (S) 128-bits 14 116 0.1% 12.3% FFT LAME FP limited 208 981 4.4% 14.5% Inverse DCT MPEG-2 Decode 16-bit (S) Ö 128-bits 75 649 0.3% 29.7% Max Value LAME 32-bit (S) unlimited 8 39 0.2% 12.0% Mix Timidity 16-bit (S) Ö unlimited 143 829 1.0% 35.7% Quantize LAME FP unlimited 55 312 1.4% 15.3% Short Term Analysis Filter GSM Encode 16-bit (S) Ö 128-bits 15 79 0.1% 20.2% Short Term Synth Filter GSM Decode 16-bit (S) Ö 128-bits 15 114 0.2% 72.7% Subsample Horizontal MPEG-2 Encode 8-bit (U) Ö 88-bits 35 244 0.3% 2.6% Subsample Vertical MPEG-2 Encode 8-bit (U) Ö unlimited 76 478 0.6% 2.1% Synthesis Filtering Mpg123 FP Ö 512-bits 67 348 0.4% 39.6% Transform and Normalize Mesa FP limited 51 354 0.1% 0.7% Table 1. Multimedia Kernels Studied - from left to right, the columns list 1) primary data type specified as N-bit ({Unsigned, Signed}) integer or floating point (FP), 2) if saturating arithmetic is used, 3) native width; the longest width which does not load/store/compute excess unused elements, 4) static C source line count (for those lines which are executed), 5) static instruction count (for those instructions which are executed) 6) percentage of total static instructions, 7) percentage of total CPU time spent in the kernel. The later three statistics are machine specific and are for the original C code on a Compaq DS20 (dual 500 MHz Alpha 21264, Tru64 Unix v5.0 Rev. 910) machine.

From a performance standpoint, no piece of code can realistically be extracted and studied in isolation. All of the parent applications in the Berkeley multimedia workload were modified to make calls to the BMKL rather than their original internal functions. By measuring our kernel codes from within real applications we could realistically include the effects of the shared resources of our test systems (e.g. CPU, caches, TLB, memory, and registers).

3.2. Coding Process Ideally, high level language compilers would be able to systematically and automatically identify parallelizable sections of code and generate the appropriate SIMD instruction sequence. The lack of languages which allow programmers to specify data types and overflow semantics at variable declaration time has hindered the development of automated compiler support for multimedia instruction sets [10].

3

As is the case with digital signal processors (DSPs), the most efficient way to program with multimedia extensions is to have an expert programmer tune software using [19]. Although this is more tedious and error prone than other methods such as hand coded standard shared libraries or automatic code generation by a compiler, it is a method that is available on every platform, and allows for the greatest flexibility and precision when coding. We chose to code in assembly language ourselves rather than measure vendor supplied optimized libraries because they would make a fair cross-platform comparison next to impossible. Each vendor implements a different set of functions which presume particular (and likely different) data formats. For example, a vendor supplied color space conversion function might assume band-interleaved format when the program to be accelerated uses a band-separated structure. Some overhead would then be needed to transform data structures into the expected format, perturbing the system under test and adding unnecessary overhead. All of the assembly codes in our study were written by the same programmer (Slingerland) with an approximately equal amount of time spent on each platform. Coding ourselves had the supplemental benefit of preventing any differences in programmer ability and time spent coding between the vendors from potentially skewing our comparison. This also gave us considerable insight into the reasons for performance differences, where we would otherwise be limited to simply reporting our measurements. It is important to stress that the kernel algorithms and applications in question were studied in considerable detail before assembly coding began. Our level of understanding made it possible to play the role of applications expert and thereby make educated choices about SIMD algorithms, sufficient precision and required data types, and is reflected in the development and characterization of our workload in [34]. In some cases it was not practical to code a SIMD version of a kernel if an instruction set lacked the requisite functionality. Sun’s VIS and DEC’s MVI do not support packed floating point, so floating point kernels were not recoded for those platforms. DEC’s MVI does not support any data communication (e.g. permute, mix, merge) or subword parallel integer multiply instructions. In general, if there was no compelling opportunity for performance gain, the C version was considered to be a particular platform’s “solution” to a kernel when the needed SIMD operations were not provided. Situations which called for a C solution are clearly denoted in our results, and the typically lesser performance of compiler generated code must be kept in mind when comparing against the other (hand-tuned) kernels.

4

3.2.1. C Compilers and Optimization Flags The most architecturally tuned compiler on each architecture was used to compile the C reference version of the Berkeley Multimedia Kernel library. The optimization flags used were those which give the best general speedup and highest level of optimization without resorting to tuning the compiler flags for each particular kernel. The specific compiler and associated optimization flags used on each platform are listed in [36].

4. Instruction Sets In this paper we study five architectures with different multimedia instruction sets (Table 2 lists the parameters for the exact parts). Many of the instruction sets (AMD’s 3DNow!, DEC’s MVI, Intel’s MMX, Sun’s VIS) use 64-bit wide registers, while Motorola’s AltiVec and Intel’s SSE are 128-bits wide. The size of the available on each architecture varies widely, ranging from eight 64-bit registers on the based AMD to thirty-two 128-bit wide registers with Motorola’s AltiVec extension. AMD DEC Intel Motorola Sun Athlon Alpha 21264 III 7400 (G4) UltraSPARC IIi Clock (MHz) 500 500 500 500 360 SIMD Extension(s) MMX/3DNow!+ MVI MMX/SSE AltiVec VIS SIMD Instructions 57/24 13 57/70 162 121 First Shipped 6/1999 2/1998 2/1999 8/1999 11/1998 Transistors (´106) 22.0 15.2 9.5 6.5 5.8 Process (mm) [Die Size (mm2)] 0.25 [184] 0.25 [225] 0.20 [105] 0.20 [83] 0.25 [148] L1 $I Cache, $D Cache (KB) 64, 64 64, 64 16, 16 32, 32 32, 64 L2 Cache (MB) 0.5 4 0.5 1 2 Register File (# ´ width) FP(8x64b)/ Int (31x64b) FP(8x64b)/ 32x128b FP(32x64b) FP(8x64b) 8x128b Inst. Reorder Buffer Entries 72 35 40 6 12 SIMD Int, FP Multimedia Units 2 (combined) 2, 0 2 (combined) 3, 1 2, 0 SIMD Int Add Latency 2 [64b/cycle] - 1 [64b/cycle] 1 [128b/cycle] 1 [64b/cycle] SIMD Int Multiply Latency 3 [64b/cycle] - 3 [64b/cycle] 3 [128b/cycle] 3 [64b/cycle] SIMD FP Add Latency 4 [64b/cycle] - 4 [64b/cycle] 4 [128b/cycle] - SIMD FP Multiply Latency 4 [64b/cycle] - 5 [64b/cycle] 4 [128b/cycle] - SIMD FP ~1/sqrt Latency 3 [32b/cycle] - 2 [64b/cycle] 4 [128b/cycle] - Table 2. Microprocessors Studied - All parameters are for the actual chips used in this study. A SIMD register file may be shared with existing integer (Int) or floating point (FP) registers, or be separate. Note that some of the processors implement several multimedia extensions (e.g. Pentium III has MMX and SSE) - the corresponding parameters for each are separated with “/”. A dash (-) indicates that a particular operation is not available on a given architecture, and so no latency and throughput numbers are given. DEC’s MVI extension (on the 21264) does not include any of the listed operations, but all MVI instructions have a latency of 3 cycles [64b/cycle]. [2], [3], [4], [6], [7], [9], [12], [13], [16], [18], [24], [27], [28], [40]

All of the multimedia extensions support integer operations, although the types and widths of available operations vary greatly. The earliest multimedia instruction sets (e.g. Sun’s VIS and

5

DEC’s MVI) had design goals limited by the immaturity of their target workload and unproven benefit to performance. Any approach which greatly modified the overall architecture or significantly affected die size was out of the question, so leveraging as much functionality as possible from the existing chip architecture was a high priority. Thus, the Sun and DEC extensions do not implement subword parallel floating point instructions. Intel’s first multimedia extension, MMX (1997), had as one of its primary design goals to not require any new operating system support, while and Intel’s later SSE (1999) added new operating system maintained state (eight 128-bit wide registers) for the first time since the introduction of the Intel386 instruction set in 1985. The processors studied (Table 2) vary in terms of instruction latency as well as the throughput per clock cycle. Processor clock rates were all 500 MHz, with the exception of the Sun UltraSPARC IIi, for which only a 360 MHz system was available. Although multimedia extensions primarily focus on extracting data level parallelism, most modern microprocessors are also superscalar, and thereby allow for multiple multimedia instructions to be issued every cycle. All of the architectures we look at are fully pipelined, so barring any data dependencies, one new SIMD instruction can begin per parallel functional unit each clock cycle. Typically, there are separate functional units for SIMD integer and SIMD floating point processing, although on the x86 architectures they are combined. [35] surveys current multimedia instruction sets in more detail.

5. Instruction Set Analysis This section encompasses our experience coding each of the sixteen kernels in five different multimedia extensions. We point out those instruction set features (data types and operations) that we found to be useful in accelerating our workload, as well as those for which we found no application or that created significant bottlenecks in current multimedia architectures. Those readers primarily interested in our performance measurements of specific implementations of these instruction sets can skip ahead to Section 6 and return here to Section 5 at their leisure. The complete original C source code for each kernel as well as for the assembly language SIMD implementations of the BMKL are available on the web at http://www.cs.berkeley.edu/~slingn/research/. Please contact the authors by email should this URL ever become invalid.

6

5.1. Register File and Data Path Multimedia instruction sets can be broadly categorized according to the location and geometry of the register file upon which SIMD instructions operate. Alternatives include the reuse of the existing integer or floating point register files, or implementing an entirely separate one. The type of register file affects the width and therefore the number of subwords that can be operated on simultaneously (vector length). Implementing multimedia instructions on the integer data path has the advantage that the functional units for shift and logical operations need not be replicated. Subword parallel addition and subtraction are easily created by blocking the appropriate carry bits. However, modifications to the integer data path to accommodate multimedia instructions can potentially adversely affect the critical path of the integer data-path pipeline [19]. In addition, on architectures such as x86 (AMD, Intel) and PowerPC (Motorola) the integer data path is 32-bits wide, making a shared integer data path approach less compelling due to the limited amount of data level parallelism possible. The reuse of floating point rather than integer registers has the advantage that they need not be shared with pointers and loop and other variables. In addition, multimedia and floating point instructions are not typically used simultaneously [19], so there is seldom any added register pressure. Finally, many architectures support at least double-precision (64-bit wide) operations, which is often wider than the integer data path. A separate data path that is not shared with existing integer or floating point data paths has the advantage of simpler pipeline control and potentially an increased number of registers. Disadvantages include the need for saving and restoring the new registers during context switches, as well as the relative difficulty and high overhead of moving data between register files. This can potentially limit performance if scalar code depends upon a vector result. A significant design parameter for SIMD instruction sets is vector length (the number of packed elements or subwords that can be operated on simultaneously). A vector length that is too short limits the ability to exploit data parallelism, while an excessively long vector length can degrade performance due to insufficiently available data parallelism in an algorithm as well as increase the amount of clean up code overhead. Table 1 classifies the “native width” of each of the kernels:

7

unlimited width - data elements are truly independent and are naturally arranged so as to be amenable to SIMD processing. Program loops can be strip mined at almost any length, with the increase in performance of a longer vector being directly proportional to the increase in vector length. limited width - data elements are independent, but there is some overhead required to rearrange input data so that it may be operated on in a SIMD manner. The performance advantage of longer vectors is limited by the overhead (which typically increases with vector length) required to employ them. exact width - a precise natural width which can be considered to be the right match for the kernel. This width is the longest width that does not load, store, or compute unused elements. Most multimedia kernels are either unlimited/limited or most often have an exact required width of 128-bits. The remaining exact native widths (64-, 88- and 512-bits) come up only once each. Thus, we consider a total vector length of 128-bits to be optimal for most multimedia algorithms.

5.1.1. Number of Registers Multimedia applications and their kernels can generally take advantage of quite large register files. This rule of thumb is echoed in the architectures of dedicated media processors such as MicroUnity’s chip, which has a 64 x 64-bit register file (also accessible as 128 x 32-bits) [11], while the Philips Trimedia TM-1 has a 128 x 32-bit register file [31]. The DCT and IDCT kernels are excellent examples of where the importance of a large register file can be seen. The two-dimensional discrete cosine transform (DCT) is the algorithmic centerpiece to many lossy image compression methods; it is similar to the discrete Fourier transform (DFT) in that it maps values from the time domain to the frequency domain, producing an array of coefficients representing frequencies [17]. The inverse DCT maps in the opposite direction, from the frequency to the time domain. A two dimensional (2-D) DCT or IDCT is efficiently computed by one dimensional (1-D) transforms on each row followed by 1-D transforms on each column and then scaling the resulting matrix appropriately. A SIMD approach requires that multiple data elements from several iterations be operated on in parallel for the greatest efficiency. Computation is straightforward for the 1-D column DCT, since the corresponding elements of each loop iteration are adjacent in memory (assuming a row-major storage format). However, performing a 1-D row

8

DCT is more problematic since the corresponding elements of adjacent rows are not adjacent in memory. The solution is to employ a matrix transposition (making corresponding “row” elements adjacent in memory), perform the desired computation, and then re-transpose the matrix to put the result matrix back into the correct configuration. We found this technique to only be of performance benefit on those architectures with register files that were able to hold the entire 16-bit 8x8 matrix (at least 16 64-bit or 8 128-bit registers).

5.2. Data Types The data types supported by the different multimedia instruction sets include {signed, unsigned} {8, 16, 32, 64} bit values, as well as single precision floating point. Most of the instruction sets do not support all of these types, and usually only a subset of operations on each. To determine which data types and operations are useful, we looked at the dynamic SIMD instruction counts for the kernels when running the Berkeley multimedia workload. We then broke down the dynamic SIMD instruction counts on each architecture in two ways: data type distribution per instruction class (e.g. add, multiply) and per kernel. Here we discuss only the salient features of tables of these categorizations, although the full set of results is available in [36]. It is useful to distinguish an application’s storage data type (how data is stored in memory or on disk) from its computational data type (the intermediate data type used to compute a result). In general, storage data types, although narrow and therefore potentially offering the greatest degrees of parallelism, are insufficiently wide for intermediate computations to occur without overflow. Image and video data is typically stored as packed unsigned 8-bit values, but intermediate processing usually requires precision greater than 8-bits. With the exception of width promotion, most 8-bit functionality was wasted on our set of multimedia kernels. The signed 16-bit data type is the most heavily used because it is both the native data type for audio and speech data, as well as the typical intermediate data type for video and imaging. On the wide end of the spectrum, 32- bit and longer data types are typically only used for accumulation and simple data communication operations such as alignment. Operations on 32-bit values tend to be limited to addition and subtraction (for accumulation), width promotion and demotion (for converting to a narrower output data type) and shifts (for data alignment).

9

Single precision floating point plays an important role in many of the multimedia kernels such as the geometry stage of 3-D rendering (the clipping and transform kernels) and the FFT, where the wide dynamic range of floating point is required. Double precision SIMD is uncommon; currently it is only available with Intel’s SSE2 extension on their Pentium IV processor (released after this study was completed). This data type is targeted more towards applications such as scientific and engineering workloads rather than multimedia [14], [15], and has no use in the kernels we examine.

5.3. Operations One of the primary goals of this work is to separate useful SIMD operations from those that are not. The large differences in current multimedia instruction sets for general purpose processors are fertile ground for making such a determination because many different design choices have been made. Please note the very important caveat that this determination is made with respect to supporting our multimedia workload. We explicitly comment on those instructions which we found to be “useless” for our workload but for which we believe the designers had some other purpose or workload in mind.

5.3.1. Modulo/Saturation Modulo arithmetic “wraps around” to the next representable value when overflow occurs, while saturating arithmetic clamps the output value to the highest or lowest representable value for positive and negative overflow respectively. This characteristic is useful because of its desirable aesthetic result in imaging and audio multimedia computations. For example, when adding pixels in an imaging application, modulo addition is undesirable because if overflow occurs, a small change in operand values may result in a glaring visual difference (e.g. the addition of two white pixels results in a black pixel). Saturating arithmetic also allows for overflow in multiple packed elements to be dealt with efficiently. If overflow within packed elements were to be handled similarly to traditional scalar arithmetic, an overflowing SIMD operation would have to be repeated and tested serially to determine which element overflowed. The only added cost of saturating arithmetic is separate instructions for signed and unsigned operands since the values must be interpreted by the hardware as a particular data type. This is unlike modulo operations for which the same instruction can be used for both unsigned and signed 2’s complement values.

10

The kernels in the BMKL that employ saturating arithmetic are noted in Table 1. From this it is clear that the most important types of saturating arithmetic are unsigned 8-bit and signed 16- bit integers. Saturating 32-bit operations are of little value since overflow is not often a concern for such a wide data type. Despite the utility of saturating operations, modulo computations are still important because they allow for results to be numerically identical to existing scalar algorithms. This is sometimes a consideration for the sake of compatibility and comparability.

5.3.2. Shift SIMD shift operations provide an inexpensive way to perform multiplication and division by powers of two. They are also crucial for supporting fixed point integer arithmetic in which a common sequence of operations is the multiplication of an N-bit integer by an M-bit fixed point fractional constant, producing an (N+M)-bit result, with a binary point at the Mth most significant bit. At the end of computation, the final result is rounded by adding the fixed point fraction representing 0.5, and then shifting right M-bits to eliminate the fractional bits.

5.3.3. Min/Max and Comparisons Min and max instructions output the minimum or maximum values of the corresponding elements in two input registers, respectively. A max instruction is of obvious utility in the maximum value search kernel, which searches through an array of signed 32-bit integers for the greatest maximum absolute value. Max and min instructions have other less apparent uses as well. Signed minimum and maximum operations are often used with a constant second operand to saturate results to arbitrary ranges (this is the case in the IDCT kernel which clips its output value range to -256...+255 (9-bit signed integer)). Architecturally, the implementation cost of max and min instructions should be low since the necessary comparators must already exist for saturating arithmetic. The only difference is that instead of comparing to a constant, a register value is used. We have found that in many cases where comparisons seem to be required, max and min instructions are sufficient. Subword integer comparisons were only useful on architectures lacking max/min operations (e.g. Sun’s VIS). In general the same limited usefulness held true for subword floating point comparisons, with the exception of a specialized comparison used in the project and clip test kernel. In that kernel three dimensional (3-D) objects are first mapped to 2-D space through a matrix multiplication of 1x4 vectors and 4x4 matrices. Objects are then clipped to the viewable area to avoid unnecessary rendering. Motorola’s AltiVec includes a specialized

11

comparison instruction, vcmpbfp, which handles this boundary testing. The four coordinates of a vertex {x, y, z, w} are tested in parallel to see if any clipping is needed; a branch instruction can then act as a fast out if no clipping is necessary. This technique is extremely effective because no clipping is the common case, with most vertices within screen boundaries. For the Mesa “gears” application, the fast out case held true for 61946 of the 64560 (96.0%) clipping tests performed in our application run of 30 frames rendered at 1024x768 resolution.

5.3.4. Sum of Absolute Differences A sum of absolute differences (SAD) instruction operates on a pair of packed 8-bit unsigned input registers, summing the absolute differences between the respective packed elements and placing (or accumulating) the scalar sum in another register. The block match kernel sums the absolute differences between the corresponding pixels (8-bit unsigned) of two 16x16 macroblocks, and is the only one in which sum of absolute differences (SAD) instructions is used. Although DEC’s MVI extension is quite small (only 13 instructions), one of the few operations that DEC did include was SAD. DEC architects found (in agreement with our experience) that this operation provides the most performance benefit of all multimedia extension operations [33]. Interestingly, Intel’s MMX, although a much richer set of instructions, did not include this operation (it was later included in both AMD’s enhanced 3DNow! as well as Intel’s SSE extensions to MMX). Sun’s VIS also includes a sum of absolute differences instruction. Processor Instruction Latency: Throughput AMD Athlon PSADBW 3 cycles : 64b every 1 cycles DEC Alpha PERR 2 cycles : 64b every 1 cycle Intel Pentium III PSADBW 5 cycles : 64b every 2 cycles Sun UltraSPARC IIi PDIST 3 cycles : 64b every 1 cycle Table 3. SAD Instruction Survey

The Motorola G4 microprocessor was the only CPU in our survey that did not include some form of SAD operation, forcing us to synthesize it from other instructions. Although Intel’s SSE extension includes the psadbw instruction, this offers only a one cycle (per macroblock row) performance advantage when compared to an AltiVec implementation. Despite the fact that Intel has a performance win with a dedicated SAD instruction, several instructions are still required due to the fact that MMX is a 64-bit wide extension and each row is 128-bits long. To

12

estimate the latency of a hypothetical SAD instruction for a 128-bit extension such as AltiVec, we first try to understand the latency of this instruction on other architectures (Table 3). The latency of a 128-bit SAD instruction would be higher than the 64-bit instructions listed in the table because this instruction requires a cascade of adders to sum (reduce) the differences between the elements. An N-bit SAD instruction (M=N/8) can be broken down into three steps: 1) calculate M 8-bit differences, 2) compute the absolute value of the differences, 3) perform log2M cascaded summations. The architects of the 64-bit DEC MVI extension comment that a 3- cycle implementation of PERR would have been easily attainable, but in the end the architects achieved a more aggressive 2-cycle instruction [7]. If a SAD operation were to be implemented in Motorola’s AltiVec, we estimate it would have a latency of 4 cycles; certainly superior to the existing 9 cycle multi-instruction solution.

5.3.5. Average In addition to computing the sum of absolute differences, half-pixel interpolation, for which MPEG-2 encoding offers three varieties, is also important. Interpolation is done by averaging a set of pixel values with pixels spatially offset by one horizontally, vertically or both. Integer average operations were only used in the block match kernel (for interpolation). This kernel operates on 8-bit unsigned values and was the only type of average instruction that was used within our workload.

5.3.6. High Latency Function Approximation. 3-D rendering applications use floating point math functions such as reciprocal (1 ) and x square-root ( 1 ) that are typically very high latency. In the transform and normalize kernel, x graphics primitives are transformed to the viewer’s frame of reference through matrix multiplications, an operation which has at its heart a floating point reciprocal square root operation. On some architectures scalar versions of these functions (reciprocal and square root) are computed in software, while others have hardware instructions. The computation of these functions is iterative, so an operation’s latency is directly proportional to the number of bits of precision returned; full IEEE compliant operations return 24-bits of mantissa. All of the SIMD floating point extensions (AMD’s 3DNow!, Intel’s SSE and Motorola’s AltiVec) include approximation instructions for 1 and 1 . These are typically implemented x x

13

as hardware lookup tables, returning k-bits of precision. In Intel’s SSE, for example, approximate reciprocal (rcp) and reciprocal square root (rsqrt) return 12-bits of mantissa. One unique aspect of Intel’s SSE instruction set is that not only does it include approximations of 1 and 1 , but it also includes full precision (24-bits of mantissa) x x versions of x and x . Of course, this added precision comes at a price – namely much higher y latency than the pipelined approximations (in fact Intel’s full precision instructions are not pipelined on the Pentium III). A better approach is to use the Newton-Raphson method for improving the accuracy of approximations in software through specially derived functions. In the case of 1 it is possible to iteratively increase the precision of an initial approximation x

3 2 through the equation: x1 = x0 -(0.5 × a × x0 - 0.5× x0)= 0.5× x0 × (3.0 - a× x0 ). Employing an approximation instruction in conjunction with the Newton-Raphson method to achieve full precision is actually faster than the full precision version of the instruction that Intel provides (25 vs. 36 cycles). One iteration of the Newton-Raphson method is enough to improve a 12-bit approximation to the full 24-bit precision of IEEE single precision. The cost of the Newton-Raphson method is the register space needed to hold intermediate values and constants. AMD’s 3DNow! extension circumvents this by including instructions to internally perform the Newton-Raphson method, rather than having the programmer explicitly implement it. The only odd thing about AMD’s reciprocal square root instructions are that they are actually scalar; they only produce one result value, based on the lower packed element. These instructions are a part of 3DNow! and not the normal floating-point; they expect input data to be in a packed form (64-bits comprised of two 32-bit single-precision sub-words). It is necessary to execute two Newton-Raphson operations to compute both subwords when using AMD’s 3DNow! extension (20 cycles per subword).

5.3.7. Floating Point Rounding and Exception Handling The techniques which SIMD instruction sets employ to deal with floating point rounding and exception handling are very similar to those used with integer overflow. Checking result flags or generating an exception from a packed operation requires considerable time to determine which packed element caused the problem. Fortunately, in most cases where an exception might be raised it is possible to fill in a value which will give reasonable results for most applications

14

and continue computation without interruption. In the case of rounding, it is often acceptable to incur a slight loss in precision and only use the flush to zero (FTZ) rounding mode [41]. Flush to zero clamps to a minimum representable result in the event of underflow (a number too small to be represented in single precision floating point). This speeds execution because no explicit error condition checking need be done, and is similar to saturating integer arithmetic where maximum or minimum result values are substituted rather than checking for and reporting positive or negative overflow. Only Intel’s SSE implements full IEEE compliant SIMD floating point exceptions and all four IEEE rounding modes.

5.3.8. Type Conversion Width promotion is the expansion of an N-bit value to some larger width. For unsigned fixed point numbers this requires zero extension (filling any additional bits with zeros). Zero extension is usually not specified as such in a multimedia architecture because it overlaps in functionality with data rearrangement instructions such as unpack or merge if packed values are merged with another register which has been zeroed prior to merging. Signed element unpacking is not as simple, but is rarely supported directly by hardware; only the AltiVec instruction set includes it. It can be synthesized with multiplication by one since a multiplication yields a result that has the total overall width of both its operands. Video and imaging algorithms often utilize an 8-bit unsigned data type for storage, but 16- bits for computation, making 8-bit to 16-bit zero extension quite useful. Audio and speech algorithms typically employ signed 16-bit values, but because multiplication by a fractional fixed point constant is a common operation, these values are often unpacked as a natural consequence of computation. So, although a signed unpack operation would likely be faster than multiplication by 1, it is seldom necessary to resort to this in practice.

5.3.9. Data Rearrangement SIMD instructions perform the same operation on multiple data elements. Because all of the data within a register must be treated identically, the ability to efficiently rearrange data bytes within and between registers is critical to performance. We will refer to this class of instructions as “data communication”. Interleave or merge instructions (also referred to as mixing or unpacking) merge alternate data elements from the upper or lower half of the elements in each of two source registers. Align or rotate operations allow for arbitrary byte-boundary data realignment of the data in two source registers; essentially a shift operation that is performed in

15

byte multiples. Both interleave and align type operations have hard coded data communication patterns. Insert and extract operations allow for a specific packed element to be extracted as a scalar or a scalar value to be inserted to a specified location. Shuffle (also called permute) operations allow for greater flexibility than those operations with fixed communication patterns, but this added flexibility requires that the communication pattern be specified either in a third source register or as an immediate value in part of the instruction encoding. [21] presents a novel set of simple data communication primitives which can perform all 24 permutations of a 2x2 matrix in a single cycle on a processor with dual data communication functional units. This is useful because any larger data communication problem can be decomposed into 2x2 matrices. Although this approach might make some very complex data communication patterns slow to compute, we have found that most multimedia algorithms have patterns which are relatively simple. Because of this we endorse [21]’s technique for supplying the data communication needs of multimedia applications. A related class of instructions that we found to be quite useful when programming with the AltiVec extension was “splat” instructions. A splat operation places either an immediate scalar or specified element from a source register into every element of the destination register. This was very useful when constants were required; on other architectures it is necessary to statically store these types of values in memory, and then load them into a register as needed.

5.3.10. Prefetching Prefetching is a hardware or software technique which tries to predict data access needs in advance, overlapping memory access with useful computation. Although we will not otherwise examine the performance implications of prefetch instructions (which would be a useful extension to this study), we mention them briefly because they are often a part of multimedia instruction sets due to the highly predictable nature of data accesses in multimedia applications. Prefetching can be implemented in software, hardware or both, with each technique distinguished by its methods for detection (discovering when prefetching will be of benefit) and synchronization (the means for ensuring that data is available in the cache before it is needed but not so far in advance that it pollutes the cache by purging still needed items) [38]. Many multimedia instruction sets take a purely software approach by providing prefetch instructions that fetch only the cache block at a specified address into the cache from main memory (without blocking normal load/store instruction accesses). Determining the ideal

16

location for this type of prefetch instruction depends on many architectural parameters. Unfortunately, these include such things as the number of CPU clocks for memory latency and the number of CPU clocks to transfer a cache line, which are both highly machine dependent and not readily available to the programmer. Another technique is a software-hardware hybrid approach in which a software-directed hardware prefetching scheme is used. One variation of this is implemented by Motorola’s AltiVec and its data stream touch instruction (dst) instruction which indicates the memory sequence or pattern that is likely to be accessed; a separate prefetch instruction need not be issued for each data element as hardware optimizes the number of cache blocks to prefetch. Thus, it is not necessary for the programmer to know the parameters of the cache system. Currently there is no means for synchronization in AltiVec implementations – the data specified by a dst is brought into the cache as quickly as possible rather than at the same rate as its consumption. Thus, it is currently necessary to periodically issue dst instructions specifying short access patterns rather than a single long instruction. This limits the potential benefit of this technique over a pure software approach. Several hardware techniques for maintaining the synchronization of data prefetch and consumption in hybrid solutions as well as pure hardware prefetching are discussed in [29].

5.4. Bottlenecks and Unused Functionality In this section we discuss those features which appear in multimedia instruction sets that do not appear to be useful, and are not “free”; i.e. they aren’t a low (or no) cost side effect of some other useful feature.

5.4.1. Instruction Primitives The VIS instruction set does not include full 16-bit multiply instructions. It instead offers multiplication primitives, the results of which must be combined through addition (see Listing 1 and Listing 2). fmul8sux16 %f0, %f2, %f4 fmul8ulx16 %f0, %f2, %f6 fpadd16 %f4, %f6, %f8 Listing 1. Sun VIS 16-bit x 16-bit ® 16-bit Multiply fmuld8sux16 %f0, %f2, %f4 fmuld8ulx16 %f0, %f2, %f6 fpadd32 %f4, %f6, %f8 Listing 2. Sun VIS 16-bit x 16-bit ® 32-bit Multiply

17

The reason that the architects of VIS divided up 16-bit multiplication in this way was to decrease die area. Not providing a full 16x16 multiplier subunit cut the size of the arrays in half [42]. Unfortunately, dividing an operation into several instructions which are not otherwise useful in and of themselves increases register pressure, decreases instruction decoding bandwidth and creates additional data dependencies. Splitting SIMD instructions (which have been introduced for their ability to extract data parallelism) can actually cripple a superscalar processor’s ability to extract instruction level parallelism. A multi-cycle operation can be a better solution than a multi-instruction operation because instruction latencies can be transparently upgraded in future processors, while poor instruction semantics cannot be repaired without adding new instructions.

5.4.2. Unused High Latency Approximation Instructions Floating point approximation of 1 instructions, although available on several platforms, x

did not find application in the kernels we studied. AltiVec includes approximate log2 (logarithm

base 2) and exp2 (2 raised to a power) instructions, which can potentially be used in the lighting stage of a 3-D rendering pipeline. Lighting calculations are currently handled by 3-D accelerator cards, and not the CPU, and so were not included in the BMKL.

5.4.3. Unused Pixel Conversion Instructions Motorola’s AltiVec extension includes pixel pack (vpkpx) and pixel unpack (vupkhpx, vupklpx) instructions for converting between 32-bit true color and 16-bit color representations. These did not find application within the BMKL, although it is possible that they might be of utility in situations where AltiVec needs to operate on 16-bit color data; many video games use 16-bit textures, for example.

5.4.4. Unused Memory Access Instructions Sun’s VIS includes two sets of instructions for accelerating multimedia operations with sophisticated memory addressing needs. The first, edge8, edge16, and edge32, produce bit vectors to be used in conjunction with partial store instructions to deal with the boundaries in 2-- D images. The second group of addressing instructions include array8, array16 and array32 which find use in volumetric imaging (the process of displaying a two dimensional slice of three dimensional data). An array instruction converts (x,y,z) coordinates into a memory

18

address. The Berkeley multimedia workload does not include any volumetric imaging applications, so it is not surprising that these instructions found no use in our workload.

5.4.5. Singular, Highly Utilized Resources Although we usually think of SIMD architectures as extracting data level parallelism, all of the implementations of the instruction sets we have examined are also superscalar, with multiple parallel SIMD functional units. In fact, unless the SIMD vector length is long enough to hold the entire data set being operated on, there is almost always the potential to extract instruction as well as data level parallelism. Sun’s VIS architecture does not include subword parallel shift instructions. Instead, the graphics status register (GSR) has a 3-bit addr_offset field that is used implicitly for byte granularity alignment, as well as a 4-bit scale_factor field for packing/truncation operations. The VIS GSR is a serializing bottleneck because any time packing or alignment functionality is needed, it must be pushed through the GSR. Because VIS lacks subword shift operations, we found ourselves synthesizing such operations with the packing and alignment operations whenever we could not otherwise avoid them. Even with the careful planning of packing and alignment operations it was often necessary to write to the GSR (a slow operation) several times during each iteration of the loops of our multimedia kernels. The serializing effect of this singular resource prevented VIS operations from proceeding at the fullest possible degree of (instruction) parallelism.

5.5. Overall Instruction Mix Table 4 shows the total mix of dynamic instructions executed by each architecture. Counts are for the code within the kernels only, and do not include instructions from the remainder of each application. “Control state” instructions are those that interact with special purpose machine registers (e.g. the GSR on Sun’s VIS). Branches and other data dependent codes such as conditional moves are categorized as “control flow” type instructions. We can see that SIMD kernels utilize a significant number of scalar operations for pointers, loop variables and clean up code, evidence that sharing the integer data path is not a good idea. Intel’s SSE is a 128-bit wide extension, as compared to AMD’s 64-bit wide 3DNow!, explaining why Intel’s overall instruction count is lower by about 1 billion (22%) instructions. The same reasoning applies to Motorola’s G4 which has the 128-bit wide (both packed integer and floating point) AltiVec extension. DEC’s bloated instruction count is due to the fact that

19

their MVI extension is very limited in functionality (13 instructions in total), and so many operations need to be done with scalar instructions.

AMD DEC Intel Motorola Sun

Athlon 21264A Pentium III G4 UltraSPARC IIi Int Load/Store 5.4% 20.3% 6.2% 2.4% 2.6% Int Arithmetic 12.1% 29.2% 12.3% 19.1% 26.6% Int Control Flow 12.8% 11.3% 13.4% 13.8% 9.3% Int Data Communication 2.3% 6.7% 2.1% 1.1% 0.0% FP Load/Store 3.2% 11.2% 4.0% 7.4% 8.0% FP Arithmetic 0.0% 18.1% 0.0% 2.3% 8.4% FP Control Flow 0.0% 0.1% 0.0% 0.0% 6.2% FP Data Communication 0.0% 0.0% 0.0% 0.0% 0.0% SIMD Load/Store 19.4% 21.0% 17.9% 11.7% SIMD Data Communication 12.3% 12.6% 17.0% 9.7% SIMD Int Arithmetic 12.5% 3.1% 18.8% 11.8% 13.6% SIMD Int Control Flow 0.0% 0.0% 0.0% 3.5% SIMD FP Arithmetic 19.9% 9.3% 6.9% SIMD FP Control Flow 0.0% 0.0% 0.0% Control State 0.2% 0.0% 0.2% 0.3% 0.4% Total Count 4.49E+09 7.45E+09 3.52E+09 4.53E+09 5.60E+09 Table 4. Overall Instruction Mix – values are percentages of the total counts listed at the bottom of each column. Control state instructions are those that interact with special purpose registers. Control flow instructions include branches as well as conditional moves and comparisons. Gray table cells indicate functionality that was not implemented by a particular architecture (as opposed to implemented but unused, which is indicated by 0%).

Data communication operations represent the overhead necessary to put data in a format amenable to SIMD operations. Ideally, these types of instructions should make up as small a part of the dynamic instruction mix as possible. Table 4 reveals that the Motorola AltiVec extension executes the largest percentage of these instructions. This is due to the fact that a wider register width means that it is less likely that the data is arranged correctly as first loaded from memory. Also, data communication operations are used by AltiVec to simulate unaligned access to memory (typically a separate instruction on other architectures).

6. Performance Comparison 6.1. SIMD Speedup over C In order to establish a baseline for performance, the average execution time of the C and SIMD assembly versions of the BMKL were measured on each platform (Table 5). A speedup from 2x-5x was typical, although some kernels were sped up considerably more. Note that the C compiler for the Apple system is from a pre-release version of MacOS X (developer’s preview 3) and is known to be weak, inflating the speedup of AltiVec over C.

20

AMD Athlon 500 1,087 1,801 7,374 1.22E+08 24,332 125,484 2,999 2,407 956 34,233 15,268 21,849 1.60E+07 2.55E+07 7,308 11,527 1.0E+07 3.7E+04 DEC Alpha 21264 500 669 979 8,591 2.24E+07 12,475 67,321 1,276 1,143 616 19,549 11,120 19,208 9.21E+06 1.47E+07 4,900 7,990 2.9E+06 2.2E+04 Intel Pentium III 500 969 1,855 7,938 6.01E+07 38,192 147,047 2,772 2,536 1,065 37,100 16,129 43,582 1.58E+07 2.20E+07 8,349 20,491 6.1E+06 4.1E+04 Motorola 7400 (G4) 500 1,054 1,466 8,906 5.80E+07 33,176 232,272 3,326 4,110 1,710 29,488 24,156 23,721 1.75E+07 2.93E+07 4,917 81,985 6.6E+06 4.7E+04 Sun UltraSPARC IIi 360 1,662 2,408 12,222 6.20E+07 30,922 111,258 3,782 6,165 3,550 34,928 36,773 48,660 1.77E+07 3.99E+07 7,138 14,455 7.5E+06 5.3E+04 Sun UltraSPARC IIi 500* 1,197 1,734 8,800 4.47E+07 22,264 80,106 2,723 4,439 2,556 25,148 26,477 35,035 1.27E+07 2.87E+07 5,140 10,408 5.4E+06 3.8E+04 Average 1,088 1,702 9,006 6.49E+07 27,819 136,677 2,831 3,272 1,579 31,059 20,689 31,404 1.52E+07 2.63E+07 6,522 27,290 6.7E+06 4.2E+04

AMD Athlon 500 114 257 4,559 1.29E+07 1,438 63,446 827 604 341 18,557 9,887 7,425 1.13E+07 1.08E+07 2,585 4,597 2.2E+06 1.1E+04 DEC Alpha 21264 500 350 379 1.30E+07 1,168 9.26E+06 4.49E+06 4.5E+06 6.6E+04 Intel Pentium III 500 152 381 6,336 1.08E+07 1,228 84,661 993 669 528 15,010 11,003 5,759 1.17E+07 1.00E+07 4,718 4,338 2.0E+06 1.2E+04 Motorola 7400 (G4) 500 423 437 3,427 1.30E+07 907 116,828 847 618 446 14,335 3,729 3,672 1.08E+07 3.07E+06 2,727 2,998 1.7E+06 1.0E+04 Sun UltraSPARC IIi 360 342 852 2.30E+07 3,353 2,508 8,715 1,716 15,009 15,202 1.95E+07 1.05E+07 4.8E+06 3.2E+04 Sun UltraSPARC IIi 500* 246 613 1.66E+07 2,415 1,806 6,275 1,235 10,806 10,945 1.41E+07 7.53E+06 3.5E+06 2.3E+04 Average 276 461 4,774 1.45E+07 1,732 88,312 1,294 2,355 758 15,967 9,907 8,015 1.25E+07 7.77E+06 3,343 3,978 2.2E+06 1.5E+04

AMD Athlon 500 9.6x 7.0x 1.6x 9.4x 16.9x 2.0x 3.6x 4.0x 2.8x 1.8x 1.5x 2.9x 1.4x 2.4x 2.8x 2.5x 4.5x 3.4x DEC Alpha 21264 500 1.9x 2.6x 1.7x 1.0x 1.0x 3.3x 1.9x 1.7x Intel Pentium III 500 6.4x 4.9x 1.3x 5.6x 31.1x 1.7x 2.8x 3.8x 2.0x 2.5x 1.5x 7.6x 1.3x 2.2x 1.8x 4.7x 5.1x 3.3x Motorola 7400 (G4) 500 2.5x 3.4x 2.6x 4.5x 36.6x 2.0x 3.9x 6.6x 3.8x 2.1x 6.5x 6.5x 1.6x 9.6x 1.8x 27.4x 7.6x 4.6x Sun UltraSPARC IIi 500* 4.9x 2.8x 2.7x 9.2x 1.5x 0.7x 2.1x 2.5x 3.2x 0.9x 3.8x 3.1x 2.5x Average 5.0x 4.1x 1.8x 4.8x 23.5x 1.9x 3.0x 3.2x 2.7x 2.1x 3.0x 5.0x 1.3x 4.2x 2.1x 11.5x 5.0x 3.6x

AMD Athlon 500 -9% -15% 11% -99% 7% 4% -14% 18% 31% -18% 18% 24% -12% -6% -19% 56% -1% DEC Alpha 21264 500 33% 38% -3% 64% 52% 48% 51% 61% 55% 33% 40% 33% 35% 39% 20% 70% 42% Intel Pentium III 500 3% -18% 5% 2% -46% -13% -6% 13% 23% -27% 13% -52% -11% 9% -36% 23% -7% Motorola 7400 (G4) 500 -6% 6% -7% 6% -27% -78% -27% -40% -24% -1% -30% 17% -23% -22% 20% -210% -28% Sun UltraSPARC IIi 500* -20% -11% -6% 27% 15% 39% -4% -52% -85% 14% -42% -22% 10% -19% 16% 61% -5%

AMD Athlon 500 56% 38% 28% 3% -9% 23% 28% 68% 46% 0% -6% 21% 1% -50% 36% 24% 19% DEC Alpha 21264 500 -36% 8% -35% 2% 52% 18% -11% 37% 3% -6% -19% -104% 19% 38% -22% -32% -6% Intel Pentium III 500 41% 8% 0% 19% 7% -3% 14% 64% 17% 19% -18% 39% -2% -40% -18% 28% 11% Motorola 7400 (G4) 500 -65% -6% 46% 2% 31% -42% 26% 67% 30% 23% 60% 61% 5% 57% 32% 51% 24% Sun UltraSPARC IIi 500* 4% -48% -39% -25% -82% 3% -57% -236% -95% -36% -16% -16% -23% -5% -28% -72% -48% Table 5. Kernel Performance Statistics – From top to bottom: C time (ns), SIMD time (ns), SIMD speedup relative to C, C percent deviation from the mean (%DM) and SIMD %DM (see text for further details on this metric). Kernels with a gray background were not implemented in SIMD due to insufficient instruction set functionality. The DCT kernel was originally coded in floating point, but implemented in fixed point integer for SIMD codes. (*)Sun UltraSPARC IIi 500MHz values are extrapolated from measurements taken on a 360MHz system.

The DCT and IDCT algorithms are of the same computational order of magnitude so it might seem strange that they demonstrate such different performance improvements (Table 5). The reason for this is the DCT kernel was originally coded in floating point, but implemented in fixed point integer for the SIMD codes. The IDCT is in fixed point integer for both the original and SIMD versions. Fixed point integer calculations are faster than floating point on most architectures, and provide sufficient precision for the DCT and IDCT in video applications [39].

6.2. Cross-Architecture Comparison All of our performance measurements use the metric of time rather than instruction counts since each instruction set does not necessarily do the same amount of work (computation) in the same number of instructions, nor take the same amount of time to perform similar instructions. Our analysis is complicated by the fact that the Sun UltraSPARC IIi used in our study has a 360 MHz clock, while the remainder of the chips are 500 MHz parts. Of course clock

21

frequency is not the only determinant of performance; there can be significantly disparate amounts of work accomplished between machines in each cycle depending on the pipeline length and number of functional units available. We have chosen to present scaled results (by 500/360 = 1.4x) alongside our measurements from the actual 360 MHz system in order to place an upper bound on UltraSPARC IIi performance and eliminate a degree of freedom from our comparison. Certainly whether or not the scaled results are a fair basis for comparison is debatable. Even now (October 2001) the maximum available clock frequency for the UltraSPARC IIi is 480MHz, while the other architectures in our study have scaled considerably more in frequency. The semiconductor process of the UltraSPARC IIi (see Table 2) is comparable to that of the other processors in our study, so it is not clear that scaling is appropriate. The slower clock frequency of the UltraSPARC IIi most likely stems from greater delays between latches. To measure how well or how poorly a given platform performs relative to the competition we use the metric of percent deviation from the mean:

æ t - t ö ç i ÷ %DM =100× ç ÷ è t ø

where ti is the time taken on platform i, and t is the average performance (time) across all of the platforms examined for a particular kernel. This metric indicates how much better or worse than average performance is, and provides a normalized result for computing the average improvement or degradation in performance. Please note that we use the scaled (500MHz)

UltraSPARC IIi values in computing the mean (t ). Table 5 lists the average %DM (the arithmetic average of the %DM values for all of the kernels on a given platform), which is a good general measure of how well a platform stacks up against other architectures.

6.2.1. AMD Athlon, Intel Pentium III At first glance the AMD Athlon and Intel Pentium III processors in our study appear to be very similar: both run at 500 MHz, and both share Intel’s MMX SIMD integer extension. In fact, because we implemented the Intel kernel set first, it was possible to reuse much of the same code for the AMD SIMD version of many of the integer kernels in the BMKL. The SIMD integer kernels include all but the clip test, FFT, quantize, synthesis filter and transform kernels. It is interesting to observe the differences in performance between the two processors on what is often identical code. Among the important differences between the two processors are [37]:

22

· Athlon has 64 KB instruction and data L1 caches, while the Pentium III’s are each 16 KB · Athlon has three instruction decoders working in parallel, while Pentium III has two · Athlon’s pipeline is 10-cycles long, while Pentium III’s is 12-17 cycles; the cost of branches in the Pentium III is exacerbated by its weaker branch prediction unit · Intel’s MMX instruction latency is lower (1 cycle add/sub, 1 cycle shuffle, 3 cycle mult) compared to Athlon (2 cycle add/sub, 2 cycle shuffle, 3 cycle mult) · AMD’s 3DNow! FP latency (4 cycle add/sub, 4 cycle multiply) is lower than Intel’s SSE (4 cycle add/sub, 5 cycle multiply, throughput of 64-bits per cycle like 3DNow!) So, although the SIMD integer instruction set is the same in both cases, AMD’s Athlon SIMD integer is a more powerful implementation; its only deficiency is the higher latency of simple SIMD integer instructions when compared to the Pentium III. We can see this very clearly in Table 5, although where each architectural difference comes into play depends on the kernel. One interesting case is that of the DCT and IDCT kernels - the Pentium III is faster on the DCT, while the Athlon is faster on the IDCT. This is surprising given that the two kernels are algorithmically quite similar - each is a mirror image of the other. Upon further investigation, we found that the DCT code has shuffle instructions (pshufw) in several places that are not scheduled well for the Athlon instruction’s higher latency, a circumstance which was not duplicated in the IDCT code. In terms of multimedia, the greatest difference between the two processors is their SIMD floating point extensions. AMD’s 3DNow! is quite similar to MMX, reusing the eight x87 floating point registers as 64-bit wide SIMD registers. Intel’s SSE has new 128-bit wide registers and instructions, although in the Pentium III implementation, each 128-bit instruction is actually decoded into two 64-bit wide micro-ops. This gives the Pentium III more overall register space to work with, although it has the same number of registers. In the arena of floating point operations, the AMD Athlon architecture eliminates the only deficiency it had compared to the Pentium III in SIMD integer - namely, slightly higher instruction latency. From Table 5, we can see that overall SIMD AMD Athlon code does 19% better than average, while the Pentium III is 11% better than the average of all of the multimedia instruction sets. Both chips perform well, but the AMD Athlon matches or outperforms the Intel chip on all but one SIMD floating point kernel (quantize).

23

Quantization is a process by which a real value is converted to a discrete integer representation. In the quantization kernel, an array of real double precision floating point values, xr[], is converted to an array of 32-bit integers, xi[], according to the function

xi[i] = xr[i] × xr[i] + 0.4054 . Note that original C implementation utilizes a lookup table for some values (not shown in Listing 3). static INT32 lutab[10000]; INT32 l_end; FLOAT64 xr[]; INT32 ix[]; FLOAT64 *istep_p; FLOAT32 temp; for (i=0;i

Although SIMD floating point instruction sets include approximations for 1 , rather x 1 than x , the equivalence x = x × can be used. The reason for AMD’s lackluster x performance on the quantize kernel is because AMD’s reciprocal square root instructions are actually scalar - they only compute the result for the lower packed element. This costs 3DNow! performance on this highly square-root intensive kernel.

6.2.2. DEC Alpha 21264 From Table 5 we see that the DEC Alpha platform far outstrips the other systems in terms of compiled C performance. It has been claimed that a broader multimedia instruction set would not be useful on Alpha, as an extension like Intel’s MMX only fixes x86 legacy-architecture performance deficiencies which are not present with the Alpha architecture [33]. Our performance comparison shows this argument to be false, as the kernels programmed with the extensions from AMD, Intel and Motorola were able to not only match the performance of those on the Alpha 21264, but often exceeded it. This indicates that sophisticated multimedia instruction sets are far more than a means of overcoming x86 architectural limitations.

6.2.3. Motorola G4 Motorola’s AltiVec was the only multimedia extension which was architected from the ground up - all of the others in some way leverage existing resources. The instructions included in AltiVec are far more general than those required by multimedia applications. Almost every possible operation and data type is supported, which should allow it to be applied to other

24

application domains as well. A 128-bit register width, combined with latencies that are at least as good as those found on the other (64-bit) architectures, allows AltiVec to come in with the best overall performance (23.7% better than average on SIMD accelerated code, according to Table 5). However, performance on a few of the kernels is still poor, especially add block and FFT. The add block kernel’s problem has already been discussed - the 128-bit register width is actually too long for this kernel, causing unnecessary data to be loaded from memory, since there is no 64-bit short load in AltiVec, although there are 8, 16, and 32-bit short loads. The reason for the poor FFT performance is not entirely clear, although revisiting our code revealed that the way in which we coded stores to unaligned memory could be improved slightly.

6.2.4. Sun UltraSPARC IIi The performance of the VIS multimedia extension was mediocre at best (even when scaled from 360 to 500 MHz). It is the oldest multimedia instruction set we have looked at, the second to be released after HP’s MAX-1 (1996). As we have pointed out earlier, this instruction set suffers from some odd instruction choices (e.g. multiplication primitives), missing functionality (no subword shift operations) and an over-utilized graphics status register that is a bottleneck.

7. Future Directions for Multimedia Instruction Sets SIMD instructions split traditional wide scalar data paths (e.g. 32- or 64-bits) into multiple narrow data paths which perform the same operation on each packed element. This is of significant performance benefit for multimedia applications where data types are relatively narrow and algorithms often perform the same operation over a large number of elements. The greatest performance limiting factor facing current SIMD implementations is dealing with data that is not natively in a format amenable to SIMD processing. We have found two major sources of mismatch: storage data types which are inappropriate for computation (necessitating conversion overhead before and likely also following computation) and operands which are not adjacent in memory; recall our earlier example of the matrix transpose step in the DCT kernel. The first type has led us to propose “fat subwords” (Section 7.1), while the later suggests “strided” memory operations (Section 7.2).

7.1. Fat Subwords With current SIMD instruction sets, data is loaded into registers in the same packed format that it is stored in memory. Frequently, unpacking is required before operations are performed,

25

and the unpacked data of course no longer fits within a single register. We therefore propose a “fat subword” architecture which has the following characteristics: 1) registers (and therefore subwords) that are wider than the data to be loaded into them, 2) implicit unpack with load and 3) implicit pack (and saturate) with store. A fat subword architecture is in some ways similar to the as yet unimplemented MIPS MDMX instruction set [23]. The MIPS MDMX extension has a single 192-bit accumulator register as its architectural cornerstone, with the more usual style of register to register SIMD operations also included; non-accumulator SIMD operations share the 64-bit floating point datapath. The destination of normal SIMD instructions can be either another SIMD register or the accumulator (to be loaded or accumulated). When accumulating packed unsigned bytes, the accumulator is partitioned into eight 24-bit unsigned subwords. Packed 16-bit operations cause the accumulator to be split into four 48-bit subwords. What is good about the MDMX accumulator approach is that implicit width promotion provides an elegant solution to overflow and other issues caused by packing data as tightly as possible. For example, multiplication is semantically difficult to deal with on current SIMD architectures because the result of a multiply is longer than either operand [22]. This problem is avoided by the MDMX accumulator, because there is enough extra space to prevent overflow. Fixed point arithmetic is also more precise because full precision results can be accumulated, and the total rounded only once at the end of the loop. Similarly, most scalar multimedia algorithms only specify saturation at the end of computation because it is more precise. Unfortunately, in SIMD architectures where the packed elements maximally fill the available register space, the choice is either to be imprecise (compute with saturation at every step) or lose parallelism (explicitly promote the input data to be wider and saturate at the end). Saturating arithmetic can also produce unexpected results since, unlike normal addition, the order of operations matters. While the MIPS MDMX solution may seem elegant in principle, it ignores the architectural aspect of actually making such a design fast. The accumulator is a singular and unique shared resource, and as such has a tendency to limit instruction level parallelism. We found that the existence of only a single accumulator would be a severe handicap to avoiding data dependencies. On a non-accumulator but otherwise superscalar architecture it is usually possible to perform some other useful, non data dependent operation in parallel so that

26

processing can proceed at the greatest degree of instruction level parallelism possible. On MDMX all computations that need to use the accumulator must proceed serially since accumulation is inherently dependent on the previous result. 128-bit total load/store width 64-bit total load/store width Storage Computation Storage Computation 8U 12S x 16 8U 24S x 8 16S 24S x 8 16S 48S x 4 32S 48S x 4 - - 32FP 32FP x 4 - - Table 6. Supported Data Types - data types are divided into storage (as stored in memory) and computation (the width of arithmetic and other instruction elements) widths.

7.1.1. Fat Subword Data Types To avoid the problems of MDMX, we suggest that a normal register file architecture be used (as opposed to a single accumulator), with the entire SIMD register file made wider than the width of loads and stores. For the sake of example we have chosen a width of 192-bits for our datapath (registers and functional units), corresponding to a load/store width of either 64- or 128- bits. A 64-bit only implementation would also be possible but of course with a lesser degree of potential parallelism. Another possibility, should very wide registers prove infeasible, is the use of register sets (e.g. triplets in the case of 64-bit wide registers). A similar approach was used by in their extension to MMX (found in their 6x86MX/M-II processor) to compensate for the eight register encoding limit inherent in x86 based architectures. Supported storage data types include those that we have seen to be of importance in multimedia for storage formats: 8-bit unsigned, 16-bit signed, 32-bit signed and single precision floating point. The data types supported by packed arithmetic operations for computation are different: 12-bit signed, 24-bit signed, 48-bit signed and single precision floating point. Depending on the algorithm it may be desirable to access memory at either 64-bit or 128-bit widths (see Table 6), which would then unpacked to different computational widths: a smaller number of high-precision subwords (64-bit storage width) or a larger number of low-precision subwords (128-bit storage width). This allows for better matching with the natural width of each algorithm. In the case of packed floating point operations we suggest operating on four single precision values in parallel, leaving the remaining 16 bits of each subword idle. This allows for SIMD computations to precisely match scalar results while still being able to take advantage of the same data rearrangement capabilities used for integer data types.

27

7.1.2. Fat Subword Operations In [36] we present our proposed instruction set (97 instructions in total) for a fat subword SIMD multimedia extension. Unlike existing instruction sets which are fundamentally byte- based, this instruction set is centered on quantities which are multiples of 12-bits wide. This can be thought of as four extra bits of precision for every byte of actual storage data loaded. Because we have fundamentally changed the treatment of multimedia data types, several types of typical instructions are no longer useful. These include operations such as average, pack/unpack and truncating multiplication.

7.1.3. Fat Subword Example A small example of how to code for the proposed architecture is shown in Figure 1. Note that load and store instructions specify both computational and storage data types.

0223 4 47 48771 2995 6 119 120 143 144 167 168 191 lvx8u24s v0,[1000] 0 33 0 22 0 4646 0 77 0 99 0 00 0 2323 0 22

0x1000 3 2 46 7 9 0 23 2 0x1008 15 253 34 5 255 2 2 0

lvx8u24s v1,[1008] 0 1515 0 253253 0 3434 0 55 0 255255 0 22 0 22 0 00 0223 4 47 48771 2995 6 119 120 143 144 167 168 191

0223 4 47 48771 2995 6 119 120 143 144 167 168 191 mult24s v2,v0,v1 4545 506506 15641564 3535 22952295 00 4646 00

stv24s8u v2,[1000] 0x1000 45 255 255 35 255 0046 087 15 16 23 24 31 32 39 40 47 48 55 56 63 Figure 1. Fat Subword Example – two vectors of eight 8-bit unsigned integers are loaded and implicitly expanded to 24-bit signed notation from address 0x1000 and 0x1008 into vector registers v0 and v1 respectively. Vector register v0 and v1 are then multiplied and the result placed in v2. Overflow cannot occur because the maximum width of the result is 16-bits per multiplication (the sum of the operand widths). The result vector is then simultaneously repacked to 8-bit unsigned notation and stored out to memory address 0x1000.

7.2. Strided Memo ry Access We propose that SIMD architectures implement strided load and store instructions to make the gathering of non-adjacent data elements more efficient. This is similar to the prefetch mechanism in AltiVec, except that the data elements would be assembled together by the hardware into a single register, rather than simply loaded into the cache. Of course such a memory operation would necessarily be slower than a traditional one, but it would cut down immensely on the overhead that would have to go into reorganizing data as loaded from memory. Strided loads and stores would have three operands, vD, rA and rB, where vD is the destination

28

fat-subword register, rA is the base address and rB contains a description of the memory access pattern: width - width of the data elements to be loaded [2 bits], 00 = 8-bits, 01 = 16-bits, 10 = 32- bits, 11= 64-bits. offset - number of bytes offset from the base address from which to begin loading [6 bits], interpreted as a signed value. stride - number of bytes between the effective address of one element in the sequence and the next [24 bits], interpreted as a signed value. The color space conversion kernel is an excellent example of where strided load and store instructions could be used. Pixel data consists of one or more channels or bands, with each channel representing some independent value associated with a given pixel’s (x,y) position. A single channel, for example, represents grayscale, while a three (or more) channel image is typically color. The band data may be interleaved (each pixel’s red, green, and blue data are adjacent in memory) or separated (e.g. the red data for adjacent pixels are adjacent in memory). In image processing algorithms such as color space conversion we operate on each channel in a different way, so band separated format is the most convenient for SIMD processing. Converting

from the RGB to YCBCR color space is done through the conversion coefficients shown in Listing 4. UINT8 *rowp, *y_p, *u_p, v_p; INT32 red = *rowp++, green = *rowp++, blue = *rowp++; *y_p++ = +0.2558*red + 0.5022*green + 0.0975*blue + 16.5; *u_p++ = -0.1476*red - 0.2899*green + 0.4375*blue + 128.5; *v_p++ = +0.4375*red - 0.3664*green - 0.0711*blue + 128.5; Listing 4. Color Space Conversion

Listing 5 replaces thirty-eight instructions in the original AltiVec color space conversion kernel (the corresponding code fragment is listed in [36]). In the original AltiVec code, it was necessary to load six permute control vectors (each 128-bits wide) before executing the six vperm instructions required to rearrange the data into band separated format. rgb_to_yuv: oris r11,r11 RED_PATTERN oris r12,r12,GREEN_PATTERN oris r13,r13,BLUE_PATTERN lvxstrd v28,0,r3,r11 ;; v28: |r0|r1|r2|r3|r4|r5|r6|r7|r8|r9|rA|rB|rC|rD|rE|rF| lvxstrd v29,0,r3,r12 ;; v29: |g0|g1|g2|g3|g4|g5|g6|g7|g8|g9|gA|gB|gC|gD|gE|gF| lvxstrd v30,0,r3,r13 ;; v30: |b0|b1|b2|b3|b4|b5|b6|b7|b8|b9|bA|bB|bC|bD|bE|bF| Listing 5. Modified Color Space Conversion

29

Traditional SIMD data communication operations have trouble with data that is not aligned on boundaries that are powers of two; in the case of color space conversion, visually adjacent pixels from each band are spaced 3 bytes apart. Strided loads and stores are by definition unaligned, so this would need to be handled by the load/store hardware in the CPU. It would also make sense to have additional versions of these instructions to provide hints to the memory hardware whenever data elements will not be of near-term use (especially if the stride is on the same or greater order of the line size). Another approach to strided data access is presented in [43] which discusses the Impulse system architecture. Impulse allows for a strided mapping that creates dense cache lines from array elements that are not contiguous in memory.

8. Conclusion Specialized instructions have been introduced by microprocessor vendors to support the unique computational demands of multimedia applications. The mismatch between wide data paths and the relatively short data types found in multimedia applications has led the industry to embrace SIMD (single instruction, multiple data) style processing. A SIMD approach has the significant added benefit of simplifying control circuitry, typically one of the most difficult aspects of superscalar microprocessor design. We have found that the greatest performance limiting factor facing current SIMD implementations is dealing with data that is not natively in a format amenable to SIMD processing. Storage and computation have opposing goals – the first seeks to use a minimal number of bits in order to conserve space, while the later requires extra bits to maintain precision during intermediate calculations. With current SIMD instruction sets, data is loaded into registers in the same packed format that it is stored in memory. This is inefficient because unpacking is frequently required before operations are performed, and the unpacked data of course no longer fits within a single register. When it is time to write back SIMD data to memory the data must be repacked and saturated to fit within a typically narrower storage data type. Our proposed fat- subword register architecture reduces the amount of unpacking, packing and saturation overhead that SIMD computation must incur by encoding the current and desired computational and storage data type in SIMD load/store instructions and then implicitly performing the requisite unpacking or packing and sign extension or saturation. SIMD instructions perform the same operation on multiple data elements. Because all of the data within a register must be treated identically, the ability to efficiently rearrange data bytes

30

within and between registers has become critical to performance. With regard to this we have suggested strided memory operations as having the potential to greatly simplify access to sparse data structures as well as the conversion between band-interleaved and band-separated storage formats. If these strided operations can be made fast in hardware (this seems likely given that strided vector loads and stores have long been integral components to traditional vector supercomputers, so their implementation should already be well understood) one of the major impediments to efficient SIMD processing will be lifted. We have presented a unified comparison of five multimedia instruction sets for general purpose microprocessors, which discusses the advantages and disadvantages of specific features of current instruction sets. In addition to examining the five instruction sets from a functional standpoint, we have also empirically compared the performance of specific implementations of these instruction sets on real hardware. This has allowed us to measure typical speedups relative to a highly tuned compiler on a given platform as well as to make cross-architecture comparisons. Today most microprocessors found in desktop workstations are equipped with multimedia extensions that remain unused due to the lack of compiler support for these instruction sets [20] (our analysis has been based entirely on measurements of hand tuned SIMD kernels). A valuable extension to this study would be an analysis of what characteristics would make SIMD most instructions most amenable to use by a compiler. In addition, multimedia applications, and therefore our target workload, are constantly evolving: MPEG-4, and MP3-Pro audio are only two examples of up and coming algorithms that dema nd further examination. As these and other algorithms and applications come to light we will need to revisit our design choices for multimedia instruction sets.

References [1] G.E. Allen, B.L. Evans, L.K. John, “Real-Time High-Throughput Sonar Beamforming Kernels Using Native Signal Processing and Memory Latency Hiding Techniques,” Proc. 33rd IEEE Asilomar Conf. on Signals, Systems and Computers, Pacific Grove, Calif., 1999, pp. 137-141

[2] , Inc., “AMD Athlon Processor x86 Code Optimization Guide,” Publication #22007, Rev. G, http://www.amd.com/products/cpg/athlon/techdocs/ pdf/22007.pdf, (current Apr. 2000)

31

[3] Advanced Micro Devices, Inc., “AMD Athlon Processor Technical Brief,” Publication #22054, Rev. D, http://www.amd.com/products/cpg/athlon/techdocs/pdf/22054.pdf, (current Apr. 2000)

[4] Advanced Micro Devices, “3DNow! Technology vs. KNI,” White Paper, http://www.amd.com/products/cpg/3dnow/vskni.html, (current Apr. 2000)

[5] R. Bhargava, L.K. John, B.L. Evans, R. Radhakrishnan, “Evaluating MMX Technology Using DSP and Multimedia Applications,” Proc. 31st IEEE Intl. Symp. on Microarchitecture (MICRO-31), Dallas, Texas, 1998, pp. 37-46

[6] T. Burd, “General Processor Info,” CPU Info Center, http://bwrc.eecs.berkeley.edu/CIC/ summary/local/summary.pdf, (current Apr. 2000)

[7] D.A. Carlson, R.W. Castelino, R.O. Mueller, “Multimedia Extensions for a 550-MHz RISC Microprocessor,” IEEE J. of Solid-State Circuits, vol. 32, no. 11, Nov. 1997, pp. 1618-1624

[8] W. Chen, H. John Reekie, S. Bhave, E.A. Lee, “Native Signal Processing on the UltraSparc in the Ptolemy Environment,” Proc. 30th Ann. Asilomar Conf. on Signals, Systems, and Computers, Pacific Grove, California, 1996, vol. 2, pp. 1368-1372

[9] Compaq Computer Corporation, “Alpha 21264 Microprocessor Hardware Reference Manual,” Part No. DS-0027A-TE, Feb. 2000, http://www.support.compaq.com/alpha-tools/ documentation/current/21264_EV67/ds-0027a-te_21264_hrm.pdf, (current Apr. 2000)

[10] T.M. Conte, P.K. Dubey, M.D. Jennings, R.B. Lee, A. Peleg, S. Rathnam, M. Schlansker, P. Song, A. Wolfe, "Challenges to Combining General-Purpose and Multimedia Processors," IEEE Computer, vol. 30, no. 12, Dec. 1997, pp. 33-37

[11] C. Hansen, “MicroUnity’s MediaProcessor Architecture,” IEEE Micro, vol. 16, no. 4, Aug. 1996, pp. 34-41

[12] Intel Corporation, “Intel Introduces The Pentium Processor With MMX Technology,” January 8, 1997 Press Release, http://www.intel.com/pressroom/archive/ releases/dp010897.htm, (current Apr. 2000)

[13] Intel Corporation, “Intel Architecture Software Developer’s Manual, Volume 1: Basic Architecture,” Publication 243190, , http://developer.intel.com/design/pentiumii/ manuals/24319002.PDF, (current Apr. 2000)

[14] Intel Corporation, “IA32 Intel Architecture Software Developer’s Manual with Preliminary Intel Processor Information,” Volume 1: Basic Architecture, http://developer.intel.com/design/processor/future/manuals/24547001.pdf, (current Sept. 2000)

32

[15] Intel Corporation, “Intel Announces New NetBurst Micro-Architecture for Pentium IV Processor,” http://www.intel.com/pressroom/archive/releases/dp082200.htm, (current Sept. 2000)

[16] J. Keshava, V. Pentkovski, “Pentium III Processor Implementation Tradeoffs,” Intel Tech. J., Quarter 2, 1999, http://developer.intel.com/technology/itj/q21999/pdf/impliment.pdf, (current Apr. 2000)

[17] T. Kientzle, “Implementing Fast DCTs,” Dr. Dobb’s J., vol. 24, no. 3, March 1999, pp. 115-119

[18] L. Kohn, G. Maturana, M. Tremblay, A. Prabhu, G. Zyner, “The (VIS) in UltraSPARC,” Proc. Compcon ’95, San Francisco, 1995, pp. 462-469

[19] I. Kuroda, T. Nishitani, “Multimedia Processors,” Proc. IEEE, vol. 86 no. 6, June 1998, pp. 1203- 1221

[20] S. Larsen, S. Amarasinghe, “Exploiting Superword Level Parallelism with Multimedia Instruction Sets,” Proc. ACM SIGPLAN ‘00 Conf. on Programming Language Design and Implementation, Vancouver, BC Canada, 2000, pp. 145-156

[21] R.B. Lee, “Subword Permutation Instructions for Two-Dimensional Multimedia Processing in MicroSIMD Architectures,” Proc. IEEE Intl. Conf. on Application-Specific Systems, Architectures, and Processors, Boston, 2000, pg. 3-14

[22] R.B. Lee, “Multimedia Extensions for General Purpose Processors,” Proc. IEEE Workshop on VLSI Signal Processing, Leicester, UK, 1997, pp. 1-15

[23] MIPS Technologies, Inc., “MIPS Extension for Digital Media with 3D,” WhitePaper, Mar. 12, 1997 http://www.mips.com/Documentation/isa5_tech_brf.pdf, (current Apr. 2000)

[24] Motorola Inc., “MPC7400 RISC Microprocessor User’s Manual, Rev. 0,” Document MPC7400UM/D, http://www.mot.com/SPS/PowerPC/teksupport/teklibrary/manuals/ MPC7400UM.pdf, (current Apr. 2000)

[25] J. Nakashima, K. Tallman, “The VIS Advantage: Benchmark Results Chart VIS Performance,” White Paper, Oct. 1996, http://www.sun.com/microelectronics/vis/, (current Apr. 2000)

[26] H. Nguyen, L.K. John, “Exploiting SIMD Parallelism in DSP and Multimedia Algorithms Using the AltiVec Technology,” Proc. 1999 Int’l Conf. on Supercomputing, Rhodes, Greece, 1999, pp. 11-20

[27] K. Noer, “Heat Dissipation Per Square Millimeter Die Size Specifications,” http://home.worldonline.dk/~noer/, (current Apr. 2000)

33

[28] K.B. Normoyle, M.A. Csoppenszky, A. Tzeng, T.P. Johnson, C.D. Furman, J. Mostoufi, “UltraSPARC-IIi: Expanding the Boundaries of a System on a Chip,” IEEE Micro, vol. 18, no. 2, Mar./Apr. 1998, pp. 14-24

[29] A.D. Pimentel, P. Struik, P. van der Wolf, L.O. Hertzberger, “Hardware versus Hybrid Data Prefetching in Multimedia Processors: A Case Study,” Proc. IEEE Intl. Performance, Computing and Communications Conference, Phoenix, Ariz., 2000, pp. 525-531,

[30] P. Ranganathan, S. Adve, N.P. Jouppi, “Performance of Image and Video Processing with General- Purpose Processors and Media ISA Extensions,” Proc. 26th Ann. Intl. Symp. on Computer Architecture, Atlanta, 1999, pp. 124-135

[31] S. Rathnam, G. Slavenberg, “An Architectural Overview of the Programmable Multimedia Processor, TM-1,” Proc. of COMPCON ‘96, Santa Clara, Calif., 1996, pp. 319-326

[32] D.S. Rice, “High-Performance Image Processing Using Special-Purpose CPU Instructions: The UltraSPARC Visual Instruction Set,” Univ. Calif. at Berkeley, Master’s Report, Mar., 1996

[33] P. Rubinfeld, B. Rose, M. McCallig, “Motion Video Instruction Extensions for Alpha,” White Paper, Oct., 1996 http://www.digital.com/alphaoem/papers/pmvi-abstract.htm, (current Apr. 2000)

[34] N.T. Slingerland, A.J. Smith, “Design and Characterization of the Berkeley Multimedia Workload,” University of California at Berkeley Technical Report CSD-00-1122, Dec. 2000

[35] N.T. Slingerland, A.J. Smith, “Multimedia Instruction Sets for General Purpose Microprocessors: A Survey,” University of California at Berkeley Technical Report CS-00-1124, Dec. 2000

[36] N.T. Slingerland, A.J. Smith, “Measuring the Performance of Multimedia Instruction Sets,” University of California at Berkeley Technical Report CSD-00-1125, Dec. 2000

[37] A. Stiller, “Architecture Contest,” c’t Magazine, vol. 16/99, http://www.heise.de/ct/english/99/16/092/, (current Dec. 2000)

[38] P. Struik, P. van der Wolf, A.D. Pimentel, “A Combined Hardware/Software Solution for Stream Prefetching in Multimedia Applications,” Proc. 10th Ann. Symp. on Electronic Imaging (Multimedia Hardware Architectures track), San Jose, Calif., 1998, pp. 120-130

[39] M.T. Sun ed., “IEEE Standard Specifications for the Implementation of 8x8 Inverse Discrete Cosine Transform”, IEEE Standard 1180-1990, IEEE, 1991

[40] Sun Microsystems Inc., “UltraSPARC-IIi User’s Manual,” 1997, Part No. 805-0087-01, http://www.sun.com/microelectronics/UltraSPARC-IIi/docs/805-0087.pdf, (current Apr. 2000)

34

[41] S. Thakkar, T. Huff, “The Internet Streaming SIMD Extensions,” Intel Tech. J., Q2 1999, http://developer.intel.com/technology/itj/q21999.htm, (current Apr. 2000)

[42] M. Tremblay, J.M. O’Connor, V. Narayanan, L. He, “VIS Speeds New Media Processing,” IEEE Micro, vol. 16, no. 4, Aug. 1996, pp. 10-20

[43] L. Zhang, J.B. Carter, W. Hsieh, S.A. McKee, “Memory System Support for Image Processing,” Proc. 1999 International Conference on Parallel Architectures and Compilation Techniques, Newport Beach, Calif., 1999, pp. 98-107

35