Performance Analysis of Instruction Set Architecture Extensions for Multimedia§

Performance Analysis of Instruction Set Architecture Extensions for Multimedia§ Nathan Slingerland Alan Jay Smith [email protected] [email protected] Apple Computer, Inc. University of California at Berkeley Abstract kernels taken from vendor provided libraries [1], [5] , [8] , [26] , [27] , [31] , [33]. Our contribution is unique as we do not focus Many microprocessor instruction sets include instructions for exclusively on a single architecture, and we study the performance accelerating multimedia applications such as DVD playback, of kernels derived from a real, measured, general purpose speech recognition and 3D graphics. Despite general agreement multimedia wor kload. Our results are obtained from actual on the need to support this emerging workload, there are hardware measurements rather than through simulation, instilling considerable differences between the instruction sets that have confidence in our results. been designed to do so. In this paper we study the performance of five instruction sets on kernels extracted from a broad multimedia Section 2 summarizes the multimedia workload we studied, and workload. Each kernel was recoded in the assembly language for details the sixteen computatio nally important kernels which we each of the five multimedia extensions. We compare the extracted from it. Our methodology for recoding the kernels with performance of contemporary implementations of each extension multimedia instructions and their measurement is described in against each other as well as to the original compiled C Section 3. An overview of the five instructions sets, and their performance. From our analysis we determine how well implementations, is given in Section 4. multimedia workloads map to current instruction sets, noting Our analysis is divided into two parts. Section 5 reflects our what was useful and what was not. We also propose two experience in coding the kernels, and lends insight into those enhancements to current architectures: strided memory instruction set features that we found useful. In Section 6 we operations, and fat subwords. compare the performance of five different processors each of Keywords: SIMD, subword paralle l, multimedia, instruction set, which implements a particular multimedia instruction set. We performance, measurement, MMX, SSE, AltiVec, VIS, MVI compare the performance of the five architectures both against one another, as well as their relative improvement over compiled 1. Introduction (optimized) C code. Finally, in Section 7 we propose two new directions for multimedia architectures on general purpose Specialized instructions have been introduced by microprocessor microprocessors: strided memory operations and fat subwords. vendors to support the unique computational demands of multimedia applications. The mismatch between wide data paths 2. Workload and the relatively short data types found in multimedia applications has lead the industry to embrace SIMD (single The lack of a standardized multimedia benchmark has meant that instruction, multiple data) style processing. Unlike traditional workload selection is the most difficult aspect of any study of forms of SIMD computing in which multiple individual multimedia. It was for this reason that we developed the Berkeley processors execute the same instruction, multimedia instructions multimedia workload, which is described in [35]. [35] is highly are executed by a single processor, and pack multiple short data recommended reading for any reader concerned about how elements into a single wide (64 or 128-bit) register, with all of the applications were selected or the technical merit and relevance of sub-elements being operated on in parallel. the Berkeley multimedia workload. It details why we felt existing multimedia benchmarks were unsuitable, develops and The goal of this paper is to quantify how architectural differences characterizes our workload, and extracts the kernels that we study between multimedia instruction sets translate into differences in here. The Berkeley multimedia workload was developed for use in performance. Prior studies have primarily focused on a single this research as well as our study of multimedia cache behavior in instruction set in isolation and have measured the performance of [36]. In selecting the component applications, we strove to cover as many types of media processing as possible: image compression (DjVu, JPEG), 3D graphics (Mesa, POVray), document rendering (Ghostscript), audio synthesis (Timidity), audio compression §Funding for this research has been provided by the State of California under the MICRO program, and by AT&T, Cisco Corporation, Fujitsu Microelectronics, IBM, Intel Corporation, Maxtor Corporation, Microsoft Corporation, Sun Microsystems, Toshiba Corporation and Veritas Software Corporation. 1 Sat Native Src Static %Static %CPU Kernel Name Source Application Data Type Arith Width Lines Instr Instr Cycles Add Block MPEG-2 Decode 8-bit (U) Ö 64-bits 46 191 0.1% 13.7% Block Match MPEG-2 Encode 8-bit (U) 128-bits 52 294 0.4% 59.8% Clip Test and Project Mesa FP limited 95 447 0.1% 0.8% Color Space Conversion JPEG Encode 8-bit (U) limited 22 78 0.1% 9.8% DCT MPEG-2 Encode 16-bit (S) 128-bits 14 116 0.1% 12.3% FFT LAME FP limited 208 981 4.4% 14.5% Inverse DCT MPEG-2 Decode 16-bit (S) Ö 128-bits 75 649 0.3% 29.7% Max Value LAME 32-bit (S) unlimited 8 39 0.2% 12.0% Mix Timidity 16-bit (S) Ö unlimited 143 829 1.0% 35.7% Quantize LAME FP unlimited 55 312 1.4% 15.3% Short Term Anal Filter GSM Encode 16-bit (S) Ö 128-bits 15 79 0.1% 20.2% Short Term Synth Filter GSM Decode 16-bit (S) Ö 128-bits 15 114 0.2% 72.7% Subsample Horizontal MPEG-2 Encode 8-bit (U) Ö 88-bits 35 244 0.3% 2.6% Subsample Vertical MPEG-2 Encode 8-bit (U) Ö unlimited 76 478 0.6% 2.1% Synthesis Filtering Mpg123 FP Ö 512-bits 67 348 0.4% 39.6% Transform & Normalize Mesa FP limited 51 354 0.1% 0.7% Table 1. Multimedia Kernels Studied - from left to right, the columns list 1) primary data type specified as N-bit ({Unsigned, Signed}) integer or floating point (FP), 2) if saturating arithmetic is used, 3) native width; the longest width which does not load/store/compute excess unused elements, 4) static C source line count (for those lines which are executed), 5) static instruction count (for those instructions which are executed) 6) percentage of total static instructions, 7) percentage of total CPU time spent in the kernel. The later three statistics are machine specific and are for the original C code on a Compaq DS20 (dual 500 MHz Alpha 21264, Tru64 Unix v5.0 Rev. 910) machine. (ADPCM, LAME, mpg123), video compression (MPEG-2 at invasive as possible. This also allowed us to quickly test new DVD and HDTV resolutions), speech synthesis (Rsynth), speech revisions of the code by linking against different versions of the compression (GSM), speech recognition (Rasta) and video game library. When adding a new architecture to our study it was (Doom) applications. Open source software was used both for its possible to simply start with a copy of the C reference version of portability (allowing for cross platform comparisons) and the fact the library and then implement and debug replacement SIMD that we could analyze the source code directly. assembly functions one at a time. It is usually neither time nor resource efficient to hand optimize 3.2. Coding Process every line of code in an application. Instead, optimization should focus on execution hot spots or computational kernels, thereby Ideally, high level language compilers would be able to limiting the recoding effort to those portions of the code that have systematically and automatically identify parallelizable sections of the most potential to improve overall performance. In order to code and generate the appropriate SIMD instruction sequence. evaluate the multimedia instruction sets, kernels were distilled SIMD optimizations would then not just be limited to multimedia from the workload based on their computational significance and applications, but could be more generally applied to any amenability to hand optimization with SIMD instructions; [35] application exhibiting the appropriate type of data parallelism. discusses the kernel selection process in detail. Each kernel was [21] has proposed strategies for extracting parallelism from hand coded in the assembly language of each of the five multimedia workloads with compilers. Recently, in fact, the very instruction sets. Table 1 lists the kernel codes examined. first commercial compilers have become available for Intel’s SSE ([16]) and Motorola’s AltiVec ([45]), although their effectiveness 3. Methodology is still an open question. The lack of languages which allow programmers to specify data types and overflow semantics at 3.1. Berkeley Multimedia Kernel Library variable declaration time has hindered the development of automated compiler support for multimedia instruction sets [10]. From a performance standpoint, no piece of code can realistically In addition, current SIMD extensions are clearly targeted for use be extracted and studied in isolation. In order to measure the by an expert applications programmer coding by hand in assembly performance of the multimedia instruction sets, we distilled from or through high-level language macros. our Berkeley multimedia workload a set of computationally important kernel functions, which when taken together form the 3.2.1. Hand Coding Berkeley multimedia kernel library (BMKL). All of the parent applications in the Berkeley multimedia workload were modified As is the case with digital signal processors (DSPs), the most to make calls to the BMKL rather than their original internal efficient way to program with multimedia extensions is to have an functions. By measuring our kernel codes from within real expert programmer tune software using assembly language [20]. applications we could realistically include the effects of the shared Although this is more tedious and error prone than other methods resources of our test systems (e.g.

Load more