Altivec Extension to Powerpc Accelerates Media Processing

ALTIVEC EXTENSION TO POWERPC ACCELERATES MEDIA PROCESSING DESIGNED AROUND THE PREMISE THAT MULTIMEDIA WILL BE THE PRIMARY CONSUMER OF PROCESSING CYCLES IN FUTURE PCS, ALTIVEC—WHICH APPLE CALLS THE VELOCITY ENGINE—INCREASES PERFORMANCE ACROSS A BROAD SPECTRUM OF MEDIA PROCESSING APPLICATIONS. There is a clear trend in personal com- extension to a general-purpose architecture. puting toward multimedia-rich applications. But the similarity ends there. Whereas the These applications will incorporate a wide vari- other extensions were obviously constrained ety of multimedia technologies, including audio by backward compatibility and a desire to and video compression, 2D image processing, limit silicon investment to a small fraction of 3D graphics, speech and handwriting recogni- the processor die area, the primary goal for Keith Diefendorff tion, media mining, and narrow-/broadband AltiVec was high functionality. It was designed signal processing for communication. from scratch around the premise that multi- Microprocessor Report In response to this demand, major micro- media will become the primary consumer of processor vendors have announced architec- processing cycles8 in future PCs and therefore tural extensions to their general-purpose deserves first-class treatment in the CPU. Pradeep K. Dubey processors in an effort to improve their multi- Unlike most other extensions, which over- media performance. Intel extended IA-32 with load their floating-point (FP) registers to IBM Research Division MMX1 and SSE (alias KNI),2 Sun enhanced accommodate multimedia data, AltiVec ded- Sparc with VIS,3 Hewlett-Packard added icates a large new register file exclusively to it. MAX4 to its PA-RISC architecture, Silicon Although overloading the FP registers avoids Ron Hochsprung Graphics extended the MIPS architecture with new architectural state, eliminating the need MDMX,5 and Digital (now Compaq) added to modify the operating system, it also signif- Apple Computer MVI to Alpha. This article describes the most icantly compromises performance, which was recent, and what we believe to be the most not acceptable for AltiVec. comprehensive, addition to this list: Power- AltiVec treats multimedia data as first-class Hunter Scales PC’s AltiVec.6,7 AltiVec speeds not only media data in the form of vectors. Vector elements processing but also nearly any application in include all of the major data types found in Motorola Corporation which data parallelism exists, as demonstrat- 3D graphics, image processing, digital audio ed by a cycle-accurate simulation of Motoro- and video, speech recognition, data mining, la’s MPC 7400, the heart of Apple G4 systems. and other multimedia applications. AltiVec’s powerful data reorganization capa- Highlights and performance summary bility goes far beyond that of any previous Like all the other extensions, AltiVec is a SIMD engine, making AltiVec uniquely well SIMD (single-instruction, multiple-data) suited to the bit-parallel algorithms found in 0272-1732/00/$10.00 2000 IEEE 85 ALTIVEC EXTENSION Table 1. Data types for various media tasks. 128-bit vector, loop overheads tend to be small, giving Data type AltiVec processors perfor- 8-bit integer 16-bit integer Single-precision float mance approaching that of Task Unsigned Signed Unsigned Signed Signed true vector machines. Video Low quality High quality On the basis of cycle- Audio Low quality High quality accurate simulations of more Image processing Low quality High quality than 40 media processing ker- 3D graphics Low quality High quality nels, we found that AltiVec Speech recognition Low quality High quality delivered an average speedup Communication Crypto Crypto of 6.5 on integer kernels and Media mining High quality 5.1 on floating-point kernels, over the same PowerPC processor without AltiVec. digital signal processing (DSP) domains. The speedups often approach—and sometimes These include error correction, bit-packing even exceed—the theoretical SIMD paral- kernels, and many others. lelism, which is 16 on 8-bit data (for example, AltiVec extends the scalar PowerPC archi- video), eight on 16-bit data (for example, tecture with a powerful new set of SIMD modem filters), and four on 32-bit integers and instructions. These instructions execute from floats (for example, 3D graphics and high- the same instruction stream as the PowerPC’s fidelity audio). Speedups greater than the theo- scalar integer, floating-point, and branch retical parallelism arise from the ability to use instructions. new algorithms that are inappropriate for scalar AltiVec’s major architectural characteristics processors or for less capable SIMD processors. include AltiVec architecture • fixed-length 128-bit vectors, each com- One of the attributes that enable large prising four, eight, or 16 data elements; speedups across such a broad spectrum of • a separate vector register file with a 32- media processing applications is AltiVec’s sup- register namespace, each register holding port for all of the important media data types. one 128-bit vector; Table 1 shows the various data types that a • vector-element data types of 8-, 16-, and processor must support if it is to perform well 32-bit signed or unsigned integers, as on media processing tasks. To date, AltiVec is well as IEEE single-precision floats; the only SIMD architectural extension to sup- • 162 new SIMD-style instructions opti- port all these types. mized for digital signal processing; AltiVec’s large vector register file provides • saturation or modulo arithmetic; quick access to a large number of values, such • a four-operand, nondestructive instruc- as the transform or filter coefficients that are tion format (three sources, one destina- accessed frequently in signal processing loops. tion); and The large register namespace facilitates soft- • modeless operation for zero overhead use ware pipelining and loop unrolling necessary of AltiVec instructions. to cover the long latencies associated with media streams. With a separate register file, SIMD parallelism is well matched to the the general-purpose and floating-point regis- parallelism found in the packed-data streams ters are not encumbered with multimedia of media applications. To use SIMD process- data, so media processing doesn’t interfere ing, algorithms typically break long data with scalar processing. The separate file also streams into sequences of short fixed-length permits the vector registers to be physically vector operands. SIMD instructions then optimized for the wide SIMD execution units. process these vectors iteratively in loops, each Another important AltiVec feature is its instruction performing the same operation on four-operand instruction format (three source all corresponding elements in the source- operands, one destination). This feature gives operand vectors in parallel. With AltiVec’s long each instruction extraordinarily high operand 86 IEEE MICRO Table 2. AltiVec instruction-set summary.* Arithmetic Source elements Destination elements Instruction class Signed Unsigned Modulo Saturate Operands Bytes Halfwords Words Floats Vectors Bytes Halfwords Words Floats Vectors Load/store X X X X X Stream prefetch X Add/sub X X X X 2 X X X X X X X X Multiply X X X 2 X X X X X X Multiply-add X X X X 3 X X X X Multiply-sum X X X X 3 X X X Sum across X X 2 Partial sum across X X X 2 X X X X X X Average X X X 2 X X X X X X Logicals X 2 X X Rotate/shift X X X 2 X X X X X X Compare X X 2 X X X X X X X X Select 2 X X Pack X X X X 2 X X X X Unpack/merge X X 2 X X X X Splat X X X 2 X X X X X X Permute X X 3 X X Shift elements 2 X X Round to integer X 1 X X Convert w/scale X X 1 X X X X Max/min X X X 2 X X X X X X X X 1/x estimate X 1 X X 1/sqrt(x) estimate X 1 X X Log/power estimates X 1 X X *This table summarizes AltiVec capabilities in a concise form. Not all combinations shown are available for every instruction in a given class. bandwidth and supports the encoding of pow- many digital-media-processing algorithms erful instructions such as multiply-add, per- into a set of generalized primitives that sup- mute, and select (described later). Since the port common operations such as saturation four-operand format is nondestructive, it also arithmetic. Using this approach, the design eliminates the excess register shuffling and can support a wide spectrum of media appli- copying that comes with destructive two- cations while avoiding the highly specialized operand formats like that of the x86 architec- instructions commonly found in traditional ture. Thus, AltiVec’s instruction format allows DSPs. Counting all variations of data types programs to use registers efficiently, minimiz- and arithmetic (modulo, saturation, signed, ing spill/fill traffic to memory and producing and unsigned), AltiVec adds 162 new instruc- a short instruction path, which are both tions to the PowerPC architecture, as sum- important for efficient signal processing loops. marized in Table 2. AltiVec is based on a simple RISC-style The AltiVec design criteria called for all load/store architecture, but instructions oper- instructions to be easily pipelined and suitable ate on vector operands rather than on the sim- for superscalar, out-of-order dispatch. All ple scalar operands of classical RISC engines. AltiVec processors are expected to implement The AltiVec instruction set was distilled from the full architectural vector width and to fully MARCH–APRIL 2000 87 ALTIVEC EXTENSION vector instructions with Pow- VC 01 04 08 00 1F 15 09 0A 05 1F 02 03 07 0D 0B 0E erPC scalar instructions. VB 1 Permute power Much of AltiVec’s perfor- VA 0 mance and flexibility derives from the permute instruction 0123456789ABCDEF (vperm), illustrated in Figure 1a.

Altivec Extension to Powerpc Accelerates Media Processing

RISC-V Vector Extension Webinar II

Vxworks Architecture Supplement, 6.2

SIMD Extensions

Optimizing Packed String Matching on AVX2 Platform

REPORT Compaq Chooses SMT for Alpha Simultaneous Multithreading

Prozessorarchitektur Am Beispiel Des Amdathlon

Optimizing Software Performance Using Vector Instructions Invited Talk at Speed-B Conference, October 19–21, 2016, Utrecht, the Netherlands

Multi-Platform Auto-Vectorization

A Bibliography of Publications in IEEE Micro

Compiler-Based Data Prefetching and Streaming Non-Temporal Store Generation for the Intel R Xeon Phitm Coprocessor

Optimizing SIMD Execution in HW/SW Co-Designed Processors

Vybrid Controllers Technical Overview