Exploiting the Altivec Unit for Commercial Applications

Exploiting the AltiVec Unit for Commercial Applications Daniel Citron Hiroshi Inoue, Takao Moriyama, Motohiro Kawahito, Hideaki Komatsu, Toshio Nakatani IBM Haifa Labs IBM Tokyo Research Laboratory Haifa University Campus 1623-14 Shimotsuruma, Yamato-shi Haifa 31905, Israel Kanagawa-ken, 242-8502 Japan [email protected] inouehrs,moriyama,jl25131,komatsu,nakatani ¡ @jp.ibm.com ¢ ¢ ¢ Abstract signed jointly by IBM R , Motorola R , and Apple R 1. His- torically, such vector units have been used for applications The introduction of the PowerPC 970 JS20 blade server that manipulate large amounts of small data items [12, 5] opens opportunities for vectorizing commercial applica- and have been integrated into PowerPC processors since tions using the integrated AltiVec unit. We examined the 1999 [7]. For desktop computers that sported such proces- vectorization of applications from diverse fields such as sors, graphics and image processing are classic examples of XML parsing, UTF-8 encoding, life sciences, string manip- AltiVec use. ulations, and sorting. We obtained performance speedups However, the relative weakness of the host processors (over optimized scalar code) for string comparisons (2-3), precluded the exploitation of the AltiVec unit for commer- XML delimiter lookup (1.5-5), and UTF-8 conversion (2-4). cial applications. The IBM PowerPC 970FX [17] is the first The focus of this paper is on the process rather than on processor to combine 64-bit computing, large caches, high the results. Vectorizing commercial applications vastly dif- instruction level parallelism (ILP), and the AltiVec unit. fers from vectorizing graphic and image processing appli- This paper describes the process of vectorizing a collection cations. In addition to the results achieved, we describe the of such applications on commercial servers manufactured pitfalls encountered, the advantages and disadvantages of by IBM. the AltiVec unit, and what is missing in its current imple- Commercial applications differ from media applications mentation. in several key features that hamper straightforward, and in some cases efficient, vectorization. The object of this paper Sorting presents an interesting example. Vectorizing the is to show how these obstacles were approached, overcome quicksort algorithm was not successful due to low paral- (or not), and what should be done to enhance the AltiVec lelism and misaligned data accesses. Vectorization of the unit in particular, and the vectorization process in general. combsort algorithm was very successful, with speedups of The main issues that must be addressed are: 5.0, until the data spilled from the L2 cache. Combining both approaches, by first partitioning the input using quick- Data Layout Data elements in commercial applications sort and then continuing with combsort, yielded speedups are usually heterogeneous; the data to be vectorized of over 2.0. is embedded in structures and must be extracted first. This research led to several patent disclosures, many al- This is opposed to media data, which is usually homo- gorithmic enhancements, and an insight into the correct in- geneous and streamed to the execution unit. Another tegration of software with the AltiVec unit. The wealth of problem is alignment; AltiVec loads and stores data information collected during this study is being conveyed to only at 16-byte boundaries. the auto-vectorization teams of the relevant compilers. Element Size In media applications, data sizes of 8- and 16-bits are abundant; this allows high levels of parallelism. In commercial applications, 32- and even 64- bit values are commonplace, limiting the degree of par- 1 Introduction allelism that vectorization can achieve. I/O bound Many commercial applications are I/O bound ¢ and the benefits of vectorization are not clear. Expend- The AltiVec R [9] unit is the Single Instruction Multiple ¢ £ Data (SIMD) unit of the PowerPC R architecture. It was de- 1It is also known as VMX or the Velocity Engine R . ing effort on vectorizing non-critical code is not cost effective. Correctness Many media applications can produce lossy results, which commercial applications cannot allow. Furthermore, they are full of tests and checks that hamper effective vectorization. Consistency Vectorization may change the order in which elements are processed. This can cause results to have small inconsistencies which are unacceptable for commercial applications. Application Analysis Many commercial applications are complex and composed of many modules. Finding the bottlenecks that can benefit from vectorization is dif- ficult. Even finding the right developer to approach is no simple task in enterprise applications. Figure 1. The AltiVec unit on the PowerPC 970FX (excerpted from [17]) While the successes and failures are of interest (and de- scribed in Section 2), we believe that the insights obtained from the experience are of far greater importance. Section Unit Instructions Latency 3 lists the advantages and disadvantages of the AltiVec unit, VSIU ALU 2 what can be done to improve it, and the pitfalls to avoid VCIU mul, sum, max 5 when vectorizing a complex commercial application. The VFPU FP 8 rest of this section gives a brief overview of the AltiVec unit VPERM permute 2 and the evaluation methods used in this paper. LSU ¤ load, store 4 Table 1. Latencies of AltiVec instructions on the PowerPC 970FX 1.1 Vector Processing on the PowerPC Architec [1] ture ¤ Load/Store instructions are handled by the processor's Load/Store Unit (LSU). The AltiVec unit contains 32 128-bit registers. Opera- 1.2 Evaluation Infrastructure tions can be applied to 8/16/32-bit signed and unsigned inte- ger values, single precision (32-bit) floating-point (FP) val- Vectorization was performed manually on C source code ues, and 16/32-bit pixel values. Thus, each instruction per- using AltiVec intrinsics [1]. Compiler-generated auto- forms 16, 8, or 4 operations. The instructions are divided vectorization in GCC [14] and XLC [19] could not vectorize into five groups that are executed by different functional the analyzed applications. Indeed, one of the major goals of units. Figure 1 shows a block level diagram of the AltiVec the project is to transfer the accumulated wealth of data to unit on the PowerPC 970FX. Although there are four dis- the auto-vectorization teams of the GCC, XLC, and J9/TR tinct functional units, only two instructions can be issued Java JIT compilers. (Some preliminary results are available per cycle, and one instruction is to the permute unit. All in [11] and [15].) units are fully pipelined. The units, instruction groups, and The PowerPC 970FX is currently used in two major sys- latencies are shown in Table 1. Memory accesses are per- tems: IBM's JS20 blade servers and Apple's G5 comput- formed in 16-byte quantities that have to be aligned on 16- ers. These systems run various versions of Linux and Mac byte boundaries. This disadvantage is overcome by a ver- OS and possess many variants of the GCC, XLC, and other satile permute instruction that can reorder misaligned data compilers. Experimentation was performed using many of effectively. A complete description of the instruction set is the aforementioned options. However, for the sake of uni- available in [9]. formity we display results using the following configura- tion: The choice of applications was influenced by availability, interest to parties in IBM, and diversity. The specific vector- Computer: IBM JS20 BladeCenter with two 2.2GHz Pow- ization points in each application were determined by run- erPC 970FX processors time profile analysis, source code examination, and prior Operating System: SuSE Linux SLES9, 2.6.5-7.97- knowledge. pseries64 kernel Vectorization Method: Manual vectorization using Al- 2.1 String Operations tiVec intrinsics Compiler: GCC 4.0.1 compiled with -O3 The ubiquitous string (zero-terminated array of charac- -maltivec -mabi=altivec -mcpu=970 ters) operations exemplify the advantages and disadvan- -mtune=970 -mpowerpc64 -unroll loops tages of the AltiVec unit. For example, the strlen func- -falign-functions=32 tion requires finding the first occurrence of a character with Measurements: When short functions are measured, they a value of zero. A single, two-cycle, instruction determines are called 1,000,000 times with varying alignments if a vector of sixteen characters contains a zero. However, (lower bits range from 0 to 127). In addition, they are prologue and epilogue computations are relatively complex; called via pointers in order to equalize the overhead of the data first has to be aligned on 16-byte boundaries and standard library calls. obtaining the exact location of the zero in the vector is not Metrics: Speedup compared to optimized scalar code is supported by a single instruction. shown for all graphs. For short functions, CPU time Complicating matters even more is the fact that many is measured, and for full applications, elapsed time is strings are shorter than sixteen bytes, and this is not known measured. a priori. Thus, any vector version has to be competitive even for short strings. Figure 2 shows that when using the GCC n compiler, nice speedups (relative to GCC's standard scalar e l 2.5 r t implementation) are obtained for string lengths as short as s gcc 4.0 xlc 7.0 r 10 characters. For longer string lengths, the speedups im- a l 2 a c prove and surpass 2.0. However, when the XLC compiler s r 1.5 is used, the speedup (relative to XLC's standard scalar im- e v o plementation) disappears. This discrepancy is due to the 1 p u superior scalar implementation of XLC; the use of 64-bit d e 0.5 e logic enables 'vectorization' using scalar instructions. This p S stresses yet another important point: vectorization must be 0 1 9 17 25 33 41 49 57 compared with the best scalar effort. 5 13 21 29 37 45 53 61 string length in bytes Figure 2.

Load more