Exploiting the AltiVec Unit for Commercial Applications

Daniel Citron Hiroshi Inoue, Takao Moriyama, Motohiro Kawahito, Hideaki Komatsu, Toshio Nakatani IBM Haifa Labs IBM Tokyo Research Laboratory Haifa University Campus 1623-14 Shimotsuruma, Yamato-shi

Haifa 31905, Israel Kanagawa-ken, 242-8502 Japan

citron@il..com inouehrs,moriyama,jl25131,komatsu,nakatani ¡ @jp.ibm.com

¢ ¢ ¢ Abstract signed jointly by IBM R , R , and Apple R 1. His- torically, such vector units have been used for applications The introduction of the PowerPC 970 JS20 blade server that manipulate large amounts of small data items [12, 5] opens opportunities for vectorizing commercial applica- and have been integrated into PowerPC processors since tions using the integrated AltiVec unit. We examined the 1999 [7]. For desktop computers that sported such proces- vectorization of applications from diverse fields such as sors, graphics and image processing are classic examples of XML parsing, UTF-8 encoding, life sciences, string manip- AltiVec use. ulations, and sorting. We obtained performance speedups However, the relative weakness of the host processors (over optimized scalar code) for string comparisons (2-3), precluded the exploitation of the AltiVec unit for commer- XML delimiter lookup (1.5-5), and UTF-8 conversion (2-4). cial applications. The IBM PowerPC 970FX [17] is the first The focus of this paper is on the process rather than on processor to combine 64-bit computing, large caches, high the results. Vectorizing commercial applications vastly dif- instruction level parallelism (ILP), and the AltiVec unit. fers from vectorizing graphic and image processing appli- This paper describes the process of vectorizing a collection cations. In addition to the results achieved, we describe the of such applications on commercial servers manufactured pitfalls encountered, the advantages and disadvantages of by IBM. the AltiVec unit, and what is missing in its current imple- Commercial applications differ from media applications mentation. in several key features that hamper straightforward, and in some cases efficient, vectorization. The object of this paper Sorting presents an interesting example. Vectorizing the is to show how these obstacles were approached, overcome quicksort algorithm was not successful due to low paral- (or not), and what should be done to enhance the AltiVec lelism and misaligned data accesses. Vectorization of the unit in particular, and the vectorization process in general. combsort algorithm was very successful, with speedups of The main issues that must be addressed are: 5.0, until the data spilled from the L2 cache. Combining both approaches, by first partitioning the input using quick- Data Layout Data elements in commercial applications sort and then continuing with combsort, yielded speedups are usually heterogeneous; the data to be vectorized of over 2.0. is embedded in structures and must be extracted first. This research led to several patent disclosures, many al- This is opposed to media data, which is usually homo- gorithmic enhancements, and an insight into the correct in- geneous and streamed to the execution unit. Another tegration of software with the AltiVec unit. The wealth of problem is alignment; AltiVec loads and stores data information collected during this study is being conveyed to only at 16-byte boundaries. the auto-vectorization teams of the relevant compilers. Element Size In media applications, data sizes of 8- and 16-bits are abundant; this allows high levels of paral- lelism. In commercial applications, 32- and even 64- bit values are commonplace, limiting the degree of par- 1 Introduction allelism that vectorization can achieve. I/O bound Many commercial applications are I/O bound

¢ and the benefits of vectorization are not clear. Expend-

The AltiVec R [9] unit is the Single Instruction Multiple

¢ £ Data (SIMD) unit of the PowerPC R architecture. It was de- 1It is also known as VMX or the Velocity Engine R . ing effort on vectorizing non-critical code is not cost effective.

Correctness Many media applications can produce lossy results, which commercial applications cannot allow. Furthermore, they are full of tests and checks that ham- per effective vectorization.

Consistency Vectorization may change the order in which elements are processed. This can cause results to have small inconsistencies which are unacceptable for com- mercial applications.

Application Analysis Many commercial applications are complex and composed of many modules. Finding the bottlenecks that can benefit from vectorization is dif- ficult. Even finding the right developer to approach is no simple task in enterprise applications. Figure 1. The AltiVec unit on the PowerPC 970FX (excerpted from [17])

While the successes and failures are of interest (and de- scribed in Section 2), we believe that the insights obtained from the experience are of far greater importance. Section Unit Instructions Latency 3 lists the advantages and disadvantages of the AltiVec unit, VSIU ALU 2 what can be done to improve it, and the pitfalls to avoid VCIU mul, sum, max 5 when vectorizing a complex commercial application. The VFPU FP 8 rest of this section gives a brief overview of the AltiVec unit VPERM permute 2 and the evaluation methods used in this paper. LSU ¤ load, store 4

Table 1. Latencies of AltiVec in- structions on the PowerPC 970FX 1.1 Vector Processing on the PowerPC Architec [1]

ture ¤ Load/Store instructions are handled by the pro- cessor’s Load/Store Unit (LSU).

The AltiVec unit contains 32 128-bit registers. Opera- 1.2 Evaluation Infrastructure tions can be applied to 8/16/32-bit signed and unsigned inte- ger values, single precision (32-bit) floating-point (FP) val- Vectorization was performed manually on source code ues, and 16/32-bit pixel values. Thus, each instruction per- using AltiVec intrinsics [1]. Compiler-generated auto- forms 16, 8, or 4 operations. The instructions are divided vectorization in GCC [14] and XLC [19] could not vectorize into five groups that are executed by different functional the analyzed applications. Indeed, one of the major goals of units. Figure 1 shows a block level diagram of the AltiVec the project is to transfer the accumulated wealth of data to unit on the PowerPC 970FX. Although there are four dis- the auto-vectorization teams of the GCC, XLC, and J9/TR tinct functional units, only two instructions can be issued Java JIT compilers. (Some preliminary results are available per cycle, and one instruction is to the permute unit. All in [11] and [15].) units are fully pipelined. The units, instruction groups, and The PowerPC 970FX is currently used in two major sys- latencies are shown in Table 1. Memory accesses are per- tems: IBM’s JS20 blade servers and Apple’s G5 comput- formed in 16-byte quantities that have to be aligned on 16- ers. These systems run various versions of Linux and Mac byte boundaries. This disadvantage is overcome by a ver- OS and possess many variants of the GCC, XLC, and other satile that can reorder misaligned data compilers. Experimentation was performed using many of effectively. A complete description of the instruction set is the aforementioned options. However, for the sake of uni- available in [9]. formity we display results using the following configura- tion: The choice of applications was influenced by availability, interest to parties in IBM, and diversity. The specific vector- Computer: IBM JS20 BladeCenter with two 2.2GHz Pow- ization points in each application were determined by run- erPC 970FX processors time profile analysis, source code examination, and prior Operating System: SuSE Linux SLES9, 2.6.5-7.97- knowledge. pseries64 kernel Vectorization Method: Manual vectorization using Al- 2.1 String Operations tiVec intrinsics Compiler: GCC 4.0.1 compiled with -O3 The ubiquitous string (zero-terminated array of charac- -maltivec -mabi=altivec -mcpu=970 ters) operations exemplify the advantages and disadvan- -mtune=970 -mpowerpc64 -unroll loops tages of the AltiVec unit. For example, the strlen func- -falign-functions=32 tion requires finding the first occurrence of a character with Measurements: When short functions are measured, they a value of zero. A single, two-cycle, instruction determines are called 1,000,000 times with varying alignments if a vector of sixteen characters contains a zero. However, (lower bits range from 0 to 127). In addition, they are prologue and epilogue computations are relatively complex; called via pointers in order to equalize the overhead of the data first has to be aligned on 16-byte boundaries and standard library calls. obtaining the exact location of the zero in the vector is not Metrics: Speedup compared to optimized scalar code is supported by a single instruction. shown for all graphs. For short functions, CPU time Complicating matters even more is the fact that many is measured, and for full applications, elapsed time is strings are shorter than sixteen bytes, and this is not known measured. a priori. Thus, any vector version has to be competitive even for short strings. Figure 2 shows that when using the GCC

n compiler, nice speedups (relative to GCC’s standard scalar e l 2.5 r

t implementation) are obtained for string lengths as short as s

gcc 4.0 xlc 7.0 r 10 characters. For longer string lengths, the speedups im- a

l 2 a

c prove and surpass 2.0. However, when the XLC compiler s

r 1.5 is used, the speedup (relative to XLC’s standard scalar im- e v

o plementation) disappears. This discrepancy is due to the 1 p

u superior scalar implementation of XLC; the use of 64-bit d

e 0.5 e logic enables ’vectorization’ using scalar instructions. This p

S stresses yet another important point: vectorization must be 0 1 9 17 25 33 41 49 57 compared with the best scalar effort. 5 13 21 29 37 45 53 61 string length in bytes

Figure 2. Vectorization of strlen using GCC 4.0 vs. XLC 7.0

2 Applications and Results Figure 3. Speedup of vectorized strchr over a scalar version when memory shortcuts are enabled This section surveys the vectorization process of several (shortcuts) and not enabled (via memory) commercial applications. Where important insights, pit- falls, or improvements are encountered they are highlighted using . The analyzed applications are taken from bold text The strchr function returns the position of the first four major domains: occurrence of a given character. This emphasizes another

¥ String manipulation operations obstacle to optimal vectorization in the AltiVec unit: the

¥ transfer of data between general purpose registers to vector Life Science applications registers and vice versa. This transfer must be performed ¥ Online document handling via memory, as the architecture doesn’t support direct trans- ¥ Database auxiliary functions fer. Thus, additional overhead is added. However, the hardware implementation allows us to hur- quality of the comparisons degraded when a vector com- dle this obstacle. The lvsl instruction receives an address parison was performed. In a different candidate function and uses the four LSBs to create a permute vector that aligns for vectorization, other problems arose:

the data to be read. By repeating the call to lvsl it is pos- ¥ The AltiVec unit doesn’t have an integer or float di- sible to transfer nibble (4-bit) amounts between a GPR and vide operation; the closest instruction returns the esti- a VR. The reverse direction can also be enhanced by ex- mated reciprocal of an FP value.

ploiting the properties of the 970’s Memory Management ¥ AltiVec Unit (MMU). If a store and subsequent load are differenti- doesn’t possess a 32-bit integer multiply in- either, only a single precision FP multiply- ated by one dispatch group, the processor forwards the data struction add instruction.

directly to the reading register. The GCC 4.0.1 compiler au- ¥ tomatically inserts nops to ensure this. Figure 3 shows that Indirection in accessing data greatly hampers straight- forward vectorization. when naive VR ¦ GPR data transfer is used (diamonds), no speedup over the scalar version is achieved. When the data Thus, vectorization of this application didn’t yield any transfer is enhanced (squares), the vector version is compet- speedup, but rather a list of many important insights. itive, albeit starting from string lengths of 32. This empha- sizes the importance of efficiently vectorizing prologue 2.2.2 HMMER and epilogue computations in addition to the kernel com- putation itself. HMMER [8] is an open source life science application that Conversely, string comparison functions perform much spends 97% of its execution time in the P7Viterbi() better. Even though the alignment overhead is higher, where function. This function has indeed been vectorized using both strings must be aligned from the first byte onwards, AltiVec by Erik Lindahl of Stanford University and is part the vector versions outperform the scalar version for string of the standard distribution mentioned above. Speedups of lengths of seven and higher. Apparently, comparing two up to 10.0 are reported using this vectorized version [4], due different vectors is cost effective even when short strings to a high level of data parallelism. The original code uses are compared; the vector version accesses memory less that 32-bit integer words and 32-bit (single-precision) FP. All the scalar version. This is in contrast to comparing a vary- vector code processed four items per operation. ing vector to a constant vector (such as with strlen and We attempted to use 16-bit integer values and vectorize strchr), where the ratio between loads to comparisons is 8 items per operation. Several sample input files supported lower. this premise: all integer values fall in the 16-bit range (- 32768, 32767). However, after several attempts we realized 2.2 Life Science Applications that the data has entropy that is larger than what 16-bits can represent. Many of the integer values in the input are ra- The field of life science is on the frontier of scientific tios that are converted into FP values. The loss of precision computing. Molecular and DNA manipulations require vast when using 16-bits was intolerable. computing resources. It seems natural to harness the Al- This case exemplifies another possible limitation of the tiVec unit to such tasks. We examined three applications in AltiVec unit (although not for this specific application). A the domain. high degree of parallelism can be achieved when 8 and 16 data elements are processed per operation. When this 2.2.1 Spectrum Comparison drops to 4 (due to 32-bit data values) the benefits over the scalar units diminish, yet are still considerable. A pos- A given spectrum is compared to a database of spectra. sible implementation might be to support six 21-bit opera- Each spectrum is represented as a vector of points, where tions or five 25-bit operations. each point has a mass and intensity. There are two major obstacles to vectorization: 2.2.3 BLAST ¥ All scalar data elements are in classes and/or struc- BLAST (Basic Local Alignment Search Tool) [2] is a util- tures. Accessing them requires transforming them ity that is maintained by the National Center for Biotech- into homogeneous arrays.

¥ nology Information (NCBI). BLAST is a set of similarity The spectra are sparse and non-linear (i.e., not all el- search programs designed to explore all the available se- ements of one spectrum must match all elements of quence databases regardless of whether the query is protein another, and the matches may occur at different posi- or DNA. tions). The CEPBA-IBM Research Institute (CIRI) have vec- The first issue was overcome by a simple data transforma- torized some of the key functions of BLAST [3]. We ana- tion. However, the second was harder to overcome; the lyzed their code, validated some of their results and came

up with some interesting conclusions. When a word size ¨ © © of 22 is used ( § ), nice speedups are obtained (top row in Table 2). However, code analysis shows, that most of the speedup is due to the higher bandwidth of the Al- tiVec unit (128-bits vs. 64-bits per load). This analysis is shared with Alex Ramirez of UPC [16], who added that the optimal scalar results were achieved with a word size of 11

( § ¨ ). When using this word size, the results are very different (bottom row of Table 2). Therefore, we can con- clude that enhanced bandwidth using AltiVec is useful: reading a chunk of 128-bits per operation is faster than two simultaneous 64-bit reads. The same work is done with less instructions, less effective address computations, and less register files accesses. Figure 4. Speedup of vector version of delimiter lookup Word Size Input 1 Input 2 Input 3 W=22 1.58 1.63 2.85 W=11 0.99 0.99 1.01 version for every search length and reaches speedups of Table 2. Speedups of BLAST with different word up to 5.0. This technique was used to enhance an XML sizes parser. In this case, Amdahl’s law comes into play and the speedups for the whole application are relatively modest with a maximum of 1.55 and an average of 1.13 over thir- teen real world inputs. The same techniques used here can 2.3 Online Document Processing also be used for general table lookup. A 256-byte lookup table can be stored in 16 vector registers and accessed using The ever increasing bandwidth to the Internet makes it a series of permutes and shifts. This reduces lookup time by crucial to quickly process downloaded data and display it to half and frees valuable cache lines. the user. The AltiVec unit effectively improves the process- ing time of two important applications. 2.3.2 UTF-8 to Unicode Conversion UTF-8 (Unicode Translation Format-8) code is a popular 2.3.1 XML Delimiter Lookup encoding scheme [18], particularly for Kanji-based lan- Recently, the Extensible Markup Language (XML) has guages. The standard way to convert a UTF-8 character been widely used for various purposes, including web ser- sequence to Unicode is to: vices for commercial workloads. This places a burden on the XML parser, which converts text data into XML for- 1. Fetch the first byte. mat. The XML parser consumes valuable processing time 2. Examine its prefix bits and determine the length of the in finding the positions of a group of certain values (delim- expected byte sequence. iters) in the input text. The following code snippet shows the scalar delimiter lookup: 3. Fetch the following bytes. 4. Construct the Unicode data, by extracting bit fields table[256] = {1, 0, 0, 0, 0, ... from each byte, and concatenating them into a 16-bit 0, 0, 0, 0, 0, ... code or a pair of 16-bit codes. ... }; for (i = 0; i < DATASIZE; i++) 5. Repeat as long as data is available. if (table[data[i]] != 1) { /* found delimiter */ } The algorithm stresses one of the problems of vectoriza- tion: the number of elements that fit in a 16-byte vector A ’normal’ character is represented by a 1 and a delimiter varies. Sequences of one byte UTF-8 characters are han- by a 0 in a ’table’ array. In our method, a 256-byte array dled easily – sixteen bytes at a time. Sequences of three byte is condensed into a 256-bit (32-byte) length bitmap. One UTF-8 characters pose more of a problem – the five char- permutation and two vector shift instructions can effectively acters have to be isolated within the 16-byte vector. How- look up 16 characters at once. ever, a mix of 1- and 3-byte UTF-8 characters can hamper The performance of the delimiter lookup code is shown any vectorization attempt. Nevertheless, coding efforts im- in Figure 4. The vectorized version outperforms the scalar proved the UTF-8 conversion routine in an XML parser (C code). Speedups of 3.8 for 1-byte characters and 1.4 for a bubblesort by comparing two elements with a gap (the dis- combination of 1- and 3-byte characters were achieved. The tance from one other) of more than one, while a gap of one inputs and speedups are shown in Figure 5. is always used in bubblesort. The ability to process several elements per instruction and the elimination of branches us- ing the vector min and max instructions, greatly enhances performance. However, combsort performs worse for large data sets that cannot fit into the L2 cache. This is due to poor access locality. Thus, we combined the vectorized combsort and scalar quicksort for large data sets. Elements are divided into smaller data sets by quicksort, and combsort is used when each data set is small enough to fit into the L2 cache of the processor.

Figure 5. Speedups of UTF-8 to Unicode conver- sions

2.4 Database Applications

Databases are one of the key business applications pro- cessed by high-end servers. Enabling them to use the Al- tiVec unit can be beneficial for computer manufacturers and their customers. However, the sheer size and complexity of Figure 6. Speedup of vectorized combsort and vec- such systems makes it very difficult to identify the CPU torized bitonic sort relative to scalar quicksort bound bottlenecks that can benefit from vectorization. It is very hard to profile such a system or find a developer who can point out code segments that are potentially vectoriz- Another pitfall that must be avoided is that sorting is able. And even when they are identified, the code segments used mostly to reorder data structures according to their are covered with layers of legacy code that cannot be al- keys. Sorting only the keys is ineffective. However, we tered for the sake of backward compatibility. can easily extend our method to sort pairs of 32-bit inte- Thus, in our experience, fragments of the application ger keys and 32-bit pointers. Even though half the band- should be vectorized independently and the results pre- width is lost, the sorting is effective and performance still sented to the DB architects who can then choose whether surpasses other fast implementations (scalar quicksort and and how to integrate them into their applications. Sorting al- vectorized bitonic sort [6]) as shown in Figure 6. gorithms are elementary in DB systems and thus are likely Further exploration succeeded in improving this algo- candidates for vectorization. An attempt to vectorize the rithm even more: data chunks that fit in the L2 cache are quicksort algorithm failed due to two main reasons: sorted using combsort and then merged using an innovative vectorized mergesort algorithm. Thus, the technique can ¥ The frequent fragmenting of the buffers ensures that sustain a high speedup even when data spills from the L2 loading and storing the data is usually misaligned. cache. This is explained in detail in [10]. ¥ If all items in a vector are not smaller/larger than the pivot, the vectorized version is serialized. 3 Insights and Conclusions Thus, an algorithm that is independent of the data values must be used: combsort [13] accesses memory in a sequen- The breadth of the analyzed applications, the vectoriza- tial order independent of the input data. To the best of our tion successes, and the subsequent failures, lead to many knowledge, combsort is the only known algorithm that has conclusions. These can benefit future architecture and both a deterministic memory access pattern and a compu- micro-architecture designers, compiler writers, and source

tational complexity of  . Combsort improves on level developers: 1. Faster transfer between vector registers and general 7. Optimizing scalar code – Although this should be ob- purpose registers – The current implementation, where vious, it is not always performed. In many cases, the data must use memory as an intermediary, results in scalar source code to be compared with isn’t available. significant performance loss. The capability to trans- Every effort should be made to access it, or alterna- fer data directly between a GPR and a VR partially tively, the code should be written anew. The effort to exists by using the lvsl instruction and exploiting vectorize and the utilization of the vector resources are knowledge of the pipeline internals. However, a true wasted on code that can achieve similar or higher per- architectural solution is self-evident. Other possible formance using scalar units. implementations are instructions that count the num- ber of zeros/leading zeros/ones in a vector and write References the result into a GPR. 2. Additional instructions – Many candidate applications [1] AltiVec Instruction Cross-Reference http://developer.apple.com/hardware/ve. were not vectorized due to the lack of suitable instruc- [2] BLAST. http://www.ncbi.nih.gov/BLAST/. tions. The absence of 32-bit integer multiplication [3] CIRI BLAST. http://www.ciri.upc.es/cela pblade/BLAST.htm. and any kind of division, rules out the vectorization [4] CIRI HMMER. http://www.ciri.upc.es/cela pblade/HMMER.htm. of many applications. The lack of double precision FP [5] K. Diefendorff and P. K. Dubey. How Multimedia Work- capabilities denies vectorization to almost all numeric loads Will Change Processor Design. Computer, 30(9):43– 45, 1997. applications. This, and the indirect data transfer, put [6] N. K. Govindaraju, N. Raghuvanshi, and D. Manocha. Fast the AltiVec unit at a disadvantage when compared to and Approximate Stream Mining of Quantiles and Frequen- other vector architectures (such as Intel’s SSE3). cies using Graphics Processors. In SIGMOD ’05: Proceed- ings of the 2005 ACM SIGMOD International Conference 3. Homogeneous data layout – Data must be laid out ho- on Management of Data, pages 611–622, 2005. mogeneously for the AltiVec unit to read and process [7] L. Gwennap. G4 Is First PowerPC With AltiVec. Micropro- several items simultaneously. If data is abstracted into cessor Report, 12(15), November 1998. heterogeneous structures and classes, the overhead of [8] HMMER. http://hmmer.wustl.edu/. extracting and pre-processing it to a vectorizable form [9] IBM. PowerPC Microprocessor Family: AltiVec(TM) Tech- may be too high. Compiler writers and programmers nology Programming Environments Manual, 2.0 edition, should be aware of this problem when building and July 2003. [10] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani. A representing their data. New Sorting Algorithm for Exploiting SIMD Instructions. 4. Utilizing unique instructions – The permute, IBM Research Report, RT0640, 2006. [11] M. Kawahito, H. Komatsu, T. Moriyama, H. Inoue, and min, max, and many other instructions are unique T. Nakatani. A New Idiom Recognition Framework for Ex- to SIMD units. Utilizing them intelligently yields ben- ploiting Accelerators. IBM Research Report, RT0639, 2006. efits that are beyond the speedup obtained by paral- [12] C. Kozyrakis and D. Patterson. Overcoming the Limitations lelism. of Conventional Vector Processors. In ISCA ’03: Proceed- ings of the 30th Annual International Symposium on Com- 5. Expanded bandwidth – Vector loads and stores are puter Architecture, pages 399–409, June 2003. in 128-bit quantities. These can double the mem- [13] S. Lacy and R. Box. A Fast, Easy Sort. Byte Magazine, page ory to processor bandwidth without clogging up the 315, April 1991. load/store queues of the processor. This offers an ad- [14] D. Naishlos. Autovectorization in GCC. In Proceed- vantage over issuing two scalar LD/ST instructions per ings of the 2004 GCC Developer’s Summit, June 2004. cycle. The down side is that data must be aligned on http://www.gccsummit.org/2004/. [15] D. Nuzman and R. Henderson. Multiplatform Vectorization. 16-byte boundaries. Misaligned data can greatly ham- In Proceedings of the 4th Annual International Symposium per performance. Care should be taken to have aligned on Code Generation and Optimization (CGO), March 2006. data at all times. This is one resason that an algorithm [16] A. Ramirez. Private communication. such as combsort is easily vectorized, while quicksort [17] P. Sandon. PowerPC 970: First in a New Family is not. of 64-bit High Performance PowerPC Processors. http://www.simdtech.org/altivec/documents/IBM PPC970 MPF2002.pdf, 6. Data parallelism – The smaller the data element, the 2002. higher the parallelism that AltiVec can offer. Unfortu- [18] UTF & BOM. http://www.unicode.org/faq/utf bom.html. nately, most data in commercial applications is in 32- [19] P. Wu, A. E. Eichenberger, and A. Wang. Efficient SIMD or even 64-bit elements. Thought should be given on Code Generation for Runtime Alignment and Length Con- how to partition these elements into smaller ones, or version. In Proceedings of the 3rd International Symposium on Code Generation and Optimization (CGO 2005), pages add additional capabilities to AltiVec (such as process- 153–164, March 2005. ing several vectors together).