Sorting on a Cell Broadband Engine SPU ∗†
Total Page:16
File Type:pdf, Size:1020Kb
Sorting On A Cell Broadband Engine SPU ∗† Shibdas Bandyopadhyay and Sartaj Sahni Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611 shibdas@ufl.edu, [email protected]fl.edu March 17, 2009 Abstract We adapt merge sort for a single SPU of the Cell Broadband Engine. This adaptation takes advantage of the vector instructions supported by the SPU. Experimental results indicate that our merge sort adaptation is faster than other sort algorithms (e.g., AA sort, Cell sort, quick sort) proposed for the SPU as well as faster than our SPU adaptations of shaker sort and brick sort. An added advantage is that our merge sort adaptation is a stable sort whereas none of the other sort adaptations is stable. Keywords Cell broadband engine, SPU sorting, merge sort. 1 Introduction The Cell Broadband Engine (CBE) is a heterogeneous multicore architecture developed by IBM, Sony, and Toshiba. A CBE (Figure 1) consists of a Power PC (PPU) core, eight Synergistic Processing Elements or Units (SPEs or SPUs), and associated memory transfer mechanisms [6]. The SPUs are connected in a ring topology and each SPU has its own local store. However, SPUs have no local cache and no branch prediction logic. Data may be moved between an SPUs local store and central memory via a DMA transfer, which is handled by a Memory Flow Control (MFC). Since the MFC runs independent of the SPUs, data transfer can be done concurrently with computation. The absence of branch prediction logic in an SPU and the availability of SIMD instructions that can operate on vectors that are comprised of 4 numbers poses a challenge when developing high performance CBE algorithms. Recently, two sorting algorithms–AA-sort [8] and CellSort [4]–were proposed for the CBE. AA-sort is an adap- tation of comb sort, which was originally proposed by Knuth [9] and rediscovered by Dobosiewicz [2] and Box and Lacey [1]. CellSort is an adaptation of bitonic sort (e.g., [9]). Both AA-sort and CellSort are based on sorting algorithms that are inefficient on a single processor. Hence, parallelizing these algorithms begins with a handicap relative to the fastest serial sorting algorithms–merge sort for worst-case behavior and quick sort for average behav- ior. Comb sort is known to have a worst-case complexity that is O(n2) [3]. Although the best upper bound known for its average complexity is also O(n2), experimental results indicate an average complexity of O(n log n) [1, 3]. On the other hand, the average complexity of quick sort is known to be O(n log n). Since experiments indicate that comb sort runs in about twice as much time on a single processor as does quick sort [3], attempts such as [8] to develop a fast average-case sort, for a single SPU of the CBE that begin with comb sort, are handicapped by a factor of two compared to attempts that begin with quick sort. This handicap needs to be overcome by the CBE adaptation. For integers and floats, the CBE supports 4-way parallelism within a single SPU as 4 integers (floats) may be stored in each of the SPU’s 128-bit vector registers. Hence, we expect at best two-fold speed up over a conventional implementation of quick sort. However, due to possible anomalous behavior resulting from such factors as the ∗This research was supported, in part, by the National Science Foundation under grant ITR-0326155 †The authors acknowledge Georgia Institute of Technology, its Sony-Toshiba-IBM Center of Competence, and the National Science Foundation, for the use of Cell Broadband Engine Resource that have contributed to this research 1 Figure 1: Architecture of the Cell Broadband Engine [5] absence of branch prediction, we may actually observe a greater speed up [10]. Similarly, attempts such as [4] to develop a fast worst-case sort for a single SPU starting with bitonic sort are handicapped relative to starting with 2 merge sort because the worst-case complexity of bitonic sort is O(n log n) while that of merge sort is O(n log n). As noted in [4], a logical way to develop a sorting algorithm for a heterogeneous multicore computer such as the CBE is to (1) begin with a sorting algorithm for a single SPU, then (2) using this as a core, develop a sort algorithm for the case when data fits in all available cores, then (3) use this multi-SPU algorithm to develop a sort algorithm for the case when the data to be sorted does not fit in the local stores of all available SPEs but fits in main memory. Our long-term strategy is to extend this hierarchical plan to the case where data to be sorted is so large that it is distributed over the main memories of a cluster of CBEs. An alternative strategy to sort is to use the master-slave model in which the PPU serves as the master processor and the SPUs are the slave processors. The PPU partitions the data to be sorted and sends each partition to a different SPU; the SPUs sort their partition using a single SPU sort; the PPU merges the sorted data from the PPUs so as to complete the sort of the entire data set. This strategy is used in [14] to sort on the nCube hypercube and in [13] to sort on the CBE. Regardless of whether we sort large data sets using the hierarchical strategy of [4] or the master-slave strategy of [13, 14], it is important to have a fast algorithm to sort within a single SPU. The absence of any branch prediction capability and the availability of vector instructions that support SIMD parallelism on an SPU make the development of a competitive merge sort a challenge, a challenge that we address in this paper. Our development of a competitive merge sort is specifically for the case of sorting integers and floats and makes use of the fact that 4 integers or floats may be stored in a single register of an SPU. In Section 2, we describe SIMD functions that are used in the specification of our SPU algorithms. Comb sort and its SPU adaptation AA-sort are reviewed in Section 3. In this section, we develop SPU adaptations for brick sort and shaker sort as well. Our SPU adaptation of merge sort is developed in Section 4 and experimental results comparing various SPU sorting algorithms are presented in Section 5. In the remainder of this paper, we use the term number to refer to an integer or a float. Note that 4 numbers may be stored in a 128-bit vector. 2 SPU Vector Operations We shall use several SIMD functions that operate on a vector of 4 numbers to describe the SPU adaptation of sorting algorithms. We describe these in this section. In the following, v1, v2, min, max, and temp are vectors, each comprised of 4 numbers and p, p1, and p2 are bit patterns. Function names that begin with spu are standard C/C++ Cell SPU intrinsics while those that begin with mySpu are defined by us. Our description of these functions is tailored to the sorting application of this paper. 1. spu shuffle(v1, v2,p) · · · This function returns a vector comprised of a subset of the 8 numbers in v1 and v2. The returned subset is determined by the bit pattern p. Let W , X, Y , and Z denote the 4 numbers 2 (left to right) of v1 and let A, B, C, and D denote those of v2. The bit pattern p = XCCW , for example, returns a vector comprised of the second number in v1 followed by two copies of the third number of v2 followed by the first number in v1. In the following, we assume that constant patterns such as XYZD have been pre-defined. 2. spu cmpgt(v1, v2) · · · A 128-bit vector representing the pairwise comparison of the 4 numbers of v1 with those of v2 is returned. If an element of v1 is greater than the corresponding element of v2, the corresponding 32 bits of the returned vector are 1; otherwise, these bits are 0. 3. spu add(v1, v2) · · · Returns the vector obtained by pairwise adding the numbers of v1 with the corresponding numbers of v2. 4. spu sub(v1, v2) · · · Returns the vector obtained by pairwise subtracting the numbers of v2 from the corre- sponding numbers of v1. 5. spu and(p1,p2) · · · Returns the vector obtained by pairwise anding the bits of p1 and p2. 6. mySpu not(p) · · · Returns the vector obtained by complementing each of the bits of p. Although the CBE does not have a not instruction, we can perform this operation using the nor function that is supported by the CBE and which computes the complement of the bitwise or of two vectors. It is easy to see that spu nor(p, v0) where v0 is an all zero vector, correctly computes the complement of the bits of p. 7. spu select(v1, v2,p) · · · Returns a vector whose ith bit comes from v1 (v2) when the ith bit of p is 0 (1). 8. spu slqwbyte(v1,n) · · · Returns a vector obtained by shifting the bytes of v1 m bytes to the left, where m is the number represented by the 5 least significant bits of n. The left shift is done with zero fill. So, the rightmost m bytes of the returned vector are 0. 9. spu splat(s) · · · Returns a vector comprised of 4 copies of the number s.