Register Level Sort Algorithm on Multi-Core SIMD Processors
Total Page:16
File Type:pdf, Size:1020Kb
Register Level Sort Algorithm on Multi-Core SIMD Processors Tian Xiaochen, Kamil Rocki and Reiji Suda Graduate School of Information Science and Technology The University of Tokyo & CREST, JST {xchen, kamil.rocki, reiji}@is.s.u-tokyo.ac.jp ABSTRACT simultaneously. GPU employs a similar architecture: single- State-of-the-art hardware increasingly utilizes SIMD paral- instruction-multiple-threads(SIMT). On K20, each SMX has lelism, where multiple processing elements execute the same 192 CUDA cores which act as individual threads. Even a instruction on multiple data points simultaneously. How- desktop type multi-core x86 platform such as i7 Haswell ever, irregular and data intensive algorithms are not well CPU family supports AVX2 instruction set (256 bits). De- suited for such architectures. Due to their importance, it veloping algorithms that support multi-core SIMD architec- is crucial to obtain efficient implementations. One example ture is the precondition to unleash the performance of these of such a task is sort, a fundamental problem in computer processors. Without exploiting this kind of parallelism, only science. In this paper we analyze distinct memory accessing a small fraction of computational power can be utilized. models and propose two methods to employ highly efficient Parallel sort as well as sort in general is fundamental and bitonic merge sort using SIMD instructions as register level well studied problem in computer science. Algorithms such sort. We achieve nearly 270x speedup (525M integers/s) on as Bitonic-Merge Sort or Odd Even Merge Sort are widely a 4M integer set using Xeon Phi coprocessor, where SIMD used in practice. However, implementing sort algorithm level parallelism accelerates the algorithm over 3 times. Our on SIMD processors efficiently, especially on Xeon Phi or method can be applied to any device supporting similar x86 CPU remains a challenging job. Until now, many algo- SIMD instructions. rithms have been proposed for GPU or CPU. GPU version of the algorithm cannot be directly transplanted a proces- sor supporting 512-bit wide vector instructions such as Xeon Categories and Subject Descriptors Phi. Nvidia GPUs have fast, explicitly manageable on-chip C.1.2 [Computer Systems Organization]: PROCESSOR shared memory and their cores are more functional and ex- ARCHITECTURES|Single-instruction-stream, multiple-data- ecute individual threads concurrently with SIMT synchro- stream processors (SIMD) nization on hardware level. Programming Xeon Phi poses a different challenge - the instruction restrictions of memory accesses (discussed in Section 3.1) needed during SIMD sort General Terms or merge. Previous algorithms designed for SSE or CELL Algorithms, Performance, Design processor require too many registers or not strong scalable enough for long SIMD instructions. On Xeon Phi, proces- Keywords sor's resources can easily run out with previous methods. A general, strong scaling algorithm for Xeon Phi or future SIMD, Parallel, Sort, Xeon Phi, Register, Irregular SIMD instruction capable processors is necessary. In this paper, we focus on developing parallel sort and 1. INTRODUCTION merge algorithms only in register, under the constraints of Multi-core architecture has been widely used in state-of- memory accessing model within registers. We propose two art processors. For example, NVIDIA's Tesla K20 GPU in-register sort/merge algorithms which only take a constant has 14 Stream Multiprocessors(SMX) and Intel's Xeon Phi number of registers (2 or 3) no matter how long SIMD in- has 60 cores. Xeon Phi supports 512 bits SIMD instruc- struction is. Then, we present theoretical and empirical tion i.e. AVX-512 which can process 16 integers (4 bytes) analysis of the execution time. We chose AVX-512 and Xeon Phi as our main experiment environment as a extreme case of SIMD parallelism combined with restricted access pat- Permission to make digital or hard copies of all or part of this work for per- terns and limited bandwidth. We also implemented the al- sonal or classroom use is granted without fee provided that copies are not gorithm on Intel E5-2670 with AVX instructions. The same made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components algorithm can be derived for other SIMD processors. of this work owned by others than the author(s) must be honored. Abstract- ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a 2. RELATED WORK fee. Request permissions from [email protected]. First algorithm that needs to be mentioned is parallel SC13 November 17-21, 2013, Denver CO, USA Radix sort[5] as it is often used. The main advantage of the Copyright is held Copyright is held by the owner/author(s). ACM 978-1-4503-2503-5/13/11 method is that if the size of digits is fixed, it only needs O(n) http://dx.doi.org/10.1145/2535753.2535762. time complexity, where n is the length of data. The idea is to sort data digit by digit. Intel Laboratories' report[2] is the othy Furtak[7], it generates the code with search method. first published result presenting performance of radix sort on The main idea is to search a set of SSE instructions which Xeon Phi. Later in the paper, we are presenting our results follow exactly the same logic as Odd-Even Merge Sort at compared to those shown in this paper. Its main disadvan- each step. The advantage of this method is that the gener- tage on the other hand is that it is limited to the integer type ated code can be locally optimal. However they ignore the of data and exploits its decimal character. Contrary to more usage of the number of registers and the speed of genera- general, comparative sort for example, float numbers need tion. Finally, AA sort[10] is a different approach which does extra transformation[13] using radix sort. Furthermore, if not employ odd-even or any other sorting network. Comb the data type size is longer, the algorithm becomes slower. Sort[11] is their choice. However, Comb Sort needs O(n2) Nadathur Satish et al.[6] studied how radix sort and merge in worst case, also this algorithm requires the same number sort performs when size of data type varies. Though both of registers as the aforementioned Jatin's algorithm. of them get slower when data type changes from 32bit to We propose two strong scaling register-level-sort algorithms 128bit, radix sort get higher deceleration. which only need constant number of registers. The two algo- Comparison based sort is a more general sort algorithm. rithms are both derived from the code generation approach Ω(nlog(n)) is the lower bound of its time complexity. Many which is based on constructing methods. It can be finished theoretical results have been published in the field of parallel in polynomial time (the same as bitonic merge sort). sorting. Bitonic-Merge Sort[3] or Odd-Even Merge Sort[8] are just two algorithms usually used in practice. They are 2.2 Multi-Core Level Sort also sorting networks traversed in O(log2(n)) steps. Though In principle, Multi-Core Level Sort focuses on utilizing log(n)-step algorithm has been invented[9], due to the large multiple cores. MIMD (Multiple Instruction, Multiple Data) overhead it is hardly used in practice. Our algorithm keeps and PRAM are the appropriate hardware models. On GPU, the generality of a comparative sort while achieving speedup Odd-Even Merge Sort is still a widely chosen algorithm [12, only slightly worse than with radix sort. 14]. It has high parallelism, but suffers from high work There is a large number of research papers which describe complexity. Theoretically Bitonic-Merge Sort can be per- implementing parallel sort. We classify the algorithms into formed in O(nlogn) time by using a subtree(i.e. an address two types: Register Level Sort and Multi-Core Level Sort. pointer) exchanging algorithm[16] utilizing a tree data struc- This classification is made by the way of synchronization in ture. Contrary to merge, divide-and-conquer is another idea processors. In register level, SIMD instructions play the role of sorting elements. For example, Bucket Sort, the quick of cores. Synchronization is not necessary(GPU may need sort's parallel version can divide the data into a numbers of synchronization explicitly) since SIMD computation is natu- bucket by selecting several pivots. The data in each bucket rally synchronized after each instruction. In the situation of can be sorted by other sorting algorithms, since the buck- Multi-Core level, communication happens in slow cache or ets are ordered by pivots. An GPU implementation[15] uses RAM. The cost of synchronization becomes more expensive. hybrid algorithm of bucket sort and merge sort. However the algorithm may not be well balanced since buckets' sizes 2.1 Register Level Sort depend on the chosen of pivots. Some buckets may be too Register level sort means sorting a part of a small amount large or too small. A typical x86-family processor, such of data only using SIMD instructions. Usually the data as Intel i7 does not have as many core as common GPUs. should be loaded into a register from the main memory and Multi-way merge is an optional algorithm in such a CPU then sorted in registers in parallel. Finally the data should implementation[1]. Although, multi-way merge can be as be copied back to the main memory. Some implementations efficient as merge sort from time complexity aspect. If there may use Odd-Even Merge Sort at this point, for example an are too many cores, the communication between cores will algorithm using SSE instructions[1] by Jatin Chhugani et al.