Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture

Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture ⋆ Jatin Chhugani† William Macy† Akram Baransi ⋆ Anthony D. Nguyen† Mostafa Hagog Sanjeev Kumar† Victor W. Lee† Yen-Kuang Chen† Pradeep Dubey† Contact: [email protected] † Applications Research Lab, Corporate Technology Group, Intel Corporation ⋆ Microprocessor Architecture - Israel, Mobility Group, Intel Corporation ABSTRACT orders data, it is also used for other database operations such Sorting a list of input numbers is one of the most fundamen- as creation of indices and binary searches. Sorting facilitates statistics related applications including finding closest pair, tal problems in the field of computer science in general and th high-throughput database applications in particular. Al- determining an element’s uniqueness, finding k largest el- though literature abounds with various flavors of sorting ement, and identifying outliers. Sorting is used to find the algorithms, different architectures call for customized im- convex hull, an important algorithm used in computational plementations to achieve faster sorting times. geometry. Other applications that use sorting include com- This paper presents an efficient implementation and de- puter graphics, computational biology, supply chain man- tailed analysis of MergeSort on current CPU architectures. agement and data compression. Our SIMD implementation with 128-bit SSE is 3.3X faster Modern shared-memory computer architectures with mul- than the scalar version. In addition, our algorithm performs tiple cores and SIMD instructions can perform high perfor- an efficient multiway merge, and is not constrained by the mance sorting, which was formerly possible only on message- memory bandwidth. Our multi-threaded, SIMD implemen- passing machines, vector supercomputers, and clusters [23]. tation sorts 64 million floating point numbers in less than 0.5 However, efficient implementation of sorting on the latest seconds on a commodity 4-core Intel processor. This mea- processors depends heavily on careful tuning of the algo- sured performance compares favorably with all previously rithm and the code. First, although SIMD has been shown published results. as an efficient way to achieve good power/performance, it Additionally, the paper demonstrates performance scal- restricts how the operations should be performed. For in- ability of the proposed sorting algorithm with respect to stance, conditional execution is often not efficient. Also, certain salient architectural features of modern chip multi- re-arranging the data is very expensive. Second, although processor (CMP) architectures, including SIMD width and multiple processing cores integrated on a single silicon chip core-count. Based on our analytical models of various ar- allow one program with multiple threads to run side-by- chitectural configurations, we see excellent scalability of our side, we must be very careful about data partitioning so implementation with SIMD width scaling up to 16X wider that multiple threads can work effectively together. This is than current SSE width of 128-bits, and CMP core-count because multiple threads may compete for the shared cache scaling well beyond 32 cores. Cycle-accurate simulation of and memory bandwidth. Moreover, due to partitioning, the Intel’s upcoming x86 many-core Larrabee architecture con- starting elements for different threads may not align with firms scalability of our proposed algorithm. the cacheline boundary. In the future, microprocessor architectures will include many cores. For example, there are commercial products 1. INTRODUCTION that have 8 cores on the same chip [10], and there is a re- Sorting is used by numerous computer applications [15]. search prototype with 80 cores [24]. This is because multi- It is an internal database operation used by SQL operations, ple cores increase computation capability with a manageable and hence all applications using a database can take advan- thermal/power envelope. In this paper, we will examine how tage of an efficient sorting algorithm [4]. Sorting not only various architecture parameters and multiple cores affect the Permission to make digital or hard copies of portions of this work for tuning of a sort implementation. personal or classroom use is granted without fee provided that copies The contributions of this paper are as follows: Permissionare not made to copy or distr withoutibuted fee for all pr oro partfit or of co thismmercial material ad isvantage granted and provided • We present the fastest sorting performance for mod- thatthat the copies copies bear are th notis notice made orand distributed the full citation for direct on the commercial first page. advantage, ern computer architectures. theCopyright VLDB copyright for components notice and of thethis title work of theowned publication by others and than its date VLD appear,B • Our multiway merge implementation enables the run- andEndowment notice is must given be that honored. copying is by permission of the Very Large Data ning time to be independent of the memory bandwidth. Base Endowment. To copy otherwise, or to republish, to post on servers Abstracting with credit is permitted. To copy otherwise, to republish, • Our algorithm also avoids the expensive unaligned load/store ortoto post redistribute on servers to or lists, to r requiresedistribute a fee to and/orlists requires special prior permission specific from the operations. publisher,permission ACM. and/or a fee. Request permission to republish from: VLDBPublications ‘08, August Dept. 24-30,, ACM 2008,, Inc. F Auckland,ax +1 (212 New)869-0481 Zealand or • We provide analytical models of MergeSort that accu- [email protected]. 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00. PVLDB '08, August 23-28, 2008, Auckland, New Zealand Copyright 2008 VLDB Endowment, ACM 978-1-60558-306-8/08/08 1313 rately track empirical data. the local memory. AA-sort [11], implemented on a PowerPC • We examine how various architectural parameters affect 970MP, proposed a multi-core SIMD algorithm based on the Mergesort algorithm. This provides insights into op- comb sort [14] and mergesort. During the first phase of the timizing this sorting algorithm for future architectures. algorithm, each thread sorts the data assigned to it using • We compare our performance with the most efficient al- comb sort based algorithm, and during the second phase, the gorithms on different platforms, including Cell and GPUs. sorted lists are merged using an odd-even merge network. The rest of the paper is organized as follows. Section 2 Our implementation is based on mergesort for sorting the presents related work. Section 3 discusses key architectural complete list. parameters in modern architectures that affect Mergesort performance. Section 4 details the algorithm and imple- 3. ARCHITECTURE SPECIFICATION mentation of our Mergesort. Section 5 provides an analyti- Since the inception of the microprocessor, its performance cal framework to analyze Mergesort performance. Section 6 has been steadily improving. Advances in semiconductor presents the results. Section 7 concludes. manufacturing capability is one reason for this phenomenal change. Advances in computer architecture is another rea- 2. RELATED WORK son. Examples of architectural features that brought sub- Over the past few decades, large number of sorting algo- stantial performance improvement include instruction-level rithms have been proposed [13, 15]. In this section, we focus parallelism (ILP, e.g., pipelining, out-of-order super-scalar mainly on the algorithms that exploit the SIMD and multi- execution, excellent branch prediction), data-level parallelism core capability of modern processors, including GPUs, Cell, (DLP, e.g., SIMD, vector computation), thread-level paral- and others. lelism (TLP, e.g., simultaneous multi-threading, and multi- Quicksort has been one of the fastest algorithms used in core), and memory-level parallelism (MLP, e.g., hardware practice. However, its efficient implementation for exploit- prefetcher). In this section, we examine how ILP, DLP, TLP, ing SIMD is not known. In contrast, Bitonic sort [1] is imple- and MLP impact the performance of sorting algorithm. mented using a sorting network that sets all comparisons in advance without unpredictable branches and permits multi- 3.1 ILP ple comparisons in the same cycle. These two characteristics First, modern processors with super-scalar architecture make it well suited for SIMD processors. Radix Sort [13] can can execute multiple instructions simultaneously on differ- also be SIMDfied effectively, but its performance depends on ent functional units. For example, on Intel Core 2 Duo pro- the support for handling simultaneous updates to a memory cessors, we can execute a min/max and a shuffle instruction location within the same SIMD register. on two separate units simultaneously [3]. As far as utilizing multiple cores is concerned, there exist Second, modern processors with pipelining can issue a algorithms [16] that can scale well with increasing number of new instruction to the corresponding functional unit per cores. Parikh et al. [17] propose a load-balanced scheme for cycle. The most-frequently-used instructions in our Merge- parallelizing quicksort using the hyperthreading technology sort implementation, Min, Max and Shuffle, have one-cycle available on modern processors. An intelligent scheme for throughput on Intel Core 2 Duo processors. Nonetheless, splitting the list is proposed that works well in practice. not all the instructions have

Load more