Sorting on a Cell Broadband Engine SPU ∗†

Total Page:16

File Type:pdf, Size:1020Kb

Sorting on a Cell Broadband Engine SPU ∗† Sorting On A Cell Broadband Engine SPU ∗† Shibdas Bandyopadhyay and Sartaj Sahni Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611 shibdas@ufl.edu, [email protected]fl.edu March 17, 2009 Abstract We adapt merge sort for a single SPU of the Cell Broadband Engine. This adaptation takes advantage of the vector instructions supported by the SPU. Experimental results indicate that our merge sort adaptation is faster than other sort algorithms (e.g., AA sort, Cell sort, quick sort) proposed for the SPU as well as faster than our SPU adaptations of shaker sort and brick sort. An added advantage is that our merge sort adaptation is a stable sort whereas none of the other sort adaptations is stable. Keywords Cell broadband engine, SPU sorting, merge sort. 1 Introduction The Cell Broadband Engine (CBE) is a heterogeneous multicore architecture developed by IBM, Sony, and Toshiba. A CBE (Figure 1) consists of a Power PC (PPU) core, eight Synergistic Processing Elements or Units (SPEs or SPUs), and associated memory transfer mechanisms [6]. The SPUs are connected in a ring topology and each SPU has its own local store. However, SPUs have no local cache and no branch prediction logic. Data may be moved between an SPUs local store and central memory via a DMA transfer, which is handled by a Memory Flow Control (MFC). Since the MFC runs independent of the SPUs, data transfer can be done concurrently with computation. The absence of branch prediction logic in an SPU and the availability of SIMD instructions that can operate on vectors that are comprised of 4 numbers poses a challenge when developing high performance CBE algorithms. Recently, two sorting algorithms–AA-sort [8] and CellSort [4]–were proposed for the CBE. AA-sort is an adap- tation of comb sort, which was originally proposed by Knuth [9] and rediscovered by Dobosiewicz [2] and Box and Lacey [1]. CellSort is an adaptation of bitonic sort (e.g., [9]). Both AA-sort and CellSort are based on sorting algorithms that are inefficient on a single processor. Hence, parallelizing these algorithms begins with a handicap relative to the fastest serial sorting algorithms–merge sort for worst-case behavior and quick sort for average behav- ior. Comb sort is known to have a worst-case complexity that is O(n2) [3]. Although the best upper bound known for its average complexity is also O(n2), experimental results indicate an average complexity of O(n log n) [1, 3]. On the other hand, the average complexity of quick sort is known to be O(n log n). Since experiments indicate that comb sort runs in about twice as much time on a single processor as does quick sort [3], attempts such as [8] to develop a fast average-case sort, for a single SPU of the CBE that begin with comb sort, are handicapped by a factor of two compared to attempts that begin with quick sort. This handicap needs to be overcome by the CBE adaptation. For integers and floats, the CBE supports 4-way parallelism within a single SPU as 4 integers (floats) may be stored in each of the SPU’s 128-bit vector registers. Hence, we expect at best two-fold speed up over a conventional implementation of quick sort. However, due to possible anomalous behavior resulting from such factors as the ∗This research was supported, in part, by the National Science Foundation under grant ITR-0326155 †The authors acknowledge Georgia Institute of Technology, its Sony-Toshiba-IBM Center of Competence, and the National Science Foundation, for the use of Cell Broadband Engine Resource that have contributed to this research 1 Figure 1: Architecture of the Cell Broadband Engine [5] absence of branch prediction, we may actually observe a greater speed up [10]. Similarly, attempts such as [4] to develop a fast worst-case sort for a single SPU starting with bitonic sort are handicapped relative to starting with 2 merge sort because the worst-case complexity of bitonic sort is O(n log n) while that of merge sort is O(n log n). As noted in [4], a logical way to develop a sorting algorithm for a heterogeneous multicore computer such as the CBE is to (1) begin with a sorting algorithm for a single SPU, then (2) using this as a core, develop a sort algorithm for the case when data fits in all available cores, then (3) use this multi-SPU algorithm to develop a sort algorithm for the case when the data to be sorted does not fit in the local stores of all available SPEs but fits in main memory. Our long-term strategy is to extend this hierarchical plan to the case where data to be sorted is so large that it is distributed over the main memories of a cluster of CBEs. An alternative strategy to sort is to use the master-slave model in which the PPU serves as the master processor and the SPUs are the slave processors. The PPU partitions the data to be sorted and sends each partition to a different SPU; the SPUs sort their partition using a single SPU sort; the PPU merges the sorted data from the PPUs so as to complete the sort of the entire data set. This strategy is used in [14] to sort on the nCube hypercube and in [13] to sort on the CBE. Regardless of whether we sort large data sets using the hierarchical strategy of [4] or the master-slave strategy of [13, 14], it is important to have a fast algorithm to sort within a single SPU. The absence of any branch prediction capability and the availability of vector instructions that support SIMD parallelism on an SPU make the development of a competitive merge sort a challenge, a challenge that we address in this paper. Our development of a competitive merge sort is specifically for the case of sorting integers and floats and makes use of the fact that 4 integers or floats may be stored in a single register of an SPU. In Section 2, we describe SIMD functions that are used in the specification of our SPU algorithms. Comb sort and its SPU adaptation AA-sort are reviewed in Section 3. In this section, we develop SPU adaptations for brick sort and shaker sort as well. Our SPU adaptation of merge sort is developed in Section 4 and experimental results comparing various SPU sorting algorithms are presented in Section 5. In the remainder of this paper, we use the term number to refer to an integer or a float. Note that 4 numbers may be stored in a 128-bit vector. 2 SPU Vector Operations We shall use several SIMD functions that operate on a vector of 4 numbers to describe the SPU adaptation of sorting algorithms. We describe these in this section. In the following, v1, v2, min, max, and temp are vectors, each comprised of 4 numbers and p, p1, and p2 are bit patterns. Function names that begin with spu are standard C/C++ Cell SPU intrinsics while those that begin with mySpu are defined by us. Our description of these functions is tailored to the sorting application of this paper. 1. spu shuffle(v1, v2,p) · · · This function returns a vector comprised of a subset of the 8 numbers in v1 and v2. The returned subset is determined by the bit pattern p. Let W , X, Y , and Z denote the 4 numbers 2 (left to right) of v1 and let A, B, C, and D denote those of v2. The bit pattern p = XCCW , for example, returns a vector comprised of the second number in v1 followed by two copies of the third number of v2 followed by the first number in v1. In the following, we assume that constant patterns such as XYZD have been pre-defined. 2. spu cmpgt(v1, v2) · · · A 128-bit vector representing the pairwise comparison of the 4 numbers of v1 with those of v2 is returned. If an element of v1 is greater than the corresponding element of v2, the corresponding 32 bits of the returned vector are 1; otherwise, these bits are 0. 3. spu add(v1, v2) · · · Returns the vector obtained by pairwise adding the numbers of v1 with the corresponding numbers of v2. 4. spu sub(v1, v2) · · · Returns the vector obtained by pairwise subtracting the numbers of v2 from the corre- sponding numbers of v1. 5. spu and(p1,p2) · · · Returns the vector obtained by pairwise anding the bits of p1 and p2. 6. mySpu not(p) · · · Returns the vector obtained by complementing each of the bits of p. Although the CBE does not have a not instruction, we can perform this operation using the nor function that is supported by the CBE and which computes the complement of the bitwise or of two vectors. It is easy to see that spu nor(p, v0) where v0 is an all zero vector, correctly computes the complement of the bits of p. 7. spu select(v1, v2,p) · · · Returns a vector whose ith bit comes from v1 (v2) when the ith bit of p is 0 (1). 8. spu slqwbyte(v1,n) · · · Returns a vector obtained by shifting the bytes of v1 m bytes to the left, where m is the number represented by the 5 least significant bits of n. The left shift is done with zero fill. So, the rightmost m bytes of the returned vector are 0. 9. spu splat(s) · · · Returns a vector comprised of 4 copies of the number s.
Recommended publications
  • Coursenotes 4 Non-Adaptive Sorting Batcher's Algorithm
    4. Non Adaptive Sorting Batcher’s Algorithm 4.1 Introduction to Batcher’s Algorithm Sorting has many important applications in daily life and in particular, computer science. Within computer science several sorting algorithms exist such as the “bubble sort,” “shell sort,” “comb sort,” “heap sort,” “bucket sort,” “merge sort,” etc. We have actually encountered these before in section 2 of the course notes. Many different sorting algorithms exist and are implemented differently in each programming language. These sorting algorithms are relevant because they are used in every single computer program you run ranging from the unimportant computer solitaire you play to the online checking account you own. Indeed without sorting algorithms, many of the services we enjoy today would simply not be available. Batcher’s algorithm is a way to sort individual disorganized elements called “keys” into some desired order using a set number of comparisons. The keys that will be sorted for our purposes will be numbers and the order that we will want them to be in will be from greatest to least or vice a versa (whichever way you want it). The Batcher algorithm is non-adaptive in that it takes a fixed set of comparisons in order to sort the unsorted keys. In other words, there is no change made in the process during the sorting. Unlike other methods of sorting where you need to remember comparisons (tournament sort), Batcher’s algorithm is useful because it requires less thinking. Non-adaptive sorting is easy because outcomes of comparisons made at one point of the process do not affect which comparisons will be made in the future.
    [Show full text]
  • An Evolutionary Approach for Sorting Algorithms
    ORIENTAL JOURNAL OF ISSN: 0974-6471 COMPUTER SCIENCE & TECHNOLOGY December 2014, An International Open Free Access, Peer Reviewed Research Journal Vol. 7, No. (3): Published By: Oriental Scientific Publishing Co., India. Pgs. 369-376 www.computerscijournal.org Root to Fruit (2): An Evolutionary Approach for Sorting Algorithms PRAMOD KADAM AND Sachin KADAM BVDU, IMED, Pune, India. (Received: November 10, 2014; Accepted: December 20, 2014) ABstract This paper continues the earlier thought of evolutionary study of sorting problem and sorting algorithms (Root to Fruit (1): An Evolutionary Study of Sorting Problem) [1]and concluded with the chronological list of early pioneers of sorting problem or algorithms. Latter in the study graphical method has been used to present an evolution of sorting problem and sorting algorithm on the time line. Key words: Evolutionary study of sorting, History of sorting Early Sorting algorithms, list of inventors for sorting. IntroDUCTION name and their contribution may skipped from the study. Therefore readers have all the rights to In spite of plentiful literature and research extent this study with the valid proofs. Ultimately in sorting algorithmic domain there is mess our objective behind this research is very much found in documentation as far as credential clear, that to provide strength to the evolutionary concern2. Perhaps this problem found due to lack study of sorting algorithms and shift towards a good of coordination and unavailability of common knowledge base to preserve work of our forebear platform or knowledge base in the same domain. for upcoming generation. Otherwise coming Evolutionary study of sorting algorithm or sorting generation could receive hardly information about problem is foundation of futuristic knowledge sorting problems and syllabi may restrict with some base for sorting problem domain1.
    [Show full text]
  • Investigating the Effect of Implementation Languages and Large Problem Sizes on the Tractability and Efficiency of Sorting Algorithms
    International Journal of Engineering Research and Technology. ISSN 0974-3154, Volume 12, Number 2 (2019), pp. 196-203 © International Research Publication House. http://www.irphouse.com Investigating the Effect of Implementation Languages and Large Problem Sizes on the Tractability and Efficiency of Sorting Algorithms Temitayo Matthew Fagbola and Surendra Colin Thakur Department of Information Technology, Durban University of Technology, Durban 4000, South Africa. ORCID: 0000-0001-6631-1002 (Temitayo Fagbola) Abstract [1],[2]. Theoretically, effective sorting of data allows for an efficient and simple searching process to take place. This is Sorting is a data structure operation involving a re-arrangement particularly important as most merge and search algorithms of an unordered set of elements with witnessed real life strictly depend on the correctness and efficiency of sorting applications for load balancing and energy conservation in algorithms [4]. It is important to note that, the rearrangement distributed, grid and cloud computing environments. However, procedure of each sorting algorithm differs and directly impacts the rearrangement procedure often used by sorting algorithms on the execution time and complexity of such algorithm for differs and significantly impacts on their computational varying problem sizes and type emanating from real-world efficiencies and tractability for varying problem sizes. situations [5]. For instance, the operational sorting procedures Currently, which combination of sorting algorithm and of Merge Sort, Quick Sort and Heap Sort follow a divide-and- implementation language is highly tractable and efficient for conquer approach characterized by key comparison, recursion solving large sized-problems remains an open challenge. In this and binary heap’ key reordering requirements, respectively [9].
    [Show full text]
  • Sorting Partnership Unless You Sign Up! Brian Curless • Homework #5 Will Be Ready After Class, Spring 2008 Due in a Week
    Announcements (5/9/08) • Project 3 is now assigned. CSE 326: Data Structures • Partnerships due by 3pm – We will not assume you are in a Sorting partnership unless you sign up! Brian Curless • Homework #5 will be ready after class, Spring 2008 due in a week. • Reading for this lecture: Chapter 7. 2 Sorting Consistent Ordering • Input – an array A of data records • The comparison function must provide a – a key value in each data record consistent ordering on the set of possible keys – You can compare any two keys and get back an – a comparison function which imposes a indication of a < b, a > b, or a = b (trichotomy) consistent ordering on the keys – The comparison functions must be consistent • Output • If compare(a,b) says a<b, then compare(b,a) must say b>a • If says a=b, then must say b=a – reorganize the elements of A such that compare(a,b) compare(b,a) • If compare(a,b) says a=b, then equals(a,b) and equals(b,a) • For any i and j, if i < j then A[i] ≤ A[j] must say a=b 3 4 Why Sort? Space • How much space does the sorting • Allows binary search of an N-element algorithm require in order to sort the array in O(log N) time collection of items? • Allows O(1) time access to kth largest – Is copying needed? element in the array for any k • In-place sorting algorithms: no copying or • Sorting algorithms are among the most at most O(1) additional temp space.
    [Show full text]
  • Sorting Algorithm 1 Sorting Algorithm
    Sorting algorithm 1 Sorting algorithm In computer science, a sorting algorithm is an algorithm that puts elements of a list in a certain order. The most-used orders are numerical order and lexicographical order. Efficient sorting is important for optimizing the use of other algorithms (such as search and merge algorithms) that require sorted lists to work correctly; it is also often useful for canonicalizing data and for producing human-readable output. More formally, the output must satisfy two conditions: 1. The output is in nondecreasing order (each element is no smaller than the previous element according to the desired total order); 2. The output is a permutation, or reordering, of the input. Since the dawn of computing, the sorting problem has attracted a great deal of research, perhaps due to the complexity of solving it efficiently despite its simple, familiar statement. For example, bubble sort was analyzed as early as 1956.[1] Although many consider it a solved problem, useful new sorting algorithms are still being invented (for example, library sort was first published in 2004). Sorting algorithms are prevalent in introductory computer science classes, where the abundance of algorithms for the problem provides a gentle introduction to a variety of core algorithm concepts, such as big O notation, divide and conquer algorithms, data structures, randomized algorithms, best, worst and average case analysis, time-space tradeoffs, and lower bounds. Classification Sorting algorithms used in computer science are often classified by: • Computational complexity (worst, average and best behaviour) of element comparisons in terms of the size of the list . For typical sorting algorithms good behavior is and bad behavior is .
    [Show full text]
  • Visvesvaraya Technological University a Project Report
    ` VISVESVARAYA TECHNOLOGICAL UNIVERSITY “Jnana Sangama”, Belagavi – 590 018 A PROJECT REPORT ON “PREDICTIVE SCHEDULING OF SORTING ALGORITHMS” Submitted in partial fulfillment for the award of the degree of BACHELOR OF ENGINEERING IN COMPUTER SCIENCE AND ENGINEERING BY RANJIT KUMAR SHA (1NH13CS092) SANDIP SHAH (1NH13CS101) SAURABH RAI (1NH13CS104) GAURAV KUMAR (1NH13CS718) Under the guidance of Ms. Sridevi (Senior Assistant Professor, Dept. of CSE, NHCE) DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING NEW HORIZON COLLEGE OF ENGINEERING (ISO-9001:2000 certified, Accredited by NAAC ‘A’, Permanently affiliated to VTU) Outer Ring Road, Panathur Post, Near Marathalli, Bangalore – 560103 ` NEW HORIZON COLLEGE OF ENGINEERING (ISO-9001:2000 certified, Accredited by NAAC ‘A’ Permanently affiliated to VTU) Outer Ring Road, Panathur Post, Near Marathalli, Bangalore-560 103 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CERTIFICATE Certified that the project work entitled “PREDICTIVE SCHEDULING OF SORTING ALGORITHMS” carried out by RANJIT KUMAR SHA (1NH13CS092), SANDIP SHAH (1NH13CS101), SAURABH RAI (1NH13CS104) and GAURAV KUMAR (1NH13CS718) bonafide students of NEW HORIZON COLLEGE OF ENGINEERING in partial fulfillment for the award of Bachelor of Engineering in Computer Science and Engineering of the Visvesvaraya Technological University, Belagavi during the year 2016-2017. It is certified that all corrections/suggestions indicated for Internal Assessment have been incorporated in the report deposited in the department library. The project report has been approved as it satisfies the academic requirements in respect of Project work prescribed for the Degree. Name & Signature of Guide Name Signature of HOD Signature of Principal (Ms. Sridevi) (Dr. Prashanth C.S.R.) (Dr. Manjunatha) External Viva Name of Examiner Signature with date 1.
    [Show full text]
  • Evaluation of Sorting Algorithms, Mathematical and Empirical Analysis of Sorting Algorithms
    International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 86 ISSN 2229-5518 Evaluation of Sorting Algorithms, Mathematical and Empirical Analysis of sorting Algorithms Sapram Choudaiah P Chandu Chowdary M Kavitha ABSTRACT:Sorting is an important data structure in many real life applications. A number of sorting algorithms are in existence till date. This paper continues the earlier thought of evolutionary study of sorting problem and sorting algorithms concluded with the chronological list of early pioneers of sorting problem or algorithms. Latter in the study graphical method has been used to present an evolution of sorting problem and sorting algorithm on the time line. An extensive analysis has been done compared with the traditional mathematical methods of ―Bubble Sort, Selection Sort, Insertion Sort, Merge Sort, Quick Sort. Observations have been obtained on comparing with the existing approaches of All Sorts. An “Empirical Analysis” consists of rigorous complexity analysis by various sorting algorithms, in which comparison and real swapping of all the variables are calculatedAll algorithms were tested on random data of various ranges from small to large. It is an attempt to compare the performance of various sorting algorithm, with the aim of comparing their speed when sorting an integer inputs.The empirical data obtained by using the program reveals that Quick sort algorithm is fastest and Bubble sort is slowest. Keywords: Bubble Sort, Insertion sort, Quick Sort, Merge Sort, Selection Sort, Heap Sort,CPU Time. Introduction In spite of plentiful literature and research in more dimension to student for thinking4. Whereas, sorting algorithmic domain there is mess found in this thinking become a mark of respect to all our documentation as far as credential concern2.
    [Show full text]
  • Sorting Algorithm 1 Sorting Algorithm
    Sorting algorithm 1 Sorting algorithm A sorting algorithm is an algorithm that puts elements of a list in a certain order. The most-used orders are numerical order and lexicographical order. Efficient sorting is important for optimizing the use of other algorithms (such as search and merge algorithms) which require input data to be in sorted lists; it is also often useful for canonicalizing data and for producing human-readable output. More formally, the output must satisfy two conditions: 1. The output is in nondecreasing order (each element is no smaller than the previous element according to the desired total order); 2. The output is a permutation (reordering) of the input. Since the dawn of computing, the sorting problem has attracted a great deal of research, perhaps due to the complexity of solving it efficiently despite its simple, familiar statement. For example, bubble sort was analyzed as early as 1956.[1] Although many consider it a solved problem, useful new sorting algorithms are still being invented (for example, library sort was first published in 2006). Sorting algorithms are prevalent in introductory computer science classes, where the abundance of algorithms for the problem provides a gentle introduction to a variety of core algorithm concepts, such as big O notation, divide and conquer algorithms, data structures, randomized algorithms, best, worst and average case analysis, time-space tradeoffs, and upper and lower bounds. Classification Sorting algorithms are often classified by: • Computational complexity (worst, average and best behavior) of element comparisons in terms of the size of the list (n). For typical serial sorting algorithms good behavior is O(n log n), with parallel sort in O(log2 n), and bad behavior is O(n2).
    [Show full text]
  • Comparison Sorts Name Best Average Worst Memory Stable Method Other Notes Quicksort Is Usually Done in Place with O(Log N) Stack Space
    Comparison sorts Name Best Average Worst Memory Stable Method Other notes Quicksort is usually done in place with O(log n) stack space. Most implementations on typical in- are unstable, as stable average, worst place sort in-place partitioning is case is ; is not more complex. Naïve Quicksort Sedgewick stable; Partitioning variants use an O(n) variation is stable space array to store the worst versions partition. Quicksort case exist variant using three-way (fat) partitioning takes O(n) comparisons when sorting an array of equal keys. Highly parallelizable (up to O(log n) using the Three Hungarian's Algorithmor, more Merge sort worst case Yes Merging practically, Cole's parallel merge sort) for processing large amounts of data. Can be implemented as In-place merge sort — — Yes Merging a stable sort based on stable in-place merging. Heapsort No Selection O(n + d) where d is the Insertion sort Yes Insertion number of inversions. Introsort No Partitioning Used in several STL Comparison sorts Name Best Average Worst Memory Stable Method Other notes & Selection implementations. Stable with O(n) extra Selection sort No Selection space, for example using lists. Makes n comparisons Insertion & Timsort Yes when the data is already Merging sorted or reverse sorted. Makes n comparisons Cubesort Yes Insertion when the data is already sorted or reverse sorted. Small code size, no use Depends on gap of call stack, reasonably sequence; fast, useful where Shell sort or best known is No Insertion memory is at a premium such as embedded and older mainframe applications. Bubble sort Yes Exchanging Tiny code size.
    [Show full text]
  • A Comparative Study of Sorting Algorithms Comb, Cocktail and Counting Sorting
    International Research Journal of Engineering and Technology (IRJET) e‐ISSN: 2395 ‐0056 Volume: 04 Issue: 01 | Jan ‐2017 www.irjet.net p‐ISSN: 2395‐0072 A comparative Study of Sorting Algorithms Comb, Cocktail and Counting Sorting Ahmad H. Elkahlout1, Ashraf Y. A. Maghari2. 1 Faculty of Information Technology, Islamic University of Gaza ‐ Palestine 2 Faculty of Information Technology, Islamic University of Gaza ‐ Palestine ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐***‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ Abstract –The sorting algorithms problem is probably one java programming language, in which every algorithm sort of the most famous problems that used in abroad variety of the same set of random numeric data. Then the execution time during the sorting process is measured for each application. There are many techniques to solve the sorting algorithm. problem. In this paper, we conduct a comparative study to evaluate the performance of three algorithms; comb, cocktail The rest part of this paper is organized as follows. Section 2 and count sorting algorithms in terms of execution time. Java gives a background of the three algorithms; comb, count, and programing is used to implement the algorithms using cocktail algorithms. In section 3, related work is presented. numeric data on the same platform conditions. Among the Section4 comparison between the algorithms; section 5 three algorithms, we found out that the cocktail algorithm has discusses the results obtained from the evaluation of sorting the shortest execution time; while counting sort comes in the techniques considered. The paper is concluded in section 6. second order. Furthermore, Comb comes in the last order in term of execution time. Future effort will investigate the 2.
    [Show full text]
  • Register Level Sort Algorithm on Multi-Core SIMD Processors
    Register Level Sort Algorithm on Multi-Core SIMD Processors Tian Xiaochen, Kamil Rocki and Reiji Suda Graduate School of Information Science and Technology The University of Tokyo & CREST, JST {xchen, kamil.rocki, reiji}@is.s.u-tokyo.ac.jp ABSTRACT simultaneously. GPU employs a similar architecture: single- State-of-the-art hardware increasingly utilizes SIMD paral- instruction-multiple-threads(SIMT). On K20, each SMX has lelism, where multiple processing elements execute the same 192 CUDA cores which act as individual threads. Even a instruction on multiple data points simultaneously. How- desktop type multi-core x86 platform such as i7 Haswell ever, irregular and data intensive algorithms are not well CPU family supports AVX2 instruction set (256 bits). De- suited for such architectures. Due to their importance, it veloping algorithms that support multi-core SIMD architec- is crucial to obtain efficient implementations. One example ture is the precondition to unleash the performance of these of such a task is sort, a fundamental problem in computer processors. Without exploiting this kind of parallelism, only science. In this paper we analyze distinct memory accessing a small fraction of computational power can be utilized. models and propose two methods to employ highly efficient Parallel sort as well as sort in general is fundamental and bitonic merge sort using SIMD instructions as register level well studied problem in computer science. Algorithms such sort. We achieve nearly 270x speedup (525M integers/s) on as Bitonic-Merge Sort or Odd Even Merge Sort are widely a 4M integer set using Xeon Phi coprocessor, where SIMD used in practice.
    [Show full text]
  • Parallel Implementation of Sorting Algorithms
    IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012 ISSN (Online): 1694-0814 www.IJCSI.org 164 Parallel Implementation of Sorting Algorithms Malika Dawra1, and Priti2 1Department of Computer Science & Applications M.D. University, Rohtak-124001, Haryana, India 2Department of Computer Science & Applications M.D. University, Rohtak-124001, Haryana, India Abstract Usage of memory and other computer resources is also a factor in classifying the A sorting algorithm is an efficient algorithm, which sorting algorithms. perform an important task that puts elements of a list in a certain order or arrange a collection of elements into a Recursion: some algorithms are either particular order. Sorting problem has attracted a great recursive or non recursive, while others deal of research because efficient sorting is important to may be both. optimize the use of other algorithms such as binary search. This paper presents a new algorithm that will Whether or not they are a comparison sort split large array in sub parts and then all sub parts are examines the data only by comparing two processed in parallel using existing sorting algorithms elements with a comparison operator. and finally outcome would be merged. To process sub There are several sorting algorithms available to parts in parallel multithreading has been introduced. sort the elements of an array. Some of the sorting algorithms are: Keywords: sorting, algorithm, splitting, merging, multithreading. Bubble sort 1. Introduction In the bubble sort, as elements are sorted they gradually “bubble” (or rise) to their proper location Sorting is the process of putting data in order; either in the array.
    [Show full text]