Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture

Total Page:16

File Type:pdf, Size:1020Kb

Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture ⋆ Jatin Chhugani† William Macy† Akram Baransi ⋆ Anthony D. Nguyen† Mostafa Hagog Sanjeev Kumar† Victor W. Lee† Yen-Kuang Chen† Pradeep Dubey† Contact: [email protected] † Applications Research Lab, Corporate Technology Group, Intel Corporation ⋆ Microprocessor Architecture - Israel, Mobility Group, Intel Corporation ABSTRACT orders data, it is also used for other database operations such Sorting a list of input numbers is one of the most fundamen- as creation of indices and binary searches. Sorting facilitates statistics related applications including finding closest pair, tal problems in the field of computer science in general and th high-throughput database applications in particular. Al- determining an element’s uniqueness, finding k largest el- though literature abounds with various flavors of sorting ement, and identifying outliers. Sorting is used to find the algorithms, different architectures call for customized im- convex hull, an important algorithm used in computational plementations to achieve faster sorting times. geometry. Other applications that use sorting include com- This paper presents an efficient implementation and de- puter graphics, computational biology, supply chain man- tailed analysis of MergeSort on current CPU architectures. agement and data compression. Our SIMD implementation with 128-bit SSE is 3.3X faster Modern shared-memory computer architectures with mul- than the scalar version. In addition, our algorithm performs tiple cores and SIMD instructions can perform high perfor- an efficient multiway merge, and is not constrained by the mance sorting, which was formerly possible only on message- memory bandwidth. Our multi-threaded, SIMD implemen- passing machines, vector supercomputers, and clusters [23]. tation sorts 64 million floating point numbers in less than 0.5 However, efficient implementation of sorting on the latest seconds on a commodity 4-core Intel processor. This mea- processors depends heavily on careful tuning of the algo- sured performance compares favorably with all previously rithm and the code. First, although SIMD has been shown published results. as an efficient way to achieve good power/performance, it Additionally, the paper demonstrates performance scal- restricts how the operations should be performed. For in- ability of the proposed sorting algorithm with respect to stance, conditional execution is often not efficient. Also, certain salient architectural features of modern chip multi- re-arranging the data is very expensive. Second, although processor (CMP) architectures, including SIMD width and multiple processing cores integrated on a single silicon chip core-count. Based on our analytical models of various ar- allow one program with multiple threads to run side-by- chitectural configurations, we see excellent scalability of our side, we must be very careful about data partitioning so implementation with SIMD width scaling up to 16X wider that multiple threads can work effectively together. This is than current SSE width of 128-bits, and CMP core-count because multiple threads may compete for the shared cache scaling well beyond 32 cores. Cycle-accurate simulation of and memory bandwidth. Moreover, due to partitioning, the Intel’s upcoming x86 many-core Larrabee architecture con- starting elements for different threads may not align with firms scalability of our proposed algorithm. the cacheline boundary. In the future, microprocessor architectures will include many cores. For example, there are commercial products 1. INTRODUCTION that have 8 cores on the same chip [10], and there is a re- Sorting is used by numerous computer applications [15]. search prototype with 80 cores [24]. This is because multi- It is an internal database operation used by SQL operations, ple cores increase computation capability with a manageable and hence all applications using a database can take advan- thermal/power envelope. In this paper, we will examine how tage of an efficient sorting algorithm [4]. Sorting not only various architecture parameters and multiple cores affect the Permission to make digital or hard copies of portions of this work for tuning of a sort implementation. personal or classroom use is granted without fee provided that copies The contributions of this paper are as follows: Permissionare not made to copy or distr withoutibuted fee for all pr oro partfit or of co thismmercial material ad isvantage granted and provided • We present the fastest sorting performance for mod- thatthat the copies copies bear are th notis notice made orand distributed the full citation for direct on the commercial first page. advantage, ern computer architectures. theCopyright VLDB copyright for components notice and of thethis title work of theowned publication by others and than its date VLD appear,B • Our multiway merge implementation enables the run- andEndowment notice is must given be that honored. copying is by permission of the Very Large Data ning time to be independent of the memory bandwidth. Base Endowment. To copy otherwise, or to republish, to post on servers Abstracting with credit is permitted. To copy otherwise, to republish, • Our algorithm also avoids the expensive unaligned load/store ortoto post redistribute on servers to or lists, to r requiresedistribute a fee to and/orlists requires special prior permission specific from the operations. publisher,permission ACM. and/or a fee. Request permission to republish from: VLDBPublications ‘08, August Dept. 24-30,, ACM 2008,, Inc. F Auckland,ax +1 (212 New)869-0481 Zealand or • We provide analytical models of MergeSort that accu- [email protected]. 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00. PVLDB '08, August 23-28, 2008, Auckland, New Zealand Copyright 2008 VLDB Endowment, ACM 978-1-60558-306-8/08/08 1313 rately track empirical data. the local memory. AA-sort [11], implemented on a PowerPC • We examine how various architectural parameters affect 970MP, proposed a multi-core SIMD algorithm based on the Mergesort algorithm. This provides insights into op- comb sort [14] and mergesort. During the first phase of the timizing this sorting algorithm for future architectures. algorithm, each thread sorts the data assigned to it using • We compare our performance with the most efficient al- comb sort based algorithm, and during the second phase, the gorithms on different platforms, including Cell and GPUs. sorted lists are merged using an odd-even merge network. The rest of the paper is organized as follows. Section 2 Our implementation is based on mergesort for sorting the presents related work. Section 3 discusses key architectural complete list. parameters in modern architectures that affect Mergesort performance. Section 4 details the algorithm and imple- 3. ARCHITECTURE SPECIFICATION mentation of our Mergesort. Section 5 provides an analyti- Since the inception of the microprocessor, its performance cal framework to analyze Mergesort performance. Section 6 has been steadily improving. Advances in semiconductor presents the results. Section 7 concludes. manufacturing capability is one reason for this phenomenal change. Advances in computer architecture is another rea- 2. RELATED WORK son. Examples of architectural features that brought sub- Over the past few decades, large number of sorting algo- stantial performance improvement include instruction-level rithms have been proposed [13, 15]. In this section, we focus parallelism (ILP, e.g., pipelining, out-of-order super-scalar mainly on the algorithms that exploit the SIMD and multi- execution, excellent branch prediction), data-level parallelism core capability of modern processors, including GPUs, Cell, (DLP, e.g., SIMD, vector computation), thread-level paral- and others. lelism (TLP, e.g., simultaneous multi-threading, and multi- Quicksort has been one of the fastest algorithms used in core), and memory-level parallelism (MLP, e.g., hardware practice. However, its efficient implementation for exploit- prefetcher). In this section, we examine how ILP, DLP, TLP, ing SIMD is not known. In contrast, Bitonic sort [1] is imple- and MLP impact the performance of sorting algorithm. mented using a sorting network that sets all comparisons in advance without unpredictable branches and permits multi- 3.1 ILP ple comparisons in the same cycle. These two characteristics First, modern processors with super-scalar architecture make it well suited for SIMD processors. Radix Sort [13] can can execute multiple instructions simultaneously on differ- also be SIMDfied effectively, but its performance depends on ent functional units. For example, on Intel Core 2 Duo pro- the support for handling simultaneous updates to a memory cessors, we can execute a min/max and a shuffle instruction location within the same SIMD register. on two separate units simultaneously [3]. As far as utilizing multiple cores is concerned, there exist Second, modern processors with pipelining can issue a algorithms [16] that can scale well with increasing number of new instruction to the corresponding functional unit per cores. Parikh et al. [17] propose a load-balanced scheme for cycle. The most-frequently-used instructions in our Merge- parallelizing quicksort using the hyperthreading technology sort implementation, Min, Max and Shuffle, have one-cycle available on modern processors. An intelligent scheme for throughput on Intel Core 2 Duo processors. Nonetheless, splitting the list is proposed that works well in practice. not all the instructions have
Recommended publications
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Coursenotes 4 Non-Adaptive Sorting Batcher's Algorithm
    4. Non Adaptive Sorting Batcher’s Algorithm 4.1 Introduction to Batcher’s Algorithm Sorting has many important applications in daily life and in particular, computer science. Within computer science several sorting algorithms exist such as the “bubble sort,” “shell sort,” “comb sort,” “heap sort,” “bucket sort,” “merge sort,” etc. We have actually encountered these before in section 2 of the course notes. Many different sorting algorithms exist and are implemented differently in each programming language. These sorting algorithms are relevant because they are used in every single computer program you run ranging from the unimportant computer solitaire you play to the online checking account you own. Indeed without sorting algorithms, many of the services we enjoy today would simply not be available. Batcher’s algorithm is a way to sort individual disorganized elements called “keys” into some desired order using a set number of comparisons. The keys that will be sorted for our purposes will be numbers and the order that we will want them to be in will be from greatest to least or vice a versa (whichever way you want it). The Batcher algorithm is non-adaptive in that it takes a fixed set of comparisons in order to sort the unsorted keys. In other words, there is no change made in the process during the sorting. Unlike other methods of sorting where you need to remember comparisons (tournament sort), Batcher’s algorithm is useful because it requires less thinking. Non-adaptive sorting is easy because outcomes of comparisons made at one point of the process do not affect which comparisons will be made in the future.
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • An Evolutionary Approach for Sorting Algorithms
    ORIENTAL JOURNAL OF ISSN: 0974-6471 COMPUTER SCIENCE & TECHNOLOGY December 2014, An International Open Free Access, Peer Reviewed Research Journal Vol. 7, No. (3): Published By: Oriental Scientific Publishing Co., India. Pgs. 369-376 www.computerscijournal.org Root to Fruit (2): An Evolutionary Approach for Sorting Algorithms PRAMOD KADAM AND Sachin KADAM BVDU, IMED, Pune, India. (Received: November 10, 2014; Accepted: December 20, 2014) ABstract This paper continues the earlier thought of evolutionary study of sorting problem and sorting algorithms (Root to Fruit (1): An Evolutionary Study of Sorting Problem) [1]and concluded with the chronological list of early pioneers of sorting problem or algorithms. Latter in the study graphical method has been used to present an evolution of sorting problem and sorting algorithm on the time line. Key words: Evolutionary study of sorting, History of sorting Early Sorting algorithms, list of inventors for sorting. IntroDUCTION name and their contribution may skipped from the study. Therefore readers have all the rights to In spite of plentiful literature and research extent this study with the valid proofs. Ultimately in sorting algorithmic domain there is mess our objective behind this research is very much found in documentation as far as credential clear, that to provide strength to the evolutionary concern2. Perhaps this problem found due to lack study of sorting algorithms and shift towards a good of coordination and unavailability of common knowledge base to preserve work of our forebear platform or knowledge base in the same domain. for upcoming generation. Otherwise coming Evolutionary study of sorting algorithm or sorting generation could receive hardly information about problem is foundation of futuristic knowledge sorting problems and syllabi may restrict with some base for sorting problem domain1.
    [Show full text]
  • An Introduction to Gpus, CUDA and Opencl
    An Introduction to GPUs, CUDA and OpenCL Bryan Catanzaro, NVIDIA Research Overview ¡ Heterogeneous parallel computing ¡ The CUDA and OpenCL programming models ¡ Writing efficient CUDA code ¡ Thrust: making CUDA C++ productive 2/54 Heterogeneous Parallel Computing Latency-Optimized Throughput- CPU Optimized GPU Fast Serial Scalable Parallel Processing Processing 3/54 Why do we need heterogeneity? ¡ Why not just use latency optimized processors? § Once you decide to go parallel, why not go all the way § And reap more benefits ¡ For many applications, throughput optimized processors are more efficient: faster and use less power § Advantages can be fairly significant 4/54 Why Heterogeneity? ¡ Different goals produce different designs § Throughput optimized: assume work load is highly parallel § Latency optimized: assume work load is mostly sequential ¡ To minimize latency eXperienced by 1 thread: § lots of big on-chip caches § sophisticated control ¡ To maXimize throughput of all threads: § multithreading can hide latency … so skip the big caches § simpler control, cost amortized over ALUs via SIMD 5/54 Latency vs. Throughput Specificaons Westmere-EP Fermi (Tesla C2050) 6 cores, 2 issue, 14 SMs, 2 issue, 16 Processing Elements 4 way SIMD way SIMD @3.46 GHz @1.15 GHz 6 cores, 2 threads, 4 14 SMs, 48 SIMD Resident Strands/ way SIMD: vectors, 32 way Westmere-EP (32nm) Threads (max) SIMD: 48 strands 21504 threads SP GFLOP/s 166 1030 Memory Bandwidth 32 GB/s 144 GB/s Register File ~6 kB 1.75 MB Local Store/L1 Cache 192 kB 896 kB L2 Cache 1.5 MB 0.75 MB
    [Show full text]
  • Cuda C Programming Guide
    CUDA C PROGRAMMING GUIDE PG-02829-001_v10.0 | October 2018 Design Guide CHANGES FROM VERSION 9.0 ‣ Documented restriction that operator-overloads cannot be __global__ functions in Operator Function. ‣ Removed guidance to break 8-byte shuffles into two 4-byte instructions. 8-byte shuffle variants are provided since CUDA 9.0. See Warp Shuffle Functions. ‣ Passing __restrict__ references to __global__ functions is now supported. Updated comment in __global__ functions and function templates. ‣ Documented CUDA_ENABLE_CRC_CHECK in CUDA Environment Variables. ‣ Warp matrix functions now support matrix products with m=32, n=8, k=16 and m=8, n=32, k=16 in addition to m=n=k=16. www.nvidia.com CUDA C Programming Guide PG-02829-001_v10.0 | ii TABLE OF CONTENTS Chapter 1. Introduction.........................................................................................1 1.1. From Graphics Processing to General Purpose Parallel Computing............................... 1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model.............3 1.3. A Scalable Programming Model.........................................................................4 1.4. Document Structure...................................................................................... 5 Chapter 2. Programming Model............................................................................... 7 2.1. Kernels......................................................................................................7 2.2. Thread Hierarchy........................................................................................
    [Show full text]
  • Investigating the Effect of Implementation Languages and Large Problem Sizes on the Tractability and Efficiency of Sorting Algorithms
    International Journal of Engineering Research and Technology. ISSN 0974-3154, Volume 12, Number 2 (2019), pp. 196-203 © International Research Publication House. http://www.irphouse.com Investigating the Effect of Implementation Languages and Large Problem Sizes on the Tractability and Efficiency of Sorting Algorithms Temitayo Matthew Fagbola and Surendra Colin Thakur Department of Information Technology, Durban University of Technology, Durban 4000, South Africa. ORCID: 0000-0001-6631-1002 (Temitayo Fagbola) Abstract [1],[2]. Theoretically, effective sorting of data allows for an efficient and simple searching process to take place. This is Sorting is a data structure operation involving a re-arrangement particularly important as most merge and search algorithms of an unordered set of elements with witnessed real life strictly depend on the correctness and efficiency of sorting applications for load balancing and energy conservation in algorithms [4]. It is important to note that, the rearrangement distributed, grid and cloud computing environments. However, procedure of each sorting algorithm differs and directly impacts the rearrangement procedure often used by sorting algorithms on the execution time and complexity of such algorithm for differs and significantly impacts on their computational varying problem sizes and type emanating from real-world efficiencies and tractability for varying problem sizes. situations [5]. For instance, the operational sorting procedures Currently, which combination of sorting algorithm and of Merge Sort, Quick Sort and Heap Sort follow a divide-and- implementation language is highly tractable and efficient for conquer approach characterized by key comparison, recursion solving large sized-problems remains an open challenge. In this and binary heap’ key reordering requirements, respectively [9].
    [Show full text]
  • Sorting Partnership Unless You Sign Up! Brian Curless • Homework #5 Will Be Ready After Class, Spring 2008 Due in a Week
    Announcements (5/9/08) • Project 3 is now assigned. CSE 326: Data Structures • Partnerships due by 3pm – We will not assume you are in a Sorting partnership unless you sign up! Brian Curless • Homework #5 will be ready after class, Spring 2008 due in a week. • Reading for this lecture: Chapter 7. 2 Sorting Consistent Ordering • Input – an array A of data records • The comparison function must provide a – a key value in each data record consistent ordering on the set of possible keys – You can compare any two keys and get back an – a comparison function which imposes a indication of a < b, a > b, or a = b (trichotomy) consistent ordering on the keys – The comparison functions must be consistent • Output • If compare(a,b) says a<b, then compare(b,a) must say b>a • If says a=b, then must say b=a – reorganize the elements of A such that compare(a,b) compare(b,a) • If compare(a,b) says a=b, then equals(a,b) and equals(b,a) • For any i and j, if i < j then A[i] ≤ A[j] must say a=b 3 4 Why Sort? Space • How much space does the sorting • Allows binary search of an N-element algorithm require in order to sort the array in O(log N) time collection of items? • Allows O(1) time access to kth largest – Is copying needed? element in the array for any k • In-place sorting algorithms: no copying or • Sorting algorithms are among the most at most O(1) additional temp space.
    [Show full text]
  • Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2
    Overview SIMD and MIMD in the Multicore Context Single Instruction Multiple Instruction ● (note: Tute 02 this Weds - handouts) ● Flynn’s Taxonomy Single Data SISD MISD ● multicore architecture concepts Multiple Data SIMD MIMD ● for SIMD, the control unit and processor state (registers) can be shared ■ hardware threading ■ SIMD vs MIMD in the multicore context ● however, SIMD is limited to data parallelism (through multiple ALUs) ■ ● T2: design features for multicore algorithms need a regular structure, e.g. dense linear algebra, graphics ■ SSE2, Altivec, Cell SPE (128-bit registers); e.g. 4×32-bit add ■ system on a chip Rx: x x x x ■ 3 2 1 0 execution: (in-order) pipeline, instruction latency + ■ thread scheduling Ry: y3 y2 y1 y0 ■ caches: associativity, coherence, prefetch = ■ memory system: crossbar, memory controller Rz: z3 z2 z1 z0 (zi = xi + yi) ■ intermission ■ design requires massive effort; requires support from a commodity environment ■ speculation; power savings ■ massive parallelism (e.g. nVidia GPGPU) but memory is still a bottleneck ■ OpenSPARC ● multicore (CMT) is MIMD; hardware threading can be regarded as MIMD ● T2 performance (why the T2 is designed as it is) ■ higher hardware costs also includes larger shared resources (caches, TLBs) ● the Rock processor (slides by Andrew Over; ref: Tremblay, IEEE Micro 2009 ) needed ⇒ less parallelism than for SIMD COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 3 Hardware (Multi)threading The UltraSPARC T2: System on a Chip ● recall concurrent execution on a single CPU: switch between threads (or ● OpenSparc Slide Cast Ch 5: p79–81,89 processes) requires the saving (in memory) of thread state (register values) ● aggressively multicore: 8 cores, each with 8-way hardware threading (64 virtual ■ motivation: utilize CPU better when thread stalled for I/O (6300 Lect O1, p9–10) CPUs) ■ what are the costs? do the same for smaller stalls? (e.g.
    [Show full text]
  • Sorting Algorithm 1 Sorting Algorithm
    Sorting algorithm 1 Sorting algorithm In computer science, a sorting algorithm is an algorithm that puts elements of a list in a certain order. The most-used orders are numerical order and lexicographical order. Efficient sorting is important for optimizing the use of other algorithms (such as search and merge algorithms) that require sorted lists to work correctly; it is also often useful for canonicalizing data and for producing human-readable output. More formally, the output must satisfy two conditions: 1. The output is in nondecreasing order (each element is no smaller than the previous element according to the desired total order); 2. The output is a permutation, or reordering, of the input. Since the dawn of computing, the sorting problem has attracted a great deal of research, perhaps due to the complexity of solving it efficiently despite its simple, familiar statement. For example, bubble sort was analyzed as early as 1956.[1] Although many consider it a solved problem, useful new sorting algorithms are still being invented (for example, library sort was first published in 2004). Sorting algorithms are prevalent in introductory computer science classes, where the abundance of algorithms for the problem provides a gentle introduction to a variety of core algorithm concepts, such as big O notation, divide and conquer algorithms, data structures, randomized algorithms, best, worst and average case analysis, time-space tradeoffs, and lower bounds. Classification Sorting algorithms used in computer science are often classified by: • Computational complexity (worst, average and best behaviour) of element comparisons in terms of the size of the list . For typical sorting algorithms good behavior is and bad behavior is .
    [Show full text]
  • Visvesvaraya Technological University a Project Report
    ` VISVESVARAYA TECHNOLOGICAL UNIVERSITY “Jnana Sangama”, Belagavi – 590 018 A PROJECT REPORT ON “PREDICTIVE SCHEDULING OF SORTING ALGORITHMS” Submitted in partial fulfillment for the award of the degree of BACHELOR OF ENGINEERING IN COMPUTER SCIENCE AND ENGINEERING BY RANJIT KUMAR SHA (1NH13CS092) SANDIP SHAH (1NH13CS101) SAURABH RAI (1NH13CS104) GAURAV KUMAR (1NH13CS718) Under the guidance of Ms. Sridevi (Senior Assistant Professor, Dept. of CSE, NHCE) DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING NEW HORIZON COLLEGE OF ENGINEERING (ISO-9001:2000 certified, Accredited by NAAC ‘A’, Permanently affiliated to VTU) Outer Ring Road, Panathur Post, Near Marathalli, Bangalore – 560103 ` NEW HORIZON COLLEGE OF ENGINEERING (ISO-9001:2000 certified, Accredited by NAAC ‘A’ Permanently affiliated to VTU) Outer Ring Road, Panathur Post, Near Marathalli, Bangalore-560 103 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CERTIFICATE Certified that the project work entitled “PREDICTIVE SCHEDULING OF SORTING ALGORITHMS” carried out by RANJIT KUMAR SHA (1NH13CS092), SANDIP SHAH (1NH13CS101), SAURABH RAI (1NH13CS104) and GAURAV KUMAR (1NH13CS718) bonafide students of NEW HORIZON COLLEGE OF ENGINEERING in partial fulfillment for the award of Bachelor of Engineering in Computer Science and Engineering of the Visvesvaraya Technological University, Belagavi during the year 2016-2017. It is certified that all corrections/suggestions indicated for Internal Assessment have been incorporated in the report deposited in the department library. The project report has been approved as it satisfies the academic requirements in respect of Project work prescribed for the Degree. Name & Signature of Guide Name Signature of HOD Signature of Principal (Ms. Sridevi) (Dr. Prashanth C.S.R.) (Dr. Manjunatha) External Viva Name of Examiner Signature with date 1.
    [Show full text]
  • Fast Parallel GPU-Sorting Using a Hybrid Algorithm
    Fast Parallel GPU-Sorting Using a Hybrid Algorithm Erik Sintorn Ulf Assarsson Department of Computer Science and Engineering Department of Computer Science and Engineering Chalmers University Of Technology Chalmers University Of Technology Gothenburg, Sweden Gothenburg, Sweden Email: [email protected] Email: uffe at chalmers dot se Abstract— This paper presents an algorithm for fast sorting of that supports scattered writing. The GPU-sorting algorithms large lists using modern GPUs. The method achieves high speed are highly bandwidth-limited, which is illustrated for instance by efficiently utilizing the parallelism of the GPU throughout the by the fact that sorting of 8-bit values [10] are nearly four whole algorithm. Initially, a parallel bucketsort splits the list into enough sublists then to be sorted in parallel using merge-sort. The times faster than for 32-bit values [2]. To improve the speed of parallel bucketsort, implemented in NVIDIA’s CUDA, utilizes the memory reads, we therefore design a vector-based mergesort, synchronization mechanisms, such as atomic increment, that is using CUDA and log n render passes, to work on four 32- available on modern GPUs. The mergesort requires scattered bit floats simultaneously, resulting in a nearly 4 times speed writing, which is exposed by CUDA and ATI’s Data Parallel improvement compared to merge-sorting on single floats. The Virtual Machine[1]. For lists with more than 512k elements, the algorithm performs better than the bitonic sort algorithms, which Vector-Mergesort of two four-float vectors is achieved by using have been considered to be the fastest for GPU sorting, and is a custom designed parallel compare-and-swap algorithm, on more than twice as fast for 8M elements.
    [Show full text]