Extending the BSP Model for Multi-Core and Out-Of-Core Computing: MBSP
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Paralellizing the Data Cube
PARALELLIZING THE DATA CUBE By Todd Eavis SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY AT DALHOUSIE UNIVERSITY HALIFAX, NOVA SCOTIA JUNE 27, 2003 °c Copyright by Todd Eavis, 2003 DALHOUSIE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE The undersigned hereby certify that they have read and recommend to the Faculty of Graduate Studies for acceptance a thesis entitled “Paralellizing the Data Cube” by Todd Eavis in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Dated: June 27, 2003 External Examiner: Virendra Bhavsar Research Supervisor: Andrew Rau-Chaplin Examing Committee: Qigang Gao Evangelos Milios ii DALHOUSIE UNIVERSITY Date: June 27, 2003 Author: Todd Eavis Title: Paralellizing the Data Cube Department: Computer Science Degree: Ph.D. Convocation: October Year: 2003 Permission is herewith granted to Dalhousie University to circulate and to have copied for non-commercial purposes, at its discretion, the above title upon the request of individuals or institutions. Signature of Author THE AUTHOR RESERVES OTHER PUBLICATION RIGHTS, AND NEITHER THE THESIS NOR EXTENSIVE EXTRACTS FROM IT MAY BE PRINTED OR OTHERWISE REPRODUCED WITHOUT THE AUTHOR’S WRITTEN PERMISSION. THE AUTHOR ATTESTS THAT PERMISSION HAS BEEN OBTAINED FOR THE USE OF ANY COPYRIGHTED MATERIAL APPEARING IN THIS THESIS (OTHER THAN BRIEF EXCERPTS REQUIRING ONLY PROPER ACKNOWLEDGEMENT IN SCHOLARLY WRITING) AND THAT ALL SUCH USE IS CLEARLY ACKNOWLEDGED. iii To the two women in my life: Amber and Bailey. iv Table of Contents Table of Contents v List of Tables x List of Figures xi Abstract i Acknowledgements ii 1 Introduction 1 1.1 Overview of Primary Research . -
Improving the Performance of Bubble Sort Using a Modified Diminishing Increment Sorting
Scientific Research and Essay Vol. 4 (8), pp. 740-744, August, 2009 Available online at http://www.academicjournals.org/SRE ISSN 1992-2248 © 2009 Academic Journals Full Length Research Paper Improving the performance of bubble sort using a modified diminishing increment sorting Oyelami Olufemi Moses Department of Computer and Information Sciences, Covenant University, P. M. B. 1023, Ota, Ogun State, Nigeria. E- mail: [email protected] or [email protected]. Tel.: +234-8055344658. Accepted 17 February, 2009 Sorting involves rearranging information into either ascending or descending order. There are many sorting algorithms, among which is Bubble Sort. Bubble Sort is not known to be a very good sorting algorithm because it is beset with redundant comparisons. However, efforts have been made to improve the performance of the algorithm. With Bidirectional Bubble Sort, the average number of comparisons is slightly reduced and Batcher’s Sort similar to Shellsort also performs significantly better than Bidirectional Bubble Sort by carrying out comparisons in a novel way so that no propagation of exchanges is necessary. Bitonic Sort was also presented by Batcher and the strong point of this sorting procedure is that it is very suitable for a hard-wired implementation using a sorting network. This paper presents a meta algorithm called Oyelami’s Sort that combines the technique of Bidirectional Bubble Sort with a modified diminishing increment sorting. The results from the implementation of the algorithm compared with Batcher’s Odd-Even Sort and Batcher’s Bitonic Sort showed that the algorithm performed better than the two in the worst case scenario. The implication is that the algorithm is faster. -
A Proposed Solution for Sorting Algorithms Problems by Comparison Network Model of Computation
International Journal of Scientific & Engineering Research Volume 3, Issue 4, April-2012 1 ISSN 2229-5518 A Proposed Solution for Sorting Algorithms Problems by Comparison Network Model of Computation. Mr. Rajeev Singh, Mr. Ashish Kumar Tripathi, Mr. Saurabh Upadhyay, Mr.Sachin Kumar Dhar Dwivedi Abstract:-In this paper we have proposed a new solution for sorting algorithms. In the beginning of the sorting algorithm for serial computers (Random access machines, or RAM’S) that allow only one operation to be executed at a time. We have investigated sorting algorithm based on a comparison network model of computation, in which many comparison operation can be performed simultaneously. Index Terms Sorting algorithms, comparison network, sorting network, the zero one principle, bitonic sorting network 1 Introduction 1.2 The output is a permutation, or reordering, of the input. There are many algorithms for solving sorting algorithms For example of bubble sort 8, 25,9,3,6 (networks).A sorting network is an abstract mathematical model of a network of wires and comparator modules that is used to sort 8 8 8 3 3 a sequence of numbers. Each comparator connects two wires and sorts the values by outputting the smaller value to one wire and 25 25 9 9 3 8 6 6 the large to the other. A sorting network consists of two items comparators and wires .each wires carries with its values and each 9 25 3 9 6 8 comparator takes two wires as input and output. This independence of comparison sequences is useful for parallel 3 25 6 9 execution of the algorithms. -
Bitonic Sorting Algorithm: a Review
International Journal of Computer Applications (0975 – 8887) Volume 113 – No. 13, March 2015 Bitonic Sorting Algorithm: A Review Megha Jain Sanjay Kumar V.K Patle S.O.S In CS & IT, S.O.S In CS & IT, S.O.S In CS & IT, PT. Ravi Shankar Shukla PT. Ravi Shankar Shukla PT. Ravi Shankar Shukla University, Raipur (C.G), India University, Raipur (C.G), India University, Raipur (C.G), India ABSTRACT 2.1 Bitonic Sequence The Batcher`s bitonic sorting algorithm is a parallel sorting A bitonic sequence of n elements ranges from x0,……,xn-1 algorithm, which is used for sorting the numbers in modern with characteristics (a) Existence of the index i, where 0 ≤ i ≤ parallel machines. There are various parallel sorting n-1, such that there exist monotonically increasing sequence algorithms such as radix sort, bitonic sort, etc. It is one of the from x0 to xi, and monotonically decreasing sequence from xi efficient parallel sorting algorithm because of load balancing to xn-1.(b) Existence of the cyclic shift of indices by which property. It is widely used in various scientific and characterics satisfy [10]. After applying the above engineering applications. However, Various researches have characteristics bitonic sequence occur. The bitonic split worked on a bitonic sorting algorithm in order to improve up operation is applied for producing two bitonic sequences on the performance of original batcher`s bitonic sorting which merge and sort operation is applied as shown in Fig 1. algorithm. In this paper, tried to review the contribution made by these researchers. 3. -
Sorting Algorithm 1 Sorting Algorithm
Sorting algorithm 1 Sorting algorithm In computer science, a sorting algorithm is an algorithm that puts elements of a list in a certain order. The most-used orders are numerical order and lexicographical order. Efficient sorting is important for optimizing the use of other algorithms (such as search and merge algorithms) that require sorted lists to work correctly; it is also often useful for canonicalizing data and for producing human-readable output. More formally, the output must satisfy two conditions: 1. The output is in nondecreasing order (each element is no smaller than the previous element according to the desired total order); 2. The output is a permutation, or reordering, of the input. Since the dawn of computing, the sorting problem has attracted a great deal of research, perhaps due to the complexity of solving it efficiently despite its simple, familiar statement. For example, bubble sort was analyzed as early as 1956.[1] Although many consider it a solved problem, useful new sorting algorithms are still being invented (for example, library sort was first published in 2004). Sorting algorithms are prevalent in introductory computer science classes, where the abundance of algorithms for the problem provides a gentle introduction to a variety of core algorithm concepts, such as big O notation, divide and conquer algorithms, data structures, randomized algorithms, best, worst and average case analysis, time-space tradeoffs, and lower bounds. Classification Sorting algorithms used in computer science are often classified by: • Computational complexity (worst, average and best behaviour) of element comparisons in terms of the size of the list . For typical sorting algorithms good behavior is and bad behavior is . -
Two-Level Main Memory Co-Design: Multi-Threaded Algorithmic Primitives, Analysis, and Simulation
Two-Level Main Memory Co-Design: Multi-Threaded Algorithmic Primitives, Analysis, and Simulation Michael A. Bender∗x Jonathan Berryy Simon D. Hammondy K. Scott Hemmerty Samuel McCauley∗ Branden Moorey Benjamin Moseleyz Cynthia A. Phillipsy David Resnicky and Arun Rodriguesy ∗Stony Brook University, Stony Brook, NY 11794-4400 USA fbender,[email protected] ySandia National Laboratories, Albuquerque, NM 87185 USA fjberry, sdhammo, kshemme, bjmoor, caphill, drresni, [email protected] zWashington University in St. Louis, St. Louis, MO 63130 USA [email protected] xTokutek, Inc. www.tokutek.com Abstract—A fundamental challenge for supercomputer memory close to the processor, there can be a higher architecture is that processors cannot be fed data from number of connections between the memory and caches, DRAM as fast as CPUs can consume it. Therefore, many enabling higher bandwidth than current technologies. applications are memory-bandwidth bound. As the number 1 of cores per chip increases, and traditional DDR DRAM While the term scratchpad is overloaded within the speeds stagnate, the problem is only getting worse. A computer architecture field, we use it throughout this variety of non-DDR 3D memory technologies (Wide I/O 2, paper to describe a high-bandwidth, local memory that HBM) offer higher bandwidth and lower power by stacking can be used as a temporary storage location. DRAM chips on the processor or nearby on a silicon The scratchpad cannot replace DRAM entirely. Due to interposer. However, such a packaging scheme cannot contain sufficient memory capacity for a node. It seems the physical constraints of adding the memory directly likely that future systems will require at least two levels of to the chip, the scratchpad cannot be as large as DRAM, main memory: high-bandwidth, low-power memory near although it will be much larger than cache, having the processor and low-bandwidth high-capacity memory gigabytes of storage capacity. -
Minimizing Writes in Parallel External Memory Search
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Minimizing Writes in Parallel External Memory Search Nathan R. Sturtevant and Matthew J. Rutherford University of Denver Denver, CO, USA fsturtevant, [email protected] Abstract speed in Chinese Checkers, and is 3.5 times faster in Rubik’s Cube. Recent research on external-memory search has shown that disks can be effectively used as sec- WMBFS is motivated by several observations: ondary storage when performing large breadth- first searches. We introduce the Write-Minimizing First, most large-scale BFSs have been performed to ver- Breadth-First Search (WMBFS) algorithm which ify the diameter (the number of unique depths at which states is designed to minimize the number of writes per- can be found) or the width (the maximum number of states at formed in an external-memory BFS. WMBFS is any particular depth) of a state space, after which any com- also designed to store the results of the BFS for puted data is discarded. There are, however, many scenarios later use. We present the results of a BFS on a in which the results of a large BFS can be later used for other single-agent version of Chinese Checkers and the computation. In particular, a breadth-first search is the under- Rubik’s Cube edge cubes, state spaces with about lying technique for building pattern databases (PDBs), which 1 trillion states each. In evaluating against a com- are used as heuristics for search. PDBs require that the depth parable approach, WMBFS reduces the I/O for the of each state in the BFS be stored for later usage. -
I. Sorting Networks Thomas Sauerwald
I. Sorting Networks Thomas Sauerwald Easter 2015 Outline Outline of this Course Introduction to Sorting Networks Batcher’s Sorting Network Counting Networks I. Sorting Networks Outline of this Course 2 Closely follow the book and use the same numberring of theorems/lemmas etc. I. Sorting Networks (Sorting, Counting, Load Balancing) II. Matrix Multiplication (Serial and Parallel) III. Linear Programming (Formulating, Applying and Solving) IV. Approximation Algorithms: Covering Problems V. Approximation Algorithms via Exact Algorithms VI. Approximation Algorithms: Travelling Salesman Problem VII. Approximation Algorithms: Randomisation and Rounding VIII. Approximation Algorithms: MAX-CUT Problem (Tentative) List of Topics Algorithms (I, II) Complexity Theory Advanced Algorithms I. Sorting Networks Outline of this Course 3 Closely follow the book and use the same numberring of theorems/lemmas etc. (Tentative) List of Topics Algorithms (I, II) Complexity Theory Advanced Algorithms I. Sorting Networks (Sorting, Counting, Load Balancing) II. Matrix Multiplication (Serial and Parallel) III. Linear Programming (Formulating, Applying and Solving) IV. Approximation Algorithms: Covering Problems V. Approximation Algorithms via Exact Algorithms VI. Approximation Algorithms: Travelling Salesman Problem VII. Approximation Algorithms: Randomisation and Rounding VIII. Approximation Algorithms: MAX-CUT Problem I. Sorting Networks Outline of this Course 3 (Tentative) List of Topics Algorithms (I, II) Complexity Theory Advanced Algorithms I. Sorting Networks (Sorting, Counting, Load Balancing) II. Matrix Multiplication (Serial and Parallel) III. Linear Programming (Formulating, Applying and Solving) IV. Approximation Algorithms: Covering Problems V. Approximation Algorithms via Exact Algorithms VI. Approximation Algorithms: Travelling Salesman Problem VII. Approximation Algorithms: Randomisation and Rounding VIII. Approximation Algorithms: MAX-CUT Problem Closely follow the book and use the same numberring of theorems/lemmas etc. -
Adaptive Bitonic Sorting
Encyclopedia of Parallel Computing “00101” — 2011/4/16 — 13:39 — Page 1 — #2 A One of the main di7erences between “regular” bitonic Adaptive Bitonic Sorting sorting and adaptive bitonic sorting is that regular bitonic sorting is data-independent, while adaptive G"#$%&' Z"()*"++ bitonic sorting is data-dependent (hence the name). Clausthal University, Clausthal-Zellerfeld, Germany As a consequence, adaptive bitonic sorting cannot be implemented as a sorting network, but only on archi- Definition tectures that o7er some kind of 9ow control. Nonethe- Adaptive bitonic sorting is a sorting algorithm suitable less, it is convenient to derive the method of adaptive for implementation on EREW parallel architectures. bitonic sorting from bitonic sorting. Similar to bitonic sorting, it is based on merging, which Sorting networks have a long history in computer is recursively applied to obtain a sorted sequence. In science research (see the comprehensive survey []). contrast to bitonic sorting, it is data-dependent. Adap- One reason is that sorting networks are a convenient tive bitonic merging can be performed in O n parallel p way to describe parallel sorting algorithms on CREW- time, p being the number of processors, and executes ! " PRAMs or even EREW-PRAMs (which is also called only O n operations in total. Consequently, adaptive n log n PRAC for “parallel random access computer”). bitonic sorting can be performed in O time, ( ) p In the following, let n denote the number of keys which is optimal. So, one of its advantages is that it exe- ! " to be sorted, and p the number of processors. For the cutes a factor of O log n less operations than bitonic sake of clarity, n will always be assumed to be a power sorting. -
Bitonic Sort and Quick Sort
Assignment of bachelor’s thesis Title: Development of parallel sorting algorithms for GPU Student: Xuan Thang Nguyen Supervisor: Ing. Tomáš Oberhuber, Ph.D. Study program: Informatics Branch / specialization: Computer Science Department: Department of Theoretical Computer Science Validity: until the end of summer semester 2021/2022 Instructions 1. Study the basics of programming GPU using CUDA. 2. Learn the fundamentals of the development of parallel algorithms with TNL library (www.tnl- project.org). 3. Learn and understand parallel sorting algorithms, namely bitonic sort and quick sort. 4. Implement both algorithms into TNL library to run on CPU and GPU. 5. Implement unit tests for testing correctness of the implemented algorithms. 6. Perform measurement of speed-up compared to sorting algorithms in the STL library and GPU implementation [1]. [1] https://github.com/davors/gpu-sorting Electronically approved by doc. Ing. Jan Janoušek, Ph.D. on 30 November 2020 in Prague. Bachelor’s thesis Development of parallel sorting algorithms for GPU Nguyen Xuan Thang Department of Theoretical Computer Science Supervisor: Ing. Tomáš Oberhuber, Ph.D. May 13, 2021 Acknowledgements I would like to thank my supervisor Ing. Tomáš Oberhuber, Ph.D. for his sup- port, guidance and advices throughout the whole time of creating this thesis. My gratitude also goes to my family that helped me during these hard times. Declaration I hereby declare that the presented thesis is my own work and that I have cited all sources of information in accordance with the Guideline for adhering to ethical principles when elaborating an academic final thesis. I acknowledge that my thesis is subject to the rights and obligations stip- ulated by the Act No. -
A Single SMC Sampler on MPI That Outperforms a Single MCMC Sampler
ASINGLE SMC SAMPLER ON MPI THAT OUTPERFORMS A SINGLE MCMC SAMPLER Alessandro Varsi Lykourgos Kekempanos Department of Electrical Engineering and Electronics Department of Electrical Engineering and Electronics University of Liverpool University of Liverpool Liverpool, L69 3GJ, UK Liverpool, L69 3GJ, UK [email protected] [email protected] Jeyarajan Thiyagalingam Simon Maskell Scientific Computing Department Department of Electrical Engineering and Electronics Rutherford Appleton Laboratory, STFC University of Liverpool Didcot, Oxfordshire, OX11 0QX, UK Liverpool, L69 3GJ, UK [email protected] [email protected] ABSTRACT Markov Chain Monte Carlo (MCMC) is a well-established family of algorithms which are primarily used in Bayesian statistics to sample from a target distribution when direct sampling is challenging. Single instances of MCMC methods are widely considered hard to parallelise in a problem-agnostic fashion and hence, unsuitable to meet both constraints of high accuracy and high throughput. Se- quential Monte Carlo (SMC) Samplers can address the same problem, but are parallelisable: they share with Particle Filters the same key tasks and bottleneck. Although a rich literature already exists on MCMC methods, SMC Samplers are relatively underexplored, such that no parallel im- plementation is currently available. In this paper, we first propose a parallel MPI version of the SMC Sampler, including an optimised implementation of the bottleneck, and then compare it with single-core Metropolis-Hastings. The goal is to show that SMC Samplers may be a promising alternative to MCMC methods with high potential for future improvements. We demonstrate that a basic SMC Sampler with 512 cores is up to 85 times faster or up to 8 times more accurate than Metropolis-Hastings. -
The Parallel Persistent Memory Model
The Parallel Persistent Memory Model Guy E. Blelloch∗ Phillip B. Gibbons∗ Yan Gu∗ Charles McGuffey∗ Julian Shuny ∗Carnegie Mellon University yMIT CSAIL ABSTRACT 1 INTRODUCTION We consider a parallel computational model, the Parallel Persistent In this paper, we consider a parallel computational model, the Par- Memory model, comprised of P processors, each with a fast local allel Persistent Memory (Parallel-PM) model, that consists of P pro- ephemeral memory of limited size, and sharing a large persistent cessors, each with a fast local ephemeral memory of limited size M, memory. The model allows for each processor to fault at any time and sharing a large slower persistent memory. As in the external (with bounded probability), and possibly restart. When a processor memory model [4, 5], each processor runs a standard instruction set faults, all of its state and local ephemeral memory is lost, but the from its ephemeral memory and has instructions for transferring persistent memory remains. This model is motivated by upcoming blocks of size B to and from the persistent memory. The cost of an non-volatile memories that are nearly as fast as existing random algorithm is calculated based on the number of such transfers. A access memory, are accessible at the granularity of cache lines, key difference, however, is that the model allows for individual pro- and have the capability of surviving power outages. It is further cessors to fault at any time. If a processor faults, all of its processor motivated by the observation that in large parallel systems, failure state and local ephemeral memory is lost, but the persistent mem- of processors and their caches is not unusual.