Theoretical Parallel Computing

Total Page:16

File Type:pdf, Size:1020Kb

Theoretical Parallel Computing Theoretical Parallel Computing A. Legrand Parallel RAM Introduction Pointer Jumping Reducing the Theoretical Parallel Computing Number of Processors PRAM Model Hierarchy Conclusion Combinatorial Arnaud Legrand, CNRS, University of Grenoble Networks Merge Sort 0-1 Principle LIG laboratory, [email protected] Odd-Even Transposition Sort FFT November 1, 2009 Conclusion 1 / 63 Outline Theoretical Parallel Computing A. Legrand 1 Parallel RAM Introduction Parallel RAM Introduction Pointer Jumping Pointer Jumping Reducing the Reducing the Number of Processors Number of Processors PRAM Model Hierarchy PRAM Model Hierarchy Conclusion Conclusion Combinatorial Networks 2 Combinatorial Networks Merge Sort 0-1 Principle Merge Sort Odd-Even Transposition 0-1 Principle Sort FFT Odd-Even Transposition Sort Conclusion FFT 3 Conclusion 2 / 63 Outline Theoretical Parallel Computing A. Legrand 1 Parallel RAM Introduction Parallel RAM Introduction Pointer Jumping Pointer Jumping Reducing the Reducing the Number of Processors Number of Processors PRAM Model Hierarchy PRAM Model Hierarchy Conclusion Conclusion Combinatorial Networks 2 Combinatorial Networks Merge Sort 0-1 Principle Merge Sort Odd-Even Transposition 0-1 Principle Sort FFT Odd-Even Transposition Sort Conclusion FFT 3 Conclusion 3 / 63 Theoretical Parallel Computing A. Legrand Models for Parallel Computation Parallel RAM Introduction We have seen how to implement parallel algorithms Pointer Jumping in practice Reducing the Number of We have come up with performance analyses Processors PRAM Model In traditional algorithm complexity work, the Turing Hierarchy Conclusion machine makes it possible to precisely compare Combinatorial algorithms, establish precise notions of complexity, Networks Merge Sort etc.. 0-1 Principle Odd-Even Can we do something like this for parallel computing? Transposition Sort Parallel machines are complex with many hardware FFT characteristics that are difficult to take into account Conclusion for algorithm work (e.g., the network), is it hopeless? The most famous theoretical model of parallel computing is the PRAM model We will see that many principles in the model are really at the heart of the more applied things we’ve seen so far Courtesy of Henri Casanova 4 / 63 Theoretical Parallel Computing A. Legrand The PRAM Model Parallel RAM Introduction Parallel Random Access Machine (PRAM) Pointer Jumping Reducing the An imperfect model that will only tangentially relate to Number of Processors the performance on a real parallel machine PRAM Model Hierarchy Just like a Turing Machine tangentially relate to the Conclusion performance of a real computer Combinatorial Networks Goal: Make it possible to reason about and classify Merge Sort parallel algorithms, and to obtain complexity results 0-1 Principle Odd-Even (optimality, minimal complexity results, etc.) Transposition Sort One way to look at it: makes it possible to determine FFT the “maximum parallelism” in an algorithm or a Conclusion problem, and makes it possible to devise new algorithms Courtesy of Henri Casanova 5 / 63 Theoretical Parallel Computing A. Legrand The PRAM Model Parallel RAM Introduction Memory size is infinite, number of Pointer Jumping processors in unbounded Reducing the But nothing prevents you to “fold” this back to Number of Processors something more realistic P1 PRAM Model Hierarchy No direct communication between Conclusion processors P2 Combinatorial they communicate via the memory Networks they can operate in an asynchronous fashion P3 Merge Sort Shared 0-1 Principle Every processor accesses any memory Odd-Even location in 1 cycle . Memory Transposition Sort Typically all processors execute the same . FFT algorithm in a synchronous fashion . Conclusion READ phase COMPUTE phase WRITE phase PN Some subset of the processors can stay idle (e.g., even numbered processors may not work, while odd processors do, and conversely) Courtesy of Henri Casanova 6 / 63 Theoretical Parallel Computing A. Legrand Memory Access in PRAM Parallel RAM Introduction Exclusive Read (ER): p processors can Pointer Jumping Reducing the simultaneously read the content of p distinct Number of Processors memory locations. PRAM Model Hierarchy Conclusion Concurrent Read (CR): p processors can Combinatorial simultaneously read the content of p’ memory Networks Merge Sort locations, where p’ < p. 0-1 Principle Odd-Even Exclusive Write (EW): p processors can Transposition Sort simultaneously write the content of p distinct FFT Conclusion memory locations. Concurrent Write (CW): p processors can simultaneously write the content of p’ memory locations, where p’ < p. Courtesy of Henri Casanova 7 / 63 Theoretical Parallel Computing A. Legrand PRAM CW? Parallel RAM Introduction What ends up being stored when multiple writes occur? Pointer Jumping Reducing the priority CW: processors are assigned priorities and the top priority Number of Processors processor is the one that counts for each group write PRAM Model Fail common CW: if values are not equal, no change Hierarchy Conclusion Collision common CW: if values not equal, write a “failure value” Combinatorial Fail­safe common CW: if values not equal, then algorithm aborts Networks Merge Sort Random CW: non­deterministic choice of the value written 0-1 Principle Combining CW: write the sum, average, max, min, etc. of the values Odd-Even Transposition etc. Sort FFT The above means that when you write an algorithm for a CW PRAM Conclusion you can do any of the above at any different points in time It doesn’t corresponds to any hardware in existence and is just a logical/algorithmic notion that could be implemented in software In fact, most algorithms end up not needing CW Courtesy of Henri Casanova 8 / 63 Theoretical Parallel Computing A. Legrand Classic PRAM Models Parallel RAM Introduction CREW (concurrent read, exclusive write) Pointer Jumping Reducing the most commonly used Number of Processors PRAM Model CRCW (concurrent read, concurrent write) Hierarchy Conclusion most powerful Combinatorial Networks EREW (exclusive read, exclusive write) Merge Sort most restrictive 0-1 Principle Odd-Even Transposition unfortunately, probably most realistic Sort FFT Conclusion Theorems exist that prove the relative power of the above models (more on this later) Courtesy of Henri Casanova 9 / 63 Theoretical Parallel Computing A. Legrand PRAM Example 1 Parallel RAM Introduction Problem: Pointer Jumping Reducing the We have a linked list of length n Number of Processors For each element i, compute its distance to the PRAM Model Hierarchy end of the list: Conclusion d[i] = 0 if next[i] = NIL Combinatorial Networks d[i] = d[next[i]] + 1 otherwise Merge Sort 0-1 Principle Odd-Even Sequential algorithm in O(n) Transposition Sort We can define a PRAM algorithm in O(log n) FFT Conclusion associate one processor to each element of the list at each iteration split the list in two with odd­ placed and even­placed elements in different lists Courtesy of Henri Casanova 10 / 63 Theoretical Parallel Computing A. Legrand PRAM Example 1 Parallel RAM Principle: Introduction Pointer Jumping Look at the next element Reducing the Number of Add its d[i] value to yours Processors PRAM Model Point to the next element’s next element Hierarchy Conclusion Combinatorial 1 1 1 1 1 0 Networks Merge Sort 0-1 Principle Odd-Even The size of each list Transposition 2 2 2 2 1 0 Sort is reduced by 2 at FFT each step, hence the Conclusion O(log n) complexity 4 4 3 2 1 0 5 4 3 2 1 0 Courtesy of Henri Casanova 11 / 63 Theoretical Parallel Computing A. Legrand PRAM Example 1 Parallel RAM Introduction Algorithm Pointer Jumping Reducing the Number of forall i Processors ← ← PRAM Model if next[i] == NIL then d[i] 0 else d[i] 1 Hierarchy Conclusion while there is an i such that next[i] ≠ NIL Combinatorial i Networks forall Merge Sort if next[i] ≠ NIL then 0-1 Principle Odd-Even d[i] ← d[i] + d[next[i]] Transposition Sort ← FFT next[i] next[next[i]] Conclusion What about the correctness of this algorithm? Courtesy of Henri Casanova 12 / 63 Theoretical Parallel Computing A. Legrand forall loop Parallel RAM Introduction At each step, the updates must be Pointer Jumping Reducing the synchronized so that pointers point to the Number of Processors right things: PRAM Model Hierarchy next[i] ← next[next[i]] Conclusion Combinatorial Ensured by the semantic of forall Networks Merge Sort forall i 0-1 Principle forall i Odd-Even tmp[i] = B[i] Transposition Sort A[i] = B[i] forall i FFT A[i] = tmp[i] Conclusion Nobody really writes it out, but one mustn’t forget that it’s really what happens underneath Courtesy of Henri Casanova 13 / 63 Theoretical Parallel Computing A. Legrand while condition Parallel RAM Introduction while there is an i such that next[i] ≠NULL Pointer Jumping Reducing the How can one do such a global test on a Number of Processors PRAM? PRAM Model Hierarchy Conclusion Cannot be done in constant time unless the PRAM Combinatorial is CRCW Networks At the end of each step, each processor could write to a Merge Sort 0-1 Principle same memory location TRUE or FALSE depending on Odd-Even next[i] being equal to NULL or not, and one can then take Transposition Sort the AND of all values (to resolve concurrent writes) FFT On a PRAM CREW, one needs O(log n) steps for doing Conclusion a global test like the above In this case, one can just rewrite the while loop into a for loop, because we have analyzed the way in which the iterations go: for step = 1 to log
Recommended publications
  • Vertex Coloring
    Lecture 1 Vertex Coloring 1.1 The Problem Nowadays multi-core computers get more and more processors, and the question is how to handle all this parallelism well. So, here's a basic problem: Consider a doubly linked list that is shared by many processors. It supports insertions and deletions, and there are simple operations like summing up the size of the entries that should be done very fast. We decide to organize the data structure as an array of dynamic length, where each array index may or may not hold an entry. Each entry consists of the array indices of the next entry and the previous entry in the list, some basic information about the entry (e.g. its size), and a pointer to the lion's share of the data, which can be anywhere in the memory. Memory structure Logical structure 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 next previous size data Figure 1.1: Linked listed after initialization. Blue links are forward pointers, red links backward pointers (these are omitted from now on). We now can quickly determine the total size by reading the array in one go from memory, which is quite fast. However, how can we do insertions and deletions fast? These are local operations affecting only one list entry and the pointers of its \neighbors," i.e., the previous and next list element. We want to be able to do many such operations concurrently, by different processors, while maintaining the link structure! Being careless and letting each processor act independently invites desaster, see Figure 1.3.
    [Show full text]
  • Easy PRAM-Based High-Performance Parallel Programming with ICE∗
    Easy PRAM-based high-performance parallel programming with ICE∗ Fady Ghanim1, Uzi Vishkin1,2, and Rajeev Barua1 1Electrical and Computer Engineering Department 2University of Maryland Institute for Advance Computer Studies University of Maryland - College Park MD, 20742, USA ffghanim,barua,[email protected] Abstract Parallel machines have become more widely used. Unfortunately parallel programming technologies have advanced at a much slower pace except for regular programs. For irregular programs, this advancement is inhibited by high synchronization costs, non-loop parallelism, non-array data structures, recursively expressed parallelism and parallelism that is too fine-grained to be exploitable. We present ICE, a new parallel programming language that is easy-to-program, since: (i) ICE is a synchronous, lock-step language; (ii) for a PRAM algorithm its ICE program amounts to directly transcribing it; and (iii) the PRAM algorithmic theory offers unique wealth of parallel algorithms and techniques. We propose ICE to be a part of an ecosystem consisting of the XMT architecture, the PRAM algorithmic model, and ICE itself, that together deliver on the twin goal of easy programming and efficient parallelization of irregular programs. The XMT architecture, developed at UMD, can exploit fine-grained parallelism in irregular programs. We built the ICE compiler which translates the ICE language into the multithreaded XMTC language; the significance of this is that multi-threading is a feature shared by practically all current scalable parallel programming languages. Our main result is perhaps surprising: The run-time was comparable to XMTC with a 0.48% average gain for ICE across all benchmarks. Also, as an indication of ease of programming, we observed a reduction in code size in 7 out of 11 benchmarks vs.
    [Show full text]
  • Chapter 4 Algorithm Analysis
    Chapter 4 Algorithm Analysis In this chapter we look at how to analyze the cost of algorithms in a way that is useful for comparing algorithms to determine which is likely to be better, at least for large input sizes, but is abstract enough that we don’t have to look at minute details of compilers and machine architectures. There are a several parts of such analysis. Firstly we need to abstract the cost from details of the compiler or machine. Secondly we have to decide on a concrete model that allows us to formally define the cost of an algorithm. Since we are interested in parallel algorithms, the model needs to consider parallelism. Thirdly we need to understand how to analyze costs in this model. These are the topics of this chapter. 4.1 Abstracting Costs When we analyze the cost of an algorithm formally, we need to be reasonably precise about the model we are performing the analysis in. Question 4.1. How precise should this model be? For example, would it help to know the exact running time for each instruction? The model can be arbitrarily precise but it is often helpful to abstract away from some details such as the exact running time of each (kind of) instruction. For example, the model can posit that each instruction takes a single step (unit of time) whether it is an addition, a division, or a memory access operation. Some more advanced models, which we will not consider in this class, separate between different classes of instructions, for example, a model may require analyzing separately calculation (e.g., addition or a multiplication) and communication (e.g., memory read).
    [Show full text]
  • Instructor's Manual
    Instructor’s Manual Vol. 2: Presentation Material CCC Mesh/Torus Butterfly !*#? Sea Sick Hypercube Pyramid Behrooz Parhami This instructor’s manual is for Introduction to Parallel Processing: Algorithms and Architectures, by Behrooz Parhami, Plenum Series in Computer Science (ISBN 0-306-45970-1, QA76.58.P3798) 1999 Plenum Press, New York (http://www.plenum.com) All rights reserved for the author. No part of this instructor’s manual may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise, without written permission. Contact the author at: ECE Dept., Univ. of California, Santa Barbara, CA 93106-9560, USA ([email protected]) Introduction to Parallel Processing: Algorithms and Architectures Instructor’s Manual, Vol. 2 (4/00), Page iv Preface to the Instructor’s Manual This instructor’s manual consists of two volumes. Volume 1 presents solutions to selected problems and includes additional problems (many with solutions) that did not make the cut for inclusion in the text Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999) or that were designed after the book went to print. It also contains corrections and additions to the text, as well as other teaching aids. The spring 2000 edition of Volume 1 consists of the following parts (the next edition is planned for spring 2001): Vol. 1: Problem Solutions Part I Selected Solutions and Additional Problems Part II Question Bank, Assignments, and Projects Part III Additions, Corrections, and Other Updates Part IV Sample Course Outline, Calendar, and Forms Volume 2 contains enlarged versions of the figures and tables in the text, in a format suitable for use as transparency masters.
    [Show full text]
  • Lawrence Berkeley National Laboratory Recent Work
    Lawrence Berkeley National Laboratory Recent Work Title Parallel algorithms for finding connected components using linear algebra Permalink https://escholarship.org/uc/item/8ms106vm Authors Zhang, Y Azad, A Buluç, A Publication Date 2020-10-01 DOI 10.1016/j.jpdc.2020.04.009 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Parallel Algorithms for Finding Connected Components using Linear Algebra Yongzhe Zhanga, Ariful Azadb, Aydın Buluc¸c aDepartment of Informatics, The Graduate University for Advanced Studies, SOKENDAI, Japan bDepartment of Intelligent Systems Engineering, Indiana University, Bloomington, IN, USA cComputational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA Abstract Finding connected components is one of the most widely used operations on a graph. Optimal serial algorithms for the problem have been known for half a century, and many competing parallel algorithms have been proposed over the last several decades under various different models of parallel computation. This paper presents a class of parallel connected-component algorithms designed using linear-algebraic primitives. These algorithms are based on a PRAM algorithm by Shiloach and Vishkin and can be designed using standard GraphBLAS operations. We demonstrate two algorithms of this class, one named LACC for Linear Algebraic Connected Components, and the other named FastSV which can be regarded as LACC’s simplification. With the support of the highly-scalable Combinatorial BLAS library, LACC and FastSV outperform the previous state-of-the-art algorithm by a factor of up to 12x for small to medium scale graphs. For large graphs with more than 50B edges, LACC and FastSV scale to 4K nodes (262K cores) of a Cray XC40 supercomputer and outperform previous algorithms by a significant margin.
    [Show full text]
  • Introduction to Parallel Computing, 2Nd Edition
    732A54 Traditional Use of Parallel Computing: Big Data Analytics Large-Scale HPC Applications n High Performance Computing (HPC) l Much computational work (in FLOPs, floatingpoint operations) l Often, large data sets Introduction to l E.g. climate simulations, particle physics, engineering, sequence Parallel Computing matching or proteine docking in bioinformatics, … n Single-CPU computers and even today’s multicore processors cannot provide such massive computation power Christoph Kessler n Aggregate LOTS of computers à Clusters IDA, Linköping University l Need scalable parallel algorithms l Need to exploit multiple levels of parallelism Christoph Kessler, IDA, Linköpings universitet. C. Kessler, IDA, Linköpings universitet. NSC Triolith2 More Recent Use of Parallel Computing: HPC vs Big-Data Computing Big-Data Analytics Applications n Big Data Analytics n Both need parallel computing n Same kind of hardware – Clusters of (multicore) servers l Data access intensive (disk I/O, memory accesses) n Same OS family (Linux) 4Typically, very large data sets (GB … TB … PB … EB …) n Different programming models, languages, and tools l Also some computational work for combining/aggregating data l E.g. data center applications, business analytics, click stream HPC application Big-Data application analysis, scientific data analysis, machine learning, … HPC prog. languages: Big-Data prog. languages: l Soft real-time requirements on interactive querys Fortran, C/C++ (Python) Java, Scala, Python, … Programming models: Programming models: n Single-CPU and multicore processors cannot MPI, OpenMP, … MapReduce, Spark, … provide such massive computation power and I/O bandwidth+capacity Scientific computing Big-data storage/access: libraries: BLAS, … HDFS, … n Aggregate LOTS of computers à Clusters OS: Linux OS: Linux l Need scalable parallel algorithms HW: Cluster HW: Cluster l Need to exploit multiple levels of parallelism l Fault tolerance à Let us start with the common basis: Parallel computer architecture C.
    [Show full text]
  • Oblivious Network RAM and Leveraging Parallelism to Achieve Obliviousness
    Oblivious Network RAM and Leveraging Parallelism to Achieve Obliviousness Dana Dachman-Soled1,3 ∗ Chang Liu2 y Charalampos Papamanthou1,3 z Elaine Shi4 x Uzi Vishkin1,3 { 1: University of Maryland, Department of Electrical and Computer Engineering 2: University of Maryland, Department of Computer Science 3: University of Maryland Institute for Advanced Computer Studies (UMIACS) 4: Cornell University January 12, 2017 Abstract Oblivious RAM (ORAM) is a cryptographic primitive that allows a trusted CPU to securely access untrusted memory, such that the access patterns reveal nothing about sensitive data. ORAM is known to have broad applications in secure processor design and secure multi-party computation for big data. Unfortunately, due to a logarithmic lower bound by Goldreich and Ostrovsky (Journal of the ACM, '96), ORAM is bound to incur a moderate cost in practice. In particular, with the latest developments in ORAM constructions, we are quickly approaching this limit, and the room for performance improvement is small. In this paper, we consider new models of computation in which the cost of obliviousness can be fundamentally reduced in comparison with the standard ORAM model. We propose the Oblivious Network RAM model of computation, where a CPU communicates with multiple memory banks, such that the adversary observes only which bank the CPU is communicating with, but not the address offset within each memory bank. In other words, obliviousness within each bank comes for free|either because the architecture prevents a malicious party from ob- serving the address accessed within a bank, or because another solution is used to obfuscate memory accesses within each bank|and hence we only need to obfuscate communication pat- terns between the CPU and the memory banks.
    [Show full text]
  • PRAM Algorithms Parallel Random Access Machine
    PRAM Algorithms Arvind Krishnamurthy Fall 2004 Parallel Random Access Machine (PRAM) n Collection of numbered processors n Accessing shared memory cells Control n Each processor could have local memory (registers) n Each processor can access any shared memory cell in unit time Private P Memory 1 n Input stored in shared memory cells, output also needs to be Global stored in shared memory Private P Memory 2 n PRAM instructions execute in 3- Memory phase cycles n Read (if any) from a shared memory cell n Local computation (if any) Private n Write (if any) to a shared memory P Memory p cell n Processors execute these 3-phase PRAM instructions synchronously 1 Shared Memory Access Conflicts n Different variations: n Exclusive Read Exclusive Write (EREW) PRAM: no two processors are allowed to read or write the same shared memory cell simultaneously n Concurrent Read Exclusive Write (CREW): simultaneous read allowed, but only one processor can write n Concurrent Read Concurrent Write (CRCW) n Concurrent writes: n Priority CRCW: processors assigned fixed distinct priorities, highest priority wins n Arbitrary CRCW: one randomly chosen write wins n Common CRCW: all processors are allowed to complete write if and only if all the values to be written are equal A Basic PRAM Algorithm n Let there be “n” processors and “2n” inputs n PRAM model: EREW n Construct a tournament where values are compared P0 Processor k is active in step j v if (k % 2j) == 0 At each step: P0 P4 Compare two inputs, Take max of inputs, P0 P2 P4 P6 Write result into shared memory
    [Show full text]
  • CSE 4351/5351 Notes 9: PRAM and Other Theoretical Model S
    CSE 4351/5351 Notes 9: PRAM and Other Theoretical Model s Shared Memory Model Traditional Sequential Algorithm Model RAM (Random Access Machine) Uniform access time to memory Arithmetic operations performed in O(1) time (simplification, really O(lg n)) Generalization of RAM to Parallel Computation PRAM (Parallel Random Access Machine) SHARED MEMORY P0 P1 P2 P3 . Pn FAMILY of models - different concurrency assumptions imply DRASTIC differences in hardware/software support or may be impossible to physically realize. MAJOR THEORETICAL ISSUE: What is the penalty for modifying an algorithm in a flexible model to satisfy a restrictive model? MAJOR PRACTICAL ISSUES: Adequacy for mapping to popular topologies. Developing topologies/algorithms to support models with minimum penalty. Other models: Valiant's BSP (bulk-synchronous parallel), CACM 33 (8), August 1990 "Large" supersteps Bulk communication/routing between supersteps Syncronization to determine superstep completion Karp's LogP, ACM Symp. on Parallel Algorithms and Architectures, 1993 Processors can be: Operational: Computing, receiving, submitting a message Stalling: Delay in message being accepted by communication medium Parameters: L: Maximum time for message to arrive o: Overhead for preparing each message g: Time units between consecutive communication handling steps P: Maximum processor computation time PRAM algorithms are usually described in a SIMD fashion 2 Highly synchronized by constructs in programming notation Instructions are modified based on the processor id Algorithms are simplified by letting processors do useless work Lends to data parallel implementation - PROGRAMMER MUST REDUCE SYNCHRONIZATION OVERHEAD Example PRAM Models EREW (Exclusive Read, Exclusive Write) - Processors must access different memory cells. Most restrictive. CREW (Concurrent Read, Exclusive Write) - Either multiple reads or a single writer may access a cell ERCW (Exclusive Read, Concurrent Write) - Not popular CRCW (Concurrent Read, Concurrent Write) - Most flexible.
    [Show full text]
  • An External-Memory Work-Depth Model and Its Applications to Massively Parallel Join Algorithms
    An External-Memory Work-Depth Model and Its Applications to Massively Parallel Join Algorithms Xiao Hu Ke Yi Paraschos Koutris Hong Kong University of Science and Technology University of Wisconsin-Madison (xhuam,yike)@cse.ust.hk [email protected] ABSTRACT available. This allows us to better focus on the inherent parallel The PRAM is a fundamental model in parallel computing, but it is complexity of the algorithm. On the other hand, given a particular p, seldom directly used in parallel algorithm design. Instead, work- an algorithm with work W and depth d can be optimally simulated depth models have enjoyed much more popularity, as they relieve on the PRAM in O¹d + W /pº steps [11–13]. Thirdly, work-depth algorithm designers and programmers from worrying about how models are syntactically similar to many programming languages various tasks should be assigned to each of the processors. Mean- supporting parallel computing, so algorithms designed in these while, they also make it easy to study the fundamental parallel models can be more easily implemented in practice. complexity of the algorithm, namely work and depth, which are The PRAM can be considered as a fine-grained parallel model, irrelevant to the number of processors available. where in each step, each processor carries out one unit of work. The massively parallel computation (MPC) model, which is a However, today’s massively parallel systems, such as MapReduce simplified version of the BSP model, has drawn a strong interest and Spark, are better captured by a coarse-grained model. The first in recent years, due to the widespread popularity of many big data coarse-grained parallel model is the bulk synchronous parallel (BSP) systems based on such a model.
    [Show full text]
  • Introduction to Parallel Computing
    732A54 / TDDE31 Big Data Analytics Introduction to Parallel Computing Christoph Kessler IDA, Linköping University Christoph Kessler, IDA, Linköpings universitet. Traditional Use of Parallel Computing: Large-Scale HPC Applications High Performance Computing (HPC) Much computational work (in FLOPs, floatingpoint operations) Often, large data sets E.g. climate simulations, particle physics, engineering, sequence matching or proteine docking in bioinformatics, … Single-CPU computers and even today’s multicore processors cannot provide such massive computation power Aggregate LOTS of computers → Clusters Need scalable parallel algorithms Need exploit multiple levels of parallelism C. Kessler, IDA, Linköpings universitet. NSC Tetralith2 More Recent Use of Parallel Computing: Big-Data Analytics Applications Big Data Analytics Data access intensive (disk I/O, memory accesses) Typically, very large data sets (GB … TB … PB … EB …) Also some computational work for combining/aggregating data E.g. data center applications, business analytics, click stream analysis, scientific data analysis, machine learning, … Soft real-time requirements on interactive querys Single-CPU and multicore processors cannot provide such massive computation power and I/O bandwidth+capacity Aggregate LOTS of computers → Clusters Need scalable parallel algorithms Need exploit multiple levels of parallelism Fault tolerance C. Kessler, IDA, Linköpings universitet. NSC Tetralith3 HPC vs Big-Data Computing Both need parallel computing Same kind of hardware – Clusters of (multicore) servers Same OS family (Linux) Different programming models, languages, and tools HPC application Big-Data application HPC prog. languages: Big-Data prog. languages: Fortran, C/C++ (Python) Java, Scala, Python, … Par. programming models: Par. programming models: MPI, OpenMP, … MapReduce, Spark, … Scientific computing Big-data storage/access: libraries: BLAS, … HDFS, … OS: Linux OS: Linux HW: Cluster HW: Cluster → Let us start with the common basis: Parallel computer architecture C.
    [Show full text]
  • Parallel Functional Arrays
    Parallel Functional Arrays Ananya Kumar Guy E. Blelloch Robert Harper Carnegie Mellon University, USA Carnegie Mellon University, USA Carnegie Mellon University, USA [email protected] [email protected] [email protected] Abstract ment with logarithmic time accesses and updates using balanced The goal of this paper is to develop a form of functional arrays trees, but it seems that getting both accesses and updates in con- (sequences) that are as efficient as imperative arrays, can be used stant time cannot be achieved without some form of language ex- in parallel, and have well defined cost-semantics. The key idea is tension. This means that algorithms for many fundamental prob- to consider sequences with functional value semantics but non- lems are a logarithmic factor slower in functional languages than in functional cost semantics. Because the value semantics is func- imperative languages. This includes algorithms for basic problems tional, “updating” a sequence returns a new sequence. We allow such as generating a random permutation, and for many important operations on “older” sequences (called interior sequences) to be graph problems (e.g., shortest-unweighted-paths, connected com- more expensive than operations on the “most recent” sequences ponents, biconnected components, topological sort, and cycle de- (called leaf sequences). tection). Simple algorithms for these problems take linear time in We embed sequences in a language supporting fork-join paral- the imperative setting, but an additional logarithmic factor in time lelism. Due to the parallelism, operations can be interleaved non- in the functional setting, at least without extensions. deterministically, and, in conjunction with the different cost for in- A variety of approaches have been suggested to alleviate this terior and leaf sequences, this can lead to non-deterministic costs problem.
    [Show full text]