Introduction Dr. Ralf-Peter Mundani Cesim / IGSSE

Total Page:16

File Type:pdf, Size:1020Kb

Introduction Dr. Ralf-Peter Mundani Cesim / IGSSE Technische Universität München Parallel Programming and High-Performance Computing Part 1: Introduction Dr. Ralf-Peter Mundani CeSIM / IGSSE Technische Universität München 1 Introduction General Remarks • materials: http://www5.in.tum.de/lehre/vorlesungen/parhpp/SS08/ • Ralf-Peter Mundani – email [email protected], phone 289–25057, room 3181 (city centre) – consultation-hour: Tuesday, 4:00—6:00 pm (room 02.05.058) • Ioan Lucian Muntean – email [email protected], phone 289–18692, room 02.05.059 • lecture (2 SWS) – weekly – Tuesday, start at 12:15 pm, room 02.07.023 • exercises (1 SWS) – fortnightly – Wednesday, start at 4:45 pm, room 02.07.023 Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−2 Technische Universität München 1 Introduction General Remarks •content – part 1: introduction – part 2: high-performance networks – part 3: foundations – part 4: programming memory-coupled systems – part 5: programming message-coupled systems – part 6: dynamic load balancing – part 7: examples of parallel algorithms Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−3 Technische Universität München 1 Introduction Overview • motivation • classification of parallel computers • levels of parallelism • quantitative performance evaluation I think there is a world market for maybe five computers. —Thomas Watson, chairman IBM, 1943 Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−4 Technische Universität München 1 Introduction Motivation • numerical simulation: from phenomena to predictions physical phenomenon technical process 1. modelling determination of parameters, expression of relations 2. numerical treatment model discretisation, algorithm development 3. implementation software development, parallelisation discipline 4. visualisation mathematics illustration of abstract simulation results computer science 5. validation comparison of results with reality application 6. embedding insertion into working process Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−5 Technische Universität München 1 Introduction Motivation • why parallel programming and HPC? – complex problems (especially the so called “grand challenges”) demand for more computing power • climate or geophysics simulation (tsunami, e. g.) • structure or flow simulation (crash test, e. g.) • development systems (CAD, e. g.) • large data analysis (Large Hadron Collider at CERN, e. g.) • military applications (crypto analysis, e. g.) • … – performance increase due to • faster hardware, more memory (“work harder”) • more efficient algorithms, optimisation (“work smarter”) • parallel computing (“get some help”) Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−6 Technische Universität München 1 Introduction Motivation • objectives (in case all resources would be available N-times) – throughput: compute N problems simultaneously • running N instances of a sequential program with different data sets (“embarrassing parallelism”); SETI@home, e. g. • drawback: limited resources of single nodes – response time: compute one problem at a fraction (1/N) of time • running one instance (i. e. N processes) of a parallel program for jointly solving a problem; finding prime numbers, e. g. • drawback: writing a parallel program; communication – problem size: compute one problem with N-times larger data • running one instance (i. e. N processes) of a parallel program, using the sum of all local memories for computing larger problem sizes; iterative solution of SLE, e. g. • drawback: writing a parallel program; communication Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−7 Technische Universität München 1 Introduction Overview • motivation • classification of parallel computers • levels of parallelism • quantitative performance evaluation Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−8 Technische Universität München 1 Introduction Classification of Parallel Computers • definition: “A collection of processing elements that communicate and cooperate to solve large problems” (ALMASE and GOTTLIEB, 1989) • possible appearances of such processing elements – specialised units (steps of a vector pipeline, e. g.) – parallel features in modern monoprocessors (superscalar architectures, instruction pipelining, VLIW, multithreading, multicore, …) – several uniform arithmetical units (processing elements of array computers, e. g.) – processors of a multiprocessor computer (i. e. the actual parallel computers) – complete stand-alone computers connected via LAN (work station or PC clusters, so called virtual parallel computers) – parallel computers or clusters connected via WAN (so called metacomputers) Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−9 Technische Universität München 1 Introduction Classification of Parallel Computers • reminder: dual core, quad core, manycore, and multicore – observation: increasing frequency (and thus core voltage) over past years – problem: thermal power dissipation increases linearly in frequency and with the square of the core voltage Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−10 Technische Universität München 1 Introduction Classification of Parallel Computers • reminder: dual core, quad core, manycore, and multicore (cont’d) – 25% reduction in frequency (and thus core voltage) leads to 50% reduction in dissipation dissipation performance Î normal CPU reduced CPU Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−11 Technische Universität München 1 Introduction Classification of Parallel Computers • reminder: dual core, quad core, manycore, and multicore (cont’d) – idea: installation of two cores per die with same dissipation as single core system dissipation performance Î single core dual core Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−12 Technische Universität München 1 Introduction Classification of Parallel Computers • commercial parallel computers – manufacturers: starting from 1983, big players and small start-ups (see tabular; out of business: no longer in the parallel business) – names have been coming and going rapidly – in addition: several manufacturers of vector computers and non- standard architectures company country year status in 2003 Sequent U.S. 1984 acquired by IBM Intel U.S. 1984 out of business Meiko U.K. 1985 bankrupt nCUBE U.S. 1985 out of business Parsytec Germany 1985 out of business Alliant U.S. 1985 bankrupt Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−13 Technische Universität München 1 Introduction Classification of Parallel Computers • commercial parallel computers (cont’d) company country year status in 2003 Encore U.S. 1986 out of business Floating Point Systems U.S. 1986 acquired by SUN Myrias Canada 1987 out of business Ametek U.S. 1987 out of business Silicon Graphics U.S. 1988 active C-DAC India 1991 active Kendall Square Research U.S. 1992 bankrupt IBM U.S. 1993 active NEC Japan 1993 active SUN Microsystems U.S. 1993 active Cray Research U.S. 1993 active Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−14 Technische Universität München 1 Introduction Classification of Parallel Computers • arrival of clusters – in the late eighties, PCs became a commodity market with rapidly increasing performance, mass production, and decreasing prices – growing attractiveness for parallel computers – 1994: Beowulf, the first parallel computer built completely out of commodity hardware • NASA Goddard Space Flight Centre • 16 Intel DX4 processors • multiple 10 Mbit Ethernet links • Linux with GNU compilers •MPI library – 1996: Beowulf cluster performing more than 1 GFlops – 1997: a 140-node cluster performing more than 10 GFlops Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−15 Technische Universität München 1 Introduction Classification of Parallel Computers • arrival of clusters (cont’d) – 2005: InfiniBand cluster at TUM • 36 Opteron nodes (quad boards) • 4 Itanium nodes (quad boards) • 4 Xeon nodes (dual boards) for interactive tasks • InfiniBand 4× Switch, 96 ports • Linux (SuSE and Redhat) Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−16 Technische Universität München 1 Introduction Classification of Parallel Computers • supercomputers – supercomputing or high-performance scientific computing as the most important application of the big number crunchers – national initiatives due to huge budget requirements • Accelerated Strategic Computing Initiative (ASCI) in the U.S. – in the sequel of the nuclear testing moratorium in 1992/93 – decision: develop, build, and install a series of five supercomputers of up to $100 million each in the U.S. – start: ASCI Red (1997, Intel-based, Sandia National Laboratory, the world’s first TFlops computer) – then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain, ASCI White, … • meanwhile new high-end computing memorandum (2004) Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−17 Technische Universität München 1 Introduction Classification
Recommended publications
  • Parallel Prefix Sum (Scan) with CUDA
    Parallel Prefix Sum (Scan) with CUDA Mark Harris [email protected] April 2007 Document Change History Version Date Responsible Reason for Change February 14, Mark Harris Initial release 2007 April 2007 ii Abstract Parallel prefix sum, also known as parallel Scan, is a useful building block for many parallel algorithms including sorting and building data structures. In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA. We start with a basic naïve algorithm and proceed through more advanced techniques to obtain best performance. We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads. Month 2007 1 Parallel Prefix Sum (Scan) with CUDA Table of Contents Abstract.............................................................................................................. 1 Table of Contents............................................................................................... 2 Introduction....................................................................................................... 3 Inclusive and Exclusive Scan .........................................................................................................................3 Sequential Scan.................................................................................................................................................4 A Naïve Parallel Scan ......................................................................................... 4 A Work-Efficient
    [Show full text]
  • CSE373: Data Structures & Algorithms Lecture 26
    CSE373: Data Structures & Algorithms Lecture 26: Parallel Reductions, Maps, and Algorithm Analysis Aaron Bauer Winter 2014 Outline Done: • How to write a parallel algorithm with fork and join • Why using divide-and-conquer with lots of small tasks is best – Combines results in parallel – (Assuming library can handle “lots of small threads”) Now: • More examples of simple parallel programs that fit the “map” or “reduce” patterns • Teaser: Beyond maps and reductions • Asymptotic analysis for fork-join parallelism • Amdahl’s Law • Final exam and victory lap Winter 2014 CSE373: Data Structures & Algorithms 2 What else looks like this? • Saw summing an array went from O(n) sequential to O(log n) parallel (assuming a lot of processors and very large n!) – Exponential speed-up in theory (n / log n grows exponentially) + + + + + + + + + + + + + + + • Anything that can use results from two halves and merge them in O(1) time has the same property… Winter 2014 CSE373: Data Structures & Algorithms 3 Examples • Maximum or minimum element • Is there an element satisfying some property (e.g., is there a 17)? • Left-most element satisfying some property (e.g., first 17) – What should the recursive tasks return? – How should we merge the results? • Corners of a rectangle containing all points (a “bounding box”) • Counts, for example, number of strings that start with a vowel – This is just summing with a different base case – Many problems are! Winter 2014 CSE373: Data Structures & Algorithms 4 Reductions • Computations of this form are called reductions
    [Show full text]
  • CSE 613: Parallel Programming Lecture 2
    CSE 613: Parallel Programming Lecture 2 ( Analytical Modeling of Parallel Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2019 Parallel Execution Time & Overhead et al., et Edition Grama nd 2 Source: “Introduction to Parallel Computing”, to Parallel “Introduction Parallel running time on 푝 processing elements, 푇푃 = 푡푒푛푑 – 푡푠푡푎푟푡 , where, 푡푠푡푎푟푡 = starting time of the processing element that starts first 푡푒푛푑 = termination time of the processing element that finishes last Parallel Execution Time & Overhead et al., et Edition Grama nd 2 Source: “Introduction to Parallel Computing”, to Parallel “Introduction Sources of overhead ( w.r.t. serial execution ) ― Interprocess interaction ― Interact and communicate data ( e.g., intermediate results ) ― Idling ― Due to load imbalance, synchronization, presence of serial computation, etc. ― Excess computation ― Fastest serial algorithm may be difficult/impossible to parallelize Parallel Execution Time & Overhead et al., et Edition Grama nd 2 Source: “Introduction to Parallel Computing”, to Parallel “Introduction Overhead function or total parallel overhead, 푇푂 = 푝푇푝 – 푇 , where, 푝 = number of processing elements 푇 = time spent doing useful work ( often execution time of the fastest serial algorithm ) Speedup Let 푇푝 = running time using 푝 identical processing elements 푇1 Speedup, 푆푝 = 푇푝 Theoretically, 푆푝 ≤ 푝 ( why? ) Perfect or linear or ideal speedup if 푆푝 = 푝 Speedup Consider adding 푛 numbers using 푛 identical processing Edition nd elements. Serial runtime,
    [Show full text]
  • A Review of Multicore Processors with Parallel Programming
    International Journal of Engineering Technology, Management and Applied Sciences www.ijetmas.com September 2015, Volume 3, Issue 9, ISSN 2349-4476 A Review of Multicore Processors with Parallel Programming Anchal Thakur Ravinder Thakur Research Scholar, CSE Department Assistant Professor, CSE L.R Institute of Engineering and Department Technology, Solan , India. L.R Institute of Engineering and Technology, Solan, India ABSTRACT When the computers first introduced in the market, they came with single processors which limited the performance and efficiency of the computers. The classic way of overcoming the performance issue was to use bigger processors for executing the data with higher speed. Big processor did improve the performance to certain extent but these processors consumed a lot of power which started over heating the internal circuits. To achieve the efficiency and the speed simultaneously the CPU architectures developed multicore processors units in which two or more processors were used to execute the task. The multicore technology offered better response-time while running big applications, better power management and faster execution time. Multicore processors also gave developer an opportunity to parallel programming to execute the task in parallel. These days parallel programming is used to execute a task by distributing it in smaller instructions and executing them on different cores. By using parallel programming the complex tasks that are carried out in a multicore environment can be executed with higher efficiency and performance. Keywords: Multicore Processing, Multicore Utilization, Parallel Processing. INTRODUCTION From the day computers have been invented a great importance has been given to its efficiency for executing the task.
    [Show full text]
  • Parallel Algorithms and Parallel Program Design
    Introduction to Parallel Algorithms and Parallel Program Design Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 12 – Introduction to Parallel Algorithms Methodological Design q Partition ❍ Task/data decomposition q Communication ❍ Task execution coordination q Agglomeration ❍ Evaluation of the structure q Mapping I. Foster, “Designing and Building ❍ Resource assignment Parallel Programs,” Addison-Wesley, 1995. Book is online, see webpage. Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 2 Partitioning q Partitioning stage is intended to expose opportunities for parallel execution q Focus on defining large number of small task to yield a fine-grained decomposition of the problem q A good partition divides into small pieces both the computational tasks associated with a problem and the data on which the tasks operates q Domain decomposition focuses on computation data q Functional decomposition focuses on computation tasks q Mixing domain/functional decomposition is possible Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 3 Domain and Functional Decomposition q Domain decomposition of 2D / 3D grid q Functional decomposition of a climate model Introduction to Parallel Computing, University of Oregon, IPCC Lecture 12 – Introduction to Parallel Algorithms 4 Partitioning Checklist q Does your partition define at least an order of magnitude more tasks than there are processors in your target computer? If not, may loose design flexibility. q Does your partition avoid redundant computation and storage requirements? If not, may not be scalable. q Are tasks of comparable size? If not, it may be hard to allocate each processor equal amounts of work.
    [Show full text]
  • Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!
    CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over next 2 weeks)! Lecture 23" Introduction to Parallel Processing! University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 3! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 4! Processor components! Multicore processors and programming! Processor comparison! vs.! Goal: Explain and articulate why modern microprocessors now have more than one core andCSE how software 30321 must! adapt to accommodate the now prevalent multi- core approach to computing. " Introduction and Overview! Writing more ! efficient code! The right HW for the HLL code translation! right application! University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 6! Pipelining and “Parallelism”! ! Load! Mem! Reg! DM! Reg! ALU ! Instruction 1! Mem! Reg! DM! Reg! ALU ! Instruction 2! Mem! Reg! DM! Reg! ALU ! Instruction 3! Mem! Reg! DM! Reg! ALU ! Instruction 4! Mem! Reg! DM! Reg! ALU Time! Instructions execution overlaps (psuedo-parallel)" but instructions in program issued sequentially." University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! Multiprocessing (Parallel) Machines! Flynn#s
    [Show full text]
  • 3Dfx Oral History Panel Gordon Campbell, Scott Sellers, Ross Q. Smith, and Gary M. Tarolli
    3dfx Oral History Panel Gordon Campbell, Scott Sellers, Ross Q. Smith, and Gary M. Tarolli Interviewed by: Shayne Hodge Recorded: July 29, 2013 Mountain View, California CHM Reference number: X6887.2013 © 2013 Computer History Museum 3dfx Oral History Panel Shayne Hodge: OK. My name is Shayne Hodge. This is July 29, 2013 at the afternoon in the Computer History Museum. We have with us today the founders of 3dfx, a graphics company from the 1990s of considerable influence. From left to right on the camera-- I'll let you guys introduce yourselves. Gary Tarolli: I'm Gary Tarolli. Scott Sellers: I'm Scott Sellers. Ross Smith: Ross Smith. Gordon Campbell: And Gordon Campbell. Hodge: And so why don't each of you take about a minute or two and describe your lives roughly up to the point where you need to say 3dfx to continue describing them. Tarolli: All right. Where do you want us to start? Hodge: Birth. Tarolli: Birth. Oh, born in New York, grew up in rural New York. Had a pretty uneventful childhood, but excelled at math and science. So I went to school for math at RPI [Rensselaer Polytechnic Institute] in Troy, New York. And there is where I met my first computer, a good old IBM mainframe that we were just talking about before [this taping], with punch cards. So I wrote my first computer program there and sort of fell in love with computer. So I became a computer scientist really. So I took all their computer science courses, went on to Caltech for VLSI engineering, which is where I met some people that influenced my career life afterwards.
    [Show full text]
  • PACKET 7 BOOKSTORE 433 Lecture 5 Dr W IBM OVERVIEW
    “PROCESSORS” and multi-processors Excerpt from Hennessey Computer Architecture book; edits by JT Wunderlich PhD Plus Dr W’s IBM Research & Development: JT Wunderlich PhD “PROCESSORS” Excerpt from Hennessey Computer Architecture book; edits by JT Wunderlich PhD Historical Perspective and Further 7.14 Reading There is a tremendous amount of history in multiprocessors; in this section we divide our discussion by both time period and architecture. We start with the SIMD SIMD=SinGle approach and the Illiac IV. We then turn to a short discussion of some other early experimental multiprocessors and progress to a discussion of some of the great Instruction, debates in parallel processing. Next we discuss the historical roots of the present multiprocessors and conclude by discussing recent advances. Multiple Data SIMD Computers: Attractive Idea, Many Attempts, No Lasting Successes The cost of a general multiprocessor is, however, very high and further design options were considered which would decrease the cost without seriously degrading the power or efficiency of the system. The options consist of recentralizing one of the three major components. Centralizing the [control unit] gives rise to the basic organization of [an] . array processor such as the Illiac IV. Bouknight, et al.[1972] The SIMD model was one of the earliest models of parallel computing, dating back to the first large-scale multiprocessor, the Illiac IV. The key idea in that multiprocessor, as in more recent SIMD multiprocessors, is to have a single instruc- tion that operates on many data items at once, using many functional units (see Figure 7.14.1). Although successful in pushing several technologies that proved useful in later projects, it failed as a computer.
    [Show full text]
  • A Review of Parallel Processing Approaches to Robot Kinematics and Jacobian
    Technical Report 10/97, University of Karlsruhe, Computer Science Department, ISSN 1432-7864 A Review of Parallel Processing Approaches to Robot Kinematics and Jacobian Dominik HENRICH, Joachim KARL und Heinz WÖRN Institute for Real-Time Computer Systems and Robotics University of Karlsruhe, D-76128 Karlsruhe, Germany e-mail: [email protected] Abstract Due to continuously increasing demands in the area of advanced robot control, it became necessary to speed up the computation. One way to reduce the computation time is to distribute the computation onto several processing units. In this survey we present different approaches to parallel computation of robot kinematics and Jacobian. Thereby, we discuss both the forward and the reverse problem. We introduce a classification scheme and classify the references by this scheme. Keywords: parallel processing, Jacobian, robot kinematics, robot control. 1 Introduction Due to continuously increasing demands in the area of advanced robot control, it became necessary to speed up the computation. Since it should be possible to control the motion of a robot manipulator in real-time, it is necessary to reduce the computation time to less than the cycle rate of the control loop. One way to reduce the computation time is to distribute the computation over several processing units. There are other overviews and reviews on parallel processing approaches to robotic problems. Earlier overviews include [Lee89] and [Graham89]. Lee takes a closer look at parallel approaches in [Lee91]. He tries to find common features in the different problems of kinematics, dynamics and Jacobian computation. The latest summary is from Zomaya et al. [Zomaya96].
    [Show full text]
  • A Survey on Parallel Multicore Computing: Performance & Improvement
    Advances in Science, Technology and Engineering Systems Journal Vol. 3, No. 3, 152-160 (2018) ASTESJ www.astesj.com ISSN: 2415-6698 A Survey on Parallel Multicore Computing: Performance & Improvement Ola Surakhi*, Mohammad Khanafseh, Sami Sarhan University of Jordan, King Abdullah II School for Information Technology, Computer Science Department, 11942, Jordan A R T I C L E I N F O A B S T R A C T Article history: Multicore processor combines two or more independent cores onto one integrated circuit. Received: 18 May, 2018 Although it offers a good performance in terms of the execution time, there are still many Accepted: 11 June, 2018 metrics such as number of cores, power, memory and more that effect on multicore Online: 26 June, 2018 performance and reduce it. This paper gives an overview about the evolution of the Keywords: multicore architecture with a comparison between single, Dual and Quad. Then we Distributed System summarized some of the recent related works implemented using multicore architecture and Dual core show the factors that have an effect on the performance of multicore parallel architecture Multicore based on their results. Finally, we covered some of the distributed parallel system concepts Quad core and present a comparison between them and the multiprocessor system characteristics. 1. Introduction [6]. The heterogenous cores have more than one type of cores, they are not identical, each core can handle different application. Multicore is one of the parallel computing architectures which The later has better performance in terms of less power consists of two or more individual units (cores) that read and write consumption as will be shown later [2].
    [Show full text]
  • Parallelizing Multiple Flow Accumulation Algorithm Using CUDA and Openacc
    International Journal of Geo-Information Article Parallelizing Multiple Flow Accumulation Algorithm using CUDA and OpenACC Natalija Stojanovic * and Dragan Stojanovic Faculty of Electronic Engineering, University of Nis, 18000 Nis, Serbia * Correspondence: [email protected] Received: 29 June 2019; Accepted: 30 August 2019; Published: 3 September 2019 Abstract: Watershed analysis, as a fundamental component of digital terrain analysis, is based on the Digital Elevation Model (DEM), which is a grid (raster) model of the Earth surface and topography. Watershed analysis consists of computationally and data intensive computing algorithms that need to be implemented by leveraging parallel and high-performance computing methods and techniques. In this paper, the Multiple Flow Direction (MFD) algorithm for watershed analysis is implemented and evaluated on multi-core Central Processing Units (CPU) and many-core Graphics Processing Units (GPU), which provides significant improvements in performance and energy usage. The implementation is based on NVIDIA CUDA (Compute Unified Device Architecture) implementation for GPU, as well as on OpenACC (Open ACCelerators), a parallel programming model, and a standard for parallel computing. Both phases of the MFD algorithm (i) iterative DEM preprocessing and (ii) iterative MFD algorithm, are parallelized and run over multi-core CPU and GPU. The evaluation of the proposed solutions is performed with respect to the execution time, energy consumption, and programming effort for algorithm parallelization for different sizes of input data. An experimental evaluation has shown not only the advantage of using OpenACC programming over CUDA programming in implementing the watershed analysis on a GPU in terms of performance, energy consumption, and programming effort, but also significant benefits in implementing it on the multi-core CPU.
    [Show full text]
  • Chapter 3. Parallel Algorithm Design Methodology
    CSci 493.65 Parallel Computing Prof. Stewart Weiss Chapter 3 Parallel Algorithm Design Chapter 3 Parallel Algorithm Design Debugging is twice as hard as writing the code in the rst place. Therefore, if you write the code as cleverly as possible, you are, by denition, not smart enough to debug it. - Brian Kernighan [1] 3.1 Task/Channel Model The methodology used in this course is the same as that used in the Quinn textbook [2], which is the task/channel methodology described by Foster [3]. In this model, a parallel program is viewed as a collection of tasks that communicate by sending messages to each other through channels. Figure 3.1 shows a task/channel representation of a hypothetical parallel program. A task consists of an executable unit (think of it as a program), together with its local memory and a collection of I/O ports. The local memory contains program code and private data, e.e., the data to which it has exclusive access. Access to this memory is called a local data access. The only way that a task can send copies of its local data to other tasks is through its output ports, and conversely, it can only receive data from other tasks through its input ports. An I/O port is an abstraction; it corresponds to some memory location that the task will use for sending or receiving data. Data sent or received through a channel is called a non-local data access. A channel is a message queue that connects one task's output port to another task's input port.
    [Show full text]