Parallel Computing: a Brief Discussion

Total Page:16

File Type:pdf, Size:1020Kb

Parallel Computing: a Brief Discussion Parallel Computing: a brief discussion Marzia Rivi Astrophysics Group Department of Physics & Astronomy Research Programming Social – Nov 10, 2015 What is parallel computing? Traditionally, software has been written for serial computation: – To be run on a single computer having a single core. – A problem is broken into a discrete series of instructions. – Instructions are executed one after another. – Only one instruction may execute at any moment in time. Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: – A problem is broken into discrete parts that can be solved concurrently. – Instructions from each part executed simultaneously on different cores. Why parallel computing? • Save time and/or money: – in theory, more resources we use, shorter the time to finish, with potential cost savings. • Solve larger problems: – when the problems are so large and complex, it is impossible to solve them on a single computer, e.g. "Grand Challenge" problems requiring PetaFLOPS and PetaBytes of computing resources (en.wikipedia.org/wiki/Grand_Challenge). – Many scientific problems can be tackled only by increasing processor performances. – Highly complex or memory greedy problems can be solved only with greater computing capabilities. • Limits to serial computing: physical and practical reasons Who needs parallel computing? • has a large number of High Throughput independent jobs (e.g. Computing (HTC) processing video files, (many computers): genome sequencing, • dynamic parametric studies) environment • multiple Researcher 1 • uses serial applications independent small- • developed serial code and to-medium jobs validated it on small • large amounts of High Performance problems processing over Computing (HPC) • to publish, needs some long time (single parallel “big problem” results • loosely connected computer): Researcher 2 resources (e.g. grid) • static environment • needs to run large parallel • single large scale simulations fast (e.g. problems molecular dynamics, • tightly coupled computational fluid parallelism dynamics, cosmology) Researcher 3 How to parallelise an application? Automatic parallelisation tools: • compiler support for vectorisation of operations (SSE and AVX) and threads parallelisation (OpenMP) • specific tools exists but limited practical use • all successful applications require intervention and steering Parallel code development requires: • programming languages (with support for parallel libraries, APIs) • parallel programming standards (such as MPI and OpenMP) • compilers • performance libraries/tools (both serial and parallel) But…, more that anything, it requires understanding: • the algorithms (program, application, solver, etc.): • the factors that influence parallel performance How to parallelise an application? • First, make it work! – analyse the key features of your parallel algorithms: • parallelism: the type of parallel algorithm that can use parallel agents • granularity: the amount of computation carried out by parallel agents • dependencies: algorithmic restrictions on how the parallel work can be scheduled – re-program the application to run in parallel and validate it • Then, make it work well! – Pay attention to the key aspects of an optimal parallel execution: • data locality (computation vs. communication) • scalability (linear scaling is the holy grail: execution time is inversely proportional with the number of processors) – Use profilers and performance tools to identify problems. Task parallelism Thread (or task) parallelism is based on parting the operations of the Task Parallelismalgorithm . Thread (or task)If an parallelism algorithm is isimplemented based on withexecuting series of independent operations these concurrently differentcan be spread parts throughout of the algorithm. the processors thus realizing program parallelisation. Features cpu begin • Different independent sets of 1 instructions applied to single task 1 set (or multiple sets) of data. cpu 2 • May lead to work imbalance task 2 and may not scale well and task 3 cpu performance limited by the 3 slowest process. task 4 cpu end 4 Data Parallelism Data parallelism Data parallelism means spreading data to be computed through Datathe parallelismprocessors. means spreading data to be computed through the processors. Features • The sameThe processorssets of instructions execute merely applied the to same different operations (parts, butof the) on diverse data. data sets. • ProcessorsThis workoften onlymeans on distribution data assigned of array to themelements and acrosscommunicate the computing units. when necessary. begin • Easy to program, scale well. • Inherent in program loops. no i<4? yes array[4] data task cpu cpu cpu cpu end 1 3 2 4 Granularity • Coarse – parallelise large amounts of the total workload – in general, the coarser the better – minimal inter-processor communication – can lead to imbalance • Fine – parallelise small amounts of the total workload (e.g. inner loops) – can lead to unacceptable parallel overheads (e.g. communication) Dependencies Dictate the order DataDataof operations dependencedependence, imposes limits on parallelism and requires parallel synchronisation. In the following example the i index loop can be parallelized: InIn this the example following the example i index the loop i index can loop be parallelized: can be parallelized: DO I = 1, N fortran For (i=1; i<n; i++){ c/c++ DO I = 1,DO N J = 1, N fortran For for(i=1; (j=1 i<n; ;j<n; i++){ j++){ c/c++ DO A(J,I)J = 1, =N A(J-1,I) + B(J,I) for a[j][i](j=1 ;j<n; = a[j-1][i] j++){ + b[j][i]; END A(J,I)DO = A(J-1,I) + B(J,I) }a[j][i] = a[j-1][i] + b[j][i]; END DO END DO } } END DO } In this loop parallelization is dependent on the K value: InIn this this loop loop parallelization parallelization is is dependent dependent on on the the K value k value:: DO I = M, N fortran For (i=m; i<n; i++){ c/c++ DO I = M, N A(I) = A(I-K) + B(I)/C(I)fortran For a[i](i=m; = i<n;a[i-k] i++){ + b[i]/c[i]; c/c++ END DO A(I) = A(I-K) + B(I)/C(I) }a[i] = a[i-k] + b[i]/c[i]; END DO } If Kk >> M-N N-M oror k K< N-M< M-N parallelizationparallelization is is straightforward. straightforward. If K > N-M or K < M-N parallelization is straightforward. ConceptsConcepts andand TerminologyTerminology • • SharedShared Memory Memory = =a acomputer computer architecture architecture where where all all processors processors have have directdirect access access to to common common physical physical memory. memory. Also, Also, it it describes describes a amodel model wherewhere parallel parallel tasks tasks can can directly directly address address and and access access the the same same logical logical memorymemory locations. locations. • • DistributedDistributed Memory Memory = =network network based based memory memory access access for for physical physical memorymemory that that is is not not common. common. As As a aprogramming programming model, model, tasks tasks can can only only logically "see" local machine memory and must use communications to logically "see" local machine memory and must use communications to access memoryParallel on other machinescomputing where othermodels tasks are executing. access memoryShared memory on other machines where otherDistributed tasks are memory executing. • Each processor has direct access • Local processor memory is invisible to to common Shared physical Memory memory Distributed Memory all other processors, network based (e.g. multi-processors, Shared Memory cluster Distributed Memory memory access (e.g. computer nodes). clusters). • Agent of parallelism: the thread • Agent of parallelism: the process (program = collection of threads). (program = collection of processes). • Threads exchange information • Exchanging information between implicitly by reading/writing processes requires communications. shared variables. • Programming standard: MPI. • Programming standard: OpenMP. OpenMP http://www.openmp.org API instructing the compiler what can be done in parallel (high-level programming). Application User • Consisting of: Directive Environment – compiler directives Compiler Variables – runtime library functions Runtime Library – environment variables Threads in Operating System • Supported by most compilers for Fortran and C/C++. • Usable as serial code (threading ignored by serial compilation). • By design, suited for data parallelism. OpenMP Threads are generated automatically at runtime and scheduled by the OS. • Thread creation / destruction overhead. Execution model • Minimise the number of times parallel regions are entered/exited. master Master thread only Fortran syntax C syntax !$OMP PARALLEL #pragma omp parallel { All threads Parallel region !$OMP END PARALLEL } Master thread only OpenMP - example Objective: vectorise a loop, to map the sin operation to vector x in parallel. Idea: instruct the compiler on what to parallelise (the loop) and how (private and shared data) and let it do the hard work. In C: #pragma omp parallel for shared(x, y, J) private(j) for (j=0; j<J; j++) { y[j] = sin(x[j]); } In Fortran: $omp parallel do shared(x, y, J) private(j) do j = 1, J y(j) = sin(x(j)) end do $omp end parallel do Number of threads is set by environment variable OMP_NUM_THREADS or programmed for using the RTL function omp_set_num_threads(). MPI - Message Passing Interface http://www.mpi-forum.org/ MPI is a specification for a Distributed-Memory API designed by a committee for Fortran, C and C++ languages. • Two versions: – MPI 1.0, quickly and universally adopted
Recommended publications
  • Data-Level Parallelism
    Fall 2015 :: CSE 610 – Parallel Computer Architectures Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Data Parallelism vs. Control Parallelism – Data Parallelism: parallelism arises from executing essentially the same code on a large number of objects – Control Parallelism: parallelism arises from executing different threads of control concurrently • Hypothesis: applications that use massively parallel machines will mostly exploit data parallelism – Common in the Scientific Computing domain • DLP originally linked with SIMD machines; now SIMT is more common – SIMD: Single Instruction Multiple Data – SIMT: Single Instruction Multiple Threads Fall 2015 :: CSE 610 – Parallel Computer Architectures Overview • Many incarnations of DLP architectures over decades – Old vector processors • Cray processors: Cray-1, Cray-2, …, Cray X1 – SIMD extensions • Intel SSE and AVX units • Alpha Tarantula (didn’t see light of day ) – Old massively parallel computers • Connection Machines • MasPar machines – Modern GPUs • NVIDIA, AMD, Qualcomm, … • Focus of throughput rather than latency Vector Processors 4 SCALAR VECTOR (1 operation) (N operations) r1 r2 v1 v2 + + r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 Scalar processors operate on single numbers (scalars) Vector processors operate on linear sequences of numbers (vectors) 6.888 Spring 2013 - Sanchez and Emer - L14 What’s in a Vector Processor? 5 A scalar processor (e.g. a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements E.g. 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set of vector functional units Integer, FP, load/store, etc Some times vector and scalar units are combined (share ALUs) 6.888 Spring 2013 - Sanchez and Emer - L14 Example of Simple Vector Processor 6 6.888 Spring 2013 - Sanchez and Emer - L14 Basic Vector ISA 7 Instr.
    [Show full text]
  • Parallel Prefix Sum (Scan) with CUDA
    Parallel Prefix Sum (Scan) with CUDA Mark Harris [email protected] April 2007 Document Change History Version Date Responsible Reason for Change February 14, Mark Harris Initial release 2007 April 2007 ii Abstract Parallel prefix sum, also known as parallel Scan, is a useful building block for many parallel algorithms including sorting and building data structures. In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA. We start with a basic naïve algorithm and proceed through more advanced techniques to obtain best performance. We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads. Month 2007 1 Parallel Prefix Sum (Scan) with CUDA Table of Contents Abstract.............................................................................................................. 1 Table of Contents............................................................................................... 2 Introduction....................................................................................................... 3 Inclusive and Exclusive Scan .........................................................................................................................3 Sequential Scan.................................................................................................................................................4 A Naïve Parallel Scan ......................................................................................... 4 A Work-Efficient
    [Show full text]
  • High Performance Computing Through Parallel and Distributed Processing
    Yadav S. et al., J. Harmoniz. Res. Eng., 2013, 1(2), 54-64 Journal Of Harmonized Research (JOHR) Journal Of Harmonized Research in Engineering 1(2), 2013, 54-64 ISSN 2347 – 7393 Original Research Article High Performance Computing through Parallel and Distributed Processing Shikha Yadav, Preeti Dhanda, Nisha Yadav Department of Computer Science and Engineering, Dronacharya College of Engineering, Khentawas, Farukhnagar, Gurgaon, India Abstract : There is a very high need of High Performance Computing (HPC) in many applications like space science to Artificial Intelligence. HPC shall be attained through Parallel and Distributed Computing. In this paper, Parallel and Distributed algorithms are discussed based on Parallel and Distributed Processors to achieve HPC. The Programming concepts like threads, fork and sockets are discussed with some simple examples for HPC. Keywords: High Performance Computing, Parallel and Distributed processing, Computer Architecture Introduction time to solve large problems like weather Computer Architecture and Programming play forecasting, Tsunami, Remote Sensing, a significant role for High Performance National calamities, Defence, Mineral computing (HPC) in large applications Space exploration, Finite-element, Cloud science to Artificial Intelligence. The Computing, and Expert Systems etc. The Algorithms are problem solving procedures Algorithms are Non-Recursive Algorithms, and later these algorithms transform in to Recursive Algorithms, Parallel Algorithms particular Programming language for HPC. and Distributed Algorithms. There is need to study algorithms for High The Algorithms must be supported the Performance Computing. These Algorithms Computer Architecture. The Computer are to be designed to computer in reasonable Architecture is characterized with Flynn’s Classification SISD, SIMD, MIMD, and For Correspondence: MISD. Most of the Computer Architectures preeti.dhanda01ATgmail.com are supported with SIMD (Single Instruction Received on: October 2013 Multiple Data Streams).
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
    Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles
    [Show full text]
  • Massively Parallel Computing with CUDA
    Massively Parallel Computing with CUDA Antonino Tumeo Politecnico di Milano 1 GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs. Jack Dongarra Professor, University of Tennessee; Author of “Linpack” Why Use the GPU? • The GPU has evolved into a very flexible and powerful processor: • It’s programmable using high-level languages • It supports 32-bit and 64-bit floating point IEEE-754 precision • It offers lots of GFLOPS: • GPU in every PC and workstation What is behind such an Evolution? • The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) • So, more transistors can be devoted to data processing rather than data caching and flow control ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU • The fast-growing video game industry exerts strong economic pressure that forces constant innovation GPUs • Each NVIDIA GPU has 240 parallel cores NVIDIA GPU • Within each core 1.4 Billion Transistors • Floating point unit • Logic unit (add, sub, mul, madd) • Move, compare unit • Branch unit • Cores managed by thread manager • Thread manager can spawn and manage 12,000+ threads per core 1 Teraflop of processing power • Zero overhead thread switching Heterogeneous Computing Domains Graphics Massive Data GPU Parallelism (Parallel Computing) Instruction CPU Level (Sequential
    [Show full text]
  • Parallel Computer Architecture
    Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 2 – Parallel Architecture Outline q Parallel architecture types q Instruction-level parallelism q Vector processing q SIMD q Shared memory ❍ Memory organization: UMA, NUMA ❍ Coherency: CC-UMA, CC-NUMA q Interconnection networks q Distributed memory q Clusters q Clusters of SMPs q Heterogeneous clusters of SMPs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 2 Parallel Architecture Types • Uniprocessor • Shared Memory – Scalar processor Multiprocessor (SMP) processor – Shared memory address space – Bus-based memory system memory processor … processor – Vector processor bus processor vector memory memory – Interconnection network – Single Instruction Multiple processor … processor Data (SIMD) network processor … … memory memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 3 Parallel Architecture Types (2) • Distributed Memory • Cluster of SMPs Multiprocessor – Shared memory addressing – Message passing within SMP node between nodes – Message passing between SMP memory memory nodes … M M processor processor … … P … P P P interconnec2on network network interface interconnec2on network processor processor … P … P P … P memory memory … M M – Massively Parallel Processor (MPP) – Can also be regarded as MPP if • Many, many processors processor number is large Introduction to Parallel Computing, University of Oregon,
    [Show full text]
  • CSE373: Data Structures & Algorithms Lecture 26
    CSE373: Data Structures & Algorithms Lecture 26: Parallel Reductions, Maps, and Algorithm Analysis Aaron Bauer Winter 2014 Outline Done: • How to write a parallel algorithm with fork and join • Why using divide-and-conquer with lots of small tasks is best – Combines results in parallel – (Assuming library can handle “lots of small threads”) Now: • More examples of simple parallel programs that fit the “map” or “reduce” patterns • Teaser: Beyond maps and reductions • Asymptotic analysis for fork-join parallelism • Amdahl’s Law • Final exam and victory lap Winter 2014 CSE373: Data Structures & Algorithms 2 What else looks like this? • Saw summing an array went from O(n) sequential to O(log n) parallel (assuming a lot of processors and very large n!) – Exponential speed-up in theory (n / log n grows exponentially) + + + + + + + + + + + + + + + • Anything that can use results from two halves and merge them in O(1) time has the same property… Winter 2014 CSE373: Data Structures & Algorithms 3 Examples • Maximum or minimum element • Is there an element satisfying some property (e.g., is there a 17)? • Left-most element satisfying some property (e.g., first 17) – What should the recursive tasks return? – How should we merge the results? • Corners of a rectangle containing all points (a “bounding box”) • Counts, for example, number of strings that start with a vowel – This is just summing with a different base case – Many problems are! Winter 2014 CSE373: Data Structures & Algorithms 4 Reductions • Computations of this form are called reductions
    [Show full text]
  • Demystifying Parallel and Distributed Deep Learning: an In-Depth Concurrency Analysis
    1 Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis TAL BEN-NUN and TORSTEN HOEFLER, ETH Zurich, Switzerland Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning. CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies → Neural networks; Parallel computing methodologies; Distributed computing methodologies; Additional Key Words and Phrases: Deep Learning, Distributed Computing, Parallel Algorithms ACM Reference Format: Tal Ben-Nun and Torsten Hoefler. 2018. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. 47 pages. 1 INTRODUCTION Machine Learning, and in particular Deep Learning [143], is rapidly taking over a variety of aspects in our daily lives.
    [Show full text]
  • Parallel Programming: Performance
    Introduction Parallel Programming: Rich space of techniques and issues • Trade off and interact with one another Performance Issues can be addressed/helped by software or hardware • Algorithmic or programming techniques • Architectural techniques Todd C. Mowry Focus here on performance issues and software techniques CS 418 • Point out some architectural implications January 20, 25 & 26, 2011 • Architectural techniques covered in rest of class –2– CS 418 Programming as Successive Refinement Outline Not all issues dealt with up front 1. Partitioning for performance Partitioning often independent of architecture, and done first 2. Relationshipp,y of communication, data locality and • View machine as a collection of communicating processors architecture – balancing the workload – reducing the amount of inherent communication 3. Orchestration for performance – reducing extra work • Tug-o-war even among these three issues For each issue: Then interactions with architecture • Techniques to address it, and tradeoffs with previous issues • Illustration using case studies • View machine as extended memory hierarchy • Application to g rid solver –extra communication due to archlhitectural interactions • Some architectural implications – cost of communication depends on how it is structured • May inspire changes in partitioning Discussion of issues is one at a time, but identifies tradeoffs 4. Components of execution time as seen by processor • Use examples, and measurements on SGI Origin2000 • What workload looks like to architecture, and relate to software issues –3– CS 418 –4– CS 418 Page 1 Partitioning for Performance Load Balance and Synch Wait Time Sequential Work 1. Balancing the workload and reducing wait time at synch Limit on speedup: Speedupproblem(p) < points Max Work on any Processor • Work includes data access and other costs 2.
    [Show full text]
  • CSE 613: Parallel Programming Lecture 2
    CSE 613: Parallel Programming Lecture 2 ( Analytical Modeling of Parallel Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2019 Parallel Execution Time & Overhead et al., et Edition Grama nd 2 Source: “Introduction to Parallel Computing”, to Parallel “Introduction Parallel running time on 푝 processing elements, 푇푃 = 푡푒푛푑 – 푡푠푡푎푟푡 , where, 푡푠푡푎푟푡 = starting time of the processing element that starts first 푡푒푛푑 = termination time of the processing element that finishes last Parallel Execution Time & Overhead et al., et Edition Grama nd 2 Source: “Introduction to Parallel Computing”, to Parallel “Introduction Sources of overhead ( w.r.t. serial execution ) ― Interprocess interaction ― Interact and communicate data ( e.g., intermediate results ) ― Idling ― Due to load imbalance, synchronization, presence of serial computation, etc. ― Excess computation ― Fastest serial algorithm may be difficult/impossible to parallelize Parallel Execution Time & Overhead et al., et Edition Grama nd 2 Source: “Introduction to Parallel Computing”, to Parallel “Introduction Overhead function or total parallel overhead, 푇푂 = 푝푇푝 – 푇 , where, 푝 = number of processing elements 푇 = time spent doing useful work ( often execution time of the fastest serial algorithm ) Speedup Let 푇푝 = running time using 푝 identical processing elements 푇1 Speedup, 푆푝 = 푇푝 Theoretically, 푆푝 ≤ 푝 ( why? ) Perfect or linear or ideal speedup if 푆푝 = 푝 Speedup Consider adding 푛 numbers using 푛 identical processing Edition nd elements. Serial runtime,
    [Show full text]
  • Exploiting Parallelism in Many-Core Architectures: Architectures G Bortolotti, M Caberletti, G Crimi Et Al
    Journal of Physics: Conference Series OPEN ACCESS Related content - Computing on Knights and Kepler Exploiting parallelism in many-core architectures: Architectures G Bortolotti, M Caberletti, G Crimi et al. Lattice Boltzmann models as a test case - Implementing the lattice Boltzmann model on commodity graphics hardware Arie Kaufman, Zhe Fan and Kaloian To cite this article: F Mantovani et al 2013 J. Phys.: Conf. Ser. 454 012015 Petkov - Many-core experience with HEP software at CERN openlab Sverre Jarp, Alfio Lazzaro, Julien Leduc et al. View the article online for updates and enhancements. Recent citations - Lattice Boltzmann benchmark kernels as a testbed for performance analysis M. Wittmann et al - Enrico Calore et al - Computing on Knights and Kepler Architectures G Bortolotti et al This content was downloaded from IP address 170.106.202.58 on 23/09/2021 at 15:57 24th IUPAP Conference on Computational Physics (IUPAP-CCP 2012) IOP Publishing Journal of Physics: Conference Series 454 (2013) 012015 doi:10.1088/1742-6596/454/1/012015 Exploiting parallelism in many-core architectures: Lattice Boltzmann models as a test case F Mantovani1, M Pivanti2, S F Schifano3 and R Tripiccione4 1 Department of Physics, Univesit¨atRegensburg, Germany 2 Department of Physics, Universit`adi Roma La Sapienza, Italy 3 Department of Mathematics and Informatics, Universit`adi Ferrara and INFN, Italy 4 Department of Physics and CMCS, Universit`adi Ferrara and INFN, Italy E-mail: [email protected], [email protected], [email protected], [email protected] Abstract. In this paper we address the problem of identifying and exploiting techniques that optimize the performance of large scale scientific codes on many-core processors.
    [Show full text]