Parallel Programming &

Total Page:16

File Type:pdf, Size:1020Kb

Parallel Programming & Parallel Programming & Murat Keçeli 1 Why do we need it now? http://www.gotw.ca/publications/concurrency-ddj.htm 2 Why do we need it now? Intel® Xeon Phi™ Coprocessor 7120X (16GB, 1.238 GHz, 61 core) http://herbsutter.com/2012/11/30/256-cores-by-2013/ 3 Flyyn’s Taxonomy (1966) Computer architectures http://users.cis.fiu.edu/~prabakar/cda4101/Common/notes/lecture03.html 4 Multiple instruction multiple data l Shared memory • All processors are connected to a "globally available" memory. • Your laptop, smartphone, a single node in a cluster. • Easier to implement, but not scalable. l Distributed memory • Each processor has its own individual memory location. • Single processors at different nodes. • Data is shared through messages. Harder to implement. l Hybrid (clusters, grid computing) l Distributed shared memory (Distributed Global Address Space ) 5 Grid and Cloud Computing l Scalable solutions for loosely coupled jobs. l Cloud is the evolved version of grid computing. (in terms of efficiency, QoS, reliability) l Crowd-sourcing: SETI@HOME, FOLDIT@HOME, l The clean energy project. 2.3 million organic compounds screened by volunteers to discover the next generation of solar cell materials. (World Community Grid, IBM) l We can write proposals for thermochemistry calculations for aromatic hydrocarbons. 6 Goals of parallel programming l Linear speedup: problem of a given size is solved N times faster on N processors • You can reduce time/cost Serial execution time t1 SN Speedup= SN = = N 0 < EN = ≤ 1 Parallel execution time tN N l Scalability: problem that is N times bigger is solved in the same amount of time on N processors • You can attack larger problems 7 Amdahl’s law 0 ≤ p ≤ 1: parallel portion 1 S = N p (1− p) + N http://en.wikipedia.org/wiki/Amdahl's_law 8 Parallelization Tools l Auto-parallelization l Libraries (Intel Threading Building Blocks, Intel MKL, Boost) l Cilk, Unified Parallel C, Coarray Fortran l Functional programming languages (Lisp, F#) l OpenMP (Open Multi-Processing, shared memory) l MPI (Message Passing Interface, distributed memory), l Java is designed for thread level parallelism, java.util.concurrent l Python (https://wiki.python.org/moin/ParallelProcessing) • Global interpreter lock: The mechanism to assure that only one thread executes Python bytecode at a time 9 How to do parallel programming l Start with the chunk that takes most amount of time. l Decide the parallelization scheme based on available hardware and software. l Divide the chunk into subtasks such that: • Minimum dependency (minimizes communication) • Each process has its own data (data independence) • Each process do not need others’ functions to finish (functional independence) • Equal distribution (minimizes latency) • Workload is equally distributed 10 SCOOP l Scalable COncurrent Operations in Python: is a distributed task module allowing concurrent parallel programming on various environments, from heterogeneous grids to supercomputers. • The future is parallel; • Simple is beautiful; • Parallelism should be simpler. http://code.google.com/p/scoop/ 11 Hello World l Results of a map are always ordered even if their computation was made asynchronously on multiple computers. http://code.google.com/p/scoop/ 12 RMG & Thermochemistry l Thermochemical parameters (enthalpy, entropy, heat capacity) are important for reaction equilibrium constants, kinetic parameter estimates, and thermal effects. l Affects both the mechanism generation process and the behavior of the final resulting model. l Estimate based on the group additivity approach of Benson. • This method is fast and can be improved by adding more parameterization. • Harder to parallelize: Hierarchical search, database sharing • Currently fails for aromatic species and subject to fail for any species outside of its parametrization scope. • As the applications of RMG starts to vary, this module needs to be updated for ad hoc corrections. 13 QMTP (Greg Magoon) l Quantum mechanics thermodynamic property (QMTP) module is designed for on-the-fly quantum and force field calculations to calculate thermochemical parameters. • Must be linked to third party programs. • Error checking is required. • Slow. Speed depends on the method of calculation and the software chosen. • Calculations are uncoupled. (embarrassingly parallel) Much easier to parallelize. • Both speed and reliability improvement comes from outside. 14 QMTP Design ! Greg Magoon’s thesis 2012 15 1,3-Hexadiene without QM ~:0:<method 'fromAdjacencyList' main:323:execute of 'rmgpy.molecule.group.Group' 100.00% objects> (0.00%) 1.03% Serial: 3 minutes 1 (0.07%) 10753 1.44% 15.45% 0.97% 18 1 10753 main:498:saveEverything main:224:initialize adjlist:51:fromAdjacencyList 82.43% 1.44% 15.45% 1.07% 16 (0.00%) (0.01%) (0.88%) 18 1 11038 1.15% 6.00% 8.74% 18 2 1 model:595:enlarge main:652:saveOutputHTML main:203:loadDatabase 88.44% 1.15% 0.20% 8.74% (0.06%) (0.00%) 1 (0.03%) 18 18 1 1.30% 16.94% 2.19% 48.76% 1.15% 2.25% 6.21% 44 503 29 18 18 1 33 model:758:processNewReactions model:80:generateThermoData pdep:267:exploreIsomer model:1417:updateUnimolecularReactionNetworks output:52:saveOutputHTML rmg:64:load family:881:fillKineticsRulesByAveragingUp 1.30% 18.27% 16.96% 2.19% 48.76% 1.35% 2.25% 6.21% (0.03%) 119 (0.00%) (0.00%) (0.02%) (0.01%) (0.00%) (0.00%) 44 507 29 18 19 1 33 1.23% 15.47% 1.45% 2.18% 48.74% 1.62% 6.21% 2856 497 507 29 3556 1 33 model:477:makeNewReaction thermo:596:getThermoData model:577:react model:158:processThermoData pdep:450:update rmg:107:loadKinetics rules:416:fillRulesByAveragingUp 1.23% 15.52% 20.46% 1.45% 48.74% 1.62% 6.21% 6.20% (0.04%) (0.00%) (0.00%) (0.01%) (0.13%) (0.00%) (3.30%) 1920 2856 500 148 507 3556 1 471651 1.13% 15.20% 20.46% 1.31% 35.70% 11.99% 1.62% 1.56% 10188 498 213 499 28 70 1 471651 ~:0:<method 'toNASA' of 16 model:351:makeNewSpecies thermo:753:estimateRadicalThermoViaHBI thermo:725:getThermoDataFromGroups __init__:251:generateReactions 'rmgpy.thermo.wilhoit.Wilhoit' network:192:calculateRateCoefficients model:205:generateStatMech __init__:99:load rules:373:getRule 1.13% 13.53% 15.20% 20.46% objects> 35.70% 11.99% 1.62% 1.63% (0.03%) (0.02%) (0.01%) (0.00%) 1.31% (0.44%) (0.00%) (0.00%) (0.55%) 10192 701 498 213 (0.07%) 28 70 1 491725 499 8.95%13.53% 14.87% 20.46% 1.18% 12.29% 22.96% 11.99% 1.62% 1.01% 701 701 790 213 499 1120 1120 70 1 491725 thermo:830:estimateThermoViaGroupAdditivity __init__:296:generateReactionsFromFamilies optimize:1131:fminbound network:740:applyModifiedStrongCollisionMethod network:263:setConditions statmech:630:getStatmechData __init__:107:loadFamilies rules:393:getAllRules 3.98% 14.87% 0.51% 20.46% 1.21% 12.29% 22.96% 11.99% 1.62% 1.01% 1402 (0.11%) 1078 (0.01%) (0.52%) (0.03%) (0.03%) (0.00%) (0.00%) (0.86%) 1491 213 507 1120 1120 70 1 493918 2.70% 7.05% 20.45% 12.23% 18.23% 4.01% 11.99% 1.57% 790 31258 7029 1120 1120 224 70 33 ~:0:<method 'calculateSymmetryNumber' of thermo:941:__addGroupThermoData family:1174:generateReactions ~:0:<rmgpy.pdep.msc.applyModifiedStrongCollisionMethod> network:718:calculateCollisionModel network:522:calculateMicrocanonicalRates statmech:669:getStatmechDataFromGroups family:493:load 'rmgpy.molecule.molecule.Molecule' 7.56% 20.45% 12.23% 18.23% 4.01% 11.99% 1.57% objects> (0.20%) (0.06%) (4.78%) (1.10%) (0.74%) (0.00%) (0.00%) 6.68% 32336 7029 1120 1120 224 70 33 (6.68%) 2192 20.39% 6.58% 0.87% 16.67% 3.09% 11.99% 1.54% 13365 305965 218960 2320 4888 70 99 ~:0:<method ~:0:<method 'generateCollisionMatrix' of 'calculateMicrocanonicalRateCoefficient' family:1223:__generateReactions linalg:185:solve fromnumeric:1185:sum statmech:357:getStatmechData base:168:load 'rmgpy.pdep.configuration.Configuration' of 'rmgpy.reaction.Reaction' 20.39% 6.68% 1.00% 0.44% 11.99% 2.15% objects> objects> (0.32%) (1.33%) (0.16%) 2240 (0.01%) (0.68%) 16.67% 3.09% 14895 309590 253008 70 110 (16.68%) (1.87%) 2320 4888 3.05% 1.51% 0.91% 8.67% 8.68% 0.82% 1.15% 1.39% 0.35% 11.81% 4971 12608 365785 1530 1529 25053 619180 309590 309590 69 ~:0:<method ~:0:<method 'isIsomorphic' of 'generateResonanceIsomers' of family:1522:getReactionTemplate 'rmgpy.molecule.molecule.Molecule' family:1213:calculateDegeneracy family:1156:__matchReactantToTemplate linalg:31:_makearray linalg:64:_commonType ~:0:<numpy.core.multiarray.zeros> statmechfit:80:fitStatmechToHeatCapacity 'rmgpy.molecule.molecule.Molecule' 7.04% 3.09% 13.30% objects> 8.68% 0.82% 1.19% 1.42% 0.84% 11.81% objects> 32336 (0.00%) 28239 1.16% (0.01%) (0.05%) (0.44%) (0.75%) (0.84%) (0.01%) 1.51% 5027 (1.16%) 1530 25053 632323 314349 372520 69 (1.51%) 582525 12608 3.09% 0.66% 4.60% 7.12% 5027 632323 21 48 groups:101:getReactionTemplate family:1026:__generateProductStructures numeric:180:asarray statmechfit:248:fitStatmechPseudo statmechfit:141:fitStatmechDirect 3.09% 13.30% 0.84% 4.60% 7.12% (0.07%) (0.17%) (0.35%) (0.00%) (0.00%) 5027 28239 725478 21 48 2.93% 5.93% 6.79% 4.59% 7.12% 10374 106570 28222 21 48 ~:0:<method 'solve' of base:854:descendTree family:1108:isMoleculeForbidden family:910:applyRecipe 'pydqed.DQED' objects> 10.11% 6.28% 5.93% 6.81% 11.70% (0.56%) 24015 (0.35%) (0.24%) (1.99%) 123302 106570 28402 69 9.48% 5.58% 1.17% 3.75% 4.02% 5.70% 854881 184177 28222 54249 4200 4705 ~:0:<method 'split' of ~:0:<method 'copy' of base:789:matchNodeToStructure base:1081:isMoleculeForbidden 'rmgpy.molecule.molecule.Molecule' 'rmgpy.molecule.molecule.Molecule' statmechfit:474:evaluate statmechfit:383:evaluate 9.48% 5.58% objects> objects> 4.02% 5.70% (3.20%) (1.78%) 1.17% 3.78% (1.01%) (0.76%) 891134 184177
Recommended publications
  • Other Apis What’S Wrong with Openmp?
    Threaded Programming Other APIs What’s wrong with OpenMP? • OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. – cannot arbitrarily start/stop threads – cannot put threads to sleep and wake them up later • OpenMP is good for programs where each thread is doing (more-or-less) the same thing. • Although OpenMP supports C++, it’s not especially OO friendly – though it is gradually getting better. • OpenMP doesn’t support other popular base languages – e.g. Java, Python What’s wrong with OpenMP? (cont.) Can do this Can do this Can’t do this Threaded programming APIs • Essential features – a way to create threads – a way to wait for a thread to finish its work – a mechanism to support thread private data – some basic synchronisation methods – at least a mutex lock, or atomic operations • Optional features – support for tasks – more synchronisation methods – e.g. condition variables, barriers,... – higher levels of abstraction – e.g. parallel loops, reductions What are the alternatives? • POSIX threads • C++ threads • Intel TBB • Cilk • OpenCL • Java (not an exhaustive list!) POSIX threads • POSIX threads (or Pthreads) is a standard library for shared memory programming without directives. – Part of the ANSI/IEEE 1003.1 standard (1996) • Interface is a C library – no standard Fortran interface – can be used with C++, but not OO friendly • Widely available – even for Windows – typically installed as part of OS – code is pretty portable • Lots of low-level control over behaviour of threads • Lacks a proper memory consistency model Thread forking #include <pthread.h> int pthread_create( pthread_t *thread, const pthread_attr_t *attr, void*(*start_routine, void*), void *arg) • Creates a new thread: – first argument returns a pointer to a thread descriptor.
    [Show full text]
  • L22: Parallel Programming Language Features (Chapel and Mapreduce)
    12/1/09 Administrative • Schedule for the rest of the semester - “Midterm Quiz” = long homework - Handed out over the holiday (Tuesday, Dec. 1) L22: Parallel - Return by Dec. 15 - Projects Programming - 1 page status report on Dec. 3 – handin cs4961 pdesc <file, ascii or PDF ok> Language Features - Poster session dry run (to see material) Dec. 8 (Chapel and - Poster details (next slide) MapReduce) • Mailing list: [email protected] December 1, 2009 12/01/09 Poster Details Outline • I am providing: • Global View Languages • Foam core, tape, push pins, easels • Chapel Programming Language • Plan on 2ft by 3ft or so of material (9-12 slides) • Map-Reduce (popularized by Google) • Content: • Reading: Ch. 8 and 9 in textbook - Problem description and why it is important • Sources for today’s lecture - Parallelization challenges - Brad Chamberlain, Cray - Parallel Algorithm - John Gilbert, UCSB - How are two programming models combined? - Performance results (speedup over sequential) 12/01/09 12/01/09 1 12/1/09 Shifting Gears Global View Versus Local View • What are some important features of parallel • P-Independence programming languages (Ch. 9)? - If and only if a program always produces the same output on - Correctness the same input regardless of number or arrangement of processors - Performance - Scalability • Global view - Portability • A language construct that preserves P-independence • Example (today’s lecture) • Local view - Does not preserve P-independent program behavior - Example from previous lecture? And what about ease
    [Show full text]
  • Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++
    Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++ Christopher S. Zakian, Timothy A. K. Zakian Abhishek Kulkarni, Buddhika Chamith, and Ryan R. Newton Indiana University - Bloomington, fczakian, tzakian, adkulkar, budkahaw, [email protected] Abstract. Library and language support for scheduling non-blocking tasks has greatly improved, as have lightweight (user) threading packages. How- ever, there is a significant gap between the two developments. In previous work|and in today's software packages|lightweight thread creation incurs much larger overheads than tasking libraries, even on tasks that end up never blocking. This limitation can be removed. To that end, we describe an extension to the Intel Cilk Plus runtime system, Concurrent Cilk, where tasks are lazily promoted to threads. Concurrent Cilk removes the overhead of thread creation on threads which end up calling no blocking operations, and is the first system to do so for C/C++ with legacy support (standard calling conventions and stack representations). We demonstrate that Concurrent Cilk adds negligible overhead to existing Cilk programs, while its promoted threads remain more efficient than OS threads in terms of context-switch overhead and blocking communication. Further, it enables development of blocking data structures that create non-fork-join dependence graphs|which can expose more parallelism, and better supports data-driven computations waiting on results from remote devices. 1 Introduction Both task-parallelism [1, 11, 13, 15] and lightweight threading [20] libraries have become popular for different kinds of applications. The key difference between a task and a thread is that threads may block|for example when performing IO|and then resume again.
    [Show full text]
  • Parallel Programming
    Parallel Programming Libraries and implementations Outline • MPI – distributed memory de-facto standard • Using MPI • OpenMP – shared memory de-facto standard • Using OpenMP • CUDA – GPGPU de-facto standard • Using CUDA • Others • Hybrid programming • Xeon Phi Programming • SHMEM • PGAS MPI Library Distributed, message-passing programming Message-passing concepts Explicit Parallelism • In message-passing all the parallelism is explicit • The program includes specific instructions for each communication • What to send or receive • When to send or receive • Synchronisation • It is up to the developer to design the parallel decomposition and implement it • How will you divide up the problem? • When will you need to communicate between processes? Message Passing Interface (MPI) • MPI is a portable library used for writing parallel programs using the message passing model • You can expect MPI to be available on any HPC platform you use • Based on a number of processes running independently in parallel • HPC resource provides a command to launch multiple processes simultaneously (e.g. mpiexec, aprun) • There are a number of different implementations but all should support the MPI 2 standard • As with different compilers, there will be variations between implementations but all the features specified in the standard should work. • Examples: MPICH2, OpenMPI Point-to-point communications • A message sent by one process and received by another • Both processes are actively involved in the communication – not necessarily at the same time • Wide variety of semantics provided: • Blocking vs. non-blocking • Ready vs. synchronous vs. buffered • Tags, communicators, wild-cards • Built-in and custom data-types • Can be used to implement any communication pattern • Collective operations, if applicable, can be more efficient Collective communications • A communication that involves all processes • “all” within a communicator, i.e.
    [Show full text]
  • BIOINFORMATICS APPLICATIONS NOTE Doi:10.1093/Bioinformatics/Btq011
    Vol. 26 no. 5 2010, pages 705–707 BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btq011 Databases and ontologies Advance Access publication January 19, 2010 ProteinWorldDB: querying radical pairwise alignments among protein sets from complete genomes Thomas Dan Otto1,2,∗, Marcos Catanho1, Cristian Tristão3, Márcia Bezerra3, Renan Mathias Fernandes4, Guilherme Steinberger Elias4, Alexandre Capeletto Scaglia4, Bill Bovermann5, Viktors Berstis5, Sergio Lifschitz3, Antonio Basílio de Miranda1 and Wim Degrave1 1Laboratório de Genômica Funcional e Bioinformática, Instituto Oswaldo Cruz, Fiocruz, Rio de Janeiro, Brazil, 2Pathogen Genomics, Wellcome Trust Genome Campus, Hinxton, UK, 3Departamento de Informática, Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro, 4IBM Brasil, Hortolândia, São Paulo, Brazil and 5IBM, Austin, TX, USA Associate Editor: Alfonso Valencia ABSTRACT nomenclature or might have no value when inferred from previous Motivation: Many analyses in modern biological research are incorrectly annotated sequences. Hence, secondary databases based on comparisons between biological sequences, resulting such as Swiss-Prot (http://www.expasy.ch/sprot/), PFAM (http:// in functional, evolutionary and structural inferences. When large pfam.sanger.ac.uk) or KEGG (http://www.genome.ad.jp/kegg), to numbers of sequences are compared, heuristics are often used mention only a few, have been implemented to analyze specific resulting in a certain lack of accuracy. In order to improve functional aspects and to improve the annotation procedures and and validate results of such comparisons, we have performed results. radical all-against-all comparisons of 4 million protein sequences Dynamic programming algorithms, or a fast approximation, belonging to the RefSeq database, using an implementation of the have been successfully applied to biological sequence comparison Smith–Waterman algorithm.
    [Show full text]
  • Parallelism in Cilk Plus
    Cilk Plus: Language Support for Thread and Vector Parallelism Arch D. Robison Intel Sr. Principal Engineer Outline Motivation for Intel® Cilk Plus SIMD notations Fork-Join notations Karatsuba multiplication example GCC branch Copyright© 2012, Intel Corporation. All rights reserved. 2 *Other brands and names are the property of their respective owners. Multi-Threading and Vectorization are Essential to Performance Latest Intel® Xeon® chip: 8 cores 2 independent threads per core 8-lane (single prec.) vector unit per thread = 128-fold potential for single socket Intel® Many Integrated Core Architecture >50 cores (KNC) ? independent threads per core 16-lane (single prec.) vector unit per thread = parallel heaven Copyright© 2012, Intel Corporation. All rights reserved. 3 *Other brands and names are the property of their respective owners. Importance of Abstraction Software outlives hardware. Recompiling is easier than rewriting. Coding too closely to hardware du jour makes moving to new hardware expensive. C++ philosophy: abstraction with minimal penalty Do not expect compiler to be clever. But let it do tedious bookkeeping. Copyright© 2012, Intel Corporation. All rights reserved. 4 *Other brands and names are the property of their respective owners. “Three Layer Cake” Abstraction Message Passing exploit multiple nodes Fork-Join exploit multiple cores exploit parallelism at multiple algorithmic levels SIMD exploit vector hardware Copyright© 2012, Intel Corporation. All rights reserved. 5 *Other brands and names are the property of their respective owners. Composition Message Driven compose via send/receive Fork-Join compose via call/return SIMD compose sequentially Copyright© 2012, Intel Corporation. All rights reserved. 6 *Other brands and names are the property of their respective owners.
    [Show full text]
  • Outro to Parallel Computing
    Outro To Parallel Computing John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Purpose of this talk Now that you know how to do some real parallel programming, you may wonder how much you don’t know. With your newly informed perspective we will take a look at the parallel software landscape so that you can see how much of it you are equipped to traverse. How parallel is a code? ⚫ Parallel performance is defined in terms of scalability Strong Scalability Can we get faster for a given problem size? Weak Scalability Can we maintain runtime as we scale up the problem? Weak vs. Strong scaling More Processors Weak Scaling More accurate results More Processors Strong Scaling Faster results (Tornado on way!) Your Scaling Enemy: Amdahl’s Law How many processors can we really use? Let’s say we have a legacy code such that is it only feasible to convert half of the heavily used routines to parallel: Amdahl’s Law If we run this on a parallel machine with five processors: Our code now takes about 60s. We have sped it up by about 40%. Let’s say we use a thousand processors: We have now sped our code by about a factor of two. Is this a big enough win? Amdahl’s Law ⚫ If there is x% of serial component, speedup cannot be better than 100/x. ⚫ If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts. ⚫ If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use .
    [Show full text]
  • The Continuing Renaissance in Parallel Programming Languages
    The continuing renaissance in parallel programming languages Simon McIntosh-Smith University of Bristol Microelectronics Research Group [email protected] © Simon McIntosh-Smith 1 Didn’t parallel computing use to be a niche? © Simon McIntosh-Smith 2 When I were a lad… © Simon McIntosh-Smith 3 But now parallelism is mainstream Samsung Exynos 5 Octa: • 4 fast ARM cores and 4 energy efficient ARM cores • Includes OpenCL programmable GPU from Imagination 4 HPC scaling to millions of cores Tianhe-2 at NUDT in China 33.86 PetaFLOPS (33.86x1015), 16,000 nodes Each node has 2 CPUs and 3 Xeon Phis 3.12 million cores, $390M, 17.6 MW, 720m2 © Simon McIntosh-Smith 5 A renaissance in parallel programming Metal C++11 OpenMP OpenCL Erlang Unified Parallel C Fortress XC Go Cilk HMPP CHARM++ CUDA Co-Array Fortran Chapel Linda X10 MPI Pthreads C++ AMP © Simon McIntosh-Smith 6 Groupings of || languages Partitioned Global Address Space GPU languages: (PGAS): • OpenCL • Fortress • CUDA • X10 • HMPP • Chapel • Metal • Co-array Fortran • Unified Parallel C Object oriented: • C++ AMP CSP: XC • CHARM++ Message passing: MPI Multi-threaded: • Cilk Shared memory: OpenMP • Go • C++11 © Simon McIntosh-Smith 7 Emerging GPGPU standards • OpenCL, DirectCompute, C++ AMP, … • Also OpenMP 4.0, OpenACC, CUDA… © Simon McIntosh-Smith 8 Apple's Metal • A "ground up" parallel programming language for GPUs • Designed for compute and graphics • Potential to replace OpenGL compute shaders, OpenCL/GL interop etc. • Close to the "metal" • Low overheads • "Shading" language based
    [Show full text]
  • C Language Extensions for Hybrid CPU/GPU Programming with Starpu
    C Language Extensions for Hybrid CPU/GPU Programming with StarPU Ludovic Courtès arXiv:1304.0878v2 [cs.MS] 10 Apr 2013 RESEARCH REPORT N° 8278 April 2013 Project-Team Runtime ISSN 0249-6399 ISRN INRIA/RR--8278--FR+ENG C Language Extensions for Hybrid CPU/GPU Programming with StarPU Ludovic Courtès Project-Team Runtime Research Report n° 8278 — April 2013 — 22 pages Abstract: Modern platforms used for high-performance computing (HPC) include machines with both general- purpose CPUs, and “accelerators”, often in the form of graphical processing units (GPUs). StarPU is a C library to exploit such platforms. It provides users with ways to define tasks to be executed on CPUs or GPUs, along with the dependencies among them, and by automatically scheduling them over all the available processing units. In doing so, it also relieves programmers from the need to know the underlying architecture details: it adapts to the available CPUs and GPUs, and automatically transfers data between main memory and GPUs as needed. While StarPU’s approach is successful at addressing run-time scheduling issues, being a C library makes for a poor and error-prone programming interface. This paper presents an effort started in 2011 to promote some of the concepts exported by the library as C language constructs, by means of an extension of the GCC compiler suite. Our main contribution is the design and implementation of language extensions that map to StarPU’s task programming paradigm. We argue that the proposed extensions make it easier to get started with StarPU, eliminate errors that can occur when using the C library, and help diagnose possible mistakes.
    [Show full text]
  • Unified Parallel C (UPC)
    Unified Parallel C (UPC) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP 422 Lecture 21 March 27, 2008 Acknowledgments • Supercomputing 2007 tutorial on “Programming using the Partitioned Global Address Space (PGAS) Model” by Tarek El- Ghazawi and Vivek Sarkar —http://sc07.supercomputing.org/schedule/event_detail.php?evid=11029 2 Programming Models Process/Thread Address Space Message Passing Shared Memory DSM/PGAS MPI OpenMP UPC 3 The Partitioned Global Address Space (PGAS) Model • Aka the Distributed Shared Memory (DSM) model • Concurrent threads with a Th Th Th 0 … n-2 n-1 partitioned shared space —Memory partition Mi has affinity to thread Thi • (+)ive: x —Data movement is implicit —Data distribution simplified with M0 … Mn-2 Mn-1 global address space • (-)ive: —Computation distribution and Legend: synchronization still remain programmer’s responsibility Thread/Process Memory Access —Lack of performance transparency of local vs. remote Address Space accesses • UPC, CAF, Titanium, X10, … 4 What is Unified Parallel C (UPC)? • An explicit parallel extension of ISO C • Single global address space • Collection of threads —each thread bound to a processor —each thread has some private data —part of the shared data is co-located with each thread • Set of specs for a parallel C —v1.0 completed February of 2001 —v1.1.1 in October of 2003 —v1.2 in May of 2005 • Compiler implementations by industry and academia 5 UPC Execution Model • A number of threads working independently in a SPMD fashion —MYTHREAD specifies
    [Show full text]
  • IBM XL Unified Parallel C User's Guide
    IBM XL Unified Parallel C for AIX, V11.0 (Technology Preview) IBM XL Unified Parallel C User’s Guide Version 11.0 IBM XL Unified Parallel C for AIX, V11.0 (Technology Preview) IBM XL Unified Parallel C User’s Guide Version 11.0 Note Before using this information and the product it supports, read the information in “Notices” on page 97. © Copyright International Business Machines Corporation 2010. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Chapter 1. Parallel programming and Declarations ..............39 Unified Parallel C...........1 Type qualifiers ............39 Parallel programming ...........1 Declarators .............41 Partitioned global address space programming model 1 Statements and blocks ...........42 Unified Parallel C introduction ........2 Synchronization statements ........42 Iteration statements...........46 Chapter 2. Unified Parallel C Predefined macros and directives .......47 Unified Parallel C directives ........47 programming model .........3 Predefined macros ...........48 Distributed shared memory programming ....3 Data affinity and data distribution .......3 Chapter 5. Unified Parallel C library Memory consistency ............5 functions..............49 Synchronization mechanism .........6 Utility functions .............49 Chapter 3. Using the XL Unified Parallel Program termination ..........49 Dynamic memory allocation ........50 C compiler .............9 Pointer-to-shared manipulation .......56 Compiler options .............9
    [Show full text]
  • A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization
    TREES: A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization Blake A. Hechtman, Andrew D. Hilton, and Daniel J. Sorin Department of Electrical and Computer Engineering Duke University Abstract —We have developed a task-parallel runtime targeting CPUs are a poor fit for GPUs. To understand system, called TREES, that is designed for high why this mismatch exists, we must first understand the performance on CPU/GPU platforms. On platforms performance of an idealized task-parallel application with multiple CPUs, Cilk’s “work-first” principle (with no runtime) and then how the runtime’s overhead underlies how task-parallel applications can achieve affects it. The performance of a task-parallel application performance, but work-first is a poor fit for GPUs. We is a function of two characteristics: its total amount of build upon work-first to create the “work-together” work to be performed (T1, the time to execute on 1 principle that addresses the specific strengths and processor) and its critical path (T∞, the time to execute weaknesses of GPUs. The work-together principle on an infinite number of processors). Prior work has extends work-first by stating that (a) the overhead on shown that the runtime of a system with P processors, the critical path should be paid by the entire system at TP, is bounded by = ( ) + ( ) due to the once and (b) work overheads should be paid co- greedy o ff line scheduler bound [3][10]. operatively. We have implemented the TREES runtime A task-parallel runtime introduces overheads and, for in OpenCL, and we experimentally evaluate TREES purposes of performance analysis, we distinguish applications on a CPU/GPU platform.
    [Show full text]