Parallel Programming &

Parallel Programming & Murat Keçeli 1 Why do we need it now? http://www.gotw.ca/publications/concurrency-ddj.htm 2 Why do we need it now? Intel® Xeon Phi™ Coprocessor 7120X (16GB, 1.238 GHz, 61 core) http://herbsutter.com/2012/11/30/256-cores-by-2013/ 3 Flyyn’s Taxonomy (1966) Computer architectures http://users.cis.fiu.edu/~prabakar/cda4101/Common/notes/lecture03.html 4 Multiple instruction multiple data l Shared memory • All processors are connected to a "globally available" memory. • Your laptop, smartphone, a single node in a cluster. • Easier to implement, but not scalable. l Distributed memory • Each processor has its own individual memory location. • Single processors at different nodes. • Data is shared through messages. Harder to implement. l Hybrid (clusters, grid computing) l Distributed shared memory (Distributed Global Address Space ) 5 Grid and Cloud Computing l Scalable solutions for loosely coupled jobs. l Cloud is the evolved version of grid computing. (in terms of efficiency, QoS, reliability) l Crowd-sourcing: SETI@HOME, FOLDIT@HOME, l The clean energy project. 2.3 million organic compounds screened by volunteers to discover the next generation of solar cell materials. (World Community Grid, IBM) l We can write proposals for thermochemistry calculations for aromatic hydrocarbons. 6 Goals of parallel programming l Linear speedup: problem of a given size is solved N times faster on N processors • You can reduce time/cost Serial execution time t1 SN Speedup= SN = = N 0 < EN = ≤ 1 Parallel execution time tN N l Scalability: problem that is N times bigger is solved in the same amount of time on N processors • You can attack larger problems 7 Amdahl’s law 0 ≤ p ≤ 1: parallel portion 1 S = N p (1− p) + N http://en.wikipedia.org/wiki/Amdahl's_law 8 Parallelization Tools l Auto-parallelization l Libraries (Intel Threading Building Blocks, Intel MKL, Boost) l Cilk, Unified Parallel C, Coarray Fortran l Functional programming languages (Lisp, F#) l OpenMP (Open Multi-Processing, shared memory) l MPI (Message Passing Interface, distributed memory), l Java is designed for thread level parallelism, java.util.concurrent l Python (https://wiki.python.org/moin/ParallelProcessing) • Global interpreter lock: The mechanism to assure that only one thread executes Python bytecode at a time 9 How to do parallel programming l Start with the chunk that takes most amount of time. l Decide the parallelization scheme based on available hardware and software. l Divide the chunk into subtasks such that: • Minimum dependency (minimizes communication) • Each process has its own data (data independence) • Each process do not need others’ functions to finish (functional independence) • Equal distribution (minimizes latency) • Workload is equally distributed 10 SCOOP l Scalable COncurrent Operations in Python: is a distributed task module allowing concurrent parallel programming on various environments, from heterogeneous grids to supercomputers. • The future is parallel; • Simple is beautiful; • Parallelism should be simpler. http://code.google.com/p/scoop/ 11 Hello World l Results of a map are always ordered even if their computation was made asynchronously on multiple computers. http://code.google.com/p/scoop/ 12 RMG & Thermochemistry l Thermochemical parameters (enthalpy, entropy, heat capacity) are important for reaction equilibrium constants, kinetic parameter estimates, and thermal effects. l Affects both the mechanism generation process and the behavior of the final resulting model. l Estimate based on the group additivity approach of Benson. • This method is fast and can be improved by adding more parameterization. • Harder to parallelize: Hierarchical search, database sharing • Currently fails for aromatic species and subject to fail for any species outside of its parametrization scope. • As the applications of RMG starts to vary, this module needs to be updated for ad hoc corrections. 13 QMTP (Greg Magoon) l Quantum mechanics thermodynamic property (QMTP) module is designed for on-the-fly quantum and force field calculations to calculate thermochemical parameters. • Must be linked to third party programs. • Error checking is required. • Slow. Speed depends on the method of calculation and the software chosen. • Calculations are uncoupled. (embarrassingly parallel) Much easier to parallelize. • Both speed and reliability improvement comes from outside. 14 QMTP Design ! Greg Magoon’s thesis 2012 15 1,3-Hexadiene without QM ~:0:<method 'fromAdjacencyList' main:323:execute of 'rmgpy.molecule.group.Group' 100.00% objects> (0.00%) 1.03% Serial: 3 minutes 1 (0.07%) 10753 1.44% 15.45% 0.97% 18 1 10753 main:498:saveEverything main:224:initialize adjlist:51:fromAdjacencyList 82.43% 1.44% 15.45% 1.07% 16 (0.00%) (0.01%) (0.88%) 18 1 11038 1.15% 6.00% 8.74% 18 2 1 model:595:enlarge main:652:saveOutputHTML main:203:loadDatabase 88.44% 1.15% 0.20% 8.74% (0.06%) (0.00%) 1 (0.03%) 18 18 1 1.30% 16.94% 2.19% 48.76% 1.15% 2.25% 6.21% 44 503 29 18 18 1 33 model:758:processNewReactions model:80:generateThermoData pdep:267:exploreIsomer model:1417:updateUnimolecularReactionNetworks output:52:saveOutputHTML rmg:64:load family:881:fillKineticsRulesByAveragingUp 1.30% 18.27% 16.96% 2.19% 48.76% 1.35% 2.25% 6.21% (0.03%) 119 (0.00%) (0.00%) (0.02%) (0.01%) (0.00%) (0.00%) 44 507 29 18 19 1 33 1.23% 15.47% 1.45% 2.18% 48.74% 1.62% 6.21% 2856 497 507 29 3556 1 33 model:477:makeNewReaction thermo:596:getThermoData model:577:react model:158:processThermoData pdep:450:update rmg:107:loadKinetics rules:416:fillRulesByAveragingUp 1.23% 15.52% 20.46% 1.45% 48.74% 1.62% 6.21% 6.20% (0.04%) (0.00%) (0.00%) (0.01%) (0.13%) (0.00%) (3.30%) 1920 2856 500 148 507 3556 1 471651 1.13% 15.20% 20.46% 1.31% 35.70% 11.99% 1.62% 1.56% 10188 498 213 499 28 70 1 471651 ~:0:<method 'toNASA' of 16 model:351:makeNewSpecies thermo:753:estimateRadicalThermoViaHBI thermo:725:getThermoDataFromGroups __init__:251:generateReactions 'rmgpy.thermo.wilhoit.Wilhoit' network:192:calculateRateCoefficients model:205:generateStatMech __init__:99:load rules:373:getRule 1.13% 13.53% 15.20% 20.46% objects> 35.70% 11.99% 1.62% 1.63% (0.03%) (0.02%) (0.01%) (0.00%) 1.31% (0.44%) (0.00%) (0.00%) (0.55%) 10192 701 498 213 (0.07%) 28 70 1 491725 499 8.95%13.53% 14.87% 20.46% 1.18% 12.29% 22.96% 11.99% 1.62% 1.01% 701 701 790 213 499 1120 1120 70 1 491725 thermo:830:estimateThermoViaGroupAdditivity __init__:296:generateReactionsFromFamilies optimize:1131:fminbound network:740:applyModifiedStrongCollisionMethod network:263:setConditions statmech:630:getStatmechData __init__:107:loadFamilies rules:393:getAllRules 3.98% 14.87% 0.51% 20.46% 1.21% 12.29% 22.96% 11.99% 1.62% 1.01% 1402 (0.11%) 1078 (0.01%) (0.52%) (0.03%) (0.03%) (0.00%) (0.00%) (0.86%) 1491 213 507 1120 1120 70 1 493918 2.70% 7.05% 20.45% 12.23% 18.23% 4.01% 11.99% 1.57% 790 31258 7029 1120 1120 224 70 33 ~:0:<method 'calculateSymmetryNumber' of thermo:941:__addGroupThermoData family:1174:generateReactions ~:0:<rmgpy.pdep.msc.applyModifiedStrongCollisionMethod> network:718:calculateCollisionModel network:522:calculateMicrocanonicalRates statmech:669:getStatmechDataFromGroups family:493:load 'rmgpy.molecule.molecule.Molecule' 7.56% 20.45% 12.23% 18.23% 4.01% 11.99% 1.57% objects> (0.20%) (0.06%) (4.78%) (1.10%) (0.74%) (0.00%) (0.00%) 6.68% 32336 7029 1120 1120 224 70 33 (6.68%) 2192 20.39% 6.58% 0.87% 16.67% 3.09% 11.99% 1.54% 13365 305965 218960 2320 4888 70 99 ~:0:<method ~:0:<method 'generateCollisionMatrix' of 'calculateMicrocanonicalRateCoefficient' family:1223:__generateReactions linalg:185:solve fromnumeric:1185:sum statmech:357:getStatmechData base:168:load 'rmgpy.pdep.configuration.Configuration' of 'rmgpy.reaction.Reaction' 20.39% 6.68% 1.00% 0.44% 11.99% 2.15% objects> objects> (0.32%) (1.33%) (0.16%) 2240 (0.01%) (0.68%) 16.67% 3.09% 14895 309590 253008 70 110 (16.68%) (1.87%) 2320 4888 3.05% 1.51% 0.91% 8.67% 8.68% 0.82% 1.15% 1.39% 0.35% 11.81% 4971 12608 365785 1530 1529 25053 619180 309590 309590 69 ~:0:<method ~:0:<method 'isIsomorphic' of 'generateResonanceIsomers' of family:1522:getReactionTemplate 'rmgpy.molecule.molecule.Molecule' family:1213:calculateDegeneracy family:1156:__matchReactantToTemplate linalg:31:_makearray linalg:64:_commonType ~:0:<numpy.core.multiarray.zeros> statmechfit:80:fitStatmechToHeatCapacity 'rmgpy.molecule.molecule.Molecule' 7.04% 3.09% 13.30% objects> 8.68% 0.82% 1.19% 1.42% 0.84% 11.81% objects> 32336 (0.00%) 28239 1.16% (0.01%) (0.05%) (0.44%) (0.75%) (0.84%) (0.01%) 1.51% 5027 (1.16%) 1530 25053 632323 314349 372520 69 (1.51%) 582525 12608 3.09% 0.66% 4.60% 7.12% 5027 632323 21 48 groups:101:getReactionTemplate family:1026:__generateProductStructures numeric:180:asarray statmechfit:248:fitStatmechPseudo statmechfit:141:fitStatmechDirect 3.09% 13.30% 0.84% 4.60% 7.12% (0.07%) (0.17%) (0.35%) (0.00%) (0.00%) 5027 28239 725478 21 48 2.93% 5.93% 6.79% 4.59% 7.12% 10374 106570 28222 21 48 ~:0:<method 'solve' of base:854:descendTree family:1108:isMoleculeForbidden family:910:applyRecipe 'pydqed.DQED' objects> 10.11% 6.28% 5.93% 6.81% 11.70% (0.56%) 24015 (0.35%) (0.24%) (1.99%) 123302 106570 28402 69 9.48% 5.58% 1.17% 3.75% 4.02% 5.70% 854881 184177 28222 54249 4200 4705 ~:0:<method 'split' of ~:0:<method 'copy' of base:789:matchNodeToStructure base:1081:isMoleculeForbidden 'rmgpy.molecule.molecule.Molecule' 'rmgpy.molecule.molecule.Molecule' statmechfit:474:evaluate statmechfit:383:evaluate 9.48% 5.58% objects> objects> 4.02% 5.70% (3.20%) (1.78%) 1.17% 3.78% (1.01%) (0.76%) 891134 184177

Parallel Programming &

Other Apis What’S Wrong with Openmp?

L22: Parallel Programming Language Features (Chapel and Mapreduce)

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Parallel Programming

BIOINFORMATICS APPLICATIONS NOTE Doi:10.1093/Bioinformatics/Btq011

Parallelism in Cilk Plus

Outro to Parallel Computing

The Continuing Renaissance in Parallel Programming Languages

C Language Extensions for Hybrid CPU/GPU Programming with Starpu

Unified Parallel C (UPC)

IBM XL Unified Parallel C User's Guide

A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization