Algorithmic Skeletons for Exact Combinatorial Search at Scale

Total Page:16

File Type:pdf, Size:1020Kb

Algorithmic Skeletons for Exact Combinatorial Search at Scale Archibald, Blair (2018) Algorithmic skeletons for exact combinatorial search at scale. PhD thesis. https://theses.gla.ac.uk/31000/ Copyright and moral rights for this work are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This work cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Enlighten: Theses https://theses.gla.ac.uk/ [email protected] ALGORITHMIC SKELETONS FOR EXACT COMBINATORIAL SEARCH AT SCALE BLAIR ARCHIBALD SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy SCHOOL OF COMPUTING SCIENCE COLLEGE OF SCIENCE AND ENGINEERING UNIVERSITY OF GLASGOW NOVEMBER 4, 2018 © BLAIR ARCHIBALD Abstract Exact combinatorial search is essential to a wide range of application areas including constraint optimisation, graph matching, and computer algebra. Solutions to combinatorial problems are found by systematically exploring a search space, either to enumerate solutions, determine if a specific solution exists, or to find an optimal solution. Combinatorial searches are computationally hard both in theory and practice, and efficiently exploring the huge number of combinations is a real challenge, often addressed using approximate search algorithms. Alternatively, exact search can be parallelised to reduce execution time. However, parallel search is challenging due to both highly irregular search trees and sensitivity to search order, leading to anomalies that can cause unexpected speedups and slowdowns. As core counts continue to grow, parallel search becomes increasingly useful for improving the performance of existing searches, and allowing larger instances to be solved. A high-level approach to parallel search allows non-expert users to benefit from increasing core counts. Algorithmic Skeletons provide reusable implementations of common parallelism patterns that are parameterised with user code which determines the specific computation, e.g. a particular search. We define a set of skeletons for exact search, requiring the user to provide in the minimal case a single class that specifies how the search tree is generated and a parameter that specifies the type of search required. The five are: Sequential search; three general-purpose parallel search methods: Depth-Bounded, Stack-Stealing, and Budget; and a specific parallel search method, Ordered, that guarantees replicable performance. We implement and evaluate the skeletons in a new C++ parallel search framework, YewPar. YewPar provides both high-level skeletons and low-level search specific schedulers and utilities to deal with the irregularity of search and knowledge exchange between workers. YewPar is based on the HPX library for distributed task-parallelism potentially allowing search to execute on multi-cores, clusters, cloud, and high performance computing systems. Underpinning the skeleton design is a novel formal model, MT 3, a parallel operational semantics that describes multi-threaded tree traversals, allowing reasoning about parallel search, e.g. describing common parallel search phenomena such as performance anomalies. YewPar is evaluated using seven different search applications (and over 25 specific instances): Maximum Clique, k-Clique, Subgraph Isomorphism, Travelling Salesperson, Binary Knap- sack, Enumerating Numerical Semigroups, and the Unbalanced Tree Search Benchmark. The search instances are evaluated at multiple scales from 1 to 255 workers, on a 17 host, 272 core Beowulf cluster. The overheads of the skeletons are low, with a mean 6.1% slowdown compared to hand-coded sequential implementation. Crucially, for all search applications YewPar reduces search times by an order of magnitude, i.e hours/minutes to minutes/seconds, and we commonly see greater than 60% (average) parallel efficiency speedups for up to 255 workers. Comparing skeleton performance reveals that no one skeleton is best for all searches, highlighting a benefit of a skeleton approach that allows multiple parallelisations to be explored with minimal refactoring. The Ordered skeleton avoids slowdown anomalies where, due to search knowledge being order dependent, a parallel search takes longer than a sequential search. Analysis of Ordered shows that, while being 41% slower on average (73% worse-case) than Depth-Bounded, in nearly all cases it maintains the following replicable performance properties: 1) parallel executions are no slower than one worker sequential executions 2) runtimes do not increase as workers are added, and 3) variance between repeated runs is low. In particular, where Ordered maintains a relative standard deviation (RSD) of less than 15%, Depth-Bounded suffers from an RSD greater than 50%, showing the importance of carefully controlling search orders for repeatability. Acknowledgements Utmost thanks goes to Phil Trinder, Patrick Maier, and Rob Stewart, for their unfaltering encouragement, guidance and ideas. All have been simultaneously supervisors, colleagues, mentors and friends. This thesis would not have been possible without their support. Thank you to everyone in the School of Computing science at Glasgow, my home for the last seven years. Special thanks to Jeremy Singer, for kindling my interest in research, and Ciaran McCreesh for sending me down the rabbit-hole of parallel search problems and for feedback on this thesis. Also to Stephen McQuistin for feedback on an earlier draft. To all those at EPCC, especially Michele Weiland and Nick Johnson, thank you for hosting me during my internship. I learned a great deal about high performance computing and, of course, cryptic crosswords! Thanks to the Software Sustainability Institute who taught me the wider importance of computing and ensured I didn’t miss the forest for the (search) trees. I am especially grateful to my parents, family, friends, and Jena who have formed the best support network I could ask for. Thank you. Finally, thanks to Chris Jefferson, David Manlove, and Michel Steuwer for making the viva an enjoyable experience. This work was funded from EPSRC grants: EP/K503058/1, EP/L50497X/1 and EP/M508056/1. Table of Contents 1 Introduction 1 1.1 Contributions . .2 1.2 Publications and Authorship . .5 2 Background 7 2.1 Combinatorial Search . .8 2.1.1 General Combinatorial Search . .9 2.1.2 Backtracking Search . 11 2.1.3 Types of Search . 12 2.1.4 Search Ordering and Heuristics . 12 2.2 Parallelism . 15 2.2.1 Parallel Architectures . 15 2.2.2 Parallelism Models . 16 2.2.3 Load-Balancing . 18 2.2.4 High-Level Approaches to Parallelism . 20 2.2.5 Parallel Performance Metrics . 21 2.3 Parallel Combinatorial Search . 22 2.3.1 Methods of Parallelising Combinatorial Search . 22 2.3.2 Challenges of Space-Splitting Search . 23 2.4 Existing Parallel Space-Splitting Searches . 25 2.4.1 Static Approaches . 26 2.4.2 Dynamic Approaches . 29 2.4.3 Scalability of Existing Approaches . 36 2.4.4 The Need For A New Search Skeleton Framework . 36 3 Formalising Parallel Tree Search 41 3.1 Modelling Search Trees . 42 3.1.1 Ordered Trees . 44 3.2 Example: Tree Generators for the Travelling Salesperson Problem . 45 3.3 Semantics of Parallel Tree Search . 47 3.3.1 Creating an MT 3 Model . 48 3.3.2 Rule Ordering . 49 3.3.3 Search State . 49 3.3.4 Traversal Rules . 50 3.3.5 Example Reductions for TSP . 50 3.3.6 Node Processing Rules . 51 3.3.7 Pruning Rules . 54 3.3.8 Spawn Rules . 57 3.4 Lazy Node Generators: A Functional Interface for Tree Search . 57 3.4.1 Node Generators . 58 3.4.2 Node Generators as a Programming Interface . 58 3.4.3 Lazy Node Generators . 61 3.5 Summary . 62 4 Search Skeletons 65 4.1 Skeletons as a High-Level Parallelism Approach . 66 4.1.1 Creating Search Skeletons . 67 4.2 An Abstract Parallel Framework for Tree Search . 67 4.3 Search Coordination Methods . 70 4.3.1 Design Choices . 71 4.3.2 Pseudocode . 72 4.3.3 Sequential Search Coordination . 72 4.3.4 Depth-Bounded Search Coordination . 73 4.3.5 Stack-Stealing Search Coordination . 76 4.3.6 Budget Search Coordination . 81 4.3.7 Search Coordination Summary . 84 4.4 Workpool Design Choices . 85 4.5 Search Types . 87 4.5.1 Enumeration Search Type . 87 4.5.2 Decision Search Type . 88 4.5.3 Optimisation Search Type . 89 4.6 Summary . 90 5 Skeleton Implementation and Case Studies 93 5.1 Implementation . 93 5.1.1 A Prototype Haskell Framework for Tree Search . 93 5.1.2 YewPar: A Framework for Distributed-Memory Tree Search . 95 5.1.3 YewPar Features . 96 5.2 Case Study Applications . 104 5.2.1 Enumeration Case Studies . 104 5.2.2 Decision Case Studies . 108 5.2.3 Optimisation Case Studies . 112 6 Evaluation 117 6.1 Experimental Setup . 117 6.2 Caveats when Evaluating Parallel Search . 118 6.3 The Cost of Generality . 119 6.4 Global Incumbent Management . 121 6.5 Depth-Bounded . 123 6.5.1 Workpool Choice . 129 6.5.2 Depth-Bounded Work-Stealing Performance . 130 6.5.3 Depth-Bounded Summary . 132 6.6 Stack-Stealing . 132 6.6.1 Use of Chunking . 134 6.6.2 Stack-Stealing Work-stealing Performance . 136 6.6.3 Stack-Stealing Summary . 136 6.7 Budget . 138 6.7.1 Budget Work-Stealing Performance . 144 6.7.2 Budget Summary . 145 6.8 Skeleton Comparison . 146 6.8.1 Cumulative Statistics . 148 6.9 Scalability . 149 6.9.1 Finite Geometry: H(4; 32) ...................... 153 6.10 Generality and Ease of Use . 153 6.11 Summary . 154 7 Replicable Branch and Bound Search 157 7.1 Limits of Replicable Search .
Recommended publications
  • Toward Optimised Skeletons for Heterogeneous
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by ROS: The Research Output Service. Heriot-Watt University Edinburgh TOWARD OPTIMISED SKELETONS FOR HETEROGENEOUS PARALLEL ARCHITECTURE WITH PERFORMANCE COST MODEL By Khari A. Armih Submitted for the Degree of Doctor of Philosophy at Heriot-Watt University on Completion of Research in the School of Mathematical and Computer Sciences July, 2013 The copyright in this thesis is owned by the author. Any quotation from the thesis or use of any of the information contained in it must acknowledge this thesis as the source of the quotation or information. I hereby declare that the work presented in this the- sis was carried out by myself at Heriot-Watt University, Edinburgh, except where due acknowledgement is made, and has not been submitted for any other degree. Khari A. Armih (Candidate) Greg Michaelson, Phil Trinder (Supervisors) (Date) ii Abstract High performance architectures are increasingly heterogeneous with shared and distributed memory components, and accelerators like GPUs. Programming such architectures is complicated and performance portability is a major issue as the architectures evolve. This thesis explores the potential for algorithmic skeletons integrating a dynamically parametrised static cost model, to deliver portable performance for mostly regular data parallel programs on heterogeneous archi- tectures. The first contribution of this thesis is to address the challenges of program- ming heterogeneous architectures by providing two skeleton-based programming libraries: i.e. HWSkel for heterogeneous multicore clusters and GPU-HWSkel that enables GPUs to be exploited as general purpose multi-processor devices. Both libraries provide heterogeneous data parallel algorithmic skeletons including hMap, hMapAll, hReduce, hMapReduce, and hMapReduceAll.
    [Show full text]
  • Algorithmic Skeletons and Parallel Design Patterns in Mainstream Parallel Programming
    International Journal of Parallel Programming https://doi.org/10.1007/s10766-020-00684-w(0123456789().,-volV)(0123456789().,-volV) Algorithmic Skeletons and Parallel Design Patterns in Mainstream Parallel Programming 1 1 1 Marco Danelutto • Gabriele Mencagli • Massimo Torquati • 2 3 Horacio Gonza´lez–Ve´lez • Peter Kilpatrick Received: 13 August 2019 / Accepted: 8 October 2020 Ó The Author(s) 2020 Abstract This paper discusses the impact of structured parallel programming methodologies in state-of-the-art industrial and research parallel programming frameworks. We first recap the main ideas underpinning structured parallel programming models and then present the concepts of algorithmic skeletons and parallel design patterns. We then discuss how such concepts have permeated the wider parallel programming community. Finally, we give our personal overview—as researchers active for more than two decades in the parallel programming models and frameworks area—of the process that led to the adoption of these concepts in state-of-the-art industrial and research parallel programming frameworks, and the perspectives they open in relation to the exploitation of forthcoming massively-parallel (both general and special-purpose) architectures. Keywords Algorithmic skeletons Á Parallel design patterns Á High performance computing Á Multi-core architecture Á Parallel computing 1 Introduction In the last two decades, the number of parallel architectures available to the masses has substantially increased. The world has moved from clusters/networks of workstations—composed of individual nodes with a small number of CPUs sharing a common memory hierarchy—to the ubiquitous presence of multi-core CPUs coupled with different many-core accelerators, typically interconnected through high-bandwidth, low-latency networks.
    [Show full text]
  • A Divide-And-Conquer Parallel Skeleton for Unbalanced and Deep Problems
    13th International Symposium on High-Level Parallel Programming and Applications (HLPP 2020) Porto, Portugal July 9-10, 2020 A Divide-and-conquer Parallel Skeleton for Unbalanced and Deep Problems Mill´anA. Mart´ınez Basilio B. Fraguela · · Jos´eC. Cabaleiro Abstract The Divide-and-conquer (D&C) pattern appears in a large number of problems and is highly suitable to exploit parallelism. This has led to much research on its easy and efficient application both in shared and distributed memory parallel systems. One of the most successful approaches explored in this area consists in expressing this pattern by means of parallel skeletons which automate and hide the complexity of the parallelization from the user while trying to provide good performance. In this paper we tackle the develop- ment of a skeleton oriented to the efficient parallel resolution of D&C problems with a high degree of unbalance among the subproblems generated and/or a deep level of recurrence. Our evaluation shows good results achieved both in terms of performance and programmability. Keywords Algorithmic skeletons Divide-and-conquer Template metapro- gramming Load balancing · · · 1 Introduction Divide-and-conquer [1], hence denoted D&C, is a strategy widely used to solve problems whose solution can be obtained by dividing them into subproblems, separately solving those subproblems, and combining their solutions to com- pute the one of the original problem. In this pattern the subproblems have the same nature as the original one, thus this strategy can be recursively ap- plied to them until base cases are found. The independence of the subproblems Mill´anA.
    [Show full text]
  • Towards an Algorithmic Skeleton Framework for Programming the Intel R Xeon Phitm Processor
    Hélder de Almeida Marques Licenciado em Engenharia Informática Towards an Algorithmic Skeleton Framework for Programming the Intel R Xeon PhiTM Processor Dissertação para obtenção do Grau de Mestre em Engenharia Informática Orientador : Hervé Paulino, Prof. Auxiliar, Universidade Nova de Lisboa Júri: Presidente: Prof. Doutor Miguel Pessoa Monteiro Universidade Nova de Lisboa Arguente: Prof. Doutor Nuno Roma Universidade de Lisboa Vogal: Prof. Doutor Hervé Miguel Cordeiro Paulino Universidade Nova de Lisboa Dezembro, 2014 iii Towards an Algorithmic Skeleton Framework for Programming the Intel R Xeon PhiTM Processor Copyright c Hélder de Almeida Marques, Faculdade de Ciências e Tecnologia, Univer- sidade Nova de Lisboa A Faculdade de Ciências e Tecnologia e a Universidade Nova de Lisboa têm o direito, perpétuo e sem limites geográficos, de arquivar e publicar esta dissertação através de ex- emplares impressos reproduzidos em papel ou de forma digital, ou por qualquer outro meio conhecido ou que venha a ser inventado, e de a divulgar através de repositórios científicos e de admitir a sua cópia e distribuição com objectivos educacionais ou de in- vestigação, não comerciais, desde que seja dado crédito ao autor e editor. iv Aos meus pais e amigos que tornaram esta tese possível vi Acknowledgements I would like to start by thanking so much to my adviser Hervé Paulino for all the sup- port, availability and constant meetings throughout the course of this thesis that really helped me a lot to develop this work. I would like to thank the department of Com- puter Sciences of Faculdade de Ciências e Tecnologias da Universidade Nova de Lis- boa, for the facilities and its constant availability to always have a friendly environment to work.
    [Show full text]
  • UNIVERSITY of PISA and SCUOLA SUPERIORE SANTANNA Master
    UNIVERSITY OF PISA AND SCUOLA SUPERIORE SANTANNA Master Degree in Computer Science and Networking Master Thesis Structured programming in MPI A streaming framework experiment Candidate : Leta Melkamu Jifar Supervisor : Prof. Marco Danelluto Academic Year : 2012/2013 To my family, who emphasized the importance of education, who has been my role-model for hard work, persistence and personal sacrifices, and who instilled in me the inspiration to set high goals and the confidence to achieve them. Who has been proud and supportive of my work and who has shared many uncertainties and challenges for completing this thesis and the whole masters program. To Hatoluf, to Dani, to Basho, to Ebo and to mom! Acknowledgments I would like to express my gratitude to my supervisor, Prof. Marco Danelluto, whose expertise, understanding, and patience, added considerably to my graduate experience. I appreciate his vast knowledge and skill in many areas (e.g., parallel and distributed computing, structured Skeleton programming, Multi/Many-Core and Heterogeneous Architectures, Autonomic computing), and his assistance in keeping me in the right track both in the implementation and thesis writing parts. I would also like to mention his humble support in making the experimental machine available and installing all tools that are important for the experiment of this thesis. And I also want to thank him for understanding my situation and be able to available for discus- sion; giving me time from his busy schedule to serve many students, attending conferences and conducting research works. Finally, I would like to thank my family, my classmates and my friends for being supportive and proud.
    [Show full text]
  • A Theoretical Model for Global Optimization of Parallel Algorithms
    mathematics Article A Theoretical Model for Global Optimization of Parallel Algorithms Julian Miller 1,* , Lukas Trümper 1,2 , Christian Terboven 1 and Matthias S. Müller 1 1 Chair for High Performance Computing, IT Center, RWTH Aachen University, 52074 Aachen, Germany; [email protected] (L.T.); [email protected] (C.T.); [email protected] (M.S.M.) 2 Huddly AS, Karenslyst Allé 51, 0279 Oslo, Norway * Correspondence: [email protected] Abstract: With the quickly evolving hardware landscape of high-performance computing (HPC) and its increasing specialization, the implementation of efficient software applications becomes more challenging. This is especially prevalent for domain scientists and may hinder the advances in large-scale simulation software. One idea to overcome these challenges is through software abstraction. We present a parallel algorithm model that allows for global optimization of their synchronization and dataflow and optimal mapping to complex and heterogeneous architectures. The presented model strictly separates the structure of an algorithm from its executed functions. It utilizes a hierarchical decomposition of parallel design patterns as well-established building blocks for algorithmic structures and captures them in an abstract pattern tree (APT). A data-centric flow graph is constructed based on the APT, which acts as an intermediate representation for rich and automated structural transformations. We demonstrate the applicability of this model to three representative algorithms and show runtime speedups between 1.83 and 2.45 on a typical Citation: Miller, J.; Trümper, L.; heterogeneous CPU/GPU architecture. Terboven, C.; Müller, M.S. A Theoretical Model for Global Keywords: parallel programming; parallel patterns; program optimization; optimizing framework; Optimization of Parallel Algorithms.
    [Show full text]
  • Arxiv:2005.04094V1 [Cs.DC] 5 May 2020 National University of Defense Technology E-Mail: {J.Fang, Chunhuang, Taotang84}@Nudt.Edu.Cn Z
    manuscript No. (will be inserted by the editor) Parallel Programming Models for Heterogeneous Many-Cores : A Survey Jianbin Fang · Chun Huang B · Tao Tang · Zheng Wang Received: date / Accepted: date Abstract Heterogeneous many-cores are now an integral part of modern comput- ing systems ranging from embedding systems to supercomputers. While heteroge- neous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitably par- allel and can be made to match the underlying heterogeneous platform. In this article, we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability. We examine various software opti- mization techniques for minimizing the communicating overhead between hetero- geneous computing devices. We provide a road map for a wide variety of different research areas. We conclude with a discussion on open issues in the area and po- tential research directions. This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements. Keywords Heterogeneous Computing · Many-Core Architectures · Parallel Programming Models 1 Introduction Heterogeneous many-core systems are now commonplace [158,159]. The combina- tion of using a host CPU together with specialized processing units (e.g., GPGPUs, XeonPhis, FPGAs, DSPs and NPUs) has been shown in many cases to achieve orders of magnitude performance improvement. As a recent example, Google's Ten- sor Processing Units (TPUs) are application-specific integrated circuits (ASICs) J. Fang, C. Huang (B), T.
    [Show full text]
  • Skelcl -- a Portable Multi-GPU Skeleton Library
    ANGEWANDTE MATHEMATIK UND INFORMATIK SkelCL – A Portable Multi-GPU Skeleton Library Michel Steuwer, Philipp Kegel, and Sergei Gorlatch 04/10 - I UNIVERSIT A¨ T M UNSTER¨ SkelCL { A Portable Multi-GPU Skeleton Library Michel Steuwer, Philipp Kegel, and Sergei Gorlatch Department of Mathematics and Computer Science University of M¨unster,M¨unster,Germany November 30, 2010 Abstract Modern Graphics Processing Units (GPU) are increasingly used as general- purpose processors. While the two currently most widely used programming models for GPUs, CUDA and OpenCL, are a step ahead as compared to ex- tremely laborious shader programming, they still remain effort-demanding and error-prone. Memory management is complex due to the separate memories of CPU and GPU and GPUs support only a subset of common CPU programming languages. Both, CUDA and OpenCL lack support for multi-GPU systems, such that additional challenges like communication and synchronization arise when using multiple GPUs. We present SkelCL { a novel library for simplifying OpenCL programming of single and multiple GPUs. SkelCL provides for the application programmer two main abstractions: Firstly, a unified memory management that implicitly transfers data transfers between the system's and the GPU's memory. Lazy copying techniques pre- vent unnecessary data transfers and increase performance. Secondly, SkelCL provides so-called algorithmic skeletons to describe computations. Algorithmic skeletons encapsulate parallel implementations of recurring programming pat- terns. Skeletons are configured by providing one or several parameters which adapt a skeleton parallel program pattern according to an application. In ad- dition a set of basic skeletons, SkelCL also provides means to easily compose skeletons, to allow for more complex applications.
    [Show full text]
  • Algorithmic Skeletons and Parallel Design Patterns in Mainstream Parallel Programming
    Algorithmic Skeletons and Parallel Design Patterns in Mainstream Parallel Programming Danelutto, M., Mencagli, G., Torquati, M., González–Vélez, H., & Kilpatrick, P. (2020). Algorithmic Skeletons and Parallel Design Patterns in Mainstream Parallel Programming. International Journal of Parallel Programming. https://doi.org/10.1007/s10766-020-00684-w Published in: International Journal of Parallel Programming Document Version: Publisher's PDF, also known as Version of record Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal Publisher rights © 2020 The Authors. This is an open access article published under a Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium, provided the author and source are cited. General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law,
    [Show full text]
  • Nested Parallelism with Algorithmic Skeletons A
    NESTED PARALLELISM WITH ALGORITHMIC SKELETONS A Thesis by ALIREZA MAJIDI Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Chair of Committee, Lawrence Rauchwerger Committee Members, Timothy Davis Diego Donzis Head of Department, Scott Schaefer August 2020 Major Subject: Computer Science and Engineering Copyright 2020 Alireza Majidi ABSTRACT New trend in design of computer architectures, from memory hierarchy design to grouping computing units in different hierarchical levels in CPUs, pushes developers toward algorithms that can exploit these hierarchical designs. This trend makes support of nested-parallelism an impor- tant feature for parallel programming models. It enables implementation of parallel programs that can then be mapped onto the system hierarchy. However, supporting nested-parallelism is not a trivial task due to complexity in spawning nested sections, destructing them and more importantly communication between these nested parallel sections. Structured parallel programming models are proven to be a good choice since while they hide the parallel programming complexities from the programmers, they allow programmers to customize the algorithm execution without going through radical changes to the other parts of the program. In this thesis, nested algorithm com- position in the STAPL Skeleton Library (SSL) is presented, which uses a nested dataflow model as its internal representation. We show how a high level program specification using SSL allows for asynchronous computation and improved locality. We study both the specification and per- formance of the STAPL implementation of Kripke, a mini-app developed by Lawrence Livermore National Laboratory.
    [Show full text]
  • Algorithmic Skeletons for Python Jolan Philippe, Frédéric Loulergue
    PySke: Algorithmic Skeletons for Python Jolan Philippe, Frédéric Loulergue To cite this version: Jolan Philippe, Frédéric Loulergue. PySke: Algorithmic Skeletons for Python. The 2019 Interna- tional Conference on High Performance Computing & Simulation (HPCS), Jul 2019, Dublin, Ireland. 10.1109/HPCS48598.2019.9188151. hal-02317127 HAL Id: hal-02317127 https://hal.archives-ouvertes.fr/hal-02317127 Submitted on 15 Oct 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. PySke: Algorithmic Skeletons for Python Jolan Philippe Fred´ eric´ Loulergue School of Informatics Computing and Cyber Systems School of Informatics Computing and Cyber Systems Northern Arizona University Northern Arizona University Flagstaff, AZ, USA Flagstaff, AZ, USA [email protected] [email protected] Abstract—PySke is a library of parallel algorithmic skeletons b) Approach and Goals: To ease the development of in Python designed for list and tree data structures. Such parallel programming, Murray Cole has introduced the skeletal algorithmic skeletons are high-order functions implemented in parallelism approach [1]. A skeleton is a parallel implementa- parallel. An application developed with PySke is a composition of skeletons. To ease the write of parallel programs, PySke does tion of a computation pattern.
    [Show full text]
  • Copyrighted Material
    k INDEX k k A benchmarks, 118, 314, 386, 425 aborts, 83 block access, 4 adaptive locking protocol, 466 BlockLib, 139 adaptive scheduling, 422 Brook, 145, 201 algorithmic skeleton, 263 bus interconnection, 431 algorithm view, 35 allocation, 418 C all-or-nothing transaction, 167 CABAC, 282 Amdahl’s law, 372 cache, 72, 237 analytical model, 442 cache behavior, 95 application-centric models, 44 cache coherence, 348 array, 67 cascading aborts, 172 Array-OL, 146 CAVLC, 282 atomicity, 84, 326 Cell/B.E. processor, 39 automated parallelization,COPYRIGHTED 233 Cell Superscalar, MATERIAL 43 auto-tuning, 194–195 Charm++,45 Cilk, 37 B cloud computing, 452 bag, 65 cluster, 113, 311 bandwidth, 425 code optimization, 292 Programming Multicore and Many-core Computing Systems, First Edition. Edited by Sabri Pllana and Fatos Xhafa. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. 481 k k 482 Index code profiling, 291 diagnostic tools, 358 collector, 268 dictionary, 68 combinability, 62 dining philosopher, 166 communication, 36 distributed desk checking, 331 communication congestion, 431 distributed programming, 451 communication links, 444 distributed real-time applications, 464 commutativity, 177 distributive review, 325 compiler, 109, 388 DOACROSS parallelism, 205 computational accelerators, 407 DOALL parallelism, 205 computing performance, 3 dynamic data structure, 218 concurrency, 36, 228, 232 dynamic scheduling, 421 concurrent code, 81 concurrent conflicting transactions, 83 E concurrent data structures, 59 eager update, 88,
    [Show full text]