Workshop on Programming Languages for High Performance Computing (HPCWPL) Final Report

Total Page:16

File Type:pdf, Size:1020Kb

Workshop on Programming Languages for High Performance Computing (HPCWPL) Final Report SANDIA REPORT SAND2007-XXXX Unlimited Release Printed Workshop on Programming Languages for High Performance Computing (HPCWPL) Final Report Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, California 94550 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under Contract DE-AC04-94-AL85000. Approved for public release; further dissemination unlimited. Issued by Sandia National Laboratories, operated for the United States Department of Energy by Sandia Corporation. NOTICE: This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government, nor any agency thereof, nor any of their employees, nor any of their contractors, subcontractors, or their employees, make any warranty, express or implied, or assume any legal liability or responsibility for the accuracy, completeness, or usefulness of any infor- mation, apparatus, product, or process disclosed, or represent that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recom- mendation, or favoring by the United States Government, any agency thereof, or any of their contractors or subcontractors. The views and opinions expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or any of their contractors. Printed in the United States of America. This report has been reproduced directly from the best available copy. Available to DOE and DOE contractors from U.S. Department of Energy Office of Scientific and Technical Information P.O. Box 62 Oak Ridge, TN 37831 Telephone: (865) 576-8401 Facsimile: (865) 576-5728 E-Mail: [email protected] Online ordering: http://www.osti.gov/bridge Available to the public from U.S. Department of Commerce National Technical Information Service 5285 Port Royal Rd Springfield, VA 22161 Telephone: (800) 553-6847 Facsimile: (703) 605-6900 E-Mail: [email protected] Online ordering: http://www.ntis.gov/help/ordermethods.asp?loc=7-4-0#online NT O E F E TM N R E A R P G E Y •D • U N A I C T I E R D E S M T A ATES OF 2 SAND2007-XXXX Unlimited Release Printed Workshop on Programming Languages for High Performance Computing (HPCWPL) Final Report Richard C. Murphy HPCWPL Technical Chair Abstract This report summarizes the deliberations and conclusions of the Workshop on Pro- gramming Languages for High Performance Computing (HPCWPL) held at the San- dia CSRI facility in Albuquerque, NM on December 12-13, 2006. 3 Acknowledgment This report would be impossible without the efforts of a large number of people. Each of the participants presenters provided invaluable expert input into our effort to understand the state of applications, architecture, and programming models. We are very thankful for their active participation. The workshop’s steering and program committees were invaluable in shaping its con- tent and direction: Ron Brightwell, Bill Carlson, David Chase, Bill Gropp, Bruce Hendrickson, Mike Heroux, Fred Johnson, Peter Kogge, Bob Lucas, Bob Numrich, Marc Snir, Thomas Sterling, Jeffrey Vetter, Christoph von Praun, Zhaofang Wen, and Hans Zima. We are thankful to the chairs for providing direction for each of their sessions: Pe- ter Kogge served as the workshop chair, Thomas Sterling chaired the architecture session, Jeffrey Vetter chaired the applications session, Marc Snir chaired the HPCS Languages Panel, which served as the basis for many of the recommendations given in this report, Bill Gropp chaired the programming models session, and Bob Lucas chaired final conclusions session. Each of the deputy chairs from Sandia also provided invaluable input into the final report: Keith Underwood, Mike Heroux, Arun Rodrigues, and Ron Brightwell. Richard C. Murphy, Technical Chair Neil D. Pundit, Sandia Host 4 Executive Summary This report describes the critical need for additional research in programming lan- guages for high performance computing to enable new applications and the architec- tures to support those applications. It is the result of a synthesis of the contents of the Workshop on Programming Languages for High Performance Computing (HPCWPL) held at Sandia National Laboratories’ Computer Science Research Institute on De- cember 12-13, 2006. This report makes the following observations: 1. Today’s High Performance Computing (HPC) applications consist of large, slowly evolving software frameworks. These frameworks are typically decades old and have already survived large transitions in computer architecture (e.g., Vector to MPP). 2. Emerging HPC applications are very different from their traditional scientific computing counterparts. As a result, the “lessons” from scientific computing do not address the problems of these applications. 3. Programmers within the DOE are typically experts. Consequently, performance is more important than rapid prototyping. 4. HPC workloads share little in common with commercial workloads. Thus, very little reuse can be made of commercial programming models. For example, Transactional Memory is touted as the correct mechanism for pro- gramming multicore architectures, but it is inappropriate for HPC workloads. 5. The memory wall dominates computer architecture and is increas- ingly problematic. Even in the commodity space, the transition to multi- core architectures will tremendously increase the demands on memory systems. Memory latency dominates the performance of individual processor cores, while memory bandwidth dominates the number of cores that can be put on a die. Both are problems. 6. While supercomputer scaling was previously dominated by intercon- nect performance future machine scaling will be dominated by mem- ory. Because current programming models focus on interconnection networks rather than memory, this could significantly limit the ability to scale applica- tions. 7. Both conventional and emerging architectures require additional con- currency to be easily specified. MPI is virtually certain not to be able to provide all of the required concurrency. 5 8. Emerging heterogeneous architectures (such as Cell) pose a significant workload partitioning problem. The job of performing this partitioning is currently the programmer’s, and there is little hope for automation on the horizon. As a result, the following recommendations are proffered: I. We should pursue an evolutionary rather than revolutionary approach to HPC languages. There are both historical precedents (in the form of MPI) and practical requirements (in the form of a multi-billion dollar investment in existing code) that dictate this path. These languages must be compatible with existing tools and legacy code. II. The evolutionary path should address application requirements first and machine architecture requirements second (but should address both). New programming models (and architectures) have already been adopted to solve problems that cannot be addressed using the conventional MPI/MPP model. There are common features among radically different problems that cannot be solved today that should be the focus of new programming language extensions. These features include requirements for fine-grained parallelism, synchronization, and data access. III. An evolutionary language should introduce a minimal set of abstrac- tions, which should match HPC’s unique requirements. MPI succeeded by introducing a very small number of key routines that mapped well onto ap- plications and architectures. In the commercial space, Transactional Memory introduces small extensions to introduce nondeterminism into programs (and find parallelism). However, HPC applications typically require deterministic parallelism. IV. Languages need to support large numbers of efficient, light-weight, locality aware threads. This serves to increase concurrency, address the core problem of the memory wall, and begin to address the cache coherency problem. V. We must identify problems that cannot be solved today, and de- velop programming languages (and architectures) that solve those problems. This supports recommendations I-IV by enabling the creation of a significant user-base to support the early development of these languages, and focusing language development on satisfying a specific need. We are currently at a unique point in the evolution of HPC. Application and un- derlying technology requirements dictate programming models to support new ways of solving humanity’s problems. Today, architectures and applications are evolving faster than the programming models needed to support them. These recommenda- tions are intended as a first cut strategy for addressing those needs and enabling the solution to new, larger-scale problems. 6 Table 1. HPCWPL Participants Name Affiliation Name Affiliation Berry, Jon SNL Lucas, Bob ISI Brightwell, Ron SNL Maccabe, Barney UNM Chamberlain, Brad Cray Mehrotra, Piyush NASA Chase, David Sun Murphy, Richard SNL Chtchelkanova, Almadena NSF Numrich, Bob U. of Minnesota DeBenedictis, Erik SNL Olukotun, Kunle Stanford DeRose, Luiz Cray Perumalla, Kalyan ORNL Doerfler, Doug SNL Poole, Stephen ORNL Gao, Guang U. of Delaware Pundit, Neil SNL Gokhale, Maya LANL Rodrigues, Arun SNL Gropp, Bill ANL Snir, Marc UIUC Hall, Mary USC/ISI Sterling, Thomas LSU Hecht-Nielsen, Robert UCSD Underwood, Keith SNL Hemmert,
Recommended publications
  • The State of the Chapel Union
    The State of the Chapel Union Bradford L. Chamberlain, Sung-Eun Choi, Martha Dumler Thomas Hildebrandt, David Iten, Vassily Litvinov, Greg Titus Cray Inc. Seattle, WA 98164 chapel [email protected] Abstract—Chapel is an emerging parallel programming Cray’s entry in the DARPA HPCS program concluded language that originated under the DARPA High Productivity successfully in the fall of 2012, culminating with the launch Computing Systems (HPCS) program. Although the HPCS pro- of the Cray XC30TM system. Even though the HPCS pro- gram is now complete, the Chapel language and project remain very much alive and well. Under the HCPS program, Chapel gram has completed, the Chapel project remains active and generated sufficient interest among HPC user communities to growing. This paper’s goal is to present a snapshot of Chapel warrant continuing its evolution and development over the next as it stands today at the juncture between HPCS and the several years. In this paper, we reflect on the progress that was next phase of Chapel’s development: to review the Chapel made with Chapel under the auspices of the HPCS program, project, describe its status, summarize its successes and noting key decisions made during the project’s history. We also summarize the current state of Chapel for programmers who lessons learned, and sketch out the next few years of the are interested in using it today. And finally, we describe current language’s life cycle. Generally speaking, this paper reports and ongoing work to evolve it from prototype to production- on Chapel as of version 1.7.01, released on April 18th, 2013.
    [Show full text]
  • Parallel Languages & CUDA
    CS 470 Spring 2019 Mike Lam, Professor Parallel Languages & CUDA Graphics and content taken from the following: http://dl.acm.org/citation.cfm?id=2716320 http://chapel.cray.com/papers/BriefOverviewChapel.pdf http://arxiv.org/pdf/1411.1607v4.pdf https://en.wikipedia.org/wiki/Cilk Parallel languages ● Writing efficient parallel code is hard ● We've covered two generic paradigms ... – Shared-memory – Distributed message-passing ● … and three specific technologies (but all in C!) – Pthreads – OpenMP – MPI ● Can we make parallelism easier by changing our language? – Similarly: Can we improve programmer productivity? Productivity Output ● Productivity= Economic definition: Input ● What does this mean for parallel programming? – How do you measure input? ● Bad idea: size of programming team ● "The Mythical Man Month" by Frederick Brooks – How do you measure output? ● Bad idea: lines of code Productivity vs. Performance ● General idea: Produce better code faster – Better can mean a variety of things: speed, robustness, etc. – Faster generally means time/personnel investment ● Problem: productivity often trades off with performance – E.g., Python vs. C or Matlab vs. Fortran – E.g., garbage collection or thread management Why? Complexity ● Core issue: handling complexity ● Tradeoff: developer effort vs. system effort – Hiding complexity from the developer increases the complexity of the system – Higher burden on compiler and runtime systems – Implicit features cause unpredictable interactions – More middleware increases chance of interference and
    [Show full text]
  • Parallel Programming Models
    Lecture 3 Parallel Programming Models Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline • Goal: Understand parallel programming models – Programming Models – Application Models – Parallel Computing Frameworks • Open-MP • MPI • CUDA • Map-Reduce – Parallel Computing Libraries ICOM 6025: High Performance Computing 2 Parallel Programming Models • A parallel programming model represents an abstraction of the underlying hardware or computer system. – Programming models are not specific to a particular type of memory architecture. – In fact, any programming model can be in theory implemented on any underlying hardware. ICOM 6025: High Performance Computing 3 Programming Language Models • Threads Model • Data Parallel Model • Task Parallel Model • Message Passing Model • Partitioned Global Address Space Model ICOM 6025: High Performance Computing 4 Threads Model • A single process can have multiple, concurrent execution paths – The thread based model requires the programmer to manually manage synchronization, load balancing, and locality, • Deadlocks • Race conditions • Implementations – POSIX Threads • specified by the IEEE POSIX 1003.1c standard (1995) and commonly referred to as Pthreads, • Library-based explicit parallelism model. – OpenMP • Compiler directive-based model that supports parallel programming in C/C++ and FORTRAN. ICOM 6025: High Performance Computing 5 Data Parallel Model • Focus on data partition • Each task performs the same operation
    [Show full text]
  • High Performance Computing Using MPI and Openmp on Multi-Core
    Parallel Computing 37 (2011) 562–575 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco High performance computing using MPI and OpenMP on multi-core parallel systems ⇑ Haoqiang Jin a, , Dennis Jespersen a, Piyush Mehrotra a, Rupak Biswas a, Lei Huang b, Barbara Chapman b a NAS Division, NASA Ames Research Center, Moffett Field, CA 94035, United States b Department of Computer Sciences, University of Houston, Houston, TX 77004, United States article info abstract Article history: The rapidly increasing number of cores in modern microprocessors is pushing the current Available online 2 March 2011 high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems – distributed memory across nodes and shared memory with non- Keywords: uniform memory access within each node – poses a challenge to application developers. In Hybrid MPI + OpenMP programming this paper, we study a hybrid approach to programming such systems – a combination of Multi-core Systems two traditional programming models, MPI and OpenMP. We present the performance of OpenMP Extensions standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applica- Data Locality tions using this approach on several multi-core based systems including an SGI Altix 4700, an IBM p575+ and an SGI Altix ICE 8200EX. We also present new data locality extensions to OpenMP to better match the hierarchical memory structure of multi-core architectures. Published by Elsevier B.V. 1. Introduction Multiple computing cores are becoming ubiquitous in commodity microprocessor chips, allowing the chip manufacturers to achieve high aggregate performance while avoiding the power demands of increased clock speeds.
    [Show full text]
  • Single-Sided Pgas Communications Libraries
    SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Parallel Programming Languages and Approaches Parallel Programming Languages 2 Contents • A Little Bit of History • Non-Parallel Programming Languages • Early Parallel Languages • Current Status of Parallel Programming • Parallelisation Strategies • Mainstream HPC • Alternative Parallel Programming Languages • Single-Sided Communication • PGAS • Final Remarks and Summary Parallel Programming Languages 3 Non-Parallel Programming Languages • Serial languages important • General scientific computing Fortran • Basis for parallel languages C • PRACE Survey results: C++ Python Perl Other Java Chapel Co-array Fortran 0 50 100 150 200 250 Response Count • PRACE Survey indicates that nearly all applications are written in: • Fortran: well suited for scientific computing • C/C++: allows good access to hardware • Supplemented by • Scripts using Python, PERL and BASH • PGAS languages starting to be used Parallel Programming Languages 4 Data Parallel • Processors perform similar operations across elements in an array • Higher level programming paradigm, characterised by: • single-threaded control • global name space • loosely synchronous processes • parallelism implied by operations applied to data • compiler directives • Data parallel languages: generally serial language (e.g., F90) plus • compiler directives (e.g., for data distribution) • first class language constructs to support parallelism • new intrinsics and library functions • Paradigm well suited to a number of early (SIMD) parallel computers • Connection
    [Show full text]
  • (Hpc) Through Parallel Programming Paradigms and Their Principles
    International Journal of Programming Languages and Applications ( IJPLA ) Vol.4, No.1, January 2014 TOWARDS HIGH PERFORMANCE COMPUTING (HPC) THROUGH PARALLEL PROGRAMMING PARADIGMS AND THEIR PRINCIPLES Dr. Brijender Kahanwal Department of Computer Science & Engineering, Galaxy Global Group of Institutions, Dinarpur, Ambala, Haryana, India ABSTRACT Nowadays, we are to find out solutions to huge computing problems very rapidly. It brings the idea of parallel computing in which several machines or processors work cooperatively for computational tasks. In the past decades, there are a lot of variations in perceiving the importance of parallelism in computing machines. And it is observed that the parallel computing is a superior solution to many of the computing limitations like speed and density; non-recurring and high cost; and power consumption and heat dissipation etc. The commercial multiprocessors have emerged with lower prices than the mainframe machines and supercomputers machines. In this article the high performance computing (HPC) through parallel programming paradigms (PPPs) are discussed with their constructs and design approaches. KEYWORDS Parallel programming languages, parallel programming constructs, distributed computing, high performance computing 1. INTRODUCTION The numerous computational concentrated tasks of the computer science like weather forecast, climate research, the exploration of oil and gas, molecular modelling, quantum mechanics, and physical simulations are performed by the supercomputers as well as mainframe computer. But these days due to the advancements in the technology multiprocessors systems or multi-core processor systems are going to be resembled to perform such type of computations which are performed by the supercomputing machines. Due to the recent advances in the hardware technologies, we are leaving the von Neumann computation model and adopting the distributed computing models which have peer-to-peer (P2P), cluster, cloud, grid, and jungle computing models in it [1].
    [Show full text]
  • A Brief Overview of Chapel
    ABRIEF OVERVIEW OF CHAPEL1 (PRE-PRINT OF AN UPCOMING BOOK CHAPTER) Bradford L. Chamberlain, Cray Inc. January 2013 revision 1.0 1This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0001. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the Defense Advanced Research Projects Agency. Chapter 9 Chapel Bradford L. Chamberlain, Cray Inc. Chapel is an emerging parallel language whose design and development has been led by Cray Inc. under the DARPA High Productivity Computing Systems program (HPCS) from 2003 to the present. Chapel supports a multithreaded execution model, permitting the expression of far more general and dynamic styles of computation than the typi- cal single-threaded Single Program, Multiple Data (SPMD) programming models that became dominant during the 1990’s. Chapel was designed such that higher-level abstractions, such as those supporting data parallelism, would be built in terms of lower-level concepts in the language, permitting the user to select amongst differing levels of abstraction and control as required by their algorithm or performance requirements. This chapter provides a brief introduction to Chapel, starting with a condensed history of the project (Section 9.1). It then describes Chapel’s mo- tivating themes (Section 9.2), followed by a survey of its main features (Section 9.3), and a summary of the project’s status and future work (Chapter 9.4). 9.1 A Brief History of Chapel 9.1.1 Inception DARPA’s HPCS program was launched in 2002 with five teams, each led by a hardware vendor: Cray Inc., Hewlett- Packard, IBM, SGI, and Sun.
    [Show full text]
  • Irregular Computations in Fortran Expression and Implementation Strategies
    Irregular Computations in Fortran Expression and Implementation Strategies Jan F. Prins, Siddhartha Chatterjee, and Martin Simons Department of Computer Science The University of North Carolina Chapel Hill, NC 27599-3175 {prins,sc,simons}@cs.unc.edu Abstract. Modern dialects of Fortran enjoy wide use and good support on high- performance computers as performance-oriented programming languages. By pro- viding the ability to express nested data parallelism, modern Fortran dialects en- able irregular computations to be incorporated into existing applications with minimal rewriting and without sacrificing performance within the regular por- tions of the application. Since performance of nested data-parallel computation is unpredictable and often poor using current compilers, we investigate threading and flattening, two source-to-source transformation techniques that can improve performance and performance stability. For experimental validation of these tech- niques, we explore nested data-parallel implementations of the sparse matrix- vector product and the Barnes-Hut n-body algorithm by hand-coding thread- based (using OpenMP directives) and flattening-based versions of these algo- rithmsandevaluatingtheirperformanceonanSGIOrigin2000andanNEC SX-4, two shared-memory machines. 1 Introduction Modern science and engineering disciplines make extensive use of computer simula- tions. As these simulations increase in size and detail, the computational costs of naive algorithms can overwhelm even the largest parallel computers available today. Fortu- nately, computational costs can be reduced using sophisticated modeling methods that vary model resolution as needed, coupled with sparse and adaptive solution techniques that vary computational effort in time and space as needed. Such techniques have been developed and are routinely employed in sequential computation, for example, in cos- mological simulations (using adaptive n-body methods) and computational fluid dy- namics (using adaptive meshing and sparse linear system solvers).
    [Show full text]
  • Array Optimizations for High Productivity Programming Languages
    RICE UNIVERSITY Array Optimizations for High Productivity Programming Languages by Mackale Joyner A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved, Thesis Committee: Vivek Sarkar , Co-Chair E.D. Butcher Professor of Computer Science Zoran Budimli´c,Co-Chair Research Scientist Keith Cooper L. John and Ann H. Doerr Professor of Computer Engineering John Mellor-Crummey Professor of Computer Science Richard Tapia University Professor Maxfield-Oshman Professor in Engineering Houston, Texas September, 2008 ABSTRACT Array Optimizations for High Productivity Programming Languages by Mackale Joyner While the HPCS languages (Chapel, Fortress and X10) have introduced improve- ments in programmer productivity, several challenges still remain in delivering high performance. In the absence of optimization, the high-level language constructs that improve productivity can result in order-of-magnitude runtime performance degrada- tions. This dissertation addresses the problem of efficient code generation for high-level array accesses in the X10 language. The X10 language supports rank-independent specification of loop and array computations using regions and points. Three as- pects of high-level array accesses in X10 are important for productivity but also pose significant performance challenges: high-level accesses are performed through Point objects rather than integer indices, variables containing references to arrays are rank- independent, and array subscripts are verified as legal array indices during runtime program execution. Our solution to the first challenge is to introduce new analyses and transforma- tions that enable automatic inlining and scalar replacement of Point objects. Our solution to the second challenge is a hybrid approach. We use an interprocedural rank analysis algorithm to automatically infer ranks of arrays in X10.
    [Show full text]