Optimizing the LULESH Stencil Code Using Concurrent Collections

Total Page:16

File Type:pdf, Size:1020Kb

Optimizing the LULESH Stencil Code Using Concurrent Collections Optimizing the LULESH Stencil Code using Concurrent Collections Chenyang Liu Milind Kulkarni Purdue University Purdue University 465 Northwestern Avenue 465 Northwestern Avenue West Lafayette, Indiana West Lafayette, Indiana [email protected] [email protected] ABSTRACT world phenomenon. However, the domain scientists who de- Writing scientific applications for modern multicore ma- velop these models require expertise in parallel programming chines is a challenging task. There are a myriad of hard- to effectively run their applications in order to take advan- ware solutions available for many different target appli- tage of modern and future architectures. An intuitive way cations, each having their own advantages and trade-offs. to solve this problem is to use an architecture-agnostic pro- An attractive approach is Concurrent Collections (CnC), gramming language that separates the responsibilities of the which provides a programming model that separates the domain scientist from the programming expert. This can be concerns of the application expert from the performance ex- achieved using Concurrent Collections (CnC), a parallel pro- pert. CnC uses a data and control flow model paired with gramming model that combines ideas from previous work in philosophies from previous data-flow programming models tuple-spaces, streaming languages, and dataflow languages and tuple-space influences. By following the CnC program- [2, 7, 13]. ming paradigm, the runtime will seamlessly exploit avail- In CnC, algorithmic programs are expressed as a set of able parallelism regardless of the platform; however, there partially-ordered operations with explicitly defined depen- are limitations to its effectiveness depending on the algo- dencies. The CnC runtime exploits potential parallelism, rithm. In this paper, we explore ways to optimize the per- so long as it satisfies the dependencies set by the program- formance of the proxy application, Livermore Unstructured mer. The responsibility of the programmer is minimal, only Lagrange Explicit Shock Hydrodynamics (LULESH), writ- requiring them to provide the basic semantic scheduling ten using Concurrent Collections. The LULESH algorithm constraints of their problem. The advantages of simpli- is expressed as a minimally-constrained set of partially- fied programmability comes at the expense of sub-optimal ordered operations with explicit dependencies. However, fine-grained parallelism for these programs. In this paper, performance is plagued by scheduling overhead and synchro- we explore how to rewrite CnC programs to make specific nization costs caused by the fine granularity of computation equivalent-state transformations to boost parallel perfor- steps. In LULESH and similar stencil-codes, we show that mance. We investigate the effects of these transformation an algorithmic CnC program can be tuned by coalescing techniques on the Livermore Unstructured Lagrange Ex- CnC elements through step fusion and tiling to become a plicit Shock Hydro (LULESH) mini-app, a hydrodynamics well-tuned and scalable application running on multi-core code created in the DARPA UHPC program [9, 11]. systems. With these optimizations, we achieve up to 38x We discover that there are many opportunities to tune speedup over the original implementation with good scala- complex CnC programs. Algorithms with many unique com- bility for up to 48 processor machines. putational steps can be transformed to alter the working- set granularity through step fusion or distribution. These transformations are similar to classic loop fusion and dis- Keywords tribution, providing benefits through computation reorder- Concurrent Collections, LULESH, Tiling ing while preserving execution semantics. Furthermore, we look at tiling the computation steps to increase the working set size and reduce the amount of fine-grained parallelism. 1. INTRODUCTION These techniques help reduce scheduling overhead and im- Effectively programming scientific applications is a very prove data locality for stencil codes such as LULESH. In our challenging task. There are many research fields dedicated experiments, we show that step fusion and tiling greatly im- to producing new models and methods for simulating real pact performance and scalability in the mini-app, giving up to 38x speedup over the sequential baseline implementation on 48 cores for a moderate (603) simulation size. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation 2. OVERVIEW OF CNC on the first page. To copy otherwise, to republish, to post on servers or to redistribute In this section, we provide a brief overview of Concur- to lists, requires prior specific permission and/or a fee. rent Collections (CnC). We summarize the basic concepts WOLFHPC2015 November 15-20 2015, Austin, TX, USA and highlight the important aspects that make CnC a com- Copyright is held by the owner/author(s). Publication rights licensed to ACM 978-1-4503-4016-8/15/11 ...$15.00. pelling programming model for scientific applications such http://dx.doi.org/10.1145/2830018.2830024.. as LULESH. A more in-depth description of CnC can be found from previous related work [2, 3, 14]. in the tag collection. In our example, a tag is created for The CnC programming model separates the algorithmic every reduction step that is needed to execute and is rep- semantics of a program from how the program is executed. resented as the yellow triangle. Finally, the program data This is an attractive solution for a domain scientist, whose is stored in the (data) item collection, shown in Figure 1 primary concern is focused on algorithmic correctness and by the rectangular nodes. The item collection contains in- stability. CnC relies on a tuning expert, whose expertise is stances of data that steps use as input or outputs. Each in performance, to properly map the application to a specific data item is dynamic and is atomically assigned once in its platform. This separation of concerns simplifies the roles of lifetime (dynamic single-assignment) in order to provide the each party. The domain scientist can program without need- runtime with definitive information and prevent race condi- ing to reason about parallelism or architecture constraints, tions. These data items are also important during execution while the tuning expert can focus on performance given the for scheduling; many steps consume data from previous steps maximum freedom to map the execution onto the hardware. as well as produce them, indicating program dependencies. The restrictions are that CnC programs must adhere to In a typical CnC program, the step collection will be de- a certain set of rules. Firstly, computation units (steps) in fined similarly to a traditional C function, except it can- CnC must execute statelessly, so they cannot modify any not hold any state and must receive data from other steps global data. Instead, the language uses built-in put/get through data items. For our reduction, the program will operations to store and retrieve data items, which assist first prescribe the reduction computation step using a con- the runtime with scheduling dependencies. Secondly, all trol tag ti. Once prescribed, the step will be scheduled to data items exhibit the dynamic single-assignment property, run once all data dependencies are resolved. In the example, meaning they are immutable once they are written to. This the reduction step waits to get 4 values, x1 to x4, indicat- requirement prevents data races during parallel execution ing the function will consume those values produced from a and helps resolve data dependencies during runtime. In the previous computation. Just as the get construct indicates following section, we give a brief overview of the CnC pro- a consumer dependence, the producers of the data will pro- gramming model. duce those values by initiating a put. Once the computation is performed in the reduction step, it will also store the re- CnC Programming Model sult F by initiating a put to its respective data item using a We describe the CnC language using a simple example of a tag for future retreival. The get and put constructs assists reduction routine. Assume that we wish to sum values from the runtime in scheduling steps and resolving dependencies. 4 A step collection can only be associated with a single tag four different sources, F = P x . Figure 1 shows a CnC i collection/space; however, a single tag collection can pre- i=1 representation of how a domain expert may represent the scribe multiple step collections. Steps are created when a computation and its dependencies. control tag prescribes a particular instance of the step col- lection. While the existence of a tag prescribed for a specific step means that instance of a step will execute, it does not specify when that step executes. Execution of that step depends on the availability of data items (consumer depen- dencies) as well as the CnC scheduler which determines how tasks are handled. How and when tags are assigned is left up to the programmer. CnC allows for prescribing tags in ad- vance, leaving the scheduler manage dormant tags, as well as dynamically prescribing tags for each step instance as their data becomes ready. Previous work shows tradeoffs exist for each implementation, but it is a flexible parameter that can be tuned [4]. Tasks are scheduled as the program executes, with steps producing data and sometimes prescribing tags. A step be- comes available as soon as its tag is prescribed, and pro- duced data items become ready once they are atomically written. Whenever item dependencies are ready and the tag has been prescribed, a step becomes enabled and may be executed. The runtime exploits parallelism by scheduling the available step instances as soon as they become enabled. Figure 1: Reduction Operation in CnC Once all prescribed steps have been executed, the CnC pro- gram completes. There are three types of nodes in this figure: computa- In principle, CnC programs should be well-tailored to run tion steps, data items, and control tags.
Recommended publications
  • Here I Led Subcommittee Reports Related to Data-Intensive Science and Post-Moore Computing) and in CRA’S Board of Directors Since 2015
    Vivek Sarkar Curriculum Vitae Contents 1 Summary 2 2 Education 3 3 Professional Experience 3 3.1 2017-present: College of Computing, Georgia Institute of Technology . 3 3.2 2007-present: Department of Computer Science, Rice University . 5 3.3 1987-2007: International Business Machines Corporation . 7 4 Professional Awards 11 5 Research Awards 12 6 Industry Gifts 15 7 Graduate Student and Other Mentoring 17 8 Professional Service 19 8.1 Conference Committees . 20 8.2 Advisory/Award/Review/Steering Committees . 25 9 Keynote Talks, Invited Talks, Panels (Selected) 27 10 Teaching 33 11 Publications 37 11.1 Refereed Conference and Journal Publications . 37 11.2 Refereed Workshop Publications . 51 11.3 Books, Book Chapters, and Edited Volumes . 58 12 Patents 58 13 Software Artifacts (Selected) 59 14 Personal Information 60 Page 1 of 60 01/06/2020 1 Summary Over thirty years of sustained contributions to programming models, compilers and runtime systems for high performance computing, which include: 1) Leading the development of ASTI during 1991{1996, IBM's first product compiler component for optimizing locality, parallelism, and the (then) new FORTRAN 90 high-productivity array language (ASTI has continued to ship as part of IBM's XL Fortran product compilers since 1996, and was also used as the foundation for IBM's High Performance Fortran compiler product); 2) Leading the research and development of the open source Jikes Research Virtual Machine at IBM during 1998{2001, a first-of-a-kind Java Virtual Machine (JVM) and dynamic compiler implemented
    [Show full text]
  • Ultra-Large-Scale Sites
    Ultra-large-scale Sites <working title> – Scalability, Availability and Performance in Social Media Sites (picture from social network visualization?) Walter Kriha With a forword by << >> Walter Kriha, Scalability and Availability Aspects…, V.1.9.1 page 1 03/12/2010 Copyright <<ISBN Number, Copyright, open access>> ©2010 Walter Kriha This selection and arrangement of content is licensed under the Creative Commons Attribution License: http://creativecommons.org/licenses/by/3.0/ online: www.kriha.de/krihaorg/ ... <a rel="license" href="http://creativecommons.org/licenses/by/3.0/de/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by/3.0/de/88x31.png" /></a><br /><span xmlns:dc="http://purl.org/dc/elements/1.1/" href="http://purl.org/dc/dcmitype/Text" property="dc:title" rel="dc:type"> Building Scalable Social Media Sites</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="wwww.kriha.de/krihaorg/books/ultra.pdf" property="cc:attributionName" rel="cc:attributionURL">Walter Kriha</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/de/">Creative Commons Attribution 3.0 Germany License</a>.<br />Permissions beyond the scope of this license may be available at <a xmlns:cc="http://creativecommons.org/ns#" href="www.kriha.org" rel="cc:morePermissions">www.kriha.org</a>. Walter Kriha, Scalability and Availability Aspects…, V.1.9.1 page 2 03/12/2010 Acknowledgements <<master course, Todd Hoff/highscalability.com..>> Walter Kriha, Scalability and Availability Aspects…, V.1.9.1 page 3 03/12/2010 ToDo’s - The role of ultra fast networks (Infiniband) on distributed algorithms and behaviour with respect to failure models - more on group behaviour from Clay Shirky etc.
    [Show full text]
  • Optimizing Applications for Multicore by Intel Software Engineer Levent Akyil Welcome to the Parallel Universe
    Letter to the Editor by parallelism author and expert James Reinders Are You Ready to Enter a Parallel Universe: Optimizing Applications for Multicore by Intel software engineer Levent Akyil Welcome to the Parallel Universe Contents Think Parallel or Perish, BY JAMES REINDERS .........................................................................................2 James Reinders, Lead Evangelist and a Director with Intel® Software Development Products, sees a future where every software developer needs to be thinking about parallelism first when programming. He first published“ Think Parallel or Perish“ three years ago. Now he revisits his comments to offer an update on where we have gone and what still lies ahead. Parallelization Methodology...................................................................................................................... 4 The four stages of parallel application development addressed by Intel® Parallel Studio. Writing Parallel Code Safely, BY PETER VARHOL ........................................................................... 5 Writing multithreaded code to take full advantage of multiple processors and multicore processors is difficult. The new Intel® Parallel Studio should help us bridge that gap. Are You Ready to Enter a Parallel Universe: Optimizing Applications for Multicore, BY LEVENT AKYIL .............................................. 8 A look at parallelization methods made possible by the new Intel® Parallel Studio—designed for Microsoft Visual Studio* C/C++ developers of Windows* applications.
    [Show full text]
  • August 2009 Volume 34 Number 4
    AUGUST 2009 VOLUME 34 NUMBER 4 OPINION Musings 2 Rik Farrow FILE SYSTEMS Cumulus: Filesystem Backup to the Cloud 7 Michael VR able, SteFan SaVage, and geoffrey M. VoelkeR THE USENIX MAGAZINE PROGRAMMinG Rethinking Browser Performance 14 leo MeyeRoVich Programming Video Cards for ­Database Applications 21 tiM kaldewey SECURITY Malware to Crimeware: How Far Have They Gone, and How Do We Catch Up? 35 daVid dittRich HARDWARE A Home-Built NTP Appliance 45 Rudi Van dRunen CoLUMns Practical Perl Tools: Scratch the ­Webapp Itch with CGI::Application, Part 1 56 daVid n. blank-edelMan Pete’s All Things Sun: T Servers—Why, and Why Not 61 PeteR baeR galVin iVoyeur: Who Invited the Salesmen? 67 daVe JoSePhSen /dev/random 71 RobeRt g. Ferrell BooK REVIEWS Book Reviews 74 elizabeth zwicky et al. USEniX NOTES USENIX Lifetime Achievement Award 78 STUG Award 79 USENIX Association Financial Report for 2008 79 ellie young Writing for ;login: 83 ConfERENCES NSDI ’09 Reports 84 Report on the 8th International Workshop on Peer-to-Peer Systems (IPTPS ’09) 97 Report on the First USENIX Workshop on Hot Topics in Parallelism (HotPar ’09) 99 Report on the 12th Workshop on Hot Topics in Operating Systems (HotOS XII) 109 The Advanced Computing Systems Association aug09covers.indd 1 7.13.09 9:21:47 AM Upcoming Events 22n d ACM Sy M p o S i u M o n op e r A t i n g Sy S t e ms 7t H uSENIX Sy M p o S i u M o n ne t w o r k e d Sy S t e ms prinCipleS (SoSp ’09) de S i g n A n d iM p l e M e n t A t i o n (nSDI ’10) Sponsored by ACM SIGOPS in cooperation with USENIX Sponsored
    [Show full text]
  • High Performance Neural Network Inference, Streaming, and Visualization of Medical Images Using FAST
    Received August 15, 2019, accepted September 16, 2019, date of publication September 19, 2019, date of current version October 2, 2019. Digital Object Identifier 10.1109/ACCESS.2019.2942441 High Performance Neural Network Inference, Streaming, and Visualization of Medical Images Using FAST ERIK SMISTAD 1,2, ANDREAS ØSTVIK 1,2, AND ANDRÉ PEDERSEN 1 1SINTEF Medical Technology, 7465 Trondheim, Norway 2Department of Circulation and Medical Imaging, Norwegian University of Science and Technology, 7491 Trondheim, Norway Corresponding author: Erik Smistad ([email protected]) This work was supported by the Research Council of Norway under Grant 270941 and Grant 237887. ABSTRACT Deep convolutional neural networks have quickly become the standard for medical image analysis. Although there are many frameworks focusing on training neural networks, there are few that focus on high performance inference and visualization of medical images. Neural network inference requires an inference engine (IE), and there are currently several IEs available including Intel's OpenVINO, NVIDIA's TensorRT, and Google's TensorFlow which supports multiple backends, including NVIDIA's cuDNN, AMD's ROCm and Intel's MKL-DNN. These IEs only work on specific processors and have completely different application programming interfaces (APIs). In this paper, we presents methods for extending FAST, an open-source high performance framework for medical imaging, to use any IE with a common programming interface. Thereby making it easier for users to deploy and test their neural networks on different processors. This article provides an overview of current IEs and how they can be combined with existing software such as FAST. The methods are demonstrated and evaluated on three performance demanding medical use cases: real-time ultrasound image segmentation, computed tomography (CT) volume segmentation, and patch-wise classification of whole slide microscopy images.
    [Show full text]
  • Comparison and Analysis of Parallel Computing Performance Using Openmp and MPI
    Send Orders for Reprints to [email protected] 38 The Open Automation and Control Systems Journal, 2013, 5, 38-44 Open Access Comparison and Analysis of Parallel Computing Performance Using OpenMP and MPI Shen Hua1,2,* and Zhang Yang1 1Department of Electronic Engineering, Dalian Neusoft University of Information Dalian, 116023/Liaoning, China 2Department of Biomedical Engineering, Faculty of Electronic Information and Electrical Engineering, Dalian Univer- sity of Technology, Dalian, 116024/Liaoning, China Abstract: The developments of multi-core technology have induced big challenges to software structures. To take full advantages of the performance enhancements offered by new multi-core hardware, software programming models have made a great shift from sequential programming to parallel programming. OpenMP (Open Multi-Processing) and MPI (Message Passing Interface), as the most common parallel programming models, can provide different performance characteristics of parallelism in different cases. This paper intends to compare and analyze the parallel computing ability between OpenMP and MPI, and then some proposals are provided in parallel programming. The processing tools used include Intel VTune Performance Analyzer and Intel Thread Checker. The find- ings indicate that OpenMP is in favor of implementation and provides good performance in shared memory systems , and that MPI is propitious to programming models which require nodes performing a large number of tasks and little commu- nications in processes. Keywords: Parallel computing, Multi-core, multithread programming, OpenMP, MPI. 1. INTRODUCTION library routines and environment variables. OpenMP em- ploys the fork-join model, where different tasks are imple- As the number of transistors has increased to fit within a mented by multiple threads within the same address space.
    [Show full text]
  • Threading Building Blocks Getting Started Guide
    Intel(R) Threading Building Blocks Getting Started Guide Intel® Threading Building Blocks is a runtime-based parallel programming model for C++ code that uses threads. It consists of a template-based runtime library to help you harness the latent performance of multicore processors. Use Intel® Threading Building Blocks to write scalable applications that: • Specify logical parallel structure instead of threads • Emphasize data parallel programming • Take advantage of concurrent collections and parallel algorithms This guide provides a complete example that uses Intel® Threading Building Blocks to write, compile, link, and run a parallel application. The example shows you how to explore a key feature of the library and to successfully build and link an application. After completing this guide, you should be ready to write and build your own code using Intel® Threading Building Blocks. Contents 10H Note Default Directory Paths...............................................................................29H 21H Set Up Environment ..........................................................................................310H 32H Develop an Application Using parallel_for .............................................................411H 43H Build the Application..........................................................................................812H 54H Run the Application .........................................................................................1013H 65H Next Steps.....................................................................................................1114H
    [Show full text]
  • Programming Models & Environments Summit
    REPORT OF THE 2014 Programming Models & Environments Summit EDITOR: Michael Heroux CO-EDITOR: Richard Lethin CONTRIBUTORS: Michael Garland, Mary Hall, Michael Heroux, Larry Kaplan, Richard Lethin, Kathryn O’Brien, Vivek Sarkar, John Shalf, Marc Snir, Armando Solar-Lezama, Thomas Sterling, and Katherine Yelick DOE ASCR REVIEWER: Sonia R. Sachs September 19, 2016 Sponsored by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) 17-CS-3270 Contents 1 Executive Summary 5 2 Introduction 7 2.1 Hardware Challenges for Programming Environments ...................... 8 2.2 Application Challenges for Programming Models and Environments .............. 10 2.3 Application Software Design Challenges .............................. 12 2.4 Principles of Exascale Programming Models and Environments ................. 12 2.5 Ecosystem Considerations ...................................... 14 3 Role of Programming Models and Environments Research 15 3.1 Challenges ............................................... 16 3.2 Metrics for Success .......................................... 17 4 Background, Challenges and Opportunities 18 4.1 Characterizing Programming Models and Environments ..................... 18 4.2 Design Choices ............................................ 22 4.2.1 Low Level ........................................... 22 4.2.2 Uniform vs. Hybrid ..................................... 23 4.2.3 Physical vs. Virtual ..................................... 23 4.2.4 Scheduling and Mapping ..................................
    [Show full text]
  • Mlperf Inference Benchmark
    MLPERF INFERENCE BENCHMARK Vijay Janapa Reddi 1 Christine Cheng 2 David Kanter 3 Peter Mattson 4 Guenther Schmuelling 5 Carole-Jean Wu 6 Brian Anderson 4 Maximilien Breughe 7 Mark Charlebois 8 William Chou 8 Ramesh Chukka 2 Cody Coleman 9 Sam Davis 10 Pan Deng 11 Greg Diamos 12 Jared Duke 4 Dave Fick 13 J. Scott Gardner 14 Itay Hubara 15 Sachin Idgunji 7 Thomas B. Jablin 4 Jeff Jiao 16 Tom St. John 17 Pankaj Kanwar 4 David Lee 18 Jeffery Liao 19 Anton Lokhmotov 20 Francisco Massa 6 Peng Meng 11 Paulius Micikevicius 7 Colin Osborne 21 Gennady Pekhimenko 22 Arun Tejusve Raghunath Rajan 2 Dilip Sequeira 7 Ashish Sirasao 23 Fei Sun 24 Hanlin Tang 2 Michael Thomson 25 Frank Wei 26 Ephrem Wu 23 Lingjie Xu 26 Koichi Yamada 2 Bing Yu 18 George Yuan 7 Aaron Zhong 16 Peizhao Zhang 6 Yuchen Zhou 27 Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and four orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria.
    [Show full text]
  • Evaluating the Separation of Algorithm and Implementation Within Existing Programming Models
    Evaluating the Separation of Algorithm and Implementation within Existing Programming Models Michelle Mills Strout1, Christopher Krieger1, Andrew Stone1, Christopher Wilcox1, John Dennis2, James Bieman1 1Colorado State University, Department of Computer Science, 1873 Campus Deliv- ery, Fort Collins, CO 80523-1873, USA 2National Center For Atmospheric Research, 3090 Center Green Drive, Boulder, CO 80301, USA Email: [email protected] Abstract. Implementation details and performance tuning obfuscate the algo- rithms in programs. In the SAIMI project (Separating Algorithm and Implemen- tation via Programming Model Injection), we are investigating ways to use sim- pler, more restricted programming models to orthogonally express important sub- computations and express implementation details as transformations on the sub- computations. In this initial phase of the project, we are evaluating separation in existing programming models especially with respect to look-up table, grid-based, sparse computation, and task graph programming models. In this paper, we illus- trate the general approach of the SAIMI project with a source-to-source compilation tool called Mesa that automates most aspects of applying look-up transformations in scientific codes; we use the CGPOP miniapp as an example geoscience code to discuss the coupling between the discretization grid implementation details and the communication routines that support the simulation algorithms; and we eval- uate the ability of existing programming languages such as Cilk, TBB, CnC, and OpenMP to represent general, dynamic task graphs that arise in sparse computa- tions. We conclude with future directions for developing orthogonal implementation abstractions and building on the orthogonal capabilities in existing programming languages. 1 Introduction Solving important scientific problems through computational modeling requires high perfor- mance computing.
    [Show full text]
  • Flow Graphs, Speculative Locks, and Task Arenas in Intel® Threading Building Blocks
    Flow Graphs, Speculative Locks, and Task Arenas in Intel® Threading Building Blocks Graduate from MIT to GCC Mainline with Intel® Cilk™ Plus Issue 18 2014 CONTENTS Letter from the Editor 3 Speaking in Code By James Reinders Graduate from MIT to GCC Mainline with Intel® Cilk™ Plus 4 Intel® Cilk™ Plus provides a higher level of abstraction than other threading frameworks. This article explores its capabilities for expressing task and data parallelism. FEATURE Flow Graphs, Speculative Locks, and Task Arenas in Intel® 15 Threading Building Blocks A look at Intel® TBB features, including the flow graph interface, speculative locks that take advantage of the Intel® Transactional Synchronization Extensions (Intel® TSX) technology, and user-managed task arenas that provide enhanced concurrency control and work isolation. 20 Years of the MPI Standard: 28 Now with a Common Application Binary Interface Examines MPI compatibility issues and resolution, as well as the potential of the upcoming common MPI Application Binary Interface (ABI). Mastering Performance Challenges with the New MPI-3 Standard 33 Find out how to measure the overlap of communication and computation, and how an MPI application can benefit from nonblocking collective communication. An OpenMP* Timeline 41 An infographic time-capsule of OpenMP*. Leverage Your OpenCL™ Investment on Intel® Architectures 42 Get more out of OpenCL™—from cross-device portability to the Intel® Xeon Phi™ coprocessor. © 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Cilk, VTune, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
    [Show full text]
  • Intel Concurrent Collections for Haskell Ryan Newton, Chih-Ping Chen, and Simon Marlow
    Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2011-015 March 22, 2011 Intel Concurrent Collections for Haskell Ryan Newton, Chih-Ping Chen, and Simon Marlow massachusetts institute of technology, cambridge, ma 02139 usa — www.csail.mit.edu Intel Concurrent Collections for Haskell Ryan Newton Chih-Ping Chen Simon Marlow Intel Microsoft Research Abstract. Intel Concurrent Collections (CnC) is a parallel program- ming model in which a network of steps (functions) communicate through message-passing as well as a limited form of shared memory. This paper describes a new implementation of CnC for Haskell. Compared to ex- isting parallel programming models for Haskell, CnC occupies a useful point in the design space: pure and deterministic like Evaluation Strate- gies, but more explicit about granularity and the structure of the parallel computation, which affords the programmer greater control over parallel performance. We present results on 4, 8, and 32-core machines demon- strating parallel speedups over 20 on non-trivial benchmarks. × 1 Introduction Graph-based parallel programming models, including data-flow and process net- works, offer an attractive means of achieving robust parallel performance. For example, programming systems based on synchronous dataflow (SDF) have been able to show fully automatic parallelization and excellent scaling on a wide vari- ety of parallel architectures, without program modification (8), and without spe- cializing the programming model to particular hardware. Graph-based program- ming can provide both more accessible and more portable parallel programming, and covers an application domain that is large and rapidly growing, including most data-intensive processing from signal processing to financial analytics.
    [Show full text]