Intel Concurrent Collections C++ API and Runtime

Total Page:16

File Type:pdf, Size:1020Kb

Intel Concurrent Collections C++ API and Runtime Concurrent Collections (CnC) Kath Knobe [email protected] 2012/03/29 Software and Services Group 1 Outline • Intro to the CnC • Tuning CnC • CnC for distributed memory Software and Services Group 2 Outline • Intro to the CnC • Tuning CnC • CnC for distributed memory Software and Services Group 3 The Big Idea • Don’t think about what operations can run in parallel – Difficult and depends on target • Think about the semantic ordering constraints only – Easier and depends only on application Software and Services Group 4 4 Exactly Two Sources Of Ordering Requirements • Producer must execute before consumer Producer - consumer COMPUTE STEP DATA ITEM COMPUTE STEP step1 item step2 Software and Services Group 5 5 Exactly Two Sources Of Ordering Requirements • Producer must execute before consumer • Controller must execute before controllee Producer - consumer Controller - controllee CONTROL TAG tony COMPUTE STEP DATA ITEM COMPUTE STEP COMPUTE STEP COMPUTE STEP step1 item step2 step1 step2 Software and Services Group 6 6 Cholesky factorization Software and Services Group 7 Cholesky factorization Cholesky Software and Services Group 8 Cholesky factorization Cholesky Trisolve Software and Services Group 9 Cholesky factorization Cholesky Trisolve Update Slightly simplified Software and Services Group 10 Cholesky factorization Cholesky Software and Services Group 11 4 simple steps to a CnC application Software and Services Group 12 1:The white board level • Data • Computations • Producer/consumer relations among them • I/O COMPUTE STEP Cholesky COMPUTE STEP Trisolve COMPUTE STEP Update DATA ITEM Array Software and Services Group 13 13 2: Distinguish among the instances • Distinct instances (chunks of data or computation) may be distinctly scheduled and mapped • A collection is a statically named set of dynamic instances COMPUTE STEP Cholesky: iter COMPUTE STEP Trisolve: row, iter COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter Software and Services Group 14 14 2: Distinguish among the instances • Distinct instances (chunks of data or computation) may be distinctly scheduled and mapped • A collection is a statically named set of dynamic instances • Optional: tag functions relating the tags e.g., Cholesky: iter→ Array: iter, iter, iter COMPUTE STEP Cholesky: iter COMPUTE STEP Trisolve: row, iter COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter Software and Services Group 15 15 Control tag CONTROL TAG Ceasar COMPUTE STEP • Tuple – unique identifier foo – <row, col, iteration> – <socialSecurityNumber, date> – <treeNode> • Meaning of control relationship – For each row/col/iteration tuple in the collection Ceasar, the corresponding instance of foo will execute sometime. – An instance of foo has access to the values of row, col and iteration. • Control dependence and data dependence are on the same footing Software and Services Group 16 3: What are the control tag collections CONTROL TAG CholeskyTag: iter CONTROL TAG COMPUTE STEP TrisolveTag: row, iter Cholesky: iter CONTROL TAG COMPUTE STEP Trisolve: row, iter UpdateTag: col, row, iter COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter Software and Services Group 17 17 4: Who produces control CONTROL TAG CholeskyTag: iter CONTROL TAG COMPUTE STEP TrisolveTag: row, iter Cholesky: iter CONTROL TAG COMPUTE STEP Trisolve: row, iter UpdateTag: col, row, iter COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter Software and Services Group 18 18 <K> Actual app: cell tracker (Cell detector: K) [Cell candidates: K ] [Histograms: K] [Input Image: K] (Cell tracker: K) [Labeled cells initial: K] (Arbitrator initial: K) [State measurements: K] <Cell tags Initial: K, cell> (Correction filter: K, cell) [Motion corrections: K , cell] [Labeled cells final: K] (Arbitrator final: K) [Final states: K] <Cell tags Final: K, cell> (Prediction filter: K, cell) Software and [PredictedServices Group states: K, cell] 19 Sample step (cholesky) // Perform unblocked Cholesky factorization on the input block // Output is a lower triangular matrix. int cholesky::execute( const int & t, cholesky_context & c ) const { tile_const_ptr_type A_block; tile_ptr_type L_block; Get data used by this step Normal int b = c.b; computation const int iter = t; c.Array.get(triple(iter,col,row), A_block); // Get the input tile. L_block = std::make_shared< tile_type >( b ); // allocate output tile for(int k_b = 0; k_b < b; k_b++) { Put data produced // do all the math on this tile by this step } c.Array.put(triple(iter+1,col,row), L_block); // Write the output tile return CnC::CNC_Success; } Steps have no side-effects Steps are atomic Items are dynamic single assignment Software and Services Group 20 Shared memory Spike Performance James Brodman - UIUC, Intel 6 • Parallel solver for banded linear systems of equations 5 • Complex algorithm but “easily” maps onto CnC • Same code for shared and distributed memory 4 Runtime (S) Runtime 3 Shared/distributed memory 2 (4 nodes, GigE) 1 0 1048576 w/ 128 Super/Sub Diagonals, 32 partitions MKL MKL HTA CnC HTA DistCnC MKL +OMPMKL+OMP (TBB)HTA+TBB CnC HTA+MPI (MPI)DistCnC 1048576 w/ 128 Super/Sub Diagonals, 32Software partitions and Services Group 21 Shared memory Spike Performance James Brodman - UIUC, Intel 6 • Parallel solver for banded linear systems of equations 5 • Complex algorithm but “easily” maps onto CnC • Same code for shared and distributed memory 4 Runtime (S) Runtime 3 Shared/distributed memory 2 (4 nodes, GigE) 1 0 1048576 w/ 128 Super/Sub Diagonals, 32 partitions MKL MKL HTA CnC HTA DistCnC MKL +OMPMKL+OMP (TBB)HTA+TBB CnC HTA+MPI (MPI)DistCnC 1048576 w/ 128 Super/Sub Diagonals, 32Software partitions and Services Group 22 Semantics of CnC • Semantics defined in terms of tag attributes of instances. avail • The current state is described by the current set of instances and their attributes. step controlReady step step ready executed step dataReady Item avail Software and Services Group 23 CnC is a coordination language: paired with a computation language • Existing: • In the works: – C++ (intel) – Fortran (Rice) – Java (Rice) – Chapel (UIUC) – Haskell (Indiana) • Next up: – Python (Rice) – Matlab (Rice) – Scala (Rice) – APL (Indiana) Rice = Vivek Sarkar’s group Indiana = Ryan Newton’s group UIUC = David Padua’s group Software and Services Group 24 CnC is a high level model allows for a wide variety of runtime approaches grain distribution schedule HP static static static TStreams HP static static dynamic TStreams static dynamic dynamic Intel CnC static dynamic dynamic Rice CnC dynamic dynamic dynamic Georgia Tech CnC Software and Services Group 25 Possible expressions of the model • Graphical interface • Textual representation – Rice version – previous Intel versions • API – Current Intel release Software and Services Group 26 Key properties • Think about the meaning of your program – Not about parallelism • Deterministic & serializable – Easy to debug • Separation of concerns simplifies everything – Computation from coordination – Domain spec from tuning spec • Domain spec is free of tuning and largely independent of architecture • Unified programming for shared and distributed memory – Not like MPI/OpenMP combo Software and Services Group 27 Intel® Concurrent Collections for C++ • New whatif-release – Header-files, shared libs – Samples – Documentation • CnC concepts • Using CnC • Tutorials • Template library with runtime – For Windows(IA-32/Intel-64) and Linux (Intel-64) – Shared and distributed memory Software and Services Group 28 Outline • Intro to the CnC • Tuning CnC • CnC for distributed memory Software and Services Group 29 Tuning expert with Rice: Mike Burke, Sanjay Chatterjee, Zoran Budimlic • Without tuning – Very good performance – Typically more than enough parallelism – Asynchronous execution • Multiple tuning specs for each domain spec – Tuning spec is in addition and separate – Semantic constraints from the domain spec are still requirements – A tuned execution is a one of the many legal executions of the domain spec. Goal: guide away from bad ones. Software and Services Group 30 Foundation for tuning: Hierarchical Affinity groups • Affinity groups – Avoid mappings with poor locality (temporal + spatial) – No benefit if too far away in either time or space – Foundation doesn’t distinguish space from time – On top of this foundation we allow (but don’t require) time-specific and space-specific control • Hierarchy – Same low-level group => tight affinity – Same higher-level group => weaker affinity Software and Services Group 31 Cholesky domain spec CONTROL TAG CholeskyTag: iter CONTROL TAG COMPUTE STEP TrisolveTag: row, iter Cholesky: iter CONTROL TAG COMPUTE STEP Trisolve: row, iter UpdateTag: col, row, iter COMPUTE STEP Update: col, row, iter DATA ITEM Array : col, row, iter Software and Services Group 32 32 Hierarchical affinity structure COMPUTE STEP Cholesky: COMPUTE STEP Trisolve: COMPUTE STEP Update: Software and Services Group 33 Hierarchical affinity groups have names AFFINITY GROUP GroupC: COMPUTE STEP Cholesky: AFFINITY GROUP GroupTU: COMPUTE STEP Trisolve: COMPUTE STEP Update: Software and Services Group 34 We distinguish among instances of affinity groups AFFINITY GROUP GroupC: iter COMPUTE STEP Cholesky: iter AFFINITY GROUP GroupTU: row, iter COMPUTE STEP Trisolve: row, iter COMPUTE STEP Update: col, row, iter Software and Services Group 35 Specify the instances of each affinity group AFFINITY GROUP CONTROL TAG GroupC: iter CholeskyTag: iter COMPUTE STEP Cholesky: iter AFFINITY
Recommended publications
  • Here I Led Subcommittee Reports Related to Data-Intensive Science and Post-Moore Computing) and in CRA’S Board of Directors Since 2015
    Vivek Sarkar Curriculum Vitae Contents 1 Summary 2 2 Education 3 3 Professional Experience 3 3.1 2017-present: College of Computing, Georgia Institute of Technology . 3 3.2 2007-present: Department of Computer Science, Rice University . 5 3.3 1987-2007: International Business Machines Corporation . 7 4 Professional Awards 11 5 Research Awards 12 6 Industry Gifts 15 7 Graduate Student and Other Mentoring 17 8 Professional Service 19 8.1 Conference Committees . 20 8.2 Advisory/Award/Review/Steering Committees . 25 9 Keynote Talks, Invited Talks, Panels (Selected) 27 10 Teaching 33 11 Publications 37 11.1 Refereed Conference and Journal Publications . 37 11.2 Refereed Workshop Publications . 51 11.3 Books, Book Chapters, and Edited Volumes . 58 12 Patents 58 13 Software Artifacts (Selected) 59 14 Personal Information 60 Page 1 of 60 01/06/2020 1 Summary Over thirty years of sustained contributions to programming models, compilers and runtime systems for high performance computing, which include: 1) Leading the development of ASTI during 1991{1996, IBM's first product compiler component for optimizing locality, parallelism, and the (then) new FORTRAN 90 high-productivity array language (ASTI has continued to ship as part of IBM's XL Fortran product compilers since 1996, and was also used as the foundation for IBM's High Performance Fortran compiler product); 2) Leading the research and development of the open source Jikes Research Virtual Machine at IBM during 1998{2001, a first-of-a-kind Java Virtual Machine (JVM) and dynamic compiler implemented
    [Show full text]
  • Ultra-Large-Scale Sites
    Ultra-large-scale Sites <working title> – Scalability, Availability and Performance in Social Media Sites (picture from social network visualization?) Walter Kriha With a forword by << >> Walter Kriha, Scalability and Availability Aspects…, V.1.9.1 page 1 03/12/2010 Copyright <<ISBN Number, Copyright, open access>> ©2010 Walter Kriha This selection and arrangement of content is licensed under the Creative Commons Attribution License: http://creativecommons.org/licenses/by/3.0/ online: www.kriha.de/krihaorg/ ... <a rel="license" href="http://creativecommons.org/licenses/by/3.0/de/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by/3.0/de/88x31.png" /></a><br /><span xmlns:dc="http://purl.org/dc/elements/1.1/" href="http://purl.org/dc/dcmitype/Text" property="dc:title" rel="dc:type"> Building Scalable Social Media Sites</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="wwww.kriha.de/krihaorg/books/ultra.pdf" property="cc:attributionName" rel="cc:attributionURL">Walter Kriha</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/de/">Creative Commons Attribution 3.0 Germany License</a>.<br />Permissions beyond the scope of this license may be available at <a xmlns:cc="http://creativecommons.org/ns#" href="www.kriha.org" rel="cc:morePermissions">www.kriha.org</a>. Walter Kriha, Scalability and Availability Aspects…, V.1.9.1 page 2 03/12/2010 Acknowledgements <<master course, Todd Hoff/highscalability.com..>> Walter Kriha, Scalability and Availability Aspects…, V.1.9.1 page 3 03/12/2010 ToDo’s - The role of ultra fast networks (Infiniband) on distributed algorithms and behaviour with respect to failure models - more on group behaviour from Clay Shirky etc.
    [Show full text]
  • Optimizing Applications for Multicore by Intel Software Engineer Levent Akyil Welcome to the Parallel Universe
    Letter to the Editor by parallelism author and expert James Reinders Are You Ready to Enter a Parallel Universe: Optimizing Applications for Multicore by Intel software engineer Levent Akyil Welcome to the Parallel Universe Contents Think Parallel or Perish, BY JAMES REINDERS .........................................................................................2 James Reinders, Lead Evangelist and a Director with Intel® Software Development Products, sees a future where every software developer needs to be thinking about parallelism first when programming. He first published“ Think Parallel or Perish“ three years ago. Now he revisits his comments to offer an update on where we have gone and what still lies ahead. Parallelization Methodology...................................................................................................................... 4 The four stages of parallel application development addressed by Intel® Parallel Studio. Writing Parallel Code Safely, BY PETER VARHOL ........................................................................... 5 Writing multithreaded code to take full advantage of multiple processors and multicore processors is difficult. The new Intel® Parallel Studio should help us bridge that gap. Are You Ready to Enter a Parallel Universe: Optimizing Applications for Multicore, BY LEVENT AKYIL .............................................. 8 A look at parallelization methods made possible by the new Intel® Parallel Studio—designed for Microsoft Visual Studio* C/C++ developers of Windows* applications.
    [Show full text]
  • August 2009 Volume 34 Number 4
    AUGUST 2009 VOLUME 34 NUMBER 4 OPINION Musings 2 Rik Farrow FILE SYSTEMS Cumulus: Filesystem Backup to the Cloud 7 Michael VR able, SteFan SaVage, and geoffrey M. VoelkeR THE USENIX MAGAZINE PROGRAMMinG Rethinking Browser Performance 14 leo MeyeRoVich Programming Video Cards for ­Database Applications 21 tiM kaldewey SECURITY Malware to Crimeware: How Far Have They Gone, and How Do We Catch Up? 35 daVid dittRich HARDWARE A Home-Built NTP Appliance 45 Rudi Van dRunen CoLUMns Practical Perl Tools: Scratch the ­Webapp Itch with CGI::Application, Part 1 56 daVid n. blank-edelMan Pete’s All Things Sun: T Servers—Why, and Why Not 61 PeteR baeR galVin iVoyeur: Who Invited the Salesmen? 67 daVe JoSePhSen /dev/random 71 RobeRt g. Ferrell BooK REVIEWS Book Reviews 74 elizabeth zwicky et al. USEniX NOTES USENIX Lifetime Achievement Award 78 STUG Award 79 USENIX Association Financial Report for 2008 79 ellie young Writing for ;login: 83 ConfERENCES NSDI ’09 Reports 84 Report on the 8th International Workshop on Peer-to-Peer Systems (IPTPS ’09) 97 Report on the First USENIX Workshop on Hot Topics in Parallelism (HotPar ’09) 99 Report on the 12th Workshop on Hot Topics in Operating Systems (HotOS XII) 109 The Advanced Computing Systems Association aug09covers.indd 1 7.13.09 9:21:47 AM Upcoming Events 22n d ACM Sy M p o S i u M o n op e r A t i n g Sy S t e ms 7t H uSENIX Sy M p o S i u M o n ne t w o r k e d Sy S t e ms prinCipleS (SoSp ’09) de S i g n A n d iM p l e M e n t A t i o n (nSDI ’10) Sponsored by ACM SIGOPS in cooperation with USENIX Sponsored
    [Show full text]
  • High Performance Neural Network Inference, Streaming, and Visualization of Medical Images Using FAST
    Received August 15, 2019, accepted September 16, 2019, date of publication September 19, 2019, date of current version October 2, 2019. Digital Object Identifier 10.1109/ACCESS.2019.2942441 High Performance Neural Network Inference, Streaming, and Visualization of Medical Images Using FAST ERIK SMISTAD 1,2, ANDREAS ØSTVIK 1,2, AND ANDRÉ PEDERSEN 1 1SINTEF Medical Technology, 7465 Trondheim, Norway 2Department of Circulation and Medical Imaging, Norwegian University of Science and Technology, 7491 Trondheim, Norway Corresponding author: Erik Smistad ([email protected]) This work was supported by the Research Council of Norway under Grant 270941 and Grant 237887. ABSTRACT Deep convolutional neural networks have quickly become the standard for medical image analysis. Although there are many frameworks focusing on training neural networks, there are few that focus on high performance inference and visualization of medical images. Neural network inference requires an inference engine (IE), and there are currently several IEs available including Intel's OpenVINO, NVIDIA's TensorRT, and Google's TensorFlow which supports multiple backends, including NVIDIA's cuDNN, AMD's ROCm and Intel's MKL-DNN. These IEs only work on specific processors and have completely different application programming interfaces (APIs). In this paper, we presents methods for extending FAST, an open-source high performance framework for medical imaging, to use any IE with a common programming interface. Thereby making it easier for users to deploy and test their neural networks on different processors. This article provides an overview of current IEs and how they can be combined with existing software such as FAST. The methods are demonstrated and evaluated on three performance demanding medical use cases: real-time ultrasound image segmentation, computed tomography (CT) volume segmentation, and patch-wise classification of whole slide microscopy images.
    [Show full text]
  • Comparison and Analysis of Parallel Computing Performance Using Openmp and MPI
    Send Orders for Reprints to [email protected] 38 The Open Automation and Control Systems Journal, 2013, 5, 38-44 Open Access Comparison and Analysis of Parallel Computing Performance Using OpenMP and MPI Shen Hua1,2,* and Zhang Yang1 1Department of Electronic Engineering, Dalian Neusoft University of Information Dalian, 116023/Liaoning, China 2Department of Biomedical Engineering, Faculty of Electronic Information and Electrical Engineering, Dalian Univer- sity of Technology, Dalian, 116024/Liaoning, China Abstract: The developments of multi-core technology have induced big challenges to software structures. To take full advantages of the performance enhancements offered by new multi-core hardware, software programming models have made a great shift from sequential programming to parallel programming. OpenMP (Open Multi-Processing) and MPI (Message Passing Interface), as the most common parallel programming models, can provide different performance characteristics of parallelism in different cases. This paper intends to compare and analyze the parallel computing ability between OpenMP and MPI, and then some proposals are provided in parallel programming. The processing tools used include Intel VTune Performance Analyzer and Intel Thread Checker. The find- ings indicate that OpenMP is in favor of implementation and provides good performance in shared memory systems , and that MPI is propitious to programming models which require nodes performing a large number of tasks and little commu- nications in processes. Keywords: Parallel computing, Multi-core, multithread programming, OpenMP, MPI. 1. INTRODUCTION library routines and environment variables. OpenMP em- ploys the fork-join model, where different tasks are imple- As the number of transistors has increased to fit within a mented by multiple threads within the same address space.
    [Show full text]
  • Threading Building Blocks Getting Started Guide
    Intel(R) Threading Building Blocks Getting Started Guide Intel® Threading Building Blocks is a runtime-based parallel programming model for C++ code that uses threads. It consists of a template-based runtime library to help you harness the latent performance of multicore processors. Use Intel® Threading Building Blocks to write scalable applications that: • Specify logical parallel structure instead of threads • Emphasize data parallel programming • Take advantage of concurrent collections and parallel algorithms This guide provides a complete example that uses Intel® Threading Building Blocks to write, compile, link, and run a parallel application. The example shows you how to explore a key feature of the library and to successfully build and link an application. After completing this guide, you should be ready to write and build your own code using Intel® Threading Building Blocks. Contents 10H Note Default Directory Paths...............................................................................29H 21H Set Up Environment ..........................................................................................310H 32H Develop an Application Using parallel_for .............................................................411H 43H Build the Application..........................................................................................812H 54H Run the Application .........................................................................................1013H 65H Next Steps.....................................................................................................1114H
    [Show full text]
  • Programming Models & Environments Summit
    REPORT OF THE 2014 Programming Models & Environments Summit EDITOR: Michael Heroux CO-EDITOR: Richard Lethin CONTRIBUTORS: Michael Garland, Mary Hall, Michael Heroux, Larry Kaplan, Richard Lethin, Kathryn O’Brien, Vivek Sarkar, John Shalf, Marc Snir, Armando Solar-Lezama, Thomas Sterling, and Katherine Yelick DOE ASCR REVIEWER: Sonia R. Sachs September 19, 2016 Sponsored by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research (ASCR) 17-CS-3270 Contents 1 Executive Summary 5 2 Introduction 7 2.1 Hardware Challenges for Programming Environments ...................... 8 2.2 Application Challenges for Programming Models and Environments .............. 10 2.3 Application Software Design Challenges .............................. 12 2.4 Principles of Exascale Programming Models and Environments ................. 12 2.5 Ecosystem Considerations ...................................... 14 3 Role of Programming Models and Environments Research 15 3.1 Challenges ............................................... 16 3.2 Metrics for Success .......................................... 17 4 Background, Challenges and Opportunities 18 4.1 Characterizing Programming Models and Environments ..................... 18 4.2 Design Choices ............................................ 22 4.2.1 Low Level ........................................... 22 4.2.2 Uniform vs. Hybrid ..................................... 23 4.2.3 Physical vs. Virtual ..................................... 23 4.2.4 Scheduling and Mapping ..................................
    [Show full text]
  • Mlperf Inference Benchmark
    MLPERF INFERENCE BENCHMARK Vijay Janapa Reddi 1 Christine Cheng 2 David Kanter 3 Peter Mattson 4 Guenther Schmuelling 5 Carole-Jean Wu 6 Brian Anderson 4 Maximilien Breughe 7 Mark Charlebois 8 William Chou 8 Ramesh Chukka 2 Cody Coleman 9 Sam Davis 10 Pan Deng 11 Greg Diamos 12 Jared Duke 4 Dave Fick 13 J. Scott Gardner 14 Itay Hubara 15 Sachin Idgunji 7 Thomas B. Jablin 4 Jeff Jiao 16 Tom St. John 17 Pankaj Kanwar 4 David Lee 18 Jeffery Liao 19 Anton Lokhmotov 20 Francisco Massa 6 Peng Meng 11 Paulius Micikevicius 7 Colin Osborne 21 Gennady Pekhimenko 22 Arun Tejusve Raghunath Rajan 2 Dilip Sequeira 7 Ashish Sirasao 23 Fei Sun 24 Hanlin Tang 2 Michael Thomson 25 Frank Wei 26 Ephrem Wu 23 Lingjie Xu 26 Koichi Yamada 2 Bing Yu 18 George Yuan 7 Aaron Zhong 16 Peizhao Zhang 6 Yuchen Zhou 27 Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and four orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria.
    [Show full text]
  • Evaluating the Separation of Algorithm and Implementation Within Existing Programming Models
    Evaluating the Separation of Algorithm and Implementation within Existing Programming Models Michelle Mills Strout1, Christopher Krieger1, Andrew Stone1, Christopher Wilcox1, John Dennis2, James Bieman1 1Colorado State University, Department of Computer Science, 1873 Campus Deliv- ery, Fort Collins, CO 80523-1873, USA 2National Center For Atmospheric Research, 3090 Center Green Drive, Boulder, CO 80301, USA Email: [email protected] Abstract. Implementation details and performance tuning obfuscate the algo- rithms in programs. In the SAIMI project (Separating Algorithm and Implemen- tation via Programming Model Injection), we are investigating ways to use sim- pler, more restricted programming models to orthogonally express important sub- computations and express implementation details as transformations on the sub- computations. In this initial phase of the project, we are evaluating separation in existing programming models especially with respect to look-up table, grid-based, sparse computation, and task graph programming models. In this paper, we illus- trate the general approach of the SAIMI project with a source-to-source compilation tool called Mesa that automates most aspects of applying look-up transformations in scientific codes; we use the CGPOP miniapp as an example geoscience code to discuss the coupling between the discretization grid implementation details and the communication routines that support the simulation algorithms; and we eval- uate the ability of existing programming languages such as Cilk, TBB, CnC, and OpenMP to represent general, dynamic task graphs that arise in sparse computa- tions. We conclude with future directions for developing orthogonal implementation abstractions and building on the orthogonal capabilities in existing programming languages. 1 Introduction Solving important scientific problems through computational modeling requires high perfor- mance computing.
    [Show full text]
  • Flow Graphs, Speculative Locks, and Task Arenas in Intel® Threading Building Blocks
    Flow Graphs, Speculative Locks, and Task Arenas in Intel® Threading Building Blocks Graduate from MIT to GCC Mainline with Intel® Cilk™ Plus Issue 18 2014 CONTENTS Letter from the Editor 3 Speaking in Code By James Reinders Graduate from MIT to GCC Mainline with Intel® Cilk™ Plus 4 Intel® Cilk™ Plus provides a higher level of abstraction than other threading frameworks. This article explores its capabilities for expressing task and data parallelism. FEATURE Flow Graphs, Speculative Locks, and Task Arenas in Intel® 15 Threading Building Blocks A look at Intel® TBB features, including the flow graph interface, speculative locks that take advantage of the Intel® Transactional Synchronization Extensions (Intel® TSX) technology, and user-managed task arenas that provide enhanced concurrency control and work isolation. 20 Years of the MPI Standard: 28 Now with a Common Application Binary Interface Examines MPI compatibility issues and resolution, as well as the potential of the upcoming common MPI Application Binary Interface (ABI). Mastering Performance Challenges with the New MPI-3 Standard 33 Find out how to measure the overlap of communication and computation, and how an MPI application can benefit from nonblocking collective communication. An OpenMP* Timeline 41 An infographic time-capsule of OpenMP*. Leverage Your OpenCL™ Investment on Intel® Architectures 42 Get more out of OpenCL™—from cross-device portability to the Intel® Xeon Phi™ coprocessor. © 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Cilk, VTune, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.
    [Show full text]
  • Intel Concurrent Collections for Haskell Ryan Newton, Chih-Ping Chen, and Simon Marlow
    Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2011-015 March 22, 2011 Intel Concurrent Collections for Haskell Ryan Newton, Chih-Ping Chen, and Simon Marlow massachusetts institute of technology, cambridge, ma 02139 usa — www.csail.mit.edu Intel Concurrent Collections for Haskell Ryan Newton Chih-Ping Chen Simon Marlow Intel Microsoft Research Abstract. Intel Concurrent Collections (CnC) is a parallel program- ming model in which a network of steps (functions) communicate through message-passing as well as a limited form of shared memory. This paper describes a new implementation of CnC for Haskell. Compared to ex- isting parallel programming models for Haskell, CnC occupies a useful point in the design space: pure and deterministic like Evaluation Strate- gies, but more explicit about granularity and the structure of the parallel computation, which affords the programmer greater control over parallel performance. We present results on 4, 8, and 32-core machines demon- strating parallel speedups over 20 on non-trivial benchmarks. × 1 Introduction Graph-based parallel programming models, including data-flow and process net- works, offer an attractive means of achieving robust parallel performance. For example, programming systems based on synchronous dataflow (SDF) have been able to show fully automatic parallelization and excellent scaling on a wide vari- ety of parallel architectures, without program modification (8), and without spe- cializing the programming model to particular hardware. Graph-based program- ming can provide both more accessible and more portable parallel programming, and covers an application domain that is large and rapidly growing, including most data-intensive processing from signal processing to financial analytics.
    [Show full text]