Analysis of Automatic Parallelization Methods for Multicore Embedded Systems

Total Page:16

File Type:pdf, Size:1020Kb

Analysis of Automatic Parallelization Methods for Multicore Embedded Systems DEGREE PROJECT IN INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND LEVEL STOCKHOLM, SWEDEN 2015 Analysis of Automatic Parallelization Methods for Multicore Embedded Systems FREDRIK FRANTZEN KTH ROYAL INSTITUTE OF TECHNOLOGY INFORMATION AND COMMUNICATION TECHNOLOGY Analysis of Automatic Parallelization Methods for Multicore Embedded Systems Fredrik Frantzen 2015-01-06 Master’s Thesis Examiner Mats Brorsson Academic adviser Detlef Scholle KTH Royal Institute of Technology School of Information and Communication Technology (ICT) Department of Communication Systems SE-100 44 Stockholm, Sweden Acknowledgement I want to thank my examiner Mats Brorsson and my two supervisors Detlef Scholle and Cheuk Wing Leung for their helpful advice and for making this report possible. I also want to thank the other two thesis workers, Andreas Hammar and Anton Hou, that have made the time at Alten, really enjoyable. 1 Abstract There is a demand for reducing the cost of porting legacy code to different embedded platforms. One such system is the multicore system that allows higher performance with lower energy consumption and it is a popular solution in embedded systems. In this report, I have made an evaluation of a number of open source tools supporting the parallelization effort. The evaluation is made using a set of small highly parallel programs and two complex face recognition applica- tions that show what the current advantages and disadvantages are of different parallelization methods. The results show that parallelization tools are not able to parallelize code automati- cally without substantial human involvement. Therefore it is more profitable to parallelize by hand. The outcome of the study is a number of guidelines on how to parallelize their program and a set of requirement that serves as a basis for designing an automatic parallelization tool for embedded systems. 2 Sammanfattning Det finns ett behov av att minska kostnaderna f¨orportning av legacykod till olika inbyggda system. Ett s˚adant system ¨arde flerk¨arnigasystemen som m¨ojligg¨orh¨ogreprestanda med l¨agreenergif¨orbrukningoch ¨aren popul¨arl¨osningi inbyggda system. I denna rapport, har jag utf¨orten utv¨ardering av ett antal open source-verktyg, som hj¨alper till med arbetet att parallelisera kod. Detta g¨orsmed hj¨alpav sm˚aparalleliserbara program och tv˚akomplexa ansiktsigenk¨annings-applikationer som visar vad de nuvarande f¨or-och nackdelar de olika par- allelliserings metoderna har. Resultaten visar att parallelliseringsverktygen inte klarar av att parallellisera automatiskt utan avsev¨ardm¨anskliginblandning. Detta medf¨oratt det ¨arl¨onsam- mare att parallelisera f¨orhand. Utfallet av denna studie ¨arett antal riktlinjer f¨orhur man ska g¨oraf¨oratt parallelisera sin kod, samt ett antal krav som agerar som bas till att designa ett automatiskt paralleliseringsverktyg f¨orinbyggda system. 3 Contents Acknowledgement 1 List of Tables 7 List of Figures 8 Abbreviations 10 1 Introduction 11 1.1 Background . 11 1.2 Problem statement . 12 1.3 Team goal . 12 1.4 Approach . 12 1.5 Delimitations . 12 1.6 Outline . 13 2 Parallel software 14 2.1 Programming Parallel Software . 14 2.1.1 Where to parallelize . 15 2.1.2 Using OpenMP for shared memory parallelism . 17 2.1.3 Using MPI for distributed memory parallelism . 19 2.1.4 Using vector instructions for spatially close data . 20 2.1.5 Offloading to accelerators . 20 2.2 To code for different architectures . 21 2.2.1 Use of hybrid shared and distributed memory . 21 2.2.2 Tests on accelerator offloading . 22 2.3 Conclusion . 22 3 Parallelizing methods 24 3.1 Using dependency analysis to find parallel loops . 24 3.1.1 Static dependency analysis . 25 3.1.2 Dynamic dependency analysis . 26 3.2 Profiling . 26 3.3 Transforming code to remove dependencies . 26 3.3.1 Privatization of variables to remove dependencies . 26 3.3.2 Reduction recognition . 27 3.3.3 Induction variable substitution . 27 3.3.4 Alias analysis . 28 3.4 Parallelization methods . 29 3.4.1 Traditional parallelization methods . 29 3.4.2 Polyhedral model . 29 4 3.4.3 Speculative threading . 30 3.5 Auto-tuning . 30 3.6 Conclusion . 31 4 Automatic parallelization tools 32 4.1 Parallelizers . 32 4.1.1 PoCC and Pluto . 32 4.1.2 PIPS-Par4all . 33 4.1.3 LLVM-Polly . 33 4.1.4 LLVM-Aesop . 33 4.1.5 GCC-Graphite . 34 4.1.6 Cetus . 34 4.1.7 Parallware . 34 4.1.8 CAPS . 34 4.2 Translators . 35 4.2.1 OpenMP2HMPP . 35 4.2.2 Step . 35 4.3 Assistance . 35 4.3.1 Pareon . 35 4.4 Comparison of tools and reflection . 36 4.4.1 Polyhedral optimizers and performance . 36 4.4.2 Auto-tuning incorporation and performance . 36 4.4.3 Functional differences . 37 4.5 Conclusion . 38 5 Programming guidelines for automatic parallelizers 39 5.1 How to structure loop headers and bounds . 39 5.2 Static control parts . 40 5.3 Loop bodies . 41 5.4 Array accesses and allocation . 42 5.5 Variable scope . 43 5.6 Function calls and stubbing . 43 5.7 Function pointers . 44 5.8 Alias analysis problems: Pointer arithmetic and type casts . 45 5.9 Reductions . 45 5.10 Conclusion . 46 6 Implementation 47 6.1 Implementation approach . 47 6.2 Requirements . 48 7 The applications to parallelize 50 7.1 Face recognition applications . 50 7.1.1 Training application . 50 7.1.2 Detector application . 52 7.2 PolyBench benchmark applications . 53 8 Results from evaluating the tools 54 8.1 Compilation flags . 54 8.2 PolyBench results . 54 8.3 Parallelization results on the face recognition applications . 59 8.4 Discussion . 61 5 9 Requirements fulfilled by automatic parallelizers 63 9.1 Code handling and parsing . 63 9.2 Reliability and exposing parallelism . 63 9.3 Maintenance and portability . 63 9.4 Parallelism performance and tool efficiency . 64 10 Conclusions 65 10.1 Limitations of parallelization tools . 65 10.2 Manual versus Automatic parallelization . 65 10.3 Future work . 66 References 67 6 List of Tables 4.1 Functional differences in the tools. 37 4.2 A rough overview of what the investigated tools take as input and what they can output. 38 6.1 The list of requirements for an automatic parallelization tool. 48 8.1 Compilation flags for the individual tools. 54 8.2 Refactoring time and validity of parallelized training application. 60 8.3 Refactoring time and validity of parallelized classification application. 60 7 List of Figures 2.1 Two parallel tasks are in separate critical sections and holding a resource each, when requesting to get the others resource, a deadlock is created. 15 2.2 Parallelism in a loop. 16 2.3 A false sharing situation. 16 2.4 A sequential program split up into pipeline stages. 17 2.5 Pipeline parallelism, displaying different balancing of the stages. 17 2.6 Thread creation and deletion in OpenMP. [1] . 17 2.7 A subset of OpenMP pragma directives. 18 2.8 Dynamic and static scheduling.
Recommended publications
  • Using the GNU Compiler Collection (GCC)
    Using the GNU Compiler Collection (GCC) Using the GNU Compiler Collection by Richard M. Stallman and the GCC Developer Community Last updated 23 May 2004 for GCC 3.4.6 For GCC Version 3.4.6 Published by: GNU Press Website: www.gnupress.org a division of the General: [email protected] Free Software Foundation Orders: [email protected] 59 Temple Place Suite 330 Tel 617-542-5942 Boston, MA 02111-1307 USA Fax 617-542-2652 Last printed October 2003 for GCC 3.3.1. Printed copies are available for $45 each. Copyright c 1988, 1989, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004 Free Software Foundation, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with the Invariant Sections being \GNU General Public License" and \Funding Free Software", the Front-Cover texts being (a) (see below), and with the Back-Cover Texts being (b) (see below). A copy of the license is included in the section entitled \GNU Free Documentation License". (a) The FSF's Front-Cover Text is: A GNU Manual (b) The FSF's Back-Cover Text is: You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development. i Short Contents Introduction ...................................... 1 1 Programming Languages Supported by GCC ............ 3 2 Language Standards Supported by GCC ............... 5 3 GCC Command Options .........................
    [Show full text]
  • Memory Tagging and How It Improves C/C++ Memory Safety Kostya Serebryany, Evgenii Stepanov, Aleksey Shlyapnikov, Vlad Tsyrklevich, Dmitry Vyukov Google February 2018
    Memory Tagging and how it improves C/C++ memory safety Kostya Serebryany, Evgenii Stepanov, Aleksey Shlyapnikov, Vlad Tsyrklevich, Dmitry Vyukov Google February 2018 Introduction 2 Memory Safety in C/C++ 2 AddressSanitizer 2 Memory Tagging 3 SPARC ADI 4 AArch64 HWASAN 4 Compiler And Run-time Support 5 Overhead 5 RAM 5 CPU 6 Code Size 8 Usage Modes 8 Testing 9 Always-on Bug Detection In Production 9 Sampling In Production 10 Security Hardening 11 Strengths 11 Weaknesses 12 Legacy Code 12 Kernel 12 Uninitialized Memory 13 Possible Improvements 13 Precision Of Buffer Overflow Detection 13 Probability Of Bug Detection 14 Conclusion 14 Introduction Memory safety in C and C++ remains largely unresolved. A technique usually called “memory tagging” may dramatically improve the situation if implemented in hardware with reasonable overhead. This paper describes two existing implementations of memory tagging: one is the full hardware implementation in SPARC; the other is a partially hardware-assisted compiler-based tool for AArch64. We describe the basic idea, evaluate the two implementations, and explain how they improve memory safety. This paper is intended to initiate a wider discussion of memory tagging and to motivate the CPU and OS vendors to add support for it in the near future. Memory Safety in C/C++ C and C++ are well known for their performance and flexibility, but perhaps even more for their extreme memory unsafety. This year we are celebrating the 30th anniversary of the Morris Worm, one of the first known exploitations of a memory safety bug, and the problem is still not solved.
    [Show full text]
  • Statically Detecting Likely Buffer Overflow Vulnerabilities
    Statically Detecting Likely Buffer Overflow Vulnerabilities David Larochelle [email protected] University of Virginia, Department of Computer Science David Evans [email protected] University of Virginia, Department of Computer Science Abstract Buffer overflow attacks may be today’s single most important security threat. This paper presents a new approach to mitigating buffer overflow vulnerabilities by detecting likely vulnerabilities through an analysis of the program source code. Our approach exploits information provided in semantic comments and uses lightweight and efficient static analyses. This paper describes an implementation of our approach that extends the LCLint annotation-assisted static checking tool. Our tool is as fast as a compiler and nearly as easy to use. We present experience using our approach to detect buffer overflow vulnerabilities in two security-sensitive programs. 1. Introduction ed a prototype tool that does this by extending LCLint [Evans96]. Our work differs from other work on static detection of buffer overflows in three key ways: (1) we Buffer overflow attacks are an important and persistent exploit semantic comments added to source code to security problem. Buffer overflows account for enable local checking of interprocedural properties; (2) approximately half of all security vulnerabilities we focus on lightweight static checking techniques that [CWPBW00, WFBA00]. Richard Pethia of CERT have good performance and scalability characteristics, identified buffer overflow attacks as the single most im- but sacrifice soundness and completeness; and (3) we portant security problem at a recent software introduce loop heuristics, a simple approach for engineering conference [Pethia00]; Brian Snow of the efficiently analyzing many loops found in typical NSA predicted that buffer overflow attacks would still programs.
    [Show full text]
  • Pattern Matching
    Functional Programming Steven Lau March 2015 before function programming... https://www.youtube.com/watch?v=92WHN-pAFCs Models of computation ● Turing machine ○ invented by Alan Turing in 1936 ● Lambda calculus ○ invented by Alonzo Church in 1930 ● more... Turing machine ● A machine operates on an infinite tape (memory) and execute a program stored ● It may ○ read a symbol ○ write a symbol ○ move to the left cell ○ move to the right cell ○ change the machine’s state ○ halt Turing machine Have some fun http://www.google.com/logos/2012/turing-doodle-static.html http://www.ioi2012.org/wp-content/uploads/2011/12/Odometer.pdf http://wcipeg.com/problems/desc/ioi1211 Turing machine incrementer state symbol action next_state _____ state 0 __1__ state 1 0 _ or 0 write 1 1 _10__ state 2 __1__ state 1 0 1 write 0 2 _10__ state 0 __1__ state 0 _00__ state 2 1 _ left 0 __0__ state 2 _00__ state 0 __0__ state 0 1 0 or 1 right 1 100__ state 1 _10__ state 1 2 0 left 0 100__ state 1 _10__ state 1 100__ state 1 _10__ state 1 100__ state 1 _10__ state 0 100__ state 0 _11__ state 1 101__ state 1 _11__ state 1 101__ state 1 _11__ state 0 101__ state 0 λ-calculus Beware! ● think mathematical, not C++/Pascal ● (parentheses) are for grouping ● variables cannot be mutated ○ x = 1 OK ○ x = 2 NO ○ x = x + 1 NO λ-calculus Simplification 1 of 2: ● Only anonymous functions are used ○ f(x) = x2+1 f(1) = 12+1 = 2 is written as ○ (λx.x2+1)(1) = 12+1 = 2 note that f = λx.x2+1 λ-calculus Simplification 2 of 2: ● Only unary functions are used ○ a binary function can be written as a unary function that return another unary function ○ (λ(x,y).x+y)(1,2) = 1+2 = 3 is written as [(λx.(λy.x+y))(1)](2) = [(λy.1+y)](2) = 1+2 = 3 ○ this technique is known as Currying Haskell Curry λ-calculus ● A lambda term has 3 forms: ○ x ○ λx.A ○ AB where x is a variable, A and B are lambda terms.
    [Show full text]
  • Pathscale ENZO GTC12 S0631 – Programming Heterogeneous Many-Cores Using Directives C
    PathScale ENZO GTC12 S0631 – Programming Heterogeneous Many-Cores Using Directives C. Bergström | May 14th, 2012 Brief Introduction to ENZO 2 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 ENZO Overview & Goals Speed transition to GPU & many-core systems • Simplify the task of migrating software written in C, C++ & Fortran • Uses OpenHMPP Standard (easy migration) • CAPS HMPP compatible Performance & HPC focused • Fully exploits NVIDIA GPU features • Generates native instructions optimized for NVIDIA GPU 3 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Project Schedule & Status 4 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Project Schedule . ENZO Production release June 2012 – OpenHMPP 2.5 C, C++ and Fortran . Next ENZO Production release October 2012 – More tools and better support for libraries – x8664 OpenHMPP task parallelism (similar to OMP3 tasks) – More optimizations (IPA / CG2 / textures) – OpenHMPP 3.0 – CUDA 4.x – Kepler 5 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Project Status . OpenHMPP 2.5 – Running CAPS C & Fortran Labs – PathScale written HMPP test suite – Customer code . New C++ compiler – Perennial C++VS and CVSA regression free – Corner case compile time issues – Corner case runtime issues . Ongoing effort – Performance tuning & benchmarking – Compiler robustness – Nightly compiler builds to address issues 6 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Performance 7 | PathScale GTC12 S0631 Tutorial | May 14th, 2012 Performance . NVIDIA Tesla 2050 - “Lab2” SGEMM – ENZO – /opt/enzo/bin/pathcc -hmpp
    [Show full text]
  • Lambda Calculus and Functional Programming
    Global Journal of Researches in Engineering Vol. 10 Issue 2 (Ver 1.0) June 2010 P a g e | 47 Lambda Calculus and Functional Programming Anahid Bassiri1Mohammad Reza. Malek2 GJRE Classification (FOR) 080299, 010199, 010203, Pouria Amirian3 010109 Abstract-The lambda calculus can be thought of as an idealized, Basis concept of a Turing machine is the present day Von minimalistic programming language. It is capable of expressing Neumann computers. Conceptually these are Turing any algorithm, and it is this fact that makes the model of machines with random access registers. Imperative functional programming an important one. This paper is programming languages such as FORTRAN, Pascal etcetera focused on introducing lambda calculus and its application. As as well as all the assembler languages are based on the way an application dikjestra algorithm is implemented using a Turing machine is instructed by a sequence of statements. lambda calculus. As program shows algorithm is more understandable using lambda calculus in comparison with In addition functional programming languages, like other imperative languages. Miranda, ML etcetera, are based on the lambda calculus. Functional programming is a programming paradigm that I. INTRODUCTION treats computation as the evaluation of mathematical ambda calculus (λ-calculus) is a useful device to make functions and avoids state and mutable data. It emphasizes L the theories realizable. Lambda calculus, introduced by the application of functions, in contrast with the imperative Alonzo Church and Stephen Cole Kleene in the 1930s is a programming style that emphasizes changes in state. formal system designed to investigate function definition, Lambda calculus provides a theoretical framework for function application and recursion in mathematical logic and describing functions and their evaluation.
    [Show full text]
  • Aliasing Restrictions of C11 Formalized in Coq
    Aliasing restrictions of C11 formalized in Coq Robbert Krebbers Radboud University Nijmegen December 11, 2013 @ CPP, Melbourne, Australia int f(int *p, int *q) { int x = *p; *q = 314; return x; } If p and q alias, the original value n of *p is returned n p q Optimizing x away is unsound: 314 would be returned Alias analysis: to determine whether pointers can alias Aliasing Aliasing: multiple pointers referring to the same object Optimizing x away is unsound: 314 would be returned Alias analysis: to determine whether pointers can alias Aliasing Aliasing: multiple pointers referring to the same object int f(int *p, int *q) { int x = *p; *q = 314; return x; } If p and q alias, the original value n of *p is returned n p q Alias analysis: to determine whether pointers can alias Aliasing Aliasing: multiple pointers referring to the same object int f(int *p, int *q) { int x = *p; *q = 314; return x *p; } If p and q alias, the original value n of *p is returned n p q Optimizing x away is unsound: 314 would be returned Aliasing Aliasing: multiple pointers referring to the same object int f(int *p, int *q) { int x = *p; *q = 314; return x; } If p and q alias, the original value n of *p is returned n p q Optimizing x away is unsound: 314 would be returned Alias analysis: to determine whether pointers can alias It can still be called with aliased pointers: x union { int x; float y; } u; y u.x = 271; return h(&u.x, &u.y); &u.x &u.y C89 allows p and q to be aliased, and thus requires it to return 271 C99/C11 allows type-based alias analysis: I A compiler
    [Show full text]
  • Locality-Aware Automatic Parallelization for GPGPU with Openhmpp Directives
    Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez and Juan Touriño 7th International Symposium on High-Level Parallel Programming and Applications (HLPP 2014) July 3-4, 2014 — Amsterdam, Netherlands Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. Outline • Motivation: General Purpose Computation with GPUs • GPGPU with CUDA & OpenHMPP • The KIR: an IR for the Detection of Parallelism • Locality-Aware Generation of Efficient GPGPU Code • Case Studies: CONV3D & SGEMM • Performance Evaluation • Conclusions & Future Work J.M. Andión et al. Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives. HLPP 2014. 100,000 Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) 24,129 Intel Core Duo Extreme 2 cores, 3.0 GHz 21,871 Intel Core 2 Extreme 2 cores, 2.9 GHz 19,484 10,000 AMD Athlon 64, 2.8 GHz 14,387 AMD Athlon, 2.6 GHz 11,865 Intel Xeon EE 3.2 GHz 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 6,681 IBM Power4, 1.3 GHz 4,195 3,016 Intel VC820 motherboard, 1.0 GHz Pentium III processor 1,779 Professional Workstation XP1000, 667 MHz 21264A Digital AlphaServer 8400 6/575, 575 MHz 21264 1,267 1000 993 AlphaServer 4000 5/600, 600 MHz 21164 649 Digital Alphastation 5/500, 500 MHz 481 Digital Alphastation 5/300, 300 MHz 280 22%/year Digital Alphastation 4/266, 266 MHz 183 IBM POWERstation 100, 150 MHz 117 100 Digital 3000 AXP/500, 150 MHz 80 HP 9000/750, 66 MHz 51 IBM RS6000/540, 30 MHz 24 52%/year Performance (vs.
    [Show full text]
  • Teach Yourself Perl 5 in 21 Days
    Teach Yourself Perl 5 in 21 days David Till Table of Contents: Introduction ● Who Should Read This Book? ● Special Features of This Book ● Programming Examples ● End-of-Day Q& A and Workshop ● Conventions Used in This Book ● What You'll Learn in 21 Days Week 1 Week at a Glance ● Where You're Going Day 1 Getting Started ● What Is Perl? ● How Do I Find Perl? ❍ Where Do I Get Perl? ❍ Other Places to Get Perl ● A Sample Perl Program ● Running a Perl Program ❍ If Something Goes Wrong ● The First Line of Your Perl Program: How Comments Work ❍ Comments ● Line 2: Statements, Tokens, and <STDIN> ❍ Statements and Tokens ❍ Tokens and White Space ❍ What the Tokens Do: Reading from Standard Input ● Line 3: Writing to Standard Output ❍ Function Invocations and Arguments ● Error Messages ● Interpretive Languages Versus Compiled Languages ● Summary ● Q&A ● Workshop ❍ Quiz ❍ Exercises Day 2 Basic Operators and Control Flow ● Storing in Scalar Variables Assignment ❍ The Definition of a Scalar Variable ❍ Scalar Variable Syntax ❍ Assigning a Value to a Scalar Variable ● Performing Arithmetic ❍ Example of Miles-to-Kilometers Conversion ❍ The chop Library Function ● Expressions ❍ Assignments and Expressions ● Other Perl Operators ● Introduction to Conditional Statements ● The if Statement ❍ The Conditional Expression ❍ The Statement Block ❍ Testing for Equality Using == ❍ Other Comparison Operators ● Two-Way Branching Using if and else ● Multi-Way Branching Using elsif ● Writing Loops Using the while Statement ● Nesting Conditional Statements ● Looping Using
    [Show full text]
  • User-Directed Loop-Transformations in Clang
    User-Directed Loop-Transformations in Clang Michael Kruse Hal Finkel Argonne Leadership Computing Facility Argonne Leadership Computing Facility Argonne National Laboratory Argonne National Laboratory Argonne, USA Argonne, USA [email protected] hfi[email protected] Abstract—Directives for the compiler such as pragmas can Only #pragma unroll has broad support. #pragma ivdep made help programmers to separate an algorithm’s semantics from popular by icc and Cray to help vectorization is mimicked by its optimization. This keeps the code understandable and easier other compilers as well, but with different interpretations of to optimize for different platforms. Simple transformations such as loop unrolling are already implemented in most mainstream its meaning. However, no compiler allows applying multiple compilers. We recently submitted a proposal to add generalized transformations on a single loop systematically. loop transformations to the OpenMP standard. We are also In addition to straightforward trial-and-error execution time working on an implementation in LLVM/Clang/Polly to show its optimization, code transformation pragmas can be useful for feasibility and usefulness. The current prototype allows applying machine-learning assisted autotuning. The OpenMP approach patterns common to matrix-matrix multiplication optimizations. is to make the programmer responsible for the semantic Index Terms—OpenMP, Pragma, Loop Transformation, correctness of the transformation. This unfortunately makes it C/C++, Clang, LLVM, Polly hard for an autotuner which only measures the timing difference without understanding the code. Such an autotuner would I. MOTIVATION therefore likely suggest transformations that make the program Almost all processor time is spent in some kind of loop, and return wrong results or crash.
    [Show full text]
  • Purity in Erlang
    Purity in Erlang Mihalis Pitidis1 and Konstantinos Sagonas1,2 1 School of Electrical and Computer Engineering, National Technical University of Athens, Greece 2 Department of Information Technology, Uppsala University, Sweden [email protected], [email protected] Abstract. Motivated by a concrete goal, namely to extend Erlang with the abil- ity to employ user-defined guards, we developed a parameterized static analysis tool called PURITY, that classifies functions as referentially transparent (i.e., side- effect free with no dependency on the execution environment and never raising an exception), side-effect free with no dependencies but possibly raising excep- tions, or side-effect free but with possible dependencies and possibly raising ex- ceptions. We have applied PURITY on a large corpus of Erlang code bases and report experimental results showing the percentage of functions that the analysis definitely classifies in each category. Moreover, we discuss how our analysis has been incorporated on a development branch of the Erlang/OTP compiler in order to allow extending the language with user-defined guards. 1 Introduction Purity plays an important role in functional programming languages as it is a corner- stone of referential transparency, namely that the same language expression produces the same value when evaluated twice. Referential transparency helps in writing easy to test, robust and comprehensible code, makes equational reasoning possible, and aids program analysis and optimisation. In pure functional languages like Clean or Haskell, any side-effect or dependency on the state is captured by the type system and is reflected in the types of functions. In a language like ERLANG, which has been developed pri- marily with concurrency in mind, pure functions are not the norm and impure functions can freely be used interchangeably with pure ones.
    [Show full text]
  • How to Write Code That Will Survive
    Programming Heterogeneous Many-cores Using Directives HMPP - OpenAcc F. Bodin, CAPS CTO Introduction • Programming many-core systems faces the following dilemma o Achieve "portable" performance • Multiple forms of parallelism cohabiting – Multiple devices (e.g. GPUs) with their own address space – Multiple threads inside a device – Vector/SIMD parallelism inside a thread • Massive parallelism – Tens of thousands of threads needed o The constraint of keeping a unique version of codes, preferably mono- language • Reduces maintenance cost • Preserves code assets • Less sensitive to fast moving hardware targets • Codes last several generations of hardware architecture • For legacy codes, directive-based approach may be an alternative o And may benefit from auto-tuning techniques CC 2012 www.caps-entreprise.com 2 Profile of a Legacy Application • Written in C/C++/Fortran • Mix of user code and while(many){ library calls ... mylib1(A,B); ... • Hotspots may or may not be myuserfunc1(B,A); parallel ... mylib2(A,B); ... • Lifetime in 10s of years myuserfunc2(B,A); ... • Cannot be fully re-written } • Migration can be risky and mandatory CC 2012 www.caps-entreprise.com 3 Overview of the Presentation • Many-core architectures o Definition and forecast o Why usual parallel programming techniques won't work per se • Directive-based programming o OpenACC sets of directives o HMPP directives o Library integration issue • Toward a portable infrastructure for auto-tuning o Current auto-tuning directives in HMPP 3.0 o CodeletFinder for offline auto-tuning o Toward a standard auto-tuning interface CC 2012 www.caps-entreprise.com 4 Many-Core Architectures Heterogeneous Many-Cores • Many general purposes cores coupled with a massively parallel accelerator (HWA) Data/stream/vector CPU and HWA linked with a parallelism to be PCIx bus exploited by HWA e.g.
    [Show full text]