Optimization and Programming Guide for Little Endian Distributions

Total Page:16

File Type:pdf, Size:1020Kb

Optimization and Programming Guide for Little Endian Distributions IBM XL C/C++ for Linux, V13.1.6 IBM Optimization and Programming Guide for Little Endian Distributions Version 13.1.6 SC27-6560-05 IBM XL C/C++ for Linux, V13.1.6 IBM Optimization and Programming Guide for Little Endian Distributions Version 13.1.6 SC27-6560-05 Note Before using this information and the product it supports, read the information in “Notices” on page 117. First edition This edition applies to IBM XL C/C++ for Linux, V13.1.6 (Program 5765-J08; 5725-C73) and to all subsequent releases and modifications until otherwise indicated in new editions. Make sure you are using the correct edition for the level of the product. © Copyright IBM Corporation 1996, 2017. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents About this document ......... v An intermediate step: adding -qhot suboptions at Who should read this document........ v level 3 ............... 26 How to use this document.......... v Optimizing at level 4 .......... 26 How this document is organized ....... v Optimizing at level 5 .......... 27 Conventions .............. vi Tuning for your system architecture ...... 28 Related information ............ ix Getting the most out of target machine options 28 Available help information ........ ix Using high-order loop analysis and transformations 29 Standards and specifications ........ xi Getting the most out of -qhot ....... 30 Other IBM information.......... xi Using shared-memory parallelism (SMP) .... 31 Other information ........... xi Getting the most out of -qsmp ....... 32 Technical support ............ xi Using interprocedural analysis ........ 33 How to send your comments ........ xii Getting the most from -qipa ........ 35 Using profile-directed feedback ........ 36 Chapter 1. Using XL C/C++ with Fortran 1 Compiling with -qpdf1 ......... 36 Training with typical data sets ....... 37 Identifiers ............... 1 Recompiling or linking with -qpdf2 ..... 40 Corresponding data types .......... 2 Marking variables as local or imported ..... 43 Character and aggregate data ......... 4 Getting the most out of -qdatalocal ..... 44 Function calls and parameter passing ...... 4 Using compiler reports to diagnose optimization Pointers to functions............ 5 opportunities .............. 45 Sample program: C/C++ calling Fortran ..... 5 Parsing compiler reports with development tools 47 Other optimization options ......... 47 Chapter 2. Aligning data........ 7 Using alignment modes........... 7 Chapter 6. Debugging optimized code 49 Alignment of aggregates ......... 8 Detecting errors in code .......... 49 Alignment of bit-fields .......... 9 Understanding different results in optimized Using alignment modifiers ......... 10 programs ............... 50 Debugging in the presence of optimization .... 51 Chapter 3. Handling floating-point operations ............. 13 Chapter 7. Coding your application to Floating-point formats ........... 13 improve performance ........ 53 Handling multiply-and-add operations ..... 13 Finding faster input/output techniques ..... 53 Compiling for strict IEEE conformance ..... 14 Reducing function-call overhead ....... 53 Handling floating-point constant folding and Managing memory efficiently (C++ only) .... 55 rounding ............... 14 Optimizing variables ........... 55 Matching compile-time and runtime rounding Manipulating strings efficiently ........ 56 modes ............... 15 Optimizing expressions and program logic .... 56 Handling floating-point exceptions ...... 16 Optimizing operations in 64-bit mode ..... 57 The C++ template model .......... 58 Chapter 4. Constructing a library ... 17 Using delegating constructors (C++11) ..... 58 Compiling and linking a library ....... 17 Using rvalue references (C++11) ....... 59 Compiling a static library......... 17 Using visibility attributes (IBM extension) .... 61 Compiling a shared library ........ 17 Types of visibility attributes ........ 62 Linking a library to an application...... 17 Rules of visibility attributes ........ 63 Linking a shared library to another shared library 18 Propagation rules (C++ only) ....... 69 Initializing static objects in libraries (C++) .... 18 Specifying visibility attributes using the -fvisibility option .......... 71 Chapter 5. Optimizing your applications 21 Specifying visibility attributes using pragma Distinguishing between optimization and tuning .. 21 preprocessor directives ......... 71 Steps in the optimization process ....... 22 Basic optimization ............ 22 Chapter 8. Exploiting POWER9 Optimizing at level 0 .......... 22 technology ............. 75 Optimizing at level 2 .......... 23 POWER9 compiler options ......... 75 Advanced optimization .......... 24 High-performance libraries that are tuned for Optimizing at level 3 .......... 25 POWER9 ............... 75 © Copyright IBM Corp. 1996, 2017 iii POWER9 built-in functions ......... 75 Shared and private variables in a parallel environment.............. 104 Chapter 9. Using the high performance Reduction operations in parallelized loops.... 105 libraries .............. 81 Using the Mathematical Acceleration Subsystem Chapter 12. Offloading computations (MASS) libraries ............. 81 to the NVIDIA GPUs ........ 107 Using the scalar library ......... 82 System prerequisites for offloading to the NVIDIA Using the vector libraries ......... 84 GPUs ................ 107 Using the SIMD libraries ......... 88 Programming with OpenMP device constructs .. 107 Compiling and linking a program with MASS .. 91 Extensions from OpenMP Technical Reports .. 108 Using the Basic Linear Algebra Subprograms - BLAS 92 Using IBM XL C/C++ for Linux with NVCC ... 110 BLAS function syntax .......... 93 Limitations .............. 110 Linking the libxlopt library ........ 95 Chapter 13. Vector element order Chapter 10. Using vector technology 97 toggling .............. 113 Chapter 11. Parallelizing your programs 99 Notices .............. 117 Countable loops ............. 99 Trademarks .............. 119 Enabling automatic parallelization ...... 101 Data sharing attribute rules......... 101 Index ............... 121 Using OpenMP directives ......... 103 iv XL C/C++: Optimization and Programming Guide for Little Endian Distributions About this document This guide discusses advanced topics related to the use of IBM® XL C/C++ for Linux, V13.1.6, with a particular focus on program portability and optimization. The guide provides both reference information and practical tips for getting the most out of the compiler's capabilities through recommended programming practices and compilation procedures. Who should read this document This document is addressed to programmers building complex applications, who already have experience compiling with XL C/C++ and would like to take further advantage of the compiler's capabilities for program optimization and tuning, support for advanced programming language features, and add-on tools and utilities. How to use this document This document uses a task-oriented approach to present the topics by concentrating on a specific programming or compilation problem in each section. Each topic contains extensive cross-references to the relevant sections of the reference guides in the IBM XL C/C++ for Linux, V13.1.6 documentation set, which provides detailed descriptions of compiler options, pragmas, and specific language extensions. How this document is organized This guide includes the following chapters: v Chapter 1, “Using XL C/C++ with Fortran,” on page 1 discusses considerations for calling Fortran code from XL C/C++ programs. v Chapter 2, “Aligning data,” on page 7 discusses options available for controlling the alignment of data in aggregates, such as structures and classes. v Chapter 3, “Handling floating-point operations,” on page 13 discusses options available for controlling how floating-point operations are handled by the compiler. v Chapter 4, “Constructing a library,” on page 17 discusses how to compile and link static and shared libraries and how to specify the initialization order of static objects in C++ programs. v Chapter 5, “Optimizing your applications,” on page 21 discusses various options provided by the compiler for optimizing your programs, and it provides recommendations on how to use these options. v Chapter 6, “Debugging optimized code,” on page 49 discusses the potential usability problems of optimized programs and the options that can be used to debug optimized code. v Chapter 7, “Coding your application to improve performance,” on page 53 discusses recommended programming practices and coding techniques to enhance program performance and compatibility with the compiler's optimization capabilities. v Chapter 9, “Using the high performance libraries,” on page 81 discusses two performance libraries that are shipped with XL C/C++: the Mathematical Acceleration Subsystem (MASS), which contains tuned versions of standard © Copyright IBM Corp. 1996, 2017 v math library functions, and the Basic Linear Algebra Subprograms (BLAS), which contains basic functions for matrix multiplication. v Chapter 10, “Using vector technology,” on page 97 provides an overview of the vector technology, which can be used to accelerate the performance-driven, high-bandwidth communications and computing applications. v Chapter 11, “Parallelizing your programs,” on page 99 provides an overview of options offered by XL C/C++ for creating multi-threaded programs, including OpenMP language constructs. v Chapter 12, “Offloading computations to the
Recommended publications
  • Program Analysis and Optimization for Embedded Systems
    Program Analysis and Optimization for Embedded Systems Mike Jochen and Amie Souter jochen, souter ¡ @cis.udel.edu Computer and Information Sciences University of Delaware Newark, DE 19716 (302) 831-6339 1 Introduction ¢ have low cost A proliferation in use of embedded systems is giving cause ¢ have low power consumption for a rethinking of traditional methods and effects of pro- ¢ require as little physical space as possible gram optimization and how these methods and optimizations can be adapted for these highly specialized systems. This pa- ¢ meet rigid time constraints for computation completion per attempts to raise some of the issues unique to program These requirements place constraints on the underlying analysis and optimization for embedded system design and architecture’s design, which together affect compiler design to address some of the solutions that have been proposed to and program optimization techniques. Each of these crite- address these specialized issues. ria must be balanced against the other, as each will have an This paper is organized as follows: section 2 provides effect on the other (e.g. making a system smaller and faster pertinent background information on embedded systems, de- will typically increase its cost and power consumption). A lineating the special needs and problems inherent with em- typical embedded system, as illustrated in figure 1, consists bedded system design; section 3 addresses one such inherent of a processor core, a program ROM (Read Only Memory), problem, namely limited space for program code; section 4 RAM (Random Access Memory), and an ASIC (Applica- discusses current approaches and open issues for code min- tion Specific Integrated Circuit).
    [Show full text]
  • Stochastic Program Optimization by Eric Schkufza, Rahul Sharma, and Alex Aiken
    research highlights DOI:10.1145/2863701 Stochastic Program Optimization By Eric Schkufza, Rahul Sharma, and Alex Aiken Abstract Although the technique sacrifices completeness it pro- The optimization of short sequences of loop-free, fixed-point duces dramatic increases in the quality of the resulting code. assembly code sequences is an important problem in high- Figure 1 shows two versions of the Montgomery multiplica- performance computing. However, the competing constraints tion kernel used by the OpenSSL RSA encryption library. of transformation correctness and performance improve- Beginning from code compiled by llvm −O0 (116 lines, not ment often force even special purpose compilers to pro- shown), our prototype stochastic optimizer STOKE produces duce sub-optimal code. We show that by encoding these code (right) that is 16 lines shorter and 1.6 times faster than constraints as terms in a cost function, and using a Markov the code produced by gcc −O3 (left), and even slightly faster Chain Monte Carlo sampler to rapidly explore the space of than the expert handwritten assembly included in the all possible code sequences, we are able to generate aggres- OpenSSL repository. sively optimized versions of a given target code sequence. Beginning from binaries compiled by llvm −O0, we are able 2. RELATED WORK to produce provably correct code sequences that either Although techniques that preserve completeness are effec- match or outperform the code produced by gcc −O3, icc tive within certain domains, their general applicability −O3, and in some cases expert handwritten assembly. remains limited. The shortcomings are best highlighted in the context of the code sequence shown in Figure 1.
    [Show full text]
  • Generalizing Loop-Invariant Code Motion in a Real-World Compiler
    Imperial College London Department of Computing Generalizing loop-invariant code motion in a real-world compiler Author: Supervisor: Paul Colea Fabio Luporini Co-supervisor: Prof. Paul H. J. Kelly MEng Computing Individual Project June 2015 Abstract Motivated by the perpetual goal of automatically generating efficient code from high-level programming abstractions, compiler optimization has developed into an area of intense research. Apart from general-purpose transformations which are applicable to all or most programs, many highly domain-specific optimizations have also been developed. In this project, we extend such a domain-specific compiler optimization, initially described and implemented in the context of finite element analysis, to one that is suitable for arbitrary applications. Our optimization is a generalization of loop-invariant code motion, a technique which moves invariant statements out of program loops. The novelty of the transformation is due to its ability to avoid more redundant recomputation than normal code motion, at the cost of additional storage space. This project provides a theoretical description of the above technique which is fit for general programs, together with an implementation in LLVM, one of the most successful open-source compiler frameworks. We introduce a simple heuristic-driven profitability model which manages to successfully safeguard against potential performance regressions, at the cost of missing some speedup opportunities. We evaluate the functional correctness of our implementation using the comprehensive LLVM test suite, passing all of its 497 whole program tests. The results of our performance evaluation using the same set of tests reveal that generalized code motion is applicable to many programs, but that consistent performance gains depend on an accurate cost model.
    [Show full text]
  • A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel
    A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel To cite this version: Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel. A Tiling Perspective for Register Op- timization. [Research Report] RR-8541, Inria. 2014, pp.24. hal-00998915 HAL Id: hal-00998915 https://hal.inria.fr/hal-00998915 Submitted on 3 Jun 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Tiling Perspective for Register Optimization Łukasz Domagała, Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel RESEARCH REPORT N° 8541 May 2014 Project-Teams GCG ISSN 0249-6399 ISRN INRIA/RR--8541--FR+ENG A Tiling Perspective for Register Optimization Lukasz Domaga la∗, Fabrice Rastello†, Sadayappan Ponnuswany‡, Duco van Amstel§ Project-Teams GCG Research Report n° 8541 — May 2014 — 21 pages Abstract: Register allocation is a much studied problem. A particularly important context for optimizing register allocation is within loops, since a significant fraction of the execution time of programs is often inside loop code. A variety of algorithms have been proposed in the past for register allocation, but the complexity of the problem has resulted in a decoupling of several important aspects, including loop unrolling, register promotion, and instruction reordering.
    [Show full text]
  • Efficient Symbolic Analysis for Optimizing Compilers*
    Efficient Symbolic Analysis for Optimizing Compilers? Robert A. van Engelen Dept. of Computer Science, Florida State University, Tallahassee, FL 32306-4530 [email protected] Abstract. Because most of the execution time of a program is typically spend in loops, loop optimization is the main target of optimizing and re- structuring compilers. An accurate determination of induction variables and dependencies in loops is of paramount importance to many loop opti- mization and parallelization techniques, such as generalized loop strength reduction, loop parallelization by induction variable substitution, and loop-invariant expression elimination. In this paper we present a new method for induction variable recognition. Existing methods are either ad-hoc and not powerful enough to recognize some types of induction variables, or existing methods are powerful but not safe. The most pow- erful method known is the symbolic differencing method as demonstrated by the Parafrase-2 compiler on parallelizing the Perfect Benchmarks(R). However, symbolic differencing is inherently unsafe and a compiler that uses this method may produce incorrectly transformed programs without issuing a warning. In contrast, our method is safe, simpler to implement in a compiler, better adaptable for controlling loop transformations, and recognizes a larger class of induction variables. 1 Introduction It is well known that the optimization and parallelization of scientific applica- tions by restructuring compilers requires extensive analysis of induction vari- ables and dependencies
    [Show full text]
  • Using Static Single Assignment SSA Form
    Using Static Single Assignment SSA form Last Time B1 if (. ) B1 if (. ) • Basic definition, and why it is useful j j B2 x ← 5 B3 x ← 3 B2 x0 ← 5 B3 x1 ← 3 • How to build it j j B4 y ← x B4 x2 ← !(x0,x1) Today y ← x2 • Loop Optimizations – Induction variables (standard vs. SSA) – Loop Invariant Code Motion (SSA based) CS502 Using SSA 1 CS502 Using SSA 2 Loop Optimization Classical Loop Optimizations Loops are important, they execute often • typically, some regular access pattern • Loop Invariant Code Motion regularity ⇒ opportunity for improvement • Induction Variable Recognition repetition ⇒ savings are multiplied • Strength Reduction • assumption: loop bodies execute 10depth times • Linear Test Replacement • Loop Unrolling CS502 Using SSA 3 CS502 Using SSA 4 Other Loop Optimizations Loop Invariant Code Motion Other Loop Optimizations • Build the SSA graph semi-pruned • Scalar replacement • Need insertion of !-nodes: If two non-null paths x →+ z and y →+ z converge at node z, and nodes x • Loop Interchange and y contain assignments to t (in the original program), then a !-node for t must be inserted at z (in the new program) • Loop Fusion and t must be live across some basic block Simple test: • Loop Distribution If, for a statement s ≡ [x ← y ⊗ z], none of the operands y,z refer to a !-node or definition inside the loop, then • Loop Skewing Transform: assign the invariant computation a new temporary name, t ← y ⊗ z, move • Loop Reversal it to the loop pre-header, and assign x ← t. CS502 Using SSA 5 CS502 Using SSA 6 Loop Invariant Code
    [Show full text]
  • The Effect of Code Expanding Optimizations on Instruction Cache Design
    May 1991 UILU-EN G-91-2227 CRHC-91-17 Center for Reliable and High-Performance Computing THE EFFECT OF CODE EXPANDING OPTIMIZATIONS ON INSTRUCTION CACHE DESIGN William Y. Chen Pohua P. Chang Thomas M. Conte Wen-mei W. Hwu Coordinated Science Laboratory College of Engineering UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Approved for Public Release. Distribution Unlimited. UNCLASSIFIED____________ SECURITY ¿LASSIPlOvriON OF t h is PAGE REPORT DOCUMENTATION PAGE 1a. REPORT SECURITY CLASSIFICATION 1b. RESTRICTIVE MARKINGS Unclassified None 2a. SECURITY CLASSIFICATION AUTHORITY 3 DISTRIBUTION/AVAILABILITY OF REPORT 2b. DECLASSIFICATION/DOWNGRADING SCHEDULE Approved for public release; distribution unlimited 4. PERFORMING ORGANIZATION REPORT NUMBER(S) 5. MONITORING ORGANIZATION REPORT NUMBER(S) UILU-ENG-91-2227 CRHC-91-17 6a. NAME OF PERFORMING ORGANIZATION 6b. OFFICE SYMBOL 7a. NAME OF MONITORING ORGANIZATION Coordinated Science Lab (If applicable) NCR, NSF, AMD, NASA University of Illinois N/A 6c ADDRESS (G'ty, Staff, and ZIP Code) 7b. ADDRESS (Oty, Staff, and ZIP Code) 1101 W. Springfield Avenue Dayton, OH 45420 Urbana, IL 61801 Washington DC 20550 Langley VA 20200 8a. NAME OF FUNDING/SPONSORING 8b. OFFICE SYMBOL ORGANIZATION 9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER 7a (If applicable) N00014-91-J-1283 NASA NAG 1-613 8c. ADDRESS (City, State, and ZIP Cod*) 10. SOURCE OF FUNDING NUMBERS PROGRAM PROJECT t a sk WORK UNIT 7b ELEMENT NO. NO. NO. ACCESSION NO. The Effect of Code Expanding Optimizations on Instruction Cache Design 12. PERSONAL AUTHOR(S) Chen, William, Pohua Chang, Thomas Conte and Wen-Mei Hwu 13a. TYPE OF REPORT 13b. TIME COVERED 14. OATE OF REPORT (Year, Month, Day) Jl5.
    [Show full text]
  • Loop Transformations and Parallelization
    Loop transformations and parallelization Claude Tadonki LAL/CNRS/IN2P3 University of Paris-Sud [email protected] December 2010 Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Introduction Most of the time, the most time consuming part of a program is on loops. Thus, loops optimization is critical in high performance computing. Depending on the target architecture, the goal of loops transformations are: improve data reuse and data locality efficient use of memory hierarchy reducing overheads associated with executing loops instructions pipeline maximize parallelism Loop transformations can be performed at different levels by the programmer, the compiler, or specialized tools. At high level, some well known transformations are commonly considered: loop interchange loop (node) splitting loop unswitching loop reversal loop fusion loop inversion loop skewing loop fission loop vectorization loop blocking loop unrolling loop parallelization Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Dependence analysis Extract and analyze the dependencies of a computation from its polyhedral model is a fundamental step toward loop optimization or scheduling. Definition For a given variable V and given indexes I1, I2, if the computation of X(I1) requires the value of X(I2), then I1 ¡ I2 is called a dependence vector for variable V . Drawing all the dependence vectors within the computation polytope yields the so-called dependencies diagram. Example The dependence vectors are (1; 0); (0; 1); (¡1; 1). Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Scheduling Definition The computation on the entire domain of a given loop can be performed following any valid schedule.A timing function tV for variable V yields a valid schedule if and only if t(x) > t(x ¡ d); 8d 2 DV ; (1) where DV is the set of all dependence vectors for variable V .
    [Show full text]
  • Compiler-Based Code-Improvement Techniques
    Compiler-Based Code-Improvement Techniques KEITH D. COOPER, KATHRYN S. MCKINLEY, and LINDA TORCZON Since the earliest days of compilation, code quality has been recognized as an important problem [18]. A rich literature has developed around the issue of improving code quality. This paper surveys one part of that literature: code transformations intended to improve the running time of programs on uniprocessor machines. This paper emphasizes transformations intended to improve code quality rather than analysis methods. We describe analytical techniques and specific data-flow problems to the extent that they are necessary to understand the transformations. Other papers provide excellent summaries of the various sub-fields of program analysis. The paper is structured around a simple taxonomy that classifies transformations based on how they change the code. The taxonomy is populated with example transformations drawn from the literature. Each transformation is described at a depth that facilitates broad understanding; detailed references are provided for deeper study of individual transformations. The taxonomy provides the reader with a framework for thinking about code-improving transformations. It also serves as an organizing principle for the paper. Copyright 1998, all rights reserved. You may copy this article for your personal use in Comp 512. Further reproduction or distribution requires written permission from the authors. 1INTRODUCTION This paper presents an overview of compiler-based methods for improving the run-time behavior of programs — often mislabeled code optimization. These techniques have a long history in the literature. For example, Backus makes it quite clear that code quality was a major concern to the implementors of the first Fortran compilers [18].
    [Show full text]
  • Autotuning for Automatic Parallelization on Heterogeneous Systems
    Autotuning for Automatic Parallelization on Heterogeneous Systems Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften von der KIT-Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation von Philip Pfaffe ___________________________________________________________________ ___________________________________________________________________ Tag der mündlichen Prüfung: 24.07.2019 1. Referent: Prof. Dr. Walter F. Tichy 2. Referent: Prof. Dr. Michael Philippsen Abstract To meet the surging demand for high-speed computation in an era of stagnat- ing increase in performance per processor, systems designers resort to aggregating many and even heterogeneous processors into single systems. Automatic paral- lelization tools relieve application developers of the tedious and error prone task of programming these heterogeneous systems. For these tools, there are two aspects to maximizing performance: Optimizing the execution on each parallel platform individually, and executing work on the available platforms cooperatively. To date, various approaches exist targeting either aspect. Automatic parallelization for simultaneous cooperative computation with optimized per-platform execution however remains an unsolved problem. This thesis presents the APHES framework to close that gap. The framework com- bines automatic parallelization with a novel technique for input-sensitive online autotuning. Its first component, a parallelizing polyhedral compiler, transforms implicitly data-parallel program parts for multiple platforms. Targeted platforms then automatically cooperate to process the work. During compilation, the code is instrumented to interact with libtuning, our new autotuner and second com- ponent of the framework. Tuning the work distribution and per-platform execu- tion maximizes overall performance. The autotuner enables always-on autotuning through a novel hybrid tuning method, combining a new efficient search technique and model-based prediction.
    [Show full text]
  • Programming Parallel Machines
    Programming Parallel Machines Prof. R. Eigenmann ECE563, Spring 2013 engineering.purdue.edu/~eigenman/ECE563 1 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Parallelism - Why Bother? Hardware-Perspective: Parallelism is everywhere – instruction level – chip level (multicores) – co-processors (accelerators, GPUs) – multi-processor level – multi-computer level – distributed system level Big Question: Can all this parallelism be hidden ? 2 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Hiding Parallelism - For Whom and Behind What? For the end user: – HPC applications: parallelism is not really apparent. Sometimes, the user needs to tell the system on how many “nodes” the application shall run. – Collaborative applications and remote resource accesses: the user may want to see the parallelism. For the application programmer: We can try to hide parallelism – using parallelizing compilers – using parallel libraries or software components – In reality: partially hide parallelism behind a good API 3 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Different Forms of Parallelism An operating system has parallel processes to manage the many parallel activities that are going on concurrently. A high-performance computing application executes multiple parts of the program in parallel in order to get the job done faster. A bank that performs a transaction with another banks uses parallel systems to engage multiple distributed databases A multi-group research team that uses a satellite downlink in Boston, a large computer in San-Diego and a “Cave” for visualization at the Univ. of Illinois needs collaborative parallel systems. 4 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 How important is Parallel Programming 2013 in Academia? It’s a hot topic again – There was significant interest in the 1980es and first half of 1990es.
    [Show full text]
  • ARM Optimizing C/C++ Compiler V20.2.0.LTS User's Guide (Rev. V)
    ARM Optimizing C/C++ Compiler v20.2.0.LTS User's Guide Literature Number: SPNU151V January 1998–Revised February 2020 Contents Preface ........................................................................................................................................ 9 1 Introduction to the Software Development Tools.................................................................... 12 1.1 Software Development Tools Overview ................................................................................. 13 1.2 Compiler Interface.......................................................................................................... 15 1.3 ANSI/ISO Standard ........................................................................................................ 15 1.4 Output Files ................................................................................................................. 15 1.5 Utilities ....................................................................................................................... 16 2 Using the C/C++ Compiler ................................................................................................... 17 2.1 About the Compiler......................................................................................................... 18 2.2 Invoking the C/C++ Compiler ............................................................................................. 18 2.3 Changing the Compiler's Behavior with Options ......................................................................
    [Show full text]