Serial & Vector Optimization

Total Page:16

File Type:pdf, Size:1020Kb

Serial & Vector Optimization Serial & Vector Optimization Lucas A. Wilson TACC Summer Supercomputing Institute June 16, 2014 Overview • Optimization – Optimization Process – Measuring Improvement • Scalar Optimization – Compiler Options • Vector Optimization How much of the code should be tuned? Profile --TUNE MOST TIME-INTENSIVE SECTION Optimization is an iterative process: none AVAILABLE STOP • Profile code TIME !0 • Tune most time intensive portions RE- of code EVALUATE SUFFICIENT yes • Repeat PERFORMANCE STOP INCREASE no For specific algorithms, derive an ideal architecture objective; but realize limitations as you optimize. As you optimize: Ideal Expectations • ideal objective usually decreases • while performance increases. Realistic (measured) Performance Optimization Cycles How is code performance measured? Some useful performance metrics 1. FLOPS • Floating point operations per second (GFLOPS/MFLOPS) • Important for numerically intensive code • Example: dot-product program prog_dot integer, parameter :: n=1024*1024*1024 real*8 :: x(n), y(n), sum ... do i=1,n sum=sum+x(i)*y(i) two FLOPS/iteration end do total FLOPS=2 GFLOPS end program If execution time = 1.0 s. How do you then program achieves 2.0 GFLOPS/s measure this? Ideally: CPUs can execute 4-8 FLOPS/Clock Period(CP) 4-8 FLOPS/CP x CPU speed (Hz, CPs/sec.) = Peak Performance How is code performance measured? Some useful performance metrics 2. MIPS • Millions of instructions per second • Good indicator of efficiency (high MIPS means efficient use of functional units) • But, MIPS is not necessarily indicative of “productive” work do i1=1,n Every 9 operations do i1=1,n 8 float operations do i2=i1+1,n there is loop overhead do i2=i1+1,n no loop overhead rdist=0.0 do idim=1,3 rdist=(x(1,i2)-x(1,i2))**2 + & dx=x(idim,i2)-x(idim,i1) (x(2,i2)-x(2,i2))**2 + & rdist=rdist+dx**2 (x(3,i2)-x(3,i2))**2 end do ... ... end do code A end do code B end do end do Both codes do the “same” work, but code A gives rise to more instructions How is code performance measured? Some useful performance metrics 3. External Timer: (time stop-watch) 4. Instrument Code CPU (user) time + system (sys) time = Execution of your instructions + Kernel time (with wall-clock timers) executing instructions on your behalf To measure code block Application Times times, use system_clock % /usr/bin/time –p ./a.out in Fortran and clock in C real 0.18 to measure wall clock user 0.16 sys 0.02 time. program MAIN #include <time.h> integer :: ic0, ic1, icr int main(int argc, char*argv[]){ real*8 :: dt clock_t t0, t1; float dt; ... ... call system_clock(count=ic0,count_rate=icr) t0=clock(); call jacobi_iter(psi0,psi,m,n) jacobi_iter(psi,psi0,m,n); call system_clock(count=ic1) t1=clock(); dt=dble(ic1-ic0)/dble(icr) dt=(t1-t0)/CLOCKS_PER_SEC; ... ... end program } What are the tools for measuring metrics? Tools • Profilers (routine level) – gprof • Timers (Application/Block) – time, system_clock, clock, gettimeofday, getrusage, ... • Analyzers (profile, line level timings, & trace) – MPIP, TAU, Vtune, Paraver, … • Expert system – PerfExpert (Burtscher Tx State, Browne & Rane UT & TACC) • Hardware Counters: – PAPI Compiler Options • Three important Categories – Optimization Level – Interprocedural Optimization – Architecture Specification You should always have (at least) one option from each category! Compiler Optimization Levels -O -O1 -O2 -O3 -O4 -O5 -O6 Debugging Standard Interprocedural Quick Compilations Optimizations Optimization Aggressive Opt. Loop Unrolling Loop Interchanging Semantics Changes -O0: Debug High Order Transforms= (Blocking, routine -O1: Fast compile substitution) -O2: Standard -O3: More Aggressive Interprocedural Analysis Interprocedural analysis (IPA) allows compilers to optimize across routine boundaries. Benefits can include automatic procedure inlining (across modules); copy propagation, in which flow dependency is eliminated by propagating a copy of the variable causing the (false) dependency; and vectorization. automatic recognition of standard libraries procedure inlining whole program alias analysis pointer analysis Intel PGI IBM xlf Sun SGI IPA -ip (single file) -Mipa -qipa inline=(auto|noauto) -xipo (all -IPA -ipo (multi-file) inline=<fun._name> objects) inline=threshold=<num> System Specific Optimizations • Include Architectural Information on Command Line– otherwise, many compilers, use a “generic” instruction set. Vector Loops: Can be sent to a SIMD Unit Can be unrolled and pipelined Can be parallelized through OpenMP Directives Can be automatically parallelized. Power6, G4/5 Velocity Engine (AltiVec) Intel/AMD MMX, SSE, SSE2, SSE3, SSE4 (SIMD) Cray Vector Units Compiler options for x86 on TACC machines: Intel: -xHOST -- for Stampede and Lonestar Optimizations performed by the compiler • Copy propagation – Involves getting rid of unnecessary dependencies – Aids in instruction level parallelism – e.g. x=y x=y These can z=1.0+x z=1.0+y be executed simultaneously • Constant folding ILP=Instruction Level Parallelism – Involves replacement of variable references with a constant to generate faster code – Programmer can help the compiler by declaring such constants in PARAMETER statements in Fortran or by using the const modifier in C Optimizations performed by the compiler • Strength reduction – Replacement of an expensive calculation with a cheaper one: Use x*x instead of x**2.0 0.5*x instead of x/2.0 (exp(T)-exp(2T)) è tmp=exp(t); (tmp-tmp*tmp) Use n logical shifts (<< or >>) instead of 2**n Optimizations performed by the compiler • Common subexpression elimination e.g. temp=a/b compiler will generate x=a/b x=temp a temporary variable ... ... to save on calculating a/b y=y0+a/b y=y0+temp Optimizations performed by the compiler • Loop-invariant code motion – Involves moving loop-independent calculations out of the loop e.g. temp=a*b; for (i=0;i<n;i++) { for (i=0;i<n;i++) { x(i)=y(i)+a*b; x(i)=y(i)+temp; last=x(n); } } last=x(n); • Induction variable simplification – Replacement of complicated expressions using induction variables with simpler ones j=5; e.g. for (i=0;i<n;i++) { for (i=0;i<n;i++) { j=i*2+5; j+=2; ... ... } } Example: use of a macro instead of a function float min(float x, float y); #define min(x,y)((x)<(y)?(x):(y) main(int argc, char* argv[]) main(int argc, char* argv[]) { { float x1, x2, xmin; float x1, x2, xmin; … … xmin=min(x1,x2); xmin=min(x1,x2); … … } } float min(float x, float y) { min=(x<y)?x:y; } use of a function use of a macro Example: procedure inlining program MAIN program MAIN integer :: ndim=2, niter=10000000 integer, parameter :: ndim=2 real*8 :: x(ndim), x0(ndim), r real*8 :: x(ndim), x0(ndim), r integer :: i, j integer :: i, j ... ... do i=1,100000 do i=1,100000 ... ... r=dist(x,x0,ndim) r=0.0 ... do j=1,ndim end do r=r+(x(j)-(x0(j))* & ... (x(j)-(x0(j)) end program end do ... real*8 function dist(x,x0,n) end do real*8 :: x0(n), x(n), r ... integer :: j,n end program r=0.0 do j=1,n r=r+(x(j)-x0(j))**2 function dist is expanded end do inline inside loop dist=r function dist is called end function niter times Loop Optimization Loop bisection: Splits a loop in half to create two separate loops half the size of the original, and combines the operations of the two halves in a single loop of half the length. do i=1,n dosum=dosum+a(i)+ b(i) end do i=1,n/2 sum0=sum0+a(i )+ b(i ) Addition Pipeline is fuller. sum1=sum1+a(i+n/2)+ b(i+n/2) 2 more streams from memory end dosum=sum0+sum1 Loop Optimization Loop fusion: Loop fusion combines two or more loops of the same iteration space (loop length) into a single loop: for (i=0;i<n;i++){ for (i=0;i<n;i++){ a[i]=x[i]+y[i]; a[i]= x[i]+y[i]; } b[i]=1.0/x[i]+z[i]; for (i=0;i<n;i++){ } b[i]=1.0/x[i]+z[i]; } Only n memory accesses for X array. Five streams created. Costly (at least 30 CP) Division many not be pipelined! If mult. instead of inverse Addition pipeline is fuller. Loop Optimization Loop Fission: The opposite of loop fusion is loop distribution or fission. Fission splits a single loop with independent operations into multiple loops: do i=1,n a(i)=b(i)+c(i)*d(i) end do do i=1,n a(i)=b(i)+c(i)*d(i) do i=1,n e(i)=f(i)-g(i)*h(i)+p(i) e(i)=f(i)-g(i)*h(i)+p(i) q(i)=r(i)+s(i) end do end do do i=1,n q(i)=r(i)+s(i) end do Memory Tuning Low-stride memory access is preferred because it uses more of the data in a cache line retrieved from memory and helps in triggering streaming. do i=1,n for (j=0;j<n;j++) { do j=1,n for (i=0;i<n;i++) { C(i,j)=A(i,j)+B(i,j) C[i][j]=A[i][j]+B[i][j]; end do } end do } do j=1,n for (i=0;i<n;i++) { do i=1,n for (j=0;j<n;j++) { C(i,j)=A(i,j)+B(i,j) C[i][j]=A[i][j]+B[i][j]; end do } end do } Loop Optimization Strip mining: Transforms a single loop into two loops to insure that each element of a cache line is used, once it is brought into the cache. In the example below, the loop over j is split into two: do jouter=1,n,8 do j=jouter,min(jouter+7,n) do j=1,n do i=1,n do i=1,n a(i,j)=a(i,j)+s*b(j,i) a(i,j)=a(i,j)+s*b(j,i) end do “Intermediate end do end do Code” end do end do do jouter=1,n,8 do i=1,n do j=jouter,min(jouter+7,n) a(i,j)=a(i,j)+s*b(j,i) end do end do Blocked Code end do Memory Tuning Array Blocking real*8 a(n,n), b(n,n), c(n,n) do ii=1,n,nb do jj=1,n,nb matrix do kk=1,n,nb multiplication do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) c(i,j)=c(i,j)+a(j,k)*b(k,i) nb x nb nb x nb nb x nb nb x nb end do; end do; end do; end do; end do; end do Memory Tuning Much more efficient implementations exist, in HPC scientific libraries (ESSL, MKL, ACL, ATLAS,…) Memory Tuning Loop interchange can help in the case of a DAXPY-type loop : integer,parameter::nkb=16,kb=1024,n=nkb*kb/8 real*8 :: x(n), a(n,n), y(n) ..
Recommended publications
  • Generalizing Loop-Invariant Code Motion in a Real-World Compiler
    Imperial College London Department of Computing Generalizing loop-invariant code motion in a real-world compiler Author: Supervisor: Paul Colea Fabio Luporini Co-supervisor: Prof. Paul H. J. Kelly MEng Computing Individual Project June 2015 Abstract Motivated by the perpetual goal of automatically generating efficient code from high-level programming abstractions, compiler optimization has developed into an area of intense research. Apart from general-purpose transformations which are applicable to all or most programs, many highly domain-specific optimizations have also been developed. In this project, we extend such a domain-specific compiler optimization, initially described and implemented in the context of finite element analysis, to one that is suitable for arbitrary applications. Our optimization is a generalization of loop-invariant code motion, a technique which moves invariant statements out of program loops. The novelty of the transformation is due to its ability to avoid more redundant recomputation than normal code motion, at the cost of additional storage space. This project provides a theoretical description of the above technique which is fit for general programs, together with an implementation in LLVM, one of the most successful open-source compiler frameworks. We introduce a simple heuristic-driven profitability model which manages to successfully safeguard against potential performance regressions, at the cost of missing some speedup opportunities. We evaluate the functional correctness of our implementation using the comprehensive LLVM test suite, passing all of its 497 whole program tests. The results of our performance evaluation using the same set of tests reveal that generalized code motion is applicable to many programs, but that consistent performance gains depend on an accurate cost model.
    [Show full text]
  • A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel
    A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel To cite this version: Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel. A Tiling Perspective for Register Op- timization. [Research Report] RR-8541, Inria. 2014, pp.24. hal-00998915 HAL Id: hal-00998915 https://hal.inria.fr/hal-00998915 Submitted on 3 Jun 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Tiling Perspective for Register Optimization Łukasz Domagała, Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel RESEARCH REPORT N° 8541 May 2014 Project-Teams GCG ISSN 0249-6399 ISRN INRIA/RR--8541--FR+ENG A Tiling Perspective for Register Optimization Lukasz Domaga la∗, Fabrice Rastello†, Sadayappan Ponnuswany‡, Duco van Amstel§ Project-Teams GCG Research Report n° 8541 — May 2014 — 21 pages Abstract: Register allocation is a much studied problem. A particularly important context for optimizing register allocation is within loops, since a significant fraction of the execution time of programs is often inside loop code. A variety of algorithms have been proposed in the past for register allocation, but the complexity of the problem has resulted in a decoupling of several important aspects, including loop unrolling, register promotion, and instruction reordering.
    [Show full text]
  • Efficient Symbolic Analysis for Optimizing Compilers*
    Efficient Symbolic Analysis for Optimizing Compilers? Robert A. van Engelen Dept. of Computer Science, Florida State University, Tallahassee, FL 32306-4530 [email protected] Abstract. Because most of the execution time of a program is typically spend in loops, loop optimization is the main target of optimizing and re- structuring compilers. An accurate determination of induction variables and dependencies in loops is of paramount importance to many loop opti- mization and parallelization techniques, such as generalized loop strength reduction, loop parallelization by induction variable substitution, and loop-invariant expression elimination. In this paper we present a new method for induction variable recognition. Existing methods are either ad-hoc and not powerful enough to recognize some types of induction variables, or existing methods are powerful but not safe. The most pow- erful method known is the symbolic differencing method as demonstrated by the Parafrase-2 compiler on parallelizing the Perfect Benchmarks(R). However, symbolic differencing is inherently unsafe and a compiler that uses this method may produce incorrectly transformed programs without issuing a warning. In contrast, our method is safe, simpler to implement in a compiler, better adaptable for controlling loop transformations, and recognizes a larger class of induction variables. 1 Introduction It is well known that the optimization and parallelization of scientific applica- tions by restructuring compilers requires extensive analysis of induction vari- ables and dependencies
    [Show full text]
  • Using Static Single Assignment SSA Form
    Using Static Single Assignment SSA form Last Time B1 if (. ) B1 if (. ) • Basic definition, and why it is useful j j B2 x ← 5 B3 x ← 3 B2 x0 ← 5 B3 x1 ← 3 • How to build it j j B4 y ← x B4 x2 ← !(x0,x1) Today y ← x2 • Loop Optimizations – Induction variables (standard vs. SSA) – Loop Invariant Code Motion (SSA based) CS502 Using SSA 1 CS502 Using SSA 2 Loop Optimization Classical Loop Optimizations Loops are important, they execute often • typically, some regular access pattern • Loop Invariant Code Motion regularity ⇒ opportunity for improvement • Induction Variable Recognition repetition ⇒ savings are multiplied • Strength Reduction • assumption: loop bodies execute 10depth times • Linear Test Replacement • Loop Unrolling CS502 Using SSA 3 CS502 Using SSA 4 Other Loop Optimizations Loop Invariant Code Motion Other Loop Optimizations • Build the SSA graph semi-pruned • Scalar replacement • Need insertion of !-nodes: If two non-null paths x →+ z and y →+ z converge at node z, and nodes x • Loop Interchange and y contain assignments to t (in the original program), then a !-node for t must be inserted at z (in the new program) • Loop Fusion and t must be live across some basic block Simple test: • Loop Distribution If, for a statement s ≡ [x ← y ⊗ z], none of the operands y,z refer to a !-node or definition inside the loop, then • Loop Skewing Transform: assign the invariant computation a new temporary name, t ← y ⊗ z, move • Loop Reversal it to the loop pre-header, and assign x ← t. CS502 Using SSA 5 CS502 Using SSA 6 Loop Invariant Code
    [Show full text]
  • Loop Transformations and Parallelization
    Loop transformations and parallelization Claude Tadonki LAL/CNRS/IN2P3 University of Paris-Sud [email protected] December 2010 Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Introduction Most of the time, the most time consuming part of a program is on loops. Thus, loops optimization is critical in high performance computing. Depending on the target architecture, the goal of loops transformations are: improve data reuse and data locality efficient use of memory hierarchy reducing overheads associated with executing loops instructions pipeline maximize parallelism Loop transformations can be performed at different levels by the programmer, the compiler, or specialized tools. At high level, some well known transformations are commonly considered: loop interchange loop (node) splitting loop unswitching loop reversal loop fusion loop inversion loop skewing loop fission loop vectorization loop blocking loop unrolling loop parallelization Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Dependence analysis Extract and analyze the dependencies of a computation from its polyhedral model is a fundamental step toward loop optimization or scheduling. Definition For a given variable V and given indexes I1, I2, if the computation of X(I1) requires the value of X(I2), then I1 ¡ I2 is called a dependence vector for variable V . Drawing all the dependence vectors within the computation polytope yields the so-called dependencies diagram. Example The dependence vectors are (1; 0); (0; 1); (¡1; 1). Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Scheduling Definition The computation on the entire domain of a given loop can be performed following any valid schedule.A timing function tV for variable V yields a valid schedule if and only if t(x) > t(x ¡ d); 8d 2 DV ; (1) where DV is the set of all dependence vectors for variable V .
    [Show full text]
  • Compiler-Based Code-Improvement Techniques
    Compiler-Based Code-Improvement Techniques KEITH D. COOPER, KATHRYN S. MCKINLEY, and LINDA TORCZON Since the earliest days of compilation, code quality has been recognized as an important problem [18]. A rich literature has developed around the issue of improving code quality. This paper surveys one part of that literature: code transformations intended to improve the running time of programs on uniprocessor machines. This paper emphasizes transformations intended to improve code quality rather than analysis methods. We describe analytical techniques and specific data-flow problems to the extent that they are necessary to understand the transformations. Other papers provide excellent summaries of the various sub-fields of program analysis. The paper is structured around a simple taxonomy that classifies transformations based on how they change the code. The taxonomy is populated with example transformations drawn from the literature. Each transformation is described at a depth that facilitates broad understanding; detailed references are provided for deeper study of individual transformations. The taxonomy provides the reader with a framework for thinking about code-improving transformations. It also serves as an organizing principle for the paper. Copyright 1998, all rights reserved. You may copy this article for your personal use in Comp 512. Further reproduction or distribution requires written permission from the authors. 1INTRODUCTION This paper presents an overview of compiler-based methods for improving the run-time behavior of programs — often mislabeled code optimization. These techniques have a long history in the literature. For example, Backus makes it quite clear that code quality was a major concern to the implementors of the first Fortran compilers [18].
    [Show full text]
  • Autotuning for Automatic Parallelization on Heterogeneous Systems
    Autotuning for Automatic Parallelization on Heterogeneous Systems Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften von der KIT-Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation von Philip Pfaffe ___________________________________________________________________ ___________________________________________________________________ Tag der mündlichen Prüfung: 24.07.2019 1. Referent: Prof. Dr. Walter F. Tichy 2. Referent: Prof. Dr. Michael Philippsen Abstract To meet the surging demand for high-speed computation in an era of stagnat- ing increase in performance per processor, systems designers resort to aggregating many and even heterogeneous processors into single systems. Automatic paral- lelization tools relieve application developers of the tedious and error prone task of programming these heterogeneous systems. For these tools, there are two aspects to maximizing performance: Optimizing the execution on each parallel platform individually, and executing work on the available platforms cooperatively. To date, various approaches exist targeting either aspect. Automatic parallelization for simultaneous cooperative computation with optimized per-platform execution however remains an unsolved problem. This thesis presents the APHES framework to close that gap. The framework com- bines automatic parallelization with a novel technique for input-sensitive online autotuning. Its first component, a parallelizing polyhedral compiler, transforms implicitly data-parallel program parts for multiple platforms. Targeted platforms then automatically cooperate to process the work. During compilation, the code is instrumented to interact with libtuning, our new autotuner and second com- ponent of the framework. Tuning the work distribution and per-platform execu- tion maximizes overall performance. The autotuner enables always-on autotuning through a novel hybrid tuning method, combining a new efficient search technique and model-based prediction.
    [Show full text]
  • Programming Parallel Machines
    Programming Parallel Machines Prof. R. Eigenmann ECE563, Spring 2013 engineering.purdue.edu/~eigenman/ECE563 1 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Parallelism - Why Bother? Hardware-Perspective: Parallelism is everywhere – instruction level – chip level (multicores) – co-processors (accelerators, GPUs) – multi-processor level – multi-computer level – distributed system level Big Question: Can all this parallelism be hidden ? 2 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Hiding Parallelism - For Whom and Behind What? For the end user: – HPC applications: parallelism is not really apparent. Sometimes, the user needs to tell the system on how many “nodes” the application shall run. – Collaborative applications and remote resource accesses: the user may want to see the parallelism. For the application programmer: We can try to hide parallelism – using parallelizing compilers – using parallel libraries or software components – In reality: partially hide parallelism behind a good API 3 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Different Forms of Parallelism An operating system has parallel processes to manage the many parallel activities that are going on concurrently. A high-performance computing application executes multiple parts of the program in parallel in order to get the job done faster. A bank that performs a transaction with another banks uses parallel systems to engage multiple distributed databases A multi-group research team that uses a satellite downlink in Boston, a large computer in San-Diego and a “Cave” for visualization at the Univ. of Illinois needs collaborative parallel systems. 4 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 How important is Parallel Programming 2013 in Academia? It’s a hot topic again – There was significant interest in the 1980es and first half of 1990es.
    [Show full text]
  • An Unfolding-Based Loop Optimization Technique
    An Unfolding-Based Loop Optimization Technique Litong Song 1, Krishna Kavi 1, and Ron Cytron 2 1 Department of Computer Science, University of North Texas, Denton, Texas, 76203, USA {slt, kavi}@cs.unt.edu 2 Department of Computer Science and Engineering Washington University, St. Louis, MO 63130, USA {cytron }@cs.wustl.edu Abstract. Loops in programs are the source of many optimizations for improv- ing program performance, particularly on modern high-performance architec- tures as well as vector and multithreaded systems. Techniques such as loop in- variant code motion, loop unrolling and loop peeling have demonstrated their utility in compiler optimizations. However, many of these techniques can only be used in very limited cases when the loops are “well-structured” and easy to analyze. For instance, loop invariant code motion works only when invariant code is inside loops; loop unrolling and loop peeling work effectively when the array references are either constants or affine functions of index variable. It is our contention that there are many opportunities overlooked by limiting the op- timizations to well structured loops. In many cases, even “badly-structured” loops may be transformed into well structured loops. As a case in point, we show how some loop-dependent code can be transformed into loop-invariant code by transforming the loops. Our technique described in this paper relies on unfolding the loop for several initial iterations such that more opportunities may be exposed for many other existing compiler optimization techniques such as loop invariant code motion, loop peeling, loop unrolling and so on. 1 Introduction Loops in programs are the source of many optimizations for improving program per- formance, particularly on modern high-performance architectures as well as vector and multithreaded systems.
    [Show full text]
  • SSE Optimizations These Optimizations Leverage the Streaming SIMD Extension (SSE) Instruction Set of the CPU
    Introduction Optimization options control compile time optimizations to generate an application with code that executes more quickly. Absoft Fortran 90/95 is an advanced optimizing compiler. Various optimizers can be turned on that discover different opportunities to optimize Fortran code. There are pros and cons when choosing optimizations; the application will execute much faster after compilation but the compilation speed itself will be slow. Some of the optimizations described below will benefit almost any Fortran code, while others should only be applied to specific situations. No Optimization The –O0 option is the default optimization level and provides the fastest compilation speed. It disables all optimizations and is useful for debugging. The -g option is the common debugging flag used with this level of optimization. Basic Optimizations The –O1 option will cause most code to run faster and enables optimizations that only span sequential code sequences. The optimizations include, but are not limited to, common subexpression elimination, constant propagation, dead code elimination, and instruction scheduling. This option is compatible with debugging options. Normal Optimizations The –O2 option enables normal optimizers that can make most code run faster. It can substantially rearrange the code generated for a program. The optimizations include, but not limited to, strength reduction, partial redundancy elimination, innermost loop unrolling, control flow optimizations, advanced instruction scheduling. This option is not generally usable with debugging options. Advanced Optimizations The –O3 option enables advanced optimizers that can significantly rearrange and modify the code generated for a program. It provides all optimizations applied with –O1 and –O2. Additional optimizations include auto loop vectorization, loop permutation (loop reordering), loop tiling (improved cache performance), loop skewing, loop reversal, loop fusion and fission, unimodular transformations, forward substitution, and expression simplification.
    [Show full text]
  • Compiler Construction
    Compiler construction PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 10 Dec 2011 02:23:02 UTC Contents Articles Introduction 1 Compiler construction 1 Compiler 2 Interpreter 10 History of compiler writing 14 Lexical analysis 22 Lexical analysis 22 Regular expression 26 Regular expression examples 37 Finite-state machine 41 Preprocessor 51 Syntactic analysis 54 Parsing 54 Lookahead 58 Symbol table 61 Abstract syntax 63 Abstract syntax tree 64 Context-free grammar 65 Terminal and nonterminal symbols 77 Left recursion 79 Backus–Naur Form 83 Extended Backus–Naur Form 86 TBNF 91 Top-down parsing 91 Recursive descent parser 93 Tail recursive parser 98 Parsing expression grammar 100 LL parser 106 LR parser 114 Parsing table 123 Simple LR parser 125 Canonical LR parser 127 GLR parser 129 LALR parser 130 Recursive ascent parser 133 Parser combinator 140 Bottom-up parsing 143 Chomsky normal form 148 CYK algorithm 150 Simple precedence grammar 153 Simple precedence parser 154 Operator-precedence grammar 156 Operator-precedence parser 159 Shunting-yard algorithm 163 Chart parser 173 Earley parser 174 The lexer hack 178 Scannerless parsing 180 Semantic analysis 182 Attribute grammar 182 L-attributed grammar 184 LR-attributed grammar 185 S-attributed grammar 185 ECLR-attributed grammar 186 Intermediate language 186 Control flow graph 188 Basic block 190 Call graph 192 Data-flow analysis 195 Use-define chain 201 Live variable analysis 204 Reaching definition 206 Three address
    [Show full text]
  • Using Openmp Effectively on Theta
    Using Openmp* Effectively on Theta Carlos Rosales-Fernandez & Ronald W Green Intel Corporation 2019 ALCF Computational Performance Workshop 1 Argonne Leadership Computing Facility Access and getting the files Find a good working directory. These labs are small and don’t create a lot of data. Your /home should suffice, assuming you have not exhausted your quota To get started, copy the files to a directory of your choosing in the /projects area: $ tar -zxvf /projects/SDL_Workshop/training/UsingOpenMP/labs.tgz Then change into the omp directory: $ cd ./omp Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. 2 *Other names and brands may be claimed as the property of others. Methodology • Labs are numbered “labX” • We will work through the labs in numeric order starting with “lab1” • Each lab has a “readme.txt” to describe the lab • Each lab has a batch script “ labX.run” • If there are multiple runs in a lab, run scripts are named “labX-Y.run” • for example, if there are 2 run scripts in lab1, the run scripts are “lab1- 1.run” and lab1-2.run” • Solutions, if needed are in directory “solution/” • Move through the labs at your own pace OR follow along with the group Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. 3 *Other names and brands may be claimed as the property of others. Misc Use latest Intel compiler module swap intel/18.0.0.128 intel/19.0.3.199 OpenMP* 5.0 Reference omp/OpenMPRef-5.0-111802-web.pdf Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved.
    [Show full text]