Amd's X86 Open64 Compiler

AMD’S X86 OPEN64 COMPILER Michael Lai AMD CONTENTS Brief History AMD and Open64 Compiler Overview Major Components of Compiler Important Optimizations Recent Releases Performance Applications and Libraries Heterogeneous Computing More Information 3 | AMD’s x86 Open64 Compiler | June 2011 BRIEF HISTORY Started as SGI® MIPSpro/Pro64 Compiler in the 1990’s Open sourced in 2000 as Pro64 Compiler; later renamed to Open64 Compiler Has been re-targeted to many architectures (MIPS, IA-64, x86-64, ARM, …) Popular among industry and academia; used for both production and research Open64 Steering Group (with members from industry and universities) Major contributors include: AMD, Intel, HP, PathScale, Tsinghua University, Chinese Academy of Sciences, University of Houston, University of Delaware, SimpLight, … 4 | AMD’s x86 Open64 Compiler | June 2011 AMD AND OPEN64 AMD’s x86 Open64 Compiler: – Pull down from www.open64.net (leverage open source community) – Work on bug fixes, new development and infrastructure, advanced optimizations – Keep in sync with www.open64.net – Check changes back into www.open64.net (contribute to open source community) http://developer.amd.com: – First AMD release was version 4.2.2 in April 2009 – Most recent AMD release was version 4.2.5 in April 2011 Active participant in the open source community: – Member of the Open64 Steering Group (OSG) – Many AMD global and local gatekeepers (design and code discussions and reviews) – Release management and testing – Present at workshops, tutorials, forums 5 | AMD’s x86 Open64 Compiler | June 2011 COMPILER OVERVIEW Language standards Platform highlights – ANSI C99, ISO C++98 – x86 32-bit and x86 64-bit code generation Conforms to ISO/IEC 9899: 1999, Programming – Large file support on 32-bit systems Languages – C standard – Vector and scalar SSE* code generation Conforms to ISO/IEC 14882: 1998(E), – AVX, XOP, FMA4 code generation Programming Languages – C++ standard – Optimized C/C++ and math libraries – Compatible with gcc – Optimized AMD Core Math Library (ACML) – Fortran 77, 90, 95 – MPICH2 for distributed and shared Conforms to ISO/IEC 1539-1: 1997, Programming Languages – Fortran memory systems – Inter-language calling support – IEEE 754 floating point support – OpenMP 2.5 for shared memory systems 6 | AMD’s x86 Open64 Compiler | June 2011 COMPILER OVERVIEW Global optimizations, e.g. Feedback-directed optimizations, e.g. – Partial redundancy elimination – Code layout – Constant propagation and code motion – Function inlining and de-virtualization – Strength reduction and expression simplification – Register allocation – Dead code elimination and common – Value specialization subexpression elimination Interprocedural analyses and optimizations, e.g. Loop-nest optimizations, e.g. – Function inlining and cloning – Loop fusion and distribution – Alias analysis – Loop interchange and cache locality optimization – Data re-layout optimizations for structures – Vectorization for SSE*/AVX code generation and arrays – Software prefetching – Constant propagation and dead code elimination Code generation and optimizations, e.g. Multi-core scalability optimizations – Advanced register allocation OpenMP support and automatic parallelization – Loop unrolling, peephole optimizations – Instruction selection and scheduling 7 | AMD’s x86 Open64 Compiler | June 2011 MAJOR COMPONENTS OF COMPILER Frontend – Generates a WHIRL file from each input source file Backend – Generates an object file from each WHIRL file Linker – Generates an executable file from the object files IPA – Pass1: ipl – Pass 2: ipa_link 8 | AMD’s x86 Open64 Compiler | June 2011 source source source frontend frontend frontend WHIRL WHIRL WHIRL backend backend backend .o .o .o linker a.out 9 | AMD’s x86 Open64 Compiler | June 2011 source source source frontend frontend frontend WHIRL WHIRL WHIRL ipl ipl ipl .o .o .o ipa_link WHIRL WHIRL backend backend .o .o linker a.out 10 | AMD’s x86 Open64 Compiler | June 2011 IMPORTANT OPTIMIZATIONS Backend – LNO (loop nest optimization) Traditional loop transformations such as loop blocking, interchange, fusion, distribution Software prefetching Vectorization – WOPT (global optimization) Build control flow graphs Data flow analysis Traditional global scalar optimizations such as constant folding, partial redundancy elimination, etc. – CG (code generation) Instruction selection and scheduling Machine dependent optimizations such as address mode optimization and peephole optimization Emit instructions for the target machine 11 | AMD’s x86 Open64 Compiler | June 2011 IMPORTANT OPTIMIZATIONS IPA (interprocedural analysis) – Pass1: ipl Local analysis – Pass 2: ipa_link Whole program analysis Data layout optimizations Function inlining, cloning Constant propagation Dead function elimination Profile feedback directed optimization – -fb-create – -fb-opt 12 | AMD’s x86 Open64 Compiler | June 2011 RECENT RELEASES Release 4.2.2 (April 2009) – Support for 2 MB huge pages – Improved loop fusion (proactive loop fusion) and loop unrolling – Improved head/tail duplication, if-merging, scalar replacement and constant folding optimizations – Improved interprocedural alias analysis – Improved partial inlining and inlining of virtual functions – More advanced re-layout optimization for structure members – Improved instruction selection and instruction scheduling – Improved tuning of library functions 13 | AMD’s x86 Open64 Compiler | June 2011 RECENT RELEASES Release 4.2.3 (December 2009) – Improved interprocedural analysis to include structure array copy optimization and array remapping optimization – Improved loop optimizations: loop unrolling, loop unroll and jam, triangular loops, proactive loop interchange, loop distribution, loop peeling – Improved redundancy elimination optimizations for stores and memory initialization; better integration of reassociation and common subexpression elimination; enhanced expression factorization – Improved instruction selection and addressing code generation – Improved vectorization – Extended prefetching to include arrays with inductive base addresses – Enhanced loop multi-versioning – Improved OpenMP and auto-parallelization code generation – Improved tuning of OpenMP and parallel runtime library functions – Introduced advanced optimizations to improve scalability/bandwidth utilization of multi-core processors (-mso) 14 | AMD’s x86 Open64 Compiler | June 2011 RECENT RELEASES Release 4.2.4 (June 2010) – Improved function inlining heuristics and enhanced inline expansion of library functions – Enhanced framework for multi-versioning – Improved inductive expression simplification and if-merging optimization – Improved code generation for the % operator – Improved interprocedural analysis for indirect function calls, virtual functions, and functions with "noreturn" attribute – Optimized exception handling – Optimized processing of Fortran 90 temporary arrays – Improved processor affinity mapping in the OpenMP and parallel runtime library – Added support for 1 GB huge pages 15 | AMD’s x86 Open64 Compiler | June 2011 RECENT RELEASES Release 4.2.5 (March 2011) – Optimized code generation for the new AMD Opteron Family 15h processors ("Bulldozer" core) (including instruction groups SSE*, AVX, XOP, FMA4) (-march=bdver1) – Support for iso_c_binding, a Fortran 2003 feature – Enhanced framework to support better vectorization – Improved vectorization for outer loops and loops containing conditionals – Enhanced framework to support better aliasing – Modified -O3 to enable more powerful floating-point optimizations by default – Improved compatibility with newer versions of gcc for function prototype definitions under OpenMP – Compiler build infrastructure enhanced to be similar to other linux application builds involving configure, make and make install – Incremental improvements to many generic optimizations such as loop fusion, dead code elimination, if merging, if conversion, function inlining, register pressure tuning, structure splitting, etc. – Incremental improvements for C++ applications such as function de-virtualization, exception handling, etc. – General correctness improvements including bug fixes for problems in Fortran intrinsics, Fortran frontend, Fortran I/O, x86 alignment, OpenMP – General improvements to reduce the compilation times of large C++/Fortran applications 16 | AMD’s x86 Open64 Compiler | June 2011 PERFORMANCE Used in benchmark submission, for example: – HP® – Dell™ – IBM® – Sun® (Oracle®) – SGI® Performance on AMD platforms: – Best performing compiler Both integer and floating point benchmark suites Performance on Intel platforms: – Among the best performing compilers Both integer and floating point benchmark suites 17 | AMD’s x86 Open64 Compiler | June 2011 APPLICATIONS AND LIBRARIES Libraries and utilities, for example: – ACML (Fortran) – BLAST (C/C++) – Charm++ (C++) – CLHEP (C++) – FFTW (C) – Goto BLAS (Fortran) – MPICH/MPICH2 (Fortran, C/C++) – NetCDF (Fortran, C/C++) – LAM/MPI (Fortran, C/C++) – OpenMPI (Fortran, C/C++) – GSL (C/C++) 18 | AMD’s x86 Open64 Compiler | June 2011 APPLICATIONS AND LIBRARIES Large applications, for example: – GEANT4 (C/C++) – GROMACS (Fortran, C/C++) – NAMD (C/C++) – NWChem (Fortran, C/C++) – POP (Fortran) – POV-Ray (C++) – WRF (Fortran) Benchmarks, for example: – HPCC (Fortran, C/C++) – SPEC CPU2006 (Fortran, C/C++) – SPEC OMP2001 (Fortran, C/C++) 19 | AMD’s x86 Open64 Compiler | June 2011 HETEROGENEOUS COMPUTING Existing optimizations –

Load more