
AMD’S X86 OPEN64 COMPILER Michael Lai AMD CONTENTS Brief History AMD and Open64 Compiler Overview Major Components of Compiler Important Optimizations Recent Releases Performance Applications and Libraries Heterogeneous Computing More Information 3 | AMD’s x86 Open64 Compiler | June 2011 BRIEF HISTORY Started as SGI® MIPSpro/Pro64 Compiler in the 1990’s Open sourced in 2000 as Pro64 Compiler; later renamed to Open64 Compiler Has been re-targeted to many architectures (MIPS, IA-64, x86-64, ARM, …) Popular among industry and academia; used for both production and research Open64 Steering Group (with members from industry and universities) Major contributors include: AMD, Intel, HP, PathScale, Tsinghua University, Chinese Academy of Sciences, University of Houston, University of Delaware, SimpLight, … 4 | AMD’s x86 Open64 Compiler | June 2011 AMD AND OPEN64 AMD’s x86 Open64 Compiler: – Pull down from www.open64.net (leverage open source community) – Work on bug fixes, new development and infrastructure, advanced optimizations – Keep in sync with www.open64.net – Check changes back into www.open64.net (contribute to open source community) http://developer.amd.com: – First AMD release was version 4.2.2 in April 2009 – Most recent AMD release was version 4.2.5 in April 2011 Active participant in the open source community: – Member of the Open64 Steering Group (OSG) – Many AMD global and local gatekeepers (design and code discussions and reviews) – Release management and testing – Present at workshops, tutorials, forums 5 | AMD’s x86 Open64 Compiler | June 2011 COMPILER OVERVIEW Language standards Platform highlights – ANSI C99, ISO C++98 – x86 32-bit and x86 64-bit code generation Conforms to ISO/IEC 9899: 1999, Programming – Large file support on 32-bit systems Languages – C standard – Vector and scalar SSE* code generation Conforms to ISO/IEC 14882: 1998(E), – AVX, XOP, FMA4 code generation Programming Languages – C++ standard – Optimized C/C++ and math libraries – Compatible with gcc – Optimized AMD Core Math Library (ACML) – Fortran 77, 90, 95 – MPICH2 for distributed and shared Conforms to ISO/IEC 1539-1: 1997, Programming Languages – Fortran memory systems – Inter-language calling support – IEEE 754 floating point support – OpenMP 2.5 for shared memory systems 6 | AMD’s x86 Open64 Compiler | June 2011 COMPILER OVERVIEW Global optimizations, e.g. Feedback-directed optimizations, e.g. – Partial redundancy elimination – Code layout – Constant propagation and code motion – Function inlining and de-virtualization – Strength reduction and expression simplification – Register allocation – Dead code elimination and common – Value specialization subexpression elimination Interprocedural analyses and optimizations, e.g. Loop-nest optimizations, e.g. – Function inlining and cloning – Loop fusion and distribution – Alias analysis – Loop interchange and cache locality optimization – Data re-layout optimizations for structures – Vectorization for SSE*/AVX code generation and arrays – Software prefetching – Constant propagation and dead code elimination Code generation and optimizations, e.g. Multi-core scalability optimizations – Advanced register allocation OpenMP support and automatic parallelization – Loop unrolling, peephole optimizations – Instruction selection and scheduling 7 | AMD’s x86 Open64 Compiler | June 2011 MAJOR COMPONENTS OF COMPILER Frontend – Generates a WHIRL file from each input source file Backend – Generates an object file from each WHIRL file Linker – Generates an executable file from the object files IPA – Pass1: ipl – Pass 2: ipa_link 8 | AMD’s x86 Open64 Compiler | June 2011 source source source frontend frontend frontend WHIRL WHIRL WHIRL backend backend backend .o .o .o linker a.out 9 | AMD’s x86 Open64 Compiler | June 2011 source source source frontend frontend frontend WHIRL WHIRL WHIRL ipl ipl ipl .o .o .o ipa_link WHIRL WHIRL backend backend .o .o linker a.out 10 | AMD’s x86 Open64 Compiler | June 2011 IMPORTANT OPTIMIZATIONS Backend – LNO (loop nest optimization) Traditional loop transformations such as loop blocking, interchange, fusion, distribution Software prefetching Vectorization – WOPT (global optimization) Build control flow graphs Data flow analysis Traditional global scalar optimizations such as constant folding, partial redundancy elimination, etc. – CG (code generation) Instruction selection and scheduling Machine dependent optimizations such as address mode optimization and peephole optimization Emit instructions for the target machine 11 | AMD’s x86 Open64 Compiler | June 2011 IMPORTANT OPTIMIZATIONS IPA (interprocedural analysis) – Pass1: ipl Local analysis – Pass 2: ipa_link Whole program analysis Data layout optimizations Function inlining, cloning Constant propagation Dead function elimination Profile feedback directed optimization – -fb-create – -fb-opt 12 | AMD’s x86 Open64 Compiler | June 2011 RECENT RELEASES Release 4.2.2 (April 2009) – Support for 2 MB huge pages – Improved loop fusion (proactive loop fusion) and loop unrolling – Improved head/tail duplication, if-merging, scalar replacement and constant folding optimizations – Improved interprocedural alias analysis – Improved partial inlining and inlining of virtual functions – More advanced re-layout optimization for structure members – Improved instruction selection and instruction scheduling – Improved tuning of library functions 13 | AMD’s x86 Open64 Compiler | June 2011 RECENT RELEASES Release 4.2.3 (December 2009) – Improved interprocedural analysis to include structure array copy optimization and array remapping optimization – Improved loop optimizations: loop unrolling, loop unroll and jam, triangular loops, proactive loop interchange, loop distribution, loop peeling – Improved redundancy elimination optimizations for stores and memory initialization; better integration of reassociation and common subexpression elimination; enhanced expression factorization – Improved instruction selection and addressing code generation – Improved vectorization – Extended prefetching to include arrays with inductive base addresses – Enhanced loop multi-versioning – Improved OpenMP and auto-parallelization code generation – Improved tuning of OpenMP and parallel runtime library functions – Introduced advanced optimizations to improve scalability/bandwidth utilization of multi-core processors (-mso) 14 | AMD’s x86 Open64 Compiler | June 2011 RECENT RELEASES Release 4.2.4 (June 2010) – Improved function inlining heuristics and enhanced inline expansion of library functions – Enhanced framework for multi-versioning – Improved inductive expression simplification and if-merging optimization – Improved code generation for the % operator – Improved interprocedural analysis for indirect function calls, virtual functions, and functions with "noreturn" attribute – Optimized exception handling – Optimized processing of Fortran 90 temporary arrays – Improved processor affinity mapping in the OpenMP and parallel runtime library – Added support for 1 GB huge pages 15 | AMD’s x86 Open64 Compiler | June 2011 RECENT RELEASES Release 4.2.5 (March 2011) – Optimized code generation for the new AMD Opteron Family 15h processors ("Bulldozer" core) (including instruction groups SSE*, AVX, XOP, FMA4) (-march=bdver1) – Support for iso_c_binding, a Fortran 2003 feature – Enhanced framework to support better vectorization – Improved vectorization for outer loops and loops containing conditionals – Enhanced framework to support better aliasing – Modified -O3 to enable more powerful floating-point optimizations by default – Improved compatibility with newer versions of gcc for function prototype definitions under OpenMP – Compiler build infrastructure enhanced to be similar to other linux application builds involving configure, make and make install – Incremental improvements to many generic optimizations such as loop fusion, dead code elimination, if merging, if conversion, function inlining, register pressure tuning, structure splitting, etc. – Incremental improvements for C++ applications such as function de-virtualization, exception handling, etc. – General correctness improvements including bug fixes for problems in Fortran intrinsics, Fortran frontend, Fortran I/O, x86 alignment, OpenMP – General improvements to reduce the compilation times of large C++/Fortran applications 16 | AMD’s x86 Open64 Compiler | June 2011 PERFORMANCE Used in benchmark submission, for example: – HP® – Dell™ – IBM® – Sun® (Oracle®) – SGI® Performance on AMD platforms: – Best performing compiler Both integer and floating point benchmark suites Performance on Intel platforms: – Among the best performing compilers Both integer and floating point benchmark suites 17 | AMD’s x86 Open64 Compiler | June 2011 APPLICATIONS AND LIBRARIES Libraries and utilities, for example: – ACML (Fortran) – BLAST (C/C++) – Charm++ (C++) – CLHEP (C++) – FFTW (C) – Goto BLAS (Fortran) – MPICH/MPICH2 (Fortran, C/C++) – NetCDF (Fortran, C/C++) – LAM/MPI (Fortran, C/C++) – OpenMPI (Fortran, C/C++) – GSL (C/C++) 18 | AMD’s x86 Open64 Compiler | June 2011 APPLICATIONS AND LIBRARIES Large applications, for example: – GEANT4 (C/C++) – GROMACS (Fortran, C/C++) – NAMD (C/C++) – NWChem (Fortran, C/C++) – POP (Fortran) – POV-Ray (C++) – WRF (Fortran) Benchmarks, for example: – HPCC (Fortran, C/C++) – SPEC CPU2006 (Fortran, C/C++) – SPEC OMP2001 (Fortran, C/C++) 19 | AMD’s x86 Open64 Compiler | June 2011 HETEROGENEOUS COMPUTING Existing optimizations –
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages23 Page
-
File Size-