Survey of Compiler Technology at the IBM Toronto Laboratory

An (incomplete) Survey of Compiler Technology at the IBM Toronto Laboratory Bob Blainey March 26, 2002 Target systems Sovereign (Sun JDK-based) Just-in-Time (JIT) Compiler zSeries (S/390) OS/390, Linux Resettable, shareable pSeries (PowerPC) AIX 32-bit and 64-bit Linux xSeries (x86 or IA-32) Windows, OS/2, Linux, 4690 (POS) IA-64 (Itanium, McKinley) Windows, Linux C and C++ Compilers zSeries OS/390 pSeries AIX Fortran Compiler pSeries AIX Key Optimizing Compiler Components TOBEY (Toronto Optimizing Back End with Yorktown) Highly optimizing code generator for S/390 and PowerPC targets TPO (Toronto Portable Optimizer) Mostly machine-independent optimizer for Wcode intermediate language Interprocedural analysis, loop transformations, parallelization Sun JDK-based JIT (Sovereign) Best of breed JIT compiler for client and server applications Based very loosely on Sun JDK Inside a Batch Compilation C source C++ source Fortran source Other source C++ Front Fortran C Front End Other Front End Front End Ends Wcode Wcode Wcode++ Wcode Wcode TPO Wcode Scalarizer Wcode TOBEY Wcode Back End Object Code TOBEY Optimizing Back End Project started in 1983 targetting S/370 Later retargetted to ROMP (PC-RT), Power, Power2, PowerPC, SPARC, and ESAME/390 (64 bit) Experimental retargets to i386 and PA-RISC Shipped in over 40 compiler products on 3 different platforms with 8 different source languages Primary vehicle for compiler optimization since the creation of the RS/6000 (pSeries) Implemented in a combination of PL.8 ("80% of PL/I") and C++ on an AIX reference platform Inside TOBEY Wcode OPT(0) Wcode-to-XIL Translator OPT(2) Local Commoning Value Numbering Control Flow Simple Early Redundancy Elimination Straightening Optimization Optimization Reassociation Dead Store Elimination Early Macro Expansion OPT(2) Value Numbering Late OPT(0) Commoning/Code Motion Optimization Late Macro Dead Code Elimination Expansion OPT(0) OPT(2) Instruction Fast Register Scheduling Allocation and Register Allocation Final Assembly TPO (Toronto Portable Optimizer) Project started in 1994 as an interprocedural optimizer for RS/6000 Shipped first as an interprocedural optimizer for the OS/390 C compiler in 1996 Later shipped as part of C, C++ and Fortran compilers on AIX, the C++ compiler on OS/390 and as a linker enhancement on OS/400 Key optimization driver for the ASCI Blue and White projects and PowerPC SPEC benchmark performance Provides OpenMP explicit parallel support and automatic loop parallelization on RS/6000 Being adapted to optimize large scale commercial software such as DB2,Oracle and SAP Implemented in C++ on an AIX reference platform Inside TPO Compile Time Optimization Wcode from FE Decode Control Flow Analysis Store Motion Intraprocedural Constant Propagation Redundant Condition Elimination Control or Copy Propagation Loop Normalization Alias Changed? Optimizations Alias Analysis Loop Unswitching Dead Store Elimination Loop Unrolling Loop Fusion Scalar Replacement Loop Loop Distribution Loop Parallelization Unimodular Trans Loop Vectorization Optimizations Unroll-and-jam Code Motion and Commoning Collection Wcode Encode to BE Loop Optimization Scalar Control Flow Optimization Data Flow Optimization Optimization Loop Normalization Loop Nest Aggressive Copy Propagation Canonization Maximal Loop Fusion Parallel Loops Loop Nest Partitioning High Level Loop Interchange Transformations Loop Unroll and Jam Loop Parallelization Serial Parallel Loop Loops Outlining Inner Loop Unrolling Low Level Loop Vectorization Transformations Strength Reduction Redundancy Elimination Code Motion All information subject to change without notice Inside an Link-time Compilation Object files Libraries OTHER LINK TPO INFORMATION Wcode partitions TOBEY Object Files Linker Executable or shared library Inside TPO Link Time Optimization Symbol Backward parameter & global def/use Resolution Alias backward properties Analysis Call Graph LEVEL(2) Completion Forward copy and constant propagation Data-flow pointer alias analysis LEVEL(1) Analysis dead code elimination Inlining Alias closure of context sensitive LEVEL(0) Closure pointer alias relationships Data Coalescing Backward invariant code motion Data-flow common subexpression elimination Analysis loop optimization Function Partitioning Selected TPO Optimizations Interprocedural constant propagation, pointer alias analysis and dead code elimination Partially invariant code motion Forward and backward store motion Partial constant propagation Redundant condition elimination Code and data partitioning Loop partitioning Some Compiler Changes for Power4 Instruction scheduling for dispatch Register-conctrained modulo scheduling Avoid microcoded and some cracked instructions Generate stream touch instructions Eliminate small branch sequences using CA bit Tune loop optimization for 8 prefetch buffers Procedure and loop code alignment Use static branch prediction override with PDF Inline pointer glue and set BH for virtual and pointer calls Bias CR allocation to get same source/target for CR logic Platform Neutral Improvements Profile directed interprocedural optimization Profiling and specialization of function pointer calls F90 MATMUL/TRANSPOSE improvements Interprocedural loop optimization Profile directed outlining Results: Regatta vs. Competition 1200 1000 +12% PA-8700 750MHz 800 SPARC 900MHz +21% Itanium 800MHz 600 Pentium 2.2GHz Alpha 1.0GHz 400 Power4 1.3GHz Power4 + compiler 200 0 SPECint2000 SPECfp2000 * Note: Power4 measurements NOT official 2002 Performance Plan Themes Middleware performance (DB2) Practical SP Performance Continuing Power4 and follow-on support Optimization Priorities Low Level Optimization and Code Generation Loop Transformations Array Analysis Interprocedural Optimization C++ Optimization 2002 Optimization Highlights Shrink wrapping Loop fusion, distribution and index-set splitting Loop unrolling for machine balance and bandwidth utilization Interprocedural register allocation Superblock scheduling Profile-driven commoning and code motion Array data flow analysis and privatization Optimization of C++ exceptions, virtual dispatch and templates Data dependence analysis for complex indexing Interprocedural type-based analysis Sovereign Java Architecture Java Application Java Compiler (javac, jikes) Sovereign Byte MMI Java Virtual Code Machine JIT Native Compiler Code PowerPC S/390 IA-32 IA-64 Linux Linux Linux Linux AIX32 Win32 Win64 AIX64 OS/390 OS/2 OS/400 4690 Sovereign JIT Compilation Cycle method Recompile hot queue Recompile method invalid code Recompilation Class Compiler Controller Loader Code Compiled Data MMI Byte code samples code samples Transfer Native Sampler Interpreter Code Execution Interpreter Method invocation counts Fast startup Conditional path info Class & method resolution Loop detection Class initialization Sampler/Compiler Good code for warm methods Hot methods Best code for hot methods Common parameters Specialized hot methods Inside the Sovereign JIT Java Bytecode Bytecode Optimization Quadruple Quadruple Generation Optimization ILP DAG (SSA) Optimization Optimization (IA64) Instruction Scheduling Native Code Register Allocation Generation Bytecode Optimization Java Class Flow Guarded Method Bytecode Analysis Devirtualization Inlining JSR Inlining Field Privatization NULL check elimination Dataflow Field privatization Optimization Type flow analysis Array check elimination Quadruple Optimization Code Escape Quads Straightening Analysis Copy Propagation Check Busy Code Motion Dead Store Elimination Dataflow Elimination Redundant Class Init Optimization Class Flow Analysis Architecture Redundancy Transfer Elimination Lazy Code Motion Induction Variables Late DAG Impact Loop Versioning Architecture Loop Striding Optimization Analysis Transfer ILP Optimization IA64 Native Code Instruction-Level Parallel Optimization (IA-64) Parallelism-aware Quad Register PDG Allocation Hyperblock IF Conversion Selection and Predicate Analysis Generation Control Speculation Critical Path Data Speculation Reduction Exception Speculation Generate Instruction Native Code Parallelization (Bundle Formation).

Survey of Compiler Technology at the IBM Toronto Laboratory

CSE 582 – Compilers

ICS803 Elective – III Multicore Architecture Teacher Name: Ms

Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗

Compiler Optimization for Configurable Accelerators Betul Buyukkurt Zhi Guo Walid A

Functional Array Programming in Sac

Loop Transformations and Parallelization

Compiler-Based Code-Improvement Techniques

Compiler Transformations for High-Performance Computing

Generalized Index-Set Splitting

Qlogic Pathscale™ Compiler Suite User Guide

Chapter 10 Scalar Optimizations

Compiler Transformations for High-Performance Computing