An (incomplete) Survey of Compiler Technology at the IBM Toronto Laboratory

Bob Blainey March 26, 2002 Target systems Sovereign (Sun JDK-based) Just-in-Time (JIT) Compiler zSeries (S/390) OS/390, Linux Resettable, shareable pSeries (PowerPC) AIX 32-bit and 64-bit Linux xSeries (x86 or IA-32) Windows, OS/2, Linux, 4690 (POS) IA-64 (Itanium, McKinley) Windows, Linux C and C++ Compilers zSeries OS/390 pSeries AIX Fortran Compiler pSeries AIX Key Components TOBEY (Toronto Optimizing Back End with Yorktown) Highly optimizing code generator for S/390 and PowerPC targets TPO (Toronto Portable Optimizer) Mostly machine-independent optimizer for Wcode intermediate language Interprocedural analysis, loop transformations, parallelization Sun JDK-based JIT (Sovereign) Best of breed JIT compiler for client and server applications Based very loosely on Sun JDK Inside a Batch Compilation

C source C++ source Fortran source Other source

C++ Front Fortran C Front End Other Front End Front End Ends

Wcode Wcode Wcode++ Wcode Wcode TPO Wcode Scalarizer

Wcode

TOBEY Wcode Back End

Object Code TOBEY Optimizing Back End Project started in 1983 targetting S/370 Later retargetted to ROMP (PC-RT), Power, Power2, PowerPC, SPARC, and ESAME/390 (64 bit) Experimental retargets to i386 and PA-RISC Shipped in over 40 compiler products on 3 different platforms with 8 different source languages Primary vehicle for compiler optimization since the creation of the RS/6000 (pSeries) Implemented in a combination of PL.8 ("80% of PL/I") and C++ on an AIX reference platform Inside TOBEY

Wcode

OPT(0) Wcode-to-XIL Translator OPT(2) Local Commoning Control Flow Simple Early Redundancy Elimination Straightening Optimization Optimization Reassociation Elimination Early Macro Expansion OPT(2) Value Numbering Late OPT(0) Commoning/Code Motion Optimization Late Macro Expansion OPT(0) OPT(2) Instruction Fast Register Scheduling Allocation and

Final Assembly TPO (Toronto Portable Optimizer) Project started in 1994 as an interprocedural optimizer for RS/6000 Shipped first as an interprocedural optimizer for the OS/390 C compiler in 1996 Later shipped as part of C, C++ and Fortran compilers on AIX, the C++ compiler on OS/390 and as a linker enhancement on OS/400 Key optimization driver for the ASCI Blue and White projects and PowerPC SPEC benchmark performance Provides OpenMP explicit parallel support and automatic loop parallelization on RS/6000 Being adapted to optimize large scale commercial software such as DB2,Oracle and SAP Implemented in C++ on an AIX reference platform Inside TPO Compile Time Optimization

Wcode from FE Decode

Control Flow Analysis Store Motion Intraprocedural Constant Propagation Redundant Condition Elimination Control or Copy Propagation Loop Normalization Alias Changed? Optimizations Loop Unswitching Dead Store Elimination

Loop Fusion Scalar Replacement Loop Loop Distribution Loop Parallelization Unimodular Trans Loop Vectorization Optimizations Unroll-and-jam Code Motion and Commoning

Collection

Wcode Encode to BE

Scalar Control Flow Optimization Data Flow Optimization Optimization Loop Normalization

Loop Nest Aggressive Copy Propagation Canonization Maximal Loop Fusion

Parallel Loops Loop Nest Partitioning High Level Transformations Loop Unroll and Jam Loop Parallelization

Serial Parallel Loop Loops Outlining

Inner Loop Unrolling Low Level Loop Vectorization Transformations Redundancy Elimination Code Motion

All information subject to change without notice Inside an Link-time Compilation

Object files Libraries

OTHER LINK TPO INFORMATION

Wcode partitions

TOBEY

Object Files

Linker

Executable or shared library Inside TPO Link Time Optimization

Symbol Backward parameter & global def/use Resolution Alias backward properties Analysis

Call Graph LEVEL(2) Completion Forward copy and constant propagation Data-flow pointer alias analysis LEVEL(1) Analysis dead code elimination

Inlining Alias closure of context sensitive LEVEL(0) Closure pointer alias relationships Data Coalescing Backward invariant code motion Data-flow common subexpression elimination Analysis loop optimization Function Partitioning Selected TPO Optimizations Interprocedural constant propagation, pointer alias analysis and dead code elimination Partially invariant code motion Forward and backward store motion Partial constant propagation Redundant condition elimination Code and data partitioning Loop partitioning Some Compiler Changes for Power4 for dispatch Register-conctrained modulo scheduling Avoid microcoded and some cracked instructions Generate stream touch instructions Eliminate small branch sequences using CA bit Tune loop optimization for 8 prefetch buffers Procedure and loop code alignment Use static branch prediction override with PDF Inline pointer glue and set BH for virtual and pointer calls Bias CR allocation to get same source/target for CR logic Platform Neutral Improvements Profile directed interprocedural optimization Profiling and specialization of function pointer calls F90 MATMUL/TRANSPOSE improvements Interprocedural loop optimization Profile directed outlining Results: Regatta vs. Competition

1200

1000 +12%

PA-8700 750MHz 800 SPARC 900MHz +21% Itanium 800MHz 600 Pentium 2.2GHz Alpha 1.0GHz 400 Power4 1.3GHz Power4 + compiler

200

0 SPECint2000 SPECfp2000

* Note: Power4 measurements NOT official 2002 Performance Plan Themes Middleware performance (DB2) Practical SP Performance Continuing Power4 and follow-on support Optimization Priorities Low Level Optimization and Code Generation Loop Transformations Array Analysis Interprocedural Optimization C++ Optimization 2002 Optimization Highlights Shrink wrapping Loop fusion, distribution and index-set splitting Loop unrolling for machine balance and bandwidth utilization Interprocedural register allocation Superblock scheduling Profile-driven commoning and code motion Array data flow analysis and privatization Optimization of C++ exceptions, virtual dispatch and templates Data dependence analysis for complex indexing Interprocedural type-based analysis Sovereign Java Architecture

Java Application Java Compiler (javac, jikes) Sovereign Byte MMI Java Virtual Code Machine

JIT Native Compiler Code

PowerPC S/390 IA-32 IA-64 Linux Linux Linux Linux AIX32 Win32 Win64 AIX64 OS/390 OS/2 OS/400 4690 Sovereign JIT Compilation Cycle

method Recompile hot queue Recompile method invalid code

Recompilation Class Compiler Controller Loader

Code Compiled Data MMI Byte code samples code samples Transfer Native Sampler Interpreter Code Execution

Interpreter Method invocation counts Fast startup Conditional path info Class & method resolution Loop detection Class initialization

Sampler/Compiler Good code for warm methods Hot methods Best code for hot methods Common parameters Specialized hot methods Inside the Sovereign JIT

Java Bytecode Bytecode Optimization

Quadruple Quadruple Generation Optimization ILP DAG (SSA) Optimization Optimization (IA64)

Instruction Scheduling Native Code Register Allocation Generation Bytecode Optimization

Java Class Flow Guarded Method Bytecode Analysis Devirtualization Inlining

JSR Inlining

Field Privatization

NULL check elimination Dataflow Field privatization Optimization Type flow analysis Array check elimination Quadruple Optimization

Code Escape Quads Straightening Analysis

Copy Propagation Check Busy Code Motion Dead Store Elimination Dataflow Elimination Redundant Class Init Optimization Class Flow Analysis Architecture Redundancy Transfer Elimination Lazy Code Motion

Induction Variables Late DAG Impact Loop Versioning Architecture Loop Striding Optimization Analysis Transfer ILP Optimization IA64 Native Code Instruction-Level Parallel Optimization (IA-64)

Parallelism-aware Quad Register PDG Allocation

Hyperblock IF Conversion Selection and Predicate Analysis Generation

Control Speculation Critical Path Data Speculation Reduction Exception Speculation

Generate Instruction Native Code Parallelization (Bundle Formation)