An (incomplete) Survey of Compiler Technology at the IBM Toronto Laboratory
Bob Blainey March 26, 2002 Target systems Sovereign (Sun JDK-based) Just-in-Time (JIT) Compiler zSeries (S/390) OS/390, Linux Resettable, shareable pSeries (PowerPC) AIX 32-bit and 64-bit Linux xSeries (x86 or IA-32) Windows, OS/2, Linux, 4690 (POS) IA-64 (Itanium, McKinley) Windows, Linux C and C++ Compilers zSeries OS/390 pSeries AIX Fortran Compiler pSeries AIX Key Optimizing Compiler Components TOBEY (Toronto Optimizing Back End with Yorktown) Highly optimizing code generator for S/390 and PowerPC targets TPO (Toronto Portable Optimizer) Mostly machine-independent optimizer for Wcode intermediate language Interprocedural analysis, loop transformations, parallelization Sun JDK-based JIT (Sovereign) Best of breed JIT compiler for client and server applications Based very loosely on Sun JDK Inside a Batch Compilation
C source C++ source Fortran source Other source
C++ Front Fortran C Front End Other Front End Front End Ends
Wcode Wcode Wcode++ Wcode Wcode TPO Wcode Scalarizer
Wcode
TOBEY Wcode Back End
Object Code TOBEY Optimizing Back End Project started in 1983 targetting S/370 Later retargetted to ROMP (PC-RT), Power, Power2, PowerPC, SPARC, and ESAME/390 (64 bit) Experimental retargets to i386 and PA-RISC Shipped in over 40 compiler products on 3 different platforms with 8 different source languages Primary vehicle for compiler optimization since the creation of the RS/6000 (pSeries) Implemented in a combination of PL.8 ("80% of PL/I") and C++ on an AIX reference platform Inside TOBEY
Wcode
OPT(0) Wcode-to-XIL Translator OPT(2) Local Commoning Value Numbering Control Flow Simple Early Redundancy Elimination Straightening Optimization Optimization Reassociation Dead Store Elimination Early Macro Expansion OPT(2) Value Numbering Late OPT(0) Commoning/Code Motion Optimization Late Macro Dead Code Elimination Expansion OPT(0) OPT(2) Instruction Fast Register Scheduling Allocation and Register Allocation
Final Assembly TPO (Toronto Portable Optimizer) Project started in 1994 as an interprocedural optimizer for RS/6000 Shipped first as an interprocedural optimizer for the OS/390 C compiler in 1996 Later shipped as part of C, C++ and Fortran compilers on AIX, the C++ compiler on OS/390 and as a linker enhancement on OS/400 Key optimization driver for the ASCI Blue and White projects and PowerPC SPEC benchmark performance Provides OpenMP explicit parallel support and automatic loop parallelization on RS/6000 Being adapted to optimize large scale commercial software such as DB2,Oracle and SAP Implemented in C++ on an AIX reference platform Inside TPO Compile Time Optimization
Wcode from FE Decode
Control Flow Analysis Store Motion Intraprocedural Constant Propagation Redundant Condition Elimination Control or Copy Propagation Loop Normalization Alias Changed? Optimizations Alias Analysis Loop Unswitching Dead Store Elimination Loop Unrolling
Loop Fusion Scalar Replacement Loop Loop Distribution Loop Parallelization Unimodular Trans Loop Vectorization Optimizations Unroll-and-jam Code Motion and Commoning
Collection
Wcode Encode to BE Loop Optimization
Scalar Control Flow Optimization Data Flow Optimization Optimization Loop Normalization
Loop Nest Aggressive Copy Propagation Canonization Maximal Loop Fusion
Parallel Loops Loop Nest Partitioning High Level Loop Interchange Transformations Loop Unroll and Jam Loop Parallelization
Serial Parallel Loop Loops Outlining
Inner Loop Unrolling Low Level Loop Vectorization Transformations Strength Reduction Redundancy Elimination Code Motion
All information subject to change without notice Inside an Link-time Compilation
Object files Libraries
OTHER LINK TPO INFORMATION
Wcode partitions
TOBEY
Object Files
Linker
Executable or shared library Inside TPO Link Time Optimization
Symbol Backward parameter & global def/use Resolution Alias backward properties Analysis
Call Graph LEVEL(2) Completion Forward copy and constant propagation Data-flow pointer alias analysis LEVEL(1) Analysis dead code elimination
Inlining Alias closure of context sensitive LEVEL(0) Closure pointer alias relationships Data Coalescing Backward invariant code motion Data-flow common subexpression elimination Analysis loop optimization Function Partitioning Selected TPO Optimizations Interprocedural constant propagation, pointer alias analysis and dead code elimination Partially invariant code motion Forward and backward store motion Partial constant propagation Redundant condition elimination Code and data partitioning Loop partitioning Some Compiler Changes for Power4 Instruction scheduling for dispatch Register-conctrained modulo scheduling Avoid microcoded and some cracked instructions Generate stream touch instructions Eliminate small branch sequences using CA bit Tune loop optimization for 8 prefetch buffers Procedure and loop code alignment Use static branch prediction override with PDF Inline pointer glue and set BH for virtual and pointer calls Bias CR allocation to get same source/target for CR logic Platform Neutral Improvements Profile directed interprocedural optimization Profiling and specialization of function pointer calls F90 MATMUL/TRANSPOSE improvements Interprocedural loop optimization Profile directed outlining Results: Regatta vs. Competition
1200
1000 +12%
PA-8700 750MHz 800 SPARC 900MHz +21% Itanium 800MHz 600 Pentium 2.2GHz Alpha 1.0GHz 400 Power4 1.3GHz Power4 + compiler
200
0 SPECint2000 SPECfp2000
* Note: Power4 measurements NOT official 2002 Performance Plan Themes Middleware performance (DB2) Practical SP Performance Continuing Power4 and follow-on support Optimization Priorities Low Level Optimization and Code Generation Loop Transformations Array Analysis Interprocedural Optimization C++ Optimization 2002 Optimization Highlights Shrink wrapping Loop fusion, distribution and index-set splitting Loop unrolling for machine balance and bandwidth utilization Interprocedural register allocation Superblock scheduling Profile-driven commoning and code motion Array data flow analysis and privatization Optimization of C++ exceptions, virtual dispatch and templates Data dependence analysis for complex indexing Interprocedural type-based analysis Sovereign Java Architecture
Java Application Java Compiler (javac, jikes) Sovereign Byte MMI Java Virtual Code Machine
JIT Native Compiler Code
PowerPC S/390 IA-32 IA-64 Linux Linux Linux Linux AIX32 Win32 Win64 AIX64 OS/390 OS/2 OS/400 4690 Sovereign JIT Compilation Cycle
method Recompile hot queue Recompile method invalid code
Recompilation Class Compiler Controller Loader
Code Compiled Data MMI Byte code samples code samples Transfer Native Sampler Interpreter Code Execution
Interpreter Method invocation counts Fast startup Conditional path info Class & method resolution Loop detection Class initialization
Sampler/Compiler Good code for warm methods Hot methods Best code for hot methods Common parameters Specialized hot methods Inside the Sovereign JIT
Java Bytecode Bytecode Optimization
Quadruple Quadruple Generation Optimization ILP DAG (SSA) Optimization Optimization (IA64)
Instruction Scheduling Native Code Register Allocation Generation Bytecode Optimization
Java Class Flow Guarded Method Bytecode Analysis Devirtualization Inlining
JSR Inlining
Field Privatization
NULL check elimination Dataflow Field privatization Optimization Type flow analysis Array check elimination Quadruple Optimization
Code Escape Quads Straightening Analysis
Copy Propagation Check Busy Code Motion Dead Store Elimination Dataflow Elimination Redundant Class Init Optimization Class Flow Analysis Architecture Redundancy Transfer Elimination Lazy Code Motion
Induction Variables Late DAG Impact Loop Versioning Architecture Loop Striding Optimization Analysis Transfer ILP Optimization IA64 Native Code Instruction-Level Parallel Optimization (IA-64)
Parallelism-aware Quad Register PDG Allocation
Hyperblock IF Conversion Selection and Predicate Analysis Generation
Control Speculation Critical Path Data Speculation Reduction Exception Speculation
Generate Instruction Native Code Parallelization (Bundle Formation)