GCC Yesterday, Today and Tomorrow Diego Novillo [email protected] Red Hat Canada

2nd HiPEAC GCC Tutorial Ghent, Belgium January 2007 Yesterday

January 28, 2007 2 Brief History

● GCC 1 (1987) – Inspired on Pastel (Lawrence Livermore Labs) – Only C – Translation done one statement at a time ● GCC 2 (1992) – Added C++ – Added RISC architecture support – Closed development model challenged – New features difficult to add

January 28, 2007 3 Brief History

● EGCS (1997) – Fork from GCC 2.x – Many new features: , Chill, numerous embedded ports, new scheduler, new optimizations, integrated libstdc++ ● GCC 2.95 (1999) – EGCS and GCC2 merge into GCC – Type based – Chill front end – ISO C99 support

January 28, 2007 4 Brief History

● GCC 3 (2001) – Integrated libjava – Experimental SSA form on RTL – Functions as trees (crucial feature) ● GCC 4 (2005) – Internal architecture overhaul (Tree SSA) – Fortran 95 – Automatic vectorization

January 28, 2007 5 GCC Growth1 Objective C++ 2,200,000 Tree SSA 2,000,000 Fortran 95 1,800,000

1,600,000 Ada 1,400,000

1,200,000 C libjava Total O Runtime L 1,000,000 libstdc++ Front Ends 800,000 Java Compiler Ports 600,000 C++ 400,000

200,000

0 1.21 1.38 2.0 2.8.1 EGCS 2.95 3.0 3.1 4.0 4.1 1988 1990 1992 1998 1998 1999 2001 2002 2005 2006 Releases 1 generated using David A. Wheeler's 'SLOCCount'.

January 28, 2007 6 Core Compiler Growth1

700,000 Tree SSA 600,000

500,000

400,000 C O L 300,000 Compiler Ports

200,000

100,000

0 1.21 1.38 2.0 2.8.1 EGCS 2.95 3.0 3.1 4.0 4.1 1988 1990 1992 1998 1998 1999 2001 2002 2005 2006 Releases 1 generated using David A. Wheeler's 'SLOCCount'.

January 28, 2007 7 Growing pains

● Monolithic architecture ● Front ends too close to back ends ● Little or no internal interfaces ● RTL inadequate for high level optimizations Front End Back End C

C++ RTL Assembly Java

Fortran

January 28, 2007 8 Tree SSA Design

● Goals – Separate FE from BE – Evolution vs Revolution ● Needed IL layers – Two ILs to choose from: Tree and RTL – Moving down the abstraction ladder seemed simpler ● Started with FUD chains over C Trees – Every FE had its own variant of Trees – Complex grammar, side-effects, language dependencies

January 28, 2007 9 Tree SSA Design

● SIMPLE (McGill University) – Used same tree data structure – Simplified and restrictive grammar – Started with C and followed with C++ – Later renamed to GIMPLE – Still not enough ● GENERIC – Target IL for every FE – No grammar restrictions – Only required to remove language dependencies

January 28, 2007 10 Tree SSA Design

● FUD-chains unpopular – No overlapping live ranges (OLR) – Limits some transformations (e.g., copy propagation) ● Replaced FUD-chains with rewriting form – Kept FUD-chains for memory expressions (Virtual SSA) – Needed out-of-SSA pass to cope with OLR ● Several APIs – CFG, statement, operand manipulation – Pass manager – Call graph manager

January 28, 2007 11 Tree SSA Timeline

EH Lowering SRA Statement lists Pass manager Operands API Nov/03 Rewriting SSA Statement iterators Fortran 95 Out of SSA GIMPLE for Java Complex lowering DOM Flow-sensitive PTA PHI opt CCP/DCE forward prop Points-to analysis unnesting

el Mudflap profiling GIMPLE and GENERIC Jan/03 DSE NRV Lev Jul/03

y PRE t tree-ssa-20020619-branch vi i

t SIMPLE for C++

Ac Mainline SIMPLE for C Sep/02 Pretty printer Merge

Private ast-optimizer- development branch 13/May/2004 SSA on C trees 19/Jun /02

O ct/ 00 Ju l/01 Ja n/02

January 28, 2007 12 Today

January 28, 2007 13 Compiler pipeline

C Front End Middle End Back End

C++ GENERIC GIMPLE RTL Assembly Java Inter RTL Fortran Procedural Optimizer Optimizer

SSA Final Code Optimizer Generation

Call Graph Pass Manager Manager

January 28, 2007 14 Major Features

● De-facto system compiler for Linux ● No central planning (controlled chaos) ● Supports – All major languages – An insane number of platforms ● SSA-based high-level global optimizer ● Automatic vectorization ● OpenMP support ● Pointer checking instrumentation for C/C++

January 28, 2007 15 Major Issues

● Popularity – Caters to a wide variety of user communities – Moving in different directions at once ● No central planning (controlled chaos) ● Not enough engineering cycles ● Risk of “featuritis” – Too many flags – Too many passes ● Warnings depending on optimization levels

January 28, 2007 16 Longer Release Cycles

400 Release cycle has almost Estimated (Feb/07) doubled since 3.1 350

300

250 s y 200 a

D Stage 3 150 Release

100

50

0 3.1 3.3 3.4 4.0 4.1 4.2 2002 2003 2004 2005 2006 2007 Release

January 28, 2007 17 A Few Current Projects

● RTL cleanups ● OpenMP ● Interprocedural analysis framework ● GIMPLE tuples ● Link-time optimization ● Memory SSA ● Sharing alias information across ILs ● ● Scheduling

January 28, 2007 18 RTL Cleanups

● Removal of duplicate functionality – Mostly done – Goal is for RTL to only handle low-level issues ● Dataflow analysis – Substitute ad-hoc flow analysis with a generic DF solving framework – Increased accuracy. Allows more aggressive transformations – Support for incremental dataflow information – Backends need taming

January 28, 2007 19 OpenMP

● Pragma-based annotations to specify parallelism ● New hand-written recursive-descent parser eased pragma recognition ● GIMPLE extended to support concurrency ● Supports most platforms with support ● Available now in Fedora Core 6's compiler ● Official release will be in GCC 4.2

January 28, 2007 20 OpenMP

#include main() { #pragma omp parallel printf (“[%d] Hello\n”, omp_get_thread_num()); }

$ gcc -fopenmp -o hello hello.c $ gcc -o hello hello.c $ ./hello $ ./hello [2] Hello [0] Hello [3] Hello [0] Hello ← Master thread [1] Hello

January 28, 2007 21 Interprocedural Analysis

● Keep the whole call-graph in SSA form ● Increased precision for inter-procedural analyses and transformations ● Unify internal APIs to support inter/intra procedural analysis – No special casing in pass manager – Improve callgraph facilities ● Challenges – Memory consumption – Privatization of global attributes

January 28, 2007 22 GIMPLE Tuples

● GIMPLE shares the tree data structure with FE

● Too much redundant information ● A separate tuple-like data structure provides – Increased separation with FE – Memory savings – Potential compile time improvements ● Challenges – Shared infrastructure (e.g. fold()) – Compile time increase due to conversion

January 28, 2007 23 Link Time Optimzation

● Ability to stream IL to support IPO across – Multiple compilation units – Multiple languages ● Streamed IL representation treated like any other language ● Challenges – Original language must be represented explicitly – Complete compiler state must be preserved – Conflicting flags used in different modules

January 28, 2007 24 Alias Analysis

January 28, 2007 25 Overview

● GIMPLE represents alias information explicitly ● Alias analysis is just another pass – Artificial symbols represent memory expressions (virtual operands) – FUD-chains computed on virtual operands → Virtual SSA ● Transformations may prove a symbol non- addressable – Promoted to GIMPLE register – Requires another aliasing pass

January 28, 2007 26 Symbolic Representation of Memory

● Pointer P is associated with memory tag MT – MT represents the set of variables pointed-to by P

● So *P is a reference to MT

if (...) p points-to {a, b} p = &a p has memory tag MT else p = &b *p = 5 Interpreted as MT = 5

January 28, 2007 27 Associating Memory with Symbols

● Alias analysis – Builds points-to sets and memory tags ● Structural analysis – Builds field tags (sub-variables) ● Operand scanner – Scans memory expressions to extract tags – Prunes alias sets based on expression structure

January 28, 2007 28 Alias Analysis

● Points-to alias analysis (PTAA) – Based on constraint graphs – Field and flow sensitive, context insensitive – Intra-procedural (inter-procedural in 4.2) – Fairly precise ● Type-based analysis (TBAA) – Based on input language rules – Field sensitive, flow insensitive – Very imprecise

January 28, 2007 29 Alias Analysis

● Two kinds of pointers are considered – Symbols: Points-to is flow-insensitive ● Associated to Symbol Memory Tags (SMT) – SSA names: Points-to is flow-sensitive ● Associated to Name Memory Tags (NMT)

● Given pointer dereference *ptr 42

– If ptr has NMT, use it 42 – If not, fall back to SMT associated with ptr

January 28, 2007 30 Structural Analysis

● Separate structure fields are assigned distinct symbols

struct A { int x; ● Variable a will have 3 sub-variables int y; { SFT.1, SFT.2, SFT.3 } int z; }; ● References to each field are mapped to the corresponding sub- struct A a; variable

January 28, 2007 31 IL Representation

foo (i, a, b, *p) { p = (i > 10) ? &a : &b

# a = VDEF foo (i, a, b, *p) # b = VDEF { *p = 3 p =(i > 10) ? &a : &b *p = 3 # VUSE return a + b t1 = a } # VUSE t2 = b

t3 = t1 + t2 return t3 }

January 28, 2007 32 Virtual SSA Form

foo (i, a, b, *p) ● VDEF operand needed { to maintain DEF-DEF p_2 = (i_1 > 10) ? &a : &b links # a_4 = VDEF ● They also prevent code a = 9; movement that would # a_5 = VDEF cross stores after loads # b_7 = VDEF *p = 3; ● When alias sets grow too big, static grouping # VUSE heuristic reduces t1_8 = a; number of virtual t3_10 = t1_8 + 5; operators in aliased return t3_10; references }

January 28, 2007 33 Virtual SSA – Problems

● Big alias sets → Many virtual operators – Unnecessarily detailed tracking – Memory – Compile time – SSA name explosion ● Need new representation that can – Degrade gracefully as detail level grows – Or, better, no degradation with highly detailed information

January 28, 2007 34 Memory SSA

● New representation of aliasing information ● Big alias sets → Many virtual operators – Unnecessarily detailed tracking – Memory, compile time, SSA name explosion ● Main idea – Stores to many locations create a single name – Factored name becomes reaching definition for all symbols involved in store

January 28, 2007 35 Memory SSA

# .MEM_10 = VDEF <.MEM_0> p_3 points-to { a, b, c } *p_3 = ...

# .MEM_11 = VDEF <.MEM_0> q_4 points-to { n, o, p } *q_4 = ... At most one VDEF and # b_12 = VDEF <.MEM_10> b = ... one VUSE per statement

# .MEM_13 = VDEF <.MEM_10, b_12> *p_3 = ... Virtual operators may refer to more than one operand # VUSE <.MEM_13> t_14 = b Factored stores create # VUSE <.MEM_11> t_15 = o “sinks” that group multiple incoming names

January 28, 2007 36 Memory SSA

● Challenges – Overlapping live ranges create a problem for some passes – Requires more detailed tracking in SSA rewriting – Dynamic association between Φ nodes and symbols ● Memory partitions – Have more than one .MEM object – Create static associations between symbols – Association is independent from alias relations – Heuristics control partitioning

January 28, 2007 37 Tomorrow

January 28, 2007 38 Future Work

● Wide variety of projects ● A small sample – Plug-in support – Scheduling – Register pressure reduction – Register allocation – Incremental compilation – Dynamic compilation – Dynamic optimization pipeline

January 28, 2007 39 Plug-in Support

● Extensibility mechanism to allow 3rd party tools ● Wrap some internal APIs for external use ● Allow loading of external shared modules – Loaded module becomes another pass – Compiler flag determines location ● Versioning scheme prevents mismatching ● Useful for – Static analysis – Experimenting with new transformations

January 28, 2007 40 Scheduling

● Several concurrent efforts targetting 4.3 and 4.4 – Schedule over larger regions for increased parallelism – Most target IA64, but benefit all architectures ● Enhanced selective scheduling ● Treegion scheduling ● Superblock scheduling ● Improvements to swing modulo scheduling

January 28, 2007 41 Register Allocation

● Several efforts over the years ● Complex problem – Many different targets to handle – Interactions with reload and scheduling ● YARA (Yet Another Register Allocator) – Experimented with several algorithms ● IRA (Integrated Register Allocator) – Priority coloring, Chaitin-Briggs and region based – Expected in 4.4 – Currently works on x86, x86-64, ppc, IA64, sparc, s390

January 28, 2007 42 Register pressure reduction

● SSA may cause excessive register pressure – Pathological cases → ~800 live registers – RA battle lost before it begins ● Short term project to cope with RA deficiencies ● Implement register pressure reduction in GIMPLE before going to RTL – Pre-spilling combined with live range splitting – Load – Tie RTL generation into out-of-ssa to allow better for spills and rematerialization

January 28, 2007 43 Dynamic compilation

● Delay compilation until runtime (JIT) – Emit bytecodes – Implement virtual machine with optimizing transformations ● Leverage on existing infrastructure (LLVM, LTO) ● Not appropriate for every case ● Challenges – Still active research – Different models/costs for static and dynamic

January 28, 2007 44 Incremental Compilation

● Speed up edit-compile-debug cycle ● Speeds up ordinary compiles by compiling a given header file “once” ● Incremental changes fed to compiler daemon ● Incremental linking as well ● Side effects – Refactoring – Cross-referencing – Compile-while-you-type (e.g., Eclipse)

January 28, 2007 45 Dynamic Optimization Pipeline

● Phase ordering not optimal for every case ● Current static ordering difficult to change ● Allow external re-ordering – Ultimate control – Allow experimenting with different orderings – Define -On based on common orderings ● Problems – Probability of finding bugs increases – Enormous search space

January 28, 2007 46