gcc link time optimization and the kernel

Andi Kleen Intel OTC

Apr 2013 [email protected]

Acknowledgments

● Lots of people helped/contributed

● Ralf Baechle, Richard Biener, Tim Bird, Honza Hubicka, H.J. Lu, Joe Mario, Markus Trippelsdorf, Changlong Xie, others

Why LTO?

● Optimize over the whole binary – Not just function (3.0) or file (<4.5)

● Avoid inline dependency hell in header files

● Without changing Makefiles significantly

gcc 4.7+ LTO WHOPR crash course

parses files, writes GIMPLE to object files (LGEN) – Function optimization summaries are computed ● Linker calls lto1 with all files for sequential whole program analysis (WPA) – Merges types, Generates global callgraph, IPA optimization summaries, writes partitions ● Generate code per partition in parallel (LTRANS) – Inline inside partition, run IPA optimizations for real, run per function optimizations, generate object code

LTO IPA optimizations (4.8)

● Inlining between files ● De-virtualization

● Function cloning for ● Change ABI (SSE) specific arguments ● Constant propagation ● Remove unused code (fwhole-program) ● Scalar replacement of aggregates ● Increase alignment ● Constructor / ● Discover pure/const destructor merging ● Keep globals/statics

alive over calls green: does not benefit kernel today

Build time

Build time small User time

800 small config 600

) 4000 s (

400 e 2000 m i

200 t

r

e 0 0 s Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto u Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto

Faults small config Parallelism 60.00 e 15.00 m i t l ) 40.00 a

10.00 M e ( r

/ s

t l e

5.00 u 20.00 a m i t F r

e 0.00 s 0.00

U Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto

LTO is slow and not parallel enough

Parallelism small build 4.7 Object file generation Parsing / LGEN Runnable processes Modules without job server LTO kernel build small config

60

50

40 LTRANS

) code generation p

( r 30 e l b a n n u

r 20

10 WPA + real linker 0 Type merging 11 31 51 71 91 111131151171191211231251271291311331351371391411431451471491511531551 1 21 41 61 81 101121141161181201221241261281301321341361381401421441461481501521541 4 times

time (s) Small config parallelism

User time / Runtime

15 10 ) s ( 5 e WHOPR still has poor parallelism m i t 0 Gcc 4.7 Gcc 4.7 no lto gcc 4.8 Gcc 4.8 no lto Multilink

● vmlinux links 2-4x: runs LTO that often ● Generates integrated symbol table (kallsyms) ● So far not fixed – KALLSYMS can be disabled

● One of those unexpected quirks of real build systems

6000.00

4000.00

2000.00

0.00 Gcc 4.8 w/o kallsyms Gcc 4.8 no lto Gcc 4.7 with kallsyms Memory usage 4.7 small build

Active memory

Kernel LTO build small config

8

7

6

) 5 B G ( 4 m e m 3 e v i t

a 2

1

0 9 25 41 57 73 89 105121137153169185201217233249265281297313329345361377393409425441457473489505521537553 1 17 33 49 65 81 97 113129145161177193209225241257273289305321337353369385401417433449465481497513529545

time (s)

Faults small config

60.00 Even small build swaps in 4GB system

) 40.00 M ( Memory peaks all in WPA s t l 20.00 u a F 0.00 Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Memory consumption

● Temporary data can be a problem, together with WPA – Early many swap storms with /tmp = tmpfs – Partitioning algorithm was improved – Use TMPDIR=objdir – With modules need to avoid too large -j* for parallel WPA – Jobserver has to be disabled, makes it worse

Large build 4.8 (allyes)

Linux allyes nosym runnable 4.8 -j16

60 Kallsyms disabled 50 (disables various features) 40 ) p ( 30 e l b a n 20 ~ 42mins n u r ~1h 11min with kallsyms 10 ~15GB peak 0 81 237 393 549 705 861 1017 1173 1329 1485 1641 1797 1953 2109 2265 2421 3 159 315 471 627 783 939 1095 1251 1407 1563 1719 1875 2031 2187 2343 2499

time (s) Linux allyes nosym gcc 4.8 memory

14.00 12.00 10.00 ) G (

8.00 m

e 6.00 m

e

v 4.00 i t c

a 2.00 0.00 87 255 423 591 759 927 109512631431159917671935210322712439 3 171 339 507 675 843 1011117913471515168318512019218723552523 time (s) WPA cycle profile 4.8

Samples: 257K of event 'cycles', Event count (approx.): 183142453583

● ­ 9.71% lto1­wpa lto1 [.] htab_find_slot_with_hash ◆

● ­ htab_find_slot_with_hash ▒

● + 61.79% gimple_type_hash(void const*) ▒

● + 26.54% gimple_register_type_1(tree_node*, bool) ▒

● + 4.79% visit(tree_node*, sccs*, unsigned int, vec

● + 1.53% iterative_hash_canonical_type(tree_node*, unsigned int) ▒

● + 0.98% iterative_hash_gimple_type(tree_node*, unsigned int, vec

● + 0.92% symtab_get_node(tree_node const*) ▒

● + 0.77% gimple_type_eq(void const*, void const*) ▒

● + 0.53% build_int_cst_wide(tree_node*, unsigned long, long) ▒

● + 0.53% streamer_string_index(output_block*, char const*, unsigned int, b▒

● + 9.00% lto1­wpa libc­2.17.so [.] _int_malloc ▒

● + 7.34% lto1­wpa lto1 [.] iterative_hash_hashval_t(unsigned i▒

● + 7.33% lto1­wpa lto1 [.] gimple_type_eq(void const*, void co▒

● + 6.66% lto1­wpa libc­2.17.so [.] __memset_sse2 ▒

● + 5.35% lto1­wpa libc­2.17.so [.] malloc_consolidate ▒

● + 4.11% lto1­wpa libc­2.17.so [.] _int_free ▒

● + 4.10% lto1­wpa lto1 [.] tree_map_base_eq(void const*, void ▒

● + 2.85% lto1­wpa lto1 [.] pointer_map_insert(pointer_map_t*, ▒

● + 2.76% lto1­wpa lto1 [.] inflate_fast ▒

● + 2.17% lto1­wpa lto1 [.] lto_output_tree(output_block*, tree▒

Most time spent in hashing types for type merging

Type merging

Complicated iterative hash consisting struct foo ** of 4 hash tables (including two “hash cache hashes”)

Each LGEN has its own types, need pointer_type to be unified

pointer_type

Struct foo Members (record_type)

Hashing improvements

● Use tcmalloc instead of glibc malloc – build time -9%; peak mem +35% ● Start with large type hash tables / caches ● Use murmur for hashing / fix iterative / fix pointer hash to handle all bits – (collisions 90%->70%->64%) ● Splay trees for type merging (bad idea) 12.00% Build time Improvement hash changes 10.00% a

8.00%t 4.8, build time small config l e

6.00%) s (

4.00%e m i t 2.00% 0.00% tcmalloc murmur fullhash Possible future hash improvements

● Precompute hash in LGEN – And stream types out in hash order – Requires even larger hash tables to avoid collisions (or a good tree) ● Only merge per partition? – And do that in parallel ● Parallelize WPA further?

Incremental linking

● Kernel build system uses ld -r / ar extensively

● Linker merges all sections with the same name – Added random postfixes to LTO sections ● Still problem with pure assembler files – .S assembler code dropped during LTO – Originally attempted to fix with slim LTO ● Then fixed in Linux binutils – Unfortunately changes not merged into mainline binutils – Use Linux binutils for kernel

Kernel modifications (excerpt)

● Section attributes ● Disable LTO for some files – const __initconst char * vs ● Workarounds for 4.7 __initconst char * const partitioning bugs (visible) ● Toplevel inline assembler ● Gcc-ld – Need globals and visible for any symbols ● A few fixes for optimizations (rare) ● Lots of externally_visible ● Various changes for dot NUMBER symbols

Kernel changes:

108 files changed, 740 insertions(+), 313 deletions(-)

(v3.8; without most section attributes) Debugging

● Kernel debugging: more difficult with more inlining – Especially with initialization functions – May need inline aware backtracer (“dwarf kallsyms”)

● Compiler – No simple test cases. Lots of regressions; Complicated delta. – Complex multi process – Hard to find right process to attach/profile – -dH (core dumps) are most useful

LTO has challenges for debugging

Global Optimizations (small config)

21k new inlines between files Inlining 13k unique

Scalar replacement No change

Partial Inline +16%

Constant Propagation +40%

266->1052 const clones Cloning for constant propagation (CONFIG_LTO_CLONE) +2% text (~200k)

Static instruction changes

Generated code 4.8 small LTO

Changes in static instructions (vmlinux) More = Better

12.00%

10.00%

8.00% a

t 6.00% l e d 4.00%

2.00%

0.00% calls functions fsize avg push pop mov

instruction

Generated code

Runtime

● Runtime – Mixed results

LKP dbench tbench AIM7 OLTP Sandy Bridge 0% 0% +3% 0% Nehalem 0% 0% +4% +6%

● Likely need some more optimizations – Open: Atom testing (more sensitive to bad code)

Text size

● Text size – Slight increase for large configs, but lower for small configs

● small config +3% ● allno shrinks -50k

– The smaller the kernel the less infrastructure is used. But EXPORT_SYMBOL defeats it again.

ARM embedded improvements (Tim Bird)

vmlinux.non-lto => vmlinux.lto

● baseline other change percent

● ------

● text: 5242000 4869340 -372660 -7%

● data: 490184 476200 -13984 -2%

● bss: 121824 122112 288 0%

● total: 5854008 5467652 -386356 -6%

● Kernel: non-lto lto

● ------

● time for full build: 1m5.87s 3m22s

● time for incremental build: 0m5.24s 2m12s

● kernel image size: 5854008 5467652

● compressed uImage size: 2849920 2694224

● system meminfo MemTotal: 17804 kB 18188 kB

● system meminfo MemFree: 11020 kB 11260 kB

● boot time: 2.466369 2.414428

gcc LTO improvements wish list

● Per file options (partial patch) ● Fix passing through of options ● Standard gcc-ld ● Don't error for section mismatches ● Better procedure for LTO error reporting ● Better messages for type mismatches ● No .XXXX postfixes ● -fno-toplevel-reorder by symbol? ● Automatic procedure for externally_visible discovery? ● Better syntax for “symbol visible to toplevel assembler”

Status

● Tested on , ARM, MIPS ● Uptodate with Linux 3.8 ● Some options disabled (function-trace, gcov, modversions)

References

● LTO kernel tree http://github.com/andikleen/linux-misc – Branch lto-3.x (x increasing) ● Linux binutils http://www.kernel.org/pub/linux/devel/binutils/

● Tcmalloc http://code.google.com/p/gperftools/

[email protected] http://halobates.de

Backup

Future kernel improvements

● Avoid multi-link ● Use constructors to allow disabling -fno- toplevel-reorder ● Support new option to disable instrument_function

Quick HOWTO for LTO kernel

● Get git://github.com/andikleen/linux-misc lto-3.x tree ● Need gcc 4.7+ set up for LTO / Linux binutils ● Enable CONFIG_LTO – Make sure FUNCTION_PROFILER et.al./MODVERSIONS are disabled ● make CC=compiler LD=linux-ld AR=gcc-ld -jX – Check LTO does not get disabled ● See Documentation/lto-build for more details