Gcc Link Time Optimization and the Linux Kernel Andi Kleen Intel
Total Page:16
File Type:pdf, Size:1020Kb
gcc link time optimization and the Linux kernel Andi Kleen Intel OTC Apr 2013 [email protected] Acknowledgments ● Lots of people helped/contributed ● Ralf Baechle, Richard Biener, Tim Bird, Honza Hubicka, H.J. Lu, Joe Mario, Markus Trippelsdorf, Changlong Xie, others Why LTO? ● Optimize over the whole binary – Not just function (3.0) or file (<4.5) ● Avoid inline dependency hell in header files ● Without changing Makefiles significantly gcc 4.7+ LTO WHOPR crash course ● Compiler parses files, writes GIMPLE to object files (LGEN) – Function optimization summaries are computed ● Linker calls lto1 with all files for sequential whole program analysis (WPA) – Merges types, Generates global callgraph, IPA optimization summaries, writes partitions ● Generate code per partition in parallel (LTRANS) – Inline inside partition, run IPA optimizations for real, run per function optimizations, generate object code LTO IPA optimizations (4.8) ● Inlining between files ● De-virtualization ● Function cloning for ● Change ABI (SSE) specific arguments ● Constant propagation ● Remove unused code (fwhole-program) ● Scalar replacement of aggregates ● Increase alignment ● Constructor / ● Discover pure/const destructor merging ● Keep globals/statics alive over calls green: does not benefit kernel today User time / real time 10.00 15.00 0.00 5.00 200 400 600 800 0 Gcc 4.7 Gcc 4.7 Build time small time Build Parallelism LTO is and slow not enough parallel LTO Gcc 4.8 Gcc 4.8 4.7 -lto 4.7 4.7 -lto 4.7 4.8 -lto 4.8 4.8 -lto 4.8 Build time Faults (M) 20.00 40.00 60.00 0.00 user time (s) 2000 4000 Gcc 4.7 0 Faults small config small Faults Gcc 4.7 Gcc 4.8 smallconfig User time User Gcc 4.8 4.7 -lto 4.7 4.7 -lto 4.7 4.8 -lto 4.8 4.8 -lto 4.8 Parallelism small build 4.7 Object file generation Parsing / LGEN Runnable processes Modules without job server LTO kernel build small config 60 50 40 LTRANS ) code generation p ( r 30 e l b a n n u r 20 10 WPA + real linker 0 Type merging 11 31 51 71 91 111131151171191211231251271291311331351371391411431451471491511531551 1 21 41 61 81 101121141161181201221241261281301321341361381401421441461481501521541 4 times time (s) Small config parallelism User time / Runtime 15 10 ) s ( 5 e WHOPR still has poor parallelism m i t 0 Gcc 4.7 Gcc 4.7 no lto gcc 4.8 Gcc 4.8 no lto Multilink ● vmlinux links 2-4x: runs LTO that often ● Generates integrated symbol table (kallsyms) ● So far not fixed – KALLSYMS can be disabled ● One of those unexpected quirks of real build systems 6000.00 4000.00 2000.00 0.00 Gcc 4.8 w/o kallsyms Gcc 4.8 no lto Gcc 4.7 with kallsyms Memory usage 4.7 small build Active memory Kernel LTO build small config 8 7 6 ) 5 B G ( 4 m e m 3 e v i t c a 2 1 0 9 25 41 57 73 89 105121137153169185201217233249265281297313329345361377393409425441457473489505521537553 1 17 33 49 65 81 97 113129145161177193209225241257273289305321337353369385401417433449465481497513529545 time (s) Faults small config 60.00 Even small build swaps in 4GB system ) 40.00 M ( Memory peaks all in WPA s t l 20.00 u a F 0.00 Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Memory consumption ● Temporary data can be a problem, together with WPA – Early many swap storms with /tmp = tmpfs – Partitioning algorithm was improved – Use TMPDIR=objdir – With modules need to avoid too large -j* for parallel WPA – Jobserver has to be disabled, makes it worse Large build 4.8 (allyes) Linux allyes nosym runnable 4.8 -j16 60 Kallsyms disabled 50 (disables various features) 40 ) p ( 30 e l b a n 20 ~ 42mins n u r ~1h 11min with kallsyms 10 ~15GB peak 0 81 237 393 549 705 861 1017 1173 1329 1485 1641 1797 1953 2109 2265 2421 3 159 315 471 627 783 939 1095 1251 1407 1563 1719 1875 2031 2187 2343 2499 time (s) Linux allyes nosym gcc 4.8 memory 14.00 12.00 10.00 ) G ( 8.00 m e 6.00 m e v 4.00 i t c a 2.00 0.00 87 255 423 591 759 927 109512631431159917671935210322712439 3 171 339 507 675 843 1011117913471515168318512019218723552523 time (s) WPA cycle profile 4.8 Samples: 257K of event ©cycles©, Event count (approx.): 183142453583 ● - 9.71% lto1-wpa lto1 [.] htab_find_slot_with_hash ◆ ● - htab_find_slot_with_hash ▒ ● + 61.79% gimple_type_hash(void const*) ▒ ● + 26.54% gimple_register_type_1(tree_node*, bool) ▒ ● + 4.79% visit(tree_node*, sccs*, unsigned int, vec<tree_node*, va_heap, v▒ ● + 1.53% iterative_hash_canonical_type(tree_node*, unsigned int) ▒ ● + 0.98% iterative_hash_gimple_type(tree_node*, unsigned int, vec<tree_nod▒ ● + 0.92% symtab_get_node(tree_node const*) ▒ ● + 0.77% gimple_type_eq(void const*, void const*) ▒ ● + 0.53% build_int_cst_wide(tree_node*, unsigned long, long) ▒ ● + 0.53% streamer_string_index(output_block*, char const*, unsigned int, b▒ ● + 9.00% lto1-wpa libc-2.17.so [.] _int_malloc ▒ ● + 7.34% lto1-wpa lto1 [.] iterative_hash_hashval_t(unsigned i▒ ● + 7.33% lto1-wpa lto1 [.] gimple_type_eq(void const*, void co▒ ● + 6.66% lto1-wpa libc-2.17.so [.] __memset_sse2 ▒ ● + 5.35% lto1-wpa libc-2.17.so [.] malloc_consolidate ▒ ● + 4.11% lto1-wpa libc-2.17.so [.] _int_free ▒ ● + 4.10% lto1-wpa lto1 [.] tree_map_base_eq(void const*, void ▒ ● + 2.85% lto1-wpa lto1 [.] pointer_map_insert(pointer_map_t*, ▒ ● + 2.76% lto1-wpa lto1 [.] inflate_fast ▒ ● + 2.17% lto1-wpa lto1 [.] lto_output_tree(output_block*, tree▒ Most time spent in hashing types for type merging Type merging Complicated iterative hash consisting struct foo ** of 4 hash tables (including two “hash cache hashes”) Each LGEN has its own types, need pointer_type to be unified pointer_type Struct foo Members (record_type) Hashing improvements ● Use tcmalloc instead of glibc malloc – build time -9%; peak mem +35% ● Start with large type hash tables / caches ● Use murmur for hashing / fix iterative / fix pointer hash to handle all bits – (collisions 90%->70%->64%) ● Splay trees for type merging (bad idea) 12.00% Build time Improvement hash changes 10.00% a 8.00%t 4.8, build time small config l e d 6.00%) s ( 4.00%e m i 2.00%t 0.00% tcmalloc murmur fullhash Possible future hash improvements ● Precompute hash in LGEN – And stream types out in hash order – Requires even larger hash tables to avoid collisions (or a good tree) ● Only merge per partition? – And do that in parallel ● Parallelize WPA further? Incremental linking ● Kernel build system uses ld -r / ar extensively ● Linker merges all sections with the same name – Added random postfixes to LTO sections ● Still problem with pure assembler files – .S assembler code dropped during LTO – Originally attempted to fix with slim LTO ● Then fixed in Linux binutils – Unfortunately changes not merged into mainline binutils – Use Linux binutils for kernel Kernel modifications (excerpt) ● Section attributes ● Disable LTO for some files – const __initconst char * vs ● Workarounds for 4.7 __initconst char * const partitioning bugs (visible) ● Toplevel inline assembler ● Gcc-ld – Need globals and visible for any symbols ● A few fixes for optimizations (rare) ● Lots of externally_visible ● Various changes for dot NUMBER symbols Kernel changes: 108 files changed, 740 insertions(+), 313 deletions(-) (v3.8; without most section attributes) Debugging ● Kernel debugging: more difficult with more inlining – Especially with initialization functions – May need inline aware backtracer (“dwarf kallsyms”) ● Compiler – No simple test cases. Lots of regressions; Complicated delta. – Complex multi process – Hard to find right process to attach/profile – -dH (core dumps) are most useful LTO has challenges for debugging Global Optimizations (small config) 21k new inlines between files Inlining 13k unique Scalar replacement No change Partial Inline +16% Constant Propagation +40% 266->1052 const clones Cloning for constant propagation (CONFIG_LTO_CLONE) +2% text (~200k) Static instruction changes Generated code 4.8 small LTO Changes in static instructions (vmlinux) More = Better 12.00% 10.00% 8.00% a t 6.00% l e d 4.00% 2.00% 0.00% calls functions fsize avg push pop mov instruction Generated code Runtime ● Runtime – Mixed results LKP dbench tbench AIM7 OLTP Sandy Bridge 0% 0% +3% 0% Nehalem 0% 0% +4% +6% ● Likely need some more optimizations – Open: Atom testing (more sensitive to bad code) Text size ● Text size – Slight increase for large configs, but lower for small configs ● small config +3% ● allno shrinks -50k – The smaller the kernel the less infrastructure is used. But EXPORT_SYMBOL defeats it again. ARM embedded improvements (Tim Bird) vmlinux.non-lto => vmlinux.lto ● baseline other change percent ● -------- ----- ------ ------- ● text: 5242000 4869340 -372660 -7% ● data: 490184 476200 -13984 -2% ● bss: 121824 122112 288 0% ● total: 5854008 5467652 -386356 -6% ● ● Kernel: non-lto lto ● --------------------------- ------- -------- ● time for full build: 1m5.87s 3m22s ● time for incremental build: 0m5.24s 2m12s ● kernel image size: 5854008 5467652 ● compressed uImage size: 2849920 2694224 ● system meminfo MemTotal: 17804 kB 18188 kB ● system meminfo MemFree: 11020 kB 11260 kB ● boot time: 2.466369 2.414428 gcc LTO improvements wish list ● Per file options (partial patch) ● Fix passing through of options ● Standard gcc-ld ● Don't error for section mismatches ● Better procedure for LTO error reporting ● Better messages for type mismatches ● No .XXXX postfixes ● -fno-toplevel-reorder by symbol? ● Automatic procedure for externally_visible discovery? ● Better syntax for “symbol visible to toplevel assembler” Status ● Tested on x86, ARM, MIPS ● Uptodate with Linux 3.8 ● Some options disabled (function-trace, gcov, modversions) References ● LTO kernel tree http://github.com/andikleen/linux-misc – Branch lto-3.x (x increasing) ● Linux binutils http://www.kernel.org/pub/linux/devel/binutils/ ● Tcmalloc http://code.google.com/p/gperftools/ ● [email protected] http://halobates.de Backup Future kernel improvements ● Avoid multi-link ● Use constructors to allow disabling -fno- toplevel-reorder ● Support new option to disable instrument_function Quick HOWTO for LTO kernel ● Get git://github.com/andikleen/linux-misc lto-3.x tree ● Need gcc 4.7+ set up for LTO / Linux binutils ● Enable CONFIG_LTO – Make sure FUNCTION_PROFILER et.al./MODVERSIONS are disabled ● make CC=compiler LD=linux-ld AR=gcc-ld -jX – Check LTO does not get disabled ● See Documentation/lto-build for more details .