Gcc Link Time Optimization and the Linux Kernel Andi Kleen Intel

Gcc Link Time Optimization and the Linux Kernel Andi Kleen Intel

gcc link time optimization and the Linux kernel Andi Kleen Intel OTC Apr 2013 [email protected] Acknowledgments ● Lots of people helped/contributed ● Ralf Baechle, Richard Biener, Tim Bird, Honza Hubicka, H.J. Lu, Joe Mario, Markus Trippelsdorf, Changlong Xie, others Why LTO? ● Optimize over the whole binary – Not just function (3.0) or file (<4.5) ● Avoid inline dependency hell in header files ● Without changing Makefiles significantly gcc 4.7+ LTO WHOPR crash course ● Compiler parses files, writes GIMPLE to object files (LGEN) – Function optimization summaries are computed ● Linker calls lto1 with all files for sequential whole program analysis (WPA) – Merges types, Generates global callgraph, IPA optimization summaries, writes partitions ● Generate code per partition in parallel (LTRANS) – Inline inside partition, run IPA optimizations for real, run per function optimizations, generate object code LTO IPA optimizations (4.8) ● Inlining between files ● De-virtualization ● Function cloning for ● Change ABI (SSE) specific arguments ● Constant propagation ● Remove unused code (fwhole-program) ● Scalar replacement of aggregates ● Increase alignment ● Constructor / ● Discover pure/const destructor merging ● Keep globals/statics alive over calls green: does not benefit kernel today User time / real time 10.00 15.00 0.00 5.00 200 400 600 800 0 Gcc 4.7 Gcc 4.7 Build time small time Build Parallelism LTO is and slow not enough parallel LTO Gcc 4.8 Gcc 4.8 4.7 -lto 4.7 4.7 -lto 4.7 4.8 -lto 4.8 4.8 -lto 4.8 Build time Faults (M) 20.00 40.00 60.00 0.00 user time (s) 2000 4000 Gcc 4.7 0 Faults small config small Faults Gcc 4.7 Gcc 4.8 smallconfig User time User Gcc 4.8 4.7 -lto 4.7 4.7 -lto 4.7 4.8 -lto 4.8 4.8 -lto 4.8 Parallelism small build 4.7 Object file generation Parsing / LGEN Runnable processes Modules without job server LTO kernel build small config 60 50 40 LTRANS ) code generation p ( r 30 e l b a n n u r 20 10 WPA + real linker 0 Type merging 11 31 51 71 91 111131151171191211231251271291311331351371391411431451471491511531551 1 21 41 61 81 101121141161181201221241261281301321341361381401421441461481501521541 4 times time (s) Small config parallelism User time / Runtime 15 10 ) s ( 5 e WHOPR still has poor parallelism m i t 0 Gcc 4.7 Gcc 4.7 no lto gcc 4.8 Gcc 4.8 no lto Multilink ● vmlinux links 2-4x: runs LTO that often ● Generates integrated symbol table (kallsyms) ● So far not fixed – KALLSYMS can be disabled ● One of those unexpected quirks of real build systems 6000.00 4000.00 2000.00 0.00 Gcc 4.8 w/o kallsyms Gcc 4.8 no lto Gcc 4.7 with kallsyms Memory usage 4.7 small build Active memory Kernel LTO build small config 8 7 6 ) 5 B G ( 4 m e m 3 e v i t c a 2 1 0 9 25 41 57 73 89 105121137153169185201217233249265281297313329345361377393409425441457473489505521537553 1 17 33 49 65 81 97 113129145161177193209225241257273289305321337353369385401417433449465481497513529545 time (s) Faults small config 60.00 Even small build swaps in 4GB system ) 40.00 M ( Memory peaks all in WPA s t l 20.00 u a F 0.00 Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Memory consumption ● Temporary data can be a problem, together with WPA – Early many swap storms with /tmp = tmpfs – Partitioning algorithm was improved – Use TMPDIR=objdir – With modules need to avoid too large -j* for parallel WPA – Jobserver has to be disabled, makes it worse Large build 4.8 (allyes) Linux allyes nosym runnable 4.8 -j16 60 Kallsyms disabled 50 (disables various features) 40 ) p ( 30 e l b a n 20 ~ 42mins n u r ~1h 11min with kallsyms 10 ~15GB peak 0 81 237 393 549 705 861 1017 1173 1329 1485 1641 1797 1953 2109 2265 2421 3 159 315 471 627 783 939 1095 1251 1407 1563 1719 1875 2031 2187 2343 2499 time (s) Linux allyes nosym gcc 4.8 memory 14.00 12.00 10.00 ) G ( 8.00 m e 6.00 m e v 4.00 i t c a 2.00 0.00 87 255 423 591 759 927 109512631431159917671935210322712439 3 171 339 507 675 843 1011117913471515168318512019218723552523 time (s) WPA cycle profile 4.8 Samples: 257K of event ©cycles©, Event count (approx.): 183142453583 ● - 9.71% lto1-wpa lto1 [.] htab_find_slot_with_hash ◆ ● - htab_find_slot_with_hash ▒ ● + 61.79% gimple_type_hash(void const*) ▒ ● + 26.54% gimple_register_type_1(tree_node*, bool) ▒ ● + 4.79% visit(tree_node*, sccs*, unsigned int, vec<tree_node*, va_heap, v▒ ● + 1.53% iterative_hash_canonical_type(tree_node*, unsigned int) ▒ ● + 0.98% iterative_hash_gimple_type(tree_node*, unsigned int, vec<tree_nod▒ ● + 0.92% symtab_get_node(tree_node const*) ▒ ● + 0.77% gimple_type_eq(void const*, void const*) ▒ ● + 0.53% build_int_cst_wide(tree_node*, unsigned long, long) ▒ ● + 0.53% streamer_string_index(output_block*, char const*, unsigned int, b▒ ● + 9.00% lto1-wpa libc-2.17.so [.] _int_malloc ▒ ● + 7.34% lto1-wpa lto1 [.] iterative_hash_hashval_t(unsigned i▒ ● + 7.33% lto1-wpa lto1 [.] gimple_type_eq(void const*, void co▒ ● + 6.66% lto1-wpa libc-2.17.so [.] __memset_sse2 ▒ ● + 5.35% lto1-wpa libc-2.17.so [.] malloc_consolidate ▒ ● + 4.11% lto1-wpa libc-2.17.so [.] _int_free ▒ ● + 4.10% lto1-wpa lto1 [.] tree_map_base_eq(void const*, void ▒ ● + 2.85% lto1-wpa lto1 [.] pointer_map_insert(pointer_map_t*, ▒ ● + 2.76% lto1-wpa lto1 [.] inflate_fast ▒ ● + 2.17% lto1-wpa lto1 [.] lto_output_tree(output_block*, tree▒ Most time spent in hashing types for type merging Type merging Complicated iterative hash consisting struct foo ** of 4 hash tables (including two “hash cache hashes”) Each LGEN has its own types, need pointer_type to be unified pointer_type Struct foo Members (record_type) Hashing improvements ● Use tcmalloc instead of glibc malloc – build time -9%; peak mem +35% ● Start with large type hash tables / caches ● Use murmur for hashing / fix iterative / fix pointer hash to handle all bits – (collisions 90%->70%->64%) ● Splay trees for type merging (bad idea) 12.00% Build time Improvement hash changes 10.00% a 8.00%t 4.8, build time small config l e d 6.00%) s ( 4.00%e m i 2.00%t 0.00% tcmalloc murmur fullhash Possible future hash improvements ● Precompute hash in LGEN – And stream types out in hash order – Requires even larger hash tables to avoid collisions (or a good tree) ● Only merge per partition? – And do that in parallel ● Parallelize WPA further? Incremental linking ● Kernel build system uses ld -r / ar extensively ● Linker merges all sections with the same name – Added random postfixes to LTO sections ● Still problem with pure assembler files – .S assembler code dropped during LTO – Originally attempted to fix with slim LTO ● Then fixed in Linux binutils – Unfortunately changes not merged into mainline binutils – Use Linux binutils for kernel Kernel modifications (excerpt) ● Section attributes ● Disable LTO for some files – const __initconst char * vs ● Workarounds for 4.7 __initconst char * const partitioning bugs (visible) ● Toplevel inline assembler ● Gcc-ld – Need globals and visible for any symbols ● A few fixes for optimizations (rare) ● Lots of externally_visible ● Various changes for dot NUMBER symbols Kernel changes: 108 files changed, 740 insertions(+), 313 deletions(-) (v3.8; without most section attributes) Debugging ● Kernel debugging: more difficult with more inlining – Especially with initialization functions – May need inline aware backtracer (“dwarf kallsyms”) ● Compiler – No simple test cases. Lots of regressions; Complicated delta. – Complex multi process – Hard to find right process to attach/profile – -dH (core dumps) are most useful LTO has challenges for debugging Global Optimizations (small config) 21k new inlines between files Inlining 13k unique Scalar replacement No change Partial Inline +16% Constant Propagation +40% 266->1052 const clones Cloning for constant propagation (CONFIG_LTO_CLONE) +2% text (~200k) Static instruction changes Generated code 4.8 small LTO Changes in static instructions (vmlinux) More = Better 12.00% 10.00% 8.00% a t 6.00% l e d 4.00% 2.00% 0.00% calls functions fsize avg push pop mov instruction Generated code Runtime ● Runtime – Mixed results LKP dbench tbench AIM7 OLTP Sandy Bridge 0% 0% +3% 0% Nehalem 0% 0% +4% +6% ● Likely need some more optimizations – Open: Atom testing (more sensitive to bad code) Text size ● Text size – Slight increase for large configs, but lower for small configs ● small config +3% ● allno shrinks -50k – The smaller the kernel the less infrastructure is used. But EXPORT_SYMBOL defeats it again. ARM embedded improvements (Tim Bird) vmlinux.non-lto => vmlinux.lto ● baseline other change percent ● -------- ----- ------ ------- ● text: 5242000 4869340 -372660 -7% ● data: 490184 476200 -13984 -2% ● bss: 121824 122112 288 0% ● total: 5854008 5467652 -386356 -6% ● ● Kernel: non-lto lto ● --------------------------- ------- -------- ● time for full build: 1m5.87s 3m22s ● time for incremental build: 0m5.24s 2m12s ● kernel image size: 5854008 5467652 ● compressed uImage size: 2849920 2694224 ● system meminfo MemTotal: 17804 kB 18188 kB ● system meminfo MemFree: 11020 kB 11260 kB ● boot time: 2.466369 2.414428 gcc LTO improvements wish list ● Per file options (partial patch) ● Fix passing through of options ● Standard gcc-ld ● Don't error for section mismatches ● Better procedure for LTO error reporting ● Better messages for type mismatches ● No .XXXX postfixes ● -fno-toplevel-reorder by symbol? ● Automatic procedure for externally_visible discovery? ● Better syntax for “symbol visible to toplevel assembler” Status ● Tested on x86, ARM, MIPS ● Uptodate with Linux 3.8 ● Some options disabled (function-trace, gcov, modversions) References ● LTO kernel tree http://github.com/andikleen/linux-misc – Branch lto-3.x (x increasing) ● Linux binutils http://www.kernel.org/pub/linux/devel/binutils/ ● Tcmalloc http://code.google.com/p/gperftools/ ● [email protected] http://halobates.de Backup Future kernel improvements ● Avoid multi-link ● Use constructors to allow disabling -fno- toplevel-reorder ● Support new option to disable instrument_function Quick HOWTO for LTO kernel ● Get git://github.com/andikleen/linux-misc lto-3.x tree ● Need gcc 4.7+ set up for LTO / Linux binutils ● Enable CONFIG_LTO – Make sure FUNCTION_PROFILER et.al./MODVERSIONS are disabled ● make CC=compiler LD=linux-ld AR=gcc-ld -jX – Check LTO does not get disabled ● See Documentation/lto-build for more details .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    29 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us