Binary Analysis for Measurement and Attribution of Program Performance

Total Page:16

File Type:pdf, Size:1020Kb

Binary Analysis for Measurement and Attribution of Program Performance Binary Analysis for Measurement and Attribution of Program Performance Nathan R. Tallent John M. Mellor-Crummey Michael W. Fagan Rice University ftallent,johnmc,[email protected] Abstract obtain top performance. To support tuning of such code, perfor- Modern programs frequently employ sophisticated modular de- mance analysis tools must pinpoint context-sensitive inefficiencies signs. As a result, performance problems cannot be identified from in fully optimized applications. costs attributed to routines in isolation; understanding code perfor- Several contemporary performance tools measure and attribute mance requires information about a routine’s calling context. Ex- execution costs to calling context in some form. However, when isting performance tools fall short in this respect. Prior strategies applied to fully optimized applications, existing tools fall short for attributing context-sensitive performance at the source level ei- for two reasons. First, current calling context measurement tech- ther compromise measurement accuracy, remain too close to the bi- niques are unacceptable because they either significantly perturb nary, or require custom compilers. To understand the performance program optimization and execution with instrumentation, or rely of fully optimized modular code, we developed two novel binary on compiler-based information that is sometimes inaccurate or un- analysis techniques: 1) on-the-fly analysis of optimized machine available, which causes failures while gathering certain calling con- code to enable minimally intrusive and accurate attribution of costs texts. Second, by inlining procedures and transforming loops, op- to dynamic calling contexts; and 2) post-mortem analysis of op- timizing compilers introduce a significant semantic gap between timized machine code and its debugging sections to recover its the binary and source code. Thus, prior strategies for attributing program structure and reconstruct a mapping back to its source context-sensitive performance at the source level either compro- code. By combining the recovered static program structure with mise measurement accuracy or remain too close to the object code. dynamic calling context information, we can accurately attribute To clarify, we consider the capabilities of some popular tools performance metrics to calling contexts, procedures, loops, and in- using three related categories: calling context representation, mea- lined instances of procedures. We demonstrate that the fusion of surement technique and attribution technique. this information provides unique insight into the performance of Calling context representation. Performance tools typically complex modular codes. This work is implemented in the HPC- attribute performance metrics to calling context using a call graph 1 or call path profile. Two widely-used tools that collect call graph TOOLKIT performance tools. profiles are gprof [11] and Intel’s VTune [15]. A call graph profile Categories and Subject Descriptors C.4 [Performance of sys- consists of a node for each procedure and a set of directed edges tems]: Measurement techniques, Performance attributes. between nodes. An edge exists from node p to node q if p calls q. To represent performance measurements, edges and nodes are General Terms Performance, Measurement, Algorithms. weighted with metrics. Call graph profiles are often insufficient for modular applications because a procedure p that appears on Keywords Binary analysis, Call path profiling, Static analysis, multiple distinct paths is represented with one node, resulting in Performance tools, HPCTOOLKIT. shared paths and cycles. Consequently, with a call graph profile it is in general not possible to assign costs to p’s full calling context, 1. Introduction or even to long portions of it. To remove this imprecision, a call path profile [12] represents the full calling context of p as the path Modern programs frequently employ sophisticated modular de- of calls from the program’s entry point to p. Call path profiling is signs that exploit object-oriented abstractions and generics. Com- necessary to fully understand the performance of modular codes. position of C++ algorithm and data structure templates typically Measurement technique. There are two basic approaches for yields loop nests spread across multiple levels of routines. To im- obtaining calling context profiles: instrumentation and statistical prove the performance of such codes, compilers inline routines and sampling. Instrumentation-based tools use one of three principal optimize loops. However, careful hand-tuning is often necessary to instrumentation techniques. Tools such as Tau [26] use source code 1 instrumentation to insert special profiling code into the source pro- HPCTOOLKIT is an open-source suite of performance tools available gram before compilation. In contrast, VTune [15] uses static bi- from http://hpctoolkit.org. nary instrumentation to augment application binaries with profiling code. (gprof’s [11] instrumentation, though traditionally inserted by a compiler, is effectively in this category.) The third technique Permission to make digital or hard copies of all or part of this work for personal or is dynamic binary instrumentation. classroom use is granted without fee provided that copies are not made or distributed While source-level instrumentors collect measurements that are for profit or commercial advantage and that copies bear this notice and the full citation easily mapped to source code, their instrumentation can interfere on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. with compiler optimizations such as inlining and loop transforma- PLDI’09, June 15–20, 2009, Dublin, Ireland. tions. As a result, measurements using source-level instrumenta- Copyright © 2009 ACM 978-1-60558-392-1/09/06. $5.00 tion may not accurately reflect the performance of fully optimized code [28]. Binary instrumentors may also compromise optimiza- Much prior work on loop attribution either compromises mea- tion. For example, in some compilers gprof-instrumented code surement accuracy by relying on instrumentation [22, 26] or is cannot be fully optimized. based on context-less measurement [19]. A few sampling-based An important problem with both source and static binary in- call path profilers [2, 14, 22] identify loops, but at the binary level. strumentation is that they require recompilation or binary rewriting Moseley et al. [22] describe a sampling-based profiler (relying of a program and all its libraries. This requirement poses a sig- on unwind information) that additionally constructs a dynamic nificant inconvenience for large, complex applications. More criti- loop/call graph by placing loops within a call graph. However, cally, the need to see the whole program before run time can lead to by not accounting for loop or procedure transformations, this tool ‘blind spots,’ i.e., portions of the execution that are systematically attributes performance only to binary-level loops and procedures. excluded from measurement. For instance, source instrumentation Also, by using a dynamic loop/call graph, it is not possible to un- fails to measure any portion of the application for which source derstand the performance of procedures and loops in their full call- code is unavailable; this frequently includes critical system, math ing context. and communication libraries. For Fortran programs, this approach can also fail to associate costs with intrinsic functions or compiler- To understand the performance of modular programs, as part of inserted array copies. Static binary instrumentation is unable to the HPCTOOLKIT performance tools we built hpcrun, a call path cope with shared libraries dynamically loaded during execution. profiler that measures and attributes execution costs of unmodified, The third approach, dynamic binary instrumentation, supports fully optimized executables to their full calling context, as well fully optimized binaries and avoids blind spots by inserting instru- as loops and inlined code. Achieving this result required novel mentation in the executing application [4]. Intel’s recently-released solutions to three problems: Performance Tuning Utility (PTU) [14], includes a call graph pro- filer that adopts this approach by using Pin [18]. However, dynamic 1. To measure dynamic calling contexts, we developed a context- instrumentation remains susceptible to systematic measurement er- free on-line binary analysis for locating procedure bounds and ror because of instrumentation overhead. computing unwind information. We show its effectiveness on Indeed, all three instrumentation approaches suffer in two dis- applications in the SPEC CPU2006 suite compiled with Intel, tinct ways from overhead. First, instrumentation dilates total execu- Portland Group and PathScale compilers using peak optimiza- tion time, sometimes enough to preclude analysis of large produc- tion. tion runs or force users to a priori introduce blind spots via selec- 2. To attribute performance to user-level source code, we devel- tive instrumentation. For example, because of an average slowdown oped a novel post-mortem analysis of the optimized object code factor of 8, VTune requires users to limit measurement to so-called and its debugging sections to recover its program structure and ‘modules of interest’ [15]. Moreover, overhead is even more acute reconstruct a mapping back to its source code. The ability to if loops are instrumented. A recent Pin-based
Recommended publications
  • Executable Code Is Not the Proper Subject of Copyright Law a Retrospective Criticism of Technical and Legal Naivete in the Apple V
    Executable Code is Not the Proper Subject of Copyright Law A retrospective criticism of technical and legal naivete in the Apple V. Franklin case Matthew M. Swann, Clark S. Turner, Ph.D., Department of Computer Science Cal Poly State University November 18, 2004 Abstract: Copyright was created by government for a purpose. Its purpose was to be an incentive to produce and disseminate new and useful knowledge to society. Source code is written to express its underlying ideas and is clearly included as a copyrightable artifact. However, since Apple v. Franklin, copyright has been extended to protect an opaque software executable that does not express its underlying ideas. Common commercial practice involves keeping the source code secret, hiding any innovative ideas expressed there, while copyrighting the executable, where the underlying ideas are not exposed. By examining copyright’s historical heritage we can determine whether software copyright for an opaque artifact upholds the bargain between authors and society as intended by our Founding Fathers. This paper first describes the origins of copyright, the nature of software, and the unique problems involved. It then determines whether current copyright protection for the opaque executable realizes the economic model underpinning copyright law. Having found the current legal interpretation insufficient to protect software without compromising its principles, we suggest new legislation which would respect the philosophy on which copyright in this nation was founded. Table of Contents INTRODUCTION................................................................................................. 1 THE ORIGIN OF COPYRIGHT ........................................................................... 1 The Idea is Born 1 A New Beginning 2 The Social Bargain 3 Copyright and the Constitution 4 THE BASICS OF SOFTWARE ..........................................................................
    [Show full text]
  • The LLVM Instruction Set and Compilation Strategy
    The LLVM Instruction Set and Compilation Strategy Chris Lattner Vikram Adve University of Illinois at Urbana-Champaign lattner,vadve ¡ @cs.uiuc.edu Abstract This document introduces the LLVM compiler infrastructure and instruction set, a simple approach that enables sophisticated code transformations at link time, runtime, and in the field. It is a pragmatic approach to compilation, interfering with programmers and tools as little as possible, while still retaining extensive high-level information from source-level compilers for later stages of an application’s lifetime. We describe the LLVM instruction set, the design of the LLVM system, and some of its key components. 1 Introduction Modern programming languages and software practices aim to support more reliable, flexible, and powerful software applications, increase programmer productivity, and provide higher level semantic information to the compiler. Un- fortunately, traditional approaches to compilation either fail to extract sufficient performance from the program (by not using interprocedural analysis or profile information) or interfere with the build process substantially (by requiring build scripts to be modified for either profiling or interprocedural optimization). Furthermore, they do not support optimization either at runtime or after an application has been installed at an end-user’s site, when the most relevant information about actual usage patterns would be available. The LLVM Compilation Strategy is designed to enable effective multi-stage optimization (at compile-time, link-time, runtime, and offline) and more effective profile-driven optimization, and to do so without changes to the traditional build process or programmer intervention. LLVM (Low Level Virtual Machine) is a compilation strategy that uses a low-level virtual instruction set with rich type information as a common code representation for all phases of compilation.
    [Show full text]
  • Application Performance Optimization
    Application Performance Optimization By Börje Lindh - Sun Microsystems AB, Sweden Sun BluePrints™ OnLine - March 2002 http://www.sun.com/blueprints Sun Microsystems, Inc. 4150 Network Circle Santa Clara, CA 95045 USA 650 960-1300 Part No.: 816-4529-10 Revision 1.3, 02/25/02 Edition: March 2002 Copyright 2002 Sun Microsystems, Inc. 4130 Network Circle, Santa Clara, California 95045 U.S.A. All rights reserved. This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers. Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd. Sun, Sun Microsystems, the Sun logo, Sun BluePrints, Sun Enterprise, Sun HPC ClusterTools, Forte, Java, Prism, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the US and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. The OPEN LOOK and Sun™ Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry.
    [Show full text]
  • Architectural Support for Scripting Languages
    Architectural Support for Scripting Languages By Dibakar Gope A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical and Computer Engineering) at the UNIVERSITY OF WISCONSIN–MADISON 2017 Date of final oral examination: 6/7/2017 The dissertation is approved by the following members of the Final Oral Committee: Mikko H. Lipasti, Professor, Electrical and Computer Engineering Gurindar S. Sohi, Professor, Computer Sciences Parameswaran Ramanathan, Professor, Electrical and Computer Engineering Jing Li, Assistant Professor, Electrical and Computer Engineering Aws Albarghouthi, Assistant Professor, Computer Sciences © Copyright by Dibakar Gope 2017 All Rights Reserved i This thesis is dedicated to my parents, Monoranjan Gope and Sati Gope. ii acknowledgments First and foremost, I would like to thank my parents, Sri Monoranjan Gope, and Smt. Sati Gope for their unwavering support and encouragement throughout my doctoral studies which I believe to be the single most important contribution towards achieving my goal of receiving a Ph.D. Second, I would like to express my deepest gratitude to my advisor Prof. Mikko Lipasti for his mentorship and continuous support throughout the course of my graduate studies. I am extremely grateful to him for guiding me with such dedication and consideration and never failing to pay attention to any details of my work. His insights, encouragement, and overall optimism have been instrumental in organizing my otherwise vague ideas into some meaningful contributions in this thesis. This thesis would never have been accomplished without his technical and editorial advice. I find myself fortunate to have met and had the opportunity to work with such an all-around nice person in addition to being a great professor.
    [Show full text]
  • CDC Build System Guide
    CDC Build System Guide Java™ Platform, Micro Edition Connected Device Configuration, Version 1.1.2 Foundation Profile, Version 1.1.2 Optimized Implementation Sun Microsystems, Inc. www.sun.com December 2008 Copyright © 2008 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, California 95054, U.S.A. All rights reserved. Sun Microsystems, Inc. has intellectual property rights relating to technology embodied in the product that is described in this document. In particular, and without limitation, these intellectual property rights may include one or more of the U.S. patents listed at http://www.sun.com/patents and one or more additional patents or pending patent applications in the U.S. and in other countries. U.S. Government Rights - Commercial software. Government users are subject to the Sun Microsystems, Inc. standard license agreement and applicable provisions of the FAR and its supplements. This distribution may include materials developed by third parties. Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S. and in other countries, exclusively licensed through X/Open Company, Ltd. Sun, Sun Microsystems, the Sun logo, Java, Solaris and HotSpot are trademarks or registered trademarks of Sun Microsystems, Inc. or its subsidiaries in the United States and other countries. The Adobe logo is a registered trademark of Adobe Systems, Incorporated. Products covered by and information contained in this service manual are controlled by U.S. Export Control laws and may be subject to the export or import laws in other countries. Nuclear, missile, chemical biological weapons or nuclear maritime end uses or end users, whether direct or indirect, are strictly prohibited.
    [Show full text]
  • Memory Subsystem Profiling with the Sun Studio Performance Analyzer
    Memory Subsystem Profiling with the Sun Studio Performance Analyzer CScADS, July 20, 2009 Marty Itzkowitz, Analyzer Project Lead Sun Microsystems Inc. [email protected] Outline • Memory performance of applications > The Sun Studio Performance Analyzer • Measuring memory subsystem performance > Four techniques, each building on the previous ones – First, clock-profiling – Next, HW counter profiling of instructions – Dive deeper into dataspace profiling – Dive still deeper into machine profiling – What the machine (as opposed to the application) sees > Later techniques needed if earlier ones don't fix the problems • Possible future directions MSI Memory Subsystem Profiling with the Sun Studio Performance Analyzer 6/30/09 2 No Comment MSI Memory Subsystem Profiling with the Sun Studio Performance Analyzer 6/30/09 3 The Message • Memory performance is crucial to application performance > And getting more so with time • Memory performance is hard to understand > Memory subsystems are very complex – All components matter > HW techniques to hide latency can hide causes • Memory performance tuning is an art > We're trying to make it a science • The Performance Analyzer is a powerful tool: > To capture memory performance data > To explore its causes MSI Memory Subsystem Profiling with the Sun Studio Performance Analyzer 6/30/09 4 Memory Performance of Applications • Operations take place in registers > All data must be loaded and stored; latency matters • A load is a load is a load, but > Hit in L1 cache takes 1 clock > Miss in L1, hit in
    [Show full text]
  • Using Ld the GNU Linker
    Using ld The GNU linker ld version 2 January 1994 Steve Chamberlain Cygnus Support Cygnus Support [email protected], [email protected] Using LD, the GNU linker Edited by Jeffrey Osier (jeff[email protected]) Copyright c 1991, 92, 93, 94, 95, 96, 97, 1998 Free Software Foundation, Inc. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided also that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another lan- guage, under the above conditions for modified versions. Chapter 1: Overview 1 1 Overview ld combines a number of object and archive files, relocates their data and ties up symbol references. Usually the last step in compiling a program is to run ld. ld accepts Linker Command Language files written in a superset of AT&T’s Link Editor Command Language syntax, to provide explicit and total control over the linking process. This version of ld uses the general purpose BFD libraries to operate on object files. This allows ld to read, combine, and write object files in many different formats—for example, COFF or a.out. Different formats may be linked together to produce any available kind of object file. See Chapter 5 [BFD], page 47, for more information. Aside from its flexibility, the gnu linker is more helpful than other linkers in providing diagnostic information.
    [Show full text]
  • Oracle Solaris Studio 12.2 Performance Analyzer MPI Tutorial
    Oracle® Solaris Studio 12.2: Performance Analyzer MPITutorial Part No: 820–5551 September 2010 Copyright © 2010, Oracle and/or its affiliates. All rights reserved. This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related software documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable: U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are “commercial computer software” or “commercial technical data” pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms setforth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007). Oracle America, Inc., 500 Oracle Parkway, Redwood City, CA 94065.
    [Show full text]
  • Oracle® Solaris Studio 12.4: Performance Analyzer Tutorials
    ® Oracle Solaris Studio 12.4: Performance Analyzer Tutorials Part No: E37087 December 2014 Copyright © 2014, Oracle and/or its affiliates. All rights reserved. This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice is applicable: U.S. GOVERNMENT END USERS. Oracle programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, delivered to U.S. Government end users are "commercial computer software" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, use, duplication, disclosure, modification, and adaptation of the programs, including any operating system, integrated software, any programs installed on the hardware, and/or documentation, shall be subject to license terms and license restrictions applicable to the programs. No other rights are granted to the U.S. Government. This software or hardware is developed for general use in a variety of information management applications.
    [Show full text]
  • Openmp Performance 2
    OPENMP PERFORMANCE 2 A common scenario..... “So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?”. Most of us have probably been here. Where did my performance go? It disappeared into overheads..... 3 The six (and a half) evils... • There are six main sources of overhead in OpenMP programs: • sequential code • idle threads • synchronisation • scheduling • communication • hardware resource contention • and another minor one: • compiler (non-)optimisation • Let’s take a look at each of them and discuss ways of avoiding them. 4 Sequential code • In OpenMP, all code outside parallel regions, or inside MASTER and SINGLE directives is sequential. • Time spent in sequential code will limit performance (that’s Amdahl’s Law). • If 20% of the original execution time is not parallelised, I can never get more that 5x speedup. • Need to find ways of parallelising it! 5 Idle threads • Some threads finish a piece of computation before others, and have to wait for others to catch up. • e.g. threads sit idle in a barrier at the end of a parallel loop or parallel region. Time 6 Avoiding load imbalance • It’s a parallel loop, experiment with different schedule kinds and chunksizes • can use SCHEDULE(RUNTIME) to avoid recompilation. • For more irregular computations, using tasks can be helpful • runtime takes care of the load balancing • Note that it’s not always safe to assume that two threads doing the same number of computations will take the same time.
    [Show full text]
  • Openmp 4.0 Support in Oracle Solaris Studio
    OpenMP 4.0 Support In Oracle Solaris Studio Ruud van der Pas Disnguished Engineer Architecture and Performance, SPARC Microelectronics SC’14 OpenMP BOF November 18, 2014 Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 1 Safe Harbor Statement The following is intended to outline our general product direcEon. It is intended for informaon purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or funcEonality, and should not be relied upon in making purchasing decisions. The development, release, and Eming of any features or funcEonality described for Oracle’s products remains at the sole discreEon of Oracle. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 2 What Is Oracle Solaris Studio ? • Supported on Linux too • Compilers and tools, • It is actually a comprehensive so-ware suite, including: – C, C++ and Fortran compilers – Various tools (performance analyzer, thread analyzer, code analyzer, debugger, etc) • Plaorms supported – Processors: SPARC, x86 – Operang Systems: Solaris, Linux Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 3 hp://www.oracle.com/technetwork/ server-storage/solarisstudio/overview/index.html Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 4 OpenMP Specific Support/1 • Compiler – Various opEons and environment variables – Autoscoping – Compiler Commentary • General feature to inform about opEmizaons, but also specific to OpenMP • Performance Analyzer – OpenMP “states”, metrics, etc – Task and region specific informaon Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 5 Performance Analyzer - Profile Comparison Copyright © 2014, Oracle and/or its affiliates. All rights reserved. 6 A Comparison– More Detail Copyright © 2014, Oracle and/or its affiliates.
    [Show full text]
  • Optimizing Function Placement for Large-Scale Data-Center Applications
    Optimizing Function Placement for Large-Scale Data-Center Applications Guilherme Ottoni Bertrand Maher Facebook, Inc., USA ottoni,bertrand @fb.com { } Abstract While the large size and performance criticality of such Modern data-center applications often comprise a large applications make them good candidates for profile-guided amount of code, with substantial working sets, making them code-layout optimizations, these characteristics also im- good candidates for code-layout optimizations. Although pose scalability challenges to optimize these applications. recent work has evaluated the impact of profile-guided intra- Instrumentation-based profilers significantly slow down the module optimizations and some cross-module optimiza- applications, often making it impractical to gather accurate tions, no recent study has evaluated the benefit of function profiles from a production system. To simplify deployment, placement for such large-scale applications. In this paper, it is beneficial to have a system that can profile unmodi- we study the impact of function placement in the context fied binaries, running in production, and use these data for of a simple tool we created that uses sample-based profiling feedback-directed optimization. This is possible through the data. By using sample-based profiling, this methodology fol- use of sample-based profiling, which enables high-quality lows the same principle behind AutoFDO, i.e. using profil- profiles to be gathered with minimal operational complexity. ing data collected from unmodified binaries running in pro- This is the approach taken by tools such as AutoFDO [1], duction, which makes it applicable to large-scale binaries. and which we also follow in this work. Using this tool, we first evaluate the impact of the traditional The benefit of feedback-directed optimizations for some Pettis-Hansen (PH) function-placement algorithm on a set of data-center applications has been evaluated in some pre- widely deployed data-center applications.
    [Show full text]