Ball-Larus Path Profiling Across Multiple Loop Iterations
Total Page:16
File Type:pdf, Size:1020Kb
Ball-Larus Path Profiling Across Multiple Loop Iterations Daniele Cono D’Elia Camil Demetrescu Dept. of Computer, Control and Management Dept. of Computer, Control and Management Engineering Engineering Sapienza University of Rome Sapienza University of Rome tifac r t * A Comple t * te [email protected] [email protected] * n * A te is W E s A e n C l l L o D C S o * * c P u e m s E u O e e n v R t e o O d t a y * s E a * l u d a e t Abstract General Terms Algorithms, Measurement, Performance. Identifying the hottest paths in the control flow graph of Keywords Profiling, dynamic program analysis, instru- a routine can direct optimizations to portions of the code mentation. where most resources are consumed. This powerful method- ology, called path profiling, was introduced by Ball and Larus in the mid 90’s [4] and has received considerable at- 1. Introduction tention in the last 15 years for its practical relevance. A Path profiling is a powerful methodology for identifying per- shortcoming of the Ball-Larus technique was the inability formance bottlenecks in a program. The approach consists to profile cyclic paths, making it difficult to mine execu- of associating performance metrics, usually frequency coun- tion patterns that span multiple loop iterations. Previous re- ters, to paths in the control flow graph. Identifying hot paths sults, based on rather complex algorithms, have attempted can direct optimizations to portions of the code that could to circumvent this limitation at the price of significant per- yield significant speedups. For instance, trace scheduling can formance losses even for a small number of iterations. In improve performance by increasing instruction-level paral- this paper, we present a new approach to multi-iteration path lelism along frequently executed paths [13, 22]. The sem- profiling, based on data structures built on top of the original inal paper by Ball and Larus [4] introduced a simple and Ball-Larus numbering technique. Our approach allows the elegant path profiling technique. The main idea was to im- profiling of all executed paths obtained as a concatenation of plicitly number all possible acyclic paths in the control flow up to k Ball-Larus acyclic paths, where k is a user-defined graph so that each path is associated with a unique compact parameter. We provide examples showing that this method path identifier (ID). The authors showed that path IDs can can reveal optimization opportunities that acyclic-path pro- be efficiently generated at runtime and can be used to update filing would miss. An extensive experimental investigation a table of frequency counters. Although in general the num- on a large variety of Java benchmarks on the Jikes RVM ber of acyclic paths may grow exponentially with the graph shows that our approach can be even faster than Ball-Larus size, in typical control flow graphs this number is usually due to fewer operations on smaller hash tables, producing small enough to fit in current machine word-sizes, making compact representations of cyclic paths even for large val- this approach very effective in practice. ues of k. While the original Ball-Larus approach was restricted to Categories and Subject Descriptors C.4 [Performance of acyclic paths obtained by cutting paths at loop back edges, Systems]: Measurement Techniques; D.2.2 [Software Engi- profiling paths that span consecutive loop iterations is a de- neering]: Tools and Techniques—programmer workbench; sirable, yet difficult, task that can yield better optimization D.2.5 [Software Engineering]: Testing and Debugging— opportunities. Consider, for instance, the problem of elim- diagnostics, tracing inating redundant executions of instructions, such as loads and stores [7], conditional jumps [6], expressions [9], and array bounds checks [8]. A typical situation is that the same instruction is redundantly executed at each loop iteration, Permission to make digital or hard copies of all or part of this work for personal or which is particularly common for arithmetic expressions and classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation load operations [7, 9]. To identify such redundancies, paths on the first page. Copyrights for components of this work owned by others than ACM that extend across loop back edges need to be profiled. An- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a other application is trace scheduling [22]: if a frequently exe- fee. Request permissions from [email protected]. cuted cyclic path is found, compilers may unroll the loop and OOPSLA ’13, October 29–31, 2013, Indianapolis, Indiana, USA. Copyright c 2013 ACM 978-1-4503-2374-1/13/10. $15.00. perform trace scheduling on the unrolled portion of code. http://dx.doi.org/10.1145/2509136.2509521 Tallam et al. [20] provide a comprehensive discussion of the 373 benefits of multi-iteration path profiling. We provide further Ball-Larus path Program numbering and examples in Section 3. tracing framework Different authors have proposed techniques to profile cyclic Static Instrumented paths by modifying the original Ball-Larus path numbering analysis program scheme in order to identify paths that extend across multi- (probes added) ple loop iterations [17, 19, 20]. Unfortunately, all known so- Control Execution lutions require rather complex algorithms that incur severe flow graph emit r performance overheads even for short cyclic paths, leaving Instrumentation the interesting open question of finding simpler and more efficient alternative methods. Stream of Ball-Larus path IDs r Contributions. In this paper, we present a novel approach to multi-iteration path profiling, which provides substan- count[r]++ Forest tially better performance than previous techniques even for construction long paths. Our method stems from the observation that any BL path k-iteration cyclic execution path in the control flow graph of a routine frequencies path forest can be described as a concatenation of Ball-Larus acyclic BL path profiler k-iter. path profiler paths (BL paths). In particular, we show how to accurately profile all executed paths obtained as a concatenation of up to k BL paths, where k is a user-defined parameter. The main Figure 1: Overview of our approach: classical Ball-Larus profiling results of this paper can be summarized as follows: versus k-iteration path profiling, cast in a common framework. 1. we reduce multi-iteration path profiling to the problem of counting n-grams, i.e., contiguous sequences of n Techniques. Differently from previous approaches [17, 19, items from a given sequence. To compactly represent 20], which rely on modifying the Ball-Larus path numbering collected profiles, we organize them in a forest of prefix to cope with cycles, our method does not require any modifi- trees (or tries) [14] of depth up to k, where each node cation of the original numbering technique described in [4]. is labeled with a BL path, and paths in a tree represent The main idea behind our approach is to fully decouple the concatenations of BL paths that were actually executed task of tracing Ball-Larus acyclic paths at run time from the by the program, along with their frequencies. We also task of concatenating and storing them in a data structure present an efficient construction algorithm based on a to keep track of multiple iterations. The decoupling is per- variant of the k-SF data structure presented in [3]; formed by letting the Ball-Larus profiling algorithm issue 2. we discuss examples of profile-driven optimizations that a stream of BL path IDs (see Figure 1), where each ID is can be achieved using our approach. In particular, we generated when a back edge in the control flow graph is tra- show how the profiles collected with our method can help versed or the current procedure is abandoned. As a conse- to restructure the code of an image processing applica- quence of this modular approach, our method can be imple- tion, achieving speedups up to about 30% on an image mented on top of existing Ball-Larus path profilers, making filter based on convolution kernels; it simpler to code and maintain. Our profiler introduces a technical shift based on a 3. to evaluate the effectiveness of our ideas, we developed smooth blend of the path numbering methods used in in- a Java performance profiler in the Jikes Research Virtual traprocedural path profiling with data structure-based tech- Machine [1]. To make fair performance comparisons with niques typically adopted in interprocedural profiling, such state-of-the-art previous profilers, we built our code on as calling-context profiling. The key to the efficiency of our top of the BLPP profiler developed by Bond [10, 15], approach is to replace costly hash table accesses, which are which provides an efficient implementation of the Ball- required by the Ball-Larus algorithm to maintain path coun- Larus acyclic-path profiling technique. Our Java code ters for larger programs, with substantially faster operations was endorsed by the OOPSLA 2013 Artifact Evaluation on trees. With this idea, we can profile paths that extend Committee and is available on the Jikes RVM Research across many loop iterations in comparable time, if not faster, Archive. We also provide a C implementation of our than profiling acyclic paths on a large variety of industry- profiler for manual source code instrumentation; strength benchmarks. 4. we performed a broad experimental study on a large suite of prominent Java benchmarks on the Jikes Research Organization of the paper.