Portable Section-Level Tuning of Compiler Parallelized Applications

Portable Section-level Tuning of Compiler Parallelized Applications Dheya Mustafa and Rudolf Eigenmann Electrical and Computer Engineering Purdue University West Lafayette, USA Email:{dmustaf,eigenman}@purdue.edu Abstract—Automatic parallelization of sequential programs and challenging. We will show that section-level tuning can combined with tuning is an alternative to manual parallelization. outperform whole-program tuning by up to 82%. The chal- This method has the potential to substantially increase productiv- lenge is in managing the drastic growth in the search space ity and is thus of critical importance for exploiting the increased computational power of today’s multicores. A key difficulty is of optimization variants. Empirical tuning methods work by that parallelizing compilers are generally unable to estimate the evaluating many optimization variants and choosing the one performance impact of an optimization on a whole program or a that performs the best at runtime [6]–[10]. During this process, program section at compile time; hence, the ultimate performance they need to consider interactions between the optimization decision today rests with the developer. Building an autotuning variants. This issue is already severe in existing, whole- system to remedy this situation is not a trivial task. This work presents a portable empirical autotuning system that operates at program tuning methods. The effect of an optimization may program-section granularity and partitions the compiler options depend significantly on the presence of another (e.g., unrolling into groups that can be tuned independently. and vectorization influence each other). Therefore, a large To our knowledge, this is the first approach delivering an au- number of combinations of program optimizations need to be toparallelization system that ensures performance improvements evaluated. In our section-based tuning system, interactions also for nearly all programs, eliminating the users’ need to experiment with such tools to strive for highest application performance. occur between optimizations applied to different program sections. Therefore, combinations of program sections in different I. INTRODUCTION optimization variants will need to be explored, further growing Despite the enormous potential and importance of automatic the already large search space. We use a key observation to parallelization for increasing software productivity in today’s manage this additional growth: Interactions between program multicore era, current compilers do not use this capability sections happen primarily among near neighbors. We will as their default behavior. As we will show, both state-of- consider potential interactions within windows of program the-art commercial and advanced research compilers, while sections, but not across them. We will show that the window capable of improving the performance of many programs, size can be kept small, significantly limiting the increase in may also degrade the performance of individual codes. This program optimization time of our section-level tuning method. behavior would be unacceptable as the default of a compiler. Second, we introduce a portable tuning framework and The primary reason is that there is insufficient knowledge at demonstrate its application to all parallelizers that we could compile time to make adequate optimization decisions. Even obtain. Creating portable tuning systems is critical, as the though many advanced parallelization techniques exist today, development effort and thus cost is large. Re-usability in including several runtime decision methods, three decades of tuning has not been addressed, so far; the many proposed autoparallelization research have not yet delivered the needed, systems and techniques use substantially different frameworks consistent compilation tools for multicores [1]–[5]. In this and components [11]–[16] . In particular, different approaches paper, we will show that by building on advances in automatic use specialized optimization search spaces, which are custom- tuning, and by further improving this technology, compilers built and navigated with specialized tuning engines. Our ap- can deliver automatically parallelized applications that ensure proach is two-fold. We use a framework that conceptualizes the performance greater or equal to their source programs. Our essential elements of empirical optimization, and we employ results will demonstrate that this can be done with current a tuning definition system, through which the user (typically commercial as well as with research compilers. the compiler developer) customizes the framework for the We advance the employed, automatic tuning methods in task a hand. We will evaluate the implementation of this two ways. First, our new techniques support the optimization approach for a range of parallelizing compilers and show that of individual program sections. Doing so is both important the customization efforts are small. This paper makes the following contributions. This work was supported, in part, by the National Science Foundation under • We introduce a novel, fast tuning algorithm that is able grants No. CNS-0720471, 0707931-CNS, 0833115-CCF, and 0916817-CCF. SC12 November 10-16, 2012, Salt Lake city, Utah, USA to optimize individual program sections. To keep tuning 978-1-4673-0806-9/12/$31.00 ©2012 IEEE time low, it partitions the program into windows; it fully accounts for the interaction of optimizations of different sections within windows, but not across windows. Inde- pendent optimization groups and code sections are tuned Empirical Measurement simultaneously. We will show that section-based tuning improves performance significantly, that a small window Search Space Navigation size is sufficient, and that tuning times are reasonable. • We present a portable, empirical tuning framework for automatic parallelization, which we implement and evaluate on our CETUS source-to-source translator [3] as Version Generation well as the OpenUH [28], Rose [17], PGI [18], and ICC System Specifications: compilers [19]. We will show that our tuning system Tuning Definition File improves the performance of all these compilers, and the porting effort to each compiler is small. Fig. 1. High-level model of the automatic tuning system. Search Space Nav- • Applying our tuning method turns all compilers into tools igation picks a next program version to be tried; Version Generation compiles where automatic parallelization can be used as the default this version; and Empirical Measurement evaluates its performance at runtime. The search space is generated and pruned based on the specifications from behavior. We will show that the performance of all the tuning definition file. programs in our test suite (the NAS Parallel Benchmarks) for all compilers mentioned above is at least as good as the performance of the serial programs. Hence, users The focus of this paper is on developing such a portable can benefit from the performance gains of automatic tuning system for improving automatic parallelization. To this parallelization without risk of performance degradation. end, the following introduces a generic tuning model and This eliminates the need for users to experiment with prototype that are applicable to a wide variety of scenarios. such tools. We will also show that, relative to the hand- We will demonstrate the portability to two commercial and parallelized programs, tuned automatic parallelization three research parallelizing compilers. gains significant performance in half of our benchmark suite and in one case exceeds the manually optimized A. Tuning Model code. Figure 1 shows the high-level model of our portable tuning The remainder of the paper is organized as follows: The system. Program tuning proceeds in three steps. Search Space next section provides a system overview, including the tuning Navigation iteratively traverses the space of all feasible combi- model, system definition process, tuning algorithm, and search nations of program optimizations, until a suitable “best” vari- space pruning method. Section III applies this tuning frame- ant has been found. In each iteration, the current combination work to five compilers, Cetus, ICC, PGI, Rose, and OpenUH, is passed on to the Version Generator – typically a compiler – demonstrating its portability. Section IV evaluates our work which produces the code. Next, the performance gets evaluated using the NAS parallel benchmarks. We discuss related work by the Empirical Measurement component. The performance in Section V. Section VI draws conclusions and discusses results are fed back to the Search Space Navigator, which future work. decides on a next iteration or terminates tuning. This model can be configured through a System Specifi- II. PORTABLE TUNING SYSTEM cation by means of a Tuning Definition File (TDF), which One aim of creating a portable tuning system is to make it defines all optimization options. The specifications include easy for compiler developers to plug their existing compiler compiler options for global as well as for section-level op- into a framework that applies empirical tuning. In this way, timizations and allow the tuning system developer to indicate compilers can be enhanced to find the best combination of their which optimizations may interact with each other. In defining optimization capabilities, using an offline process that is simi- these interactions, the tuning developer can be conservative; lar to profile-based compilation. The system should allow the independent optimizations

Portable Section-Level Tuning of Compiler Parallelized Applications

Expression Rematerialization for VLIW DSP Processors with Distributed Register Files ?

User-Directed Loop-Transformations in Clang

Elimination of Memory-Based Dependences For

A General Compilation Algorithm to Parallelize and Optimize Counted Loops with Dynamic Data-Dependent Bounds Jie Zhao, Albert Cohen

Foundations of Scientific Research

Polyhedral-Model Guided Loop-Nest Auto-Vectorization Konrad Trifunović, Dorit Nuzman, Albert Cohen, Ayal Zaks, Ira Rosen

Compiler Construction

Synthesis and Exploration of Loop Accelerators for Systems-On-A-Chip

Mipsprotm Fortran 77 Programmer's Guide

Unified Polyhedral Modeling of Temporal and Spatial Locality

Power and Energy Impact by Loop Transformations

Loop Transformations: Convexity, Pruning and Optimization