Extensive Parameterization and Tuning of Architecture-Sensitive Optimizations 1

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Available online at www.sciencedirect.com Procedia Computer Science 4 (2011) 2156–2165 International Conference on Computational Science, ICCS 2011 Extensive Parameterization And Tuning of Architecture-Sensitive Optimizations 1 Qing Yia, Jichi Guoa aComputer Science, University of Texas At San Antonio Abstract The complexity of modern architectures require compilers to apply an increasingly large collection of architecture- sensitive optimizations, e.g., parallelization and cache optimizations, which interact with each other in unpredictable ways. We present a framework to support fine-grained parameterization of these optimizations and flexible tuning of their configuration space. Instead of directly generating optimized code, we extend an optimizing compiler to output its optimization decisions in POET, a scripting language designed for extensive parameterization of source-to-source program transformations. We then use a transformation-aware (TA) search algorithm to support flexible tuning of the parameterized transformation scripts to achieve portable high performance. We have used our framework to apply 6 highly interactive optimizations, parallelization via OpenMP, cache blocking, array copying, unroll-and-jam, scalar replacement, and loop unrolling, and present results of exploring their combined configuration space. 1. Introduction Emerging multi-core architectures require a large collection of ' optimizations, including both thread-level parallelization and mem- ory locality optimizations, for scientific applications to achieve high performance. Iterative compilation [1, 2, 3, 4] can use runtime feed- backs of evaluating differently optimized code to automatically se- lect promising optimization configurations on modern architectures. However, as a single optimized code is generated as output, the op- ! timizations cannot be later reconfigured for a different architecture, and developers have limited control over how a compiler may pa- rameterize or tune the configurations of different optimizations. We present a framework, shown in Figure 1, to support more " & extensive parameterization and tuning of architecture-sensitive opti- ! mizations. Instead of directly generating optimized code, we extend an optimizing compiler to output its optimization decisions into extensively parameterized program transformations in POET, a scripting language designed for applying source-to-source program trans- # $% " formations. The POET output can then be ported together with an Figure 1: The optimization workflow 1This research is funded by NSF awards CCF0747357 and CCF-0833203, and DOE award DE-SC0001770. 1877–0509 © 2011 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Selection and/or peer-review under responsibility of Prof. Mitsuhisa Sato and Prof. Satoshi Matsuoka doi:10.1016/j.procs.2011.04.236 Qing Yi et al. / Procedia Computer Science 4 (2011) 2156–2165 2157 void dgemm_test(const int M,const int N,const int K,const double alpha,const double *A,const int lda,const double *B, const int ldb,const double beta,double *C,const int ldc) { int i, j, l; for (j = 0; j <= -1 + N; j += 1) /*@ BEGIN(nest1=Nest) @*/ for (i = 0; i <= -1 + M; i += 1) /*@ BEGIN(nest3=Nest) @*/ { C[j * ldc + i] = beta * C[j * ldc + i]; for (l = 0; l <= -1 + K; l += 1) /*@ BEGIN(nest2=Nest) @*/ C[j * ldc + i] = C[j * ldc + i] + alpha * A[l * lda) + i] * B[(j * ldb) + l]; } } Figure 2: Matrix Multiplication code with POET annotations annotated input program to different machines, where an empirical transformation engine can dynamically interpret the POET scripts with different optimization configurations until satisfactory performance is achieved. The merit of our overall approach lies in its unique integration of programmable control by developers, automated optimization by compilers, and empirical tuning of the optimization space by search engines. It permits different levels of automation and programmer intervention, from fully-automated tuning to semi-automated optimization to fully programmable control. The auto-generated POET transformations are extensively parameterized so that each optimization can be turned on or off independently for each relevant array or code region, and arbitrary integers can be given as the blocking or unrolling factor for each loop being transformed. The granularity of external control is far beyond those supported by existing iterative compilation frameworks [1, 2, 3, 4, 5]. Independent search engines can be substituted with ease, and developers can easily interfere by modifying the auto-generated POET scripts. We have used our framework to apply 6 highly interactive optimizations, parallelization via OpenMP, cache blocking, array copying, unroll-and-jam, scalar replacement, and loop unrolling. The configuration parameters of these optimizations form a considerably large, complex, and randomly interacting multi-dimensional space. To study this optimization space, we have developed a transformation-aware (TA) search algorithm based on a number of commonly-adopted compiler heuristics. Our goal is to discover important patterns that compiler optimizations interact with each other and to gain insights in terms of how to significantly reduce empirical search time without sacrificing application performance. Our main contribution includes the following. • We propose an empirical tuning framework to support extensive parameterization and flexible tuning of architectur sensitive compiler optimizations. • We study the configuration space of 6 highly interactive optimizations and present results of using commonly adopted compiler heuristics to explore this space. Our study provides answers to questions such as whether a large and complex configuration space of interacting optimizations can be effectively explored by tuning one optimization at a time, how should various optimizations be ordered within such a search strategy, what heuristics can be used to efficiently explore the configuration space of each optimization. and how sensitive are the various heuristics when facing unpredictable interactions between different optimizations. The rest of this paper is organized as follows. Sections 2 and 3 present an overview of our tuning framework and our transformation-aware search algorithm. Section 4 presents experimental design and results of applying our framework to tune several linear algebra routines. Section 5 and 6 present related work and conclusions. 2. Tuning Infrastructure Our tuning framework is shown in Figure 1 and includes two main components: the ROSE analysis engine, which we built by extending the ROSE loop optimizer [6] to automatically produce POET scripts as output without affecting how it works otherwise [7], and the POET transformation engine, which includes the POET language interpreter and a configuration search engine. The ROSE analysis engine analyzes the source code of an input program, discovers optimization opportunities, and then produces two optimization output: a slightly modified source code where POET annotations are inserted to tag the code regions to be transformed, and a parameterized POET transformation script which can be invoked to apply a long sequence of program transformations with different optimization configurations. Figure 3 shows the skeleton of a POET script auto-generated by our ROSE optimizer after analyzing the code in Figure 2, which includes POET annotations automatically inserted by the optimizer to tag various code regions. 2158 Qing Yi et al. / Procedia Computer Science 4 (2011) 2156–2165 1: include opt.pi 2: <trace target/> 15:<eval par_nest1 = DELAY{......}/> <*OMP parallelization*> 3: <input to=target syntax="Cfront.code" from=("dgemm.C")/> 16:<eval block_nest1 = DELAY{......}/> <*loop blocking*> ...... 17:<eval copyC_nest1 = DELAY{......}/> <*copy array C*> 4:<parameter pthread_nest1 type=1.._ default=1 18:<eval copyA_nest1 = DELAY{......}/> <*copy array A*> message="number of threads to parallelize loop nest1"/> 19:<eval copyB_nest1 = DELAY{......}/> <*copy array B*> 5:<parameter psize_nest1 type=1.._ default=256 20:<eval unrolljam_nest1 = DELAY{......}/> <*unroll & jam*> message="block size to run by each thread for nest1"/> 21:<eval scalarC_nest1 = DELAY{......}/> <*scalarRepl C*> 6:<parameter bsize_nest1 type=(INT INT INT) default=(8 8 8) 22:<eval scalarA_nest1 = DELAY{......}/> <*scalarRepl A*> message="Blocking factor for loop nest nest1"/> 23:<eval scalarB_nest1 = DELAY{......}/> <*scalarRepl B*> 7:<parameter copy1_config_C type=0..2 default=1 24:<eval unroll_nest2=DELAY{UnrollLoops[factor=usize_nest2] message="configuration for copy array C at loop nest1; (nest2[Nest.body],nest2)}/> 0: no opt; 1: array copy; 2: strength reduction"/> 25:<eval cleanup_nest1=DELAY{CleanupBlockedNests 8:<parameter copy2_config_A type=0..2 default=1 [trace=top_nest1](clnup_nest1)}/> message="configuration for copy array A at loop nest1"/> 9:<parameter copy3_config_B type=0..2 default=1 26:<eval APPLY{par_nest1}; message="configuration for copy array B at loop nest1"/> 27: APPLY{block_nest1}; 10:<parameter ujsize_nest1 type=(INT INT) default=(2 2) 28: APPLY{copyC_nest1}; message="Unroll and Jam factor for loop nest nest1"/> 29: APPLY{copyA_nest1}; 11:<parameter scalar1_config_C type=0..2 default=1 30: APPLY{copyB_nest1}; message="configuration for scalarRepl array C; 31: APPLY{unrolljam_nest1};

Extensive Parameterization and Tuning of Architecture-Sensitive Optimizations 1

Loop Pipelining with Resource and Timing Constraints

Research of Register Pressure Aware Loop Unrolling Optimizations for Compiler

CS153: Compilers Lecture 19: Optimization

Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗

A Hierarchical Region-Based Static Single Assignment Form

The Effect of C Ode Expanding Optimizations on Instruction Cache Design

A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel

Graduate Computer Architecture Chapter

The Effect of Code Expanding Optimizations on Instruction Cache Design

LLVM Performance Optimization for Z Systems

That Increase Instruction-Level Parallelism 1 Introduction

Code Scheduling