Resource-Constrained Software Pipelining 1 Introduction

ResourceConstrained Software Pip elining Alexandru Nicolau Alexander Aiken Department of Information and Computer Science Computer Science Division University of California Irvine University of California Berkeley Irvine CA Berkeley CA email nicolauicsuciedu email aikencsberkeleyedu y Steven Novack Department of Information and Computer Science University of California Irvine Irvine CA email snovackicsuciedu Abstract This pap er presents a software pip elining algorithm for the automatic extraction of negrain parallelism in general lo ops The algorithm accounts for machine resource constraints in a way that smo othly integrates the management of resource constraints with software pip elining Furthermore generality in the software pip elining algorithm is not sacriced to handle resource constraints and scheduling choices are made with truly global information Pro ofs of correctness and the results of exp eriments with an implementation are also presented Introduction Recently there has b een considerable interest in a class of compiler parallelization techniques known collectively as software pipelining Software pip elinin g algorithms compute a static parallel schedule overlapping the op erations of a lo op b o dy in much the same way that a hardware pip eline overlaps op erations in a dynamic instruction stream The schedule computed by a software pip elini ng algorithm is suitable for execution on a synchronous tightlycoupled parallel machine such as a sup erscalar or VLIW Very Long Instruction Word machine Software pip elini ng algorithms are interesting for at least three reasons The rst reason is that sup erscalar and VLIW machines are b eing built IBMs System can execute four op erations in parallel Intels i and i chips can execute three op erations in parallel The largest tightlycoupled synchronous machine built to date is Multiows TRACE which has functional units Several computer manufacturerseg HP Phillips Siemensare also developing VLIW or sup erscalar archi tectures The second reason is that these tightlycoupled machines must b e programmed at a very low This work supp orted in part by ONR grant NK y This work supp orted in part by ONR grant NK level Someone writing a program for a tightlycoupled machine must develop a parallel schedule which means that p erson must know ab out and account for details of the hardware design such as instruc tion timings and resource conicts b etween functional units This task is extremely timeconsuming and errorprone compilation techniques are needed to translate programs written at a reasonably high level into go o d parallel schedules The nal reason is that software pip elini ng techniques hold the promise of pro ducing b etter co de with faster compilation time than other scheduling techniques This p otential is illustrated by the example in Figure Figure a shows a simple sequential lo op and Figures b and c show two dierent parallel schedules for the lo op For convenience we lab el the op erations in the original lo op ad and refer only to these lab els in the parallel lo ops In this example some parallelism is present within the lo op b o dy b ecause op erations b and c can b e executed simultaneously as well as across iterations b ecause d from one iteration can overlap with a from the next iteration The classical approach to scheduling the lo op in Figure a is to unroll the lo op b o dy some number of times and then apply scheduling heuristics within the unrolled lo op b o dy Fis as illustrated in Figure b While this approach allows parallelism to b e exploited b etween some iterations of the original lo op there is still sequentiality imp osed b etween iterations of the unrolled lo op b o dy In general if the lo op could b e fully unrolled all parallelism b oth inside and across iterations could b e exploited by this approach However full unrolling is usually imp ossible or impractical to obtain Software pip elini ng provides a direct way of exploiting parallelism inside and across all iterations of a lo op hence software pip elini ng achieves the eect of scheduling with full unrolling A softwarepipelined version of the original lo op is given in Figure c Previous Work One b o dy of work on software pip elini ng has fo cussed on establishing the formalism required to adequately address what software pip elini ng algorithms can and cannot achieve Results in this line of development include a software pip elini ng algorithm that generates optimal co de for lo ops without conditional tests ANa and a pro of that optimal software pip elini ng is imp ossible in general SGE However this work has largely ignored resource constraints Existing software pip elinin g algorithms handle resource constraints in a variety of ways Some algo rithms deal with only weak forms of resource constraintseg the number of op erations that can b e executed in parallel Others assume resource constraints are handled in a separate xup phase af ter software pip elinin g NPA Several software pip elini ng algorithms account for resource constraints directly as part of the software pip elinin g algorithm eg RG Lam However in most such algo rithms the treatment of resource constraints is intimately connected to software pip elini ngthat is the software pip elini ng is not separable from the handling of resource constraints One of our interests is to separate what is really intrinsic to software pip elin ing from other orthogonal concerns A more extensive discussion of previous and related work is included in Section a a i i bc b j i h da a c k i g bc bc d l j d da c Pip elined lo op a An example lo op b Unrolled twice and scheduled Figure Lo op Unrolling and Software Pip elinin g Our Approach In this pap er we present an algorithm that smo othly integrates software pip elini ng with the treatment of resource constraints while at the same time maintaining a structured design that separates orthogonal concerns Our algorithm serves two purp oses First we b elieve the algorithm represents a practical direction and can form the basis of implementations of software pip elini ng we discuss an implementation of our algorithm in Section Second the algorithm represents a summary of many of the most interesting asp ects of our investigation of software pip elini ng over the last several years Nic ANa ANb Aik Aik AN Our algorithm has several novel features The handling of resource constraints is orthogonal to the software pip elini ng At each step the algorithm has global information ab out the op erations that can b e scheduled In a technical sense dened precisely in Section given sucient resources our algorithm can pro duce co de arbitrarily close to the theoretical optimum The advantage of the rst p oint is that the treatment of resources could b e mo died say for a dierent machine and no changes would b e required in the overall algorithm The second and third p oints together imply that the quality of the nal pip elined lo op is limited only by the ability to make go o d resource allo cation decisions see Section and not by the design of the software pip elini ng algorithm Our software pip elini ng algorithm is built from two comp onents a scheduler and a dependence an alyzer The machinedependent scheduler is used to incrementally build a parallelized lo op from a se quential lo op For each parallel instruction the scheduler selects op erations to schedule based on the set of op erations available for scheduling in that instruction and available resources The set of available operations is maintained by a global dep endence analyzer as the scheduler makes decisions ab out where to place op erations the set of available op erations is up dated incrementally Together the scheduler and the dep endence analyzer encapsulate all machinedependent information As the parallelized lo op is constructed the software pip elini ng algorithm checks for rep eating states that can b e pip elined The software pip elini ng algorithm itself is very simple the diculty lies in establishing minimal restrictions on the scheduler and dep endence analyzer that guarantee the correctness and termination of the software pip elini ng algorithm The rest of this pap er is divided into nine sections Section denes the mo del of parallel computation used to develop the algorithm Section works through a small example to give an intuitive idea of how the software pip elini ng algorithm works Section describ es the algorithm and presents a pro of of correctness Section gives an algorithm for incrementally maintaining the set of available op erations Section describ es the integration of resource constraints into the algorithm Handling resources well is critical in realistic applications of software pip elini ng Section briey describ es an implementation of our algorithm some additional optimizations and some exp erimental results The exp erimental results b ear out the strengths of our approach and p oint out some weaknesses b oth are discussed at length Section presents a result that suggests our algorithm can achieve the b est schedules p ossible in the presence of resource constraints A discussion of related work is in Section The nal section summarizes and presents some conclusions Basic Terminology This section develops a simple mo del of a tightlycoupled synchronous parallel machine The formalism is used to explain our software pip elini ng algorithm and to provide a basis for a pro of of correctness A program is an automaton hX n N i X is a set of n op erations fx x g Op

Load more