OPTIMAL SOFTWARE PIPELINING: INTEGER LINEAR PROGRAMMING APPROACH

by Artour V. Stoutchinin

Schooi of Cornputer Science McGill University, Montréal

A THESIS SUBMITTED TO THE FACULTYOF GRADUATESTUDIES AND RESEARCH

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTEROF SCIENCE

Copyright @ by Artour V. Stoutchinin Acquisitions and Acquisitions et Bibliographie Services services bibliographiques 395 Wellington Street 395. nie Wellington Ottawa ON K1A ON4 Ottawa ON KI A ON4 Canada Canada

The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sell reproduire, prêter, distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la fome de microfiche/^, de reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantialextracts fiom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation- Acknowledgements

First, 1 thank my family - my Mom, Nina Stoutcliinina, my brother Mark Stoutchinine, and my sister-in-law, Nina Denisova, for their love, support, and infinite patience that comes with having to put up with someone like myself. I also thank rny father, Viatcheslav Stoutchinin, who is no longer with us. Without them this thesis could not have had happened.

My advisor, Professor Guang R. Gao, has directed me throughout this work. His great knowledge that he shared with the students, his high standards in class and in research, and his constant encouragement made this thesis possible. 1 am honored to have worked with him.

I also benefited a lot from discussions witli Dr. Erik Altman from IBM Wat- son Research Center and Dr. Govind Ramaswamy from Indian Institute of Technology, who provided me with useful directions in my research. I feel lucky to have been able to work witli them.

1 have worked closely with Dana Tarlescu since my first day in school on many topics that led to completion of this thesis. Dana's ingenuity and hard work inspired me and I am grateful for al1 of her help during my first difficult steps in computer science, and for the wonderful friendship that 1 am proud of.

Professor Laurie Hendren, although not directly involved with my thesis, often helped me during this work by asking challenging questions and providing useful advise.

Many other people have made my life in Montreal a happy one. My room- mates, Yuri Kroukov and ICatya Baranovskaya, endured my presence for a long time, while we were sharing an apartment. 1 will always treasure the time spent in their Company. 1 am grateful to Igor and Lena Fomenko, Christina Parent, Benoit and Anne de Dinechin for their help and support that they gave me in a variety of ways. I also thank my friends from our graduate office, Peter Alleyne, Fai Jacqueline Yeung , Xinan Tang and K halil El-Khatib, for making my stay in McGill more than enjoyable.

My fellow students from the ACAPS group have given me a great deal of help during al1 this time: Andres Marquez, Luis Lozano, V.C. Sreedhar, Christo- pher Lapkowski, Shamir Merali, Kevin Theobald, Zhu Yingchun.

1 also wish to thank Dr. John Ruttenberg, Woody Lichtenstein, David Lively, Verna Lee, Dr. Ross Towle, Violet Jen, Bettina Le Veille and others from the Developers Magic Group at Computer Systems in Mountain View, California, where the bulk of my research has been done in the summer of 1995. Their expertise guided me and helped me to complete this work.

Finally, 1 owe special thanks to Mike Sung frorn Massachusetts Institute of Technology for proofreading and correcting the final draft, and to Anne de Dinechin for translating the abstract of this thesis into French. Abstract

In optimizing the code for high-performance processors, software pipelining of innerrnost loops is of fundamental importance. In order to benefit from software pipelining, it is essential to: (i) find the rate-optimal legal schedule, and (ii) allocate registers to the found schedule (it must fit into the limited number of available machine registers). This thesis deals with the development of a software pipeliner that produces the best possible schedules in terms of required registers, thus, assisting .

Software pipelining and register allocation can be formulated as an integer linear programming (ILP) prablem, aiming to produce optimal schedules. In this thesis, we discuss the application of the integer linear programming to software pipelining and design a pipeliner for the MIPS R8000 superscalar . We extended the previously developed ILP framework to a full software pipelining implementation by : (1) establishing an ILP model for the R8000 processor, (2) implementing the model in Modulo Scheduling ToolSet (MOST), (3) integrating it into the MIPSpro , (4) successfully producing real code and gathering runtime statistics, and (5) developing and implementing a model for optimization of the memory systern behavior on the R8OOO processor.

The ILP-based software pipeliner was tested as a functional replacement for ABSTRACT

the original MIPSpro software pipeliner. Our results indicate a need of im- proving the ILP formulation and its solution: (1) the existing technique failed to produce results for loops with large instruction counts, (2) it was not able to guarantee register optimality for many interesting and important loops, for which optimal scheduling is necessary in order to avoid spilling, (3) the branching order, in which an ILP solver traverses the brandi-and-bound tree, was a single significant factor that affected the ILP solution time, leading to a conclusion that exploiting scheduling problem structure is essential for improving the efficiency of the ILP problem solving in the future. Résumé

Le pipeline logiciel de boucles internes est d'une importance fondamentale dans l'optimisation des codes pour processeurs hautes-performances. Pour bénéficier du pipeline logiciel, il est essentiel: (i) de trouver l'ordonnancement valide à débit optimal, et (ii) d'allouer des registres à l'ordonnancement obtenu (en utilisant le nombre limité de registres machine disponibles). Cette thèse a pour objet le développement d'un pipelineur logiciel qui produise les meilleurs ordonnancements possibles en termes de registres requis, tout en facilitant l'allocation des registres.

Dans le but de produire des ordonnancements optimaux, le pipeline logiciel et l'allocation des registres peuvent être formulés comme un problème de programmation linéaire en nombres entiers (PLNE). Dans cette thèse, nous discutons de l'application de la programmation linéaire en nombres entiers au pipeline logiciel, et nous dérivons un pipelineur pour le processeur MIPS R8000. Nous étendons le cadre initial de la PLNE à une implantation complète d'un pipelineur logiciel (1) en établissant un modèle de PLNE pour le pro- cesseur R8000, (2) en implantant le modèle au MOST (Modulo Scheduling ToolSet), (3) en l'intégrant au compilateur MIPSpro, (4) en produisant avec succès du code réel et en collectionnant des statistiques d'exécution, (5) en développant et implantant un modèle pour l'optimisation du comportement du système mémoire sur le processeur R8000. Le pipelineur logiciel PLNE a été testé en tant que remplaçant fonctionnel du pipelineur logiciel MIPS. Nos résultats montrent le besoin d'améliorer la for- mulation du PLNE et sa solution: 1) la technique existante ne peut produire de résultats pour des boucles avec un grand nombre d'instructions, (2) elle n'est pas capable de garantir l'optimisation des registres pour de nombreuses boucles intéressantes et importantes, pour lesquelles un ordonnancement op- timal est nécessaire afin d'éviter tout débordement, (3) l'ordre de séparation, selon lequel un solveur traverse l'arbre de séparation-évaluation, s'avère être le facteur principal qui règle le temps pour obtenir une solution, nous faisant conclure qu'une exploitation de la structure du problème est essentielle pour améliorer l'efficacité de la méthode de résolution par PLNE pour les problèmes futurs. Contents

Acknowledgement s

Abstract

Résumé

1 Introduction 1

1.1 McGill ILP Formulation ...... 3

1.2 Thesis Contributions ...... 4

1.3 Thesis Organizat ion ...... 7

2 Software Pipelining Basics 8

2.1 Simple Example...... S

2.2 Dependence ...... 9

2.3 Basic Definitions ...... 14

2.4 Modulo Scheduling ...... 15

vii CONTENTS vl11...

3 The R8000 Processor Design 20

3.1 Processor Overview ...... 22

3.2 Instruction Pipeline ...... 23

3.2.1 Instruction Fetch ...... 24

3.2.2 Integer and Address Generation Pipelines ...... 25

3.2.3 Floating Point Execution Pipeline ...... 25

3.3 CPU - FPU Interface ...... 26

3.3.1 Floating Point Queueing Mechanism ...... 27

3.3.2 TBus ...... 27

3.4 Memory System ...... 29

3.4.1 Streaming Cache ...... 29

3.5 Instruction Set Architecture ...... 31

4 ILP Mode1 for a 33

4.1 ILP Formulation ...... 33

4.2 ILP Formulation for Superscalars ...... 39

4.2.1 Modified Resource Constraints ...... 39

4.2.2 ObjectiveFunction ...... 42

4.2.3 Upper Bound on the Number of Registers ...... 44

4.2.4 Lower Bound on the Number of Registers ...... 45 CONTENTS

4.2.5 Loop Overhead Optimization ...... 46

4.3 R8000 Memory System Optimization ...... 48

4.3.1 Memory Reference Analysis ...... 49

4.3.2 Memory Constraints for the ILP formulation ...... 52

5 ILP Mode1 for the MIPS R8000 56

5.1 Resource Scheduling on MIPS RSOOO ...... 56

5.2 R8000 Machine Description ...... 59

5.3 Software Pipelining Algorithm ...... 60

6 Experimental Results 64

6.1 Experimental Framework ...... 64

6.2 Highlights of Experimental Results ...... 67

6.3 Results and Analysis ...... 68

6.3.1 Memory Stalls ...... 70

6.3.2 Performance Comparison of ILP vs SGI Pipeliner ... 72

6.3.3 Minimizing Register Requirements ...... 75

6.4 BranchingOrder ...... 77

6.5 Short Trip Count Performance ...... 80

7 Conclusions and Future Work 85

7.1 Summary ...... 55

7.2 Future Work ...... 87 CONTENTS

Appendix A: Reservation Tables

Appendix B List of Figures

2.1 Types of Dependence Arcs ......

2.2 The DDG for the loop in Section 2.1 ......

3.1 R8000 Microprocessor ......

3.2 Streaming Cache Access ......

4.1 (a) Reservation table for a single-precision divide; (b) its CRT for II=ll ......

4.2 (a) CRT of a two stage function unit; (b) A matrix of a schedule ......

4.3 CRT for a single-precision divide for II=6 ......

5.1 The Flow-Chart of the Software Pipeliner ......

6.1 Relative performance of ILP over SGI ......

6.2 Improvement in the ILP performance due to memory system optirnization ......

6.3 Relative performance of the ILP over SGI on Liverrnore Loops ...... Chapter 1

Introduction

This introduction assumes a basic familiarity with the Software Pipelining and the Modulo Scheduling techniques. If this is not the case, Chapter 2 reviews the fundamentals of the Software Pipelining and the Modulo Scheduling.

In the recent years, the concept of instruction-level parallelism played a central role in the microprocessor design of al1 the major CPU manufacturers. Several processors, such as the DEC Alpha 21064, the IBM RS/6000, the MIPS RS000, the Intel i860 and i960, the Sun Microsystems SPARC, etc., derive their benefit from instruction-level parallelism.

Instruction-level parallel processors take advantage of the parallelism in pro- grams by performing multiple machine level operations simultaneously. A typical sucli processor (a superscalar or VLIW processor), provides multiple pipelined function units in parallel, thus allowing the simultaneous issue of multiple operations per clock cycle [39, 47, 251. In order to take advantage of instruction-level parallelism, these machines have to be programmed at a very low-level, and those who write programs for them must be familiar with the details of the hardware design, such as instructions timings and resource CHAPTER 1. INTRODUCTION 2

usage patterns. This is a very tedious, time-consuming and error-prone task. Compilation techniques are needed to expose parallelism in programs written in a high-level language.

Instruction scheduling [46, 24, 211 is a compiler parallelization technique used to facilitate hardware's task of exploiting instruction-level parallelism, of which loops are the largest source. Because scheduling loops is of primary impor- tance, much work is being done on this, and different loop scheduling algo- rithms, such as trace scheduling after a loop has been unrolled [18, 171, soft- ware pipelining [9, 401, percolation scheduling [a], and hierarchical reduction [29], have been developed.

This thesis studies software pipelining, which has been given a lot of attention recently and is successfully implemented in production [44]. There are two reasons for this. First, software pipelining offers better code quality, i.e. achieves higher speedups, compared to the other scheduling techniques. And second, although the problem of finding the throughput-optimal soft- ware pipelined schedule in general is NP-hard [Il], scheduling heuristics have been shown to effectively turn software pipelining into a practical and efficient compilation met hod.

The throughput is of primary concern in software pipelining and, altliough heuristics produce good quality parallel schedules, none of them can guar- antee optimality of the throughput. Several distinct approaches to software pipelining exist , al1 of them trying to solve this problem. One approacli is to move different instructions across the loop branch [15, 14, 28, 20, 301. The decision to move a particular operation is arbitrary and it is not clear how to achieve the optimal results. Another possibility is to unroll the loop and search for a repeating pattern [l, 21. This approach sometirnes requires too large an unrolling before such a pattern is found. One more approach, and CHAPTER 1. INTRODUCTION

the one that has been successfully implemented in production compilers, is Modulo Scheduling [40].Modulo scheduling encompasses a broad class of al- gorithms and implementations, with some of them assuming special hardware support [4S, 6, 411, or developed for the conventional architectures 1291 (see Cliapter 2).

One question that such heuristic approaches leave unanswered is "How well do these methods do their job?". Indeed, how close do we get to the optimal scheduling, is there any room for improvement, are we getting al1 that we want from existing technology? These types of questions ied to the development of exact scheduling methods that guarantee the optimality of their results. The main idea behind the exact methods is to represent the scheduling problem as an optimization problem with a set of linear scheduling constraints, and an objective function minimizing some cost criterion (linear programming and integer linear programming) [32]. A nurnber of interesting results of using the linear and integer linear programming approach for software pipelining have been published recently [35, 23, 5, 4, 161.

1.1 McGill ILP Formulation

The interest in software pipelining at McGill stemmed from work on register allocation for loops on data-flow machines. This work culminated in a math- ematical formulation of this problem in a linear periodic form [19]. It was soon discovered that this formulation can also be applied to software pipelin- ing for conventional architectures. This formulation was then used to prove an interesting theoret ical result : the minimum storage assignment problem for rate-optimal software pipelined schedules can be solved using an efficient polynomial time method provided the target machine has enough function CHAPTER 1. INTRODUCTION

units so resource constraints can be ignored [35, 341. This method used a linear programming formulation for finding from the set of al1 schedules for a loop the fastest schedule using minimum bufers. Buflers are an approxi- mation to registers and buffer minimization serves as a good approximation to register minimization. The drawback of this method was that it ignored resource constraints of the t arget architecture.

This limitation was overcome in the following work [5, 231, where an integer linear programming formulation was proposed that guarantees finding of the fastest software pipelined schedules that satisfy al1 the resource constraints of the target machine. Finally, register considerations were taken into account and an integer linear programrning formulation was developed that in addition to the fastest schedules guaranteed that they use the minimum number of registers [4].

This work was never intendecl to become a full software pipelining implemen- tation: its output was a set of static quality measures, not runnable code, and its only targets were models tliat exhibited certain interesting properties, never a real commercial high-performance processor. Thus, it was not clear if this rnethod could be used for practical compiling. How would it compare with heuristic implementation? How much better would its results be?

1.2 Thesis Contributions

This t hesis concentrates on the development of the integer linear programming software pipeliner for the MIPS RSOOO microprocessor based on the previous work by E. Altman [4] and on implementing it in Modulo Scheduling ToolSet (MOST). In developing such software pipeliner Our main interests were to study how well the ILP approach would work when targeted to a real processor CHAPTER 1. INTRODUCTION 5

in a setting of a production compiler, and to study the feasibility of a full implementation that would generate runnable code.

The first important contribution of this thesis is the development of a complete ILP model for the MIPS R8000 processor, taking full advantage of its super- scalar capabilities, and the design and implementation of a software pipeliner based on this model. To our knowledge, most existing implementations of soft- ware pipelining that use the ILP approach represent research tools rather than real code generators, and have been too preliminary for detailed measurement and evaluationl. We are not aware of any other ILP implernentation targeting a real processor. In particular, we believe this thesis to be the first measure- ment of runtime performance for ILP-based generation for software pipelines. In order to achieve this, the ILP software pipeliner was embedded in the MIP- Spro compiler over the summer of 1995. In that context, the ILP pipeliner served as a functional replacement to the original MIPSpro software pipeliner, and enjoyed a proven pipelining framework - a full set of optimizations and analysis before pipelining and a robust post-processing implementation to in- tegrate the pipelined code correctly back into the program.

The second important contribution of this thesis is the assessment study of the quality of exact software pipelining methods, and the study of the impact of various aspects of software pipelined schedules on resulting performance. The main drawback of existing heuristic methods is their inability to guarantee op- timality, specifically register optimality of the schedules they generate. Exact methods that guarantee register optimality of the resulting code on the other hand, are computationally complex, and optimal scheduling of many impor- tant loops is very expensive and often out of the reach [4, 161. However, we

'Also, ILP software pipelining models mostly targeted VLIW-like arcliitectures and did not deal with the issues of scheduling for superscalar architectures. CHAPTER 1. INTRODUCTION

showed that the register optimality only matters if it leads to register alloca- tion failure. This allowed us to trade the strict register optimality in exchange for broader application of our method. We also show that addressing RSOOO memory organization issues has greater impact on resulting performance than minimizing the number of registers required for a schedule. We also studied the overhead reduction issues when optimal code generation for short trip count loops is of interest.

We compared performance of the code scheduled by the ILP software pipeliner against that of the MIPSpro heuristic software pipeliner. It took designers nearly four years to develop and implement this heuristic that is in the core of the MIPSpro compiler's code generation, whose basic goal is generating very high quality software pipelined inner loop code. At the time the R8000 based systems with the MIPSpro were shipped, they had the best reported performance on number of important benchrnarks, in particular for the floating point SPEC92. Compiler's software pipelining capabilities played the central role in delivering this performance on the R8000. Our ILP software pipeliner was able to achieve performance comparable to that of the MIPSpro heuristics.

This work shows the possibility of using the exact software pipelining methods in a setting of a production compiler. In Our experiments 743 loops from the SPEC92 floating point benchmark were scheduled and the code was generated for these loops. We discovered that the order in which the instructions of the loop are being scheduled has significant impact on the efficiency of the integer linear program solving, thus showing that exploiting the problem structure is essential for the improvement of our pipeliner's efficiency.

In summary the main contributions of this thesis are:

O we developed a complete ILP mode1 for the MIPS R8000 processor in- CHAPTER 1. INTRODUCTION 7

cluding optimization of the R8000 processor's memory system behavior;

a we designed and implemented an ILP software pipeliner based on these models;

a we integrated the developed ILP software pipeliner with MIPSpro pro- duction compiler;

a we evaluated the quality of the ILP software pipelining approach corn- pared to the leading heuristic technology. Our results indicate the impor- tance of the optimization of the MIPS RSOOO memory system behavior for a good performance; we also show that the optimal solution meth- ods have potential for improving upon existing heuristic methods in the performance of loops with large instruction count and severe register pressure, and in loops with short trip counts.

1.3 Thesis Organizat ion

The rest of this thesis is organized as follows. In Chapter 2 we describe the software pipelining and rnodulo scheduling basics. The MIPS RSOOO micro- processor architecture is described in Chapter 3. In Chapter 4 we present the integer linear programming formulation for modulo scheduling and our modification to it with the integer linear programming formulation for opti- mizing the R8000 memory system behavior. The integer linear programming based software pipeliner is described in Chapter 5, and results of its perfor- mance evaluation using SPEC92 floating point benchmark suite are presented in Chapter 6. Finally, we put forward some conclusions and give suggestions for future work after discussing related research in this area. Chapter 2

Software Pipelining Basics

Software Pipelining was proposed originally as a hand coding technique that improved quality of the innermost loop code on pipelined machines [9]. The general formulation of the software pipelining problem for a single basic block appeared in [40]. Since then software pipelining has become one of the most important optimizations performed by today's compilers for high-performance architectures. In section 2.1 we illustrate the basic concept of software pipelin- ing; dependencies are overviewed in section 2.2; basic introduction to software pipelining and modulo scheduling is given in sections 2.3 and 2.4.

2.1 Simple Example

The principle of Software Pipelining can be illustrated by the following exam- ple. Suppose we have a loop:

x = load A(i) z =z+x A(i) = store z CHAPTER 2. SOFTWARE PIPELINING BASICS

Assume that value of x is ready 2 cycles after the load is issued, and result of the addition is ready on the next cycle after the add is issued.

The shortest sequence for the loop body is:

1. x = load ~(i)

3. z =z+x 4. A(i) = store z

Executed sequentially this loop needs 4 clock cycles per iteration to complete.

Notice that a new iteration can be initiated every 2 clock cycles (overlapped execution) :

1. load 2. 3. add load 4. store 5. add 6. store

The software pipelined loop body appears at clock cycles 3 and 4, and is called a ICernel or a Steady State. Each loop iteration needs only 2 clock cycles to complete, instructions in clock cycles 1, 2 (Prologue) and 5, 6 (Epilogue) being the overhead needed to fil1 up and drain the pipeline. The frequency with which iterations of the loop are initiated is limited by dependeilcies between the loop statements and the resource availability.

2.2 Dependence

Dependence between the two statements in the program can be viewed as a precedence relation between these two statements, i.e. if a statement T is CHAPTER 2. SOFTWARE PIPELINING BASES

dependent on a statement S, then T may only be executed if S hm finished its execution (more precisely, S's results are computed and available). Two types of dependencies occur in programs, control and data dependencies. In this section we will treat the subject of dependence briefly for the purpose of the further discussion (more comprehensive overview cm be found in [49]).

A control dependence exists, for example, between each of the statements on both alternative paths of an if statement and the if statement itself. Thus, the staternents on the then path are control dependent on the if statement. In software pipelining a loop is assumed to be executed a large number of times. Therefore, flow of control is assumed to move sequentially from one loop iteration to another. For the rest of discussion we will focus on loops consisting of a single basic block, i.e. not containing conditional st atements that cause control dependencies to occur. Many loops can be transformecl into such form using if-conversion [36, 131.

There are three kinds of data dependencies that prevent reordering of the statements in programs, thereby lirniting scheduling possibilities. These are fEow, anti, and output dependencies (the fourth kind of dependence, the input dependence does not prevent statements from being reordered and we will not consider such dependencies here) .

A flow dependence occurs when a value computed (defined) in statement S is read (used) in some statement T; we Say that T is data fiow-dependent on S. This type of data dependence shows how the data flows between the statements of the program. For example, consider the following situation: CHAPTER 2. SOFTWARE PIPELINING BASICS

Statement S computes the value of x that statement T uses in its computation of y, therefore T is data JEow-dependent on S. This kind of dependence is also called a read after write (RAW) dependence, and a scheduling algorithm can not reorder these two staternents, because, otherwise, statement T may be accessing the wrong value of x.

An anti-dependence occurs when a variable is used in statement S before that variable is reassigned (redefined) in statement T; we Say that T is data anti- dependent on S. For example, consider this:

Statement T is data anti-dependent on S since it redefines the value of x after it has been used by the statement S. A scheduling algorithm can not reorder these two statements, because, otherwise, S and any statement between S and T that uses (reads) x will be accessing the wrong value of x. This kind of dependence is also called a write after read (WAR) dependence.

An output dependence occurs when a variable is assigned (defined) in state- ment S before that variable is reassigned (redefined) in some statement T; we say that T iç data oatput-dependent on S. For example:

Statement T is data output-dependent on statement S because it writes into the variable y after statement S does. Again, a scheduler can not reorder CHAPTER 2. SOFTWARE PIPELINING BASICS

Figure 2.1: Types of Dependence Arcs them, because, in that case, al1 the statements that use variable y after T may be using the unupdated values. This dependence is also called a write after write (WAW) dependence.

Data dependencies in the program must not be violated in order for a sched- ule to be legal (i-e. to produce correct results). When performing software pipelining, the loop body is modeled as a directed graph G = {V, E), where V is the set of nodes that correspond to the statements of the loop body, and E is the set of that correspond to data dependencies between these nodes'. The three types of dependence arcs corresponding to the tliree types of data dependencies are shown in the figure 2.1.

Such a graph is called the Data Dependence Graph (DDG). For each directed arc (i,j) we cal1 xi its tail and xj its head. Dependence relationsliip is specified by assigning to each arc (i,j) in the DDG a pair of attributes

(nij,dij):

O d,, or delay, is the time in clock cycles required for the tail operation xi

to produce its result that the head operation xj will use (in the case of a flow- or output- dependence); or the time in clock cycles required for the tail operation xi to read its operand defined by the head operation

'a statement, an instruction, an operation, and a DDG node are used interchangeably througliout the text. CHAPTER 2. SOFTWARE PIPELINING BASIC'S

xj (in the case of an anti-dependence).

O,, or Iteration Distance, is the number of iterations that separates an instance of the tail operation xi from an instance of the head operation

Xj. For example, if fiipj= 0, the dependence between xi and xj exists within the same iteration, and is called hop-independent dqendence.

If, on the other hand, flw = 2, the xj is dependent on xi from two iterations ahead, and such dependence is called loop-carried dependence with iteration distance 2.

The DDG for the loop in Section 2.1 is shown in the Figure 2.2.

Figure 2.2: The DDG for the loop in Section 2.1 CHAPTER 2. SOFTWARE PXPELINING BASES

2.3 Basic Definitions

Tlie objective of software pipelining is to achieve the highest througliput pos- sible for a given loop by overlapping the execution of consecutive iterations.

Assume a set V of N statements of a loop body 1x1,xp, ..., xN). Let X denote the number of loop iterations. A set of times at which each iteration of each loop statement is initiated is called a schedule, i.e.

Definition 2.1 A integer schedule is a set 0 = {Q, 02, ..., cN), wltere each ci is an integer function: {1,2 ,...,X) + N, i = 1,2, ---,N, such that for 1 < kl < 5 X, oi(kl)< oi(k2);N is the set of integer nonnegative numbers.

If the loop iterations are initiated at a constant rate (computation rate), the schedule is said to be periodic.

Definition 2.2 A schedule o is periodic with the penod II > O if there erist integer numbers t; such that:

In equation (2. l),ti corresponds to the time of the initiation of the operation ri in the first iteration of the loop. The length of the interval at wliich iterations are initiated is called the Initiation Interval (II) [40]. Shorter values of the II correspond to higher computation rates and higher throughputs.

Definition 2.3 A schedule o is legal if it obeys the data dependence con- straints and the resource constraints of the target architecture. CHAPTER 2. SOFTWARE PIPELINING BASICS

Definition 2.4 A schedule o is rate-optimal if it is legal and achieves the maximal computation rate.

For arbitrary loops the rate-optimal software pipelining is impossible [45]. However, for loops consisting of a single basic block, schedules in periodic form (2.1) are asymptotically rate-optimal. Thus, in our discussion we will focus on finding the rate-optimal schedules of the form (2.1).

2.4 Modulo Scheduling

Tliere are two basic approaches to finding periodic schedules, described in the . previous section: Operatéonal approach and Modulo Scheduling. The Oper- ational approach consists in the simultaneous unrolling and scheduling of a loop body until a repeating pattern is found. Although this approach have the potential to produce very good schedules it is not guaranteed that the pattern will appear fast, and the amount of the information that need to be maintained during the scheduling is fairly large [l, 2, 31.

Unlike the above approach, Modulo Scheduling encompasses a large number of algorithms and implementations that attempt to find a periodic schedules under the Modulo Constricints. There are two types of constraints that a schedule must obey in order to be legal:

1. Data Precedence constraints.

Let a be a periodic schedule. oi(k)= ti+ (k - 1) II means that operation xi begins its kth iteration at time ti + (k - 1) II under schedule o. CHAPTER 2. SOFTWARE PIPELINING BASICS 16

0 is a legal schedule if for every pair of nodes xi and zj in the DDG, that are connected by a dependence arc i + j : (d,, fi,), following is true: tjdjIIV(i,j)EE (2.3)

This means that operation xj must be issued dij cycles after operation

xj from S2,-th previous iteration. If = 0, the operation xj must follow operation xi from the same iteration by at least dij cycles.

2. Resource constraints.

Operation xi using certain resource at time t in its k-th iteration, uses the sameresourceat times -.-,t-2II,t-II,t,t+II,t$211,--.. Thus, a placement of an operation in the schedule must avoid the resource conflicts with the future iterations of previously scheduled operations.

Because the form of the schedule and scheduling constraints are defined in terms of the II, scheduling becomes too hard if the II isnot known at schedul- ing time. To avoid such complications, Modulo Scheduling algorithms use the iterative approacli, i.e. they begin by establishing the lower and upper bounds on the II, and then search, among the possible values, for the smallest Initi- ation Interval for which a legal schedule can be found. The lower bound on the II, MinII, for a given loop is determined by two factors: the existence of recurrence cycles in the loop and the nurnber of resources available for that loop execution.

1. Recurrence cycles.

The recurrence cycles are possible because of the loop-carried dependen- cies between the loop statements. Consider a cycle C in the DDG: CHAPTER 2. SOFTWARE PIPELINING BASICS

The corresponding dependence edges are:

The corresponding data precedence constraints for each edge are:

Add equations (2.3) to obtain:

The ratio of the sum of delays along the cycle C to the sum of iteration distances along the cycle C limits the rate at which the iterations of a loop containing C can be initiated. The cycle of the loop witli the maximal such ratio is called a critical cycle and in fact is the most

restrictive cycle in that loop 2. Thus, the II of a loop can iiot be smaller than the Min11 determined by the critical cycles:

where Cc denotes one of the critical cycles of the loop. 'there may be more than one such cycle. CHAPTER 2. SOFTWARE PlPELINING BASES

2. Available Resources

Modulo Scheduling requires each operation of the loop to be executed exactly once during each II. Consider some resource ssuch as an ALU, a write port of one of the register files, or an instruction issue dot. Let

~(s)be a set of instructions that use this resource S. Let si be the total number of clock cycles for which the instruction xi E ~(s)uses the

resource S. The total number of cycles during which the resource s is used by al1 instructions xi E ~(s)in the Ioop during each loop iteration

is CziEX(s)Si. If the nurnber of resource s in the target architecture is F' 2 1, such that they inay be used by different instructions in parallel, C'''xl'' then at least lF, cycles are needed for al1 instructions in one loop

iteration to complete their usage of S. Because each instruction rnust execute exactly once during II cycles, the II must be at least:

Taking the maximum over al1 the resources of the target architecture is equivalent to saying that the length of the II is bounded by a resource for which there is greater contention among the operations of the loop.

Definition 2.5 Initiation Rate of a schedule equals 1/II.

Definition 2.6 (Modulo Scheduling Problem) : Given a loop (its Data Dependence Graph), the number of available resources in the architecture, and Initiation Rate, End a legal schedule j'or this loop.

Summary

In this Chapter the fundamentals of software pipelining and modulo schedul- ing were reviewed. Software pipelining is a code generation technique that CHAPTER 2. SOFTWARE PIPELINING BASES

overlaps execution of different loop iterations in order to take advantage of the instruction-level parallelism in loops and of the target machine's comput- ing capabilities. Modulo scheduling is one of the most successful approaches for performing software pipelining and has been irnplemented in production compilers. Under modulo scheduling, operations of the loop body are packed inside the time window the size of the interval between initiations of consecu- tive loop iterations. Dependencies between loop operations must be preserved and conflicts between operations due to hardware usage must be avoided. Modulo scheduling problem is, in general, NP-hard. Chapter 3

The R8000 Processor Design

The R8000 microprocessor is a superscalar implementation of the MIPS 64-bit architecture with a target frequency of 75 MHz. The MIPS R8000 Micropro- cessor Chip Set, implemented using separate integer and floating point de- vices, delivers a peak performance of 300 MFLOPS. The RB000 CPU contains 2.6 million transistors. The RB010 Floating Point Unit contains 830 thousand transistors. The R8000 processor uses a superscalar machine organization that dispatches up to four instructions each clock cycle to two floating-point exe- cution units, two rnemory load/store uni ts, and two integer execution uni ts. A split level cache structure reduces cache misses by directing integer data references to a 16 KByte on-chip cache while floating point data references are directed to a 4 MByte off chip cache. Limited out-of-order execution for floating-point operations allows the R8000 processor to achieve performance comparable to that of a low-end vector supercornputer on the floating-point intensive computation. In this Chapter features of the MIPS RSOOO Micro- processor that are relevant from the scheduling point of view are presented. For a more detailed architecture description, see [43, 261. CHAPTER 3. THE R8000 PROCESSOR DESIGN

Streaming *the 1 S treaming Cache

Figure 3.1: R8000 Microprocessor CHAPTER 3. THE R8000 PROCESSOR DESIGN

3.1 Processor Overview

The diagram of the R8000 Microprocessor is shown in Figure 3.1.

The shaded area identifies which function units are the integer unit (RS000) and floating-point unit (R8010) chips. Additionally there are two tag RAM chips and a set of static RAMs making up the external cache.

Instructions are fetched from an on-chip 16 KByte instruction cache (Instruc- tion cache). This cache is direct mapped with a line size of 32 bytes. Four instructions are fetched per cycle. Shere is a branch prediction cache (Branch cache) associated with the instruction cache. The branch cache is also direct mapped and contains 1K entries.

Instructions from the cache are realigned before going to the dispatch logic. Up to four instructions chosen from two integer, two memory, and four floating- point instruction types rnay be dispatched per cycle. Floating-point instruc- tions are dispatched into a queue (FPQ) where they can wait for resource contention and data dependencies to clear without holding up the integer dis- patching. In particular, the FP Queue decouples the floating-point unit to hide the latency of the external cache.

Integer and mernory instructions get their operands from a 13 port (Integer Register File). Integer function units consist of two integer ALUs, one shifter, and one multiply-divide unit. Up to two integer operations may be initiated every cycle. Memory instructions go through the address gener- ation units (Address Generator) and to the TLB. The TLB is a three way set-associative cache containing 354 entries. It is dual ported so that two independent memory instructions can be supported per cycle. Integer loads and stores go to the on-chip data cache (Data cache). It, too, is clual ported CHAPTER 3. THE R8000 PROCESSOR DESIGN 23

to support two loads or one load and one store per cycle. The Data cache is 16 I

Both the Instruction cache and the Data cache refill from the external cache (Streaming Cache). The Data cache refill time is 7 cycles: five to go through the external cache RAMs (described later) and two cycles to write a 32 byte line (external cache delivers 16 byte at a tirne). The Instruction cache refill tirne js 11 cycles (because of the branch prediction and the TLB).

Floating-point loads and stores go directly to off chip external cache described later. Its refill time depends on the system implement ation. The floating-point unit contains two execution datapaths (Floating Point Units) each capable of fused multiply-adds, simple multiplies, adds, divides, square roots and con- versions. A twelve port register file (Floating Point Register File) feeds the execution datapaths. The two datapaths are completely symmetric and inùis- tinguishable for the software - the compiler simply knows that it can schedule two floating-point operations per cycle.

3.2 Instruction Pipeline

Execution of each instruction is broken up into five stages:

(F) - Fetch and partial decode. Branch prediction. (D) - Decode, read register file, scoreboard and dependency checks. (A) - Generate address. (E) - Execution. (W) - Write the result into the register file.

Each stage takes one clock cycle except for the execution stage, which may may take several clock cycles. CHAPTER 3. THE R8000 PROCESSOR DESIGN

Such a pipeline stage sequence eliminates the load delay slot, i.e. an instruc- tion that imrnediately follows a load and depends on the load's result does not incur a delay of one cycle. On the other hand, a one cycle delay happens whenever a Ioad or a store follow immediately an instruction they depend on. Also, branches are resolved one stage later and, therefore, branch miss penalty is increased. To overcome this problem the branch prediction is done very early in the pipeline, during the Fetch stage. Thus, an instruction at a branch target is fetched in the cycle following the branch, i.e. there is no branch delay slot . For backwards compatibility, however, one instruction im- mediately following a branch is executed and may be considered being in the "branch delay dot" .

3.2.1 Instruction Fetch

The R8000 haa superscalar dispatch which allows issue of up to four instruc- tions per clock to multiple execution units of the R8000 CPU and R8Ol O FPU. There are no boundary alignment restrictions. Instructions are fetched from the Instruction cache, predecoded, and placed into the instruction queue that acts as a temporary storage for instructions waiting to be executed. There are four instructions in the dispatch unit at any given clock chosen arnong al1 the instructions in the instruction queue. Should a situation arise where the queue is full, a stall is issued and the instruction cache will cease to fetch instructions until there is a room in the queue. Instructions are sent out from the dispat ch unit to execution depending on the resources available from cycle to cycle and the interdependencies between any of the four instructions in the dispat ch unit . CHAPTER 3. THE R8000 PROCESSOR DESIGN

3.2.2 Integer and Address Generation Pipelines

The R8000 handles al1 the integer operations. The R8000 contains two arith- metic logic units ( ALU) and two address generation units, capable of executjng a maximum of four instructions per cycle. The ALU and the address gener- ation units have 1 cycle latencies. Dedicated busses allow each unit to run independently of each other. The R8000 has an on-chip dual ported 16KByte Data Cache, with separate address and data busses for each port. This al- lows multiple simultaneous accesses to the cache. The R8000 has a on-chip single ported 16KByte Instruction Cache. In addition to the Data and In- struction caches, the Ra000 also contains Branch and Translation Lookaside Buffer (TLB) caches, giving a total of four caches on-chip. The Branch Cache implements a single prediction bit scheme to predict the branch outcorne. A single TLB services both the Data and Instruction caches.

3.2.3 Float ing Point Execution Pipeline

The R8010 Floating Point Unit (FPU) performs al1 the floating point func- tions. The R8OlO has two execution units, allowing two arithmetic and two Floating Point memory operations per clock. The Floating Point Register File contains 8 read ports and 4 write ports. The R8010 has no on-chip cache and uses the Streaming Cache, which is a second level cache for the R8000, as its rnemory. Normally the RSOlO FPU is controlled by the R8000 CPU. Dispatchiog of instructions, floating point loads and stores to the Streaming Cache, integer stores to the Streaming Cache, etc. are a11 under the control of the R8000 CPU. CHAPTER 3. THE R8000 PROCESSOR DESIGN

3.3 CPU - FPU Interface

The CPU - FPU Interface consists of floating-point instruction and data queues and the TBus that connects the CPU to the FPU. High throughput on Ra000 microprocessor is achieved through decoupling of the integer and floating point units. The streaming cache, which is accessed by floating point memory references, has latency of 5 cycles on a hit. This presents a prob- lem for a floating-point code: a straightforward implementation would have floating-point loads casting a shadow at least five cycles or twenty instructions long (if a floating point load instruction is immediately followed by a compute instruction that uses the result of the load, the compute instruction woulcl have to wait for five cycles till the data are fetched through the external cache pipeline). Streaming cache latency is hidden by decoupling the RSOlO from the R8000 pipeline. The decoupling of the integer and floating-point functions allows the execution in the R8010 FPU to "slide" behind the execution in the R8000 CPU, t hus hiding the streaming cache latency. Because floating-point operations are decoupled from the R8000, long floating-point operations or accesses to the main memory can be executed in parallel with other integer operations. Floating-point instructions are dispatched into a queue before go- ing to the R8010. If a floating-point load instruction is immediately followed by a compute instruction that uses the result of the load, the queue allows both instructions to be dispatched together as if the load hacl no iatency at all. The integer pipeline is immediately free to continue on to other instruc- tions. The load instruction proceeds down the streaming cache pipeline and deposits the load data in the load data queue after 5 cycles. In the meantirne the compute instruction waits in the floating point instruction queue until the load data is available.

By decoupling floating-point operations, a limited form of out-of-order exe- CHAPTER 3. TWE R8000 PROCESSOR DESIGN

cution of floating-point instructions is achieved. The R8000 is not held up, allowing vector start-up time to be reduced. For example, in transitioning from one loop to another, while the R8OlO FPU is completing the first loop, the Ra000 can begin processing the overhead code and get started on the second loop, even though the RS010 is not yet finished with the first loop.

3.3.1 Floating Point Queueing Mechanism

The Floating Point Queue Mechanism consists of an instruction queue and load data queue, which together allow the RSOOO CPU run ahead of the RSOlO FPU coprocessor.

Floating-point instructions are dispatched into a queue before going to the R8010. If a floating-point load instruction is immediately followed by a corn- pute instruction that uses the result of the load, the queue allows both instruc- tions to be dispatched together as if the load had no latency at all. The load instructions proceeds down the external cache pipeline and deposits the load data in the load data queue after five cycles. In the mean time the cornpute instruction waits in the floating-point instruction queue until the load data is available.

The TBus connects between the R8000 Microprocessor, the R8010 Floating Point Unit, and the Cache Controller. In this subsection we only discuss the communication between the CPU and the FPU.

There are four types of instruction transfers dispatched by the RSOOO CPU to the R8010 FPU through the TBus. Table 3.1 shows the content of the first CHAPTER 3. THE R8000 PROCESSOR DESIGN 28

74 bits of the TBus bandwidth during each of the four types of transmissions (table notation is explained below):

TBus pins Transfer Mode 73 - 65 64 - 56 55 - 28 27 - O Normal MemSpecA MemSpecB FpOP-A FpOP-B MoveFrom MfSpec MemSpecB FpOP-A FpOP-B IntStore 1stSpec Data MoveTo MtSpec Data

.. - - -- Table 3.1: Usage of the TBus Bandwidth

Normal transfer

A normal dispatch contains two floating point arithmetic operations (FpOP- A and FpOP-B) and two floating point memory reference operations (Mem- SpecA and MemSpecB).

MoveFrom transfer

This is similar to Normal mode except that one of the memory reference operations is replaced by a move frorn operation (MfSpec). This operation moves data from a floating point register to a general purpose register in the R8000.

IntStore transfer

The IntStore operation (IStSpec) supports integer stores in the streaming cache (see section 3.4). In this mode the TBus contains the 64bit integer data CHAPTER 3. THE R8000 PROCESSOR DESIGN

(Data) along with some store alignment information. No other operation can be placed on TBus pins in the same time.

MoveTo transfer

The MoveTo operation (MtSpec) moves data from a general purpose register in the R8000 CPU to a floating point register. This format is similar to the IntStore format except that instead of store alignment information, the floating point register destination is transmitted along with the 64-bit data.

3.4 Memory System

In this section the cache organization of the R8000 is described. The cache system consists of a 16KByte integer only first level cache on chip and a 4MByte second level Streaming Cache. The later acts as a second level cache for the R8000 CPU and as a first level cache for the R8010 FPU coprocessor. Separation of the Integer and the Floating Point data helps to achieve a higher efficiency for the floating point intensive code.

3.4.1 Streaming Cache

The Streaming Cache data RAMs have separate load and store data busses. Both read and write can be on their respective busses at the same timc, although only one of thern can be performed by the data RAM at a time. Having separate busses eliminates any bus turnaround time, which occurs when a write immediately follows a read, and allows read and write data to be pipelined to the RAM. The pipelined access to the streaming cache is shown in the figure 3.2. CHAPTER 3. THE R8000 PROCESSOR DESIGN

External Cache Pipeline

Data (even)

Floating Integer Point Pipeline Pipeline

l Floating Point Instructions

Figure 3.2: Streaming Cache Access

There are five stages in the externd streaming cache pipeline. Addresses are sent from the R8000 to the tag RAM in the first stage. The tags are looked up and hit/rniss information is encoded in the second stage. The third stage is used for chip crossing from tag RAM to the data RAMs. The data RAMs are accessed internally within the chip in the fourth stage. Finally, data is sent back to the R8000 and R8010 in the fifth stage. In the case of cache hit, the total streaming cache access time is 5 cycles.

The total cache size of a 4 MByte, Cway set associative cache implementation is split between the even and odd banks, containing each 2MBytes and having each a dedicated Tag Table, allowing them to operate independently of each CHAPTER 3. THE R8000 PROCESSOR DESIGN

other. Each set consists of 2048 lines, each line contains sixty four 64bit words divided as 32 words per bank.

Interleaving the streaming cache doubles the available bandwidth from the cache. However, two sirnultaneous accesses to the same bank will cause a one cycle stall. The compiler can rnitigate bank conflicts by careful code generation. Hardware also is designed to help out the compiler by adding a one-deep queue called the "address bellow". Referring to figure 3.2, immedi- ately following the integer pipeline there is a logic for sorting even and odd references, and a register that forms the address bellow. The address bellow resolves bank conflicts when both accesses alternate between odd and even. Imagine a sequence of pairs of both even references, followed in the next cycle by both odd references, followed in the next cycle by both even references and so on. Without the address beilow, the interIeaved cache would only be able to process one half of a pair of references per cycle - the pipeline would be stalled every other cycle and so the machine would run at half the speed. The address bellow slightly reorders the reference pattern, improving the chances of even and odd references lining up in time. For example, the second even reference in a cycle is enqueued in the address bellow and paired with the first odd reference from the next cycle, and so forth.

3.5 Instruction Set Architecture

Incorporation of conditional move and fused floating-point multiply-add in- structions into the instruction set of the R8000 microprocessor is essential for its enhanced performance.

The addition of four conditional move instructions helps avoid unnecessary branches. These instructions allow representation of the IF-THEN-ELSE con- CHAPTER 3. THE R8000 PROCESSOR DESIGN

structs without branches. The results of THEN and ELSE are computed uncondi tionally (speculatively) and placed in the temporary registers. Then depending on the condition one of them is moved to the permanent register.

For example, the Fortran s t atement if (A (i). gt .big) idx=i can be compiled into the following straight-line assembly code: fload %f2 = Akl] -- i in \%ri cmp.gt %cc1 = %f2 > %fi -- big in \%fi cmove %r2 = %cc1 : %ri ? %r2 -- idx in \%r2

The addition of floating point multiply-addlsubtract instructions allows two floating point computations to be performed with one instruction. Moreover, since no intermediate rounding is performed, lower latency and liigher preci- sion are possible.

Summary

In this chapter the design features of the R8000 microprocessor relevant to the for this processor were described. Some of the more innovative features are: the integer pipeline has no load delay dot; code density is preserved by the alignment-free instruction dispatching mechanism that does not require nop instruction padding to achieve high issue rates; the split-level cache structure fetches large floating-point array reference streams fsom the external cache into the processor without thrashing the on-cliip cache; intesleaved cache bandwidth is enhanced with the address below rnechanisrn that reorders conflicting accesses; a limited forrn of out-of-order execution is supported by decoupling the floating-point unit from the integer pipeline. Chapter 4

ILP Mode1 for a Superscalar Processor

In section 4.1 the ILP formulation of the modulo scheduling problem, devel- oped by R. Govindarajan, E. Altman, and G. Gao [5], is introduced. Based on this formulation, in section 4.2 we develop an ILP mode1 for the MIPS R8000 microprocessor .

4.1 ILP Formulation

Modulo scheduling can be formulated as an integer linear programming (ILP) problem. As such it defines a set of linear constraints imposed on the legal solution by dependencies in the program and by resource constraints of the target architecture. Out of many legal schedules that satisfy such constraints the best is chosen according to a certain optimality criteria.

Consider the example loop from Chapter 2: CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 34

il: x = load A(i) i2: z=z +x i3: A(i) = store z

Table 4.1 shows a periodic schedule o for this loop of the form (2.1):

Table 4.1: Periodic schedule for the example loop

Here the II = 2, and the kernel appears in cycles 3,4; tjl = O, tj2 = 3, tis = 4.

For each ti a pair of values is defined, ki and oi:

ki = and oi = (ti mod II)

If we consider a time window (frarne) of the size II, corresponding to the repetitive pattern, then this expression means that the operation xi is initiated at the oi-th clock cycle of the repetitive pattern, and during the ki-th time window from the beginning of the execution.

Each t; can now be written in the following way:

where CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 35

Since each oi lies in [O, II - 11:

where A = [atli]is a 0-1 matrix:

1, if 0; = t, atli = O, otherwise

In other words at,i = 1 if operation xi is issued at the clock cycle t from the beginning of the repetitive pattern.

In our example, matrix A is:

vectors 0 and K: are: 0 = 101001, K: = 100121

Substituting (4.4) into (4.2) we obtain the matrix form of a periodic schedule

0: II.K+[O,~,..~,II-~]x A=T (4.5)

Because each operation is allowed to execute only once within the repetitive pattern, the following condition applies: II-1 CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 36

Figure 4.1: (a) Reservation table for a single-precision divide; (b) its CR'S for II=11 a is legal if it satisfies linear precedence constraints (2.2):

In Our example:

To determine resource requirements at each clock cycle t, we need to know not only when each operation is initiated, but also the usage of various pipeline stages during the execution. For this purpose circular reservation tables (CRT) of the pipelines are used.

A CRT of a pipeline is defined as follows:

For each pipeline whose execution time d < II, we extend its reservation table to II columns by adding (II - d) zero column-vectors. Since d < II, each entry in the CRT is at most 1. CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 37

O For the case d > II, we fold the reservation table to II columns such that the t-th column of the original reservation table is added to the (t mod II)-th column in the CRT. Under the Modulo Scheduling Constraint no instruction can use a particular pipeline stage at clock cycles congruent modulo II (i.e. at clock cycles t,t + II,t ~211,etc.), and each entry in the CRT is at most 1. The above assumption is necessary for a fixed function unit assignment, i.e. al1 iterations of an instruction in the pipeline are ini tiated on a particular execution unit. Fixed function unit assignment is required for generating correct code on VLIW processors. Some Ils would be impossible because of the modulo scheduling constraint.

Figure 4.1 shows a reservation table of a single-precision floating point divide pipeline and its corresponding CRT for II = 11: the reservation table has been folded.

The s-th sow of the CRT of the pipeline p specifies usage of pipeline stage S. This row is denoted by C RT;, a vector of length II. An entry C RT$] is 1, if stage s is required (t mod II) clock cycles after the initiation of an instruction in that pipeline. For each instruction xi that uses p, resource usage matrix U of stage s is defined as follows:

Let X(p) denote the set of operations being executed in p. Then, resource sequirements of pipeline stage s at the clock cycle t in the repetitive pattern are:

where Rp(t) represents the number of instructions using stage s of the pipeline CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 38

01234 stage 1 x x stane 2 x x - - (b) Figure 4.2: (a) CRT of a two stage function unit; (b) A matrix of a schedule

p at time t. This number must not exceed Rp,the number of pipelines of type p in the architecture.

The linear resource constraints, therefore, are:

The linear scheduling constraints described above define a set of feasible peri- odic schedules for a given Ioop and the target architecture. An optimization objective can be set to obtain schedules that satisfy certain criteria. Two proposecl objectives are:

- minimizing the number of required execution pipelines;

- minimizing the nurnber of required registers.

Minimizing the number of pipelines is fairly straightforward [5].Minimizing register usage is more complicated, but also more interesting because the resulting schedule, even the rate-optimal one, is useless if it does not fit into the available number of machine registers. Minimizing register usage requires also additional register constraints to be included in the ILP formulation, making it complex and very hard to solve efficiently 141. CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 39

4.2 ILP Formulation for Superscalars

The ILP formulation presented in the previous section was implemented in MOST - the Modulo Scheduling ToolSet. MOST is a collection of modulo scheduling implementations that contains different scheduling heuristics such as Huff's lifet ime sensitive scheduling [27], Gasperoni's method [20], Decom- posed Software Pipelining (DESP) 1481, and Exhaustive Enurneration method 141. MOST was developed as a research tool for studying and analyzing many diverse approaches to software pipelining. Because it was never intended to be a component of a real compiler, this implementation was not a full software pipelining irnplementation: its output was a set of static quality measures, principally the II of the schedule found and the number of registers required, not a piece of runnable code; register allocation was not implemented; its only targets were models that exhibited certain interesting properties, never a real commercial high-performance processor.

In this thesis we take MOST one step further and establish an ILP mode1 for the existing architecture of the MIPS R8000 superscalar processor.

4.2.1 Modified Resource Constraints

Definition 4.1 Two operations xi and xj are of the same type X if their resource usage patterns are identical, we Say xi, xj E A.

Each machine resource can be thought of as a stage in the instruction pipeline. Al1 machine resources can be classified into two categories:

shared stages are used by different pipelines

O non-shared stages are used only by a particular pipeline. CHAPTER 4. iLP MODEL FOR A SUPERSCALAR PROCESSOR 40

Previous formulation assumed no sharing of machine resources (stages) among different pipelines. However, realistic instruction pipeline includes such re- sources as instruction issue logic, busses, register file ports, etc., shared by al1 or some of different execution pipelines. Thus, instead of using the CRTs that represent the resource usage patterns of different pipelines, we use modified CRTs that represent resource usage patterns of different instruction types. This way the machine resources are looked at as stages during instruction execution rather than as stages of machine's execution units. We will descri be modified CRTs later in this section.

The modulo scheduling constraint that forces each instruction to be "bound" to a particular execution unit is no longer required in the superscalar environ- ment and may be excessively restrictive. Consider the single-precision divide operation in the figure 4.1 (a). It uses the divide stage for 11 consecutive cycles, and under the modulo scheduling constraint the MinII for a loop containing such instruction will be at least 11 cycles, no matter how many divide pipelines are available in the machine. However, one can easily see that if different iterations of this instruction are allowed to be initiated in differ- ent pipelines (which is the case with superscalar architectures), iterations can be initiated at a rate of one every 6 cycles, if two pipelines of the type are available.

In order to allow sharing of processor resources and to avoid unnecessary restrictions on the II, modified CRTs, specifying the resource usage of in- structions of type A, are constructed as follows:

a For each operation type A whose execution. time d < II, we extend its reservation table to II columns by adding (1I - cl) zero colurnn-vectors.

a For the case d > II, we fold the reservation table to II columns such CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 4i

Figure 4.3: CRT for a single-precision divide for II=6

that the t-th column of the original reservation table is added to the (t mod II)-th column in the modified CRT. By dropping the modulo scheduling constraint we allow entries in the modified CRT to be greater than 1.

The rnodified CRT for instructions, using the single-precision divide pipeline from the figure 4.1, for II = 6, is shown ir, Figure 4.3. Because the second iteration of the loop begins before the divide operation from the first iteration is completed, two divide execution pipelines are needed for the first five cycles of that second iteration: the one in which the divide in first iteration is being completed, and the other one to execute the divide operation from the second iteration of the loop.

Let us denote the s-th row in the modified CRT for the type X by CRT;, a vector of length II. An entry CRTS\[~] = m if resource s is required by an instruction of type X at clock cycles t, t + II, . --,t + (m- 1) - II after the initiation of that instruction. Thus, for each instruction xi E X we define a resource usage matrix Us (similar to the matrix U,S from equation 4.7, except that it is computed using modified CRTs) specifying usage of resource s by this instruction:

The constraints of 4.1 are modified slightly to reflect the fact that any machine CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 42

resource may be used as a part of any instruction pipeline. At each clock cycle t in the repetitive pattern, the resource requirements for a particular resource s are: % = C C Üs[t,i], Vt E [O,II- 11, XEISA x;EX

4.2.2 Objective Function

Because the number of available machine resources is fixed in the architecture, minimizing the number of pipelines used by the schedule is not of interest. On the other hand, obt aining register-wise optimal schedules is very important for loops whose register requirements are anywhere near the limit of available registers. In such cases a schedule that uses one extra register (compared to the optimal one) may not be register allocatable, and this will lead to spills and rescheduling, degrading the II1.

Unfort unately, using the integrated formulation for register minimization, mentioned in Section 4.1, is psohibited by its computational complexity. In- teresting loops (those with the register pressure near the limits of available registers) are usually medium and large size loops, and the optimal solution to the integrated formulation can not be achieved in any reasonable time.

Instead, two different approximations for register requirements of a schedule were used, leading to very similar results. One of the approximations cor- responds to the upper bound, and the other to the lower bound on register Iregister allocation is described in terms of generalized registers. However, the irnple- mentation rnakes a difference between al1 kinds of registers present in the target machine, which is a fairly straightforward extension. CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 43

requirement s of the schedule. Although minimizing the upper bound repre- sents a certain backing off from strict optimality, it would still produce good near-optimal schedules. On the other hand, using the lower bound on reg- isters as a cutoff point, one can hope to reduce the scheduling search space. If the lower bound is tight enough, obtained schedules will be readily regis- ter allocatable. Using the lower bound on registers also represents giving the optimality up in exchange for a shorter solution time.

.Fi&, we need to introduce some additional definitions. There are three types of variables in loops:

0 loop variant variables are defined in each loop iteration and cause flow dependencies between the loop statements,

0 loop invariant variables are defined outside and used inside the loop body. Such variables must be lwpt in registers, and, therefore, their presence reduces the number of registers available for allocation. Loop invariant variables do not cause any dependencies between the state- ments in the loop body.

0 keeper variables (such as the stack pointer, for example) are defined in the loop body, but have a particular physical register associated with them. Such variables can not be allocated to any other register, and, therefore, can not be renamed. They cause flow, anti and output depen- dencies between statements in the loop body.

The set E' of edges in the DDG is a subset of al1 the flow dependence edges, not associated with keeper variables.

The non-allocatable variables are not considered when schedule's register re- quirements are being minimized, because they already have certain physical CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 44

registers assigned to them and do not need to be allocated. However, they still occupy regis ters. Thus, considering the number of registers available for allocation, it is understood that these are:

NT - the number of al1 registers available for allocation in the target architec- ture; NLI- the number of loop invariants; NNa - the number of keeper variables in the loop body.

We also assume that before scheduling, variable renaming has been performed and, except for the keeper variables and anti-dependencies due to conditional moves, the DDG is in the Static Single Assignment (SSA) form [12].

4.2.3 Upper Bound on the Number of Registers

The upper bound on the number of registers required by a schedule is given by the sum of the buffer sizes of al1 values kept in registers [35]. The buffer size of a variable corresponds to the number of iterations this variable's lifetime spans, and can be defined as follows:

L(xi) denotes the lifetime of the variable into which instruction xi writes its result. Because the program is in SSA form and we do not consider "keeper" variables, i.e. there is no output dependencies, this variable's lifetime corre- sponds exactly to xi's longest flow dependence.

Buffers overestimate register pressure. For example, let the loop's II be 10 cycles. Variable who's iifetime spans 3 loop iterations needs a buffer of size 3, CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 45

b; = 3. Since a buffer is allocated for a variable for the duration of the entire II, such a variable will be kept in the buffer for 30 cycles. However, if the actual lifetime of this variable is less than 30 cycles, say 23 cycles, it does not need to be kept in a register for the entire 30 cycles. It also should be noticed that registers can be shared by different values when their lifetimes do not overlap, which is not reflected by the buffers. Linear buffer constraints that approximate buffer size of a variable look like following (351:

IIebi+ti-tj 2 IIw(aij+1)-1, V(i,j)~E', bi are integers (4.13)

and the objective function is:

min C bi xi:3(i,j)€Ef

The complete formulation for software pipelining with buffer minimization is given in Appendix B. 1.

4.2.4 Lower Bound on the Number of Registers

Once a loop has been scheduled, an absolute lower bound on the schedules register pressure can be found by computing the maximum number of vari- ables that are live at any cycle of the schedule. This register pressure can be approximated by the average cumulative lifetime, i.e. the total length of al1 lifetimes divided by II, which is a schedule independent lower bound on the loop's register pressure [27]:

As was mentioned earlier when a program is in the SSA form and "keeper" variables are not considered, variables' lifetimes correspond exactly to longest CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 46

flow dependencies of their defining instructions, thus:

MinAvg, the schedule's average cumulative lifetime, gives the lower bound on the register requirements of a schedule. For example, if the II = 5, and xi, 12, are the two variables , whose lifetimes overlap, each alive for 2 cycles, it is obvious that 2 registers are needed for this schedule, but the MinAvg = r(2 +2)/51 = 1. Thus, if a schedule's MinAvg exceeds the number of registers in the target machine, this schedule is not register allocatable. On the other hand, the fact that the MinAvg of a schedule is less than the number of available registers does not guarantee allocatability of this schedule. R. Huff estimated that for the majority of the loops MinAvg is very close to the loop's real register requirements, and that such a lower bound is quite tight. Our experiments showed that MinAvg is somewhat less good an approximation than are buffers.

The complete formulation for software pipelining under the limited average cumulative lifetime is given in Appendix B.2.

4.2.5 Loop Overhead Optimization

The ILP pipeliner strives to minimize the number of registers used in the software pipelined steady state. The motivation behind such an approach is to produce schedules that can be register allocated. However, tliere is another factor that affects the resulting performance - the time to enter arid exit a pipelined steady state, or loop overhead. This overhead is constant relative to the trip count of the loop and thus increases in importance as the trip count clecreases and asymptotically disappears in importance as the trip CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 47

Schedule 1: Schedule 2:

A

count increases. If we ignore dynamic factors such as the memory systern effects discussed later, the diffeïent schedules of s loop with identical Ils and different register requirements differ at most in overhead so long as they both fit in the machine's available registers.

Register usage affects the loop overhead because some registers must be saved before the loop is entered and restored after it is exited, which is done in the prologue and the epilogue. A more important factor tlian register usage that influences pipeline overhead is the number of instructions (or the number of cycles) in loops prologue and epilogue. The number of instructions in the prologue and epilogue is a function of how deeply pipelined each instruction in the loop is and whether the less deeply pipelined instructions can be executed speculatively during the final few iterations of the loop.

The pipeline depth of the instruction xi in the schedule is precisely the num- ber of the window of the size II, corresponding to the repetitive pattern, in which it appears for the first time from the beginning of the schedule. In the ILP formulation this number is denoted by the variable ki from (4.1). Thus, minimizing the depth of the pipeline is equivalent to minimizing: CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 48

Consider the example DDG in ~i~ure4.2.5 (a). In schedule 1 (Figure 4.2.5 (b)), instructions were scheduled in order and iterations were overlapped with the II = 3. The first iteration of instruction A appears in the prologue, before the steady state is formed with a pipeline depth of 1. Schedule 2 (Figure 4.2.5 (b)) does not have a prologue and the pipeline depth of al1 instructions is O because execution of instructions B and D were overlapped (in this case, the schedule sesembles a simple basic block schedule).

The complete formulation for software pipelining with the pipeline depth min- irnization is given in Appendix B.3.

4.3 R8000 Memory System Optimization

The MIPS R8000 provides a simple implementation of an architecture sup- porting more than one memory reference per cycle. The pïocessor can issue two references per cycle, and the memory (specifically the second level stream- ing cache) is divided into two banks of double-words, the even address bank and the odd address bank. If two references in the same cycle both address the same bank, one is serviced immediately, and the other is queued for service in the one element queue called the bellow (see chapter 3). If this hardware configuration can not keep up with the stream of references, the processor stalls. In the worst case there are two references every cycle al1 adclressing the same bank, and the processor stalls once on each cycle, so that it ends up running at half speed. CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 49

The MIPSpri, compiler attempts to find known even-odd pairs of references to schedule in the same cycle - it does not mode1 the bellows feature of the rnem- ory system. The ILP pipeliner adopted a similar strategy. Before scheduling begins, for each memory reference rn a list P(m) of al1 other references m', for which (m, m') is known to be an even-odd pair, is formed. A set of linear memory constraints is added to the ILP formulation that forces each reference rn to be scheduled with eitlier its pair rn' or with no other memory references in the same cycle. This can lead to a failure to find a legal schedule at a given II. In such cases, the mernory constraints are omitted and the schedul- ing proceeds for the same II but without any consideration of the memory system.

In order to build the list P(m) of pairable candidates for each memory refer- ence rn, a memory analysis is performed.

4.3.1 Memory Reference Analysis

Consider the following loop which references the array A:

Loop :

.. EndLoop

Suppose that Ith iteration of xi and (1+ 6kij)thiteration of xj have been scheduled in the same cycle:

Loop :

EndLoop CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 50

To be pairable, t hese two memory references must address opposite memory banks. Because the banks contain double-words, two addresses map into op- posite banks if they are separated in memory by a number of bytes divisible by 8, but not by 16. Thus, two references address the opposite memory banks when:

Loop counter I is a variable in the equality 4.17. Because b, d and dkii are constants for a given schedule: b and d define addressed memory location, and 6kij depends on the schedule, the equality 4.17 is true for al1 1, i.e. hold in al1 iterations of the loop, if and only if

(c - a) . I is multiple of 16 for al1 1

(c-a) % 16=O

Otherwise there may exist an iteration 1 such that equation (4.17) does not hold.

Because c 2 O and 6kij 2 0:

c.61e,+Id-b1=8+16mij, mij=l,2,3,*** (4.19)

From equation 4.19 we can derive following non-interference rules for 6b:

1. if c%l6 = O and (d-b)%16 = 8, then the two memory references A[ai+b] and A[c(I$ dkij)+ d] never interfere in cache independent of the value of 6ki , CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 51

2. if c%16 = 8 and (d - b)%16 = O, then dkij E {1,3,5, . *) guarantecs that A[aI + b] and A[c(I + 6kij)+ d] do not interfere in cache, in the same way

3. if c%16 = 8 and (d - b)%16 = 8, then 6R, E {O, 2,4,. -1,

4. if c%16 = 4 and (d- b)%16 = O, then 6kij E {0,2,4,**-),

5. if c%16 = 4 and (d - b)%16 = 8, then dkij E {O, 4,S,. O},

6. if e%16 = 4 and (d - b)%l6 = 4, then bkij E {1,5,9, a),

7. if c%16 = 4 and (d - b)%16 = 12, then 61t, E {3,7,11, e);

These rules merely state what values of 6kij satisfy the equality 4.19 for certain values of c, b and d. Any other values of c, b and d does not allow us to conclude whether the two rnemory references are pairable.

Definition 4.2 (Sufficient Condit ions for Pairability) Two arra y refer- ences of the form A[aI + b] and A[c(I+ 6kij)+ d] are pairable (i. e. they are rnapped into the opposite memory banks) if:

they refe~encethe same array in the nzemory;

O (c- a) % 16 = 0;

0 bkij satisfies non-interference rules.

From the non-interference rules, 61i, has the form

61I, = 0.f fii + çtepij m,

For example, for bkij E {3,7,11, -)stepij = 4 and off, = 3. CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 52

Let M = {xi), i = 1,2,*-., m be a set of instructions that represent memory references in the loop. For each xi E M, a vector P of 3-tuples

(xj, of fij, stepij) is built:

each xj E M, xj # xi is a memory reference pairable to xi;

of fij and stepij define a set of acceptable 6kijs that satisfy the sufficient conditions.

4.3.2 Memory Constraints for the ILP formulation

Linear memory c,onstraints have been added to the ILP formulation that en- force the following rule: Two memory references may be scheduled in the same cycle only if they rep- resent a known memory pair, i.e. they satisfy the suficient conditions.

1. a memory reference xi that have no pairable candidates must not share

a cycle with any other memory reference xj:

at,i + at,j 5 1, Vxi E M witliout mernory pair candidate, Vxj E M, Vt E [O, II - 11 (4.21)

2. for every memory reference xi and its pair candidate xj the following must be true:

if kith iteration of xi and bth iteration of xj are scheduled in the same cycle. CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 53

The condition (4.22) may be expressed in the following linear form:

ki - kj = (2 **wij- 1) **offij + stepij . (mij + (w, - 1) -mm,,),

Vi E [O, II - 11 such that ut$ = at, = 1 (4.23)

where mm., is the upper lirnit on mij, which will be computed later; wij is the sign of 1 ki - kj 1.

which can be expressed in the form of inequalities as:

Since these inequalities should only be true whenever for some t E [O, II - 11 there exist ut,; = utlj = 1, we can write:

Inequalities (4.27) and (4.28) are always satisfied if at,i # at,j # 1, and there- fore, do not impose any constraints on ki and kj. When there exist a time t E [O, II - 11 at which at,; = atj = 1, however, inequalities (4.27) and (4.28) require the ki and kj be such that the Ith iteration of xi and (1+1 - kj [)th iteration of xj are pairable. CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 54

We only have a value of mm, to be computed. From equation (4.20), we notice t hat : ma+ 1 ki - Ici I= off, + stepij mm., (4.29)

Consequen t ly,

where

is the upper bound on the relative distance between the i terations, from whicli the two references come. asap(xi) and alüp(~i)are the earliest and the latest possible times when the instruction xi may be issued.

The full set of linear constraints for the ILP formulation with memory con- straints and buffer minimization is shown in Appendix B.4.

Summary

In this Chapter the integer linear programming (ILP) formulation of modulo scheduling was introduced. Contrary to heuristic methods, the ILP approach guarantees optimality of its solution. Within the ILP framework, the modulo scheduling problem is formulated as an optimization problem defined by a set of linear constraints and given some cost criterion to minimize. Minimizing the number of registers required by the schedule is important, because it may lielp to avoid spilling registers in loops with high register pressure. Unfortu- nately, using the integrated formulation for scheduling and register allocation CHAPTER 4. ILP MODEL FOR A SUPERSCALAR PROCESSOR 55

is impossible due to its computational complexity. Two approximations to gauge register pressure are buffers [34] and average cumulative lifetime [Z].

In this Chapter we developed the ILP formulation for modulo scheduling on a superscalar processor. This formulation assists register allocation by min- imizing a schedule's register pressure, and takes into consideration memory systern behavior by minimizing memory stalls due to simultaneous references to the same secondary cache bank. Chapter 5

ILP Mode1 for the MIPS R8000

In this chapter the ILP-based software pipeliner for the MIPS RSOOO inicro- processor is presented. In section 5.1, we discuss resource scheduling for the R8000 processor, in section 5.2 we describe the R8000 machine model, and in section 5.3 we present the design of the ILP software pipeliner.

5.1 Resource Scheduling on MIPS R8000

Resource confiicts arise when the hardware cannot support a given combi- nation of instructions, because tliey might simultaneously require more uni ts of some hardware resource (register write ports, for example) than is avail- able in the machine. Instructions in the R8000 Microprocessor are executed inside the five-stage pipeline. The following pipeline resources of the RSOOO microprocessor must be considered during scheduling (see Chapter 3): CHAPTER 5. ILP MODEL FOR THE MIPS R8000

1 Resource 1 Notes U Instruction dispatch resources: COMDISP The superscalar issue/dispatch logic allows four instructions to be issued per cycle; BRADISP One branch instruction can be decoded per cycle; -- -- ~~--INSTO~~TO~~integer store can be issued per cycle; -- - .. - - II Integer execution and address generation resources: ALUDISP 1 Two arithmetic logic units are available for II performing integer computations; ALUSHIF One shifter is available; ALUXILO One Iiigh/low register pair for performing integer multiply and divide operations is available; ------. MEMDISP Two address generators for address computations for integer and floating-point data are available; -- .- - - Floating-point queue resources: FPUDISP No more than four floating-point instructions may be on the TBus at a time; FPMDISP Two Aoating-point memory reference instructions are allowed on the TBus at a time; FPADISP Two floating-point arithmetic instructions are allowed on the TBus at a time; CHAPTER 5. ILP MODEL FOR THE MIPS R8000

Floating-point execution resources: Two floating-point execution pipelines for performing floating point rnultiply/add, divide, square root, and reciprocal operations, are availabk; Two floating-point register file write ports are available; Special resources: One special register used by a move instruction from the FPU to CPU; One special register used by a move instruction from the FPU to CPU;

Other hardware resources, although used, do not have to be scheduled, since scheduling one of the above named resources guarantees a conflict-free access for each instruction to the rest of the hardware. For example, the integer register file write port does not have to be scheduled because the ALU execu- tion is fully pipelined and therefore, once an instruction is issued, no further conflicts are possible (see Chapter 3).

Sorne hardware resources are not directly programrned by the user. Because loops containing instructions that manipulate these resources are not subject to software pipelining in the MIPSpro compiler, usage of such resources do not have to be scheduled. For example, there is the MiscBus interna1 to the R8OOO CPU which is used for data transfers in the situations wliere dedicated buses are not available. Instructions which use the MiscBus include JAL, MFCO and MTCO. This bus is not controlled by the scoreboard and there are certain restrictions that apply when using it. However, the MFCO and MTCO instructions are privileged instructions executed only with kernel permissions by the operating system software, which is not subject to software pipelining. CHAPTER 5. ILP MODEL FOR TH% MIPS R8000

5.2 R8000 Machine Description

The machine mode1 of the target architecture must specify:

1. A collection of the machine's schedulable resources, where the machine has {ni, nz, ,n,) units of each resource. Each instruction type in the code follows its associated resource usage pattern during the execution.

2. A collection of latencies associated with each instruction type. A pro- gram instruction requires an integral number of machine cycles to be executed, called the instruction's latency. Also, there are pipelined con- straints imposed on the execution of instructions. Altogether, these define how many cycles are needed for the result of an instruction to be computed and become available for subsequent use by other instructions. Instruction latencies are modeled by the integer delays assigned to the data dependence edges of the DDG. Although most of the instructions are executed in one cycle, there exist additional delays, such as:

r a delay of one cycle between an instruction and a memory reference that uses its result register, because of the pipeline structure;

an additional delay of 6 cycles between an instruction and a floating- point instruction that models the floating-point instruction queue;

a delays corresponding to the fully-pipelined execution with multi- cycle latency of such instructions as a move from the CPU to the FPU, floating-point multiply-add, and floating-point conditional move instructions,

r delays corresponding to non-pipelined execution patterns of such instructions as integer multiply/divide, and some floating-point in- structions, CHAPTER 5. ILP MODEL FOR THE MIPS R8000

delays corresponding to the execution pattern with potential haz- ards (moves from the FPU to the CPU and select pseudo opera- tions) ,

zero delays of the floating-point load instructions because of the decoupled architecture of the R8000, and of stores, because they do not cause any delays.

Instructions in the MIPS IV instruction set [37] are classified into twenty- one type according to their resource usage patterns. Resource usage of each instruction type is modeled using reservation tables. Reservation tables for al1 instruction types are given in the Appendix 7.2.

5.3 Software Pipelining Algorit hm

The ILP-based software pipeliner takes as input:

0 the DDG of an innermost loop from the control flow graph of a program, produced by the MIPSpro compiler,

0 the machine description that contains the instruction types, reservat ion tables for each instruction type, and the information about the number of available resources.

If successful, the software pipeliner produces a new loop body and al1 the necessary information for generating the loop's prologue and epilogue. The new loop is returned to the control flow graph of the program. If the software pipeliner fails, no changes are made to the control flow graph.

The flow-chart of the ILP based software pipeliner is shown in figure 5.1. CHAPTER 5. ILP MODEL FOR THE MIPS R8000

YES optirral? - Done.

Figure 5.1: The Flow-Chart of the Software Pipeliner

1. First, the Ning-Gao linear programming (LP) formulation for buffet.- optimal software pipelining without resource constraints is solved (see Appendix B.5). It is proven to always have a solution and to take a low degree polynomial time. If its solution obeys resource constraints of the R8000 architecture, the schedule is accepted. More often, however, such a solution violates one or more resource constraints. CHAPTER 5. ILP MODEL FOR THE MIPS R8000

2. If the Ning-Gao solution fails, the integer linear programming formula- tion (ILP) with resource constraints but without an optimization crite- rion is attempted (see Appendix B.6). A solution to such a formulation is one of many possible legal schedules, not necessarily optimal. This formulation is solved faster than a formulation which minirnizes buffers, because the search for a schedule stops after the first legal schedule is found. If a solution can not be found, there is little hope that a solution to the more cornplex formulation will be obtained. Tlius, the schedul- ing attempt starts from scratch with the increased value of the II. A solution to the ILP, on the other hand, is often not register allocatable, in which case the integer linear formulation with buffer minimization (ILPB) is attempted.

3. The ILPB formulation is more complicated than the two other formu- lations. It strives to find a schedule which uses the fewest buffers. If the optimal solution can not be found (or proven to be optimal) in 3 minutes, the best schedule found so far is accepted. If a schedule that can be register allocated can not be found, the II is incremented, and the scheduling attempt restarts from scratch.

Formulation using average cumulative lifetime:

When the formulation that uses average cumulative lifetime is being solved, the ILP in step 2 is augmented with the corresponding constraints (see Chap- ter 4), and step 3 is disabled. It is the subject of future work to try optimizing schedules obtained using such formulation, in case they do not fit into available number of registers.

Minimization of the pipeline depth: CHAPTER 5. ILP MODEL FOR THE MIPS R8000 63

For the short trip count loops the ILP formulation aims at minimizing the depth of the software pipeline. In this case, after a solution to the ILP in step 2 is found, the ILP in step 3 that minimizes pipeline depth of the most deeply pipelined instruction (ILPP) is applied (see Chapter 4).

Summary

In this section the ILP software pipeliner for the MIPS R8000 processor is described. Multiple pipelined resources of this 4-way issue superscalar ma- chine must be carefully scheduled in order to avoid processor stalls. Machine mode1 of the RSOOO processor consists of collection of reservation tables that; describe resource usage by al1 machine instructions together with a collection of latencies associated with the execution of each machine instruction.

The ILP software pipeliner attempts to find a legal schedule using the Ning- Gao formulation. A solution to this formulation may not satisfy al1 the re- source constraints of the R8000. If such is the case, a resource-constrained schedule is sought using the ILP formulation without considering register al- location issues. If a solution can be allocated registers, it is accepted. If the found solution has too much register pressure, the ILP formulation that minimizes buffers is used to produce a register allocatable schedule. The ILP software pipeliner was integrated and tested as a functional replacement of the MIPSpro software pipeliner. Chapter 6

Experiment al Results

In this chapter, performance of the ILP pipeliner is analyzed. In section 6.1 we describe the experimental framework used for the testing of the ILP software pipeliner. The main results of the study are presented in sections 6.2 and 6.3. The significance of the scheduling order of operations for efficient ILP solving is discussed in section 6.4. Finally, the results of minimizing the loop overhead via reducing the software pipeline depth of a schedule are discussed in Section 6.5.

6.1 Experimental Fkamework

The ILI? formulation for software pipelining was developed as an alterna- tive to heuristic methods, in hope that an integrated solution to the software pipelining the register allocation will result in the code of better quality. Un- fortunately, using the ILP formulation of the integrated register allocation and scheduling problem [4] was too slow and unacceptably limited the size of CHAPTER 6. EXPERIMENTAL RESULTS

loops that could be scheduledl.

To overcome this diiliculty, two approaches were considered:

1. Simplifying the integrated formulation so as to contain fewer integer variables and fewer constraints, thus reducing the computational burden on the ILP solver. Thus, to guarantee register optimality, we substituted buffer equations 1351 for the coloring formulation from [4] (see Chapter 4)-

2. The number of different subproblems, solved by the branch-and-bound algorithm, is an important measure of the complexity of the ILP formu- lation. In our case this number was quite large. However, restructuring the formulation so that the amount of branching done by the solver is minimized would lead to faster solutions. The possibility of deriving a well-structured ILP formulation is discussed in Chapter 7, and its im- plementation is left for future work.

Thus, in order to generate register optimal software pipelined schedules, we used the simplified ILP formulations described in Chapter 4. Because we wanted to answer the question of whether optimizing schedules improves performance relative to randomly generated, heuristic schedules, cornparison with one of the leading production compilers seemed reasonable. In order to measure its performance, the ILP software pipeliner was embedded in the Silicon Graphics' MIPSpro compiler [44]. Thus, Silicon Graphics' software pipeliner served as a reference for evaluating the effectiveness of the ILP soft- ware pipeliner. lin our experiments ive used a 3 minute tirne limit for ILP solving, and our experience shows that there is not mucli to gain by increasing this time limit. '~.Altmanestimated that for 94% of the loops, buffer-optimal schedules correspond to register optimal schedules. CHAPTER 6. EXPERIMENTAL RESULTS

Unfortunately, there are some problems that could distort the outcome of the experiment by favoring one pipeliner over another.

1. One problem that could strongly favor the heuristic pipeliner is the fact that not every loop that can be scheduled by the MIPS compiler can also be scheduled by the ILP pipeliner within a reasonable time. This is a particular problem because the penalty for not pipelining can be very high. Because the simplified ILP formulation using Ning's buffers is very difficult to solve to optimality, the ILP software pipeliner failed to schedule 45 loops in the SPEC92 floating-point benchmark suite that were successfully scheduled by the SGI software pipeliner. This prob- lem was addressed by using the SGI pipeliner as a backup for the ILP pipeliner. Thus, instead of falling back to the single block scheduler used when the MIPSpro compiler fails to schedule and register allocate a loop, it falls back to the MIPSpro pipeliner itself. This should only reveal the deficiencies in the MIPSpro compiler, and demonstrate how much improvement is gained by using the ILP pipeliner.

2. Another problem that could favor one of the two pipeliners is the ran- dom factor introduced by the memory system of the R8OOO processor (see Chapter 3). Unbalanced memory references cause stalls, and thus two different schedules with the sarne II may have different dynamic performance. Because these mernory effects account for up to 10% of the performance loss [44], their impact had to be reduced frorn the exper- iment in order to conduct meaningful measurements. This problem was addressed by introducing heuristics in both pipeliners that minimizes the the likelihood of memory accesses causing stalls. The integer linear programming constraints for reducing unbalanced rnemory accesses on the R8000 chip are described in Chapter 4. CHAPTER 6. EXPERIMENTAL RESULTS

Figure 6.1: Relative performance of ILP over SGI

6.2 Highlights of Experimental Results

In our expriment the ILP pipeliner and the SGI pipeliner were used in turn for compilingthe SPEC92 benchmark suite, coiisisting of 14 benchmark programs. The SGI pipeliner scheduled 798 loops in total in this benchmark suite and the ILP pipeliner scheduled 753 loops. We compared the execution time of programs scheduled by the ILP software pipeliner to the execution time of the sarne programs scheduled by the MIPSpro software pipeliner.

Figure 6.1 shows the relative improvement of ILP schedules over SGI schedules for each of the fourteen SPEC92 programs. The Y-axis measures how much faster the ILP scheduled code is compared to the SGI code in percentage points. CHAPTER 6. EXPERIMENTA L RESULTS

Our experiment revealed the following:

1. Memory system behavior is one of the major factors affecting the perfor- mance of programs on the R8000. The ILP formulation that minimizes random memory stalls proved to be very efficient in generating the high quality code.

2. Code scheduled by the ILP pipeliner is just a little bit slower than code scheduled by the SGI pipeliner because (1) the 3 minute tirne limit imposed on the duration of the ILP solving sometimes prevents the ILP solver from finding a good solution, (2) RSOOO memory system introduces a random factor that can affect the resulting performance.

3. The ILP pipeliner could not consistently produce schedules superior to the SGI pipeliner in terms of required registers, because the ILP solving time limit prevented many loops from being optimally scheduled.

4. One important result of our experiments is the discovery that the order in which the branch-and-bound tree is traversed is by far the most im- portant factor affecting whether or not the ILP problem can be solved.

5. An ILP formulation for short trip count loops is able to successfully minirnize the loop overhead. However, it did not directly translate into a performance gain for such loops. Other factors, such as cache and memory issues, have a greater impact on the performance of short trip count loops than the pipeline loop overhead.

6.3 Results and Analysis

In this section the results of benchmarking are analyzed. Performance ef- fects of minimizing registers, minimizing memory stalls, and minimizing loop CHAPTER 6. EXPERIMENTAL RESULTS 69

pipeline overhead are discussed. The searching order of the branch-and-bound algorithm is shown to be the important factor in solving the ILP formulation.

Table 6.1 sumrnarizes the results of our experiments. We ran 14 SPEC92 floating-point programs scheduled by the TLP software pipeliner with and without memory optimization clescribed in Chapter 4:

Benchmark Execution time in seconds Memory optimization on 1 Memory optimization off

Table 6.1: ILP time for scheduling loops in SPECfp92 Benchma suite

Effects of the memory optimization are discussed later in this section. CHAPTER 6. EXPERIMENTAL RESULTS

As expected, the ILP software pipeliner is slow. Table 6.2 shows the number of loops scheduled in less than 1 second, 10 seconds, 12 minutes, and over 12 minutes out of 753 loops in the SPEC92 floating-point benchmark suite:

1 Under 1 sec. 1 Under 10 secs 1 Under 12 rnins 1 Over 12 mins 1

.. - Table 6.2: ILP time for scheduling loops in SPECfp92 Benchmark suite

Roughly estimating, al1 loops scheduled by the ILP pipeliner in less than 12 minutes were scheduled optimally. Sometimes the ILP solver needecl this much time because it tried four different priority orders in turn, of which each was given up to 3 minutes of time. It so happened sometimes that only during the last priority order attempt was the optimal solution found3.

6.3.1 Memory Stalls

Figure 6.2 shows the eflect of memory stall reduction for the ILP pipelines. It depicts the relative performance of the ILP pipeliner with memory stalls optimization enabled over the ILP pipeliner with memory stall optimization disabled.

The majority of benchmarks benefited from memory optimization. The aver- age improvement due to minimizing memory stalls is 7 percent. Three pro- grams: mdljdp2, alvinn and mdljsp2 - run significantly faster when scheduled with memory optimization; these programs are very sensitive to the rnemory

3this does not mean that al1 loops scheduled in over 12 minutes are not scheduled opti- mally, but this is very unlikely. CHAPTER 6. EXPERIMENTAL RESULTS

Figure 6.2: Improvement in the ILP performance due to memory system optimization

system behavior. However, some of the programs suffered a slight loss in per- formance. Specifically, this performance loss was noticeable in two programs - swm256 and fpppp. Why? There are a couple of reasons for this. Addi- tional mernory stall constraints sometimes allow the ILP formulation to find a register allocatable schedule at a lower II that it would have found with- out these constraints. However, it could be that a schedule at a lligher II generates fewer stalls than the one at a lower II, or it could be that without these additional constraints the scheduler spills some variables and, ironically, cornes up with an even lower II on a spilled loop. These problems can be dealt with at a cost of additional search in the schecluling space, a search that is too expensive for the ILP pipeliner. CHAPTER 6. EXPERIMENTAL RESULTS

6.3.2 Performance Cornparison of ILP vs SGI Pipeliner

Referring to Figure 6.1, the SGI pipeliner slightly outperforms the ILP pipeliner in the majority of benchmarks. How can this be? The design of the experiment should have prevented the ILP pipeliner from ever finding a worse schedule than could be found by MIPSpro.

The main reason for this is the time limit imposed on the ILP pipeliner's search for a schedule. Because of the exponential computational complexity of Our ILP formulation, the ILP solver's search was restricted to the 3 minutes time limit, after which it would accept the best suboptimal solution found. Heuristics were available as a fall-baclt for the ILP pipeliner only when it could not find any scheduie. However, sometimes the ILP pipeliner could not find a schedule at a given II within the 3 minutes time limit, but was successful at a higher II. At this point it did not fa11 back to the heuristic pipeliner and accepted the best schedule found. Thus, the heuristic pipeliner sometimes found schedules at better IIs than the ILP pipeliner.

Wr example, fifteen loops where scheduled by both pipeliners (al1 of software pipelineable loops) in swm256. For two of them, the loops in line 273 and in line 324, the ILP pipeliner scheduled at a higher II than did the SGI pipeliner. These two loops have the greatest trip counts in the program that clearly dominate its execution time. As a result, the code scheduled by the heuristic pipeliner runs faster. The SGI pipeliner rejected a schedule at II = 6 for the loop in line 344 which appears to give advantage to the ILP pipeliner, because of possible stalls due to memory system (SGI's heuristic is tuned better than the ILP at this point to handle such subtleties). The II at which the loops in swm256 were scheduled by both software pipeliners along with the trip counts of these loops are shown in Table 6.3: CHAPTER 6. EXPERIMENTAL RESULTS

Software Pipelined Loop Trip Count ILP's II SGI's II

swm256: line 228 256 * 256 11 11 swm256: line 228 256 * 256 3 3 swm256: line 235 256 4 4 swm256: line 239 256 4 4 swm256: line 246 257 3 3 swm256: line 273 256 * 256 * 1200 14 12 swm256: line 286 256 * 1200 4 4 swm256: ljne 292 256 * 1200 4 4 swm256: line 324 256 * 256 * 1200 14 11 swm256: line 339 256 * 1200 6 6 swm256: line 344 256 * 1200 6 8 swm256: line 370 257 * 257 6. 6 swm256: line 399 256 * 256 * 1199 8 8 swm256: line 412 256 * 1199 6 6 swm256: line 420 256 * 1199 6 6 Table 6.3: 11s in the swm256 Benchmark

Another reason why code scheduled by the SGI pipeliner runs faster is the random behavior of the memory system. The SGI rnemory heuristic performs some additional search looking for a schedule with, perhaps, greater II but fewer possible memory stalls, so that the overall performance is improved. Such scarch is too expensive for the ILP pipeliner. On the other hand, al- tliough the random factor introduced by the memory system is corisiderably reduced, it rnay still favor either one of the pipeliners in different programs.

Investigation of the three benchmarks: spice2g6, tomcatv, and ear, on which the ILP pipeliner performs better than the SGI pipeliner, showed that memory system behavior may be the reason for the ILP code to be faster. For example, table 6.4 shows static quality measures of the performance of both pipeliners in the ear benchmark: CHAPTER 6. EXPERIMENTAL RESULTS

Software Pipelined Loop ILP II SGI II ILP reg. SGI reg. c0rrelate.c: line 205 2 2 9 1 1 1 1 9 1 corre1ate.c: line 305 1 89 1 88 22 1 19 corre1ate.c: line 374 4 4 9 11 corre1ate.c: line 407 22 20 26 21 corre1ate.c: line 468 4 4 24 25 corre1ate.c: line 504 29 29 21 1s corre1ate.c: line 512 4 4 9 11 corre1ate.c: line 534 4 4 18 20 corre1ate.c: line 557 4 4 23 22 corre1ate.c: line 666 6 6 39 31 coïre1ate.c: line 670 4 4 9 11 ear.c: line 138 7 7 23 17 ear.c: line 158 6 6 20 22 earfi1ters.c: line 74 2. 2 4 4 earfi1ters.c: line 66 4 4 7 8 earfi1ters.c: line 95 4 4 9 11 earfilters-c: line 101 6 6 21 1s 1 1 I I 1 earfi1ters.c: line 139 6 6 30 36 1 1t 1 1 1 earfi1ters.c: line 163 8 1 8 39 29 earfi1ters.c: line 172 5 5 14 17 earfi1ters.c: line 190 11 11 33 28 b earfi1ters.c: line 209 2 2 4 4 earfi1ters.c: line 213 6 6 20 24 earfi1ters.c: line 230 8 8 24 25 fft.c: line 70 68 68 22 14 I I 1 1 fft .c: line 99 1 20 20 16 17 fft.c: line 116 4 4 7 7 fi1e.c: Iine 174 4 4 11 9 fi1e.c: line 286 4 4 11 9 uti1ities.c: line 408 20 20 14 15 Table 6.4: Static Scheduling Quality For the Ear Benchmark CHAPTER 6. EXPERIMENTAL RESULTS

Al1 loops were scheduled by both pipeliners at the same II, except for the loops in line 305 and in line 407.in the "corre1ate.c" program, which the SGI pipeliner scheduled at a bet ter II. Because the performance of t hese programs is dominated by the loops with long trip counts, and because the ILP pipeliner neither scheduled any loops at a better II than the SGI compiler, nor is there a clear advantage on the ILP side in the number of required registers, we conclude that the ILP just generated fewer stalls by chance. The major source of such stalls is the rnemory system, therefore, it must be the main reason for the ILP scheduled code to run faster in this benchmark.

6.3.3 Minimizing Regist er Requirements

As mentioned earlier, rninimizing register pressure in software pipelined loops is rather costly in terms of compile time. Moreover, as long as the schedule is register allocatable, the number of registers it uses should not really matter. After all, a couple of spills before the loop is entered and a couple of restores after the exit can not significantly slow down a program. In this respect, rninimizing register usage should only be important for loops with high register pressure, when the schedule tliat uses some extra registers compared to the optimal one may lead to a register allocation failure and spills. Unfortunately, the exponential nature of the ILP formulation prevented the ILP pipeliner from finding optimal schedules for many interesting loops with high register pressure. Register usage in loops scheduled by both pipeliners in mdljdp:! benchmark is shown in Table 6.5.

In 4 out of 20 loops, the ILP pipeliner failed to find and register allocate a schedule. In 6 loops, the ILP pipeliner produced schedules that require fewer registers than schedules produced by the SGI pipeliner, and scliedules produced by the SGI pipeliner use fewer registers in 4 loops. Because neither CHAPTER 6. EXPERIMENTAL RESULTS 76

SGI II 1 ILP reg. 1 SGI reg.

dP2: line 779 1 6 I dp2: line 902 1 6 mdl dP2:line 1751 4 1 mdl, di2: line 2067 I 6 mdl dp2: line 2074 5 rndljdp2: line 2084 14 mdljdp2: line 2248 19 mdljdp2: line 2307 5 mdljdp2: line 2336 11 mdljdp2: line 2370 7 mdljdp2: line 4132 4 mdljdp2: line 4144 4 mdljdp2: line 4156 2 mdlidp2: line 4168 2 Table 6.5: Static Measurements Of the Schedule Quality for mdljdpf

of the two schedulers outperforms the other in terms of required registers, there is no clear evidence of performance being affected by this factor.

The ILP formulation including constraints on the average cumulative lifetirne (see Chapter 4) aimed at reducing the compile time cost of finding register allocatable schedules. Huff [27] argued that average cumulative lifetime of a scliedule is a very good approximation for the loop's register pressure. Table 6.6 shows the compile time of 755 loops scheduled using the ILP formulation when constraints on the average cumulative lifetime were included:

Notice that 2 loops were scheduled where the ILP formulation with buffer minimization failed, and 35 loops were scheduled at a better II compared to the formulation that uses buffer minimization. Also, 464 loops were scheduled in less than 10 seconds, while only 241 loops were scheduled in less than 10 seconds using the ILP formulation with buffer minimization. 16 loops, CHAPTER 6. EXPERIMENTAL RESULTS 77

1 Under 1 sec. 1 Under 10 secs 1 Under 12 rnins 1 Over 12 mins 1

Table 6.6: ILP time for scheduling loops in SPECfp92 Benchmark suite

however, were scheduled at a higher II by the formulation that constrains the average cumulative lifetime. This shows that buffer rninimization is important in some cases. Perhaps, the hybrid pipeliner using the ILP formulation with the average cumulative lifetime as the first step, and the ILP formulation that minimizes bufFers, whenever the earlier fails to produce a register allocatable schedule, would have greater efficiency than either of the two separately.

6.4 Branching Order

Surprisingly, for a given loop, one ordering may lead to the optimal solution in a very short time, while another may not find any solution at all. Why? The ILP pipeliner calls CPLEX mixed integer linear solver4 to solve the ILP formulation. CPLEX is a very powerful and flexible tool, but it does not allow us to fully exploit the problem's structure. Making the integer solver aware of such structure increases the effectiveness of integer problem solving.

The SGI pipeliner uses multiple scheduling orders to facilitate register alloca- tion [44]. The idea behind it is that at least one of the different scheduling orders should produce a schedule which fits into the available allocation reg- isters. Facilitating the search for register allocat able schedules was also the original motivation for trying, in turn, many different orders in which the ILP

4CPL~Xis a trademark of CPLEX Optirnization, Inc. CHAPTER 6. EXPERIMENTAL RESULTS

solver traverses the branch-and-bound tree. However, soon we discovered that traversing the branch-and-bound tree in a "good" order significantly reduces the time spent searching for the optimal solution.

The four orders that were used by the ILP pipeliner are5:

1. Folded depth-fist ordering with the final rnemory sort - In the simple cases, this is just a depth first ordering starting with the roots (stores) of the calculation. However, when there are difficult to schedule operations (operations with complex resource usage patterns, such as the floating- point divide) or large strongly connected components, they are folded and become virtual roots. Then the depth first search proceeds outward from the fold points, backward to the leaves (loads) and forward to the roots (stores). Finally, stores with no successors and loads with no predecessors are pulled to the end of the Iist.

2. Dada precedence graph heights with the Enal memory sort - The opera- tions are ordered in terms of the maximum sum of the latencies along any path to a root (store). This traversa1 order corresponds to schedul- ing in the topological order of the data dependence graph. Finally, stores with no successors and loads with no predecessors are pulled to the end of the list.

3. Reversed heights with the final memory sort - The data precedence list may be reversed. This ordering corresponds to a scheduling backward in the topological order of the data dependence graph. Finally, stores with no successors and loads with no predecessors are pulled to the end of the list. In the SGI compiler, it was useful for loops other than those in the SPEC92 floating-point benchmark.

5these are built by the SGI compiler. CHAPTER 6. EXPERIMENTAL RESULTS

4. Folded depth first ordering without final memory sort.

Table 6.7 shows how many loops out of the total 753 loops were scheduled by the ILP pipeliner, and which resulting schedules were successfully allocated registers, using each of the four branching orders:

FDFO w/memory sort Data precedence heights Reversed heights FDFO

Table 6.7: Number of loop scheduled by each searching order out of the total 753 loops

No branching order was able to successfully schedule al1 the loops. The most efficient was the folded depth-first search order. Because of the "folding", placement of the difficult to schedule operations is attempted first, and this allows the ILP solver to detect earlier in the brandi-and-bound tree that such operations can not be scheduled and not to spend much time exploring their subtrees. Final sorting of the loads and stores, which are easy to schedule in the R8000 architecture, is another way of trying to place operations that are more difficult to schedule ahead in the scheduling order. However, when this is done, the effectiveness of searching in the scheduling solution space is significantly reduced. This shows that scheduling operations out of their topo- logical order in the data dependence graph causes the ILP solver to explore parts of the branch-and-bound tree that do not contain any solution.

A more interesting discovery was that different search orders work well with different loop bodies. No single search order works best with al1 loop bodies. However, folded depth-first ordering works better than others for a vast ma- jority of loops. Thus, scheduling in the backward topological order of the data CHAPTER 6. EXPERIMENTAL RESULTS

dependence graph and giving a proper consideration to the properties of the target architecture, such as placing the most difficult to schedule operations before the others, noticeably irnproved the effectiveness of the ILP solution.

These results show that further improvements in the ILP solving have to corne from exploiting the problem properties, such as relations between different operations in the data dependence graph and characteristics of the target architecture.

6.5 Short Trip Count Performance

The Livermore Loops benchmark was used to measure the impact of mini- mizing the depth of software pipeline on the performance of short trip count loops (see Chapter 4). This benchmark is particularly well suited for this measurement. It measures the performance on each of 24 floating point ker- nels for short, medium, and long trip counts. Figure 6.3 shows the relative performance of the ILP over the SGI pipeliner on 18 out of 24 kernels 6:

These results show better performance for the SGI scheduler in nearly al1 cases. But as we have just seen, these results can be distorted by the effects of the machine's memory system. We'd like a way to make a more direct cornparison.

Table 6.8 gives some static performance information about the individual loops in the benchmark. Relative performance of the two pipeliners is given in terms of:

1. initiation intervals for each loop,

%hese were software pipelined by both pipeliners CHA PTER 6. EXPERIMENTAL RESULTS

Figure 6.3: Relative performance of the ILP over SGI on Liverrnore Loops

2. register usage measured in total number of both floating point and in- teger registers used,

3. depth of the resulting software pipeline, and

4. overall pipeline overhead, rneasured in cycles required to enter and exit the loop.

The length of the initiation interval and overall pipeline overhead directly affect performance, while register usage is important only so far as it impacts pipeline overhead. However, this chart shows that:

1. Minimizing software pipeline depth directly translates into the reduced CHAPTER 6. EXPERIMENTAL RESULTS 82

Loop II Reg. Pipeline Depth Overhead Kernel 1 (218) 414 37/28 315 27/30 Kernel 2 (239) 616 33/26 314 39/43 I

Kernel 9 (354) 716 39/36 417 59/74 - Kernel 10 (368) 10/10 38/33 314 67/71 ICernel 11 (400) 818 33/25 415 61/69 Kernel 12 (413) 414 22/22 112 7/12 Kernel 14 (455) 616 22/23 314 38/43 Kernel 14 (464) 818 40143 616 97/91 I

loop overhead penalty, and the ILP pipeliner consistently produces bet- ter schedules in terms of overhead cycles. This proves that the number of overhead cycles is directly affected by the pipeline depth.

2. There is no clear correlation between register usage and overhead. For 13 out of 26 loops, the schedule with smaller overhead did not use fewer registers. This proves that the number of the used registers is not the most important parameter for optimizing the short trip count loops (saving and restoring registers used by the steady state is only one of CHAPTER 6. EXPERIMENTAL RESULTS S3

the things that needs to be done in prologue and epilogue, and R8000 is able of saving or restoring two registers per dock cycle).

3. Although minimizing software pipeline depth reduces resulting modulo expansion, it does not seem to affect register requirements. Not one loop was scheduled by the SGI compiler that is less deeply pipelined than the loop scheduled by the ILP pipeliner, and yet the SGI pipeliner uses less registers in 10 out of 26 loops. This proves that modulo expansion is not very significant factor in causing high register pressure.

4. Reducing loop overhead does not seem to translate directly into a per- formance gain. This shows that there are more important parameters than loop overhead affecting short trip count loop performance. Some of them are cache and memory issues. This is not as surprising as it may seem at first. One secondary cache miss can have a lot heavier consequences for a short trip count loop than it would for a long trip count loop, because the miss penalty is not amortized over rnany itera- tions by the R8000 integer floating-point decoupling mechanism. Future research on improving performance of the short trip count loops need to be focused on cache and memory issues as well as on the overall loop overhead.

Summary

In this Chapter the results of the experimental testing of the ILP software pipeliner for the MIPS RS000 processor were presented. Its performance was compared to the performance of the MIPSpro heuristic pipeliner. As ex- pected, the ILP pipeliner is much slower than the heuristic-based one. This was especially important for large loop bodies, where this factor limited the CHAFTER 6. EXPERIMENTAL RESULTS

ILP pipeliner's functionality. Because heuristic techniques can handle near- optimally modest-sized loops, very rarely did the optimal technique schedule and allocate a loop at a lower II than the heuristic. However, there is signif- icant room for improvements in scheduling loops with large bodies and high register pressure. In order to achieve such improvements, efficiency of the ILP solution must be increased. Our experiments demonstrated importance for the ILP solving of the order in which the solution space is searched. This proves that exploiting problem structure is essential for improving the ILP results.

Our experiments also showed that issues related to the memory system or- ganisation can not be ignored. By using the ILP formulation for minimizing processor stalls due to the simultaneous accesses to the same memory bank, an average performance irnprovement of 7% was obtained for the programs in the SPEC92 floating-point benchmark suite. Moreover, performance of the short trip count loops was affected much more significantly by the cache and mernory system issues than by the overall loop's pipeline overhead.

Future work in this direction will include improving the efficiency of the branch-and-bound techniques via exploitation of the problem's structure, and exploring the possibilities of including optimizations related to the memory system in existing modulo scheduling framework. Chapter 7

Conclusions and Future Work

7.1 Summary

Scheduling operations for instruction-level parallelism allows the creation of faster, more efficient code sequences. In loops, the most parallelism is achieved by simultaneously executing operations from different iterations of the loop. One method for generating such code is software pipelining. Modulo schedul- ing is one of the possible approaches to software pipelining. In modulo scliedul- ing a new loop body is constructed by packing loop instructions into a time window of the size of the interval that separates the initiation of successive loop iterations. The size of the window, or initiation interval, is known before the scheduling starts. Finding the best software pipelined schedules under lim- ited resources is NP-hard, and heuristics have been developed that attempt to overcome this difficulty. When a schedule produced by a heuristic for a given initiation interval can not be accepted (can not be allocated registers, for example), the initiation interval has to be increased in order to find an acceptable schedule. Thus, schedule's quality degrades, even though bet ter CHAPTER 7. CONCLUSIONS AND FUTURE WORK

schedules might exist. The integer linear programming framework offers an optimal solution to software pipelining problem. This thesis deals with the design of an ILP-based software pipeliner, and, in particular, with the design and implement ation of the software pipeliner for MIPS R8000 superscalar microprocessor.

We developed a complete ILP mode1 of 'the R8000 processor. This machine has multiple execution pipelines that share certain stages, and an in-order superscalar dispatching mechanism that issues instructions as soon as their operands are ready and there are sufficient resources for their execution. We also developed the ILP formulation that optimizes code for better memory system performance. In order to evaluate its performance, the ILP software pipeliner was embedded in the Silicon Graphics' MIPSpro compiler. Em- bedding the ILP software pipeliner in the MIPSpro compiler was a perfect opportunity to answer many questions: how well would the ILP work when targeted to a real processor? Were there any unexpected problems standing in the way of a full implementation, one that would generate runnable code? How would it compare with a more specialized heuristic irnplementation? It was bound to be slower, but how much better would its results be? Because heuristic approaclies can have near-linear running time, they would certainly be able to handle larger loops. How much larger? The bulk of the this job was conducted at Silicon Graphics during the summer of 1995.

What did we discover. So long as a software pipelined loop actually fits in a machine's registers, the number of registers used by the loop's steady state is not the most important parameter to optimize. We discovered that requiring strict register optirnality is often unnecessary, and other factors, eg. rnemory system behavior, may have more significant performance impact. In fact, min- imizing memory stalls that result from multiple simultaneous references to the CHAPTER 7. CONCLUSIONS AND FUTURE WORK 87

same memory bank gave an average performance improvement of 7 percent, and up to 43 percent for one of the benchmarks that we used. Nevertheless, certain software pipelined loops achieve high register pressure and finding the optimal schedule in terms of the required registers is important as it allows us to avoid spilling in those loops. However, the exponential nature of the in- teger linear programming prevented the ILP pipeliner from scheduling many interesting and important loops optimally in an acceptable amount of time. Some loops could not be scheduled at al1 because of their size. In order to produce results in such cases, the ILP pipeliner had to use a heuristic fallbacli. Altogether, the ILP pipeliner scheduled 753 out of 798 loops in the SPEC92 floating-point benchmark suite. 9 loops were scheduled by the ILP pipeliner at a lower II than by the SGI pipeliner in the course of this study. And in those cases some increase in the backtracking lirnits of the heuristic method equalized the situation.

7.2 Future Work

Future work in this direction should focus on reducing the ILP solution tirne. Careful investigation showed that exploitation of the problem structure must be the foundation of such an improvement.

The largest loop solved to optimality in our experiments contains only 14 in- structions. The ILP formulation of resource-constrained scheduling (that does not address register allocation) can be solved considerably fasterl. However, as our results indicate, addressing register allocation problem is important. Can an ILP formulation of software pipelining that is amenable to fast optimal

las a matter of fact, for the rnajority of loops the tirne to solve this formulation is comparable with the time of the heuristic software pipeliner. CHAPTER 7. CONCLUSIONS AND FUTURE WORK

solution be derived?

The ILP framework ha. been effective in solving other computationally dif- ficult problems [32, 31, 38, 7, 221. In these problems, formulating a "good" model is of crucial importance to solving that model [33]. Therefore, analyz- ing the structure of scheduling constraints and developing a well-structured formulation can serve as a solid theoretical foundation for future improvement in solution efficiency. The key observation is that 0-1 ILP problems can be solved much more eaciently by traditional branch-and-bound methods than ILP problems with arbitrary variables. Our formulation is not a 0-1 integer linear formulation. It is shown that cyclic scheduling problem can be forrnu- lated as 0-1 integer linear problem [IO].

A 0-1 ILP formulation is written as:

where Pr = {Ax $ b, x E [O, 11; x integer)

wliere c is a (n x 1) real vector, b is a (m x 1) vector of integers, and A is a (m x n) matrix of integers.

A legal schedule Q may be expressed as a 0-1 vector xQ= {xf,,, ,x$,~~) :

1, if operation xi is issued at tirne t; O, otherwise

Then the precedence and resource scheduling constraints can be formulated in terms of x&; variables.

Such problems are solved using branch-and-bound techniques, where a linear relaxation of the ILP problem is sohed first: CHAPTER 7. CONCL USIONS AND FUTURE WORK

where PF = {Ax 5 b, x E [O, 11)

Because x is no longer required to be integer, PI C PF. Additional constraints are added at each step of the branch-and-bound, producing a new relaxation, in order to gradually reduce PF to PI and guarantee the integrality of the solution. If PI = PF,the linear relaxation is called tight. Obviously, the tighter the initial relaxation is, the less work is performed by the branch-and-bound, and the faster the problem is solved. Thus, obtaining a tight formulation from the beginning is of interest. It is also shown [IO] that the linear relaxations of the precedence constraints afone, as well as of the resource constraints alone, are tight. There union, however, does not produce a tight relaxation. The authors of [IO] reported that they were able to optirnally schedule a benchrnark with over 40 instructions reasonably fast. The above mentioned formulation did not address the register allocation problern. Conceivably, it can be extended to do so, resulting perhaps, in an efficient scheduler that will be able to handle optimally bigger loops than the ILP pipeliner handles so far .

The integer linear programming framework is very malleable and easy to use, and can be employed to improve different aspects of the schedules. The lower bound on the initiation interval of a schedule have always given us an idea of how much room there was for performance improvernent. On the other hand, this does not necessarily apply to the software pipelined loops when the trip count is short. In this thesis, we showed that the ILP formulation can be developed that optimizes the loop overhead more directly than by minimizing register usage. However, performance of the short trip count loops is affected by other important factors, such as cache and memory system behavior. Issues related to the performance improvement of the short trip count loops are CHAPTER 7. CONCLUSIONS AND FUTURE WORK

currently being investigated. Short trip count performance becomes specially important in Iight of loop nest transformations such as tiling.

There is also room for improvement in the usage of machine's memory system.' Perhaps, an ILP formulation can be made that minimizes processor stalls due to cache misses and simultaneous memory bank accesses. Cache issues, for example, can no longer be ignored, because of the relatively high cache miss penalties and, therefore, significant performance improvement opportunities that the optimal cache utilization may lead to. A good solution to the problem of optimizing memory system could have interesting implications in design of such systems. Appendix A: Reservation Tables

The machine mode1 of the MIPS R8000 processor includes a set of reservation tables that define instruction resource usage patterns. The reservation tables description bellow consists of:

the instruction type with the corresponding resource usage,

0 a table of resources required for a given instruction type along with the time in clock cycles from the start of an instruction of this type at which it uses a particular resource, and the number of required units of that resource.

For example, the single-precision integer multiply instruction uses one unit of the COM-DISP resource, the issue stage of the pipeline (see Chapter 5), at clock cycle O after it has been issued; the ALUDISP resource, arithmetic logic unit, at clock cycle 0; and the ALUHILO resource, the high/low pair of registers, for four consecutive cycles, from cycle O to cycle 3.

Integer load type:

COM-DISP 1 MEM-DISP 1

Integer merge type: APPENDIX A: RESERVATION TABLES

COM-DISP 1 MEM-DISP 1 IIC-STORE 1

Integer store type:

COM-DISP 1 MEM-DISP 1 FPU-DISP 4 1%-STORE 1

Move to/from HiLo registers type:

COM-DISP 1 ALU-DISP 1 ALU-HILO 1

ALU type:

COM-DISP 1 ALU-DISP 1

Shift type:

COM-DISP 1 ALU-DISP 1 ALU-SHIF 1

Single-precision integer multiply type:

COM-DLSP 1 O O O ALU-DISP 1 O O O ALU-HILO 1 1 1 1

Double-precision integer multiply type: APPENDIX A: RESERVATION TABLES

COM-DISP 1 O O O O O ALU-DISP 1 O O O O O ALU-HILO 1 1 1 1 i 1

Signed integer divide type:

COM-DISP 1OOOOOOOOOOOOOOOOOOOO ALU-DISP 100000000000000000000 ALU-HILO 1111111i111L111111111

Unsigned integer divide type:

COM-DISP 1000000000000000000 ALU-DISP IOOOOOOOOOOOOGOOOOO ALU-HILO 1111111111111111ii1

Branch type:

COM-DISP 1 O ALU-DISP 1 O BRA-DISP 1 1

FPU memory type:

COM-DISP 1 O O O O O O O MEM-DISP 1 O O O O O O O FPU-DISP 1 O O O O O O O FPM-DISP O O O O O O O 1

Move to FPU type:

COM-DISP 1 O O O O O O O FPU-DISP 4 O O O O O O O IE-STORE 1 O O O O O O O FPM-DISP O O O O O O O 1

Move from FPU type: A PPENDIX A: RESERVATION TABLES

COM-DISP 1 O O O O O O O O O O O FPU-DISP 1 O O O O O O O O O O O ISSUE-03 O O O O O O O 1 O O i O ISSUE-24 O O O O O O O 1 O 1 O 1 FPM-DISP O O O O O O O 1 O O O O

FPU 1 cycle type:

COM-DISP 1 O O O O O O O O O O FPU-DISP 10O O O O O O O O O FPA-DISP O O O O O O O 1 O O O FPU-WREG O O O O O O O O O O i

FPU conditional move type:

COM-DISP 1 O O O O O O O O O O ALU-DISP 1 O O O O O O O O O O FPU-DISP 1 O O O O O O O O O O FPA-DISP O O O O O O O 1 O O O FPU-WREG O O O O O O O O O O 1

FPU compare type:

COM-DISP 1 O O O O O O O O O ALU-DISP 1 O O O O O O O O O FPU-DISP 1 O O O O O O O O O FPA-DISP O O O O O O O 1 O O FPU-BREG O O O O O O O O O 1

FPU multicycle type:

COM-DISP 1 O O O O O O O O O O O O O FPU-DISP i O O O O O O O O O O O O O FPA-DISP O O O O O O O 1 O O O O O O FPU-BREG O O O O O O O O O 1 O O O O FPU-WREG O O O O O O O O O O O O O 1

FPU single-precision type: APPENDIX A: RESERVATION TABLES

COM-DISP 10000000000000000000000 FPU-DISP 10000000000000000000000 FPA-DISP OOOOOOO1OOOOOOOOOOOOOOO FPU-BREG 00000000011111111111000 FPU-WREG 00000000000000000000001

FPU double-precision type:

COM-DISP 10000000000000000000000000000 FPU-DISP 1OOOOOOOOOOOOOOOOOOOOOOOOOOOO FPA-DISP OOOOOOOiOOOOOOOOOOOOOOOOOOOOO FPU-BREG 00000000011111111111111111000 FPU-WREG 00000000000000000000000000001 Appendix B

The ILP software pipeliner uses a number of the ILP formulations for modulo scheduling.

B.1 ILP Formulation With Buffers

minimize C bi x;:3(i,j)EE1 subject to

1. Precedence constraints:

2. Resource constraints:

Rs = C C ' C . a((t-i)%11),i*~R~:[1],Qt E [O, II-11, Vscheduled resource s XEISA xiEX /=O 3. Buffer constraints:

II bi + ti - tj 2 II (Oij+ 1) - 1, V(i,j) E E'

ti 2 O are real, ki 1 0, O 5 al,i $ 1, bi 2 O are integers

B .2 ILP Formulation With Bounded Lifetimes

find a legal schedule subject to

1. Precedence constraints:

2. Resource constraints:

Rs = C C C a((t-i)%rr),i-CR~:[l],W E [O, II-11, Vscheduled resource s XEISA x;EX l=O

3. Average cumulative lifetime:

MinAvg 5 Number of registers available for allocation APPENDIX B

MinAvg = -Lzi Li Vxi such that 3(i,j) E Et II '

Li 2 tj - ti+ II R,, V(i,j)E E'

ti 2 0 are real, k; 2 0, O 5 afli 5 1, Li 2 O are integers

B.3 ILP Formulation For Short Loops

subi ect to

1. Min-max constraint :

2. Precedence constraints:

3. Resource constraints:

Rs = C C C a((t-r)%~q~i*CRT;[~], Vt E [O, II-11, Vscheduled resource s XEISA x;EX I=0

ti 2 O are real, ki 2 0, O 5 atc 1 are integers A PPENDIX B 99

B.4 ILP Formulation With Buffers and Mem- ory Constraints

minimize C bi x;:3(i,j)EEr subiect to

1. Precedence constraints:

2. Resource constraints:

Rs = C C C ~((~-~)%I~),~.CRTX[Z],W E [O, II-11, Vscheduled resource s XEISA x~EA I=0

3. Buffer constraints:

4. Memory constraints:

ai,i + atj 5 1, Vxi without memory pair, Vxj E M APPENDIX B 1O0

ti>O arereal, iki>O, O

m, 2 O are integers

B. 5 Ning-Gao LP Formulation

minimize C bi

subject to

1. Precedence constraints:

2. Buffer constraints:

t; 2 O are real, ki 2 0, O $ at,i 5 1, bi 5 O are integers

B.6 Resource-constrained ILP Formulation

find a legal schedule APPENDIX B

subject to

1. Precedence constraints:

' 2. Resource constraints:

ti 2 O are real, ki 2 0, O 5 at,i < 1, are integers Bibliography

[l] Alexander Aiken and Alexandru Nicolau. Optimal loop parallelization. In Conference on Programming Language Design and Implementation, pages 308-317, Atlanta, GA, June 1988.

[2] Alexander Aiken and Alexandru Nicolau. A realistic resource-constrained software pipelining algorithm. Advances in Languages and Compilers for

Parallel Processing, pages 274 - 290, 1991.

[3] Alexander Aiken and Alexandru Nicolau. Resource-constrained software pipelining. In IEEE Transactions on Parallel and Distributed Systems,

volume 6, pages 1248 - 1270, Decenber 1995.

[4] Erik R. Altman. Optimal Software Pipelining with Function Unit and Register Constraints. PhD thesis, McGill University, Montreal, Quebec, 1995.

[5] Erik R. Altman, R. Govindarajan, and Guang R. Gao. Scheduling and mapping: Software pipelining in presence of structural hazards. In Con- ference on Programming Language Design and Implementation, pages 139 - 150, La Jolla, CA, June 1995. ACM SIGPLAN.

[6] Gary R. Beck, David W. L. Yen, and Thomas L. Anderson. The cy- dra 5 mini-supercornputer: Architecture and irnplernentation. The Jour- BlBLIO GRA PWY 103

na1 of Supercomputing (Special Issue on Instruction-Level Parallelism), 7(1/2):143 - 180, 1993.

[7] R. Bixby, Ken Kennedy, and Uwe Kremer. Automatic data layout using 0-1 integer linear programming. In Conference on Parallel Architectures and Compilation Techniques, pages 111 - 122, August 1994.

[SI Foster C. C. and Riseman E. M. Percolation of code to enhance paral- le1 dispatching and djsMbution. In IEEE Transactions on Cornputers, volume C-21, pages 1411 - 1415, December 1972.

[9] A. E. Charlesworth. An approach to scientific array processing: The

architectural design of the ap-120b/fps 164 farnily. Computer, 14318 - 27, Septernber 1981.

[IO] Samit Chaudhuri, Robert A. Walker, and John E. Mitchell. Analysing and exploiting the structure of the constraints in the ilp approach to the scheduling problem. IEEE Transactions On Very Large Scale Integration (VLSI) Systems, 2(4), December 1994.

[Il] E. G. Coffrnan. Computer and Job-Shop Scheduling Theory. John Wiley St Sons, New York, 1976.

[12] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, and Mark N. Wegman. Eniciently computing static single assignment forrn and the control de- pendence graph. ACM Transactions on Programming Languages and Systerns, 13(4):451 - 490, October 1991.

[13] James C. Dehnert and Ross A. Towle. Cornpiling for cydra 5. The Jour- nal of Supercomputing (Special Issue on Instruction-Level Parallelism), 7(1/2), July 1993. BIBLIOGRAPHY

[14] K. Ebcioglu and T. Nakatani. A new compilation technique for paralleliz- ing loops with unpredictable branches on a vliw architecture. Languages and Compilers for Parallel Computing, pages 213 - 229, 1989.

[15] I(erna1 Ebcioglu. A compilation technique for software pipelining of loops with conditional jumps. In 20th Annual Workshop on Microprogramming, pages 69 - 79, Colorado Springs, Colorado, December 1987.

[16] Alexandre E. Eichenberger, David S. Davidson, and Santosh G. Abra- ham. Optimum modulo schedules for minimum register requirements. In

International Conference on Supercomputing, pages 31 - 40, Barcelona, Spain, July 1995. ACM SIGARCH.

[17] 3. R. Ellis. Bul1dog:A Compiler for VLIW Architectures. The MIT Press, Cambridge, Massachusetts, 1987.

[lS] J. A. Fisher. Trace sclieduling: A technique for global microcode com- paction. In IEEE Transactions on Cornputers, volume C-30, pages 478 - 490, July 1981.

[19] G.R. Gao and Q. Ning. Loop storage optimization for dataflow machines. In 4th Annual Workshop on Languages and Compilers for Parallel Com-

puting, pages 359 - 373, August 1991.

[20] Franco Gasperoni and Uwe Schwiegelsholin. Scheduling loops on parallel processors: A simple algorithm with close to optimal performance. In International Conference CONPAR, pages 625 - 636, 1992.

[21] P.B. Gibbons and S.S. Muchnik. Efficient instruction scheduling for a pipelined processor. In SIGPLAN'86 Symposium on Compiler Construc- tion, pages 11 - 16, Pa10 Alto, CA, June 1986. ACM. BIBLIOGRAPHY

[22] David W. Goodwin and Kent D. Wilken. Optimal and near-optimal global register allocation using 0-1 integer prograrnming. Software - Prac- tice and Experience, 1996.

[23] R. Govindarajan, Erik R. Altman, and Guang R. Gao. Minirnizing register requirements under resource-constrained rate-optimal software pipelining. In 27th Annual International Symposium on Microarchitec- ture, pages 85 - 94, San Jose, CA, November - December 1994.

[24] J. L. Hennessy and T. R. Gross. Postpass code optimization of pipeline constraints. A CM Transactions on Programming Languages and Systems, 5(3):422 - 448, July 1983.

[25] John L. Hennessy and David A. Patterson. Cornpater Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 1995.

1261 Peter Yan-Tek Hsu. Design of the RB000 Microprocessor. MIPS Tech- nologies Inc., June 1994.

[27] Richard A. Huff. Lifetime-sensitive modulo scheduling. In Conference on Programming Language Design and Implementation, pages 258 - 267, Albuquerque, N. M., June 1993. ACM SIGPLAN.

[2S] Suneel Jain. Circular scheduling: A new technique to perform software pipelining. In Conference on Programming Langaage Design and Imple- mentation, pages 219 - 228, Toronto, ON, June 1991. ACM SIGPLAN.

[29] Monica S. Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Conference on Programming Language Design and Implementation, pages 318 - 328, Atlanta, GA, June 1988. ACM SIGPLAN. BIBLIOGRAPWY

[30] S.-M. Moon and Kemal Ebcioglu. An efficient resource-constrained global scheduling technique for superscalar and VLIW processors. In Proceed- ings of the 25th Annual International Symposium on Microarchitecture, Portland, Oregon, December 1992.

[31] G. Nemhauser. The age of optimization: Solving large-scale real world problems. Operation Research, 42(1):5- 13, January - February 1994.

[32] G. Nemhauser and L. Wolsey. Integer and Cornbinatorial Optimization. John Wiley & Sons, 1988.

[33] G. L. Nemhauser and L. A. Wolsey. Handboolzs in Operation Research and Management Science: Optimization, volume 1. Elsevier Science, New York, 1989. ch.6.

[34] Qi Ning. Register Allocation for Optimal Loop Scheduling. PhD thesis, McGill University, Montreal, Quebec, 1993.

1351 Qi Ning and Guang R. Gao. A novel framework of register allocation for software pipelining. In 20th Annual International Symposium on Princi-

ples of Programming Languages, pages 29 - 42, January 1993.

[36] Joseph C. H. Park and Mike Schlansker. On predicated execution. Tecli- nical Report HPL-91-58, Hewlett Packard Software and Systems Labo- ratory, May 1991.

[37] Charles Price. MIPS IV Instruction Set. Silicon Graphics Computer Systems, January 1995. Revision 3.1.

[38] W. Pugh. The omega test: A fast and practical integer programming

algorithm for dependence analysis. In Supercomputing, pages 18 - 22, November 1991. BIBLIO GRAPHY 107

[39] B. R. Rau and Joseph A. Fisher. Instruction-level parallel processing: History, overview, and prospective. The Journal of Supercomputing (Spe- cial Issue on Instruction-Levez Parallelism), ï(l/2):9 - 50, 1993.

[40] B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific com- puting. In 14th Annual Workshop on Microprogramming, pages 183 - 198, October 1981.

[41] B. R. Rau, M. S. Schlansker, and P. P. Tirumalai. Code generation schemas for modulo scheduled loops. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 158 - 169, Port- land, Oregon, December 1992.

[42] B. R. Rau, D. W. L. Yen, W. Yen, and R. A. Towle. The cydra 5 do partmental supercornputer: Design philosophies, decisions and tradeoffs. Cornputer, 22(1):12 - 35, January 1989.

[43] Steve Rhodes. MIPS R8000 Microprocessor Chip Set. Users Manual. Silicon Graphics Computer Systems, July 1994. Revision 3.0.

[44] John C. Ruttenberg, Guang R. Gao, Artour Stoutchinin, and Woody Lichtenstein. Software pipelining showdown: Optimal vs heuristic meth- ods in a production compiler. In Conference on Programming Language Design and Implementation, pages 1 - 11, Philadelphia, PA, May 1996. ACM SIGPLAN.

[45] Uwe Schwiegelshohn, Franco Gasperoni, and Kemal Ebcioglu. On opti- mal parallelization of arbitrary loops. Journal of Parallel and Distributed Computing, 1 l(2): 130 - 134, February 1991. BIBLIOGRAPHY

[46] R. Sites. Instruction ordering for the cray-1 cornputer. Technical Report 78-CS-023, Department of Computer Science, University of California, San Diego, July 1979.

[47] Harold S. Stone. High-Performance . Addisson Wesley Publishing, 3rd edition, 1993.

[48] Jian Wang, Christine Eisenbeis, Martin Jourdan, and Bogong Su. De- composed software pipelining: A new perspective and a new appïoach. International Journal of Parallel Programming, 22(3):357 - 379, 1994.

[49] Michael Wolfe. Optimizing Supercompilers for Supercornputers. The MIT Press, Cambridge, Massachusetts, 1989. APPLIED IMAGE, lnc G 1653 East Main Street --.- - Rochester. NY 14609 USA ------Phone: 71 61482-0300 ------Fax: 7161288-5989

O 1993. Applied Image, Inc., All Rlghls Reserved