Global Instruction Scheduling for Superscalar Machines Abstract 1

Global Instruction Scheduling for SuperScalar Machines David Bernslein Michael Rodeh IBM Israel ScientKlc Center Technion City Haifa 32000 ISRAEL Abstract 1. Introduction To improve the utilization of machine resources in Starting in the late seventies, a new approach for superscalar processors, the instructions have to be building high speed processors emerged which carefully scheduled by the compiler. As internal emphasizes streamlining of program instructions; parallelism and pipelining increases, it becomes subsequently this direction in computer architecture evident that scheduling should be done beyond the was called RISC [P85J It turned out that in order basic block level. A scheme for global (intra-loop) to take advantage of pipelining so as to improve scheduling is proposed, which uses the control and performance, instructions have to be rearranged, data dependence information summarized in a usually at the intermediate language or assembly Program Dependence Graph, to move instructions code level. The burden of such transformations, well beyond basic block boundaries. This novel called instruction scheduling, has been placed on scheduling framework is based on the parametric optimizing compilers. description of the machine architecture, which spans a range of superscakis and VLIW machines, and Previously, scheduling algorithms at the instruction exploits speculative execution of instructions to level were suggested for processors with several further enhance the performance of the general functional units [BJR89], pipelined machines code. We have implemented our algorithms in the [BG89, BRG89, HG83, GM86, W90] and Very IBM XL family of compilers and have evaluated Large Instruction Word (VLIW) machines [EIEJ them on the IBM RISC System/6000 machines. While for machines with n functional units the idea is to be able to execute as many as n instructions each cycle, for pipelined machines the goal is to issue a new instruction every cycle, effectively eliminating the so-called NOPS (No Operations). However, for both types of machines, the common Permission to copy without fee ell or part of this material is granted feature required from the compiler is to discover in provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and the code instructions that are data independent, its date appear, and notice is given that copying is by permission of the allowing the generation of code that better utilizes Association for Computing Machinery. To copy otherwise, or to republish, reauirea a fee end/or aDecific oermiasion. the machine resources. @1991 ACM 0-89791-428-7/91/0005/0241 . ..$1.50 I i Proceedings of the ACM SIGPLAN ’91 Conference on It was a common view that such data independent Programming Language Design and Implementation. Toronto, Ontario, Canada, June 26-28, 1991. instructions can be found within basic blocks, and there is no need to move instructions beyond basic 241 block boundaries. Virtually, all of the previous employs a novel data structure, called the Program work on the implementation of instruction Dependence Graph (PDG), that was recently scheduling for pipelined machines concentrated on proposed by Ferrante et. al [FOW87] to be used in scheduling within basic blocks [HG83, GM86, compilers to expose parallelism for the purposes of W90]. Even for basic RISC architectures such vectorization and generation of code for restricted type of scheduling may result in code with multiprocessors. We suggest combining the PDG many NOPS for certain Unixl -type programs that with the parametric description of a family of include many small basic blocks terminated by superscalar machines, thereby providing a powerful unpredictable branches. On the other hand, for framework for global instruction scheduling by scientific programs the problem is not so severe, optimizing compilers. since there, basic blocks tend to be larger. While trace scheduling assumes the existence of a Recently, a new type of architecture is evolving that main trace in the program (which is likely in extends RISC by the ability to issue more than one scientific computations, but may not be true in instruction per cycle [G089]. This type of high symbolic or Udx-t ype programs), global scheduling speed processors, called superscalar or (as well as enhanced percolation scheduling) does superpipelined architecture, poses more serious not depend on such assumption. However, global challenges to compilers, since instruction scheduling scheduling is capable of taking advantage of the at the basic block level is in many cases not branch probabilities, whenever available (e.g. sufficient to allow generation of code that utilizes computed by proffig). As for the enhanced machine resources to a desired extent [JW89]. percolation scheduling, our opinion is that it is more targeted towards a machine with a large One recent effort to pursue instruction scheduling number of computational units, like VLIW for superscalar machines was reported in [GR90], machines. where code ,replication techniques for scheduling beyond the scope of basic blocks were investigated, Using the information available in a PDG, we resulting in fair improvements of running time of distinguish between useful and speculative execution the compiled code, Also, one can view a of instructions. Also, we identify the cases where superscalar processor as a VLIW machine with a instructions have to be duplicated in order to be small number of resources. There are two main scheduled. Since we are currently interested in approaches for compiling code for the VLIW machines with a small number of functional units machines that were reported in the literature: the (like the RISC System/6000 machines), we trace scheduling [F81, E851 and the enhanced established a conservative approach to instruction percolation scheduling [EN89]. scheduling. First we try to exploit the machine resources with useful instructions, next we consider In this paper, we present a technique for global speculative instructions, whose effect on instruction scheduling which permits the movement performance depends on the probability ~f branches of instructions well beyond basic blocks boundaries to be taken, and scheduling with duplication, which within the scope of the enclosed loop. The method might increase the code size incurring additional I Unix is a trademark of AT&T Bell Labs 242 costs in terms of instruction cache misses. Also, we a collection of functional units of m types, where do not overlap the execution of instructions that the machine has nl, nz, ....n~ units of each type. belong to different iterations of the loop. This Each instruction in the code can be potentially more aggressive type of instruction scheduling, executed by any of the units of a speci.tied type. which is often called sofware pipelining [J-X8], is left for future work. For the instruction scheduling purposes, we assume that there is an unbounded number of symbolic For speculative instructions, previously-it was registers in the machine. Subsequently, during the suggested that they have to be supported by the register allocation phase of the compiler, the machine architecture [ESS, SLH90]. Since symbolic registers are mapped onto the real architectural support for speculative execution machine registers, using one of the standard carries a si~lcant run-time overhead, we are (coloring) algorithms. Throughout this paper we evaluating techniques for replacing such support will not deal with register allocation at all. For the with compile-time analysis of the code, still discussion on the relationships between instruction retaining most of the performance effect promised scheduling and register allocation see [BEH89]. by speculative execution. A program instruction requires an integral number We have implemented our scheme in the context of of machine cycles to be executed by one of the the IBM XL family of compilers for the IBM RISC functional units of its type. Also, there are System/6000 (RS/6K for short) computers. The pipelined constraints imposed on the execution of preliminary performance results for our scheduling instructions which are modelled by the integer prototype were based on a set of SPEC benchmarks delays assigned to the data dependence edges of the [ss9]. computational graph. The rest of the paper is organized as follows. In Let 11 and L be two instructions such that the edge Section 2 we describe our generic machine model (11,12) is a data dependence edge. Let I (t > 1) be and show how it is applicable to the RS/6K the execution time of 11 and d (d z O) be the delay machines. Then, in Section 3 we bring a small assigned to (11,14. For performance purposes, if 11is program that will serve as a running example. In scheduled to start at time k, then L should be Section 4 we discuss the usefulness of the PDG, scheduled to start no earlier than k + t+ d. Notice, while in Section 5 several levels of scheduling, however, that if Zzis scheduled (by the compiler) to including speculative execution, are presented. start earlier than mentioned above, this would not Finally, in Section 6 we bring some performance affect the correctness of the program, since we results and conclude in Section 7. assume that the machine implements hardware interlocks to guarantee the delays at run time. 2. Parametric machine description More info~ation about the notion of delays due to Our model of a superscalar machine is based on the pipelined constraints can be found in [BG8!J, description of a typical RISC processor whose only BRG89]. instructions that reference memory are load and store instructions, while all the computations are done in registers. We view a superscalar machine as 243 2.1 The RS/6K model the second types of the above mentioned delays will Here we show how our generic model of a be considered. superscalar machine is cotilgured to fit the RS/6K machine. The RS/6K processor is modelled as 3. A program example follows: Next, we present a small program (written in C) that computes the minimum and the maximum of ● m = 3, there are three types of functional units: an array.

Load more