Dynamic Instruction Scheduling and the Astronautics ZS-1

I i Dynamic Instruction Scheduling and the Astronautics ZS-1 James E. Smith Astronautics Corporation of America ipelined instruction processing I I computer of considerable historical in- has become a widely used tech- I I terest, the IBM 360/9 1 ,s used dynamic nique for implementing high-per- scheduling methods even more exten- formance computers. Pipelining first ap- Dynamic instruction sively than the CDC 6600. peared in supercomputers and large As the RISC philosophy becomes ac- maiiiframes, but can now be found in less scheduling resolves cepted by the design community, the expensive systems. For example, most of benefits of dynamic instruction schedul- the recent reduced instruction set com- control and data ing are apparently being overlooked. puters use pipelining.’.’ Indeed, a major dependencies at Dynamic instruction scheduling can argument for RISC architectures is the provide performance improvements ease with which they can be pipelined. At runtime. This extends simply not possible with static schedul- the other end of the spectrum, computers ing alone. with more complex instruction sets, such performance over that This article has three major purposes: as the VAX 8800,’ make effective use of possible with static pipelining as well. (1) to provide an overview of and The ordering, or “scheduling,” of in- scheduling alone, as survey solutions to the problem of structions as they enter and pass through instruction scheduling for pipe- an instruction pipeline is a critical factor shown by the ZS-1. lined computers, in determining performance. In recent (2) to demonstrate that dynamic in- years, the RISC philosophy has become struction scheduling can provide pervasive in computer design. The basic performance improvements not reasoning behind the RISC philosophy possible with static scheduling can be stated as “simple hardware means Many features of the pioneering CDC alone, and faster hardware, and hardware can be 66004 have found their way into modem (3) to describe a new high-perform- kept simple by doing as much as possible pipelined processors. One noteworthy ance computer, the Astronautics in software.” A corollary naturally fol- exception is the reordering of instruc- ZS-1, which uses new methods for lows, stating that instruction scheduling tions at runtime, or dynamic instruction implementing dynamic schedul- should be done by software at compile scheduling. The CDC 6600 scoreboard ing and which can outperform time. We refer to this as static instruction allowed hardware to reorder instruction computers using similar-speed scheduling, and virtually every new execution, and the memory system stunt technologies that rely solely on computer system announced in the last box allowed reordering of some memory state-of-the-art static scheduling several years has followed this approach. references as well. Another innovative techniques. July 1989 001S-9162/89/0700-0021$01.00 01989 IEEE 21 in one cycle from a cache memory. (2) Instruction decode -- the instruction's opcode is examined to determine the function to perform and the resources fetch needed. Resources include general-purpose registers, buses, and functional units. (3) Instruction issue -- resource availability is checked and resources are (a) time -+ reserved. That is, pipeline control inter- locks are maintained at this stage. As- inst. 1 FDIEEEE sume that operands are read from regis- inst. 2 FDIEEEE ters during the issue stage. inst. 3 FDIEEEE (4) Instruction execution -- instruc- inst. 4 FDIEEEE tions are executed in one or several execution stages. Writing results into the general-purpose registers is done during Figure 1. Pipelined instruction processing: (a) a typical pipeline; (b) ideal the last execution stage. In this discus- flow of instructions through the pipeline. sion, consider memory load and store operations to be part of execution. Figure Ib illustrates an idealized flow of instructions through the pipeline. Time is measured in clock periods and IIruns from left to right. The diagram notes RI t(Y) Load register RI from memory location Y the pipeline stage holding an instruction R2 t (Z) Load register R2 from memory location Z each clock period. F denotes the instruc- R3 t RI +f R2 Floating add registers RI and R2 tion fetch stage, D denotes the decode (X)t R3 Store the result into memory location X stage, I denotes the issue stage, and E R4 t (B) Load register R4 from memory location B denotes the execution stages. R5 t (C) Load register R5 from memory location C In theory, the clock period for a p- R6 t R4 *f R5 Floating multiply registers R4 and R5 stage pipeline would be I/p the clock (A)+ R6 Store the result into memory location A period for a nonpipelined equivalent. Consequently, there is the potential for a p times throughput (performance) im- provement. There are several practical time -+ limitations, however, on pipeline performance. The limitation of particular RI t(Y) FDIEEEE interest here is instruction dependencies. R2 t (Z) FDIEEEE Instructions may depend on results of R3 t RI +f R2 FD. .IEEE previous instructions and may therefore (X)tR3 F...D..IEEEE have to wait for the previous instructions R4 t (B) F. .DIEEEE to complete before they can proceed R5 t (C) FDIEEEE through the pipeline. A data dependence R6 t R4 *f R5 FD. IEEE occurs when instructions use the same (A)+ R6 F...D..IEEEE input and/or output operands; for example, when an instruction uses the re- (b) sult of a preceding instruction as an input operand. A data dependence may cause Figure 2. Pipelined execution of X=Y+Z and A=B*C: (a) machine code; (b) an instruction to wait in the pipeline for pipeline timing. a preceding instruction to complete. A control dependence occurs when control decisions (typically as conditional branches) must be made before subse- quent instructions can be executed. Figure 2 illustrates the effects of de- Introduction to stages. Figure 1 illustrates a simple ex- pendencies on pipeline performance. ample pipeline. In Figure la, the pipeline Figure 2a shows a sequence of machine pipelined computing stages are: instructions that a compiler might gener- ate to perform the high-level language Pipelining decomposes instruction ( I) Instruction fetch -- for simplicity statements X = Y + Z and A = B * C. processing into assembly line-like assume that all instructions are fetched Assume load and store instructions take 22 COMPUTER four execution clock periods while floating-point additions and multiplications take three. (These timing assumptions represent a moderate level of pipelining. R1 t(Y) Load register RI from memory location Y In many RISC processors fewer clock R2 t (Z) Load register R2 from memory location Z periods are needed. On the other hand, R4 t (B) Load register R4 from memory location B the Cray-1 requires 11 clock periods for R5 t (C) Load register R5 from memory location C a load, and floating-point additions take R3 t R1 +f R2 Floating add X and Y six. Cray-2 pipelines are about twice the R6 t R4 *f R5 Floating multiply B and C length of the Cray-1’s.) (X)t R3 Store R3 to memory location X Figure 2b illustrates the pipeline tim- (A)+ R6 Store R6 to memory location A ing. A simple in-order method of instruction issuing is used; that is, if an instruction is blocked from issuing due to a dependence, all instructions following it time -+ are also blocked. The same letters as before are used to denote pipeline stages. RI t(Y) FDIEEEE A period indicates that an instruction is R2 t (Z) FDIEEEE blocked or “stalled” in a pipeline stage R4 t (B) FDIEEEE and cannot proceed until either the in- R5 t (C) FDIEEEE struction ahead of it proceeds, or, at the R3 t R1 +f R2 FD. IEEE issue stage, until all resources and data R6 t R4 *f R5 F .D.IEEE dependencies are satisfied. (X)t R3 F .DIEEEE The first two instructions issue on (A)+ R6 FD. IEEEE consecutive clock periods, but the add is dependent on both loads and must wait three clock periods for the load data before it can issue. Similarly, the store to Figure 3. Reordered code to perform X=Y+Z and A=B*C: (a) machine code; location X must wait three clock periods (b) pipeline timing. for the add to finish due to another data dependence. There are similar blockages during the calculation of A. The total time required is 18 clock periods. This I I time is measured beginning when the first instruction starts execution until the last starts execution. (We measure time Execute Execute in this way so that pipeline “fill” and E “drain” times do not unduly influence relative timings.) Instruction fetch Decode + Issue Execute Execute Instruction scheduling D I E An important characteristic of pipelined processors is that using equivalent, but reordered, code sequences can result in performance differences. For example, the code in Figure 3a performs the same function as that in Figure 2a except that it has been reordered, or Figure 4. Block diagram of the CDC 6600-style processor. “scheduled,” to reduce data dependencies. Furthermore, registers have been allocated differently to eliminate certain register conflicts that appear to the hard- uling or reordering of instructions that years ago, it is rarely used in today’s ware as dependencies. Figure 3b illus- can be done by a compiler prior to execu- pipelined processors. trates the pipeline timing for the code in tion. Most, if not all, pipelined comput- Figure 3a. Note that there is considerably ers today use some form of static sched- Dynamic instruction scheduling: more overlap, and the time required is uling by the compiler.

Dynamic Instruction Scheduling and the Astronautics ZS-1

Computer Science 246 Computer Architecture Spring 2010 Harvard University

Integrating Program Optimizations and Transformations with the Scheduling of Instruction Level Parallelism*

Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo's Algorithm

Multiple Instruction Issue and Completion Per Clock Cycle Using Tomasulo’S Algorithm – a Simple Example

Tomasulo's Algorithm

Computer Architecture: Static Instruction Scheduling

Tomasulo's Algorithm

Static Instruction Scheduling for High Performance on Limited Hardware

Tomasulo Algorithm and Dynamic Branch Prediction

WCAE 2003 Workshop on Computer Architecture Education

Verification of an Implementation of Tomasulo's Algorithm by Compositional Model Checking

Lecture 1 Introduction