1984, Instruction Issue Logic for Pipelined Supercomputers

INSTRUCTION ISSUE LOGIC FOR PIPELINED SUPERCOMPUTERS Shlomo Weiss Computer Sciences Department University of Wisconsin-Madison Madison, WI 53706 James E. Smith Department of Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Abs~t 1.1. Tradeo~ Basic principles and design tradeoffs for control ol We begin with a discussion of design tradeoffs that pipelined processors are first discussed. We concentrate centers on four principle issues: on register-register architectures like the GRAY-1 where (1) clock period, pipeline control logic is localized to one or two pipeline stages and is referred to as "instruction issue logic". (2) instruction scheduling, Design tradeoffs are explored by giving designs for a (3) issue logic complexity, and variety of instruction issue methods that represent a (4) hardware cost, debugging, and maintenance. range of complexity and sophistication. These vary from the original CRAY-1 issue logic to a version of Tomasulo's Each of these issues will be discussed in turn. algorithm, first used in the IBM 360/91 floating point Clock Period. In a pipeltned computer, there are a unit. Also studied are Thornton's "scoreboard" algo- number of segments containing combinational logic with rithm used on the CDC 8600 and an algorithm we have latches separating successive segments. All the latches devised. To provide a standard for comparison, all the are synchronized by the same clock, and the pipeline is issue methods are used to implement the GRAY-1 scalar capable of initiating a new instruction every clock architecture. Then, using a simulation model and the period. Hence, under ideal conditions, i.e. no dependen- Lawrence Livermore Loops compiled with the GRAY FOR- cies or resource conflicts, pipeline performance is TRAN compiler, performance results for the various issue directly related to the period of the clock used to syn- methods are given and discussed. chronize the pipe. Even with data dependencies and resource conflicts, there is a high correlation between performance and clock period. Historically, pipelined supercomputers have had shorter clock periods than other computers. This is in part due to the use of the fastest available logic techno- 1. Introduction logies, but it is also due to designs that minimize logic levels between successive latches. Although modern supereomputers are closely asso- ciated with high speed vector operation, it is widely Sehednling of Instructions. Performance of a pipe- recognized that scalar operation is at least of equal lined processor depends greatly on the order of the importance, and pipelining [KOGGS1] is the predominant instructions in the instruction stream. If consecutive technique for achieving high scalar performance. In a instructions have data and control dependencies and pipelined computer, instruction processing is broken contend for resources, then "holes" in the pipeline will into segments and processing proceeds in an assembly develop and performance will suffer. To improve perfor- line fashion with the execution of several instructions mance, it is often possible to arrange the code, or being overlapped. Because of data and control dependen- schedule it, so that dependencies and resource conflicts cies in a scalar instruction stream, interlock logic is are minimized. Registers can also be allocated so that placed between critical pipeline segments to control register conflicts are reduced (register conflicts caused instruction flow through the pipe. In an register-register by data dependencies can not be eliminated in this way, architecture like the CDC 6600 [THOR70], the CDC 7000 however). Because of their close relationship, in the [BONS69], and the GRAY-1 [GRAY77, CRAY79,RUSST8], remainder of the paper we will group code scheduling most of the interlock logic is localized to one segment and register allocation together and refer to them col- early in the pipeline and is referred to as "instruction lectively as "code scheduling". issue" logic. There are two different ways that code scheduling It is the purpose of this paper to highlight some of can be done. First, it can be done at compile time by the the tradeoffs that affect pipeline control, with particular software. We refer to this as "static" scheduling because emphasis on instruction issue logic. The primary vehicle it does not change as the program runs. Second, it can for this discussion is a simulation study of different be done by the hardware at run time. We refer to this as instruction issue methods with varying degrees of com- "dynamic" scheduling. These two methods are not mutu- plexity. These range from the simple and straightfor- ally exclusive. ward as in the GRAY-1 to the complex and sophisticated Most compilers for pipelined processors do some as in the CDC 6600 and the IBM 360/91 floating point unit form of static scheduling to avoid dependencies. This [TOMA67]. Each is used to implement the GRAY-1 scalar adds a new dimension to the optimization problems architecture, and each implementation is simulated faced by a compiler, and occasionally a programmer will using the 14 Lawrence IAvermore Loops [MCMA72] as hand code inner loops in assembly language to arrive at compiled by the Gray Research FORTRAN compiler (cgr). a better schedule than a compiler can provide. II0 0194-7111/84/0000/0110501.00© 1984 IEEE Issue Logic Complexity. By using complex issue algorithm we have devised to allow dynamic scheduling logic, dynamic scheduling of instructions can be while doing away with some of the associative compares achieved. This allows instructions to begin execution required by the other methods that perform dynamic "out-of-order" with respect to the compiled code scheduling. Each of the four CRAY-1 implementations is sequence. This has two advantages. First, it relieves simulated over the same set of benchmarks to allow per- some the burden on the compiler to generate a good formance comparisons. Section 6 contains a further dis- schedule. That is, performance is not as dependent on cussion on the relationship between software and the quality of the compiled code. Second, dynamic hardware code scheduling in pipelined processors, and scheduling at issue time can take advantage of depen- Section 7 contains conclusions. dency information that is not available to the compiler when it does static scheduling. Complex issue logic does 2. The CRAY-1 Architecture and Instruction Issue Algo- require longer control paths, however, which can lead to rithm a longer clock period. 2.1. Overview of the CRAY-1 Hardware Cost, Debugging, and Maintenance. Com- plex issue methods lead to additional hardware cost. The CRAY-I architecture and organization are used More logic is needed, and design time is increased. Com- throughout this paper as a basis for comparison. The plex control logic is also more expensive to debug and CRAY-1 scalar architecture is shown in Figure 1. It con- maintain. These problems are aggravated by issue sists of two sets of registers and functional units for (1) methods that dynamically schedule code because it may address processing and (2) scalar processing. The be difficult to reproduce exact issue sequences. address registers are partitioned into two levels: eight A registers and sixty four B registers. The integer add and multiply functional units are dedicated to address pro- 1.2. Historical PerspeeU~e cessing. Similarly, the scalar registers are partitioned It is interesting to review the way the above into two levels: eight S registers and sixty four T regis- tradeoffs have been dealt with historically. In the late ters. The B and T register files may be used as a 1950's and early 1960's there was rapid movement programmer-manipulated data cache, although this toward increasingly complex issue methods. Important feature is largely unused by the CFT compiler. Four milestones were achieved by STRETCH [BUCH62] in 1961 functional units are used exclusively for scalar process- and the CDC 6600 in 1964. Probably the most sophisti- ing. In addition, three floating point functional units are cated issue logic used to date is in the IBM 360/91 shared with the vector processing section (vector regis- [ANDE67], shipped in 1967. After this first rush toward ters are not shown in Fig. 1). more and more complex methods, there was a retreat toward simpler instruction issue methods that are still in use today. At CDC, the 7600 was designed to issue instructions in strict program sequence with no dynamic ~.~ T REGISTER scheduling. The clock period, however, was very fast, FILE even by today's standards. The more recent CRAY-1 and CRAY-XMP [CRAYS2] differ very little from the CDC7600 in the way they handle scalar instructions. The CDC CYBER205 [CDCS1] scalar unit is also very similar. At IBM, and later at Amdahl Corp., pipelined implementations of the 360/370 architecture following the 360/91 REGIS~R have issued instructions strictly in order. FlU FILE As for the future, both the debug/maintenance problem and the hardware cost problem may be significantly alleviated by using VLSI where logic is much less expensive and where replaceable parts are such that fault isolation does not need to be as precise as with SSI. In addition, there is a trend toward moving software F'UNC1 problems into hardware, and code scheduling seems to be a candidate. Consequently, tradeoffs are shifting and instruction issue logic that dynamically schedules code deserves renewed study. 'LY AL 1.3. Paper Overview {IFT The tradeoffs just discussed lead to a spectrum of instruction issue algorithms. Through simulation we can look at the performance gains that are made possible by Figure 1 - The CRAY-1 Scalar Architecture dynamic code scheduling. Other issues like clock period and hardware cost and maintenance are more difficult The instruction set is designed for efficient pipeline and require detailed design and construction to make processing. Being a register-register architecture, only quantitative assessments.

1984, Instruction Issue Logic for Pipelined Supercomputers

Contrpl Data Nos Version 2 Administration Handbook

Mi!!Lxlosalamos SCIENTIFIC LABORATORY

Conversion of the COBRA-IV-I Code from CDC CYBER to HP 9000/700 Version

ABSTRACT CO-RESIDENT OPERATING SYSTEMS TASK TEAM REPORT April 181 1973

VMS 4.2 Service Begins

Computer Architectures and Aspects of NWP Models

(Is!!I%Losalamos Scientific Laboratory

' an Evaluation of ^0 Superminicomputers ^ for Thermal Analysis

THE USE of MINICOMPUTERS in a DISTRIBUTED INFORMATION PROCESSING SYSTEM-A FEASIBILITY STUDY by Stanley M

Supercomputers: the Amazing Race Gordon Bell November 2014

ADAPTATION of a PROGRAM for NONLINEAR FINITE ELEMENT ANALYSIS to the CDC STAR 100 COMPUTER* Allan B. Pifko Grumman Aerospace

A Comparison of Processor Technologies