Multiple Instruction Issue in the Nonstop Cyclone System
Total Page:16
File Type:pdf, Size:1020Kb
~TANDEM Multiple Instruction Issue in the NonStop Cyclone System Robert W. Horst Richard L. Harris Robert L. Jardine Technical Report 90.6 June 1990 Part Number: 48007 Multiple Instruction Issue in the NonStop Cyclone Processorl Robert W. Horst Richard L. Harris Robert L. Jardine Tandem Computers Incorporated 19333 Vallco Parkway Cupertino, CA 95014 Abstract This paper describes the architecture for issuing multiple instructions per clock in the NonStop Cyclone Processor. Pairs of instructions are fetched and decoded by a dual two-stage prefetch pipeline and passed to a dual six-stage pipeline for execution. Dynamic branch prediction is used to reduce branch penalties. A unique microcode routine for each pair is stored in the large duplexed control store. The microcode controls parallel data paths optimized for executing the most frequent instruction pairs. Other features of the architecture include cache support for unaligned double precision accesses, a virtually-addressed main memory, and a novel precise exception mechanism. lA previous version of this paper was published in the conference proceedings of The 17th Annual International Symposium on Computer Architecture, May 28-31, 1990, Seattle, Washington. Dynabus+ Dynabus X Dvnabus Y IIIIII I 20 MBIS Parallel I I II 100 MbiVS III I Serial Fibers CPU CPU CPU CPU 0 3 14 15 MEMORY ••• MEMORY • •• MEMORY MEMORY ~IIIO PROC110 IIPROC1,0 PROC1,0 ROC PROC110 IIPROC1,0 PROC110 F11IOROC o 1 o 1 o 1 o 1 I DISKCTRL ~ DISKCTRL I I Q~ / \. I DISKCTRL I TAPECTRL : : DISKCTRL : I 0 1 2 3 /\ o 1 2 3 0 1 2 3 0 1 2 3 Section 0 Section 3 Figure 1. Cyclone System Architecture. microcode updates downloaded to writeable con trol store. The Tandem instruction set has ap 1. Introduction proximately 300 fixed-length (16-bit) instruc tions, ranging from simple RISC-like instruc The NonStop Cyclone system is a fault-tolerant tions to very complex instructions, such as block mainframe targeted at transaction processing, moves and inter-processor sends, which may query processing and batch. Each system con take hundreds of clocks to complete. Most opera sists of four to sixteen processors that are con tions are zero-address with operands on the top of nected by dual high-speed busses (Figure 1). an eight-word register stack. The basic memory Sections of four processors may be geographi reference instructions are load and store in cally distributed and interconnected by fiber op structions with address displacements relative to tic cables. Each processor has its own memory a stack pointer or segment base register. and drives two to four I/O channels. Fault detec tion is performed primarily by the hardware, The Cyclone processor is over three times faster and fault recovery is performed by the message than its predecessor. Approximately half of the based operating system. The system can tolerate performance improvement is due to higher clock a single fault in a processor, peripheral con rates, and the other half is due to the new micro troller, power supply, or cooling system. Failed architecture. Much of the architectural im components can be serviced on-line without dis provement stems from the ability to issue up to rupting processing. two instructions per clock cycle. Other im provements are due to parallel data paths and Five generations of Tandem computers (NonStop new designs for the caches and main memory. II, TXP, VLX, CLX and Cyclone) are object-code This paper describes the architectural aspects of compatible and have been kept current through the NonStop Cyclone processor. In particular, it 1 concentrates on the features that have been in primary difference between the Cyclone proces cluded to support multiple-instruction issue. sor and other superscalar designs is in the selec tion of which sets of instructions are to be issued simultaneously. Other machines have divided 2. Overview the instruction set into categories, such as branches, memory reference, and execution op In recent years, advances in technology and erators. In those machines, at most one instruc computer architecture have allowed the design_of lion from each category can be issued simultane processors in which simple instructions can be ously. executed in a single clock cycle. Once that point is reached, further architectural performance During the design of the Cyclone processor, we improvements must be made by executing more recognized that there may be many cases where than one instruction per clock. Some previous several sequential instructions from the same scientific machines were capable of issuing category (or even the same instruction) should be multiple instructions per clock, but this was done issued simultaneously. For instance, in our through simultaneous execution of integer and stack-based machine, it is common to sequen floating point operations. When the instruction tially load two literal constants onto the register set can be partitioned into independent opera stack with Load Immediate (LDl) instructions. tions that share few resources, then it is possible This pair of instructions, LDI&LDI, could easily to design independent function units and to as be executed in a single clock with appropriate sign each instruction to one of these units. data path flexibility and enough register file Several instructions can be issued to the function ports. However, there was no obvious way to par units simultaneously [1]. tition the machine into independent function units to which instructions could be assigned. Issuing multiple integer instructions per clock is Some pairs could benefit from separate ALUs, more difficult because most integer instructions while others could benefit from separate parti require use of the same resources. Typically, tions for memory reference and ALU. A few op nearly all instructions access the same register erations even suggest a bit-partitioning; one fre file, and there are many inter-instruction data quent pair has separate instructions to load a dependencies. There is no simple partitioning full-word literal into a register from left and that would easily allow execution of multiple in right half-word literals. structions per clock. Rather than partitioning the processor into inde Very Long Instruction Word (VLIW) machines pendent function units, we chose to use firmware use sophisticated compiler technology to generate control and to program the microcode routines wide object code to control parallel data paths [2]. for each unique pair individually. In this way, Typically, each VLIW implementation has its there are no artificial restrictions on which in own unique object code format. While VLIW is structions can be paired. In addition, by using useful in some situations, our environment de microcode control, we do not restrict pairable op mands object-code compatibility between genera erators to ones that can execute in a single clock tions of machines. It was essential to find a way cycle. For instance, instructions that use indi to detect the parallelism at run-time rather than rect addressing make two sequential accesses to at compile-time. the data cache and require three clocks to com plete. However, it is still beneficial to pair indi The term "superscalar" was recently coined to rect operators with other instructions. It takes describe machines that issue multiple instruc three clocks to perform an indirect load, yet takes tions per clock, yet produce the same results as no more clocks when the indirect load is paired machines that execute instructions sequentially with a branch, immediate, or add instruction. [3]. At about the same time the NonStop Cyclone system was announced, superscalar micropro Once we decided to control pair execution with cessors were announced by Intel and IBM. The unique microcode routines, we could decide on a 2 case-by-case basis whether to include the hard data paths also turned out to be of great benefit in ware support to be able to execute a pair in a sin the execution of long instructions, such as the gle clock cycle. A hardware performance moni those that move or scan blocks of data, and those tor was built, and instruction-pair frequencies that send or receive messages. were gathered for transaction processing appli cations. We then examined the frequencies to In some cases, we chose not to include data path determine which hardware would gain the most support for pairing. For instance, support for the performance for the least cost. .pairing of memory reference instructions would have required more than twice the area and cost Figure 2 shows the pairing matrix for some rep of a simpler cache. The frequency of successive resentative instructions. Of the pairs shown, all memory references did not warrant such a cost. except those in the last row execute in a single Instead, we determined that a greater payoff clock. The indirect loads require three clocks. would result from supporting fast access to In the current microcode, the full table of 2014 unaligned cache data for double-words. pairs has 38 "first" instructions (out of a possible 64) and 53 "second" instructions (out of a possible The following sections describe in more detail 127). In future microcode releases, more pairs the support for multiple instruction issue in key may be added for improved performance. parts of the processor: the instruction fetch unit, the control store, the data paths, and the memory. The most important data path additions for the support of pairing were the inclusion of a nine port register file and two ALUs that could be con 3. Instruction FetchUnit trolled independently or linked together for dou ble-precision arithmetic. The flexibility of the The Cyclone Instruction Fetch Unit (lFU) has four main functions: 1) to fetch instructions from memory, 2) to decode these instructions to deter SECOND FIRST INSTRUCTION mine whether they are candidates for paired exe cution, 3) to provide the beginning address for INSTR BCC WI LOAD STOR DADD RRM microcode execution of the instruction or pair, and 4) to assist in the execution of branching in Bee x x x x x structions and exception handling.