§1 1 1. Introduction. This Program Is the Heart of The
Total Page:16
File Type:pdf, Size:1020Kb
§1 MMIX-PIPE INTRODUCTION 1 1. Introduction. This program is the heart of the meta-simulator for the ultra-configurable MMIX pipeline: It defines the MMIX run routine, which does most of the work. Another routine, MMIX init , is also defined here, and so is a header file called mmix_pipe.h. The header file is used by the main routine and by other routines like MMIX config , which are compiled separately. Readers of this program should be familiar with the explanation of MMIX architecture as presented in the main program module for MMMIX. A lot of subtle things can happen when instructions are executed in parallel. Therefore this simulator ranks among the most interesting and instructive programs in the author’s experience. The author has tried his best to make everything correct ... but the chances for error are great. Anyone who discovers a bug is therefore urged to report it as soon as possible; please see http://mmix.cs.hm.edu/bugs/ for instructions. It sort of boggles the mind when one realizes that the present program might someday be translated by a C compiler for MMIX and used to simulate itself. 2 INTRODUCTION MMIX-PIPE §2 2. This high-performance prototype of MMIX achieves its efficiency by means of “pipelining,” a technique of overlapping that is explained for the related DLX computer in Chapter 3 of Hennessy & Patterson’s book Computer Architecture (second edition). Other techniques such as “dynamic scheduling” and “multiple issue,” explained in Chapter 4 of that book, are used too. One good way to visualize the procedure is to imagine that somebody has organized a high-tech car repair shop according to similar principles. There are eight independent functional units, which we can think of as eight groups of auto mechanics, each specializing in a particular task; each group has its own workspace with room to deal with one car at a time. Group F (the “fetch” group) is in charge of rounding up customers and getting them to enter the assembly-line garage in an orderly fashion. Group D (the “decode and dispatch” group) does the initial vehicle inspection and writes up an order that explains what kind of servicing is required. The vehicles go next to one of the four “execution” groups: Group X handles routine maintenance, while groups XF, XM, and XD are specialists in more complex tasks that tend to take longer. (The XF people are good at floating the points, while the XM and XD groups are experts in multilink suspensions and differentials.) When the relevant X group has finished its work, cars drive to M station, where they send or receive messages and possibly pay money to members of the “memory” group. Finally all necessary parts are installed by members of group W, the “write” group, and the car leaves the shop. Everything is tightly organized so that in most cases the cars move in synchronized fashion from station to station, at regular 100-nanocentury intervals. In a similar way, most MMIX instructions can be handled in a five-stage pipeline, F–D–X–M–W, with X replaced by XF for floating-point addition or conversion, or by XM for multiplication, or by XD for division or square root. Each stage ideally takes one clock cycle, although XF, XM, and (especially) XD are slower. If the instructions enter in a suitable pattern, we might see one instruction being fetched, another being decoded, and up to four being executed, while another is accessing memory, and yet another is finishing up by writing new information into registers; all this is going on simultaneously during one clock cycle. Pipelining with eight separate stages might therefore make the machine run up to 8 times as fast as it could if each instruction were being dealt with individually and without overlap. (Well, perfect speedup turns out to be impossible, because of the shared M and W stages; the theory of knapsack programming, to be discussed in Section 7.7 of The Art of Computer Programming, tells us that the maximal achievable speedup is at most 8 − 1/p − 1/q − 1/r when XF, XM, and XD have delays bounded by p, q, and r cycles. But we can achieve a factor of more than 7 if we are very lucky.) Consider, for example, the ADD instruction. This instruction enters the computer’s processing unit in F stage, taking only one clock cycle if it is in the cache of instructions recently seen. Then the D stage recognizes the command as an ADD and acquires the current values of $Y and $Z; meanwhile, of course, another instruction is being fetched by F. On the next clock cycle, the X stage adds the values together. This prepares the way for the M stage to watch for overflow and to get ready for any exceptional action that might be needed with respect to the settings of special register rA. Finally, on the fifth clock cycle, the sum is either written into $X or the trip handler for integer overflow is invoked. Although this process has taken five clock cycles (that is, 5υ), the net increase in running time has been only 1υ. Of course congestion can occur, inside a computer as in a repair shop. For example, auto parts might not be readily available; or a car might have to sit in D station while waiting to move to XM, thereby blocking somebody else from moving from F to D. Sometimes there won’t necessarily be a steady stream of customers. In such cases the employees in some parts of the shop will occasionally be idle. But we assume that they always do their jobs as fast as possible, given the sequence of customers that they encounter. With a clever person setting up appointments—translation: with a clever programmer and/or compiler arranging MMIX instructions—the organization can often be expected to run at nearly peak capacity. In fact, this program is designed for experiments with many kinds of pipelines, potentially using additional functional units (such as several independent X groups), and potentially fetching, dispatching, and executing several nonconflicting instructions simultaneously. Such complications make this program more difficult than a simple pipeline simulator would be, but they also make it a lot more instructive because we can get a better understanding of the issues involved if we are required to treat them in greater generality. §3 MMIX-PIPE INTRODUCTION 3 3. Here’s the overall structure of the present program module. #include <stdio.h> #include <stdlib.h> #include <math.h> #include "abstime.h" h Preprocessor definitions i h Header definitions 6 i h Type definitions 11 i h Global variables 20 i h External variables 4 i h Internal prototypes 13 i h External prototypes 9 i h Subroutines 14 i h External routines 10 i 4. The identifier Extern is used in MMIX-PIPE to declare variables that are accessed in other modules. Actually all appearances of ‘Extern’ are defined to be blank here, but ‘Extern’ will become ‘extern’ in the header file. #define Extern /∗ blank for us, extern for them ∗/ format Extern extern h External variables 4 i ≡ Extern int verbose; /∗ controls the level of diagnostic output ∗/ See also sections 29, 59, 60, 66, 69, 77, 86, 87, 98, 115, 136, 150, 168, 207, 211, 214, 242, 247, 284, and 349. This code is used in sections 3 and 5. 5. The header file repeats the basic definitions and declarations. h mmix−pipe.h 5 i ≡ #define Extern extern h Header definitions 6 i h Type definitions 11 i h External variables 4 i h External prototypes 9 i 6. Subroutines of this program are declared first with a prototype, as in ANSI C, then with an old-style C function definition. The following preprocessor commands make this work correctly with both new-style and old-style compilers. h Header definitions 6 i ≡ #ifdef __STDC__ #define ARGS(list ) list #else #define ARGS(list )() #endif See also sections 7, 8, 52, 57, 129, and 166. This code is used in sections 3 and 5. 4 INTRODUCTION MMIX-PIPE §7 7. Some of the names that are natural for this program are in conflict with library names on at least one of the host computers in the author’s tests. So we bypass the library names here. h Header definitions 6 i +≡ #define random my random #define fsqrt my fsqrt #define div my div 8. The amount of verbosity depends on the following bit codes. h Header definitions 6 i +≡ #define issue bit (1 ¿ 0) /∗ show control blocks when issued, deissued, committed ∗/ #define pipe bit (1 ¿ 1) /∗ show the pipeline and locks on every cycle ∗/ #define coroutine bit (1 ¿ 2) /∗ show the coroutines when started on every cycle ∗/ #define schedule bit (1 ¿ 3) /∗ show the coroutines when scheduled ∗/ #define uninit mem bit (1 ¿ 4) /∗ complain when reading from an uninitialized chunk of memory ∗/ #define interactive read bit (1 ¿ 5) /∗ prompt user when reading from I/O location ∗/ #define show spec bit (1 ¿ 6) /∗ display special read/write transactions as they happen ∗/ #define show pred bit (1 ¿ 7) /∗ display branch prediction details ∗/ #define show wholecache bit (1 ¿ 8) /∗ display cache blocks even when their key tag is invalid ∗/ 9. The MMIX init ( ) routine should be called exactly once, after MMIX config ( ) has done its work but before the simulator starts to execute any programs. Then MMIX run( ) can be called as often as the user likes. h External prototypes 9 i ≡ Extern void MMIX init ARGS((void)); Extern void MMIX run ARGS((int cycs , octa breakpoint )); See also sections 38, 161, 175, 178, 180, 209, 212, and 252.