The HEP Parallel Processor by James W
Total Page:16
File Type:pdf, Size:1020Kb
The HEP Parallel Processor by James W. Moore lthough there is an abundance of The HEP achieves MIMD with a system cooperate on a job, or each PEM may be concepts for parallel computing, of hardware and software that is one of the running several unrelated jobs. All of the A there is a dearth of experimental data most innovative architectures since the ad- instruction streams and their associated data delineating their strengths and weaknesses. vent of electronic computing. In addition, it streams are held in main memory while the Consequently, for the past three years per- is remarkably easy to use with the FOR- associated processes are active. sonnel in the Laboratory’s Computing Di- TRAN language. In its maximum configura- How is parallel processing handled within vision have been conducting experiments on tion it will be capable of executing up to 160 an individual PEM? This is done by “pipelin- a few parallel computing systems. The data million instructions per second. ing” instructions so that several are in dif- thus far are uniformly positive in supporting ferent phases of execution at any one mo- the idea that parallel processing may yield The Architecture ment. A process is selected for execution substantial increments in computing power. each machine cycle, a single instruction for However, the amount of data that we have Figure 1 indicates the general architecture that process is started, and the process is been able to collect is small because the of the HEP. The machine consists of a made unavailable for further execution until experiments had to be conducted at sites number of process execution modules that instruction is complete. Because most away from the Laboratory, often in “soft- (PEMs), each with its own data memory instructions require eight cycles to complete, ware-poor” environments. bank. connected to the HEP switch. In at least eight processes must be executed We recently leased a Heterogeneous Ele- addition, there are other processors con- concurrently in order to use a PEM fully. ment Processor (HEP) manufactured by De- nected to the switch, such as the operating However, memory access instructions re- nelcor, Inc. of Denver, Colorado. This ma- system processor and the disk processor. quire substantially more than eight cycles. chine (first developed for the Army Ballistic Each PEM can access its own data memory Thus, in practice. about twelve concurrent Research Laboratories at Aberdeen) is a bank directly, but access to most of the processes are needed for full utilization, and parallel processor suitable for general-pur- memory is through the switch. a single HEP PEM can be considered a pose applications. We and others throughout In a MIMD architecture, entire programs “virtual” 8- to 12-processor machine. If a the country will use the HEP to explore and or, more likely, pieces of programs, called given application is formulated and executed evaluate parallel-processing techniques for processes, execute in parallel, that is, concur- using p processes (where p is an integer from applications representative of future super- rently. Although each process has its own 1 to 12), then execution time for the applica- computing requirements. Only the beginning independent instruction stream operating on tion will be inversely proportional top. steps have been taken, and many difficulties its own data stream, processes cooperate by Now what happens when individual PEMs remain to be resolved as we move from sharing data and solving parts of the same are linked together? Each PEM has its own experiments that use one or two of this problem in parallel, Thus. throughput can be program memory to prevent conflicts in machine’s processors to those that use many. increased by a factor of N, where N is the accessing instructions, and all PEMs are But what are the principles of the HEP? average number of operations executed con- connected to the large number of other data Parallel processing can be used separately currently. memory banks through the HEP switch (a or concurrently on two types of information: The HEP implements MIMD with up to high-speed, packet-switched network). One instructions and data. Much of the early sixteen PEMs, each PEM capable of execut- result of these connections is that the number parallel processing concentrated on multiple- ing up to sixty-four processes concurrently. of switch nodes increases more rapidly than data streams. However, computer systems It should be noted, however, that these upper the number of PEMs. As one changes from a such as the HEP can handle both multiple- limits may not be the most efficient con- 1-PEM system toward the maximum 16- instruction streams and multiple-data figuration for a given, or even for most, PEM configuration, the transmittal time streams. These are called MIMD machines. applications. Any number of PEMs can through the HEP switch, called latency, 72 Fall 1983 LOS ALAMOS SCIENCE Frontiers of Supercomputing Memory o L r This portion of the program sets up the variables to process an 100-col- umn array using 12 concurrent processes. $FIN remains set at empty throughout the computations, $COL will count up through the 100 columns, and $PROCS will count down as the 12 processes die off. This DC) loop CREATES the 12 processes. Each process does its computations using SUBROUTINE COL. This statement, otherwise trivial, stops the main program if $FIN is not yet set to full. This portion terminates each proc- ess, counting down with $PROCS, and then setting $FIN to full as the last process is killed. Fig. 2. An example of HEP FORTRAN, 74 Fall 1983 LOS ALAMOS SCIENCE Frontiers of Supercomputing quickly becomes substantial. Such latency FORTRAN Extensions to Support process that is not computing checks the first increases the number of processes that must Parallelism of’ these variables both to see if it can start a be running in each PEM to achieve full computation and, if so, which column is next in line. At the end of that computation and utilization. Although there is not enough Only two extensions to standard FOR- data yet to provide good estimates on how TRAN arc required to exploit the parallelism regardless of what stage any other process fast latency actually increases with the num- inherent in the HEP: process creation and has reached. the process checks again to see ber of PEMs, experience with a 2-PEM asynchronous variables. Standard FOR- if there are further columns to be dealt with. If not, the process is terminated. A second system suggests that a 4-PEM system will TRAN can. in fact, handle both, but the require about twenty concurrent processes in current HEP FORTRAN has extensions asynchronous variable counts down as the each PEM. specifically tailored to do so. processes die off. When the last operating These extensions allow the programmer to process completes its computation. a number Process Synchronization create processes in parallel as needed and is stored in a third, previously empty then let them disappear once they are no asynchronous variable, setting its extra bit to A critical issue in MIMD machines is the longer needed. Also the number of PEMs full. This altered variable is a signal to the main program that it may use the recently synchronization of processes. The HEP being used will vary with the number of solves this problem in a simple and elegant processes that are created at any given generated data. This method tends to smooth manner, Each 64-bit data word has an extra moment. irregularities in process execution time aris- ing from disparities in the amount of process- bit that is set to full each time a datum is Process creation syntax is almost identical ing done on the individual columns and, stored and is cleared to empty each time a to that for calling subroutines: CALL is re- datum is fetched, In addition, two sets of placed by CREATE, However, in a normal further, does not require changes if the memory instructions are employed. One set program, once a subroutine is CALLed, the dimension of the array is changed. is used normally throughout most of the main program stops; in HEP FORTRAN. More elegant syntactic constructs can be program. This set ignores the extra bit and the main program may continue while a devised. but the HEP extensions are work- will fetch or store a data word regardless of CREATEd process is being worked on. If able. For well-structured code, conversion to whether the bit is full or empty. Data may the main program has no other work, it may the HEP is quite easy, For more complex then be used as often as required in a CALL a subroutine and put itself on equal programs the main difficulty is verifying that process. footing with the other processes until it exits the parallel processes defined are in fact The second set of instructions is defined the subroutine. A process is eliminated when independent. No tools currently exist to help through the use of asynchronous variables, it reaches the normal RETURN statement. with this verification. Typically, this set is used only at the start or We can illustrate these techniques by finish of a process, acting as a barrier against showing how the HEP is used to process an Early Experience with the HEP interference from other processes, The set array in which each column needs to be will not fetch from an empty word or store processed in the same manner but in- Several relatively small FORTRAN codes into a full word. Thus, to synchronize several dependently of the other columns (Fig. 2). have been used on the HEP at Los Alamos.