<<

The HEP Parallel by James W. Moore

lthough there is an abundance of The HEP achieves MIMD with a system cooperate on a , or each PEM may be concepts for , of hardware and that is one of the running several unrelated jobs. All of the A there is a dearth of experimental data most innovative architectures since the ad- instruction streams and their associated data delineating their strengths and weaknesses. vent of electronic computing. In addition, it streams are held in main memory while the Consequently, for the past three years per- is remarkably easy to use with the FOR- associated processes are active. sonnel in the Laboratory’s Computing Di- TRAN language. In its maximum configura- How is parallel processing handled within vision have been conducting experiments on tion it will be capable of executing up to 160 an individual PEM? This is done by “pipelin- a few parallel computing systems. The data million . ing” instructions so that several are in dif- thus far are uniformly positive in supporting ferent phases of execution at any one mo- the idea that parallel processing may The Architecture ment. A is selected for execution substantial increments in computing power. each machine cycle, a single instruction for However, the amount of data that we have Figure 1 indicates the general architecture that process is started, and the process is been able to collect is small because the of the HEP. The machine consists of a made unavailable for further execution until experiments had to be conducted at sites number of process execution modules that instruction is complete. Because most away from the Laboratory, often in “soft- (PEMs), each with its own data memory instructions require eight cycles to complete, ware-poor” environments. bank. connected to the HEP switch. In at least eight processes must be executed We recently leased a Heterogeneous Ele- addition, there are other processors con- concurrently in order to use a PEM fully. ment Processor (HEP) manufactured by De- nected to the switch, such as the operating However, memory access instructions re- nelcor, Inc. of Denver, Colorado. This ma- system processor and the disk processor. quire substantially more than eight cycles. chine (first developed for the Army Ballistic Each PEM can access its own data memory Thus, in practice. about twelve concurrent Research Laboratories at Aberdeen) is a bank directly, but access to most of the processes are needed for full utilization, and parallel processor suitable for general-pur- memory is through the switch. a single HEP PEM can be considered a pose applications. We and others throughout In a MIMD architecture, entire programs “virtual” 8- to 12-processor machine. If a the country will use the HEP to explore and or, more likely, pieces of programs, called given application is formulated and executed evaluate parallel-processing techniques for processes, execute in parallel, that is, concur- using p processes (where p is an integer from applications representative of future super- rently. Although each process has its own 1 to 12), then execution time for the applica- computing requirements. Only the beginning independent instruction operating on tion will be inversely proportional top. steps have been taken, and many difficulties its own data stream, processes cooperate by Now what happens when individual PEMs remain to be resolved as we move from sharing data and solving parts of the same are linked together? Each PEM has its own experiments that use one or two of this problem in parallel, Thus. throughput can be program memory to prevent conflicts in machine’s processors to those that use many. increased by a factor of N, where N is the accessing instructions, and all PEMs are But what are the principles of the HEP? average number of operations executed con- connected to the large number of other data Parallel processing can be used separately currently. memory banks through the HEP switch (a or concurrently on two types of information: The HEP implements MIMD with up to high-speed, packet-switched network). One instructions and data. Much of the early sixteen PEMs, each PEM capable of execut- result of these connections is that the number parallel processing concentrated on multiple- ing up to sixty-four processes concurrently. of switch nodes increases more rapidly than data streams. However, computer systems It should be noted, however, that these upper the number of PEMs. As one changes from a such as the HEP can both multiple- limits may not be the most efficient con- 1-PEM system toward the maximum 16- instruction streams and multiple-data figuration for a given, or even for most, PEM configuration, the transmittal time streams. These are called MIMD machines. applications. Any number of PEMs can through the HEP switch, called ,

72 Fall 1983 LOS ALAMOS SCIENCE Frontiers of Supercomputing

Memory o L

r This portion of the program sets up the variables to process an 100-col- umn array using 12 concurrent processes. $FIN remains set at empty throughout the computations, $COL will count up through the 100 columns, and $PROCS will count down as the 12 processes die off. This DC) loop CREATES the 12 processes. Each process does its computations using COL.

This statement, otherwise trivial, stops the main program if $FIN is not yet set to full.

This portion terminates each proc- ess, counting down with $PROCS, and then setting $FIN to full as the last process is killed.

Fig. 2. An example of HEP ,

74 Fall 1983 LOS ALAMOS SCIENCE Frontiers of Supercomputing

quickly becomes substantial. Such latency FORTRAN Extensions to Support process that is not computing checks the first increases the number of processes that must Parallelism of’ these variables both to see if it can start a be running in each PEM to achieve full computation and, if so, which column is next in line. At the end of that computation and utilization. Although there is not enough Only two extensions to standard FOR- data yet to provide good estimates on how TRAN arc required to exploit the parallelism regardless of what stage any other process fast latency actually increases with the num- inherent in the HEP: process creation and has reached. the process checks again to see ber of PEMs, experience with a 2-PEM asynchronous variables. Standard FOR- if there are further columns to be dealt with. If not, the process is terminated. A second system suggests that a 4-PEM system will TRAN can. in fact, handle both, but the require about twenty concurrent processes in current HEP FORTRAN has extensions asynchronous variable counts down as the each PEM. specifically tailored to do so. processes die off. When the last operating These extensions allow the programmer to process completes its computation. a number Process create processes in parallel as needed and is stored in a third, previously empty then let them disappear once they are no asynchronous variable, setting its extra bit to A critical issue in MIMD machines is the longer needed. Also the number of PEMs full. This altered variable is a to the main program that it may use the recently synchronization of processes. The HEP being used will vary with the number of solves this problem in a simple and elegant processes that are created at any given generated data. This method tends to smooth manner, Each 64-bit data word has an extra moment. irregularities in process execution time aris- ing from disparities in the amount of process- bit that is set to full each time a datum is Process creation syntax is almost identical ing done on the individual columns and, stored and is cleared to empty each time a to that for calling : CALL is re- datum is fetched, In addition, two sets of placed by CREATE, However, in a normal further, does not require changes if the memory instructions are employed. One set program, once a subroutine is CALLed, the dimension of the array is changed. is used normally throughout most of the main program stops; in HEP FORTRAN. More elegant syntactic constructs can be program. This set ignores the extra bit and the main program may continue while a devised. but the HEP extensions are work- will fetch or store a data word regardless of CREATEd process is being worked on. If able. For well-structured code, conversion to whether the bit is full or empty. Data may the main program has no other work, it may the HEP is quite easy, For more complex then be used as often as required in a CALL a subroutine and put itself on equal programs the main difficulty is verifying that process. footing with the other processes until it exits the parallel processes defined are in fact The second set of instructions is defined the subroutine. A process is eliminated when independent. No tools currently exist to help through the use of asynchronous variables, it reaches the normal . with this verification. Typically, this set is used only at the start or We can illustrate these techniques by finish of a process, acting as a barrier against showing how the HEP is used to process an Early Experience with the HEP interference from other processes, The set array in which each column needs to be will not fetch from an empty word or store processed in the same manner but in- Several relatively small FORTRAN codes into a full word. Thus, to synchronize several dependently of the other columns (Fig. 2). have been used on the HEP at Los Alamos. processes an asynchronous variable is de- First, one defines a subroutine that can One such code is SIMPLE, a 2000-line, two- fined that can only be accessed, using the process a single column in a sequential dimensional, Lagrangian, hydro-diffusion second set of instructions, at the appropriate fashion. We could usc this subroutine by code. By partitioning this code into proc- time by each process. A process that needs creating a process for each Column and then esses, about 99 per cent of it can be executed to fetch an asynchronous variable will not do all the processes in parallel, but in parallel. The on a 1-PEM system so if the extra bit is empty and will not there is a limit on the number of processes is close to linear for up to eleven processes proceed until another process stores into the that each PEM can handle. A better tech- and then flattens out, indicating that the variable, setting it full. Because the full and nique would be to CREATE eight to twelve PEM is fully utilized, To achieve this degree empty properties of this extra bit are im- processes per PEM and let the processes of speedup on any MIMD machine requires plemented in the HEP hardware at the user self-schedule. Each process selects a column a very high percentage of parallel code. How level. requiring no interven- from the array, does the computation for difficult it will be to achieve a high percent- tion. the usual synchronization methods that column, then looks for additional col- age of parallelism on full-scale production (semaphores, etc.) can be used, and process umns to work on. Several asynchronous codes is an open question, but the potential synchronization is very efficient. variables are the key to this technique. Each payoff is significant. ■

LOS ALAMOS SCIENCE Fall 1983 U.S. GOVERNMENT PRINTING OFFICE 1983 — 776-101/1007 75