by B. L. Buzbee, N. Metropolis, and D. H. Sharp

scientists two years to do may be reduced to needed for rapid progress. two months of effort. When this happens. Before presenting highlights of the con- practice in many fields of science and tech- ference we will review in more depth the nology will be revolutionized. importance of supercomputing and the These radical changes would also have a trends in that form needs. Electronic computers were developed, large and rapid impact on the nation’s econ- the background for the conference dis- in fact, during and after World War 11 to omy and security. The skill and effectiveness cussions. meet the need for numerical simulation in the with which can be used to design of nuclear weapons, aircraft, and design new and more economical civilian The Importance of Supercomputers conventional ordnance. Today. the avail- aircraft will determine whether there is em.. ability of supercomputers ten thousand times ployrnent in Seattle or in a foreign city. The term “” refers to the faster than the first electronic devices is Computer-aided design of automobiles is most powerful scientific computer available having a profound impact on all branches of already playing an important role in De- at a given time. The power of a computer is science and engineering—from astrophysics troit’s effort to recapture its position in the measured by its speed, storage capacity to elementary particle physics. from fusion automobile market. The speed and accuracy (memory). and precision. Today’s com- energy research to automobile design. The with which information can be processed will mercially available supercomputers, the reason is clear: supercomputers extend bear importantly on the effectiveness of our I from Cray Research and the enormously the range of problems that are national intelligence activities. CYBER 205 from Control Data Corpora- effectively solvable. Japan and other foreign countries have tion, have a peak speed of over one hundred Although the last forty years have seen a already realized the significance of making million operations per second, a memory of dramatic increase in computer performance, this quantum jump in supercomputer devel- about four million 64-bit “words,” and a the number of users and the range of applica- opment, Japan is, in fact, actively engaged in precision of at least 64 bits. tions have been increasing at an even faster a highly integrated effort to realize it. At this There are presently about seventy-five of rate, to the point that demands for greater juncture the United States is in danger of these supercomputers in use throughout the performance now far outstrip the improve- losing its long-held leadership in supercom- world. They are being used at national ments in hardware. Moreover, the broadened puting. laboratories to solve complex scientific prob- community of users is also demanding im- Various efforts are being made in this lems in weapon design, energy research, provements in that will permit a country to develop the hardware and soft- meteorology, oceanography, and geophysics. more comfortable interface between user and ware necessary for the change to massively In industry they are being used for design machine. parallel . But these and simulation of very large-scale integrated Scientific supercomputing is now at a efforts are only loosely coordinated. circuits, design of aircraft and automobiles, critical juncture. While the demand for How can the United States maintain its and exploration for oil and minerals. To higher speed grows daily. computers as we worldwide supremacy in supercomputing? demonstrate why we need even greater have known them for the past twenty to What policies and strategies are needed to speed, we will examine some of these uses in thirty years are about as fast as we can make make the revolutionary step to massively more detail. them. The growing demand can only be met parallel supercomputers? How can individ- The first step in scientific research and by a radical change in computer architec- ual efforts in computer architecture. soft- engineering design is to model the phenom- ture. a change from a single serial processor ware. and languages be best coordinated for ena being studied. In most cases the whose logical design goes back to Turing this purpose? phenomena are complex and are modeled by and von Neumann to an aggregation of up to To discuss these questions. the Labora- equations that are nonlinear and singular a thousand parallel processors that can tory and the National Security Agency spore and. hence. refractory to analysis. Neverthe- perform many independent operations con- sored a conference in Los Alamos on August less. one must somehow discern what the currently. This radical change in hardware 15-19, 1983. Entitled “Frontiers of Super- model predicts. test whether the predictions will necessitate improvements in program- computing." the conference brought together are correct, and then (invariably) improve ming languages and software. If made, they leading representatives from industry, gov- the model, Each of these steps is difficult, could significantly reduce the time needed to ernment, and universities who shared the time-consuming, and expensive. Supercom- translate difficult problems into machine- most recent technical developments as well puters play an important role in each step. computable form. What now takes a team of as their individual perspectives on the tactics helping to make practical what would other-

LOS ALAMOS SCIENCE Fall 1983 65 wise be impractical. The demand for im- oceans or the atmosphere or the motion of or decades to complete. Here numerical proved performance of supercomputers tectonic plates cannot be tested in the labora- simulation can provide an accelerated pic- stems from our constant desire to bring new tory, but can be simulated on a large com- ture that may help, for example, in assessing problems over the threshold of practicality. puter. Many phenomena are difficult to the long-term effects of increasing the per- As an example of the need for supercom- investigate experimentally in a way that will centage of carbon dioxide in our atmosphere. puters in modeling complex phenomena, not modify the behavior being studied. An consider magnetic fusion research. Magnetic example is the flow of reactants and Trends in Performance and fusion requires heating and compressing a products within an internal combustion en- Architecture magnetically confined plasma to the ex- gine. Supercomputers are currently being tremes of temperature and density at which used to study this problem. Most scientists engaged in solving the thermonuclear fusion will occur. During this Testing a nuclear weapon at the Nevada complex problems outlined above feel that unstable motions of the plasma may test site costs several million dollars. Run- an increase of speed of at least two orders of occur that make it impossible to attain the ning a modern wind tunnel to test airfoil magnitude is required to make significant required final conditions. In addition, state designs costs 150 million dollars a year. progress. The accompanying figure shows variables in the plasma may change by many Supercomputers can explore a much larger that the increase in speed, while very rapid at orders of magnitude. Analysis of this process range of designs than can actually be tested, the start, now appears to be leveling off. is possible only by means of large-scale and expensive test facilities can be reserved numerical computations, and scientists work- for the most promising designs. ing on the problem report the need for Supercomputers are needed to interpret computers one hundred times faster and with test results. For example, in nuclear weapons larger memories than those now available. tests many of the crucial physical parameters Another example concerns computing the are not accessible to direct observation, and equation of state of a classical one-compo- the measured signals are only indirectly nent plasma. A one-component plasma is an related to the underlying physical processes. idealized system of one ionic species im- D. Henderson of Los Alamos characterized mersed in a uniform sea of electrons such the situation well: “The crucial linkage be- that the whole system is electrically neutral. tween these signals, the physical processes, If the equation of state of this “simple” and the device design parameters is available material could be computed, it would only through simulation." provide a framework from which to study Finally, supercomputers permit scientists the equations of state of more complicated and engineers to circumvent some real-world substances. Before the advent of the Cray-1, constraints. In some cases environmental the equation of state of a one-component considerations impose constraints. Clearly, Year plasma could not be computed throughout the safety of nuclear reactors is not well the parameter range of interest. suited to experimental study. But with Using the Cray-1, scientists developed a powerful computers and reliable models, better, more complex approximation to the scientists can simulate reactor accidents, The rapid growth has been due primarily interparticle potential. Improved software either minor or catastrophic, without en- to advances in microelectronics. First came and very efficient algorithms made it possible dangering the environment. the switch from cumbersome and capricious for the Cray-1 to execute the new calculation Time can also impose severe constraints vacuum tubes to small and reliable semicon- at about 90 million operations per second. on the scientist. Consider, for example, ductor transistors. Then in 1958 Jack Kilby Even so, it took some seven hours; but, for chemical reactions that take place within invented a method for fabricating many the first time, the equation of state for a one- microseconds. One of the advantages of transistors on a single silicon chip a fraction component plasma was calculated through- large-scale numerical simulation is that the of an inch on a side, the so-called integrated out the interesting range. scientist using it can “slow the clock” and circuit. In the early 1960s computer There are numerous ways in which super- with graphic display, observe the associated switching circuits were made of chips each computers play a crucial role in testing phenomena in slow motion. At the other end containing about a dozen transistors. This models. Models of the circulation of the of the spectrum are processes that take years number increased to several thousand (me-

66 Fall 1983 LOS ALAMOS SCIENCE Frontiers of Supercomputing

as a system, allow a large percentage of work to be processed in parallel with only a minimum number of additional instructions. Admiral B. R. Inman in his keynote address stressed that progress in the information (See "The Efficiency of Parallel Process- industry will be a critical factor in meeting our security and economic needs. ing.”) However, formulating algorithms for this second type is somewhat easier than for dium-scale integration) in the early 1970s Josephson-junction technology>. However, it lockstep vetor processors. and to several hundred thousand (very large- is estimated that a supercomputer built with Recent work on tightly coupled parallel scale integration, or VLSI) in the early that technology would have a speed of at processors has concentrated on systems with 1980s, Furthermore. since 1960 the cost of most one billion operations per second. two to four vector processors sharing a large transistor circuits has decreased by a factor which is greater than the speed of the Cray-1 memory. Such machines have been used of about 10,000. or the CYBER 205 by only a factor of ten. successfully for parallel processing of scien- The increased circuit density and de- Thus, supercomputers appear to be close tific computation. Logically, the next steps creased cost has had two major impacts on to the performance maximum based on our are systems with eight, sixteen, even sixty- computer power. First. it became possible to experience with single-processor machines. If four processors: however, scientists may not build very large. very fast memories at a we are to achieve an increase m speed of two be able to find sufficient concurrent tasks to tolerable cost, Large memories are essential orders of magnitude or more. we must look achieve high parallelization with sixty-four for complex problems and for problems to machines with multiple processors ar- processors. The problem lies in the granu- involving a large data base. Second, in- ranged in parallel architectures. that is, to larity of’ the task, that is, the size of the creased circuit density reduced the time machines that perform many operations con- pieces into which the problem can be broken. needed for each cycle of logical operations in currently. To achieve high performance on a given the computer. Three types of parallel architecture hold processor, granularity should be large. How- Until recently a major limiting factor on promise of providing the needed hundredfold ever, to provide a sufficient number of con- computer cycle time has been the gate. or increase in performance: current tasks to keep a large number of switch, delays. For vacuum tubes these de- processors busy, granularity will have to lockstep vector processors. lays are 10-5 second, for single transistors decrease, and high performance may be lost. tightly coupled parallel processors. and 10 7 second. and for integrated circuits 10-9 The situation is even more challenging devices. second. With gate delays reduced to less when we consider a massively parallel sys- than a nanosecond. cycle times are now The first type may be the least promising. It tem with thousands of processors com- limited by the time required for signals to has been shown that achieving maximum municating with thousands of memories. In propagate from one part of the machine to performance from a re- general. scientists cannot find and manage another. The cycle times of today’s super- quires vectorizing at least 90 percent of the parallelism for such large numbers of proc- computers. which contain VLSI compo- operations involved, but a decade of ex- essors. Rather. the software must find it, nents, are between 10 and 20 nanoseconds perience with vector processors has revealed map it onto the architecture, and manage it. and are roughly proportional to the linear that achieving this level of vectorization is Therein lies a formidable research issue. In dimensions of the computer. that is, to the difficult, fact. the issues of concurrency are so great length of the longest wire in the machine. The second type of architecture employs that some scientists are suggesting that we The figure summarizes the history of tightly coupled systems of a few high-per- will have to forego the familiar ordering of computer operation, and the data have been formance processors. In principle, collabora- computation intrinsic in sequential process- extrapolated into the future by approxima- tion of these processors on a common task ing. This has revolutionary implications for tion with a modified Gompertz curve. The can produce the desired hundredfold in- algorithms. asymptote to that curve, which represents an crease in speed. But important architectural In summary, the architecture of super- upper limit on the speed of a single-processor issues. such as the communication geometry computers is likely to undergo fundamental machine, is about three billion operations per between processors and memories, remain changes within the next few years, and these second. Is this an accurate forecast in view unresolved. Analysis of the increase in speed changes may affect many aspects of large- of forthcoming developments in integrated possible from parallel processing shows that scale computation. To make this transition circuit technology? The technology with the the research challenge is to find algorithms, successfully. a substantial amount of basic greatest potential for high speed is languages, and architecture that. when used research and development will be needed.

LOS ALAMOS SCIENCE Fall 1983 67 Robert S. Cooper Sidney Fernbach Jacob T. Schwartz DARPA Courant Institute

Conference Highlights secretary of Defense R. DeLauer.) Robust end of 1986, a computer capable of execut- economic properity, however, offers the ing ten billion floating point operations per National and industry Perspectives. in his opportunity for progress toward world sta- second. opening comments. Laboratory Director bility. Admiral Inman stressed that a critical Mr. Rollwagen called attention to the fact Donald Kerr characterized the present state factor in meeting our security and economic that the market for supercomputers might of affairs as presenting challenges on two needs will be progress in the information increase substantially if software were avail- frontiers—the intellectual frontier concerned industry information processing, supercom- able that made them easier to use. Norris with scientific and technological progress puting, automation, and robotics. He cited and Rollwagen also suggested that a govern- and a leadership frontier concerned with numerous actions that arc required to put the ment program to place supercomputers in national policies and strategics aimed at United States in a better position to seize and universities would not only help the market maintaining U.S. leadership in supercomput- exploit opportunities as they arise. These but would produce other benefits as well. ing. include greater investments at universities for This point was strongly reinforced from U.S. Senator Jeff Bingaman (New Mex- graduate training in science and mathe- the university standpoint by Nobel laureate ico) raised several questions that were a matics, new organizations for pooling scarce Kenneth Wilson. He stressed that the poten- focus of attention during the conference. talent and resources, revised governmental tial market for supercornputers was much How should our overall national effort in procedures (such as three year authorization greater than was generally supposed and that supercomputing be coordinated? What is the bills) to facilitate commitments to longer the key to developing this market was ade- proper role of government! Are recent in- range research projects, new antitrust legisla- quate software. Researchers at universities itiatives such as the formation of SRC and tion, in part to enable the pooling of talent, can make a major contribution in this area MCC and the DARPA project sufficient? and a consensus on national security policy. by, for example, expanding the use of mod- (SRC, the Semiconductor Research Corpor- Both Senator Bingaman and Admiral In- ular programming for large, complex prob- ation, and MCC, the Microelectronics and man stressed that the key to rapid progress lems. In addition. he emphasized that one of Computer Technology Corporation. are in the supercomputing field lies in effective the best ways to create a market for super- nonprofit research cooperatives drawn from collaboration among the academic, indus- computcrs is to train many people in their private industry. DARPA. the Pentagon’s trial, and governmental sectors. Several use. This can be done if’ the universities have Defense Advanced Research Projects speakers addressed the question of how the supercomputers. Agency. has recently inaugurated a Strategic different components of’ this triad could best Wilson also pointed out that at present Computing and Survivability project.) Sena- support and reinforce each other’s activities. universities are largely isolated from the tor Bingaman also stressed that Congress The viewpoint of the supercomputer supercomputer community. For example. needs to be made more keenly aware of the manufacturers was presented by William while supercomputer manufacturers are importance of supercomputing. Norris of’ Control Data Corporation and by planning evolutionary steps in machine de- In his keynote address Admiral B. R. John Rollwagen of Cray) Research. Both sign, workers at universities are concentrat- Inman (U.S. Navy, retired, and now presi- speakers called attention to the thinness of ing on the revolutionary massively parallel dent of MCC) outlined challenges and op- the current supercomputer market and architectures and appropriate software and portunities facing the United States in com- pointed out that a larger market is required algorithms to make such systems usable. ing decades. In the military sphere the USSR to sustain an accelerated research and devel- Would not each sector gain from closer will continue to pose a formidable challenge, opment program. Mr. Norris called for pool interaction! How should this be accom- especially in view of their increasingly effec- ing of research and revisions in antitrust plished? tive and mobile conventional forces. (This laws. He took the occasion to announce the Several speakers were concerned about point was also stressed in the remarks of formation of a new firm, ETA Systems. Inc.. the fact that highly reliable VLSI compo- DARPA director R. S. Cooper and Under- whose goal is to develop, for delivery by the nents. which form an essential element of

68 Fall 1983 LOS ALAMOS SCIENCE Richard DeLauer Kenneth G. Wilson John Rollwagen James F. Decker ndersecretary of Defense Cornell University Cray Research DOE

modern supercomputers, are often difficult achieve comparable cooperation among the carry out specific algorithms are one exam- to procure from domestic vendors. The result various sectors of the U.S. supercomputing ple of a fixed-net architecture. Such systems is that more and more of these components community is certainly in order. provide fast execution of the algorithms for are being purchased from Japanese manu- which they are designed, but other algo- facturers. In responding to the suggestion Scientific Developments in Architecture. rithms may have to be adapted to fit the that the U.S. government provide more sup- Software. Algorithms, and Applications. The architecture. port to the domestic semiconductor industry, scientific talks presented at he conference In bind-at compile time architectures the DeLauer pointed out that his agency was were organized into session covering ad- topology of the interconnection of proc- already furnishing some help. Moreover, vanced architectures, supercomputing at Los essors and memory can be reconfigured to there are numerous domestic industries call Alamos, software, and algorithms and ap- suit a particular algorithm but remains in ing for aid, and available funds are limited. placations. that configuration until explicitly recon- Hence the semiconductor industry as well as The trend toward figured. The TRAC machine at the Univer- the supercomputer industry must find effec- architectures was abundantly evident in sity of Texas at Austin is an example of bind tive ways to leverage the support that the presentations by computer manufacturers at-compile time architecture. Reconfigurable DOD can give. and academic researchers. By the end of the architectures must have more complex con- There was a consensus at the conference decade. commercially supplied systems with trol systems than those that bind at design on the need for clearly defined national eight or more processors will be available. B. time, and the reconfiguration process does supercomputer goals and on a strategic plan Smith of Denelcor presented a new industrial take time. The trade-off is that recon- to reach them. The Federal Coordinating entry. the first attempt, ab initio, at modest figurable architectures can be used for many Committee for Science and Engineering parallelism. different algorithms. Technology has asked the DOE to prepare Most academic projects make extensive Bindat-execution-tirne architectures do such a plan. The current status of this plan use of VLSI, exploiting 32-bit micro- not explicitly specify topology but, through was reviewed by James Decker of the DOE. processors and 256K memory chips. Many use of control structures and shared mem- The plan provides for the government to of them can be classified as “dancehall” ory. allow any topology. These systems are accelerate its use of supercomputers, to machines. Dancehall systems have proc- the most flexible since no algorithm is con- acquire and use experimental systems, and to essors aligned along one side. memories on strained by an inappropriate architecture. encourage long-range research and develop- the other, and a communication geometry to The price is the overhead required to syn- ment with tax incentives and increased sup- bind them together, High performance on chronize and communicate information, port. these systems necessitates algorithms that Dataflow machines are an example of bind- Throughout the conference the Japanese keep all of the processors “dancing” all of a-execution-time systems. initiatives in supercomputing were much on the time. Today, effective use of a multiprocessor people’s minds. J. Worlton reviewed Japan’s James Browne of the University of Texas, system (even a vectorizing system) requires activities in the supercomputer field and Austin, observed that there are three general intimate knowledge of the hardware, the assessed the strengths and weaknesses of models of massively parallel systems based compiler’s mapping of code onto the hard- their development strategies. The effective on whether decisions concerning the com- ware. and the problem at hand. On systems cooperation that they have achieved between munication. synchronization. and granu- with one hundred or more processors, soft- government. industry, and academia was larity of parallel operations are fixed at ware must free the user from these details viewed as a substantial asset. While no one design time. compile time, or execution time, because their complexity will defy human suggested that Japanese organizational These decisions are made at design time in management. Many issues involving algo- methods could provide a detailed role model fixed-net architectures. Systolic arrays con rithms, models, architectures, and languages for the United States, a concerted effort to taining hardware components designed to for massively parallel systems have yet to be

LOS ALAMOS SCIENCE Fall 1983 69 resolved. John Armstrong of IBM, in his discussion of VLSI technology, indicated that as com- ponents become more integrated, the need for basic research and development in materials science and packaging technology increases. Further. he believes that the cur- rent level of research in these areas is in- adequate, Several speakers noted that supercom- puter systems require high-speed peripherals (in particular, disks) and that. in general. little work is being done to advance their performance. Resolution of this problem will require collaboration between manufacturers and government laboratories. The current status of large-scale computa- tion in nuclear weapons design (D. Hen- derson, Los Alamos). fluid dynamics (J. Glimm, Courant Institute), fusion reactor design (D. Nelson. DOE), aircraft and space- craft design (W. Ballhaus, NASA-Ames Re- search Center), and oil reservoir simulation (G. Byrne. Exxon Research) was reviewed. An observation of general applicability was made by Glimm. Problems in all these areas are three-dimensional. time-dependent. nonlinear, and singular. Simple algorithms converge slowly when applied to such prob- lems. Hence, most interesting problems are undercomputed. To achieve an increase in resolution by an order of magnitude in a three-dimensional, time-dependent problem requires an increase in computing power by a factor of 10,000, far beyond projected hardware improvements. Thus, to bridge the gap between needs and projected hardware capabilities, one must look to the develop- Conference Summary ing systems, and very little research and ment of better algorithms and more powerful development is being done in the United software for their implementation, A broad spectrum of interests and points States in this area. Several new applications of supercomput- of view, was expressed during the week. The Future Market. Many of the participants ing were also discussed, These included participants concurred on many general foresee a much larger market for super- design of special-purpose, very large-scale points, and these were well summarized by computers due to growing industrial use. integrated circuits (D. Rose, Bell Laborato- K. Speierman of the National Security ries), robotics (J. Schwartz, Courant In- Nearly forty years ago the art of comput- Agency’. Among them were the following stitute). CAD/CAM (J. Heiney, Ford Motor ing made a discontinuous leap from elec- critical issues, Co.) and finally. an unexpected topic, the tromechanical plodding to crisp, electronic study of cooperative phenomena in human Systems Approach. All aspects of mas- swiftness. Many aspects of life have been behavior (F. Harlow. Los Alamos). Also sively parallel systems—architecture, both broadened and rendered sophisticated presented was an exhibition of animated algorithms, and software-must be de- by the advent of electronic computation. The computer graphics, an application whose full veloped in concert. quantum jump envisioned today is from the realization demands far more of computers VLSI Supply. Highly reliable VLSI ultimate in serial operation to a coordinated than today’s supercomputers can provide. components are often difficult to procure parallel mode. ■ Quite apart from the dazzle, animated from domestic vendors. Further, as- graphics represents a key mode of interac- sociated research and development in tion between the computer and the brain. materials science and packaging tech- The proceedings of the conference are to be which if properly exploited could progress nology are inadequate. published by the University of California far beyond the already impressive develop- High-Speed Peripherals. High-speed Press as a volume in the Los Alamos Series ment. peripherals are essential to supercomput- in Basic and Applied Sciences.

70 Fall 1983 LOS ALAMOS SCIENCE Frontiers of Supercomputing The Efficiency of Parallel Processing

arallel processing, or the application of Note that the first term in the denominator is several processors to a single task, is the execution time devoted to that part of the P an old idea with a relatively large application that cannot be processed in literature. The advent of very large-scale parallel, and the second term is the time for integrated technology has made testing the that part that can be processed in parallel. idea feasible, and the fact that single- How does speedup vary with a? In par- processor systems are approaching their ticular. what is this relationship for a = 1. the maximum performance level has made it ideal limit of complete parallelization? Dif- necessary. We shall show, however, that ferentiating S, we find that successful use of parallel processing imposes stringent performance requirements on algo- rithms, software, and architecture. The so-called asynchronous systems that use a few tightly coupled high-speed proces- sors are a natural evolution from high-speed The accompanying figure shows the Ware single-processor systems, Indeed, systems model of speedup as a function of a for a 4- with two to four processors will soon be processor, an 8-processor, and a 16-proc- available (for example, the Cray X-MP, the essor system. The quadratic dependence of Speedup as a function of parallelism (a) Cray-2, and the Control Data System 2XX). the derivative on p results in low speedup for and number of processors. Systems with eight to sixteen processors are a less than 0.9. Consequently, to achieve significant speedup, we must have highly likely by the early 1990s. What are the processors. Further, parallel algorithms may parallel algorithms. It is by no means evident prospects of using the parallelism in such inherently require additional instructions. To that algorithms in current use on single- systems to achieve high speed in the execu- correct for this inadequacy, we add a term, tion of a single application? Early attempts processor machines contain the requisite parallelism. and research will be required to with vector processing have shown that mentation that is at best nonnegative and find suitable replacements for those that do plunging forward without a precise under- usually monotonically increasing with p. standing of the factors involved can lead to not. Further, the highly parallel algorithms available must be implemented with care. disastrous results. Such understanding will the algorithm. the architecture, and even of be even more critical for systems now con- For example, it is not sufficient to look at just those portions of the application templated that may use up to a thousand modified model. Then processors. amenable to parallelism because a is de- The key issue in the parallel processing of termined by the entire application. For a a single application is the speedup achieved, close to 1. changes in those few portions less especially its dependence on the number of amenable to parallelism will cause small processors used. We define speedup (S) as changes in a, but the quadratic behavior of the factor by which the execution time for the derivative will translate those small If the application can be put completely in the application changes: that is, changes in a into large changes in speedup. parallel form, then Those who have experience with vector execution time for one processor processors will note a striking similarity S= between the Ware curves and plots of vector execution time for p processors processor performance versus the fraction of vectorizable computation. This similarity is To estimate the speedup of a tightly due to the assumption in the Ware model of In other words, the maximum speedup of a coupled system on a single application, we a two-state machine since a vector processor real system is less than the number of use a model of parallel computation in- can also be viewed in that manner. In one processors p, and it may be significantly less. troduced by Ware. We define a as the state it is a relatively slow, general-purpose Also note that, whatever the value of a, S fraction of work in the application that can machine. and in the other state it is capable will have a maximum for sufficiently large p be processed in parallel. Then we make a of high performance on vector operations, simplifying assumption of a two-state ma- Ware’s model is inadequate in that it continues to increase, chine; that is, at any instant either all p assumes that the instruction stream executed Thus the research challenge in parallel processors are operating or only one proc- on a parallel system is the same as that processing involves finding algorithms, pro- essor is operating. If we normalize the execu- executed on a single processor. Seldom is gramming languages, and parallel architec- tion time for one processor to unity, then this the case because multiple-processor sys- tures that, when used as a system, a tems usually require execution of instruc- large amount of work processed in parallel 1 S (p,a) = tions dealing with synchronization of the (large a) at the expense of a minimum num- (l-a) + a/p processes and communication between

LOS ALAMOS SCIENCE Fall 1983 71 The HEP Parallel Processor by James W. Moore

lthough there is an abundance of The HEP achieves MIMD with a system cooperate on a job, or each PEM may be concepts for parallel computing, of hardware and software that is one of the running several unrelated jobs. All of the A there is a dearth of experimental data most innovative architectures since the ad- instruction streams and their associated data delineating their strengths and weaknesses. vent of electronic computing. In addition, it streams are held in main memory while the Consequently, for the past three years per- is remarkably easy to use with the FOR- associated processes are active. sonnel in the Laboratory’s Computing Di- TRAN language. In its maximum configura- How is parallel processing handled within vision have been conducting experiments on tion it will be capable of executing up to 160 an individual PEM? This is done by “pipelin- a few parallel computing systems. The data million instructions per second. ing” instructions so that several are in dif- thus far are uniformly positive in supporting ferent phases of execution at any one mo- the idea that parallel processing may yield The Architecture ment. A process is selected for execution substantial increments in computing power. each machine cycle, a single instruction for However, the amount of data that we have Figure 1 indicates the general architecture that process is started, and the process is been able to collect is small because the of the HEP. The machine consists of a made unavailable for further execution until experiments had to be conducted at sites number of process execution modules that instruction is complete. Because most away from the Laboratory, often in “soft- (PEMs), each with its own data memory instructions require eight cycles to complete, ware-poor” environments. bank. connected to the HEP switch. In at least eight processes must be executed We recently leased a Heterogeneous Ele- addition, there are other processors con- concurrently in order to use a PEM fully. ment Processor (HEP) manufactured by De- nected to the switch, such as the operating However, memory access instructions re- nelcor, Inc. of Denver, Colorado. This ma- system processor and the disk processor. quire substantially more than eight cycles. chine (first developed for the Army Ballistic Each PEM can access its own data memory Thus, in practice. about twelve concurrent Research Laboratories at Aberdeen) is a bank directly, but access to most of the processes are needed for full utilization, and parallel processor suitable for general-pur- memory is through the switch. a single HEP PEM can be considered a pose applications. We and others throughout In a MIMD architecture, entire programs “virtual” 8- to 12-processor machine. If a the country will use the HEP to explore and or, more likely, pieces of programs, called given application is formulated and executed evaluate parallel-processing techniques for processes, execute in parallel, that is, concur- using p processes (where p is an integer from applications representative of future super- rently. Although each process has its own 1 to 12), then execution time for the applica- computing requirements. Only the beginning independent instruction stream operating on tion will be inversely proportional top. steps have been taken, and many difficulties its own data stream, processes cooperate by Now what happens when individual PEMs remain to be resolved as we move from sharing data and solving parts of the same are linked together? Each PEM has its own experiments that use one or two of this problem in parallel, Thus. throughput can be program memory to prevent conflicts in machine’s processors to those that use many. increased by a factor of N, where N is the accessing instructions, and all PEMs are But what are the principles of the HEP? average number of operations executed con- connected to the large number of other data Parallel processing can be used separately currently. memory banks through the HEP switch (a or concurrently on two types of information: The HEP implements MIMD with up to high-speed, packet-switched network). One instructions and data. Much of the early sixteen PEMs, each PEM capable of execut- result of these connections is that the number parallel processing concentrated on multiple- ing up to sixty-four processes concurrently. of switch nodes increases more rapidly than data streams. However, computer systems It should be noted, however, that these upper the number of PEMs. As one changes from a such as the HEP can handle both multiple- limits may not be the most efficient con- 1-PEM system toward the maximum 16- instruction streams and multiple-data figuration for a given, or even for most, PEM configuration, the transmittal time streams. These are called MIMD machines. applications. Any number of PEMs can through the HEP switch, called latency,

72 Fall 1983 LOS ALAMOS SCIENCE Frontiers of Supercomputing

Memory o L

r This portion of the program sets up the variables to process an 100-col- umn array using 12 concurrent processes. $FIN remains set at empty throughout the computations, $COL will count up through the 100 columns, and $PROCS will count down as the 12 processes die off. This DC) loop CREATES the 12 processes. Each process does its computations using SUBROUTINE COL.

This statement, otherwise trivial, stops the main program if $FIN is not yet set to full.

This portion terminates each proc- ess, counting down with $PROCS, and then setting $FIN to full as the last process is killed.

Fig. 2. An example of HEP FORTRAN,

74 Fall 1983 LOS ALAMOS SCIENCE Frontiers of Supercomputing

quickly becomes substantial. Such latency FORTRAN Extensions to Support process that is not computing checks the first increases the number of processes that must Parallelism of’ these variables both to see if it can start a be running in each PEM to achieve full computation and, if so, which column is next in line. At the end of that computation and utilization. Although there is not enough Only two extensions to standard FOR- data yet to provide good estimates on how TRAN arc required to exploit the parallelism regardless of what stage any other process fast latency actually increases with the num- inherent in the HEP: process creation and has reached. the process checks again to see ber of PEMs, experience with a 2-PEM asynchronous variables. Standard FOR- if there are further columns to be dealt with. If not, the process is terminated. A second system suggests that a 4-PEM system will TRAN can. in fact, handle both, but the require about twenty concurrent processes in current HEP FORTRAN has extensions asynchronous variable counts down as the each PEM. specifically tailored to do so. processes die off. When the last operating These extensions allow the programmer to process completes its computation. a number Process Synchronization create processes in parallel as needed and is stored in a third, previously empty then let them disappear once they are no asynchronous variable, setting its extra bit to A critical issue in MIMD machines is the longer needed. Also the number of PEMs full. This altered variable is a signal to the main program that it may use the recently synchronization of processes. The HEP being used will vary with the number of solves this problem in a simple and elegant processes that are created at any given generated data. This method tends to smooth manner, Each 64-bit data word has an extra moment. irregularities in process execution time aris- ing from disparities in the amount of process- bit that is set to full each time a datum is Process creation syntax is almost identical ing done on the individual columns and, stored and is cleared to empty each time a to that for calling subroutines: CALL is re- datum is fetched, In addition, two sets of placed by CREATE, However, in a normal further, does not require changes if the memory instructions are employed. One set program, once a subroutine is CALLed, the dimension of the array is changed. is used normally throughout most of the main program stops; in HEP FORTRAN. More elegant syntactic constructs can be program. This set ignores the extra bit and the main program may continue while a devised. but the HEP extensions are work- will fetch or store a data word regardless of CREATEd process is being worked on. If able. For well-structured code, conversion to whether the bit is full or empty. Data may the main program has no other work, it may the HEP is quite easy, For more complex then be used as often as required in a CALL a subroutine and put itself on equal programs the main difficulty is verifying that process. footing with the other processes until it exits the parallel processes defined are in fact The second set of instructions is defined the subroutine. A process is eliminated when independent. No tools currently exist to help through the use of asynchronous variables, it reaches the normal RETURN statement. with this verification. Typically, this set is used only at the start or We can illustrate these techniques by finish of a process, acting as a barrier against showing how the HEP is used to process an Early Experience with the HEP interference from other processes, The set array in which each column needs to be will not fetch from an empty word or store processed in the same manner but in- Several relatively small FORTRAN codes into a full word. Thus, to synchronize several dependently of the other columns (Fig. 2). have been used on the HEP at Los Alamos. processes an asynchronous variable is de- First, one defines a subroutine that can One such code is SIMPLE, a 2000-line, two- fined that can only be accessed, using the process a single column in a sequential dimensional, Lagrangian, hydro-diffusion second set of instructions, at the appropriate fashion. We could usc this subroutine by code. By partitioning this code into proc- time by each process. A process that needs creating a process for each Column and then esses, about 99 per cent of it can be executed to fetch an asynchronous variable will not do scheduling all the processes in parallel, but in parallel. The speedup on a 1-PEM system so if the extra bit is empty and will not there is a limit on the number of processes is close to linear for up to eleven processes proceed until another process stores into the that each PEM can handle. A better tech- and then flattens out, indicating that the variable, setting it full. Because the full and nique would be to CREATE eight to twelve PEM is fully utilized, To achieve this degree empty properties of this extra bit are im- processes per PEM and let the processes of speedup on any MIMD machine requires plemented in the HEP hardware at the user self-schedule. Each process selects a column a very high percentage of parallel code. How level. requiring no interven- from the array, does the computation for difficult it will be to achieve a high percent- tion. the usual synchronization methods that column, then looks for additional col- age of parallelism on full-scale production (semaphores, etc.) can be used, and process umns to work on. Several asynchronous codes is an open question, but the potential synchronization is very efficient. variables are the key to this technique. Each payoff is significant. ■

LOS ALAMOS SCIENCE Fall 1983 U.S. GOVERNMENT PRINTING OFFICE 1983 — 776-101/1007 75 THIS REPORT WAS PREPARED AS AN ACCOUNT OF WORK SPON- SORED BY THE UNITED STATES GOVERNMENT. NEITHER THE UNITED STATES GOVERNMENT, NOR THE UNITED STATES DEPARTMENT OF ENERGY, NOR ANY OF THEIR EMPLOYEES MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LEGAL LIABILITY OR RE- SPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF ANY INFORMATION, APPARATUS, PRODUCT, OR PROCESS DIS- CLOSED, OR REPRESENTS THAT ITS USE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS. REFERENCE HEREIN TO ANY SPECIFIC COMMERCIAL PRODUCT, PROCESS, OR SERVICE BY TRADE NAME, MARK, MANUFACTURER, OR OTHERWISE, DOES NOT NECESSARILY CONSTITUTE OR IMPLY ITS ENDORSEMENT, RECOMMENDATION, OR FAVORING BY THE UNITED STATES GOVERNMENT OR ANY AGENCY THEREOF. THE VIEWS AND OPINIONS OF AUTHORS EXPRESSED HEREIN DO NOT NECESSARILY STATE OR REFLECT THOSE OF THE UNITED STATES GOVERNMENT OR ANY AGENCY THEREOF.

LA-UR-83-2940