;If\.}y\._ . X —; ~\~;Qi·4·5.;_—\\ _ ` »`\§;i>.».—__\_` ` _ ` ` >‘‘~ AQ. ejo~.\_~_\_ ..».§j`§i\.;\—`\\`\_;` w ` ` ES-. A. H ~ _~.\ \\.\ ‘ ` `\`—Y`§§§QY ;;q» ;_‘»f M. · S·\v `?»`~Q\\\ ~ x ·` \?~.‘e. .. >.>¤~.\ ` ‘—*T·>¤;\\\.€.—\~.\ . »~`.\\1IN* ._. ` ` s—¤‘>.j>?{~QQ. . . —. _\\`\ x\\ _\.&\ \\ " ~~`\\`\\}`Q\\\ ` ~\\ ,5 ~\ \x ~ ~‘—T» . \\\~. _ .x<>wFS:2>$q\\,Q ` e xi-;. \§\\\\ . Wx ;.\_\\s&\` \ wx `.~.`_—\.a\>

56*0 Q CERNWW\%\\\\\\\\\I\mW\\\\\\¥ LIBRARIES, GENEVA ATOOOOOIIS

Q m rp Q YOOTIW

1992 — 1993 ACADEMIC TRAINING PROGRAMME

LECTURE SERIES

SPEAKER : M. SMITH / Edinburgh Centre TITLE Introduction to massively parallel computing in High Energy Physics 15, 16, 17, 18 & 19 March from 11.00 to 12.00 , . .. \ CILRN PLACE _ _ Auditorium ES G 6 -<’

ABST Az ani 772arh · 3 8 O ...... Ever sxnce computers were fzrst used for sc1ent1f1c and numerzcal work, there nas existed an "arms race" between the technical development of faster computing hardware, and the desires of scientists to solve larger problems in shorter time scales. However, the vast leaps i n performance achieved through advances in semi—conductor science have reached a hiatus as the technology comes up against the physical limits ofthe speed of light and quantum effects. This has lead all high performance computer manufacturers to turn towards a parallel architecture for their new machines. In these lectures we will introduce the history and concepts behind parallel computing, and review the various parallel architectures and environments currently available. We will then introduce programming methodologies that allow efficient exploitation of parallel machines, and present case studies of the parallelization of typical High Energy Physics codes for the two main classes of parallel computing architecture (SIMD and MIMD). We will also review the expected future trends in the technology, with particular reference to the high performance computing initiatives in and the United States, and the drive towards standardization of parallel computing software environments.

GE

.2.`é QQ m.%LLI °Y`*"~ _ ' ' '~'».*’»"*”_~ »5 " ` \ ?· `—K\?` " *—;£js—}‘»Z£~Z·{ \q\` Ti? i~?i`\~;—§}`_\\T1QT—~—L€—_—?~>Z. ·r-~·»•.-q-¤——. Q.©(;~J_~;._;\_\~`·., .~»‘ i.;;._~.\ _\;_.,;.. .5-\._ ;.—.;;.__ »» . __ . . \. " `\ Q.} ‘ \\ TilflilgTiF1s§;Y‘2Qt\i§}§_I_{\T ’ `—_\i·5i`§I"j`..Tg \ "L ¤§5.;·._;.gs—;;{·_;:\;‘{;.—. \ g X~ j. .»——j; .·—.l.;}{.§{.; ;.;_; }g_>;§: psy _\. sgi; . g; i _.\ii\_—I\ » ~—.·— . , * ~ . \· - iw = xx \ is . \ ;\q i.gij§i{\.;\\\g.r_j—§._Q\§.\_.\>_—;. »\\~?>>&%.;>;;_ ._i_ . g.- ~» ~.\\\ §§OCR .—».‘ \\;~;.§ Output`”`— \\—\ ` ~§¤§\\§\ »»..Q \— ` . I ..\ CERN ACADENHC TRAINING PROGRAMME 1992-93

An Introduction to

Massively Parallel Computing in High Energy Physics

Mark Smith

Edinburgh Parallel Computing Centre OCR Output MAssrvELY PARALLEL CGMPUTING IN HIGH ENERGY Pmsrcs OCR Output CERN AcADEM1c TRANNG PR0GRAMME 1992-93

Contents

I An Introduction to Parallel Computing 1.1 An Introduction to thc EPCC ...... 1.2 Parallel Computer Architectures ...... 1.2.1 A Taxonomy for Computer Architectures 1.2.2 SIMD: Massively Parallel Machines . . 1.2.3 MIMD Architectures ...... 1.2.4 MIMD ...... 10 1.2.5 MIMD ...... 11 1.3 Architectural Trends ...... 12

131 Future Trends 14 ...... 1.4 General Purpose Parallelism ...... 15 1.4.1 Proven Price/Performance ...... 15

1.4.2 Proven Service Provision ...... 16

1.4.3 Parallel Software ...... 16

17 144...... In Summary

2 Decomposing the Potentially Parallel 19 2.1 Potential Parallelism 19 ...... 19 211...... Granularity 2.1.2 Load Balancing ...... 20 2.2 Decomposing the Potentially Parallel ...... 21 22 ,.·¤. 2.2.1 Trivial Decomposition ...... 2.2.2 Functional Decomposition ...... 22 23 23...... Data Decomposition 2.3.1 Geometric Data Decomposition ..... 24 2.3.2 Dealing with Unbalanced Problems . . . 25 27 24...... Summary

3 An HEP Case Study for SIMD Architectures 29 3.1 SIMD Architecture and Programming Languages . 29 3.2 Case Study: Lattice QCD Codes ...... 31 3.2.1 Generation of Gauge Conhgurations . . . 32 3.2.2 The Implementation 35 3.2.3 Calculation of Quark Propagators ..... 36 36 OCR Output ss1vE1.Y PARALLEL C0M1>UT1Nc 1N H1G1-1 ENERGY Pmsxcs

An HEP Case Study for MIMD Architectures 39 4.1 A Brief History of MIMD Computing ..... 39 4.1.1 The Importance of the . . . 39 4.1.2 The Need for Message Passing .... 40 4.2 Case Study: Experimental Monte Carlo Codes 4l 4.2.1 Physics Event Generation ...... 41 4.2.2 Task Farming GEANT ...... 42 4.2.3 Geometric Decomposition of the Detector 4.2.4 Summary: A Hierarchy of Approaches . . 45

High Performance Computing: Initiatives and Standards 49 {___ 5.1 High Performance Computing and Networking Initiatives 49 49 5.1.2 United States ...... 49

52 513...... Europe 5.2 Emerging Key Technologies ...... 53 521 CHIMP 54 ......

5.2.2 PUL 55 ......

5.2.3 NEVIS 57 ...... 5.3 Parallel Computing Standards Forums . . 57 5.3.1 High Performance Forum . 58 5.3.2 Message Passing Interface Forum . 58

Acknowledgements

These lectures and their accompanying notes have been produced, in part, from the EPCC Parallel Systems training course material. I am therefore grateful to my EPCC colleagues Neil MacDonald, Arthur Trew, Mike Norman, Kevin Collins, David Wallace, Nick Radcliffe and Lyndon Clarke for the material they produced for that course. In addition, I must thank Ken Peach, Steve Booth, David Henty and Mark Parsons of the University of Edinburgh Physics Department for their help in the HEP aspects of these notes, and CERN’s Fabrizio Gagliardi for the original invitation to present this material. OCR Output CERN ACADEMIC TRAINING PRooRAMME 1992-93

1 An Introduction to Parallel Computing

1.1 An Introduction to the EPCC

The Edinburgh Parallel Computing Centre was established during 1990 as a focus for the University’s various interests in high performance supercomputing. It is interdisciplin ary, and combines a decade’s experience of parallel computing applications with a strong research activity and a successful industrial affiliation scheme. The Centre provides a national service to around 300 registered users, on state—of—the—art commercial parallel machines worth many millions of pounds.

As parallel computers become more common, the Centre’s task is to accelerate the effect ive exploitation of high performance parallel computing systems throughout academia, industry and commerce. The Centre, housed at the Univcrsity’s King’s Buildings, has a staff of more than 50, in three divisions: Service, Consultancy & Development, and Applications Development, The team is headed by Professors David Wallace (Physics), Roland Ibbett (Computer Science) and Jeff Collins (formerly of Electrical Engineering).

Edinburgh’s involvement with parallel computing began in 1980, when physicists in Edinburgh used the ICL Distributed Array Processor (DAP) at Queen Mary College in London to run molecular dynamics and high energy physics simulations. Their pioneering results and wider University interest enabled Edinburgh to acquire two of these machines for its own use, and Edinburgh researchers soon achieved widespread recognition across a range of disciplines: high energy physics, molecular dynamics, phase transitions, neural network models, protein crystallography, stellar dynamics, _,_ image processing, meteorology and .

With the imminent decommissioning of the DAPs, a replacement resource was needed and a transputer machine was chosen in 1986. This machine was the new multi-user Meiko Computing Surface, consisting of domains of , each with its own local memory, and interconnected by programmable switching chips. The Edinburgh Concurrent Project was then established in 1987 to create a national parallel computing facility, available to academics and industry throughout the UK.

ln 1990 this project evolved into the Edinburgh Parallel Computing Centre (EPCC) to bring together the many strands of parallel research and applications activity in Edinburgh University, and to provide the basis for a broadening of these activities. The Centre now houses a range of machines including two third- generation DAPs, a number of Meiko machines (both transputer and i860—based), a Thinking Machines Corporation CM-200, and a large network of Sun and workstations. Plans are currently being made to purchase two further parallel in 1993, these will act as an internal development resource, and will be state-of-the-art architectures. OCR Output MAss1vELY PARALLEL C0M1>U‘r1Nc· 1N H1c·H ENERGY Pmsrcs

The applications dcvclopmcnt work within the EPCC takes placc within thc Numerical Simulations Group and thc Informadon Systems Group; These teams perform all of the contract work for our industrial and commercial collaborators. Current projects include work on parallelising computational iiuid dynamics code for the aerospace, oil and nuclear power industries; parallel geographical information systems; distributed memory implementations of oil reservoir simulators; spatial interaction modelling on data parallel machines; parallel network modelling and analysis; development of a production environment for seismic processing; and also parallel implementation of speech recognition algorithms.

Non-contract software development effort is focused around two Key Technology Pro grammes (KTPs), which are the direct successors to research work performed at Edin burgh over the last few years. The Common High—Level Interface to Message Passing ,.._ (CHIMP) KTP is working towards the development of a common interface to mes sage passing available on a range of different manufacturers’ machines. The primary aim of the interface is to provide source code compatibility for applications and sys tems programmers across a range of commercially available parallel computers. The objective of the Parallel Utilities Library (PUL) KTP is the development and mainten ance of a library to support applications development and free the programmer from re-implementing basic parallel utilities. To this end, PUL builds upon the machine in dependence offered by Cl-HMP The PUL programme currently contains four projects, which provide extensions to system software (extended message passing and global tile access, for example), and support for standard MIMD programming paradigms (e.g. task farming or grid-based decompositions).

Figure l shows the How of technology through the EPCC, and details how we build upon research and Key Technology Programmes to provide portable efficient parallel implementations.

1.2 Parallel Computer Architectures

Throughout the history of scientihc and numerical computing there has been a desire to achieve the ultimate performance from a computer. The term "supercomputer" was coined in the early 197 Os to describe a machine that performed calculations at a rate way in excess of that achieved by standard computers. The term is probably over-used today as all manufacturers clamber to identify their particular product as a supercomputer. However, one thing that is now certain, is that all computer manufacturers are look ing towards some form of parallel computing architecture for their high performance machines.

The idea of parallel processing is not, however, a new one. Most human and natural systems operate in an inherently parallel manner, and it is perhaps unfortunate that OCR Output CERN ACADEMIC TRAINING PRooRAMME 1992-93

Funding F/ow of Technology /PF? Exp/citation

/n—House Applications inmriy industry } } _ Commercial Sa/es PAP °T"$E"’° Application-Speciiic Umsing Core Technology iiaiiiing spec if1dU$ffy PAP CEFIITQS

lsc { Programmes Academia (Free)

SEHC Research

Figure l: Technology How through the EPCC structure. This diagram also shows the main sources of funding for the centre, and the main benefactors from our work. early computing machines introduced the notion of sequentialisation for the sake of simplicity, and prograrmners have been stuck with that framework ever since. The quote below, from "Weather prediction by numerical " by Lewis F. Richardson, dates from 1922 and shows how parallel thinking seemed to come naturally to a scientist, long before mechanical computers came into existence. The computers referred to by Richardson are in fact individuals put to work on a particular calculation.

If the time-step were 3 hours, then 32 individuals could just compute two points so as to keep pace with the weather, if we allow nothing for the very great gain in speed which is invariably noticed when a complicated operation is divided up into simpler parts, upon which individuals specialize. If the co—ordinate chequer were 200krn square in plan, there would be 3200 columns on the complete map of the globe. ln the tropics the weather is often foreknown, so that we may say 2000 active columns. So that 32 >< 2000 = 64000 computers would be needed to race the weather for the whole globe. That is a staggering figure. Perhaps in some years’ time it may be possible to report a simplification of the process. But in any case, the organization indicated is a central forecast-factory for the whole globe, or for portions extending to boundaries where the weather is steady, with individual computers specializing on the separate equations. Let us hope for their sakes that they are moved on from time to time to new operations. After so much hard reasoning, may one play with a fantasy`? OCR Output MAssrvE1.Y PARALLEL COMPUT1NG IN I—I1cH ENERGY PHYs1cs

1.2.1 A Taxonomy for Computer Architectures

In 1972 Michael Flynn proposed a taxonomy for computer architectures as follows:

SISD: Single Instruction Single Data. The theoretical basis for the simple single processor machine (from PCs to Crays). This is the conventional von Neumann architecture.

SIMD: Single Instruction Multiple Data. Here many processors simultaneously execute the same instructions, but on different data. This is the basis for massively parallel machines like the AMT Distributed Array Processor (DAP), the MasPar MP-series, or Thinking Machines’ Connection Machine, both of which use many thousands of very simple processors and can achieve "supercomputer" perform ance on certain problems.

MISD: Multiple Instruction Single Data. Such a machine would apply many instructions to each datum fetched from memory. No computers that conform strictly to this model have yet been constructed, although there is some debate as to whether the Dataiiow machine could fit into this category of Flynn’s taxonomy.

MIMD: Multiple Instruction Multiple Data. This is an evolutionary step forward from SISD technology. An MIMD computer contains several independent (and usually equi-powerful) processors, each of which executes its individual program. There are several ways of building such a machine, the differences lying in how processors are linked together for communications, and how each is linked to memory.

Of these classifications, only SIMD and MIMD are currently relevant to the parallel computing industry.

1.2.2 SIMD: Massively Parallel Machines

The basic architecture of an SIMD machine can be seen in Figure 2. The large number of simple processors all execute the same piece of code, but operate upon the data in their individual memory stores. There are simple extensions to standard languages that are usually used when programming such machines. These extensions allow the transfer of data between neighbouring processors, as well as the masking of operations allowing selective execution of some tasks. This is necessary since we know that the vast majority of problems have some inherently sequential sections.

The SIMD architecture is a fairly simple concept, which can produce very powerful results with the right application. It uses a straightforward programming model, and OCR Output CERN AcADEM1c TRAMNG PR0cRAMME 1992-93

—:·:iz-:—:¤:¢:¥·:·¤:i:4:¢:¥ :·:·:—:·:-:~>:$·E¥:·:·:-:·:· Qt*:*:§·?*:¤:*$· :::::i:i:·:;:E5:i:;:e:·::$ i·M ‘ ..·2·Z< ?` ·:·:·: · :·:·:· fi: ··¥:¥:2* ·: ' :-<:·:~ . .::2: ·::··‘· .¤:¤: $2S:E:§E?:E:E:¤:§:‘<1E¢:1:152 Eizi:E$:E:§EE:EE§i2$:§:?:=

~:¥:¢. :¤:<: :·:·:¥!tl&$.¤.%—

=$;:$$i:;$:;;=;=:;:=:;;z

;=;=%;=;=ks;=:=ss3:=s:

+::1 ¢é=E=§ . ..;·f..`..·:·.¤ =E=3' E2, ·· .;;; ·:-g ·1·:·1·'$;2·¤<<:€ :¤z¤:§:¥;¤:i:ai:i:¤::$:i:<

=§¤E=$EE=E=E=E=3§=E=E3i=E ¢si=Q" ‘‘‘ . ‘‘‘·i:;& :¤:#: : $:¤:¤: ·j-2::: ><’ ¥:`·$:?§Q ·`·$:`= ·-€·‘~ ~:· -. §¥:·`2Z .·.;. .·.;.;·· ..;:-·

:3;£;2=§=E=Ei=E=§i3;¤ Z?Z?2¥:¥ii2¥:2::¥S:iZ?SFZ·

PE Array

Figure 2: Typical SIMD architecture

... thus often provides an easy route to parallelisation for many physical systems. The Current development of High Performance Fortran (HPF) and Fortran-90 is planned to dove-tail nicely into data parallel SIMD computing, and will certainly provide the path of least resistance for the porting of numerical Fortran codes. However, the simplicity of the model does have its draw—backs, and SIMD systems are not the most flexible parallel computers currently available. The insistence of total synchronisation of all processing elements, although beneficial to some problems, can cause excessive implementation overheads for others. In such cases we must turn to the more flexible MIMD machines.

1.2.3 MIMD Architectures

The basic model of independent processors executing separate programs can be satisfied in many ways. The primary distinction we would make is between computers with a small number of powerful processors, those with larger numbers of smaller processors, and those that lie in between. Other characteristics, such as the relationship of processors OCR Output MAss1vE1.Y PARALLEL C0MPuT1Nc· 1N Hrcu ENERGY P1—1Ys1cs to memory and to cach other, follow on from this initial distinction. The range of MIMD machine types can therefore be divided into three groups.

Machines with small numbers of powerful processors have tended to evolve from existing computers, a good example being the Y—MP which has eight ex tremely powerful vector processors combined in a single machine with a single memory store. All the parallelism is produced by the compilers, and thus se quential code from previous machines can run without adaption. However Cray machines come at a high price, and automatic parallelisation is currently only possible for certain well defined problems, and produces well below optimal results.

The next class of MIMD machine is typified by the computers produced by manufacturers such as Sequent or Alliant. These machines use standard mi croprocessors (typically 80386) in small to middling numbers. These are attached to a single memory store by -based links, and produce a machine with mainframe performance at minicomputer cost. Like the Cray machines these com puters provide coarse-grained parallelism, unlike the SIMD machines that can provide parallelisation that is very jinagrained. These machines allow much existing software to be re-used, as well as implementing many well-understood ideas about managing simple concurrency. The major draw-back with this class of MIMD machine is that there is an upper limit on the number of processors that can be used with bus-based processor-to-memory links.

The third class of MIMD computer contains those with much larger numbers of processors, and thus is typically thought of as medium-grained parallelism. This category includes the transputer based machines from companies such as Parsytec, Parsys and Meiko, as well as the hypercube machines produced by Intel and NCube. These machines avoid the memory access bottle-necks of the previous categories by distributing memory, giving some to each processor. However this approach does bring some problems, with processors having to communicate to find data that is not stored locally.

1.2.4 Shared Memory MIMD

A typical shared memory architecture is shown schematically in Figure 3. Each pro cessor has a direct link via some bus mechanism to a single global memory store (although increasingly some use is being made of local memory caches). Processors can "communicate" through objects placed in global memory. This is conceptually easy to implement, but has severe implications for the amount of message traffic through the single bus link to memory. There are also problems with memory access control. For example; which process should update an item of memory when two wish to do so concurrently; or memory may on occasion need to be locked to outside processes when OCR Output CERN ACADEMIC TRA1Ni1~ro PRooRAMM12 1992-93 ll

I/O is being performed.

MEMORY

BUS

P1 P2 P3 P4 Pn

Figure 3: Typical shared memory MIMD architecture

Shared memory computers are attractive primarily because they are relatively simple to program. Most techniques developed for multi-tasking computers, such as semaphores, can be used directly on shared memory machines. However these machines do have one great flaw, they cannot be scaled up infinitely. As the number of processors uying to access memory increases, so do the odds that processors will be contending for such access. Very quickly access to memory becomes a bottleneck to the speed of the computer. Machines like the BBN Butteriiy try to avoid this problem by dividing memory into as many sections as there are processors, and connecting all segments to processors through a high-performance switching network. However this eventually — leads to the same contention crippling the performance of the switch.

The escape route for shared—memory architectures is to introduce memory caches on each processor. This is of course simply the first step on the road to full distributed memory machines.

1.2.5 Distributed Memory MIMD

Figure 4 shows schematically the architecture of a distributed memory machine. The reasons for distributing memory were discussed earlier as the case against shared memory. What we must now consider is how to solve the main problem that dis tributed memory raises — how will the processors be connected and communicate?

Connecting all the processors to a bus or through a single switch only brings back the bottlenecks of shared memory systems. Introducing connections from each processor OCR Output 12 MAssrvELY PARALLEL COMPUTING IN Hror-1 ENERGY Pmsrcs

M I--1 P1 P3

lnterC0nnect P4 L-! Ml ___ M I—-I P" l""

P5 I—I M M I I P7 "° l-Hi

Figure 4: Schematic architecture of a distributed memory MIMD machine to all other processors is completely infeasible for large numbers of processors since the number of necessary connections rises as the square of the number of processors, and therefore soon gets out of hand. The only practical solution comes from connecting each processor to some small subset of its fellows. Many computers have been built that do exactly this, with a fixed topology of inter-processor links, for example the hypercube- or grid-based machines from Intel. The alternative is to use switching chips between processors that allow the user to adapt the topology to suit the particular program being run. This technology was pioneered by Meiko in their Computing Surface machines based upon standard four-link transputers. We will return to this subject later in Section 4.

1.3 Architectural Trends

It is relatively easy to look back over the past 40 years and see the ways in which computer architecture has advanced in order to provide the extra computational power OCR Output 13 desired by scientists. It is a somewhat more difncult task to predict how architectures will advance in the future, although it is certain that they must continue to provide more and more power. An excellent mechanism to highlight architectural trends coupled with increases in computational power is to study the computing requirements of the UK Meteorological Office and how they were satisfied, see Figure 5.

FLOPS

10E+12

Expected peak 10E+11

Expected peak 10E+1O Proposed peak

10E+09 Cyber 205 (Peak) I

105+08 Cyber 205

10E+07

zeones 10E+06

1 OOOOO KDF g

1 0000

n / M u W 1000

‘°° -1 ·/tml

1 0

1950 1960 1970 1980 1990 2000 Y93|’$

Figure 5: UK Met. Office computing requirements. Source: Dr. Roger Vliley, UK Met. Office.

This graph indicates that an approximate hundred—fold increase in power is required every ten years, and details some of the machines that made this possible for the UKMO. In order that this kind of performance be achieved in the scientific world in general, we see the following architectures as having played major roles over the last 20 years.

Vector supercomputers: During the 1970s Cray took hold of the supercomputer mar— ket following the introduction of the Cray-l vector machine. These machines use OCR Output MASSNELY PARALLEL COMPUTING IN HIGH ENERGY Pmsrcs

pipelining of arithmetic operations to allow greatly increased throughput when dealing with long vectors of data. This architecture has been imitated by others, such as NEC and Fujitsu, and although still uni-processor in nature these machines provided the first real leap forward from conventional performance.

Special purpose SIMD machines: By the early 1980s the first real parallel computers were coming onto the market. These were typically distributed memory machines, often of SIMD design, and usually targetted at a specific application. In the UK, ICL produced six of its Distributed Array Processors. At a similar time in the United States, a group headed by Geoffrey Fox and Charles Seitz at the California Institute of Technology produced the first machine with a hypercube architecture. The "Cosmic Cube" was intended for use in planetary dynamics calculations, but soon found much wider applications iields, and lead to production machines from .. Intel and NCube using the hypercube design.

Shared memory machines: The mid-1980s saw the arrival of parallelism into the general purpose and high performance computing markets through the use of shared memory architectures. The processors in these machines, as long as the number remains small, can all be connected to a single memory store. This approach is attractive as it provides facilities to directly re—use existing software, and also the prograrmrning techniques involved are relatively simple. The increase in power available through the use of several processors rather than one lead Cray to produce multi-processor machines such as the X-MP and Y-MP each with up to eight very powerful vector processors. For the more general purpose market, manufacturers such as Sequent and Alliant used standard microprocessors to produce machines with mainframe performance at minicomputer prices.

Distributed memory machines: In order to overcome the bottleneck problems of shared—memory multicomputers, manufacturers of parallel systems now distribute .-— memory, giving some to each of the processors in the machine. This approach leads to computers like the Intel iPS C, transputer-based machines like the Meiko Computing Surface, and in fact large SIMD architecture machines such as Think ing Machine’s Connection Machine. This approach seems the most likely to be able to give parallel machines the necessary to keep supercomputer power increasing as the advances of serial processor technology reach their hiatus.

1.3.1 Future Trends

If we accept that the future of high performance architectures lies in distributed memory multicomputers, the only problem we face is deciding what development and program ming environment is made available on these future machines. We have already seen the manufacturers of transputer-based machines moving away from the Occam language towards providing tools to allow multiple sequential processes, written in C or Fortran, OCR Output CERN ACADEMIC TRAINING P1zoo1zAMME 1992-93 15

to run in parallel.

It seems likely that further in the future, machines will become available that hide the finer details of the machine architecture from the programmer. The provision of high band-width, low latency communications systems should enable high-level programming models to become more efficient, thus leading users to a virtual shared· memory machine. This would contain the power benefits of a distributed memory design, with the ease-of-use of a shared-memory system. Other potential innovations would be an efficient, possibly hardware based, process—to-process message passing harness common to all architectures. This would have the effect of making parallel versions of software portable between different machines, with possibly very different architectures. This same goal is currently being explored with the provision of a HPF ,,, standard that would provide extensions for parallel data structures, and thus allow porting of code between any machine with a suitable HPF compiler.

1.4 General Purpose Parallelism

Today, Cray Computer Corporation still have the majority of all supercomputer sales, and Fujitsu, who make a Cray look-alike are second in this league. The top parallel manufacturer, Thinking Machines Corporation, only manages to take third place. Why is this? Is it because parallel machines are difncult to find, or buy? Is it because they are an impossible platform on which to provide a service? Or is it because the software technology is still either non-existent or primitive`? We contend that it is the last alternative which is the major reason for the still small penetration of parallel computing into the general marketplace, despite all of its other advantages. This claim is made on the basis of the experience which EPCC gained through using, testing, programming A. and running a service on parallel machines.

1.4.1 Proven Price/Performance

In Table l we compare the prices and performance of a "conventional" machine with a parallel one, at the top and middle of the scientific computing market.

Given these price/performance figures, and the rationale for parallelism given above, it would seem that at the top and middle of the market parallel computers can be a very attractive option for computation. Not all users, however, wish to perform calculations; databases form a very large and growing market sector. Traditionally, hardware for this held has been dominated by the giants: IBM and DEC, but recently, NCube and Oracle have announced a new version of the Oracle database software which runs on a 64 node NCube-2 at 1,073 TPS. This is over 2.5>< faster than the previous record, held by an Amdahl mainframe, and costs approximately 5% as much per TPS (TCP-B benchmark). OCR Output 16 MAss1vELY PARALLEL COMPUTING IN Hrou ENERGY Pmrsrcs

Price I Conventional Price I Parallel (pounds) I Cray Y-MP/8 (pounds) I 'TMC CM-200 (64k) 15 M I 2.7 GFLOPS (Peak) I 5 M I 40 GFLOPS (peak) 2.1 GFLOPS (Sustain) 21 GFLOPS (Sustain) IBM ES/9000 320VF Intel iPSC/860 (8 node) 1.2 M I 125 MFLOPS (Peak) I 160 k I 640 MFLOPS (Peak) 110 MFLOPS (Sustain) 200 MFLOPS (Sustain)

Table 1: Comparison of conventional and parallel computers.

While is is possible to obtain even lower costs per TPS by running Oracle on a PC, it is then not possible to obtain more than about l0 TPS. Only by using parallelism is it possible to mix performance with cost effectiveness.

The low end of the performance spectrum parallelism is, however, in some trouble. Who wants to link 40 T800 transputers together, when the same peak performance can be achieved from an Intel i860 chip or an IBM RS/6000? Advances in chip technology, however, do not obviate the need for parallel processing, they simply provide opportunities for ever more powerful machines.

1.4.2 Proven Service Provision

Our experience has shown that, even with a large number of varied users spread across the UK, we can maintain service availability consistently in excess of 95%. In addition, the EPCC now houses the two most powerful supercomputers in the country, and CPU time on these machine is as good as saturated. We contend that our situation shows that it is possible to run a consistent service to a very geographically scattered user population. However, providing the highest level of service availability does require extra effort. In 1980 a typical 1 MIP mainframe consisted of a number of cabinets each containing many boards. Today, an equivalent machine, for example a Sequent Symmetry, has approximately 15 boards housed within a single cabinet. The ECS has five cabinets, each holding approximately 40 boards. In terms of numbers of components parallel computers thus lag behind their sequential counterparts. As machine reliability is inversely proportional to the number of chips it is probable that parallel computers will suffer from rather more hardware problems until the technology becomes more standard.

1.4.3 Parallel Software

Due to the lack of suitable software standards for parallel computing there is considerable scope for "re-inventing the wheel" as users try to port their applications onto different OCR Output 17

machines. One way round this problem is to follow the lead given by sequential programming and produce libraries containing basic numerical routines. However, this is only a partial solution because not only do parallel users require numerical libraries they also need environments within which to run their applications.

There have been a number of attempts, for example STlUlNDior8 Linda, to produce higher level virtual machine interfaces. Although these provide a standard across different architectures they impose an CPU overhead. Since performance is still the critical issue for most users, and the reason why they use a parallel computer, they are prepared to devote effort to use lower-level and machine specific functions.

Ultimately, it might be expected that the task of mapping the application onto the _,___ hardware would be performed automatically by a compiler. While language definitions, such as Fortran 90, are extending the language syntax to include parallel constructs, the era of the parallelising compiler is still some years away. Since users cannot wait for that length of time some action must be taken now to enhance code portability and maximise the re-usability of existing programs. At EPCC we have established a number of Key Technology Programmes (KTPs) to address these precise issues.

1.4.4 In Summary

Parallel computation provides the only by which any known technology can deliver the orders of magnitude increases in performance required by many applications. Moreover, as greater power becomes available new fields of research become technically feasible, and so the range of applications expands. The current aim of most large parallel computer manufacturers is the TeraFLOPS machine; one with a performance of 1,000 A GFLOPS — or 500 >< the peak sustainable speed of a Cray Y-MP/8. No-one believes that this goal can be achieved by any route other than massive parallelism, and the first of these machines is expected within the next three — five years. OCR Output 18 MAss1vELY PARALLEL C0MRLmNG HW H1cH ENERGY PHYs1cs OCR Output CERN ACADEMIC TRAINING PRooRAMME 1992-93 19

2 Decomposing the Potentially Parallel

2.1 Potential Parallelism

ln general any computer program can be split into two parts, some sections that are inherently sequential (e.g. accepting input data from a user), and other sections that are potentially parallel (e.g. independent operations on that data). ln addition it is often possible to identify separate sequential streams that can be executed concurrently. The main objective of parallelisation is to reduce execution time by identifying areas of potential parallelism, and thus decomposing the problem to allow parallel execution.

Common sources of potential parallelism are iterative constructs, usually tied to paral lelisable data structures, and independent or successive functions. It is vitally important that full consideration be taken of data dependencies within and between such identified sections of our program. Functions that use the same data will need to communicate new values between each other, and iterative constructs cannot always be simply decom posed without consideration being taken for the inter-dependencies of data values. Later in this section will discuss various methods to perform a decomposition of a problem. However, first we must introduce a few ideas that will ultimately govern the success of any such decomposition.

2.1.1 Granularity

On the assumption that the decomposition has been performed, we must now ask how efficiently the decomposed program can be run. Is the best performance achieved through use of maximum parallelism, i.e. throwing as many processors as we can at the problem? There is a considerable cost involved with splitting a problem into an unreasonably large number of sub-problems. Besides the cost of problem decomposition itself (which will increase with the number of sub-problems), we must also consider the necessity to communicate between the separate sub-problems, both to pass data around as well as to synchronise execution. All communications cost execution time, and thus too many messages can severely damage efficiency. It is therefore vitally important that we choose the correct granularity for the decomposition of the problem. Too large and we do not extract all the available parallelism, too small and we swamp ourselves in communication and decomposition costs. OCR Output 20 MASSIVELY PARALLEL COMPUTING IN HIGH ENERGY PHYSICS

Work Work

Processor Processor Work Work

PIOCESSOT Processor

Figure 6: Freeing processor resource using load balancing. As the granularity of the decomposition is reduced, we can also iind a better balance between processor loads ,4 and hence reduce overall computation time.

2.1.2 Load Balancing

The other major consideration for successful parallelisation is whether the sub·problems that result from decomposition all require the same execution time. lf the sub—problems are allocated one per processor, then the total execution time is dependent upon the execution time of the largest individual task — Amdahl’s Law. lf this task is inherently sequential then it will always be the lower limit for execution, and no additional paral lelism can hope to improve the situation. However efficiency can be improved by better use of the other processors, potentially freeing computing resource to work on other problems (see Figure 6). Also, by placing more than one task on each processor we are potentially reducing the inter-processor communications.

The best level of efficiency is achieved through having sub—problems spread evenly OCR Output CERN ACADEMIC TRAINING PRooRAMME 1992-93 21

Decomposition

Trivial Functional Data Problem dependent

Pipeline Balanced Unbalanced (Vector)

Geometric SSD Task Farm Independent Tasks

Figure 7: Decomposition techniques across the available processors. This is the 311 (or rather science) of load balancing. Achieving this balance is dependent upon the decisions made during problem decom position and the size of granularity used. lf either of these are poor then the load balance can rarely recover and efficiency will be reduced. Figure 6 also shows how, by reducing granularity further, most efficient use of the available processors can be made.

2.2 Decomposing the Potentially Parallel

Suppose we have a sequential program which we wish to run faster by executing it on parallel hardware, in particular, distributed memory multicomputers. We consider all sequential programs to be composed of inherently sequential parts and potentially parallel parts, so the runtime of a sequential program is the runtime of its inherently sequential parts plus the runtime of its potentially parallel parts. By somehow splitting up, or decomposing, the work involved in a potentially parallel sub-program in such a way that a number of processors can work concurrently on the problem, we aim to reduce the runtime of the sub-program. However, decomposition frequently introduces an overhead: for example, message passing between processors. We have to consider this overhead in the execution time of the parallel version of the sub-program, and it must be outweighed by the reduction in execution time as a result of the use of (many) processors in parallel. lf we achieve this goal, we will have reduced the runtime of the sub—pro gram and therefore the runtime of the whole program.

Three important decomposition techniques and their derivatives are shown in Figure 7. OCR Output 22 MASSIVELY PARALLEL COMPUTING IN Hrori ENERGY Prnrsrcs

L,L.LL.R.LLLLR LRLR . `‘`‘: ‘‘‘‘`

Image Smoothed Feature Object Image Descriptions Descriptions

Figure 8: A pipeline

2.2.1 Trivial Decomposition

The simplest technique is trivial decomposition, which doesn’t really involve decompos ition at all. Suppose you have a sequential program which has to be run independently l._ on lots of different inputs. You can clearly introduce some parallelism by doing a number of runs of the sequential program in parallel. Since there are no dependencies between different runs, the number of processors which can be used is limited only by the number of runs to be performed. In this case the execution time of the set of runs will be the execution time of the most time consuming run in the set. Clearly, trivial parallelism can be exploited to provide almost linear if runs take a similar length of time.

2.2.2 Functional Decomposition

Functional decomposition, the first true decomposition technique, breaks up a program into a number of sub-programs.

A simple form of functional decomposition is the pipeline (Figure 8), in which each input passes through each of the sub-programs in a given order. Parallelism is introduced by having several inputs moving through the pipeline simultaneously. Consider Figure 8. The pipeline is initially empty. The first data element Hows into the first stage of the pipeline (smoothing). Once this element has been smoothed, it is passed on to the next stage of the pipeline (feature extraction). While this first element is being processed through the feature extraction stage, the second data element can iiow into the smoothing stage. Parallelism is introduced as the second data element’s progress through the pipeline is overlapped with that of the first element. This filling process continues until every element of the pipeline is processing a data element. When the data set is exhausted there is a analogous draining period in which the number of busy stages in the pipeline falls to zero.

Parallelism in a pipeline is limited by the number of stages in the pipeline. For greatest efficiency we want to keep all the stages busy. This requires the time taken for each stage of processing to be equal- the pipeline is then balanced. OCR Output 23

In general, functional decomposition is very problem dependent, and the amount of parallelism is limited by the program. This means that as the size of the input dataset grows, it may not be easy (or indeed possible) to exploit any more parallelism. As a result, the individual data items in a large dataset are unlikely to be processed any faster than those in a small dataset, and the larger dataset takes a proportionally longer time to be processed.

2.3 Data Decomposition

Rather than decomposing the program itself, or not decomposing at all, we can consider »~ decomposing the data. Many problems involve applying similar or identical operations to different parts of a large data set. In such circumstances, it is often appropriate to split the dataset up over many processors.

If there are no data dependencies in the data set so that we can compute the result of the operation on a single data item without knowledge of the rest of the data set, then the principal factor limiting parallelism is the number of available processors.

It is useful to think in terms of “processing grains", corresponding to the work involved in processing a number of basic data items. The grain size is the number of basic data items in the processing grain. Independent data sets allow small grain sizes to be used.

In most cases, however, in order to compute the result at one point in the data set, we require knowledge of other data points. Most commonly, it is neighbouring points in the data set which we require knowledge of. lf processing grains are clusters of data points then we can compute the results for most of these points without reference to the rest of the data set in exactly the same way as we would in the corresponding sequential program. It is only points on the boundary of the processing grain which require special treatment.

Figure 9: "Volume to Surface Area" Ratio

Processing boundary points requires us to gather information from other processing grains, which typically reside on other processors. We gather this information by passing messages around the machine. There is extra work involved in passing these messages, which steals otherwise useful processor cycles. Moreover, messages take time to travel between processors, introducing possible latencies whereby processors OCR Output MASSIVELY PARALLEL COMPUTING IN HIGH ENERGY Pmsrcs are unable to proceed with the main computation until data arrives. In general, these overheads increase with the distances messages travel.

There is thus conflict between the desire to use many processors to exploit the parallelism in the problem and the desire to minimise the processing overheads associated with breaking up the data set. The most important of these overheads are the communications costs leading to the idea that we wish to maximise the ratio of time spent on useful computation to time spent on communications. (This is the famous "calc-to-comms ratio".) We can think of this in terms of a "volume-to-surface area" ratio (Figure 9) for the processing grain. The volume refers to the number of points in the grain and the surface area to the number of points on the boundary which therefore require communications.

2.3.1 Geometric Data Decomposition

Given n processors, we generally maximise the volume-to-surface area ratio (and thus the calc-to-comrns ratio) by dividing the data set into n processing grains and allocating one grain to each processor. It is usually not the case that every processor is directly connected to every other processor, so we try to allocate neighboruing processing grains to neighbouring processors. In this way we reduce latencies and congestion. We call this ‘geometric" decomposition, which is one of the most important and commonly-used techniques for exploiting parallel machines. Perhaps its most obvious application is in modelling physical systems such as a tank of water where we simply chop the water into blocks for the purposes of processing and place adjacent blocks on adjacent processors. It is only at the interfaces between the blocks that the parallel program differs from the sequential program by requiring information to be passed across the boundary.

In early 1988 Geoffrey Fox, then at Caltech and now at Syracuse, performed a survey of the real parallel applications programs that he was aware of in the literature. We can paraphrase his results by considering the ways in which such applications would naturally be programmed on distributed memory parallel machines. We assume that we can divide programming methodologies into three classes that can be arranged in order of difficulty as follows.

Farming < Regular Decomposition < Anything Else

The following table shows the proportion of his applications that would be programmed rn various ways.

Strategy Proportion Task Farm 14% Regular Decomposition | 76% Something Else 10% OCR Output 25

This shows that three quarters of all surveyed applications could be pro grammed by reg ular decomposition. Current hearsay understanding is that the NASA Ames lab believes that around 50% of its applications can be programmed by regular decomposition, and that the proportion is dropping, but even so regular decomposition is a good point to start a consideration of parallelisation.

2.3.2 Dealing with Unbalanced Problems

Simple geometric decomposition as described above works extremely well and can provide very high processor efficiencies provided that each processing grain takes the same time to process. This is usually the case when, for example, we are solving a differential equation for every point in some space.

Figure l0: Geometric decomposition of an unbalanced problem

Consider, however, the problem of identifying and mapping out edges in an image. It A may well be that if we use a simple geometric decomposition some processors will be allocated blocks which do not contain any part of any edge, and therefore will require minimal processing, while the blocks on other processors contain many edges and thus involve much work (Figure l0). In this situation the processors with relatively empty blocks will inevitably lie idle much of the time, while only a few processors perform useful work. This is a simple example of a poorly balanced problem.

Our basic approach in such cases is to work with smaller grains, and to allocate more than one grain to each processor. According to the nature of the processing grains, we usually adopt one of two standard techniques — scattered spatial decomposition or task farming (Figure ll).

In scattered spatial decomposition, we simply allocate a random selection of these smaller processing grains to each processor and trust that, on average, each processor will have an approximately equal work—load. There is a direct trade—off between the volume to-surface area ratio, smaller grains involving larger relative amounts of communication, and the balance we can achieve between the work—loads of different processors. We OCR Output 26 MASSIVELY PARALLEL CoMPur1NG HW HIGH ENERGY Pmrsrcs

Figure ll: Decomposing an unbalanced problem call this a "load-balancing problem", and describe scattered spatial decomposition as a "static" load-balancing technique because once the grains have been allocated there is no further attempt to distribute work evenly over the processors. Practical experience shows that it is possible to find grain sizes which achieve good overall load-balances without prohibitive communications costs over a surprisingly wide range of problems. lf grains can be processed completely independently of one another, then we can consider using "task farming" to provide "dynamic" load-balancing. Again, relatively small processing grains are chosen, and one processor (the "task master") maintains a set of unprocessed grains. A number of "worker" processors repeatedly request a grain from the master, process it and then dispatch the results to a results collector, which may or may not be the task master. This has the advantage that no prior assumptions need to be made about the data set: provided that the processing grains are sufficiently small, an even workload is ensured since a worker which happens to receive many grains requiring little work will simply request more grains. It is often possible to construct independent tasks from a problem in which there are data dependencies by including extra information (for example, values from neighbouring boundary points) with the processing grain.

Task farms do, however, require a constant How of grain requests and replies between the workers and the master. The costs associated with maintaining the dynamic load balance are particularly relevant in cases where many iterations are being performed over the data set. Static data allocation to processors generally requires only a single transmission of data from the master to the workers, while a task farm would effective involve retransmission for every iteration. OCR Output CERN ACADEMIC TRAINING PRooRAMME 1992-93 27

2.4 Summary

Trivial decomposition is appropriate for a parallelising an independent set of program executions.

Functional decomposition yields limited parallelism, and cannot effectively exploit massively parallel hardware.

Data decomposition techniques fall into two classes:

1. Those appropriate for predictably balanced problems

Geometric decomposition

2. Those appropriate for unbalanced problems.

Scattered spatial decomposition

Task farming OCR Output 28 MASSIVELY PARALLEL COMPUTING IN H1<;1~1 ENERGY Pmsxcs OCR Output CERN ACADEMIC TRAINING PROGRAMME 1992-93 29

3 An HEP Case Study for SIMD Architectures

There are two main forms of parallel processing: Single Instruction Multiple Data stream (SIMD) and Multiple Instruction Multiple Data-stream (MIMD). The latter is the subject of Section 4 of these notes, but it is important to understand the essential difference between the two paradigms. SIMD imposes a lock-step synchronisation on the processors, which enforces all processors to execute the same instruction at any one time. The only freedom enjoyed by the application is to let the individual processors execute a given instruction or not. In MIMD computing the processes are much more loosely coupled, with potentially many different programs running on different nodes of the computer.

Why then should one be concerned with such a rigid model, when a more flexible alternative is readily available? There are two main reasons: first, the much simpler programming model, with only one of control, makes code generation and testing much easier. Since most physical problems require repeated calculation of the same functions at differentpoints in a space (e. g. CFD or hnite element analysis), and provided these calculations are well load balanced (that is that they all require roughly equal iterations) the SIMD model is both appropriate and efficient. There are applications which do not meet these criteria, ray—tracing is an obvious example, and the MIMD model is then more appropriate.

The second reason for interest in SIMD is that the greater simplicity of the model and the processors permits much larger SIMD arrays to be constructed than would be feasible for a MIMD machine. This feeds through into ; today the fastest machine in the world is the SIMD Connection Machine (CM) made by Thinking Machines Corporation (TMC). In its largest configuration it has 65,536 _ processing elements and a peak performance of 40 GFLOPS. In applications it has returned a sustained performance of 21 GFLOPS, about seven times greater than the peak power of a Cray Y-MP8. For the highest performance on many problems SIMD processing is still unsurpassed.

SIMD computing has a substantial history that has lead to three maj or players currently in the data parallel marketplace — AMT, Thinking Machines Corporation and MasPar Corporation. Figure 12 shows the development of these companies.

3.1 SIMD Architecture and Programming Languages

The basic SIMD architecture was introduced in Section 1 and is shown in greater detail for one particular machine in Figure 13. The master controller (MCU) executes scalar instructions itself, and issues array instructions to the PEs in the PE array. The controller interacts with the host system as required, for example for I/O. The Connection OCR Output 30 MAss1vE1.Y PARALLEL C0M1>UT1Nc 1N H1c.1»1 ENERGY PHYs1cs

1972 Research begins an [CL 1973

1974

1975

1976 Exst DAP commissioned

1977

1978

1979

1980 Hrst commercial DAP shipped

1981

1982 1983 '1`hinking Machines Corporarion founded 1984 1985 First mi¤iDAP shipped 1986 AMT formw First CM—1 shipped 1987 DAP-Sxx 1988 DAP-6xx CM-2 with floating point hardware I | MasPar Computer Corporadon founded 1989 1990 CP8 coprocusor CM-200 (increased clock speed) I I MasPar MP-1 Family 1991 CM-5

Figure 12: Origins of thc DAP, Connection Machine, and MP- l

Machine series differs slightly from this basic design, in that PEs are cormected in a hypercube topology (rather than a mesh), and in that single floating—point accelerators are shared between groups of 32 PEs. However, this has little impact on the user unless programming is being done at a machine·code level.

SIMD machines are typically programmed in some special variant of Fortran 77, or for ... optimal CPU performance in an assembly level language. A parallel version of C is available on the CM series, and under development on the AMT DAR A data parallel Fortran (CMF, say) includes an extended range of data structures designed to match the machine architecture and provides logical masks to block PEs from updating their data if required. A comparison of some of these features with standard serial Fortran code is shown in the program fragment below (see Figure 14).

Clearly, an advantage of such parallel languages is the greater simplicity with which many problems can be coded. As software support advances users have been able to use parallel arrays of any size and dimension (in the past there was often a restriction to multiples of machine size). Compilers will now take responsibility for the sensible distribution of arrays across processors, although it can still be the case that most efficient execution is obtained through a wise choice of data structures. In the general case, we can consider our machine to be an array of virtual processors, one for each element of our parallel data structure. These virtual processors are then mapped onto actual processors by our compiler. OCR Output CERN ACADEMIC TRAINING PRooRAMME 1992-93 31

C d SQL; 11...Mw

Registers

1 -bit proce ssoi

Array Memory

8-bit processor

DAP 6lOc

Figure 13: A Schematic Diagram of SIMD Hardware

3.2 Case Study: Lattice QCD Codes

In order to give a flavour for the techniques involved in writing data parallel programs for an SIMD machine, let us look at one particular HEP application — the lattice formulation of quantum chromodynamics (QCD). In the quenched approximation this process can be divided into three distinct phases; the generation of pure gauge configurations; the generation of quark propagators for each configuration; and the combination of these propagators to produce approximations to real particles. The third stage of this process does not have a particularly high computational requirement, however the first two phases involve the generation and manipulation of very large arrays (typically 650,000 instances of a 3 >< 3 matrix). For this reason lattice QCD computations have been at the fore—front of high performance computing Grand Challenge projects, and are therefore serious users of leading-edge parallel computing technology. We hope here to give an introduction to a data parallel implementation of lattice QCD, although we cannot go into the details of the theory, we hope that enough is presented to provide a structure within which to discuss SIMD parallelisation. OCR Output 32 MASSNELY PARALLEL COMPUTING IN HIGH ENERGY PHYSICS

CMF Fortran“77

C sky**2 comparison C sky**2 comparison C with zero—point C with zero—point

real array (100,100) :: sky real sky(100,100) real z_p real z_p

z_p = 0.5 z_p = 0.5 call evaluate(sky) call evaluate(sky)

C compare; if sky**2 < z_p C compare; if sky**2 < z_p C then set sky = z_p C then set sky = z_p

where(sky**2.lt.z_p) do 10 i = 1, 100 sky=z_p do 10 j = 1, 100 endwhere if(sky(j,i)**2.lt.z_p) & sky(j,i) = z_p 10 continue

Figure 14: A comparison of a small fragment of data parallel Fortran and Fortran 77.

QCD theory covers the structure of hadrons by considering confined elementary particles (quarks) with three charges (or colours), and eight charged gauge bosons (gluons) which can undergo strong interactions. Unfortunately, unlike quantum electro-dynamics (QED), which has one charge, one uncharged non-interacting boson (photon) and a weak interaction strength, QCD cannot be solved analytically at low energies and we must look to numerical solutions. We therefore formulate a discretisation of space-time on a four—dimensional lattice. In this formulation gluons live on lattice links and are represented as 3 >< 3 complex matrices. Whereas quarks inhabit lattice sites, and are represented as complex vectors of length l2. In this way QCD can be formulated as a four-dimensional (typically 24°>< 48) statistical mechanics problem with a "temperature" inversely related to the strength of the quark—gluon coupling. Numerical solutions must tirst develop a background gluon iield, and through this then study the movement of quarks.

3.2.1 Generation of Gauge Configurations

A pure gauge conjiguration consists of all values of the gauge fields (gluons) on the space-time lattice at any instant in real time —— the lattice itself contains a time dimension, we do not take sections through this, a particular configuration consists of all lattice values at all lattice times. Although we have stated that the gauge iields are thought of as living on the links, for computational purposes they are placed on the lattice sites. OCR Output 33

There are two common ways to begin a sequence of these configurations —— hot and cold starts. The former corresponds to an ordered system where all of the gauge fields are set to the same value. The latter start corresponds to a disordered system where all of the gauge fields are set to random values. From this starting conhguration we must then use Monte Carlo techniques to evolve the whole lattice.

Updating a single link within the lattice is the fundamental step in Monte Carlo com putations on the lattice. To obtain a new configuration we successively place each link in contact with a ‘heat-bath’ that selects a new link value stochastically with a Boltzmann probability. All gauge links are updated one by one in a complete sweep through the lattice, and after many sweeps (typically > 10, 000 from a hot start) a gauge configuration in thermal equilibrium is obtained. This configuration is then recorded, and a succession of further generations are performed (typically a chain of further 2,000 sweep steps from the equilibrium state) to build a database of gluon configurations.

There is clearly potential for two types of parallelisation in this application. On the one hand we could consider a trivial parallelisation using one processor to generate each configuration (provided our processors have a large enough memory capacity), and on the other we could parallelise the generation of each individual configuration. In fact we could also use a combination of these two approaches. However, let us for now consider the parallelisation of individual configuration generation, as this fits rather well with a data parallel approach, and is much more realistic in terms of current memory limitations. We should therefore consider the data decomposition of our QCD lattice across our array of processors.

The energy determination for each site in our lattice depends on plaquettes of all links with cycles in the lattice, this is shown for two dimensions in Figure 15. This dependence means that all links within the same plaquette must be updated independently. In .. addition, links with the same direction vector from sites with differing parity (where parity: at + y + z —l- f) must also be updated independently.

Since all other links from each lattice site, as well as similar directed links from sites of differing parity (all neighbouring sites have opposite parity) have to be held constant while a chosen link is updated, this restricts the number of links that can be simultan eously updated to half of all the lattice links in a certain direction. We therefore identify all alternate links along each axis as making a set of simultaneously updatable links, this is known as RED/BLACK (or ODD/EVEN) preconditioning (or decomposition). Figure 16 shows the structure of a two—dimensional slice through our four—dimensional lattice, with certain links highlighted. These thicker links represent those that can be simultaneously updated, since none are common to the same plaquette. It is a relatively straightforward process to extend this concept to four dimensions.

We can see that, even allowing for the restriction that only one in every eight links can be simultaneously updated (one out of four dimensions, and one from two parities), OCR Output 34 MASSIVELY PARALLEL COMPUTHNIG IN Hron ENERGY Pnysrcs

(3,1) (3,2)

(2,1) @._,[ (2,2)

(Ll) (L2)

Figure 15: The cycle of links that correspond to the two plaquettes that contain the gaugefield vector U (x). Within any particular plaquette all link updates must be per formed independently.

IB

Figure 16: Red-Black preconditioning of the lattice. Only the thick links in the x direction may be updated at the same instant. This leads to the division of the lattice into two sub-lattices of ODD and EVEN parity. OCR Output 35

since our lattice is so large, there is still lots of potential for parallelisation. The Monte Carlo update of the link state is exactly the same for each set of current links, and thus is ideally suited to a data parallel implementation. We must note that should the number of links in our lattice be close to the number of processors available in our target machine, this type of exclusive update can result in inefficient use of the machine. However, for the large QCD lattices this is not a problem, and each processor is loaded with many lattice sites (i.e. virtual processors) of both parities, and therefore each will be equally busy for every type of link update.

3.2.2 The Connection Machine Implementation

The CM-200 at EPCC has 16,384 bit serial processors plus 512 floating point accelerator co-processors, with a total of 0.5 gigabytes of memory. The machine is generally programmed in slicewise Fortran, this is based on Fortran 77 with the array handling abilities of Fortran 90 with some CM-specific extensions. When declaring a multi dimensional array, not only must the bounds be specihed on all indices as normal, but also how the array is to be laid out on the machine, e.g.

COMPLEX, AB.RAY( rtco1our,11col.our,r1xby2 ,ny,rtz,11t. ) : : lattice CMF$ LAYOUT lattice ( :SEPtIAL, :SEPtIAL, :NEwS, :NEwS, :NEwS, :NEwS )

The Hrst statement simply declares the array, called lattice, to be of dimensions ricolour >< ncolour >< 11xby2 >< ny >< nz >< nt. The Second statement specifies that the first two indices ( : SERIAL) are to live on the same processor i.e. there will be ncoleur ... >< ncolour complex numbers for that array living on each processor, and the last 4 indices (:NEwS) are to be distributed across the various processors. To visualise this, imagine an ncolour >< ncolour matrix defined at each point in space—time.

Since there are not enough PEs to have one for each space-time point (unless a very small lattice is being used or a bigger CM), the compiler operates in terms of Wrmal Processors (VPs). One VP is allocated to each distributed lattice point. For one parity/direction on 24" >< 48 lattice on the full EPCC machine we get a VP-ratio of 648. It is assumed that such an array is distributed evenly over the processor cluster this is left to the compiler. However the distribution can be examined and has been found to be fairly efficient. If necessary, the distribution can be altered by weighting the different axes, although this has not been done so far in practice. In fact, with a straightforward labelling scheme for lattice sites, each physical processor will contain a set of sites that are all (whether odd or even parity) nearest neighbour locations. At this stage this ensures that each processor will have an even distribution of sites and link types. However the distribution proves even more important for the efficiency of the next stage of parallelisation. OCR Output MASSIVELY PARALLEL COMPUTING IN Hrou ENERGY Pmrsrcs

3.2.3 Calculation of Quark Propagators

Once we have generated our gluon coniigurations the next stage of the QCD calculations is to produce quark propagation matrices for each of these configurations. These matrices can be thought of as describing the probability of quark motion from the origin of our four-dimensional lattice to every other lattice site. This calculation can be reduced down to a matrix inversion problem, using, at the lowest level, the multiplication of vectors and matrices.

Although our lattice matrix is very large, since all motion is to nearest neighbour sites, we have a very sparsely populated transition matrix, in fact this will be a banded matrix with the diagonal and eight off-diagonal bands. The parallelisation of algorithms to perform such calculations is a well established process, and can be performed with high efficiency, even on fully synchronous machines like the Connection Machine. CM-Fortran actually includes library routines to perform this type of calculation.

The QCD group at Edinburgh have written their own version of these routines, in order to make use of the processor-locality of neighbouring site values. CM—Fortran provides functionality (CSHIFT) to move data values around the virtual processor array in local steps, and these can be used to locate nearest neighbour values. Figure 17 shows the conceptual nature of the CSHIFT operation, and how we can imagine data migrating through our array structure. This is most easily thought of as a large number of simultaneous transfers between our virtual processors. lf VPs in communication with each other are mostly on the same physical processor (which the Edinburgh group have ensured) we can greatly reduce the costs of these transfers. The iirst Edinburgh implementation runs at an overall speed of > SSOMFLOPS.

3.2.4 Summary

We have seen that data parallel computing has great advantages in its prograrruning simplicity, and for the right type of problem can show near optimal use of massively parallel computers. It is fortunate that many physical problems are well suited to this data parallel approach, where we typically have a regular data structure representing some spatial/temporal lattice.

Our QCD case study reports on the success already achieved in using SIMD supercom puters for a High Energy Physics problem. We have detailed the general techniques involved in a data parallelisation, along with a few of the problems that can be en countered in the strive for maximum efficiency. Hopefully this has provided some motivation towards the future parallelisation of other suitable HEP codes. OCR Output CERN ACADEMIC TRAINING PRooRAMME 1992-93 37

lI2I3 |4|5l6 csrrm |6|4|5 cs1m=r (North) (East) 4ISl6|T0t'mdva1uel7I8|9lTot'mdvalueI9I7|8 wtheS0uth t0theWest 7I8|9 1|2|3 3|1|2

Mmrix Vector VP array Vector

Figure 17: A schematic representation of the CSHIFT operation in Connection Machine Fortran. These routines are used to perform banded matrix—vector calculations by ensuring that the required nearest-neighbour matrix data is in adjacent locations in the virtual processor array. OCR Output 38 MAss1vELY PARALLEL C0M1>LmNG IN Hmu ENERGY Pmsrcs OCR Output CERN ACADEMIC TRAINING PRooRAMME 1992-93 39

4 An HEP Case Study for MIMD Architectures

4.1 A Brief History of MIMD Computing

As we discussed in Section l, it is now generally accepted that truly scalable parallel computing requires machine architectures to follow a distributed memory structure. If we attempt to use any form of centralised memory store, access bottlenecks will lead to execution inefficiencies as we use larger numbers of processors. Although this may not be a problem for general purpose or workstation computing, in high performance computing we will always be seeking out ultimate performance and this will require the use of the largest possible number of the most powerful processors. The development of MIMD computing progressed in different directions on either side of the Atlantic. In the United States, the pioneering work of Geoffrey Fox and Charles Seitz at the California Institute of Technology, lead to the development of hyper-cube based machines such as those marketed by Intel and Ncube. In Europe, however, distributed memory computing only became a realisable possibility following the invention of the Transputer in the mid 1980s. This complete microprocessor, on a single piece of silicon, was the first to place calculation units, memory store and communication units on the same chip. Although Inmos, the inventors of the transputer, initially aimed their product at embedded control systems (for devices as simple and common as household washing machines, for example) the transputer’s greater potential as a component for highly parallel computing systems was soon realised. A group of lnmos employees formed the company Meiko Scientdic in order to design and build such machines. Meiko have gone on to become arguably the most successful European manufacturer of parallel computers.

4.1.1 The Importance of the Transputer

The general philosophy behind the transputer is to provide a family of compatible components which are able to communicate with one another using minimum external logic. Transputers communicate via point·to-point links, of which each transputer has four. In transputer-based multiprocessors these serial communications links and their interconnection topology constitute the message transfer system. The TSOO is the second generation of the transputer. The design (see Figure 18) is essentially the same as that of the original chips (T4l4), but has extra on-chip memory and an on-chip floating point unit. Extra instructions are provided to support floating—point data types and to give direct support to graphics operations. The iixed and floating-point units operate independently, so that a limited amount of implicit overlap of instructions can occur. Synchronisation between the two units occurs when data are moved into or out of the floating-point unit. This permits integer address calculations to proceed in parallel with Boating-point calculations. OCR Output 40 MASSIVELY PARALLEL COMPUTING IN HIGH ENERGY PHYSICS

i Processor gigririaru Serial Links . l€l’T\0l’ySff`°h'p EXp3l"ISlOh 0 On Chipl RAM ~Link Link 12

Link 3

Figure 18: The conceptual structure of the Inmos T80O transputer.

Although the transputer has now been surpassed by newer technology hardware (i860s, SPARCs, Alpha), its concepts and influence over parallel computing development are undeniable. In particular the topology flexibility created by using an electronic switching chip between processors, rather than the fixed processor topology of the hyper—cube machines, seems to be at last gaining world-wide support.

4.1.2 The Need for Message Passing

In order to implement an application on such a DM-MIMD computer a computation must be divided into tasks that do not need to share memory. We will, however, almost always require these separate tasks to communicate with each other in order to exchange data values and to synchronise. A multi-computer is therefore usually programmed using a message passing system to provide this functionality. This currently has its disadvantages because of the typically primitive and non—intuitive nature of available software interfaces, and the lack of a standard interface across various manufacturers’ machines. It has the advantage of imposing relatively low software overheads (unlike higher level abstractions such as Strand88 or Linda). Since performance is still the critical issue for most users of parallel machines, the pain of dealing with the machine in this way is usually considered worthwhile.

A complete message passing system must have the following six components:

o An addressing scheme for the destinations of messages.

o A teclmique for avoiding in the message passing system.

o A policy for deciding when messages can be sent.

o A mechanism for sending messages between tasks.

o A mechanism for selecting between messages at destination. OCR Output CERN ACADEMIC TRAINING PROGRAMME 1992-93 41

o A mechanism for locating tasks on processors.

These requirements are met by available message passing systems in different ways, and one of the greatest hindrances to the rapid spread of parallel programming techniques is the present lack of a common interface to message passing, thus allowing code to migrate between machine architectures and manufacturers. This problem is currently being addressed by the Edinburgh Parallel Computing Centre through the production of a Common High-level Interface to Message Passing (CHIMP), and hopefully our experiences will be make a useful contribution to the current MPI standards initiative (see Section 5).

4.2 Case Study: Experimental Monte Carlo Codes

As an example of a typical CERN application code that is suitable for MIMD parallel isation we have chosen to consider Monte Carlo simulation codes, such as those used as part of a LEP experiment (e.g. Aleph). These codes currently use vast amounts of computing time in order to provide simulation results to compare to actual experimental data. To a limited extent parallel computing has been taken up in this work, in that the event reconstruction code (Julia) processes real experimental data on a dedicated VAX cluster. However, the current codes for collision simulation, particle evolution (GEANT), detector electronic response simulation, and event reconstruction (Julia) all run on conventional supercomputers. The loading on such machines is already very high, and future plans for the Large Hadron Collider will certainly increase computing demand well beyond present resources.

These Monte Carlo codes provide an ideal example for the discussion of parallelisation -··~ techniques, as they present a variety of possible approaches, and indeed will most benefit from a selection of parallel techniques.

4.2.1 Physids Event Generation

The simulation of e“`e` collisions forms the first stage of experimental Monte Carlo work at LEP These codes perform a simulation of particle collisions that produce an intermediate short-lived, high-energy particle (in these examples we will consider the process Z U —> q§ hadrons). These bosons then undergo a succession of almost immediate decays as a shower of quarks and gluons (see Figure 19).

Following the shower phase, the products of the Z U decay event undergo a process of hadronisation that produces jets of resultant particles. These final jets are the end result of the Physics event generation codes, and are stored as an array of (usually) around 100 particles described by their name, mass, energy and momentum. OCR Output 42 MAss1vE1.Y PARALLEL C0M1¤u*r1NG 1N H1G1-1 ENERGY PHYs1cs

______Decay product i jets

Shower

Hadronisation

Figure 19: A schematic diagram of Z U boson decay and hadronisation.

This stochastic simulation code makes a series of probabilistic calculations to track each event through the tree of various possible decays. Although the calculation time for each event is relatively small (< l CPU second on the CERN Cray), the necessity to perform many simulations (~ 40, 000 events are placed one each data tape) can result in a high computational load. This is always going to be far less of a concern than the computing requirements for the latter stages of the experimental work. For many theoreticians event simulation stops here, and they can begin to study the distributions and nature of the decay products.

Parallelisation of this section of the Monte Carlo codes, although not a high priority, ·—\ is a relatively straightforward task for MIMD implementations. The problem {its very well into the trivial parallelism mould, with massive duplication of identical programs, each writing their results to some central data store. On the other hand a data parallel implementation, although possible, would be more difficult to achieve and implement efficiently. The inherent branching nature of such probabilistic codes would require fully synchronised codes to use processor masking at each node in the decision tree.

4.2.2 Task Farming GEANT

Experimental Monte Carlo work begins in earnest with the results from the decay code discussed above. The objective now is, for each event, to simulate the motion of each of the decay products from the point of initial Z" creation through each section of the detection chamber. The Galeph package (based on the GEANT library) of codes reads the initial detector structure into memory, and then takes individual particles and tracks OCR Output

OCR OutputMASSIVELY PARALLEL COMPUTING HQ HIGH ENERGY PHYSICS

By using a task farm to implement this application we can easily avoid the efficiency problems produced by different events requiring differing amounts of computation time. We can also implement such techniques in a flexible and efficient manner: allowing for the number of workers, or even sources and sinks to vary; using a demand driven farm to reduce message—passing overheads and contention; and by implementing buffers/queues on workers we can remove communication latencies. The technique also allows a relatively straightforward route for code porting, since each worker will run the complete GEANT and detector response codes, as good as unchanged. We simply wrap a communication shell around the code. This method is close to the trivial parallelism used for the theoretical codes, but with the addition of the task farming processes to ensure efficient use of all processors.

However there are potential problems in using this simple approach, due to the size of .._ executable and detector description necessary on each processor within the machine. As mentioned earlier, the description for the whole Aleph detector requires a massive amount of memory. Typically in massively parallel supercomputers, although the overall memory of the machine may be large, the amount allocated to each processor is limited (especially if we consider retaining all necessary data on-chip). Therefore, as we consider parallelisation schemes, we must keep in mind the practical aspects of our target machine architecture. This becomes even more important as we move to studying potentially larger problems such as the LHC project.

4.2.3 Geometric Decomposition of the Detector

In order to use massively parallel distributed memory computers to solve our Monte Carlo codes with high memory requirements, we must look to decomposition techniques that are more complex than task farming. We must develop methodologies that decom—»~—_ pose our data structure across the available processors in a sensible and efficient manner, allowing us to obtain optimal granularity and and load-balancing. As an example of one approach to this problem, let us look at a geometric decomposition of the Aleph detector, see Figure 21.

It seems relatively straightforward to decompose our solution codes according to the area of the detector material in which they work. In fact the GEANT codes are already functionally decomposed in this manner. Each code will therefore only need to store data relating to its particular domain of responsibility, be this the inner tracking chamber (ITC), the time projection chamber (TPC), the calorimeters, the outer muon detectors, or possibly within the beam pipe. The decomposition could even be taken a step further, if necessary, and large detector units broken down under some regular domain decomposition scheme.

This approach provides us with a set of codes responsible for tracking and response simulation within specific regions of the detector equipment. Therefore, when provided OCR Output CERN ACADEMIC TRAINING PRooRAMME 1992-93 45

éricizé:2:5:isi:£=i=€=E¤E=i=é=£r£¤E¢i¤i=&=£ 33%:5:2£=S=5$E:£=$;E=E$£ =5=E£=£ Muon detectors

Hadron Calorimcter

rzm

Electromagnetic Calorimeter

irc

Superconducting Magnets

Figure 21: An outline of the detector structure for LEP experiments. with a particle entering its domain of responsibility (through the arrival of a data message) a processor can run its GEANT code until the particle decays or leaves its domain. The detector response code could run on the same processor in conjunction with the GEANT code, or could potentially run on another processor, with detector events being passed on as message traffic. The output from the detector response sections can then be passed to a sink process for collation.

Obviously at the start of processing for a particular event there will be a heavy bias of work for those processors responsible for tracking/response calculations within the beam pipe and inner chamber. The bulk of work—load will then later move to those processors responsible for the TPC, and later to others for the outermost areas of the detector. This may seem at first to be rather inefiicient, since we would want to maintain each processor on a single task — to prevent the need for loading of new detector data. However, once we realise that we have very many such events to process, with a well implemented source program we can begin a pipeline of events. Once the iirst event’s decays are outside the beam pipe, those processors can start to process the initial motion for decays from the second, and then the third events and so on. In this way we can combine a geometric data decomposition of our problem with both a pipeline and a task farm, thus achieving an efficient parallel implementation.

4.2.4 Summary: A Hierarchy of Approaches

In order to achieve a Hexible, and optimal parallelisation of the CERN experimental Monte Carlo codes, we have suggested the combined use of a variety of the decomOCR Output MAssrvELY PARALLEL COMPUTING IN HIGH ENERGY PHYSICS position techniques discussed earlier in Section l. This has not been engineered purely for the sake of example, but will, we believe, provide the best route for an efficient implementation.

We have moved from the trivial parallelisation of the event generation codes, through task farming GEANT, to the probable necessity of a geometric decomposition of the detector. This resulting set of codes can then be implemented as a set of task farms arranged in a pipeline format in order to gain reasonable efficiency. In fact this whole smrcture can then be duplicated to make use of more processors (in either a trivial manner, or even within a higher level task farm). With the large number of events that must be processed, we can ignore the start-up and close-down inefficiencies of our pipeline, and we can experiment with the granularity of detector decomposition, and the relative numbers of workers within each domain, in order to gain optimum granularity ,_ and load balance.

This hierarchy of parallelisation schemes is best summarised by use of a diagram, and this is shown in Figure 22 where we have attempted to detail all the levels we have discussed.

Once the raw data is on the final tape it must be processed by the Julia program. This is currently done for experimental Monte Carlo work using conventional computing techniques, however for real data analysis use is already being made of coarse grained parallelism as the first step towards a massively parallel approach. OCR Output CERN AcADEM1c TRA1N1Nc: PR0cRAMME 1992-93 47

gsfonseiil Tape

Dgrk .9.9A .... :.;.;-.:;:,4 ,.,2.,. .; .V... ..-. . ff/('§—_j;§° Q

<$:·:·:·_:;: ‘;‘·:;:;._._ é:¤:.-;.;I»=?:1·;q».°E*°F€ ‘ ' -—:=;¤£:_

"‘€I ~ ln*§:R @21 __’`‘'‘*‘°¤=F=2E=Ei=;>;;·· -1

;?E?:?:F$§:;;.;;._.._ ‘·‘ - ==:=§;:g

MP. UC ·‘ W Fw 5:5; Ca]-2 -55;;:;:; E;] '``'` Féfziiiéii 7 K5:} "K:§:=:¢:§:o a;2Q,_,__j__Q__j__{ Z§:Z§:~:.<’i::¤§iEZi1E1i1:i *?‘iEZ¥:i;:·;.-..::;::.;§_,,

£~:::·;:.-A - . ¤·$i·‘€·i·. :;:§$§:;;:§;:3;:;> "'*‘¤¤ ¤:I““““'*“**'¥:€vi::: 553 .. .. ;E;E§;;;¤

Source $2;:-;.; QE=E=E=E=E=$;=s:=;=;r;=:¤;=;u

Figure 22: The hierarchy of parallelisation schemes for a experimental LEP Monte Carlo codes. The diagram shows three parallel pipelines, each of these contains a stream of sources, each responsible for sections of the detector. These pass out tasks to dedicated workers running GEANT routines, the results of these runs are subsequently passed to detector response (DR) workers for analysis. ln turn these pass results to a response sink that collates and writes to tape. OCR Output 48 MAss1vELY PARALLEL C0MRLmNG IN Hmu ENERGY P1—1Ys1cs OCR Output 49

5 High Performance Computing: Initiatives and Stand ards

5.1 High Performance Computing and Networking Initiatives

One undoubtable truth about the race to achieve ever more performance from our super computers is that the research and development effort required is far too much for any single manufacturer to bear. Therefore, throughout the developed world, governments are being prompted to fund initiatives to provide the necessary knowledge with which manufacturers can work. These high performance computing initiatives are underway ,.. in Japan the United States and now also in Europe.

5.1.1 Japan

Japan has a good track record of special purpose machines emerging from government assisted projects. For example the QCDPAX, a machine designed for Japan’s efforts on quantum chromodynamics calculations, is currently in use and running at around 3 GFLOPS. In February 1991 the Japanese government released a proposal for a five year follow—on project (CP-PACS — computational physics by parallel array computer system) between academic physicists and industry aimed at demonstrating several hun dred GFLOPS performance. This $ 12M project is based in Japan’s national centre for computational physics.

Japan are also involved in the competition to produce the first TeraFLOPS machine. Fujitsu have announced that their AP1000 project has a goal of 1 TFLOPS, and have placed five of these machines in Japanese universities. More recently the Japanese government has funded the Real World Computing Initiative which has technology tracks aimed at massive parallelism and neural computing models, as well as other

areas.

5.1.2 United States

The United States seem to have made the most concerted efforts to support high per formance architecture research. They can already be seen to be half way to their goal of 1 TFLOPS. The Intel Delta at the California Institute of Technology is currently performing at 40 GFLOPS, and the Intel Paragon is expected to produce 150 GFLOPS once acceptance tests are passed. With Thinking Machines Corporation, the US QCD community and 40M dollars of capital available, the full project would seem to be on course for its scheduled production of 1 TFLOPS in 1994. This theory seems to be OCR Output 50 MAssrvELY PARALLEL COMPUTING IN HIGH ENERGY Pursrcs

Memory Size (words)

10G

Vision Climate modelling Fluid turbulence Human genome 1G Vehicle dynamics Ocean circulation _ _ Vehicle signature Viscous fluid dynamics Smural $Eé?€3.?éS£?5i}i?é.?.!lli?"`’"*` | I 72 hour weather E°|°9Y _Qugntum chromodynamigs 100M

Estimate of Higgs boson _ Pharmaceutical mass design

10M 3D plasma modelling 48 hour weather Chemical dynamics 1M 2D plasma modelling Amon Q IOil Resen/oir modelling

1980 1988 1991 19967 Processing Speed

1ooM 1G me 100G 1T (FLOPS)

Figure 23: Some grand challenges and their predicted computational requirements. conhrmed by the recent announcement of the CM-5 design and its latest LINPAC bench mark results of > 59 GFLOPS sustained — this scales to 0.944 TFLOPS for a full (16k) CM-5, although it must be remembered that the Los Alamos lk machine cost $25M. With five US supercomputing centres receiving around $ 6OM govemment support per annum, the CM-5 in Los Alamos, and the Intel Paragon in Oak Ridge, the United States can be seen to be taking high performance computing very seriously. Indeed this recent quote from their new Woe President, Al Gore, highlights their attitude:

The nation which most completely assimilates high performance comput ing into its economy will very likely emerge as the dominant intellectual, economic, and technological force in the next century.

However, what is an even more impressive movement is that the previous US govern ment had already taken high performance computing as a serious issue. The Federal High Performance Computing and Communications (HPCC) Program (published in 1991) details how the US government sees investment into computer research enabling US competitiveness. The program also details a set of grand challenges in science, and shows that the computing power to solve them is as yet unavailable (see Figure 23).

The solution, the program suggests, is massive additional investment (~$l billion) by the govemment into many fields of computer science research and development. OCR Output CERN ACADEMIC TRAINHNTG PROGRAMME 1992-93 51

Millions (S) Millions ($)

$5k— Scenario A - No funding $5k Scenario B — Funding

Parallel Supercomputers. $4

$3K Parallel Supercqmputers / Sak

$2K $2x

$1k_I Us Vecmr Supercompmers $1K_| US Vector $upercomputers

ageseiggr Japanese Superoomputers g

asso teas 1990 tees zoos ieee was me regs zone

Figure 24: Supercomputing revenues predicted by the Gartner Group report (1991).

These stretch from system design tools and advanced prototype development, through computational techniques and software components, to funding for a new national research and education communications network. All in all, additional funding is suggested for 597 different projects through iive years.

F,. Since the HPCC report the US Department of Energy and the Los Alamos National Laboratory have since employed the Gartner Group of Stamford to assess the effects of supporting the Federal HPCC Program. Their report shows startling predictions of the rise in parallel supercomputing revenues (see Figure 24), whether the program gets taken up (scenario B) or not (scenario A).

The Gartner report also stresses the importance to industry of moving between cascading technology curves (Figure 25), whether the federal proposals are acted upon or not. This graphically shows the importance of moving away from one technology that is suffering from stagnation as far as advancement is concerned (point A), and moving to another that has future potential (point B). Many companies are now making this move away from their conventional vector machines towards general and special purpose parallel computers.

The bottom line of the Gartner Group report is that they estimate the gross national product (GNP) of the United States to increase by between 172 and 502 billion dollars over the next ten years, if the proposals of the Federal HPCC program are put into OCR Output 52 MASSIVELY PARALLEL CoMRUrrNo IN Hion ENERGY PHYSICS

Vector SUp9fCOmpU'£9l‘S

Parallel Sup9I'COITlpUI€fS

Figure 25: Cascading technology curves as shown in the Gartner Group report. operation.

5.1.3 Europe

The European Community has funded research into advanced information technology for many years. The Esprit program of joint research projects between universities and industry has strengthened Europe’s position, particularly in the areas of advanced computer technology and applications. Unfortunately Esprit and Esprit-2 provided little ~ funding specifically for parallel processing work, the two notable exceptions being the Parsys/Telmat Informatique transputer-based SuperNode machine, and the GP-MIMD project. It does seem, however, that in Esprit-3, the latest round of funding, more will be directed at parallel themes, many of them applications oriented.

Also the EC commissioned Prof. Carlo Rubbia, Nobel prize winner for Physics and Director General of CERN, to report on the state of high performance computing in Europe. This report points out that although Europe consumes around 30% of the world’s supercomputers, it produces almost none. The Rubbia report therefore contains proposals very similar to those in the Federal HPCC report. However the time-scale for action in Europe has always been somewhat behind the United States, with the Hrst extra funding only becoming available in 1993.

Although later in starting the European initiatives do now seem to be making progress. Within the general Esprit banner there are now two main domains of interest to us. Firstly the Human, Capital and Mobility (HCM) scheme, and secondly the High Performance OCR Output 53

Computing and Networking (HPCN) initiative. HCM simply aims to increase the human resources available for research and technological development in all Helds, with a specific area of interest in Information Technology. The programme will spend 500M ECU over four years to fund fellowships, networking, large scale facilities and conferences.

The HPCN initiative is obviously more focussed on our area of specific interest. The remit of this programme is to bring about the implementation of the Rubbia 1 recom mendations. This means the creation of domestic market conditions that are favourable to the creation of a competitive European supply industry. The programme has spe cific aims and action lines that specify a user and application driven approach, with a specific aim of exploiting parallel systems. The HPCN funding comes through the general Esprit banner, and signals a definite sea change in the Esprit focus. We are now seeing this funding being far more market oriented, with the main objective being the parallelisation of the l()-15 most industrially relevant codes. lt is specifically stated that this work must result in implementations that are as portable and scalable as possible.

5.2 Emerging Key Technologies

The basic concept behind the EPCC Key Technology Programmes (KTPs) is to enable the development of commonly used parallel programming tools and libraries which will be used by applications as required. We have worked extensively on three KTPs: CHIMP, PUL, and NEVIS, and these are described below. The applicability of the software generated by the KTPs to industrial projects within the EPCC is shown in Table 2, and this provides encouraging evidence of the viability of re-usable parallel software libraries when implemented on a portable message passing interface.

AEA BT Shell Cairntech GMAP GIS lntera RR

CI-HMP (UNIX) •••• •••

CHIMP (TSOO) • •• CHIMP (1860) • •• PUL-EM PUL-TF PUL-RD PUL-GF PUL-SM NEVIS-EML NEVIS-PDGL •

Table 2: The exploitation of Key Technology Programmes in current industrial EPCC projects. OCR Output MAssrvE1.Y PARALLEL C0Mpu·r1Nc 1N HIGH ENERGY Pmsxcs

5.2.1 CHIMP

Communications bctwccn prcccsscs on distributed memory MIMD computers, or clustered workstations, is accomplished using a message passing system. VVhi1e many such systems exist they tend to be vendor-specific. In order to increase the portability of applications it is necessary to define and implement a common interface on a range of different platforms. This is precisely the remit of the Common High-level Interface to Message Passing (CHIMP) KTR

The CHIMP poject started in January 1991, and prototype software was implemented by the end of that year. This enabled the current interface to be defined (Version 1.0) and this has been ported to a range of platforms. The CHIMP interface is available with both Fortran and C language bindings and is available on:

Meiko T800 Computing Surface

Meiko i860 Computing Surface — both MKO86 and MKO96 hardware

Sun, RS6000, HP, and SGI UNIX workstations

Fujitsu APIOOO

Generic transputer—based machines (C only)

Intel iPSC/860 and iPSC/2

Thinking Machines CM-5 (CMMD Version 3.0)

CHIMP gives a basic message passing system for the transfer of single messages between end points. This means that it provides both blocking and non—blocking communications, multi-casting of messages, and message selection. CHIMP provides a connectionless reliable datagram service and has a process naming scheme with a two level hierarchy to enable the placement of names into groups. This convention provides a general and efhcient programming model in which higher level message passing functions are implemented on top of CHIMP within the Parallel Utilities Library (PUL) KTR

Currently CHIMP uses a pseudo—dynamic configuration stratCgY; that is, all of the processes must register their names and then call a synchronisation function. After this point it is not possible to add new names to the system, only pass messages between processes. A new version of the interface has been developed and is currently under test. This new release (Version 2.0) will permit communications and name registration to be intermingled, producing a much more fiexible programming model. In addition the new version of the interface includes basic support for heterogeneous computing. OCR Output 55

Perhaps more than either of the other two KTPs CHIMP has a number of competitors in the marketplace. Two of these, PVM and PARMACS, have received support in the US and Europe respectively. We believe that CHIMP offers superior functionality in many respects, and its widespread use for industrial projects is unparallelled. One of the principal forums for promoting the design principles adopted within the KTP’s, and CHIMP in particular, will be the MPI forum (see later). Until such time as the MPI standard emerges the EPCC will continue to develop and support CI-HMP for use in applications development activities. Because of our close contact with the MPI forum we believe that we will have an implementation of the standard very soon after it is agreed, we should then be in a position to offer CHIMP implementations on top of MPI, or separate MPI implementations.

5.2.2 PUL

Applications

Domain Specific Mesh

sm I DM

Paradigm Specific

Event Grid

TF I DC RD I SD

Non—Specific Systx

EM I GF

Message Passing I M P OCR Output

Figure 26: PUL Software Hierarchy. MAss1vELY PARALLEL COMPUTING 1N HIGH ENERGY PHYSICS

The objective of the Parallel Utilities Library (PUL) KTP is the development and maintenance of higher—level libraries to free the applications programmer from re implementing basic parallel utilities. To this end, PUL builds upon the machine inde pendence offered by CHIMP (see Section 5.2.l and Figure 26).

The PUL programme currently contains four projects, each of which will provide one or more utilities, which are described below:

SYSTX provides extensions to the underlying system software. Two of the utilities are:

o EM — an extended message passing utility which provides additional function ality to CHIMP, for example scan operations;

6 GF — a global file access utility whereby parallel processes can sensibly share files (including files distributed across parallel disks). This utility is targeted primarily at data parallel groups of processes.

EVENT deals with the event parallel programming paradigm, in which a given problem can be subdivided into a large number of similar and independent subproblems. There are two utilities within EVENT:

6 TF provides support to a classical task farm in which one source generates tasks which are executed by a number of workers;

o DC provides support for the divide and conquer method in which it is undesir able for a single process to create the tasks as in TF.

GRH) provides support for grid based problems, where the data are distributed across a surface or volume. The two utilities within GRID are:

o RD supports regular domain decomposition in which the data distribution is uniform;

o SD is an extension of the RD utility to support scattered domain decomposition for problems with uneven loading across the grid;

MESH will support unstructured mesh problems, such as finite element methods. MESH is a collaborative project between PUL and the Numerical Simulations Group. Two utilities are planned and these will provide functionality to assist programmers with the parallelisation of static mesh-based problems, and will include dynamic load—balancing routines to enable efficient implementation of codes with mesh refinement. OCR Output CERN ACADEMIC TRAINING PRooRAMME 1992-93 57

o SM supports static unstructured meshes;

o DM supports dynamic or adaptive unstructured meshes.

5.2.3 NEVIS

The NEVIS (Networked Visualisation) KTP is studying the combined use of parallel computers and graphics workstations for interactive visualisation of parallel applica tions. The long-term aims are to improve software support for the use of networked visualisation with parallel computers.

The work completed to date includes an assessment of the feasibility of parallelisin g the interfaces to the GL Graphics Library from Silicon Graphics, providing support for par allel modules within modular visualisation environments such as AVS and IRIS Explorer and developing parallel implementations of various visualisation algorithms. Various demonstration applications have also been developed to show the use of networked visualisation. Work on the Explorer visualisation environment has been generalised to include the development of a local prototype parallel environment — the Modular Visualisation Environment (MVE).

Another significant effort was the design and initial implementation of the Parallel Distributed Graphics Library (PDGL). This is a GL-like interface which would enable DGL calls to be made from separate nodes within a parallel computer. Further ports of DGL and VOGL (a public-domain implementation of SGI’s GL interface with an X driver) have been made to allow them to be used on all of EPCC’s equipment, including a port to the Transputer.

5.3 Parallel Computing Standards Forums

As parallel computing hardware has evolved over the past years, we have seen drastic changes to the software environments supported on different platforms. As we stated earlier, we believe that it is uncertainty in parallel software direction that has most prevented the wider use of the technology. However it now seems that this uncertainty is being removed, and as we see hardware technology mature and standardise, so we are able to see efforts come together to standardise software. There is now a clear movement towards two main branches of parallel computing methodologies, data parallel programming, and programming with message passing. As industry and vendors have become more certain about the future of these methods, standardisations of each paradigm begin to look like real possibilities. OCR Output 58 MAss1vELY PARALLEL COMPUTING IN HIGH ENERGY Pursrcs

5.3.1 High Performance Fortran Forum

Although Fortran has been the main language of scientific computing for over 20 years, it does not have (in either the Fortran 77 or Fortran 90 versions) the necessary constructs to allow full exploitation of modern computer architectures, i.e. parallel architectures. The High Performance Fortran Forum was founded as a coalition of industrial and academic working groups to suggest a standard set of extensions to Fortran to provide the information necessary for:

o Opportunities for parallel execution

o Type of available parallelism -— MIMD, SIMD

O Allocation of data among individual processor memories

o Placement of data within a single processor

The intent of HPFF has been to develop extensions to Fortran which provide support for high performance programming on a wide variety of machines, including massively parallel SIMD and MIMD systems and vector processors. From its beginning, HPFF included most vendors delivering parallel machines, government labs, and many uni versity research groups. Public input has always been encouraged, and the result of the project is the recently (November 1992) released language specification for HPF. This specification provides a language that is portable from workstations to massively parallel supercomputers, while also being able to express the algorithms necessary to achieve high performance on specihc architectures.

5.3.2 Message Passing Interface Forum

Following the apparent success of the HPF forum in producing an agreed standard in a short time-scale, the end of 1992 saw the formation of a forum to provide a de facto standard for a message passing interface —— MPI. The forum meets every six weeks in Dallas, and communicates via email lists between meetings. Current movements seem very positive, and the target date for the standard definition (July 1993) seems an achievable goal. Following this, parallel machine manufacturers will be expected to provide implementations of the standard on their systems, and in this way provide a means for complete code portability for MIMD systems.

The MPI forum is structured as a set of sub-committees, each with responsibility for the production of a section of the final standard specification. These sub-cornmittees discuss and develop the following topics: OCR Output CERN AcADEM1c TRA1N1NG PROGRAMME 1992-93 59

P0int—t0-point communications

Process topologies

Collective communications

Communication contexts

Language bindings

Environmental enquiry

_,.., Representatives of all parallel computer vendors, most MIMD software developers, and many academic establishments take part in the MPI subcommittees. No single institution is allowed to sit on more than three sub·c0mmittees, and only has one vote in the final full committee meetings. Progress towards the standard seems to be rapid, and the EPCC representative at these meetings feels that they will produce a good standard in the specified time. The EPCC also intends to have an implementation of the standard running by the summer of 1993, and will aim to use this as a demonstrator of the portability benefits that such an interface can bring. OCR Output