<<

SIMD :

An Intro duction

C. J. . Schauble

Septemb er 12, 1995



High Performance Scienti c Computing

University of Colorado at Boulder

c

Copyright 1995 by the HPSC Group of the University of Colorado

The following are memb ers of

the HPSC Group of the Department of

at the University of Colorado at Boulder:

Lloyd D. Fosdick

Elizab eth R. Jessup

Carolyn J. C. Schauble Gitta O. Domik

SIMD Computing i

Contents

1 General architecture 2

1.1 The CM-2 ::: :: :: :: :: ::: :: 3

1.1.1 Characteristics :: :: :: ::: :: :: :: :: ::: :: 5

1.1.2 Performance :: :: :: :: ::: :: :: :: :: ::: :: 12

1.2 The MasPar MP-2 :: :: :: :: ::: :: :: :: :: ::: :: 12

1.2.1 Characteristics :: :: :: ::: :: :: :: :: ::: :: 13

1.2.2 Performance :: :: :: :: ::: :: :: :: :: ::: :: 15

2 Programming issues 17

2.1 Architectural organization considerations :::::::::::: 17

2.1.1 Homes ::: :: :: :: :: ::: :: :: :: :: ::: :: 18

2.2 CM , MPF, and Fortran 90 :: :: :: :: :: ::: :: 19

2.2.1 Arrays ::: :: :: :: :: ::: :: :: :: :: ::: :: 20

2.2.2 Array sections : :: :: :: ::: :: :: :: :: ::: :: 22

2.2.3 Alternate DO lo ops :: :: ::: :: :: :: :: ::: :: 23

2.2.4 WHERE statements : :: :: ::: :: :: :: :: ::: :: 23

2.2.5 FORALL statements :: :: ::: :: :: :: :: ::: :: 24

2.3 Built-in functions for CM Fortran and Fortran 90 :: ::: :: 25

2.3.1 Intrinsic functions : :: :: ::: :: :: :: :: ::: :: 26

2.3.2 Masks ::: :: :: :: :: ::: :: :: :: :: ::: :: 26

2.3.3 Sp ecial functions : :: :: ::: :: :: :: :: ::: :: 27

2.4 directives :: :: :: :: ::: :: :: :: :: ::: :: 34

2.4.1 CM Fortran LAYOUT :: :: ::: :: :: :: :: ::: :: 35

2.4.2 MasPar MPF : :: :: ::: :: :: :: :: ::: :: 38

2.4.3 CM Fortran ALIGN :: :: ::: :: :: :: :: ::: :: 39

2.4.4 CM Fortran COMMON :: :: ::: :: :: :: :: ::: :: 40

2.4.5 MasPar MPF ONDPU : :: ::: :: :: :: :: ::: :: 40

2.4.6 MasPar MPF ONFE :: :: ::: :: :: :: :: ::: :: 41

3 Acknowledgements 42

References 42

CUBoulder : HPSC Course Notes

ii SIMD Computing

Trademark Notice

 DECstation, ULTRIX, VAX are trademarks of Digital Equipment Corp ora-

tion.

 Goodyear MPP is a trademark of Go o dyear Rubb er and Tire Company, Inc.

 ICL DAP is a trademark of International Computers Limited.

 MasPar Fortran, MasPar MP-1, MasPar MP-2, MasPar Programming En-

vironment, MPF, MPL, MPPE, X-net are trademarks of MasPar Computer

Corp oration.

 X-Window System is a trademark of The Massachusetts Institute of Tech-

nology.

 MATLAB is a trademark of The MathWorks, Inc.

 IDL is a registered trademark of Research Systems, Inc.

 Symb olics is trademark of Symb olics, Inc.

 C*, CM, CM-1, CM-2, CM-5, CM Fortran, Connection Machine, DataVault,

*Lisp, Paris, Slicewise are trademarks of Thinking Machines Corp oration.

 UNIX is a trademark of UNIX Systems Lab oratories, Inc.

CUBoulder : HPSC Course Notes

SIMD Computing:

yz

An Intro duction

C. J. C. Schauble

Septemb er 12, 1995

According to the Flynn computer classi cation system [Flynn 72 ], a SIMD

computer is a Single-Instruction, Multiple-Data machine. In other words, all

the pro cessors of a SIMD multipro cessor execute the same instruction at the

same time, but each executes that instruction with di erent data.

The computers we discuss in this tutorial are SIMD machines with dis-

tributed memories DM-SIMD. They are sometimes referred to as

arrays or as massively-paral lel computers.

This tutorial is divided into two main parts. In the rst section, we

discuss the general architecture of SIMD multipro cessors. Then we consider

how these general features are emb o died in two particular SIMD machines:

the Thinking Machines CM-2 and the MasPar MP-2.

In the second section, welookinto programming issues for SIMD multi-

pro cessors, b oth architectural and language-oriented. In particular, we de-



This work has b een partially supp orted by the National Center for Atmospheric Re-

search NCAR and utilized the TMC CM-2 at NCAR in Boulder, CO. NCAR is supp orted

by the National Science Foundation.

y

This work has b een partially supp orted by the National Center for Sup ercomputing

Applications under the grants, TRA930330N and TRA930331N, and utilized the Connec-

tion Machine Mo del-2 CM-2 at the National Center for Sup ercomputing Applications,

University of Illinois at Urbana-Champaign.

z

This work has b een supp orted by the National Science Foundation under an Ed-

ucational Infrastructure grant, CDA-9017953. It has b een pro duced by the HPSC

Group, Department of Computer Science, University of Colorado, Boulder, CO 80309.

Please direct comments or queries to Elizab eth Jessup at this address or e-mail

[email protected].

c

Copyright 1995 by the HPSC Group of the University of Colorado 1

2 SIMD Computing

scrib e useful features of Fortran 90 and CM Fortran.

For detailed information on how to login and program sp eci c SIMD

computers such as CM-2 and the MasPar MP-1, refer to the do cuments in

the /pub/HPSC directory at the cs.colorado.edu anonymous ftp site.

1 General architecture

Each of the pro cessors in a distributed-memory SIMD machine has its own

lo cal memory to store the data it needs. Also each pro cessor is connected to

other pro cessors in the computer and may send or receive data to or from

any of them. In many resp ects, these computers are similar to distributed-

memory MIMD multiple instruction, multiple data multipro cessors.

As stated ab ove, the term SIMD implies that the same instruction is exe-

cuted on multiple data. Hence the distinguishing feature of a SIMD machine

is that all the pro cessors act in concert . Each pro cessor p erforms the same

instruction at the same time as all the other pro cessors, but each pro cessor

uses it own lo cal data for this execution.

The array of pro cessors is usually connected to the outside world bya

sequential computer or workstation. The user accesses the pro cessor array

through this front end or host machine.

Using a SIMD computer for scienti c computing means that many ele-

1

ments of an array can b e computed simultaneously. Unlikevector pro cessors,

the computation of these elements is not pip elined with di erent p ortions of

neighb oring elements b eing worked on at the same time. Instead large groups

of elements go through the same computation in parallel.

In the following, we discuss the architectural features of SIMD multi-

pro cessors concentrating on two computers in this class: the Connection

Machine CM-2 by Thinking Machines Corp oration and the MasPar MP-2

by MasPar Computer Corp oration. Similar computers include the Digital

Equipment Corp oration MPP series technically the same as the MasPar

machines, the Go o dyear MPP, and the ICL DAP.

1

See the tutorial on vector computing [Schauble 95] for more information on vector

pro cessors.

CUBoulder : HPSC Course Notes

SIMD Computing 3

1.1 The Connection Machine CM-2

The CM-2 Connection Machine is a SIMD sup ercomputer manufactured by

Thinking Machines Corp oration TMC. Data parallel programming is the

natural paradigm for this machine allowing each pro cessor to handle one data

element or set of data elements at a time.

The initial concept of the machine was set forth in a Ph.D. dissertation by

W. Daniel Hillis [Hillis 85 ]. The rst commercial version of this computer was

called the CM-1 and was manufactured in 1986. It contained up to 65,536 or

64K pro cessors capable of executing the same instruction concurrently.As

shown in gure 1, sixteen one-bit pro cessors with 4K bits of memory apiece

2

are on one chip of the machine. These chips are arranged in a hyp ercub e

d

pattern. Thus the machine was available in units of 2 pro cessors where

d = 12 through 16.

One of the original purp oses of the computer was arti cial intelligence;

the eventual goal was a thinking machine . Each pro cessor is only a one-

bit pro cessor. The idea was to provide one pro cessor p er pixel for image

pro cessing, one pro cessor p er for VLSI simulation, or one pro cessor

p er concept for semantic networks.

The rst high-level language implemented for the machine is *Lisp, a

parallel extension of Lisp. The design of p ortions of the *Lisp language are

discussed in the Hillis dissertation.

As the rst version of this sup ercomputer came onto the market, TMC

discovered that there was also signi cantinterest and money for sup ercom-

puters that can b e used for numerical and scienti c computing. Hence a

faster version of the machine was pro duced in 1987; named the CM-2, this

machine was the rst of the CM-200 series of computers. It included oating-

p oint hardware, a faster clo ck, and increased the memory to 64K bits p er

pro cessor. These mo dels emphasized the use of data-parallel programming.

Both C* and CM Fortran were available on this machine in addition to *Lisp.

Announced in Novemb er 1991, a more recent machine is the CM-5. This

is a MIMD machine that emb o dies many of the earlier Connection Machine

concepts with more p owerful pro cessors, techniques, and I/O.

The following subsections discuss the characteristics and the p erformance

of the CM-2. For further information, see the Connection Machine CM-200

2

See the tutorial on MIMD computing [Jessup 95] for more information on a hyp ercub e.

CUBoulder : HPSC Course Notes

4 SIMD Computing





Memory











P P P P





P P P P



M M



 e e



P P P P

 

m m



q  

o o

H

r r

P P P P

B H

H

y y

B H

H

B

B

Router

B

B

B

B

B

B

Memory

B

B

B

2

Figure 1: A representative blowup of one of the 64 pro cessor chips in a Thinking

Machines CM-1 or CM-2. Each CM-1/CM-2 chip contains 16 one-bit pro cessors.

The router connects the pro cessor chips with other pro cessor chips. Memory chips

are asso ciated with each CM-1/CM-2 chip.

CUBoulder : HPSC Course Notes

SIMD Computing 5

Series Technical Summary [TMC 91d ], Paral lel Supercomputing in SIMD

Architectures [Hord 90 ], chapter 7, or : Case Studies

[Baron & Higbie 92], chapter 18.

M M M M M M M M M M M M M M M M M M M M M M M M

P P P P P P P P P P P P P P P P P P P P P P P P

...

M M M M M M M M M M M M M M M M M M M M M M M M

P P P P P P P P P P P P P P P P P P P P P P P P

M M M M M M M M M M M M M M M M M M M M M M M M

P P P P P P P P P P P P P P P P P P P P P P P P

M M M M M M M M M M M M M M M M M M M M M M M M

P P P P P P P P P P P P P P P P P P P P P P P P

: Xy

1 Pi  X

* HY

P

 X

Parallel

6

 H  X P @I

 X

P

  X H

P

 X

Pro cessing

 X  H @ P

Unit

Front

End

Figure 2: The two main parts of a Thinking Machines Connection Machine

system.

1.1.1 Characteristics

A Connection Machine CM may b e considered as two main parts; these are

depicted in gure 2. The paral lel processing unit PPU is the SIMD p ortion

of the machine and contains up to 64K single-bit pro cessors; this part is the

CM itself. The front end FE of the machine acts as a controller or host for

the PPU and provides access to the rest of the world.

The FE is usually a small computer; for the CM, this may b e a UNIX

or a Symb olics Lisp workstation or a VAX.ACMmayhave up to four

FE's; these do not need to all b e the same typ e of machine. Programs are

CUBoulder : HPSC Course Notes

6 SIMD Computing

compiled, stored, and executed serially on the FE; any parallel op erations

in the program are recognized and pip elined to the PPU for execution there

with the serial execution lling the pip eline and continuing until a resp onse is

needed back from the PPU. Thus CM programs have the familiar sequential

control ow and do not require additional primitives as do

programs for other multipro cessors.

Dep ending on the con guration of the machine, each one-bit PPU pro-

cessor has 64K, 256K, or 1024K bits of memory, an arithmetic-logic unit

ALU, four one-bit registers, and interfaces for two forms of communication

and I/O. These pro cessors all work in parallel on simple instructions. In

fact the PPU can b e thought of as a large, synchronized drill team while the

FE is the sergeant who yells out the commands. For example, all the PPU

pro cessors fetch something from their individual memories to their ALU's at

one command; they all add something to that at the next command; and

they all store their individual results into their own memories at the next

command.

The language Paris PARallel Instruction Set is used to express the par-

allel op erations that are to b e run on the PPU. All *Lisp, CM Fortran, or

C* parallel commands are compiled into Paris instructions. Such op erations

include parallel arithmetic op erations b oth oating-p oint and xed, vector

3

summation and other reduction op erations , sorting, and multipli-

cation.

An alternate run-time system exists for machines with 64-bit oating-

p oint accelerators. This is called the Slicewise mo del and provides a di erent

viewp oint of the CM than the Paris mo del. This can b e used only with CM

Fortran programs but allows more ecient execution.

The PPU may b e broken up into two or four sections as shown in gure 3.

Each section has its own sequencer and can b e used as a sub-PPU by itself or

can b e group ed with other sections. The nexus switch provides the pathway

between a given FE and its current sections of the PPU.

A sequencer receives Paris instructions from the FE and breaks them

down into a sequence of low-level instructions that can b e handled by the

one-bit pro cessors. When that is done, the sequencer broadcasts these in-

structions to all the pro cessors in its section. Each pro cessor then executes

3

See the tutorial on vector computing [Schauble 95] for more information on reduction

op erations.

CUBoulder : HPSC Course Notes

SIMD Computing 7

Graphic

Data Data Data

Vault Vault Vault Display

I/O System

PPU0 PPU1

Seq0 Seq1

Parallel

Pro cessing

Unit

Nexus

Front

FE0 FE1

Ends

Figure 3: Breakdown of a Thinking Machines CM PPU into 2 sections with

sequencers. Two front end machines, the nexus switch, and the I/O system are

also shown.

CUBoulder : HPSC Course Notes

8 SIMD Computing

the instructions in parallel with the other pro cessors. When the execution

of low-level instructions is completed, control is returned to the FE. Used

indep endently, each section sets up its own grid layout for computation and

communication for each array; if the sections are group ed together, one grid

p er array is laid over all the pro cessors. These grids may b e altered dynam-

ically during the execution of the program.

Flt-Pt and

Flt-Pt

Memory

Memory

Execution

Unit Interface

FPA

ECC ECC

P P P P P P P P

P P P P P P P P

P P P P P P P P

P P P P P P P P

Router Router

Figure 4: A pair of Thinking Machines CM-2 pro cessor chips communicate with a

single FPA.

Virtual processors are another feature of the CM. If the numb er of ele-

ments for the data of a given program is larger than the numb er of pro cessors,

the machine acts as if there were enough pro cessors, providing virtual pro-

cessors by assigning the data elements across the PPU in an ecient manner.

In other words, eachphysical pro cessor may act as one or more virtual pro-

cessors.

Floating-p oint accelerators are optional and come in twotyp es: 32-bit or

64-bit. A 32-bit FPAinterprets the bits from twochips of sixteen one-bit pro-

cessors as a single oating-p ointnumb er allowing single precision arithmetic

CUBoulder : HPSC Course Notes

SIMD Computing 9

computations. A 64-bit FPA also works with the contents of the pro cessors

contained on twochips and provides double precision arithmetic. For this

reason, the pro cessor chips are group ed in pairs with a oating-p oint and

memory interface shared by each pair as shown in gure 4; it is common to

think of each pair of chips as b eing equivalent to a oating-p oint pro cessor.

The addition of a oating-p oint unit to every 32 pro cessors to form an FPA

sp eeds up the pro cessing of oating-p oint computations on the CM by more

than a factor of twenty.

The Paris mo del of computation assigns full 32-bit or 64-bit words to the

memory of each pro cessor. These are passed to the FPA bit by bit when

needed. The Slicewise mo del assigns di erent bits of a 32-bit or 64-bit word

to each of the 32 pro cessors on the twochips connected to a single FPA.

When the FPA needs one of these words, each pro cessor can send its bits

concurrently with the other pro cessors. In other words, one 32-bit word

can b e sent to the FPA in one load cycle as one bit from each of the 32

pro cessors is sent to the FPA simultaneously. A 64-bit word requires only

two load cycles since the data paths are 32-bits wide. Hence the amountof

time sp ent passing data to and from the FPAismuch less for the Slicewise

mo del.

Data may b e read and written serially through the FE to the PPU, but

this is not a pro ductive metho d for large amounts of data b ecause of the

cost of communication b etween the FE and PPU. An optional input/output

system allows p eak rates up to 320 Mbytes p er second for transfers of data

between the PPU pro cessors and the I/O system bu ers in parallel. One

solution for sp eeding up p eripherals is to attachmultiple devices i.e., disks

to the CM I/O system in parallel. Each device is connected to a separate

CMIO and may transfer data in parallel with the other devices. Thus

every 64 bits of data the of the bus is transferred to a di erent

device; this is called le or disk striping . Alternatively, these I/O bu ers may

attach to Thinking Machines Data Vaults ; each Data Vault may contain 5

to 60 Gbytes of data. A maximum of eight data vaults can b e on a given

system, each transferring data at a p eak rate of 40 Mbytes p er second or an

average rate of 20 Mbytes p er second.

Another p eripheral available through the I/O system is a graphical dis-

play for output generated by the PPU; this uses the Thinking Machines

CM framebu er managing up to one Gbyte of data p er second. The high-

resolution graphical display is a 19" color monitor. The framebu er for the

CUBoulder : HPSC Course Notes

10 SIMD Computing

Figure 5: Part of a hyp ercub e network.

display attaches directly to the CM backplane, like an I/O controller, and

holds raster image data. Alternatively, the image can b e displayed in an

X-window on a remote workstation or terminal. The size of the window de-

termines the numb er of pixels; each PPU pro cessor generates data for one

pixel. Either 8-bit mo de or 24-bit mo de can b e used.

In addition to the serial communication available b etween the FE and

the PPU, the PPU pro cessors may communicate with each other in di erent

ways. The general form of communication is via the router ; as shown in

gure 1, a router is presentoneachchip of sixteen pro cessors p ermitting

each pro cessor to communicate with any other pro cessor on the PPU. The

router also contains the logic to handle virtual pro cessors.

Communication b etween pro cessors on the same chip is called local or

on-chip communication. Clearly this is faster than any metho d of o -chip

communication.

The interconnection of the pro cessor chips across the PPU is based on

ahyp ercub e network, where eachnodeofthehyp ercub e is a pro cessor chip

containing sixteen pro cessors. The op eration of the router on a pro cessor

chip is broken up into twelve sub cycles where twelve is the maximum di-

CUBoulder : HPSC Course Notes

SIMD Computing 11

N

v

W E

v vj v

S

v

Figure 6: A two-dimensional NEWS grid.

mension of a CM-2 hyp ercub e network. At sub cycle k , the router is able

to communicate with other pro cessor chips or no des along the k th dimen-

sion of the hyp ercub e. Thus this interconnection pattern is sometimes called

a cube-connected cycle. Figure 5 shows a 4-dimensional hyp ercub e where

the vertices of the network represent pro cessor chips and the arcs represent

connections to neighb oring pro cessor chips in the network.

When data can b e treated as elements of a mesh or grid spread across

the PPU, a second and faster metho d of o -chip communication maybe

used. This is called NEWS N orth-E ast-W est-S outh communication and

allows the pro cessors of each pro cessor chip to communicate easily with the

pro cessors on the nearest neighbor chips on that mesh. However this typ e of

communication is limited to nearest-neighb or and cannot b e used

for broadcasting information to all the pro cessors. An example of a two-

dimensional NEWS grid is shown in gure 6. The CM-1 only allowed a

xed, two-dimensional grid, but the CM-2 p ermits NEWS grids with up to

31 dimensions. Eachchip on the PPU has an interface to the NEWS grid;

pro cessors with data elements required byorprovided by their neighb ors

store p ointers to those neighb ors. When two or more sections of the PPU

are b eing used indep endently, each section mayhave its own NEWS grid

setup. When the sections are all ganged together; there is a single NEWS

grid across the sections for each array or parallel structure.

CUBoulder : HPSC Course Notes

12 SIMD Computing

Op eration Single Precision

Paris Mo del M ops

Fl. Pt. Addition 4,000

Fl. Pt. Multiplication 4,000

Fl. Pt. Division 1,500

Dot Pro duct 10,000

4K4K Matrix Mult 3,500

Table 1: CM-2 p erformance of single precision oating-p oint op erations on a full

64K mo del, taken from [TMC 88].

1.1.2 Performance

A full CM-2 with 64K pro cessors and double precision FPA's can p erform

arithmetic op erations at the following sp eeds under the Paris mo del of

4

execution is shown in table 1.

The sp eed of memory reads and writes b etween each pro cessor and its

memory is greater than or equal to 5 Mbits/second. Each pro cessor has 64K

to 1024K bits or 8K to 128K of memory and a full machine has 64K

pro cessors providing a total memory of 512 Mbytes to 8 Gbytes. Hence given

the minimum memory access sp eed of 5 Mbits/second for a single pro cessor,

the CM-2 can p erform memory read/writes at ab out 300 Gbytes/second.

Other p erformance tests were done at the University of Colorado using

the CM-2 at the National Center for Atmospheric Research NCAR. This

machine was just one-eighth the size of a full CM-2 with merely 8192 pro ces-

sors and contained only single precision FPA's; it also used the Paris mo del

of execution. Standard CM Fortran routines were used exclusively for these

tests. The results are summarized in table 2.

1.2 The MasPar MP-2

Manufactured by the MasPar Computer Corp oration, the MasPar MP-1 and

MP-2 are two other examples of DM-SIMD multipro cessors. Intro duced in

4

These gures were taken from the Connection Machine Model CM-2 Technical Sum-

mary [TMC 88], pp. 59-60.

CUBoulder : HPSC Course Notes

SIMD Computing 13

Op eration Single Precision

Paris Mo del M ops

Fl. Pt. Addition 180

Fl. Pt. Multiplication 200

Fl. Pt. Division 70

Dot Pro duct 56

Cosine 65

Exp onential 60

Square Ro ot 83

Table 2: Performance of NCAR CM-2 single precision oating-p oint op erations.

1990, the MasPar MP-1 is the original MasPar machine. The newer MasPar

MP-2 was brought out in 1992. While the MP-2 is larger and faster than the

MP-1, the architectures of the two machines are similar.

1.2.1 Characteristics

Like the Thinking Machines CM, the MasPar MP-2 is broken up into two

main parts: a front end or host  machine and a pro cessor array containing

the SIMD pro cessors.

The front end of this machine maybea VAX or a DECstation 5000.

Both typ es of these host machines run ULTRIX. The host machine p ermits

communication of the SIMD pro cessor array with the user. As with the

CM-2, scalar pro cessing is done on the front end.

The Data Parallel Unit DPU p erforms all the parallel computation and

corresp onds to the CM PPU; it is comprised of two parts. The rst is the

10 14

array of Pro cessor Elements PEs incorp orating 2 1024 to 2 16384

pro cessors. Unlike the one-bit pro cessors of the CM, a MasPar MP-2 pro-

cessor or PE is a CPU capable of handling 32-bit values. Each of these

pro cessors contains 64 32-bit registers as well as an ALU and its own lo cal

memory. Thirty-two of the pro cessors t on a single chip in the PE. The

older MasPar MP-1 contains only 4-bit pro cessors but also ts 32 pro cessors

onto one chip.

The other part of the DPU is the Array ACU. This is a

CUBoulder : HPSC Course Notes

14 SIMD Computing

Figure 7: In a MasPar MP-2, each pro cessing element PE is connected by the

X-net to neighb oring PE's along the north-south, east-west, northwest-southeast,

and southwest-northeast axes.

pro cessor in itself with its own memory and instructions. The function of

the ACU is to manage the op erations of the PEs; it plays a similar role to

that of the sequencers in the CM. An instruction is sent out by the ACU to

all the pro cessors in the PE array at the same time, and all the pro cessors

that are active simultaneously execute that instruction with their own data.

The ACU is also the contact with the front end host machine.

Communication b etween the PEs is handled by the DPU in twoways.

Ecient nearest-neighb or communication is done by the X-net ; this is sim-

ilar to the NEWS metho d of communication on the CM but go es in eight

directions, as shown in gure 7. X-net communication should only b e used

for data communications in a regular pattern. The MasPar MP-2 pro cessors

are also connected by the Global Router . Like the CM router communication,

communication via the Global Router is slower but p ermits passing

between anytwo pro cessors in the DPU.

As with the CM architecture, there exist metho ds for allowing data on

disks b e read or written, directly to or from the DPU. A framebu er

connection to the DPU is also present and p ermits animation on graphical

displays generated directly from the DPU.

Two main programming languages are available for the MasPar MP-2:

MPF MasPar Fortran and MPL MasPar Programming Language. The

CUBoulder : HPSC Course Notes

SIMD Computing 15

MasPar  Pro cs R N N R

max max peak

1=2

MP-2216 16384 1.6 11264 1920 2.4

MP-1216 16384 .473 11264 1280 .55

MP-1 16384 .44 5504 1180 .58

MP-2204 4096 .374 5632 896 .60

MP-1204 4096 .116 5632 640 .138

MP-2201 1024 .092 2816 448 .15

MP-1201 1024 .029 2816 320 .034

TMC CM-2 2048 10.4 33920 14000 28.

Table 3: Performance of MasPar MP-1 and MP-2, from [Dongarra 94]. The Mas-

Par gures are for 64-bit numb ers, while the Thinking Machines CM-2 values are

for 32-bit numb ers.

rst of these, MPF, is Fortran 90 with some extensions for the MasPar ar-

chitecture; it includes sp ecial functions for passing data b etween the front

end host machine and the DPU. The second language, MPL, is a version

of C with ; it is a low-level language and requires the pro-

grammer to have a go o d understanding of the machine architecture. The

MasPar VAST-2 prepro cessor allows the conversion of Fortran 77 programs

to MPF programs for this machine. In addition, a programming environ-

ment called MPPE MasPar Programming Environment is installed on the

host machine providing extra interactive supp ort for the user to write, test,

debug, and monitor parallel programs. libraries for mathematical

computation, data display, and image pro cessing are also available.

For more information on the MasPar SIMD computers, refer to [MasPar 92 ].

1.2.2 Performance

The table in table 3 compares the p erformance of various mo dels of the

MasPar MP-1 and MasPar MP-2 with di ering numb ers of pro cessors. These

gures are for double precision oating-p oint op erations and are taken from

[Dongarra 94 ]. Also included for comparison are Thinking Machines CM-

2 gures for single precision oating-p oint op erations. In this table, R

max

represents the theoretical p eak p erformance of a machine with the given

CUBoulder : HPSC Course Notes

16 SIMD Computing

numb er of pro cessors  Pro cs; R is the b est p erformance achieved

peak

for a problem of size N ; and N provides the size of the problem that

max

1=2

executes at half the R p erformance.

max

CUBoulder : HPSC Course Notes

SIMD Computing 17

2 Programming issues

Parallel programming is usually a greater challenge than programming for a

sequential machine. In this section, we consider some of the problems and

solutions asso ciated with programming an SIMD multipro cessor. We also

discuss some of the Fortran 90 features that are applicable to this typ e of

architecture.

This discussion fo cuses on the CM-2 as a sample architecture using CM

Fortran. Most of the concepts can easily b e applied to the MasPar MP-2 and

other DM-SIMD architectures. In particular, CM Fortran was a forerunner of

Fortran 90 and contains most of the data-parallel constructs present in that

language. Some additional constructs exist as well; those in that category

are so indicated b elow. Some MPF constructs are also discussed. With a few

exceptions such as reduction op erations, data-parallel op erations b ehavein

an elementwise fashion: each SIMD pro cessor acts only on the array elements

contained in its own memory.

2.1 Architectural organization considerations

When programming for the CM-2, the MP-2, or other SIMD architectures,

it is b est to think of the machine in the two parts shown in gure 2:

1. the front end to handle all the scalar op erations and

2. the array of pro cessors to execute all parallel op erations.

For usual scienti c computing applications, the front end FE is simply

5

a sequential UNIX computer or workstation. This is the machine to which

the user logs in, the machine that compiles programs for the CM, and the

machine that controls the execution of programs on the CM. All program

variables used only in a sequential or scalar fashion are stored on the front

end. The parallel pro cessing unit PPU with its SIMD architecture handles

all parallel or array op erations, each pro cessor taking care of one piece in

unison with the other pro cessors.

Since the editing and compiling of programs are done on the FE, pro-

gramming on the CM is much like programming on any UNIX machine. In

5

Some installations use Symb olics Lisp machines as the FE to the CM.

CUBoulder : HPSC Course Notes

18 SIMD Computing

ARRAYS

Offset Size Type /Class Home Name

0 2048 REAL4 local CM A

2048 2048 REAL4 local CM B

4096 2048 REAL4 local CM C

Figure 8: ARRAYS Section of a CM Fortran Listing.

fact if a program do es not use the CM parallel constructs, it executes entirely

on the FE completely ignoring the PPU.

On the MasPar MP-2, the FE usually starts the program and sets up

the initial data arrays; it also completes the program and collects the nal

results. Since communication is exp ensivebetween the FE and the DPU, it

is b est to contain most of the execution of a program to the DPU.

2.1.1 Homes

All program variables in CM Fortran programs are assigned a home. This is

simply where the variable is stored. Since scalar variables can only b e on the

FE, that is their home. However arrays may b e stored on the FE or on the

PPU, dep ending on whether or not they are used in parallel op erations. If

they are used in b oth serial and parallel op erations, they are stored on the

PPU and copied to and from the FE for the serial op erations. The exception

to the rules ab ove is for arrays of typ e CHARACTER; these are always stored

on the FE.

To see where homes have b een assigned for your variables, check the

last part of your program listing. The two sections VARIABLES and ARRAYS

provide the name, typ e, and size of the scalar variables and the arrays used in

your program. The ARRAYS section also lists the Home for each array. Under

this heading, the term CM refers to the PPU and FE to the FE. It is wise to

double check this part of the program listing to sure the arrays have

b een assigned as exp ected. A p ortion of a sample listing showing the ARRAYS

section is given in gure 8.

CUBoulder : HPSC Course Notes

SIMD Computing 19

Homes for variables in every program unit are assigned individually.In

other words, the array Z in SUBROUTINE MYSUB may not b e assigned the same

home as Z in FUNCTION MYFTN. Each program mo dule is treated as a unit.

Homes of actual and dummy arguments must match. If you pass an array

to a function or argument, the dummy array argumentmust have the same

home as the incoming array parameter. Otherwise unpredictable results are

p ossible. There are a number of ways to force arrays to b e assigned homes

on the PPU:

 Declare the arrayinCOMMON as all COMMON arrays are placed on the

PPU;

 Put parallel op erations in every program mo dule for the appropriate

arrays even if they do nothing useful;

 Use the LAYOUT compiler :

CMF$ LAYOUT Z:NEWS.

With :NEWS as an argument, Z MUST b e in the PPU. Other p ossi-

ble arguments are :SERIAL and :SEND. More on the LAYOUT compiler

directive can b e found in section 2.4.1.

2.2 CM Fortran, MPF, and Fortran 90

The version of Fortran used on the CM is called CM Fortran. It is based on

Fortran 77 and extended by parallel constructs; most of these constructs are

contained in a subset of Fortran 90. On the MasPar MP-2, MPF MasPar

Fortran is a version of Fortran 90 with extensions; many of the constructs

are the same as those in CM Fortran.

On a CM, it is imp ortant to know that the control owofaCMFortran

program is handled by the FE, as are all scalar statements like those found

in Fortran 77. All data-parallel statements, including Fortran 90 statements,

are executed on the PPU.

The following subsections intro duce a few elements of CM Fortran, MPF,

and Fortran 90 to get you started. For further reference, see the CM manuals

[TMC 91a ], [TMC 91b ], and [TMC 91c ], the MasPar manuals: [MasPar 93b ]

and [MasPar 93a ], and some Fortran 90 references such as [Brainerd et al 90 ]

and [Adams et al 92 ].

CUBoulder : HPSC Course Notes

20 SIMD Computing

2.2.1 Arrays

An array on a SIMD multipro cessor may b e considered a data-paral lel ob ject;

this is true for CM Fortran arrays as well. In fact the only CM Fortran

variables stored on the PPU are those arrays used in parallel op erations; all

scalars and all arrays not involved with parallel op erations are stored on the

FE of the CM.

6

The prop erties of an array are rank and shape. The rank of an arrayis

the numb er of its dimensions; e.g., the array declared as S5,10 has rank

2. The shap e of an array is its dimensions; so the shap e of S5,10 is 5  10.

Two arrays with the same shap e are said to b e conformable. Most parallel

op erations require that the arrays involved b e conformable.

Once an array has b een declared, the use of the name of the arrayby

itself not subscripted denotes the entire array with all its elements. Such

usage implies a parallel op eration is to b e applied to the array.For instance

the statement

S = 0.0

sets all the elements of S to zero in parallel on the PPU.

Subsections of an array can b e sp eci ed by triples. The general form of

a triple is

7

rstvalue : lastvalue : increment

For example if S is declared as ab ove, then S1:5:2,1:10 refers to the o dd

rows of S. The triple 1:5:2 sp eci es that rows 1, 1 + 2 = 3, and 3 + 2 = 5 are

to b e used; the triple 1:10 has an implied increment of 1 and so sp eci es all

ten columns. This second triple 1:10 could have b een replaced by a single

colon as in S1:5:2,: to imply that all the columns b e used for the chosen

rows.

As in Fortran 77 and Fortran 90, CM Fortran arrays may b e declared by

DIMENSION, COMMON,or type statements. In CM Fortran, they may also b e

declared using array attribute statements. For example assuming the array

S is a real array, it could have b een de ned by the following array attribute

statement:

6

In this context, rank has a di erent meaning than its customary mathematical

de nition.

7

This di ers from the triple form used byMATLAB: rstvalue : increment : lastvalue.

CUBoulder : HPSC Course Notes

SIMD Computing 21

REAL, ARRAY5,10 :: S, T

In Fortran 90, the following statementwould have the same e ect:

REAL, DIMENSION5,10 :: S, T

This simply says that b oth S and T are real arrays with 5 rows and 10

columns. Notice the comma after the typ e indicator REAL; this is how the

array attribute statement is recognized by the compiler. The double colon ::

is also a requirement; it must b e placed b etween the array de nition and the

array names. You should recall that blank spaces in Fortran are traditionally

ignored; hence anynumb er of spaces can b e added to this statement even

between the colons or deleted from the statement.

Array constructors can b e used to initialize the elements of an arrayin

parallel. For instance if Z has b een declared in CM Fortran by the statement

REAL, ARRAYN :: Z

then the statement

Z = REAL [1:N] 

assigns 1:0toZ1,2:0to Z2, and REALN to ZN. The corresp onding

Fortran 90 statements follow:

REAL, DIMENSIONN :: Z

and

Z = / REALI, I=1,N /

Array constructors can also b e included in array attribute statements. In

CM Fortran, this is done by adding a DATA parameter to the statement; thus

the following CM Fortran statement has the e ect of de ning and initializing

Z at once:

REAL, ARRAYN, DATA :: Z = [1:N]

The same e ect can b e achieved in Fortran 90 by the following statement:

REAL, DIMENSIONN :: Z = / REALI, I=1,N /

This is an ecientway of assigning initial values to an array as it is done

at load time. A limitation on the array constructor in CM Fortran is that is

can only b e used for one-dimensional arrays.

CUBoulder : HPSC Course Notes

22 SIMD Computing

M4:6,5:10

Figure 9: Subsection of array M12,12.

2.2.2 Array sections

Most of the parallel array facilities of Fortran 90 are part of CM Fortran.

As mentioned in section 2.2.1, the abilitytowork with all the elements of

an array or subsections of an array in parallel is provided; however in CM

Fortran, all of the arrays involved in this typ e of parallel expression must b e

parallel arrays with homes on the PPU.

Using the name of an array implies using the entire array in parallel. For

instance, the statement

Y = Z**2

causes Y1 to b e set to the value of Z1**2, Y2 to Z2**2, etc. This

also works for constant assignments; the statement

Y = -1.0

means that all the elements of Y are set to 1:0.

Subsections of arrays can b e used in assignment statements as well. In

the following statement

Y1:10 = Z11:20

CUBoulder : HPSC Course Notes

SIMD Computing 23

the rst ten elements of Y are set to the second ten elements of Z. Such

statements execute in parallel.

An example of a subsection of a two-dimensional arrayisshown in gure 9.

Here the array M has b een declared as

REAL M12,12

and the 3  6 subsection is de ned by

M4:6,5:10

2.2.3 Alternate DO lo ops

Additional control constructs exist in CM Fortran; these are similar to those

in Fortran 90. The rst of these are alternate forms of DO lo ops as demon-

strated b elow:

N = 4096

DOWHILE N .GT. 0

Z1:N = ...

N = N/2

ENDDO

KK = 1

DO N TIMES

KK=KK*K

ENDDO

The rst of these lo ops assigns values to the rst N elements of the array Z

for N equal to decreasing p owers of two. The second lo op terminates when KK

N

is equal to K where N is a non-negativeinteger. This form of the DO lo op is

useful when the lo op index is not needed within the b o dy of the lo op. Note

that the DO WHILE construct is a legal Fortran 90 construct; the second form

of the DO lo op, DO N TIMES, is not.

2.2.4 WHERE statements

The WHERE statements provide a means for working with a subset of a full

array still as a parallel op eration:

WHERE Z .GT. 0.0 Y = SQRTZ

CUBoulder : HPSC Course Notes

24 SIMD Computing

Here all the CM pro cessors actually compute Y = SQRTZ.However only

those pro cessors with a value for Z greater than zero store the result. The

intrinsic function SQRT is used on the entire array.Ifwe wished to set the

other negative and zero elements to zero at the same time, this op eration

could b e programmed as follows:

WHERE Z .GT. 0.0

Y = SQRTZ

ELSEWHERE

Y = 0.0

ENDWHERE

In this set of statements, the elements of Y are set to zero when the corre-

sp onding elementof Z is not greater than zero. Note that this construct acts

in two steps. First all the pro cessors compute Y = SQRTZ, but only the

values for elements of Y corresp onding to non-zero elements of Z are stored.

Then all the pro cessors compute Y= 0, but only the values corresp onding to

zero or negative elements of Z are stored. In other words, the construct ap-

p ears similar to the IF..THEN..ELSE..ENDIF statement but b ehaves a little

di erently.

2.2.5 FORALL statements

The FORALL construct is not a part of Fortran 90; however it is included in

b oth CM Fortran and MPF. Such statements are very convenient for SIMD

computers.

FORALL statements as in

FORALL I=1:N YI = I

can only contain one assignment. This statement is equivalent to the follow-

ing DO lo op:

DO I = 1,N

YI = I

ENDDO

Notice that there is a colon b etween the start and stop values of the FORALL

index; this is in a triple format like the subscripts discussed earlier. An

incrementmay b e used as well after another colon, as the third elementof

the triple.

CUBoulder : HPSC Course Notes

SIMD Computing 25

More than one index can b e used within the FORALL statement; for in-

stance the following statement:

FORALL I=1:N, J=1:N SI,J = I

sets all the elements of the Ith rowofS to I.

Often individual elements of an array need to b e initialized to sp eci c

values. If the home of the array is on the PPU, it is b est to use a FORALL

statement for this purp ose. For instance the statement

S6,1 = S6,2 + S6,3

is executed on the FE since it is essentially a scalar op eration. However if

we rewrite this statementasa FORALL statement,

FORALL I=6:6, J=1:1 SI,J = SI,J+1 + SI,J+2

it is executed on the PPU. In e ect, the sum of SI,J+1 and SI,J+2

for each I and J is computed by all the elements in the array, but only the

pro cessor containing the element in the rst column of the sixth rowofS

stores this value into its array element.

A FORALL statement with dep endencies that cannot b e resolved is exe-

cuted serially. Check for restrictions on the FORALL statement in the appro-

priate manual.

2.3 Built-in functions for CM Fortran and Fortran 90

The CM Fortran intrinsic functions are for the most part the same as the

intrinsic functions describ ed for Fortran 90. Additional functions for handling

data-parallel data typ es are also present in b oth CM Fortran and Fortran 90.

Some of these are describ ed b elow.

To aid in the explanation of these built-in functions, assume the following

arrays have the values given b elow:

0 1

1 2

B C

A = 3 4

@ A

5 6

 !

2 4 5

B =

3 8 5

C =1; 2; 3; 4; 5; 6

CUBoulder : HPSC Course Notes

26 SIMD Computing

2.3.1 Intrinsic functions

The usual Fortran intrinsic functions are available in CM Fortran and Fortran

90. Moreover most of them can b e used in a parallel fashion. For instance if

A has b een declared as ab ove, then

MODA,5

returns a matrix of the same typ e and shap e as A containing the values of

the elements of A mod5:

0 1

1 2

B C

MODA; 5= 3 4

@ A

0 1

Similarly the SQRT function can handle a whole array at once.

0 1

1:000 1:414

B C

1:732 2:000 SQRTA=

@ A

2:236 2:449

2.3.2 Masks

Masks are logical arrays created by p erforming a relational op eration on all

the elements of a given array. Both the mask and the original arraymust

conform. Consider the following examples.

0 1

T T

B C

A:GT:0 = T T

@ A

T T

 !

F F T

B:EQ:5 =

F F T

C:LT:0 =F; T; F; T; F; T 

CUBoulder : HPSC Course Notes

SIMD Computing 27

2.3.3 Sp ecial functions

In addition to the normal Fortran intrinsic functions, CM Fortran provides

several sp ecial functions to aid in the parallel op eration of the machine. For

the examples of the functions describ ed b elow, assume the following: ARRAY

is the name of any arrayoftyp e real, integer, or logical; DIM is an integer

denoting which particular dimension of the array if any the function is to b e

applied to; MASK is a logical array with the same shap e as ARRAY telling which

particular elements the function is to use; V, V1 and V2 are one-dimensional

arrays or vectors; M1 and M2 are two-dimensional arrays or matrices; and

SHIFT is an integer or integer array describing the shift to b e made. For some

of the following functions, DIM and MASK may b e used as keyword parameters.

Reduction Op erations: The following functions p erform commonly used

reduction op erations. Except where noted, they are common to b oth CM

Fortran and Fortran 90.

 SUM ARRAY [, DIM] [, MASK]: This function computes the sum of

all the elements of ARRAY, according to the values of DIM and MASK.

Note: in the last example, MASK is used as a keyword parameter, since

the second parameter DIM is missing.

SUMA =21

SUMB, 1 =5; 12; 10 

SUMB, 2 =11; 16 

SUMC, MASK=C.GT.0 =9

 PRODUCT ARRAY [, DIM] [, MASK]: This function computes the

pro duct of all the elements of ARRAY according to the values of DIM

and MASK. Note: in the last example, MASK is used as a keyword pa-

rameter, since the second parameter DIM is missing.

PRODUCTA = 720

PRODUCTB, 1 =6; 32; 25 

PRODUCTB, 2 =40; 120 

PRODUCTC, MASK=C.GT.0 =15

 DOTPRODUCT V1, V2: This function computes the dot pro duct of the

twovectors or one-dimensional arrays, V1 and V2.

CUBoulder : HPSC Course Notes

28 SIMD Computing

DOTPRODUCTA1,:,B:,2 =20

DOTPRODUCTA:,1,B2,: =52

DOTPRODUCTC,C =91

 MAXVAL ARRAY [, DIM] [, MASK]: This function nds the maxi-

mum value of all the elements of ARRAY according to the values of

DIM and MASK.

MAXVALA =6

MAXVALB:,1 =3

MAXVALC =5

MAXVALC,1,C.LT.0 =-2

 MINVAL ARRAY [, DIM] [, MASK]: This function nds the mini-

mum value of all the elements of ARRAY according to the values of

DIM and MASK.

MINVALA =1

MINVALB:,1 =2

MINVALC =-6

MINVALC,1,C.GT.0 =1

 MAXLOC ARRAY [, MASK]: This function returns an integer value or

integer array representing the subscripts of the maximum values of all

the elements of ARRAY according to the values of MASK. If more than one

such lo cation exists, which subscript is returned is non-deterministic.

MAXLOCA =3; 2

MAXLOCB:,3 =1 could also b e 2

MAXLOCC =5

MAXLOCC,C.LT.0 =2

 MINLOC ARRAY [, MASK]: This function returns an integer value or

integer array representing the subscripts of the minimum values of all

the elements of ARRAY according to the values of MASK. If more than one

such lo cation exists, which subscript is returned is non-deterministic.

MINLOCA =1; 1

MINLOCB:,3 =1 could also b e 2

MINLOCC =6

MINLOCC,C.GT.0 =1

CUBoulder : HPSC Course Notes

SIMD Computing 29

 COUNT MASK [, DIM]: This function returns the numb er of elements

for which the MASK held true.

COUNTA.GT.0 =6

COUNTA.GT.0,1 =3; 3

COUNTB.EQ.5 =2

COUNTC.LE.0 =3

 ANY MASK [, DIM]: This function returns True if the MASK held true

for any of the elements.

ANYA.GT.0 = T

ANYA.GT.0,1 =T; T 

ANYB.EQ.5 = T

ANYC.LE.0 = T

 ALL MASK [, DIM]: This function returns True if the MASK held true

for all of the elements.

ALLA.GT.0 = T

ALLA.GT.0,1 =T; T 

ALLB.EQ.5 = F

ALLC.LE.0 = F

Functions for matrices: The following built-in functions in CM Fortran

and Fortran 90 are used for manipulating matrices:

 TRANSPOSE M1: This function returns the transp ose of the matrix or

two-dimensional array M1.

 !

1 3 5

TRANSPOSEA=

2 4 6

 MATMUL M1, M2: This function returns the result of the matrix mul-

tiplication of M1 by M2. Note that the expression M1*M2 do es not p er-

form matrix multiplication. Instead it pro duces element-by-element

multiplication; that is, for all i; j M 1  M 2 = M 1  M 2 .

i;j i;j i;j

CUBoulder : HPSC Course Notes

30 SIMD Computing

0 1

8 20 15

B C

MATMULA; B= 18 44 35

@ A

28 68 55

 !

39 50

MATMULB; A=

52 68

 DIAGONAL ARRAY [, FILL]: This function is only in CM Fortran.

It creates a diagonal matrix from the vector ARRAY. The elements of

the vector are placed on the diagonal and the value of FILL if any is

placed in the other elements of the matrix. If there is no FILL value,

the value of 0 or .FALSE., if logical is used.

1 0

1 0 0 0 0 0

C B

0 2 0 0 0 0

C B

C B

0 0 3 0 0 0

C B

DIAGONALC=

C B

C B

0 0 0 4 0 0

C B

A @

0 0 0 0 5 0

0 0 0 0 0 6

1 0

1 99 99 99 99 99

C B

99 2 99 99 99 99

C B

C B

99 99 3 99 99 99

C B

DIAGONALC; 99=

C B

C B

99 99 99 4 99 99

C B

A @

99 99 99 99 5 99

99 99 99 99 99 6

Other useful functions: In addition to the reduction op erations listed

ab ove, b oth CM Fortran and Fortran 90 contain other functions directed

toward handling data-parallel data ob jects. Some of these are given here:

 RANK ARRAY: This CM Fortran function returns the rank of the given

scalar or ARRAY. A similar Fortran 90 function is named SIZE.

CUBoulder : HPSC Course Notes

SIMD Computing 31

RANK100 =0

RANKA =2

RANKB =2

RANKC =1

 DSHAPE ARRAY: This CM Fortran function returns the shap e of the

given scalar or ARRAY.InFortran 90, this function is named SHAPE.

DSHAPE-1 =

DSHAPEC =6

DSHAPEA =3; 2

DSHAPEB =2; 3

 REPLICATE ARRAY, DIM, NCOPIES: This CM Fortran function adds

NCOPIES of the ARRAY along the given DIMension. The resultant array

has the same rank as the original ARRAY, but the shap e in greater in

the given DIMENSION.

0 1

1 2

B C

3 4

B C

B C

5 6

B C

REPLICATEA; 1; 2=

B C

B C

1 2

B C

@ A

3 4

5 6

0 1

1 2 1 2 1 2

B C

REPLICATEA; 2; 3= 3 4 3 4 3 4

@ A

5 6 5 6 5 6

REPLICATEA, 1, 0 = 

REPLICATEC, 1, 2 =

1; 2; 3; 4; 5; 6; 1; 2; 3; 4; 5; 6

 SPREAD ARRAY, DIM, NCOPIES: This function is in b oth CM Fortran

and Fortran 90. It pro duces NCOPIES of the ARRAY along DIM. The

resultant array has rank one greater than that of the original ARRAY.

This can also b e used to makea vector from a scalar.

CUBoulder : HPSC Course Notes

32 SIMD Computing

SPREAD-1, 1, 6 =1; 1; 1; 1; 1; 1

SPREAD-1, 1, 0 =

1 1 0 0

1 2

C C B B

3 4

C B

A @

C B

C B

5 6

C B

0 1

SPREADA; 1; 2=

C B

1 2

C B

B C C B

3 4

@ A A @

5 6

1 1 1 0 1 0 0 0

5 6 3 4 1 2

C C C B C B B B

5 6 3 4 SPREADA; 2; 3= 1 2

A A A @ A @ @ @

5 6 3 4 1 2

! 

1 2 3 4 5 6

SPREADC; 1; 2=

1 2 3 4 5 6

0 1

1 1 1

B C

2 2 2

B C

B C

3 3 3

B C

SPREADC; 2; 3=

B C

B C

4 4 4

B C

@ A

5 5 5

6 6 6

 PACK ARRAY, MASK [, V]: This function gathers elements from the

ARRAY under the control of the MASK. If the vector V is sp eci ed, the

result is placed on top of the values already there.

PACKB, B.GT.4 =8; 5; 5

PACKA, A.GT.3, C =5; 4; 6; 4; 5; 6

PACKA, A.GT.0, C =1; 3; 5; 2; 4; 6

 UNPACK V, MASK, ARRAY: This function scatters elements from the

vector V under the control of the MASK into the ARRAY.

UNPACKC, C.LT.0, [6[0.0]] =0; 1; 0; 2; 0; 3

UNPACKC, C.LT.0, [6[-1.0]] =1; 1; 1; 2; 1; 3

UNPACKC, C.GT.0, C =1; 2; 2; 4; 3; 6

 CSHIFT ARRAY, DIM, SHIFT: This function do es a Circular SHIFT

on ARRAY returning a result that has the same typ e and shap e as ARRAY.

CUBoulder : HPSC Course Notes

SIMD Computing 33

CSHIFTC, 1, 1 =2; 3; 4; 5; 6; 1

CSHIFTC, 1, -1 =6; 1; 2; 3; 4; 5

CSHIFTC, 1, 2 =3; 4; 5; 6; 1; 2

CSHIFTC, 1, -3 =4; 5; 6; 1; 2; 3

0 1

5 6

B C

CSHIFTA; 1; 2= 1 2

@ A

3 4

0 1

2 1

B C

CSHIFTA; 2; 1= 4 3

@ A

6 5

0 1

1 4

B C

CSHIFTA; 1; [0; 1]= 3 6

@ A

5 2

 EOSHIFT ARRAY, DIM, SHIFT [, BOUNDARY]: This function do es an

End-O SHIFT on ARRAY returning a result that has the same typ e

and shap e as ARRAY. The value of BOUNDARY if any is used to ll up

the spaces made by shifting away from the edges; otherwise zero or

.FALSE. is used.

EOSHIFTC, 1, 1 =2; 3; 4; 5; 6; 0

EOSHIFTC, 1, -1 =0; 1; 2; 3; 4; 5

EOSHIFTC, 1, 3, 9 =4; 5; 6; 9; 9; 9

0 1

0 1

B C

EOSHIFTA; 2; 1= 0 3

@ A

0 5

0 1

0 2

B C

EOSHIFTA; 1; [1; 0] = 1 4

@ A

3 6

0 1

5 6

B C

99 99

EOSHIFTA; 1; 2; 99=

@ A

99 99

1 0

5 6

C B

EOSHIFTA; 1; 2; REAL[1 : 2] = 1 2

A @

1 2

CUBoulder : HPSC Course Notes

34 SIMD Computing

There are many additional intrinsic functions that can b e used to ma-

nipulate arrays in parallel, sucha, RESHAPE in b oth CM Fortran and For-

tran 90 and PROJECT available only in CM Fortran. Refer to the CM For-

tran Reference Manual [TMC 91b ] or the MasPar Fortran Reference Manual

[MasPar 93a ] for more information on these and other functions. Reference

b o oks on Fortran 90, such as the Fortran 90 Handbook by [Adams et al 92]

and the Programmers Guide to Fortran 90 by [Brainerd et al 90], maybe

helpful as well.

2.4 Compiler directives

Compilers for SIMD computers make use of compiler directives to de ne the

assignment of array elements to di erent pro cessors. The compiler directives

in CM Fortran are used in twoways. The rst is to sp ecify the lo cation of

the elements of arrays with resp ect to each other; the second is to sp ecify the

use and homes of arrays in the common blo cks of the program. The MasPar

MPF language also provides compiler directives for mapping array elements

to pro cessors and for sp ecifying the lo cation of arrays and common blo cks.

Further MPF compiler directives assist with the calling of routines in other

languages.

In this section, we discuss some of the compiler directives for b oth these

machines. This is done to provide examples of how such directives are used;

in many cases, these examples may b e extended or mo di ed for

on other SIMD machines. In the following, we brie y present the use and

purp ose of three CM Fortran compiler directives and three MPF compiler

directives

CMF$ LAYOUT args

CMPF MAP args

CMF$ ALIGN args

CMF$ COMMON args

CMPF ONDPU args

CMPF ONFE args

as representative SIMD compiler directives.

All the CM Fortran compiler directives b egin with the characters CMF$;

they app ear as comment lines to other Fortran compilers. The continuation

of such lines if needed di ers from the normal Fortran statement continuation;

CUBoulder : HPSC Course Notes

SIMD Computing 35

since the compiler directives lo ok like comments, an amp ersand & needs to

b e placed on the end of the line to b e continued instead of the continuation

character in column six. MPF directives b egin with the characters CMPF.

2.4.1 CM Fortran LAYOUT

In this section and the next, a reference to a pro cessor means a virtual pro-

cessor. If an array has more elements than the number of physical pro cessors,

each pro cessor acts as a virtual pro cessor to more than one element. The

ordering of the assignment of elements to virtual pro cessors dep ends on the

layout or ordering chosen for the given dimension of the array.

In the normal or default allo cation of arrays on the PPU, each elementis

placed on a single pro cessor. When there are more elements than pro cessors,

each element is mapp ed to a virtual pro cessor. The intent is to allow each

pro cessor to have its own piece of the action. Such a mapping for the array

V0:N is shown in gure 10.

P P P P P P

0 1 2 3 4 5

::: w w w w w w

V V V V V V

0 1 2 3 4 5

Figure 10: Normal layout for the array V0:N.

However there are where it is more ecient to allow each

pro cessor to have a set of elements of a given array as those elements are

used together in the same computation. For instance in the case of arrays

of two or more dimensions, it may b e b est to have all the elements of one of

the dimensions b e placed on the same pro cessor. For instance supp ose the

array Q is declared by the following DIMENSION statement

DIMENSION Q3,0:N

Further supp ose that

Q = F Q ; Q 

3;i 1;i 2;i

where F is some function of the two one-dimensional arrays that makeup

the rst tworows of Q. Then it might b e desirable to have all three elements

CUBoulder : HPSC Course Notes

36 SIMD Computing

P P P P P P

0 1 2 3 4 5

w w w w w w :::

Q Q Q Q Q Q

1;0 1;1 1;2 1;3 1;4 1;5

Q Q Q Q Q Q

2;0 2;1 2;2 2;3 2;4 2;5

Q Q Q Q Q Q

3;0 3;1 3;2 3;3 3;4 3;5

Figure 11: Layout for the array Q:SERIAL,:NEWS.

of each column of the array on the same pro cessor. This typ e of layout is

shown in gure 11. The compiler directive that assigns this layout is

CMF$ LAYOUT Q:SERIAL,:NEWS

Here the term SERIAL means that the elements on the array in this dimension

should all b e on the same pro cessor. The term NEWS implies that the elements

of the second dimension each column should b e spread across the pro cessors

in an order to allow ecient communication b etween nearest neighb ors.

The general format of this directiveis

CMF$ LAYOUT ARRAYweight 1:order 1,weight 2:order 2,...,

weight N:order N

This sp eci es an order and a weight to that order for each dimension of

the given ARRAY. The order is as de ned ab ove; the weightisany constant

expression that indicates the imp ortance of that ordering for the given di-

mension in relation to the other dimensional orders. If the weight is missing,

it is assumed to b e 1; if the order is missing, it is assumed to b e NEWS. The

sample statementabove

CMF$ LAYOUT Q:SERIAL,:NEWS

provides no weights. Weights have no meaning for serial ordering since the

elements in that dimension should all b e on the same pro cessor.

There are three di erenttyp es of orders. As mentioned in the last para-

graph, the SERIAL order tells that compiler that all elements of the given

dimension should b e arranged sequentially on the same pro cessor.

The NEWS order is the default ordering; it sp eci es that elements of the

given dimension of the array should b e stored on pro cessors in sucha wayas

CUBoulder : HPSC Course Notes

SIMD Computing 37

N

v

X

i;j +1

W E

v v v

X X X

i+1;j i1;j i;j

S

v

X

i;j 1

Figure 12: A two-dimensional NEWS grid.

to provide quick nearest neighb or communication. This is most often used in

grid computations where each element of the array needs to b e up dated using

the values of its neighb oring elements as in the following two-dimensional

expression:

X = F X ;X ;X ;X ;X 

i;j i;j i1;j i+1;j i;j 1 i;j +1

This two-dimensional form of a grid is shown in gure 12. So the default

layout

CMF$ LAYOUT X1.0:NEWS,1.0:NEWS

would b e b est for this computation. Were the layout de ned as

CMF$ LAYOUT X1000:NEWS,1:NEWS

the compiler assumes that elements along the rst axis of the array com-

municate far more often than those of the second axis. Thus it may assign

elements of the rst array on the same chip if p ossible to make the commu-

nication lo cal.

The last ordering SEND arranges the elements of the array according to

the unique addresses preassigned to each pro cessor. This communication

metho d uses the underlying hyp ercub e of the CM, allowing each elementto

b e quickly communicated to any other element in the array. This ordering

is b est if the communication is b etween lo cal pro cessors that are not also

CUBoulder : HPSC Course Notes

38 SIMD Computing

P P P P P P

0 1 2 3 4 5

w w w w w w :::

Q Q Q Q Q Q

1;0 1;1 1;2 1;3 1;4 1;5

Q Q Q Q Q Q

2;0 2;1 2;2 2;3 2;4 2;5

Q Q Q Q Q Q

3;0 3;1 3;2 3;3 3;4 3;5

V V V V V V

0 1 2 3 4 5

Figure 13: Layout for the arrays Q:SERIAL,:NEWS and V:NEWS.

nearest neighb ors. An example of a program that would b ene t from this

ordering is an FFT computation.

It is assumed that arrays declared in the same DIMENSION statement

should have their rst elements assigned to the same pro cessor. Hence if

the arrays Q and V have b een declared together as in the statement

DIMENSION Q3,0:N, V0:N

and Q has the same LAYOUT directiveasabove, then the elements of b oth

arrays are assigned to the pro cessors in a manner similar to that shown in

gure 13.

2.4.2 MasPar MPF MAP

The MAP compiler directive for the MasPar MP-2 is similar to the CM Fortran

LAYOUT directive. It de nes a mapping of the given array to the pro cessor

array elements. For example the statements

CMPF MAP QMEMORY, ALLBITS

CMPF MAP VALLBITS

also map the Q and V arrays as shown in gure 13. Here the parameter

MEMORY has a meaning similar to that of :SERIAL for the CM; the elements

in each column are to put in the same pro cessor memory. The parameter

ALLBITS corresp onds to the CM Fortran :NEWS; along that dimension of the

array, the elements are to b e spread across the pro cessor element array.

CUBoulder : HPSC Course Notes

SIMD Computing 39

P P P P P

1 2 3 4 5

::: w w w w w

S S S S S

1;1 2;1 3;1 4;1 5;1

T T T T T

1 2 3 4 5

a Default alignment: SN,N and TN.

P P P P P P

1 2 3 N N +1 N +2

w w w ::: w w w :::

S S S S S S

1;1 2;1 3;1 N;1 1;2 2;2

T T

1 2

b CMF$ ALIGN S1,I WITH TI.

P P P P P P

L L+1 L+2 L+N L+N +1 L+N +2

::: ::: ::: w w w w w w

S S S S S S

3;3 4;3 5;3 3;4 4;4 5;4

T T

1 2

c CMF$ ALIGN S4,I+2 WITH TI.

Figure 14: Alignments of arrays S and T.

2.4.3 CM Fortran ALIGN

The ALIGN compiler directive is used to align the elements of two arrays

with each other. For example supp ose the following arrays are declared:

DIMENSION SN,N, TN

By default the elements of the arrays would b e laid out so that S1,1 and

T1 are on the same pro cessor, S2,1 and T2 would b e on the next

pro cessor, etc. This is the ordering shown in gure 14a. However it might

b e desirable to have T2 on the same pro cessor as S1,2 as shown in

gure 14b. This can b e accomplished by the following ALIGN statement.

CMF$ ALIGN S1,I WITH TI

CUBoulder : HPSC Course Notes

40 SIMD Computing

Similarly the statement

CMF$ ALIGN S4,I+2 WITH TI

would pro duce a layout as shown in gure 14c.

2.4.4 CM Fortran COMMON

The COMMON compiler directive is used to de ne a default home for the arrays

in a given common blo ck. The three p ossible forms of this directive are as

follows:

CMF$ COMMON [, CMONLY] /blkname/

CMF$ COMMON FEONLY /blkname/

CMF$ COMMON INITIALIZE /blkname/

The rst form of the directive tells the compiler to put the arrays contained in

the given common blo ck blkname on the PPU. The term CMONLY is optional

and is used for clarity.

The middle or second form of the directive informs the compiler that the

arrays should b e placed on the FE. Otherwise the normal default for arrays

in COMMON blo cks would b e on the PPU.

The third and nal form of the directive also instructs the compiler to

make the PPU b e the home of the arrays in the common blo ck blkname.

But in addition, it allo cates space on the FE for the common blo ckaswell to

allow for the static initialization of the arrays. Such arrays may b e initialized

by DATA statements.

2.4.5 MasPar MPF ONDPU

To force an array or common blo ck in MPF to b e on the DPU, use the ONDPU

compiler directive:

CMPF ONDPU A, DPUCBLK

This command asserts that the array A and the common blo ck DPUCBLK must

reside on the DPU. This happ ens whether or not the array or common blo ck

is used in any parallel op erations.

CUBoulder : HPSC Course Notes

SIMD Computing 41

2.4.6 MasPar MPF ONFE

The ONFE compiler directive p erforms the reverse of the ONDPU directive. The

command

CMPF ONFE B, FECBLK

will make sure that the array B and the common blo ck FECBLK are placed on

the front end.

CUBoulder : HPSC Course Notes

42 SIMD Computing

3 Acknowledgements

Wewould like to thank Steven Goldhab er and Steve Hammond of Thinking

Machines Corp oration at the National Center for Atmospheric Research for

their assistance and patient explanations of the op eration and architecture

of the CM-2.

Wewould also to thank MasPar Computer Corp oration for donating com-

puter time on a MasPar MP-1.

References

[Adams et al 92] ADAMS, JEANNE C., WALTER S. BRAINERD, JEANNE T.

MARTIN, BRIAN T. SMITH, AND JERROLD L. WAGENER. [1992].

Fortran 90 Handbook. Intertext Publications. McGraw-Hill Bo ok

Company, New York, NY.

[Almasi & Gottlieb 94] ALMASI, GEORGE S. AND ALLAN GOTTLIEB. [1994].

Highly Paral lel Computing. The Benjamin/Cummings Publishing

Company, Inc., Redwood City, CA, 2nd edition.

[Baron & Higbie 92] BARON, ROBERT J. AND LEE HIGBIE. [1992]. Computer

Architecture: Case Studies. Electrical and .

Addison-Wesley Publishing Company, New York, NY.

CUBoulder : HPSC Course Notes

SIMD Computing 43

[Brainerd et al 90] BRAINERD, WALTER S., CHARLES

GOLDBERG, AND JEANNE C. ADAMS. [1990]. Programmers Guide

to Fortran 90. McGraw-Hill Bo ok Company, New York, NY.

[Dongarra 94] DONGARRA, J. J. [1994]. Performance of various computers

using standard linear equations software. Technical Rep ort CS-89-

85, Oak Ridge National Lab oratory, Oak Ridge, TN 37831. netlib

version as of Novemb er 1, 1994.

[Eb erlein & Eastridge 91] EBERLEIN, MARY AND ERIC EASTRIDGE. [Aug

1991]. ABeginner's Guide to the MasPar MP-2. Joint Institute for

Computational Science JICS, UniversityofTennessee, Knoxville,

TN.

[Flynn 72] FLYNN, MICHAEL J. [Sep 1972]. Some computer organizations and

their e ectiveness. IEEE Transactions on Computers, C-219:948{

960.

[Hillis 85] HILLIS, W. DANIEL. [1985]. The Connection Machine. The MIT

Press, Cambridge, MA.

[Hord 90] HORD, R. MICHAEL. [1990]. Paral lel Supercomputing in SIMD

Architectures.CRC Press, Inc., Boston, MA.

[Hwang 93] HWANG, KAI. [1993]. Advanced Computer Architecture. McGraw-

Hill, Inc., New York, NY.

[Jessup 95] JESSUP, ELIZABETH R. [1995]. Distributed-memory MIMD com-

puting: An intro duction. HPSC Course Notes.

[MasPar 92] MasPar Computer Corp oration, Sunnyvale, CA. [Jul 1992].

MasPar System Overview.Part Numb er 9300-0100, Rev. A5.

[MasPar 93a] MasPar Computer Corp oration, Sunnyvale, CA. [May 1993].

MasPar Fortran Reference Manual.Part Numb er 9303-0000 , Revi-

sion A6.

[MasPar 93b] MasPar Computer Corp oration, Sunnyvale, CA. [May 1993].

MasPar Fortran User Guide.Part Numb er 9303-0100, Revision A5.

CUBoulder : HPSC Course Notes

44 SIMD Computing

[Schauble 95] SCHAUBLE, CAROLYN J. C. [1995]. Vector computing: An

intro duction. HPSC Course Notes.

[TMC 88] [May 1988]. Connection Machine Mo del CM-2 Technical Sum-

mary.Technical Rep ort HA87-4, Thinking Machines Corp oration,

Cambridge, MA.

[TMC 91a] Thinking Machines Corp oration, Cambridge, MA. [Jan 1991].

Connection Machine: Fortran Programming Guide.Version 1.0.

[TMC 91b] Thinking Machines Corp oration, Cambridge, MA. [Jul 1991].

Connection Machine: Fortran Reference Manual.Version 1.0 and

1.1.

[TMC 91c] Thinking Machines Corp oration, Cambridge, MA. [Jul 1991].

Connection Machine: Fortran Users's Guide.Version 1.0 and 1.1.

[TMC 91d] [Jun 1991]. Connection Machine CM-200 Series Technical Sum-

mary. Technical rep ort, Thinking Machines Corp oration, Cam-

bridge, MA.

[van der Steen 94] VAN DER STEEN, AAD J. [Sep 1994]. Overview of recent

sup ercomputers. Technical rep ort, Stichting Nationale Computer

Faciliteiten, The Netherlands. Fourth revised edition, netlib.

CUBoulder : HPSC Course Notes