SIMD Computing:
An Intro duction
C. J. C. Schauble
Septemb er 12, 1995
High Performance Scienti c Computing
University of Colorado at Boulder
c
Copyright 1995 by the HPSC Group of the University of Colorado
The following are memb ers of
the HPSC Group of the Department of Computer Science
at the University of Colorado at Boulder:
Lloyd D. Fosdick
Elizab eth R. Jessup
Carolyn J. C. Schauble Gitta O. Domik
SIMD Computing i
Contents
1 General architecture 2
1.1 The Connection Machine CM-2 ::: :: :: :: :: ::: :: 3
1.1.1 Characteristics :: :: :: ::: :: :: :: :: ::: :: 5
1.1.2 Performance :: :: :: :: ::: :: :: :: :: ::: :: 12
1.2 The MasPar MP-2 :: :: :: :: ::: :: :: :: :: ::: :: 12
1.2.1 Characteristics :: :: :: ::: :: :: :: :: ::: :: 13
1.2.2 Performance :: :: :: :: ::: :: :: :: :: ::: :: 15
2 Programming issues 17
2.1 Architectural organization considerations :::::::::::: 17
2.1.1 Homes ::: :: :: :: :: ::: :: :: :: :: ::: :: 18
2.2 CM Fortran, MPF, and Fortran 90 :: :: :: :: :: ::: :: 19
2.2.1 Arrays ::: :: :: :: :: ::: :: :: :: :: ::: :: 20
2.2.2 Array sections : :: :: :: ::: :: :: :: :: ::: :: 22
2.2.3 Alternate DO lo ops :: :: ::: :: :: :: :: ::: :: 23
2.2.4 WHERE statements : :: :: ::: :: :: :: :: ::: :: 23
2.2.5 FORALL statements :: :: ::: :: :: :: :: ::: :: 24
2.3 Built-in functions for CM Fortran and Fortran 90 :: ::: :: 25
2.3.1 Intrinsic functions : :: :: ::: :: :: :: :: ::: :: 26
2.3.2 Masks ::: :: :: :: :: ::: :: :: :: :: ::: :: 26
2.3.3 Sp ecial functions : :: :: ::: :: :: :: :: ::: :: 27
2.4 Compiler directives :: :: :: :: ::: :: :: :: :: ::: :: 34
2.4.1 CM Fortran LAYOUT :: :: ::: :: :: :: :: ::: :: 35
2.4.2 MasPar MPF MAP : :: :: ::: :: :: :: :: ::: :: 38
2.4.3 CM Fortran ALIGN :: :: ::: :: :: :: :: ::: :: 39
2.4.4 CM Fortran COMMON :: :: ::: :: :: :: :: ::: :: 40
2.4.5 MasPar MPF ONDPU : :: ::: :: :: :: :: ::: :: 40
2.4.6 MasPar MPF ONFE :: :: ::: :: :: :: :: ::: :: 41
3 Acknowledgements 42
References 42
CUBoulder : HPSC Course Notes
ii SIMD Computing
Trademark Notice
DECstation, ULTRIX, VAX are trademarks of Digital Equipment Corp ora-
tion.
Goodyear MPP is a trademark of Go o dyear Rubb er and Tire Company, Inc.
ICL DAP is a trademark of International Computers Limited.
MasPar Fortran, MasPar MP-1, MasPar MP-2, MasPar Programming En-
vironment, MPF, MPL, MPPE, X-net are trademarks of MasPar Computer
Corp oration.
X-Window System is a trademark of The Massachusetts Institute of Tech-
nology.
MATLAB is a trademark of The MathWorks, Inc.
IDL is a registered trademark of Research Systems, Inc.
Symb olics is trademark of Symb olics, Inc.
C*, CM, CM-1, CM-2, CM-5, CM Fortran, Connection Machine, DataVault,
*Lisp, Paris, Slicewise are trademarks of Thinking Machines Corp oration.
UNIX is a trademark of UNIX Systems Lab oratories, Inc.
CUBoulder : HPSC Course Notes
SIMD Computing:
yz
An Intro duction
C. J. C. Schauble
Septemb er 12, 1995
According to the Flynn computer classi cation system [Flynn 72 ], a SIMD
computer is a Single-Instruction, Multiple-Data machine. In other words, all
the pro cessors of a SIMD multipro cessor execute the same instruction at the
same time, but each executes that instruction with di erent data.
The computers we discuss in this tutorial are SIMD machines with dis-
tributed memories DM-SIMD. They are sometimes referred to as processor
arrays or as massively-paral lel computers.
This tutorial is divided into two main parts. In the rst section, we
discuss the general architecture of SIMD multipro cessors. Then we consider
how these general features are emb o died in two particular SIMD machines:
the Thinking Machines CM-2 and the MasPar MP-2.
In the second section, welookinto programming issues for SIMD multi-
pro cessors, b oth architectural and language-oriented. In particular, we de-
This work has b een partially supp orted by the National Center for Atmospheric Re-
search NCAR and utilized the TMC CM-2 at NCAR in Boulder, CO. NCAR is supp orted
by the National Science Foundation.
y
This work has b een partially supp orted by the National Center for Sup ercomputing
Applications under the grants, TRA930330N and TRA930331N, and utilized the Connec-
tion Machine Mo del-2 CM-2 at the National Center for Sup ercomputing Applications,
University of Illinois at Urbana-Champaign.
z
This work has b een supp orted by the National Science Foundation under an Ed-
ucational Infrastructure grant, CDA-9017953. It has b een pro duced by the HPSC
Group, Department of Computer Science, University of Colorado, Boulder, CO 80309.
Please direct comments or queries to Elizab eth Jessup at this address or e-mail
c
Copyright 1995 by the HPSC Group of the University of Colorado 1
2 SIMD Computing
scrib e useful features of Fortran 90 and CM Fortran.
For detailed information on how to login and program sp eci c SIMD
computers such as CM-2 and the MasPar MP-1, refer to the do cuments in
the /pub/HPSC directory at the cs.colorado.edu anonymous ftp site.
1 General architecture
Each of the pro cessors in a distributed-memory SIMD machine has its own
lo cal memory to store the data it needs. Also each pro cessor is connected to
other pro cessors in the computer and may send or receive data to or from
any of them. In many resp ects, these computers are similar to distributed-
memory MIMD multiple instruction, multiple data multipro cessors.
As stated ab ove, the term SIMD implies that the same instruction is exe-
cuted on multiple data. Hence the distinguishing feature of a SIMD machine
is that all the pro cessors act in concert . Each pro cessor p erforms the same
instruction at the same time as all the other pro cessors, but each pro cessor
uses it own lo cal data for this execution.
The array of pro cessors is usually connected to the outside world bya
sequential computer or workstation. The user accesses the pro cessor array
through this front end or host machine.
Using a SIMD computer for scienti c computing means that many ele-
1
ments of an array can b e computed simultaneously. Unlikevector pro cessors,
the computation of these elements is not pip elined with di erent p ortions of
neighb oring elements b eing worked on at the same time. Instead large groups
of elements go through the same computation in parallel.
In the following, we discuss the architectural features of SIMD multi-
pro cessors concentrating on two computers in this class: the Connection
Machine CM-2 by Thinking Machines Corp oration and the MasPar MP-2
by MasPar Computer Corp oration. Similar computers include the Digital
Equipment Corp oration MPP series technically the same as the MasPar
machines, the Go o dyear MPP, and the ICL DAP.
1
See the tutorial on vector computing [Schauble 95] for more information on vector
pro cessors.
CUBoulder : HPSC Course Notes
SIMD Computing 3
1.1 The Connection Machine CM-2
The CM-2 Connection Machine is a SIMD sup ercomputer manufactured by
Thinking Machines Corp oration TMC. Data parallel programming is the
natural paradigm for this machine allowing each pro cessor to handle one data
element or set of data elements at a time.
The initial concept of the machine was set forth in a Ph.D. dissertation by
W. Daniel Hillis [Hillis 85 ]. The rst commercial version of this computer was
called the CM-1 and was manufactured in 1986. It contained up to 65,536 or
64K pro cessors capable of executing the same instruction concurrently.As
shown in gure 1, sixteen one-bit pro cessors with 4K bits of memory apiece
2
are on one chip of the machine. These chips are arranged in a hyp ercub e
d
pattern. Thus the machine was available in units of 2 pro cessors where
d = 12 through 16.
One of the original purp oses of the computer was arti cial intelligence;
the eventual goal was a thinking machine . Each pro cessor is only a one-
bit pro cessor. The idea was to provide one pro cessor p er pixel for image
pro cessing, one pro cessor p er transistor for VLSI simulation, or one pro cessor
p er concept for semantic networks.
The rst high-level language implemented for the machine is *Lisp, a
parallel extension of Lisp. The design of p ortions of the *Lisp language are
discussed in the Hillis dissertation.
As the rst version of this sup ercomputer came onto the market, TMC
discovered that there was also signi cantinterest and money for sup ercom-
puters that can b e used for numerical and scienti c computing. Hence a
faster version of the machine was pro duced in 1987; named the CM-2, this
machine was the rst of the CM-200 series of computers. It included oating-
p oint hardware, a faster clo ck, and increased the memory to 64K bits p er
pro cessor. These mo dels emphasized the use of data-parallel programming.
Both C* and CM Fortran were available on this machine in addition to *Lisp.
Announced in Novemb er 1991, a more recent machine is the CM-5. This
is a MIMD machine that emb o dies many of the earlier Connection Machine
concepts with more p owerful pro cessors, routing techniques, and I/O.
The following subsections discuss the characteristics and the p erformance
of the CM-2. For further information, see the Connection Machine CM-200
2
See the tutorial on MIMD computing [Jessup 95] for more information on a hyp ercub e.
CUBoulder : HPSC Course Notes
4 SIMD Computing
Memory
P P P P
P P P P
M M
e e
P P P P
m m
q
o o
H
r r
P P P P
B H
H
y y
B H
H
B
B
Router
B
B
B
B
B
B
Memory
B
B
B
2
Figure 1: A representative blowup of one of the 64 pro cessor chips in a Thinking
Machines CM-1 or CM-2. Each CM-1/CM-2 chip contains 16 one-bit pro cessors.
The router connects the pro cessor chips with other pro cessor chips. Memory chips
are asso ciated with each CM-1/CM-2 chip.
CUBoulder : HPSC Course Notes
SIMD Computing 5
Series Technical Summary [TMC 91d ], Paral lel Supercomputing in SIMD
Architectures [Hord 90 ], chapter 7, or Computer Architecture: Case Studies
[Baron & Higbie 92], chapter 18.
M M M M M M M M M M M M M M M M M M M M M M M M
P P P P P P P P P P P P P P P P P P P P P P P P
...
M M M M M M M M M M M M M M M M M M M M M M M M
P P P P P P P P P P P P P P P P P P P P P P P P
M M M M M M M M M M M M M M M M M M M M M M M M
P P P P P P P P P P P P P P P P P P P P P P P P
M M M M M M M M M M M M M M M M M M M M M M M M
P P P P P P P P P P P P P P P P P P P P P P P P
: Xy
1 Pi X
* HY
P
X
Parallel
6
H X P @I
X
P
X H
P
X
Pro cessing
X H @ P
Unit
Front
End
Figure 2: The two main parts of a Thinking Machines Connection Machine
system.
1.1.1 Characteristics
A Connection Machine CM may b e considered as two main parts; these are
depicted in gure 2. The paral lel processing unit PPU is the SIMD p ortion
of the machine and contains up to 64K single-bit pro cessors; this part is the
CM itself. The front end FE of the machine acts as a controller or host for
the PPU and provides access to the rest of the world.
The FE is usually a small computer; for the CM, this may b e a UNIX
or a Symb olics Lisp workstation or a VAX.ACMmayhave up to four
FE's; these do not need to all b e the same typ e of machine. Programs are
CUBoulder : HPSC Course Notes
6 SIMD Computing
compiled, stored, and executed serially on the FE; any parallel op erations
in the program are recognized and pip elined to the PPU for execution there
with the serial execution lling the pip eline and continuing until a resp onse is
needed back from the PPU. Thus CM programs have the familiar sequential
control ow and do not require additional synchronization primitives as do
programs for other multipro cessors.
Dep ending on the con guration of the machine, each one-bit PPU pro-
cessor has 64K, 256K, or 1024K bits of memory, an arithmetic-logic unit
ALU, four one-bit registers, and interfaces for two forms of communication
and I/O. These pro cessors all work in parallel on simple instructions. In
fact the PPU can b e thought of as a large, synchronized drill team while the
FE is the sergeant who yells out the commands. For example, all the PPU
pro cessors fetch something from their individual memories to their ALU's at
one command; they all add something to that at the next command; and
they all store their individual results into their own memories at the next
command.
The language Paris PARallel Instruction Set is used to express the par-
allel op erations that are to b e run on the PPU. All *Lisp, CM Fortran, or
C* parallel commands are compiled into Paris instructions. Such op erations
include parallel arithmetic op erations b oth oating-p oint and xed, vector
3
summation and other reduction op erations , sorting, and matrix multipli-
cation.
An alternate run-time system exists for machines with 64-bit oating-
p oint accelerators. This is called the Slicewise mo del and provides a di erent
viewp oint of the CM than the Paris mo del. This can b e used only with CM
Fortran programs but allows more ecient execution.
The PPU may b e broken up into two or four sections as shown in gure 3.
Each section has its own sequencer and can b e used as a sub-PPU by itself or
can b e group ed with other sections. The nexus switch provides the pathway
between a given FE and its current sections of the PPU.
A sequencer receives Paris instructions from the FE and breaks them
down into a sequence of low-level instructions that can b e handled by the
one-bit pro cessors. When that is done, the sequencer broadcasts these in-
structions to all the pro cessors in its section. Each pro cessor then executes
3
See the tutorial on vector computing [Schauble 95] for more information on reduction
op erations.
CUBoulder : HPSC Course Notes
SIMD Computing 7
Graphic
Data Data Data
Vault Vault Vault Display
I/O System
PPU0 PPU1
Seq0 Seq1
Parallel
Pro cessing
Unit
Nexus
Front
FE0 FE1
Ends
Figure 3: Breakdown of a Thinking Machines CM PPU into 2 sections with
sequencers. Two front end machines, the nexus switch, and the I/O system are
also shown.
CUBoulder : HPSC Course Notes
8 SIMD Computing
the instructions in parallel with the other pro cessors. When the execution
of low-level instructions is completed, control is returned to the FE. Used
indep endently, each section sets up its own grid layout for computation and
communication for each array; if the sections are group ed together, one grid
p er array is laid over all the pro cessors. These grids may b e altered dynam-
ically during the execution of the program.
Flt-Pt and
Flt-Pt
Memory
Memory
Execution
Unit Interface
FPA
ECC ECC
P P P P P P P P
P P P P P P P P
P P P P P P P P
P P P P P P P P
Router Router
Figure 4: A pair of Thinking Machines CM-2 pro cessor chips communicate with a
single FPA.
Virtual processors are another feature of the CM. If the numb er of ele-
ments for the data of a given program is larger than the numb er of pro cessors,
the machine acts as if there were enough pro cessors, providing virtual pro-
cessors by assigning the data elements across the PPU in an ecient manner.
In other words, eachphysical pro cessor may act as one or more virtual pro-
cessors.
Floating-p oint accelerators are optional and come in twotyp es: 32-bit or
64-bit. A 32-bit FPAinterprets the bits from twochips of sixteen one-bit pro-
cessors as a single oating-p ointnumb er allowing single precision arithmetic
CUBoulder : HPSC Course Notes
SIMD Computing 9
computations. A 64-bit FPA also works with the contents of the pro cessors
contained on twochips and provides double precision arithmetic. For this
reason, the pro cessor chips are group ed in pairs with a oating-p oint and
memory interface shared by each pair as shown in gure 4; it is common to
think of each pair of chips as b eing equivalent to a oating-p oint pro cessor.
The addition of a oating-p oint unit to every 32 pro cessors to form an FPA
sp eeds up the pro cessing of oating-p oint computations on the CM by more
than a factor of twenty.
The Paris mo del of computation assigns full 32-bit or 64-bit words to the
memory of each pro cessor. These are passed to the FPA bit by bit when
needed. The Slicewise mo del assigns di erent bits of a 32-bit or 64-bit word
to each of the 32 pro cessors on the twochips connected to a single FPA.
When the FPA needs one of these words, each pro cessor can send its bits
concurrently with the other pro cessors. In other words, one 32-bit word
can b e sent to the FPA in one load cycle as one bit from each of the 32
pro cessors is sent to the FPA simultaneously. A 64-bit word requires only
two load cycles since the data paths are 32-bits wide. Hence the amountof
time sp ent passing data to and from the FPAismuch less for the Slicewise
mo del.
Data may b e read and written serially through the FE to the PPU, but
this is not a pro ductive metho d for large amounts of data b ecause of the
cost of communication b etween the FE and PPU. An optional input/output
system allows p eak rates up to 320 Mbytes p er second for transfers of data
between the PPU pro cessors and the I/O system bu ers in parallel. One
solution for sp eeding up p eripherals is to attachmultiple devices i.e., disks
to the CM I/O system in parallel. Each device is connected to a separate
CMIO bus and may transfer data in parallel with the other devices. Thus
every 64 bits of data the bandwidth of the bus is transferred to a di erent
device; this is called le or disk striping . Alternatively, these I/O bu ers may
attach to Thinking Machines Data Vaults ; each Data Vault may contain 5
to 60 Gbytes of data. A maximum of eight data vaults can b e on a given
system, each transferring data at a p eak rate of 40 Mbytes p er second or an
average rate of 20 Mbytes p er second.
Another p eripheral available through the I/O system is a graphical dis-
play for output generated by the PPU; this uses the Thinking Machines
CM framebu er managing up to one Gbyte of data p er second. The high-
resolution graphical display is a 19" color monitor. The framebu er for the
CUBoulder : HPSC Course Notes
10 SIMD Computing
Figure 5: Part of a hyp ercub e network.
display attaches directly to the CM backplane, like an I/O controller, and
holds raster image data. Alternatively, the image can b e displayed in an
X-window on a remote workstation or terminal. The size of the window de-
termines the numb er of pixels; each PPU pro cessor generates data for one
pixel. Either 8-bit mo de or 24-bit mo de can b e used.
In addition to the serial communication available b etween the FE and
the PPU, the PPU pro cessors may communicate with each other in di erent
ways. The general form of communication is via the router ; as shown in
gure 1, a router is presentoneachchip of sixteen pro cessors p ermitting
each pro cessor to communicate with any other pro cessor on the PPU. The
router also contains the logic to handle virtual pro cessors.
Communication b etween pro cessors on the same chip is called local or
on-chip communication. Clearly this is faster than any metho d of o -chip
communication.
The interconnection of the pro cessor chips across the PPU is based on
ahyp ercub e network, where eachnodeofthehyp ercub e is a pro cessor chip
containing sixteen pro cessors. The op eration of the router on a pro cessor
chip is broken up into twelve sub cycles where twelve is the maximum di-
CUBoulder : HPSC Course Notes
SIMD Computing 11
N
v
W E
v vj v
S
v
Figure 6: A two-dimensional NEWS grid.
mension of a CM-2 hyp ercub e network. At sub cycle k , the router is able
to communicate with other pro cessor chips or no des along the k th dimen-
sion of the hyp ercub e. Thus this interconnection pattern is sometimes called
a cube-connected cycle. Figure 5 shows a 4-dimensional hyp ercub e where
the vertices of the network represent pro cessor chips and the arcs represent
connections to neighb oring pro cessor chips in the network.
When data can b e treated as elements of a mesh or grid spread across
the PPU, a second and faster metho d of o -chip communication maybe
used. This is called NEWS N orth-E ast-W est-S outh communication and
allows the pro cessors of each pro cessor chip to communicate easily with the
pro cessors on the nearest neighbor chips on that mesh. However this typ e of
communication is limited to nearest-neighb or messages and cannot b e used
for broadcasting information to all the pro cessors. An example of a two-
dimensional NEWS grid is shown in gure 6. The CM-1 only allowed a
xed, two-dimensional grid, but the CM-2 p ermits NEWS grids with up to
31 dimensions. Eachchip on the PPU has an interface to the NEWS grid;
pro cessors with data elements required byorprovided by their neighb ors
store p ointers to those neighb ors. When two or more sections of the PPU
are b eing used indep endently, each section mayhave its own NEWS grid
setup. When the sections are all ganged together; there is a single NEWS
grid across the sections for each array or parallel structure.
CUBoulder : HPSC Course Notes
12 SIMD Computing
Op eration Single Precision
Paris Mo del M ops
Fl. Pt. Addition 4,000
Fl. Pt. Multiplication 4,000
Fl. Pt. Division 1,500
Dot Pro duct 10,000
4K4K Matrix Mult 3,500
Table 1: CM-2 p erformance of single precision oating-p oint op erations on a full
64K mo del, taken from [TMC 88].
1.1.2 Performance
A full CM-2 with 64K pro cessors and double precision FPA's can p erform
arithmetic op erations at the following sp eeds under the Paris mo del of
4
execution is shown in table 1.
The sp eed of memory reads and writes b etween each pro cessor and its
memory is greater than or equal to 5 Mbits/second. Each pro cessor has 64K
to 1024K bits or 8K to 128K bytes of memory and a full machine has 64K
pro cessors providing a total memory of 512 Mbytes to 8 Gbytes. Hence given
the minimum memory access sp eed of 5 Mbits/second for a single pro cessor,
the CM-2 can p erform memory read/writes at ab out 300 Gbytes/second.
Other p erformance tests were done at the University of Colorado using
the CM-2 at the National Center for Atmospheric Research NCAR. This
machine was just one-eighth the size of a full CM-2 with merely 8192 pro ces-
sors and contained only single precision FPA's; it also used the Paris mo del
of execution. Standard CM Fortran routines were used exclusively for these
tests. The results are summarized in table 2.
1.2 The MasPar MP-2
Manufactured by the MasPar Computer Corp oration, the MasPar MP-1 and
MP-2 are two other examples of DM-SIMD multipro cessors. Intro duced in
4
These gures were taken from the Connection Machine Model CM-2 Technical Sum-
mary [TMC 88], pp. 59-60.
CUBoulder : HPSC Course Notes
SIMD Computing 13
Op eration Single Precision
Paris Mo del M ops
Fl. Pt. Addition 180
Fl. Pt. Multiplication 200
Fl. Pt. Division 70
Dot Pro duct 56
Cosine 65
Exp onential 60
Square Ro ot 83
Table 2: Performance of NCAR CM-2 single precision oating-p oint op erations.
1990, the MasPar MP-1 is the original MasPar machine. The newer MasPar
MP-2 was brought out in 1992. While the MP-2 is larger and faster than the
MP-1, the architectures of the two machines are similar.
1.2.1 Characteristics
Like the Thinking Machines CM, the MasPar MP-2 is broken up into two
main parts: a front end or host machine and a pro cessor array containing
the SIMD pro cessors.
The front end of this machine maybea VAX or a DECstation 5000.
Both typ es of these host machines run ULTRIX. The host machine p ermits
communication of the SIMD pro cessor array with the user. As with the
CM-2, scalar pro cessing is done on the front end.
The Data Parallel Unit DPU p erforms all the parallel computation and
corresp onds to the CM PPU; it is comprised of two parts. The rst is the
10 14
array of Pro cessor Elements PEs incorp orating 2 1024 to 2 16384
pro cessors. Unlike the one-bit pro cessors of the CM, a MasPar MP-2 pro-
cessor or PE is a CPU capable of handling 32-bit values. Each of these
pro cessors contains 64 32-bit registers as well as an ALU and its own lo cal
memory. Thirty-two of the pro cessors t on a single chip in the PE. The
older MasPar MP-1 contains only 4-bit pro cessors but also ts 32 pro cessors
onto one chip.
The other part of the DPU is the Array Control Unit ACU. This is a
CUBoulder : HPSC Course Notes
14 SIMD Computing
Figure 7: In a MasPar MP-2, each pro cessing element PE is connected by the
X-net to neighb oring PE's along the north-south, east-west, northwest-southeast,
and southwest-northeast axes.
pro cessor in itself with its own memory and instructions. The function of
the ACU is to manage the op erations of the PEs; it plays a similar role to
that of the sequencers in the CM. An instruction is sent out by the ACU to
all the pro cessors in the PE array at the same time, and all the pro cessors
that are active simultaneously execute that instruction with their own data.
The ACU is also the contact with the front end host machine.
Communication b etween the PEs is handled by the DPU in twoways.
Ecient nearest-neighb or communication is done by the X-net ; this is sim-
ilar to the NEWS metho d of communication on the CM but go es in eight
directions, as shown in gure 7. X-net communication should only b e used
for data communications in a regular pattern. The MasPar MP-2 pro cessors
are also connected by the Global Router . Like the CM router communication,
communication via the Global Router is slower but p ermits message passing
between anytwo pro cessors in the DPU.
As with the CM architecture, there exist metho ds for allowing data on
disks b e read or written, directly to or from the DPU. A framebu er
connection to the DPU is also present and p ermits animation on graphical
displays generated directly from the DPU.
Two main programming languages are available for the MasPar MP-2:
MPF MasPar Fortran and MPL MasPar Programming Language. The
CUBoulder : HPSC Course Notes
SIMD Computing 15
MasPar Pro cs R N N R
max max peak
1=2
MP-2216 16384 1.6 11264 1920 2.4
MP-1216 16384 .473 11264 1280 .55
MP-1 16384 .44 5504 1180 .58
MP-2204 4096 .374 5632 896 .60
MP-1204 4096 .116 5632 640 .138
MP-2201 1024 .092 2816 448 .15
MP-1201 1024 .029 2816 320 .034
TMC CM-2 2048 10.4 33920 14000 28.
Table 3: Performance of MasPar MP-1 and MP-2, from [Dongarra 94]. The Mas-
Par gures are for 64-bit numb ers, while the Thinking Machines CM-2 values are
for 32-bit numb ers.
rst of these, MPF, is Fortran 90 with some extensions for the MasPar ar-
chitecture; it includes sp ecial functions for passing data b etween the front
end host machine and the DPU. The second language, MPL, is a version
of C with parallel extensions; it is a low-level language and requires the pro-
grammer to have a go o d understanding of the machine architecture. The
MasPar VAST-2 prepro cessor allows the conversion of Fortran 77 programs
to MPF programs for this machine. In addition, a programming environ-
ment called MPPE MasPar Programming Environment is installed on the
host machine providing extra interactive supp ort for the user to write, test,
debug, and monitor parallel programs. Software libraries for mathematical
computation, data display, and image pro cessing are also available.
For more information on the MasPar SIMD computers, refer to [MasPar 92 ].
1.2.2 Performance
The table in table 3 compares the p erformance of various mo dels of the
MasPar MP-1 and MasPar MP-2 with di ering numb ers of pro cessors. These
gures are for double precision oating-p oint op erations and are taken from
[Dongarra 94 ]. Also included for comparison are Thinking Machines CM-
2 gures for single precision oating-p oint op erations. In this table, R
max
represents the theoretical p eak p erformance of a machine with the given
CUBoulder : HPSC Course Notes
16 SIMD Computing
numb er of pro cessors Pro cs; R is the b est p erformance achieved
peak
for a problem of size N ; and N provides the size of the problem that
max
1=2
executes at half the R p erformance.
max
CUBoulder : HPSC Course Notes
SIMD Computing 17
2 Programming issues
Parallel programming is usually a greater challenge than programming for a
sequential machine. In this section, we consider some of the problems and
solutions asso ciated with programming an SIMD multipro cessor. We also
discuss some of the Fortran 90 features that are applicable to this typ e of
architecture.
This discussion fo cuses on the CM-2 as a sample architecture using CM
Fortran. Most of the concepts can easily b e applied to the MasPar MP-2 and
other DM-SIMD architectures. In particular, CM Fortran was a forerunner of
Fortran 90 and contains most of the data-parallel constructs present in that
language. Some additional constructs exist as well; those in that category
are so indicated b elow. Some MPF constructs are also discussed. With a few
exceptions such as reduction op erations, data-parallel op erations b ehavein
an elementwise fashion: each SIMD pro cessor acts only on the array elements
contained in its own memory.
2.1 Architectural organization considerations
When programming for the CM-2, the MP-2, or other SIMD architectures,
it is b est to think of the machine in the two parts shown in gure 2:
1. the front end to handle all the scalar op erations and
2. the array of pro cessors to execute all parallel op erations.
For usual scienti c computing applications, the front end FE is simply
5
a sequential UNIX computer or workstation. This is the machine to which
the user logs in, the machine that compiles programs for the CM, and the
machine that controls the execution of programs on the CM. All program
variables used only in a sequential or scalar fashion are stored on the front
end. The parallel pro cessing unit PPU with its SIMD architecture handles
all parallel or array op erations, each pro cessor taking care of one piece in
unison with the other pro cessors.
Since the editing and compiling of programs are done on the FE, pro-
gramming on the CM is much like programming on any UNIX machine. In
5
Some installations use Symb olics Lisp machines as the FE to the CM.
CUBoulder : HPSC Course Notes
18 SIMD Computing
ARRAYS
Offset Size Type Block/Class Home Name
0 2048 REAL4 local CM A
2048 2048 REAL4 local CM B
4096 2048 REAL4 local CM C
Figure 8: ARRAYS Section of a CM Fortran Listing.
fact if a program do es not use the CM parallel constructs, it executes entirely
on the FE completely ignoring the PPU.
On the MasPar MP-2, the FE usually starts the program and sets up
the initial data arrays; it also completes the program and collects the nal
results. Since communication is exp ensivebetween the FE and the DPU, it
is b est to contain most of the execution of a program to the DPU.
2.1.1 Homes
All program variables in CM Fortran programs are assigned a home. This is
simply where the variable is stored. Since scalar variables can only b e on the
FE, that is their home. However arrays may b e stored on the FE or on the
PPU, dep ending on whether or not they are used in parallel op erations. If
they are used in b oth serial and parallel op erations, they are stored on the
PPU and copied to and from the FE for the serial op erations. The exception
to the rules ab ove is for arrays of typ e CHARACTER; these are always stored
on the FE.
To see where homes have b een assigned for your variables, check the
last part of your program listing. The two sections VARIABLES and ARRAYS
provide the name, typ e, and size of the scalar variables and the arrays used in
your program. The ARRAYS section also lists the Home for each array. Under
this heading, the term CM refers to the PPU and FE to the FE. It is wise to
double check this part of the program listing to make sure the arrays have
b een assigned as exp ected. A p ortion of a sample listing showing the ARRAYS
section is given in gure 8.
CUBoulder : HPSC Course Notes
SIMD Computing 19
Homes for variables in every program unit are assigned individually.In
other words, the array Z in SUBROUTINE MYSUB may not b e assigned the same
home as Z in FUNCTION MYFTN. Each program mo dule is treated as a unit.
Homes of actual and dummy arguments must match. If you pass an array
to a function or argument, the dummy array argumentmust have the same
home as the incoming array parameter. Otherwise unpredictable results are
p ossible. There are a number of ways to force arrays to b e assigned homes
on the PPU:
Declare the arrayinCOMMON as all COMMON arrays are placed on the
PPU;
Put parallel op erations in every program mo dule for the appropriate
arrays even if they do nothing useful;
Use the LAYOUT compiler directive:
CMF$ LAYOUT Z:NEWS.
With :NEWS as an argument, Z MUST b e in the PPU. Other p ossi-
ble arguments are :SERIAL and :SEND. More on the LAYOUT compiler
directive can b e found in section 2.4.1.
2.2 CM Fortran, MPF, and Fortran 90
The version of Fortran used on the CM is called CM Fortran. It is based on
Fortran 77 and extended by parallel constructs; most of these constructs are
contained in a subset of Fortran 90. On the MasPar MP-2, MPF MasPar
Fortran is a version of Fortran 90 with extensions; many of the constructs
are the same as those in CM Fortran.
On a CM, it is imp ortant to know that the control owofaCMFortran
program is handled by the FE, as are all scalar statements like those found
in Fortran 77. All data-parallel statements, including Fortran 90 statements,
are executed on the PPU.
The following subsections intro duce a few elements of CM Fortran, MPF,
and Fortran 90 to get you started. For further reference, see the CM manuals
[TMC 91a ], [TMC 91b ], and [TMC 91c ], the MasPar manuals: [MasPar 93b ]
and [MasPar 93a ], and some Fortran 90 references such as [Brainerd et al 90 ]
and [Adams et al 92 ].
CUBoulder : HPSC Course Notes
20 SIMD Computing
2.2.1 Arrays
An array on a SIMD multipro cessor may b e considered a data-paral lel ob ject;
this is true for CM Fortran arrays as well. In fact the only CM Fortran
variables stored on the PPU are those arrays used in parallel op erations; all
scalars and all arrays not involved with parallel op erations are stored on the
FE of the CM.
6
The prop erties of an array are rank and shape. The rank of an arrayis
the numb er of its dimensions; e.g., the array declared as S5,10 has rank
2. The shap e of an array is its dimensions; so the shap e of S5,10 is 5 10.
Two arrays with the same shap e are said to b e conformable. Most parallel
op erations require that the arrays involved b e conformable.
Once an array has b een declared, the use of the name of the arrayby
itself not subscripted denotes the entire array with all its elements. Such
usage implies a parallel op eration is to b e applied to the array.For instance
the statement
S = 0.0
sets all the elements of S to zero in parallel on the PPU.
Subsections of an array can b e sp eci ed by triples. The general form of
a triple is
7
rstvalue : lastvalue : increment
For example if S is declared as ab ove, then S1:5:2,1:10 refers to the o dd
rows of S. The triple 1:5:2 sp eci es that rows 1, 1 + 2 = 3, and 3 + 2 = 5 are
to b e used; the triple 1:10 has an implied increment of 1 and so sp eci es all
ten columns. This second triple 1:10 could have b een replaced by a single
colon as in S1:5:2,: to imply that all the columns b e used for the chosen
rows.
As in Fortran 77 and Fortran 90, CM Fortran arrays may b e declared by
DIMENSION, COMMON,or type statements. In CM Fortran, they may also b e
declared using array attribute statements. For example assuming the array
S is a real array, it could have b een de ned by the following array attribute
statement:
6
In this context, rank has a di erent meaning than its customary mathematical
de nition.
7
This di ers from the triple form used byMATLAB: rstvalue : increment : lastvalue.
CUBoulder : HPSC Course Notes
SIMD Computing 21
REAL, ARRAY5,10 :: S, T
In Fortran 90, the following statementwould have the same e ect:
REAL, DIMENSION5,10 :: S, T
This simply says that b oth S and T are real arrays with 5 rows and 10
columns. Notice the comma after the typ e indicator REAL; this is how the
array attribute statement is recognized by the compiler. The double colon ::
is also a requirement; it must b e placed b etween the array de nition and the
array names. You should recall that blank spaces in Fortran are traditionally
ignored; hence anynumb er of spaces can b e added to this statement even
between the colons or deleted from the statement.
Array constructors can b e used to initialize the elements of an arrayin
parallel. For instance if Z has b een declared in CM Fortran by the statement
REAL, ARRAYN :: Z
then the statement
Z = REAL [1:N]
assigns 1:0toZ1,2:0to Z2, and REALN to ZN. The corresp onding
Fortran 90 statements follow:
REAL, DIMENSIONN :: Z
and
Z = / REALI, I=1,N /
Array constructors can also b e included in array attribute statements. In
CM Fortran, this is done by adding a DATA parameter to the statement; thus
the following CM Fortran statement has the e ect of de ning and initializing
Z at once:
REAL, ARRAYN, DATA :: Z = [1:N]
The same e ect can b e achieved in Fortran 90 by the following statement:
REAL, DIMENSIONN :: Z = / REALI, I=1,N /
This is an ecientway of assigning initial values to an array as it is done
at load time. A limitation on the array constructor in CM Fortran is that is
can only b e used for one-dimensional arrays.
CUBoulder : HPSC Course Notes
22 SIMD Computing
M4:6,5:10
Figure 9: Subsection of array M12,12.
2.2.2 Array sections
Most of the parallel array facilities of Fortran 90 are part of CM Fortran.
As mentioned in section 2.2.1, the abilitytowork with all the elements of
an array or subsections of an array in parallel is provided; however in CM
Fortran, all of the arrays involved in this typ e of parallel expression must b e
parallel arrays with homes on the PPU.
Using the name of an array implies using the entire array in parallel. For
instance, the statement
Y = Z**2
causes Y1 to b e set to the value of Z1**2, Y2 to Z2**2, etc. This
also works for constant assignments; the statement
Y = -1.0
means that all the elements of Y are set to 1:0.
Subsections of arrays can b e used in assignment statements as well. In
the following statement
Y1:10 = Z11:20
CUBoulder : HPSC Course Notes
SIMD Computing 23
the rst ten elements of Y are set to the second ten elements of Z. Such
statements execute in parallel.
An example of a subsection of a two-dimensional arrayisshown in gure 9.
Here the array M has b een declared as
REAL M12,12
and the 3 6 subsection is de ned by
M4:6,5:10
2.2.3 Alternate DO lo ops
Additional control constructs exist in CM Fortran; these are similar to those
in Fortran 90. The rst of these are alternate forms of DO lo ops as demon-
strated b elow:
N = 4096
DOWHILE N .GT. 0
Z1:N = ...
N = N/2
ENDDO
KK = 1
DO N TIMES
KK=KK*K
ENDDO
The rst of these lo ops assigns values to the rst N elements of the array Z
for N equal to decreasing p owers of two. The second lo op terminates when KK
N
is equal to K where N is a non-negativeinteger. This form of the DO lo op is
useful when the lo op index is not needed within the b o dy of the lo op. Note
that the DO WHILE construct is a legal Fortran 90 construct; the second form
of the DO lo op, DO N TIMES, is not.
2.2.4 WHERE statements
The WHERE statements provide a means for working with a subset of a full
array still as a parallel op eration:
WHERE Z .GT. 0.0 Y = SQRTZ
CUBoulder : HPSC Course Notes
24 SIMD Computing
Here all the CM pro cessors actually compute Y = SQRTZ.However only
those pro cessors with a value for Z greater than zero store the result. The
intrinsic function SQRT is used on the entire array.Ifwe wished to set the
other negative and zero elements to zero at the same time, this op eration
could b e programmed as follows:
WHERE Z .GT. 0.0
Y = SQRTZ
ELSEWHERE
Y = 0.0
ENDWHERE
In this set of statements, the elements of Y are set to zero when the corre-
sp onding elementof Z is not greater than zero. Note that this construct acts
in two steps. First all the pro cessors compute Y = SQRTZ, but only the
values for elements of Y corresp onding to non-zero elements of Z are stored.
Then all the pro cessors compute Y= 0, but only the values corresp onding to
zero or negative elements of Z are stored. In other words, the construct ap-
p ears similar to the IF..THEN..ELSE..ENDIF statement but b ehaves a little
di erently.
2.2.5 FORALL statements
The FORALL construct is not a part of Fortran 90; however it is included in
b oth CM Fortran and MPF. Such statements are very convenient for SIMD
computers.
FORALL statements as in
FORALL I=1:N YI = I
can only contain one assignment. This statement is equivalent to the follow-
ing DO lo op:
DO I = 1,N
YI = I
ENDDO
Notice that there is a colon b etween the start and stop values of the FORALL
index; this is in a triple format like the subscripts discussed earlier. An
incrementmay b e used as well after another colon, as the third elementof
the triple.
CUBoulder : HPSC Course Notes
SIMD Computing 25
More than one index can b e used within the FORALL statement; for in-
stance the following statement:
FORALL I=1:N, J=1:N SI,J = I
sets all the elements of the Ith rowofS to I.
Often individual elements of an array need to b e initialized to sp eci c
values. If the home of the array is on the PPU, it is b est to use a FORALL
statement for this purp ose. For instance the statement
S6,1 = S6,2 + S6,3
is executed on the FE since it is essentially a scalar op eration. However if
we rewrite this statementasa FORALL statement,
FORALL I=6:6, J=1:1 SI,J = SI,J+1 + SI,J+2
it is executed on the PPU. In e ect, the sum of SI,J+1 and SI,J+2
for each I and J is computed by all the elements in the array, but only the
pro cessor containing the element in the rst column of the sixth rowofS
stores this value into its array element.
A FORALL statement with dep endencies that cannot b e resolved is exe-
cuted serially. Check for restrictions on the FORALL statement in the appro-
priate manual.
2.3 Built-in functions for CM Fortran and Fortran 90
The CM Fortran intrinsic functions are for the most part the same as the
intrinsic functions describ ed for Fortran 90. Additional functions for handling
data-parallel data typ es are also present in b oth CM Fortran and Fortran 90.
Some of these are describ ed b elow.
To aid in the explanation of these built-in functions, assume the following
arrays have the values given b elow:
0 1
1 2
B C
A = 3 4
@ A
5 6
!
2 4 5
B =
3 8 5
C =1; 2; 3; 4; 5; 6
CUBoulder : HPSC Course Notes
26 SIMD Computing
2.3.1 Intrinsic functions
The usual Fortran intrinsic functions are available in CM Fortran and Fortran
90. Moreover most of them can b e used in a parallel fashion. For instance if
A has b een declared as ab ove, then
MODA,5
returns a matrix of the same typ e and shap e as A containing the values of
the elements of A mod5:
0 1
1 2
B C
MODA; 5= 3 4
@ A
0 1
Similarly the SQRT function can handle a whole array at once.
0 1
1:000 1:414
B C
1:732 2:000 SQRTA=
@ A
2:236 2:449
2.3.2 Masks
Masks are logical arrays created by p erforming a relational op eration on all
the elements of a given array. Both the mask and the original arraymust
conform. Consider the following examples.
0 1
T T
B C
A:GT:0 = T T
@ A
T T
!
F F T
B:EQ:5 =
F F T
C:LT:0 =F; T; F; T; F; T
CUBoulder : HPSC Course Notes
SIMD Computing 27
2.3.3 Sp ecial functions
In addition to the normal Fortran intrinsic functions, CM Fortran provides
several sp ecial functions to aid in the parallel op eration of the machine. For
the examples of the functions describ ed b elow, assume the following: ARRAY
is the name of any arrayoftyp e real, integer, or logical; DIM is an integer
denoting which particular dimension of the array if any the function is to b e
applied to; MASK is a logical array with the same shap e as ARRAY telling which
particular elements the function is to use; V, V1 and V2 are one-dimensional
arrays or vectors; M1 and M2 are two-dimensional arrays or matrices; and
SHIFT is an integer or integer array describing the shift to b e made. For some
of the following functions, DIM and MASK may b e used as keyword parameters.
Reduction Op erations: The following functions p erform commonly used
reduction op erations. Except where noted, they are common to b oth CM
Fortran and Fortran 90.
SUM ARRAY [, DIM] [, MASK]: This function computes the sum of
all the elements of ARRAY, according to the values of DIM and MASK.
Note: in the last example, MASK is used as a keyword parameter, since
the second parameter DIM is missing.
SUMA =21
SUMB, 1 =5; 12; 10
SUMB, 2 =11; 16
SUMC, MASK=C.GT.0 =9
PRODUCT ARRAY [, DIM] [, MASK]: This function computes the
pro duct of all the elements of ARRAY according to the values of DIM
and MASK. Note: in the last example, MASK is used as a keyword pa-
rameter, since the second parameter DIM is missing.
PRODUCTA = 720
PRODUCTB, 1 =6; 32; 25
PRODUCTB, 2 =40; 120
PRODUCTC, MASK=C.GT.0 =15
DOTPRODUCT V1, V2: This function computes the dot pro duct of the
twovectors or one-dimensional arrays, V1 and V2.
CUBoulder : HPSC Course Notes
28 SIMD Computing
DOTPRODUCTA1,:,B:,2 =20
DOTPRODUCTA:,1,B2,: =52
DOTPRODUCTC,C =91
MAXVAL ARRAY [, DIM] [, MASK]: This function nds the maxi-
mum value of all the elements of ARRAY according to the values of
DIM and MASK.
MAXVALA =6
MAXVALB:,1 =3
MAXVALC =5
MAXVALC,1,C.LT.0 =-2
MINVAL ARRAY [, DIM] [, MASK]: This function nds the mini-
mum value of all the elements of ARRAY according to the values of
DIM and MASK.
MINVALA =1
MINVALB:,1 =2
MINVALC =-6
MINVALC,1,C.GT.0 =1
MAXLOC ARRAY [, MASK]: This function returns an integer value or
integer array representing the subscripts of the maximum values of all
the elements of ARRAY according to the values of MASK. If more than one
such lo cation exists, which subscript is returned is non-deterministic.
MAXLOCA =3; 2
MAXLOCB:,3 =1 could also b e 2
MAXLOCC =5
MAXLOCC,C.LT.0 =2
MINLOC ARRAY [, MASK]: This function returns an integer value or
integer array representing the subscripts of the minimum values of all
the elements of ARRAY according to the values of MASK. If more than one
such lo cation exists, which subscript is returned is non-deterministic.
MINLOCA =1; 1
MINLOCB:,3 =1 could also b e 2
MINLOCC =6
MINLOCC,C.GT.0 =1
CUBoulder : HPSC Course Notes
SIMD Computing 29
COUNT MASK [, DIM]: This function returns the numb er of elements
for which the MASK held true.
COUNTA.GT.0 =6
COUNTA.GT.0,1 =3; 3
COUNTB.EQ.5 =2
COUNTC.LE.0 =3
ANY MASK [, DIM]: This function returns True if the MASK held true
for any of the elements.
ANYA.GT.0 = T
ANYA.GT.0,1 =T; T
ANYB.EQ.5 = T
ANYC.LE.0 = T
ALL MASK [, DIM]: This function returns True if the MASK held true
for all of the elements.
ALLA.GT.0 = T
ALLA.GT.0,1 =T; T
ALLB.EQ.5 = F
ALLC.LE.0 = F
Functions for matrices: The following built-in functions in CM Fortran
and Fortran 90 are used for manipulating matrices:
TRANSPOSE M1: This function returns the transp ose of the matrix or
two-dimensional array M1.
!
1 3 5
TRANSPOSEA=
2 4 6
MATMUL M1, M2: This function returns the result of the matrix mul-
tiplication of M1 by M2. Note that the expression M1*M2 do es not p er-
form matrix multiplication. Instead it pro duces element-by-element
multiplication; that is, for all i; j M 1 M 2 = M 1 M 2 .
i;j i;j i;j
CUBoulder : HPSC Course Notes
30 SIMD Computing
0 1
8 20 15
B C
MATMULA; B= 18 44 35
@ A
28 68 55
!
39 50
MATMULB; A=
52 68
DIAGONAL ARRAY [, FILL]: This function is only in CM Fortran.
It creates a diagonal matrix from the vector ARRAY. The elements of
the vector are placed on the diagonal and the value of FILL if any is
placed in the other elements of the matrix. If there is no FILL value,
the value of 0 or .FALSE., if logical is used.
1 0
1 0 0 0 0 0
C B
0 2 0 0 0 0
C B
C B
0 0 3 0 0 0
C B
DIAGONALC=
C B
C B
0 0 0 4 0 0
C B
A @
0 0 0 0 5 0
0 0 0 0 0 6
1 0
1 99 99 99 99 99
C B
99 2 99 99 99 99
C B
C B
99 99 3 99 99 99
C B
DIAGONALC; 99=
C B
C B
99 99 99 4 99 99
C B
A @
99 99 99 99 5 99
99 99 99 99 99 6
Other useful functions: In addition to the reduction op erations listed
ab ove, b oth CM Fortran and Fortran 90 contain other functions directed
toward handling data-parallel data ob jects. Some of these are given here:
RANK ARRAY: This CM Fortran function returns the rank of the given
scalar or ARRAY. A similar Fortran 90 function is named SIZE.
CUBoulder : HPSC Course Notes
SIMD Computing 31
RANK100 =0
RANKA =2
RANKB =2
RANKC =1
DSHAPE ARRAY: This CM Fortran function returns the shap e of the
given scalar or ARRAY.InFortran 90, this function is named SHAPE.
DSHAPE-1 =
DSHAPEC =6
DSHAPEA =3; 2
DSHAPEB =2; 3
REPLICATE ARRAY, DIM, NCOPIES: This CM Fortran function adds
NCOPIES of the ARRAY along the given DIMension. The resultant array
has the same rank as the original ARRAY, but the shap e in greater in
the given DIMENSION.
0 1
1 2
B C
3 4
B C
B C
5 6
B C
REPLICATEA; 1; 2=
B C
B C
1 2
B C
@ A
3 4
5 6
0 1
1 2 1 2 1 2
B C
REPLICATEA; 2; 3= 3 4 3 4 3 4
@ A
5 6 5 6 5 6
REPLICATEA, 1, 0 =
REPLICATEC, 1, 2 =
1; 2; 3; 4; 5; 6; 1; 2; 3; 4; 5; 6
SPREAD ARRAY, DIM, NCOPIES: This function is in b oth CM Fortran
and Fortran 90. It pro duces NCOPIES of the ARRAY along DIM. The
resultant array has rank one greater than that of the original ARRAY.
This can also b e used to makea vector from a scalar.
CUBoulder : HPSC Course Notes
32 SIMD Computing
SPREAD-1, 1, 6 = 1; 1; 1; 1; 1; 1
SPREAD-1, 1, 0 =
1 1 0 0
1 2
C C B B
3 4
C B
A @
C B
C B
5 6
C B
0 1
SPREADA; 1; 2=
C B
1 2
C B
B C C B
3 4
@ A A @
5 6
1 1 1 0 1 0 0 0
5 6 3 4 1 2
C C C B C B B B
5 6 3 4 SPREADA; 2; 3= 1 2
A A A @ A @ @ @
5 6 3 4 1 2
!
1 2 3 4 5 6
SPREADC; 1; 2=
1 2 3 4 5 6
0 1
1 1 1
B C