<<

This paper appeared in th International Parallel Processing Symposium Proc of nd Work

shop on Heterogeneous Processing pages NewportBeach CA April

Triton

A MassivelyParallel MixedMo de Computer

Designed to Supp ort High Level Languages

Christian G Herter Thomas M Warschko Walter F Tichy and Michael Philippsen

University of Karlsruhe Dept of Informatics

Postfach D Karlsruhe Germany

Mo dula

Abstract

Mo dula pronounced Mo dulastar is a small ex

We present the architectureofTriton a scalable

tension of Mo dula for massively parallel program

mixedmode SIMDMIMD paral lel computer The

ming The programming mo del of Mo dula incor

novel features of Triton are

p orates b oth data and control parallelism and allows

hronous and asynchronous execution mixed sync

Support for highlevel machineindependent pro

Mo dula is problemorientedinthesensethatthe

gramming languages

programmer can cho ose the and

mix the control mo de SIMD or MIMDlike as need

Fast SIMDMIMD mode switching

ed bytheintended algorithm Parallelism maybe

nested to arbitrary depth Pro cedures may b e called

Special hardware for barrier synchronization of

from sequential or parallel contexts and can them

multiple groups

selves generate parallel activity without any restric

tions Most Mo dula programs can b e translated

into ecient co de for b oth SIMD and MIMD archi

A selfrouting deadlockfreeperfect shue inter

tectures

connect with latency hiding

Overview of language extensions

The architecture is the outcomeofanintegrated de

Mo dula extends Mo dula with the following two

sign in which a machineindependent programming

language constructs

language optimizing compiler and paral lel computer

were designed handinhand

Parallelism can b e created in Mo dula pro

grams only by means of the FORALL statement

There are twoversions of this statement a syn

Intro duction

chronous and an asynchronous one

The goal of adequate programmability of parallel ma

chines can b est b e achieved by tightly coupling the de

The distribution of array data may b e optional

sign of machineindep endent programming languages

ly sp ecied by socalled al locators in a machine

compilers and parallel hardware In the past the de

indep endentway Allo cators do not haveany se

velopment of parallel computers has b een mainly driv

mantic meaning they are hints for the compiler

en by hardware considerations without regard to the

law of the weakest link

Because of the compactness and simplicity of the ex

The overall p erformance of a parallel computer is de

tensions they could easily b e incorp orated in other

termined by the p erformance of its slowest part with

imp erative programming languages suchasFortran

resp ect to the requirements of the software

C or Ada

Ignoring software requirements has resulted in unsat

A synchronous FORALL statement creates a set of

isfactory p erformance of these machines on machine

synchronously running pro cesses The number of the

indep endent parallel programs Toavoid these short

pro cesses is determined bytheFORALL statement and

comings and to show that highlevel parallel program

is not limited to the numberofthePEsofthema

ming do es not necessarily lead to p o or p erformance

chine As long as there is no branch in the control

we sp ecically analyzed the requirements of program

ow of the statements inside a FORALL statementthe

ming languages and compilers b efore designing the

semantics of the execution is equivalent to a SIMD ma

hardware

chine executing those statements If there are branch

The following section outlines the parallel program es in the control ow that dene alternatives likeIF

ming language Mo dula and derives general require THEN ELSE or CASE the pro cesses are partitioned

ments for parallel computers In section we describ e into several groups Each of those groups is executing

the architecture of Triton We emphasize those fea one branch in the control ow The pro cesses b elong

tures which arose from software requirements ing to one group execute synchronously in SIMD style

but groups are allowed to execute concurrently with are signicant and therefore require that the compil

resp ect to each other There is no assumption ab out er detect the faster cases However it is often imp ossi

the relative sp eeds of two pro cesses in dierent groups ble to know statically for which case to optimize For

instance we found that in many cases it was imp ossi

In contrast to the synchronous FORALL statement

ble to determine in the compiler whether a pro cedure

an asynchronous FORALL statement creates a set of in

would access lo cal or nonlo cal memory The generat

dep endent pro cesses running at an unsp ecied relative

ed co de thus has to check all three cases at runtime

sp eed Common to b oth variants is that the termi

Such simple frequent and dynamic analyses could b e

nation of a FORALL statement is determined bythe

done more eciently in hardware

termination of the last pro cess created by the FORALL

Barrier synchronization is extremely frequent Basi

statement The end of a FORALL statement always

callyevery communication requires a barrier synchro

denes a synchronization barrier For further details

nization The reasoning is as follows Communication

ab out the language see for detailed discussion of

sends or receives data Unless communication is re

compilation techniques and optimization see

dundant there must b e a write b etween two succes

Software requirements for parallel

sive communication calls to the same cell It follows

computers

that a synchronization op eration must b e placed some

On machines the distribution of

where b etween one of the communication calls and the

array data over the available pro cessors is a central

write in order to avoid race conditions If many pro

problem Two conicting goals data lo calityand

cesses communicate at once as in massively parallel

maximum degree of parallelism must b e recon

machines this typ e of synchronization amounts to a

ciled Data lo cality means that data elements should

barrier No pro cess may pro ceed until all pro cesses

b e stored lo cal to the pro cessors that need them to

have completed either their write or communication

minimize communication costs Perfect data lo cali

op eration Because of its frequencywe need

ty could b e achieved by employing a single pro ces

fast barrier synchronization

sor Parallelism which reduces runtime may unfor

In principle the communication network could b e used

tunately reduce lo cality and increase communication

for barrier synchronization However communication

costs Additional goals for data distribution are

networks usually have high latency whichmake them

exploiting sp ecial comm unication patterns supp orted

to o slow for fast barrier synchronization These nets

by hardware and generating simple address calcu

are optimized for transp orting data while barrier syn

lations to prevent addressing from b ecoming a domi

chronization requires the transp ort of only one or two

nant cost

bits but must also implement a reduction or scan op

Even with optimal layout of the data there will

eration on these bits

still b e communication in the general case In fact in

An additional complication is that there are usu

most massively parallel applications that are not triv

ally several groups of pro cesses that need to com

ially parallel communication is almost as frequentas

municate among themselves necessitating multiple

computation Therefore

nonoverlapping barriers Consider for instance a

the network MIPS measure must approach the CPU

pip elined architecture in which one set of pro cesses

MIPS measure

passes data to another set Eachsetmayhaveto

A second p erformanceoriented recommendation is an

synchronize internally indep endent of the other Sim

ilarly an IFstatement within a FORALL divides a

indep endently op erating network with asynchronous

set of pro cesses into two subsets whichmayhaveto

message delivery

synchronize indep endentlyThus weneed

since it allows the delivery of packets concurrently

barrier synchronization for multiple indep endent

with computation That would enable the compiler to

sets of pro cesses

interleave computation and communication and thus

to hide some of the communication latency

On the other hand we recommend a Triton

shared address space The p o or programmabilityoftodays parallel ma

chines is a consequence of the fact that the design

A shared address space do es not imply shared memo

of these machines has b een driven mostly by hard

ry it only means that every pro cessor can generate ad

ware considerations Programmability seems to have

dresses for the entire memory in the system System

b een a secondary issue resulting in languages designed

wide addresses are esp ecially imp ortant for p ointers

sp ecically for a particular machine mo del Suchlan

b ecause otherwise they would havetobesimulated

guages do not satisfy the needs of programmers who

quite ineciently in software Even the memory of

need to write machineindep endent applications

the control pro cessor eg the frontend of a SIMD

Triton matches most of the recommendations of

machine should b e part of the shared address space

the previous section Lo oking from a general p ointof

Furthermore and similar to the ab ove wecallfora

view Triton is determined by the following state

instructions

ments

Many parallel machines to dayprovide a set of instruc General Architecture Triton is a SAMD

tions for accessing lo cal memory a second one for ac synchronousasynchronous instruction streams mut

cessing memory in neighb ors and a third set for ac liple data streams machine it runs in SIMD mo de

cessing distant memory units The dierences in sp eed where strict synchrony is necessary it can switchto

MIMD mo de where concurrent execution of dierent parity disk RAID level is used for error handling

tasks is b enecial It is even p ossible to run a subset Figure gives an overview on the logical organization

of the pro cessors in SIMD mo de the other in MIMD of Triton

Thus Triton is truly SAMD ie mixedmo de not

just switchedmo de Only a few research prototyp es

of mixedmo de machines have b een built OPSILA Ethernet

FE

TRAC and PASM Triton provides supp ort

for switching rapidly b etween the two mo des and a

Instruction

highlevel language to control b oth mo des eectively

Control Bus

Fast barrier synchronization is supp orted by

sp ecial hardware The usage of synchronization hard

PE PE PE PE

ware is p ossible in b oth op erating mo des Synchro

nization with hardware supp ort overcomes the neces

sity of coarse grained parallelism

Network

Network Wechose the De Bruijn network for Tri

ton b ecause it has several desirable prop erties log

Figure Triton Architecture

arithmic diameter xed degree etc is costeective

to build and can b e made to op erate extremely fast

In SIMD mo de the frontend pro duces the instruc

and reliable In section we present p erformance

tion stream and controls the backend p ortion at in

gures

struction level In MIMD mo de the frontend is re

Scalability and balance Parallel machines

sp onsible for downloading the co de and the initiation

should scale in p erformance byvarying the number of

of the program The instruction bus is bits wide

pro cessors furthermore the p erformance of the indi

For reasons of decoupling frontend and backend in or

vidual comp onents pro cessor memory network and

der to reduce the time of the frontend waiting for the

IO should harmonize is mainly a prop er

backend to b ecome readyorviceversa the instruction

ty of the network The most p opular networks to day

stream is sent through a fo The handshake signals

hyp ercub es and grids do not scale well hyp ercub es

necessary to control the instruction stream are part of

are to o exp ensive b ecause they havevariable degree

the control bus

while grids cause high latency b ecause of large diame

ter Tritons De Bruijn net has none of these prob

For reasons of debugging it is a go o d idea to have

lems and scales well It is also well matched to the

direct access from the frontend to all parts of the

sp eed of the pro cessors

machine esp ecially to the main memory distributed

IO capabilities IO must also scale with the

among the pro cessing elements As a direct conse

numb er of pro cessors Few parallel machines to day

quence of the common address space of Triton this

provide for scalable IO Triton implements a mas

is p ossible via the so called analyze mo deTo sup

sively parallel IO architecture one disk p er pro ces

p ort the analyze mo de the control bus includes ad

sor For large sets of disks wehave extended the

dress lines and several dedicated control signals the

traditional notion of a le to what we call a vector

instruction bus is used for data transp ort While b e

le Massively parallel IO also provides the basis for

ing in analyze mo de all PEs release their lo cal busses

research in parallel op erating systems suchasvirtu

to enable frontend access via direct memory access

al memory parallel paging strategies and true multi

The pro cessing elements are designed as universal

user environments Results in these areas are required

computing elements capable of p erforming computa

to bring parallel machines into widespread and gen

tionaswell as service functions Each PE consists

eral purp ose use

of a Motorola MC micropro cessor a memory

management unit MC a numeric copro cessor

Architecture of Triton

MC or MC MBytes of main memory

Triton is divided in a frontend and a backend p or

aSCSIinterface and a networkpro cessor Figure

tion The frontend typically consists of a UNIX work

gives an overview No extra controllers for mass stor

station with a memorymapp ed interface to connect

age access or any other IO are necessary

via the instruction and the control bus to the backend

The network of Triton is built up of the network

p ortion The backend p ortion consists of the pro cess

pro cessors included in the PEs the interconnection

ing elements the network and the IO system

lines realized with at cables and fo buers for in

The Triton prototyp e will b e built up of a Intel

termediate buering of data packets The net work is

based PC running BSD UNIX as frontend The

able to route data packets from their source to their

prototyp e will contain PEs of whichare

destination without interfering with the PEs Non

supplied with a disk of the PEs are provided for

interference p ermits latency hiding techniques to b e

computation and PEs are for hot stand byThese

applied Again for reasons of decoupling the interface

PEs can b e congured under soft ware control into the

between a PE and its resp ectivenetwork pro cessor is

network if other PEs fail The reconguration in

implemented with fos

volves changing the PE numb ers consistently and re

computing the routing tables in the network pro ces Paritychecking of main memorynetwork links and

sors The disks are logically organized in groups mass storage implements error detection Perio dic sig

of disks where each group contains data and one nature tests lo cate malfunctioning elements

Address Bus

a concept with a bit address space do es not auto

FE

FPU

matically imply computing with bit addresses all

the time In the ma jority of the cases computing with

Data Bus

bit addresses suces reducing the time sp ent with

address calculation

Another p ointofinterest is mo deswitching

MMU NET

Though the MIMD mo de is more natural to the pro

cessor the system is started in SIMD mo de This is

done to save additional hardware for the startup co de

In SIMD mo de the function co des of the pro cessor

CPU SCSI

are used to determine whether the pro cessor intends

a data or a program access According to that the

pro cessor bus is connected to the lo cal memory or the

RAM

instruction bus resp ectively If a PE is selected not

Figure Triton Pro cessing Element

to execute an instruction the lo cal signal listen to

instruction stream is turned o and the pro cessor of

that PE is not notied of instructions except if the in

struction is unconditional The value of the pro cessors

Detailed discussion of selected hard

program counter is completely ignored in SIMD mo de

ware asp ects

In order to switch to MIMD mo de the program to b e

In order to get a b etter idea on the architectural fea

executed has to b e downloaded to the memory of the

tures of Triton it is necessary to lo ok into some im

PEs This is done via the instruction stream in SIMD

plementation details of the hardware

mo de Thus the distribution of co de is in contrast to

many other MIMD machines done in a time prop or

The implementation of the instruction bus and

tional to the length of the co de indep endentofthe

the control bus is quite naturally done by a hier

size of the machine The switch from SIMD to MIMD

archy of bus drivers for signals from the frontend to

mo de is p erformed bytwo instructions With the rst

the backend For the opp osite direction globalwired

instruction the program counter is set according to

or lines are emulated by explicit ORcombining the

the lo cation of the program to b e executed in MIMD

signals from the single PEs In SIMD mo de all PEs

mo de by a JMP instruction With the second instruc

execute the same instruction at a time or idle in

tion the SIMD request bit in the command register

cluding reading the instruction at the same time The

lo cal to the PE is deactivated The PE then switches

reading of instructions by the PEs is controlled via

to MIMD mo de at the end of the current cycle and

three control signals Instruction strob e signals the

commences execution of the lo cal co de without de

frontend that all PEs currently listening to the in

layTo switch from MIMD to SIMD mo de the SIMD

struction stream are ready to read an instruction The

request bit in the lo cal command register simply is ac

frontend then asserts the instruction and answers with

tivated which causes the PE to switch to SIMD mo de

instruction transfer acknowledge If all PEs current

at the end of the current cycle The next instruction

ly listening accepted the instruction instruction fetch

is then exp ected form the instruction stream

done is asserted which signals the end of an instruc

tion transfer The threeway handshakeintro duces

While some PEs are executing in MIMD mo de the

a nonnegligible amountofdelay due to the signals

rest of the PEs may execute in SIMD mo de This is

traversing the complete bus hierarchy several times

achieved by activating the instruction transfer hand

To reduce that delayweintro duced an instruction

shake in the case of MIMD op eration If there is no PE

buer at eachdriver level in the bus hierarchy re

left to execute in SIMD mo de but still some instruc

ducing the delay for the instruction fetchbytwo clo ck

tions remain in the instruction fo the handshake sig

cycles in the normal case Thus the handshakede

nals automatically empty the buer

scrib ed ab ove is executed in b etween every two hier

Data transfer is an imp ortantpointinevery par

archy levels rather than b etween the frontend and the

allel computer There are several dierent data paths

PEs

to consider The most imp ortant p ointisthedata

As mentioned ab ovethe global address space transp ort b etween the PEs That task is p erformed

constists of a bit address The least signicant by the network which is describ ed later in detail An

bits are used to select the memory and the memory other imp ortantpoint is the transp ort of data from

mapp ed IO in the PEs Another bits are used to the frontend to the backend and vice versa There are

identify the PE to b e accessed and one bit is used to dierent p ossibilities for each direction To transp ort

ay distinguish frontend and backend The identication data from the frontend to the backend the easiest w

is to send the data as immediate data via the instruc of the PEs is twofold Each PE has a hardware iden

tion stream in SIMD mo de With that p ossibilityany tication which is selected by a switch setting The

subset of the PEs can b e the destination of the data hardware identication is used to select the PE in the

Unfortunately only unidirectional access is p ossible analyze mo de for debugging AdditionallyeachPE

The second p ossibility of transferring data is the di has a software identication which is used while com

rect memory access within the analyze mo de Herein puting The software id is initially set to the same

data can b e transferred in b oth directions The draw value as the hardware id but can alter for reasons

back of the analyze mo de is that no computation can of hardware error handling However implementing

This discussion glossed over the diculties that take place and not more than one PE can b e accessed

arise if a PE virtualizes ie executes several threads concurrently The third p ossibility of data transp ort is

or pro cesses whichmay b elong to dierent groups via the network There is one dedicated network no de

In this case the ready bits have to b e virtualized as which is connected to the frontend This is esp ecial

well The details dep end on the virtualization strat ly useful in the case that more than a few bytes have

egywhichmay b e a mixture of lo oping and context to b e transp orted from dierent PEs to the frontend

switching eg picture data Another advantage of a network

no de included in the frontend is that computation can

Communications Network

commence while data is transp orted

The Triton network is based on the generalized De

Bruijn Net The number N of no des in the net

Fast barrier synchronization in MIMD

work is not limited to p owers of two The maximum

mo de

diameter is dlog N e The average diameter is well

d

b elowlog N and in practice quite close to the theo

d

An imp ortant problem is the realization of barrier syn

retical lower b ound the average diameter of directed

chronization in the case that several dierent sets of

Mo ore graphs

pro cesses are distributed randomly over the PEs A

In our implementation we use degree d which

set or group of pro cesses is dened as executing the

makes our net a p erfect shue see gure In com

same part of co de eg pro cedure and therefore shar

parison with other frequently used networks this de

ing common variables If there is only one set of pro

sign has the b enet of a constant degree p er no de and

cesses requesting synchronization the barrier synchro

a small average diameter Data transp ort is done via

nization is easily done by the usage of a globalwiredor

a tablebased selfrouting packet switching metho d

line Each PE sets its ready bit on the line to true as

which allows wormholerouting and load dep endent

so on as it reaches the synchronization p oint Approx

detouring of packets Every no de is equipp ed with its

imately one clo ck cycle after the last PE sets its bit

own routing table and with four buers two for in

the frontend is able to recognize the result and the

termediate storage of data packets coming from other

PEs are notied by the result line

no des and twotocommunicate with its asso ciated pro

The problem of using synchronization with a single

cessing element The buering temp orally decouples

globalwiredor line in MIMD mo de is that a global

the network from lo cal pro cessing

wiredor line cannot b e partitioned randomlyInthe

general case more than one group of pro cesses exists

Each of these groups share common variables to which

accesses have to b e regulated In most cases the groups

of pro cesses are distributed randomly over the set of

PEs so that they cannot b e partitioned by partitioning

the backend

To enable the usage of hardware supp orted syn

chronization with several groups of pro cesses running

in MIMD mo de the globalwiredor line is adminis

Figure De Bruijn Net with no des

trated by the frontend as a synchronization resource

in the following way Each group of pro cesses is iden

The communications pro cessor is able to route the

tied by a unique pro cess group numb er Initiallythe

packets without interfering with the lo cal pro cessing

synchronization line is not used and eachPEisal

element Optimal routes are stored in a routing ta

lowed to request it on b ehalf of a group The request

ble p er communications pro cessor The network can

is p erformed by the rst PE reaching a barrier That

thus transp ort data in parallel with the op eration of

PE signals the frontend by the service request line and

the pro cessing elements This feature can b e used by

sends the group identication via the analyze circuits

the compiler to overlap communicati on and pro cessing

If more than one PE reaches a barrier at once the an

time by rearranging co de

alyze circuits will select one randomly The frontend

then knows which group demands the synchronization In order to analyze the b ehavior of the network we

built a simulator based on the measured p erformance

line Next the frontend interrupts all PEs and forces

them into SIMD mo de to p erform a barrier setup The of a single communications pro cessor Wesimulat

ed the overall p erformance of the network in various

PEs not b elonging to the requesting group are pro

hibited to request the sync line themselves They also mo des The numb er of no des ranged from to

Figure presents the results of a series of exp eriments

turn on their ready bits The PEs b elonging to the

requesting group set their ready bit to true if they with a random communication pattern random H

p ermutations

already reached the sync p oint otherwise to false

After this setup phase the PEs return to MIMD Both the sender and the receiver were chosen random

mo de All PEs continue computation indep endentof ly with the restriction that the numb er of data packets

their group memb ership As so on as the last ready to b e transp orted is the same as the numberofnodes

bit is turned on the group owning the globalwiredor in the network The simulation shows that the net

line synchronizes and then releases the sync line The work scales well the delayintro duced by the network

frontend then releases the request prohibition in order is within O log N with N denoting the number of

to enable other groups to synchronize no des and messages

Makoto Imase and Masaki Itoh Design to min

b

imize diameter on buildingblo ck network IEEE

b

Transactions on Computers June

b

b

b

David A Patterson Garth Gib ons and Randy H

b

b

Katz A case for redundant arrays of inexp ensive

b

disks RAID In Proc of the ACMSIGMOD

r

b

r

r

r

Conference on Managenment of Data pages

r

r

r r

r

r

Chicago June

Michael Philippsen Walter F Tichy and Chris

tian G Herter Mo dula and its compilation

Number of nodes

In First International Conference of the Austrian

 transfer time in network cycles

Center for Paral lel Computation pages

maximum diameter

Salzburg Austria Septemb er Springer Ver

average diameter

lag Lecture Notes in Computer Science

HJ Siegel T Schwederski JT Kuehn and NJ

Figure Performance on random communication

Davis An overview of the PASM parallel pro cess

ing system In DD Ga jski VMMilutinovic and

r HJSiegel and BPFurht editors Computer A

chitecture pages IEEE Computer So ciety

The robustness against overload is surprisingly

Press Washington DC

go o d Even if all pro cessing elements send a great

numb er of packets simultaneously the overall through

Walter F Tichy and Christian G Herter Mo dula

put of the network do es not decrease Irregular p er

An Extension of Mo dula for Highly Parallel

mutations are p erformed esp ecially fast All hard

Portable Programs Technical Rep ort No

patterns known from literature eg transp osition of

Interner Bericht University of Karlsruhe De

a matrix buttery and bit reversal p erform well to o

partment of Informatics January

The delivery time for those is equal to or lower than

the delivery time for random p ermutaitions

Conclusion

The integrated approach of designing language com

piler and hardware together has lead to a parallel

architecture that supp orts higherlevel languages ad

equately Fast barrier synchronization for multiple

pro cess groups SAMD mo de shared address space

and a fast indep endently op erating network should

make parallel computers run eciently even when pro

grammed in a machineindep endent fashion

Status and schedule of Triton

The fully functional prototyp e of a PE b oard was

completed in Octob er The individual comp o

nents communication pro cessor pro cessing elemen t

and control pro cessor interface are tested and are run

ning according to sp ecications The manufacturing

of the printed circuit b oards is in progress The nal

assembly of Triton will b e completed early in

References

M Auguin and F Bo eri The OPSILA computer

In M Consard editor Paral lel Languages and Ar

chitectures pages Elsivier Science Pub

lishers Holland

N G De Bruijn Acombinatorial problem In

Proc of the Sect of ScienceAkademie van Weten

schappen pages Amsterdam June