A Massively-Parallel Mixed-Mode Computer Designed to Support

This paper appeared in th International Parallel Processing Symposium Proc of nd Work shop on Heterogeneous Processing pages NewportBeach CA April Triton A MassivelyParallel MixedMo de Computer Designed to Supp ort High Level Languages Christian G Herter Thomas M Warschko Walter F Tichy and Michael Philippsen University of Karlsruhe Dept of Informatics Postfach D Karlsruhe Germany Mo dula Abstract Mo dula pronounced Mo dulastar is a small ex We present the architectureofTriton a scalable tension of Mo dula for massively parallel program mixedmode SIMDMIMD paral lel computer The ming The programming mo del of Mo dula incor novel features of Triton are p orates b oth data and control parallelism and allows hronous and asynchronous execution mixed sync Support for highlevel machineindependent pro Mo dula is problemorientedinthesensethatthe gramming languages programmer can cho ose the degree of parallelism and mix the control mo de SIMD or MIMDlike as need Fast SIMDMIMD mode switching ed bytheintended algorithm Parallelism maybe nested to arbitrary depth Pro cedures may b e called Special hardware for barrier synchronization of from sequential or parallel contexts and can them multiple process groups selves generate parallel activity without any restric tions Most Mo dula programs can b e translated into ecient co de for b oth SIMD and MIMD archi A selfrouting deadlockfreeperfect shue inter tectures connect with latency hiding Overview of language extensions The architecture is the outcomeofanintegrated de Mo dula extends Mo dula with the following two sign in which a machineindependent programming language constructs language optimizing compiler and paral lel computer were designed handinhand Parallelism can b e created in Mo dula pro grams only by means of the FORALL statement There are twoversions of this statement a syn Intro duction chronous and an asynchronous one The goal of adequate programmability of parallel ma chines can b est b e achieved by tightly coupling the de The distribution of array data may b e optional sign of machineindep endent programming languages ly sp ecied by socalled al locators in a machine compilers and parallel hardware In the past the de indep endentway Allo cators do not haveany se velopment of parallel computers has b een mainly driv mantic meaning they are hints for the compiler en by hardware considerations without regard to the law of the weakest link Because of the compactness and simplicity of the ex The overall p erformance of a parallel computer is de tensions they could easily b e incorp orated in other termined by the p erformance of its slowest part with imp erative programming languages suchasFortran resp ect to the requirements of the software C or Ada Ignoring software requirements has resulted in unsat A synchronous FORALL statement creates a set of isfactory p erformance of these machines on machine synchronously running pro cesses The number of the indep endent parallel programs Toavoid these short pro cesses is determined bytheFORALL statement and comings and to show that highlevel parallel program is not limited to the numberofthePEsofthema ming do es not necessarily lead to p o or p erformance chine As long as there is no branch in the control we sp ecically analyzed the requirements of program ow of the statements inside a FORALL statementthe ming languages and compilers b efore designing the semantics of the execution is equivalent to a SIMD ma hardware chine executing those statements If there are branch The following section outlines the parallel program es in the control ow that dene alternatives likeIF ming language Mo dula and derives general require THEN ELSE or CASE the pro cesses are partitioned ments for parallel computers In section we describ e into several groups Each of those groups is executing the architecture of Triton We emphasize those fea one branch in the control ow The pro cesses b elong tures which arose from software requirements ing to one group execute synchronously in SIMD style but groups are allowed to execute concurrently with are signicant and therefore require that the compil resp ect to each other There is no assumption ab out er detect the faster cases However it is often imp ossi the relative sp eeds of two pro cesses in dierent groups ble to know statically for which case to optimize For instance we found that in many cases it was imp ossi In contrast to the synchronous FORALL statement ble to determine in the compiler whether a pro cedure an asynchronous FORALL statement creates a set of in would access lo cal or nonlo cal memory The generat dep endent pro cesses running at an unsp ecied relative ed co de thus has to check all three cases at runtime sp eed Common to b oth variants is that the termi Such simple frequent and dynamic analyses could b e nation of a FORALL statement is determined bythe done more eciently in hardware termination of the last pro cess created by the FORALL Barrier synchronization is extremely frequent Basi statement The end of a FORALL statement always callyevery communication requires a barrier synchro denes a synchronization barrier For further details nization The reasoning is as follows Communication ab out the language see for detailed discussion of sends or receives data Unless communication is re compilation techniques and optimization see dundant there must b e a write b etween two succes Software requirements for parallel sive communication calls to the same cell It follows computers that a synchronization op eration must b e placed some On distributed memory machines the distribution of where b etween one of the communication calls and the array data over the available pro cessors is a central write in order to avoid race conditions If many pro problem Two conicting goals data lo calityand cesses communicate at once as in massively parallel maximum degree of parallelism must b e recon machines this typ e of synchronization amounts to a ciled Data lo cality means that data elements should barrier No pro cess may pro ceed until all pro cesses b e stored lo cal to the pro cessors that need them to have completed either their write or communication minimize communication costs Perfect data lo cali op eration Because of its frequencywe need ty could b e achieved by employing a single pro ces fast barrier synchronization sor Parallelism which reduces runtime may unfor In principle the communication network could b e used tunately reduce lo cality and increase communication for barrier synchronization However communication costs Additional goals for data distribution are networks usually have high latency whichmake them exploiting sp ecial comm unication patterns supp orted to o slow for fast barrier synchronization These nets by hardware and generating simple address calcu are optimized for transp orting data while barrier syn lations to prevent addressing from b ecoming a domi chronization requires the transp ort of only one or two nant cost bits but must also implement a reduction or scan op Even with optimal layout of the data there will eration on these bits still b e communication in the general case In fact in An additional complication is that there are usu most massively parallel applications that are not triv ally several groups of pro cesses that need to com ially parallel communication is almost as frequentas municate among themselves necessitating multiple computation Therefore nonoverlapping barriers Consider for instance a the network MIPS measure must approach the CPU pip elined architecture in which one set of pro cesses MIPS measure passes data to another set Eachsetmayhaveto A second p erformanceoriented recommendation is an synchronize internally indep endent of the other Sim ilarly an IFstatement within a FORALL divides a indep endently op erating network with asynchronous set of pro cesses into two subsets whichmayhaveto message delivery synchronize indep endentlyThus weneed since it allows the delivery of packets concurrently barrier synchronization for multiple indep endent with computation That would enable the compiler to sets of pro cesses interleave computation and communication and thus to hide some of the communication latency On the other hand we recommend a Triton shared address space The p o or programmabilityoftodays parallel ma chines is a consequence of the fact that the design A shared address space do es not imply shared memo of these machines has b een driven mostly by hard ry it only means that every pro cessor can generate ad ware considerations Programmability seems to have dresses for the entire memory in the system System b een a secondary issue resulting in languages designed wide addresses are esp ecially imp ortant for p ointers sp ecically for a particular machine mo del Suchlan b ecause otherwise they would havetobesimulated guages do not satisfy the needs of programmers who quite ineciently in software Even the memory of need to write machineindep endent applications the control pro cessor eg the frontend of a SIMD Triton matches most of the recommendations of machine should b e part of the shared address space the previous section Lo oking from a general p ointof Furthermore and similar to the ab ove wecallfora view Triton is determined by the following state uniform memory access instructions ments Many parallel machines to dayprovide a set of instruc General Architecture Triton is a SAMD tions for accessing lo cal memory a second one for ac synchronousasynchronous instruction streams mut cessing memory in neighb ors and a third

A Massively-Parallel Mixed-Mode Computer Designed to Support

2.5 Classification of Parallel Computers

Massively Parallel Computing with CUDA

CS 677: Parallel Programming for Many-Core Processors Lecture 1

Core Processors

Massively Parallel Computers: Why Not Prirallel Computers for the Masses?

Introduction to Parallel Computing

Massively Parallel Processor Architectures for Resource-Aware Computing

Real-Time Network Traffic Simulation Methodology with a Massively Parallel Computing Architecture

Massively Parallel Message Passing: Time to Start from Scratch?

An Event-Driven Massively Parallel Fine-Grained Processor Array

A Massively Parallel Digital Learning Processor

Implementation of a Message Passing Interface Into a Cloud-Resolving Model for Massively Parallel Computing