<<

A Study of Software Multithreading in

Distributed Systems

TA Marsland and Yao qing Gao Francis M Lau

Computing Science Department Science Department

University of Alb erta UniversityofHongKong

Edmonton Canada TG H Hong Kong

tonygaoyqcsualb ertaca fmclaucshkuhk

Technical Rep ort TR

Novemb er

Abstract

Multiple threads can b used not only as a mechanism for tolerating un

predictable communication latency but also for facilitating dynamic

and load balancing Multithreaded systems are well suited to highly irregular

and dynamic applications such as tree search problems and provide a nat

ural waytoachieve p erformance improvement through such new concepts as

active messages and remote memory copy Although already p opular in single

pro cessor and sharedmemory pro cessor systems multithreading on distributed

systems encounters more diculties and needs to address new issues suchas

communication scheduling and migration b etween threads lo cated in separate

addressing spaces This pap er addresses the key issues of multithreaded systems

and investigates existing approaches for distributed concurrent computations

Keywords Multithreaded computation virtual pro cessor thread safety

active message datadriven

This researchwas supp orted by Natural Sciences and Engineering Research Council of Canada

TA MarslandYaoqing Gao and Francis CM Lau

Intro duction

Distributed computing on interconnected high p erformance workstations has b een

increasingly prevalentover the past few years and provides a costeective alterna

tive to the use of sup ercomputers and Much eort has

b een put into developing distributed programming systems suchasPVMP

MPI and Express Most existing distributed software systems supp ort only a

pro cessbased messagepassing paradigm ie pro cesses with lo cal memory that com

municate with each other by sending and receiving messages This paradigm ts well

on separate pro cessors connected bya communication network where a pro cess has

only a single thread of execution ie it is doing only one thing at a time However

the pro cess based paradigm suers p erformance p enalty through communication la

tency and high context switchoverhead esp ecially in the case where computing work

is frequently created and destroyed Toovercome these problems lightweight pro

cesses or threads subpro cesses in a sheltered environmentandhaving a lower context

switch cost have received much as a mechanism for a providing a ner

granularity of concurrency b facilitating dynamic load balancing c making virtual

pro cessors indep endentofthephysical pro cessors and d overlapping computation

and communication to tolerate communication latency In addition multithreading

provides a natural way to implementnonblo cking communication op erations active

messages and remote memory copy

Multithreading can b e supp orted bysoftware or even hardware eg MTA

Alewife EARTH START Although it has b een p opular in single

pro cessor and sharedmemory pro cessor systems multithreading on distributed sys

tems encounters more diculties mainly b ecause of the high latency and low band

width communication links underlying distributedmemory

In this pap er we rst address some ma jor issues ab out designing and imple

menting multithreading in distributed systems Then weinvestigate typical existing

multithreaded systems for exploiting the p otential of multicomputers and workstation

networks and compare their capabilities Toprovide a framework we use a uniform

terminology of thread pro cess and pro cessor but provide at the end a table show

ing the varied terminology used by the originating researchers From this study we

identify three areas where further work is necessary to advance the p erformance of

A Study of Software Multithreading in Distributed Systems

multithreaded distributed systems

Ma jor issues

A runtime system denes the s or s view of a It can b e used

either as a compiler target or for programmer use It is resp onsible for managementof

computational resources communication and synchronization among parallel tasks

in a program Designing and implementing multiple threads in distributed systems

involves the following ma jor issues how to implement threads How to pro

vide communication and scheduling mechanisms for threads in separate addressing

space What kinds of programming paradigms are supp orted for ease of concur

rent programming Howtomake the system threadsafe

Threads

A pro cess may b e dened lo osely as an address space together with a current state

consisting of a register values and a A

pro cess has only one program counter and do es one thing at a time single thread

As we know one of the obvious obstacles to the p erformance of workstation networks

is the relatively slow network interconnection hardware One of the motivations to

intro duce the notion of threads is to provide greater concurrency within a single

pro cess This is p ossible through more rapid switching of control of the CPU from

one thread to another b ecause is simplied When one thread

is waiting for data from another pro cessor other threads can continue executing This

improves p erformance if the overhead asso ciated with the use of multiple threads is

less than the communication latency masked by computation From the view p oint

of programmers it is desirable for reasons of clarity and simplicity that programs are

co ded using a known numb er of virtual pro cessors irresp ective of the availabilityof

the physical pro cessors Threads supp ort a scenario where each pro cess is a virtual

pro cessor and multiple threads are running in the context of a pro cess The feature

of lowoverhead for thread is esp ecially suitable for highly dynamic

irregular problems where threads are frequently created and destroyed Threads can b e supp orted

TA MarslandYaoqing Gao and Francis CM Lau

 at the systemlevel where all functionality is part of the op erating system kernel

 by userlevel co de where all functionality is part of the user program

and can b e linked in and

 by a mixture of the ab ove

Userlevel library implementations such as Cthread PCR FastThreads

and pthread multiplex a p otentially large numb er of userdened threads on

top of a single kernelimplemented pro cess This approach can b e more ecient than

relying on op erating system kernel supp ort since the overhead of entering and leav

ing the kernel at each call and the context switchoverhead of kernel threads is high

Userlevel threads are on the other hand not only exible but can also b e customized

to the needs of the language or user without kernel mo dication However userlevel

library implementation complicates signal handling and some thread op erations such

as synchronization and scheduling Two dierentscheduling mechanisms are needed

one for kernellevel pro cesses and the other for userlevel threads Userlevel threads

built on top of traditional pro cesses may exhibit p o or p erformance and even incor

rect b ehavior in some cases when the op erating system activities such as IO and

page faults distort the mapping from virtual pro cesses to physical pro cessors The

kernel thread supp ort suchas Topaz and simplies control over

thread op erations and signal handling but like traditional pro cesses these kernel

threads carry to o muchoverhead A mixture of the ab ove can combine the function

alityofkernel threads with the p erformance and exibility of userlevel threads The

key issue is to provide a twowaycommunication mechanism b etween the kernel and

user threads so that they can exchange information suchasscheduling hints The

Psyche system provides suchamechanism by the following approaches shared

data structures for asynchrono ous communication b etween the kernel and the user

software to notify the user level of some kernel events and a scheduler in

terface for interactions in the b etween dissimilar thread packages Another

example is Scheduler Activations through whichthekernel can notify the thread

scheduler of every event aecting the user and the user can also notify of the subset

of userlevel events aecting pro cessor allo cation decisions In addition Scheduler

Activations address some problems not handled bythePsyche system such as page

faults and upwardcompatible of traditional kernel threads

A Study of Software Multithreading in Distributed Systems

Thread safety

A software system is said to b e threadsafe if multiple threads in the system can

execute successfully without data corruption and without interfering with each other

If a system is not thread safe then each thread must havemutually exclusive access

to the system forcing serialization of other threads To supp ort multithreaded pro

gramming eciently a system needs to b e carefully engineered in the following wayto

make it threadsafe use as few global state information as p ossible explicitly

manage any global state that cannot b e removed use reentrant functions that

may b e safely invoked bymultiple threads concurrently without yielding erroneous

results race conditions or deadlo cks and provide atomic access to nonreentrant

functions

Thread scheduling

Scheduling plays an imp ortant role in distributed multithreaded systems There are

two kinds of thread scheduling scheduling within a pro cess and among pro cesses

The former are similar to the prioritybased preemptive or round robin strategies in

traditional singlepro cessor systems Here we will fo cus on more complicated thread

scheduling among pro cesses esp ecially for a dynamically growing multithreaded com

putation Although some can cop e with the scheduling issues statically

their scop e is often limited due to the complexity of the analysis and the lackof

runtime information For ecientmultithreaded computations of dynamic irregular

problems runtimesupp orted dynamic scheduling can meet the following

requirements provide enough active threads available to prevent a pro cessor from

sitting idle limit resource consumption ie the total numb er of active threads

is kept within the space constraint provide computation lo cality try to keep

related threads on the same pro cessor to minimize the communication overhead

There are typically two classes of dynamic scheduling strategies work sharing

senderinitiated and work stealing receiverinitiated In the work sharing strat

egy whenever a pro cessor generates new threads the scheduler decides whether to

migrate some of them to other pro cessors on the basis of the current system state

Control over the maintenance of the information may b e centralized in a single pro

cessor or distributed among the pro cessors In the work stealing strategy whenever

TA MarslandYaoqing Gao and Francis CM Lau

pro cessors b ecome idle or under utilized they attempt to steal threads from more

busy pro cessors Manywork sharing and work stealing strategies have b een pro

p osed For example Chowdhurys greed load sharing Eager et als

adaptive load sharing algorithm and Karp et als work stealing algorithm

These algorithms vary in the amount of system information they use in making a task

migration decision Usually the work stealing strategy outp erforms the work sharing

strategy since if all pro cessors haveplentyofwork to do there is no thread migration

in the work stealing strategy while threads are always migrated in the work sharing

strategy Eager et al haveshown that the work sharing strategy outp erforms the

workstealing at light to mo derate system loads but the latter is preferable at high

system loads when the costs of task transfer under the two strategies are comparable

Multithreaded programming paradigms

A provides a metho d for structuring programs to reduce

the programming complexity Multithreaded systems can b e classied into shared

memory explicit messagepassing active messagepassing and implicit datadriven

programming paradigms Multiple threads of control are p opular in unipro cessors

sharedmemory multipro cessors and tightlycoupled multicom

puters with the supp ort of distributed systems where a pro cess has

multiple threads and each thread has its own register set and stack All the threads

within a pro cess share the same address space Shared variables are used as a means

of communication among threads These systems supp ort the sharedmemory parallel

programming paradigm an application consists of a set of threads which co op erate

with each other through shared global variables as Figure a shows

Most lo oselycoupled multicomputers and networks of workstations are pro cess

based Their software supp orts only as the means of communication

between pro cessors b ecause there are no shared global variables When threads are

intro duced to these systems they are able to draw on the p erformance and concur

rency b enets from multithreaded computing but usually a shared address space is

precluded see Figure b

Multithreads increase the convenience and eciency of the implementation of

active messages An active message contains not only the data but also the ad

A Study of Software Multithreading in Distributed Systems

dress of a userlevel handler co de segment which is executed on message arrival It

provides a mechanism to reduce the communication latency and an extended mes

sage passing paradigm as well as illustrated in Figure c There are fundamental

dierences b etween remote pro cedure calls and active messages The latter has no

acknowledgment or return value from the call Active messages are also used to facil

itate remote memory copying which can b e thought of as part of the active message

paradigm typied by put and get op erations on such parallel machines as CM

nCUBE Cray TD and Meiko CS

Figure Multithreaded programming paradigms

In addition to the traditional sharedmemory message passing and activemes

sage parallel programming paradigms there could b e a datadriven nontraditional

paradigm see Figure d which is somewhat related to dataow computing There

TA MarslandYaoqing Gao and Francis CM Lau

an application consists of a collection of threads but with data dep endencies among

the threads A thread can b e activated only when all its data are available This

paradigm simplies the development of distributed concurrent programs byproviding

implicitsynchronization and communication

Multithreaded runtime systems

Since the varied and conicting terminology is confusingly used in the literature

b elowwe will compare dierent system using uniform terminology for continuity

Cilk is a Cbased runtime system for multithreaded parallel programming and

provides mechanisms for thread communication synchronization scheduling as well

as primitives callable within Cilk programs It is develop ed by a group of researchers

at MIT Lab oratory for

Figure Threads closures and continuations in Cilk

The Cilk language provides b oth pro cedure abstraction in implicitcontinuation

passing style and thread abstraction in explicit continuationpassing style A contin

uation describ es what to do when the currently executing computation terminates

A Study of Software Multithreading in Distributed Systems

The programming paradigm provided by Cilk is a dataowlike pro cedurecall tree

where each pro cedure consists of a collection of threads The entire computation can

b e represented by a directed acyclic graph of threads A thread is a piece of co de

implemented as a nonblo cking C function When a thread is spawned the runtime

system allo cates a closure thread control table entry for it A closure is a data

structure that consists of a p ointer to the threads co de all argument slots needed

to enable the thread and a join count indicating the numb er of missing arguments

that the thread dep ends on The Cilk prepro cessor translates the thread into a C

function of one argument a p ointer to a closure and void return typ e Each thread

is activated only after all its arguments are available ie its corresp onding closure

is changed from the waiting state to the full state It runs to completion without

waiting or susp ending once it has b een activated A thread generates parallelism at

runtime byspawning a child thread so that the calling thread may execute concur

rently with its child threads For example see Figure The creation relationships

between threads form a spawn tree Threads maywait for some arguments to arrive

in the future Explicit continuationpassing is used as a communication mechanism

among threads A continuation is a global reference to an empty argument slot that

a thread is waiting for

Cilks scheduler employs a work stealing strategy for load balancing Each

pro cessor maintains a lo cal ready queue to hold full closures threads whose data are

available Each thread has a level which is a renement of the distance to the ro ot

of the spawn tree A pro cessor always selects a thread with the deep est level to run

from its lo cal ready queue When a pro cessor runs out of work it selects a remote

pro cessor randomly and steals the shallowest ready thread from the remote pro cessor

based on the heuristic that a shallower thread is likely to have more heavy work than

a deep one Computations in Cilk are fully strict ie a pro cedure only sends values

to its parent This reects the ordinary pro cedure callreturn semantics and has go o d

time and space eciency Cilk also allows the programmer to have control over the

runtime system

Preliminary versions of the Cilk system has b een implemented on massively par

allel machines CM MPP the Intel Paragon MPP and the Silicon Power

Challenge SMP and networks of workstations the MIT Phish It has demonstrated

go o d p erformance for dynamic asynchronous treelike MIMD computations but not

TA MarslandYaoqing Gao and Francis CM Lau

yet for more traditional dataparallel applications The explicit continuationpassing

style results in a simple implementation but it needs more frequentcopying of argu

ments and puts some burden on the programmer Although the pro cedure abstraction

with implicit continuationpassing is provided to simplify parallel programming it is

less go o d at controlling the runtime system

Multithreaded Parallel

PVM has emerged as one of the most p opular messagepassing systems for work

stations mainly b ecause of its interop erability across networks and the p ortability

of its TCP and XDRbased implementation PVM only supp orts the pro cessbased

messagepassing paradigm where pro cesses are b oth the unit of parallelism and of

scheduling TPVM and LPVM are two PVMbased runtime systems They aim at

enhancing PVM byintro ducing threads to reduce task initiation and scheduling costs

to overlap computation and communication and to obtain ner load balance

Threadsoriented PVM

The TPVM Threadsoriented PVM system is develop ed byFerrari and Sun

deram in University of Virginia and Emory University It is an extension of the PVM

system that mo dels threadsbased distributed concurrent computation an applica

tions comp onents form a collection of instances threads that co op erate through

message passing These instancesthreads are the basic units of parallelism and

scheduling in PVM

TPVM supp orts not only userlevel threads and provides b oth the traditional

pro cessbased explicit message passing but also a datadriven paradigm In the

traditional interacting sequential pro cess paradigm threads execute in the context

of disjoint address spaces and message passing is the only means of communication

between threads This paradigm typically forces the programmer to explicitly manage

all thread creation and synchronization in a program To implement this paradigm

PVM pro cesses are rst spawned and registered with the PVM system Then the

PVM pro cesses exp ort entry p oints describing threads as Figure shows Threads

can b e activated in appropriate lo cations by a thread startup mechanism ie the

tpvm spawn primitive

A Study of Software Multithreading in Distributed Systems

Figure TPVM messagepassing paradigm

The datadriven paradigm provides automatic synchronization and communica

tion based on data dep endencies Threads should b e declared with named data

dep endencies They are initially created at appropriate lo cations but are actually

activated when a set of sp ecied messagesdata are available as shown in Figure

As so on as a thread is activated it can p erform nonblo cking receives to obtain its

required inputs

TPVM is implemented as three separate mo dules the library the user inter

faces with the TPVM system via library calls the exp ort functions tpvm export

tpvm unexport the message passing functions tpvm send tpvm recv and the

spawn tpvm invoke the p ortable thread scheduling functions tpvm

mo dule thread creation exiting the running thread yielding the running thread

obtaining releasing mutual exclusion and determining a unique

identier asso ciated with the running thread the thread server mo dule A cen

tralized thread server taskpro cess provides supp ort for the thread exp ort

scheduling and datadriven thread creation The scheduling of threads is done in a round robin fashion on the appropriate pro cesses

TA MarslandYaoqing Gao and Francis CM Lau

Figure TPVM datadriven paradigm

Preliminary exp eriments on a small network of Sparc computers show that

some applications can b enet from TPVMs ner load balance whereas manySPMD

SingleProgramMultipleData style applications with very regular communication

patterns do not

TPVM is built on top of PVM and is p ortable But the lack of threadsafetyis

still not addressed As a result the entire PVM library is treated as a critical region

Semaphores are used to ensure that only one active thread accesses the library at a

time Such an exclusive access requirement on the library is restrictive eg the use

of blo cking PVM calls is precluded The currentversion of TPVM manages threads

through a centralized server and this may b ecome a b ottleneck as the number of

pro cessors increases

Lightweight pro cess PVM

LPVM Lightweight pro cess PVM is develop ed by Zhou and Geist at Oak Ridge

National Lab oratory It is a mo died version of the PVM system and is implemented

on sharedmemory multipro cessors

In PVM INETdomain UDP so ckets are used to communicate among tasks When

one task sends a message to another there are three copying op erations involved from

A Study of Software Multithreading in Distributed Systems

the senders user space to the systems send buer to the systems receive buer and

to the receivers user space With the availabilityofphysical shared memory the

communication latency and bandwidth could b e greatly improved To retain the

traditional PVM the whole user program in LPVM is one pro cess with

multiple tasks Each task is a thread in LPVM instead of b eing a pro cess in PVM

However it only supp orts the pro cessbased messagepassing paradigm

Tomake LPVM thread safe PVM was mo died extensively to remove many global

states all the memory and buer managementcodeinversion of PVM

was mo died and pvm spawn was changed so that the PVM library spawns

new tasks instead of the PVM daemon The user interface is changed accordingly

eg most LPVM calls need to have the caller Task ID parameter and use it as a

key to lo cate resources related to the task pvm pack routines explicitly require the

send buer id returned bypvm initsend so that they can pack messages into the

buer safely

LPVM is an exp erimental system and its currentversion runs on IBM SMP and

SUN SMP systems with a single host Threads are supp orted at the system level and

thread scheduling within a pro cess is fully taken care of by the underlying op erating

system Thread scheduling among dierent hosts has not yet b eed addressed in the

currentversion b ecause it do es not supp ort the TPVMs exp ort of the thread entry

p oints to another host

Nexus

The Nexus runtime system is develop ed at Argonne National Lab oratory and Cali

fornia Institute of Technology It is designed primarily for a go o d compilation

target to supp ort taskparallel and mixed data and taskparallel computation in

heterogeneous environments

Nexus supp orts the active messagepassing paradigm It provides ve core ab

stractions the pro cessor pro cess thread global p ointer and remote service request

A pro cessor is a physical pro cessing resource A pro cess is an address space plus

an executable program which corresp onds to a virtual pro cessor A computation

consists of a set of threads All threads in the same pro cess communicate with each

other through shared variables One or more pro cesses can b e mapp ed onto a single

TA MarslandYaoqing Gao and Francis CM Lau

no de They are created and destroyed dynamically Once created a pro cess cannot

b e migrated b etween no des A thread can b e created in the same pro cess as or in

a dierent pro cess from the currently executing thread Nexus provides the op era

tions of thread creation termination and yielding To supp ort a global name space

for the compiler a global p ointer is intro duced which consists of three comp onents

pro cessor pro cess and address Threads are created in a remote pro cess p ointed to

by a global p ointer by issuing a remote service request RSR Since Nexus do es not

require a uniform address space on all pro cessors RSR sp ecies the name rather that

the address of the handler function It is similar to an active message but less re

strictive In a remote service request data can b e passed as an argument to a remote

RSR handler by means of a buer see Figure

Figure Threads pro cesses and pro cessors in Nexus

The Nexus implementation encapsulates thread and communication functions in

threads and proto col mo dules to supp ort heterogeneity Threads are implemented by

a userlevel library It is the users and compilers resp onsibility to computation

to physical pro cessors whichinvolves b oth the mapping of threads to pro cesses and

the mapping of pro cesses to pro cessors Therefore Nexus is more suitable to a compiler

target than to programmer use Nexus is op erational on TCPIP networks of

workstations the IBM SP and the Intel Paragon using NX It has b een used to

implementtwo taskparallel programming languages CC and Fortran M

Exp eriments showed that Nexus is comp etitive with PVM despite its lack

A Study of Software Multithreading in Distributed Systems

of optimization

Threaded Abstract Machine

TAM is a Threaded Abstract Machine for negrain parallelism develop ed at

University of California Berkeley It serves as a framework for implementing a gen

eral purp ose parallel TAM provides ecient supp ort for the

dynamic creation of multiple threads of control lo cality of reference through the

utilization of the storage hierarchy and scheduling mechanisms and ecient synchro

nization through dataowvariables TAM supp orts a novel datadriven paradigm

Figure TAM structure

TA MarslandYaoqing Gao and Francis CM Lau

ATAM program is a collection of co deblo cks and each codeblock consists of

several nonblo cking and nonbranching instruction sequences and inlets message

handlers When a caller invokes its child co deblo ck it do es not susp end itself

while an activation frame roughly comparable to a thread is allo cated for the child

co deblo ck The activation frame provides storage for the lo cal variables and the

resources required for synchronization and scheduling of instruction sequences The

dynamic call structure forms a thread tree A TAM instruction sequence runs in the

context of an activation frame without any susp ension and branching Argumentand

result passing is represented in terms of interthread communication the caller sends

arguments to predened inlets of the callee who in turn sends results back to inlets

the caller sp ecies An inlet is a compilergenerated message handler which pro cesses

the receipt of a message for a target thread see Figure

Threads are the basic unit of parallelism b etween pro cessors A thread is ready to

run if it has some enabled instruction sequences waiting to b e executed A thread has

a continuation vector for the addresses of all its enabled instruction sequences Each

pro cessor maintains a scheduling queue consisting of ready threads Finer parallelism

within a thread is represented by instruction sequences that are used to reduce com

munication latencyFurthermore parallelism within an instruction sequence can b e

used to improve pro cessor pip eline latency

TAM provides mechanisms to control the lo cation of threads and the compiler

may determine this mapping staticallyFor highly irregular parallel programs the

runtime system needs to provide dynamic load balancing techniques

Tovalidate the eectiveness of the TAM execution mo del a threaded machine

language called TL for TAM is implemented on the CM multipro cessor The

ne grain parallelism achieved makes TAM esp ecially suitable for tightcoupled mul

ticomputers

Multithreaded Message Passing Interface

MPI Message Passing Interface tries to collect the b est features of many exist

ing messagepassing systems and improve and standardize them MPI is a library

which encompasses p ointtop oint and collective message passing communication

scoping virtual top ologies datatyp es proling and environmentinquiry It sup

A Study of Software Multithreading in Distributed Systems

p orts a pro cessbased messagepassing paradigm Pro cesses in MPI b elong to groups

Contexts are used to restrict the scop e of message passing A group of pro cesses

plus the safe communication context form a communicator which guarantees that

one communication is isolated from another There are twotyp es of communica

tors intracommunicators for op erations within a single group of pro cesses and

intercommunicators for p ointtop ointcommunication b etween two nonoverlapping

groups of pro cesses

Haines et al at NASA Langley ResearchCenter develop ed a runtime system

called Chant whichislayered on top of MPI and POSIX pthreadsInChant threads

are supp orted at the userlevel and pthreads are extend to supp ort two global threads

chanter and rope a collection of chanters A global thread identier is comp osed

of a pro cess identier a pro cess rank within the group and a thread rank within

the pro cess Communication op erations resemble the MPI op erations Communi

cation b etween threads within the same pro cess can use shared memory primitives

while communication b etween threads in dierent pro cesses utilizes p ointtop oint

primitives In p ointtop oint communication Chantemploys userlevel p olling for

outstanding messages Like Nexus Chant supp orts remote service requests for a re

mote fetch remote state up date and remote pro cedure calls The existing pthreads

primitives are extended to provide a coherentinterface for global threads Chant

supp orts the messagepassing and active messagepassing paradigms But it has not

addressed the issue of thread scheduling in a dierent addressing space

Chantiscurrently running on b oth a network of workstations and on the Intel

Paragon multicomputer It is b eing used to supp ort Opus a sup erset of High

Performance Fortran HPF with extensions that supp ort parallel tasks and shared

data abstractions

Conclusion

Threads are an emerging mo del for exploiting multiple levels of concurrency and com

putational eciency in b oth parallel machines and networks of workstations This

pap er investigates typical existing multithreaded distributed systems using a uni

form terminology for clarityTable summarizes the varied terminology of dierent

research teams and provides the asso ciation b etween the terms thread pro cess and

TA MarslandYaoqing Gao and Francis CM Lau

pro cessor whichwe use Compared to the singlepro cessor and shared memory mul

tipro cessor cases multithreading on distributed systems encounters more diculties

due to underlying physical distributed memories and the need to address new issues

such as thread scheduling migration and communication in a separate addressing

space

Table Threads pro cesses and pro cessors

Terms Cilk TPVM LPVM Nexus TAM Chant

Instruction instruction instruction instruction instruction thread instruction

sequence sequence sequence sequence sequence sequence

Thread thread thread thread thread activation chanter

closure frame rop e

Pro cess task task context pro cess

Pro cessor pro cessor pro cessor host no de pro cessor pro cessor

The systems we discuss here are either built on top of such existing systems

as PVM and MPI through the integration of a userlevel threads package or are

selfcontained Because of the nature of the underlying systems usually there is no

shared address space among threads even if these threads b elong to the same pro cess

The multithreaded programming environments supp orted by these systems include

explicit message passing active message passing and implicit datadriven paradigms

Thread scheduling can b e taken care of by the software systems themselves or by the

underlying op erating system or simply left to users and compilers

TPVM supp orts b oth message passing and datadriven paradigms Threads have

no shared data even if they b elong to the same PVM pro cess This makes it easier to

activate a thread at any appropriate lo cation But it is obvious that communication

between threads within the same pro cess by message passing instead of shared mem

ory is not natural and may result in extra overhead In TPVM the lack of thread

and signal safety has not yet b een addressed and its centralized server may b ecome a

b ottleneck for large applications Mo dications in LPVM make it thread and signal

safe and thread scheduling is completely taken care of by the op erating system To

retain the user interface LPVM supp orts only the message passing paradigm ie

threads cannot share data even though the underlying machines havephysical shared

A Study of Software Multithreading in Distributed Systems

memories The merit of LPVMs approach to combining PVM with multithreading

on sharedmemory multipro cessors remains unclear LPVM reduces the communica

tion latency by taking advantage of shared memory in a multipro cessor but seems

not to haveyet addressed multithreading on multiple hosts

Although most user thread packages such as pthread supp ort only communica

tion b etween threads lo cated in separate address space Chant extends the POSIX

pthread by adding a new global thread chanter and provides p ointtop ointand

remote service request communication op erations by utilizing the underlying MPI

system Threads within the same pro cess can share data Furthermore Chant sup

p orts collective op erations among thread groups using a scoping mechanism called

rop es Chant supp orts the message passing paradigm In contrast Nexus is a mul

tithreaded system mainly used for a compiler target It supp orts dynamic pro cessor

acquisition dynamic address creation a global memory mo del and remote request

service Nexus supp orts the active message passing paradigm The burden of mapping

threads to pro cesses and pro cesses to physical pro cessors is left to users or compil

ers Finally Cilk is a Cbased multithreaded system suitable for dynamical treelike

computations It supp orts the datadriven paradigm and thread scheduling is auto

matically taken care of by the runtime system Cilk has not yet shown its eciency

in dataparallel applications and its continuationpassing style results in extra copy

ing overhead but b enets from its workstealing strategy TAM also supp orts the

datadriven paradigm Compared with Cilk it exploits even ner grain parallelism

Therefore it is more suitable for tightly coupled distributed memory multicomputers

Table summarizes the imp ortant features of the systems discussed here

For further advances to multithreading on distributed systems the following three

topics should b e addressed

 Integration of multiple programming paradigms to supp ort b oth dataparallel

and taskparallel applications Multithreading has demonstrated its suitability

for taskparallel computations but most existing multithreaded systems have

not yet obtained go o d eciency for dataparallel problems

 Co ordination b etween thread and pro cess scheduling In multithreading sys

tems b oth thread and pro cess scheduling are supp orted Little attention has

b een paid so far to co ordinate these two activities for systemwide global load

TA MarslandYaoqing Gao and Francis CM Lau

balancing Thread scheduling can b e viewed a mechanism for shortterm load

balancing within a single user while pro cess scheduling helps longterm load

balancing One must therefore recognize when a workstation is disconnected or

is newly available and migrate pro cesses

 Fast communication mechanisms The proto cols of existing communication net

works are heavyweight and the software overhead of communication dominates

the hardware cost Improvements in communication p erformance can

dramatically increase the applicabilityofmultithreading

In our view advances in each of these areas will have dramatic eect on the utility and p erformance of threadbased systems

A Study of Software Multithreading in Distributed Systems

Table Capability Comparison of Multithreading Systems

Capabilities Cilk PVM based Nexus TAM MPI based

TPVM LPVM Chant

Parallel datadriven message message active data activeand

programming passing passing message driven message

paradigms datadriven passing passing

Languages Cilk C language C language Mainly used TL an Opus

supp orted Cbased as a compiler abstract Fortran

language target for machine based

pC HPF Id language

FortranD

Thread userlevel userlevel systemlevel userlevel userlevel userlevel

implementation

Underlying no sp ecial PVM without PVM with POSIX no sp ecial MPI

software software any mo di minimal DCE C software pthreads

systems system cation mo di and Solaris system with

required required GNUREX cation for threads required extensions

thread thread safety Various

library OSsupp orted proto cols

threads TCPIntel NX

Thread work dynamic Scheduling taken care compiler taken care

scheduling stealing scheduling taken care of by users controlled of by users

strategies select one at thread of byOS or compilers and runtime or compilers

pro cessor creation or supp orted

at random at the frame and

request of thread

users scheduling

Underlying multi hetero Shared heterogeneous tightly multi

architectures pro cessors geneous memory distributed coupled computers

required multi distributed multi environments multi anetwork of

computers environ pro cessors computers workstations

a cluster of ment

workstations

Strength ecient supp orts two take global extended asynchronous

treelike paradigms advantage name space datadriven remote service

computations p ortable of shared supp ort mixed paradigm requests

dynamic on any memory parallel compiler

scheduling PVM system mo dify PVM computations controlled

strategy for thread remote service scheduling

safety request

Weakness frequent use only no supp ort no supp ort only users take

argument subset of of thread of automatic suitable care of

copying library due migration scheduling for tightly thread

overhead to thread from one strategy coupled scheduling

not suitable safetya host to not suitable machines

for data b ottleneck another for due to ne

parallel caused by programming parallelism

problems the server

TA MarslandYaoqing Gao and Francis CM Lau

References

The Express way to distributed pro cessing Supercomputing Review pages

May

Anant Agarwal Ricardo Bianchini David Chaiken and six others The MIT

alewife machine Architecture and p erformance In Proceedings of ISCA

Rob ert Alverson David Callahan Daniel Cummings and three others The tera

computer system In Proceedings of International ConferenceonSupercomputing

pages June

Thomas E Anderson Brian N Bershad Edward D Lazowska and Henry M

Levy Scheduler activations Eectivekernel supp ort for the userlevel man

agement of parallelism ACM Transactions on Computer Systems

February

David L Black Scheduling supp ort for concurrency and parallelism in the Mach

op erating system IEEE Computer May

Rob ert D Blumofe Christopher F Jo erg Bradley C Kuszmaul Charles E

Leiserson Keith H Randall and Yuli Zhou Cilk An ecientmultithreaded

runtime system In Proc th ACM SIGPLAN Symp on Principles and Practice

of Paral lel Programming pages July

Rob ert D Blumofe and Charles E Leiserson Scheduling multithreaded com

putations bywork stealing In Proceedings of the th Annual Symposium on

Foundations of Computer ScienceFOCS pages Santa Fe NM

USA Novemb er

Ralph Butler and Ewing Lusk Users guide to the P parallel programming

system Technical Rep ort ANL Argonne National Lab oratory Octob er

Version

Thacker C Stewart L and Satterthwaite Jr E Firey Amultipro cessor

workstation IEEE Trans on Computers August

A Study of Software Multithreading in Distributed Systems

K Mani Chandy and Carl Kesselman CC A declarative concurrent ob ject

oriented programming notation In Gul Agha Peter Wegner and Aki Yonezawa

editors Research Directions in Concurrent Object OrientedProgramming MIT

Press Cambridge MA

S Chowdhury The greedy load sharing algorithm Journal of Paral lel and

Distributed Computing May

E Co op er and R Draves C threads Technical Rep ort CMUCS De

partment of Computer Science Carnegie Mellon University

David E Culler Seth Cop en Goldstein Klaus Erik Schauser and Thorsten von

Eicken TAM A Compiler Controlled Threaded Abstract Machine In Journal

of Paral lel and Distributed Computing Special Issue on Dataow June

Cheriton D The V distributed system Communications of ACM

March

DL Eager ED Lazowska and J Zahorjan Adaptive load sharing in ho

mogeneous distributed systems IEEE Transactions on

May

DL Eager ED Lazowska and J Zahorjan A comparison of receiverinitiated

and senderinitiated adaptive load sharing Performance Evaluation

Adam Ferrari and V S Sunderam TPVM Distributed

with lightweight pro cesses In IEEE High Performance Distributed Computing

August pp

Ian Foster Carl Kesselman and Steven Tuecke Nexus Runtime supp ort for

taskparallel programming languages Mathematics and Computer Science Divi

sion Argonne National Lab oratory Argonne IL August

Ian T Foster and K Mani ChandyFortran M A language for mo dular parallel

programming Mathematics and Computer Science Division Argonne National

Lab oratory Argonne IL June

TA MarslandYaoqing Gao and Francis CM Lau

Al Geist Adam Beguelin Jack Dongarra Weicheng Jiang Rob ert Manchek and

Vaidyalingam S Sunderam PVM Paral lel Virtual Machine A Users Guide and

Tutorial for Network Paral lel Computing Scientic and Engineering Computa

tion Series MIT Press Cambridge MA

Matthew Haines David Cronk and Piyush Mehrotra On the design of chant A

talking threads package In Proceedings of Supercomputing pages

Also as Tech rep ort NASA CR ICASE Rep ort No Institute

for Computer Applications in Science and Engineering NASA Langley Research

M Homewo o d and M McLaren MeikoCSinterconnect elan elite design In

Proceedings of Hot Interconnects pages August

Herb ert HJ Hum Olivier Maquelin Kevin B Theobald and fourteen others A

design study of the EARTH multipro cessor In Proceedings of the International

ConferenceonParal lel Architecture and Compilation Techniques pages

June

Micheal J BeckerleOverview of the START T Multithreaded Computer In

Proceedings of the IEEE COMPCONFebruary

Richard M Karp and Yanjun Zhang Randomized parallel algorithms for

search and branchandb ound computation Journal of the ACM

July

Weiser M Demers A and Hauser C The p ortable common runtime approach

to interop erability In Proceedings of the th ACM Symposium on Operating

Systems Principles pages Litcheld Park Ariz

Brian D Marsh Michael L Scott Thomas J LeBlanc and Evangelos P

Markatos Firstclass userlevel threads In Proceedings of the ACM Symposium

on Principles SOSP pages Octob er

D E Maydan J L Hennessy and M S Lam Ecient and exact data de

p endence analysis In Proceedings of SIGPLAN ConferenceonProgramming

Language Design and Implementation pages June

A Study of Software Multithreading in Distributed Systems

Piyush Mehrotra and Matthew Haines An overview of the opus language and

runtime system Technical Rep ort NASA CR ICASE Rep ort No

Institute for Computer Applications in Science and Engineering NASA Langley

Research Center May pages

Frank Mueller A library implementation of POSIX threads under UNIX In

Proceedings of the USENIX Conference pages

Anthony Skjellum Nathan E Doss Kishore Viswanathan Aswini Chowdappa

and Purushotham V Bangalore Extending the message passing interface MPI

In Anthony Skjellum and Donna S Reese editors Proceedings of the Scalable

Paral lel Libraries ConferenceII IEEE Computer So ciety Press Octob er

Anderson T Lazowska E and Levy H The p erformance implications of thread

management alternatives for shared memory multipro cessors IEEE Trans on

Computers December

Thorsten von Eicken David E Culler Seth Cop en Goldstein and Klaus Erik

Schauser Active Messages aMechanism for Integrated Communication and

Computation In Proc of the th Intl Symposium on

Gold Coast Australia May

Honb o Zhou and Al Geist LPVM lightweight pro cess thread based PVM for

SMP systems Technical rep ort Oak Ridge National Lab oratory