<<

IV. Parallel Op erating Systems

Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

IST/INESC, Lisb on, Portugal

1. Intro duction ::::::::::::::::::::::::::::::::::::::::::::::::: 169

2. Classi cation of Parallel Computer Systems ::::::::::::::::::::: 169

2.1 Parallel Computer Architectures ::::::::::::::::::::::::::: 169

2.2 Parallel Op erating Systems :::::::::::::::::::::::::::::::: 174

3. Op erating Systems for Symmetric Multipro cessors ::::::::::::::: 178

3.1 Pro cess Management ::::::::::::::::::::::::::::::::::::: 178

3.2 :::::::::::::::::::::::::::::::::::::::::::::: 179

3.3 Pro cess Synchronization :::::::::::::::::::::::::::::::::: 180

3.4 Examples of SMP Op erating Systems : :::::::::::::::::::::: 181

4. Op erating Systems for NORMA environments ::::::::::::::::::: 189

4.1 Architectural Overview ::::::::::::::::::::::::::::::::::: 190

4.2 Distributed Pro cess Management :::::::::::::::::::::::::: 191

4.3 Parallel File Systems ::::::::::::::::::::::::::::::::::::: 194

4.4 Popular Message-Passing Environments ::::::::::::::::::::: 195

5. Scalable Systems :::::::::::::::::::::::::::::: 196

5.1 Shared Memory MIMD Computers ::::::::::::::::::::::::: 197

5.2 Distributed Shared Memory ::::::::::::::::::::::::::::::: 198

References :::::::::::::::::::::::::::::::::::::::::::::::::: 199

Summary. Parallel op erating systems are the interface b etween parallel comput-

ers (or computer systems) and the applications (parallel or not) that are executed

on them. They translate the hardware's capabilities into concepts usable bypro-

gramming languages.

Great diversity marked the b eginning of parallel architectures and their op-

erating systems. This diversity has since b een reduced to a small set of dominat-

ing con gurations: symmetric multipro cessors running commo dity applications and

op erating systems ( and Windows NT) and multicomputers running custom

kernels and parallel applicatio ns. Additionall y, there is some (mostly exp erimen-

tal) work done towards the exploitation of the shared memory paradigm on top of

networks of workstations or p ersonal computers.

In this chapter, we discuss the op erating system comp onents that are essential

to supp ort parallel systems and the central concepts surrounding their op eration:

scheduling, synchronization, multi-threading, inter-pro cess communication, mem-

ory management and fault tolerance.

Currently, SMP computers are the most widely used multipro cessors. Users nd

it a very interesting mo del to have a computer, which, although it derives its pro-

cessing p ower from a set of pro cessors, do es not require anychanges to applications

and only minor changes to the op erating system. Furthermore, the most p opular

parallel programming languages have b een p orted to SMP architectures enabling

also the execution of demanding parallel application s on these machines.

However, users who want to exploit parallel pro cessing to the fullest use those

same parallel programming languages on top of NORMA computers. These mul-

ticomputers with fast interconnects are the ideal hardware supp ort for message

168 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

passing parallel applicatio ns. The surviving commercial mo dels with NORMA ar-

chitectures are very exp ensivemachines, which one will nd running calculus in-

tensive applications , suchasweather forecasting or uid dynamics mo delling.

We also discuss some of the exp eriments that have b een made b oth in hardware

(DASH, Alewife) and in software systems (TreadMarks, Shasta) to deal with the

issues of maintaining consistency in shared-memory systems and to prove

their applicabil ity on a large-scale.

IV. Parallel Op erating Systems 169

1. Intro duction

Parallel op erating systems are primarily concerned with managing the re-

sources of parallel machines. This faces manychallenges: application

programmers demand all the p erformance p ossible, many hardware con gu-

rations exist and change very rapidly,yet the op erating system must increas-

ingly b compatible with the mainstream versions used in p ersonal computers

and workstations due b oth to user pressure and to the limited resources avail-

able for developing new versions of these system.

In this chapter, we describ e the ma jor trends in parallel op erating systems.

Since the architecture of a parallel op erating system is closely in uenced by

the hardware architecture of the machines it runs on, wehave divided our pre-

sentation in three ma jor groups: op erating systems for small scale symmetric

multipro cessors (SMP), op erating system supp ort for large scale distributed

memory machines and scalable distributed shared memory machines.

The rst group includes the currentversions of Unix and Windows NT

where a single op erating system centrally manages all the resources. The sec-

ond group comprehends b oth large scale machines connected by sp ecial pur-

p ose interconnections and networks of p ersonal computers or workstations,

where the op erating system of each no de lo cally manages its resources and

collab orates with its p eers via to globally manage the ma-

chine. The third group is comp osed of non- (NUMA)

machines where each no de's op erating system manages the lo cal resources

and interacts at a very ne grain level with its p eers to manage and share

global resources.

We address the ma jor architectural issues in these op erating systems,

with a greater emphasis on the basic mechanisms that constitute their core:

pro cess management, le system, memory managementandcommunications.

For each comp onent of the op erating system, the presentation covers its archi-

tecture, describ es some of the interfaces included in these systems to provide

programmers with a greater level of exibilityandpower in exploiting the

hardware, and addresses the implementation e orts to mo dify the op erat-

ing system in order to explore the parallelism of the hardware. These top-

ics include supp ort for multi-threading, internal parallelism of the op erating

system comp onents, distributed pro cess management, parallel le systems,

inter-pro cess communication, low-level resource sharing and fault isolation.

These features are illustrated with examples from b oth commercial op erating

systems and research prop osals from industry and academia.

2. Classi cation of Parallel Computer Systems

2.1 Parallel Computer Architectures

Op erating systems were created to present users and programmers with a

view of computers that allows the use of computers abstracting away from the

170 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

details of the machine's hardware and the way it op erates. Paral lel operating

systems in particular enable user interaction with computers with parallel

architectures. The physical architecture of a computer system is therefore an

imp ortant starting p oint for understanding the op erating system that controls

it. There are two famous classi cations of parallel computer architectures:

Flynn's [Fly72] and Johnson's [Joh88].

2.1.1 Flynn's classi cation of parallel architectures. Flynn divides

computer architectures along two axes according to the number of data

sources and the numb er of instruction sources that a computer can pro cess

simultaneously. This leads to four categories of computer architectures:

SISD - Single Instruction Single Data. This is the most widespread archi-

tecture where a single pro cessor executes a sequence of instructions that

op erate on a single stream of data. Examples of this architecture are the

ma jority of computers in use nowadays such as p ersonal computers (with

some exceptions suchasIntelDualPentium machines), workstations and

low-end servers. These computers typically run either a version of the Mi-

crosoft Windows op erating system or one of the manyvariants of the UNIX

op erating system (Sun Microsystems' Solaris, IBM's AIX, ...) although

these op erating systems include supp ort for multipro cessor architectures and

are therefore not strictly op erating systems for SISD machines.

MISD - Multiple Instruction Single Data. No existing computer corresp onds

to the MISD mo del, whichwas included in this description mostly for the

sake of completeness. However, we could envisage sp ecial purp ose machines

that could use this mo del. For example, a "cracker computer" with a set of

pro cessors where all are fed the same stream of ciphered data whicheach

tries to crack using a di erent algorithm.

SIMD - Single Instruction Multiple Data. The SIMD architecture is used

mostly for sup ercomputer vector pro cessing and calculus. The programming

mo del of SIMD computers is based on distributing sets of homogeneous data

among an array of pro cessors. Each pro cessor executes the same op eration

on a fraction of a data element. SIMD machines can b e further divided into

pip elined and parallel SIMD. Pip elined SIMD machines execute the same

instruction for an arrayofdataelements byfetching the instruction from

memory only once and then executing it in an overloaded way on a pip elined

high p erformance pro cessor (e.g. Cray 1). In this case, each fraction of a data

element(vector) is in a di erent stage of the pip eline. In contrast to pip elined

SIMD, parallel SIMD machines are made from dedicated pro cessing elements

connected by high-sp eed links. The pro cessing elements are sp ecial purp ose

pro cessors with narrow data paths (some as narrow as one bit) and which

execute a restricted set of op erations. Although there is a design e ort to

keep pro cessing elements as cheap as p ossible, SIMD machines are quite rare

due to their high cost, limited applicability and to the fact that they are

virtually imp ossible to upgrade, thereby representing a "once in a lifetime"

investment for many companies or institutions that buy one.

IV. Parallel Op erating Systems 171

An example of a SIMD machine is the Connection Machine (e.g. CM-2)

[TMC90a], built by the Thinking Machines Corp oration. It has up to 64k

pro cessing elements where each of them op erates on a single bit and is only

able to p erform very simple logical and mathematical op erations. For eachset

of 32 pro cessing cells there are four memory chips, a numerical pro cessor for

oating p oint op erations and a router connected to the cell interconnection.

The cells are arranged in a 12-dimensional hyp er-cub e top ology. The concept

of having a vast amountofvery simple pro cessing units served three purp oses:

to eliminate the classical memory access b ottleneckbyhaving pro cessors

distributed all over the memory, to exp eriment with the scalabilitylimitsof

parallelism and to approximate the brain mo del of computation.

The MasPar MP-2 [Bla90]by the MasPar Computer Corp oration is an-

other example of a SIMD machine. This computer is based on a 2 dimensional

lattice of up to 16k pro cessing elements driven bya VAX computer that serves

as a front end to the pro cessing elements. Each pro cessing element is con-

nected to its eight neighb ours on the 2D mesh to form a torus. The user

pro cesses are executed on the front-end machine that passes the parallel p or-

tions of the co de to another pro cessor called the Data Parallel Unit (DPU).

The DPU is resp onsible for executing op erations on singular data and for

distributing the op erations on parallel data among the pro cessing elements

that constitute the lattice.

MIMD - Multiple Instruction Multiple Data. These machines constitute the

core of the parallel computers in use to day. The MIMD design is a more

general one where machines are comp osed of a set of pro cessors that execute

1

di erent programs which access their corresp onding datasets . These archi-

tectures have b een a success b ecause they are typically cheap er than sp ecial

purp ose SIMD machines since they can b e built with o -the-shelf comp o-

nents. To further detail the presentation of MIMD machines it is now useful

to intro duce Johnson's classi cation of computer architectures.

2.1.2 Johnson's classi cation of parallel architectures. Johnson's

classi cation [Joh88]isoriented towards the di erent memory access meth-

o ds. This is a much more practical approach since, as wesawabove, all but

the MIMD class of Flynn are either virtually extinct or never existed. We

take the opp ortunity of presenting Johnson's categories of parallel architec-

tures to give some examples of MIMD machines in each of those categories.

Johnson divides computer architectures into:

UMA { Uniform Memory Access. UMA machines guarantee that all pro-

cessors use the same mechanism to access all memory p ositions and that

those accesses are p erformed with similar p erformance. This is achieved with

a shared memory architecture where all pro cessors access a central mem-

ory unit. In a UMA machine, pro cessors are the only parallel element (apart

from caches whichwe discuss b elow). The remaining architecture comp onents

1

although not necessary in mutual exclusion.

172 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

(main memory, le system, networking...) are shared among the pro cessors.

Each pro cessor has a lo cal cache where it stores recently accessed memory

p ositions. Most ma jor hardware vendors sell UMA machines in the form of

symmetric multipro cessors with write-through caches. This means that writes

to cache are immediately written to main memory. In the alternative p olicy

of write-back caching, writes to the cache are only written to main memory

when the cache line containing the write is evicted from the pro cessor's cache.

In order to keep these pro cessor caches coherent and therebygiveeach

pro cessor a consistent view of the main memory a hardware coherence pro-

to col is implemented. There are twotechniques for maintaining cache coher-

ence: directory based and bus sno oping proto cols. Bus sno oping is p ossible

only when the machine's pro cessors are connected byacommonbus.Inthis

case, every pro cessor can see all memory accesses requested on the common

bus and up date its own cache accordingly. When sno oping is not p ossible, a

directory-based proto col is used. These proto cols require that pro cessor keeps

a map of how memory lo cations are distributed among all pro cessors in order

to b e able nd it among the machine's pro cessors when requested. Typically,

bus sno oping is used in shared memory SMPs and directory-based proto cols

are used in architectures such as the SGI Origin 2000.

The Alpha family [Her97], develop ed by the Digital Equipment

Corp oration (nowowned by Compaq), together with Digital UNIX has b een

one of the most commercially successful server and op erating system solu-

tions. Alpha Servers are symmetric multipro cessor based on Alpha pro cessors,

up to 4 of them, connected byasnoopy bus to a motherb oard with up to 4

GByte of memory. The Alpha Server has separate address and data buses and

uses a cache write-back proto col. When a pro cessor writes to its cache, the

op eration is not immediately passed on to main memory but only when that

data element has to b e used by another memory blo ck. Therefore, pro cessors

have to sno op the bus in order to reply to other pro cessors that might request

data whose more recentversion is in their cache and not in main memory.

A more recent example of a UMA multipro cessor is the Enterprise 10000

Server [Cha98], which is Sun Microsystems' latest high-end server. It is con-

gured with 16 interconnected system b oards. Each b oard can haveupto

4 UltraSPARC 336MHz pro cessors, up to 4GB of memory and 2 I/O con-

nections. All b oards are connected to four address buses and to a 16x16

connection Gigaplane-XB interconnection matrix. This architecture, called

Star re, is a mixture b etween a p oint- to-p oint pro cessor/memory top ology

and a sno opy bus. The system b oard's memory addresses are interleaved

among the four address buses, which are used to send packets with memory

requests from the pro cessors to the memory. The requests are sno op ed by all

pro cessors connected to the bus, which allows them to implement a sno oping

coherency proto col. Since all pro cessors sno op all requests and up date their

caches, the replies from the memory with the data requested by the pro ces-

sors do not have to b e seen and are sent on a dedicated high-throughput

IV. Parallel Op erating Systems 173

Gigaplane connection b etween the memory bank and the pro cessor thereby

optimising memory throughput. Hence, the architecture's b ottleneckisatthe

address bus, which can't as many requests as the memory and the Gi-

gaplane interconnection. The Enterprise 10000 runs Sun's Solaris op erating

system.

NUMA { Non-Uniform Memory Access. NUMA computers are machines

where each pro cessor has its own memory.However, the pro cessor intercon-

nection allows pro cessors to access another pro cessor's lo cal memory.Natu-

rally, an access to lo cal memory has a much b etter p erformance than a read

or write of a remote memory on another pro cessor. The NUMA category

includes replicated memory clusters, pro cessors, cache co-

herent NUMA (CC-NUMA such as the SGI Origin 2000) and cache only

memory architecture. Tworecent examples of NUMA machines are the SGI

Origin 2000 and Cray Research's T3E. Although NUMA machines givepro-

grammers a global memory view of the pro cessors aggregate memory,per-

formance conscious programmers still tend to try to lo cate their data on the

lo cal memory p ortion and avoid remote memory accesses. This is one of the

arguments used in favour of MIMD shared memory architectures.

The SGI Origin 2000 [LL94] is a cache coherent NUMA multipro cessor

designed to scale up to 512 no des. Each no de has one or two R10000 195MHz

pro cessors with up to 4GB of coherent memory and a con-

troller called the hub. The no des are connected via a SGI SPIDER router

chip which routes trac b etween six Craylink network connections. These

connections are used to create a cub e top ology were eachcubevertex has a

SPIDER router with two no des app ended to it and additional three SPIDER

connections are used to link the vertex to its neighb ours. This is called a

bristle cub e. The remaining SPIDER connection is used to extend these 32

pro cessors (4 pro cessor x 8 vertices) by connecting it to other cub es. The

Origin 2000 uses a directory-based proto col similar to the

Stanford DASH proto col [LLT+93]. The Origin 2000 is one of the most in-

novative architectures and is the setting of many recent research pap ers (e.g.

[JS98] [BW97]).

The Cray T3E is a NUMA massively parallel multicomputer and one of

the most recent machines by the Cray Research Corp oration (nowowned

by SGI). The Cray T3E has from 6 to 2048 Alpha pro cessors (some con-

gurations use 600MHz pro cessors) connected in a 3D cub e interconnection

top ology. It has a 64-bit architecture with a maximum distributed globally

addressable memory of 4 TBytes where remote memory access messages are

automatically generated by the . Although remote memory

can b e addressed, this is not transparent to the programmer. Since there is

no hardware mechanism to keep the Cray's pro cessor caches consistent, they

contain only lo cal data. Each pro cessor runs a copyofCray's UNICOS mi-

crokernel on top of which language sp eci c message-passing libraries supp ort

174 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

parallel pro cessing. The I/O system is distributed over a set of no des on the

surface of the cub e.

NORMA { No Remote Memory Access. In a NORMA machine a pro cessor

is not able to access the memory of another pro cessor. Consequently, all

inter-pro cessor communication has to b e done by the explicit exchange of

messages b etween pro cessors and there is no global memory view. NORMA

architectures are more cost-e ective since they don't need all the hardware

infrastructure asso ciated with allowing multiple pro cessor to access the same

memory unit and therefore provide more raw pro cessing p ower for a lower

price at the exp ense of the system's programming mo del's friendliness.

The CM-5 [TMC90b]by the Thinking Machine Corp oration is a MIMD

machine, whichcontrasts with the earlier SIMD CM-1 and CM-2, a further

sign of the general tendency of vendors to concentrate on MIMD architec-

tures. The CM-5 is a private memory NORMA machine with a sp ecial typ e of

pro cessor interconnection called a fat tree. In a fat tree architecture, pro ces-

sors are the leaves of an interconnection tree where the bandwidth is bigger at

the branches placed higher on the tree. This sp ecial interconnection reduces

the bandwidth degradation implied by an increase of the distance b etween

pro cessors and it also enables bandwidth to remain constantasthenumber

of pro cessors grows. Each no de in a CM-5 has a SPARC pro cessor, a 64k

cache and up to 32MB of lo cal 64 bit-word memory. The CM-5 pro cessors,

which run a , are controlled by a control pro cessor running UNIX,

which serves as the user front end.

The Paragon Sup ercomputer [Div95] is a 2D rectangular mesh of

up to 4096 Intel i860 XP pro cessors. This sup ercomputer with its 50Mhz

42 MIPS pro cessors connected p oint-to-p oint at 200MByte/sec full duplex

comp eted for a while with the CM-2 for the title of the "world's fastest su-

p ercomputer". The pro cessor mesh is supp orted by I/O pro cessors connected

to hard disks and other I/O devices. The Intel Paragon runs the OSF/1 op-

erating system, which is based on Mach.

In conclusion, it should b e said that, after a p erio d when very varied

designs co existed, parallel architectures are converging towards a generalized

use of shared memory MIMD machines. There are calculus intensivedomains

were NORMA machines, such as the CRAY T3-E, are used but the advan-

tages o ered by the SMP architecture, such as the ease of programming,

standard op erating systems, use of standard hardware comp onents and soft-

ware p ortability,haveoverwhelmed the market.

2.2 Parallel Op erating Systems

There are several comp onents in an op erating system that can b e parallelized.

Most op erating systems do not approach all of them and do not supp ort par-

allel applications directly. Rather, parallelism is frequently exploited by some

additional software layer such as a distributed le system, distributed shared

IV. Parallel Op erating Systems 175

memory supp ort or libraries and services that supp ort particular parallel

programming languages while the op erating system manages concurrenttask

execution.

The convergence in parallel computer architectures has b een accompanied

by a reduction in the diversity of op erating systems running on them. The

current situation is that most commercially available machines run a avour

of the UNIX OS (Digital UNIX, IBM AIX, HP UX, Sun Solaris, Linux).

Others run a UNIX based microkernel with reduced functionality to optimise

the use of the CPU, suchasCray Research's UNICOS. Finally,a number

of shared memory MIMD machines run NT (so on to b e

sup erseded by the high end variantofWindows 2000).

There are a numb er of core asp ects to the characterization of a parallel

computer op erating system: general features such as the degrees of co ordi-

nation, coupling and transparency; and more particular asp ects suchasthe

typ e of pro cess management, inter-pro cess communication, parallelism and

synchronization and the programming mo del.

2.2.1 Co ordination. The typ e of co ordination among pro cessors in paral-

lel op erating systems is a distinguishing characteristic, which conditions how

applications can exploit the available computational no des. Furthermore, ap-

plication parallelism and op erating system parallelism are two distinct issues:

While application concurrency can b e obtained through op erating system

mechanisms or by a higher layer of software, concurrency in the execution of

the op erating system is highly dep endantonthetyp e of pro cessor co ordina-

tion imp osed by the op erating system and the machine's architecture. There

are three basic approaches to co ordinating pro cessors:

{ Separate sup ervisor - In a separate supervisor paral lel machine eachnode

runs its own copy of the op erating system. Parallel pro cessing is achieved

via a common pro cess management mechanism allowing a pro cessor to

create pro cesses and/or threads on remote machines. For example, in a

multicomputer like the CrayT3E,eachnoderunsitsown indep endent

copy of the op erating system. Parallel pro cessing is enabled by a concur-

rent programming infrastructure such as an MPI (see Section 4.)

whereas I/O is p erformed by explicit requests to dedicated no des. Having

a front-end that manages I/O and that dispatches jobs to a back-end set of

pro cessors is the main motivation for separate sup ervisor op erating system.

{ Master-slave-Amaster-slave paral lel architecture as-

sumes that the op erating system will always b e executed on the same

pro cessor and that this pro cessor will control all shared resources and in

particular pro cess management. A case of master-slave op erating system,

as wesawabove, is the CM-5. This typ e of co ordination is particularly

adapted to single purp ose machines running applications that can b e bro-

ken into similar concurrent tasks. In these scenarios, central control may

b e maintained without any p enalty to the other pro cessors' p erformance

since all pro cessors tend to b e b eginning and ending tasks simultaneously.

176 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

{ Symmetric - Symmetric OSs are the most common con guration currently.

In a symmetric paral lel OS,any pro cessor can execute the op erating system

kernel. This leads to concurrent accesses to op erating system comp onents

and requires careful synchronization. In Section 3. we will discuss the rea-

sons for the p opularityofthistyp e of co ordination.

2.2.2 Coupling and transparency. Another imp ortantcharacteristic in

parallel op erating systems is the system's degree of coupling. Just as in the

case of hardware where we can sp eak of lo osely coupled (e.g. network of

workstations) and highly coupled (e.g. vector parallel multipro cessors) archi-

tectures, parallel OS can b e meaningfully separated into lo osely coupled and

tightly coupled op erating systems.

Many current distributed op erating systems have a highly mo dular archi-

tecture. There is therefore a wide sp ectrum of distribution and parallelism in

di erent op erating systems. To see how in uential coupling is in forming the

abstraction presented by a system, consider the following extreme examples:

on the highly coupled end a sp ecial purp ose vector computer dedicated to

one parallel application, e.g. weather forecasting, with master-slave co ordi-

nation and a parallel le system, and on the lo osely coupled end a network

of workstations with shared resources (printer, le server) running each their

own application and mayb e sharing some client-server applications.

Within the sp ectrum of op erating system coupling there are three land-

mark typ es:

{ Network Operating Systems - These are implementations of a lo osely cou-

pled op erating systems on top of lo osely coupled hardware. Network op-

erating systems are the software that supp orts the use of a network of

machines and provide users that are aware of using a set of computers,

with facilities designed to ease the use of remote resources lo cated over

the network. These resources are made available as services and mightbe

printers, pro cessors, le systems or other devices. Some resources, of which

dedicated hardware devices suchasprinters, tap e drives, etc... are the clas-

sical example, are connected to and managed by a particular machine and

are made available to other machines in the network via a service or dae-

mon. Other resources, such as disks and memory, can b e organized into true

distributed systems, which are seamlessly used byallmachines. Examples

of basic services available on a network op erating system are starting a

session on a remote computer (remote login), running a program on

a remote machine (remote execution) and transferring les to and from

remote machines (remote le transfer).

{ DistributedOperating Systems -True distributed op erating systems cor-

resp ond to the concept of highly coupled software using lo osely coupled

hardware. Distributed op erating systems aim at giving the user the p ossi-

bility of transparently using a virtual unipro cessor. This requires having an

adequate distribution of all the layers of the op erating system and provid-

ing a global uni ed view over pro cess management, le system and inter-

IV. Parallel Op erating Systems 177

pro cess communication, therebyallowing applications to p erform trans-

parent migration of data, computations and/or pro cesses. Distributed le

systems (e.g. AFS, NFS) and distributed shared memories (e.g. Tread-

Marks [KDC+94], Midway[BZ91], DiSoM [GC93]) are the most common

supp orts for migrating data. Remote pro cedure call (e.g. Sun RPC, Java

RMI, Corba) mechanisms are used for migration of computations. Pro cess

migration is a less common feature. However, some exp erimental platforms

have supp orted it [Nut94], e.g. Sprite, Emerald.

{ Multiprocessor Timesharing OS - This case represents the most common

con guration of highly coupled software on top of highly coupled software.

Amultipro cessor is seen by the user as a p owerful unipro cessor since it

hides away the presence of multiple pro cessor and an interconnection net-

work. We will discuss some design issues of SMP op erating systems in the

following section.

2.2.3 HW vs. SW. As wementioned a parallel op erating system provides

users with an abstract computational mo del over the .

It is worthwhile showing that this view can b e achieved by the computer's

parallel hardware architecture or bya software layer that uni es a network

of pro cessors or computers. In fact, there are implementations of every com-

putational mo del b oth in hardware or software systems:

{ The hardware version of the shared memory mo del is represented by sym-

metric multipro cessors whereas the software version is achieved by unifying

the memory of a set of machines by means of a distributed shared memory

layer.

{ In the case of the distributed memory mo del there are multicomputer ar-

chitectures where accesses to lo cal and remote data are explicitly di erent

and, as wesaw, have di erent costs. The equivalent software abstractions

are explicit message-passing inter-pro cess communication mechanisms and

programming languages.

{ Finally, the SIMD computation mo del of massively parallel computers is

mimicked bysoftware through data parallel programming.

is a style of programming geared towards applying parallelism to large data

sets, by distributing data over the available pro cessors in a "divide and

conquer" mo de. An example of a data parallel programming language is

HPF (High Performance Fortran).

2.2.4 Protection. Parallel computers, b eing multi-pro cessing environments,

require that the op erating system provide protection among pro cesses and

between pro cesses and the op erating system so that erroneous or malicious

programs are not able to access resources b elonging to other pro cesses. Pro-

tection is the access control barrier, which all programs must pass b efore

accessing op erating system resources. Dual mo de op eration is the most com-

mon protection mechanism in op erating systems. It requires that all op er-

ations that interfere with the computer's resources and their management

178 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

is p erformed under op erating system control in what is called protected or

kernel mo de (in opp osition to unprotected or user mo de). The op erations

that must b e p erformed in kernel mo de are made available to applications as

an op erating system API. A program wishing to enter kernel mo de calls one

of these system functions via an interruption mechanism, whose hardware

implementation varies among di erent pro cessors. This allows the op erating

system to verify the validity and authorization of the request and to exe-

cute it safely and correctly using kernel functions, which are trusted and

well-b ehaved.

3. Op erating Systems for Symmetric Multipro cessors

Symmetric Multiprocessors (SMP) are currently the predominanttyp e of par-

allel computers. The ro ot of their p opularity is the ease with whichboth

op erating systems and applications can b e p orted onto them. Applications

programmed for unipro cessors can b e run on an SMP unchanged, while p ort-

ing a multitasking op erating system onto an SMP requires minor changes.

Therefore, no other parallel architecture is so exible running b oth parallel

and commo dity applications on top of generalist op erating systems (UNIX,

NT). The existence of several pro cessors raises scalability and concurrency

issues, e.g. hardware supp ort for cache coherence (see Section 2.), for the

initialisation of multiple pro cessors and the optimisation of pro cessor inter-

connection. However, regarding the op erating system, the sp eci c asp ects to

b e considered in a SMP are pro cess synchronization and scheduling.

3.1 Pro cess Management

A pro cess is a program's execution . It is a set of data structures that

sp ecify all the information needed to execute a program: its execution state

(global variables), the resources it is using ( les, pro cesses, synchronization

variables), its security identity and accounting information.

There has b een an evolution in the typ e of tasks executed by computers.

The state description of a traditional pro cess contains much more information

than what is needed to describ e a simple ow of execution and therefore

commuting b etween pro cesses can b e a considerably costly op eration. Hence,

most op erating systems have extended pro cesses with lightweight pro cesses

(or threads) that representmultiple ows of execution within the pro cess

addressing space. A needs to maintain much less state information

than a pro cess, typically it is describ ed by its stack, CPU state and a p ointer

to the function it is executing, since it shares pro cess resources and parts of

the context with other threads within the same . The advantage

of threads is that due to their low creation and p erformance

p enalties they b ecome a very interesting way to structure computer programs and to exploit parallelism.

IV. Parallel Op erating Systems 179

There are op erating systems, whichprovide kernel threads (Mach, Sun

Solaris), while others implement threads as a user level library. In particular,

user threads, i.e. threads implemented as a user level library on top of a

single pro cess, havevery low commuting costs. The ma jor disadvantage of

user-level threads is the fact that the assignment of tasks to threads at a

user-level is not seen by the op erating system whichcontinues to schedule

pro cesses. Hence, if a thread in a pro cess is blo cked on a (as is

the case in many UNIX calls), all other threads in the same pro cess b ecome

blo cked to o.

Therefore, supp ort for kernel threads has gradually app eared in conven-

tional computer op erating systems. Kernel threads overcame the ab ovedis-

advantages of user level threads and increased the p erformance of the system.

A crucial asp ect of thread p erformance is the lo cking p olicy within the

thread creation and switching functions. Naive implementations of these fea-

tures, guarantee their execution as critical sections, by protecting the thread

packagewithasinglelock. More ecient implementations use ner and more

careful lo cking implementations.

3.2 Scheduling

Scheduling is the activity of assigning pro cessor time to the active tasks (pro-

cesses or threads) in a computer and of commuting them. Although sophisti-

cated schedulers implement complex pro cess state machines there is a basic

set of pro cess (or thread) states that can b e found in any op erating system:

A pro cess can b e running, ready or blo cked. A running pro cess is currently

executing on one of the machine's pro cessor whereas a ready pro cess is able

to b e executed but has no available pro cessor. A blo cked pro cess is unable to

run b ecause it is waiting for some event, typically the completion of an I/O

op eration or the o ccurrence of a synchronization event.

The run queue of an op erating systems is a data structure in the ker-

nel that contains all pro cesses that are ready to b e executed and await an

available pro cessor. Typically pro cesses are enqueued in a run queue and

are scheduled to the earliest available pro cessor. The scheduler executes an

algorithm, which places the running task in the run queue, selects a ready

task from the run queue and starts its execution. At op erating system level,

what unites the pro cessors in an SMP is the existence of a single run queue.

Whenever a pro cess is to b e removed from a pro cessor, this pro cessor runs

the scheduling algorithm to select a new running pro cess. Since the scheduler

can b e executed byany pro cessor, it must b e run as a critical section, to

avoid the simultaneous choice of a pro cess bytwo separate executions of the

scheduler. Additionally, most SMP provide metho ds to allow a pro cess to run

only on a given pro cessor.

The simplest scheduling algorithm is rst-come rst-served. In this case,

pro cesses are ordered according to their creation order and are scheduled

in that order typically by means of rst-in rst-out (FIFO) queue. A slight

180 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

improvementinoverall p erformance can b e achieved byscheduling jobs ac-

cording to their exp ected duration. Estimating this duration can b e a tricky

issue.

Priorityscheduling is the approachtaken in most mo dern op erating sys-

tem. Each is assigned a priorityandscheduled according to that value.

Jobs with equal priorities are scheduled in FIFO order.

Scheduling can b e preemptive or non-preemptive. In preemptiveschedul-

ing, a running job can b e removed from the pro cessor if another job with

a higher priority b ecomes active. The advantage of non- is sim-

plicity: byavoiding to preempt a job, the op erating system knows that the

pro cess has left the pro cessor correctly without leaving its data structures or

the resources it uses corrupted. The disadvantage of non-preemption is the

slowness of reaction to hardware events, which cannot b e handled until the

running pro cess has relinquished the pro cessor.

There are a numb er of measures of the qualityofanOS'scheduling algo-

rithm. The rst is CPU utilization. Wewould like to optimise the amountof

time that the machine's pro cessors are busy. CPU utilization is not however

a user oriented metric. Users are more interested in other measures, notably

the amount of time a job needs to run and, in an interactive system, how

resp onsive the machine is, i.e. whether a user-initiated request yields results

quickly or not. A measure that translates all these concerns wellistheaver-

age waiting time of a pro cess, i.e. the amount of time a pro cess sp ends in the

waiting queue.

The interested reader may nd more ab out scheduling strategies in Chap-

ter VI.

3.3 Pro cess Synchronization

In an op erating system with multiple threads of execution, accesses to shared

data structures require synchronization in order to guarantee that these are

used as critical sections. Gaining p ermission to enter a critical section, which

involves testing whether another pro cess has already an exclusive access to

this section, and if not, lo cking all other pro cesses out of it, has to b e p er-

formed atomically. This requires hardware mechanisms to guarantee that this

sequence of actions is not preempted. Mutual exclusion in critical sections is

achieved by requiring that a lo ckistaken b efore entering the critical section

and by releasing it after exiting the section.

The two most common techniques for manipulating lo cks are the test-

and-set and swap op erations. Test-and-set atomically sets a lo ck to its lo cked

state and returns the previous state of that lo ck to the caller pro cess. If the

lo ckwas not set, it is set by the test and set call. If it was lo cked, the calling

pro cess will havetowait until it is unlo cked.

The swap op eration exchanges the contents of two memory lo cations.

Swapping the lo cation of a lo ck with a memory lo cation containing the value

corresp onding to its lo cked state is equivalent to a test-and-set op eration.

IV. Parallel Op erating Systems 181

Synchronizing pro cesses or threads on a multipro cessor p oses an addi-

tional requirement. Having a busy-waiting synchronization mechanism, also

known as a spin lo ck, in which a pro cess is constantly consuming bandwidth

on the computer's interconnection to test whether the lo ckisavailable or

not, as in the case of test-and-set and swap, is inecient. Nevertheless, this

is how synchronization is implemented at the hardware level in most SMP,

e.g. those using Intel pro cessors. An alternative to busy-waiting is susp ending

pro cesses which fail to obtain a lo ck and sending calls to all sus-

p ended pro cesses when the lo ckisavailable again. Another synchronization

primitive whichavoids spin a lo ck are semaphores. Semaphores consist of a

lo ck and a queue, which is manipulated in mutual exclusion, containing all

the pro cesses that failed to acquire the lo ck. When the lo ck is freed, it is

given to one of the enqueued pro cesses, which is sentaninterrupt call. There

are other variations of synchronization primitives such as monitors [Hoa74],

eventcount and sequencers [RK79], guarded commands [Dij75] and others

but the examples ab ove illustrate clearly the issues involved in SMP pro cess

synchronization.

3.4 Examples of SMP Op erating Systems

Currently, the most imp ortant examples of op erating systems for parallel

machines are UNIX and Windows NT running on top of the most ubiquitous

multi-pro cessor machines, which are symmetric multipro cessors.

3.4.1 UNIX. UNIX is one the most p opular and certainly the most in uen-

tial op erating system ever built. Its development b egan in 1969 and from then

on countless versions of UNIX have b een created by most ma jor computer

companies (AT&T, IBM, Sun, Microsoft, DEC, HP). It runs on almost all

typ es of computers from p ersonal computers to sup ercomputers and will con-

tinue to b e a ma jor player in op erating system practice and research. There

are currently around 20 di erent avours of UNIX b eing released. This di-

versity led to various e orts to standardize UNIX and so there is an ongoing

e ort by the Op en Group, supp orted by all ma jor UNIX vendors, to create

a uni ed interface for all UNIX avours.

Architecture. UNIX was designed as a time-sharing op erating system where

simplicity and p ortability are fundamental. UNIX allows for multiple pro-

cesses that can b e created asynchronously and that are scheduled according

to a simple priority algorithm. Input and output in UNIX is as similar as p os-

sible among les, devices and inter-pro cess communication mechanisms. The

le system has a hierarchical structure and includes the use of demountable

volumes. UNIX was initially a very compact op erating system. As technology

progressed, there have b een several additions to it such as supp ort for graph-

ical interfaces, networking and SMP that have increased its size considerably,

but UNIX has always retained its basic design traits.

182 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

UNIX is a dual mo de op erating system: the kernel is comp osed of a set

of comp onents that are placed b etween the system call interface provided to

the user and the kernel's interface to the computer's hardware. The UNIX

kernel is comp osed of:

{ Scheduler - The scheduler is the algorithm which decides whichprocess

is run next (see Process management below). The scheduling algorithm is

executed by the pro cess o ccupying the pro cessor when it is ab out to relin-

quish it. The context switchbetween pro cesses is p erformed bytheswapper

pro cess, which is one of the two pro cesses constantly running (the other

one is the pro cess which is the parent pro cess of all other pro cesses on

the machine).

{ - A le in UNIX is a sequence of bytes, and the op erating system

do es not recognize any additional structure in it. UNIX organizes les in

a hierarchical structure that is captured in a metadata structure called

the table. Eachentry in the inode table represents a le. Hardware

devices are represented in the UNIX le system in the /dev directory which

contributes to the interface standardization of UNIX kernel comp onents.

This way, devices can b e read and written just like les. Files can b e known

in one or more directories by several names, which are called links. Links

can p oint directly to a le within the same le system (hard links) or

simply refer to the name of another le (symb olic links). UNIX supp orts

mountpoints, i.e. any le can b e a reference to another physical le system.

have a bit for indicating whether they are mount p oints or not.

If a le that is a mount p oint is accessed, a table with the equivalence

between lenames and mounted le system, the mount table, is scanned

to nd the corresp onding le system. UNIX lenames are sequences of

lenames separated by'/'characters. Since UNIX supp orts mount p oints

and symb olic links, lename parsing has to b e done on a name byname

basis. This is required b ecause any lename along the to the le b eing

ultimately referred to can b e a mount p oint or symb olic link and indirect

the path to the wanted le.

{ Inter-pro cess communication (IPC) mechanisms - Pip es, so ckets and, in

more recentversions of UNIX, shared memory are the mechanisms that

allow pro cesses to exchange data (see b elow).

{ Signals - Signals are a mechanism for handling exceptional conditions.

There are 20 di erent signals that inform a pro cess of events such as arith-

metical over ow, invalid system calls, pro cess termination, terminal inter-

rupts and others. A pro cess checks for signals sent toitwhenitleaves the

kernel mo de after ending the execution of a system call or when it is inter-

rupted. With the exception of signals to stop or a pro cess, a pro cess can

react to signals as it wishes by de ning a function, called a handler,

which is executed the next time it receives a particular signal.

{ - Initially, memory management in UNIX was based

on a swapping mechanism. Currentversions of UNIX resort also to paging

IV. Parallel Op erating Systems 183

to manage memory. In non-paged UNIX, pro cesses were kept in a con-

tiguous address space. To decide where to create ro om for a pro cess to b e

brought from disk into memory (swapp ed-in), the swapp er pro cess used a

criterion based on the amount of idle time or the age of pro cesses in mem-

ory to cho ose which one to send to disk (swap-out). Conversely, the amount

of time a pro cess has b een swapp ed out was the criterion to consider it for

swap-in.

Paging has two great advantages: it eliminates external memory fragmen-

tation and it eliminates the need to have complete pro cesses in memory,

since a pro cess can request additional pages as it go es along. When a pro-

cess requests a page and there isn't a page frame in memory to place the

page, another page will havetoberemoved from the memory and written

to disk.

The target of any page replacement algorithm is to replace the page that

will not b e used for the longest time. This ideal criterion cannot b e ap-

plied b ecause it requires knowledge of the future. An approximation to this

algorithm is the least-recently-used (LRU) algorithm which removes from

memory the page that hasn't b een accessed for the longest time. However,

LRU requires hardware supp ort to b e eciently implemented. Since most

systems provide a reference bit on each page that is set every time the

page is accessed, this can b e used to implement a related page replacement

algorithm, one which can b e found in some implementations of UNIX, e.g.

4.3BSD, the second chance algorithm. It is a mo di ed FIFO algorithm.

When a page is selected for replacement its reference bit is rst checked. If

this is set, it is then unset but the page gets a second chance. If that page

is later found with its reference bit unset it will b e replaced.

Several times p er second, there is a check to see if it is necessary to run

the page replacement algorithm. If the numb er of free page frames falls

b eneath a threshold, the pagedaemon runs the page replacement algorithm.

In conclusion, it should b e mentioned that there is an interaction b etween

swapping, paging and scheduling: as a pro cess lo oses priority, accesses to

its pages b ecome more infrequent, pages are more likely to b e paged out

and the pro cess risks b eing swapp ed out to disk.

Process management. In the initial versions of UNIX, the only execution con-

text that existed were pro cesses. Later several thread libraries were develop ed

(e.g. POSIX threads). Currently, there are UNIX releases that include ker-

nel threads (e.g. Sun's Solaris 2) and the corresp onding kernel interfaces to

manipulate them.

In UNIX, a pro cess is created using the system call, which generates

a pro cess identi er (or pid). A pro cess that executes the fork system call

receives a return value, whichisthepid of the new pro cess, i.e. its child

pro cess, whereas the child pro cess itself receives a return co de equal to zero.

As a consequence of this pro cess creation mechanism, pro cesses in UNIX are organized as a pro cess tree. A pro cess can also replace the program it is

184 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

executing with another one by means of the execve system call. Parentand

children pro cesses can b e synchronized by using the and calls. A

child pro cess can terminate its execution when it calls the exit function and

the parent pro cess can synchronize itself with the child pro cess by calling

wait, thereby blo cking its execution until the child pro cess terminates.

Scheduling in UNIX is p erformed using a simple dynamic priorityalgo-

rithm, which b ene ts interactive programs and where larger numb ers indicate

alower priority. Pro cess priority is calculated using:

time + B ase pr ior ity [+] P r ior ity = P r ocessor

The nice factor is available for a pro cessor to give other pro cesses a higher

priority.For each quantum that a pro cess isn't executed, it's priorityimproves

via:

P r ocessor time = P r ocessor time=2

Pro cesses are assigned a CPU slot, called a quantum, which expires by

means of a kernel timeout calling the scheduler. Pro cesses cannot b e pre-

empted while executing within a kernel. They relinquish the pro cessor either

b ecause they blo cked waiting for I/O or b ecause their quantum has expired.

Inter-process communication. The main inter-pro cess communication mech-

anism in UNIX are pip es. A pip e is created bythe pipe system call, which

establishes a reliable unidirectional byte stream b etween two pro cesses. A

pip e has a xed size and therefore blo cks writer pro cesses trying to exceed

their size. Pip es are usually implemented as les, although they do not have

a name. However, since pip es tend to b e quite small, they are seldom written

to disk. Instead, they are manipulated in the blo ck cache, thereby remain-

ing quite ecient. Pip es are frequently used in UNIX as a p owerful to ol for

concatenating the execution of UNIX utilities. UNIX shell languages use a

vertical bar to indicate that a program's output should b e used as another

program's input, e.g. listing a directory and then printing it (ls j lpr).

Another UNIX IPC mechanism are so ckets. A so cket is a communica-

tion endp oint that provides a generic interface to pip es and to networking

facilities. A so cket has a domain (UNIX, Internet or XeroxNetwork Ser-

vices), which determines its address format. A UNIX so cket address is a le

name while an Internet so cket uses an Internet address. There are several

typ es of so ckets. Datagram so ckets, supp orted on the Internet UDP proto-

col, exchange packets without guarantees of delivery, duplication or ordering.

Reliable delivered message so ckets should implement reliable datagram com-

munication but they are not currently supp orted. Stream so ckets establish

a data stream b etween twosockets, which is duplex, reliable and sequenced.

Rawsockets allow direct access to the proto cols that supp ort other so cket

typ es, e.g. to access the IP or Ethernet proto cols in the Internet domain.

Sequenced packet so ckets are used in the XeroxAF NS proto col and are

equivalent to stream so ckets with the addition of record b oundaries.

Asocket is created with a call to socket,which returns a so cket descriptor.

If the so cket is a server so cket, it must b e b ound to a name (bind system call)

IV. Parallel Op erating Systems 185

so that clientsockets can refer to it. Then, the so cket informs the kernel that

it is ready to accept connections (listen)andwaits for incoming connections

(accept). A clientsocket that wishes to connect to a server so cket is also

created with a socket call and establishes a connection to a server so cket.

Its name is known after p erforming a connect call. Once the connection is

established, data can b e exchanged using ordinary read and write calls.

A system call complementary to UNIX IPC mechanisms is the select call.

It is used by a pro cess that wishes to multiplex communication on several

les or so cket descriptors. A pro cess passes a group of descriptors into the

select call that returns the rst on which activity is detected.

Shared memory supp ort has b een intro duced in several recent UNIX ver-

sions, e.g. Solaris 2. Shared memory mechanisms allow pro cesses to declare

shared memory regions (shmget system call) that can then b e referred to via

an identi er. This identi er allows pro cesses to attach(shmat) and detach

(shmdt) a shared memory region to and from its addressing space. Alterna-

tively, shared memory can b e achieved via memory-mapp ed les. A user can

cho ose to map a le (mmap/munmap)onto a memory region and, if the le is

mapp ed in a shared mo de, other pro cesses can write to that memory region

therebycommunicating among them.

Internal paral lelism. In a symmetric multipro cessor system, all pro cessors

may execute the op erating system kernel. This leads to synchronization prob-

lems in the access to the kernel data structures. The granularityatwhich

these data structures are lo cked is highly implementation dep endent. If an

op erating system has a high lo cking cost it might prefer to lo ckatahigher

granularity but lo ck less often. Other systems with a lower lo cking cost might

prefer to have complex ne-grained lo cking algorithms thereby reducing the

probabilityofhaving pro cesses blo cked on kernel lo cks to a minimum.An-

other necessary adaptation b esides lo cking of kernel data structures is to

mo dify the kernel data structures to re ect the fact that there is a set of

pro cessors.

3.4.2 Microsoft Windows NT. Windows NT [Cus93] is the high end of

Microsoft Corp oration's op erating systems range. It can b e executed on sev-

eral di erent architectures (Intel , MIPS, Alpha AXP,PowerPC) with

up to 32 pro cessors. Windows NT can also emulate several OS environments

(Win32, OS/2 or POSIX) although the primary environment is Win32. NT is

a 32-bit op erating system with separate p er-pro cess address spaces. Schedul-

ing is thread based and uses a preemptivemultiqueue algorithm. NT has an

asynchronous I/O subsystem and supp orts several le systems: FAT, high-

p erformance le system and the native NT le system (NTFS). It has inte-

grated networking capabilities which supp ort 5 di erent transp ort proto cols:

NetBeui, TCP/IP, IPX/SPX, AppleTalk and DLC.

Architecture. Windows NT is a dual mo de op erating system with user-level

and kernel comp onents. It should b e noted that not the whole kernel is ex-

186 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

ecuted in kernel mo de. The Windows NT comp onents that are executed in

kernel mo de are:

{ Executive - The executive is the upp er layer of the op erating system and

provides generic op erating system services for managing pro cesses, threads

and memory, p erforming I/O, IPC and ensuring security.

{ Device Drivers - Device drivers provide the executive with a uniform in-

terface to access devices indep endently of the hardware architecture the

manufacturer uses for the device.

{ Kernel - The kernel implements pro cessor dep endent functions and related

functions such as thread scheduling and context switching, exception and

interrupt dispatching and OS synchronization primitives

{ Layer (HAL) - All these comp onents are layered

on top of a hardware abstraction layer. This layer hides hardware sp eci c

details, suchasI/Ointerfaces and interrupt controllers, from the NT ex-

ecutive thereby enhancing its p ortability.For example, all cache coherency

and ushing in a SMP is hidden b eneath this level.

The mo dules that are executed in user mo de are:

{ System & Service Pro cesses - These op erating system pro cesses provide

several services such as session control, logon management, RPC and event

logging.

{ User Applications

{ Environment Subsystems - The environment subsystems use the generic

op erating system services provided by the executive to give applications

the interface of a particular op erating system. NT includes environment

subsystems for MS-DOS, 16 bit Windows, POSIX, OS/2 and for its main

environment, the Win32 API.

In NT, op erating system resources are represented by ob jects. Kernel

ob jects include pro cesses and threads, le system comp onents ( le han-

dles, logs), mechanisms (semaphores, mutexes, waitable

timers) and inter-pro cess communication resources (pip es, mailb oxes, com-

munication devices). In fact, most of the ob jects just mentioned are NT -

utiveinterfaces to low-level kernel ob jects. The other twotyp es of Windows

NT OS ob jects are the user and graphical interface ob jects which comp ose

the graphical and are private to the Win32 environment.

Calls to the kernel are made via a protected mechanism that causes an

exception or interrupt. Once the system call is done, execution returns to

user-level by dismissing the interrupt. These are calls to the environment

subsystems libraries. The real calls to the kernel libraries are not visible at

user-level.

Security in NT is based on twomechanisms: pro cesses, les, printers and

other resources have an access control list and active OS ob jects (pro cesses

and threads) have securitytokens. These tokens identify the user on b ehalf

of which the pro cess is executing, what privileges that user p ossesses and can

IV. Parallel Op erating Systems 187

therefore b e used to check access control when an application tries to access

a resource.

Process management. AWindows NT pro cess is a data structure where the

resources held by an execution of a program are managed and accounted for.

This means that a pro cess keeps track of the virtual addressing space and the

corresp onding working set its program is using. Furthermore an NT pro cess

includes a securitytoken identifying its p ermissions, a table containing the

kernel ob jects it is using and a list of threads. A pro cess' handle table is

placed in the system's address space and therefore protected from user-level

"tamp ering".

In NT, a pro cess do es not representa ow of execution. Program execution

is always p erformed by a thread. In order to trace an execution a thread is

comp osed of a unique identi er, a set of registers containing pro cessor state,

a stack for use in kernel mo de and a stack for use in user mo de, a p ointer

to the function it is executing, a security and its currentaccess

mo de.

The Windows NT kernel has a multi-queue scheduler. It has 32 di erent

prioritylevels. The upp er 15 are reserved for real time tasks. Scheduling is

preemptive and strictly priority driven. There is no guaranteed execution

p erio d b efore preemption and threads can b e preempted in kernel mo de.

Within a prioritylevel, scheduling is time-sliced and round robin. If a thread

is preempted it go es back to the head of its prioritylevel queue. If it exhausts

its CPU quantum, it go es back to the of that queue.

Regarding multipro cessor scheduling, each task has an ideal pro cessor,

which is assigned, at the thread's creation, in a round-robin fashion among all

pro cessors. When a thread is ready, and more than one pro cessor is available,

the ideal pro cessor is chosen. Otherwise, the last pro cessor where the thread

ran is used. Finally, if the thread has not b een waiting for to o long, the

scheduler checks to see if the next thread in the ready state can run on its

ideal pro cessor. If it can, this second thread is chosen. A thread can run on

any pro cessor unless an anity mask is de ned. An anity mask is a sequence

of bits where each represents a pro cessor and which can b e set to restrict the

pro cessors where a thread is allowed to execute. However, this this forced

thread placement, called hard anity,may lead to a reduced execution of

the thread.

Inter-process communication. Windows NT provides several di erentinter-

pro cess communication (IPC) mechanisms:

Windows sockets. The Windows So ckets API provides a standard Win-

dows interface to several communication proto cols with di erent addressing

schemes, such as TCP/IP and IPX. The Windows So ckets API was devel-

op ed to accomplish two things. One was to migrate the so ckets interface,

develop ed for UNIX BSD in the early 1980s, into the Windows NT environ-

ments, and the other was to establish a new standard interface capable of

supp orting emerging network capabilities suchasrealtimecommunications

188 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

and QoS (quality of service) guarantees. The Windows So ckets, currently in

their version 2, are an interface for networking, not a proto col. As they do

not implement a particular proto col, the Windows So ckets do not a ect the

data b eing transferred through them. The underlying transp ort proto col can

b e of anytyp e. Besides the usual datagram and stream so ckets, it can, for

example, use a proto col designed for multimedia communications. The trans-

p ort proto col and name providers are placed b eneath the Windows So cket

layer.

Local Procedure Cal ls. LPC is a message-passing facilityprovided bytheNT

executive. The LPC interface is similar to the standard RPC but it is opti-

mised for communication b etween two pro cesses on the same machine. LPC

communication is based on p orts. A p ort is a kernel ob ject that can have

twotyp es: connection p ort and communication p ort. Connections are estab-

lished by sending a connect request to another pro cess' connection p ort. If

the connection is accepted, communication is then established via twonewly

created p orts, one on each pro cess. LPC can b e used to pass messages di-

rectly through the communication p orts or to exchange p ointers into memory

regions shared by b oth communicating pro cesses.

Remote Procedure Cal ls (RPC). Apart from the conventional inter-computer

invo cation pro cedure, the Windows NT RPC facility can use other IPC mech-

anisms to establish communications b etween the computers on whichthe

client and the server p ortions of the application exist. If the client and server

are on the same computer, the Lo cal Pro cedure Call (LPC) mechanism can b e

used to transfer information b etween pro cesses and subsystems. This makes

RPC the most exible and p ortable IPC choice in Windows NT.

Namedpipes and mailslots. Named pip es provide connection-oriented mes-

saging that allows applications to share memory over the network. Windows

NT provides a sp ecial application programming interface (API) that increases

security when using named pip es.

The mailslot implementation in Windows NT is a subset of the Microsoft

OS/2 LAN Manager implementation. Windows NT implements only second-

class mailslots. Second class mailslots provide connectionless messaging for

broadcast messages. Delivery of the message is not guaranteed, though the

delivery rate on most networks is high. It is most useful for identifying other

computers or services on a network. The Computer Browser service under

Windows NT uses mailslots. Named pip es and mailslots are actually imple-

mented as le systems. As le systems, they share common functionality,

such as security, with the other le systems. Lo cal pro cesses can use named

pip es and mailslots without going through the networking comp onents.

NetBIOS. NetBIOS is a standard programming interface in the PC envi-

ronment for developing client-server applications. NetBIOS has b een used

as an IPC mechanism since the early 1980s. From a programming asp ect,

higher level interfaces such as named pip es and RPCs are sup erior in their

IV. Parallel Op erating Systems 189

exibility and p ortability. A NetBIOS client-server application can commu-

nicate over various proto cols: NetBEUI proto col (NBF), NWLink NetBIOS

(NWNBLink), and NetBIOS over TCP/IP (NetBT).

Synchronization. In Windows NT, the concept of having entities through

which threads can synchronize their activities is integrated in the kernel ob-

ject architecture. The synchronization ob jects, ob jects on which a thread can

wait and b e signalled, in the NT executive are: pro cess, thread, le, event,

event pair, , timer and mutant (seen at Win32 level as a mutex).

Synchronization on these ob jects di ers in the conditions needed for the ob-

ject to signal the threads that are waiting on it. The circumstances in which

synchronization ob jects signal waiting threads are: pro cesses when their last

thread terminates, threads when they terminate, les when an I/O op era-

tion nishes, events when a thread sets them, semaphores when their counter

drops to zero, mutants when holding threads release them, timers when their

set time expires. Synchronization is made visible to Win32 applications via

a generic wait interface (WaitForSingleObject or WaitForMultipleObjects).

4. Op erating Systems for NORMA environments

In Section 2. we showed that parallel architectures have b een restricted

mainly to symmetric multipro cessors, multicomputers, computer clusters and

networks of workstations. There are some examples of true NORMA ma-

chines, notably the CrayT3E.However since networks of workstations and/or

p ersonal computers are the most challenging NORMA environments to day,

wechose to approach the problem of system level supp ort for NORMA par-

allel programming to raise some of the problems that arise in the network of

workstations environment. Naturally, most of those problems are also highly

prominent in NORMA multipro cessors.

While in a multipro cessor every pro cessor is exactly likeevery other in

capability, resources, software and communication sp eed, in a computer net-

work that is not the case. As a matter of fact, the computers in a network

are most probably heterogeneous, i.e. they have di erent hardware and/or

op erating systems. More precisely, the heterogeneity asp ects that must b e

considered, when exploiting a set of computers in a network, are the follow-

ing: architecture, data format, computational sp eed, machine and network

.

Heterogeneityofarchitecture means that the computers in the network

can b e Intel p ersonal computers, workstations, shared-memory multipro ces-

sors, etc... which p oses the problems of incompatible binary formats and

di erent programming metho dologies.

Data format heterogeneity is an imp ortant obstacle to parallel computing

b ecause it prevents computers from correctly interpreting the data exchanged among them.

190 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

The heterogeneity of computational sp eed may result in unused pro cessing

power b ecause the same task can b e accomplished much faster in a multi-

pro cessor than in a p ersonal computer, for example. Thus, the programmer

must b e careful at splitting the tasks among the computers in the network

so that no computer sits idle.

The computers in the network have di erent usage patterns, which lead

to di erent loads. The same reasoning applies to network load. This may lead

to a situation in which a computer takes a lot of time to execute a simple

task, b ecause it is to o loaded, or stays idle, b ecause it is waiting for some

message.

All these problems must b e solved in order to takeadvantage of the b ene-

ts of parallel computing on networks of computers, which can b e summarized

as follows:

{ low price given that the computers already exist;

{ optimised p erformance by assigning the right task to the most appropriate

computer (taking into account its architecture, op erating system and load);

{ the resources available can easily growby simply connecting more comput-

ers to the network (p ossibly with more advanced technologies);

{ programmers still use familiar to ols such as in a single computer (editors,

compilers, etc.)

4.1 Architectural Overview

Currently, stand-alone workstations and p ersonal computers are very ubiq-

uitous and almost always connected by means of a network. Eachmachine

provides a reasonable amount of pro cessing p ower and memory which re-

mains unused for long p erio ds of time. The challenge is to provide a platform

that allows programmers to take advantage of such resources easily.Thus,

the computational p ower of such computers can b e applied to solveavariety

of computationally intensive applications. In other words, this network-based

approach can b e e ective in coupling several computers, resulting in a con g-

uration that might b e economically and technically dicult to achieve with

sup ercomputer hardware.

On the other extreme of the sp ectrum of NORMA systems are of course

multicomputers generally running separate sup ervisor . These

machines derive their p erformance from high p erformance interconnection

and from very compact and therefore very fast op erating system. The paral-

lelism is exploited not directly by the op erating system but by some language

level functionality. Since each no de is a complete pro cessor and not a simpli-

ed pro cessing no de as in SIMD sup ercomputers, the tasks it p erforms can

b e reasonably complex.

System level supp ort for NORMA architectures involves providing mech-

anism to overcome the physical pro cessor separation and to co ordinate ac-

tivities on pro cessors running indep endent op erating systems. This section

IV. Parallel Op erating Systems 191

discusses some of these basic mechanisms: distributed pro cess management,

parallel le systems and distributed recovery.

4.2 Distributed Pro cess Management

In contrast with the SMP discussed in the previous section that are generally

used as glori ed unipro cessor in the sense that they execute an assorted set

of applications, NORMA systems are targeted at parallel applications. Load

balancing is an imp ortant feature to increase the p erformance of many parallel

applications. However, ecient load balancing, in particular receiver initiated

load balancing proto cols, where pro cessors request tasks from others, requires

an underlying supp ort for pro cess migration.

4.2.1 Pro cess migration. [Nut94] is the abilitytomove

an executing pro cess from one pro cessor to another. It is a very useful func-

tionality although it is costly to execute and challenging to implement. Ba-

sically, migration consists of halting a pro cess in a consistent state in the

current pro cessor, packing a complete description of its state, cho osing a new

pro cessor for the pro cess, shipping the state description to the destination

pro cessor and restarting it there. Pro cess migration is most frequently mo-

tivated by the need to balance the load of a parallel application. However,

it is also useful for reducing inter-pro cess communication (pro cesses commu-

nicating with increasing frequency should b e moved to the same pro cessor),

recon guring a system for administrative reasons or exploiting a sp ecial ca-

pability sp eci c of a particular pro cessor. There are several systems that

implemented pro cess migration such as Sprite[Dou91], Condor [BLL92]or

Accent [Zay87]. More recently, as migration capabilities where intro duced in

ob ject oriented distributed and parallel systems, the interest in migration

has shifted into ob ject migration, or mobile ob jects as in Emerald[MV93]or

COOL[Lea93]. However, the core diculties remain the same: ensuring that

the migrating entity's state is completely describ ed and capturing, redirect-

ing and replaying all interactions that would havetaken place during the

migration pro cedure.

4.2.2 Distributed recovery. Recovery of a computation in spite of no de

failures is achieved via checkp ointing, which is the pro cedure of saving the

state of an ongoing distributed computation so that it can b e recovered from

that intermediate state in case of failure. It should b e noted that checkp oint-

ing allows a computation to recover from a system failure. However, if an

execution fails due to an error induced by the pro cess' execution itself, the

error will rep eat itself after recovery. Another typ e of failures not addressed

bycheckp ointing are secondary storage failures. Hard disk failures haveto

b e addressed by replication using mirrored disks or RAID (cf. Section 8.4 of

Chapter ).

Checkpointing and pro cess migration require very similar abilities from

the system p oint of view since the complete state of a pro cess must b e saved

192 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

so that the pro cessor (the current one in the case of checkp ointing or another

in the case of pro cess migration) can b e correctly loaded and restarted with

the pro cess' saved state. There are four basic approaches for checkp ointing

[EJW96]:

{ Co ordinated checkp ointing - In co ordinated checkp ointing, the pro cesses

involved in a parallel execution execute their checkp oints simultaneously

in order to guarantee that the saved system state is consistent. Thus, the

pro cesses must p erio dically co op erate in computing a consistentglobal

checkp oint. Co ordinated checkp oints were devised b ecause simplistic pro-

to cols with lack of co ordination tend to create a so-called domino e ect.

The domino e ect is due to the existence of orphan messages. An orphan

message app ears when a pro cess, after sending messages to other pro cesses,

needs to rollback to a previous checkp oint. In that case, there will b e pro-

cesses that have received messages, which, from the p oint of view of the

failed pro cess that has rolled back, haven't b een sent. Supp ose that in a

group of pro cesses eachsaves its state indep endently. Then, if a failure

o ccurs and a checkp oint has to b e chosen in order to rollback to it, it is

dicult to nd a checkp oint where no pro cess has orphan messages. As a

result the system will regress inde nitely trying to nd one checkp oint that

is consistent for all pro cessors. Co ordinated checkp ointing basically elimi-

nates the domino e ect, given that pro cesses always restart from the most

recent global checkp ointing. This co ordination even allows for pro cesses to

have non-deterministic b ehaviours. Since all pro cessors are restarted from

the same moment in the computation, the new computation could take

a di erent execution path. The main problem with this approachisthat

it requires the co ordination of all the pro cesses, i.e. pro cess autonomyis

sacri ced and there is a p erformance p enalty to b e paid when all pro cesses

have to b e stopp ed in order to save a globally consistent state.

{ Unco ordinated checkp ointing - With this technique, pro cesses create check-

p oints asynchronously and indep endently of each other. Thus, this ap-

proach allows each pro cess to decide indep endently when to takecheck-

p oints. Given that there is no synchronization among computation, com-

munication and checkp ointing, this optimistic recovery technique can toler-

ate the failure of an arbitrary numb er of pro cessors. For this reason, when

failures are very rare, this technique yields b etter throughput and resp onse

time than other general recovery techniques.

When a pro cessor fails, pro cesses have to nd a set of previous checkp oints

representing a consistent system state. The main problem with this ap-

proach is the domino e ect, whichmay force pro cesses to undo a large

amountofwork. In order to avoid the domino e ect, all the messages re-

ceived since the last checkp ointwas created would havetobelogged.

{ Communication-induced checkp ointing - This technique complements un-

co ordinated checkp oints and avoids the domino e ect by ensuring that all

pro cesses verify a system-wide constraint b efore they accept a message

IV. Parallel Op erating Systems 193

from the outside world. A number of such constraints that guarantee the

avoidance of a domino e ect have b een identi ed and this metho d requires

that any incoming message is accompanied by the necessary information

to verify the constraint. If the constraint is not veri ed, then a checkp oint

must b e taken b efore passing the message on to the parallel application.

An example of a constraint for communication induced logging is the one

prop osed in Programmer Transparent Co ordination (PTC) [KYA96]which

requires that each pro cess takeacheckp oint if the incoming message makes

it dep end on a checkp oint that it did not previously dep end up on.

{ Log-based rollback recovery - This approach requires that pro cesses save

their state when they p erform a checkp oint and additionally save all data

regarding interactions with the outside world, i.e. emission and reception

of messages. This stops the domino e ect b ecause the message log bridges

the gap b etween the checkp ointtowhich the failed pro cess rolled backand

the current state of the other pro cesses. This technique can even include

non-deterministic events in the pro cess' execution if these are logged to o.

A common way to p erform logging is to saveeach message that a pro cess

receives b efore passing it on to the application co de.

During recovery the logged events are replayed exactly as they o ccurred

b efore the failure.

The di erence of this approach with resp ect to to the previous ones is

that in b oth, co ordinated and unco ordinated checkp ointing, the system

restarts the failed pro cesses by restoring a consistent state. The recovery

of a failed pro cess is not necessarily identical to its pre-failure execution

which complicates the interaction with the outside world. On the contrary,

log-based rollbackrecovery do es not have this problem since it recorded all

its previous interactions with the outside world.

There are three variants of this approach: p essimistic logging, optimistic

logging, and causal logging.

In p essimistic logging, the system logs information regarding eachnon-

deterministic event b efore it can a ect the computation, e.g. a message is

not delivered until it is logged.

Optimistic logging takes a more relaxed but more risky approach. Non-

deterministic events are not written to stable storage but saved in a faster

volatile log and only p erio dically written in an asynchronous waytoa

stable storage. This is a more ecient approach since it avoids constantly

waiting for costly writes to a stable storage. However, should a pro cess fail,

the volatile log will b e lost and there is the p ossibility of thereby creating

orphan messages and inducing a domino e ect.

Causal logging uses the Lamp ort happened-before relation [Lam78], to es-

tablish whichevents caused a pro cess' current state, and guarantee that

all those events are either logged or available lo cally to the pro cess.

194 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

4.3 Parallel File Systems

Paral lel le systems address the p erformance problem faced by currentnet-

work client-server le systems: the read and write bandwidth for a single

le is limited by the p erformance of a single server (memory bandwidth,

pro cessor sp eed, etc.). Parallel applications su er from the ab ove-mentioned

p erformance problem b ecause they present I/O loads equivalent to traditional

sup ercomputers, which are not handled by current le servers. If users resort

to parallel computers in search of fast execution, the le systems of parallel

computers must also aim at serving I/O request rapidly.

4.3.1 Strip ed le systems. Dividing a le among several disks enables

an application to access that le quicker by issuing simultaneous requests to

several disks [CK93]. This technique is called declustering. In particular, if

the le blo cks are divided in a round-robin way among the available disks it

is called striping. In a strip ed le system, either individual les are strip ed

across the servers (p er- le striping), or all the data is strip ed, indep endently

of the les to which they b elong, across the servers (p er-client striping). In

the rst case, only large les b ene t from the striping. In the second case,

given that eachclient forms its new data for all les into a sequential log,

even small les b ene t from the striping.

With p er-client striping, servers are used eciently, regardless of le sizes.

Striping a large amount of writes allows them to b e done in parallel. On the

other hand, small writes will b e batched.

4.3.2 Examples of parallel le systems. CFS (Concurrent File System)

[Pie89] is one of the rst commercial multipro cessor le systems. It was de-

velop ed for the Intel iPSC and Touchstone Delta multipro cessors. The basic

idea of CFS is to decluster les across several I/O pro cessors, each one with

one or more disks. Caching and prefetching are completely under the control

of the le system; thus, the programmer has no way to in uence it. The same

happ ens with the clustering of a le across disks, i.e. it is not predictable by

the programmer.

CFS constitutes the secondary storage of the Intel Paragon. The no des use

the CFS for high-sp eed simultaneous access to secondary storage. Files are

manipulated using the standard UNIX system calls in either C or FORTRAN

and can b e accessed by using a le p ointer common to all applications, which

can b e used in four sharing mo des:

{ Mo de 0: Each no de pro cess has its own le p ointer. It is useful for large

les to b e shared among the no des.

{ Mo de 1: The computer no des share a common le p ointer, and I/O requests

are serviced on a rst-come- rst-serve basis.

{ Mo de 2: Reads and writes are treated as global op erations and a global

synchronization is p erformed.

{ Mo de 3: A synchronous ordered mo de is provided, but all write op erations

have to b e of the same size.

IV. Parallel Op erating Systems 195

The PFS (Parallel File System) [Roy93] is a strip ed le system which

strip es les (but not directories or links) across the UNIX le systems that

were assembled to create the le system. The size of each strip e, the strip e

unit is determined by the system administrator.

Contrary to the previous two le systems, the le system for the nCUBE

multipro cessor [PFD+89] allows the programmer to control many of its fea-

tures. In particular, the nCUBE le system, which also uses le striping,

allows the programmer to manipulate the striping unit size and distribution

pattern.

IBM's Vesta le system [CFP+93], a strip ed le system for the SP-1 and

SP-2 multipro cessors, allows the programmer to control the declustering of a

le when it is created. It is p ossible to sp ecify the numb er of disks, the record

size, and the strip e-unit size. Vesta provides users with a single name space.

Users can dynamically partition les into sub les, by breaking a le into rows,

columns, blo cks, or more complex cyclic partitioning. Once partitioned, a le

can b e accessed in parallel from a numb er of di erent pro cesses of a parallel

application.

4.4 Popular Message-Passing Environments

4.4.1 PVM. PVM [GBD94] supp orts parallel computing on current com-

puters byproviding the abstraction of a which com-

prehends all the computers in a network. For this purp ose, PVM handles

message routing, data conversion and task scheduling among the participat-

ing computers.

The programmer develops his programs as a collection of co op erating

tasks. Eachtaskinteracts with PVM by means of a library of standard inter-

face routines that provide supp ort for initiating and nishing tasks on remote

computers, sending and receiving messages and synchronizing tasks.

To cop e with heterogeneity, the PVM message-passing primitives are

strongly typ ed, i.e. they require typ e information for bu ering and trans-

mission.

PVM allows the programmer to start or stop any task, or to add or

delete any computer in the network from the parallel virtual machine. In

addition, any pro cess may communicate or synchronize with any other. PVM

is structured around the following notions:

{ The user with an application running on top of PVM can decide which

computers will b e used. This set of computers, a user-con gured host p o ol,

can change during the execution of an application, by simply adding or

deleting computers from the p o ol. This dynamic b ehaviour is very useful

for fault-tolerance and load-balancing.

{ Application programmers can sp ecify which tasks are to b e run on which

computers. This allows the explicit exploitation of sp ecial hardware or op erating system capabilities.

196 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

{ In PVM, the unit of parallelism is a task that often, but not necessarily,

is a UNIX pro cess. There is no pro cess-to-pro cessor mapping implied or

enforced byPVM.

{ Heterogeneity: Varied networks and applications are supp orted. In par-

ticular, PVM allows messages containing more than one data typ e to b e

exchanged bymachines with di erent data representations.

{ PVM uses native message-passing supp ort on those computers in which

the hardware provides optimised message primitives.

4.4.2 MPI. MPI [MPI95] stands for Message Passing Interface. The goal

of MPI is to b ecome a widely used standard for writing message-passing

programs. It is a standard that de nes b oth the syntax and semantics, of a

message-passing library that can eciently b e implemented on a wide range of

computers. The de ned interface attempts to establish a practical, p ortable,

ecient, and exible standard for message-passing. The MPI standard pro-

vides MPI vendors with a clearly de ned base set of routines that they can

implement eciently or, in some cases, provide hardware supp ort for, thereby

enhancing scalability.

MPI has b een strongly in uenced bywork at the IBM T. J. Watson

ResearchCenter, Intel's NX/2, Express, nCUBE's Vertex, and p4.

Blo cking communication is the standard communication mo de in MPI.

By default, a sender of a message is blo cked until the receiver calls the ap-

propriate function to receive the message. However, MPI do es supp ort non-

blo cking communication. Although MPI is a complex standard, six basic calls

are enough to create an application:

{ MPI Init - initiate a MPI computation

{ MPI Finalize - terminate a computation

{ MPI Comm size - determine the numb er of pro cesses

{ MPI Comm rank - determine the current pro cess' identi er

{ MPI Send - send a message

{ MPI Recv - receive a message

MPI is not a complete software infrastructure that supp orts distributed

computing. In particular, MPI provides neither pro cess management, i.e. the

ability to create remote tasks, nor supp ort for I/O. So, MPI is a communica-

tions interface layer that will b e built up on native facilities of the underlying

hardware platform. (Certain data transfer op erations may b e implemented

at a level close to hardware.) For example, MPI could b e implemented on

top of PVM.

5. Scalable Shared Memory Systems

Traditionally parallel computing has b een based on the message-passing

mo del. However there are a numb er of exp eriences with distributed shared

IV. Parallel Op erating Systems 197

memory that haveshown that there is not necessarily a great p erformance

p enaltytopay for this more amenable programming mo del. The great ad-

vantage of shared memory is that it hides the details of interpro cessor com-

munication from the programmer. Furthermore, this transparency allows for

anumb er of p erformance-enhancing techniques, suchascaching and data

migration and replication, to b e used without changes to application co de.

Of course, the main advantage of shared memory, transparency,isalsothe

ma jor aw, as p ointed out by the advo cates of message-passing: a program-

mer with application sp eci c knowledge can b etter control the placementand

exchange of data than an underlying system can do it in an implicit, auto-

matic way. In this section, we present some of the most successful scalable

shared memory systems.

5.1 Shared Memory MIMD Computers

5.1.1 DASH. The goal of the Stanford DASH [LLT+93] pro ject was to ex-

plore large-scale shared-memory architecture with hardware directory-based

cache coherency.DASH consists of a two dimensional mesh of clusters. Each

cluster contains four pro cessors (MIPS R3000 with twolevels of cache), up to

28 MB of main memory, a directory controller, which manages the memory

metadata, and a reply controller, which is a b oard that handles all requests is-

sued by the cluster. Within each cluster cache consistency is maintained with

a sno opy bus proto col. Between clusters, cache consistency is guaranteed bya

directory-based proto col. To enforce this proto col a full-map scheme is imple-

mented. Each cluster has a complete map of the memory indicating whether

a memory lo cation is valid on the lo cal memory or if the most recentvalue

is held at a remote cluster. Requests for memory lo cations that are lo cally

invalid are sent to the remote cluster, which replies to the requesting cluster.

The DASH prototyp e is comp osed of 16 clusters with a total of 64 pro ces-

sors. Due to the various caching levels, lo cality is a relevant factor for the

p erformance of DASH whose p erformance degrades as application working

sets increase.

5.1.2 MIT Alewife. The MIT Alewife's architecture [ACJ+91] is similar to

that of the Stanford DASH. It is comp osed by a 2D mesh of no des, eachwitha

pro cessor (a mo di ed SPARC called Sparcle), memory and a communication

switch. However, the Alewife uses some interesting techniques to improve

p erformance. The Sparcle pro cessor is able to switch quickly context b etween

threads. So when a thread accesses data that is not available lo cally,the

pro cessor, using indep endent register sets, quickly switches to another thread.

Another interesting feature of the Alewife is its LimitLESS cache coherence

technique. LimitLESS emulates a directory scheme similar to that of the

DASH but it uses a very small map of cache line descriptors. Whenever a

cache line is requested that is outside that limited set, the request is handled

bya software . The leads to a much smaller p erformance degradation

198 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

than should b e exp ected and has the obvious advantage of reducing memory

o ccupation.

5.2 Distributed Shared Memory

Software distributed shared memory (DSM) systems aim at capturing the

growing computational p ower of workstations and p ersonal computers and

the improvements in the networking infrastructure to create a parallel pro-

gramming environment on top of them. We presenttwo particularly successful

implementations of software DSM: TreadMarks and Shasta.

5.2.1 TreadMarks. TreadMarks [KDC+94] is a DSM platform develop ed

by Rice University, which uses lazy release consistency as its memory consis-

tency mo del. A memory consistency mo del de nes under which circumstances

memory is up dated in a DSM system. The release consistency mo del dep ends

on the existence of synchronization variables whose accesses are divided into

acquire and release. The release consistency mo del requires that accesses to

regular variables are followed by the release of a synchronization variable

whichmust b e acquired byany other pro cess b efore accessing the regular

variables again. In particular, the lazy release consistency mo del p ostp ones

the propagation of memory up dates p erformed by a pro cess until another

pro cess acquires the corresp onding synchronization variable.

Another fundamental option in DSM is deciding how to detect that ap-

plications are attempting to access memory. If the DSM layer wants to do

the detection automatically by using memory page faults then the unit of

consistency, i.e., the smallest amount of memory that can b e up dated in a

consistency op eration will b e a memory page. Alternatively, smaller consis-

tency units can b e used but in this case the system can no longer rely on

page faults and changes have to b e made to the application co de to indicate

where data accesses are going to o ccur.

The problem with big consistency units is false sharing. False sharing

happ ens when two pro cesses are accessing di erent regions of a common con-

sistency unit. In this case, they will b e communicating frequently to exchange

up dates of that unit although in fact they are not sharing data at all.

The designers of TreadMarks chose the page-based approach. However

they use a technique called lazy di creation for reducing the amountofdata

exchanged to up date a memory page. When an application rst accesses a

page TreadMarks creates a copy of the page (a twin). When another pro cess

requests an up date of the page, a record of the mo di cations made to the

page (a di ) is created by comparing the twotwin pages, the initial and the

current one. It is this di that is sent to up date the requesting pro cesses page.

TreadMarks has b een used for several applications, such as genetic research

where it was particularly successful.

IV. Parallel Op erating Systems 199

5.2.2 Shasta. Shasta [SG97] is a software DSM system that runs on Al-

phaServer SMPs connected by a Memory Channel interconnection. The Mem-

ory Channel is a memory-mapp ed network that allows a pro cess to transmit

data to a remote pro cess without any op erating system overhead via a simple

store into a mapp ed page. Pro cesses check for incoming memory transfers by

p olling a variable.

Shasta uses variable sized memory units, which are multiples of 64 bytes.

Memory units are managed using a p er-pro cessor table where the state of

eachblockiskept. Additionally, each pro cessor knows for each of its lo cal

memory units, the ones initially placed in its memory,which pro cessors are

holding those memory units.

Shasta implements its memory consistency proto col by instrumenting the

application co de, i.e. altering the binary co de of the compiled applications.

Each access to application data is bracketed in co de that enforces memory

consistency. The added co de doubles the size of applications. However using a

set of optimisations, such as instrumenting sequences of contiguous memory

accesses as a single access, this overhead drops to an average of 20 %. Shasta

is a paradigmatic example of a non-intrusive transition of a unipro cessor

application to a shared memory platform: the designers of Shasta were able

to execute the Oracle database transparently on top of Shasta.

References

[ACJ+91] Agarwal, A., Chaiken, D., Johnson, K., Kranz, D., Kubiatowicz, J.,

Kurihara, K., Lim, B., Maa, G., Nussbaum, D., The MIT Alewife

Machine: A Large-Scale Distributed-Memory Multiprocessor, Scalable

Shared Memory Multiprocessors,Kluwer Academic Publishers, 1991.

[Bac93] Bacon, J., Concurrent Systems, AnIntegratedApproach to Operating

Systems, Database, and Distributed System, Addison-Wesley, 1993.

[BZ91] Bershad, B.N., Zekauskas, M., Midway, J., Shared memory parallel pro-

gramming with entry consistency for distributed memory multipro ces-

sors, Technical Rep ort CMU-CS-91-170, Scho ol of Computer Science,

Carnegie-Mellon University, 1991.

[Bla90] Blank, T., The MasPar MP-1 architecture, Proc. 35th IEEE Computer

Society International Conference (COMPCON), 1990, 20-24.

[BLL92] Bricker, A., Litzkow, M., Livny, M., Condor technical summary,Tech-

nical Rep ort, Computer Sciences Department, University of Wisconsin-

Madison, 1992.

[BW97] Bro oks I I I, E.D., Warren, K.H., A study of p erformance on SMP and

distributed memory architectures using a shared memory programming

mo del, SC97: High Performance Networking and Computing, San Jose,

1997.

[Cha98] Charlesworth, A., Star re: Extending the SMP envelop e, IEEE Micro

18, 1998.

[CFP+93] Corb ett, P.F., Feitelson, D.G., Prost, J.P., Johnson-Baylor, S.B., Par-

allel access to les in the vesta le system, Proc. Supercomputing '93, 1993, 472-481.

200 Jo~ao Garcia, Paulo Ferreira, and Paulo Guedes

[CK93] Cormen, T.H., Kotz, D., Integrating theory and practice in parallel le

systems, Proc. the 1993 DAGS/PC Symposium, 1993, 64-74.

[Cus93] Custer, H., Inside Windows NT, Microsoft Press, Washington, 1993.

[Dij75] Dijkstra, E.W., Guarded commands, nondeterminancy and the formal

derivation of programs, Communications of the ACM 18, 1975, 453-457.

[Dou91] Douglis, F., Transparent pro cess migration: Design alternatives and the

Sprite implementation , SoftwarePractice and Experience 21, 1991, 757-

785.

[EJW96] Elnozahy, E.N., Johnson, D.B., Wang, Y.M., A survey of rollback-

recovery proto cols in message-passing systems, Technical Rep ort,

Scho ol of Computer Science, Carnegie Mellon University, 1996.

[Fly72] Flynn, M.J., Some computer organizations and their e ectiveness, IEEE

Transactions on Computers 21, 1972, 948-960.

[GBD94] Geist, A., Beguelin, A., Dongarra, J., PVM: Paral lel Virtual Machine:

A Users' Guide and Tutorial for NetworkedParal lel Computing,MIT

Press, Cambridge, 1994.

[GC93] Guedes, P., Castro, M., Distributed shared ob ject memory, Proc.

Fourth Workshop on Workstation Operating Systems (WWOS-IV),

Napa, 1993, 142-149.

[GST+93] Gupta, A., Singh, J.P., Truman, J., Hennessy, J.L., An empirical com-

parison of the Kendall Square Research KSR-1 and Stanford DASH

multipro cessors, Proc. Supercomputing'93, 1993, 214-225.

[Her97] Herdeg, G.A., Design and implementation of the Alpha Server 4100

CPU and memory architecture, Digital Technical Journal 8, 1997, 48-

60.

[Hoa74] Hoare, C.A.R., Monitors: An op erating system structuring concept,

Communications of the ACM 17, 1974, 549-557.

[Div95] Intel Corp oration Sup ercomputer Systems Division , Paragon System

User's Guide, 1995.

[JS98] Jiang, D., Singh, J.P., A metho dology and an evaluation of the SGI Ori-

gin2000, Proc. SIGMETRICS ConferenceonMeasurement and Model-

ing of Computer Systems, 1998, 171-181.

[Joh88] Johnson, E.E., Completing an MIMD multipro cessor taxonomy, Com-

puter Architecture News 16, 1988, 44-47.

[KDC+94] Keleher, P., Dwarkadas, S., Cox, A.L., Zwaenep o el, W., TreadMarks:

Distributed shared memory on standard workstations and op erating

systems, Proc. the Winter 94 Usenix Conference, 1994, 115-131.

[KYA96] Kim, K.H., You, J.H., Ab ouelnaga, A., A scheme for co ordinated in-

dep endent execution of indep endently designed recoverable distributed

pro cesses, Proc. IEEE Fault-Tolerant Computing Symposium, 1996, 130-

135.

[Lam78] Lamp ort, L., Time, clo cks and the ordering of events in a distributed

system, Communications of the ACM 21, 1978, 558-565.

[LL94] Laudon, J., Lenoski, D., The SGI Origin: A ccNUMA highly scalable

server, Technical Rep ort, Silicon Graphics, Inc., 1994.

[Lea93] Lea, R., Co ol: System supp ort for distributed programming, Commu-

nications of the ACM 36, 1993, 37-46.

[LLT+93] Lenoski, D., Laudon, J., Truman, J., Nakahira, D., Stevens, L., Gupta,

A., Hennessy, J., The DASH prototyp e: Login overhead and p erfor-

mance, IEEE Transactions on Paral lel and Distributed Systems 4, 1993,

41-61.

[MPI95] Message Passing Interface Forum, MPI: A message-passing interface

standard - version 1.1, 1995.

IV. Parallel Op erating Systems 201

[MV93] Mo os, H., Verbaeten, P., Ob ject migration in a heterogeneous world - a

multi-dimensio nal a air, Proc. Third International Workshop on Object

Orientation in Operating Systems, Asheville, 1993, 62-72.

[Nut94] Nuttall, M., A brief survey of systems providing pro cess or ob ject mi-

gration facilities, ACM SIGOPS Operating Systems Review 28, 1994,

64-80.

[Pie89] Pierce, P., A concurrent le system for a highly parallel mass storage

system, Proc. the Fourth Conference on Hypercube Concurrent Com-

puters and Applications, 1989, 155-160.

[PFD+89] Pratt, T.W., French, J.C., Dickens, P.M., Janet, S.A., Jr., A compar-

ison of the architecture and p erformance of two parallel le systems,

Proc. Fourth ConferenceonHypercube Concurrent Computers and Ap-

plications, 1989, 161-166.

[RK79] Reed, D.P., Kano dia, R.K., Synchronizati on with eventcounts and se-

quencers, Communications of the ACM 22, 1979, 115-123.

[Roy93] Roy,P.J., Unix le access and caching in a multicomputer environment,

Proc. the Usenix Mach III Symposium, 1993, 21-37.

[SG97] Scales, D.J., Gharachorlo o, K., Towards transparent and ecientsoft-

ware distributed shared memory, Proc. 16th ACM Symposium on Oper-

ating Systems Principles (SIGOPS'97), ACM SIGOPS Operating Sys-

tems Review 31, 1997, 157-169.

[SG94] Silb erschatz, A., Galvin, P.B., Operating System Concepts, Addison-

Wesley,1994.

[SS94] Singhal, M., Shivaratri, N.G., AdvancedConcepts in Operating Systems,

McGraw Hill, Inc., New York, 1994.

[SH97] Smith, P., Hutchinson, N.C., Heterogeneous pro cess migration: The Tui

system, Technical Rep ort, Department of Computer Science, University

of British Columbia, 1997.

[TMC90a] Thinking Machines Corp oration, Connection Machine Model CM-2

Technical Summary, 1990.

[TMC90b] Thinking Machines Corp oration, Connection Machine Model CM-5

Technical Summary, 1990.

[Zay87] Zayas, E.R., Attacking the pro cess migration b ottleneck, Proc. Sympo-

sium on Operating Systems Principles, Austin, 1987, 13-22.