INFORMATION TO USERS

This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer.

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand corner and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book.

Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order.

A Bell & Howell Information Com pany 300 North Zeeb Road. Ann Arbor, Ml 48106-1346 USA 313/761-4700 800/521-0600 A l g o r it h m s t o I m p l e m e n t S e m a p h o r e s in D is t r ib u t e d E nvironments

dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate

School of The Ohio State University

By

Mahendra A. Ramachandran, B.S., M.S.

The Ohio State University

1995

Dissertation Committee: Approved by

Dr. Mukesh Singhal Dr. Ming-Tsan Liu Adviser Dr. Thomas W. Page Jr. Department of Computer and Information Science UMI Number: 9526079

Copyright 1995 by RAMACHANDRAN, MAHENDRA A. All eights reserved.

UMI Microform 9526079 Copyright 1995, by UMI Company. All rights reserved.

This microform edition is protected against unauthorized copying under Title 17, United States Code.

UMI 300 North Zeeb Road Ann Arbor, MI 48103 © Copyright by

Mahendra A. Ramachandran

1995 To Mom and Dad for the great start you have given me in life

and to Anita and Shaila for your constant love and support.

ii A cknowledgements

This thesis would never have seen fruition without the inspiration and guidance of

Dr. Mukesh Singhal. I am deeply grateful to him for taking me under his wing, for

his foresight in making me change my thesis topic and for the guidance and encour­

agement he has shown along the way. My appreciation to my committee members,

Professors Ming-Tsan Liu and Tom Page for their valuable comments and suggestions

which have contributed to improving this dissertation .

It is in no small part thanks to Dr. and Mrs. Jerrold Voss that I came to Columbus

to pursue a bachelors degree and it is because of to them that I stayed as long as I

did. There is no way to measure the help, support and kindness they have shown me

over the years. Nor can I adequately convey my gratitude for having a home away

from home. So, allow me to say in Kiswahili, Asante Sana Rafiki Yangu.

I am grateful for the support and guidance I have received from my well wisher,

Professor Merv Muller, former Chair of the department. Thanks to the faculty of this department, especially Professors Jayasimha, Krueger, Sadayappan, and Soundarara- jan for their support during various stages of my studies. Mrs. Elley Quinlan, the TA supervisor, has always been extremely helpful. My supervisor at ATS, Mr. Clifford

Collins has been very understanding and patient when I was writing this dissertation.

The CIS office staff have been very helpful to me on many occasions.

iii I have made many a good friends over the years here. It is not easy to maintain a

good friendship with a college roommate, but Tony Kaplanis and I have managed to

do just that. Julie Hartigan, my first friend in graduate school has remained a close

friend over the years. John Mudd, with whom I enjoyed cooking out and watching the

Buckeyes on TV. . Ramanujam, who taught me how to make real coffee, and a source

of inspiration during my general exams. J. Ramachandran, Niranjan Shivaratri, and

M. G. Sriram - the CIS 727 gang, have been close friends throughout my graduate

program. Manas Mandal, my colleague since the ALPS research group days, has done

a fantastic job in maintaining the dissertation style file for LaTex. Deb Shands who

made me an aquarist, Jeff Martens with whom I have spent many hours at BW-3,

Paolo Bucci who has out lasted everyone else from the aerobics group. My office-

mates, Kalluri Eswar, Loren Schwiebert and Wang-Chien Lee who have had to put

up with me for a few years. Walid Mostafa and Amr Ellsadani who gave me sound

advice about finishing my degree before starting a job. Thank you all.

Finally, I would like to thank my family. Dr. and Mrs. Ramachandran, my

parents, have always placed the greatest emphasis on my education without which I would never have made it this far. While dad’s support throughout my education is

immeasurable, I would like to thank him specifically for proof reading my dissertation

and offering several useful criticisms. My mom has been very keen that I become the third doctorate holder in our family and I am glad to finally deliver. When I was debating whether to pursue a degree in medicine or computer science, it was my brother who convinced me to pursue the latter and for that I am very grateful. I thank my parents-in-law, Mr. and Mrs. P. A. Singaravelu, for their patience and for never loosing faith while I finished this degree. It is a miracle that I am writing this despite all those sleepless nights thanks to Shaila, nevertheless, I am so glad she was born in time to share this academic milestone with me. Anita, thanks for your support during all those ups and downs that are a part of the dissertation drudgery, for putting up with me during my onerous mood swings and for being the best friend

I could ever have hoped for. Now it is your turn to finish your degree.

v V ita

September 16, 1965 ...... Born - Madras, INDIA

1987 ...... B.S. Computer Science Ohio State University Columbus, Ohio 1989 ...... M.S. Computer Science Ohio State University Columbus, Ohio 1989-1990 ...... Graduate Research Associate, The Ohio State University. 1990-1994 ...... Graduate Teaching Associate, The Ohio State University. 1994-1995 ...... Graduate Research Associate, The Ohio State University.

Publications

Mahendra Ramachandran and Mukesh Singhal “Consensus-Based Approach to Im­ plementing Semaphores in a Distributed System”. Technical Report OSU-CISRC- 1/95-TR1, Jan 1995.

Mahendra Ramachandran and Mukesh Singhal “On the Synchronization Mechanisms in Distributed Shared Memory Systems”. Submitted to Journal of Parallel and Dis­ tributed Computing. Technical Report OSU-CISRC-10/94-TR54, Oct 1994.

Mahendra Ramachandran and Mukesh Singhal “Decentralized Semaphore Support in a Virtual Shared Memory System”. To appear in Journal of Supercomputing Vol. 9, No. 1, 1995. Mahendra Ramachandran and Mukesh Singhal “Distributed Semaphores”. Submit­ ted to The International Conference on Parallel Processing. Technical Report OSU- CISRC-6/94-TR34, June 1994.

M. Mandal, M. Ramachandran, and P. Vishnubhotla. “The ALPS Kernel for Pro­ cessor Networks”. Technical Report OSU-CISRC-ll/93-TR 41, Nov 1993.

M. Ramachandran, M. Mandal, and P. Vishnubhotla. “Topology-Independent Map­ ping for Transputer Networks”. In Transputer Research and Applications 4, David L. Fielding, editor, pages 9-17. I.O.S. Press, Amsterdam, October 1990.

P. Vishnubhotla, M. Mandal, and M. Ramachandran. “Portable Parallel Program­ ming: The ALPS Approach”. Technical Report OSU-CISRC-ll/89-TR 53, Nov 1989 (Revised July, 1990).

P. Vishnubhotla, M. Mandal, J. Mudd, A. Mitschele-Thiel, M. Ramachandran, et.al. “The Alps Kernel”. Technical Report OSU-CISRC-8/89-TR 37, Aug 1989 (Revised July, 1990).

Fields of Study

Major Field: Computer and Information Science

Studies in: Distributed Systems Prof. Mukesh Singhal Programming Languages Prof. Neelam Soundararajan Computer Architecture Prof. D. N. Jayasimha Operating Systems Prof. Prasad Vishnubhotla T a b l e o f C o n t e n t s

DEDICATION ...... ii

ACKNOWLEDGEMENTS ...... iii

VITA ...... vi

LIST OF TABLES...... xi

LIST OF FIGURES ...... xii

CHAPTER PAGE

I Introduction ...... 1

1.1 Background and Motivations ...... 1 1.2 Contributions of the Dissertation ...... 8 1.2.1 The Semaphore-page Approach...... 8 1.2.2 The Clustered Approach ...... 9 1.2.3 The Consensus-based Approach ...... 9 1.2.4 Comparative Performance S tu d y ...... 10 1.3 Organization of Dissertation ...... 11

II A Taxonomy of Synchronization Mechanisms in Distributed Shared Mem­ ory Systems ...... 12

2.1 Introduction ...... 12 2.2 Prelim inaries ...... 14 2.2.1 DSM System s ...... 14 2.2.2 Basic Synchronization M echanism s ...... 17 2.3 A Taxonomy of Synchronization Mechanisms in DSM Systems . . . 19 2.4 Synchronization Schemes in Existing DSM System s ...... 24 2.4.1 Im plem entations ...... 24 2.4.2 Hardware Im plem entations ...... 30 2.5 Discussion and Summary ...... 34

III The Semaphore Page Approach...... 37

3.1 Introduction ...... 37 3.2 The Semaphore-page Approach...... 38 3.2.1 System Model and Definitions ...... 39 3.2.2 P Operation ...... 41 3.2.3 V O p e ra tio n ...... 45 3.2.4 Updating the Structures in the Semaphore-page ...... 45 3.2.5 Techniques for Performance Im provem ent ...... 52 3.3 C orrectness ...... 52 3.3.1 Safety P roperty ...... 53 3.3.2 Liveness Property ...... 54 3.4 A Performance Study ...... 57 3.4.1 Simulation M odel ...... 58 3.5 S u m m ary ...... 61

IV The Clustered Semaphore Approach ...... 67

4.1 Introduction ...... 67 4.2 System Model and D efinition ...... 68 4.3 Accessing a Semaphore...... 70 4.3.1 Intracluster A ccess ...... 71 4.3.2 InterCIuster Access ...... 75 4.3.3 Combining Intracluster and Intercluster Operations 80 4.4 C orrectness ...... 81 4.4.1 Mutually Exclusive P and V Operations ...... 82 4.4.2 Deadlock F reed o m ...... 82 4.4.3 Starvation Freedom ...... 84 4.5 Performance of the Clustering Approach ...... 86 4.5.1 Complexity of the Clustering Scheme...... 86 4.5.2 Simulation R e su lts ...... 87 4.6 S u m m ary ...... 89

ix V Consensus-Based Approach ...... 94

5.1 Introduction ...... 94 5.2 System Model and Definition ...... 95 5.3 A Distributed Implementation of Semaphores ...... 96 5.4 A Proof of the Correctness ...... 102 5.4.1 Deadlock Freedom ...... 102 5.4.2 , Starvation Freedom ...... 103 5.5 Performance Enhancement Techniques ...... 104 5.6 Performance S tu d y ...... 107 5.6.1 Simulation R e su lts ...... 108 5.7 S um m ary ...... 110

VI Comparative Performance Evaluation ...... 115

6.1 Introduction ...... 115 6.2 System With One Binary Semaphore ...... 116 6.3 Multiple Binary Semaphores ...... 118 6.4 Multiple Resource Counting Semaphores ...... 120 6.5 S u m m ary ...... 121

VII Summary and Directions ...... 129

7.1 Summary of Results ...... 130 7.1.1 Taxonomy of Synchronization Mechanisms in DSM Systems 130 7.1.2 The Semaphore-page A pproach...... 131 7.1.3 The Clustered Approach ...... 132 7.1.4 The Consensus-based Approach ...... 133 7.1.5 Comparative Performance S t u d y ...... 134 7.2 Future Directions ...... 135

BIBLIOGRAPHY ...... 137

x L is t o f T a b l e s

TABLE PAGE

1 A Summary of Synchronization Mechanisms ...... 35

2 Maximum Throughput Versus Number of Processors...... 63

3 Average and Maximum Number of Messages Required to Perform a P Operation ...... 63 L is t o f F ig u r e s

FIGURE PAGE

1 Processor-Memory Model in a Multiprocessor System ...... 3

2 Processor-Memory Model in a Distributed System...... 4

3 Processor-Memory Model in a Distributed Shared Memory System ...... 15

4 Definition of P and V operations on a semaphore ...... 18

5 Contents of a Semaphore Structure ...... 40

6 State of Semaphore-page SPj on Processor Pi...... 42

7 Code Executed to Access a Semaphore and Perform a P Operation. . . . 43

8 Code to Service a Request for a Semaphore ...... 44

9 Code to Perform a V Operation ...... 46

10 Holder_status Array for Semaphore-page SPj on Processor Pi...... 48

11 Modified Code to Access a Semaphore for a P Operation ...... 49

12 Code to Update the Semaphore-page ...... 50

13 Modified Code to Service a Request for a Semaphore...... 50

14 Modified Code to Perform a V Operation ...... 51

xii 15 Response Time vs. XCST. Semaphore-page and Centralized Schemes. One Binary Semaphore (10 nodes, CST= 4.0 msec, TZA= 10.0 msec)...... 60

16 Response Time vs. XCST. Semaphore-page and Centralized Schemes. One Binary Semaphore (35 nodes, CST= 4.0 msec, TZA= 10.0 msec)...... 61

17 Response Time vs. XCST. Semaphore-page and Centralized Schemes. Mul­ tiple Binary Semaphores (10 nodes, CST— 4.0 msec, TIA= 10.0 msec). . . 62

18 Response Time vs. XCST . Semaphore-page and Centralized Schemes. Mul­ tiple Binary Semaphores (35 nodes, CST= 4.0 msec, TZA= 10.0 msec). . . 64

19 Response Time vs. XCST. Semaphore-page and Centralized Schemes. Mul­ tiple Resource Counting Semaphores (10 nodes, CST= 4.0 msec, 7ZA= 10.0 msec)...... 64

20 Response Time vs. XCST. Semaphore-page and Centralized Schemes. Mul­ tiple Resource Counting Semaphores (35 nodes, CST= 4.0 msec, 7lA= 10.0 msec)...... 65

21 Speedup Attained Over the Centralized Scheme for Various System Sizes (CST= 4.0 msec, 1ZA= 10.0 msec)...... 65

22 Speedup Attained Over the Centralized Scheme for Various System Sizes (iCST= 10.0 msec, 7ZA= 10.0 msec)...... 66

23 Clustering the Processors for semaphore Si ...... 69

24 Executed by a node to access a semaphore to perform a P operation. . . . 73

25 Executed by a node on receipt of a P-request message...... 74

26 Executed by a node to perform a V operation ...... 75

27 Executed by a node on receiving of a V.request message...... 75

28 Executed by a node to initiate a P operation (both intra and intercluster). 78

xiii 29 Executed by a proxy node on receiving a P.proxyrequest message...... 79

30 Response Time vs XCST. Binary Semaphores on 10 Processors ( CST= 4 msec, TZA= 10 msec)...... 89

31 Response Time vs XCST. Binary Semaphores on 35 Processors ( CST = 4 msec, 7ZA= 10 msec)...... 90

32 Response Time vs XCST. Resource Counting Semaphores on 10 Processors (CST= 4 msec, 7ZA= 10 msec) ...... 91

33 Response Time vs XCST. Resource Counting Semaphores on 35 Processors (CST= 4 msec, 7ZA= 10 msec) ...... 91

34 Relative Speedup of the Cluster scheme over the Distributed and Centralized Schemes for 10 Processors...... 92

35 Relative Speedup of the Cluster scheme over the Distributed and Centralized Schemes for 35 Processors...... 92

36 Average Number of Messages Required to Access a Semaphore for the Clus­ tered and Nonclustered Approaches (10 Processors) ...... 93

37 Average Number of Messages Required to Access a Semaphore for the Clus­ tered and Nonclustered Approaches (35 Processors) ...... 93

38 Code to perform a P operation ...... 97

39 Executed by a node on receiving a P_request message ...... 98

40 Definition of Precedence Relationship of Two P.request Messages...... 99

41 Executed by a Node to Perform a V Operation ...... 101

42 Executed by a Node on Receiving a V_request Message ...... 101

43 Code to perform a P operation ...... 106

xiv 44 Executed by a node on receiving a P operation request ...... 107

45 Executed by a node to perform a V Operation ...... 108

46 Response Time Versus XCST. Consensus-based and Centralized Approaches, One Binary Semaphore on 10 Processors (CST= 4 msec, 7ZA= 10 msec) . 110

47 Response Time Versus XCST. Consensus-based and Centralized Approaches, One Binary Semaphore on 35 Processors (CST= 4 msec, 72.-4= 10 msec) . I l l

48 Response Time Versus XCST. Consensus-based and Centralized Approaches, Multiple Binary Semaphores on 10 Processors ( CST= 4 msec, 7ZA= 10 msec) 112

49 Response Time Versus XCST. Consensus-based and Centralized Approaches, Multiple Binary Semaphores on 35 Processors ( CST= 4 msec, 71A= 10 msec) 112

50 Response Time Versus XCST. Consensus-based and Centralized Approaches, Multiple Resource Counting Semaphores on 10 Processors ( CST= 4 msec, 7ZA= 10 m sec)...... 113

51 Response Time Versus XCST. Consensus-based and Centralized Approaches, Multiple Resource Counting Semaphores on 35 Processors ( CST= 4 msec, 7ZA= 10 m sec)...... 113

52 Speedup of the Consensus-based Approach over the Centralized Approach. Multiple Binary Semaphores on 10 and 35 Processors ( CST = 4 msec, 7ZA= 10 m se c )...... 114

53 Speedup of the Consensus-based Approach over the Centralized Approach. Multiple Resource Counting Semaphores on 10 and 35 Processors ( CST = 4 msec, 7lA= 10 msec)...... 114

54 Response Time vs. XCST. One Binary Semaphore on 10 Processors ( CST= 4 msec, 7ZA= 10 m sec)...... 117

55 Response Time vs. XCST. One Binary Semaphore on 35 Processors (CST= 4 msec, 7ZA= 10 m sec)...... 118

xv 56 Response Time vs. XCST. One Binary Semaphore on 10 Processors (i CST= 20 msec, TZA= 10 m se c )...... 119

57 Response Time vs. XCST. One Binary Semaphore on 35 Processors ( CST = 20 msec, %A= 10 m se c )...... 120

58 Response Time vs. XCST. Multiple Binary Semaphores on 10 Processors (CST= 4 msec, %A= 10 msec) ...... 121

59 Response Time vs. XCST. Multiple Binary Semaphores on 35 Processors (CST= 4 msec, 1ZA= 10 msec) ...... 122

60 Response Time vs. XCST. Multiple Binary Semaphores on 10 Processors (iCST= 20 msec, TlA= 10 m sec)...... 123

61 Response Time vs. XCST. Multiple Binary Semaphores on 35 Processors (CST= 20 msec, 'RA= 10 m sec)...... 124

62 Average Number of Messages Required to Acquire the Semaphore. Multiple Binary Semaphores on 10 Processors (C

63 Average Number of Messages Required to Acquire the Semaphore. Multiple Binary Semaphores on 35 Processors (CST= 4 msec, TlA= 10 msec) . . . 125

64 Response Time vs. XCST. Multiple Resource Counting Semaphores on 10 Processors (CST= 4 msec, 7ZA= 10 msec)...... 125

65 Response Time vs. XCST. Multiple Resource Counting Semaphores on 35 Processors (CST= 4 msec, TZA= 10 msec)...... 126

66 Response Time vs. XCST. Multiple Resource Counting Semaphores on 10 Processors (CST= 20 msec, 7ZA= 10 m sec)...... 126

67 Response Time vs. XCST- Multiple Resource Counting Semaphores on 35 Processors (CST= 20 msec, 7ZA= 10 m se c)...... 127

xvi 68 Average Number of Messages Required to Acquire the Semaphore. Multiple Resource Counting Semaphores on 10 Processors {CST = 4 msec, 7ZA= 10 m sec)...... 127

69 Average Number of Messages Required to Acquire the Semaphore. Multiple Resource Counting Semaphores on 35 Processors (CST = 4 msec, 7ZA= 10 m sec)...... 128

xvii CHAPTER I

Introduction

1.1 Background and Motivations

If only one process may exist at a time in a computer system, this process may access

resources such as memory, files and other peripheral devices without the interven­

tion of the operating system. However, if more than one process can exist in the

system concurrently, the resources may be shared by these processes and the proper

handling of access to these shared resources becomes an important issue. Operating

systems that support multiprogramming must provide mechanisms through which the

concurrent processes can synchronize with each other when accessing these resources.

Uniprocessor systems, computer systems with one processor (or CPU), can provide

a synchronization mechanism which is developed completely in software and imple­

mented in the operating system kernel. One of the earliest such mechanisms, proposed

by Dekker, allowed two processes to synchronize with one another to gain mutually exclusive access to a shared resource [14]. Dijkstra, subsequently, extended Dekker’s algorithm to achieve mutual exclusion among N processes. Numerous mechanisms have been proposed since then, including semaphores by Dijkstra [15], eventcounts

1 2 by Reed and Kanodia [45], monitors by Hoare [26], and serializers by Hewitt and

Atkinson [25].

A shared-memory multiprocessor system is composed of multiple CPUs connected through an interconnection (IC) network to a commonly shared memory (Figure

1). The IC network may be a shared bus such as in the Balance, Symmetry and

FX from Sequent Computer Systems, and the Multimax from Encore Corporation; these machines are called uniform memory architectures. The IC network may be a multistage network such as in the BBN Butterfly. A few experimental systems have been designed that are composed of clusters of processors and memories connected by busses or multistage networks (e.g., C.mmp and Cm* from CMU [28], and Cedar from

Illinois [29]). These systems are called non-uniform memory architectures because, while the memory is completely shared by all the processors, the costs associated with accessing different memory locations may vary [1],

In a multiprocessor system, a typical application is composed of multiple processes that interact with each other through shared variables. In order to ensure correct access to these shared variables, synchronization mechanisms are necessary. Purely software based solutions (such as Dekker’s and Dijkstra’s algorithms) are impractical for such systems because concurrent processes executing the synchronization code in parallel may result in a heavy load on the interconnection network. To reduce the load on the network, hardware instruction such as test-and-set and fetch-and-add have been designed. These instructions read the current contents of a memory location and write to the same memory location in one machine cycle. This allows a process 3 to test a variable and assign it a new value atomically. Synchronization mechanisms like semaphores, eventcounts and monitors can be implemented on multiprocessors using these hardware instructions.

CPU Memory

Memory

CPU IC

Network Memory

CPU Memory

Memory CPU

Figure 1: Processor-Memory Model in a Multiprocessor System.

Synchronization in Distributed Memory Systems

Another class of multiprocessor systems, known as distributed systems or proces­ sor networks, consist of CPUs with their own private memory interconnected via a network (Figure 2). The network may have a static topology such as a hypercube or a mesh with a fixed number of processing elements, or it may be a more dynamic topology such as an ethernet network (linear or star) or ATM network (arbitrary topology) with varying number of processing elements. The processing elements in the latter category are typically full-fledged workstations. Distributed systems are 4

Memory CPU CPU Memory

Memory CPU Comm. CPU Memory

Network

Memory CPU CPU Memory

Memory CPU CPU Memory

Figure 2: Processor-Memory Model in a Distributed System. characterized by a lack of shared memory where communications between the nodes is achieved through message passing only.

A distributed system is composed of n computers which do not share a global memory or a global clock. Each computer consists of a processor and private memory with its own copy of the operating system kernel (or micro-kernel) running on it. The computers communicate with one another through messages only. It is assumed that the communication network is reliable (i.e., failure resilient) and delivers the messages in a FIFO order. 5

Distributed Synchronization

A parallel computation in such environments is composed of a set of processes,

each with its private data space. The processes communicate with each other through

the exchange of messages. The processes may need to synchronize with each other to

access the shared resources. In distributed systems, the lack of shared memory makes

is difficult to implement synchronization mechanisms based on shared variables such

as those mentioned above. A synchronization mechanism must be implemented using

the message passing paradigm. Typically, a single node in the system can be chosen

as the site where the mechanism is implemented and synchronization requests from

other nodes are sent as messages to this site. Such an implementation has the obvious drawbacks of any centralized scheme; it is a single point of failure and can become

a bottle-neck under a high rate of synchronization accesses, cause congestion of the network links near the controller node, and the synchronization delay is twice the message delay.

Thus, it is desirable to have efficient, decentralized algorithms to implement shared variable based synchronization mechanisms in a distributed system.

To provide a mutually exclusive access to a critical section in a distributed system, several decentralized algorithms based on message passing have been proposed [30,

46, 36, 51, 39, 48]. A survey of various mutual exclusion algorithms appears in [49].

Requesting and acquiring permission to execute a critical section is like performing a

P operation and passing this permission to another site is analogous to a V operation. 6

However, these algorithms are not easily extendable to general problems such as process synchronization and -ary exclusion. Note that these mutual exclusion algorithms indirectly implement a binary semaphore in distributed systems. There is still a need for general purpose synchronization mechanisms such as resource counting semaphores.

While distributed memory architectures such as hypercubes and meshes are ap­ pealing because they scale better than shared memory multiprocessors, they are not as easy to program as the latter. This has motivated research into what is known as Distributed Shared Memory systems (DSM) [31, 40]. Such systems provide a shared memory programming paradigm on a distributed memory multiprocessor sys­ tem or a network of computers. The rationale is that the combination of the shared memory paradigm and the distributed memory architectures eases the tasks of pro­ gramming and portability, while maintaining scalability. This makes it all the more important to provide shared variable-like synchronization mechanisms in distributed systems. Most DSM systems provide centralized implementations of mechanisms such as semaphores.

Synchronization in Distributed Systems

A distributed database system consists of a set of computers or sites that are connected by a communication medium (similar to distributed systems, described above). There is no globally shared memory and sites communicate through messages.

The database is partitioned among the sites of the system. A data object is located on any one of the sites of the system. The data object may be located on more than one site if availability of the database is a major concern. Semantic relationships exist

among the objects in the database and it must be preserved by the execution of the

transactions.

A program (or transaction) in a database system may read and update a collection

of data objects. The actions performed by a transaction preserve the consistency of

the database. Typically, multiple transactions may execute concurrently in a database

system and for efficiency, the actions performed by the transactions are interleaved.

Though a serial execution of the transactions will preserve database consistency, their

concurrent execution may not. This raises the issue of how to maintain the integrity

of the database [4, 50].

Providing mutually exclusive access to the shared resources is insufficient because

it is possible for for the actions of concurrent transactions to interleave in such a man­

ner that database consistency is violated. This is because mutually exclusive access to

the objects does not prevent transactions from observing an inconsistent state of the database. Database concurrency control mechanisms are used to restrict interleav­ ing of transactions to ensure database consistency is maintained. These mechanisms ensure that only those interleavings whose resulting behavior is identical to a serial execution of the transactions are allowed to execute. Several concurrency control algorithms have been proposed to implement such mechanisms [17, 5]. Database systems require stronger synchronization mechanisms than are needed for process synchronization in general distributed applications. The algorithms presented in this 8

dissertation are intended for general distributed systems and may not be suitable for

distributed database systems.

1.2 Contributions of the Dissertation

In this dissertation, we first examine synchronization mechanisms provided by a va­

riety of distributed shared memory systems, both software and hardware based. We

classify the synchronization mechanisms supported in existing DSM systems accord­

ing to three criteria: software versus hardware, centralized versus distributed, and

a mechanism that is integrated into the DSM versus one which is not. We review

synchronization mechanisms provided in the existing DSM systems and categorize

the systems according to the developed classification.

Then three decentralized algorithms to implement resource counting semaphores

in a distributed system are presented. These algorithms are briefly described below.

1.2.1 The Semaphore-page Approach

In the Semaphore-page approach, the semaphores are grouped into pages called

semaphore-pages. These pages can be similar to the data pages in DSM systems,

however, the semaphore-pages are replicated on the processors of the system and

algorithms to maintain their consistency are described. This approach enables pro­

cesses to exploit the locality of reference which exists in accessing semaphore variables

by caching them on their processors. Each page maintains information pertaining to

each semaphore; either the semaphore itself or an indication about its current local­ ity. When a pair of nodes communicate, they can exchange information about the semaphores in their respective pages. This improves the locality information main­ tained in the pages by the nodes. A proof of the correctness of the algorithm is presented. Performance studies using simulations indicate that this approach per­ forms better than the centralized approach. Furthermore, it eliminates the single point of failure and the bottleneck problems inherent in centralized schemes.

1.2.2 The Clustered Approach

In the Clustered approach, the processors in the system are partitioned into clusters.

The number and size of each cluster is a function of the semaphore’s initial value.

Each cluster of processors maintains a copy of the semaphore. Within each cluster, the semaphore migrates from site to site, somewhat similar to a token in mutual exclusion algorithms. Note that the passing of the semaphore from one processor to another does not signify the V operation, unlike in mutual exclusion algorithms.

Aside from the intra-cluster access to the semaphore, which is performed most of the time, inter-cluster access is sometimes necessary. This is performed only if the clus­ ter’s semaphore is zero. This movement of semaphore values across clusters produces a load-balancing effect in the system with a greater concentration of semaphore values in those clusters where the demand is high. A deadlock and livelock freedom proof is presented. Performance studies based on simulations have shown that clustering plays a significant role in the efficiency of this approach. 10

1.2.3 The Consensus-based Approach

The Consensus-based approach to implementing semaphores is similar to the consensus-

based approach known for mutual exclusion algorithms. In this approach, a node

sends timestamped requests to other nodes in the system and completes the P oper­

ation only when it receives confirmations from all other sites. Techniques to improve

the performance of this approach, including an algorithm that forms a consensus in a

clustered environment are discussed. A proof of deadlock freedom and starvation free­

dom is presented. Performance of the consensus-based approach in both the clustered

and unclustered environments is presented.

1.2.4 Comparative Performance Study

Performance studies of the three algorithms are presented and compared with each

other. We investigate how the three algorithms perform for varying numbers of pro­

cessors, rates of access, semaphore access patterns, and critical section durations.

In order to investigate how the number and types of semaphores affects the per­

formance, simulations were performed for one semaphore that is accessed by all the

user processes as well as for multiple semaphores that are accessed with even prob­

ability. These experiments were performed for semaphores that were initialized as either binary or resource counting.

The critical section time is the amount of time spent by a process after successfully performing a P operation and before performing the V operation. Experiments were performed for two different critical section time values - short (4 msec) and long (20 11

msec). We define short and long with respect to the cost of a message between a pair

of processors in the system. This cost, called the remote access cost, was fixed at 10

msec.

For each type of semaphore, simulations were performed for the different critical

section times for 10 and 35 processor systems. Plots of the response time, average

number of messages required, and speedup are presented.

1.3 Organization of Dissertation

The rest of the dissertation is organized as follows: In Chapter II, the basic synchro­

nization mechanisms are described. A characterization and classification taxonomy

of the various synchronization mechanisms supported by various Distributed Shared

Memory systems is presented followed by descriptions of the actual mechanisms sup­

ported by specific DSM systems.

In Chapters III, the Semaphore-page approach is presented with proofs of the

correctness of the approach and performance results. In Chapter IV, the Clustered

approach to implementing a token-based scheme is presented along with its correct­

ness proofs and simulation based performance studies. In Chapter V the algorithms for the Consensus-based approach together with proof of its correctness and simula­ tion results are presented.

In order to compare the behavior of the three approaches, a comparative perfor­ mance study is presented in Chapter VI. A summary the results of this research along with a few thoughts on future directions is discussed in Chapter VII. CHAPTER II

A Taxonomy of Synchronization Mechanisms in Distributed Shared Memory Systems

2.1 Introduction

Prom an architectural point of view, distributed memory systems are more attractive than shared memory multiprocessors because they perform better, are cheaper to build with off-the-shelf components, and are more scalable. However, the develop­ ment of large application software and tools such as compilers is considerably more advanced in shared memory multiprocessors. This is primarily due to the ease of programming of the shared memory paradigm.

It is therefore appealing to provide the shared memory programming paradigm in distributed systems. The rationale is that the combination of the shared memory paradigm and the distributed memory architecture eases the tasks of programming and portability, while maintaining efficiency and scalability. This has motivated re­ search into what is referred to as Distributed Shared Memory systems (DSM).

A DSM system supports the shared memory programming paradigm on a dis­ tributed system or processor network. The DSM system may be either an operating system kernel, a runtime library, dedicated hardware or a combination of these. A

12 13

parallel application written for a DSM system consists of a set of processes which

access shared data similar to a shared memory application. The DSM system, trans­

parently handles the access of the shared data from remote locations of the distributed

system.

As in any concurrent programming environment, it is essential to provide mecha­

nisms through which the processes in a DSM application can access shared data for

communication and synchronization in a coordinated manner. While in shared mem­

ory systems, the synchronization mechanisms are based on atomic access to shared variables which are located in the same address space as other variables, in DSM systems, the variables used for synchronization cannot be placed in the same address space as other shared variables. This is because thrashing of the shared page between remote sites competing for synchronization variable would result in significant per­ formance degradation. Therefore, in DSM systems the synchronization mechanisms must be provided through some other means.

In this chapter we examine the types of synchronization mechanisms supported by DSM systems and investigate their implementation strategies. A taxonomy of the various synchronization mechanisms in DSM systems has been developed. The representative DSM systems are classified according to this taxonomy.

The rest of this chapter is organized as follows. Section 2.2, examines a DSM system in greater detail and describes basic synchronization mechanisms supported by DSM systems. In Section 2.3, a taxonomy of synchronization mechanisms in DSM systems is presented. The synchronization mechanisms supported in representative 14

DSM systems are described and classified according to this taxonomy in Section 2.4.

Sec 2.5 summarizes the chapter.

2.2 Preliminaries

In this section, the structural model of DSM systems is described and a need for syn­ chronization mechanisms in such systems is motivated. Some of the basic synchro­ nization mechanisms incorporated in various DSM systems are also briefly described.

2.2.1 DSM Systems

A DSM is a combination of a distributed system (or processor network) and a shared- memory multiprocessor system. Physically, it is a processor network; processors with their own memory connected by an interconnection network (similar to Figure 2).

Logically, it is a shared-memory system; processes can access memory on any proces­ sor just as if it were a local memory access. The processes interact through shared variables just as the parallel processes in shared-memory multiprocess systems do.

Figure 3 gives a pictorial representation of a DSM system. Support for the globally shared memory view is provided either by the operating system kernel, a runtime library, dedicated hardware or a combination of these.

An application in a DSM system is composed of a set of concurrent processes. A process may access private data which typically resides on the same processor as the process. The process may also access data shared with other processes. This data may reside on the same site as the process, or it may reside on some other processor. 15

CPU CPU CPU CPU

MemoryMemory Memory Memory

Globally Shared Memory

Figure 3: Processor-Memory Model in a Distributed Shared Memory System.

It is the responsibility of the DSM system to retrieve data from remote sites for an

application in a transparent manner.

Data Coherence

Typically, in order to improve the performance of the system, shared data or pages

are cached on demand, similar to the manner in which pages are brought into main

memory in traditional virtual memory systems. However, the pages in DSM systems

are resources that are competed for by concurrent processes that are executing on different processors. In order to reduce the effects of thrashing and to improve the efficiency of program execution, it is desirable to replicate the shared pages and place copies on various processors of the system. While this replication allows concurrent 16

local access to pages by the processes, the copies become outdated when one of the

pages is modified by its processor. A similar problem arises in shared memory multi­

processors when multiple copies of shared data are cached by individual processors.

While hardware-based solutions are used in multiprocessor systems to maintain the

coherence of the caches, DSM systems must solve this problem in software. Maintain­

ing the coherence of shared pages has been the focus of the research is DSM systems

[2, 5, 11, 38, 50], While a coherence maintenance protocol ensures that replicated

data is updated in an application transparent manner, it is insufficient to ensure the

correct execution of a parallel program.

Synchronization

Aside from the issue of maintaining the coherency of shared data, it is important

for a DSM system to provide mechanisms through which access to shared data by pro­

cesses can be synchronized. A typical application on DSM systems consists of several

concurrent processes that access shared variables for synchronization and communica­

tion, and this concurrent access to the shared data must be synchronized to maintain

the integrity of the shared data. In shared memory multiprocessor systems, these synchronization mechanisms are based on atomic access to shared variables (e.g., semaphores and monitors). An excellent survey of these synchronization methods may be found in [15]. In DSM systems, if the variables used for synchronization were placed in shared pages, significant performance degradation would result due to the thrashing of the shared page between the processors as a result of false sharing. The alternative is to place the synchronization variables in memory that is not part of the 17 shared page. This requires all synchronization operations to be sent to the processor

in whose memory the synchronization variable has been mapped. This results in a

centralized scheme. The obvious disadvantage is that the central server can become a bottle-neck, is a single point of failure, and can cause congestion of the network links near the central server node. In Section 2.4, alternative methods to implementing synchronization mechanisms are discussed.

2.2.2 Basic Synchronization Mechanisms

Semaphores

The concept of semaphores was proposed by Dijkstra to solve a variety of synchro­ nization problems, namely, mutual exclusion, K-ary mutual exclusion, and process synchronization in the uniprocessor system [14]. A semaphore is a synchronization variable which is initialized to one for a binary semaphore or greater than one for a resource counting semaphore. Binary semaphores are used for mutual exclusion and process synchronization whereas resource counting semaphores are used for K-ary mutual exclusion.

There are two atomic operations that can be performed on Semaphores. These are called P for Proberen which means “to test” and V for Verhogen which stands for “to increment” in Dutch (Dijkstra is from the Netherlands.) These operations are defined in Figure 4.

Eventcounts

The eventcount synchronization mechanism was proposed by Reed and Kanodia

[43] as an alternative to the mutual exclusion type of synchronization provided by 18

P(S): if (S > 1) V(S): if (semaphore queue is not empty) th en th e n S:=S- 1; de-queue a process; else else block the process on S:=S + 1; the semaphore queue;

Figure 4: Definition of P and V operations on a semaphore.

mechanisms such as binary semaphores and monitors [24]. An eventcount is a mono-

tonically increasing integer variable. Whenever a process needs to signal the occur­

rence of an event to the other processes in the system, it does so by incrementing

the eventcount variable. To increment the eventcount variable, a process executes the

advance(eventcount) primitive.

A process can observe the current state of the variable eventcount by executing

the read(eventcount) primitive. If a process must wait for the variable eventcount to reach a particular value, the process need not repeatedly perform read operations in a loop. Instead, it can block and wait for the variable to reach the specified value by executing the await(eventcount, value) primitive.

Locks

In locking mechanisms, a lock is associated with every shared resource. A process gets mutually exclusive access to a resource by acquiring the lock on the resource.

This is done by performing an acquire operation on the particular lock variable. When the process succeeds in acquiring the lock, it is the sole possessor of the resource until 19

it relinquishes the lock. This is done by performing a release operation. When a lock

is released, any of the processes waiting to acquire the lock is granted the lock.

Barriers

A barrier is a synchronization mechanism that allows multiple processes to syn­

chronize at certain points during program execution [52, 35]. This allows multiple

processes to complete a phase of computation, synchronize to ensure that all other

processes have also completed the phase, and then proceed to the next phase of com­

putation. When each process arrives at a barrier, it blocks until all other processes

have arrived at the barrier. At that point, all the blocked processes are unblocked

and proceed with their computations.

2.3 A Taxonomy of Synchronization Mechanisms in DSM Systems

Synchronization mechanisms provided in existing DSMs can be classified along three

criteria.

1. It may be a hardware or a software-based solution.

2. A mechanism may be implemented in a centralized or a distributed manner.

3. A mechanism may be integrated into the DSM system or non-integrated.

We next discuss these criteria in detail.

Hardware versus Software

A DSM system can be implemented either in hardware or software. Early DSMs were software based and were implemented on multicomputers whose architecture 20 provided no support for a shared memory environment [29, 36, 42, 6]. Increasingly, support for DSM is being provided in hardware for efficiency [5,16,17, 3, 22]. This has enabled the development of distributed memory computers with hardware support for shared memory.

Software implemented systems are designed to provide the DSM on existing dis­ tributed memory systems. These systems are implemented either as an entire operat­ ing system or as a library of user callable routines and a runtime system which ran on top of the operating system [42, 29]. The libraries provide the primitives supported by the DSM and the runtime system implements the mechanisms that access the shared data and perform synchronization. While such systems serve as valuable testbeds to validate a particular mechanism, the overhead incurred is high. This overhead can be reduced by incorporating these mechanisms in the kernel [33]. The architec­ ture on which software based DSMs are implemented typically provide rudimentary synchronization (e.g., lock-unlock, semaphores or just a hardware Test&Set) on a per-node basis. It is the responsibility of the DSM to provide global synchronization mechanisms to synchronize processes across the network.

Recently, considerable effort has been directed into building large-scale distributed memory multiprocessors with hardware support for shared memory. These DSM systems provide mechanisms for accessing the distributed memory as if it were a single address space. This support is provided in the hardware and in some cases include software-based mechanisms as well [17]. A variety of architectures exist. In some systems, local memory of a processor is used as its cache and the local memories of the rest of the system are viewed as the global memory for that processor [16, 3, 9]. Other systems support a hierarchical view of memory similar to the cache-main-secondary memory hierarchy in traditional operating systems. In these systems, the memory hierarchy is divided into memory that is local to a processor, local to a cluster of processors, and global to the entire network of processors. Most of the systems provide some form of synchronization beyond the traditional per-processor synchronization found in multicomputers. Some systems provide global synchronization and others a restricted or hierarchical synchronization. We will describe these synchronization mechanisms when we describe some representative systems in Section 2.4.

Centralized versus Distributed

A synchronization mechanism can be implemented either in a centralized or a distributed manner. In centralized schemes, a particular node in the system is defined to be the central manager node and all requests for synchronization must be sent to this node. An advantage of a centralized scheme is the ease of implementation of the synchronization mechanism. Unlike in distributed schemes, there is no need to locate the owner of the synchronization variable. However, a centralized scheme will result in a bottleneck at the manager node when the access frequency of synchronization variables is high [39]. It also makes the manager the single point of failure in the system.

It is possible to improve on the centralized scheme by using more than one man­ ager in the system. In this scheme, known as the static (or fixed) distributed scheme, there are multiple managers in the system, each of which manages a subset of the 22 synchronization variables in the system. While this helps reduce the congestion in­ herent with a single manager, it would degenerate to the performance of the single manager scheme if high access frequencies occur to one or more synchronization vari­ ables belonging to one manager. This is because the static distributed manager scheme is still a centralized scheme.

In distributed schemes, the idea of a manager is no longer applicable; instead the node that currently holds the synchronization variable is considered the holder or owner [21]. When the variable is migrated to another node, the ownership of the synchronization variable is transferred to that site as well. Since the ownership of the resource changes dynamically, a requesting site may not know the identity of the current owner of the resource. In order for a site to send its request message to the owner, it may broadcast the message to all the sites in the network, or it may multicast the request to a subset of the sites or it may send the message to the site which it considers to be the owner. Broadcasting has the advantage of being able to deliver the request with just one message hop. However it will generate a lot of network traffic which would have a negative impact on synchronization delay.

Multicasting, while not generating excessive message traffic as with a broadcast, may still require additional messages if none of the sites that received the message is the owner. Lastly, by sending a request message to the probable owner, it is possible to get a request granted with the fewest total number of messages generated. However, it is possible that ownership has moved to another site. In such a case the site receiving the request must forward the request message to a more recent owner of the variable 23 until it eventually arrives at the current owner. These techniques have been used in locating other resources (including DSM pages) in distributed systems and are not unique to synchronization variables. Fowler presents a comprehensive exposition on locating objects in a distributed environment [21].

Integrated versus Non-integrated Synchronization Mechanisms

A DSM system supports an integrated synchronization mechanism if the program­ ming model (or the communication model supported by it) defines the synchronization available for user processes. If a a synchronization mechanism is implemented similar to the way other shared data access is implemented on a particular DSM, then the

DSM supports an integrated mechanism for synchronization. For instance, if a DSM system supports shared pages as the mechanism through which global data is shared, and it supports semaphores in shared semaphore pages (possibly with different se­ mantics) then we claim that the semaphore mechanism is integrated with this DSM.

Another example is a hardware-based DSM which provides caching of shared data and synchronization variables, perhaps with different cache-coherence protocols, and provides an integrated synchronization mechanism.

A non-integrated synchronization mechanism is one whose choice was not influ­ enced by the programming model, or by the manner in which other shared data is implemented by the DSM system. One synchronization mechanism can be replaced with another with little or no effect on the programming model. The programming en­ vironment and the synchronization mechanism share no specific design goals or traits.

For instance, suppose the DSM which supports shared pages described above provides 24 static distributed semaphores, then there is a clear distinction between shared data and the semaphore variables supported by this DSM. Another example of a non- integrated mechanism is a hardware-based DSM like the one described above which allocates all semaphore variables to one processor and all P and V operations are executed by this processor in a centralized manner.

2.4 Synchronization Schemes in Existing DSM Systems

In this section, we discuss synchronization schemes in a set of representative DSM systems. We first describe synchronization schemes for software-based systems, fol­ lowed by those for hardware-based systems. For each system, a brief description of the system is first presented, followed by the synchronization mechanism supported by the system.

2.4.1 Software Implementations

Software based DSM implementations are available on a variety of platforms. Some systems have been built on mini-computers connected by Ethernet; Mirage, for in­ stance, is built on a network of Vax 11/750 computers [20]. Many others are built on a network of workstations or even personal computers, e.g., Mether and Mirage+

[36, 19]. Since most software implementations are academic research projects, the choice of the platforms is typically dictated by the existing infrastructure. Neverthe­ less, the performance results reported give us insight on the viability of DSM systems on such platforms. A few systems are built on existing multicomputers such as the

Intel iPSC/2 hypercube [31]. 25

IVY

The IVY system, one of the early DSM systems, is implemented on a network of

Apollo workstations connected via Ethernet [32]. IVY supports shared pages which are replicated on the nodes of the system to reduce latency. Strict consistency is used to maintain the coherence of the replicated pages [29].

In the IVY system, eventcounts are used for synchronization. The choice was motivated by the fact that the Aegis operating system, on top of which IVY was implemented, already provided eventcounts. In an earlier version of IVY, IVY I, eventcounts were implemented using remote procedure calls. All eventcount vari­ ables were placed in the local (non-shared) memory of a processor and operations on eventcount variables were sent as remote calls to this processor. A variation to this scheme is a static distributed technique [32]. The RPC based implementation is essentially a centralized scheme which is not integrated with the programming model supported by IVY.

IVY II supports an integrated synchronization mechanism where the eventcount variables are located in the shared pages of the shared virtual memory space. It is a dynamic distributed scheme; the pages containing the eventcount variables migrate among the processors where they are used. To ensure the atomicity of the eventcount operations, the shared page containing the eventcount variables is kept on a processor by locking it (referred to as “wiring” the page [32]) during the operation. Li claims that this mechanism not only has cleaner semantics than the RPC version, but also is more efficient than the RPC implementation of IVY I when there are multiple 26

processes per processor. Since the page containing the eventcount is cached on the

processor where it is accessed, the accesses by the multiple processes would be local

to that processor, thus yielding better performance. However, if different eventcount

variables on the same shared page are frequently accessed on different processors,

thrashing of the page may result due to false sharing.

Clouds

Clouds is a DSM system developed at Georgia Tech on a network of workstation connected via Ethernet [41]. It supports an object-based model of programming and allows for the migration of objects and other shared data to the processors where they are needed. The object model of computation integrates the synchronization mechanism by providing atomicity in the form of remote procedure calls. An object in Clouds is composed of shared-data segments which can be migrated among pro­ cessors as well. Clouds ensures strict consistency of these segments by combining the locking and unlocking the data segment with the accessing and releasing operations, respectively.

In order to support inter-process synchronization that is independent of the RPC mechanism, semaphores are supported in the kernel of the Clouds system [42]. The kernel, known as Ra, groups the semaphores into semaphore segments which reside on the processor that created them. The P and V operations are performed on that processor in a centralized manner. The semaphore segment also provides linked-lists to queue blocked P requests. An alternative, proposed in [42], is to move the entire segment of semaphores to the requesting site. This improves the performance for 27

semaphore access if multiple processes reside on a processor; however, false sharing of

the semaphore segment by processes on different processors can increase overheads.

Munin and TreadMarks

The Munin system, developed at Rice University, is implemented on a network

of workstations interconnected via Ethernet [7]. The system consists of a runtime

system which interacts with a modified version of the V kernel operating system

[8]. Unlike IVY and Clouds, Munin’s design philosophy is that different shared data

require varying degrees of consistency and the performance can be improved by using

a spectrum of consistency maintenance protocols for these shared variables. In order

to do so, Munin defines nine shared-data types, each with it own coherence protocol

[6]. Munin supports an integrated synchronization mechanism; one of the shared-data

types is the synchronization type. This data type is implemented as a distributed lock.

Aside from locks, centralized implementations of barriers and conditional variables

are also provided in Munin [7]. A condition variable makes it possible to implement synchronization based on Monitors [24]. Processes block by executing a wait primitive

on Condition variable. When a process executes the signal primitive, one of the blocked processes is unblocked and allowed to proceed. The definition of the signal primitive in Munin is different; all the blocked processes are unblocked together.

Locks are implemented using a variation of the dynamic distributed scheme. When a processor needs the lock, it sends a request to the lock’s owner. The lock owner may have passed the lock to another processor, in which case it will forward the request message to that processor. This forwarding will result in the request eventually 28 arriving at the current owner of the lock. When the current owner receives the request, it either grants the lock if it is free or queues the request if the lock is currently in use. If the owner already has another request queued for the lock, it sends the request to that site to be queued there. In this manner, a distributed queue is created on the network with every waiting processor queuing the processor that is to receive the lock after it.

TreadMarks, also developed at Rice University, supports locks and barriers for synchronization in a manner similar to Munin [10]. Unlike Munin, however, this system uses a single memory consistency model called lazy release consistency [10], which allows deffering the propagation of data updates (or invalidations) until pro­ cesses synchronize. In order to facilitate this delayed propagation, the lock request must be time-stamped by the sending processor. This timestamp is used by the lock holder to determine which data element’s updates/invalidates must be forwarded along with the lock. This synchronization mechanism is unusual in that the coher­ ence maintenance protocol is combined into the synchronization mechanism and not vice-versa.

The DSM has been implemented on a network of workstations connected by an

ATM network (capable of transferring data at ten times the rate of Ethernet).

Mirage

The Mirage system is implemented in the kernel of a modified Locus operating sys­ tem on a group of Vax 11/750 computers connected by Ethernet [20]. More recently, a modified version of the system, Mirage+, has been ported to a network of personal 29 computers running AIX [19]. Mirage and Mirage+ support paged-segmentation where the unit of data migration or distribution is the page. The coherence protocol used is similar to IVY with one important distinction, the use of a variable parameter A to reduce the effects of thrashing. When a page is cached on a processor, it is not swapped out of the processor until A time units have expired.

Synchronization in Mirage is provided as semaphores, implemented in the Locus kernel [18]. The Locus system was designed to provide reliability in a distributed system and thus the Locus semaphore mechanism is a fault-tolerant to processor fail­ ures. The semaphores are grouped into semaphore sets and operations are performed atomically on an entire semaphore set in a centralized manner by the manager of the set. The manager also maintains undo logs for each process that performs operations on a semaphore set. If a process is aborted or the site where it resides crashes, the manager uses the undo log to clean up any half-finished semaphore operations.

Shiva

Shiva, a DSM with a coherence protocol similar to IVY, has been developed on the Intel iPSC/2 hypercube multicomputer [31]. The prototype was implemented on

NX/2, the minimal operating system executed on each node of the iPSC/2.

The system provides binary semaphores. These semaphores are accessed by us­ ing send/receive primitives to the central semaphore controller. It is also possible to implement the mechanism using the dynamic distributed technique to cache the semaphore on the processor performing P and V operations [31]. Since only the semaphore being accessed is cached, this implementation avoids the false sharing 30 problem inherent in mechanisms which cache a group of semaphores. On the other hand, this mechanism is not integrated into the DSM system.

2.4.2 Hardware Implementations

Early hardware implementations of DSM consisted of a group processors or work­ stations connected via a network with a specially designed network interface. For instance, MemNet was composed of processors connected to a token-ring by an inter­ face device called a MemNet device [12]. This device was responsible for performing remote access requests on behalf of its processor as well as servicing the remote re­ quests of other processors in the system.

More recent implementations encompass processor networks where the nodes them­ selves are clusters or multiprocessors. The DASH system, for instance, is a two dimen­ sional mesh where the nodes are Silicon Graphics 4D/240 bus-based shared memory multiprocessors [16]. The network consists of a pair of meshes, one for sending re­ mote requests and the other to receive replies. DASH uses a directory-based scheme to maintain the coherence of shared data. This scheme ensures that updates to syn­ chronization variables (locks) are propagated to a cluster of nodes which has cached the lock variable when it is released. Dash also provides the fetch and increment and fetch and decrement atomic operations. These are performed at the memory locations of the synchronization variables without caching them on the processors. Thus, Dash supports an integrated, dynamic distributed mechanism as well as a non-integrated, static distributed mechanism. 31

Some systems implement the synchronization mechanism in software even though the data consistency is maintained in hardware. For instance, the FLASH system, which is designed to test a variety of coherence maintenance protocols [17], implements synchronization primitives at the operating system level.

We next describe a few representative systems.

Plus

The Plus system, a hardware implementation of a DSM, is composed of a group of Plus nodes, connected via a “fast interconnection network” [5]. Each Plus node consists of a processor with its cache, memory (for local and replicated global data), and a Memory Coherence Manager (MCM) connected by a bus. Shared data in Plus are grouped into pages, which are replicated on the nodes where they are required.

The MCM on each node ensures that when a data element is written to by the pro­ cessor, the update is propagated to all the other nodes which have cached that page.

A subsequent read on that data element by the processor is blocked until its MCM re­ ceives confirmation that the propagation of updates has completed. However, writes to other data elements can proceed concurrently. Sequential consistency, if necessary, is ensured by using the user callable write-fence operation to inhibit concurrent writes from the same processor.

Synchronization in the Plus system is through semaphores. In keeping with the design philosophy of distinguishing between the issuing of updates and the reading of the updated data location, Plus supports the concept of 11 delayed operations”. In this approach, when a processor issues a P operation, it need not block and wait 32

for the operation to complete; instead, it can perform other computations that do

not depend on the synchronization request. The processor can read the result of

the synchronization operation after a certain delay from its local memory (where

it would be cached by the MCM). This is similar to the delayed branch technique

used in programming RISC architectures. Judicious placement of the synchronization

request and read operations in the program allows for concurrency in the computation

and synchronization. This helps reduce the effect of the synchronization delay on the

performance of the program. Since the synchronization operations are performed at

the node where the synchronization variable is located, this mechanism uses a static

distributed scheme.

Willow

The architecture of Willow is tree-based hierarchical multiprocessor [3]. All in­ termediate nodes are memory modules and each leaf node of the tree is a processor module. Both types of modules (memory and processor) have caches associated with them. The connection between a node and its children is through a shared bus. Thus at the lowest level are groups of bus-based shared memory multiprocessors with four processors and a memory module.

Synchronization is provided in the form of Conditional Test&Set operations. Every cache in the system has a separate unit which is reserved for synchronization variables.

When a processor performs the Conditional Test&Set operation on a synchronization variable in its cache, the set phase of the operation attempts to write through to all replica of the synchronization variable. The operation is completed only after the successful completion of this phase. If multiple processors attempt to perform the operation on the same variable, the one that acquires the shared bus at the highest level of the tree common to them succeeds in completing the operation. The other processors invalidate the result of the conditional test phase of the operation and wait for the synchronization variable’s value to be reset to zero. The ideal case in this scheme is when multiple processors accessing a synchronization variable belong to the same multiprocessor since this would require the use of the bus in that multiprocessor.

In the worst case, if two processors at the two extreme leaves of the tree contend for the same variable, the synchronization must be handled by the bus at the root of the tree. This scheme is a distributed one which has been integrated into the hardware of the system.

Convex Exemplar

The Exemplar architecture, developed by Convex Corp. [9], is a hierarchical distributed memory architecture. It is composed of clusters called hypernodes. Each hypernode has eight processors and shared memory connected via a cross-bar. Each processor has a private portion of that shared memory for its own exclusive use as well. The hypernodes are connected into a two-dimensional torus network. A part of each hypernode’s memory, the global memory, is accessible to other hypernodes.

Hardware synchronization is provided as semaphores and barriers. To perform these operations a variety of memory access instructions are provided, namely, fetch, fetch-and-clear, fetch and increment, fetch-and-decrement, and load-and-clear. The first four instructions are performed atomically at the semaphore variable’s memory location while the last one results in caching the semaphore variable on the request­ ing processor. All the fetch instructions return the current value of the semaphore variable to the requesting processor. While the first primitive does not alter the value of the variable, the remaining three will clear it, increment it by one, or decrement it by one, respectively. The load-and-clear instruction caches the variable on the pro­ cessor and clears the variable’s memory location. This enables subsequent accesses by the processor to be performed locally without network traffic. However, if another processor tries to access the variable, the cached variable must be flushed. Implemen­ tation of the P operation using the various fetch-and-Op primitives would result in the processor repeatedly testing the status of the variable and thus causing excessive traffic on the network. This is essentially a static distributed scheme since all accesses are directed to the synchronization variable’s memory location.

2.5 Discussion and Summary

In distributed shared memory programming model, cooperating processes access shared variables, and coordinating this access is important for the correct execution of a parallel program. Synchronization mechanisms provide the support for this coordi­ nation. Synchronization mechanisms are supported by all DSM systems, irrespective of the programming model and the coherence protocol used.

In this chapter we examined the synchronization mechanisms in a variety of DSM systems. We classified the mechanisms according to whether the mechanism is im­ plemented in hardware or software, whether the mechanism uses a centralized or 35

Table 1: A Summary of Synchronization Mechanisms

DSM System Implementation Mechanism Integration Clouds Software Centralized Both Dash hardware Static Distributed Non-Integrated Exemplar hardware Static Distributed Non-Integrated Flash hardware (software mech.) IVY Software Centralized & Distributed Non-integrated Mirage Software Centralized Non-integrated Munin Software Distributed Integrated Plus hardware Static & Dynamic Distributed both Sesame hardware Distributed Integrated Shiva Software Distributed Non-integrated TreadMarks Software Distributed Integrated Willow hardware Static Distributed Integrated

distributed scheme, and lastly whether the mechanism is integrated into the DSM system or not. We summarize the classification of the synchronization mechanisms in Table 1.

In software-based systems, the synchronization mechanism is implemented either by the runtime system or the kernel. It is implemented either in a centralized, static distributed, or dynamically distributed manner. If it is dynamically distributed, only one copy of the synchronization variable is maintained and this copy is migrated (like a token) among the processors in the system.

In hardware based DSM systems, the mechanism is typically implemented with hardware support; a notable exception is the FLASH system in which it is imple­ mented by the operating system kernel. The distributed schemes in such systems 36 take advantage of the hardware support to cache the synchronization variable at the processors and rely on cache-coherence protocols to maintain the consistency of the replicated copies. Some hardware based systems implement the mechanism in a static distributed manner to eliminate the need for maintaining multiple cache-coherence protocols (one for synchronization variables and another for other shared data). Such systems use fetch-and-op type of operations to reduce the number of messages gener­ ated by processors accessing the synchronization variable. CHAPTER III

The Semaphore Page Approach

3.1 Introduction

In the previous chapter, the need for synchronization mechanisms in a DSM system was motivated and existing centralized and distributed algorithms to implement these synchronization mechanisms were discussed. Among the software based approaches, semaphores were implemented in a centralized manner. One exception to this is Shiva, which uses a static distributed approach to provide binary semaphores that can be cached on the processor performing P and V operations.

Decentralized algorithms that provide synchronized access to data have been pro­ posed. These algorithms, however, provide only mutually exclusive access to the shared data (i.e., lock and unlock mutual exclusion). Concurrent tasks need a mech­ anism through which to synchronize or handshake at certain points during their exe­ cution. Mutual exclusion is insufficient for these synchronization needs. Semaphores provide process synchronization, mutual exclusion, as well as general K-ary exclusion elegantly [13, 14].

In this chapter a decentralized semaphore mechanism which eliminates the prob­ lems of single point of failure and the bottleneck problems inherent in centralized

37 38 schemes is presented. In this scheme, semaphores are grouped into pages called semaphore-pages. These pages are similar to the data pages in DSM systems, how­ ever, the semaphore-pages are replicated on the processors of the system and specific algorithms to maintain their consistency are presented. This approach enables pro­ cesses to exploit the locality of reference which exists in accessing semaphore variables by caching them on their processors.

The rest of the chapter is organized as follows: Section 3.2 describes the decen­ tralized scheme to maintain semaphores in DSM systems. A proof of correctness is presented in Section 3.3. Performance studies of the semaphore-page approach based on simulation is discussed in Section 3.4. Section 3.5 summarizes the chapter.

3.2 The Semaphore-page Approach

In the semaphore-page approach, semaphores in the system are grouped into pages, called semaphore-pages. A semaphore-page is divided into semaphore structures.

Each structure contains either the semaphore or information regarding the location of the semaphore in the network (see Figure 6). While the semaphore-pages are replicated on each processor, the semaphores are not. By grouping the semaphores into the semaphore-pages, the processors in the system are able to update their view of the location of the semaphores when they communicate with each other.

When a process tries to access a semaphore, it first checks the semaphore-page on its processor. If the semaphore is found, the desired ( P or V ) operation is performed.

If the semaphore is not available locally, its semaphore structure will point to the 39 processor on which the semaphore is likely to reside. A remote access is performed by sending a request to obtain the semaphore. When a processor which currently holds a semaphore receives a request, depending on the state of the semaphore it either queues the request or sends the semaphore to the requesting processor. If a processor which does not hold the semaphore receives the request, it forwards it to the processor likely to hold the semaphore. There are other methods for propagating requests such as broadcasting and multicasting, but these methods cause greater message traffic on the network. Fowler, in his dissertation, discusses these issues [21].

When processors exchange messages with each other to access a semaphore, they exchange additional information regarding the locations of all the semaphores of a particular semaphore-page. We make use of timestamps in determining which proces­ sor has the more recent information regarding the location of a particular semaphore.

This enables the processors to update their semaphore-pages to reflect the current state of the system. We describe this updating of semaphore-pages in detail in Section

3.2.4. A description of the model and a detailed description of the approach follows.

3.2.1 System Model and Definitions

The system consists of n processors labeled Pi,...,Pn. The processors are connected by an interconnection network or a local area network and communicate via message passing primitives. It is assumed that the communications network is reliable; all messages are delivered within a finite time delay. Failure of a processor would require the system to be restarted. 40

There are two types of semaphores in the system; binary and resource counting.

A binary semaphore is initialized to the value 1 and a resource counting semaphore

is initialized to some value, N. A P operation and its corresponding V operation on

a semaphore can be performed on two different processors. The P and V operations were defined in Figure 4 in Chapter II.

Each processor maintains a copy of p semaphore-pages, denoted as SPi,...,SPp.

This however does not imply that the semaphores themselves are replicated; only one processor holds a particular semaphore. Initially, the semaphores are mapped onto different processors in some manner that is not dependent on the mechanism. The other processors maintain information pertaining to the location of that semaphore.

There are m semaphores per semaphore-page, denoted by Si,...,Sm. Semaphore Sj in pageSPi is identified by a tuple, {SPi, Sj), that identifies the page and the semaphore index within that page. Each semaphore in a semaphore-page consists of the structure shown in Figure 5.

Sem: semaphore; probable_holder : integer; blocked_req : queue of processor.id; timestamp : integer; requested : boolean;

Figure 5: Contents of a Semaphore Structure.

When a semaphore is cached on a processor, its semaphore structure holds the semaphore. Otherwise, the semaphore field is NULL. The probable-holder field in the 41 structure holds the ID of the processor where the semaphore is likely to be located.

The blocked_req queue holds blocked requests which are waiting to be granted. The timestamp field associates an age with the information in the probable_holder field.

The use of timestamps is discussed in conjunction with updating semaphore-pages in Section 3.4. The last field, requested, is true when there is an outstanding re­ quest for the semaphore and is false otherwise. Since multiple processes can execute on a processor, this field can be used to block all except one request for the same semaphore.

As an example, Figure 6 depicts the state of four semaphore structures (So, S 2, £4 and £ 5) in semaphore-page, SPj on processor Pi. Semaphores £2 and £5 are cached locally and semaphores £0 and £4 are on processors P\ and P7, respectively. The blocked_req queue, timestamp, and requested fields are left blank.

3.2.2 P Operation

When a process tries to perform a P operation, it first checks the local semaphore-page on its processor for presence of the semaphore. If the semaphore is found locally, the

P operation is performed. Otherwise, the probable_holder field indicates where the semaphore is likely to be cached and by using the probableJiolder fields on successive processors, the current holder of the semaphore can be located. Thus, to acquire the semaphore a request is sent to the processor specified in the probableJiolder and the

P operation remains blocked until the semaphore has been cached locally. At this point the semaphore can be successfully decremented to complete the P operation. 42

Semaphore-page Semaphore structure Sem: NULL Probable_holder: 4 Blocked_req Timestamp Requested

Sem: cached locally Probablejiolder: i BIocked_req Timestamp Requested

Sem: NULL Probablejiolder: 7 Blocked_req Timestamp Requested______

Sem: cached locally Probablejiolder: i Blocked_req Timestamp Requested

Figure 6: State of Semaphore-page SPj on Processor Pi.

A binary semaphore is not moved to the requesting processor if it has a pending V operation on the processor that currently holds it. In the case of a resource counting semaphore, it is not moved from the processor where it is held unless it has a value greater than zero. Subsequent V operations are forwarded to the processor where the semaphore has moved. V operations are discussed in detail in Section 3.2.3.

A request message contains the id or tag of the semaphore being requested as well as the ID of the processor that initiated the request, (e.g., [REQUEST, (SP i,Sj ),

-Pfe]). When the request is satisfied, the semaphore’s probablejiolder field must be 43 updated to reflect the current location of the semaphore. The code that is executed on a processor to execute a P ( {SP i,Sj )) operation is shown in Figure 7.

« iff {SPi, Sj). probablejiolder ^ My-ProcId.) then iff {SPi, Sj). requested ^ TRUE) then SEND(REQUEST, {SPi,Sj), My.ProcId) to {SPi, Sj). probablejiolder; {SPi, Sj). requested := TRUE; endif; » await reply; « {SPi,Sj). Sem := semaphore received in reply; {SPi, Sj). probablejiolder := My-Prodd; {SPi, Sj). requested := FALSE; concat(blockedjreq queue from sender, {SPi, Sj) Mocked-req); endif; >> P f {SPi, Sj).Sem);

Figure 7: Code Executed to Access a Semaphore and Perform a P Operation.

Servicing a Remote Access Request for a Semaphore

When a processor receives a request for a semaphore, it checks if the semaphore is locally cached. If it is available, the semaphore is sent to the requesting processor provided its value is greater than zero. If the semaphore is not cached, the request is forwarded to the processor indicated by the probablejiolder field. If a processor waiting to access a semaphore receives a request for the same semaphore from another processor, it will queue the request on the blocked_req queue for that semaphore. 44

It is possible that more than one request is queued for a semaphore on a processor.

When the semaphore is released, it is sent to the processor at the head of the queue

along with the rest of the blocked_req queue. The code to service a request of the

form [REQUEST, (SPit Sj),Pk] is given in Figure 8.

« iff" {SPi, Sj) .probablejiolder = My-Procld) th e n iff {SPi, Sj).Sem > 1) th e n SEND (REPLY, {SPi,Sj), {SPi,Sj).Sem) to Pk; {SPi, Sj) .probablejiolder := Pk; {SPi,Sj).Sem := NULL; else append( {SPi, Sj) .blockedjreq , Pk); else iff {SPi, Sj). requested TRUE) th en SEND (REQUEST, {SPi,Sj), Pk) to {SPt, Sj). probablejiolder; {SPi,Sj). requested := TRUE; else append( {SPi, Sj). blocked.req , Pk); endif; »

Figure 8: Code to Service a Request for a Semaphore.

It is noteworthy that the manner in which a site responds to a request does not depend on whether the semaphore is binary or resource counting. The initial value that was assigned to the semaphore will determine the type of the semaphore. This, of course, assumes that every process that performs a P will eventually perform a V as well. 45

3.2.3 V Operation

When a process performs a V operation on a semaphore, two possibilities arise. First, the V operation is on a binary semaphore. In this case the operation can always be performed locally because the semaphore could not have been moved when it had a V operation pending. Second, the V operation is on a resource counting semaphore. In this case, it is possible that the semaphore has been moved to another processor and in such a case, the V operation must also be forwarded to that processor. As in the case of the semaphore access request, it may be necessary to forward the V operation until it reaches the current holder of the semaphore. The processor that holds the semaphore may have requests waiting on the blocked_req queue of the semaphore. In this case, after performing the V operation, the processor sends the semaphore to the first request on the blocked_req queue along with any other blocked requests.

The V ( (SPi, Sj)) operation described in Figure 9, can perform local and remote

V operations as well as forward such operations to another site.

3.2.4 Updating the Structures in the Semaphore-page

We now discuss how the semaphore pages are updated. This is done when messages are exchanged between the processors when accessing a semaphore or performing a V operation. The timestamp field of the semaphore structures plays an important role in this process. This field indicates how current the probablejiolder information in a particular semaphore structure is compared to the rest of the system. By comparing the timestamps of a particular semaphore in two semaphore-pages, we can determine 46

« if( (SPi, Sj) .probablejiolder = My-Procld) th en V ( (SPU Sj).Sem); iff (SPi, Sj) .blockedjreq ^ NULL) th en (SPi, Sj). probablejiolder = head( (SPi, Sj) .blocked-req ); SEND(REPLY, (SPh Sj), (SPU Sj).Semi tail( (SPU Sj) .blocked.req )) to head( (SPi,Sj). blocked.req ); (S Pi, Sj). blocked jreq = NIL; (SPU Sj).Sem = NULL; endif; else SEND(VOP, (SPi,Sj))to (SPi, Sj). probablejiolder; endif; »

Figure 9: Code to Perform a>V Operation. which semaphore-page has the more uptodate information about the semaphore’s location. This is used to update the older of the two semaphore structures. The motivation behind updating the semaphore structure is that on average it would shorten the length of the path taken by a request to access a semaphore.

When a processor requests another processor for a semaphore, it also sends the probable holder information (details in Section 3.2.4) of the other semaphores of that semaphore-page as part of the request message. This is used by the other processors to update their semaphore-pages. As the request is forwarded to the holder, the processors along the path participate in updating their respective semaphore-pages.

The requesting processor eventually updates its semaphore-page when it receives the information along with the semaphore from the processor which holds the semaphore. 47

As an example, suppose processor Pi sends a request for (SPi, Sj) to P2, which forwards it to P 3. Pi sends probable holders information about the semaphores of the semaphore-page SPi to P2. Processor P2 uses this information to update its copy of

SPi and then sends this updated version of the probable holder information of SPi to

P3. The processor P3 similarly updates its copy of SPi. When it sends the semaphore to Pi, it will also send the SPi probable holder information. This enables Pi also to update its own copy of SPi. In this manner, a processor gets the holder information from the processor prior to it in the path and passes its updated information to the successive processor in the path.

Specifics of the Update Scheme

As part of the messages, an array of tuples, called holderstatus is now included. This array of tuples has m elements; one tuple per semaphore of a semaphore-page. Each tuple consists of the probablejiolder and timestamp fields (see Figure 10) that were described in the semaphore structure in Section 3.2.1. This array is created from the semaphore-page prior to issuing any message.

A processor that wishes to access a semaphore that is not available locally, issues the following request message: [REQUEST,(SPi, Sj), {holderstatus of 57^},P*]. The processor then waits for the semaphore and the holderstatus array of the page SPi from the holder of the semaphore. It uses the holderstatus array to update the other semaphores of its semaphore-page. It also increments the timestamp of the semaphore it received. This ensures that the timestamp at the holder of a semaphore is the largest of all the timestamps for that semaphore in the system. The code from 48

Holder_status array

probablejiolder: 4 time_stamp: 25

probab!e_holder: 6

time_stamp: 12

probable_holder: i

time_stamp: 50

probable_holder: 0 Sm-1 time_stamp: 234

Figure 10: Holder_status Array for Semaphore-page SPj on Processor P{.

Section 3.2.2 is modified to reflect this change in Figure 11.

When a processor receives a message, it uses the holderstatus array sent in the message to determine which semaphore structures in its semaphore-page to update.

It determines which probablejiolder entry is more recent by comparing the times­ tamps supplied in each tuple of the holderstatus array with those in the semaphore structures of SPi. The code to update the local semaphore-page SPi is given in Figure

12.

When a processor has to forward a request message, it first updates its own semaphore-page using the holderstatus array. It then creates a new holderstatus 49

« iff' (SPi, Sj). probablejiolder My-ProcId) th en iff (S P{, Sj) .requested ^ TRUE) th e n SEND(REQUEST, (SPi,Sj ), holderstatus of SPi, My-Prodd) to (SPi, Sj). probablejiolder; (SPi, Sj). requested := TRUE; endif; >> await reply; « Update SPi using the holderstatus array received with the reply. (SPi,Sj). Sem semaphore received in reply; (SPi, Sj) .probablejiolder := My-Procld; (SPi, Sj). timestamp := (SPi, Sj). timestamp + 1; (SPi, Sj). requested := FALSE; concat(blocked-req queue from sender, (SPi, Sj). blocked jreq); endif; » P f (SPU Sj).Sem);

Figure 11: Modified Code to Access a Semaphore for a P Operation. array based on its semaphore-page to send with the forwarded message. The modified code to service a request for a semaphore is given in Figure 13.

When a processor performs a V operation that requires it to send a message to another processor (i.e., a resource counting semaphore which has been moved), it includes the holder_status array of the semaphore-page in the V message. So, the message structure is: ([VOP, (SPi, Sj),{holder.status of SPi}}.

This enables the processor performing the V operation to update its semaphore- page when it performs the V on the semaphore. If it is necessary to forward the V 50

« for (index := 0 to m-1) do if(holderstatusfindex].timestamp > (SPi, Sj). timestamp) th e n (SPi, Sj) .probablejiolder := holderstatus [index].probablejiolder; (SPi, Sj). timestamp holderstatus [index], timestamp; endif; »

Figure 12: Code to Update the Semaphore-page.

<< Update SPi using the holderstatus array received. iff (SPi,Sj). probablejiolder = My.ProcId) th en iff (SPi, Sj). Sem > 1) th e n (SPi, Sj) .probablejiolder := Pk; (SPi, Sj) .timestamp := (SPi, Sj). timestamp + 1; SEND(REPLY, (SPi,Sj), (SPi, Sj). Sem, holderstatus of SPi) to Pk; (SPi, Sj). Sem NULL; else append( (S Pi, S[). blocked jreq , Pk); else iff (SPi, Sj). requested ^ TRUE) th en SEND(REQUEST, (SPi,Sj), holderstatus of SPi, Pk) to (SP{, Sj). probablejiolder; else append( (SPi, Sj). blocked jreq , Pk); endif; »

Figure 13: Modified Code to Service a Request for a Semaphore. 51 operation message via a chain of processors, the processors along the path update their respective semaphore-pages prior to forwarding the V message. The code in

Figure 14 is a modified version of Figure 9, presented in Section 3.2.3. While, the version presented earlier was uniform for both local and remote V operations, it is necessary to differentiate between them now. If a processor is able to perform a

V operation locally, it will execute all the statements in Figure 14 except the first statement.

« Update semaphore page SPi using the information structure; iff (SPi, Sj) .probablejiolder = My-Procld) th en V f (SPi, Sj).Sem); iff (SPi, Sj) .blockedjreq / NIL) th en (SPi, Sj) .probablejiolder = head( (S Pi, Sj). blocked jreq ); (SPi, Sj). timestamp = (SPi, Sj), timestamp + 1; SEND (REPLY, (SPi,S-), (SPi, Sj).Sem, tail( (SPi,Sj ).blocked.req ), holderstatus of SPi) to head( (S Pi, Sj). blockedjreq ); (SPi, Sj) .blockedjreq = NIL; (SPU Sj).Sem = NULL; endif; else SEND(VOP, (SPi, Sj) ,holder status of SPi) to (SPi, Sj). probablejiolder; »

Figure 14: Modified Code to Perform a V Operation. 52

3.2.5 Techniques for Performance Improvement

In this section, a couple of ways to reduce the number of messages required in accessing the semaphore are proposed. When a processor forwards a remote P request, it updates its probablejiolder field to point at the node whose request it forwarded.

This will reduce the number of messages required by the next request message.

Another method involves a different way of updating the probablejiolder field when a processor sends the semaphore to another site. Rather than setting the field to point to the site where the semaphore has been sent, the processor sets the site whose request is at the end of the semaphore queue as the probablejiolder. This will also effectively reduce the number of message hops required before a message gets to a semaphore holder or a processor waiting for the semaphore.

However, both these methods would require the use of two probablejiolder fields

(PprobableJiolder and VprobableJiolder) to avoid a potential for deadlock. The

VprobableJiolder field would be set in the same manner as the probablejiolder field was set in the previous sections. This ensures that a V operation’s request follows the same path as that of the semaphore.

3.3 Correctness

To prove the correctness of the scheme, we show that the following two properties hold. First, we show that P and V operations are performed correctly (i.e., the safety property is satisfied). Second, we show that a requested P or V operation is satisfied in finite time (i.e., the liveness property is satisfied). We show that these properties 53

hold for one semaphore, say (SPq, qS), and the same argument applies to all the

semaphores.

We use Pk( {SPq,S$}) to refer to the entry for S 0 in SP q on processor P*.. Let

PH denote the probablejiolder field of the semaphore.

3.3.1 Safety Property

The PTifields on each processor induces a directed graph where the nodes are the

processors and the P'Hfields are the edges. If Pfc( (SPq,Sq).PH) = Pi, then there is

a directed edge from Pk to p .

Observation 1 Initially only one copy of each semaphore exists in the system. For

instance, let us assume that Sq is on processor Pq. Then Vp., Pk( {SPq,Sq).P H ) =

Pq. Po is the root of a tree that spans the processors in the network.

L em m a 1 At any time, only one processor performs P or V operations on a semaphore.

Proof: As shown in Figures 11,13, and 14, only the processor Pk, whose (SPq, qS).PH

— Pk can perform a P or V on the semaphore. What needs to be proven, however, is that only one processor in the system has the property Pk{ (SPq, S0).PH) = Pk at any time. From Observation 1, this is true initially. Whenever Pk forwards the semaphore to another node P, it sets Pk{ {SP q,Sq).VH) = Pi (in Figures 13 and

14). Processor Pi sets Pi( (SP0, So).PH) = p only when it receives the semaphore

(in Figure 11). □ 54

Lem m a 2 Two or more processors will not form a cycle while waiting for a semaphore.

Proof: A cycle forms when two or more processors block each other’s P requests. If

processor Pi s request is blocked at processor Pk, this is denoted by P{ -¥ Pk- The

proof is by contradiction. Suppose there exists a cycle Pi -» Pj -*■ Pk ->• Pi.

Pi —» Pj implies that Pj held the semaphore more recently than P{. Similarly, Pj

Pk implies P* held the semaphore more recently than Pj. Lastly, Pk -*■ Pi implies

Pi held the semaphore more recently than P*. However, by transitivity, Pi —» Pj —>

Pk implies P* held the semaphore more recently than Pi. Thus Pk —> Pi cannot hold. □

T heorem 1 The P and V operations are performed correctly.

Proof: Prom Lemma 1, only one processor holds the semaphore at a time; therefore,

only one processor will perform P or V on the semaphore at any time. From Lemma

2, pending requests will not form a cycle. □

3.3.2 Liveness Property

Lem m a 3 A processor’s request message will never loop back to it.

Proof: If Pi( {SPo, Sffj.'P'H) = Pj, Pi sends the request message to Pj. Let the sending

of the request message be denoted by P* Pj. Similarly, if Pj forwards the message

to Pk, then Pi Pj Pk. For a cycle to result from the forwarding of the request message, Pj Pj P* • • -Pq Pm Pi must exist. 55

The existence of such a processor Pm along the path would imply (by transitivity) that Pm received the semaphore more recently than Pi. But, then Pm( (SPQ, Sq).'P'H)

7^ Pj. Thus Pm Pj cannot exist. □

Lem m a 4 Every P and V REQUEST message eventually reaches the semaphore holder or a processor that will become the semaphore holder.

Proof: Pj sends its P or V request to Pj provided Pf( (SPQ, Sq).VH) — Pj. We denote this as Pj~»Pj. If Pj holds the semaphore, then the request has reached the holder; if Pj is itself blocked waiting for the semaphore, Pj’s request has reached an eventual semaphore holder (i.e., Pj -> Pj). Otherwise, the message is forwarded to the processor Pj{ (SPq,Sq).VH) (say, Pk). Pk similarly is either the holder, blocked waiting to become the holder or simply a forwarding processor. By constructing such a path using the VH field at each of a finite number of processors, and from the absence of cycles in such a path (Lemma 3), Pj’s request arrives at the semaphore holder or a processor blocked for the semaphore. □

Observation 2 Given two processors Pi and Pj whose requests are blocked at proces­ sors waiting for the semaphore. It is impossible for Pi or Pj to acquire the semaphore twice before the other ever gets the semaphore.

Lem m a 5 A blocked request will be granted the semaphore in finite time.

Proof: A request from processor Pi is blocked on the blocked_req queue at either the processor holding the semaphore or an intermediate processor, which is itself waiting to acquire the semaphore. 56

Suppose the request is queued at the semaphore holder. There can be at most n — 2 requests ahead of Pi’s request in the semaphore queue. The processor at the head of the queue (say, Pj) will get the semaphore from the holder in at most

(CS execution time + message delay) time. Pj will also get the rest of the blocked_req queue from the holder (to which it may append it own local queue). Now, Pi's request has at most n — 3 requests ahead of it. Again, after a delay of at most

(CS execution time 4 - message delay), Pj will forward the semaphore and queue containing n — 4 nodes to the processor at the head of the queue. Thus after at most

(n — 2) * (CS execution time + message delay) time, Pi will become the semaphore holder.

Suppose P^s request is queued up at an intermediate processor, Pj. Assuming Pj's request is queued up at the semaphore holder, then by similar reasoning as above, after at most n — 3 processors Pj will become the semaphore holder. If Pj is itself queued at an intermediate processor P*, then Pf. will become the semaphore holder after at most n — 4 processors. By induction and Observation 2, Pj will get the semaphore after a maximum of n — 2 processors have held the semaphore for a finite amount of time each. □

T heorem 2 Every request for a semaphore is eventually satisfied.

Proof: The proof is a direct result of Lemmas 4 and 5. □ 57

3.4 A Performance Study

The performance is measured by the response time, speedup, maximum allowable throughput, message traffic generated as well as other properties such as reliability.

The response time is the time taken by a requesting site to complete a P operation.

Speedup is the improvement in response time over a base case (the centralized al­ gorithm in the experiments). The maximum allowable throughput is the maximum number of critical section executions that can be performed per unit time. The mes­ sage traffic generated is a measure of the number of messages required to perform a

P or a V operation.

While the centralized algorithm uses the fewest messages to perform P or V opera­ tion, it creates a bottleneck at the central processor which may result in poor response time. The maximum throughput attainable in such a system is 1/ (CS Execution time

+ 2 * Message Transfer time). Furthermore, it suffers from the problem of a single point of failure.

In distributed algorithms, the number of messages required in locating a semaphore is often greater than that in the centralized (on average, 0(log n) messages are required to find the current holder). However, the response time is better under high semaphore access rate as fewer messages transfers are involved between two successive P operations in the system. The maximum attainable throughput is

1/(CS Execution time 4- Message Transfer time). 3.4.1 Simulation Model

We simulated the semaphore-page approach and the centralized algorithm to study

the relative performance improvement of our system. The simulators were written in

C using the CSIM simulation package (an event driven simulator from MCC Corp.)

[45]. Each processor in the system executes one user process. Each user process

performs the following actions in a loop - it acquires the semaphore, performs a P

operation, executes the critical section, and executes a V and then performs other

computations outside the critical section. The amount of time spent in the critical section and outside the critical section were assumed to be Exponentially distributed with mean C

The simulation experiments were run in the following manner: the statistics for the first 500 iterations for each user process of each simulation run were ignored to ensure that the system had reached stable state. After that, each user process executed in a loop for 10,000 iterations. We collected statistics on the response time, throughput, and the average and maximum number of messages (for the distributed scheme) for different numbers of processors. We recorded the maximum number of messages required by P operations. In the case of binary semaphores, since the semaphore is not released until the V is performed, the number of messages required to perform a V is zero. We also computed the speedup over the centralized scheme. 59

The simulated user processes accessed binary semaphores as well as resource count­ ing semaphores. To examine the behavior of the updating scheme, simulation experi­ ments were conducted where only one semaphore per page is accessed as well as where multiple semaphores per page is accessed.

Response Time

In Figures 15, 16, 15, 16, 19, and 20, we plot the response time of the central­ ized and semaphore-page algorithms for one binary semaphore, and multiple binary semaphores, and multiple resource counting semaphores, respectively. In each fig­ ure, we plot the data gathered for n = 10 and n = 35 processors. We find the semaphore-page approach performs better than the centralized scheme when the rate of semaphore access is high. This improvement in performance is more pronounced with larger numbers of processors. When the semaphore access rate is low, however, the centralized scheme fares better since it requires fewer messages to locate and retrieve the semaphore. Since the semaphore-page approach performs better under heavy load conditions (i.e., when semaphore access rate is high) and scales well with the number of processors, it is of greater interest to typical real-life systems.

Speedup

Speedup, is defined as the relative improvement in response time of the semaphore- page approach with respect to the centralized approach. Figures 21 and 22 plot the speedup in the average response time attained over a centralized scheme as a function of IC STfor 15 and 35 processors. Figure 21 shows the pattern when the time of critical section execution is short and Figure 22 is for long critical section execution 60

200

160

140

120

100

80

60

40

20 0 0 200 400 600 800 1000 ICS? (msec)

Figure 15: Response Time vs. XCST. Semaphore-page and Centralized Schemes. One Binary Semaphore (10 nodes, CST= 4.0 msec, TZA= 10.0 msec). time. Under a high semaphore access rate the semaphore-page scheme is as much as 60 % faster than the centralized scheme. Under low semaphore access rate, the centralized algorithm performs better due to its lower delay in locating the semaphore.

It is the heavy load condition case that is of interest and the semaphore-page approach performs better in those situations.

Maximum Throughput

The maximum throughput computed for both algorithms is listed Table 2. We computed the maximum throughput for a small network (with 10 processors) to a large network (45 processors). The maximum throughput for the semaphore-page ap­ proach is consistently higher than the centralized case, implying that the semaphore- page approach can sustain high load. 61

aoo 'semaphore-page 'centralized 700 \ \ 1 500 0 I 400

200

100

0 500 1000 1500 2000 2500 3000 ZCST (msec)

Figure 16: Response Time vs. XCST. Semaphore-page and Centralized Schemes. One Binary Semaphore (35 nodes, CST= 4.0 msec, TZA= 10.0 msec).

Message Traffic

Table 3 shows the average and the maximum number of messages that were re­ quired to perform remote operations. We have listed these according to the number of processors in the system. As the IC5Tincreases, the access rate for the semaphore decreases and as a result the probable_holder information in the processors is not kept up-to-date as often. This is why the number of messages required to perform remote

P operations increases for large XCST values. Nevertheless, on average, the number of messages scales well with the number of processors.

3.5 Summary

This chapter described an approach to providing semaphores on distributed systems.

This approach differs from previous work in that it decentralizes the management of 62

14a •semaphore-page' ----- 'centralized ...... 120

100

80

60

40

20

0 0 200 400 600 800 1000 ICST (msec)

Figure 17: Response Time vs. XCST • Semaphore-page and Centralized Schemes. Multiple Binary Semaphores (10 nodes, CST= 4.0 msec, 1ZA= 10.0 msec). the semaphores. This decentralization eliminates the inherent problems in centralized schemes: the single point of failure and the bottleneck problems. A proof of the correctness of the approach was also presented.

By grouping semaphores into semaphore-pages, this approach enables processes to exploit the locality of reference which exists in accessing semaphore variables by caching them on their processors. By updating information about the locations of all the semaphores in a semaphore-page, the semaphore location algorithm becomes simple yet efficient. Proofs of correctness of the approach were presented. Extensive simulation results indicate that under heavy load conditions (i.e., frequent accesses to the semaphore), this approach clearly out performs the centralized scheme. This im­ provement in performance is more pronounced as the number of processors increases, a highly desirable feature in real-life systems. 63

Table 2: Maximum Throughput Versus Number of Processors.

Processors Centralized Semaphore-page 10 34.47 45.54 15 34.18 45.55 35 33.83 45.72 45 33.66 45.62

Table 3: Average and Maximum Number of Messages Required to Perform a P Operation.

10 Processors 15 Processors 35 Processors 45 Processors ICST Average Max Average Max Average Max Average Max 10 1.000000 3 1.000461 3 1.000508 3 1.000512 3 50 1.116330 8 1.004305 6 1.005104 6 1.003548 6 100 2.027623 9 1.227352 12 1.015166 8 1.010280 8 150 2.600059 9 2.076194 12 1.025298 9 1.034397 9 200 2.765846 9 2.812010 13 1.062730 8 1.024986 6 250 2.777570 9 3.308638 13 1.130911 8 1.070668 8 300 2.791722 9 3.528131 13 1.364121 15 1.074752 9 350 2.790633 9 3.621452 14 2.115637 19 1.206929 11 400 2.770553 9 3.674616 13 2.993569 20 1.409659 16 450 2.765169 9 3.671097 12 3.459618 25 2.086521 19 500 2.773464 9 3.686671 13 3.856686 21 3.039955 20 750 2.728747 9 3.667449 13 5.654947 23 5.100494 27 1000 2.703597 9 3.645407 13 6.106209 24 6.553216 25 64

600 * sem aphore-page' ----- 'centralized' 500

u 0) wB 400 0) a 300

cna 200 os0)

100

0 500 1000 1500 2000 2500 3000 XCST (msec)

Figure 18: Response Time vs. XCST. Semaphore-page and Centralized Schemes. Multiple Binary Semaphores (35 nodes, CST= 4.0 msec, TZA= 10.0 msec).

100 'semaphore-page' ----- 00 'centralized'

u 0) 70

gj

01 50 cW o a 0) 40 asai 30 20

0 200 400 600 800 1000 ICST (msec)

Figure 19: Response Time vs. XCST. Semaphore-page and Centralized Schemes. Multiple Resource Counting Semaphores (10 nodes, CST= 4.0 msec, TZA= 10.0 msec). 65

400 semaphore-page' — 'centralised' —— 350

300

250

200

150

100

50 0 0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 20: Response Time vs. XCST. Semaphore-page and Centralized Schemes. Multiple Resource Counting Semaphores (35 nodes, CST= 4.0 msec, 1ZA= 10.0 msec).

60 10 Nodes ----- 15 Nodes ----- 60 35 Nodes ...... 45 Nodes ----- 40

20 0Of ? & w •20

0 500 1000 15QQ 2000 2500 ICST (msec)

Figure 21: Speedup Attained Over the Centralized Scheme for Various System Sizes ( CST= 4.0 msec, TZA= 10.0 msec). 66

10 Nodes — 15 Nodes ----- 60 35 Nodes 45 Nodes ----- 40

20

-20

-40

-80 0 500 1000 1500 2000 2500 ICST (msec)

Figure 22: Speedup Attained Over the Centralized Scheme for Various System Sizes ( CST = 10.0 msec, HA= 10.0 msec). CHAPTER IV

The Clustered Semaphore Approach

4.1 Introduction

Several algorithms that implement mutual exclusion in distributed systems employ a migrating token. The token may be viewed as a key to enter a critical section or to access a shared resource. Since there is just one token in the system, any node which currently holds the token is granted mutually exclusive access to the resource.

These token-based mutual exclusion algorithms implement a restricted form of binary semaphores. They are restricted in the sense that every node that performs a

P operation must also perform the corresponding V operation. Binary semaphores, on the other hand, can be used for other types of synchronization as well, for instance, process coordination.

In this chapter, a decentralized approach to implement the semaphore mechanism is proposed, which is similar to token-based algorithms in that the semaphore, like a token, migrates from node to node. Furthermore, in this approach, the nodes in the system are clustered into a two-level hierarchy based on the initial value of the semaphore. Each cluster in the system maintains its own semaphore-token and intr­ acluster P and V operations are performed on the cluster’s copy. Intercluster access

67 68

is necessary if the cluster’s copy of the semaphore is zero. Simulation studies show

that clustering has a significant improvement on the performance of the algorithm

over the unclustered version.

The rest the chapter is organized as follows. Section 4.2 describes the system

model. The algorithm for accessing a semaphore is given in Section 4.3. A proof

of correctness of the algorithm is given in Section 4.4. Section 4.6 summarizes the

chapter.

4.2 System Model and Definition

The distributed system is composed of a n of processors or nodes, labeled Po,...,Pn_i.

For the sake of clarity in describing the algorithm, let us assume that each processor

has only one process executing on it. Each processor queues up the requests for the

same semaphore and services them one by one.

The semaphore variables in the system are labeled Si,...,Sk. Each semaphore

variable has an initial value assigned to it (e.g., Si = 1, S2 = 9; Si is a binary

semaphore and S 2 is a resource counting semaphore). In general, the initial value of

Si = Si. To simplify the description, let us assume that s, is a perfect square. Later

we show that the model is easily extended when the value S; is not a perfect square.

For the sake of clarity, let us also assume for now that ^/sj divides n.

For a semaphore Si with initial value S{, we cluster the nodes in the system into

y/sl clusters. Each cluster contains n /y/si nodes. The first cluster, Co, contains the first n /yjsl nodes and the second cluster contains the second n/yjsl nodes and so on 69

(So, C0 = {Po,--.-P(n/vsT)-i} and Ci = Pn/y/si,-,P(2m/y/z)-i}, etc.) Figure 23 depicts

the clusters that result for a system has 24 nodes and a semaphore 5t- with an initial

value of 16.

CO p i i

P10

Global Si = 16

PI 8 P19 P12 P13

P23 C3 P20 P17 C2 PI4

P22 P21 P16 PIS

Figure 23: Clustering the Processors for semaphore Si

A copy of the semaphore Si is maintained in each cluster of the system. However,

the initial value of the copy in a cluster is y/sl, not s*. Each node has two request sets

defined, the local request set and the global request set. The local request set ( LR) of a node contains all the nodes in its cluster. The global request set ( GR) contains one node from each of the y/sl clusters. Formally, we define the requests sets as follows: 70

LRi = {Pi \Pi is in the same cluster as Pi}

GR{ = {P i\I = i± k * n /^ i,k = 0,...,y/s^}

If a semaphore S^s initial value of s* is not a perfect square, we divide the system into [y^J clusters, each with n/\_-sfs l\ nodes. If each cluster’s copy of the semaphore is initialized to [y^si], then the global value of the semaphore would be [ y ^ J 2 and the remaining value Si — Lv^J 2 need to be distributed among the clusters. Hence, the local copy of the semaphore is now set to |_\/slj + (s* - \.\fsl\'1)/ • An alternative solution is to divide the network into fy'sTl clusters. Then Lv^iJ °f the clusters will have access to semaphores initialized to [y ^ J and the last cluster will have its semaphore initialized to (s, - [a/ sIJ2).

Similarly, if the number of nodes in the system does not fully divide by y/sl, we assign [n / y'sij nodes to each of the clusters and the remaining n — y/sl[n/y/si\ nodes are distributed among the clusters. This would result in a difference of at most one node between any two clusters.

4.3 Accessing a Semaphore

When a process wishes to perform a P operation on a semaphore, it is important that the P operation be performed in a mutually exclusive manner to maintain correct­ ness of the semaphore. In shared memory multiprocessors, this P operation can be performed mutually exclusively through the use of test&set instructions on a shared memory location. In a distributed system with no central controller to maintain the semaphore, it is necessary for the nodes of the system to take extra measures to en­ sure that the P operation is performed in mutual exclusion. Likewise, a V operation should also be performed in mutual exclusion.

In the previous section, we described how the nodes in the system are divided into clusters. This was done to localize message traffic on the network to nodes within the cluster when accessing semaphores. Furthermore, each cluster’s copy of the semaphore which is initialized to y/sl, is mapped onto any node in a cluster. The semaphore is migrated to the node which needs to access it. We next describe how a node acquires a semaphore to perform P and V operations. We first describe the manner in which a node acquires a semaphore from within its cluster (i.e., intracluster access) and then we will extend these algorithms to include intercluster access as well.

4.3.1 Intracluster Access

P operation

In order to perform a P operation on a semaphore Si, a node Pj must first acquire the semaphore. If the node already holds the semaphore, it performs the P(S'j) without generating any network traffic. Otherwise, the node must acquire it from the node holding the semaphore in its cluster, since only the holder of the semaphore can perform a P operation on it.

Acquiring a semaphore requires that the node which currently holds the semaphore must be determined. This can be accomplished in several ways. One approach is to broadcast the request message to every node in the cluster [49]. An alternative approach is to maintain a probableJiolder variable on each node that indicates the 72 node likely to hold the semaphore [37]. This variable at a node indicates the node to whom it passed the semaphore most recently. The request message is sent to the node specified by the probable_holder variable. That node in turn may have to forward the message until it arrives at the current holder of the semaphore. Both methods have their trade-offs. The broadcast approach can acquire the semaphore with two message delays, whereas, the probable_holder method would involve on an average 0(log^JN/Si)+l message delays. The broadcast approach, however, results in excessive message traffic (|£f2| —1 messages per request) on the network. A reasonable compromise might be to maintain a history of recent holders of the semaphore and multicast the request message to a subset of the nodes in the LR set [46]. In the rest of this section, we use the probable_holder method to access the semaphore.

When the requesting node becomes the holder of the semaphore, it performs the P operation. If the semaphore’s value is greater than zero, the P operation is completed immediately. Otherwise, the node must wait for V(S't) to be performed by another node before it can complete the P operation. The code to acquire and perform a P operation on a semaphore is shown in Figure 24. The symbols and in the code denote a per-node critical section.

If a P-request message arrives at a node that currently holds the semaphore, the node sends the semaphore to the requesting site provided it is not using the semaphore. If it is performing the P operation or is itself waiting for the semaphore, it queues up the request. Upon completion of the P operation, the semaphore is sent to the requesting site. If there is more than one request queued, the semaphore along 73

if (semaphore is not available) th en send P.request message for Si to probable.holder node, wait for the semaphore. endif; « if (Si > 0) th en Si := Si - 1; » else block the process.

Figure 24: Executed by a node to access a semaphore to perform a P operation. with the rest of the requests are sent to the site of the request at the head of the queue. In this manner blocked requests are forwarded along with the semaphore and eventually, when a request gets to the head of the queue, the node which had initiated that request becomes the next recipient of the semaphore.

The semaphore is forwarded to the next requesting node regardless of whether the holder node has as yet to perform a V operation on the semaphore or whether the semaphore’s value is zero. It is tempting to suggest an improvement wherein the request is blocked until the semaphore’s value is non-zero as a result of a local or remote V operation on the semaphore. However, we do not propose it because this may conflict with the intercluster access we describe in the following section.

If a node that receives a request no longer holds the semaphore, it forwards the message to the probable holder as specified by its local variable. However, if it is in the process of acquiring the semaphore, it blocks the request until it acquires 74 the semaphore and finishes its P operation. This is described by the pseudocode presented below in Figure 25.

V operation

A node that wishes to perform V (S i ) may find the semaphore not available lo­ cally. If the semaphore is available locally, the node performs the V(S'j) immedi­ ately. Otherwise, rather than acquiring the semaphore to perform the V, it sends a V-request message to the node currently holding the semaphore. This is done by sending the message to the probable_holder node just as is done by a node to acquire the semaphore. The message may need to be forwarded until it arrives at the current holder of the semaphore which then performs the V. The request can be viewed as a remote or proxy V operation that is performed on behalf of the requesting node.

Figures 26 and 27 show the code executed by the nodes in initiating and servicing

V-request messages.

« On the receipt of a P-request message for Si. if(Performing a P operation on the Si) th e n queue the P.request message. else if (Si is locally available) th en SEND St to Pj; probableJiolder = Pj; else SEND P.request to probable.holder; »

Figure 25: Executed by a node on receipt of a P.request message. 75

<< if(Si is available locally) th en St Si + 1; else SEND V-request message for Si to probable-holder; »

Figure 26: Executed by a node to perform a V operation.

On the receipt of a V.request from Pj for Si « if(Si is available locally) th en if(blocked for Si) th en unblock process else Si ;= Si + 1; else V.request message for Si to probableJiolder; »

Figure 27: Executed by a node on receiving of a V.request message.

4.3.2 InterCluster Access

If only intracluster access is used to implement P and V operations, a scenario such as the following is possible. Suppose node Pi in cluster Cj acquires the cluster copy of semaphore 5* and finds the value to be zero and blocks for a V(5jt) operation to be performed. In this situation, if every other node in the cluster issues a P operation before doing a V operation, then Pi will be forced to wait indefinitely, even though 76 the value of S* in the whole system may be greater than zero. By acquiring the semaphore from another cluster in the system, such situations can be avoided.

Another motivation for intercluster access is efficiency. A node which cannot complete the P operation on a semaphore because the value is zero, need not wait for one of the other nodes in its cluster to execute a V. Instead, it should be able to access the copy of the semaphore from another cluster, provided one exists with a non-zero value, expediting the execution of the P operation.

In order to perform P and V operations on a copy of the semaphore in another cluster, a node makes use of the nodes in its GR set as proxy nodes in each of the other clusters in the network. When a node is unable to complete the P operation on a local semaphore (because its value is zero), it sends a P.proxyrequest message to one of the nodes in its GR. It waits for a response before trying the next node in GR.

A node which acts as a proxy on behalf of the requesting node accesses the semaphore in its cluster in the same manner as described in the previous section.

If it finds the semaphore is zero, it responds to the P-proxyrequest message with a deny message (it does not initiate another intercluster access). If it succeeds in per­ forming the P on the semaphore, it responds with a success message to the requesting node. If the proxy node was not successful in performing a P operation on its behalf, the requesting node sends a request message to the next node in the GR set.

To perform a proxy V operation on the semaphore, the node sends a V-request message to the node which performed the proxy P operation. There is no difference between the request message used here and the one used for intracluster remote V 77 operation because the node receiving the request treats both messages in the same manner.

Rather than have a proxy node perform a P and a V on behalf of a requesting node, we could treat the intercluster access as a transfer of a semaphore value from one cluster to the other. Thus, after a proxy P operation, the requesting node will not send a V.request message to its proxy node. Instead, it will initiate a V(Sfc) operation on its cluster’s copy of the semaphore. This results in the local semaphore effectively gaining a value. This migration of semaphore values results in a dynamic distri­ bution/sharing of semaphores among clusters where higher accesses to semaphores is observed. Even in applications with homogeneous access patterns to semaphores, there could be statistical fluctuations which make such dynamic sharing attractive. In applications with hetrogenous access patterns, the effectiveness of this scheme should be even more pronounced.

While awaiting a response to a proxy request, it is possible that a requesting node receives a V.request message. This message may either be an intracluster V request or an intercluster proxy-V request. By servicing such a request, it would be possible for the blocked node to complete its P operation locally. Thus, if it receives a deny message from a proxy node, it should first check for the presence of a V.request message before sending a P.proxyrequest to another proxy node. On the other hand, if the requesting node receives both a success and a V.request message, it would proceed with the completion of the P operation based on the success message.

After completion of the P, it would process the V.request message that it received 78 concurrently. If necessary, the node would forward the semaphore to another node in its cluster.

The algorithms to perform P operations and servicing remote requests are pre­ sented in Figures 28 and 29, respectively. The algorithms in Figures 26 and 27 for initiating a V operation and for servicing a V request remain the same. Servicing

P-requests requests is still the same as the code in Figure 25.

if(semaphore is not available) th en send P-request message for Si to probable-holder node. wait for the semaphore. endif; « if(Si > 0) th en St := Si - 1; » else while (NOT done) do SEND P-proxyrequest message for Si to a node in GR. AW AIT reply. « if(reply = “success”) th en done := TRUE; else if(V-request message has arrived) th en done := TRUE; Execute code described in Figure 5. endif>> endw hile

Figure 28: Executed by a node to initiate a P operation (both intra and intercluster). 79

« On receipt of a P-proxyrequest message for Si from Pj if (Si NOT locally available) th en SEND P.request to probable .holder; wait for semaphore; else if(requesting the semaphore myself) th en queue the P.proxy request message until semaphore is available. » « if(Si > 0 ) th e n Si := Si - 1; SEND “success” to Pj; else SEND “deny” to Pj; »

Figure 29: Executed by a proxy node on receiving a P-proxyrequest message. 80

4.3.3 Combining Intracluster and Intercluster Operations

In the previous section, an intercluster operation is performed only after an intraclus­ ter access to the semaphore is done and the semaphore’s value is found to be zero.

This increases the delay in performing a P operation because of serial execution of the intracluster and intercluster access. This delay can be reduced by combining the intracluster and the intercluster operations into one phase. However, it will increase the number of messages required to perform P operations. While this increase in the number of messages may be tolerable under moderate semaphore access patterns, it may degrade the performance under heavy loads. This combined access may still be beneficial in cases where the semaphore within the cluster is typically at or close to zero.

The combined access strategy works as follows. When a node performs a P oper­ ation, it broadcasts the Pjrequest message to all the nodes in its LR set and sends a

P-proxyrequest message to a node in its GR set. The nodes in the GR act as proxy nodes for the node making the request. The proxy node acquires the semaphore from within its cluster and if possible performs the P(Sj). It then replies to the requesting node with either a success or deny message to indicate whether or not it succeeded.

The requesting node may either receive the semaphore followed by the reply from the proxy node, or the reply followed by the semaphore, or it could receive both simultaneously. We consider the three cases separately below.

If it receives the semaphore first and is able to perform the P operation on it, the node must ensure that the subsequent arrival of a proxy success message is responded 81 to with a V-request message. If the proxy message is adeny , it can be discarded. If it was unable to perform the P operation on the local semaphore, it must await the result of the proxy request. If the reply is a deny, it issues proxy requests to the other nodes in its GR set as was done in the previous section.

If the requesting node receives the proxy node’s response before the semaphore, it determines whether the proxy node was successful in performing the proxy P request.

If it was, then the requesting node’s P operation is complete. If not, the requesting node issues a new proxy request to another node in its GR set. After completion of the P operation the semaphore is forwarded to the site whose request is at the head of the queue.

Lastly, if the response and the semaphore arrive simultaneously, the proxy node’s response is first examined. If the message is a success, the P operation is complete and the semaphore is not examined. If the message is a deny, the semaphore is examined and if possible, the P operation is performed on it. If the semaphore is already zero, then a new proxy request is sent to another node in the GR set.

4.4 Correctness

We now present a correctness proof for the algorithm. We first prove that P and

V operations are done in a mutually exclusive manner. To prove deadlock freedom, we show that no set of nodes form a cycle while trying to acquire the semaphore to perform a P operation. Lastly, we show that any process wishing to perform a P operation will do so in finite time (i.e., starvation freedom). 82

We denote the probable_holder field of a node P, as VH(Pi) for brevity. For example, V'H(Pi) = Pj =$> P fs probable_holder field indicates that Pj is the probable holder of the semaphore.

4.4.1 Mutually Exclusive P and V Operations

O bservation 3 Only one copy of the semaphore exists in a cluster. There may be y/si copies in the system, but a semaphore copy never leaves its cluster.

T heorem 3 The P and V operations are done in a mutually exclusive manner.

Proof: There is only one copy of the semaphore per cluster and only the node holding it can perform P and V operations on it. Mutual exclusion is guaranteed by the node by performing the operations within a critical section as shown in the code. □

4.4.2 Deadlock Freedom

To prove the absence of deadlock in the algorithm, we must show that two or more nodes will not form a cycle in trying to acquire a semaphore from each other. We also show that a P.request by a node will not loop back to it and cause it to self-deadlock.

We first state a few observations which simplify the proof.

Observation 4 Initially, YPK(Pi) — Pj, Pi € Cj and Pj is the holder of the semaphore in cluster Cj.

O bservation 5 VU(Pi) = Pi implies that Pi holds the semaphore at present and is at the root of the tree induced by the V% fields of the other nodes in this cluster. This tree spans the nodes of its cluster. 83

The above observations state that the probable_holder fields induce a spanning

tree within each cluster. The initial tree is like a star with the root holding the

semaphore and all the other nodes being leaves of the tree. As requests are made

and the semaphore is propagated, the root of the tree changes in concert with the

semaphore propagation.

Lem m a 6 A node’s request for a P or a V operation will not form a cycle.

Proof: P{ sends its request to Pj provided V'H(Pi) = Pj. This traversal of the request

will be denoted by Pi->Pj. For a cycle to result from the forwarding of the request

message, the following path must be traversed by the request: Pi Pj Pk -> • • •Pq

—> Pm ~ > Pi-

A node such as Pm cannot exist along the path traversed by the request. For Pm

to be on the path taken by the request message would imply that Pm was on the path

along which the semaphore propagated after leaving Pi (the last time). But, then Pm

must either still hold the semaphore or passed it to a node other than Pj.n

Lem m a 7 Two or more nodes will not form a cycle while waiting for a semaphore.

Proof: We can assume that none of the nodes in question hold the semaphore since a

node which currently holds a semaphore will not make a request for it. Thus, if there

exists a cycle, it must be a knot since none of the nodes involved is a sink.

For two nodes, P{ and Pj to form a cycle while trying to access a semaphore,

VH{Pi) = Pj and PH(Pj) = Pi. But, VH(Pi) = Pj => Pj received the semaphore from Pi. Similarly, V%{Pj) = P{ => Pi received the semaphore from Pj. Furthermore, 84 as stated in Observation 5, it is impossible for either P, or Pj to maintain an old value for V%. Thus it is impossible for both VTi(Pi) = Pj and VTL{Pj) = Pi to hold.

For three nodes, Pi, Pj and Pk to form a cycle, we must have each node waiting on one another. For instance, Pi-+Pj-*Pk-~*Pi. For this to occur, VH{Pi) = Pj,

V'H{Pj) = Pk and VH{Pk) = Pi must hold. This implies that Pi held the semaphore prior to Pj and Pj held it before Pk- By transitivity, Pi must then have held it before

Pk, which contradicts the statement VH{Pk) = P*. This argument can be extend for more nodes. □

T heorem 4 The algorithm is deadlock free.

Proof; The preceding lemmas prove that a node will not block its own request, that two or more nodes will not block on each other and form a cycle. Thus, the algorithm is deadlock free. □

4.4.3 Starvation Freedom

Lem m a 8 Every P_request message eventually reaches the semaphore holder.

Proof: Suppose Pi issues a P.request message. This message is sent to the node specified by VTliPi). The node specified by VH{Pi) is the one to which Pi sent the semaphore last. If node P'H(Pj) holds the semaphore, then the request has reached the holder. If the semaphore is no longer at node VH{Pi), then the message is forwarded to the node P%(V%{Pi)). This third node similarly is the one to which node VH(Pi) sent the semaphore last. By recursively constructing this argument, we 85 see that the VH of each node defines a directed edge in a path from Pi to the current holder of the semaphore. □

Lem m a 9 A request message will not be queued up indefinitely.

Proof: V-request messages are processed as soon as the current holder of the semaphore can do so and are thus never queued up as P-request messages sometimes can be.

A P.request message by a node Pi may be queued at the holder of the semaphore or at an intermediate node which is itself waiting for the semaphore.

Suppose Pi is queued up at the holder of the semaphore and there are k nodes ahead of it on the queue. Note that k < nf y/sl — 2. After the semaphore (along with the queue) has been forwarded to each of the k nodes blocked ahead of it in the queue, Pi will receive the semaphore.

Suppose Pi is queued up at an intermediate node, Pj. Assuming P /s request is queued up at the semaphore holder, then after a delay of at most n/y/sl — 3 nodes, Pj will receive the semaphore. If P{ is the kth node on the queue at Pj and if the queue sent with the semaphore is of length k', then as per the algorithm, the queue at Pj will be appended to the queue it received. Thus after the node Pj has received the semaphore, P{ will be the (k' + k)ih node to receive the semaphore. Similarly, if Pj itself was queued at the intermediate node, using a similar argument we can see that after at most n / y/s[ — 2 nodes, each with a local queue of blocked request of at most n f yfsi — 3, Pi will get the semaphore (though the actual number of requests ahead of it will be bounded by njyfsi — 2). □ 86

L em m a 10 A Proxy request message will not be queued up indefinitely.

Proof: A proxy request is sent by a node Pj to a node Pprj in its GR set. In response

Pprj then issues a intracluster request for the semaphore which, as proven in the above lemma, will not be delayed indefinitely. Upon checking the state of the semaphore,

Pprj will respond with a success or a deny message. Thus, Pi’s request is responded to in a finite time. □

T heorem 5 Every request for a P or a V operation on semaphore is eventually satisfied.

Proof: It has been proven that every request eventually reaches the semaphore holder and that a request will not be queued up indefinitely in the preceding lemmas. □

4.5 Performance of the Clustering Approach

We are interested in how our algorithm performs in terms of the number of messages required to perform P and V operations. To do so, we would like to compare the clustered approach to a non-clustered approach. We are also interested in the response times for P operations of the clustered scheme and in seeing how they compare with the unclustered and centralized approaches.

4.5.1 Complexity of the Clustering Scheme

In considering the average case complexity of performing P and V operations, there are two cases to be considered: intracluster and intercluster. If a node has to acquire a semaphore (or send a VL request for a V operation), in the best case the semaphore 87 is held by the node which is specified by the probableJiolder field. In the worst case, however, the request message must be forwarded to every node in the cluster. This would require n/y/si — 1 messages to get to the holder node. As was shown in [ 37], on average we expect 0{log(n/y/si)) message transfers.

If the holder node does not have any other requests pending, the current request can be satisfied immediately, incurring one additional message. However, if there are blocked requests queued at the holder, then the current request would be added to that queue and be serviced when it comes to the head of the queue. If there are

Q requests ahead of it, then the total delay in accessing the semaphore would be

Q * (average P&V execution delay + message delay ) + 1 message delay

It is worth noting that a request may be blocked at a non-holder node, which is itself blocked either at the holder or yet another node, and so on. The above equation would still apply and Q < n/y/si. If an intercluster proxy operation is necessary, this would require an additional round-trip message delay plus the intracluster delay.

4.5.2 Simulation Results

The experiments conducted to observe the performance of this approach are similar to those conducted for the Semaphore-page approach and are discussed in Section 3.4 of Chapter III.

In Figures 30, 31 and 32, 33 we plot the response time of the three models as functions of the JC5T(low ZC

When we use general resource counting semaphores, we observe that the non­ clustered token-based approach performs poorly (Figures 32, 33). This is most ap­ parent when the number of processors is small (10 processors) when the central­ ized approach out performs the non-clustered approach throughout. Even when the number of processors is increased to 35, the non-clustered approach performs only marginally better than the centralized approach. We find our scheme performs better than the other two schemes, particularly when the rate of semaphore access is high.

This improvement is more pronounced for a larger number of processors. When the semaphore access rate is low, however, the centralized scheme fares better, since it requires fewer messages to locate and retrieve the semaphore. However, it is the performance at the high load that matters the most.

The percentage improvement in the response time of one approach over another is defined as the speedup. Figures 34, 35 show the speedup in the average response time attained by the clustering scheme over the other two schemes as a function of

XCSTiox 10 and 35 processors. The cluster based approach performed as much as

45% and 65% faster than the centralized and non-clustered approaches respectively for 10 processors. With 35 processors, the cluster based approach out performed the other two approach by as much as 74%. 89

Response Times vs. I.C.S.T (10 procs) 200 'Centralized 'Distributed

0 200 400 600 800 1000 ICST (msec)

Figure 30: Response Time vs ICST. Binary Semaphores on 10 Processors ( CST= 4 msec, 71A= 10 msec)

The average number of messages required to acquire a semaphore for the clustered and non-clustered approaches are shown in Figures 36, 37.

4.6 Summary

In this chapter, schemes for performing P and V operations efficiently on semaphores in a distributed environment are presented. The nodes in a distributed system are grouped into a two-level hierarchy called clusters. The sizes of the clusters depends on the semaphore’s value. A node acquires and updates the cluster’s copy of the semaphore, thus enabling concurrent access the semaphore across clusters. Further­ more, this reduces the number of messages required to access the semaphore. '

For the sake of reducing the amount of information that needs to be maintained for each semaphore, the system can define a few cluster sizes (instead of one for 90

Response Times vs. I.C.S.T.. (35 Procs) 800 'Centralized 'Distributed1 700

600

500

400

300

200

100 0 0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 31: Response Time vs ICST. Binary Semaphores on 35 Processors ( CST= 4 msec, HA— 10 msec) each semaphore) and associate the most appropriate one for each semaphore. As semaphore access exhibit temporal locality, the concept of dynamically migrating semaphore values from one cluster to another to achieve load sharing is appealing.

Simulations of the schemes indicate that the clustered approach performs well when the access rate to the semaphore is high and thus is more appealing than the centralized approach for typical distributed applications. As the access rate reduces

(increasing ICST), the performance advantage reduces and eventually the centralized approach performs better, as expected. We also notice that while the approach be­ haves like the non-clustered token-based scheme for binary semaphores, it is clearly superior to the non-clustered approach for general (resource counting) semaphores.

We discuss these issues in greater detail in Chapter VI. 91

160 'Centralized* ----- * U n clu stered ' •••••• 140 'Clustered'

120 u 0) 3 100

80

60

40

20

0 2 00 600 800 1000

Figure 32: Response Time vs IC S T ■ Resource Counting Semaphores on 10 Processors (CST~ 4 msec, TZA= 10 msec)

600 'Centralized' 'Unclustered' 'Clustered' 500

400

300

200

100

0 0 500 1000 1500 2000 2500 3000 ZCST (msec)

Figure 33: Response Time vs TCST. Resource Counting Semaphores on 35 Processors {CST— 4 msec, 7ZA— 10 msec) 92

80

60

40

20

0

-20

0 200 400 600 800 1000 ICST (msec)

Figure 34: Relative Speedup of the Cluster scheme over the Distributed and Centralized Schemes for 10 Processors

'Over Centralized1 'Over Unclustered'

40 o. 3 1ai 20 a V) Ol

-60

-80 0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 35: Relative Speedup of the Cluster scheme over the Distributed and Centralized Schemes for 35 Processors 93

3 .5

3

2 .5

2

1 .5 C lu stered ' 1

0 .5 0 200 400 600 800 1000 ICST (msec)

Figure 36: Average Number of Messages Required to Access a Semaphore for the Clustered and Nonclustered Approaches (10 Processors)

10 m « tn «s in tn 'Non-clustered*

2 C lu stered '

0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 37: Average Number of Messages Required to Access a Semaphore for the Clustered and Nonclustered Approaches (35 Processors) CHAPTER V

Consensus-Based Approach

5.1 Introduction

In this chapter, a distributed implementation of semaphores which performs P and

V operations on semaphores through consensus is presented. In this approach, every node in the system maintains a copy of the semaphore. A P operation is complete when a requesting site receives confirmation from all the other nodes indicating that they have completed the P on their respective copies. If multiple nodes try to perform a P operation on the same semaphore, the competing sites arrive at a consensus on the ordering of the P requests. The ordering is determined by timestamped priorities of the requests. All the nodes must also be notified when a particular node performs the V operation so that they may perform the V on their respective semaphores.

This approach has some similarity to some mutual exclusion algorithms such as

Lamport, Ricart-Agarawala, Maekawa, and Singhal [28, 44, 34, 46]. These algorithms, unlike the token-based mutual exclusion algorithms described in previous chapters do not rely on a token. Instead, mutual exclusion is requested from and granted by a set of nodes in the system. Hence, these algorithms are referred to as non-token-based or consensus-based mutual exclusion algorithms. It is important to reiterate that the

94 95 algorithm presented in this chapter can be used for a variety of synchronization needs for which the above mentioned algorithms cannot.

The rest of the chapter is organized as follows. Section 5.2 describes the system model. The algorithms for performing P and V operations on semaphores are given in Section 5.3. In Section 5.4, we give a proof of the correctness of the algorithm.

In Section 5.5, we suggest ways to enhance the performance of the algorithm. A simulation based performance study of the algorithm has been conducted. The results of this study is presented in Section 5.6. Section 5.7 concludes the chapter.

5.2 System Model and Definition

The distributed system we consider is composed of a n of processors, labeled Po,...,Pn_i.

For the sake of clarity in exposition, we assume that each processor has only one pro­ cess. If the system had multiple processes per processor, the operating system would queue up multiple requests for the same semaphore and serve them sequentially. Each processor also maintains a counter which is used to timestamp P and V request mes­ sages that the processor broadcasts to the other sites in the system. Every node maintains a logical clock which is updated everytime the node communicates with another as per the definition of timestamps proposed by Lamport [28]. We assume that this is performed implicitly and leave it out of our algorithms.

The semaphores in the system are labeled as Si,...,Sk. Each semaphore has an initial value assigned to it (e.g., Si = 1, S 2 = 4). The semaphores are replicated on each node in the system. 96

5.3 A Distributed Implementation of Semaphores

Each node in the system is composed of an user process as well system processes. We

assume that there exists a dedicated process to handle semaphore operation requests

from the user processes. The semaphores are initialized appropriately on every node in the system. User processes make P and V requests to the operating system on their processor which then initiates the operation on the system.

In order to perform a P or a V operation on a semaphore, a node must not only update its copy of the semaphore, but also ensure that all the other nodes update their respective copies. Thus a P operation is complete only after the node that initiated the operation has received confirmation of its success from all the other nodes in the system. Similarly, a V operation is complete when all the sites have been informed of the operation. In this case however, the initiating site need not wait for the completion of the V operation on the remote sites as is the case with the P operation. In the detailed descriptions that follow, assume that the term node refers to the system process that performs the P and V operations on behalf of the user process.

Performing a P Operation

To perform a P operation on a semaphore 5,, a node Pj first examines its local copy of the semaphore. If the value of the semaphore is greater than zero, the node decrements it and broadcasts a timestamped P-request message to all the nodes in the system. It then awaits the acknowledgments from all the nodes, signaling a consensus that its P operation has been successfully completed on all the nodes in the system. 97

If the semaphore’s value is less than one, the node postpones executing the consensus- phase of the P operation until the semaphore’s value is greater than zero. Figure 38 shows the operation.

LI: if(Si < 1) th e n w hile (Si < 1) No-Op; « if(Si > 0) th en Si := Si - 1; Broadcast [P-request, Time Stamp, Si .id] to all nodes in system. » else goto Li; endif Await replies from all (n-1) nodes.

Figure 38: Code to perform a P operation.

Responding to a P Operation

A node which receives a P-request message, may be in the process of performing a P operation on the same semaphore itself, or it may not. If it is not, it decrements its copy of the semaphore by one, irrespective of the current value of the semaphore and responds to the P^request message with an ACK message. Note that it is possible for a copy of the semaphore to acquire values in the negative range. This, however does not cause a problem as a node will not initiate a P request if it finds its own semaphore’s value to be less than one (see Figure 38). It should also be pointed out here that the operations on a local semaphore copy do not use the same semantics as 98

On receiving a P.request message from Pi for Si « \i(NOT Performing a P on S J Si := Si - 1; Send A C K reply message to Pi; else Si := Si - 1; ii(Si >= 0) th e n Send A C K reply message to Pi; else if (my P.request -< Pi P .request) th en Enqueue Pi P.request on TimeStamp prioritized waitq. else Send A C K reply message to Pi; endif endif »

Figure 39: Executed by a node on receiving a P_request message. the P operation given in the Section 2.2.2, but the semantics as observable by user processes still follows that definition.

On the other hand, a node could receive a P.request message while in the process of performing a P operation on the same semaphore (i.e., it has broadcast P.request messages and is awaiting ACK messages). In such a situation, depending on the value of the semaphore a couple of alternatives arise. If the semaphore’s value is greater than zero, the node would decrement it and ACK the request. Otherwise, it is necessary to impose an order on the two concurrent requests.

Ordering of the competing P operation requests is done by determining whether the node’s own pending one or the one for which it has received the P.request message 99

has precedence. The timestamps associated with the requests are used to determine

the precedence. If the timestamps are equal, node id’s are used. The symbol, -< is

used to denote precedence and is defined as:

P i’s P..request -< P j’s if: P i’s P-request.timestamp < P j’s P-request.timestamp OR i < j

Figure 40: Definition of Precedence Relationship of Two P.request Messages.

If the precedence relationship specified that the node that received the P.request

message has the higher priority, it would queue the P.request message (on a local

queue called the waitq ), and postpone the ACK until after it has performed a V

operation. If the request message it received has the higher priority, it would ACK

that message immediately. The code is given in Figure 39.

Performing a V Operation

To perform a V operation, a node must complete the operation on its copy of the semaphore and also inform the other nodes in the system to do the same on their respective copies. To do this, it increments its semaphore’s value and broadcasts a

V.request message to the other nodes in the system.

It is possible that the node has queued the P.request messages of other nodes on its waitq. Since the node is no longer performing a P operation, it need not participate in the serialization of multiple requests anymore and can ACK all the nodes queued 100 on its waitq. This does not pose any problems, however, because the nodes that were queued on its waitq , are ordered according to the precedence relationship among each other and hence only one of them will be able to complete its P on receipt of the

ACK from this node. To illustrate this point, consider the following scenario.

Suppose nodes Pi, P j, and Pk initiated concurrent P operations on a binary semaphore Si. Furthermore, suppose their requests defined the following precedence:

Pi -< Pj -< Pk as defined in Figure 40. In this scenario, Pj will enqueue Pk and Pi will enqueue both Pj and Pk- Since Pi has the highest precedence of the three, it would receive ACK messages from the two nodes and would be the first among the three to complete its P operation. When Pi performs the V operation, it can dequeue and send ACK messages to both Pj and Pk- Since Pj -< Pk, Pj had enqueued Pk and Pk had AC K ed Pj. Thus, P*’s ACK would result in the completion of Pj's P operation.

Observe that any other node that might have initiated a P operation would either have preceded P{ or will succeed Pk-

It is important to note that while the above scenario assumed a binary semaphore and that the same node initiates both the P and V operations, this is not a limitation of the approach. Similar cases can be developed for resource counting semaphores as well as for remote V operations. 101

« Si .*= Si ~h 1; Broadcast a V.request message to the other nodes. for each (P.request in waitq) Dequeue P.request and send an ACK message to that node. »

Figure 41: Executed by a Node to Perform a V Operation.

On receiving a V.request message, a node performs the V operation on its semaphore.

It then sends ACK messages to all nodes that were queued on its waitq. Thus in the example above, if Pi had received a V.request message, it would have ACKed the other nodes. The code executed by a node to perform a V operation is shown in

Figure 41. Figure 42 shows the code for handling a V.request messages.

One way to improve the performance of the algorithm would be to combine the broadcasting of V.request messages and the ACK messages. Rather than waiting for explicit ACK messages alone, a node waiting for an ACK message from another node, could treat a V.request messages from that site with a timestamp greater than its own as an implicit ACK message as well.

On receiving a V.request message from Pi for Si « Si := Si ~h 1, for each (P.request in waitq) Dequeue P.request and send an ACK message to that node. »

Figure 42: Executed by a Node on Receiving a V_request Message. 102

5.4 A Proof of the Correctness

We now present a correctness proof for the algorithm. To prove deadlock freedom, we show that a set of nodes will never form a cycle while trying to perform a P operation.

We also show that any process wishing to perform a P operation will succeed in doing so in finite time (i.e., starvation freedom).

5.4.1 Deadlock Freedom

To prove the absence of deadlock, we must show that two or more nodes will not form a cycle while trying to complete their P operations. A cycle can arise if two (or more) nodes block each other’s P.request messages on their respective waitq queues and wait for each other’s ACK messages. Since none of the nodes has completed its respective

P operation, the requests will be queued indefinitely and a deadlock occurs.

L em m a 11 Concurrent P_requests by two nodes (Pi and Pj) for a semaphore will not be queued on the waitq queues at both sites.

Proof: From the definition of precedence in Figure 40, either the P.request of Pi has precedence over that of Pj or vice-versa.

If the semaphore’s value is greater than zero, both nodes will ACK each other regardless of the precedence. Suppose the semaphore is zero or below at one of the nodes (say, for instance Pi). Then, Pi will enqueue Pj's request if Pfs P.request -<

Pj's P.request and ACK it otherwise. A similar argument applies if Pj is the site with a semaphore with value less than zero. If the semaphore at both nodes is less than one, and Pi's P.request -< P j's P.request, then P; will enqueue P /s request and Pj 103 will ACK Pi. If P/s P.request -< P/s P.request, then the converse will hold (i.e., Pj will enqueue Pi’s request and Pj will ACK Pj). □

L em m a 12 Concurrent P_requests by three or more nodes will not form a cycle of waiting requests.

Proof: Suppose nodes Pj, Pj, and Pk perform concurrent P operations on the same semaphore. If the semaphore’s value is less than one, some of the requests may need to be queued. Suppose the following cycle results, P, queues P /s request, Pj queues

Pk s request and Pk queues P /s. Node Pi queues P /s request, provided P /s P.request

-< P /s. Similarly, Pj queues Pk if P /s P.request -< P /s . If P /s P.request -< P /s, and

P/s P.request -< P /s , then Pi’s P.request -< P /s since the precedence relationship is transitive. Thus, Pk cannot enqueue P /s request. By using similar arguments, we can see that no set of nodes can queue each other and form a cycle. □

T heorem 6 The algorithm is deadlock-free.

Proof: Lemma 1 proved that no two nodes can block each other’s requests and Lemma

2 proved that three or more nodes cannot form a cycle of blocked requests. □

5.4.2 Starvation Freedom

Starvation occurs if a node awaiting ACK replies to complete its P operation, waits indefinitely while some other nodes are repeatedly succeeding in performing P oper­ ations. This can occur only if the node’s P.request message is queued on the waitq of some node in the system and is never dequeued and sent and ACK message. 104

Observation 6 For the correct execution of application programs, every P operation on a semaphore has a corresponding V operation performed on it. The two operations, however, need not be initiated by the user process.

T heorem 7 A queued P_request will be ACK ’ed in finite time.

Proof: Suppose nodes Pi and Pj initiate concurrent P operations, and Pi s P-request

-< P /s. Assuming the semaphore’s value is less than zero, Pj’s request will be ACKed by Pj when it receives the request. Node P /s request will be enqueued by Pj.

Provided node Pj or some other node in the system initiates a V operation even­ tually, node Pj will perform the V operation on its copy of the semaphore. At this point, it will dequeue all the nodes on its waitq and send ACK messages to them. □

5.5 Performance Enhancement Techniques

One way to improve the performance of the algorithm is to partition the nodes in the system into clusters of nodes. The clustering can be either a function of the number of nodes in the system (e.g., if there are n nodes in the system, partition them into y/n clusters with y/n nodes each), or it could be a function of the initial value of a semaphore (e.g., create clusters with n/[yfsl\ nodes in each.) The value of the semaphore on each node will have to be modified to reflect this processor hierarchy

(For instance, if the clusters are a function of the semaphore’s value, each node’s copy of the semaphore should be initialized to The P and V requests would be multicast to the nodes within of a particular cluster and will not be broadcast to the entire system. The algorithm must provide for inter-cluster proxy-P and proxy-V 105 operations as well. A Similar strategy was proposed for token-based implementations of semaphores [40]. The modified algorithms, using clustering is presented.

Initiating a P Operation

When a node performs a P operation on a semaphore Si, it first examines its local copy of the semaphore. If the value of the semaphore is greater than zero, it proceeds with essentially the same algorithm as in Figure 38. However, if the semaphore is zero, it sends a proxy P.request message to its proxy node in another cluster to perform the P on its behalf. If the proxy node is successful in doing so, it responds with an ACK. Otherwise, it responds with a NO ACK message. Each time it receives a NO ACK message, the node will try another proxy node in another cluster.

It repeats this procedure until it has tried each of its proxy sites in the other clusters and if unsuccessful in every case, tests its own semaphore again and starts over. This modified algorithm is shown in Figure 43.

When a node receives a P.request message, it first checks whether the request is a proxy request or a broadcast request from within its own cluster. If the request is a proxy request, it checks the value of the semaphore to determine whether it can succeed in performing the proxy request. If the semaphore is less than one, it responds to the proxy request with a NOACK message. If the semaphore is greater than zero, it initiates an intra-cluster P operation on behalf of the requesting node.

If the proxy node was itself in the process of performing the P operation, rather than initiate another P operation concurrently, it will respond to the proxy request with a NOACK message. Figure 44 presents the modified operations. 106

< < L I: if (Si > 0) th en Si := Si - 1; Broadcast (P.request, TimeStamp, Si-idJ to all nodes in system. » Await replies from all (n- 1) nodes. else for each(ProxyJPi in Cluster.j AND Not success) do Send[Proxy P.request, TimeStamp, Si-idJ to Proxy _ Pi Await Reply. if(Reply = ACK) th en success := TRUE; endif endfor if(Not success) th en goto LI endif

Figure 43: Code to perforin a P operation.

Performing a V Operation

When a node performs a V operation, depending on whether it is a proxy or

cluster request, it either sends a proxy V.request message to the proxy node which

performed the P operation on its behalf, or it broadcasts a V.request message to the

other nodes in its cluster. In either case, it also sends ACK messages to all the nodes

whose requests it queued in its waitq while it was executing its own P operation phase.

The actions performed are shown in Figure 45. On receiving a V.request message,

a proxy node would send V.request messages to the other nodes in its cluster and dequeue any pending requests on its waitq and ACK them. 107

On receiving a P.request message from Pi for Si if(Performing P Operation OR (P.request is a proxy request AND Si < = 0)) th en Send NO ACK reply message to Pi; endif < < Si •= Si - 1; if(performing a P on S{ AND Si < 0) th en if (my P.request « Pi P.request) th en Enqueue Pi P.request on TimeStamp prioritized waitq. else Send ACK reply message to Pi; endif else Send A C K reply message to Pi; »

Figure 44: Executed by a node on receiving a P operation request.

5.6 Performance Study

We are interested in how our algorithm performs in terms of the number of messages required to perform P and V operations. We are also interested in determining the delay involved in these operations. There are two ways of measuring these - through analysis and simulation. Since analytical modeling of these algorithms is extremely difficult for the general case, we study the performance of the algorithm using simulation.

Since a P operation involves a broadcast, followed by a reply from each node,

2(n—1) messages are needed. This is the same message complexity as is involved in the

Ricart-Agarawala algorithm in getting mutually exclusive access. The V operation 108

« if(Proxy P.request) th en Send V.request to proxy Pi. else S{ := S{ -h lj Broadcast V.request message to the other nodes, foreach (P.request in waitq) Dequeue P.request and send an ACK message to that node. en dif »

Figure 45: Executed by a node to perform a V Operation.

needs only n — 1 messages since there are no acknowledgments involved. By clustering

the nodes as described in the previous section, we can reduce the number of messages

involved as follows.

In considering the average case complexity of performing P and V operations,

there are two cases to be considered: intracluster and intercluster. If a node is able

to complete a P operation through consensus within its own cluster, it would require

2 * ( n/y/sl — 1) messages, and n /y/sl — 1 messages will be needed to perform a V

operation. If it performs an intercluster P operation, it would require atmost an

additional ,/s i — 1 messages to poll a proxy node in one of the other clusters to perform the P operation on its behalf.

5.6.1 Simulation Results

The consensus-based approach was simulated using the CSIM package and compared to the centralized approach. The experiments conducted to observe the performance 109 of this approach are similar to those conducted for the Semaphore-page approach and are discussed in Section 3.4 of Chapter III.

Response Time

In Figures 46, 47, 48,49, 50, and 51 we plot the response time of the Consensus- based approach and the centralized approach as functions of the XCST(low ZC«STindicates high semaphore access rate and vice versa). Figures 46 and 47 are the plots for a system where a single binary semaphore is accessed by all the user processes in the system. Figures 48 and 49 are the plots for systems where five binary semaphores are accessed by user processes with uniform probability. Figures 50 and 51 are for systems with five resource counting semaphores that are accessed by the processes with uniform probability. The consensus-based approach performs better than the centralized approach when the access rate to the semaphore is high, and when the semaphore access rate is low, the centralized performs marginally better.

The consensus-based approach performs better than the centralized scheme for multiple semaphores as well. In the case of multiple binary semaphores, this is more pronounced (Figures 48 and 49) than for multiple resource counting semaphores (Fig­ ures 50 and 51).

Speedup

Figures 52 and 53 plot the relative speedup of the consensus-based approach over the centralized approach for binary and resource counting semaphores. 110

200 * consensus-based' — ISO ‘centralized* -----

160

140

120

100 80

60

40

20 0 0 200 400 600 800 1000 ICST (msec)

Figure 46: Response Time Versus 1CST. Consensus-based and Centralized Approaches, One Binary Semaphore on 10 Processors ( CST= 4 msec, TZA= 10 msec)

5.7 Summary

In this chapter, a consensus-based algorithm to implement semaphores in a distributed system was presented. In this approach, the semaphores are replicated on all the nodes of the system. Each node performs P and V semaphore operations on its local copy, however, for the operation to be complete, the node performing the operation must acquire a consensus from the other nodes in the system. In performing a P operation a node first informs all the other nodes of its intention to perform the P. When it receives an acknowledgment from the nodes, its P is complete. Similarly, a V is also broadcast to all the nodes. The node initiating the V however, need not wait for a response from the others. All requests are timestamped in order to impose a precedence ordering of concurrent requests and prevent deadlocks. The algorithm was Ill

800 ‘consensus-based 'centralized 700

600

500

400

300

200

100 0 0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 47: Response Time Versus ICST. Consensus-based and Centralized Approaches, One Binary Semaphore on 35 Processors (CST= 4 msec, TZA= 10 msec) shown to be deadlock and livelock free. A method wherein the nodes are clustered to reduce the number of messages in the system was also presented.

Performance studies using simulation indicate that this approach performs well for both binary and resource counting semaphores. While the approach is clearly superior to the centralized approach when access rates to semaphores are high, it remains competitive at lower access rates as well. 112

140 'co n sen su s Centralized 120

8 100 8 H (U 01 a 5 6

0 200 400 600 800 1000 ICST (msec)

Figure 48: Response Time Versus ICST. Consensus-based and Centralized Approaches, Multiple Binary Semaphores on 10 Processors ( CST= 4 msec, TZA= 10 msec)

600 'consensus* 'centralized* 500

400 qj t* 300 ino oC a 200 a4)

100

0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 49: Response Time Versus ICST. Consensus-based and Centralized Approaches, Multiple Binary Semaphores on 35 Processors (CST = 4 msec, TZA= 10 msec) 113

120 'consensus' • 110 'centralized'

100

tj 90 i Q> •9 70

m c 50 &M 01 06 40

30

20

10 0 200 400 600 600 1000 ICST (msec)

Figure 50: Response Time Versus ICST . Consensus-based and Centralized Approaches, Multiple Resource Counting Semaphores on 10 Processors (CST= 4 msec, 7ZA* 10 msec)

400

350

300 u

250

at* 200 V 150 &w 0) 06 100

50

0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 51: Response Time Versus XCST. Consensus-based and Centralized Approaches, Multiple Resource Counting Semaphores on 35 Processors ( CST= 4 msec, TZA= 10 msec) 114

80 *10 Nodes' ----- '35 N odes ...... 60

40

20

0

-2 0

-40

-60 0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 52: Speedup of the Consensus-based Approach over the Centralized Approach. Multiple Binary Semaphores on 10 and 35 Processors ( CST = 4 msec, TZA— 10 msec)

*10 Nodes' '35 Nodes'

A3

-20

-40

-60 0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 53: Speedup of the Consensus-based Approach over the Centralized Approach. Multiple Resource Counting Semaphores on 10 and 35 Processors ( CST = 4 msec, 7ZA= 10 msec) CHAPTER VI

Comparative Performance Evaluation

6.1 Introduction

In the previous chapters, performance studies of the three approaches were presented.

It was observed that under high rates of access to the semaphore, the decentralized approaches performed better than the centralized approach. As the access rate de­ creased, the centralized approach improved and eventually performed better than the decentralized approach.

This chapter compares the performance of the three approaches presented in this dissertation with each other. We would like to investigate the effects of various parameters on the performance of the respective approaches. As was done in the performance studies in the previous chapters, the number of processors and the access rates are varied for each experiment. The effects of the time spent in the critical section, and the access pattern of the user processes on the response times of the approaches are investigated. The critical section time {CST) can be either short or long and is relative to the remote access time {71 A). The remote access time is the time taken to send a message between a pair of nodes in the system. For a TZA of 10 msec, a short CST is 4 msec and a long CST is 20 msec.

115 116

Simulations of the three approaches for both a single semaphore and for multiple semaphores were conducted. The single semaphore case is used to maximize the ac­ cess rate generated by user processes to a semaphore. The multiple semaphores case is representative of typical parallel applications. For both the single and the multi­ ple semaphore cases, experiments were conducted for binary and resource counting semaphores.

The rest of this chapter is organized as follows. In Section 6.2, simulation results of the three approaches for a single binary semaphore is presented. In Section 6.3, the performance of the approaches for multiple binary semaphores is discussed. Section

6.4, presents the results for multiple resource counting semaphores. Finally, Section

6.5 summarizes the the chapter.

6.2 System With One Binary Semaphore

The first set of experiments performed on the three approaches assumes that all the user processes in the system access a single binary semaphore and perform P and V operations on that semaphore. The number of nodes in the system was varied and the results presented are for 10 and 35 node systems. Simulations for both short CST and long CST were conducted. Figures 54 and 55 plot the response times of the three approaches for 10 and 35 node systems for a CST of 4 msec. In Figures 56 and 57 the results of the same experiments for a CST of 20 msec are plotted.

One can observe that both the semaphore-page and the clustered approaches per­ form almost identically to each other. This is quite understandable and expected 117 for the following reason. When there is only one semaphore being used in the sys­ tem, the semaphore-page approaches updating scheme is ineffective. Since the single semaphore is a binary semaphore, the clustered approach doesn’t cluster the system at all. Thus both the approaches behave similar to the simple token-based approach described in Chapter IV. The consensus-based approach performs better than the other two approaches for low access rates since it requires fewer messages in perform­ ing the P operation. Furthermore, this approach is less sensitive to the length of the critical section than are the other two approaches.

160 'Semaphore-page' ----- 'Clustered' ----- 140 'consensus' ...

0)u 120 3

01 100

V (ft 80 c an a> 60

40

20 0 200 400 600 600 1000 ic st (msec)

Figure 54: Response Time vs. XCST. One Binary Semaphore on 10 Processors ( CST= 4 msec, TZA= 10 msec) 118

600 'semaphore-page* --- 'clustered' ... 'consensus' ... 500

u 0) g 400

a 300 ai m c a w 200

100

0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 55: Response Time vs. ICST. One Binary Semaphore on 35 Processors ( CST= 4 msec, TZA— 10 msec)

6.3 Multiple Binary Semaphores

Another set of experiments performed on the three approaches was to test their behavior on multiple binary semaphores. The user processes in the system access any of five binary semaphores with uniform probability. Figures 58, 59, 58, and 61 plot the response times of the approaches for short and long CST fpr 10 and 35 processors respectively.

In these experiments we find that the semaphore-page approach performs con­ sistently better than the clustered approach for both short and long CST values as well as number of processors. This is because the updating of the semaphore-pages is effecting its response times, whereas the clustered approach still treats the entire system as one cluster since all the semaphores are binary. This is apparent when we 119

300 ’semaphore-page ' c lu s te r e d 'con sen su s 250

200

150

100

50

0 0 200 400 600 800 1000 ICST (msec)

Figure 56: Response Time vs. XCST . One Binary Semaphore on 10 Processors (CST= 20 msec, 7ZA= 10 msec)

plot the average number of messages required to acquire the semaphore in Figures 62

and 63. It should be noted that maximum number of messages required can vary by

as much as ten hops in a 35 node system.

Unlike the previous experiment with one binary semaphore, we observe here that the semaphore-page and clustered approaches perform significantly better than the consensus-based approach when the semaphore access rates are high. This is because the multiple semaphores have had a less pronounced effect on the latter approach since all the nodes must participate in the consensus irrespective of the particular semaphore involved. 120

1200 'samaphore-page' ----- 'clustered* ----- 'consensus . 1000

800

600

400

200

0 0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 57: Response Time vs. ICST. One Binary Semaphore on 35 Processors ( CST= 20 msec, %A= 10 msec)

6.4 Multiple Resource Counting Semaphores

These experiments assumed that there are multiple resource counting semaphores in the system. These semaphores are accessed with uniform probability. The results of the experiments for short CST are displayed in Figures 64 and 65 for 10 and 35 processor systems and for long CST in Figures 66 and 67.

In these experiments, its is apparent that both the semaphore-page and clustered approaches have much better response times for high access rates than the consensus- based approach. At low access rates, the consensus-based approach continues to perform better, just as it does in the other experiments.

It is also apparent that the clustered approach is benefiting from the clustering that is possible with resource counting semaphores. The performance of this approach 121

120 ’semaphore-page' ----- 110 'c lu s t e r e d ...... 'consensus' 100

90

80

70

60

50

40

30

20 0 200 400 600 800 1000 ICST (msec)

Figure 58: Response Time vs. ICST. Multiple Binary Semaphores on 10 Processors (CST= 4 msec, TZA= 10 msec) is better than that of the semaphore-page approach for high access rates for both system sizes. This is due to the fact that nodes in different clusters are able to access the same semaphore concurrently. When the access rates are low, the two approaches perform virtually identically. Figures 68 and 69 plot the average number of messages required in accessing the semaphores.

6.5 Summary

In this chapter, performance studies of the three approaches were presented. The experiments were divided into short and long critical sections. For each critical sec­ tion length, experimental results for one semaphore and multiple semaphores are presented. The semaphores were both binary and resource counting. For each exper­ iment, results were generated for 10 and 35 node systems. 122

350 'semaphore-page* ----- 'clustered' ----- 300 'consensus' ......

8 250 1 I 200 150 a 01 a) « 100

50

0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 59: Response Time vs. TCST. Multiple Binary Semaphores on 35 Processors {CST = 4 msec, HA= 10 msec)

For the semaphore-page and clustered approaches, the experiments indicate that while the length of the critical section is a factor in the response times when the semaphore access rates are high, it is insignificant when the access rates are low. The consensus-based approach is not as affected by the length of the critical section.

The experiments also indicate that when multiple semaphores are accessed with uniform distribution, the response times are significantly lower than if only one semaphore is being accessed. This pattern is more pronounced when the access rate is high, but holds for all access rates.

The clustered approach performs the best among the three approaches in the ex­ periments when dealing with resource counting semaphores. This can be explained by the fact that the clustered approach replicates the semaphore and enables concurrent access, thereby reducing the contention for the semaphore. 123

110 sexnaphore-page 100 'c lu s te r e d 'con sen su s 90

80

70 €0

50

40

30

20 0 200 400 600 800 1000 ICST (msec)

Figure 60: Response Time vs. ICST. Multiple Binary Semaphores on 10 Processors {CST = 20 msec, HA= 10 msec)

The semaphore-page approach performs well for multiple semaphores since it is able to utilize the page updating strategy to its advantage. For a system consisting of multiple binary semaphores, this approach is superior to the clustered approach.

The consensus-based approach consistently performs the best when the access rates for semaphores are low. This is due to the fact that unlike the other two approaches, the current location of the semaphore need not be determined using multiple message hops. 124

350

300

250

200

150

100

50

0 0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 61: Response Time vs. ICST. Multiple Binary Semaphores on 35 Processors {CST = 20 msec, 7ZA= 10 msec)

2.8 'clustered'

2.6

2 .4

2.2

2

1.8 semaphore-page'

1.6 0 200 400 600 800 1000 ICST (msec)

Figure 62: Average Number of Messages Required to Acquire the Semaphore. Multiple Binary Semaphores on 10 Processors ( CST= 4 msec, 7ZA= 10 msec) 125

7

6 .5

6

5 .5

5

4 .5

4 semaphore-page

3 .5 0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 63: Average Number of Messages Required to Acquire the Semaphore. Multiple Binary Semaphores on 35 Processors ( CST= 4 msec, 7ZA= 10 msec)

120 'semaphore-page* ----- 110 'clustered' ----- 'c o n sen su s...... 100

o ati 70

20 0 200 400 600 800 1000 ICST (msec)

Figure 64: Response Time vs. XCST. Multiple Resource Counting Semaphores on 10 Processors (CST— 4 msec, HA— 10 msec) 126

400 'semaphore-page* --- 'clustered* — * 350 'consensus....

300

250

200

150

100

50 0 0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 65: Response Time vs. ICST. Multiple Resource Counting Semaphores on 35 Processors (CST= 4 msec, 1ZA— 10 msec)

100 'semaphore-page' 'clustered* 'consensus*

o 4) 70

40

30

20 0 200 400 600 800 1000 ICST (msec)

Figure 66: Response Time vs. ICST . Multiple Resource Counting Semaphores on 10 Processors {CST— 20 msec, TZA= 10 msec) 127

400 'semaphore-page' 'clustered' 'consensus'

o ai B 250

200 cw a 150 100

50

0 1000 1S00 2000 2500 3000500 ICST (msec)

Figure 67: Response Time vs. ICST. Multiple Resource Counting Semaphores on 35 Processors (CST= 20 msec, TZA— 10 msec)

2.3 'clustered' 2.2 'semaphore-page'

2.1

2

1.9

1.8

1.7

1.6

1 .5

1 .4 0 200 400 600 800 1000 ICST (msec)

Figure 68: Average Number of Messages Required to Acquire the Semaphore. Multiple Resource Counting Semaphores on 10 Processors ( CST = 4 msec, 7ZA= 10 msec) 128

'clustered' 'semaphore-page' 6

5

4

3 .5 0 500 1000 1500 2000 2500 3000 ICST (msec)

Figure 69: Average Number of Messages Required to Acquire the Semaphore. Multiple Resource Counting Semaphores on 35 Processors ( CST= 4 msec, KA= 10 msec) CHAPTER VII

Summary and Directions

This dissertation investigated an important problem in distributed systems, the need for general purpose synchronization mechanisms. While this problem has been stud­ ied extensively in the context of uniprocessor and shared memory multiprocessor systems, it has been largely uninvestigated in the context of processor networks. The main reason for this is the lack of shared memory in such systems. Synchronization mechanisms such as distributed mutual exclusion algorithms, which rely on message passing, have been proposed for distributed systems. Mutual exclusion provides a restricted form of the binary semaphore namely, a lock. While locks are necessary for mutually exclusive access to a shared resource, parallel and distributed applications require other forms of synchronization as well. For instance, process co-ordination, handshaking, and K-ary exclusion cannot be implemented elegantly using locks. Im­ plementing general purpose synchronization mechanisms such as semaphores in a centralized manner has the obvious drawbacks of the central point being a bottle­ neck, congestion of the network near the central point and the single source of failure.

Thus it is necessary to provide such mechanisms in a decentralized approach.

129 130

Another motivation for the need to support general purpose synchronization mech­ anisms in distributed systems is the gaining popularity and prevelance of Distributed

Shared Memory systems (DSM). A DSM system provides a shared memory program­ ming abstraction on top of distributed system. This enables us to write parallel applications using the shared memory programming paradigm, but execute the pro­ gram on a distributed system. This eases the tasks of programming and portability, while maintaining scalability and efficiency. In the DSM programming model, coop­ erating processes access shared variables, and coordinating this access is important for the correct execution of a parallel program. Synchronization mechanisms provide the support for this coordination.

This dissertation presented three algorithms that implement semaphores in a dis­ tributed environment. Proof of correctness and performance studies based on simu­ lations were presented for each algorithm. A comparative performance study of the three algorithms was also presented. Next, a brief summary of each of the contribu­ tions of the research is presented below.

7.1 Summary of Results

7.1.1 Taxonomy of Synchronization Mechanisms in DSM Sys­ tems

In Chapter II, we examined the synchronization mechanisms in a variety of DSM systems. We classified the mechanisms according to whether the mechanism is im­ plemented in hardware or software, whether the mechanism uses a centralized or distributed scheme, and lastly whether the mechanism is integrated into the DSM 131

system or not. We summarized the classification of the synchronization mechanisms

in various existing DSM systems in Table 1.

In software-based systems, the synchronization mechanism is implemented either by the runtime system or the kernel. It is implemented either in a centralized, stati­ cally distributed, or dynamically distributed manner. If it is dynamically distributed, only one copy of the synchronization variable is maintained and this copy is migrated

(like a token) among the processors in the system.

In hardware-based DSM systems, the mechanism is typically implemented with hardware support; a notable exception is the FLASH system in which it is imple­ mented by the operating system kernel. The distributed schemes in such systems take advantage of the hardware support to cache the synchronization variable at the processors and rely on cache-coherence protocols to maintain the consistency of the replicated copies. Some hardware-based systems implement the mechanism in a static distributed manner to eliminate the need for maintaining multiple cache-coherence protocols (one for synchronization variables and another for other shared data). Such systems use fetch-and-Op type of operations to reduce the number of messages gen­ erated by processors accessing the synchronization variable.

7.1.2 The Semaphore-page Approach

In Chapter III, we presented the Semaphore-page approach to providing semaphores on distributed shared memory systems. This approach differs from previous work in that it decentralizes the management of the semaphores. This decentralization 132 eliminates the inherent problems in centralized schemes: the single point of failure and the bottleneck problems.

By grouping semaphores into semaphore-pages, the approach enables processes to exploit the locality of reference which exists in accessing semaphore variables by caching them on their processors. By updating information about the locations of all the semaphores in a semaphore-page, the semaphore location algorithm becomes simple yet efficient. Extensive simulation results indicate that under heavy load conditions (i.e., frequent accesses to the semaphore), this scheme has a better response time than the centralized scheme. This improvement in performance becomes more pronounced as the number of processors increases. This is a highly desirable feature for real-life systems because they are expected to handle heavy load and must be scalable.

7.1.3 The Clustered Approach

In Chapter IV, the Clustered approach was presented. In this approach, the proces­ sors in the system are grouped into a two-level hierarchy of clusters. The sizes of the clusters depends on the semaphore’s value. A copy of the semaphore is maintained by each cluster. The semaphore within a cluster, migrates from site to site, some­ what similar to a token in a mutual exclusion algorithm. Nodes acquire and update their cluster’s copy of the semaphore, thus enabling concurrent access the semaphore across clusters. Furthermore, this reduces the number of messages required to access the semaphore. Aside from intra-cluster accesses, it is necessary to perform inter­ cluster accesses when the cluster’s copy of the semaphore is zero. This movement of 133 semaphore values across clusters produces a load-balancing effect in the system with a greater concentration of semaphore values in those clusters where the demand is high.

For the sake of reducing the amount of information that needs to be maintained for each semaphore, the system can define a few cluster sizes (instead of one for each semaphore) and associate the most appropriate one for each semaphore. As semaphore access exhibits temporal locality, the concept of dynamically migrating semaphore values from one cluster to another to achieve access-load sharing is ap­ pealing.

Simulations of the scheme indicate that the clustered approach performs well when the access rate to the semaphore is high and thus is more appealing than the cen­ tralized approach for typical distributed applications. As the access rate reduces, the performance advantage reduces and eventually the centralized approach performs better as expected. While there is no distinction between the clustered and the un­ clustered approaches for binary semaphores, the clustered approach is clearly superior to the unclustered approach for general (resource counting) semaphores.

7.1.4 The Consensus-based Approach

In the Consensus-based approach presented in Chapter V, the semaphore is replicated on all the nodes of the system and each node manipulates its own copy. In this approach, a node in the system uses timestamped P and V request messages to inform the other nodes of its intention to update its copy of the semaphore. The other sites use this message to update their respective semaphore copies. Concurrent requests 134 from different sites are serialized by the competing sites. By grouping the nodes in the system in clusters, an intra-cluster consensus-based approach is developed. This approach has the advantage of reduced number of messages in achieving consensus, since only the nodes within the cluster participate in the formation of a consensus, not the entire system.

Proof of correctness of the approach was presented. Performance studies of both the clustered and unclustered approaches an comparisons with the central­ ized approaches indicate that the consensus-based approach performs well under high semaphore access rates.

7.1.5 Comparative Performance Study

In Chapter VI, performance studies of the three algorithms are presented and com­ pared with each other. We investigated how the three algorithms perform for varying numbers of processors, rates of access, semaphore access patterns, and critical section durations.

Simulations were performed for one semaphore that is accessed by all the user processes as well as for multiple semaphores that are accessed with even probability.

The semaphores were initialized as both binary and resource counting.

The critical section time is the amount of time spent by a process after successfully performing a P operation and before performing the V operation. Experiments were performed for three different critical section time values: small (4 msec), medium (10 msec), and large (20 msec). We define small, medium, and large with respect to the cost of a remote access, which was fixed at 10 msec. 135

For each type of semaphore described above, simulations were performed for the different critical section times for 10 processor systems and 35 processor systems. We observe that the clustered approach performs better than the others since replication of the semaphore reduces the contention.

7.2 Future Directions

The problems involved in implementing synchronization mechanisms in distributed systems have been explored in this dissertation. Several algorithms that implement semaphores in such systems were presented. Simulations of these algorithms indi­ cate that they perform better than simple centralized implementations during heavy system load. Thus these algorithms have a niche in “real-world” practical systems, and are not merely purely academic pursuits. These algorithms, however, are not fault-tolorent. In the event of a node or link failure, it is assumed that the underlying system either buffers messages and re-transmits them, or that the whole system is re-started after the failure has been corrected. While this assumption is acceptable in typical non-critical systems, it makes these algorithms unsuitable for systems where the penalty of faults can be catastrophic. Distributed systems which are fault-tolerant have received much research attention. Other research initiatives have focused on making Distributed Shared Memory systems tolerant of faults [51, 30, 25]. It is very desirable to develop similar techniques for implementing semaphores.

Another area for further investigation is in developing general purpose synchro­ nization mechanisms that can be employed in time-critical systems (i.e., real-time systems.) While locks (mutual exclusion), barriers and now semaphores have been studied in the context of distributed systems, there exist several other synchroniza­ tion mechanisms that still require centralized implementations. This is another area of further study. Lastly, an increasing number of commercial, hardware-based DSM systems are being developed. These systems, focus on data coherence issues and on developing efficient hardware-based protocols for their implementations. These pro­ tocols are nevertheless not ideal candidates for semaphore or other synchronization variables. Efficient protocols that support synchronization variables in these DSM systems is highly desirable. B ibliography

[1] G. S. Almasi and A. Gottlieb. Highly Parallel Computing. Benjamin-Cummings, 1988.

[2] J. Bennett, J. Carter, and W. Zwaenepoel. “Munin: Distributed Shared Mem­ ory Based on Type-Specific Memory Coherence”. In Symp. on Principles and Practice of Parallel Programming , pages 168-175, March 1990.

[3] J. Bennett, S. Dwarkadas, J. Greenwood, and E. Speight. “Willow: A Scalable Shared Memory Multiprocessor ”. In Proceedings of Supercomputing ’92, pages 336-345, Nov 1992.

[4] P. Bernstein and N. Goodman. “Concurrency Control in Distributed Database Systems”. ACM Computing Surveys, 13(2), June 1981.

[5] P. Bernstein and D. W. Shipman. “The Correctness of Concurrency Control Mechanisms in a System for Distributed (SDD- 1)” . ACM Transactions on Database Systems, 5(1), March 1980.

[6] R. Bisiani and M. Ravishankar. “Plus: A Distributed Shared-Memory System”. In International Symp. on Computer Architecture, pages 115-124, May 1990.

[7] J. Carter, J. Bennett, and W. Zwaenepoel. “Implementation and Performance of Munin”. In ACM Symp. on Operating System Principles, pages 152-164, October 1991.

[8] John Carter. “Efficient Distributed Shared Memory Based On Multi-Protocol Release Consistency”. PhD thesis, Rice University, Sep 1993.

[9] David R. Cheriton and Willy Zwaenepoel. Distributed Process Groups in the V Kernel. ACM Transactions on Computer Systems, 3(2):77-107, May 1985.

[10] Convex Pres, Richardson, TX. “ Convex Exemplar Architecture”, Nov 1993.

137 138

11] A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, and W. Zwaenepoel. “Software Versus Hardware Shared-memory Implementation: A Case Study”. In 21st Annual Int. Symp. on Computer Architecture, pages 106-117, April 1994.

12] P. Dasgupta, R. LeBlanc, and W. Appelbe. “The Clouds Operating System: Functional Description, Implementation details and Related Work”. In Interna­ tional Conference on Distributed Computing Systems, pages 2-9, June 88.

13] G. Delp, D. Farber, R. Minnich, J. Smith, and I. Tam. “Memory as a Network Abstraction”. In T. Cassavant and M. Singhal, editors, Readings in Distributed Computing Systems, pages 409-423. IEEE, 1994.

14] E. W. Dijkstra. “Solution of a Problem in Concurrent Programming Control”. Communications of the ACM, page 569, Sep 1965.

15] E. W. Dijkstra. “Cooperating Sequential Processes”. In F. Genuys, editor, Programming Languages, pages 43-112. Academic Press, London, 1968.

16] A. Dinning. “A Survey of Synchronization Methods for Parallel Computers”. IEEE Computer, 22(7):66-77, July 1989.

17] K. P. Eswaran, J. N. Gray, and et. al. “The Notions of Consistency and Predicate Locks in a Database System” . Communications of the ACM, 19(ll):624-633, Nov 1976.

18] D. Lenoski et. al. “The Stanford Dash Multiprocessor”. IEEE Computer, 25(3):63—79, March 1992.

19] J. Kuskin et. al. “The Stanford FLASH Multiprocessor”. In International Symp. on Computer Architecture, pages 302-313, April 1994.

20] B. Fleisch. “Distributed System V IPC in Locus: A design and Implementation Retrospective”. In Sigcomm ’86 Symposium, pages 386-396, 1986.

21] B. Fleisch, R. Hyde, and N. Juul. “Mirage-f-: A Kernel Implementation of Distributed Shared Memory on a Network of Personal Computers”. Software - Practice and Experience, 1994. To appear.

22] B. Fleisch and G. Popek. “Mirage: A Coherent Distributed Shared Memory Design”. In Proceedings of the eleventh ACM Symp. on Operating System Prin­ ciples, pages 211-223, December 1989. 139

[23] Robert Fowler. “Decentralized Object Finding Using Forwarding Addresses”. PhD thesis, University of Washington, 1985.

[24] G. Hermannsson and L. Wittie. “Optimistic Synchronization in Distributed Shared Memory”. In International Conference on Distributed Computing Sys­ tems, 1994.

[25] C. E. Hewitt and R. R. Atkinson. “Specification and Proof Techniques for Seri­ alizers”. IEEE Transaction on Software Engineering, pages 10-23, Jan 1979.

[26] C. A. R. Hoare. “Monitors: An Operating System Structuring Concept”. Com­ munications of the ACM, 17(10):549—557, October 1974.

[27] Y. Huang and C. Kintala. “Software Implemented Fault Tolerance Technologies and Experience”. In Intl. Symposium on Fault-Tolerant Computing, pages 2-9, Toulouse, France, June 1993.

[28] Anita K. Jones and Peter Schwartz. “Experience Using Multiprocessor Systems - A Status Report”. ACM - Computing Surveys, 12(2), 1980.

[29] David J. Kuck, Edward S. Davidson, Duncan H. Lawrie, and Ahmed H. Sameh. “Parallel Supercomputing Today and the Cedar Approach”. SCIENCE, pages 967-974, February 1986.

[30] L. Lamport. “Time, Clocks and Ordering of Events in Distributed Systems”. Communications of the ACM, July 1978.

[31] K. Li and P. Hudak. “Memory Coherence in Shared Virtual Memory Systems”. ACM Transactions on Computer Systems, 7(4):321—359, November 1989.

[32] K. Li, J. Naughton, and J. Plank. “Checkpointing Multicomputer Applications”. In Symposium on Reliable Distributed Systems, pages 2-11, Pisa, Italy, oct 1991.

[33] K. Li and R. Schaefer. “A Hypercube Shared Virtual Memory System”. In International Conference on Parallel Processing, pages 1125—1132, 1989.

[34] Kai Li. “Shared Virtual Memory on Loosely Coupled Multiprocessors”. PhD thesis, Yale University, September 1986.

[35] V. Lo. “Operating Systems Enhancements for Distributed Shared Memory”. In M. Yovits, editor, Advances in Computers, pages 191-237. Academic Press, 1994. 140

[36] M. Maekawa. “A y/N Algorithm for Mutual Exclusion in Decentralized Sys­ tems”. ACM Transactions on Computer Systems, May 1985.

[37] J. Mellor-Crummey and M. Scott. “Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors”. ACM Trans, on Computer Systems, pages 21-65, Feb 1991.

[38] R. Minnich and D. Farber. “The Mether System: Distributed Shared Memory for SunOS 4.0 In Proceedings of the Summer 1989 Usenix Conference, pages 51-60, Summer 1989.

[39] M. Naimi and M. lYehel. “An Improvement of the log(n) Distributed Mutual Ex­ clusion Algorithm”. In Proc. of the 7th International Conference on Distributed Computing Systems, 1987.

[40] B. Nitzberg and V. Lo. “Distributed Shared Memory: A Survey of Issues and Algorithms”. IEEE Computer, pages 52-60, August 1991.

[41] M. Ramachandran and M. Singhal. “Distributed Semaphores”. Technical Report OSU-CISRC-6/94-TR34, The Ohio State University Computer and Information Science Research Center, June 1994.

[42] M. Ramachandran and M. Singhal. “Decentralized Semaphore Support in a Virtual Shared Memory System”. Journal of Supercomputing, 1995. To appear.

[43] U. Ramachandran, M. Ahamad, and Y. Khalidi. “Coherence of Distributed Shared Memory Unifying Synchronization and Data Transfer”. In International Conference on Parallel Processing, pages 160-169, aug 1989.

[44] U. Ramachandran and Y. Khalidi. “An Implementation of Distributed Shared Memory”. Software - Practice and Experience, 21(5):443-464, May 1991.

[45] D. Reed and R. Kanodia. “Synchronization with Eventcounts and Sequencers”. Communications of the ACM, Feb 1979.

[46] G. Ricart and A. K. Agrawala. “An Optimal Algorithm for Mutual Exclusion in Computer Networks”. Communications of the ACM, Jan 1981.

[47] Herb Schwetman. “Users’ Guide (for use with CSIM Rev. 16)”. Microelectron­ ics and Computer Technology Corporation, 3500 West Balcones Center Drive, Austin, TX 78759. 141

[48] M. Singhal. “A Heuristically-Aided Algorithm for Mutual Exclusion in Dis­ tributed Systems”. IEEE Transactions on Computers, 38(5), May 1989.

[49] M. Singhal. “A Taxonomy of Distributed Mutual Exclusion”. Journal of Parallel and Distributed Systems, May 1993.

[50] M. Singhal and N. Shivratri. Advanced Operating Systems. McGraw-Hill, 1994.

[51] I. Suzuki and T. Kasami. “A Distributed Mutual Exclusion Algorithm”. ACM Transactions on Computer Systems, Nov 1985.

[52] Ming-Chit Tam, Jonathan M. Smith, and David J. Farber. “A Taxonomy-Based Comparison of Several Distributed Shared Memory Systems”. Operating Systems Review, 24(3) :40—67, July 1990.

[53] V. Tam and M. Hsu. “Fast Recovery in Distributed Shared Virtual Memory Sys­ tems’. In Proceedings of the International Conference on Distributed Computing Systems, pages 38-45, 1990.

[54] P. Tang and P.-C. Yew. “Processor Self-scheduling for Multiple-nested Parallel Loops”. In International Conference on Parallel Processing, pages 528-535,1986.