00 Multi-Threaded Servers: Design with reusable components, Performance Measurements and Analysis

Gunidas Somadder B.E.

A thesis submitted to the Faculty of Graduate Studies and Research in partial fMïlhent of the reqairements for the degree of Master of Engineering

Ottawa-Carleton Institute for Electrical Engineering Faculty of Engineering Department of Systems and Cornputer Engineering Carleton University, Ottawa, Ontario, Canada

February 12,1997

O 1997, Gurudas Somadder National Library Bibliothèque nationale of Canada du Canada

Acquisitions and Acquisitions et Bbliographic Services services bibliographiques 395 WefIington Street 395, rue Wellington Ottawa ON KIA ON4 Ottawa ON KIA ON4 Canada Canada Ywr61B vmréffnnicd

Our Isle Nom nirthence

The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distri'bute or seIl reproduire, prêter, distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la forme de microfiche/nlm, de reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantid extracts fkom it Ni la thèse ni des extraits substantiels may be p~tedor othemise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation.

canada Abstract

The first object of this research was to develop an efficient approach to build multi- threaded servers by using reusable components provided in the object-oriented fhmework Adaptive Communication Environment (ACE) descrii in [Sch94Aprl]. The thesis presents the pattern-based design, implementation and perfofmance measurements of different dtematives of multi-threaded servers for distributed applications. The multi- threaded approaches are compared based on their performance under different loads and system architectures, their execution overheads and cotz~umptionof shared system resources (such as memory and I/O ports). It is shom that for cenain system architectures, server multi-threading can improve the ovexall performance of the system by allowing for a more efficient use of the hardware resources. The impact of assigning priorities to the service requests (as opposed to the clients) and a rationale for establishg such priorities depending on the workload characteristics are also discussed. Another objective of the research was to fkd a way to exploit the design pattern used m object-oriented systems for pedormance analysis. Design pattems desmi recurriflg solutions to common problems fiom various application domains, and provide a more abstract view of the system behavior as they concentrate on the main interactions between the system components. It is shown that patterns can guide developers in choosing what to measure, in instrumenting the code for measurements and in interpreting the results. It is emphasized that an abstract behavioral model, such as the one provided by , is especially usenil in systems built with reusable components, both for understanding what the reused software does and its contriion to the overall performance. Acknowledgrnents

1 would like to express my deep gratitude to my supervisor, Dr. D. C. Petriu, for her vaiuable guidance, advice and encouragement throughout this research.

1 would aIso like to thank Dr. CM. Woodside and Dr. J. Rolia of the Real-Time and

Distnbuted Systems (.R4DS) Group of the Department of Systems and Compter

Engineering for their support and advice diiring the research.

Many thanks to my colleagues, Istabrak, Fahim, Hesham, Marc, Alex, Cheryl and Leslie for their support, advice, help and encouragement during the course of this thesis.

The f?nancial assistance of the Telecommunications Research Institute of Ontario (TRIO), through its opemting gant program, is gratefully acknowledged.

Finally 1 would like to thank my parents, sisters and Iast but not the least, my brother-in- laws, for their invaluable support, patience, understanding and encouragement which made this thesis possible. Contents

1 Introduction

2 . Threads and Threading Models

Introduction ...... 10

SoIaris 2.5 Multi-threading Architecture ...... 12

2.2.1 Processes and Threads ...... 13

2.2.2 Solaris Multi-threaded Architecture ...... O...... 14

Threading Models ...... ,,...... œ...... œœœœ...œœ...... o.œœœ...... 1 7

Thread Priorities and Tbread-based Measmements ...... 18

2.4.1 Prionties ...... 18

2.4.2 ThTead-based Measurements ...... 19

Benefits/Drawbacks of Thread-based Concurrent Rogramming ...... 20

2.5.1 Benefits of using Thread-based Concment Programming ...... 20

2.5-2 Drawbacks of using Thread-based Concurrent Programming ... 2 1 3 . The Adaptive Communication Environment 24 Introduction ......

The ACE Tookit ......

3.2.1 C* Wrappers ......

3 .2.2 Class Categories and Frameworks ...... O...... Design Patterns: Concept and Use .., ......

3.3. 1 The Reactor Pattern ......

3.3.2 The Acceptor Pattern ......

3.3.3 The Connector Pattern ...... 3.3.4 The Pattern ...... 4. Client/Server Design and Implementation

4.1 Introduction ......

4.2 Client,Server Architecture fiom an ACE Viewpoint ., ......

4.2.1 The Client ......

4.2.2 The Server e......

4.2.2.1 Single Threaded Server ...... W...... e.

4.2.2.2 The Thread-per-Reguest Server ...... o...... eO......

4.2.2.3 The Thread-per-Client Server ......

4.2.2.4 Tbe Thread-Pool Server ...... W...... e.

4.3 System Architecture ....O...... ~...mœ~.e.e.~œo~~.~...qo...m.~*..~~~.o.~o....~...~.....

4.3.1 Single-Threaded C/SArchitecture ...... O......

4.3.2 Thread-per-Request C/S Architecture ......

4.3.3 Thread-per-Client CIS Architecture ...... 4.3.4 Thread-Po01 C/S Architecture ...... 64

4.4 A Use Case Map Representation of the System Execution Cycle ...... * 64

4.5 Systems with Layered Servers ...... ,...l...... w...... m.*m*....*..*...... 66

Experimental Setup and Measurements 68

Introduction ...... 68

Experimental Setup .....,w...... ,..oa...... ***.e*.*...... *...*....*...... *...... 69

5.2.1 Experiment ControUer and Data Logger ...... *....*..... 70 5.2.2 DECALS Architecture ...... *...... 71 Measuement Results and Performance Analysis ...,...... 72

Pattern-guided Mea~u~ementsand Analysis ...... w..,...... m.œ..*....*...e**.... . 79

5.4.1 Frequently executed patterns ...... 80

5 -4-1 .1 Reactor Pattern ...... *...... *...... *.*...... 80 5.4.1.2 Active Object Pattern ...... ,...... 5.4.2 Detailed Measurement Results and Performance Anaiysis ......

Cornparison between the Thread-per-Client and the Thread-Pool

Semer Models ......

6.3 Service Rior@ ...... , ...,....*...... *...... 99

6.4 Experimental Setup ,...... ,...,....,*...... ~.*...... 1O0

6.5 Measuement Rdtsand Andysis for Service Priority ...... * 101

vii 6.5.1 Measurement Red&for the Pure Server Arc.tecture ...... 102

6.5.1.1 Effect of Changing Service Time Ratios ...... O.....106

6-5-12 Impact of Changing the Number of Tbreads in

the Thread-Pool ......

6.5.2 Measurement Results for the Layered Server Architecture ...... 109

6.6 Measurement Rdtsand Analysis for Class Priorify ...... 1 12 7. Conchions 115

7.1 Summaxy ...... 115

7.2 FutureWork ...... 117

Appendix A Notations for Use Case Maps and UML 118

k 1 Use Case Map Notations ...... 11 8

A.2 UML Notation ...... 118

Appendix B Batch Script for Conducting Experiments 120

B .1 Ovewiew ...... 120

B.2 Batch Scripts ....,..,...... 120

B.3 Logging Data ...... ,...... ,...... 0...... 0...... 125

B.4 Overview of Data gathering ,., ...... 125

References 126 List of Figures

Fig . 1 Sol& Multi-Threading Architecture ......

Fig . 2 Structure of the Adaptive Co~nxnunicationEnvironment TooUcit .....O......

Fig .3 ACE Class Categories ...... ,...... o...... o...... o.o...... œ....o......

Fig .4 Participants in the Reactor Pattern ...... Fig . 5 Participants in the Acceptor Pattern .....-...... m...... o..m..mm...... Fig .6 Participants in the Co~ectorPattern ...... Fig .7 Participants in the Active Object Pattern ...... Fig . 8 ActiveRassive Comection Roks ......

Fig .9 A High Level Hardware View of the System ......

Fig . 10 Class Cornponents for the Client ...... Fig . 11 Interaction Diagram for the Client ..,...... *...... Fig . 12 Class Components for the Single-Threaded Server ......

Fig . 13 Interaction Diagram for the Single-Threaded Semer ...... Fig . 14 Interaction Diagram for the Thread-per-Request Server ...... Fig . 15 Interaction Diagram for the Thread-per-Client Server ......

Fig . 16 Interaction Diagram for the Thread-Pool Semer ......

Fig . 17 Single-Threaded C/S Architecture ...... m...,...... *...... œ......

Fig -18 Thread-per-Client CIS Architecture ..,.,...... ,.....,......

Fig . 19 Thread-Pool CIS Architecture ...... Fig 20 Use Case Map Representation of the System for a Single-Threaded

Sewer ...... Fig . 21 PurelLayered Semer Architecture ...,...... Fig . 22 Experimental Setup ., ...... I...... ~...o...... o...... Fig . 23 Measmement Red&for the Pure Server Architecture ...... Fig . 24 Sewer Utilkations for the Pure Server Architecture ...... Fig . 25 Measurernent Results for the Layered Server Architecure ...... Fig . 26 Mid-level Server Utilkation for the Single-Threaded Semer ...... Fig . 27 Mid-level Server Utilization for the Tbread-Pool Server ...... Fig . 28 Execution Sequence Chart for the Reactor Pattern ......

Fig. 29 Execution Sequence Chart for the Active Obect Pattern ...... Fig . 30 Measurement Results for the hande-eventso Ioop ......

Fig . 3 1 Measmement Rd& for the registertshmdIe~/remuveOhmdle~

rnethods ...... Fig . 32 Measurement Results for the insertO/removeO methods ...... Fig . 33 Memory Consumption due to Threads ......

Fig . 34 Effect of Different Service Times and Message Sizes ......

Fig . 35 Heap Implementation of the Priority Queue ...... , ......

Fig . 36 Measurement Results for the Pure Server Arch. (Low Pnority .

LOW Service Times) ...... Fig. 37

Fig. 38

Fig. 39 List of Symbols and Terminology

ACE Adaptive Communication Enviromnent

API Application Programmbg Interface

ASX Adaptive Service Executive

COOTS Conference on Object-ûriented Technologies and Systems

C/S Client/Server

DECALS Distnbued Experiment Control and Logging System

EoroPLoP European Conference on Pattern Languages of Programming

EWP Heavyweight Process

ICDP International Conference on Distributed Platforms

ICODP International Conference on Open Distriiuted Processing

IP Internet Protocol

IPC Inter-process Communication

LW Lightweight Process

MT Mu1 ti-Threaded

00 Object-Onented

OS Operathg System

RPC Remote Procedure Cal1

SAP Service Access Point

ST Single-threaded

TPC Thread-per-client TPOOL

TPR Thread-per-request

UML Unified Modehg Language

xiii Chaprerl Introduction

Chapter 1

Introduction

1.1 Research Background

A disnibuted system is a combination of hardware, software, and network components m which the software components execute on two or more computers (also known as nodes) and comrnunicate via a network A distriiuted systern possesses certain unique advantages over computing with multi-user systems (typically maidiames and minicornputers) and single-user systems (typically personal computers and workstations). It provides the advantages of a multi-user system that are lacking in a nngle-user system like resource sharing, widespread access to data, centralized services, multi-user applications etc. It preserves the advantages of single-user systems that are lost in multi-user systemç like pnce to performance ratio, standard software packages, ability to use curent computing platforms etc. It also has some inherent advantages like flexi'bility (in tenns of location of hardware, mobility, reconfigurability etc.), scalability (in terms of additional components that can be added), efficiency (due to the fact that they con& of heterogeneous components which can be so selected that they are optimal for the task at hand) and avadability (as critical components can be duplicated and thus the system can be made immune to single points of failure).

In order to take advantage of , a new paradigm was developed calied the Client/Server (US) paradigm which involved dishibuting applications onto

00 Muhi-nreaded &*ers: Design wt~hteusable componenu. Pedomance Measuremenrs and Ana&sis 1 Chapterl Introduction

different nodes E a network, in which certain applications (the clients) would generate

requests for service nom other applications (the servers) who would service these

requests.

Although experience has shown [Sch95Jan, Sch95Feb, Dilley95I that distniuted

computing, based on a C/S paradigm, can indeed offer these benefits when applied

properly, developing distn'buted applications whose components collaborate efficiently,

reliably, transparently and scalably is a complex task. Much of this complexity arises from

the limitations with conventional tools and techniques used to deveiop distributed

application sohare. For example, most standard network programming mec-

(such as BSD sockets and Wmdows NT named pipes) lack typede, portable, reentrant

and extensible interfaces. A typical example of this is the weakly-typed Y0 handles used by sockets and named pipes to identifL endpoints of commun.ication. These handles

increase the potential for subtle nin-the errors since compilers carmot detect type

mismatches at compile-tirne.

Another factor that makes distnbuted systems more difficult to implement is their involvement of multiple technologies that are the result of the partitioning of cornputer technology into specialized areas. Each area deais with solving the problems of its own domain using concepts and terminology that have been developed to serve its specialized needs. Although this would have been fine in a typical monolithic system that can be built using tools fiom just one of these specialties, distributed systems invariably involve several of them. It is ohdifficult even to understand how a concept nom one field works with a concept from another. It is also difficult to design a system that mandates an interface

00 ,Mdri-ï3readed Seners: Design wrh reurable compnenu. Pefonnance ,Ueasuremenrr and Analvstj 2 Chaprer 1 Introduction between two technologies when you cannot determine if they are even intended to be compatiile. These, and other such problerns ariskg due to distriiuted cornpuan& are most apparent m the area of cornputer communication which involves commuûication between heterogeneous nodes in a network, each with its own individual data formats and communication protocols. In order to overcome these dficulties, miaware standards and products have appeared to facilitate distn'bution like the Disûibuted Communication

Environment @CE), proposed by the Open Software Foudation (OSF) group

[Lockhart94] and Common Object Request Broker Architecture (CORBA), proposed by the Object Management Group (OMG)[Siegel96].

However, although the midware products based on these standards comprise a wide variety of tools and senrices, they are quite ofien heavyweight and slow in nature

[Sch95Jun]. Due to this fact, there was felt a need for lightweight hiesof reusable components which can be used as building blocks for distnbuted applications. An example of such a product is the Adaptive Communication Environment (ACE) which is used m this thesis [Sch94Aprl].

1.1.1 00 Technology for Distributed Systems:

00 technology provides distriiuted computing with many of the same benefits (such as encapsdation, interface inheritance and reuse, parameterized types, object-based exception handling, portability and exteos%-) as it does for non-distnbuted computing.

Encapsulution promotes the sepadon of interface fiom implernentation, which is crucial for developing highly extensible architectures that decouple reusable application- independent rnechanisns from application-specific policies. lnte~aceinheritance and

00 Mulri-nireaùeà Seners: Design wirh reusable componenrs. Pefonnance .Ueasuremenrr and Anabis 3 Chapterl Introduction

pmeterized types promote rewe and emphasue commonality in a design. Object-based

exception handling often simplifies program logic by decouphg error-handling code fiom

normal application processing. In addition, the inherently decentralized nature of 00

computing provides a nanital way to design distriiuted applications, which are themselves

based on a decentralized architecture. For example, most dïsîriiuted applications do not

have direct access to global resources, as a result of whïch they interoperate by passing

messages. This message passing mechanism is very similar to method invocation on an

object in 00 programming. Thus, 00 technology is ideally suited to the development of

extensible, robwt, portable and efficient distributecl applications.

Even with 00 technology, developing communication software that is reusable across OS

plaâoms is hard. Constrabts imposed by the underlying OS platfomis may make it

impractical to reuse existing algorih, detailed designs, interfaces implementations

directly. However, most solutions to fundamental problems in communication software

follow certain basic patterns. Such patterns, known as design patterns [Gam94, Sch96,

Coplien94, Gilbert951, concentrate on the key collaborations between participants in a

software architecture, instead of overwhetmmg the developer with details. Design patterns capture the static and dynamic structures of solutions that occw repeatedly when developing applications in a particdar context, and help in experknce reuse at the architectural level, and thus ease development efforts and costs. Due to their benefits, developers are tuming more and more to a 00 design pattern based approach, for the design of communication software.

00 Mulri-Threaded Scrvers: kignwirh muable componenrs. Petformance Mearuremenrr and AnaMir 4 1.2 Research Motivation

Most disûiiuted applications, based on a C/S paradigm, have an inherently concurrent nature, as more than one service request !?om different clients may arrive at the servers at the same time. For exampIe, some distriiuted applications use software semers like name servers, networks file servers etc. They cm benefit from using a concurrent mode1 of execution to perfonn their fiinctions, as queuing delays caused by contention between requests for services are reduced when-severalrequests can be prucessed at the same time.

Although concurrency can be provided by using muitiprocessor platforms, the complexity and cost of such applications usually makes it infaible. Instead, many uni-processor applications rely on rnulti-threading to provide concurrent services. A threod is an independent series of instructions executed concurrently with other threads within a single process address space. Multi-threaded server applications handle concurrent requests by allowing each thread to deal with one request at a tirne, in paralle1 with other threads. This not only simplifies the code and improves its understanding, but also, for some system architectures, improves the overall performance of the system- Various communication midwares (like DCE and CORBA implementations) make use of mdti-theadhg to provide concurrency. However, différent implementation techniques introduce different overheads and consume d.erent system resources. This thesis compares several approaches for implexnenting multi-threaded servers in a disiriiuted environment using the thread-encapsulation library of the Adaptive Commuûication Environment (ACE) reusable cornponent object-oriented tookit [Sch95Aprl], and identifies cases where multi- threading yields perfommnce dividends.

00 .&fuifi-nreadedSenen: Design wixh remable componencs. Pe$onnance .Measuremenû and Analysis 5 Chapterl Introducrion

13 Research Goals

The main god of the thesis is to anaiyze and compare different approaches for building

multi-threaded semers in a distributed environment by uhg the reusable components

contaiued in ACE. The conclusions of such an analysis may be used by friture developers

of distriibuted applications.

The analysis and cornparison takes into account several factors, such as the overall system

response tima and throughputs obtained under different workloads; the execution

overheads of various threading models and their causes; the use of other system resources

(such as memory, y0 ports etc.). This thesis also takes the relationship between design

patterns and performance a step Mer, promoting the idea that patterns can be used not

ody to document and explain reusable components, but also to guide the performance

measurements and analysis of such components and of the systems incorporaMg them.

The thesis also assessa the role played by 00 design patterns in building distributed

systems with reusable components and in analysing the performance of such systems. This

research also studies the impact of assigning priorities to service requests, and cornes up with a rationale for assigning priorities to client requests [Li95], dependhg on the

workload characteristics, for different server architectures.

1.4 Contributions and Publications

Reseurch contributions of the thes&:

Experiment in software reuse: Designed and implemented several multi-threaded serves based on the ACE reusable component tooikit. Built distributed test beds for layered C/S

00 .Uulri- Threaàeû Seners: Design wirh reusable cotnponenrr. Performance Memremenrr and Ana[vsis 6 Chapterl Introduction

Developed a technique to automate system measurements @ce CPU usage, memory co~lsumption,etc.) on a per-thread and per-procers basis, based on UNM batch scripts.

These scripts set up, launch, control and cleanup der the Mérent sets of distributed experiments. They change the different workload parameters @ce number of threads, semice times, think times, nuxnber of clients etc.) on the fiy for each set of experiments, collect statistics and process them by computing averages and confidence intervals.

Compared several approaches for implementing mdti-threaded servers in the ACE environment and identifieci cases where multi-threading yields performance dividends. The resuits are presented in the technid report SCE-97-02(&O accepted at the International

Conference on Open Distnbuted Processing, ICODPIICDP, 1997).

Used design patterns to identify key interactions between components in the software architecture, and thus guide performance measurements. This thesis ernphasizes that an abstract model, such as provided by design patterns, is especially useful when a system is b3t with reusable components, both for understanding what the reused software does, and what is its contribution to system performance. This idea is presented in the technical report SCE-97-03(also submined to the Conference on Object Oriented Technologies and

Systems, COOTS, 1997).

Studied the impact of assigning priorities to service requests, and put fonvard a rationale for assigning priorities to se~cerequests, depending on workioad characteristics.

Publicutions issuedfiom the thesis are:

G. Somadder and D.C. Petriu, 'Tdormance Measurements of Multi-Tbreaded Servers

00 ,Uulii-nireaded S4rwei-s: Design wtrh reusable compnenrr. Pe(onnance .Mearurenrenrr and Ano&sCI 7 in a Distributed Environment,", RemSCE-97-02, Dept. of Systems and Computer

Eng.. Cmleton Universis> - accepted at the ICODP/ICDP797, Toronto, Canada, May

1997.

G. Somadder and D.C. Peaiu, "Pattern-guided Performance Mea~urefllents and

Analysis," Repo~:SCE-97-03, Dept. of Systems and Computer Eng., Carleton University

- submitted to the USENIX COOTS797,Portland, U.S.A., June 1997.

G. Somadder and D.C. Petri& 44P&omiance Analpis Patterns for Multi-Threaded

Servers," - in progress for submission to the European Conference on Pattern Languages of Programming, EuroPLoP, June 1997.

1.5 Thesis Qvemew

Chapter 2 provides an ovemiew of concurrent programming and muIti-threading7 defines key terminology, highlights the salient points of the various alternative mechanhm available for concurrent programmuig and discusses the measurernent techniques used m collecting data for the case çhidies.

Chapter 3 has two parts. The first part descxi'bes the structure and functionality of the

ACE tooikit, while the second part presents the concept of design patterns, and discusses some of the ACE design patterns relevant to the thesis.

Chapter 4 provides an in-depth discussion of the design and implernentation of the C/S system implernented and used for measurernent purposes.

Chapter 5 has two parts: the first describes the experùnental setup used for conducting performance measurements, and the second presents a detailed analysis of the results obtained, first fiom a system level viewpoint, and îhen fiom a pattern-guided viewpoint

00 .CiuIri-Tiireaded Servers: Derign with reusable componen fi. Pe$onnance ,Ueasuremen~sand .-înalysis 8 Chapter1 Inrroduction

Chapter 6 snidies the design and implementation of the priority software servers, and presents a detailed anaiysis of the dtsobtained It cornes up with a rationale for açsigning prioriàes to client requests depertding on the workload characteristics for different merarchitectures.

Finally, Chapter 7 presents conclusions and directions for fûture research.

00 lMulri-nreadedSeners: Design wilh reusuble components. Performance .Cfea.surernenrî and Anaiysis 9 Chapter 2 Threads and Threading Modek

-- - Chapter 2

Threads And Threading Models

This chapter provides an ove~ewof concurrent programrning and multi-threading,

defines key terminoIogy, highlights the salient points of the various alternative mechanhm

available for concurrent programming and discusses the measurement techniques used m

collecting data for the case studies. It concentrates only on the Solaris 2.5 operating

system because the ACE tooikit (discussed in Chapter 3) used in this thesis was compiled

on Solaris 2-5.

2.1 Introduction

Distributeci applications are designed to take advantage of connectivity, inter-networking, parallel processing, replication, extensiiility and cost effectivenets ofFered by distnbuted computing. Some distniuted applications use software servers like name servers, network

Ne servers etc. These applications are difncult to develop using single threaded processes.

For example, a single-threaded network file semer cannot block for extended time periods handling one client request since the responsiveness for other clients would suffer. Such applications can benefit fiom using a concurrent mode1 of execution to perfom their hctions, as queuing delays caused by contention between requests for seMces are reduced when several requests can be processed at the same tirne. In order to provide concurrent services, various workarounds have been developed, some of which are

00 Multi-ilreaded Seners: Design with reusable componenrr. Pe(onnance Meanrrementr and Anal.vsir 10 Chapter 2 Threads and Threading .blodels

Event demultiplexer and dispatcher - This technique is widely used to manage multiple input devices in single-threaded frameworks. The main event dernultiplexer/dwpatcher detects an incohg event, demultiplexes the event to the appropriate event hander, and then dispatches an application-specific callback method associated with the event hander.

The primq drawback with this approach is that long duration seMces drastically degrade the responsiveness of the system.

User-leuel CO-routines- This technique enables tasks to voluntarily suspend their own execution until another CO-routineresumes them at a later point. It involves developing non-preemptive, user-level CO-routine packages that explicitly save and restore context information. An example of this system is the multi-tasking mechanisms available on

Windows 3.1 systems. However, CO-routinesare difficult to develop on account of their complexity (due to task preemption). It is also difficdt to program CO-routinescorrectly, since developers must modify their programming style to avoid certain operating system

(OS) features (such as asynchronous signais). Also, they are mostly suitable for short duraton tasks as otherwise they would lead to significant queuing delays.

Mulii-processing - This technique tries to alleviate the complexity of designing single- threaded applications meant to handle concurrent requests, by making use of the coane grained rnulti-processing capabilities provided by the UND(; fork and exec system calls.

Fork spawns a separate child process that executes a task concurrently with its parent.

Inter-process communication (IPC) is achieved by using IPC mechanisms such as shared memory and memory-mapped files. However, the overhead and inflaubilty of creating and using processes using fork and mec rnakes dynamic process invocation pmhibitively

00 LUultï-i%naded&mers.- Design wïfhmasable components Peijormance Measurements and Analysis 11 Chapter 2 Threads and Threading Models expensive and overly complicated for most applications. Moreover, it is difficult to exert fine-@ control over scheduluig and process priority using fork and scec.

MuIti47treading - Mdti-threading mechanisms provide a more elegant, and sometimes more efficient, way to overcome the limitations of the traditional concurrent processing techniques descn'bed above.

Every thread de& with one request at a tirne, in parallel with the other threads. This sh@fies the code and improves its understanding. An added bonus is that, for some system architectures, multi-threading services improves the overall performance of the systenz Intuitively, multi-threading is a means to improve the performance of software servers by allowing concurrent processing of requests. Various co~nmunicationmidwares

(like DCE and CORBA implementations) deuse of multi-threading to provide concurrency (It should be mentioned at this point that CORBA itseIf does not have ha&, but its implementations can have, by incorporating them fiom standard packages).

However, different miplementation techniques introduce different overheads and consume different system resources. The material presented in the rest of the chapter diScusses the strategies and tactics of concurrent programming techniques ushg threads, and describes different threading models which can be wdto effectively implement rnulti-threading.

2.2 Solaris 2.5 Multi-threading architecture

This section summarizes relevant background material on the multi-threading (MT)

2.2.1 Processes and Threads

00 Mufti-ibeuàed Seners: Design with reurable componenu. Pegonnonce Mearuremenri und Ana!vsis 12 Chaprer 2 Threads und Threading Modek

A traditional UNM process is a collection of resources Iike WNal mernory, VO descriptors, a run-tirne stack, signal handiers, user and group ids, and access control tokens executing in a single-tkd of control. Such a process is also sometimes lmown as a he.y-weîght proces (HWP). A thread, on the other hand, is an independent series of instnictions executed within a single process address space. In addition to its own instruction pointer, a thread contains other resources such as a nm-time stack of funaon activation records, a set of general purpose registers, and thread-specific data. A rnulti- threaded process consists of one or more independently executhg threads. These threads, often hown as lîght-weight processes (LWP), maintain a minimal state inforxnation, require less overhead to spawn and syncbronize, and inter-communicate via shared memory rather dian message passing. Conventional workstation operating systems (such as variants of UNM and Windows NT) support the concurrent execution of multiple processes, each of which may contain one or more threads. A process serves as the unit of protection and resource allocation within a separate hardware protected address space. A thread semes as the unit of execution that runs within a process address space that û shared by zero or more threads.

2.2.2 Solaris Multi-threaded architecture

A typical Solaris 2.5 multi-threading architecture [Powgl, POSX] operates at two levels

(kernel space and user space) and consists of the following components, as shown in

Fig. 1: Chapter 2 îhretzdr and Threading Mode!.

CPUs - They execute user-level and kemel-level instructions. The semantics of the

Solaris MT mode1 are intended to work for both uni-processors and symmetrid multi- processors on shared memory hardware.

Kernel Threads - These are thefirndamental entities that are scheduled and executed by the CPU in kernel space. Each kemel thread has a snall data structure and stack, maintained by the OS kemel, and thus makes context switching between kemel threads relatively fasf as vimial memoxy information lanains unchangecl.

LiglrfweightProcesses (LWPs) - These are associated with kemel threads, and can be thought of as a 'tirtual UNlX Process App[rcahonfhreads Kemel Threads CPU", upon which 7\ 7 application threads are scheduled and mdtiplexed by a user- level thread iibrary. In Ttueaûs Muftiplexeâ Solaris 2.5, each UNIX onto a Single LW process, instead of having Grow of Threads a single thread of control, Muitipiexed ont0 a contains one or more gtoup of LWPs, while having Threads bwnd

LWPs, each of whicli ir to LWPs associated with a kemel User tevel . Kemel Level Hardware Level thread. Application level Fip. 1 Solaris Multi-Threadine Architecture threads are scheduled by

00 Mulfi-ThreadedSeruers: Design with rwable componenrs. Perfbnnance Measurementr and Analmis 14 Chapter 2 Threads and Threading Models the Sol& 2.5 kemel level scheduler using LWPs. Context switching between LWPs is relatively slow, compared to context switching between kemel level threads, as each LWP msintiijnic a large amount of state (virtual memory address ranges, timers? register data etc.). LWPs are scheduled by the kernel onto the available CPU resources according to their class and priority. However, switching behveen LWPs in a process is still much fasta than switching between processes.

Application mreads - Application threads are spawned by the application. Each application thread has a unique stack and register set, although it shares the process address space with other threads. Application threads are scheduled and mulàplexed onto available LWPs by a user-level thread h'brary. Within a process, each of these application tlueads execute independently, although not necessarily in parallel (depends on the hardware). Solaris 2.5 provides a multi-Ievel concurrency mode1 that permits application threads to be spawned and scheduled using one of the foliowing two modes:

1. Bound - Each bound thread is mapped onto its correspondhg LWP. If the underlying hardware consists of mdtiple CPUs, then programmers can take advantage of multi-processing by forcing independent tasks to execute in paralle1 on multiple CPUs.

Thus, if two application threads are nmning on separate LWPs (and thus separate kemel threads), they may execute in parallel, provided that they are running on a multiprocessor or using asynchronous I/O. A kernel-level context switch is required to re-schedule bound threads. Also, OS kemel intervention is required for synchronization operations on bound threads. The number of bond threads is limited by the underIying kernel resources, as each bound tbread requires the allocation of kemel resources.

00 ,Uukti-Threaded Seners: Design with -able components. P erftomance Measuremenrr and Anal>sïs 15 Chapter 2 Threads and lhreading Models

2. Unbound - These are schedded onto a pool of LWPs by a userdefined thread

Library which implements a non-preemptive cooperative mdti-tasking connnrency model.

Preemption would require swing the state of each thread preempted, in order for the thread to resume execution hm the same state, which would lead to highly complex li'brary designs with no obvious benefits due to the overheads involvd It invokes LWPs as needed and assigns them to execute mblethreads (there may be a many-twne relationship between unbound threads and ninnable LWPs). Mer assimiing the state of the tbread, the LWP executw its instructions und completion, or till it gets blocked on a synchronization mechanism, at which time the thread library schedules another LWto run. UnbOund threads consume relatively less system resources and incur relatively lower overtieads to spawn, context mitch, and synchronize in cornparison with bond threads.

Also, since each unbound thread does not allocate kemel resources, it is possible to ailocate a very large number of unbound threads without significs~lltly degrading

@ormance.

23 Threading Models

Multi-threaded servers are designed to handle multiple client requests simultaneously.

They help not only in simplifyhg program design (as they allow multiple server tasks to proceed independently using conventional programming abstractions) but also m improving system performance by using the parallel processing capabilities of multi- processor hardware platfomis and overlapping computation with communication. This section discusses some of the common models [Sch96Feb] adopted in the design of multi- threaded servers.

00 Multi-nreaded &~erS:Design with musable components. Pe?$onnance .Ue~~rementsand /Inal,~is 16 Chapter 2 Threads and Threading Models

Tnread-per-request - In this model, the server spawns off a thread for each request fiom a client, which means that the creational overhead for a thread is incurred for every request. In order to minimi7e coasumption of OS resources (when many clients make requests simultaneously) and the overheads associated with spawnhg a tfiread, this model is usefûl only for long-duration requests from clients.

Threcrd-pm-cIient - The server spawns off a thread for each client, which is exclusively associated with it, for the entire period during which the client is connected to the server.

This amortizes the cost of spawning a thread across multiple requests. It is useful h scenarios where the mers carry on long-duration conversations with multiple clients.

Given sufncient OS resources, and multiple CPUs, this model cm achieve the highest throughput This model reduces to the thread-per-request model for scenarios in which the client makes only one request.

Thread pool - This mode1 tries to aileviate the cost of spawning a thread by pre- spawning a pool of threads, whose nimiber may be fixecl, or changed dynamically as per reqyirements. It is useful for scenarios in which OS resources are limited, so that resource consumption is mÏnïmized However, it has the overhead of cpeuing delays to gain access to threads, when the number of simultaneous client requests exceeds the nimiber of threads in the pool, at which time all the threads are busy. This model requires the most effort by the programmer to hction as intended

niread-per-objeci - In this model, each service in the server, associated with an object, is executed in a separate thread of control. This rnodel is usefùl when the client requests

00 Mulri-Threaded Seners: Design uirh reirrable componenrs. Pe@onnance Measurements and Anaiysis 17 Chapter 2 Threaak and mreading Modek are distniuted evenly among the objects, otherwise objects receiving a higher percentage of requests will become a performance bottleneck.

2.4 Thread Priorities and Thread-based Measurements

This section provides a brief overview of the different systern calls and techniques used to set thread pnorities and conduct measurements on a per-thread basis.

2.4.1 Thread Priorities

Solaris threads are implemented as a li'brary, using underlying LWPs which are supported by the kernel. The threads l'brary schedules threads on a pool of LWPs m the process, m much the çame way as thc kemel schedules LWPs on a pool of processors. LWPs in the system are scheduled by the kernel onto the available CPU resources according to their scheduling class and priority. Scheduling of threads is controlled through the Solaris threads library functions thr-setconcurency0, thr-segrioo, and the T7Zl-M W-L WP option of thr-createo [Solaris]. Unbound thread scheduling uses simple pnority levels with no adjustments and no kernel involvement. Thread scheduling regulates only how threads are assigned to LWPs. It has no effect on scheduling of LWPs by the kernel. The

LWP's system priority is usually domand is inherited nom the creating process.

Solaris 2.5 kemel has three classes of scheduling which are maintained for each LWP.

Each scheduling class maps the priority of the LWit is scheduling to an overall dispatching priority according to the coafigurable priority of the scheduling class. The three scheduling classes, in tams of descendhg order of pnority, are: red time (RT), system, and timeshare (TS - default schedding class). When a process is created, its one

00 Multi-IMaded Seners: Design wirh reurobie componencr. Pe$onnance Mensuremenu and Analysu 18 Chapier 2 Threaak and ntreading Models initial LWP inherits the scheduling class and priority of the parent process. Unbound threads in a process have the same scheduling and class priority- On the other hanci, bound threads have the scheduling class and priority of thek underlying LWPs. Each bound thread in a process cm have a unique scheduling class and pnority that is visible to the kernel. Bound threads are scheduled with respect to dl other LWPs in the system.

Scheduling class (Le. RT or TS) for both bound and unbound threads cmbe set by the priomtl(2) system cd. This system call is also used to change the prïority of a bound thread, by changing the pnonty of the LWP to which it is bound. In order to dynamidy change the priority of an mbound thread, the thi_seprioO system call can be used by passing the calling threads identifier and the required priority. Currently, Solaris supports a priority Ievel ranging fhm O to 127. The target thread will preempt lower pnoriv threads, and will yield to higher priority threads m their contention for LWPs, not for

CPUs.

Thread priorities set in this rnanner regdate access to LWPs, not CPUs, and hence are

Mirent hmreal-time priorities, which regdate and enforce access to CPU resources. A thread's prionty set via these system cals is more lïke a hint m temis of guaranteed access to execution resources. The case studies performed on priority servers, in Chapter 7, use thr_seprio0 to change the priorities of unbound threads only.

2-42 Thread-based Measurements

One of the major problems in thread-based experiments is the collection of data like CPU usage etc. on a per-thread basis. This arises due to the fact that most system calls used for measuring various parameters, like gehzcsageo which is used for measuring the CPU

00 ,Uulti-Tnreaded Setvers: Design wirh rarroble componenrr. Peflonnance Measuremencs and Anolysis 19 Chaprer 2 Threads and Threading Models

usage, are on a per process basis, and do not provide data on a per-thread basis. However,

SoIaris does provide system calls which can measure resource usage on a per-LW bask

(For e.g. this thesis uses the ioctZ0 system call, which is a catchall of all system cak under

Solaris). Thus, in order to measure resource usage on a per-thread bas& there has to be a

one to one relatiomhip between the thread and the LWP, wtiich in tum means that the

only threads which can be used to conduct per-thread measurements have to be bound

threadr. In Chapter 5, which explains the behavior of multi-threaded servers from a design

pattern perspective, ioctlo system cal1 has been used to obtain the process idenfifier for

the serven, which in turn is fed back into the ioctZ0 f'unction to obtain LWP identifia.

These identifiers are then used to calculate resource usage for the bound threads

associated with them.

2.5 BenefitsDDrawbacks of Thread-based Concurrent Programming

When used correctly, multi-threading provides a more elegant and potentiaily more efficient means to overcome the limitations with the other concurrent processing techniques desmîed in Section 2.1. Some of the cornmon advantages and disadvantages of using multi-threaded applications are outlined below:

2.5.1 Benefits of using Thread-based Concurrent Programming

Credonul Overhead - Spawning a thread consumes relatively fewer system resources as compared to forking off a new process, as it does not require duplicating the parent's address space memory or setting up new kemel data structures. In addition, it avoids using up a process dot in order to perfom a subtask within a larger application.

00 Siulti-Threaàed Servers: Design wilh reusable componenrr. Pe~onnancebfeasuremencs and Analvsis 20 Chapter 2 Threads and Threading Models

Contexi SWitching Overheud - As threads maintain minima state information, context

switching overhead is correspondingiy reduced as les state information must be stored

and retrieved, in cornparison with context switching between UNIX heavy-weight

proceses. Moreover, threads that nui strictly in user-level do not incur any kernel-level

context switching overhead at all, as they are managed by the user-defined thread library.

Synchron&ztion Overheod - Thread synchronization is less expensive than process

synchronization as it may not be necessary to switch between kemel-mode and user-mode when scheduling and executing an application thread Also, entities behg synchronized are most often local ones, so they do not involve any kernel intervention.

Coinmunicorion Overhead - Communication between separate threads usually takes place using shared memory, which is much faster than IPC message passing for inter- process co~ll~llunication,as it avoids the overhead of eqkcif dota copyïng. A conmion example of this is cooperating database senices that fkquently reference common memory-resident data structures implemented using threads. In general, using the shared address space of a process to communicate between threadr is easier and more eficient than using shared memory mechankm to communicate between processes.

2.5.2 Drawbacks of using Thread-based Concurrent Programming

Cmcurrency Control Owrhead - Regardless of the hardware plah Le. uni- processor or multi-processor, programmers must ensure that access to shared resources

(such as files, databases records, network devices, temiinals, shared memory etc.) is serialized to prevent race conditions - an erroneous condition arising due to the execution of two or more concurrent LWPs. Race conditions may be eliminated by using the Solaris

00 .Mufti-ntreadedSrvers: Design wi~hreusable components. Pe$onnance Measurements and Analysis 21 2.5 synchonization rnechhns, which introduce additional oveheads @ce queuing

delays for gaining access to shared resources), and inclease the complexity of the

applications.

SîWe Applications - Multi-threaded application robustness rnay be reduced due to the

fact that separate threads execute within the same process address space and are not

protected hmeach other (m order to reduce context switchhg and synchnization

overhead), which might lead to one (or more) thread(s) inadvertently compthg the

address space of other threads, thereby leading to unpredictable system behavior. Ir

requires a great deal of programming @art to develop stable complex mu&-threaded

applicntiom.

Secum - Since all threads in a process shme the same userid and access pmtileges to

files and other protected resources, it may not be possible to prevent accidental or

intentional access to unauthorized resources. As a result of this, network services like

Internet fp and telnet, which base their security mechanisms on process ownership, are

typically implemented in separate processes.

System Perfrmance - A common misconception ussociated with threadr L that rnulti-

threading an application wdd automaticully improve its per$ormance. In many

scenarios, as wiIl be shown in the following chapters, multi-threading does not improve

performance, specially for compte-bound applications mnning on a uni-processor

workstution, as it the CPU itself which saturates, and becomes the system bottleneck.

To conclude, programming single threaded distri'buted applications meant to handle concwency is diBcult, especially for server applications. A single tbreaded application

00 dr-Iui~i-ïhreadedhem: Design wizh misable componenrs. Pegomance AUe~remen&and Analvsis 22 Chapter 2 l7treads and Threading Models must be so designed that service requests are not starved, either by quickly handling all requests or by using heavyweight mechanismF like fork and exec. However, in practice, most non-trivial requests can't be served quickiy enough to avoid starvation, neither can new processes be created such that system resource comumption is minimi;red, whüe at the same time keeping program design simple.

On the other hand, with multiple threads, each request can be serviced in its own thread, independent of other requests, thereby ensuring that clients are not starved. At the same the, system resource consumption is kept to a mhhmn, thereby fkeing up resources for some other useful purposes. Thus, although rnulti-threaded systems are difficult to design and debug, their benefits often outweigh their drawbacks and help in creating simple designs and implementations than single-threaded programming.

Subsequent chapters will introduce the concept of ACE design patterns and will explain and evaluate the design and implernentation of the above multi-threaded semer models based on these design pattems.

00 .Uulii-nireaâeti Sétvers: Design wiih reusable componenfi. Perfbnnance Memremenu and Analvsk 23 Chaprer 3 The Adaptive Cornmunicarion Envîronment

-- Chapter 3

The -Adaptive -Communication -Environment

The Adaptive Communication Toolkit (ACE) was developed at the department of

Computer Science, Washington University, St. Louis, U.S.A-, by Douglas C. Schmidt and his team (Sch94AprI. It implements a set of reusable C* wrappes, classes and fïameworks, bmed on a set of fiuidamentol design pattemu, that perfom cornmon communication tasks such as event demultiplexing, event handler dispatching, connection establishment, routing, dynamic configuration of application seMces and concurrency contTQL

The primmy goal of ACE is to shplzfi the developrnent of concurrent ment-driven communication sofiare by sinipEf#ng the use of OS mechanisns that provide interprocess communication (IPC), commUILication port demultiplexing, explicit dynamic linking, and concurrency.

This chapter provides an overview of the structure and fûnctionality of the ACE toolkit, introduces the concept of design patterns, and discusses some of the ACE design patterns relevant to the thesis [Sch94Aprl, Sch94Apr2, Sch95, SchgSJan, Sch95Septl.

3.1 Introduction

Developing distributed applications whose components collaborate efficiently, reliably, traasparently, and scalably is a hiptiiy complex task., mainly due to the limitations of conventional tools and techniques used to develop these applications. For instance, many

00 Mulri-nireaded Srvers: Design wirh reusoble cornpunena. Pe$ornmnce ,Veasuremenu and Analpk 24 Chapter 3 The Adaptive Communication Environment standard network programming mechanisms (such as BSD sockets and Windows NT named pipes) and reusabIe component lies(such as sun WC)lack typesafe, portable, reentrant, and extensible interfaces. A tpid example would be the weakly typed y0 handles used to iden* communication endpoints by both sockets and gamed pipes. These handles mcrease the potential for subtle m-time errors since compilers can't detect type mismatches at compile tirne. In addition, since most distributed applications are developed using functional decomposition techniques, extending such applications becomes a highly complex task. One of the solutions to th& problem is to shifi to object-oriented technologv which provides disûibuted applications with benefits like encapsdation (which promotes the sepration of interface fiom implementation, thereby decoupling reusable application-dependent mechanisms fiom application-specinc policies), reuse (which emphasizes commonality in a design), portability and extensII,ility (by providing wrappers around actual system calls, thereby enabling developers to program distributed applications using familiar techniques such as method dson objects) etc.

The next section provides an overview of the structure of the object-oriented ACE toolkit and describes its fiuictionality.

3.2 The ACE Toolkit

The ACE toolkit is designed using a layered architectwe, providing C* wrapperS that encapsulate OS mechanisms as well as higher level class categories and fhmework. which implement strategic and tactical communication software design patterns @y extending the

C++ wrappers).

00 Mulii- fhreaded Seners: Design wih reusable components. Performance Meamtemenri and Analysis 25 Chapter 3 The Adaptive Cornmunica~ionEnvironrnenr

3.2.1 C* wrappers

The ACE toolkit improves application robwtness by encapsulating OS communication, tfireading, and virtual rnemory mechanisms within type-secure C++ wrnppets ththereby reducing the need to deectly access the underlying weakly-typed C h'brary interfaces directly, which in turn dows the C* compilers to detect type system violations at

Fie. 2 Structure of the Ada~tiveCnmmunïcation Environment Toolkit

These C* wrappers help to simpiify the design and implementation of robust, compact, portable, and extensible communication software.

3.2.2 Class Categories and Frameworks

A fiamework is an integrated collection of components that collaborate to produce a

00 Mulri-ïhrded Servers: Design with reusuble componenïs, Pe$omnce Measuremenrr and Anakt3 26 Chapter 3 îleAdaptive Cornrnunica tion Environment reusable architecture for a Myof related applications. Frameworks enable larger-scale reuse of software components than is otherwise obtained by reusing individual classes and stand-alone bctions.

The Iowa level Ct+ wrappers, shown in Fig. 2, are grouped into ciass categories. These class categories consist of a collection of components that coiiaborate to provide a set of related interfaces and services [Bch93]. Some of the primary class categories supported by

ACE, shown in Fig. 3 using the UML notation [Appendix A], include:

Reactor - This chcategory performs event demultiplexing and event handler dispatching by encapsulating OS event demultiplexing mechanisms (such as select and poll). The Reactor shields application developers fiom enor-prone, low-level details associated with programming existing I/O demultiplexing system calls (such as setting and clearing bitmadcs, handling tune-ouîs and intemipts, and dispatching callback methods).

To facilitate application portability, the Reactor provides the same interface regardess of the event demultiplexing mechanism used (such as select and poil), thus allowing developers to concentrate on higher-level application-related issues, rather than repeatedly wrestling with Iowa-level event demultiplexing details. Moreover, in order to work correctly in a multi-threaded event processing environment, the Reactor also contains muhial exclusion mechanisms designed to perform callback-style dispatching comectly and efficiently. In addition, the Reactor class category supports transparent extensibility by using inheritance, dynamic binding and parameterized types, which decouples the lower- level Y0 demultiplexing and service dispatching mechanisms (such as detecàng events on multiple Il0 descriptors, handling hmer expiration etc.) fiom the higher-level application

00 Mulri-lïtreaded Servers: Design wirh reuzable componenrr. Pet$onnance ,Measairemenrs and Analysir 27 Chapter 3 The Adaptive Communication Environment

processing policies (such as connection establishment, data tmmmkion and reception,

processing se~cerequests fiom other participating hosts etc.).

I Fie. 3 ACE Class Cateeories

------LPC-SAP - The Inter Rocess CommWLication Service Access Point WC-SAP) class

category encapsulates the standard OS I/O handle-based local and remote IPC

mechanhm that offer connection-oriented and connectionless protocols, such as the

SOCK-SAP (which encapdates the socket API), TLI-SAP (which encapsdates the TL1

MI), SPIPE-SAP (which encapsulates the UMX SunOS 5.x STREAM pipe API) and

the FIFO-SAP (which encapsulates the UNM NAMED pipe MI). Each of these

subclasses provide a welldefined interfkce to a subset of local or remote communication

mechanisms which, when put together, comprise the overail functionality of a particular

communication abstraction (such as the Intemetdomain or UNIXdomain protocol

fdes). Compared to stand-alone fiinctions, these subclmses help to simplzfi ne~ork progtamming by shielding applications fiom error-prone details @y providing a type-

00 Mulfi-Threaded Seners: Design with reusable camponenu. Pedonnance Jfeanrremenu and Ana[vsis 28 Chapter 3 The A daptive Communication Environment

secure C++ interface), combinllig several operations into one (For example, the

SOCK-Acceptor class category constnrctor perfoxms the various socket system cds like

soc& bind and lisren), parameterizing IPC mechanisms into applications (CHfeatures

such as default parameter values and templates help to develop applications,

parameterized at compile tirne, to operate correctly over either a socket-based or TLI-

based transport interface), and finally, by enhancing code sharing (Nlheritance-based

hiefarchical decomposition inaeases the amount of common code that is shared amongst

the various IPC mechanisms).

Conneciion - It encapsulates activdpossive comection establishment mechanisrns. Its

two subclasses are the Connecter (which activel'y initiates a connection to a

communication endpoint) and the Acceptor (which passivel'y establishes an endpoint of

communication). These subclasses decouple the passive and active connection roles, once

the comection has ken established, hmthe services perfonned They help to enable

flexible strategies for executing network senices concurrently (once a connection is

established, peer applications use the comection to exchange data to pedorm some type

of service), reusing existuig initiakation code for new services (as initiakation strategies

tend to remain unchanged, even though semice characteristics may change), efficiently

estabIishing connections with large number of pers @y using arynchrony to initiate and

complete connections in non-blocking mode) and making co~ectionestablishment code

portable across platfom containhg different network programmirtg interfaces @y parameteniing platform dependent mechanisms for accepting and initiating connections).

Concurrenq - This class category encapsulates the corresponding Solaris and POSE

00 dUulri-ï%readedServen: Design with muable componenu. Perftonnnnce ,tIea.wrements and AnaIwis 29 Chapter 3 The Adaptive CommunicationEnvironmenr

Pthreads rnulti-threading and synchronization mechanim. It automates the initialkation of synchronkation objects that appear as fields in C+c classes and also simplifies typical usage patterns for the tbreading and synchronization mechanisms. Each muhial exclusion wrapper subclass (encapsulating the mutex, condition7 semaphore and RW- mechanisms) shares a common interface (i.e. acquirdreleerrse), but possesses different seriakation and performance properties. ACE also provides a Thread-Manager class that contains a set of mechanîsms to manage groups of threads that collaborate to implement collective actions (like creating*suspendïng7 resuming threads etc.).

Sentice Confi'rator - It is responsible for automatic configuration and reconfiguration by encapsulating explicit dpamic linking mechanisms (such as dlopen and dlsym).

Dynamic linking, in contmst to static linking (which only allows object files binding at compile-the ador static link-tirne), enables the addition ador deletion of object files into the address space of a process at initial program invocation or at any point later duiing runtime. Although SunOS 4x and 5.x support both impZicit dynamic Wg(used to implement shed object mes or Iibranes) and erplicit dynamic linking (provides interfaces that allow applications to obtain, utilize, and/or remove the nin-time address bindings of symbols defined in shared object files), the ACE Senrice Configurator class ody supports the explicit mclinking mechanisms of SunOS. Using the other ACE components, the Service Configurator class extends the functionality of conventional daemon configuration and control frameworks (such as listen, inetd etc.) by providuig automated support for static and dynamic configuration of concurrent, multi-service communication software, monitoring sets of communication ports for Y0 activity and

00 ,Uuiri-nireaded Srvers: Desip with reusable cornponents.Pefonnance Me~nrremenrcand Analysis 30 Chaprer 3 The Adaptive Communication Environment dtspatcbing incoming messages received on monitored ports to the appropnate application-specified semices.

Stream - This class category is the primary focus of the ACE toollat This class category contains the Adaptive Service executive (ASX) fiamework [Sch94Apr2], which integrates the lower-level C* wrapper components (like IPC-SAP) and higher-level class categories (like the Semice Configuraor). The ASX fhnework embodies, encapsdates, and implements key design pattezns (design patterns help to enhance software @ty by addressing firndamental challenges in large-scale development, as explained in the subsequent sections) that are commonly used to develop communication software.

Moreover, the ASX framework separates communication software development into two distinct categories: (i) applic~tionindependent - common to ail (or most) communication software, such as event dernultiplexing, queuing, port monitoring etc.) and (fi) application specific - whkh depend on an individual application. This helps in reusing the C* wrappers and fkameworks provided by ACE, thereby allowing developers to concentrate on key higher-level fimctional requir;ements and design concems that constitute a particular appIication(s), instead of spending their the on reinventhg solutions to commonly recurring tasks. It permits applications to consolidate one or more services into a single administrative unit, thereby helps in simplifjing development by perfonning common senice initialkation activities automatically, reduces OS resource consumption by spawning service handlers on demand, and allows application senrices to be updated without momgexisting source code or tenninating an executing dispatcher process

(such as the inetd supersemer). It dows run-time services to be perfomied by several

00 Mulri-iïtreaded Semen: Design with reusable componenrr. PerJonnance Measunmenu and .dna@ir 31 Chapter 3 The Adaptive Communication Environment different types of process and thread execution agents, and thus increases the range of application concunency configuration alternatives available to developers (as service functionality is decoupled from the execution agent).

33 Design Patterns: Concept and Use

Christopher Alexander, an archirect, developed the idea of a pattern lnnguage to enable people to design their own homes and commdties [Alex77]. A pattern language is a set of patterns, each of which descnis how to solve a particular kind of problern

Alexander's pattern language, which star& by explainmg fiom a world viewpoi-then to a nation view, then to srnaller regions in a nation, until it fimally it reaches to a room in a house, is supposed to be a document that non-architects can use to design their own communities and homes, without requiring any specialized training. It focuses on common design problems that non-architects encounter (like building bedrooms), rather than uncommon ones (like building cathedrals). Although Alexander taiks in terms of pattern languages, it is not a formal language (like a context-fiee language), but is more like a structwed essay. Thus, his term has been replaced by the more generic termpattem.

Any recuning solution(s) to a design problem(s) can be characterized by a specific pattern(s) which captures the static and dynamic structures of its solution(s). Such a pattern is known as a design pattern [Sch95Aug, Gam94, Sch961. Design patterns facilitate architecniral level reuse by providing blueprints that guide the definition, composition, and evaluation of key components in a software system, In general, a large amount experience reuse is possible at the architectural level. Reuse of such patterns helps in developing better software architectures due to their emphasis on the strategic

00 Mufti- Eireaded Seners: Design wich reusable componenrs. Performance Meanrremenrs and rlnalysic Chapter 3 The Adaptive Communican'on Environment collaborations between key participants in a software architecture without ovenvhelming developers with excessive details. If used properly, design pattems oui significantly reduce and simpl@ development efforts.

These ideas have been ûmsferred to the field of 00 design. Objects are the design elements that fonn patterns, the patterns are dîscenilit1e in classes of cooperating objects, linked by certain relationships, that are repeated in the solutions to a particular class of problems. ûriginally, the 00 paradigm promoted component reuse at the system- component level by establishhg libraries of reusable classes organized in hierarchies of inheritance (fkneworks of classes). However, now the focus is more on the relationship between patterns in any system design Le. on architecture reuse ratber than component reuse.

The subsequent sections desmie, in detail, the structure, collaborations and working of the different participants in the ACE design patterns relevant to the thesis.

33.1 The Reactor Pattern

The Reactor is an object that decouples event demultiplexing and event handler dispatching from the services performed in response to events. It provides a portable interface to an integrated collection of extensible, reusable and type-secure C* classes that encapsulate and enhance the OS select and poZZ Il0 demultipIexing mechanisns [Sch95]. By decoupling the policies (which are specific to an application) and mechanisns (which may be independent of an application, and thus reusable), a number of software quality factors, such as reusability and exteasibility of system components, is increased.

00 lUulri-ïXreaded Semen: Design with mable componenrr. Pe$onnance Measurements and Analysb 33 Chapter 3 The Adaptive Communication Environment

The Reactor pattern addresses a key issue in single-tbreaded co~~~ulzicationsoftware - multiplexing different types of eventr hmmultiple events sources within a single tbread of control. This is achieved by providing a coarse-grained concurrency control that seriakas application event handling within a process at the event dernultipiexhg level, thereby minimimig the need for more

Fip. 4 Participants in the Reactor Pattern complicated synchronization, or locking within an application.

Fig. 4 shows the collaborations between the different participants in the Reactor pattern, using the UML notation [Appendix A]. The lcey participants are:

Reactor - The Reactor, which defines an interface for registering, removing and dispatching Event-Mer objects, provides a set of application-independent mechanisns which perform event demultiplexing and dispatching of application-specifc event handers in response to events.

Event Handler - It defines an interface used by the Reactor to dispatch callback methods defined by objects that are pre-registered to handle certain events.

Concrete Evenr Hiznder - It t responsible for implernenting the callback methods that process events in an applicatim-specific marner-

WOïtmG - A Reactor maintab a table of objects that are derived fiom the

Event-Handler base class. Its interface provides public methods to register

00 Mufti- fhreaded Servers: Design ~7threusable componenrs. Pe4onnance Mepnrremenrr and rina&sis 34 Chaprer 3 îhe Adaptive Communication Environrnenr

(register-handlero) and remove (remove-handlero) these objects fiom this table at mtime. The Remor's dispatching mechanisn is &y ïmplemented as the main event loop of an event-dnven application. Its dispatchO method may be impIemented using an

OS event demultiplexing system cal1 (like s& or poli). This method blocks on the OS event demultiplexing system call until one or more events occur. On occurrence of any such event(s), it retunis fiom this demultiplexing cal1 and dispatches the handle-ment0 method on any Event-HandZer object(s) that are registered to handle these events. This dback method executes userdefined code and rems control to the Reactor when it completes.

Due to this pattern, the application-independent mechanisns become reusable components whose main function is to demultiplex events and dispatch the appropriate callback methods defined by Event-Handlers, whie the application-specific mechanhm provide a specific type of service. Moreover, it provides applications with a coarse-grained form of concurrency control by serihg the invocation of Event-HandIers at the level of event demultiplexing and dispatching within a process or thtead, thereby minimi7ing the need for more cornplicated synchronization or locking mechanisms within an application.

33.2 The Acceptor Pattern

The Acceptor is an object that decouples passive corn establishment fiom the service perfomed, once the co~ectionhas been established

[Sch95Dec]. This decouplhg of seMces enables the application-specific portion of a service to vary independently of the mechanisn used to estabiish the co~ectioni.e. it enables the tasks perfomed by network seMces to evolve independently of the strategies

00 Mutri-fireaded Seners: Design wirh reurable components. Peflonnance Measuremenn and Analjm3 35 Chaprer 3 neAdaptive Cornmunicarion Environment used to passive& initialize the senrices. By decoupling service inirialuation fkorn sewice processing, this pattern enables the creation of reusable, extensible and efficient network services. This pattern is most ofien used when the comection-oriented applications provide services whose behavior does not depend on the steps required to passively iaitialize a service and when continuous polling (or blocking) mechanisns to check for concurrent comection requests is inefficient This pattern leverages off the Reactor pattern's Reactor to passively establish multiple connections within a single thread of control.

I F~P.5 Partici~antsin the Acceritor Pattern

Reaaor - As explained m Section 3.3.1, the Reactor is responsible for demultiplexing connection requests received on one or more communication endpoints to the appropriate

Acceptor. It allows multiple Acceptors to Men for connections fiom peers within a single

00 Mairi-Threaded Servers: Design n71h reusable componenu. Pe$omance Measuremenrs und Analysii 36 Chapter 3 The AdapBYe Communication Environmenr

thread of control.

Acceptor - It implements the strategy for establishîng connections with peers. It is

parameterized by concrete types that confoxm to the interfaces of the fodtemplate

arguments SVC-HANDLER (which performs a service in conjunction with a comected

peer) and PEER-ACCEPTOR (which is the underlying IPC mechanisn used to passively

establish the connection). The Acceptor 3 hande-evento method implements the strategy

for initialin'ng a Svc-HandZer by passively connecting it with a peer. The Reactor

perfonns a callback to this method autornatically when a comection anives for the

Svc-Handier - The Svc-HandIer is a concrete type that defines a generic interface for a

service. It inherits from the Evenchdler (shown m Fig. 4), which allows it to be

dispatched by the Reactor when connection events occur. The Svc-Hander is

parameterized by a PEER-STREAM endpoint, which is used by the Acceptor to

associate it with its peer when a connection is established successfiiuy. Implementors

supply concrete types for the parameters to produce an I.tiated-Accepter.

WC)-G - AmOng the different key participants in this pattem, the EventJ3andZer and

Reactor, which are reused fiom the Reactor pattern, encapsulate the OS event

demultiplexing system calls meselect or poll), and provide the same hctionality as

describeci in Section 3.3.1. The Acceptor is responsible for creating a service handler

(Svc-Handler) object, passively connecting this handler to its peer, and activating the hander once its connected. The openo method of a Svc-NandZer object is called by the

Acceptor after a comection is established (This is a pure vimial function which hm to be

00 .UuZti-nireaded Servers: Design wiih reusable componenrs. Per$ormance Measuremencs and dna&sis 37 Chapter 3 Be A&ptive Communication Environment defined by a subclass, which perfomis service-specific initialhtions). Since the Acceptor inherits eom the Event-Handh class, the Reactor can automatically callback to the

Accepter's hundle-ment0 method when a connection arrives hm a peer. The openo method of the Acceptor is passed the local network address used to Iisten for connections.

It fornards this address to the passive comection acceptance rnechiinism definecl by the

PEER-ACCEPTOR. This mechanism initializes the listener endpoint, which then advertises its Intemet Protocol (IP) address and port number (i.e. Service Access Point) to clients interested in connecting with the Accqtor. Mer the listener endpoint has been initialized, the openo method registers itself with the Reactor, which then uses the get-hande0 method to obtain the underlying V0 file descriptor (or harde). This handle is then used by the Reactor to detect and demultiplex incoming connection frmn clients m order to dispatch the Acceptor 's hande-ment0 methoà, which implanents the strategies for creating a new Svc-Handler, acceptkg a comection into it, and activating the senice.

333 The Connector Pattern

The Connector pattern, which k a complement of the Acceptor pattern, enables the tasks perfomed by network services to evolve independently of the mechanisms that actively initialize the senices i.e. it decouples active service initialkation fiom the tasks perforrned once a semice is Initialized [Sch96Jan]. This pattern parnits key characteristics of seMces

(such as the concmency strategy or the data format) to evolve independently and transparently fiom the mechanisms used to establish the connections, which in turn helps in increasing code reuse as comection establishment mechanisms change fa. las often

00 Mufti-Threuded Seners: Design wirh reusabfe componenrt. Pe$omance Memremenrs and Analvris 38 Chapter 3 The Adaptive Communication Environment

when compared to hcecharacteristics. Like the Acceptor pattern, it makes use of

asynchrony to uctively establish comiections with a large number of peers efficiently.

The structure of the key participants in the Comector pattern is shown in Fig. 6:

-----.*---.------i PEER ~ira~",; i )-: - -.* - - z - - - *- - - ..

Fip. 6 Parüci~antsin the Connector Pattern

Conneetor - It connects and activates a Svc-HandZer. The connecta method of the

Connector is used to actively initiate a connection between a Svc-HandZer and its remote peer, while the handZe-ourputo method is used to activate the Svc-Handlers whose connections were initiated and completed asynchronously.

Svc-Handler - As in the Acceptor pattern, it defines a generic interface for a service.

ïhe Svc-HandZer contains a communication endpoint (PEER-Stream) that encapsulates an I/O file descriptor (Le. handle). This endpoint is used to exchange data between the

Svc-HandZer and its connected peer. The Connector uses the openo method of the

Svc_Handler to activate its endpoint, when a connection completes successfully.

00 iUulri-ThreadedServers: Design with reusable componenrs. Pe$orntance Measuremenrs and Ana1.m~ 39 Chapter 3 The Adaptive Communication Environment

Reactor - As stated eark in Section 3.2.1, the Reactor handles the completion of connections that were initializes asynchronously. The Reactor dows multiple

Svc-HandIers to have their coanections initiatecl and completed asynchronously by a

Connector configured in a single thread of control.

Wû-G - As m the Acceptor pattern, the EventJhdler and Reactor, reused fkom the Reactor patîern, encapsulate the OS event demultiplexing calls (like select or polç) and provide application-independent (Reactor)mechsuiimzc that perform event demultiplexing and dispatching of application-specinc (Svc..Ùndler) event handlers in response to events. The Connector uses the openo method of a SvcJandler to pedorm service- specinc initializations, once a comection is completed. On completion of a connection, the

Reactor can automatically callback to the Connecter's hondle-ment0 method (as the

Connector inherits hmthe Event-Handler). One of the parameters of the Connector. the

PEER-CONNECTOR provides the transport mechan- used by the Connector to actively establish the connection synchronously or asynchronously, while the other,

SVC-HANDLER, provides the service that processes data exchanged wiîh its connected peer. The Svc-HmdZer can be activated either m a synchronous or asynchronous mode

('y sening the underlying socket mechanisms into a blocking or non-blocking mode respectively). The Connector maintains a map of SvcJ3andlers whose asynchronous connections are pending completion. Once an asynchronous connection completes successfully, the Reactor makes a cailback to the Comector's hamile-ment0 method, which in tum fin& and removes the comected Svc-HandIer fiom its intemal map, transfers the y0 handle to the Svc-Hundler and initik it by calling the

00 Mufti-nireadedScryen: Design wirh reusable comptanenu. Pe$onnance Meosuremenu and Analniysu 40 Chaprer 3 neAdaptive Communication Environment acmtate-svcChand2erO method Once the initialïzation is completed, the Reactor maka appropriate callbacks to the pre-registered concrete Svc-Handlers to implement the application-specific service functionalities.

33.4 The Active Object Pattern

This pattern is used for shplifymg synchronized access to a shared resource @y methods invoked m dzyerent threadi of control). It decouples method execution from method invocation, so that independently executing threads can access data (modeled as a single object) m an interleaved fashion &av95Sept]. This pattern is well suited to a broad class of producer/co~l~~merand readedwriter problems and is commonly used in distributed systems requiring multi-threaded servers. Even though it is possible to buiid robust single- threaded servers handling concurrent requests for services, it requires complex concumnt programming (to ensure that undue Serialization is not cawd) and it may not be possible to alleviate pdormance bottlenecks (like queuing delays caused due to seriabation of requests) arising due to adopting a single-threaded approach. The Active Object pattern provides a solution to the above problem by allowing a method to execute in a different thread than the one that invoked the method originauy (In contrast, passive objecfi execute in the same thread as the object that deda method on the passive object). Using the Active Object pattern helps in simplifjmg flow control (since each active object has its own tbtead of control, and hence can block, waiting for fIow control to abate, without

&&hg oîher executing objects), concurrent programming (since it allows multiple methods to be executed in parallel) and takes advantage of parallekm (on multi-processor platforms). Moreover, it Shields applications hmlow-level synchronization mechanisms,

00 Mutfi-firreaded Semen: Design wifh reusable camponenu. Pefonnance Measuremenu and Anulysis 41 Chapter 3 neAdaptive Communican'on Environnsenr

rather than having them acquire and release locks explicitly and dows methods, which are

invoked asynchnously, to be executed accordhg to synchronization policies, and not on the order of invocation.

As shown in Fig. 7, the key participants in the Active Object pattern are:

I F~P,7 Participants in the Active Obiect Pattern

Client Interfme - This This a proxy object that presents a method interface to client applications. The invocation of a method defined by it triggers the construction and queuing of a Method Object.

Mdhod Oljea - These are objects that are created for any rnethod cdthat requires synchronized access to a shared resource managed by the Scheduler. Each of these objects mahtahs context information necessary for execu~gan operation fouowing a method invocation and for rebuming resuits of that invocation through the Client Interface.

00 Mulri-nireaded Servem: Design with reusable componenrs. Pe$onnance Measuremenu and Anal* 42 Chapter 3 The Adaptive Communication Environrnenr

Activation Queue - It is managed by the Schedulm. It maintains a priority queue of pending method invocations, which are represented as Method Objects created by the

Client lnteg%ace.

Scireduïèr - It is responsi'ble for managing the Activation Queue containing Method

Objects requiring execution (based on mutual exclusion and condition synchmnization coastraints).

Resource Representatrin - It represents the shared resource that is king modeled as an

Active Object and typically defines methods that are defineci in the Client Interface, m addition to methods that are used by the Scheduler to compte runtime synchronizaîion conditions for detennining the scheduling order.

Result Hde- When a method is invoked on the Client Interjiuce, a R~esultHandle is retumed to the der. It allows the rnethod result value to be obtained &er the Scheduler finishes executing the method.

WORK~~- The client application invokes a method defined by the Client lnt-face which trïggen the creation of a Method Object. Each Method Object maintains the argument bindings to the meîhod, in addition to any other bindings required to execute the method and retum a result. On creation of Method Objecü, the Scheduler acquires a munial exclusion lock, consults the Activation Queue to determine which Method Object(s) meet the synchmnization consiraints, binds the Method Object(s) to the curent Resource

Representation and aliows the method to access/update this Resource Representation and mate a ResuZt HandZe. This Rhislt Handle is used to bind the Result value, if any, to a future object which passes rem values back to the cder when the method finishes

00 ,Uulri-Threaded Servers: Design wiih retuuble componenu. Pefomance Meosuremenrî and Anafysis 43 Chapter 3 î-he Adaptive Communication Environmenr

executing. A fhre object [Lis88Tun] is a synchronization object that enforces Me-

once,read-mmy synchronization. Subsequently, any readers that rendezvous with the

fbture wili evaluate the fhre object and obtain the ResuIt value. The future and the

Method Objecta) are garbage coilected when they are no longer needed.

To sum up, design patterns help in developing communication software components and

fiameworks that are reusable across OS platforms by faciltahg the reuse of abstract

architectures that are decoupled hm concrete reaiizations of these architectures. In

addition, they provide a structured meam of documenting software architectures by capturing the structure and collaboration of participants in a software architecture at a

higher level than source code, which helps in captiiring the essentid architectural

interactions while suppressing unnecessary details.

Chapter 4 discusses the design and implementation of the multi-tbreaded semer models, discussed in Chapter 2, based on ACE design patterns.

00 Mufti-iltmaded Semersi Design with reusable componenu. Pegomance Measurments and AnaMir 44 Chapter 4 CIiedServer Design and Implemenration

Chapter 4

Client/Server Design and Implementation

This chapter, using chapters 2 and 3 as background, provides an in-depth discussion of the design and implernentation of the Client/Sewer (US) system used for measurement purposes.

4.1 Introduction

This thesis uses a C/S architecture to study the impact of using different multi-threaded semer models to

@lement con- current services.

In every US system, cornmu- nication channels have to be established bet- Fie. 8 Active/Passive Connection Roles ween the clients and servers. As show m Fig. 8, estabhhhg connections between endpoints involves the following two roles:

Pussive Rde Element - which initializes an endpoint of co11~11~1unicationat a paaicular address and waits passively for the other endpoint@)to connect with it.

00 Mulri-nireaded Seners: Design with reusable componenu. Peq2mnce ,Ueasurernen~and Anolysir 45 Chapter 4 ClientAsenet Design and implementation

Active Rote Etement - which advely initiates a co~ectionto one or more endpoints that are playing the passive de.

Chapter 3 introduced the ACE toolkit and highlighted the usefiilness of pattern-based design Using the background provided by Chapter 3, one can fonn a picîure as to how the design patterns, discussed there, can be used to miplexnent a C/S system based on the

ACE tooki~as show in Fig. 9:

The server uses the Acceptor pattern to establish an endpoint of communication (parsive role), and advertÏses its local port and address. The geec clients make use of the

Connecter pattern to actively initiate a connectïon (active role) to the server, on its pre- defbed local port. The semer accepts these connections and creates SvcedHandZers, responsr'ble for providing the requested service, for each client. Both the client and server make use of the Reactor pattern for demultiplexing of I/0 events and dispatching the appropriate method(s) in the pre-registered Svc_Nondlers. As explained in Section 3.3.4,

00 ,MuIri-Z+readeà Seners: Design wiih =able cornponenrr. Pefonnance Measuremenrr and Anahic 46 Chopter 4 CZienr/Servet Design and Impiemenration the thread-pool server makes use of the Active Object pattern to implernent a pool of threads, each of which seMces a client request in pdelwith the other threads in the pool.

4.2 ClientEerver Architecture fkom an ACE Viewpoint

This section discusses the design and implementation of the client and servers fkom an

ACE viewpoint The client implementation is generk, with the same implementation king used for all the different servers based on the different threading models. The entire discussion is fiom a design-pattern perspective because, as explaineci in Section 3.3, design pattems help to focus on the interactions between key participants in a software architecture, rather than going into all the nitîy-gritty details and ovenvhelming the reader.

4.2.1 The Client

The client, due to its active role in the experimental setup, is based on the

ComectorlReactor patterns [Sch96Jan, Sch951. The Reactor encapsulates the operating system select mechanism, while the Connecter encapsulates the various UNIX domain socka mechaoisns. Due to its generic role in the experiments Le. it is responsible only for sending service requests to the server, sleeping and sending again, its implementation remains the same for di the different multi-threaded server models. As show in Section

3.3.1 and 3.3.3, the Reactor and Connecter patterns can be broken down into different modules, and since the client is based on these patterns, its design can also be broken down into dermodules. Fig. 10 shows the object oriented design of the client, modularized into its class components, using the UML notation, developed by Booch and

00 Muln-nireaded ken:Design wifhreusable componenu, Pe$ormance Measwemenrs und Analmis 47 Chupter 4 CliendServer Design and Implernentution

Rarnbough [Appendix A].

As shown in Fig. 10, the client cornponents are divided into three layers:

e e 8 e 8 e 8 N «S& Rocessor» D B Request lssuer I a « Even! Han* » s m 1

8

Fip. 10 Class Commnents for the Client Tne Reactive Lqyer - This layer, dong with the Connection layer, pexforms generic, application-independent strategies for handling events and establishing connections. The two participants at biis layer, the Reactor and Event Handler, are reused hmthe Reactor pattern [Sch95]. The Reactor defines an interface for registering, removing and dispatching Even t-Handler objects (such as the Connecter and Svc-Handler). Its interface provides a set of application-independent mec* that perform event demultiplexing and dispatching of application-specific handlers in response to events.

The Connecrion Layer - This layer is responsible for actively connecting a

Svc-HandZer to its peer and activating the handler once its comected. This layer delegates to the Reactor pattern in order to establish comections asynchronously without requiring mdti-threading. The Svc-Hmdler abstract class provides a generic interface for

00 dUulti-ThreadedServers: Design with reusable cornponents. Pefonnance Meuslrrentenrr and Analysis 48 Chaprer 4 ClienUServer Design and Implementation processing services, which are customized by applications. The openo method of a

Svc-HandIer is called by the Connector after a comection is established for perfomiing service-specinc initiali;lations. A subclass of the S~c~HmdZeris responsîble for detennining the service's ~~~~~cy strategy (for e.g. a SvccyHandZer might employ the

Reactor pattern to demultiplex events in a single-thread of control, or it might employ the

Active Object pattern for providing mdti-threaded services). The Conneetor abstract class implements the generic strategy for iniîhkbg network services. Since the Conneetor inheriîs hm the Event-Hmdler, the Reactor cm autornatically dback to the

Connecter's hndle-ment0 method when a comection completes. The Connector is parameterized by a particular type of PEER-CONNECTOR (which provides the transport mechanism used by the Connector to actively establish the connection synchronously or asynchronously) and SVC-HANDLER (which provides the service that processes data exchanged with its comected peer).

me Applicafion Layer - The application layer is responsible for supplying a concrete

IPC mechankm and a concrete service handler. The SOCK-Co~ector class is used to encapsulate the OS socket rnechanjsms for actively initiating a connection to a peer. The

Concrete Svc-HandIer miplements the application-specific savice activated by a Concrete

Connector. Each Concrete Svc-HandZer is instantiated with a specific type of C* IPC wrapper that exchanges data with its comected peer. The Concrete Connector class instantiates the generic Connector factory with concrete parameterized type arguments for

SVC-HANDLER (which is the Req~est~Issuerhere) and PEER-CONNECTOR (which is the SOCK-Connecter here).

00 MuZti-ïhmaded Servers: Derign bzrh reusabie componenrs. Pcrfonnance .Memremenri and Analysrj 49 Chapter 4 CZient/ServerDesign and Imp Zementation

EXECUnoN CYCLE - Fig. 11 depicts the execution cycle of the clie- using an interaction diagram. Each client has the following two major phases:

I Fii. 11 Interaction Diamam for the Client

Connection and Service Initiailîztion Phase - ui this phase, one or more

Svc+H+ndZer(s) are advely comected with their peers, either synchronously or asynchronously. Dze main client driver program uses the Connector::connec@ method to actively initiate a connection to the semer, creating and reegistering a senrice handler

(Request-Issuer) with the Reuctor, thus ensuring that the Reactor performs application-

00 Multi-Threaded Senen: Design witA eurable componenti. Pefonnunce ,Lieamremenri and AmlysCr 50 Chapter 4 ClientYSener Design and implementation specific dbacks as events concerning it occur. The openo method of the Svc-Hander perfomis service-specifk initiali;lation.

Service PrucessUg Phase - Once the co~ectionhas been established actively and the

Service has been initialized, the client enters into a sentice processuig phase. It blocks on the Reuctor::hund1eeevent~ loop an event OCCUIS @ke an Y0 on the port its listening on). As events occur, the Reactor automaticdy calls the appropriate method (i.e. the svco loop) in the previously registered Request-Isssuer to perfonn the necessary application- specific services.

4.2.2 The Semer

This section provides a brief synopsis of the design and implementation of the different server models. In contrast to the diex& the server fulfills the passive role in the experimentai setup and uses the Acceptor pattern [Sch95Dec] to accomplish it. Event demultiplexing is handled by the Reactor pattern [Sch95]. nie design of the ciiffixent threaded-mode1 servers remains essentially the same, although the implementation details vq,as will be shown in the subsequent sections.

Two min design issues in server design relate to demultipZlexig [SchMar94, Stevens901 of sexvice requests nom clients, and conmmency. Accordingly, server design cari fiill into two fiindamental categones:

Single Tlireaded Seruers - These hande client semice requests sequentiaily by iteroting through an mfinite loop. While processing the curent request, an iterative server typicdy queues new client requests. An iterative design is most suitable for short-duration services

00 ,UuIti-ï3readed Seners: kignwiih reusable componenu. Per$onnance Measuremenu and Analysis 51 Chapter 4 Clienr/se~erDesign and imphentazion that exhibit relatively Iittle variation m their execution time (as otherwise queuing delays would reduce the level of concurrent senrices provided).

Concurrent Smers - These serves, either by using multi-processing or multi-threading techniques, handle multiple requests hmclients ~imultaneously~?ney help to miprove the system throughput when the arrivai rate of requests is greater than the rate at which they are processed They are most usefbl for VO bound or variable execution time semèes, as each client incurs a relatively smalla queuing delay for gemEg access to the semer.

4.2.2.1 Single Threaded Sewer

The single threaded semer implements the AcceptorlReactor patterns. As shown in Fig.

12, the class components of the iterative semer are similar to the ones in the client, with the Acceptor pattem replacing the Connecforpattern:

L I I 8 8 Application mer 8 Connectiori Layer Fie. 12 Ciass Components for the SipJe-ThreadedServer

Tne Reacrive Loyer - This laya performs the same services as those descrïbed m

Section 4.2.1. The Reactor, which demdtiplexes connections requests received on one or

00 Multi-ninaded Setvers: Desip wirh rwable cornpanem. Pe>fonnance IUe~remenuand Analysis 52 Chaprer 4 CZiendSe~verDesign and implernenranon more communication endpoints to the appropriate Acceptor, dlows multiple Acceptors to listen for connections fiom peers.

The Connection mer- This layer is responsible for creating a Svc-HandZer, pasively connecting it to its peer, and activating it once it is connected Due îhe gehcbehavior of this layer, the classes in this layer delegate to the concrete IPC mechanism and concrete

SVC-HANDLER Wtiated by the application layer. The Svc-HandZer abstract class provides a generic interface for processing services, which are customized by applications.

The openo method of a Svc-HundZer is cailed by the Accep after a connection is established for perfofming service-specific iaitialjzations. As m the case of the client, a subclass of the SvclasHmdZeris responsible for detemiining the service's connirrency strategy (for e.g. a Svc-HandZer might employ the Reactor pattern to demultiplex events in a single-thread of control, or it might qloythe Active Object pattern for providing muiti-threaded services). The Acceptor abstract class implements the generic strategy for passively initialking network services. Since the Acceptor inherits from the

Event-Hondler, the Renctor can automatically callback to the Connecter 's handle-ment0 method when a connection completes. The Acceptor is parameterized by a paaicuiar type of PEER-ACCEPTOR (which provides the transport mechankm used by the Acceptor to pasively establish the comection) and SVC-HANDLER (which provides the hcethat processes data exchanged with its connected peer).

00 Mufti-nireadedSeners: Derign wirh reusable componencs. Peijonnance Me4~rement.sand Analys& 53 Chapter 4 CZimt/ServerDesign and Implementation

The Appücarioon Layer - The application layer is respomible for supplying a concrete

IPC mechanisn and a concrete service hander. The SOCK-Acceptor class Û used to encapsulate the OS soc& mecbanisms for passively accepting a comection iequest nom a peer. The Concrete Svc-HmdZer implements the application-specific service activated by a Concrete Accepttor. Each Conmete Svc-HmdIer is instantiated with a specinc type of

C* IPC wrapper that exchanges data with its connected peer. The Conmete Acceptor class instantiates the generic Acceptor factory with concrete parameterized type arguments for SVC-HANDLER (which is the Req~est~Isssuerhere) and PEER-ACCEPTOR (which is the SOCK-Acceptor here). The execution cycle of the single threaded semer goes through the following two phases, as shown in Fig. 13:

00 Mufn~ïïtreaùedServers: Design with reusable companents. Peformance Measuremenu and rinaS.sir 54 Chapter 4 ClienUServer Design and hplementation

Endpont tuid Service IniririMon Phuse - In this phase, a passive-mode endpoint is created and bound to a network address such as an IP address and pon number. The passive-mode endpoint Mens for connection requests hmpeers. The main server driver program uses the Acceptor::openO method to passively estabu an endpoint of comection. On receiving connection requests, it creates, activates (by calling the

Svc-HandIer 's open0 method) and registers a SvcersUc1ndlerwith the Reactor, to fiditate cdbacks to the Accepter's hanclle-ment0 method. The openo method of the

Svc-HandZer then performs semice-specific iaitialization.

Service RocessUlg Phase - This phase performs application-specific tasks that process the data exchanged between the Svc-HandZer(s) and its co~ectedpeer(s). The server enters the Reactoc:handZeeinputo method that blocks tiü an Il0 event occurs. On occurrence of these pre-registered events, the Reactor autornatidy calls back to the previously registered SvcereHundler(Request-HandZer m this case) to perfom the necessary application-specific Services.

4.2.2.2 The Thread-per-Request Semer

The thread-per-request server, like the singie-threaded server, also implements the

Acceptor/neactor patterns. However imlüce the single-tbreaded server, which has a shgle thread of control only, the thead-per-request semer handles each and every semice request in a separate thread of control. Each callback by the Reactor to the appropriate rnethod in the Svc-Hader (Req~est~HandZerin this case) triggers the spawning of a thread, which processes and retums the reply to the client in a separate thread of control.

Threads are spawned and managed by the ACE-Thread-Manager [Sch95Sept] class of

00 Multi-Threaded Seners: Design wirh retsable componenrr. Pegônnance ,Ueasuremenu and Anaiysir 55 Chapter 4 CIienVServer Design and Implernentation the ACE thread library, which encapsulates a set of mechanisms to manage a group of threads that collaborate to implement collective actions. The ACE-Thread-Manager class also shields applications hm many mcompati'bilities between different types of multi- threading mechankm (such as Solaris, POSIX, and Win32 threads). As show in Fig. 14, the key phases in the thread-per-request semer are:

Endpoint and Service Iniaaization Phase - TES phase is exactly identical to the corresponding phase in the single-threaded semer. The main server driver program uses the Acceptor::openO method to passively estabiish an endpoint of connection. On

I Re. 14 Interaction Diamfor the Thread-per-Recruest Semer receiving comection requests, it creates, activates @y calhg the Svc-Mer's open0 method) and registers a Svc-HandIer with the Reactor, to facilitate callbacks to the

00 Mufti-Tnrroded Setvers: Design n-irh misable componenrr. Per$onnance ,Ueosurernenrr and An~lyslj 56 Chapter 4 ClientIsemer Design and Implemenration

Acceptor 's hundle-ment0 methoà. ïhe openo method of the Svc-HandZer then performs

senice-specific initiali7sition

Sem!Processihg Phase - Once the comection has ken established passively and the

senice has been initialized, the semer enters into a service processing phase. The Reactor,

which blocks on the hundle-ment0 method, des callbacks to the appropnate

Svc-Hc~ndIer{s) on occumnce of events conceming those Svc-Handlers. The

SvcJladlers, m tum, hvoke the ThreaddManger::spm@ method for each and every request This method, which spawns off a thread to handle that paaicular request, processes the request and retunis the results, before destroying the thread.

43.23 The Thread-perCLient Semer

As explained in Section 2.3, the thread-per-client server is Smilar to the thread-per- request semer, except that a thread is spawned for each clienc which processes requests from the client for the whole interval during which the client is connected with the server win96Junl. One of the major diffefences between the server describeci in Section 4.2.2.2 and this is that since each thead exists for the duration of a client's connection to the server, the corresponding Svc-Handleer is not registered with the Reactor as it cm hction in its own thread of control and does not require callbacks by the Reactor to invoke the appropriate method in the Svc-Handler. Fig. 15 shows the major phases in the execution cycle of the thread-per-client servef:

00 .Uulri- Threaded Seners: Design with rwable compnenrs. Pe$ormance Measuronents und Analysis 57 Chapter 4 Clienr/ServerDesign and Implemenration

]Fip. 15 Interaction Diamm for the Thread-per-Ciient Server

Endpoint and Service InizWùdon Phase - The ~lernenîationof this phase is

exaaly simüar to the other rnodeis, with one major difference. This phase is responsible

for passive-mode endpoint initiaiization and Svc-Hander creation and activation.

However, unlike the other models, the Svc-HmdZers are not registered with the Reucior.

The reason behind this is that since each Svc-.undlm has its own thread of control, it cm

block on its own OS demdtiplexing caIi, without affécting the execution of other

Svc-HandZers. The main server driver program, after initializing and activaàng

Svc-Nandlets, hvokes the ntreac-1M~nager::spuwnOmethod to spawn off threads which

are associated with a Svc-HmdZer, and hence its corresponding peer, for the whole

intenta1 during which the peer remains comected to the Svc-Handler.

00 Mulri-nireaded Senoers:Design wirh muable componenis. Pevormance ,CIeasuremcnts and Analysir 58 Chapter 4 CliendSenter Design and Implemenration

S-ce Processikg Phuse - In this phase, each and every SvcJlmidZer perfomis

application-specific services in their independent threads of conîrol . When the Acceptor

terminaies, either due to enors or due to a proper shutdown, the Reactor calls the

Acceptor::handle_close~method to release dynamicdly acquired resources.

4.2.2.4 The Thread-Pool Sewer

The thread pool server, which has the most complex design of all the models considered,

is based on the Acceptur~euctorpatterns. In addition, it implements the Active Object pattern to provide a fiinctionai pool of reusable threads [Sch96Apr]. It pre-spawns a ked number of threads (a design decision) at start-up which processes all incoming requests.

Its design, which is a compromise between the single-threaded sener and the thread-per- client server, consists of initiaking and activating Svc-HandZem, similar to the other models, and processing service requests as they occur. However, a major dfierence is that although each Svc-HandZer executes m a separate thread of control , the number of threads are limite& and hence the response seen by the client also consists of the queuing delay incmed by the Svc-Handlers to get a thread of control. The major phases in the execution cycle of the thread-pool server are shown in Fig. 16:

Endpoint and Service Initializafion Phase - This phase is exactly simila. to the initiakation phase described m the other server models. The Acceptoc:openO method, invoked by the main driver program, creates a passive-mode endpoint that is bound to a network address. This endpoint listens for connection requests fiom peers. When a connection &es, the Reuctor makes a callback to the Acceptor::handZeeeuen@ method, which perfoms Svc-Handler specific initialkation, by calling the openo method of the

00 .SiuIti-Threaded Semem: Mgnwirh reurabie componenu. Pefonnance Measurements and Analysis Chapter 4 CIient/Smer Design and Implementarion

Sv~~HandIer.

Sen>ice Processing Pliose - In this phase, as events occur, the Recrctor makes a caüback to the handIe-inpuM method of the appropriate pre-registered Svc-HandZer. T'bis

Svc-Handler, m tum, enqueues this request onto to a global queue, which is then dequeued by any of the inactive threads in the thread pool for service processing and reaiming remlts back to its peer.

I Fip. 16 Message Semence Chart for the Thread-Pool Server 43 System Architecture

The previous section provided an ovewiew of the design and implementation issues involved in the different servers based on difTerent threading models. This section Mer

00 Mulfi-nnaded Srvers: Design wirh reurabfecomponenrr. Pe~onnance,Ueanaremenu and Anaiysir 60 Chapter 4 Client/Sener Design and Implementation expands on the previous section, and provides an overall view of the system architecture.

43.1 Single-Threaded

C/S Architecture

Fig. 17 shows the system architecture for the single- threaded server. The semer starts its execution cycle by creating an Acceptar object, which advertises its local address. Each client begins its execution cycle by issuing a connect request to the server. On receiving such a connection request, the server creates a Request-Handler object (which is the application-specific concrete instance of the geec Svc-Handler). It accepts the comection into the Request-HundZer and activates it It then idomis the client about the acceptance of its comection request, at which point the client starts its service cycle. In case the client doesn't recefve this acknowledgment within a certain time period (a design decision), it performs an exponential backoff and retries. In case, &er a givm number of retries, it is not able to connect, it informs the experiment controk about its stanis and shuts down, thereby aborhg the experimeats.

00 Mdti-Threuded Servers: Design wirh mable componmts. Petftonnance Mearuremmrr and Anabit 61 Chaprer 4 Client/Server Design and Implementarion

43.2 Thread-per-Request C/S Architecture

This architecture is very simila.to the thread-perclient architecture shown in Fig. 18, the

only difference king in the way requests are handled. In this case, each semice request

causes the server to spawn off a thread, which services the request and gets destroyed at the end of the service for that request

433 Thread-per-Ciient C/S Architecture

In contrast to the thread-per- request architecture, the thread- per-client mer spawns off beads which service a ciient for the entire duration during which its is comected to the server, V&en a connection request anives from a client, the server creates, accepts and activates a Request-HudZer. f Fie. 18 Thread-per-Client CIS Architecture In addition t O service-specif'c initialization, the RequesttHandZer::ope~method also Spa- off a thread, which gets associated with it as long as it has a comected peer. Each Request-Handler does its own event-demdtiplexing and hmce does not register itself with the Reactor as it does not need any callbacks to be performed. Chapter 4 CliendSener Design and Implemenration

43.4 Thread-Pool C/S Architecture

As can be seen from Fig.

19, the thread-pool semer implements the Ache

Object pattern, in addition to the

AcceptorLReactor patterns. The staaup of the expaiments follows the same pattern as the other cases. However, an additional part in the Fiv. 19 Thread-Poo1 C/S Architecture server initialkation is the spawning of a pre-detennined number of threads to fom the thread pool. This thread pool is managed by the Active Object, which is an instance of the ACE-Task class, and contains a global message queue, in addition to the thread pool. When the server receives a comection request fiom a client, it creates, accepts and activates a Reqwt-Hundler object. On receiving an acknowledgment nom the semer, the client issues a senrice request, which is directed by the Reactor (similar to the single-threaded server case), in the main thread, to the appropriate Request-Handler. The Request-Hander, instead of receiving the achiai message in the main thread, enqueues a pointer to it in the global message queue. This is then taken off the global queue by any of the inactive threads in the

00 Multi-nireaded Semen: Design with rwrrble componenrr. PegÔnnance Measuremenu and Analysis 63 Chapter 4 Clienr/Server Design and Implemenration

thread pool, which then goes on to receive the actual request, process it and return the

resdt back to the client In this way, the Reactor, WIiich blocks till a Request-Handler

Wesexecuting, is freed up as soon as possible to handle service requests from other

clients, and the achial service processing is done in a separate thread of control.

In order to get a feel of the execution cycle of the system, a dynamic diagranmiing

technique is presented in the next section.

4.4 A Use Case Map Representation of the System Execution Cycle

This section provides another way to dwcn'be the system execution cycle. It presents a

study of the execution cycle of the C/S system based on the single-tbreaded semer model,

using a diagrmnrning technique known as use case Maps [Buhr96, Hubb961. A use case

map comists of superimposing paths of related scenarios of system operation on diagrams

of system architecture. Paths in use case maps trace cause-effect sequences that traverse a

system fiun points of inputs to points of outputs. The structural components of the

system are represented by differently shaped boxes and the behavioral parts are shown

using paths superimposed on such boxes. Use case maps are easy to comprehend due to

their visual design and are particularly helpfùl in providing an insight into the structural

and behavioral patterns of complex software systans.

As shown by the use case paths in Fig. 20, the server inithlization starts with a UNIX

command to start up the server, wfüch causes the server to mate a Request Acceptor

object, which is an instance of the ACE-Acceptor class. This stimulus, interpreted by the

RequestAcceptor,, causes it to establish a pusive endpoint of communication and

instantiate a Request-Haandle object, an instance of the ACE Svc-Hder class, which

00 dWulri-ThreudedServers: Design wifhrrt~fuble componenu. Pe$ormcince Mearzrremenu und Anaiysis 64 Chapzer 4 CliendSener Design and Implemenzatimt

gsl:Ceaîeanùrs&iceofRequest -ptor g& haîean instance of Request Conneeior gs3: Initiate a connection to the semer gs4: Deteü error condition gs5: Complete initialaation of eonwdion gsS:sendandreceivereqies&

Fip. 20 Use Case Map Re~resentationof the Svstem for a Sinde-Threaded Server

processes requests received fiom clients. As a result of this ïnitialization process, the

server is ready to service requests nom ciients. The client initiakation proceeds in the

same miinner, except that this thne Repuest-Connecter object is instantiated, which

activeZy initiates a connection to the server, and creates an instance of a Request-Issuer

object, which sends (and receives) requests (and replies) to (fiom) the server.

In addition to the above, some error hanmg has been shown by use case pths

(represented by gs4), in which a client desa maximum of 20 connection retries, show

by a timer like object On detection of a communication error, it perfonns retries, based on

an exponential backoff, ffigwhich it shuts down. Use case maps for the other models

follow the same pattern essenMy and so have not been shown. Previous sections,

including this, an overview of the design and imp1enientation details of the client and

servers of the C/S systern used for conductmg experiments. The next section provides an

ove~ewof the distributed layered semer architecture, which was used as a case study.

00 lUulri-Threaded Senters: Design wirh teustable componenrr. P&ormance Meamrernenu and Analysis 65 Chapter 4 Ciïent/Se~erDesign and Implemenzarion

4.5 Systems with Layered Semers

This section provides a brief overview of the distributed layered semer system considered as a case-study. Many distributed applications decompose naturaiiy into a series of hierarchically co- operating tasks that fit the ClientlServer paradigm. For instance, a database system can be designed and miplemented using a layered architecture in which applications,

acting QS clients nmning on one layer, make use of the Services of other applications, acting as semen, numing below them. These servers, in turrt, may require the Fip. 21 Pure/Laivered Sewer Architecture services of other servers fiirther beIow, Le. the same module acting as a server in one layer, may become a client for a server ninning on another layer. Fig. 22 represents an example of a layered server architecture. Software entities are represented as pamllelograms, and hardware devices as circles. Accordingly, we have two types of software servers:

Pure Sener - A pure sofhvare semr executes on its own the requests of its clients, needing only a CPU to nui on. It does not perform any VO nor does it issue any request for services to other servers. If the senw requires more smers than its own CPU, it becomes a mid-level senter.

Mid-leel Sofhvare Server - In this case, the server requires the services of other server(s) below (besides its own CPU), while servicing the requests of its own clients. The example show m Fig. 21 contains two software mers, a fiont-end server (mid-ZeveI), and a back-end smrer (low-lever).The mid-level server receives a request nom a client above it, processes part of the request, sen& down the remaihg p~rtionto the low-ievel semer, waits for the result to corne back, and then sen& the reply back to the client.

00 Muiti-Threaded Serwn: Design wirh reuttble componenû. Per$onnunce Mc0surement.s and Analysis 66 Choper 4 ClientlServer Design und Implementution

This chapter provided an oveniew of the design and implernentation issues involved m designing the CIS system. The next chapter discusses the experhental setup and anaiyses the redts obtained.

00 Mufti-nimaàed &mm: Design wtlh reusable componenrr. Pedormunce Measuremenu and Anafysis 67 Chaptw 5 EXperimental Semp and Measuremenrs

Chapter 5

Experimental Setup and Measurements

Software @ormance characteristics (response thes, throughputs etc.) are crucial in most application domains, speciaiiy distniuted domains. Fiathemore, it is important to understand the dependence of these characteristics on various factors such as design alternatives, system load, resource allocation, distri%ution etc. In order to analyze the performance of a system one should be able to Myunderstand its runtime behavior, and to identify kquent, critical execution paths. Object-oriented (00) patterns offer an invatuable support in those steps, as they focus on the interactions between key participants m a software architecture, rather than ovenvhelming the developer with dea. This chapter describes the experimental setup used for conducting performance measurements and presents a detailed analysis of the results obtained, first from a system lwel viewpoint, and then fiom a pattern-guided viewpoht, using chapters 3 and 4 as background.

5.1 Introduction

Design pattern describe redgsolutions to cornmon problans that occur in various application domains. By caphiring the static and dynamic nature of these solutions, patterns help to design and iniplement new systans either from scratch or bared on reusable components. In the latter case, patterns are useful not only to the designers and implernentors of the components meant for reuse, but also to those who build software systems using such components. For example, in [Johnson921 it is shown how pattems can

00 SiuIfi-Threaded Seners: Daign with reusable componmu. Pe$onnance .Weasuremenu and Anafysk 68 Chapter 5 Experimenral Semp and Measurements help to explain and document a bework (Le. a hierarchy of reusable classes), and to teach application programmers how to use the mework for building new systems.

It is a well known fact that the performance characteristics of the system are very important in some application domamS (like embedded controllers, telecommunication, distributed databases etc.). It is only natural that some of the patterns applied to the design and implementation of such systems do reflect a strong concern for efficiency, as shown m [Coplien96, SchgSAug, Gd4,McKenny95].

As will be shown in the subsequent sections, this thesis takes the relationship berneen patterns and pdonnance a step further, promothg the idea that patterns can be used not only to document and explain reusable components, but also to guide the performance measurements and analysis of such components and of the systems incorporating them.

The overd performance measmes depend on all the parts cornposing a system (application software, operathg system, hardware devices, etc.) and on their mtime interaction and behavior. For example, in order to gain insight hto the factors contnbuting to the response time for different types of requests coming to a system, one has to identify the execution paths through the system for each type of request, the software entities executed dong the path, and the main causes for waithg and delays. Patterns can help in this process because they provide a more abstract view of the system behavior, concenttahg on the main interactions between the system components, and getting away kmimplementation details. Moreover, patterns can guide developers m choosing what to measure, in instnunenting the code for performance measurements, and in interpreting their results.

5.2 Experimental Setup

This section descnibes the experimental setup of the system which was used for evaluating its performance. The system consisted of two SPARCstation-2, one SPARCstation-5,

00 ,W&i-nireaded bers: Design nirh réiljabie camponen& Pedonnance .Ueasummenrr and ~naiysk 69 Chapter 5 Fxperimental Setup and Measurements

three SPARCstation-10/20 and one SPARcvitra machine, respectively. The network was

based on an Ethernet topology, having a maximum bit rate of 10Mbps. The clients were

distriiuted among the different machines, however the server(s) were nm alone.

5.2.1 Erperiment Controiler and Data Logger

In order to setup the merand clients on different machines, and synchronize their data

gathering, DECALS (Dhtxiiuted Experiment Control and Logging System) - a set of

tools and Ir'braries used to run and control distriiuted applications over a network and

coliect measufement and trace related data, was used [Elgillani94]. DECALS reads a set

of configuration files (for the system configuration), a list of experiments file- LOE (which

specïfïes the number of different experiments to be conducted) and one or more experiment description nles - EXP (which hdi~atewhat is to be done for the expexhents to be performed). An aperiment, in DECALS temiinology, is the allocation of application processes to network nodes, the execution mtil completion of those application processes and the collection of measurement and trace information from those processes. Once the Fie. 22 Erperimentaï se tu^ configuration idonnation has been read, DECALS performs each experiment in turn (Fig.

00 .Muin'-Threaded Seruers: Design wirh retuabie componenrs. Peflomtunce Meamaremenu and Analysis 70 Chapter 5 Expeninen~alSetup and Meamremenfi

22). For each experimenî, DECALS arecutes and synchron&s ail the application processes (s&er(s) and clients), and SimultaneousIy nuis the collecter, which is used to collect the application's measurements and trace information, using a procedd interface provided by DECALS hibraies. Li order to terminate the experiments, DECALS provides a tirne-out period, in addition to the naturaz deuth (completion of the experiment) of the experiments.

5.2.2 DECALS Architecture

DECALS consists of many processes, as shown in Fig. 22:

Ehpen'ment CuirtroCCer - It is the main process in the DECALS system. It is in charge of ninming and initializing all other processes. There is exactly one experiment controller m operation for a aven set of experiments and it may be executed on any nemork node.

RPIZ> Processes - These processes are used by the expennmentcontroller to invoke the remaining processes. Initially, there is one RPTD process per workstation node, which acts as a daemon server. It then forks a copy of itself to deal with each application process to be run on its node according to the insbnictions received fiom the experiment controller. These forked RPZD processes remain until the application processes die.

Application Processes - These are the set of processes for which the measurements and analysis are being perfonned. They communicate with the Eqerzhent Controller and the

Collector via the constituent services provided by the Collec?or Iiirary.

CoUector - This process is responsible for collecting measurements fkom the application processes and storing in a file. Exactiy one Collector process is nmning on each node m the network being used by the cment application.

00 Mulri-Threuded Srvers: Dsign with mabie components, Peflonnance Memuremenu and Ana[vsis 71 Chaprer 5 Experimental Setup and Memurernenu

However, it was found that a number of problems cropped up as the number of users of

DECALS hcreased. So, although the fïrst half of this chapter uses DECALS to set up and control experiments, the remainder of the thesis uses UNIX batch scripts to implement the same approach [explained in Appendix BI.

5.3 Measurement Results and System Performance Analysis

A series of tests were conducted for al1 the different threading models, corresponding to the two architectures. The message size was kept constant at 4 bytes, while the socket size was fked at the default size of 8K Although, it is known fiom previous studies

[Courtois961 that message size variation has a strong impact on the total response tirne, we kept the message size fixed and d for moa of our experiments because our goal was to study multi-threading overheads. Ln [Courtois96, PozzeWS], it is shown that the

CPU cost-per-byte is non-monotonie and causes large fluctuations m the &ce the, and we wanted to avoid such fluctuations which would hide the effect of the overheads we were interested in. Each client execution cycle consisted of an exponential think the with 100 mecs average, and an active tirne, during which it sent requests and received replies hmthe server. The server service time was fixed at 10 mecs in order to highlight the impact of various multi-threading overheads, wwhh would be of the order of a fao miUisecondF. As will be shown in Fig. 34, larger service tirnes and larger message sizes give the same overall perfomce behavior, but mask the difference between various multi-threading techniques. Although the thmk tune could be used to Vary the load on the system (as the arrivai rate of service requests would change), the load on the system was varied by varying the number of clients in the systeia Each test consisted of 10

00 Mulri-Threuïied srvers: Daim with rcusabte componenrr. Pe#orma~cc h4easurement.s and AmIysis 72 Chapter 5 Expen-mentalSetup and Memurements

replications of 300 sends/recvs each, in order to obtain a reasonable accuracy and to

account for perfionnance variation due to transient load on the networks and hosts. The

test resuits obtained were within a confidence interval + 2% of the mean at 95%

confidence Ievel [Appendix BI.

Client Cycle T'me VIS Systcm Lord (Pure Sencr Arch)

+TPOOL

- 1

1 3 5 7 9 il 13 15 17 19 21 23 25 Namber of Clients

System Tbroughput V/S Systan Load (Pure Scrver Arch) Server Servicc Tirne: 10 msecs Ciient Think Tirne: 100 msecs (mean)

ST - Single Threaded TPC - Thread-per-Client TPR - Thmd-per-Request TPool - Thread Pool with number of threzrds = 15

Fip. 23 Measunment Resuits for the Pure Sewer Architecture

00 Muiti-7nreaded Sérvers: Design with musable compnenu. Peflonnance IWe~remenrrand Anabis 73 Chapter 5 Ekperimental Sehcp and ,Measurements

Fig. 23 summarizes the performance resuits for the pure server architecture, for a single- threaded case and ail the three threading models. The singie threaded semer shows the lowest cycle times of all the four modek compared, which comesponds to the bighest throughput acbieved m the system. Since the pure-sewer architecture comprises of a single CPU, the single threaded mode1 avoids the overheads of context switching between threads incurred by the other models (although context-switching between processes is present for ail the four models).

Server Utilization vh Systtm Loid (Pore Server Arch.) Server Service Time: 10 msecs Client Tbiak Timc: 100 msecs (mean)

Numbcr of Clients Fi. 24 Server Utilizations for the Pure Server Architecture

In case of the threaded models, the cycle tirnes of aii the three models are close to each other. Higher cycle thes cmbe explaineci by the extra overheads incurred due to thread creation, scheduliag and context switching, since aîI threads are the-muitÎplexed onto a single CPU. Detailed performance measurements in Section 5.4 wili show that, on an average, a typical cal1 to the ACEJhread::spawn() method, (which encapsulates the O/S thread spawn system cd), consumes around 3msec of CPU time, which compared to the overall semice the, consritutes a highly signifiant overhead, which explains the fact that

00 Mulri-Zhreaded Servers: Design n-ith reurable componenzs. Pe$ormonce Meantremenu and Anah~is 74 Chapter 5 EXperimenral Setup and Measurements the thread-per-request model exhibits the poorest performance, as can be seen fkom its throughput. Fig. 24 shows the etionof the smer instances and their CPU for the single-threaded and thread-pool servers respectively. It can be seen that although the maximum CPU utilizabon for botb cases is limitai to around 50%, the cuauxlulative effect of the differexlt overheads (shown, and explained, by the elapsed time graphs in Section

5.4) cause the software server instance of the thread-pool server to saturate at around 9 clients. For the single-threaded case, the server takes from its queue only one request at a tirne, which proâuces signincant queuing delays, thereby remicing its throughput. To nmi up, multi-threading servers in a pure-semer architecture does not provide dividends m temis of system performance, and for cases where service requests are CPU bond, it is the CPU itself which becornes the system bottleneck i.e. we have a hardware bottleneck.

However, the performance results for a mid-level server are entirely different. Fig. 25 highlights the superior performance of the threaded server compared to the single threaded server, for the layered server architecture. The service requested by the clients is assumed to consist of a CPU intensive operation eormedby the mid-level server, and an

I/O intensive operation pafomed by the low-level semer, although the total service time seen by the client remains the same as in the case of the pure server architecture. The low-

Ievel Servet is based on the thread-per-client model, and rrmains the same for all the models being compared. The highest tbroughput is acbieved by the thread-per-client model, followed by the thread-pool model. In case of the thread-perclient model, each thread can pdorm an y0 operation, hmthe mid-level server to the low-level server, in pardel, and since there are as many threads as clients, the interval of the for which a

00 Multi-ïhreaded Semen: Derign wirh murable componenrr. Pefonnance .Ueamremenrs and Anabis 75 Chapter 5 FJrperimenral Semp adMemrements

client gets blocked reduces, thereby iucreasing the overaii throughput of the system. It

also mkbim the overheads of thread creation, unlike the thad-per-request model,

since each thre-ad is created for the mtervai during which a client remains connected to the mer.

Client Cycle Time VIS System Load (Layrtred Arch.) Server Service Time: 10 msecs Client Tbink Time: 100 msecs (mean)

Number of Clients

Systcm Throughput VIS System Lotd (Liytred Arch.) Sener Service Time: 10 msecs Client Think Time: 100 msecs (mcan) 100 3

1 3 5 7 9 11 13 15 17 19 21 23 25 Number of Cüents ST - Single Threaded TPC - Thread-per-Ciieat TPR - Thread-per-Request TPool - Thread Pool wîtb number of threads = 15

Fie. 25 Measurement ResnIk for the Lavered Serrer Architecture

The single threaded model, as expected, shows the lowest throughput. In this case, the

00 ~Udti-nireadedServefs: Design with reusable componenu. Pe+mance Meammenu and Analysis 76 Chapfer 5 Experimenral Serup and Measuremenrr

mid-bel semer takes îrom its queue only one request at a time. The semer is busy even

when it is waiting for a reply hmthe low-level server, thereby producing a queuing delay

which leads to its low throughput Therefore, the mid-1eveI mer becornes the system

bottleneck, even if its CPU and the low-level semer are not used at their fùil capacity.

Such a bottieneck is known as a sofiare bo~tlmieckmeilson95]. This effect is i11ustrated

in Fig. 26, which shows the utilization of the mid-level server and the low-level server for

the single-threaded model.

Server Utilization VIS System Lord (Layered Arrh) Single-Threaded Mid-Level Semer

A A A A 1: v v v - v + 0.9 3 0.8 0.7 *Mid CPU I 0.6 1 Level gZ os: I 2 0.4 3

Namber of Clients Fip. 26 Mid-level Server Utiiization for the Sinde-Threaded Server

The utilization of a resource represents the percentage of time the resource is buy. For

example, the mid-level meris hsy with a request from the moment it accepts it to the moment it completes its service, including the tirne it waits for the low-level server. As it can be seen in the single-threaded mid-level semer case, the system bottleneck is the mid- level server (Le. we have a software bottleneck). Both the mid-level merand the lower- level server have equal semice Mies, due to which their curves are overlapped in Fig. 26.

00 .WuIn'-7hreaded Seners: Design wirh reusable componenrs. Pefonnance Metuurements and Analysis 77 Chapter 5 Erperimental Setup and Measurements

The mid-level server causes the system to saturate at around 11 clients (Fig. 26). Its CPU and the low-level server rernain under-utilized, with the maximum utilkation at around

Server U-tion VSSystern bad (Layered Ar&) Tltread-Pool Mid-Level Server

O

*Mid Lm1CPU

O

1 3 5 7 9 11 13 15 17 19 21 23 25 Numkr of Clients

Fip. 27 Mid-level Server Utilization for the Thread-Pool Server

In such situations, the bottleneck can be removed or mitigated, by increasing the number of threads in the mid-level semer. Stated in another manner, we purh dmthe request queue to the low-level semers (te. the low-level software server instance and its CPU), thereby increasing their utilkation, as demonstrated by the mid-level server utiiization for the thread-pool server in Fig. 27. In case of the thread-pool model, the mid-level server dl lemains the bonleneck. However, the system now saturates at around 19 clients and the low-level server utilkation increases to around 33%.

The system-level perfonnaace analysis presented in this section, based on end-to-end measurements, does not explain the differences m performance between the different threading models. In order to gain insight into these issues, we will have to conduct more

00 Mufti-litmaïiedScrvers: Design wi~hreusable componencs. Pefonnunce ,Me~~remenrrand Analysir 78 Chapter 5 Eiperimental Semp and Measuremenrs

fine-grained measurements. These measurements are based on design patterns, as

expiained in the subsequent Sections.

5.4 Pattern-guided Measurements and Analysis

Chapter 4 discusses the patterns used to implement the generic client and different mers

(based on the different threading models). Out of the patterns discussed, the

-Acceptor/Connecto pattern are executed only during the initiakation of the system under consideration. The Acceptor pattem uiitiaüzes a passive-mode endpoint of communication that is bound to a network address (such as an IP address and port number) which is used to listen for communication requests hm peers. Once the connection has been established passively and the senice has been initdked, the

Acceptor 's role ends and the appiication enters into a sentce processing phase. Similarly, the Connecter pattern is used to actively initiate a connection to its remote peer, either syncbronously or asynchronously. Like the Acceptor, its role ends as soon as the application enters into the service processing phase. 'Ihus, the Acceptor/Connector patterns do not have a signzjkant impact on the system pe$omnce rrs they fom part of the stamp sentices, and so have not been instrumented for rneasurement purposes.

The Reactor pattern is hvoked at each and every service request issueci by a client, as it is responsible for event demultiplexing and perfonning dl-backs to the appropnate registered service handlers for providing application specific functionality. In addition, the

Active Object pattern is also involved in every service request as it manages a pool of threads which provide the actual service requested by the clients. Thus, both the Renctor

00 Mulri-7hreaded Sewers: Design wirh reusable componenrs. Performance Meusuremena and Analjsàs Chapter 5 merimental Setup and Memurements

and Active Object patterns have a signiscant impact on systan performance due to their

repeated invocations.

5.4.1 Frequently executed patterns

This section presents a discussion on the design and Umplementation of the Reactor and

Active Object patterns and provides an insight to issues critical to system @ormance.

Fip. 28 Execution Seauence Chart for tbe Reactor Pattern

5.4.1.1 Reactor Pattern

The Reactor pattern, as discussed in Section 3.3.1 is an object behavioral pattern used to perfonn event demuitiplexing and event handler dispatching. This pattern is responsible for demultiplexing of events (such as y0 events, thers, UND( si@) and dispatching the appropriate methods of pre-registered event handler(s) to process these events. An execution sequence chart (which further elaborata on the interaction diagram charts by showing the different components of the execution cycle) is shown in Fig. 28, which

00 Mufti-nireaded Servers: Design wifh reusabIe companents. Pedonnance Meosuremenrr and Anaiysis 80 Chapter 5 Ekpenmental Setup and Memurements

shows the interactions between key participants in the Reactor pattern.

As explained in Section 3.3.1, the Reactor defines an interface for registerkg, removing

and dispatching Evmt Hderobjects and provides a set of application-independent event

demultipIexing mechanism. It triggers Event Hmdler methods in response to events (An

Event Hmdler object specifies an interface used by the Reactor to dispatch callback methods defined by objects that are pre-registered to handie certain events). When the

Reactor registers an Event Handler subclass object using the register_handZ@ method, it obtains the object's mderlying nle descriptor (or handle) by invoking the get-handleo method in the conesponding Event HmdZer. This Thisdle is then used by the Reactor, dong with other registered Event Handler handles, to wait for I/O events to occur. The

Reactor 's dlrpacho method blocks on the OS event demultiplexing system cal1 (select or polf),mtil one or more events occur. On occurrence of such events, the Reactor invokes the handle-eventso method which uses these handes to perform cdbacks to these pre- registered Event HandZers to perform application-specific hctionality, after whïch controI returns back to the Reactor.

Thus, the critical execution path (or the moa frequent path of execution) in the Reactor patteni is the dispatcho - select0 - handle-eventso path. In case of multi-tbreaded applications, as mentioned below, this path also inchdes remove-handlero- re@ter_handZerO methods, as they are used for unblocking the Reactor after each cd- back, and subsequent re-registering after application-specific execution to ensure cd- backs.

00 Mulrz-iXreaàed Semen: Design wirh reusable compnencs. PeMonnance Meosurernenu and Analysir 81 Chaprer 5 Eiperirnenral Semp and Meanrremenis

Overheads Associated With The Reactor Pattern

From the dical execution path description, it cmbe seen that the Reactor pattern has the

following major overheads associatecl with it:

Evmt Hadler regktr&n/de-registrdon - Event Handlers registered with the

Reactor are not pre-empted wfiüe they are executing),which means that they should not

pdorm blocking I/O or long duration operations such as bulk data transfer etc. Also,

since the Reactor is blocked till a partinilar Event HandZer fimshes execution, it is

impeIative that it should be fkâ as eariy as pomble, m order to service other registered

Event HandZers. Otherwise, this effixt of the Reactork serialization will degrade the

perfiormance of multi-threaded servers, as the Reactor will become the bottleneck The

main advantage of doing this becornes apparent when message sizes are large. If the

Reactor were to receive the message itseif and pass on the seNice request to the worker

threads, larger message sizes would rnake the Reactor do the work of the actual message

recepîion (and thus reduce its seriaking effect)), the Reactor just mforms a

Event-Handler about a service request and retums. The Event-Handler, on king

iaformed about this pending message, enqueues a senrice request which is taken off by one

of the worker threads in the thread-pool. This worker thread, in parallel with other

threads, then does the acdmessage reception. To fkee up the Reuctor as quickly as

possible, it is necessary for the Event Hander to de-register itseIf, by ïnvoking the

remove-hmdIer0 method, as soon as it is invoked by the handle-event~0 loop of the

Reactor, thereby removing itseIf nom the Reactor 's intemal tables. However, at the end of

its service, it is necessary for it to re-register itself using the register-handled method, in

00 Mufti- fhreaded Servers: Design wilh lpu~abtecomponenu. Pegonnance ,Ueasuremenrt and Anaiysis 82 Chaprer 5 -en-menta Setup and Memrements order to facilitate cd-backs once more. As shown in the previous section, when this overhead of registering/de-reginering is most apparent for small service thes, and short messages. However, this technique pays off for large message sizes.

Locking Overheads - In order to emre atomicity of registratiodde-registration and

0thmodifications of the Reactor 's interna1 tables invoked in different thrieadsi of control in rnulti-threaded applications, mutexes have to obtained and released for each and every cd. Tbk htroduces Iocking/imlocking overheads which may prove to be significant if done repeatedly.

Event demultpIdng Overhecids - The Il0 semantics of the underlying OS significantly

Sect the performance of applications implementing the Reactor pattern. For this case study, the select mez)iani(;rn of the operating system was used for Il0 demultiplexinggThe select mechanimi indicates which file descriptors (or handles) out of the registered set of handlers have become ready for I/O. Thus, as the number of registered handes increases, the overhead associated with the select cdincreases correspondin&y.

5.4.1.2 Active Object Pattern

This pattern enables a method to execute in a différent thread than the one that uivoked the method origmally (as opposed to passive objects which execute in the same thread of control as the object that called a method on the passive object) i.e. the order of method execution can differ hmthe order of method invocation [Lav95Sept].

An execution sequence chart for the Active Object pattem is shown in Fig. 29, which represents the key interactions between participants in that pattern:

00 Multi-nireaded Seners: Design wfth reu~ablecomponentr. Pe~onnonceMeawnmenzs and Analys3 83 Chapter 5 Eqerimental Setup and Meawrements

The client application invokes a method defined by the client inte$uce (a proxy that provides a method interface to client applications) whose invocation creates a qyeuing method object, which msintaïns ali context idionnation required to execute the method and rem a resuit. On request for a partidar service fkom this active object, the scheduler acquires a mutex and binds the meïhod object(s) to the current representarion after detemiinuig the method object(s) which meet the synchronization constraints, by consulting the activation queue. This method then accesses/updates this representution and creates a resdt, which is then bound to ajüture object (a synchronization object that enforces wrfte-~ncdread-manysynchronization) which passes back retum values to the caller when the method finishes executing.

As it can be seen fiom the execution sequence chart in Fig. 28, the critical execution path in the active object pattern involves queuingldequeuing of service requests Le. the imerto-remove0 operations which are executed for every service request to the active object .

O0,Uul~i-Threaded Semrs: Design wiih muable componenu. Peflormonce Measuremenu and Anaipis 84 Chapter 5 Eiperimental Semp and Measuremenis

Overheads Associated With The Active Obiect Pattern

Scheduting adExecution OverheOdS - This pattern can potentialiy increase context switching, data movement and synchronization overheads, depending on how the scheduler is implemented, due to scheduling and execution overheads (user-space or kemel-space) of multiple active objects. With the increase in the numk of independently executing threads, this overhead might become more significant than the gains obtained by increasing the level of concurrency.

Lockiir@nlocking Overheads - Since this pattern consists of multiple threads of

Elspsed Timc VIS System Laad handle-events() for the Reactor Pattern Server Service Time - 10 mstcs, Client Think Timc - 100 mecs (mean)

Cpu Usage VIS System Load handle-ments0 for the Retctor Pattern Semer Service Time = 10 msecs Client Tblnk Time = 100 msecs (mern)

Y ) oaooi @ = * 3 % 0.0002 - U :

- 1 3 5 7 9 li 13 15 17 19 21 23 25 Nnmber ofClients

00 Muhi-Threaded Seruen: Design wilh murable componenu. P~onnance.Me~~remen~ and Rna&sh 85 Chapter 5 Expeninental Setup and Measurementr control executing independently, mutual exclusion mechankm are required to ensure atornicify of operations on shared resources like global data, variables etc. This irnplies that this pattern has a locking/uniocking overhead associated with each service request, whose impact depends on the manner in whkh these mutual exclusion mechanisns have been implementd

5.4.2 Detahi Mensurement Results and Performance Analysis

This section descnies the experimental setup of the systern which was used for conducting measurements on the Reactor and Active Object patterns, and the andysis of obtained measuxements. The system parameters were kept the same as discussed in Section 5.3, with the semer service time fixed at 10 msecs, and the exponential mean of the client think thne fixed at 100 msecs. At fhis point it should be mentioned that the accuracy of the measurements presented in this section is not very good (around 90% confidence interval) due to the low order of magaitude of the measured overheads and to the fact that the operating system collects this type of measurements by sampling. Even though the curves are noe,the general trend can be easily observed.

Measurements on the Reoetor pattern - As discussed in Section 5.4.1.1, the execution path of interest m the Reactor pattern is the dkpatccho-selecto-handle-wenw operation, including the register-handZer0 and remove-hundlet-0 operations Accordingly, these methods were insUumented for boîh elapsed the (which would capture queuing delays, including those caused by mutexes) as well as the actual CPU execution tirne for these methods.

O0 Multi-Threadcd &mets: Design with reusable components. Performance Measuremenu and Ana fysis 86 Chapter 5 Erpeninenral Sencp and Measurements

Fig. 30 shows the measurement results for the handIeeevent.sO loop for the Reactor pattern. This loop is the main event dnvR for the Reactor pattern and it consists of the waitfor_mu1tipIeeevents~rnethod(which encapsulates th3 select0 mechanism), as well as the dispotch0 method, which perfiom the adcd-back to the appropriate Event

Hander. Two separate sets of measurements were conducted, one including the select0 system call, and one without it. From the graph of CPU usage, it can be seen that the

Reactor pattern overhead is constant, but the overhead of the select0 system cal1 is increasing when the number of clients goes up, due to the increase in the number of registered file descriptors (or handles) that the select0 has to check From the graph of elapsed the, it can be seen that the elapsed time with select0 is consistently higher than the one without select& which can be attri'buted to the fact that the application blocks on the select0 system cd, waiting for events to occur. This rneasurement result is more dependent on the system wide behatior, rather than on the pattern itself (as it depends on the fiequency of arrivals of sdcerequests).

As shown in Fig. 3 1, the CPU danand for registering a particdar Event HandIer with the

Reactor, as well as the demand for de-registering, are constant For this particular implementation, as mentioned before in Section 5.4.1.1, the main thread (which enqueues a service request) is alone responsible for de-registering a particular Event HandZer, using the remove-handw method, and hence does not incur any queuing overheads for acquiring mutexes (as shown in the graph of elapsed time v/s system load). On the other hand, a thread which is carrying out a particular semice for an Event Hondler is responsïble for re-registering it, using the registre-handlero method, and so has to

00 Mulri-Threaded Sen,ets: Design wirh muable componmtr. Peqonnance ibîe~remmrrand Anabsis 87 Chapter 5 Expeninental Sehrp and Meanrrernents

m Elapsod Time VIS System had regist~1~honder0/remo~e~liun~e~for the Reactor Pattern Server SeMce "Rme= 10 msecs Client Tbink time = 100 mecs (me=) 0.006 -:

Cpu Usage V/S System Load rw-sterstha~dlerO/removeOliandZerOfor the Reactor Pattern ScrvCr Service Time - 10 msea Client ThinkTirue = 100 msccs (me@

I I œI III I I

Fie. 31 Measurement ResnIb for the mzkr handlexfllremove handler methods

1 - pp compete with other threads for acquiring mutexes (in order to ensure atornicity of updates of the Reactor's intemal tables). Thus, re-registering an Event HandZer by a thread involves blocking for mutexes acquired by other threads, the effect of which is felt on the elapsed time (Fig. 3 1). However, the ocmal CPU overhead is imignzificant, compared to the total overall overhead (as summarized in Fig. 39).

From these dts,we can Merexplain the differences between the different threading

00 Muld-Thmded Servcrs: Design nlih r~usablecomponents. Pet#onncrnce Measuremenrr and Analmis 88 Chapter 5 FJrperimental Sem- and Memremenrs models shown in Fig. 23 and Fig. 24. The Reuctor is responsible for event danultiplexing and pafonniflg cd-backs to the appropriate Event Hder object. The thread-pool server requires such an event demultiplexer to act as a work producer for the threads m the thread pool. When an event occurs, the Reactor calls-back to a pre-registered Event

HundIer which enqueues the request received, in order for it to be served by one of the threads in the thread pool. This introduces an additional queue in the semer, in addition to the system level queue. This repeated event ddtiplexing and dback enquehg and dequeing and the associated acquiring and releasing of mutexa introduces additional overheads in the thread-pool server, compared to the thread-per-client server, due to which its throughput is correspondingiy lower.

Cpu Usage VIS System Load inserz()/rem ove0 for the Active Ob ject Pattern Servcr Service Time = f O msecs Client ThinkTime - 100 mecs (m e&n)

1 3 5 7 9 11 13 15 17 19 21 23 25 Nimber of Clients Fie. 32 Meamrement Results for the insert()/removeO methods On the other hand, the thread-per-client server has a thread permanentIy associated with a client, each of which is assigned an independent socket to lista to. Hence, it does not require an event demultiplexer like the Reactor and thus avoids the above mentioned overheads which are incurred by the thread-pool server.

00 Mulri-Ttrreaded Semersi Design wirh reusable componmrs. Peflonnonce .MePNtrnenrr and Analysir 89 Chapter 5 fiperimenral Setup and Measuremenrs

Memremenis on the A&e Qbject Pattern - Section 5.4.1.2 shows that the most

fiequent path of execution, and hence the path causing the greatest impact on system

perfiormance, involves enqueuing and dequeuhg of senrice requests, performed by the

ins~O/~emove~methods. As m the Reactor pattem, these methods were instnimented

and the results are shown below. As can be seen fi0111 Fig. 3 1, the CPU demands for both

the insert0 and remove0 methods are constant and similar to each other. The enqueuing

of a savice request is done by the main thread, whüe dequeuing of a service request, and

sub~e~uentservice is carried out by the bdsin the thread pool. Due to this, all threads

in the thread pool try to acqujre mutexes to execute the remove0 methoci, with the result

that one thread in the pool has acquired the mutex and is blocking for a semice request to

be enqueued, while ali other threads are blocked waiting to acquire the mutex Thus, this wodd lead to large vanations in the measured elapsed the for dequeuing requests and so elapsed time measurements were not done.

Fig. 32 provides further insight into the system under study. The thread-pool server requires a global queue in which the Event Hmdlers cm enqueue service requests, and the worker threads in the pool can dequeue them for providing service. Each queueldequeue operaton is synchronized by mutexes for ensuring atomic operations, which however, incurs the cost of additional overhead of ~g/releasingmutexes. These overheads, combined witù the ones mentioned above, lead to the better perfomiance of the thread- perclient server compared to the thread-pool server.

In ail the cases considered, the thtead-per-request model's performance is consistently

00 Mulsi-Thded hm:Design wirh muable componenls. Performance Measufemen?.s and Amfysis 90 Chaprer 5 Eiperimental Semp and Memrements poorer than the others. This can be explained by the fact that this model incurs the overhead of creating a thread for each and every request, as well as the overhead of cleaning up afta each request bas ken seniced (around 3 msec of CPU), as well as the overheads associated with Uueaded servers, like thread-scheduling, context-switching between threads (in addition to context switching between processes), acquiring/releasing mutexes etc., all of which contnaue to its poorer performance. Also, if each spawned thread docated additional, system resources like virtuai memory etc., the perfo~nanceof the thread-per-request semer would degrade Mer.

Thus, as can be seen hmthe previous sections, design pattern can be used to guide the performance measufements and analysis of performance-sensitive object-uriented systems, like the present case study. The next section presents a cornparison between the thread- per-client mode1 and the tbread-pool model, from the point of view of system resource

Memory Consomption VIS Number of Threrds Spawned 230E+06 7 i 2.10E+O6 4 I

- - Resident Size

Nombtr of Threads Fip. 33 Memorv Consumption due to Threads

00 Mulii- fhreaded Servets: Design wirh reusable componenu. Pefonnance .Measuremenrs and Analysis 91 Chapter 5 Fjrperimentd Setup and Measurementr

5.5 Cornparison between the Thread-per-Client and the Thread-Pool

Server models

In gened, whenever the issue of multi-threaded servers arise, it is dya case of

making a choice between a thread-pool semer and a thread-per-client (also known as a

thread-per-session) server. Spawning a thread consumes vital system resources. Fig. 33

shows the increase m the size of a process (in memory and resident, for Solaris 2.5) , as

the number of threads spawned increases. This consumption of system resources can be an

extremely strong incentive for choosing the thread-pool mode1 irnplementation over the

thread-per-client (even if the latter has a better performance than the former, as show m

the previous section).

Cornparison of Thread-Per-Cüent and Thnad-Poo1 Modek

Number of Clients Fie, 34 Effect of different Service Times and Messape Sues

If the system load is so high that it causes an unacceptable amount of context switchhg

00 Multi-Thrwded Seners: Design mith réusable componenu. Peqonnance Meaîuremenu und Anal'is 92 Chapter 5 ~eninentafSemp and MeasuremenLs

between threads m the thrrad-per-client model, it is highly probable that the thread-pool

model's performance will be on par, or wen better than the than the former. In the

previous implementation for the thread-pool model, the message size itself was

insignifiant, thereby causing the Reactor overhead to become signifïcant. As shown m

Fig. 34, two more experiments were performed, for the layered server architecture: (i)

Message size was increased to 1024 bytes (as compared to the previous cases when it was

4 bytes) and (ii) Server Service heswere increased by twenty times to 200 msecs,

keeping ali the rest of the parameters constant In case of increase in service times, it can

be seen that the thread-pool model's performance equals that of the thread-per-client model, which cm be explained by the fact that the Reactor overhead now becomes

mngnincant.

In case of increase of message size, the thread-pool model perfom as well as the thread- per-client until the system load increases beyond 15 clients. This can be attributed to the fact that the CPU demand for the operathg system's read/write cak (carried out by the threads) is signiscanî, compared to the overhead incurred by the Reactor's demultiplexing of I/O events, and since this overhead is incurred by both the models, they are both equally affected Two facton contnbute to the thread-pool model's poor performance, compared to the thread-per-client, beyond 15 clients: (i) the Reactor overhead of registering and de- registering (3 msecs) is significant compared to the overall &ce rime of 10 rnsecs. and,

(ü) The number of threads is fixed at 15, which means that all of them become busy as soon as the system load increases beyond 15 clients, thereby causing intemal queuing m the mid-level server, in addition to queuing at the operating system level.

00 ,UuIri-7%readedbers: Design wirh rewable corngonena. Peflormance ,Measuremenrr and Analpis 93 Chapter 5 Eqverirnenzaf Senrp and Memurernenrs

Another deciding factor m choosing between the two models could be the number of

system file-descriptors available. These experiments were conducted on a systan nmning

Solaris 2.5, which had a maxhmm of 64 file descriptors. In case of the kad-per-client

model, the number of Ne-descriptors required equals the number of clients that request

senices. This limitation arises due to the fact that there is no event demdtiplexer, like the

Reactor in the tbread-pool model so that each thread has to be assigned a unique socket

for communication with the Client (and hence a unique Nedescriptor). The thread-pool

modei, on the other han& can avoid this limitation, by assigning a socket for one or more

Clients, thereby king file-desCnptors for other operations, Iike communicating with

other servers.

To sum up, multi-threading a pure semer does not produce performance gains, and in fact

may cause it to deteriorate due to the hardware system bottleneck. In order to gain

benefits fiom multi-threading, a layered server architecture has to be adopted so that the

bottleneck can be pushed dom to lower Ievels by increasing the connmency (i.e. number

of threads) level of the mid-Ievel server. Also, whenever the issue of multi-threading arises, as shown in the previous sections, it is often better to opt for the thread-pool model

in cornparison to the thread-per-client model, specially from the point of view of system resource consumption. Another important point is that an abstract behavioral model, such as provided by design patterns, is especially useful when a system is built with reusable components, both for understanding what the reused software does, and what is its contriiution to system performance. The ACE patterns used to design and implement the various multi-threading alternatives helped (i) to understand how the reusable ACE

-- - - 00 Multi-Thrrnded Séners: wirh reusable componenu. Peflannance Mcasuremenrr and Analpis 94 Chapter 5 Ekpeninenral Setup and Meamrements components work (ii) to design and @lanent different multi-threaded servers (iii) to idente execution paths critical for performance and finally (iv) to decide what to masure and how to hsîrument the code. In this case, the Reactor and Active Object patterns hold the key to understanding the system behavior, as they control the concurrency levei in the semer-

The next chapter concentrates solely on the thread-pool server and examines the impact of assigning pnorities to clients with different workloads.

00 Mulri-Threaded Semrs: hignwiih reusoble components. Pefonnance Meanrrements and Analysis 95 Chapter 6

Chapter 6

Priority Software Sewers

In a client/~erverdistributeci system, as shown in the previous chapters, the server is ofien the bottleneck (could be a hardware bottleneck, or a wftware one). It was shown that it is possible to reduce the effect of the bonlenecking by using multiple copies of the serven, either by h-g multiple independent semer processes, or by using multi-threading. This chapter studies the impact of pnority scheduling on multi-tbreaded semen. Two cases are considered: (i) associate priorities to client classes (as required in certain applications) and

(ii) associate priorities to different semices provided by the senter. The idea behind the second approach is to use the priorities in such a mifnner that the overall client waiting time for a response is reduced

This chapter studies the design of the head-of-line (HOL)priority server, applied to the thread-pool server, for both the pure and layered semer architectures. HOL software priority server was chosen due to the fact that it avoids the complex programming required in the case of pre-ernptive software server (which requires saving the state of each service request being pre-empted and reproducing it on its resumption, thereby leading to unnecessary complexity). It provides measurement resuits for both these architectures for different priority combinations and presents detailed anaiysis of the resuits obtained.

00 Muln'-73reded îe~ers:kign m3h reusable components. Pefonnance Meastuemenu and Analysis 96 6.1 Priority Semer Design

This section provides an ovewiew of the design and implementation of the prioriîy software server, for both the pure and layered server architectures. The priority server is based on the thread-pool server model, (explaineci in Section 4.2.2.3), and miplements the

Acceptor/Reacttr patterns, in addition to the Active Object pattem (used to implernent the pool of threads). The design and iniplementation of the priority meris mtlyMar in all respects to the thread-pool server, the only exception being in the design of the global message queue used to exchange service requests between the main thread and the worker threads in the thread pool. The next subsection discussa the design of the priority queue.

Fie, 35 Hear, Implementation of the Prioritv Oueue

6.2 Priority Queue Design

A standard FIFO queue is siqly a big buffer. The first object pushed into the queue is also the htobject removed. This is the kind of data structure a simple operating system might use to hande, say, requests to access a hard drive. However, most operating systems want to include the concept of prioTity attached to each task. The scheduler

- - 00 Mulri-nreaded Senes: Design nirh reusable componmts. Peg?onnnnce Measumments and Analysis 97 Chapter 6 Pnority Sofmare Servers would then give the next slot of CPU time to the task (or service request) with highest priority. A simple iniplementation of such a queue would be to store all entries, according to priority, in a iist and then simply remove the head of the list Although this is a simple implementation, it is relatively inefficient The code will run in O(N) time, where N is the length of the queue. A faster solution is to @lement the priority queue as a heap, shown in Fig. 35.

Any list containing N elemenîs is considered a heap if each element i is greater than, or equal to, each of its two children, elements 2i+l and 2i+2. These two nodes can be considered to be chüdren in a binaxy tree. For example, as shown in Fig. 36, Element[l] has a valw of 8, which is greater than or equal to each of its two children, Element[3] and

Elementw respectively. Although it looks Like a sorted binary tree, it is not, which is illwtrated by Element[4], wwch is larger than Element[2] and thus violates the niles for a sorted binary tree. Some of the advantages of using a heap as a priority queue are outlined below:

For any given heap, Element[O] mut be the largest element in the amy. Thus, as long as the data structure is a heap, the highest priority element in the queue can be accessed in constant the.

There are no additional storage mpirements. The shape of the heap is a simple binary tree, with each node's children aiways residhg in the same spot. Children or parents can be found by ushg simple index math.

Adding a node to the heap is an O(logN) operation.

Removing a node fiom the tree is an O(1ogN) operation.

00 ,riulri-î7weoded Senters: Design with reusabIe componenG Perjonnance Measuremenu and Anabis 98 Its performance is superior to other sorted tree implementations because it does not sort the entire tree. Mead, it just ensures that the element at the top of the tree is the largest in the heap. Thus, addmg or deleting leaf nodes do not have to shuffie the entire tree.

The priority queue used for the expiments is designed as a parameterized class, accepting the type of muhial exclusion mechankm rwed, the user-defined

Svc_HondZer, and the cornparison operator (type of the priority Le. integer, double etc.) as its parameters. It is based on the pnority queue implemenbtion provided by the HP

Standard Template LI- (STL), which contains a set of clw templates and their irnplementations for perfoming various tasks melson95].

63 Semce Priority

In some applications, it is necesçary to associate different pnoritiw to different classes of clients (dependhg on how important a client is, how much it pays, etc.). The pnority of a client issuing a service request gets inherited by the server thread executing the request.

The service pnonty experiments were conducted with four classes of clients, distinguuhed by their service requiremem, with aU classes having the same number of clients. The priority of a client remaineci constant throughout the duration of the experiment Also, for sknplicity reasons, we assumed that each class of clients called a single service with different service requhments for each class. In order to shidy the impact on pdomiance of senrice pnority combined with different semice requests, we ordered the client classes by the length of service required fiom the server, then conducted two sets of experiments: (i) the class having the highest semice requirements was given the maximum priority and (ii) the class having the lawest service requûements was givm

00 .WuIn'-Threaded Servm: Design wifhreurable componenu. Po$ormance Meamrernenrr and Analysis 99 Chapter 6 PrionSy Sofware Servers

the xrmimm priority. The experiments were done for both the pure and layered server

architectures.

6.4 Experimental Setup

Although the first half of the thesis used DECALS to set up and control the experiments,

UNM batch scripts were used to conduct these experiments [as explained in Appendix BI.

Initially, the pnority of dl the threads in the thread-pool were kept equal, independent of

the priority of the client rhey were serving. However, it was found that the diffêrences

between the different cases were more accentuated when the worker tbreads themselves

inherited the pnor* of the client, so all cases presented here are of this kind.

The ratio of the four different service times was hed first at 1:2:5: 10 with the lowest

semice time of 10 msecs. However, Merent service ratios were also studied: a ratio of

1:2:3 :4 and one of 1:5:5O: 100 are also presented in Section 6.5.1.1. The message size was

kept constant at 24 bytes (this increase fiom the 4 bytes in Chapter 5 was due to the

additional information which had to be passed back to the client about its prionty, for

logging purposa) and the socket size was fixed at the default size of 8K Each client exmution cycle consisted of an expondal th.ink tùne with 500 mecs average, and an active tirne, during which it sent re~uestsand received replies fiorn the semer. The number of worker threads in the thread-pool semer were kept constant at 10, due to the following reasons:

The performance gains achieved by increasing the number of threads in the thread pool . .. is dunmishing as the nimiber of threads increases above a certain level. The increase m throughput achieved by changing the number of threads fiom 10 to 20 was insignificant,

00 Mulii-17rreaded Servers: Design wzih wusable componw, Pegormance Measuremenu and AnalyJis 100 Chapter 6 Prions, Sofrware Servers

Cases with 2 and 5 threads were also studied, showing a significant reduction in the achieved throughput, compared to the case with 10 threads. Thus keeping the number of threads at 10 represents a good tradeoff in this system (as shown in Section 6.5.1 2).

For the layered semer case, each thread, serving a Svc-Hmdk in the mid-lwel mer, had an open comection to the corresponding Svc-HandIw m the low-level server.

Although each Svc-Hundleer in the mid-level server could have an open comection to the comsponding Svc-Handk m the low-level smer (instead of having connections on a per-thread basis), the numba of connections active at a time would be defined by the number of worker threads in the pool, as they were responsible for communicating with the low-level server. Consequently, the fewer the number of thrads, the fewer connections were required, which meant that there were that many more unused file descriptors in the system. Thus, the number of threads was fixed at 10.

Due to the increase m the number of -sers and corresponding incfease in the network flc,the number of senddrecvs were increased from 300 to 100 for every client. AU tests were repeated for 10 times to account for performance variation due to transient load on the networks and hasts. The test results obtained were within a confidence inteml f

2% of the mean at 95% confidence level [Appendix BI.

6.5 Measurement Results and Analysis for Semce Priority

This section presents the measurement results for the different cases, for the pure and layered server architectures.

00 Multi-Threaded Servers: Design wirh reusable componenu. Pe$onnance Meosuremenzs and Analmir 101 Chaprer 6 Prionjr Sofmare Servers

Clus Througiiput vls System Lord (Pure StNer Arch) Clw with Lowest Service Timt hu Lowest Priority S avtr Riom 6 1

420msecs

100 msecs

Nimber of Clients Fip. 36 Measurement Results for the Pure Server Arch. aowPrioritv - Low Service Times)

63.1 Measurement Resnlts for the Pure Semer Architecture

Fig. 36 Summanzes the performance results for the pure server architecture, with the

highest priorisr given to the class of clients having the largest service requirements.

Iaitially, the classes with lower service times (and hence lower pnority) outperform the

classes with higher priorities (and higher se~cetimes), which is a combination of two

facts - (i) their senrice times are the lowest, and hence require the Least time to complete and more importantly ci) as the number of clients in the system is fewer than the ninnber

00 .Uuiti-'lnreaded Srvers: Design wilh muable contponentr.Peonnance lUeasuremenû and Analysir 102 Chapter 6 Pnon-ty Sofmare Semers

of threads in the pool, then are some threads fiee to sexvice the lower priori*/ class of

clients. However, at around 10 ciients, when aii the number of threads in the pool (fixed at

10) becorne busy serving requests, the effect of priority kicks in, and the throughput of the

higher priority class of clients (with higher savice &es) starts cbbing, with the lower

priority clients spending more the waitng in the priority queue and getting lesser amount

of CPU resources, (ie. they undergo service starvation) due to the fact that ali threads in

the pool start giving pnonty to sentice requests coming from higher priority class of

clients.

Class Cycle mevfs Namber of Ciknrs (PureServer Arch.) CIas with Lonclt Service Tirne hru PriOrity I 35 7 servie Prfority

Ckss Througbput vls System Loid (Pire Scrvcr Arch.) Class wftb Lowest Service T ime bis Elghcst Priority Smi cc Priority

-50 msecs -20 msecs

1 fie 37 Meiaurement Resuits for the Pure Semer Arch. (Hiph Prloritv - Lon Service Times)

00 lUdti--Threaded S~NPTJ:Design wih rewable componenu. Pe$onnance .Uea.mrmenrr und Analys& 103 On the other han& as shown in Fig. 37, when the class of clients with lowest semice times are given maximum priority, the rdtsare quite differenit In this case, clients with lower service Meshave higher priority, and so more such clients get serviced, with the result that the throughput achieved is much higher. This can be attributed to the fact that the average waiting tirne in the service queue, per class of clients, is reduced considerably. Fig.

38 compares the efficiency of CPU usage in the No cases, showing that at lower loads, the CPU is used more efficiently when the class with the highest service request has higher priority (Note that the clients with longer semices contribute the most to CPU utilkation per request). The situation changes at higher loads, when higher pnonty for the shortest clients leads to a higher CPU utkation and a higher throughput overait. The total CPU utïlization, in Fig. 38, is a combination of the CPU utilkitions by the différent classes of clients, as well as the c2ms independent CPU resources consumed due to overheads.

CPU Utintrtion v/s System Lod (Pure Server Ar&) S rviœ Priority

Fip. 38 Thread-Poo1 Server UtilUations for the Pure Semer Arch. With Priontv)

00 Mufti-nireadedServers: Den'gn wirh -able componenrr. Pedonnance Measuremen~and Analysis 104 In order to get a better idea of the co~lsumptionof CPU resources by different

components in the semer, a breakup of the clms independent server components, and

their corresponding CPU requirernents, is outlined. When a client issues a service request

to the server, the Reactor::handZeeaten~ loop, blocking on a given set of I/O file

descriptors, demdtiplexes it and calls back to the handle-input0 method of the

appropriate Svc-Handler. ûn receiving this callback, this Svc-Handler then de-registers

itself hm the Reactor (as explained in Section 5.4.2), locks the global priority queue,

inserts a referaice to itself (and hence to the client) at the appropriate place in the heap-

based queue, according to its priority and re-constructs the heap. It then returns, thereby

king up the Reactor to service other requests. At some later tirne, one of the idle

worker threads in the pool removes the reference to the Svc-Handler from the queue, and

then uses this reference to perform the actuai message reception, to savice the request

and to retum the results back to the appropriate client Thus, in addition to the actual

service processing done, CPU overheads are incurred in the Reuctur::hnndZeeeven~

(responsible for event demultiplexing), the Svc-Handler::handZeeinput0 itself (which receives the callback hmthe Reactor), Reactor::remuveJzandlerO (which de-registers the Svc-Handler fkom the Reactor in order to free it), queue::imerto (which inserts the

item in the queue), queue::reaiüd-heapo (which rebuilds the heap structure der item

inseaion) and queue::remove0 (which removes the head of the queue),

Svc-2Xandler::recvO and Svc.HandZer::sendO (actual message reception and sending), and finally Reactoc:register_handlerO (which re-registers the Svc-Handler with the

Reactor to faciltate callbacks) operatiom. These methods were instnunented to find the

00 dUuiti-ThreadedSrvers: Design wirh reusable componencs. Pe$onnance Meawrenienû and Ana[~sis 105 out the actual CPU overheads in the multi-threaded server, (measurements were carried out using the ioctlo system cd, as qlained in Section 2.42), which are Summarized in

Fig. 39 below:

Reactor::handeewents() with select0

hande-input0 for the user dehned SvcinpHmdIer

Reactor: zemove-hde@ for de-registering

Queue::inserQ for insertïng item into priority queue

Queuexebuild-heapo for rebuilding the heap

Queue:~anoveOfor removing item pnority queue

Svc-HandlerxecvO for receiving message

Svc-HanderxsendO for sending message

React0r:xegister-handla for re-reghtering

Total approximate CPU Overheads 2.66

Fip. 39 CPU Overheads of the Different Comnonents of the Thread-Pool Server

65.1.1 EEect of Changing Service Time Ratios

Fig. 40 shows the measurement results obtained by chging the ratio of the &ce times between classes from 1:2:5: 10 to 1:2:3:4 and 1:5:50:100, for the server m which lower service time clients have lower priorïty, with the lowest senice thestill fixed at 10 msecs.

It can be seen that, although the generd pattern of the results is the same as in Fig. 36, the tbroughputs of the different classes are much closer to each other for the case where the

il0 Mulri-7hrtadtd Scrvers: Daïgn wirh reauable componenrs. PerJormance Measuremenrr and Analpis 106 semice tirne ratios are 1:2:3:4, due to reduction in the impact of service stawation of

lower priority clients. However, in the case where savice ratios were increased to

1:5:50:100, the impact is dramatic. Initially, the throughput of the lower prionty clients

(with Iowa dcethes) is much higher dian the hi& priority clients (with higher

senice times), as there are still some threads in the thread pool fkee to serve them.

However at around 12 clients, when ail the threads m the pool become busy serving the

higher priority clients, the lower priority clients suffer hmsevere service starvation and

Clrss Thromgbput rh Systcm Load (Part Semer Ana) Effed of reâocing sewke dme ratios 1:29:4 9 - CLcs with Lowest SenkeTime has Lowest Priorftv

ChThroughput vls Systcm Load Effcct of iacrcming senkt dme ratios to 15SO:lOO

Fip. 40 Eff't of Cbanein~Service Time Ratios

00 Mtùri-Threaded Scrvem: hipt wirh cetrsoble componenrs. Pe$mnunce 4Ueasuremenuund Analysu 107 at around 24 clients, when all the îbreads in the pool are serving only higher priority requests, their throughput drops to near zero.

Thus, although the general pattern of the results remains the same as those obtained in Fig.

36, semice time ratios effect the degree of service starvation felt by the lower priority clients.

65.1.2 Impact of Changing the Number of Threads in the Thread-pool

In order to achieve optimum performance with the least possible number of threads, the above experirnene~were repeated with 2,5 ,10 and 20 threads in the pool. Fig. 41 shows the throughputs of the class with the lowest service time(l0 msecs) and lowest priority, for 2, 5, 10 and 20 threads. It can be seen that there is a marked llnprovement in the class throughput, when the number of threads in the pool is 10 or 20, compared to the wes with 2 and 5 threads.

In the case where the thread pool has 2 and 5 threads, the bottieneck is the number of

Throughput vls Systtm Load (Pure Server Arch.) Ciass Service Tirne: 10 msecs Ciass m-th Lowcst Service Time has Lowest Pirority

+Threads = 5 +Thrads = 10 1 +Threads = 20

Fie. 41 Impact of Chaneia~the Number of Threads in the Thread-pool

00 Mulri-Threaded &mm: Mgnwirh musable componenu. Perjorntance Measunmenu and Analysis 108 Chapter 6 Pnority Sofmare Servers threads itseIf, as there are too few tbreads in the thread-pool to hande the number of simdtaneous reqpests coming in, thereby increasing the total queuing delay seen by the lower priori@ clients. However, with 20 threads, the improvement in the class throughput is not very sipnincan&compared to the case with 10 threads. This cm be atîributed to the fact that with 20 threads, the overheads due to scheduling, mamghg and context mitchhg between these threads is quite signifiant, coqared to the case with 10 threads.

Thus, the thread-pool with 10 threads is able to achieve a better balance between the overfieads incurred and the service requested, and so the number of threads in the thread- pool was chosen to be 10.

653 Measnrement Resnlts for the Layered Server Architecture

Fig. 42 summarizes the results for the layered merarchitecture, for the case in wiiich clients having Iowa se~cetirnes have lower senice priorities.

As in Section 5.3, the service requested by the clients is assumed to consist of a CPU intensive operation performed by the mid-level server, and an V0 intensive operation, performed by the low-level server, although the total senrice tune seen by clients of a particular class remains the same as in Section 6.5.1. Also, as m Section 5.3, the low-level server is based on the thread-per-client mode1 and has a one-twne communication channel open with each of the threads in the tbread pool, so there is no queuing at the low-level semer. As expected, although the pattern of the results obtained remah the same as for the pure server case, the obtained throughput is much hi-. This is because, for the Iayered server architecture, the acnial CPU consumption of the nid-level server is halved (as the total service time requested by the client is divided equally between the

00 Mulri-ilreuded Seners: Design wirh msable componenu. PetfÔrmunce Mensuremenu and Analysis 109 Chaprer 6 Priority Sofrware Servers

CPU intensive operation and the Il0 intensive operation, although the total is the same as in the pure server architecture), consequently there is that much more fke CPU capacity.

Cimss Cydc Timt vis Syotcm Lord (Layered Server Arch,) Cl+u *th Lowest Semice Tictus Lowtst Pnority S av iePrionty = 33 U. 01 - 52.5 4 O: -: 2: g 7 i g 1.5 g : -Q 1: h -- 0.5 ) ': *roo- 0 1 4 8 12 16 20 24 28 32 36 40 Number of Clients

Class Throoghpat VISSystcm Loid (Lsyered Server Arch.) Class m'th Lowtst Service Time hts Lowest Priority S avi c Priority

-20 msecs -50 msecs 1 -W-iO0 mçecs [ I 4 8 12 16 20 24 2 8 3 2 36 4 0 Nombtr of Clients Fip. 42 Meascirement Resuits for the Lavered Server Arch. lLow Prioritv - Low Service Times)

Fig. 43 presents the measurement results for the layered server architecture, for the case where clients with lower sewice trmes have higher priority. In this case also, the service requested by the clients is assumed to con& of a CPU intensive operation perfomed by the mid-level server, and an IM3 intensive operation, perfomed by the low-level server, although the total service time seen by clients of a particular class remains the same as m

00 .Udn-77rreaded Sewen: Design with reusabfecornpanmu. Pe$onnance Meanrremenrs and Analys& 110 Chapter 6 Prioriry Sof~are Servers

Section 6.5.1. As explained in Section 6.5.1, this design reduces the overail waiting time m the priority queue seen by the clients with lower pnority (i-e. higher service thes) due to the increase in the nuznber of lower semice rime clients (Le. higher prionty) king served.

In other words, this design utikes the system in a better way compared to the case where clients with lower &ce times have lower priority, which can be seen hm Fig. 44,

Chss Cycle Timc v/s System Load (Layered Server Arch.) Clmss with Lowest Service Time bas Hlghest Priority S uvf c Prio rity

5 0.4 02 3

4 8 12 16 20 24 28 32 36 40 Nnmber of clients

Clrss Throaghput vls Systtm Load (Laycrcd Server Arch.) Clrss with Lowcst Service Time h8s Higbest Priority S uvi c Prion ty

+50 msecs -20 msccs

4 8 12 16 20 24 28 32 36 40 Nlimbcr of clients Fip. 43 Meemrernent Resuits for the Lavemi Server Arch. (Hiph Prïoriw - Low Service Times) which compares the CPU utilizations of the mid-level server for the two cases.

Laitially, the CPU uthtion of the mid-bel server for the case where clients with lower senice times have lower priority is approximately the same as the case where clients with

00 Muhi- fhreaded Seners: Derign wirh remable componenrs. Pe

Mid-level Server CPU Utiüzrtion v/s Systtm Lord (Laytrtd Arch.) 0.8 7 S avice Priority 4 0.7 1 i

Classwith Iowcr service times have lower priority +Classes with lower service times have highcr priority 0.1 1 4 4 O 1 4 8 12 16 20 24 28 32 36 40 Number of CLients

Fip. 44 Cornparison of CPU Uü.l.ûationsfor the Lavered Semer Arch. (With Priority)

AU the cases studied above had clients whose priority was decided according to their service requinmentS. The next section examines the impact of assigning pnorities according to the class of the client, instead of its dcerequirement.

6.6 Measurement Results and Anaiysis for Class Priorie

This section presents measurement results for the layered server architecture for the case where, imlike the previous cases, each client had a choice of selecting one of two different service times with equal probabüity, for every service request. The systern was assumeci to consist of two classes of clients, with their class decidng their priority.

Fig. 45 presents a cornparison of the mean response times between the two classes of

00 Multi- fhreaded Srrvm: Design wirh musable componenrs. Peflonnance Me0suremenr.s and Analyis 112 clients, for the case where priority was decided by the class of the chnt The senice tirnes requested by the clients were taken to be the extreme of the cases presented in the previous sections, with the ratio between them fked at 1:20, with the lowest service time

Chss Cydt Time VISSystem Load (Laycred Arch) Ciients Bave Vuiablt Service Times (5 mecs and 100 msecs) Priority decided according to Chsof Clients

1

2 6 10 14 18 22 26 30 34 38 Number of Ciients

CIas Tbroughput VISSysttm Load (Layend Arck) Clients Have Variable Service Times (5 msccs and 100 mecs) Priority decided accordhg to Clus of Clients

07 a . . i 2 6 10 14 18 22 26 30 34 38 Number of cHeati I Fin. 45 Measurement Results for Class Prioritv for the Layered Semer Arch. It can be seen that the class of clients belonging to the lower priority class are penalized because of their class, even though their senrice requirements are the same as that for the class of clients havuig higher priority. Such an approach is usually not practicai, unless it is

00 Muiti- fhreadeà Semers: Design wiih maable components. Pedormance Measuremenrr and Anaiysb 113 Chapter 6 Prionry Software Semers

the application itself which imposes class priority restrictions. To sum up, if workload

dependent priorities are to be imposed by the application, assigning higher pnorities to

shorter jobs makes best use of the available system resources. In real systems, it might

happen that exact service requirrments xnight not be known beforehand. However, even

rough estlmates can be used to decide which service requests would be of shorter

duration, and hence should be served first. For example, in a database application,

retrieving data most often consumes lesser CPU resources compared to modzBing data,

which would lead us to assigning higher priority to a renieve operation as compared to a

modzfi operation.

Furthermore, if system pe?jionnance is tu be optim~ed,ii t better to opt for a layered server architecture, coompared to a pure smer architecture, us it tan take advantage of

the processing power of multiple CPUs (although this is tme where network delays are not siguificant, most modem networks have much higher bandwidth and are as fast, if not faster, than existing CPUs, thereby making the use of such architectures e-ely feasible).

Finally, the next chapter present some concluding remarks and directions for fiiture research.

00 Mulri-Tiireaàed Seners; Design with reusable componenrr. Pefonnance Meamremenrr and Analysis 114 Chqter7 Conclurions

Chapter 7

Conclusions

ûver the past decade, there has been an increasing trend m building distributed applications, based on a CIS paradigm. Most of these distributed applications have an inherently concurrent nature, as more than one service request hmdifferent clients may arrive at the servers at the same the. Such distriibuted applications can benefit hmusing a concurrent mode1 of execution to perform their fÛnctionsy as queuing delays caused by contention between requests for services are reduced when several requests can be processed at the same Mie. Although concunency can be provided by using multiprocessor platforms, the complexity of such applications usually makes such implementations infeasïble. Instead, most mi-processor applications rely on multi- threading to provide concuwnt services. Multi-tbreaded semer applications can handle simultaneous reqyests for service by dowing each thread to deal with one request at a time, m parallel with other threads. This not only simplifies the code and improves its understanding, but also, for some system architectures, improves the ovdpafonnance of the system. Various communication midwares (like DCEy COMA and ACE implementations) make use of rndti-threading to provide concurrency.

This thesis compares several approaches for implementing multi-threaded servers in a distnbuted enviro~lmentusing the threadencapsulation h'brary of the ACE object-oriented

00 ,Uttiti-73rreaded&mers: Design wuh reusable componenu. Pe>jonnance Measurements and Analysir 115 Chaprer7 Conclusions reusable component communication toolkit, and identifies cases where rnuiti-threading yields pedomce dividends.

This thesis also introduces the idea of design-pattern guided meaSurementS to analyze the performance of the object-oriented systems under study and promotes the idea of using patterns to Wh/ understand system runtime behavior, and to identify muent, critical execution paths due to their focus on the interactions between key participants in a sohare architecture, rather then overwhelming the developer with details. It shows that patterns are very useful not only for documenthg reusable components and teaching application programmers how to use them, but also for characterizing their performance and providing a basis for performance measurements and analysis.

In this thesis, several different multi-threaded server models based on two different architectures were examined from a performance viewpoint. In case of the pure-semer architecture, multi-threading a software server which does not require the seMces of any other server except of its own CPU does not produce any performance gains, as the system bottleneck Ls a hardware one Le. the CPU itself Multi-threading in such a case can be detrimental to system performance due to execution overheads, as illustrated by the rneasurement results. However, in case of the layered server architecture, the mid-level server uses the seMces of the low-level semer, in addition to its own CPU. In such a case, multi-threading the mid-level server, which represents a software bottleneck will bring substantial performance gains. By increasing the number of concurrent threads in such a server, the bottleneck ispurhed dom to lower levels, ailowing for a more efficient use of the servers below and thereby increasing the overail system performance. Mm, whenever

00 Mulri-nireaded Sen~ers:Derign with reusable componenrr. PMomance Mecuurernenrs and Analysit 116 Chapter7 Conclusions the issue of mdti-threading arises7as shown by the measurements, it is ofien better to opt for the kad-pool model, in cornparison to the thread-per-client model, specially nom the point of view of system resource corx~umption.

This thesis also examined the impact of assigning pnonties to service requests, for the pure and layered server architectures. Two different cases were studied for both the architectures - (i) assigning lower priorities to clients whose jobs had lower service hes and (ii) açsigning higher prionties to clients whose jobs had lower service tirnes. It was found that if workload dependent priorities are to be imposed by the application, assigning higher pnonties to shorter jobs makes a better use of the available system resources. Also, due to the increasing speed and bandwidth of modem networks, network induced delays are almost insigmficant, which makes it better to opt for a layered server architecture, cornpared to a pure server architecture7as it cmtake advantage of the processing power of the different workstations present in the network.

7.2 Future Work

The work presented in this thesis can be extended in several ways:

Extend the range of parameters for the different cases stuclied, and cover more thoroughly the parameter space.

Use redts fiom the thesis to btdd performance models for mdti-threaded mers and to validate those models.

Apply the measiirllig techniques to real serves (such as database sema, name-sexvers etc.)

00 ,Vuin'- Tnreadeci Serwers: Design with reusable componentr. Perftonnance iCfeamrernenrsand Analysis 117 Appendix A

Appendix A

Notations for Use Case Maps and UML

A.1 Use Case Map Notations

This section shows a subset of the notations used for representing use case maps

Path Start

Path\ End Or-Fork And-Fork

Context Component

Fi. Al Use Case Map Notations

-. . A.2 UR.IL Notation

This section presents a subset of the notations used in the unifed modeZZing language

(LTML) developed by G. Booch and Rumbaugh, which combines their individual notations

(Booth and UNIT respectively).

O0.Udti- shrended Semen: Design with reflfabie componenrr. PHonnance Measuremenrr and Rnaiysis 118 ------Parameters 7; I

Parameterized Class Class Instantiation

Inheritance Dependencies

-Notes

Class Cate~ow

AObject :Cl- Attributes riamel= vduel name2 = vaiue2

Obiect Instantiation Fig. A2 UML Notation

00 Mufti-nireaded Senters: Design nith réutable components. Peflonnance ,Ueo~urementrand Analysis 119 Appendix B

Batch Script for Conducting Experiments

B.1 Ovemew

The first halfof this thesis used DECALS Flgillani94J to set up expeximents and log data.

However, it was found that as the nimiber of users of DECALS increased, it became increasingly difficult to setup and manage experhents due to some technical dificulties.

As a rdt, instead of DECALS, it was decided to use UNIX batch scripts to accomplish the çame purpose. These batch scripts, as explained in the subsequent sections in this appendix, were started on the command line in the appropriate directory and accepted input which was very similar to DECALS input, as shown below:

â