<<

Instrumentation and Measurement of

Multithreaded Applications

Alexander Vo

Januar 1997 DA-25-95

Instrumentation and Measurementof

Multithreaded Applications

Studienarb eit im Fach Informatik

vorgelegt von

Alexander Vo

geb. am 13. August 1972 in Bergisch Gladbach

Angefertigt am

Institut fur Mathematische Maschinen und Datenverarb eitung IV



Friedrich-Alexander-Universitat Erlangen-Nurnb erg

 

Betreuer: Prof. Dr. Fridolin Hofmann

Dipl. Inf. Frank Bel losa

Beginn der Arb eit: 3. April 1996

Abgab e der Arb eit: 03. Januar 1997

Ichversichere, da ich die Arb eit ohne fremde Hilfe und ohne Benutzung

anderer als der angegeb enen Quellen angefertigt hab e und da die Arb eit in

gleicher o der ahnlicher Form no chkeiner anderen Prufungsb ehorde vorgele-

  

gen hat und von dieser als Teil einer Prufungsleistung angenommen wurde.



Alle Ausfuhrungen, die wortlich o der sinngema ub ernommen wurden, sind

   

als solche gekennzeichnet.

Erlangen, 28. Dezemb er 1996

Zusammenfassung

Ziel dieser Studienarb eit ist es, eine Programmierumgebung zu schaf-

fen, in der Programme mit mehreren Aktivitatstragern entwickelt und ver-

 

messen werden konnen. Die Prozesse der Programmentwicklung und In-



strumentierung sollen dab ei moglichst unabhangig von Zielsystem ablaufen,

 

dessen sp ezi sche Eigenschaften trotzdem genutzt werden konnen. Wichtige



Fragestellungen, die es im Rahmen dieser Arb eit zu b eantworten gilt, sind:

 Wie konnen die gewunschten Analysedaten ohne wesentliche Beein-

 

ussung des Programmablaufs gewonnen und gesp eichert werden?

 Welche Mo di kationen mussen evtl. am Laufzeitsystem vorgenommen



werden, um Ereigniszahler b ei der Umschaltung zwischen verschiede-



nen Aktivitatstragern zu sichern?

 

 Wie konnen diese Daten graphisch sinnvoll aufb ereitet werden, um



dem Programmierer Auskunft ub er das Laufzeitverhalten seines Pro-



grammes zu geb en?

 Wie kann die Instrumentierung abstrahiert werden, damit eine ein-

fache Portierbarkeit des Systems gegeb en ist?

Es ist geplant, die Leistungsfahigkeit des entwickelten Systems an einem

 Beispiel aufzuzeigen.

Contents

1 Intro duction 2

2 Multipro cessor Architecture 4

2.1 Tightly vs. Lo osely Coupled Systems ...... 4

2.1.1 Uniform Memory Access UMA ...... 4

2.1.2 Cache Only Memory Access COMA ...... 4

2.1.3 Nonuniform Memory Access NUMA ...... 5

2.2 Interconnection Networks ...... 5

2.3 Caching ...... 6

2.3.1 Direct Mapp ed vs. Asso ciative Caches...... 6

2.3.2 Virtual vs. Physical Caches...... 7

2.3.3 ReplacementPolicies ...... 7

2.3.4 Write Policies ...... 8

2.3.5 Line Size ...... 8

2.3.6 Cache Consistency ...... 8

2.3.7 Prefetching ...... 9

2.3.8 Non-blo cking op eration ...... 9

2.4 Memory Mo dels ...... 9

3 Multithreading 10

3.1 Motivation ...... 10

3.2 Drawbacks...... 11

3.3 Kernel- vs. Userlevel Threads ...... 11

3.4 Scheduling ...... 12

3.4.1 Preemptive vs. Nonpreemptive ...... 12

3.4.2 Policy vs. Mechanism ...... 13

3.4.3 Scheduling Policies ...... 13

3.4.4 Scheduling in Multipro cessor Systems ...... 14

3.5 Synchronization ...... 14

3.5.1 Deadlo cks and Starvation ...... 14

3.5.2 Synchronization Problems...... 15

3.5.3 Synchronization Mechanisms ...... 16

3.5.4 Synchronisation in Multipro cessor Systems ...... 17 ix

4 Performance Measurement 19

4.1 Intro duction to p erformance evaluation ...... 19

4.2 Measurement Metho dology ...... 21

4.2.1 Hardware monitors ...... 22

4.2.2 Software monitors ...... 22

4.2.3 Hybrid monitors ...... 23

4.3 Tuning ...... 23

4.3.1 Program Behaviour ...... 23

4.3.2 Bottlenecks...... 24

4.3.3 Improving Software ...... 25

5 So crates 27

5.1 Programming with So crates ...... 28

5.1.1 The ...... 28

5.1.2 The Generator ...... 29

5.1.3 Supp ort for Multithreading ...... 29

5.2 Instrumenting Programs ...... 29

5.2.1 Overview ...... 29

5.2.2 Function Calls ...... 30

5.2.3 User Events...... 30

5.2.4 Op erations ...... 30

5.2.5 Synchronisation Ob jects ...... 30

5.3 Execution ...... 30

5.3.1 Collecting EventTraces ...... 30

5.4 Post-Pro cessing and ...... 30

5.4.1 Example: Barriers ...... 31

5.5 Example: All Pairs Shortest Paths ...... 34

5.5.1 The ...... 34

5.5.2 The Implementation ...... 35

5.5.3 Measurement ...... 36

5.6 Implementation ...... 38

5.6.1 Implementation ...... 38

6 Conclusion 40

A So crates Supp ort for Multithreading 41

A.1 Thread Management ...... 41

A.2 Thread-Lo cal Variables ...... 41

A.3 Mutexes ...... 41

A.4 Condition Variables ...... 42

A.5 Semaphores ...... 42

A.6 Barriers ...... 42

B Existing Performance MeasurementTo ols 43

B.1 CXtrace ...... 43

B.2 tha/tnf ...... 43

B.3 Paradyn ...... 43

B.4 Pablo ...... 43

B.5 JED...... 43

C Sun Enterprise X3000 44

Friedrich-Alexander-Universitat Erlangen-Nurn  b erg

Abstract

Instrumentation and Measurement of Multithreaded

Applications

by Alexander Vo

How to write an abstract

 readers of an abstract want to decide weather to read the full thesis

 present the outcomes of your work

 an abstract is a small article in itself and thus stands separate from

the thesis

 keep it simple

 do not make claim that are not justi ed in the pap er

 claim some new result unless you write a survey

 avoid "`In this pap er"' or similar intro ductions

Keywords Multithreading, Measurement, Instrumentation 1

Chapter 1

Intro duction

Problem Statement

Computer systems continue to b ecome more p owerful, but as they do so the

demands always growbeyond the reachofany existing system. The p er-

formance increase that can b e achieved by reducing cycle times is limited

b ecause of the nite sp eed at which information can propagate and b ecause

of energy dissipation problems. The only other way to increase the p erfor-

mance of a system is to exploit the parallelism inherent in the application.

Parallelism on the level of micro-instructions has b een used for a long

time { most of the technology is in use since the advent of sup ercomput-

ing. Today,even the cheap est micropro cessors havemultiple indep endent

execution units and implement pip elining. Optimising that help

to exploit this ne-grained parallelism are widely available. As technology

matures, its limits b ecome visible. The parallelism that can b e achieved at

this level is very limited. Scalar arithmetic instructions for example cannot

make ecient use of pip elines deep er than fourteen stages [HP96, page 210].

The adventofcheap single-chip micropro cessors has made it feasible to

build multicomputer and multipro cessor systems that exploit parallelism at

amuch higher level. Applications are split into multiple threads of control

that co op erate in order to achieve a common goal.

Development of applications that contain multiple threads of control is

still p o orly understo o d and resembles an art rather than an engineering

discipline. Only few to ols are available that aid in managing the complexity

of concurrent computation. Most of these only supp ort one step in the

development cycle, e.g. debugging.

Another problem is that systems b ecome more complex as the

numb er of pro cessors and levels in the memory-hierarchy increase. Some of

this complexity leaks through even to the userlevel. Todays no

longer provide the programmer with a consistent single system image. 2

Goals of this Thesis

The primary goal of this pro ject is to provide to ols that allow the program-

mer to write multithreaded applications in an architecture-indep endentway

and evaluate their p erformance in di erentenvironments, i.e. di erent hard-

ware, op erating systems, and runtime systems.

Performance measurement shall provide an insightinto the dynamics

of the program under developmentaswell as the underlying system. As

p ortability is a primary goal, no additional hardware may b e used to record

the data. It follows that a software solution has to b e found that meets the

following criteria:

 In uence of the measurement part of the system on the execution of

the application program must b e minimised. All blo cking op erations

must b e done by separate kernel-threads.

 The system must allow the user to sp ecify the set of events that are

to b e recorded. Pro ducing to o much data slows down the pro cess and

hides the imp ortant facts.

 Instrumentation and Visualisation should use the identi ers given in

the application's source co de. The programmer must b e able to see

the relation b etween co de and program dynamics.

more

Outline of the Thesis

Performance of multithreaded applications heavily dep ends on the prop erties

of the underlying system. , op erating system design,

and design of the runtime system interact with each other in intricate ways.

To understand the reasons for an application's b ehaviour, one has to lo ok

at all levels of the system.

Thus the rst chapters discuss some selected topics from the elds of

, op erating system design, runtime supp ort, as well

as p erformance evaluation that constitute the research domain for this the-

sis. This material is meanttoshow p ossible p erformance problems and

programming errors as well as ways to improve the software.

Beginning with chapter 5, the programming environment So crates is

presented. Its functionality is describ ed and some insightisgiven into its

implementation. A simple example illustrates the pro cess of developing

and evaluating multithreaded co de. Conclusions are drawn in chapter 6.

App endices inform ab out topics such as other measurement to ols or the

software and hardware that has b een used. 3

Chapter 2

Multipro cessor Architecture

This chapter presents some asp ects of computer architecture that are closely

related with multipro cessor p erformance. The goal is to show p ossible p er-

formance problems and programming errors. For further information on the

architecture of multiprocessor systems see [HB84, HP96, Kai96 ].

2.1 Tightly vs. Lo osely Coupled Systems

Multipro cessors are classi ed as b eing tightly vs. lo osely coupled, dep ending

on whether they share a global physical memory system or not. In tightly

coupled systems, pro cessors have direct access to all memory lo cations and

thus memory can b e used for communication and synchronisation whereas

in lo osely coupled systems message passing has to b e used. Tightly coupled

architectures can b e further sub divided as follows:

2.1.1 Uniform Memory Access UMA

In systems with uniform memory access, each pro cessor has direct access to

a single shared memory. All memory lo cations are equidistant in terms of

access times to each pro cessor. Most UMA-systems incorp orate caching to

eliminate memory contention but this mechanism is hidden from applica-

tions.

2.1.2 Cache Only Memory Access COMA

Systems with cache only memory access do not haveaphysically shared

memory but caches only. These caches constitute the machines memory

and, together, form a single address space. Access times vary dep ending on

whether the requested memory lo cation is in the lo cal cache or in a remote

one. may b e ignorant of the as

the b ehaves very much like a UMA-machine with caching. 4

2.1.3 Nonuniform Memory Access NUMA

Systems with nonuniform memory access have a distributed shared physical

memory. Each partition of it is directly attached to a no de but can b e ac-

cessed by pro cessors in other no des via the interconnection network. Thus

memory access times di er dep ending on whether the requested lo cation is

lo cal to the no de or remote. This added level of complexitymay b e hidden

from application software but doing so leads to sub optimal p erformance.

To make b est use of the hardware, the programmer must take its architec-

ture into consideration. Caching may b e used b etween pro cessors and lo cal

memory as well as b etween no des. Machines with coherent caching at the

hardware level are called cc-NUMA.

2.2 Interconnection Networks

Interconnection networks can b e built using either packet- or circuit-switching

technology and can use various top ologies. The more parallelism and redun-

dancy is incorp orated, the more complex and exp ensive the interconnection

network gets. Most solutions in use to day represent a reasonable compro-

mise b etween p erformance and cost:

Buses are the least exp ensiveinterconnection networks. They are also very

easy to manage but do not provide multiple connections at the same time

or failure tolerance. Buses do not scale well { the only waytointro duce

parallelism is to replicate them. Systems with a very limited number of

pro cessors use a bus or a small set of buses. Many larger systems, however,

use buses at no de-level and another top ology at the global level.

Rings allow more simultaneous data transfers than busses b ecause all seg-

ments the paths b etween two no des of the ring may b e used for communi-

cation at the same time. They can also provide fault tolerance if data can b e

sent in b oth directions. On the downside, data that is sentbetween no des

that are not adjacent has to travel through switches of other no des b efore

reaching the destination, thereby b eing delayed. This p erformance problem

grows with the numb er of no des connected to the ring.

Cross-bar Switches provide multiple exclusive communication lines b etween

1

pro cessors and memory . If memory is partitioned into n blo cks, n concur-

rent memory accesses can b e accommo dated. Contention can only o ccur if

two pro cessors try to access the same memory mo dule. Cross-bar switches

scale well but the costs grow quadratically which makes them prohibitively

exp ensive with larger numb ers of pro cessors.

1

They may also b e used to connect multiple pro cessors. 5

Multistage switches provide a means of interconnection that scales reason-

ably well in b oth p erformance and cost. The complexity of the network

grows logarithmically with increasing parallelism. As opp osed to cross-bar

switches, contention can o ccur at the network-level. Bu ers can b e intro-

duced to address this problem.

Hypercube networks are often used in lo osely coupled systems with store-

and-forward communication. Likemultistage switches, their complexity in-

creases only logarithmically. They scale well but the numb er of pro cessors

always has to b e a p ower of two. Contention at network-level can o ccur but

is addressed by using store-and-forward communication.

2.3 Caching

A cache is a bu er used to hide di erences in op erating sp eed. The gen-

eral scheme can b e applied in a lot of situations. Intro ducing caches into

the memory system has three advantages: the latency of accesses to main

memory is hidden, bandwidth increased, and contention of interconnection

structures reduced.

Pro cessor sp eed continues to grow faster than main memory sp eed and

novel multipro cessor architectures have deep memory hierarchies with laten-

cies that can reach a couple of hundred pro cessor cycles. These developments

make caches the central factor in system p erformance. This section describ es

the key attributes of cache subsystems and their impact on p erformance.

For further information about caching see [HP96, Kai96, Sch94].

2.3.1 Direct Mapp ed vs. Asso ciative Caches

In a cache subsystem that uses direct mapping, a data item can b e stored

in only one cache line. Its address is split into three parts: the rst is used

to address the line that the data can b e found in, the second is stored in

the tag ram to uniquely identify the data in the cache, and the third is the

o set of the data in the line.

Asso ciative caches use fewer bits to address the cache line than would

b e necessary to uniquely identify it. Thus data items can map to multiple

lines. A cache that can map data to n distinct lo cations is called n-way

asso ciative. In order to identify which data is in the cache, more bits have

to b e used in the tag ram. Accessing the cache now requires a search pro cess

as there are multiple lines in which a data item may b e stored. In order to

keep the sp eed of the lo okup op eration at the level of a direct mapp ed cache,

the searchmust b e done in parallel, requiring n comparators for an n-way

asso ciative cache.

There are three factors that make asso ciative caches more exp ensive than

direct mapp ed ones: the comparators needed, the increased size of the tag 6

ram, and the need for a more complex replacement p olicy see b elow. As

these costs increase with the size of the cache, small on-chip caches usually

employ a higher level of asso ciativity than the larger external caches. Making

a cache asso ciative has its advantages: data items that would map to the

same line in a direct mapp ed cache can now b e stored in the cache at the

same time. Trashing is reduced and utilisation improved.

2.3.2 Virtual vs. Physical Caches

Most systems to dayhave a unit MMU providing

virtual address spaces. The cache can b e placed b etween CPU and MMU,

or b etween MMU and main memory. The former is called a virtual cache,

as it only sees virtual addresses, while the latter is referred to as a physical

cache. Both solutions have their sp eci c dis-advantages:

 Access to virtual caches need not go through the MMU and is thus

faster when a hit o ccurs.

 Virtual addresses might b e larger than physical ones, making virtual

caches more exp ensive.

 Virtual caches require more control logic as they have to deal with

multiple address spaces.

 Physical caches can to bus-sno oping in order to provide cache consis-

tency.

Most CPUs to dayhave the memory management unit integrated so that

external caches are always physical. It is p ossible to improve the p erfor-

mance of a physical cache by using virtual addresses for indexing and phys-

ical addresses for tagging. This way, address translation in the MMU may

be overlapp ed with cache op eration. The resulting architecture is called a

virtual cache with physical tags see [Sch94].

2.3.3 ReplacementPolicies

When new data is written to the cache, old entries mayhave to b e evicted.

Cho osing a victim in a direct mapp ed cache is trivial as there is only one

candidate. In asso ciative caches, multiple solutions exist. The algorithm

that cho oses an item to remove is called the replacement p olicy. Replace-

mentincache systems is very much like page replacement in

systems. The p olicies used, however, are much simpler b ecause they have

to b e implemented in exp ensive hardware and have to b e fast.

LRU works by adding state information to the tag ram which is up dated

with every cache reference on the corresp onding line. The line that 7

was least recently used is selected for eviction. Strict LRU is usually

used for two-way asso ciative caches only as it would require to o much

state information for caches with higher asso ciativity.

Pseudo-LRU is used in cache systems with asso ciativity greater than two.

The state information recorded is limited but still enough so that lines

that are used often do not get evicted.

Random replacement requires no state information at all and still yields

go o d results. It is b est suited for large highly asso ciative caches b e-

cause it saves a lot of hardware while still p erforming well.

2.3.4 Write Policies

Write op erations by the CPU may up date b oth memory and cache, or the

cache only. The rst solution is called write-through and has the advantage

that memory always contains up-to-date information. The second p olicy is

termed write-back. It has the advantage of b eing faster, as the data need

not b e written to the slow main memory. Its disadvantage is that it leaves

the memory in an inconsistent state, requiring additional logic in the case

of parallel op erations e.g. DMA. missing:

write-

allo cate

2.3.5 Line Size

Another factor that determines the p erformance of a cache is its line-size.

It determines the amount of data that is moved b etween main memory

and the cache. Large cache-lines mean that data is prefetched, i.e. loaded

into the cache b efore the CPU requests it. Assuming spatial lo cality of the

memory references, this can improve p erformance and reduce contention

of interconnection networks. Smaller lines, however, can b e written backto

memory faster. Cho osing the right line-size means nding a tradeo b etween

p ossible p erformance gains and losses. Another factor are costs as smaller

lines mean larger tags and thus require more hardware.

2.3.6 Cache Consistency

Caches intro duce replication into the memory system. Whenever memory is

accessed bymultiple units, care must b e taken that the information in the

memory system stays consistent. Cache consistency in unipro cessor systems

can b e controlled by soft- or hardware. The only critical op erations are

interactions with p eripheral devices. A simple solution is to mark all I/O-

bu ers as non-cacheable. The presence of multiple pro cessors intro duces

new diculties. Although software solutions would b e p ossible see [Sch94],

it is necessary to implement consistency proto cols in hardware in order to

achieve optimal p erformance. 8

Consistency of caches is implemented by enforcing a set of rules that

determine when a data item maybecached. As long as the data is only

read, there is no problem at all. On a write op eration, however, measures

must b e taken to keep the view of the memory system consistent. If the

lo cation that is written to is cached, the cache entries have to b e either

up dated or marked as invalid. Thus cache consistency proto cols fall into

two categories: write-up date and write-invalidate.

Write-Up date Proto cols

missing

Write-Invalidate Proto cols

missing

2.3.7 Prefetching

Multipro cessors to day allow programs to prefetch data from main memory

into caches by issuing sp ecial prefetch instructions. Prefetching is done in

parallel to normal CPU op eration and can thus help to hide memory latency.

2.3.8 Non-blo cking op eration

missing

2.4 Memory Mo dels

missing

A= 0 B=0

. .

. .

. .

A= 1 B=1

ifB == 0 ifA == 0

printf"B"; printf"A"; 9

Chapter 3

Multithreading

This chapter deals with howmultithreaded programs can exploit the paral-

lelism provided bymultipro cessors and what in uence the op erating system

and the runtime system have on their p erformance.

3.1 Motivation

One goal of op erating system design has always b een to improve the utili-

sation of the underlying hardware. Thus batch-pro cessing was sup erceded

bymultitasking. Multithreading is a re nement of traditional multitasking

that separates the notions of a thread of control and an address space. The

following advantages justify the use of this concept:

Increased Throughput A single-threaded program has to use non-blo cking

primitives if it wants to overlap communication with its execution, which

leads to a dramatic increase in program complexity. Multithreading allows

overlapping of communication and execution while using higher-level i.e.

blo cking services.

Eciency and Ease of Communication Applications that are distributed

over multiple pro cesses have to use some kind of interpro cess communica-

tion. Multithreaded applications automatically use the most ecient IPC-

mechanism available { shared memory. In most applications, shared memory

is also most intuitive to program.

Program Structure Designing an application as a set of co op erating threads

helps to clarify program structure. Many problems are parallel by nature {

servers might start one thread p er request and numeric applications might

divide their data into subsets that are worked on by co op erating threads.

Multiple Processors Multithreaded applications make use of hardware par-

allelism on a ne level of granularity. A single application program maybe 10

executed in parallel using multiple pro cessors which reduces their execution

time. Compared to traditional multitasking, utilisation of multipro cessor

systems is improved.

Availability The of applications is increased bymultithread-

ing. A user need not wait for each of his commands to nish execution

but can issue new requests while the system works on completing those he

recently made. The need to show a \please wait"-message decreases.

System Resources Multithreaded applications make b etter use of system

resources than multiprogrammed applications do. Threads need less state

information than pro cesses and context switches are muchcheap er.

3.2 Drawbacks

Likeevery technique, multithreading has its limitations and drawbacks. The

following summarises some asp ects that havetobetaken into consideration

whenever a decision has to b e made whether to use multiple threads or not:

 Some applications do not lend themselves very well to multithread-

ing. Synchronisation and communication eat up much of the sp eedup

achieved by parallelisation.

 Parallelising an application is a tedious job that intro duces new p o-

tential programming errors. It is very dicult to test or pro of the

correctness of multithreaded co de.

 Most op erating systems to day provide bu ering and prefetching for

I/O op erations. The use of separate threads for I/O do es not improve

p erformance very muchonsuch systems.

 Interactivity of an application can b e achieved by using an event-driven

programming style rather than threads. Ousterho o d

3.3 Kernel- vs. Userlevel Threads

Threads can b e implemented either as an op erating system concept or as a

user . Both p ossibilities have their dis-advantages:

 Kernel threads are more exp ensive as they require administrative data

in kernel space and have to do their context switches via the kernel.

All functionality that any application might need has to b e included

in the kernel. 11

 Userlevel threads are not known to the op erating system { kernel ser-

vices can only b e used by pro cesses or kernel threads. If a userlevel

thread calls a blo cking system-call, all other threads are blo cked with

it. Signals cannot b e sent to a single userlevel thread.

 With userlevel threads, scheduling is indep endent of the op erating

system and can thus b e tuned to t the applications needs. Syn-

chronization b etween userlevel threads need not use exp ensive system

calls.

Toovercome these problems and combine the advantages of b oth con-

cepts, it is now common to use kernel threads as \virtual pro cessors" for

userlevel threads. This, however, intro duces a new level of complexity: How

is the mapping b etween userlevel threads, kernel threads, and physical pro-

cessors to b e done? How can the kernel and the userlevel library interact in

order to optimize system p erformance?

3.4 Scheduling

The scheduler is the entity in the op erating system or runtime library that

maps threads of execution onto pro cessing elements. Its impact on system

p erformance is signi cant. This section presents some asp ects of scheduling

systems and shows why traditional scheduling p olicies are unsuitable for

multipro cessor systems. Scheduling in uniprocessor systems is describedin

[Hof91, Tan92]. Issues that arise in multiprocessors and distributed systems

arecovered in [SS94].

3.4.1 Preemptive vs. Nonpreemptive

The scheduler maybeinvoked whenever a thread of execution enters the

kernel or the runtime library. If this is the only way to initiate scheduling, the

system is said to b e nonpreemptive. in a co op erative system, the scheduler

only runs when a pro cess explicitly calls it. Although this scheme minimizes

context switches, it do es not optimize overall system p erformance as it has

no way of assigning priorities e.g. to I/O-b ound tasks. Another problem

is that a thread of execution may monop olize the CPU and decrease the

system's interactiveness or even bring it to a standstill.

A solution to these problems is preemtiveness. Timer are used

to p erio dically force the CPU into kernel mo de, where the scheduler maybe

invoked. This increases exibility but also means that an application cannot

makeany assumptions ab out when scheduling will take place. Some race

conditions only o ccur in preemptive systems. Programmers havetobewell

aware of these p otential problems. 12

3.4.2 Policy vs. Mechanism

The decision which thread of execution is to b e scheduled next is indep endent

of the mechanism that actually invokes the thread. Only the latter part of

the scheduler has to b e part of the op erating system or runtime library.

Policy what is to b e done? may b e separated from mechanism howisit

done?. Performance can b e increased if scheduling decisions are made by

the application programs themselves.

A program that enters a state in which is do es a lot of computation might

decide not to preempt threads. Thus the tradeo b etween interactivity and

throughput can b e dynamically adjusted, reducing the numb er of context

switches.

3.4.3 Scheduling Policies

The following paragraphs intro duce some scheduling p olicies that are com-

mon in mo dern op erating systems and runtime libraries:

Round Robin Each thread is assigned a time quantum. If it do es not vol-

untarily relinquish the CPU b efore its quantum has expired, it is preempted

and inserted at the end of the queue of runable pro cesses. When a new

thread needs to b e selected for execution, the rst in the runqueue is se-

lected. The only imp ortant parameter in this scheme is the length of the

quantum. If it is to o long, resp onsiveness will su er; if it is to o short, context

switches will degrade p erformance.

Priority Scheduling It may b e necessary or convenient to give some pro-

cesses priorityover others. With priorityscheduling, the pro cess with the

highest priority is run. Tokeep low-priority jobs from starving, the prior-

ity of a pro cess that has just run is decreased. Priorities may b e assigned

statically or dynamically.

Multiple Queues This scheduling p olicy was develop ed for a system that

had very exp ensive task-switches. The designers recognised the fact that

long-running compute-b ound pro cesses should b e run less often but with

larger quanta. Multiple queues hold pro cesses with di erent runtime b e-

haviour.

Other Policies There are a lot of other scheduling p olicies, most of which

are tailored for a sp eci c purp ose e.g. realtime pro cessing. Examples are

shortest job rst, guaranteed scheduling and two-level scheduling. Consult

[Tan92] for further information. 13

3.4.4 Scheduling in Multipro cessor Systems

In multipro cessor systems, the scheduler has to map n pro cesses onto m

pro cessors. Issues like caching and communication costs have to b e taken

into consideration. The main goal in to days systems is to keep all CPUs

busy by providing data fast enough { CPU p ower is no longer the limiting

factor for system p erformance.

AnityScheduling

Loadbalancing has to b e done in order to keep all CPUs busy but removes

threads from the caches that hold their data. On NUMA systems, a thread

mighteven b e migrated to another no de that has to access its data remotely.

Anityscheduling [BS96, GTU91, TTG95] takes these asp ects into account

and tries to migrate only those threads that do no havemuch data in caches.

Threads are kept on the same no de and pro cessor whenever p ossible. This

scheme is very successful, esp ecially on NUMA systems that have caches

between hyp erno des for an example of such an architecture see [Con94].

3.5 Synchronization

Supp ort for synchronization of concurrent op erations has to b e provided

at three levels: hardware, op erating system, and .

Hardware supp ort in the form of atomic instructions is needed, if multiple

pro cessors access shared resources such as shared physical memory. Mul-

tiprogramming op erating systems have to deal with mutual exclusion b e-

tween multiple threads of execution. Programming languages should provide

means of abstraction for the services of hardware and op erating system.

This section describ es synchronisation in multithreaded environments.

The basic synchronisation problems are discussed as well as the mechanisms

that runtime systems provide to solve them. Finally, techniques that help

to make synchronisation ecientonmultipro cessor systems are discussed.

Synchronization at the software level is discussed in [Hof91, SS94,Tan92].

Hardware aspects are adressed in [HB84].

3.5.1 Deadlo cks and Starvation

A deadlo ck is a situation in which all pro cesses in a system are blo cked

waiting for some resource. Because none of these pro cesses will b ecome

runable, no resources can b e freed and the system grinds to a standstill.

Another problem arises if pro cesses never b ecome runable although resources

are available. This situation is called \starvation". It is likely to o ccur in

a system with static priorities. Deadlo cks are examined in depth in [SS94,

Tan92]. Deadlo cks app ear in the following situations: 14

 AB-BA deadlo ck

 recursivelocking

more

3.5.2 Synchronization Problems

This section describ es four synchronisation problems, each of which repre-

sents a class of similar problems. Almost every situation in which synchro-

nisation is needed can b e reduced to one of these \basic" problems:

Mutual Exclusion The basic problem of synchronization is that of mutual

exclusion. Op erations on shared data structures have to b e \atomic", i.e.

at any p oint of time, only one thread of execution maywork on a given set

of shared data. If mutual exlusion is not guaranteed, data inconsistencies

may o ccur.

To mo del mutual exclusion, the notion of a \critical section" has b een

intro duced. If a thread of execution needs to access a shared data ob ject, it

rst has to acquire a lo ck i.e. enter the critical section. Then it p erforms

op erations on the data and nally releases the lo ck i.e. leaves the critial

section when it is nished and the data is in a coherent state again. Lo cks

may either b e asso ciated with the data ob jects or to the co de that mo di es

them.

Bounded Bu ers This synchronization problem is also known as the pro ducer-

consumer problem. Pro ducer threads send messages to consumer threads

via a bu er of limited size. Clearly, access to the bu er has to b e mutually

exclusive. Another problem arises whenever a pro ducer needs to send mes-

sages via a full bu er or a consumer wants to read from an empty one. In

these cases, the pro cesses havetowait for the bu er's state to change.

Readers/Writers Shared data may b e read bymultiple threads, but a

thread may only write data if it has exclusive access, i.e. there are nei-

ther other writers nor readers. Two cases have to b e distinguished: in the

rst, priority is given to the readers while in the second the writer may blo ck

arriving readers and thus takes priority. If readers are assigned priority,a

writer may starve if new readers keep arriving.

Dining Philosophers Multiple threads may contend for limited resources,

each needing more than one resource. The classical description of this prob-

lem is that of ve philosophers alternately thinking and eating. For each

philospher, there is a b owl of spaghetti but there are only ve forks b etween

the b owls. A philosopher needs b oth the left and the right fork in order to

eat. Clearly,notwo neighb ouring philosophers may eat simultaneously and

access to the forks needs to b e synchronized. A situation like this is likely

to cause deadlo cks. 15

3.5.3 Synchronization Mechanisms

This section describ es the four most common constructs used for synchroni-

sation in multithreaded environments. They can b e found in most runtime

libraries and are implemented as normal API calls. Thus no supp ort by the

is needed. Programming language constructs for synchronization

are discussed in [SS94]. A formal treatment of synchronization mechanisms

can be found in [Hof91].

Mutexes

Mutexes are synchronization ob jects that only have the two states \lo cked"

and \unlo cked". Atomic op erations can b e p erformed to lo ck and unlo ck

mutexes. The lo ck op eration blo cks a thread until it acquires the mutex.

Most implementations provide a trylo ck function that tries to lo cka mutex

but do es not blo ck on it. If the mutex was already lo cked, trylo ck simply

returns an error-co de. This functionality can b e used to spin on a mutex.

Even scalar values mayhave to b e protected bymutexes, as access to

them might not b e atomic on multipro cessors: if the size of the data is

larger than the width of the memory subsystem or in case of misalignment,

multiple memory accesses have to b e p erformed in order to manipulate the

data. Optimizations may b e made if it is known that a scalar is accessed

atomically, but co de using such optimizations is not p ortable.

Whenever multiple mutexes have to b e acquired, deadlo cks may arise.

They can b e avoided by using lo ck hierarchies i.e. acquiring lo cks only in

a prede ned order or by using the trylo ck op eration and releasing all other

lo cks if it fails a stategy called \backo ".

Mutexes are the basis for all other synchronization mechanisms. Their

implementation is straightforward if hardware constructs like test-and-set

instructions are used.

Condition Variables

Condition variables provide a mechanism to wait for a condition to b e sat-

is ed the b ounded bu er problem can b e solved using condition variables.

They are always used in conjunction with a mutex that provides mutual

exclusion. After lo cking the mutex, the condition maybechecked. If the

condition is satis ed, the program unlo cks the mutex and continues execu-

tion. Otherwise, it p erforms a wait op eration that atomically inserts the

thread into a list of waiting threads asso ciated with the condition variable

and unlo cks the mutex. Whenever the data that is asso ciated with a condi-

tion is changed, a signal op eration has to b e p erformed in order to wakeup

a thread that waits for suchachange. Most implementations provide two

more op erations: a broadcast wakes up all threads waiting on the condition

variable and timedwait allows threads to wait a certain time only. 16

Semaphores

were intro duced by Dijkstra in 1968 [Dij68]asamechanism for mutual

exclusion and synchronization. Semaphores can b e thought of as an ob ject

consisting of a signed integer value, a condition variable and a mutex. Two

metho ds are de ned, namely P and , b oth of which are atomic this is why

the mutex is needed. P decrements the integer and blo cks on the condition

variable if the new value is less than zero. If the semaphore contains a

negativenumb er, then this designates the numb er of pro cesses waiting on

it. V increments the integer and unblo cks one pro cess from the condition

variable if the new value is zero or less.

Mutual exclusion can b e achieved with a semaphore by initializing it

to one and protecting critical sections with a pair ov P and V op erations.

Semaphores used in this way are called binary semaphores.

Synchronization can b e achieved by initializing a semaphore to zero.

Now pro cesses can blo ck on the semaphore with P op erations and an event

can b e signaled to one waiting pro cess byinvoking V .

Resource allo cation can b e controlled by initializing a semaphore to the

numb er of resources available. Now pro cesses can allo cate and free resources

byinvoking P and V resp ectively.

Barriers

are used to synchronize a set of threads at a sp eci c lo cation in the program.

Threads reaching a barrier wait until all other threads have also reached it.

more

3.5.4 Synchronisation in Multipro cessor Systems

revise

On unipro cessors, mutual exclusion can b e achieved by making the kernel

nonpreemptive thus exactly on pro cess can b e in kernel-mo de and disabling

interrupts during critical sections thus keeping -handlers from in-

terfering. All other synchronization problems can b e solved building up on

these concepts, e.g. ressources can b e protected using the sleep/wakeup-

meachnism that intro duces condition variables into the op erating system.

Synchronization on multipro cessors is a more complex theme. The non-

-assumption no longer holds and disabling interrupts on all pro-

cessors of a system is hardly a go o d idea. Furthermore, it is desirable to

havemultiple pro cessors execute kernel-co de concurrently in order to achive

higher parallelism. For a discussion of multipro cessor kernels see [Sch94].

Most computer systems provide at least some variation of a \Test-and-

Set" op eration. This instruction provides atomic access to one memory

address. The old value can b e read and subsequently mo di ed with one

op eration. Shared memory multipro cessor systems need at least this basic

form of synchronization in order to function prop erly. 17

Thread-Level Synchronisation

As parallelism increases, so do es the need for synchronization. In a mul-

tithreaded environment, very lightweight mechanisms have to b e provided

in order to achive a maximal p erformance improvement. Threads rather

than pro cesses should b e blo cked on resources and synchronization among

userlevel threads should b e implemented at this level. more

Spinning vs. Blo cking

Spinning aka. \Busy Waiting" is usually avoided as it waists precious

CPU cycles and memory bandwidth. If however, the pro cess only has to

wait for a short p erio d of time, blo cking it would b e a bad idea. The cost

of a task-switchmay then b e greater than the cost for spinning. If the time

that a pro cess has to wait cannot b e predetermined, a strategy mightbe

employed that spins for a sp eci c time and then blo cks. 18

Chapter 4

Performance Measurement

This chapter deals with the art and science of p erformance evaluation. The

rst section provides a broad overview of p erformance evaluation, the con-

text in which measurement stands. The second section describ es theoret-

ical asp ects of p erformance measurementaswell as current metho dology

and to ols. Tuning of application software and the runtime environmentis

discussed in the last section. For further information about Performance

Measurement see [FSZ83, McK88, Lan92, Jai91 , ea95, SK90]

4.1 Intro duction to p erformance evaluation

Performance can b e de ned as a sum of prop erties of a system that makeit

t for a sp eci c purp ose. The individual prop erties are called p erformance

indices. Di erent p eople regard di erent asp ects of a system as imp ortant

and each p erson's view mightchange in di erent contexts. Thus, p erfor-

mance is a weighted sum of p erformance indices:

X

P = w p

i i

i

The goal of p erformance evaluation is the improvement of the absolute

p erformance or of the cost/p erformance ratio. Even if absolute p erformance

is the primary goal, only limited resources are available to achieve it. Thus,

sp eedup and eciency are imp ortant:

?

P P

S = and E =

P C

Sp eedup is the ratio of the mo di ed system's p erformance to the original

p erformance. Eciency is p erformance p er cost.

System p erformance is limited by the weakest link in the chain, in com-

puter science termed a b ottleneck. As we eliminate one b ottleneck, others

show up while the costs of improving the p erformance of a system normally 19

Level Performance Index

System throughput, capacity,availability

Program time to completion, resp onsiveness

Pro cess numb er of page faults, memory usage

Language blo ck execution time, numb er of executions

cost of context switches, resource utilisation Op erating System

OS Resource utilisation, average number of waiting pro cesses

cache eciency, bandwidth, cycle time Hardware

Table 4.1: Examples of p erformance indices

increase with every step of improvement. Thus, every system has a state of

optimal op eration, where the cost p er p erformance is minimal or where the

cost of p erformance improvement b ecomes unacceptable.

Information pro cessing systems are usually structured into multiple lay-

ers, each of which provides an increased level of abstraction. Accordingly,

p erformance indices can b e de ned in each of these layers. Some examples

are given in table 4.1. Evaluation of a system's p erformance is needed in all

stages of its life-cycle:

Design: The develop er has to rep eatedly evaluate the current state of a

design until the goals have b een reached.

Improvement: No pro duct coming to market is p erfect. Problems not

encountered in the design phase now b ecome visible and new demands

develop. Thus the design pro cess has to continue with this new input.

Acquisition: Before a system is b ought, it has to b e veri ed that is meets

the requirements.

Op eration: Most systems are parameterised in some way.To attain max-

imum p erformance, these parameters have to b e tuned.

Capacity planning: A system's workload will probably increase during

op eration. The op erator has to calculate when the maximum workload

will b e reached and howmuch added p erformance has to b e b ought

or whether the system has to b e replaced.

There are four basic techniques for evaluating the p erformance of a sys-

tem, di ering greatly in requirements, cost and precision:

Analysis of mathematical mo dels A system is describ ed im terms of

mathematical formulas that describ e its b ehaviour.

Simulation A mo del of the system is generated and each step of op eration

is simulated and evaluated. 20

Measurement The b ehaviour of parts of  the system is measured under

repro ducible conditions. This requires that the system b e op erational.

Benchmarking The time that sp eci c programs need for execution is mea-

sured. These programs may b e existing applications or arti cially de-

signed b enchmark programs. Benchmarking can b e seen as a sp ecial

case of measurement.

Of these, measurement is the one examined in this thesis. The following

two sections provide an overview of the theory and practice of p erformance

measurement.

4.2 Measurement Metho dology

The goal of p erformance measurement is to gain insightinto the dynamics of

a system's op eration. In contrast to an examination of the static descrip-

tion of a system, p erformance measurement can provide information ab out

its runtime b ehaviour and thus lead to p erformance improvements and the

discovery of race-conditions. The measurement pro cess consists of ve basic

steps:

Instrumenting the system: sensors and stimuli are attached to the system

in order to extract information and to in uence the target system.

Generating traces and pro les: traces record the chronology of events in

the system while pro les count them or measure durations.

Storing the collected information: interference with the systems op eration

must b e minimised.

Pro cessing the data: measurement will usually yield a huge amountof

data that must b e p ost-pro cessed.

Visualising the results of pro cessing: collected data has to b e presented to

the user in a way that emphasises imp ortant information.

Common characteristics of measurement to ols are: more

Interference: Measurement to ols in uence the measured system. This in-

terference has to b e kept to a minimum.

Accuracy: The system must b e able to collect as much data as the user

wishes. Recorded values must b e as precise as p ossible and error ranges

must b e known.

Level of Abstraction: Measurement can record data ab out ob jects on

any level of a systems hierarchy { from logic gates to complex ap-

plication programs. 21

Data reduction capabilities: It must b e p ossible to reduce the amountof

data that is collected to a meaningful quantity. Recording all available

data might hide the imp ortant facts.

Scop e: A measurement system that is tightly tted to just one application

cannot b e reused and its price will probably b e to o high compared

with general-purp ose systems.

Compatibility: Target- and measurement system havetointeract and thus

either have to b e built using the same basic technology or b e adapted

to each other.

Integration: Measurement is mostly done in the context of a design pro cess

which it has to b e integrated into. An ideal system do es not interrupt

the design pro cess but rather accompanies it.

Ease of use: Of course, a measurement system has to have a

that makes it easy to use. Another issue is the turnaround time, i.e.

the time b etween the sp eci cation of the measurement parameters and

the display of the results. So-called oine systems display data only

after measurement is completed. Online systems display it directly and

mayeven supp ort interactivechanging of measurement parameters.

Costs: Costs are always an issue. A system that costs more than it is worth

is unlikely to b e used. One must always consider the running costs of

the system which dep end on its resource usage and its ease of use.

4.2.1 Hardware monitors

Hardware monitors provide means to trace the low-level op erations of a

system. They can collect data almost without interfering with the target

system and their accuracy is very high. Unfortunately, hardware monitors

are mostly highly sp ecialised to ols that have a limited range of applications,

which makes their use very exp ensive.

4.2.2 Software monitors

Very often, the advantages and drawbacks of software complement those of

hardware. Measurement programs can b e far more exible, adaptable, easy

to use and easy to integrate. They can b e signi cantly cheap er. Their draw-

backs are in the elds of interference and accuracy. Software solutions have

to use existing hardware and thus contend for resources with the pro cesses

that are to b e measured. In uence to the target system is high and the

amount of data that can b e collected very limited. 22

4.2.3 Hybrid monitors

Hybrid monitors are a way to alleviate the de ciencies of software monitors.

Sp ecial measurement hardware is intro duced into the target system that

allows the collection of low-level data that would not b e accessible to a pure

software measurement system. Another purp ose of the hardware comp onent

may b e to reduce the in uence that the measurement system takes.

Most micropro cessors to dayhave at least some supp ort for p erformance

measurement. There are usually several counters for events like cache-

misses, page-faults, or pip eline stalls for an example see [WG94].

4.3 Tuning

reorder

Performance is measured in order to improve a systems b ehaviour. When-

ever needs are not satis ed, the system must b e changed. One way of im-

proving p erformance is to buy a more p owerful machine. Although this

option is theoretically available in most cases, it is seldom the most cost-

e ectiveway. There are two causes for bad p erformance and thus twoways

to improve the b ehaviour of an existing system:

4.3.1 Program Behaviour

The most obvious question when addressing p erformance problems is: Do es

the application software make optimal use of the resources? It is

so obvious that many p eople tend to shut their eyes to it. Let me illustrate

this with a short example: nd a b et-

Most software to day makes no use of parallelism at all. Thus, although ter exam-

multipro cessing can improve the systems overall p erformance by running ple

applications in parallel, an individual application cannot pro t from it. In

to days workstation-oriented environments, there mostly is only one active

pro cess. A workstations resources are unused most of the time and over-

loaded whenever the user makes a request. Nevertheless, the workstation-

mo del is quite p opular.

Here are some examples of how an individual application program can

b e improved:

 Parallelism can b e intro duced by splitting it into multiple pro cesses or

threads. This can b e a reasonable step even if the program runs on

unipro cessors as it may improve the applications resp onsiveness and

hide the latency of communication op erations [].

 Lo cking of shared resources may b e made ner in order to achieve

more parallelism [Sun94, page 116]. 23

 Alternative may b e considered. If memory usage is the

critical factor, a sub optimal algorithm in terms of theoretical time

complexity may b e appropriate.

 On a NUMA machine, memory allo cation may b e mo di ed in order

to bring data closer to the pro cessor working on it. []

Another p ossibility is to improve the b ehaviour of an application pro-

gram bychanging its environment:

 Most op erating systems have a large numb er of tuneable parameters

that may b e adjusted to b est t the most commonly used applications.

 Scheduling is an imp ortant factor. Most of to days systems use static

schemes that do not take the program's b ehaviour e.g. its usage of

caches into account. Ascheduling p olicy that adapts to demands

of the running programs can bring great p erformance improvements

[Ste95, page 14].

 In most systems that use kernel threads as virtual pro cessors for the

user threads, concurrency is reduced whenever a thread makes a blo ck-

ing system call. This mayeven lead to deadlo cks if the numb er of run-

ning kernel threads reaches zero. There are ways to keep the number

of virtual pro cessors and thus concurrency constant [Kop95].

 Synchronisation mechanisms may b e made more ecientbyintro duc-

ing a strategy of spinning for some time and then blo cking [Ste95, page

46].

Tuning need not b e an action that is p erformed only once. Systems have

b een built that monitor a systems op eration and try to optimise it without

intervention by the user [Yan88].

4.3.2 Bottlenecks

Performance is limited by the weakest link in the chain: a b ottleneck. Buy-

ing faster CPUs will not bring a p erformance gain if the limiting factor is

memory bandwidth. Bottlenecks can b e found in b oth hard- and software.

Here are some examples:

Software

 Most libraries have b een made thread-safe by simply adding one syn-

chronisation ob ject to each subsystem e.g. the mallo c calls. An

application using a lot of dynamic data structures lo oses parallelism

to this simple scheme. 24

 Centralised kernel data structures may b ecome a ma jor b ottleneckin

multipro cessor systems. This is esp ecially bad as it a ects all pro-

cesses.

 If all request to a client-server are handled by a single pro cess

then this will likely b e a b ottleneck.

Hardware

 The p erformance of the memory subsystem and of the communication

paths b etween pro cessors is to day's ma jor b ottleneck.

 Mass storage may b e a b ottleneck in applications that require large

amounts of data but do not p erform complex op erations on them.

 A MMU may b e a b ottleneck if it has to access main memory to o often

to provide the pro cessor with data fast enough.

How can the b ottleneck problem b e addressed? There are two basic ways:

Bottlenecks may b e eliminated e.g. by replicating the critical resource or

they may b e hidden e.g. byintro ducing caches.

4.3.3 Improving Software

Only seldom is it p ossible to makechanges b eneath the level of the ap-

plication software. Better hardware may b e unavailable or to o exp ensive.

Changing the op erating system maybeeven more exp ensive. Even if there

was a b etter system, switching to it would mean that applications would

have to b e rewritten, p eople b e retrained, and so on. A solution that is

feasible is p erformance tuning of the application software.

Optimising Software for Cache Eciency

Performance of a computer system can b e improved if software is optimised

to make go o d use of the cache subsystem. The primary goal is to maximise

the cache hit-ratio. Causes for cache misses fall into three categories called

the \three C's" [HP96]:

Compulsory: Whenever a data item is accessed for the rst time, a cache

miss o ccurs. There is no way to prevent these so-called cold start

misses.

Capacity: If a program references more data than ts into the cache, misses

o ccur b ecause cache-lines have to b e evicted to make ro om for others.

Away to alleviate this problem is to increase the temp oral lo calityof

the data e.g. by reordering lo ops. 25

Con ict: Data items mapping to the same cache lines may exp el each other

from the cache. The higher the asso ciativity of the cache is, the less

severe is this problem. Programmers can avoid collision misses by

allo cating data structures that are used together in one blo ck.

Compilers can p erform co de optimisations that improve a programs util-

isation of hardware resources. Higher level reordering, however, still has to

b e done by the programmer. Hennessy and Patterson [HP96] present four

basic optimisations that improve cache use:

Merging Arrays: Cache use can b e improved if data structures that are

accessed together are merged or at least allo cated together. The rst

optimisation increases spatial lo cality while the second simply reduces

the of collision misses.

Lo op Interchange: Programs accessing data in nested lo ops often trash

caches b ecause the data items are not addressed in the order in which

they are stored. Interchanging inner and outer lo ops can yield great

p erformance improvements.

Lo op Fusion: If a program accesses the same data in di erent lo ops in

order to p erform di erent computations, temp oral lo calitymaybe

improved by joining these lo ops.

Blo cking: The last optimisation technique tries to eliminate capacity misses

by dividing the problem that is to b e solved in multiple subproblems.

An go o d example for this is matrix multiplication which can b e done

blo ck-wise and results in a very unfortunate access pattern when im-

plemented naively.

Improving Multithreaded Co de

Re-blocking

Optimizing Synchronisation

Improving the Runtime System 26

Chapter 5

So crates

Developing multithreaded programs is a dicult and still ill understo o d

task. Only few to ols are available that aid the programmer in designing

correct and ecient co de. So crates is an environment that allows programs

to b e written in an architecture-indep endentway and their p erformance to

be evaluated. Figure 5.1 shows an overview of the system.

Interactive Editing Visualization Source Generation

DB

*.d *.h *.c *.p.c

Post-Processing

Compilation & Linking

Execution Events

Figure 5.1: Overview of the So crates system

Applications are written in ANSI C using a graphical environment. Sup-

p ort for multithreading and p erformance measurement is added in an or-

thogonal way. Instead of extending the language to supp ort parallelism, a

generator has b een built that inserts the necessary co de into the source.

Chapter 5.1 will show this in detail.

The allows the programmer to browse and edit 27

the source co de and sp ecify instrumentation parameters. The user must

sp ecify the set of events that are to b e generated during execution. This

step is describ ed in chapter 5.2.

The co de that is generated can b e compiled using an unmo di ed C com-

piler but mayhave to b e linked with mo di ed versions of the runtime li-

braries and b e run using an extended system kernel. During the execution

of the program, an event trace is recorded and saved in a le in raw form.

This pro cess is discussed in chapter 5.3.

The generator pro duces a p ost-pro cessing program which inserts the raw

data into an SQL database. This allows the programmer to p erform complex

queries using a standardised language. So crates do es not provide supp ort

for the analysis of the data as there exist a wide variety of programs for

this purp ose. Data may b e exp orted to spreadsheet applications, statistical

analysis systems, or plotting software. Chapter 5.4 discusses the b ene ts

and costs of this approach. re ect

this in An example program which illustrates the whole pro cess of writing mul-

the corre- tithreaded co de, instrumenting it, generating for a sp eci c platform, and

sp onding analysing the resulting data, is presented in chapter 5.5.

chapters

5.1 Programming with So crates

5.1.1 The User Interface

The user interface of So crates allows the programmer to browse the source

co de and edit it. The co de is split multiple parts: typ es, variables, functions,

and so on.

Figure 5.2: So crates User-Interface 28

5.1.2 The Generator

Once a program has b een written using So crates, C source les may b e gen-

erated that can b e compiled using an unmo di ed C compiler. Declarations

and de nitions are separated and written to the corresp onding les.

So crates contains a prepro cessor that is able to manipulate the co de that

is generated. It allows arbitrary Tcl co de to b e inserted into the C source

co de. Anything b etween two at-signs is passed to the Tcl , the

return value b eing inserted into the output of the generator. This scheme

makes So crates very easy to extend and adapt to new environments. It

also provides the programmer with more p owerful prepro cessing function-

ality than the standard C prepro cessor do es. In the currentversion, Tcl

commands cannot b e nested. This will change in the future, making the

prepro cessor even more p owerful.

5.1.3 Supp ort for Multithreading

Supp ort for writing architecture-indep endentmultithreaded programs is im-

plemented by the use of the prepro cessor and Tcl commands that generate

co de for the chosen platform. In the co de browser, threads exist as abstract

entities and are thus separated from normal functions. Like functions, they

have arguments and variables. Unlike functions, they have attributes that

sp ecify whether they are b ound to a pro cessor, detached from their patent

thread, and so on.

An example of a call to the thread library is @thr self@ which pro-

duces thr self when Solaris threads are used and pthread self with POSIX

threads. So crates provides the same functionality for all runtime libraries.

This do es not mean that only the common subset is supp orted as some

functionality that is missing in a library maybeemulated. Solaris threads,

for example, do not have a mechanism for barrier synchronization but co de

containing barriers may still b e generated for this platform. The prepro-

cessor is able to generate co de that implements barriers using mutexes and

condition variables.

App endix A contains a table of all thread op erations and synchronisation

mechanisms supp orted by So crates.

5.2 Instrumenting Programs

5.2.1 Overview

The measurement pro cess can pro duce to o much data. Some systems ad-

dress this issue by letting the user lter events after the measurement pro-

cess. As So crates uses the same hardware as the target program, a goal of

the instrumentation must b e to reduce the amount of data that is generated. 29

The user may de ne which classes of event are to b e generated. Only the

corresp onding co de is instrumented.

5.2.2 Function Calls

5.2.3 User Events

5.2.4 Thread Op erations

5.2.5 Synchronisation Ob jects

5.3 Execution

5.3.1 Collecting EventTraces

During execution of the program, an event trace is generated and stored in

a le. Here are a few things that the user has to keep in mind:

 In order to use So crates, it may b e necessary to b o ot sp ecial versions

of the op erating system kernel. Using Solaris, a version of the kernel

must b e used that allows the manipulation of the p erformance counter

registers and saves their contents during all context switches.

 The event trace le should b e stored on the fastest disk that is avail-

able. If there is enough memory available, it is b est to use a ram

disk. This can sp eed up the pro cess dramatically and help to reduce

interference with the application program.

 The system should b e unloaded during measurement. Other pro cesses

mayinterrupt the program under insp ection and thus falsify measure-

ment results e.g. by evicting data from the caches.

5.4 Post-Pro cessing and Visualization

The p ost-pro cessing program pp transfers the event data into an SQL database.

No ltering or other manipulation of the data is done. This is left to the

database system b ecause it provides high-level functionality for exactly this

typ e of op eration.

As so on as the data is loaded into the DBMS, the user may to p erform

complex queries that are not normally supp orted by p erformance evaluation

to ols. An example might b e a program that checks the events for the o ccur-

rence of deadlo cks or starvation situations. Freedom, of course, means work.

Interaction with the system is done using SQL whichmaybeemb edded into

other programming languages such as C or Tcl.

There are, of course, some programs that p erform standard tasks like de-

termining the numb er of threads in existence, the numb er of threads waiting 30

on a condition variable, and so on. They work for every application program

and can b e used as starting-p oints for more sp ecialised analysis to ols. 31

5.4.1 Example: Barriers

This example program written in Tcl/Tk counts the numb er of threads

waiting in a barrier.

Co de barriers.graph

!/bin/sh

 the next line is executed by sh, but not by tcl

 because for tcl, it is a continuation of this comment \

exec pgtclsh $0 ${1+"$@"}

The pgtclsh is a Tcl- linked with routines that allow access to a Postgres

server. A Shell is used to search is using the PATH environmentvariable.

set dbhandle [pg_connect "graph"]

set plotfd [open "graph.db" w]

connect com- A connection to the p ostgres server is established using the pg

mand. The resulting handle is stored in the variable dbhandle. Likewise, a

le is op ened for the results of this script.

set res [pg_exec $dbhandle

"select address from barrier_init where name = '${barrier}';"]

set address [pg_result $res -getTuple 0]

pg_result $res -clear

Barriers are identi ed by their memory address. During initialization, an

event is generated that contains an identi er for the barrier. This is now

used to map the name of the barrier to its address. In applications that

allo cate barriers dynamically, the mapping of addresses to identi ers may

not b e one to one.

set res [pg_exec $dbhandle

"select maxtime from barrier_waited where address = ${address};"]

set maxx [pg_result $res -getTuple 0]

pg_result $res -clear

set res [pg_exec $dbhandle

"select mintime from barrier_wait where address = ${address};"]

set minx [pg_result $res -getTuple 0]

pg_result $res -clear

In order to scale the time axis of the graph, the minimum and maximum

timestamps are determined.

set res [pg_exec $dbhandle

"create table barrier_events time float8, event char, eid int;"]

pg_result $res -clear

set res [pg_exec $dbhandle

"insert into barrier_events select time,'e',eid \

from barrier_wait where address = ${address};"] 32

pg_result $res -clear

set res [pg_exec $dbhandle

"insert into barrier_events select time,'l',eid \

from barrier_waited where address = ${address};"]

pg_result $res -clear

A table is created that holds all barrier events with their timestamp, event

id, and a tag that identi es whether the barrier was entered or left.

set res [pg_exec $dbhandle

"select time - cast '${minx}.0' as float8 as time,event,eid\

from barrier_events order by time,eid;"]

set ntups [pg_result $res -numTuples]

Now the table of barrier events is read. Time is normalized to b egin with

zero and the events are totally ordered by time and event id. The resulting

handle will b e used to transfer the data from the Postgres server.

set nthreads 0

for {set i 0} {$i<$ntups} {incr i} {

set e [pg_result $res -getTuple $i]

if {[lindex $e 1] == "l"} {

incr nthreads -1

} else {

incr nthreads

}

puts $plotfd "[lindex $e 0].0 $nthreads"

}

Iterating over all events, the numb er of threads in the barrier is determined

for every event. The result is written to the output le.

pg_result $res -clear

pg_disconnect $dbhandle

close $plotfd

The last result handle is cleared, the database connection ended, and the

output le closed. 33

The results of such a program may b e further pro cessed with programs

for statistical analysis like SPSS or Origin. They can b e visualised using nd GNU

a data plotting program likegnuplot, xmgr, or plotmtv. Figure 5.3 shows soft

a plot that was generated from the output le of the ab ove example using xmgr.

150.0

100.0

50.0 Number of Threads in Barrier k

0.0 0.0e+00 5.0e+08 1.0e+09 1.5e+09 2.0e+09

Process Virtual Time / ns

Figure 5.3: Barrier Synchronizations.

5.5 Example: All Pairs Shortest Paths

This sample program implements a simple parallelised version of Rob ert D.

Floyd's algorithm [Flo62, Jun94] for nding the shortest path b etween all

pairs of no des in a directed weighted graph.

5.5.1 The Algorithm

Data Typ es: Both the graph and the shortest paths found so far are stored

in n  n matrices where n is the number of vertices in the graph. The

elements store information ab out the existence and length of a path

i; j between no des i and j . A barrier b is provided for synchronisation

between the worker threads.

Initialisation: For each pair of no des i and j , the result matrix is initialised

to d if there exists a vertex i; j  with length d and to 1 if no vertex 34

i; j  exists. The barrier b is initialised to n.

Worker i: Eachworker is assigned a row of the result matrix that it works

on. It then rep eats the following co de k =1:::n times:

 For each j see if there exists a path from i to j via no de k and,

if it is shorter than the shortest path found so far, store it in the

result matrix.

 Synchronise with the other threads at barrier b.

5.5.2 The Implementation

The following shows an implementation of the algorithm given ab ove.

Thread Worker {

Arguments {

int i,

APSP* apsp

}

Variables {

int j,k;

int r;

}

Definition{

fork=0;kn;k++ /* make k runs through the algorithm */

{

forj=0;jn;j++ /* for each i,j with i fixed per thread */

{

ifi!=k /* nothing to be won in these cases */

&&j!=k

&&i!=j

&&apsp->d+apsp->n*i+k->exist /* if ex. i,k and k,j */

&&apsp->d+apsp->n*k+j->exist

{

float d;

ifi>k

{

@mutex_lock apsp->d+apsp->n*i+k->m r@;

@mutex_lock apsp->d+apsp->n*k+j->m r@;

}

else

{

@mutex_lock apsp->d+apsp->n*k+j->m r@;

@mutex_lock apsp->d+apsp->n*i+k->m r@;

}

d = apsp->d+apsp->n*i+k->dist /* d = |i,k|+|k,j| */

+ apsp->d+apsp->n*k+j->dist;

@mutex_unlock apsp->d+apsp->n*k+j->m r@;

@mutex_unlock apsp->d+apsp->n*i+k->m r@; 35

ifapsp->d+apsp->n*i+j->exist /* was there a path before? */

{

ifd<apsp->d+apsp->n*i+j->dist /* see if d is shorter */

{

apsp->d+apsp->n*i+j->dist = d; /* new shortest path found */

}

}

else /* new path was found */

{

@mutex_lock apsp->d+apsp->n*i+j->m r@;

apsp->d+apsp->n*i+j->exist = 1; /* i,j now exists */

apsp->d+apsp->n*i+j->dist = d; /* and has length d */

@mutex_unlock apsp->d+apsp->n*i+j->m r@;

}

}

}

@barrier_wait apsp->k@; /* synchronize all workers */

@barrier_wait apsp->k2@ /* before starting a new run */

}

@barrier_wait apsp->finish@; /* synchronize all threads on exit */

@thr_exit 0@; /* exit */

}

5.5.3 Measurement

Mutex Op erations

Every access to the result matrix r is protected byamutex. Figure 5.4

shows the numb er of threads waiting for or holding a mutex vs. time. As

can b e seen from the graph, contention for mutexes is high. A solution

to this p erformance problem would b e to replace mutexes by reader-writer

lo cks, thus allowing for more concurrency.

Scheduling

Scheduling has a great in uence on program p erformance. Figure 5.5 shows

when a userlevel thread changes its LWP and its CPU. As the Solaris thread

library do es not use memory conscious scheduling, there is only little anity

to the caches.

Cache Utilisation

Mo dern micropro cessors provide hardware supp ort for p erformance mea-

surement in the form of event counters that are able to record such informa-

tion as the numb er of cache references and hits. Figure 5.6 shows the cache

usage of a single userlevel thread. The cache hit rate is ab out 93 p ercent. 36 30.0

20.0

10.0 Number of Threads Waiting

0.0 0.0e+00 5.0e+08 1.0e+09 1.5e+09 2.0e+09

Process Virtual Time / ns

Figure 5.4: Numb er of threads waiting for mutexes.

CPU #3

CPU #2

CPU #1

LWP #11

LWP #6

LWP #1

0.0e+00 5.0e+09 1.0e+10

Process Virtual Time / ns

Figure 5.5: Scheduling of a userlevel thread. 37 100% 300000.0

75%

200000.0

50%

100000.0 Cache References 25% Cache Hits Cache Hit Ratio

0.0 0.0e+00 5.0e+09 1.0e+10 1.5e+10

Process virtual time / ns

Figure 5.6: Cache usage of a userlevel thread.

5.6 Implementation

5.6.1 Implementation

Events that are generated by the instrumented program are collected in

bu ers. Whenever a bu er is full, it is inserted into a list. A writer-thread

which is b ound to its own LWP scans the lists and writes their contents to

the event trace le. Generation of events do es not intro duce muchoverhead

into the application's co de as almost all blo cking op erations are done by the

writer-thread. Figure 5.7 shows this pro cess. 38 Free Memory Event File

Event Group

 Thread  Thread Writer    1  1

more

Threads Event Lists        

  



Figure 5.7: Generating and writing event traces. 39

Chapter 6

Conclusion

What has b een achieved?

Not yet bug-free User interface lacks integration

Future Work

As it is, So crates supp orts only one platform. Ports are planned to the

MThreads library [Ste95] b oth on UltraSparc machines and the Convex

SPP1600. This will allow a comparison b etween di erent thread libraries.

It is hop ed that the b ene ts of memory conscious scheduling can b e shown

and di erent strategies b e evaluated.

The user interface lacks integration. The goal is to combine all parts of

the system into one application. Ultimately, all steps of program develop-

ment shall b e integrated into a single environment. 40

App endix A

So crates Supp ort for

Multithreading

A.1 Thread Management

Data Typ e Description

thrid The typ e of an identi er for a thread

Parameters Description Function

name name of the thread to start

arg name args* p ointer to the argument structure

thr start options list of options

thrid thread identi er

result result

thr exit

thr join

thr self

thr susp end

thr continue

A.2 Thread-Lo cal Variables

Function Parameters Description

tgv create

tgv delete

tgv get

tgv set

A.3 Mutexes

Data Typ e Description

mutex 41

Function Parameters Description

mutex init

mutex destroy

mutex lo ck

mutex unlo ck

A.4 Condition Variables

Description Data Typ e

condv

Function Parameters Description

condv init

condv destroy

condv wait

condv signal

condv broadcast

A.5 Semaphores

Data Typ e Description

sema

Function Parameters Description

sema init

sema destroy

sema wait

sema trywait

sema p ost

A.6 Barriers

Description Data Typ e

barrier

Function Parameters Description

barrier init

barrier destroy

barrier setnothr

barrier wait 42

App endix B

Existing Performance

Measurement To ols

B.1 CXtrace

B.2 tha/tnf

B.3 Paradyn

B.4 Pablo

B.5 JED 43

App endix C

Sun Enterprise X3000

The machine used for development and testing of So crates was a Sun Ultra

Enterprise X3000. The following sections provide a broad overview of the

systems architecture and the particular con guration used; more informa-

tion can b e found in [Sun96].

Memory Subsystem

On-Chip Instruction Cache: 16KB, 2-way asso ciative, line-size of 32bytes

On-Chip Data Cache: 16KB, direct-mapp ed, line-size of 32 bytes with

16 byte subblo cks, write-through with no allo cate on write miss, non-

blo cking with up to nine outstanding load op erations

External Uni ed Cache: 512KB, direct-mapp ed, line-size of 64 bytes with

16 byte subblo cks, write-back, cache controller integrated in the cpu,

access by the cpu is pip elined

System Bus Architecture

Con guration

 3 UltraSparc I mo dules at 167MHz 44

Bibliography

[Bac86] Maurice J. Bach. The Design of the .

Prentice-Hall, Inc., 1986.

[BS96] Frank Bellosa and Martin Steckermeier. The p erformance implica-

tions of lo cality information usage in shared-memory multipro ces-

sors. Journal of Paral lel and , 371:113{

121, 1996.

[Con94] Convex Press. Convex Exemplar Architecture, 2nd edition, 1994.

[CT96] Alan Chalmers and Jonathan Tidmus. Practical Paral lel Process-

ing.International Thomson Computer Press, 1996.

[Dij68] E. W. Dijkstra. Co-op erating sequential pro cesses. Programming

Languages, pages 43{112, 1968. Citation taken from [CT96].

[ea95] Rainer Klar et al. Messung und Model lierung paral leler und

verteilter Rechensysteme. B.G. Teubner, Stuttgart, 1995.

[Flo62] Rob ert W. Floyd. Shortest path. Communications of the ACM,

56:345, 1962.

[FSZ83] Domenico Ferrari, Giusepp e Serazzi, and Alessandro Zeigner.

Measurement and Tuning of Computer Systems. Prentice-Hall,

Inc., 1983.

[GTU91] Annop Gupta, Andrew Tucker, and Shigeru Urushibara. The im-

pact of op erating system scheduling p olicies and synchronization

metho ds of the p erformance of parallel applications. ACM SIG-

METRICS International ConferenceonMeasurement and Model-

ing of Computer Systems, 191, 1991.

[Han95] Olav Hansen. Leistungsanalyse paral leler Programme. Sp ektrum

Akademischer Verlag GmbH, 1995.

[HB84] Kai Hwang and Fay e A. Briggs. Computer Architecture and Par-

al lel Processing. McGraw Hill, Inc., 1984. 45

[Hof91] Fridolin Hofmann. Betriebssysteme: Grundkonzepte und -

el lvorstel lungen. Leitf"aden der Angewandten Informatik. B.G.

Teubner, Stuttgart, 1991.

[HP96] John L. Hennessy and David A. Patterson. Computer Architec-

ture: A Quantitative Approach. Morgan Kaufmann Publishers,

2nd edition, 1996.

[Jai91] Ra j Jain. The Art of Computer Systems PerformanceAnalysis.

John Wiley & Sons, 1991.

[Jun94] Dieter Jungnickel. Graphen, Netzwerke und Algorithmen. BI-

Wissenschaftsverlag, 3rd edition, 1994.

[Kai96] Richard Y. Kain. Advancedcomputer architecture: a systems de-

sign approach. Prentice-Hall, Inc., 1996.

[Kop95] Christoph Kopp e. Sleeping threads: Akernel mechanism for

supp ort of ecient user level threads. In M. H. Hamza, editor,

Proceedings of the 7th IASTED/ISMM International Conference

Paral lel and Distributed Computing and Systems. IASTED-ACTA

PRESS, 1995.

[Lan92] Horst Langendorfer. Leistungsanalyse von Rechensystemen.

Hanser Studienbucher der Informatik. Carl Hanser Verlag, 1992.

[McK88] Philip McKerrow. PerformanceMeasurement of Computer Sys-

tems. International Series. Addison-Wesley,

1988.

[Qui94] Michael J. Quinn. Paral lel computing: theory and practice.

McGraw-Hill, 2nd edition, 1994.

[Sch94] Curt Schimmel. Unix Systems for Modern Architectures. Addison-

Wesley Professional Computing Series. Addison-Wesley Publish-

ing Company, 1994.

[SK90] Margaret Simmons and Reb ecca Koskela, editors. Performance

Instrumentation and Visualization.ACM Press, 1990.

[SS94] Mukesh Singhal and Niranjan G. Shivaratri. Advanced Concepts

in Operating Systems. McGraw Hill, Inc., 1994.

[Ste95] Martin Steckermeyer. Einbeziehung von Lokalitatsinformation in

Mechanismen zur Prozeverwaltung auf Benutzerebene. Diploma

thesis, Friedrich-Alexander-Universitat Erlangen-Nurn  b erg, 1995.

DA-I4-95-25.

[Sun94] Sun Microsystems, Inc. MultithreadedProgramming Guide, 1994. 46

[Sun96] Sun Microsystems, Inc. Ultra Enterprise X000 Server Family:

Architecture and Implementation, 1996.

[Tan92] Andrew S. Tanenbaum. Modern Operating Systems. Prentice-Hall,

Inc., 1992.

[TTG95] Josep Torrellas, Andrew Tucker, and Ano op Gupta. Evaluat-

ing the p erformance of cache-anityscheduling in shared-memory

multipro cessors. Journal of Paral lel and Distributed Computing,

24:139{151, 1995.

[Wal95] Klaus Waldschmidt, editor. Paral lelrechner, Architekturen - Sys-

teme - Werkzeuge. B.G. Teubner, 1995.

[WG94] David L. Weaver and Tom Germond, editors. The SPARCArchi-

tecture Manual Version 9. Prentice-Hall, Inc., 1994.

[Yan88] J. C. Yan. Post-game Analysis { A Heuristic Resource Manage-

ment Framework for Concurrent Systems. PhD thesis, Depart-

ment of Electrical Engineering, Stanford University, 1988. 47