Instrumentation and Measurement of
Multithreaded Applications
Alexander Vo
Januar 1997 DA-25-95
Instrumentation and Measurementof
Multithreaded Applications
Studienarb eit im Fach Informatik
vorgelegt von
Alexander Vo
geb. am 13. August 1972 in Bergisch Gladbach
Angefertigt am
Institut fur Mathematische Maschinen und Datenverarb eitung IV
Friedrich-Alexander-Universitat Erlangen-Nurnb erg
Betreuer: Prof. Dr. Fridolin Hofmann
Dipl. Inf. Frank Bel losa
Beginn der Arb eit: 3. April 1996
Abgab e der Arb eit: 03. Januar 1997
Ichversichere, da ich die Arb eit ohne fremde Hilfe und ohne Benutzung
anderer als der angegeb enen Quellen angefertigt hab e und da die Arb eit in
gleicher o der ahnlicher Form no chkeiner anderen Prufungsb ehorde vorgele-
gen hat und von dieser als Teil einer Prufungsleistung angenommen wurde.
Alle Ausfuhrungen, die wortlich o der sinngema ub ernommen wurden, sind
als solche gekennzeichnet.
Erlangen, 28. Dezemb er 1996
Zusammenfassung
Ziel dieser Studienarb eit ist es, eine Programmierumgebung zu schaf-
fen, in der Programme mit mehreren Aktivitatstragern entwickelt und ver-
messen werden konnen. Die Prozesse der Programmentwicklung und In-
strumentierung sollen dab ei moglichst unabhangig von Zielsystem ablaufen,
dessen sp ezi sche Eigenschaften trotzdem genutzt werden konnen. Wichtige
Fragestellungen, die es im Rahmen dieser Arb eit zu b eantworten gilt, sind:
Wie konnen die gewunschten Analysedaten ohne wesentliche Beein-
ussung des Programmablaufs gewonnen und gesp eichert werden?
Welche Mo di kationen mussen evtl. am Laufzeitsystem vorgenommen
werden, um Ereigniszahler b ei der Umschaltung zwischen verschiede-
nen Aktivitatstragern zu sichern?
Wie konnen diese Daten graphisch sinnvoll aufb ereitet werden, um
dem Programmierer Auskunft ub er das Laufzeitverhalten seines Pro-
grammes zu geb en?
Wie kann die Instrumentierung abstrahiert werden, damit eine ein-
fache Portierbarkeit des Systems gegeb en ist?
Es ist geplant, die Leistungsfahigkeit des entwickelten Systems an einem
Beispiel aufzuzeigen.
Contents
1 Intro duction 2
2 Multipro cessor Architecture 4
2.1 Tightly vs. Lo osely Coupled Systems ...... 4
2.1.1 Uniform Memory Access UMA ...... 4
2.1.2 Cache Only Memory Access COMA ...... 4
2.1.3 Nonuniform Memory Access NUMA ...... 5
2.2 Interconnection Networks ...... 5
2.3 Caching ...... 6
2.3.1 Direct Mapp ed vs. Asso ciative Caches...... 6
2.3.2 Virtual vs. Physical Caches...... 7
2.3.3 ReplacementPolicies ...... 7
2.3.4 Write Policies ...... 8
2.3.5 Line Size ...... 8
2.3.6 Cache Consistency ...... 8
2.3.7 Prefetching ...... 9
2.3.8 Non-blo cking op eration ...... 9
2.4 Memory Mo dels ...... 9
3 Multithreading 10
3.1 Motivation ...... 10
3.2 Drawbacks...... 11
3.3 Kernel- vs. Userlevel Threads ...... 11
3.4 Scheduling ...... 12
3.4.1 Preemptive vs. Nonpreemptive ...... 12
3.4.2 Policy vs. Mechanism ...... 13
3.4.3 Scheduling Policies ...... 13
3.4.4 Scheduling in Multipro cessor Systems ...... 14
3.5 Synchronization ...... 14
3.5.1 Deadlo cks and Starvation ...... 14
3.5.2 Synchronization Problems...... 15
3.5.3 Synchronization Mechanisms ...... 16
3.5.4 Synchronisation in Multipro cessor Systems ...... 17 ix
4 Performance Measurement 19
4.1 Intro duction to p erformance evaluation ...... 19
4.2 Measurement Metho dology ...... 21
4.2.1 Hardware monitors ...... 22
4.2.2 Software monitors ...... 22
4.2.3 Hybrid monitors ...... 23
4.3 Tuning ...... 23
4.3.1 Program Behaviour ...... 23
4.3.2 Bottlenecks...... 24
4.3.3 Improving Software ...... 25
5 So crates 27
5.1 Programming with So crates ...... 28
5.1.1 The User Interface ...... 28
5.1.2 The Generator ...... 29
5.1.3 Supp ort for Multithreading ...... 29
5.2 Instrumenting Programs ...... 29
5.2.1 Overview ...... 29
5.2.2 Function Calls ...... 30
5.2.3 User Events...... 30
5.2.4 Thread Op erations ...... 30
5.2.5 Synchronisation Ob jects ...... 30
5.3 Execution ...... 30
5.3.1 Collecting EventTraces ...... 30
5.4 Post-Pro cessing and Visualization ...... 30
5.4.1 Example: Barriers ...... 31
5.5 Example: All Pairs Shortest Paths ...... 34
5.5.1 The Algorithm ...... 34
5.5.2 The Implementation ...... 35
5.5.3 Measurement ...... 36
5.6 Implementation ...... 38
5.6.1 Implementation ...... 38
6 Conclusion 40
A So crates Supp ort for Multithreading 41
A.1 Thread Management ...... 41
A.2 Thread-Lo cal Variables ...... 41
A.3 Mutexes ...... 41
A.4 Condition Variables ...... 42
A.5 Semaphores ...... 42
A.6 Barriers ...... 42
B Existing Performance MeasurementTo ols 43
B.1 CXtrace ...... 43
B.2 tha/tnf ...... 43
B.3 Paradyn ...... 43
B.4 Pablo ...... 43
B.5 JED...... 43
C Sun Enterprise X3000 44
Friedrich-Alexander-Universitat Erlangen-Nurn b erg
Abstract
Instrumentation and Measurement of Multithreaded
Applications
by Alexander Vo
How to write an abstract
readers of an abstract want to decide weather to read the full thesis
present the outcomes of your work
an abstract is a small article in itself and thus stands separate from
the thesis
keep it simple
do not make claim that are not justi ed in the pap er
claim some new result unless you write a survey
avoid "`In this pap er"' or similar intro ductions
Keywords Multithreading, Measurement, Instrumentation 1
Chapter 1
Intro duction
Problem Statement
Computer systems continue to b ecome more p owerful, but as they do so the
demands always growbeyond the reachofany existing system. The p er-
formance increase that can b e achieved by reducing cycle times is limited
b ecause of the nite sp eed at which information can propagate and b ecause
of energy dissipation problems. The only other way to increase the p erfor-
mance of a system is to exploit the parallelism inherent in the application.
Parallelism on the level of micro-instructions has b een used for a long
time { most of the technology is in use since the advent of sup ercomput-
ing. Today,even the cheap est micropro cessors havemultiple indep endent
execution units and implement pip elining. Optimising compilers that help
to exploit this ne-grained parallelism are widely available. As technology
matures, its limits b ecome visible. The parallelism that can b e achieved at
this level is very limited. Scalar arithmetic instructions for example cannot
make ecient use of pip elines deep er than fourteen stages [HP96, page 210].
The adventofcheap single-chip micropro cessors has made it feasible to
build multicomputer and multipro cessor systems that exploit parallelism at
amuch higher level. Applications are split into multiple threads of control
that co op erate in order to achieve a common goal.
Development of applications that contain multiple threads of control is
still p o orly understo o d and resembles an art rather than an engineering
discipline. Only few to ols are available that aid in managing the complexity
of concurrent computation. Most of these only supp ort one step in the
development cycle, e.g. debugging.
Another problem is that computer systems b ecome more complex as the
numb er of pro cessors and levels in the memory-hierarchy increase. Some of
this complexity leaks through even to the userlevel. Todays machines no
longer provide the programmer with a consistent single system image. 2
Goals of this Thesis
The primary goal of this pro ject is to provide to ols that allow the program-
mer to write multithreaded applications in an architecture-indep endentway
and evaluate their p erformance in di erentenvironments, i.e. di erent hard-
ware, op erating systems, and runtime systems.
Performance measurement shall provide an insightinto the dynamics
of the program under developmentaswell as the underlying system. As
p ortability is a primary goal, no additional hardware may b e used to record
the data. It follows that a software solution has to b e found that meets the
following criteria:
In uence of the measurement part of the system on the execution of
the application program must b e minimised. All blo cking op erations
must b e done by separate kernel-threads.
The system must allow the user to sp ecify the set of events that are
to b e recorded. Pro ducing to o much data slows down the pro cess and
hides the imp ortant facts.
Instrumentation and Visualisation should use the identi ers given in
the application's source co de. The programmer must b e able to see
the relation b etween co de and program dynamics.
more
Outline of the Thesis
Performance of multithreaded applications heavily dep ends on the prop erties
of the underlying system. Hardware architecture, op erating system design,
and design of the runtime system interact with each other in intricate ways.
To understand the reasons for an application's b ehaviour, one has to lo ok
at all levels of the system.
Thus the rst chapters discuss some selected topics from the elds of
computer architecture, op erating system design, runtime supp ort, as well
as p erformance evaluation that constitute the research domain for this the-
sis. This material is meanttoshow p ossible p erformance problems and
programming errors as well as ways to improve the software.
Beginning with chapter 5, the programming environment So crates is
presented. Its functionality is describ ed and some insightisgiven into its
implementation. A simple example illustrates the pro cess of developing
and evaluating multithreaded co de. Conclusions are drawn in chapter 6.
App endices inform ab out topics such as other measurement to ols or the
software and hardware that has b een used. 3
Chapter 2
Multipro cessor Architecture
This chapter presents some asp ects of computer architecture that are closely
related with multipro cessor p erformance. The goal is to show p ossible p er-
formance problems and programming errors. For further information on the
architecture of multiprocessor systems see [HB84, HP96, Kai96 ].
2.1 Tightly vs. Lo osely Coupled Systems
Multipro cessors are classi ed as b eing tightly vs. lo osely coupled, dep ending
on whether they share a global physical memory system or not. In tightly
coupled systems, pro cessors have direct access to all memory lo cations and
thus memory can b e used for communication and synchronisation whereas
in lo osely coupled systems message passing has to b e used. Tightly coupled
architectures can b e further sub divided as follows:
2.1.1 Uniform Memory Access UMA
In systems with uniform memory access, each pro cessor has direct access to
a single shared memory. All memory lo cations are equidistant in terms of
access times to each pro cessor. Most UMA-systems incorp orate caching to
eliminate memory contention but this mechanism is hidden from applica-
tions.
2.1.2 Cache Only Memory Access COMA
Systems with cache only memory access do not haveaphysically shared
memory but caches only. These caches constitute the machines memory
and, together, form a single address space. Access times vary dep ending on
whether the requested memory lo cation is in the lo cal cache or in a remote
one. Application software may b e ignorant of the systems architecture as
the machine b ehaves very much like a UMA-machine with caching. 4
2.1.3 Nonuniform Memory Access NUMA
Systems with nonuniform memory access have a distributed shared physical
memory. Each partition of it is directly attached to a no de but can b e ac-
cessed by pro cessors in other no des via the interconnection network. Thus
memory access times di er dep ending on whether the requested lo cation is
lo cal to the no de or remote. This added level of complexitymay b e hidden
from application software but doing so leads to sub optimal p erformance.
To make b est use of the hardware, the programmer must take its architec-
ture into consideration. Caching may b e used b etween pro cessors and lo cal
memory as well as b etween no des. Machines with coherent caching at the
hardware level are called cc-NUMA.
2.2 Interconnection Networks
Interconnection networks can b e built using either packet- or circuit-switching
technology and can use various top ologies. The more parallelism and redun-
dancy is incorp orated, the more complex and exp ensive the interconnection
network gets. Most solutions in use to day represent a reasonable compro-
mise b etween p erformance and cost:
Buses are the least exp ensiveinterconnection networks. They are also very
easy to manage but do not provide multiple connections at the same time
or failure tolerance. Buses do not scale well { the only waytointro duce
parallelism is to replicate them. Systems with a very limited number of
pro cessors use a bus or a small set of buses. Many larger systems, however,
use buses at no de-level and another top ology at the global level.
Rings allow more simultaneous data transfers than busses b ecause all seg-
ments the paths b etween two no des of the ring may b e used for communi-
cation at the same time. They can also provide fault tolerance if data can b e
sent in b oth directions. On the downside, data that is sentbetween no des
that are not adjacent has to travel through switches of other no des b efore
reaching the destination, thereby b eing delayed. This p erformance problem
grows with the numb er of no des connected to the ring.
Cross-bar Switches provide multiple exclusive communication lines b etween
1
pro cessors and memory . If memory is partitioned into n blo cks, n concur-
rent memory accesses can b e accommo dated. Contention can only o ccur if
two pro cessors try to access the same memory mo dule. Cross-bar switches
scale well but the costs grow quadratically which makes them prohibitively
exp ensive with larger numb ers of pro cessors.
1
They may also b e used to connect multiple pro cessors. 5
Multistage switches provide a means of interconnection that scales reason-
ably well in b oth p erformance and cost. The complexity of the network
grows logarithmically with increasing parallelism. As opp osed to cross-bar
switches, contention can o ccur at the network-level. Bu ers can b e intro-
duced to address this problem.
Hypercube networks are often used in lo osely coupled systems with store-
and-forward communication. Likemultistage switches, their complexity in-
creases only logarithmically. They scale well but the numb er of pro cessors
always has to b e a p ower of two. Contention at network-level can o ccur but
is addressed by using store-and-forward communication.
2.3 Caching
A cache is a bu er used to hide di erences in op erating sp eed. The gen-
eral scheme can b e applied in a lot of situations. Intro ducing caches into
the memory system has three advantages: the latency of accesses to main
memory is hidden, bandwidth increased, and contention of interconnection
structures reduced.
Pro cessor sp eed continues to grow faster than main memory sp eed and
novel multipro cessor architectures have deep memory hierarchies with laten-
cies that can reach a couple of hundred pro cessor cycles. These developments
make caches the central factor in system p erformance. This section describ es
the key attributes of cache subsystems and their impact on p erformance.
For further information about caching see [HP96, Kai96, Sch94].
2.3.1 Direct Mapp ed vs. Asso ciative Caches
In a cache subsystem that uses direct mapping, a data item can b e stored
in only one cache line. Its address is split into three parts: the rst is used
to address the line that the data can b e found in, the second is stored in
the tag ram to uniquely identify the data in the cache, and the third is the
o set of the data in the line.
Asso ciative caches use fewer bits to address the cache line than would
b e necessary to uniquely identify it. Thus data items can map to multiple
lines. A cache that can map data to n distinct lo cations is called n-way
asso ciative. In order to identify which data is in the cache, more bits have
to b e used in the tag ram. Accessing the cache now requires a search pro cess
as there are multiple lines in which a data item may b e stored. In order to
keep the sp eed of the lo okup op eration at the level of a direct mapp ed cache,
the searchmust b e done in parallel, requiring n comparators for an n-way
asso ciative cache.
There are three factors that make asso ciative caches more exp ensive than
direct mapp ed ones: the comparators needed, the increased size of the tag 6
ram, and the need for a more complex replacement p olicy see b elow. As
these costs increase with the size of the cache, small on-chip caches usually
employ a higher level of asso ciativity than the larger external caches. Making
a cache asso ciative has its advantages: data items that would map to the
same line in a direct mapp ed cache can now b e stored in the cache at the
same time. Trashing is reduced and utilisation improved.
2.3.2 Virtual vs. Physical Caches
Most systems to dayhave a memory management unit MMU providing
virtual address spaces. The cache can b e placed b etween CPU and MMU,
or b etween MMU and main memory. The former is called a virtual cache,
as it only sees virtual addresses, while the latter is referred to as a physical
cache. Both solutions have their sp eci c dis-advantages:
Access to virtual caches need not go through the MMU and is thus
faster when a hit o ccurs.
Virtual addresses might b e larger than physical ones, making virtual
caches more exp ensive.
Virtual caches require more control logic as they have to deal with
multiple address spaces.
Physical caches can to bus-sno oping in order to provide cache consis-
tency.
Most CPUs to dayhave the memory management unit integrated so that
external caches are always physical. It is p ossible to improve the p erfor-
mance of a physical cache by using virtual addresses for indexing and phys-
ical addresses for tagging. This way, address translation in the MMU may
be overlapp ed with cache op eration. The resulting architecture is called a
virtual cache with physical tags see [Sch94].
2.3.3 ReplacementPolicies
When new data is written to the cache, old entries mayhave to b e evicted.
Cho osing a victim in a direct mapp ed cache is trivial as there is only one
candidate. In asso ciative caches, multiple solutions exist. The algorithm
that cho oses an item to remove is called the replacement p olicy. Replace-
mentincache systems is very much like page replacement in virtual memory
systems. The p olicies used, however, are much simpler b ecause they have
to b e implemented in exp ensive hardware and have to b e fast.
LRU works by adding state information to the tag ram which is up dated
with every cache reference on the corresp onding line. The line that 7
was least recently used is selected for eviction. Strict LRU is usually
used for two-way asso ciative caches only as it would require to o much
state information for caches with higher asso ciativity.
Pseudo-LRU is used in cache systems with asso ciativity greater than two.
The state information recorded is limited but still enough so that lines
that are used often do not get evicted.
Random replacement requires no state information at all and still yields
go o d results. It is b est suited for large highly asso ciative caches b e-
cause it saves a lot of hardware while still p erforming well.
2.3.4 Write Policies
Write op erations by the CPU may up date b oth memory and cache, or the
cache only. The rst solution is called write-through and has the advantage
that memory always contains up-to-date information. The second p olicy is
termed write-back. It has the advantage of b eing faster, as the data need
not b e written to the slow main memory. Its disadvantage is that it leaves
the memory in an inconsistent state, requiring additional logic in the case
of parallel op erations e.g. DMA. missing:
write-
allo cate
2.3.5 Line Size
Another factor that determines the p erformance of a cache is its line-size.
It determines the amount of data that is moved b etween main memory
and the cache. Large cache-lines mean that data is prefetched, i.e. loaded
into the cache b efore the CPU requests it. Assuming spatial lo cality of the
memory references, this can improve p erformance and reduce contention
of interconnection networks. Smaller lines, however, can b e written backto
memory faster. Cho osing the right line-size means nding a tradeo b etween
p ossible p erformance gains and losses. Another factor are costs as smaller
lines mean larger tags and thus require more hardware.
2.3.6 Cache Consistency
Caches intro duce replication into the memory system. Whenever memory is
accessed bymultiple units, care must b e taken that the information in the
memory system stays consistent. Cache consistency in unipro cessor systems
can b e controlled by soft- or hardware. The only critical op erations are
interactions with p eripheral devices. A simple solution is to mark all I/O-
bu ers as non-cacheable. The presence of multiple pro cessors intro duces
new diculties. Although software solutions would b e p ossible see [Sch94],
it is necessary to implement consistency proto cols in hardware in order to
achieve optimal p erformance. 8
Consistency of caches is implemented by enforcing a set of rules that
determine when a data item maybecached. As long as the data is only
read, there is no problem at all. On a write op eration, however, measures
must b e taken to keep the view of the memory system consistent. If the
lo cation that is written to is cached, the cache entries have to b e either
up dated or marked as invalid. Thus cache consistency proto cols fall into
two categories: write-up date and write-invalidate.
Write-Up date Proto cols
missing
Write-Invalidate Proto cols
missing
2.3.7 Prefetching
Multipro cessors to day allow programs to prefetch data from main memory
into caches by issuing sp ecial prefetch instructions. Prefetching is done in
parallel to normal CPU op eration and can thus help to hide memory latency.
2.3.8 Non-blo cking op eration
missing
2.4 Memory Mo dels
missing
A= 0 B=0
. .
. .
. .
A= 1 B=1
ifB == 0 ifA == 0
printf"B"; printf"A"; 9
Chapter 3
Multithreading
This chapter deals with howmultithreaded programs can exploit the paral-
lelism provided bymultipro cessors and what in uence the op erating system
and the runtime system have on their p erformance.
3.1 Motivation
One goal of op erating system design has always b een to improve the utili-
sation of the underlying hardware. Thus batch-pro cessing was sup erceded
bymultitasking. Multithreading is a re nement of traditional multitasking
that separates the notions of a thread of control and an address space. The
following advantages justify the use of this concept:
Increased Throughput A single-threaded program has to use non-blo cking
primitives if it wants to overlap communication with its execution, which
leads to a dramatic increase in program complexity. Multithreading allows
overlapping of communication and execution while using higher-level i.e.
blo cking services.
Eciency and Ease of Communication Applications that are distributed
over multiple pro cesses have to use some kind of interpro cess communica-
tion. Multithreaded applications automatically use the most ecient IPC-
mechanism available { shared memory. In most applications, shared memory
is also most intuitive to program.
Program Structure Designing an application as a set of co op erating threads
helps to clarify program structure. Many problems are parallel by nature {
servers might start one thread p er request and numeric applications might
divide their data into subsets that are worked on by co op erating threads.
Multiple Processors Multithreaded applications make use of hardware par-
allelism on a ne level of granularity. A single application program maybe 10
executed in parallel using multiple pro cessors which reduces their execution
time. Compared to traditional multitasking, utilisation of multipro cessor
systems is improved.
Availability The interactivity of applications is increased bymultithread-
ing. A user need not wait for each of his commands to nish execution
but can issue new requests while the system works on completing those he
recently made. The need to show a \please wait"-message decreases.
System Resources Multithreaded applications make b etter use of system
resources than multiprogrammed applications do. Threads need less state
information than pro cesses and context switches are muchcheap er.
3.2 Drawbacks
Likeevery technique, multithreading has its limitations and drawbacks. The
following summarises some asp ects that havetobetaken into consideration
whenever a decision has to b e made whether to use multiple threads or not:
Some applications do not lend themselves very well to multithread-
ing. Synchronisation and communication eat up much of the sp eedup
achieved by parallelisation.
Parallelising an application is a tedious job that intro duces new p o-
tential programming errors. It is very dicult to test or pro of the
correctness of multithreaded co de.
Most op erating systems to day provide bu ering and prefetching for
I/O op erations. The use of separate threads for I/O do es not improve
p erformance very muchonsuch systems.
Interactivity of an application can b e achieved by using an event-driven
programming style rather than threads. Ousterho o d
3.3 Kernel- vs. Userlevel Threads
Threads can b e implemented either as an op erating system concept or as a
user library. Both p ossibilities have their dis-advantages:
Kernel threads are more exp ensive as they require administrative data
in kernel space and have to do their context switches via the kernel.
All functionality that any application might need has to b e included
in the kernel. 11
Userlevel threads are not known to the op erating system { kernel ser-
vices can only b e used by pro cesses or kernel threads. If a userlevel
thread calls a blo cking system-call, all other threads are blo cked with
it. Signals cannot b e sent to a single userlevel thread.
With userlevel threads, scheduling is indep endent of the op erating
system and can thus b e tuned to t the applications needs. Syn-
chronization b etween userlevel threads need not use exp ensive system
calls.
Toovercome these problems and combine the advantages of b oth con-
cepts, it is now common to use kernel threads as \virtual pro cessors" for
userlevel threads. This, however, intro duces a new level of complexity: How
is the mapping b etween userlevel threads, kernel threads, and physical pro-
cessors to b e done? How can the kernel and the userlevel library interact in
order to optimize system p erformance?
3.4 Scheduling
The scheduler is the entity in the op erating system or runtime library that
maps threads of execution onto pro cessing elements. Its impact on system
p erformance is signi cant. This section presents some asp ects of scheduling
systems and shows why traditional scheduling p olicies are unsuitable for
multipro cessor systems. Scheduling in uniprocessor systems is describedin
[Hof91, Tan92]. Issues that arise in multiprocessors and distributed systems
arecovered in [SS94].
3.4.1 Preemptive vs. Nonpreemptive
The scheduler maybeinvoked whenever a thread of execution enters the
kernel or the runtime library. If this is the only way to initiate scheduling, the
system is said to b e nonpreemptive. in a co op erative system, the scheduler
only runs when a pro cess explicitly calls it. Although this scheme minimizes
context switches, it do es not optimize overall system p erformance as it has
no way of assigning priorities e.g. to I/O-b ound tasks. Another problem
is that a thread of execution may monop olize the CPU and decrease the
system's interactiveness or even bring it to a standstill.
A solution to these problems is preemtiveness. Timer interrupts are used
to p erio dically force the CPU into kernel mo de, where the scheduler maybe
invoked. This increases exibility but also means that an application cannot
makeany assumptions ab out when scheduling will take place. Some race
conditions only o ccur in preemptive systems. Programmers havetobewell
aware of these p otential problems. 12
3.4.2 Policy vs. Mechanism
The decision which thread of execution is to b e scheduled next is indep endent
of the mechanism that actually invokes the thread. Only the latter part of
the scheduler has to b e part of the op erating system or runtime library.
Policy what is to b e done? may b e separated from mechanism howisit
done?. Performance can b e increased if scheduling decisions are made by
the application programs themselves.
A program that enters a state in which is do es a lot of computation might
decide not to preempt threads. Thus the tradeo b etween interactivity and
throughput can b e dynamically adjusted, reducing the numb er of context
switches.
3.4.3 Scheduling Policies
The following paragraphs intro duce some scheduling p olicies that are com-
mon in mo dern op erating systems and runtime libraries:
Round Robin Each thread is assigned a time quantum. If it do es not vol-
untarily relinquish the CPU b efore its quantum has expired, it is preempted
and inserted at the end of the queue of runable pro cesses. When a new
thread needs to b e selected for execution, the rst in the runqueue is se-
lected. The only imp ortant parameter in this scheme is the length of the
quantum. If it is to o long, resp onsiveness will su er; if it is to o short, context
switches will degrade p erformance.
Priority Scheduling It may b e necessary or convenient to give some pro-
cesses priorityover others. With priorityscheduling, the pro cess with the
highest priority is run. Tokeep low-priority jobs from starving, the prior-
ity of a pro cess that has just run is decreased. Priorities may b e assigned
statically or dynamically.
Multiple Queues This scheduling p olicy was develop ed for a system that
had very exp ensive task-switches. The designers recognised the fact that
long-running compute-b ound pro cesses should b e run less often but with
larger quanta. Multiple queues hold pro cesses with di erent runtime b e-
haviour.
Other Policies There are a lot of other scheduling p olicies, most of which
are tailored for a sp eci c purp ose e.g. realtime pro cessing. Examples are
shortest job rst, guaranteed scheduling and two-level scheduling. Consult
[Tan92] for further information. 13
3.4.4 Scheduling in Multipro cessor Systems
In multipro cessor systems, the scheduler has to map n pro cesses onto m
pro cessors. Issues like caching and communication costs have to b e taken
into consideration. The main goal in to days systems is to keep all CPUs
busy by providing data fast enough { CPU p ower is no longer the limiting
factor for system p erformance.
AnityScheduling
Loadbalancing has to b e done in order to keep all CPUs busy but removes
threads from the caches that hold their data. On NUMA systems, a thread
mighteven b e migrated to another no de that has to access its data remotely.
Anityscheduling [BS96, GTU91, TTG95] takes these asp ects into account
and tries to migrate only those threads that do no havemuch data in caches.
Threads are kept on the same no de and pro cessor whenever p ossible. This
scheme is very successful, esp ecially on NUMA systems that have caches
between hyp erno des for an example of such an architecture see [Con94].
3.5 Synchronization
Supp ort for synchronization of concurrent op erations has to b e provided
at three levels: hardware, op erating system, and programming language.
Hardware supp ort in the form of atomic instructions is needed, if multiple
pro cessors access shared resources such as shared physical memory. Mul-
tiprogramming op erating systems have to deal with mutual exclusion b e-
tween multiple threads of execution. Programming languages should provide
means of abstraction for the services of hardware and op erating system.
This section describ es synchronisation in multithreaded environments.
The basic synchronisation problems are discussed as well as the mechanisms
that runtime systems provide to solve them. Finally, techniques that help
to make synchronisation ecientonmultipro cessor systems are discussed.
Synchronization at the software level is discussed in [Hof91, SS94,Tan92].
Hardware aspects are adressed in [HB84].
3.5.1 Deadlo cks and Starvation
A deadlo ck is a situation in which all pro cesses in a system are blo cked
waiting for some resource. Because none of these pro cesses will b ecome
runable, no resources can b e freed and the system grinds to a standstill.
Another problem arises if pro cesses never b ecome runable although resources
are available. This situation is called \starvation". It is likely to o ccur in
a system with static priorities. Deadlo cks are examined in depth in [SS94,
Tan92]. Deadlo cks app ear in the following situations: 14
AB-BA deadlo ck
recursivelocking
more
3.5.2 Synchronization Problems
This section describ es four synchronisation problems, each of which repre-
sents a class of similar problems. Almost every situation in which synchro-
nisation is needed can b e reduced to one of these \basic" problems:
Mutual Exclusion The basic problem of synchronization is that of mutual
exclusion. Op erations on shared data structures have to b e \atomic", i.e.
at any p oint of time, only one thread of execution maywork on a given set
of shared data. If mutual exlusion is not guaranteed, data inconsistencies
may o ccur.
To mo del mutual exclusion, the notion of a \critical section" has b een
intro duced. If a thread of execution needs to access a shared data ob ject, it
rst has to acquire a lo ck i.e. enter the critical section. Then it p erforms
op erations on the data and nally releases the lo ck i.e. leaves the critial
section when it is nished and the data is in a coherent state again. Lo cks
may either b e asso ciated with the data ob jects or to the co de that mo di es
them.
Bounded Bu ers This synchronization problem is also known as the pro ducer-
consumer problem. Pro ducer threads send messages to consumer threads
via a bu er of limited size. Clearly, access to the bu er has to b e mutually
exclusive. Another problem arises whenever a pro ducer needs to send mes-
sages via a full bu er or a consumer wants to read from an empty one. In
these cases, the pro cesses havetowait for the bu er's state to change.
Readers/Writers Shared data may b e read bymultiple threads, but a
thread may only write data if it has exclusive access, i.e. there are nei-
ther other writers nor readers. Two cases have to b e distinguished: in the
rst, priority is given to the readers while in the second the writer may blo ck
arriving readers and thus takes priority. If readers are assigned priority,a
writer may starve if new readers keep arriving.
Dining Philosophers Multiple threads may contend for limited resources,
each needing more than one resource. The classical description of this prob-
lem is that of ve philosophers alternately thinking and eating. For each
philospher, there is a b owl of spaghetti but there are only ve forks b etween
the b owls. A philosopher needs b oth the left and the right fork in order to
eat. Clearly,notwo neighb ouring philosophers may eat simultaneously and
access to the forks needs to b e synchronized. A situation like this is likely
to cause deadlo cks. 15
3.5.3 Synchronization Mechanisms
This section describ es the four most common constructs used for synchroni-
sation in multithreaded environments. They can b e found in most runtime
libraries and are implemented as normal API calls. Thus no supp ort by the
compiler is needed. Programming language constructs for synchronization
are discussed in [SS94]. A formal treatment of synchronization mechanisms
can be found in [Hof91].
Mutexes
Mutexes are synchronization ob jects that only have the two states \lo cked"
and \unlo cked". Atomic op erations can b e p erformed to lo ck and unlo ck
mutexes. The lo ck op eration blo cks a thread until it acquires the mutex.
Most implementations provide a trylo ck function that tries to lo cka mutex
but do es not blo ck on it. If the mutex was already lo cked, trylo ck simply
returns an error-co de. This functionality can b e used to spin on a mutex.
Even scalar values mayhave to b e protected bymutexes, as access to
them might not b e atomic on multipro cessors: if the size of the data is
larger than the width of the memory subsystem or in case of misalignment,
multiple memory accesses have to b e p erformed in order to manipulate the
data. Optimizations may b e made if it is known that a scalar is accessed
atomically, but co de using such optimizations is not p ortable.
Whenever multiple mutexes have to b e acquired, deadlo cks may arise.
They can b e avoided by using lo ck hierarchies i.e. acquiring lo cks only in
a prede ned order or by using the trylo ck op eration and releasing all other
lo cks if it fails a stategy called \backo ".
Mutexes are the basis for all other synchronization mechanisms. Their
implementation is straightforward if hardware constructs like test-and-set
instructions are used.
Condition Variables
Condition variables provide a mechanism to wait for a condition to b e sat-
is ed the b ounded bu er problem can b e solved using condition variables.
They are always used in conjunction with a mutex that provides mutual
exclusion. After lo cking the mutex, the condition maybechecked. If the
condition is satis ed, the program unlo cks the mutex and continues execu-
tion. Otherwise, it p erforms a wait op eration that atomically inserts the
thread into a list of waiting threads asso ciated with the condition variable
and unlo cks the mutex. Whenever the data that is asso ciated with a condi-
tion is changed, a signal op eration has to b e p erformed in order to wakeup
a thread that waits for suchachange. Most implementations provide two
more op erations: a broadcast wakes up all threads waiting on the condition
variable and timedwait allows threads to wait a certain time only. 16
Semaphores
were intro duced by Dijkstra in 1968 [Dij68]asamechanism for mutual
exclusion and synchronization. Semaphores can b e thought of as an ob ject
consisting of a signed integer value, a condition variable and a mutex. Two
metho ds are de ned, namely P and V , b oth of which are atomic this is why
the mutex is needed. P decrements the integer and blo cks on the condition
variable if the new value is less than zero. If the semaphore contains a
negativenumb er, then this designates the numb er of pro cesses waiting on
it. V increments the integer and unblo cks one pro cess from the condition
variable if the new value is zero or less.
Mutual exclusion can b e achieved with a semaphore by initializing it
to one and protecting critical sections with a pair ov P and V op erations.
Semaphores used in this way are called binary semaphores.
Synchronization can b e achieved by initializing a semaphore to zero.
Now pro cesses can blo ck on the semaphore with P op erations and an event
can b e signaled to one waiting pro cess byinvoking V .
Resource allo cation can b e controlled by initializing a semaphore to the
numb er of resources available. Now pro cesses can allo cate and free resources
byinvoking P and V resp ectively.
Barriers
are used to synchronize a set of threads at a sp eci c lo cation in the program.
Threads reaching a barrier wait until all other threads have also reached it.
more
3.5.4 Synchronisation in Multipro cessor Systems
revise
On unipro cessors, mutual exclusion can b e achieved by making the kernel
nonpreemptive thus exactly on pro cess can b e in kernel-mo de and disabling
interrupts during critical sections thus keeping interrupt-handlers from in-
terfering. All other synchronization problems can b e solved building up on
these concepts, e.g. ressources can b e protected using the sleep/wakeup-
meachnism that intro duces condition variables into the op erating system.
Synchronization on multipro cessors is a more complex theme. The non-
preemption-assumption no longer holds and disabling interrupts on all pro-
cessors of a system is hardly a go o d idea. Furthermore, it is desirable to
havemultiple pro cessors execute kernel-co de concurrently in order to achive
higher parallelism. For a discussion of multipro cessor kernels see [Sch94].
Most computer systems provide at least some variation of a \Test-and-
Set" op eration. This instruction provides atomic access to one memory
address. The old value can b e read and subsequently mo di ed with one
op eration. Shared memory multipro cessor systems need at least this basic
form of synchronization in order to function prop erly. 17
Thread-Level Synchronisation
As parallelism increases, so do es the need for synchronization. In a mul-
tithreaded environment, very lightweight mechanisms have to b e provided
in order to achive a maximal p erformance improvement. Threads rather
than pro cesses should b e blo cked on resources and synchronization among
userlevel threads should b e implemented at this level. more
Spinning vs. Blo cking
Spinning aka. \Busy Waiting" is usually avoided as it waists precious
CPU cycles and memory bandwidth. If however, the pro cess only has to
wait for a short p erio d of time, blo cking it would b e a bad idea. The cost
of a task-switchmay then b e greater than the cost for spinning. If the time
that a pro cess has to wait cannot b e predetermined, a strategy mightbe
employed that spins for a sp eci c time and then blo cks. 18
Chapter 4
Performance Measurement
This chapter deals with the art and science of p erformance evaluation. The
rst section provides a broad overview of p erformance evaluation, the con-
text in which measurement stands. The second section describ es theoret-
ical asp ects of p erformance measurementaswell as current metho dology
and to ols. Tuning of application software and the runtime environmentis
discussed in the last section. For further information about Performance
Measurement see [FSZ83, McK88, Lan92, Jai91 , ea95, SK90]
4.1 Intro duction to p erformance evaluation
Performance can b e de ned as a sum of prop erties of a system that makeit
t for a sp eci c purp ose. The individual prop erties are called p erformance
indices. Di erent p eople regard di erent asp ects of a system as imp ortant
and each p erson's view mightchange in di erent contexts. Thus, p erfor-
mance is a weighted sum of p erformance indices:
X
P = w p
i i
i
The goal of p erformance evaluation is the improvement of the absolute
p erformance or of the cost/p erformance ratio. Even if absolute p erformance
is the primary goal, only limited resources are available to achieve it. Thus,
sp eedup and eciency are imp ortant:
?
P P
S = and E =
P C
Sp eedup is the ratio of the mo di ed system's p erformance to the original
p erformance. Eciency is p erformance p er cost.
System p erformance is limited by the weakest link in the chain, in com-
puter science termed a b ottleneck. As we eliminate one b ottleneck, others
show up while the costs of improving the p erformance of a system normally 19
Level Performance Index
System throughput, capacity,availability
Program time to completion, resp onsiveness
Pro cess numb er of page faults, memory usage
Language blo ck execution time, numb er of executions
cost of context switches, resource utilisation Op erating System
OS Resource utilisation, average number of waiting pro cesses
cache eciency, bandwidth, cycle time Hardware
Table 4.1: Examples of p erformance indices
increase with every step of improvement. Thus, every system has a state of
optimal op eration, where the cost p er p erformance is minimal or where the
cost of p erformance improvement b ecomes unacceptable.
Information pro cessing systems are usually structured into multiple lay-
ers, each of which provides an increased level of abstraction. Accordingly,
p erformance indices can b e de ned in each of these layers. Some examples
are given in table 4.1. Evaluation of a system's p erformance is needed in all
stages of its life-cycle:
Design: The develop er has to rep eatedly evaluate the current state of a
design until the goals have b een reached.
Improvement: No pro duct coming to market is p erfect. Problems not
encountered in the design phase now b ecome visible and new demands
develop. Thus the design pro cess has to continue with this new input.
Acquisition: Before a system is b ought, it has to b e veri ed that is meets
the requirements.
Op eration: Most systems are parameterised in some way.To attain max-
imum p erformance, these parameters have to b e tuned.
Capacity planning: A system's workload will probably increase during
op eration. The op erator has to calculate when the maximum workload
will b e reached and howmuch added p erformance has to b e b ought
or whether the system has to b e replaced.
There are four basic techniques for evaluating the p erformance of a sys-
tem, di ering greatly in requirements, cost and precision:
Analysis of mathematical mo dels A system is describ ed im terms of
mathematical formulas that describ e its b ehaviour.
Simulation A mo del of the system is generated and each step of op eration
is simulated and evaluated. 20
Measurement The b ehaviour of parts of the system is measured under
repro ducible conditions. This requires that the system b e op erational.
Benchmarking The time that sp eci c programs need for execution is mea-
sured. These programs may b e existing applications or arti cially de-
signed b enchmark programs. Benchmarking can b e seen as a sp ecial
case of measurement.
Of these, measurement is the one examined in this thesis. The following
two sections provide an overview of the theory and practice of p erformance
measurement.
4.2 Measurement Metho dology
The goal of p erformance measurement is to gain insightinto the dynamics of
a system's op eration. In contrast to an examination of the static descrip-
tion of a system, p erformance measurement can provide information ab out
its runtime b ehaviour and thus lead to p erformance improvements and the
discovery of race-conditions. The measurement pro cess consists of ve basic
steps:
Instrumenting the system: sensors and stimuli are attached to the system
in order to extract information and to in uence the target system.
Generating traces and pro les: traces record the chronology of events in
the system while pro les count them or measure durations.
Storing the collected information: interference with the systems op eration
must b e minimised.
Pro cessing the data: measurement will usually yield a huge amountof
data that must b e p ost-pro cessed.
Visualising the results of pro cessing: collected data has to b e presented to
the user in a way that emphasises imp ortant information.
Common characteristics of measurement to ols are: more
Interference: Measurement to ols in uence the measured system. This in-
terference has to b e kept to a minimum.
Accuracy: The system must b e able to collect as much data as the user
wishes. Recorded values must b e as precise as p ossible and error ranges
must b e known.
Level of Abstraction: Measurement can record data ab out ob jects on
any level of a systems hierarchy { from logic gates to complex ap-
plication programs. 21
Data reduction capabilities: It must b e p ossible to reduce the amountof
data that is collected to a meaningful quantity. Recording all available
data might hide the imp ortant facts.
Scop e: A measurement system that is tightly tted to just one application
cannot b e reused and its price will probably b e to o high compared
with general-purp ose systems.
Compatibility: Target- and measurement system havetointeract and thus
either have to b e built using the same basic technology or b e adapted
to each other.
Integration: Measurement is mostly done in the context of a design pro cess
which it has to b e integrated into. An ideal system do es not interrupt
the design pro cess but rather accompanies it.
Ease of use: Of course, a measurement system has to have a user interface
that makes it easy to use. Another issue is the turnaround time, i.e.
the time b etween the sp eci cation of the measurement parameters and
the display of the results. So-called oine systems display data only
after measurement is completed. Online systems display it directly and
mayeven supp ort interactivechanging of measurement parameters.
Costs: Costs are always an issue. A system that costs more than it is worth
is unlikely to b e used. One must always consider the running costs of
the system which dep end on its resource usage and its ease of use.
4.2.1 Hardware monitors
Hardware monitors provide means to trace the low-level op erations of a
system. They can collect data almost without interfering with the target
system and their accuracy is very high. Unfortunately, hardware monitors
are mostly highly sp ecialised to ols that have a limited range of applications,
which makes their use very exp ensive.
4.2.2 Software monitors
Very often, the advantages and drawbacks of software complement those of
hardware. Measurement programs can b e far more exible, adaptable, easy
to use and easy to integrate. They can b e signi cantly cheap er. Their draw-
backs are in the elds of interference and accuracy. Software solutions have
to use existing hardware and thus contend for resources with the pro cesses
that are to b e measured. In uence to the target system is high and the
amount of data that can b e collected very limited. 22
4.2.3 Hybrid monitors
Hybrid monitors are a way to alleviate the de ciencies of software monitors.
Sp ecial measurement hardware is intro duced into the target system that
allows the collection of low-level data that would not b e accessible to a pure
software measurement system. Another purp ose of the hardware comp onent
may b e to reduce the in uence that the measurement system takes.
Most micropro cessors to dayhave at least some supp ort for p erformance
measurement. There are usually several counters for events like cache-
misses, page-faults, or pip eline stalls for an example see [WG94].
4.3 Tuning
reorder
Performance is measured in order to improve a systems b ehaviour. When-
ever needs are not satis ed, the system must b e changed. One way of im-
proving p erformance is to buy a more p owerful machine. Although this
option is theoretically available in most cases, it is seldom the most cost-
e ectiveway. There are two causes for bad p erformance and thus twoways
to improve the b ehaviour of an existing system:
4.3.1 Program Behaviour
The most obvious question when addressing p erformance problems is: Do es
the application software make optimal use of the computers resources? It is
so obvious that many p eople tend to shut their eyes to it. Let me illustrate
this with a short example: nd a b et-
Most software to day makes no use of parallelism at all. Thus, although ter exam-
multipro cessing can improve the systems overall p erformance by running ple
applications in parallel, an individual application cannot pro t from it. In
to days workstation-oriented environments, there mostly is only one active
pro cess. A workstations resources are unused most of the time and over-
loaded whenever the user makes a request. Nevertheless, the workstation-
mo del is quite p opular.
Here are some examples of how an individual application program can
b e improved:
Parallelism can b e intro duced by splitting it into multiple pro cesses or
threads. This can b e a reasonable step even if the program runs on
unipro cessors as it may improve the applications resp onsiveness and
hide the latency of communication op erations [].
Lo cking of shared resources may b e made ner in order to achieve
more parallelism [Sun94, page 116]. 23
Alternative algorithms may b e considered. If memory usage is the
critical factor, a sub optimal algorithm in terms of theoretical time
complexity may b e appropriate.
On a NUMA machine, memory allo cation may b e mo di ed in order
to bring data closer to the pro cessor working on it. []
Another p ossibility is to improve the b ehaviour of an application pro-
gram bychanging its environment:
Most op erating systems have a large numb er of tuneable parameters
that may b e adjusted to b est t the most commonly used applications.
Scheduling is an imp ortant factor. Most of to days systems use static
schemes that do not take the program's b ehaviour e.g. its usage of
caches into account. Ascheduling p olicy that adapts to demands
of the running programs can bring great p erformance improvements
[Ste95, page 14].
In most systems that use kernel threads as virtual pro cessors for the
user threads, concurrency is reduced whenever a thread makes a blo ck-
ing system call. This mayeven lead to deadlo cks if the numb er of run-
ning kernel threads reaches zero. There are ways to keep the number
of virtual pro cessors and thus concurrency constant [Kop95].
Synchronisation mechanisms may b e made more ecientbyintro duc-
ing a strategy of spinning for some time and then blo cking [Ste95, page
46].
Tuning need not b e an action that is p erformed only once. Systems have
b een built that monitor a systems op eration and try to optimise it without
intervention by the user [Yan88].
4.3.2 Bottlenecks
Performance is limited by the weakest link in the chain: a b ottleneck. Buy-
ing faster CPUs will not bring a p erformance gain if the limiting factor is
memory bandwidth. Bottlenecks can b e found in b oth hard- and software.
Here are some examples:
Software
Most libraries have b een made thread-safe by simply adding one syn-
chronisation ob ject to each subsystem e.g. the mallo c calls. An
application using a lot of dynamic data structures lo oses parallelism
to this simple scheme. 24
Centralised kernel data structures may b ecome a ma jor b ottleneckin
multipro cessor systems. This is esp ecially bad as it a ects all pro-
cesses.
If all request to a client-server database are handled by a single pro cess
then this will likely b e a b ottleneck.
Hardware
The p erformance of the memory subsystem and of the communication
paths b etween pro cessors is to day's ma jor b ottleneck.
Mass storage may b e a b ottleneck in applications that require large
amounts of data but do not p erform complex op erations on them.
A MMU may b e a b ottleneck if it has to access main memory to o often
to provide the pro cessor with data fast enough.
How can the b ottleneck problem b e addressed? There are two basic ways:
Bottlenecks may b e eliminated e.g. by replicating the critical resource or
they may b e hidden e.g. byintro ducing caches.
4.3.3 Improving Software
Only seldom is it p ossible to makechanges b eneath the level of the ap-
plication software. Better hardware may b e unavailable or to o exp ensive.
Changing the op erating system maybeeven more exp ensive. Even if there
was a b etter system, switching to it would mean that applications would
have to b e rewritten, p eople b e retrained, and so on. A solution that is
feasible is p erformance tuning of the application software.
Optimising Software for Cache Eciency
Performance of a computer system can b e improved if software is optimised
to make go o d use of the cache subsystem. The primary goal is to maximise
the cache hit-ratio. Causes for cache misses fall into three categories called
the \three C's" [HP96]:
Compulsory: Whenever a data item is accessed for the rst time, a cache
miss o ccurs. There is no way to prevent these so-called cold start
misses.
Capacity: If a program references more data than ts into the cache, misses
o ccur b ecause cache-lines have to b e evicted to make ro om for others.
Away to alleviate this problem is to increase the temp oral lo calityof
the data e.g. by reordering lo ops. 25
Con ict: Data items mapping to the same cache lines may exp el each other
from the cache. The higher the asso ciativity of the cache is, the less
severe is this problem. Programmers can avoid collision misses by
allo cating data structures that are used together in one blo ck.
Compilers can p erform co de optimisations that improve a programs util-
isation of hardware resources. Higher level reordering, however, still has to
b e done by the programmer. Hennessy and Patterson [HP96] present four
basic optimisations that improve cache use:
Merging Arrays: Cache use can b e improved if data structures that are
accessed together are merged or at least allo cated together. The rst
optimisation increases spatial lo cality while the second simply reduces
the probability of collision misses.
Lo op Interchange: Programs accessing data in nested lo ops often trash
caches b ecause the data items are not addressed in the order in which
they are stored. Interchanging inner and outer lo ops can yield great
p erformance improvements.
Lo op Fusion: If a program accesses the same data in di erent lo ops in
order to p erform di erent computations, temp oral lo calitymaybe
improved by joining these lo ops.
Blo cking: The last optimisation technique tries to eliminate capacity misses
by dividing the problem that is to b e solved in multiple subproblems.
An go o d example for this is matrix multiplication which can b e done
blo ck-wise and results in a very unfortunate access pattern when im-
plemented naively.
Improving Multithreaded Co de
Re-blocking
Optimizing Synchronisation
Improving the Runtime System 26
Chapter 5
So crates
Developing multithreaded programs is a dicult and still ill understo o d
task. Only few to ols are available that aid the programmer in designing
correct and ecient co de. So crates is an environment that allows programs
to b e written in an architecture-indep endentway and their p erformance to
be evaluated. Figure 5.1 shows an overview of the system.
Interactive Editing Visualization Source Generation
DB
*.d *.h *.c *.p.c
Post-Processing
Compilation & Linking
Execution Events
Figure 5.1: Overview of the So crates system
Applications are written in ANSI C using a graphical environment. Sup-
p ort for multithreading and p erformance measurement is added in an or-
thogonal way. Instead of extending the language to supp ort parallelism, a
generator has b een built that inserts the necessary co de into the source.
Chapter 5.1 will show this in detail.
The graphical user interface allows the programmer to browse and edit 27
the source co de and sp ecify instrumentation parameters. The user must
sp ecify the set of events that are to b e generated during execution. This
step is describ ed in chapter 5.2.
The co de that is generated can b e compiled using an unmo di ed C com-
piler but mayhave to b e linked with mo di ed versions of the runtime li-
braries and b e run using an extended system kernel. During the execution
of the program, an event trace is recorded and saved in a le in raw form.
This pro cess is discussed in chapter 5.3.
The generator pro duces a p ost-pro cessing program which inserts the raw
data into an SQL database. This allows the programmer to p erform complex
queries using a standardised language. So crates do es not provide supp ort
for the analysis of the data as there exist a wide variety of programs for
this purp ose. Data may b e exp orted to spreadsheet applications, statistical
analysis systems, or plotting software. Chapter 5.4 discusses the b ene ts
and costs of this approach. re ect
this in An example program which illustrates the whole pro cess of writing mul-
the corre- tithreaded co de, instrumenting it, generating for a sp eci c platform, and
sp onding analysing the resulting data, is presented in chapter 5.5.
chapters
5.1 Programming with So crates
5.1.1 The User Interface
The user interface of So crates allows the programmer to browse the source
co de and edit it. The co de is split multiple parts: typ es, variables, functions,
and so on.
Figure 5.2: So crates User-Interface 28
5.1.2 The Generator
Once a program has b een written using So crates, C source les may b e gen-
erated that can b e compiled using an unmo di ed C compiler. Declarations
and de nitions are separated and written to the corresp onding les.
So crates contains a prepro cessor that is able to manipulate the co de that
is generated. It allows arbitrary Tcl co de to b e inserted into the C source
co de. Anything b etween two at-signs is passed to the Tcl interpreter, the
return value b eing inserted into the output of the generator. This scheme
makes So crates very easy to extend and adapt to new environments. It
also provides the programmer with more p owerful prepro cessing function-
ality than the standard C prepro cessor do es. In the currentversion, Tcl
commands cannot b e nested. This will change in the future, making the
prepro cessor even more p owerful.
5.1.3 Supp ort for Multithreading
Supp ort for writing architecture-indep endentmultithreaded programs is im-
plemented by the use of the prepro cessor and Tcl commands that generate
co de for the chosen platform. In the co de browser, threads exist as abstract
entities and are thus separated from normal functions. Like functions, they
have arguments and variables. Unlike functions, they have attributes that
sp ecify whether they are b ound to a pro cessor, detached from their patent
thread, and so on.
An example of a call to the thread library is @thr self@ which pro-
duces thr self when Solaris threads are used and pthread self with POSIX
threads. So crates provides the same functionality for all runtime libraries.
This do es not mean that only the common subset is supp orted as some
functionality that is missing in a library maybeemulated. Solaris threads,
for example, do not have a mechanism for barrier synchronization but co de
containing barriers may still b e generated for this platform. The prepro-
cessor is able to generate co de that implements barriers using mutexes and
condition variables.
App endix A contains a table of all thread op erations and synchronisation
mechanisms supp orted by So crates.
5.2 Instrumenting Programs
5.2.1 Overview
The measurement pro cess can pro duce to o much data. Some systems ad-
dress this issue by letting the user lter events after the measurement pro-
cess. As So crates uses the same hardware as the target program, a goal of
the instrumentation must b e to reduce the amount of data that is generated. 29
The user may de ne which classes of event are to b e generated. Only the
corresp onding co de is instrumented.
5.2.2 Function Calls
5.2.3 User Events
5.2.4 Thread Op erations
5.2.5 Synchronisation Ob jects
5.3 Execution
5.3.1 Collecting EventTraces
During execution of the program, an event trace is generated and stored in
a le. Here are a few things that the user has to keep in mind:
In order to use So crates, it may b e necessary to b o ot sp ecial versions
of the op erating system kernel. Using Solaris, a version of the kernel
must b e used that allows the manipulation of the p erformance counter
registers and saves their contents during all context switches.
The event trace le should b e stored on the fastest disk that is avail-
able. If there is enough memory available, it is b est to use a ram
disk. This can sp eed up the pro cess dramatically and help to reduce
interference with the application program.
The system should b e unloaded during measurement. Other pro cesses
mayinterrupt the program under insp ection and thus falsify measure-
ment results e.g. by evicting data from the caches.
5.4 Post-Pro cessing and Visualization
The p ost-pro cessing program pp transfers the event data into an SQL database.
No ltering or other manipulation of the data is done. This is left to the
database system b ecause it provides high-level functionality for exactly this
typ e of op eration.
As so on as the data is loaded into the DBMS, the user may to p erform
complex queries that are not normally supp orted by p erformance evaluation
to ols. An example might b e a program that checks the events for the o ccur-
rence of deadlo cks or starvation situations. Freedom, of course, means work.
Interaction with the system is done using SQL whichmaybeemb edded into
other programming languages such as C or Tcl.
There are, of course, some programs that p erform standard tasks like de-
termining the numb er of threads in existence, the numb er of threads waiting 30
on a condition variable, and so on. They work for every application program
and can b e used as starting-p oints for more sp ecialised analysis to ols. 31
5.4.1 Example: Barriers
This example program written in Tcl/Tk counts the numb er of threads
waiting in a barrier.
Co de barriers.graph
!/bin/sh
the next line is executed by sh, but not by tcl
because for tcl, it is a continuation of this comment \
exec pgtclsh $0 ${1+"$@"}
The pgtclsh is a Tcl-shell linked with routines that allow access to a Postgres
server. A Shell is used to search is using the PATH environmentvariable.
set dbhandle [pg_connect "graph"]
set plotfd [open "graph.db" w]
connect com- A connection to the p ostgres server is established using the pg
mand. The resulting handle is stored in the variable dbhandle. Likewise, a
le is op ened for the results of this script.
set res [pg_exec $dbhandle
"select address from barrier_init where name = '${barrier}';"]
set address [pg_result $res -getTuple 0]
pg_result $res -clear
Barriers are identi ed by their memory address. During initialization, an
event is generated that contains an identi er for the barrier. This is now
used to map the name of the barrier to its address. In applications that
allo cate barriers dynamically, the mapping of addresses to identi ers may
not b e one to one.
set res [pg_exec $dbhandle
"select maxtime from barrier_waited where address = ${address};"]
set maxx [pg_result $res -getTuple 0]
pg_result $res -clear
set res [pg_exec $dbhandle
"select mintime from barrier_wait where address = ${address};"]
set minx [pg_result $res -getTuple 0]
pg_result $res -clear
In order to scale the time axis of the graph, the minimum and maximum
timestamps are determined.
set res [pg_exec $dbhandle
"create table barrier_events time float8, event char, eid int;"]
pg_result $res -clear
set res [pg_exec $dbhandle
"insert into barrier_events select time,'e',eid \
from barrier_wait where address = ${address};"] 32
pg_result $res -clear
set res [pg_exec $dbhandle
"insert into barrier_events select time,'l',eid \
from barrier_waited where address = ${address};"]
pg_result $res -clear
A table is created that holds all barrier events with their timestamp, event
id, and a tag that identi es whether the barrier was entered or left.
set res [pg_exec $dbhandle
"select time - cast '${minx}.0' as float8 as time,event,eid\
from barrier_events order by time,eid;"]
set ntups [pg_result $res -numTuples]
Now the table of barrier events is read. Time is normalized to b egin with
zero and the events are totally ordered by time and event id. The resulting
handle will b e used to transfer the data from the Postgres server.
set nthreads 0
for {set i 0} {$i<$ntups} {incr i} {
set e [pg_result $res -getTuple $i]
if {[lindex $e 1] == "l"} {
incr nthreads -1
} else {
incr nthreads
}
puts $plotfd "[lindex $e 0].0 $nthreads"
}
Iterating over all events, the numb er of threads in the barrier is determined
for every event. The result is written to the output le.
pg_result $res -clear
pg_disconnect $dbhandle
close $plotfd
The last result handle is cleared, the database connection ended, and the
output le closed. 33
The results of such a program may b e further pro cessed with programs
for statistical analysis like SPSS or Origin. They can b e visualised using nd GNU
a data plotting program likegnuplot, xmgr, or plotmtv. Figure 5.3 shows soft
a plot that was generated from the output le of the ab ove example using xmgr.
150.0
100.0
50.0 Number of Threads in Barrier k
0.0 0.0e+00 5.0e+08 1.0e+09 1.5e+09 2.0e+09
Process Virtual Time / ns
Figure 5.3: Barrier Synchronizations.
5.5 Example: All Pairs Shortest Paths
This sample program implements a simple parallelised version of Rob ert D.
Floyd's algorithm [Flo62, Jun94] for nding the shortest path b etween all
pairs of no des in a directed weighted graph.
5.5.1 The Algorithm
Data Typ es: Both the graph and the shortest paths found so far are stored
in n n matrices where n is the number of vertices in the graph. The
elements store information ab out the existence and length of a path
i; j between no des i and j . A barrier b is provided for synchronisation
between the worker threads.
Initialisation: For each pair of no des i and j , the result matrix is initialised
to d if there exists a vertex i; j with length d and to 1 if no vertex 34
i; j exists. The barrier b is initialised to n.
Worker i: Eachworker is assigned a row of the result matrix that it works
on. It then rep eats the following co de k =1:::n times:
For each j see if there exists a path from i to j via no de k and,
if it is shorter than the shortest path found so far, store it in the
result matrix.
Synchronise with the other threads at barrier b.
5.5.2 The Implementation
The following shows an implementation of the algorithm given ab ove.
Thread Worker {
Arguments {
int i,
APSP* apsp
}
Variables {
int j,k;
int r;
}
Definition{
fork=0;k
{
forj=0;j
{
ifi!=k /* nothing to be won in these cases */
&&j!=k
&&i!=j
&&apsp->d+apsp->n*i+k->exist /* if ex. i,k and k,j */
&&apsp->d+apsp->n*k+j->exist
{
float d;
ifi>k
{
@mutex_lock apsp->d+apsp->n*i+k->m r@;
@mutex_lock apsp->d+apsp->n*k+j->m r@;
}
else
{
@mutex_lock apsp->d+apsp->n*k+j->m r@;
@mutex_lock apsp->d+apsp->n*i+k->m r@;
}
d = apsp->d+apsp->n*i+k->dist /* d = |i,k|+|k,j| */
+ apsp->d+apsp->n*k+j->dist;
@mutex_unlock apsp->d+apsp->n*k+j->m r@;
@mutex_unlock apsp->d+apsp->n*i+k->m r@; 35
ifapsp->d+apsp->n*i+j->exist /* was there a path before? */
{
ifd<apsp->d+apsp->n*i+j->dist /* see if d is shorter */
{
apsp->d+apsp->n*i+j->dist = d; /* new shortest path found */
}
}
else /* new path was found */
{
@mutex_lock apsp->d+apsp->n*i+j->m r@;
apsp->d+apsp->n*i+j->exist = 1; /* i,j now exists */
apsp->d+apsp->n*i+j->dist = d; /* and has length d */
@mutex_unlock apsp->d+apsp->n*i+j->m r@;
}
}
}
@barrier_wait apsp->k@; /* synchronize all workers */
@barrier_wait apsp->k2@ /* before starting a new run */
}
@barrier_wait apsp->finish@; /* synchronize all threads on exit */
@thr_exit 0@; /* exit */
}
5.5.3 Measurement
Mutex Op erations
Every access to the result matrix r is protected byamutex. Figure 5.4
shows the numb er of threads waiting for or holding a mutex vs. time. As
can b e seen from the graph, contention for mutexes is high. A solution
to this p erformance problem would b e to replace mutexes by reader-writer
lo cks, thus allowing for more concurrency.
Scheduling
Scheduling has a great in uence on program p erformance. Figure 5.5 shows
when a userlevel thread changes its LWP and its CPU. As the Solaris thread
library do es not use memory conscious scheduling, there is only little anity
to the caches.
Cache Utilisation
Mo dern micropro cessors provide hardware supp ort for p erformance mea-
surement in the form of event counters that are able to record such informa-
tion as the numb er of cache references and hits. Figure 5.6 shows the cache
usage of a single userlevel thread. The cache hit rate is ab out 93 p ercent. 36 30.0
20.0
10.0 Number of Threads Waiting
0.0 0.0e+00 5.0e+08 1.0e+09 1.5e+09 2.0e+09
Process Virtual Time / ns
Figure 5.4: Numb er of threads waiting for mutexes.
CPU #3
CPU #2
CPU #1
LWP #11
LWP #6
LWP #1
0.0e+00 5.0e+09 1.0e+10
Process Virtual Time / ns
Figure 5.5: Scheduling of a userlevel thread. 37 100% 300000.0
75%
200000.0
50%
100000.0 Cache References 25% Cache Hits Cache Hit Ratio
0.0 0.0e+00 5.0e+09 1.0e+10 1.5e+10
Process virtual time / ns
Figure 5.6: Cache usage of a userlevel thread.
5.6 Implementation
5.6.1 Implementation
Events that are generated by the instrumented program are collected in
bu ers. Whenever a bu er is full, it is inserted into a list. A writer-thread
which is b ound to its own LWP scans the lists and writes their contents to
the event trace le. Generation of events do es not intro duce muchoverhead
into the application's co de as almost all blo cking op erations are done by the
writer-thread. Figure 5.7 shows this pro cess. 38 Free Memory Event File
Event Group