Measurement-Based Modeling of Distributed Systems

Meßbasierte Modellierung verteilter Systeme

Der Technischen Fakultät der Universität Erlangen-Nürnberg zur Erlangung des Grades

DOKTOR-INGENIEUR

vorgelegt von

Kai-Steffen Jens Hielscher

Erlangen - 2008 Als Dissertation genehmigt von der Technischen Fakultät der Universität Erlangen-Nürnberg

Tag der Einreichung: 12. März 2008 Tag der Promotion: 21. April 2008 Dekan: Prof. Dr.-Ing. habil. Johannes Huber Berichterstatter: Prof. Dr.-Ing. Reinhard German Prof. Dr.-Ing. Wolfgang Schröder-Preikschat Contents

List of Figures7

List of Tables9

Abstract 11

Zusammenfassung 15

1 Introduction 19

2 Related Work 23 ó.Õ Measurements...... óì ó.ó Time Synchronization...... ó¦ ó.ì Input Modeling...... ó¢ ó.¦ Performance Evaluation of Web Servers...... ó¢

3 The Web Cluster Laboratory 27 ì.Õ e Virtual Server System...... ó˜ ì.ó Hardware Setup...... ìþ

4 Measurement Concepts 33 ¦.Õ Computer Clocks...... ì˜ ¦.ó Clock Errors...... ìÉ ¦.ó.Õ Classication in the Frequency Domain...... ¦ä ¦.ó.ó Classication in the Time Domain...... ¦˜ ¦.ì Reference Clocks...... ¢Õ ¦.ì.Õ NTP...... ¢ó ¦.ì.ó Time Sources...... ¢¦ ¦.ì.ì e PPS API...... ¢¢

ì Contents

5 Dedicated Measurement Infrastructure 59 ¢.Õ PPS Pulse Latency...... äì ¢.Õ.Õ Echo Feedback...... ä¢ ¢.ó Oœine Synchronization...... äÉ ¢.ì Instrumentation...... ßì ¢.ì.Õ IP Stack Instrumentation...... ßì ¢.ì.ó Web Server Instrumentation...... ßß ¢.ì.ì Load Generator Instrumentation...... ßß ¢.ì.¦ Application Server Instrumentation...... ߘ ¢.ì.¢ Summary Performance Data...... ˜Õ ¢.¦ Analysis of the Traces...... ˜ó ¢.¢ Example Measurement Results...... ˜ì

6 Advanced Input Modeling 91 ä.Õ Traces and Empirical Distributions...... Éó ä.ó Outlier Values...... ɦ ä.ì Autocorrelation...... É¢ ä.¦ Standard eoretical Distributions...... Éß ä.¢ Multimodal Distributions...... ÕþÕ ä.ä Multimodal Distributions with Phases...... Õþó ä.ß Bézier Distributions...... Õþ¢ ä.˜ A New Model for Autocorrelated Data...... Õþß

7 Simulation Model 117 ß.Õ Model Structure...... Õ՘ ß.ó TCP...... ÕÕÉ ß.ó.Õ RFC ßÉì...... Õóþ ß.ó.ó RFC ÕÕóó...... Õóì ß.ó.ì RFC Õìóì...... Õó¦ ß.ó.¦ RFC ó¢˜Õ...... Õó¢ ß.ó.¢ RFC óɘ˜...... Õóä ß.ì Client...... Õó˜ ß.ì.Õ Application...... Õó˜ ß.ì.ó TCP...... ÕóÉ ß.ì.ì Processor...... ÕóÉ ß.¦ Network Channels...... Õìþ ß.¢ Load Balancer...... Õìó

¦ Contents

ß.ä Servers...... Õìó ß.ä.Õ Processes...... Õìì ß.ä.ó System Processes...... Õì¦ ß.ä.ì Processor...... Õì¢ ß.ß Utility Classes and Execution Control...... Õìä ß.˜ Experiments...... Õìß

8 Conclusions and Future Work 143

Bibliography 147

¢ Contents

ä List of Figures

ì.Õ Distributed Web Server Architecture...... óß ì.ó Load Balancing via NAT...... óÉ

¦.Õ Hardware Monitoring...... ì¢ ¦.ó SoŸware Monitoring...... ìä ¦.ì Hybrid Monitoring...... ìß ¦.¦ Latencies for Reading the Time...... ìÉ ¦.¢ Frequency Changes with Temperature...... ¦Õ ¦.ä Frequency Variation...... ¦ó ¦.ß Frequency Distribution...... ¦ì ¦.˜ Phase Errors...... ¦¦ ¦.É UDP Delays...... ¦¢ ¦.Õþ Power-Law Spectral Densities...... ¦ß ¦.ÕÕ Allan Deviation...... ¢þ ¦.Õó NTP Time Transfer...... ¢ó ¦.Õì NTP Architecture...... ¢¦ ¦.Õ¦ NTP and the PPS API...... ¢ä

¢.Õ Detail of UDP Delays...... äÕ ¢.ó Synchronization System...... äó ¢.ì Interrupt Latencies...... ä¢ ¢.¦ External Clock...... ää ¢.¢ Time Deviation...... ä˜ ¢.ä Oœine Synchronization...... ßþ ¢.ß IP Stack Instrumentation...... ߢ ¢.˜ Application Server Instrumentation Architecture...... ˜þ ¢.É Illustration of Delays in the Object System...... ˜ä ¢.Õþ Trace Plot of Measured Delays...... ˜ß ¢.ÕÕ Trace Plots of Individual Delays...... ˜˜

ß List of Figures

¢.Õó Delay Components for Requests...... ˜É ¢.Õì Summary Statistics the Delays...... Éþ

ä.Õ Histograms of Observed Delays...... Éì ä.ó Correlation Plots (lag ≤ ¢þþ)...... Éä ä.ì Correlation Plots (lag ≤ ¦þ)...... ɘ ä.¦ Trace Plots Sorted by Real Server...... ÉÉ ä.¢ Distribution Comparison for Delay óó...... ÕþÕ ä.ä Distribution Comparison for Delay ì...... Õþì ä.ß State Chart for Phase Transitions...... Õþ¦ ä.˜ Distribution Comparison for Delay ÕÉ...... Õþ¢ ä.É Screenshot of PRIME...... Õþä ä.Õþ Distribution Comparison for Delay ՘...... Õþ˜

ä.ÕÕ Histogram Ho of the Deltas for Delay ¢...... ÕþÉ ä.Õó Trace Plot of Delay ¢...... ÕÕþ ä.Õì Delta over the Values of Delay ¢...... ÕÕÕ ä.Õ¦ ìD Histogram of Delta ¢...... ÕÕó ä.Õ¢ Weighting Areas...... ÕÕì ä.Õä Weighting Factors...... ÕÕ¦ ä.Õß Original and Weighted Histogram for Delta ¢...... ÕÕ¦ ä.՘ Distribution Comparison for Delay ¢...... ÕÕ¢

ß.Õ Conceptual Model...... ÕÕÉ ß.ó TCP...... Õóþ ß.ì Model of a TCP Segment...... ÕóÕ ß.¦ Central TCP State Chart receive_packet ...... Õóó ß.¢ Structure of the Client Object...... Õó˜ ß.ä Conceptual Model of the Network Channels...... ÕìÕ ß.ß Server Model and Embedded Objects...... Õìì ß.˜ Graphical Comparison of the Results...... Õ¦þ

˜ List of Tables

¦.Õ Slope Characteristics...... ¢Õ

¢.Õ Quantile Summary for Delays in Microseconds...... ˜¢

ä.Õ Fitted Standard eoretical Distributions...... Õþþ ä.ó Fitted Multimodal Distributions...... Õþó ä.ì Fitted Multimodal Distributions with Phases...... Õþ¦

ß.Õ Core Simulation Parameters...... ÕìÉ ß.ó Quantile Comparisons in Milliseconds...... Õ¦Õ ß.ì CPU Load Comparison...... Õ¦ó

É List of Tables

Õþ Abstract

Nowadays, distributed systems are ubiquitous. Since the delays during processing in such systems are oŸen essential, many research projects deal with performance analyses of these systems. Most of them treat the systems from an abstract point of view and oarse-grained models are built. ese do not include results of detailed measurement studies of real systems. e goal of this work is to demonstrate a methodology that allows to create precise models of distributed systems that are parametrized, calibrated and validated from ne-grained measurements of a laboratory setup. e approach is exemplied on a cluster-based web server system. e resulting model contains many details that inžuence the behavior and performance of the system like one-way delays or system activity caused by the hardware.

Since network aspects play a central role in distributed systems, it is important to be able to capture the timing characteristics of packet delays in the network exactly. erefore, a modular system has been developed for the Linux operating system that allows to record sent and received TCP segments and to generate timestamps for the sending and receiving actions. For that purpose, the netlter framework has been extended to insert packet headers and corresponding timestamps into a ring bušer in kernel space. e timestamps in the resulting event trace are generated using the clock of the object system that is observed. To calculate one- way delays for packets, it is necessary to synchronize the clocks of the nodes, because the timestamps for the sending and the receiving event are taken from dišerent clocks. is can be achieved with standard solutions like the use of NTP during the measurement. An alternative that is more suitable in many situations is the use of a dedicated oœine synchronization. is method is an own development based on an algorithm classically used for online synchronization of computer clocks. e method allows to use the cycle counter of the processor (TSC) of the object system for timestamping. Due to this feature, a context switch for obtaining kernel clock timestamps can be avoided. erefore, the latencies for

ÕÕ Abstract reading the clocks are minimized. To implement this solution, the PPS output of a GPS receiver is connected to the nodes of the object system that need to be synchronized. PPS signals are pulses that mark the beginning of every second. During the measurement, timestamps for these PPS pulses are recorded in an additional trace. e standardized interface for PPS pulse reception, the PPS API, was extended to also use the TSC for timestamp generation. e resulting time trace can be used aŸer the measurement to calculate the ošset and the frequency of each individual clock that has been used for timestamping in the event traces. Using this information, the traces can then be related to a global timing reference. Interrupt latencies occur during the generation of the time trace and have a negative ešect on the accuracy of the synchronization. erefore, a hardware module that allows to measure the time between the PPS pulse and the invocation of the interrupt handler has been developed. is innovation allows a correction of the timestamps in the time trace. In addition to the ne-grained event recording, summary performance data are recorded to calibrate and validate the model. In a second step, the obtained data are processed so that they can be represented in the model in a sensible way. For this purpose, the standard method of theoretical distribution function tting is used in the input model. It is supplemented by advanced techniques like multimodal and Bézier distributions. is is not su›cient for all data sets, therefore, distributions are combined so that the correlation of the measured data is represented using a phase approach. For some sets of delays, this does not produce satisfactory results. Due to the bušering of the Ethernet frames in network elements, these delays feature both a high correlation over large lags and a strict upper and lower bound. To represent these values, a new procedure is introduced. It samples the dišerences of successive values from a part of an empirical distribution function. is part is selected according to the value the random variable has reached. Samples generated in this way exhibit a good compliance with the original values regarding both the density and the correlation structure. e representations of the data are utilized in a detailed simulation model of the complete system. It contains the most important aspects of the web cluster that inžuence the performance. e model has been realized in AnyLogic, a simulation tool that is based on UML and . e model consists of ve objects on the root level (client, network channel Õ, load balancer, network channel ó and server nodes). Some of the elements have a multiplicity, i.e. more than one instance of these objects is present. For each HTTP request, an individual TCP

Õó connection is simulated. TCP is the lowest level of the protocol stack that is modeled explicitly. An instance of the client object is created for each TCP connection. Besides an instance of a processor model object, each client contains an instance of a TCP object. is object models the TCP stack of the operating system and is a complex sub-model that reproduces the main features of the protocol. It controls the connection establishment and tear-down, the protocol dynamics, and supports message segmentation. Modeled properties include slow start, congestion avoidance, timeout calculation, fast retransmit and fast recovery. Both network channels induce packet delays that represent the measured characteristics. e load balancing object forwards incoming connection requests to specic server nodes according to congurable scheduling algorithms. e server objects are responsible for handling the requests. e processing phases of concurrent requests of dišerent TCP connections are interleaved at the server objects. For that purpose, each server contains an individual instance of the TCP object mentioned before for every TCP connection. e servicing is done in process objects. Processor time is assigned to them by a central processor object. e processing in the user mode can be delayed by system activity. Condence intervals are utilized for the execution control of the simulation. e resulting model allows a ne-grained evaluation of the behavior of the system. e time needed to reach a dened quality of the simulation results is acceptable. Altogether, we created a solution that is easily applicable and allows to obtain ne-grained measurement data from a laboratory setup of a system, to represent this data with their densities and correlations adequately and to create a detailed simulation model that contains quintessential features of the system and allows performance evaluations of dišerent congurations.

Õì Abstract

Õ¦ Zusammenfassung

Verteilte Systeme sind heute allgegenwärtig, und die Verzögerung bei der Ver- arbeitung von Aufgaben in solchen Systemen ist oŸ eine wichtige Größe. Aus diesem Grunde existieren diverse Arbeiten, die sich mit der Leistungsbewertung solcher Systeme befassen. Die meisten der Forschungsprojekte behandeln dabei eine grobgranulare Modellierung der Systeme auf abstrakter Ebene und kommen meist ohne detaillierte Messungen an realen Systemen aus. Ziel des vorliegenden Werkes ist daher, eine Methodik aufzuzeigen, mit Hilfe derer ein genaues Modell verteilter Systeme erstellt werden kann, das durch feingranulare Messungen an einem Laborsystem parametriert, kalibriert und validiert wird. Am Beispiel eines clusterbasierten Webserver-Systems wir diese Vorgehensweise veranschaulicht. Das resultierende Modell enthält dabei viele Details, die das Verhalten und die Leistung des Systems beeinžussen, wie etwa Einwegverzögerungen im Netzwerk und durch die Hardware verursachte Systemaktivitäten.

Da bei verteilten Systemen die Netzwerkaspekte von zentraler Bedeutung sind, ist es wichtig, die zeitlichen Charakteristika der Paketlaufzeiten im Netz genau erfassen zu können. Daher wurde ein modulares System für Linux entwickelt, das die Aufzeichnung und Zeitstempelung von gesendeten und empfangenen TCP- Segmenten unterstützt. Dazu wurde eine Erweiterung des Netlter-Frameworks vorgenommen, mit der die Paketköpfe und zugehörige Zeitstempel in einen Ring- pušer im Adreßbereich des Betriebssystemkerns eingetragen werden können. Die Zeitstempel in der produzierten Ereignisspur werden dabei mit der Uhr des je- weiligen beobachteten Objektsystems generiert. Um daraus Einweglaufzeiten für Pakete ermitteln zu können, ist eine Synchronisation der Uhren der Objektsystem- knoten nötig, da die Zeitstempel für die Absende- und Empfangszeitereignisse mit unterschiedlichen Uhren gewonnen werden. Dies kann während des Meßvor- ganges mittels bekannter Techniken wie NTP oder nach der Messung in einem eigens dafür entwickelten Oœine-Synchronisationsprozeß geschehen. Unter Zuhil- fenahme dieser Methode kann die Zeitstempelung auch mit dem Zyklenzähler des

Õ¢ Zusammenfassung

Prozessors (TSC) des Objektsystems erfolgen und somit kann der in machen Fällen nötige Kontextwechsel beim Lesen der Uhr vermieden werden. Zum Zwecke der Synchronisation wird der PPS-Ausgang eines GPS-Empfängers mit den zu synchro- nisierenden Knoten des Objektsystems verbunden. PPS-Pulse sind standardisierte Signale, die jeweils genau zum Beginn jeder Sekunde generiert werden. Während der Aufzeichung von Ereignissen wir zusätzlich eine Spur mit Zeitstempeln für die PPS-Pulse protokolliert. Aus dieser Zeitspur kann somit im Nachhinein die Abweichung und die Frequenz der jeweils für die Zeitstempel der Ereignisse ver- wendeten Uhr bestimmt werden. Mit diesen Informationen können die Spuren dann auf eine globale Zeitreferenz bezogen werden. Die bei der Generierung der Zeitspur auŸretenden Interruptlatenzen beeinžussen die Synchronisation negativ. Daher wurde eine Schaltung entwickelt, die es ermöglicht, für jedes PPS-Signal die Zeit zwischen dem Puls und dem Aufruf der Interrupt-Behandlungsroutine zu bestimmen. Diese Zeit erlaubt dann eine Korrektur der Zeitstempel in der Zeit- spur. Zusätzlich zur feingranularen Ereignisaufzeichnung werden summarische Leistungsdaten zum Kalibrieren und Validieren des Modells erhoben. In einem zweiten Schritt werden die gewonnenen Meßdaten dann so aufbereitet, daß sie im Modell sinnvoll repräsentiert werden können. Hierzu werden bekannte Methoden der Eingabemodellierung mittels theoretischer Verteilungsfunktionen eingesetzt. Diese werden durch fortgeschrittenen Techniken wie multimodale und Bézier-Verteilungen ergänzt, dies führt jedoch nicht bei allen gemessenen Werten zum gewünschten Ziel. Daher werden die Verteilungsfunktionen mittels Phasen so kombiniert, daß auch die Autokorrelation der Meßdaten repräsentiert werden kann. Für bestimmte gemessene Verzögerungen ist auch dieses Vorgehen nicht erfolgreich, da diese aufgrund der Pušerung von Ethernet-Rahmen in den Netzwerk-Komponenten hohe Autokorrelation über weite zeitliche Entfernungen und eine feste Ober- und Untergrenze aufweisen. Zu deren Repäsentation wurde ein neues Verfahren entwickelt, bei dem die Dišerenzen aufeinanderfolgender Werte aus einem Teil einer empirischen Verteilungsfunktion generiert werden. Der entsprechende Teil der Verteilungsfunktion wird dabei anhand des bereits erreichten Wertes der eigentlichen Zufallsvariable ausgewählt. Die so erzeugten Werte stimmen sowohl bezüglich ihrer Dichte als auch in ihrer Korrelationsstruktur gut mit den Originaldaten überein. Die Repräsentationen der Daten werden in einem detaillierten Simulationsmodell des Gesamtsystems eingesetzt. Es bildet wesentliche Aspekte des Webclusters ab, die die Leistung beeinžussen. Das Modell wurde in AnyLogic realisiert. Dieses

Õä Simulationswerkzeug basiert auf UML und Java. Das Modell besteht auf der ober- sten Ebene aus fünf Objekten (Client, Netzwerkkanal Õ, Lastverteilungsknoten, Netzwerkkanal ó und Serverknoten), die teils eine Multiplizität besitzen, d.h. mehr- fach vorhanden sind. Für jeden HTTP-Request wird eine eigene TCP-Verbindung simuliert. Dabei bildet TCP die niedrigste Ebene des Protokollstapels, die explizit abgebildet wird. Für jede TCP-Verbindung wird eine eigene Instanz des Client- Objekts erzeugt. Neben einer Prozessor-Instanz ist im Client eine Instanz eines TCP-Objekts eingebettet. Diese modelliert den TCP-Stack des Betriebssystems und ist ein komplexes Teilmodell, das wesentliche EigenschaŸen des Protokolls nach- bildet. Es regelt den Verbindungsaufbau und -abbau, die Dynamik des Protokolls und unterstützt die Segmentierung von Nachrichten. Modellierte EigenschaŸen sind unter anderem Slow Start, Congestion Avoidance, Timeout-Berechnung, Fast Retransmit und Fast Recovery. Die beiden Netzwerkkanäle erzeugen Paketverzöge- rungen analog der gemessenen Daten. Das Lastverteilungsobjekt weist eingehende Verbindungswünsche nach kongurierbaren Scheduling-Strategien den Server- Knoten zu. Diese Server-Objekte bearbeiten dann mehre Requests, die einzelnen Bearbeitungsphasen unterschiedlicher TCP-Verbindungen sind ineinander ver- schränkt. Dazu besitzt ein Server für jede Verbindung eine Instanz des vorher erwähnten TCP-Objekts. Die Bearbeitung ndet in Prozeß-Objekten statt, denen Prozessorzeit zugeteilt wird. Die einzelnen Bearbeitungsphasen können durch Systemprozesse verzögert werden. Für die Simulationskontrolle werden Konden- zintervalle eingesetzt. Das resultierende Modell erlaubt es, das Systemverhalten feingranular abzubilden, wobei die benötigte Zeit bis zum Erreichen einer vorgege- benen Güte der Simulationsergebnisse akzeptabel bleibt. Somit wurde eine einfach einzusetzende Lösung geschašen, die es ermöglicht, feingranulare Meßdaten an einem Laborsystem zu erheben, diese Daten mit ihren Dichten und Autokorrelationen angemessen zu repräsentieren und ein detaillier- tes Simulationsmodell zu erstellen, das wesentliche Systemaspekte enthält und Aussagen über die Leistung unterschiedlicher Kongurationen zu gewinnen.

Õß Zusammenfassung

՘ 1 Introduction

During the last ten years, the Internet has become a major economic factor. Elec- tronic business has replaced traditional mail order in many areas. More and more new business models utilizing the Internet emerge. Individuals can start with new ideas to create solutions without the need for excessive funding. One example for a new platform can be found in [Éó], where the economic opportunities of an Internet portal for customer referral programs were evaluated with simulation models.

e number of persons using the Internet is also constantly growing. According to statistics of the Miniwatts Marketing Group [äþ], the worldwide Internet usage has grown from the year óþþþ to the end of óþþß by more than óä¢Û. e largest growth can be observed in the Middle East with a rate of over ÉóþÛ in the period mentioned above. A growing percentage of these users access the Internet over broadband connections. is allows service providers to ošer innovative services like Voice-over-IP or IP TV.

Due to these trends, the need for high performance server systems to fulll the demands of a growing user base is rising. e success of open source operating systems and the availability of powerful PC hardware at low cost allows to build cost-ešective solutions to handle these challenges by combining commodity server hardware with load balancing mechanisms to build high-performance cluster- based web servers.

Customer satisfaction depends mainly on the availability and speed of a service. A method to evaluate the expected delay for user transactions in early design phases of a server architecture helps to dimension the system. As both the hardware and the application have a large impact on the performance, an approach that bases the modeling on measurements of individual components promises more exact results than a simpler model like the common queuing network models which oŸen assume Markovian tra›c to allow fast and easy evaluation of the models.

ÕÉ Õ Introduction

When planning measurements, it is important to keep an eye on the model in which the results are to be used. On the other hand, when building a model, it is equally important to be able to parametrize the model according to real-world data.

For this reason, we designed and implemented a solution that allows to measure the most important performance data of distributed systems, to represent them in the model and to simulate the complete system with a great level of detail. e sim- ulation model not only allows to assess the performance of dišerent architectures under various workload conditions, it also helps to understand the inžuence of operating system aspects like interrupt handling and scheduling.

e object of study is a cluster-based web server that has been installed in our laboratory. It allows to observe a live system under realistic load conditions. As we have full control over the system, we can change its architecture, modify the operating system and generate load with dišerent characteristics. e application can also been changed from serving static pages with a simple Apache web server to a multi-tier system that implements a book shop with web servers, application servers, databases and even an emulation of the credit card authorization. is žexibility provided an excellent base for experiments both in the measurement and in the modeling phase.

Our measurement infrastructure is based on GPS to allow measurements of one- way delays over long distance paths that oŸen occur in distributed systems on the Internet, as GPS ošers a global time base around the globe. A processing of the timing information with dedicated hardware and a sophisticated oœine time synchronization process allow to mitigate side-ešects of noise sources like interrupt latency and thermal ešects in the time synchronization. A congurable modular instrumentation of the TCP/IP stack of the Linux operating system helps to gather a high volume of data without signicant degradation of the performance of the system. Further, emphasis is put on applying advanced input modeling techniques in order to adequately represent the basic parameters in the model. Special care has been taken to režect autocorrelation in the input data. A simulation model based on UML has been built. It shows how to represent mechanisms like queuing in bušers, transport control mechanisms and contention for CPU power with other processes. e resulting model thus combines a precise stochastic representation of low-level system parameters with an explicit representation of system behavior at higher observable levels. e simulation allows to gather performance data

óþ of various congurations under dišerent load characteristics without additional measurements and input modeling. We illustrated our approach on the basis of a cluster-based web server architecture, but most aspects are not limited to this eld of application. Even the most system- related tasks of the measurement concepts have been demonstrated to be applicable in various environments ranging from web portals over wireless local area network transmissions to mobile embedded systems on soccer robots. e measurement studies in [óÉ, ìÉ, ¦þ, ßó,ßì,ߢ,ßä] are based on the work presented here. e following chapteró presents some related work in the context of our elds of research. It is followed by a brief description of the laboratory setup in chapterì. Chapter¦ illustrates the basic concepts for performance measurements and shows the problems to deal with during measurement studies. In chapter¢ we present our solution for detailed, ne-grained measurements of distributed systems. Besides the instrumentation of the system, this also includes two approaches to improve the quality of the needed time synchronization in soŸware monitoring: Echo Feedback and Oœine Synchronization. Chapterä illustrates how the measured data can be represented in a performance model of the system preserving the most important statistical parameters. A simulation model of the web cluster system based on UML is presented in chapterß. It includes various details that inžuence the dynamics and performance of the system. ese details are typically not found in classic queuing models of such systems. Chapter˜ concludes the work and gives directions for future research in this area.

óÕ Õ Introduction

óó 2 Related Work

Since the work presented here touches dišerent elds of research, there exist numer- ous related publications and only some of the most inžuencing ones are mentioned in the following sections.

2.1 Measurements

Various approaches for performance measurements are presented in [ìä] by Raj Jain and in [ì˜] by Klar et al. e second book also demonstrates the application of hardware and hybrid monitoring for dišerent distributed systems as they have been implemented at the Department of Computer Science ß (Computer Networks and Communication Systems) of the University of Erlangen-Nürnberg. e hardware monitor ZM¦ was built and utilized for these projects. An extensive congurable instrumentation of the the Linux kernel is the Linux Trace Toolkit (LTT) [Õþþ]. e system operates e›cient and provides valuable information, but the level of detail provided makes it hard to lter relevant information. Furthermore, it is implemented as a kernel patch and is thus not easily adaptable to dišerent kernel versions. Due to the extensive instrumentation, this solution is more intrusive and causes more measurement overhead than an instrumentation that is specically tailored to the observed system. New versions of the system are called LTTng [óÕ]. Its applicability to distributed systems has been demonstrated in [É¢]. Our imple- mentation of the IP stack instrumentation ašects only some parts of the network packet processing and can be congured to be applied only to packets of interest. erefore, the measurement overhead is greatly reduced. Marcus Meyerhöfer’s PhD thesis [¢¢] presents a comprehensive performance measurement solution. It was implemented at the Department of Computer Science ä (Data Management) of the University of Erlangen and serves similar purposes as our AOP-based mon- itoring of the JóEE application server. Compared to his work, our AOP-based

óì ó Related Work instrumentation is based on standard techniques and, depending on the extend of the instrumentation, is expected to cause less overhead. Nonetheless, more manual ešort has to be put in the instrumentation when applying our approach.

2.2 Time Synchronization

e most important solution for computer clock synchronization that also inžu- enced the work presented here is the Network Time Protocol (NTP) [¢É]. It is intended for time synchronization over the Internet using a hybrid approach based on phase-locked and frequency-locked loops. It also marks the state of the art in this eld. We used its algorithms and concepts as a basis for our own imple- mentations and extensions as presented in chapter¢ of this thesis. Its foundations are explained in more detail in section ¦.ì.Õ. e National Institute for Standards and Technology also published a number of algorithms to synchronize clocks to a common time base. We used one of these, the lockclock algorithm [¦ä] by Judah Levine, as a basis for our idea of an oœine synchronization solution presented in section ¢.ó. Some of the more recent research projects concentrate on estimating clock dišerences and one-way delays from statistical properties of delays measured using unsynchronized clocks [äì, ää, äß]. A similar approach for oœine time syn- chronization without a reference clock has previously been published in [ìÕ] as the result of research at the Department of Computer Science ß. All these methods are intended for use with Internet packet delays on the order of milliseconds. As our evaluation showed [ÕÉ], these methods are not applicable for determining exact distribution functions for the one-way delays occurring in our laboratory setup that are only several microseconds long. Another solution for time synchronization is the IEEE standard Õ¢˜˜ [ì¦]. It is intended for synchronization of measurement and control systems on a local area network in sub-microsecond range. Although this would be an optimum choice for our measurement infrastructure, it can only be implemented using specialized hardware or real-time operating systems. It mainly focuses on Ethernet architectures, but can be used with other technologies for LANs, too. Wide-area synchronization over the Internet is not possible using this standard. Our solution is based on the PPS API [äó] that is intended to be used for connecting an external reference time source to one NTP server. We extended the existing solution by distributing one PPS signal to all nodes of our object system. e existing echo feature of the PPS API is intended to measure the

ó¦ ó.ì Input Modeling latencies involved and to use a mean value of this latency for compensation. Our improvement uses this facility to measure the individual latency of every interrupt handler invocation and to correct the timestamps dynamically.

2.3 Input Modeling

Law and Kelton [¦¢] summarize the most important methods for input modeling. Our input modeling with phases is an adaption of the process for multimodal distributions presented in the book. We also combined this approach with the methods for Bézier distribution functions of Wilson and Wagner [ÉÕ, Éþ]. We also used their tool PRIME for the construction of the curves. While our approach for representing correlated data shares some similarities with time series approaches as they are used in the TES methodology by Melamed [¢þ] and the ARTA processes by Cario and Nelson [Õì], the specic nature of the bušering ešects made it necessary to nd a dišerent way to represent the data. e sampling of the dišerences of successive values reminds of the classical time series approach, but the construction of an empirical distribution function and the sampling of the dišerences from parts of this distribution according to additional constraints for upper and lower bound is a novel aspect in our work. Markovian arrival processes (MAPs) are also able to capture the correlation structures of input data to some extend, but compact forms are insu›cient to generate autocorrelation over long lags and they are more suited to analytical models as they are used in tra›c-based decomposition. One example for their application can be found in [óß].

2.4 Performance Evaluation of Web Servers

A huge number of scientic papers have been published on the topic of performance evaluation of distributed systems. Some of the well-known analysis approaches of web server systems are published in the books of Menascé [¢¦, ¢Õ, ¢ó, ¢ì]. ey allow analytic solutions based on simple queuing networks with classes. e workload of the users is mapped to service demands at the dišerent components of the system. e user behavior is represented in a customer behavior model graph (CBMG). is allows to dene dišerent ways a user can access the system and this leads to dišerent service demand at dišerent nodes. More detailed models of web

ó¢ ó Related Work clusters are included in [Õ¦]. eir simulation study is based on a detailed model for the hardware of the system, but does not include ne-grained measurements of delays inside the system. e authors of [˜ó] investigate the ešect of dišerent load balancing strategies on cluster-based web servers using both a laboratory setup of a cluster and a simulation model. However, they only employ high-level measurements of the load balancer and server service times. Packet delays and network dynamics are not in their focus. In [Õþó], the performance of several load balancing schemes is evaluated. ey use traces of the arrival processes as input for their simulation model, but other aspects of the overall system performance are not based on measurements. Regarding the simulation of TCP, there exist some similarities with the TCP model that is included in the INET framework for OMNeT++ [˜ä]. Even if the level of detail included in the model is still larger than in our simulation, the focus of this tool is more on functional evaluation than on sound statistical analyses that are needed for serious performance evaluations. Even more functionality has been included in an integration of the complete TCP/IP stack of FreeBSD into OMNeT++ [˜].

óä 3 The Web Cluster Laboratory

To evaluate the performance of distributed web servers, we built a laboratory setup of a cluster-based web server [ó˜]. Distributed web servers need at least one load balancing node that distributes incoming user requests to several nodes that process the requests with common web server soŸware like the Apache web server. ese nodes are called real servers. Load can either be generated by real clients on the Internet or by a load generator that creates synthetic load and is oŸen located in an internal network. Figure ì.Õ shows the basic architecture of a distributed web server with its components.

Figure ì.Õ: Distributed Web Server Architecture

e most common approach to load balancing is the DNS-based load balancing mechanism where the host name of the server is resolved to dišerent IP addresses belonging to dišerent machines according to a specied scheduling algorithm. e drawback of this method, known as round-robin DNS, is that the time-to-live entry for the DNS record must be small to avoid asymmetrically balanced load. For this

óß ì e Web Cluster Laboratory reason, the entry is only cached for a short time and frequent name resolution processes are needed [Õó]. e main eld of application is therefore global load balancing. Its goal is to distribute load originated in specic geographical regions to a nearby web server so that the distance and the delay in the network are minimized. e system uses a table of IP address blocks and geographic locations to resolve the alphanumeric host name of the server to dišerent IP addresses according to the client location. ese addresses belong to dišerent web server machines located in the Internet in dišerent geographic regions.

3.1 The Linux Virtual Server System

In our solution, we use a routing-based approach that is more suited for local load balancing where all servers are located in geographical proximity. e Linux Virtual Server [¦˜] system is an open source project that supports load balancing of various IP-based services and supports out-of-band transmission (e.g. for FTP) and persistent connections (e.g. for SSL). It is a layer-¦-switching system where routing decisions are based on elds of TCP or UDP headers like port numbers. e whole distributed web server carries a single IP address called Virtual IP Address (VIP). Requests sent to this address are balanced among the real servers carrying the dišerent Real IP Addresses (RIPi). ree mechanisms for load balancing are available:

● Network Address Translation, ● IP Tunneling and ● Direct Routing. Network Address Translation (NAT) is a method specied in RFC ÕäìÕ [óó] for mapping a group of n IP addresses with their TCP/UDP ports to a group of m dišerent IP addresses (n-to-m NAT). When used for load balancing, the VIP is assigned to the load balancer only. is node receives all incoming packets, selects the IP address of a real server according to a chosen scheduling algorithm, creates an entry in a connection table, changes the destination address of the packet to the chosen RIPi and forwards it to the selected real server. e connection table is used to route packets of the same client session (i.e. TCP connection) to the same real server and the answer packets back to the right client. e load balancer is used

ó˜ ì.Õ e Linux Virtual Server System as the standard gateway for the real servers in their routing tables. When packets belonging to replies arrive at the load balancer, the source address is changed to the VIP and the packets are forwarded to the client via the Internet. NAT involves rewriting both the packets directed to the real server nodes and those originating from them. As the load balancer has to be used as a gateway for the real server nodes, its use is reasonable only for nodes in geographically proximity. Figure ì.ó exemplies the functionality of this approach.

Figure ì.ó: Load Balancing via NAT

Tunneling and Direct Routing cause less overhead because the packets sent by the real servers do not have to pass the load balancer. Since our load balancer does not reach saturation even with the NAT approach, we did most of our measurements with NAT. Details about the other two methods can be found in [¦˜]. e Linux Virtual Server system ošers dišerent scheduling algorithms:

● Round Robin, ● Weighted Round Robin, ● Least Connection,

óÉ ì e Web Cluster Laboratory

● Weighted Least Connection, ● Locality-based Least Connection, ● Locality-based Least Connection with Replication, ● Destination Hashing and ● Source Hashing scheduling. While the rst four algorithms can be used for any IP-based services, the later four are intended for cluster-based caching proxy servers. e system is implemented as a Linux kernel patch that is integrated into the netlter framework. is framework is used for the manipulation of IP packets for rewalling and NAT. e kernel part can be congured using the user mode tool ipvsadm. Only the load balancer needs to run the Linux operating system, the real servers can operate under any OS that supports the necessary features like IP-IP encapsulation for Tunneling or non-arping interfaces for Direct Routing. In addition to monitoring the state of the real servers and removing them from the scheduling in case of an error, there are dišerent soŸware addons that can be used to implement a fail-over solution for the load balancer for high availability [¦˜]. An identical conguration of all machines simplied the laboratory setup. ere- fore we used Linux with a ó.¦.x kernel version on all machines for our measure- ments. While other operating systems can create non-ARP interfaces without any modication, a special hidden-patch for Linux [¦] is needed for the real servers with Direct Routing. Most measurements were done serving static content with the Apache web server. e simulation model presented in chapterß also implements this conguration.

3.2 Hardware Setup

e hardware we use in our project consists of one load balancer with the following main components:

● SMP mainboard with ServerWorks ServerSet III LE chipset with ä¦bit PCI bus,

● two intel Pentium III processors with Õ GHz each,

ìþ ì.ó Hardware Setup

● ¢Õó MB SD-RAM PCÕìì memory, ● two Õþþþ-Base-SX network interface cards with Alteon AceNIC chipset with ä¦bit PCI interface,

● on-board Õþþ-Base-TX NIC with intel chipset for management purposes. e same hardware setup is utilized for the load generator. We used up to ten real servers and one NTP server with identical hardware:

● mainboard with VIA Apollo KTÕìì chipset (VT˜ìäìA north bridge and VT˜óCä˜äB south bridge),

● AMD Athlon underbird processor with ÉþþMHz, ● ó¢ä MB SD-RAM PCÕìì memory, ● two ìCom Õþþ-Base-TX PCI network interface cards. A ó¦ port Cisco Catalyst ì¢þþXL switch with two Õþþþ-Base-SX GBIC modules connects the load generator, the load balancer and the real servers. It supports the use of SNMP and RMON for monitoring the switch internals. e Õþþ-Base-TX NICs used for management purposes are connected to another switch to minimize the inžuence of management tra›c on our measurements. e Gigabit Ethernet ports are connected to the load generator and the load balancer whereas the real servers are connected to the Fast Ethernet ports of the switch.

ìÕ ì e Web Cluster Laboratory

ìó 4 Measurement Concepts

One method for assessing the performance of computer systems is to conduct measurements. Although a real implementation of the system is needed, typically in a laboratory setup, only this method allows to obtain real-world data that can be used in further performance studies like analytical or simulation models, since many aspects of a the dynamics of systems cannot easily be determined purely from the specications. is is more relevant the more complex the studied system is. Measurements are classically characterized in the following categories [ìä]:

• Active measurements versus passive monitoring • Event driven measurements versus sampling • Summary versus event oriented performance evaluation • SoŸware, hardware and hybrid monitoring

During active measurements, the object system is observed while synthetic load is generated. is allows a well dened workload to be applied to the system and minimizes the ešects of uncontrolled activities.

Passive monitoring is applied to evaluate the system under real-world conditions, where the workload is generated by actual user interaction without inžuencing its behavior by applying synthetic load.

Sampling is the process of observing the system at regular time intervals and recording performance data like statistics of resource utilization for this interval. E.g. the measurement the CPU load of a system is oŸen performed by sampling, i.e. by recording the fraction of time the CPU was busy during a certain period.

Event driven measurements are usually used to obtain ne-grained performance data. During this process, timestamps are recorded for relevant points. ese points

ìì ¦ Measurement Concepts might for example mark the beginning and the end of a calculation. ese points are called events and are timeless, whereas the the periods of time which are marked with events for the start and the end are called activities. eir duration can be calculated as the dišerence of the timestamps.

is leads directly to the dišerence between summary and event oriented perfor- mance evaluation: When statistical measures like mean values or quantiles are collected during the measurement, we speak of a a summary performance evalua- tion. is is most common for sampling. e results of a event driven measurement allows for event oriented performance evaluation, where important aspects are recorded with timestamps. is makes it possible to calculate various performance data aŸer the measurement process and allows to use the data for a detailed input modeling to be utilized in a performance model of the system using the timestamps recorded. Statistics like the probability density function of the duration of system activities can be calculated in this way. Event driven performance evaluation is usually done using three dišerent steps: Event recognition, where the measurement system is triggered to generate a time- stamp, the generation of the timestamp itself and the recording of the event record which usually consists of the timestamp and an event identier that allows to dis- tinguish between the dišerent events. e sequence of the event records is referred to as the event trace. ere a three basic ways to perform event driven measurements: hardware monitor- ing, soŸware monitoring and hybrid monitoring. All those monitoring possibilities dišer by the method used to conduct the three steps mentioned above. In hardware monitoring, all three steps are done in hardware. at means that a dedicated piece of hardware is needed to recognize an event. In relatively simple systems like small electronic controller units, this step can be as easy as snooping the address bus of the microcontroller and reacting to a certain activity like writing to or reading from a certain address to trigger the recording of an event record. e generation of an event identier involves obtaining the relevant information from the system using another hardware component. When a write action to a special address marks the begin of a relevant action, a simple example for a unique event identier might be this special address and could be received from an additional bus interface that snoops the bus interconnecting the memory bus of the processor. e resulting event records are recorded by a dedicated hardware event

ì¦ recorder. Figure ¦.Õ illustrates the hardware monitoring setup of a small where the processor bus to the main memory is monitored to trigger the event recording and to obtain the event identiers using a bus interface. e main advantage is that the performance of system to be measured, the object system, is not inžuenced by the monitoring process, since all additional activities required for the monitoring are done in additional hardware. Furthermore, the precision and resolution of the timestamps generated does not relate to the system clock and can thus be inžuenced by the hardware used to conduct the measurements. But most complex architectures like modern servers or desktop computers have certain characteristics that makes this method impractical. For example, all CPUs used in this context use memory management units (MMUs) that introduce a layer of abstraction between memory access in the program code and the physical memory. erefore, accesses to specic memory locations are not easily seen on the address bus. Additionally, multi-level caches in these architectures prevent the CPU from accessing outside memory in some cases at all. Activities on higher levels are thus not easy to recognize using hardware monitoring. For the reasons mentioned, it is also hard to determine proper event identiers.

Figure ¦.Õ: Hardware Monitoring

In contrast, soŸware monitoring shiŸs all three steps to soŸware running on the object system itself. While it is easy to trigger the logging of event entries at relevant points in the žow of execution of the program and to generate meaningful event identiers, it might prove complicated to generate exact timestamps due to internal delays caused by other components of the object system like operating system pro- cesses under certain circumstances, especially when the event is asynchronously

ì¢ ¦ Measurement Concepts triggered by external hardware like packets arriving at the network interface. Fur- thermore, the generation of performance data, in this case called instrumentation, can ašect the performance of the system considerably. e accuracy and resolution of the timestamps depend on the properties of the clock source used. Since it has to be a clock in the object system, it is challenging to improve this step. Figure ¦.ó shows an example for a soŸware monitoring solution, where the IP stack of the object system has been instrumented to generate events that are timestamped with the operating system clock and recorded in a bušer in kernel space that can be read by an user mode process.

Figure ¦.ó: SoŸware Monitoring

To reduce the complexity of the event detection and trigger generation in pure hard- ware monitoring, a combined approach with methods from soŸware monitoring sometimes proves feasible. is combination is oŸen referred to as hybrid monitor- ing. In this case, all or some events are detected in soŸware. e event recording and timestamping are usually done in hardware. So the soŸware instrumentation on the object system has to provide trigger signals for event recorder. An example of a hybrid monitoring system with event recognition in soŸware where the event identiers and timestamps are created in hardware using a bus interface is shown in gure ¦.ì. e event identier can be determined by a piece of dedicated code or hardware. While this method allows for high precision timestamps and a relatively easy instrumentation for activities on higher levels, dedicated hardware is needed nonetheless.

ìä Figure ¦.ì: Hybrid Monitoring

Most modern computer architectures include some form of communication. e aspect of communication is not only important for large servers on the Internet, even small embedded devices are oŸen equipped with network interfaces today. Some examples for such systems are electronic control units (ECUs) in automotive applications, where more than ßþ devices exchange messages over a number of dišerent bus systems in a current upper class car, or the wireless sensor nodes, small devices that include a low power central processing unit, a number of sensors to record environmental data and some form of radio communication.

erefore, communication plays an important role in performance evaluations of computer systems. For a thorough study, it is not enough to assess the performance of one system in isolation, the interaction with other systems has to be taken into account. While conducting measurement studies of distributed systems, these aspects need to be handled. One important aspect that arises from this demand is that if the communication itself is viewed as an important activity, a global time base for all components of the object system is unavoidable, since the event that marks the beginning of a communication activity and the event that marks its end are generated on dišerent components.

In a pure soŸware monitoring process where all timestamps are created using the dišerent clocks of the components of the object system, these clocks have to be set into relation to determine inter-component communication delays.

ìß ¦ Measurement Concepts

4.1 Computer Clocks

Using the clocks of the object system to generate timestamps for the events imposes numerous problems.

Traditional Unix clocks use an internal structure to represent the time in counters that are incremented in ji›es. A jišy is generated by the clock interrupt. e programmable interrupt controller is instructed to generate an interrupt every Õ~Hz second [ߘ]. For Linux kernel versions up to ó.¦, the standard value of Hz was Õþþ. Although the possibility to change this number existed in Linux, this was hardly ever done, since it also inžuenced other system aspects like the granularity of the process scheduler.

A number of ways have been proposed to interpolate between successive ji›es. e time stamp counter (TSC) of modern CPUs was used in Linux ó.¦ for this purpose when it was available. e time stamp counter is a ä¦ bit cycle counter inside the CPU that is increased with the internal CPU clock frequency and can be read like a normal CPU register using special opcodes. e kernel calls to read the wall clock time cause a context switch in most operating systems, whereas the TSC can be read both in kernel and user mode without a context switch. e time to read the clock is shown in gure ¦.¦. is gure was generated using a small program to read the clock Õ,þþþ times and calculate the dišerence of successive timestamps. e mean time to read the clock using the gettimeofday() call is around ¦ microseconds, whereas the mean time between successive read instructions for the TSC only is about ¦þ nanoseconds.

Timekeeping in the Linux kernel has undergone several changes in ó.ä versions. e rst one was the increase of the value Hz from Õþþ to Õþó¦. is change improved the granularity of both the timer ticks and the scheduler to below one millisecond. Newer versions of the ó.ä kernel series include a number of changes in the handling of timers and a žexible handling of clock event sources implemented in the Generic Time-of-Day subsystem [˜þ]. is system enables the kernel to use dišerent hardware elements like the local APIC to generate clock events. A clock event in this context is similar to the traditional ticks, but using the new subsystem, the scheduling of operating system tasks is decoupled from the generation of timer interrupts by the clock event source. Newer modications also changed the handling of kernel timers to a large extent [ó¢].

ì˜ ¦.ó Clock Errors

ere are several approaches to improve timekeeping in the Linux kernel. One of the most sophisticated one is the PPS API patch [Éì] for Linux versions ó.¦ developed by Ulrich Windl. is kernel extension is based on the nanokernel [¢ß] by David Mills as implemented in current FreeBSD systems. It uses the TSC to interpolate between timer ticks to provide nanosecond time resolution. Besides decreasing the granularity of the clock, it also implements the PPS API [äó], an interface for generating timestamps for PPS pulses (cf. section ¦.ì.ì).

gettimeofday() rdtscll() Latency [ns] Latency [ns] 0 5000 15000 0 5000 15000

0 200 400 600 800 1000 0 200 400 600 800 1000 Index Index

Figure ¦.¦: Latencies for Reading the Time

4.2 Clock Errors

In common computer architectures and operating systems, most clock sources are triggered by a central quartz oscillator. is oscillator is oŸen also used as a frequency source of the CPU. For that purpose, a clock multiplier and divider is used to generate higher frequencies from the frequency of the quartz. Due to the manufacturing process, all quartz oscillators have a systematic frequency error. at means that the frequency of the oscillator is higher or lower than the nominal frequency specied. e frequency error is in the order of around Õþþ ppm for the oscillators of common PC hardware. Using more sophisticated manufacturing methods, far more precise quartz oscillators can be made. is involves using a dišerent cut of the quartz crystal and a method to ne tune the frequency by small amounts using a mechanical or electrical device. e frequency error described above is a systematic error. at means that the frequency dišerence can be determined by measuring over longer periods of time

ìÉ ¦ Measurement Concepts and compensating for the errors. is can be done by calculating new time readings from the raw clock readings aŸer the measurements or by changing the amount of time the internal clock of the operating system is increased at every timer interrupt before the measurement is done.

Besides this systematic frequency error, all quartz oscillators have a temperature dependent error component. is component is some orders of magnitude lower than the systematic error. Nonetheless, they contribute to the error of the clock read- ings and sum up over the time when considering time measurements. is ešect becomes more and more important the longer the measurement takes. Since the frequencies of the oscillators change over time, this ešect can not easily eliminated. e manufacturers of high precision oscillators ošer devices that use a temperature compensation inside the oscillators. is compensation can be achieved using analog circuits in temperature compensated external oscillators (TCXOs) or using digital logic in digitally temperature compensated external oscillators (DTCXOs). Another method to eliminate the ešect is the use of a small oven that heats the crystal to a constant temperature above room temperature. ese components are called oven controlled external oscillators (OCXOs) and ošer the highest precision of all available quartz oscillators.

Figure ¦.¢ illustrates the ešect a change in the temperature has on the CPU fre- quency of a system. In this experiment, we observed the frequency of the CPU by reading the TSC value of a Éþþ MHz Athlon CPU (measured mean frequency fn = Éþ˜.ÕÕäÕì¦ MHz) in regular intervals of τþ = Õ s by generating an interrupt triggered by a signal sent precisely with a frequency of Õ Hz by our GPS hardware. e plotted values are ltered using an averaging algorithm over τ = ìó s to elimi- nate TSC reading errors caused by the interrupt latency of the system. Since the quartz oscillator had no temperature sensor attached to it, we used the sensor of the southbridge chipset to determine a general temperature tendency of the computer. e reason for the oscillation of the frequency with a period of about ¦þ minutes was found to be temperature change caused by the duty cycleof the air conditioning system.

Figure ¦.ä shows the amount of the frequency variation caused by the temperature change over a period of É¢ hours. Besides the cycle mentioned above, the graph shows a lower frequency at time t ≈ ìÕ.ó h, t ≈ ¢¢.ó h and t ≈ ßÉ.ó h. Since the measurements were done in June, the outside temperature increased to a value the air conditioner was unable to compensate for during that time of the three days.

¦þ ¦.ó Clock Errors

Frequency and Temperature

Temperature [°C] 24.6 25.2 25.3 Frequency [MHz] 908.1160 908.1161 908.1162 908.1163 50 55 60 65 Time [h]

Figure ¦.¢: Frequency Changes with Temperature

Due to the directional exposure of the server room, the peaks in temperature and thus in the measured frequency occur with an interval of ó¦ hours. e frequencies occurring at the three dišerent temperature levels the temperature sensor provided are shown as histograms in gure ¦.ß. e ešect of the temperature can determined and used for a frequency correction. But as the gure shows, a higher temperature resolution is needed to mitigate the inžuence. A more exhaustive evaluation of these ešects and the inžuences of temperature and power management has been done by Stefan Schreieck in [ߦ]. e results back up the assumption that a more precise time keeping could be achieved in modern operating systems by changing the amount of time added to the current value of the system clock by taking into account the actual temperature of the main oscillator. Since soŸware monitoring is based on generating timestamps, it is not the fre- quency error that matters itself, but the phase error, the ošset of the clock with respect to a reference clock. In case of measurements of one-way delays in the absence of a reference clock, the dišerence of the current values of the system clocks is added to each value obtained, since the timestamp for the sending event

¦Õ ¦ Measurement Concepts

Frequency Error and Temperature Frequency Error [ppm]

Temperature [°C] 24.6 25.2 25.3 −0.15 −0.10 −0.05 0.00 0.05

0 20 40 60 80 Time [h]

Figure ¦.ä: Frequency Variation is generated in the source system and the timestamp for the receive event by the sink. Even when both systems have zero clock ošset in the beginning of a mea- surement, the temperature dependent frequency errors cause a phase error during the measurement. A remaining frequency dišerence of merely one ppm leads to a phase dišerence of one microsecond aŸer one second.

Figure ¦.˜ illustrates this ešect, where the ošsets of the clocks of two PCs compared to a GPS-based reference clock are plotted over the time. Both PCs were located in close proximity in an air-conditioned room. e systematic frequency error has already been eliminated before the measurement was started. One would expect that the phase errors of both systems evolve in a similar way, but slight dišerences in the internal temperature and in the cutting of the quartz crystal during manufacturing causes both systems to behave dišerently. Since the delays measured in our local area network are around äþ microseconds, this dišerence inžuences the measurement considerably.

To evaluate the ešect, we transmitted UDP packets between two computers, PCÕ and PCó. Timestamps were generated both at the time when a packet was sent and when a packet was received with the clock of the respective systems as usual in

¦ó ¦.ó Clock Errors

Temperature 24.6 °C Temperature 25.2 °C Temperature 25.3 °C Frequency Frequency Frequency 0 1000 0 1000 2500 0 40

908.1157 908.1161 908.1157 908.1161 908.1157 908.1161 Frequency [MHz] Frequency [MHz] Frequency [MHz]

Figure ¦.ß: Frequency Distribution

˜ soŸware monitoring. Assume a packet is sent from PCó to PCÕ. Let then tó,só(i) ˜ be the timestamp generated by PCó when sending the i-th packet and tÕ,rÕ(i) the timestamp generated by PCÕ when receiving this packet. e measured one-way delay is calculated as ˜ ˜ ˜ dó,Õ(i) = tÕ,rÕ(i) − tó,só(i).

Assume PCÕ has a constant ošset o (phase error dišerence) compared to PCó, i.e. the time t˜Õ(t) on PCÕ and the time t˜ó(t) on PCó at real time t dišer by a constant amount of o for all values of t:

t˜Õ(t) − t˜ó(t) = o(t) = o ∀t.

˜ When tÕ,só(i) denotes the time of the clock of PCÕ when the packet was sent at PCó, the correct one-way delay dó,Õ(i) can be calculated as

˜ ˜ d(i) = tÕ,rÕ(i) − tÕ,só(i).

Since we know that

t˜Õ(t) = t˜ó(t) + o, we can determine

˜ ˜ dó,Õ(i) = tÕ,rÕ(i) − (tó,só(i) + o) ˜ = dó,Õ(i) − o.

¦ì ¦ Measurement Concepts

Phase Errors Offset [ms]

PC1 PC2 Difference PC1−PC2 −4 −3 −2 −1 0

0 5 10 15 20 Time [h]

Figure ¦.˜: Phase Errors

Similarly, we can analyze packets sent from PCÕ to PCó:

˜ ˜ ˜ dÕ,ó(i) = tó,ró(i) − tÕ,sÕ(i) ˜ ˜ dÕ,ó(i) = tó,ró(i) − tó,sÕ(i) ˜ ˜ = tó,ró(i) − (tÕ,sÕ(i) − o) ˜ = dÕ,ó(i) + o. us, ˜ dó,Õ(i) = o + dó,Õ(i) and ˜ −dÕ,ó(i) = o − dÕ,ó(i). ese two quantities, together with the ošset of the clock of PCÕ from the clock of PCó are shown in gure ¦.É. e main reason for the phase dišerences were dišerent reactions of the quartz oscillators to the change of temperature. e variable part of the frequency error itself can be neglected in the calculation of the one-way delays, as it is below Õ ppm and thus contributes less than Õ ppm to the measurement error of each delay calculation.

¦¦ ¦.ó Clock Errors

Figure ¦.É: UDP Delays

But not only frequency errors cause phase errors. Delays when reading the clock appear as phase errors, too. is can be the case both when reading the clock for timestamping and during the time synchronization process [¢É].

More sophisticated approaches to characterize the error involved in time and frequency measurements are presented in the technical note [˜Õ], which contains a number of the articles about this topic. e most important ones of these in our context are [¦ß], [ìì], [Õ] and [Õß].

Both papers cited are based on the assumption that two clocks with their oscillators are compared. All analyses use a set of data of the fractional frequency or time žuctuations between these clocks.

e rst distinction that has to be made is the one between non-random and random žuctuations.

Non-random žuctuations can be easily determined and predicted. Suppose one oscillator has a constant frequency dišerence to the other. en, the time dišerence between the two clocks will constantly increase in a linear way. As the frequency dišerence can be estimated as the mean value of the frequency measurements,

¦¢ ¦ Measurement Concepts the phase errors (time dišerences) caused by this type of can be predicted. us, these ešects are called systematic. Another systematic žuctuation would be a linear frequency driŸ (linear change of the frequency) that leads to quadratically departing phase žuctuations.

AŸer determining, predicting and eliminating the systematic žuctuations, a set of errors remains in the data. is set of errors contains the random errors and has to be characterized using statistic methods either in the Fourier frequency domain or in the phase domain. When one oscillator is compared to a reference as in the setting described above, y(t) denotes the instantaneous normalized frequency deviation from the nominal frequency νþ at time t and φ(t) the phase deviation in radians from the nominal phase óπνþt. ey relate to each other as

Õ dφ(t) φ˙(t) y(t) = = . óπνþ dt óπνþ

Another important measure is the phase deviation x(t) expressed in units of time

φ(t) x(t) = . óπνþ

One main observation when dealing with clock readings is that noise processes that cause errors are oŸen not of Gaussian form, and the processes are not stationary. is is the reason why traditional measures like the mean or standard deviation do not provide valid predications.

4.2.1 Classification in the Frequency Domain

e frequency and phase error processes can be classied in the frequency domain using their one-sided spectral densities, i.e. spectral densities where the Fourier frequencies range in the interval þ to ∞. ese spectral densities can be determined for all quantities dened above. Sy( f ) denotes the one-sided spectral density of y(t), Sφ( f ) the one of φ(t), Sφ˙( f ) of φ˙(t) and Sx( f ) of x(t).

¦ä ¦.ó Clock Errors

e relation between the dišerent spectral densities can be expressed by the follow- ing equations:

f ó Sy( f ) = ó Sφ( f ) νþ ó Sφ˙( f ) = (óπ f ) Sφ( f ) Õ . Sx( f ) = ó Sφ( f ) (óπνþ)

A common way to characterize the instabilities is to plot the spectral densities over the Fourier frequency. e most important žuctuations are oŸen represented as a sum of ve dišerent noise processes using power-law spectral densities for Sy( f ):

⎧ ó α ⎪∑α=−ó hα f for þ < f < fh Sy( f ) = ⎨ ⎪þ for f f . ⎩⎪ ≥ h

In this equation, hα is a scale factor, α an integer between −ó and ó and fh is the cut-oš frequency of a low-pass lter. e ve noise processes can be identied in a logarithmic plot of Sy( f ) over the logarithm of the Fourier frequency f as depicted in gure ¦.Õþ.

Figure ¦.Õþ: Power-Law Spectral Densities

In this log log plot, α appears as the slope of the line that relates Sy( f ) to f , whereas hα is the amplitude of the corresponding noise process. e area in the plot where the slope α is ó, the noise process is white phase (PM) noise, which is oŸen induced by the measurement process. When this noise process

¦ß ¦ Measurement Concepts is part of the signal, it is mainly caused by the devices used for amplication of the signal. e same reason is also the cause of žicker phase (PM) noise that appears in the area of the graph where α = Õ. Another reason for the presence of this noise component is the use of frequency multipliers that are oŸen used on PC mainboards and CPUs to generate a higher frequency signal from a lower frequency quartz oscillator output. White frequency (FM) noise with α = þ appears oŸen when a slave quartz oscillator is locked to the frequency output of another device. is is the case for cesium and rubidium oscillators as well as in GPS receivers where a quartz oscillator is disciplined by atomic clocks present in the satellites of the GPS constellation. Flicker frequency (FM) noise may be caused by physical resonance mechanisms of active oscillators and by environmental properties. It is identiable as the the area with slope a equal to −Õ in the log log plot. e Ÿh noise component, random walk frequency (FM) noise or white frequency aging, is visible as an area with slope −ó. It is related to the physical environment of the oscillator and can be caused by mechanical shock, vibration and changing temperature. All these ešects result in a change of the frequency.

4.2.2 Classification in the Time Domain

As the data observed when characterizing clock errors do not belong to stationary processes, classic measures like the mean and the standard deviation do not provide meaningful results. e standard deviation for clock error measurements will oŸen increase with the number of samples included in the calculation. erefore, these measures cannot be used to compare the performance of dišerent clocks. A measure used commonly for classication of clock errors in the time domain is the two-sample Allan variance [ó]. An intuitive introduction of the Allan variance is presented in [¦ß]:

From the time dišerences (phase error) of two clocks at times tÕ and tó = tÕ + τ denoted by x(tÕ) and x(tó), the frequency dišerence during this interval can be estimated as x(tó) − x(tÕ) y¯Õ = . τ e time dišerence at time tì = tó + τ can be estimated as

xˆ(tì) = x(tó) + y¯Õτ = óx(tó) − x(tÕ).

¦˜ ¦.ó Clock Errors

is estimation is based on the assumption that the frequency in the interval from tó to tì is the same as in the previous one from tÕ to tó. e prediction error

є = x(tì) − xˆ(tì) is proportional to the dišerence of the frequency errors y¯ó − y¯Õ, assuming that y¯ó is the frequency error in the interval tó to tì. Expressed using time measurements, the prediction error is proportional to

ó x(tì) − x(tó) + x(tÕ). τ One half of the mean-square value of this quantity is called the two-sample Allan ó variance for an averaging time of τ, denoted by σy(τ). erefore,

ó y¯k+Õ − y¯k σ τ , y( ) = c ó h where the angled brackets <> denote an innite time average for the adjacent samples tk+Õ = tk + τ, which are thus time dišerence measurements with a xed sample rate Õ~τ. is results in frequency estimates y¯k with zero dead time, where

Õ tk+Õ ¯ d yk = S y(t) t τ tk x(tk+Õ) − x(tk) = . τ

From the equations above, it can be seen that a constant frequency ošset does not inžuence the Allan variance, the measure therefore does not cover frequency accuracy. e square root of the Allan variation is called the Allan deviation σy(τ). A more e›cient method for calculating the Allan variance [ìì] can be obtained for measurements with a constant rate Õ~τþ using overlapping estimates as

N−óm ó Õ ó σ τ x t +ó óx t + x t , y( ) = ó ó ó Q ( ( i m) − ( i m) + ( i)) (N − m)τ i=Õ where N is the original number of time dišerence measurements spaced by τþ, M = N − Õ the number of frequency error measurements of sample time τþ and τ = mτþ.

¦É ¦ Measurement Concepts

Figure ¦.ÕÕ: Allan Deviation

As it is the case for power-law spectral densities, the dišerent error process com- ó ponents can be seen in a plot of the logarithm of the Allan variance σy(τ) or of the Allan deviation σy(τ) over the logarithm of the averaging time τ as dišerent slope characteristics. Figure ¦.ÕÕ shows a typical plot of the Allen deviation for the ve independent noise processes. Table ¦.Õ summarizes the error processes and slopes in dišerent plots. is table also shows that the Allan variance does not show dišerent slopes for white and žicker phase noise processes. For that reason, another two-sample variance has been developed, the modied Allan variance. It is dened as n ó ó Õ Õ Mod σ τ x +ó óx + x y ( ) = ó ó d Q ( i n − i n + i) i τ n i=Õ and allows to distinguish between white and žicker phase noise processes in a log ó log plot of Mod σy(τ) versus τ as areas with slopes −ì and −ó, respectively. Like the normal Allan variance, the modied Allan variance can also be determined using overlapping estimates. For N time measurements spaced by τþ, the modied Allan variance can be calculated for a chosen τ = mτþ as ó −ì +Õ ⎡j+m−Õ ⎤ Õ N m ⎢ ⎥ Mod ó ⎢ ó ⎥ . σy(τ) = ó ó Q ⎢ Q (x(ti+óm) − x(ti+m) + x(ti))⎥ óτ m (N − ìm + Õ) =Õ ⎢ = ⎥ j ⎣ i j ⎦ Since the time dispersion is the primary concern in our eld of application, the time variance (TVAR) [ßÉ] is especially useful. It is an estimator for the timing

¢þ ¦.ì Reference Clocks

Table ¦.Õ: Slope Characteristics Noise Process Frequency Domain Time Domain ö ö ö Sy ( f ) Sφ( f ) σy (τ) σy (τ) Mod σy (τ) σx (τ) White Phase Noise 2 0 -2 -1 -3 -1 Flicker Phase Noise 1 -1 -2 -1 -2 0 White Frequency Noise 0 -2 -1 -1/2 -1 1 Flicker Frequency Noise -1 -3 0 0 0 2 White Frequency Aging -2 -4 1 1/2 1 3

ó errors caused by frequency variations. It is designated σx (τ) and can be calculated using the modied Allan variance as

ó ó τ ó σ τ Mod σ τ . x ( ) = ì y( )

Another advantage of the time variance besides being interpretable in the time domain is that it can be used to easily identify the onset of the domain in which the spectrum is dominated by white frequency noise as the point in the log log plot where the slope changes from zero to one. e time deviation σx is dened as the square root of the time variance. Fast algorithms for calculating these measures have been published by Bregni in [Õþ]. We implemented these algorithms in R for our experiments.

4.3 Reference Clocks

To evaluate the accuracy of clocks and to synchronize them, a reference standard is needed. Nowadays, the most common reference clocks for computer systems are NTP servers [¢É] that distribute timing information received by other reference clocks over the network. David Mills, the inventor and maintainer of NTP, has evaluated the performance of his system under real-world conditions and discov- ered that the synchronization accuracy of NTP over LAN links is in the order of Õþ µs with spikes up to Õþþ µs due to varying network delays caused mainly by the queuing delays in network elements like switches and network adapters. Over WAN links, the accuracy is even more impaired by additional components such as routers in the transmission path of the datagrams used for time transfer.

¢Õ ¦ Measurement Concepts

erefore, one can expect an accuracy of about ¢ ms in the Internet, but errors up to Õþþ ms have also been observed [¢˜]. ese aspects lead to the conclusion that the achieved precision is not su›cient to estimate accurate distribution functions for measured one-way delays, even when an NTP server is available in the local area network.

4.3.1 NTP

Despite the limitations mentioned, it is worthwhile looking at the functionality of the NTP protocol and internal mechanisms of NTP servers, since the system is both used as one component in our solution and provides ideas for own implementations for time synchronization solutions. is description of NTP follows [¢˜], as this monograph by David Mills is the most comprehensive and detailed work on this topic.

Figure ¦.Õó: NTP Time Transfer

As the internal oscillators of computers were not chosen with precision timekeeping in mind, undisciplined clocks of dišerent computer systems tend to dišer both in phase and frequency. e idea of NTP is to use a network connection between two systems to transfer timing information. e same NTP soŸware is used on both sides of the connection. at means that there is no need for dišerent soŸware on the client and the server. A system that acts as a client to some server can also act as a server to other systems. is creates hierarchies of NTP servers. e roots of these hierarchies are referred to as stratum Õ servers and are usually connected to some reference time source other than NTP. When descending the hierarchy, the stratum number increases, i.e. the next level servers are have stratum ó. For

¢ó ¦.ì Reference Clocks determining the clock ošset, the protocol species a protocol data unit (PDU) that, among other information, can hold three timestamps referred to as the origin, receive and transmit timestamp. A ä¦ bit format with a resolution of óìó ps is used for all packet timestamps. For the time transfer as depicted in gure ¦.Õó, a client generates a NTP PDU and lls the origin timestamp with the current local time TÕ and sends the PDU to the server using the connectionless protocol UDP. Upon receiving the datagram, the server lls the receive timestamp eld with its current clock reading Tó. Just before sending the datagram back to the client, the transmit timestamp is lled with Tì. When the client receives the packet, it immediately generates a fourth internal timestamp T¦ that enables it to calculate both the time ošset Õ θ Tó TÕ Tì T¦ = ó [( − ) + ( − )] and the round-trip delay

δ = (T¦ − TÕ) − (Tì − Tó) of the datagram. Figure ¦.Õó shows the transfer of timestamps used by NTP. While it is obvious that the calculation of the round-trip time is always correct, the calculation of the ošset assumes that the delays in the network are symmetric, i.e. T¦ − Tì = Tó − TÕ. is time transfer is repeated a number of times and the measurement with the smallest round-trip delay is selected to be used in further calculations, as it is assumed that the lowest delay is an indication for the least queuing and thus most symmetry. Since a client usually has associations to more than one NTP server, these measurements are then processed using lters and selection, clustering and combining algorithms to generate a guess of the local clock ošset from all the values of dišerent servers. is selected value can then be used to discipline the clock. Successive measurements allows to determine the frequency of the local clock over an averaging interval τ. e clock is disciplined by changing the amount of time that is added to the local clock in each clock update cycle. at means that the clock operates as a variable frequency oscillator (VFO). e amount of change is determined by a feedback loop. Depending on the averaging interval over which the current frequency is determined, a phase-locked loop (PLL) or frequency-locked loop (FLL) is used to control the frequency adjustment. An overview of the architecture of an NTP server is shown in gure ¦.Õì. e same mechanisms can be used for controlling the clock of stratum Õ servers, but the timing reference is a clock source directly connected to the system without

¢ì ¦ Measurement Concepts

Figure ¦.Õì: NTP Architecture any network in between.

4.3.2 Time Sources

When looking at the reference clocks for stratum Õ NTP servers, two main options exist in Germany: DCFßß and GPS receivers. e DCFßß signal is a coded time signal transmitted using a ßß.¢ kHz radio frequency. e distributed time is the o›- cial German time reference generated by the Physikalisch-Technische Bundesanstalt (PTB). e sender is located in Mainžingen near Frankfurt am Main. While the time sources for the signal are highly precise atomic clocks (cesium and rubidium frequency references) that provide an accuracy of ìþþ ns for the start of each second, the phase and frequency errors caused during the propagation of the long-wave signal are several orders of magnitude higher [äÉ].

A higher precision can be achieved using GPS technology. e Global Positioning System (GPS) [¢] uses a number of orbiting satellites (up to ìÕ) equipped with precise atomic clocks that transmit time information. All satellites are synchronized to a common time base. Several ground control stations monitor the satellite clocks with respect to the time base and send correction information to the respective satellite in case of a discrepancy. For positioning purposes, the reception of time information of four satellites allows to calculate the current position of the receiver,

¢¦ ¦.ì Reference Clocks since the positions of the satellites and the propagation delays of the signals are known. If the GPS receiver was equipped with a precise time base, the reception of three satellite signals is su›cient to calculate its position, but since this is not the case for most commercial receivers, four is the minimum number needed [ßß]. In many places of the earth, more than the four satellites needed for positioning are in view all of the time. e system can be used to obtain highly precise timing information. For a receiver in a known xed position, only one satellite has to be in view to acquire the timing information.

4.3.3 The PPS API

RFC óߘì [äó] species an application programming interface (API) to use PPS pulses of an external reference clock directly connected to a stratum Õ NTP server for time synchronization. Many clock sources are capable of providing a signal where a level change marks the beginning of a second with high precision. is signal is called pulse-per-second (PPS) output. e PPS API provides a facility to timestamp level changes of signals delivered to a system interface with high resolution. Since the serial port of the system was oŸen used to connect external clocks, the data carrier detect (DCD) pin of a serial port is commonly used for the PPS signal input.

Since the PPS pulse marks the beginning of the second, the timestamps gener- ated allow to determine the ošset of the local clock with respect to the reference clock, provided that the ošset is less than half a second. Larger ošsets can not be determined using the PPS API alone, since from the timestamp itself, it can not be seen to which second (minute, hour, day, month and year) the level change of the signal level belongs. Since the time dišerence between two successive PPS pulses is exactly one second, the frequency error of the receiving system can also be calculated from these measurements.

e PPS API is available as a patch set for Linux kernels of both the ó.¦ and the ó.ä series. e Linux PPS API kernel patch for kernel versions ó.¦ [Éì] modies the Linux serial port driver for detecting and timestamping signals delivered to the DCD pin of the serial port. In addition to PPS recognition, this patch also extends the timekeeping resolution of the the Linux kernel to one nanosecond by utilizing the timestamp counter (TSC). In the ó.ä kernel series, the patch just enables the

¢¢ ¦ Measurement Concepts timestamping of the pulses, but does not change the general timekeeping properties such as the resolution of the clock.

e timestamps of the PPS pulses can be used in two ways to discipline the kernel clock: either by using the hardpps() kernel consumer or by using the user level NTP daemon. Both of them make use of the kernel model for precision timekeeping as specied in RFC Õ¢˜É [¢ä] and estimate the frequency error of the local oscillator by averaging the measured time interval between successive PPS pulses. Figure ¦.Õ¦ shows how an NTP process uses the timestamps provided by the PPS API to discipline the operating system clock that is used for timestamping the PPS signals.

Figure ¦.Õ¦: NTP and the PPS API

e main advantage of using the PPS API is the low latency of the timestamping process, as the implementation usually instructs the system to invoke a special interrupt handler to generate the timestamp upon reception of the external pulse causing an interrupt. In a careful implementation, the only variable time interval between the hardware logic level change and the generation of the timestamp is thus just the interrupt latency. is interrupt latency changes with other simultaneous tasks of the system and can become relatively high, especially when other tasks involve communication with hardware. On the other hand, reception of an NTP PDU over the network includes bušering both in network components and the network interfaces of the sending and receiving system. ese bušer ešects are usually several orders of magnitude worse than the interrupt latency. Additionally, the reception of Ethernet frames oŸen also involves the invocation of interrupt handlers and thus these measurements also include the interrupt latency as an

¢ä ¦.ì Reference Clocks additional error component.

¢ß ¦ Measurement Concepts

¢˜ 5 Dedicated Measurement Infrastructure

It was obvious that precise measurements of one-way delays are needed as a base for a ne grained performance analysis of distributed systems.

Due to the structure and complexity of the network hardware used, it became clear that it would not be feasible to implement a hardware monitoring solution for this purpose. e most important problem with such a solution is to recognize the beginning of a transmission and to record all relevant header information of the packets. Since HTTP over TCP/IP is used in our setup, recording the source and destination IP addresses, TCP port numbers and sequence numbers is extremely useful to reconstruct the data exchange aŸer the measurement. Even if it were possible to detect transmissions in hardware, recording these data from layers ó and ì required a considerable hardware complexity in the event recorder.

erefore, a hybrid or pure soŸware monitoring approach seemed more promising. We evaluated the possibility to use the hardware monitor ZM¦ [՘] for a hybrid monitoring approach. e ZM¦ system is a distributed hardware monitor that uses a central measurement timing generator (MTG) and a number of distributed dedicated probe units (DPUs). e DPUs are synchronized to the common mea- surement timing pulses generated by the MTG using a twisted pair connection. e DPUs generate an internal timing signal with a resolution of Õþþ ns. is local clock of the DPUs is used to timestamp the events which are recognized by the trigger logic and recorded internally in the DPU. e sustained event recording rate is limited to Õþ,þþþ events per second.

A fully utilized Õþþþ-Base-T Ethernet connection can transmit up to ˜ì,ììì packets per second, even if we assume quite large Ethernet frames of Õ,¢þþ bytes. erefore, the hardware we had at hand was not able to handle this event rate.

¢É ¢ Dedicated Measurement Infrastructure

For that reason, a pure soŸware monitoring approach was the only option. But since the clocks of the object system are used to generate the timestamps and since the nodes of the cluster system are completely independent, a method for synchro- nizing these clocks is needed. Our Experiments have shown that the precision achieved by estimating the clock skew from network delay measurements without a GPS reference clock like in [ää, äì, äß, ìÕ] is not su›cient for determining the distributions of the delays in our system. ese techniques were introduced to estimate the phase and frequency dišerence of clocks used to timestamp pack- ets transmitted over an Internet link, where the resulting transmission delays are several milliseconds long. In an environment with low-latency LAN links, these methods proved as not ešective, as the variations of the transfer delays are relatively large compared to the clock dišerences. erefore, all these approaches would lead to false predictions. Most of the methods are based on using statistical characteris- tics of the transfer delays, usually a minimum of the estimated one-way delay over a certain period of time, and assume a symmetric link. e use of the minimum is justied, as all ešects that ašect the transmission delays in an unpredictable way like queuing delays in routers and switches lead to longer delays. us, when using the minimum, the packets which are least ašected are chosen for the estimation of the clock dišerences. A detail of the plot of UDP transmission delays in gure ¦.É is shown closer in gure ¢.Õ. When looking at this graph it becomes clear that the calculated minimum used depends heavily on the chosen period of time, and that the minimum of the transfer delays is not distributed symmetrical in both direc- tions. Even if the general trend of the clock error dišerence can be estimated using these approaches, the determined distribution of the calculated transfer delays is not exact enough to be used in a sound input model. A detailed evaluation of these methods can be found in the thesis of Johannes Dodenhoš [ÕÉ]. In earlier stages of our experiments with the web cluster, we equipped each node of the cluster with a dedicated GPS-based time source, a Meinberg GPSÕäß PCI card. ese receivers need to be installed in the respective nodes and share a common roof-mounted antenna. Using these GPS receivers directly as a time source where every timestamp needed is generated by the clock of the PCI card proved to be not e›cient, as every reading of the clock caused a context switch from the user mode to the kernel mode and took time in the order of several microseconds. Another disadvantage of this approach is that every node of the object system on which performance measurements have to be conducted must be equipped with an own GPS receiver. Besides causing high

äþ UDP Transmission

Phase Difference PC1−PC2 Delay PC2 to PC1 Negative Delay PC1 to PC2 Delay [ms] −2.65 −2.60 −2.55 −2.50 −2.45 −2.40

7.0 7.2 7.4 7.6 7.8 8.0 Time [h]

Figure ¢.Õ: Detail of UDP Delays costs, it also limits the applicability to systems with interfaces like PCI for which dedicated GPS hardware is available.

e solution we found to this problem was to use the standard operating system clocks of the object systems for timestamping the measurement events and to synchronize them to our GPS time source periodically during the measurements. e method is based on standard time synchronization tools such as NTP with GPS receivers and own modications.

We use a standard NTP server equipped with a GPS receiver to provide coarse timing information to all other nodes of the system over a dedicated synchroniza- tion and measurement network. e GPS receiver has a PPS signal output that is documented to deliver a TTL pulse that marks the beginning of a second with a uncertainty below ¢þþ ns with respect to the GPS time. e new idea implemented here is to distribute the PPS signal to all nodes of the cluster and use this timing source in combination with the networked NTP server as a precision timing ref-

äÕ ¢ Dedicated Measurement Infrastructure erence. Since the signal levels of TTL are dišerent from the ones in RS-óìó, we built a ¢V-powered level converter using Maxim MAXìóó¢ chips. ese chips were selected because of their relatively low propagation delay. One chip can convert two TTL signals to RS-óìó levels, so we used seven chips connected on the TTL side to deliver the PPS signal to all nodes of our cluster plus the NTP server [ìþ]. Figure ¢.ó shows the architecture of the whole synchronization system.

Figure ¢.ó: Synchronization System

As mentioned above, the PPS pulse just marks the beginning of an arbitrary second, but does not contain any information on the absolute time, so all clocks of the cluster nodes must be set to have ošsets less than ¢þþ ms. is is the reason for using a standard NTP server on the network. What makes this solution appealing is that the whole time synchronization during the measurements can be achieved using the ntpdate command before starting PPS synchronization and the hardpps features during the measurements, or by using the NTP daemon with a conguration le that contains two time sources,

äó ¢.Õ PPS Pulse Latency the PPS clock driver and the NTP server. e hardpps solution has the advantage that no additional NTP process has to be executed during the measurement, but David Mills recommends using NTP for high-precision synchronization, since the in-kernel implementation does not use žoatingpoint calculations and is thus less accurate than a user mode NTP process. When using the solution with an NTP process, all nodes of the object system become stratum Õ NTP servers, as the GPS appears to be locally installed by using the PPS connection.

When an Internet connection is used as the channel between the dišerent nodes of the object system and the nodes are not placed in the same location, own GPS receivers are needed for the dišerent locations to generate the PPS signal for the time synchronization system. Since the PPS pulses of GPS receivers are derived from the global GPS clock ensemble, the error caused by the use of an additional receiver that possibly has other GPS satellites in view is minimal. us, this architecture can also be used for a geographically distributed object system.

In any case, no modication of the standard soŸware components is needed for this measurement infrastructure. Our new solution can be implemented using only a special conguration le.

5.1 PPS Pulse Latency

In Linux, the recognition of the PPS pulse is done by instructing the hardware to generate an interrupt in case of a signal transition on the DCD pin of the serial port. e interrupt handling routine in the serial port driver is modied by the patch to timestamp every invocation. e PPS API can generate an echo signal on the DSR pin of the serial port to be able to estimate the delay between the PPS pulse and the timestamping using an external clock. is delay decho is composed of the hardware propagation delay for the incoming PPS pulse dhwi, the interrupt latency dlat, a delay between the timestamping and the generation of the echo signal dts and the hardware propagation delay for the outgoing echo signal dhwo. While the other delays remain more or less constant and can be compensated for, dlat depends on the the state of the system at the time of the signal reception. us, if the time of the generation of n-th PPS pulse is t(n), the time of timestamping this event is

tts(n) = t(n) + dhwi(n) + dlat(n).

äì ¢ Dedicated Measurement Infrastructure and the time the echo pulse is observable as an external signal transition is

techo(n) = t(n) + decho(n)

= t(n) + dhwi(n) + dlat(n) + dts(n) + dhwo(n). By recording the PPS pulse and the resulting echo with an external clock the value of decho can be determined. e time of the local clock at the n-th echo generation tloc,echo(n) is the timestamp generated by the PPS API driver. e time of the local clock at the n-th external PPS signal tloc,pps(n) is tloc,echo(n) − dts(n). us the dišerences of two nodes i and k at the time of the n-th PPS pulse can be calculated as

∆ti,k(n) =(tloc,echo,i(n) − dts,i(n))−

(tloc,echo,k(n) − dts,k(n)).

e delay dloc,ts is not observable, but since dts(n) and dhwo(n) can be viewed as constant across dišerent nodes and time, a reasonable approximation for ∆ti,k(n) can be calculated as ˜ ∆ti,k(n) =(tloc,echo,i(n) − decho,i(n))−

(tloc,echo,k(n) − decho,k(n))

=∆ti,k(n) + dts,k(n) − dts,i(n)+

dhwo,k(n) − dhwo,i(n).

e assumption that dhwo is constant for all systems is justied because all signal level converters use the same hardware with low propagation delay and share the same ambient temperature. e generation of the echo signal inside an interrupt handler with other interrupts disabled and the identical hardware on all real servers makes the assumption of a constant dts also reasonable. e PPS serial port driver is implemented so that the serial port remains usable for general communication besides PPS recognition. Due to this fact, there are several instructions executed between the timestamping and the echo generation. So we decided to implement a driver for the parallel port for exclusive use for PPS signal recognition. is also enabled us to avoid the use of signal level converters since the parallel port makes use of TTL signals as provided by the GPS receiver. Figure ¢.ì shows a reduction in the interrupt latency with our driver using the parallel port compared to the standard PPS serial port driver.

ä¦ ¢.Õ PPS Pulse Latency

IRQ Latency Serial Port IRQ Latency Parallel Port Density Density 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6

12 14 16 18 20 22 24 12 14 16 18 20 22 24 Latency [µs] Latency [µs]

Figure ¢.ì: Interrupt Latencies

e PPS API patch for the ó.¦ Linux kernel series also improves the resolution of the system call do_clock_gettime() to one nanosecond. When using this system call from kernel space, there is no context switch involved. e mea- sured mean execution time of one system call on our cluster nodes is ßþ ns. is measurement was done by allocating a bušer in the kernel space and writing the result of successive invocations of the system call do_clock_gettime() to that bušer space. e content was read by a user mode tool where we calculated the dišerences of successive timestamps. It shows that the times dts(n) for generating the timestamps once the interrupt handler is invoked are short compared to the interrupt latencies dlat(n) with our parallel port driver.

5.1.1 Echo Feedback

e interrupt latency dlat from section ¢.Õ occurs not only in our setup, but in every system that uses an external reference clock. It makes no signicant dišerence if the clock is connected to a serial or parallel port or a system bus like PCI. Interrupt latencies occur in any case.

e calculation of ∆ti,k lead us to the idea of Dynamic PPS Echo Feedback: By measuring the time between the PPS pulse and the generated echo for each pulse with an external clock, we can compensate for the interrupt latency by subtracting decho(n) from the timestamp tloc,echo(n). e resulting timestamp

′ tloc,echo(n) = tloc,pps(n) − dts(n) − dhwo(n)

ä¢ ¢ Dedicated Measurement Infrastructure

is lower than the desired tloc,pps by dts(n) + dhwo(n) but does not depend on the interrupt latency any more. Since the generation of the echo signal is done in the interrupt handler immediately aŸer generating the timestamp, where other interrupt handling is disabled, the point of time where the signal is generated is close to the timestamping of the PPS pulse.

erefore, the delay decho(n) is a close approximation for the interrupt latency and the use of timestamps calculated as described above leads to a considerable improvement of the quality of the synchronization with respect to phase errors and jitter of the timestamps. is novel concept is used in [˜¢] for the implementation of an improved synchro- nization system by Gükan Uygur. During the work on his thesis, he created an external clock that is intended to measure the interrupt latency. For this purpose, an eld programmable gate array (FPGA) is connected to the parallel port of object system. e FPGA is programmed to increment an internal counter with every tick of an external quartz oscillator connected to the FPGA. e structure of the hardware is shown in gure ¢.¦.

Figure ¢.¦: External Clock

Each time the FPGA receives a PPS signal, the current reading of the counter is latched in an internal register and the counter is set to zero. When the FPGA receives an echo signal, the counter is saved to another register, but the counter is not reset, it keeps counting. e contents of both registers can be read by the object system over a parallel port connection. e interrupt handler in our own implementation of a PPS API driver for the parallel port was modied so that aŸer generating a timestamp for each PPS signal received and generating an echo signal, it reads the values of both FPGA registers. It then uses the counter value cpps(n)

ää ¢.Õ PPS Pulse Latency latched for the PPS signal as the frequency of the external oscillator, since the time between two PPS signals is exactly one second and thus, this counter value is the exact frequency of the oscillator during the previous second. Once the frequency is known, the time between the reception of the PPS and the echo signal decho(n) can be estimated using the counter value for the echo signal cecho(n) as

cecho(n) ˆ cecho(n) decho(n) = s ≈ decho(n) = s. cpps(n + Õ) cpps(n) e frequency of the external oscillator during the last second is used as an approxi- mation of the current frequency. e error introduced is small, since the frequency changes are also small during the short measurement interval τþ = Õ s.

e recorded timestamp for the PPS pulse tloc,echo(n) can then be modied by ˆ ˆ ˆ subtracting decho(n). is modied timestamp tloc,pps = tloc,echo − decho(n) is provided for later use in applications or the kernel hardpps facility through the API as an estimation for the system clock at the time of the PPS pulse. Please note that ˆ another error has been introduced by measuring decho(n) with the external clock whereas tloc,echo is measured using the internal clock of the object system. e error caused by this compensation is only in the range of hundred parts per million ˆ of decho(n) for an undisciplined local clock. It gets close to zero as the frequency of the clock of the object system is gradually disciplined by the PPS pulses and the frequency of the external clock is also determined using these signals.

For an optimal performance, the granularity of the external clock should be at least as ne as the granularity of the object system. For a resolution of one nanosecond, an external quartz oscillator with a frequency of Õ GHz and an FGPA that can handle this frequency would be needed. As this is not feasible, the experiments were conducted using a ¢þ MHz oscillator. With this setup, we were able to achieve large improvements in timekeeping. For a typical trace, the root mean square (RMS) value of the jitter of the PPS timestamps was reduced from óþþ˜þ ns to ¢¦Õ¦ ns. e jitter has been measured as the dišerence of successive dišerences of PPS timestamps, i.e. when tloc,pps(n) denotes the timestamp for PPS pulse n, the dišerence to the next timestamp n + Õ can be calculated as ∆(n) = tloc,pps(n + Õ) − tloc,pps(n). e jitter is then calculated as j(n) = ∆(n + Õ) − ∆(n). While the number of large jitter values decreases considerably when applying the echo feedback mechanism, the number of small jitter values increases. is is caused by the limited granularity of the external clock.

äß ¢ Dedicated Measurement Infrastructure

PPS ECHO Feedback [s] ) τ ( x σ 5e−07 2e−06 1e−05

1 5 10 50 100 500 5000 τ [s]

Figure ¢.¢: Time Deviation

More detailed evaluations involved calculating the time deviation. e time devi- ation σx(τ) has been introduced in section ¦.ó.ó. It is an estimator for the time dispersion due to frequency variation. Figure ¢.¢ was produced by plotting the time deviation for the raw PPS pulses that were received by the PPS API with an undisciplined local clock as crosses and for the PPS pulses corrected by dynamic PPS echo feedback as dots versus the averaging interval τ. Both axes are scaled logarithmically. e measurement process took ó¦ hours. e frequency of the local clock had been corrected to eliminate the systematic frequency errors as far as possible by determining the overall frequency over a long averaging period of several weeks.

e graph shows that using echo feedback improves the phase errors, especially for small averaging periods. e values of σx(τ) are always below the values for uncorrected PPS pulses. Please note that both axes are scaled logarithmically by convention to make it possible to identify the dišerent noise processes. erefore, the ešect looks smaller at the rst glance than it really is.

e graph also shows that a careful selection of the averaging interval τ is crucial to the accuracy of the system. Our measurements imply an optimum value of ìó seconds, but a standard NTP daemon bases the choice of τ on the Allan intercept point, the minimum of the Allan variance. e averaging interval used by NTP

ä˜ ¢.ó Oœine Synchronization is larger than the optimal choice determined in our setup, since standard NTP and hardpps implementations impose a lower limit of Õþó¦ seconds on τ. is limitation lead to the idea to implement an own synchronization system that is tailored to the specic needs in our laboratory environment.

5.2 Offline Synchronization

To obtain an optimal synchronization, we developed a solution that uses the TSC of the CPU to timestamp both the events and the PPS pulses and leaving the system clocks completely unsynchronized. e synchronization is done oœine aŸer the measurements took place. As explained in section ¦.Õ, the system clock is not based on a free-running oscillator in most operating systems, but based on a combination of a number of clock sources (e.g. interrupt controller and cycle counter). is leads to an addition of the individual noise processes of the dišerent oscillators and makes it harder to synchronize the clock to an external reference and to determine optimal parameters for a synchronization system. is can be completely avoided by using only a single free running high frequency oscillator. In this case we use the cycle counter that is triggered by the internal CPU clock. Another advantage of using the cycle counter for timestamping is shown in gure ¦.¢: It can be read fast and no context switch is needed. erefore, the measurement process inžuences the performance of the object system less than when using the local clock for generating timestamps.

Figure ¢.ä illustrates the measurement and oœine synchronization method. Before the measurement starts, a reference point in time is marked with a TSC timestamp of the object system. is reference timestamp is used to generate absolute timing references. During the measurement, both the events and the PPS pulses arriving at the object system are timestamped using the object system’s cycle counter and these timestamps are recorded in trace les. AŸer the measurement took place, the time trace of the PPS pulses can be used together with the initialization information to calculate a synchronized event trace that contains absolute time points for all entries of the original event trace.

Frank Fischer [ó¦] implemented a solution using an exponentially weighted moving average algorithm with weights chosen depending on the optimum averaging period. He showed that the mean accuracy of the synchronization achievable is

äÉ ¢ Dedicated Measurement Infrastructure

Figure ¢.ä: Oœine Synchronization

äþì ns which is very close to the specied accuracy of the PPS signal used in the setup (¢þþ ns). e solution is based on the lockclock algorithm [¦ä]. Assume the time trace contains a number of readings of the local clock tk and the corresponding time tR,k of a reference clock. e time ošset xk of the local clock with respect to the reference time is given by

xk = tk − tR,k.

When τk = tR,k −tR,k−Õ denes the current dišerence of the timestamps as measured by the reference clock, the current frequency error yk can be estimated as

xk − xk−Õ yk = . τk

e lockclock algorithm tries to estimate the current time ošset xˆk from the ltered previous frequency error estimation y¯k−Õ as

xˆk = xk−Õ + y¯k−Õτk,

ßþ ¢.ó Oœine Synchronization where y¯k−Õ + Gyk y¯k = Õ + G with a weighting factor G that is determined by the characteristics of the local clock. Dening G α = Õ + G allows to write the equation above as

y¯k = αyk + (Õ − α)y¯k−Õ, which is an exponentially weighted moving average of the calculated yi with a weighting factor α. To apply this algorithm to our situation where PPS pulses are used as the main synchronization source and the time between successive reference timestamps is exactly τþ = Õ s, a sensible weighting factor G has to be determined. Levine reasons in [¦ä] that G depends on the characteristics of the free running clock. He suggests to determine the measurement interval Tnw at which the frequency žuctuations of the free running clock begin to deviate from a white spectrum, as white frequency noise leads to best predictions by the algorithm. is can be done by nding the point in a plot of the logarithm of the Allan deviation σy versus the logarithm of the measurement interval τ where the slope changes from −þ.¢ to þ. An optimized weighting factor G should then be selected so that

τþ G ≈ . Tnw

e implementation of this algorithm was done as a Java application. e program uses a text le that contains TSC stamps for PPS pulses plus an initialization text le that contains a wall clock time (date and time) for one specic PPS pulse to generate an event trace with wall clock times and event identiers from a trace le with TSC stamps for events and the corresponding event identiers.

Since the events are synchronized to an external clock that provides the PPS pulses, the solution can be applied to an arbitrary number of event traces of dišerent systems. e resulting synchronized traces can then be compared to each other and e.g. used to determine one-way delays.

ßÕ ¢ Dedicated Measurement Infrastructure

is approach is also applicable in small embedded systems where an online syn- chronization would be too time consuming [ìÉ]. When using congurable hard- ware, it is even possible to latch the current cycle counter (TSC) reading in hardware at every PPS pulse. is latched cycle counter can be read in the interrupt service routine for the PPS pulse to be used in an oœine synchronization process, as this completely avoids the negative impact of the interrupt latency.

e system has not only been used for the web cluster, it has also been applied to measure one-way delays for wireless IEEE ˜þó.ÕÕb transmissions. For this purpose, laptop computers were equipped with PCMCIA WLAN cards. A PPS pulse from a GPS receiver was delivered to the parallel ports of the computers to record the PPS TSC trace. e whole measurement process has been implemented as an integrated system by Christian Resch [ßì, ßó]. e measurement results were then used to calibrate existing WLAN models in the simulation package ns-ó.

Johannes Dodenhoš [óþ] evaluated an implementation of the NTP algorithms with a VFO and a PLL/FLL for oœine analysis of timestamps. He was unable to produce results that were better than the results of the oœine synchronization process of Frank Fischer. Using the PLL approach with a standard NTP parameter set, the ¦þ minute cycle still remained clearly visible as an oscillation in the time ošset. Using optimized parameters (τ = Õþó¦ s), the height of the amplitude of this oscillation was reduced to about ¢ µs. He obtained the best results applying an FLL correction aŸer the trace has already been modied using the PLL approach. e main result was that the quality of the synchronization depends strongly on the set of chosen parameters like the bandwidth τ of the PLL. When the parameters are not carefully chosen, the timestamps begin to oscillate and diverge more and more from the reference clock.

e experiments with recorded time traces showed that the oœine synchronization implemented by Frank Fischer provided su›ciently good results and that a PLL or FLL approach needs to be tailored to the specic characteristics of the clocks. For a future improvement of the oœine time synchronization process, it seems worthwhile to evaluate the performance of a Kalman lter based approach.

ßó ¢.ì Instrumentation

5.3 Instrumentation

For a soŸware monitoring solution, it is necessary to instrument the code of the soŸware that is executed on the object system. Instrumentation is done by insert- ing instructions into the code that generate timestamps for events and recording those together with an event identier in an event trace. Events can also mark the beginning and the end of an activity. erefore it is possible to calculate durations of activities that inžuence the progress of a certain task. Since all the soŸware com- ponents used in our laboratory are open source applications, it has been possible to include instrumentation code in the source code and to recompile all necessary components.

5.3.1 IP Stack Instrumentation

Since a primary goal was to include ne-grained measurements of one-way delays for IP packets, it has been necessary to instrument the TCP/IP stack of the operating system kernel. By doing so, it is possible to generate timestamps as soon as the operating system recognizes an IP packet. e closer to the hardware the timestamp is generated, the fewer inžuences of other tasks disturb the measurements of the packet delays. is proved as feasible, as we use Linux as the operating system, which enabled us to include own code statements in the kernel and to recompile and use our own custom kernel.

Starting from versions ó.¦, the Linux TCP/IP stack contains the netlter frame- work [ä¢] for packet ltering and mangling. It provides hooks at several places in the kernel stack where own code blocks can be registered. ese blocks are executed when an incoming or outgoing IP packet is processed by the kernel stack. Our rst solution for Linux versions ó.¦ was implemented by Andrey Chep- urko [Õ¢] and consisted of timestamping code that was registered at a hook (e.g. NF_IP_LOCAL_OUT on the real server nodes) of the netlter framework and a kernel space ring bušer for the recorded timestamps and event identiers. e event identiers included the source IP address, the TCP source port and the TCP sequence number for incoming packets from the client. For outgoing packets, destination data has been used instead of source data. Another eld of the event ID included the TCP žags SYN, ACK, FIN and the direction of the packet (incom- ing or outgoing). Special care had to be taken on the load balancing node, since

ßì ¢ Dedicated Measurement Infrastructure both packets sent from the client and destined to the client pass this node twice. erefore, each packet has to generate two dišerent events, one when entering the node and one when leaving the node. Since our load balancing solution is imple- mented as a kernel module that adds load balancing functionality to the IP stack, this code was extended to also include timestamping and event recording. e timestamps have been generated using do_clock_gettime() call. erefore, we have been able to obtain timestamps with nanosecond resolution when using the PPS API kernel patch. With this ä¦ bit timestamps, one entry has occupied ÕÉ bytes of bušer space. e size of the kernel event trace bušer can be congured, the standard size has been Õó MByte. is bušer has been organized as a ring bušer. at means that when the bušer runs full, the measurement continues and overwrites the oldest entries successively. e bušer can be read from user space processes via a character device. e recorded event trace is copied to the user mode in a raw binary form. IOCTL commands to reset and clear the bušer have been implemented. A user mode process can read from the device both during and aŸer the measurement and write the trace to a le in binary form. An additional program has been implemented to convert the entries of the binary le to a text le for further evaluations. While this rst solution was su›cient for experiments in the web cluster laboratory, it had a number of limitations. e most important problem was that it has only been implemented for Linux kernel version ó.¦. Furthermore, even if it is a loadable module, it has been static in many aspects: Most options like the size of the kernel ring bušer can only be changed at compile time, not when loading the module. What limited its use for other elds of applications was the fact that both the packets that are captured and the header elds that are included in the event trace have been xed and tailored for the specic needs of web tra›c. is has been done to limit the size needed for an event entry and to increase the speed of the event recording. But as main memory grows and since a žexible instrumented IP stack showed to be useful for other applications, we decided to re-implement the netlter-based capture solution for Linux versions ó.ä in a more žexible way. Mario Lasch [¦ó] created a žexible packet logging solution for Linux kernel ver- sions ó.¦ and provided a base for porting his solution to kernel versions ó.ä. His implementation kept the basic realization of the in-kernel ring bušer and the char- acter device for communication with user mode processes. His logging module can be attached to any netlter hook and allows to specify which packets to capture depending on their protocol (TCP,UDP or ICMP) and destination port. In contrast

ߦ ¢.ì Instrumentation to the previous implementation, the timestamps can not only be generated using the kernel clock when a packet is received, but also using the TSC or by reading the timestamp that is generated by some network interface card drivers. is brings the timestamping closer to the hardware and thus provides more precise results. Another improvement is that the complete IP and transport layer headers can be recorded. e size of the kernel ring bušer can be specied when loading the module without recompilation. Nonetheless, a change of the netlter hook still requires a recompilation of the module.

Figure ¢.ß: IP Stack Instrumentation

In a further thesis [¦ì], Mario Lasch implemented a solution based on his prior work. e extended module is usable both for Linux kernel versions ó.¦ and ó.ä. It is completely integrated into the netlter framework. Besides being attachable to any netlter hook without recompilation, the kernel module can also be used as an iptables target. e netlter framework provides the possibility to implement rule sets for packet ltering. e rules are composed of a number of classiers (iptables matches) and one connected action (iptables target). e user mode command iptables can be used to insert a new match in a netlter kernel table and denes

ߢ ¢ Dedicated Measurement Infrastructure the target for matching packets. e following command inserts a rule that drops all incoming TCP packets in which the SYN and ACK bits are set, i.e. sends them to the target DROP that discards the packets:

# iptables -A INPUT -p tcp --tcp-flags ALL SYN,ACK -j DROP

e new logging solution provides an iptables target named RBUFF. Timestamps are created for all packets that are sent to this target, and the timestamp and corresponding packet header data are recorded in the kernel ring bušer as an event entry. AŸer loading the new module called ipt_RBUFF.ko, the following command can be used to trigger the recording of all incoming TCP packets with destination port ˜þ in the event trace:

# iptables -A INPUT -p tcp --dport 80 -j RBUFF

In contrast to the DROP target, packets sent to the RBUFF target are not discarded, but remain in the kernel to be processed by other rules and to be nally copied to user mode applications. Additional to the hook and target mode, the module can also be used in a dual mode where the packet is recorded both in the rst and the last hook of the IP stack (PRE_ROUTING and POST_ROUTING hook). is mode is needed for measuring the time spent in the stack which is useful in gateway nodes like the load balancer of the web cluster. e use of multi-core architectures and the kernel preemption in Linux version ó.ä made it necessary to include protection for critical sections in the new version of the module. is new kernel versions provided the base for some further improvements. e module can now be monitored and controlled using the sys lesystem (sysfs). For example, when using sysfs, the following command can be used to stop the recording of events in the kernel bušer:

# cat "0" > /sys/bus/platform/drivers/rbuff_driver/record

Since kernel versions ó.ä provide advanced solutions for transferring data from the kernel to user space applications (e.g. the relay subsystem), the character device implementation of the module has also been analyzed and optimized. Likewise, the module now also supports udev, an approach to create device nodes in the /dev tree automatically. Additionally, Mario Lasch implemented a new user mode tool to read the event entries from the kernel ring bušer. Additional to the export of the entries into a text le where the elds are separated by spaces or commas, the entries can also be exported in libpcap format. is format is used by the

ßä ¢.ì Instrumentation tcpdump command and allows network protocol analyzers like wireshark [Õä] to read the recorded trace. e ltering and display capabilities of wireshark are useful especially when examining and debugging new congurations. e architecture of the current IP stack instrumentation is depicted in gure ¢.ß.

5.3.2 Web Server Instrumentation

In addition to the kernel level timestamping of IP packets, an instrumentation of the code of the web server application is needed to obtain data for modeling and performance evaluation. When static web pages are served, an instrumentation of the Apache web server has been used to generate application level timestamps. Apache’s C API provided a base to implement handlers in certain stages of the processing of an incoming HTTP request. e rst handler is the post-read request handler. is handler is called when an arriving request has been fully read by the server application. We implemented an external Apache module that registers a handler in this place to write a timestamp along with the client IP address, the client port and the URI of the request to an event trace le. e event for the completion of the request is triggered when the ap_send_http_header() method is invoked. is point in time marks the begin of sending the HTTP reply back to the client.

5.3.3 Load Generator Instrumentation

On the client side, an HTTP load generator is needed. httperf, a load generator developed by David Mosberger [ä¦], is able to generate load with dišerent char- acteristics: sequential requests, requests with a xed rate, session-oriented tra›c with think times and requests according to a recorded trace le. It has the potential to overload a web server by generating a high request rate. Unlike most other load generators, it does not try to simulate a certain number of users. e number of requests and the rate of the requests can be specied on the command line. Since the test client PC has certain limitations on the number of TCP connections that can held open simultaneously, httperf supports parallel execution on a number of client machines. is load generator is ideal for studying the behavior of a web server in extreme situations for nding the system’s limits.

ßß ¢ Dedicated Measurement Infrastructure

SURGE [ä] on the other hand emulates the behavior of a congurable number of users as observed by analyzing the log les of web servers by its author Paul Barford. For this purpose, the relative percentage of the number of accesses per le, embedded references, temporal locality of references and inactive periods of the user are determined by an analytical model derived from empirical observations. Some of the probability density functions used in the model are heavy-tailed. e on-oš processes which are are used to model the user generate bursts and self- similar tra›c as observed in recent studies about real-world tra›c on the Internet. It is also usable in a distributed environment with more than one load generating node.

We use both load generators and instrumented their request generation phase for timestamping each request on the HTTP layer. Additional to the instrumentation of the load generating soŸware, the IP stack of the load generator nodes has also been instrumented with the solution from section ¢.ì.Õ.

5.3.4 Application Server Instrumentation

While static content is still common on most web server systems, generating content dynamically becomes more and more important in the web. In its most basic form, data in an internal representation is transformed to another output format for presentation to the client. is allows to separate the layout of the web pages from the content. One approach to achieve this is the representation of the content as XML and the use of XSL for transformation to an output format like HTML. We evaluated the use of XSLT in our web cluster laboratory in [Éä]. is dynamic generation of content can be combined with a content management system (CMS) as described in [óì]. A content management system oŸen uses a combination of a database for storing the content and an application server for the transformation. Even more dynamic behavior can be expected in an web shop application. erefore, Markus Preißner [ßþ] implemented an online bookstore according to the TPC-W [˜ì] for use in our lab. He used Enterprise Java Beans (EJB) for the business logic running on an JBoss application server. e database backend was a MySQL server. His implementation was intended for the use on the web cluster and allowed to distribute the application server and database functionality to a dišerent number of nodes, depending on the conguration used. e only limitation in doing so was that write access was only allowed on one of the multiple database

ߘ ¢.ì Instrumentation nodes. For a performance evaluation of a system of this kind, it is necessary to instrument the dišerent stages of the request handling and reply generation on the application server. Patrick Wunderlich [Éß] implemented an instrumentation for the TPC-W web shop system using aspect-oriented programming (AOP). AOP allows to dene an instrumentation aspect, where parts of code that are used for the instrumentation can be held separate from the business logic. ese code parts are called advices. Joinpoints are events in the program execution like the invocation of a method. A pointcut allows to select specic joinpoints and to assign an advice to it. An advice can then be executed before, aŸer or instead of a method. e process of weaving constructs the complete soŸware system using the core logic and inserting the advices of dišerent aspects according to the dened pointcuts. When using Java as in our case, the weaving of the instrumentation aspect can be done in three dišerent ways: • A precompiler can be used to insert the source code of the advices into the source code of the program to be instrumented. e resulting code can then be compiled to bytecode using a standard Java compiler and be executed using a standard virtual machine (JVM). • An AOP compiler can be used to insert the advices into the compiled byte- code of the core logic to produce instrumented bytecode that can be executed like the uninstrumented code. • A special class loader can be used during the execution of the uninstrumented bytecode of the core logic to to insert the code of the advices. Patrick Wunderlich used AspectWerkz for his instrumentation. Aspects are im- plemented as pure Java. e pointcuts can be dened using annotations in Java ¢, custom doclets in Java Õ.ì and Õ.¦ or an external XML denition. e joinpoints denitions can contain wildcards that match specic classes, methods, construc- tors or elds. Pointcut denitions can combined using logical expressions like NOT, OR, AND and can be grouped using parentheses. is allows a žexible selection of the method invocations to be included in the performance evaluation. As the measurement code is inserted by weaving, the original web shop source code remains completely unmodied. Advices were introduced not only into the web shop soŸware on the JBoss application server, but also into the the Tomcat servlet container and the Clustered-JDBC database middleware. is middleware component was also utilized by Patrick Wunderlich to mitigate some limitations of the original implementation of Markus Preißner when it is used in a clustered

ßÉ ¢ Dedicated Measurement Infrastructure environment. e event traces of all components on all nodes are sent to a central instrumentation server. is server is also implemented in Java and based on the Tomcat servlet container. It collects the traces and allows the user to lter interesting events using a web frontend. For statistical analyses, the relevant data can then be exported to text les. Figure ¢.˜ shows the overall architecture of the instrumentation present on each cluster node.

Figure ¢.˜: Application Server Instrumentation Architecture

e solution not only provided valuable data for a performance analysis of the system, but also allowed to gain insight into to dynamics of the implementation and to identify optimization points.

While this method of instrumentation has been developed for a system where all source code is available, it can also be used for closed source systems. As an example, Patrick Wunderlich explained in his thesis how the database interaction can be observed by instrumenting the JDBC driver. Since each driver has to implement the interfaces of the package java.sql, it is obvious which pointcut denitions have to be used.

Stefan Schreieck showed the applicability of this approach to a self-service web por- tal of the University of Applied Sciences in Kempten. e system uses commercial Java class les for which no source code is available. erefore, Stefan Schreieck used the method we suggested and created an instrumented JDBC driver [ߢ,ßä]

˜þ ¢.ì Instrumentation for the Informix database to obtain performance data of the system that can be used to parametrize models of the system like the ones presented in [óÉ].

A similar instrumentation was used by Olena Kolisnichenko [¦þ] for the web por- tal of DATEV eG. DATEV is a Nuremberg based association for tax counselors, auditors and attorneys. eir online portal builds the gateway to dišerent online services provided to their customers. It is a Java ó Enterprise Edition application that has been programmed by an in-house development department. To iden- tify performance-critical parts in the program execution and to avoid possible problems, a measurement method that can be used during the implementation and testing phases before deployment had to be found. One goal was to keep the instrumentation separate from the core business logic and to provide a solution that allows to easily add and remove the instrumentation. e performance im- pact introduced by the instrumentation should be kept to a minimum. Previous experiments have shown that traditional proling was not žexible enough for the automation of the performance tests and had a major impact on the overall performance. Our aspect oriented approach that has been ported to the DATEV application by Olena Kolisnichenko proved to fulll the needs of the development department and will be used during future tests of new implementations.

5.3.5 Summary Performance Data

In addition to the ne-grained performance data obtained by event-oriented soŸ- ware monitoring, summary performance data are oŸen useful to validate the outcome of simulation runs. In a detailed performance analysis, resources are oŸen modeled as a separate entity. e utilization of these resources is not only caused by the application to be evaluated, but other system activities also use the same resources. erefore, tools to measure summary data like resource utilization are a sensible addition. We used the sar and iostat tools that are part of the sysstat utilities [óä]. In the cluster lab we used sar to sample the utilization of the CPU, the memory usage and the network load over intervals of one second length. Especially the CPU utilization has proved to be an important indicator for the correctness of the model, because it allowed us to compare simulated and measured CPU data. When building a simple (conceptual) model, such summary performance data can be su›cient for parametrization.

˜Õ ¢ Dedicated Measurement Infrastructure

5.4 Analysis of the Traces

While some of the performance data obtained by conducting measurements with our instrumentation can be used directly for the input modeling, the event traces, especially from the instrumented IP stack, have to be processed to be usable. As de- scribed in chapter¦, the event trace produced during event-oriented performance evaluation contains timestamps for events. e events themselves have no dura- tion. e one-way delays that are represented in performance models are activities. e start and the end of each activity is marked by an event. e duration of the activity and thus the delay can be calculated as the dišerence of the corresponding timestamps. In a distributed system, the start and the end are oŸen recorded on dišerent nodes of the object system in dišerent trace les. As the timestamps are obtained from the local clocks of the object system, the start and the end of an activity are oŸen measured with dišerent clocks. Since the calculated delays are used for determining a distribution function in the input modeling process, the local clocks of the machines need to be synchronized with high accuracy. Low phase jitter in time synchronization is a crucial point. As described before, this can be achieved by synchronization during the measurements or by post-processing the event traces using oœine synchronization (section ¢.ó). Once the event traces of the IP stack with synchronized timestamps have been obtained from all nodes of the object system, the event identiers can be used to reconstruct the way of all TCP packets through the nodes of the system. We wrote a Java application that uses the TCP sequence numbers, žags, ports and IP addresses to identify and calculate the delays each segment of a TCP connection experiences in dišerent places. For example, this allows to determine the delay between the reception of an HTTP request and the sending of the rst reply packet on a web server node or the delay in network channels. e end-to-end delay on the application level between a client and a server node can be evaluated by looking at the application layer traces of the respective nodes. For a web server system like the one in our lab, this means relating the load generator and the web server traces to each other and requires matching the request URL, the client IP address and the client port. During the implementation of the detailed simulation model, it became obvious that the model can produce event traces that contain exactly the same information as those obtained during the measurements. In the simulation model, the dynamics

˜ó ¢.¢ Example Measurement Results of the system are oŸen represented as state charts. When a change of the state of a model can be seen in the output of the system, the captured output of a real system can be used in combination with the model of this system to parametrize the model automatically. So far, we have not implemented this approach. For model-based performance testing, we have identied some aspects that can be handled using this method [ß].

5.5 Example Measurement Results

For these example measurements, ve real server nodes and one load balancer were used in a NAT environment with round-robin scheduling. One test client generated HTTP requests using the httperf load generator. We have generated Õþ,þþþ HTTP/Õ.þ requests for a binary le with a size of Õ,þó¦ bytes. is resulted in a request size of ä¢ bytes. e web server added ó¦¦ bytes of header information, so the resulting replies had a size of Õ,óä˜ bytes. Since this is smaller than the maximum segment size used (Õ,¢þþ bytes), all replies consisted of exactly one TCP segment. Figure ¢.É provides an illustration for the óß individual delays in the exchange of TCP segments that contribute to the total processing time of the HTTP request. Time advances along the vertical axis from the top to the bottom and the vertical bars represent the dišerent delays. e horizontal position shows where the delays are caused: Either by the load generator (LG), the network channel between the load generator and the load balancer (CÕ), the load balancer (LB), the network between the load balancer and the real servers (Có) or by one of the real servers (RS). e delays in the channels CÕ and Có include not only the physical propagation delay and the processing delay in the switches between the hosts, but also to time between the reception of the packet at the node of the cluster and the beginning of the packet processing in the TCP/IP stack of the operating system. e segments that appear during delay ÕÕ, Õó and Õä are sent due to TCP protocol mechanisms (Early ACK) and do not mark a state change in the HTTP protocol state machine. e measurements were conducted in a low load situation. at means that queuing was not an issue here. One reason for doing so was that this leads to lower delays for

˜ì ¢ Dedicated Measurement Infrastructure each packet and separates the dišerent delays so that the activities do not overlap and inžuence each other. Another reason for this was that the load generator is not powerful enough to bring the cluster system in an overload situation. Even if both the measurement infrastructure and the load generation soŸware itself allows to use more than one load generator, this would lead to more complicated traces that are both hard to analyze and to visualize. All data shown in this section was obtained using our instrumentation of the IP stack. e ring bušers were congured to be large enough to hold all captured data so that the user mode reading application could be started as soon as the measure- ment was over. erefore, it did not inžuence or disturb the measurements. Time synchronization was achieved using the PPS signals of a GPS receiver connected to all cluster nodes and the load generator as shown in the chapter¢. A trace plot of all delays can be seen in gure ¢.Õþ. All delays are plotted over the time of their measurement in one single graph. Since this graph provides only a comparison of the dišerent orders of magnitude of the dišerent delays and of the complex nature of the processes, gure ¢.ÕÕ shows trace plots of the individual delay. e vertical axes have been limited to the ÉÉ.¢Û quantile of the respective delay. e maximum values are not included, because outlier values can become excessively high. e main part of the observed delays would therefore been reduced to a single line due to the scaling of the axes. Figure ¢.Õó show the óß dišerent delays for the rst ¢þ request-reply pairs from the trace. e delays are displayed as stacked horizontal bars from the leŸ side to the right. e colors of the bars correspond to the color of the delays in gure ¢.É. erefore, the bars on the leŸ show delay Õ while the rightmost bars represent delay óß. Overall summary statistics are plotted in gure ¢.Õì. e horizontal stacked bars show the minimum, the þ.¢Û quantile, the rst quartile, the mean value, the median, the third quartile, the ÉÉ.¢Û quantile and the maximum value for all óß delays observed. e fact that þ.¢Û of the values are much larger than the rest can be easily seen. is indicates that a few measurements are disturbed by undesirable side ešects. It also justies to discard these values in further analyses of the system. Since the delays dišer by three orders of magnitude, some of the delays cannot be seen in this plot. Table ¢.Õ summarizes the same statistics in numerical form. All values are given in microseconds.

˜¦ ¢.¢ Example Measurement Results

Table ¢.Õ: Quantile Summary for Delays in Microseconds Min. 0.5% 25% Median 75% 99.5% Max. Mean Delay 1 54.232 56.208 82.925 112.395 141.611 170.936 297.031 112.407 Delay 2 0.690 0.707 0.741 0.754 0.772 1.443 10.826 0.798 Delay 3 51.444 52.522 54.239 54.940 55.696 72.246 390.614 55.249 Delay 4 4.953 5.437 6.235 6.621 7.060 27.508 452.295 7.066 Delay 5 59.225 61.445 95.070 124.531 149.065 174.716 241.577 121.236 Delay 6 1.057 1.070 1.103 1.115 1.128 1.594 3.866 1.128 Delay 7 43.285 45.597 83.976 106.386 134.612 158.010 164.803 107.930 Delay 8 18.820 19.447 21.064 22.011 24.499 46.814 67.133 24.486 Delay 9 62.210 64.975 102.091 140.350 156.637 172.649 180.240 129.141 Delay 10 0.541 0.544 0.555 0.575 0.626 0.721 3.119 0.595 Delay 11 61.695 63.276 67.195 73.835 75.145 84.464 99.207 72.093 Delay 12 254.489 274.503 302.826 334.700 344.924 665.231 741.078 333.525 Delay 13 264.395 266.088 290.171 319.325 351.589 381.352 799.464 320.875 Delay 14 1.075 1.089 1.117 1.127 1.141 1.402 10.921 1.145 Delay 15 61.550 63.492 87.235 109.059 137.273 174.859 181.055 111.885 Delay 16 30.548 31.625 33.947 35.398 39.263 71.260 115.535 39.115 Delay 17 59.064 61.693 101.631 133.435 147.522 166.668 503.900 124.597 Delay 18 0.581 0.604 0.636 0.672 0.713 0.893 3.170 0.681 Delay 19 50.005 51.021 53.415 59.244 61.039 70.231 84.364 57.612 Delay 20 11.379 11.867 12.844 16.005 16.853 26.047 36.828 15.216 Delay 21 58.028 60.569 93.349 122.280 142.922 173.169 206.766 118.104 Delay 22 1.074 1.083 1.121 1.139 1.158 1.490 3.520 1.146 Delay 23 42.674 44.629 83.178 105.752 134.197 156.626 163.368 107.220 Delay 24 4.211 4.437 4.811 4.993 5.306 8.631 19.907 5.298 Delay 25 51.173 53.125 66.844 96.091 139.416 166.659 204.041 102.336 Delay 26 0.605 0.619 0.649 0.660 0.670 0.791 3.125 0.661 Delay 27 50.649 51.701 53.487 54.186 54.965 58.845 98.571 54.300

˜¢ ¢ Dedicated Measurement Infrastructure

Figure ¢.É: Illustration of Delays in the Object System

˜ä ¢.¢ Example Measurement Results

Figure ¢.Õþ: Trace Plot of Measured Delays

˜ß ¢ Dedicated Measurement Infrastructure

Figure ¢.ÕÕ: Trace Plots of Individual Delays

˜˜ ¢.¢ Example Measurement Results

Delay [µs] 0 500 1000 1500 2000 Delay Components

Figure ¢.Õó: Delay Components for Requests

˜É ¢ Dedicated Measurement Infrastructure

Summary Statistics Max. 99.5% Q. 75% Q. Median Mean 25% Q. 0.5% Q. Min.

0 1000 2000 3000 4000 5000 Delay [µs]

Figure ¢.Õì: Summary Statistics the Delays

Éþ 6 Advanced Input Modeling

To employ the measured delays in a performance study of the system, the statistical parameters of the data have to be determined. In the input modeling phase, the representation of the real-world data in the model must be determined. Two dišerent approaches are applicable, trace driven performance evaluation and the use of distribution functions. In trace driven modeling, a recorded trace is fed into the model from which all events are generated in exactly the same order and temporal distance as observed at the real system. While it is extremely useful for checking if the simulation results are valid and comparable to real world data, the amount of recorded data is limited in most situations. erefore, once the trace has been consumed by the model, the only solution is to repeat the process and continue with the start of the trace again. is produces correlated data that are not independent. erefore, care has to be taken when analyzing the results of the performance study. Furthermore, this approach can only be chosen if a real setup of the modeled system is available. is is not always the case, because performance evaluations of several dišerent architectures are oŸen conducted before building a setup and have the aim to decide which design alternatives to implement.

e use of distribution functions allows to generate an innite number of indepen- dent random variates. is approach in its basic form is only valid if the measured data from which the distribution function is to be determined are independent. Nonetheless, dišerent solution how to deal with correlated input data will be pre- sented in the next sections. For uncorrelated input data, the rst step is to determine the distribution (or probability mass) function of the measured data. Once this function has been found, the empirical function can be used directly in the model or a tting theoretical function can be determined. Figure ä.Õ shows histograms of the óß measured delays from the previous chapter. Each of the delays can be used in a performance model. e theoretical represen-

ÉÕ ä Advanced Input Modeling tations we utilized were implemented in separate simulation models to validate if they are able to represent the measured values. is was done by generating several thousand sample points and analyzing them with quantile comparisons and a variety of plots like histograms, trace plots, scatter diagrams and correlation plots.

In addition to the implementation of a detailed simulation model, Isabel Wagner implemented and improved the input modeling in two theses [˜ß, ˜˜]. Preliminary results of the modeling have been published in [˜É].

6.1 Traces and Empirical Distributions

As a rst step, the behavior of a performance model, especially when building a discrete event simulation, can be evaluated using a trace driven approach. Once the structure of the model is implemented, the necessary random variates that inžuence this behavior are directly taken from a recorded trace le. When we have built a model of our IP stack, we were able use the example measurement results from section ¢.¢ to drive the model directly. For this purpose, trace les of the óß delays have been prepared. For each delay occurring in the model, the next delay from the trace le has been used as an input to the model. Once the end of a trace had been reached, the process has been continued from the beginning of the le again. is allowed to check if the model structure represented the measured system, for example by comparing the time from the sending of the rst SYN segment from the client until the reception of the last ACK segment of each TCP connection by a server in the model to the measured values. Some structural model errors can be seen using this validation process. is simple approach is only feasible when the model represents the setup of the real system. It is obvious that it cannot be used for congurations for which no laboratory setup exists and therefore no measurements are available. It also does not allow to easily change the load that is imposed on the system.

Assuming independence of the measured data, empirical distribution functions can be built as the second step in the input modeling process to generate random values from. Since all measured data are available in our case, we can sort the n observations Xi , i ∈ {Õ,... n} in increasing order with X(Õ) ≤ X(ó) ≤ ⋯ ≤ X(n).A

Éó ä.Õ Traces and Empirical Distributions

Delay 1 Delay 2 Delay 3 Density Density Density 0 5 10 15 0.0 0.2 0.000 0.006 60 80 100 120 140 160 0.7 0.8 0.9 1.0 1.1 52 54 56 58 60 62 64 66 Delay [µs] Delay [µs] Delay [µs] Delay 4 Delay 5 Delay 6 Density Density Density 0 5 10 20 0.0 0.2 0.4 0.6 0.000 0.010 6 8 10 12 14 60 80 100 120 140 160 1.05 1.10 1.15 1.20 1.25 1.30 Delay [µs] Delay [µs] Delay [µs] Delay 7 Delay 8 Delay 9 Density Density Density 0.00 0.10 0.20 0.000 0.008 0.000 0.010 0.020 40 60 80 100 120 140 160 20 25 30 35 40 45 60 80 100 120 140 160 Delay [µs] Delay [µs] Delay [µs] Delay 10 Delay 11 Delay 12 Density Density Density 0 10 30 0.00 0.10 0.20 0.000 0.010 0.55 0.60 0.65 0.70 65 70 75 80 250 300 350 400 450 500 550 Delay [µs] Delay [µs] Delay [µs] Delay 13 Delay 14 Delay 15 Density Density Density 0 5 15 25 0.000 0.006 0.000 0.006 0.012 260 280 300 320 340 360 380 1.10 1.15 1.20 1.25 60 80 100 120 140 160 Delay [µs] Delay [µs] Delay [µs] Delay 16 Delay 17 Delay 18 Density Density Density 0 4 8 12 0.00 0.10 0.000 0.010 0.020 30 40 50 60 60 80 100 120 140 160 0.60 0.65 0.70 0.75 Delay [µs] Delay [µs] Delay [µs] Delay 19 Delay 20 Delay 21 Density Density Density 0.00 0.10 0.00 0.15 0.30 0.000 0.006 0.012 50 55 60 65 12 14 16 18 20 22 60 80 100 120 140 160 Delay [µs] Delay [µs] Delay [µs] Delay 22 Delay 23 Delay 24 Density Density Density 0 4 8 12 0.0 0.4 0.8 1.2 0.000 0.006 0.012 1.10 1.15 1.20 1.25 1.30 40 60 80 100 120 140 160 5 6 7 8 Delay [µs] Delay [µs] Delay [µs] Delay 25 Delay 26 Delay 27 Density Density Density 0 10 20 30 0.0 0.2 0.000 0.010 60 80 100 120 140 160 0.62 0.64 0.66 0.68 0.70 0.72 52 54 56 58 Delay [µs] Delay [µs] Delay [µs]

Figure ä.Õ: Histograms of Observed Delays

Éì ä Advanced Input Modeling continuous piecewise-linear distribution function F(x) can then be dened as

⎧þ if x < X(Õ) ⎪ ⎪ i−Õ x−X(i) F x if X( ) x X( +Õ) for i Õ, ó, ..., n Õ ( ) = ⎨ n−Õ + (n−Õ)‰X Õ −X Ž i ≤ < i = − ⎪ (i+ ) (i) ⎪Õ if . ⎩⎪ X(n) ≤ x

When only grouped data summaries like histograms are available, a dišerent ap- proach has to be taken to construct an approximate empirical distribution function. More details of both methods can be found in [¦¢]. While empirical distribution functions represent the statistical properties of the measured data to some extend, their application has some limitations. When sort- ing the data, it is obvious that any correlation structure that might be present is lost. Furthermore, the mean of the sampled values X¯ can dišer from the mean of the distribution function F(x) due to the piecewise-linear interpolation. Another limitation of empirical distribution functions is visible when looking at the deni- tion of the function: When generating random values, the lowest value that can be generated is the lowest value in the measurement and the largest generated value is the largest measured value. As this is not always desired, there exist approaches to combine an empirical distribution function with a theoretical distribution function where this is not the case, for example an exponential distribution. For that reasons, empirical distribution functions are useful in early stages of the model design for validation and rst experiments, but when the behavior of dišerent congurations of the system under various parameter settings is to be predicted, using theoretical distribution functions overcomes these limitations.

6.2 Outlier Values

e measured data oŸen contain values that are not caused by system behavior but are either too small or too large because of errors caused by the measurement itself or by other undesired system activity. To isolate these ešects, these values, called outlier values, have to be eliminated from the traces before distribution tting. In [ɦ], Winkler describes a method to analyze the statistical properties of the data and to classify the values either as valid or outliers. For values distributed according to a normal distribution, he suggests to remove any value outside of the

ɦ ä.ì Autocorrelation interval [µ − ¦σ, µ + ¦σ], where µ is the sample mean and σ the standard deviation, resulting in a signicance level of þ.þÕ. For other distributions, the same approach can be applied, but here the sample median is a good estimator for µ, whereas σ can be estimated as the median of the absolute deviation of the sampled values, as these estimators are insensitive to the magnitude of extreme values and outliers. Winkler states that in typical cases less than ÕÛ of the measured data are removed using his method. We implemented the algorithm in the statistical computing environment R [ßÕ] to automate the process of outlier removal.

6.3 Autocorrelation

Some of the mathematical methods used when tting a distribution to the recorded data are only valid when the observations are independent. is is especially true for the maximum-likelihood estimation and chi-square tests that are used for the parameter estimation once a family of distributions has been selected. ere are dišerent techniques to assess the independence of the values. As an example, correlation plots for the óß delays are shown in gure ä.ó. e sample correlation ρ j of the observations XÕ, Xó,..., Xn is dened as ˆ n−j ¯ ¯ C j ˆ ∑i=Õ [Xi − X(n)][Xi+j − X(n)] ρˆj = , with C j = , Só(n) n − j n ¯ ó ó ∑i=Õ[Xi − X(n)] S (n) = (sample variance), n − Õ n ¯ ∑i=Õ Xi X(n) = (sample mean). n

e correlation is plotted for a varying values of the lag j ∈ {Õ, ó,..., l}. is ρˆj is an estimate for the true autocorrelation ρ j of two observations that are j samples apart in time. If all samples were independent, ρ j = þ for all j ∈ {Õ, ó,..., n − Õ}, but since the samples are observations of a random variable, the estimator ρˆj will not be exactly þ for all j, but a signicant dišerence from þ indicates dependence of the observations.

Another graphical method to assess the independence of the samples is the scatter diagram. It displays points with coordinates (Xi , Xi+Õ) for pairs of successive samples. When the observations are independent, the points are scattered randomly

É¢ ä Advanced Input Modeling

Delay 1 Delay 2 Delay 3 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 Lag Lag Lag Delay 4 Delay 5 Delay 6 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 Lag Lag Lag Delay 7 Delay 8 Delay 9 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 Lag Lag Lag Delay 10 Delay 11 Delay 12 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 Lag Lag Lag Delay 13 Delay 14 Delay 15 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 Lag Lag Lag Delay 16 Delay 17 Delay 18 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 Lag Lag Lag Delay 19 Delay 20 Delay 21 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 Lag Lag Lag Delay 22 Delay 23 Delay 24 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 Lag Lag Lag Delay 25 Delay 26 Delay 27 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 Lag Lag Lag

Figure ä.ó: Correlation Plots (lag ≤ ¢þþ)

Éä ä.¦ Standard eoretical Distributions in the plain. If they are dependent, they tend to be located along a line in the plane. Examples of these plots will appear in the distribution tting sections.

Autocorrelation is oŸen caused by queuing or bušering in system components. When network packets are processed by a single server with a FIFO queue and the server is busy while a packet arrives, the packet has to wait until the previous packets have been processed. e longer the processing of the previous packet takes, the longer the successive packets have to wait. is can be seen as a positive autocorrelation. Delays caused by network transmissions like number É, ÕÕ, Õì and Õ¢ show this typical behavior due to bušering in both switches and the network interfaces of the system. Another interesting ešect can be seen from a correlation plot, especially when looking at smaller lags like depicted in gure ä.ì. e autocor- relation is very high for lags that are integer multiples of ve for the delays ˜, Õä and ó¦. is indicates that even if the hardware and soŸware of all real server nodes are identical, packets sent to or received from dišerent real servers experience dišerent delays, especially in the load generator node. In gure ä.¦, a trace plot sorted by the real server node that is involved in the communication shows that these delays are indeed dišerent for real server node ¦. We made some further experiments, but were not able to nd any systematic reason for that behavior.

A critical point is how to deal with the correlation. As mentioned before, some mathematical estimators are not valid for correlated data. But most graphical methods for determining the goodness of a tted distribution are still applicable. Depending on the model structure, correlation can have an ešect on the results of a performance evaluation. erefore, either the model itself can induce a correlation (as it is the case when modeling bušers and queues explicitly) or the random variates must be generated so that they exhibit the same correlation as the measured data.

6.4 Standard Theoretical Distributions

Once the outliers are removed, standard theoretical distributions are a good way to represent the data in the model. ey ošer the advantage that their parametrization can be changed. So they are not only useful to capture the current behavior, but they can also be modied to model the system under dišerent workloads. is can oŸen be done by changing one parameter of the distribution function like the mean value that is oŸen the most important parameter to characterize a certain distribution.

Éß ä Advanced Input Modeling

Delay 1 Delay 2 Delay 3 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Lag Lag Lag Delay 4 Delay 5 Delay 6 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Lag Lag Lag Delay 7 Delay 8 Delay 9 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Lag Lag Lag Delay 10 Delay 11 Delay 12 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Lag Lag Lag Delay 13 Delay 14 Delay 15 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Lag Lag Lag Delay 16 Delay 17 Delay 18 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Lag Lag Lag Delay 19 Delay 20 Delay 21 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Lag Lag Lag Delay 22 Delay 23 Delay 24 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Lag Lag Lag Delay 25 Delay 26 Delay 27 ACF ACF ACF −0.2 0.4 0.8 −0.2 0.4 0.8 −0.2 0.4 0.8 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Lag Lag Lag

Figure ä.ì: Correlation Plots (lag ≤ ¦þ)

ɘ ä.¦ Standard eoretical Distributions

Figure ä.¦: Trace Plots Sorted by Real Server

ÉÉ ä Advanced Input Modeling

For example, the mean value of an exponential distribution can be changed using a dišerent λ. Other distributions allow even more modications, the normal distribution allows to change the mean µ and variance σó. For Weibull and gamma distributions, the shape of the density function can be changed additionally by modifying a shape parameter α. is allows to adapt these distributions to account for dišerent situations to be predicted. However, it is not always possible to t a standard distribution. Most standard distributions are monomodal, therefore, a good t cannot be expected for multimodal data like delays ó, ì, ˜ and Õä.

e distribution tting tool ExpertFit automates the process of distribution tting, parameter estimation and goodness-of-t tests [¦¦]. In a rst attempt, we used ExpertFit to t standard theoretical distributions to all óß delays. As explained before, the goodness-of-t tests resulted in a bad t for most of the delays, which is not surprising, considering the shapes of the delays as depicted in gure ä.Õ.

Nonetheless, we were able to achieve an acceptable t for the seven delays shown in table ä.Õ. One problem when applying mathematical goodness-of-t tests is that their results oŸen indicate a bad t when a high number of input samples is used. In contrast, the accuracy of the tted distribution becomes higher the more input data are available. For that reason, we used the maximum of ˜,þþþ samples that can be handled by ExpertFit, even when the goodness-of-t tests show worse results in this case than the graphical methods for distribution comparison indicate.

Table ä.Õ: Fitted Standard eoretical Distributions Delay Fitted Distribution 1 Uniform 4 Lognormal 6 Log-Logistic 14 Log-Logistic 22 Lognormal 24 Pearson Type V 27 Lognormal

Graphical comparisons of the measured data versus the tted theoretical distribu- tion are shown exemplarily in gure ä.¢. e rst row shows a trace plot, the his- togram, a correlation plot and a scatter diagram of the samples (delay óó), whereas the second row depicts the same plots for the tted distribution (lognormal). e

Õþþ ä.¢ Multimodal Distributions

Measurements: Delay 22 Density Delay i+1 Delay [µs] Correlation 1.10 1.20 1.10 1.15 1.20 0 5 10 15 0.0 0.4 0.8 0 20 40 60 80 100 1.10 1.15 1.20 0 100 200 300 400 500 1.10 1.15 1.20 Time [s] Delay [µs] Lag Delay i Fit: Delay 22 Density Delay i+1 Delay [µs] Correlation 1.10 1.20 1.10 1.20 0 5 10 15 0.0 0.4 0.8 0 20 40 60 80 100 1.10 1.15 1.20 0 100 200 300 400 500 1.10 1.15 1.20 Time [s] Delay [µs] Lag Delay i

Figure ä.¢: Distribution Comparison for Delay óó

rst two columns of the plot indicate that the range and the general shape of the density function is a good approximation of the measurement. Furthermore, the correlation is negligible in this case, as indicated by columns three and four of this graph.

6.5 Multimodal Distributions

Some of the measured data sets exhibit a multimodal distribution that is visible as two or more peaks in the corresponding histogram. A multimodal distribution is oŸen caused by a mixture of data from dišerent monomodal distributions. When tting a distribution function, all monomodal distribution functions have to be handled separately. Since ExpertFit is unable to deal with multimodal data, we determined initial split points between the distributions visually by looking at the histograms. We were then able to t a distribution to the individual data separated by these thresholds using ExpertFit. Once the monomodal distributions are known and their parameters are estimated, the weighting in the multimodal distribution as a weighted mixture of the individual distributions has to be determined. is can be done by determining the probability mass in each part as the relative frequency of the observations between the split points. Special attention has to be paid at the overlapping areas, as the dišerent monomodal distributions will contribute a dišerent amount of probability mass here. erefore, the resulting multimodal distribution must be compared to the trace le and the split point has to be modied potentially in an iterative process.

ÕþÕ ä Advanced Input Modeling

When the resulting multimodal distribution is used in a performance model, one of the monomodal distributions is selected at random according to the weights of the distributions for each sample before the sample is generated from the selected distribution. is procedure generates independent random variates with the right density and weights the contributing components correctly. For this reason, the process can only be applied to independent input data that exhibit an autocorrela- tion near zero for all lags. A method for correlated samples will be presented in the next section.

All distributions of our example for which we were successful in tting a uncorre- lated multimodal distribution are shown in table ä.ó. All of these distributions are composed of two standard theoretical distributions and thus are bimodal.

Table ä.ó: Fitted Multimodal Distributions Delay Lower Distribution Upper Distribution Split Point 2 Log-Logistic Log-Laplace ÿ.™×Ì µs 3 Log-Logistic Log-Logistic å®.£ÿÿ µs 8 Inverted Weibull Log-Logistic ö™.™ÿÿ µs 16 Inverted Weibull Johnson SB ®£.£ÿÿ µs

Figure ä.ä shows a comparison of the measured data of delay ì with a distribution tted using this method. According to the histogram, the weights and shapes of the monomodal distributions (both log-logistic) were chosen correctly. Additionally, the scatter diagrams indicate that the number of samples in each mode and the independent transition between the modes was modeled right.

6.6 Multimodal Distributions with Phases

In the gure ä.Õ the shape of the histogram for delay ÕÉ looks clearly bimodal, but gure ä.ó indicates high autocorrelation. e reason for this becomes clear from the trace plot of this delay in gure ¢.Õþ: e values occur from one of the modes almost exclusively for a certain time. AŸer this phase, almost only values from the second mode occur in the trace. Other delays also show this behavior. e length of the phases dišers, but there seems to be a minimum length for each.

Õþó ä.ä Multimodal Distributions with Phases

Measurements: Delay 3 Density Delay i+1 Delay [µs] Correlation 60 65 70 55 65 75 −0.5 0.5 0.00 0.10 0 10 20 30 40 60 65 70 75 0 100 200 300 400 500 55 60 65 70 75 Time [s] Delay [µs] Lag Delay i Fit: Delay 3 Density Delay i+1 Delay [µs] Correlation 60 70 0.0 0.4 0.8 55 65 75 0.00 0.10 0 10 20 30 40 55 60 65 70 75 0 100 200 300 400 500 55 60 65 70 75 Time [s] Delay [µs] Lag Delay i

Figure ä.ä: Distribution Comparison for Delay ì

e rst steps in the distribution tting are the same as in the independent mul- timodal case. e trace is split in individual modes and the relative frequency of each mode is determined. But in this case, the minimum length of each of the phases is also a needed parameter.

Once these values are known, bimodal distributions with phases can be modeled as a nite state machine with two states. Samples are generated from the upper mode as long as the state machine is in the upper state. When a number of samples have been generated that corresponds to the minimum length of this phase, a state change to the lower state can happen. e probability of this state change is chosen according the relative frequency of the measured values in each of the modes. When a state change happens, the state machine is in the lower state and samples are generated from the lower mode. Again, a state change can happen according to the relative frequency of the modes once the number of generated samples reaches an integer multiple of the minimum length of this phase.

Figure ä.ß shows an illustration of a state chart for this generation scheme as it is used in the simulation tool AnyLogic. e minimum phase lengths are identied by utime and ltime here. When this time has passed (i.e. the minimum required samples from the corresponding phase have been generated), a state change to the state denoted by state happens. In this state, another state change happens immediately. e direction of this state change, either back to the original state or to the other mode of the distribution, is chosen randomly according to the relative frequencies of the measured data denoted by ratio. By modeling the distribution in this way it is ensured that the probability mass is distributed correctly among the

Õþì ä Advanced Input Modeling

³¬¬Ø° 5°ıººØ°╈や³²ıßØ ¬°©Æや┗や°̶²ı© "ª²ı©-╈や¬°©Æや└や³-ıŒ©°ß〉《 ±²̶²Ø 5°ıººØ°╈やœ²ıßØ ¬°©Æや┘└や°̶²ı© "ª²ı©-╈や¬°©Æや└や³-ıŒ©°ß〉《 œ©¹Ø°

Figure ä.ß: State Chart for Phase Transitions modes of the multimodal distribution even when the process of the state changes behaves dišerent than in the real system.

Table ä.ì: Fitted Multimodal Distributions with Phases Delay Lower Distribution Upper Distribution Split Point 10 Bézier Bézier ÿ.£™ÿ µs 11 Lognormal Log-Logistic à×.ÿÿÿ µs 12 Johnson SB Log-Logistic î×£.ÿÿÿ µs 18 Bézier Bézier ÿ.å£ÿ µs 19 Pearson Type VI Log-Logistic £à.ÿÿÿ µs 20 Pearson Type V Inverted Weibull ×£.ÿ£ÿ µs 26 Bézier Bézier ÿ.£®ö µs

e approach presented here is applicable for the delays in table ä.ì. We were not able to t standard distributions to all monomodal distributions of the phases. For some of them, we used Bézier curves as described in the next section. An exemplary comparison with the measured values is depicted in gure ä.˜. e length and distribution of the phases show a good compliance of the synthetically generated data with the measurements. e autocorrelation is also captured in the model as indicated by the correlation plot. e scatter diagram shows one aspect of the samples that is not included in the models: the sporadic generation of values from the other mode of the distribution. is is of minor impact for most delays. For delay óä, where this happens more oŸen, this generation from the other mode has been included in the model. e relative frequencies that are

Õþ¦ ä.ß Bézier Distributions

Measurements: Delay 19 Density Delay i+1 Delay [µs] Correlation 0.0 0.4 0.8 52 58 64 50 55 60 65 0.00 0.10 0 20 40 60 80 100 50 55 60 65 0 100 200 300 400 500 50 55 60 65 Time [s] Delay [µs] Lag Delay i Fit: Delay 19 Density Delay i+1 Delay [µs] Correlation 52 58 64 50 60 0.0 0.4 0.8 0.00 0.10 0 20 40 60 80 100 50 55 60 65 0 100 200 300 400 500 50 55 60 65 Time [s] Delay [µs] Lag Delay i

Figure ä.˜: Distribution Comparison for Delay ÕÉ used for transitions in the state machine had to be modied to maintain the right distribution of the probability mass.

6.7 Bézier Distributions

When no theoretical distribution functions ts to the samples or the part of the samples that belong to one mode, Bézier distributions are an alternative approach. Classical Bézier curves are oŸen used as an approximation of smooth univariate functions on a bounded interval in computer graphics. ey have been adapted for the approximation of distribution functions by Wagner and Wilson [Éþ, ÉÕ].

To apply this process on a set of sampled data XÕ, Xó,..., Xn that represent a continuous random variable X with a nite range [a, b], a set of control points {pþ, pÕ,..., pm} has to be placed, where pi = (yi , zi) for i ∈ {Õ, ó,..., m − Õ}, pþ = (a, þ) and pm = (b, Õ).A Bézier distribution function P(t) of degree m is given parametrically by

m P(t) = Q Bm,i(t)pi for t ∈ [þ, Õ], i=þ where the blending function Bm,i(t) is the Bernstein polynomial

⎧ m! i Õ m−i for þ, Õ and þ, Õ, . . . , ⎪ i!(m−i)! t ( − t) t ∈ [ ] i ∈ { m} Bm,i(t) = ⎨ ⎪þ otherwise. ⎩⎪

Õþ¢ ä Advanced Input Modeling

e resulting Bézier curve passes through the rst and the last control point. Setting the control points pþ and pm as noted above ensures that the resulting function will have the value of þ on its lower endpoint a and Õ on the upper endpoint b. Wagner and Wilson show in [Éþ] how to create a Bézier function that also fullls the monotonically nondecreasing property of a distribution function.

Figure ä.É: Screenshot of PRIME

eir graphical tool PRIME can be used to t these Bézier distributions to sets of sampled data. Several automated tting methods based on optimization of the control point coordinates are available as well as the possibility to adjust the control points manually. e tting process results in the coordinates of the control points. PRIME is limited to a maximum degree of the Bézier curve of ìþ. is is not an issue in our example measurement, but can become problematic for example when trying to t an accurate distribution function to a data set with three or more peaks in the histogram. Figure ä.É shows the empirical distribution function for the measured values of delay ՘ and the tted Bézier curve as it is used in [˜˜].

Õþä ä.˜ A New Model for Autocorrelated Data

e possibility of using Bézier distribution functions is also proposed in [¦¢]. Law and Kelton mention that these distributions are a good alternative to empirical distribution functions, but have the drawback that they are not included in most performance evaluation tools. is was also true for AnyLogic, the simulation environment we used to model the cluster system. erefore, Isabel Wagner [˜˜] implemented the random variate generation approach presented in [Éþ]. Here, a sample from a Bézier distribution function that is given as a set of control points is generated using the method of inversion. e rst step is to generate a random number U from the uniform distribution on [þ, Õ]. e rst goal is to nd a value tU so that m U = Q Bm,i(tU )zi , i=þ i.e. to invert the function above numerically. In our implementation, we used a combination of two numerical root-nding algorithms. First, two approximate solutions are obtained using bisection with two runs of low order (¦ and ¢) [ÕÕ]. e nal solution is calculated using the secant method [ÕÕ] with the results of the bisection as initial approximations.

Using this tU , a random variate y(tU ) from the Bézier distribution can be generated as m y(tU ) = Q Bm,i(tU )yi . i=þ As mentioned in the previous section (table ä.ì), modes of the delays Õþ, ՘ and óä can be represented as Bézier distributions. e multimodal delays are generated from these using the phases approach. Figure ä.Õþ shows a comparison of syntheti- cally generated random variates with measured data. Again, the scatter diagram shows that the sporadic generation from the other phase has been neglected. e other parameters of the distribution t well, especially the representation of the probability density function shows the applicability of the Bézier approach.

6.8 A New Model for Autocorrelated Data

Some of the measured delays feature both a high autocorrelation over relatively large lags and a clear upper and lower bound. is is the case for nearly all channel delays. When looking at the trace plots of these delays, a structure of rising or

Õþß ä Advanced Input Modeling

Measurements: Delay 18 Density Delay i+1 Delay [µs] Correlation 0.0 0.4 0.8 0.60 0.75 0.60 0.75 0 4 8

0 20 40 60 80 100 0.60 0.70 0.80 0 100 200 300 400 500 0.60 0.70 0.80 Time [s] Delay [µs] Lag Delay i Fit: Delay 18 Density Delay i+1 Delay [µs] Correlation 0.60 0.75 0.60 0.75 0.0 0.4 0.8 0 4 8 12

0 20 40 60 80 100 0.60 0.70 0.80 0 100 200 300 400 500 0.60 0.70 0.80 Time [s] Delay [µs] Lag Delay i

Figure ä.Õþ: Distribution Comparison for Delay ՘ falling bands is clearly visible. A reason for this can be found in the bušering in the network interface at the receiving side. To limit negative ešects of frequent interrupt requests, modern network interfaces bušer received frames until either a reasonable amount of data has been collected in the bušer or no additional frame has been received for a certain time before issuing an interrupt request. e higher the bit rate of the medium, the more frames of a xed size can be received per time unit. erefore, the bušers are usually larger in Gigabit Ethernet interfaces than they are in Fast Ethernet interfaces. e frames can only be handled by the IP stack of the operating system aŸer the interrupt has been handled by the driver and the content of the frame has been copied over the interconnecting bus. As the timestamping is done in the IP stack, the timestamps for packet reception will be close together for all packets that were transferred to the driver in the same interrupt handler invocation. e timestamp generated when sending these packets are also generated in the IP stack, but in this case, no bušering occurs before the generation. ese ešects are visible as bands in the trace plots and the resulting autocorrelation can ašect performance studies of the system. To model this behavior of the network interfaces explicitly, it would be necessary to separate the delay components that contribute to the measured delays: the delay in the IP stack of the sender, the transmission delay, the propagation delay, the queuing and processing delays in switches or routers, the time spent in the bušer of the receiving network interface, the interrupt latency of the receiving node and the time in the IP stack of the receiver until the timestamping. Some of these times can be calculated analytically from the physical characteristics of the channel. For example, the propagation delay depends on the length of the interconnecting

Õþ˜ ä.˜ A New Model for Autocorrelated Data medium and its velocity factor, whereas the transmission delay is a function of the frame length and the bit rate of the interconnection. But other factors like the queuing delay are statistically distributed and are inžuenced by other activity in the system. Measuring these delays individually is also no alternative, as this would require an enormous amount of hardware in dišerent places inside the transmission path. erefore, the most promising solution is to generate random variates for the overall delays that feature the same statistical properties as the measured values.

Histogram of Deltas for Delay 5 Frequency 0 500 1500 2500 3500

−1e+05 −5e+04 0e+00 5e+04 1e+05 Delta 5 [ns]

Figure ä.ÕÕ: Histogram Ho of the Deltas for Delay ¢

Due to the high autocorrelation, we developed the idea not to generate samples di of the delays, but to generate random variates for the dišerence of the current to the next delay value di+Õ as δi = di+Õ − di. is allows to calculate the next delay sample as di+Õ = di + δi from the current delay di and the sampled δi. In a rst step we determined the histogram Ho for all deltas. Figure ä.ÕÕ shows this overall histogram. But as the delays feature an upper and lower bound, the values of delta can clearly not be independent of the current delay value. In gure ä.Õó, these bounds are plotted for the measured values of delay ¢ as dashed horizontal lines . Due to the upper and lower bound dmax and dmin of the delays, there is only a limited range dmin − di ≤ δi ≤ dmax − di to sample the deltas from for a given delay di. is valid range can be calculated for all values di.

ÕþÉ ä Advanced Input Modeling

Delay 5 Delay 5 [ns]

600000 100000 140000 2000 180000 4000 6000 8000 Measured Delay Number

Figure ä.Õó: Trace Plot of Delay ¢

In gure ä.Õì we plotted the observed delta values over the values of the current delay for the measured trace of delay ¢. We also determined the valid range for δi for all delay values. is resulted in a parallelogram-shaped area in the plot. All points (di , δi) are located inside this parallelogram. Since the deltas depend on the current delay, it might not be correct to determine the overall histogram Ho and neglect the dependency on the current delay. To see if there are more dependencies of δi inside of the valid area, we constructed a number of histograms that include only the values of δi for a certain range of di. Once these histograms have been obtained, all of them can be combined in one three-dimensional histogram that shows the relative frequency of the occurring δi for ranges of di. As one can see from ä.Õ¦, the height of the surface is almost the same along the delay axis inside of the valid area. is shows that the deltas are to a large extend independent of the current delay inside of the valid area.

Due to this fact, the overall histogram Ho of all δi, regardless of di, makes sense. But since only values inside the valid area are used to construct the histogram

Ho, less values contribute to the bins for extreme negative and positive bins than

ÕÕþ ä.˜ A New Model for Autocorrelated Data

Delta 5 over Delay 5 Delta 5 [ns] −1e+05 −5e+04 0e+00 5e+04 1e+05

60000 80000 100000 120000 140000 160000 180000 Delay 5 [ns]

Figure ä.Õì: Delta over the Values of Delay ¢

to the bins around zero. erefore, an additional histogram Hw is built, where the bins of the histogram Ho are weighted with a weighting factor. e weighting factor is chosen so that the resulting histogram Hw is the histogram that would be generated if delta values were observed for all delays dmin ≤ di ≤ dmax. at means that we extrapolate the distribution of the δi outside of the valid area from the values within.

e factor wk for bin k can calculated as the ratio of the area Ak that can contain delta values that fall into the respective bin in the extrapolated histogram to the area of the bin that is covered by the valid area Ck, wk = Ak~Ck. e area Ak of the delta values that potentially contribute to the bin of Hw is delimited on the vertical axis by the borders of bin k and on the horizontal axis by dmin and dmax. e area Ck is the part of the area Ak that overlaps with the parallelogram of the valid area and thus the area that contains the delta values that contribute to the respective bin in Ho. Figure ä.Õ¢ illustrates these areas for an exemplary bin of the values for delay ¢. For an equidistant histogram with bin width b the area Ak is constant: Ak = b ⋅ (dmax − dmin) ∀k. e resulting weighting factors for a histogram of the

ÕÕÕ ä Advanced Input Modeling

Histogram for Delta of Delay 5

250

200

Frequency 150

100

50 Delay Delta

0

Figure ä.Õ¦: ìD Histogram of Delta ¢ deltas for delay ¢ with ¦þ bins is shown in gure ä.Õä.

If ok denotes the number of observations in bin k of Ho, then the number of observations in bin k of Hw can be calculated as ok ⋅ wk = ok ⋅ Ak~Ck. e resulting histogram Hw is the histogram that would be generated if delta values were observed for all delays dmin ≤ di ≤ dmax, i.e. even outside of the valid area. e ešect of the weighting process is shown in gure ä.Õß, where the overall histogram Ho is overlaid with the weighted histogram Hw. As the wk > Õ ∀k, the number of potential observations in each bin of Hw is obviously larger than the number of observations in the respective bin of Ho. is is the ešect of extrapolating from the valid area to the whole area.

e histogram Hw is used to construct an empirical distribution function for grouped data [¦¢]. From this distribution function, values for δi are sampled. ese are used to calculate the next delay from the previous one as di+Õ = di + δi. But since there is a constraint on the allowed values for δi depending on di, only a certain part of the empirical distribution function is used for each sample. e

ÕÕó ä.˜ A New Model for Autocorrelated Data

Weighting Areas

Ak

Delta 5 [ns] Ck 60000 100000 60000 80000 100000 120000 140000 160000 180000 Delay 5 [ns]

Figure ä.Õ¢: Weighting Areas

part is limited by the valid area and chosen so that δi ∈ [dmin − di; dmax − di] for the current delay di.

In our example, the bounds for the delay ¢ are dmin = ä¦, É¢¢.¢ ns and dmax = ÕÉó, äßó.þ ns. When, e.g., the delay d j has reached ÕÉþ, þþþ.þ ns, δ j is sampled from the range [dmin − d j = −Õó¢, þ¦¦.¢ ns; dmax − d j = ó, äßó.þ ns]. Due to the weighting, a considerable amount of probability mass is found on the extreme negative end of this range and there is a high probability that the delay jumps from a high d j to a low d j+Õ = d j + δ j by sampling a low negative δ j. Figure ä.՘ compares the measured data with values generated using this method. e graphs show a close match of all relevant characteristics. We implemented this method of distribution tting in R. e functions produce Java code that can be used in the discrete event simulation tool AnyLogic to generate random variates according to our new procedure.

ÕÕì ä Advanced Input Modeling

Histogram Weighting Factor (40 Bins) Bin Weighting Factor 0 10 20 30 40 −1e+05 −5e+04 0e+00 5e+04 1e+05 Bin Midpoint

Figure ä.Õä: Weighting Factors

Histogram of Delta 5

Ho Hw Frequency 0 500 1500 2500 3500

−1e+05 −5e+04 0e+00 5e+04 1e+05 Delta 5

Figure ä.Õß: Original and Weighted Histogram for Delta ¢

ÕÕ¦ ä.˜ A New Model for Autocorrelated Data

Measurement: Delay 5 Density Delay i+1 Correlation Delay 5 [µs] 0.0 0.4 0.8 60 120 180 60 120 180 0.000 0.006 0.012 0 2000 4000 6000 8000 60 100 140 180 0 100 200 300 400 500 60 100 140 180 Delay Number Delay [µs] Lag Delay i Fit: Delay 5 Density Delay i+1 Correlation Delay 5 [µs] 60 120 180 60 120 180 0.0 0.4 0.8 0.000 0.006 0.012 0 2000 4000 6000 8000 60 100 140 180 0 100 200 300 400 500 60 100 140 180 Delay Number Delay [µs] Lag Delay i

Figure ä.՘: Distribution Comparison for Delay ¢

ÕÕ¢ ä Advanced Input Modeling

ÕÕä 7 Simulation Model

When building a performance model of a complex system like our web cluster, the modeling formalism and the level of details have to be chosen according to the problem to be solved. Since one goal of this study was to gain insight into the internal behavior of the system, we chose to implement a rather detailed model. e measurement process was thus also implemented to režect this level of detail. During the input modeling process it became clear that the probability distribu- tions involved do not allow for the application of analytical methods without great simplications. Trying to implement the the distributions as phase-type distri- butions would certainly lead to the problem of state-space explosion due to the inherent parallelism of the system. So discrete event simulation appeared as a sensible method to implement the model.

Earlier modeling approaches have been conducted by students of our Simulation and Modeling II class. In their project work, they used a much simpler input modeling than presented in the previous chapter. eir simulation model has been implemented in the process-oriented discrete event simulation environment AutoMod. e model structure is relatively simple and most components used resemble either single or innite server queues. Nevertheless, this type of model already gives essential hints for dimensioning of systems during the planning phase and they managed to publish the results in the journal Simulation Modeling Practice and eory [ÕþÕ]. e modeling tool AnyLogic [ɘ] has been used to build the current detailed simulation model of the web cluster. It has been developed in joint work with Isabel Wagner [˜ß, ˜˜] and the resulting model has been published in [˜É]. AnyLogic is a simulation tool that supports discrete event and continuous simulation. e main formalisms are UML-based. It does not support standard UML with proles, but provides own real-time extensions to standard notations like state charts. It allows seamless integration of Java code in the models. Simulated entities are represented by Active Objects. e Active Objects can have internal behavior that can be

ÕÕß ß Simulation Model specied using state charts and Java code. ey can communicate using ports. Ports can contain a FIFO queue and can be interconnected to transfer user-dened messages. ese messages can be arbitrary Java objects. e Active Objects can be hierarchically structured. erefore, an Active Object can encapsulate other Active Objects. e objects can also have a multiplicity. In the AnyLogic world, their instances are called replicated objects. Newer versions of the tool are integrated into the Eclipse framework and can be executed on a number of dišerent operating systems like Windows, MacOS and Linux. e executable models are compiled Java bytecode that can be exported as a stand-alone Java applet. Newer versions of the tool also include formalisms to support other modeling paradigms besides the state chart base formalism like process žow simulation, agent-based simulation and system dynamics. A new way to visualize the decision logic and to specify the žow of control in the Active Objects are action charts. Besides all these formalisms, the user is supported with a number of pre-dened objects provided in libraries to speed up and simplify the process of model creation. For an e›cient model, the user must be aware that all models are transformed to Java code and is advised to take care of the specic characteristics of Java like the garbage collection to implement models that can be executed and evaluated at high speed. For example it is advisable not to generate too many objects that are disposed in quick succession, as this requires frequent invocations of the garbage collector and slows down the execution of the model considerably.

7.1 Model Structure

e global structure, in AnyLogic referred to as the Root Object, is composed of ve building blocks that represent entities of the setup as it is used in the web cluster laboratory. e structure of the complete model is shown in gure ß.Õ. e HTTP requests are generated in the Client objects. e requests are encapsulated as TCP_package objects that represent TCP segments. ese objects are transmit- ted over ports to the Active Object Channel1 that models the network channel between the load generators and the load balancer. e Load Balancer is the next object. It distributes the incoming segments among a number of Server instances that are connected via a the second network element Channel2. e modeled real server nodes process the requests and send TCP segments with reply data back to the client through the network channels and load balancing node. e

Õ՘ ß.ó TCP model contains a congurable number of server nodes, as indicated by the stacked graphics in gure ß.Õ. Since the limitations of a client object should not limit the performance of the complete system, our model contains a separate client object for each HTTP transaction. ese Active Objects are created dynamically. As the processing of dišerent HTTP transactions overlap, there are usually more than one client objects present.

Please note that this is just a brief sketch of the conceptual model. e implemented model is more complicated as it includes TCP dynamics, hardware and operating systems aspects. For example, TCP requires a connection setup using a three-way handshake before any data can be sent. e details of the various parts involved are presented in the following sections.

Figure ß.Õ: Conceptual Model

Variable parameters of the simulation are the distributions for the individual delays as shown in chapterä, the arrival rate of client requests, the sizes of the requested objects, the number of real servers and the load balancing strategy to be employed. Simulation output data includes the individual delays in dišerent elements of the cluster, the total delay and summary statistics like utilization, throughput and mean queue length for the network channels, the load balancer and each of the server processors.

7.2 TCP

e model for TCP is a central aspect of our simulation, since all message trans- missions are triggered by the TCP protocol. An Active Object TCP is present in the endpoints of a TCP connection, in our case, the client and server nodes. It implements the functionality of the TCP/IP stack of the operating system. For this purpose, it has interfaces for communication with a modeled application and to the network. e interface to the network allows to send TCP segments to be processed by connected entities. e working principles of TCP are specied in

ÕÕÉ ß Simulation Model several Requests for Comments (RFCs) issued by the Internet Engineering Task- force (IETF). e model implements the most important aspects of TCP from the following RFCs.

7.2.1 RFC 793

e basic TCP dynamics have been specied in RFC ßÉì [˜¦]. It denes the inter- face of the TCP layer to user mode applications. For that purpose, a number of commands are listed that must be implemented by the stack. e commands open, send, receive, close and abort have been modeled as individual ports of the Active Object over which an application can communicate. Since the status command is not useful in our simulation, it has not been implemented.

Figure ß.ó: TCP

According to the specication, a node can use a port named open to open a con- nection actively or passively, depending on a žag. When a connection is opened actively, the node sends a TCP segment where the SYN žag is set to the network to initiate a handshake for a new TCP connection, whereas a passive open enables other systems to actively connect to this node as it is waiting for incoming con- nections. e close port closes the TCP connection by sending a FIN packet, but as the standard requires, allows to send outstanding data and retransmit data if needed. e abort port allows to terminate a connection by sending a RST

Õóþ ß.ó TCP frame. e receive and send ports are used to transfer data to and from the application, while the error port is used to inform the application about TCP errors. e rcv_flag is used to notify the application when data has arrived. Figure ß.ó shows the Active Object TCP. e ports are displayed as squares on the border of the object. Ports with queues contain a dot in the square. e variables are depicted as circles, state charts are represented as symbols that show two states with transitions and timers are drawn as a clock with a bell. Embedded Active Objects are displayed as boxes. As the illustration shows, the TCP object has two additional ports. e port total_delay is used to collect statistical information about the time spent in the stack. e port named packet is used for connection to the network. TCP in the transport layer is the lowest layer that is modeled explicitly. As all underlying layers do not change the behavior of the model we are interested in besides adding delays, all lower layers have been merged in what we call the network.

Figure ß.ì: Model of a TCP Segment

Segments sent to the network are implemented as TCP_package Active Objects. As shown in gure ß.ì, theses objects contain the payload and variables to represent the header elds as specied in RFC ßÉì. For technical reasons, an additional timer delay has been included in the packet. is architecture simplies the delay handling in the simulation. Two additional variables in this object are used for the purpose of collecting timing statistics. e protocol state machine from the RFC has been directly implemented in the model as a state chart named receive_packet. e states of the RFC are modeled as super-states. e actual processing of packets is done in the internal transitions. Whenever a transition into another super-state is required, a state change variable is set to contain the new state. is variable triggers transitions between super-states. is implementation allowed to reused code fragments and

ÕóÕ ß Simulation Model

Figure ß.¦: Central TCP State Chart receive_packet to clearly arrange the state chart. Please note that in the illustration of the state chart in gure ß.¦ transitions from every super-state back the state CLOSED that are present in both the RFC and the model have been leŸ out for clarity.

Due to the working principle of the state chart, the model supports the sending of early ACK segments that contain no data as they also occur in the measurements.

e variables for the send and receive window sizes are present, are set and can be read correctly. ey are used for žow control to avoid an overžow of packets at the receiving side. Since we did not model bušer limitations in our simulation of the web cluster, these variables are currently of no use.

e maximum segment size (MSS) can be specied when opening a new TCP connection. is value is used for message segmentation. When no value is specied explicitly, a default size of ¢ìä octets is used. Message segmentation means that messages from the application layer are encapsulated in one or more TCP segments with a maximum size of MSS octets.

Õóó ß.ó TCP

RFC ᐍ also suggests how to calculate the retransmission timeout based on an estimation of the round-trip time (RTT). e round-trip time is the time from sending a segment until the corresponding acknowledgment arrives. e value of RTT is smoothed using historical data with an exponentially weighted moving average algorithm to calculate a value SRTT that is updated with every ACK received as

SRTTi = (α ⋅ SRTTi−Õ) + ((Õ − α)RTT) . e retransmission timeout RTO is then dynamically determined as

RTOi = min{UBOUND, max {LBOUND, (β ⋅ SRTTi)} , where UBOUND is a predened upper bound and LBOUND a lower bound for the timeout value so that Õ s ≤ RTO ≤ Õ min. While RTO is dynamically adapted to network characteristics, the smoothing factor α is a constant between þ.˜ and þ.É, the delay variance factor β is also a constant between Õ.ì and ó.þ. RTO is used to trigger a retransmission of unacknowledged data if no ACK is received for the duration of RTO. As the channels in the model exhibit no loss in the normal case, retransmissions should not be required. But the model allows to use lossy channels and provides all necessary mechanisms.

Another feature of the TCP model is that out-of-order segments are saved in a bušer. Once the missing segments arrive and the sequence number gap is closed, our implementation of TCP sends a cumulative acknowledgment for all data that are still unacknowledged.

7.2.2 RFC 1122

On the base of RFC ßÉì, RFC ÕÕóó [É] species further requirements for Internet hosts. It claries which aspects of TCP are required and which are optional. One of the requirements is that the push žag (PSH) is set when sending the last segment of data that have been stored in the send bušer. It also species that data must not be stored in the send bušer for indenite time. Both aspects have been integrated in the main TCP state chart. e receiving side must be able to receive options in the header of every segment. Unknown options must be silently ignored. e model supports sending and receiving the maximum segment size in the SYN packet. Path MTU discovery according to RFC ÕÕÉÕ [äÕ] has not been implemented, as this

Õóì ß Simulation Model would require a model for the ICMP protocol that is used for this purpose. RFC ÕÕóó also species TCP slow start, congestion avoidance and Karn’s algorithm for calculating the retransmission timeout. As later RFCs specify more details about these aspects, a more detailed explanation of the implemented features follows later. e implementation of the urgent pointer is used in TCP-based interactive applications like telnet, is therefore not needed in our model and has thus been leŸ out. Since the network channels can lose packets but not cause bit errors so far, the calculation and the check of the checksum have not been implemented yet. e algorithm for the selection of the initial sequence number and an algorithm to avoid the silly window syndrome are of no use in the modeled environment and were neglected for this reason.

7.2.3 RFC 1323

RFC Õìóì [ì¢] species several TCP extensions. For links with a large bandwidth- delay product, the maximum size of the receive window may not be large enough to fully utilize the link capacity. erefore, this RFC adds a window scale parameter and a corresponding TCP option. Another addition for these link types is the selective acknowledgment (SACK) option. As both are not needed for our setup, these option have not been implemented.

An option specied in RFC Õìóì that is both used in the laboratory setup and has also been implemented in the model is the timestamp option. It is intended for round-trip time measurement. e sender can include a timestamp value in the new option header eld TSval in all segments that he sends. When the receiver of such a segment sends a corresponding ACK packet back, it echoes the received TSval value back in the TSecr provided it also supports this option. A receiver of an echoed value can then simply subtract the TSecr from the time when the ACK was received to calculate the round-trip time. Since only the local clock is used for the calculations, time synchronization is not needed. For this reason, dišerent granularities of the clock of both partners do not matter, too. e granularity needs only be ne enough to measure the round-trip time. It is advisable that the original receiver that echoes the time in TSecr includes an own sending timestamp in the TSval eld of the ACK packet so that it obtains own RTT measurements. In TCP implementations without this option, a sender of packets must record timestamps for sent segments along with the corresponding sequence number to

Õó¦ ß.ó TCP estimate the RTT. But in case of a retransmission, the sender does not know if a received ACK packet was sent due to the original or the retransmitted segment, since both had the same sequence number. is is problematic because accurate RTT measurements are important especially in congested networks. And in this environment, retransmissions can happen frequently. e Active Object TCP contains the additional variables last_ack_sent and TS_recent to decide for which packets to echo back what timestamps, as gaps in the sequence number space can lead to wrong estimations of the RTT if these special cases are not treated separately.

7.2.4 RFC 2581

RFC ó¢˜Õ [ì] describes four algorithms to deal with and to avoid congestion in the network. As all of them ašect the performance and dynamics of TCP directly, all of them have been included in the model.e rst one is the slow start algorithm. It is applied for the rst segments sent aŸer the handshake for a new TCP connection is completed. It aims at limiting the amount of tra›c to be sent into the network to avoid immediate congestion. For this purpose, the modeled TCP stack contains two variables, cwnd and ssthresh. e value of the congestion window variable cwnd limits the number of data octets the sender can send into the network until it has to wait for the reception of an acknowledgment. Its initial value must not be larger than two times the sender’s maximum segment size (SMSS). e model uses ó ⋅ SMMS initially. For every received ACK, it is increased by SMSS octets until the slow start threshold ssthresh has been reached. e value of this variable is initialized to a high value, but this value is recalculated as a result of packet loss. When the congestion window size is lower than the threshold, the slow start algorithm is applied. When the value is higher, a second algorithm, congestion avoidance, becomes ešective. e congestion avoidance algorithm aims to achieve an increase of the congestion window of one full-sized segment per RTT. e RFC suggests to increase cwnd by SMSS ⋅ SMSS~cwnd for each incoming non-duplicate ACK to approximate the desired behavior. When packet loss is detected, the sender sets its cwnd to the size of SMSS. e slow start threshold is set to ssthresh ∶= max {FlightSize~ó, ó ⋅ SMSS} , where FlightSize is the amount of data that has been sent but not yet been ac- knowledged. Due to these settings, the slow start algorithm becomes ešective again.

Õó¢ ß Simulation Model

According to this RFC, the receiver should immediately send a duplicate ACK when an out-of-order segment arrives. A duplicate ACK is an acknowledgment for the segment preceding the sequence number gap. As this segment has already been acknowledged when it arrived, the ACK is a duplicate. When the sender has received three duplicate ACKs for the same sequence number, it sends all packets aŸer the acknowledged one again. Due to the use of this fast retransmit algorithm the sender does not have to wait for a timeout for the respective segment before the retransmission. e fast recovery algorithm deals with the settings for the congestion window size and the slow start threshold in case of detected loss. e RFC species a combined algorithm for fast retransmit and fast recovery for the sender: Õ. When the third duplicate ACK is received, set ssthresh to

max{FlightSize~ó, ó ⋅ SMSS}.

ó. Transmit the lost segments, set cwnd to ssthresh + ó ⋅ SMSS. is is done because the three duplicate ACKs show that three additional segments have been received.

ì. For each additional duplicate ACK, cwnd is increased by SMSS, as this indicates the successful transmission of additional segments. ¦. If cnwd allows, transmit a new segment. If applicable, the receiver’s window size used for žow control must also be taken into account. ¢. When the next acknowledgment for new data arrives, set cwnd to the value of ssthresh. e reason for not performing slow start but congestion avoidance in this case is that the reception of duplicate ACKs indicates that the network is not totally congested and still allows to transmit segments. All assumptions are only correct without assuming duplication of the ACK packets by the network.

7.2.5 RFC 2988

e RFC óɘ˜ [ä˜] claries some details of the calculation of the retransmission timeout that has been specied in RFC ÕÕóó. According to the newer denition, the

Õóä ß.ó TCP variability is not only režected by the constant β but calculated from measurements of the round-trip time. For that purpose, a new variable RTTVAR is introduced. Before any measurement is done the value for the retransmission timeout should be set to RTOþ = ì s. When the rst measurement has been conducted, a value RTTÕ is obtained. e smoothed RTT value is set to the value of the round-trip time measurement: SRTTÕ = RTTÕ. e new variability factor is initialized to RTTVARÕ = RTTÕ~ó. e retransmission timeout is then set to RTOÕ = SRTTÕ + max{G, K ⋅ RTTVARÕ}, where K = ¦ and G is the granularity of the clock that is used for round-trip measurements. For any subsequent measurement i, the value of RTTVAR is updated as

RTTVARi = (Õ − β) ⋅ RTTVARi−Õ + β ⋅ SSRTTi−Õ − RTTiS and the current value for the smoothed round-trip time is calculated as

SRTTi = (Õ − α) ⋅ SRTTi−Õ + α ⋅ RTTi . e RFC suggests using α = Õ~˜ and β = Õ~¦. e retransmission timeout must then be updated as

RTOi = SRTTi + max{G, K ⋅ RTTVARi}.

An implementation can choose to impose a limit on RTOi as Õ s ≤ RTOi ≤ äþ s. According to the RFC, round-trip time measurements have to be conducted using either the timestamp option of RFC Õìóì or Karn’s algorithm [ìß]. is algorithm states that RTT samples must not be made using segments that were retransmitted, as this would lead to ambiguities, because it cannot be known if the acknowledg- ment is sent due to the original or due to the retransmitted packet. ese requirements have as well been included in the model as the recommendation for retransmission timer management. Every time a packet is sent or retransmitted and the timer is not running, the timer should be started so that it expires aŸer a timeout that corresponds to the current value of RTO. When all outstanding data has been acknowledged, the timer should be stopped. When an ACK for new data is received, the timer should be restarted to expire aŸer time RTO. When the timer expires, the earliest segment that has not been acknowledged should be retransmitted. e current value for RTO must be set to RTOi = óRTOi−Õ, the sender may impose a limit of äþ s. en, the retransmission timer is started with the current value RTOi. is implements an exponential backoš. Once a new ACK is received, the normal rules for calculation of RTO apply. is may result in a collapse of RTO back to the value before the backoš.

Õóß ß Simulation Model

7.3 Client

In contrast to the laboratory setup, we did not model a single load generator node. Our HTTP transactions are generated by Client Active Objects. One of these clients is created for every new TCP connection. e client then sends HTTP requests, waits for replies and the object is discarded once the connection is closed. As we modeled HTTP version Õ.þ, one client sends exactly one request. Of course, the model can easily changed to support HTTP version Õ.Õ with persistent connections and pipelining. We chose this model structure due to the fact that we were not interested in modeling and evaluating limits of the client nodes, but only in the cluster nodes. Since a synthetic load generator is just emulating a number of real HTTP clients as they can be found in the Internet, this architecture is more realistic. A single load generator is oŸen the limiting point during measurements in the real system. Besides these considerations, this layout also simplies the model and recording of results. e Root object can be congured to generate new TCP connections and thus Client objects with an arbitrary rate or with interarrival times that are distributed according to congurable distribution function.

Figure ß.¢: Structure of the Client Object

As the conceptual model of the Client Active Object in gure ß.¢ shows, one client consists of three embedded objects, the application, an instance of TCP and a processor.

7.3.1 Application

e Active Object Application models the HTTP client application, which usually is a browser. It is responsible for generating the requests and receiving the replies. It communicates with an instance of the TCP model described in section ß.ó over ports that represent the commands specied in RFC ßÉì. e rst thing a

Õó˜ ß.ì Client client does is to actively open a new TCP connection to the server. It is interesting to note that we do not use IP addresses in the model, because the transport layer is the lowest layer to be modeled explicitly. For that reason, the underlying network forwards all packets to the load balancing node of the cluster. e only way to distinguish between the dišerent clients is the port number that is used when opening the TCP connection. is is also the case in a real system with a single load generator. Actively opening a connection leads to sending out a SYN segment without any initial delay. When the connection has been opened, the application sends an HTTP request to the TCP object. is request string that is handed over to TCP directly encodes the size of the object to be requested. When the desired reply packets have been received, the connection is closed and the application terminates. e corresponding Client object can then be discarded.

7.3.2 TCP

e Active Object TCP works like shown in section ß.ó. It encodes the request sting in standard TCP packets, handles all the dynamics on the network and receives the segments with replies that can again be handed over to the application. Due to the nature of our measurements, it proved to be more e›cient not to directly handle the passing of time in the TCP object, but to introduce an additional object that delays the TCP segments before they are sent to the network. For this reason, the network port of the TCP is not directly connected to the outgoing port of the complete Client object. Instead, an intermediate object Processor is connected, which receives the outgoing TCP_package objects, delays them and nally sends them to a connected network port of the client.

7.3.3 Processor

Despite the name Processor, the third object in the client besides the Active Objects Application and TCP is only responsible for managing the simulated time passed during packet processing and for generating statistics. It implements a delay queue realized as an innite server queue that only imposes the representation of the measured client delays, but does not limit the performance in other aspects. is is done again with the intention that the client should never be the limiting factor in the overall performance.

ÕóÉ ß Simulation Model

7.4 Network Channels

e delay in the network between two nodes usually viewed to be composed of four individual delay components [¦Õ]. e nodal delay dnodal is the sum of the processing delay dproc, the queuing delay dqueue, the transmission delay dtrans and the propagation delay dprop. e processing delay is caused by the inspecting and handling of packets in intermediate network nodes like routers or switches. e queuing delay can also occur in nodes on the path when the packets have to wait in queues of the input or output port. e transmission delay is the time that is needed to bring the bits that compose the packet on a link that has a limited bit rate. For transmitting a packet of size L over a link of rate R, the transmission delay can be calculated as dtrans = L~R. e propagation delay is the time the signal needs to travel from the source to the sink. is delay depends on the distance d of the endpoints and is calculated as dprop = d~s, where s is the propagation speed. is speed is, depending on the velocity factor of the medium, the speed of light or a little less.

As denoted in section ä.˜, the measured delays in the network are composed of all these delays plus a time spent in the TCP/IP stacks of the communication partners. As the measurements were conducted during low network speed, the queuing delays in the intermediate switches can be assumed to be low if not zero. e propagation delays are in our case so small compared to other components that they do not contribute to the sum signicantly and were leŸ out in the model. But it would be easy to include them if the distance between the nodes were larger. e main contribution in the total delay comes from the transmission delay that can be determined analytically since the bit rate and packet length are known. Since the bits are sent to the medium sequentially, this component must be modeled as a single server queue. is automatically leads to a queuing delay under heavier load and thus higher utilization of the network channels. e propagation delay ašects all bits in parallel, thus has to be modeled as a innite server queue. As in-network processing of the packets can to some extend be assumed to happen parallel, this component has also been assumed to occur in an innite server queue. As we were not able to measure the time in the TCP/IP stacks in isolation, we are aware that we introduce a small error by assuming that this part of the delay is also introduced by the innite server queue.

e physical channels are full duplex. is fact is represented by two completely

Õìþ ß.¦ Network Channels

Figure ß.ä: Conceptual Model of the Network Channels independent streams with individual delay components in our model, one for each direction. As illustrated in the conceptual model in gure ß.ä, the network channels are constructed as follows. e transmission delay dtrans is calculated analytically and a single server queue with the transmission delay as the service time processes the incoming packet. is also allows for the queuing ešects at higher utilization levels. en the packets are processed by an innite server queue which applies the delays dnode sampled from the tted distributions diminished by the determined transmission delay dtrans. According to the laboratory setup, the channel between the load generator and the load balancer is modeled as a channel with a capacity of one gigabit per second in each direction. e channel between the load balancer and the real servers includes a switch that has one Gigabit Ethernet port that is connected to the load balancer and a number of Fast Ethernet Ports that are connected to the real server nodes. So the model includes a channel of one gigabit per second that leads into a switch where the channel is divided into one Õþþ Mbps channel for each real server. e switch does not impose an additional delay, it only selects to which server the packet is forwarded. It is interesting to note that since no lower layers than TCP are modeled, no MAC or IP addresses exist in the model. As shown in section ß.¢, the load balancer modies the TCP destination port to režect the number of the real server the packet should be forwarded to. is information is used by the switch module to determine the outgoing port.

Due to the fact that all queues are innite, bušer overžows, the main reason for packet loss in the Internet, never appears in the model. But to evaluate the behavior of TCP under these circumstances, an articial packet loss can be congured in all

ÕìÕ ß Simulation Model network channels.

7.5 Load Balancer

e main purpose of the load balancer is to distribute new incoming connection requests, i.e. SYN packets, among the real server nodes. Once a connection has been assigned to a real server node, all TCP segments belonging to this connection must be forwarded to the assigned real server. As indicated in section ¢.¢, the load balancer causes an additional short delay of the packets. It supports dišerent schemes for distributing the load. Random distribution selects one of the real servers randomly with a uniform distribution. When using round robin scheduling with n servers, a variable rr is increased modulo n with each new incoming connection. is new connection is then assigned to server rr + Õ. is scheduling strategy has also been used during the example measurement. Since the load balancer has to keep track of the connections in a table to forward packets belonging to the connection to the right server, it also has the ability to use least connection scheduling. Each new connection is here assigned to the server with currently the least number of open connections. Since all servers in the real system were equal, all servers are selected in all scheduling algorithms with the same weight. e system used the NAT mechanism during our measurements. Due to the fact that the IP layer is not modeled, there are no IP addresses in the system. e load balancer reuses the destination port entry in the TCP header to encode the destination real server, whereas the client is identied by the source port entry. Packets coming from the real server side are forwarded into the network aŸer a delay. ere is no need to modify the packet headers in the model as it is done in the real system.

7.6 Servers

e most complex building blocks of the model are the Server Active Objects. e processing in the server objects has the most inžuence on the overall per- formance of the system. e complete model contains a congurable number of modeled servers. In the current model conguration, all servers share the same

Õìó ß.ä Servers performance characteristics, but the model provides the necessary facilities for simulating heterogeneous cluster architectures.

±Ø°µØ° ±½±²Øßい¬°©ª ¬°©ªØ±± ¬°©ªØ±±©° ²©²̶œいŁØœ̶½ ºØ-Ø°̶²Øい ²©²̶œいŁØœ̶½ ±²̶²Øやµ̶°ı̶Ɯر ±½±い±½±²Øß ©³² ¾いŁØœ̶½ ¾い²ª¬い¬̶ªøさ ¬°©ªØ±±やさ ºØ-Ø°̶²Øい ¾い±²̶°²©Œ4Ø°µıªØ ±½±い³±Ø° ¬ØØøØŁ ±²̶²Øªæ̶-ºØ ±Ø°µØ°い ¬̶ªøز ²ª¬ ̶¬¬ ¬æ̶±ØŁアイ ¬°©Æアイ や¬°©ªØ±±©° ı- ¬°©Æイー ¬°©ªØ±±ı-º ¬æ̶±ØŁイー ²ª¬い¬̶ªøさ ©³² ¬°©ªØ±±ı-º Ƴ±½い³±Ø° ±½±²Øßい¬°©ª ŁØœ̶½ ²æ°©³ºæ¬³² ³²ıœı¾̶²ı©- ßØ̶-2³Ø³Ø-Ø-º²æ °ØªØıµØ ıŁœØ Ƴ±½い±½±²Øß ¬©°²

Figure ß.ß: Server Model and Embedded Objects

e leŸmost object in gure ß.ß shows the internal structure of one of these server objects. It is composed of three embedded objects. For each request that arrives at this server, a new process object is created by the receive state chart. Because processing of a packet consumes CPU time in the real system, a process needs to use the modeled resource processor to become active. From time to time, the processor is occupied by system processes. ey are represented in the system_proc object.

7.6.1 Processes

As noted before, the receive state chart of the Server object creates new process objects dynamically. For this purpose, all packets arriving at the port packet are inspected. If the SYN žag of an incoming TCP_package is set, a new process object is created. As each TCP connection is uniquely identied by the TCP source port number, incoming segments without the SYN bit are forwarded to the respective process object as they must belong to a TCP connection that has already been initialized. Once the connection to a client is closed, the processes are deleted.

As shown in second object from the right in gure ß.ß, one instance of TCP and a server_app object are embedded in every process object of the servers. e functionality of the TCP object hast been described in detail in section ß.ó. It

Õìì ß Simulation Model forwards data encapsulated in the received segments to a port of the server_app. is object models the Apache web server soŸware present in the real system. erefore, this Active Object is responsible for handling incoming HTTP requests, parsing them and sending back responses to the TCP stack. e size of the reply object is directly encoded in the request. e TCP object is then responsible for generating TCP segments of a precongured maximum size MSS. As we used HTTP version Õ.þ, the connection is closed aŸer one HTTP object has been transmitted and the application can subsequently terminate. Instead of simulating the elapsed time directly in the application, more realistic results have been obtained by handling all timing-related aspects in a central processor object in each server.

7.6.2 System Processes

As system processes occur in the real system from time to time, the measured CPU utilization has been higher than in the model before we introduced system processes. ese system processes can be classied into kernel mode and user mode system processes. Kernel mode system processes model kernel activity that is not necessarily related to a process, it is intended to represent input and output activity as well as interrupt handler invocations. e CPU time spent in this mode is relatively short, but it interrupts activities in user mode. Since some of the processing time of a TCP segment is included in the channel delays, we generate kernel mode system pro- cesses with the same arrival rate at which packets arrive from the network. e processing time was selected to be a fraction of the channel delay. Since no direct measurements for this fraction was available, it was selected so that the resulting CPU utilization approximates the measured values. User mode system processes represent other tasks running on the system. On a real server, it is almost impossible to have only the processes of the HTTP server running, most real-world congurations needed other concurrent daemons to provide a usable system. ese activities do not interrupt the execution of the HTTP server soŸware processes, but they also occupy the CPU for a certain amount of time from time to time. ey are scheduled like the web server processes and can thus lead to an increase of the waiting time (queuing delay) of these processes. Since we did not instrument the complete operating system, as this would lead to

Õì¦ ß.ä Servers a signicant drop in the performance and would thus inžuence the performance signicantly, we also had no direct measurements for these activities. But since the measured values of the delays in the real server nodes showed a number of outlier values that have been removed in the input modeling phase, we were able to estimate the rate and execution time of the user mode system processes by taking into account these outliers, as the reasons for these large delay values are most likely other processes being executed on the respective node. When we estimated the parameters of these processes, we also had to pay attention that the mean values for the CPU utilization in the model matched the real values. e Active Object system_proc is responsible for generating both kernel and user mode system activity. Its structure is shown as the second object from the leŸ in gure ß.ß. Due to the way the scheduling of the operating system is modeled in the processor object, the state charts generate TCP_package objects with a specied rate to represent the system processes. ese objects do not contain any data except for a delay value that is used to occupy the processor for a certain amount of time.

7.6.3 Processor

As noted before, the processor object is responsible for the advance of time in each server node. e rightmost object is gure ß.ß depicts this Active Object. e main state chart is called processing. It is shown on the bottom of this gure. All service requests are caused by TCP_package objects arriving at the port in. Instead of sending TCP segments directly into the network, user mode processes send the generated packets to this port of the processor object. is design simplies the model, but due to this implementation, system processes must also be represented as TCP segments that do not occur in the real system. ese additional objects are used only internally, they are not visible externally, so the external behavior remains consistent with the laboratory setup. To assign CPU time to dišerent processes, the scheduling of the operating system has to be taken into account. We implemented a basic model for preemptive scheduling with time slices. But when we examined the occurring delays, we found that all measured individual delays were well below the scheduling granularity of Õþ ms that is used in the Linux ó.¦ operating system with the standard value of Hz = Õþþ (cf. section ¦.Õ). erefore, the preemptive scheduling has been leŸ out of

Õì¢ ß Simulation Model the model to speed up the simulation. So we implemented the processor as a single server queue with FIFO queuing discipline for user mode processes. e execution of user mode processes can be interrupted by kernel mode activity. Additional to the kernel mode system processes introduced in section ß.ä.ó, the delays ¦ and óþ are also classied as system processes since they are caused by the TCP/IP stack of the operating system before and aŸer the actions of the user mode processes due to SYN and FIN žags in the TCP segments. When the processing state chart of the processor object is in the idle state and a user mode process arrives, a state change to busy_user occurs. e delay for the transition back to idle is sampled from a distribution function selected according to the type of packet received. But while the processor is in the state busy_user, an incoming kernel mode process causes a transition to busy_system and thus interrupts the processing in the user mode. For this purpose, the TCP segment and the remaining delay have to be saved because the processing is work conserving. e time spent in this state is encoded in the corresponding TCP_package. e execution of the kernel mode activities is not interrupted by other arriving kernel mode service requests. We did not model kernel preemption since it was also not implemented in the kernel version we used. e transition out of the state is selected depending on the state before. When the processor was executing a user mode process, the interrupted work is continued in the busy_user state, otherwise the processor becomes idle again. User mode system processes are treated like normal user mode processes and are also handled in the busy_user state.

7.7 Utility Classes and Execution Control

Since the simulation model contains a number of random processes, it is essential to generate random numbers with high statistical quality. e standard random generator in AnyLogic is a multiplicative congruential generator (MCG). ese generators are the simplest form of linear congruential generators (LGCs) and known to produce periodic pseudo-random numbers that show a lattice structure in an scatter diagram [¦É]. As this might induce an unwanted correlation in the generated random numbers, we used an implementation of a Mersenne Twister generator that is provided as RanMT in the RngPack [ìó] collection of random number generators. It has better statistical properties and a very high cycle length

Õìä ß.˜ Experiments

ÕÉÉìß (ó − Õ). e way how random number generators are utilized in AnyLogic allows us to use this generator also for generating random variates with predened methods.

e model has been implemented so that it generates a trace le similar to the one obtained by the measurements. at means that timestamps need to be recorded for all relevant events. One problem when dealing with a high volume of data like this is that writing to a number of text les or reading from huge les as necessary in trace-driven simulation consumes a considerable amount of time. To mitigate this ešect, we used an ODBC data source in combination with a MicrosoŸ Access Database both for input and output data. e tool R [ßÕ] provides the library RODBC to read these data and analyze them statistically. Additionally, the model writes a text le with typical summary performance data as usual in simulation studies.

Condence intervals are a crucial factor when judging the quality of simulation study results. ough AnyLogic provides a feature that allows to specify the length of one simulation run and the desired number of replications, the version we have used had no possibility to formulate a functional stopping criterion for one run or to specify the number of replications to be executed. But the AnyLogic Java API allows to dene a method executionControl in which replications can be started. Using this method we were able to implement a code block that uses the mean values of the total delay for TCP connections from one replication and decides on the basis of the relative error if more replications are needed to achieve a desired condence level and relative condence interval half width. If more replications are needed it starts them or eventually stops the execution of the simulation.

7.8 Experiments

e detailed simulation model allowed us to conduct various experiments. To validate the model, we parametrized the model so that the conguration režects the setup of the web cluster system in the laboratory during the measurement process. We then increased the request rate from ¢þ to Õ,þþþ requests per second and compared the simulation results with measurements. Due to limitations in the load generation process, we were not able to bring our system to overload in the

Õìß ß Simulation Model measurements. e load generation node was not powerful enough to cause a high load on the real server nodes. During all measurements, the CPU utilization in the load balancer was extremely low and did not increase signicantly when increasing the load. is becomes clear when looking at the short delays we determined in this node. We estimated that at least ¢þ real server nodes are needed to handle so many requests that the load balancer starts to show signicant resource utilization. When the load is increased further, which is only possible with even more real server nodes, the load balancer might nally become the bottleneck of the system. But this is only true for short requests and replies. Otherwise, the network channels will be the limiting factor. e model allows to simulate the system in a high load situation, but due to the limitations described we were unable to obtain measurements we could compare to the results. To evaluate the ešect of message segmentation, we also simulated requests for objects that are larger than the maximum segment size. To see if the correlation structure and densities had a signicant ešect on the results, we replaced all distribution functions in the model with exponential distributions with the same mean value and compared the results with those obtained during the measurements and with the ones obtained when using our advanced input modeling process. e model allows to adjust a number of parameters that inžuence the results of the simulation. e parameters can be divided into two groups. e rst one determines the general structure and parametrization of the model and consists of the number of real servers, the request rate in requests per second, the size of the requested reply objects, the load balancing method and the source of input data (traces, empirical or detailed input model). e second group ašects the execution of the simulation model. e number of initial replications, the number of requests in each replication and a relative error have to be set. Table ß.Õ summarizes the values of these core simulation parameters together with typical values used during the exemplary simulation runs. All simulation runs were done with multiple sequential replications to obtain a condence level of É¢Û and a relative error of ¢Û. e time needed for the experiments was, depending on the setup, around óþ minutes on a standard desktop PC. e simulation speed is acceptable considering the level of detail in the model. As an example of the modeling results, gure ß.˜ shows trace plots, histograms, correlation plots and scatter diagrams for the total delay in dišerent simulation scenarios together with measurement results. e total delay is the time from sending the rst SYN segment by the client node to the reception of the last ACK

Õì˜ ß.˜ Experiments

Table ß.Õ: Core Simulation Parameters Number of Real Servers 5 Load Balancing Strategy Round Robin Requests per Second 50; 100; 200; 500; 1,000 Input Modeling Detailed Input Model; Exponential Distributions Object Size 1,024 Bytes; 10,000 Bytes Initial Replications 5 Number of Requests per Run 10,000 Relative Error 5% segment that concludes the closing of the TCP connection by the real server node (cf. gure ¢.É).

e rst and third row of the gure allow to compare the simulation results with the measurements. ey show a good match for the range of values obtained and the mean value. e measurements still exhibit larger autocorrelation than the model as shown in the correlation plots while the scatter diagrams indicate no signicant dišerence. So we concluded that both the general model structure and the input modeling described in chapterä are adequate representations of the real system.

To check if the detailed input modeling is needed to obtain sensible results, we replaced our detailed input model for all distributions with exponential distribu- tions with the same mean values. e choice of exponential distributions režects the fact that these distributions are the standard distributions in Markov models that oŸen allow closed form solutions due to their memoryless property. When comparing the results of the exponential modeling shown in the second row of gure ß.˜ with the measurements and the results obtained with the detailed input model, the values of the exponential model show a much larger dispersion whereas the measurements and detailed input model exhibit a narrow spread of the values. In fact, the range of the exponential values is more than four times the range of the measurements. is ešect is expected to become larger the more queuing occurs in the system under higher load. e other graphs in the second row also show that no autocorrelation is present. Since quantiles of the response time are the most important quantities when assessing the performance of web server systems,

ÕìÉ ß Simulation Model

Detailed Input Model, 1024 bytes Density Delay i+1 Delay [µs] Correlation 0 1000 2000 0.0 0.4 0.8 0.001 0.004 0.007 0.0010 0.003 0.005 2 4 6 8 0.001 0.002 0.003 0.004 0.005 0 100 200 300 400 500 0.001 0.003 0.005 0.007 Time [s] Delay [µs] Lag Delay i Exponential, 1024 bytes Density Delay i+1 Delay [µs] Correlation 0 200 400 600 0.0 0.4 0.8 0.001 0.004 0.007 0.0010 0.003 0.005 2 4 6 8 0.001 0.002 0.003 0.004 0.005 0 100 200 300 400 500 0.001 0.003 0.005 0.007 Time [s] Delay [µs] Lag Delay i Measurements Density Delay i+1 Delay [µs] Correlation 0.0 0.4 0.8 0 1000 2000 0.001 0.004 0.007 0.0010 0.003 0.005 2 4 6 8 10 0.001 0.002 0.003 0.004 0.005 0 100 200 300 400 500 0.001 0.003 0.005 0.007 Time [s] Delay [µs] Lag Delay i Detailed Input Model, 10k bytes Density Delay i+1 Delay [µs] Correlation 0 500 1500 0.0 0.4 0.8 0.001 0.004 0.007 0.0010 0.003 0.005 2 4 6 8 0.001 0.002 0.003 0.004 0.005 0 100 200 300 400 500 0.001 0.003 0.005 0.007 Time [s] Delay [µs] Lag Delay i Exponential, 10k bytes Density Delay i+1 Delay [µs] Correlation 0 200 400 600 0.0 0.4 0.8 0.001 0.004 0.007 0.0010 0.003 0.005 2 4 6 8 0.001 0.002 0.003 0.004 0.005 0 100 200 300 400 500 0.001 0.003 0.005 0.007 Time [s] Delay [µs] Lag Delay i

Figure ß.˜: Graphical Comparison of the Results the representation with exponential distributions is not appropriate for a detailed simulation model. It also justies the extensive measurement process which is needed to obtain the data for an exact input model.

Table ß.ó compares the quantiles of the total delay for request rates from ¢þ to ¢þþ requests per second as measurements and as results of the model in milliseconds. e numbers support the statements above. e mean values obtained with expo- nential distributions seem to be a good approximation, but overall the values are spread too far, which is illustrated by the signicant deviations in the Õ¢Û and ˜¢Û quantiles. In contrast, the detailed input model we developed shows a very good approximation of the quantiles of the measurements.

Õ¦þ ß.˜ Experiments

Table ß.ó: Quantile Comparisons in Milliseconds Min. 15% Quant. Median Mean 85% Quant. Max. 50 Requests per Second Detailed Input Model, 1024 Bytes 1.81 2.08 2.21 2.21 2.33 13.92 Exponential, 1024 Bytes 0.89 1.59 2.09 2.17 2.79 13.52 Measurements, 1024 Bytes 1.75 1.90 2.04 2.25 2.18 406.75 Detailed Input Model, 10 kBytes 2.85 3.14 3.29 3.29 3.44 15.16 Exponential, 10 kBytes 1.88 2.99 3.62 3.71 4.46 15.79 100 Requests per Second Detailed Input Model, 1024 Bytes 1.75 2.08 2.20 2.21 2.33 16.4 Exponential, 1024 Bytes 0.73 1.59 2.11 2.18 2.77 16.71 Measurements, 1024 Bytes 1.66 1.90 2.01 2.03 2.17 2.68 Detailed Input Model, 10 kBytes 2.79 3.16 3.31 3.31 3.46 17.69 Exponential, 10 kBytes 1.77 2.97 3.61 3.69 4.43 17.78 Measurements, 10 kBytes 1.90 2.50 3.50 715.40 200 Requests per Second Detailed Input Model, 1024 bytes 1.82 2.08 2.20 2.21 2.32 22.30 Exponential, 1024 Bytes 0.82 1.59 2.09 2.18 2.75 22.16 Measurements, 1024 Bytes 1.66 1.90 2.06 2.59 2.18 357.22 Detailed Input Model, 10 kBytes 2.81 3.16 3.31 3.32 3.46 23.57 Exponential, 10 kBytes 1.94 2.95 3.62 3.69 4.39 23.70 500 Requests per Second Detailed Input Model, 1024 Bytes 1.74 2.08 2.20 2.38 2.33 63.36 Exponential, 1024 Bytes 0.76 1.60 2.11 2.40 2.80 64.46 Measurements, 1024 Bytes 1.63 1.90 2.06 24.48 2.20 902.27 Detailed Input Model, 10 kBytes 2.78 3.17 3.31 3.55 3.47 69.17 Exponential, 10 kBytes 1.78 2.98 3.61 3.91 4.43 69.61

All values show that we were not able to generate load that is high enough to cause signicant queuing in the system due to limitations of the hardware available. A purchase of additional nodes to use multiple load generating nodes and also to increase the number of real servers in the system would be the only solution to this problem. e excessively high maximum delays at ¢þþ requests per second are also caused by measurement errors induced by the utilization of the load generator. As all measurements were done with only one load generator, the processor load on this machine gets close to ÕþþÛ when Õ,þþþ requests per second are generated and even before this point is reached, this node sometimes limits the performance of the complete system due to the unavailability of resources. is ešect is not represented in the model and thus lower maximum delays appear.

Further simulation results concern the performance of the system when the re- quested object size is increased from the standard Õ,þó¦ bytes to Õþ kilobytes. e experimental results are shown in rows four and ve of gure ß.˜. e correspond-

Õ¦Õ ß Simulation Model ing quantiles are also given in textual form in table ß.ó.

Table ß.ì: CPU Load Comparison Request Rate System Load Model Load 25 1.92% 1.08% 50 0.11% 2.22% 100 0.10% 5.02% 150 1.35% 9.53% 200 18.04% 18.99%

Another task was to compare the model processor load with data measured in the real system. As seen in table ß.ì, the measured processor loads are not constantly increasing with increasing request rates. Even though the model processor loads are constantly increasing and don’t match the loads measured in the real system exactly, we decided not to change the model as more sensible measurement data were needed for this process and we were not able to produce consistent data due to the changing inžuence of other processes on the real system. e higher the request rate becomes, the more impact the utilization of the resources gains. e match of the representation in the model gets better in these situations. e table also gives a hint that more load might be caused by system processes in the user mode that occur sporadically than by kernel mode system activity, but this ešect is also di›cult to capture in measurements without a degradation of the performance caused by excessive instrumentation.

Õ¦ó 8 Conclusions and Future Work

e laboratory setup of the web cluster allows us to conduct ne-grained mea- surements of the internal behavior of distributed systems as exemplied on a cluster-based web server. An extensive measurement infrastructure has been imple- mented. It allows to record event traces for TCP segments that are sent and received by nodes of the system. It consists of an extension of the netlter framework with a kernel ring bušer that holds timestamps and packet headers and can be read as a device from user mode programs. For capturing the timing aspects of Enterprise Java Bean implementations of application servers, we designed and implemented an instrumentation that is based on aspect-oriented programming. Especially this approach proved to be useful in various other elds of application. All timestamps can be related to a global time base that is obtained from a GPS receiver. As an alternative to the synchronization during the measurements with NTP, we established an oœine synchronization method that is based on recording timestamps of PPS signals of the GPS receiver. We used the standardized PPS API in a non-standard way by connecting the PPS output of a single GPS receiver to all nodes of the cluster. Furthermore, it was extended to generate timestamps for PPS pulses using the cycle counter of modern processor architectures. e resulting time trace is used to synchronize the timestamps of the event trace to the GPS reference time that is globally valid in the system. Since the varying interrupt latencies involved in the reception of the PPS pulses lead to a degradation of the synchronization accuracy, we propose the use of an external clock that we implemented to measure this latency and compensate for its negative ešects. Our new oœine synchronization solution allows to use the cycle counter also for timestamping events and thus minimizes the measurement overhead. Additionally, optimized parameters in the oœine synchronization improve the accuracy of the results. e measurement infrastructure and the oœine synchronization thus allow to obtain precise values for one-way delays of packets. In future research projects, an

Õ¦ì ˜ Conclusions and Future Work alternative synchronization based on the Kalman lter can be evaluated. We expect that this approach is able to produce even more precise results, although Mills states in [¢É] that a Kalman lter is not an optimum choice for time synchronization. Summary performance data like CPU utilization or memory usage are recorded periodically for the calibration and validation of performance models. e measurement infrastructure and oœine synchronization process already proved to be applicable in other congurations like embedded systems on communicating mobile robots and WLAN transmissions. e measured delays are represented in an input model using theoretical distri- butions and advanced techniques like multimodal distributions with phases and Bézier distributions. For data sets that have specic statistical properties like high autocorrelation over large lags that are caused by bušering, we invented a new method in which the dišerences of successive values are sampled from a part of an empirical distribution function. is part of the distribution is selected de- pending on the current value of the random variate. Using a combination of these approaches, we were able to generate synthetic samples that approximate the distri- bution function and correlation structure of all data sets obtained in an example measurement of the laboratory setup. A detailed simulation model of the cluster-based web server system has been imple- mented in AnyLogic. It uses a formalism based on UML and Java. e simulation includes important aspects to TCP that inžuence the dynamics. Modeled features of TCP include message segmentation, slow start, congestion avoidance, fast re- transmit and fast recovery. Additionally, the contention of processes for processor resources and the preemption and interruption by system activity is contained. Simulation experiments with a conguration like during the measurement show that the delays at dišerent system loads are closely reproduced by the model. is enables us to run simulation experiments with various system congurations and system loads, and to analyze the estimated system behavior in detail. Due to limitations of the load generator, we were not able to cause high load with signicant queuing in the real system. erefore, we were not able to provide comparisons with the model under these circumstances. To overcome these lim- itations, more hardware nodes were needed. With a higher number of nodes it would also be possible to determine the point at which the load balancer becomes the bottleneck of the architecture experimentally. e inžuence of system activity

Õ¦¦ both in the user and the kernel mode could be improved in future models, but this would require additional instrumentation of the object system like it can be achieved using the Linux Trace Toolkit [ÉÉ], but we suspect that such a massive instrumentation would inžuence the resource utilization itself. Future simulation studies can also include generation of dynamic content. Two systems have already been implemented in the laboratory and were instrumented: a XML processing system where HTML pages are generated dynamically from XML content and a complete bookshop implemented according to the TPC-W specications [˜ì] using Enterprise Java Beans.

Õ¦¢ ˜ Conclusions and Future Work

Õ¦ä Bibliography

[Õ] D. Allan, H. Hellwig, P. Karatschoš, J. Vanier, J. Vig, G. M. R. Winkler, and N. Yannoni. Standard Terminology for Fundamental Frequency and Time Metrology. In Proceedings of the ¦ónd Annual Symposium on Frequency Control, Õɘ˜.

[ó] D. W. Allan. Statistics of Atomic Frequency Standards. Proc. IEEE, ¢¦:óóÕ– óìþ, February ÕÉää.

[ì] M. Allman, V.Paxson, and W.R. Stevens. TCP Congestion Control. Request for Comments RFC ó¢˜Õ, Internet Engineering Task Force, April ÕÉÉÉ.

[¦] J. Anastasov. Linux ARP extensions, February óþþ˜. http://www.ssi. bg/~ja/.

[¢] ARINC Engineering Services, LLC. Navstar GPS Space Segment/Navigation User Interfaces IS-GPS-óþþ. GPS Joint Program O›ce, El Segundo, CA, ß March óþþä. Revision D.

[ä] P. Barford and M. Crovella. Generating representative Web workloads for network and server performance evaluation. ACM SIGMETRICS Perfor- mance Evaluation Review, óä(Õ):Õ¢Õ–Õäþ, ÕÉɘ. [ß] M. Beyer, W. Dulz, and K.-S. Hielscher. Performance Issues in Statistical Testing. In R. German and A. Heindl, editors, Proc. of Õìth GI/ITG Conference on Measurement, Modeling, and Evaluation of Computer and Communication Systems (MMB óþþä), pages ÕÉÕ–óþß, Erlangen, March óþþä. VDE Verlag. [˜] R. Bless and M. Doll. Integration of the FreeBSD TCP/IP-Stack into the Discrete Event Simulator OMNeT++. In R. G. Ingalls, M. D. Rossetti, J. S. Smith, and B. A. Peters, editors, Proceedings of the óþþ¦ Winter Simulation Conference, óþþ¦.

Õ¦ß Bibliography

[É] R. Braden. Requirements for Internet Hosts – Communication Layers. Request for Comments RFC ÕÕóó, Internet Engineering Task Force, October ÕɘÉ.

[Õþ] S. Bregni. Fast Algorithms for TVAR and MTIE Computation in Char- acterization of Network Synchronization Performance. In G. Antoniou, N. Mastorakis, and O. Panlov, editors, Advances in Signal Processing and Computer Technologies. WSES Press, óþþÕ.

[ÕÕ] R. L. Burden and J. D. Faires. Numerical Analysis. Wadsworth Group, seventh edition, óþþÕ.

[Õó] V. Cardellini, M. Colajanni, and P. Yu. Dynamic load balancing on Web- server systems. IEEE Internet Computing, ì(ì):ó˜–ìÉ, May–June ÕÉÉÉ.

[Õì] M. C. Cario and B. L. Nelson. Autoregressive to Anything: Time-Series Input Processes for Simulation. Operations Research Letters, ÕÉ:¢Õ–¢˜, ÕÉÉä.

[Õ¦] E. Casalicchio and M. Colajanni. A client-aware dispatching algorithm for Web clusters providing multiple services. In Proc. of Õþth Int’l Conference, pages ¢ì¢–¢¦¦, May óþþÕ.

[Õ¢] A. Chepurko. Instrumentieren eines clusterbasierten Webservers zur Leis- tungsmessung. Master thesis, Friedrich-Alexander-Universität Erlangen- Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikation- ssysteme), Erlangen, óþþó.

[Õä] G. Combs. Wireshark, February óþþ˜. http://www.wireshark. org/.

[Õß] Comité Consultatif International des Radiocommunications. Characteriza- tion of Frequency and Phase Noise. Report ¢˜þ, CCIR, Õɘä.

[՘] P. Dauphin, R. Hofmann, R. Klar, B. Mohr, A. Quick, M. Siegle, and F. Sötz. ZM¦/SIMPLE: a General Approach to Performance-Measurement and - Evaluation of Distributed Systems. In T. Casavant and M. Singhal, editors, Readings in Distributed Computing Systems, chapter ä, pages ó˜ä–ìþÉ. IEEE Computer Society Press, Los Alamitos, California, Jan ÕÉɦ.

Õ¦˜ Bibliography

[ÕÉ] J. Dodenhoš. Vergleichende Bewertung verschiedener Zeitsynchronisa- tionsverfahren. Studienarbeit, Friedrich-Alexander-Universität Erlangen- Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikation- ssysteme), Erlangen, óþþ¢.

[óþ] J. Dodenhoš. PLL-basierte Verfahren zur Oœine-Analyse von Zeitstem- peln. Diplomarbeit, Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikationssysteme), Erlangen, óþþä.

[óÕ] École Polytechnique Montréal. LTTng & LTTV, February óþþ˜. http: //ltt.polymtl.ca/.

[óó] K. Egevang and P. Francis. e IP Network Address Translator (NAT). Request for Comments RFC ÕäìÕ, Internet Engineering Task Force, May ÕÉɦ.

[óì] H. Essa. Evaluierung des Einsatzes eines Content-Management-Systems für den Web-AuŸritt des Informatik-Lehrstuhls ß. Studienarbeit, Friedrich- Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikationssysteme), Erlangen, óþþ¦.

[ó¦] F. Fischer. Oœine-Synchronisation von Zeitstempeln. Studienarbeit, Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Infor- matik ß (Rechnernetze und Kommunikationssysteme), Erlangen, óþþ¦.

[ó¢] T. Gleixner and D. Niehaus. Hrtimers and Beyond: Transforming the Linux Time Subsystem. In Proceedings of the Linux Symposium, volume Õ, pages ììì–ì¦ä, Ottawa, Ontario, Canada, July óþþä.

[óä] S. Godard. SYSSTAT Utilites, February óþþ˜. http:// pagesperso-orange.fr/sebastien.godard/.

[óß] A. Heindl. Analytic moment and correlation matching for MAP(ó)s. In Proc. äth Int. Workshop on Performability Modeling of Computer and Com- munication Systems (PMCCS), pages ìÉ–¦ó, Monticello, IL, USA, September óþþì.

[ó˜] K.-S. Hielscher. Aufbau eines clusterbasierten Webservers zur Leistungsanal- yse. Diplomarbeit, Friedrich-Alexander-Universität Erlangen-Nürnberg,

Õ¦É Bibliography

Lehrstuhl für Informatik ß (Rechnernetze und Kommunikationssysteme), Erlangen, óþþÕ.

[óÉ] K.-S. Hielscher, S. Schreieck, and R. German. Analyse und Modellierung einer produktiven verteilten Webanwendung. In B. Wolnger and K. Heidt- mann, editors, Leistungs-, Zuverlässigkeits- und Verlässlichkeitsbewertung von Kommunikationsnetzen und verteilten Systemen (ì. GI/ITG-Workshop MMBnet óþþ¢), volume óäì, pages ÉÉ–ÕÕþ, Hamburg, September óþþ¢. Fach- bereich Informatik.

[ìþ] K.-S. J. Hielscher and R. German. A Low-Cost Infrastructure for High Precision High Volume Performance Measurements of Web Clusters. In P. Kemper and W. H. Sanders, editors, Proceedings of the Õìth Conference on Computer Performance Evaluations, Modelling Techniques and Tools (TOOLS óþþì), volume óßɦ of Lecture Notes in Computer Science, pages ÕÕ–ó˜, Urbana-Champaign, Illinois, September ó–¢ óþþì. Springer.

[ìÕ] R. Hofmann and U. Hilgers. eory and Tool for Estimating Global Time in Parallel and Distributed Systems. In Proc. of the Sixth Euromicro Workshop on Parallel and Distributed Processing PDP’ɘ, pages Õßì–ÕßÉ, Los Alamitos, January óÕ–óì ÕÉɘ. Euromicro, IEEE Computer Society.

[ìó] P.Houle. RngPack Õ.Õa, November ä óþþì. http://www.honeylocust. com/RngPack/.

[ìì] D. A. Howe, D. W. Allan, and J. A. Barnes. Properties of Signal Sources and Measurement Methods. In Proceedings of the ì¢th Annual Symposium on Frequency Control, Philadelphia, PA, ÕɘÕ.

[ì¦] IEEE. äÕ¢˜˜:óþþ¦ (Õ¢˜˜-óþþó) Precision clock synchronization protocol for networked measurement and control systems. IEEE Standards Association, óþþ¦.

[ì¢] V. Jacobson, B. Braden, and D. Borman. TCP Extensions for High Perfor- mance. Request for Comments RFC Õìóì, Internet Engineering Task Force, May ÕÉÉó.

[ìä] R. Jain. e Art of Computer Systems Performance Analysis. Wiley, New York, ÕÉÉÕ.

Õ¢þ Bibliography

[ìß] P. Karn and C. Partridge. Improving Round-Trip Time Estimates in Reliable Transport Protocols. ACM Transactions on Computer Systems, É(¦):ì䦖ìßì, ÕÉÉÕ.

[ì˜] R. Klar, P.Dauphin, F.Hartleb, R. Hofmann, B. Mohr, A. Quick, and M. Siegle. Messung und Modellierung paralleler und verteilter Rechensysteme. Teubner- Verlag, Stuttgart, ÕÉÉ¢.

[ìÉ] K. Köker, K.-S. Hielscher, and R. German. A Low-Cost High Precision Time Measurement Infrastructure for Embedded Mobile Systems. In K. Kozlowski, editor, Robot Motion and Control óþþß, volume ìäþ of Lecture Notes in Control and Information Sciences, pages ¦¦¢–¦¢ó, Heidelberg, óþþß. Springer. [¦þ] O. Kolisnichenko. Performancemessung und Optimierung von JóEE- Anwendungen. Studienarbeit, Friedrich-Alexander-Universität Erlangen- Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikation- ssysteme), Erlangen, óþþß.

[¦Õ] J. F. Kurose and K. W. Ross. Computer Networking. Pearson, Boston, fourth edition, óþþß.

[¦ó] M. Lasch. Implementierung einer Netlter-basierten Packet-Logging- Infrastruktur. Studienarbeit, Friedrich-Alexander-Universität Erlangen- Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikation- ssysteme), Erlangen, óþþ¢.

[¦ì] M. Lasch. Erweiterung einer bestehenden netlter-basierten Packet-Logging- Implementierung. Diplomarbeit, Friedrich-Alexander-Universität Erlangen- Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikation- ssysteme), Erlangen, óþþä.

[¦¦] A. M. Law. ExpertFit Distribution Fitting SoŸware, February óþþ˜. http: //www.averill-law.com/.

[¦¢] A. M. Law and W. D. Kelton. Simulation Modeling and Analysis. McGraw- Hill Series in Industrial Engineering and Management Service. McGraw-Hill, Boston, third edition, óþþþ.

[¦ä] J. Levine. An Algorithm to Synchronize the Time of a Computer to Universal Time. IEEE/ACM Trans. Netw., ì(Õ):¦ó–¢þ, ÕÉÉ¢.

Õ¢Õ Bibliography

[¦ß] J. Levine. Introduction to time and frequency metrology. Rev. Sci. Instrum., ßþ:ó¢äß–ó¢Éä, ÕÉÉÉ. [¦˜] Linux Virtual Server Project, February óþþ˜. http://www. linuxvirtualserver.org/.

[¦É] G. Marsaglia. Random Numbers Fall Mainly in the Planes. Proceedings of the National Academy of Sciences of the United States of America, äÕ(Õ):ó¢–ó˜, September Õ¢ ÕÉä˜. [¢þ] B. Melamed. An Overview of TES Processes and Modeling Methodology. In L. Donatiello and R. Nelson, editors, Performance Evaluation of Computer and Communications Systems, Lecture Notes in Computer Science, pages ì¢É–ìÉì, Heidelberg, ÕÉÉì. Springer.

[¢Õ] D. A. Menascé and V.A. F. Almeida. Capacity Planning for Web Performance. Prentice Hall, Upper Saddle River, ÕÉɘ.

[¢ó] D. A. Menascé and V. A. F. Almeida. Scaling for E-Business. Prentice Hall, Upper Saddle River, óþþþ.

[¢ì] D. A. Menascé and V. A. F. Almeida. Capacity Planning for Web Services. Prentice Hall, Upper Saddle River, óþþó.

[¢¦] D. A. Menascé, V. A. F. Almeida, and L. W. Dowdy. Capacity Planning and Performance Modeling. Prentice Hall, Upper Saddle River, ÕÉÉä. [¢¢] M. Meyerhöfer. Messung und Verwaltung von SoŸwarekomponenten für die Performancevorhersage. Dissertation, Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Informatik ä (Datenmanagement), Er- langen, óþþß. [¢ä] D. Mills. A Kernel Model for Precision Timekeeping. Request for Comments RFC Õ¢˜É, Internet Engineering Task Force, March ÕÉɦ.

[¢ß] D. Mills and P.-H. Kamp. e nanokernel. In Proc. Precision Time and Time Interval (PTTI) Applications and Planning Meeting, pages ¦óì–¦ìþ, Reston VA, November óþþþ.

[¢˜] D. L. Mills. e Network Computer as Precision Timekeeper. In Proc. Precision Time and Time Interval (PTTI) Applications and Planning Meeting, pages Éä–Õþ˜, Reston VA, November ÕÉÉä.

Õ¢ó Bibliography

[¢É] D. L. Mills. Time Synchronization: e Network Time Protocol. CRC Press, Boca Raton, óþþ¢. [äþ] Miniwatts Marketing Group. Internet World Stats, February óþþ˜. http: //www.internetworldstats.com/. [äÕ] J. Mogul and S. Deering. Path MTU Discovery. Request for Comments RFC ÕÕÉÕ, Internet Engineering Task Force, November ÕÉÉþ. [äó] J. Mogul, D. Mills, J. Brittenson, J. Stone, and U. Windl. Pulse-per-second API for Unix-like operating systems, version Õ. Request for Comments RFC-óߘì, Internet Engineering Task Force, March óþþþ. [äì] S. B. Moon, P.Skelly, and D. Towsley. Estimation and Removal of Clock Skew from Network Delay Measurements. In Proceedings of IEEE INFOCOM ’ÉÉ, March ÕÉÉÉ. [ä¦] D. Mosberger and T. Jin. httperf: A Tool for Measuring Web Server Per- formance. In First Workshop on Internet Server Performance, pages ¢É–äß. ACM, June ÕÉɘ. [ä¢] netlter.org. netlter/iptables project homepage, February óþþ˜. http: //www.netfilter.org/. [ää] A. Pásztor and D. Veitch. PC based precision timing without GPS. In Proceedings of the óþþó ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages Õ–Õþ. ACM Press, óþþó. [äß] V. Paxson. On Calibrating Measurements of Packet Transit Times. In Measurement and Modeling of Computer Systems, pages ÕÕ–óÕ, ÕÉɘ. [ä˜] V.Paxson and M. Allman. Computing TCP’s Retransmission Timer. Request for Comments RFC óɘ˜, Internet Engineering Task Force, November óþþþ. [äÉ] D. Piester, P. Hetzel, and A. Bauch. Zeit- und Normalfrequenzverbreitung mit DCFßß. PTB-Mitteilungen, ÕÕ¦(¦):즢–ìä˜, óþþ¦. [ßþ] M. Preißner. Erstellung einer E-Commerce-Plattform nach TPC-W. Studi- enarbeit, Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikationssysteme), Erlangen, óþþì óþþì.

Õ¢ì Bibliography

[ßÕ] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, óþþß. ISBN ì-Éþþþ¢Õ-þß-þ.

[ßó] C. Resch. Messungsbasierte Modellierung von drahtlosen lokalen Net- zen. Studienarbeit, Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikationssysteme), Erlangen, óþþ¢.

[ßì] C. Resch, A. Heindl, K.-S. J. Hielscher, and R. German. Measurement-based modeling of end-to-end delays in WLANs with ns-ó. In F. Hülsemann, M. Kowarschik, and U. Rüde, editors, Proc. ՘th Symposium on Simulation Techniques (ASIM), pages 󢦖ó¢É, Erlangen, September óþþ¢. ASIM. [ߦ] S. Schreieck. Einžüsse von Temperaturveränderungen auf die Performance eines PCs im Hinblick auf die Genauigkeit von Zeitmessvorgängen. In- terner Bericht þä-þß, Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikationssysteme), Erlangen, December óþþä.

[ߢ] S. Schreieck. Entwurf und Implementierung eines aspektorientierten Tools zur Leistungsmessung und zum Debugging im JDBC Umfeld. Interner Bericht þ¦-þß, Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehr- stuhl für Informatik ß (Rechnernetze und Kommunikationssysteme), Erlan- gen, June óþþä.

[ßä] S. Schreieck. Instrumentierung des Selbstbedienungsportals der Fach- hochschule Kempten mittels Aspektorientierter Programmierung zur Gewinnung von performancerelevanten Daten. Interner Bericht þ¢-þß, Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Infor- matik ß (Rechnernetze und Kommunikationssysteme), Erlangen, June óþþä.

[ßß] G. Seeber. Satellite Geodesy. De Gruyter, Berlin, second edition, óþþì.

[ߘ] A. Silberschatz and P. Galvin. Operating System Concepts. Wiley, New York, Ÿh edition, ÕÉÉÉ.

[ßÉ] S. R. Stein. Frequency and Time, eir Measurement and Characterization, Precision Frequency Controls, volume ó, chapter Õó. Academic, New York, Õɘ¢.

Õ¢¦ Bibliography

[˜þ] J. Stultz, N. Aravamudan, and D. Hart. We Are Not Getting Any Younger: A New Approach to Time and Timers. In Proceedings of the Linux Symposium, volume Õ, pages óÕÉ–óìó, Ottawa, Ontario, Canada, July óþþ¢. [˜Õ] D. Sullivan, D. Allan, D. Howe, and F. Walls. Characterization of Clocks and Oscillators. Technical note Õììß, National Institute of Standards and Technology, ÕÉÉþ. [˜ó] Y. Teo and R. Ayani. Comparison of Load Balancing Strategies on Cluster- based Web Servers. e Journal of the Society for Modeling and Simulation, ßß(¢-ä):՘¢–ÕÉ¢, November-December óþþÕ. [˜ì] Transaction Processing Performance Council. TPC-W Transactional Web e-Commerce Benchmark, April óþþ¢. http://www.tpc.org/tpcw/. [˜¦] University of Southern California, Information Sciences Institute. Trans- mission Control Protocol. Request for Comments RFC ßÉì, Defense Ad- vanced Research Projects Agency, Information Processing Techniques O›ce, September ÕɘÕ. [˜¢] G. Uygur. Eine externe Uhr für Linux zur Zeitsynchronisation in Nanosekun- den. Studienarbeit, Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikationssysteme), Erlangen, óþþ¦. [˜ä] A. Varga. OMNeT++, February óþþ˜. http://www.omnetpp.org/. [˜ß] I. Wagner. Modellierung des Informatik-ß-Webclusters mit UML State- charts. Studienarbeit, Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikationssysteme), Erlangen, óþþ¦. [˜˜] I. Wagner. Modellierung des Informatik-ß-Webclusters mit UML Statecharts, Teil II. Diplomarbeit, Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikationssysteme), Erlangen, óþþ¢. [˜É] I. Wagner, K.-S. Hielscher, and R. German. A Measurement-Based Simu- lation Model of a Web Cluster. In J. Krüger, A. Lisounkin, and G. Schreck, editors, ìrd Int. Industrial Simulation Conference (ISC’óþþ¢ Berlin, Germany), pages ˜˜–Éó, Ghent, Belgium, June É–ÕÕ óþþ¢. EUROSIS-ETI.

Õ¢¢ Bibliography

[Éþ] M. A. F. Wagner and J. R. Wilson. Using Univariate Bézier Distributions to Model Simulation Input Processes. In ÕÉÉì Winter Simulation Conference Proceedings, pages ì䢖ìßì, Los Angeles, ÕÉÉì. ACM. [ÉÕ] M. A. F. Wagner and J. R. Wilson. Recent developments in input modeling with Bézier distributions. In WSC ’Éä: Proceedings of the ó˜th conference on Winter simulation, pages Õ¦¦˜–Õ¦¢ä, Washington, DC, USA, ÕÉÉä. IEEE Computer Society. [Éó] L. Wallner. Simulation der Mehrwertgenerierung durch Kunden-werben- Kunden-Programme auf einer neuartigen Web-Plattform. Diplomarbeit, Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Infor- matik ß (Rechnernetze und Kommunikationssysteme), Erlangen, óþþß. [Éì] U. Windl. PPSKit, August óþþä. ftp://ftp.kernel.org/pub/ linux/daemons/ntp/PPS/. [ɦ] G. M. R. Winkler. Introduction to Robust Statistics and Data Filtering, February óþþ˜. http://www.wriley.com/ROBSTAT.htm. [É¢] R. W. Wisniewski, R. Azimi, M. Desnoyers, M. M. Michael, J. Moreira, D. Shiloach, and L. Soares. Experiences Understanding Performance in a Commercial Scale-Out Environment. In Euro-Par óþþß Parallel Processing, volume ¦ä¦Õ/óþþß of Lecture Notes in Computer Science, pages ÕìÉ–Õ¦É, Heidelberg, óþþß. Springer. [Éä] P.Wunderlich. XML Processing System. Studienarbeit, Friedrich-Alexander- Universität Erlangen-Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kommunikationssysteme), Erlangen, óþþì. [Éß] P. Wunderlich. Instrumentierung und Leistungsmessung an einer E- Commerce-Applikation. Diplomarbeit, Friedrich-Alexander-Universität Erlangen-Nürnberg, Lehrstuhl für Informatik ß (Rechnernetze und Kom- munikationssysteme), Erlangen, óþþ¢ óþþ¢. [ɘ] XJ Technologies Company. AnyLogic, February óþþ˜. http://www. xjtek.com/. [ÉÉ] K. Yaghmour and M. Dagenais. Measuring and Characterizing System Behavior Using Kernel-Level Event Logging. In USENIX óþþþ Annual Technical Conference, June ՘–óì óþþþ.

Õ¢ä Bibliography

[Õþþ] K. Yaghmour and M. R. Dagenais. Measuring and Characterizing System Behavior Using Kernel-Level Event Logging. In USENIX Annual Technical Conference, pages Õì–óä, June ՘–óì óþþþ. [ÕþÕ] J. Yang, D. Jin, Y. Li, K.-S. Hielscher, and R. German. Modeling and simu- lation of performance analysis for a cluster-based Web server. Simulation Modeling Practice and eory, Õ¦(ó):՘˜–óþþ, óþþä. [Õþó] Q. Zhang, A. Riska, W. Sun, E. Smirni, and G. Ciardo. Workload-aware load balancing for clustered Web servers. Parallel and Distributed Systems, IEEE Transactions on, Õä(ì):óÕÉ–óìì, March óþþ¢.

Õ¢ß