Facoltà di Ingegneria Corso di Studi in Ingegneria Informatica

tesi di laurea specialistica Logbus-ng: a logging bus for Field Failure Data Analysis in distributed systems

Anno Accademico 2009/2010

relatori Ch.mo prof. Domenico Cotroneo Ch.mo prof. Marcello Cinque

correlatore Ch.mo Ing. Antonio Pecchia

candidato Antonio Anzivino matr. 885/451

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Summary Introduction ...... 4 Logging ...... 7 Computer security and accounting ...... 10 Field Failure Data Analysis...... 11 The state of art of logging frameworks ...... 14 Logging frameworks ...... 14 An example API ...... 16 Logging formats ...... 17 Logging protocols ...... 22 Open issues ...... 24 Design of the Logbus-ng project ...... 26 Source-side interfaces ...... 29 Monitor-side interfaces ...... 29 Logging APIs ...... 37 Core design ...... 39 Plugin system ...... 44 Implementation of Logbus-ng ...... 45 Overview of (.NET/Mono) platform ...... 45 XML configuration ...... 49 Determining a host’s IP address in “Connect-to-me” protocols ...... 53 Running a web application from inside a console application ...... 56 Concurrency issues ...... 59 Plugin APIs ...... 61 Field Failure Data Logging support ...... 64 The Entity Manager plugin ...... 66 Log4net interoperability ...... 68 Experimental validation ...... 71 Unit testing for Syslog parser ...... 72 Delivery time of messages ...... 72 2

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Loss of UDP datagrams under stress ...... 77 Conclusions and future work ...... 80 Platform bindings ...... 80 Dealing with protocol drawbacks ...... 81 Load balancing, fault tolerance ...... 81 Other work ...... 85 Appendixes...... 87 Appendix Alpha ...... 88 Appendix Bravo ...... 92 Appendix Charlie ...... 101 Bibliography...... 108 Acknowledgements ...... 110

3

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Introduction

Today, critical computer systems are becoming more and more important in main human activities, replacing people in controlling processes and thus gaining cheaper costs and greater reliability. Such systems are always more directly responsible of people’s safety, and a failure might bring to disastrous consequences in some cases. If we want to replace a human controller with an automated computer controller in a critical scenario, like a nu- clear power plant or a passenger flight, we must know how “dependable” each hardware and software component in the controller is, where dependability is, by definition, a quan-

titative indication of its capability to provide a proper service (or its resistance to faults). Quantitative indices are suitable for engineering approaches, and among these we find the most common and relevant: availability, which is the probability the system will be providing a service at a given time1; reliability, which is the probability the system will be up for a continuous time interval2.

Dependability Engineering is a vast field in modern engineering. Related to computer sci-

ence, we can distinguish hardware from software dependability, which are not so parallel like we might expect. Now we still want to deal with the problems that arise from and the techniques that are applied to both fields, but this will not be comprehensive. Hardware components are physical tangible components, made with highly consolidated technolo- gies, subject to Physics laws, particularly those in the Electronics field. An electronic component, whether an ALU or an entire CPU, computes electrical input signals into out-

1 Conversely, it’s the probability of a failure to occur at a given time subtracted by the unit 2 Conversely, it’s the probability of a failure to occur during a time interval subtracted by the unit 4

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems put signals through a deterministic function. One of the Electronics’ teachings is that input signals are not fully deterministic, but they are instead affected by noise that, if not proper- ly filtered, may cause a fault of the component. Signals that are able to affect a hardware circuit is behaviour are not only those electrical ones applied to input connectors, but also the electromagnetic radiation to which the component is subject in its working environ- ment: for example, you can build a memory and indefinitely test it in lab without finding any design defect, but once mounted in a space probe traveling towards the Sun that memory may be affected by radiation the star produces and possibly showing an undesired behaviour. Studying possible failures of hardware components and the techniques that can help avoid- ing them is a consolidated subject. Software components, however, are virtual: they have no mass, while still contained in physical memories, and do not exist in our realm, in the meaning that we cannot physically interact with software. Software is then no more subject to the physical laws that regulate hardware, and is thus slightly more difficult to study. A great advantage of software over hardware is that it is practically immutable: while it is reasonable to believe that under cer- tain conditions (thermal stress, salt accumulation), a hardware component may be modi- fied in its low-level physical structure and be not showing the same performance as when exiting the factory, executable code is not subject to wear or alteration, while still software is affected by a specific form of aging, which is beyond the scope of this work.

Difference between hardware and software faults is that hardware components may actual- ly fail for unpredictable random phenomena (like radiation), but the root cause of a soft- ware fault is no more than a permanent design or implementation defect, commonly called bug. When the activation condition of a fault is found, its activation is then deter- ministic. This apparently simplifies things, however finding a bug in the software may become something extremely complicated.

5

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Our goal is not to model software failures with theoretical approaches, whether black or white box, and not even to deal with the problem of tracking down the root cause of a fault starting from a known failure in the system. Given a complex distributed system (business or safety critical), we want to facilitate its performance analysis in operational scenarios and provide administrators with all the tools useful to monitor the system’s health and de- tect possible malfunctioning in little time.

The most basic tools to get execution information in runtime environment, where you can- not perform debugging to monitor the system’s execution flow, are logging and conse- quent log analysis, which we will deal with in a dedicated paragraph. Classic log analysis is often performed offline on heterogeneous and distributed plat- forms. Log messages come from different hosts/programs and if they are formatted ac- cording to different formats, they must be processed separately (multiple DBs, multiple analysis), while they are actually correlated since a failure in a node can propagate as a fault in another node.

Our goal is to provide developers with a tool to perform online log analysis on complex distributed systems, collecting logs in different formats and allowing to analyse them as a single log trace. We created Logbus-ng, an open source platform that is capable to collect and distribute log messages generated by an entire network of physical or virtual ma- chines, thus allowing online log analysis by monitoring clients with effectiveness and reli- ability.

Logbus-ng is free software available on SourceForge.net, the most known open source de- velopment website around the world, at https://www.sourceforge.net/projects/logbus-ng, and is released according to Reciprocal License, OSI approved.

6

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Logging

Logging the execution of a software program is a common and dated practice, and has roots in various requirements more than dependability analysis. Depending on how and on the context in which logging is done, messages can be used for access control, accounting, profiling, auditing and data mining. When logs are used for security or accounting purpos- es, they are often subject to specific security requirements, like authentication and confi- dentiality, both to comply with normative regulations and to provide means of proof in a contract.

Logging is always done by inserting proper calls to a logging library in a program’s source code at specified points. The most rudimental logging mechanism, often used by students to perform debugging, is to output messages to a debug console like Common Language Runtime’s System.Debug (1) or ’s System.err (2), which point to a special output stream that will display such messages either in-line with the program’s console output or on a separate window. In POSIX systems, the error stream can be always captured and re- directed to another output stream, such as a file, using console commands.

No matter the usage that is required for log messages, several platforms were created to provide a scalable and versatile support to various logging requirements by developers, simplifying code intervention and granting a high level of transparency for low-level oper- ations (file opening, DBMS connection management, network protocols…). We will not hide our major interest for open source logging platform developed by the Apache Soft- ware Foundation (3), the same that made the so famous HTTP server: these platforms are log4j (4), log4net (5), log4php and log4cxx, respectively made for Java, CLR, PHP and 7

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

C++. Without going into the detail, let us focus on the impact that they had in the software market: thanks to simple calls that accept a text message developers can log to different targets defined in runtime configuration, differentiating by source, importance or type. The platforms provide only provides tools needed to store messages on stable memory or for- ward them to a log server, but it is still up to the to correctly place logging calls inside the source code. In fact, when excluding random damaging or loss of log messages, logging quality only

relies on the expertise of the programmer, at the end of a comprehensive software engi- neering work by the designer. He is in charge of log the right events in a clear and non- ambiguous way, possibly structured. One of the weak points of classic logging are text messages: there are no standards about them, not even de facto3, so log messages are made human-readable and force log analysis to be manual, or at least most of it. Historically speaking, there can be conventions on the structure of information decorating a log mes- sage, mainly time stamp, but not about the textual descriptive message. The following log

trace was extracted on August 15th from marcus.zighinetto.org server to provide an exam- ple. Messages have been altered to obfuscate potentially sensitive information

3 Never confuse with the de facto standard that define log message formatting when dealing with structured data, ie. web server logs and Syslog protocol 8

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Aug 15 17:28:24 marcus sshd[17988]: Invalid user condor from 211.47.xxx.xxx Aug 15 17:28:36 marcus sshd[18200]: Invalid user global from 211.47.xxx.xxx Aug 15 17:28:40 marcus sshd[18338]: Invalid user upload from 211.47.xxx.xxx Aug 15 17:28:41 marcus sshd[18348]: Invalid user marine from 211.47.xxx.xxx Aug 15 17:33:26 marcus syslog-ng[7956]: Log statistics; dropped='pipe(/dev/xconsole)=0', processed='center(queued)=70448', processed='center(received)=48033', processed='destination(messages)=448', processed='destination(mailinfo)=11985', processed='destination(mailwarn)=5215', processed='destination(localmessages)=0', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=5215', processed='destination(console)=0', processed='destination(null)=0', processed='destination(mail)=17200', processed='destination(xconsole)=0', processed='destination(firewall)=30385', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=48033' Aug 15 17:40:01 marcus /usr/sbin/cron[28606]: (root) CMD (/etc/webmin/system-status/systeminfo.pl) Aug 15 17:40:57 marcus sshd[31881]: Accepted publickey for djechelon from 94.167.xxx.xxx port 1598 ssh2 Aug 15 17:41:09 marcus su: FAILED SU (to root) djechelon on /dev/pts/0 Aug 15 17:41:18 marcus su: FAILED SU (to root) djechelon on /dev/pts/0 Aug 15 17:41:24 marcus su: (to root) djechelon on /dev/pts/0

After a manual analysis we can get some conclusions about the informational content of the messages. First 4 messages are from sshd’s authentication facility, and apparently re- port that several users from the same IP address failed authentication over SSH. By expe- rience, those log entries are the signature of an attack being done to the server4 trying to

exploit weak username/password pairs by trying a common user name (ie. admin, joe, etc.) and a password attack such as dictionary or brute force. These messages are followed by a diagnostic message from syslog-ng (6) daemon, which is responsible for logging infra- structure in UNIX systems. Other messages are still from authentication system: user djechelon first authenticated with a public key, then he had to elevate his privileges to su- per user (failing twice typing the password) in order to read the /var/log/messages file

4 Believe us or not, that attack was real and not simulated to get log data. Such attacks to sshd are very common for people running a server 9

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

containing such logs. Apart from the date, time, machine name, etc., that are structured according to the BSD Syslog protocol (7), text messages from different applications or kernel modules (then from different developers) are plain text and at best follow a single developer’s specific format. Fortunately, each unique message structured in an immutable way, it is easy to au- tomatically parse messages and getting information of interest, such as what IP address fails more than 5 authentication attempts in a reduced amount of time: as an example, the

fail2ban (8) daemon is an open source monitoring tool that scans logs searching for too many consecutive authentication failures to block the potential attacker via firewall in systems. Now it is time for classic logging issues: as shown by (9) and (10), classic logs are often incomplete or redundant, up to the point that for a single event to which developers paid lots of attention5 there are lots of log messages when just one suffices, and obviously other interesting events for which there is no trace in the logs because developers did not care

about logging them. These, together with log damage, contribute to incoherence and whilst highlighting failures in distributed systems (11), provide no mean to track back the root cause. On the other hand, analysing heterogeneous logs structured upon the conscience of different developers that failed in the adoption of common rules (12) (9) brings to log chaos.

Let’s examine some common field of application for logging: Computer security and accounting Logs are often used for computer security purposes or accounting. A telephone bill is basically a log of outbound calls made by a user. A bank issues period- ic reports to account holders basing on a log of transactions. Together with accounting purposes, there are security usages of logs. We already shown in the previous example that an attack’s signature can be detected by log analysis in common

5 For example, an error triggered by a bug that was very hard to discover 10

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

cases, but there are also other usages. If an access control system logs each successful and failed authentication with adequate contextual information, in case a security violation is detected, analysts can (or can try to) track down the original offender.

Computer security is not only the only field in which logging is applied. Most web servers, for examples, generates comprehensive access logs, separated from error logs, that can be mined by system administrators with tools such as AWStats (13) to perform website statis-

tics. This software provides detailed projections of visitors’ origin, preferred time of ac- cess and pages, aggregating logs coming from each monitored web server.

In general, logs related to such critical environments are enforced with certain security re- quirements, which we will not examine in this work deeply: they mostly include cryptog- raphy and authentication.

There are also other kinds of security analyses that can be performed using logs: Sewiorek (14) uses a new methodology based on the representation of a program as a finite state machine and logs (representing transactions) to detect a vulnerability in a web server that was unknown to the developers, and then reported it to them. Field Failure Data Analysis Field Failure Data Analysis is a subset of Log Analysis that aims to characterize depend-

ability features of a system, like availability and reliability in a specified time horizon.

FFDA is done extracting the log messages that report failures in the system, and often re- quires a huge manual work. Classic offline FFDA based on logs is performed with the following steps:  Log collection  Log filtering  Log coalescence and tupling (14)  Evaluation of dependability indices (MTBF, MTTR…)

11

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Filtering is required to obtain a log collection that only reports failures in the system, cleaned with informational and debug messages. Coalescence is a process that groups messages generated inside a time window into a single failure, because when a failure propagates in a chain of service it often causes other failures to occur; tupling is a form of spatial coalescence based on the fact that a failure in a node can propagate to other nodes (a power outage can cause a core router to go down, disconnecting several hosts in a net- work): coalescence is performed by graphical means.

The final step is obtaining dependability indices for the system from the clean FFDA log. For example, the MTBF is obtained by computing the statistical mean of intervals between consequent failures, and so on…

New, modern, online FFDA (14) aims to perform fault forecasting, and is based on the interpretation of log data according to known patterns or automated classification mod- ules. An online FFDA tools requires a severe and non-ambiguous logging criterion to ap-

ply to program’s source code. Such a criterion can be reasonably implemented by an au- tomated instrumenting compiler designed for the developer’s language and logging APIs. The effectiveness of such logging rules has been proven in research works (9) (10). As- suming the program’s code is well instrumented with logging statements upon error events to occur, and assuming the logging subsystem is reliable, we can get very useful infor- mation from logs about the failures occurring in a software system6.

However, FFDA performed with analysis of classic logs (from now on “legacy”), both

online and offline, is affected by some problems, highlighted in research works: Oliner (12) states that legacy logs usually do not provide adequate information for automatic fail- ure detection, logs are subject to corruption and often log messages are inconsistent; Na- tella (10) states that legacy logs usually do not provide information about timing failures (i.e. hang, deadlock) because the code will never reach logging statements; and finally Pecchia (9) demonstrates that legacy logs are still useful when used in combination to a set

6 We will not mention hardware logging and hardware FFDA 12

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

of logging rules.

In order to perform automated FFD log analysis, from the complete and verbose set of log messages, we must filter those related to error conditions and that are useful for automated processing. Filtering does not mean deleting: Oliner (12) demonstrates that legacy logs of- ten provide information about the root cause of a failure. For what concerns the root cause, let us distinguish the two cases.

Finding the root cause in a software system often translates into tracking down and fixing a bug in the code, but this operation requires time and expertise of the maintenance staff, thus cannot be performed automatically. In case of critical systems that expose transient or intermittent failures, it might be more useful for a diagnostic process to detect and isolate the faulty component, where the component can be defined at any level of grain: in a coarse-grained approach, we might think to detect the hardware node that caused the fail- ure, but in other fine-grained approaches we might want to isolate only the process that

failed in order to reduce the probability of a subsequent failure.

Common problems of FFD logging are heterogeneity of log formats and collection sys- tems. There are a number of platforms available on the market, and all use a proprietary format, preventing integration (while joint analysis is a critical success factor in FFDA). Also, performing FFD logging always result in collecting a large number of useless waste

messages that must be eventually dropped from the collection, leaving only those that are

considered helpful. Part of our work is the development of Logbus-ng’s source-client APIs that, together with structured FFDL7 approach provide means to achieve our goals, overcoming to the diffi- culties of legacy logging. It is important to underline that Logbus-ng, from the considera- tions in (12) and (11), is designed to remedy to the heterogeneity of logging systems, thus collecting all log messages in a system for deeper and comprehensive analyses.

7 Field Failure Data Logging, distinguishing the generation of log messages from the analysis of them 13

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

The state of art of logging frameworks

A logging framework has two basic requirements: it allows the developer to log messages with simple calls and stores log messages into a permanent memory obeying to a well- known format. Format choice has been for us a critical issue, because of the heterogeneity of log messages and the absence of a reference standard. Both because logging has been historically a process internally managed by departments, and not subject to customer relationship, both because the log messages themselves are made up to the discretion of the single programmer or (in the best case) the development team, no real

de facto standard was ever widely adopted for log message syntax. The need of a logging format is not only due to the need of a common format of binary data representation (whether ASCII, Unicode, etc.), but also to the type, number and syntax of information decorating the log message itself, mainly time stamp. If we ever had to create a brand new logging format, we would had to define three things: data that is part of the log message, message syntax and encoding. That is exactly what we did not want to do, sure that an- other brand new logging format would have added chaos to chaos. We chose to deeply

study existing format and refer to possible Internet standards. Logging frameworks We found that there are lots of different logging frameworks, both commercial and open source, available on the market.

The first we want to discuss are the Apache log4xxx frameworks, available as separate packages for ++, PHP, Java (4) and .NET (5). They are fully configurable in terms of syntax for log messages (to ease automatic parsing) and destination, using pluggable com- 14

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

ponents for file system, network, DBMS, email, etc. The most disastrous result we found is that Apache Software Foundation’s logging frameworks and Apache HTTP server use different logging syntaxes, and the server does even use its own logging framework. Syntax of both is customizable by the administrator. While this feature has the advantage of being flexible upon any automated parsing re- quirement (think about a web server log that only needs to record the referrer URLs for mining purposes, dropping source IP address for privacy reasons), no native format with

the full set of information is defined8. What we want from a “Logbus” is to overcome to the logging chaos by allowing trans- coding between multiple supported formats, as soon as this conversion is technically fea- sible.

A very interesting logging infrastructure we studied is the Tivoli (15) framework by IBM. It provides both real-time and offline log analysis features, and is designed for distributed

systems. Unfortunately, it is a closed and expensive platform, and while its APIs are pub- lic, it uses a proprietary logging format.

Another interesting framework that can be used for real-time logging was analysed by Pecchia (9) being the Data Distribution Service (16). However, Pecchia himself demon- strated in his work that DDS’s non-functional requirements are unacceptable for real-time

logging. An explanation for this can be found in the fact that content-based channels are

slightly less performing than topic-based channels. Our publish/subscribe model, which we will illustrate later, is based on content-based channels. We are then forced to build a log distribution service on our own. What pattern to use then? Again, Pecchia (9) suggests that the publish/subscribe pattern for the logging bus is a valid choice. This is due to the fact that publish/subscribe middleware components are lighter since they tend to be stateless, unlike, for example, Tivoli’s CEI which keeps track

8 However, there are still some default formats obtained by not altering configuration from installation’s defaults 15

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

of historical events.

Finally, Microsoft created its own logging frameworks for Windows operating system and Internet Information Services application server. The Windows Event Log (17), which we are going to examine, is a sophisticated platform for distributed logging, widely used by system administrators on large clusters or server farms. Windows system administrators can decide to forward logs to various systems in the net-

work, including a single log server that will collect logs from all the machines and is equipped with proprietary log analysis tools. Logging APIs for Windows Event Log are publicly available on MSDN. An example API Let us now analyse the logging APIs other platforms provide in order to find useful sug- gestions for the design of ours.

We found that they all look almost the same, and can be synthetized into log4net us (5) APIs which we took as a reference. log4net was created as a porting of log4j (4). The fol- lowing C# fragment illustrates the main entity in log4net, which is the ILog interface, conceptually equivalent to other APIs.

public interface ILog { void Debug(object message); void Debug(object message, Exception exception); void DebugFormat(string format, params object[] args); void Error(object message); void Error(object message, Exception exception); void ErrorFormat(string format, params object[] args); void Fatal(object message); void Fatal(object message, Exception exception); void FatalFormat(string format, params object[] args; void Info(object message); void Info(object message, Exception exception); void InfoFormat(string format, params object[] args); void Warn(object message); void Warn(object message, Exception exception); void WarnFormat(string format, params object[] args); } Other APIs may differ in the fact that instead of calling different function/methods for dif- 16

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

ferent logging levels, that is passed as an argument. As of documentation, log4net provides only five logging levels, while, internally, there are a lot more and they include the eight Syslog levels. These are:  Debug  Info  Warning  Error

 Fatal As a convention in all logging platforms, the severity of a message is a valid priority index during the analyses, with debug logs often used to support debugging (by obtaining con- textual (18) (12) (11) information), the informational logs to provide statistics and perform data mining, and the error logs to report anomalies and perform FFDA. As we said from the beginning, it is all up to the ’ discretion. Logging formats We tried to study the logging formats used by the most common frameworks to highlight the heterogeneity of them.

First of all, Apache logging frameworks use a customizable log format, with syntax being specified by user or by a pluggable component (actually, user-defined syntax is based on a component that allows the user to define the syntax with a formatting string). This is suit-

able for automated parsing (you can format the log to adapt it to the program that will

parse it) and is also suitable for the adoption of a common logging format.

BSD Syslog (7) messages are simple in their structure: they start with a priority value9 be- tween angular brackets, immediately followed by a timestamp in a semi-English format that does not include the year. The message also contains the host name of the originator, the name of the application that generated the message and, finally, the text payload. One

9 The meaning of this will be expanded when dealing with Syslog 2009 17

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems of the major problems about BSD Syslog is that this does not define a rigorous syntax for the message. The RFC 3164 document clearly shows examples of unusual Syslog messag- es that are difficult to parse by a general-purpose parser. This makes the adoption of BSD Syslog on heterogeneous systems almost impossible. We discovered that Apache logging frameworks natively support Syslog.

Microsoft relies on a format that is code-oriented for Windows Event Log (17). It means that logs are not written in a text file, like other frameworks, but are stored inside a local database in their native structured format, without any form of serialization. The format of Windows Event Log entries is defined by the following C code fragment: typedef struct _EVENTLOGRECORD { DWORD Length; DWORD Reserved; DWORD RecordNumber; DWORD TimeGenerated; DWORD TimeWritten; DWORD EventID; WORD EventType; WORD NumStrings; WORD EventCategory; WORD ReservedFlags; DWORD ClosingRecordNumber; DWORD StringOffset; DWORD UserSidLength; DWORD UserSidOffset; DWORD DataLength; DWORD DataOffset; } EVENTLOGRECORD, *PEVENTLOGRECORD; As we can see, this format stores two different timestamps: one for the time the message was generated, and the other for the time it was written in the log. This aids to overcome another common problem in logging: distributed clocks. As we know, clocks in a distrib- uted systems may not be accurately synchronized, then the need for a reference clock is clear.

Let us analyse the features of the Syslog (19) protocol and highlight its differences with (7). First of all, (7), more than defining a standard prior to its mass adoption (while RFCs are still de facto standards), is written under the form of a study of common UNIX and Berkeley BSD logging formats, trying to stop the format chaos by defining a uniformed

18

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

one that is actually derived from the others. This is explained by the presence of state- ments such has “it has been observed that…”, referring on the separators conventions. The (19) format is instead defined in an imperative manner through the use of the ABNF (20) notation and particular attention was paid to binary encoding, ABNF allows to define a non-ambiguous syntax, easing the implementation of a parser. A Syslog messages is characterized by a severity and a facility, which make a priority value when combined. Severity is basically the importance of the messages for analysis

purposes. Its value is between 0 and 7 included, and these values are ordered upon the POSIX priority convention10. It goes from emergency (0) to debug (7), which is the mini- mum. Facility is widely used to group messages coming from a certain kind of source, ie. the kernel, the security system, the mail daemon, etc. The lowest facilities are for user

code generating messages. The priority value is computed as . We do not agree to such convention, or at least of considering this value as a priority: surely, a kernel emergency (0) message is very important and needs urgent care, but a kernel debug (7)

message cannot be more important of an error in the mail system (19) in our opinion. Other information the message may or may not (but usually does) contain are timestamp in UTC + offset format11, host name of generator, application name and PID, a message ID for easy grouping of similar messages and a text part. More, Syslog 2009 defines a special field for structured data, where each element is identified by a key and contains key/value pairs. This is to extend the informational contents of a message beyond the lim-

its of the other fields, without needing the programmer to format any additional data into

the text part, which still remains human-readable. There currently are some standard fields in the structured data which semantics are defined in RFC 5424, but it is possible for de- velopers to create their own standards. In order to avoid ambiguities when messages are transmitted over the Internet, a mechanism very close to namespaces is used to format KVPs. The key of a structured element, in fact, must end with a valid SMI Enterprise ID (21) (or SD-ID) assigned by IANA (22) to enterprises. This is similar to enterprises using

10 The lower the number, the higher the importance 11 Unlike the old BSD Syslog containing the local time without the year indication 19

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems their website as part of XML namespaces. Currently, University of Naples owns 8389 SD- ID, so all customized structured keys end with “@8289” and we define semantics for them. The following is an example of a valid Syslog 2009 message, with its parts highlighted.

The syntax of a Syslog message can be summarized into the following ABNF (20):

20

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

SYSLOG-MSG = HEADER SP STRUCTURED-DATA [SP MSG] HEADER = PRI VERSION SP TIMESTAMP SP HOSTNAME SP APP-NAME SP PROCID SP MSGID PRI = "<" PRIVAL ">" PRIVAL = 1*3DIGIT ; range 0 .. 191 VERSION = NONZERO-DIGIT 0*2DIGIT HOSTNAME = NILVALUE / 1*255PRINTUSASCII APP-NAME = NILVALUE / 1*48PRINTUSASCII PROCID = NILVALUE / 1*128PRINTUSASCII MSGID = NILVALUE / 1*32PRINTUSASCII TIMESTAMP = NILVALUE / FULL-DATE "T" FULL-TIME FULL-DATE = DATE-FULLYEAR "-" DATE-MONTH "-" DATE-MDAY DATE-FULLYEAR = 4DIGIT DATE-MONTH = 2DIGIT ; 01-12 DATE-MDAY = 2DIGIT ; 01-28, 01-29, 01-30, 01-31 based on ; month/year FULL-TIME = PARTIAL-TIME TIME-OFFSET PARTIAL-TIME = TIME-HOUR ":" TIME-MINUTE ":" TIME-SECOND [TIME-SECFRAC] TIME-HOUR = 2DIGIT ; 00-23 TIME-MINUTE = 2DIGIT ; 00-59 TIME-SECOND = 2DIGIT ; 00-59 TIME-SECFRAC = "." 1*6DIGIT TIME-OFFSET = "Z" / TIME-NUMOFFSET TIME-NUMOFFSET = ("+" / "-") TIME-HOUR ":" TIME-MINUTE STRUCTURED-DATA = NILVALUE / 1*SD-ELEMENT SD-ELEMENT = "[" SD-ID *(SP SD-PARAM) "]" SD-PARAM = PARAM-NAME "=" %d34 PARAM-VALUE %d34 SD-ID = SD-NAME PARAM-NAME = SD-NAME PARAM-VALUE = UTF-8-STRING ; characters ’"’, ’\’ and ; ’]’ MUST be escaped. SD-NAME = 1*32PRINTUSASCII ; except ’=’, SP, ’]’, %d34 (") MSG = MSG-ANY / MSG-UTF8 MSG-ANY = *OCTET ; not starting with BOM MSG-UTF8 = BOM UTF-8-STRING BOM = %xEF.BB.BF UTF-8-STRING = *OCTET ; UTF-8 string as specified ; in RFC 3629 OCTET = %d00-255 SP = %d32 PRINTUSASCII = %d33-126 NONZERO-DIGIT = %d49-57 DIGIT = %d48 / NONZERO-DIGIT NILVALUE = "-"

The following is an example of a valid and complete Syslog message: <165>1 1986-06-02T00:00:00.003Z rosy717.zighinetto.org myPro- cess 5569 ID47 [exampleSDID@32473 iut="3" even- tSource="Application" eventID="1011"] BOMHello World! Today is the happiest day in history!!

The message begins with priority value 165, meaning that facility is 20 (local use 4) and severity is 5 (notice). The message was generated 3 milliseconds past the midnight of July 21

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

2nd 1986 by host rosy717.zighinetto.org12. The process myProcess generated the message having PID 5569. The message is labelled ID47 and there are some example field in the documentation namespace, which do not mean anything for us. The BOM part highlighted in red is the UTF-8 byte-order-mark, and is required to be there by the standard. The mes- sage ends with text part for human usage. Logging protocols In this paragraph, we are going to take a closer look at logging protocols.

Microsoft Event Log protocol is based on a very simple mechanism: each Windows host runs an Event Log pro- vider that is accessible via code using special system calls. When a new entry (according to the definition given earli-

er) is logged, the provider automatically writes it to local log file or sends it to the specified remote host according to the configuration. Applications can not only write to Event Log, but also read programmatically. There are C APIs that allow reading mes- sages currently stored in the event log. In .NET, it is possible to asynchronously listen for new log messages thanks to the EventLog class of the framework, that fires an event once

a message is logged to the specified log.

Syslog, more than a format, is a protocol. Log messages are not only meant to be locally stored on file or database, but also to be directly forwarded to monitoring applications via network. The Syslog protocol defines three types of entities: originator, relay and collec- tor. The originator is the source of the messages, the relay is used to forward messages to monitoring/analysing clients, which are collectors. The relay is not supposed to alter the

12 Often you will not find qualified domain names for hostnames, especially if you work in LANs 22

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems contents of the Syslog messages, except a few cases like timestamp or hostname adjust- ment when it is justified to. For example, a good architecture for a Syslog-based system is to use a single relay process on each hardware node, collecting messages from all local process and forwarding them remotely. In such case, applications do not need to set the timestamp or the hostname, which are known to the relay. The relay, if knows details about clock synchronization and attached networks, can add extra information telling if the stated timestamp is reliable according to synchronization to a remote clock, and add the IP address the machine is currently using. The Syslog protocol itself does not enforce any security: instead, it is clearly stated that Syslog protocol is not designed to resist to common network attacks per- formed by malicious users, i.e. eavesdropping, forging, deletion, man-in-the-middle, DoS. Syslog does not also clearly de- fine, at least in RFC 5424, net- work protocols to use when re- motely sending messages. The originator-relay-collector pattern can be implemented using any network transport proto- col. The two RFCs next to 5424 are about delivering Syslog messages via TLS (23) and UDP (24) protocols. Let us deeply examine these two means of sending Syslog messages by first dealing with UDP.

UDP, we all know, is a lightweight and unreliable protocol based on datagrams. Once a datagram exits the outbound network interface of a host, everything about its lifecycle is

23

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

unknown to the sender. UDP is widely used in all those applications that need perfor- mance, like VoIP, video conferencing, streaming, gaming, etc. The drawback of using UDP is that packets are subject to loss and there are no means or retransmission if not by higher-level protocol. Sending a Syslog message via UDP is extremely simple: just encode it into UTF-8 and send directly to the destination, that is all. The rule is “one message per datagram”, and messages cannot be spanned across multiple datagrams, thus limiting the size of a Syslog over UDP message to 64KB, which is an extremely large size, actually

too much for common applications for us to care about. Obviously, as stated in RFC 5426, main drawbacks are reliability and security.

TLS, instead, is a secure protocol. Running over TCP, it achieves reliability, but it also enforces end-to-end security by encrypting and signing data. Sending Syslog data over TLS is still simple, but first the source must establish a secure channel with the destina- tion. This is achieved by TCP’s 3-way handshake followed by TLS handshake with serv-

er-only or mutual authentication. After TLS handshake, the source can start sending mes- sages to the destination by first sending the number of bytes the message is long, in ASCII BCD format, then a space character and finally the message itself encoded into UTF-8 format. Right after the message sent so far comes the next message with the same syntax. During a TLS conversation, a number of messages is expected to be sent, one by one, at the maximum transfer rate allowed by network. The conversation ends with the source

terminating the connection after the last message. Open issues We have examined the logging platforms and formats that are currently available on the market, highlighting features and drawbacks. The common element that arises from this analysis is that each vendor provides its own format, making it difficult to merge logs from several programs into a single system log without converting all them to a common format, with the possibility of losing data. More, the deep differences in the logs’ syntax and semantics forces to design an analysis

24

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems tool for each specific system. No tool exists yet that can be reused for different scenarios without deep re-design. Another open issue is the impossibility for current logging frameworks to perform a priori filtering and coalescence. These two operations have to be done after the log has been generated, on large collection of messages and with a considerable computational effort. Performing prior filtering, resulting in a clean log, would be desirable. More, filtering and coalescence are done at discretion of the analysts, making it difficult to compare results from different experiments working on the same scenario. It would also be desirable to reuse, whole or partially, the same analysis tool across several scenarios.

Fortunately, Apache logging frameworks are based on pluggable components, so once a common format is found, it will be easy to adopt it for all applications that use one of the- se frameworks by simple re-configuration of the runtime.

25

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Design of the Logbus-ng project

The possibility to use a software bus for collection and distribution of log messages to monitoring clients has been highlighted in (10) and it is now our objective to make it con- crete through the usage of a software infrastructure that is separated from applications. We have designed and then developed a software framework to collect log messages generated according to some rules (9) and to distribute them to clients that subscribe some form of a channel. The starting formal specifications of the software can be synthetized into the following:

Make a software platform for real-time collection and distribu- tion of structured log message using the publish-subscribe mod- el. Also develop dynamically linked APIs to be used both by cli- ents that produce and consume log messages

We chose that our project is logging format had to ultimately be Syslog 2009 both in order to avoid log format chaos and because we found Syslog is a linking point between multi-

ple commercial frameworks. However, we have not been satisfied about this choice, be-

cause even being able to develop a brand-new software, we could not be able to collect legacy logs: the need of collecting those to achieve better analyses is highlighted in (12), so we had to require the Logbus to support legacy and different-format logs by up- conversion to Syslog 2009. Believing there were no particular reasons to choose different- ly, we also required the possibility to extend the supported logging format both in- bound and outbound. This way, if an analysis program requires a different logging for- mat, it may receive properly converted messages (think about reusing legacy code).

26

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

We chose to name it Logbus-ng by the conventional name adopted in (9) and with the “next-generation” suffix because we designed it from scratch. The final functional requirements of Logbus-ng are the following:  Logbus-ng collects log from local and remote Syslog sources (both old and new protocol)  Logbus-ng must be able to be configured to collect logs in different formats, by us- ing a pluggable component that translates the messages into a common format

 Logbus-ng must be able to be configured to automatically filter all incoming mes- sages with a preconfigured Boolean criterion  Logbus-ng must be able to forward all messages to other logging entities, including standard output (console), file system and other Logbus nodes. Forwarders have to be pluggable components based on a common API  Logbus-ng must be able to be configured with one or more dynamically loaded plugins that follow the service’s lifecycle and are able to interact with it program-

matically. The plugin APIs must be invoked the same way as the core APIs  Logbus-ng has to offer remote clients the possibility to create, delete and subscribe to message channels. A message channel is bound to a Boolean filter so as all mes- sages (and only those) messages matching the filter are forwarded to all the clients subscribed to it. A channel also features a coalescence window that can be set after FFDA in order to forward only one message for each time window

 Logbus-ng has to offer subscribing clients the possibility to choose their favourite

network protocol among a list of supported protocols, mainly including (23) and (24). Logbus-ng must use pluggable components to support new protocols  Logbus-ng’s APIs can be interfaced (in-bound and out-bound) to other existing logging framework by using pluggable components, in order to support legacy code already using different logging frameworks More, the following are the non-functional requirements:  Portability: the bus node must run on major hardware/software platforms, if not

27

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

any  Effectiveness: the bus must work with heavy13 workloads and eventual loss of log messages must be as little as possible  Reliability: the bust must support a reliable delivery of messages to clients that re- quire it, without impacting onto other clients’ performance  Fault tolerance/load balancing: the bus must be capable of being replicated or dis- tributed on multiple nodes

In our architectural vision, the Logbus is divided in three main segments: the source seg- ment, the core segment and the monitor segment, not to be confused with Syslog’s origi- nator, relay and collector (but they actually support them). The source and monitor seg- ments will be defined by proper interfaces based on design contract and TCP/IP network protocols in order to ease the implementation of APIs for the most common high-level programming languages. By having to support pluggable components, we must also make sure that developers have appropriate tools to develop components on their own, support-

ing different formats and protocols. This may be the case of legacy applications using pro- prietary protocols, for which the Logbus can perform a server-side conversion.

UNIX Syslog Log4cxx-powered Cisco router Windows Event Log Log4net-powered application application

Logbus

SQL DBMS Viewer program

File system storage

13 We will later determine how heavy a workload is by numbers 28

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

In order to support the multi-platform requirement, we had to adopt for a commercially available virtual machine platform: we chose C# as , for the Com- mon Language Runtime platform, implementations of which are Microsoft .NET and (25) for POSIX operating systems. Source-side interfaces By requirement, Logbus-ng must be able to collect both messages from Syslog-compliant sources and any other source, provided that there is an implementation of an inbound

channel that is specific to the protocol to use. Before defining the source-side interface, we wanted to examine the current possibilities of remote logging with Syslog protocol and Apache logging frameworks. More than the “core” Syslog 2009, there are two specific RFC standards to transfer Syslog messages via UDP (24) on port 514 or TLS (23) on port 6514. We also considered the fact that the sys- log-ng (6) daemon can be configured to forward messages (all or part of them) to a remote Syslog host by ASCII-encoding14 and receive messages on port 514 as the classic protocol

states. More, log4net (5) and log4j (4) frameworks are equipped with a RemoteSyslogAp- pender (26) (27) component that is capable to send BSD Syslog-formatted messages to remote hosts, encapsulated in a UDP datagram. No handshake is required, and the remote host is always pre-configured in the logging application. When a logged event occurs, a datagram is sent to the log server. The choice between UDP and TLS must be dictated by the reliability requirement: while TLS also addresses

security, we observed that almost no logging framework considers security a requirement,

mostly because they run on LANs without the usual security threats of the Internet. Since it is a requirement of supporting different logging protocols, we will ensure that a proper code basis for pluggable components will be available in the core Logbus. Monitor-side interfaces We believe there are no obstacles preventing us from using the Syslog-over-UDP (24) pro- tocol in this case too, but now we must keep this for later. Requirements for monitor-side

14 Syslog-ng currently supports only BSD Syslog format 29

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

APIs do not force the usage any specific protocol, but let the client decide what protocol to use, as soon as it is supported by the server too. We must now analyse and solve the problem of defining a uniformed subscription inter- face for clients. Primary choice has been to use Web Services (28) to achieve the best in- teroperability. However, we do not want to restrict the interface to this technology, allow- ing future developers to implement the same interfaces using a different technolo- gy/middleware.

Log analysis’ first phase is filtering, in order to exclude all those messages that do not provide useful information to analyses. While it is true that in human-driven analysis it is useful to keep a record of debug or similar messages to aid obtaining better context infor- mation, it is still true that it is infeasible to gather useful information from unknown mes- sages by an automated analyser. In order to reduce network traffic and computational ef- fort, we can believe useful to perform a prior filtering by the core segment. We then define an outbound channel as a logical channel bound to a Boolean filter. A number of clients can subscribe to the same logical channel; all the clients subscribed to the same channels will receive (assuming no loss/drop of messages on reliable networks) all the messages that match the filter, and only those. Now a new functional requirement arises: a client must be able to create (and, why not, delete) outbound channels by providing a filter on its own. In order to define interoperable filters we must define a set of standard filters, of which semantics and parameters are well known, in the XML (29) syntax, the same Web Services are based upon. Let us examine only a bunch of these fil- ters, modelled on the Syslog 2009 format:  True: accepts any message  False, accepts no message  And: composite filter that accepts a message if and only if all its sub-filters match the message  Or: composite filter that accepts a message if and only if at least one of its sub- filters match the message

30

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

 Not: composite filter that returns the opposite of its sub-filter  RegexMatch: message is accepted only if its text part matches a given Regular Ex- pression  Severity: compares message’s Severity with the provided value and comparison operator. For example, defining a Severity filters as [>=, Warning] it will accept only those messages of severity less than or equal15 than 4 Along these default filters, there is one more, custom, that can be used through an identifi-

cation tag and a free set of parameters, which are defined by design contract and not by language. We have also imposed, for our C# implementation, that these custom filters had to be made using pluggable components: this has no effect on the interoperability require- ment, because if a developer wants to implement the Logbus-ng core according to our specifications but in his favourite language, he can either hard-code all the custom filters he needs, or choose to do like we did. The full specifications for the filters have been translate into the XML Schema namespace

http://www.dis.unina.it/logbus-ng/filters, in Appendix Alpha. Concerning the log messages’ delivery, we chose to both support UDP (24) and TLS (23) for our core release. The UDP approach is the classical approach to remote logging, as we found other platforms do. However, there are serious drawbacks to it:  No congestion control: if a log source sends messages at an excessive rate (to read more, consult our experiments chapter) then tangible network performance de-

crease can occur, together with packet loss and damage

 Unreliable delivery: UDP datagrams are not subject to retransmission or acknowl- edgement, then it is possible for the network to drop packets without hosts to no- tice about it16  No security: no authentication/encryption techniques are put in place, then delivery

15 In UNIX priority conventions, the same adopted by Syslog, lower number means higher priority. We chose to design the operator according to the logical meaning of comparison, but arithmetically it is the opposite 16 Actually, Syslog 2009 defines, within its extensions, a standard mechanism for message enumeration. As soon as we expect all messages to be delivered (no dropping because of prior filtering), a missing sequence number is a symphtom of packet loss 31

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

is subject to eavesdropping and man-in-the-middle attacks, involving insertion of spurious messages into the channel, malicious editing and malicious drop  No ordering guaranteed: UDP datagrams are not ordered during delivery. Unless using a sequencing at a higher-level protocol or strict hypotheses on clock syn- chronization, logical ordering of messages cannot be granted

We can do the following observation by analysing real-life scenarios.

For what concerns reliability and congestion control, let us distinguish automated FFDA from a case we believe specular, such as accounting logging. In the accounting case, it is obvious how the loss or damage of even a single packet can result in smaller or bigger economic loss for the enterprise, then it is reasonable for the source of these logs to put any useful fault avoidance mechanism in place; if we had to use the Logbus for such pur- poses, our choice would definitely be using reliable transmission on the source side, and that is why we required that to be possible.

On the other hand, when analysing large complex systems subject to high rates of traffic, strict reliability requirements cause a significant performance decrease to the whole sys- tem, maybe right when the logging infrastructure is under full workload pressure because of an on-going fault: in this case, the law of large numbers suggests that this will not sig- nificantly affect the final result of analysis. To better understand it, let us make a worst- case example: suppose that network drops 1 out of 1000 packets in normal conditions

(99.9% of reliability), and 1 out of 5 (80% of reliability) during a burst, then suppose that during a fault a large number of log messages is generated thus flooding the network, and when a fault occurs, in our worst case, only one message from a single host reports it. Then, the probability to lose trace of the fault is 1 out of 5. Also, suppose that the MTTR is 1s, and the mean number of faults is 50000/yr, with that loss rate we can detect up to 40000 faults/yr. The effective availability of the system is 0.99844 but we detect it to be 0.99873, While the example is invented, it shows that an unreliable delivery, even in dis- astrous cases (rather than worst), does not change FFDA results significantly, because sys-

32

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems tem’s availability is kept at 99.8%. More than theory, we can use practice to support our statements: most supercomputers are equipped with fast dedicated network interfaces that are only used for logging, such as Myrinet (30), revealing how efficient is to use UDP even with the risk of losing some packets. And UDP sockets are even faster than writing logs to disk!

Now, the requirement of runtime choosing the delivery protocol of messages arises. This consists in allowing Logbus-ng to deliver messages to each monitoring client by adopting a protocol that is chosen by the client itself (and is obviously supported by the server). Such protocols can be unicast or multicast. It must be possible to add support for new pro- tocols with pluggable components. This led us to design the C# core as a 4-stages pipeline: inbound channels, hub, outbound channels, outbound transports. This is shown in the figure below.

Outbound Transport

Inbound channel Plugin

Outbound Outbound channel Transport

Outbound Transport

Inbound Outbound Outbound Hub channel channel Transport

Outbound Transport

Outbound Outbound channel Transport

Inbound Web Service channel

Outbound Transport

In order to understand it better, let us make an example: we want to collect messages with

33

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Logbus-ng via UDP (24), TLS (23) and from the Windows Event Log: each inbound channels features its implementation details, but all provide outbound Syslog messages (obviously, Windows logs must be converted), so each of these is correctly represented by an Inbound Channel circle in the previous graph; the hub receives messages17 and for- wards them to each Outbound Channel, which is only responsible of determining if a mes- sage should be sent or not to clients according to the filter; finally, the Outbound Transport takes care of performing the delivery implementing the specific protocol chosen

by the client, thus hiding it to the stages that stand before it. Notice that instances of transport manager are seen as multicast entities by the channel, ie. each transport is bound to multiple clients and is seen as a single entity by the channel. For example, the TLS transport manager is instantiated only by those channels to which at least a TLS client is bound, but will handle the TLS delivery of messages to any other cli- ent subscribing with TLS. What the transport manager actually does is to perform a unicast transmission to each client. This is slightly different from performing a true mul-

ticast, and a real multicast transport manager will hide its protocol details the same way as the unicast does. If the channel had to manage the list of connected clients, it would also have needed to be aware of some protocol details about each client, and what set of clients were part of a multicast group. We instead chose to hide everything and use delegation. Before translating these specifications into a WSDL interface definition, we must obvious- ly define a robust protocol for client subscription, where robustness is meant as the possi-

bility to support protocol extensibility.

Before going ahead, let us quickly summarize the methods we need to implement for sure:  ListChannels, to enumerate available channels to which subscribe  GetChannelInformation, to get information about an existing channel  CreateChannel and DeleteChannel, the meaning of which is obvious  SubscribeChannel and UnsubscribeChannel, self-talking names Surely, the SubscribeChannel method requires a channel ID as parameter, and the indica-

17 It will also receive channel subscription requests, but this is not shown here 34

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

tion of the transport chosen by client. What more? What parameters are required to sub- scribe a client to a Syslog or non-Syslog channel that we designers do not know yet? Log- ic says to allow the client to send a collection of un-typed parameters that are part of a specific design contract with the pluggable components involved in the transaction. Right now, we defined that the only entity the behaviour of which is not fully documented and is the transport manager, which relies on the specific transport protocol it implements. Other entities, such hub and channel, are strict in their specifications.

Suppose we want to use UDP (24) as transport, then these parameters are reasonably the IP address and the port number of destination. While the IP address can be easily detected in a SOAP transaction, we stated that SOAP must not be the only way to subscribe, and still it gives no clue on the destination port. This case, where the client sends parameters to the server, does not suffice to cover cases such as multicast UDP in which the multicast group is created prior to client is join and is known only to the server component; the unicast UDP also suggests us the inefficiency of such a choice: let us examine these two

cases separately. In case of unicast UDP, because of the features of the protocol itself (31), it is impossible to detect a client fault18, and after it becomes alive again, it may be unable to recover or cancel its old subscription: even if being able to add a spurious mechanism of sending keep-alive datagrams (to who?), we chose to implement an explicit RefreshSubscription method to the WSDL interface and a consequent time-to-live to the subscription itself, so

that if the client does not refresh before the TTL to expire, it gets eventually purged and

will not receive more datagrams. Refresh is not required by TCP-based protocols, because both peers are able to detect a loss of connection and implement a recover mechanism. The opposite occurs when running multicast: it is reasonable to think that it is up to the server to create the multicast group and to tell its address to the client. Here comes a new opportunity: to implement a set of protocols to help clients behind a NAT (32) overcome the limits imposed by TCP and UDP protocols. Since we determined

18 Without an explicit fault-detection mechanism that goes beyond the simple transport protocol, as opposite with TCP that embeds reliability 35

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems that there are parameters that need to be passed from server to client, transport is endpoint is a good choice in reverse-connect protocols. Let us now examine the final specifications of the methods needed to manage the sub- scription to an already-created channel, together with their parameters:  SubscribeChannel o Input . Channel ID

. Transport mechanism . List (0..N) of key/value pairs in string format representing what we called input instructions o Output . Client ID assigned by server . List (0..N) of key/value pairs in string format representing what we called output instructions

o Exception . Channel not existing . Unsupported transport . Design contract violation (expected different input parameters)  UnsubscribeChannel o Input

. Client ID

o Exceptions . Client not subscribed  RefreshSubscription o Input . Client ID o Exceptions . Client not subscribed (it may have been unsubscribed by timeout,

36

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

then it can be time to try to subscribe again) . Transport does not support refresh Full WSDL code is in Appendix Bravo. Logging APIs We already dealt with the analysis of existing logging frameworks. Since our objective is to make a software release, it has no sense offering to the public a software you just cannot use readily. This is particularly true in case of logging APIs, because they provide the

basic means for developers to start playing with Logbus-ng and experimenting its capabili- ties. Logging APIs can be efficiently used for brand-new software or for legacy software not using an existing logging framework: in case the software exists and uses a different logging format than Syslog, components must be developed to create an interface with Logbus-ng. Like log4net (5), Logbus-ng logging APIs are based on an ILog interface

public interface ILog { // Methods void Alert(string message); void Alert(string format, params object[] args); void Critical(string message); void Critical(string format, params object[] args); void Debug(string message); void Debug(string format, params object[] args); void Emergency(string message); void Emergency(string format, params object[] args); void Error(string message); void Error(string format, params object[] args); void Info(string message); void Info(string format, params object[] args); void Notice(string message); void Notice(string format, params object[] args); void Warning(string message); void Warning(string format, params object[] args);

// Properties ILogCollector Collector { get; set; } string LogName { get; set; } } One of the basic concepts behind the Logbus-ng APIs is the interface between a client (whether source or monitor) is a component attached to the bus. Visu- ally speaking, the puzzle piece icon is often used in documentation to describe components that are linked each other. Here we want to use them to highlight the abstrac-

37

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems tions they provide.

Source APIs are based on the concept of collector, which is a component that, from the point of view of user code, collects log. From that point of view, both the bus and clients are seen as collectors. ILog wraps the collector providing high-level logging APIs. We defined the ILogCollector interface as follows:

public interface ILogCollector { void SubmitMessage(SyslogMessage message); } You can only submit messages to log collectors. Since we want to provide general use APIs, we did not only define those collectors that actually send messages to Logbus-ng, but also components that redirect logs to console or file system. A log collector hides im-

plementation details about the delivery of messages.

Client APIs, as opposite, are based on the concept of sources. Again, we look from the point of view of user code. Logbus-ng is a source of logs for that. The source’s descriptive interface simply reports that a log message has been re- ceived through an asynchronous call-back mechanism. ILogSource is defined as follows:

public interface ILogSource { event EventHandler MessageReceived; } The way these interfaces are defined, they can be easily combined in logging applications that want to remote logging without Logbus-ng server.

Our logging API automatically fills the message with the most common information in Syslog protocol. One of the differences between our API and log4net API is exception support. With log4net, logging methods have an explicit parameter dedicated to excep- tions, while ours do not. Not only we are referring to the Syslog 2009 format, which has

38

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

no native support for the concept of exception until standardized into structured data, but we also believe that, since exceptions are a language-dependant construct, they should not be part of the high-level API. On the counterpart, logging exceptions is very easy because their message and context information can be part of the log message. We suggest devel- opers, when catching an unwanted exception, to log that event with a higher severity, briefly describing what happened (i.e. “cannot connect to remote host”, “unable to access data”), then logging a debug message with the details from the exception that has been al-

ready thrown. This will be helpful when performing offline log analysis for debugging, but will also grant that monitors can discard the verbose output of the error details and focus on the real fault. Core design

1..* «utility»TransportFactory

+CreateTransport() : IOutboundTransport

«interface» «interface» ILogSource ILogCollector +SubmitMessage(in message : SyslogMessage) «interface» «uses» IOutboundTransport

*

1

«interface» IOutboundChannel +SubscribeClient() : Subscription ILogBus +UnsubscribeClient(in Client : Subscription) +RefreshClient(in Client : Subscription) «interface» * +GetChannelInfo() IInboundChannel * 1 +Start() +GetDescription() : string(idl) +Start() +Stop() +AddInboundChannel() +Stop() 1 +ReceiveMesage() : SyslogMessage +RemoveInboundChannel() 1 +AddOutboundChannel() +RemoveOutboundChannel() 1

SyslogUDPReceiver SyslogTlsReceiver

1 1

«interface» IFilter +IsMatch(in Message : SyslogMessage) : boolean(idl) The above class diagram provides an overview on how the core Logbus-ng is structured. We voluntarily omitted from the diagram most of the classes that were actually created in the final project, leaving the most important components in clear view. We also did not enumerate all the public methods in the interfaces, otherwise the diagram would have be-

39

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems come too complicated for reading. The two main interfaces, specular each other, are ILogCollector and ILogSource. They provide the two-way communication abstraction with the software bus, mainly represented by ILogBus. In this diagram, we only have a view on the core segment of the Logbus ar- chitecture as we defined earlier. Also, IFilter interface is implemented by several filters in the main package, defined according to the specifications we already provided. ILogBus is bound to both a number of inbound channels, represented by IInboundChan- nel and a number of outbound channels, represented by IOutboundChannel, with the formers being instantiated via configuration, and the others being created by monitor cli- ents. Each segment in the chain of responsibility is seen as a collector by the previous stage and as a source by its next in charge, like the diagram below:

Relay out- Source ap- Relay recei- Destination Local proxy Relay hub bound chan- Local proxy plication ver application nel

Let us now examine how Logbus-ng core works when performing the most important op- erations: channel creation and client subscription. Usually, clients that are interested in certain events first create a channel with their own defined filter, and then subscribe to it. This design is adopted in monitoring APIs

40

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Inbound Channel

e

g

a

s

s

e

M

t

i

m

b

u

S

1: CreateChannel 2: CreateChannel 3: CreateChannel

Client Channel Manager Logbus hub Channel Factory l e e g n a n s s a e h C M t i d m A b

: u 4 S

Outbound Channel

The above collaboration diagram shows the channel creation phase. The Logbus pipe is displayed vertically, transversal to the set of entities actively involved in the channel crea- tion process. The client requests the creation of the channel, with APIs described earlier, and this request is directed to the Logbus hub through the Channel Manager object, acting as SOAP skeleton of Logbus hub itself (they do implement the same interfaces). Logbus hub requests the channel factory to create the new channel, then appends the returned ob- ject to the list of active channel, while the pipeline is still running. The transaction is fully synchronous, but message delivery is concurrent and independent.

After performing channel creation, subscription occurs:

41

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Inbound Channel

e

g

a

s

s

e

M

t

i

m

b

u

S

1: Subscribe 2: Subscribe

Client Channel Subscriber Logbus hub e g

d e a

n b i s

e r s

s c

e s

U b M t i u

D S

P m

:

: b 3

8 u S

SubmitMessage 6: addTransport 7: Subscribe 5: CreateTransport

Transport Outbound Channel TransportFactory

l

o

c

o

t

o

r

P

r

o

F

y

r

o

t

c

a

F

t

e

G

:

4

Transport Factory Helper

This phase involves a different SOAP proxy, the Channel Subscriber (properly Channel

Subscription Manager), which still delegates its calls to the hub. This time, the channel al- ready exists, so the hub proceeds by delegation to the channel. In the case this is the first subscription for the channel (true if previous diagram applies to the sequence), or in gen- eral when no client ever subscribed the current channel with the transport protocol re- quested by the current one, then the Outbound Channel requests the general helper of transport factories to return the transport factory for the protocol chosen by the client, if supported, then it requests the factory the creation of a transport manager, to which the

42

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems subscription request will be delegated/redirected. After that happened, the asynchronous delivery of messages continues, plus extending the delivery to the just-subscribed client.

Another interesting aspect of Logbus-ng we would like to focus our attention to is the Sys- log-over-TLS (23) protocol full with channel subscription, described below:

Logbus-ng client Logbus-ng server

Open TLS server

SubscribeChannel(channelId, clientHost, clientPort)

Create TLS client

TCP connect(clientHost, clientPort) [handshake omissis]

TLS hello TLS hello + TLS server certificate

TLS handshake continues... omitting the rest of the messages

SubscribeChannel returns clientId

Send Log

Send Log

Send Log

UnsubscribeChannel(clientId)

Close TLS connection

TLS bye/TCP FIN TLS bye/TCP FIN

UnsubscribeChannel returns

It is a quite complex protocol. The original Syslog-over-TLS protocol requires that the originator/relay is a TLS client and the collector is a TLS server. But, in our perspective,

43

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

the roles of client and server are opposite! The monitor client has to first open a TLS serv- er (and it must be provided with an SSL certificate as required by the transport protocol), then request the subscription to the Logbus server, which acts as TLS client and establish a connection to the Logbus client, perform the TCP and the TLS handshake, and finally start sending logs according to the well-known protocol. When the client wants to terminate the conversation, it can request the Logbus server to unsubscribe, thus causing it to terminate the TLS connection. Plugin system One of the requirements of Logbus-ng is to support dynamically-loaded plugins. A plugin is an entity that is traversal to the pipeline and is directly hooked to the hub component of the core segment.

Inbound channel Hub Outbound channel

Plugin Plugin

The purpose of a plugin is to support Logbus-ng in its life-cycle and provide advanced services to clients via the same Web Services APIs used by core Logbus-ng. The imple- mentation of a plugin is platform-dependant, so we will analyse how C# plugins are im-

plemented for the current version of Logbus-core in a dedicated paragraph, together with examples. The most important thing is that each plugin can define a WSDL binding that is made available together with Logbus-ng’s native WSDL API automatically when activat- ing the web service application.

44

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Implementation of Logbus-ng

In this chapter, we are going to describe the implementation phase of Logbus-ng, started after planning and early design.

Logbus-ng is currently available as open source project on Sourceforge.net, at the address https://www.sourceforge.net/projects/logbus-ng. People interested in the source code can either download the released packages that contains source code and binaries, or check out one of the various projects via Subversion available at Subversion URL https://logbus-

ng.svn.sourceforge.net/svnroot/logbus-ng Overview of Common Language Runtime (.NET/Mono) platform The Common Language Runtime platform was founded in 2002 by Microsoft with the goal of creating a common execution platform on which the software will work regardless of the hardware architecture and operating system. We know, indeed, that an executable program is compiled into machine instructions that depend on both the architecture type and on the operating system on which must operate: architecture, means not only the CPU

instruction sets but also the conventions on addressing, interrupt and i/o; also, each exe- cutable program can only work on the kernel which it was compiled for, except compati- bility between successive versions of the same kernel. Furthermore, if you can simply recompile the program to make it run with the same kernel but on a different CPU architecture, it does not mean that porting, i.e. adaptation to use on another OS, is possible without the involvement of developers. To overcome this obstacle you can use a virtual machine, which is a software layer put be- tween user code and the kernel that allows running the same code on any architecture for 45

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems which there is an implementation of the VM. The VM must translate the source language or an appropriate bytecode instructions into assemblv, including kernel system calls when needed (allocating memory, I/O, inter-process communication, mutual exclusion). Like Java, .NET Framework initiative consists not only of a virtual machine based on a stacked CPU, but an entire library that implements from the most elementary to many of the advanced programming features, from disk and network I/O to threading, an entire framework for building graphical interfaces () and even one for web applications (ASP.NET). In CLR, as in Java, all types derive from System.Object, multiple inheritance is not support- ed (but you can define and imple- ment all interfaces you want), types are either value types or reference types, and the link between libraries is dynamic. Unlike Java, you can (if the lan- guage allows it) use pointers. How- ever, sections of code that use pointers are specially tagged as unsafe and the dynamic class loader could refuse, especially if the source code is unauthenticated and remote, to run a program with such sections. The CLR platform, finally, natively supports events as construct, allowing the programmer to implement the observer design pattern. Support for events is allowed by delegate types, which are an abstraction of C++’s function pointers, so that an event handler is not registered as an interface (typical of Java), but as a delegate.

The C# language is one of the languages initially introduced, along with Visual Basic .NET and J#, with the platform to provide programmers the first instruments to produce

46

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems ap-can be introduced, is an ECMA (33) and ISO (34) standard, and has been used to im- plement the Logbus-ng core and APIs. Currently, the software developed for the .NET platform from Microsoft can work not on- ly on Windows systems, but also on mobile devices with Windows Mobile (now Windows Phone), thanks to the Compact Framework, and to some extent on the Xbox 360 console, thanks to the XNA Framework that allows, specifically, to run DirectX games on both Windows and Xbox 360 as Arcade games.

As regards the Mac OSX operating systems and all other UNIX family OSes, thanks to the opening of specifications, portings of original Microsoft .NET were eventually made, in- cluding DotGNU (35) of the free software Foundation, and Mono (25), initially created by Miguel de Icaza and then fully managed and sponsored by Novell. One of the great advantages of CLR platform, and in particular of Mono as implementa- tion, compared to Java, are the benefits due to compiling bytecode into "native code" be- fore execution (ahead of time) and not at runtime (just in time): in Java, the JIT compiler is the only available compiler, while although .NET for Windows performs an AOT build that expires upon machine reboot, and Mono supports a special compilation flag that cre- ates a shared object to any executable or library you throw as an argument, containing the result, standing, AOT compilation. One last mention we give relatively to ASP.NET for developing web applications: this technology is the evolution of Active Server Pages, the disseminated and discussed Mi- crosoft platform for Web applications based on proprietary language interpretative type

VBScript (the reason why it was always criticised). ASP.NET was born with the idea to obsolete ASP, while preserving a certain level of back-compatibility, using the bytecode for performance and security. In this regard, we explain now what are the benefits of the safety in use of managed code: the most im- portant is the rigorous control of pointers, which will prevent your code from owning ex- ploitable vulnerabilities for accessing memory areas not belonging to the object in use, and therefore prevents attackers to try buffer overflows from specially-forged input data. The

47

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems fact that the bytecode compiles under rules JIT/AOT instead of interpreted every time en- sures much better performance. The asp.net library is contained in the System.Web Assembly and contains all classes ne- capital to build Web applications on the model of HTTP Handlers (servlets) or Web Forms. An HTTP Handler is a class that responds to HTTP requests to a specific "page" (with default extension .ashx), likewise the Servlet Java and PHP scripts. A Web Form is the web equivalent of a Windows Form: the framework provides classes that represent the main components of an interactive HTML page, such as text boxes, tables, buttons, in the form of objects controlled programmatically; each class can also trigger events that can be handled by user code to perform interactions (e.g. when the login button is pressed, com- pare user name and password with the database and return the results in a textual label). A Web Form has usually .aspx extension and is a far more sophisticated instrument of classi- cal JSP if not expanded with a graphical development framework (tag library). The tools to develop Web Services are provided by the System.Web.Services assembly, separate from System.Web to allow clients the use of proxy by lightening the dependen- cies. The skeleton of a Web Service consists of a class that inherits from Sys- tem.Web.Services.WebService. and is mapped to a .asmx file in the execution directory. The WSDL is generated dynamically by the ASP.NET runtime with each request to the skeleton using introspection (reflection) on the methods and parameters of the class (per- haps this is a point of weakness as it was noticed that often ASP.NET binds XML namespaces to random prefixes, making it more challenging processing by less sophisti- cated client libraries).

In order to develop Logbus-ng we had to build a WSDL interface for monitors to ensure portability of the code. Well, we now report in this innovative technique (36) that allows the developer to run an entire Web server without having to deploy his software within an HTTP/Application server stack such as Microsoft Internet Information Services. The tech- nique is based on the fact that the ASP.NET Framework is entirely developed in managed

48

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

code and all contained within the System.Web Assembly, and the application server only uses the classes in System.Web to run the ASP.NET pipeline. To make an HTTP applica- tion server you must first start a web server, and, indeed, you can simply use some specific classes of the System.Net namespace that actually perform the wrapping of the HTTP pro- tocol on TCP sockets, then this will be able to forward HTTP requests received to the ASP.NET pipeline, like IIS does. One difficulty is inherent in the fact that achieving such a server requires that it starts within a security context other than the user’s application,

called , and each object residing in a different domain than the calling is actually a proxy that alters the standard mechanism for calling the methods so that even reference types are passed by value: for this reason, there must be at least one object "re- sistant to transition between domains", implementing the MarshalByRefObject class marker that guarantees the reference exchange when passed as parameter to a method. Currently, our prototype web server/application based on the template provided in the arti- cle (36) has three limitations: does not run under Mono, requires super user privileges and

does not optimize performance by multithreading. To work around the first problem, in addition to report the problem to the development team of Mono, we created a classic web application that works with Apache/mod_mono, so that it instantiates the Logbus-ng serv- er directly in context (and requires root privileges only if it is configured to listen on port 514). We are currently working on extending this to Mono by using the Mono.WebServer2 assembly, as suggested by some developers in a chat session. XML configuration One of the aspects of the Logbus-ng suite we gave particular care in our design is configu- ration. Since the beginning, we had Logbus-ng require to support pluggable components for most of its parts: we do support pluggable components by widely use reflection, a kind of type introspection implemented in .NET that allows to instantiate an object of a type which is unknown by the running method until runtime, or even not linked at all with the running assembly. Reflection is based on dynamic class loading, but we will not go deeper into the details of it: instead, let us focus on the technique we used to ease configuration.

49

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Common Language Runtime’s default means of providing configuration to a program is to add a special App.config file (Web.config for web application), which is basically an XML file that is programmatically accessible. When compiling the .NET application, this file gets renamed into AppName.exe.config and is ultimately bound to executable file19. The contents of the XML configuration can be accessed through the Sys- tem.Configuration.ConfigurationManager class, that provides some abstractions over the

XML markup: in particular, ConfigurationManager allows to easily retrieve an entire set of key/value pairs of application’s configuration parameters in a special section, or to re- trieve a whole XML node under the root by its name, called configuration section. A con- figuration section (here comes the potential of .NET) can be bound to a class through a mapping in the same XML file. To be more clear, a special configSections node is defined in the App.config file like the following (we simplified the syntax for easy reading)

The meaning of this node is very simple: node “logbus-core” is a configuration section, mapped to the LogbusCoreConfigurationSectionHandler class, so when retrieving the “logbus-core” section via ConfigurationManager’s GetSection() method (which returns

Object), the XML markup will be filtered by that class, which must be either IConfigura- tionSectionHandler or ConfigurationSection. The behaviour of these two is slightly differ- ent. If you inherit ConfigurationSection, GetSection() will return an instance of your class, while if you implement IConfigurationSectionHandler, your class is supposed to parse the XML and return an object that has been populated with data (and it is up to the caller to

properly cast it).

19 .NET allows, using some tricks, to define a configuration file for DLLs too, which are class containers and not exe- cutable programs, but this practice is strongly deprecated by both Microsoft and the community 50

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

We chose the interface approach, but we will resume later about it.

Our problem is now to map an XML markup into a class the easiest way. We immediately believed that XSD is the most straightforward way as an XSD to C# compiler exists and is xsd.exe. We then defined the syntax and semantics of the configuration sections in XSD format, then used xsd.exe to compile the schema into C# code and finally got lots of clas- ses as result. These auto-generated classes, of which we do not want to provide code, are full of C# compilation attributes that define the mapping between properties and elements in the XML syntax. What are they for? First of all, C# attributes are classes that can be used to “decorate” a class or a member of it. Attributes are of a type that inherits Sys- tem.Attribute and, like other classes, expose public properties that can be set from the con- structor or with an explicit-initialization construct that is part of the C# syntax and we will not discuss. Attributes are only accessible via reflection, once you have a reference to an object or its type. For XML serialization purposes, scanning the attributes of a type is needed to determine the mapping between properties and XML elements: when you define a property in a C# class, you cannot specify whether it is mapped into an attribute or an el- ement, and of what name. XML serialization attributes also help render required fields when the property is null, and define the XML namespace elements belong to.

These compiler-generated classes are used by the .NET’s XML serialization facility, which is able to translate a class into XML and vice versa using only one instruction! The reason why we chose the interface approach for the section handler is to avoid touching the compiler-generated code, but instead delegating a separate class to call the XML seri- alizer. The following code fragment shows an example node in Logbus-ng server configu- ration.

51

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

In this example fragment, we wanted to configure Logbus-ng core with two Syslog-over- UDP listeners, respectively on default port (514) and port 7514, and another Syslog-over- TLS listener on default port (6514). The semantics of each param element you find is ac- tually defined by design contract, allowing us to configure any kind of option for each pluggable component. Well, even components shipped with Logbus-ng are treated as pluggable components like third-party components, except for the fact that we allow a special short format for class names, without namespace and assembly, that are automati- cally completed by the code with default values.

The following code fragment proves how easy is to parse a configuration section from C# with such a well-designed scheme.

52

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

object IConfigurationSectionHandler.Create(object parent, object configContext, XmlNode section) { try { return new XmlSerializer(typeof (LogbusCoreConfiguration)).Deserialize(new XmlNodeReader(section)); } catch (InvalidOperationException) { return null; } } Above fragment deserializes the content of the section object, which corresponds to the XML markup similar to the fragment above.

If we ever wanted to do something similar in Java, we would had to manually parse the XML syntax, which is an error-prone approach. In C#, thanks to xsd.exe, we get clean code and fulfil to the DRY20 pattern. In fact, XML syntax is defined only once in the schema and the rules to map it to C# classes are created by an automated compiler that

takes the schema as an input. We just have to make sure that every time the schema gets changed, it also gets recompiled into C#. Schema is available in Appendix Charlie. Determining a host’s IP address in “Connect-to-me” protocols We already mentioned that monitor clients can choose among different protocols to be de- livered log messages, and the main protocols are Syslog over UDP (24) and TLS (23). These two protocols assume that the Logbus client is a Syslog server. While we perform subscription using SOAP, which is based on HTTP in our case, and has perfect knowledge of the initiator’s IP address, we deliberately chose to not consider this information when

delegating the transport manager to process the request. Figure below shows the chain of responsibilities when performing the SubscribeClient request

Client WSDL Logbus Transport ASP.NET Channel proxy skeleton core manager

Orange entities are aware of IP addresses as contextual information. Red entities will not

20 Do not Repeat Yourself! This pattern discourages repeating code or markup in multiple places in a project to avoid potential errors 53

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

know the IP address of the originator unless explicitly passed as parameter. The reason why we keep these other entities unaware of IP address is portability: if you want to per- form subscription using a different protocol, or programmatically, or on networks different from IP21, you cannot rely on an explicit IP address. Another reason why we chose to hide the initiator’s IP address is the possibility to send logs to another network host, maybe us- ing a faster connection. So we had to implement the transport is design contract as a con- nect-to-me protocol: it means that the originator explicitly declares the IP endpoint to use

as destination in level 5 PDU22. This gets translated into using the transport input parame- ters to configure the endpoint. In this paragraph, we want to deal with problems related to sending UDP datagrams and opening TCP connections that we found during development, and for which we imple- mented a solution.

We already know that it is impossible for clients behind a NAT to receive inbound UDP

datagram unless the NAT is configured for port forwarding or is UPnP enabled, and this problem is relative only to IPv4 since IPv6 does not use address translation and every IPv6 address is a public address. We will now assume, in our examples, that communication be- tween Logbus and clients is possible according to network topology. All programming languages, including C#, provide means to retrieve the local machine’s IP address using a simple code fragment, conceptually equal to the ifconfig UNIX com-

mand: using System.Net; IPAddress[] a = Dns.GetHostAddresses(Dns.GetHostName()); It returns more than one address since a host can be assigned multiple addresses when it is equipped with multiple network cards or is using IP mobility protocols. The problem we found is in the correct choice of one of the many possible IP addresses.

21 You would need to use a different Logbus transport protocol in this case 22 Protocol Data Unit 54

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

A greedy approach involves choosing the first IP address in the list, but this choice has been proven to be definitely wrong because, in most cases, the first IP address available is the local 127.0.0.1 (::1 in IPv6). Even choosing the first IP that is not local we might en- counter problems.

Let us consider the following network topology as an example of things working fine:

IP: IP: Network: 127.0.0.1 127.0.0.1 45.0.0.0/8 45.2.6.86 45.13.2.86

Client and server are both in the same subnet, and the client belongs only to that. Choosing the only non-local IP 45.13.2.86 is the right choice in this case, as the server can reach that address. If both client and server are connected to the Internet and have WAN addresses, this technique works.

Let us consider a pair of specular cases, a bit more complicated, where the client has mul- tiple network interfaces of different kinds:

Network: 45.0.0.0/8 IP: 127.0.0.1 Network: 45.2.6.86 39.66.0.0/16

IP: 127.0.0.1 192.168.6.56 39.66.0.213

Network: 192.168.6.0/24

55

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

In the first case, represented by the diagram, the client has both LAN (192.168.6.0/24) and WAN addresses, and the server is connected to Internet. In this case, choosing the WAN address as favourite address makes the client able of receiving datagrams, but this works as soon as we assume that the server can access the WAN too. The specular case, not shown in the diagram, is that in which the server is connected to the LAN and only to that, and this is a common situation for log servers (usually to enforce security by fault avoidance). In this case, choosing the WAN address is a bad choice as the

server will not be able to address to it. Since there is no unique criterion to choose the IP address to use, we understood that the only viable approach is a smart choice of the address that is chosen either by pre- configuration or by dynamically querying routing tables. Two problems now: avoiding changing the external interface of ILogClient, the basic interface for Logbus monitoring, and second is activating the lowest-level routing mechanism by C#, possibly remaining within managed code. By doing so we can obtain a local IP address that is function of the

remote IP address, but, at the same time, if our subscription manager is bound to a proto- col different than SOAP/HTTP we can try to choose the best WAN address and hope it works. We must not forget that Logbus-ng APIs are made by interface that developers are free to implement their own way, in order to achieve extensibility over new protocols. Our criterion is an opportunistic scan using reflection: if the client subscriber is our

SOAP/HTTP proxy, we can read the endpoint’s URL and get the remote endpoint is ad-

dress. In order to force the kernel perform a routing attempt, we create a fake socket and try to connect to it, then we read which local address it is bound to. Our approach is opportunistic because if anything fails in this, we fall back to the old cri- terion of the best WAN address available. Running a web application from inside a console application Another important aspect of Logbus-ng is that it is designed to run the SOAP over HTTP protocol (the basis for Web Services) as a server. It involves running an entire web appli-

56

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems cation which is conceptually independent from the core Logbus, and it is easier to say than to do. The good news is that the ASP.NET pipeline is 100% managed code, so nobody prevents ASP.NET to be run outside the scope of a web server like IIS or XSP (which is basically a console application that spawns a web/application server, nothing more). One of the main features of ASP.NET is that it runs in a special protected memory envi- ronment to prevent the parent application to be affected by errors in the web application, making safer for the host to unload the web application without consequences. These memory environments are called Application Domains, and their usage (one domain per web application) is justified in large hosting environments like IIS. More, IIS server itself periodically refreshes the AppDomain by reloading the web application in order to prevent software aging problems, even if managed code is unlikely to show such problems. We must deal with this when designing our web application. Another aspect we must take into account is that Logbus-ng supports plugins, that are able to expose a WSDL interface that must be deployed together with the main web application, characterized by the two end- points LogbusManagement and LogbusSubscription (with .asmx extension in ASP.NET).

We found a couple of tutorials about running web applications from within console appli- cations, however the main difficulty is to allow the web application to communicate with Logbus-ng’s core object: objects that reside in different AppDomains, for protection rea- sons, are passed by value when crossing the AppDomain “barrier”, even if they are refer- ence types. This is called marshalling by value, and involves a shallow copy of members of an object, actually resulting in the existence of two alive objects, which is what we do not want. We had then to explicitly require .NET to marshal every object we needed to cross the barrier as reference, thus inheriting from a special class MarshalByRefObject that is able to generate a proxy for the parent object. Let us keep it for later.

In the ASP.NET world, each page, handler, script, or web service, is bound to a represent-

57

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems ing file in the web application directory. ASP.NET is based on virtual and physical direc- tories, so we cannot create an in-memory web application and are forced to write into the file system (but we will clean up the mess upon termination). The “representing file” con- tains a special header and/or code. Assuming we write code only in the scope of classes (which is the best practice), only the special markup header must be written. It declares the nature of the file and binds it to a class. In order to better understand this, let us deeply an- alyse the three compilation stages of ASP.NET under the best-practice case.

Once you create a web application with pages and code-behind, you can compile it with usual .NET compiler and get an assembly. The resulting web application contains only the .aspx, .ashx, .asmx etc. files and the binaries in bin/ directory. Web Forms files, .aspx, also contain page markup that declares static HTML code and controls to use. All files contain the header that references the class the element is bound to. When running the web appli- cation, a second compilation is performed: each ASP.NET file is compiled into an assem- bly with only one class inheriting from the class that is declared in the header. For Web

Forms, the class that inherits from the user-code’s page programmatically defines its ele- ments (i.e. write static HTML to output stream, declare controls as objects). For web ser- vices, we just need the skeleton class to run them. In our solution, we created a temporary directory (like real ASP.NET does to avoid dirty- ing the original web application’s directory) into which we wrote special forged .asmx files LogbusManagement.asmx and LogbusSubscription.asmx, pointing to the skeleton classes declared in It.Unina.Dis.Logbus core assembly. We also did the same for Glob- al.asax, a special file that is used to declare the HttpApplication object that is responsible of managing the web application’s life cycle, so we have full programmatic control over the web application. We also had to deploy specially-forged .asmx files for plugins ac- cording to their needs: as we will describe in the following paragraph, plugins can declare zero or more SOAP endpoints and provide skeletons for that to Logbus for the web appli- cation to be completely instantiated. Next phase is to get the web application to communicate with core Logbus crossing the

58

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

AppDomain barrier: we created a design contract in which Logbus core “drops” a cross- domain proxy to itself into a named collection of the AppDomain class, and the element is banally named “Logbus”. Upon start of the web application, our HttpApplication object tries to retrieve that proxy and sets it as an application’s environment object. Something very similar is done for plugins.

One final notice is about Mono: our initial code worked only under Windows for unknown

reasons. After chatting with Mono developers, we decided to follow their advices and use the same code XSP does, by importing the Mono.WebServer2 assembly into our work- space. We did not discard our old code, but differentiated releases for .NET and Mono platforms playing with compilation flags. Concurrency issues In order to guarantee high performance to Logbus-ng core, we chose to extensively use parallelism to perform most of the tasks.

The first aspect we took care about was the main delivery pipeline, already shown in pre- vious figures. We felt the need to parallelize each stage like in a hardware pipeline or an industry assembly line. Each entity in the pipeline, then, is allocated with at least one worker thread. In order to obtain a performance increase, however, we must not forget some aspects: 1. Concurrency leads to parallelism only in multi-core architectures

2. The benefits of multithreading are great with I/O bound operations

3. The C# event construct may look asynchronous for the event handler, but is fully synchronous for event invoker (it is like any regular method), so all event handlers are executed in sequence and block the caller, unlike their “concurrency by design” 4. Multithreaded operations introduce non-determinism. Logbus-ng is not required to deliver messages in order, so it is not a problem 5. There is a limit, dependant on hardware and software configuration, to the number of concurrent threads in an application, above which performance decreases rather

59

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

than increasing In order to provide native concurrency to the pipeline stages, we decided to widely adopt FIFO queues, at least at the entrance of a stage. LogbusService, the class that implements the ILogBus interface and runs as hub, automatically inherits ILogCollector23 and imple- ments its SubmitMessage() method as an “enqueue” operation in a synchronized FIFO queue, which is read by another thread. By doing so, it frees the caller thread as soon as possible and leaves the responsibility to continue processing to the dedicated worker

thread; the same class is an event handler for the MessageReceived event of ILogSource, the interface that all IInboundChannel must implement by design contract: the event han- dler, which keeps busy the channel’s thread, simply queues the message into the buffer. We did not mention, however, that LogbusService uses 4 threads and 4 queues, which are a good trade-off. Each thread is dedicated to one queue, and these are filled using a round- robin rule24. We tried to avoid locks like death in all our code, at least for high- performance operations such as message processing and delivery (we believe a client can

wait a little for subscription to complete, but we cannot slow down message processing on high rates of log messages), because locks are proven to be very slow.

However, we could not avoid to use locks in some parts of our code. Let us examine the following C# fragments foreach (IOutboundChannel chan in OutboundChannels) chan.SubmitMessage(newMessage);

This fragment is executed by the four delivery threads in concurrency to deliver threads to

all outbound channels. When we perform a channel creation, the following code is execut- ed by another thread OutboundChannels.Add(channel);

Let us now suppose that two channel creation operations are always executed in sequence

23 By design contract: “you can submit messages to ILogBus” 24 We previously tried a “random-robin” rule, choosing the queue randomly, but found that it was very easy that queues were not filled symmetrically 60

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

(no conflict between threads modifying the list), the two fragments shown still conflict when run concurrently. This because the execution of the for-each loop can be interrupted at any time, and the insertion operation can be executed meanwhile. This causes the inval- idation of the enumerator that the reader thread is using to scan the collection, thus throw- ing an exception because the list was altered, and consequently losing the message. In order to avoid this, we must lock the list prior to writing. But it would be unacceptable to allow only one thread to scan/write the list at once. We decided to use the ReaderWrit-

erLock class to allow multiple readers to operate concurrently on the queue, but only one writer. The reading fragment then becomes _outLock.AcquireReaderLock(DEFAULT_JOIN_TIMEOUT); try { foreach (IOutboundChannel chan in OutboundChannels) chan.SubmitMessage(newMessage); } finally { _outLock.ReleaseReaderLock(); } And the writing fragment becomes _outLock.AcquireWriterLock(DEFAULT_JOIN_TIMEOUT); try { OutboundChannels.Add(channel); } finally { _outLock.ReleaseWriterLock(); } These kinds of statements, widely used in the code, grant that there will be no conflicts be- tween readers and writers, while readers can always perform operations concurrently. We

actually implemented the ReadWriterLock statement in a cleaner manner in some cases: by first acquiring a reader lock (to scan the collection), then upgrading it to a writer lock (to edit the collection, lock is upgraded only if it is needed to write), then downgrading the writer lock and finally releasing the reader lock at all. This is the common behaviour of transactional systems. Plugin APIs Let us now deal with the implementation of the plugin system’s API. As we mentioned

61

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems earlier, a plugin is a generic dynamically-loaded component with very few requirements: it must be able to support the Logbus-ng server life cycle. The other strict requirement is that, if the plugin has to expose an interface to clients, Logbus must expose this interface together with its standard WSDL endpoints: this second requirement has been a bit tricky to implement, as we are going to see.

We developed an IPlugin interface, which must be implemented by all custom plugins, and defined a special XML configuration node for plugins. Each child node allocates a plugin for dynamic loading by reflection. No configuration is expected inside the plugin’s declaration node. We believed it simpler to allow plugins to define customized and specif- ic configuration sections for the App.config file for easy parsing, even because it will be simpler to parse the configuration in one shot. In our opinion, allowing free XML markup under the plugin’s declaration node is harder to handle when parsing the configuration the automated way. This choice also lightens the computation effort required by the configura- tion parser, since parsing rules are up to the plugin (which can always perform manual DOM inspection rather than using our object-oriented approach). In order to support the Logbus server life cycle, we need the plugin instance to obtain a valid reference to the server object and possibly all those in the pipeline. We achieve this by implementing a Register method which takes the current ILogBus instance as argu- ment, and allows the plugin to both store the reference for future use, and to hook up to

Logbus events such as channel creation, client subscription, etc.

We must admit, however, that part of the current specifications of ILogBus were not the result of a waterfall design technique, because when we discovered that plugins need some events to hook up to, we decided to add them to the ILogBus specifications. These public members were the the result of a requirements change, and you can see them comparing older SVN revisions. Let us now examine the full definition of IPlugin interface before dealing with web service

62

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems activation: public interface IPlugin : ILogSupport, IDisposable { void Register(ILogBus logbus);

void Unregister();

string Name { get; }

WsdlSkeletonDefinition[] GetWsdlSkeletons();

MarshalByRefObject GetPluginRoot(); } The WsdlSkeleton struct is basically a pair of a string and a type, which we will discuss soon. IPlugin inherits ILogSupport to be assigned (better “injected”) a logger by Logbus- ng, which actually logs to Logbus itself or to the logger designed to handle Logbus- internal messages via configuration, and inherits IDisposable to allow easy resource deal- location upon Logbus destruction. We already saw Register method. Unregister simply signals the plugin it is time to stop working and detach from Logbus because the plugin’s life cycle is almost over (next step is reasonably destruction), but we did not design any scenario in which a plugin gets deal- located dynamically by Logbus core. The Name property is used to identify a plugin in design contracts. This actually makes all the plugins singleton entities, because the only constraint about the name is that no other plugin has been already registered with the same name. This, together with the final mem- bers, is part of the WSDL design contract of plugins.

Since SOAP endpoints need a class (inheriting WebService) to be mapped to a script name, and cross-domain calls require a MarshalByRefObject, we use both in our design contract, which is the following: 1. If the plugin does not expose WSDL interfaces, both GetWsdlSkeletons and Get- PluginRoot return null 2. For each WSDL endpoint exposed by the plugin (there can be zero or more), the GetWsdlSkeletons returns a pair in which the string represents the final part of the SOAP endpoint is URL (script name, without extension), and the type represents

63

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

the skeleton type that will process requests to that endpoint 3. If required by the plugin’s design, a cross-domain object can be returned by Get- PluginRoot, and it will be stored in the web application’s AppDomain context with the plugin’s name as unique key Logbus-ng’s web activator, which already creates a dynamic web application from scratch, as we shown earlier, can take care of plugins easily by creating a .asmx template for each endpoint defined by plugins, mapping the strong name of the type it uses to process SOAP

requests, plus it stores the cross-domain plugin’s proxy in the web application’s space. Field Failure Data Logging support We already dealt with FFDL/FFDA in early chapters. When designing and developing Logbus-ng, we focused on making it a general-purpose logging bus, also supporting FFD logging natively. However, we felt the need to separate the general logging features from the specific subset of FFD logging to lighten the final package and allow system adminis- trators to use Logbus-ng as a real-time log collector/distributor without caring for

FFDL/FFDA. In order to do this, we created a separate Extensions package containing all the tools needed to perform both FFD logging and analysis.

In this package, a whole namespace is dedicated to FFD logging. APIs are provided both for loggers (FFD-instrumented applications) and monitors (applications that monitor sys- tem’s health and raise alerts if needed).

Let us now analyse how to use the FFD logger to correctly generate a sequence of log

messages that is suitable for FFDA. Pecchia (9) showed that a set of logging rules should be applied to code in order to detect timing failures in case the transaction hangs or dead- locks. Suppose we have a couple of methods in two entities that interact each other, like the fol- lowing example:

64

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

class ClassA { public void ServiceX() { […] _b.ServiceM(); […] } }

class ClassB { public void ServiceM() { […] Resource.Read(); […] } } Logging rules suggest us to instrument the code like the following template class ClassA { public void ServiceX() { try { _logger.LogSST(); […]

_logger.LogEIS(); _c.ServiceT(); _logger.LogEIE();

[…] } catch { _logger.LogCMP(); throw; } finally { _logger.LogSEN(); } } }

class ClassB { public void ServiceM() { try { _logger.LogSST(); […]

_logger.LogRIS(); Resource.Read(); _logger.LogRIE(); } 65

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

catch { _logger.LogCMP(); throw; } finally { _logger.LogSEN(); } } } Thanks to this instrumentation, the very first event logged by the methods is SST. If the method is expected to throw exceptions (throws statement omitted for Java case), any ex-

ception coming out of the method’s scope is first caught by the outer catch block, the CMP event is logged and the exception is re-thrown to the caller; if the method returns, the SEN event will be logged. Also, when a method interacts with another entity or resource (the difference is highlight- ed in (9) and (10), but can be simply summarized into entities being part of our analysis and also being instrumented with rules, while resources are still subject to failures but are not part of our model), a specific interaction start event is logged.

When analysing the log trace, missing end events highlight a problem in the system oc- curred. Monitoring applications are supposed to trigger alerts. In our model, an alert mes- sage (computation, entity interaction or resource interaction) is anonymous, ie. does not specify which entity failed and at what point. However, nothing stops the monitor from adding a text log message with detailed information. In the FFD package, we also provided simple tools to help parse FFD messages coming from an ILogSource. The Entity Manager plugin We just dealt with FFDL and FFDA. In order to facilitate FFDA, we decided to ship Log- bus-ng with a tool that helps monitoring applications identify the entities that are active in the system. This is the Entity Manager plugin, and is developed as a dynamically loaded plugin with SOAP interfaces.

The basic concept about the Entity Manager plugin is the concept of logging entity (9),

66

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

which we defined as the triple (host name, PID25, logger name). The reason for this choice is that, even if we can perform fine-grained analyses up to the exact method that logged an event, we found a good trade-off to group all classes/methods with the same logger name (which is mostly hard-coded) in the same entity. The EM plugin also support legacy logs by treating the case in which not all fields are available, even if we assume that host name is always present.

How does the EM plugin help analyses? It simply gets all log messages from Logbus-ng and scans them for the entity’s definition. For each entity found, a row in a table is creat- ed, if the entity does not exist, or update, if the entity exists already. Information include the FFDA capability for the entity: an entity is FFDA-supported if it sent at least an FFD message ever, otherwise it is assumed (until found contrary) that the entity does not sup- port FFDA (at least with our message convention). Clients can query the EM plugin for active entities, either enumerating them all, or selecting only those that match a given cri-

terion, particularly with inactivity time. Last activity time is exploded into two timestamps: last time entity sent a log message, and last time sent a special heartbeat mes- sage. For each entity identified in the system, up to two free26 channels are created: one is al- ways created and broadcasts all log messages from the given entity, and the other only FFD messages, if entity is FFDL-capable.

Together with the plugin, we ship a WSDL file that defines the interface for it (mandatory

to develop clients), plus a pre-defined proxy to quickly use the EM interface from C#. The WSDL file can be still compiled into Java code, or into any other language for cross- platform development. Entity Manager plugin is mapped to EntityManagement SOAP endpoint, then being EntityManagement.asmx in Logbus-ng server for C#. Let us conclude this paragraph with a flow diagram that illustrate the algorithm used for EM plugin when receiving new messages from Logbus server.

25 Application name is used if PID is absent 26 Clients save the computational overhead of creating the channels on their own 67

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Incoming message

Use Process ID for Use Application Name Y Process ID specified N “Process” attribute for “Process” attribute

Create new record Y New entity? N

Create Entity channel Entity FFDA- FFDA Y enabled? message?

N Entity FFDA- enabled? Create FFDA- entity channel Y Y Mark entity FFDA- Create FFDA- enabled entity channel N

Mark entity FFDA- N enabled Note: due to the nature of FFDA messages, next decision will always be N

Y Heartbeat?

N

Update last HB Update last action

Stop

Log4net interoperability One of the most interesting requirements for Logbus-ng is interoperability. We already

mentioned the existence of several logging frameworks on the market, and the log chaos 68

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems caused by the heterogeneity of formats. We also said that, until a de facto standard is adopted by all frameworks, conversion needs to be done to log messages created using each specific framework, and also that Syslog 2009 was a good choice because it is the most adopted. In order to aid developers of legacy applications that already use a logging framework to support Logbus-ng, the very last thing we are supposed to do is to force them change the source code of their software and start using our logging APIs instead. We already men- tioned that BSD Syslog protocol was adopted by log4net (5), a popular logging framework for .NET platform, through the usage of a special appender class.

log4net Logbus-ng

Now, in order to achieve the log4net to Logbus-ng compatibility, we might simply tell to use a RemoteSyslogAppender (27) properly configured with Logbus-ng server IP and port, but we found that this appender is buggy, is out of date since BSD Syslog (7) does not store as much information as Syslog 2009 does, has to be used together with a specific pat- tern layout (thus violating the separation of responsibilities principle, or otherwise allow- ing to violate Syslog protocol by choosing a wrong format), and finally because log4net collects much useful information we may want to have available in Logbus-ng, like class and method that invoked the logging method, that get dropped in BSD Syslog. We then decided to implement our own message formatter for Syslog 2009 format. Formatters (or better layouts) are independent from appenders, because their goal is to format the log message from its logic representation to a serialized string. However, since the Re- moteSyslogAppender prepends the priority value to the message, it cannot be used in con- junction with our formatter, which is SyslogLayout from It.Unina.Dis.Logbus.log4net

69

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

namespace. Instead, to correctly use the Syslog-over-UDP protocol (24), developers must use the UdpAppender class of log4net, which is the base class of RemoteSyslogAppender anyway. Configuring log4net to use our code is as easy as to configure Logbus-ng, because log4net too uses the App.config file to configure loggers. Logs produced by log4net can be always used for FFDA based on legacy logs.

We also chose to implement reverse-compatibility.

log4net Logbus-ng

While log4net is older software if compared to Logbus-ng, it still provides lots of useful features that we decided (for this reason) not to implement in our code: mainly lots of use- ful appenders to send logs via email, or store them into a database27. In order to use the Logbus-ng’s logging APIs and let log4net store or deliver messages according to its pre- existing configuration, we created an ILogCollector (equivalent to an appender) that is called Log4netCollector and can be instantiated like other collectors via configuration, ei- ther for logging or to forward messages transiting in Logbus-ng server.

Our collector has simple configuration and usage: first define a proper log4net configura- tion, and possibly specify one or more logger names that will be used; then, in Logbus configuration, use the Log4netCollector as log collector for logging APIs or forwarding, and add the “logger” parameter to its configuration specifying the log4net logger name to use.

27 This is the thing that most logging platforms do. Logbus-ng would not be commercially acceptable if it had no way to store messages in a relational DBMS for easy querying, and that is another reason to rely on log4net 70

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Experimental validation

No software can be implemented and then released to the public without proper validation. Software Engineering requires a comprehensive test phase before the software is ready to be published. In this chapter, we are going to illustrate the main experiments we per- formed to validate Logbus-ng’s functional and non-functional requirements. Functional requirements have been tested using the typical tools of Software Engineering, such as unit testing and system testing. Fortunately, unit testing is widely supported by commer- cial IDEs, as there are several solutions, both open source and commercial, to perform test

generation and, most important, automation. We created a test suite with Visual Studio 2008 Team System software to perform most of automated unit test. In the next paragraph we will take a look at tests for only one unit.

Before going into the detail of unit testing, let us take a look at the equipment in the Mo- bilab laboratory in Naples. The lab is equipped with a 1Gbps Ethernet cable network, plus a wireless access point for

802.11g protocol, widely used by laptops controlling the experiments, all inside a 10.0.0.0/8 private LAN. We had available 6 computers with Pentium 4 CPU, 2GB of RAM memory and openSUSE 11.3 operating system with desktop-optimized kernel. Native-OS hosts were artax (10.58.6.30), mizard (10.25.4.32), megres (10.144.166.113), marty86ce (10.13.2.86). Virtual hosts were rosy717 (10.2.6.86) and pegasus (10.40.30.20). We used them all in conjunction mainly to perform stress tests. Logbus-ng server was mainly deployed to mizard node to test Mono compatibility.

71

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Unit testing for Syslog parser In order to validate the correctness of our Syslog (19) parser and UDP (24) listener, we tried to create an environment that is as most realistic as possible, while remaining under laboratory unit testing environment. RFC 3164 (7) and RFC 5424 (19) provide some mes- sages examples to help developers test their code against the standard format. Unfortunate- ly, these are too few to prove the robustness of a parser. We choose to proceed by automa- tion, collecting real-world log messages from a Linux machine, creating a very simple

software that listened on a UDP port for incoming Syslog messages from syslog-ng and wrote their base64 representation into a file (in order to save byte encoding when transfer- ring the file between different kernels). We left the daemon running several hours and re- trieved the results that are included in our standard test bench. Finally, we created a test bench with Visual Studio in which each message was base64- decoded, transmitted over UDP to local host. We also activated the SyslogUdpCollector to test, hooking the test bench to its events of MessageReceived and ParseError in order to

count the occurrences of successful and failed parsing. We knew from the beginning of the test the final number of messages, so we were also able to count any lost UDP diagram in this way. The test bench was designed to be run with multiple threads synchronized with a semaphore We had to take care about making the sender thread to sleep a little before sending the next message. This because otherwise the UDP socket on local host would have been

flooded, thus causing loss of UDP datagrams. This is a topic which we will cover soon.

Anyway, on June 30th 2010 we happily achieved 100% of successfully parsed messages with a workload of 973 messages, while still many were duplicates. Delivery time of messages After validating the functional requirements of Logbus-ng, we wanted to validate non- functional requirements, mainly performance. We performed some stress tests to determine the capability of Logbus-ng to provide a proper service even under heavy stressful conditions. There are two kinds of stress Log-

72

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems bus-ng can be subject to: the first is the input messages rate, that does not tightly depend on the number of clients submitting log messages, and the other is the number of outbound channels and/or clients, which affects both computational and network performance. Another interesting issue, which we will analyse and treat separately, arises from the usage of UDP protocol: message loss, which can affect the validity of log analyses.

In order to perform a stress test, we ran the Logbus-ng server on a dedicated machine, on which no other server than Logbus-ng was running. We then controlled all the other com- puters in order to produce a set of uncorrelated and “nonsense” log messages, that we called noise, at a rate that was increased during the experiment of 1000 messages per se- cond each hour, making the experiment last for a total of about 8 hours. In order to meas- ure the time it takes for Logbus-ng to process log messages, avoiding problems related to remote clocks, we created a very simple round-trip-time utility, periodically sending a log message marked with a source timestamp, and waiting for it to come back after having traversed the full Logbus-ng’s pipeline, through an Outbound channel. In order to deal with message loss, a timeout was set for considering a message as lost. During the experiment, we observed the following RTT: RTT 25000 20000 R 15000 T 10000 T 5000 0 Time

For better reading, we chose to split the graph at each hour/rate of experiment:

73

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

12

10

8 R T 6 T 4

2

0 Time (0-1)h - 0 messages/s

25

20

R 15 T T 10

5

0 Time(1-2)h - 1000 messages/s

20

15 R T 10 T 5

0 Time(2-3)h - 2000 messages/s

30

25

20 R T 15 T 10

5

0 Time(3-4)h - 3000 messages/s

74

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

100

80

R 60 T T 40

20

0 Time(4-5)h - 4000 messages/s

20000

15000 R T 10000 T 5000

0 Time(5-6)h - 5000 messages/s

25000

20000

R 15000 T T 10000

5000

0 Time(6-7)h - 6000 messages/s

25000

20000

R 15000 T T 10000

5000

0 Time(7-8)h - 7000 messages/s

75

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

In order to better analyse the meaning of the data, we computed the statistical properties of the raw results, grouped by workload frequency, and then built the following table, to which we added the mean CPU utilization percentile manually detected during the exper- iment.

76

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Message rate Mean RTT Max RTT Min RTT RTT Variance CPU usage 0 2.691 11.416 2.400 1.375 2% 1000 3.539 19.221 2.376 6.486 23% 2000 3.888 18.793 2.366 8.340 31% 3000 4.738 24.044 2.335 12.683 45% 4000 9.712 87.223 2.384 125.875 70% 5000 3553.738 17638.012 3.170 37522050.099 85% 6000 4490.144 19500.391 9.912 64426166.628 85% 7000 4150.373 19634.178 40.947 61681965.553 88%

Overall 1383.267 19634.178 2,335 22338794.147

The data shown above reveal that, after a certain threshold of the message rate, everything is messed up, and Logbus-ng’s behaviour becomes more and more non-deterministic (see variance). Non-determinism is mainly due to the fact that Logbus internally use multiple threads and queues to temporarily store messages, and this causes some messages to wait longer than others to be delivered: in fact, minimum RTT is almost not affected by work- load. Loss of UDP datagrams under stress Another interesting experiment we performed was aimed to understand the actual maxi- mum weight of the workload that Logbus-ng can process using UDP protocol. As you know, TLS, being based on TCP, automatically regulates the transfer rate to prevent pack- et loss, but, on the other side, decreases the sender’s performance. Logging frameworks must have little or no performance impact on the main application, thus they often use de- livery protocols based on UDP. Our goal is to determine what is the limit of UDP in Log-

bus-ng in order to better understand how log analysis might be affected during a network

burst.

In order to obtain quantitative data over packet loss, we designed and created an experi- menting tool to simulate heavy network activity and detect packet loss with different hardware configuration. First experiment involved Logbus-ng server to be deployed to host artax, and host merges acting as client; second experiment involved a dual core lap- top to be used as server.

77

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Our command-line tool accepts as arguments both the number of messages to use for the experiment, and the rate at which to send them. Unfortunately, the model being imple- mented by the tool (messages sent at a constant rate in large quantities) is not realistic at all. In real systems, log messages are transmitted according to stochastic models that vary with the system under examination. Often systems transmit smaller amounts of log mes- sages at a rate even higher than ours, but we can state our model, for our purposes of vali- dating Logbus-ng’s non-functional requirements, is suitable.

In fact, what we ultimately want to measure is the reliability of collected log messages for analysis purposes. Since we assured that Logbus-ng does not internally drop messages, monitor clients can choose reliable protocols, and finally file sys28tem is assumed to be re- liable, then the possible source of message loss or corruption is the UDP (24) listener. Loss of datagrams in IP networks is due to congestion, either in routers or in endpoints, causing the receive buffer to overflow and consequently discard new messages. Before performing the experiment, we tweaked Logbus-ng’s UDP listener in order to increase the

buffer’s size to 8MB, a quantity which is considered large enough for most uses. Just to have an idea, let us compare the default buffer allocation in common kernels.

Kernel Default UDP buffer size

Linux 128KB

Solaris 256K

FreeBSD, Darwin 256K

AIX 1MB

Setting 8MB in Linux is large far from the default limit and then more than acceptable.

Our test utility performs the following: it first subscribes a Logbus-ng reliable channel (to get datagrams back and avoid to add loss on the return segment), then starts to send all the

28 Configuring a forwarding rule to FileCollector class allows to store log messages into file system. To achieve the same using a DBMS, we recommend using log4net bindings 78

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

messages via UDP, waiting for them to come back via the previously instantiated client. If they do not come all, the utility waits up to 30 seconds after the last message received to tolerate network or processing delays, after which part of the datagrams is definitely con- sidered lost. We performed the experiment by fixing the number of messages to 18000 and by varying the transmission rate. We obtained the following results:

Rate29 Loss 7000 0 14000 5006

1000 0 8000 550 15000 4924

2000 0 9000 1938 16000 5898

3000 0 10000 3442 17000 6935

4000 0 11000 5146 18000 8238

5000 0 12000 4958 19000 8599

6000 0 13000 4382 20000 9720

12000

10000

8000

6000 Packet loss 4000

2000

0

Rate (messages per second)

When we performed the same experiment using the dual-core CPU, we were surprised of losing no message at all for rates in our experiment.

29 Messages per second 79

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Conclusions and future work

In all the previous chapters, we shown all the analysis, design, development and validation steps that we followed during the development of Logbus-ng. These culminated into the public release of Logbus-ng on Sourceforge.net as a package containing source code, compiled code, documentation and examples.

However, our work on Logbus-ng as a logging tool is not complete. Our C# core is a small part of a wide architecture for effective and efficient logging and analysis. While we

showed how Logbus-ng is an effective tool for log-based FFDA, and proved with experi- ments that its performance is mostly affected by hardware and network infrastructures ra- ther than by our implementation, there are still open issues and opportunities for future ac- ademic and non-academic work. Platform bindings We decided to implement all the Logbus-ng architecture in C#, but we, since the begin- ning, designed the architecture to be platform and language independent. All the interfaces

are well expressed in platform-agnostic formalisms that allow future developers to imple- ment components in other programming languages. There is no need to implement the server component in another language, since it’s separated from the rest of the infrastruc- ture and does not affect computation any way. However, portings of the source-side APIs to different platforms are highly desirable to allow developers to use Logbus-ng APIs with any software application, without having to mess with the inbound channel protocols. A porting of Logbus-ng APIs to Java, possibly with log4j (4) integration is an affordable short-time objective. 80

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Dealing with protocol drawbacks We have already shown that Logbus-ng is currently based on the Syslog over UDP (24) and TLS (23) protocols, each having its peculiar advantages and drawbacks. The TLS pro- tocol, while not explicitly shown in our work, has a significant performance impact due to cryptographic operation, while the UDP protocol, which was subject to our deep analysis, is subject to packet loss and corruption, which can affect further analyses.

We have been able to mitigate the TLS performance impact on the running application by adopting a buffering strategy in our code that is traversal to the protocol itself. It means that developers, when choosing to adopt TLS as logging protocol in order to prevent mes- sage loss, must pay particular attention to how he implements the protocol, trying to send the maximum amount of payload possible when there are messages to send (then using full network bandwidth if needed) and avoiding logging methods to block into TLS log- ging like we did in earlier implementations.

If developers are not likely to use TLS, the Syslog over UDP protocol provide another mean to prevent packet loss: sequence IDs. As shown in the Syslog 2009 (19) protocol paragraph, if the sequenceId extension is adopted, analysers can detect loss of packets, ex- cept the case in which you lose the last n messages, for which there is currently no detec- tion method. After lost packets are detected, retransmission strategies, beyond the mere

Syslog protocol, can be put in place to perform fault recovery. Load balancing, fault tolerance We have already shown how Logbus-ng efficiently processes large amounts of messages in our test scenarios. However, there can still be scenarios in which a single Logbus-ng server does not suffice to handle all the messages generated by various nodes in a cluster. Or, simply, system operators may not want a single Logbus-ng server to be a single point of failure in a dependable infrastructure. It is clearly uncommon for reliability engineers to design a software product to be a single point of failure. In both cases, we want to repli-

81

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems cate Logbus-ng in several nodes, possibly masking the presence of multiple instance of the server (replication transparency).

Let us first analyse a case of replication that achieves fault tolerance against UDP’s packet loss without affecting performance of logging applications:

82

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Process A Process B

Process C

Local Logbus-ng instance

T L S

Central Logbus-ng server

83

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

T L S

TLS

T L S

84

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

The previous diagrams show a possible opportunity of deployment for Logbus-ng at two different scales: in the first diagram, we highlight the fact that local applications can for- ward logs via UDP to a locally run instance of Logbus-ng, which will forward all these messages (possibly filtered) to a centralized Logbus-ng server using the reliable TLS pro- tocol. Local UDP delivery is expected not to be subject to packet loss under normal work- loads. This way, a centralized log server can work under a controlled workload up to its maximum capability without affecting monitored applications. However, in this case, if a

hardware node fails, the local Logbus-ng server fails too, losing all the messages in queue to be delivered30.

The above scenario assumes Logbus-ng server is enough reliable to be left alone running. In real-life scenarios, we may want to make the log server fault tolerant too. This can be easily achieved by replicating the Logbus-ng server into several nodes. In or- der to avoid bandwidth consuming, a multicast protocol should be used, like Syslog over

UDP (24) via IP multicast. In order for clients to tolerate faults of one or more log servers, client APIs must be rewritten to subscribe to more than one Logbus, treat the consequent duplicated messages and obviously hiding the above code the replication details. In order to treat duplicate messages, clients must be aware of already received messages by keep- ing them or meta-information about them.

It could be simpler for Logbus-ng client APIs to use a round-robin subscription mecha-

nism with immediate subscription to another server when one is detected to crash (using TLS, it is immediate). Other work Finally, we would like to conclude this work with a few ideas that came into our mind dur- ing the development of Logbus-ng.

30 We still expect the delivery queue to be empty for most of the time according to hardware performance and network bandwidth 85

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Channel garbage collector plugin: we found that our clients subscription protocol lacks robustness in case a client crashes. Even if the client is detected to have failed and unsub- scribed, the channel itself may be kept alive forever. We think it could be a good idea for a plugin to periodically scan outbound channels and purge those to which no client is sub- scribed from lots of time. It could be implemented as a plugin. However, we must notice that Entity Manager Plugin creates lots of channels to which usually no client subscribes, and those must not be deleted anyway. Also, the system administrator may want to create public channels that are not constantly subscribed during execution of Logbus. This must be kept in mind.

Channel persister plugin: in case Logbus-ng crashes, all channels definitions get lost. To avoid clients to have to re-create their channels manually upon Logbus-ng startup, it might be interesting for a plugin to save and restore channel definitions upon a restart (subscrip- tions would be obviously empty).

User interface for server: we never developed any kind of GUI for the server component. It would be interesting to be able to configure, control and monitor the active instance of Logbus-ng.

Serious security: our work clearly states that, even if we use the TLS protocol, no security is enforced in Logbus-ng. In order to make Logbus-ng an effective tool for general log- ging, including accounting logging, proper security requirements must be enforced. Study and development of security policies (including securing the client APIs via HTTPS or WS-Security) can be subject of future work.

86

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Appendixes

87

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Appendix Alpha XML Schema logbus-filter.xsd, namespace http://www.dis.unina.it/logbus-ng/filters

88

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

89

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

90

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

91

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Appendix Bravo

WSDL interface logbus-control.wsdl, namespace http://www.dis.unina.it/logbus- ng/wsdl

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems type="xsi:string" nillable="true"/> 93

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

ID for the new channel. Must be unique 94

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Manages outbound channels for the Logbus service Lists the IDs of available channels Creates a new channel Generic error while creating channel There already exists a channel with given ID Retrieves channel information Full desciption of channel and status Not enough privileges 95

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Deletes a channel by ID ID of channel to delete Not enough privileges to do it Generic error while deleting channel Channel is not empty. Clients must unsubscribe Client APIs to use for subscribing existing channels and receive log data Lists the IDs of available channels Subscription "ticket" contains both the client is ID and the instructions that it will use to configure itself. These are transport-dependent Channel not found Error configuring transport Unsubscribes a client from a channel 96

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Client was not found, or timed out Refreshes client subscription, if required by transport 97

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

98

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

99

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

100

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Appendix Charlie

Schema of configuration sections, namespace http://www.dis.unina.it/logbus- ng/configuration/2.0

Begin common section Basic and language-idependent definition of KVP Holds a KVP Must be a fully-qualified type in the form "Namespace.Type, FullyQualifiedAssemblyName" 101

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Marks if the entity is to be considered default or not Defines a connector object to Logbus Begin core section Main type for Logbus configuration, bound to IConfigurationSectionHandler

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems type="LogbusCoreConfiguration">

xmlns Main element for Logbus configuration Configures Inbound channels Defines an Inbound channels by type and properties Configure Custom Filters Defines a Custom Filter by type 103

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Defines an assembly to scan for required types (they must be marked with appropriate attributes) Configures Outbound Transports Defines an Outbound Transport Must be a fully-qualified type in the form "Namespace.Type, FullyQualifiedAssemblyName" 104

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

.NET Type to use for creating Outbound Channel (must implement IOutboundChannelFactory) Configure Logbus-ng core plugins Defines a single plugin Configures log forwarding Begin Source section Configuration section for a Logbus source (element that generates and sends log messages) 105

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Strong-name of default class implementing ILog Default collector to use when not specified Default hartbeat interval in seconds. If not specified, heartbeat is disabled Unique ID of logger ID of collector to use for this logger Heartbeat interval, in seconds. If not specified, heartbeating is disabled 106

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Name of collector. Must be unique Begin Client section Configuration section for a Logbus client (element that receives log messages from subscribed channel[s]) Defines the location of a Logbus endpoint for channel management and subscription

107

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Bibliography 1. Microsoft Corporation. System.Debug. MSDN. [Online] http://msdn.microsoft.com/it- it/library/system.diagnostics.debug.aspx. 2. Oracle. System (Java Platform SE 6). [Online] http://download- llnw.oracle.com/javase/6/docs/api/java/lang/System.html#err. 3. Apache Software Foundation. Apache Software Foundation. [Online] http://www.apache.org. 4. —. Apache log4j. [Online] http://logging.apache.org/log4j/. 5. —. Apache log4net. [Online] http://logging.apache.org/log4net/index.html. 6. BalaBit IT Security. Syslog-ng. [Online] http://www.balabit.com/network-security/syslog-ng/. 7. The Syslog BSD Protocol. Lonvick, Chris. s.l. : Internet Engineering Task Force. RFC 3164. 8. Fail2ban. [Online] http://www.fail2ban.org. 9. Improving FFDA of Web Servers through a Rule-Based Logging Approach. Cinque, Marcello, et al., et al. Naples : s.n., 2008. 10. A Logging Approach for Effective Dependability Evaluation of Complex Systems. Cinque, Marcello, Cotroneo, Domenico e Pecchia, Antonio. Naples : s.n., 2009. 11. Failure Data Analysis of a LAN of Windows NT Based Computers. Kalyanakrishnam, Kalbarczyk e Iyer. Urbana, IL : s.n. 12. What Supercomputers Say: A Study of Five System Logs. Oliner, Adam e Stearley, Jon. 13. Destailleur, Laurent. AWStats. [Online] http://awstats.sourceforge.net/. 14. Reflections on Industry Trends and Experimental Research in Dependability. Siewiorek, Daniel e Kalbarczyk, Zbigniew. s.l. : IEEE, 2004. 15. IBM. Common Event Infrastructure. IBM Tivoli. [Online] http://www- 01.ibm.com/software/tivoli/features/cei/. 16. Object Management Group. Data Distribution Service. OMG Data Distribution Portal. [Online] http://portals.omg.org/dds/. 17. Microsoft Corporation. Windows Event Log. MSDN. [Online] http://msdn.microsoft.com/en- us/library/aa385780%28v=VS.85%29.aspx. 18. V AX/VMS Event Monitoring and Analysis. Buckley, Michael e Siewiorek, Daniel. 19. The Syslog Protocol. Gerhards, Rainer. s.l. : Internet Engineering Task Force. RFC 5424. 20. Augmented BNF for Syntax Specifications: ABNF. Crocker, Dave e Overell, Paul. s.l. : Internet Engineering Task Force, 2008. RFC 5234. 21. Internet Assigned Numbers Authority. Enterprise Numbers. [Online] http://www.iana.org/assignments/enterprise-numbers. 22. —. Internet Assigned Numbers Authority. [Online] http://www.iana.org. 23. Transport Layer Security (TLS) Transport Mapping for Syslog. Miao, Fuyou, Ma, Yuzhi e Salowey, Joseph. s.l. : Internet Engineerint Task Forcr, 2009. RFC 5425. 24. Transmission of Syslog Messages over UDP. Okmianski, Anton. 2009. RFC 5426. 25. Novell, Inc. Mono Project. [Online] http://www.mono-project.com.

108

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

26. Apache Software Foundation. SyslogAppender. Apache log4j. [Online] http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/net/SyslogAppender.html. 27. —. RemoteSyslogAppender Class. Apache log4net. [Online] http://logging.apache.org/log4net/release/sdk/log4net.Appender.RemoteSyslogAppender.html. 28. World Wide Web Consortium. Web Services Activity. [Online] 2002. http://www.w3.org/2002/ws/. 29. —. eXtensible Markup Language. [Online] 1998. http://www.w3.org/XML/. 30. Myricom. Myrinet-2000 Index Page. [Online] 15 January 2007. http://www.myri.com/myrinet/. 31. User Datagram Protocol. Postel, Jon. 1980. RFC 768. 32. The IP Network Address Translator (NAT). Egevang, Kjeld Borch e Francis, Paul. s.l. : Internet Engineering Task Force, 1994. RFC 1631. 33. Standard ECMA-334 C# Language Specifications. [Online] 2003. http://www.ecma- international.org/publications/standards/Ecma-334.htm. 34. ISO/IEC 23270:2003 - Information technology -- C# Language Specification. [Online] 2003. http://www.iso.org/iso/catalogue_detail.htm?csnumber=36768. ISO/IEC 23270:2003. 35. Free Software Foundation. DotGNU Project. [Online] http://www.gnu.org/software/dotgnu/. 36. Skonnard, Aaron. Run ASMX Without IIS. Service Station. [Online] December 2004. http://msdn.microsoft.com/en-us/magazine/cc163879.aspx.

109

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Acknowledgements

So… you’ve finally come reading here, or, perhaps, you started reading from here like a Japanese comic! I hope the first, since you might have missed the chance to find all the easter eggs hidden in the text! It’s surely time for saying “thank you” to the many people that helped me, the author, write this Master thesis. I would first thank prof. Domenico Cotroneo and Marcello Cinque of Department of Computer Science for being my tutors for such a long time (I think they’ll be glad I’m

leaving). Then the rest of the Mobilab team, including Alessandro the wise, Lelio who should never cut his hair too short anymore and Antonio Bovenzi, whose ass’s photo has been used for lots of darts tournaments. A special acknowledgement to Antonio Pecchia, my POLTERGEIST TUTOR, whose voice sounded like a creepy presence in my laptop’s speakers.

I wish to thank Microsoft Corporation, the Mono development team and JetBrains

Inc. for having donated us free licenses of expensive development software, including Visual Studio, Mono Tools for Visual Studio and ReSharper!!

I won’t forget to thank Vittorio Alfieri, my desk partner, the man who betrayed his girl- friend for Data Binding, for all the times he made me hit my head on the wall, for all the Sundays and summer days spent on the phone, for all the… whatever… See you soon for the next head-hitting project, bro! And good luck for your Master thesis too!!

110

Logbus-ng: a software logging bus for Field Failure Data Analysis in distributed systems

Other people I would like to mention that need mention are Madia Mele (ladies first), the blue-eyed engineer: I hope you get your “fun” soon; Peppe Brunetti, the man the most re- sembles Barack Obama for being young, handsome… and tanned; Claudio Chiaro, who struggles to be the next Wolverine stunt: marry soon and I hope PrancescAlberto will grow strong like his father.

I also wish to thank Peppe Buiano for struggling in order to find me a job, and my fitness trainer Monica Iacobelli for struggling in order to find me a girlfriend.

LAST… Thank you Christian Barone, the most LAXATIVE human around the world (sorry, I’ll be right back… OK, done!), for being a MILF… Whaaaaaaaat? Hey, pervert, what did you understand? I mean a Man I’d Light Fire!!!

Finally, Rosy Festa: it’s been years of endless tears, but I can love no one but you, and I really mean no one!!

111