CERN-THESIS-2014-086 21/07/2014

Prof. Dr.Ditzinger Prüfungsausschusses Der Vorsitzendedes Karlsruhe, 13.01.2014 Thema: Name: Abgabetermin: Korreferent: CERN,Genf Referent: Arbeitsplatz:

Master-Thesis 12.07.2014 Prof. Dr.Hoffmann Prof. Dr.Fuchß Applications Monitoring andDiagnosticsforC/++Real-Time Yves Fischer

Fakultät für Informatik und Wirtschaftsinformatik Eidesstattliche Erklärung Statutory Declaration

Ich versichere alle verwendeten Quellen angege- I hereby declare that no other person's work has ben zu haben. been used without due reference. The german ver- Alle übernommenen Textzeilen, ganze Textpassag- sion of this statutory is authoritative. en, Tabellen oder Bilder sind mit Quelle angege- ben. Dies gilt unabhängig davon ob die Quelle ein Buch oder eine Veröffentlichung im Internet ist. Auch eine direkte Übersetzung eines fremdspra- chigen Dokuments ist mit Quellenangabe verse- hen. Die deutsche Version dieser Erklärung ist bindend.

Prévessin, 12th of July 2014 Yves Johannes Wolfgang Fischer

i Acknowledgements

First I would like to thank my supervisor at CERN, Felix Ehm, for his support and helpful guidance. His advice, expertise and understanding added considerably to my graduate experience.

I would like to express my gratitude to professor Thomas Fuchß, as he provided me with many great points to include and gave me advice whenever it was required.

I would also like to thank Stephen Page, who proofread my text and provided me with helpful comments and suggestions.

ii Abstract

Knowledge about the internal state of computational processes is essential for problem diagnostics as well as for constant monitoring and pre-failure recognition. The CMX li- brary provides monitoring capabilities similiar to the Management Extensions (JMX) for C and C++ applications.

This thesis provides a detailed analysis of the requirements for monitoring and diagnos- tics of the C/C++ processes at CERN.

The developed CMX enables real-time C/C++ processes to expose values with- out harming their normal execution. CMX is portable and can be integrated in different monitoring architectures. Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Overview of CERN ...... 2 1.3 Structure of this Thesis ...... 4

2 Monitoring of C/C++ Systems 5 2.1 Technical Environment ...... 6 2.2 Motivation ...... 7 2.3 Related Work ...... 8

3 Requirements 10 3.1 Terms ...... 10 3.2 Functional Requirements ...... 12 3.3 Technical Requirements ...... 13

4 Existing Technologies and Solutions 16 4.1 Monitoring Systems ...... 16 4.2 Logging Systems ...... 19 4.3 Interprocess Communications ...... 21 4.3.1 Possibilities ...... 22 4.3.2 Evaluation ...... 24 4.4 Existing Solutions ...... 25 4.5 Conclusions ...... 27

5 Design of CMX Protocol and Data Structures 28 5.1 Design of CMX Data Structures ...... 28 5.2 Shared ...... 30 5.3 Design of CMX Protocol ...... 34 5.3.1 Real-Time Constraints ...... 35 5.3.2 Concurrent Access to Shared Memory ...... 36 5.3.3 Verification ...... 43 5.4 Comparison with Similar Algorithms ...... 45

iv 5.5 Verification with Models ...... 48 5.5.1 Simple Example of a Promela Model ...... 48 5.5.2 Model of Two Writers ...... 49 5.5.3 Model of Concurrent Reader/Writer ...... 50 5.6 Conclusions ...... 52

6 Implementation of CMX 53 6.1 Platform and Toolchain ...... 53 6.1.1 Compiler ...... 53 6.1.2 Atomicity of Operations ...... 55 6.1.3 Processor Memory Consistency ...... 58 6.1.4 Processor Cache Coherency ...... 64 6.1.5 Choosing a Suitable Timesource ...... 64 6.2 Implementation Overview ...... 66 6.2.1 The Implementation in C ...... 66 6.2.2 The C++ API ...... 66 6.2.3 Independent Usage of CMX ...... 70 6.2.4 Real-Time Compatibility ...... 70 6.2.5 Automated Testing ...... 73 6.2.6 Performance Analysis ...... 75 6.2.7 Possible Extensions ...... 77 6.3 Conclusions ...... 78

7 Integration in CERN Infrastructure 79 7.1 A Remote Agent for CMX ...... 79 7.1.1 Diagnostic Access in the DIAMON GUI ...... 80 7.1.2 Monitoring of CMX Enabled Applications in DIAMON ...... 80 7.2 Interaction of CMX with Build Tools ...... 82 7.3 Conclusions ...... 85

8 Summary 86

Literature 87

Glossary 91

List of Definitions and Requirements 92

List of Figures 93

List of Tables 95

v 1 Introduction

High system availability is essential for successfully operating a large industrial facility. For this reason it is important to identify sources of errors and potential problems as early as possible. In the field of computing, system and application monitoring is applied to fulfill this task.

This work describes the implementation of application monitoring and diagnostic tools that are suitable for real-time applications, such as the ones which are used in CERN's accelerator control system.

1.1 Motivation

Large installations like particle accelerators or industrial sites are expensive in construc- tion and operation. The cost for building the LHC accelerator was about 6 billion CHF. The experiments which depend on the correct functioning of the accelerator are funded independently. The material costs for the ATLAS experiment were 540 million CHF [1, p. 17].

The only time frame in which this investment pays back is when everything is working correctly and, in case of the LHC, collisions can be delivered to the experiments. The condition of a proper operating accelerator depends on the reliability of many smaller or bigger hard- and software components.

The BE-CO group, where this work was carried out, is responsible for a large part of the accelerator controls software. Naturally our primary goal is to provide reliable, fault- tolerant software and - in case of unforeseen events - response times as short as possible.

Monitoring plays a critical role in early recognition of possible error conditions and fast identification of problem sources. The monitoring system constantly watches about 2,000 machines and applies many rules to detect problems.

1 1 Introduction

Monitoring is always limited to what developers consider worth being monitored. Hence, enabling developers to expose metrics easily from within their application in a suitable and standardized way is a key factor for success. We failed to find any existing solution in this area for C/C++ applications that fulfills our requirements to a large extent and is at the same time compatible with the existing mon- itoring and diagnostic system. This was the initial reason to develop a new monitoring and diagnostics library for C/C++, called CMX, at CERN.

1.2 Overview of CERN

This project is carried out at CERN in Geneva, where physicists and engineers are re- searching the fundamental structure of the universe. Founded in 1954, the CERN labora- tory is one of Europe's first joint ventures and now has 21 member states.

Figure 1.1: CERN Accelerator Complex [2]

Today CERN hosts many particle physic experiments. The biggest and most well known is the particle accelerator LHC and the detectors ATLAS and CMS, most known for the discovery of the Higgs-Boson. A bunch of particles in the LHC, that collide in one of the detectors, have gone through a cascade of increasingly powerful accelerators (Fig. 1.1) to reach the speed of 0.999 999 991

2 1 Introduction

(a) “The observed probability (local p-value) that (b) “Event recorded with the CMS detector in 2012 the background-only hypothesis would yield the at a proton-proton centre-of-mass energy of 8 same or more events as are seen in the CMS data, TeV. The event shows characteristics expected as a function of the SM Higgs boson mass for from the decay of the SM Higgs boson to a pair of the five channels considered. The solid black line photons (dashed yellow lines and green towers). shows the combined local p-value for all chan- The event could also be due to known standard nels.” model background processes”

Figure 1.2: Pictures related to the discovery of the Higgs Boson, CMS Collaboration [3] times the speed of light. Moreover, at the time of collision, every proton has reached a top energy of 3.5 TeV. Compared to energy emissions in the real-world, these energies are still low. However, in the LHC they are so heavily concentrated in space like nowhere else. A complete beam, in the LHC two beams circulating in opposite directions, contains: 2 · 3.5 TeV · (1.1 · 1011 particles) · (2808 Bunches) ≃ 350 MJ, which is about as energetic as a 400t train, such as the French TGV, travelling at 150 km/h [1].

Colliding inside one of the four detectors, the protons or lead-ions produce sub-atomic particles. Particle detectors use different devices to identify these particles. For instance their path in a magnet field is measured and tracked. Calorimeters finally stop some of the particles and measure their energy. This and additional measurement techniques allow physicists to identify events when unusual particles, that may fit into established or new theories, appear.

Fig. 1.2 shows two graphics taken from the paper about “a new boson at a mass of 125 GeV with the CMS experiment at the LHC” [3] by the CMS-Collaboration. The plot on the left shows the probability for a subatomic particle with the characteristics of theHiggs- Boson. The illustration on right is a visualization of a particle collision with characteristics of a Higgs-Boson decay.

3 1 Introduction

1.3 Structure of this Thesis

The following chapter 2 is about monitoring in general and the technical environment at CERN. The usage-scenario for a C/C++ monitoring solution will be defined.

Chapter 3 defines requirements for monitoring and diagnostics in C/C++.

Chapter 4 discuss basic technology decisions and existing software solutions.

Chapter 5 describes the technical design of the designated solution in detail.

Chapter 6 focuses on important points of the implementation process.

Chapter 7 is about the way of extending the existing monitoring and diagnostics system with C/C++ monitoring capabilities provided by CMX.

The final chapter 8 gives an overall résumé. More specific conclusions can be found at the end of chapters 4, 5, 6 and 7.

4 2 Monitoring of C/C++ Systems

CERN's accelerator control system is essential for operating the accelerators (Fig. 1.1); hence its availability, performance and correct functioning is essential. In the Beams Department (BE) of CERN, the Controls Group (BE-CO) [4] is responsible for the speci- fication, design, procurement, integration, installation, commissioning and operation of the controls infrastructure for all CERN accelerators, their transfer lines and the experi- mental areas.

The controls group provides general services like front-end software framework (FESA), general machine and beam-synchronous timing generation and distribution, signal ob- servation system, communication middleware, surveillance and monitoring (DIAMON), alarms, general logging facilities and data management. Presentation

3 4 5 2 6 1 7 0 status displays operator consoles 8 9

JMS / Java RMI / JDBC / HTTP Acccess Control databases logging Business configuration messaging monitoring

Controls Middleware (CORBA/zmq) Acccess Control Device

monitoring protection magnets cooling

Figure 2.1: CERN Accelerator Control System

5 2 Monitoring of C/C++ Systems

2.1 Technical Environment

In terms of software, the accelerator control system is comprised of approximately 3,500 applications written in Java, C and C++, it is composed of three layers(Fig. 2.1): • Presentation layer hosts the graphical control interfaces, status displays and the operator consoles. • Business layer general services including: Logging, Monitoring, Messaging, Config- uration Management. • Device layer time-critical control software, mostly written using the FESA Frame- work. While the Presentation and Business layer applications are running mostly on Java, the timing critical applications in the device layer are written in C or C++. In terms of numbers, most C++ applications are written using the FESA Framework. FESA provides an object oriented description of equipment with standardized basic functional- ity such as real-time event handling, standardized interface to device properties, logging, testing and simulation as well as the necessary tooling for code generation and integra- tion into the Eclipse IDE.

Host infrastracture The C/C++ software runs mostly on Intel and x86-64 based Hardware but some also on very old PowerPC based hardware using the LynxOS op- erating system. The internal support for the PowerPC based hardware is slowly being abandoned and replaced by Intel x86-64 systems. This process will not be finished before the next long shutdown (LS2, around 2018). The newer 32- or 64-bit Intel systems are running Scientific (SLC) 4/32-bit, 5/32-bit and 6/64-bit. It is unlikely that software which is currently running on PowerPC will be enhanced and re-compiled only for integrating new minor features but without moving to the newer systems at the same time. Supporting PowerPC requires that new written software must compile cleanly with the old gcc version 2.95 from the LynxOS Toolchain and also at the same time with gcc-4.1 (SLC5) and gcc-4.4 (SLC6).

The C++ Toolbox In terms of application monitoring and diagnostic remote-access the C/C++ landscape is a rather fragmented, complicated, and incomplete area, compared to the Java Platform.

6 2 Monitoring of C/C++ Systems

The current operational C++ Toolbox consists of:

• Diagnostics: Different programming framework and task specific tools

• Debugging: Trigger core dump and/or attaching gdb

• Post-mortem: Analyze core dump with gdb

• Info/Warnings: Centralized logging

• Configuration Management: Centralized framework for tracing messages

In many cases logging is reduced to minimum to avoid performance degradation which affects the real-time constraints.

For monitoring process' health there are currently only simple process existence checks, and regular probing of functionality from the outside (e.g. application specific tests for the process does what it is supposed to do) as well as a manual core dump if a problem is suspected.

There is currently neither a standardized nor easy way to monitor a specific value from inside a C/C++ process. However, like in Java, there is a need for monitoring in C++.

2.2 Motivation

We claim that monitoring is even more critical for C/C++ than Java, since C and C++ are commonly considered as low-level programming languages. Generally applications written in low-level languages are more error-prone, more difficult to verify automatically and have a more complex build process, hence they need more attention than similar Java applications during development and testing phase.

But not all problems can be identified before finishing testing. It happens that some problems are identified very late, that means during productive run-time. However, in all these stages, there are no appropriate, simple to use tools for monitoring and diagnosis available to developers.

Experience from real world operation revealed for example a serious problem: Software can be updated during working times and may introduce faulty code. Although it may work initially, it eventually stops somewhen later, e.g. as an internal message queue has filled up or a counter overflowed.

7 2 Monitoring of C/C++ Systems

The original author did not have a chance to identify this issue until the program breaks in the night. No monitoring system had a clue about the approaching problem, the op- erational responsible of the service is not automatically alerted about the situation.

The consequence: Machine operators have to identify the source of the problem them- selves, then experts have to be called in and the resolution of the situation takes much longer than during the day. The experts may have very little information on C++ pro- grams and often no possibilities to inspect a running instance of the program fromre- mote. Entirely crashed programs may, depending on their setting, write a core dump of their last state which can be analyzed. In any way, important historical data of internal metrics, which help solving the source of the problem, is not always available.

Next steps Improving this situation for the CERN BE-CO environment is constrained in several ways. In the following chapters we will define the targets and requirements for a possible solution. We will look at existing ideas and solutions and elaborate their suitability also in respect to environment constraints and especially their compatibility with real-time controls applications.

2.3 Related Work

In the following chapters there are references to related work. This section provides an overview of referenced work. A discussion of existing software products will follow in chapter 4.

The efforts for better C/C++ monitoring at CERN will enhance the capabilities of the larger controls monitoring framework DIAMON [5] (Diagnostics and Monitoring) system. Like existing modules for Java, it is foreseen to improve the capabilities of DIAMON for services written in C/C++.

The closest relative to this work is the CERN paper “CMX – A Generic Solution to Explore Monitoring Metrics”[6] for the 14th International Conference on Accelerator & Large Experimental Physics Control Systems, published in autumn 2013. The corresponding source-files, paper and poster can be obtained from the CMX Website.

The paper describes the ideas for C/C++ monitoring and the planned integration into the existing controls monitoring framework. At this time a prototype of CMX existed, but this

8 2 Monitoring of C/C++ Systems version suffered from several issues including memory leaks, locking errors, read-writer access blocking and flawed OO-design for the C++ part.

For the implementation, we refer mainly to the following literature:

• “Is Parallel Programming Hard, And, If So, What Can You Do About It?”[7] written by McKenney explains parallel programming in general, atomic operations, and CPU memory barriers. The Sequence Locks mentioned there in section 8.2 are equivalent to the locking mechanism used in CMX.

• “Timecounters: Efficient and precise timekeeping in SMP kernels.”[8] by Kamp de- scribes a specific use-case in the FreeBSD . The algorithm that can be seen as a mix of holding a multiple value copy and a having a sequence .

• “Effective synchronization on Linux/NUMA systems”[9] by Lameter describes Se- quence locks as they are implemented in the Linux Kernel.

9 3 Requirements

This chapter defines what we expect from a C/C++ monitoring and diagnostics solution. It provides a formalized overview of the requirements for this project. All requirements and definitions are formatted like: TYPE N.N (short name) … short description… Followed by a detailed explanation….. ⋄

3.1 Terms

In the following, terms are described which are used throughout this chapter.

TERM 1 (Roles). We use roles to characterize different kind of user groups. A user can have any number of roles. The following are important in our scenarios: Developer, Operator, Monitoring and Expert.

• A Developer creates applications and may use libraries from other developers. In case of unforeseen incidents, the developer may be involved in resolving errors during operation.

• An Operator takes care of machine operation. Operators can call specific experts if additional support is needed.

• A Monitoring system reads values periodically to provide a historical view and trigger alarms.

• An Expert is acquainted with a broad range of systems and can be a developer at the same time. ⋄

TERM 2 (Metric). A metric describes a measure of a property of the system that is being monitored.

10 3 Requirements

A metric is similar to a Key Performance Indicator (KPI), every KPI is a metric but not every metric is automatically a good candidate for a KPI.

Examples of possible metrics are: “requests processed per second”, “number of connected clients”, “amount of memory in use”, “round-trip time to external peripheral”, “uptime”, “number of crc errors”.

Metrics can be exported from running applications, here the following applies:

• One application has zero to many metrics.

• A machine can execute multiple applications, it can also execute the same appli- cation multiple times.

The term 'metric' can be used more precisly as:

Metric App.no_connections/int defined by developers Metric instance App.no_connections/int created at program runtime Metric value App.no_connections=10 updated during runtime App<1>.no_connections +App<2>.no_connections Metric +App<3>.no_connections calculated in external system aggregate =120

Metric alarm App.no_connections==0 calculated in external system Developers define metrics during development of their applications. Their names do not have to be compile-time constants. The metrics get registered at program startup. Their values could be static (determined at compile time) or dynamic (updated at run- time).

Using these values, an independent system (e.g. monitoring system) can aggregate metrics or apply rules to trigger alarms. ⋄

TERM 3 (Real-Time). “A real-time computer system is a computer system where the correctness of the system behavior depends not only on the logical results of the com- putations, but also on the physical time when these results are produced. By system behavior we mean the sequence of outputs in time of a system.” [10, p. 2]

The real-time term is often separated in soft and hard real-time. An example for soft real-time, a live 3d visualization, runs at about 30fps to look smooth but occasional

11 3 Requirements

frame skips can be tolerated and will not be noticed by the human eye. On the contrary a hard real-time system might control a industrial robot. If it fails to sent a STOP signal to the robot at exactly the right time, things might get damaged permanently.

If we see a real-time system as a black-box receiving inputs, and reacting by emitting outputs, then our goal is that this system can fulfill its constraints regarding response times. In fact our main interest is to avoid harming the real-time properties of existing systems, by adding the functionality into their execution path.

Real-time is not a strict term. In this work we are not going to make detailed Worst Case Execution Time (WCET) calculation. That means we will not look at influences such as from caches and the peripherals and also will not calculate actual execution times based on scheduling priorities. Both are heavily dependent upon the execution platform's hardware and software. Instead, we analyse the complexity and added costs of algorithms and functions. ⋄

3.2 Functional Requirements

includes CMX Monitor exposed includes values regularly JMX Monitoring system

includes Development cycle Add metrics using CMX Values includes Developer

Monitor real-time Error during includes application operation Operator

Figure 3.1: C++ Monitoring and Diagnostics: Users and their use-cases

The functional requirements are described from a user's point of view. Fig. 3.1 shows the use-cases of the designated users.

FUNC 1 (Monitoring). The monitoring system must be able to collect internal process metrics from C/C++ processes.

12 3 Requirements

The DIAMON monitoring system is able to monitor different sources of information through customized data acquisition agents. Support for monitoring C/C++ metrics shall be provided to the same extent as it is currently supported for Java processes through JMX.

It is not required to provide event-processing grade precision where every status change in the monitored application is evaluated by the monitoring system. ⋄

FUNC 2 (Development). The system must provide tools for developers to allow insight into software test-runs during the development.

C/C++ Monitoring needs to be accessible with the same comfort as it is currently pos- sible for Java/JMX using Jconsole or the embedded JMX GUI in the CERN specific Mon- itoring GUI.

In addition, because C/C++ is often used in a very low-level, system-orientied en- vironment where using Java GUIs can be uncomfortable, there should be powerful command-line tools to access values exposed by applications.

Developers are concentrated on the main functionality of their applications. Moni- toring and diagnosis functionality is often only added where needed. To encourage developers to implement metrics, it must be as easy as possible. ⋄

FUNC 3 (Diagnosis). Operators and developers must be able to explore all exposed values for error diagnosis.

The differences between monitoring and diagnosis require that the user can decideto re-acquire the metrics values at any time, meaning he is not bound to regular refresh interval of the monitoring system.

For diagnosis the user must be able to access every exposed metric, not only those configured to be surveyed by the monitoring system. ⋄

3.3 Technical Requirements

TECH 1 (Real-Time). The system must never interfere with the main program blocking execution or non-deterministic duration of function calls.

13 3 Requirements

The real-time processes must not be disturbed if the monitoring system wants to get an regularly update of a metric (Monitoring aspect) or the user asks the current value of metric (Diagnostic aspects). The overhead for the real-time thread to update a value needs to be very low and must not block the process or obstruct the execution in any other way. ⋄

TECH 2 (Integration). The C/C++ monitoring system must integrate well into exist- ing infrastructures, such as the build process and the data acquisition of the monitoring system.

The system shall be able to use the compile-time information, which is already gen- erated in the build process, and make it available at run-time. This includes e.g. the username of the user who compiled a product or the exact release timestamp. Information exposed at run-time must be accessible for the existing accelerator controls monitoring system. ⋄

TECH 3 (Reusability). The C/C++ monitoring system must be useable in all different kinds of C/C++ applications.

C/C++ projects in CERN BE-CO are not standardized. There are several projects from different teams that are slightly different. A monitoring system must not be specificto special use-cases but applicable to all kind of applications. ⋄

TECH 4 (Portability). The system needs to be separated in modules which are CERN specific or not. The core libraries and tools must be portable to other similar environ- ments.

It is also preferable if the process is not directly related to the monitoring system in use at CERN, but provides only a clean interface from which many tools can profit. ⋄

TECH 5 (Easy to use). Exposing run-time information must be an easy programming task.

The developer must be bothered as little as possible with implementation details and the implementation must be as unintrusive as possible. The goal of a library for exposing monitoring and diagnostics is to hide all implemen- tation details behind a simple API. ⋄

14 3 Requirements

TECH 6 (Datatypes). Support the most common datatypes for metric values.

The following common datatypes must be supported:

• signed integer numbers of 64-bit

• signed floating point numbers of 64-bit

• boolean values

• character strings of a maximal length specifid by the developer

The values in different datatypes must be stored without applying any lossy type- conversion. The value returned in read-calls must be bit-wise equal to the value stored before. ⋄

15 4 Existing Technologies and Solutions

This chapter highlights existing possible solutions and points out their characteristics.

For enabling processes to expose internal values, there has to be some kind of inter- process communication (IPC) technique. In this chapter we discuss possibilities for inter- process communication.

IPC in general is defined as “different ways of message passing between different pro- cesses that are running on some operating system” [11]. In fact they can also run in different places connected via network.

In this chapter we evaluate IPC mechanisms, regarding their suitability to connect real- time processes to monitoring facilities without coupling their execution behavior. This investigation contains an overview of existing technologies and assesses them with the requirements from the previous chapter.

The monitoring patterns and criteria of IPC mechanisms are then applied in areviewof existing monitoring software. In this comparison some software which is popular and powerful in general, turn out as inapplicable or do not provide solutions for the use-case of monitoring real-time applications.

Before doing a detailed analysis of IPC machanisms, we will next look at general proper- ties of a monitoring system and compare it to a logging system.

4.1 Monitoring Systems

A general architecture of a monitoring system is shown in Fig. 4.1. In the following, we will describe the components shown in this figure and then focus on the connection between application and the connector/agent of the monitoring system.

16 4 Existing Technologies and Solutions

Status History Alarms Sink

Rule engine status = 2 for time:1min

AND | THEN status=1 | ELSE status=2 > 500 < 14 Processor metric1 metric2 Monitoring System Monitoring Acquisition Connector Agent

poll values push values

in intervals Source

Applications Application Application

Figure 4.1: General monitoring system architecture

Components The figure shows examples of two applications under monitoring. The left one is monitored through a server-side agent, who regularly polls values. Itcouldbe a JMX-enabled Java application where JMX Attributes are read in specified intervals.

Next to this, on the right side, the application actively pushes information to the moni- toring system. The application sends out values as events which will then be processed by rules of the monitoring system.

Each application contributes a metric (named metric1 and metric2). In this example, each metric is evaluated with a rule that yields a boolean result. This result is combined with a logical AND and translated into an integer status code.

The result of the evaluation is then used by any kind of status listener. Typical candi- dates are systems for recording the history of values, status displays for operators, alarm systems and notification systems sending messages over e-mail and SMS.

Temporal behavior The entire stack shown in Fig. 4.1 is driven by events which are propagated from bottom to top (from source over the event-processor to sink).

If the right application changes a value, it sends out an event. Therefore, all changes in the application reach the monitoring system. There, the update event will invalidate all depending results and trigger a recalculation.

17 4 Existing Technologies and Solutions

The application on the right does not decide itself the point in time at which it sends metric updates. Instead, the updates are sent regularly in fixed intervals determined by the acquisition agent. The acquisition agent then sends the values to the monitoring system, where it will trigger a recalculation if the values have changed.

Real-time applications If we imagine that the application to monitor is a real-time controls application, then both scenarios are suboptimal for the following reasons.

A real-time application can operate at high frequency. On every state change, it commu- nicates to the monitoring system, following results of this change have to be calculated every time. This can create a considerable amount of work for the monitoring system, depending on the complexity of the rules.

Application metrics can be updated with a very high frequency. The resulting network traffic will therefore be highly undeterministic. It can easily overload the network.A possible fluctuating load on the kernel towards the network stack is prone to influence the system behavior. Even with using stateless protocols (possibly UDP), the in-kernel work connected with writing to sockets is not tolerable inside the real-time application thread.

Considering querying a application regularly through an acquisition agent (Fig. 4.1 on the left), it is not acceptable to allow an external entity, to communicate with areal-time application directly. External requests, especially if they require session state handling and command parsing, may harm the deterministic execution as well.

Acquisition Connector Agent

poll values push values in intervals IPC 1 Agent Agent

IPC 2 Application Application

Figure 4.2: Data acquisition for monitoring systems

For these reasons, the acquisition has to be decoupled from the normal program flow. Fig. 4.2 extends the previous Fig. 4.1 with this aspect.

18 4 Existing Technologies and Solutions

Fig. 4.2 describes two IPC connections, the first (IPC 1) is the same as before, probably a network connection to send values to the monitoring system.

The second (“IPC 2”), is a new additional IPC connection between the real-time controls program and a lower-prioritized agent. The agent is queried by the monitoring system's acquisition agent (Fig. 4.2 left) or directly communicates with the monitoring system and pushes values (right).

“IPC 2” needs to be non-blocking from the side of the real-time application. Also the overhead must be as low as possible. IPC mechanisms available on our target systems are discussed in the following section 4.3.

4.2 Logging Systems

From a high-level perspective, the task of system or application monitoring is similar to system or application logging. Both involve a producer of information (system/applica- tion) and a consumer (logging/monitoring system).

In most enterprise landscapes there is a centralized logging system installed. At CERN, the accelerator control system uses a low-footprint logging library [12]. The C++ variant of this service sends messages as UDP packages to a central endpoint which injects them into the logging system. From here on, user applications can subscribe to messages or inspect the history.

That means, there is a system which already allows pushing of information to a cen- tral instance. The question arises whether it can be used to transmit C/C++ monitoring information.

Generally the use-cases for logging and monitoring are not the same:

• Logging provides access to the latest log-message issued by the process in near real- time, but the point in time where the next message will be issued/received cannot easily be determined.

• Monitoring wants to keep track of values over time. Usually “snapshots” of values are created in predefined intervals.

• Diagnostic access is entirely controlled by the user.

19 4 Existing Technologies and Solutions

For monitoring systems, evaluating logging messages can be one source of information. For example, simply the number of messages per second can be a useful metric for clas- sifying the health state of a service. More specific information requires knowledge of the data-format used by the applications. However, compared to the characteristics of monitoring real-time application from previ- ous section, the technical implementation of logging is incompatible with monitoring. As shown in Table 4.1, logging is traditionally push-based. Push means that the log-producer sends his messages/events to the log-consumer at his own will. Therefore, there is not separation as the one named with “IPC 2”.

temporal behavior data format Logging determined by producer mostly unstructured (the program) Monitoring regularly/by producer or con- structured and standardized sumer Diagnostic determined by user structured but flexible

Table 4.1: Comparison of Logging, Monitoring and Diagnostics

Table 4.1 summarizes the differences between logging, monitoring and diagnostic system. This results in the following arguments for not using the logging infrastructure to monitor real-time applications:

• the number of variables may grow so high that pushing the value changes becomes inefficient. • scaling: Using logging might work for a small number of hosts, scale it up to around 2000 hosts and logging constantly at least 200 messages per second, will certainly overload the current logging system. • the update frequency is very much limited and entirely determined by the process. One probably doesn't want to emit logging messages at higher rate than about 1 per second. This again would require rate-limiting or non-deterministic logic on the client-side. • real-time software do not tolerate interruptions. Therefore, logging over network is not possible and would require a non-blocking queuing mechanism to deliver the monitoring data to non-real-time threads. • real-time software often turn logging off because it threatens their deterministic behaviour

20 4 Existing Technologies and Solutions

• all values have to be sent over the network, every time. Even if the monitoring system throw away some of them or do not actually monitor a value. The client- side not necessarily knows in which information the server is interested.

Conclusion The existing logging system is not suitable for monitoring and diagnos- tics. The identified differences in communications make an effective implementation of monitoring and diagnostics on top of logging facilities unfeasible for the requested envi- ronment.

4.3 Interprocess Communications

In current operating systems different kinds of IPC mechanism are provided, sometimes with different implementations of the same principle. This section provides an overview of the advantages and disadvantages of different IPC solutions.

Because of the huge variety of available IPC mechanisms, we will concentrate in the following on the most applicable ones.

byte HTTP FIFO/pipe/stream socket data stream transfer MOM pseudoterminal communication message

SysV shared memory shared memory anonymous mapping

signal standard/realtime signal

semaphore

fcntl()) file lock synchronization flock())

Figure 4.3: Taxonomy of IPC facilities (Figure is based on [13, p. 878])

21 4 Existing Technologies and Solutions

For sorting out the non-applicable IPC mechanisms, we use the “IPC Taxonomy” (Fig. 4.3) found in “The Linux Programming Interface: A Linux and UNIX System Programming Hand- book”[13]. It shows different categories of IPC mechanism, actual implementations and applications .

4.3.1 Possibilities

A monitoring communication protocol involves two parties: writer(s) and the reader(s). The writer is an application exposing internal information, the reader is the part of the monitoring system accessing information of this application. Probably there is only one writer most of the time but any number of readers.

The basic non-blocking requirements require that the protocol must be designed without states in which the writer would have to wait for the reader. Also a request-response model is unsuited because we want to avoid the possibility that the reader can disrupt the execution of the writer.

communication → data transfer A very common use-case for data transfer IPC in the subclass of “byte stream” are TCP sockets. A solution designed to use stream sockets can easily be adapted to fast communication inside one host using UNIX-Domain Sockets as well as network-transparent communication with TCP, possibly adding an authentication, encryption, and compression layer, thus making it scalable for internet- wide usage.

Writing data on sockets requires a system call, which is expensive in terms of execution time. It may also block the execution of the current thread somewhere in the kernel.

In data transfer a message between the writer (publishing a metric's value) and the reader is destroyed after reading. In consequence, the reader always has to listen for values sent by the publishing applications.

In real-time systems, we expect the reader to be always lower prioritized than the writer. Hence, if the writer goes, for whatever reasons, full-speed in publishing metric values this means a lot of work for the reader. If the reader cannot cope with this speed, the writer will be blocked or some other overflow mitigation effect will deal with the situation. The reader has to handle incoming updates all the times, even if the reading side is currently

22 4 Existing Technologies and Solutions not interested in up-to-date values. On the other side, if the reader for some reasons is unavailable, the writer needs to be able to deal with this situation as well.

A concept like the traditional client-server as implemented e.g. in the HTTP (Hypertext Transfer Protocol) protocol [14] is also unsuited as it requires connection handling, state tracking and therefore the rate and volume of requests will certainly influence the exe- cution of the application. This way a low priority reader process can affect high priority real-time processes, which is a form of priority inversion and undesirable.

The same applies to the connection to a MOM (Message-Oriented Middleware). While here different communication patterns are possible, they all suffer either from toomuch client-side work by handling subscriptions or possibly generating unpredictable amounts of traffic for value updates.

communication → shared memory there are different flavors of shared memory. All allow any number of processes access to the same memory region by mapping it into the process's virtual address space.

Shared memory, in contrary to data transfer, is unsynchronized by design. Therefore, custom synchronization methods have to be implemented.

Reads to shared memory are non-destructive. The values stay the same until they are overridden and without additional effort the writing process doesn't care about readers.

In communication realized by data transfer mechanisms, both sides agree too the same “protocol”, but they can be designed to be backward-compatible. In data transfer a proto- col is event-oriented, typically every message as such either contains a set of information or triggers a state change.

In shared memory there is only one piece of shared information which can be modified at any time by any involved party. This makes the protocol design for shared memory much harder than for data transfer/event oriented communication.

The most important advantage of shared memory is that it is by design the fastest possible inter-process communication technique, without involving in-kernel queues or locks.

In current computer systems there cannot be anything faster than communication di- rectly through memory. The essential communication instructions (read,write) are im- plemented in hardware, which is the most direct communication possible.

23 4 Existing Technologies and Solutions

The latency and throughput of memory are direct but still not constant. Like every other operation it is influenced by various factors, including:

• Size of processor cache

• Number of processors in a SMP system

• Performance of the cache hierarchy, performance of the attached main memory

• Performance of the inter-processor bus, cache coherency protocol

• Current system workload, type of other concurrent applications

• Timing anomalies because of out-of-order execution, prefetching, speculative exe- cution etc. [15]

signal/synchronization these facilities are not usable for communication, because they do not transport any payload data. Signals or locks have to be used with great attention as they cause time-delays and unpredictable change of the program flow.They are listed here for completeness.

4.3.2 Evaluation

In general all the different mechanisms under data transfer are unsuited in the same way as logging solutions are unsuited for monitoring (see section 4.2).

While a message-driven architecture is probably the most flexible approach, it's hard to make a message-driven system real-time compatible. The overhead and blocking behav- ior of the message-oriented approaches could be treated by separating the creation of a metric update from actually sending it. This can be implemented with a separate, low- priority sender thread. The real-time threads would send their data to the monitoring thread using something like a size-limited queue and hence are unaffected by the latency in the communication with the monitoring server.

This kind of decoupling between real-time and non-real-time code can be achieved more easily by using shared memory directly inside the client application's address space.

We consider the approach to let the applications, which are to be monitored, directly write to an IPC shared memory region, as preferable to less direct communication solutions

24 4 Existing Technologies and Solutions such as byte transfers over a network or pipes. It also reduces the amount of code and logic needed to be included in every application. Furthermore it's more trustworthy to make guarantees about real-time suitability because of the reduced complexity.

Due to the nature of shared memory, if one publishes whole data structures at once, the implementation will become less private (no “encapsulation” in OO-terms), with the effect that it is more complicated to maintain a fully backward-compatible system.

By avoiding any in-kernel work, which is triggered by all other IPC mechanisms, shared memory becomes very fast and has the lowest possible overhead. It allows us to make value updates nearly as cheap as normal arithmetic operations. SHM is perfectly suited to fulfill the TECH 1 Real-Time requirement.

The integration into the centralized monitoring system will then be done in a separate reader process, which can be scheduled with low priority. It can be started/stopped, up- dated and extended independently. This also makes it easier to integrate into any other monitoring solution in the future (TECH 4 Portability).

4.4 Existing Software Solutions

This section covers a wide range of existing software solutions for monitoring and soft- ware components which might be useful in monitoring tasks.

JMX On the Java Platform the Java Management Extensions (JMX)[16] are often used for remote-monitoring of Java processes. JMX has a extensive set of features covering more than only exposing run-time information.

Run-time information in JMX is organized in different kinds of MBeans. Basically an MBean exposes values and enables remote triggering of operations.

Fig. 4.4 shows attributes of an MBean in the Java VisualVM Tool. In this example, an application exports information about loaded JAR (Java Archive) files as MBeans. The MBeans have two attributes: URL and Properties.

The usage of JMX is restricted to the Java Platform, however it's possible to start a Java virtual machine embedded inside a C++ application, only for the purpose of JMX [17].

25 4 Existing Technologies and Solutions

Figure 4.4: Java VisualVM showing JMX attributes xymon The xymon [18] host-monitoring solution uses POSIX shared memory for inter- process communication (see lib/xymond_ipc.c). From looking at the source code, it looks like xymon developers didn't choose shared memory for performance reasons, but for the ease of use. pcp As a full-featured monitoring suite, the “performance Co-Pilot” (called pcp)[19], supports so called Memory-mapped values (pcp/src/libpcp_mmv). pcp is a complete frame- work for logging applications performance metrics. For the environment at CERN the pcp framework is too intrusive. Also we didn't find any statements in documentation or code of how the memory-mapped values in pcp handle concurrent access. other applications of shared memory We found usage of shared memory in other applications which are less connected to monitoring. Nevertheless we will mention them and describe shortly how they use shared memory. pvbrowser [20] is a SCADA application framework. It provides a System V shared memory backed value table supporting all common datatypes. The table is protected by a process- shared pthread-mutex [21, pthread_mutex_init]. pvbrowser supports Linux, Windows and VMS operating systems. localmemcache [22] is a implementation using POSIX SHM Objects as storage back-end. It's mainly intended to be used from programs written in ruby. It tries to emulate a Berkeley DB style access paradigm. The design looks very specific, it uses POSIX named semaphores for exclusive locking of the whole table.

26 4 Existing Technologies and Solutions

The X11 window manager i3 [23] uses shared memory to provide a debug channel which is persistent in cases of crashes and readable if the program hangs (see src/log.c). It doesn't lock the shared memory data structures, instead a pointer to the latest log message is provided and a pthreads condition variable [21, pthread_cond_init] is used to broadcast signals when the pointer has been updated after adding a new message.

4.5 Conclusions

This chapter presented monitoring systems in general and highlighted specific aspects of monitoring real-time applications. We identified that there has to be a kind of decoupling layer between the monitoring system and the real-time application, which ensures that activity of the monitoring system cannot harm the application.

In the evaluation of different IPC mechanisms, we designated shared memory for the communication between the real-time application and a monitoring agent on the local machine.

A comparison with existing software solutions showed that currently exists no plugin-in solution to this problem. However, some aspects of exposing values using shared-memory are implemented for different use-cases elsewhere.

Consequently, the next chapters will describe the development of CMX, a new solution for monitoring real-time controls applications written in C and C++.

27 5 Design of CMX Protocol and Data Structures

This chapter describes the design phase of the data structures and access protocol of the CMX library. The CMX library is intended to fulfill the requirements described in chapter 3.

In the previous chapter we concluded that a monitoring solution, according to the require- ments, must be implemented using shared memory as the inter-process communication technique. The next section discusses the design of shared memory data structures and a reader-writer access protocol.

5.1 Design of CMX Data Structures

From the definition of a metric (TERM 2), the behavior of our selected IPC mechanism, and our system environment, we derived a model which is shown in Fig. 5.1 as the optimal representation of data:

• A host can execute any number of applications at the same time

• An application can expose independent sets of metrics (Components: TERM 2).

• Some predefined metrics are exposed by default (easy to use: TECH 5).

• Metrics have different data types, depending on their content (types: TECH 6).

The model is a hierarchy written like: Host->Process->Component->Metric.

The Host part is self-evident, since we do not want to communicate metrics over network using the core CMX Library. A Host is a computer system executing multiple Processes. A host is identified by a host name.

28 5 Design of CMX Protocol and Data Structures

Application Process Component - (Process) start_time=1323123123/int64 hostname=ewe-123-fbcdev/string process_name=TEST-ECW10/string

Component TestMetrics active_users=5/int64 last_sql_stmt="SELECT..."/string items_processed=120123/int64

Figure 5.1: CMX Host - Process - Component model

The Processes shall be independent in the way in which they expose their metrics. This allows smooth upgrades in case of improvements to the library and reduces the risk of interference between the processes.

The next separation level is Component. This maps directly to a shared memory space where the metrics are organized. Most executables are built from many different libraries, sometimes from 3rd-partys. Therefore, we do not risk that one library interfere with others by filling up a shared Component. Instead every library should register its own Components independently.

Additionally, the CMX monitoring library will automatically register a so called Process Component for every application to expose some predefined metrics like start-time, host name or resource usage.

The Metric is placed into the Components. Every metric has a name, type and value. The name is a limited character string. The value can have a fixed (integer, float, boolean) or arbitrary length (TECH 6).

Metrics can be either addressed by their indexes in the component or searched by their names. The search is a simple linear search and should be avoided if possible.

A Component starts empty and has to be filled with Metrics. The metrics are initialized with a neutral value (zero:0 or empty string:""). The developer can set/get values on metrics. They can be referenced by their index in the Component. Since Metrics have a name, they can also be searched using a simple linear search through all elements in a Component.

29 5 Design of CMX Protocol and Data Structures

It is not planned to implement hash map access to Components as a hash map data struc- ture is by design non-deterministic in time and a very dynamic data structure. The imple- mentation effort is quite high for static-sized shared memory segments and not feasible for CMX.

By putting the metrics into Components we also avoid name collisions if two instances of the same application register a metric with the same name.

A process can expose the same Component multiple times but with a different name for all its client connections or every storage subsystem etc. This approach is similar to JMX, in the sense that one can expose several instances/objects of the same Type (MBean) [16].

Implementation of the Hierarchy The planned hierarchy for metrics stored on a computer is Host->Process->Component->Metric. Given that we chose POSIX SHM ob- jects as our memory-backing technology, the first abstraction (Host) is free, provided by the separation through the operating system.

The next separation (Component) level is quite special since POSIX SHM is not directly grouped by owner processes, instead it is linked to the creator's user-id. Therefore, we write the owning process PID in the name of the SHM file like: /dev/shm/cmx.2345.ComponentName, where 2345 is the operating system process ID (PID).

The Metrics stored in the Components are organized in “slots”. A slot is usually used for one value (integer, float, bool). In case of character strings many slots can be chained together to support data of arbitrary length.

5.2 Shared Memory

This section discusses some aspects of using shared memory for implementing CMX. It starts with a description of how different SHM implementations identify shared memory regions. Then the usage of pointers inside shared memory is discussed, followed by some general thoughts about the design of shared memory data structures. Finally we focus at the delayed mapping of allocated shared memory to physical pages.

30 5 Design of CMX Protocol and Data Structures

Identification and Handles System V shared memory segments are identified by a handle which is determined when the segments are created. This handle is created from a numeric key, which is specified by the application. From this the operating system will derive a unique SysV-Id which acts as an handle. This first numeric key is subject to possible identifier collisions between independent applications.

The derived ID is then used in calls to control (shmctl) and attach/detach (shmat/shmdt) functions. Special System V command-line utilities are available to create/inspect/de- stroy shared memory segments.

It is also possible to skip the numeric key and directly ask the operating system to gen- erate the unique ID. The unique ID can be passed to other processes.

POSIX shared memory objects by contrast, use character based identifiers. In the case of Linux, the SHM object will be created as a file in the directory /dev/shm. This directory is a filesystem of type tmpfs (a filesystem which is entirely stored in ram). This way POSIX SHM objects are identical to memory-mapped files but using an in-memory filesystem.

Re-sizing shared memory: pointer issues While it seems appealing to be able to re-size the amount of memory available for storing metrics to the actual need, this raises some practical problems and makes implementation of data structures more complicated. The most significant problem with re-sizing applies in to same way to usual memory management using malloc() and realloc() from the .

For the initial allocation the call to void *malloc(size_t size) returns (if successful) a pointer to the newly allocated memory.

The operating system creates a mapping between the process' virtual memory and the hosts physical memory (see Fig. 5.2). Subsequent calls to malloc() will likely return the virtual-memory address of the previously returned pointer, plus the size of the previously allocated region.

Once allocated memory can be resized using realloc(). Here it is possible that the first allocated block cannot be expanded, because it would grow into the address space of the next allocated block. Thus, the call to realloc() returns a new pointer, which can be different from the previous one.

The same problem applies in principle to POSIX shared memory object's mapping us- ing shm_open(), mmap() and ftruncate(). Hence, the variable holding the pointer to the

31 5 Design of CMX Protocol and Data Structures shared-memory data structure needs to be protected against concurrent access. If not, a thread could trigger the re-size operation and consequently render the current pointer invalid, while it is still in use by another thread. This would lead to a illegal memory access in the second thread. Any implementation is limited to only increase but never decrease the shared memory size. Otherwise a blocking synchronization is required, which prevents other threads and processes from accessing possible invalid memory. This would also render the whole effort of providing fast and guaranteed non-blocking access to shared memory useless.

Re-sizing shared memory: data structure design Dynamic data structures are more complicated to represent in C data structures. In C there is not a notion of an “array with variable size of X.” A workaround is to use pointer manipulation, the following struct definition serves as an example:

1 struct cmx_value { 2 int value; //>numeric value 3 char name[64]; //>name of this value 4 }; 5 struct cmx { 6 int process_id; //>process-id ofthecreator 7 int number_of_values; //>current number ofvalues (size) 8 char component_name[64]; //>name of this collection/component 9 };

Here we do not define the actual number of struct cmx_value. The allocation will have to calculate and add them manually to the overall size:

1 struct cmx *; 2 cmx = (struct cmx*) malloc(sizeof(cmx) + NO_OF_VALUES * sizeof(cmx_value));

Since the value slots are not longer real fields in the struct cmx, one has to take the address of the struct cmx, add one to skip to the possible next element in a array, then this pointer is cast to struct cmx_value * type. This operation can be simplified using C preprocessor macros but stays error prone be- cause it cannot be verified easily. It is getting even more complicated if there are more than one dynamically growing fields. For example, one array of value-names and sec- ondly the value itself. Then one either has to build groups or copy a lot of memory. The first prototype of CMX used groups of System V shared memory segments, which were allocated as needed and freed if empty. This is inefficient because it involves a lot of mapping/un-mapping operations and required very strict locking.

32 5 Design of CMX Protocol and Data Structures

Process A Physical 0x0000 memory

0x0000

Process A

shared Another memory process's memory 0xffff Process B 0x0000

Process B

shared memory

0xffff 0xffff Figure 5.2: Virtual to physical memory addresses

Mapping of virtual to physical pages The mapping from a virtual memory address to a physical memory page is not set up instantly when calling mmap() or malloc() respec- tively.

The memory allocated by ftruncate() or malloc() only gets actually mapped wired to physical memory when it is first accessed. The operating system maintains a page-fault handler which creates the mapping on demand, in segments of at least 4 KB.

Fig. 5.2 shows the simplified mapping of a memory segment shared between process A and process B. This mapping consists of two smaller physical memory regions and also un-allocated holes. Holes can be utilized to allocate huge amounts of memory, hence reserving the address space in the virtual memory address space, without actually wasting the same amount of physical memory.

This method works as long the memory doesn't get initialized (for instance “zeroed”) by default. It works both with usual malloc() allocated memory, System V shared memory segments as well as POSIX shared memory objects. Also most modern Linux file-systems understand the similar concept of “holes” in files, so this is usable with filesystem backed mappings too.

This can be changed at operating system level by calling mlockall(). If called with flag MCL_CURRENT this will fault-in all currently open resources. With flag MCL_FUTURE this will

33 5 Design of CMX Protocol and Data Structures also affect resources like shared memory mappings in the future, this setting can beturned off by calling munlockall(). The behaviour can also changed for specific memory regions using mlock() or the mmap() flag MAP_LOCKED.

Criteria POSIX SHM mapped file SysV SHM Common Name SHM object mmap()-ed file SHM segment Available since 1993 (POSIX.1b) 1999 (Linux) 1983 (SysV SVR1) Identifier FS-path FS-path SysV Key (integer) Handle file descriptor file descriptor SysV Id (integer) Follows UNIX I/O Design yes yes no Resizeable yes yes no Auto-delete on last detach no no yes Portability ok ok very good Data persistence reboot filesystem reboot

Table 5.1: Comparison of Shared Memory Implementations

Conclusions There are three major implementations of shared memory to choose from. Table 5.1 shows the differences between the three implementations available on SLC5 and SLC6, using common criteria.

Memory-mapped files are not strictly speaking a shared memory. As files they are saved usually in a disk-persistent filesystem. However, they can be mapped into the process memory as well and then shared between different processes. The usage of mapped files is interesting because, they offer persistence over system reboots, this also includes system crashes to some extent. However, in our target environment, most systems mount their filesystem read-only, so they cannot write to the filesystem.

After all, we see the POSIX SHM as the most powerful and suitable solution. Thepro- gramming interface is more consistent than the one of System V SHM. System V can be still interesting if one needs to support the operating system which, through the UNIX compatibility library provided by Microsoft, only supports SystemV IPC or a totally different Windows-specific interface.

5.3 Design of CMX Protocol

The previous sections have described the overall data structure on a high level. The fol- lowing section is about managing concurrent access to the data. It discusses ways to let a

34 5 Design of CMX Protocol and Data Structures process write its internal values to shared memory, without need to fear any obstructions caused by concurrent reads, and at the same time ensure that a reader can always detect read-corrupted values.

5.3.1 Real-Time Constraints

The basic real-time constraints are roughly described in TECH 1. For the implementation the following assumptions are made upon this: • Not all functions need to be real-time suitable. For example, we cannot make guar- antees about functions involving system calls. • The basic get/set operations must be real-time suitable to be called from real-time threads. • The creation of CMX Components and registering names of new metrics does not necessarily need to be real-time suitable, since these operations can run once at initialization time. • We can assume that only one thread wants to update the very same value at a time, every concurrent update upon the same value is allowed to fail. • A read is allowed to fail if a concurrent write happens. Otherwise it must succeed. • A write is allowed to fail if a concurrent write happens. Otherwise it must succeed. The following table shows an overview of the high-level operations needed to publish metric values using CMX. Some of these actions operate on an “object”, for example, a metric is always bound to a component. The column “real-time” marks whether these actions have critical real-time requirements, according to TECH 1.

Operation Object Real-time required? Create/update process information - no Create Component - no Remove Component - no Create Metric instance Component no Remove Metric instance Component no Update (set) Metric value Metric instance yes Read (get) Metric value Metric instance (limited) yes

35 5 Design of CMX Protocol and Data Structures

5.3.2 Concurrent Access to Shared Memory

For accessing shared memory, a common protocol is mandatory. The role of the protocol is to take care of handling concurrent access, and thus prevent data corruption and loss.

A shared memory access protocol is different from the usually known stream-oriented protocols. In shared memory, values can be manipulated at any position, at any time (random access). The is no access synchronization between multiple parties built into POSIX SHM objects. They may access SHM in overlapping operations and concurrently.

Therefore, the access methods must implement a protocol which guarantees that data integrity is ensured at any time. This applies for read and write operations equally.

In the following the evolution of such a protocol is described. We start with a naive, broken design and improve it to provide the required guarantees. The final design is described from page 41 onwards.

First approach A very simple example shows the basic usage of POSIX SHM objects and a flawed solution of how to store values. The exposed values are of 32 bit integer type and identified by keys (64 characters).

The naive shared memory data structure looks like:

1 struct cmx_value_t { 2 char name[64]; 3 int32_t value; 4 } 5 struct cmx_shm_t { 6 char component_name[64]; 7 int32_t process_id; 8 struct cmx_value_t values[1024]; 9 }

The process can then map this structure to a shared-memory region of the same size (error checking omitted):

1 // create shared memory object 2 int fd = shm_open("component-name" /* name */, 3 (O_CREAT | O_EXCL | O_RDWR /* flags */), 4 (S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH /* mode */)); 5 // set size of shared memory 6 ftruncate(fd, (off_t) sizeof(struct cmx_shm_t)); 7 // map shared memory into virtual adress space / create mapping 8 struct cmx_shm_t * cmx_shm_ptr; 9 cmx_shm_ptr = (struct cmx_shm_t *) mmap(0, sizeof(cmx_shm), 10 PROT_READ | PROT_WRITE, 11 MAP_SHARED, fd, 0);

36 5 Design of CMX Protocol and Data Structures

The shared-memory object is created by calling shm_open() (line 1 in listing above). The size of the shared-memory object is set then using ftruncate() (line 6) to the size of the cmx_shm_t struct type.

The actual pointer/reference to the shared memory is returned from the mmap() (line 9) function. The mapping is configured read-write (PROTO_READ|PROTO_WRITE) and shared (MAP_SHARED). A private mapping (MAP_PRIVATE) would be a copy-on-write clone of the current state, only visible to the current process.

The returned reference to the shared memory is assigned to the variable cmx_shm_ptr of type pointer to cmx_shm_t (line 9). This pointer can then be used to access the struct as usual. That means up to 1024 values of type cmx_value_t can be stored there (7̃0 KB of memory).

This simple approach works to a certain extent, but lacks data-integrity. The next steps will document the problems in this solution and improve it gradually to the state which is actually implemented in CMX.

First enhancements The previously described version has many shortcomings of which we will address some in this first enhancement phase.

So far it is undefined when a value is set (contains valid data) or unset. Now we define: If the name is empty (it starts with a null byte '\0'), the specific value is empty.

Also it's missing an update timestamp, but a timestamp field can be easily added (named mtime field for “modification time”):

1 struct cmx_value_t { 2 char name[64]; 3 int32_t value; 4 uint64_t mtime; 5 } 6 ...

This raises a new problem: one value is atomic, setting two values is not. Imagine the following program with two threads: Update value Read value • Write data • Read data • Write mtime • Read mtime

With any ordering, it is not guaranteed for the reader that his data (value,mtime) belong together.

37 5 Design of CMX Protocol and Data Structures

In consequence, the following reader/writer pattern can appear: Update value: write(data,mtime)->() Read value: read()->(data,mtime)

• T1: write("A", 0x1) • read()->(dataT2 , mtimeT1 )->("B",0x1)

• T2: write("B", 0x2) • read()->(dataT3 , mtimeT2 )->("C",0x2)

• T3: write("C", 0x3)

Here the values read ("B",0x1 and "C",0x2) do not match any of the value pairs written on the left side.

Second enhancement To keep track of the connection in (value, mtime), we need to prevent the reader from accessing the data if an update is in progress.

The classical approach is to use a simple spin-lock (condition flag) or a reader-writer lock. However, this is undesired as it adds blocking behavior to the write and/or read functions. In real-time systems it's unreasonable that the writer process has to wait for the reader to finish his reading operation.

A lock requires the reader to interfere with the writer by grabbing the lock. It looks line an easy solution to simple use a lock without actually blocking the execution, by just failing immediately if the lock-flag is set. This doesn't work, imagined adding a simple bit indicating if the current state (locked/unlocked). Then the reader has to check this bit both on entry and exit of the critical region where he reads the value.

This approach is fundamentally broken because a reader can get suspended inside the critical region at any time, for instance because of a process-context switch or the pro- cessor's data-access stalling. During this sleep period the update occurs and when the reader resumes the value is unlocked again. In this case the reader is unable to detect changes done during his reading phase, the read data is likely corrupted or inconsistent. This scenario is generally known as the ABA-Problem [24, p. 235].

Instead of traditional locking, sequence locks can be used. Existing implementations are discussed and references to literature are made in section 5.4.

While this second step adds a classical sequence lock, the final implementation in CMX use some modifications to suit it to particular needs. The description here uses a tradi- tional sequence lock for better understanding.

For counting the sequence-value, there is now a new field called ctr (for counter), added to the cmx_value_t struct:

38 5 Design of CMX Protocol and Data Structures

1 struct cmx_value_t { 2 char name[64]; 3 int32_t value; 4 uint64_t mtime; 5 uint64_t ctr; 6 };

For the new ctr field the following applies:

• will be initialized with zero.

• if it is even, the current value is valid.

• if it is odd, the current value is invalid.

Based on the previously described update/read protocol, there are now new steps to set and verify the ctr value:

   

Write value Read value     Fig. 5.3 on the left shows the assert(ctr is even) && increment ctr Read ctr1 <- ctr flow diagram of a sequence

ATOMIC lock write and read operation.

The first step of the of Write Set value Read value <- value Value procedure contains two operations. They need to be executed atomically, that Read mtime <- Set mtime mtime means in one step, without being interrupted by thread scheduling or concurrent exe- Increment ctr Read ctr2 <- ctr cution on other processors.   Q The Intel architecture and al- return  Q    ctr1 == ctr2 QQ false QQ&& ctr1 is even most all modern CPU architec- Q  Q  tures provide suitable atomic true  instructions to implement this. return  

Figure 5.3: Program flow of a sequence lock imple- mentation

39 5 Design of CMX Protocol and Data Structures

time add("test") set(101) set(102) struct cmx_value_t { char name[64]; "test" int32_t value; 0 101 102. . . uint64_t mtime; 0 1400766 1400865

uint64_t ctr; 1 2 3 4 5 }; invalid valid invalid valid invalid Figure 5.4: Visualization of data in “Second enhancement” version, over time

When using sequence-locks the data switches from being valid to invalid during write access. Fig. 5.4 shows the validity of the data inside the cmx_value_t struct over time. First the struct is initialized (add()), then the value is set to 101 at the fictive time of 1400766 (set(101)). Next the value will get updated again to 102 (set(102)).

Between those updates there are timeframes in which the value is valid (ctr is even). As long as the reader is inside such a timeframe, the read will succeed. If the reader starts in one valid timeframe but ends in another this is be detected using the ctr value.

Third enhancement So far there is only support for values of type int32_t (32 bit signed integer). As mentioned in (TECH 6 Datatypes), it is required to be able to store float values and character strings as well.

Support for both float and integer values can be easily added in a type-safe manner by using a union for storing the value. A new separate integer value keeps track of the actual type in the union.

Character strings of arbitrary length can be implemented by allowing the cmx_value_t data structure to be chained like a single-linked list. Only in the first element of this list the name and mtime fields are important, the others are only used to store the payload of the character string.

1 typedef enum cmx_value_type_t 2 { // enumeration of possible types 3 TYPE_INT64 = 1, //> identifies values of type integer 64-bit 4 TYPE_FLOAT64 = 2, //> identifies values of type floating-point 64-bit 5 TYPE_STRING = 3, //> identifies head of character string values 6 TYPE_STRING_CONT = 4 //> identifies continuation of string values 7 } cmx_value_type; 8 9 struct cmx_value_t { 10 char name[64]; //> name of the value 11 uint8_t type; //> type, see enum cmx_value_type_t 12 union {

40 5 Design of CMX Protocol and Data Structures

13 int64_t _int64; //> the integer value (TYPE_INT64) 14 float64_t _float64; //> the float value (TYPE_FLOAT64) 15 union { 16 char data[64]; //> data of string value 17 int32_t next; //> index of the next cmx_value_t data structure 18 } _string; 19 } value; 20 uint64_t mtime; //> modification timestamp 21 uint64_t ctr; //> sequence lock counter 22 };

The following is a example state of four values in memory. The value of mtime is not relevant, the ctr is always 2 after the first update, thus mtime and ctr omitted here. cmx_value_t[0] {.name = "val1", .type = TYPE_INT64, .value._int64 = 1234L, }; cmx_value_t[1] {.name = "val2", .type = TYPE_FLOAT64, .value._float64 = 12.34L }; cmx_value_t[2] {.name = "val3", .type = TYPE_STRING, .value._string.data = "Hello..[64]", .value._string.next = 3 }; cmx_value_t[3] {.name = "", .type = TYPE_STRING_CONT, .value._string.data = "World..[64]", .value._string.next = -1 };

Final design The current implementation of CMX contains some more changes com- pared to the state described here. The definite data structure used in shared memory is visualized with a diagram generated from source-code in Fig. 5.5. The definitive protocol is written on the bottom of Fig. 5.6. So far value fields have been addressed by index. Of course one can also search through all values by name. Since values can be added and removed, a reference by index stays valid even if the name of the value has changed. To avoid this we include a counter value into the reference. This way our reference is effectively a 15bit index value + 16bit counter value. The remaining single bit is used to indicate errors (the “sign” bit turns the value negative, this is handy for error checking).

Instead of incrementing the ctr twice, a field called state is used. State can be any of FREE,OCCUPIED,UPDATE,SET,PAYLOAD. Occupied is used in the transition from free to set. Payload is a new state used when saving strings.

The value_state variable is an array independent from the values itself. This way we do not initialize the memory while scanning for a free value in insert or find operations.

41 5 Design of CMX Protocol and Data Structures

struct (2.7 MB) cmx_shm_t CMX_TAG char[8] process_id int reserved char[52] name char[64] value_state int[20479] value array[20479]

union (128.0 B) None value struct value_payload struct

struct (128.0 B) cmx_shm_slot_value_t struct (128.0 B) id unsigned short cmx_shm_slot_payload_t type int id unsigned short mtime unsigned long next_index unsigned short name char[64] next_id unsigned short txn unsigned long data char[116] _reserved char[30] value union

union (8.0 B) cmx_shm_value_t _int64 long _float64 double _bool int _string struct

struct (8.0 B) cmx_shm_value_string_t size unsigned short current_size unsigned short next_index unsigned short next_id unsigned short

Figure 5.5: Memory structures of CMX (generated using a custom tool on top of sparse [25])

42 5 Design of CMX Protocol and Data Structures

5.3.3 Verification

In the verification of properties of the data-access protocol we want to proof that different constraints are met. The following arguments are written in prose. A verification using machine-based reasoning follows in section 5.5.

The correctness of the algorithm depends on several conditions, we going to prove them one after another. The explanations are separated bya ⋆ on the right side of the page.

The next seven statements are related to Fig. 5.6. For Verify 8 - Verify 10 there are no illustrative figures.

Verify 1 (Read before Write). Succeed, trivial.

Verify 2 (Read after Write). Succeed, trivial.

Verify 3 (Read overlaps Write). The read fails in TR2 because TW1 sets state to UPDATE.

Verify 4 (Start of Write overlaps Read partially). This operation succeeds, because the read ends before the write increases the ctr value in TW2 . The data read is valid as the writer changed nothing before the end of TR4 is reached.

Verify 5 (Start of Write overlaps Read). The read operation fails in TR5 , because the writer increased the ctr value in TW2 before the reader finished TR4 .

Verify 6 (Read inside Write). The read operation fails in TR1 because the state variable is set to UPDATE in TW1 .

Verify 7 (Write inside Read). The read fails in TR5 because write modified the ctr in

TW2 value.

Verify 8 (The value update is non-blocking). The update (set) operation cannot block because there are no blocking instructions.

Verify 9 (The value update fails if another update is in progress). The atomic compare- and-exchange operation in TW1 acts as a mutual exclusion. The update operation cannot be entered by a second process while state ≠ SET.

Verify 10 (The read operation always detects invalid data). A invalid read would happen

43 5 Design of CMX Protocol and Data Structures

1 Read before write READ 1 2 3 4 5 WRITE 1 2 3 4

2 Read after write READ 1 2 3 4 5 WRITE 1 2 3 4

Read overlaps write READ 3 1 2 3 4 5

WRITE 1 2 3 4

4 Write begin overlaps read partially READ 1 2 3 4 5 WRITE 1 2 3 4 5 Write begin overlaps read READ 1 2 3 4 5 WRITE 1 2 3 4 READ Read inside write 6 1 2 3 4 5 WRITE 1 2 3 4

7 Write inside read READ 1 2 3 4 5

WRITE 1 2 3 4

j TW1 CAS(state,SET,UPDATE) TR1 ctr

TW2 ctr++ TR2 state = SET

TW3 value := /*value*/ TR3 read value k TW4 state := SET TR4 ctr j k TR5 ctr = ctr Figure 5.6: Overview of the CMX Reader/Writer Protocol with examples

44 5 Design of CMX Protocol and Data Structures

if the read (TR3 ) is executed in the time after the transaction counter is incremented (TW2 ), but before the value (value and mtime) is written completely.

For this to happen, the check state = SET(TR2 ) must execute before TW1 . If so then j k it is guaranteed that ctr (TR1 ) is read before ctr++(TW2 ) and ctr is read after TW2 - therefore they must be different and the modification will be detected by the checkin

TR5 .

5.4 Comparison with Similar Algorithms

The reader/writer protocol for CMX is not new or unique. While this specific implemen- tation is customized for the use-case of CMX, the general idea has been described and implemented before elsewhere. We found three publications or implementations of sim- ilar algorithms.

Non-Blocking Write Protocol (NBW) This description of a similar algorithm is from 1993 by Kopetz and Reisinger [26]. They provide correctness arguments and a detailed analysis of scheduling. The usage scenario described there is more abstract. It is not specifically designed for SMP machines but for a network of nodes each consisting of CPU, memory and a communication controller. Every node can execute many tasks pseudo-parallel and because there is only one CPU, the tasks are executed interleaved one after another, meaning that they get interrupted from time totime.

The nodes receive updates via broadcasts over a communication processor, it will write the update using direct access to the memory of the node. Thus, this is a single-writer, many reader situation (every task can be a reader). If a task gets interrupted, it is possible that the communication processor updates the value in the meantime. The Non-Blocking Reader/Write protocol makes sure to detect those cases and inform the reader that his data is now invalid.

The NBW protocol as shown in Fig. 5.7 and as described in the paper [26, p. 3/133] is similar to the one in CMX.

Write message is designed single-writer while CMX uses the state value to prevent two writers running at the same time. The Writer first increments the CCF-Counter, then the value and again the CCF-Counter.

45 5 Design of CMX Protocol and Data Structures

start: CCFi := 0; // The counter CCFi starts globally at zero

Write messagei Read messagei start: CCFold := CCFi; start: CCFbegin := CCFi; CCFi := CCFold + 1; CCFend := CCFi; CCFi := CCFold + 2; if CCFend ≠ CCFbegin or CCFbegin is odd then goto start;

Figure 5.7: Non Blocking Reader Writer Protocol [26]

In Read message the reader looks at the CCF-Counter, then the value and then again the CCF-Counter. This is repeated until both reads of the CCF-Counter are equal.

Timecounters in FreeBSD “Timecounters: Efficient and precise timekeeping in SMP kernels.”[8] from 2002 is about the management of high-performance timer values in the FreeBSD operating system. The paper is mostly about timekeeping, except the section “Locking, lack of …” which describes a lock-free single-writer data structure supporting multiple data generations. In difference to the previous algorithm, this one keeps multiple copies of the dataina ring buffer. This way a reader which is slower than the writer can still succeed, infacthe has time until the data-item generation he selected will be overwritten in the next round of the ring buffer. This reduces the reader fail-rate significantly, especially in scenarios with very high update rates and where the readers would otherwise not be able to read the complete data until the writer starts his next update. This approach is perfectly suited for exposing time sources to the whole system, because the time is read from every process in the system, ones with very low to very high prior- ities. Even a low priority process which might be suspended for a longer timeslice must succeed in reading his time data. Also timekeeping is critical for many applications and retries are more problematic than the negligible increase of memory usage for the ring- buffer. In CMX, we do not expect a situation with many reading clients with different process priorities. Due to the implementation without ring-buffer in CMX, a reader has a higher chance to fail. For example, if a process does nothing else than updating its CMX values, like a time source update its timestamp, a reader might not be able to read this value.

46 5 Design of CMX Protocol and Data Structures

CMX might be used to store bigger amounts of data (character strings), where holding multiple copies is memory becomes quickly very expensive. Also we do not expect CMX users to update metric values in a frequency comparable to time sources. Where in CMX values will possibly be updated every millisecond under some circumstances, the time source has to be able to handle nanosecond precision.

In contrast to time-source clients, the CMX readers are less critical, they can tolerate to retry a read if it has failed before.

Linux Sequence Locks The Linux kernel has provided Sequence Locks since kernel version 2.6.12 (around April 2005). They are not considered as locks in the common sense, but an abstraction of a Reader/Writer protocol like the one found in [26] and described before.

A description of Sequence Locks can be found in “Effective synchronization on Linux/NUMA systems” by Lameter [9, p. 15] and in “Linux device drivers, Third Edition” by Rubini and Corbet [27, chapter 5, p. 127], as well as in the C source of the Linux Kernel itself [28, include/linux/seqlock.h].

The Sequence Lock additionally uses a spinlock to synchronize the write access. CMX locks the write access in a similar way by setting the state variable from SET to UPDATE, but CMX does not spin, rather it fails immediately.

The seqcount struct contains a counter variable of type unsigned.

1 typedef struct { 2 struct seqcount seqcount; 3 spinlock_t lock; 4 } seqlock_t;

The usage of a sequence lock in the linux kernel looks like (assume lock is of seqlock_t): Writer Reader

1 write_seqlock(&lock); 1 unsigned seq;

2 // modify 2 do {

3 write_sequnlock(&lock); 3 seq = read_seqbegin(&lock);

4 // read data here

5 } while (read_seqretry_xxx(&lock, seq));

47 5 Design of CMX Protocol and Data Structures

Unlike the algorithm described for CMX, the Linux seqlock uses an even/odd check to detect if the writer is currently active. The CMX algorithm uses the status flag for this. As the status variable needs to be checked anyway in CMX, there is no advantage of checking even/odd instead or additionally. On the contrary this half the chances for the transaction counter to overflow, except that this won't happen anyway to a 264 variable during one read-cycle.

5.5 Verification with Models

The previous attempts to verify the data access protocol of CMX, by writing proofs in natural language, showed how difficult it is to describe parallel algorithms in informal prose. As suggested by McKenney [7], we additionally use the language Promela and its compiler Spin [29], to model some of the core aspects of the algorithms in CMX.

This allows better understanding of the algorithm as it is re-written in a language spe- cialized for parallel computing.

In contrast to the unit-tests written in C/C++ (see subsection 6.2.5), where we try to find problematic situations by executing the code sections multiple times concurrently to trig- ger a race condition, Spin will search the entire possible state space of the algorithm.

This can quickly lead to state space explosion, if we would try to implement a complex al- gorithm completely. A complex model would create a state space that cannot completely be searched in a lifetime. Instead we try to focus on key aspects of our algorithms. After a short introduction, a small model will only reflect the writer-writer situation. Following that, the next step will be a writer-reader situation.

All the models described in the following were successfully validated by means of the assertions written in the code.

5.5.1 Simple Example of a Promela Model

This is an introductory example unrelated to CMX. Listing 5.1 shows a very simple Promela model. It consists of two so called processes (keyword proctype).

• A producer toggles the global variable turn to C, triggering the other process consumer to start.

48 5 Design of CMX Protocol and Data Structures

• In a real scenario, the second process would do some real work. In this example, the consumer immediately changes the state back to P, indicating that he finished his work.

1 mtype = { P,C }; 2 mtype turn = P; 3 4 active proctype producer(){ 5 do 6 :: (turn == P) -> 7 progress_produce: 8 printf("Produce\n"); 9 turn = C; 10 od; 11 } 12 13 14 active proctype consumer(){ 15 do 16 :: (turn == P) -> 17 progress_consume: 18 printf("Consume\n"); 19 turn = P; 20 od; 21 } Listing 5.1: Simple model of a producer/consumer scenario

Figure 5.8 shows the state spaces as created by the Spin compiler.

Figure 5.8: Model states, image generated by spin from Listing 5.1

.

5.5.2 Model of Two Writers

The next Promela program is actually linked to CMX Reader/Writer algorithm. It is still a pretty obvious case: In a two writer scenario, is it guaranteed that the two threads can-

49 5 Design of CMX Protocol and Data Structures not access/write at the same time: does the mutual exclusion through the state variable actually work?

1 mtype = { STATE_SET, STATE_UPDATE }; 2 mtype state = STATE_SET; 3 int count; 4 5 inline enter() { 6 atomic { state == STATE_SET -> state = STATE_UPDATE; } 7 } 8 9 inline leave() { 10 atomic { state = STATE_SET; } 11 } 12 13 active [2] proctype writer() { 14 do 15 :: enter() -> 16 count++; 17 assert(count == 1); 18 count--; 19 leave(); 20 progress: 21 skip 22 od 23 } Listing 5.2: Model of two Writers/Updaters using test-and-set locking with state variable

The model in Listing 5.2 describes two identical writer processes, named writer(). Spin will explore all possible execution paths of these two processes executed concurrently. With every possibility of a process to enter the critical region, spin checks with (assert count == 1) that the other process is currently not in the critical region.

5.5.3 Model of Concurrent Reader/Writer

The previous check can be extended to include a reader. Listing 5.3 contains one writer and one reader. The model covers all the states of the CMX protocol as described in Fig. 5.6.

Instead of writing real values, it uses a boolean value which is 'true' if the field contains valid data and 'false' while the writer updates the field inside the critical region.

According to the protocol, the reader makes a copy of ctr and then the value. If ctr matches at the end, then the boolean value flag needs to be true.

50 5 Design of CMX Protocol and Data Structures

1 mtype = { STATE_SET, STATE_UPDATE }; 2 mtype state = STATE_SET; 3 int ctr_count = 1; 4 bool value = true; 5 int no_writers; 6 7 inline enter_update() { 8 atomic { state == STATE_SET -> state = STATE_UPDATE; } } 9 10 inline leave_update() { 11 atomic { state = STATE_SET; } } 12 13 active [1] proctype writer() { 14 if 15 :: enter_update() -> /* TW_1 CAS(state,SET,UPDATE) */ 16 no_writers++; 17 assert(no_writers == 1); 18 ctr_count++; /* TW_2 ctr++ */ 19 value = false; /* TW_3 would happen here */ 20 assert(no_writers == 1); 21 no_writers--; 22 value = true; /* TW_3 ends */ 23 leave_update(); /* TW_4 state = SET */ 24 fi; 25 } 26 27 active [1] proctype reader() { 28 int ctr_j; 29 int ctr_k; 30 bool read_value; 31 bool success = false; 32 33 start: 34 ctr_j = ctr_count; /* TR_1 ctr^j */ 35 if 36 :: (state == STATE_SET) -> /* TR_2 state == SET */ 37 read_value = value; /* TR_3 read value */ 38 ctr_k = ctr_count; /* TR_4 ctr^k */ 39 if 40 :: (ctr_j == ctr_k) -> /* TR_5 ctr^j == ctr^k */ 41 assert(read_value) /* read was outside TW_2..TW_3 */ 42 :: else -> 43 progress_retry: goto start 44 fi; 45 fi; 46 end_success: 47 } Listing 5.3: Model of a Writer and a Reader

51 5 Design of CMX Protocol and Data Structures

5.6 Conclusions

This chapter made an analysis of suitable inter-process communication principles and implementations. Based on shared memory, a data access protocol for shared data struc- tures has been developed. The protocol is perfectly suited for the use-case of CMX. It provides very fast and unobtrusive publishing of run-time values from shared memory. The next steps will involve implementing CMX using this protocol based on POSIX shared memory objects.

The algorithm was tested with different verification techniques. In one approach we reasoned about the correctness in written prose. Then models were created, using aspe- cialized language and verification tool. The successful verification of the protocol was mandatory to advertise CMX for use in critical applications.

52 6 Implementation of CMX

This chapter targets different aspects of the implementation process. The first part ana- lyzes properties of the compiler and the target computing platform, including operating system, processor instructions and memory consistency.

The second section of this chapter introduces into the implementation details of the new CMX library.

6.1 Platform and Toolchain

CMX primarily targets Scientific Linux 5 on Intel x86-32 and 6 on Intel x86-64 (SLC5/SLC6). It is written in the C and C++ programming languages. The compiler versions shipped with the operating system releases are gcc 4.1.2 on SLC5 and gcc 4.4.7 on SLC6.

6.1.1 Compiler

Today only the most up-to-date C compilers are aware of concurrent programming. There is an ongoing effort to standardize and implement atomic operations and architecture- aware memory access for the C Language. The new C11 (ISO/IEC 9899, 7.17 Atomics) standard [30] contains a header file named stdatomic.h which offers basic atomic types and operations. The compilers used in SLC5/6, however, do not support these features.

Memory consistency is not the only issue a developer needs to take care about when cre- ating concurrent programs. Due to that, compilers are not aware of concurrent execution, they tend to make code optimization, which change the original intent of the program.

53 6 Implementation of CMX

Erroneous Optimization The following example shows that compiler optimization can have undesired results when it comes to concurrent access of a shared state (here an int value). While the following program looks legitimate:

1 int function(int * state) { 2 int c = 0; 3 if (*state == 1) c += 1; 4 if (*state == 1) c += 1; 5 if (*state == 1) c += 1; 6 return c; 7 }

The compiler is right to assume that *state does not change inside the function. Both of the next two transformations are therefore valid:

1 int function(int * state) { 1 int function(int * state) {

2 int c = 0; 2 if (*state == 1) return 3;

3 int s = *state; 3 else return 0;

4 if (s == 1) c += 1; 4 }

5 if (s == 1) c += 1;

6 if (s == 1) c += 1;

7 return c;

8 }

On the left side the compiler deduced that Based on the previous optimization, the the three if clauses always the same mem- right code evaluated the additions to x at ory value. Therefore it optimized the ac- compile-time. cess to a single read which will then sup- posedly stay in a CPU-register.

In the C language, the keyword volatile is used to mark variables that are memory- mapped and can be changed from outside the program's process. As described in “Volatiles Are Miscompiled, and What to Do about It”[31] some compilers unfortunately contain bugs regarding the handling of volatile data types. For example, a program accessing a volatile int * variable multiple times can still be compiled wrongly to access the vari- able only once. The authors recommend to use read-helper functions to prevent the com- pilers from optimizing the code, in their analysis: “96% of all the volatile errors we found are fixed through the introduction of helper functions” [31, section 5 and 6.3].

54 6 Implementation of CMX

This also makes sense to support the readability of the source code and not need to add volatile to all relevant variables but use the implicitly applied typecast in the function parameter:

1 int function(int * state) { 1 int

2 int c = 0; 2 read_int(volatile int * v) {

3 if (read_int(state) == 1) 3 return *v;

4 c += 1; 4 }

5 .... 5 int

6 } 6 read_int(volatile int * v)

7 __attribute__ ((noinline));

With the additional gcc function attribute noinline, we can prevent this function from being inlined by the compiler. This eases the verification of the algorithm because we can easily trace the order in which operations are executed. On the other side, one needs to remember that this additional call may have a considerable performance impact.

6.1.2 Atomicity of Operations

With concurrent execution of algorithms, any non-atomic/divisible operation can yield half-processed data to other threads. This is explained in the following example of a small program of two threads:

1 while (c

2 data[++c] = source[c]; 2 int b=data[1];

• Thread 1 provides data in a global array called data.

• Thread 2 reads data from Thread 1 and stores into two variables named a and b.

If both threads start at the same time, “int b” can contain the value from after the update while “int a” was read before the update in Thread 1 occured. This is a very simple

55 6 Implementation of CMX scenario, we won't go deeper into solving it at this point. It was already discussed in subsection 5.3.2. The essential question is, how do we know that a simple assignment is executed atomi- cally/in one step. For example, int x = 0xff00ff00; where the memory/register backing the variable x never contains, for instance 0x0000ff00 or 0xff000000?

Simple Atomic Operations If we translate the program to assembler, we would see that the assignment of an 32-bit value maps to exactly one store/load assembler instruc- tion. Still, the processor could translate this internally into many smaller operations. For example: • Moving 4 times each 8bit of the value

• Loading the constant into a register, storing the register to memory location of x. While the second option is fine in most situations, the first one results in having a different value for some time at the destination. For x86-64 the Intel documentation [32, ch.8 p.8] says “The Intel-64 memory-ordering model guarantees that:” • Constituent memory operation that read or write a double word (4 bytes) whose address is aligned on a 4 byte boundary appears to execute as a single memory access • … read or write quad word (8 bytes) whose address is aligned on a 8 bytes boundary … • Any locked instruction appears to execute as an indivisible sequence of load(s), followed by store(s) regardless of alignment. For the x86-32 (Intel Pentium and newer) the documentation [32, ch.8 p.2] states that “the following additional memory operations will always be carried out atomically: Reading or writing a quadword aligned on a 64-bit boundary”

More Complex Atomic Operations If two threads want to communicate with each other about shared data, they need to use an atomic operation that allows them to com- pare the value of a shared datum and conditionally modify the datum in one step (atomic). This can be illustrated by a simple spinlock implementation, used for example in operat- ing systems for low-level synchronization. The naive implementation looks like:

56 6 Implementation of CMX

1 bool lock=false; // shared variable 2 PROCESS() { 3 do { 4 if (!lock) { 5 lock = true; 6 break; 7 } 8 } while (1); 9 // critical section 10 lock = false; // unlock 11 }

This solution is wrong, because it is not atomic. Both hardware threads can execute and pass the if(!lock) check at the same time. We can fix this with an instruction which compare (line 4) and modify (line 5) in one step. Imagine a instruction which does the work in line 4 to 7 in the previous program in one single atomic step, called check_and_set. The resulting program would look like:

1 bool lock=false; // shared variable 2 PROCESS() { 3 // spin on lock 4 while (1) { if (check_and_set(&lock)) { break;}} 5 // critical section 6 lock = false; // unlock 7 }

This function has actually been available on Intel platform since i486, named compare and swap or compare exchange (Instruction Mnemonic: cmpxchg). It is implemented as follows [33, pp. 3-148]:

1 TEMP ← DEST lock; cmpxchg DEST, SRC

2 IF %eax = TEMP lock; cmpxchg r/m32, r32

3 THEN

4 ZF ← 1; Compare EAX with r/m32.

5 DEST ← SRC; If equal,

6 ELSE set ZF and

7 ZF ← 0; load r32 into r/m32.

8 %eax ← TEMP; Else,

9 DEST ← TEMP; clear ZF and

10 FI; load r/m32 into EAX.

Compare and swap is not part of Standard C99 [30, ISO 9899:1999], but it can be wrapped in a C function using inline assembler. In the following short example this wrapper is

57 6 Implementation of CMX called cmx_atomic_val_compare_and_swap_int32 and returns the value of TEMP. If the ex- change succeeds this value is equal to the first argument and exchanged (swapped) with the value in the second parameter.

This actual code is used to enter the TW1 state of the writer access protocol:

1 // State = T_W_1 2 // Enter state UPDATE 3 switch (cmx_atomic_val_compare_and_swap_int32( 4 &cmx_shm_ptr->value_state[value_index], 5 CMX_SLOT_STATE_SET, CMX_SLOT_STATE_UPDATE)) 6 { 7 case CMX_SLOT_STATE_SET: 8 // thats OK, the XCHG succeeded 9 break; 10 case CMX_SLOT_STATE_UPDATE: 11 // another update is in progress 12 return E_CMX_CONCURRENT_MODIFICATION; 13 default: 14 // invalid other state 15 return E_CMX_OPERATION_FAILED; 16 }

The C compiler gcc [34] in version ≥ 4.4 provides the builtin function to some atomic primitives. For example, __sync_val_compare_and_swap() is a wrapper to access cmpxchg in a portable way. But the experience from this project shows that one should always check the generated assembler code before using the gcc atomic-builtins.

With older compilers (for SLC5, gcc-4.1.2) inline assembler must be used. New compilers, supporting C11, will likely provide a compare exchange function through the header.

6.1.3 Processor Memory Consistency

In the previous section we described how to the compiler can be prevented from reorder- ing our carefully sequenced program. The next step is to ensure that the processor does not destroy our efforts by executing instructions out-of-order, causing other processors in an SMP system to see corrupted data.

The ancient DEC Alpha architecture or ARM based CPUs, now popular in mobile comput- ing, have a very weak memory ordering model. They aggressively try to re-order memory accesses to gain performance [7, appendix C.7.1]. With out-of-order execution, a CPU is free to execute independent instructions until a datum, required for a previous load instruction, is actually loaded into the processor. This way, independent units inside the

58 6 Implementation of CMX processor (e.g. FPU and Integer ALU) can continue to work and do not need to idle, if the current instruction is unrelated to them.

Todays multi-processor machines can execute hundreds of instructions in the time that is needed to fetch a single datum from memory, by this they can achieve an even greater speedup in instruction reordering. The gap between processor vs. memory performance increased constantly since 1980 [35, p. 289].

To reduce the communication with the slower memory in the hierarchy (communication with cpu-registers to L1/L2/L3-Cache, cache to main memory), processors have store and load buffers between the registers and the level 1 cache. In contrary to the caches, buffers only hold a single datum not an entire cache line mirroring whole 64 bits of main-memory.

CPU 0 Lock CPU 1 CPU 0 CPU 1

Store Store Buffer Buffer Cache Cache

Interconnect Cache Cache

Interconnect Memory

Memory

Figure 6.1: Simplified CPU with and without Store Buffer (Figure based on[7])

At the same time, buffers are not involved in the cache coherency protocol that ensures a consistent view of memory between the processors in a SMP system. The important aspect here is that the must actually take care that the buffer is flushed to the caches at the right time, if he wants other processors to be able to see the update instantaneously.

The Intel x86-32/64 in contrast to ARM, PowerPC or Alpha CPUs has a rather strong memory ordering model, while still using store buffers per logical CPU (see Fig. 6.1 right). This means, the developers have less things to care about when writing parallel code compared to ARM but the CPU needs more logic to provide a strong memory model and apply optimization such as out-of-order execution or load/store buffers at the same time.

59 6 Implementation of CMX

However, for Intel x86 the manufacturers reference manual about the memory model [32, ch.8 p.6] is still extensive and complicated.

While the ARM/PowerPC memory model is formally defined, there has been some ef- fort to provide the same for the x86 architecture. The most recent description of the x86 memory model is called x86-TSO (Total Store Order) [36]. The x86-TSO is not an offi- cial documentation from Intel or AMD. It is specified by researchers based on vendor documentation, assumptions and extensive testing on real hardware.

Total Store Order means that the processors agree on a global order of the store (mem- ory write) operations. In contrast to SC (sequentially consistent memory), the strongest memory model (all processors share a consistent view on the memory at any time), the Intel x86 processor is allowed to buffer write (store) operations. As a kind of exception the processor is also allowed to read (load) his own stored values before the global order is established.

Due to the big changes in the history of the x86 architecture, the x86-TSO model describes a model that is valid for all x86 processers, even though there may exist x86 processors providing a stronger ordered-memory model.

Store Buffer Effects with x86-TSO The Intel documentation claims in one place that: “In general, the existence of the store buffer is transparent to software, even in systems that use multiple processors” [32, ch.11 p.20]. This is literally not quite true.

Fig. 6.3 shows a simple data-race involving the store buffer. The scenario features two processes (Tp and Tq), operating on the memory locations [x] and [y], and their respective register %eax. Both [x], [y] and %eax are initialized with 0.

We see that the load of [x] by Tq, or the load of [y] by Tq respectively, takes place during the time that the value is still in the store buffer of the corresponding processor. In con- sequence, the loads will likely load the initial value (0) instead of 1, even in contradiction with the program order.

To fix this problem, one has to insert so called fences or barriers. They force the buffer to flush the content ahead of schedule. A corrected version would look like:

60 6 Implementation of CMX

Tp Tq

t mov [x], $1 mov [y], $1 t mov %eax, [y] mov %eax, [x]

Figure 6.2: x86 Assembler test program

SB {} {x=1} FLUSH {}

INSTR mov [x], $1 mov %eax, [y] eax=0

SB {} {y=1} FLUSH {}

INSTR mov [y], $1 mov %eax, [x] eax=0

x=0 x=1 x=1

RAM y=0 y=1 y=1

time Figure 6.3: Timing of x86 assembler test program

1 mov [x], $1 1 mov [y], $1

2 mfence 2 mfence

3 mov %eax, [y] 3 mov %eax, [x]

With the insertion of the barriers after each memory write, the thread's store bufferis forced to flush its contents immediately. Thus, both threads share an equivalent view of the memory [37, p. 12].

Store Buffer and CMX/Sequence Locks Applying the theory about data-races with x86-TSO to the algorithm used in CMX reveals no problems regarding concurrency. The store buffer indeed stays transparent to software.

(w,x ,1) W.1 R1 (r ,x,_)

(w,y ,1) W2 R2 (r ,y,_)

Figure 6.4: Read and write values x and y

61 6 Implementation of CMX

The diagram in Fig. 6.4 is simplified but equivalent to processes which read and write values x and y using CMX. Here process W is writing to CMX values (x and y) in shared memory and R is reading the very same. As described in section 5.3, the behavior of the CMX algorithm is that a reader never writes to the shared memory.

Because TSO ensures that the stores don't get reordered, the data always stays valid according to the protocol.

However, things can get interesting if we take a look at a special case. It is described in “Reasoning About the Implementation of Concurrency Abstractions on x86-TSO” [37, p. 16] and applies in the same way to the CMX algorithm:

[…] a program where the reading processors only access memory via the code at Reader is trivially TRF [triangular race free]. However, there are data races between the writer and a reader on […ctr, value], and if the reading pro- cessor has written to memory before initiating the, read these become triangular races.

Fig. 6.5 shows the data-races described in the quote.

However, on x86, we won't have the chance to observe this theoretical bug with CMX because the locked instruction cmpxchg in TW1 imposes a total memory order. It is used to change from state SET to state UPDATE.

1 0 As marked with the blue line (from TW1 to TW4 ), the TW process on thread 1 will not be able to progress until the SET in TW4 left the store buffer of thread 0.

To optimize we could flush the store buffer at the end of TW4 explicitly, and thus minimize the time spent in TR on thread 1 waiting for the unlock and respectively the fail rate on thread 1.

The locking behavior on the state variable makes the race on ctr (red lines) irrelevant.

Conclusion The documentation for the memory ordering on our target x86-32/x86-64 processor architecture is quite complicated, fragmented and leaves room for interpreta- tion. The history of the x86 architecture is long and there have been many improvements over the time and the current processors still have to keep compatibility with chips from the 1980s. The x86-TSO model seems to provide a good reference for the processor be- haviour, but on the other side it is highly academic.

62 6 Implementation of CMX

0 j → T TR (CAS,state,SET UPDATE) W1 1 (r , ctr ,A')

T 0 T 1 (r ,txn,A) W2 R2 (r , state ,SET)

T 0 T 1 (w,txn ,( A+1)) W2 R3 (r , value ,V')

T 0 T 1 k (w,value,V) W3 R4 (r , ctr ,A '')

T 0 (w,state ,SET) W4

j T 0 T 1 → (r , ctr ,B') R1 W1 (CAS,state,SET UPDATE)

T 0 T 1 (r , state ,SET) R2 W2 (r , ctr ,B)

T 0 T 1 (r , value ,W') R3 W2 (w,ctr ,( B+1))

k T 0 T 1 (r , ctr ,B '') R4 W3 (w,value,W)

T 1 W4 (w,state ,SET) Thread 0 Thread 1

Figure 6.5: Reader/Writer with CPU swap

Compared to the weak memory model of other processors, the x86 is easier to target be- cause of the stronger memory model. From the processor point of view, these guarantees add more dependencies. They increase the need for inter-core communication, thus the overall performance is theoretically lower.

Using unnecessary memory barriers instead of relying on specific x86-behavior costs some performance but is much easier to handle and more portable. The needed memory barriers can be issued using gcc-builtin functions or assembler instructions, for SLC6 and SLC5 respectively.

63 6 Implementation of CMX

6.1.4 Processor Cache Coherency

Multiprocessor systems use a cache coherency protocol to maintain coherency between all the processor caches and the main memory. For high performance and low overhead the cache management communication must be kept as low as possible.

In our target architectures, the cache line size is 64-bytes. That means if a processor wants to cache some memory, it will cache blocks of 64-bytes from the main memory at once. This will be important to consider when it comes to define the layout of data structures later on.

The algorithm in CMX is equivalent to Sequence Locks (see section 5.4) from the cache relevant perspective. It is perfectly suited to keep the cache communication low, because the 'reader' process will never do a write access to the data [9, p.16]. Therefore, the 'reader' will never have to gain exclusive ownership of the cache lines holding the values.

6.1.5 Choosing a Suitable Timesource

The metrics in CMX are supposed to be time-stamped in the event of an update. For now the timestamps are set automatically using the POSIX Realtime API. The clock_gettime() function can select from available clocks defined in the time.h system header.

Both the 32-bit and 64-bit SLC6 systems offer the following clocks (overview based on the clock_gettime() documentation):

• CLOCK_REALTIME System-wide real-time clock. Real-time in the sense of that the values represent the amount of time (in seconds and nanoseconds) since the start of the epoch (01.01.1970).

• CLOCK_MONOTONIC Clock that represents monotonic-time, counting from some un- specified starting point.

• CLOCK_PROCESS_CPUTIME_ID Thread-specific CPU-time clock.

Starting from Linux 2.6.32 some new clock sources are defined but exposed only in the 64-bit SLC6.

• CLOCK_REALTIME_COARSE A faster but less precise version of CLOCK_REALTIME.

• CLOCK_MONOTONIC_COARSE A faster but less precise version of CLOCK_MONOTONIC.

64 6 Implementation of CMX

• CLOCK_MONOTONIC_RAW Similar to CLOCK_MONOTONIC, but provides access to a raw hard- ware based time that is not subject to NTP adjustments.

As described in [38, ch. 15.2.1], the _COARSE variants avoid the actual read of the desig- nated timer source and therefore also the switch from user-space to kernel-space. This has been validated with the perf [39] tool. Instead they use the so-called vDSO mecha- nism (see vdso(7)) where actual kernel functions are exposed and then executed in user- space.

The performance gain of using the _COARSE variant is noticeable. In case of virtualized machines used during development of CMX, it is even higher, presumably because the clock read-out avoiding a call to the hypervisor.

The next table shows the time for updating a 64-bit integer value 5 million times single- threaded in CMX under different conditions. The tests were repeated 10 times with perf and the percentage shows the difference between these runs.

CLOCK_REALTIME CLOCK_REALTIME_COARSE native 0.459 s. (+- 2.51%) 0.353 s. (+- 6.87%) virtualized 4.024 s. (+- 2.20%) 0.421 s. (+- 4.10%)

Table 6.1: Compare execution times on virtualized/unvirtualized hosts regarding the us- age of different clock sources

The tests were made on up-to-date modern hardware. The virtualized tests are run in a Microsoft Hyper-V guest system. The native system is equipped with two Intel Xeon X5660 at 2.8 GHz.

Because of effects of the processor caches, those results cannot be simply divided by5 million to get the execution time of a single CMX update. The time spent in a single update can be much longer if the processor first has to load the CMX data structures from main memory.

Conclusion CMX is intended to be used in real-time applications. These machines are very seldom virtualized and have fast time sources available for use with CLOCK_REALTIME.

Depending on requests from users, CMX might provide a way to pass the update times- tamp in the future. It is possible that there are applications of CMX where the users want to pass a timestamp gathered from an external source, e.g. an accelerator timing receiver.

65 6 Implementation of CMX

6.2 Implementation Overview

CMX is implemented in C. Additionally, there is a C++ library available which acts as a wrapper for the C functions.

6.2.1 The Implementation in C

The implementation of CMX is split in different modules. These are: cmx.h Parent header file, includes complete public API shm.h CMX Components and Values in shared memory log.h CMX logging and log redirection functions process.h Predefined metrics for the CMX Process Component registry.h Lookup of CMX Components §atomic.h Implementation of atomic primitives §common.h CMX error codes and some common functions §shm-private.h CMX data structures in shared memory but not exposed in the API

Table 6.2: Tabular overview of C source-code header files

For each header file (.h) a corresponding implementation file (.c) can be found, except for shm-private.h. The files marked with § are for internal use only and protected with an #ifdef acting as include-guard.

6.2.2 The C++ API

The API for C++ is a wrapper around the C implementation. It provides a more convenient access to CMX while still focusing on real-time suitability and low overhead.

Fig. 6.6 shows a UML Diagram of the classes in the CMX C++ API.

• CmxRef is the base type of a CMX Value reference. It holds the type of the value as an integer value but is itself untyped.

• CmxImmutableInt64,…Float64,…Bool and …String are typed references to CMX Values. They can by created from CmxRefs using the cmx_cast function.

66 6 Implementation of CMX

• CmxInt64 (without Immutable), …Float64,…Bool and …String are the same but ex- tended for write support. They are created by the newInt64,Float64,Bool,String function of Component

• ImmutableComponent can be used to open a CmxComponent of another process. It implements the read-only operations on CMX components. If requested, the imple- mentation can hide all error-cases from the user by returning neutral values instead of errors.

• Component adds modifying create, remove and set operations to ImmutableComponent. If requested it can act as a 'dummy' as well, causing all write operations to be no- ops.

• Registry is responsible for listing existing CMX components on the local system. cleanup() removes components from dead processes.

• ProcessComponent can be used to create a CMX Process Component, there prede- fined metrics about the process are exposed. It should be called by every process before creating CMX components.

• CmxException Various exception classes map the C error codes.

One way of using CMX in C++ is similar to the C API. Fig. 6.7 shows a self-contained example.

The C/C++ API can be also be used in a more abstract way, it supports mapping a C++ class to a CMX Component with very little effort needed from the developer. This ap- proach is shown in Fig. 6.8. The CMX class CmxSupport is not shown in the previously shown class diagram but also part of CMX-C++.

67 6 Implementation of CMX ... ProcessComponent +update() CmxErrorCodeException Registry Component ImmutableComponent <> CmxException <> +newInt64(): CmxInt64 +newFloat64(): CmxFloat64 +newBool(): CmxBool +newString(): CmxString +set(ref:mxInt64 &,value:CmxInt64::c_type &): void +set(ref:CmxFloat64 &,value:CmxFloat64::c_type &): void +remove(ref:CmxRef &): void +create(name:std::string,ignoreError:bool=false): ComponentPtr +dummy(): ComponentPtr +iterator: typedef = ComponentIterator +begin(): iterator +end(): iterator +getValue(ref:CmxImmutableInt64&): CmxImmutableInt64::c_type +getValue(ref:CmxImmutableFloat64&): CmxImmutableFloat64::c_type +getValue(ref:CmxImmutableBool&): CmxImmutableBool::c_type +getValue(ref:CmxImmutableString&): CmxImmutableString::c_type +getValueAsString(ref:CmxRef &): std::string +name(): std::string +processId(): int +setIgnoreErrors(ignoreErrors:bool): void +isIgnoreErrors(): bool +iterator: typedef = RegistryIterator +begin(): iterator +end(): iterator +cleanup() +open(processId:int,componentName:std::string &): ImmutableComponentPtr ImmutableComponentPtr : boost::shared_ptr : ImmutableComponentPtr ComponentPtr : boost::shared_ptr : ComponentPtr CmxCastException MutableComponentMixin CmxRef -component_: ComponentPtr CmxString CmxImmutableInt64 CmxImmutableBool CmxImmutableFloat64 CmxBool CmxInt64 CmxString CmxFloat64 -immutableComponent_: ImmutableComponentPtr +mtime(): uint64_t +name(): std::string +type(): int +c_type: typedef = CmxTypeInfo::c_type +operator c_type() +c_type: typedef = CmxTypeInfo::c_type +operator c_type() +c_type: typedef = CmxTypeInfo::c_type +operator c_type() +c_type: typedef = CmxTypeInfo::c_type +operator c_type() +operator=(value:const c_type &): c_type +operator=(value:const c_type &): c_type +operator=(value:const c_type &): c_type +operator=(value:const c_type &): c_type The CMX C++ API CMW::CMX

Figure 6.6: C++ Class diagram

68 6 Implementation of CMX

1 #include 2 #include 3 #include 4 #include 5 6 using namespace cmw::cmx; 7 8 int main() 9 { 10 struct timespec tm; 11 tm.tv_sec = 0; 12 tm.tv_nsec = 50000000; 13 14 ProcessComponent::update(); 15 16 ComponentPtr component = Component::create("stats"); 17 CmxInt64 metr_test = component->newInt64("test"); 18 19 std::cout << "Enter␣work-sleep␣loop" << std::endl; 20 for (int i = 0; i < 100; i++) 21 { 22 metr_test = i; // update metric 23 24 if (i % 500 == 0) ProcessComponent::update(); // update process metrics 25 (std::cout << ".").flush(); 26 nanosleep(&tm, NULL); 27 } 28 }

Figure 6.7: Simple example of using the CMX C++ API

Header 1 class Demo1 : CmxSupport 2 { 3 CmxInt64 counterInt; 4 CmxFloat64 counterFloat; 5 CmxBool counterBool; 6 CmxString counterString; 7 public: 8 Demo1(); 9 void execute(); 10 }; Implementation

1 Demo1::Demo1() : 2 counterInt(newInt64("Component1", "counter_int")), 3 counterFloat(newFloat64("Component1", "counter_float")), 4 counterBool(newBool("Component1", "counter_bool")), 5 counterString(newString("Component1", "counter_string", 30)) 6 { 7 counterInt = 1; 8 counterString = "Initalizing..."; 9 }

Figure 6.8: Demo of OO abstraction for CMX (excerpt)

69 6 Implementation of CMX

6.2.3 Independent Usage of CMX

CMX can be used independent from the CERN infrastructure. The public source code releases contain configuration files for building the libraries and example applications using the scons [40] build system.

Also included is a Python program, based on the C++ API, which allows inspection of all CMX-enabled applications running on a host via the HTTP [14] protocol. The output is either formatted in HTML, intended for humans or inJSON[41], for integration into existing monitoring systems.

The integration of CMX into the CERN environment is discussed in chapter 7.

6.2.4 Real-Time Compatibility

CMX is targeted to suit requirements of real-time applications. We defined the term “real time” in TERM 3 and our concrete requirements in TECH 1.

The real-time compatibility of a program can be influenced by different parameters. In the following we will give an overview over possible causes that can harm real-time execution and analyze their impact in CMX.

Priority inversion describes the effect of unintentionally changing a thread/process priority in a scheduled system. The effect is pointed out in Fig. 6.9, with an example where 3 Jobs (Threads or processes) access 2 resources concurrently.

In this scenario, the 3 Jobs have different priorities. T1, the job with the highest priority, tries to grab a lock on resource 'brown'. However, this resource is currently owned by the Job T3, having a 'low' priority. This means, high priority jobs must wait for the low priority job to finish.

Again, before unlocking resource 'brown', the low priority job T3 must do some work on resource 'blue'. Resource 'blue' however, is owned by the middle priority job T2. This way the high priority job is affected by two lower priority jobs.

This scenario does not take the scheduling effects into account. Depending on the strat- egy, the effects may vary but in the end the system is generally more willing togivetime to higher priority jobs which, in case of priority inversion, cannot make any progress and

70 6 Implementation of CMX

Priority

High T1 High Low Middle Low High

Middle T2 continue wait wait continue Low T3

Time Figure 6.9: Priority Inversion With 3 Threads on 2 Resources immediately yield the execution or even worse, waste valuable compute time by spinning on the lock.

In section 5.1 we concluded that real-time aspects are most important for the get() and set() operations. For get() and set(), because they contain no resource accesses which are vulnerable for priority inversion or calls to the operating system, this is a not an issue. There is no blocking behavior of the resources in use hence, the effect cannot appear.

In other scenarios, a possible way to solve priority inversion is to implement priority inheritance. There the processes which a high priority process is waiting for, temporarily inherit the high priority from the waiting processes and hence can finish earlier, thus the whole system finishes earlier.

Memory Management The malloc()/free() functions are not used inside the set() or get() functions of CMX. However, memory reserved in shared memory is un-mapped (not wired to physical memory) by default. This means the 4K pages are bound to physical memory only on-demand, in fact the first time they are accessed.

This way a process can reserve any addressable amount of memory without immediate effect, the operating system will map the address to physical memory only at thefirst real data access. This is detected by catching the page-fault resulting from the invalid access. Handling the page-fault takes time and has to be avoided in a real-time process.

71 6 Implementation of CMX

One can force the mapping simply by calling memset() over the whole data structure, thus activate the page-fault handling of the operating system, or use mlock() which also guarantees that the pages will stay in RAM and will not be swapped out to disk. In CMX, we actually want to profit from this behaviour since a CMX Component is fixed size but rarely filled completely.

On Linux, the mapping to physical memory can be reversed with a madvice(MADV_REMOVE) (Memory advise) call.

System Calls System calls in general can harm the real-time execution badly, because the user-space program using CMX has no control over what is happening inside the kernel. For example, the shared memory allocation code uses locks internally to protect the management data structures against concurrent access.

Except for setting up the data structures in shared memory, CMX does not use anysystem calls. foreign/libc provided functions In principle every function called in CMX, which is not implemented in CMX itself, has to be verified in terms of runtime complexity. For- tunately there are not many and if we only take the one involved in get() and set() operations into account, there are the following:

• printf()/vprintf for the logging functions (can be disabled).

• memcpy() is used to copy multi-byte data such as character strings.

• getpid() used to validate if the current process owns a CMX Component.

• clock_gettime() obtain a time value for the modification timestamp.

Logging in CMX involves calling printf() and related functions. String formatting is quite expensive in general, however depending on the logging level, functions involving printf() are only called in case of errors. Then, this can also create side-effects and blocking in case of errors or warnings (depending on the configured log-level) when it is configured to write to stderr. To avoid this, logging can be disabled completely at compile time, for specific or all log-levels. This also reduces the binary size considerably. memcpy is directly proportional to amount of memory which is to be copied, given that this memory is initialized (wired). This is the case after calling add() in CMX.

72 6 Implementation of CMX

While getpid() is generally translated into a system call, the value will be cached by libc and only be updated after a fork() call.

The clock_gettime() function uses the vDSO (virtual Dynamic Shared Object) on Linux, which is actual kernel code, linked and executed in user space. This allows a low overhead call and is almost constant in time, since it translates into reading a Linux SeqLock (see section 5.4) protected value.

6.2.5 Automated Testing

CMX is embedded into the BE-CO integration test environment based on Bamboo, a continuous integration (CI) server application by Atlassian Software [42].

The CI server is connected to the source code management system and triggers the exe- cution of test plans according to changes in the source-code.

The automated tests are split into tests of the C API (including test code-coverage report), the C++ API and an integration test performing interprocess communication between all recent CMX versions to ensure backward compatibility.

LCOV - code coverage report

Current view: top level - cmw-cmx Hit Total Coverage Test: unnamed Lines: 481 687 70.0 % Date: 2014-04-17 Functions: 47 54 87.0 %

Filename Line Coverage ( show details ) Functions atomic.c 100.0 % 28 / 28 100.0 % 11 / 11 common.c 92.0 % 23 / 25 100.0 % 4 / 4 log.c 100.0 % 48 / 48 100.0 % 6 / 6 registry.c 60.8 % 48 / 79 100.0 % 6 / 6 shm-private.h 0.0 % 0 / 47 0.0 % 0 / 6 shm.c 72.2 % 328 / 454 94.4 % 17 / 18 shm.h 100.0 % 6 / 6 100.0 % 3 / 3

Generated by: LCOV version 1.10

Figure 6.10: Results of the coverage analysis (CMX 2.0.4)

73 6 Implementation of CMX

Unit tests using Google Test The tests using the Google Test framework are intended to cover all possible usage scenarios of CMX and ensure the correctness of the code. The coverage of the tests can be verified using code coverage analysis tools such as gcov with lcov [43].

Fig. 6.10 shows the code-coverage result for the CMX-C source-code. The low coverage of registry.c and shm.c is due to error handling, where possible errors cannot be easily triggered in unit tests. The file shm-private.h contains some functions which are not used in production and thus this result can be ignored.

Valgrind tests Valgrind is a machine code execution engine which can apply transfor- mations before actually executing the code.

The most popular tool in the valgrind family is memcheck. It keeps track of memory requested from the operating system. If the program makes illegal access to this mem- ory, memcheck prints error messages with a detailed message report of what's going on, including a stack trace.

Furthermore valgrind can be used to apply wrapper functions to already defined symbols. This way system-calls can be easily wrapped without having to deal with dynamic linking.

We created tests where the ftruncate() function is overridden by a mock implementation. The actual test, which will execute cmx_shm_create() while calling my_ftruncate() instead of ftruncate(), then looks like:

1 WITH_WRAPPER(ftruncate, my_ftruncate) 2 { 3 cmx_shm * cmx_shm_ptr; 4 my_ftruncate_result.ret = -1; 5 ASSERT_EQ(E_CMX_CREATE_FAILED, cmx_shm_create("foo", &cmx_shm_ptr)); 6 ASSERT_NE(0, my_ftruncate_result.fd); 7 ASSERT_NE(0, my_ftruncate_result.length); 8 }

The asserts in this example verify the correct error handling and whether the correct parameters have been passed to the wrapper of ftruncate().

Checking of struct size and packing The constant layout of the shared memory structures is crucial for the correct operation of shared memory applications like CMX. Two versions of CMX, where the shared memory structures are compiled differently, will certainly lead to problems at run-time.

74 6 Implementation of CMX

There are no syntactic ways in the C language to guarantee that the compiler will always create the same memory layout. A compiler optimizing for size might choose the densest packing of the fields, another one that optimizes for fast access puts the fields in best alignment for the processor.

In CMX, checks to verify the constant in-memory layout of data structures are impera- tively executed at compile time. All builds must fail if the checks are not successful.

This is implemented in functions which will not be used later on, but have to be syntac- tically correct anyway. The checks examine the offset and the overall size of all structs and compare them to user-defined expected values.

The size is obtained using the compiler-function sizeof(struct type) and the offset using offsetof(struct type, field). These functions are compile-time constants, their values are calculated by the compiler. A condition with two constant values will be resolved at compile time, the false branch will not be compiled.

In the false branch of the condition we place a function which is artificially marked as erroneous using gcc-attributes. The whole process is supported by preprocessor macros, if the false branch is taken, gcc will abort the compilation with a message about the cause of the failure.

A set of tests for a struct looks like:

1 static_assert_eq(0U, offsetof(cmx_shm_value, _int64), "Check␣Offset"); 2 static_assert_eq(0U, offsetof(cmx_shm_value, _float64), "Check␣Offset"); 3 static_assert_eq(0U, offsetof(cmx_shm_value, _bool), "Check␣Offset"); 4 static_assert_eq(0U, offsetof(cmx_shm_value, _string), "Check␣Offset"); 5 static_assert_eq(8U, sizeof(cmx_shm_value), "size␣of␣cmx_shm_value");

6.2.6 Performance Analysis

Latencies The latencies of CMX highly depend on the actual setup. Today's systems are seldom single processor, but most often multi-core systems with more than 4 processor cores. The performance of shared-memory applications depends on architecture-specific features and can vary among different generations of Intel/AMD processors. Most im- portantly, it depends on the main workload that is running on a machine.

As long as a CMX writer thread is the only workload on a system it obviously runs with very high speed since all the operations take place inside the processor's Level 1 Cache.

75 6 Implementation of CMX

Once another thread on a second processor core is going to read the CMX-Values, this core needs access to the cache-lines holding the memory pages in which the CMX-Values reside. At the same time, the writing thread must regain authority of the cache line to make modifications to the value. Now, writing to the cache-line will invalidate all copies that have been previously shared with the reading thread. This activity creates cache- coherence protocol communication between the cores, hence it adds additional latencies.

When CMX is embedded into another application, its data will probably not stay in the Level 1 cache. Therefore, the values may need to be fetched from the DRAM first, which results in a additional delay. Accessing DRAM can be estimated with around 60ns [44, p. 22]. Int64/Float64/Bool values fit into one cache line. Character values are certainly slower, depending on their size.

In general the overhead of CMX is equal to similar memory operations. Updating a value in CMX is only slightly more expensive than a usual memory operation. The overhead is created in the reader/writer access protocol by memory fences and the compare-and- swap to mutually exclude the writers.

Validation of memory behavior As described in section 5.2, we expect the operating system to allocate the memory for the shared memory used in CMX on demand. This can be easily verified by creating a CMX component through the CMX API and adding values while recording the current memory usage.

The results are plotted in Fig. 6.11. The values were taken on a SLC6 system. Values are obtained from the getrusage()[21] function. The plots show the values over time while adding CMX values to a CMX Component.

The upper plot shows the current memory usage minus the current value at start-time (blue) and the corresponding calculated size of the cmx_value struct multiplied by the number of values.

The bottom plot shows the number of page-faults of the process. This aligns nicelywith the increase of the actual memory usage. Every access of the uninitialized memory trig- gers a page-fault, as a consequence the operating system jumps in and allocates the mem- ory as needed.

76 6 Implementation of CMX

500 maxrss minus init 400 calculated size

300

200

100 resident set size [kB]

0

440 420 400 380 360 340 Pagefaults 320 300 280 0 500 1000 1500 2000 2500 3000 3500 4000 No. of metrics

Figure 6.11: Change of memory usage depending on number of metrics

6.2.7 Possible Extensions

Direct increment operations Feedback from early adopters mentioned that it would be handy to have a call for integer values which directly increase the value for one step. Otherwise, if they do not store the values themselves somewhere, they must execute a get() operation before every update of the value.

While this is implemented in the C++ part using a separate get() before set(), it would certainly make sense to support this operation in one transaction. Without this “atomic” increment of CMX values, a multi-thread writer scenario is not possible without risking to loose some updates.

Garbage collection In the current implementation, the space for the values stays ini- tialized until the first usage (using add()). After a remove() call, the memory does not get uninitialized, thus stays in physical memory. Currently this is not a big problem since so far most users populate their values once at start time and do not change anything later on.

Simply marking one value as unused is impossible since this must be done for whole

77 6 Implementation of CMX memory-pages at once. A memory page on Linux/x86 is 4 KB where a CMX Value is 128 KB. Therefore, many values must be freed before they can get initialized. The foreseen approach for this is to provide a call to search for values in aligned blocks greater or equal than the page size, mark them with a special status for garbage collection, inform the operating system that this memory is free using madvise(MADV_DONTNEED) and then set the status to free.

Dynamic character string size The current implementation requires the size of the string length to be determined at the time a new CMX Value is added. When the value is set, the exact length is recorded, but it must be less than or equal to the previously set maximum size. This behavior can be designed to be more dynamic by allocating the slots for storing the data as needed when updating the value. A string value would then only consume the actual amount of space it really needs. On the other side it raises the danger of update failure during runtime and the developer should not be obliged to handle such situations. The same behavior can be emulated by removing the value for a short time and then add a new one under the same name with a different size. If this happens for more than one value, it will also increase the fragmentation of char- acter value slots.

6.3 Conclusions

This chapter covered the main part of the work in this thesis. Starting from the ideas and the technical foundations prepared in the previous chapters, we made the transformation of the CMX algorithm/protocol from theory into an implementation using the C and C++ programming languages and POSIX SHM objects. During the implementation, we observed different kinds of issues through the involved technology stack. We described and solved different aspects, ranging from hardware characteristics and programming techniques to compiler tricks. Certainly the whole issue around the memory model could be solved more elegantly if C11 were available, but this might be valid for many development setups targeting the SLC5 toolchain throughout the coming years.

78 7 Integration in CERN Infrastructure

This chapter describes the essential steps in the integration of the monitoring and diag- nostic capabilities of CMX into established tools at CERN. This includes the integration into the DIAMON system (see requirements FUNC 1, p. 12) and tools for the diagnostics use case (FUNC 3).

7.1 A Remote Agent for CMX

The CMX library includes command-line tools, which can be used to inspect, read and create CMX values on the local machine. In many cases, users expect to access the values exposed by CMX remotely. For example, in the Java world the graphical JConsole tool can be used to inspect a process over a network. This is implemented by direct communication with the process over a TCP connection. In contrary to JMX, CMX does not provide networking capabilities. There has to be a process, a CMX Agent, which connects to the shared memory data structures, managed by CMX and responds to requests over a network. In the accelerator controls infrastructure, there is already a daemon which runs on al- most every machine, the “clic” agent. It is a system monitoring daemon and supports a plugin-architecture for which hardware-specific modules exists. All “clic” agents report the acquired metrics to a message broker dedicated to monitoring. Due to the modular architecture of the agent, it was easy to extend the functionality. With the new CMX module, monitoring CMX Values is supported in the same standardized way as other system metrics. Additionally, custom commands can be used to list the available CMX Components and their values. The standardized metric protocol uses a “clic”-specific, self-describing protocol for the transmission of values over a message broker connection. The CMX-related remote com- mands respond with JSON [41] formatted messages.

79 7 Integration in CERN Infrastructure

7.1.1 Diagnostic Access in the DIAMON GUI

To provide an easy diagnostic access to CMX values, a graphical user interface has been developed. The MX-Viewer is available either as a standalone application (see Fig. 7.2) or integrated into the host-centric view of DIAMON Console user-interface.

To make the work flow more general and reach a higher user acceptance for CMX, the MX-Viewer also supports read-only access to Java programs using JMX. Therefore, it du- plicates the functionality of the JConsole Tool in accessing MBean attributes.

7.1.2 Monitoring of CMX Enabled Applications in DIAMON

DIAMON is based on the C2MON [45, 46] SCADA Monitoring System, which uses dis- tinct modules to access different kinds of data-sources, like one for network equipment through SNMP or Java through JMX.

For the integration of CMX into DIAMON, it was not necessary to create a new access module for C2MON, since the one for accessing the “clic” agent can be reused.

CMX metrics are addressed using a generic metric naming format. The agent reads the metric configuration from the database with field formatted like:

{arguments} for example: cmx.metric{process-name,component-name,value-name} real example: cmx.metric{CGFS_COHAL,COMPONENT,lun1.LatestCall}

The monitoring of CMX values is configured in the CERN Controls database. The config- uration string for identifying the metric can be generated in the MX-Viewer (see Fig. 7.3).

Since CMX integrates tightly into the monitoring system, the values from CMX can be analyzed using the provided tools as any other monitored value. Also the alarm-trigger rules are applicable.

80 7 Integration in CERN Infrastructure

clients=10 packets_lost=0 commits_failed=0 txns_per_s=134 Shared Memory no_threads=10

Exposes Component Process …. Accelerator Monitoring system C Operator Component C++ Process Exposes …. STOMP CMX Library CMX Agent Reads Remote Access Computer Software Engineer CMX Remote Viewer

Inspects via Local Access Command Line

Figure 7.1: CMX remote access using an agent

Figure 7.2: The MX-Viewer, running embedded in the Diamon Console, showing live val- ues from a CERN BE-CO C++ program

81 7 Integration in CERN Infrastructure

Figure 7.3: Configuration of a CMX Metric. The MX Viewer (on top) generated the value specifier for the web-based controls configuration interface (at the bottom).

7.2 Interaction of CMX with Build Tools

C and C++ projects for CERN accelerator controls are built with a common build system based on make. The system enforces some conventions about the source directory layout and the structure of released products. The releases are copied to a common repository, similar to maven [47] but far simpler. It does not involve a server process, all data is shared over NFS. Only one repository is usable at a time and there is no dependency management in place. The build process of C/C++ libraries and applications has, compared to standards known from Java, some serious shortcomings related to release and dependency management. In fact there is no support from the system in these tasks. Without exceptional attention it is easily possible to mix-up the dependencies between the various products. Most software projects in BE-CO depend upon a lot of libraries.

C/C++ programs, in contrary to Java with various RuntimeExceptions and detailed infor- mation, do not fail nicely in case of errors. They simply crash without useful information about the errors introduced by linking errors. In particular, the C language toolchain ig- nores the function's signature and does not warn about duplicate symbols in the linker's

82 7 Integration in CERN Infrastructure path, so there are even more possibilities to create scenarios which are very difficult to debug.

Dependency verification using a manifest To improve the current situation we es- tablished a non-intrusive extension to the build system, which collects dependency and compiler information throughout the build process. This information is called Manifest.

The manifest contains metadata about the project, describing the external resources/de- pendencies used at build time. This data will then be propagated through each dependent build process. At the end, this gives an insight into the complete build process.

This manifest is also added to the resulting binaries, which are currently copied to the servers without any meta-data and hence lacking any self-identifying datum. This way we prevent situations where one cannot determine which version of a software is actually running on a live system.

Format of a manifest file During the compilation of each product, a manifest file is generated as plain-text file. The following snippet shows an example manifest file:

1 $Manifest: name=fesa-deploy-unit/CGAFG_DU:1.0.3@2014-05-06T08:40:09+0000; 2 dep=fesa-class/CGAFG:1.0.3@2014-05-06T07:40:25+0000; 3 user=...; 4 compiler=i386-redhat-linux-g++ (GCC) ...; 5 os=Linux 2.6.32-...; 6 cpu=L865 $ 7 $ManifestDep: 8 name=fesa-class/CGAFG:1.0.3@2014-05-06T07:40:25+0000; 9 dep=; 10 user=...; 11 compiler=i386-redhat-linux-g++ (GCC) ...; 12 os=Linux 2.6.32-...; 13 cpu=L865 $

The format is oriented towards the RCS-Keywords [48, sec.2.4] which can be identified by the UNIX tool ident. It can be executed on any file, searches for RCS-Keywords and prints them one per line. RCS-Keywords are formatted like $keyword: data $. The manifest always uses the term “Manifest” or “ManifestDep”, respectively as keyword for dependent libraries.

CMX uses one keyword line per product. Each product ships all keywords from dependent libraries, their $Manifest:$ line is transformed $ManifestDep:$ and their name= value is added to the list of libraries on which they depend in the 'dep=' attribute.

83 7 Integration in CERN Infrastructure

The name= attribute is formatted like project-name/product-name:version@timestamp. This is similar to GAVC identifiers used in Java environments. In this analogy, it is group-id/ artifact-id:version@timestamp, where timestamp has no equivalent in the Java world and classifier has no equivalent in the C/C++ world. The timestamp is used to detect spurious re-releases which might create distortions but may not be fatal.

The attribute user= contains the username running the release-process, compiler= the C/C++ compiler name and version, os= the operating system and cpu= the identifier of the target platform.

Figure 7.4: MX-Viewer showing dependency information of a remotely running program

Embedding of structured text The plain-text manifest file is converted to an object- code file named manifest.o using objcopy from binutils [49]. The manifest.o file is then put into the static library archive file with the rest of the product's object files. It is also put explicitly into the final executable, enabling the use of ident to identify deployed executables.

The role of CMX in improving the CERN control system build system is to make this manifest also accessible remotely to operators and general specialists, thereby eliminating the need to use command-line tools. With this in place, automatic verification of the

84 7 Integration in CERN Infrastructure current configuration, by comparing it to the software which is currently running, can be done.

The MX viewer application post-processes the manifest information to give a graphical and easily understandable overview. Fig. 7.4 shows the MX-Viewer focusing on the man- ifest, where the application (here the “clic” agent itself) has two dependencies, one called cmw/cmx-cpp (the C++ wrapper for the CMX library) also depends on the other cmw/cmx.

A software library developer is able to scan all running programs to find out wherehis software is running and in which version. This can be used during inventory-taking, it can support smooth software upgrades of operational software. In problem diagnostic tasks this provides a standardized, easy to use interface to compile-time information.

7.3 Conclusions

CMX was successfully integrated into the existing monitoring system DIAMON. This way over 2,000 hosts can be simply configured to make use of the C/C++ monitoring capabil- ities provided through CMX.

The modular architecture of both the “clic” monitoring agent and CMX enabled a seam- less integration of values provided through CMX with other system metrics. The support for CMX in DIAMON was shipped as a regular update for the DIAMON “clic” daemon. CMX proved to be ready to be integrated into existing infrastructure.

The new possibilities with CMX were successfully combined with efforts to improve the C/C++ build system for the CERN accelerator controls software. Supported by CMX, compile-time information gets accessible during run-time.

85 8 Summary

The target of this thesis was to create a possibility to expose run-time information from real-time C/C++ applications, like the ones of CERN's accelerator control system. We collected requirements and made a survey over existing solutions. However, we found no existing products which fit the requirements.

With the CMX software library developed in this thesis, we are able to fulfill thepre- defined requirements successfully. The solution combines low-overhead inter-process communication using shared memory technology, with a non-blocking communication protocol. We have seen that the run-time overhead of CMX is minimal and suitable for timing-critical processes.

The CMX API allows developers to expose run-time information in a standardized way. This removes the burden to develop and maintain different, application-specific tools. The CMX C library has a low memory footprint and increases the code size by only 10-20 KB. The C++ wrapper adds syntactic constructs to provide a more simple and concise API, while keeping the performance characteristics of the underlying C Library.

CMX is integrated into the CERN accelerator controls monitoring infrastructure. User- friendly tools are provided to simplify the access and hence reduce the overall diagnostic time. Additionally, users can easily develop their own tools and scripts. At time of writing this thesis, it is being adopted in accelerator controls software at CERN.

CMX is also ready to be used in any common Linux environment. All source code is published as an open source project under the LGPL license.

Today, CMX presents a good choice for adding monitoring abilities in any kind of timing- critical C/C++ applications. Thus enable measures to increase the overall availability and lower the general mean time to recover.

86 Literature

[1] CERN. LHC the guide. [Online; accessed 10-June-2014]. url: http ://cds . cern . ch/ record/999421. [2] CERN. The Accelerator complex/Complexe des accélérateurs. [Online; Accessed 10- 06-2014]. url: http://cds.cern.ch/record/1621894. [3] CMS Collaboration. “Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC”. In: Phys.Lett. B716 (2012), pp. 30–61. doi: 10.1016/ j.physletb.2012.08.021. arXiv: 1207.7235 [hep-ex]. [4] CERN. The Accelerator Control Group (BE-CO). [Online; Accessed 10-06-2014]. url: http://cern.ch/be-dep-co. [5] W. Buczak et al. Diamon2 - Improved Monitoring of CERN’s Accelerator Controls Infrastructure. Tech. rep. CERN-ACC-2013-0234. [Online; accessed 10-June-2014]. Geneva: CERN, Oct. 2013. url: http://cds.cern.ch/record/1611115. [6] Felix Ehm et al. CMX – A Generic Solution to Explore Monitoring Metrics. Tech. rep. CERN-ACC-2013-0241. [Online; Accessed 10-06-2014]. Geneva: CERN, Oct. 2013. url: http://www.cern.ch/cmx. [7] Paul E. McKenney. Is Parallel Programming Hard, And, If So, What Can You Do About It? First Edition Release Candidate 4, [Online; Accessed 10-06-2014]. Linux Tech- nology Center, IBM Beaverton, 2014. url: https://www.kernel.org/pub/linux/kernel/ people/paulmck/perfbook/perfbook.html. [8] Poul-Henning Kamp. Timecounters: Efficient and precise timekeeping in SMP kernels. Tech. rep. [Online; accessed 10-April-2014]. The FreeBSD Project, 2002. url: phk. .dk/pubs/timecounter.pdf. [9] Christoph Lameter. Effective synchronization on Linux/NUMA systems. In Gelato Conference. [Online; Accessed 10-06-2014]. Silicon Graphics, Inc., 2005. url: https:// www.kernel.org/pub/linux/kernel/people/christoph/gelato/gelato2005-presentation. pdf.

87 Literature

[10] Hermann Kopetz. Real-Time Systems: Design Principles for Distributed Embedded Applications. 1st Edition. Norwell, MA, USA: Kluwer Academic Publishers, 1997. isbn: 0792398947. [11] W. Richard Stevens. UNIX Network Programming: Networking : Sockets and XTI. 2nd. Upper Saddle River, NJ, USA: Prentice Hall PTR, 1997. isbn: 013490012X. [12] F. Ehm and A. Dworak. “A Remote Tracing Facility For Distributed Systems”. In: Conf. Proc. C111010.CERN-ATS-2011-200 (Oct. 2011), WEMAU001. 4 p. [13] Michael Kerrisk. The Linux Programming Interface: A Linux and UNIX System Pro- gramming Handbook. 1st. San Francisco, CA, USA: No Starch Press, 2010. isbn: 1593272200, 9781593272203. [14] R. Fielding and J. Reschke. Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. RFC 7230 (Proposed Standard). [Online; Accessed 10-06-2014]. Inter- net Engineering Task Force, June 2014. url: http://www.ietf.org/rfc/rfc7230.txt. [15] T. Lundqvist and P. Stenstrom. “Timing anomalies in dynamically scheduled mi- croprocessors”. In: Real-Time Systems Symposium, 1999. Proceedings. The 20th IEEE. 1999, pp. 12–21. doi: 10.1109/REAL.1999.818824. [16] S. Larsen and H. Wong. JSR 3: Java™Management Extensions (JMX™) Specification. [Online; Accessed 10-06-2014]. 1998. url: https://jcp.org/en/jsr/detail?id=3. [17] Johannes Hölzl. “Monitoring von Anwendungsservern mit Java Management Ex- tensions (JMX)”. Diplomarbeit. Institut für Informationsverarbeitung und Mikro- prozessortechnik, Universität Linz, 2007. [18] Henrik Storner et al. Xymon systems and network monitor. [Online; Accessed 10-06- 2014]. url: http://xymon.sourceforge.net/. [19] SGI/RedHat. Performance Co-Pilot. [Online; Accessed 10-06-2014]. url: http://oss. sgi.com/projects/pcp. [20] pvbrowser.de. The Process Visualization Browser. [Online; Accessed 10-06-2014]. url: https://github.com/pvbrowser/pvb. [21] The IEEE and The Open Group. The Open Group Base Specifications Issue 7: IEEE Std 1003.1™, 2013 Edition. [Online; accessed 11-June-2014]. 2013. url: http://pubs. opengroup.org/onlinepubs/9699919799/. [22] Sven C. Koehler. localmemcache. [Online; Accessed 10-06-2014]. url: https://github. com/sck/localmemcache.

88 Literature

[23] Michael Stapelberg and contributors. i3 - improved tiling wm. [Online; Accessed 17-06-2014; v4.8]. url: http://i3-wm.org. [24] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. San Fran- cisco, CA, USA: Morgan Kaufmann Publishers Inc., 2008. isbn: 9780123705914. [25] Linux Torvalds, Josh Triplett, and Christopher Li. Sparse - a Semantic Parser for C. [Online; accessed 10-June-2014]. url: https ://sparse . wiki . kernel . org/index . / Main Page. [26] Hermann Kopetz and J. Reisinger. “The non-blocking write protocol NBW: A so- lution to a real-time synchronization problem”. In: Real-Time Systems Symposium, 1993., Proceedings. Dec. 1993, pp. 131–137. doi: 10.1109/REAL.1993.393507. [27] Alessandro Rubini and Jonathan Corbet. Linux device drivers, Third Edition. [Online; Accessed 10-06-2014]. O'Reilly Media, Inc., 2001. url: https://lwn.net/Kernel/LDD3/. [28] Linux Kernel Developers. Linux. [Online, accessed 01-July-2014]. url: https://www. kernel.org/. [29] Gerard Holzmann. Spin Model Checker, the: Primer and Reference Manual. First. Addison-Wesley Professional, 2003. isbn: 0-321-22862-6. [30] ISO IEC JTC1/SC22/WG14 - C. [Online; accessed 11-June-2014]. url: http://www. open-std.org/jtc1/sc22/wg14/. [31] Eric Eide and John Regehr. Volatiles Are Miscompiled, and What to Do about It. 2008. [32] Intel®. Intel®64 and IA-32 Architectures Software Developer’s Manual, Volume 3 A/B/C (order number 253668, 253669, 326019). 2014. [33] Intel®. Intel®64 and IA-32 Architectures Software Developer’s Manual, Volume 2 A/B/C (order number 325383-050US). 2014. [34] GCC, the GNU Compiler Collection. [Online; accessed 11-June-2014]. url: https:// gcc.gnu.org. [35] John L. Hennessy and David A. Patterson. Computer Architecture, Fifth Edition: A Quantitative Approach. 5th. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011. isbn: 012383872X, 9780123838728. [36] Scott Owens, Susmit Sarkar, and Peter Sewell. “A Better x86 Memory Model: x86- TSO”. In: Theorem Proving in Higher Order Logics. Ed. by Stefan Berghofer et al. Vol. 5674. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2009, pp. 391–407. isbn: 978-3-642-03358-2. doi: 10 . 1007/978 - 3 - 642 - 03359 - 9 27. url: http://dx.doi.org/10.1007/978-3-642-03359-9 27.

89 Literature

[37] Scott Owens. “Reasoning About the Implementation of Concurrency Abstractions on x86-TSO”. In: Proceedings of the 24th European Conference on Object-oriented Programming. ECOOP'10. Maribor, Slovenia: Springer-Verlag, 2010, pp. 478–503. isbn: 3-642-14106-4, 978-3-642-14106-5. url: http ://dl . acm . org/citation . cfm ? id = 1883978.1884011. [38] Lana Brindley, Alison Young, and Cheryn Tan. Red Hat Enterprise MRG 2 Realtime Reference Guide. 2013. [39] perf: Linux profiling with performance counters. [Online; Accessed 10-06-2014]. url: https://perf.wiki.kernel.org/index.php/Main Page. [40] Steven Knight et al. SCONS - build your software, better. [Online; accessed 11-June- 2014]. url: http://www.scons.org. [41] T. Bray. The JavaScript Object Notation (JSON) Data Interchange Format. RFC 7159 (Proposed Standard). Internet Engineering Task Force, Mar. 2014. url: http://www. ietf.org/rfc/rfc7159.txt. [42] Atlassian Software. Bamboo Continous Integration and Build Server. [Online; ac- cessed 11-June-2014]. url: https://www.atlassian.com/software/bamboo. [43] Peter Oberparleiter et al. lcov - a graphical GCOV front-end. [Online; accessed 11- June-2014]. url: http://ltp.sourceforge.net/coverage/lcov.php. [44] Dr. PhD. David Levinthal. Performance Analysis Guide for Intel® CoreTM i7 Processor and Intel® XeonTM 5500 processors. Intel Corporation, 2009. [45] CERN. C2MON open source SCADA monitoring system. [Online; Accessed 10-06- 2014]. url: http://cern.ch/c2mon. [46] M. Braeger et al. High-Availability Monitoring and Big Data: Using Java Clustering and Caching Technologies to Meet Complex Monitoring Scenarios. 2014. [47] Apache Maven Project. [Online; accessed 01-May-2014]. url: http://maven.apache. org. [48] GNU RCS Manual. [Online; accessed 11-June-2014]. url: http ://www . gnu . org/ software/rcs/manual/. [49] GNU Binutils. [Online; accessed 11-June-2014]. url: http://www.gnu.org/software/ binutils/.

90 Literature Glossary

ARM refers to the ARMv7/v8 computer architecture, designed by ARM Holdings plc. 58, 59, 60 CERN French: Organisation Européenne pour la Recherche Nucléaire. French (originally): Conseil Européen pour la Recherche Nucléaire English: European Organization for Nuclear Research 2, 5, 8, 14, 19, 26, 70, 79, 80, 82, 86, 91

DIAMON The Diagnostics and Monitoring system (,,DIAMON2'') is the main monitoring system in the CERNs Accelerator controls infrastructure. 5, 8, 13, 80, 85 Intel Intel Corporation, inventor of the x86 Platform. The term Intel Platform is also used to refer to the Intel processor architecture. There we use the term x86-64 for the Intel 32-bit architec- ture (also called i386) and x86-32 for the 64-bit architecture (also AMD-64 or Intel 64) CPU architecture. 6, 57, 59, 60 IPC Inter Process Communication. 16, 19, 21, 24, 27 JMX Java Management Extensions - Specification for surveillance and remote control of applica- tions running on the Java Platform. 13, 17, 25, 30, 79, 80 LHC The Large Hadron Collider is the worlds largest and most powerful particle collider, located at CERN near Geneva. 1, 2, 3 make Make is a designed to automate build processes. A commonly used implementation is GNU Make. Make can be used to model the dependencies between source files and their corresponding compile units as well as any other file based entities. 82 NTP The Network Time Protocol is widely used to synchronize computer clocks over ethernet. 65 POSIX Portable Operating System Interface, published by IEEE. 26, 30, 31, 33, 52, 64, 78 SCADA Supervisory Control and Data Acquisition (SCADA), is a term to describe to surveillance and control of industrial processes using a computer system 26 SLC Scientific Linux for CERN, Linux based operating system built from the sources of RedHat Enterprise Linux. 6, 34, 53, 58, 63, 64 SMP A Symmetric multiprocessing (SMP) systems consist of multiprocessor computer hardware where more than one processors connect to a single, shared main memory and are controlled by a single OS instance. 24 System V Unix System V. Standardized in POSIX 1003.1-2008[21] 26, 32, 33, 34

91 List of Definitions and Requirements

1 TERM: Roles ...... 10 2 TERM: Metric ...... 10 3 TERM: Real-Time ...... 11 1 FUNC: Monitoring ...... 12 2 FUNC: Development ...... 13 3 FUNC: Diagnosis ...... 13 1 TECH: Real-Time ...... 13 2 TECH: Integration ...... 14 3 TECH: Reusability ...... 14 4 TECH: Portability ...... 14 5 TECH: Easy to use ...... 14 6 TECH: Datatypes ...... 15

1 Verify: Read before Write ...... 43 2 Verify: Read after Write ...... 43 3 Verify: Read overlaps Write ...... 43 4 Verify: Start of Write overlaps Read partially ...... 43 5 Verify: Start of Write overlaps Read ...... 43 6 Verify: Read inside Write ...... 43 7 Verify: Write inside Read ...... 43 8 Verify: The value update is non-blocking ...... 43 9 Verify: The value update fails if another update is in progress ...... 43 10 Verify: The read operation always detects invalid data ...... 43

92 List of Figures

1.1 CERN Accelerator Complex [2] ...... 2 1.2 Pictures related to the discovery of the Higgs Boson, CMS Collaboration [3] 3

2.1 CERN Accelerator Control System ...... 5

3.1 C++ Monitoring and Diagnostics: Users and their use-cases ...... 12

4.1 General monitoring system architecture ...... 17 4.2 Data acquisition for monitoring systems ...... 18 4.3 Taxonomy of UNIX IPC facilities (Figure is based on [13, p. 878]) ...... 21 4.4 Java VisualVM showing JMX attributes ...... 26

5.1 CMX Host - Process - Component model ...... 29 5.2 Virtual to physical memory addresses ...... 33 5.3 Program flow of a sequence lock implementation ...... 39 5.4 Visualization of data in “Second enhancement” version, over time ..... 40 5.5 Memory structures of CMX (generated using a custom tool on top of sparse [25]) ...... 42 5.6 Overview of the CMX Reader/Writer Protocol with examples ...... 44 5.7 Non Blocking Reader Writer Protocol [26] ...... 46 5.8 Model states, image generated by spin from Listing 5.1 ...... 49

6.1 Simplified CPU with and without Store Buffer (Figure based on [7]) .... 59 6.2 x86 Assembler test program ...... 61 6.3 Timing of x86 assembler test program ...... 61 6.4 Read and write values x and y ...... 61 6.5 Reader/Writer with CPU swap ...... 63 6.6 C++ Class diagram ...... 68 6.7 Simple example of using the CMX C++ API ...... 69 6.8 Demo of OO abstraction for CMX (excerpt) ...... 69

93 List of Figures

6.9 Priority Inversion With 3 Threads on 2 Resources ...... 71 6.10 Results of the coverage analysis (CMX 2.0.4) ...... 73 6.11 Change of memory usage depending on number of metrics ...... 77

7.1 CMX remote access using an agent ...... 81 7.2 The MX-Viewer, running embedded in the Diamon Console, showing live values from a CERN BE-CO C++ program ...... 81 7.3 Configuration of a CMX Metric. The MX Viewer (on top) generated the value specifier for the web-based controls configuration interface (at the bottom)...... 82 7.4 MX-Viewer showing dependency information of a remotely running pro- gram ...... 84

94 List of Tables

4.1 Comparison of Logging, Monitoring and Diagnostics ...... 20

5.1 Comparison of Shared Memory Implementations ...... 34

6.1 Compare execution times on virtualized/unvirtualized hosts regarding the usage of different clock sources ...... 65 6.2 Tabular overview of C source-code header files ...... 66

95