Experiences in developing a Cluster for Real-Time Simulation

Andrew Robbie; Graeme Simpkin; John Fulton; David Craven Defence Science and Technology Organisation [email protected]; [email protected]; [email protected]; [email protected]

Abstract. There has been considerable interest in making use of the GNU/Linux as a platform for simulation and training systems. The simulation centre in DSTO's Air Operations Division is focusing on a GNU/Linux cluster primarily designed to serve as a real-time computational resource, rather than as a graphics display system. In this paper we examine some of the design challenges, such as system configuration, modifications required to achieve real-time performance under GNU/Linux, the inter-process communication methods used by simulation middleware, and performance benchmarking.

1. INTRODUCTION Image generation systems and simulation system host In 1992 the Air Operations Division of the Defence computers have often been partitioned into separate Science and Technology Organisation undertook to hardware configuration items. The AOSC adopted a consolidate three decades of real-time human-in-the-loop different approach using a large multi-processor Silicon simulation development and expertise into a new Graphics computer to simultaneously provide multi- capability named the Air Operations Simulation Centre channel image generation and host system models (flight (AOSC) [1]. models, avionics subsystems etc.). Combining these functions within a delivered benefits The primary mission given to the AOSC was to develop through reduced maintenance and administration of the the capability for, and conduct, real-time human-in-the- system and simplified inter-process communications. A loop simulation research with an emphasis on the air simulation software infrastructure was developed by the domain [2]. A flexible, modular system architecture was AOSC to take full advantage of multi-processor developed to address both immediate and anticipated machines. Scalability was provided through simulation requirements. The facility demonstrated its interconnection of host computers via high-bandwidth initial operational capability in 1994 by conducting an reflective memory links. investigation of the tactical utility of symbology presented to aircrew using helmet-mounted displays. The During early years of service the AOSC typically ran following ten years of development and operation have simulation experiments in a serial fashion. However the seen the range of simulation tools expand. Fixed wing demands of recent projects highlighted a requirement for and rotary wing cockpits with both single and dual seat routinely developing and performing experiments in configurations have been integrated with corresponding parallel. Addressing this requirement provided the system models to provide the range of crew environments development team with an opportunity to investigate needed for the work of the Division. alternative design solutions that were afforded by new technology developments. Rapid advancements in the performance of commodity computers made them an attractive design option, meriting further investigation. This paper discusses issues that were identified while exploring the concept of transitioning the successful AOSC architecture to lower-cost computing systems running the GNU/Linux Operating System.

2. ARCHITECTURE OF THE AOSC The broad range of applications for the AOSC dictated a real-time system design that would be flexible, re- configurable, scalable and robust. The research objectives of the simulation imply that behaviour of subsystem models must be both observable and repeatable. Furthermore, the system was designed to undergo continuous development and modification to accommodate changing features of the modelling domain. The architectural solution, developed by the Figure 1: A low-cost flying desk system; computations AOSC to meet these requirements, can be described in are performed on the AOSC Linux cluster. terms of two design patterns [3] – the façade and the mediator.

Façade When viewed at the abstraction level of design patterns, it can be seen that the AOSC architecture has The façade design pattern calls for each subsystem, much in common with the DMSO sponsored High or user developed module, to be wrapped in a Level Architecture (HLA) [5]. It should be noted, common interface. Individual components of the however, that the HLA initiative developed several AOSC are integrated into the simulation system in a years after the AOSC began operations. The unified approach that provides standard input, output approximate mapping between architectural and control interfaces. User modules that employ the components is shown in Table 1. façade interface include subsystem models (such as flight dynamics and avionics models), audio systems, Table 1 Similarities between AOSC and HLA image generation systems, operator station modules, AOSC HLA distributed simulation gateways and cockpit devices. User Module & Mediator ⇒ Federate Façade The mediator design pattern calls for a design component Run Time that liaises between other components of the system Mediator ⇒ (Figure 2). The AOSC uses a mediator process to transfer Infrastructure (RTI) data between user modules, and control or schedule Federation Object subsystem activity. De-coupling user modules in this way Global Map File ⇒ greatly facilitates their re-use. Model

3. PROJECT ALUMINIUM User Module + Façade A spiral life-cycle model was adopted for concept Input exploration so that higher-risk options would be examined in earlier work iterations. The initial Output Mediator iteration of work determined to resolve issues that Control would be introduced by migrating the original architecture onto PC based systems. The name given to this initial iteration of work was Project Aluminium.

Figure 2: Mediator linking user modules. Process control 3.1 Design Decisions A simulation scenario configuration file is created for Migrate from IRIX to GNU/Linux each experiment detailing the modules to be used, the resources assigned and the inter-module data-flows. GNU/Linux presently enjoys popularity as a well This file is referred to as a global map file. Mediators supported widely available UNIX-like computer and user modules use this information to establish operating system. The GNU Free Software Foundation run-time interconnections. The mediator coordinates coordinates the development toolchain (compilers and time-wise execution of subsystem processes in core libraries), while the Linux kernel provides the accordance with resources specified in the global underlying infrastructure, such as virtual memory and map file. Data transfers occur on simulation frame- scheduling. It was decided early in the project that boundaries as shown in Figure 3. Robbie [4] AOSC simulation software should be migrated to the describes this behaviour in further detail. GNU/Linux platform to achieve the following project goals:

• Establishing a technology roadmap that includes low-cost computing options • Improving AOSC flexibility by enhancing platform independence • Minimal risk by maintaining commonality with the original architecture

Implement a compute cluster It was decided that the multi-processor scalability of the original implementation should be reproduced by implementing a cluster of GNU/Linux nodes. A number of toolkits are available for developing Figure 3: Subsystem scheduling cluster applications. These include message passing APIs, such as MPI [6], and virtual machine systems, HLA Analogy

such as PVM [7], Mosix [8] and OpenSSI[9]. These from the beginning for real-time tasks, it does not tools provide an abstraction layer that hides the suffer the scheduling latency overhead of a general distributed nature of the system. Unfortunately, none purpose operating system. of these were designed to support real-time systems. Rather, they are focused on massively parallel The main drawback to using RTAI is that real-time problems, where every thread is executing the same tasks cannot use standard Linux system calls — code, and jobs can run for days or months. There is instead, they communicate with a normal Linux also a steep learning curve for some toolkits, notably process via or message queues, which MPI. Hence, these tools were not used. would be complicated for our application. In addition, there are few real-time mode Network Interface Card Reuse system components (NIC) drivers (none for Gigabit Ethernet). As high Rework was minimized by determining that original speed, low latency communication between nodes is AOSC components should be reused wherever essential this is a significant issue. Therefore, it was possible. Peripheral components able to be reused decided that real-time Linux would only be used if without alteration included the audio subsystem, normal Linux would not be able to provide the instruments, cockpit interfaces and terrain databases. required performance. De-couple from image generation system Recently the RTAI developers have introduced the LXRT mode, which allows normal Linux threads to The image generation subsystem was the only user switch to a hard real time scheduler. Additionally, they module requiring significant redevelopment for this run in user mode rather than kernel mode. This mode project. It was decided to limit the scope of Project of operation makes IPC significantly easier, and may Aluminium to only migrate the simulation compute be prototyped in future Aluminium development. infrastructure. Accordingly, the new architecture would need to interface with an external image generation system. This approach was believed to 3.2.1 Cluster Implementation Issues enhance the flexibility of the system by encouraging CPUs per node the interchangeability of image generation capabilities. An evaluation of the price/performance ratio of a number of computer systems revealed that 2-way 3.2 Concept Exploration Symmetric Multi-Processor (SMP) systems were the Selecting a Linux kernel most competitive. Larger 4- and 8-way SMP systems are generally offered in the ‘Enterprise’ computing Obtaining real-time performance requires assistance sales channel, and seem to be around 5 times more from the operating system kernel in a number of areas. expensive per CPU, partly because they have server When we started development we began with the most specific features like SCSI hot-swap drives and power recent Linux kernel available, release 2.4.3. A number supplies. In addition, large SMP systems seem to lag of key features were not present in the mainline kernel, behind the technology curve by several years. These but are released by other developers as patches to disadvantages were seen as outweighing the reduced specific kernel versions. Tracking bleeding edge kernel communication overhead. Uni-processor systems are developments is a full-time job, especially when cheaper per CPU, but the fixed cost per node (high several patches must be applied. speed NICs, rack space, power, management) makes A number of approaches have been taken to enhance them less economical, and they are less suitable for the scheduling performance of Linux [24]. The low- running real-time tasks. latency patch [10] prevents the kernel from spending Cluster interconnect long periods in any particular system call or internal housekeeping task. The pre-emptive kernel patch [11] An analysis of the inherent parallelism and tries to make kernel threads pre-emptive, i.e. other communication versus computation profile of the IRIX kernel threads or user threads can interrupt lower implementation gave an indication of bandwidth and priority work. Both of these techniques aim to reduce latency required of the cluster fabric, and the the delay between a thread becoming eligible for middleware mode of operation. For example, although addition to the run queue and activation of the a data item may have many subscribers, and hence scheduler, which then decides which thread has the might be distributed efficiently using multicast, the highest priority. The O(1) scheduler patch [12] is overhead of marshalling separate messages for every designed to make the scheduler run in constant time multicast group may negate the benefit. for most activations, only occasionally running in O(n) The general trend in communications hardware is to time. The low-latency patch was used in the new offload communications tasks from the CPU. development system. However, to obtain low latencies it is necessary to use One of the Real-Time Linux variants, RTAI [13]. The operating system bypass — user processes are then RTAI environment is a separate kernel and operating able to communicate without having to transfer all data system, which runs at a higher priority than the via the kernel. This requires more intelligent NICs and standard Linux kernel. Since RTAI has been designed drivers. A number of interconnect hardware options

were evaluated, e.g. Fast and Gigabit Ethernet, ATM, Most large Beowulf [22] clusters use a network boot SCI, SCRAMNET, Reflective Memory, Myrinet, environment. However, these systems run a limited set Giganet cLAN, InfiniBand, and Quadrics. Key criteria of applications and run in fairly static configurations. were: API simplicity and suitability, performance, cost, This makes the overhead of developing a custom boot support and maintenance effort. image less onerous. These systems are often extensively prototyped. As we were only beginning to The interconnect API can have a significant effect on explore the cluster solution we started with more basic the architecture of the overall system. In Reflective configurations. Memory type systems the communication pattern is inherently hidden, possibly at some performance cost. The initial attempt involved installing generic Red Hat In contrast message based APIs, such as Berkley Linux 7.1 on all nodes of the cluster using the Sockets, require explicit message receive actions. Kickstart automated installer. While this was effective Myrinet’s GM API is message based, but also includes in reducing the installation time, it provides no method modes such as remote write. Myrinet also provides to maintain synchronisation. Sockets-GM, which allows standard TCP/IP code to take advantage of operating system bypass features. We explored creating drive images and mirroring the disks, both over the network and making use of The first cycle of the hardware acquisition process removable drive caddies fitted to our systems. used two off-the-shelf dual-processor PIII systems, However, most tools, eg Norton Ghost, do not seem using both Fast and Gigabit Ethernet. Although this capable of handling combined GNU/Linux and was a useful system for development, numerous Windows multi-boot configurations, and as they do not problems were identified. This led to a decision to recognise the ext3 file system format, blank space is identify vendors with experience in multi-machine copied, making copying time excessive. The network configurations, preferably clusters. A longer process of copy mechanism is not automated, which significantly test & evaluation was also seen as essential in reducing increases administration overhead. the risk of excessive maintenance costs after system deployment. The most successful technique for system synchronisation used the rsync protocol [14]. Start up, Table 2 Driver support under Linux shutdown and cron scripts were used to automate the Vendor Vendor Open-source and process of updating slave nodes from a master node. support community support This method is able to handle updating all kernel and user programs and, as only changes have to be Myrinet Excellent Very good propagated, it is very fast. A null update (checking for any changes) takes around 20 seconds; updating from Intel Excellent Very good – Intel Red Hat 7.1 to 7.3 took around 10 minutes, and drivers open- required no user intervention. However, if a system sourced update is performed using a broken kernel or system configuration, it can require a complete re-install of the 3Com Minimal Good operating system Also, per-node customisation can D-Link Poor Non-existent significantly complicate the synchronisation script.

Giganet cLAN Vendor absorbed, no future support, The system used at present makes use of network boot no source code available features built into the GRUB bootloader [15]. In large, static installations it makes sense to burn NIC ROM chips; however this is not required if booting from a Boot system floppy disk is acceptable. The GRUB program on the A major factor in hardware decisions is the level of floppy loads up, then queries the Dynamic Host vendor support for Linux. Many vendors have drivers Configuration Protocol (DHCP) server for an IP for previous versions of the kernel, that do not work on address. With this it is able to find the tftp server and recent, development kernels. A number of devices are load the Linux kernel and initial RAM disk image, supported by open source drivers; these seem to be which are then booted. The Linux kernel re-queries the generally better maintained than vendor drivers, DHCP server for the address of its primary NIC; this especially when there is an active user community. See allows it to derive the IP addresses of the other Table 2. interfaces. After networking has loaded the system mounts NFS shares for /usr, /etc and /lib. The key factor in reducing administration overhead for Job control a cluster is maintaining all systems with an identical software configuration. This can be achieved relatively The original implementation used a Tcl/Tk script easily for application software through the use of NFS called mapstarter to parse the global map file and shared drives. However upgrading the main operating identify all user modules and the mediator for a given system components (kernel, core libraries such as host. The mapstarter provided a GUI to allow the glibc, X11 system) is more of a challenge. experiment operator to start and monitor each process.

With minor changes this program also worked on the processor basis. However the Linux kernel does cater new implementation. for POSIX thread scheduling policies – allowing specific processes to be scheduled in a First-In First- While the mapstarter functions correctly, the Out (FIFO) fashion. FIFO scheduling is considered to employment of many nodes in the new architecture be a real-time scheduling policy and so always has makes using one mapstarter for each node higher priority than non real-time processes. Therefore complicated. It is envisioned that an improved tool the operating system will never schedule non real-time would provide a single user interface. processes to run on a processor servicing a FIFO process. 3.2.2 Real-time Performance Issues To all intents and purposes, running a set of FIFO Assigning processes to CPUs simulation processes, constrained to a particular Real-time user modules are assigned to dedicated processor, achieves the same result as running a set of CPUs to provide guaranteed processing time for each processes on a processor that has had pre-emptive module. Non real-time processing such as scheduling disabled and is restricted from running asynchronous user interfaces and operating system extraneous processes. tasks are prevented from running on the real-time CPU(s). Thus real-time and non real-time processes do This approach was not without its teething problems. not contend for CPU resources and scheduling is Although all the elements appeared to be in place, simplified. Brosky and Rotolo [16] describe a similar locally developed kernel patches were initially approach. required to ensure correct FIFO scheduling behaviour. However, these features have been correctly IRIX has long provided a rich set of features [17] for implemented for Linux kernels 2.4.20pre 5 and later. supporting this model including tools and function calls for isolating individual CPUs (from non-real-time The Linux Trace Toolkit [20] was used as a means of processes) and the ability to redirect interrupts to a observing the scheduling behaviour of potential design nominated CPU. A cursory examination of Linux solutions. kernel scheduling code showed that a bit-field was already implemented for setting the CPU affinity of 3.2.3 Inter-Process Communication (IPC) Issues each process. However, there was no corresponding The original architecture supported two hosts interface that enabled a user mode process to assign connected via reflective memory. The architecture was CPU resources. Interrupt redirection is also possible expanded to a multi-host configuration to support under Linux, however we have not tested this many Linux hosts. extensively. Multi-host Communication Changes The obvious solution to this problem was to patch the kernel to provide the required interface. Fortunately, An abstraction layer was developed between the several people in the open-source community had mediator and specific implementations of cluster already developed a solution. Morton’s approach [18] interconnects. This enables networking hardware to be provides an elegant interface via the Linux proc easily changed without requiring any change to the filesystem. At the time of writing another approach has mediator interface. been merged into the experimental Linux kernel tree; Minor changes were needed due to differences this adds the sched_setaffinity system call. Project between reflective memory and socket architectures. Aluminium uses the proc interface at present. Communication using reflective memory is stateless. Scheduling i.e. Reflective memory allows source processes to write at any time and destination processes to read at Scheduling refers to the chronological assignment of some later stage. Communication using sockets processes to processors. Normal UNIX scheduling is requires both source and destination processes to co- pre-emptive, i.e. processes can be paused at any point operate in the transaction. without notification. The standard UNIX scheduler implements a multi-level feedback queue [19], TCP/IP was chosen over UDP/IP because the designed to give a ‘fair’ allocation of resources. unreliable nature of UDP can result in occasional lost However, real-time systems need to meet deadlines, packets. Unfortunately, TCP/IP can cause large delays which often requires a guaranteed time allocation per when recovering from packet loss. A higher scheduling interval. performance protocol incorporating application- specific error recovery is being developed, using UDP As described earlier, process scheduling in the current as the underlying transport mechanism. architecture is managed cooperatively by the mediator- façade implementation as opposed to the operating Multi host Synchronisation system. Originally this was achieved by disabling pre- The original system implementation required a byte of emptive scheduling on processors set aside for reflective memory to be set or cleared by the simulation processing. Standard GNU/Linux does not nominated master mediator and slave mediators to support disabling pre-emptive scheduling on a per-

indicate the state of processing on each machine. The user module was discovered as having implied endian master mediator would then toggle these state flags to assumptions. schedule the next iteration. The Project Aluminium system replaced these state flags with explicit 4. PROJECT ACHIEVEMENTS messages. Slave nodes send iteration complete messages to the master mediator and wait for a At the conclusion of the project, all basic simulation continuation message; this acts as a barrier infrastructure and support tools were successfully synchronisation. migrated to the new platform. Minor system dependencies, such as reliance on system specific Moving to POSIX IPC convenience libraries, were identified and eliminated. The basic infrastructure used SGI-specific shared- A useful subset of AOSC user modules was ported to memory arenas for IPC. These arenas had several the new environment. In many cases this required little features not shared by basic POSIX shared memory more than simple recompilation. The resultant implementations, including automatic growth and combination of components succeeded in decoupling broader IPC functionality. Arena functionality was the original architecture from specific vendor abstracted and replaced using combinations of POSIX hardware. shared memory and locks. The new system has since demonstrated its 3.2.4 Compilation Issues functionality through a number of exercises, including two human-in-the-loop simulation experiments that The base used was Red Hat 7.1-3. incorporated typical user module configurations with This shipped the GNU C compiler version 2.96 instead full data collection. Additionally, the new system can of the more stable 2.95.3, breaking compilation of the be used to support the integration of new user modules kernel and other software dependent on certain low- before migrating them back to the original simulation level code generation behaviour. The current version environment. of Red Hat ships with GCC 3.2.2, which is more standards compliant. 5. LESSONS LEARNED Some convenience functions and macros provided with The first lesson to be noted is that the original design IRIX were not available under GNU/Linux. Although goal of reusability, implemented through decoupled these functions were highlighted as being operating modularization of subsystems, aided the staged system specific, they had been utilised in the AOSC migration of functionality across platforms. Once the implementation. These capabilities had to be basic infrastructure was ported, the project team was reimplemented. relatively unrestricted in selecting user modules for Different levels of standards compliance were migration to the new platform. observed for compilers of different languages. The next lesson concerns system specific capabilities. FORTRAN compilers, in particular, exhibited different The AOSC code-base uses abstraction libraries for features on the two platforms under consideration. commonly used system functions, such as serial port Areas of difference included C-like extensions interfaces, network communications and message (tolerance of C-style /*comments*/, the ability to treat logging. The project team found that the abstraction integer types like Boolean types and tolerance of layer greatly simplified migration to the new platform. subroutine prototypes), compiler name-mangling, Challenging porting issues were discovered where restrictions on the use of FORMAT statements and appropriate abstractions had not been developed. The implementation of intrinsic functions. These porting process enabled some beneficial code differences were particularly resource intensive to refactoring; additionally, we found that porting resolve. revealed some hidden assumptions about host system Several standalone systems, such as instrument behaviour. displays, are hosted on WIN32 operating systems. By Some components required no porting effort at all. default these applications were compiled with double- These were generally tools based on scripting word alignment. Correct interfacing with these languages such as Perl and Tcl/Tk. FORTRAN based subsystems required double-word alignment for components required the most porting effort. corresponding Project Aluminium modules. Double- word alignment later became the cause of some very A benefit of using an open source operating system to mysterious run-time behaviour when linking with non- host the simulation was the ability to utilise kernel double-word aligned third-party libraries. This is an features before they were fully integrated. This enabled issue to be wary of because the Linux dynamic linker a working prototype to be fielded much earlier than does not provide informative warnings. would otherwise have been the case. However, moving away from standard software turned out to be a The original simulation hosts (MIPS) were big endian double-edged sword — the in-house ‘improvements’ and the new hosts (IA32) were little endian. Only one needed to be re-applied for every subsequent system

upgrade until they became part of the standard 6. The Message Passing Interface (MPI) standard GNU/Linux release. (2003), http://www-unix.mcs.anl.gov/mpi/ 7. PVM (Parallel Virtual Machine) home page (2002), Cluster maintenance, for both software and hardware, http://www.csm.ornl.gov/pvm/pvm_home.html proved to be an on-going challenge. The following 8. MOSIX home page (2003), http://www.mosix.org/ practical lessons were learned in assembling cluster hardware: 9. Single Sytem Image Clusters (SSI) for Linux (2003), http://openssi.org/ • Specify all systems identically; otherwise the 10 Morton, A., Linux scheduling latency, cluster will be harder to maintain http://www.zip.com.au/~akpm/linux/schedlat.html • Specify motherboard features such as: ECC 11. Love, Robert M. (2002) Pre-emptable Kernel Patch. memory, boot information on serial console, PXE ftp://ftp.kernel.org/pub/linux/kernel/people/rml/cpu network boot support and an onboard NIC -affinity • Rack mount systems were worth the extra cost 12. Ingo Molnar (2002) O(1) Scheduler patch. http://people.redhat.com/mingo/. • Choose cases carefully; they need to be easy to 13. Real Time Application Interface for Linux (2003), work with and fit standard rack-mount slides http://www.aero.polimi.it/~rtai/ • Consider a keyboard-video-mouse (KVM) system 14. rsync web pages (2003), http://samba.anu.edu.au/rsync • Maintain a spare parts inventory 15. GNU GRUB Manual (2002), http://www.gnu.org/manual/grub/ The project team began this development without 16. Brosky S. and Rotolo S. (2002) Shielded having a reliable estimate of expected system Processors: Guaranteeing Sub-millisecond performance. The performance of the prototype system Response in Standard Linux, Fourth Real-Time Linux Workshop, Boston, USA. provided encouraging endorsement of the concept of using a commodity for real-time 17. Silicon Graphics Inc (1991) Irix with REACT, simulation computing. Technical report, Mountain View, CA. 19. McKusick, M.K. et al. (1996) The Design and 6. REFERENCES Implementation of the 4.4 BSD Operating System, 1. Feik, R. A. and Mason, W. M. (1993) A New Addison-Wesley. Simulation Facility for Research in Military Air 18. Morton, A., CPUS Allowed patch, Operations, Unpublished DSTO report. http://www.zip.com.au/~akpm/linux/cpus_allowed.patch 2. Anderson, K. W. and Feik, R. A. (1992) Concept 20. The Linux Trace Toolkit. and Functional Description of the Air Operations http://www.opersys.com/LTT/. Simulation Centre, in possession of the Air 21. Butler, L., Atkinson, T. & Miller, E. (2000) Operations Division, DSTO. Comparing CPU performance between and within 3. Gamma, E., Helm, R., Johnson, R. and Vlissides, J. processor families, 25th International Conference (1995) Design Patterns: Elements of reusable on Computer Measurement and Performance. software, Addison-Wesley Publishing Co. Inc., 22. Butenhof, D. R. (1997) Programming with POSIX Massachusetts, USA. Threads, Addison-Wesley. 4. Robbie, A. (2001) Design of an Architecture for 23. Sterling, T, et al. (1995) Beowulf: A Parallel Reconfigurable Real-Time Simulation, in Proc. Workstation for Scientific Computation, Proc. 24th SimTecT 2001, p251, Canberra. ICPP, Oconomowoc. 5. Institute of Electrical and Electronic Engineers 24. Williams, Clark (March 2002) Linux Scheduler (2000) IEEE Standard for Modeling and Latency, Technical report. Red Hat, Inc. Simulation (M&S) High Level Architecture (HLA) – Framework and Rules, IEEE Std 1516-2000.