Journal of Systems Architecture 52 (2006) 283–297 www.elsevier.com/locate/sysarc

The design and utility of the ML-RSIM system simulator

Lambert Schaelicke a,*, Mike Parker b

a Intel Corporation, 3400 Harmony Road, Fort Collins, CO 80528, USA b Cray, Inc., PO Box 6000, Chippewa Falls, WI 54729, USA

Received 24 July 2004; accepted 21 July 2005 Available online 25 October 2005

Abstract

Execution-driven simulation has become the primary method for evaluating architectural techniques as it facilitates rapid design space exploration without the cost of building prototype hardware. To date, most simulation systems have either focused on the cycle-accurate modeling of user-level code while ignoring operating system and I/O effects, or have modeled complete systems while abstracting away many cycle-accurate timing details. The ML-RSIM simulation system presented here combines detailed hardware models with the ability to simulate user-level as well as operating system activ- ity, making it particularly suitable for exploring the interaction of applications with the operating system and I/O activity. This paper provides an overview of the design of the simulation infrastructure and discusses its strengths and weaknesses in terms of accuracy, flexibility, and performance. A validation study using LMBench microbenchmarks shows a good cor- relation for most of the architectural characteristics, while operating system effects show a larger variability. By quantifying the accuracy of the simulation tool in various areas, the validation effort not only helps gauge the validity of simulation results but also allows users to assess the suitability of the tool for a particular purpose. Ó 2005 Elsevier B.V. All rights reserved.

Keywords: Simulator validation; Execution-driven simulator; Simulator accuracy

1. Introduction not provided by real hardware. However, the design and implementation of a suitable and robust Simulation has become one of the predominant simulator is in itself a major undertaking that tools for computer architecture research. It allows requires a combination of architectural expertise researchers to rapidly explore novel techniques and software development skills. For this reason, and architectures without incurring the engineering a number of readily available and proven simula- effort and expense of implementing prototype hard- tion systems such as SimpleScalar [2], SimOS ware. In addition, it affords a flexibility normally [16], SimICS [12] and RSIM [15] have found wide- spread use. Understanding the performance, accu- racy, and limitations of simulators is critical when * Corresponding author. selecting a tool to answer a particular research E-mail addresses: [email protected] (L. Schaelicke), [email protected] (M. Parker). question. URLs: http://www.cs.utah.edu/~lambert (L. Schaelicke), This paper describes the design and validation of http://www.cs.utah.edu/~map (M. Parker). ML-RSIM, an execution-driven system simulator

1383-7621/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.sysarc.2005.07.001 284 L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297 that combines detailed hardware models of modern The following section briefly describes the design and server-class machines with a Unix- of ML-RSIM, focusing on the I/O subsystem and like operating system [18]. The ML-RSIM environ- operating system infrastructure. Section 3 presents ment, based on RSIM [15], extends the system the validation methodology used to gauge the accu- features normally represented by simulators to racy of ML-RSIM. Section 4 presents the validation include I/O devices and an operating system with- results and discusses how important design deci- out sacrificing fidelity of the processor model and sions affect the simulatorÕs utility and accuracy. Sec- with only a small impact on simulation speed. A tion 5 contrasts this work with other simulation detailed characterization of the accuracy of the infrastructures and validation efforts. Finally, Sec- hardware models as well as the simulator operating tion 6 summarizes and outlines ongoing and future system highlights key strengths and weaknesses of work. ML-RSIM and helps determine the toolÕs suitability for specific research tasks. 2. ML-RSIM design Simulator design is a trade-off between develop- ment effort, simulation speed, modeling scope, and 2.1. Machine model accuracy. As such, most simulators are targeted towards specific application domains and research The ML-RSIM processor model is based on agendas. For instance, architectural simulators like RSIM v1.0 [15] with extensions to simulate operat- SimpleScalar [2] and RSIM [15] focus on the ing system code and memory-mapped access to I/O cycle-accurate modeling of user-level instructions devices. Other hardware models include a two-level by providing highly detailed models of the processor cache hierarchy, a coherent memory bus, a memory and memory hierarchy. These tools are invaluable controller, and several I/O devices. Combined, these for experiments involving compute-intensive appli- components model a contemporary workstation or cations, but they ignore many system-level effects server-class machine, as shown in Fig. 1. such as virtual memory, system calls, interrupt han- The ML-RSIM CPU model implements the dling, and other I/O related effects. System-level SPARC V8 instruction set [21] on a MIPS R12000- simulators like SimOS [16] and SimICS [12],on like dynamically scheduled superscalar processor the other hand, provide complete system models core [14]. It also includes an instruction cache model, and thus account for application, operating system, TLB models, exception handling capabilities, and and I/O effects. However, to achieve acceptable sim- support for privileged instructions and registers. ulation speed, hardware models are usually simpli- Both user-level and privileged instructions are simu- fied and not cycle-accurate. Such system-level lated, so simulation sessions account for both appli- simulators are able to model the interaction of hard- cation and kernel effects. ware, system software, and applications. However, The cache hierarchy is modeled as a conventional due to the abstract hardware models, only coarse- two-level structure with support for snooping cache grain effects can be observed. coherency. An uncached buffer handles load and In contrast, ML-RSIM combines detailed hard- store instructions to I/O addresses, and supports ware models of a microprocessor, memory hierar- store combining to reduce system bus utilization. chy, and I/O subsystem with a simulated operating The system bus implements snooping cache coher- system. As such, it supports detailed explorations ency through a MESI protocol. The memory con- of the interaction of applications, operating sys- troller supports SDRAM or RAMBUS memory in tems, and I/O devices. Hardware models include a a variety of configurations and accurately models dynamically scheduled processor with a two-level queueing delays, bank contention, and DRAM pre- cache hierarchy, coherent system bus, and a multi- charge and refresh overhead. bank memory controller. These components are The I/O subsystem is based on the PCI bus and augmented with an I/O subsystem consisting of a consists of a PCI bridge with autoconfiguration sup- PCI bridge, SCSI adapter, SCSI bus, and hard disk. port, a real-time clock, and a number of SCSI host The fully-simulated ‘‘Lamix’’ operating system is adapters with hard disks. The real-time clock is based on NetBSD source code and includes a sys- modeled after a MOSTEK 48T02 clock chip and tem call layer, multiple file systems with buffer implements common date and time functionality cache, device drivers, and multitasking capabilities as well as two independent periodic interrupt [18]. sources. The SCSI adapter is a variation of an L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297 285

CPU

L1 I-Cache L1 D-Cache

L2 Cache

System Bus

Disk Memory Controller PCI Config

PCI Bridge DRAM . . . . . DRAM Realtime SCSI Clock Adapter SCSI Bus

Fig. 1. ML-RSIM system model.

Adaptec AIC 7770-based SCSI host adapter and ances the need to accurately model a realistic system supports multiple outstanding requests, disconnect with the cost of implementing the minute and often and reconnect transactions, and request queueing obscure details of a particular commercial system. [1]. Each adapter controls one SCSI bus that accounts for arbitration delays, idle times, and data 2.2. Lamix kernel transfer. Each SCSI bus connects an adapter to a configurable number of SCSI disks. These disks The Lamix kernel is a Unix-compatible multi- model seek and transfer times as well as an on- tasking operating system specifically designed to board DRAM buffer. Seek times are calculated run on the ML-RSIM simulator. It consists of an using one of four algorithms: (a) a fitted curve based independently developed multitasking core and sig- on single-track, average, and full-stroke seek times, nificant portions of NetBSD source code that imple- (b) a three-point linear curve based on the same ment file system and device driver functionality. The parameters, (c) a constant seek time, or (d) instanta- system ABI is Solaris-compatible; thus applications neous seeks with zero latency [8,11]. The sector- compiled for Lamix can, in most cases, execute on based disk cache is used for data staging, prefetch- native Solaris hosts without modifications. Lamix ing and optional write buffering. The disk model only requires that applications are statically linked stores non-empty disk sectors in files on the simula- and do not use 64-bit instructions. This flexibility tion host, thus maintaining simulated disk contents allows users to test and debug applications on a across successive simulation sessions. native host at significantly greater speed before sim- The ML-RSIM hardware models intentionally ulating them. Since no special libraries or header do not represent any particular system. Instead, var- files are required, applications intended for simula- ious features are modeled after a number of existing tion can be compiled from nearly any programming prototypes, with the goal of producing a simulator language with little restriction on the compiler used. representative of a large class of real machines. However, due to the restriction that applications The system bus has similarities with SGI systems, must be statically linked, arbitrary Solaris binaries while the PCI addressing scheme follows the design may not run on the simulator. of a PowerPC-based workstation. Closely approxi- The Lamix kernel supports Unix-style process mating the design of a particular system would have management, signal handling, basic virtual memory facilitated a relatively straightforward port of an management, file I/O both to simulated disks and to open-source operating system, but would also have simulation host files, and Unix and Internet domain imposed many undesirable idiosyncrasies of that sockets. All traps and interrupts are dispatched to system. The approach taken for ML-RSIM bal- fully simulated trap handlers. Process management 286 L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297 allows for the dynamic creation and termination of ported from NetBSD with little modification and processes, loading of executable files, and passing of accurately account for interrupt handling, synchro- signals between processes. Virtual memory manage- nization, and request queueing. In addition to these ment is fully functional, but its implementation is physical devices, Lamix includes the RAIDframe somewhat simplified compared to other Unix vari- software RAID driver [6] and a pseudo disk device ants. Page table management and address transla- for striping and device concatenation. tion are accurately modeled, including TLB miss handling via software traps. However, physical 2.3. Simulation speed memory management does not support paging to disk. Instead it uses a static allocation scheme, One of the costs of the level of detail provided by resulting in a fixed upper limit on the number of ML-RSIM is simulation speed. Generally, the concurrently active processes. In addition, mem- majority of simulation time is spent executing the ory-mapped files, and consequently dynamic link- processor and cache models, with the instruction ing, are currently not supported. These restrictions decode, register rename and dependency check are partly a result of the incremental design of the stages being a significant portion of that time. Lamix kernel. On the other hand, the static physical The instruction cache model, which is essential memory management scheme helps make results for ML-RSIMÕs support of operating system code, more easily reproducible, since small changes to is a considerable source of simulation overhead. the order of memory allocation requests from appli- Furthermore, changes to the I-cache model param- cations can otherwise have a significant impact on eters have a significant impact on simulation over- overall system behavior due to varying level-2 cache head. To improve simulation speed, instructions conflicts. are decoded into an expanded format while being The vnode-based file system layer demultiplexes fetched into the simulated I-cache. Consequently, file I/O system calls to one of three file systems. as the modeled I-cache becomes more effective, sim- Both the standard BSD fast file system (FFS) and ulation speed increases as well. Traditionally, such a log-structured file system (LFS) are fully simu- pre-decoding occurs only once for the entire pro- lated, including buffer cache and an LFS cleaner gram at the beginning of a simulation run. How- daemon [10]. A newly-designed HostFS file system ever, ML-RSIM requires fully dynamic instruction imports the entire file structure of the simulation decoding to support operating system functionality host into the Lamix environment, thus giving appli- such as the loading of new binaries. cations access to files in the userÕs home directory or Other hardware components such as the I/O other locations. While the simulated file systems devices, the system bus model, and the memory con- accurately account for disk access delays, host files troller are not significant contributors to the simula- are accessed instantaneously. The main purposes tion cost, since the event-driven implementation of the HostFS are to provide a mechanism to move invokes these models only when they are active. files between the host and the simulated file system, Simulating a detailed operating system does not and to provide file I/O for applications where accu- significantly impact simulation speed beyond the rately modeling the I/O overhead is not required, small additional cost of support for exception han- such as accessing configuration files. dling and privileged state. Modeling complete sys- The socket layer supports Unix and INET tem calls and interrupt handlers requires more domain sockets. Unix sockets are fully simulated simulation time than a simulator that performs using the BSD mbuf structure, while INET socket instantaneous pseudo-system calls. However, one operations are performed instantaneously. Conse- of the main goals of ML-RSIM is to provide insight quently, INET sockets are most useful for applica- into the interactions between user and system code tions that do not require accurate modeling of and to allow detailed measurements of operating networking overheads and perform only a small system costs. Consequently, simulating kernel code amount of communication, such as a database ser- is not considered overhead but an important com- ver that reads relatively short request messages ponent of the simulation process. The kernel com- before performing sizable amounts of disk I/O. pletes initialization and mounts a simulated disk in Underlying the file systems are several device under 50 million simulated cycles, resulting in a run- drivers, including SCSI disk and host adapter driv- time overhead of approximately 1 min on a ers and a PCI bus driver. These device drivers are 900 MHz Sun Blade-1000. Thus, the overhead of L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297 287 booting Lamix for each simulation run is substan- memory subsystem, and operating system gives fur- tial and does not warrant further optimization. ther insight into the simulatorÕs strengths and weak- nesses in terms of its accuracy and its ability to 3. Validation methodology model a wide range of systems.

In our view, simulator validation is the process of 3.2. Experimental setup quantifying the extent that the simulation tool faith- fully models not only the functional behavior but The validation study presented here is a continu- also the performance of real systems. Even though ation of work previously reported [17]. It uses the most architectural simulators are used to evaluate same benchmark system and simulator configura- new techniques on hypothetical future computer tion, augmented with a second set of experiments systems, a simulator should be able to approximate for a significantly different reference system. the characteristics of existing systems. Such valida- The first reference system is an SGI Octane work- tion helps ensure that conclusions and design station with one 175 MHz processor. trade-offs made with a given simulator are not Although this system no longer represents current overly affected by modeling errors. Equally impor- high-performance workstation-class machines, it tant, validation uncovers the simulatorÕs weaknesses provides a useful comparison point because its and inaccuracies and lets users judge the suitability architectural characteristics are well documented, of the tool for a particular purpose. and because the R10000 processorÕs microarchitec- ture closely matches that modeled by ML-RSIM. 3.1. Methodology Moreover, many architectural aspects of this class of machine have not changed significantly. Com- This work uses LMBench [20] to compare vari- bined with the more modern Sun Blade-1000 refer- ous characteristics of a reference system with the ence system, this system choice can demonstrate a corresponding simulated system. LMBench consists good correspondence between the simulator and a of a set of benchmarks that measure microarchitec- broad class of real architectures. Table 1 summa- tural performance and basic operating system rizes the key parameters of the reference system behavior without relying on hardware performance and the matching simulator configuration. Other counters. While such counters are available in parameters such as the number and latency of vari- nearly all modern systems, their functionality varies ous functional units are set to match published data considerably and often does not provide enough on the MIPS processor. detail to adequately compare against a simulated The second reference system is a 900 MHz Sun system. Portable and well-known microbenchmarks Blade-1000 workstation. This machine is based on are able to report the same performance metrics for the superscalar in-order UltraSPARC-III processor. all systems compared here, thus helping to discover Since its instruction set matches that of the simula- the correlation between specific details of the bench- tion system, and the Lamix system call interface is mark system and the simulator, such as cache and compatible with Solaris, identical benchmark execu- memory behavior. Such validation establishes confi- tables can be used in this validation. Table 2 sum- dence that the simulator accurately models the per- marizes the key parameters of both systems. formance of existing systems, and that trends and Again, other microarchitectural parameters are set changes observed on the simulation tool represent to match published values where appropriate. real effects and not modeling artifacts. Overall, these two reference systems cover a wide To this end, two widely different simulator con- range of workstation-class systems, thus providing figurations are compared with two representative more meaningful insights through the validation reference systems: a 175 MHz SGI Octane and a process. While the microarchitecture of the MIPS 900 MHz Sun Blade-1000. By quantifying and ana- processor closely matches that modeled by the sim- lyzing the simulators accuracy against such diverse ulator, the instruction set, operating system and systems, the confidence in the validation process compiler differ, posing some challenges for the vali- itself is increased, and users of the tool are better dation. On the other hand, the Sun reference system able to judge the simulators suitability for a specific is based on the same instruction set, but the proces- purpose. Comparing two diverse systems that differ sor microarchitecture differs from the simulator in instruction set, processor microarchitecture, processor model. To approximate the behavior of 288 L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297

Table 1 Architectural parameters of the SGI Octane reference and the corresponding simulator configuration Parameter SGI Octane ML-RSIM Processor 175 MHz MIPS R10000, MIPS ISA, 175 MHz SPARC ISA, 4-way 4-way superscalar, dynamically superscalar, dynamically scheduled scheduled L-1 D-Cache 32 kbyte, 2-way set-assoc., 32-byte 32 kbyte, 2-way set-assoc., 32-byte lines, 1 cycle latency lines, 1 cycle latency L-1 I-Cache 32 kbyte, 2-way set-assoc., 64-byte 32 kbyte, 2-way set-assoc., 64-byte lines, 1 cycle latency lines, 1 cycle latency L-2 Cache 1 Mbyte, 2-way set-assoc., 128-byte 1 Mbyte, 2-way set-assoc., 128-byte lines, 10 cycles latency lines, 10 cycle latency System bus 87.5 MHz, 64-bit multiplexed 87.5 MHz, 64-bit multiplexed Memory controller SGI proprietary SDRAM, 4 Banks Hard disk IBM UltraStar 9, 9.1 Gbyte 9.1 Gbyte, 7200 rpm, 10 heads, 8420 7200 rpm, 10 heads, 8420 cylinders, cylinders, 1 Mbyte cache with 16 1 Mbyte cache with 16 sectors sectors OS IRIX 6.5, 64 bit System V, Lamix, 32 bit SysV compatible, BSD- multithreaded based Compiler MIPS Pro 6.2 Sun Workshop 6.0

Table 2 Architectural parameters of the Sun Blade-1000 reference and the corresponding simulator configuration Parameter Sun Blade-1000 ML-RSIM Processor 900 MHz UltraSPARC-III 900 MHz SPARC V8plus ISA 4-way SPARC V9 ISA 4-way superscalar dynamically scheduled superscalar in-order L-1 D-Cache 64 kbyte, 4-way set-assoc., 64 kbyte, 4-way set-assoc., 32-byte 32-byte lines, 1 cycle latency lines, 1 cycle latency L-1 I-Cache 32 kbyte, 4-way set-assoc., 32 kbyte, 4-way set-assoc., 32-byte 32-byte lines, 1 cycle latency lines, 1 cycle latency L-2 Cache 8 Mbyte, 2-way set-assoc., 8 Mbyte, 2-way set-assoc., 64-byte 64-byte lines, 14 cycles lines, 14 cycles latency latency System bus 150 MHz, 256-bit data path 150 MHz, 256-bit multiplexed address/data Memory controller SDRAM, integrated onto SDRAM, 2 Banks CPU die Hard disk Seagate Cheetah, 36 Gbyte, 36 Gbyte, 10,000 rpm, 4 heads, 10,000 rpm, 4 heads, 24,620 24,620 cylinders, 4 Mbyte cache with cylinders, 4 Mbyte cache 16 sectors with 16 sectors OS Solaris 8, 64 bit System V, Lamix 32 bit SysV compatible, BSD- multithreaded based Compiler Sun Workshop 6.0 Sun Workshop 6.0 an in-order execution core, the issue window size of ers. Depending on the array size, the observed the simulated processor is set to 12 entries, with at latency corresponds to different levels of cache or most two outstanding memory references. main memory. Fig. 2 shows the resulting graphs for both reference platforms and the simulator mod- 4. Validation results els for a 32-byte stride, as well as the squares of the correlation coefficient for all strides. 4.1. Memory System Performance Generally, the boundaries between the three plateaus correctly identify the level-1 and level-2 Memory System Performance LMBench mea- cache sizes. Furthermore, the load latencies mea- sures memory latency by traversing arrays of point- sured for each memory hierarchy level closely match L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297 289

Fig. 2. Memory load latency for 32-byte stride and r2 values for all strides: (a) SGI Octane and (b) Sun Blade. the reference and the simulated systems. The most flicting and non-conflicting source and destination significant discrepancy between the reference and arrays. Fig. 3 shows the bandwidth graphs for the the simulated systems is generally observed at the C-library bcopy routine, as well as the r2 values transition points between plateaus and is primarily for all bandwidth measurements. the result of differing interferences of the test array On the reference platforms, the copy bandwidth with instructions and TLB entries in the level-2 shows three distinct regions that correspond to the cache. The r2 values shown in the tables to the right cache and memory hierarchy levels. For the SGI are also known as the coefficient of determination Octane configuration, ML-RSIM fails to reproduce and are computed as the square of the Pearson cor- the steep bandwidth drop at the L-1 cache bound- relation coefficient. Generally, r2 values range from ary, but subsequently follows the reference platform zero to one and indicate the strength of the correla- more closely. For the Sun Blade configuration, the tion between the two systems. A value of one repre- simulated system tracks bandwidth more closely sents a perfect correlation, and a value of zero for small array sizes, but starts observing lower represents no correlation. As the results demon- bandwidth at array sizes below the level-1 cache strate, both ML-RSIM configurations are able to size. Overall, the correlation between the reference closely track the memory latency of the reference system and the simulator is weaker than it is for system for all experiments. the latency experiments, especially for tests that In addition to memory latency, LMBench mea- include a significant fraction of writes. Part of this sures memory bandwidth by accessing an array of discrepancy is the result of differences in the cache varying size in a variety of patterns. Similar to the controller implementation that affect when and latency measurements, the resulting curves represent how multiple outstanding requests are coalesced bandwidth at different levels of the memory hierar- and how writebacks are buffered. It is also worth chy. Measurements are taken using the bcopy and noting that the implementation of bcopy used on bzero routines provided in the native C-library as the SGI Octane differs significantly from that used well as using unrolled copy, read, and write loops. on the Sun Blade and on the simulator. This differ- In addition, copy bandwidth is measured for con- ence is largely due to the different instruction sets 290 L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297

Fig. 3. Memory bandwidth of libc bcopy and r2 values for all bandwidth experiments: (a) SGI Octane and (b) Sun Blade. that result in different optimization points for each hence includes rotational delay. Depending on the system. initial rotational position of the disk, seek latency The reference UltraSPARC system uses a varies considerably between seeks of similar dis- 64 kbyte write-through cache accompanied by a tance. The resulting graphs are scatter plots with a 2 kbyte write cache. Earlier experiments suggested large variation between adjacent data points. To that this configuration is best approximated by a improve readability, Fig. 4 shows seek latency for 64 kbyte write-back cache rather than by the the two systems summarized by polynomial trend write-through cache model provided by ML-RSIM. lines. However, significant differences remain, resulting in In both cases, ML-RSIM is able to approximate the bandwidth difference between the two systems the seek latencies of the reference disk through a fit- for array sizes close to the level-1 cache size. In ted curve based on published single cylinder, aver- many cases such discrepancy at the boundary points age, and full-stroke seek times. Generally, the is not critical. However, care must be taken that simulated disk model overestimates seek latency applications do not continually operate at these for small distances, perhaps because there is insuffi- boundary points, or else small perturbations will cient detail in modeling SCSI bus and controller result in large but meaningless differences in mea- overhead. Note that in both cases, the disk model sured application performance. This same effect is parameterized from published seek latencies and can be observed on real systems as well as other it not tuned to match the reference disk. simulators. The disk bandwidth experiments reveal a short- coming of the simulated disk models. The reference 4.2. Disk performance disks are divided into multiple zones of varying sec- tor densities to maximize capacity. Consequently, Complementing the memory system measure- read bandwidth varies from nearly 15 Mbyte/s on ments, LMBench characterizes disk seek latency the outside of the platter to 7.5 Mbyte/s for the and bandwidth. Seek latency is measured by reading innermost cylinders for the IBM UltraStar drive from specific sectors on the raw disk device and found in the SGI Octane, and varies between 48 L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297 291

20 16

16 Sun SGI 12 12 8 8 ML-RSIM 4 ML- 4 seek (ms) latency seek (ms) latency

0 0 0 3000 6000 9000 0 12500 25000 37500 (a) distance (cylinders) (b) distance (cylinders)

Fig. 4. Hard disk seek latency: (a) SGI Octane and (b) Sun Blade. and 28 Mbyte/s for the Seagate Cheetah drive in the 4.3. System call latency Sun Blade. ML-RSIM currently does not model multiple zones, but rather assumes a fixed sector In addition to microarchitectural characteristics, density for all cylinders. As a result, disk bandwidth LMBench measures the latencies of various system is constant at 10 and 35 Mbyte/s respectively, inde- calls. Results for both systems are shown in Table 3. pendent of the cylinder. These bandwidths closely Relative error is calculated as latency difference match the average bandwidth of both reference divided by the latency of the reference system. disks. Unfortunately, detailed zoning information Overall, system call latencies do not match the is often not available for commercial disks, though reference and simulated systems as well as do the an approximation could be formulated through architectural characteristics. In most other cases, experimentation. Such an approximation would ML-RSIM/Lamix underestimates the cost of system lead to a more accurate disk model, in regard to calls compared to the reference kernel. This error bandwidth. However, it would place additional bur- leads to conservative results for studies that attempt den on the user to correctly parameterize the model. to optimize operating system performance, since Furthermore, adding zoning details to the disk such evaluation uses a more efficient baseline sys- model makes experimentation results sensitive to tem. The reason for the large discrepancies is found the placement of files on the simulated disks. Files in the different structures of the two kernels. While on an otherwise empty simulated disk will be placed Lamix is a traditional non-preemptive 32-bit uni- on the outside, whereas in real systems with more processor kernel, both Irix and Solaris are internally utilized disks, the test files may be placed elsewhere, multithreaded, support application-level threads resulting in the simulator overestimating effective and 64-bit code, and are designed to scale to large bandwidth. numbers of processors. The resulting synchronization

Table 3 System call latencies Parameter Octane (ls) ML-RSIM (ls) Error (%) Blade (ls) ML-RSIM (ls) Error (%) getpid 2.61 2.57 1.53 0.75 0.67 11.33 read/dev/null 10.38 6.88 33.71 5.13 1.55 69.84 read file 10.38 7.76 70.98 4.40 2.18 50.42 write 12.24 6.21 49.27 1.53 1.54 À1.12 stat 49.45 51.12 À3.38 5.60 15.34 À173.94 fstat 8.00 4.08 49.00 2.22 1.09 50.76 open/close 73.76 59.61 19.18 7.45 17.06 À128.89 select, 10 FDs 11.78 8.69 26.23 8.97 2.06 77.07 select, 60 fds 43.16 37.80 12.42 44.42 8.29 81.33 signal install 8.16 2.78 65.88 1.42 0.55 61.50 signal handler 31.95 8.25 74.18 16.52 2.11 87.24 pipe 64.94 52.98 18.42 16.37 20.69 À26.43 socket 74.24 55.68 25.00 26.82 20.39 23.96 fork/exit 1592.75 12785.00 À702.70 416.14 7566.00 À17181.30 fork/exec 4894.00 13484.00 À106.53 577.80 7898.00 À12669.10 292 L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297 overhead and the cost of supporting thread-specific the original goals, but would have added consider- signals increases system call latencies on the refer- ably to the overall complexity of the simulation ence systems. system. However, in both cases the latency of the simple Overall, the system call latency results bound the getpid() system call closely matches that of the ref- application domain that ML-RSIM/Lamix is able erence platform. This system call performs very little to simulate accurately. The cost of most system calls actual work as it retrieves a single value from the is underestimated under Lamix, compared to the ref- process control block, hence this experiment essen- erence platforms. Consequently, studies concerned tially measures the cost of entering and leaving the with system call performance find in ML-RSIM kernel. In many Unix kernels the relevant code an aggressive baseline model and yield conservative sequence is nearly identical for system calls, external performance gains. However, care must be taken interrupts, and traps. Consequently, the good match that the unrealistically expensive fork() system call between the systems confirms that Lamix imple- does not overly influence results. Applications that ments a realistic model for saving and restoring pro- frequently spawn new processes are not well served cess state. by the current implementation of this service. The Process creation is several times more expensive Lamix kernel itself uses this system call during boot on Lamix than it is on the reference platforms, time to start the simulated applications, but gather- which indicates a deficiency in the simulator kernel. ing of simulation statistics begins only after the LamixÕs simple memory management scheme does application has been loaded and thus does not not support copy-on-write, hence the fork() system include the fork() system call. call copies the entire address space of the parent process to the child. Given the small code and data 4.4. Context switch cost size of the microbenchmarks, the stack is a major contributor to the memory footprint of the process. Context switch overhead is another area of large Simulated measurements with a smaller stack lowers discrepancy between the two systems, but at the process creation costs to within a factor of two of same time also shows large variation between sys- the reference systems, confirming that the lack tems and between successive experiments. LMBench of copy-on-write functionality is the main source measures context switch overhead for different num- of this discrepancy. bers of processes of differing working set sizes. In This difference is a direct result of the original addition to measuring the cost of saving and restor- design goal of ML-RSIM, and shows how simula- ing the process state, the benchmark also includes tor design is a trade-off between accuracy and the cost of cache and TLB miss due to the context complexity. One of ML-RSIMÕs original design tar- switch. Fig. 5 shows the context switch latency for gets was to provide a platform to investigate and the reference and simulated systems, for a 16 kbyte optimize the cost of I/O operations. As such, pro- working set size. Note the different scale on the cess creation is a necessary feature but does not y-axis between the two graphs. need to be modeled accurately. Implementing a In the default configuration, ML-RSIM overesti- fully-featured virtual memory subsystem with mates the context switch cost by a factor of two to demand-paging and copy-on-write functionality six, especially for small numbers of processes or would have provided no additional benefit towards large working sets. Only for large numbers of

120 80

90 60

60 40

30 20 overhead (microseconds)

overhead (microseconds) 0 0 24816 24816 (a) # processes (b) # processes SGI ML-RSIM ML-RSIM odd Sun ML-RSIM ML-RSIM odd

Fig. 5. Context switch latency: (a) SGI Octane and (b) Sun Blade. L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297 293 processes is ML-RSIM able to approach the switch approximates the performance of the reference latency of the reference system. platform more closely with a determination coeffi- The main source of this discrepancy is the varia- cient of 0.925. tion in physical memory layout that affects the num- For the Sun Blade platform, ML-RSIM similarly ber of level-2 cache conflicts. By default, ML- overestimates effective bandwidth, resulting in RSIMÕs static memory management scheme aligns determination coefficients of 0.67 and 0.64. For this the physical address spaces of processes at powers platform, the overall shape of the bandwidth curves of two, thus maximizing L2-cache conflicts between matches better than that for the SGI Octane, but processes, resulting in worst-case context switch both curves are slightly offset thus leading to a sim- costs. The third bar in each group in Fig. 5 shows ilar overall error. File system differences play an context switch costs for Lamix recompiled so as to insignificant role, as the data transfer occurs only not align physical address spaces at powers of between the file cache and a user buffer. two. Due to the reduction in L2-cache conflicts, On the other hand, ML-RSIM generally underes- Lamix exhibits context switch costs that approach timates interprocess communication bandwidth by that of the reference platform to within a factor of approximately 30%. Both Unix domain socket and two. pipe bandwidth is measured between multiple pro- In practice, the impact and significance of this cesses; consequently, results are very sensitive to difference will vary between experiments. In multi- the context switch cost and the physical memory programming environments consisting of different layout. The same effect that increases context switch processes, level-2 cache conflicts are less likely since cost in Lamix degrades interprocess copy band- different processes exhibit different memory layouts width. Both the source and destination buffers in and operate in different code and data regions. On the two processes map to the same locations in the the other hand, multiple instances of the same pro- level-2 cache, increasing the number of conflict gram will suffer from the worst-case level-2 cache cache misses and reducing bandwidth. Tests on conflicts and thus overestimate context switch costs. the reference platforms also show a large variation In such cases, changing the default alignment of by nearly a factor of two between successive mea- physical address spaces in Lamix will mitigate this surements due to varying alignments of physical effect. memory in subsequent runs. A similar but smaller variation can be reproduced in the simulator by 4.5. System call bandwidth compiling Lamix with different parameters for the physical memory allocation scheme. LMBench measures the bandwidth of several Interestingly, for the Sun Blade system band- operating system services, including reading of reg- width differs by a factor of two between Unix sock- ular files from the file cache and interprocess com- ets and pipes. Lamix, following the BSD design, munication with Unix domain sockets and pipes. treats pipes as a special case of Unix sockets and Since these operations do not involve any I/O trans- achieves nearly the same bandwidth in both cases, actions, the experiments essentially measure the cost whereas Solaris achieves significantly higher band- of copying data between protected address spaces. width for pipes. In effect, the bandwidth observed When reading from an open file, ML-RSIM on the simulated system is close to the average of overestimates the bandwidth of the SGI Octane the bandwidths measured on the reference system. reference platform by 20–40%, resulting in a deter- Again, these observations confirm that different ker- mination coefficient (r2) of 0.606. Since the micro- nel organizations can lead to significantly different architectural bandwidth measurements show a behavior, and highlight the challenge of validating stronger correlation, some of this discrepancy is a system simulator. attributed to different implementations of the copy routines in the two kernels. In addition, differences 4.6. File system performance in physical memory layout and in the resulting L2-cache conflict misses can lead to significant dis- LMBench measures file system performance as crepancies, especially for small requests. Indeed, the number of file creations and deletions performed the modeling error is largest for transfers of less per second. The different file system structures of the than 16 kbytes. When including opening and closing systems leads to a large discrepancy, as shown in the file in the overall data transfer, ML-RSIM Table 4. 294 L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297

Table 4 File system performance in creations/deletions per second File size SGI Octane ML-RSIM Sun Blade ML-RSIM Sun Ultra 1/140 Sun Ultra 60/450 (kbyte) Irix 6.5 Lamix Solaris 8 Lamix Solaris 6 Solaris 7 0 528/350 63/120 5489/11,959 162/167 38/87 56/119 1 210/247 35/64 3780/11,882 54/167 37/35 63/59 4 207/229 33/60 4012/11,409 56/167 28/37 52/63 10 218/227 31/56 718/9864 49/89 27/38 48/68

IrixÕs XFS is a modern journaling file system that ML-RSIM/Lamix underestimates the cost of achieves significantly higher performance for these most system calls. This leads to results that are con- metadata operations than does the BSD fast file servative for studies concerned with the perfor- system. Similarly, the Solaris 8 implementation of mance impact of operating system activity. Process UFS achieves extremely high rates of metadata creation is unrealistically expensive due to the sim- updates, indicating that the original BSD consis- ple memory management scheme employed in tency requirement has been relaxed. For a more Lamix. Consequently, the simulation system is cur- realistic comparison with other FFS-based commer- rently not suitable for workloads that rely on the cial file systems, Table 4 also includes results for two frequent creation of processes. A redesign of SPARC platforms running previous versions of LamixÕs virtual memory subsystem would be neces- Solaris. These systems exhibit performance similar sary if ML-RSIM were to be used in an environ- to that of the simulator and confirm that while ment in which these effects are important. LamixÕs FFS cannot be compared with XFS or File I/O and interprocess communication band- the latest Solaris UFS, it is comparable to previ- width show a varying error compared to the refer- ous-generation commercial FFS implementations. ence systems. The main source of this error is the A journaling file system as well as the recent soft- differing physical memory management schemes of updates optimization of UFS could be ported to the two kernels. On the reference platforms, Lamix if desired. Note, however, that this perfor- demand-paging, copy-on-write, and interactions mance discrepancy only holds for metadata updates with other processes cause considerable variations such as creating and deleting files. Results from pre- in physical memory layout between successive mea- vious sections demonstrate that file read and write surements, with a resulting difference in copy band- bandwidth matches more closely that of the refer- width due to varying level-2 cache conflicts. LamixÕs ence systems. simple and deterministic scheme can lead to unreal- istically high or low numbers of cache conflicts that 4.7. Validation conclusions do not change between successive simulation runs. This difference is as much a result of the design goals Simulator validation using microbenchmarks pro- of ML-RSIM as it is inherent to simulators in vides important insights into the accuracy of various general. components. Overall modeling error determines the ML-RSIM is designed to model basic I/O func- degree of confidence in simulation results. Equally tionality such as interrupt handling, DMA trans- important, the validation results define the applica- fers, and system calls. As such it does not tion domain that the simulation tool can model with emphasize accurate memory management. While high accuracy. ML-RSIM/Lamix shows good corre- implementing or porting a more complex memory lation with the reference systems in most architec- management subsystem is feasible, the resulting ker- tural aspects. These results are encouraging, since nel would be subject to much of the same non-deter- they confirm that compute-intensive applications minism that can be observed in real operating without a major I/O component can be modeled systems. Small changes in the order of memory allo- with high accuracy. The fact that such correlation cation requests can lead to very different physical can be achieved for two widely different reference memory layouts, thus making it more difficult to systems demonstrates that the main design goal of reproduce and compare simulation results. Due to ML-RSIM, to be representative of an entire class the run-time cost of simulations, it is not usually of systems, has been achieved. possible to repeat experiments multiple times to L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297 295 account for variability; thus, a deterministic simula- complex timing simulator does not need to be func- tor is essential. tionally correct, implementation of advanced fea- tures can proceed incrementally, thus reducing 5. Related work implementation effort. ML-RSIM takes a different approach to reduce implementation cost and does Existing execution-driven simulation tools fall not provide a complete machine model. Conse- broadly into two categories. The first category, com- quently, it is not able to fully leverage existing oper- posed of commonly-used architectural simulators ating systems, but the reduced complexity of such as SimpleScalar [2] and RSIM [15], provides hardware models and system software also results detailed processor and memory hierarchy models in a system that is relatively easier to maintain. but focuses on user-level instructions only. These Validation is an important part of developing a tools are invaluable for studying compute-intensive simulation system, as it provides insight as to how workloads, but the lack of operating system and I/O conclusions and design trade-offs based on simula- device models makes them unsuitable for I/O inten- tion may be aliased by modeling errors. In addition, sive applications. The SimpleScalar tool set targets it can help establish a level of significance and the detailed simulation of single-program user-level confidence in the final conclusions. code and as such is not able to account for operat- Several existing simulators have undergone a ing system and I/O effects. RSIM is designed to sim- similar validation process. A version of SimpleSca- ulate scientific shared-memory multiprocessor lar has been validated against a Compaq Alpha workloads and consequently includes a simple net- workstation [7]. Performance counter measurements work model, but also lacks support for operating and statistics collected by the simulator are com- system and I/O effects. ML-RSIM, on the other pared to provide a detailed view of simulator inac- hand, extends the modeling scope to include several curacies. While this approach is able to pinpoint I/O devices, virtual memory, and a fully-simulated sources of modeling errors more clearly than the operating system. observation of microbenchmark results, it relies on The second category of simulation tools includes the availability of suitable performance counters. full-system simulators, such as SimOS [16], SimOS- In some cases, the exact specification of these PPC [19], Pharmsim [5], SimICS [12], and M5 [3]. counters is not clear. Using LMBench to charac- These system simulators are able to boot complete terize microarchitectural characteristics can elimi- operating systems and can thus simulate a much nate or complement the use of performance broader set of workloads, including I/O and operat- counters. LMBenchÕs capabilities and scope exceed ing-system intensive codes. SimOS and its deriva- those of most performance counters, providing tives as well as SimICS trade model fidelity for statistics not only related to isolated events, but also simulation speed. Furthermore, SimOS and M5 cur- related to whole system behavior. In fact, this study rently boot commercial operating systems for which shows that accurate microarchitectural models are source code is not readily available, restricting mod- not sufficient to ensure that overall system behavior ifications that can be made by users to the hardware will be accounted for correctly. models. The M5 system simulator focuses on accu- The FLASH group validated the simulator used rately modeling network effects, thus complement- before and during prototype construction against ing ML-RSIMÕs emphasis on disk I/O. It currently the final hardware [9]. Unlike other validation also runs only a commercial OS. ML-RSIM, on efforts, this work investigates how accurately the the other hand, does not provide a complete simulator predicts trends, rather than focusing on machine model and thus requires a custom OS. absolute metrics such as execution time. Given that However, the operating system kernel is freely avail- many simulators are in fact used to predict the rel- able in source code form and the smaller scope of ative impact of new techniques, the FLASH the simulation system reduces its complexity and approach more closely measures the desired charac- development and maintenance effort. teristics of the tool. While parallel architecture Timing-first simulation lowers the development speedup and scalability trends are relatively easy and execution cost of accurate full-system simula- to recreate on actual hardware, microarchitectural tors by combining a timing-accurate but functionally details such as instruction window size and memory incomplete simulator with a functionally-correct system organization cannot be varied easily in isola- full-system simulator [13]. Since the usually more tion on real hardware. As a result, validation studies 296 L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297 that compare absolute metrics are necessary when which kernel interaction is an important part of investigating changes to microarchitectural details. the application, and where underestimating the cost Calibration, the process of refining simulator of various system calls would not invalidate the models until they match a hardware prototype to experimental results. Currently ML-RSIM is less a desired degree [4], is similar to validation but appropriate for evaluating workloads that frequen- has a different objective. Unless the hardware mod- tly create processes, unless improvements are first els are general enough to be applicable to a variety made to the memory management subsystem of of systems, this calibration can result in a simulation Lamix. tool that contains too many idiosyncrasies of one ML-RSIM is freely available for researchers to specific system to be broadly applicable. The valida- use and modify. In addition to several refinements tion work presented here intentionally does not and functionality enhancements, a simultaneous curve fit or specialize hardware models since the multithreading version is under development. The simulator is designed to represent a large class of ML-RSIM tools, like most available simulators, systems. provide the basis on which researchers implement and test their own techniques and enhance- 6. Conclusions ments. Ultimately it is the responsibility of those researchers to ensure that the simulator being used This article presents a high-fidelity simulation is the appropriate tool for the task and that the sim- system designed for I/O and operating system inten- ulator provides sufficient fidelity and accuracy in the sive workloads. ML-RSIM combines accurate areas that impact the conclusions drawn from its models of modern workstation hardware with a use. fully-functional Unix-compatible operating system. ML-RSIM is based on RSIM v1.0 and extends the Acknowledgements original processor model with an I/O subsystem, detailed bus and memory controller models and We would like to thank the RSIM project at the support for privileged instructions and virtual mem- University of Illinois at Urbana-Champaign for ory. These extensions allow ML-RSIM to simulate developing and providing their simulation environ- a Unix-like operating system. This unique combina- ment on which ML-RSIM is built. Many people tion of features makes the simulation environment have contributed to the development of ML-RSIM. well-suited for exploring the interaction of system In particular, we would like to acknowledge the architecture, system software and applications. significant contributions of Lixin Zhang and A validation study comparing the simulator to Branden Moore. This work was in part supported two diverse reference systems demonstrates the level by Defense Advanced Research Projects Agency of accuracy achieved by the system, but more under agreement number N0003995C0018 and importantly highlights its strengths and weaknesses. F306029810101, by an NSF Research Infrastructure The validation process presented here uses Grant Number CDA9623614, and by a Graduate LMBench to compare microarchitectural character- Research Fellowship from the University of Utah istics and operating system performance with an Graduate School for the academic year 2000/2001. SGI Octane and Sun Blade-1000 workstation. Microarchitectural parameters generally show a References close correlation between ML-RSIM and the refer- ence platforms, indicating that all major hardware [1] Adaptec, Inc. AIC-7770 Data Book, 1992. components are modeled accurately. System call [2] T. Austin, E. Larson, D. Ernst, SimpleScalar: an infrastruc- latencies and other operating system performance ture for computer system modeling, IEEE Computer 35 (2) (1998) 59–67. parameters show considerable discrepancies between [3] N. Binkert, E. Hallnor, and S. Reinhardt, Network-oriented the systems, largely as a result of the different inter- full-system simulation using M5, in: Proc. Sixth Workshop nal structures of the operating system kernels. This Computer Architecture Evaluation using Commercial Work- understanding and the detailed results presented loads, February 2003. here help researchers to evaluate the suitability of [4] B. Black, J. Shen, Calibration of microprocessor perfor- mance models, IEEE Computer 31 (5) (1998) 59–65. the tool for specific research questions and to judge [5] H. Cain et al., Precise and accurate processor simulation, in: the validity of results. In particular, the simulator is Proc. Fifth Workshop Computer Architecture Evaluation appropriate and useful for simulating workloads in using Commercial Workloads, 2002. L. Schaelicke, M. Parker / Journal of Systems Architecture 52 (2006) 283–297 297

[6] W. Courtright II, G. Gibson, M. Holland, L. Neal Reilly, J. [19] The SimOS-PPC Project, http://www.cs.utexas.edu/users/ Zelenka, RAIDFrame: A Rapid Prototyping Tool for RAID cart/simOS/. Systems, Version 1.0, tech. report, Carnegie Mellon Univer- [20] L. McVoy, C. Staelin, lmbench: portable tools for perfor- sity, Pittsburgh, PA, 1996. mance analysis, in: Proc. USENIX Annual Technical Con- [7] R. Desikan, D. Burger, S. Keckler, Measuring experimental ference, Usenix Assoc., Berkeley, CA, 1996, pp. 279–294. error in microprocessor simulation, in: Proc. 28th IntÕl [21] D. Weaver, T. Germond, The SPARC Architecture Manual Symposium Computer Architecture, ACM Press, New York, Version 9, PTR Prentice Hall Inc., 1994. NY, 2001, pp. 266–277. [8] G. Ganger, B. Worthington, Y. Patt, The DiskSim Simula- tion Environment Version 1.0 Reference Manual, tech. Lambert Schaelicke’s research interest report CSE-TR-358-98, Department of Electrical Engineer- include high-performance computer ing and Computer Science, University of Michigan, Mich- architecture and performance modeling igan, 1998. and evaluation. As an assistant professor [9] J. Gibson et al., FLASH vs. (simulated) FLASH: closing the at the University of Notre Dame, he has simulation loop, in: Proc. 9th IntÕl Conf. Architectural directed a research project that devel- Support Programming Languages and Operating Systems, oped and implemented a high-speed ACM Press, New York, NY, 2000, pp. 49–58. network intrusion detection system. In [10] M. McKusick, K. Bostic, M. Karels, J. Quarterman, The addition, he was involved in the design Design and Implementation of the 4.4. BSD Operating and evaluation of a novel high-end System, Addison Wesley Longman Inc., 1996. computer architecture based on merged [11] E. Lee, Performance Modeling and Analysis of Disk Arrays, logic-DRAM technology. He is now working as a performance Ph.D. Dissertation, tech. report CSD-93-770, University of engineer in the Intel group. He received a Diploma in California, Berkeley, CA, 1993. Computer Science from the Technical University of Berlin, [12] P. Magnusson et al., SimICS: a full system simulation Germany, and a Ph.D. in Computer Science from the University platform, IEEE Computer 35 (2) (1998) 50–58. of Utah. [13] C. Mauer, M. Hill, D. Wood, Full system timing-first simulation, in: Proc. ACM SIGMETRICS Conf. Measure- ment Modeling Computer Systems, ACM Press, New York, Mike Parker is a system and processor NY, 2002, pp. 108–116. architect at Cray. His research interests [14] MIPS Technologies, MIPS R10000 Microprocessor UserÕs include processor and scalable parallel Manual, Version 2.0, 1996. computer architecture, performance [15] V.S. Pai, P. Rangnathan, S.V. Adve, RSIM Reference analysis, and VLSI design. Past work Manual, Version 1.0, tech. report 9705, Department of includes improving I/O and message- Electrical Computer Engineering, Rice University, Houston, passing performance through direct user- TX, 1997. level event notifications; investigating [16] M. Rosenblum et al., Using the SimOS machine simulator memory system, processor, and syn- to study complex computer systems, ACM TOMACS chronization improvements on scalable Special Issue on Computer Simulation (1997) 79–103. parallel shared-memory systems; archi- [17] L. Schaelicke, L-RSIM: A simulation environment for I/O tecture and hardware development of advanced memory systems; intensive workloads, in: Proc. 3rd Annual IEEE Workshop and reducing communication latency and overhead in clusters of Workload Characterization, 2000, pp. 83–89. by using efficient protocols and closely coupling the [18] L. Schaelicke, M. Parker, ML-RSIM Reference Manual, communication hardware with the CPU. He received his BSEE tech. report TR 02-10, Department of Computer Science and from the University of Oklahoma in 1995 and is expecting his Engineering, University of Notre Dame, Notre Dame, In., Ph.D. from the University of Utah in 2005. 2002.