Lmbench: Portable Tools for Performance Analysis

The following paper was originally published in the Proceedings of the USENIX 1996 Annual Technical Conference San Diego, California, January 1996 lmbench: Portable Tools for Performance Analysis Larry McVoy, Silicon Graphics Carl Staelin, Hewlett-Packard Laboratories For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: [email protected] 4. WWW URL: http://www.usenix.org lmbench: Portable tools for performance analysis Larry McVoy Silicon Graphics, Inc. Carl Staelin Hewlett-Packard Laboratories Abstract of this benchmark suite obsolete or irrelevant. lmbench is a micro-benchmark suite designed to lmbench is already in widespread use at many focus attention on the basic building blocks of many sites by both end users and system designers. In some common system applications, such as databases, simu- cases, lmbench has provided the data necessary to lations, software development, and networking. In discover and correct critical performance problems almost all cases, the individual tests are the result of that might have gone unnoticed. lmbench uncovered analysis and isolation of a customer’s actual perfor- a problem in Sun’s memory management software that mance problem. These tools can be, and currently are, made all pages map to the same location in the cache, used to compare different system implementations effectively turning a 512 kilobyte (K) cache into a 4K from different vendors. In several cases, the bench- cache. marks have uncovered previously unknown bugs and lmbench measures only a system’s ability to design flaws. The results have shown a strong correla- transfer data between processor, cache, memory, net- tion between memory system performance and overall work, and disk. It does not measure other parts of the performance. lmbench includes an extensible system, such as the graphics subsystem, nor is it a database of results from systems current as of late MIPS, MFLOPS, throughput, saturation, stress, graph- 1995. ics, or multiprocessor test suite. It is frequently run on multiprocessor (MP) systems to compare their perfor- 1. Introduction mance against uniprocessor systems, but it does not lmbench provides a suite of benchmarks that take advantage of any multiprocessor features. attempt to measure the most commonly found perfor- The benchmarks are written using standard, mance bottlenecks in a wide range of system applica- portable system interfaces and facilities commonly tions. These bottlenecks have been identified, iso- used by applications, so lmbench is portable and lated, and reproduced in a set of small micro- comparable over a wide set of Unix systems. benchmarks, which measure system latency and band- lmbench has been run on AIX, BSDI, HP-UX, IRIX, width of data movement among the processor and Linux, FreeBSD, NetBSD, OSF/1, Solaris, and memory, network, file system, and disk. The intent is SunOS. Part of the suite has been run on Win- to produce numbers that real applications can repro- dows/NT as well. duce, rather than the frequently quoted and somewhat less reproducible marketing performance numbers. lmbench is freely distributed under the Free Software Foundation’s General Public License [Stall- The benchmarks focus on latency and bandwidth man89], with the additional restriction that results may because performance issues are usually caused by be reported only if the benchmarks are unmodified. latency problems, bandwidth problems, or some com- bination of the two. Each benchmark exists because it 2. Prior work captures some unique performance problem present in one or more important applications. For example, the Benchmarking and performance analysis is not a TCP latency benchmark is an accurate predictor of the new endeavor. There are too many other benchmark lmbench Oracle distributed lock manager’s performance, the suites to list all of them here. We compare memory latency benchmark gives a strong indication to a set of similar benchmarks. of Verilog simulation performance, and the file system • I/O (disk) benchmarks: IOstone [Park90] wants to latency benchmark models a critical path in software be an I/O benchmark, but actually measures the mem- development. ory subsystem; all of the tests fit easily in the cache. lmbench was dev eloped to identify and evaluate IObench [Wolman89] is a systematic file system and system performance bottlenecks present in many disk benchmark, but it is complicated and unwieldy. machines in 1993-1995. It is entirely possible that In [McVoy91] we reviewed many I/O benchmarks and computer architectures will have changed and found them all lacking because they took too long to advanced enough in the next few years to render parts run and were too complex a solution to a fairly simple problem. We wrote a small, simple I/O benchmark, and almost all are far more complex. Less filling, lmdd that measures sequential and random I/O far tastes great. faster than either IOstone or IObench. As part of [McVoy91] the results from lmdd were checked 3. Benchmarking notes against IObench (as well as some other Sun internal I/O benchmarks). lmdd proved to be more accurate 3.1. Sizing the benchmarks than any of the other benchmarks. At least one disk The proper sizing of various benchmark parame- vendor routinely uses lmdd to do performance testing ters is crucial to ensure that the benchmark is measur- of its disk drives. ing the right component of system performance. For Chen and Patterson [Chen93, Chen94] measure I/O per- example, memory-to-memory copy speeds are dramat- formance under a variety of workloads that are auto- ically affected by the location of the data: if the size matically varied to test the range of the system’s per- parameter is too small so the data is in a cache, then formance. Our efforts differ in that we are more inter- the performance may be as much as ten times faster ested in the CPU overhead of a single request, rather than if the data is in memory. On the other hand, if the than the capacity of the system as a whole. memory size parameter is too big so the data is paged to disk, then performance may be slowed to such an • Berkeley Software Distribution’s microbench extent that the benchmark seems to ‘never finish.’ suite: The BSD effort generated an extensive set of test benchmarks to do regression testing (both quality lmbench takes the following approach to the and performance) of the BSD releases. We did not use cache and memory size issues: this as a basis for our work (although we used ideas) • All of the benchmarks that could be affected by for the following reasons: (a) missing tests — such as cache size are run in a loop, with increasing sizes (typ- memory latency, (b) too many tests, the results tended ically powers of two) until some maximum size is to be obscured under a mountain of numbers, and (c) reached. The results may then be plotted to see where wrong copyright — we wanted the Free Software the benchmark no longer fits in the cache. Foundation’s General Public License. • The benchmark verifies that there is sufficient mem- • Ousterhout’s Operating System benchmark: ory to run all of the benchmarks in main memory. A [Ousterhout90] proposes several system benchmarks to small test program allocates as much memory as it measure system call latency, context switch time, and can, clears the memory, and then strides through that file system performance. We used the same ideas as a memory a page at a time, timing each reference. If basis for our work, while trying to go farther. We any reference takes more than a few microseconds, the measured a more complete set of primitives, including page is no longer in memory. The test program starts some hardware measurements; went into greater depth small and works forward until either enough memory on some of the tests, such as context switching; and is seen as present or the memory limit is reached. went to great lengths to make the benchmark portable and extensible. 3.2. Compile time issues • Networking benchmarks: Netperf measures net- The GNU C compiler, gcc, is the compiler we working bandwidth and latency and was written by chose because it gav ethe most reproducible results Rick Jones of Hewlett-Packard. lmbench includes a across platforms. When gcc was not present, we smaller, less complex benchmark that produces similar used the vendor-supplied cc. All of the benchmarks results. were compiled with optimization -O except the benchmarks that calculate clock speed and the context ttcp is a widely used benchmark in the Internet com- switch times, which must be compiled without opti- munity. Our version of the same benchmark routinely mization in order to produce correct results. No other delivers bandwidth numbers that are within 2% of the optimization flags were enabled because we wanted numbers quoted by ttcp. results that would be commonly seen by application • McCalpin’s stream benchmark:[McCalpin95] has writers. memory bandwidth measurements and results for a All of the benchmarks were linked using the large number of high-end systems. We did not use default manner of the target system. For most if not these because we discovered them only after we had all systems, the binaries were linked using shared results using our versions. We will probably include libraries. McCalpin’s benchmarks in lmbench in the future. In summary, we rolled our own because we 3.3. Multiprocessor issues wanted simple, portable benchmarks that accurately All of the multiprocessor systems ran the bench- measured a wide variety of operations that we con- marks in the same way as the uniprocessor systems. sider crucial to performance on today’s systems. Some systems allow users to pin processes to a partic- While portions of other benchmark suites include sim- ular CPU, which sometimes results in better cache ilar work, none includes all of it, few are as portable, reuse.

Lmbench: Portable Tools for Performance Analysis

Microprocessor

Load Testing, Benchmarking, and Application Performance Management for the Web

Overview of the SPEC Benchmarks

Opportunities and Open Problems for Static and Dynamic Program Analysis Mark Harman∗, Peter O’Hearn∗ ∗Facebook London and University College London, UK

Evaluation of AMD EPYC

DVCS Or a New Way to Use Version Control Systems for Freebsd

3Dfx Oral History Panel Gordon Campbell, Scott Sellers, Ross Q. Smith, and Gary M. Tarolli

PACKET 7 BOOKSTORE 433 Lecture 5 Dr W IBM OVERVIEW

Everything You Need to Know About Openjdk's Move to Git and Github

An Effective Dynamic Analysis for Detecting Generalized Deadlocks

BOOM): an Industry- Competitive, Synthesizable, Parameterized RISC-V Processor

Online Sec 6.15.Indd