A Survey of Virtualization Workloads
Total Page:16
File Type:pdf, Size:1020Kb
A Survey of Virtualization Workloads Andrew Theurer Karl Rister Steve Dobbelstein IBM Linux Technology Center IBM Linux Technology Center IBM Linux Technology Center [email protected] [email protected] [email protected] Abstract solution Y. Another difference in benchmarking virtu- alization is that the benchmark scenarios can be very We survey several virtualization benchmarks, including different than one without virtualization. Server consol- benchmarks from different hardware and software ven- idation is such a scenario. Server consolidation may not dors, comparing their strengths and weaknesses. We typically be benchmarked without the use of a hypervi- also cover the development (in progress) of a new vir- sor (but not out of the realm of possibility; for example, tualization benchmark by a well known performance containers may be used). Server consolidation bench- evaluation group. We evaluate all the benchmarks’ ease marks strive to show how effective a virtualization so- of use, accuracy, and methods to quantify virtualization lution can host many guests. Since many guests can be performance. For each benchmark, we also detail the ar- involved in this scenario, it may require the use of sev- eas of a virtualization solution they stress. In this study, eral benchmarks running concurrently. This concept is we use Linux where applicable, but also use other oper- not common on traditional benchmarks. ating systems when necessary. 2 Recently Published Benchmarks 1 Introduction The following are virtualization benchmarks with pub- Although the concept of virtualization is not new [1], lished specifications and run rules that users can repli- there is a recent surge of interest in exploiting it. Vir- cate in their own environment. These benchmarks strive tualization can help with several challenges in comput- to set a standard for virtualization benchmarking. In ing today, from host and guest management, energy this section, we discuss the strengths and weaknesses consumption reduction, reliability, and serviceability. of these benchmarks. There are now several virtualization offerings, such as VMware R ESX [2], IBM PowerVMTM , Xen [3][4] tech- 2.1 vConsolidate nology from Citrix, Virtual Iron, RedHat, and SUSE, and Microsoft R Windows Server R 2008 Hyper-V [5]. The vConsolidate benchmark [6] was developed by As the competition heats up, we are observing a growth Intel R to measure the performance of a system running of performance competitiveness across these vendors, consolidated workloads. As one of the earlier proposals yielding “marketing collateral” in the form of bench- for a virtualization benchmark, vConsolidate was writ- mark publications. ten to prompt the industry to discuss how the perfor- mance of a system running with virtualization should be 1.1 Why are Virtualization Benchmarks Different? measured. vConsolidate runs a benchmark for a web server, a mail A key difference in benchmarking a virtualization-based server, a database server, and a JavaTM server, each in a solution is that a hypervisor is included. The hypervi- separate guest. There is also a guest that runs no bench- sor is responsible for sharing the hardware resources for mark, which is meant to simulate an idle server. These one or more guests in a safe way. The use of a hypervi- five guests make up a consolidation stack unit, or CSU, sor and the sharing of hardware can introduce overhead. as illustrated in Figure 1. One of the goals of benchmarking virtualization is to quantify this overhead and ideally show that virtualiza- The tester starts with running 1 CSU, obtaining the tion solution X has lower overhead than virtualization benchmark score and the processor utilization of the • 215 • 216 • A Survey of Virtualization Workloads CSU 1 CSU 2 CSU 3 vConsolidate score Processor utilization Idle Idle Idle 5 100 Java Java Java Database Database Database 4 80 Mail Mail Mail Web Web Web 3 60 2 40 vConsolidate score Processor utilization 1 20 Figure 1: Consolidation stack units 0 0 1-CSU 2-CSUs 3-CSUs 4-CSUs 5-CSUs Figure 2: Sample results for vConsolidate system. The tester does three iterations and then uses the median score and its processor utilization. The tester in- crementally adds additional CSUs, recording the bench- mark score and processor utilization, until the bench- 2.1.1 Benchmark Implementation mark score for the set of N CSUs is less than the score for N-1 CSUs, or until all the system processors are fully vConsolidate specifies which benchmarks are run for utilized. The final benchmark score is the maximum of each of the workloads: WebBenchTM [7] from PC Maga- the scores reported along with the number of CSUs and zine for the web server, Exchange Server Load Simula- the processor utilization for that score. tor (LoadSim) [8] from Microsoft for the mail server, SysBench [9] for the database server, and a slightly modified version of SPECjbb2005 [10] for the Java The vConsolidate benchmark score is calculated by server. first summing each of the component benchmark scores across the individual CSUs. The sums are then normal- WebBench is implemented with two programs—a con- ized against the score of the benchmark running on a troller and a client. The controller coordinates the run- reference system, giving a ratio of the sum compared to ning of the WebBench client(s). vConsolidate uses only the score of the reference platform. The reference sys- one client program, which runs eight engine threads. tem scores can be obtained from a 1 CSU run on any The client and the controller can be run on the same system. It is Intel’s desire to define a “golden” reference system because neither is processor intensive. The only system for each profile. The vConsolidate score for the interface to WebBench is through its GUI, making it dif- test run is the geometric mean of the ratios for each of ficult to automate. the benchmarks. Figure 2 shows sample results from a vConsolidate test run. In this example, the maximum vConsolidate indirectly specifies which mail server to score was achieved at 4 CSUs with a processor utiliza- run. Microsoft’s LoadSim only works against a Mi- tion of 78.3%. crosoft Exchange Server, therefore, the mail server must be Exchange Server running on the Windows operating system. Although it runs in a GUI, LoadSim can be eas- The reporting of the processor utilization along with the ily automated because it can be started from the com- score is not common. Most standard benchmarks sim- mand line. ply report the benchmark score and are not concerned with the processor utilization. The processor utilization, vConsolidate runs a version of SysBench that has been however, is a useful metric in characterizing the per- modified so that it prints out a period after a certain num- formance of the system running the consolidated work- ber of database transactions. The output is then redi- loads. It can also be useful in spotting performance is- rected to a file, which is processed by vConsolidate to sues in other areas of the system (for example, disk I/O, calculate the throughput score. network), for example, when the score starts dropping off before the processors get fully utilized, as seen in vConsolidate has instructions for modifying the Figure 2. SPECjbb2005 source to add a think time, so that the test 2008 Linux Symposium, Volume Two • 217 doesn’t run full bore, and to have it print out some statis- mind that the benchmark should specify the software tics at a periodic interval. The output is then redirected and benchmarks used so that fair comparisons can be to a file that is processed by vConsolidate to calculate made between different hardware platforms. Others see the throughput score. The benchmark specifies the ver- that specifying the software and benchmarks favors the sion of Java to run: BEA R JRockit R 5.0. software selected, does not allow for comparisons to be made between software stacks, and reduces the concept One observation of the benchmark implementation is of the benchmark being open. that most of the workloads are modified or configured such that the benchmark does not run all out, but sim- ulates a server running with a given load. LoadSim is 2.1.2 Running the Benchmark configured to run with 500 users. SysBench is config- ured to run only four threads. SPECjbb2005 is modified As mentioned above, the test setup requires client ma- to add a delay. However, WebBench is not modified to chines to run LoadSim and to run the WebBench con- limit its load. The delay time and think time are both troller and client. In the current version of vConsoli- zero. It may be that this is an oversight of the bench- date (1.1), each client runs one instance of LoadSim and mark configuration. Or it may be that even with no de- one instance of the WebBench controller and client. Es- lay and no think time that WebBench does not generate sentially, one machine runs all of the client drivers for enough load to consume the network and/or processor one CSU. This is an improvement over the previous ver- utilization on the server. That is, WebBench may gener- sion of vConsolidate that specified separate clients for ate a moderate load even with the delay and think time LoadSim and WebBench. Having one client machine set to zero. per CSU makes for an easier test setup, not to mention the savings in hardware, energy, and rack space. Certain components of vConsolidate are portable to other applications and other operating systems. The test setup also requires one machine to run the WebBench makes standard HTTP requests, conse- vConsolidate controller. The controller coordinates the quently it doesn’t matter which web server software is start and stop of each test run so that all benchmarks run nor the OS on which it runs.