Towards Better in Virtual Environments on Asymmetric Single-ISA Multi-core Systems

Viren Kumar Alexandra Fedorova Simon Fraser University Simon Fraser University 8888 University Dr 8888 University Dr Vancouver, Canada Vancouver, Canada [email protected] [email protected]

ABSTRACT performance per watt than homogeneous multicore proces- Single-ISA heterogeneous multicore architectures promise to sors. As power consumption in data centers becomes a grow- deliver plenty of cores with varying complexity, speed and ing concern [3], deploying ASISA multicore systems is an performance in the near future. Virtualization enables mul- increasingly attractive opportunity. These systems perform tiple operating systems to run concurrently as distinct, in- at their best if application workloads are assigned to het- dependent guest domains, thereby reducing core idle time erogeneous cores in consideration of their runtime proper- and maximizing throughput. This paper seeks to identify a ties [4][13][12][18][24][21]. Therefore, understanding how to heuristic that can aid in intelligently scheduling these vir- schedule data-center workloads on ASISA systems is an im- tualized workloads to maximize performance while reducing portant problem. This paper takes the first step towards power consumption. understanding the properties of workloads that determine how they should be scheduled on ASISA multi- We propose that the controlling domain in a Virtual Ma- core processors. Since virtual machine technology is a de chine Monitor or hypervisor is relatively insensitive to changes facto standard for data centers, we study virtual machine in core frequency, and thus scheduling it on a slower core (VM) workloads. saves power while only slightly affecting guest domain per- formance. We test and validate our hypothesis and further ASISA multicore systems will consist of several cores expos- propose a metric, the Combined Usage of a domain, to assist ing the same ISA but differing in features, complexity, power in future energy-efficient scheduling. Our preliminary find- consumption, and performance. Most likely, they will fea- ings show that the Combined Usage metric can be used as ture a handful of complex cores running at high frequency a starting point to gauge the sensitivity of a guest domain and consuming a relatively high amount of power and a to variations in the controlling domain’s frequency. larger number of simple cores running at low frequency and consuming relatively little power [1][13]. The motivation Categories and Subject Descriptors behind ASISA processors is to deliver a higher performance D.4.1 [Operating Systems]: Management—mul- per watt ratio. While some workloads experience an in- tiprocessing, scheduling crease in performance proportional to the increase in the power consumption when they run on a high-frequency core as opposed to a low-frequency core, other workloads experi- General Terms ence only a marginal gain in performance while consuming a Measurement, Performance, Experimentation lot more power. To maximize performance per watt, work- loads must be assigned to heterogeneous cores in ASISA sys- Keywords tems in consideration of their runtime properties. Workloads virtualization, performance-asymmetric multicore architec- whose performance is the most sensitive to frequency of the tures, performance per watt core (i.e., those that experience the largest relative perfor- mance gain on a high-frequency core vs. a low-frequency 1. INTRODUCTION core) should be assigned to run on high-frequency cores, Asymmetric single-ISA (ASISA) multicore processors [21][12] while workloads that are less sensitive should run on low- (also known as heterogeneous) can potentially deliver a greater frequency cores. Several recent studies have demonstrated that this scheduling strategy improves performance per watt [4][13][12][18][24][21].

In this paper, we study properties of virtualized workloads that shed insight into how these workloads should be sched- uled on ASISA systems. We hope to use our findings to design new scheduling algorithms in virtual machine hyper- visors.

As our experimental virtual platform we use the Xen hyper- performance of CPU-bound workloads, we measured their Table 1: Experimental System runtime in the guest domain. We measured performance Processors 4 quad-core AMD Barcelona of the guest domain that exercised dom0: the frequency of L1 64 KB per core the guest domain remained fixed, so any performance vari- L2 Cache 512 KB per core ation was due to performance variation in the dom0. Since L3 Cache 2 MB (Shared) the dom0-intensive workloads executed almost exclusively in domU clock speeds 2.3 GHz, 2.0 GHz, 1.7 GHz, dom0, this measurement technique was a good way to ap- 1.4 GHz, 1.15 GHz proximate the measurement in dom0 – the reason for not dom0 clock speed 2.3 GHz measuring performance in dom0 directly was the lack of ac- Physical Memory 64 GB cess to application-level performance metrics in dom0. domU Memory 1.0 GB OS Gentoo 2.6.21 For CPU-bound benchmarks that do not depend on dom0 we Xen version 3.2.1 chose the industry standard benchmarks: SPEC CPU2000, SPEC CPU2006, and SPEC JBB2005 [22]. The process of selecting dom0-intensive benchmarks was more involved. visor [2], a popular system used in data centers. Xen has a single controlling domain, henceforth also referred to as While there are a large number of dom0-intensive bench- dom0, and many guest domains, also referred to as domUs. marks, not all of them provide good potential for power The controlling domain in Xen executes code on behalf of savings on ASISA processors. For example, those dom0- the guest domains, usually when the guest domains need ac- intensive workloads that have low CPU utilization in dom0, cess to devices, such as the network interface (NIC) or disk. due to being highly I/O bound, cannot be used to deliver Frequent device access in dom0 implies several properties substantial savings in CPU power consumption simply be- about workloads dependent on dom0: (1) in many cases the cause they leave the CPU idle most of the time. There- device, and not the CPU, is the bottleneck, so running this fore, we focused on those workloads that produce high CPU workload on a slower CPU may not hurt overall performance utilization in dom0. This could be accomplished either by as much as for other workloads; (2) device access involves choosing benchmarks that naturally produced high CPU uti- frequent execution of system code and handling of interrupts lization in dom0 or by running multiple guest domains that – according to previous work this workload is less sensitive individually produce low CPU utilization but collectively to changes in CPU performance than other workloads [18]. produce moderate to high CPU utilization. To this end, we hypothesize that the workloads executed by dom0 are less sensitive to changes in CPU frequency than To select the suitable dom0-intensive workloads, we ran a domU (CPU-bound) workloads. The main contribution of wide range of I/O intensive workloads in the guest domain our work is the test of this hypothesis. and measured the resulting CPU utilization in the dom0. We ran the following workloads that utilize disk and net- We measured the performance of a large number of virtual work: dbench [25], [19], sysbench (file) [11], lmbench workloads running on Xen on cores running at various fre- (pagefault) [23], stress [26], bonnie [5], bonnie++ [6], quencies and computed sensitivity to frequency changes. To [20], piozone [7], hdparm [15], tiobench [16], volanomark [14], vary the frequency of the cores, we used dynamic voltage and ab (Apache ) [8], and netperf [10]. We ran them frequency scaling (DVFS) available on our AMD Barcelona using anywhere from one to four guest domains. For many system. We found that our hypothesis holds: in general, the of these benchmarks, CPU utilization in dom0 was low (be- performance of workloads that depend on dom0 is less sensi- low 10%) even when multiple guest domains were used. In tive to variations in CPU performance than workloads that particular, all disk-bound benchmarks had low dom0 CPU do not depend on dom0. However, the degree of sensitiv- utilization. We suspect that if we had a powerful storage ity depends on the nature of the guest workload that drives system (as opposed to a single disk), we would see higher dom0. By accurately identifying dom0-dependent workloads CPU utilization for these workloads: in our present test- that exhibit the lowest sensitivity, we can schedule them on ing conditions, where we had only one spindle, the disk was low-frequency cores and maximize the performance per watt overloaded and so CPU utilization was low. The benchmarks ratio. Therefore, another contribution of this work is the dis- that exhibited moderate to high CPU utilization were the covery of heuristics that help us identify the least sensitive following: dom0-dependent workloads. • ab with four guest domains (91% CPU utilization in The rest of the paper is structured as follows. Section 2 de- dom0) tails our methodology and experimental environment. Sec- tion 3 presents the experimental results. Section 4 discusses • netperf TCP with one guest domain (79% CPU uti- related work. Section 5 summarizes our contributions and lization in dom0) discusses future work. • netperf UDP with one guest domain (65% CPU uti- 2. METHODOLOGY AND EXPERIMENTS lization in dom0) To test our hypothesis we selected a range of benchmarks that exercised dom0 and benchmarks that did not depend In the rest of the paper we focus on these benchmarks. on dom0 (CPU-bound benchmarks). We measured perfor- Apache Benchmark (ab) sends a fixed number of requests to mance of these benchmarks at various core frequencies and a running instance of the Apache web server and then calcu- computed sensitivity to changes in frequency. To measure lates the number of requests per second the web server is ca- Processors 4 quad-core AMD Barcelona and UDP_STREAM to measure UDP bandwidth. L1 Cache 64 KB per core The bandwidth is measured in megabits/second. We ran our experiments on a Dell PowerEdge L2 Cache 512 KB per core R905 server powered by four quad-core AMD L3 Cache 2 MB (shared) Barcelona processors capable or running at five domU clock speeds 2.3 GHz, 2.0 GHz, 1.7 GHz, different frequencies. We used Xen 3.2.1 running 1.4 GHz, 1.15 GHz Gentoo Linux 2.6.21 as guest domains and in the dom0 clock Speed 2.3 GHz dom0. The details of our experimental system are Physical memory 64 GB shown in Table 1. domU Memory 1.0 GB To measure performance in the controlling OS Gentoo Linux 2.6.21 domain, we used sysstat [11], a system-wide Xen version 3.2.1 performance monitoring tool available for Linux. Sysstat gave us the metrics we sought, such as the Table 1. Experimental system number of packets processed per second for ab and [16], stress [23], bonnie [7], bonnie++ [8], iozone netperf. [20], piozone [21], hdparm [12], tiobench [25], Xenoprof [17], the oprofile-based Xen profiling volanomark [26], ab (Apache benchmark) [1], and tool, was used to measure the controlling domain’s netperf [13]. We ran them using anywhere from one CPU usage during the workload execution phase. to four guest domains. For many of these benchmarks, CPU utilization in dom0 was low 4. Experimental Results (below 10%) even when multiple guest domains were 4.1 Sensitivity used. In particular, all disk-bound benchmarks had Figure 1 shows the relative sensitivity of dom0 low dom0 CPU utilization. We suspect that if we had performance compared to the CPU-bound a powerful storage system (as opposed to a single benchmarks. Measuring the performance of diverse disk), we would see higher CPU utilization for these workloads requires different metrics, and we chose workloads: in our present testing conditions, where the standard (IPC) metric for we had only one spindle, the disk was overloaded and the CPU-bound SPEC workloads. The networking so CPU utilization was low. The benchmarks that workloads’ performance is measured with a exhibited moderate to high CPU utilization were the throughput metric, i.e. the number of packets that are following: processed by the dom0 per second. Workloads that  ab with four guest domains (91% CPU are CPU-bound and spend all their time in the guest utilization in dom0) domain are penalized the most by frequency changes.  netperf TCP with one guest domain (79% CPU At the lowest dynamic frequency and voltage setting, utilization in dom0) SPEC CPU2006's IPC is 58% of the IPC at the 8000  netperf UDP with one guest domain (65% CPU highest setting. On the other hand, workloads highest setting. On the other hand, workloads Dom0Dom0 CPU CPU Usage Usage DomUDomU Performance Performance utilization in dom0) 7000 100100 60 highesthighesthighestdependenthighestdependent setting. setting. setting. on setting. ondom0 On Ondom0 On thedo On the the do not other thenot other othersuffer suffer other hand, hand,as hand, asmuch hand, workloadsmuch workloads workloads when workloads when the the Dom0Dom0Dom0 CPU CPU CPU Usage Usage Usage 60 DomU Performance Dom0 CPU Usage DomUDomU PerformanceDomU Performance Performance dependentdependentdependentdom0’s6000dependentdom0’s on CPU on ondom0 CPU dom0 ondom0 frequency frequencydom0do do donot not donot issuffer notsuffer adjusted.issuffer adjusted. sufferas as asmuch much muchas However, However,muchwhen when when whenthe abthe the ab is the is 100100100 100 60 6060 60 In the rest of the paper we focus on these 40 40 dom0’sdom0’sdom0’snot5000dom0’snot affectedCPU CPU affected CPU frequencyCPU frequency frequency as frequency as strongly. strongly. is is adjusted.is adjusted. adjusted. is With adjusted. With However, ab However, However,, ab However, the, the ab rate ab ab rateis isab ofis of is benchmarks. Apache Benchmark (ab) sends a fixed 50 50 40 4040 notnotnot processing affectednotprocessing affected affected affected aspackets as aspackets strongly. strongly.as strongly. at strongly. atthe theWith lowest With With lowest With ab ab frequency, ab , thefrequency, ab the the , rate the rate rateis of rate is86% of of 86% of 40 number of requests to a running instance of the 4000 ab – dom0 packets/sec 50 5050 50 20 20 processingprocessingprocessingthatprocessingthat of of the packets packets the packets rate packetsrate at atnetperf at the atthe the -theat highesttcp lowest highest–the lowest lowestdom0 lowest frequency. frequencypackets/sec frequency frequency. frequency frequency netperf-tcp is isnetperf-tcp 86%is 86% 86%is 86% Apache web server and then calculates the number of 3000 20 2020 20 thatthatthatand of thatand of of thenetperf-udp the ofthe netperf-udp rate therate rate at rate at theat netperf thealso the at highestalso thehighest- highestudp suffer highest suffer– dom0 frequency. frequency. less frequency. packets/sec less frequency. significantly significantly netperf-tcp netperf-tcpnetperf-tcp netperf-tcp than than 0 0 Throughput / IPC 0 0

SPEC CPU 2000 IPC Percentage Utilization requests per second the web server is capable of 2000 Percentage Utilization 2.32.3 2 1.7 2 1.7 1.4 1.151.4 1.15 Mbit / Second Second / Mbit (Hundreds) andandandCPU-bound netperf-udp andCPU-bound netperf-udpnetperf-udp netperf-udp workloads: also workloads:alsoalso sufferalso suffer suffer netperf-TCP’ suffer less netperf-TCP’ less less significantly less significantly significantly significantlys performances performance than than than than 0 00 0 Second / Mbit (Hundreds) 2.32.3 2 1.72 1.7 1.4 1.41.15 1.15 SPEC CPU 2006 IPC 0 00 0

Utilization Percentage Utilization Frequency Utilization Percentage Utilization Percentage Utilization 2.3Frequency 2 1.7 1.4 1.15 handling. Netperf is a networking benchmark that Percentage Utilization 2.32.3 2 22.3 1.7 1.7 1.4 2 1.4 1.151.7 1.15 1.4 1.15 Frequency 1000 Second / Mbit (Hundreds) Frequency Mbit / Second Second / Mbit (Hundreds) Second / Mbit (Hundreds) 2.3 2 1.7 1.4 1.15 Second / Mbit (Hundreds) 2.32.3 2 2 1.7 1.7 1.4 1.4 1.15 1.15 CPU-boundCPU-boundCPU-boundlossCPU-boundloss is is only workloads: only workloads: workloads: 15% workloads: 15%SPEC and netperf-TCP’jbb and netperf-TCP’ netperf-UDP'netperf-TCP’throughput netperf-UDP'netperf-TCP’s s performances performance performancelossss lossperformance is is 25% 25% 2.3 2 1.7 1.4 1.15 consists of many tests, the most commonly used ones FrequencyFrequencyFrequencyFrequency FrequencyFrequencyFrequencyFrequency losslosslosscompared is losscompared is0 is only only isonly to 15%only 15%theto 15% the highest-frequency 15%and andhighest-frequency and netperf-UDP' and netperf-UDP'netperf-UDP' netperf-UDP' scenario.s s lossscenario.s loss losss is loss is is 25% 25% is25% 25% (a) (a) (a) (b) (b) (b) being TCP_STREAM to measure TCP bandwidth comparedcomparedcomparedcompared to to theto2.3 the the tohighest-frequency highest-frequency thehighest-frequency highest-frequency2 1.7 scenario. scenario. scenario.1.4 scenario. 1.15 Figure (a) Figure (a) (a) 3: (a) netperf-TCP3: netperf-TCP Workload Workload (b) (b) (b) (b) Figure 3: netperf-TCP Workload 4.2 Analysis Frequencies of sensitivities (GHz) FigureFigureFigureFigure 3: 3: netperf-TCP3: netperf-TCP netperf-TCP3: netperf-TCP Workload Workload Workload Workload 4.2 Analysis of sensitivities 4.24.24.2 Analysis 4.2Analysis AnalysisIn AnalysisInthis this ofsection of ofsensitivitiessection sensitivities sensitivitiesof wesensitivities we analyze analyze the the sensitivity sensitivity results. results. utilizationutilization (computed (computed as asthe the ratio ratio of ofthe the benchmark benchmark’s ’s FiguFigurere 1: 1:Sensitivity Sensitivity of Workloads of Workloads WhileInWhileIn Inthis this this networkIn section networkthissection section section saturationwe we saturationwe analyze analyze weanalyze analyze in the inthethe the sensitivity the domUssensitivitythe sensitivity domUs sensitivity is results. notisresults. results. not results. very very utilizationutilizationperformanceutilizationactualutilizationactual bandwidth (computed bandwidth(computed (computed of(computed diverse divided as dividedas theas workloadsthe theas ratioby ratiothe ratiobythe ofratiothe maximumof requirestheof maximumthe the ofbenchmark benchmarkthe benchmark different theoreticalbenchmark theoretical’s ’s metrics,’s ’s and we chose the standard Instructions Per Cycle (IPC) While network saturation in the domUs is not very actualactualactualbandwidthactual bandwidthbandwidth bandwidth bandwidth bandwidth of of10divided divided 10dividedGbits/second). Gbits/second).divided by by bythe the bythe maximum maximumthe maximumIf theremaximumIf there theoreticalis theoretical theoreticalanis theoreticalanoverl overl ap ap WhileWhilecommonWhilecommon network network network in saturation in practice,saturation practice, saturation in wein the we the in assumedomUs thedomUs assume domUs is full is not full not is network very notvery network very metric for the CPU-bound SPEC workloads. The network- pable of handling. Netperf is a networking benchmark that commoncommoncommonutilizationcommonutilization in in in practice, ourin practice, practice, our domUs. practice, domUs. we we we assume we assume assume assume full full full network full network network network bandwidthbandwidthingbandwidthbetweenbandwidthbetween workloads’ of I/Oof 10of I/O 10 and10ofGbits/second). performance Gbits/second). and10Gbits/second). computation, Gbits/second). computation, isIf If measuredthereIf this there there thisIf overlap,isthere is overlap,anis an with anoverlis overl anoverl i.e., aap i.e.,overl ap through- thap e th ape consists of many tests, the most commonly used ones being betweenbetweenputbetween metric, I/O I/O I/O and i.e. and and computation, the computation, computation, number thisof this thispackets overlap, overlap, overlap, that i.e., i.e., arei.e., th theprocessed thee TCPutilizationutilizationutilizationSTREAMutilizationFigures Figuresin in ourin our 2,our indomUs. to3,2,domUs.our domUs. and3, domUs. measureand 4 show4 show the TCP the CPU CPU bandwidth utilization utilization in and in sumbetweensum of ofutilizations, utilizations, I/O and will computation, will be begreater greater thisthan than overlap, 100% 100% (rec i.e., (recall th alle by the dom0 per second. Workloads that are CPU-bound UDPFiguresSTREAMFiguresFigures 2, 2,3,2, to 3,and3, measureand and 4 4 show4 show show UDP the the the CPU bandwidth. CPU CPU utilization utilization utilization The in band-inin sumsumsumthat ofsumthat of dom0ofutilizations, utilizations, dom0utilizations,of utilizations,runs runs on will onawill willsingle abe willsinglebe begreater greater CPU).begreater CPU). greater than than than 100% than 100% 100% 100%(rec (rec (recall all(rec all all dom0dom0 Figures and and the the2, corresponding 3, corresponding and 4 show performancethe performance CPU utilization of of the the in and spend all their time in the guest domain are penalized width is measured in megabits/second. dom0dom0dom0benchmarks dom0benchmarks and and and the and the the at corresponding the corresponding atcorresponding the corresponding the different different performance performance frequencies, performance frequencies, performance of of for of the for the of theab, theab, thatthatthethat dom0 thatdom0 mostdom0Figures dom0Figuresruns byruns runs on frequency 5,runs on ona 5,6, singlea ona 6,andsingle single a and changes.single 7CPU). showCPU). 7CPU). show CPU). the At the results.the results. lowest Combined Combineddynamic fre- quencyFiguresusageFiguresFiguresFigures and for 5, 5, voltage 6,5,ab 6, and6,5,and and and6, 7 setting,netperf-TCP and 7 show 7 show show 7 show the SPEC the the results. are theresults. results. CPU2006’s above results. Combined CombinedCombined 100% Combined IPC for is all 58% Webenchmarksbenchmarksbenchmarksnetperf-TCP ranbenchmarksnetperf-TCP our experiments at at, atthe and , the at theand differentnetperf-UDP the different differentnetperf-UDP on different a Dell frequencies, frequencies, frequencies, PowerEdgerespectively. frequencies,respectively. for R905 for for ab, While forab, serverab, While ab, usage for ab and netperf-TCP are above 100% for all of the IPC at the highest setting. On the other hand, work- powerednetperf-TCPnetperf-TCPnetperf-TCPnetperf-TCPab byand four, and, , andnetperf-TCP and quad-core ,netperf-UDP and netperf-UDPnetperf-UDP netperf-UDP AMD loserespectively. Barcelonarespectively. respectively. 19%respectively. andprocessors While While While 20% While ca- of usageusageusagefrequencies. usagefrequencies.for for for ab ab for aband and aband Combinednetperf-TCP and Combinednetperf-TCP netperf-TCP netperf-TCP usage are usage are are abovefor aboveare abovefor netperf-UDPabove 100% netperf-UDP100% 100% 100%for for for all , all for all on, all on ab and netperf-TCP lose 19% and 20% of loads dependent on dom0 do not suffer as much when the pable or running at five different frequencies. We used Xen abab abperformance and abperformanceandand andnetperf-TCP netperf-TCPnetperf-TCP netperf-TCP compared compared lose lose lose to lose 19%to the 19% 19% the 19%highest and andhighest and and20% 20% frequency 20% frequency 20% of of of of frequencies.frequencies.dom0’sfrequencies.thefrequencies.the other other CPU hand, Combined Combinedhand, Combinedfrequency is Combined belowis below usage isusage usage100%, adjusted. usage100%,for for for showing netperf-UDP for netperf-UDPshowing netperf-UDP However, netperf-UDP that that I/O,ab , on I/O, is onand on, not and on af- 3.2.1 running Gentoo Linux 2.6.21 as guest domains and in thethefectedthe computationother other thecomputationother asotherhand, hand, strongly.hand, hand, is do is below is do notbelow belowis With not below overlap.100%, 100%, overlap.100%,ab ,100%, theshowing (Theshowing showing rate(The showing reason ofthat reason that processingthat I/O why thatI/O I/O whyand theseI/Oand and these packets and theperformanceperformanceperformancesetting, dom0.performancesetting, Thenetperf-UDP comparednetperf-UDP details compared compared compared of our toexperiences to experimental toexperiences the theto the highest the highest highest a highest system slightlya frequency slightlyfrequency frequency arefrequency more shown more at the lowest frequency is 86% that of the rate at the highest insetting,setting, Tablesetting,significantsetting,significant netperf-UDP 1. netperf-UDPnetperf-UDP netperf-UDP loss loss of of 25%.experiences experiences 25%.experiences experiences We We attempt a attempt a slightlya slightly slightlya to slightly to understand moreunderstand more more more computationcomputationcomputationnumberscomputationnumbers do do do not do not not do notadd overlap. notaddoverlap. overlap.up overlap.upto (Theto100% (The (The100% reason(The is reason reason becauseis reason because why why why thesesome why these thesesome theseof of frequency. Netperf-tcp and netperf-udp also suffer less sig- significantsignificantsignificantthissignificantthis difference difference loss loss loss of loss of inof 25%. 25%. in of sensitivities25%. sensitivities 25%. We We We attempt We attempt attempt in attempt in order to to order to understand understand tounderstand to tounderstand develo develo p p numbersnumbersnificantlynumbersthenumbersthe computation do computation do donot than not donot add CPU-boundnotadd add isup add doneup is upto done to up100%to in100% 100%to workloads: inthe 100% is the guestis becauseis guestbecause becauseis domainbecausenetperf-TCP domainsome some some and someof andof of in ’s of in per- To measure performance in the controlling domain, we used this difference in sensitivities in order to develop thetheformancethe the computation computationthe computation hypervisor, computation hypervisor, loss is is doneonly iswhose done done whose is 15% indone in CPU inthe theand CPU the in guest utilization theguestnetperf-UDP guest utilization guest domain domain domain domain we and’s we and do and loss in do and innot in is not in 25% sysstatthisthisheuristics difference thisheuristics difference[9], differencea system-wide for in for in sensitivitiesscheduling sensitivities in scheduling performance sensitivities inthat in that order order monitoring indistinguish distinguish order to to develo develo to tool develomore p avail-p more p compared to the highest-frequency scenario. ableheuristicsheuristicsheuristicssensitive forheuristicssensitive Linux. for workloads for forworkloads Sysstat scheduling for scheduling scheduling scheduling fromgave from usless that thattheless thatsensitive that metricsdistinguishsensitive distinguish distinguish distinguish ones. we ones. sought, more more more more such thethethe measure.) hypervisor, themeasure.) hypervisor, hypervisor, hypervisor, We whose We whose suspectwhose suspect whose CPU CPU CPU that CPU utilization that utilization utilization low utilization low combined wecombined we we do we do do not usage not do not usage not assensitivesensitivesensitive thesensitive numberOur workloadsOur workloads workloadshypothesis hypothesisworkloads of packets from from fromwas lessfrom processedwas less thatless sensitive that lesssensitive sensitiveab absensitiveand per andones. netperf-TCP secondones. ones. netperf-TCP ones. for ab areand are measure.)measure.)measure.)correlatesmeasure.)correlates We We withWe suspect with We suspect suspect higher suspecthigher that sensitivitythat that sensitivity low that low low combined low combined for combined for combined the the followin usage usagefollowin usage usage g g netperf. 3.2 Analysis of Sensitivities OurlessOurOur hypothesis Oursensitivehypothesis hypothesis hypothesis towas was variationswas that wasthat that ab thatab abandin and abandfrequency netperf-TCP andnetperf-TCP netperf-TCP netperf-TCP than are netperf-are are are correlatescorrelatescorrelatesreason.correlatesreason. withWhen with When with higher with the higher higher the combined higher combined sensitivity sensitivity sensitivity sensitivity usage usage for foris for the low, is for the thelow, followin dom0 the followin followin dom0 followin isg notisgg not g less sensitive to variations in frequency than netperf- In this section we analyze the sensitivity results. While net- XenoproflesslesslessUDP sensitive lessUDPsensitive sensitive, [17],becausesensitive, because to the to variationsto oprofilevariationsthey variationsto they variations better-based betterin in frequencyinoverlap frequency Xenfrequencyinoverlap frequency profiling network than networkthan than tool,netperf- than netperf- I/Onetperf- was I/O netperf- with used with reason.reason.workreason.keepingreason.keeping When saturation When When the When the the the devicethe combined in devicecombinedthe combined the combined busy. domUs busy. usage usage usage As is Asusageis a not is low,is result, alow, low, veryis result, dom0 low, dom0 dom0 commonwhen dom0 whenis is not is CPUnot not is in CPU not prac- toUDPUDP measureUDPCPU, UDPCPUbecause, , because computation,because , the computation,because controllingthey theythey betterthey better andbetter and better domainoverlap thus overlapoverlap thus slowingoverlapSsˇ slowingnetwork CPU networknetwork downnetwork usage downI/O I/OI/Othe during with theI/O with CPUwith CPU with the keepingkeepingtice,keepingfrequencykeepingfrequency we the assumethe the deviceis the device devicedecreased,is fulldecreased, device busy. network busy. busy. busy.the As Asthe Asrate utilization a Asarate result, a at result, result, awhichat result, which when in when when ourpackets whenpackets CPU domUs. CPU CPU a CPUre a re workload execution phase. CPUCPUCPUhas computation,CPUhas a computation, computation, smallera computation,smaller effect andeffect and and on thus and thus on thusthe slowing thusthe slowingoverall slowing overall slowing downperformance down downperformance down the the the CPU theCPU CPUtha CPUtha n n frequencyfrequencyfrequencygeneratedfrequencygenerated is is decreased,is decreases decreased, decreased,is decreases decreased, eventhe the eventhe rate further,rate therate further,at rate at whichat which which slowingat slowingwhich packets packets packets downpackets downa rea are there a the re Figures 2, 3, and 4 show the CPU utilization in dom0 and generatedgeneratedgeneratedgenerateddevice decreases decreases decreases even decreases even more even even further, even further,and further, further, decreasing slowing slowing slowing slowing down down down the down the the overall the the 3.hashashasfor a EXPERIMENTALhasfora smallerthea smaller smaller the aUDP smaller UDPeffect effect workload.effect workload. effecton on onthe the onthe Tooverall RESULTSoverall theTooveralltest testoverall whetherperformance performance whetherperformance performance this this explainstha tha explainsthan n than n thedevice corresponding even more performance and decreasing of the benchmarks the overall at the devicedevicedevicedeviceperformance. even even even even more more more This more and and and effect decreasingand decreasing decreasing is decreasing less thenoticeable the the overall the overall overall overall in the 3.1forforfor thethe the forthe theSensitivity differenceUDP UDPthe differenceUDP workload.UDP workload. workload. in workload. sensitivities,in sensitivities, To To Totest test Totest whether wetestwhether whether we computewhether compute this this this explains thethisexplains explains the deg explains deg ree ree differentperformance. frequencies, This effect for ab , isnetperf-TCP less noticeable, and netperf-UDP in the thethethe ofdifference difference theofoverlapdifference overlap difference betweenin in betweensensitivities,in sensitivities, sensitivities,in sensitivities,the the I/O I/O weand we weandcompute computation compute wecompute computation compute the the the deg indeg thedeg reedom0in reedeg ree dom0 ree performance.performance.respectively.performance.workloadsperformance.workloads Thiswhere While This Thiswhere effectThis the effectab effect theand combinedeffect is combined is isnetperf-TCP less less isless noticeable less usage noticeable noticeable usage noticeable andlose and in 19%thus in in the thus thein andthe the the 20% Figure 1 shows the relative sensitivity of dom0 performance of performance compared to the highest frequency setting, workloadsworkloadsworkloads where where where the the the combined combined combined usage usage usage and and and thus thus thus the the the comparedofof overlapofas overlap overlapofas follows: overlap follows: tobetween between thebetween webetween CPU-bound wethe add the the addI/O I/O CPUtheI/O and CPU I/Oand benchmarks.and utilizationcomputation andcomputation utilizationcomputation computation and Measuring in andin dom0in network dom0 dom0in network dom0 the netperf-UDPoverlapworkloadsoverlap between between experiences where I/O I/O and the and computation acombined slightlycomputation more usage are are significanthigh. and high. thus loss the of asas as follows: follows:as follows: follows: we we we add we add add CPU add CPU CPU CPU utilization utilization utilization utilization and and and network and network network network overlapoverlapoverlapoverlap between between between between I/O I/O I/O and andI/O and computation andcomputation computation computation are are are high. high.are high. high. Dom0Dom0 CPU CPU Usage Usage DomUDomU Performance Performance Dom0Dom0 CPU CPU Usage Usage DomUDomU Performance Performance 100 100 20 20 100100 20 20 … Dom0Dom0Dom0 CPUDom0 CPU CPU Usage UsageCPU Usage Usage DomU… DomUDomU PerformanceDomU Performance Performance Performance Dom0Dom0Dom0 CPUDom0 CPU CPU Usage UsageCPU Usage Usage DomUDomUDomU PerformanceDomU Performance Performance Performance 10010010080 10080 20 2020 20 100100100 100 20 2020 20 … … … 15 … 15 15 15 Second 80 808060 8060 15 15Second 15 15 1515 10 1510 50 50 10 1510 Second Second Second 60 606040 6040 Second 10 10 10 105 50 5050 50 10 1010 10 40 404020 4020 5 5 5 5 55 20 2020 0 200 0 50 5 55 5 Utilization Percentage Utilization Utilization Percentage Utilization Requests per Requests 0 Requests per Requests 0 0 0

2.3 2 1.7 1.4 1.15 Percentage Utilization 2.3 2 1.7 1.4 1.15 0 00 02.32.3 2 2 1.7 1.7 1.4 1.4 1.15 1.15 0 00 02.3 2 1.7 1.4 1.15 Percentage Utilization 2.32.3 2 1.7 2 1.7 1.4 1.151.4 1.15 2.3 2 1.7 1.4 1.15 Mbit / Second Second / Mbit (Hundreds) Utilization Percentage Utilization Mbit / Second Second / Mbit (Hundreds) Utilization Percentage Utilization Percentage Utilization Requests per Requests

Utilization Percentage Utilization 0 Requests per Requests per Requests 0 0 Requests per Requests 0 00 0 0 Frequency FrequencyFrequency Frequency FrequencyFrequency Frequency 2.3 2 1.7 1.4 1.15 Percentage Utilization Frequency 2.3 2 1.7 1.4 1.15 2.32.3 2 2 1.7 1.7 1.4 1.4 1.15 1.15 Percentage Utilization Percentage Utilization 2.3 2 1.7 1.4 1.15 2.32.3 2 2 1.7 1.7 1.4 1.4 1.15 1.15 2.32.32.3 22.3 2 21.7 1.7 1.72 1.4 1.4 1.7 1.4 1.15 1.15 1.4 1.15 1.15 2.3 2 1.7 1.4 1.15 Percentage Utilization 2.32.3 2 22.3 1.7 1.7 1.42 1.4 1.71.15 1.15 1.4 1.15 2.3 2 1.7 1.4 1.15 Mbit / Second Second / Mbit (Hundreds) Mbit / Second Second / Mbit (Hundreds) Second / Mbit (Hundreds) Second / Mbit (Hundreds) FrequencyFrequencyFrequency FrequencyFrequencyFrequency FrequencyFrequencyFrequency FrequencyFrequencyFrequencyFrequency (a) Frequency (a) Frequency (b) (b) Frequency (a) (a) (b) (b) (a) (a) (a) Figure (a) Figure (a) 2: ab2: abWorkload Workload (b) (b) (b) (b) (b) Figure (a) Figure (a) (a) (a) 4: (a) netperf-UDP4: netperf-UDP Workload Workload (b) (b) (b) (b) FigureFigureFigureFigure 2: 2: ab2: ab abWorkload2: Workload abWorkload Workload FigureFigureFigureFigure 4: 4: netperf-UDP4: netperf-UDP netperf-UDP4: netperf-UDP Workload Workload Workload Workload Figure 2: ab Workload Figure 4: netperf-UDP Workload NICNIC Usage Usage CPUCPU Usage Usage 150150 CPUCPU Usage Usage NICNIC Usage Usage 150150 NIC Usage CPU Usage 150 CPU Usage NIC Usage

100100 100100 100

50 50 50 50 50 Combined Usage Combined Usage Combined Usage Combined Usage

Combined Usage 0 0 Combined Usage 0 0 0 2.32.3 2 2 1.7 1.7 1.4 1.4 1.15 1.15 2.3 2.3 2 2 1.7 1.7 1.4 1.4 1.151.15 2.3 2 1.7 1.4 1.15 FrequencyFrequency (GHz) (GHz) FrequencyFrequency (GHz) (GHz)

FigureFigure 5: Combined5: Combined Usage Usage for for netperf-TCP netperf-TCP FigureFigure 7: Combined7: Combined Usage Usage for for netperf-UDP netperf-UDP Figure 5: Combined Usage for netperf-TCP Figure 7: Combined Usage for netperf-UDP WhileWhile combined combined usage usage seems seems like like a plausible a plausible switchswitch a subset a subset of ofthe the workload, workload, the the portion portion that that runs runs explanationexplanation for for the the differences differences in in sensitivities, sensitivities, the t he in thein the controlling controlling domain. domain. 25%.explanation We attempt for the to understand differences this in difference sensitivities, in sensitivi- the inthe the combinedcontrolling usage domain. is low, dom0 is not keeping the device tiesresultsresults in order that that support to support develop this this heuristics theory theory are for are still scheduling still prelimin prelimin thatary.ary. In distin- In busy. As a result, when CPU frequency is decreased, the rate guishorderorder more to to sensitive thoroughly thoroughly workloads test test fromthe the lessvalidity validity sensitive of of ones. this this at which packets are generated decreases even further, slow- order to thoroughly test the validity of this 6.ing 6.Conclusions Conclusions down the device and and even Future Future more Work and Work decreasing the overall explanation,explanation, we we need need to torun run more more tests tests with with a widera wider Ourexplanation, hypothesis we was need that toab runand morenetperf-TCP tests withare a wider less sensi- performance.TheThe goal goal This of of effect our our is work less work noticeable was was to toin analyze the analyze workloads rangerange of of workloads. workloads. If Ifthis this theory theory is isconfirmed confirmed by by where the combined usage and thus the overlap between I/O tiverange to variations of workloads. in frequency If this theory than netperf- is confirmed UDP, because by performanceperformance sensitivity sensitivity of ofvirtual virtual machine machine workload workloads s theyadditionaladditional better experiments, overlap experiments, network combined combined I/O with usage usage CPU may may computation, be beused used and computation are high. additional experiments, combined usage may be used to to variations variations in in CPU CPU frequency. frequency. We We tested tested the the andas as thus a heuristica slowing heuristic downfor for scheduling the scheduling CPU has of ofa dom0-dependent smaller dom0-dependent effect on the overallas a heuristicperformance for than scheduling for the ofUDP dom0-dependent workload. To test hypothesisWhilehypothesis combined that that workloads usage workloads seems are like are less a less plausible sensitive sensitive explanation to to workloadsworkloads on onASISA ASISA systems. systems. for the differences in sensitivities, the results that support whetherworkloads this on explains ASISA the systems. difference in sensitivities, we com- changeschanges in infrequency frequency when when they they stress stress dom0. dom0. While While pute the degree of overlap between the I/O and computation this theory are still preliminary. In order to thoroughly test ourtheour hypothesis validity hypothesis of was this was validatedexplanation, validated we we we found needfound that to that run dom0- dom0- more tests in5. dom0 5.Related Related as follows: Work Work we add CPU utilization and network uti- lization5. Related (computed Work as the ratio of the benchmark’s actual dependentwithdependent a wider workloads workloads range of varyworkloads. vary in in the If the degreethis degree theory of of istheir confirmed their OurOur work work is similaris similar to tothat that of ofMogul Mogul et al.,et al., who who by additional experiments, combined usage may be used as bandwidthOur work divided is similar by the to maximum that of Mogul theoretical et al., bandwidth who sensitivity.sensitivity. In In an an attempt attempt to to explain explain this this sensitiv sensitivity,ity, ofposit 10posit Gbits/second). that that an anASISA ASISA Ifsystem there system isshould anshould overlap have have slower between slower cores I/O cores and a heuristic for scheduling of dom0-dependent workloads on posit that an ASISA system should have slower cores weASISAwe proposed proposed systems. combined combined usage, usage, a combined a combined utilization utilization computation,dedicateddedicated to this to OS OS overlap, tasks tasks [17]. i.e., [17]. Our the Our sum work work of differs utilizations, differs from from will bededicated greater than to OS 100% tasks (recall [17]. that Our dom0work differs runs on from a single of of the the CPU CPU and and devices, devices, as as the the heuristic. heuristic. While While theirstheirs in inthat that we we take take into into account account virtualization virtualization and and CPU).theirs in that we take into account virtualization and combined4.combined RELATED usage usage appears appears WORK to to be be a plausiblea plausible factor factor runrun the the entire entire controlling controlling domain, domain, i.e. i.e. dom0, dom0, on on a a run the entire controlling domain, i.e. dom0, on a explainingOurexplaining work the is the similardifferences differences to that in ofinsensitivities, Mogulsensitivities, et al.,more more who tests posittests that Figuresslowerslower 5,core, 6,core, and as asopposed 7 showopposed the to results.toa fewa few select Combined select system system usage call call fors sab an ASISA system should have slower cores dedicated to OS andslowernetperf-TCP core, as opposedare above to 100%a few for select all frequencies. system calls Com- areare needed needed to to understand understand whether whether it it has has wide wide andand interrupts. interrupts. tasks [18]. Our work differs from theirs in that we take into binedand interrupts. usage for netperf-UDP, on the other hand, is below applicability.accountapplicability. virtualization If If wide wide and applicability applicability run the entire is controllingis confirmed, confirmed, domain, 100%,Matching showingMatching a that workloada workload I/O and to computationtoa certain a certain core, core, do based not based overlap.on on Matching a workload to a certain core, based on combinedi.e.combined dom0, usage on usage a slower can can core, be be used as used opposed as as a to aheuristic a heuristic few select for for system (The reason why these numbers do not add up to 100% is thethe workload’s workload’s characteristics, characteristics, builds builds on onthe the work work of of schedulingcallsscheduling and interrupts. workloads. workloads. WhileThe The hypervisor the hypervisor authors scheduler didscheduler briefly can can theorize becausethe workload’s some of characteristics, the computation builds is done on the in thework guest of do- scheduling workloads. The hypervisor scheduler can KumarKumar et etal al[13]. [13]. Our Our work work is alsois also similar similar to toother other that dom0 could run on a slower, “OS-friendly” core, our mainKumar and et in al the[13]. hypervisor, Our work whose is also CPU similar utilization to other we do measureworkmeasure provides the the combined combined the first usage concrete usage of ofvirtual validation virtual CPUs CPUs of mapped their mapped hypothe- workwork by byKumar Kumar et al.et al.[14], [14], which which trades trades performance performance notwork measure.) by Kumar We et suspectal. [14], that which low trades combined performance usage corre- tosis. dom0,to dom0, and, and, in casein case if itif isit foundis found to beto behigh, high, as signassign latesforfor power with power higher savings. savings. sensitivity Unlike Unlike them, for them, the we following we do donot not reason. switch th Whene th e for power savings. Unlike them, we do not switch the thosethose virtual virtual CPUs CPUs to to low-power low-power physical physical cores, cores, Matching a workload to a certain core, based on the work- entireentire workload workload to to a different a different frequency frequency but but only only improvingimproving overall overall performance performance per per watt. watt. entire workload to a different frequency but only improvingload’s characteristics, overall performance builds on per the watt. work of Kumar et al [13]. OurInwork In the the is future, also future, similar we we planto planother to workto evaluate evaluate by Kumar whether whether etal. [12], which trades performance for power savings. Unlike them, 150150 NICNIC Usage Usage CPUCPU Usage Usage combinedcombined usage usage is isa good a good heuristic heuristic for for scheduling scheduling and,weand, do if not if so, so, switch implement implement the entire and and workload evaluate evaluate to a a different schedulinga scheduling frequency and,but if only so, switch implement a subset and of the evaluate workload, a scheduling the portion that 100100 algorithmrunsalgorithm in thethat that controlling uses uses it. it. domain.

50 50 7.5. 7.References References CONCLUSIONS AND FUTURE WORK The goal of our work was to analyze performance sensitiv- [1] [1] ab. ab. Apache Apache Webserver Webserver Benchmarking Benchmarking Tool. Tool. Combined Usage Combined Usage Combined ity of virtual machine workloads to variations in CPU fre- Combined Usage Combined 0 0 http://httpd.apache.org/docs/2.0/programs/ab.hthttp://httpd.apache.org/docs/2.0/programs/ab.ht 0 quency. We tested the hypothesis that workloads are less ml ml 2.3 2.3 2 2 1.7 1.7 1.4 1.4 1.151.15 sensitive to changes in frequency when they stress dom0. [2] [2] K. K.Asanovic Asanovic and and et al.et al.The The Landscape Landscape of Parallelof Parallel FrequencyFrequency (GHz) (GHz) While our hypothesis was validated we found that dom0- Frequency (GHz) ComputingComputing Research: Research: A AView View From From Berkeley. Berkeley. dependent workloads vary in the degree of their sensitiv- FigureFigure 6: Combined6: Combined Usage Usage for for ab ab ity. In an attempt to explain this sensitivity, we proposed Figure 6: Combined Usage for ab combined usage, a combined utilization of the CPU and de- vices, as the heuristic. While combined usage appears to be IEEE Computer Society. a plausible factor explaining the differences in sensitivities, [14] V. LLC. VolanoMark benchmark. more tests are needed to understand whether it has wide http://www.volano.com/benchmarks.html. applicability. If wide applicability is confirmed, combined [15] M. Lord. A Hard Drive Performance and usage can be used as a heuristic for scheduling workloads. Benchmarking Utility. The hypervisor scheduler can measure the combined usage http://sourceforge.net/projects/hdparm/. of virtual CPUs mapped to dom0, and, in case if it is found [16] J. Manning and M. Kuoppala. Threaded I/O to be high, assign those virtual CPUs to low-power physical Benchmark for Linux. cores, improving overall performance per watt. http://tiobench.sourceforge.net. [17] A. Menon, J. R. Santos, Y. Turner, G. J. In the future, we plan to evaluate whether combined usage Janakiraman, and W. Zwaenepoel. Diagnosing is a good heuristic for scheduling and, if so, implement and Performance Overheads in the Xen Virtual Machine evaluate a scheduling algorithm that uses it. Environment. In VEE ’05: Proceedings of the 1st ACM/USENIX international conference on Virtual 6. REFERENCES execution environments, pages 13–23, New York, NY, [1] K. Asanovic, R. Bodik, B. . Catanzaro, J. J. Gebis, USA, 2005. ACM. P. Husbands, K. Keutzer, D. A. Patterson, W. L. [18] J. C. Mogul, J. Mudigonda, N. Binkert, Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. P. Ranganathan, and V. Talwar. Using Asymmetric The Landscape of Research: A Single-ISA CMPs to Save Energy on Operating View from Berkeley. Technical Report Systems. IEEE Micro, 28(3):26–41, 2008. UCB/EECS-2006-183, EECS Department, University [19] D. Mosberger and T. Jin. httperf — A Tool for of California, Berkeley, Dec 2006. Measuring Web Server Performance. SIGMETRICS [2] P. Barham, B. Dragovic, K. Fraser, S. Hand, Perform. Eval. Rev., 26(3):31–37, 1998. T. Harris, A. Ho, R. Neugebauer, I. Pratt, and [20] W. Norcutt. The Iozone Filesystem Benchmark. A. Warfield. Xen and the Art of Virtualization. In http://www.iozone.org/. SOSP ’03: Proceedings of the Nineteenth ACM [21] D. Shelepov, J. C. Saez, S. Jeffery, A. Fedorova, Symposium on Operating Systems Principles, pages N. Perez, Z. F. Huang, S. Blagodurov, and V. Kumar. 164–177, New York, NY, USA, 2003. ACM Press. HASS: A Scheduler for Heterogeneous Multicore [3] L. A. Barroso. The Price of Performance. Queue, Systems. SIGOPS Operating Systems Review, 3(7):48–53, 2005. 43(2):55–75, April 2009. [4] M. Becchi and P. Crowley. Dynamic [22] SPEC. SPEC CPU 2000. http://www.spec.org. Assignment on Heterogeneous Multiprocessor [23] C. Staelin and H.-P. Laboratories. lmbench: Portable Architectures. In CF ’06: Proceedings of the 3rd Tools for Performance Analysis. In In USENIX conference on Computing frontiers, pages 29–40, New Annual Technical Conference, pages 279–294, 1996. York, NY, USA, 2006. ACM. [24] R. Strong, J. Mudigonda, J. C. Mogul, N. Binkert, [5] T. Bray. Bonnie Disk Benchmark. and D. Tullsen. Fast switching of threads between http://www.textuality.com/bonnie/. cores. SIGOPS Oper. Syst. Rev., 43(2):35–45, 2009. [6] R. Coker. Bonnie++ Disk Benchmark. [25] A. Tridgell. Dbench Benchmarking Tool. http://www.coker.com.au/bonnie++/. http://freshmeat.net/projects/dbench/. [7] P. Eriksson. A Hard Disk Benchmarking Tool. [26] A. Waterland. Stress Workload Generator. http://www2.lysator.liu.se/ pen/piozone/. http://weather.ou.edu/∼apw/projects/stress/. [8] A. S. Foundation. ab - Apache HTTP Server Benchmarking Tool. http://httpd.apache.org/docs/2.0/programs/ab.html. [9] S. Godard. Performance Monitoring Tools for Linux. http://pagesperso-orange.fr/sebastien.godard/. [10] R. Jones. A Network Performance Benchmark. http://www.netperf.org. [11] A. Kopytov. Sysbench Benchmarking Tool. http://sysbench.sourceforge.net/. [12] R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen. Single-ISA Heterogeneous Multi-core Architectures: The Potential for Power Reduction. In Proceedings of the 36th Annual IEEE/ACM International Symposium on , 2003. [13] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In ISCA ’04: Proceedings of the 31st Annual International Symposium on , page 64, Washington, DC, USA, 2004.