Quick viewing(Text Mode)

An Interface to Implement NUMA Policies in the Xen Hypervisor Gauthier Voron, Gaël Thomas, Vivien Quema, Pierre Sens

An Interface to Implement NUMA Policies in the Xen Hypervisor Gauthier Voron, Gaël Thomas, Vivien Quema, Pierre Sens

An interface to implement NUMA policies in the Gauthier Voron, Gaël Thomas, Vivien Quema, Pierre Sens

To cite this version:

Gauthier Voron, Gaël Thomas, Vivien Quema, Pierre Sens. An interface to implement NUMA policies in the Xen hypervisor. Twelfth European Conference on Computer Systems, EuroSys 2017, Apr 2017, Belgrade, Serbia. pp.15. ￿hal-01515359￿

HAL Id: hal-01515359 https://hal.inria.fr/hal-01515359 Submitted on 27 Apr 2017

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. An interface to implement NUMA policies in the Xen hypervisor

Gauthier Voron Gael¨ Thomas Vivien Quema´ Sorbonne Universites,´ UPMC Univ Tel´ ecom´ SudParis, Universite´ Univ. Grenoble-Alpes, Grenoble INP Paris 06, CNRS, INRIA, LIP6 Paris-Saclay LIG laboratory

Pierre Sens Sorbonne Universites,´ UPMC Univ Paris 06, CNRS, INRIA, LIP6

Abstract rent operating systems provide distinct NUMA placement While only introduces a small overhead on policies for distinct memory access patterns. machines with few cores, this is not the case on larger ones. Unfortunately, the Xen hypervisor, which simulates sev- Most of the overhead on the latter machines is caused by the eral physical machines through virtual machines, imposes Non-Uniform Memory Access (NUMA) architecture they a single, basic NUMA placement policy. As a result, on a are using. In order to reduce this overhead, this paper shows set of 29 applications evaluated on a 8-node machine with how NUMA placement heuristics can be implemented inside 48 cores, we measure an overhead caused by an inefficient Xen. With an evaluation of 29 applications on a 48-core memory placement of up to 700% with Xen. For 15 of the machine, we show that the NUMA placement heuristics can 29 applications, the overhead is higher than 50%, and for 11 multiply the performance of 9 applications by more than 2. of the applications, the overhead is higher than 100% (see Figure 1). 1. Introduction To decrease this overhead, Amazon EC2, which provides on-the-shelf Haswell multicores with 40 virtual CPUs and Modern large multicore machines have a complex memory 160 GB of memory, proposes to expose the NUMA topology architecture, called Non Uniform Memory Access (NUMA) to the virtual machines. The guest is then architecture. Such machines are formed by a set of nodes able to apply a NUMA policy, and hence improve perfor- connected via a high-speed network, called “interconnect”. mance. This solution is not satisfactory because it prevents Each node contains several CPUs, a memory bank and a an efficient load balancing of the virtual CPUs on the phys- deep cache hierarchy. ical CPUs. Indeed, when the hypervisor migrates a virtual Achieving high performance on a NUMA architecture is CPU to a physical CPU of a new NUMA node, the hypervi- difficult because the interconnect or a memory controller sor dynamically modifies the NUMA topology of the virtual can easily saturate. This saturation comes from an inade- machine, which is not supported by any of the current main- quate memory placement on the NUMA nodes, which leads stream operating systems. to many messages on the interconnect or towards a mem- Instead of exposing the NUMA topology to the virtual ory controller. Preventing saturation is difficult because, as machine, we propose to mitigate the NUMA effects at the shown by Lachaize et al. [23] and confirmed by our evalua- hypervisor level, by allowing various NUMA policies to tion, a single memory placement policy on the NUMA nodes be implemented by the hypervisor. Implementing NUMA is not efficient for all the applications. For this reason, cur- policies in an hypervisor is challenging. A NUMA policy must decide where to place the memory of a process. It has thus to know which part of the memory is used by a process. But an hypervisor is only aware of the virtual CPUs assigned to a , not of the processes or their memory. For this reason, the hypervisor cannot understand which part of the memory belongs to which process, and can thus not easily implement an advanced NUMA policy.

[Copyright notice will appear here once ’preprint’ option is removed.]

1 2017/3/6 In this paper, we propose to enable the efficient imple- ory, called, in Xen, the physical memory of the virtual ma- mentation of NUMA policies inside the Xen hypervisor with chine. a new simple interface. This interface consists in two new hypercalls, i.e., calls from the guest operating system to the 2.1 The memory subsystem hypervisor. The first hypercall is used by the guest operating An hypervisor has to isolate the memory of the virtual ma- to inform the hypervisor that it released some memory. The chines. Modern ensure isolation by leveraging second hypercall is used to select a NUMA policy. an hypervisor page table provided by the hardware. The hy- Based on this interface, we show how we can implement pervisor page table maps the physical memory of a virtual four policies. These policies are based on the first-touch, the machine to the memory of the machine, called, in Xen, the interleaved and the Carrefour [12] policies (see Section 3). machine memory.1 An hypervisor activates the hypervisor We evaluate the policies using 29 applications from 5 dif- page table when it switches the processor to guest mode, i.e., ferent benchmark suites on a 8-node machine comprising 48 a special mode that changes the processor behavior to facil- cores using three setups: a setup with a single virtual ma- itate virtualization. In native mode, an application accesses chine using all 48 cores and two setups with several virtual memory through a single page table that maps a virtual ad- machines sharing the 48 cores. Overall, our evaluation shows dress to a physical address, while in guest mode, the proces- that the NUMA policies we implemented inside Xen can sor additionally translates the physical address of the guest drastically improve the performance of many applications, to the machine address using the hypervisor page table. while hiding the NUMA topology to the virtual machine. More precisely, we found that: 2.2 The I/O subsystem In this section, we provide some background on the I/O sub- • With a single virtual machine spanning 48 cores, using an system, because we later explain why the implementation of efficient NUMA policy divides the completion time of 9 one of the NUMA policies (first-touch) has an incompatibil- applications by more than 2. The maximum improvement ity with the virtualized I/O mechanism provided by modern we observe is of 6 times. processors. • With an evaluation of consolidated workloads of multi- In Xen, a special virtual machine, called dom0, is used ple virtual machines, we show that using an adequate to manage the other virtual machines, called domU virtual NUMA policy also drastically reduces the completion machines. Dom0 can also be used to manage the I/Os for time. Among 11 configurations, an efficient NUMA pol- the domUs. In this case, dom0 directly accesses physical de- icy reduces the completion time of at least one virtual vices, while domU virtual machines access physical devices machine by more than 2 in 9 cases. via the dom0 virtual machine. • With a single virtual machine spanning 48 cores, only 4 2.2.1 Para-virtualization technique applications remain degraded by more than 50% when we compare Xen with its best NUMA policy and Linux Xen implements two complementary techniques to intercept with its best NUMA policy. This result shows that most an access to a virtual device of a domU virtual machine. The of the large overhead observed with Xen on large multi- first technique is the full-virtualization technique: the guest core was actually caused by an inefficient NUMA policy. operating system executes a legacy driver, and Xen inter- cepts the low level I/O operations performed by the driver. • Because of a design choice at the hardware level, it is The second technique is the para-virtualization technique: impossible to use an IOMMU [6] with the first-touch the guest operating system executes a modified driver, which policy in the hypervisor without deeply modifying the calls the hypervisor to perform the I/O. In our experiments, I/O stack of the hypervisor. we use the para-virtualized drivers implemented in Linux be- The remainder of the paper is organized as follows. Sec- cause it yields better performance. tion 2 gives some background on Xen. Section 3 describes the NUMA policies studied in the paper. Section 4 presents 2.2.2 The PCI passthrough driver both the interface and the implementation of the NUMA When a domU virtual machine accesses a virtual device, Xen policies. Section 5 reports the evaluation of the policies. Sec- intercepts the access, forwards the access to dom0, which tion 6 discusses related work and Section 7 concludes the executes the request, and gives the result to the domU virtual paper. machine. When a guest operating system starts a Direct Memory Access (DMA), it uses a physical address that has 2. The Xen hypervisor to be translated into a machine address before the access. On This section provides some background on the Xen hyper- 1 In the remainder of the text, we use the Xen terminology to describe the visor. An hypervisor simulates several physical machines different memory levels. In other hypervisors, e.g., in KVM, an address of the machine memory is called a system physical address (SPA) or a host through virtual machines. A virtual machine holds a set of physical address (HPA), and an address in the physical address space of a virtual CPUs (vCPUs), a set of virtual devices, and a mem- virtual machine is called a guest physical address (GPA).

2 2017/3/6 re otasaetyruemmr cesst h appro- in the node to nodes. NUMA accesses NUMA single memory priate a route transparently to to region order each associating uses CPU map Each regions. a NUMA into the space partitions address statically I/O machine hardware bank, The memory CPUs. a and contain controllers may node NUMA Each nodes. useful. the performance are policies a presented with and the all paper architecture that the showing evaluation NUMA in evaluated the policies NUMA presents section study under This policies NUMA 3. itself. it start transfer time to the the perform takes to to it compared takes the negligible time becomes the when transfer increases, that DMA read a fact to the bytes by by of caused number explained overhead is the lower This the KiB virtualization. read, 4 bytes a of reading amount the that measured 186 have takes vir- bus. block we express domU PCI setting, a other this for the use bus With then express can PCI Dom0 machine. a this tual reserve For buses. can express we machine PCI two reason, the has Fortunately, experiments our granularity. in bus used express at PCI machines PCI de- virtual the the to granularity, associate devices device associates the can driver at passthrough IOMMU machines relatively virtual AMD to is vices the driver While passthrough PCI restrictive. the However, driver. over- large the above. involving reducing mentioned without significantly head memory thus hypervisor, physical the the ad- component, physical accesses IOMMU a device an With translate a address. to machine devices a into by dress used that directly unit management be memory can a is component This compo- [6]. IOMMU nent an provide processors modern translation, Linux. 307 takes caching) have the avoid (with we to block experiments, KiB 4 the a reading for that used measured we machine 48-core the Relative overhead UAacietr scmoe yasto NUMA of set a by composed is architecture NUMA A passthrough PCI the with IOMMU an use to able is Xen address the perform hypervisor the letting of Instead

� � � � � � � � � �

��������� ������� µ

sil74 (still s ����������� �������������

iue1. Figure ��������� µ

µ

nLnx.Nt httelarger the that Note Linux). in s

,wiei nytks74 takes only it while s, ����

eaieoeha fteXncmae oteLnx(oe sbetter). is (lower Linux the to compared Xen the of overhead Relative ����

����

���� ���� O

DIRECT

���� ���� µ

flag in s ���� ����

3 ����

h aefrtepgscrepnigt hedsak,but stacks, thread to corresponding typically pages is the This page. for that the thread case to the the accesses often is the time of first most the performs for page a accesses that have not policy. round-robin does a node using another selected from NUMA node memory the the allocates Linux per- If pages, that free access. enough thread the first of the node forms NUMA the from page physical maps page. and physical is a access access to invalid the address the time, virtual of the first intercepts thread the Linux a for invalid. page When thus a pages. accesses Linux the process space, allocate the address physically virtual not new a does creates allocation in it memory used When lazy policy. policy a uses NUMA Linux precisely, default More the Linux. is policy first-touch The policy first-touch The 3.1 system. operating Linux imple- the been for have mented and that policies round-4K NUMA first-touch, are NUMA policies The default) Carrefour Xen. thus in (and accesses. only implemented the memory policy is of policy latency in round-1G average node The NUMA the new improve dy- a to to can order address Carrefour virtual policy: a dynamic remap namically a and last is node, The NUMA (Carrefour) mapping. initial a policy the to change dynamically address not do virtual they a map initially policies: static they are round-4K) and first-touch (round-1G, cies interconnect. the of saturation the avoid to overloaded locality order avoid access in memory to enforce order (ii) and in controllers, load controllers memory the memory balance the (i) all both on should policy placement NUMA a a physical to a node. process NUMA to the a address to virtual belongs of that the page address a mapping precisely, virtual by More node a NUMA table. maps page policy the systems, Linux NUMA on native rely In each policies process. place a NUMA to by node used NUMA address which virtual on choosing of charge in

��

h rttuhplc sotnefiin eas h thread the because efficient often is policy first-touch The the allocates Linux policy, first-touch default the With poli- Three policies. NUMA four study to propose We 26], 17, [12, studies research several by highlighted As is policy, NUMA simply or policy, placement NUMA A

��

�����

���

������

��������

���������

������

���

��

��������

����

��������� ������� 2017/3/6 also for the pages that host a data structure only accessed by the memory access locality, while avoiding contention on the thread that allocated the structure. memory controllers and interconnect links. The first-touch policy is, however, particularly inefficient Carrefour monitors the memory access patterns of the if a single thread allocates and initializes the memory for threads using hardware counters. It implements three heuris- all the other threads. This is typically the case of applica- tics to migrate or replicate the hottest physical pages, i.e., tions implementing a master-slave pattern, in which a mas- the most accessed physical pages. The first heuristic imple- ter thread prepares the memory for the slaves. In such appli- mented by Carrefour is the interleaved heuristic. Carrefour cations, because of the first-touch policy, the master thread executes this heuristic when it detects that memory con- maps a large part of the memory from its NUMA node when trollers are overloaded. It randomly migrates hot pages from it prepares the memory. When the slaves execute, they per- overloaded nodes to underloaded nodes. Carrefour also im- form most of their memory accesses to this NUMA node, plements two heuristics that it triggers when it detects that yielding contention and poor performance. the interconnect saturates. The migration heuristic migrates hot pages that are remotely accessed by only a single node 3.2 The round-4K policy towards the node that performs the accesses. The replication Linux also proposes a policy that we call round-4K. This heuristic replicates hot pages that are accessed in read-only policy allocates physical pages of 4 KiB and selects NUMA mode by a set of threads. In our study, we have discarded nodes on which to allocate these pages using a round-robin the replication heuristic, because it has only a marginal ef- policy. With this policy, memory accesses are usually well- fect on performance, and because implementing this policy balanced on all the NUMA nodes, which presents the advan- within the Xen hypervisor would require radical changes in tage of balancing the load on memory controllers. However, the design of the Xen memory manager. the round-4K policy has the drawback of inducing a large 3.5 A case for each NUMA policy number of remote memory accesses. For applications that tend to saturate a single memory As presented in the previous section, Linux offers several controller with the first-touch policy, typically, applications policies. However, their usefulness remains questionable: designed with a master-slave pattern, the round-4K induces are all these policies important to achieve good perfor- better performance than the first-touch policy. In such appli- mance? If the answer to this question is no, offering an cations, the round-4K policy trades the saturation of a sin- interface to implement several NUMA policies in Xen is gle node with a possible saturation of the interconnect links, useless. If the answer to this question is yes, it probably which yields lower memory access latencies [12, 17]. makes sense to implement the same NUMA policies within the Xen hypervisor. 3.3 The round-1G policy 3.5.1 Evaluation of the NUMA policies in Linux Xen uses a default NUMA policy, that we call the round-1G In order to verify that all the NUMA policies are useful, policy. Xen eagerly allocates the physical memory of a vir- Figure 2 reports the improvement in terms of completion tual machine during its creation. Xen tries to pack the mem- time relative to the default first-touch policy when using the ory and the vCPUs of the virtual machine on the minimal different NUMA policies in Linux. The Figure presents the number of underloaded NUMA nodes by reserving a phys- evaluation of 29 applications on AMD48, a 48-core machine 2 ical CPU per vCPU. These NUMA nodes form the home- with 8 NUMA nodes, running a native Linux operating sys- nodes of the virtual machine. tem (see Section 5 for a detailed description of the hard- Xen favors locality by allocating the memory of a virtual ware and setting). We exhaustively evaluate all the machine from its home-nodes. It first tries to allocate the possible combinations of static and dynamic policies imple- memory by regions of 1 GiB with a round-robin algorithm mented in Linux: first-touch, first-touch/Carrefour, round- from the home-nodes. Then, in case of fragmentation or if 4K and round-4K/Carrefour. the virtual machine needs less than 1 GiB (resp. 2 MiB), Xen We can first observe that the NUMA policy has a huge allocates the memory by regions of 2 MiB (resp. 4 KiB). Be- impact on performance for many applications. 17 of the 29 cause of the BIOS and I/O memory regions, the first and last applications are improved by more than 25% when we com- physical GiBs of a virtual machine are always fragmented. pare the best against the worst NUMA policy (12 applica- tions by more than 50% and 5 by more than 100%). We 3.4 The Carrefour policy can also observe that each possible combination is for some Carrefour [12] is a dynamic memory placement policy for applications the one yielding the best possible performance NUMA architectures executing the Linux operating system. (e.g., first-touch for cg.C, first-touch/Carrefour for sp.C, Carrefour dynamically migrates pages in order to increase round-4K for kmeans or round-4K/Carrefour for facesim). This result answers the above question: it is probably worth 2 http://wiki.xen.org/wiki/Xen_4.3_NUMA_Aware_Scheduling# implementing within Xen the NUMA policies that are avail- Soft_Scheduling_Affinity. able within Linux.

4 2017/3/6 sdfie steaeaeo h ecnaeo h bandwidth the of percentage load the interconnect of average The the node. av- as defined per the is around accesses deviation of standard number relative erage the as imbalance defined The registers. is counter performance available all the metrics uses already the Carrefour report because not Carrefour and executing does Ta-when first-touch table the useful, The with policies. are measured round-4K policies the metrics the two all reports 1 why ble understand to order behaviors In the of Analysis 3.5.2 threads. 48 with AMD48 1. Table better). is (higher policy first-touch 2. Figure streamcluster fluidanimate

memcached Relative improvement bodytrack swaptions cassandra mongodb pagerank psearchy facesim kmeans wrmem belief mg.D x264 ep.D ua.C dc.B cg.C sp.C ���� ���� ���� ���� lu.C bt.C sssp ft.C pca bfs wc wr cc � � � �

feto h ttcNM oiisi iu on Linux in policies NUMA static the of Effect

mrvmn ftecmlto ieo aiu UAplce nLnxo M4 ih4 hed eaiet the to relative threads 48 with AMD48 on Linux in policies NUMA various of time completion the of Improvement ��������� First-touch

odimbalance Load 130% 193% 183% 185% 190% 206% 251% 235% 135% 110% 101% 113% 263% 175% 219% 253% 135% 65% 85% 19% 47% 60% 45% 89% 84% 65% �������

5% 8% 7% �����������

Round-4k 102% 116% 180% ������������� 95% 50% 10% 23% 31% 24% 80% 74% 26% 14% 57% 41% 30% 19% 19% 28% 45% 16% 27% 48%

8% 7% 4% 1% 5% 8% ���������

First-touch necnetload Interconnect ���� 61% 52% 10% 18% 18% 14% 43% 12% 18% 17% 48% 10% 11% 51% 17% 31% 18% 39% 16% 14% 17% 17% 17% 17% 19% 13%

6% 4% 9% ���� ��������

Round-4k ���� 46% 42% 41% 11% 18% 17% 37% 58% 51% 41% 46% 22% 46% 35% 13% 18% 16% 16% 14% 14% 11% 11% 11% 12% 10% 12%

9% 5% 8% ����

maac level Imbalance ihfirst-touch with ����

moderate moderate moderate moderate moderate ������������������

high high high high high high high high ���� high high high high high low low low low low low low low low low low

����

���� ����

5 ���� sdo h otlae necnetlnsdrn ahsec- each during links interconnect ond. loaded most the on used lsvl ok h ikwieacsigrmt opnnssc sthe as such components remote accessing ex- controller. request while memory memory remote link it remote the saturates, each link locks because a bandwidth clusively When of packets 80% related only software reaches on re- piggy-backed hardware commands Those send be ma- commands. to can synchronization bandwidth the link the as When of such amplitude. 50% commands uses lated 30% hardware the this only the report idle, to We is relative saturated. chine is bandwidth link the the when of 80% variation and idle is link the when im- 3 is it load because interconnect (the performance locality access their memory their improve proves Car- the 1). to Table of tends in imbalance imbalance refour threads. columns large (see other the policy the prevents first-touch for policy memory round-4K single the a The allocate 3.2, Section to in tends presented thread As first- Linux. the in with policy 130% than touch more of imbalance access memory worst a with for applications, 10% these of than for case average policy in slower NUMA 1% best only the is measured policy have first-touch we the 40%, result, to that a 10% As roughly Interconnect). (from column applications the 11 see the of 7 interconnect for the 4 load locality: by multiplies access roughly policy memory round-4K the decreases degrades it also for because policy round-4K performance locality The run. access the of memory remainder the the the degrading however, of has, migration consequence re- The the does performance. pages As improve the node. migrating not another temporary, only to are page accesses the mote migrates traf- and In interconnect burst nodes. temporary fic a remote observes by Carrefour node accessed case, perfor-this its heavily from temporarily their accessed be degrade mainly may to page tends a Technically, Carrefour mance. that allocated. structures data be- has access perfect, mostly is to is tends policy thread first-touch each 3.1, cause the Section in applications, presented these As with for Linux. 85% than in less policy of ex- first-touch imbalance the applications access “low” memory 11 low The a hibit 1). Table of level imbalance h adaecutr culygv ercwihvr ewe 50% between vary which metric a give actually counters hardware The

tteopst,te1 hg”apiain xii high a exhibit applications “high” 13 the opposite, the At (column groups three in applications the classify can We ��

3 ��

��������

����� ���

ft.C ������

.

�������� ������������������

���������

������

���

��

��������

����

��������� ������� 2017/3/6 4.1 The internal interface Guest operating system External interface The internal interface consists in two functions. The first Internal interface function implements the mechanism to map the virtual page NUMA Policy of a process to a NUMA node. The second function imple- Hypervisor ments the mechanism to migrate the virtual page of a process Figure 3. External and internal interfaces. to a new NUMA node. In an hypervisor, directly placing the virtual address of a process to a NUMA node is difficult because an hypervisor large for 5 applications). As a result, we have measured that executes virtual machines, not processes. A guest operating the round-4K/Carrefour policy is only 2% slower in average system maps a virtual page of a process to a physical page than the best NUMA policy for these applications, with a of a virtual machine. Modifying this mapping from within worst case of 8% for wrmem. the hypervisor is difficult because an hypervisor cannot eas- Finally, the 5 remaining “moderate” applications exhibit a ily know which physical page of a virtual machine is used moderate memory access imbalance between 85% and 130% or not. Moreover, the hypervisor cannot easily synchronize with the first-touch policy in Linux. For these applications, with the operating system in order to avoid concurrent ac- the first-touch policy does not perfectly balance the load cesses to the guest page table. on all the nodes, but ensures a satisfactory memory access For these reasons, we have chosen to implement the mem- locality. Using the round-4K policy degrades performance ory placement mechanism by leveraging the hypervisor page because this policy destroys the memory access locality. The table. In this setting, a NUMA policy lets the guest operat- Carrefour policy is useful for these applications, because it ing system map a virtual page to any physical page. Then, better balances the load and slightly improves the memory the NUMA policy maps the physical page to a NUMA node access locality. As a result, we have measured that the round- by mapping the physical page to a machine page of the node. 4K/Carrefour policy is only 2% slower in average than the For the same reasons, the migration mechanism relies on best NUMA policy for these applications, with a worst case the hypervisor page table. It starts by write protecting the of 5% for mongodb. entry of the physical page in order to avoid concurrent writes To summarize this analysis, we confirm that all the stud- on the copied page. Then, the migration mechanism copies ied NUMA policies are useful. This is explained by the the page to its new location and updates the entry of the fact that different sets of applications have different mem- physical page in the hypervisor page table. ory behaviors. More precisely, the round-4K/Carrefour pol- icy is required for the “high” applications, the first-touch/- Carrefour policy is required for the “moderate” applications, 4.2 The external interface and the first-touch policy is required for the “low” applica- The external interface consists in two functions. The first tions. function is used to select a NUMA policy. The second func- tion is required for the first-touch policy to trap the first ac- 4. Implementing NUMA policies inside Xen cess to a virtual page performed by a process. As all the NUMA policies evaluated in the previous section are useful in Linux, we have implemented all these policies in Xen. In order to implement the NUMA policies inside 4.2.1 Selecting the NUMA policy Xen, we have identified three important needs. First, each of An administrator selects a NUMA policy for a whole virtual the policies needs a mechanism to map a virtual page of a machine because Xen is not aware of the processes inside the process to a machine page of any NUMA node. Second, the virtual machine. Notice that, for this reason, an administra- Carrefour policy needs a mechanism to dynamically migrate tor must not colocate different processes with contradictory a virtual page of a process to a new NUMA node. Third, the NUMA access patterns inside the same virtual machine. first-touch policy needs a mechanism to trap the first access By default, a virtual machine boots with the round-4K of a process to a virtual page. policy. We do not provide an interface to dynamically switch These needs have driven the definition of the interface we to the round-1G policy because, as presented in the evalua- propose in this paper. As presented in Figure 3, the interface tion (see Section 5.3.3), the round-1G policy is much less can be split into two parts: the external interface is used by a useful than the other policies. Instead, we provide an option NUMA policy to communicate with the guest operating sys- to boot a virtual machine with the round-1G policy that can tem, whereas the internal interface is used to communicate be used for testing purposes. with the hypervisor. As we wanted to minimize the modifi- Then, when the virtual machine runs, Xen provides a cations performed in the code of the guest operating system, new hypercall to dynamically change the NUMA policy of a we tried to minimize as much as possible the size of the ex- virtual machine. This hypercall can switch to the first-touch ternal interface. policy and activate/deactivate the Carrefour policy.

6 2017/3/6 Old address Space New address space hypercall. The guest operating system triggers this hyper- Virtual page V0 Virtual page V1 call when it adds a page to its free list, in order to inform Xen that the page is not used anymore. However, calling the Old mapping New mapping hypervisor for each page release is inefficient for an applica- Physical page P0 tion that releases physical pages often. This is for example the case of several Mosbench applications that we evaluate. These applications rely on the Streamflow allocator [34] be- Machine page M0 cause it better scales than the default glibc allocator when the Figure 4. Issue to implement first-touch in Xen. number of cores increases. Unfortunately, the streamflow al- locator continuously calls the Linux mmap/munmap functions to manage the memory. As a result, a Mosbench application 4.2.2 Trapping the first access to a page such as wrmem releases a physical page every 15 µs. Even The second hypercall is used to trap the first access to a page. executing an empty hypercall in Xen for each page release It is used by the guest operating system to communicate already divides by 3 the performance of wrmem. This large a queue of recently allocated and released physical pages. slowdown can only annihilate the performance improvement We introduce this function because of a mismatch: the first- brought by the first-touch policy. touch policy has to trap the first access of a process to a virtual page, but the hypervisor manipulates physical pages 4.2.4 Batching the hypercalls that belong to a virtual machine, not to a specific process. We can easily solve the problem of the hypercall cost by This mismatch is problematic because the hypervisor is batching the hypercalls. Instead of calling the hypervisor not involved when the guest operating system releases a when it releases a page, the guest operating system accu- physical page and reallocates the page to a new process. mulates several free pages in a page queue, and sends the Figure 4 illustrates the problem. An hypervisor can easily whole queue after several page releases. trap the first access performed by a virtual machine to a However, batching the hypercalls raises a new challenge. physical page P0 by letting the entry of P0 empty in the The operating system may reallocate a page while it is in hypervisor page table. In this case, during the first access, the the page queue but not yet sent to the hypervisor. When the hypervisor will map P0 to a machine page, for example M0. hypervisor receives the page queue, it can thus not ignore However, when the operating system reallocates P0 from an the content of a page as the page is maybe already reused old virtual page V0 to a new virtual page V1, the hypervisor by a process. Without any other mechanism, the hypervisor is not involved. The first access to P0 through V1 does not would thus have to copy the old content of the page when it generate an hypervisor page fault: in the best case, it only migrates the page to a new NUMA node, which is costly. generates a guest page fault, which is directly caught by the We solve the problem by trapping both the page alloca- guest operating system, without involving the hypervisor. tion and release of the guest operating system. At high level, we define a global queue shared by all the cores and pro- 4.2.3 An hypercall for each page release tected by a lock. Each entry in the queue contains a pair (op, We implement the first-touch policy by invalidating, in the page), in which op is the operation (allocation or release) hypervisor page table, the entries of the physical pages that and page is the address of the physical page. are not used by the guest operating system. When the guest When the guest operating system allocates or releases a operating system allocates the physical page to a process, page, it acquires a lock before adding the pair to the queue. the first access of the process generates an hypervisor page Then, before releasing the lock, the guest operating system fault. We use then this hypervisor page fault to implement sends the queue to the hypervisor through an hypercall when the first-touch policy. the queue is full. The guest operating system has to keep In order to implement this mechanism, the hypervisor has the lock during the hypercall in order to ensure that another to know when a process of a guest operating system releases core cannot reallocate a free page of the queue during the a physical page. We might want to use the ballooning driver hypercall. to get that knowledge. The ballooning driver is used by guest When a NUMA policy receives the queue, it starts with operating systems to release pages to the hypervisor, which the most recent operations, i.e., the end of the queue. Then, can then give them to other virtual machines. However, when the NUMA policy keeps a list of the visited pages and only a guest operating system releases a page through the bal- takes into account the most recent operation associated to a looning driver, the guest can no longer use that page. In our page. If the most recent operation is a release, the NUMA case, the guest operating system has to be able to reallocate policy knows that the physical page is no longer used and the free page to a new process at any time, which precludes it can invalidate its entry. If the most recent operation is an using the ballooning driver. allocation, the hypervisor knows that the page may already As the ballooning driver is inadequate, we have chosen be reused by a process. This case is rare and we simply to use a para-virtualization technique by introducing a new handle it by letting a reallocated page on its current node.

7 2017/3/6 Indeed, copying the old content of a page would be too costly ���� ��� ���� ���� in the common case and would thus not be efficient. ������ Finally, using a single global queue protected by a lock is ������

a bottleneck when the virtual machine uses many cores. As ����� a final solution, we partition the global queue in independent � queues. We associate each page address to a single queue by ���� � using the two less significant bits of the page frame number. Time (ns) ����

As a result, each queue has its own independent lock, which ����� increases the parallelism. �� With the partitioned queue, we have measured that 87.5% ����� ��� of the time is spent invalidating the pages during an hyper- Figure 5. IPI cost repartition call, while sending the queue only takes 12.5% of this time. For this reason, we have not used more efficient and scalable queue algorithms [20]. 4.4.1 Incompatibility between first-touch and the IOMMU 4.3 Implementation of the NUMA policies Our implementation is incompatible with the IOMMU With the interfaces provided to NUMA policies, implement- mechanism [6] (see Section 2.2). The IOMMU translates ing the first-touch, the round-4K and the Carrefour policies the physical addresses of a virtual machine into the machine is straightforward. We have implemented the round-4K pol- addresses in order to perform transparent DMA transfers di- icy in Xen by statically allocating 4 KiB pages in a round- rectly to the guest virtual machine memory. However, during robin fashion from the virtual machine’s home nodes when this translation, the IOMMU cannot handle invalid entries in the virtual machine is created. This implementation relies the hypervisor page table. on the internal interface (see Section 4.1). For the first-touch If an entry is invalid, the IOMMU aborts the transfer and policy, we rely on the hypercall provided by the external in- indicates that an I/O error happened. Unfortunately, because terface (see Section 4.2): when the guest operating system of a hardware design choice, the IOMMU asynchronously releases a page, the NUMA policy simply invalidates the en- notifies the hypervisor that an error occurred. As a result, try in the guest page table. the hypervisor may handle this error after the guest operat- In order to implement the Carrefour policy, we have ing system did. In this case, even if the hypervisor maps a ported the code of Carrefour in Xen4. The original Car- machine page in the hypervisor page table, it is too late: the refour implementation defines two components: the system operating system already considered that the I/O was impos- component and the user component. The system component sible and returned an error to the process that performed the runs in the Linux kernel. It gathers the low-level hardware I/O. counters and associates metrics to the hot pages. It also de- In order to trap the first access to a physical page, the first- fines a system interface to migrate a page from one node to touch policy invalidates entries in the hypervisor page table. another and to read the metrics. The user component uses If the physical page is used by the guest operating system these metrics to choose which page to migrate and to select as a DMA buffer, the I/O becomes impossible. For this the destination node. reason, when we evaluate the first-touch policy, we disable In Xen, the user component executes as a process in the the IOMMU. dom0 virtual machine. The system component executes in- side Xen. Instead of using the operating system interface to 4.4.2 Content of the released pages communicate with the system component, the user compo- Our first-touch implementation has a second small limita- nent performs an hypercall, which is trapped by Linux and tion. Before a page release, Linux fills the page with ze- forwarded to Xen. Then, instead of observing the memory ros. For this reason, the free pages in Xen are all equivalent accesses performed by the threads of the processes, the sys- with the same content. The first-touch policy can thus map tem component observes the memory accesses performed any page to any physical address of the guest. However, this by the vCPUs of the virtual machines. In order to migrate behavior could be a limitation for an operating system that a page, the system component relies on the internal interface stores per-page housekeeping data in free pages. (see Section 4.2). 5. Performance evaluation 4.4 Limitation This section presents an evaluation of the NUMA policies The implementation of our first-touch policy has two limita- implemented inside Xen. We first present the hardware and tions. then the software setting. We also present Xen+, an im- 4 The code of Carrefour is available at https://github.com/ proved version of Xen used in the evaluation, along with its Carrefour. evaluation. Then, we compare the different NUMA policies

8 2017/3/6 Context Memory Memory Hard drive Cache switches footprint 1 thread 48 threads MB/s k/s MB L1 cache 5 cycles Local 156 cycles 697 cycles bodytrack 0 17.7 7 L2 cache 16 cycles Remote (1 hop) 276 cycles 740 cycles L3 cache 48 cycles Remote (2 hops) 383 cycles 863 cycles facesim 0 11.7 328 fluidanimate 0 4.2 223 Parsec Table 3. Cache and memory access latency on AMD48. streamcluster 0 29.5 106 swaptions 0 0.0 4 x264 0 0.6 1129 dom0 virtual machine are connected to the bus of node 0. bt.C 0 1.2 698 The disk that contains all the benchmarks and the datasets is cg.C 0 5.9 889 dc.B 175 0.1 39273 connected to the bus of node 6. ep.D 0 0.0 49 Table 3 reports the time to access the caches and the mem- NPB ft.C 0 0.3 5156 ory on AMD48. In order to measure the memory access la- lu.C 0 1.5 600 tency, we present two results. With 1 thread, we measure an mg.D 0 1.5 27095 uncontended case, in which a single thread accesses a sin- sp.C 0 2.0 869 gle NUMA node. With 48 threads, we measure a contended ua.C 0 37.4 483 wc 0 3.9 16682 case, in which 48 threads access the same NUMA node. This wr 1 5.2 19016 result shows that the diameter has only a small impact on wrmem 5 7.5 11610 performance (276 cycles for a 1-hop access versus 383 cy- Mosbench pca 0 0.3 5779 cles for a 2-hop access in the uncontended case). On the con- kmeans 0 0.1 4178 trary, we can observe that a contended memory controller psearchy 54 0.8 28576 drastically slows down memory access latency (697 cycles memcached 0 127.1 2205 to access a local NUMA node when this last is contended). belief 234 0.0 12292 bfs 236 0.0 12291 X-Stream cc 249 0.0 12291 5.2 Software setting pagerank 240 0.0 12291 For each experiment presented in this section, we report the sssp 261 0.0 12291 average of 6 runs. We use the Linux 3.9 kernel, along with cassandra 16 10.7 1111 YCSB gcc 4.6.3, libgomp 3.0 and glibc 6. For the analysis, we have mongodb 184 14.6 1092 selected applications often used to measure the performance Table 2. Behavior of the applications. of large multicores [7, 12, 13, 33]. We evaluate 29 applica- tions from the Parsec 2.1 (precompiled version), the NPB in Xen+ with both single and multiple virtual machines. Fi- 3.3 (openMP version), the Mosbench benchmark (Stream- nally, we present a comparison of Linux and Xen+ when we flow allocator [34]), a set of X-Stream applications [33], and use the best possible NUMA policy for each application in the YCSB benchmark [11] on Cassandra and MongoDB. both Linux and Xen+. For the Xen experiments, we use Xen 4.5. We hide the NUMA topology to the guest. The dom0 virtual machine is 5.1 Hardware setting pinned to the CPUs of node 0. As we can only select the We perform our evaluation on an AMD machine, called NUMA policy for a whole virtual machine, we only run a AMD48 hereafter. AMD48 has 8 NUMA nodes with 6 single benchmark application per virtual machine in all the CPUs/16 GiB per node. In total AMD48 has thus 48 cores experiments. and 128 GiB of RAM. 5.3 Xen+ The exact memory system of AMD48 is described in [5]. In summary, AMD48 has four Opteron 6174 sockets, each In order to highlight the effect of the NUMA policies in one containing two NUMA nodes. Each node is connected Xen, we use an improved baseline, called Xen+, in which to 16 GiB of RAM through a memory controller having a we mitigate other well-know virtualization costs: the cost of maximum throughput of 13 GiB/s. The 6 CPUs of a node virtualized I/Os and some cost of virtualized inter-processor share a unified L3 cache of 5 MiB. A CPU of a NUMA node interrupts. runs at 2.2 GHz. It has two L1 caches of 64 KiB each (a data and a code cache), and a single unified L2 cache of 512 KiB. 5.3.1 Virtualized I/O cost The nodes are interconnected by HyperTransport links, One of the virtualization bottleneck is caused by I/Os (disk with a maximum distance of two hops. The machine handles or network) [1, 18, 24, 29]. We partially remove this over- L3 cache misses with the HT3 (HT-assist) protocol. The head by activating the IOMMU with the PCI passthrough bandwidth between nodes is asymmetric, with a maximum driver (see Section 2.2). bandwidth of 6 GiB/s. Two of the nodes (nodes 0 and 6) are However, as presented in Section 4.4, we cannot activate connected to a PCI bus. The network and the disk of the both the IOMMU and the first-touch policy. For this reason,

9 2017/3/6 we only activate the PCI passthrough driver for the round-4K LinuxNUMA Xen+NUMA and round-1G policies. bodytrack Round 4K / Carrefour Round 4K / Carrefour facesim Round-4K Round-4K 5.3.2 Virtualized IPI cost fluidanimate Round 4K / Carrefour Round 4K / Carrefour streamcluster Round-4K Round-4K Virtualizing inter-processor interrupts (IPIs) is another well- swaptions Round-4K Round-4K known cause of virtualization overhead. As presented in x264 First-Touch Round-4K Figure 5, in native mode, sending an IPI is relatively fast bt.C First-Touch / Carrefour First-Touch / Carrefour (0.9 µs). However, in guest mode, sending an IPI takes much cg.C First-Touch First-Touch more time (10.9 µs). dc.B First-Touch Round-1G As reported by Ding et al. [16, 35], this overhead is ep.D Round-4K Round-4K especially detrimental for applications that frequently leave ft.C Round-4K Round-4K lu.C Round-4K First-Touch the processor because they wait for an event, e.g., a lock, a mg.D First-Touch First-Touch condition variable or a network packet. When a thread waits sp.C Round 4K / Carrefour Round 4K / Carrefour for an event, the thread goes to the sleep state through a ua.C First-Touch First-Touch context switch. The CPU of the thread may then become idle wc First-Touch / Carrefour Round-4K when no more threads are ready. When the CPU becomes wr First-Touch Round-4K idle, Linux halts the processor up to the next interrupt. In wrmem First-Touch Round-4K this case, when the event arrives, Linux wakes up a sleeping pca Round-4K Round 4K / Carrefour kmeans Round-4K Round-4K CPU by sending an inter-processor interrupt (IPI). psearchy First-Touch Round-4K As presented in Table 2, we can observe that 7 applica- memcached First-Touch Round-1G tions do intentionally frequently leave the CPU (more than belief Round-4K Round 4K / Carrefour 10,000 intentional context switches each second). These ap- bfs Round-4K Round-4K plications may suffer from the cost of virtualized IPIs if the cc Round 4K / Carrefour Round 4K / Carrefour CPU goes to the sleep state. Ding et al. propose to solve the pagerank Round 4K / Carrefour Round 4K / Carrefour sssp Round 4K / Carrefour Round 4K / Carrefour problem with a complex algorithm that can handle consoli- cassandra First-Touch / Carrefour Round-1G dated workloads. mongodb First-Touch / Carrefour Round-1G For the sake of simplicity, we have rather implemented a much simpler solution that only works for non-consolidated Table 4. Best NUMA policies. workloads. Our solution consists in implementing the pthread mutex and condition variable functions with a spin loop us- ing the MCS algorithm [28]. Thanks to this modification, NUMA policy based on the evaluation presented in Figure 2 when a thread synchronizes through a lock or a condition of Section 3.5). variable, it does not leave the processor. Using a spin loop By construction, Xen+ is better than Xen, and Lin- is, of course, not a long-term solution: we only use this tech- uxNUMA is better than Linux. Figure 6 confirms this result. nique to better focus on the NUMA placement problem by When we compare LinuxNUMA with Xen+, we can ob- eliminating an unrelated, well-known virtualization bottle- serve that the overhead of Xen+ relatively to LinuxNUMA neck. remains large: 20 of the 29 evaluated applications still ex- This modification significantly improves the performance hibit an overhead higher than 25%, 14 applications exhibit of facesim and streamcluster (see the evaluation pre- an overhead higher than 50%, and 11 applications have an sented in Section 5.3.3). The modification improves facesim overhead higher than 100%. As we have already removed by 30% and streamcluster by 55%. We also measure some overhead caused by I/Os and IPIs in the Xen+, we can that, after the modification, these applications generate zero suspect that the remaining large overhead is actually often intentional context switch per second. We activate thus this caused by NUMA effects. optimization for Xen in the single virtual machine experi- When we compare Xen with Xen+, we can observe that ments for facesim and streamcluster. In order to make facesim and streamcluster are largely improved by the a fair evaluation, we also replace the pthread mutex and con- use of MCS lock. We can also observe that dc.B, bfs, dition variable functions by a spin loop for facesim and cc, pagerank, sssp and mongodb are largely improved by streamcluster in Linux. Xen+. Column Hard drive of Table 2, which reports the disk usage, confirms that these applications heavily use the 5.3.3 Xen+ evaluation disk, which makes the PCI passthrough driver especially Figure 6 reports the overhead of Xen, Xen+ and Linux, rel- useful. atively to LinuxNUMA. We define LinuxNUMA as a Linux We can finally observe that, unexpectedly, for some disk- that uses MCS locks for facesim and streamcluster, intensive applications (belief, bfs, cc, pagerank and and that systematically uses the best possible Linux NUMA ssp), Xen+ is slightly better than Linux. We have not found policy for each application (Table 4 recalls this best Linux why Xen+ improves performance, but we suppose that Xen+

10 2017/3/6 mrvdb oeta 0% ntebs ae for case, best the In 100%. are than applications more 9 by improved performance. the improve drastically can policies. NUMA best the highlighting by The policy. NUMA each column for Xen+ to relative improvement the and CPUs vCPUs. physical the the on on threads physical vCPUs of the number pin the vCPUs, also as We of vCPUs CPUs. many number Each as the uses machine. guest as virtual the threads and single many a as uses run application we experiment, this In machine virtual Single 5.4.1 machines. virtual multiple ma- then virtual and Xen single chine in a evaluate policy first NUMA We performance. efficient goal improves an the selecting has that evaluation showing This dif- of Xen. the of of policies performance NUMA the ferent compare we experiment, policies this NUMA In the of Evaluation 5.4 different of the pages nodes. maps memory NUMA to which guest table, the NUMA of page different addresses hypervisor physical on the distributed to is thanks buffer nodes DMA nodes. NUMA a physical different Xen+, the the on In grain partition coarse hardware at space single the address a because from allocated node only address thus NUMA physical is buffer contiguous DMA a This as space. Linux, allocated In is transfer. buffer DMA DMA a a during parallelism better a offers omneb 7% n h on-Kplc stebs for best the is policy round-4K ft.C the and 170%, by formance for best the the is policy is first-touch policy round-4K/Carrefour for the best 100%, for by best first-touch/- formance the the is example, policy For for Carrefour significant. pol- performance sometimes each by best is brought icy the gain the yields Moreover, applications. policy observe some can NUMA we Moreover, each 6. that by divided is time completion Relative overhead ecnfis bev htuiga fcetNM policy NUMA efficient an using that observe first can We the applications, 29 the of each for reports, 7 Figure � � � � � � � � � � n tipoe efrac y315%. by performance improves it and ,

Xen+NUMA

sp.C ���������

iue6. Figure �������

n tipoe efrac y20,the 290%, by performance improves it and �����������

fTbe4cmlmnsti evaluation this complements 4 Table of �������������

eaieoeha fLnx e n e+a oprdt iuNM lwri better). is (lower LinuxNUMA to compared as Xen+ and Xen Linux, of overhead Relative

��������� ���� kmeans

bt.C ���� ���� n tipoe per- improves it and

n tipoe per- improves it and

���� ����

����� ����

cg.C ����

the ,

���� ���� 11

������� ����

fXn ihti etn,ec hsclCUeeue two executes CPU physical each setting, policy per- this placement avoid With vCPU to Xen. the order of by in CPU caused physical variations single formance a to vCPU each runs. the two compute the we of and time machines, completion swapping virtual average the by by twice, used nodes configuration the each this For execute machine. we we virtual when a reason, for varies nodes performance NUMA that, different applications, observed select the have We of vCPU. some physical single for Each a half. executes other thus the CPU on and nodes machine NUMA virtual the second of half the one on machine virtual first the on policy NUMA default the over Xen+. policy NUMA best improvement the the report of figures The interface. external the by col- (see the policy of NUMA each Xen umn For best the machine. select virtual we thread the applications, many in as vCPUs with available application as single virtual a each executes experiments, machine both In machines. virtual workloads two consolidated eleven with present 9 Figure and 8 Figure machines virtual Multiple 5.4.2 policy. passthrough first-touch PCI the the activate disable we we when that driver behav- fact This the de- Xen+. from to comes drastically ior compared as systematically performance their to grade seems policy first-touch in mrvdb h C ashog rvr( driver passthrough PCI cc the by improved tions more by degraded are applications three 5%. other than only best observe and the we by 10%, degradation policy is performance round-1G maximum the the replace policy, we if col- 7, (see ure applications four for policies umn other only the is than policy round-1G better NUMA The the implemented. than have efficient we less policies much is Xen+ by provided icy , �� nFgr ,ec ita ahn a 8vPs epin We vCPUs. 48 has machine virtual each 9, Figure In pin We vCPUs. 24 has machine virtual each 8, Figure In ial,w a oieta o h ikitnieapplica- disk-intensive the for that notice can we Finally, pol- round-1G default the that observe can we Moreover, pagerank

Xen+NUMA Xen+NUMA ��

������������ ����� ��� ,

sssp

fTbe4.Mroe,a rsne nFig- in presented as Moreover, 4). Table of fTbe4 hog h yecl provided hypercall the through 4) Table of

������

and �������� ���������

mongodb

������ ���

e eto ..) the 5.3.3), Section see ,

��

�������� ����

dc.B ��������� ������� 2017/3/6 , bfs , tms 0.Teerslscnr htuiga efficient an using that confirm results by and These (our policy, 10%. NUMA configuration better most single a at with a machine. degraded only virtual is one case) that worst least observe at also for can 50% config- We than 11 perfor- more improves the by policy of NUMA mance 9 efficient For an using 440%. urations, by improved ( is case best application the In with improves Xen+. executed to drastically compared also as policy performance NUMA efficient an using one and machine. machine virtual virtual second the first to the belonging to belonging one vCPUs, better) is higher each, cores (48 VMs consolidated 2 with 9. Figure Relative improvement better) is higher each, cores (24 VMs colocated 2 with 8. Figure Relative improvement Relative improvement vrl,w a bev htfrcnoiae workloads, consolidated for that observe can we Overall, � � � � � � � ���� ���� ���� ����

� � � � � � � �

� � � �

���������

eaieipoeeto e+UAoe Xen+ over Xen+NUMA of improvement Relative Xen+ over Xen+NUMA of improvement Relative iue7. Figure

���� ����

����� ��������� ��������� ������� �����

sp.C

�����������

���� ����

����� better). is (higher Xen+ to compared as Xen+ in policies NUMA the of improvement Relative

nFgr ) h efrac fthe of performance the 8), Figure in

������ �������������

���� ����

������ �����

���������

���� ����

�����

����

���� ���� �����

������ ���� ������

�������

���� ���� �����

����

������� ����

�����

����

������ ���� ����� ����

�����������������

����

������ ���� �����

�� ����� ����� ����

cg.C

���� ����

12 ����

memcached machine. virtual topol- guest NUMA the the to hiding ogy while systems, overhead, this operating reduce of can have context we that the policies within NUMA implemented inte- efficient been by the that Xen+ shown inside have by large grating We caused policy. a is placement that multicores NUMA shows large the result on This virtualization of of Xen+. overhead overhead with an 50% have by than applications degraded 14 more remain while 50%, applications efficient than with 4 more that only observe policies, can NUMA We LinuxNUMA. to pared Lin- CPUs. In physical pinned. the to are threads vCPUs the the pin in we and In uxNUMA, setting, threads machine overheads. the virtual both virtualization single which the other use NUMA by we inefficient Xen+NUMA, not an by and caused is placement the Xen+ that showing of of overhead goal the large has best experiment column This its (see 4). Table with application of Lin- each Xen+ with for as policy Xen+NUMA defined NUMA is compare Xen+NUMA we uxNUMA. experiment, this In LinuxNUMA versus Xen+NUMA 5.5 workloads multicores. consolidated large for on important also is policy NUMA ciaeabtlnc nteIOsakta ehv o yet not have we that stack I/O the may identified. IOMMU, in the bottleneck of use a the activate despite and, the 2) Table by (see drive down IPIs. slowed virtualized Linux. probably of of cost thus Futex are the applications on These relies that mechanism inten- chronization they network. because the packets use sively network for wait pthread continuously not variables. the are targets condition which They and solution, locks IPIs. applica- simple virtualized our These of by cost 2). corrected the Table from of suffer switch tions context (column CPU

��

o h eann plctos ecnosrethat observe can we applications, remaining 4 the For com- as Xen+NUMA of overhead the reports 10 Figure ��

������� ����� ,

cassandra ��� ������

����������������� �������� Ua.C

Psearchy and , ���������

nesvl ssa dhcsyn- ad-hoc an uses intensively ������ Memcached

ua.C ���

nesvl sstehard the uses intensively ��

rqetylaethe leave frequently ��������

and ����

cassandra

���������

Xen+NUMA ������� 2017/3/6 4,t vi ah nefrnefo te ita machines virtual 27, other from [22, interference server cache to same avoid 36], to the 9, 44], in [2, [2], workloads locks, handling compatible blocking 42, of colocate interrupt performance 36, of the 30, performance enforce 19, to the [9, enforce performance to I/Os 43], isolate virtual several to between machines: interference mitigating and 31] [10, policies. placement NUMA hy- propose not an do in they architectures but pervisor, NUMA study of a impact present performance [21] the al. of et Han hypervisor. NUMA the several inside implement policies to required interface we et the while policy, Liu on placement focus for NUMA single As a hypervisor. provide they an al., using hiding architecture by NUMA architectures in the NUMA (non-NUMA on systems legacy retro- operating on run 1997) focus to [8] propose al. They et compatibility. Bugnion NUMA hypervisor. classical an implement in pa- to policies our interface an in identify reason, we this should per, For hypervisor policies. the NUMA that several show provide we policy, provide NUMA they policy While single [26]. placement a created NUMA is machine static virtual vCPU a the placement when propose the memory al. by the et on Liu interested policy. focus are we al. com- algorithm, is et scheduling study Rao Our balancing. while load plementary: prevent not does They it performance improves 37%). when locality to NUMA this and up how caches that (by show show performance They improve architecture. can NUMA algorithm [32] a on al. place to vCPUs et scheduler the migration Rao vCPU hypervisor. random bias an the propose inside sev- Overall, policies implement to 32]. NUMA interface the 26, eral identify 21, works these [8, of hypervisors none on focus specifically observation. this interconnect confirms the study or Our im- controllers saturates. memory performance the dramatic either a when a that pact or have show works 15] can These [14, architecture 45]. sub-system operat- NUMA 38, locking [17, an collectors the in garbage in 32], in 25], 26, [12, [21, system hypervisor ing NUMA an of effect in performance architectures: the study works recent Many work Related 6. Relative overhead eea eerhwrsaeitrse nunderstanding in interested are works research Several ones four architectures, NUMA on works the Among

� � � � � � � � � � ���������

iue10. Figure

�������

����������� �������������

eaieoeha fXn n e+UAa oprdt iuNM lwri better). is (lower LinuxNUMA to compared as Xen+NUMA and Xen+ of overhead Relative

���������

����

����

����

����

����

���� ����

���� ���� ����

13 ����

yevsr fpgsue ytegetfrDAtransfers DMA for guest the the by in used invalidation, pages the policy. of prevents first-touch hypervisor, that the algorithm and an incom- IOMMU Finding an the identified between have we patibility Third, overhead. this prevent as such or allocator an [3] Using releases 4.2.4). often Section application (see an pages when overhead implementation an exhibit our may Second, should misses large performance. TLB Handling of improve number KiB. further the 4 decrease of to order pages in pages small this consider in only presented policies paper NUMA the policy, round-1G fault for 50% than less applications. evaluated to the of virtualization can most we of that overhead show We the evalu- 100%. reduce 29 than the more by of can applications 9 ated we of performance that the shows improve enables evaluation significantly that Our hypervisor between the implementations. functions and these two system with operating machines. guest interface virtual the small the classical a to identify the topology We hypervisor, NUMA implement an the inside hiding can systems while operating we of how policies NUMA shows paper This Conclusion 7. effects. the NUMA the of highlight some better remove to order we in IOMMU, bottleneck the I/O using from By hypervisor path. et the remove I/O Gordon entirely the of to able work is the which especially [18], al. cost, Solutions this 29]. mitigate 24, to 18, exist [1, works research different in studied appli- some for multicores. overhead large of on source cations large a that im- in is confirms IPI simplicity, cost evaluation virtualized IPI of Our the sake mitigate workloads. the to non-consolidated mechanism for simpler and, a 5.3.2 plemented Section in [16] architectures. We NUMA effect of complementary. the virtualization: is of case aspect work in performance another Our holder study lock 39–41]. the [37, of spinlocks preemption of the prevent to or [36]

���������������

��

u okoessvrlprpcie.Frt xettede- the except First, perspectives. several opens work Our been has environment virtualized a in cost their I/Os, For algorithm al. et Ding the discuss we IPIs, Regarding

��

llalloc

�����

4,wihol aeyrlae ae,should pages, releases rarely only which [4], ���

������

��������

���������

������

���

��

��������

����

��������� scalloc ������� 2017/3/6 is a promising perspective. Finally, automatically selecting [12] M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, the most efficient NUMA policy in an hypervisor or in an B. Lepers, V. Quema, and M. Roth. Traffic management: operating system remains an open subject. A holistic approach to memory placement on numa systems. In Proceedings of the conference on Architectural Support Availability for Programming Languages and Operating Systems, ASP- LOS’13, pages 381–394, 2013. The code and the results are freely available at https:// [13] F. David, G. Thomas, J. Lawall, and G. Muller. Continuously github.com/gauthier-voron/xen-tokyo/tree/carrefour- measuring critical section pressure with the Free-Lunch pro- improved for the Xen host and https://github.com/ filer. In Proceedings of the conference on Object Oriented gauthier-voron/linux-xen-ft for the modified Linux Programming Systems Languages and Applications, OOP- guest. SLA’14, 2014. References [14] T. David, R. Guerraoui, and V. Trigonakis. Everything you al- ways wanted to know about synchronization but were afraid to [1] K. Adams and O. Agesen. A comparison of software and ask. In Proceedings of the Symposium on Operating Systems hardware techniques for x86 virtualization. In Proceedings Principles, SOSP’13, pages 33–48, 2013. of the conference on Architectural Support for Programming [15] D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: A Languages and Operating Systems, ASPLOS’06, pages 2–13, general technique for designing NUMA locks. In Proceedings 2006. of the symposium on Principles and Practices of Parallel [2] J. Ahn, C. H. Park, and J. Huh. Micro-sliced virtual processors Programming, PPoPP’12, pages 247–256, 2012. to hide the effect of discontinuous cpu availability for consoli- [16] X. Ding, P. B. Gibbons, M. A. Kozuch, and J. Shan. Gleaner: dated systems. In Proceedings of the International Symposium Mitigating the blocked-waiter wakeup problem for virtualized on Microarchitecture, MICRO’14, pages 394–405, 2014. multicore applications. In Proceedings of the Usenix Annual [3] M. Aigner, C. M. Kirsch, M. Lippautz, and A. Sokolova. Technical Conference, USENIX ATC’14, pages 73–84, 2014. Fast, multicore-scalable, low-fragmentation memory alloca- [17] L. Gidra, G. Thomas, J. Sopena, M. Shapiro, and N. Nguyen. tion through large virtual memory and global data struc- NumaGiC: a garbage collector for big data on big NUMA tures. In Proceedings of the conference on Object Oriented machines. In Proceedings of the conference on Architectural Programming Systems Languages and Applications, OOP- Support for Programming Languages and Operating Systems, SLA’15, pages 451–469, 2015. ASPLOS’15, pages 661–673, 2015. [4] llalloc: Lockless memory allocator. http://locklessinc.com/. [18] A. Gordon, N. Amit, N. Har’El, M. Ben-Yehuda, A. Landau, [5] Cache hierarchy and memory subsystem of the amd A. Schuster, and D. Tsafrir. Eli: Bare-metal performance for opteron processor. http://portal.nersc.gov/project/ i/o virtualization. In Proceedings of the conference on Archi- training/files/XE6-feb-2011/Architecture/ tectural Support for Programming Languages and Operating Opteron-Memory-Cache.pdf, 2011. Systems, ASPLOS’12, pages 411–422, 2012. [6] Amd i/o virtualization technology (iommu) specification. [19] D. Gupta, L. Cherkasova, R. Gardner, and A. Vahdat. Enforc- http://support.amd.com/TechDocs/48882_IOMMU. ing performance isolation across virtual machines in xen. In pdf, 2015. Proceedings of the International Conference on Middleware, [7] S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Middleware’06, pages 342–362, 2006. Kaashoek, R. Morris, and N. Zeldovich. An analysis of linux [20] A. Haas, M. Lippautz, T. A. Henzinger, H. Payer, A. Sokolova, scalability to many cores. In Proceedings of the conference C. M. Kirsch, and A. Sezgin. Distributed queues in shared on Operating Systems Design and Implementation, OSDI’10, memory: Multicore performance and scalability through pages 1–16, 2010. quantitative relaxation. In Proceedings of the ACM Interna- [8] E. Bugnion, S. Devine, and M. Rosenblum. Disco: Running tional Conference on Computing Frontiers, pages 17:1–17:9, commodity operating systems on scalable multiprocessors. In 2013. Proceedings of the Symposium on Operating Systems Princi- [21] J. Han, J. Ahn, C. Kim, Y. Kwon, Y.-R. Choi, and J. Huh. ples, SOSP’97, pages 143–156, 1997. The effect of multi-core on hpc applications in virtualized sys- [9] L. Cheng, J. Rao, and F. C. M. Lau. vscale: Automatic and tems. In Proceedings of the European conference on Parallel efficient processor scaling for smp virtual machines. In Pro- processing, EuroPar’10, pages 615–623, 2010. ceedings of the European Conference on Computer Systems, [22] Y. Koh, R. C. Knauerhase, P. Brett, M. Bowman, Z. Wen, and EuroSys’16, pages 2:1–2:14, 2016. C. Pu. An analysis of performance interference effects in vir- [10] L. Cherkasova and R. Gardner. Measuring cpu overhead for tual environments. In Proceedings of the International Sym- i/o processing in the xen virtual machine monitor. In Pro- posium on Performance Analysis of Systems and Software, IS- ceedings of the Usenix Annual Technical Conference, USENIX PASS’07, pages 200–209, 2007. ATC’05, pages 24–24, 2005. [23] R. Lachaize, B. Lepers, and V. Quema. Memprof: A memory [11] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, profiler for numa multicore systems. In Proceedings of the and R. Sears. Benchmarking cloud serving systems with Usenix Annual Technical Conference, USENIX ATC’12, pages ycsb. In Proceedings of the Symposium on , 53–64, 2012. SoCC’10, pages 143–154, 2010.

14 2017/3/6 [24] J. R. Lange, K. Pedretti, P. Dinda, P. G. Bridges, C. Bae, ized platforms. Technical report, Parallel Processing Institute, P. Soltero, and A. Merritt. Minimal-overhead virtualization Fudan University, 2010. of a large scale supercomputer. In Proceedings of the in- [36] B. Teabe, A. Tchana, and D. Hagimont. Application-specific ternational conference on Virtual Execution Environments, quantum for multi-core platform scheduler. In Proceedings of VEE’11, pages 169–180, 2011. the European Conference on Computer Systems, EuroSys’16, [25] Autonuma: the other approach to numa scheduling. pages 3:1–3:14, 2016. http://lwn.net/Articles/488709/, 2012. [37] B. Teabe, A. Tchana, and D. Hagimont. The lock holder [26] M. Liu and T. Li. Optimizing virtual machine consolidation and the lock waiter pre-emption problems: nip them in the performance on numa server architecture for cloud workloads. bud using informed spinlocks (i-spinlocks). In Proceedings of In Proceedings of the International Symposium on Computer the European Conference on Computer Systems, EuroSys’17, Architecture, ISCA’14, pages 325–336, 2014. 2017. [27] J. Mars, L. Tang, K. Skadron, M. L. Soffa, and R. Hundt. [38] M. M. Tikir and J. K. Hollingsworth. NUMA-aware Java Increasing utilization in modern warehouse-scale computers heaps for server applications. In Proceedings of the In- using bubble-up. IEEE Micro, 32(3):88–99, 2012. ternational Parallel and Distributed Processing Symposium, [28] J. M. Mellor-Crummey and M. L. Scott. Synchronization IPDPS’05, pages 108–117, 2005. without contention. In Proceedings of the conference on Ar- [39] V. Uhlig, J. LeVasseur, E. Skoglund, and U. Dannowski. To- chitectural Support for Programming Languages and Operat- wards scalable multiprocessor virtual machines. In Proceed- ing Systems, ASPLOS’91, pages 269–278, 1991. ings of the conference on Virtual Machine Research And Tech- [29] A. Menon, J. R. Santos, Y. Turner, G. J. Janakiraman, and nology Symposium’04, pages 1–14, 2004. W. Zwaenepoel. Diagnosing performance overheads in the [40] P. M. Wells, K. Chakraborty, and G. S. Sohi. Hardware xen virtual machine environment. In Proceedings of the in- support for spin management in overcommitted virtual ma- ternational conference on Virtual Execution Environments, chines. In Proceedings of the International Conference on VEE’05, pages 13–23, 2005. Parallel Architectures and Compilation, PACT’06, pages 124– [30] D. Ongaro, A. L. Cox, and S. Rixner. Scheduling i/o in virtual 133, 2006. machine monitors. In Proceedings of the international con- [41] C. Weng, Q. Liu, L. Yu, and M. Li. Dynamic adaptive schedul- ference on Virtual Execution Environments, VEE’ 08, pages ing for virtual machines. In Proceedings of the symposium 1–10, 2008. on High-Performance Parallel and Distributed Computing, [31] X. Pu, L. Liu, Y. Mei, S. Sivathanu, Y. Koh, and C. Pu. Under- HPDC’11, pages 239–250, 2011. standing performance interference of i/o workload in virtual- [42] C. Xu, S. Gamage, H. Lu, R. Kompella, and D. Xu. vturbo: ized cloud environments. In Proceedings of the International Accelerating virtual machine i/o processing using designated Conference on Cloud Computing, CLOUD’10, pages 51–58, turbo-sliced core. In Proceedings of the Usenix Annual Tech- 2010. nical Conference, USENIX ATC’13, pages 243–254, 2013. [32] J. Rao, K. Wang, X. Zhou, and C.-Z. Xu. Optimizing virtual [43] C. Xu, S. Gamage, P. N. Rao, A. Kangarlou, R. R. Kompella, machine scheduling in NUMA multicore systems. In Pro- and D. Xu. vslicer: Latency-aware virtual machine scheduling ceedings of the symposium on High Performance Computer via differentiated-frequency CPU slicing. In Proceedings of Architecture, HPCA’13, pages 306–317, 2013. the symposium on High-Performance Parallel and Distributed [33] A. Roy, I. Mihailovic, and W. Zwaenepoel. X-stream: Edge- Computing, HPDC’12, pages 3–14, 2012. centric graph processing using streaming partitions. In Pro- [44] H. Yang, A. D. Breslow, J. Mars, and L. Tang. Bubble-flux: ceedings of the Symposium on Operating Systems Principles, precise online qos management for increased utilization in SOSP’13, pages 472–488, 2013. warehouse scale computers. In Proceedings of the Interna- [34] S. Schneider, C. D. Antonopoulos, and D. S. Nikolopoulos. tional Symposium on Computer Architecture, ISCA’13, pages Scalable locality-conscious multithreaded memory allocation. 607–618, 2013. In Proceedings of the International Symposium on Memory [45] J. Zhou and B. Demsky. Memory management for many- Management, ISMM’06, pages 84–94, 2006. core processors with software configurable locality policies. [35] X. Song, H. Chen, and B. Zang. Characterizing the perfor- In Proceedings of the International Symposium on Memory mance and scalability of many-core applications on virtual- Management, ISMM’12, pages 3–14, 2012.

15 2017/3/6