Performance Profiling in a Virtualized Environment

Performance Profiling in a Virtualized Environment Jiaqing Du Nipun Sehrawat Willy Zwaenepoel EPFL, Switzerland IIT Guwahati, India EPFL, Switzerland Abstract can refer to the output of a profiler to identify bottlenecks in a program and tune its performance . PMU-based per- Virtualization is a key enabling technology for cloud formance profiling in a non-virtualized environment has computing. Many applications deployed in a cloud run been well studied. Mature tools built upon PMUs ex- in virtual machines. However, profilers based on CPU ist in almost every popular operating system [9, 6]. They performance counters do not work well in a virtualized are used extensively to tune software performance. How- environment. In this paper, we explore the possibilities ever, this is not the case in a virtualized environment. for achieving performance profiling in virtual machine monitors (VMMs) built on paravirtualization, hardware % CYCLE Function Module assistance, and binary translation. We present the de- 98.5529 vmx vcpu run kvm-intel.ko sign and implementation of performance profiling for a 0.2226 (no symbols) libc.so VMM based on the x86 hardware extensions, with some 0.1034 hpet cpuhp notify vmlinux preliminary experimental results. 0.1034 native patch vmlinux 0.0557 (no symbols) bash 1 Introduction Table 1: Top five time-consuming functions from profiling both the VMM and the guest Virtualization is a key enabling technology for cloud computing. Applications deployed in a virtualization- For applications executing in a public cloud, running based cloud, e.g., Amazon’s EC2, run inside virtual ma- a PMU-based profiler directly in a guest does not result chines, which are dynamically mapped onto a cluster of in useful output, because, as far as we know, none of the physical machines. By providing workload consolida- current VMMs expose the PMU programming interfaces tion and migration, virtualization improves the resource properly to a guest1. As more and more applications utilization and the scalability of a cloud. In a virtual- are migrated to virtualization-based public clouds, it is ized environment, the virtual machine monitor (VMM) necessary to add support for PMU-based profiling in vir- exports a set of virtual machines to the guest operating tual machines, without requiring access to the privileged systems (hereafter referred to as guests) and manages ac- VMM. Profiling results help customers of public clouds cesses to the underlying hardware resources shared by to understand the unique characteristics of their comput- multiple guests. In this way virtualization makes it pos- ing environment, identify performance bottlenecks, and sible to run multiple operating systems simultaneously fully exploit the hardware and software resources they on a single physical machine [11, 3, 8]. pay for. In modern processors, the performance monitoring In a private cloud, it is possible to run a profiler di- unit (PMU) is indispensable for performance debugging rectly in the VMM, but some of its intermediate out- of complex software systems. It is employed by soft- put cannot be converted to meaningful final output with- ware profilers to monitor the micro-architectural behav- out the cooperation of the guest. The data in Table 1 ior of a program. Typical hardware events monitored in- result from profiling a computation-intensive applica- clude clock cycles, instruction retirements, cache misses, tion in the guest. The profiler runs in the VMM, and TLB misses, etc. As an example, Table 1 shows a typical output of a profiler. It presents the top-five time- 1It is possible to obtain limited profiling results by running the pro- consuming functions of the whole system. Developers filer in the guest in a timer interrupt driven mode. it monitors CPU cycles. The first row shows that the for a VMM based on the x86 virtualization extensions CPU spends more than 98% of its cycles in the func- and discuss some preliminary experimental results. In tion vmx vcpu run(), which switches the CPU to run Section 6 we survey related work, and we conclude in guest code. As the design of the profiler does not con- Section 7. sider virtualization, all the instructions consumed by the guest are accounted to this VMM function. Therefore, 2 Background we cannot obtain detailed profiling data of the guest. Currently, only XenOprof [10] supports detailed pro- As a hardware component, a PMU consists of a set of filing of virtual machines running in Xen, a VMM based performance counters, a set of event selectors, and the on paravirtualization. For VMMs based on hardware as- digital logic to increase a counter after a specified hard- sistance and binary translation, no such tools exist. En- ware event occurs. When a performance counter reaches abling profiling in the VMM provides users of private a pre-defined threshold, the interrupt controller sends a clouds and developers of both types of clouds a full-scale counter overflow interrupt to the CPU. view of the whole software stack and its interactions with Generally, a profiler consists of the following major the hardware. This will help them tune the performance components: of both the VMM and the applications running in the guest. • Sampling configuration. The profiler registers it- In this paper, we address the problem of performance self as the counter overflow interrupt handler of the profiling for three different virtualization techniques: operating system, selects the monitored hardware paravirtualization, binary translation, and hardware as- events and sets the number of events after which an sistance. We categorize profiling techniques in a vir- interrupt should occur. It programs the PMU hard- tualized environment into two types. Guest-wide pro- ware directly by writing to its registers. filing discloses the runtime characteristics of the guest • Sample collection. When a counter overflow inter- kernel and its active applications. It only requires a pro- rupt occurs, the profiler handles it synchronously. In filer running in the guest, similar to native profiling, i.e., the interrupt context, it records the program counter profiling in a non-virtualized environment. The VMM (PC), the virtual address space of the sampled PC is responsible for virtualizing the PMU hardware, and value, and some additional system state. changes introduced to the VMM are transparent to the guest. System-wide profiling discloses the runtime be- • Sample interpretation. The profiler converts the havior of both the VMM and the active guests. It requires sampled PC values into routine names of the pro- a profiler running in the VMM and provides a full-scale filed process by consulting its virtual memory lay- view of the whole software stack: both the VMM and the out and its binary file compiled with debugging in- guest. formation. The contributions of this paper are as follows. (1) We analyze the challenges of achieving both guest-wide and Native profiling only involves one OS instance, where system-wide profiling for each of the three virtualiza- all three profiling components reside. They interact with tion techniques. Synchronous virtual interrupt delivery each other through facilities provided by the OS. All of to the guest is necessary for guest-wide profiling. The the exchanged information is located in the OS. ability to interpret samples belonging to a guest con- In a virtualized environment, multiple OS instances text into meaningful symbols is required for system-wide are involved, namely the VMM and the guest(s). The profiling. (2) We present profiling solutions for virtual- three components of a profiler may be spread among dif- ization techniques based on hardware assistance and bi- ferent instances, and their interactions may require com- nary translation. (3) We implement both guest-wide and munication between them. In addition, not all the con- system-wide profiling for a VMM based on the x86 vir- ditions necessary for implementing these three compo- tualization extensions and present some preliminary ex- nents may be satisfied. We present a detailed discussion perimental results. of implementing both guest-wide and system-wide pro- The rest of this paper is organized as follows. In Sec- filing for each of the three virtualization techniques in the tion 2 we review the structure of a conventional profiler. following two sections. In Section 3, we analyze the difficulties of supporting guest-wide profiling in a virtualized environment, and 3 Guest-wide Profiling present solutions for each of the three aforementioned virtualization techniques. We discuss system-wide pro- Challenges. By definition, guest-wide profiling runs a filing in Section 4. In Section 5 we present the imple- profiler in the guest and only monitors the guest. Al- mentation of both guest-wide and system-wide profiling though more information about the whole software stack can be obtained by employing system-wide profiling, by setting a field of the control structure. Event delivery sometimes guest-wide profiling is the only way to do per- to a guest is synchronous so the guest profiler samples formance profiling and tuning in a virtualized environ- correct system states. We present our implementation of ment. As we explained before, users of a public cloud guest-wide profiling based on the x86 hardware exten- service are not normally granted the privilege to run a sions in Section 5.1. profiler in the VMM, which is necessary for system-wide profiling. To achieve guest-wide profiling, the VMM should pro- Binary translation. For VMMs based on binary vide facilities that enable the implementation of the three translation, synchronous interrupt delivery is also re- profiling components in a guest and PMU multiplexing, quired. Another obstacle is that the sampled PC val- i.e., saving and restoring PMU registers. Since sample ues point to addresses in the translation cache, not to interpretation under guest-wide profiling is the same as the memory addresses holding the original instructions.

Load more