Paravirtualization Effect on Single- and Multi-Threaded Memory-Intensive Linear Algebra Software

Paravirtualization Effect on Single- and Multi-threaded Memory-Intensive Linear Algebra Software Lamia Youseff Keith Seymour Haihang You Dept. of Computer Science, Dept. of Electrical Engineering Dept. of Electrical Engineering University of California, Santa and Computer Science, and Computer Science, Barbara. University of Tennessee. University of Tennessee. [email protected] [email protected] [email protected] Dmitrii Zagorodnov Jack Dongarra Rich Wolski Dept. of Computer Science, Dept. of Electrical Engineering Dept. of Computer Science, University of California, Santa and Computer Science, University of California, Santa Barbara. University of Tennessee. Barbara. [email protected] [email protected] [email protected] ABSTRACT Keywords Previous studies have revealed that paravirtualization im- Virtual Machine Monitors, Paravirtualization, AutoTuning, poses minimal performance overhead on High Performance BLAS, high performance, linear algebra, cloud computing Computing (HPC) workloads, while exposing numerous benefits for this field. In this study, we are investigating the im- 1. INTRODUCTION pact of paravirtualization on the performance of automatically- tuned software systems. We compare peak performance, Virtualization has historically offered numerous benefits performance degradation in constrained memory situations, for high performance computing. It was, however ignored performance degradation in multi-threaded applications, and in computationally intensive settings as a result of its po- inter-VM shared memory performance. For comparison pur- tential for performance retardation. A recent approach to poses, we examine the proficiency of ATLAS, a quintessen- virtualization, dubbed paravirtualization has emerged as a tial example of an autotuning software system, in tuning possible alternative. Originally developed to help consoli- the BLAS library routines for paravirtualized systems. Our date servers in commercial settings, this OS-based approach results show that the combination of ATLAS and Xen par- to virtualization has grown rapidly in popularity. To pro- avirtualization delivers native execution performance and vide the greatest flexibility, portability, and instrumentation nearly identical memory hierarchy performance profiles in possibilities, it comprises both software and hardware tech- both single and multi-threaded scenarios. Furthermore, we niques. Originally explored by Denali [28] and Xen [22] in show that it is possible to achieve memory sharing among 2003, this technique in which the guest and the host OSs OS instances at native speeds. These results expose new are both strategically modified to provide optimized perfor- benefits to memory-intensive applications arising from the mance is used today in several virtualization systems. ability to slim down the guest OS without influencing the To this end, several studies [19, 29, 30] have measured the system performance. In addition, our findings support a performance ramifications of running general HPC bench- novel and very attractive deployment scenario for computa- marks on paravirtualized systems and they report near-native tional science and engineering codes on virtual clusters and performance for different HPC workloads. At the same time, computational clouds. other studies have focused on the flexibility and functional- ity benefits of using modern virtualization techniques. For example, OS-level check-pointing [20], fault-tolerance and Categories and Subject Descriptors load balancing [8] are very attractive possibilities in HPC D.4.8 [Operating Systems]: Performance; C.4 [Performance environments. In addition, other researchers [25, 21] have of Systems]: Performance attributes looked into dynamic adaptation of the guest OS for performance and application-customized guest OSs for scientific General Terms parallel codes [5, 15, 31]. Although previous studies of paravirtualization have ad- Measurement, Performance dressed the general-case performance characteristics of HPC benchmarks, they lacked the investigation of the performance boundary and performance consequences under scarce Permission to make digital or hard copies of all or part of this work for memory conditions. This has a remarkable importance in personal or classroom use is granted without fee provided that copies are HPC, because of the performance sensitivity of memory- not made or distributed for profit or commercial advantage and that copies intensive codes and autotuning Linear Algebra (LA) pack- bear this notice and the full citation on the first page. To copy otherwise, to ages to the memory characteristics. In this vein, we provide republish, to post on servers or to redistribute to lists, requires prior specific a detailed study of the impact of the paravirtualized execu- permission and/or a fee. HPDC'08, June 23-27, 2008, Boston, Massachusetts, USA. tion on LA codes that use autotuning software for portable Copyright 2008 ACM 978-1-59593-997-5/08/06 ...$5.00. performance. We focus on several areas in which paravirtualization may have potentially deleterious effects: VM in- impact on a memory intensive dense matrix-matrix multipli- terference with raw single-thread performance, interference cation performance and characterizing its RSS (Resident Set between VMs in multi-threaded performance, and VM in- Size), swap and TLB activity. Section 6 continues the exam- terference in shared memory communication efficiency. ination of matrix multiplication to determine whether par- Autotuning has become an important technology for“core” avirtualization has a negative effect on the performance of libraries that are performance-critical since these systems multiple simultaneous instances. Section 7 discusses the im- often require complex and otherwise labor-intensive config- pact of paravirtualization on shared memory performance. uration to achieve maximal performance levels. Thus, our Finally, we discuss the implications of our work in Section 8 goals are two fold. First, we wish to understand how au- and present our conclusions in Section 9. totuning is affected by paravirtualization. For example, we wish to know whether the autotuning software can “sense” 2. BACKGROUND the presence of paravirtualization during the tuning process. Secondly we wish to explore the potential impact paravir- Historically, virtualization has been a technique for pro- tualization may have on highly tuned numerical software. viding secure and flexible hosting of user and system pro- While it may be that a vanilla installation is unaffected as grams spanning several trust levels. In an Internet comput- has been shown in previous studies [30], in this work we in- ing setting, this hosting capability has relied on language se- vestigate the effects of paravirtualization on the performance mantics (e.g. Java) to permit the importation of untrusted boundary. code for execution by a software virtual machine. While In particular, we study the efficacy of Automatically T uned the approach has been successful, incompatibilities between Linear Algebra Software (ATLAS) in detecting system con- the virtual machine mandated by the language and the typ- figurations for paravirtualized systems. We then use DGEMM ical organization of actual computational resources (proces- matrix-matrix multiplication as a memory-intensive code to sors, memory, I/O subsystems, etc.) imposes a performance measure performance in Mflops for double precision arith- penalty on virtualized program execution. Many powerful metic and compare the performance of several OS kernels and elegant optimization techniques have been developed with varying main memory allocations. With this in mind, to minimize this penalty, but at present language virtual- we investigate the different attributes of the memory hier- ized systems still do not achieve native execution speeds for archy of the paravirtualized system. numerically-intensive applications. The main contribution of this paper is an exposition of A recent approach to virtualization, dubbed paravirtual- how paravirtualization (as implemented by Xen) impacts ization has emerged as a possible alternative. Paravirtual- the performance of LA codes, particularly with respect to ization is a software virtualization technique which allows its use of the memory hierarchy. Also, with the increasing the virtual machines to achieve near native performance. prevalence of multi-core architectures, the issues of memory In paravirtualized systems, for example Xen [22], the sys- contention as well as efficient communication and synchro- tem software stack is augmented, as illustrated in Figure 1. nization between cores will be crucial to large-scale applica- The stack on the left shows the traditional OS deployment stack, while the right stack portrays the paravirtualized de- tion performance. Thus, we have studied whether running 1 highly tuned LA code in different VMs causes more perfor- ployment stack. In the latter, the hypervisor occupies a mance perturbation than multiple threads within a single small part of the main memory, and acts as a moderator VM. Finally, as the ever increasing number of cores per chip layer between the hardware and the guest OS kernels. On makes evident the need for low-overhead communication, we top of the hypervisor, two types of guest OSs are run. The study the efficiency of using shared memory between VM in- first type is regarded as a privileged services guest OS, which stances on different cores. Our findings complete along with provides OS services to the other less-privileged OS and has our

Paravirtualization Effect on Single- and Multi-Threaded Memory-Intensive Linear Algebra Software

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support