VMM Emulation of Intel Hardware Transactional Memory

VMM Emulation of Intel Hardware Transactional Memory Maciej Swiech Kyle C. Hale Peter Dinda Department of Electrical Engineering and Computer Science Northwestern University {m-swiech,k-hale,pdinda}@northwestern.edu ABSTRACT HTM promises better correctness in concurrent code by We describe the design, implementation, and evaluation of replacing locking with transactions over instructions. Unlike emulated hardware transactional memory, specifically the locks, such transactions are composable, meaning that it is Intel Haswell Restricted Transactional Memory (RTM) ar- less likely to introduce deadlock and livelock bugs as a code- chitectural extensions for x86/64, within a virtual machine base expands. Furthermore, transactions have the potential monitor (VMM). Our system allows users to investigate RTM for running faster than locks because the hardware is able on hardware that does not provide it, debug their RTM- to detect violations of transaction independence alongside of based transactional software, and stress test it on diverse maintaining coherence. emulated hardware configurations, including potential fu- Intel has made HTM a component of its Haswell plat- ture configurations that might support arbitrary length trans- form, and chips with the first implementation of this feature actions. Initial performance results suggest that we are able are now widely available. This paper focuses on the re- to accomplish this approximately 60 times faster than under stricted transactional memory (RTM) component of Intel's a full emulator. A noteworthy aspect of our system is a novel specification. RTM is a bit of a misnomer|it might bet- page-flipping technique that allows us to completely avoid ter be called explicit transactional memory. With RTM, the instruction emulation, and to limit instruction decoding to programmer starts, aborts, and completes transactions us- only that necessary to determine instruction length. This ing new instructions added to the ISA. Our work does not makes it possible to implement RTM emulation, and poten- address the other component of Intel's specification, hard- tially other techniques, far more compactly than would oth- ware lock elision (HLE), which is a mechanism for promot- erwise be possible. We have implemented our system in the ing some forms of existing lock-based code to transactions context of the Palacios VMM.Our techniques are not specific automatically|i.e., it is implicit transactional memory. to Palacios, and could be implemented in other VMMs. Our paper focuses on how to extend a virtual machine monitor (VMM) so that it can provide the guest with Intel Haswell RTM capability even if the underlying hardware 1. INTRODUCTION does not support it. Furthermore, the limitations of this Hardware transactional memory (HTM) [4] is an endur- emulated capability can differ from that of the underlying ing concept that holds considerable promise for improving hardware. the correctness and performance of concurrent programs on There are three primary use cases. The first is in testing hardware multiprocessors. Today's typical server platforms RTM code against different hardware models, to attempt are already small scale NUMA machines. A mid-range server to make the code resilient to different and changing hard- may have as many as 64 hardware threads spread over 4 ware. As we describe in more detail in Section 2, transaction sockets. Further, it is widely accepted that the growth of aborts are caused not only by the detection of failures of single node performance depends on increased concurrency transaction independence, but also by other events that are within the node. For example, the U.S. national exascale strongly dependent on specific hardware configurations and efforts are crystallizing around a model of billion-way par- implementations. Hence, a transaction that may succeed on allelism [1], of which a factor of 1000 or more is anticipated one processor model might abort on another model. to be within a single node [12]. Given these trends, correct The second use case is to consider potential future RTM and efficient concurrency within a single node or server is of implementations, including those that might allow arbitrary overarching importance, not just to systems software, but length transactions. Current RTM hardware implementa- also to libraries and applications. tions limit transaction length due to cache size and write buffer size limitations. Our system is free of such limitations unless they are explicitly configured. Conceivably, this Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed functionality could also be used to bridge RTM and software for profit or commercial advantage and that copies bear this notice and the full cita- transactional memory. tion on the first page. Copyrights for components of this work owned by others than The third use case is in debugging RTM code via a con- ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- trolled environment. Through emulation, it is possible to publish, to post on servers or to redistribute to lists, requires prior specific permission introduce abort-generating events at specific points and ob- and/or a fee. Request permissions from [email protected]. serve the effects. It is also possible to collect detailed trace ROSS ’14, June 10, 2014, Munich, Germany Copyright 2014 ACM 978-1-4503-2950-7/14/06 ...$15.00. data from running code. http://dx.doi.org/10.1145/2612262.2612265 We have designed, implemented, and evaluated a system has the goals of providing support for new code that ex- for Intel RTM emulation within the Palacios VMM. Our plicitly uses transactions, backward compatibility of some techniques are not specific to Palacios, and could be imple- such new code to older processors, and allowing for hardware mented in other VMMs as well. Our implementation will be differences and innovation under the ISA-visible model [5]. available as part of the open-source Palacios codebase when There are two models supported by TSX, hardware lock this paper is published. Our contributions are as follows: elision (HLE) and restricted transactional memory (RTM). • We have designed a page-flipping technique that allows Our focus in this paper is on RTM. In the RTM model, four additional instructions have been instruction execution while capturing instruction added to the ISA: XBEGIN, XEND, XABORT, and XTEST. fetches, and data reads and writes. This technique If they are executed on hardware which does not support avoids the need for any instruction emulation or them, a #UD (undefined opcode) exception is generated. complex instruction decoding other than determining Code can use the XTEST instruction to determine if it is instruction length. This greatly simplifies RTM executing within a transaction. An RTM transaction is typ- emulation and could be applied to other services. ically written in a form like this: • We have designed an emulation technique for RTM start_label: based around the page-flipping technique, XBEGIN abort_label redo-logging, undefined opcode exceptions, and <body of transaction, may use XABORT> hypercalls. The technique is extensible, allowing for XEND the inclusion of different hardware models, for success_label: example different cache sizes and structures. <handle transaction commited> abort_label: • We have implemented the RTM emulation technique in <handle transaction aborted> Palacios. The entire technique comprises about 1300 lines of C code. The XBEGIN instruction signifies the start of a transaction, • We have evaluated our VMM-based RTM emulation and provides the hardware with the address to jump to if technique and compared it with Intel's the transaction is aborted. The body of the transaction emulator-based implementation of Haswell RTM in then executes. If no abort conditions arise, the transaction the Software Development Emulator [6]. Our is committed by the XEND. implementation is approximately 60 times faster Conceptually, the core on which the body of the trans- when a transaction is executing, and has full action, from XBEGIN to XEND, executes does reads and performance when none are. writes that are independent of those from other cores, and Related work: Herlihy and Moss introduced HTM [4]. Re- its own writes are not seen by other cores until after the cent work by Rajwar, Herlihy, and Lai showed how HTM XEND completes successfully. If another core executes a could be extended such that hardware resource limitations conflicting write or read, breaking the promise of indepen- would not be programmer visible [11]. Unbounded transac- dence, the hardware will abort the transaction, discard all tional memory [2] shows that hardware designs that allow of the completed writes, and jump to the abort label. The arbitrarily long transactions are feasible, and this work also code in the body of the transaction may also explicitly abort demonstrated that using such transactions would allow for the transaction using the XABORT instruction. The spe- significant speedups in Java and Linux kernel code. Ham- cific reason is written into RAX so the abort handling code mond et al. [3] have argued for using such powerful trans- can decide what to do. actions as the basic unit of parallelism, coherence, and con- Beyond conflicting memory reads and writes on other cores, sistency. In contrast to such work, our goal is simply to and the execution of the XABORT instruction, there are nu- efficiently emulate a specific commercially available HTM merous reasons why a transaction may abort. These form system that will have model-specific hardware resource lim- three categories: instructions, exceptions, and resource limitations. By using our system, programmers will be able its. Within each category, there are both implementation- to test how different hardware limits might affect their pro- independent and implementation-dependent items. One of grams. However, because the conditions under which our the benefits of our emulated RTM system is to allow the system aborts a transaction are software defined, and the testing of RTM code under different implementations.

Load more