Flexible Decoupled Transactional Memory Support 1 Introduction

Flexible Decoupled Transactional Memory Support∗ Arrvindh Shriraman Sandhya Dwarkadas Michael L. Scott Department of Computer Science, University of Rochester {ashriram,sandhya,scott}@cs.rochester.edu Abstract such conflicts are identified. Conflict management is responsi- A high-concurrency transactional memory (TM) implemen- ble for arbitrating between conflicting transactions and decid- tation needs to track concurrent accesses, buffer speculative ing which should abort. Pessimistic (eager) systems perform updates, and manage conflicts. We present a system, FlexTM both conflict detection and conflict management as soon as pos- (FLEXible Transactional Memory), that coordinates four de- sible. Optimistic (lazy) systems delay conflict management un- coupled hardware mechanisms: read and write signatures, til commit time (though they may detect conflicts earlier). TM which summarize per-thread access sets; per-thread conflict systems must also perform version management, either buffer- summary tables (CSTs), which identify the threads with which ing new values in private locations (a redo log) and making conflicts have occurred; Programmable Data Isolation, which them visible at commit time, or buffering old values (an undo maintains speculative updates in the local cache and employs log) and restoring them on aborts. In the taxonomy of Moore et a thread-private buffer (in virtual memory) in the rare event al. [23], undo logs are considered an orthogonal form of eager- of overflow; and Alert-On-Update, which selectively notifies ness (they put updates in the “right” location optimistically); threads about coherence events. All mechanisms are software- redo logs are considered lazy. accessible, to enable virtualization and to support transactions The mechanisms required for conflict detection, conflict of arbitrary length. FlexTM allows software to determine when management, and version management can be implemented to manage conflicts (either eagerly or lazily), and to employ in hardware (HTM) [1, 12, 14, 23, 24], software (STM) [9, 10, a variety of conflict management and commit protocols. We 13, 19, 25], or some hybrid of the two (HyTM) [8, 16, 22, 30]. describe an STM-inspired protocol that uses CSTs to manage Full hardware systems are typically inflexible in policy, with conflicts in a distributed manner (no global arbitration) and fixed choices for eagerness of conflict management, strategies allows parallel commits. for conflict arbitration and back-off, and eagerness of version- In experiments with a prototype on Simics/GEMS, FlexTM ing. Software-only systems are typically slow by comparison, exhibits ∼5× speedup over high-quality software TM, with no at least in the common case. loss in policy flexibility. Its distributed commit protocol is also Several systems [5,30,36] have advocated decoupling of the more efficient than a central hardware manager. Our results hardware components of TM, giving each a well-defined API highlight the importance of flexibility in determining when to that allows them to be implemented and invoked independently. manage conflicts: lazy maximizes concurrency and helps to en- Hill et al. [15] argue that decoupling makes it easier to refine sure forward progress while eager provides better overall uti- an architecture incrementally. Shriraman et al. [30] argue that lization in a multi-programmed system. decoupling helps to separate policy from mechanism, allowing software to choose a policy dynamically. Both groups suggest 1 Introduction that decoupling may allow TM components to be used for non- transactional purposes [15] [30, TR version]. Transactional Memory (TM) addresses one of the key chal- Several papers have identified performance pathologies with lenges of programming multi-core systems: the complexity of certain policy choices (eagerness of conflict management; arbi- lock-based synchronization. At a high level, the programmer tration and back-off strategy) in certain applications [4, 28, 30, or compiler labels sections of the code in a single thread as 32]. RTM [30] promotes policy flexibility by decoupling ver- atomic. The underlying system is expected to execute this code sion management from conflict detection and management— atomically, consistently, and in isolation from other transac- specifically, by separating data and metadata, and performing tions, while exploiting as much concurrency as possible. conflict detection only on the latter. While RTM hardware Most TM systems execute transactions speculatively, and provides a single mechanism for both conflict detection and data conflicts must thus be prepared for , when concurrent trans- management, software can choose (by controlling the timing of actions access the same location and at least one of the accesses metadata inspection and updates) when conflicts are detected. Conflict detection is a write. refers to the mechanism by which Unfortunately, metadata management imposes significant soft- ∗This work was supported in part by NSF grants CCF-0702505, CCR- ware costs. 0204344, CNS-0411127, CNS-0615139, and CNS-0509270; NIH grant 1 R21 In this paper, we propose more fully decoupled hard- GM079259-01A1; an IBM Faculty Partnership Award; equipment support ware, allowing us to maintain the separation between version from Sun Microsystems Laboratories; and financial support from Intel and Mi- crosoft. management and conflict management without the need for software-managed metadata. Specifically, our FlexTM (FLEX- sioning mechanism of LogTM [23] with Bulk style signatures. ible Transactional Memory) system deploys four decoupled It supports efficient virtualization (i.e., context switches and hardware primitives (1) Bloom filter signatures (as in Bulk [5] paging), but this is closely tied to eager versioning (undo logs), and LogTM-SE [36]) to track and summarize a transaction’s which in turn requires eager conflict detection and management read and write sets; (2) conflict summary tables (CSTs) to con- to avoid inconsistent reads. Since LogTM does not allow trans- cisely capture conflicts between transactions; (3) the versioning actions to abort one another, it is possible for running transac- system of RTM (programmable data isolation—PDI), adapted tions to “convoy” behind a suspended transaction. to directory-based coherence and augmented with an Over- UTM [1] and VTM [24] both perform lazy versioning us- flow Table (OT) in virtual memory filled by hardware; and (4) ing virtual memory. On a cache miss (local or forwarded re- RTM’s Alert-On-Update mechanism to help transactions en- quest) in UTM, a hardware controller walks an uncacheable in- sure consistency without checking on every memory access. memory data structure that specifies access permissions. VTM The hardware primitives are fully visible in software, and employs tables maintained in software and uses software rou- can be read and written under software control. This allows us tines to walk the table only on cache misses that hit in a locally to virtualize these structures when context switching and pag- cached lookaside filter. Like LogTM, both VTM and UTM re- ing, and to use them for non-TM purposes. FlexTM separates quire eager conflict management. conflict detection from management, and leaves software in Hybrid TMs [8,16] allow hardware to handle common-case charge of policy. Simply put, hardware always detects conflicts, bounded transactions, and fall back to software for transac- and records them in the CSTs, but software chooses when to tions that overflow time and space resources. Hybrid TMs must notice, and what to do about it. Additionally, FlexTM employs maintain metadata compatible with the fallback STM and use a commit protocol that arbitrates between transactions in a dis- policies compatible with the underlying HTM. SigTM [22] em- tributed fashion and allows parallel commits. This enables lazy ploys hardware signatures for conflict detection but uses an (al- conflict management without commit tokens [12], broadcast of ways on) TL2 [9] style software redo-log for versioning. Like write sets [5,12], or ticket-based serialization [7]. FlexTM is, to hybrid systems, it suffers from per-access metadata bookkeep- our knowledge, the first hardware TM in which the decision to ing overheads. It restricts conflict management policy (specifi- commit or abort can be an entirely local operation, even when cally, only self aborts) and requires expensive commit time ar- performed lazily by multiple threads in parallel. bitration on every speculatively written location. We have developed a 16-core FlexTM CMP prototype on RTM [30] explored hardware acceleration of STM. Specif- the Simics/GEMS simulation framework, and have investigated ically, it introduced (1) Alert-On-Update (AOU), which trig- performance using benchmarks that stress the various hard- gers a software handler when pre-specified lines are modi- ware components and software policies. Our results suggest fied remotely, and (2) Programmable Data Isolation (PDI), that FlexTM’s performance is comparable to that of fixed pol- which buffers speculative writes in (potentially incoherent) lo- icy HTMs, and 2× and 5× better than that of hardware ac- cal caches. Unfortunately, to decouple version management celerated STMs and plain STMs, respectively. Experiments from conflict detection and management, RTM software had to with globally-arbitrated commit mechanisms support the claim segregate data and metadata, retaining much of the bookkeep- that CSTs avoid significant latency and serialization penalties. ing cost of all-software TM systems. Finally, experiments indicate

Load more