Making Nested Parallel Transactions Practical Using Lightweight Hardware Support

Making Nested Parallel Transactions Practical using Lightweight Hardware Support Woongki Baek, Nathan Bronson, Christos Kozyrakis, Kunle Olukotun Computer Systems Laboratory Stanford University {wkbaek,nbronson,kozyraki,kunle}@stanford.edu ABSTRACT General Terms Transactional Memory (TM) simplifies parallel programming by Algorithms, Design, Performance supporting parallel tasks that execute in an atomic and isolated way. To achieve the best possible performance, TM must support Keywords the nested parallelism available in real-world applications and sup- ported by popular programming models. A few recent papers have Transactional Memory, Nested Parallelism, Parallel Programming proposed support for nested parallelism in software TM (STM) and hardware TM (HTM). However, the proposed designs are still im- 1. INTRODUCTION practical, as they either introduce excessive runtime overheads or Transactional Memory (TM) [11] has been proposed as a promis- require complex hardware structures. ing solution to simplify parallel programming. With TM, program- This paper presents filter-accelerated, nested TM (FaNTM). We mers can simply declare parallel tasks as transactions that appear extend a hybrid TM based on hardware signatures to provide prac- to execute in an atomic and isolated way. TM manages all con- tical support for nested parallel transactions. In the FaNTM de- currency control among concurrent transactions. A large num- sign, hardware filters provide continuous and nesting-aware con- ber of TM implementations have been proposed based on hard- flict detection, which effectively eliminates the excessive overheads ware [9, 13], software [8, 10, 17], and hybrid [6, 7, 16] techniques. of software nested transactions. In contrast to a full HTM approach, To date, most TM systems have assumed sequential execution of FaNTM simplifies hardware by decoupling nested parallel transac- the code within transactions. However, real-world parallel appli- tions from caches using hardware filters. We also describe subtle cations often include nested parallelism in various forms including correctness and liveness issues that do not exist in the non-nested nested parallel loops, calls to parallel libraries, and recursive func- baseline TM. tion calls [19]. To achieve the best possible performance with the We quantify the performance of FaNTM using STAMP appli- increasing number of cores, it is critical to fully exploit the paral- cations and microbenchmarks that use concurrent data structures. lelism available at all levels. Several popular programming models First, we demonstrate that the runtime overhead of FaNTM is small that do not use transactions have already incorporated nested par- (2.3% on average) when applications use only single-level paral- allelism [1, 18]; TM should be extended to efficiently support the lelism. Second, we show that the incremental performance over- case of nested parallelism. head of FaNTM is reasonable when the available parallelism is A few recent papers investigated the semantics of concurrent used in deeper nesting levels. We also demonstrate that nested par- nesting and proposed prototype implementations in STM [2–4, 15, allel transactions on FaNTM run significantly faster (e.g., 12.4×) 21]. While compatible with existing multicore chips, most STM than those on a nested STM. Finally, we show how nested paral- implementations already suffer from excessive runtime overheads lelism is used to improve the overall performance of a transactional of TM barriers even for single-level parallelism [6]. To make the microbenchmark. problem worse, supporting nested parallelism solely in software may introduce additional performance overheads due to the use of complicated data structures [2, 4] or the use of an algorithm Categories and Subject Descriptors whose time complexity is proportional to the nesting depth [3]. For example, as shown in our performance evaluation, a single- D.1.3 [Programming Techniques]: Concurrent Programming – threaded, transactional version of the red-black tree microbench- parallel programming; C.1.4 [Processor Architectures]: Parallel mark runs 6.2× slower with single-level transactions and 17.0× Architectures slower with nested transactions than a non-transactional, sequential version. Nested parallel transactions in STM will remain impracti- cal unless these performance issues are successfully addressed. A recent paper investigated how to support nested parallelism Permission to make digital or hard copies of all or part of this work for in HTM [20]. However, supporting nested parallelism solely in personal or classroom use is granted without fee provided that copies are hardware may drastically increase hardware complexity, as it re- not made or distributed for profit or commercial advantage and that copies quires intrusive modifications to caches. For instance, apart from bear this notice and the full citation on the first page. To copy otherwise, to the additional transactional metadata bits in tags, the design pro- republish, to post on servers or to redistribute to lists, requires prior specific posed in [20] requires that caches are capable of maintaining mul- permission and/or a fee. ICS’10, June 2–4, 2010, Tsukuba, Ibaraki, Japan. tiple blocks with the same tag but different version IDs, and provide Copyright 2010 ACM 978-1-4503-0018-6/10/06 ...$10.00. version-combining logic that merges speculative data from multiple ways. Given the current trend in which hardware companies are Field Description reluctant to introduce complicated hardware components to imple- TID T ’s TID ment transactional functionality even for single-level parallelism, FV A bit vector that encodes family(T ). If a bit is set, the corresponding transaction belongs to family(T ) this hardware-only approach is unlikely to be adopted. CTID The TID of the transaction that conflicted with T To address this problem, we propose filter-accelerated, nested RSig Read signature transactional memory (FaNTM) that provides practical support for WSig Write signature nested parallel transactions using hardware filters. FaNTM extends abt If set, T has a pending abort. a baseline hybrid TM (SigTM) [6] to implement nesting-aware con- act If set, this TMB is the active TMB. flict detection and data versioning. Since hardware filters provide nackable If set, the nackable bit in outgoing memory requests is set. continuous, nesting-aware conflict detection, FaNTM effectively reduces the excessive runtime overheads of software nested trans- Table 1: State information stored in each TMB. T denotes the actions. In contrast to a full HTM approach, FaNTM simplifies transaction that is mapped on the TMB. hardware by decoupling nested transactions from caches. As a re- sult, FaNTM makes nested parallel transactions practical in terms of both performance and implementation cost. • If T writes to l, it is a conflict if there exists T 0 such that T 0 ∈ The specific contributions of this work are: readers(l) ∪ writers(l), T 0 6= T and T 0 ∈/ ancestors(T ). • We propose FaNTM, a hybrid TM system that supports nested If a committing transaction T is not a top-level transaction, its parallel transactions with low overheads. FaNTM provides read- and write-sets are merged to its parent. Otherwise (i.e., top- eager data versioning and conflict detection at cache-line gran- level), the values written by T become visible to other transactions. ularity across nested parallel transactions. If any transaction T aborts, all the changes made by T are discarded • We describe subtle correctness and liveness issues such as a and previous state is restored [14]. dirty-read problem that do not exist in the non-nested baseline TM. We also propose solutions to address the problems. 2.2 NesTM We use NesTM [3] as a proxy for a timestamp-based STM with • We quantify the performance of FaNTM across multiple use support for concurrent nesting. While it is an open research is- scenarios. First, we demonstrate that the runtime overhead sue to formally check the correctness and liveness guarantees of of FaNTM is small when applications use only single-level timestamp-based nested STMs, we use NesTM to investigate per- parallelism. Specifically, FaNTM is slower than the baseline formance differences between software and hybrid nested TMs. We hybrid TM by 2.3% on average when running STAMP appli- only provide a brief description and refer to [3] for additional in- cations. Second, we show that the incremental overhead of formation on NesTM. FaNTM for deeper nesting is reasonable. We also show that NesTM [3] extends an eager variant of TL2 [8] to support con- nested transactions on FaNTM run significantly faster (e.g., current nesting. NesTM uses a global version clock to establish se- 12.4×) than those on a nested STM. Finally, we demonstrate rializability. Each memory word is associated with a version-owner how FaNTM improves the performance of a transactional lock that simultaneously encodes the version and owner informa- microbenchmark using nested parallelism. tion. Transactional metadata and barriers are extended to imple- The rest of the paper is organized as follows. Section 2 reviews ment nesting-aware conflict detection and data versioning. the semantics of concurrent nesting and TM systems. Section 3 Since all the nesting-aware transactional functionality is solely presents FaNTM. Section 4 discusses subtle correctness and live- implemented in software, NesTM introduces substantial runtime ness issues. Section 5 quantifies the performance of FaNTM. Sec- overheads to nested transactions.

Making Nested Parallel Transactions Practical Using Lightweight Hardware Support

Spreading Excellence Report

A Compiler-Compiler for DSL Embedding

(URMD) Grad Cohort Workshop

Entrepreneurship Opportunities & Skills

Implementing and Evaluating Nested Parallel Transactions in Software Transactional Memory

Ecpe Connections

Kunle Olukotun Cadence Design Systems Professor and Professor of Electrical Engineering

Energy-Efficient Abundant-Data Computing: the N3XT 1,000X

News from SCS Networks

Multicore Cpus: Processor Proliferation - IEEE Spectrum 2/15/11 1:51 PM

University of Copenhagen

10003.Demeyerromain.2604.Pdf (0.1