SATURATION IN LOCK-BASED CONCURRENT

DATA STRUCTURES

by

Kenneth Joseph Platz

APPROVED BY SUPERVISORY COMMITTEE:

S. Venkatesan, Co-Chair

Neeraj Mittal, Co-Chair

Ivor Page

Cong Liu Copyright c 2017

Kenneth Joseph Platz

All rights reserved This dissertation is dedicated to my wife and my parents. They have always believed in me even when I did not. SATURATION IN LOCK-BASED CONCURRENT

DATA STRUCTURES

by

KENNETH JOSEPH PLATZ, BS, MS

DISSERTATION

Presented to the Faculty of

The University of Texas at Dallas

in Partial Fulfillment

of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY IN

COMPUTER SCIENCE

THE UNIVERSITY OF TEXAS AT DALLAS

December 2017 ACKNOWLEDGMENTS

I would like to first of all thank my wife Tracy for having the patience and fortitude to stand with me through this journey. Without her support, I could have never started this path much less completed it. Second, I would like to thank my supervising professors, Drs. “Venky” Venkatesan and Neeraj Mittal. They both provided me with frequent guidance and support throughout this entire journey.

Finally, I would like to thank the rest of my dissertation committee, Drs. Ivor Page and Cong Liu. Dr. Page especially asked many pointed (and difficult) questions during my proposal defense, and that helped improve the quality of this work.

November 2017

v SATURATION IN LOCK-BASED CONCURRENT

DATA STRUCTURES

Kenneth Joseph Platz, PhD The University of Texas at Dallas, 2017

Supervising Professors: S. Venkatesan, Co-Chair Neeraj Mittal, Co-Chair

For over three decades, computer scientists enjoyed a “free lunch” inasmuch as they could depend on processor speeds doubling every three years. This all came to an end in the mid-2000’s, when manufacturers ceased increasing processor speeds and instead focused on designing processors with multiple independent execution units on each chip. The demands for ever increasing performance continue to grow in this era of multicore and manycore processors. One way to satisfy this demand is to continue developing efficient data structures which permit multiple concurrent readers and writers while guaranteeing correct behavior.

Concurrent data structures synchronize via either locks or atomic read-modify-write instruc- tions (such as Compare-and-Swap). Lock-based data structures are typically less challenging to design, but lock-free data structures can provide stronger progress guarantees.

We first develop two variants of existing lock-based concurrent data structures, a and a skiplist. We demonstrate how we can unroll these data structures to support multiple keys per node. This substantially improves the performance in these data structures when compared to other similar data structures. We next demonstrate how lock-based data structures can saturate, or plateau in performance, at sufficiently high thread counts, dependent upon the percentage of write operations applied to that . We then

vi discuss how we can apply a new technique involving group mutual exclusion to provide a lock-based data structure which is resilient to saturation. We then demonstrate how this technique can be applied to our implementations of linked lists and skiplists to provide scalable performance to 250 threads and beyond.

Our implementations provide excellent throughput for a wide variety of workloads, out- performing many similar lock-based and lock-free data structures. We further discuss how these techniques might apply to other data structures and provide several avenues for future research.

vii TABLE OF CONTENTS

ACKNOWLEDGMENTS ...... v ABSTRACT ...... vi LIST OF FIGURES ...... xi LIST OF TABLES ...... xii CHAPTER 1 INTRODUCTION ...... 1 1.1 Our Contributions ...... 3 1.2 Dissertation Roadmap ...... 3 CHAPTER 2 SYSTEM MODEL ...... 5 2.1 Shared Memory System Architecture ...... 5 2.2 Synchronization Primitives ...... 6 2.3 Correctness Conditions ...... 7 2.3.1 Linearizability ...... 7 2.3.2 Deadlock-Freedom ...... 9 CHAPTER 3 BACKGROUND AND RELATED WORK ...... 10 3.1 Synchronization in Concurrent Data Structures ...... 10 3.1.1 Blocking Techniques ...... 10 3.1.2 Non-Blocking Techniques ...... 11 3.2 Concurrent Linked Lists ...... 12 3.3 Concurrent Skiplists ...... 14 3.4 Group Mutual Exclusion ...... 16 CHAPTER 4 A CONCURRENT UNROLLED LINKED LIST WITH LAZY SYN- CHRONIZATION ...... 20 4.1 Algorithm Overview ...... 20 4.2 Algorithm Detail ...... 22 4.3 Correctness Proof ...... 26 4.4 Experimental Evaluation ...... 27 4.4.1 Experiment Setup ...... 27 4.4.2 Experimental Results ...... 29

viii 4.4.3 Expansion of Key and Data Sizes ...... 36 4.5 Introducing a Per-Thread Shortcut Cache ...... 37 4.5.1 Overview of Shortcut Cache ...... 38 4.5.2 Detail of Shortcut Cache ...... 40 4.5.3 Evaluation of Shortcut Cache ...... 42 4.6 Conclusions and Future Work ...... 44 CHAPTER 5 UNROLLING THE OPTIMISTIC SKIPLIST ...... 46 5.1 Introduction ...... 46 5.2 Algorithm Overview ...... 47 5.3 Detail of an Unrolled Skiplist ...... 49 5.3.1 Scan ...... 49 5.3.2 Lookup ...... 50 5.3.3 Insert ...... 51 5.3.4 Remove ...... 53 5.4 Correctness Proof ...... 57 5.5 Experiment Setup ...... 61 5.5.1 Concurrent Implementations ...... 62 5.5.2 Simulation Parameters ...... 62 5.5.3 Test System ...... 64 5.5.4 Experimental Methodology and Measurements ...... 64 5.6 Experimental Results ...... 64 5.6.1 Results for 1 million keys ...... 65 5.6.2 Results for 10 million keys ...... 68 5.6.3 Discussion ...... 70 5.7 Conclusions and Future Work ...... 72 CHAPTER 6 THE SATURATION PROBLEM ...... 73 6.1 Introduction ...... 73 6.2 Demonstrating Saturation on Manycore Systems ...... 73 6.3 Saturation in Unrolled Linked Lists ...... 74

ix 6.4 Saturation in Unrolled Skiplists ...... 76 6.5 Further Exploration of Saturation ...... 78 CHAPTER 7 INCREASING CONCURRENCY WITH GROUP MUTUAL EXCLU- SION ...... 79 7.1 About Intra-Node Concurrency ...... 79 7.2 Evaluation of Group Mutual Exclusion Algorithms ...... 81 7.2.1 Survey of Potential GME Algorithms ...... 82 7.2.2 Selection of GME Algorithm ...... 85 7.3 Introducing Intra-Node Concurrency to the Unrolled Linked List ...... 85 7.3.1 Algorithm Overview ...... 85 7.3.2 Algorithm Detail ...... 86 7.3.3 Experimental Evaluation ...... 90 7.3.4 Conclusions ...... 92 7.4 Introducing Intra-Node Concurrency to the Unrolled Skiplist ...... 93 7.4.1 Algorithm Detail ...... 93 7.4.2 Experiment Setup ...... 95 7.4.3 Intel Xeon System ...... 96 7.4.4 Intel Xeon Phi System ...... 103 7.5 An In-Depth Evaluation of the GME-enabled Skiplist ...... 110 7.6 Analysis and Conclusions ...... 113 CHAPTER 8 CONCLUSION ...... 114 REFERENCES ...... 116 BIOGRAPHICAL SKETCH ...... 124 CURRICULUM VITAE

x LIST OF FIGURES

2.1 Example History of Concurrent Memory Location ...... 8 3.1 Layout of a skiplist ...... 14 4.1 Layout of the unrolled linked list ...... 21 4.2 Experimental Results on System A in Operations per Microsecond...... 30 4.3 Experimental Results on System B in Operations per Microsecond...... 31 4.4 Effect of Compiler Optimizations on System A ...... 34 4.5 Effect of Compiler Optimizations on System B ...... 35 4.6 Impact of Node Size on Throughput in Operations per Microsecond ...... 36 4.7 Impact of Shortcut Cache ...... 43 5.1 Layout of the unrolled skiplist ...... 48 5.2 Experimental Results on Intel Xeon System for one million keys...... 65 5.3 Results on Intel Xeon System for 10 million keys ...... 68 6.1 Performance of Unrolled Linked Lists on Intel Xeon Phi System...... 74 6.2 Performance of Unrolled Skiplist on Intel Xeon Phi System...... 76 6.3 Performance of Unrolled Skiplist on Intel Xeon Phi System with 10 million keys. 77 7.1 Performance of Unrolled Linked Lists on Intel Xeon Phi System...... 91 7.2 Results on Intel Xeon System for uniform distribution and one million keys. Throughput is reported in operations per microsecond...... 97 7.3 Results on Intel Xeon System for Zipfian distribution and one million keys. Throughput is reported in completed operations per microsecond ...... 99 7.4 Results on Intel Xeon System for 10 million keys and uniform distribution. Through- put is reported in operations per microsecond...... 100 7.5 Results on Intel Xeon System for 10 million keys and Zipfian distribution. Through- put is reported in operations per microsecond...... 102 7.6 Skiplist performance on Intel Xeon Phi System with K = 32 and one million keys. Throughput is reported in operations per microsecond...... 103 7.7 Skiplist performance on Intel Xeon Phi System with K = 192 and one million keys. Throughput is reported in operations per microsecond...... 104 7.8 Skiplist performance on Intel Xeon Phi System with K = 32 and ten million keys. Throughput is reported in operations per microsecond...... 107 7.9 Skiplist performance on Intel Xeon Phi System with K = 192 and ten million keys. Throughput is reported in operations per microsecond...... 108

xi LIST OF TABLES

3.1 Summary of GME algorithms. All algorithms satisfy P1, P2, and P3...... 19 4.1 Throughput on System A with respect to the Lazy algorithm at 24 threads . . . 31 4.2 Throughput on System B with respect to the Lazy algorithm at 64 threads . . . 32 5.1 Summary of results with 1 million keys and uniform distribution...... 66 5.2 Summary of results with 1 million keys and Zipfian distribution...... 67 5.3 Summary of results with ten million keys and uniform distribution...... 69 5.4 Summary of results with ten million keys and Zipfian distribution...... 70 7.1 Summary of results on Intel Xeon System with 1 million keys and uniform dis- tribution...... 98 7.2 Summary of results on Intel Xeon System with 1 million keys and Zipfian distri- bution...... 98 7.3 Summary of results on Intel Xeon System with ten million keys and uniform distribution...... 101 7.4 Summary of results for Intel Xeon System with ten million keys and Zipfian distribution...... 101 7.5 Summary of results on Intel Xeon Phi System with 1 million keys and uniform distribution...... 105 7.6 Summary of results on Intel Xeon Phi System with 1 million keys and Zipfian distribution...... 106 7.7 Summary of results on Intel Xeon Phi System with 10 million keys and uniform distribution...... 109 7.8 Summary of results on Intel Xeon Phi System with 10 million keys and Zipfian distribution...... 110 7.9 Operation timings for uniform distribution on Intel Xeon System with 1 million keys ...... 111 7.10 Operation timings for uniform distribution on Intel Xeon System with 10 million keys ...... 112

xii CHAPTER 1

INTRODUCTION

The past decade has seen considerable changes in the processor manufacturing industry. Up until the mid-2000’s, most processors consisted of a single, fast execution unit. Due to limits of Dennard scaling [34], manufacturers have been placing two or more independent cores on each chip. In this era of multicore and manycore system architectures, designing efficient data structures that permit concurrent read and write operations while maintaining correct behavior becomes increasingly important. In such a data structure, contention between different processes must be managed in such a way that all operations complete correctly and leave the data structure in a valid state. Concurrency is most often managed through locks. A process holding a lock is guaranteed exclusive access to the data structure (coarse-grained locking) or a portion of it (fine-grained locking) until it releases the lock. This makes it easier to perform (potentially conflicting) updates to the data structure because they are implemented in a mutually exclusive manner and hence serialized. Lock-based algorithms for many important data structures been devel- oped including linked lists [44, 50], queues [69, 50], hash tables [54, 30, 62, 51], and search trees [12, 61, 18, 6, 22]. One of the main drawbacks of a lock-based algorithm is that locks are blocking; while a process holds a lock, no other process can modify the portion protected by the lock. If the locking process stalls, the lock may not be released for a long time. As a result, a lock- based algorithm may be vulnerable to problems such as deadlock, priority inversion and convoying [50]. Non-blocking algorithms avoid the pitfalls of locks by using special (hardware-supported) read-modify-write instructions such as load-link/store-conditional (LL/SC) [46] and compare- and-swap (CAS) [50]. These algorithms can provide stronger progress guarantees since a stalled process cannot block other processes. Non-blocking algorithms have been developed

1 for many data structures such as queues, stacks, linked lists, hash tables, search trees and

(e.g., [46, 41, 67, 48, 33, 87, 13, 36, 29, 50, 77, 16, 53, 72, 73, 81]).

In spite of its drawbacks, the lock-based approach is commonly used for designing a

concurrent data structure because it is much easier to design, analyze, implement and debug

than a non-blocking algorithm. Furthermore, in 2016 David and Guerraoui observed that

lock-based data structures can be practically wait-free [24], the strongest progress guarantee that can be provided by a non-blocking algorithm. Wait-freedom guarantees that every process completes its operation in a finite number of steps.

One well known lock-based approach involves lazy synchronization [44]. In this approach,

a process first locates its window of interest, or subset of nodes that may be impacted by its

operation. It then (i) acquires locks on all the nodes in its window, (ii) validates the window

to make sure that it is “correct”, (iii) manipulates the window as needed, and (iv) releases

all the locks. A node is removed from the data structure in two steps: it is first marked

for removal (logical deletion) and then removed from the data structure (physical deletion).

Search operations typically scan the data structure without acquiring any locks; this permits

some well-designed lock-based algorithms to provide higher read throughput than their lock-

free counterparts.

In this work we first consider linked lists, one of the most ubiquitous data structures in

computer science. A linked list consists of a sequence of nodes, each of which consists of two

(or more) fields: a key, a next pointer, and optional satellite data; furthermore these nodes are typically stored in ascending order of keys (but this is not required). Linked lists can be useful for both storing small data sets and also as a “black box” subroutine to be used by more complex data structures (such as graphs and hash tables [21]).

Next we shall consider the skiplist, a fundamental data structure for storing and managing ordered data [79]. It is a probabilistic data structure that supports lookup, insert and remove operations whose expected running time is logarithmic in list size. Unlike balanced search

2 trees, skiplists do not require expensive balancing actions to achieve logarithmic expected running time. Several algorithms for a concurrent skiplist have been developed using both blocking and non-blocking approaches [35, 33, 47, 50, 78]. The algorithm presented in [47] is based on the aforementioned lazy synchronization approach.

1.1 Our Contributions

In this work, we explore two main avenues of improving the performance of lock-based concurrent data structures. Both of these techniques utilize the aforementioned lazy syn- chronization strategy. The first technique, unrolling, involves storing multiple keys in each node of a data structure. We will demonstrate how unrolling a data structure can provide significant gains in throughput and performance using lock-based linked lists and skiplists as case studies. The second technique involves using group mutual exclusion (GME) to permit multiple “compatible operations” (which we will define later) to operate concurrently within the same segment of a data structure. This provides much of the benefit of a lock-free data structure without the additional overhead frequently associated with lock-free algorithms. We describe a of conditions where our group mutual exclusion techniques can apply, and we provide GME-based implementations for unrolled linked lists and unrolled skiplists.

1.2 Dissertation Roadmap

The rest of the text is organized as follows. Chapter 2 describes the system model upon which we base our algorithms. In Chapter 3 we discuss prior related work on concurrent data structures, linked lists, skiplists, unrolled data structures, and group mutual exclusion. In Chapter 4 we describe our unrolled linked list, provide the algorithms for its implementation, and analyze its performance against other linked lists [76]. In Chapter 5 we extend our ideas from the previous chapter to skiplists. In Chapter 6 we discuss how data structures can

3 saturate, or peak in performance as the number of threads gets extremely large. In Chapter 7 we extend our previous algorithms to increase write throughput with group mutual exclusion. We then conclude this dissertation in Chapter 8 and present opportunities for future work.

4 CHAPTER 2

SYSTEM MODEL

Before we describe our algorithms and methods, we describe a model of the system upon which our algorithms execute. In this chapter, we point out several key aspects of the system architecture and describe several specific low level software components upon which we construct our solutions.

2.1 Shared Memory System Architecture

We assume an asynchronous shared memory system where a finite set of processes running on a finite set of independent execution units communicate by applying read, write and synchronization operations to shared memory locations. We assume no bounds on relative speed between processes. Processes may run simultaneously or be arbitrarily delayed. We furthermore assume no relationship between the number of processes and the number of cores.

We also assume that our system maintains a memory hierarchy which involves a large, slow main memory and one or more layers of smaller, faster cache memories. These cache memories may be shared by one or more processors or they may be unique to each processor.

At the time of this writing, multicore processors generally contain two or three cache levels; the lowest level cache (that is, farthest away from the processor) is typically shared amongst multiple cores, while higher levels are dedicated to a single core.

In such a cache-based multiprocessor system, some method must be used to ensure co- herence amongst the various separate caches. To this end, many cache coherence protocols have been developed [75]. These protocols ensure that the various caches maintain up-to- date information as the several cores modify information at different levels in the memory hierarchy.

5 While many of the details of these protocols differ, they generally operate under the same single-writer, multiple-reader (SWMR) invariant [65]. This invariant requires that at any point in time either one core may hold a block of data in its cache for writing, or multiple cores may hold a block for writing, but not both. When a processor writes to a block of data in its cache, the cache coherence protocol must take action in order to maintain the SWMR invariant. This may force one or more cores to evict data blocks from their own caches and (possibly) re-read them from lower level memories.

2.2 Synchronization Primitives

The main primitives our algorithms use to manage contention is an exclusive lock that satisfies the following correctness properties:

• Mutual Exclusion: At most one process can hold the lock at any time. • Deadlock-Freedom: If the lock is free and one or more processes attempt to acquire the lock, then some process is eventually able to acquire the lock.

Furthermore, some locks may be reentrant, that is, if a thread owns a given lock, it may in turn acquire that lock multiple times. In this event the locking thread must also release the lock an equal number of times. We shall utilize reentrant locks in Chapter 5. For the sake of brevity we assume the C++11 locking conventions in our pseudocode [55]. As such, acquiring a lock involves creating an object of local scope, while destroying that ob- ject frees the lock. Other programming environments may utilize different locking constructs, either system-provided or user-specified. In addition to locks, our algorithm also uses two special atomic hardware instructions, namely compare-and-swap (CAS) and fetch-and-add (FAA) to manage contention. A compare-and-swap instruction takes three arguments: address, old and new. It com- pares the contents of a memory location (address) to an expected value (old) and, only

6 if they are the same, updates the contents to the new value (new). It returns true if the contents were modified and false otherwise. Some variants of CAS may also modify the old value to reflect the value “seen” during a failed update. While this side effect may be useful, we will not assume these semantics in our algorithms.

A fetch-and-add instruction takes two arguments: address and amount. It atomically

adds a given value (amount) to the contents of a memory location (address) and returns

the value of the prior contents of that location.

2.3 Correctness Conditions

In order to demonstrate that an implementation of a sequential object is correct, one can

show that, given a known valid initial state, each method call will result in a valid ending

state. However, for concurrent objects multiple threads may be performing modification

operations simultaneously on the object. In fact, sequence of concurrent operations may

easily be constructed such that the object is never truly in a final state.

This is why we introduce two correctness conditions: a safety property and a progress

property. The safety property guarantees that our object never enters an invalid state. A

progress property requires that some set of pending operations “eventually” completes. We

shall use linearizability [52] for the safety property and deadlock-freedom [50] for the liveness

property. Broadly speaking, linearizability requires that an operation should appear to take

effect at some instant during its execution. Deadlock-freedom requires that some process

with a pending operation is able to eventually complete its operation.

2.3.1 Linearizability

Of the many consistency models, we elect to use linearizability because it is compositional.

That is, one can build a linearizable object provided that all of the methods that access that

7 write(1) read(2)

write(2) read(3)

read(1) write(3)

Figure 2.1: Example History of Concurrent Memory Location object are linearizable. Likewise, we can compose a linearizable system with one or more linearizable objects.

In a sequential system, we can consider method calls as point-in-time events. When we move to a multithreaded system, we must consider method calls as a sequence of invocation events followed by a matching response event. As threads move forward in time, we may encounter events where two (or more) method calls are active at a given time point; we define these as concurrent method calls.

For a method call to be linearizable, we must be able to select a distinct linearization point at which this method appears to take effect. This linearization point must occur between the method’s invocation event and its corresponding response event. These linearization points allow us to impose a sequential ordering of the method calls. In a linearizable system, these linearization points create a legal history of the system; that is, this history adheres to the sequential specification of the system. In Figure 2.1, we display a sample linearizable history of a concurrent memory location.

In this figure, we have three independent threads operating on a single memory location.

We assume that reads and writes to this location are atomic, that is, they complete indivisibly in a single step1. The bars delineate each method’s invocation and associated response event,

1This is a safe assumption when writing single machine words on modern hardware, but does not neces- sarily hold when writing multiple machine words.

8 while the lines denote the associated interval. The dot within each interval indicates the (arbitrarily selected in this example) linearization point for each operation. As we see in the figure, each of three threads performs both reads and writes on the memory location. The result of each read operation depends both on the linearization points for the read operation and other concurrent write operations. If we select different, valid, linearization points for each operation, the results of the read operations potentially change. In order to claim that our data structures are linearizable, we will establish linearization points for each operation.

2.3.2 Deadlock-Freedom

Deadlock-freedom requires that at any moment at least one thread is eligible to make progress. This eligible thread may or may not be actively progressing or may in fact be suspended by the operating system’s scheduler. However, this guarantee suffices to ensure our system progresses. Deadlocks can occur when a set of cyclical dependencies occur be- tween processes, such as thread A waiting on thread B to release a resource while thread B also waits on thread A to release a (different) resource. In practice these cycles may involve many more than two processes and resources, and as such may be difficult to detect. One common method, which we shall utilize in our algorithms, involves imposing an ordering upon resources and only acquiring resources sequentially according to that ordering [28]. Now that we have described the system upon which we shall be operating, we can move forward to describing prior relevant work in this field. This will help lay a foundation to describe our works and their significance.

9 CHAPTER 3

BACKGROUND AND RELATED WORK

In order to lay a foundation for understanding our contributions and place them into an appropriate perspective, we highlight several salient areas of prior work. These include an overview of synchronization in concurrent data structures, a survey of concurrent linked lists and skiplists, and an introduction to the group mutual exclusion problem.

3.1 Synchronization in Concurrent Data Structures

In order to provide correct implementations, concurrency must be managed through syn- chronization primitives. Concurrent data structures can be broken down into two major categories: blocking and non-blocking.

3.1.1 Blocking Techniques

Blocking data structures synchronize by utilizing locks. These techniques may be either coarse-grained or fine-grained. In a coarse-grained data structure, a single lock protects the entire data structure during access. This allows for a very simple implementation but provides no true concurrency. Fine-grained techniques utilize multiple locks within the data structure (typically one lock per node). These techniques involve acquiring one or more locks during the course of an operation. The performance of these algorithms vary greatly depending upon the number of locks required for an operation and the granularity of each lock, i.e., the fraction of the data structure protected by each lock.

One well-known fine-grained technique, called lazy synchronization, defers expensive syn- chronization operations until necessary. Several well-known data structures have been im- plemented using this technique, including linked lists [44], queues [69], priority queues [64], and skiplists [47]. One major benefit of lazy synchronization involves wait-free reads; in

10 other words, a thread can search the data structure without performing any synchronization

operations.

Lazy synchronization involves scanning a data structure for a window of nodes to operate upon and locking one or more nodes (depending on the algorithm). Once the initiating thread acquires the locks, it validates that the window is unchanged; another thread may modify the window while the initiating thread waits to acquire the locks. Finally, the initiating thread either inserts or removes the operand. In lazy synchronization, node removal involves a two-stage process. First the thread marks the node for removal, then it manipulates the pointer(s) to physically remove the node from the data structure. This allows other threads to readily determine whether a node is still present in the list.

3.1.2 Non-Blocking Techniques

Non-blocking algorithms utilize atomic Read-Modify-Write (RMW) instructions to provide synchronization. These algorithms typically operate in three phases. The prologue involves scanning the data structure for the the correct segment to modify. In the injection step, a thread utilizes a single RMW operation to modify a segment of the data structure. This single operation must include enough information so that other threads may (potentially) complete this operation. Finally, the cleanup phase completes any additional steps of the operation. Any thread traversing this segment of the data structure may perform the cleanup phase.

Non-blocking algorithms can provide stronger progress guarantees, including lock-freedom and wait-freedom, than blocking algorithms. A lock-free algorithm guarantees that at any point in time at least one thread is actively making progress, while a wait-free algorithm guarantees that at any point in time every active thread is making progress.

These attractive progress guarantees do not come free, however. Many non-blocking algorithms require considerably more overhead than their lock-based counterparts. Further-

11 more, a non-blocking algorithm may be harder to design, implement, and debug than a similar blocking algorithm.

3.2 Concurrent Linked Lists

A linked list is one of the fundamental data structures of computer science. It implements the standard set operations: insert, remove, and lookup. A linked list is typically implemented as a sequence of nodes, each of which contains a key, possibly a data element, and a pointer to the next node in the sequence. Linked lists and their variants are frequently used to manipulate small data sets which require insertion and removal at arbitrary locations in the list (for example, they are used extensively within the Linux kernel [37]). Furthermore, many other data structures (such as graphs and hash tables) use linked lists as “black box” subroutines [21]. Linked lists have been well-studied from a concurrency perspective; several efficient lock- based algorithms exist. The simplest implementation consists of a single lock which protects all accesses to the list, but this does not allow for any true concurrency. Improvements have been seen with fine-grained locking, where each node contains its own lock. These fine- grained algorithms can scan the list for a node of interest and acquire the lock on that node (and possibly other nodes). Two algorithms that use this technique include an “optimistic” algorithm by Herlihy and Shavit [50] and a “lazy” algorithm by Heller, et al [44]. In the optimistic list, a thread scans the list for a window of two nodes to operate upon, locks those nodes, and then re-scans the list to ensure those nodes are still present in the list. The lazy list decouples the remove phase of the algorithm by first marking a node for removal followed by manipulating the pointer leading to that node. Another thread can then check this mark to determine if a node has been removed from the list. Several lock-free algorithms have also been proposed by Harris [41], Michael [68], and Valois [90]. These implementations operate in a very similar manner as the aforementioned

12 “lazy” algorithm, except instead of locks, they use atomic CAS operations. More recently

Timnat, et al [89] proposed a wait-free linked list that operates by first prepending its operation to the head of the list and then traversing the entire list to determine the final outcome of the operation. While this implementation is wait-free in that an operation appears to take effect immediately, the algorithm dictates that the initiating thread traverse the entire list to determine the operation’s eventual outcome.

Linked lists, while extremely useful, do suffer from several disadvantages. One major disadvantage to a linked list is that any operation on average, traverses half the nodes in the list. Each step in this traversal dereferences that node’s next pointer and accesses a memory location that may be far removed from the prior node [92]. This pointer-chasing

problem makes poor use of the memory hierarchy found in today’s systems. Specifically, this

violates the temporal and spatial locality assumptions upon which today’s memory hierarchies

rely [27].

Several attempts have been made to increase the efficiency of linked lists by combining

multiple keys into a single node. These “unrolled” lists, first described by Shao, et al [83],

improve performance in two ways. First, unrolling reduces the number of pointers which must

be followed to find an item. Second, this groups multiple successive elements in sequential

memory locations and better conforms to the principle of spatial locality [26, 75].

More recently Braginsky and Petrank developed a “chunked” lock-free linked list [15].

Their algorithm improves the locality of memory accesses by storing a sequential subset

of key/data pairs within a contiguous block of memory. As time elapses and elements are

inserted and removed from the list, their algorithm splits full chunks and combines sparsely

populated ones. An operation can quickly locate the appropriate chunk, and searches within

a chunk exhibit favorable spatial locality.

13 −∞ 5 7 8 11 13 +∞

Figure 3.1: Layout of a skiplist

3.3 Concurrent Skiplists

The skiplist is a probabalistic data structure that provides expected O(lg(n)) performance for insert, lookup, and remove operations [78, 79]. A skiplist consists of a hierarchy of “levels”

(Figure 3.1). The lowest level of the skiplist consists of a linked list containing every element in the set. Each higher level list consists of a subset of the list immediately below it. This inclusion invariant requires that every element present at level N must also be present at levels 0..(N − 1).

In order to locate a given element, a process starts at the highest list and locates the (half- open) (pred, curr]; by convention curr indicates the “current” node and pred its predecessor

(if the element is not present at this level, curr indicates the first node greater than the element). The process stores this window for the current level, then drops down to the next-lowest list and, starting at pred, searches for the window associated with that level.

This continues until the process finds the (pred, curr] window at the lowest level of the list.

For example, a scan(11) of the skiplist in Figure 3.1 would return (pred = [8, 7, −∞, −∞], curr = [11, 11, 11, +∞]). Inserting an element in this skiplist involves randomly selecting a level, finding the array of (pred, curr] windows, and inserting the element into all of the appropriate levels starting from the bottom level and working upwards (thus preserving the inclusion invariant). Removing an element operates similarly, except the elements are removed from the topmost list first and proceeding downwards to the bottom list.

14 Several concurrent skiplists have been developed. Fraser has described three different implementations, using software transactional memory (STM), multi-word compare-and- swap (MWCAS), and (single-word) compare-and-swap [35]. Fraser’s first implementation utilizes a software transactional memory library which per- mits multiple (possibly non-contiguous) words to be read and written speculatively [49, 84]. Once all of the writes are staged, then the initiating thread can commit those writes such that they appear to other threads atomically. Many such libraries exist, but further discussion of STM is beyond the scope of this dissertation. Fraser’s second implementation utilizes MWCAS, which permits a thread to read multiple (possibly disjoint) memory locations and then atomically update them to a set of new values. As of the time of this writing, there are no pure hardware implementations of MWCAS. In a practical sense, MWCAS can be constructed from single-word CAS instructions [42, 70]. Lev, et al developed an “optimistic” lock-based skiplist [47]. Their algorithm first scans the skiplist for the set of (pred,curr) windows. Inserting an element involves first determining a level for the new node and locking the pred nodes up to (and including) the level of the new node. The inserting thread then splices the new node into the skiplist from the bottom to top. Removing a node operates in much the same manner; once the thread locks the appropriate node, it removes the node from the skiplist starting from the topmost level. This same general sequence was utilized by Lev, et al to develop a lock-free skiplist [47], expanding upon Fraser’s CAS-based skiplist. Recently, Crain, et al developed a “No Hot Spot” lock-free skiplist [23]. This skiplist differs from prior implementations in that it decouples the modification operations from maintaining the skiplist hierarchy. Specifically, all threads insert nodes at the lowest level, and a dedicated background thread (or threads) is tasked with maintaining the topology of the skiplist. Likewise, a removing thread simply marks a node for removal while the background thread in turn performs the physical removal and garbage collection of the affected node(s).

15 Avni, et al recently designed a “LeapList”, which combines a skiplist, unrolling, and transactional memory [11]. This skiplist consists of nodes that can contain up to K (a constant parameter) immutable keys and utilizes software transactional memory for syn- chronization. This implementation provides very good lookup throughput at the cost of slow update operations.

3.4 Group Mutual Exclusion

Group mutual exclusion (GME) was first proposed by Joung [59] as a natural extension to the mutual exclusion problem. In a GME system, each request for the critical section also includes a session-id. The primary difference between the GME problem and the mutual exclusion problem is that different processes which request the critical section may coexist in the critical section, provided they request the same session-id. If every process requests a unique ID for its session-id, the GME problem reduces to the mutual exclusion problem. We can likewise construct a reader-writer lock by adding a single session-id that can be shared by all processes. Many algorithms have been proposed to implement group mutual exclusion in the shared memory model [14, 39, 40, 58, 59, 60, 88]. At a bare minimum, any correct solution to the group mutual exclusion problem must satisfy the following two criteria:

(P1) Mutual Exclusion: If two processes are in the critical section (CS) concurrently, then they must have requested the same session-id.

(P2’) Deadlock Freedom: If no process is in the CS and one or more processes request a session-id, then at least one will eventually succeed.1

We can use several criteria to measure the efficiency of group mutual algorithms; some of these can be readily derived from the algorithm description, while others may be somewhat

1In other publications, P2 generally implies a stronger condition, which we will discuss later.

16 subjective and dependent upon application. The most readily calculable metric involves the space required to implement an algorithm. Other metrics may include number of steps re- quired for a thread to enter the CS, but this may be dependent upon the degree of contention; that is, the number of threads actively attempting to enter the critical section. Another, more subtle, measurement involves counting Remote Memory References (RMRs) [10]; this criterion measures the number of times a processor accesses a memory location not already in its (nearest) cache, thus factoring in possible adverse effects associated with cache coherence protocols. Processes participating in GME algorithms generally transition between five states. The

Non-Critical Section (NCS) involves operations where synchronization is not necessary between processes. The Doorway Section involves the processes acquiring a conceptual “ticket” to establish priorities amongst processes. The Waiting Room section involves a process waiting for permission to transition to the Critical Section (CS). Finally, the Exit Section involves a process returning to the NCS and (potentially) releasing one or more processes from the Waiting Room. Based upon this five-state model, we define fellow processes as those processes that request the same session-id. Likewise, conflicting processes are those requesting different session-ids.

Furthermore, processes in the NCS have no (active) session-id. We also say that process p1 doorway-precedes process p2 if p1 completes its Doorway Section first; that is, it received a higher-priority “ticket” than p2. GME algorithms that adhere to this five-state model can support additional, stronger correctness criteria. These criteria include:

(P2) Starvation Freedom: Any process that enters the Doorway Section will eventu- ally enter the CS.

(P3) Bounded Exit: Any process that enters the Exit Section returns to the NCS within a bounded number of its own steps.

17 (P4) Concurrent Entering: If there are no conflicting processes, any process that enters

the Doorway Section will enter the CS in a bounded number of its own steps.

(P4’) Strong Concurrent Entering: If p1 completes its Doorway Section before all conflicting processes, then it can enter the CS in a bounded number of its own steps.

(P5) First-Come-First-Served (FCFS): If process p1 doorway-precedes conflicting pro-

cess p2 then p2 does not enter the CS before p1.

(P5’) Relaxed FCFS: If p1 doorway-precedes conflicting process p2, then p2 shall not enter

the CS before p1 unless there exists a p3 such that (i) p2 and p3 are fellow processes

(ii) p3 overlaps p2’s attempt and (iii) p1 does not doorway-precede p3.

(P6) Pulling: If p1 is in the CS at time t and doorway-precedes all conflicting processes,

then any process p2 that is in the waiting room at time t may enter the CS in a bounded number of its own steps.

Several of these criteria (P5, P5’) introduce fairness requirements, while others attempt to introduce additional concurrency within the critical section (P5’, P6). Clearly these are both desirable properties. Unfortunately these are also conflicting requirements; in order to strictly enforce one, we must sacrifice the other.

Joung provided the first solution to the GME problem, providing guarantees P1, P2, and

P3 [59]. Joung’s solution implements a “round robin” technique; that is, each session-id is enumerated. Once the last process exits a given session, it examines the outstanding requests and selects the request with the next-lowest session-id (modulo s, where s is the

total number of session-ids).

Since Joung described his first solution, Keane and Moir [60]; Hadzilacos [39]; Jayanti,

et al [58]; Takamura and Igarashi [88]; Hadzilacos and Danek [40]; Bhatt and Huang [14];

and He, et al [43] have all proposed solutions varying in methodology, guarantees, and

18 efficiency. Several solutions attempt to provide strict fairness (potentially) at the expense of concurrency [39, 40, 58]. On the other hand, others relax the fairness constraints somewhat in the interests of increased concurrency [60, 40, 14]. Table 3.1 summarizes many of the proposed GME algorithms.

Table 3.1: Summary of GME algorithms. All algorithms satisfy P1, P2, and P3.

RMR Space P4 P5 P5’ P6 Joung [59] ∞ O(n) Yes No No No Keane & Moir [60] O(log n) O(n) No No No No Hadzilacos [39] O(n) O(n2) Yes Yes Yes No Takamura & Igarashi [88] (Algorithm 1) ∞ O(n) No No No No Takamura & Igarashi [88] (Algorithm 2) O(n) O(n) Yes No No No Takamura & Igarashi [88] (Algorithm 3) O(n) O(n) Yes No No No Jayanti, et al [58] O(n) O(n) Yes Yes Yes No Danek & Hadzilacos [40] (Algorithm 1) O(n) O(n2) Yes Yes Yes No Danek & Hadzilacos [40] (Algorithm 2) O(n) O(n2) Yes No Yes No Danek & Hadzilacos [40] (Algorithm 3) O(log n log m) O(n2) Yes No No No Bhatt & Huang [14] O(min (log n, k)) O(n) Yes Yes Yes Yes He, et al [43] (GLB) O(n) O(n) Yes Yes Yes No He, et al [43] (BWBGME) O(n) O(n) Yes Yes Yes No

19 CHAPTER 4

A CONCURRENT UNROLLED LINKED LIST WITH LAZY

SYNCHRONIZATION

In the previous chapter, we discussed linked lists and explored several concurrent implemen- tations of them. Furthermore, we explained how lists could be “unrolled” to support multiple key-data pairs in each node. In this chapter, we demonstrate how to do so in a lock-based concurrent setting. We provide algorithms to construct and manipulate a concurrent un- rolled linked list, prove their correctness, and provide experimental analysis to demonstrate the performance of our algorithms compared to other well-known concurrent lists.

4.1 Algorithm Overview

Our unrolled linked list maintains a singly-linked list of nodes and stores keys in partially sorted order. Each node contains (i) a count of the number of elements in the node, (ii) an anchor key, (iii) an array of key-data pairs, (iv) a pointer to the next node in the list, (v) a marked flag indicating a node’s logical removal, and (vi) an exclusive lock. The anchor key helps a thread determine if a given key exists within a single node; during operations we maintain the invariant that all keys within a given node should be greater than or equal to its anchor key and strictly less than the next node’s anchor key. The lock conceptually protects access to the next pointer, allowing most operations to complete while holding a single lock. We define the parameter K to indicate the maximum number of key/data pairs and MinFull for the desired minimum number of keys per node. The data structure keeps track of the head pointer which points to the first element in the list. We maintain two invariants: (i) the anchor key of each node is strictly less than the anchor key of its successors and (ii) all keys in a node are greater than or equal to that node’s anchor key; however, the anchor key of a node may not be currently present in the

20 head count count count ... anchor anchor anchor keys[0] keys[0] keys[0] data[0] data[0] data[0] ...... keys[K-1] keys[K-1] keys[K-1] data[K-1] data[K-1] data[K-1] next next next marked marked marked lock lock lock

Figure 4.1: Layout of the unrolled linked list

list. We do not impose any ordering among keys within a node; attempting to keep keys in

sorted order would penalize write performance and complicate wait-free lookups. The layout

of the data structure is depicted in Figure 4.1.

We define three sentinel values of −∞, +∞, and ⊥. The sentinel keys −∞ and +∞ denote the head and tail of the list, while ⊥ indicates unused key-data slots within a node.

We instantiate the list with three nodes with anchor keys of −∞, −∞, and +∞.

Each operation scans the list to find a target node and its immediate predecessor. A lookup operation returns either the data element associated with the key or nil if the key is

not present. Both insert and remove operations lock the predecessor node (thus protecting its next pointer) and invoke validate to ensure that no local structural changes have occurred.

The insert operation replaces a sentinel key with our new key-data pair and returns true if successful or false if the element is already in the list. A remove replaces a key with a sentinel key, returning true for success or false for failure (i.e., the element was not found in the list).

If an insert operation detects a node is full, it will lock and split the target node into two new nodes. Likewise, if a remove operation encounters a sparse node, it will acquire

21 additional locks and either merge the node with its predecessor, or redistribute keys among two new nodes.

4.2 Algorithm Detail

Scan. The first step in any operation on the list involves invoking scan (Algorithm 4.1). We maintain three pointers during this scan, prev, curr, and succ. We scan through the list until the succ contains an anchor key greater than our key of interest. Once succ meets this criteria, scan returns the pair (prev, curr).

ALGORITHM 4.1: Scan

1 Function scan(item) : (node,node) 2 prev ← head 3 curr ← prev.next 4 succ ← curr.next 5 while succ.key > item do 6 prev ← curr 7 curr ← succ 8 succ ← succ.next

9 return (prev,curr)

Lookup. The lookup method (Algorithm 4.2) invokes scan to acquire the (pred,curr) pair of nodes. We then perform a single pass through curr’s keys looking for item. At each slot, we read the key/data pair atomically (line 4). We can either select key and data elements that collectively fit within a machine word or use atomic snapshots [5, 9]. If we encounter

key during our scan, we return the associated data element. Otherwise, we return nil.

Validate. Our insert and remove functions depend upon a validate function (Algorithm 4.3) identical to that proposed by Heller, et al [44]. We must perform this validation because another thread may still manipulate prev or curr during the interval between invoking the lock on prev and actually acquiring it. We validate by checking that neither prev nor curr are marked for removal, and prev.next still points to curr.

22 ALGORITHM 4.2: Lookup

1 Function lookup(item) : Data 2 (prev, curr) ← scan(item) // Perform a linear search of the node for the requested key 3 for i ← 0 to K − 1 do // Read key/data element atomically 4 (key, data) ← (curr.keys[i], curr.data[i]) 5 if key = item then return data

6 return nil

ALGORITHM 4.3: Validate

1 Function validate(prev, curr) : boolean 2 return ¬prev.marked ∧ ¬curr.marked ∧ prev.next = curr

Insert. The insert operation (Algorithm 4.4), performs a scan to locate an appropriate

insertion point, locks prev, and invokes validate. If the validation fails, the operation returns

to the head of the list and scans again. Once a validation succeeds, it checks curr for three

conditions. If curr already contains item it leaves the node unchanged and returns false. If

there is at least one empty slot in curr (denoted by the sentinel ⊥), it atomically replaces the

sentinel key and its associated data element with the new key-data pair, increments count,

and returns true. If there are no available slots, we split the node.

To split a node (Algorithm 4.5), we first lock curr. This will not require another validation

since no other thread can modify prev.next. Next we allocate two new nodes, new1 and new2.

We copy all of the key/data pairs from curr to new1, sort them1, and then copy the upper

half to new2. Finally, we replace the upper half of new1’s keys with ⊥ and set the next

pointers appropriately.

Remove Removing an element operates in a similar manner (Algorithm 4.6). We perform

a scan to locate the (prev, curr) pair, lock prev, and invoke validate. If this succeeds, we

locate item in curr.keys (returning true if not present). If item is present, we replace item

1While there is a O(n) algorithm to determine the median and partition a set of values, in real-world situations, an efficient sorting algorithm is faster [21].

23 ALGORITHM 4.4: Insert

1 Function insert(key,data) : boolean 2 while true do 3 (prev,curr) ← scan(item) 4 prev.lock() 5 if ¬validate(prev, curr) then 6 continue /* Return to head and re-scan */

7 if curr.contains(item) 6= nil then return false 8 slot ← curr.contains(⊥) 9 if slot 6= nil then 10 (curr.keys[slot], curr.data[slot]) ← (key, data) 11 curr.count ← curr.count + 1 12 else 13 curr.lock() 14 (new1, new2) ← split(curr) 0 15 if key < new2 s anchor key then 16 (new1.keysdK/2e, new1.datadK/2e) ← (key, data) 17 new1.count ← new1.count + 1 18 else 19 (new2.keysbK/2c, new2.databK/2c) ← (key, data) 20 new2.count ← new2.count + 1

21 curr.marked ← true 22 prev.next ← new1

23 return true

ALGORITHM 4.5: Split

1 Function split(node) : (Node, Node) 2 Allocate two new nodes, new1 and new2 3 Copy all key/data pairs from node to new1 4 Sort all key/data pairs in new1 ascending by key 5 Copy the upper bK/2c key/data pairs from new1 to new2 6 new2.anchor ← new2.keys[0] 7 Replace the upper bK/2c keys in new1 with ⊥ 8 new1.next ← new2 9 new2.next ← node.next 10 new1.count ← dK/2e, new2.count ← bK/2c 11 return (new1, new2)

with the sentinel ⊥. At this time, we also decrement the node’s count. If our node now

has fewer than MinFull keys, some additional checking is required. Specifically, we neither

merge with the tail (line 13) nor an empty node (line 15)2. Otherwise, we either merge with

2This can happen if succ is the tail.

24 ALGORITHM 4.6: Remove

1 Function remove(item) : boolean 2 while true do 3 (prev, curr) ← scan(item) 4 prev.lock() 5 if ¬validate(prev, curr,item) then continue 6 slot = curr.contains(item) 7 if slot = nil then return false 8 curr.keys[slot] ← ⊥ 9 curr.count ← curr.count − 1 10 if curr.count

15 if curr.count = 0 then 16 curr.marked ← true 17 prev.next ← succ 18 return true

19 succ.lock() 20 if curr.count + succ.count < MaxMerge then 21 merge(curr, succ) 22 else 23 new1 ← redistribute(curr, succ) 24 prev.next ← new1

25 return true

our successor node (Algorithm 4.7) or redistribute the keys from succ into curr and one new

node (Algorithm 4.8).

Rebalancing does temporarily violate two invariants; however, a concurrent thread will

still observe correct behavior. First, while a small set of keys will be duplicated between two

nodes, a thread will not attempt to seek a duplicate key within curr until such a time as the

modifying thread updates the curr.next pointer. Likewise, curr will temporarily contain a

small set of keys with values greater than succ’s anchor key; however no concurrent thread

will attempt to seek those keys until the original thread updates curr.next.

25 ALGORITHM 4.7: Merge

1 Function merge(curr, succ) 2 Copy valid key/data pairs from succ to curr 3 succ.marked ← true 4 curr.next ← succ.next

ALGORITHM 4.8: Redistribute

1 Function redistribute(curr, succ) : (Node, Node) 2 Create one new node new1 3 Copy valid key/data pairs from succ to new1 4 new1.count ← d(curr.count + succ.count)/2e 5 tomove ← succ.count − new1.count 6 Sort all key-data pairs in new1 by ascending key value 7 Copy the lower tomove key/data pairs from new1 to curr 8 new1.anchor ← new1.keys[tomove] 9 Replace the lower tomove keys in new1 with ⊥ 10 curr.count ← curr.count + tomove 11 new1.next ← succ.next 12 succ.marked ← true 13 return new1

Optimization

We can further modify the above algorithms to keep all valid keys at the head of the node; this requires minor changes to remove. Instead of replacing the affected key with ⊥, we would replace that key with the last valid key in the node and replace the last valid key with ⊥ (this is symmetric to removing the anchor key). This effectively can cause a valid key to move forward within a node; therefore, a lookup would need to scan from right-to-left to correctly identify whether the key is present.

4.3 Correctness Proof

Here we prove that our algorithm is correct, using deadlock-freedom as the liveness property and linearizability as the safety property. We assume that garbage nodes are never reclaimed (all memory accesses are safe). We also assume that our key space is finite; any traversal of the list will terminate. We will make use of the following terms: a write operation shall consist of an insert or a remove, an active node is a node currently reachable from the head

26 of the list, and a passive node is a node that is no longer active. We similarly categorize

lookup operations as either lookup-hit and lookup-miss. We can treat a failed insert operation as a lookup-hit, and we treat a failed remove operations as a lookup-miss.

At any moment in time, one or more threads may hold locks on nodes. We can order these threads in head-to-tail order according to the lock(s) they hold. Since our key space is finite, our list is of finite length; therefore, one thread will hold the rightmost lock. Since a thread always acquires locks from head-to-tail, this thread will always be able to progress.

We can also define a linearization point for every operation.

• A successful insert operation either linearizes to when the key/data is written or (for

a split operation) when the prev.next pointer is updated.

• A successful remove operation can linearize to the point where ⊥ is written

• Any lookup operation which operates on a passive node can be linearized to the point

at which the node becomes passive.

• A lookup-hit operating on an active node can be linearized at the point it reads the

key/data pair.

• A lookup-miss operating on an active node has two subcases. If the key is not present

when the thread starts scanning, we can linearize to the instant the scan begins. If the

key is present at that point, a successful remove operation must have removed it. We

can therefore linearize the lookup-miss to immediately follow the point of the remove.

4.4 Experimental Evaluation

4.4.1 Experiment Setup

We completed our experiments on the following two systems:

27 System A a 2-processor AMD Opteron 6180SE system (24 total hardware threads) with a clock speed of 2.5GHz and 64GB of memory running Linux (kernel 2.6.43).

System B a 4-processor AMD Opteron 6276 system (64 total hardware threads) with a clock speed of 2.3GHz and 256GB of memory running Linux (kernel 4.5.5).

All of our evaluation code was written in C++ and compiled using gcc-6.2.1 using the same set of optimizations (-O3 -funroll-loops -march=native). We evaluated the following list implementations:

1. Lazy: The lazy linked list by Heller, et al [44].

2. LockFree: A lock-free linked list by Harris[41] and Michael[67, 68].

3. Chunked: The chunked linked list by Braginsky and Petrank[15]3.

4. Unrolled: The unrolled linked list described in this work.

For this implementation, we combined a 32-bit key and 32-bit data element to fit into a single 64-bit machine word. This layout mirrors the implementation described by Braginsky and Petrank [15].

Each implementation used hazard pointers for garbage collection. For our initial experi- ments we tested node sizes ranging from 8 to 512 keys per node, key ranges from 1,024 to 1 million, thread counts ranging from 1 to twice the number of cores, and multiple synthetic workloads. Based on our initial observations, we selected the following parameters.

1. Node Size: For the chunked and unrolled lists, we evaluated the performance with

K of 8 and 64 keys per node, MinFull of K/4 and MaxMerge of 3K/4.

3Source code was obtained with permission from Braginsky and Petrank.

28 2. Workload Distribution: We evaluated performance against three representative workloads: Write-Dominant with no lookups, 50% inserts, 50% removes; Mixed with 70% lookups, 20% inserts, 10% removes; and Read-Dominant with 90% lookups, 9% inserts, 1% removes.

3. Degree of Concurrency: We evaluated the performance in 4 thread increments up to the maximum number of hardware threads.

4. Maximum List Size: Keys were selected uniformly from the half-open interval [0, 5000).

Each experiment was conducted by initially creating a list with 2,500 entries. We then spawned the specified number of threads and ran them concurrently. Each simulation began with a two second “warm-up” phase to eliminate effects of cache loading. Following this, each simulation ran for two seconds. Each thread executed as many operations as possible using the specified mix of operations, and we recorded the total number of operations completed. Each experiment was repeated fifteen times. System throughput is reported in operations per microsecond.

4.4.2 Experimental Results

Figures 4.2 and 4.3 depict the results of our experiments. The graphs at the top depict results for 8 keys per node while those on the bottom show 64 keys per node. From left to right we display results for the write-dominant, mixed, and read-dominant workloads, respectively. We also summarize relative throughput at each machine’s maximum hardware threads in Tables 4.1 and 4.2. On System A, we see that the relative performance of each algorithm remains consistent for every workload and thread count. Specifically, we can rank them fastest to slowest: our unrolled algorithm, the Chunked algorithm, the LockFree algorithm, and the Lazy algorithm.

29 Write-Dominant Workload Mixed Workload Read-Dominant Workload

3 3 3

2 2 2 = 8 K 1 1 1 System Throughput 0 0 0 0 10 20 0 10 20 0 10 20

10 10 10 = 64 5 5 5 K System Throughput 0 0 0 0 10 20 0 10 20 0 10 20 Thread Count Thread Count Thread Count Lazy LockFree Chunked Unrolled

Figure 4.2: Experimental Results on System A in Operations per Microsecond.

Furthermore, as we expect, both our algorithm and the Chunked algorithm provide improved

throughput at K = 8 over K = 64. We do see one item of interest with regards to our algorithm: our algorithm shows a dip in performance at 2 threads. This is due to effects of cache coherence; each time an entry is changed, it invalidates entries in other processors’ caches. Our experiments indicate that simulations with two threads incur approximately

25% more frequent L1 data cache misses.

On System B, the results are considerably more interesting. In the upper graphs, we see that Braginsky and Petrank’s chunked list closely mirrors our algorithm in performance up

30 Write-Dominant Workload Mixed Workload Read-Dominant Workload

3 3 3

2 2 2 = 8

K 1 1 1 System Throughput 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60 15 15 15

10 10 10 = 64

K 5 5 5 System Throughput 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60 Thread Count Thread Count Thread Count Lazy LockFree Chunked Unrolled

Figure 4.3: Experimental Results on System B in Operations per Microsecond.

Table 4.1: Throughput on System A with respect to the Lazy algorithm at 24 threads

K = 8 K = 64 Workload Lazy Lock-Free Chunked Unrolled Chunked Unrolled Write-Dominant 100 112 143 291 311 1012 Mixed 100 108 177 435 665 1745 Read-Dominant 100 114 160 406 1086 2211

until a certain threshold (12 threads for write-dominant, 40 threads for the mixed workload,

and 60 threads for the read-dominant). However, after hitting that threshold, the perfor-

mance of the chunked list levels out and then rapidly drops off. It should also be noted that

for K = 8, the chunked list exhibited stability issues at higher thread counts, which resulted

31 Table 4.2: Throughput on System B with respect to the Lazy algorithm at 64 threads

K = 8 K = 64 Workload Lazy Lock-Free Chunked Unrolled Chunked Unrolled Write-Dominant 100 48 9 417 302 1588 Mixed 100 92 14 358 718 2233 Read-Dominant 100 61 374 391 1638 2873 in aborted runs4 and large variances in results. On the K = 64 graphs, our algorithm con- tinued to scale well up to 64 threads; performance increases for the chunked list tapered off as the thread counts increased. On System B, our algorithm also experienced a performance drop due to cache coherence issues when moving to two threads. Finally, on System B, the lazy algorithm does outperform the lock-free algorithm.

Intuitively we can divide the four algorithms into two groups. The lazy and lock-free lists operate as a single phase; an operation examines every entry in the list until the appropriate window is found. The chunked list and unrolled list take a two phase approach. The

first phase skips over multiple entries while seeking the correct node, and the second phase examines entries within that node.

When we consider the performance differential between our data structure and the chun- ked list, we should consider the following points:

1. Our node layout provides nearly double the data density: each slot within our node

stores a single key-data pair (for a total of 64 bits); in the chunked list, each slot stores

a key-data pair and its associated (64-bit) next pointer.

2. A lookup in the chunked list (once the node has been identified) involves repeatedly

dereferencing a pointer and accessing a different area of the chunk. Our list scans

sequentially through an array. This provides the added benefit of spatial locality [45].

Compilers can also aggressively optimize array scans using techniques such as loop

4Throughput for aborted runs was not included in these results.

32 unrolling, cache prefetching and software pipelining [7]. On certain systems a compiler can also use vector instructions to perform multiple comparisons concurrently.

3. In order to perform a split, merge, or rebalance, the chunked list first freezes and stabilizes the affected node(s). Freezing involves visiting each entry and setting a freeze bit (using CAS) while stabilizing requires traversing the chunk and removing any partially-deleted nodes. Our list only requires two calls to the copy library routine and one call to the sort routine to perform either operation. These library routines are typically aggressively optimized for performance.

Effect of Compiler Optimizations

In order to measure the effect of compiler optimizations on our list and the chunked list, we disabled all optimizations and recompiled. We then re-ran our experiments using the balanced workload with 8 and 64 keys per node. We then compared the performance and calculated the speedup percentage for each degree of concurrency (Figures 4.4 and 4.5). The results confirm our hypothesis. On System A, our list achieved a minimum of a 180% improvement for each thread count, with a maximum improvement of 330% at 2 threads and K = 8. The chunked list did exhibit substantial improvements (150%-200% in many cases) except for K = 8 on System B. The experiments did exhibit two pairs of outlying data points. At 2 threads on System A, the chunked list exhibited a 430% improvement for K = 8 and 330% improvement at K = 64. Likewise, for a single thread on System B, the chunked list showed a 271% speedup for K = 8 and 299% speedup for K = 64. On System B, the results were rather surprising. Our algorithm demonstrated a 200% optimization speedup (or better) throughout almost every data point on both graphs. The chunked algorithm did benefit from the compiler optimization at K = 64. However, it incurred a significant penalty due to compiler optimizations at K = 8. We believe that this related to the stability issues we mentioned earlier.

33 K = 8 K = 64

3

10 2

5 1 System Throughput 0 0 0 5 10 15 20 25 0 5 10 15 20 25

400 300

300 200 200 100 Percent Speedup 100

0 0 0 5 10 15 20 25 0 5 10 15 20 25 Thread Count Thread Count Chunked -O0 Unrolled -O0 Chunked -O3 Unrolled -O3

Figure 4.4: Effect of Compiler Optimizations on System A

Optimal Node Sizes

Next we consider how best to select the value of K. We expect to traverse O(n/K) nodes to

find the correct node; following that, expect to scan O(K) keys. This results in O(n/K +K) √ steps per operation. If we select K = O( n), we should maximize the throughput for our

(unrolled) algorithms. In order to evaluate this, we executed the same tests as in Figures 4.2 and 4.3 the 70/20/10 “mixed” workload, a maximum key size of 5,000 (and therefore a bound

34 K = 8 K = 64 15 2

1.5 10

1 5 0.5 System Throughput 0 0 0 20 40 60 0 20 40 60

400 250

200 300

150 200 100

Percent Speedup 100 50

0 0 0 20 40 60 0 20 40 60 Thread Count Thread Count Chunked -O0 Unrolled -O0 Chunked -O3 Unrolled -O3

Figure 4.5: Effect of Compiler Optimizations on System B

on n), and varied the node size from 8 to 512 keys per node. The results are depicted in

Figure 4.6. √ As expected, each algorithm exhibited peak performance near our predicted value of n.

Specifically, the “chunked” algorithm peaked out at 64 keys per node, while our unrolled algorithm continued to scale well up until 256 keys per node (at 12 threads) or 128 keys per node (at 24 threads and above).

35 12 Threads 24 Threads

15 15

10 10 System A 5 5 System Throughput

0 0 0 200 400 0 200 400 32 Threads 64 Threads 25 25

20 20

15 15

10 10 System B

System Throughput 5 5

0 0 0 200 400 0 200 400 Node Size Node Size Chunked Unrolled

Figure 4.6: Impact of Node Size on Throughput in Operations per Microsecond

4.4.3 Expansion of Key and Data Sizes

In order provide a fair comparison to the prior work of Braginsky and Petrank, we chose to limit the size of our key-data pairs to fit within a single machine word. However, this restriction in turn limits the usefulness of this data structure. Expanding each the key and data elements to a full machine word allows us to store arbitrary information using pointers.

However, this task is not as trivial as it may at first seem.

36 When utilizing a key-data pair, care must be taken that reads and writes are atomic; in other words, the entire operation occurs in an indivisible manner (from the perspective of other threads). Reading or writing a single machine word is atomic by nature; accessing multiple machine words (contiguous or otherwise) is not. Brown, et al; Harris, et al; and Israeli and Rappaport have all described methods to construct multiple word CAS operations from the single-word CAS [19, 42, 56]. These methods are sufficiently general that they can be implemented on any processor that supports either CAS or LL/SC operations. Some platforms, such as Intel’s x86-64 architecture, offer a double-wide CAS (DWCAS) operation which accepts the parameters: loc, the memory location to access, old, the expected (16-byte) value, and new, the new value. If the current value at loc matches old, then it is (atomically) updated to new. Some implementations have the beneficial side effect that if the value does not match old is replaced with the value present in loc. Therefore, this technique can also be used to atomically read two contiguous machine words.

4.5 Introducing a Per-Thread Shortcut Cache

We can also optimize the seek phase by attempting to reduce the number of nodes a thread must traverse in order to find the appropriate (prev, curr) pair. One way of doing this is by implementing a small per-thread cache which provides shortcuts to various locations in the list. Our design goals in implementing this cache are twofold: (i) the cache should not require any information about the overall structure of the list, and (ii) the cache should not require communication between threads. We believe the unrolled layout of our list makes this technique profitable, since this organization permits us to insert and remove elements within a node while changing the list structure relatively infrequently. This also implies that any nodes we insert into our cache should have relatively long lifespans (compared to a traditional linked list) and as such we do not expect to invalidate them frequently.

37 4.5.1 Overview of Shortcut Cache

In our concept, each thread maintains a small constant number of shortcut pointers into the list. To determine a (prev, curr) window, a thread first locates an appropriate shortcut pointer where it can begin a scan. This permits us to skip large segments of the list with very little cost in terms of time and space. This should provide significant performance improvements for lists with relatively small values of K. Furthemore, as K increases in size, the expected lifespan of a given node increases; we therefore anticipate handling marked cache entries only infrequently.

A thread’s cache consists of an array of elements, each of which contains a pointer to a node in the list. We maintain the elements in ascending order according to the underlying node’s anchor key. We can now use this cache as a front-end to the scan method; we can

first iterate over each entry in the cache to find an appropriate starting node (instead of the head), and then we continue our scan as per our original algorithm.

In order to maximize the benefit realized from this shortcut cache, we prefer to cache nodes relatively far apart from each other in the underlying list. Furthermore, we want to be able to select “good” entries without inspecting the entire list. In order to select an entry, we can count the entries we traverse in the underlying list and cache an entry once the count exceeds a (per-thread) threshold.

To further satisfy our design goals, we allow each thread to dynamically select a threshold value based upon its own observations of the list. If a thread attempts to add an entry to a full cache, it should increase its threshold value. Likewise, when removing entries from a cache, we decrease the associated threshold value.

This bears some resemblance to techniques used in both Shalev and Shavit’s split-ordered [82] and Herlihy et al’s optimistic skiplist [47], so we will briefly discuss these in order to highlight the differences.

38 In the split-ordered hash table, we maintain a (lock-free) linked list of items stored in

reverse bit-order. In other words, given the keys 0, 3, 4, 5, 7, we would store them in the

order 0 (with a reversed bit-order of 000), 4 (001), 3 (110), 5 (101), 7 (111). The table also

maintains an array of pointers that provide entry points into the linked list. In order to find

an entry in the hash table, one finds the appropriate entry point into the linked list and scan

the list until either the element is found or the next entry point is encountered. The hash

table is flexible in that it can grow or shrink as elements are added and removed. Growth is

handled via a special case of table doubling, where pointers for new buckets are allocated,

but not actually constructed until they are referenced.

The skiplist is a probabalistic data structure consisting of a hierarchy of “levels”. The

lowest level of the skiplist consists of a linked list containing every element in the set. Each

higher level consists of a list that is a subset of the list immediately below it. In order

to locate a given element, a thread starts at the highest list and locates the (half-open)

(pred, curr] window containing the element in question. The thread stores this window for the current level, then drops down to the next-lowest list and, starting at pred, searches for

the window associated with that level. This continues until the thread finds the (pred, curr]

window at the lowest level of the list. Inserting an element in this skiplist involves randomly

selecting a level, finding the array of (pred, curr] windows, locking the appropriate elements,

validating, and then manipulating the pointers. Likewise, removing an element also requires

locating the array of windows, locking, and manipulating the appropriate pointers.

Our shortcut cache provides a lightweight, best-effort means of improving throughput

within our underlying unrolled linked list. Unlike the aforementioned split-ordered hash

tables and skiplists, our cache requires neither any additional coordination between threads

nor any knowledge of the overall size or structure of the list.

39 4.5.2 Detail of Shortcut Cache

Each thread maintains a small cache of up to S entries5 which consist of two fields: (i) a skipcount or estimated distance from the previous cache entry, and (ii) a node within the list proper. Each cache stores entries in ascending order according to the referenced node’s anchor key and implements the following trivial methods: (i) insert(skipcount, node) creates a new cache entry, inserts it into that thread’s cache, and deletes the entry with the smallest skipcount (if the cache was previously full); (ii) remove(entry) removes an entry from that node’s cache; (iii) getsmallest() retrieves the entry with the lowest skipcount; (iv) and isfull() returns a boolean value indicating whether the cache is currently full. Finally, we seed each thread’s cache with entries pointing to the head and tail of the list. Each cache maintains a threshold variable to estimate the size of the underlying list. We also utilize a constant

GrowthFactor6 to indicate how much to grow (or shrink) the threshold based upon observations of the list. The getsmallest() method can be implemented in one of two ways. One can perform a linear scan of the cache to find the smallest entry in O(S) time. Alternatively, we can maintain an auxiliary for each cache. This would require an additional O(S) space per thread and allow us to perform getsmallest() in constant time. Adding or removing entries from this priority queue can generally be done in O(log (S)) time, depending on the underlying representation [21]. Our per-thread shortcut cache provides a lightweight alternative to both the aforemen- tioned split-ordered hash table and skiplist. In addition to the cache mentioned earlier, we also modify the scan algorithm (Algorithm 4.9) to utilize and manage each thread’s shortcut cache. The thread first determines the threshold to skip before caching an entry. The thresh- old is initialized to (that thread’s) smallest skipcount, and if the cache is full, we multiply by

5We evaluated between 2 and 16 entries per thread.

6We determined a GrowthFactor of 1.25 to give favorable results.

40 ALGORITHM 4.9: Scan With Shortcut Cache

1 Function cachedScan(item) : (node,node) // tid refers to the current thread ID 2 entry ← cache[tid].head 3 next ← entry.next // Find a suitable entry point into the list 4 while next.node.anchor > item do // Remove marked nodes and adjust threshold 5 if next.node.isMarked() then 6 if cache[tid].isfull() then 7 cache[tid].threshold ← cache[tid].getsmallest().skipcount 8 else 9 cache[tid].threshold/ =GrowthFactor

10 cache[tid].remove(entry) 11 else 12 entry ← next

13 next ← entry.next // Begin phase two. 14 prev ← entry.node, curr ← prev.next, succ ← curr.next 15 skipcount ← 0 16 while succ.key > item do 17 prev ← curr 18 curr ← succ 19 succ ← succ.next 20 if skipcount > cache[tid].threshold then 21 skipcount ← 0 22 cache[tid].insert(curr, skipcount) 23 if cache[tid].isfull() then 24 cache[tid].threshold ← cache[tid].threshold∗GrowthFactor

25 skipcount ← skipcount + 1

26 return (prev,curr)

GrowthFactor. The operation then steps through the cache entries, deleting any entries that refer to marked nodes. The operation stops once it finds an entry with a key greater

than the target. That entry’s immediate predecessor (in the cache) becomes our starting

point for the second phase of the scan.

The second phase of the scan closely mirrors the original scan (Algorithm 5.1), with two

exceptions. First, we begin at the entry.node we discovered in phase one. Second, we count

nodes we encounter (in skipcount) as we traverse the list. Anytime that skipcount exceeds

threshold, we add that node to our cache, reset skipcount, and (if the cache is full) adjust

41 threshold. The scan terminates in the same manner as the original, returning the pair (prev, curr) once succ.key exceeds the target key.

4.5.3 Evaluation of Shortcut Cache

In order to evaluate the effect of our shortcut cache, we performed additional experiments on the same two evaluation systems. We utilized the following sets of parameters to measure the effects of these optimizations:

• Node Size: We decided to contrast small nodes versus large nodes, and therefore

evaluated lists with 8 and 64 keys per node.

• Workload Distribution: We evaluated these additional techniques against the mixed

workload of 70% lookup, 20% insert, and 10% remove.

• Degree of Concurrency: We measured performance at 4 thread increments up to

the maximum number of hardware threads

• Maximum List Size: Keys were uniformly selected within the half-open range

[0..5000).

• Cache Size: We evaluated caches of 2, 4, 6, 8, and 16 nodes.

As before we initialized the data structure with 2,500 keys, spawned the desired number of threads, and allowed each thread to complete as many operations as possible for ten seconds. We performed each experiment fifteen times and recorded the total throughput in operations per microsecond. We evaluated the effects of our per-thread shortcut cache, reporting results in Figure 4.7. The top graphs show results for System A, while the bottom plots depict results on System B. We show results for K = 8 on the left and K = 64 on the right.

42 K = 8 K = 64 35

30 30

25 20 20

15 System A 10 10 System Throughput

5 0 0 5 10 15 20 25 0 5 10 15 20 25

50 50

40 40

30 30

20 System B 20

System Throughput 10 10 0 0 0 20 40 60 0 20 40 60 Thread Count Thread Count Original Algorithm 2-Node Cache 4-Node Cache 6-Node Cache 8-Node Cache 16-Node Cache

Figure 4.7: Impact of Shortcut Cache

For the K = 8 experiments, we see a drastic improvement in performance on both systems. On System A, we observed an 89% improvement for a 2-node cache and 221% improvement for a 16-node cache. While performance continued to increase with the size of the cache, the difference between a 6-node cache and 16-node cache was a mere 13%.

43 Likewise, on System B, we saw an 85% improvement with the 2-node cache, up to a 300%

improvement at 16 nodes. We also observed diminishing returns from adding nodes to the

cache; when increasing the cache from 6 nodes to 16 nodes, throughput only increased by

11% on System A and 25% on System B.

The results for K = 64 tell a very different tale. On System A, we see moderate im- provements across the thread range, with a maximum of a 29% improvement at 24 threads for a 8-node cache. However, increasing the cache beyond 8 nodes caused a decrease in

performance. On System B, we saw up to a 51% improvement with a 2-node cache, a

75% improvement at 4 nodes, and marginal improvements beyond that. The peak observed

throughput gain of 88% was observed with an 8-node cache.

In summary, our lightweight shortcut cache provides substantial improvements in through-

put with very little storage and zero synchronization overhead. In fact, in all of our experi-

ments, we can derive much of the potential benefit from a very small cache (6 nodes in most

cases). Our further experiments indicate that little to no benefit can be gained beyond 16

nodes per thread on smaller nodes and 8 nodes per thread on larger nodes.

4.6 Conclusions and Future Work

Braginsky and Petrank described a means to reorganize a lock-free linked list to improve

locality of memory access. In our work, we have described how a lock-based algorithm

can manipulate a similarly-organized linked list with considerably less overhead and achieve

substantially higher throughput. By storing multiple keys in a node and skipping nodes that

cannot contain our key, we improve performance within a constant factor over traditional

linked lists. Storing the entries in an unsorted array allows us to sequentially scan these

entries, a task which compilers can aggressively and effectively optimize. We have also

demonstrated how to further improve upon these results with a per-thread shortcut cache.

44 Our results are extremely encouraging and suggest that further research should be done in this area. In the next chapter, we will explore the possibilities of unrolling a more complex data structure, the skiplist. Since the skiplist is (to a certain degree) an extension of the linked list, we should be able to extend these techniques to a skiplist. Unlike the linked list, the skiplist provides expected logarithmic performance, and as such, can efficiently support much larger data sets than linked lists.

45 CHAPTER 5

UNROLLING THE OPTIMISTIC SKIPLIST

5.1 Introduction

In the previous chapter, we demonstrated how we could unroll a concurrent linked list utiliz- ing lazy synchronization to provide a data structure that scales well and provides improved throughput over other list-based sets. While linked lists are very useful for small data sets, they struggle with larger data sets due to their expected O(n) performance. In order to handle data sets beyond a few thousand elements, we should look to a hierarchical data structure.

To this end, we turn our attention to the skiplist. Our endeavors begin with the lock-based optimistic skiplist presented by Herlihy, et al [47], so we shall briefly describe its operations to provide a starting point. All operations in the skiplist begin with a scan to locate a set of windows. Each window consists of two nodes that bracket the half-open interval (pred, curr] that should contain the key of interest. A thread first locates the window in the topmost level and then utilizes that window to locate the window in the next lower level.

In order to insert a node, the thread obtains the set of windows via a scan operation, and

(randomly) determines which level to insert the new node. The thread then locks and verifies the predecessor nodes up to and including the topmost level of the new node. Finally, the thread then manipulates the pointers to insert the new node in each (pred, curr] window, beginning with the lowest level.

Removing an element operates in a similar manner; the thread first invokes scan to locate the set of windows, then locks the predecessor nodes up to (and including) the topmost level of the target node and marks the target node for removal. The physical removal proper from the topmost level (of the target node) downwards. The operation’s linearization point occurs when the bottom-most entry is removed from the list.

46 5.2 Algorithm Overview

In order to unroll the skiplist to support multiple keys per node, care must be taken with regards to several subtleties. For example, in the unrolled linked list, locks are acquired in head-to-tail order to avoid deadlock. On the other hand, a skiplist acquires locks on its predecessor nodes in a bottom-to-top fashion; this also implies that locks proceed from the tail to the head. We maintain this tail-to-head order to avoid deadlocks, but this necessitates substantial changes to remove operations to enforce this ordering. This lock ordering also requires a change in our concept of locking. In the unrolled linked list, locks protect only the next pointers, but locks in our unrolled skiplist protect the entire contents of a node.

Finally, maintaining the inclusion invariant requires strict ordering when manipulating the next pointers while splitting or merging nodes.

Each node of our unrolled skiplist consists of the following fields (see Figure 5.1): (i) an exclusive reentrant lock1, (ii) a marked flag to indicate logical removal, (iii) a count of the number of valid keys, (iv) a nodeLevel value indicating the highest-level list that this node belongs, (v) an anchor key, (vi) an array of keys, (vii) a fullyLinked flag to indicate that all next pointers have been set, and (viii) an array of next pointers. The unrolled skiplist is organized as a set of “levels”. The lowest level (level zero) consists of an unrolled linked list, and each successive level contains a subset of its next-lower level.

Each node contains keys which are greater than or equal to its anchor key and strictly less than the successor node’s anchor key, although the anchor key may not be a valid key in the list. Unused slots in the keys array contain the sentinel ⊥2. An exclusive lock protects the contents of each node from concurrent modifications; a majority of operations only require one lock. The data structure maintains a head pointer to the first node of the list.

1We present our algorithms with reentrant locks for clarity, but these are not necessary.

2In practice we use +∞, since it compares greater than every valid key.

47 lock lock next marked marked marked count count count nodeLevel nodeLevel nodeLevel anchor anchor anchor keys[0] keys[0] keys[0] ...... keys[K-1] keys[K-1] keys[K-1] fullyLinked fullyLinked fullyLinked head next[0] next[0] next[0] ... next[1] next[1] ...... next[2]

Figure 5.1: Layout of the unrolled skiplist

We initialize the skiplist with three tunable parameters: TopLevel, K, and MinFull.

TopLevel indicates the maximum level of the skiplist, K indicates the maximum number of keys per node, and MinFull indicates the minimum number of keys per node. The list is instantiated with three (empty) nodes, each at level TopLevel with anchors of −∞, −∞, and +∞.

We maintain three invariants. The inclusion invariant requires that every element present at level N also be present in levels 0..(N − 1). The placement invariant requires that once we insert an element at a location within a node, that element remains at the same slot within that node. Finally, the set invariant requires the skiplist to only contain one instance of a given key at a time. We shall see that maintaining these invariants necessitates careful ordering of pointer manipulations during insert and remove operations.

Each operation (insert, lookup, or remove) begins with a scan of the list to find the node that should contain our target key; we henceforth call this our target node. The scan starts at the head of the topmost list, searching each level for a (pred, succ) window. At each level

48 succ indicates a node with an anchor greater than that of the target node, and pred indicates that node’s immediate predecessor. Once a (pred, succ) window is found for a given level, the thread drops down to the next-lower level, beginning at the previous pred node. Once the operation locates the window for the lowest-level list, the operation returns the arrays of preds and succs (including the target node at succs[0]).

After locating a target node, a lookup operation performs a linear search of the node’s keys array. Insert and remove operations lock the target node and validate that the node is unmarked. An insert operation replaces a sentinel slot with a valid key, while a remove op- eration replaces the target key with a sentinel. Threads maintain the nodes’ size constraints by splitting full nodes or merging sparse nodes with their neighbors. All operations return true upon success or false upon failure.

5.3 Detail of an Unrolled Skiplist

All operations begin by invoking scan (Algorithm 5.1) for the key in question. While our scan generates the same results as that in the optimistic skiplist presented by Herlihy, et al, the unrolled structure of our list requires extra intermediate steps.

5.3.1 Scan

The algorithm initially scans the top-level list beginning at head, maintaining the three pointers pred, curr, and succ. The scan traverses the uppermost list until succ.anchor > key.

At this point, we store pred, curr, and succ in the arrays temp, preds, and succs, respectively.

At this point, the scan drops down to the next-lower level list and continues from pred.

After traversing the lowermost level, preds[0] contains our target node. We copy preds[0] to succs[0..preds[0].nodeLevel] and copy the matching nodes from temp into preds. This provides us with our desired (preds, succs) array, and we return the target node succs[0].

49 ALGORITHM 5.1: Scan 1 Function scan(key : int; preds, succs : Node[0..TopLevel]) : Node 2 pred ← head 3 Declare array temp[0..TopLevel] // Scan through list at each level, maintaining // predecessor, current, and successor node at each level 4 for level ← TopLevel downto 0 do 5 curr ← pred.next[level], succ ← curr.next[level] 6 while succ.anchor ≤ key do 7 pred ← curr 8 curr ← succ 9 succ ← curr.next[level]

10 temp[level] ← pred // Pred’s predecessor 11 preds[level] ← curr 12 succs[level] ← succ

13 baseLevel ← preds[0].nodeLevel // Fix up the succs array such that the lowest // 0..baseLevel entries contain our target node 14 for level ← 0 to baseLevel do 15 succs[level] ← preds[level] 16 preds[level] ← temp[level]

17 return succs[0]

5.3.2 Lookup

The lookup operation (Algorithm 5.2) first executes a scan to locate the target node. We

then invoke a contains method that performs a linear search of the keys array and returns

the location of the first occurrence of key (or nil otherwise). If contains indicates the key is

present, it returns true; otherwise, it returns false.

ALGORITHM 5.2: Lookup

1 Function lookup(key) : boolean 2 Declare arrays preds[0..T opLevel, succs[0..TopLevel] 3 curr ← scan( key, preds, succs ) 4 slot ← contains( curr, key ) 5 return ( slot 6= nil )

50 5.3.3 Insert

An insert operation (Algorithm 5.3) first invokes scan to locate the set of ( preds, succs )

(and curr). The operation then acquires the lock on curr and ensures curr is a valid node in the list. It suffices to check that a node is unmarked (additional validation will be required prior to splitting nodes). Presuming curr is unmarked, we attempt to add the key to curr

(Algorithm 5.4). Depending upon the return code of addKey we either (i) return true if addKey returns Success, (ii) return false if addKey returns Collision, or (iii) proceed to split curr otherwise.

If the node is full, we split the node into two new nodes. We create the first node at the original node’s level and the second node at a new randomly-selected newLevel; this effectively adds one new node at a random level and preserves the distribution of node levels. We lock and validate our predecessor nodes up to and including the maximum of either curr’s level or newLevel (Algorithm 5.5). Once we succeed in locking the nodes we need, we invoke split (Algorithm 5.6) to create two new nodes. We then add key to the appropriate node and mark curr for deletion. We next unlink curr from top to bottom but leave it in the lowest-level list to preserve the inclusion invariant. Finally, we link in node1 from bottom to top (noting that replacing the preds[0] pointer also removes curr from the skiplist), and if node2 is taller than node1 we continue linking in the upper levels of node2.

Adding a key to a node (Algorithm 5.4) makes use of a contains convenience function to determine if the key already exists in the node; attempting to add a duplicate key will fail and return the constant Collision. Otherwise, we locate the sentinel ⊥ and replace the sentinel with key. If no such sentinel is found, we return NodeFull.

We utilize the lockNodes convenience function to lock and validate a set of nodes (Al- gorithm 5.5). In this method, we lock the nodes of preds from the lowest level up to and including lockLevel. This also implies we acquire locks from the tail of the list to the head,

51 ALGORITHM 5.3: Insert

1 Function insert(key) : boolean 2 Declare arrays preds[0..TopLevel], succs[0..TopLevel] 3 while true do 4 curr ← scan( key, preds, succs ) 5 curr.lock() 6 if ( curr.isMarked() ) then continue 7 ret ← addKey( curr, key ) 8 if ret = Success then return true 9 else if ret = Collision then return true // Node is full, split into two 10 newLevel ← randomLevel() 11 lockLevel ← max(newLevel, curr.nodeLevel) // Attempt to lock predecessor nodes up to lockLevel 12 if (¬lockNodes(preds, succs, lockLevel)) then continue 13 (node1, node2) ← split( curr, newLevel, preds, succs ) 14 if key < node2.anchor then 15 addKey(node1, key) 16 else 17 addKey(node2, key)

18 curr.marked ← true 19 level ← curr.nodeLevel // Unlink curr from top to bottom // This leaves it only in the bottommost level 20 while level ≥ 1 do 21 preds[level].next[level] ← curr.next[level] 22 level ← level − 1 // Link in node1 from bottom to top (Note: level = 0.) 23 while level ≤ node1.nodeLevel do 24 preds[level].next[level] ← node1 25 level ← level + 1

26 node1.fullyLinked ← true; // Continue linking in node2 [if necessary] 27 while level ≤ node2.nodeLevel do 28 node2.next[level] = succs[level] 29 preds[level].next[level] ← node2 30 level ← level + 1

31 node2.fullyLinked ← true 32 return true

thus avoiding deadlocks. After acquiring each lock, we verify that both preds[level] is un- marked and the associated next pointer still points to succs[level]. We need not check that succs[level] is marked; the sequence of marking and removing succs[level] occurs while a thread holds the lock to preds[level]. If we acquire and validate all of the locks, we return true; otherwise, we release the locks and return false.

52 ALGORITHM 5.4: AddKey

1 Function addKey( node, key ) : int 2 if ( contains( node, key ) ) then return Collision 3 slot ← contains( node, ⊥ ) 4 if ( slot 6= nil ) then 5 node.keys[slot] ← key 6 node.count ← node.count + 1 7 return Success

8 return NodeFull

ALGORITHM 5.5: LockNodes

1 Function lockNodes(preds, succs, lockLevel) : boolean 2 level ← 0 3 valid ← true // Lock predecessors from level 0 to lockLevel // and validate correctness after locking each node 4 while valid ∧ (level ≤ maxLevel) do 5 preds[level].lock() 6 valid ← ¬preds[level].marked∧ 7 preds[level] = succs[level] 8 level ← level + 1 // If any validation fails, release all locks 9 if ¬valid then 10 while level ≥ 0 do 11 preds[level].unlock()

12 return valid

In order to split a node, we first create two new nodes, copy the keys from curr to node1,

sort them, and copy the upper half to node2. We then set up the next pointers for these two

nodes; these fall into two cases depending on whether the new node’s nodeLevel is greater

than the current node or not. In either case, we establish the next pointers from bottom to

top to provide consistency with the insert method. Once this is complete, we return this

structure of two new nodes with all of their next pointers set.

5.3.4 Remove

A remove operation begins by performing a scan, obtaining the preds and succs arrays,

locking curr, ensuring it is not marked for removal. We then invoke removeKey to remove the

53 ALGORITHM 5.6: Split

1 Function split( curr, newLevel, preds, succs ) : (Node, Node) 2 node1 ← new node of level curr.nodeLevel 3 node2 ← new node of level newLevel 4 Copy all keys from curr to node1 5 Sort keys in node1 in increasing order 6 node1.anchor ← curr.anchor 7 node1.count ← bK/2c 8 Copy node1.keys[dK/2e..K] to node2 9 node2.anchor ← node2[0] 10 Fill node1.keys[dK/2e..K] with ⊥ 11 node2.count ← dK/2e // Set up next pointers for node1 and node2 12 level ← 0 13 if ( node1.nodeLevel ≥ node2.nodeLevel ) then 14 while ( level ≤ node2.nodeLevel ) do 15 node1.next[level] ← node2 16 node2.next[level] ← curr.next[level] 17 level ← level + 1

18 while ( level ≤ node2.nodeLevel ) do 19 node1.next[level] ← curr.next[level] 20 level ← level + 1

21 else 22 while ( level ≤ node1.nodeLevel ) do 23 node1.next[level] ← node2 24 node2.next[level] ← curr.next[level] 25 level ← level + 1

26 while ( level ≤ node2.nodeLevel ) do 27 node2.next[level] ← succs[level] 28 level ← level + 1

29 return (node1, node2)

key from the node. This method (Algorithm 5.8) can return three values: Success indicates

successful removal, NotFound indicates that the key was not present, and MustMerge

indicates that removing key would violate the MinFull parameter. In the first two cases, we return true or false (respectively). The third case requires structural modifications to the

list. In the unrolled linked list, a merge or redistribute operation is guaranteed to succeed

since the initiating thread holds a lock on the predecessor node, thus preventing other threads

from manipulating the next pointer that references the current node. This is not the case

in our unrolled skiplist; other threads may concurrently modify the next pointers on our

54 predecessor nodes which could cause a validation to fail. We therefore do not perform the

actual removal until we have locked and validated all required predecessor nodes.

ALGORITHM 5.7: Remove

1 Function remove(key) : boolean 2 Declare arrays preds[0..TopLevel] and succs[0..TopLevel] 3 while true do 4 curr ← scan(key, preds, succs) 5 curr.lock() 6 if ( curr.marked ) then continue 7 ret ← removeKey( curr, key ) 8 if ret = notfound then return false 9 else if ret = success then return true // Node remove failed due to underfull node. // Merge/Redistribute as needed 10 pred ← preds[0], pred.lock() 11 if pred.isMarked() then continue 12 if curr.count + pred.count < K then 13 if ¬lockNodes(preds, prev, succs, curr.nodeLevel) then continue 14 else 15 mergeAndRemove( key, pred, curr, preds, succs ) 16 return true

17 else 18 Declare array ppreds[0..TopLevel] 19 lockLevel ← max(pred.nodeLevel, curr.nodeLevel) 20 findPredsOf( temp, preds, lockLevel ) 21 if ( ¬lockNodes(ppreds, preds, lockLevel ) then 22 continue

23 else 24 redistribute( key, pred, curr, preds, succs ) return true

To maintain our minimum node fullness, we first lock preds[0], ensure it is unmarked, and

determine whether we can merge curr with preds[0]. If this is the case, we invoke lockNodes

to lock the predecessor nodes3, invoke mergeAndRemove (Algorithm 5.10), and return true.

Otherwise, we redistribute the keys of curr and preds[0] among two nodes; note that this

requires extra work due to our deadlock avoidance strategy. We first invoke findPredsOf

(Algorithm 5.11) to find the predecessors to the preds array; this performs a partial re-scan

3Locking pred again will succeed, since we utilize reentrant locks

55 of the skiplist. We then invoke lockNodes on these new predecessors. If this succeeds, we

replace pred and succ with two new nodes (sans our target key) by invoking redistribute

(Algorithm 5.12).

ALGORITHM 5.8: RemoveKey

1 Function removeKey(curr, key) : int 2 slot ← contains( curr, key ) 3 if ( slot = nil ) then return NotFound 4 if ( curr.count ≤ MinFull) then return MustMerge 5 curr.keys[slot] ← ⊥ 6 curr.count ← curr.count − 1 7 return Success

Removing a key (Algorithm 5.8) involves invoking contains to locate the key, ensuring

that the node is sufficiently full to support its removal, and replacing the key with ⊥. If all

these operations succeed, removeKey returns Success. If contains fails to locate the key, we

return NotFound. Likewise, if removing the key would violate the MinFull parameter,

we return MustMerge. Merging two nodes (Algorithm 5.10) involves copying all valid

keys except key from node2, marking node2 for removal, and then unlinking node2 from top

to bottom to preserve the inclusion invariant. This sequence also preserves the placement

invariant, since keys do not move within a node.

ALGORITHM 5.10: mergeAndRemove

1 Function mergeAndRemove( key, node1, node2, preds, succs ) 2 Copy all valid keys except key from node2 to node1 3 node1.count ← node1.count + node2.count − 1 4 node2.markNode() 5 level ← node2.topLevel 6 while level > node1.topLevel do 7 preds[level].next[level] ← node2.next[level] 8 level ← level − 1

9 while level ≥ 0 do 10 node1.next[level] ← node2.next[level] 11 level ← level − 1

56 ALGORITHM 5.11: FindPredsOf

1 Function findPredsOf( ppreds, preds ) 2 curr ← preds[0] 3 if curr.topLevel = maxLevel then node1 ← head 4 else node1 ← preds[curr.topLevel + 1] 5 for level ← curr.topLevel downto 0 do 6 node2 ← node1.next[level] 7 while node2.anchor < curr.anchor do 8 node1 ← node2 9 node2 ← node1.next[level]

10 preds[level] ← node1

5.4 Correctness Proof

For the purposes of the correctness proof, we assume that the memory of nodes that have become garbage is not reclaimed. A node shall be active it is reachable from the head node of the skiplist. An active node becomes passive once it is deleted and is no longer reachable from the head node. The target key shall refer to the key of an operation, and the target node shall refer to the node in the skiplist returned by the traversal procedure that contains (or should contain) the target key. The following propositions are easily proved:

Proposition 1. If a node is not marked, then it is active. If a node is passive, then it is also marked.

Proposition 2. The target node of an operation was active at some point during its traver- sal.

All Executions are Linearizable

We show that an arbitrary execution of our algorithm is linearizable by specifying the lin- earization point of each operation [52], or the point during its execution at which an operation appears to take effect. Our algorithm supports three types of operations: lookup, insert and remove. We subdivide the lookup operations into two types: lookup-hit if the operation finds

57 ALGORITHM 5.12: Redistribute

1 Function redistribute( key, old1, old2, preds, succs ):( Node, Node ) 2 Allocate array tmp[1..2K] 3 Copy all keys from old1 to tmp[1..K] 4 Copy all keys except key from old2 to tmp[K + 1..2K] 5 Sort tmp[1..2K] in ascending order 6 new1 ← new Node( old1.anchor, old1.topLevel ) 7 new1.count ← b(old1.count + old2.count)/2c 8 Copy keys tmp[1..new1.count] to new1 9 new2 ← new Node( tmp[new1.count + 1], old2.topLevel ) 10 new2.count ← d(old1.count + old2.count)/2e 11 Copy tmp[new1.count + 1..(new1.count + new2.count)] to new2 // Move from top to bottom, setting up the next pointers // in new1 and new2 and also unlinking old1 and old2 12 level ←max(old1.topLevel, old2.topLevel) 13 old1.marked ← true , old2.marked ← true 14 if old2.topLevel > old1.topLevel then 15 while level > old1.topLevel do 16 new2.next[level] ← old2.next[level] 17 preds[level].next[level] ← old2.next[level] 18 level ← level − 1

19 else 20 while level > old2.topLevel do 21 new1.next[level] ← old1.next[level] 22 preds[level].next[level] ← old1.next[level] 23 level ← level − 1

24 while level ≥ 1 do 25 new2.next[level] ← old2.next[level] 26 new1.next[level] ← new2 27 preds[level].next[level] ← old2.next[level] 28 level ← level − 1 // Link in new1 and new2, going from bottom-up 29 while level ≤ old1.topLevel do 30 preds[level].next[level] ← new1 31 level ← level + 1

32 while level ≤ old2.topLevel do 33 preds[level].next[level] ← new2 34 level ← level + 1

35 return (new1, new2)

its target key) and lookup-miss otherwise. For the ease of exposition, we treat insert and

remove operations that do not modify the skiplist as lookup-hit and lookup-miss operations,

respectively. We now specify the linearization point of each operation.

58 Insert operation: There are two cases depending on whether the operation performs a split on the target node or not. If it does not perform a split, then the linearization point shall be the point at which the operation copies its target key to an empty slot in the target node. Otherwise, it shall be the point at which the operation updates the lowest-level next

field of a predecessor node.

Remove operation: The linearization point shall be the point at which it replaces a slot in the target node containing its target key with a sentinel value.

Lookup-hit operation: If the target node was not active when the operation read the contents of the slot containing the target key, then the linearization point shall be the point at which the node became passive. Otherwise, it shall be the point at which it read the contents of the slot.

Lookup-miss operation: If the target node was not active when the operation finished scanning the node, then the linearization point shall be the point at which the target node became passive. Otherwise, it can be argued that the key was not present in the target node at some point between when the scan started and when the scan ended. In this case, the linearization point shall be any such point.

It can be verified that the sequential history obtained by ordering operations based on their linearization points is legal, that is, all operations in the sequence satisfy their specifica- tions. Further, by choice, the linearization point of any operation lies between its invocation and response events. This implies that:

Theorem 1. Every execution of our algorithm is linearizable.

59 All Executions are Deadlock-Free

We say that the system is in a quiescent state if no modify operation completes hereafter.

Note that quiescence is a stable condition; once the system is in a quiescent state, it stays in a quiescent state. We first show the following:

Lemma 1. If the system is in a quiescent state, the following holds: (a) no node in the skiplist is marked, (b) no node is undergoing linking or unlinking, and (c) no pointers in any of the linked lists are changing.

Proof. Assume not. Then it implies that there is a process that has successfully locked all the nodes in the window of its modify operation, performed the validation and is now mod- ifying the skiplist. Such a modify operation is guaranteed to eventually complete, thereby contradicting the assumption that the system is in a quiescent state.

Per above lemma, once the system has reached a quiescent state, the skiplist cannot undergo any structural changes. This also implies that any future traversal of the skiplist is guaranteed to terminate. This also implies that any lookup operation in this quiescent state is guaranteed to eventually complete. Therefore we focus on modify operations. The system shall be in potent state if it has one or more pending modify operations. We show that our algorithm is deadlock-free by proving that a potent state is necessarily non-quiescent.

Assume, by way of contradiction, that there is an execution of the system in which the system eventually reaches a state that is potent as well as quiescent. From Lemma 1, we have:

Lemma 2. Any validation test invoked by a modify operation that started its traversal after the system reached a quiescent state is guaranteed to succeed.

We now argue that acquiring of locks by a process is deadlock-free.

60 Lemma 3. If one or more processes are trying to lock the subset of nodes in their windows, then one of the processes is eventually able to lock all the nodes in its window successfully.

Proof. Processes acquire locks nodes from tail to head. We assign each node a sequence number in the bottom list by assigning the tail node sequence number 0, its predecessor sequence number 1, and so on. Each process locks a subset of nodes before proceeding to the next step. In our algorithm, every process tries to lock nodes in increasing order of sequence numbers. This is a well-known way to avoid deadlocks and guarantees that some process is eventually able to lock all its nodes successfully [28].

Note that once a process is able to lock all the nodes in its window successfully, then it either eventually completes its modify operation (if validation succeeds) or aborts (if validation fails) and traverses the skiplist again. As argued earlier, once the system has reached a quiescent state, all traversals of the skiplist are guaranteed to eventually complete and, moreover, acquiring of locks is deadlock-free. Therefore, we can conclude the following:

Lemma 4. Assume that the system is in a potent and quiescent state. Then, eventually, there is a modify operation that starts its traversal after the system has reached a quiescent state and is able to lock all the nodes in its window successfully.

Clearly, using Lemmas 4 and 2, such a modify operation is guaranteed to complete successfully. Thus, we have:

Theorem 2. Every execution of our algorithm is deadlock-free.

5.5 Experiment Setup

In order to evaluate the effectiveness of unrolling on a skiplist, we again performed a set of experiments involving a synthetic workload on several different systems.

61 5.5.1 Concurrent Skip List Implementations

We implemented our algorithm in C++ and compared against three other similar data

structures. Specifically, we compared the following implementations:

1. The optimistic lock-based algorithm proposed by Lev et al [47](LockBased-SL)4.

2. The lock-free skiplist as presented by Fraser [35](Fraser-SL)4.

3. The “No Hot Spot” skiplist as proposed by Crain, et al [23](NoHotSpot-SL)4.

4. The unrolled algorithms presented in this work implemented with pthreads exclusive locks,

with 32 and 192 keys per node (Unrolled-32 and Unrolled-192).

All implementations were in C and C++ and used the jemalloc library for dynamic memory management [31]. Our implementations utilized the ThreadScan library to perform memory reclamation [8], while the synchrobench implementations used a epoch-based recla- mation scheme [20]. In the interest of comparing only the algorithms (and not the garbage collection schemes), we disabled garbage collection for all implementations.

5.5.2 Simulation Parameters

In order to determine optimal parameters to construct our data structure, we performed a number of exploratory experiments. We evaluated several different level distributions of our

unrolled skiplist, that is, at what probability do we select a given level. For example with

1 1 3 a p = 4 , we would create a new node at level 0 with probability 1 − 4 = 4 , level at least 1 1 15 1 63 with probability 1 − 42 = 16 , level at least 2 with probability 1 − 43 = 64 , etc. We explored 1 1 1 1 probabilities of 2 , e and 4 , and determined that for our data structure, p = e provided optimal results.

4Implementation provided by the synchrobench toolset [38].

62 Second, we explored the effect of degrees of unrolling. We found that our unrolled skiplist was extremely resilient to this selection; we observed excellent results with an unrolling degree between 32 and 256. Our final set of evaluation parameters is as follows:

1. Maximum Skiplist Size: This depends on the size of the key space. We considered key space sizes of 106 (1M) and 107 (10M) keys 2. Maximum Skiplist Level: Our intuition suggests that the maximum height of our skiplist should be approximately the natural logarithm of the keyspace size, which gives us approximately 13.8 for 106 keys and 16.1 for 107 keys. Our empirical evaluation indicated that optimal results could be had with maximum levels slightly less than these (due to the unrolling factors). We therefore evaluated our unrolled skiplists with maximum level of 12 and 15. The optimistic, lockfree, and no hot spot skiplists are designed to automatically adjust their maximum level as needed. 3. Relative Distribution of Operations: We considered three workload distributions:

(a) Read-Dominated: 90% search, 5% insert and 5% remove operations. (b) Balanced: 50% search, 25% insert and 25% remove operations. (c) Write-Dominated: 0% search, 50% insert and 50% remove operations.

4. Maximum Degree of Concurrency: We varied the number of worker threads from one to the maximum number of hardware threads supported by the architecture. 5. Key Distribution: We used two different key distributions in our experiments:

(a) Uniform: All keys occur with equal probability. (b) Zipfian: This power-law distribution follows Zipf’s Law, characterized by the distri- bution function 1 p(k, α, N) = kα PN 1 i=1 iα where k is the index of the element, α is a tunable parameter, and N is the total number of elements in the distribution [4, 17, 32, 74, 93]. Our experiments are based upon α = 0.5.

63 6. Degree of Unrolling: We evaluated two degrees of unrolling, specifically K = 32 and

K = 192.

5.5.3 Test System

We performed our experiments on one test systems. We compiled all evaluation code us- ing the target system’s native version of the g++ compiler using optimization flags –O3

–march=native –funroll-loops. Our evaluation system has the following specifications:

Intel Xeon E5-2698 v3: This system is based upon the Intel 4th generation Core (“Haswell”) microarchitecture. It contains two processors, each with eight physical cores (16 hardware threads), for a total of 64 hardware threads and 512 GB of DDR4 RAM. This system runs

Linux kernel version 3.0.101, and its software stack is based upon GCC version 5.2.0.

5.5.4 Experimental Methodology and Measurements

For each experiment, we constructed an instance of the data structure in question, pre-

populated with 50% of the total keys in the keyspace (uniformly selected). We then spawned the desired number of threads and initiated a two second “warm-up” period to minimize the effect of initial caching. Following the warm-up period, we gathered statistics for ten seconds, measuring the overall throughput of the system. All results are measured in number of completed operations per microsecond, and represent the average results of ten experiments.

5.6 Experimental Results

We display the results in Figures 5.2 and 5.3. We display the uniform distribution in the top row and the zipfian distribution in the bottom row. Within each row, from right to left, we display the read-dominated, balanced, and write-dominated distributions.

64 Read-Dominant Workload Balanced Workload Write-Dominant Workload

15 15 15

10 10 10

5 5 5 System Throughput Uniform Distribution 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60

15 15 15

10 10 10

5 5 5 Zipfian Distribution System Throughput 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60 Thread Count Thread Count Thread Count Optimistic-SL NoHotSpot-SL Fraser-SL Unrolled-32 Unrolled-192

Figure 5.2: Experimental Results on Intel Xeon System for one million keys.

5.6.1 Results for 1 million keys

Uniform Distribution: We first consider the results from Figure 5.2 and the uniform

distributions. With maximal number of evaluated threads, the algorithms generally rank

in the following order: (i) Optimistic-SL, (ii) Unrolled-192, (iii) NoHotSpot-SL,

(iv) Unrolled-32, and (v) Fraser-SL. We further summarize these results in Table 5.1

(all results are reported in completed operations per microsecond).

First let us consider Unrolled-32 and Unrolled-192. For both the read-dominant and balanced workloads, our data structure scaled very well up until 16 threads; then from 16

65 through 32 threads the performance plateaued somewhat. Once the thread count exceeded 32 threads, the performance once again increased (albeit somewhat slower than before). When we look at the write-dominant workload, the performance increases rapidly to 8 threads, then tapers off to a certain degree, but still continues to improve up to 64 threads. In the majority of situations, the instance with smaller nodes (Unrolled-32) provided substantially higher throughput, which indicates that larger nodes do not always provide better performance.

Table 5.1: Summary of results with 1 million keys and uniform distribution.

Read-Dominant Balanced Write-Dominant Algorithm 32 Threads 64 Threads 32 Threads 64 Threads 32 Threads 64 Threads Optimistic-SL 7.87 8.69 2.10 2.19 1.37 1.34 Unrolled-192 5.51 9.88 5.53 8.20 5.23 5.92 Unrolled-32 9.72 15.37 8.56 13.32 7.44 9.54 NoHotSpot-SL 8.34 15.06 7.13 12.39 6.50 11.79 Fraser-SL 11.23 16.28 8.98 12.57 7.72 10.74

The Optimistic-SL appears to scale extremely poorly on both the write-dominated and balanced workloads; in fact, in the write-dominated workload this implementation failed to provide any performance gains beyond two threads. Our Unrolled-32 provided much bet- ter performance throughout the experimental range, specifically providing 76%/508%/611% better performance in the read-dominant/balanced/write-dominant workloads, respectively

(at 64 threads).

Our algorithm exhibited mixed results when compared against both NoHotSpot-SL and Fraser-SL. At lower thread counts, we provided significantly higher throughput (up to 16 threads), but once our Unrolled-32 exceeded 16 threads, its performance gains ta- pered off somewhat. When comparing to NoHotSpot-SL at 64 threads, Unrolled-32 provided 2% and 7% better performance in read-dominant and balanced workloads, but was outperformed by 23% on write-dominant workloads. Likewise, the Fraser-SL implementa- tion provided 6%/12% better performance in write-dominant and read-dominant workloads, while Unrolled-32 outperformed it by 6% in balanced workloads.

66 The performance of NoHotSpot-SL surprised us somewhat, since in the work by Crain, et al they claimed, “our implementation can be more than twice as fast as the JDK skip list” [23] (which in turn is based upon works by Fraser, Harris, and Michael [35, 41, 67]).

However, in our unbiased tests, we discovered that in many cases, Fraser-SL actually performed better than NoHotSpot-SL.

Zipfian Distribution: Under the Zipfian distribution (α = 0.5), the relative performances change somewhat. We can rank the algorithms at the maximal thread count as follows:

(i) Optimistic-SL, (ii) Unrolled-192, (iii) Unrolled-32, (iv) NoHotSpot-SL, and (v) Fraser-SL. We further summarize the results for both distributions in Table 5.2.

Table 5.2: Summary of results with 1 million keys and Zipfian distribution.

Read-Dominant Balanced Write-Dominant Algorithm 32 Threads 64 Threads 32 Threads 64 Threads 32 Threads 64 Threads Optimistic-SL 5.25 5.94 1.11 1.05 0.61 0.57 Unrolled-192 5.39 10.24 5.43 8.36 4.84 5.60 Unrolled-32 7.02 10.52 6.20 9.38 5.35 7.02 NoHotSpot-SL 6.14 10.66 6.35 10.14 5.49 9.31 Fraser-SL 8.55 12.40 7.53 10.54 6.76 9.39

As we compare the Unrolled-32 and Unrolled-192 in the Zipfian distribution, we see that the node size has much less of an impact on performance with this distribu- tion. We see that in the read-dominant workload with 32 threads a 30% throughput

advantage for Unrolled-32; however, at 64 threads that differential narrows to a mere 3%. We also see very similar comparisons to the lock-based Optimistic-SL. In the read- dominant/balanced/write-dominant workloads at 64 threads, our Unrolled-32 outper- forms Optimistic-SL by 77%/866%/1132%. In the Zipfian distribution, we also notice that our algorithms perform significantly

worse for write throughput than any of the lock-free versions. The NoHotSpot-SL im- plementation outperforms Unrolled-32 by 8% for the balanced workload and 32.6% on write-dominant workload, but we do provide roughly equivalent (within 1%) throughput

67 Read-Dominant Workload Balanced Workload Write-Dominant Workload

10 10 10

5 5 5 System Throughput Uniform Distribution 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60

10 10 10

5 5 5 Zipfian Distribution System Throughput 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60 Optimistic-SL NoHotSpot-SL Fraser-SL Unrolled-32 Unrolled-192

Figure 5.3: Results on Intel Xeon System for 10 million keys for read-dominant workloads. Likewise Fraser-SL outperforms our Unrolled-32 by 18%/12%/34% on the read-dominant/balanced/write-dominant workloads.

5.6.2 Results for 10 million keys

Our results for the experiments with the keyspace of 10 million keys appear in Figure 5.3. At first glance, the results appear very similar to the earlier results for one million keys, but we point out several key differences.

Uniform Distribution: We first consider the results from Figure 5.3 and the uniform dis- tributions. At the maximal number of evaluated threads, the algorithms generally rank the

68 following order: (i) Optimistic-SL, (ii) Fraser-SL, (iii) Unrolled-192, (iv) Unrolled-

32, and (v) NoHotSpot-SL. We further summarize these results in Table 5.3 (all results are reported in completed operations per microsecond).

Table 5.3: Summary of results with ten million keys and uniform distribution.

Read-Dominant Balanced Write-Dominant Algorithm 32 Threads 64 Threads 32 Threads 64 Threads 32 Threads 64 Threads Optimistic-SL 3.12 3.12 1.11 1.06 0.61 0.57 Fraser-SL 5.19 9.50 5.11 9.07 5.10 8.90 Unrolled-192 5.88 10.71 6.28 9.44 5.96 6.81 Unrolled-32 5.86 10.71 6.41 10.57 6.21 8.13 NoHotSpot-SL 6.41 10.65 6.35 10.14 6.03 10.03

Our first observation suggests that for the read-dominant and balanced workloads node

size makes very little difference. The performance for Unrolled-32 and Unrolled-192

very closely mirror each other in the read-dominant workload. Likewise, in the balanced

workload we only see significant performance differences beyond 30 threads; at 64 threads

Unrolled-32 does enjoy a 12% throughput advantage. In the write-dominant workload,

however, the smaller node size (32 keys) doe demonstrates a 19% performance advantage. As

we saw with earlier experiments, our Unrolled-32 demonstrates an enormous performance

advantage over Optimistic-SL; at 64 threads we measured 243%/897%/1326% performance

advantages in the read-dominant/balanced/write-dominant workloads.

With the larger key space sizes, our algorithms begin compare considerably more fa-

vorably to the lock-free skiplists. In the read-dominant and balanced workloads, our algo-

rithms demonstrate roughly equivalent performance (within 4%) to NoHotSpot-SL, and

a 13% and 16.5% advantage over Fraser-SL. The lock-free algorithms do still outperform

Unrolled-32 in the write-dominant workloads; Fraser-SL outperforms us by 9% and

NoHotSpot-SL outperforms by 23%.

69 Zipfian Distribution: When considering the Zipfian distribution for 10 million keys, the

relative rankings come in as: (i) Optimistic-SL, (ii) Fraser-SL, (iii) Unrolled-192,

(iv) Unrolled-32, and (v) NoHotSpot-SL. We summarize these results in Table 5.4.

Table 5.4: Summary of results with ten million keys and Zipfian distribution.

Read-Dominant Balanced Write-Dominant Algorithm 32 Threads 64 Threads 32 Threads 64 Threads 32 Threads 64 Threads Optimistic-SL 2.52 2.55 0.89 0.89 0.39 0.37 Fraser-SL 4.44 7.32 4.26 6.87 4.06 6.28 Unrolled-192 5.27 10.07 5.47 8.04 5.96 6.81 Unrolled-32 5.46 10.02 5.32 8.64 6.21 8.13 NoHotSpot-SL 5.33 9.34 5.08 8.96 4.86 8.38

Again, here the performance differences between Unrolled-32 and Unrolled-192 are substantially smaller than with the smaller keyspace. The performance of the two instances is nearly identical in the read-dominant workload, and in the balanced and write-dominant workload, Unrolled-32 demonstrates 7% and 19% improved throughputs, respectively.

In this set of experiments, Unrolled-32 outperformed the lock-based Optimistic-SL by 292%/870%/2097% in the three workloads (write/balanced/read).

In the read-dominant workload, our implementation compared extremely favorably to the two lock-free skiplists; we report 7% and 37% improvements over NoHotSpot-SL and

Fraser-SL, respectively. When we move to the balanced and write-dominant workloads,

NoHotSpot-SL outperforms by 7% in balanced and 3% in write-dominant workloads.

Unrolled-32 still substantially outperforms

5.6.3 Discussion

In our experiments, we demonstrated that the unrolled skiplist provides substantially higher throughput than the lock-based optimistic skiplist in every instance we tested. Furthermore, our implementations compared very favorably to two current lock-free implementations in the read-dominant workloads. Our lock-based algorithms do suffer from degraded performance

70 in write-intensive workloads, however. Part of the reason for this is the effective decrease in lock granularity due to unrolling. In other words, every lock in our unrolled skiplist locks a larger subset of the data structure.

Processor Affinity: The thresholds of 16 and 32 threads are significant points in our system and denote transitions in how the system allocates cores and memory. The system in question is a cc-NUMA (Cache Coherent Non-Uniform Memory Access) system with two processors. Each processor consists of 16 physical cores, and each core supports two virtual cores via hyperthreading [1]. Furthermore, each processor maintains a three-level cache hierarchy. Each physical core maintains a private cache for the lowest two levels, and all of the cores within a processor share a level three cache. Finally, each processor controls half of the system’s memory. An interconnect network maintains connections amongst the cores of a single processor, and a second interconnect handles traffic between the two sockets. The processors run a cache coherence protocol to ensure that each core’s cache can access the data as needed. If an application running on Processor A requires memory that is controlled by Processor B (and we assume that it is not stored in any levels of Processor A’s cache), then Processor A must initiate a sequence of messages to transfer the data between the processors. This nec- essarily causes a stall in Processor A, and it also consumes memory bandwidth on Processor B (especially if the data in question is not present in Processor B’s cache). The subject of processor affinity helps us explain this behavior [85, 91]. Specifically the processor affinity determines how the operating system attempts to schedule multiple threads within a single process. The default behavior, or compact affinity, first attempts to schedule one task on each (physical) core on the same processor. Following that, the scheduler begins adding one thread per core of the second processor, and so forth. Once all of the physical cores are consumed, the scheduler then begins adding a second thread to each physical core to take advantage of hyperthreading.

71 This scheduling policy explains the behavior we see from 16 to 32 threads; during this sequence, the operating system is starting to schedule threads across both processors. This in turn penalizes our performance somewhat. Once we cross the next threshold of 32 threads, we see the benefit of additional threads outweighing the overhead involved in scheduling across multiple processors.

5.7 Conclusions and Future Work

As we saw in this chapter, our unrolled skiplist provides enormous throughput advantages over the existing optimistic lock-based skiplist. As we saw with unrolled linked lists, we have fewer pointers to traverse and lookups can scan sequentially through consecutive memory locations. Additionally, our locking mechanism permits the vast majority of operations to complete while holding a single exclusive lock. Our techniques are not ideal, however. We demonstrated that existing lock-free algo- rithms can outperform our algorithms in write-intensive workloads due to the decrease in granularity of our locks. In the next two chapters we will explore this behavior in more detail, and we will discuss a new locking technique which can overcome these limitations.

72 CHAPTER 6

THE SATURATION PROBLEM

6.1 Introduction

As new processors are introduced, core counts (and associated hardware thread counts) continue to rise. As of this writing, Intel offers the Xeon Phi processor with a maximum of 72 cores (288 threads total) [2] and a Xeon E7 Platinum processor with a maximum of 28 cores and supports up to 8 processors per system (448 threads total) [3]. In this chapter, we demonstrate how our two lock-based data structures behave in a manycore environment. As we already saw in the previous experiments, our unrolled skiplist ceased demonstrating performance improvements beyond a certain thread count for several workloads (primarily write-intensive ones). In this section we show that this behavior also applies to the unrolled linked list, and we discuss why this is the case. In this chapter we introduce the concept of saturation in lock-based data structures. When we consider a given data structure (including whatever parameters are used to con- struct it), we may encounter a situation where we can no longer increase the throughput in the data structure by increasing the thread count. Furthermore, for some data structures and thread counts, we may even see the performance decrease as we add additional threads. We define this thread count as the saturation point or saturation range.

6.2 Demonstrating Saturation on Manycore Systems

In order to verify this hypothesis, we replicated our previous experiments from Chapters 4 and 5 on a manycore system. Specifically, we performed our experiments on a Intel Xeon Phi 7250 system containing 68 physical cores each operating at 1.4 GHz with 4 hardware threads per core (272 hardware threads total). The system contains 16 GB of “near” Multi-Channel DRAM and 112 GB of “far” DDR4 RAM running Linux version 3.1.0. Its software stack is

73 based upon g++ version 7.1.0. In addition to the prior compiler flags, we utilized the option

–march=knl to generate code optimized for the Intel Xeon Phi System.

We used the Intel Xeon Phi for three reasons. First, the increased thread count showcases how our algorithm performs at the limit of current manycore technologies. Second, the Intel

Xeon Phi platform is steadily gaining more acceptance in the community; three out of the top 10 sites in the Top500 list now utilize the Intel Xeon Phi [86]. Third, Intel Xeon Phi is a preview of future offerings; shared-memory systems now support nearly 400 hardware threads, and future systems will further increase that number.

6.3 Saturation in Unrolled Linked Lists

Read-Dominant Balanced Write-Dominant

60 60 60

40 40 40

20 20 20 System Throughput 0 0 0 0 100 200 0 100 200 0 100 200 Thread Count Thread Count Thread Count K = 8 K = 64 K = 128

Figure 6.1: Performance of Unrolled Linked Lists on Intel Xeon Phi System.

In addition to evaluating the unrolled skiplist for 8 key and 64 key nodes, we also included nodes with up to 128 key-data pairs. The results for our experiments for the unrolled linked lists can be found in Figure 6.1. From left to right we display the write-dominant, balanced, and read-dominant workloads.

74 We first consider the instance with the smallest nodes (K = 8). In the read-dominant

workload, throughput peaks at approximately eighty threads, then tapers off through 120

threads. After 120 threads, the performance actually decreases somewhat. The balanced workload exhibits a similar behavior, except the performance peaks at approximately one hundred threads and plateaus through 140 threads. Finally, we see that the write through- put peaks at approximately 120 threads and remains fairly steady throughout the remainder of the range. This may seem rather counterintuitive at first; however, in the read-dominant workload the overall throughput is generally considerably higher throughout the experimen- tal range.

The medium-sized nodes (K = 64) present a somewhat different picture. In the read- dominant workload, throughput increases rapidly up through approximately 120 threads, at which point performance levels off to a certain degree but does not drop significantly.

When measuring the balanced workload, the performance again scales quite well through approximately 80 threads at which it plateaus through the 200 thread mark and then begins to decline. Finally, the write-dominant workload plateaus at approximately 60 threads, and begins to decline at approximately 120 threads.

Finally we consider the largest nodes (K = 128). These behaved quite differently from either of the two previous node sizes. In the read-dominant workload, we see very similar

(and from approximately 120-220 threads, slightly better) performance to the medium-sized nodes. However, the performance tends to drop off more significantly after 200 threads.

In both the balanced and write-dominant workloads, however, the largest nodes provided substantially lower throughput than the medium-sized nodes (and in some cases, even the smallest nodes).

75 Read-Dominant Balanced Write-Dominant 20 20 20

10 10 10 System Throughput Uniform Distribution 0 0 0 0 100 200 0 100 200 0 100 200

20 20 20

10 10 10 Zipf Distribution System Throughput 0 0 0 0 100 200 0 100 200 0 100 200 Thread Count Thread Count Thread Count K = 32 K = 192

Figure 6.2: Performance of Unrolled Skiplist on Intel Xeon Phi System.

6.4 Saturation in Unrolled Skiplists

The results for our experiments for the unrolled skiplists can be found in Figures 6.2 and 6.3.

In each figure, we display the write-dominant, balanced, and read-dominant workloads from

left to right. Furthermore, we display the uniform distribution in the top row and the zipfian

distribution in the bottom row.

When we consider the smaller keyspace (106 keys), we can clearly see both instances of

the unrolled skiplist scale extremely well in the read-dominant workloads up until approxi-

mately two hundred threads, at which point the performance appears to taper off somewhat.

76 Read-Dominant Balanced Write-Dominant 15 15 15

10 10 10

5 5 5 System Throughput Uniform Distribution 0 0 0 0 100 200 0 100 200 0 100 200

15 15 15

10 10 10

5 5 5 Zipf Distribution System Throughput 0 0 0 0 100 200 0 100 200 0 100 200 Thread Count Thread Count Thread Count K = 32 K = 192

Figure 6.3: Performance of Unrolled Skiplist on Intel Xeon Phi System with 10 million keys.

However, as we increase the write percentage into the balanced and write-dominant work-

loads, the skiplists begin to saturate shortly after one hundred threads for the balanced

workload and at approximately 32 threads in the write-dominant workload.

As we migrate to the larger keyspace (107 keys), the balanced workload appears to

scale somewhat better than in the smaller keyspace, continuing to improve up through

approximately 200 threads. However, we see a very similar saturation behavior in the write-

dominant workload; it ceases to improve throughput past approximately 80 threads.

77 6.5 Further Exploration of Saturation

One property of exclusive locks is that while a thread owns that lock, no other threads can manipulate the segment of the data structure that it protects. This is advantageous in that it permits complex operations to occur in a thread-safe manner. However, if multiple threads find themselves contending for a single lock, at least one thread will be forced to block. We use the term point contention to indicate the quantity and frequency of this occurrence. When there are relatively few threads in relation to the number of locks in a data structure, the probability of multiple threads contending for a single lock is very low. As this proportion increases, then the rate of point contention increases, and as the point contention rate increases threads find themselves spending more time blocked on exclusive locks instead of performing useful work. Our unrolled linked lists and skiplists are especially susceptible to this phenomenon due to their layout of multiple key-value pairs. Since this layout essentially reduces the granularity of each lock, we increase the probability that multiple threads will compete for a given lock. Cache coherence also contributes to this saturation effect. Whenever a thread issues a write to memory, the underlying cache coherence protocol forces any other processor to evict the associated cache line from its local cache. Therefore, the next time a process reads the memory associated with that cache line, it must in turn fetch it from much slower memory (generally either the last-level cache or main memory). These accesses tend to be at least one or two orders of magnitude slower than accessing primary cache [71]. In the next chapter, we will explore how we can alleviate this problem by allowing multiple threads to operate safely on a given node.

78 CHAPTER 7

INCREASING CONCURRENCY WITH GROUP MUTUAL EXCLUSION

As we saw in previous chapters, lock-based data structures can provide excellent throughput for a variety of workloads. Furthermore, we have designed unrolled variants of two concurrent data structures which provide substantially increased throughput over their antecedents.

However, we also demonstrated that such a data structure will saturate at a certain degree of concurrency. In this chapter we will introduce a concept called intra-node concurrency, demonstrate how it can be implemented using group mutual exclusion, select appropriate

GME algorithms, and demonstrate its benefits over our previous data structures.

7.1 About Intra-Node Concurrency

Unrolling a data structure does have its drawbacks. Since each node contains multiple keys, locking a single node effectively locks a many elements in the data structure. One way of ameliorating this effect would be to implement a lock-free variant of this data structure.

To create a lock-free version of this data structure, we need to manage concurrent lineariz- able insert and remove operations while maintaining the set invariant. Several techniques exist for this; however, these techniques may either require substantial extra work (such as read-copy-update [63]) or disallow the reuse of deleted entries within the data structure

(e.g., [15, 16]). This begs the question, “what if we permit concurrent insert or remove op- erations to manipulate the same node, but not both?” Our technique answers this question.

If we consider a sequence of successful insert operations on a node with an arbitrary number of available slots, we observe that the first operation replaces the first sentinel, the second operation replaces the second sentinel, and so forth. As long as there are no intervening remove operations, this pattern will continue. We make use of this observation to introduce intra-node concurrency to our algorithm.

79 We begin by replacing the exclusive lock in each node with a GME object that supports T + 2 types (where T is the number of threads in the system). We allocate one type to each process plus the shared types TypeInsert and TypeRemove (which are available to all processes). A process can request one of three types: requesting its unique type shall grant exclusive access, while requesting TypeInsert or TypeRemove shall permit other processes access so long as they request the same type. To insert a key into a node, a process enters a TypeInsert session and attempts use a CAS operation to replace the first sentinel in the node with its desired key, and the process checks the return code. We continue attempting this CAS until either (i) our CAS succeeds (meaning we succeeded in inserting our value), (ii) our CAS fails and a re-read of the slot indicates another process inserted our key, or (iii) we run out of space in the node. Removing a key operates in a similar manner. The process acquires a TypeRemove session, locates the key, and attempts to remove it via CAS. In the case of a remove, repeated CAS operations are unnecessary; a single failed CAS indicates that another process successfully removed the target key. Since we only permit either insert or remove operations, we alleviate the complexity of handling both types of operations concurrently. We handle node splits and merges under the purview of exclusive locks, thus obviating the need to perform the extra work associated with lock-free data structures. This creates a lock-based data structure which supports higher degrees of concurrency within each node without introducing most of the complexity of a lock-free data structure. While investigating intra-node concurrency, we observed that both insert and remove operations operate in three phases: a seek phase, a simple phase, and (possibly) a complex phase. The seek phase operates lock-free, the simple phase requires a single lock, and the complex phase requires (possibly multiple) additional locks. The majority of operations will complete in the simple phase; therefore, we shall optimize that phase with GME. If we cannot complete our operation in the simple phase, we must proceed to the complex phase. We utilize an exclusive session to acquire the additional locks and complete the

80 operation. Ideally, we want only a single concurrent process (per node) operating in the

complex phase. At first glance, this might suggest that we need some means converting our

previous session to an exclusive session (either by providing a mechanism to convert a session

from one type to another or by releasing and reacquiring the lock). We will explain later

how we avoid this issue.

7.2 Evaluation of Group Mutual Exclusion Algorithms

In order to effectively implement our ideas, we will construct a mutex object that implements

a GME algorithm. We anticipate our algorithm executing on systems with very high thread

counts (in excess of 200), but we do not anticipate high degrees of contention in the expected

case. Based upon these observations, we developed these requirements for our algorithm:

Mutual Exclusion: If two processes are in the critical section (CS) concurrently, then they

must have requested the same session-id.

Deadlock Freedom: If no process is in the CS and one or more processes request a session-

id, then at least one will eventually succeed.

Memory Efficiency: Since we will be creating many of these GME objects, we require that

they be memory-efficient. Specifically, we require the ability to construct objects in

O(1) per-object space (plus optionally up to O(T ) overhead shared among all objects).

Low Best-Case RMR Count: Research in GME algorithms tends towards optimizing

RMR in the worst-case. However, in our skiplist (and to a lesser degree our un-

rolled linked list), we do not anticipate high degrees of contention in the expected

case. Therefore, we believe it is equally important that an algorithm possess excellent

best-case RMR performance as well.

81 Note that the mutual exclusion and deadlock freedom are identical to criteria P1 and

P2’ from section 3.4. In addition to the four above requirements, we would prefer that our

algorithm also satisfies P2, P3, P4, and at least one of the fairness criteria from that section.

7.2.1 Survey of Potential GME Algorithms

A TTAS-based algorithm

The first algorithm we consider is arguably one of the simplest to implement (Algorithm 7.1);

it is based on the Test-and-Test-and-Set (TTAS) mutual exclusion algorithm [80]. Our lock

object consists of a state variable subdivided into two fields: type indicates the current session

type (or ⊥ indicating no active type) and count indicates the number of processes active in

the critical section1.

ALGORITHM 7.1: A TTAS Based Algorithm

1 Function lock(type) : void 2 repeat 3 repeat 4 expect ← state.load() 5 until (expect.type = type ∨ expect.type = ⊥) 6 update ← (session, expect.count + 1) 7 until (state.CAS(expect, update))

8 Function unlock() : void 9 repeat 10 expect ← state.load() 11 if (expect.count = 1) then 12 update ← (⊥, 0) 13 else 14 update ← expect 15 update.count ← update.count − 1

16 until state.CAS(expect, update)

When attempting to enter, the process reads state until the lock reflects a compatible

session. We then create a new state variable, by implementing the previous count and

1In practice, we utilized a single 64-bit machine word partitioned into two 32-bit integers.

82 (possibly) setting session to our requested session. If we succeed in updating the state with CAS, we return. Releasing the lock reverses the sequence. This lock satisfies the mutual exclusion and deadlock-freedom consumes constant space (on our system, we used a single 64-bit machine word); furthermore, in the absence of contention a process can acquire the lock in a constant number of its own steps. However, this lock is clearly not fair, and processes may starve in high-contention environments.

Keane and Moir’s queue-based algorithm

Our second lock implements the algorithm by Keane and Moir [60]. This algorithm satisfies the properties of Mutual Exclusion (P1) and Starvation Freedom (P2), but it violates the other desirable properties listed in Table 3.1. Our implementation maintains the following local variables: a gatekeeper mutex object, session to indicate the current session, count to indicate the number of processes active in that session, and a local queue for waiting processes. Our algorithm utilizes a queuing approach similar to the well-known MCS mutual exclusion algorithm [66]. Specifically, our implementation maintains a global array of queue nodes, each of which contains two variables, wants (which is set to ⊥ when no session is desired) and next to indicate the next entry in the queue. This approach reduces the per-object memory requirements to a constant, thus satisfying one of our requirements for a GME algorithm. When requesting session-id S, the process acquires the gatekeeper lock and examines session and the head of the queue. If there are no processes in the queue and either the lock is free or the lock is currently executing session S, then the process can increment the count of participating processes and release the gatekeeper lock. Otherwise, the process sets the wants field on its queue node to S, adds its own node to the queue, releases the gatekeeper lock, and spins until another process wakes it up by setting its wants field to ⊥. When exiting the lock, a process acquires the gatekeeper lock, decrements the count, and checks whether the count is zero. If so, the process determines session-id S that the

83 process at the head of the queue is awaiting. The unlocking process then examines every

node in the queue, and if they are awaiting on session S, it sets that process’s wants slot to

⊥ (which awakens that process) and removes it from the queue. The unlocking process then releases the gatekeeper lock. This implies that a thread at most needs to process M RMR’s to initiate a new session (where M is the number of threads actively contending for a lock).

Bhatt and Huang’s f-array based algorithm

Bhatt and Huang proposed an algorithm based upon the concept of f-arrays, originally pro- posed by Jayanti [14, 57]. A f-array object stores the result of some function as implemented by the participating threads in the systems. For example, an f-array can be implemented to

calculate sums, averages, minimums, or maximums; this function can be updated (wait-free)

in either O(N) or O(log N) time, depending on the underlying organization. The result can

be retrieved in constant time.

Bhatt and Huang utilize a f-array object to implement a priority queue, and their algo-

rithm breaks down session-ids into two “sides”, each with an associated waiting room. When

requesting a session-id, a process picks the preferred side for that session-id unless that side is currently active and other processes are waiting. It then updates its (own) entry in the underlying f-array. Upon exiting the critical section, a thread determines if it is the last session, and if that is the case, it determines desired session-id and side of the highest-priority thread in the f-array, and awakens all threads in the appropriate waiting room.

While Bhatt and Huang’s algorithm does provide logarithmic (specifically O(log N)) worst-case performance and satisfies most of the fairness properties in Table 3.1 we do not believe this will be a good choice for our GME object. First of all, the underlying f-array object requires O(N) per-instance storage, and each GME instance requires an instance of this object. Second of all, we do not expect this algorithm to provide good performance in the (expected) lightly-contested case. In other words, even if there are only one thread

84 attempting to enter a session-id, the algorithm will still require O(log N) RMR’s in order to

update the f-array.

7.2.2 Selection of GME Algorithm

Based upon the criteria in Table 3.1 and the previous analysis, we selected the TTAS- based algorithm and Keane and Moir’s queue-based algorithm. We implemented both of these algorithms as C++ objects and constructed GME-based locking data structures with them.

7.3 Introducing Intra-Node Concurrency to the Unrolled Linked List

7.3.1 Algorithm Overview

In order to increase concurrency in the unrolled linked list (from Chapter 4), we first modify the node structure to support controlled concurrency. By controlled concurrency, we mean that multiple threads may attempt to modify a given memory location, and either all of the operations will succeed in some arbitrary order (for example, if incrementing an integer) or only one operation will succeed while others fail (for example, when updating a key-data pair). We implement this by changing the count field to an equivalent atomic type that supports the fetch-and-add (FAA) operation, and we change the key-data pairs to an atomic type that supports compare-and-swap (CAS) (as per Section .

Our second modification permits us to allow either concurrent insert operations or con- current remove operations (but not both) to operate within a node. We accomplish this by replacing the lock field with a GME object that supports multiple session types. For our purposes, we shall refer to TypeInsert as a shared type that threads can request to per- form concurrent insert operations, TypeRemove as a shared type that threads can request for concurrent remove operations, and TypeExclusive as a type exclusive to that thread.

85 The TypeExclusive session may be implemented by assigning each thread a unique type, or the GME object may implement this with a dedicated code path. Our implementation

utilizes the former method, but any GME object with that supports these three types will

suffice.

At a high level, when a thread performs an insert operation, it first acquires a TypeIn-

sert session of the predecessor’s lock. Once this lock is acquired, it attempts to insert the key-data pair into the node. If this succeeds, we return true; however, if we must split the

node, we must serialize the operation through an exclusive lock.

Likewise, when a thread performs a remove operations, it acquires a TypeRemove session on the predecessor’s lock and attempts to remove the key-data pair. If this removal

results in an excessively sparse node, we again serialize the merge or redistribute operation

through an exclusive lock.

7.3.2 Algorithm Detail

Insert

Our modified insert algorithm (Algorithm 7.2) operates as follows. As per the original algorithm we invoke scan to locate an appropriate (prev, curr) window. Second, we acquire

a TypeInsert session on the lock. We then locate the first instance of ⊥ in the node and then check to see if key already exists (lines 8-9). Note that these lines must be executed in this order, otherwise a violation of the set invariant may occur. For example, consider two threads (A and B) concurrently attempt to insert the same key into the list. If thread A checks if key exists, does not find it, and then gets suspended before locating ⊥, then thread

B may insert key before thread A locates ⊥. Therefore, when thread A wakes up, it would start inserting keys to the right of B’s insertion point, and a violation of the set invariant could occur.

86 ALGORITHM 7.2: Modified Insert

1 Function insert(key,data) : boolean 2 keydata ← make-pair( key, data ) 3 while true do 4 (prev,curr) ← scan(item) 5 prev.lock(TypeInsert) 6 if ¬validate(prev,curr,item) then 7 continue /* Re-scan due to change */

8 slot ← first location of ⊥ in curr (or +∞) 9 if curr.contains(item) then 10 return false // Walk through the list and try to CAS in the new value 11 while slot < K do 12 expect ← make-pair(⊥, ⊥) 13 if (CAS( curr.keys[slot], expect, keydata ) then 14 FAA( curr.count, 1 ) 15 return true

16 else if curr.keys[slot] = key then 17 return false // Find the next free slot 18 while (slot < K and curr.keys[slot] 6= ⊥) do 19 slot ← slot + 1 20 if ( curr.keys[slot] = key ) then 21 return false

// Node is full. Split per original 22 curr.lock(TypeExclusive) 23 if ¬validate(prev,curr,item) then 24 continue /* Re-scan due to change */

25 Split curr into two new nodes 26 Insert keydata into appropriate node 27 curr.marked ← true 28 prev.next ← first new node 29 return true

After locating an initial insertion point and asserting that key does not exist in node,

we begin trying to replace a sentinel with the new key-value pair via compare-and-swap

(CAS). Several possibilities can occur. First, we could attempt to replace a sentinel with

our key-value pair and succeed; we then increment count (with fetch-and-add) and return

true. Second, we could attempt to invoke CAS and fail; in this case we check the new value

of the slot. If this new value equals key, we return false (since another key has successfully

87 inserted our key-data pair); otherwise, we scan for the next open slot. If we reach the end

of the node, we must split the node.

Splitting the node initiates the complex phase of our operation; we require this phase to

be serialized. We explored several methods of accomplishing this. First, we could release

the TypeInsert session and acquire a TypeExclusive session. However, this introduces the possibility of interloper threads performing an arbitrary sequence of operations in the

intervening time period. The second possibility involves introducing an “upgrade” capability

to our GME object, similar to the UpgradeLockable concept in the Boost libraries [25].

We instead utilized a third option; we serialize our operations upon acquiring the lock

on the current node (line 22). This necessitates an additional validate operation to detect

whether a competing thread has split the node. If this succeeds, the operation completes as

per the original algorithm. This approach works because once the node is full, every thread

will attempt to (exclusively) lock curr before they can modify the list, effectively serializing

the operations.

Remove

The modified remove procedure (Algorithm 7.3) operates as follows. We first locate the

appropriate window, request a TypeRemove session on prev, and invoke validate. We then scan prev for the key and remove it with CAS if it exists. If either the key does not exist or the CAS fails (indicating another thread removed the key first), we return false. At this point, our operation has succeeded, but some housekeeping work may remain.

We decrement the count (using fetch-and-add), and if the node is still sufficiently full, we can return true. Otherwise, we merge the node; however, we cannot serialize through curr’s lock. Other threads may still be actively removing keys from the node; therefore, we choose to release the lock on prev and re-acquire a TypeExclusive session. When this succeeds we first revalidate and recheck the density, since other threads may have performed additional

88 ALGORITHM 7.3: Modified Remove

1 Function remove(item) : boolean 2 while true do 3 (prev, curr) ← scan(item) 4 prev.lock(TypeRemove) 5 if ¬validate(prev, curr,item) then 6 continue

7 slot = curr.contains(item) 8 if slot is not defined then 9 return false

10 expect ← key/data contents of curr.keys[slot] 11 if expect.key 6= item then 12 return false

13 update ← make-pair(⊥, ⊥) // Attempt to remove key. If this fails, another thread already removed it 14 if ¬CAS(curr.keys[slot], expect, update) then 15 return false

16 FAA(curr.count, -1) 17 if curr.count ≥ MinFull then 18 return true // Node dropped below minimum density. May need to merge or rebalance 19 prev.unlock(), prev.lock(TypeExclusive) 20 if (curr.count >MinFull ∨¬validate(prev,curr,item)) then 21 return true

22 curr.lock(TypeExclusive) 23 if curr.count = 0 then 24 curr.markNode() 25 prev.next ← succ 26 return true

27 succ ← curr.next 28 succ.lock(TypeExclusive) 29 if curr.count + succ.count > K then 30 (node1, node2) ← rebalance( curr, succ ) 31 node2.next ← succ.next, node1.next ← node2 32 else 33 node1 ← merge( curr, succ )

34 curr.markNode(), succ.markNode() 35 prev.next ← new1 36 return true

operations in the meantime. We also consider the case where (multiple) concurrent threads

remove all keys from a node; we address this by simply removing the entire node from the

89 list (and return true). If our checks indicate merging still appropriate, then we either merge or rebalance per the original algorithm.

7.3.3 Experimental Evaluation

We replicated the earlier experiments from Chapter 6 to compare our GME-enabled linked list with our earlier linked list implementation. Specifically, we evaluated against the same Intel Xeon Phi System with the following parameters:

1. Implementations: We evaluated our unrolled linked list (UnrolledList) alongside its GME-enabled counterpart. We implemented both the TTAS-based GME (TTAS- List) and Keane and Moir’s queue-based GME algorithms (Queue-List).

2. Node Size: We evaluated the performance with K of 8, 64, and 128 keys per node.

3. Workload Distribution: We evaluated performance against three representative workloads: Write-Dominant with no lookups, 50% inserts, 50% removes; Mixed with 50% lookups, 25% inserts, 25% removes; and Read-Dominant with 90% lookups, 5% inserts, 5% removes.

4. Degree of Concurrency: We evaluated the performance in 8 thread increments up to the maximum number of hardware threads.

5. Maximum List Size: Keys were selected uniformly from the half-open interval [0, 5000).

As before, each experiment consisted of pre-populating the list to 50% full utilizing keys uniformly selected within the interval. Following that, we spawned the desired number of threads and allowed them to run for ten seconds, after which we measured the number of completed operations. Each data point represents the average of ten experiments. The results are reported in Figure 7.1.

90 Read-Dominant Balanced Write-Dominant

80 80 80

60 60 60

= 8 40 40 40 K 20 20 20

System Throughput 0 0 0 0 100 200 0 100 200 0 100 200

80 80 80

60 60 60

= 64 40 40 40 K 20 20 20 System Throughput 0 0 0 0 100 200 0 100 200 0 100 200

80 80 80

60 60 60

40 40 = 128 40 K 20 20 20 System Throughput 0 0 0 0 100 200 0 100 200 0 100 200 Thread Count Thread Count Thread Count Unrolled TTAS-GME Queue-GME

Figure 7.1: Performance of Unrolled Linked Lists on Intel Xeon Phi System.

First we consider the smallest node size (K = 8). In such a small node, we do not

expect much opportunity for increased concurrency; even with over two hundred threads,

we expect collisions to occur relatively infrequently. However, we do see noticeable im-

91 provements in performance for each simulated workload as compared to the base unrolled list. At 272 threads, the Queue-GME exhibited a 141%/65%/17% improvement in the read-dominant/balanced/write-dominant workloads. Furthermore, the throughput for this instance continues to improve throughout most of the tested range (up until approximately 180 threads). The TTAS-GME instance provided improved throughput over Queue-GME throughout the entire lower and middle thread range. However, as the parameters exceeded 200 threads, the performance gap between TTAS-GME and Queue-GME narrowed sub- stantially. At 272 threads, TTAS-GME exhibited a 142%/87%/41% improvement over the original unrolled list in the read-dominant/balanced/write-dominant workloads. Next we look at the medium-sized node (K = 64). In this larger node, we expect more opportunities for increased concurrency with group mutual exclusion. However, we also see that the data structure appears to exhibit greater degrees of saturation even in read-dominant workloads. At 272 threads, we see that the Queue-GME actually performs slightly worse than the original unrolled skiplist. When we consider 100 threads, we do see improvements of 18%/13%/4% in the read-dominant/balanced/write-dominant workloads. The (albeit un- fair, but simpler) TTAS-GME algorithm performed even better, providing 3%/110%/88% improvements at 272 threads, and 25%/54%/84% improvements at 96 threads.

7.3.4 Conclusions

Our GME-enabled variants of the unrolled linked list do exhibit considerable throughput increases at certain parts of the experimental range. However, even with the intra-node concurrency enabled, we do still see the effects of saturation in this data structure. When we consider, however, that with K = 64, we can expect approximately 32 keys per node and approximately 2,500 keys in the data structure (since we have filled the data structure 50% full and are performing equal numbers of reads and writes). This gives us approximately 78 expected nodes in the data structure. Therefore, when we consider executing 200 threads, we can expect that, on average, 2.5 threads are contending for every node.

92 Therefore, to expect this data structure to scale well beyond 200 threads is most likely unreasonable. Our solution does in fact provide a substantial gain in throughput through the majority of the experimental range. In the next section, we shall extend this idea to the unrolled skiplist and demonstrate how a less-densely packed data structure can benefit significantly from this technique.

7.4 Introducing Intra-Node Concurrency to the Unrolled Skiplist

Introducing intra-node concurrency into the unrolled skiplist follows directly from our im- plementation on the unrolled linked list. We make the same sets of changes to each node of the list; we replace the exclusive lock with a GME object, we replace the keys array with a compatible type that supports CAS, and we replace the count member with an integral type that supports FAA. As in the unrolled linked list, we focus on increasing the concurrency of simple operations, and implement complex operations under the purview of exclusive locks.

7.4.1 Algorithm Detail

Insert Our modified insert operation changes the addKey method (Algorithm 7.4). First, we locate an appropriate slot for insertion then scan the node for key. If slot exists and key is not found, we attempt to insert key into node. We first check if a slot contains ⊥; if so, we attempt to CAS our key into place. If that CAS succeeds, we atomically increment count and return Success. Otherwise, we check if the slot contains key; if so, we return

Collision. We must execute lines 2 and 3 in this particular order to to avoid violating the set invariant (as we previously saw in the unrolled linked list). Once sharedAddKey is in place, we modify insert (Algorithm 5.3, line 5) to acquire a TypeInsert session, and this suffices to enable GME for simple operations. As with the unrolled linked lists, we serialize complex insert operations through the second (exclusive) lock. Unlike the unrolled linked

93 list, we may need to acquire significantly more than one additional exclusive lock (depending

on the heights of the new nodes).

ALGORITHM 7.4: SharedAddKey

1 Function sharedAddKey( node, key ) : int 2 slot ← contains(curr, ⊥) 3 if ( contains( curr, key ) 6= nil ) then return Collision 4 if ( slot = nil ) then return NodeFull 5 while ( slot < K ) do 6 expected ← node.keys[slot] 7 if ( expected = ⊥ ∧ CAS(keys[slot], expected, key) ) then 8 FAA( node.count, 1 ) 9 return Success

10 expected ← node.keys[slot] 11 if ( expected = key ) then return Collision

12 return NodeFull

Remove Modifying the removeKey method (Algorithm 5.8) requires changing the write

to utilize a CAS and check the result (Algorithm 7.5). If the CAS succeeds, we atomically

decrement the count and return Success; otherwise, we return NotFound (since another

process concurrently removed that key).

We can utilize this sharedRemoveKey convenience function to update the simple phase of

remove (Algorithm 5.7). Again, we acquire a TypeRemove session in line 5 and invoke the

sharedRemoveKey method. If this completes, we return the appropriate value; otherwise,

we proceed to the complex phase. Unlike the insert method, we cannot effectively serialize

our writes through the preds[0] lock, because other processes may be concurrently removing

more elements. We are faced with the choice to either release the lock on curr and re-acquire

it as an exclusive session or implement an upgrade mechanism in our GME lock.

We instead check curr for fullness prior acquiring its lock. If the operation seems liable

to reduce curr’s count below MinFull, we instead acquire an exclusive session on curr and

proceed with a split. By using this method, we do accept that nodes may drop below the

94 ALGORITHM 7.5: SharedRemoveKey

1 Function sharedRemoveKey(curr, key) : int 2 slot ← contains( curr, key ) 3 if ( slot = nil ) then return NotFound 4 if ( curr.count ≤ MinFull) then return MustMerge 5 if ( CAS( curr.keys[slot], curr.keys[slot], ⊥ ) then 6 FAA( curr.count, -1 ) 7 return Success

8 return NotFound

MinFull threshold by up to T . In our opinion, this is outweighed by avoiding the cost of the other two options.

7.4.2 Experiment Setup

We replicated the earlier experiments from Chapters 5 and 6 to compare our GME-enabled linked list with our earlier skiplist implementation. Specifically, we evaluated against the same Intel Xeon Systemand Intel Xeon Phi Systemwith the following parameters:

1. Implementations: We evaluated the following data structures:

(a) the lock-based optimistic skiplist by Herlihy, et al (Optimistic-SL) [47],

(b) a lock-free skiplist described by Fraser (Fraser-SL) [35],

(c) a lock-free ”No Hot Spot” skiplist described by Crain, et al (NoHotSpot- SL) [23],

(d) our unrolled skiplist described in Chapter 5 (Unrolled-SL),

(e) the GME-enabled skiplist from this chapter, using our TTAS-based GME algo-

rithm (TTAS-SL),

(f) the GME-enabled skiplist from this chapter, using Keane and Moir’s queue-based

GME algorithm [60] (Queue-SL).

2. Node Size: We evaluated the performance with K of 32 and 192 keys per node.

95 3. Workload Distribution: We evaluated performance against three representative

workloads: Write-Dominant with no lookups, 50% inserts, 50% removes; Mixed with

50% lookups, 25% inserts, 25% removes; and Read-Dominant with 90% lookups, 5%

inserts, 5% removes.

4. Degree of Concurrency: We evaluated the performance in 8 thread increments up

to the maximum number of hardware threads.

5. Maximum List Size: We performed two sets of experiments. In the first set, keys

were selected from the half-open interval (0 − 1, 000, 000), and in the second set, keys

were selected from (0 − 10, 000, 000).

7.4.3 Intel Xeon System

As before, each experiment consisted of pre-populating the list to 50% full utilizing keys uniformly selected within the interval. Following that, we spawned the desired number of threads and allowed them to run for ten seconds, after which we measured the number of completed operations. Each data point represents the average of ten experiments; we report these results in Figures 7.2– 7.5.

Since we already explored the performance of the unrolled version of this data struc- ture in Chapter 5, we shall focus on the performance of the TTAS-SL and Queue-SL implementations and their relationships with the previous implementations we analyzed.

One Million Keys

We first focus on the uniform distribution; we summarize these results in Table 7.1. We list results in completed operations per microsecond and indicate a 90% confidence interval. We observe that in this set of experiments, the smaller nodes (K = 32) generally outperformed the larger nodes (in some cases by a substantial margin). At 64 threads, we observed a

96 Read-Dominant Workload Balanced Workload Write-Dominant Workload

15 15 15

10 10 10 = 32 K 5 5 5 System Throughput 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60

15 15 15

10 10 10 = 192 K 5 5 5 System Throughput 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60 Optimistic Fraser-SL No Hot Spot Unrolled TTAS-GME Queue-GME

Figure 7.2: Results on Intel Xeon System for uniform distribution and one million keys. Throughput is reported in operations per microsecond.

performance differential of 52%/48%/15% when considering TTAS-SL-32 versus TTAS-

SL-192. However, unlike in the unrolled linked lists, we observed very similar results between

TTAS-SL and Queue-SL.

When we consider how TTAS-SL-32 compares against Unrolled-32, we notice sub-

stantial improvements over the majority of the experimental range. Specifically, we see

34%/25%/17% improvements at 32 threads and 21%/20%/44% improvements at 64 threads.

The TTAS-based GME now performs very favorably against the other lock-free skiplists, out-

performing NoHotSpot-SL by as much as 58% (read-dominant, 32 threads), Fraser-SL

97 Table 7.1: Summary of results on Intel Xeon System with 1 million keys and uniform distribution.

Read-Dominant Balanced Write-Dominant Algorithm 32 Threads 64 Threads 32 Threads 64 Threads 32 Threads 64 Threads Optimistic-SL 7.87 ±0.20 8.69 ±0.25 2.10 ±0.23 2.19 ±0.27 1.37 ±0.19 1.34 ±0.25 NoHotSpot-SL 8.34 ±0.36 15.06 ±0.53 7.13 ±0.31 12.39 ±0.34 6.50 ±0.36 11.79 ±0.52 Fraser-SL 11.23 ±0.43 16.28 ±0.17 8.98 ±0.21 12.57 ±0.14 7.72 ±0.16 10.74 ±0.13 Unrolled-192 5.51 ±0.18 9.88 ±0.39 5.53 ±0.23 8.20 ±0.30 5.23 ±0.18 5.92 ±0.39 TTAS-SL-192 8.14 ±0.18 11.92 ±0.35 6.84 ±0.20 10.58 ±0.31 8.14 ±0.17 11.91 ±0.35 Queue-SL-192 8.12 ±0.31 12.21 ±0.37 6.90 ±0.18 10.78 ±0.27 8.13 ±0.30 12.22 ±0.37 Unrolled-32 9.72 ±0.58 15.37 ±0.69 8.56 ±0.68 13.32 ±0.74 7.44 ±0.33 9.54 ±0.29 TTAS-SL-32 13.00 ±0.82 18.55 ±1.27 10.69 ±0.47 15.93 ±1.08 8.69 ±0.82 13.75 ±0.94 Queue-SL-32 11.94 ±0.81 19.10 ±0.75 10.63 ±0.98 15.98 ±0.94 8.48 ±1.01 14.12 ±0.84 by as much as 28% (write-dominant, 64 threads), and Optimistic-SL by as much as 926% (write-dominant, 64 threads). Table 7.2: Summary of results on Intel Xeon System with 1 million keys and Zipfian distribution.

Read-Dominant Balanced Write-Dominant Algorithm 32 Threads 64 Threads 32 Threads 64 Threads 32 Threads 64 Threads Optimistic-SL 5.25 ±0.08 5.94 ±0.10 1.11 ±0.17 1.05 ±0.13 0.61 ±0.17 0.57 ±0.16 NoHotSpot-SL 6.14 ±0.16 10.66 ±0.24 6.35 ±0.24 10.14 ±0.24 5.49 ±0.28 9.31 ±0.31 Fraser-SL 8.55 ±0.33 12.40 ±0.14 7.53 ±0.18 10.54 ±0.12 6.76 ±0.33 9.39 ±0.14 Unrolled-192 5.39 ±0.23 10.24 ±0.46 5.43 ±0.20 8.36 ±0.36 4.84 ±0.10 5.60 ±0.05 TTAS-SL-192 7.98 ±0.33 12.78 ±0.46 6.73 ±0.23 11.32 ±0.32 5.45 ±0.17 10.06 ±0.25 Queue-SL-192 8.27 ±1.45 12.59 ±0.54 6.48 ±0.42 11.04 ±0.44 5.34 ±0.21 9.52 ±0.32 Unrolled-32 7.02 ±0.44 10.52 ±0.33 6.20 ±0.25 9.38 ±0.29 5.35 ±0.18 7.02 ±0.08 TTAS-SL-32 8.55 ±0.31 14.08 ±0.30 7.20 ±0.33 11.69 ±0.25 5.48 ±0.21 9.87 ±0.47 Queue-SL-32 8.55 ±0.31 13.77 ±0.45 6.95 ±0.27 11.70 ±0.32 5.56 ±0.25 9.79 ±0.29

We next summarize the results for the Zipfian distribution in Table 7.2. In these scenarios TTAS-SL-32 provides the best performance of our GME algorithms, outperforming TTAS- SL-192 by 10% and 3% in read-dominant and balanced workloads, respectively. However in the write-dominant workloads the larger nodes performed slightly better than the smaller nodes (by 2%). When we compared against Unrolled-32, the TTAS-based GME algo- rithm provided 34%/25%/41% better performance against the three synthetic workloads. Furthermore, we also observed that the throughput difference between the TTAS-based and queue-based skiplists differed very little (less then 10%).

98 Read-Dominant Workload Balanced Workload Read-Dominant Workload

15 15 15

10 10 10 = 32 K 5 5 5 System Throughput 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60

15 15 15

10 10 10 = 192

K 5 5 5 System Throughput 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60 Optimistic Fraser-SL No Hot Spot Unrolled TTAS-GME Queue-GME

Figure 7.3: Results on Intel Xeon System for Zipfian distribution and one million keys. Throughput is reported in completed operations per microsecond

We also observed that TTAS-SL-32 performed admirably against the other skiplist implementations, outperforming each of them in almost every instance. It outperformed

Optimistic-SL by as much as 1, 631% in the write-dominant workload for 64 threads. The peak gain (40%) over NoHotSpot-SL occurred at 32 threads under read-dominant work- loads, and the gain over Fraser-SL (13%) occurred at 64 threads in the balanced workload.

When compared to the Unrolled-32 algorithm, we observe performance improvements of

24%/20%/41% in the read-dominant/balanced/write-dominant workloads.

99 Read-Dominant Workload Balanced Workload Write-Dominant Workload

10 10 10 = 32

K 5 5 5 System Throughput 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60

10 10 10 = 192

K 5 5 5 System Throughput 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60 Optimistic Fraser-SL No Hot Spot Unrolled TTAS-GME Queue-GME

Figure 7.4: Results on Intel Xeon System for 10 million keys and uniform distribution. Throughput is reported in operations per microsecond.

Ten Million Keys

Next we consider on to the experiments with a ten million key space; we summarize the

uniform distribution results in Table 7.3. In this set of experiments, we see very marginal

differences when considering node size (at most 6% in the read-dominant workload at 64

threads). Additionally, we see very little difference between the TTAS-based and queue-

based GME algorithms (less than 4% at almost every data point). When compared against

the original Unrolled-32, the GME-enabled version provides substantial throughput im- provements across the experimental range (up to 41% write-dominant at 64 threads).

100 Table 7.3: Summary of results on Intel Xeon System with ten million keys and uniform distribution.

Read-Dominant Balanced Write-Dominant Algorithm 32 Threads 64 Threads 32 Threads 64 Threads 32 Threads 64 Threads Optimistic-SL 3.12 ±0.09 3.12 ±0.09 1.11 ±0.13 1.06 ±0.13 0.61 ±0.05 0.57 ±0.02 Fraser-SL 5.19 ±0.20 9.50 ±0.28 5.11 ±0.18 9.07 ±0.25 5.10 ±0.05 8.90 ±0.07 NoHotSpot-SL 6.41 ±0.21 10.65 ±0.34 6.35 ±0.13 10.14 ±0.33 6.03 ±0.15 10.03 ±0.18 Unrolled-192 5.88 ±0.25 10.71 ±0.38 6.28 ±0.20 9.44 ±0.23 5.96 ±0.09 6.81 ±0.06 TTAS-SL-192 8.25 ±0.28 12.51 ±0.35 7.97 ±0.23 12.17 ±0.37 6.59 ±0.16 11.48 ±0.34 Queue-SL-192 8.02 ±0.30 12.28 ±0.32 7.56 ±0.27 12.02 ±0.14 6.45 ±0.27 10.96 ±0.36 Unrolled-32 5.86 ±0.25 10.71 ±0.31 6.41 ±0.24 10.57 ±0.34 6.21 ±0.24 8.13 ±0.36 TTAS-SL-32 8.97 ±0.34 13.32 ±0.31 8.31 ±0.24 12.66 ±0.10 6.34 ±0.24 11.47 ±0.24 Queue-SL-32 8.96 ±0.28 13.40 ±0.29 8.10 ±0.32 12.80 ±0.10 6.43 ±0.24 11.29 ±0.24

TTAS-SL-32 measures very favorably against the other skiplist algorithms. We observed up to 1912% improvement over Optimistic-SL (at 64 threads write-dominant), up to 78% improvement over Fraser-SL (at 32 threads, read-dominant), and 40% over NoHotSpot- SL (at 32 threads, read-dominant). Table 7.4: Summary of results for Intel Xeon System with ten million keys and Zipfian distribution.

Read-Dominant Balanced Write-Dominant Algorithm 32 Threads 64 Threads 32 Threads 64 Threads 32 Threads 64 Threads Optimistic-SL 2.52 ±0.09 2.55 ±0.05 0.89 ±0.04 0.89 ±0.04 0.39 ±0.03 0.37 ±0.02 Fraser-SL 4.44 ±0.13 7.32 ±0.07 4.26 ±0.11 6.87 ±0.08 4.06 ±0.04 6.28 ±0.07 NoHotSpot-SL 5.33 ±0.17 9.34 ±0.19 5.08 ±0.20 8.96 ±0.21 4.86 ±0.15 8.38 ±0.18 Unrolled-192 5.27 ±0.19 10.07 ±0.41 5.47 ±0.23 8.04 ±0.21 5.96 ±0.08 6.81 ±0.06 TTAS-SL-192 7.30 ±0.21 11.75 ±0.37 6.35 ±0.13 10.65 ±0.25 5.48 ±0.14 9.73 ±0.21 Queue-SL-192 6.96 ±0.29 11.44 ±0.35 6.22 ±0.17 10.57 ±0.14 5.51 ±0.11 9.47 ±0.16 Unrolled-32 5.46 ±0.17 10.02 ±0.30 5.32 ±0.10 8.64 ±0.21 6.21 ±0.05 8.13 ±0.06 TTAS-SL-32 7.65 ±0.25 12.98 ±0.31 6.54 ±0.17 11.10 ±0.19 5.47 ±0.22 9.67 ±0.26 Queue-SL-32 7.46 ±0.16 12.71 ±0.20 6.42 ±0.14 10.90 ±0.24 5.45 ±0.16 9.45 ±0.17

Now we consider the Zipfian distribution, as summarized in Table 7.4. As with our pre- vious experiments, the smaller nodes provide slightly higher throughput than the larger ones in the read-dominant and balanced workloads; however, the larger nodes provide marginally better performance in the write-dominant workloads. Furthermore, TTAS-SL-32 out- performs Unrolled-32 by 29%/28%/18% in the read-dominant/balanced/write-dominant workloads.

101 Read-Dominant Workload Balanced Workload Write-Dominant Workload

10 10 10 = 32

K 5 5 5 System Throughput 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60

10 10 10 = 192 5 5 5 K System Throughput 0 0 0 0 20 40 60 0 20 40 60 0 20 40 60 Optimistic Fraser-SL No Hot Spot Unrolled TTAS-GME Queue-GME

Figure 7.5: Results on Intel Xeon System for 10 million keys and Zipfian distribution. Throughput is reported in operations per microsecond.

In this set of workloads, the GME-enabled algorithms provide improved throughput over

other skiplists. TTAS-SL-32 provides up to 2, 614% improvement over Optimistic-SL

(write-dominant, 64 threads), up to 77% improvement over Fraser-SL (read-dominant, 64

threads), and up to 40% improvement over NoHotSpot-SL (32 threads, read-dominant

workload).

102 7.4.4 Intel Xeon Phi System

We repeated the same sets of experiments from Chapter 6 and included our two GME- enabled variants. The results for these experiments can be found in Figures 7.6–7.9. Since we have not yet explored the unrolled data structure on this system in detail (at least in relation to other skiplists), we shall go into somewhat more detail here.

One Million Keys

Read-Dominant Balanced Write-Dominant 20 20 20

10 10 10 System Throughput Uniform Distribution 0 0 0 0 100 200 0 100 200 0 100 200

20 20 20

10 10 10 Zipf Distribution System Throughput 0 0 0 0 100 200 0 100 200 0 100 200 Thread Count Thread Count Thread Count Optimistic Fraser-SL No Hot Spot Unrolled TTAS-GME Queue-GME

Figure 7.6: Skiplist performance on Intel Xeon Phi System with K = 32 and one million keys. Throughput is reported in operations per microsecond.

103 Read-Dominant Balanced Write-Dominant 20 20 20

10 10 10 System Throughput Uniform Distribution 0 0 0 0 100 200 0 100 200 0 100 200

20 20 20

10 10 10 Zipf Distribution System Throughput 0 0 0 0 100 200 0 100 200 0 100 200 Thread Count Thread Count Thread Count Optimistic Fraser-SL No Hot Spot Unrolled TTAS-GME Queue-GME

Figure 7.7: Skiplist performance on Intel Xeon Phi System with K = 192 and one million keys. Throughput is reported in operations per microsecond.

Uniform Distribution We summarize the results for the uniform distribution in Ta-

ble 7.5. On this system, we see very little difference between Unrolled-32 and Unrolled-

192 in terms of throughput. In general, Unrolled-32 outperforms all of the other data

structures in the read-dominant tests (at 272 threads, excepting the GME-enabled ver-

sions), exceeding Optimistic-SL by 1902%, surpassing NoHotSpot-SL by 10%, and out-

performing Fraser-SL by 28%. Unrolled-32 does not fare as well in the balanced or

write-dominant test cases however. NoHotSpot-SL exceeds its performance by 62% and

104 Table 7.5: Summary of results on Intel Xeon Phi System with 1 million keys and uniform distribution.

Read-Dominant Balanced Write-Dominant Algorithm 136 Threads 272 Threads 136 Threads 272 Threads 136 Threads 272 Threads Optimistic-SL 0.87 ±0.12 0.91 ±0.08 0.29 ±0.01 0.29 ±0.01 0.27 ±0.01 0.27 ±0.01 NoHotSpot-SL 11.74 ±0.06 16.60 ±0.24 11.94 ±0.02 16.48 ±0.16 11.94 ±0.05 16.37 ±0.23 Fraser-SL 10.93 ±0.19 14.27 ±0.58 10.67 ±0.07 12.81 ±0.18 9.95 ±0.06 11.84 ±0.35 Unrolled-192 13.93 ±0.18 18.49 ±0.39 8.38 ±0.23 9.21 ±0.07 3.70 ±0.08 3.56 ±0.05 TTAS-SL-192 14.68 ±0.24 19.84 ±0.35 14.43 ±0.24 17.85 ±0.21 13.02 ±0.15 16.54 ±0.14 Queue-SL-192 14.41 ±0.21 19.79 ±0.37 13.95 ±0.26 16.94 ±0.39 10.91 ±0.13 13.32 ±0.18 Unrolled-32 13.44 ±0.26 18.22 ±0.17 8.91 ±0.05 10.15 ±0.07 4.29 ±0.07 3.88 ±0.04 TTAS-SL-32 14.36 ±0.36 19.91 ±0.26 14.26 ±0.17 17.54 ±0.22 12.53 ±0.17 15.84 ±0.16 Queue-SL-32 14.16 ±0.41 19.85 ±0.54 13.48 ±0.32 17.00 ±0.23 10.37 ±0.16 12.83 ±0.12

321% in balanced and write-dominant workloads, respectively. Fraser-SL also outperforms Unrolled-32 by 26% and 211% in the same tests. On this manycore system, we do see substantial benefit from the GME algorithms, espe- cially as the write percentages increase. In general, we see that the saturation problem almost completely disappears with both GME-based algorithms; each instance appears to scale ex- tremely well up until 200 threads (and in many cases beyond). In each case, the smaller nodes appear to perform slightly better (although in the read-dominant workload the results are nearly identical). In most cases, we observe better performance from the TTAS-based skiplist instead of the queue-based skiplist. Specifically, we observed that TTAS-SL-32 outperformed Unrolled-32 by 9% in read-dominant, 72% in balanced, and 308% in write- dominant workloads (at 272 threads).

We further observe that TTAS-SL-32 outperforms all other skiplists under the majority of workloads. It outperforms Optimistic-SL by 2, 088%/5, 948%/5, 767% (at 272 threads in read-dominant/balanced/write-dominant workloads), Fraser-SL by 40%/37%/33%. We observe that NoHotSpot-SL did outperform us in the write-dominant workload by 3%, but TTAS-SL-32 did exceed its performance in by 20% in read-dominant and 6% in balanced workloads (at 272 threads). The performance differential was even higher at 136 threads: 22% for write-dominant and 19% for balanced.

105 Table 7.6: Summary of results on Intel Xeon Phi System with 1 million keys and Zipfian distribution.

Read-Dominant Balanced Write-Dominant Algorithm 136 Threads 272 Threads 136 Threads 272 Threads 136 Threads 272 Threads Optimistic-SL 0.87 ±0.07 0.87 ±0.07 0.25 ±0.01 0.23 ±0.01 0.19 ±0.01 0.19 ±0.03 NoHotSpot-SL 11.61 ±0.13 15.47 ±0.17 11.63 ±0.11 15.32 ±0.16 11.55 ±0.11 15.23 ±0.16 Fraser-SL 10.89 ±0.09 13.56 ±0.38 10.00 ±0.07 12.03 ±0.37 9.38 ±0.07 11.17 ±0.19 Unrolled-192 13.85 ±0.13 17.49 ±0.10 8.38 ±0.23 8.64 ±0.03 3.37 ±0.05 3.25 ±0.03 TTAS-SL-192 15.52 ±0.28 19.06 ±0.19 14.43 ±0.24 16.95 ±0.19 12.57 ±0.19 15.53 ±0.25 Queue-SL-192 15.49 ±0.17 19.09 ±0.17 13.95 ±0.26 16.30 ±0.13 10.46 ±0.08 12.82 ±0.14 Unrolled-32 13.58 ±0.07 17.22 ±0.18 8.91 ±0.05 9.73 ±0.32 3.83 ±0.05 3.53 ±0.03 TTAS-SL-32 15.35 ±0.25 19.00 ±0.19 14.26 ±0.17 16.95 ±0.27 12.11 ±0.07 15.19 ±0.23 Queue-SL-32 15.31 ±0.22 19.07 ±0.18 13.48 ±0.32 16.68 ±0.19 9.87 ±0.09 12.43 ±0.07

Zipfian Distribution: The results at one million keys for the zipfian distribution (see Table 7.6) show even more substantial results for our GME-enabled skiplists. As we dis- cussed in Chapter 6, our unrolled skiplist scales poorly in both the balanced and write-

dominant workloads. Again, Unrolled-32 generally outperformed Unrolled-32, except in the read-dominant workload (where the larger nodes provided marginally better perfor-

mance). In the read-dominant workload, Unrolled-32 outperformed Optimistic-SL by 1879%, exceeded NoHotSpot-SL by 11%, and surpassed Fraser-SL by 27%. In the bal- anced and write-dominant workloads, Unrolled-32 outpforms Optimistic-SL. However, the lock-free skiplists do provide considerably higher throughput in these workloads. Adding GME to our skiplist provides substantial improvements as well, and as before,

we generally get the best performance from the TTAS-SL-32 instance. This instance im- proves performance over Unrolled-32 by 10% in read-dominant, 74% in balanced, and 330% in write-dominant transactions. Furthermore, the GME-enabled algorithms now scale extremely well up to and beyond 200 threads.

In the read-dominant workloads (at 272 threads), TTAS-SL-32 outperforms Optimistic- SL by 2, 084%, Fraser-SL by 40%, and NoHotSpot-SL by 23%. When evaluating the balanced workload, TTAS-SL beats Optimistic-SL by 7, 270%, Fraser-SL by 41%, and NoHotSpot-SL by 11%. Finally, in the write-dominant workload, TTAS-SL-32 outper-

106 forms Optimistic-SL by 7, 895%, and Fraser-SL by 36%. Ultimately, TTAS-SL-32,

NoHotSpot-SL, and TTAS-SL-192 all provided write-dominant throughput within each others’ margins of error.

Ten Million Keys

As with our experiments on Intel Xeon System, we also ran our experiments with a keyspace

of ten million keys. We graph our results in Figures 7.8– 7.9.

Read-Dominant Balanced Write-Dominant 20 20 20

10 10 10 System Throughput Uniform Distribution 0 0 0 0 100 200 0 100 200 0 100 200

20 20 20

10 10 10 Zipf Distribution System Throughput 0 0 0 0 100 200 0 100 200 0 100 200 Thread Count Thread Count Thread Count Optimistic Fraser-SL No Hot Spot Unrolled TTAS-GME Queue-GME

Figure 7.8: Skiplist performance on Intel Xeon Phi System with K = 32 and ten million keys. Throughput is reported in operations per microsecond.

107 Read-Dominant Balanced Write-Dominant 20 20 20

10 10 10 System Throughput Uniform Distribution 0 0 0 0 100 200 0 100 200 0 100 200

20 20 20

10 10 10 Zipf Distribution System Throughput 0 0 0 0 100 200 0 100 200 0 100 200 Thread Count Thread Count Thread Count Optimistic Fraser-SL No Hot Spot Unrolled TTAS-GME Queue-GME

Figure 7.9: Skiplist performance on Intel Xeon Phi System with K = 192 and ten million keys. Throughput is reported in operations per microsecond.

Uniform Distribution: After summarizing our results in Table 7.7, we can make the following observations. We first observe that in across the board testing, Unrolled-32

provides slightly better performance than Unrolled-192 (by up to 10% in write-dominant

workloads at 272 threads). Unrolled-32 does provide reasonably good performance in

the read-dominant workload, outperforming Optimistic-SL by 1, 484%. However, the other lock-free skiplists do provide better slightly better read-dominant throughput (1% for Fraser-SL and 12% for NoHotSpot-SL). As we move to the balanced and write-

108 dominant workloads, Unrolled-32 does demonstrate the effects of saturation, peaking performance at roughly 180 and 50 threads, respectively. As in our previous experiments, introducing GME does improve performance and al- leviate the effects of saturation. In the read-dominant workload, we observed the high- est throughput from Queue-SL-32, which outperformed Unrolled-32 by 25% (at 272 threads). However, in the balanced and write-dominant workloads, TTAS-SL-192 pro- vided the best performance, outperforming Unrolled-32 by 91% in balanced and 365% in write-dominant workloads.

When compared to the other data structures, Queue-SL-32 provides the highest through- put in the read-dominant case, outperforming Optimistic-SL by 1891%, Fraser-SL by 24%, and NoHotSpot-SL by 12% (at 272 threads). In the balanced workload, TTAS-SL- 192 provides the highest throughput, outperforming Optimistic-SL by 6, 396%, Fraser- SL by 29%, and NoHotSpot-SL by 7%. Finally, in the write-dominant workload, TTAS- SL-192 also outperformed all others, beating Optimistic-SL by 8, 111%, Fraser-SL by 34%, and NoHotSpot-SL by 3%.

Table 7.7: Summary of results on Intel Xeon Phi System with 10 million keys and uniform distribution.

Read-Dominant Balanced Write-Dominant Algorithm 136 Threads 272 Threads 136 Threads 272 Threads 136 Threads 272 Threads Optimistic-SL 0.87 ±0.14 0.95 ±0.10 0.22 ±0.02 0.25 ±0.01 0.18 ±0.01 0.18 ±0.01 NoHotSpot-SL 12.38 ±0.22 16.92 ±0.18 10.98 ±0.16 15.12 ±0.09 10.43 ±0.13 14.34 ±0.13 Fraser-SL 10.93 ±0.19 15.31 ±0.14 9.77 ±0.08 12.55 ±0.38 8.71 ±0.13 11.01 ±0.44 Unrolled-192 11.57 ±0.09 14.83 ±0.13 6.60 ±0.02 7.63 ±0.07 3.00 ±0.07 2.89 ±0.03 TTAS-SL-192 14.76 ±0.31 18.72 ±0.08 13.22 ±0.19 16.24 ±0.18 11.38 ±0.11 14.78 ±0.14 Queue-SL-192 14.69 ±0.42 18.56 ±0.15 12.60 ±0.12 15.71 ±0.08 9.46 ±0.07 12.07 ±0.09 Unrolled-32 10.95 ±0.28 15.05 ±0.51 6.87 ±0.13 8.50 ±0.13 3.39 ±0.08 3.18 ±0.04 TTAS-SL-32 14.20 ±0.42 18.17 ±0.48 12.34 ±0.22 15.75 ±0.19 10.24 ±0.16 13.45 ±0.38 Queue-SL-32 14.37 ±0.44 18.92 ±0.26 12.25 ±0.38 15.71 ±0.40 8.44 ±0.14 11.42 ±0.48

Zipfian Distribution: After summarizing our results in Table 7.8, we notice the follow-

ing. First, in the (non-GME) Unrolled-192, the larger nodes provided marginally bet-

109 ter read performance, while the smaller nodes provided slightly better write performance (note that the performance difference is less than 3% in every case. In this skewed distri- bution, Unrolled-192 outperforms Optimistic-SL by 1, 600%, and Fraser-SL by 6%. NoHotSpot-SL does provide 7% better throughput in reads. As we move to the balanced and read-dominant workloads, we notice similar saturation effects to we saw in our uniform distributions. Again, introducing GME to our unrolled skiplist drastically reduces the effects of satura- tion. In this set of experiments, we observed optimal performances with Queue-SL-32 in read-dominant workloads, and TTAS-SL-192 in balanced and write-dominant workloads. In our read-dominant workload, Queue-SL-32 outperformed Optimistic-SL by 2, 020%, Fraser-SL by 33%, and NoHotSpot-SL by 16%. As we move to the balanced and write- dominant threads, TTAS-SL-192 outperforms Optimistic-SL by 7, 086% in balanced and 7, 116% in write-dominant; Fraser-SL by 37% in balanced and 44% in write-dominant; and NoHotSpot-SL by 7% in balanced and 5% in write-dominant workloads. Table 7.8: Summary of results on Intel Xeon Phi System with 10 million keys and Zipfian distribution.

Read-Dominant Balanced Write-Dominant Algorithm 136 Threads 272 Threads 136 Threads 272 Threads 136 Threads 272 Threads Optimistic-SL 0.79 ±0.07 0.82 ±0.08 0.21 ±0.01 0.21 ±0.01 0.18 ±0.01 0.19 ±0.01 NoHotSpot-SL 11.18 ±0.25 14.95 ±0.19 10.62 ±0.17 14.11 ±0.10 9.91 ±0.16 13.03 ±0.13 Fraser-SL 10.48 ±0.11 13.09 ±0.23 9.19 ±0.07 11.02 ±0.38 8.05 ±0.05 9.53 ±0.31 Unrolled-192 10.56 ±0.07 13.94 ±0.17 5.95 ±0.04 7.11 ±0.10 2.83 ±0.05 2.75 ±0.01 TTAS-SL-192 14.11 ±0.07 17.24 ±0.35 12.41 ±0.09 15.09 ±0.17 10.66 ±0.10 13.71 ±0.13 Queue-SL-192 13.96 ±0.11 17.24 ±0.26 11.88 ±0.08 14.75 ±0.37 8.90 ±0.06 11.52 ±0.07 Unrolled-32 9.85 ±0.07 13.77 ±0.13 6.06 ±0.07 7.11 ±0.09 3.17 ±0.02 2.99 ±0.03 TTAS-SL-32 13.44 ±0.12 16.93 ±0.22 11.68 ±0.10 14.52 ±0.19 9.24 ±0.42 12.63 ±0.28 Queue-SL-32 13.65 ±0.21 17.39 ±0.11 11.27 ±0.11 14.44 ±0.19 7.72 ±0.19 10.50 ±0.32

7.5 An In-Depth Evaluation of the GME-enabled Skiplist

In order to better understand the explain characteristics of TTAS-SL and Queue-SL, we enabled tracing of method calls and measured detailed performance characteristics in several

110 representative executions. We began our detailed analysis by inserting timing code to mea-

sure the average duration of lookup, remove, and insert operations for each implementation

(in nanoseconds); we summarize these results for the uniform distribution on the Intel Xeon

System in Tables 7.9 and 7.10.

Table 7.9: Operation timings for uniform distribution on Intel Xeon System with 1 million keys

Read-Dominant Balanced Write-Dominant Algorithm 32 Threads 64 Threads 32 Threads 64 Threads 32 Threads 64 Threads Lookup 2810 1359 3006 1491 Fraser-SL Insert 3239 1883 3355 2172 3454 2399 Remove 50002 26807 12619 7280 6745 4374 Lookup 3156 1447 3038 1577 NoHotSpot-SL Insert 3322 1683 3121 1690 3065 1754 Remove 55334 27755 12324 6851 6174 3592 Lookup 3003 1754 3017 2166 Unrolled-192 Insert 4536 2933 4655 5976 4018 6931 Remove 4225 2819 4413 5663 3966 6413 Lookup 3750 2375 2875 2115 TTAS-SL-192 Insert 3925 2455 3211 2675 3875 3752 Remove 3825 2401 3105 2511 3711 3496 Lookup 3771 2388 2999 2175 Queue-SL-192 Insert 3995 2491 3213 2553 3799 3704 Remove 3861 2399 3111 2471 3743 3518 Lookup 3514 1775 3271 1825 Unrolled-32 Insert 3733 2183 3877 2336 4655 6531 Remove 3644 1941 3644 2696 4183 6231 Lookup 2946 1673 2926 1803 TTAS-SL-32 Insert 3113 1711 3155 1819 4013 2833 Remove 3055 1665 3072 1794 3977 2761 Lookup 2977 1673 3077 1810 Queue-SL-32 Insert 3111 1785 3310 1850 3987 2896 Remove 3075 1715 3199 1773 3855 2783

As we examine the results of our executions, one detail stands out when comparing our

algorithms to the other algorithms. Specifically, the remove calls in the other implementa-

tions takes (in some cases) an entire order of magnitude more time to complete than lookup

or insert calls. In our algorithms, remove calls run slightly faster than insert calls. When one compares an insert call against a remove call in our algorithm, the reasoning is clear.

In order to insert an item, a thread must first ascertain that the item is not already present

111 Table 7.10: Operation timings for uniform distribution on Intel Xeon System with 10 million keys

Read-Dominant Balanced Write-Dominant Algorithm 32 Threads 64 Threads 32 Threads 64 Threads 32 Threads 64 Threads Lookup 3073 1811 3185 1888 Fraser-SL Insert 3425 2366 3555 2484 3676 2910 Remove 54384 34394 13326 8696 7141 5254 Lookup 3564 2406 3462 2419 NoHotSpot-SL Insert 3730 2738 3772 2613 3595 2799 Remove 62228 43955 14690 10177 7160 5494 Lookup 3056 2110 3170 1889 Unrolled-192 Insert 3922 3754 4734 6476 4574 7582 Remove 3796 3533 4576 6253 4445 7385 Lookup 3215 1773 3119 1952 TTAS-SL-192 Insert 3501 1890 3521 2258 1483 4088 Remove 3423 1824 3402 2067 1353 3943 Lookup 3207 1825 2266 1843 Queue-SL-192 Insert 3271 2419 2859 4345 2960 3900 Remove 3169 1964 2572 3860 2404 3499 Lookup 3598 1997 3550 2164 Unrolled-32 Insert 4091 3232 4218 4655 4239 6223 Remove 4903 2815 4129 4482 4192 6117 Lookup 3281 1877 2364 1862 TTAS-SL-32 Insert 3294 2052 2572 3357 2739 3180 Remove 3500 2118 2459 2122 2648 3064 Lookup 3039 1835 2697 1933 Queue-SL-32 Insert 3138 2522 2774 4489 2625 4199 Remove 3099 1997 2777 3999 2473 3713 before performing the actual insertion. However, to remove an item, the thread must only seek the first (and only) instance of that item and replace it with a sentinel.

We drilled down into our performance somewhat further2, and we determined that at 32 keys per node and a keyspace of one million, we required only an average of 1.001 locks for a insert operation and 1.02 locks for a remove operation. At 192 keys per node and a keyspace of one million, those numbers dropped to 1.0002 locks per insert and 1.0000001 locks per remove. When we increase the keyspace to ten million, we require 1.008(1.0007) locks per insert(remove) with 32 keys per node, and 1.0004(1.000001) locks with 192 keys per node.

This confirms our claim that the vast majority of our operations can complete by locking a single node, inclusive of any restarts.

2These results are specific to TTAS-SL, but the other implementations only differ slightly.

112 7.6 Analysis and Conclusions

As we have explored in this chapter, we can utilize group mutual exclusion to create a new type of lock-based data structure. In order to showcase this technique, we derived two new data structures based upon our prior work of unrolled linked lists and unrolled skiplists. These two GME-enabled data structures resist the deleterious effects of saturation which impacts many lock-based data structures. By adjusting the unrolling factor and the underlying GME algorithm, we are able to obtain performance which meets or exceeds other known skiplist algorithms. As we have seen, we cannot provide a “perfect” set of parameters that provide optimal performance in all situations, However, we have also demonstrated that our data structures provide excellent performance for a wide variety of settings (including unrolling factor and choice of GME algorithm) on multiple workloads and keyspace distributions. In the final chapter, we shall summarize our findings and suggest several avenues for further research.

113 CHAPTER 8

CONCLUSION

In this work we have explored the topic of saturation in lock-based concurrent data struc- tures. To that end, we first demonstrated how list-based concurrent data structures could be unrolled, which increases the storage density of the data structure and can substantially improve their throughput. We then unrolled two data structures, namely the linked list and the skiplist, and we explored in some detail how these implementations compared against other similar concurrent data structures. In each case, our data structure substantially out- performed all other lock-based data structures. However, we did see room for improvement in write performance, especially when comparing against lock-free data structures.

We then explored the behavior of our data structures on systems with very high core counts, in our case a Xeon Phi system. We determined that in this manycore environment, data structures can saturate at high thread counts, especially in write-intensive workloads.

We illustrated this behavior on both our unrolled linked list and unrolled skiplist.

Next, we demonstrated how we could replace our exclusive locks with group mutual exclusion objects and introduce intra-node concurrency to these data structures. We explored the feasibility of several GME algorithms and settled on a TTAS-based algorithm and a queue-based algorithm based upon Keane and Moir’s work. We discussed how we could allow concurrent reads or writes to safely execute within a node, and we demonstrated the value of our research in comparison testing our algorithms against other existing skiplists.

We envision two major avenues for future research in this area. The first avenue is the pursuit of new scalable GME algorithms. Many GME algorithms have been proposed, but most either lack fairness guarantees, require the use of a “gatekeeper” lock, or possess infeasible per-instance space requirements. We believe that such an algorithm is possible, but we have not as yet successfully developed one. Our final avenue is to introduce this

114 technique to other hierarchical data structures. We believe that many such candidates exist, such as k-ary search trees, B-trees, and certain types of hash tables.

115 REFERENCES

[1] Lonestar 5 User Guide. https://portal.tacc.utexas.edu/user-guides/lonestar5. Accessed 2017-01-27.

[2] (2017a). Intel xeon phi x200 product family. https://ark.intel.com/products/ series/92650/Intel-Xeon-Phi-x200-Product-Family. Accessed 2017-07-11.

[3] (2017b). Intel xeon scalable processors. https://ark.intel.com/products/series/ 125191/Intel-Xeon-Scalable-Processors. Accessed 2017-08-18.

[4] Adamic, L. A. and B. A. Huberman (2002). Zipf’s Law and the Internet. Glottomet- rics 3 (1), 143–150.

[5] Afek, Y., H. Attiya, D. Dolev, E. G. M. Merritt, and N. Shavit (1993, September). Atomic snapshots of shared memory. J. ACM 40 (4), 873–890.

[6] Afek, Y., H. Kaplan, B. Korenfeld, A. Morrison, and R. E. Tarjan (2012). CBTree: A Practical Concurrent Self-Adjusting Search . In Proceedings of the Symposium on Distributed Computing (DISC), pp. 1–15.

[7] Aho, A. V., R. Sethi, M. S. Lam, and J. S. Ullman (2006). Compilers: Principles, Techniques, and Tools (2nd ed.). Addison-Wesley.

[8] Alistarh, D., W. M. Leiserson, A. Matveev, and N. Shavit (2015, June). ThreadScan: Automatic and Scalable Memory Reclamation. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 123–132.

[9] Anderson, J. H. (1993). Composite registers. In Distributed Computing (DC), pp. 15–30.

[10] Anderson, J. H., Y.-J. Kim, and T. Herman (2003). Shared-memory mutual exclusion: Major research trends since 1986. Distributed Computing (DC) 16, 75–110.

[11] Avni, H., N. Shavit, and A. Suissa (2013). Leaplist: Lessons learned in designing tm- supported range queries. In Proceedings of the 2013 ACM Symposium on Principles of Distributed Computing (PODC), pp. 299–308.

[12] Bayer, R. and M. Schkolnick (1977). Concurrency of Operations on B-Trees. Acta Informatica 9, 1–21.

[13] Bender, M. A., J. T. Fineman, S. Gilbert, and B. C. Kuszmaul (2005, July). Concurrent Cache-Oblivious B-Trees. In Proceedings of the 17th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 228–237.

116 [14] Bhatt, V. and C. Huang (2010). Grop mutual exclsion on o(log n) rmr. In Proceedings of the 29th ACM Symposium on Principles of Distributed Computing (PODC), pp. 45–54.

[15] Braginsky, A. and E. Petrank (2011). Locality-Conscious Lock-Free Linked Lists. In Proceedings of the 12th International Conference on Distributed Computing and Net- working (ICDCN), pp. 107–118.

[16] Braginsky, A. and E. Petrank (2012). A Lock-Free B+tree. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 58–67.

[17] Breslau, L., P. Cao, L. Fan, G. Phillips, and S. Shenker (1999, March). Web Caching and Zipf-like Distributions: Evidence and Implications. In Proceedings of the 18th IEEE Conference on Computer Communications (INFOCOM), pp. 126–134.

[18] Bronson, N. G., J. Casper, H. Chafi, and K. Olukotun (2010, January). A Practical Con- current . In Proceedings of the 15th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 257–268.

[19] Brown, T., F. Ellen, and E. Ruppert (2013). Pragmatic Primitives for Non-blocking Data Structures. In Proceedings of the 32nd ACM Symposium on Principles of Dis- tributed Computing (PODC), pp. 207–221.

[20] Brown, T. A. (2015, July). Reclaiming Memory for Lock-Free Data Structures: There Has to Be a Better Way. In Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing (PODC), pp. 261–270.

[21] Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein (2009). Introduction to Algorithms (3rd ed.). The MIT Press.

[22] Crain, T., V. Gramoli, and M. Raynal (2013a). A Contention-Friendly Binary Search Tree. In Proceedings of the European Conference on Parallel and Distributed Computing (Euro-Par), Aachen, Germany, pp. 229–240.

[23] Crain, T., V. Gramoli, and M. Raynal (2013b). No hot spot non-blocking skip list. In Proceedings of the 33rd IEEE International Conference on Distributed Computing Systems (ICDCS), pp. 196–205.

[24] Davis, T. and R. Guerraoui (2016, July). Concurrent Search Data Structures Can Be Blocking and Practically Wait-Free. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 337–348.

[25] Dawes, B. and D. Abrahams. Boost C++ Libraries.

[26] Demaine, E. (2002, June). Cache-Oblivious Algorithms and Data Structures. Lecture Notes from the EEF Summer School on Massive Data Sets.

117 [27] Denning, P. J. (2005, July). The locality principle. Communications of the ACM (CACM) 48 (7), 19–24.

[28] Dijkstra, E. W. (1971, October). Hierarchical Ordering of Sequential Processes. Acta Informatica 1 (2), 115–138.

[29] Ellen, F., P. Fataourou, E. Ruppert, and F. van Breugel (2010, July). Non-Blocking Binary Search Trees. In Proceedings of the 29th ACM Symposium on Principles of Distributed Computing (PODC), pp. 131–140.

[30] Ellis, C. (1987). Concurrency in Linear Hashing. ACM Transactions on Database Systems (TODS) 12 (2), 195–217.

[31] Evans, J. (2011). Scalable Memory Allocation using jemal- loc. https://www.facebook.com/notes/facebook-engineering/ scalable-memory-allocation-using-jemalloc/480222803919/.

[32] Faloutsos, C. and H. V. Jagadish (1992, August). On B-tree Indices for Skewed Distri- butions. In Proceedings of the 18th International Conference on Very Large Data Bases (VLDB), pp. 364–373.

[33] Fomitchev, M. and E. Ruppert (2004, July). Lock-Free Linked Lists and Skiplists. In Proceedings of the 23rd ACM Symposium on Principles of Distributed Computing (PODC), pp. 50–59.

[34] Frank, D. J., R. H. Dennard, E. Nowak, P. M. Solomon, Y. Taur, and H.-S. P. Wong (2001, March). Device scaling limits of si mosfets and their application dependencies. Proceedings of the IEEE 89 (3), 259–288.

[35] Fraser, K. (2004, February). Practical Lock-Freedom. PhD dissertation, University of Cambridge.

[36] Fraser, K. and T. L. Harris (2007, May). Concurrent Programming Without Locks. ACM Transactions on Computer Systems 25 (2).

[37] Galvin, P., G. Gagne, and A. Silberschatz (2013). Operating System Concepts (9th ed.). John Wiley and Sons, Incorporated.

[38] Gramoli, V. (2015, February). More than you ever wanted to know about synchro- nization: synchrobench, measuring the impact of the synchronization on concurrent algorithms. In Proceedings of the 20th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 1–10.

[39] Hadzilacos, V. (2001, August). A Note on Group Mutual Exclusion. In Proceedings of the 20th ACM Symposium on Principles of Distributed Computing (PODC).

118 [40] Hadzilacos, V. and R. Danek (2004). Local-Spin Group Mutual Exclusion Algorithms. In Proceedings of the 18th Symposium on Distributed Computing (DISC), pp. 71–85. Springer-Verlag.

[41] Harris, T. (2001). A Pragmatic Implementation of Non-blocking Linked-lists. Dis- tributed Computing (DC), 300–314.

[42] Harris, T. L., K. Fraser, and I. A. Pratt (2002). A practical multi-word compare- and-swap operation. In Proceedings of the 16th Symposium on Distributed Computing (DISC), pp. 265–279.

[43] He, Y., K. Gopalakrishnan, and E. Gafni (2016). Group mutual exclusion in linear time and space. In Proceedings of the 17th International Conference on Distributed Computing and Networking (ICDCN), pp. 22:1–22:10.

[44] Heller, S., M. Herlihy, V. Luchangco, M. Moir, W. N. S. III, and N. Shavit (2005). A Lazy Concurrent List-Based Set Algorithm. In Proceedings of the 9th International Conference on Principles of Distributed Systems (OPODIS), Pisa, Italy, pp. 3–16.

[45] Hennessy, J. L. and D. A. Patterson (2011). Computer Architecture: A Quantitative Approach (5th ed.). Morgan Kaufmann.

[46] Herlihy, M. (1991, January). Wait-Free Synchronization. ACM Transactions on Pro- gramming Languages and Systems (TOPLAS) 13 (1), 124–149.

[47] Herlihy, M., Y. Lev, V. Luchangco, and N. Shavit (2007, June). A Simple Optimistic Skiplist Algorithm. In Proceedings of the 14th International Colloquium on Structural Information and Communication Complexity (SIROCCO), Castiglioncello, Italy, pp. 124–138.

[48] Herlihy, M., V. Luchangco, and M. Moir (2003). Obstruction-Free Synchronization: Double-Ended Queues as an Example. In Proceedings of the 23rd IEEE International Conference on Distributed Computing Systems (ICDCS), pp. 522–529.

[49] Herlihy, M. and J. Moss (1993). Transactional Memory: Architectural Support for Lock- Free Data Structures. In Proceedings of the 20th International Symposium on Computer Architecture (ISCA), pp. 289–300.

[50] Herlihy, M. and N. Shavit (2012). The Art of Multiprocessor Programming, Revised Reprint. Morgan Kaufmann.

[51] Herlihy, M., N. Shavit, and M. Tzafrir (2007). Concurrent Cuckoo Hashing. Technical report, Brown University, Providence, Rhode Island, USA.

119 [52] Herlihy, M. and J. M. Wing (1990, July). Linearizability: A Correctness Condition for Concurrent Objects. ACM Transactions on Programming Languages and Systems (TOPLAS) 12 (3), 463–492.

[53] Howley, S. V. and J. Jones (2012, June). A Non-Blocking Internal Binary Search Tree. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 161–171.

[54] Hsu, M. and W. P. Yang (1986). Concurrent Operations in Extendible Hashing. In Proceedings of the International Conference on Very Large Data Bases (VLDB), San Francisco, California, USA, pp. 241–247.

[55] ISO/IEC (2011). International Standard ISO/IEC 14882:2011(E) Programming Lan- guage C++. https://isocpp.org/std/the-standard.

[56] Israeli, A. and L. Rappoport (1994). Disjoint-access-parallel implementations of strong shared memory primitives. In Proceedings of the 13th Symposium on Distributed Com- puting (DISC), pp. 151–160.

[57] Jayanti, P. (2002). F-arrays: Implementation and applications. In Proceedings of the 21st ACM Symposium on Principles of Distributed Computing (PODC), pp. 270–279. ACM.

[58] Jayanti, P., S. Petrovic, and K. Tan (2003). Fair Group Mutual Exclusion. In Proceedings of the 22nd ACM Symposium on Principles of Distributed Computing (PODC), New York, NY, USA, pp. 275–284. ACM.

[59] Joung, Y.-J. (2000). Asynchronous Group Mutual Exclusion. Distributed Computing (DC) 13 (4), 189–206.

[60] Keane, P. and M. Moir (1999). A Simple Local-Spin Group Mutual Exclusion Algorithm. In ACM Symposium on Principles of Distributed Computing (PODC), pp. 23–32.

[61] Kim, J. H., H. Cameron, and P. Graham (2006). Lock-Free Red-Black Trees Using CAS. Concurrency and Computation: Practice and Experience, 1–40.

[62] Kumar, V. (1990). Concurrent Operations on Extendible Hashing and its Performance. Communications of the ACM (CACM) 33 (6), 681–694.

[63] Kung, H. T. and P. L. Lehman (1980, September). Concurrent Manipulation of Binary Search Trees. ACM Transactions on Database Systems 5 (3), 354–382.

[64] Lotan, I. and N. Shavit (2000, April). Skiplist-Based Concurrent Priority Queues. In Proceedings of the14thInternational Parallel and Distributed Processing Symposium (IPDPS), pp. 263–268.

120 [65] Martin, M., M. Hill, and D. Wood (2003). Token coherence: Decoupling performance and correctness. In Proceedings of the 30th International Symposium on Computer Architecture (ISCA), pp. 182–193.

[66] Mellor-Crummey, J. M. and M. L. Scott (1991, February). Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS) 9 (1), 21–65.

[67] Michael, M. M. (2002). High Performance Dynamic Lock-Free Hash Tables and List- based Sets. In Proceedings of the 14th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 73–82.

[68] Michael, M. M. (2004). Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects. IEEE Transactions on Parallel and Distributed Systems (TPDS) 15 (6), 491– 504.

[69] Michael, M. M. and M. L. Scott (1996). Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. In Proceedings of the 15th ACM Symposium on Principles of Distributed Computing (PODC), pp. 267–275.

[70] Moir, M. (1997). Practical implementations of non-blocking synchronization primitives. In Proceedings of the 16th ACM Symposium on Principles of Distributed Computing (PODC), pp. 219–228.

[71] Molka, D., D. Hackenberg, and R. Sch¨one(2014). Main memory and cache performance of intel sandy bridge and amd bulldozer. In Proceedings of theWorkshop on Memory Systems Performance and Correctness (MSPC), pp. 4:1 – 4:10.

[72] Natarajan, A. and N. Mittal (2013, October). Brief Announcement: A Concurrent Lock- Free Red-Black Tree. In Proceedings of the 27th Symposium on Distributed Computing (DISC), Jerusalem, Israel.

[73] Natarajan, A., L. H. Savoie, and N. Mittal (2013, November). Concurrent Wait-Free Red-Black Trees. In Proceedings of the 15th International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS), Osaka, Japan, pp. 45–60.

[74] Newman, M. E. J. (2005). Power Laws, Pareto Distributions and Zipf’s Law. Contem- porary Physics, 323–351.

[75] Patterson, D. A. and J. L. Hennessy (2013). Computer Organization and Design: The Hardware/Software Interface (5th ed.). Morgan Kaufmann.

[76] Platz, K., N. Mittal, and S. Venkatesan (2014). Practical Concurrent Unrolled Linked Lists Using Lazy Synchronization. In Proceedings of the 18th International Conference on Principles of Distributed Systems (OPODIS), pp. 388–403.

121 [77] Prokopec, A., N. G. Bronson, P. Bagwell, and M. Odersky (2012). Concurrent Tries with Efficient Non-Blocking Snapshots. In Proceedings of the 17th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 151–160.

[78] Pugh, W. (1989). Concurrent Maintenance of Skip Lists. Technical Report CS-TR- 2222.1, Institute for Advanced Computer Studies, Department of Computer Science, University of Maryland.

[79] Pugh, W. (1990). Skip Lists: A Probabilistic Alternative to Balanced Trees. ACM Transactions on Database Systems (TODS) 33 (6), 668–676.

[80] Rudolph, L. and Z. Segall (1984, January). Dynamic Decentralized Cache Schemes for MIMD Parallel Processors. In Proceedings of the 11th International Symposium on Computer Architecture (ISCA), New York, NY, USA, pp. 340–347. ACM.

[81] Shafiei, N. (2013, JUL). Non-blocking Patricia Tries with Replace Operations. In Pro- ceedings of the 33rd IEEE International Conference on Distributed Computing Systems (ICDCS), pp. 216–225.

[82] Shalev, O. and N. Shavit (2006, May). Split-ordered lists: Lock-free extensible hash tables. Journal of the ACM (JACM) 53 (3), 379–405.

[83] Shao, Z., J. H. Reppy, and A. W. Appel (1994). Unrolling Lists. In Proceedings of the ACM Conference on LISP and Functional Programming (LFP), New York, NY, USA, pp. 185–195. ACM.

[84] Shavit, N. and D. Touitou (1995). Software Transactional Memory. In Proceedings of the 14th ACM Symposium on Principles of Distributed Computing (PODC), pp. 204–2013.

[85] Squillante, M. S. and E. D. Lazowska (1993, Feb). Using processor-cache affinity in- formation in shared-memory multiprocessor scheduling. IEEE Transactions on Parallel and Distributed Systems (TPDS) 4 (2), 131–143.

[86] Strohmaier, E., J. Dongarra, H. Simon, and M. Meuer (2017). TOP500 List (June 2017). https://www.top500.org/lists/2016/06/.

[87] Sundell, H. and P. Tsigas (2004, March). Scalable and Lock-Free Concurrent Dictionar- ies. In Proceedings of the 19th Annual Symposium on Selected Areas in Cryptography, New York, NY, USA, pp. 1438–1445. ACM.

[88] Takamura, M. and Y. Igarashi (2003). Group mutual exclusion algorithms based on ticket orders. In Proceedings of the 9th Annual International Conference on Computing and Combinatorics (COCOON), pp. 232–241.

122 [89] Timnat, S., A. Braginsky, A. Kogan, and E. Petrank (2012). Wait-free linked-lists. In Proceedings of the 17th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 309–310.

[90] Valois, J. D. (1999). Lock-Free Linked Lists using Compare-And-Swap. In Proceedings of the 14th ACM Symposium on Principles of Distributed Computing (PODC), pp. 214– 222.

[91] Vaswani, R. and J. Zahorjan (1991). The implications of cache affinity on processor scheduling for multiprogrammed, shared memory multiprocessors. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP).

[92] Yang, C., A. Lebeck, H. Tseng, and C. Lee (2004, December). Tolerating memory latency through push prefetching for pointer-intensive applications. ACM Transactions on Architecture and Code Optimization (TACO) 1 (4), 445–475.

[93] Zipf, G. K. (1950). Human behavior and the principle of least effort. American Anthro- pologist 52 (2), 268–270.

123 BIOGRAPHICAL SKETCH

Kenneth J. Platz was born on July 12, 1972. He received his Bachelor of Science degree from the University of Illinois at Urbana-Champaign in 1994. He worked at various positions in software engineering and information technology infrastructure for 20 years. He joined The University of Texas at Dallas in 2010 on a part-time basis, earning his Master of Science in Computer Science in 2014. He thereafter entered the PhD program in Computer Science at The University of Texas at Dallas in 2014 (full-time). His wife, Tracy, is also a graduate of The University of Texas at Dallas (Master of Science in Physics, 2003). Kenneth and Tracy have one son, Zachary, who was born in 2012.

124 CURRICULUM VITAE

Kenneth J. Platz November 3, 2017

Contact Information: Department of Computer Science Voice: (214) 460-1927 The University of Texas at Dallas Email: [email protected] 800 W. Campbell Rd. Richardson, TX 75080-3021, U.S.A. Educational History: B.S., Computer Science, University of Illinois at Urbana-Champaign, 1994 M.S., Computer Science, The University of Texas at Dallas, 2014 Ph.D., Computer Science, The University of Texas at Dallas, 2017 Saturation in Lock-Based Concurrent Data Structures Ph.D. Dissertation Computer Science Department, The University of Texas at Dallas Advisors: Dr. Neeraj Mittal and Dr. S. Venkatesan

Employment History: Software Engineer, NetApp SolidFire, September 2017 – Present Intern, vSphere Inegrated Containers, VMware, Inc, June 2016 – August 2016 Graduate Teaching and Research Assistant, The University of Texas at Dallas, August 2014 – August 2017 Systems Administrator Leader, Computer Sciences Corporation, October 2005 – August 2014 Consultant, Adea Corporation, December 2004 – October 2005 Senior Systems Administrator, Colonial BancGroup, December 2003 – November 2004 UNIX Primary Delivery Engineer, Hewlett-Packard, September 2001 – October 2003 Response Center Engineer, Hewlett-Packard, September 1997 – September 2001 Consultant, Bradford and Galt Consulting Services, September 1996 – September 1997 Hardware Engineering Analyst, EDSI Corporation January 1996 – September 1996 Associate Software Engineer, NCI Incorporated May 1995 –December 1995 Junior Software Engineer, SAIC, August 1994 –May 1995

Professional Recognitions and Honors: Eta Kappa Nu Honor Society, The University of Texas at Dallas, 2016 Phi Kappa Phi Honor Society, The University of Texas at Dallas, 2014 Certificate of Academic Achievement, The University of Texas at Dallas, 2014 HP “Teams Work” award, Hewlett-Packard, 1999

Professional Memberships: Institute of Electrical and Electronics Engineers (IEEE), 2016–present Association of Computing Machinery (ACM), 2014–present