WSEAS TRANSACTIONS on COMPUTERS Akos Dudas, Sandor Juhasz

Blocking and non-blocking concurrent hash tables in multi-core systems

AKOS´ DUDAS´ SANDOR´ JUHASZ´ Budapest University of Technology and Economics Budapest University of Technology and Economics Department of Automation and Applied Informatics Department of Automation and Applied Informatics 1117 Budapest, Magyar Tudosok´ krt. 2 QB207 1117 Budapest, Magyar Tudosok´ krt. 2 QB207 HUNGARY HUNGARY [email protected] [email protected]

Abstract: Widespread use of multi-core systems demand highly parallel applications and algorithms in everyday computing. Parallel data structures, which are basic building blocks of concurrent algorithms, are hard to design in a way that they remain both fast and simple. By using mutual exclusion they can be implemented with little effort, but blocking synchronization has many unfavorable properties, such as delays, performance bottleneck and being prone to programming errors. Non-blocking synchronization, on the other hand, promises solutions to the aforementioned drawbacks, but often requires complete redesign of the algorithm and the underlying data structure to accommodate the needs of atomic instructions. Implementations of parallel data structures satisfy different progress conditions: based algorithms can be -free or starvation free, while non-blocking solutions can be lock-free or wait-free. These properties guarantee different levels of progress, either system-wise or for all threads. We present several parallel implementations, satisfying different types of progress conditions and having various levels of implementation complexity, and discuss their behavior. The performance of different blocking and non-blocking implementations will be evaluated on a multi-core machine.

Key–Words: blocking synchronization, non-blocking synchronization, hash table, mutual exclusion, lock-free, wait-free

1 Introduction ming environment making the life of programmers easier. But simplicity comes with the cost of unfa- Parallel programs have been around for more than a vorable properties, such as the risk of , live- quarter of a century, and recently they have moved locks, priority inversion, and unexpected delays due into the focus of research again. The general avail- to blocking [2, 3, 4, 5], not mentioning the increased ability of multi-core processors brings the demand risk of programming errors [6]. for high performance parallel applications into main- stream computing promoting new design concepts in Non-blocking synchronizations promote them- parallel programming [1]. selves as true alternatives to locking. Such algo- The most crucial decision when designing paral- rithms employ atomic read-modify-write operations lel data structures and algorithms is finding an optimal supported by most modern CPUs. The basic idea is balance between performance and complexity. Shavit that instead of handling concurrency in a pessimistic in [1] argues that “concurrent data structures are diffi- way using mutual exclusion, concurrent modifications cult to design.” Performance enhancements often re- of a data structure are allowed as long as there is a sult in complicated algorithms requiring mathemati- fail-safe guarantee for detecting concurrent modifica- cal reasoning to verify their correctness, whereas sim- tion made by another . (Rinard used the term pler solutions fall short of the required performance or “optimistic synchronization” to describe this behavior have other unfavorable properties. [5].) Complex modifications are performed in local Critical sections assure correct behavior by limit- memory and then in an atomic replace step they are ing access to the shared resources. Every runtime en- committed into the data structure. vironment supports various types of synchronization The biggest advantage of non-blocking synchro- primitives supporting this type of parallel program- nization is complete lock-freeness. As there are no ming, and they usually require only a few changes in locks, the problems mentioned at blocking synchro- code or data. Critical sections and other primitives nization do not arise. These methods also provide correspond to the level of abstraction of the program- better protection from programming errors and from

E-ISSN: 2224-2872 74 Issue 2, Volume 12, February 2013 WSEAS TRANSACTIONS on COMPUTERS Akos Dudas, Sandor Juhasz

faulty threads not releasing locks, resulting in higher tion in parallel environments. The hash table itself robustness and fault tolerance [7]. The tradeoff with is an abstract concept and there are various imple- non-blocking algorithms is that they are more com- mentation options, which all demand different paral- plex and harder to design, even when they seem sim- lelization approaches. Best performance is observed ple [8]. when the hash table uses arrays as external storage The right choice depends on the required perfor- [18, 19, 20], which will require structural changes in mance and expected guarantees. There are two types case of a non-blocking implementaion. of requirements: safety and liveness. Safety means The most notable downside of blocking synchro- that data structures work “correctly” without losing nization is the level of parallelism they provide; or data or breaching data integrity. The result of con- rather they fail to provide. Limiting the number of current operations performed on a parallel data struc- parallel data accesses to a single thread at one time by ture are correct, according to the con- a single critical section guarding the entire hash table dition [9, 10], if every operations appears to take ef- significantly reduces overall performance. To over- fect instantaneously at some point between it’s invoca- come this issue, finer grained locking can be applied. tion and its response. The liveness property captures Larson et al. in [21] use two lock levels: a global ta- the fact whether the interfering parallel operations ble lock and a separate lightweight lock (a flag) for will eventually finish their work. Liveness condi- each bucket. The implementation by Lea in [22] uses tions are different for blocking- and non-blocking al- a more sophisticated locking scheme with a smaller gorithms. For blocking algorithms, usually deadlock- number of higher level locks (allocated for hash table freedom or starvation-freedom are expected, while blocks including multiple buckets) allowing concur- non-blocking algorithms can be lock-free [11, 12] or rent searching and resizing of the table. When consid- wait-free [13, 14]. ering large hash tables with a couple million or more In this paper hash tables are used to illustrate the buckets the use of locks in high quantities can be chal- wide range of design options for parallel data struc- lenging. Most environments do not support this many tures. We discuss the implementation details and the locks, and with custom implementations one has to performance costs of applying various approaches de- be concerned with memory footprint and false shar- signing a concurrent hash table. ing limiting the efficiency of caches [24, 25, 26]. The rest of the paper is organized as follows. Sec- There is no general consensus about blocking tion 2 presents the related literature of parallel al- synchronization. Some argue that mutual exclusion gorithms and hash tables. Lock based solutions are locks degrade performance without any doubt [5, 11] discussed in Section 3 followed by the non-blocking and scale poorly [16], while others state that con- algorithms in Section 4. After the implementation tention for the locks is rare in well designed systems details, their performance is evaluated in Section 5 [2, 27], therefore contention does not really impact through experiments. The conclusions are discussed performance [28]. The strongest reservation about in Section 6. blocking synchronization, and what is probably most relevant for advocating non-blocking solutions, is that they build on assumptions, such as critical sections are 2 Blocking- and non-blocking syn- short, and the operating system scheduler is not likely chronization in hash tables to suspend the thread while it is in the critical section. These assumptions, however, may not always hold. Hash tables [17] store and retrieve items identified To ensure greater robustness and less depen- with a unique key. A hash table is basically a fixed- dency on the operating system a parallel data struc- size table, where the position of an item is determined ture can be implemented using the compare-and- by a hash function. The hash function is responsible exchange/compare-and-swap (CAS) operations. This for distributing the items evenly inside the table in or- method is founded on the fact that all modern archi- der to provide O(1) lookup time. The difference be- tecture provide hardware support for atomic combined tween the two basic types of hash tables (bucket- and read-write operations (e.g the CMPXCHG instruction open address), is the way they handle collisions, i.e. in the x86 architecture) that allow the non-destructive when more than one item is hashed (mapped) to the manipulation (setting the value of a memory location same location. Bucket hash tables chain the colliding with somehow preserving its old content) of a single items using external storage (external means not stor- machine word. Herlihy describes the basic concept of ing within the original hash table body), while open- transforming sequential algorithms into non-blocking address hashing finds a secondary location within the protocols this way [29]: first the object to be modified table itself. is copied into local memory in its current state, then Hash tables are perfect candidates for optimiza- the modifications are performed on this local copy in-

E-ISSN: 2224-2872 75 Issue 2, Volume 12, February 2013 WSEAS TRANSACTIONS on COMPUTERS Akos Dudas, Sandor Juhasz

Figure 1: Progress conditions of parallel data structures described by Herlihy and Shavit [33]. stance, and finally the change is committed back us- thread minimal, or all threads to complete actions - ing the CAS instruction. If another thread has mod- maximal); the other axis defines the assumptions of ified the same data in the meantime, the operation is the operating system scheduler. Algorithms are either retried. blocking or non-blocking, and they can depend on the With this method complete lock-free data struc- scheduler or can guarantee the level of progress re- tures and algorithms can be built, as it was to linked gardless of schedulers support. Fig. 1 describes this list by Harris [30], Valois [12] and Michael [25], structure of progress conditions (as in [33]). whose works were further extended to resizable hash This system of classification of parallel algo- tables by Shalev and Shavit in [31]. Both Gao el al. rithms helps with comparing different implementa- [8] and Purcell and Harris [32] built a lock-free open tions focusing on their merits both in terms of perfor- address hash tables. mance and in robustness. Wait-free algorithms are the Non-blocking algorithms are generally hard to “best” but usually the most complex, while lock-free design. This is in fact the case for parallel hash tables algorithms are simpler to create. There is a similar re- built on the CAS primitives. The entire organization lationship between starvation-free and deadlock-free of the data has to be revised: arrays having given the algorithms, given that they are also more dependent best performance in single threaded case are no longer on the operating system. The practical importance of a storage option as an additional level of indirection the obstruction-free and clash-free conditions is yet to has to be introduced into accessing an element in or- be discovered. der to allow transactional modification with a single CAS operation. Failing to perform CAS successfully the complete transaction has to be retried which re- 3 Blocking hash tables quires restructuring of code to allow this fallback exe- cution path. All in all, a completely new hash table is Blocking synchronization mechanisms are usually required with careful design of placement and align- easier to implement. Locks and critical sections are ment in memory. inherent part of all programming languages and en- Other concerns about non-blocking implementa- vironments and they are described on an abstraction tions include the fact that atomic operations are more level the programmer is familiar with, therefore is is expensive than simple writes [8]. Michael even went straightforward to use them. as far as saying that they are “inefficient” [11], and so are the algorithms, that avoid the use of locks [16]. There is an inherent hierarchy, or rather struc- 3.1 Deadlock-free hash table ture among both the blocking and the non-blocking Being conceptually simpler does not mean that block- algorithms with regards to the assumptions they are ing algorithms require no design. The basic ex- based on, and the level of progress they provide. Sev- pectation from lock-based data structures is to be eral liveness conditions are known in the literature for deadlock-free, that is, threads cannot delay each other quite some time, yet there has been no handle on their forever. The simplest way of guaranteeing deadlock- relation until recently. Herlihy and Shavit presented a freedom is using a single lock in the data structure, “unified explanation” [33] of the progress properties. which basically serializes all accesses. This single They placed these properties in a two-dimensional lock protects the data structure at each entry point. structure, where one axis defines maximal or min- Alg. 1 shows the implementation of the insert method imal progress (informally: expect at least a single in a deadlock-free hash table. A similar guard is to be

E-ISSN: 2224-2872 76 Issue 2, Volume 12, February 2013 WSEAS TRANSACTIONS on COMPUTERS Akos Dudas, Sandor Juhasz

Algorithm 1 Deadlock-free concurrent hash table using a single table lock. initialize: table_lock: mutex

insert( data ): table_lock.enter() // acquire lock, blocks until successful insert_internal( data ) // perform the insert as usual table_lock.release() // release the lock

Algorithm 2 Deadlock-free concurrent hash table using fine-grained locking. initialize: bucket_locks[]: mutex

insert( data ): bucketIdx = hash( data ) // maps the item to a bucket bucket_locks[bucketIdx].enter() // acquire lock, blocks until successful insert_internal( data ) // perform the insert as usual bucket_locks[bucketIdx].release() // release the lock

placed around each operation to make the data struc- 3.2 Starvation-free hash table ture completely thread-safe and deadlock-free. Deadlock-freedom guarantees only that there are More sophisticated locking mechanisms use mul- threads that make progress (specifically the one that tiple locks. Fine-grained locking can deliver much currently owns the lock). It does not specify that all better level of parallelism and performance. For ex- threads must complete their operations. A more re- ample in case of bucket hash tables each bucket can be strictive condition is starvation-freedom, which re- protected with a dedicated lock. Since each basic op- quires that all threads get the access to the critical sec- eration (except for rename) works in a single bucket, tion eventually. With test-and-set locks this cannot be the implementation is still deadlock-free. In case of guaranteed, as it is possible that a thread always looses large hash tables, it might not be feasible to use tens in the race for the lock. of thousands, or even millions of locks. In such cases, In order to guarantee starvation freedom with mu- the buckets can be assigned to groups, where each tual exclusion, threads should enter the critical sec- bucket is only member of one group but one group tion in the same order as they request it. Queue-locks contains multiple buckets [23]. Protecting each group satisfy this requirement. Using Mellor-Crummey and with a single lock delivers good level of parallelism Scott’s ticket lock [24] the bucket hash table is mod- (depending on the number of threads and the number ified as shown in Alg. 3 for a single table lock; fine- of groups). Alg. 2 shows the deadlock-free hash table grained bucket level locks can be implemented sim- with fine-grained locking. ilarly. (The interlocked-inc instruction increases the The only requirement about the locks used in value of the integer in an atomic step and returns the the aforementioned implementations is that they work new value.) correctly, that is, they allow a single thread to reside within the critical section. There is no fairness guaran- tee, such as allowing the access to the critical section only in the same order as it was requested. Simple spin-locks (also known as busy-wait loops) spinning 4 Non-blocking hash tables on a test-and-set instruction [34, 24] as well as mu- texes offered by the programming environment can be Their undesired properties often limit the usability used. Other options include using reader-writer locks, of blocking synchronizations. Even in the absence which allow concurrent readers and a single writer. of deadlocks and starvation, the level of parallelism This type of lock, as long as only a single one is ac- is limited: threads hinder each others progress caus- cessed by each operation, is still deadlock-free; the ing priority inversion and problems that occur when a only difference is that the appropriate type of access thread crashes and does not release a lock by error. (shared or exclusive) is determined by the operation Non-blocking synchronization mechanisms promote (i.e. find requires read access while insert requires themselves as alternatives to lock based methods of- write access). fering better scalability and robustness.

E-ISSN: 2224-2872 77 Issue 2, Volume 12, February 2013 WSEAS TRANSACTIONS on COMPUTERS Akos Dudas, Sandor Juhasz

Algorithm 3 Starvation-free concurrent hash table using a single table lock. initialize: currently_served: int, initially 1 ticket_counter: int, initially 0

insert( data ): my_ticket = interlocked_increment( ticket_counter ) // get ticket while( my_ticket != currently_served ) // wait until my ticket is served nop // no operation, spin-wait insert_internal( data ) // perform the insert as usual interlocked_increment( currently_served ) // notify next in line

4.1 Linked list-based lock-free bucket hash contain the item to remove. Search operations are not table affected by this as long as they hold the pointer to the bucket as they know it. The sensitive operation within the hash table is com- mitting a new entry into the bucket, or removing an This implementation is, similarly to the linked- entry. Using a linked list for realizing the buckets list based one, lock-free but not wait-free for the same the hash table can be implemented in a non-blocking reasons discussed previously. The linearization point fashion. Alg. 4 shows the CAS based linked-list is the CAS operation. lock-free bucket hash table algorithm. This algorithm This approach wastes the new array allocation works only if there are no deletes from the hash table. and copy operation in case of contention for the same Deletes can be supported as well, with some modifi- bucket, which is an obvious performance drawback. cations [12]. The gain on the other hand is the faster searching ca- It is easy to see that this implementation is lock- pability in the array. free. There are no locks or mutual exclusion in the De-allocating the unused arrays is also a problem. code, hence there are no indefinite delays. That is to When a new array replaces the previous one, the old say, there is one case: the CAS operation can fail if the one should be de-allocated, given that no one else uses target memory location has been changed by another it. In garbage collected systems this is handled by the thread. It will, however, always succeed for a single environment, but in other cases this is the responsibil- thread, thus the data structure is lock-free. As for cor- ity of the programmer. The solution is either reference rectness, this algorithm is linearizable, the CAS be- counting, or if possible, not deallocated memory un- ing the linearization point (where the changes appear til it is safe (after parallel operations have been com- atomically for the entire system). pleted).

4.2 Array based lock-free bucket hash table 4.3 Wait-free hash table Linked-lists are not always the best choice for bucket The last hash table presented in this section takes a dif- hash tables. Data prefetch mechanisms and cache ferent approach compared to all previously discussed lines containing multiple items mask the memory ac- implementations. All the solutions above required, to cess latency when iterating through the items in arrays some extent, the modification of the hash table code: while prefetching is of no use when traversing a linked placing the locks and the critical sections, reorganiz- list. This can yield an increase in the amount of cache ing the entire table into linked lists or use the compare- misses deprecating performance. and-exchange operation. Let us present an algorithm, The incompatibility between compare-and-swap which uses no locks and is not only non-blocking but and arrays lies in the fact that CAS works with ma- completely wait-free, and requires no modifications chine word size values only, meaning that an item in inside the hash table. To our knowledge this is the the array larger than 4 or 8 bytes cannot be written first wait-free hash table in the literature. atomically. The compromise is as follows. The buck- The main problem with parallelism from the point ets are implemented with arrays, but when a new item of view of data structures is concurrent modification is to be added, a new array is allocated copying the of the same storage slot. If different threads worked previous contents and adding the new one, then CAS- with different hash tables, this would not be a problem ing the pointer to this new array into the hash table anymore. Instead of the traditional scenario where body (see Alg. 5). Only one concurrent insert will each thread is capable of executing any work we spe- succeed causing the other one to retry. Deleting items cialize our threads by assigning individual work ar- works the same way, except the new array will not eas to them in a way that these working areas do

E-ISSN: 2224-2872 78 Issue 2, Volume 12, February 2013 WSEAS TRANSACTIONS on COMPUTERS Akos Dudas, Sandor Juhasz

Algorithm 4 CAS based lock-free concurrent bucket hash table with linked list. typedef node: key, value, next

initialize: buckets[]: pointer to node

insert( data ): myItem = create( data ) // creates the node from the data l = buckets[ hash( data ) ] // linked list representing the bucket while( true ): // loop for CAS retries curr = l; while curr.next != NULL: // check each item in the list if( matches( data, curr ) ): return curr; curr = curr.next if( CAS( curr.next, myItem, NULL ) ) // link the new item to the end return myItem // CAS succeeded (item is inserted) or retry

Algorithm 5 CAS based lock-free concurrent bucket hash table with array. initialize: buckets[]: pointer to array

insert( data ): while( true ): // loop for CAS retries h = hash( data ) myArray = buckets[ h ] // the current bucket idx = find_in_array( myArray, data ) // check the current data if idx != -1: // found return myArray[ idx ] newArray = allocate_new_copy( myArray, data ) // create new array, copy existing // data and insert new if( CAS( buckets[ h ], newArray, myArray ) ) // replace bucket return // CAS succeeded (item is inserted) // or retry

not overlap (separate functions, pipeline stages, spa- in a central queue, and decides if it should process tial domains or graph branches). A specific request it. The decision is completely decentralized, and all (insert/find) can only be served by one thread, and threads must come to the same conclusion: one and it is up to the right selection of domains that pro- only one of them chooses the item for processing. vides the load balance. The selection rules are usually There is no communication between the threads, thus based on data decomposition as functional decompo- no overhead. sition generally provides limited scalability and we The decision whether to process an item is based easily reface the bottleneck problem of critical sec- on the item itself. Each item is unambiguously tions on the level of control organization. According mapped to a thread by calculating a hash value from to our knowledge this kind of cooperation is never the data. In case of a hash table this is quite easy, as mentioned in the literature of shared memory algo- the items have a key, which, by definition, uniquely rithms, but the idea is not unknown in distributed sys- determines them. (This concept, however, we note, is tems, where there low cost shared memory synchro- generalizable to other data structures and problems. nization is not available, thus a coarser grained coop- For example, sorting can be performed this way as eration must be maintained between the nodes by di- well: the hash function can divide the items by their rected point-to-point messages or implicit work allo- value: smaller than a threshold goes to one thread, cation rules (e.g. distributed file servers, horizontally larger to another [35].) partitioned data bases, documents groups allocated to The critical element in this case is the central separate web servers). queue. It must be wait-free for the hash table sys- The universe of items that the concurrent hash tem to be wait-free, and it must map the items to the table stores is partitioned into several subtables, and requesting threads. There are wait-free queue algo- each of these hash table partitions is unambiguously rithms in the literature, such as the one by Michael et mapped to a single working thread. When a thread is al. [?]. Let n be the number of processing threads ready to perform work, it checks the next work item (and consequently the number of hash tables). Using

E-ISSN: 2224-2872 79 Issue 2, Volume 12, February 2013 WSEAS TRANSACTIONS on COMPUTERS Akos Dudas, Sandor Juhasz

Algorithm 6 Wait-free algorithm for hash tables. initialize: hash_tables[]: hash table, one for each thread queues[]: wait free queues, one for each thread

addRequest(data): h = hash( data ) queues[h].enqueue( data )

processingThreadFunction(): while true i = queues[threadId].dequeue() // get item from the threads queue h = hash( i ) hash_tables[threadId].insert( i ) // insert into the table

n wait-free queues every processing thread receives a known in the literature, while others present new dedicated queue. Assigning an item to a thread needs ideas. They all feature different properties: the non- to hash the data and add it to the right queue. The blocking solutions are all more robust, while the lock overhead this time is the pre-calculation of the hash based tables are simpler. Let us examine how simplic- by the process that would generally only insert the re- ity and compromises in the implementation translates quest into a queue. Algorithm 6 presents the solution. into performance. The biggest advantage of this solution, besides All off the parallel hash tables were implemented being wait-free, is that the internal hash tables are un- in C++ with careful optimization. They are all of aware that they are used in a multithreaded environ- the same structure: a bucket hash table with either a ment. Any hash table implementation could be used linked list or an array representing the buckets. The ta- and its source code does not need to be altered. bles were initialized by inserting 8 million items into The difference between Alg. 6 and traditional par- them, and a subsequent 8 million find queries were allel hash tables is that here the threads need to be executed. The wall-clock execution time was mea- aware that they are accessing different hash tables. sured. The insert and find operations were executed It is not the responsibility of the hash table to han- under two circumstances: in the first case the num- dle parallelism, but it is left up to the threads them- ber of buckets of the hash tables was chosen to result selves. This can be viewed as a “relaxation” of the in a single item per bucket on average, thus the con- requirements of the data structure, as Shavit calls it tention for the buckets is small (Fig. 2); the latter case [1]. It might require changes in the algorithms using used tenth of the number of buckets causing high con- the hash table to support this behavior, but the gain is tention (Fig. 3). The results are an average of 5 in- wait-freedom. (We also note that data parallel algo- dependent executions on a state-of-the-art Intel Core rithms are likely to be easier to adopt to this behav- i7-2600 CPU with 8 (logical) execution units. ior as they already distribute work among threads in a Using a single lock is not evaluated below as it similar fashion.) is expected to have very little speedup, if any due to This wait-free implementations cannot be de- high contention for the single lock. The speedup is scribed by any of the generally used correctness cri- measured by comparison to the single threaded execu- teria. It is not linearizable, as changes are not atomic. tion of the same operations using a very similar hash It is not sequentially consistent [3] either, because it table implementation without locks or CAS; as a mat- is not clear which point is the invocation of a hash ter of fact, the very same table is used by rule based table operation, and when the method returns: all op- cooperation. erations are split into two phases, first insertion into The lock based (bucket lock w/ spin locks and the right queue, and the completing the work. This bucket lock w/ ticket lock) and the linked list based type of operation is not compatible with the general w/ CAS have quite similar performance. The maxi- description of concurrent objects by histories, invoca- mal speedup is between around 3-3.5 and 4-4.5 for 8 tions and response events. concurrent threads depending on the number of buck- ets used by the table. The lock-free solutions are both faster than the lock based solutions by a small mar- 5 Performance evaluation gin under little contention, but the true power of the lock-free solutions show with high contention: array Vastly different hash table implementations were pre- based w/ CAS is by far the best one reaching almost sented in the previous sections. Some are already 6.5 speedup. Considering the overhead with the con-

E-ISSN: 2224-2872 80 Issue 2, Volume 12, February 2013 WSEAS TRANSACTIONS on COMPUTERS Akos Dudas, Sandor Juhasz

Figure 2: Speedup of the various parallel hash tables with little contention for the buckets.

Figure 3: Speedup of the various parallel hash tables with high contention for the buckets.

E-ISSN: 2224-2872 81 Issue 2, Volume 12, February 2013 WSEAS TRANSACTIONS on COMPUTERS Akos Dudas, Sandor Juhasz

Table 1: Hash table realizations for the various progress conditions of parallel data structures. non-blocking, independent blocking, dependent every method wait-free w/ queues table lock w/ ticket lock, makes progress bucket lock w/ ticket lock some method linked list-based w/ CAS, table lock w/ spin lock, bucket makes progress array-based w/ CAS lock w/ spin lock

stant array allocations and copy operations can only be ble 1 lists the various hash table types organized ac- due to the good cache behavior and prefetch friendli- cording to the progress conditions they satisfy [33]. ness exhibited by the array. Lock based solutions, when designed carefully, The wait-free solution, rule based cooperation, such as using fine-grained locking for regions in the achieves a 2.5-3.5 speedup at its peek performance for bucket hash table, have really good performance. 7 threads. It is not as fast as either the lock-based or Such locking solutions are relatively lightweight and the lock-free solutions, but at the same time it guaran- inter-thread communication in multi-core machines tees progress for all threads at all times and that inter- (via the locks) is very cheap allowing to achieve up nally it can work with arbitrary hash table implemen- to 4.5 times speedup for 8 concurrent threads. tations. The drop in performance under heavy loads CAS based non-blocking solutions represent a for 4 and 5 threads in Fig. 3 is due to uneven load bal- different type of parallel hash tables. Their inter- ancing of items among the threads. This points out nal structure require careful redesign, whereas, when another bottleneck in this wait-free implementation. carefully tuned, they have the same performance as Lock-free solutions can deliver the same, if not the lock based solutions. The usual argument that they better performance than lock based solutions. The are more robust is masked by the fact that they are cost is the complexity of the algorithms. The wait- complex and therefore hard to verify. free algorithm requires restructuring of code, but not The final options we presented, rule based coop- the structure of the hash table, and in exchange for eration is in our knowledge the first wait-free hash ta- global progress guarantee we pay with a significant ble in the literature. It wraps the hash table with an performance loss. addition layer, which handles the parallel access, thus Guaranteeing stronger progress conditions, such requires no internal modifications of the original hash as starvation-freeness or wait-freeness, result in per- table and is still able to achieve a reasonable speedup. formance loss. As an engineering choice the faster but less resilient solutions seem like a good trade- Acknowledgements: This work was partially sup- off, while the complicated progress guarantees should ported by the European Union and the European So- only be chosen when they are indeed necessary. Per- cial Fund through project FuturICT.hu (grant no.: formance in lock-based solutions is manageable by TAMOP-4.2.2.C-11/1/KONV-2012-0013).´ choosing the level of granularity for locking, which is closer to an engineering solution, while there is no such directly tunable parameter for lock-free algo- References: rithms. [1] N. Shavit, Data structures in the multicore age, Communications of the ACM 54 (3) (2011) 76. 6 Conclusion [2] J. H. Anderson, Y.-J. Kim, T. Herman, Shared- memory mutual exclusion: major research Parallel data structures are hard to design in a way trends since 1986, Distributed Computing 16 (2- that they remain simple while delivering good perfor- 3) (2003) 75–110. mance at the same time. Lock based solutions are sim- pler, but cannot guarantee much in terms of safety and [3] L. Lamport, How to make a correct multiprocess robustness; non-blocking algorithms are much more program execute correctly on a multiprocessor, complicated, and a compromise is required to balance IEEE Transactions on Computers 46 (7) (1997) performance and simplicity. In this paper we used 779–782. hash tables to illustrate the vast amount of options for [4] H. H. Michael Hohmuth, Pragmatic Nonblock- creating concurrent data structures. Their structure is ing Synchronization for Real-Time Systems, relatively simple and both locking and non-blocking in: Proceedings of the General Track: 2002 algorithms can be implemented with some effort. Ta- USENIX Annual Technical Conference, 2002.

E-ISSN: 2224-2872 82 Issue 2, Volume 12, February 2013 WSEAS TRANSACTIONS on COMPUTERS Akos Dudas, Sandor Juhasz

[5] M. C. Rinard, Effective fine-grain synchroniza- [17] D. E. Knuth, The art of computer programming, tion for automatically parallelized programs us- Vol 3, Addison-Wesley, 1973. ing optimistic synchronization primitives, ACM [18] M. Mitzenmacher, Good Hash Tables & Mul- Transactions on Computer Systems 17 (4) tiple Hash Functions, Dr. Dobbs Journal (336) (1999) 337–371. (2002) 28–32. [6] P. Gill, Operating Systems Concepts, 1st Edi- [19] G. L. Heileman, W. Luo, How Caching Affects tion, Vol. 2006, Laxmi Publications, New Delhi, Hashing, in: Proc. 7th ALENEX, 2005, pp. 141– India, 2006. 154. [7] G. E. Blelloch, B. M. Maggs, Parallel algo- [20] A. Sachedina, M. A. Huras, K. K. Romanufa, rithms, in: Algorithms and theory of computa- Resizable cache sensitive hash table (2003). tion handbook, 2010, Ch. 25, p. 25. [21] P.-A. Larson, M. R. Krishnan, G. V. Reilly, [8] H. Gao, J. F. Groote, W. H. Hesselink, Lock-free Scaleable hash table for shared-memory multi- dynamic hash tables with open addressing, Dis- processor system (Apr. 2003). tributed Computing 18 (1) (2005) 21–42. [22] D. Lea, Hash table [9] M. P. Herlihy, J. M. Wing, Linearizability: a cor- util.concurrent.ConcurrentHashMap, revision rectness condition for concurrent objects, ACM 1.3, in JSR-166, the proposed Java Concurrency Transactions on Programming Languages and Package (2003). Systems 12 (3) (1990) 463–492. [23] A.´ Dudas,´ S. Juhasz,´ S. Kolumban,´ Perfor- [10] M. P. Herlihy, J. M. Wing, Axioms for con- mance analysis of multi-threaded locking in current objects, in: Proceedings of the 14th bucket hash tables, ANNALES Universitatis ACM SIGACT-SIGPLAN symposium on Prin- Scientiarum Budapestinensis de Rolando Eotv¨ os¨ ciples of programming languages - POPL ’87, Nominatae Sectio Computatorica 36 (2012) 63– ACM Press, New York, New York, USA, 1987, 74. pp. 13–26. [24] J. M. Mellor-Crummey, M. L. Scott, Algorithms [11] M. Michael, Nonblocking Algorithms and for scalable synchronization on shared-memory Preemption-Safe Locking on Multiprogrammed multiprocessors, ACM Transactions on Com- Shared Memory Multiprocessors, Journal of puter Systems 9 (1) (1991) 21–65. Parallel and Distributed Computing 51 (1) [25] M. M. Michael, High performance dynamic (1998) 1–26. lock-free hash tables and list-based sets, in: [12] J. D. Valois, Lock-free linked lists using ACM Symposium on Parallel Algorithms and compare-and-swap, in: Proceedings of the Four- Architectures, 2002, pp. 73–82. teenth Annual ACM Symposium on Principles [26] M. Greenwald, Non-blocking Synchronization of Distributed computing, ACM Press, New and System Design, Phd thesis, Stanford Uni- York, New York, USA, 1995, pp. 214–222. versity (Aug. 1999). [13] M. Herlihy, Wait-free synchronization, ACM [27] L. Lamport, A fast mutual exclusion algorithm, Transactions on Programming Languages and ACM Transactions on Computer Systems 5 (1) Systems 13 (1) (1991) 124–149. (1987) 1–11. [14] M. P. Herlihy, Impossibility and universality re- [28] F. Fich, V. Luchangco, M. Moir, N. Shavit, sults for wait-free synchronization, in: Proceed- P. Fraigniaud, Obstruction-Free Step Complex- ings of the seventh annual ACM Symposium on ity: Lock-Free DCAS as an Example, in: Principles of distributed computing - PODC ’88, P. Fraigniaud (Ed.), Lecture Notes in Com- ACM Press, New York, New York, USA, 1988, puter Science, Vol. 3724 of Lecture Notes in pp. 276–290. Computer Science, Springer Berlin Heidelberg, [15] M. Herlihy, V. Luchangco, M. Moir, Berlin, Heidelberg, 2005, pp. 493–494. Obstruction-Free Synchronization: Double- [29] M. Herlihy, A methodology for implementing Ended Queues as an Example, in: Proceedings highly concurrent data structures, in: Proceed- of the 23rd International Conference on Dis- ings of the second ACM SIGPLAN symposium tributed Computing Systems, IEEE Computer on Principles & practice of parallel program- Society Washington, DC, USA, 2003, pp. ming - PPOPP ’90, ACM Press, New York, New 522–259. York, USA, 1990, pp. 197–206. [16] H. Attiya, R. Guerraoui, D. Hendler, [30] T. L. Harris, A Pragmatic Implementation of P. Kuznetsov, The complexity of obstruction- Non-blocking Linked-Lists, in: DISC ’01 Pro- free implementations, Journal of the ACM ceedings of the 15th International Conference on 56 (4) (2009) 1–33. Distributed Computing, 2001, pp. 300–314.

E-ISSN: 2224-2872 83 Issue 2, Volume 12, February 2013 WSEAS TRANSACTIONS on COMPUTERS Akos Dudas, Sandor Juhasz

[31] O. Shalev, N. Shavit, Split-ordered lists, Journal of the ACM 53 (3) (2006) 379–405. [32] C. Purcell, T. Harris, Non-blocking hashtables with open addressing, in: Proceedings of the 19th International Symposium on Distributed Computing, Springer-Verlag GmbH, Krakow, Poland, 2005, pp. 108–121. [33] M. Herlihy, N. Shavit, On the Nature of Progress, in: A. Fernandez` Anta, G. Lipari, M. Roy (Eds.), Principles of Distributed Sys- tems, Lecture Notes in Computer Science, Vol. 7109 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 313–328. [34] T. Anderson, The performance of spin lock alter- natives for shared-money multiprocessors, IEEE Transactions on Parallel and Distributed Sys- tems 1 (1) (1990) 6–16. [35] E. Horowitz, A. Zorat, Divide-and-Conquer for Parallel Processing, IEEE Transactions on Com- puters C-32 (6) (1983) 582–585.

E-ISSN: 2224-2872 84 Issue 2, Volume 12, February 2013