Performance Evaluation of Blocking and Non-Blocking Concurrent Queues on Gpus
Total Page:16
File Type:pdf, Size:1020Kb
University of Mississippi eGrove Electronic Theses and Dissertations Graduate School 2019 Performance Evaluation of Blocking and Non-Blocking Concurrent Queues on GPUs Hossein Pourmeidani University of Mississippi Follow this and additional works at: https://egrove.olemiss.edu/etd Part of the Computer Sciences Commons Recommended Citation Pourmeidani, Hossein, "Performance Evaluation of Blocking and Non-Blocking Concurrent Queues on GPUs" (2019). Electronic Theses and Dissertations. 1588. https://egrove.olemiss.edu/etd/1588 This Thesis is brought to you for free and open access by the Graduate School at eGrove. It has been accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of eGrove. For more information, please contact [email protected]. PERFORMANCE EVALUATION OF BLOCKING AND NON-BLOCKING CONCURRENT QUEUES ON GPUS A Thesis presented in partial fulfillment of requirements for the degree of Masters of Science in the Department of Computer and Information Science The University of Mississippi by Hossein Pourmeidani December 2018 Copyright Hossein Pourmeidani 2018 ALL RIGHTS RESERVED ABSTRACT The efficiency of concurrent data structures is crucial to the performance of multi- threaded programs in shared-memory systems. The arbitrary execution of concurrent threads, however, can result in an incorrect behavior of these data structures. Graphics Processing Units (GPUs) have appeared as a powerful platform for high-performance computing. As regular data-parallel computations are straightforward to implement on traditional CPU ar- chitectures, it is challenging to implement them in a SIMD environment in the presence of thousands of active threads on GPU architectures. In this thesis, we implement a concurrent queue data structure and evaluate its performance on GPUs to understand how it behaves in a massively-parallel GPU environment. We implement both blocking and non-blocking approaches and compare their performance and behavior using both micro-benchmark and real-world application. We provide a complete evaluation and analysis of our implemen- tations on an AMD Radeon R7 GPU. Our experiment shows that non-blocking approach outperforms blocking approach by up to 15.1 times when sufficient thread-level parallelism is present. ii ACKNOWLEDGEMENTS My first and foremost acknowledgment goes to my family. I'd like to thank my parents who have supported me till I got here. I would like to start with my mother, my role model, who always pursued education and hard work to the last day of her life. I thank her for planting in me the love of learning and believing in my abilities. I would like to thank my father for the full freedom and trust he granted me to navigate through my self-awareness journey the way I choose. My deepest thanks goes to my sisters, brothers, and wife for the endless support and unconditional love. I would like to thank my cousin and best friend Mohammadreza for being my home away from home. I wouldn't have done it without him. I would like to express my gratitude to my research adviser, Dr. Byunghyun Jang, for encouraging me to think independently and opening for me new doors of knowledge. I am grateful for all the guidance, help and honest advice he has offered me during my Masters journey. I have been fortunate to work with him and gain professional skills that are valuable for my future career. I am grateful to all my current lab mates David Troendle and Mason Zhao, and previous lab members Ajay Sharma, Tuan Ta and Esraa Gad for their collaboration. I would like to especially thank David Troendle for all the insightful discussions and the foundation on which this thesis was built. Last but not least, I would like to thank all the professors in the department of computer science for all the help they are always willing to offer. Special thanks to Dr. Conrad Cunningham and Dr. Feng Wang for the valuable discussion and the constructive critique of my thesis. iii TABLE OF CONTENTS ABSTRACT :::::::::::::::::::::::::::::::::::::::: ii ACKNOWLEDGEMENTS :::::::::::::::::::::::::::::::: iii LIST OF FIGURES :::::::::::::::::::::::::::::::::::: v INTRODUCTION :::::::::::::::::::::::::::::::::::: 1 RELATED WORK AND BACKGROUND :::::::::::::::::::: 3 2.1 Related Work . .3 2.2 Background . .4 DESIGN AND IMPLEMENTATION :::::::::::::::::::::::: 12 3.1 Blocking Algorithm . 12 3.2 Non-blocking Algorithm . 18 EXPERIMENTAL RESULTS :::::::::::::::::::::::::::: 24 4.1 Experiments Setup . 24 4.2 Performance Evaluation and Analysis . 24 CONCLUSION ::::::::::::::::::::::::::::::::::::::: 32 BIBLIOGRAPHY ::::::::::::::::::::::::::::::::::::: 33 VITA :::::::::::::::::::::::::::::::::::::::::::: 36 iv LIST OF FIGURES 2.1 Relaxed queue for BFS. .5 2.2 Bad interleaving: (a) Two threads want to increment the pointer (b) Two threads updated the pointer but the result is incorrect. .6 2.3 OpenCL platform [17]. .9 2.4 GPU thread scheduler assigns each workgroup to a CU [17]. 10 2.5 Work-group, Wavefront and Work-item [17]. 11 3.1 Bad interleaving in queues: (a) Two threads want to increment the Front pointer (b) Two threads updated the Front pointer but the result is incorrect. 13 3.2 Two threads are trying to add their values simultaneously in a blocking ap- proach: (a) Lock is free (b) Lock is acquired by W1 (c) W1 adds its value and update q (d) Lock is acquired by W2 (e) W2 adds its value and update q... 17 3.3 Two threads are trying to add their values simultaneously in a non-blocking approach: (a) xp(W1) is consistent (b) q is updated by W1 (c) W1 adds its value (d) xp(W2) is consistent (e) W2 adds its value. ............. 21 3.4 Two threads are trying to delete some values simultaneously in non-blocking approach: (a) t(W1) is consistent (b) Front is updated by W1 (c) W1 deletes a value (d) t(W2) is consistent (e) W2 deletes a value. ............. 22 4.1 Blocking Algorithm: (a) Add = 80%, Delete = 20% (b) Add = 50%, Delete = 50%. ........................................ 26 4.2 Non-blocking Algorithm: (a) Add = 80%, Delete = 20% (b) Add = 50%, Delete = 50%. 27 4.3 BFS Result for Road Networks: (a) Blocking (b) Non-blocking. 31 v CHAPTER 1 INTRODUCTION In shared-memory multiprocessors, multiple active threads run simultaneously and commu- nicate and synchronize via data structures in shared memory. As the efficiency of these data structures is critical to performance, designing efficient data structures for multiprocessor machines has been extensively studied. Designing such concurrent data structures is much more difficult than sequential ones since threads running simultaneously may interleave ar- bitrarily and can result in an incorrect behavior. Furthermore, scalability is a challenge in design of concurrent data structures as contentions among threads can severely undermine scalability [14]. There exist implementations of different concurrent data structures, such as stacks [20], queues [2, 3, 7, 15, 21] and skip-lists [19], but most of them target multi-core CPUs. Recently, Graphics Processing Units (GPUs) have become one of the most preferred platforms for high-performance parallel computing. This computing model is generally re- ferred to as General Purpose Computing on GPU (GPGPU) or GPU computing. While the aforementioned concurrent data structures have been implemented and evaluated on many different multi-core CPU architectures but little has been studied on GPU. With the recent introduction of improved memory models including atomic primitives on GPUs, existing con- current data structures for multi-core CPUs can be ported to GPUs. Regular data-parallel computations with little or no synchronization have been efficiently implemented on the GPUs. However, irregular workloads are known to be difficult to implement due to their 1 dynamic behavior of control flow and parallelism. Achieving scalable performance of those workloads needs efficient concurrent data structures to use for thread synchronization and communication. In medium-scale parallel machines with tens of active thread contexts, it may be manageable to support synchronizations among them but with thousands of active threads, this would cause significant performance overhead on GPUs. The appearance of OpenCL [16] has made general purpose programming on GPUs easier but the design and implementation of concurrent data structures still remains challenging. In this thesis, we present the evaluation of blocking and non-blocking implementations of concurrent queue data structure on GPUs. To the best of our knowledge, this is the first attempt to understand their behaviors on GPUs in depth. All of our implementations are written in OpenCL C++ programming model and rely on OpenCL's atomic primitives such as atomic compare-and-exchange and atomic exchange. We evaluate our implementations using several micro-benchmarks and a real-world application. All of our evaluation are carried out on a AMD Radeon R7 GPU. The rest of this thesis is organized as follows. Chapter 2 presents related work and background. Chapter 3 describes the design and implementation of the proposed blocking and non-blocking concurrent queues. Chapter 4 shows the experimental results. Conclusions are made in Chapter 5. 2 CHAPTER 2 RELATED WORK AND BACKGROUND In this chapter, we review previous works on the topic of concurrent data structures (CDS), and provide the background information required to understand this thesis. 2.1 Related Work Concurrent queues have been studied for three decades. Most of them have targeted multi- core CPUs, and only a few works targeted GPUs. In the section, we review several CPU- based CDS implementations and a few GPU-based studies that we found. The majority implementations of CDSs are Compare and Swap (CAS) based non- blocking. Mellor-Crummey [9] proposed a concurrent queue which is blocking based on fetch- and-store. Since enqueue and dequeue operations access both Front and Rear, enqueuers and dequeuers interfere each other's cache line and therefore results in limited scalability. Min et al. [12] proposed a scalable cache-optimized queue, which is also blocking. They entirely remove CAS failure in enqueue operation by replacing CAS with fetch-and-store and considerably decrease cache line interference among enqueuers and dequeuers.