A Review of CUDA, Mapreduce, and Pthreads Parallel Computing Models

A Review of CUDA, MapReduce, and Pthreads Parallel Computing Models Kato Mivule1, Benjamin Harvey2, Crystal Cobb3, and Hoda El Sayed4 1 Computer Science Department, Bowie State University, Bowie, MD, USA [email protected] [email protected] [email protected] [email protected] Abstract – The advent of high performance computing models to assist high performance computing users (HPC) and graphics processing units (GPU), present an with a faster understanding of parallel programming enormous computation resource for Large data concepts and thus better implementation of big data transactions (big data) that require parallel processing for parallel programming projects. Although a number of robust and prompt data analysis. While a number of HPC models could have been chosen for this overview, we frameworks have been proposed, parallel programming focused on CUDA, MapReduce, and Pthreads models present a number of challenges – for instance, how to fully utilize features in the different programming models because of the underlying features in the three to implement and manage parallelism via multi-threading models that could be used to for multi-threading. The in both CPUs and GPUs. In this paper, we take an rest of this paper is organized as follows: In Section overview of three parallel programming models, CUDA, 2, a review of the latest literature on parallel MapReduce, and Pthreads. The goal is to explore literature computing models and determinism is done. In on the subject and provide a high level view of the features Section 3, an overview of CUDA is given, while in presented in the programming models to assist high Section a review of MapReduce features is performance users with a concise understanding of parallel presented. In Section 5, a topographical exploration programming concepts and thus faster implementation of of features in Pthreads is given. Finally, in Section 6, big data projects using high performance computing. a conclusion is done. Keywords: Parallel programming models, GPUs, Big data, CUDA, MapReduce, Pthreads 2. BACKGROUND AND RELATED WORK As HPC systems increasingly becomes a necessity 1. INTRODUCTION in mainstream computing, a number of researchers The increasing volume of data (big data) generated have done work on documenting various features in by entities unquestionably, require high performance parallel computing models. Although a number of parallel processing models for robust and speedy data features in parallel programming models are analysis. The need for parallel computing has discussed in literature, CUDA, MapReduce, and resulted in a number of programming models Pthreads models, utilize multi-threading and, as such, proposed for high performance computing. However, a look at abstraction and determinism in multi- parallel programming models that interface between threading, is given consideration in this paper. In the high performance machines and programmers, their survey on the parallel programming models, present challenges; with one of the difficulties being, Kasim, March, Zhang, and See (2008), described how to fully utilize features in the different how parallelism is abstracted and presented to programming models to implement computational programmers, by investigating six parallel units, such as, multi-threads, on both CPUs and programming models used by the HPC community, GPUs efficiently. Yet still, with the advent of GPUs, namely, Pthreads, OpenMP, CUDA, MPI, UPC, and additional computation resources for the big data Fortress [1]. Furthermore, Kasim et al, proposed a set challenge are presented. However, fully utilizing of criteria to evaluate how parallelism is abstracted GPU parallelization resources in translating and presented to programmers, namely, (i) system sequential code for HPC implementation is a architecture, (ii) programming methodologies, (iii) challenge. Therefore, in this paper, we take an worker management, (iv) workload partitioning overview of three parallel programming models, scheme, (v) task-to-worker mapping, (vi) CUDA, MapReduce, and Pthreads. Our goal in this synchronization, and (vii) communication model [1]. study is to give an overall high level view of the On the feature of determinism in parallel features presented in the parallel programming programming, Bocchino, Adve, Adve, and Snir (2009), noted that with most parallel programming developers to express simple computations while models, subtle coding errors could lead to inadvertent hiding the messy details of communication, non-deterministic actions and bugs that are hard to parallelization, fault tolerance, and load balancing find and debug [2]. Therefore, Bocchino et al, argued [33]. On fully utilizing the GPU and MapReduce for a parallel programming model that was model, Ji and Ma (2011), explored the prospective deterministic by default unless the programmer value of allowing for a GPU MapReduce structure unambiguously decided to use non-deterministic that uses multiple levels of the GPU memory constructs [2]. Non-determinism is a condition in hierarchy [8]. To achieve this goal, Ji and Ma (2011) multi-threading parallel processing, in which erratic proposed a resourceful deployment using small and unpredictable behavior, bugs, and errors occur shared but fast memory, which included, a shared among threads due to their random interleaving and memory staging area management, thread-role implicit communication when accessing shared partitioning, and intra-block thread synchronization memory, thus making it more problematic to program [8]. in parallel model than the sequential von Neumann Furthermore, Fang, He, Luo, and Govindaraju model [3] [4]. (2011) proposed the Mars model, which combines Additionally, on the feature of determinism in GPU and MapReduce features to attain: (i) ease of parallel programming languages, McCool (2010) programming GPUs, (ii) portability such that the argued that a more structured approach in the design system can be applied on different technologies such and implementation of parallel algorithms was as NVIDIA CUDA, or Pthreads, and (iii) effective needed in order to reduce the complexity in use of high performance using GPUs [9]. To achieve developing software [5]. Moreover, McCool (2010) these goals, the Mars model hides the programming suggested the utilization of deterministic algorithmic complexity of GPUs by: (i) use of a simple and skeletons and patterns in the design of parallel conversant MapReduce interface, (ii) routinely programs [5]. McCool (2010) generally proposed that managing subdividing tasks, (iii) managing the any system that maintains a specification and dissemination of data, and (iv) dealing with process composition of a high-quality pattern can be used as a parallelization [9]. On the other hand, for efficient template to steer developers in designing more utilization of resources, Stuart and Owens (2011), dependable deterministic parallel programs [5]. proposed GPMR, a stand-alone MapReduce library However, Yang, Cui, Wu, Tang, and Hu (2013) that takes advantage of the power of GPU clusters for contended that determinism in parallel programming extensive computing by merging bulky quantities of was difficult to achieve since threads are non- map and reduce items into slices and using fractional deterministic by default and therefore making parallel reductions and accruals to complete computations programs reliable with stable multi-threading was [10]. more practical [6]. On the other hand, Garland, Although the Pthreads model has been around for Kudlur, and Zheng (2012), observed that while some time [11], application of the MapReduce model heterogeneous architectures and massive multi- using Pthreads is accomplished via the Phoenix threading concepts have been accepted in the larger model – a programming paradigm that is built on HPC community, modern programming models Pthreads for MapReduce implementation [12]. Of suffer from a defect of limited concurrency, recent, the Phoenix++ MapReduce model, a variation undifferentiated flat memory storage, and a of Phoenix model, was created based on the C++ homogenous processing of elements [7]. Garland et programming prototype [13]. Yet still, on the issue of al, then, proposed a unified programming model for determinism in Pthreads, Liu, Curtsinger, and Berger heterogeneous machines that provides constructs for (2011), observed that enforcing determinism in massive parallelism, synchronization, and data threads was still problematic and they proposed the assignment, executed on the whole machine [7]. DTHREADS multi-threading model, built on C/C++ Dean and Ghemawat [33] presented a run-time to replace the Pthreads library and alleviate the non- parallel, distributed, scalable programming model, determinism problem [14]. which encompassed a map and reduce function in DTHREADS model imposes determinism by order to map and reduce key-value pairs [33]. separating multi-thread programs into several Though there were many parallel, distributed, processes; using private and copy-on-write maps to scalable programming models that currently existed, shared memory, employing standard virtual memory there was no model in existence whose goal through protection to keep track of writes, and then development was to create a programming model that deterministically ordering updates on each thread, was easy to use. The innate ability of MapReduce to and as such, eliminating false sharing [14]. However, do its parallel and distributed computation across the issue of multi-thread non-determinism

A Review of CUDA, Mapreduce, and Pthreads Parallel Computing Models

Other Apis What’S Wrong with Openmp?

CUDA by Example

Bitfusion Guide to CUDA Installation Bitfusion Guides Bitfusion: Bitfusion Guide to CUDA Installation

Thread and Multitasking

Parallel Architectures and Algorithms for Large-Scale Nonlinear Programming

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

(GPU) Computing

Massively Parallel Computing with CUDA

CUDA Flux: a Lightweight Instruction Profiler for CUDA Applications

GPU: Power Vs Performance

Map, Reduce and Mapreduce, the Skeleton Way$

From Sequential to Parallel Programming with Patterns