POSIX Threads and NVIDIA's CUDA

Introducing multithreaded programming: POSIX Threads and NVIDIA’s CUDA Abstract The current progression of commodity processing architectures exhibits a trend toward increasing parallelism, requiring that undergraduate students in a wide range of technical disciplines gain an understanding of problem solving in massively parallel environments. However, as a small comprehensive college, we cannot currently afford to dedicate an entire semester-long course to the study of parallel computing. To combat this situation, we have integrated the key components of such a course into a 300-level course on modern operating systems. In this paper, we describe a parallel computing unit that is designed to dovetail with the discussion of process and thread management common to operating systems courses. We also describe a set of self-contained projects in which students explore two parallel programming models, POSIX Threads and NVIDIA’s Compute Unified Device Architecture, that enable parallel architectures to be utilized effectively. In our experience, this unit can be integrated with traditional operating systems topics quite readily, making parallel computing accessible to undergraduate students without requiring a full course dedicated to these increasingly important topics. 1 Introduction The many-core revolution currently underway in the design of processing architectures necessitates an early introduction to parallel computing. Commodity desktop systems with two cores per physical processor are now common, and the current processor roadmap for major manufacturers indicates a rapid progression toward systems with four, eight, or even 16 cores. At the same time, programmable graphics processing units (GPUs) have evolved from fixed- function pipelines implementing the z-buffer rendering algorithm to programmable, highly parallel machines that can be used to solve a wide range of problems. Together, these developments require that students possess an in-depth understanding of the hardware and software issues related to solving problems using many-core processing architectures. Grove City College is a small comprehensive college, and as such, we in the Department of Computer Science must wrestle with the requisite staffing limitations. In particular, we cannot currently afford to offer an entire course dedicated to parallel computing—here defined to comprise a study of parallel processing architectures and the programming techniques necessary to utilize those architectures effectively—without sacrificing the integrity of our core computer science curriculum. This situation thus poses a dilemma: the current trajectory of processing architectures dictates an ever-increasing need for knowledge development in this area, but we are simply unable to dedicate a semester-length course to the study of these topics. In response to this situation, we instead introduce parallel computing in the context of a semester-long 300-level operating systems course, which features a 4-week unit focusing on parallel computing. This unit is specifically designed to dovetail with the treatment of process and thread management that is common to courses on modern operating systems. In this context, we motivate the opportunities and challenges introduced by parallel processing architectures, and students explore the key programming concepts via a set of self-contained parallel programming projects. We also discuss key issues arising in the context of parallel execution environments, including resource sharing, thread synchronization, and atomicity, and these topics provide a smooth transition back to traditional operating systems concepts such as semaphores, high-level synchronization constructs, and deadlock. 2 Parallel Processing In particular, our students explore multithreaded programming in two forms: the POSIX Threads (pthreads) interface, a standardized model for multithreaded application programming1, and NVIDIA’s Compute Unified Device Architecture (CUDA), a co-designed hardware/software architecture for massively parallel computing.2 Before discussing the details of these parallel programming models, we first motivate the multithreaded approach to parallel processing by contrasting it with two other common forms of parallel processing exploited by contemporary computer systems. 2.1 Some Common Forms of Parallelism Modern processing architectures exploit parallelism on a number of levels, including instruction- level parallelism, multitasking, and multithreading. Whereas instruction level parallelism and multitasking enable programmers to remain blissfully unaware the details related to parallel execution, relying instead on optimizing compilers and operating systems to exploit parallelism automatically, multithreading requires the programmer to design a program to capitalize on potential parallelism from the outset. Thus, to fully utilize the parallelism afforded by current and future many-core processing architectures, we believe that programmers must possess an intimate knowledge of the issues that arise in the context of multithreading. 2.1.1 Instruction-Level Paralleism Consider the following expression involving several integer multiplications and additions: a + (b*c) + (d*e) + f Assuming we have a processor that requires a single cycle to evaluate each multiplication or addition operation, this expression requires five cycles to evaluate in a sequential manner: one cycle for each of the arithmetic operations in the expression (Figure 1a). However, if the processor is equipped with just one additional execution unit, then the number of cycles required to evaluate the expression can be reduced from five to three (Figure 1b). In this example, instruction-level parallelism (ILP) capitalizes on the absence of data dependencies among operations within the full expression to efficiently utilize the processor’s multiple execution units, thereby improving performance by a factor of 1.67 over the sequential version. Typically, applications level programmers need not be concerned with parallelism at this level, and instead rely on optimizing compilers to recognize such opportunities and schedule instructions in a manner that leverages ILP. Whereas ILP can improve the overall performance of a single program, opportunities to exploit ILP depend largely on the particular sequence of instructions required to implement a program’s behavior, and vary widely from one application to the next. Figure 1: Instruction-level parallelism. The lack of data dependencies among operations within an instruction stream can be exploited by a processor with multiple execution units. 2.1.2 Multitasking At a coarser level, multitasking also implements parallel processing without any special effort on the part of an application programmer. In this case, programmers rely on the operating system’s scheduler to exploit the independence of tasks within the system, rather than the compiler’s ability to exploit the independence of operations within an instruction stream. Interestingly, multitasking operating systems are able to provide the illusion of parallel execution without actually requiring truly parallel hardware. As such, multitasking is sometimes considered a form of concurrent processing, as distinguished from true parallel execution: although multiple programs appear to be executing simultaneously, each process is in reality executed sequentially—the system simply switches among the available processes so quickly that it appears as though the programs are executing in parallel. It is important to note that multitasking does not improve the performance of any particular program, but instead improves the overall system throughput, which is a measure of the number of tasks completed by the system in a particular unit of time. With multitasking, no single task completes more quickly, but instead some collection of tasks will potentially require less time to complete than if those tasks were executed sequentially. 2.1.3 Multithreading Though multitasking remains an important feature in the context of highly parallel processing architectures, this technique cannot exploit the independence of logical tasks within a single program. Modern systems thus afford programmers the ability to explicitly divide a process into two or more threads—logical units of computation that share many of the resources that are used across the program as a whole, including a program’s binary instructions and its global data structures (Figure 2). It is this form of parallelism with which our 4-week parallel computing unit is concerned. In particular, this unit centers around two multithreaded programming models that can be used to effectively exploit parallel processing and thereby improve program performance: pthreads and CUDA. 2.2 POSIX Threads pthreads is a standardized model for dividing a program into subtasks whose execution can be interleaved or run in parallel. This model implements the POSIX multithreading interface (POSIX Section 1003.1c), which is part of the more general family of IEEE operating system interface standards. Programmers experience pthreads as a set of C programming language data types and function calls that operate according to a set of implied semantics—that is, pthreads offers a standardized, programmer-friendly application programming interface (API). Specific vendors typically supply pthreads implementations in two parts: • a header file that is included in a multithreaded program, and • a library that is linked to the program during compilation. Figure 2: Classic versus

Load more