Automatic Parallelization Tools

Proceedings of the World Congress on Engineering and Computer Science 2012 Vol I WCECS 2012, October 24-26, 2012, San Francisco, USA Automatic Parallelization Tools Ying Qian machines, the parallel applications can be written using a Abstract—In recent years, much has been made of the variety of parallel programming paradigms, including computing industry’s widespread shift to parallel computing. message passing, shared memory, data parallel, bulk Nearly all consumer computers will ship with multicore synchronous, mixed-mode and so on. The message passing processors. Parallel computing will no longer be only relegated and shared memory paradigms are the two most important to exotic supercomputers or mainframes, moreover, electronic devices such as mobile phones and other portable devices have programming paradigms. begun to incorporate parallel computing capabilities. High The general procedure to parallelize an application can be performance computing becomes increasingly important. To concluded as, 1) decide on which kind of parallel computer maintain high quality solutions, programmers have to system will be used, 2) choose the right parallel efficiently parallelize and map their algorithms. This task is far programming paradigm to parallelize the code, 3) choose from trivial, especially for different existing parallel computer the good automatic parallelization tool. architectures and different parallel programming paradigms. To reduce the burden on programmers, automatic The rest of paper is organized as follows. In Section 2, I parallelization tools are introduced. The purpose of this paper provide the background introduction of several popular is to discuss different automatic parallelization tools for parallel computer architectures. The parallel programming different parallel architectures. paradigms which are designed for these architectures are discussed in Section 3, which include shared memory, message passing programming paradigm and so on. Section Index Terms—parallel processing, automatic parallelization 4 introduces the automatic parallelization and general way tools, parallel programming paradigms to employ it. Finally the existing automatic parallelization tools are discussed in Section 5. I. INTRODUCTION II. PARALLEL COMPUTER ARCHITECTURES HE parallel machines are being built to satisfy the Tincreasing demand of higher performance for parallel In the past decade, high performance computers have applications. Multi and many-core architectures are been implemented using a variety of architectures. I briefly becoming a hot topic in the fields of computer architecture describe the most common parallel computer architectures and high-performance computing. Processor design is no here. longer driven by high clock frequencies. Instead, more and Based on the Flynn’s classification [1], there are four more programmable cores are becoming available. In recent kinds of machine architectures, single-instruction stream years, much has been made of the computing industry’s single-data stream (SISD), single-instruction stream widespread shift to parallel computing. Nearly all consumer multiple-data streams (SIMD), multiple-instruction streams computers will ship with multicore processors. Parallel single-data stream (MISD) and multiple-instruction streams computing will no longer be only relegated to exotic multiple-data streams (MIMD). SISD models conventional supercomputers or mainframes, moreover, electronic sequential computers. MISD was seldom used. In an SIMD devices such as mobile phones and other portable devices machine, all processors execute the same instruction at the have begun to incorporate parallel computing capabilities. same time. It is a synchronous machine, and mostly used for Therefore, more and more software developers will need special purpose applications. An MIMD machine is a to cope with a variety of parallel computing platforms and general-purpose machine, where processors operate in technologies in order to provide optimum performance to parallel but asynchronously. MIMD machines are generally fully utilize all the processor power. However, the parallel classified into four practical machine models: Symmetric programming requires in-depth knowledge on underlying Multiprocessors (SMP), Massively Parallel Processors systems and parallel paradigms, which make the whole (MPP), Distributed Shared Memory (DSM) multiprocessors, process difficult. It would be desirable to let programmer Cluster of Workstations (COW), and Cluster of program in sequential and have the parallelization tools help Multiprocessors (CLUMP). to automatic parallelize the program. A. SMP Depending on different parallel computer architectures or SMP is a Uniform Memory Access (UMA) system, where all memory locations are the same distance away from the Manuscript received July 15, 2012; revised August 9, 2012. This work processors, so it takes roughly the same amount of time to was supported by KAUST Supercomputing Laboratory at King Abdullah University of Science and Technology (KAUST). access any memory location. SMP systems have gained Y. Qian is with KAUST Supercomputing Laboratory, King Abdullah prominence in the market place. Considerable work has University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia gone into the design of SMP systems, and several vendors (e-mail: [email protected]). such as IBM, Compaq, SGI, and HP offer small to large- ISBN: 978-988-19251-6-9 WCECS 2012 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) Proceedings of the World Congress on Engineering and Computer Science 2012 Vol I WCECS 2012, October 24-26, 2012, San Francisco, USA scale shared memory systems [2]. A typical SMP machine architecture is shown in Fig. 1. Fig. 3. Schematic view of a GPU architecture. [6] III. PARALLEL PARADIGMS Parallel computers provide support for a wide range of parallel programming paradigms. The HPC programmer has several choices for the parallel programming paradigm, Fig. 1. A typical SMP machine. including the shared memory, message passing, mix-mode, and GPU. B. Multi-core Cluster A. Message passing Multi-core clusters are distributed-memory systems, MPI [7] is a well-known message passing environment. where there are multiple nodes each having multiple MPI has good portability, because programs written using processors and its own local memory. For these systems, MPI can run on distributed-memory systems, shared- one node’s local memory is considered remote memory for memory multiprocessors, and networks of workstations or other nodes. SMPs are called tightly coupled [1]. 81.4% of clusters. On top of shared memory systems, message the top 500 supercomputers in the world are clusters [3]. A passing is implemented as writing to and reading from the typical multi-core cluster machine is shown in Fig. 2. shared memory. So MPI can be implemented very efficiently on top of the shared memory systems. Another Computer Node Computer Node advantage of the MPI programming model is that the user has complete control over data distribution and process synchronization, which can provide optimal data locality p0 p1 p0 p1 and workflow distribution. The disadvantage is that existing sequential applications require a fair amount of restructuring p2 p3 p2 p3 for parallelization based on MPI. MPI provides the user with a programming model where ... ... p4 p5 p4 p5 processes communicate with each other by calling library routines. p6 p7 p6 p7 B. Shared Memory Message-passing codes written in MPI are obviously Network Interface Network Interface portable and should transfer easily to SMP cluster systems. However, it is not immediately clear that message passing is the most efficient parallelization technique within an SMP Cluster Interconnect box, where in theory a shared memory model such as OpenMP [8] should be preferable. OpenMP is a loop level Fig. 2. A typical multi-core cluster. programming style. It is popular because it is easy to use and enables incremental development. Parallelizing a code includes two steps, (1) discover the parallel loop nests C. GPU contributing significantly to the computations time; (2) add At the same time, Graphics Processing Units (GPUs) are directives for starting/closing parallel regions, managing the becoming programmable. While their graphics pipeline was parallel threads (workload distribution, synchronization), completely fixed a few years ago, now, we are able to and managing the data. program all main processing elements using C-like languages, such as CUDA [4] or OpenGL [5]. Multi-core C. Mix-mode and many-core architectures are emerging and becoming a In the mixed MPI-OpenMP programming style, each commodity. Many experts believe heterogeneous processor SMP node executes one MPI process that has multiple platforms including both GPUs and CPUs will be the future OpenMP threads. This kind of hybrid parallelization might trend [6]. A typical GPU architecture as in Fig. 3 consists of be beneficial when it utilizes the high optimization of the a number of Single Instruction Multiple Thread (SIMT) shared memory model on each node. As small to large SMP processor clusters. Each cluster has access to an on-chip clusters become more prominent, it is open to debate shared memory (smem), a L2 cache and a large off-chip whether pure message-passing or mixed MPI-OpenMP is memory. the programming of choice for higher performance. ISBN: 978-988-19251-6-9 WCECS 2012 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) Proceedings of the World Congress on Engineering and Computer Science

Load more