Comparison and Analysis of Parallel Computing Performance Using Openmp and MPI

Send Orders for Reprints to [email protected] 38 The Open Automation and Control Systems Journal, 2013, 5, 38-44 Open Access Comparison and Analysis of Parallel Computing Performance Using OpenMP and MPI Shen Hua1,2,* and Zhang Yang1 1Department of Electronic Engineering, Dalian Neusoft University of Information Dalian, 116023/Liaoning, China 2Department of Biomedical Engineering, Faculty of Electronic Information and Electrical Engineering, Dalian Univer- sity of Technology, Dalian, 116024/Liaoning, China Abstract: The developments of multi-core technology have induced big challenges to software structures. To take full advantages of the performance enhancements offered by new multi-core hardware, software programming models have made a great shift from sequential programming to parallel programming. OpenMP (Open Multi-Processing) and MPI (Message Passing Interface), as the most common parallel programming models, can provide different performance characteristics of parallelism in different cases. This paper intends to compare and analyze the parallel computing ability between OpenMP and MPI, and then some proposals are provided in parallel programming. The processing tools used include Intel VTune Performance Analyzer and Intel Thread Checker. The find- ings indicate that OpenMP is in favor of implementation and provides good performance in shared memory systems , and that MPI is propitious to programming models which require nodes performing a large number of tasks and little commu- nications in processes. Keywords: Parallel computing, Multi-core, multithread programming, OpenMP, MPI. 1. INTRODUCTION library routines and environment variables. OpenMP em- ploys the fork-join model, where different tasks are imple- As the number of transistors has increased to fit within a mented by multiple threads within the same address space. die according to Moore’s Law, manufacturers try packing Programs are written in high-level application-specific op- more cores onto one single chip. However, adding cores is erations. These operations are partially ordered according to not synonymous with increasing computational power [1]. their semantic constraints [3]. A new high performance computation technique involv- MPI is a message-passing library specification proposed ing multiple processors on a single silicon die is quickly as a standard by a committee of vendors, implementers and gaining popularity. Most applications do not commonly util- users [13]. It is designed to support the development of par- ize the new features of a hardware platform. In this case, few allel software libraries. The primary functions of MPI are applications currently make efficient use of multiple proces- sending and receiving messages among different processes. sors to enhance single application performance. Processors are assigned to separate applications, resulting in suboptimal In this study, we compare the implementations by show- single application performance and frequent processor idle ing the results achieved between OpenMP and MPI when cycles from having too few applications available to execute. executing the same concurrent collections application. Our In this day and age of multi-core architectures, programming experiments show that significant improvements in program language support is in urgent need for constructing programs performance are obtained using multi-threading and parallel that can take great advantage of machines with multiple computing techniques. The CPU time used in a computation cores [2]. has been greatly reduced, in turn, reduces the usage of CPU resource. OpenMP (Open Multi-Processing) and MPI (Message Passing Interface) are the most popular parallel program- The paper is organized as follows. Section 1 provides the ming technologies [1]. OpenMP is an Application Program research background and methodology. Section 2 reviews Interface (API) that can be used to explicitly direct multi- the parallel programming models, including OpenMP and threaded. Shared memory parallelism comprised of three MPI. Section 3 introduces a typical parallel programming primary API components: compiler directives, runtime environment: Intel Composer XE 2013, utilizing Loop Pro- filer and VTune Amplifier to analyze bottleneck and over- load of programs. Section 4 provides a practical example to *Address correspondence to this author at the Department of Electronic reflect the computation power of the multi-core processor by Engineering, Dalian Neusoft University of Information Dalian, adding parallelism to a C++ application. Section 5 discusses 116023/Liaoning, China; Tel: +86-411-84832209; Fax: +86-411-84832210; the analytic results and Section 6 concludes the paper. E-mail: [email protected] 1874-4443/13 2013 Bentham Open Comparison and Analysis of Parallel Computing Performance The Open Automation and Control Systems Journal, 2013, Volume 5 39 2. PARALLEL PROGRAMMING MODELS OpenMP has been very successful in exploiting structured parallelism in applications. With increasing application 2.1. OpenMP complexity, there is a growing need for addressing irregular OpenMP is an Application Programming Interface (API), parallelism in the presence of complicated control structures. widely accepted as a standard for high-level shared-memory This is evident in various efforts by the industry and research parallel programming. It is a portable, scalable programming communities to provide a solution to this challenging prob- model that provides a simple and flexible interface for de- lem. One of the primary goals of OpenMP is to define a veloping shared-memory parallel applications in FORTRAN, standard dialect to express and efficiently exploit unstruc- C and C++. Now its latest version is OpenMP 3.1. Standard tured parallelism. is jointly defined by a group with members from major com- 2.2. MPI puter hardware and software vendors like IBM, Silicon Graphics, Hewlett Packard, Intel, Sun Microsystems and The MPI is a specification for a standard library for message passing, defined by the MPI Forum in April 1992. During Portland Group, etc [3, 4]. the next eighteen months the MPI Forum met regularly, and OpenMP was structured around parallel loops and was Version 1.0 of the MPI Standard was completed in May meant to handle dense numerical applications. The simplicity 1994 [12, 14]. Some clarifications and refinements were of its original interface, the use of a shared memory model, made in the spring of 1995, and Version 1.1 of the MPI and the fact that parallelisms of a program are expressed in Standard is now available [13]. directives. Those are loosely-coupled to the code. Recently, Multiple implementations of MPI have been developed in OpenMP is attracting widespread interest because of its many different versions：MPICH, LAM, and IBM MPL, etc easy-to-use portable parallel programming model. [13]. In this paper we discuss the set of tools that accompany A. OpenMP API the free distribution of MPICH, which constitute the begin- nings of a portable parallel programming environment. OpenMP API consists of the following components [3]: A. Features of MPI - Compiler directives: Instructs the compiler to process the code section following the directive for parallel MPI is a library which specifies the names, calling se- execution. quences and the results of functions. MPI can be invoked by C and FORTRAN programs; the MPI C++ library consisting - Library routines: Routines that affect and monitor of classes and methods can also be called. MPI is a threads, processors and environment variables. It also specification, not a particular implementation. A correct MPI has routines to control thread synchronization and get program should be able to run on all MPI implementations timings. without change [12, 16]. - Environment variables: Variables controlling the exe- - Messages and Buffers cution of the OpenMP program. Sending and receiving messages are the two fundamental B. Working Mechanism of OpenMP operations of MPI. Messages can be typed with a tag integer. Parallel execution in OpenMP is based on the fork-join Message buffers are more complex than a simple buffer. model, where the master thread creates a team of threads for User can create their own data types by giving options to parallel execution [3]. address combination. - Program execution begins as a single thread of execu- - Communicators tion, called the initial thread. The notions of context and group are combined in a sin- - A thread encountering a parallel construct becomes a gle object called a communicator, which becomes an argu- master, creates a team of itself and additional threads. ment to most point-to-point and collective operations. Thus, the destination or source specified in send or receive opera- - All members of the team execute the code inside par- tion always refers to the rank of the process in the group allel construct. identified by a communicator. Each thread has a temporary view of the memory, which - Process Groups is like a cache and a private memory not accessible other thread. Processes belong to groups. Each process is ranked in its group with a linear numbering. Initially, all processes belong - Relaxed consistency, the thread's view of memory is to one default group. not required to be consistent with the memory at all times. - Separating Families of Messages - Flush operation causes the last modified variable in MPI programs can use the notion of contexts to separate the temporary view to be written to memory. messages in different parts of the code. It is very

Comparison and Analysis of Parallel Computing Performance Using Openmp and MPI

Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing

Adaptive Data Migration in Load-Imbalanced HPC Applications

Here I Led Subcommittee Reports Related to Data-Intensive Science and Post-Moore Computing) and in CRA’S Board of Directors Since 2015

Beowulf Clusters — an Overview

Improving MPI Threading Support for Current Hardware Architectures

Exascale Computing Project -- Software

Ultra-Large-Scale Sites

Parallel Data Mining from Multicore to Cloudy Grids

Optimizing Applications for Multicore by Intel Software Engineer Levent Akyil Welcome to the Parallel Universe

High Performance Integration of Data Parallel File Systems and Computing

MPICH Installer's Guide Version 3.3.2 Mathematics and Computer

3.0 and Beyond